autotune_missForest {NADIA} | R Documentation |
Perform imputation using missForest form missForest package.
Description
Function use missForest package for data imputation. OBBerror (more in autotune_mice
) is used to perform grid search.
Usage
autotune_missForest(
df,
col_type = NULL,
percent_of_missing = NULL,
cores = NULL,
ntree_set = c(100, 200, 500, 1000),
mtry_set = NULL,
parallel = FALSE,
col_0_1 = FALSE,
optimize = TRUE,
ntree = 100,
mtry = NULL,
verbose = FALSE,
maxiter = 20,
maxnodes = NULL,
out_file = NULL
)
Arguments
df |
data.frame. Df to impute with column names. |
col_type |
character vector. Vector containing column type names. |
percent_of_missing |
numeric vector. Vector contatining percent of missing data in columns for example c(0,1,0,0,11.3,..) |
cores |
integer. Number of threads used by parallel calculations. By default approximately half of available CPU cores. |
ntree_set |
integer vector. Vector contains numbers of tree for grid search. |
mtry_set |
integer vector. Vector contains numbers of variables randomly sampled at each split. |
parallel |
logical. If TRUE parallel calculation is using. |
col_0_1 |
decide if add bonus column informing where imputation been done. 0 - value was in dataset, 1 - value was imputed. Default False. |
optimize |
optimize inside function |
ntree |
ntree from missForest function |
mtry |
mtry form missforest function |
verbose |
If FALSE funtion didn't print on console. |
maxiter |
maxiter form missForest function. |
maxnodes |
maxnodes from missForest function. |
out_file |
Output log file location if file already exists log message will be added. If NULL no log will be produced. |
Details
Function try to use parallel backend if it's possible. Half of the available cores are used or number pass as cores param. (Number of used cores can't be higher then number of variables in df. If it happened a number of cores will be set at ncol(df)-2 unless this number is <= 0 then cores =1). To perform parallel calculation function use registerDoParallel
to create parallel backend.
Creating backend can have significant time cost so for very small df cores=1 can speed up calculation. After calculation function turns off parallel backend.
Gride search is used to chose a sample for each tree and the number of trees can be turn off. Params in grid search have significant influence on imputation quality but function should work on any reasonable values of this parameter.
Value
Return data.frame with imputed values.
Author(s)
Daniel J. Stekhoven (2013), Stekhoven D. J., & Buehlmann, P. (2012).
References
Daniel J. Stekhoven (2013). missForest: Nonparametric Missing Value Imputation using Random Forest. R package version 1.4. Stekhoven D. J., & Buehlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
Examples
{
raw_data <- data.frame(
a = as.factor(sample(c("red", "yellow", "blue", NA), 1000, replace = TRUE)),
b = as.integer(1:1000),
c = as.factor(sample(c("YES", "NO", NA), 1000, replace = TRUE)),
d = runif(1000, 1, 10),
e = as.factor(sample(c("YES", "NO"), 1000, replace = TRUE)),
f = as.factor(sample(c("male", "female", "trans", "other", NA), 1000, replace = TRUE)))
# Prepering col_type
col_type <- c("factor", "integer", "factor", "numeric", "factor", "factor")
percent_of_missing <- 1:6
for (i in percent_of_missing) {
percent_of_missing[i] <- 100 * (sum(is.na(raw_data[, i])) / nrow(raw_data))
}
imp_data <- autotune_missForest(raw_data, col_type, percent_of_missing,
optimize = FALSE,parallel = FALSE)
# Check if all missing value was imputed
sum(is.na(imp_data)) == 0
# TRUE
}