autotune_mice {NADIA} | R Documentation |
Automatical tuning of parameters and imputation using mice package.
Description
Function impute missing data using mice functions. First perform random search using linear models (generalized linear models if only
categorical values are available). Using glm its problematic. Function allows users to skip optimization in that case but it can lead to errors.
Function optimize prediction matrix and method. Other mice parameters like number of sets(m) or max number of iterations(maxit) should be set
as hight as possible for best results(higher values are required more time to perform imputation). If u chose to use one inputted dataset m is not important. More information can be found in random_param_mice_search
and formula_creating
and mice
.
Usage
autotune_mice(
df,
m = 5,
maxit = 5,
col_miss = NULL,
col_no_miss = NULL,
col_type = NULL,
set_cor = 0.5,
set_method = "pmm",
percent_of_missing = NULL,
low_corr = 0,
up_corr = 1,
methods_random = c("pmm"),
iter = 5,
random.seed = 123,
optimize = TRUE,
correlation = TRUE,
return_one = TRUE,
col_0_1 = FALSE,
verbose = FALSE,
out_file = NULL
)
Arguments
df |
data frame for imputation. |
m |
number of sets produced by mice. |
maxit |
maximum number of iteration for mice. |
col_miss |
name of columns with missing values. |
col_no_miss |
character vector. Names of columns without NA. |
col_type |
character vector. Vector containing column type names. |
set_cor |
Correlation or fraction of featurs using if optimize= False |
set_method |
Method used if optimize=False. If NULL default method is used (more in methods_random section ). |
percent_of_missing |
numeric vector. Vector contatining percent of missing data in columns for example c(0,1,0,0,11.3,..) |
low_corr |
double betwen 0,1 default 0 lower boundry of correlation set. |
up_corr |
double between 0,1 default 1 upper boundary of correlation set. Both of these parameters work the same for a fraction of features. |
methods_random |
set of methods to chose. Default 'pmm'. If seted on NULL this methods are used predictive mean matching (numeric data) logreg, logistic regression imputation (binary data, factor with 2 levels) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) polr, proportional odds model for (ordered, > 2 levels). |
iter |
number of iteration for randomSearch. |
random.seed |
random seed. |
optimize |
if user wont to optimize. |
correlation |
If True correlation is using if Fales fraction of features. Default True. |
return_one |
One or many imputed sets will be returned. Default True. |
col_0_1 |
Decaid if add bonus column informing where imputation been done. 0 - value was in dataset, 1 - value was imputed. Default False. (Works only for returning one dataset). |
verbose |
If FALSE function didn't print on console. |
out_file |
Output log file location if file already exists log message will be added. If NULL no log will be produced. |
Value
Return imputed datasets or mids object containing multi imputation datasets.
Author(s)
Stef van Buuren, Karin Groothuis-Oudshoorn (2011).
Examples
{
raw_data <- mice::nhanes2
col_type <- 1:ncol(raw_data)
for (i in col_type) {
col_type[i] <- class(raw_data[, i])
}
percent_of_missing <- 1:ncol(raw_data)
for (i in percent_of_missing) {
percent_of_missing[i] <- 100 * (sum(is.na(raw_data[, i])) / nrow(raw_data))
}
col_no_miss <- colnames(raw_data)[percent_of_missing == 0]
col_miss <- colnames(raw_data)[percent_of_missing > 0]
imp_data <- autotune_mice(raw_data, optimize = FALSE, iter = 2,
col_type = col_type, percent_of_missing = percent_of_missing,
col_no_miss = col_no_miss, col_miss = col_miss)
# Check if all missing value was imputed
sum(is.na(imp_data)) == 0
# TRUE
}