missMDA_FMAD_MCA_PCA {NADIA} | R Documentation |
Perform imputation using MCA, PCA, or FMAD algorithm.
Description
Function use missMDA package to perform data imputation. Function can found the best number of dimensions for this imputation. User can choose whether to return one imputed dataset or list or imputed datasets form Multiple Imputation.
Usage
missMDA_FMAD_MCA_PCA(
df,
col_type = NULL,
percent_of_missing = NULL,
optimize_ncp = TRUE,
set_ncp = 2,
col_0_1 = FALSE,
ncp.max = 5,
return_one = TRUE,
random.seed = 123,
maxiter = 998,
coeff.ridge = 1,
threshold = 1e-06,
method = "Regularized",
out_file = NULL,
return_ncp = FALSE
)
Arguments
df |
data.frame. Df to impute with column names and without target column. |
col_type |
character vector. Vector containing column type names. |
percent_of_missing |
numeric vector. Vector contatining percent of missing data in columns for example c(0,1,0,0,11.3,..) |
optimize_ncp |
logical. If true number of dimensions used to predict the missing entries will be optimized. If False by default ncp = 2 it's used. |
set_ncp |
intiger >0. Number of dimensions used by algortims. Used only if optimize_ncp = Flase. |
col_0_1 |
Decaid if add bonus column informing where imputation been done. 0 - value was in dataset, 1 - value was imputed. Default False. (Works only for returning one dataset). |
ncp.max |
integer corresponding to the maximum number of components to test. Default 5. |
return_one |
One or many imputed sets will be returned. Default True. |
random.seed |
integer, by default random.seed = NULL implies that missing values are initially imputed by the mean of each variable. Other values leads to a random initialization |
maxiter |
maximal number of iteration in algortihm. |
coeff.ridge |
Value use in Regularized method. |
threshold |
threshold for convergence. |
method |
method used in imputation algoritm. |
out_file |
Output log file location if file already exists log message will be added. If NULL no log will be produced. |
return_ncp |
Function should return used ncp value |
Details
Function use different algorithm to adjust for variable types in df. For only numeric data PCA will be used. MCA for only categorical and FMAD for mixed. If optimize==TRUE function will try to find optimal ncp if its not possible default ncp=2 will be used. In some cases ncp=1 will be used if ncp=2 don't work. For multiple imputations, if set ncp don't work error will be return.
Value
Retrun one imputed data.frame if retrun_one=True or list of imputed data.frames if retrun_one=False.
Author(s)
Julie Josse, Francois Husson (2016) doi:10.18637/jss.v070.i01
References
Julie Josse, Francois Husson (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1-31. doi:10.18637/jss.v070.i01
Examples
{
raw_data <- data.frame(
a = as.factor(sample(c("red", "yellow", "blue", NA), 1000, replace = TRUE)),
b = as.integer(1:1000),
c = as.factor(sample(c("YES", "NO", NA), 1000, replace = TRUE)),
d = runif(1000, 1, 10),
e = as.factor(sample(c("YES", "NO"), 1000, replace = TRUE)),
f = as.factor(sample(c("male", "female", "trans", "other", NA), 1000, replace = TRUE)))
# Prepering col_type
col_type <- c("factor", "integer", "factor", "numeric", "factor", "factor")
percent_of_missing <- 1:6
for (i in percent_of_missing) {
percent_of_missing[i] <- 100 * (sum(is.na(raw_data[, i])) / nrow(raw_data))
}
imp_data <- missMDA_FMAD_MCA_PCA(raw_data, col_type, percent_of_missing, optimize_ncp = FALSE)
# Check if all missing value was imputed
sum(is.na(imp_data)) == 0
# TRUE
}