auto_simon_ml {immunaut} | R Documentation |
Automated Machine Learning Model Building
Description
This function automates the process of building machine learning models using the caret package. It supports both binary and multi-class classification and allows users to specify a list of machine learning algorithms to be trained on the dataset. The function splits the dataset into training and testing sets, applies preprocessing steps, and trains models using cross-validation. It computes relevant performance metrics such as confusion matrix, AUROC (for binary classification), and prAUC (for binary classification).
Usage
auto_simon_ml(dataset_ml, settings)
Arguments
dataset_ml |
A data frame containing the dataset for training. All columns except the outcome column should contain the features. |
settings |
A list containing the following parameters:
|
Details
The function performs preprocessing (e.g., centering, scaling, and imputation of missing values) on the dataset based on the provided settings. It splits the data into training and testing sets using the specified partition, trains models using cross-validation, and computes performance metrics.
For binary classification problems, the function calculates AUROC and prAUC. For multi-class classification, it calculates macro-averaged AUROC, though prAUC is not used.
The function returns a list of trained models along with their performance metrics, including confusion matrix, variable importance, and post-resample metrics.
Value
A list where each element corresponds to a trained model for one of the algorithms specified in
settings$selectedPackages
. Each element contains:
info
: General information about the model, including resampling indices, problem type, and outcome mapping.training
: The trained model object and variable importance.predictions
: Predictions on the test set, including probabilities, confusion matrix, post-resample statistics, AUROC (for binary classification), and prAUC (for binary classification).
Examples
## Not run:
dataset <- read.csv("fc_wo_noise.csv", header = TRUE, row.names = 1)
# Generate a file header for the dataset to use in downstream analysis
file_header <- generate_file_header(dataset)
settings <- list(
fileHeader = file_header,
# Columns selected for analysis
selectedColumns = c("ExampleColumn1", "ExampleColumn2"),
clusterType = "Louvain",
removeNA = TRUE,
preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
target_clusters_range = c(3,4),
resolution_increments = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
min_modularities = c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9),
pickBestClusterMethod = "Modularity",
seed = 1337
)
result <- immunaut(dataset, settings)
dataset_ml <- result$dataset$original
dataset_ml$pandora_cluster <- tsne_clust[[i]]$info.norm$pandora_cluster
dataset_ml <- dplyr::rename(dataset_ml, immunaut = pandora_cluster)
dataset_ml <- dataset_ml[, c("immunaut", setdiff(names(dataset_ml), "immunaut"))]
settings_ml <- list(
excludedColumns = c("ExampleColumn0"),
preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
selectedPartitionSplit = split, # Use the current partition split
selectedPackages = c("rf", "RRF", "RRFglobal", "rpart2", "c5.0", "sparseLDA",
"gcvEarth", "cforest", "gaussPRPoly", "monmlp", "slda", "spls"),
trainingTimeout = 180 # Timeout 3 minutes
)
ml_results <- auto_simon_ml(dataset_ml, settings_ml)
## End(Not run)