cat_glm_initialization {catalytic} | R Documentation |
Initialization for Catalytic Generalized Linear Models (GLMs)
Description
This function prepares and initializes a catalytic Generalized Linear Models (GLMs) by processing input data, extracting necessary variables, generating synthetic datasets, and fitting a model.
Usage
cat_glm_initialization(
formula,
family = "gaussian",
data,
syn_size = NULL,
custom_variance = NULL,
gaussian_known_variance = FALSE,
x_degree = NULL,
resample_only = FALSE,
na_replace = stats::na.omit
)
Arguments
formula |
A formula specifying the GLMs. Should include response and predictor variables. |
family |
The type of GLM family. Defaults to Gaussian. |
data |
A data frame containing the data for modeling. |
syn_size |
An integer specifying the size of the synthetic dataset to be generated. Default is four times the number of predictor columns. |
custom_variance |
A custom variance value to be applied if using a Gaussian model. Defaults to |
gaussian_known_variance |
A logical value indicating whether the data variance is known. Defaults to |
x_degree |
A numeric vector indicating the degree for polynomial expansion of predictors. Default is 1 for each predictor. |
resample_only |
A logical indicating whether to perform resampling only. Default is FALSE. |
na_replace |
A function to handle NA values in the data. Default is |
Value
A list containing the values of all the input arguments and the following components:
-
Function Information
-
function_name
: The name of the function, "cat_glm_initialization". -
y_col_name
: The name of the response variable in the dataset. -
simple_model
: An object of classstats::glm
, representing the fitted model for generating synthetic response from the original data.
-
-
Observation Data Information
-
obs_size
: Number of observations in the original dataset. -
obs_data
: Data frame of standardized observation data. -
obs_x
: Predictor variables for observed data. -
obs_y
: Response variable for observed data.
-
-
Synthetic Data Information
-
syn_size
: Number of synthetic observations generated. -
syn_data
: Data frame of synthetic predictor and response variables. -
syn_x
: Synthetic predictor variables. -
syn_y
: Synthetic response variable. -
syn_x_resample_inform
: Information about resampling methods for synthetic predictors:Coordinate: Preserves the original data values as reference coordinates during processing.
Deskewing: Adjusts the data distribution to reduce skewness and enhance symmetry.
Smoothing: Reduces noise in the data to stabilize the dataset and prevent overfitting.
Flattening: Creates a more uniform distribution by modifying low-frequency categories in categorical variables.
Symmetrizing: Balances the data around its mean to improve statistical properties for model fitting.
-
-
Whole Data Information
-
size
: Total number of combined original and synthetic observations. -
data
: Data frame combining original and synthetic datasets. -
x
: Combined predictor variables from original and synthetic data. -
y
: Combined response variable from original and synthetic data.
-
Examples
gaussian_data <- data.frame(
X1 = stats::rnorm(10),
X2 = stats::rnorm(10),
Y = stats::rnorm(10)
)
cat_init <- cat_glm_initialization(
formula = Y ~ 1, # formula for simple model
data = gaussian_data,
syn_size = 100, # Synthetic data size
custom_variance = NULL, # User customized variance value
gaussian_known_variance = TRUE, # Indicating whether the data variance is known
x_degree = c(1, 1), # Degrees for polynomial expansion of predictors
resample_only = FALSE, # Whether to perform resampling only
na_replace = stats::na.omit # How to handle NA values in data
)
cat_init