ssp.softmax {subsampling} | R Documentation |
Optimal Subsampling Method for Softmax (multinomial logistic) Regression Model
Description
Draw subsample from full dataset and fit softmax(multinomial logistic) regression model on the subsample. Refer to vignette for a quick start.
Usage
ssp.softmax(
formula,
data,
subset,
n.plt,
n.ssp,
criterion = "MSPE",
sampling.method = "poisson",
likelihood = "MSCLE",
constraint = "summation",
control = list(...),
contrasts = NULL,
...
)
Arguments
formula |
A model formula object of class "formula" that describes the model to be fitted. |
data |
A data frame containing the variables in the model. Denote |
subset |
An optional vector specifying a subset of observations from |
n.plt |
The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities. |
n.ssp |
The expected size of the optimal subsample (second-step subsample). For |
criterion |
The criterion of optimal subsampling probabilities.
Choices include
|
sampling.method |
The sampling method to use.
Choices include
|
likelihood |
A bias-correction likelihood function is required for subsample since unequal subsampling probabilities introduce bias. Choices include
|
constraint |
The constraint for identifiability of softmax model. Choices include
|
control |
A list of parameters for controlling the sampling process. There are two tuning parameters
|
contrasts |
An optional list. It specifies how categorical variables are represented in the design matrix. For example, |
... |
A list of parameters which will be passed to |
Details
A pilot estimator for the unknown parameter \beta
is required because MSPE, optA and
optL subsampling probabilities depend on \beta
. There is no "free lunch" when determining optimal subsampling probabilities. For softmax regression, this
is achieved by drawing a size n.plt
subsample with replacement from full
dataset with uniform sampling probability.
Value
ssp.softmax returns an object of class "ssp.softmax" containing the following components (some are optional):
- model.call
The original function call.
- coef.plt
The pilot estimator. See Details for more information.
- coef.ssp
The estimator obtained from the optimal subsample.
- coef
The weighted linear combination of
coef.plt
andcoef.ssp
, under baseline constraint. The combination weights depend on the relative size ofn.plt
andn.ssp
and the estimated covariance matrices ofcoef.plt
andcoef.ssp.
We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary arecoef
and the square root ofdiag(cov)
.- coef.plt.sum
The pilot estimator under summation constrraint.
coef.plt.sum = G %*% as.vector(coef.plt)
.- coef.ssp.sum
The estimator obtained from the optimal subsample under summation constrraint.
coef.ssp.sum = G %*% as.vector(coef.ssp)
.- coef.sum
The weighted linear combination of
coef.plt
andcoef.ssp
, under summation constrraint.coef.sum = G %*% as.vector(coef)
.- cov.plt
The covariance matrix of
coef.plt
.- cov.ssp
The covariance matrix of
coef.ssp
.- cov
The covariance matrix of
coef.cmb
.- cov.plt.sum
The covariance matrix of
coef.plt.sum
.- cov.ssp.sum
The covariance matrix of
coef.ssp.sum
.- cov.sum
The covariance matrix of
coef.sum
.- index.plt
Row indices of pilot subsample in the full dataset.
- index.ssp
Row indices of of optimal subsample in the full dataset.
- N
The number of observations in the full dataset.
- subsample.size.expect
The expected subsample size.
- terms
The terms object for the fitted model.
References
Yao, Y., & Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers, 60, 585-599.
Han, L., Tan, K. M., Yang, T., & Zhang, T. (2020). Local uncertainty sampling for large-scale multiclass logistic regression. Annals of Statistics, 48(3), 1770-1788.
Wang, H., & Kim, J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of machine learning research, 23(332), 1-50.
Yao, Y., Zou, J., & Wang, H. (2023). Optimal poisson subsampling for softmax regression. Journal of Systems Science and Complexity, 36(4), 1609-1625.
Yao, Y., Zou, J., & Wang, H. (2023). Model constraints independent optimal subsampling probabilities for softmax regression. Journal of Statistical Planning and Inference, 225, 188-201.
Examples
# softmax regression
d <- 3 # dim of covariates
K <- 2 # K + 1 classes
G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
N <- 1e4
beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K))
beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K))
set.seed(1)
mu <- rep(0, d)
sigma <- matrix(0.5, nrow = d, ncol = d)
diag(sigma) <- rep(1, d)
X <- MASS::mvrnorm(N, mu, sigma)
prob <- exp(X %*% beta.true.summation)
prob <- prob / rowSums(prob)
Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row))
n.plt <- 500
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
head(data)
formula <- Y ~ . -1
WithRep.MSPE <- ssp.softmax(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'MSPE',
sampling.method = 'withReplacement',
likelihood = 'weighted',
constraint = 'baseline')
summary(WithRep.MSPE)