estimate_accuracy {scR} | R Documentation |
Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.
Description
Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.
Usage
estimate_accuracy(
formula,
model,
data = NULL,
dim = NULL,
maxn = NULL,
upperlimit = NULL,
nsample = 30,
steps = 50,
eta = 0.05,
delta = 0.05,
epsilon = 0.05,
predictfn = NULL,
power = FALSE,
effect_size = NULL,
powersims = NULL,
alpha = 0.05,
parallel = TRUE,
coreoffset = 0,
packages = list(),
method = c("Uniform", "Class Imbalance"),
p = NULL,
minn = ifelse(is.null(data), (dim + 1), (ncol(data) + 1)),
x = NULL,
y = NULL,
...
)
Arguments
formula |
A |
model |
A binary classification model supplied by the user. Must take arguments |
data |
Optional. A rectangular |
dim |
Required if |
maxn |
Required if |
upperlimit |
Optional. A positive integer giving the maximum sample size to be simulated, if data was supplied. |
nsample |
A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results. |
steps |
A positive integer giving the number of values of $n$ for which simulations should be conducted. Larger values give more accurate results. |
eta |
A real number between 0 and 1 giving the probability of misclassification error in the training data. |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
power |
A logical indicating whether experimental power based on the predictions should also be reported |
effect_size |
If |
powersims |
If |
alpha |
If |
parallel |
Boolean indicating whether or not to use parallel processing. |
coreoffset |
If |
packages |
A list of packages that need to be loaded in order to run |
method |
An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution. |
p |
If method is 'Class Imbalance', gives the degree of weight placed on the positive class. |
minn |
Optional argument to set a different minimum n than the dimension of the algorithm. Useful with e.g. regularized regression models such as elastic net. |
x |
Optional argument for methods that take separate predictor and outcome data. Specifies a matrix-like object containing predictors. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function. |
y |
Optional argument for methods that take separate predictor and outcome data. Specifies a vector-like object containing outcome values. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function. |
... |
Additional arguments that need to be passed to |
Value
A list
containing two named elements. Raw
gives the exact output of the simulations, while Summary
gives a table of accuracy metrics, including the achieved levels of \epsilon
and \delta
given the specified values. Alternative values can be calculated using getpac()
See Also
plot_accuracy()
, to represent simulations visually, getpac()
, to calculate summaries for alternate values of \epsilon
and \delta
without conducting a new simulation, and gendata()
, to generated synthetic datasets.
Examples
mylogit <- function(formula, data){
m <- structure(
glm(formula=formula,data=data,family=binomial(link="logit")),
class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly
)
return(m)
}
mypred <- function(m,newdata){
out <- predict.glm(m,newdata,type="response")
out <- factor(ifelse(out>0.5,1,0),levels=c("0","1"))
#Important - must specify levels to account for possibility of all
#observations being classified into the same class in smaller samples
return(out)
}
library(parallel)
results <- estimate_accuracy(two_year_recid ~
race + sex + age + juv_fel_count + juv_misd_count + priors_count +
charge_degree..misd.fel.,mylogit,br,
predictfn = mypred,
nsample=10,
steps=10,
coreoffset = (detectCores() -2)
)