estimate_accuracy {scR}R Documentation

Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.

Description

Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.

Usage

estimate_accuracy(
  formula,
  model,
  data = NULL,
  dim = NULL,
  maxn = NULL,
  upperlimit = NULL,
  nsample = 30,
  steps = 50,
  eta = 0.05,
  delta = 0.05,
  epsilon = 0.05,
  predictfn = NULL,
  power = FALSE,
  effect_size = NULL,
  powersims = NULL,
  alpha = 0.05,
  parallel = TRUE,
  coreoffset = 0,
  packages = list(),
  method = c("Uniform", "Class Imbalance"),
  p = NULL,
  ...
)

Arguments

formula

A formula that can be passed to the model argument to define the classification algorithm

model

A binary classification model supplied by the user. Must take arguments formula and data

data

Optional. A rectangular data.frame object giving the full data from which samples are to be drawn. If left unspecified, gendata() is called to produce synthetic data with an appropriate structure.

dim

Required if data is unspecified. Gives the horizontal dimension of the data (number of predictor variables) to be generated.

maxn

Required if data is unspecified. Gives the vertical dimension of the data (number of observations) to be generated.

upperlimit

Optional. A positive integer giving the maximum sample size to be simulated, if data was supplied.

nsample

A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results.

steps

A positive integer giving the number of values of $n$ for which simulations should be conducted. Larger values give more accurate results.

eta

A real number between 0 and 1 giving the probability of misclassification error in the training data.

delta

A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than epsilon

epsilon

A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate

predictfn

An optional user-defined function giving a custom predict method. If also using a user-defined model, the model should output an object of class "svrclass" to avoid errors.

power

A logical indicating whether experimental power based on the predictions should also be reported

effect_size

If power is TRUE, a real number indicating the scaled effect size the user would like to be able to detect.

powersims

If power is TRUE, an integer indicating the number of simulations to be conducted at each step to calculate power.

alpha

If power is TRUE, a real number between 0 and 1 indicating the probability of Type I error to be used for hypothesis testing. Default is 0.05.

parallel

Boolean indicating whether or not to use parallel processing.

coreoffset

If parallel is true, a positive integer indicating the number of free threads to be kept unused. Should not be larger than the number of CPU cores.

packages

A list of packages that need to be loaded in order to run model.

method

An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution.

p

If method is 'Class Imbalance', gives the degree of weight placed on the positive class.

...

Additional arguments that need to be passed to model

Value

A list containing two named elements. Raw gives the exact output of the simulations, while Summary gives a table of accuracy metrics, including the achieved levels of \epsilon and \delta given the specified values. Alternative values can be calculated using getpac()

See Also

plot_accuracy(), to represent simulations visually, getpac(), to calculate summaries for alternate values of \epsilon and \delta without conducting a new simulation, and gendata(), to generated synthetic datasets.

Examples

mylogit <- function(formula, data){
m <- structure(
  glm(formula=formula,data=data,family=binomial(link="logit")),
  class=c("svrclass","glm")  #IMPORTANT - must use the class svrclass to work correctly
)
return(m)
}
mypred <- function(m,newdata){
out <- predict.glm(m,newdata,type="response")
out <- factor(ifelse(out>0.5,1,0),levels=c("0","1"))
#Important - must specify levels to account for possibility of all
#observations being classified into the same class in smaller samples
return(out)
}

library(parallel)
  results <- estimate_accuracy(two_year_recid ~
    race + sex + age + juv_fel_count + juv_misd_count + priors_count +
    charge_degree..misd.fel.,mylogit,br,
    predictfn = mypred,
    nsample=10,
    steps=10,
    coreoffset = (detectCores() -2)
  )


[Package scR version 0.1.0 Index]