feature_effects {effectplots}R Documentation

Feature Effects

Description

This is the main function of the package. By default, it calculates the following statistics per feature X over values/bins:

Additionally, corresponding counts/weights are calculated, and standard deviations of observed y and residuals.

Numeric X with more than discrete_m = 5 disjoint values are binned as in graphics::hist() via breaks. Before calculating bins, outliers are capped at +-2 IQR from the quartiles.

All averages and standard deviation are weighted by optional weights w.

If you need only one specific statistic, you can use the simplified APIs of

Usage

feature_effects(object, ...)

## Default S3 method:
feature_effects(
  object,
  v,
  data,
  y = NULL,
  pred = NULL,
  pred_fun = stats::predict,
  trafo = NULL,
  which_pred = NULL,
  w = NULL,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 5L,
  outlier_iqr = 2,
  calc_pred = TRUE,
  pd_n = 500L,
  ale_n = 50000L,
  ale_bin_size = 200L,
  seed = NULL,
  ...
)

## S3 method for class 'ranger'
feature_effects(
  object,
  v,
  data,
  y = NULL,
  pred = NULL,
  pred_fun = NULL,
  trafo = NULL,
  which_pred = NULL,
  w = NULL,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 5L,
  outlier_iqr = 2,
  calc_pred = TRUE,
  pd_n = 500L,
  ale_n = 50000L,
  ale_bin_size = 200L,
  ...
)

## S3 method for class 'explainer'
feature_effects(
  object,
  v = colnames(data),
  data = object$data,
  y = object$y,
  pred = NULL,
  pred_fun = object$predict_function,
  trafo = NULL,
  which_pred = NULL,
  w = object$weights,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 5L,
  outlier_iqr = 2,
  calc_pred = TRUE,
  pd_n = 500L,
  ale_n = 50000L,
  ale_bin_size = 200L,
  ...
)

Arguments

object

Fitted model.

...

Further arguments passed to pred_fun(), e.g., type = "response" in a glm() or (typically) prob = TRUE in classification models.

v

Vector of variable names to calculate statistics.

data

Matrix or data.frame.

y

Numeric vector with observed values of the response. Can also be a column name in data. Omitted if NULL (default).

pred

Numeric vector with predictions. If NULL, it is calculated as pred_fun(object, data, ...). Used to save time if d() is to be called multiple times.

pred_fun

Prediction function, by default stats::predict. The function takes three arguments (names irrelevant): object, data, and ....

trafo

How should predictions be transformed? A function or NULL (default). Examples are log (to switch to link scale) or exp (to switch from link scale to the original scale).

which_pred

If the predictions are multivariate: which column to pick (integer or column name). By default NULL (picks last column).

w

Optional vector with case weights. Can also be a column name in data.

breaks

An integer, vector, string or function specifying the bins of the numeric X variables as in graphics::hist(). The default is "Sturges". To allow varying values of breaks across variables, it can be a list of the same length as v, or a named list with breaks for certain variables.

right

Should bins be right-closed? The default is TRUE. Vectorized over v. Only relevant for numeric X.

discrete_m

Numeric X variables with up to this number of unique values should not be binned and treated as a factor (after calculating partial dependence) The default is 5. Vectorized over v.

outlier_iqr

Outliers of a numeric X are capped via the boxplot rule, i.e., outside outlier_iqr * IQR from the quartiles. The default is 2 is more conservative than the usual rule to account for right-skewed distributions. Set to 0 or Inf for no capping. Note that at most 10k observations are sampled to calculate quartiles. Vectorized over v.

calc_pred

Should predictions be calculated? Default is TRUE. Only relevant if pred = NULL.

pd_n

Size of the data used for calculating partial dependence. The default is 500. For larger data (and w), pd_n rows are randomly sampled. Each variable specified by v uses the same subsample. Set to 0 to omit.

ale_n

Size of the data used for calculating ALE. The default is 50000. For larger data (and w), ale_n rows are randomly sampled. Each variable specified by v uses the same subsample. Set to 0 to omit.

ale_bin_size

Maximal number of observations used per bin for ALE calculations. If there are more observations in a bin, ale_bin_size indices are randomly sampled. The default is 200. Applied after subsampling regarding ale_n.

seed

Optional random seed (an integer) used for:

  • Partial dependence: select background data if n > pd_n.

  • ALE: select background data if n > ale_n and for bins > ale_bin_size.

  • Capping X: quartiles are selected based on 10k observations.

Value

A list (of class "EffectData") with a data.frame of statistics per feature. Use single bracket subsetting to select part of the output.

Methods (by class)

References

  1. Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/.

  2. Friedman, Jerome H. 2001, Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189-1232. doi:10.1214/aos/1013203451.3.

  3. Apley, Daniel W., and Jingyu Zhu. 2016. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (4): 1059–1086. doi:10.1111/rssb.12377.

See Also

plot.EffectData(), update.EffectData(), partial_dependence(), ale(), average_observed, average_predicted(), bias()

Examples

fit <- lm(Sepal.Length ~ ., data = iris)
xvars <- colnames(iris)[2:5]
M <- feature_effects(fit, v = xvars, data = iris, y = "Sepal.Length", breaks = 5)
M
M |> update(sort = "pd") |> plot(share_y = "all")

[Package effectplots version 0.1.0 Index]