distfreereg {distfreereg} | R Documentation |
Distribution-Free Parametric Regression Testing
Description
Conduct distribution-free parametric regression testing using the process introduced in Khmaladze (2021). A parametric model for the conditional mean (specified by test_mean
) is checked against the data by fitting the model, transforming the resulting residuals, and then calculating a statistic on the empirical partial sum process of the transformed residuals. The statistic's null distribution can be simulated in a straight-forward way, thereby producing a p-value.
Using f
to denote the mean function being tested, the specific test has the following null and alternative hypotheses:
H_0\colon\ \exists\theta\in\Theta\subseteq\mathbb R^p \mathrel{\bigl|} \textrm{E}(Y| X)=f(X;\theta)
\quad\hbox{against}\quad
H_1\colon\ \forall\theta\in\Theta\subseteq\mathbb R^p \mathrel{\bigl|} \textrm{E}(Y| X)\neq f(X;\theta).
See the An Introduction to the distfreereg
Package vignette for an introduction.
Usage
distfreereg(test_mean, ordering = "simplex", group = FALSE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
## Default S3 method:
distfreereg(test_mean = NULL, ordering = "simplex", group = FALSE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
..., Y, X = NULL, covariance, J, fitted_values)
## S3 method for class 'formula'
distfreereg(test_mean, ordering = "simplex", group = FALSE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
..., data, covariance = NULL, method = "lm", theta_init = NULL)
## S3 method for class 'function'
distfreereg(test_mean, ordering = "simplex", group = FALSE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
..., Y, X = NULL, covariance, theta_init)
## S3 method for class 'lm'
distfreereg(test_mean, ordering = "simplex", group = FALSE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
## S3 method for class 'nls'
distfreereg(test_mean, ordering = "simplex", group = FALSE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
Arguments
test_mean |
A specification of the mean function to be tested. Methods exist for objects of classes |
covariance |
Named list; specifies the covariance structure of the model's error distribution. Valid element names are "
See details. |
ordering |
A character string or a list; specifies how to order the residuals to form the empirical partial sum process. Valid character strings are:
If |
group |
Logical; if |
J |
Numeric matrix; specifies the Jacobian of the function evaluated at the covariates and the estimated parameters. |
fitted_values |
Numeric vector; specifies the model's fitted values. |
stat |
Character vector; specifies the names of the functions used to calculate the desired statistics. By default, a Kolmogorov–Smirnov statistic and a Cramer–von Mises-like statistic are calculated:
where |
B |
Numeric vector of length one; specifies the Monte Carlo sample size used when simulating statistics. Silently converted to integer. |
control |
Optional named list of elements that control the details of the algorithm's computations. The following elements are accepted for all methods:
The following named elements, all but the first of which control the process of calculating the generalized least squares estimation of the parameter vector, are accepted for the
Finally, the following element is available for the |
override |
Optional named list of arguments that override internally calculated values. Used primarily by
|
verbose |
Logical; if |
... |
Additional arguments to pass to various methods; should be empty except in a call to the generic function. |
Y |
Numeric vector of observations. A matrix value is silently converted to a vector. |
X |
Optional numeric matrix of covariates. A vector value is converted to a single-column matrix with a warning. |
method |
Character vector; specifies the function to use for fitting the model when |
theta_init |
Numeric vector; specifies the starting parameter values passed to the optimizing function to be used to estimate the parameter vector. Must be |
data |
Optional data frame of covariate values; required for |
Details
This function implements distribution-free parametric regression testing. The model is specified by a mean structure and a covariance structure.
The mean structure is specified by the argument test_mean
. This can be a function, formula, lm
object, nls
object, or NULL
.
If test_mean
is a function, then it must have one or two arguments: either theta
only, or theta
and either X
(uppercase) or x
(lowercase). An uppercase X
is interpreted in the function definition as a matrix, while a lowercase x
is interpreted as a vector. (See examples and this vignette.) The primary reason to use a lowercase x
is to allow for a function definition using an R
function that is not vectorized. In general, an uppercase X
should be preferred for speed.
If test_mean
is an lm
or nls
object, then the covariance structure is obtained from the supplied model.
If test_mean
is a formula, then it must be a formula that can be passed to lm
or nls
, and the data
argument must be specified. The appropriate model will be created, and then sent back to distfreereg()
for method dispatch.
The function
method estimates parameter values, and then uses those to evaluate the Jacobian of the mean function and to calculate fitted values. It then calls the default method, which does not use test_mean
. The default method also allows the user to implement the algorithm even when the mean structure is not specified in R
. (This is useful if a particularly complicated function is defined in another language and cannot easily be copied into R
.) It requires specifying the vector of fitted values and the Jacobian matrix of the mean function evaluated at the estimated parameter values.
The covariance structure for Y|X
must be specified using the covariance
argument for the function
and default methods. It is optional for the formula
method; when present in that case, it must specify a diagonal matrix which is converted internally into a vector of weights. For the lm
and nls
methods, the covariance is determined using the supplied object.
Any element of covariance
can be a numeric matrix, or a numeric vector. If it is a vector, its length must be either 1 or the sample size. This option is mathematically equivalent to setting a covariance list element to a diagonal matrix with the specified value(s) along the diagonal. Using vectors, when possible, is more efficient than using the corresponding matrix.
Internally, distfreereg()
only needs Q
, so some efficiency can be gained by supplying that directly when available. When Q
is not specified, it is calculated using whichever element is specified. When more than one of the other elements are specified, Q
is calculated using the least expensive path, with no warning given if the specified elements are incompatible. (For example, if both Sigma
and SqrtSigma
elements are supplied to covariance
, then Q
is calculated using SqrtSigma
with no attempt to verify that SqrtSigma
is the matrix square root ofSigma
.)
The override
argument is used primarily by update.distfreereg
to avoid unnecessary and potentially computationally expensive recomputation. This update
method imports appropriate values automatically from a previously created object of class distfreereg
, and therefore validation is not always done. Use manually with caution.
Value
An object of class distfreereg
with the following components:
call |
The matched call. |
data |
A list containing |
test_mean |
The value supplied to the argument |
model |
The model built when using the |
covariance |
The list of covariance matrices, containing at least |
theta_hat |
The estimated parameter vector. |
optimization_output |
The output of |
fitted_values |
The vector of fitted values, |
J |
The Jacobian matrix. |
mu |
The mu matrix. |
r |
The matrix of transformation anchor vectors. |
r_tilde |
The matrix of modified transformation anchor vectors. |
residuals |
A named list of three vectors containing raw, sphered, and transformed residuals. |
res_order |
A numeric vector indicating the ordering of the residuals used to form the empirical partial sum process, in a format analogous to the output of |
epsp |
The empirical partial sum process formed by calculating the scaled
partial sums of the transformed residuals ordered according to |
observed_stat |
A named list of the observed statistic(s) corresponding to the transformed residuals. |
mcsim_stats |
A named list, each element of which contains the values of a simulated statistic. |
p |
A named list with two elements: |
Warnings
Consistency between test_mean
and theta_init
is verified only
indirectly. Uninformative errors can occur when, for example, theta_init
does not have the correct length. The two most common error messages that arise
in this case are "f_out cannot have NA values
", indicating that
theta_init
is too short, and "Unable to invert square root of J^tJ
",
indicating that theta_init
is too long. (Both of these errors might occur
for other reasons, as well.) To be safe, always define test_mean
to use
every element of theta
.
No verification of consistency is done when multiple elements of coviariance
are specified. For example, if P
and Sigma
are both specified, then the code will use only one of these, and will not verify that P
is the inverse of Sigma
.
When using the control
argument element optimization_fun
to
specify an optimization function other than optim
, the
verification that theta_hat_name
actually matches the name of an element
of the optimization function's output is done only after the optimization has
been done. If this optimization will likely take a long time, it is important to
verify the value of theta_hat_name
before running distfreereg()
.
Author(s)
Jesse Miller
References
Khmaladze, Estate V. Distribution-free testing in linear and parametric regression, 2021-03, Annals of the Institute of Statistical Mathematics, Vol. 73, No. 6, p. 1063–1087. doi:10.1007/s10463-021-00786-3
See Also
coef.distfreereg
, confint.distfreereg
, fitted.distfreereg
, formula.distfreereg
, plot.distfreereg
, predict.distfreereg
, print.distfreereg
, residuals.distfreereg
, update.distfreereg
, vcov.distfreereg
Examples
set.seed(20240218)
n <- 1e2
func <- function(X, theta) X[,1]^theta[1] + theta[2]*X[,2]
Sig <- runif(n, min = 1, max = 3)
theta <- c(2,5)
X <- matrix(runif(2*n, min = 1, max = 5), nrow = n)
Y <- X[,1]^theta[1] + theta[2]*X[,2] + rnorm(n, sd = sqrt(Sig))
(dfr <- distfreereg(Y = Y, X = X, test_mean = func,
covariance = list(Sigma = Sig),
theta_init = c(1,1)))
func_lower <- function(x, theta) x[1]^theta[1] + theta[2]*x[2]
(dfr_lower <- distfreereg(Y = Y, X = X, test_mean = func_lower,
covariance = list(Sigma = Sig),
theta_init = c(1,1)))
identical(dfr$observed_stats, dfr_lower$observed_stats)