Fitting Lasso/Ridge Regression with Abundance-Based Constraints (ABCs)
Source:R/cv.penlmabc.R
cv.penlmabc.Rdcv.penlmabc fits penalized (lasso or ridge) linear models
using abundance-based constraints (ABCs). For penalized regression
with categorical covariates, ABCs eliminate harmful biases
(e.g., with respect to race, sex, religion, etc.), provide
more meaningful notions of sparsity, and improve interpretability.
Usage
cv.penlmabc(
formula,
data,
type = "lasso",
lambda_path = NULL,
K = 10,
props = NULL,
plot = FALSE
)Arguments
- formula
an object of class "
formula()" (or one that can be coerced to that class); a symbolic description of the model to be fitted.- data
an optional data frame (or object coercible by
as.data.frameto a data frame) containing the variables in the model.- type
either "lasso" or "ridge"
- lambda_path
optional vector of tuning parameters; defaults are inherited from
glmnet(for ridge) orgenlasso(for lasso)- K
number of folds for cross-validation; default is 10
- props
an optional named list with an entry for each named categorical variable in the model, specifying the proportions of each category. By default,
propswill be calculated from the empirical proportions in the data.- plot
logical; if TRUE, include a plot of the cross-validated MSE across
lambda_pathvalues
Details
cv.penlmabc solves the penalized least squares problem
\(|| y - X\beta||^2 + \lambda \sum_j |\beta_j|^\gamma\)
for lasso (\(\gamma=1\)) or ridge (\(\gamma=2\)) regression, and specifically using ABCs for categorical covariates and interactions.
Default strategies for categorical covariates typically
use reference group encoding (RGE), which removes
the coefficient for one group for each
categorical variable (and their interactions).
However, because lasso and ridge shrink coefficients
toward zero, this implies that all other coefficients are
biased toward their reference group. Such bias is
clearly problematic for variables such as race, sex, and other
protected groups, but also it attenuates the estimated differences
among groups, which can obscure important group-specific
effects (e.g., for x:race). The penalized estimates and predictions
under RGE are dependent on the choice of the reference group.
Alternatively, it is possible to fit penalized least squares with an overparametrized model, i.e., without deleting a reference group (or applying any types of constraints). However, the parameters are not identifiable and thus not interpretable in general. With lasso estimation, this approach empirically tends to select a reference group, and therefore suffers from the same significant problems as RGE.
Instead, ABCs provide a parametrization of the main
effects as "group-averaged" effects, with interaction terms
(e.g., x:race) as "group-specific deviations".
These estimators are not biased toward any single group
and the predictive performance does not depend on the
choice of a reference group. ABCs provide appealing estimation
invariance properties for models with or without interactions
(e.g., x:race), and therefore offer a natural parametrization
for sparse (e.g., lasso) estimation with categorical covariates
and their interactions.
Value
cv.penlmabc returns a list with the following elements:
coefficientsestimated coefficients at each tuning parameter inlambda_pathlambda_pathvector of tuning parametersdfdegrees of freedom at each tuning parameter inlambda_pathmsecross-validated mean squared error (MSE) at each tuning parameter inlambda_pathsestandard error of the CV-MSE at each tuning parameter inlambda_pathind.minindex of the minimum CV-MSE inlambda_pathind.1seindex of the one-standard error rule inlambda_pathlambda.mintuning parameter that achieves the minimum CV-MSElambda.1setuning parameter that achieves the one-standard error rule
Examples
# Example lasso fit:
fit <- cv.penlmabc(Sepal.Length ~ Petal.Length + Species + Petal.Length*Species, data = iris)
names(fit)
#> [1] "coefficients" "lambda_path" "df" "mse" "se"
#> [6] "ind.min" "ind.1se" "lambda.min" "lambda.1se"
# Estimated coefficients at the one-standard error rule:
coef(fit)[,fit$ind.1se]
#> (Intercept) Petal.Length
#> 5.4684391 0.5527580
#> Speciessetosa Speciesversicolor
#> 0.0000000 0.0000000
#> Speciesvirginica Petal.Length:Speciessetosa
#> 0.0000000 -0.3454960
#> Petal.Length:Speciesversicolor Petal.Length:Speciesvirginica
#> 0.2737565 0.0717395