Fitting Lasso/Ridge Regression with Abundance-Based Constraints (ABCs)
Source:R/cv.penlmabc.R
cv.penlmabc.Rd
cv.penlmabc
fits penalized (lasso or ridge) linear models
using abundance-based constraints (ABCs). For penalized regression
with categorical covariates, ABCs eliminate harmful biases
(e.g., with respect to race, sex, religion, etc.), provide
more meaningful notions of sparsity, and improve interpretability.
Usage
cv.penlmabc(
formula,
data,
type = "lasso",
lambda_path = NULL,
K = 10,
props = NULL,
plot = FALSE
)
Arguments
- formula
an object of class "
formula()
" (or one that can be coerced to that class); a symbolic description of the model to be fitted.- data
an optional data frame (or object coercible by
as.data.frame
to a data frame) containing the variables in the model.- type
either "lasso" or "ridge"
- lambda_path
optional vector of tuning parameters; defaults are inherited from
glmnet
(for ridge) orgenlasso
(for lasso)- K
number of folds for cross-validation; default is 10
- props
an optional named list with an entry for each named categorical variable in the model, specifying the proportions of each category. By default,
props
will be calculated from the empirical proportions in the data.- plot
logical; if TRUE, include a plot of the cross-validated MSE across
lambda_path
values
Details
cv.penlmabc
solves the penalized least squares problem
\(|| y - X\beta||^2 + \lambda \sum_j |\beta_j|^\gamma\)
for lasso (\(\gamma=1\)) or ridge (\(\gamma=2\)) regression, and specifically using ABCs for categorical covariates and interactions.
Default strategies for categorical covariates typically
use reference group encoding (RGE), which removes
the coefficient for one group for each
categorical variable (and their interactions).
However, because lasso and ridge shrink coefficients
toward zero, this implies that all other coefficients are
biased toward their reference group. Such bias is
clearly problematic for variables such as race, sex, and other
protected groups, but also it attenuates the estimated differences
among groups, which can obscure important group-specific
effects (e.g., for x:race
). The penalized estimates and predictions
under RGE are dependent on the choice of the reference group.
Alternatively, it is possible to fit penalized least squares with an overparametrized model, i.e., without deleting a reference group (or applying any types of constraints). However, the parameters are not identifiable and thus not interpretable in general. With lasso estimation, this approach empirically tends to select a reference group, and therefore suffers from the same significant problems as RGE.
Instead, ABCs provide a parametrization of the main
effects as "group-averaged" effects, with interaction terms
(e.g., x:race
) as "group-specific deviations".
These estimators are not biased toward any single group
and the predictive performance does not depend on the
choice of a reference group. ABCs provide appealing estimation
invariance properties for models with or without interactions
(e.g., x:race
), and therefore offer a natural parametrization
for sparse (e.g., lasso) estimation with categorical covariates
and their interactions.
Value
cv.penlmabc
returns a list with the following elements:
coefficients
estimated coefficients at each tuning parameter inlambda_path
lambda_path
vector of tuning parametersdf
degrees of freedom at each tuning parameter inlambda_path
mse
cross-validated mean squared error (MSE) at each tuning parameter inlambda_path
se
standard error of the CV-MSE at each tuning parameter inlambda_path
ind.min
index of the minimum CV-MSE inlambda_path
ind.1se
index of the one-standard error rule inlambda_path
lambda.min
tuning parameter that achieves the minimum CV-MSElambda.1se
tuning parameter that achieves the one-standard error rule
Examples
# Example lasso fit:
fit <- cv.penlmabc(Sepal.Length ~ Petal.Length + Species + Petal.Length*Species, data = iris)
names(fit)
#> [1] "coefficients" "lambda_path" "df" "mse" "se"
#> [6] "ind.min" "ind.1se" "lambda.min" "lambda.1se"
# Estimated coefficients at the one-standard error rule:
coef(fit)[,fit$ind.1se]
#> (Intercept) Petal.Length
#> 5.4684391 0.5527580
#> Speciessetosa Speciesversicolor
#> 0.0000000 0.0000000
#> Speciesvirginica Petal.Length:Speciessetosa
#> 0.0000000 -0.3454960
#> Petal.Length:Speciesversicolor Petal.Length:Speciesvirginica
#> 0.2737565 0.0717395