Fitting Linear Models with Abundance-Based Constraints (ABCs)

lmabc is used to fit linear models using abundance-based constraints (ABCs). For regression models with categorical covariates, ABCs provide 1) more equitable output, 2) better statistical efficiency, and 3) more interpretable parameters, especially in the presence of interaction (or modifier) effects.

Usage

lmabc(formula, data, props = NULL)

Arguments

formula: an object of class "formula()" (or one that can be coerced to that class); a symbolic description of the model to be fitted.
data: an optional data frame (or object coercible by as.data.frame to a data frame) containing the variables in the model.
props: an optional named list with an entry for each named categorical variable in the model, specifying the proportions of each category. By default, props will be calculated from the empirical proportions in the data.

Details

An lmabc model is specified identically to the corresponding lm model. At this time, lmabc only supports a single response variable.

Differences from `lm`

The default approach for linear regression with categorical covariates is reference group encoding (RGE): one category is selected as the "reference group" for each categorical variable, and all results are presented relative to this group. However, this approach produces output that is inequitable (and potentially misleading), estimators that are statistically inefficient, and model parameters that are difficult to interpret.

For example, suppose an analyst fits a model of the form y ~ x + race + x:race, where x is a continuous covariate. This model allows for race-specific coefficients on x (or slopes). However, RGE requires a reference group; for race, this is typically non-Hispanic White (NHW). This creates several problems:

All output is presented relative to the reference (NHW) group: the results for (Intercept) and x refer to the intercept and slope, respectively, for the reference (NHW) group. This can lead to misleading conclusions about the overall x effect, since the dependence on the reference group is not made clear. All other race and x:race effects are presented relative to the reference (NHW) group, which elevates this group above all others.
Compared to the main-only model y ~ x + race, adding the x:race interaction alters the estimates and standard errors of the x effect. This typically results in a loss of statistical power: the x effect now refers to a subset of the data (the reference group).
Since all categorical covariates and interactions are anchored at a reference group, it becomes increasingly difficult to interpret the model parameters in the presence of multiple categorical variables (race, sex, etc.) and interactions.

lmabc addresses these issues. ABCs parametrize the regression model so that the main effect terms, here (Intercept) and x, are averages of the race-specific terms. The notion of "average" derives from the argument cprob: these can be input by the user (e.g., population proportions), otherwise they will be set to the sample proportions for each group. ABCs provide several key advantages:

Equitability: the main x effect is parameterized as "group-averaged" effect. It does not elevate any single group (e.g., NHW). All other group-specific effects are relative to this global term, rather than any single group.
Efficiency: comparing the main-only model y ~ x + race with the race-modified model y ~ x + race + x:race, ABCs (with the default props) ensure that the main x effect estimates are (nearly) unchanged and the standard errors are (nearly) unchanged or smaller. Remarkably, there are no negative (statistical) consequences for including the interaction x:race, even if it is irrelevant.
Interpretability: The x and x:race coefficients are "group-averaged" x-effects and "group-specific deviations", respectively. Coupled with the ABCs estimation/inference invariance, this yields simple interpretations of main and interaction effects.

Similarities to `lm`

lmabc is a reparametrization of the linear model, but the fitted values, predictions, and residuals will be the same as lm (see cv.penlmabc() for an example where this is no longer the case). Without categorical covariates, lmabc output will be identical to lm.

Value

lmabc returns an object of class "lmabc." Many generics commonly used for lm objects have been implemented for lmabc: summary, coefficients, plot, predict, and more. See the DESCRIPTION file for all implemented S3 methods.

Examples

fit <- lmabc(Sepal.Length ~ Petal.Length + Species + Petal.Length*Species, data = iris)
summary(fit)
#> 
#> Call:
#> lmabc(formula = Sepal.Length ~ Petal.Length + Species + Petal.Length * 
#>     Species, data = iris)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.73479 -0.22785 -0.03132  0.24375  0.93608 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     5.524317   0.220516  25.052  < 2e-16 ***
#> Petal.Length                    0.788771   0.102549   7.692 2.09e-12 ***
#> Speciessetosa                   0.726787   0.428933   1.694  0.09235 .  
#> Speciesversicolor              -0.004114   0.224189  -0.018  0.98538    
#> Speciesvirginica               -0.722672   0.239830  -3.013  0.00305 ** 
#> Petal.Length:Speciessetosa     -0.246478   0.189867  -1.298  0.19631    
#> Petal.Length:Speciesversicolor  0.039510   0.118337   0.334  0.73896    
#> Petal.Length:Speciesvirginica   0.206968   0.114212   1.812  0.07205 .  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.3365 on 144 degrees of freedom
#> Multiple R-squared:  0.8405,	Adjusted R-squared:  0.8349 
#> F-statistic: 151.7 on 5 and 144 DF,  p-value: < 2.2e-16
#> 

predict(fit, newdata = data.frame(Petal.Length = 1.5, Species = "setosa"))
#>        1 
#> 5.026607