lmabc is used to fit linear models using abundance-based constraints (ABCs).
For regression models with categorical covariates, ABCs
provide 1) more equitable output, 2) better statistical efficiency,
and 3) more interpretable parameters, especially in the
presence of interaction (or modifier) effects.
Arguments
- formula
an object of class "
formula()" (or one that can be coerced to that class); a symbolic description of the model to be fitted.- data
an optional data frame (or object coercible by
as.data.frameto a data frame) containing the variables in the model.- props
an optional named list with an entry for each named categorical variable in the model, specifying the proportions of each category. By default,
propswill be calculated from the empirical proportions in the data.
Details
An lmabc model is specified identically to the corresponding lm model.
At this time, lmabc only supports a single response variable.
Differences from lm
The default approach for linear regression with categorical covariates is reference group encoding (RGE): one category is selected as the "reference group" for each categorical variable, and all results are presented relative to this group. However, this approach produces output that is inequitable (and potentially misleading), estimators that are statistically inefficient, and model parameters that are difficult to interpret.
For example, suppose an analyst fits a model of the form y ~ x + race + x:race,
where x is a continuous covariate. This model allows for race-specific
coefficients on x (or slopes). However, RGE requires a reference group;
for race, this is typically non-Hispanic White (NHW). This creates several problems:
All output is presented relative to the reference (NHW) group: the results for
(Intercept)andxrefer to the intercept and slope, respectively, for the reference (NHW) group. This can lead to misleading conclusions about the overallxeffect, since the dependence on the reference group is not made clear. All otherraceandx:raceeffects are presented relative to the reference (NHW) group, which elevates this group above all others.Compared to the main-only model
y ~ x + race, adding thex:raceinteraction alters the estimates and standard errors of thexeffect. This typically results in a loss of statistical power: thexeffect now refers to a subset of the data (the reference group).Since all categorical covariates and interactions are anchored at a reference group, it becomes increasingly difficult to interpret the model parameters in the presence of multiple categorical variables (
race,sex, etc.) and interactions.
lmabc addresses these issues. ABCs parametrize the regression model so that the main effect terms, here
(Intercept) and x, are averages of the race-specific terms.
The notion of "average" derives from the argument cprob: these can be
input by the user (e.g., population proportions), otherwise
they will be set to the sample proportions for each group. ABCs provide
several key advantages:
Equitability: the main
xeffect is parameterized as "group-averaged" effect. It does not elevate any single group (e.g., NHW). All other group-specific effects are relative to this global term, rather than any single group.Efficiency: comparing the main-only model
y ~ x + racewith the race-modified modely ~ x + race + x:race, ABCs (with the defaultprops) ensure that the mainxeffect estimates are (nearly) unchanged and the standard errors are (nearly) unchanged or smaller. Remarkably, there are no negative (statistical) consequences for including the interactionx:race, even if it is irrelevant.Interpretability: The
xandx:racecoefficients are "group-averaged"x-effects and "group-specific deviations", respectively. Coupled with the ABCs estimation/inference invariance, this yields simple interpretations of main and interaction effects.
Similarities to lm
lmabc is a reparametrization of the linear model,
but the fitted values, predictions, and residuals will be the same
as lm (see cv.penlmabc() for an example where this is no longer the case).
Without categorical covariates, lmabc output will be
identical to lm.
Value
lmabc returns an object of class "lmabc." Many generics commonly used for lm objects have been implemented for lmabc: summary, coefficients, plot, predict, and more. See the DESCRIPTION file for all implemented S3 methods.
See also
lm() for the standard linear regression implementation in R.
Examples
fit <- lmabc(Sepal.Length ~ Petal.Length + Species + Petal.Length*Species, data = iris)
summary(fit)
#>
#> Call:
#> lmabc(formula = Sepal.Length ~ Petal.Length + Species + Petal.Length *
#> Species, data = iris)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.73479 -0.22785 -0.03132 0.24375 0.93608
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.524317 0.220516 25.052 < 2e-16 ***
#> Petal.Length 0.788771 0.102549 7.692 2.09e-12 ***
#> Speciessetosa 0.726787 0.428933 1.694 0.09235 .
#> Speciesversicolor -0.004114 0.224189 -0.018 0.98538
#> Speciesvirginica -0.722672 0.239830 -3.013 0.00305 **
#> Petal.Length:Speciessetosa -0.246478 0.189867 -1.298 0.19631
#> Petal.Length:Speciesversicolor 0.039510 0.118337 0.334 0.73896
#> Petal.Length:Speciesvirginica 0.206968 0.114212 1.812 0.07205 .
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.3365 on 144 degrees of freedom
#> Multiple R-squared: 0.8405, Adjusted R-squared: 0.8349
#> F-statistic: 151.7 on 5 and 144 DF, p-value: < 2.2e-16
#>
predict(fit, newdata = data.frame(Petal.Length = 1.5, Species = "setosa"))
#> 1
#> 5.026607