lmabc
is used to fit linear models using abundance-based constraints (ABCs).
For regression models with categorical covariates, ABCs
provide 1) more equitable output, 2) better statistical efficiency,
and 3) more interpretable parameters, especially in the
presence of interaction (or modifier) effects.
Arguments
- formula
an object of class "
formula()
" (or one that can be coerced to that class); a symbolic description of the model to be fitted.- data
an optional data frame (or object coercible by
as.data.frame
to a data frame) containing the variables in the model.- props
an optional named list with an entry for each named categorical variable in the model, specifying the proportions of each category. By default,
props
will be calculated from the empirical proportions in the data.
Details
An lmabc
model is specified identically to the corresponding lm
model.
At this time, lmabc
only supports a single response variable.
Differences from lm
The default approach for linear regression with categorical covariates is reference group encoding (RGE): one category is selected as the "reference group" for each categorical variable, and all results are presented relative to this group. However, this approach produces output that is inequitable (and potentially misleading), estimators that are statistically inefficient, and model parameters that are difficult to interpret.
For example, suppose an analyst fits a model of the form y ~ x + race + x:race
,
where x
is a continuous covariate. This model allows for race
-specific
coefficients on x
(or slopes). However, RGE requires a reference group;
for race
, this is typically non-Hispanic White (NHW). This creates several problems:
All output is presented relative to the reference (NHW) group: the results for
(Intercept)
andx
refer to the intercept and slope, respectively, for the reference (NHW) group. This can lead to misleading conclusions about the overallx
effect, since the dependence on the reference group is not made clear. All otherrace
andx:race
effects are presented relative to the reference (NHW) group, which elevates this group above all others.Compared to the main-only model
y ~ x + race
, adding thex:race
interaction alters the estimates and standard errors of thex
effect. This typically results in a loss of statistical power: thex
effect now refers to a subset of the data (the reference group).Since all categorical covariates and interactions are anchored at a reference group, it becomes increasingly difficult to interpret the model parameters in the presence of multiple categorical variables (
race
,sex
, etc.) and interactions.
lmabc
addresses these issues. ABCs parametrize the regression model so that the main effect terms, here
(Intercept)
and x
, are averages of the race-specific terms.
The notion of "average" derives from the argument cprob
: these can be
input by the user (e.g., population proportions), otherwise
they will be set to the sample proportions for each group. ABCs provide
several key advantages:
Equitability: the main
x
effect is parameterized as "group-averaged" effect. It does not elevate any single group (e.g., NHW). All other group-specific effects are relative to this global term, rather than any single group.Efficiency: comparing the main-only model
y ~ x + race
with the race-modified modely ~ x + race + x:race
, ABCs (with the defaultprops
) ensure that the mainx
effect estimates are (nearly) unchanged and the standard errors are (nearly) unchanged or smaller. Remarkably, there are no negative (statistical) consequences for including the interactionx:race
, even if it is irrelevant.Interpretability: The
x
andx:race
coefficients are "group-averaged"x
-effects and "group-specific deviations", respectively. Coupled with the ABCs estimation/inference invariance, this yields simple interpretations of main and interaction effects.
Similarities to lm
lmabc
is a reparametrization of the linear model,
but the fitted values, predictions, and residuals will be the same
as lm
(see cv.penlmabc()
for an example where this is no longer the case).
Without categorical covariates, lmabc
output will be
identical to lm
.
Value
lmabc
returns an object of class "lmabc." Many generics commonly used for lm
objects have been implemented for lmabc
: summary
, coefficients
, plot
, predict
, and more. See the DESCRIPTION file for all implemented S3 methods.
See also
lm()
for the standard linear regression implementation in R.
Examples
fit <- lmabc(Sepal.Length ~ Petal.Length + Species + Petal.Length*Species, data = iris)
summary(fit)
#>
#> Call:
#> lmabc(formula = Sepal.Length ~ Petal.Length + Species + Petal.Length *
#> Species, data = iris)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.73479 -0.22785 -0.03132 0.24375 0.93608
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.524317 0.220516 25.052 < 2e-16 ***
#> Petal.Length 0.788771 0.102549 7.692 2.09e-12 ***
#> Speciessetosa 0.726787 0.428933 1.694 0.09235 .
#> Speciesversicolor -0.004114 0.224189 -0.018 0.98538
#> Speciesvirginica -0.722672 0.239830 -3.013 0.00305 **
#> Petal.Length:Speciessetosa -0.246478 0.189867 -1.298 0.19631
#> Petal.Length:Speciesversicolor 0.039510 0.118337 0.334 0.73896
#> Petal.Length:Speciesvirginica 0.206968 0.114212 1.812 0.07205 .
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.3365 on 144 degrees of freedom
#> Multiple R-squared: 0.8405, Adjusted R-squared: 0.8349
#> F-statistic: 151.7 on 5 and 144 DF, p-value: < 2.2e-16
#>
predict(fit, newdata = data.frame(Petal.Length = 1.5, Species = "setosa"))
#> 1
#> 5.026607