Semiparametric Bayesian linear model with horseshoe priors for high-dimensional data
Source:R/experimental_funs.R
sblm_hs.Rd
MCMC sampling for semiparametric Bayesian linear regression with
1) an unknown (nonparametric) transformation and 2) a horseshoe prior for
the (possibly high-dimensional) regression coefficients. Here, unlike sblm
,
Gibbs sampling is needed for the regression coefficients and the horseshoe
prior variance components. The transformation g
is still sampled
unconditionally on the regression coefficients, which provides a more
efficient blocking within the Gibbs sampler.
Usage
sblm_hs(
y,
X,
X_test = X,
fixedX = FALSE,
approx_g = FALSE,
init_screen = NULL,
pilot_hs = FALSE,
nsave = 1000,
nburn = 1000,
ngrid = 100,
verbose = TRUE
)
Arguments
- y
n x 1
response vector- X
n x p
matrix of predictors (no intercept)- X_test
n_test x p
matrix of predictors for test data; default is the observed covariatesX
- fixedX
logical; if TRUE, treat the design as fixed (non-random) when sampling the transformation; otherwise treat covariates as random with an unknown distribution
- approx_g
logical; if TRUE, apply large-sample approximation for the transformation
- init_screen
for the initial approximation, number of covariates to pre-screen (necessary when
p > n
); if NULL, use n/log(n)- pilot_hs
logical; if TRUE, use a short pilot run with a horseshoe prior to estimate the marginal CDF of the latent z (otherwise, use Laplace approximation)
- nsave
number of MCMC simulations to save
- nburn
number of MCMC iterations to discard
- ngrid
number of grid points for inverse approximations
- verbose
logical; if TRUE, print time remaining
Value
a list with the following elements:
coefficients
the posterior mean of the regression coefficientsfitted.values
the posterior predictive mean at the test pointsX_test
post_theta
:nsave x p
samples from the posterior distribution of the regression coefficientspost_ypred
:nsave x n_test
samples from the posterior predictive distribution at test pointsX_test
post_g
:nsave
posterior samples of the transformation evaluated at the uniquey
valuesmodel
: the model fit (here,sblm_hs
)
as well as the arguments passed in.
Details
This function provides fully Bayesian inference for a
transformed linear model with horseshoe priors using efficiently-blocked Gibbs sampling.
The transformation is modeled as unknown and learned jointly
with the regression coefficients (unless approx_g
= TRUE, which then uses
a point approximation). This model applies for real-valued data, positive data, and
compactly-supported data (the support is automatically deduced from the observed y
values).
The horseshoe prior is especially useful for high-dimensional settings with
many (possibly correlated) covariates. Compared to sparse or spike-and-slab
alternatives (see sblm_ssvs
), the horseshoe prior
delivers more scalable computing in p
. This function
uses a fast Cholesky-forward/backward sampler when p < n
and the Bhattacharya et al. (https://doi.org/10.1093/biomet/asw042) sampler
when p > n
. Thus, the sampler can scale linear in n
(for fixed/small p
) or linear in p
(for fixed/small n
).
Empirically, the horseshoe prior performs best under sparse regimes,
i.e., when the number of true signals (nonzero regression coefficients)
is a small fraction of the total number of variables.
To learn the transformation, SeBR
infers the marginal CDF
of the latent data model Fz
by averaging over the covariates
X
(via the Bayesian bootstrap, bb
) and
the coefficients theta
. When pilot_hs = TRUE
,
the algorithm fits an initial linear regression model
with a horseshoe prior fit to transformed data
(under a preliminary point estimate of the transformation) and
uses that posterior distribution to marginalize over theta
.
Otherwise, this marginalization is done using a Laplace approximation
for speed and simplicity.
Note
The location (intercept) and scale (sigma_epsilon
) are
not identified, so any intercepts in X
and X_test
will
be removed. The model-fitting *does* include an internal location-scale
adjustment, but the function only outputs inferential summaries for the
identifiable parameters.
Examples
# \donttest{
# Simulate some data:
dat = simulate_tlm(n = 100, p = 50, g_type = 'step', prop_sig = 0.1)
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data
hist(y, breaks = 25) # marginal distribution
# Fit the semiparametric Bayesian linear model with horseshoe priors:
fit = sblm_hs(y = y, X = X, X_test = X_test)
#> [1] "3 seconds remaining"
#> [1] "Total time: 9 seconds"
names(fit) # what is returned
#> [1] "coefficients" "fitted.values" "post_theta" "post_ypred"
#> [5] "post_g" "model" "y" "X"
#> [9] "X_test" "init_screen" "approx_g"
# Evaluate posterior predictive means and intervals on the testing data:
plot_pptest(fit$post_ypred, y_test,
alpha_level = 0.10) # coverage should be about 90%
#> [1] 0.969
# Check: correlation with true coefficients
cor(dat$beta_true, coef(fit))
#> [1] 0.9737789
# Compute 95% credible intervals for the coefficients:
ci_theta = t(apply(fit$post_theta, 2, quantile, c(0.05/2, 1 - 0.05/2)))
# True positive/negative rates for "selected" coefficients:
selected = ((ci_theta[,1] >0 | ci_theta[,2] < 0)) # intervals exclude zero
sigs_true = dat$beta_true != 0 # true signals
(TPR = sum(selected & sigs_true)/sum(sigs_true))
#> [1] 1
(TNR = sum(!selected & !sigs_true)/sum(!sigs_true))
#> [1] 1
# Summarize the transformation:
y0 = sort(unique(y)) # posterior draws of g are evaluated at the unique y observations
plot(y0, fit$post_g[1,], type='n', ylim = range(fit$post_g),
xlab = 'y', ylab = 'g(y)', main = "Posterior draws of the transformation")
temp = sapply(1:nrow(fit$post_g), function(s)
lines(y0, fit$post_g[s,], col='gray')) # posterior draws
lines(y0, colMeans(fit$post_g), lwd = 3) # posterior mean
# Add the true transformation, rescaled for easier comparisons:
lines(y,
scale(dat$g_true)*sd(colMeans(fit$post_g)) + mean(colMeans(fit$post_g)), type='p', pch=2)
legend('bottomright', c('Truth'), pch = 2) # annotate the true transformation
# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0), # ECDF of testing data
col='black', type = 's', lwd = 3)
# }