Skip to contents

Use posterior predictive draws and a sampling-importance resampling (SIR) algorithm to approximate the cross-validated predictive loss. The empirical loss (i.e., the usual quantity in cross-validation) is also returned. The values are computed relative to the "best" subset according to minimum empirical loss. Specifically, these quantities are computed for a collection of linear models that are fit to the Bayesian model output, where each linear model features a different subset of predictors. The loss function may be chosen as cross-entropy or misclassification rate

Usage

pp_loss_binary(
  post_y_pred,
  post_lpd,
  XX,
  yy,
  indicators,
  loss_type = "cross-ent",
  post_y_hat = NULL,
  K = 10,
  sir_frac = 0.5
)

Arguments

post_y_pred

S x n matrix of posterior predictive draws at the given XX covariate values

post_lpd

S evaluations of the log-likelihood computed at each posterior draw of the parameters

XX

n x p matrix of covariates at which to evaluate

yy

n-dimensional vector of response variables

indicators

L x p matrix of inclusion indicators (booleans) where each row denotes a candidate subset

loss_type

loss function to be used: "cross-ent" (cross-entropy) or "misclass" (misclassication rate)

post_y_hat

S x n matrix of posterior fitted values at the given XX covariate values

K

number of cross-validation folds

sir_frac

fraction of the posterior samples to use for SIR

Value

a list with two elements: pred_loss and emp_loss

for the predictive and empirical loss, respectively, for each subset.

Details

The quantity post_y_hat is the conditional expectation of the response for each covariate value (columns) and using the parameters sampled from the posterior (rows). For binary data, this is also the estimated probability of "success". If unspecified, the algorithm will instead use post_y_pred, which is still correct but has lower Monte Carlo efficiency.