Compute the predictive and empirical cross-validated loss for binary data.

Use posterior predictive draws and a sampling-importance resampling (SIR) algorithm to approximate the cross-validated predictive loss. The empirical loss (i.e., the usual quantity in cross-validation) is also returned. The values are computed relative to the "best" subset according to minimum empirical loss. Specifically, these quantities are computed for a collection of linear models that are fit to the Bayesian model output, where each linear model features a different subset of predictors. The loss function may be chosen as cross-entropy or misclassification rate

Usage

pp_loss_binary(
  post_y_pred,
  post_lpd,
  XX,
  yy,
  indicators,
  loss_type = "cross-ent",
  post_y_hat = NULL,
  K = 10,
  sir_frac = 0.5
)

Arguments

post_y_pred: S x n matrix of posterior predictive draws at the given XX covariate values
post_lpd: S evaluations of the log-likelihood computed at each posterior draw of the parameters
XX: n x p matrix of covariates at which to evaluate
yy: n-dimensional vector of response variables
indicators: L x p matrix of inclusion indicators (booleans) where each row denotes a candidate subset
loss_type: loss function to be used: "cross-ent" (cross-entropy) or "misclass" (misclassication rate)
post_y_hat: S x n matrix of posterior fitted values at the given XX covariate values
K: number of cross-validation folds
sir_frac: fraction of the posterior samples to use for SIR

Value

a list with two elements: pred_loss and emp_loss

for the predictive and empirical loss, respectively, for each subset.

Details

The quantity post_y_hat is the conditional expectation of the response for each covariate value (columns) and using the parameters sampled from the posterior (rows). For binary data, this is also the estimated probability of "success". If unspecified, the algorithm will instead use post_y_pred, which is still correct but has lower Monte Carlo efficiency.