Overview. Data transformations are a useful companion for parametric regression models. A well-chosen or learned transformation can greatly enhance the applicability of a given model, especially for data with irregular marginal features (e.g., multimodality, skewness) or various data domains (e.g., real-valued, positive, or compactly-supported data).
Given paired data for , SeBR
implements efficient and fully Bayesian inference for semiparametric regression models that incorporate (1) an unknown data transformation
and (2) a useful parametric regression model
with unknown parameters and independent errors .
Examples. We focus on the following important special cases:
- The linear model is a natural starting point:
The transformation broadens the applicability of this useful class of models, including for positive or compactly-supported data.
- The quantile regression model replaces the Gaussian assumption in the linear model with an asymmetric Laplace distribution (ALD)
to target the th quantile of at , or equivalently, the th quantile of at . The ALD is quite often a very poor model for real data, especially when is near zero or one. The transformation offers a pathway to significantly improve the model adequacy, while still targeting the desired quantile of the data.
- The Gaussian process (GP) model generalizes the linear model to include a nonparametric regression function,
where is a GP and parameterizes the mean and covariance functions. Although GPs offer substantial flexibility for the regression function , this model may be inadequate when has irregular marginal features or a restricted domain (e.g., positive or compact).
Challenges: The goal is to provide fully Bayesian posterior inference for the unknowns and posterior predictive inference for future/unobserved data . We prefer a model and algorithm that offer both (i) flexible modeling of and (ii) efficient posterior and predictive computations.
Innovations: Our approach (https://doi.org/10.1080/01621459.2024.2395586) specifies a nonparametric model for , yet also provides Monte Carlo (not MCMC) sampling for the posterior and predictive distributions. As a result, we control the approximation accuracy via the number of simulations, but do not require the lengthy runs, burn-in periods, convergence diagnostics, or inefficiency factors that accompany MCMC. The Monte Carlo sampling is typically quite fast.
Using SeBR
The package SeBR
is installed and loaded as follows:
# CRAN version:
# install.packages("SeBR")
# Development version:
# devtools::install_github("drkowal/SeBR")
library(SeBR)
The main functions in SeBR
are:
sblm()
: Monte Carlo sampling for posterior and predictive inference with the semiparametric Bayesian linear model;sbsm()
: Monte Carlo sampling for posterior and predictive inference with the semiparametric Bayesian spline model, which replaces the linear model with a spline for nonlinear modeling of ;sbqr()
: blocked Gibbs sampling for posterior and predictive inference with the semiparametric Bayesian quantile regression; andsbgp()
: Monte Carlo sampling for predictive inference with the semiparametric Bayesian Gaussian process model.
Each function returns a point estimate of (coefficients
), point predictions at some specified testing points (fitted.values
), posterior samples of the transformation (post_g
), and posterior predictive samples of at the testing points (post_ypred
), as well as other function-specific quantities (e.g., posterior draws of , post_theta
). The calls coef()
and fitted()
extract the point estimates and point predictions, respectively.
Note: The package also includes Box-Cox variants of these functions, i.e., restricting to the (signed) Box-Cox parametric family with known or unknown . The parametric transformation is less flexible, especially for irregular marginals or restricted domains, and requires MCMC sampling. These functions (e.g., blm_bc()
, etc.) are primarily for benchmarking.
Detailed documentation and examples are available at https://drkowal.github.io/SeBR/.
References
Kowal, D. and Wu, B. (2024). Monte Carlo inference for semiparametric Bayesian regression. JASA. https://doi.org/10.1080/01621459.2024.2395586