**Overview.** Data transformations are a useful companion for parametric regression models. A well-chosen or learned transformation can greatly enhance the applicability of a given model, especially for data with irregular marginal features (e.g., multimodality, skewness) or various data domains (e.g., real-valued, positive, or compactly-supported data).

Given paired data (*x*_{i},*y*_{i}) for *i* = 1, …, *n*, `SeBR`

implements efficient and fully Bayesian inference for *semiparametric regression models* that incorporate (1) an unknown data transformation

*g*(*y*_{i}) = *z*_{i}

and (2) a useful parametric regression model

$$ z_i \stackrel{indep}{\sim} P_{Z \mid \theta, X = x_i} $$

with unknown parameters *θ*.

**Examples.** We focus on the following important special cases of *P*_{Z ∣ θ, X}:

- The
**linear model**is a natural starting point:

$$ z_i = x_i'\theta + \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2) $$

The transformation *g* broadens the applicability of this useful class of models, including for positive or compactly-supported data, while *P*_{Z ∣ θ, X = x} = *N*(*x*′*θ*,*σ*_{ϵ}^{2}).

- The
**quantile regression model**replaces the Gaussian assumption in the linear model with an*asymmetric Laplace*distribution (ALD)

$$ z_i = x_i'\theta + \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} ALD(\tau) $$

to target the *τ*th quantile of *z* at *x*, or equivalently, the *g*^{−1}(*τ*)th quantile of *y* at *x*. The ALD is quite often a very poor model for real data, especially when *τ* is near zero or one. The transformation *g* offers a pathway to significantly improve the model adequacy, while still targeting the desired quantile of the data.

- The
**Gaussian process (GP) model**generalizes the linear model to include a nonparametric regression function,

$$ z_i = f_\theta(x_i) + \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2) $$

where *f*_{θ} is a GP and *θ* parameterizes the mean and covariance functions. Although GPs offer substantial flexibility for the regression function *f*_{θ}, this model may be inadequate when *y* has irregular marginal features or a restricted domain (e.g., positive or compact).

**Challenges:** The goal is to provide fully Bayesian posterior inference for the unknowns (*g*,*θ*) and posterior predictive inference for future/unobserved data *ỹ*(*x*). We prefer a model and algorithm that offer both (i) flexible modeling of *g* and (ii) efficient posterior and predictive computations.

**Innovations:** Our approach (https://arxiv.org/abs/2306.05498) specifies a *nonparametric* model for *g*, yet also provides *Monte Carlo* (not MCMC) sampling for the posterior and predictive distributions. As a result, we control the approximation accuracy via the number of simulations, but do *not* require the lengthy runs, burn-in periods, convergence diagnostics, or inefficiency factors that accompany MCMC. The Monte Carlo sampling is typically quite fast.

# Using `SeBR`

The package `SeBR`

is installed and loaded as follows:

```
# CRAN version:
# install.packages("SeBR")
# Development version:
# devtools::install_github("drkowal/SeBR")
library(SeBR)
```

The main functions in `SeBR`

are:

`sblm()`

: Monte Carlo sampling for posterior and predictive inference with the*semiparametric Bayesian linear model*;`sbsm()`

: Monte Carlo sampling for posterior and predictive inference with the*semiparametric Bayesian spline model*, which replaces the linear model with a spline for nonlinear modeling of*x*∈ ℝ;`sbqr()`

: blocked Gibbs sampling for posterior and predictive inference with the*semiparametric Bayesian quantile regression*; and`sbgp()`

: Monte Carlo sampling for predictive inference with the*semiparametric Bayesian Gaussian process model*.

Each function returns a point estimate of *θ* (`coefficients`

), point predictions at some specified testing points (`fitted.values`

), posterior samples of the transformation *g* (`post_g`

), and posterior predictive samples of *ỹ*(*x*) at the testing points (`post_ypred`

), as well as other function-specific quantities (e.g., posterior draws of *θ*, `post_theta`

). The calls `coef()`

and `fitted()`

extract the point estimates and point predictions, respectively.

**Note:** The package also includes Box-Cox variants of these functions, i.e., restricting *g* to the (signed) Box-Cox parametric family *g*(*t*;*λ*) = {sign(*t*)|*t*|^{λ} − 1}/*λ* with known or unknown *λ*. The parametric transformation is less flexible, especially for irregular marginals or restricted domains, and requires MCMC sampling. These functions (e.g., `blm_bc()`

, etc.) are primarily for benchmarking.

Detailed documentation and examples are available at https://drkowal.github.io/SeBR/.