Title: | Flexible Modeling of Count Data |
---|---|
Description: | For Bayesian and classical inference and prediction with count-valued data, Simultaneous Transformation and Rounding (STAR) Models provide a flexible, interpretable, and easy-to-use approach. STAR models the observed count data using a rounded continuous data model and incorporates a transformation for greater flexibility. Implicitly, STAR formalizes the commonly-applied yet incoherent procedure of (i) transforming count-valued data and subsequently (ii) modeling the transformed data using Gaussian models. STAR is well-defined for count-valued data, which is reflected in predictive accuracy, and is designed to account for zero-inflation, bounded or censored data, and over- or underdispersion. Importantly, STAR is easy to combine with existing MCMC or point estimation methods for continuous data, which allows seamless adaptation of continuous data models (such as linear regressions, additive models, BART, random forests, and gradient boosting machines) for count-valued data. The package also includes several methods for modeling count time series data, namely via warped Dynamic Linear Models. For more details and background on these methodologies, see the works of Kowal and Canale (2020) <doi:10.1214/20-EJS1707>, Kowal and Wu (2022) <doi:10.1111/biom.13617>, King and Kowal (2023) <doi:10.1214/23-BA1394>, and Kowal and Wu (2023) <arXiv:2110.12316>. |
Authors: | Brian King [aut, cre], Dan Kowal [aut] |
Maintainer: | Brian King <[email protected]> |
License: | GPL (>=2) |
Version: | 1.0.2.9000 |
Built: | 2024-11-20 04:52:40 UTC |
Source: | https://github.com/bking124/countstar |
Define the intervals associated with y = j
based on the flooring function.
The function returns -Inf
for j = 0
(or smaller) and Inf
for
any j >= y_max + 1
, where y_max
is a known upper bound on the data y
(if specified).
a_j(j, y_max = Inf)
a_j(j, y_max = Inf)
j |
the integer-valued input(s) |
y_max |
a fixed and known upper bound for all observations; default is |
The (lower) interval endpoint(s) associated with j
.
# Standard cases: a_j(1) a_j(20) # Boundary cases: a_j(0) a_j(20, y_max = 15)
# Standard cases: a_j(1) a_j(20) # Boundary cases: a_j(0) a_j(20, y_max = 15)
Run the MCMC algorithm for a STAR Bayesian additive model The transformation can be known (e.g., log or sqrt) or unknown (Box-Cox or estimated nonparametrically) for greater flexibility.
bam_star( y, X_lin, X_nonlin, splinetype = "orthogonal", transformation = "np", y_max = Inf, nsave = 1000, nburn = 1000, nskip = 0, save_y_hat = FALSE, verbose = TRUE )
bam_star( y, X_lin, X_nonlin, splinetype = "orthogonal", transformation = "np", y_max = Inf, nsave = 1000, nburn = 1000, nskip = 0, save_y_hat = FALSE, verbose = TRUE )
y |
|
X_lin |
|
X_nonlin |
|
splinetype |
Type of spline to use for modelling the nonlinear predictors; must be either "orthogonal" (orthogonalized splines–the default) or "thinplate" (low-rank thin plate splines) |
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
save_y_hat |
logical; if TRUE, compute and save the posterior draws of the expected counts, E(y), which may be slow to compute |
verbose |
logical; if TRUE, print time remaining |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation.
Posterior and predictive inference is obtained via a Gibbs sampler that combines (i) a latent data augmentation step (like in probit regression) and (ii) an existing sampler for a continuous data model.
There are several options for the transformation. First, the transformation
can belong to the *Box-Cox* family, which includes the known transformations
'identity', 'log', and 'sqrt', as well as a version in which the Box-Cox parameter
is inferred within the MCMC sampler ('box-cox'). Second, the transformation
can be estimated (before model fitting) using the empirical distribution of the
data y
. Options in this case include the empirical cumulative
distribution function (CDF), which is fully nonparametric ('np'), or the parametric
alternatives based on Poisson ('pois') or Negative-Binomial ('neg-bin')
distributions. For the parametric distributions, the parameters of the distribution
are estimated using moments (means and variances) of y
. Third, the transformation can be
modeled as an unknown, monotone function using I-splines ('ispline'). The
Robust Adaptive Metropolis (RAM) sampler is used for drawing the parameter
of the transformation function.
a list with at least the following elements:
coefficients
: the posterior mean of the coefficients
fitted.values
: the posterior mean of the conditional expectation of the counts y
post.coefficients
: posterior draws of the coefficients
post.fitted.values
: posterior draws of the conditional mean of the counts y
post.pred
: draws from the posterior predictive distribution of y
post.lambda
: draws from the posterior distribution of lambda
post.sigma
: draws from the posterior distribution of sigma
post.log.like.point
: draws of the log-likelihood for each of the n
observations
WAIC
: Widely-Applicable/Watanabe-Akaike Information Criterion
p_waic
: Effective number of parameters based on WAIC
In the case of transformation="ispline"
, the list also contains
post.g
: draws from the posterior distribution of the transformation g
post.sigma.gamma
: draws from the posterior distribution of sigma.gamma
,
the prior standard deviation of the transformation g() coefficients
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5, seed=32) y = sim_dat$y; X = sim_dat$X # Linear and nonlinear components: X_lin = as.matrix(X[,-(1:3)]) X_nonlin = as.matrix(X[,(1:3)]) # STAR: nonparametric transformation fit = bam_star(y = y, X_lin = X_lin, X_nonlin = X_nonlin) # What is included: names(fit) # Posterior mean of each coefficient: coef(fit) # WAIC: fit$WAIC # MCMC diagnostics: plot(as.ts(fit$post.coefficients[,1:3])) # Posterior predictive check: hist(apply(fit$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5, seed=32) y = sim_dat$y; X = sim_dat$X # Linear and nonlinear components: X_lin = as.matrix(X[,-(1:3)]) X_nonlin = as.matrix(X[,(1:3)]) # STAR: nonparametric transformation fit = bam_star(y = y, X_lin = X_lin, X_nonlin = X_nonlin) # What is included: names(fit) # Posterior mean of each coefficient: coef(fit) # WAIC: fit$WAIC # MCMC diagnostics: plot(as.ts(fit$post.coefficients[,1:3])) # Posterior predictive check: hist(apply(fit$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
Run the MCMC algorithm for a BART model for count-valued responses using STAR. The transformation can be known (e.g., log or sqrt) or unknown (Box-Cox or estimated nonparametrically) for greater flexibility.
bart_star( y, X, X_test = NULL, y_test = NULL, transformation = "np", y_max = Inf, n.trees = 200, sigest = NULL, sigdf = 3, sigquant = 0.9, k = 2, power = 2, base = 0.95, nsave = 1000, nburn = 1000, nskip = 0, save_y_hat = FALSE, verbose = TRUE )
bart_star( y, X, X_test = NULL, y_test = NULL, transformation = "np", y_max = Inf, n.trees = 200, sigest = NULL, sigdf = 3, sigquant = 0.9, k = 2, power = 2, base = 0.95, nsave = 1000, nburn = 1000, nskip = 0, save_y_hat = FALSE, verbose = TRUE )
y |
|
X |
|
X_test |
|
y_test |
|
transformation |
transformation to use for the latent process; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
n.trees |
number of trees to use in BART; default is 200 |
sigest |
positive numeric estimate of the residual standard deviation (see ?bart) |
sigdf |
degrees of freedom for error variance prior (see ?bart) |
sigquant |
quantile of the error variance prior that the rough estimate (sigest) is placed at. The closer the quantile is to 1, the more aggressive the fit will be (see ?bart) |
k |
the number of prior standard deviations E(Y|x) = f(x) is away from +/- 0.5. The response is internally scaled to range from -0.5 to 0.5. The bigger k is, the more conservative the fitting will be (see ?bart) |
power |
power parameter for tree prior (see ?bart) |
base |
base parameter for tree prior (see ?bart) |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
save_y_hat |
logical; if TRUE, compute and save the posterior draws of the expected counts, E(y), which may be slow to compute |
verbose |
logical; if TRUE, print time remaining |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation. Here, the model in (1) is a Bayesian additive regression tree (BART) model.
Posterior and predictive inference is obtained via a Gibbs sampler that combines (i) a latent data augmentation step (like in probit regression) and (ii) an existing sampler for a continuous data model.
There are several options for the transformation. First, the transformation
can belong to the *Box-Cox* family, which includes the known transformations
'identity', 'log', and 'sqrt', as well as a version in which the Box-Cox parameter
is inferred within the MCMC sampler ('box-cox'). Second, the transformation
can be estimated (before model fitting) using the empirical distribution of the
data y
. Options in this case include the empirical cumulative
distribution function (CDF), which is fully nonparametric ('np'), or the parametric
alternatives based on Poisson ('pois') or Negative-Binomial ('neg-bin')
distributions. For the parametric distributions, the parameters of the distribution
are estimated using moments (means and variances) of y
. Third, the transformation can be
modeled as an unknown, monotone function using I-splines ('ispline'). The
Robust Adaptive Metropolis (RAM) sampler is used for drawing the parameter
of the transformation function.
a list with the following elements:
post.pred
: draws from the posterior predictive distribution of y
post.sigma
: draws from the posterior distribution of sigma
post.log.like.point
: draws of the log-likelihood for each of the n
observations
WAIC
: Widely-Applicable/Watanabe-Akaike Information Criterion
p_waic
: Effective number of parameters based on WAIC
post.pred.test
: draws from the posterior predictive distribution at the test points X_test
(NULL
if X_test
is not given)
post.fitted.values.test
: posterior draws of the conditional mean at the test points X_test
(NULL
if X_test
is not given)
post.mu.test
: draws of the conditional mean of z_star at the test points X_test
(NULL
if X_test
is not given)
post.log.pred.test
: draws of the log-predictive distribution for each of the n0
test cases
(NULL
if X_test
is not given)
fitted.values
: the posterior mean of the conditional expectation of the counts y
(NULL
if save_y_hat=FALSE
)
post.fitted.values
: posterior draws of the conditional mean of the counts y
(NULL
if save_y_hat=FALSE
)
In the case of transformation="ispline"
, the list also contains
post.g
: draws from the posterior distribution of the transformation g
post.sigma.gamma
: draws from the posterior distribution of sigma.gamma
,
the prior standard deviation of the transformation g() coefficients
If transformation="box-cox"
, then the list also contains
post.lambda
: draws from the posterior distribution of lambda
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # BART-STAR with log-transformation: fit_log = bart_star(y = y, X = X, transformation = 'log', save_y_hat = TRUE, nburn=1000, nskip=0) # Fitted values plot_fitted(y = sim_dat$Ey, post_y = fit_log$post.fitted.values, main = 'Fitted Values: BART-STAR-log') # WAIC for BART-STAR-log: fit_log$WAIC # MCMC diagnostics: plot(as.ts(fit_log$post.fitted.values[,1:10])) # Posterior predictive check: hist(apply(fit_log$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue') # BART-STAR with nonparametric transformation: fit = bart_star(y = y, X = X, transformation = 'np', save_y_hat = TRUE) # Fitted values plot_fitted(y = sim_dat$Ey, post_y = fit$post.fitted.values, main = 'Fitted Values: BART-STAR-np') # WAIC for BART-STAR-np: fit$WAIC # MCMC diagnostics: plot(as.ts(fit$post.fitted.values[,1:10])) # Posterior predictive check: hist(apply(fit$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # BART-STAR with log-transformation: fit_log = bart_star(y = y, X = X, transformation = 'log', save_y_hat = TRUE, nburn=1000, nskip=0) # Fitted values plot_fitted(y = sim_dat$Ey, post_y = fit_log$post.fitted.values, main = 'Fitted Values: BART-STAR-log') # WAIC for BART-STAR-log: fit_log$WAIC # MCMC diagnostics: plot(as.ts(fit_log$post.fitted.values[,1:10])) # Posterior predictive check: hist(apply(fit_log$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue') # BART-STAR with nonparametric transformation: fit = bart_star(y = y, X = X, transformation = 'np', save_y_hat = TRUE) # Fitted values plot_fitted(y = sim_dat$Ey, post_y = fit$post.fitted.values, main = 'Fitted Values: BART-STAR-np') # WAIC for BART-STAR-np: fit$WAIC # MCMC diagnostics: plot(as.ts(fit$post.fitted.values[,1:10])) # Posterior predictive check: hist(apply(fit$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
Posterior inference for STAR linear model
blm_star( y, X, X_test = NULL, transformation = "np", y_max = Inf, prior = "gprior", use_MCMC = TRUE, nsave = 1000, nburn = 1000, nskip = 0, psi = NULL, compute_marg = FALSE )
blm_star( y, X, X_test = NULL, transformation = "np", y_max = Inf, prior = "gprior", use_MCMC = TRUE, nsave = 1000, nburn = 1000, nskip = 0, psi = NULL, compute_marg = FALSE )
y |
|
X |
|
X_test |
|
transformation |
transformation to use for the latent process; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
prior |
prior to use for the latent linear regression; currently implemented options are "gprior", "horseshoe", and "ridge" |
use_MCMC |
logical; whether to run Gibbs sampler or Monte Carlo (default is TRUE) |
nsave |
number of MCMC iterations to save (or MC samples to draw if use_MCMC=FALSE) |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
psi |
prior variance (g-prior) |
compute_marg |
logical; if TRUE, compute and return the marginal likelihood (only available when using exact sampler, i.e. use_MCMC=FALSE) |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation. Here, the continuous latent data model is a linear regression.
There are several options for the transformation. First, the transformation
can belong to the *Box-Cox* family, which includes the known transformations
'identity', 'log', and 'sqrt', as well as a version in which the Box-Cox parameter
is inferred within the MCMC sampler ('box-cox'). Second, the transformation
can be estimated (before model fitting) using the empirical distribution of the
data y
. Options in this case include the empirical cumulative
distribution function (CDF), which is fully nonparametric ('np'), or the parametric
alternatives based on Poisson ('pois') or Negative-Binomial ('neg-bin')
distributions. For the parametric distributions, the parameters of the distribution
are estimated using moments (means and variances) of y
. The distribution-based
transformations approximately preserve the mean and variance of the count data y
on the latent data scale, which lends interpretability to the model parameters.
Lastly, the transformation can be modeled using the Bayesian bootstrap ('bnp'),
which is a Bayesian nonparametric model and incorporates the uncertainty
about the transformation into posterior and predictive inference.
The Monte Carlo sampler (use_MCMC=FALSE
) produces direct, discrete, and joint draws
from the posterior distribution and the posterior predictive distribution
of the linear regression model with a g-prior.
a list with at least the following elements:
coefficients
: the posterior mean of the regression coefficients
post.beta
: posterior draws of the regression coefficients
post.pred
: draws from the posterior predictive distribution of y
post.log.like.point
: draws of the log-likelihood for each of the n
observations
WAIC
: Widely-Applicable/Watanabe-Akaike Information Criterion
p_waic
: Effective number of parameters based on WAIC
If test points are passed in, then the list will also have post.predtest
,
which contains draws from the posterior predictive distribution at test points.
Other elements may be present depending on the choice of prior, transformation, and sampling approach.
The 'bnp' transformation is
slower than the other transformations because of the way
the TruncatedNormal
sampler must be updated as the lower and upper
limits change (due to the sampling of g
). Thus, computational
improvements are likely available.
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Fit the Bayesian STAR linear model: fit = blm_star(y = y, X = X) # What is included: names(fit) # Posterior mean of each coefficient: coef(fit) # WAIC: fit$WAIC # MCMC diagnostics: plot(as.ts(fit$post.beta)) # Posterior predictive check: hist(apply(fit$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Fit the Bayesian STAR linear model: fit = blm_star(y = y, X = X) # What is included: names(fit) # Posterior mean of each coefficient: coef(fit) # WAIC: fit$WAIC # MCMC diagnostics: plot(as.ts(fit$post.beta)) # Posterior predictive check: hist(apply(fit$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
For a linear regression model within the STAR framework, compute (asymptotic) confidence intervals for a regression coefficient of interest. Confidence intervals are computed by inverting the likelihood ratio test and profiling the log-likelihood.
## S3 method for class 'lmstar' confint(object, parm, level = 0.95, ...)
## S3 method for class 'lmstar' confint(object, parm, level = 0.95, ...)
object |
Object of class "lmstar" as output by |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
confidence level; default is 0.95 |
... |
Ignored |
A matrix (or vector) with columns giving lower and upper confidence limits for each parameter. These will be labelled as (1-level)/2 and 1 - (1-level)/2 in
#Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 2) y = sim_dat$y; X = sim_dat$X[,-1] # remove intercept # Select a transformation: transformation = 'np' #Estimate model fit = lm_star(y~X, transformation = transformation) #Confidence interval for all parameters confint(fit)
#Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 2) y = sim_dat$y; X = sim_dat$X[,-1] # remove intercept # Select a transformation: transformation = 'np' #Estimate model fit = lm_star(y~X, transformation = transformation) #Confidence interval for all parameters confint(fit)
Compute (1-alpha)% credible BANDS for a function based on MCMC samples using Crainiceanu et al. (2007)
credBands(sampFuns, alpha = 0.05)
credBands(sampFuns, alpha = 0.05)
sampFuns |
|
alpha |
confidence level |
m x 2
matrix of credible bands; the first column is the lower band, the second is the upper band
The input needs not be curves: the simultaneous credible "bands" may be computed for vectors. The resulting credible intervals will provide joint coverage at the (1-alpha) level across all components of the vector.
Compute the ergodic (running) mean.
ergMean(x)
ergMean(x)
x |
vector for which to compute the running mean |
A vector y
with each element defined by y[i] = mean(x[1:i])
# Compare: ergMean(1:10) mean(1:10) # Running mean for iid N(5, 1) samples: x = rnorm(n = 10^4, mean = 5, sd = 1) plot(ergMean(x)) abline(h=5)
# Compare: ergMean(1:10) mean(1:10) # Running mean for iid N(5, 1) samples: x = rnorm(n = 10^4, mean = 5, sd = 1) plot(ergMean(x)) abline(h=5)
Evaluate the Box-Cox transformation, which is a scaled power transformation
to preserve continuity in the index lambda
at zero. Negative values are
permitted.
g_bc(t, lambda)
g_bc(t, lambda)
t |
argument(s) at which to evaluate the function |
lambda |
Box-Cox parameter |
The evaluation(s) of the Box-Cox function at the given input(s) t
.
Special cases include
the identity transformation (lambda = 1
),
the square-root transformation (lambda = 1/2
),
and the log transformation (lambda = 0
).
# Log-transformation: g_bc(1:5, lambda = 0); log(1:5) # Square-root transformation: note the shift and scaling g_bc(1:5, lambda = 1/2); sqrt(1:5)
# Log-transformation: g_bc(1:5, lambda = 0); log(1:5) # Square-root transformation: note the shift and scaling g_bc(1:5, lambda = 1/2); sqrt(1:5)
Compute one posterior draw from the smoothed transformation
implied by (separate) Bayesian bootstrap models for the CDFs
of y
and X
.
g_bnp(y, xt_Sigma_x = rep(0, length(y)), z_grid = NULL)
g_bnp(y, xt_Sigma_x = rep(0, length(y)), z_grid = NULL)
y |
|
xt_Sigma_x |
|
z_grid |
optional vector of grid points for evaluating the CDF
of z ( |
A smooth monotone function which can be used for evaluations of the transformation at each posterior draw.
# Sample some data: y = rpois(n = 200, lambda = 5) # Compute 100 draws of g on a grid: t = seq(0, max(y), length.out = 50) # grid g_post = t(sapply(1:100, function(s) g_bnp(y)(t))) # Plot together: plot(t, t, ylim = range(g_post), type='n', ylab = 'g(t)', main = 'Bayesian bootstrap posterior: g') temp = apply(g_post, 1, function(g) lines(t, g, col='gray')) # And the posterior mean of g: lines(t, colMeans(g_post), lwd=3)
# Sample some data: y = rpois(n = 200, lambda = 5) # Compute 100 draws of g on a grid: t = seq(0, max(y), length.out = 50) # grid g_post = t(sapply(1:100, function(s) g_bnp(y)(t))) # Plot together: plot(t, t, ylim = range(g_post), type='n', ylab = 'g(t)', main = 'Bayesian bootstrap posterior: g') temp = apply(g_post, 1, function(g) lines(t, g, col='gray')) # And the posterior mean of g: lines(t, colMeans(g_post), lwd=3)
Compute a CDF-based transformation using the observed count data.
The CDF can be estimated nonparametrically or parametrically based on the
Poisson or Negative Binomial distributions. In the parametric case,
the parameters are determined based on the moments of y
.
Note that this is a fixed quantity and does not come with uncertainty quantification.
g_cdf(y, distribution = "np")
g_cdf(y, distribution = "np")
y |
|
distribution |
the distribution used for the CDF; must be one of
|
A smooth monotone function which can be used for evaluations of the transformation.
# Sample some data: y = rpois(n = 500, lambda = 5) # Empirical CDF version: g_np = g_cdf(y, distribution = 'np') # Poisson version: g_pois = g_cdf(y, distribution = 'pois') # Negative binomial version: g_negbin = g_cdf(y, distribution = 'neg-bin') # Plot together: t = 1:max(y) # grid plot(t, g_np(t), type='l') lines(t, g_pois(t), lty = 2) lines(t, g_negbin(t), lty = 3)
# Sample some data: y = rpois(n = 500, lambda = 5) # Empirical CDF version: g_np = g_cdf(y, distribution = 'np') # Poisson version: g_pois = g_cdf(y, distribution = 'pois') # Negative binomial version: g_negbin = g_cdf(y, distribution = 'neg-bin') # Plot together: t = 1:max(y) # grid plot(t, g_np(t), type='l') lines(t, g_pois(t), lty = 2) lines(t, g_negbin(t), lty = 3)
Compute the inverse function of a transformation g
based on a grid search.
g_inv_approx(g, t_grid)
g_inv_approx(g, t_grid)
g |
the transformation function |
t_grid |
grid of arguments at which to evaluate the transformation function |
A function which can be used for evaluations of the (approximate) inverse transformation function.
# Sample some data: y = rpois(n = 500, lambda = 5) # Empirical CDF transformation: g_np = g_cdf(y, distribution = 'np') # Grid for approximation: t_grid = seq(1, max(y), length.out = 100) # Approximate inverse: g_inv = g_inv_approx(g = g_np, t_grid = t_grid) # Check the approximation: plot(t_grid, g_inv(g_np(t_grid)), type='p') lines(t_grid, t_grid)
# Sample some data: y = rpois(n = 500, lambda = 5) # Empirical CDF transformation: g_np = g_cdf(y, distribution = 'np') # Grid for approximation: t_grid = seq(1, max(y), length.out = 100) # Approximate inverse: g_inv = g_inv_approx(g = g_np, t_grid = t_grid) # Check the approximation: plot(t_grid, g_inv(g_np(t_grid)), type='p') lines(t_grid, t_grid)
Evaluate the inverse Box-Cox transformation. Negative values are permitted.
g_inv_bc(s, lambda)
g_inv_bc(s, lambda)
s |
argument(s) at which to evaluate the function |
lambda |
Box-Cox parameter |
The evaluation(s) of the inverse Box-Cox function at the given input(s) s
.
Special cases include
the identity transformation (lambda = 1
),
the square-root transformation (lambda = 1/2
),
and the log transformation (lambda = 0
).
#' @examples # (Inverse) log-transformation: g_inv_bc(1:5, lambda = 0); exp(1:5)
# (Inverse) square-root transformation: note the shift and scaling g_inv_bc(1:5, lambda = 1/2); (1:5)^2
Compute the MLEs and log-likelihood for the Gradient Boosting Machines (GBM) STAR model.
The STAR model requires a *transformation* and an *estimation function* for the conditional mean
given observed data. The transformation can be known (e.g., log or sqrt) or unknown
(Box-Cox or estimated nonparametrically) for greater flexibility.
The estimator in this case is a GBM.
Standard function calls including fitted
and residuals
apply.
gbm_star( y, X, X.test = NULL, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000, n.trees = 100, interaction.depth = 1, shrinkage = 0.1, bag.fraction = 1 )
gbm_star( y, X, X.test = NULL, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000, n.trees = 100, interaction.depth = 1, shrinkage = 0.1, bag.fraction = 1 )
y |
|
X |
|
X.test |
|
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
sd_init |
add random noise for EM algorithm initialization scaled by |
tol |
tolerance for stopping the EM algorithm; default is 10^-10; |
max_iters |
maximum number of EM iterations before stopping; default is 1000 |
n.trees |
Integer specifying the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. Default is 100. |
interaction.depth |
Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1. |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.1. |
bag.fraction |
the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction < 1 then running the same model twice will result in similar but different fits. Default is 1 (for a deterministic prediction). |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation. The Gaussian model in this case is a GBM.
a list with the following elements:
fitted.values
: the fitted values at the MLEs (training)
fitted.values.test
: the fitted values at the MLEs (testing)
g.hat
a function containing the (known or estimated) transformation
sigma.hat
the MLE of the standard deviation
mu.hat
the MLE of the conditional mean (on the transformed scale)
z.hat
the estimated latent data (on the transformed scale) at the MLEs
residuals
the Dunn-Smyth residuals (randomized)
residuals_rep
the Dunn-Smyth residuals (randomized) for 10 replicates
logLik
the log-likelihood at the MLEs
logLik0
the log-likelihood at the MLEs for the *unrounded* initialization
lambda
the Box-Cox nonlinear parameter
gbmObj
: the object returned by gbm() at the MLEs
and other parameters that (1) track the parameters across EM iterations and (2) record the model specifications
Infinite latent data values may occur when the transformed Gaussian model is highly inadequate. In that case, the function returns the *indices* of the data points with infinite latent values, which are significant outliers under the model. Deletion of these indices and re-running the model is one option, but care must be taken to ensure that (i) it is appropriate to treat these observations as outliers and (ii) the model is adequate for the remaining data points.
Kowal, D. R., & Wu, B. (2021). Semiparametric count data regression for self‐reported mental health. Biometrics. doi:10.1111/biom.13617
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # EM algorithm for STAR (using the log-link) fit_em = gbm_star(y = y, X = X, transformation = 'log') # Evaluate convergence: plot(fit_em$logLik_all, type='l', main = 'GBM-STAR-log', xlab = 'Iteration', ylab = 'log-lik') # Fitted values: y_hat = fitted(fit_em) plot(y_hat, y); # Residuals: plot(residuals(fit_em)) qqnorm(residuals(fit_em)); qqline(residuals(fit_em)) # Log-likelihood at MLEs: fit_em$logLik
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # EM algorithm for STAR (using the log-link) fit_em = gbm_star(y = y, X = X, transformation = 'log') # Evaluate convergence: plot(fit_em$logLik_all, type='l', main = 'GBM-STAR-log', xlab = 'Iteration', ylab = 'log-lik') # Fitted values: y_hat = fitted(fit_em) plot(y_hat, y); # Residuals: plot(residuals(fit_em)) qqnorm(residuals(fit_em)); qqline(residuals(fit_em)) # Log-likelihood at MLEs: fit_em$logLik
Compute MLEs and log-likelihood for a generalized STAR model. The STAR model requires
a *transformation* and an *estimation function* for the conditional mean
given observed data. The transformation can be known (e.g., log or sqrt) or unknown
(Box-Cox or estimated nonparametrically) for greater flexibility.
The estimator can be any least squares estimator, including nonlinear models.
Standard function calls including
coefficients()
, fitted()
, and residuals()
apply.
genEM_star( y, estimator, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000 )
genEM_star( y, estimator, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000 )
y |
|
estimator |
a function that inputs data
|
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
sd_init |
add random noise for EM algorithm initialization scaled by |
tol |
tolerance for stopping the EM algorithm; default is 10^-10; |
max_iters |
maximum number of EM iterations before stopping; default is 1000 |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation.
The expectation-maximization (EM) algorithm is used to produce
maximum likelihood estimators (MLEs) for the parameters defined in the
estimator
function, such as linear regression coefficients,
which define the Gaussian model for the continuous latent data.
Fitted values (point predictions), residuals, and log-likelihood values
are also available. Inference for the estimators proceeds via classical maximum likelihood.
Initialization of the EM algorithm can be randomized to monitor convergence.
However, the log-likelihood is concave for all transformations (except 'box-cox'),
so global convergence is guaranteed.
There are several options for the transformation. First, the transformation
can belong to the *Box-Cox* family, which includes the known transformations
'identity', 'log', and 'sqrt', as well as a version in which the Box-Cox parameter
is estimated within the EM algorithm ('box-cox'). Second, the transformation
can be estimated (before model fitting) using the empirical distribution of the
data y
. Options in this case include the empirical cumulative
distribution function (CDF), which is fully nonparametric ('np'), or the parametric
alternatives based on Poisson ('pois') or Negative-Binomial ('neg-bin')
distributions. For the parametric distributions, the parameters of the distribution
are estimated using moments (means and variances) of y
.
a list with the following elements:
coefficients
the MLEs of the coefficients
fitted.values
the fitted values at the MLEs
g.hat
a function containing the (known or estimated) transformation
sigma.hat
the MLE of the standard deviation
mu.hat
the MLE of the conditional mean (on the transformed scale)
z.hat
the estimated latent data (on the transformed scale) at the MLEs
residuals
the Dunn-Smyth residuals (randomized)
residuals_rep
the Dunn-Smyth residuals (randomized) for 10 replicates
logLik
the log-likelihood at the MLEs
logLik0
the log-likelihood at the MLEs for the *unrounded* initialization
lambda
the Box-Cox nonlinear parameter
and other parameters that (1) track the parameters across EM iterations and (2) record the model specifications
Infinite latent data values may occur when the transformed Gaussian model is highly inadequate. In that case, the function returns the *indices* of the data points with infinite latent values, which are significant outliers under the model. Deletion of these indices and re-running the model is one option, but care must be taken to ensure that (i) it is appropriate to treat these observations as outliers and (ii) the model is adequate for the remaining data points.
Kowal, D. R., & Wu, B. (2021). Semiparametric count data regression for self‐reported mental health. Biometrics. doi:10.1111/biom.13617
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Select a transformation: transformation = 'np' # Example using GAM as underlying estimator (for illustration purposes only) if(require("mgcv")){ fit_em = genEM_star(y = y, estimator = function(y) gam(y ~ s(X1)+s(X2), data=data.frame(y,X)), transformation = transformation) } # Fitted coefficients: coef(fit_em) # Fitted values: y_hat = fitted(fit_em) plot(y_hat, y); # Log-likelihood at MLEs: fit_em$logLik
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Select a transformation: transformation = 'np' # Example using GAM as underlying estimator (for illustration purposes only) if(require("mgcv")){ fit_em = genEM_star(y = y, estimator = function(y) gam(y ~ s(X1)+s(X2), data=data.frame(y,X)), transformation = transformation) } # Fitted coefficients: coef(fit_em) # Fitted values: y_hat = fitted(fit_em) plot(y_hat, y); # Log-likelihood at MLEs: fit_em$logLik
Run the MCMC algorithm for STAR given
a function to initialize model parameters; and
a function to sample (i.e., update) model parameters.
The transformation can be known (e.g., log or sqrt) or unknown (Box-Cox or estimated nonparametrically) for greater flexibility.
genMCMC_star( y, sample_params, init_params, transformation = "np", y_max = Inf, nsave = 1000, nburn = 1000, nskip = 0, save_y_hat = FALSE, verbose = TRUE )
genMCMC_star( y, sample_params, init_params, transformation = "np", y_max = Inf, nsave = 1000, nburn = 1000, nskip = 0, save_y_hat = FALSE, verbose = TRUE )
y |
|
sample_params |
a function that inputs data
and outputs an updated list |
init_params |
an initializing function that inputs data |
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
save_y_hat |
logical; if TRUE, compute and save the posterior draws of the expected counts, E(y), which may be slow to compute |
verbose |
logical; if TRUE, print time remaining |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation.
Posterior and predictive inference is obtained via a Gibbs sampler that combines (i) a latent data augmentation step (like in probit regression) and (ii) an existing sampler for a continuous data model.
There are several options for the transformation. First, the transformation
can belong to the *Box-Cox* family, which includes the known transformations
'identity', 'log', and 'sqrt', as well as a version in which the Box-Cox parameter
is inferred within the MCMC sampler ('box-cox'). Second, the transformation
can be estimated (before model fitting) using the empirical distribution of the
data y
. Options in this case include the empirical cumulative
distribution function (CDF), which is fully nonparametric ('np'), or the parametric
alternatives based on Poisson ('pois') or Negative-Binomial ('neg-bin')
distributions. For the parametric distributions, the parameters of the distribution
are estimated using moments (means and variances) of y
.
a list with at least the following elements:
post.pred
: draws from the posterior predictive distribution of y
post.sigma
: draws from the posterior distribution of sigma
post.log.like.point
: draws of the log-likelihood for each of the n
observations
WAIC
: Widely-Applicable/Watanabe-Akaike Information Criterion
p_waic
: Effective number of parameters based on WAIC
post.lambda
: draws from the posterior distribution of lambda
(NULL unless transformation='box-cox'
)
fitted.values
: the posterior mean of the conditional expectation of the counts y
(NULL
if save_y_hat=FALSE
)
post.fitted.values
: posterior draws of the conditional mean of the counts y
(NULL
if save_y_hat=FALSE
)
If the coefficients list from init_params
and sample_params
contains a named element beta
,
e.g. for linear regression, then the function output contains
coefficients
: the posterior mean of the beta coefficients
post.beta
: draws from the posterior distribution of beta
post.othercoefs
: draws from the posterior distribution of any other sampled coefficients, e.g. variance terms
If no beta
exists in the parameter coefficients, then the output list just contains
coefficients
: the posterior mean of all coefficients
post.beta
: draws from the posterior distribution of all coefficients
Additionally, if init_params
and sample_params
have output mu_test
, then the sampler will output
post.predtest
, which contains draws from the posterior predictive distribution at test points.
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # STAR: log-transformation: fit_log = genMCMC_star(y = y, sample_params = function(y, params) sample_lm_gprior(y, X, params), init_params = function(y) init_lm_gprior(y, X), transformation = 'log') # What is included: names(fit_log) # Posterior mean of each coefficient: coef(fit_log) # WAIC for STAR-log: fit_log$WAIC # MCMC diagnostics: plot(as.ts(fit_log$post.beta[,1:3])) # Posterior predictive check: hist(apply(fit_log$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # STAR: log-transformation: fit_log = genMCMC_star(y = y, sample_params = function(y, params) sample_lm_gprior(y, X, params), init_params = function(y) init_lm_gprior(y, X), transformation = 'log') # What is included: names(fit_log) # Posterior mean of each coefficient: coef(fit_log) # WAIC for STAR-log: fit_log$WAIC # MCMC diagnostics: plot(as.ts(fit_log$post.beta[,1:3])) # Posterior predictive check: hist(apply(fit_log$post.pred, 1, function(x) mean(x==0)), main = 'Proportion of Zeros', xlab=''); abline(v = mean(y==0), lwd=4, col ='blue')
Compute the summary statistics for the effective sample size (ESS) across posterior samples for possibly many variables
getEffSize(postX)
getEffSize(postX)
postX |
An array of arbitrary dimension |
Table of summary statistics using the function summary()
.
# ESS for iid simulations: rand_iid = rnorm(n = 10^4) getEffSize(rand_iid) # ESS for several AR(1) simulations with coefficients 0.1, 0.2,...,0.9: rand_ar1 = sapply(seq(0.1, 0.9, by = 0.1), function(x) arima.sim(n = 10^4, list(ar = x))) getEffSize(rand_ar1)
# ESS for iid simulations: rand_iid = rnorm(n = 10^4) getEffSize(rand_iid) # ESS for several AR(1) simulations with coefficients 0.1, 0.2,...,0.9: rand_ar1 = sapply(seq(0.1, 0.9, by = 0.1), function(x) arima.sim(n = 10^4, list(ar = x))) getEffSize(rand_ar1)
Initialize the parameters for a linear regression model assuming a g-prior for the coefficients.
init_lm_gprior(y, X, X_test = NULL)
init_lm_gprior(y, X, X_test = NULL)
y |
|
X |
|
X_test |
|
a named list params
containing at least
mu
: vector of conditional means (fitted values)
sigma
: the conditional standard deviation
coefficients
: a named list of parameters that determine mu
Additionally, if X_test is not NULL, then the list includes an element
mu_test
, the vector of conditional means at the test points
The parameters in coefficients
are:
beta
: the p x 1
vector of regression coefficients
components of beta
# Simulate data for illustration: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Initialize: params = init_lm_gprior(y = y, X = X) names(params) names(params$coefficients)
# Simulate data for illustration: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Initialize: params = init_lm_gprior(y = y, X = X) names(params) names(params$coefficients)
Compute the MLEs and log-likelihood for the STAR linear model. The regression coefficients are estimated using least squares within an EM algorithm.
lm_star( formula, data = NULL, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000 )
lm_star( formula, data = NULL, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000 )
formula |
an object of class " |
data |
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame)
containing the variables in the model; like |
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
sd_init |
add random noise for EM algorithm initialization scaled by |
tol |
tolerance for stopping the EM algorithm; default is 10^-10; |
max_iters |
maximum number of EM iterations before stopping; default is 1000 |
Standard function calls including
coefficients
, fitted
, and residuals
apply. Fitted values are the expectation
at the MLEs, and as such are not necessarily count-valued.
an object of class
"lmstar", which is a list with the following elements:
coefficients
the MLEs of the coefficients
fitted.values
the fitted values at the MLEs
g.hat
a function containing the (known or estimated) transformation
ginv.hat
a function containing the inverse of the transformation
sigma.hat
the MLE of the standard deviation
mu.hat
the MLE of the conditional mean (on the transformed scale)
z.hat
the estimated latent data (on the transformed scale) at the MLEs
residuals
the Dunn-Smyth residuals (randomized)
residuals_rep
the Dunn-Smyth residuals (randomized) for 10 replicates
logLik
the log-likelihood at the MLEs
logLik0
the log-likelihood at the MLEs for the *unrounded* initialization
lambda
the Box-Cox nonlinear parameter
and other parameters that (1) track the parameters across EM iterations and (2) record the model specifications
Infinite latent data values may occur when the transformed Gaussian model is highly inadequate. In that case, the function returns the *indices* of the data points with infinite latent values, which are significant outliers under the model. Deletion of these indices and re-running the model is one option, but care must be taken to ensure that (i) it is appropriate to treat these observations as outliers and (ii) the model is adequate for the remaining data points.
Kowal, D. R., & Wu, B. (2021). Semiparametric count data regression for self‐reported mental health. Biometrics. doi:10.1111/biom.13617
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X[,-1] # remove intercept # Fit model fit_em = lm_star(y ~ X) # Fitted coefficients: coef(fit_em) # Fitted values: y_hat = fitted(fit_em) plot(y_hat, y); # Residuals: plot(residuals(fit_em)) qqnorm(residuals(fit_em)); qqline(residuals(fit_em))
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X[,-1] # remove intercept # Fit model fit_em = lm_star(y ~ X) # Fitted coefficients: coef(fit_em) # Fitted values: y_hat = fitted(fit_em) plot(y_hat, y); # Residuals: plot(residuals(fit_em)) qqnorm(residuals(fit_em)); qqline(residuals(fit_em))
Plot the estimated regression coefficients and credible intervals for the linear effects in up to two models.
plot_coef( post_coefficients_1, post_coefficients_2 = NULL, alpha = 0.05, labels = NULL )
plot_coef( post_coefficients_1, post_coefficients_2 = NULL, alpha = 0.05, labels = NULL )
post_coefficients_1 |
|
post_coefficients_2 |
|
alpha |
confidence level for the credible intervals |
labels |
|
A plot of regression coefficients and credible intervals for 1-2 models
Plot the fitted values, plus pointwise credible intervals, against the data. For simulations, one may use the true values in place of the data.
plot_fitted(y, post_y, y_hat = NULL, alpha = 0.05, ...)
plot_fitted(y, post_y, y_hat = NULL, alpha = 0.05, ...)
y |
|
post_y |
|
y_hat |
|
alpha |
confidence level for the credible intervals |
... |
other arguments for plotting |
A plot with the fitted values and the credible intervals against the data
Plot the empirical probability mass function, i.e., the proportion of
data values y
that equal j
for each j=0,1,...
,
together with the model-based estimate of the probability mass function
based on the posterior predictive distribution.
plot_pmf(y, post.pred, error.bars = FALSE, alpha = 0.05)
plot_pmf(y, post.pred, error.bars = FALSE, alpha = 0.05)
y |
|
post.pred |
|
error.bars |
logical; if TRUE, include errors bars on the model-based PMF |
alpha |
confidence level for the credible intervals |
A plot of the empirical PMF of y along with a PMF estimate from the model posterior predictive distribution
Outputs predicted values based on an lmstar fit and optionally prediction intervals based on the the (plug-in) predictive distribution for the STAR linear model
## S3 method for class 'lmstar' predict(object, newdata = NULL, interval = FALSE, level = 0.95, N = 1000, ...)
## S3 method for class 'lmstar' predict(object, newdata = NULL, interval = FALSE, level = 0.95, N = 1000, ...)
object |
Object of class "lmstar" as output by |
newdata |
An optional matrix/data frame in which to look for variables with which to predict. If omitted, the fitted values are used. |
interval |
logical; whether or not to include prediction intervals (default FALSE) |
level |
Level for prediction intervals |
N |
number of Monte Carlo samples from the posterior predictive distribution used to approximate intervals; default is 1000 |
... |
Ignored |
If interval=TRUE, then predict.lmstar
uses a Monte Carlo approach to estimating the (plug-in)
predictive distribution for the STAR linear model. The algorithm iteratively samples
(i) the latent data given the observed data, (ii) the latent predictive data given the latent data from (i),
and (iii) (inverse) transforms and rounds the latent predictive data to obtain a
draw from the integer-valued predictive distribution.
The appropriate quantiles of these Monte Carlo draws are computed and reported as the prediction interval.
Either a a vector of predictions (if interval=FALSE) or a matrix of predictions and bounds with column names fit, lwr, and upr
The “plug-in" predictive distribution is a crude approximation. Better
approaches are available using the Bayesian models, e.g. blm_star
, which provide samples
from the posterior predictive distribution.
For highly skewed responses, prediction intervals especially at lower levels may not include the predicted value itself, since the mean is often much larger than the median.
# Simulate data with count-valued response y: x = seq(0, 1, length.out = 100) y = rpois(n = length(x), lambda = exp(1.5 + 5*(x -.5)^2)) # Estimate model--assume a quadratic effect (better for illustration purposes) fit = lm_star(y~x+I(x^2), transformation = 'sqrt') #Compute the predictive draws for the test points (same as observed points here) #Also compute intervals using plug-in predictive distribution y_pred = predict(fit, interval=TRUE) # Plot the results plot(x, y, ylim = range(y, y_pred), main = 'STAR: Predictions and 95% PI') lines(x,y_pred[,"fit"], col='black', type='s', lwd=4) lines(x, y_pred[,"lwr"], col='darkgray', type='s', lwd=4) lines(x, y_pred[,"upr"], col='darkgray', type='s', lwd=4)
# Simulate data with count-valued response y: x = seq(0, 1, length.out = 100) y = rpois(n = length(x), lambda = exp(1.5 + 5*(x -.5)^2)) # Estimate model--assume a quadratic effect (better for illustration purposes) fit = lm_star(y~x+I(x^2), transformation = 'sqrt') #Compute the predictive draws for the test points (same as observed points here) #Also compute intervals using plug-in predictive distribution y_pred = predict(fit, interval=TRUE) # Plot the results plot(x, y, ylim = range(y, y_pred), main = 'STAR: Predictions and 95% PI') lines(x,y_pred[,"fit"], col='black', type='s', lwd=4) lines(x, y_pred[,"lwr"], col='darkgray', type='s', lwd=4) lines(x, y_pred[,"upr"], col='darkgray', type='s', lwd=4)
For a linear regression model within the STAR framework, compute p-values for regression coefficients using a likelihood ratio test. It also computes a p-value for excluding all predictors, akin to a (partial) F test.
pvals(object)
pvals(object)
object |
Object of class "lmstar" as output by |
a list of p+1 p-values, one for each predictor as well as the joint p-value excluding all predictors
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 2) y = sim_dat$y; X = sim_dat$X[,-1] # remove intercept # Select a transformation: transformation = 'np' #Estimate model fit = lm_star(y~X, transformation = transformation) #Compute p-values pvals(fit)
# Simulate data with count-valued response y: sim_dat = simulate_nb_lm(n = 100, p = 2) y = sim_dat$y; X = sim_dat$X[,-1] # remove intercept # Select a transformation: transformation = 'np' #Estimate model fit = lm_star(y~X, transformation = transformation) #Compute p-values pvals(fit)
Compute the MLEs and log-likelihood for the Random Forest STAR model.
The STAR model requires a *transformation* and an *estimation function* for the conditional mean
given observed data. The transformation can be known (e.g., log or sqrt) or unknown
(Box-Cox or estimated nonparametrically) for greater flexibility.
The estimator in this case is a random forest.
Standard function calls including fitted
and residuals
apply.
randomForest_star( y, X, X.test = NULL, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000, ntree = 500, mtry = max(floor(ncol(X)/3), 1), nodesize = 5 )
randomForest_star( y, X, X.test = NULL, transformation = "np", y_max = Inf, sd_init = 10, tol = 10^-10, max_iters = 1000, ntree = 500, mtry = max(floor(ncol(X)/3), 1), nodesize = 5 )
y |
|
X |
|
X.test |
|
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
sd_init |
add random noise for EM algorithm initialization scaled by |
tol |
tolerance for stopping the EM algorithm; default is 10^-10; |
max_iters |
maximum number of EM iterations before stopping; default is 1000 |
ntree |
Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. Default is 500. |
mtry |
Number of variables randomly sampled as candidates at each split. Default is p/3. |
nodesize |
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Default is 5. |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation.
The expectation-maximization (EM) algorithm is used to produce maximum likelihood estimators (MLEs) for the parameters defined in the The fitted values are computed using out-of-bag samples. As a result, the log-likelihood is based on out-of-bag prediction, and it is similarly straightforward to compute out-of-bag squared and absolute errors.
a list with the following elements:
fitted.values
: the fitted values at the MLEs based on out-of-bag samples (training)
fitted.values.test
: the fitted values at the MLEs (testing)
g.hat
a function containing the (known or estimated) transformation
sigma.hat
the MLE of the standard deviation
mu.hat
the MLE of the conditional mean (on the transformed scale)
z.hat
the estimated latent data (on the transformed scale) at the MLEs
residuals
the Dunn-Smyth residuals (randomized)
residuals_rep
the Dunn-Smyth residuals (randomized) for 10 replicates
logLik
the log-likelihood at the MLEs
logLik0
the log-likelihood at the MLEs for the *unrounded* initialization
lambda
the Box-Cox nonlinear parameter
rfObj
: the object returned by randomForest() at the MLEs
and other parameters that (1) track the parameters across EM iterations and (2) record the model specifications
Since the random forest produces random predictions, the EM algorithm will never converge exactly.
Infinite latent data values may occur when the transformed Gaussian model is highly inadequate. In that case, the function returns the *indices* of the data points with infinite latent values, which are significant outliers under the model. Deletion of these indices and re-running the model is one option, but care must be taken to ensure that (i) it is appropriate to treat these observations as outliers and (ii) the model is adequate for the remaining data points.
Kowal, D. R., & Wu, B. (2021). Semiparametric count data regression for self‐reported mental health. Biometrics. doi:10.1111/biom.13617
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # EM algorithm for STAR (using the log-link) fit_em = randomForest_star(y = y, X = X, transformation = 'log', max_iters = 100) # Fitted values (out-of-bag) y_hat = fitted(fit_em) plot(y_hat, y); # Residuals: plot(residuals(fit_em)) qqnorm(residuals(fit_em)); qqline(residuals(fit_em)) # Log-likelihood at MLEs (out-of-bag): fit_em$logLik
# Simulate data with count-valued response y: sim_dat = simulate_nb_friedman(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # EM algorithm for STAR (using the log-link) fit_em = randomForest_star(y = y, X = X, transformation = 'log', max_iters = 100) # Fitted values (out-of-bag) y_hat = fitted(fit_em) plot(y_hat, y); # Residuals: plot(residuals(fit_em)) qqnorm(residuals(fit_em)); qqline(residuals(fit_em)) # Log-likelihood at MLEs (out-of-bag): fit_em$logLik
Data on the efficacy of a pest management system at reducing the number of roaches in urban apartments.
roaches
roaches
## 'roaches' A data frame with 262 obs. of 6 variables:
Number of roaches caught
Pretreatment number of roaches
Treatment indicator
Indicator for only elderly residents in building
Number of days for which the roach traps were used
Gelman and Hill (2007); package 'rstanarm'
Define the rounding operator associated with the floor function. The function
also returns zero whenever the input is negative and caps the value at y_max
,
where y_max
is a known upper bound on the data y
(if specified).
round_floor(z, y_max = Inf)
round_floor(z, y_max = Inf)
z |
the real-valued input(s) |
y_max |
a fixed and known upper bound for all observations; default is |
The count-valued output(s) from the rounding function.
# Floor function: round_floor(1.5) round_floor(0.5) # Special treatmeant of negative numbers: round_floor(-1)
# Floor function: round_floor(1.5) round_floor(0.5) # Special treatmeant of negative numbers: round_floor(-1)
Sample the parameters for a linear regression model assuming a g-prior for the coefficients.
sample_lm_gprior(y, X, params, psi = NULL, XtX = NULL, X_test = NULL)
sample_lm_gprior(y, X, params, psi = NULL, XtX = NULL, X_test = NULL)
y |
|
X |
|
params |
the named list of parameters containing
|
psi |
the prior variance for the g-prior |
XtX |
the |
X_test |
matrix of predictors at test points (default is NULL) |
The updated named list params
with draws from the full conditional distributions
of sigma
and coefficients
(along with updated mu
and mu_test
if applicable).
The parameters in coefficients
are:
beta
: the p x 1
vector of regression coefficients
components of beta
# Simulate data for illustration: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Initialize: params = init_lm_gprior(y = y, X = X) # Sample: params = sample_lm_gprior(y = y, X = X, params = params) names(params) names(params$coefficients)
# Simulate data for illustration: sim_dat = simulate_nb_lm(n = 100, p = 5) y = sim_dat$y; X = sim_dat$X # Initialize: params = init_lm_gprior(y = y, X = X) # Sample: params = sample_lm_gprior(y = y, X = X, params = params) names(params) names(params$coefficients)
Compute simultaneous band scores (SimBaS) from Meyer et al. (2015, Biometrics). SimBaS uses MC(MC) simulations of a function of interest to compute the minimum alpha such that the joint credible bands at the alpha level do not include zero. This quantity is computed for each grid point (or observation point) in the domain of the function.
simBaS(sampFuns)
simBaS(sampFuns)
sampFuns |
|
m x 1
vector of simBaS
The input needs not be curves: the simBaS may be computed for vectors to achieve a multiplicity adjustment.
The minimum of the returned value, PsimBaS_t
,
over the domain t
is the Global Bayesian P-Value (GBPV) for testing
whether the function is zero everywhere.
Simulate data from a negative-binomial distribution with nonlinear mean function.
simulate_nb_friedman( n = 100, p = 10, r_nb = 1, b_int = log(1.5), b_sig = log(5), sigma_true = sqrt(2 * log(1)), seed = NULL )
simulate_nb_friedman( n = 100, p = 10, r_nb = 1, b_int = log(1.5), b_sig = log(5), sigma_true = sqrt(2 * log(1)), seed = NULL )
n |
number of observations |
p |
number of predictors |
r_nb |
the dispersion parameter of the Negative Binomial dispersion; smaller values imply greater overdispersion, while larger values approximate the Poisson distribution. |
b_int |
intercept; default is log(1.5). |
b_sig |
regression coefficients for true signals; default is log(5.0). |
sigma_true |
standard deviation of the Gaussian innovation; default is zero. |
seed |
optional integer to set the seed for reproducible simulation; default is NULL which results in a different dataset after each run |
The log-expected counts are modeled using the Friedman (1991) nonlinear function
with interactions, possibly
with additional Gaussian noise (on the log-scale). We assume that half of the predictors
are associated with the response, i.e., true signals. For sufficiently large dispersion
parameter r_nb
, the distribution will approximate a Poisson distribution.
Here, the predictor variables are simulated from independent uniform distributions.
A named list with the simulated count response y
, the simulated design matrix X
,
and the true expected counts Ey
.
Specifying sigma_true = sqrt(2*log(1 + a))
implies that the expected counts are
inflated by 100*a
% (relative to exp(X*beta)
), in addition to providing additional
overdispersion.
# Simulate and plot the count data: sim_dat = simulate_nb_friedman(n = 100, p = 10); plot(sim_dat$y)
# Simulate and plot the count data: sim_dat = simulate_nb_friedman(n = 100, p = 10); plot(sim_dat$y)
Simulate data from a negative-binomial distribution with linear mean function.
simulate_nb_lm( n = 100, p = 10, r_nb = 1, b_int = log(1.5), b_sig = log(2), sigma_true = sqrt(2 * log(1)), ar1 = 0, seed = NULL )
simulate_nb_lm( n = 100, p = 10, r_nb = 1, b_int = log(1.5), b_sig = log(2), sigma_true = sqrt(2 * log(1)), ar1 = 0, seed = NULL )
n |
number of observations |
p |
number of predictors (including the intercept) |
r_nb |
the dispersion parameter of the Negative Binomial dispersion; smaller values imply greater overdispersion, while larger values approximate the Poisson distribution. |
b_int |
intercept; default is log(1.5), which implies the expected count is 1.5 when all predictors are zero |
b_sig |
regression coefficients for true signals; default is log(2.0), which implies a twofold increase in the expected counts for a one unit increase in x |
sigma_true |
standard deviation of the Gaussian innovation; default is zero. |
ar1 |
the autoregressive coefficient among the columns of the X matrix; default is zero. |
seed |
optional integer to set the seed for reproducible simulation; default is NULL which results in a different dataset after each run |
The log-expected counts are modeled as a linear function of covariates, possibly
with additional Gaussian noise (on the log-scale). We assume that half of the predictors
are associated with the response, i.e., true signals. For sufficiently large dispersion
parameter r_nb
, the distribution will approximate a Poisson distribution.
Here, the predictor variables are simulated from independent standard normal distributions.
A named list with the simulated count response y
, the simulated design matrix X
(including an intercept), the true expected counts Ey
,
and the true regression coefficients beta_true
.
Specifying sigma_true = sqrt(2*log(1 + a))
implies that the expected counts are
inflated by 100*a
% (relative to exp(X*beta)
), in addition to providing additional
overdispersion.
# Simulate and plot the count data: sim_dat = simulate_nb_lm(n = 100, p = 10); plot(sim_dat$y)
# Simulate and plot the count data: sim_dat = simulate_nb_lm(n = 100, p = 10); plot(sim_dat$y)
Compute samples from the predictive distributions of a STAR spline regression model using either a Gibbs sampling approach or exact Monte Carlo sampling (default is Gibbs sampling which scales better for large n).
spline_star( y, tau = NULL, transformation = "np", y_max = Inf, psi = NULL, nsave = 1000, use_MCMC = TRUE, nburn = 1000, nskip = 0, verbose = TRUE )
spline_star( y, tau = NULL, transformation = "np", y_max = Inf, psi = NULL, nsave = 1000, use_MCMC = TRUE, nburn = 1000, nskip = 0, verbose = TRUE )
y |
|
tau |
|
transformation |
transformation to use for the latent data; must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
psi |
prior variance (1/smoothing parameter); if NULL, update in MCMC |
nsave |
number of MCMC iterations to save (or number of Monte Carlo simulations) |
use_MCMC |
logical; whether to run Gibbs sampler or Monte Carlo (default is TRUE) |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
verbose |
logical; if TRUE, print time remaining |
STAR defines a count-valued probability model by (1) specifying a Gaussian model for continuous *latent* data and (2) connecting the latent data to the observed data via a *transformation and rounding* operation. Here, the continuous latent data model is a spline regression.
There are several options for the transformation. First, the transformation
can belong to the *Box-Cox* family, which includes the known transformations
'identity', 'log', and 'sqrt'. Second, the transformation
can be estimated (before model fitting) using the empirical distribution of the
data y
. Options in this case include the empirical cumulative
distribution function (CDF), which is fully nonparametric ('np'), or the parametric
alternatives based on Poisson ('pois') or Negative-Binomial ('neg-bin')
distributions. For the parametric distributions, the parameters of the distribution
are estimated using moments (means and variances) of y
. The distribution-based
transformations approximately preserve the mean and variance of the count data y
on the latent data scale, which lends interpretability to the model parameters.
Lastly, the transformation can be modeled using the Bayesian bootstrap ('bnp'),
which is a Bayesian nonparametric model and incorporates the uncertainty
about the transformation into posterior and predictive inference.
a list with the following elements:
post.pred
: nsave x n
samples
from the posterior predictive distribution at the observation points tau
marg_like
: the marginal likelihood (only if use_MCMC=FALSE
; otherwise NULL)
For the 'bnp' transformation
there are numerical stability issues when psi
is modeled as unknown.
In this case, it is better to fix psi
at some positive number.
# Simulate some data: n = 100 tau = seq(0,1, length.out = n) y = round_floor(exp(1 + rnorm(n)/4 + poly(tau, 4)%*%rnorm(n=4, sd = 4:1))) # Sample from the predictive distribution of a STAR spline model: fit = spline_star(y = y, tau = tau) # Compute 90% prediction intervals: pi_y = t(apply(fit$post.pred, 2, quantile, c(0.05, .95))) # Plot the results: intervals, median, and smoothed mean plot(tau, y, ylim = range(pi_y, y)) polygon(c(tau, rev(tau)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(tau, apply(fit$post.pred, 2, median), lwd=5, col ='black') lines(tau, smooth.spline(tau, apply(fit$post.pred, 2, mean))$y, lwd=5, col='blue') lines(tau, y, type='p')
# Simulate some data: n = 100 tau = seq(0,1, length.out = n) y = round_floor(exp(1 + rnorm(n)/4 + poly(tau, 4)%*%rnorm(n=4, sd = 4:1))) # Sample from the predictive distribution of a STAR spline model: fit = spline_star(y = y, tau = tau) # Compute 90% prediction intervals: pi_y = t(apply(fit$post.pred, 2, quantile, c(0.05, .95))) # Plot the results: intervals, median, and smoothed mean plot(tau, y, ylim = range(pi_y, y)) polygon(c(tau, rev(tau)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(tau, apply(fit$post.pred, 2, median), lwd=5, col ='black') lines(tau, smooth.spline(tau, apply(fit$post.pred, 2, mean))$y, lwd=5, col='blue') lines(tau, y, type='p')
This function outputs posterior quantities and forecasts from a univariate warpDLM model. Currently two latent DLM specifications are supported: local level and the local linear trend.
warpDLM( y, type = c("level", "trend"), transformation = c("np", "identity", "log", "sqrt", "pois", "neg-bin"), y_max = Inf, R0 = 10, nsave = 5000, nburn = 5000, nskip = 1, n.ahead = 1 )
warpDLM( y, type = c("level", "trend"), transformation = c("np", "identity", "log", "sqrt", "pois", "neg-bin"), y_max = Inf, R0 = 10, nsave = 5000, nburn = 5000, nskip = 1, n.ahead = 1 )
y |
the count-valued time series |
type |
the type of latent DLM (must be either level or trend) |
transformation |
transformation to use for the latent process (default is np); must be one of
|
y_max |
a fixed and known upper bound for all observations; default is |
R0 |
the variance for the initial state theta_0; default is 10 |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
n.ahead |
number of steps to forecast ahead |
A list with the following elements:
V_post
: posterior draws of the observation variance
W_post
: posterior draws of the state update variance(s)
fc_post
: draws from the forecast distribution (of length n.ahead)
post_pred
: draws from the posterior predictive distribution of y
g_func
: transformation function
g_inv_func
: inverse transformation function
KFAS_mod
: the final KFAS model representing the latent DLM