The purpose of this R-markdown document is to demonstrate simple Bayesian linear model estimation in the context of treatment effects regression using simulated data. We estimate classical models using ordinary least squares and compare them with ANOVA. We then estimate the equivalent Bayesian models with JAGS (demonstrating the syntax of JAGS code) and compare them with the deviance information criterion (DIC). Finally, we use the package rstanarm to estimate the models using syntax similar to classical modeling. We finally use the package LOO to compare models using Pareto-Smoothed Importance Sampling Leave-One-Out posterior distributions (PSIS-LOO) and construct posterior model probabilities and weights.
First, we clear memory and load required packages:
rm(list = ls())
library(tidyverse)
library(runjags)
library(rjags)
library(rstan)
library(rstanarm)
library(loo)
options(mc.cores = 4)
logit = function(x) {
exp(x) / (1 + exp(x))
}
Next, we generate a simulated RCT dataset with \(N=500\). We have 3 slightly correlated normal covariates, \(X_1\), \(X_2\) and \(X_3\), and a randomly-assigned treatment \(TX\). The outcome \(Y\) is a normally distributed variable and is a function of covariates and treatment and effect modifier \(X_1\). The true effects of covariates and treatment are known:
\(\beta_0 = 0\)
\(\beta_{X1} = -0.3\)
\(\beta_{X2} = 0.5\)
\(\beta_{X3} = -0.5\)
\(\beta_{TX} = 0.5\)
\(\beta_{X1:TX} = 0.3\)
Treatment is effective, increasing \(Y\) by 0.5 units. Treatment effectiveness increases 0.25 per unit increase in \(X_1\). Note, this heterogeneous treatment effect suggests that persons with \(X_1 < -2\) are harmed by treatment.
set.seed(446090)
N = 500
X = MASS::mvrnorm(n = N,
mu = c(0, 0, 0),
Sigma = matrix(
data = c(1, 0.3, -0.3,
0.3, 1, 0.2,
-0.3, 0.2, 1),
nrow = 3
))
TX = rbinom(n = N, size = 1, prob = .5)
mu_Y = -0.3 * X[, 1] + 0.5 * X[, 2] - 0.5 * X[, 3] + 0.5 * TX + .25 * TX *
X[, 1]
Y = rnorm(n = N, mean = mu_Y, sd = 1)
D = tibble(
Y = Y,
TX = TX,
X1 = X[, 1],
X2 = X[, 2],
X3 = X[, 3]
)
First, we conduct treatment effects linear regression using 3 specifications:
lm.min = lm(Y ~ X1 + X2 + X3 + TX, data = D)
lm.full = lm(Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX, data = D)
lm.true = lm(Y ~ X1 + X2 + X3 + TX + X1 * TX, data = D)
print(summary(lm.min))
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + TX, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.06862 -0.67609 -0.00567 0.57240 3.03158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04891 0.06057 -0.807 0.41981
## X1 -0.14089 0.05029 -2.801 0.00529 **
## X2 0.56581 0.04877 11.601 < 2e-16 ***
## X3 -0.51903 0.05115 -10.147 < 2e-16 ***
## TX 0.54773 0.08766 6.249 8.94e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9784 on 495 degrees of freedom
## Multiple R-squared: 0.3161, Adjusted R-squared: 0.3106
## F-statistic: 57.2 on 4 and 495 DF, p-value: < 2.2e-16
print(summary(lm.full))
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 *
## TX, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.87832 -0.67161 0.01362 0.60252 2.92360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05156 0.06020 -0.857 0.39212
## X1 -0.22139 0.06833 -3.240 0.00128 **
## X2 0.50402 0.06658 7.570 1.86e-13 ***
## X3 -0.46512 0.06927 -6.714 5.22e-11 ***
## TX 0.54710 0.08716 6.277 7.57e-10 ***
## X1:TX 0.17801 0.10022 1.776 0.07633 .
## X2:TX 0.12226 0.09743 1.255 0.21015
## X3:TX -0.11133 0.10238 -1.087 0.27739
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9722 on 492 degrees of freedom
## Multiple R-squared: 0.3289, Adjusted R-squared: 0.3193
## F-statistic: 34.44 on 7 and 492 DF, p-value: < 2.2e-16
print(summary(lm.true))
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + TX + X1 * TX, data = D)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.95187 -0.67872 -0.00155 0.61929 2.94490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05058 0.06019 -0.840 0.40114
## X1 -0.25225 0.06469 -3.899 0.00011 ***
## X2 0.56357 0.04847 11.628 < 2e-16 ***
## X3 -0.52000 0.05083 -10.231 < 2e-16 ***
## TX 0.54947 0.08710 6.308 6.28e-10 ***
## X1:TX 0.24207 0.08931 2.710 0.00695 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9722 on 494 degrees of freedom
## Multiple R-squared: 0.3261, Adjusted R-squared: 0.3193
## F-statistic: 47.82 on 5 and 494 DF, p-value: < 2.2e-16
In the minimal model, the treatment main effect is estimated close to its true value, but obviously, the heterogeneity with respect to \(X_1\) is missed, and the estimate for the effect of \(X_1\) on \(Y\) is biased downwards. Adding all 3 interactions (when only one is truly present) seems to reduce the bias in \(X_1\) and shows mild support for presence of a treatment effect interaction with \(X_1\). In the true model (which, of course, would not be known in advance), estimates for \(X_1\), \(TX\) and \(X_1:TX\) are close to their true values.
We can use analysis of variance to compare the 3 models:
anova(lm.min, lm.full, lm.true)
## Analysis of Variance Table
##
## Model 1: Y ~ X1 + X2 + X3 + TX
## Model 2: Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX
## Model 3: Y ~ X1 + X2 + X3 + TX + X1 * TX
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 495 473.85
## 2 492 465.00 3 8.8457 3.1198 0.02578 *
## 3 494 466.91 -2 -1.9021 1.0063 0.36633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we had to choose among these three models, which would we choose? The full model has the lowest residual sum of squares, and has statistically significantly better fit. Compared to the full model, the true model has poorer fit overall (though it is less complex by two degrees of freedom). The choice is unclear. What we’d like to know is, of the three models, which has the highest probability of giving predictions of treatment effects in new observations that are closest to what would happen. For that, we will need a Bayesian approach.
Just Another Gibbs Sampler (JAGS) is a language, based on the BUGS (Bayes Using Gibbs Sampling) language. We can use JAGS/BUGS syntax to specify a likelihood and priors, then estimate our linear models using Markov Chain Monte Carlo - a numerical approach to draw simulated samples from the posterior distribution, given the prior, likelihood and data. Here is code for the minimal model:
JAGS.min =
"model{
#Individual likelihood
for (i in 1:N){
Y[i] ~ dnorm(mu.Y[i],tau.Y)
mu.Y[i] = b0 + b1*X[i,1] + b2*X[i,2] + b3*X[i,3] + b4*TX[i]
}
#Priors
b0 ~ dnorm(0,.01)
b1 ~ dnorm(0,.01)
b2 ~ dnorm(0,.01)
b3 ~ dnorm(0,.01)
b4 ~ dnorm(0,.01)
tau.Y ~ dgamma(.001,.001)
}"
Note that we have used minimally-informative priors (slope coefficients centered on 0 with precision \(\tau^2 = 0.01\)). Note that precision = 1/variance, so precision of \(\tau^2 = 0.01\) is saying our uncertainty about the effects of covariates and treatment on outcomes has a standard deviation \(\sigma = 1/\tau = 10\), meaning plausible effects from roughly -20 to 20, with the most likely effects closer to zero. We also have to specify a parameter for the conditional variance of Y (i.e., the residual variance). Assuming we don’t know beforehand how well our model will fit, we use a non-informative parameter \(\Gamma (0.001,0.001)\) for the conditional variance.
We use the package runjags to run the model (using 4 parallel chains, which we can run in parallel on 4 CPU cores). But first, we have to assign starting values to the 4 chains. We want them to start at different values and check that they have converged to the same posterior distribution.
numchains = 4
inits.min <-
replicate(
numchains,
list(
b0 = runif(n = 1, min = -3, max = 3),
b1 = runif(n = 1, min = -3, max = 3),
b2 = runif(n = 1, min = -3, max = 3),
b3 = runif(n = 1, min = -3, max = 3),
b4 = runif(n = 1, min = -3, max = 3),
tau.Y = runif(n = 1, min = .1, max = 5)
),
simplify = FALSE
)
Now, we run the first (minimal) model:
set.seed(11235)
results.JAGS.min = run.jags(
model = JAGS.min,
monitor = c("b0", "b1", "b2", "b3", "b4", "tau.Y"),
data = list(
N = N,
Y = Y,
X = X,
TX = TX
),
n.chains = numchains,
inits = inits.min,
burnin = 1000,
sample = 5000,
method = "rjparallel"
)
## Compiling rjags model...
## Note: the model did not require adaptation
## Starting 4 rjags simulations using a PSOCK cluster with 4 nodes on host
## 'localhost'
## Simulation complete
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 6 variables....
## Finished running the simulation
Let’s visualize the posterior draws to make sure the MCMC algorithm behaved well. What we want to see is samples from each chain that are stable (no systematic upward or downward drift), highly variable (low autocorrelation, suggesting efficient sampling from the posterior), and overlapping (regardless of the starting value, the samplers are drawing posterior values from the same region of parameter space). We also want to see nice, smooth, unimodal posterior draw histograms - indicating that we have drawn enough samples to recover the shape of the posterior distribution and that the parameter is well-identified.
plot(results.JAGS.min)
## Generating plots...
We obtain results similar to what we get when we run classical regression by taking summary statistics of the MCMC posterior draws:
summary(results.JAGS.min)
## Lower95 Median Upper95 Mean SD Mode
## b0 -0.1681250 -0.04849114 0.06507674 -0.04920064 0.06018526 NA
## b1 -0.2439049 -0.14032144 -0.04542164 -0.14059050 0.05071415 NA
## b2 0.4693117 0.56524980 0.66192694 0.56561046 0.04912695 NA
## b3 -0.6220105 -0.51853463 -0.42139446 -0.51883515 0.05144964 NA
## b4 0.3797907 0.54776890 0.72295024 0.54833224 0.08693745 NA
## tau.Y 0.9191435 1.04266080 1.17365882 1.04412326 0.06538558 NA
## MCerr MC%ofSD SSeff AC.10 psrf
## b0 0.0007204767 1.2 6978 0.008526455 1.000076
## b1 0.0004538411 0.9 12487 -0.011799306 1.000013
## b2 0.0004346465 0.9 12775 -0.015716707 1.000020
## b3 0.0004678172 0.9 12095 -0.005865949 1.000071
## b4 0.0010312769 1.2 7107 0.004061461 1.000148
## tau.Y 0.0004742463 0.7 19009 -0.008185708 1.000102
The posterior mean and standard deviation estimates are nearly identical to the least squares point estimates and standard errors for the minimal model. Bayes doesn’t magically solve the problem of model misspecification! We still have omitted variables bias here.
Next, we do the same for the full and true models (we’ll suppress the output here) - which also give similar results to the comparable OLS models.
JAGS.full =
"model{
#Individual likelihood
for (i in 1:N){
Y[i] ~ dnorm(mu.Y[i],tau.Y)
mu.Y[i] = b0 + b1*X[i,1] + b2*X[i,2] + b3*X[i,3] + b4*TX[i] + b5*X[i,1]*TX[i] + b6*X[i,2]*TX[i] + b7*X[i,3]*TX[i]
}
#Priors
b0 ~ dnorm(0,.01)
b1 ~ dnorm(0,.01)
b2 ~ dnorm(0,.01)
b3 ~ dnorm(0,.01)
b4 ~ dnorm(0,.01)
b5 ~ dnorm(0,.01)
b6 ~ dnorm(0,.01)
b7 ~ dnorm(0,.01)
tau.Y ~ dgamma(.001,.001)
}"
inits.full <-
replicate(
numchains,
list(
b0 = runif(n = 1, min = -3, max = 3),
b1 = runif(n = 1, min = -3, max = 3),
b2 = runif(n = 1, min = -3, max = 3),
b3 = runif(n = 1, min = -3, max = 3),
b4 = runif(n = 1, min = -3, max = 3),
b5 = runif(n = 1, min = -3, max = 3),
b6 = runif(n = 1, min = -3, max = 3),
b7 = runif(n = 1, min = -3, max = 3),
tau.Y = runif(n = 1, min = .1, max = 5)
),
simplify = FALSE
)
set.seed(11235)
results.JAGS.full = run.jags(
model = JAGS.full,
monitor = c("b0", "b1", "b2", "b3", "b4", "b5", "b6", "b7", "tau.Y"),
data = list(
N = N,
Y = Y,
X = X,
TX = TX
),
n.chains = numchains,
inits = inits.full,
burnin = 1000,
sample = 5000,
method = "rjparallel"
)
## Compiling rjags model...
## Note: the model did not require adaptation
## Starting 4 rjags simulations using a PSOCK cluster with 4 nodes on host
## 'localhost'
## Simulation complete
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 9 variables....
## Finished running the simulation
JAGS.true =
"model{
#Individual likelihood
for (i in 1:N){
Y[i] ~ dnorm(mu.Y[i],tau.Y)
mu.Y[i] = b0 + b1*X[i,1] + b2*X[i,2] + b3*X[i,3] + b4*TX[i] + b5*X[i,1]*TX[i]
}
#Priors
b0 ~ dnorm(0,.01)
b1 ~ dnorm(0,.01)
b2 ~ dnorm(0,.01)
b3 ~ dnorm(0,.01)
b4 ~ dnorm(0,.01)
b5 ~ dnorm(0,.01)
tau.Y ~ dgamma(.001,.001)
}"
inits.true <-
replicate(
numchains,
list(
b0 = runif(n = 1, min = -3, max = 3),
b1 = runif(n = 1, min = -3, max = 3),
b2 = runif(n = 1, min = -3, max = 3),
b3 = runif(n = 1, min = -3, max = 3),
b4 = runif(n = 1, min = -3, max = 3),
b5 = runif(n = 1, min = -3, max = 3),
tau.Y = runif(n = 1, min = .1, max = 5)
),
simplify = FALSE
)
set.seed(11235)
results.JAGS.true = run.jags(
model = JAGS.true,
monitor = c("b0", "b1", "b2", "b3", "b4", "b5", "tau.Y"),
data = list(
N = N,
Y = Y,
X = X,
TX = TX
),
n.chains = numchains,
inits = inits.true,
burnin = 1000,
sample = 5000,
method = "rjparallel"
)
## Compiling rjags model...
## Note: the model did not require adaptation
## Starting 4 rjags simulations using a PSOCK cluster with 4 nodes on host
## 'localhost'
## Simulation complete
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 7 variables....
## Finished running the simulation
We can use the deviance information criterion (DIC) to compare models. DIC is a function of the mean of model deviance \(D(\theta_s) = -2 \mbox{ log} (p(y|\theta_s))\), evaluated at each MCMC draw \(s\), and the number of effective parameters in the model. Lower values of DIC indicate better fit (higher likelihood) for the level of model complexity (lower number of parameters).
dic.min = runjags::extract(results.JAGS.min,what = "dic")
## Compiling rjags model and adapting for 1000 iterations...
## Obtaining DIC samples from 5000 iterations...
print(dic.min)
## Mean deviance: 1398
## penalty 6.004
## Penalized deviance: 1404
dic.full = runjags::extract(results.JAGS.full,what = "dic")
## Compiling rjags model and adapting for 1000 iterations...
## Obtaining DIC samples from 5000 iterations...
print(dic.min)
## Mean deviance: 1398
## penalty 6.004
## Penalized deviance: 1404
dic.true = runjags::extract(results.JAGS.true,what = "dic")
## Compiling rjags model and adapting for 1000 iterations...
## Obtaining DIC samples from 5000 iterations...
diffdic(dic.full,dic.min)
## Difference: -3.371497
## Sample standard error: 5.768775
diffdic(dic.true,dic.full)
## Difference: -1.976078
## Sample standard error: 2.726292
diffdic(dic.true,dic.min)
## Difference: -5.347576
## Sample standard error: 5.160712
There’s no absolute scale for comparing DIC, but because the deviance and penalty are calculated at each sampled observation the difference in DIC between two models has a sample standard error that gives us a magnitude for comparison. In this case, the differences in DIC are not large relative to the sample standard error of the difference in DIC, suggesting not a strong preference for one model over another.
Stan is currently one of the most actively-developed software platforms for Bayesian estimation, largely because of its integration with R. Like JAGS/BUGS, Stan has a programming syntax that allows for a vast array of customized models. But for the vast majority of applied modeling, pre-compiled models have been developed to use with helper packages like ‘brms’ and ‘rstanarm’. With the brms package, you can actually view the Stan code generated by a simple model (note: it ends up looking not so simple!). For example:
brms.min = brms::brm(formula = Y ~ X1 + X2 + X3 + TX,
data = D)
brms::stancode(brms.min)
## // generated with brms 2.16.1
## functions {
## }
## data {
## int<lower=1> N; // total number of observations
## vector[N] Y; // response variable
## int<lower=1> K; // number of population-level effects
## matrix[N, K] X; // population-level design matrix
## int prior_only; // should the likelihood be ignored?
## }
## transformed data {
## int Kc = K - 1;
## matrix[N, Kc] Xc; // centered version of X without an intercept
## vector[Kc] means_X; // column means of X before centering
## for (i in 2:K) {
## means_X[i - 1] = mean(X[, i]);
## Xc[, i - 1] = X[, i] - means_X[i - 1];
## }
## }
## parameters {
## vector[Kc] b; // population-level effects
## real Intercept; // temporary intercept for centered predictors
## real<lower=0> sigma; // dispersion parameter
## }
## transformed parameters {
## }
## model {
## // likelihood including constants
## if (!prior_only) {
## target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
## }
## // priors including constants
## target += student_t_lpdf(Intercept | 3, 0.2, 2.5);
## target += student_t_lpdf(sigma | 3, 0, 2.5)
## - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## generated quantities {
## // actual population-level intercept
## real b_Intercept = Intercept - dot_product(means_X, b);
## }
Luckily, brms or rstanarm does the hard work for us. We will show how to estimate the above models using rstanarm and compare them using pareto-smoothed importance sampling / leave-one-out cross-validation methods. As you can see from the code below, using stan to estimate simple glm models is as easy with classical methods. Note that unless we specify priors to contain information, Stan will use minimally informative default priors - point estimates and posterior standard deviations should be very close to classical OLS/ML estimates and standard errors for most simple models.
stanfit.min = stan_glm(
Y ~ X1 + X2 + X3 + TX,
data = D,
chains = 4,
seed = 11235
)
stanfit.full = stan_glm(
Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX,
data = D,
chains = 4,
seed = 11235
)
stanfit.true = stan_glm(
Y ~ X1 + X2 + X3 + TX + X1 * TX ,
data = D,
chains = 4,
seed = 11235
)
Just like with JAGS, we can summarize the posterior draws to obtain point estimates and credible intervals.
summary(stanfit.min,digits = 3)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: Y ~ X1 + X2 + X3 + TX
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 500
## predictors: 5
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) -0.049 0.061 -0.127 -0.049 0.031
## X1 -0.141 0.050 -0.206 -0.142 -0.077
## X2 0.566 0.048 0.505 0.565 0.627
## X3 -0.520 0.051 -0.585 -0.519 -0.456
## TX 0.547 0.089 0.433 0.547 0.660
## sigma 0.981 0.032 0.941 0.980 1.023
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 0.223 0.063 0.142 0.222 0.303
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.001 1.000 6019
## X1 0.001 1.000 3532
## X2 0.001 1.001 3883
## X3 0.001 1.001 3639
## TX 0.001 1.000 6262
## sigma 0.000 1.000 5714
## mean_PPD 0.001 0.999 5052
## log-posterior 0.042 1.003 1710
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
summary(stanfit.full,digits = 3)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 500
## predictors: 8
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) -0.051 0.062 -0.129 -0.051 0.027
## X1 -0.222 0.068 -0.309 -0.223 -0.132
## X2 0.503 0.069 0.414 0.503 0.592
## X3 -0.465 0.069 -0.555 -0.464 -0.378
## TX 0.547 0.089 0.429 0.548 0.662
## X1:TX 0.177 0.100 0.047 0.177 0.308
## X2:TX 0.122 0.099 -0.006 0.123 0.250
## X3:TX -0.112 0.103 -0.243 -0.111 0.016
## sigma 0.974 0.031 0.935 0.973 1.012
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 0.223 0.063 0.144 0.223 0.303
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.001 1.001 4074
## X1 0.001 1.000 2786
## X2 0.001 1.001 2667
## X3 0.001 1.000 2969
## TX 0.002 1.000 3332
## X1:TX 0.002 1.000 2678
## X2:TX 0.002 1.002 2475
## X3:TX 0.002 1.001 2768
## sigma 0.000 0.999 4058
## mean_PPD 0.001 1.000 4331
## log-posterior 0.052 1.001 1722
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
summary(stanfit.true,digits = 3)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: Y ~ X1 + X2 + X3 + TX + X1 * TX
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 500
## predictors: 6
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) -0.050 0.060 -0.125 -0.051 0.026
## X1 -0.250 0.064 -0.330 -0.252 -0.166
## X2 0.565 0.049 0.502 0.565 0.626
## X3 -0.520 0.050 -0.584 -0.521 -0.456
## TX 0.548 0.086 0.439 0.548 0.660
## X1:TX 0.239 0.090 0.124 0.239 0.353
## sigma 0.974 0.031 0.935 0.972 1.013
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 0.223 0.062 0.144 0.223 0.303
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.001 1.000 3982
## X1 0.001 1.000 2465
## X2 0.001 1.000 3422
## X3 0.001 1.000 3217
## TX 0.001 1.000 4217
## X1:TX 0.002 1.000 2788
## sigma 0.000 1.000 4153
## mean_PPD 0.001 0.999 3918
## log-posterior 0.040 1.000 2190
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We can visualize our posterior draws using ggplot2 methods defined for stanreg objects (below is for the minimal model).
stan_trace(stanfit.min)
stan_dens(stanfit.min,separate_chains = TRUE)
stan_plot(stanfit.min, show_density = TRUE)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
Finally, we will compare models based on posterior prediction rather than model fit. In theory, we’d like to judge a model’s quality by how well it predicts outcomes for new observations. For true predictive validity, we need new data - which we don’t really have access to. Instead, we use the principle of cross-validation, which mimics validation on external data by holding out a subset of available data from the modeling process and then predicting those out-of-sample observations. Leave-one-out (LOO) cross-validation does this \(N\) times - leaving out each observation \(i = 1...N\) and predicting it using estimates obtained from the remaining \(N-1\) observations. This is a computationally intensive process because it needs to run the model \(N\) times, but a good approximation can be made with a single model using Pareto-Smoothed Important Sampling (PSIS-LOO). So, rather than getting an estimate of the likelihood from the observed data, we instead estimate the ‘elpd’ - or expected log posterior predictive density. Multiplying elpd by -2 and adding a penal for model complexity to generates a LOO information criterion (PSIS-LOO-IC), which is similar to DIC - but more robust to complex models. In this example, PSIS-LOO-IC and DIC are nearly identical because the models are very simple gaussian linear regressions. Again, the comparison suggests that the true model is likely to best predict future data (has lowest PSIS-LOO-IC), adjusted for model complexity. But the significance of the differences is hard to judge.
stanfit.min$loo = loo(stanfit.min, cores = 4)
stanfit.full$loo = loo(stanfit.full, cores = 4)
stanfit.true$loo = loo(stanfit.true, cores = 4)
stanfit.min$loo
##
## Computed from 4000 by 500 log-likelihood matrix
##
## Estimate SE
## elpd_loo -702.2 15.9
## p_loo 6.1 0.5
## looic 1404.3 31.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
stanfit.full$loo
##
## Computed from 4000 by 500 log-likelihood matrix
##
## Estimate SE
## elpd_loo -700.7 16.0
## p_loo 9.2 0.7
## looic 1401.3 31.9
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
stanfit.true$loo
##
## Computed from 4000 by 500 log-likelihood matrix
##
## Estimate SE
## elpd_loo -699.3 15.9
## p_loo 6.9 0.5
## looic 1398.6 31.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
model_list <-
stanreg_list(
stanfit.min,
stanfit.full,
stanfit.true,
model_names = c("No Interactions", "Full Model", "True Model")
)
loo_compare(model_list, detail = TRUE)
## Model formulas:
## No Interactions: Y ~ X1 + X2 + X3 + TX
## Full Model: Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX
## True Model: Y ~ X1 + X2 + X3 + TX + X1 * TX
##
## Model comparison based on LOO-CV:
## elpd_diff se_diff
## True Model 0.0 0.0
## Full Model -1.3 1.4
## No Interactions -2.8 2.6
But nothing says that we have to take the results from just one model! We can take the thought experiment further into the realm of Bayesian decision theory and consider that because there is still uncertainty about which model is best, we might wish to take an average prediction generated from the models we have estimated. The method of “stacking” solves for an optimal set of weights on each posterior that minimizes the squared prediction error between the combined (ensemble) prediction and the actual outcome, again, using PSIS-LOO. We can interpret these weights as the probability that each model has the lowest prediction error of the models assessed.
loo_model_weights(model_list,cores = 4, method = "stacking")
## Method: stacking
## ------
## weight
## No Interactions 0.086
## Full Model 0.000
## True Model 0.914
In this example, the true model (fortunately!) has the highest probability of being correct - 91.4%, and if we were to use an ensemble prediction, only the true model and the minimal model would be included, with the true model getting over 10 times the weight as the minimal model.