Bayesian Examples

Introduction

The purpose of this R-markdown document is to demonstrate simple Bayesian linear model estimation in the context of treatment effects regression using simulated data. We estimate classical models using ordinary least squares and compare them with ANOVA. We then estimate the equivalent Bayesian models with JAGS (demonstrating the syntax of JAGS code) and compare them with the deviance information criterion (DIC). Finally, we use the package rstanarm to estimate the models using syntax similar to classical modeling. We finally use the package LOO to compare models using Pareto-Smoothed Importance Sampling Leave-One-Out posterior distributions (PSIS-LOO) and construct posterior model probabilities and weights.

Preliminaries

First, we clear memory and load required packages:

rm(list = ls())

library(tidyverse)
library(runjags)
library(rjags)
library(rstan)
library(rstanarm)
library(loo)
options(mc.cores = 4)
logit = function(x) {
  exp(x) / (1 + exp(x))
}

Simulated dataset

Next, we generate a simulated RCT dataset with \(N=500\). We have 3 slightly correlated normal covariates, \(X_1\), \(X_2\) and \(X_3\), and a randomly-assigned treatment \(TX\). The outcome \(Y\) is a normally distributed variable and is a function of covariates and treatment and effect modifier \(X_1\). The true effects of covariates and treatment are known:

\(\beta_0 = 0\)
\(\beta_{X1} = -0.3\)
\(\beta_{X2} = 0.5\)
\(\beta_{X3} = -0.5\)
\(\beta_{TX} = 0.5\)
\(\beta_{X1:TX} = 0.3\)

Treatment is effective, increasing \(Y\) by 0.5 units. Treatment effectiveness increases 0.25 per unit increase in \(X_1\). Note, this heterogeneous treatment effect suggests that persons with \(X_1 < -2\) are harmed by treatment.

set.seed(446090)
N = 500
X = MASS::mvrnorm(n = N,
                  mu = c(0, 0, 0),
                  Sigma = matrix(
                    data = c(1, 0.3, -0.3,
                             0.3, 1, 0.2,
                             -0.3, 0.2, 1),
                    nrow = 3
                  ))
TX = rbinom(n = N, size = 1, prob = .5)
mu_Y = -0.3 * X[, 1] + 0.5 * X[, 2] - 0.5 * X[, 3] + 0.5 * TX + .25 * TX *
  X[, 1]
Y = rnorm(n = N, mean = mu_Y, sd = 1)
D = tibble(
  Y = Y,
  TX = TX,
  X1 = X[, 1],
  X2 = X[, 2],
  X3 = X[, 3]
)

Ordinary Least Squares

First, we conduct treatment effects linear regression using 3 specifications:

A minimal model (with no treatment x covariate interactions)
A full model (treatment x covariate interactions for \(X_1\), \(X_2\), and \(X_3\))
The true model (treatment x covariate interactions for \(X_1\) only)

lm.min = lm(Y ~ X1 + X2 + X3 + TX, data = D)
lm.full = lm(Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX, data = D)
lm.true = lm(Y ~ X1 + X2 + X3 + TX + X1 * TX, data = D)

print(summary(lm.min))

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + TX, data = D)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.06862 -0.67609 -0.00567  0.57240  3.03158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04891    0.06057  -0.807  0.41981    
## X1          -0.14089    0.05029  -2.801  0.00529 ** 
## X2           0.56581    0.04877  11.601  < 2e-16 ***
## X3          -0.51903    0.05115 -10.147  < 2e-16 ***
## TX           0.54773    0.08766   6.249 8.94e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9784 on 495 degrees of freedom
## Multiple R-squared:  0.3161, Adjusted R-squared:  0.3106 
## F-statistic:  57.2 on 4 and 495 DF,  p-value: < 2.2e-16

print(summary(lm.full))

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * 
##     TX, data = D)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87832 -0.67161  0.01362  0.60252  2.92360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.05156    0.06020  -0.857  0.39212    
## X1          -0.22139    0.06833  -3.240  0.00128 ** 
## X2           0.50402    0.06658   7.570 1.86e-13 ***
## X3          -0.46512    0.06927  -6.714 5.22e-11 ***
## TX           0.54710    0.08716   6.277 7.57e-10 ***
## X1:TX        0.17801    0.10022   1.776  0.07633 .  
## X2:TX        0.12226    0.09743   1.255  0.21015    
## X3:TX       -0.11133    0.10238  -1.087  0.27739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9722 on 492 degrees of freedom
## Multiple R-squared:  0.3289, Adjusted R-squared:  0.3193 
## F-statistic: 34.44 on 7 and 492 DF,  p-value: < 2.2e-16

print(summary(lm.true))

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + TX + X1 * TX, data = D)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.95187 -0.67872 -0.00155  0.61929  2.94490 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.05058    0.06019  -0.840  0.40114    
## X1          -0.25225    0.06469  -3.899  0.00011 ***
## X2           0.56357    0.04847  11.628  < 2e-16 ***
## X3          -0.52000    0.05083 -10.231  < 2e-16 ***
## TX           0.54947    0.08710   6.308 6.28e-10 ***
## X1:TX        0.24207    0.08931   2.710  0.00695 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9722 on 494 degrees of freedom
## Multiple R-squared:  0.3261, Adjusted R-squared:  0.3193 
## F-statistic: 47.82 on 5 and 494 DF,  p-value: < 2.2e-16

In the minimal model, the treatment main effect is estimated close to its true value, but obviously, the heterogeneity with respect to \(X_1\) is missed, and the estimate for the effect of \(X_1\) on \(Y\) is biased downwards. Adding all 3 interactions (when only one is truly present) seems to reduce the bias in \(X_1\) and shows mild support for presence of a treatment effect interaction with \(X_1\). In the true model (which, of course, would not be known in advance), estimates for \(X_1\), \(TX\) and \(X_1:TX\) are close to their true values.

We can use analysis of variance to compare the 3 models:

anova(lm.min, lm.full, lm.true)

## Analysis of Variance Table
## 
## Model 1: Y ~ X1 + X2 + X3 + TX
## Model 2: Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX
## Model 3: Y ~ X1 + X2 + X3 + TX + X1 * TX
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    495 473.85                              
## 2    492 465.00  3    8.8457 3.1198 0.02578 *
## 3    494 466.91 -2   -1.9021 1.0063 0.36633  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we had to choose among these three models, which would we choose? The full model has the lowest residual sum of squares, and has statistically significantly better fit. Compared to the full model, the true model has poorer fit overall (though it is less complex by two degrees of freedom). The choice is unclear. What we’d like to know is, of the three models, which has the highest probability of giving predictions of treatment effects in new observations that are closest to what would happen. For that, we will need a Bayesian approach.

The Bayesian Approach : JAGS

Just Another Gibbs Sampler (JAGS) is a language, based on the BUGS (Bayes Using Gibbs Sampling) language. We can use JAGS/BUGS syntax to specify a likelihood and priors, then estimate our linear models using Markov Chain Monte Carlo - a numerical approach to draw simulated samples from the posterior distribution, given the prior, likelihood and data. Here is code for the minimal model:

JAGS.min =
  "model{
  #Individual likelihood
  for (i in 1:N){
   Y[i] ~ dnorm(mu.Y[i],tau.Y)
    mu.Y[i] = b0 + b1*X[i,1] + b2*X[i,2] + b3*X[i,3] + b4*TX[i]
  }
  #Priors
  b0 ~ dnorm(0,.01)
  b1 ~ dnorm(0,.01)
  b2 ~ dnorm(0,.01)
  b3 ~ dnorm(0,.01)
  b4 ~ dnorm(0,.01)
  tau.Y ~ dgamma(.001,.001)
}"

Note that we have used minimally-informative priors (slope coefficients centered on 0 with precision \(\tau^2 = 0.01\)). Note that precision = 1/variance, so precision of \(\tau^2 = 0.01\) is saying our uncertainty about the effects of covariates and treatment on outcomes has a standard deviation \(\sigma = 1/\tau = 10\), meaning plausible effects from roughly -20 to 20, with the most likely effects closer to zero. We also have to specify a parameter for the conditional variance of Y (i.e., the residual variance). Assuming we don’t know beforehand how well our model will fit, we use a non-informative parameter \(\Gamma (0.001,0.001)\) for the conditional variance.

We use the package runjags to run the model (using 4 parallel chains, which we can run in parallel on 4 CPU cores). But first, we have to assign starting values to the 4 chains. We want them to start at different values and check that they have converged to the same posterior distribution.

numchains = 4

inits.min <-
  replicate(
    numchains,
    list(
      b0 = runif(n = 1, min = -3, max = 3),
      b1 = runif(n = 1, min = -3, max = 3),
      b2 = runif(n = 1, min = -3, max = 3),
      b3 = runif(n = 1, min = -3, max = 3),
      b4 = runif(n = 1, min = -3, max = 3),
      tau.Y = runif(n = 1, min = .1, max = 5)
    ),
    simplify = FALSE
  )

Now, we run the first (minimal) model:

set.seed(11235)

results.JAGS.min = run.jags(
  model = JAGS.min,
  monitor = c("b0", "b1", "b2", "b3", "b4", "tau.Y"),
  data = list(
    N = N,
    Y = Y,
    X = X,
    TX = TX
  ),
  n.chains = numchains,
  inits = inits.min,
  burnin = 1000,
  sample = 5000,
  method = "rjparallel"
)

## Compiling rjags model...
## Note: the model did not require adaptation
## Starting 4 rjags simulations using a PSOCK cluster with 4 nodes on host
## 'localhost'
## Simulation complete
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 6 variables....
## Finished running the simulation

Let’s visualize the posterior draws to make sure the MCMC algorithm behaved well. What we want to see is samples from each chain that are stable (no systematic upward or downward drift), highly variable (low autocorrelation, suggesting efficient sampling from the posterior), and overlapping (regardless of the starting value, the samplers are drawing posterior values from the same region of parameter space). We also want to see nice, smooth, unimodal posterior draw histograms - indicating that we have drawn enough samples to recover the shape of the posterior distribution and that the parameter is well-identified.

plot(results.JAGS.min)

## Generating plots...

We obtain results similar to what we get when we run classical regression by taking summary statistics of the MCMC posterior draws:

summary(results.JAGS.min)

##          Lower95      Median     Upper95        Mean         SD Mode
## b0    -0.1681250 -0.04849114  0.06507674 -0.04920064 0.06018526   NA
## b1    -0.2439049 -0.14032144 -0.04542164 -0.14059050 0.05071415   NA
## b2     0.4693117  0.56524980  0.66192694  0.56561046 0.04912695   NA
## b3    -0.6220105 -0.51853463 -0.42139446 -0.51883515 0.05144964   NA
## b4     0.3797907  0.54776890  0.72295024  0.54833224 0.08693745   NA
## tau.Y  0.9191435  1.04266080  1.17365882  1.04412326 0.06538558   NA
##              MCerr MC%ofSD SSeff        AC.10     psrf
## b0    0.0007204767     1.2  6978  0.008526455 1.000076
## b1    0.0004538411     0.9 12487 -0.011799306 1.000013
## b2    0.0004346465     0.9 12775 -0.015716707 1.000020
## b3    0.0004678172     0.9 12095 -0.005865949 1.000071
## b4    0.0010312769     1.2  7107  0.004061461 1.000148
## tau.Y 0.0004742463     0.7 19009 -0.008185708 1.000102

The posterior mean and standard deviation estimates are nearly identical to the least squares point estimates and standard errors for the minimal model. Bayes doesn’t magically solve the problem of model misspecification! We still have omitted variables bias here.

Next, we do the same for the full and true models (we’ll suppress the output here) - which also give similar results to the comparable OLS models.

JAGS.full =
  "model{
  #Individual likelihood
  for (i in 1:N){
   Y[i] ~ dnorm(mu.Y[i],tau.Y)
    mu.Y[i] = b0 + b1*X[i,1] + b2*X[i,2] + b3*X[i,3] + b4*TX[i] + b5*X[i,1]*TX[i] + b6*X[i,2]*TX[i] + b7*X[i,3]*TX[i]
  }
  #Priors
  b0 ~ dnorm(0,.01)
  b1 ~ dnorm(0,.01)
  b2 ~ dnorm(0,.01)
  b3 ~ dnorm(0,.01)
  b4 ~ dnorm(0,.01)
  b5 ~ dnorm(0,.01)
  b6 ~ dnorm(0,.01)
  b7 ~ dnorm(0,.01)

  tau.Y ~ dgamma(.001,.001)
}"

inits.full <-
  replicate(
    numchains,
    list(
      b0 = runif(n = 1, min = -3, max = 3),
      b1 = runif(n = 1, min = -3, max = 3),
      b2 = runif(n = 1, min = -3, max = 3),
      b3 = runif(n = 1, min = -3, max = 3),
      b4 = runif(n = 1, min = -3, max = 3),
      b5 = runif(n = 1, min = -3, max = 3),
      b6 = runif(n = 1, min = -3, max = 3),
      b7 = runif(n = 1, min = -3, max = 3),
      tau.Y = runif(n = 1, min = .1, max = 5)
    ),
    simplify = FALSE
  )

set.seed(11235)

results.JAGS.full = run.jags(
  model = JAGS.full,
  monitor = c("b0", "b1", "b2", "b3", "b4", "b5", "b6", "b7", "tau.Y"),
  data = list(
    N = N,
    Y = Y,
    X = X,
    TX = TX
  ),
  n.chains = numchains,
  inits = inits.full,
  burnin = 1000,
  sample = 5000,
  method = "rjparallel"
)

## Compiling rjags model...
## Note: the model did not require adaptation
## Starting 4 rjags simulations using a PSOCK cluster with 4 nodes on host
## 'localhost'
## Simulation complete
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 9 variables....
## Finished running the simulation

JAGS.true =
  "model{
  #Individual likelihood
  for (i in 1:N){
   Y[i] ~ dnorm(mu.Y[i],tau.Y)
    mu.Y[i] = b0 + b1*X[i,1] + b2*X[i,2] + b3*X[i,3] + b4*TX[i] + b5*X[i,1]*TX[i]
  }
  #Priors
  b0 ~ dnorm(0,.01)
  b1 ~ dnorm(0,.01)
  b2 ~ dnorm(0,.01)
  b3 ~ dnorm(0,.01)
  b4 ~ dnorm(0,.01)
  b5 ~ dnorm(0,.01)

  tau.Y ~ dgamma(.001,.001)
}"

inits.true <-
  replicate(
    numchains,
    list(
      b0 = runif(n = 1, min = -3, max = 3),
      b1 = runif(n = 1, min = -3, max = 3),
      b2 = runif(n = 1, min = -3, max = 3),
      b3 = runif(n = 1, min = -3, max = 3),
      b4 = runif(n = 1, min = -3, max = 3),
      b5 = runif(n = 1, min = -3, max = 3),
      tau.Y = runif(n = 1, min = .1, max = 5)
    ),
    simplify = FALSE
  )

set.seed(11235)

results.JAGS.true = run.jags(
  model = JAGS.true,
  monitor = c("b0", "b1", "b2", "b3", "b4", "b5", "tau.Y"),
  data = list(
    N = N,
    Y = Y,
    X = X,
    TX = TX
  ),
  n.chains = numchains,
  inits = inits.true,
  burnin = 1000,
  sample = 5000,
  method = "rjparallel"
)

## Compiling rjags model...
## Note: the model did not require adaptation
## Starting 4 rjags simulations using a PSOCK cluster with 4 nodes on host
## 'localhost'
## Simulation complete
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 7 variables....
## Finished running the simulation

We can use the deviance information criterion (DIC) to compare models. DIC is a function of the mean of model deviance \(D(\theta_s) = -2 \mbox{ log} (p(y|\theta_s))\), evaluated at each MCMC draw \(s\), and the number of effective parameters in the model. Lower values of DIC indicate better fit (higher likelihood) for the level of model complexity (lower number of parameters).

dic.min = runjags::extract(results.JAGS.min,what = "dic")

## Compiling rjags model and adapting for 1000 iterations...
## Obtaining DIC samples from 5000 iterations...

print(dic.min)

## Mean deviance:  1398 
## penalty 6.004 
## Penalized deviance: 1404

dic.full = runjags::extract(results.JAGS.full,what = "dic")

## Compiling rjags model and adapting for 1000 iterations...
## Obtaining DIC samples from 5000 iterations...

print(dic.min)

## Mean deviance:  1398 
## penalty 6.004 
## Penalized deviance: 1404

dic.true = runjags::extract(results.JAGS.true,what = "dic")

## Compiling rjags model and adapting for 1000 iterations...
## Obtaining DIC samples from 5000 iterations...

diffdic(dic.full,dic.min)

## Difference: -3.371497
## Sample standard error: 5.768775

diffdic(dic.true,dic.full)

## Difference: -1.976078
## Sample standard error: 2.726292

diffdic(dic.true,dic.min)

## Difference: -5.347576
## Sample standard error: 5.160712

There’s no absolute scale for comparing DIC, but because the deviance and penalty are calculated at each sampled observation the difference in DIC between two models has a sample standard error that gives us a magnitude for comparison. In this case, the differences in DIC are not large relative to the sample standard error of the difference in DIC, suggesting not a strong preference for one model over another.

The Bayesian Approach: Stan

Stan is currently one of the most actively-developed software platforms for Bayesian estimation, largely because of its integration with R. Like JAGS/BUGS, Stan has a programming syntax that allows for a vast array of customized models. But for the vast majority of applied modeling, pre-compiled models have been developed to use with helper packages like ‘brms’ and ‘rstanarm’. With the brms package, you can actually view the Stan code generated by a simple model (note: it ends up looking not so simple!). For example:

brms.min = brms::brm(formula =  Y ~ X1 + X2 + X3 + TX,
  data = D)

brms::stancode(brms.min)

## // generated with brms 2.16.1
## functions {
## }
## data {
##   int<lower=1> N;  // total number of observations
##   vector[N] Y;  // response variable
##   int<lower=1> K;  // number of population-level effects
##   matrix[N, K] X;  // population-level design matrix
##   int prior_only;  // should the likelihood be ignored?
## }
## transformed data {
##   int Kc = K - 1;
##   matrix[N, Kc] Xc;  // centered version of X without an intercept
##   vector[Kc] means_X;  // column means of X before centering
##   for (i in 2:K) {
##     means_X[i - 1] = mean(X[, i]);
##     Xc[, i - 1] = X[, i] - means_X[i - 1];
##   }
## }
## parameters {
##   vector[Kc] b;  // population-level effects
##   real Intercept;  // temporary intercept for centered predictors
##   real<lower=0> sigma;  // dispersion parameter
## }
## transformed parameters {
## }
## model {
##   // likelihood including constants
##   if (!prior_only) {
##     target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
##   }
##   // priors including constants
##   target += student_t_lpdf(Intercept | 3, 0.2, 2.5);
##   target += student_t_lpdf(sigma | 3, 0, 2.5)
##     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## generated quantities {
##   // actual population-level intercept
##   real b_Intercept = Intercept - dot_product(means_X, b);
## }

Luckily, brms or rstanarm does the hard work for us. We will show how to estimate the above models using rstanarm and compare them using pareto-smoothed importance sampling / leave-one-out cross-validation methods. As you can see from the code below, using stan to estimate simple glm models is as easy with classical methods. Note that unless we specify priors to contain information, Stan will use minimally informative default priors - point estimates and posterior standard deviations should be very close to classical OLS/ML estimates and standard errors for most simple models.

stanfit.min = stan_glm(
  Y ~ X1 + X2 + X3 + TX,
  data = D,
  chains = 4,
  seed = 11235
)

stanfit.full = stan_glm(
  Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX,
  data = D,
  chains = 4,
  seed = 11235
)

stanfit.true = stan_glm(
  Y ~ X1 + X2 + X3 + TX + X1 * TX ,
  data = D,
  chains = 4,
  seed = 11235
)

Just like with JAGS, we can summarize the posterior draws to obtain point estimates and credible intervals.

summary(stanfit.min,digits = 3)

## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      Y ~ X1 + X2 + X3 + TX
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 500
##  predictors:   5
## 
## Estimates:
##               mean   sd     10%    50%    90% 
## (Intercept) -0.049  0.061 -0.127 -0.049  0.031
## X1          -0.141  0.050 -0.206 -0.142 -0.077
## X2           0.566  0.048  0.505  0.565  0.627
## X3          -0.520  0.051 -0.585 -0.519 -0.456
## TX           0.547  0.089  0.433  0.547  0.660
## sigma        0.981  0.032  0.941  0.980  1.023
## 
## Fit Diagnostics:
##            mean   sd    10%   50%   90%
## mean_PPD 0.223  0.063 0.142 0.222 0.303
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##               mcse  Rhat  n_eff
## (Intercept)   0.001 1.000 6019 
## X1            0.001 1.000 3532 
## X2            0.001 1.001 3883 
## X3            0.001 1.001 3639 
## TX            0.001 1.000 6262 
## sigma         0.000 1.000 5714 
## mean_PPD      0.001 0.999 5052 
## log-posterior 0.042 1.003 1710 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

summary(stanfit.full,digits = 3)

## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 500
##  predictors:   8
## 
## Estimates:
##               mean   sd     10%    50%    90% 
## (Intercept) -0.051  0.062 -0.129 -0.051  0.027
## X1          -0.222  0.068 -0.309 -0.223 -0.132
## X2           0.503  0.069  0.414  0.503  0.592
## X3          -0.465  0.069 -0.555 -0.464 -0.378
## TX           0.547  0.089  0.429  0.548  0.662
## X1:TX        0.177  0.100  0.047  0.177  0.308
## X2:TX        0.122  0.099 -0.006  0.123  0.250
## X3:TX       -0.112  0.103 -0.243 -0.111  0.016
## sigma        0.974  0.031  0.935  0.973  1.012
## 
## Fit Diagnostics:
##            mean   sd    10%   50%   90%
## mean_PPD 0.223  0.063 0.144 0.223 0.303
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##               mcse  Rhat  n_eff
## (Intercept)   0.001 1.001 4074 
## X1            0.001 1.000 2786 
## X2            0.001 1.001 2667 
## X3            0.001 1.000 2969 
## TX            0.002 1.000 3332 
## X1:TX         0.002 1.000 2678 
## X2:TX         0.002 1.002 2475 
## X3:TX         0.002 1.001 2768 
## sigma         0.000 0.999 4058 
## mean_PPD      0.001 1.000 4331 
## log-posterior 0.052 1.001 1722 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

summary(stanfit.true,digits = 3)

## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      Y ~ X1 + X2 + X3 + TX + X1 * TX
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 500
##  predictors:   6
## 
## Estimates:
##               mean   sd     10%    50%    90% 
## (Intercept) -0.050  0.060 -0.125 -0.051  0.026
## X1          -0.250  0.064 -0.330 -0.252 -0.166
## X2           0.565  0.049  0.502  0.565  0.626
## X3          -0.520  0.050 -0.584 -0.521 -0.456
## TX           0.548  0.086  0.439  0.548  0.660
## X1:TX        0.239  0.090  0.124  0.239  0.353
## sigma        0.974  0.031  0.935  0.972  1.013
## 
## Fit Diagnostics:
##            mean   sd    10%   50%   90%
## mean_PPD 0.223  0.062 0.144 0.223 0.303
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##               mcse  Rhat  n_eff
## (Intercept)   0.001 1.000 3982 
## X1            0.001 1.000 2465 
## X2            0.001 1.000 3422 
## X3            0.001 1.000 3217 
## TX            0.001 1.000 4217 
## X1:TX         0.002 1.000 2788 
## sigma         0.000 1.000 4153 
## mean_PPD      0.001 0.999 3918 
## log-posterior 0.040 1.000 2190 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We can visualize our posterior draws using ggplot2 methods defined for stanreg objects (below is for the minimal model).

stan_trace(stanfit.min)

stan_dens(stanfit.min,separate_chains = TRUE)

stan_plot(stanfit.min, show_density = TRUE)

## ci_level: 0.8 (80% intervals)

## outer_level: 0.95 (95% intervals)

Finally, we will compare models based on posterior prediction rather than model fit. In theory, we’d like to judge a model’s quality by how well it predicts outcomes for new observations. For true predictive validity, we need new data - which we don’t really have access to. Instead, we use the principle of cross-validation, which mimics validation on external data by holding out a subset of available data from the modeling process and then predicting those out-of-sample observations. Leave-one-out (LOO) cross-validation does this \(N\) times - leaving out each observation \(i = 1...N\) and predicting it using estimates obtained from the remaining \(N-1\) observations. This is a computationally intensive process because it needs to run the model \(N\) times, but a good approximation can be made with a single model using Pareto-Smoothed Important Sampling (PSIS-LOO). So, rather than getting an estimate of the likelihood from the observed data, we instead estimate the ‘elpd’ - or expected log posterior predictive density. Multiplying elpd by -2 and adding a penal for model complexity to generates a LOO information criterion (PSIS-LOO-IC), which is similar to DIC - but more robust to complex models. In this example, PSIS-LOO-IC and DIC are nearly identical because the models are very simple gaussian linear regressions. Again, the comparison suggests that the true model is likely to best predict future data (has lowest PSIS-LOO-IC), adjusted for model complexity. But the significance of the differences is hard to judge.

stanfit.min$loo = loo(stanfit.min, cores = 4)
stanfit.full$loo = loo(stanfit.full, cores = 4)
stanfit.true$loo = loo(stanfit.true, cores = 4)

stanfit.min$loo

## 
## Computed from 4000 by 500 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -702.2 15.9
## p_loo         6.1  0.5
## looic      1404.3 31.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

stanfit.full$loo

## 
## Computed from 4000 by 500 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -700.7 16.0
## p_loo         9.2  0.7
## looic      1401.3 31.9
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

stanfit.true$loo

## 
## Computed from 4000 by 500 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -699.3 15.9
## p_loo         6.9  0.5
## looic      1398.6 31.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

model_list <-
  stanreg_list(
    stanfit.min,
    stanfit.full,
    stanfit.true,
    model_names = c("No Interactions", "Full Model", "True Model")
  )

loo_compare(model_list, detail = TRUE)

## Model formulas: 
##  No Interactions:  Y ~ X1 + X2 + X3 + TX
##  Full Model:  Y ~ X1 + X2 + X3 + TX + X1 * TX + X2 * TX + X3 * TX
##  True Model:  Y ~ X1 + X2 + X3 + TX + X1 * TX
## 
## Model comparison based on LOO-CV: 
##                 elpd_diff se_diff
## True Model       0.0       0.0   
## Full Model      -1.3       1.4   
## No Interactions -2.8       2.6

But nothing says that we have to take the results from just one model! We can take the thought experiment further into the realm of Bayesian decision theory and consider that because there is still uncertainty about which model is best, we might wish to take an average prediction generated from the models we have estimated. The method of “stacking” solves for an optimal set of weights on each posterior that minimizes the squared prediction error between the combined (ensemble) prediction and the actual outcome, again, using PSIS-LOO. We can interpret these weights as the probability that each model has the lowest prediction error of the models assessed.

loo_model_weights(model_list,cores = 4, method = "stacking")

## Method: stacking
## ------
##                 weight
## No Interactions 0.086 
## Full Model      0.000 
## True Model      0.914

In this example, the true model (fortunately!) has the highest probability of being correct - 91.4%, and if we were to use an ensemble prediction, only the true model and the minimal model would be included, with the true model getting over 10 times the weight as the minimal model.