Bayes Factors with brms

Here’s a short post on how to calculate Bayes Factors with the R package brms (Buerkner, 2016) using the Savage-Dickey density ratio method (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

To get up to speed with what the Savage-Dickey density ratio method is–or what Bayes Factors are–please read Wagenmakers et al. 2010. (The paper is available on the author’s webpage.) Here, I’ll only show the R & brms code to do the calculations that Wagenmakers et al. (2010) discuss. In their paper, they used WinBUGS, which requires quite a bit of code to sample from even a relatively simple model. brms on the other hand uses the familiar R formula syntax, making it easy to use. brms also does the MCMC sampling with Stan (Stan Development Team, 2016a & 2016b), or rather creates Stan code from a specified R model formula by what can only be described as string processing magic, making the sampling very fast. Let’s get straight to the examples in Wagenmakers et al. (2010)

# We'll work in tidyverse

Example 0

Wagenmakers and colleagues begin with a simple example of 10 true/false questions: We observe a person answering 9 (s) out of 10 (k) questions correctly.

d <- data.frame(s = 9, k = 10)
##   s  k
## 1 9 10

We are interested in the person’s latent ability to answer similar questions correctly. This ability is represented by \(\theta\) (theta), which for us will be the probability parameter (sometimes also called the rate parameter) in a binomial distribution. See Wagenmakers et al. (2010) for formulas. The maximum likelihood (point) estimate for \(\theta\) is n/k, the proportion .9.

The first thing we’ll need to specify with respect to our statistical model is the prior probability distribution for \(\theta\). As in Wagenmakers et al. 2010, we specify a uniform prior, representing no prior information about the person’s ability to aswer the questions. For the binomial probability parameter, \(Beta(\alpha = 1, \beta = 1)\) is a uniform prior.

pd <- tibble(
    x = seq(0, 1, by = .01),
    Prior = dbeta(x, 1, 1)

ggplot(pd, aes(x, Prior)) +
    geom_line() +
    coord_cartesian(xlim = 0:1, ylim = c(0, 6), expand = 0) +
    labs(y = "Density", x = bquote(theta))

The solid line represents the probability density assigned to values of \(\theta\) by this prior probability distribution. You can see that it is 1 for all possible parameter values: They are all equally likely a priori. For this simple illustration, we can easily calculate the posterior distribution by adding the number of correct and incorrect answers to the parameters of the prior Beta distribution.

pd$Posterior <- dbeta(pd$x, 10, 2)
pdw <- gather(pd, key=Type, value=density, Prior:Posterior)
ggplot(pdw, aes(x, density, col=Type)) +
    geom_line() +
    annotate("point", x=c(.5, .5), y = c(pdw$density[pdw$x==.5])) +
    annotate("label", x=c(.5, .5), 
             y = pdw$density[pdw$x==.5], 
             label = round(pdw$density[pdw$x==.5], 3),

The Savage-Dickey density ratio is calculated by dividing the posterior density by the prior density at a specific parameter value. Here, we are interested in .5, a “null hypothesis” value indicating that the person’s latent ability is .5, i.e. that they are simply guessing.

filter(pd, x == .5) %>% 
    mutate(BF01 = Posterior/Prior,
           BF10 = 1/BF01) %>% 
## # A tibble: 1 x 5
##       x Prior Posterior  BF01  BF10
##   <dbl> <dbl>     <dbl> <dbl> <dbl>
## 1   0.5     1     0.107 0.107 9.309

OK, so in this example we are able to get to the posterior with simply adding values into the parameters of the Beta distribution, but let’s now see how to get to this problem using brms.


First, here’s the brms formula of the model:

s | trials(k) ~ 0 + intercept, 
data = d

Read the first line as “s successes from k trials regressed on intercept”. That’s a little clunky, but bear with it. If you are familiar with R’s modeling syntax, you’ll be wondering why we didn’t simply specify ~ 1–R’s default notation for an intercept. The reason is that brms by default uses a little trick in parameterizing the intercept which speeds up the MCMC sampling. In order to specify a prior for the intercept, you’ll have to take the default intercept out (0 +), and use the reserved string intercept to say that you mean the regular intercept. See ?brmsformula for details. (For this model, with only one parameter, this complication doesn’t matter, but I wanted to introduce it early on so that you’d be aware of it when estimating multi-parameter models.)

The next line specifies that the data model is binomial, and that we want to model it’s parameter through an identity link. Usually when you model proportions or binary data, you’d use a logistic (logistic regression!), probit or other similar link function. In fact this is what we’ll do for later examples. Finally, we’ll use the data frame d.

OK, then we’ll want to specify our priors. Priors are extremo important for Bayes Factors–and probabilistic inference in general. To help set priors, we’ll first call get_priors() with the model information, which is basically like asking brms to tell what are the possible priors, and how to specify then, given this model.

get_prior(s | trials(k) ~ 0 + intercept, 
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b intercept

The first line says that there is only one class of parameters b, think of class b as “betas” or “regression coefficients”. The second line says that the b class has only one parameter, the intercept. So we can set a prior for the intercept, and this prior can be any probability distribution in Stan language. We’ll create this prior using brms’ set_prior(), give it a text string representing the Beta(1, 1) prior for all parameters of class b (shortcut, could also specify that we want it for the intercept specifically), and then say the upper and lower bounds (\(\theta\) must be between 0 and 1).

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Almost there. Now we’ll actually sample from the model using brm(), give it the model, priors, data, ask it to sample from priors (for the density ratio!), and set a few extra MCMC parameters.

m <- brm(s | trials(k) ~ 0 + intercept, 
         family = binomial(link="identity"),
         prior = Prior,
         data = d,
         sample_prior = TRUE,
         iter = 1e4,
         cores = 4)

We can get the estimated parameter by asking the model summary:

##  Family: binomial(identity) 
## Formula: s | trials(k) ~ 0 + intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.83       0.1     0.59     0.98      15654    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Credible Interval matches exactly what’s reported in the paper. The point estimate differs slightly because here we see the posterior mean, whereas in the paper, Wagenmakers et al. report the posterior mode. I’ll draw a line at their posterior mode, below, to show that it matches.

samples <- posterior_samples(m, "b")
##   b_intercept   prior_b
## 1   0.9496061 0.4433162
## 2   0.6241450 0.6781995
## 3   0.9522157 0.1354565
## 4   0.7378991 0.4176384
## 5   0.7290172 0.9713203
## 6   0.7078843 0.9575486
gather(samples, Type, value) %>% 
    ggplot(aes(value, col=Type)) +
    geom_density() +
    labs(x = bquote(theta), y = "Density") +
    geom_vline(xintercept = .89)  # Vertical line at .89

We can already see the densities, so all that’s left is to obtain the exact values at the value of interest (.5) and take the \(\frac{posterior}{prior}\) ratio. Instead of doing any of this by hand, we’ll use brms’ function hypothesis() that allows us to test point hypotheses using the Dickey Savage density ratio. For this function we’ll need to specify the point of interest, .5, as the point hypothesis to be tested.

h <- hypothesis(m, "intercept = 0.5")
print(h, digits = 4)
## Hypothesis Tests for class b:
##                       Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (intercept)-(0.5) = 0   0.3336    0.1032   0.0876   0.4761     0.1078    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

The Evid.Ratio is our Bayes Factor BF01. Notice that it matches the value 0.107 pretty well! You can also plot this hypothesis:


However, I think the default plot isn’t fantastic (not a fan of the axis adjustments, title). Fortunately, plot(hypothesis) returns a ggplot2 object, which is easily customized.

p <- plot(h, plot = F, theme = theme_get())[[1]]
p + scale_x_continuous(breaks = seq(-.5, .5, by = .25),
                       labels = seq(-.5, .5, by = .25)+.5)

OK, so that was a lot of work for such a simple problem, but the real beauty of brms (and Stan) is the unparalleled scalability: We can easily solve a problem with one row of data and one parameter, and it won’t take much more to solve a problem with tens of thousands of rows of data, and hundreds of parameters. Let’s move on to the next example from Wagenmakers et al. (2010).

Example 1: Equality of Proportions

For context, please refer to the paper.

d <- data.frame(
    pledge = c("yes", "no"),
    s = c(424, 5416),
    n = c(777, 9072)
##   pledge    s    n
## 1    yes  424  777
## 2     no 5416 9072

They use Beta(1, 1) priors for both rate parameters, which we’ll do as well. Notice that usually a regression formula has an intercept and a coefficient (e.g. effect of group.) By taking the intercept out (0 +) we can define two pledger-group proportions instead, and set priors on these. If we used an intercept + effect formula, we could set a prior on the effect itself.

get_prior(s | trials(n) ~ 0 + pledge, 
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b  pledgeno                  
## 3           b pledgeyes

We can set the Beta prior for both groups’ rate with one line of code by setting the prior on the b class without specifying the coef.

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Like above, let’s estimate.

m1 <- brm(s | trials(n) ~ 0 + pledge, 
         family = binomial(link="identity"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

Our estimates match the MLEs reported in the paper:

print(m1, digits=3)
##  Family: binomial(identity) 
## Formula: s | trials(n) ~ 0 + pledge 
##    Data: d (Number of observations: 2) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## pledgeno     0.597     0.005    0.587    0.607      18588    1
## pledgeyes    0.546     0.018    0.511    0.580      19173    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

To get the density ratio Bayes Factor, we’ll need to specify a text string as our hypothesis. Our hypothesis is that the rate parameters \(\theta_1\) and \(\theta_2\) are not different: \(\theta_1\) = \(\theta_2\). The alternative, then, is the notion that the parameter values differ.

h1 <- hypothesis(m1, "pledgeyes = pledgeno")
print(h1, digits = 3)
## Hypothesis Tests for class b:
##                          Estimate Est.Error l-95% CI u-95% CI Evid.Ratio
## (pledgeyes)-(pled... = 0   -0.051     0.018   -0.088   -0.016      0.388
##                          Star
## (pledgeyes)-(pled... = 0    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

As noted in the paper, a difference value of 0 is about twice as well supported before seeing the data, i.e. the null hypothesis of no difference is twice less likely after seeing the data:

1/h1$hypothesis$Evid.Ratio  # BF10
## [1] 2.577207

The paper reports BF01 = 0.47, so we’re getting the same results (as we should.) You can also compare this figure to what’s reported in the paper.

h1p1 <- plot(h1, theme = theme_get(), plot = F)[[1]] +
    theme(legend.position = "top")
h1p2 <- plot(h1, theme = theme_get(), plot = F)[[1]] + 
    coord_cartesian(xlim = c(-.05, .05), ylim = c(0, 5)) +
    theme(legend.position = "top")
gridExtra::grid.arrange(h1p1, h1p2, nrow = 1)

Moving right on to Example 2, skipping the section on “order restricted analysis”.

Example 2: Hierarchical Bayesian one-sample proportion test

The data for example 2 is not available, but we’ll simulate similar data. The simulation assumes that the neither-primed condition average correct probability is 50%, and that the both-primed condition benefit is 5%. Obviously, the numbers here won’t match anymore, but the data reported in the paper has an average difference in proportions of about 4%.

d <- tibble(
    id = c(rep(1:74, each = 2)),
    primed = rep(c("neither", "both"), times = 74),
    prime = rep(c(0, 1), times = 74),  # Dummy coded
    n = 21,
    correct = rbinom(74*2, 21, .5 + prime * .05)
group_by(d, primed) %>% summarize(p = sum(correct)/sum(n))
## # A tibble: 2 x 2
##    primed         p
##     <chr>     <dbl>
## 1    both 0.5418275
## 2 neither 0.4993565

This data yields a similar t-value as in the paper.

t.test(correct/n ~ primed, paired = T, data = d)
##  Paired t-test
## data:  correct/n by primed
## t = 2.3045, df = 73, p-value = 0.02404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005741069 0.079201016
## sample estimates:
## mean of the differences 
##              0.04247104

Instead of doing a probit regression, I’m going to do logistic regression. Therefore we define the prior on the log-odds scale. The log odds for the expected probability of .5 is 0. I prefer log-odds because–although complicated–they make sense, unlike standardized effect sizes. Note that the probit scale would also be fine as they are very similar.

Let’s just get a quick intuition about effects in log-odds: The change in log odds from p = .5 to .55 is about 0.2.

tibble(rate = seq(0, 1, by = .01),
    logit = logit(rate)) %>%
    ggplot(aes(rate, logit)) +
    geom_line(size=1) +
    geom_segment(x = 0, xend = 0.55, y = .2, yend = .2, size=.4) +
    geom_segment(x = 0, xend = 0.5, y = 0, yend = 0, size=.4) +
    coord_cartesian(ylim = c(-2, 2), expand=0)

We are cheating a little because we know these values, having simulated the data. However, log-odds are not straightforward (!), and this knowledge will allow us to specify better priors in this example. Let’s get the possible priors for this model by calling get_prior(). Notice that the model now includes id-varying “random” effects, and we model them from independent Gaussians by specifying || instead of | which would give a multivariate Gaussian on the varying effects.

get_prior(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
          data = d)
##                 prior class      coef group nlpar bound
## 1                         b                            
## 2                         b intercept                  
## 3                         b     prime                  
## 4 student_t(3, 0, 10)    sd                            
## 5                        sd              id            
## 6                        sd intercept    id            
## 7                        sd     prime    id

The leftmost column gives the pre-specified defaults used by brms. Here are the priors we’ll specify. The most important pertains to prime, which is going to be the effect size in log-odds. Our prior for the log odds of the prime effect is going to be a Gaussian distribution centered on 0, with a standard deviation of .2, which is rather diffuse.

Prior <- c(
    set_prior("normal(0, 10)", class = "b", coef = "intercept"),
    set_prior("cauchy(0, 10)", class = "sd"),
    set_prior("normal(0, .2)", class = "b", coef = "prime")

Then we estimate the model using the specified priors.

m2 <- brm(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
         family = binomial(link="logit"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

OK, so our results here will be different because we didn’t parameterize the prior on a standardized effect size because a) I don’t like standardized effect sizes, and b) I would have to play around with the Stan code, and this post is about brms. Two good reasons not to use standardized effect sizes! Anyway, here are the estimated parameters:

##  Family: binomial(logit) 
## Formula: correct | trials(n) ~ 0 + intercept + prime + (0 + intercept + prime || id) 
##    Data: d (Number of observations: 148) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
## Group-Level Effects: 
## ~id (Number of levels: 74) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(intercept)     0.07      0.05     0.00     0.18       8862    1
## sd(prime)         0.13      0.08     0.01     0.30       6970    1
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.01      0.05    -0.09     0.10      20000    1
## prime         0.15      0.07     0.02     0.29      20000    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

And our null-hypothesis density ratio:

h2 <- hypothesis(m2, "prime = 0")
## Hypothesis Tests for class b:
##             Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (prime) = 0     0.15      0.07     0.02     0.29       0.27    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

Priming effect of zero log-odds is 4 times less likely after seeing the data:

## [1] 3.681923

This is best illustrated by plotting the densities:



Read the paper! Hopefully you’ll be able to use brms’ hypothesis() function to calculate bayes factors when needed.


Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from
Stan Development Team. (2016a). RStan: the R interface to Stan. Retrieved from
Stan Development Team. (2016b). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology, 60(3), 158–189.


brms produces exceptionally well written Stan code. Stan code of the last example model:

## // generated with brms 1.6.1
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   int Y[N];  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   // data for group-level effects of ID 1 
##   int<lower=1> J_1[N]; 
##   int<lower=1> N_1; 
##   int<lower=1> M_1; 
##   vector[N] Z_1_1; 
##   vector[N] Z_1_2; 
##   int trials[N];  // number of trials 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
## } 
## parameters { 
##   vector[K] b;  // population-level effects 
##   vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
##   vector[N_1] z_1[M_1];  // unscaled group-level effects 
##   // parameters to store prior samples
##   real<lower=0> prior_sd_1;
## } 
## transformed parameters { 
##   // group-level effects 
##   vector[N_1] r_1_1; 
##   vector[N_1] r_1_2; 
##   r_1_1 = sd_1[1] * (z_1[1]); 
##   r_1_2 = sd_1[2] * (z_1[2]); 
## } 
## model { 
##   vector[N] mu; 
##   mu = X * b; 
##   for (n in 1:N) { 
##     mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n]; 
##   } 
##   // prior specifications 
##   b[1] ~ normal(0, 10); 
##   b[2] ~ normal(0, .2); 
##   sd_1 ~ cauchy(0, 10); 
##   z_1[1] ~ normal(0, 1); 
##   z_1[2] ~ normal(0, 1); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ binomial_logit(trials, mu); 
##   } 
##   // additionally draw samples from priors
##   prior_sd_1 ~ cauchy(0,10);
## } 
## generated quantities { 
##   real prior_b_1; 
##   real prior_b_2; 
##   // additionally draw samples from priors 
##   prior_b_1 = normal_rng(0,10); 
##   prior_b_2 = normal_rng(0,.2); 
## }

Better forest plots from meta-analytic models estimated with brms

Hi all! After our previous discussion about how to estimate meta-analytic models with the brilliant brms R package, a few people asked me for the code to produce the forest plots. Here I’ll present a much better version of a function to produce forest plots from meta-analytic models estimated with brms. The function is implemented in ggplot2, and it is included in my vmisc package. To get the function, simply install vmisc from GitHub:


The function to draw forest plots from meta-analytic models estimated with brms is called brms_forest(), and you can learn more about it with ?brms_forest. The most important thing to know is that it requires that the model is estimated with brms with this exact formula:

yi | se(sei) ~ 1 + (1|study)

Where yi are the study-specific effect sizes (mean difference between groups, for example), sei are the study-specific standard errors, and study are unique integer labels for each study in the data.

Example data

We’ll illustrate the plot using example data from the metafor package. This data are “Results from 48 studies on the effectiveness of school-based writing-to-learn interventions on academic achievement.”

“In each of the studies included in this meta-analysis, an experimental group (i.e., a group of students that received instruction with increased emphasis on writing tasks) was compared against a control group (i.e., a group of students that received conventional instruction) with respect to some content-related measure of academic achievement (e.g., final grade, an exam/quiz/test score). The effect size measure for this meta-analysis was the standardized mean difference (with positive scores indicating a higher mean level of academic achievement in the intervention group).” (From the metafor help page ?dat.bangertdrowns2004.)

##   id   author year grade length minutes wic feedback info pers imag meta
## 1  1 Ashworth 1992     4     15      NA   1        1    1    1    0    1
## 2  2    Ayers 1993     2     10      NA   1       NA    1    1    1    0
## 3  3   Baisch 1990     2      2      NA   1        0    1    1    0    1
## 4  4    Baker 1994     4      9      10   1        1    1    0    0    0
## 5  5   Bauman 1992     1     14      10   1        1    1    1    0    1
## 6  6   Becker 1996     4      1      20   1        0    0    1    0    0
##         subject  ni     yi    vi
## 1       Nursing  60  0.650 0.070
## 2 Earth Science  34 -0.750 0.126
## 3          Math  95 -0.210 0.042
## 4       Algebra 209 -0.040 0.019
## 5          Math 182  0.230 0.022
## 6    Literature 462  0.030 0.009

We’ll only need a few of the columns, and with specific names, so in the following we’ll just select the relevant variables and rename id to study, and create labels for the plot by pasting together the studies author and year columns. I’ll also subset the data to the first 15 studies, because the original data has 48 studies and that would make the plot very large (which is fine, but it’s simpler to start small.)

d <- dat.bangertdrowns2004 %>%
    mutate(label = paste0(author, " (", year, ")"), sei = sqrt(vi)) %>%
    select(study = id, label, yi, sei) %>% 
##    study              label    yi        sei
## 1      1    Ashworth (1992)  0.65 0.26457513
## 2      2       Ayers (1993) -0.75 0.35496479
## 3      3      Baisch (1990) -0.21 0.20493902
## 4      4       Baker (1994) -0.04 0.13784049
## 5      5      Bauman (1992)  0.23 0.14832397
## 6      6      Becker (1996)  0.03 0.09486833
## 7      7 Bell & Bell (1985)  0.26 0.32557641
## 8      8     Brodney (1994)  0.06 0.08366600
## 9      9      Burton (1986)  0.06 0.20000000
## 10    10   Davis, BH (1990)  0.12 0.22803509
## 11    11   Davis, JJ (1996)  0.77 0.32710854
## 12    12         Day (1994)  0.00 0.14491377
## 13    13     Dipillo (1994)  0.52 0.19235384
## 14    14     Ganguli (1989)  0.54 0.28809721
## 15    15  Giovinazzo (1996)  0.20 0.29325757

Fit meta-analytic model with brms

Fitting the meta-analytic model is easy with brms! The formula specifies the study-specific effect size and standard error, an overall intercept (1), and the “random studies” ((1|study)). I’ll use four cores for speed and increase the adapt_delta parameter to avoid divergent transitions.

mod <- brm(yi | se(sei) ~ 1 + (1|study), data = d, 
           cores = 4, control= list(adapt_delta=.99) )

The model summary shows a 95% CI for the average effect hugging 0, and reasonable between-study heterogeneity (sd(Intercept)):

##  Family: gaussian(identity) 
## Formula: yi | se(sei) ~ 1 + (1 | study) 
##    Data: d (Number of observations: 15) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Group-Level Effects: 
## ~study (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.17       0.1     0.01      0.4        850    1
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     0.13      0.07    -0.01     0.29        999    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Draw forest plots with brms_forest()

The user only needs to enter a data frame and a brms model:

brms_forest(data = d, model = mod)

The forest plot shows, on the left, the labels provided in d. On the very right are the effect sizes, [and limits of the Credible Intervals]. The CI limits are by default 95%, but users can control this by passing the argument level = .80, for 80% CIs, for example.

In the middle are the posterior distributions of the estimated effect sizes as grey densities. The black circle indicates the posterior mean, and the arms extending from the point are the CI defined by level (here, 95% CI). The bottom row, ME, is the meta-analytic estimate.


brms_forest() has several arguments that impact the resulting plot, see ?brms_forest for details. For example

    data = d, 
    model = mod, 
    show_data = TRUE,  # Shows data means and SEs
    sort_estimates = TRUE,  # Sorts estimates based on their magnitude
    dens_fill = "dodgerblue")  # Fill densities with blue

Here, the plot also shows the means (black crosses) and 2 SEs (thin black lines) from the data, slightly below each estimate. This plot nicely shows how the random effects model shrinks the estimates toward the group mean, especially for studies that had wide SEs to begin with.

Customizing the plot

brms_forest() returns a ggplot2 object, which is customizable by regular ggplot2 functions (themes, scales…) Here, we’ll add limits to the x axis manually (we have to specify y axis, because the axes are internally flipped…), add a label to replace the default “mean”, and use another custom theme:

myplot <- brms_forest(data = d, 
                      model = mod, 
                      sort_estimates = TRUE)
myplot + 
    scale_y_continuous("Standardized ES", limits = c(-1.2, 1.2)) +


The plot uses a one-sided violin plot which is not included in the default ggplot2 distribution. This code was written by David Robinson, and Ben Marwick


Look at how non-normal those posterior densities look like! It seems that in this case, the posterior mean is quite a misleading point estimate, and actually showing the posterior distribution is a very good idea.

Hope you’ll find this function useful. If you’d like it to be improved, let me know. Thanks and have a good day!

How to Compare Two Groups with Robust Bayesian Estimation Using R, Stan and brms

## Warning: Installed Rcpp (0.12.11) different from Rcpp used to build dplyr (0.12.10).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

Happy New Year 2017 everybody! 2017 will be the year when social scientists finally decided to diversify their applied statistics toolbox, and stop relying 100% on null hypothesis significance testing (NHST). We now recognize that different scientific questions may require different statistical tools, and are ready to adopt new and innovative methods. A very appealing alternative to NHST is Bayesian statistics, which in itself contains many approaches to statistical inference. In this post, I provide an introductory and practical tutorial to Bayesian parameter estimation in the context of comparing two independent groups’ data.

More specifically, we’ll focus on the t-test. Everyone knows about it, everyone uses it. Yet, there are (arguably!) better methods for drawing inferences from two independent groups’ metric data (Kruschke, 2013; Morey & Rouder, 2015). Let’s talk about how “Bayesian estimation supersedes the t-test” (Kruschke, 2013).

Kruschke (2013, p.573) writes:

“When data are interpreted in terms of meaningful parameters in a mathematical description, such as the difference of mean parameters in two groups, it is Bayesian analysis that provides complete information about the credible parameter values. Bayesian analysis is also more intuitive than traditional methods of null hypothesis significance testing (e.g., Dienes, 2011).”

In that article (Bayesian estimation supersedes the t-test) Kruschke (2013) provided clear and well-reasoned arguments favoring Bayesian parameter estimation over null hypothesis significance testing in the context of comparing two groups, a situation which is usually dealt with a t-test. It also introduced a robust model for comparing two groups, which modeled the data as t-distributed, instead of a Gaussian distribution. The article provided R code for running the estimation procedures, which could be downloaded from the author’s website or as an R package (Kruschke & Meredith, 2015).

The R code and programs work well for this specific application (estimating the robust model for one or two groups’ metric data). However, modifying the code to handle more complicated situations is not easy, and the underlying estimation algorithms don’t necessarily scale up to handle more complicated situations. Therefore, today I’ll introduce easy to use, free, open-source, state-of-the-art computer programs for Bayesian estimation, in the context of comparing two groups’ metric (continuous) data. The programs are available for the R programming language–so make sure you are familiar with R basics (e.g. Vuorre, 2016). I provide R code (it’s super easy, don’t worry!) for t-tests and Bayesian estimation in R using the R package brms (Buerkner, 2016), which uses the powerful Stan MCMC program (Stan Development Team, 2016) under the hood.

These programs supersede older Bayesian inference programs because they are easy to use (brms is an interface to Stan, which is actually a programming language in itself), fast, and are able to handle models with thousands of parameters. Learning to implement basic analyses such as t-tests, and Kruschke’s robust model, with these programs is very useful because (obviously) you’ll then be able to do Bayesian statistics in practice, and will be prepared to understand and implement more complex models.

Understanding the results of Bayesian estimation requires understanding some basics of Bayesian statistics, which I won’t describe here at all. If you are not familiar with Bayesian statistics, please read Kruschke’s excellent article (his book is also very good, Kruschke, 2014; see also McElreath, 2016). In fact, you should read the paper anyway, it’s very good.

First, I’ll introduce the basics of t-tests in some detail, and then focus on understanding them as specific instantiations of linear models. If that sounds familiar, skip ahead to Bayesian Estimation of the t-test, where I introduce the brms package for estimating models using Bayesian methods. Following that, we’ll use “distributional regression” to obtain Bayesian estimates of the unequal variances t-test model. Finally, we’ll learn how to estimate Kruschke’s (2013) BEST model using brms.

The t in a t-test

We’ll begin with t-tests, using example data from Kruschke’s paper (p. 577):

“Consider data from two groups of people who take an IQ test. Group 1 (N1=47) consumes a “smart drug,” and Group 2 (N2=42) is a control group that consumes a placebo.”

I’ve decided to call the control group “Group 0”, and the treatment group “Group 1”, because this coding makes it natural to think of the control group as a “reference group”, and any “effect” we’ll estimate will be associated with the treatment group. These data are visualized as histograms, below:

Histograms of the two groups' IQ scores.

Figure 1: Histograms of the two groups’ IQ scores.

Equal variances t-test

These two groups’ IQ scores could be compared with a simple equal variances t-test (which you shouldn’t use; Lakens, 2015), also known as Student’s t-test. I have the two groups’ IQ scores in R as two vectors called group_0 and group_1, so doing a t-test is as easy as

t.test(group_0, group_1, var.equal=T)
##  Two Sample t-test
## data:  group_0 and group_1
## t = -1.5587, df = 87, p-value = 0.1227
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.544155  0.428653
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

We interpret the t-test in terms of the observed t-value, and whether it exceeds the critical t-value. The critical t-value, in turn, is defined as the extreme \(\alpha / 2\) percentiles of a t-distribution with the given degrees of freedom. The current situation is illustrated below:

t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

Figure 2: t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

The test results in an observed t-value of 1.56, which is not far enough in the tails of a t-distribution with 87 degrees of freedom to warrant rejecting the null hypothesis (given that we are using \(\alpha\) = .05, which may or may not be an entirely brilliant idea (e.g. Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016)). Note that R also reports a 95% CI for the estimated difference between the two groups.

Unequal variances t-test

Next, we’ll run the more appropriate, unequal variances t-test (also known as Welch’s t-test), which R gives by default:

t.test(group_0, group_1)
##  Welch Two Sample t-test
## data:  group_0 and group_1
## t = -1.6222, df = 63.039, p-value = 0.1098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.4766863  0.3611848
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

Note that while R gives Welch’s t-test by default, SPSS gives both. If you’re using SPSS, make sure to report the Welch’s test results, instead of the equal variances test. Here, the conclusion with respect to rejecting the null hypothesis of equal means is the same. However, notice that the results are numerically different, as they should, because these two t-tests refer to different models!

As a side note, I recently learned that this problem (estimating and testing the difference between two means when the variances are not assumed equal) is unsolved: only approximate solutions are known.

It is of course up to you, as a researcher, to decide whether you assume equal variances or not. But note that we almost always allow the means to be different (that’s the whole point of the test, really), while many treatments may just as well have an effect on the standard deviations.

The first take-home message from today is that there are actually two t-tests, each associated with a different statistical model. And to make clear what the difference is, we must acquaint ourselves with the models.

Describing the model(s) underlying the t-test(s)

We don’t usually think of t-tests (and ANOVAs) as models, but it turns out that they are just linear models disguised as tests (see here and here). Recently, there has been a tremendous push for model/parameter estimation, instead of null hypothesis significance testing (e.g. Cumming, 2014; Kruschke, 2014; see also the brilliant commentary by Gigerenzer, 2004), so we will benefit from thinking about t-tests as linear models. Doing so will facilitate “[interpreting data] in terms of meaningful parameters in a mathematical description” (Kruschke, 2013), and seamlessly expanding our models to handle more complicated situations.

The equal variances t-test models metric data with three parameters: Mean for group A, mean for group B, and one shared standard deviation (i.e. the assumption that the standard deviations [we usually refer to variances, but whatever] are equal between the two groups.)

We call the metric data (IQ scores in our example) \(y_{ik}\), where \(i\) is a subscript indicating the \(i^{th}\) datum, and \(k\) indicates the \(k^{th}\) group. So \(y_{19, 1}\) would be the 19th datum, belonging to group 1. Then we specify that \(y_{ik}\) are Normally distributed, \(N(\mu_{ik}, \sigma)\), where \(\mu_{ik}\) indicates the mean of group \(k\), and \(\sigma\) the common standard deviation.

\[y_{ik} \sim N(\mu_{ik}, \sigma)\]

Read the formula as “Y is normally distributed with mean \(\mu_{ik}\) (mu), and standard deviation \(\sigma\) (sigma)”. Note that the standard deviation \(\sigma\) doesn’t have any subscripts: we assume it is the same for the \(k\) groups. Note also that you’ll often see the second parameter in the parentheses as \(\sigma^2\), referring to the variance.

The means for groups 0 and 1 are simply \(\mu_0\) and \(\mu_1\), respectively, and their difference (let’s call it \(d\)) is \(d = \mu_0 - \mu_1\). The 95% CI for \(d\) is given in the t-test output, and we can tell that it differs from the one given by Welch’s t-test.

It is unsurprising, then, that if we use a different model (the more appropriate unequal variances model; Lakens, 2015), our inferences may be different. Welch’s t-test is the same as Student’s, except that now we assume (and subsequently estimate) a unique standard deviation \(\sigma_{ik}\) for both groups.

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\]

This model makes a lot of sense, because rarely are we in a situation to a priori decide that the variance of scores in Group A is equal to the variance of scores in Group B. If you use the equal variances t-test, you should be prepared to justify and defend this assumption. (Deciding between models–such as between these two t-tests–is one way in which our prior information enters and influences data analysis. This fact should make you less suspicious about priors in Bayesian analyses.)

Armed with this knowledge, we can now see that “conducting a t-test” can be understood as estimating one of these two models. By estimating the model, we obtain t-values, degrees of freedom, and consequently, p-values.

However, to focus on modeling and estimation, it is easier to think of the t-test as a specific type of the general linear model, (aka linear regression). We can re-write the t-test in an equivalent way, but instead have a specific parameter for the difference in means by writing it as a linear model. (For simplicity, I’ll only write the equal variances model):

\[y_{ik} \sim N(\mu_{ik}, \sigma)\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Here, \(\sigma\) is just as before, but we now model the mean with an intercept (Group 0’s mean, \(\beta_0\)) and the effect of Group 1 (\(\beta_1\)). To understand whats going on, let’s look at the data, Group is an indicator variable in the data, for each row of Group 0’s data Group is zero, and for each row of Group 1’s data Group is one.

##     Group  IQ
## 1       0  99
## 2       0 101
## 3       0 100
## 4       0 101
## ...   ... ...
## 86      1 101
## 87      1 104
## 88      1 100
## 89      1 101

With this model, \(\beta_1\) directly tells us the estimated difference in the two groups. And because it is a parameter in the model, it has an associated standard error, t-value, degrees of freedom, and a p-value. This linear model and can be estimated in R with the following line of code:

olsmod <- lm(IQ ~ Group, data = d)

The key input here is a model formula, which in R is specified as outcome ~ predictor (DV ~ IV). Using the lm() function, we estimated a linear model predicting IQ from an intercept (automatically included) and a Group parameter Group, which is the effect of group 1. I called this object olsmod for Ordinary Least Squares Model.

R has it’s own model formula syntax, which is well worth learning. The formula in the previous model, IQ ~ Group means that we want to regress IQ on an intercept (which is implicitly included), and group (Group). Besides the formula, we only need to provide the data, which is contained in the object I’ve conveniently called d.

You can verify that the results are identical to the equal variances t-test above.

## Call:
## lm(formula = IQ ~ Group, data = d)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9149  -0.9149   0.0851   1.0851  22.0851 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.3571     0.7263 138.184   <2e-16 ***
## Group         1.5578     0.9994   1.559    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.707 on 87 degrees of freedom
## Multiple R-squared:  0.02717,    Adjusted R-squared:  0.01599 
## F-statistic:  2.43 on 1 and 87 DF,  p-value: 0.1227

Focus on the Group row in the estimated coefficients. Estimate is the point estimate (best guess) of the difference in means (\(d = 101.9149 - 100.3571 = 1.5578\)). t value is the observed t-value (identical to what t.test() reported), and the p-value (Pr(>|t|)) matches as well. The (Intercept) row refers to \(\beta_0\), which is group 0’s mean.

This way of thinking about the model, where we have parameters for one group’s mean, and the effect of the other group, facilitates focusing on the important parameter, the difference, instead of individual means. However, you can of course compute the difference from the means, or the means from one mean and a difference.

Bayesian estimation of the t-test

Equal variances model

Next, I’ll illustrate how to estimate the equal variances t-test using Bayesian methods. We use brms (Buerkner, 2016), and the familiar R formula syntax which we used with the OLS model.

Estimating this model with R, thanks to the Stan and brms teams (Stan Development Team, 2016; Buerkner, 2016), is as easy as the linear regression model we ran above. If you haven’t yet installed brms, you need to install it first by running install.packages("brms"). Then, to access its functions, load the brms package to the current R session.


The most important function in the brms package is brm(), for Bayesian Regression Model(ing). The user needs only to input a model formula, just as above, and a data frame that contains the variables specified in the formula. brm() then translates the model into Stan language, and asks Stan to compile the model into C++ and estimate it (see Kruschke, 2014; McElreath, 2016 for details about estimation). The result is an R object with the estimated results (and much more). We run the model and save the results to mod_eqvar for equal variances model:

mod_eqvar <- brm(IQ ~ Group, data = d)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   100.34      0.74    98.91   101.83       3803    1
## Group         1.59      1.01    -0.41     3.55       3958    1
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     4.77      0.36     4.13     5.55       2653    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Notice that the model contains three parameters, one of which is the shared standard deviation sigma. Compare the output of the Bayesian model to the one estimated with lm() (OLS):

Method term estimate std.error
OLS (Intercept) 100.357 0.726
OLS Group 1.558 0.999
Bayes b_Intercept 100.339 0.743
Bayes b_Group 1.588 1.009

The point estimates (posterior means in the Bayesian model) and standard errors (SD of the respective posterior distribution) are pretty much identical.

We now know the models behind t-tests, and how to estimate the equal variances t-test using the t.test(), lm(), and brm() functions. We also know how to run Welch’s t-test using t.test(). However, estimating the general linear model version of the unequal variances t-test model is slightly more complicated, because it involves specifying predictors for \(\sigma\), the standard deviation parameter.

Unequal variances model

We only need a small adjustment to the equal variances model to specify the unequal variances model:

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Notice that we now have subscripts for \(\sigma\), denoting that it varies between groups. In fact, we’ll write out a linear model for the standard deviation parameter!

\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

The model now includes, instead of a common \(\sigma\), one parameter for Group 0’s standard deviation \(\gamma_0\) (gamma), and one for the effect of Group 1 on the standard deviation \(\gamma_1\), such that group 1’s standard deviation is \(\gamma_0 + \gamma_1\). Therefore, we have 4 free parameters, two means and two standard deviations. (The full specification would include prior distributions for all the parameters, but that topic is outside of the scope of this post.) Let’s estimate!

brm() takes more complicated models by wrapping them inside bf() (short for brmsformula()), which is subsequently entered as the first argument to brm().

uneq_var_frm <- bf(IQ ~ Group, sigma ~ Group)

You can see that the formula regresses IQ on Group, such that we’ll have an intercept (implicitly included), and an effect of Group 1. Remarkably, we are also able to model the standard deviation sigma, and we regress it on Group (it will also have an intercept and effect of group).

mod_uneqvar <- brm(uneq_var_frm, data = d, cores=4)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept         100.35      0.40    99.55   101.12       4000    1
## sigma_Intercept     0.94      0.11     0.72     1.17       3138    1
## Group               1.54      0.99    -0.38     3.47       2471    1
## sigma_Group         0.87      0.15     0.57     1.17       3222    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model’s output contains our 4 parameters. Intercept is the mean for group 0, Group 1 is the “effect of group 1”. The sigma_Intercept is the standard deviation of Group 0, sigma_Group is the effect of group 1 on the standard deviation (the SD of Group 1 is sigma_Intercept + sigma_Group). The sigmas are implicitly modeled through a log-link (because they must be positive). To convert them back to the scale of the data, they need to be exponentiated. After taking the exponents of the sigmas, the results look like this:

parameter Estimate Est.Error l-95% CI u-95% CI
b_sigma_Group 2.41 0.37 1.76 3.23
b_Intercept 100.35 0.40 99.55 101.12
b_sigma_Intercept 1.75 0.85 0.76 3.09

For comparison, here is the “observed SD” of group 0:

## [1] 2.52

Keep in mind that the parameters refer to Group 0’s mean (Intercept) and SD (sigma), and the difference between groups in those values (Group) and (sigma_Group). We now have fully Bayesian estimates of the 4 parameters of the unequal variances t-test model. Because p-values have no place in Bayesian inference, they are not reported in the output. However, you can calculate a quantity that is equivalent to a one-sided p-value from the posterior distribution: Take the proportion of posterior density (MCMC samples) above/below a reference value (0). This is definitely not the most useful thing you can do with a posterior distribution, but the fact that it numerically matches a one-sided p-value is quite interesting:

# Posterior distribution of Group effect
x <-, pars = "b_Group")[,1]
# Proportion of MCMC samples below zero
round((sum(x < 0) / length(x)), 3)
## [1] 0.058
# One sided p-value from t-test
round(t.test(group_0, group_1, data = d, alternative = "less")$p.value, 3)
## [1] 0.055

I’m showing this remarkable fact (Marsman & Wagenmakers, no date) not to persuade you to stick with p-values, but to alleviate fears that these methods would always produce discrepant results.

Although this model is super easy to estimate with brm() (which, I should emphasize, uses Stan for the estimation procedures), the model seems, frankly speaking, strange. I am just not used to modeling variances, and I’ll bet a quarter that neither are you. Nevertheless, there it is!

Finally, let’s move on to Kruschke’s (2013) “Robust Bayesian Estimation” model.

Robust Bayesian Estimation

Kruschke’s robust model is a comparison of two groups, using five parameters: One mean for each group, one standard deviation for each group, just as in the unequal variances model above. The fifth parameter is a “normality” parameter, \(\nu\) (nu), which means that we are now using a t-distribution to model the data. Using a t-distribution to model the data, instead of a Gaussian, means that the model (and therefore our inferences) are less sensitive to extreme values (outliers). Here’s what the model looks like:

\[y_{ik} \sim T(\nu, \mu_{ik}, \sigma_{ik})\]

Read the above formula as “Y are random draws from a t-distribution with ‘normality’ parameter \(\nu\), mean \(\mu_{ik}\), and standard deviation \(\sigma_{ik}\)”. We have a linear model for the means and standard deviations:

\[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]


\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

This model, as you can see, is almost identical to the unequal variances t-test, but instead uses a t distribution (we assume data are t-distributed), and includes the normality parameter. Using brm() we can still use the unequal variances model, but have to specify the t-distribution. We do this by specifying the family argument to be student (as in Student’s t)

mod_robust <- brm(bf(IQ ~ Group, sigma ~ Group),
                  data = d, cores=4)
##  Family: student (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## Intercept        100.524     0.208  100.125  100.940       4000 0.999
## sigma_Intercept    0.001     0.196   -0.389    0.395       4000 1.001
## Group              1.028     0.417    0.206    1.833       3100 1.000
## sigma_Group        0.671     0.258    0.163    1.165       4000 1.000
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## nu    1.853     0.482    1.135     2.99       3179 1.001
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The effect of Group is about one unit, with a 95% Credible Interval from 0.2 to 1.8.

Finally, let’s compare the results to those in Kruschke’s paper (2013, p.578). Before we do this, I’ll convert the estimated parameters to means and standard deviations (instead of the “regression effects” produced by default.) Recall that I recoded the group labels used by Kruschke in the paper, what he calls group 2 is group 0 (control group) in our analyses, but group 1 is still group 1. In the following I transform the results and compute HDIs to obtain results most compatible with Kruschke:

key m HDIlwr HDIupr
group0_mean 100.52 100.12 100.94
group1_mean 101.55 100.80 102.24
diff_means 1.03 0.19 1.80
group0_sigma 1.02 0.65 1.41
group1_sigma 2.00 1.22 2.80
diff_sigmas 2.02 1.13 3.10
nu 1.85 1.04 2.76
nulog10 0.25 0.04 0.46

Notice that Kruschke reports modes (2013, p. 578), but our point estimates are means. The results with respect to the group means are identical to two decimal points; the standard deviations are slightly more discrepant, because the paper reports modes, but we focus on posterior means.

Finally, here is how to estimate the model using the original code (Kruschke & Meredith, 2015):

BEST <- BESTmcmc(group_0, group_1)
## Processing function input....... 
## Done. 
## Beginning parallel processing using 3 cores. Console output will be suppressed.
## Parallel processing completed.
## MCMC took 0.283 minutes.
## MCMC fit results for BEST analysis:
## 100002 simulations saved.
##           mean     sd  median    HDIlo   HDIup Rhat n.eff
## mu1    100.525 0.2149 100.524 100.1108 100.955    1 57740
## mu2    101.548 0.3800 101.551 100.7853 102.289    1 59733
## nu       1.840 0.4775   1.764   1.0380   2.753    1 23934
## sigma1   1.052 0.2069   1.032   0.6737   1.464    1 35619
## sigma2   2.063 0.4359   2.021   1.2598   2.927    1 29495
## 'HDIlo' and 'HDIup' are the limits of a 95% HDI credible interval.
## 'Rhat' is the potential scale reduction factor (at convergence, Rhat=1).
## 'n.eff' is a crude measure of effective sample size.

This output reports posterior means and HDI limits, which we report above. You can verify that they match very closely to each other. This BESTmcmc() function is great, but with brms you are able to estimate a vast variety of models.


Well, that ended up much longer than what I intended. The aim was both to illustrate the ease of Bayesian modeling in R using brms (Buerkner, 2016) and Stan (Stan Development Team, 2016), and highlight the fact that we can easily move from simple t-tests to more complex (and possibly better) models.

If you’ve followed through, you should be able to conduct Student’s (equal variances) and Welch’s (unequal variances) t-tests in R, and to think about those tests as instantiations of general linear models. Further, you should be able to estimate these models using Bayesian methods.

You should now also be familiar with Kruschke’s robust model for comparing two groups’ metric data, and be able to implement it in one line of R code. This model was able to find credible differences between two groups, although the frequentist t-tests and models reported p-values well above .05. That should be motivation enough to try robust (Bayesian) models on your own data.

Further reading

I didn’t take any space here to discuss the interpretation of Bayesian statistics. For this, I recommend Kruschke (2014), McElreath (2016). See also Etz, Gronau, Dablander, Edelsbrunner, & Baribault (2016) for an introduction to Bayesian statistics.


Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from
Cumming, G. (2014). The New Statistics Why and How. Psychological Science, 25(1), 7–29.
Dienes, Z. (2011). Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6(3), 274–290.
Etz, A., Gronau, Q. F., Dablander, F., Edelsbrunner, P. A., & Baribault, B. (2016). How to become a Bayesian in eight easy steps: An annotated reading list. ResearchGate. Retrieved from
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603.
Kruschke, J. K. (2014). Doing Bayesian Data Analysis: A Tutorial Introduction with R (2nd Edition). Burlington, MA: Academic Press.
Lakens, D. (2015, January 26). The 20% Statistician: Always use Welch’s t-test instead of Student’s t-test. Retrieved from
Marsman, M., & Wagenmakers, E.-J. (no date). Three Insights from a Bayesian Interpretation of the One-Sided P Value. Retrieved from
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Morey, R. D., & Rouder, J. (2015). BayesFactor: Computation of Bayes Factors for Common Designs. Retrieved from
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547.
Stan Development Team. (2016). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from
Vuorre, M. (2016, December 5). Introduction to Data Analysis using R. Retrieved from

Meta-analysis is a special case of Bayesian multilevel modeling


Hello everybody! Recently, there’s been a lot of talk about meta-analysis, and here I would just like to quickly show that Bayesian multilevel modeling nicely takes care of your meta-analysis needs, and that it is easy to do in R with the rstan and brms packages. As you’ll see, meta-analysis is a special case of Bayesian multilevel modeling when you are unable or unwilling to put a prior distribution on the meta-analytic effect size estimate.

The idea for this post came from Wolfgang Viechtbauer’s website, where he compared results for meta-analytic models fitted with his great (frequentist) package metafor and the swiss army knife of multilevel modeling, lme4. It turns out that even though you can fit meta-analytic models with lme4, the results are slightly different from traditional meta-analytic models, because the experiment-wise variances are treated slightly differently.

Here are the packages we’ll be using:


The data

Here I’ll only focus on a simple random effects meta-analysis of effect sizes, and will use the same example data as in the aforementioned website. The data are included in the metafor package, and describe the relationship between conscientiousness and medication adherence. The effect sizes are r to z transformed correlations.

study year ni ri yi vi sei
Axelsson et al. (2009) 2009 109 0.19 0.19 0.01 0.10
Axelsson et al. (2011) 2011 749 0.16 0.16 0.00 0.04
Bruce et al. (2010) 2010 55 0.34 0.35 0.02 0.14
Christensen et al. (1995) 1995 72 0.27 0.28 0.01 0.12
Christensen et al. (1999) 1999 107 0.32 0.33 0.01 0.10
Cohen et al. (2004) 2004 65 0.00 0.00 0.02 0.13
Dobbels et al. (2005) 2005 174 0.18 0.18 0.01 0.08
Ediger et al. (2007) 2007 326 0.05 0.05 0.00 0.06
Insel et al. (2006) 2006 58 0.26 0.27 0.02 0.13
Jerant et al. (2011) 2011 771 0.01 0.01 0.00 0.04
Moran et al. (1997) 1997 56 -0.09 -0.09 0.02 0.14
O’Cleirigh et al. (2007) 2007 91 0.37 0.39 0.01 0.11
Penedo et al. (2003) 2003 116 0.00 0.00 0.01 0.09
Quine et al. (2012) 2012 537 0.15 0.15 0.00 0.04
Stilley et al. (2004) 2004 158 0.24 0.24 0.01 0.08
Wiebe & Christensen (1997) 1997 65 0.04 0.04 0.02 0.13

Here’s what these data look like (point estimates +- 2 SEM):

ggplot(dat, aes(x=yi, y=study)) +
    geom_segment(aes(x = yi-sei*2, xend = yi+sei*2, y=study, yend=study)) +

The model

We are going to fit a random-effects meta-analysis model to these observed effect sizes and their standard errors.

Here’s what this model looks like, loosely following notation from the R package Metafor (Viechtbauer, 2010) manual (p.6):

\[y_i \sim N(\theta_i, \sigma_i^2)\]

where each recorded effect size, \(y_i\) is a draw from a normal distribution which is centered on that study’s “true” effect size \(\theta_i\) and has standard deviation equal to the study’s observed standard error \(\sigma_i\).

Our next set of assumptions is that the studies’ true effect sizes approximate some underlying effect size in the (hypothetical) population of all studies. We call this underlying population effect size \(\mu\), and its standard deviation \(\tau\), such that the true effect sizes are thus distributed:

\[\theta_i \sim N(\mu, \tau^2)\]

We now have two really interesting parameters: \(\mu\) tells us, all else being equal, what I may expect the “true” effect to be, in the population of similar studies. \(\tau\) tells us how much individual studies of this effect vary.

I think it is most straightforward to write this model as yet another mixed-effects model (metafor manual p.6):

\[y_i \sim N(\mu + \theta_i, \sigma^2_i)\]

where \(\theta_i \sim N(0, \tau^2)\), studies’ true effects are normally distributed with between-study heterogeneity \(\tau^2\). The reason this is a little confusing (to me at least), is that we know the \(\sigma_i\)s (this being the fact that separates meta-analysis from other more common regression modeling).

Estimation with metafor

Super easy!

ma_out <- rma(data = dat, yi = yi, sei = sei, slab = dat$study)
# Random-Effects Model (k = 16; tau^2 estimator: REML)
#   logLik  deviance       AIC       BIC      AICc  
#   8.6096  -17.2191  -13.2191  -11.8030  -12.2191  
# tau^2 (estimated amount of total heterogeneity): 0.0081 (SE = 0.0055)
# tau (square root of estimated tau^2 value):      0.0901
# I^2 (total heterogeneity / total variability):   61.73%
# H^2 (total variability / sampling variability):  2.61
# Test for Heterogeneity: 
# Q(df = 15) = 38.1595, p-val = 0.0009
# Model Results:
# estimate       se     zval     pval    ci.ub          
#   0.1499   0.0316   4.7501   <.0001   0.0881   0.2118      *** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The switch to Bayes

So far so good, we’re strictly in the realm of standard meta-analysis. But I would like to propose that instead of using custom meta-analysis software, we simply consider the above model as just another regression model, and fit it like we would any other (multilevel) regression model. That is, using Stan, usually through the brms interface. Going Bayesian allows us to assign prior distributions on the population-level parameters \(\mu\) and \(\tau\), and we would usually want to use some very mildly regularizing priors. Here, to make the results most comparable, I’ll use uniform (non-informative) priors:

\[\mu \sim U(-\infty, \infty)\]


\[\tau \sim U(0, 1000)\]

Estimation with brms

Here’s how to fit this model with brms:

brm_out <- brm(yi | se(sei) ~ 1 + (1|study), 
               prior = set_prior("uniform(0, 1000)", class = "sd"),
               data = dat, iter = 5000, warmup = 2000, cores = 4)
#  Family: gaussian(identity) 
# Formula: yi | se(sei) ~ 1 + (1 | study) 
#    Data: dat (Number of observations: 16) 
# Samples: 4 chains, each with iter = 5000; warmup = 2000; thin = 1; 
#          total post-warmup samples = 12000
#    WAIC: Not computed
# Group-Level Effects: 
# ~study (Number of levels: 16) 
#               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sd(Intercept)      0.1      0.04     0.04     0.19       3121    1
# Population-Level Effects: 
#           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept     0.15      0.04     0.08     0.22       5612    1
# Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
# is a crude measure of effective sample size, and Rhat is the potential 
# scale reduction factor on split chains (at convergence, Rhat = 1).

These results are the same as the ones obtained with metafor.

Comparing results

We can now compare the results of these two estimation methods. Of course, the Bayesian method has a tremendous advantage, because it results in an actual distribution of plausible values, whereas the frequentist method gives us just point estimates.

Figure 1. Histogram of 80k samples from the posterior distribution of \(\mu\) (top left) and \(\tau\) (top right). Bottom left displays the multivariate posterior distribution of \(\mu\) (x-axis) and \(\tau\) (y-axis), light colors indicating increased plausibility of values. For each plot, the dashed lines display the maximum likelihood point estimate, and 95% confidence limits (only the point estimate is displayed for the multivariate figure.)

We can see from the numeric output, and especially the figures, that these modes of inference yield the same numerical results. Keep in mind though, that the Bayesian estimates actually allow you to discuss probabilities, and generally the things that we’d like to discuss when talking about results.

For example, what is the probability that the average effect size is greater than 0.2?

avg_es <-, pars = "b_")[,1]
cat( (sum(avg_es > 0.2) / length(avg_es))*100, "%")
# 8.125 %

Forest plot

We may now draw a forest plot with a custom ggplot2 function (see source for code, it ended up too ugly to show here!):

fit_forest <- bayes_forest(data = dat, 
                           model = brm_out, 
                           title = "Conscientiousness & medication adherence", 
                           xlimits = c(-0.5, 1),
                           level = .95,
                           rma = ma_out)

This forest plot displays, as violin plots, the entire posterior distribution of each \(\theta_i\). The meta-analytic effect size \(\mu\) is also displayed amidst the varying \(\theta_i\)s. The mean and 95% CI limits of the posteriors are also displayed on the right in text form for all you precision fans.

The plot also shows each study’s observed mean effect size as a little x, and +- 2SEM as a thin line. These are shown slightly below each corresponding posterior distribution. Defying information overflow, the BLUPs from the model fitted with metafor are shown as points above the violins, with their corresponding 95% CI limits.

Focus on Moran et al. (1997)’s observed effect size (the x): This is a really anomalous result compared to all other studies. One might describe it as incredible, and that is indeed what the bayesian estimation procedure has done, and the resulting posterior distribution is no longer equivalent to the observed effect size. Instead, it is shrunken toward the average effect size. Now look at the table above, this study only had 56 participants, so we should be more skeptical of this study’s observed ES, and perhaps we should then adjust our beliefs about this study in the context of other studies. Therefore, our best guess about this study’s effect size, given all the other studies is no longer the observed mean, but something closer to the average across the studies. Fantastic. Brilliant!

If this shrinkage business seems radical, consider Quine et al. (2012). This study had a much greater sample size (537), and therefore a smaller SE. It was also generally more in line with the average effect size estimate. Therefore, the observed mean ES and the mean of the posterior distribution are pretty much identical. This is also a fairly desirable feature.


The way these different methods are presented (regression, meta-analysis, ANOVA, …), it is quite easy for a beginner, like me, to lose sight of the forest for the trees. I also feel that this is a general experience for students of applied statistics: Every experiment, situation, and question results in a different statistical method (or worse: “Which test should I use?”), and the student doesn’t see how the methods relate to each other. So I think focusing on the (regression) model is key, but often overlooked in favor of this sort of decision tree model of choosing statistical methods (McElreath, 2016, p.2).

Accordingly, I think we’ve ended up in a situation where meta-analysis, for example, is seen as somehow separate from all the other modeling we do, such as repeated measures t-tests. In fact I think applied statistics in Psychology may too often appear as an unconnected bunch of tricks and models, leading to confusion and inefficient implementation of appropriate methods.

Bayesian multilevel modeling

As I’ve been learning more about statistics, I’ve often noticed that some technique, applied in a specific set of situations, turns out to be a special case of a more general modeling approach. I’ll call this approach here Bayesian multilevel modeling, and won’t say much more than that it’s awesome (Gelman et al., 2013; McElreath, 2016). If you are forced to choose one statistical method to learn, it should be Bayesian multilevel modeling, because it allows you to do and understand most things, and allows you to see how similar all these methods are, under the hood.

Have a nice day.


Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Boca Raton: Chapman and Hall/CRC.

McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Stan Development Team. (2016). Stan: A C++ Library for Probability and Sampling, Version 2.11.1. Retrieved from

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48.

Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer Science & Business Media.

Multilevel Confidence

In this post, I address the following problem: How to obtain regression lines and their associated confidence intervals at the average and individual-specific levels, in a two-level multilevel linear regression.


Visualization is perhaps the most effective way of communicating the results of a statistical model. For regression models, two figures are commonly used: The coefficient plot shows the coefficients of a model graphically, and can be used to replace or augment a model summary table. The advantage over tables is that it is usually faster to understand the estimated parameters by looking at them in graphical form, but the downside is losing the numerical accuracy of the table. However, both of these model summaries become increasingly difficult to interpret as the number of coefficients increases, and especially when interaction terms are included.

An alternative visualization is the line plot, which shows what the model implies in terms of the data, such as the relationship between X and Y, and perhaps how that relationship is moderated by other variables. For a linear regression, this plot displays the regression line and its confidence interval. If a confidence interval is not shown, the plot is not complete because the viewer can’t visually assess the uncertainty in the regression line, and therefore a simple line without a confidence interval is of little inferential value. Obtaining the line and confidence interval for simple linear regression is very easy, but is not straightforward in a multilevel context, the topic of this post.

Most of my statistical analyses utilize multilevel modeling, where parameters (means, slopes) are treated as varying between individuals. Because common procedures for estimating these models return point estimates for the regression coefficients at all levels, drawing expected regression lines is easy. However, displaying the confidence limits for the regression lines is not as easily done. Various options exist, and some software packages provide these limits automatically, but in this post I want to highlight a completely general approach to obtaining and drawing confidence limits for regression lines at multiple levels of analysis, and where applicable, show how various packages deliver them automatically. The general approach relies on random samples of plausible parameter values, which can then be visualized as random samples of regression lines (for example, Kruschke, 2014, uses this approach effectively). Importantly, we can summarize the samples with an interval at each level of the predictor values, yielding the confidence interval for the regression line.

I will illustrate the procedure first with a maximum likelihood model fitting procedure, using the lme4 package. This procedure requires an additional step where plausible parameter values are simulated from the estimated model, using the arm package. Then, I’ll show how to obtain the limits from models estimated with Bayesian methods, using various R packages relying on the Stan inference engine (only brms is included at the moment).

The Data

I will use the sleepstudy data set from the lme4 package as an example:

“The average reaction time per day for subjects in a sleep deprivation study. On day 0 the subjects had their normal amount of sleep. Starting that night they were restricted to 3 hours of sleep per night. The observations represent the average reaction time on a series of tests given each day to each subject.”

The sleepstudy object is a data.frame, which I modify a little bit before getting started:

  • Change the variable names to lower case
  • Change the object from a data.frame to a data_frame, because the latter has better default behavior
  • Create a sequential id number for each individual
  • Reorder the variables to a logical order (id, predictor, outcome)
  • Include a subset of 16 individuals.
names(sleepstudy) <- tolower(names(sleepstudy))
sleepstudy <- as_data_frame(sleepstudy) %>%
    mutate(id = rep(1:18, each = 10)) %>%
    select(id, days, reaction) %>%
    filter(id <= 16)

The data is structured in a long format, where each row contains all variables at a single measurement instance:

id days reaction
1 0 249.6
1 1 258.7
1 2 250.8
1 3 321.4
1 4 356.9
1 5 414.7

Fixed Effects Models and CIs

Below, I show two kinds of scatterplots from the data. The left one represents a fixed effects regression, where information about individuals is discarded, and all that is left is a lonely band of inference in a sea of scattered observations. The right panel shows fixed effects regressions separately for each individual.

p1 <- ggplot(sleepstudy, aes(x = days, y = reaction)) +
    geom_point(shape = 1) +
    geom_smooth(method = "lm", fill = "dodgerblue", level = .95) +
    scale_x_continuous(breaks = c(0, 3, 6, 9)) +
    coord_cartesian(ylim = c(180, 500))
p2 <- p1 + facet_wrap(~id, nrow = 4)
grid.arrange(p1, p2, ncol = 2)

Obtaining confidence intervals for regression lines using ggplot2 is easy, but an alternative way is to explicitly use the predict() function (which ggplot2 uses under the hood). For more complicated or esoteric models, explicit prediction becomes necessary, either using predict() or custom code.

Multilevel model

The multilevel model I’ll fit to these data treats the intercept and effect of days as fixed and varying between individuals

\[\mathsf{reaction}_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma)\]

\[\mu_{ij} = \beta_{0j} + \beta_{1j} \ \mathsf{days}_{ij}\]

\[\begin{pmatrix}{\beta_{0j}}\\{\beta_{1j}}\end{pmatrix} \sim \mathcal{N} \begin{pmatrix}{\gamma_{00}},\ {\tau_{00}}\ {\rho_{01}}\\ {\gamma_{10}},\ {\rho_{01}}\ {\tau_{10}} \end{pmatrix}\]

In this post, and the above equations, I’ll omit the discussion of hyperpriors (priors on \(\gamma\), \(\tau\) and \(\rho\) parameters.)

If the above equation baffles the mind, or multilevel models are mysterious to you, Bolger and Laurenceau (2013) provide a great introduction, and a comprehensive treatment is given in Gelman and Hill (2007).

Maximum likelihood estimation: lme4

I’ll estimate the multilevel model using the lme4 package.

lmerfit <- lmer(reaction ~ days + (days | id), data = sleepstudy)
(#tab:fit_lmer)Multilevel model summary
term estimate std.error group
(Intercept) 250.29 7.63 fixed
days 10.50 1.74 fixed
sd_(Intercept).id 26.34 NA id 6.34 NA id
cor_(Intercept) 0.04 NA id
sd_Observation.Residual 26.26 NA Residual

The key points here are the estimates and their associated standard errors, the latter of which are missing for the varying effects’ correlations and standard deviations.

Working with point estimates

Using the model output, we can generate regression lines using the predict() function. Using this method, we can simply add a new column to the existing sleepstudy data frame, giving the fitted value for each row in the data. However, for visualization, it is very useful to generate the fitted values for specific combinations of predictor values, instead of generating a fitted value for every observation. To do this, I simply create dataframes with the relevant predictors, and feed these data frames as data to predict().

To get fitted values at the average level, when there is only one predictor, the data frame is simply a column with rows for each level of days. For the varying effects, I create a data frame where each individual has all levels of days, using the expand.grid() function. When there are many predictors, expand.grid() is your friend.

# Data frame to evaluate average effects predictions on
newavg <- data.frame(days = 0:9)
newavg$reaction <- predict(lmerfit, re.form = NA, newavg)
# Predictors for the varying effect's predictions
newvary <- expand.grid(days = 0:9, id = 1:16)
newvary$reaction <- predict(lmerfit, newvary)

I’ll show these predictions within the previous figures: On the left, a single fixed effects model versus the average regression line from the new multilevel model, and on the right the separate fixed effects models versus the varying regression lines from the multilevel model. Below, I use blue colors to indicate the fixed effects models’ predictions, and black for the multilevel model’s predictions.

    p1 + geom_line(data = newavg, col = "black", size = 1),
    p2 + geom_line(data = newvary, col = "black", size = 1),
    ncol = 2)

As you can probably tell, the fixed effects regression line (blue), and the multilevel model’s average regression line (black; left panel) are identical, because of the completely balanced design. However, interesting differences are apparent in the right panel: The varying effects’ regression lines are different from the separate fixed effects models’ regression lines. How? They are “shrunk” toward the average-level estimate. Focus on the 9th panel, an individual whose reaction times got faster with increased sleep deprivation:

p2 %+% filter(sleepstudy, id == 9) + 
    geom_line(data = filter(newvary, id == 9), col = "black", size = 1) +
    ggtitle("ID 9")

Estimating each participant’s data in their very own model (separate fixed effects models; isn’t that nice, a unique model for everyone!) resulted in a predicted line suggesting to us that this person’s cognitive performance is enhanced following sleep deprivation:

(#tab:print_lm_id9)Fixed effect regression for ID 9
term estimate std.error statistic p.value
(Intercept) 263.035 6.694 39.296 0.000
days -2.881 1.254 -2.298 0.051

However, if we used a model where this individual was treated as a random draw from a population of individuals (the multilevel model; black line in the above figure), the story is different. The point estimate for the slope parameter, for this specific individual, from this model (-0.4232) tells us that the estimated decrease in reaction times is quite a bit smaller. But this is just a point estimate, and in order to draw inference, we’ll need standard errors, or some representation of the uncertainty, in the estimated parameters. The appropriate uncertainty representations will also allow us to draw the black lines with their associated confidence intervals. I’ll begin by obtaining a confidence interval for the average regression line.

CIs using arm: Average level

The method I will illustrate in this post relies on random samples of plausible parameter values, from which we can then generate regression lines–or draw inferences about the parameters themselves. These regression lines can then be used as their own distribution with their own respective summaries, such as an X% interval. First, I’ll show a quick way for obtaining these samples for the lme4 model, using the arm package to generate simulated parameter values.

The important parts of this code are:

  1. Simulating plausible parameter values
  2. Saving the simulated samples (a faux posterior distribution) in a data frame
  3. Creating a predictor matrix
  4. Creating a matrix for the fitted values
  5. Calculating fitted values for each combination of the predictor values, for each plausible combination of the parameter values
  6. Calculating the desired quantiles of the fitted values
sims <- sim(lmerfit, n.sims = 1000)  # 1
fs <- fixef(sims)  # 2
newavg <- data.frame(days = 0:9)
Xmat <- model.matrix( ~ 1 + days, data = newavg)  # 3
fitmat <- matrix(ncol = nrow(fs), nrow = nrow(newavg))  # 4
for (i in 1:nrow(fs)) { fitmat[,i] <- Xmat %*% as.matrix(fs)[i,] }  # 5
newavg$lower <- apply(fitmat, 1, quantile, prob=0.05)  # 6
newavg$median <- apply(fitmat, 1, quantile, prob=0.5)  # 6
newavg$upper <- apply(fitmat, 1, quantile, prob=0.95)  # 6
p1 + geom_line(data = newavg, aes(y = median), size = 1) +
    geom_line(data = newavg, aes(y = lower), lty = 2) +
    geom_line(data = newavg, aes(y = upper), lty = 2)

Again, the average regression line and the fixed effect model’s regression line are identical, but the former has a wider confidence interval (black dashed lines.)

Note. The above code snippet generalizes well to be used with any two matrices where one contains predictor values (the combinations of predictor values on which you want to predict) and the other samples of parameter values, such as a posterior distribution from a Bayesian model, as we’ll see below.
This procedure is given in Korner-Nievergelt et al. (2015), who give a detailed explanation of the code and on drawing inference from the results.

CIs using arm: Individual level

The fitted() function in arm returns fitted values at the varying effects level automatically, so we can skip a few lines of code from above to obtain confidence intervals at the individual-level:

yhat <- fitted(sims, lmerfit)
sleepstudy$lower <- apply(yhat, 1, quantile, prob=0.025)
sleepstudy$median <- apply(yhat, 1, quantile, prob=0.5)
sleepstudy$upper <- apply(yhat, 1, quantile, prob=0.975)
p2 + geom_line(data = sleepstudy, aes(y = median), size = 1) +
    geom_line(data = sleepstudy, aes(y = lower), lty = 2) +
    geom_line(data = sleepstudy, aes(y = upper), lty = 2) 

A subset of individuals highlights the most interesting differences between the models:

tmp <- filter(sleepstudy, id %in% c(6, 9))
p2 %+% tmp +
    geom_line(data = tmp, aes(y = median), size = 1) +
    geom_line(data = tmp, aes(y = lower), lty = 2) +
    geom_line(data = tmp, aes(y = upper), lty = 2) 

In the top panel, the unique fixed effects model’s confidence band is much wider than the confidence band from the multilevel model, highlighting the pooling of information in the latter model. Similarly, the bottom panel (individual 9 discussed above) shows that 95% plausible regression lines for that individual now include lines that increase as a function of days of sleep deprivation, and indeed the expected regression line for this individual is nearly a flat line.

In the next sections, we’ll apply this method of obtaining regression line confidence intervals for multilevel models estimated with Bayesian methods.

Intervals from Bayesian models

Confidence intervals are commonly called credible intervals in the Bayesian context, but I’ll use these terms interchangeably. The reader should be aware that, unlike traditional confidence intervals, credible intervals actually allow statements about credibility. In fact, being allowed to say the things we usually mean when discussing confidence intervals is one of many good reasons for applying bayesian statistics.


I use brms to specify the model and sample from the posterior distribution (brms uses Stan under the hood.)

options(mc.cores = parallel::detectCores())  # Run many chains simultaneously
brmfit <- brm(
    data = sleepstudy, 
    reaction ~ days + (days | id), 
    family = gaussian,
    iter = 2000,
    chains = 4)
(#tab:print_brmfit)Bayesian model estimates (brms)
Estimate Est.Error l-95% CI u-95% CI
Intercept 250.55 8.55 233.64 267.86
days 10.48 1.99 6.60 14.51
sd(Intercept) 28.94 7.71 16.25 46.29
sd(days) 7.14 1.74 4.42 11.38
cor(Intercept,days) 0.06 0.31 -0.49 0.68

Note that now we also have values for the uncertainties associated with the varying effect parameters. Brilliant!

Average regression line & CI

brms has a function for obtaining fitted values (fitted()) and their associated upper and lower bounds, which together constitute the regression line and its confidence interval.

newavg <- data.frame(days = 0:9)
fitavg <- cbind(newavg, fitted(brmfit, newdata = newavg, re_formula = NA)[,-2])
names(fitavg) <- c("days", "reaction", "lower", "upper")
p3 <- p1 + geom_line(data = fitavg, col = "black", size = 1) +
    geom_line(data = fitavg, aes(y = lower), col = "black", lty = 2) +
    geom_line(data = fitavg, aes(y = upper), col = "black", lty = 2)

The average effects’ estimates in this model have higher uncertainty than in the lmerfit model above, explaining why the average regression line’s CI is also wider.

Alternative to CIs

Instead of showing summaries of the samples from the posterior distribution, one could also plot the entire distribution–at the risk of overplotting. Overplotting can be avoided by adjusting each regression line’s transparency with the alpha parameter, resulting in a visually attractive–maybe?–display of the uncertainty in the regression line:

pst <- posterior_samples(brmfit, "b")
ggplot(sleepstudy, aes(x = days, y = reaction)) +
    geom_point(shape = 1) +
    geom_abline(data = pst, alpha = .01, size = .4,
                aes(intercept = b_Intercept, slope = b_days))

Varying regression lines & CIs

The best part is, brmsfitted() also gives regression lines with CIs at the individual level.

newvary <- expand.grid(id = 1:16, days = 0:9)
fitvary <- cbind(newvary, fitted(brmfit, newdata = newvary)[,-2])
names(fitvary) <- c("id", "days", "reaction", "lower", "upper")
p2 + geom_line(data = fitvary, aes(y = reaction), size = 1) +
    geom_line(data = fitvary, aes(y = lower), lty = 2) +
    geom_line(data = fitvary, aes(y = upper), lty = 2)

Working with brms makes it very easy to obtain CIs for regression lines at both levels of analysis.

An alternative visualization

It might be useful, especially for model checking purposes, to display not only the fitted values, but also what the model predicts. To display the 95% prediction interval, I use the same procedure, but replace fitted() with predict():

newavg <- data.frame(days = 0:9)
predavg <- cbind(newavg, predict(brmfit, newdata = newavg, re_formula = NA)[,-2])
names(predavg) <- c("days", "reaction", "lower", "upper")
p3 + geom_ribbon(data = predavg, aes(ymin = lower, ymax = upper), 
                 col = NA, alpha = .2)


Working with a matrix of plausible parameter values–a posterior distribution–makes it easier to draw regression lines with confidence intervals. Specifically, the brms package provides easy access to CIs in a multilevel modeling context. For more complex models, one can calculate the fitted values using a matrix of predictors, and a matrix of plausible parameter values.

Further Reading

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48.

Belenky, G., Wesensten, N. J., Thorne, D. R., Thomas, M. L., Sing, H. C., Redmond, D. P., … Balkin, T. J. (2003). Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep dose-response study. Journal of Sleep Research, 12(1), 1–12.

Bolger, N., & Laurenceau, J.-P. (2013). Intensive Longitudinal Methods: An Introduction to Diary and Experience Sampling Research. Guilford Press.

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from

Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Gelman, A., & Su, Y.-S. (2015). arm: Data Analysis Using Regression and Multilevel/Hierarchical Models. Retrieved from

Korner-Nievergelt, F., Roth, T., Felten, S. von, Guélat, J., Almasi, B., & Korner-Nievergelt, P. (2015). Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan: Including Comparisons to Frequentist Statistics. Academic Press.

Kruschke, J. K. (2014). Doing Bayesian Data Analysis: A Tutorial Introduction with R (2nd Edition). Burlington, MA: Academic Press.

Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer Science & Business Media.

Predict vs. simulate in lme4

All Posts by Category or Tags.