On bayesian data analysis and bayes factors

There’s a lot of buzz around bayesian data analysis (BDA) in psychology blogs, social media, and journal articles. For instance, in 2015 the APS Observer ran three columns dedicated to BDA in consecutive issues of the journal (Gallistel, 2015a, b, & c), and browsing the latest issues of Psychonomic Bulletin and Review gives an impression of increased interest in the topic.

Bayesian data analysis is more than bayes factors

However, it appears that there is an imbalance in what many beginning bayesian data analysts think about BDA. From casual observation and discussions, I’ve noticed a tendency for people to equate bayesian methods with computing bayes factors; that is, testing (usually null) hypotheses using bayesian model comparison.

I don’t have good data on people’s impressions of what BDA is, but here’s another anecdote. At a recent conference on Bayesian statistics Mark Andrews summarized his experiences teaching BDA workshops for social scientists. The talk was quite interesting, and what particularly picked my curiosity was his comment that for many–if not most–workshop participants, bayesian data analysis meant hypothesis testing with bayes factors (at 20 minutes in the linked video). (As an aside, he also noted that Stan has now superseded JAGS and BUGS as the preferred choice for a probabilistic modeling language. Go Stan!)

This imbalance or conflation of bayes factors and bayesian data analysis (if it is real and not merely bias in my observations!) is quite disappointing because BDA is a vast field of awesome methods, and bayes factors (BF) are only one thing that you can do with it1. In fact, many textbooks on BDA mention BFs only in footnotes (Gelman et al., 2013; McElreath, 2016). I’ve also written about BDA on my blog about a half-dozen times, but only once about bayes factors (Vuorre, 2016).

Further, it is really only this one method that people bicker over on social media: the bayes vs. frequentism argument usually turns into a p-values vs. bayes factors argument. The tedium of this argument (there really aren’t good reasons to prefer p-values) may even give out the impression that BDA is tedious and limited to model comparison and hypothesis testing problems. It isn’t! Bayes has benefits over frequentism that reach far beyond the p-vs-bf issue.

Here’s a practical example: It is well known that estimating generalized linear mixed models is kind of difficult. In fact, maximum likelihood methods routinely fail, especially when data are sparse of parameters are plenty (you’ve heard of multilevel models not converging, right? That’s the issue.). However, bayesian methods (via MCMC for example) usually have no problem estimating these models in situations when maximum likelihood fails! This benefit of bayes over frequentism (only the first thing to come to mind) doesn’t usually appear in the tedious p-vs-bf arguments, although one could argue that its practical implications are greater.

Reasons for thinking that BDA is BF

I suspect that one factor contributing to the apparent conflation of BDA and BFs is that there are vocal groups of psychological scientists doing interesting and important work promoting the use of Bayes Factors for hypothesis testing, and bayesian methods more generally (eg. Dienes, 2015; Ly, Verhagen, & Wagenmakers, 2015; Morey, Romeijn, & Rouder, 2016; Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016). A quick reading of some of these texts may give the (false) impression that bayes factors are a larger portion of BDA than what they actually are. I think the same goes with the APS Observer columns mentioned above. To be completely clear, these papers are not about, nor do they say, that BDA is BF. I’m merely pointing out that in the psychological literature, (important!) papers about bayesian methods focusing on BF vastly overshadow in number papers that discuss other features of BDA.

But the real root cause for this conflation is probably the fetish-like desire of hypothesis tests in psychological science.

A modest proposal

If now is the time to move toward a new statistical paradigm in psychology, we could take the opportunity to emphasize not only bayesian hypothesis testing, but the importance of modeling, estimation and bayesian methods more generally. As Dr. Andrews noted, BDA could instead be thought of as flexible probabilistic modeling.


Dienes, Z. (2015). How Bayes factors change scientific practice. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2015.10.003
Gallistel, C. R. (2015a). Bayes for Beginners 2: The Prior. APS Observer, 28(8). Retrieved from https://www.psychologicalscience.org/observer/bayes-for-beginners-2-the-prior
Gallistel, C. R. (2015b). Bayes for Beginners 3: The Prior in Probabilistic Inference. APS Observer, 28(9). Retrieved from https://www.psychologicalscience.org/observer/bayes-for-beginners-3-the-prior-in-probabilistic-inference
Gallistel, C. R. (2015c). Bayes for Beginners: Probability and Likelihood. APS Observer, 28(7). Retrieved from https://www.psychologicalscience.org/observer/bayes-for-beginners-probability-and-likelihood
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Boca Raton: Chapman and Hall/CRC.
Ly, A., Verhagen, J., & Wagenmakers, E.-J. (2015). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2015.06.004
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2015.11.001
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547. https://doi.org/10.1111/tops.12214
Vuorre, M. (2016, August 23). Statistical inference: Prix fixe or à la carte? Retrieved 4 May 2017, from https://mvuorre.github.io/post/2016/2016-08-23-free-lunch-in-inference/
Vuorre, M., & Bolger, N. (2017). Within-subject mediation analysis for experimental data in cognitive psychology and neuroscience. OSF Preprint. https://doi.org/http://dx.doi.org/10.17605/OSF.IO/6JHPF

  1. To be sure, I think a bayes factor can be a truly great tool when the competing models are well specified. I haven’t yet implemented a bayes factor method for my bayesian multilevel mediation package (Vuorre & Bolger, 2017), but might include one in the future. One difficulty here is specifying the competing models in a meaningful way.

Bayes Factors with brms

Here’s a short post on how to calculate Bayes Factors with the R package brms (Buerkner, 2016) using the Savage-Dickey density ratio method (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

To get up to speed with what the Savage-Dickey density ratio method is–or what Bayes Factors are–please read Wagenmakers et al. 2010. (The paper is available on the author’s webpage.) Here, I’ll only show the R & brms code to do the calculations that Wagenmakers et al. (2010) discuss. In their paper, they used WinBUGS, which requires quite a bit of code to sample from even a relatively simple model. brms on the other hand uses the familiar R formula syntax, making it easy to use. brms also does the MCMC sampling with Stan (Stan Development Team, 2016a & 2016b), or rather creates Stan code from a specified R model formula by what can only be described as string processing magic, making the sampling very fast. Let’s get straight to the examples in Wagenmakers et al. (2010)

# We'll work in tidyverse

Example 0

Wagenmakers and colleagues begin with a simple example of 10 true/false questions: We observe a person answering 9 (s) out of 10 (k) questions correctly.

d <- data.frame(s = 9, k = 10)
##   s  k
## 1 9 10

We are interested in the person’s latent ability to answer similar questions correctly. This ability is represented by \(\theta\) (theta), which for us will be the probability parameter (sometimes also called the rate parameter) in a binomial distribution. See Wagenmakers et al. (2010) for formulas. The maximum likelihood (point) estimate for \(\theta\) is n/k, the proportion .9.

The first thing we’ll need to specify with respect to our statistical model is the prior probability distribution for \(\theta\). As in Wagenmakers et al. 2010, we specify a uniform prior, representing no prior information about the person’s ability to aswer the questions. For the binomial probability parameter, \(Beta(\alpha = 1, \beta = 1)\) is a uniform prior.

pd <- tibble(
    x = seq(0, 1, by = .01),
    Prior = dbeta(x, 1, 1)

ggplot(pd, aes(x, Prior)) +
    geom_line() +
    coord_cartesian(xlim = 0:1, ylim = c(0, 6), expand = 0) +
    labs(y = "Density", x = bquote(theta))

The solid line represents the probability density assigned to values of \(\theta\) by this prior probability distribution. You can see that it is 1 for all possible parameter values: They are all equally likely a priori. For this simple illustration, we can easily calculate the posterior distribution by adding the number of correct and incorrect answers to the parameters of the prior Beta distribution.

pd$Posterior <- dbeta(pd$x, 10, 2)
pdw <- gather(pd, key=Type, value=density, Prior:Posterior)
ggplot(pdw, aes(x, density, col=Type)) +
    geom_line() +
    annotate("point", x=c(.5, .5), y = c(pdw$density[pdw$x==.5])) +
    annotate("label", x=c(.5, .5), 
             y = pdw$density[pdw$x==.5], 
             label = round(pdw$density[pdw$x==.5], 3),

The Savage-Dickey density ratio is calculated by dividing the posterior density by the prior density at a specific parameter value. Here, we are interested in .5, a “null hypothesis” value indicating that the person’s latent ability is .5, i.e. that they are simply guessing.

filter(pd, x == .5) %>% 
    mutate(BF01 = Posterior/Prior,
           BF10 = 1/BF01) %>% 
## # A tibble: 1 x 5
##       x Prior Posterior  BF01  BF10
##   <dbl> <dbl>     <dbl> <dbl> <dbl>
## 1   0.5     1     0.107 0.107 9.309

OK, so in this example we are able to get to the posterior with simply adding values into the parameters of the Beta distribution, but let’s now see how to get to this problem using brms.


First, here’s the brms formula of the model:

s | trials(k) ~ 0 + intercept, 
data = d

Read the first line as “s successes from k trials regressed on intercept”. That’s a little clunky, but bear with it. If you are familiar with R’s modeling syntax, you’ll be wondering why we didn’t simply specify ~ 1–R’s default notation for an intercept. The reason is that brms by default uses a little trick in parameterizing the intercept which speeds up the MCMC sampling. In order to specify a prior for the intercept, you’ll have to take the default intercept out (0 +), and use the reserved string intercept to say that you mean the regular intercept. See ?brmsformula for details. (For this model, with only one parameter, this complication doesn’t matter, but I wanted to introduce it early on so that you’d be aware of it when estimating multi-parameter models.)

The next line specifies that the data model is binomial, and that we want to model it’s parameter through an identity link. Usually when you model proportions or binary data, you’d use a logistic (logistic regression!), probit or other similar link function. In fact this is what we’ll do for later examples. Finally, we’ll use the data frame d.

OK, then we’ll want to specify our priors. Priors are extremo important for Bayes Factors–and probabilistic inference in general. To help set priors, we’ll first call get_priors() with the model information, which is basically like asking brms to tell what are the possible priors, and how to specify then, given this model.

get_prior(s | trials(k) ~ 0 + intercept, 
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b intercept

The first line says that there is only one class of parameters b, think of class b as “betas” or “regression coefficients”. The second line says that the b class has only one parameter, the intercept. So we can set a prior for the intercept, and this prior can be any probability distribution in Stan language. We’ll create this prior using brms’ set_prior(), give it a text string representing the Beta(1, 1) prior for all parameters of class b (shortcut, could also specify that we want it for the intercept specifically), and then say the upper and lower bounds (\(\theta\) must be between 0 and 1).

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Almost there. Now we’ll actually sample from the model using brm(), give it the model, priors, data, ask it to sample from priors (for the density ratio!), and set a few extra MCMC parameters.

m <- brm(s | trials(k) ~ 0 + intercept, 
         family = binomial(link="identity"),
         prior = Prior,
         data = d,
         sample_prior = TRUE,
         iter = 1e4,
         cores = 4)

We can get the estimated parameter by asking the model summary:

##  Family: binomial(identity) 
## Formula: s | trials(k) ~ 0 + intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.83       0.1     0.59     0.98      15654    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Credible Interval matches exactly what’s reported in the paper. The point estimate differs slightly because here we see the posterior mean, whereas in the paper, Wagenmakers et al. report the posterior mode. I’ll draw a line at their posterior mode, below, to show that it matches.

samples <- posterior_samples(m, "b")
##   b_intercept   prior_b
## 1   0.9496061 0.4433162
## 2   0.6241450 0.6781995
## 3   0.9522157 0.1354565
## 4   0.7378991 0.4176384
## 5   0.7290172 0.9713203
## 6   0.7078843 0.9575486
gather(samples, Type, value) %>% 
    ggplot(aes(value, col=Type)) +
    geom_density() +
    labs(x = bquote(theta), y = "Density") +
    geom_vline(xintercept = .89)  # Vertical line at .89

We can already see the densities, so all that’s left is to obtain the exact values at the value of interest (.5) and take the \(\frac{posterior}{prior}\) ratio. Instead of doing any of this by hand, we’ll use brms’ function hypothesis() that allows us to test point hypotheses using the Dickey Savage density ratio. For this function we’ll need to specify the point of interest, .5, as the point hypothesis to be tested.

h <- hypothesis(m, "intercept = 0.5")
print(h, digits = 4)
## Hypothesis Tests for class b:
##                       Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (intercept)-(0.5) = 0   0.3336    0.1032   0.0876   0.4761     0.1078    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

The Evid.Ratio is our Bayes Factor BF01. Notice that it matches the value 0.107 pretty well! You can also plot this hypothesis:


However, I think the default plot isn’t fantastic (not a fan of the axis adjustments, title). Fortunately, plot(hypothesis) returns a ggplot2 object, which is easily customized.

p <- plot(h, plot = F, theme = theme_get())[[1]]
p + scale_x_continuous(breaks = seq(-.5, .5, by = .25),
                       labels = seq(-.5, .5, by = .25)+.5)

OK, so that was a lot of work for such a simple problem, but the real beauty of brms (and Stan) is the unparalleled scalability: We can easily solve a problem with one row of data and one parameter, and it won’t take much more to solve a problem with tens of thousands of rows of data, and hundreds of parameters. Let’s move on to the next example from Wagenmakers et al. (2010).

Example 1: Equality of Proportions

For context, please refer to the paper.

d <- data.frame(
    pledge = c("yes", "no"),
    s = c(424, 5416),
    n = c(777, 9072)
##   pledge    s    n
## 1    yes  424  777
## 2     no 5416 9072

They use Beta(1, 1) priors for both rate parameters, which we’ll do as well. Notice that usually a regression formula has an intercept and a coefficient (e.g. effect of group.) By taking the intercept out (0 +) we can define two pledger-group proportions instead, and set priors on these. If we used an intercept + effect formula, we could set a prior on the effect itself.

get_prior(s | trials(n) ~ 0 + pledge, 
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b  pledgeno                  
## 3           b pledgeyes

We can set the Beta prior for both groups’ rate with one line of code by setting the prior on the b class without specifying the coef.

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Like above, let’s estimate.

m1 <- brm(s | trials(n) ~ 0 + pledge, 
         family = binomial(link="identity"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

Our estimates match the MLEs reported in the paper:

print(m1, digits=3)
##  Family: binomial(identity) 
## Formula: s | trials(n) ~ 0 + pledge 
##    Data: d (Number of observations: 2) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## pledgeno     0.597     0.005    0.587    0.607      18588    1
## pledgeyes    0.546     0.018    0.511    0.580      19173    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

To get the density ratio Bayes Factor, we’ll need to specify a text string as our hypothesis. Our hypothesis is that the rate parameters \(\theta_1\) and \(\theta_2\) are not different: \(\theta_1\) = \(\theta_2\). The alternative, then, is the notion that the parameter values differ.

h1 <- hypothesis(m1, "pledgeyes = pledgeno")
print(h1, digits = 3)
## Hypothesis Tests for class b:
##                          Estimate Est.Error l-95% CI u-95% CI Evid.Ratio
## (pledgeyes)-(pled... = 0   -0.051     0.018   -0.088   -0.016      0.388
##                          Star
## (pledgeyes)-(pled... = 0    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

As noted in the paper, a difference value of 0 is about twice as well supported before seeing the data, i.e. the null hypothesis of no difference is twice less likely after seeing the data:

1/h1$hypothesis$Evid.Ratio  # BF10
## [1] 2.577207

The paper reports BF01 = 0.47, so we’re getting the same results (as we should.) You can also compare this figure to what’s reported in the paper.

h1p1 <- plot(h1, theme = theme_get(), plot = F)[[1]] +
    theme(legend.position = "top")
h1p2 <- plot(h1, theme = theme_get(), plot = F)[[1]] + 
    coord_cartesian(xlim = c(-.05, .05), ylim = c(0, 5)) +
    theme(legend.position = "top")
gridExtra::grid.arrange(h1p1, h1p2, nrow = 1)

Moving right on to Example 2, skipping the section on “order restricted analysis”.

Example 2: Hierarchical Bayesian one-sample proportion test

The data for example 2 is not available, but we’ll simulate similar data. The simulation assumes that the neither-primed condition average correct probability is 50%, and that the both-primed condition benefit is 5%. Obviously, the numbers here won’t match anymore, but the data reported in the paper has an average difference in proportions of about 4%.

d <- tibble(
    id = c(rep(1:74, each = 2)),
    primed = rep(c("neither", "both"), times = 74),
    prime = rep(c(0, 1), times = 74),  # Dummy coded
    n = 21,
    correct = rbinom(74*2, 21, .5 + prime * .05)
group_by(d, primed) %>% summarize(p = sum(correct)/sum(n))
## # A tibble: 2 x 2
##    primed         p
##     <chr>     <dbl>
## 1    both 0.5418275
## 2 neither 0.4993565

This data yields a similar t-value as in the paper.

t.test(correct/n ~ primed, paired = T, data = d)
##  Paired t-test
## data:  correct/n by primed
## t = 2.3045, df = 73, p-value = 0.02404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005741069 0.079201016
## sample estimates:
## mean of the differences 
##              0.04247104

Instead of doing a probit regression, I’m going to do logistic regression. Therefore we define the prior on the log-odds scale. The log odds for the expected probability of .5 is 0. I prefer log-odds because–although complicated–they make sense, unlike standardized effect sizes. Note that the probit scale would also be fine as they are very similar.

Let’s just get a quick intuition about effects in log-odds: The change in log odds from p = .5 to .55 is about 0.2.

tibble(rate = seq(0, 1, by = .01),
    logit = logit(rate)) %>%
    ggplot(aes(rate, logit)) +
    geom_line(size=1) +
    geom_segment(x = 0, xend = 0.55, y = .2, yend = .2, size=.4) +
    geom_segment(x = 0, xend = 0.5, y = 0, yend = 0, size=.4) +
    coord_cartesian(ylim = c(-2, 2), expand=0)

We are cheating a little because we know these values, having simulated the data. However, log-odds are not straightforward (!), and this knowledge will allow us to specify better priors in this example. Let’s get the possible priors for this model by calling get_prior(). Notice that the model now includes id-varying “random” effects, and we model them from independent Gaussians by specifying || instead of | which would give a multivariate Gaussian on the varying effects.

get_prior(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
          data = d)
##                 prior class      coef group nlpar bound
## 1                         b                            
## 2                         b intercept                  
## 3                         b     prime                  
## 4 student_t(3, 0, 10)    sd                            
## 5                        sd              id            
## 6                        sd intercept    id            
## 7                        sd     prime    id

The leftmost column gives the pre-specified defaults used by brms. Here are the priors we’ll specify. The most important pertains to prime, which is going to be the effect size in log-odds. Our prior for the log odds of the prime effect is going to be a Gaussian distribution centered on 0, with a standard deviation of .2, which is rather diffuse.

Prior <- c(
    set_prior("normal(0, 10)", class = "b", coef = "intercept"),
    set_prior("cauchy(0, 10)", class = "sd"),
    set_prior("normal(0, .2)", class = "b", coef = "prime")

Then we estimate the model using the specified priors.

m2 <- brm(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
         family = binomial(link="logit"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

OK, so our results here will be different because we didn’t parameterize the prior on a standardized effect size because a) I don’t like standardized effect sizes, and b) I would have to play around with the Stan code, and this post is about brms. Two good reasons not to use standardized effect sizes! Anyway, here are the estimated parameters:

##  Family: binomial(logit) 
## Formula: correct | trials(n) ~ 0 + intercept + prime + (0 + intercept + prime || id) 
##    Data: d (Number of observations: 148) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
## Group-Level Effects: 
## ~id (Number of levels: 74) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(intercept)     0.07      0.05     0.00     0.18       8862    1
## sd(prime)         0.13      0.08     0.01     0.30       6970    1
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.01      0.05    -0.09     0.10      20000    1
## prime         0.15      0.07     0.02     0.29      20000    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

And our null-hypothesis density ratio:

h2 <- hypothesis(m2, "prime = 0")
## Hypothesis Tests for class b:
##             Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (prime) = 0     0.15      0.07     0.02     0.29       0.27    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

Priming effect of zero log-odds is 4 times less likely after seeing the data:

## [1] 3.681923

This is best illustrated by plotting the densities:



Read the paper! Hopefully you’ll be able to use brms’ hypothesis() function to calculate bayes factors when needed.


Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Stan Development Team. (2016a). RStan: the R interface to Stan. Retrieved from http://mc-stan.org/
Stan Development Team. (2016b). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology, 60(3), 158–189. https://doi.org/10.1016/j.cogpsych.2009.12.001


brms produces exceptionally well written Stan code. Stan code of the last example model:

## // generated with brms 1.6.1
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   int Y[N];  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   // data for group-level effects of ID 1 
##   int<lower=1> J_1[N]; 
##   int<lower=1> N_1; 
##   int<lower=1> M_1; 
##   vector[N] Z_1_1; 
##   vector[N] Z_1_2; 
##   int trials[N];  // number of trials 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
## } 
## parameters { 
##   vector[K] b;  // population-level effects 
##   vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
##   vector[N_1] z_1[M_1];  // unscaled group-level effects 
##   // parameters to store prior samples
##   real<lower=0> prior_sd_1;
## } 
## transformed parameters { 
##   // group-level effects 
##   vector[N_1] r_1_1; 
##   vector[N_1] r_1_2; 
##   r_1_1 = sd_1[1] * (z_1[1]); 
##   r_1_2 = sd_1[2] * (z_1[2]); 
## } 
## model { 
##   vector[N] mu; 
##   mu = X * b; 
##   for (n in 1:N) { 
##     mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n]; 
##   } 
##   // prior specifications 
##   b[1] ~ normal(0, 10); 
##   b[2] ~ normal(0, .2); 
##   sd_1 ~ cauchy(0, 10); 
##   z_1[1] ~ normal(0, 1); 
##   z_1[2] ~ normal(0, 1); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ binomial_logit(trials, mu); 
##   } 
##   // additionally draw samples from priors
##   prior_sd_1 ~ cauchy(0,10);
## } 
## generated quantities { 
##   real prior_b_1; 
##   real prior_b_2; 
##   // additionally draw samples from priors 
##   prior_b_1 = normal_rng(0,10); 
##   prior_b_2 = normal_rng(0,.2); 
## }

Better forest plots from meta-analytic models estimated with brms

Hi all! After our previous discussion about how to estimate meta-analytic models with the brilliant brms R package, a few people asked me for the code to produce the forest plots. Here I’ll present a much better version of a function to produce forest plots from meta-analytic models estimated with brms. The function is implemented in ggplot2, and it is included in my vmisc package. To get the function, simply install vmisc from GitHub:


The function to draw forest plots from meta-analytic models estimated with brms is called brms_forest(), and you can learn more about it with ?brms_forest. The most important thing to know is that it requires that the model is estimated with brms with this exact formula:

yi | se(sei) ~ 1 + (1|study)

Where yi are the study-specific effect sizes (mean difference between groups, for example), sei are the study-specific standard errors, and study are unique integer labels for each study in the data.

Example data

We’ll illustrate the plot using example data from the metafor package. This data are “Results from 48 studies on the effectiveness of school-based writing-to-learn interventions on academic achievement.”

“In each of the studies included in this meta-analysis, an experimental group (i.e., a group of students that received instruction with increased emphasis on writing tasks) was compared against a control group (i.e., a group of students that received conventional instruction) with respect to some content-related measure of academic achievement (e.g., final grade, an exam/quiz/test score). The effect size measure for this meta-analysis was the standardized mean difference (with positive scores indicating a higher mean level of academic achievement in the intervention group).” (From the metafor help page ?dat.bangertdrowns2004.)

##   id   author year grade length minutes wic feedback info pers imag meta
## 1  1 Ashworth 1992     4     15      NA   1        1    1    1    0    1
## 2  2    Ayers 1993     2     10      NA   1       NA    1    1    1    0
## 3  3   Baisch 1990     2      2      NA   1        0    1    1    0    1
## 4  4    Baker 1994     4      9      10   1        1    1    0    0    0
## 5  5   Bauman 1992     1     14      10   1        1    1    1    0    1
## 6  6   Becker 1996     4      1      20   1        0    0    1    0    0
##         subject  ni     yi    vi
## 1       Nursing  60  0.650 0.070
## 2 Earth Science  34 -0.750 0.126
## 3          Math  95 -0.210 0.042
## 4       Algebra 209 -0.040 0.019
## 5          Math 182  0.230 0.022
## 6    Literature 462  0.030 0.009

We’ll only need a few of the columns, and with specific names, so in the following we’ll just select the relevant variables and rename id to study, and create labels for the plot by pasting together the studies author and year columns. I’ll also subset the data to the first 15 studies, because the original data has 48 studies and that would make the plot very large (which is fine, but it’s simpler to start small.)

d <- dat.bangertdrowns2004 %>%
    mutate(label = paste0(author, " (", year, ")"), sei = sqrt(vi)) %>%
    select(study = id, label, yi, sei) %>% 
##    study              label    yi        sei
## 1      1    Ashworth (1992)  0.65 0.26457513
## 2      2       Ayers (1993) -0.75 0.35496479
## 3      3      Baisch (1990) -0.21 0.20493902
## 4      4       Baker (1994) -0.04 0.13784049
## 5      5      Bauman (1992)  0.23 0.14832397
## 6      6      Becker (1996)  0.03 0.09486833
## 7      7 Bell & Bell (1985)  0.26 0.32557641
## 8      8     Brodney (1994)  0.06 0.08366600
## 9      9      Burton (1986)  0.06 0.20000000
## 10    10   Davis, BH (1990)  0.12 0.22803509
## 11    11   Davis, JJ (1996)  0.77 0.32710854
## 12    12         Day (1994)  0.00 0.14491377
## 13    13     Dipillo (1994)  0.52 0.19235384
## 14    14     Ganguli (1989)  0.54 0.28809721
## 15    15  Giovinazzo (1996)  0.20 0.29325757

Fit meta-analytic model with brms

Fitting the meta-analytic model is easy with brms! The formula specifies the study-specific effect size and standard error, an overall intercept (1), and the “random studies” ((1|study)). I’ll use four cores for speed and increase the adapt_delta parameter to avoid divergent transitions.

mod <- brm(yi | se(sei) ~ 1 + (1|study), data = d, 
           cores = 4, control= list(adapt_delta=.99) )

The model summary shows a 95% CI for the average effect hugging 0, and reasonable between-study heterogeneity (sd(Intercept)):

##  Family: gaussian(identity) 
## Formula: yi | se(sei) ~ 1 + (1 | study) 
##    Data: d (Number of observations: 15) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Group-Level Effects: 
## ~study (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.17       0.1     0.01      0.4        850    1
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     0.13      0.07    -0.01     0.29        999    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Draw forest plots with brms_forest()

The user only needs to enter a data frame and a brms model:

brms_forest(data = d, model = mod)

The forest plot shows, on the left, the labels provided in d. On the very right are the effect sizes, [and limits of the Credible Intervals]. The CI limits are by default 95%, but users can control this by passing the argument level = .80, for 80% CIs, for example.

In the middle are the posterior distributions of the estimated effect sizes as grey densities. The black circle indicates the posterior mean, and the arms extending from the point are the CI defined by level (here, 95% CI). The bottom row, ME, is the meta-analytic estimate.


brms_forest() has several arguments that impact the resulting plot, see ?brms_forest for details. For example

    data = d, 
    model = mod, 
    show_data = TRUE,  # Shows data means and SEs
    sort_estimates = TRUE,  # Sorts estimates based on their magnitude
    dens_fill = "dodgerblue")  # Fill densities with blue

Here, the plot also shows the means (black crosses) and 2 SEs (thin black lines) from the data, slightly below each estimate. This plot nicely shows how the random effects model shrinks the estimates toward the group mean, especially for studies that had wide SEs to begin with.

Customizing the plot

brms_forest() returns a ggplot2 object, which is customizable by regular ggplot2 functions (themes, scales…) Here, we’ll add limits to the x axis manually (we have to specify y axis, because the axes are internally flipped…), add a label to replace the default “mean”, and use another custom theme:

myplot <- brms_forest(data = d, 
                      model = mod, 
                      sort_estimates = TRUE)
myplot + 
    scale_y_continuous("Standardized ES", limits = c(-1.2, 1.2)) +


The plot uses a one-sided violin plot which is not included in the default ggplot2 distribution. This code was written by David Robinson https://github.com/dgrtwo: https://gist.github.com/dgrtwo/eb7750e74997891d7c20, and Ben Marwick https://github.com/benmarwick: https://gist.github.com/benmarwick/2a1bb0133ff568cbe28d.


Look at how non-normal those posterior densities look like! It seems that in this case, the posterior mean is quite a misleading point estimate, and actually showing the posterior distribution is a very good idea.

Hope you’ll find this function useful. If you’d like it to be improved, let me know. Thanks and have a good day!

How to create within-subject scatter plots in R with ggplot2

Today, we’ll take a look at creating a specific type of visualization for data from a within-subjects experiment (also known as repeated measures, but that can sometimes be a misleading label). You’ll often see within-subject data visualized as bar graphs (condition means, and maybe mean difference if you’re lucky.) But alternatives exist, and today we’ll take a look at within-subjects scatterplots.

For example, Ganis and Kievit (2015) asked 54 people to observe, on each trial, two 3-D shapes with various rotations and judge whether the two shapes were the same or not.

There were 4 angles (0, 50, 100, and 150 degree rotations), but for simplicity, today we’ll only look at items that were not rotated with respect to each other, and items rotated 50 degrees. The data are freely available (thanks!!) in Excel format, but to make them more easily available for readers, I’ve uploaded them in a .csv file, which we can load directly from an R script.

d <- read_csv("https://mvuorre.github.io/data/ganis-kievit-2015.csv")

First, let’s clean the data a little bit by selecting and renaming a subset of the variables, and then take only the trials with 0 or 50 degrees rotation.

d <- transmute(
    id = Subject,
    angle = angle,
    correct = correct.incorrect,
    rt = Time
    ) %>%
    filter(angle < 51)
## # A tibble: 2,592 x 4
##       id angle correct    rt
##    <int> <int>   <int> <int>
##  1     1     0       1  1355
##  2     1    50       1  1685
##  3     1    50       1  1237
##  4     1     0       1  1275
##  5     1    50       1  2238
##  6     1     0       1  1524
##  7     1     0       1   964
##  8     1    50       1  1226
##  9     1    50       1  2548
## 10     1    50       1  1588
## # ... with 2,582 more rows

We’ll focus on comparing the reaction times between the 0 degree and 50 degree rotation trials. We predict that people will take longer to respond when the items are rotated, and that this effect will be robust across people.

Subject means

Bar graph

For the first graph, we’ll only need the subject’s means in each condition.

subject_means <- group_by(d, id, angle) %>%
    summarize(rt = mean(rt, na.rm = T))
## Source: local data frame [108 x 3]
## Groups: id [?]
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows

The first plot is a simple bar graph showing the condition means, and every subject as a point. Note that the mean is visualized as a bar, using the stat_summary(geom="bar") function.

barplot <- ggplot(subject_means, aes(x = angle, y = rt)) +
    geom = "bar",
    fun.y = "mean",
    col = "black",
    fill = "gray70"
    ) +
    geom_point(position = position_jitter(h = 0, w = 5)) +
    scale_y_continuous(limits = c(0, max(d$rt, na.rm = T)),
    expand = c(0, 0))

This figure shows quite clearly that the mean reaction time in the 50 degree angle condition was higher than in the 0 degree angle condition, and the spread across individuals in each condition. However, we often are specifically interested in the within-subject effect of condition, which would be difficult to visually display in this image. We could draw lines to connect each point, and the effect would then be visible as a “spaghetti plot”, but while useful, these plots may sometimes be a little overwhelming especially if there’s too many people (spaghetti is great but nobody likes too much of it!)

Within-subject scatterplot

To draw a within-subjects scatterplot, we’ll need a slight reorganization of the data, such that it is in wide format with respect to the conditions:

## Source: local data frame [108 x 3]
## Groups: id [?]
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows
subject_means_wide <-
           key = angle,
           value = rt,
           sep = "_")
## Source: local data frame [54 x 3]
## Groups: id [54]
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows

Then we can simply map the per-subject angle-means to the X and Y axes:

ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +

But this graph needs a couple of fixes to be maximally informative. We need to:

  • Make the aspect ratio 1
  • Force the axes to be identically scaled (note the use of min() and max() to show the plots on the scale of the data)
  • Add an identity (diagonal) line
  • Modify the axis labels
lims <- c(min(d$rt, na.rm = T), max(d$rt, na.rm = T))
wsplot <-
    ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point() +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio = 1)

Great! This plot shows each person (mean) as a point, and the difference between conditions can be directly seen by how far from the diagonal line the points are. Points above the diagonal indicate that the person’s (mean) RT was greater in the 50 degrees condition. All of the points lie below the identity line, indicating that the effect was as we predicted, and robust across individuals.

This is a very useful diagnostic plot that simultaneously shows the population- (or group-) level trend (are the points, on average, below or above the identity line?) and the expectation (mean) for every person (roughly, how far apart the points are from each other?). The points are naturally connected by their location, unlike in a bar graph where they would be connected by lines. Maybe you think it’s an informative graph; it’s certainly very easy to do in R with ggplot2. Also, I think it is visually very convincing, and doesn’t necessarily lead one to focus unjustly just on the group means: I am both convinced and informed by the graph.

Within-subject scatterplot with SEs

Well, we didn’t measure everybody repeatedly for nothing. We know more than their means; we can use the spread of the individual level scores to calculate, say, a SE for everybody and add it to the graph.

subject_summaries <- group_by(d, id, angle) %>%
    summarize(mean = mean(rt, na.rm = T),
              se = sd(rt, na.rm = T) / sqrt(n()))
## Source: local data frame [108 x 4]
## Groups: id [?]
## # A tibble: 108 x 4
##       id angle  mean    se
##    <int> <int> <dbl> <dbl>
##  1     1     0  1512   146
##  2     1    50  2039   134
##  3     2     0  1251   125
##  4     2    50  1768   211
##  5     3     0  1602   162
##  6     3    50  1862   109
##  7     4     0  1501   112
##  8     4    50  2023   172
##  9     5     0  2170   242
## 10     5    50  2512   307
## # ... with 98 more rows

Now we simply need to reformat the data to wide with respect to both the means and SEs. The trick here is to use spread() with different values for the sep() (separate) argument. Then, when the means and SEs are joined into wide format, we can select the columns containing either the means or SEs by referring to their unique names

means <- select(subject_summaries, -se) %>%
    spread(key=angle, value=mean, sep = "_")
## Source: local data frame [54 x 3]
## Groups: id [54]
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows
ses <- select(subject_summaries, -mean) %>%
    spread(key=angle, value=se, sep = "SE")
## Source: local data frame [54 x 3]
## Groups: id [54]
## # A tibble: 54 x 3
##       id angleSE0 angleSE50
##  * <int>    <dbl>     <dbl>
##  1     1      146       134
##  2     2      125       211
##  3     3      162       109
##  4     4      112       172
##  5     5      242       307
##  6     6      100        73
##  7     7      223       240
##  8     8      110       197
##  9     9      130       203
## 10    10      233       276
## # ... with 44 more rows
sums <- left_join(means, ses)
## Source: local data frame [54 x 5]
## Groups: id [?]
## # A tibble: 54 x 5
##       id angle_0 angle_50 angleSE0 angleSE50
##    <int>   <dbl>    <dbl>    <dbl>     <dbl>
##  1     1    1512     2039      146       134
##  2     2    1251     1768      125       211
##  3     3    1602     1862      162       109
##  4     4    1501     2023      112       172
##  5     5    2170     2512      242       307
##  6     6    1302     1382      100        73
##  7     7    2212     3014      223       240
##  8     8    1452     1824      110       197
##  9     9    2012     2501      130       203
## 10    10    1939     3058      233       276
## # ... with 44 more rows

The code for the plot is actually quite straightforward once the tricky part of data formatting is done (this is really the philosophy behind ggplot2). Use errorbar() to draw the vertical SE bars, and errorbarh() to draw the horizontal SE bars.

ggplot(sums, aes(x=angle_0, y=angle_50)) +
    geom_point() +
    geom_errorbar(aes(ymin=angle_50-angleSE50, ymax=angle_50+angleSE50)) +
    geom_errorbarh(aes(xmin=angle_0-angleSE0, xmax=angle_0+angleSE0)) +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +

Cool, huh? This graph shows the mean and +-1 SEM for everybody’s reaction time in the 0 degrees (x axis) and 50 degrees (y axis) conditions. This graph could be a great visual inspection of the data before fitting any complex models, and requires only some slight reorganizing of the data in R. Hope you’ll find it helpful in your own work!


Within-subject scatter plots are pretty common in some fields (psychophysics), but underutilized in many fiels where they might have a positive impact on statistical inference. Why not try them out on your own data, especially when they’re this easy to do with R and ggplot2?

Recall that for real applications, it’s better to transform or model reaction times with a skewed distribution. Here we used normal distributions just for convenience.

Finally, this post was made possible by the Ganis and Kievit (2015) who generously have shared their data online. Big thanks!

Have a great day!


Ganis, G., & Kievit, R. (2015). A New Set of Three-Dimensional Shapes for Investigating Mental Rotation Processes: Validation Data and Stimulus Set. Journal of Open Psychology Data, 3(1). https://doi.org/10.5334/jopd.ai

How to Compare Two Groups with Robust Bayesian Estimation Using R, Stan and brms

## Warning: Installed Rcpp (0.12.11) different from Rcpp used to build dplyr (0.12.10).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

Happy New Year 2017 everybody! 2017 will be the year when social scientists finally decided to diversify their applied statistics toolbox, and stop relying 100% on null hypothesis significance testing (NHST). We now recognize that different scientific questions may require different statistical tools, and are ready to adopt new and innovative methods. A very appealing alternative to NHST is Bayesian statistics, which in itself contains many approaches to statistical inference. In this post, I provide an introductory and practical tutorial to Bayesian parameter estimation in the context of comparing two independent groups’ data.

More specifically, we’ll focus on the t-test. Everyone knows about it, everyone uses it. Yet, there are (arguably!) better methods for drawing inferences from two independent groups’ metric data (Kruschke, 2013; Morey & Rouder, 2015). Let’s talk about how “Bayesian estimation supersedes the t-test” (Kruschke, 2013).

Kruschke (2013, p.573) writes:

“When data are interpreted in terms of meaningful parameters in a mathematical description, such as the difference of mean parameters in two groups, it is Bayesian analysis that provides complete information about the credible parameter values. Bayesian analysis is also more intuitive than traditional methods of null hypothesis significance testing (e.g., Dienes, 2011).”

In that article (Bayesian estimation supersedes the t-test) Kruschke (2013) provided clear and well-reasoned arguments favoring Bayesian parameter estimation over null hypothesis significance testing in the context of comparing two groups, a situation which is usually dealt with a t-test. It also introduced a robust model for comparing two groups, which modeled the data as t-distributed, instead of a Gaussian distribution. The article provided R code for running the estimation procedures, which could be downloaded from the author’s website or as an R package (Kruschke & Meredith, 2015).

The R code and programs work well for this specific application (estimating the robust model for one or two groups’ metric data). However, modifying the code to handle more complicated situations is not easy, and the underlying estimation algorithms don’t necessarily scale up to handle more complicated situations. Therefore, today I’ll introduce easy to use, free, open-source, state-of-the-art computer programs for Bayesian estimation, in the context of comparing two groups’ metric (continuous) data. The programs are available for the R programming language–so make sure you are familiar with R basics (e.g. Vuorre, 2016). I provide R code (it’s super easy, don’t worry!) for t-tests and Bayesian estimation in R using the R package brms (Buerkner, 2016), which uses the powerful Stan MCMC program (Stan Development Team, 2016) under the hood.

These programs supersede older Bayesian inference programs because they are easy to use (brms is an interface to Stan, which is actually a programming language in itself), fast, and are able to handle models with thousands of parameters. Learning to implement basic analyses such as t-tests, and Kruschke’s robust model, with these programs is very useful because (obviously) you’ll then be able to do Bayesian statistics in practice, and will be prepared to understand and implement more complex models.

Understanding the results of Bayesian estimation requires understanding some basics of Bayesian statistics, which I won’t describe here at all. If you are not familiar with Bayesian statistics, please read Kruschke’s excellent article (his book is also very good, Kruschke, 2014; see also McElreath, 2016). In fact, you should read the paper anyway, it’s very good.

First, I’ll introduce the basics of t-tests in some detail, and then focus on understanding them as specific instantiations of linear models. If that sounds familiar, skip ahead to Bayesian Estimation of the t-test, where I introduce the brms package for estimating models using Bayesian methods. Following that, we’ll use “distributional regression” to obtain Bayesian estimates of the unequal variances t-test model. Finally, we’ll learn how to estimate Kruschke’s (2013) BEST model using brms.

The t in a t-test

We’ll begin with t-tests, using example data from Kruschke’s paper (p. 577):

“Consider data from two groups of people who take an IQ test. Group 1 (N1=47) consumes a “smart drug,” and Group 2 (N2=42) is a control group that consumes a placebo.”

I’ve decided to call the control group “Group 0”, and the treatment group “Group 1”, because this coding makes it natural to think of the control group as a “reference group”, and any “effect” we’ll estimate will be associated with the treatment group. These data are visualized as histograms, below:

Histograms of the two groups' IQ scores.

Figure 1: Histograms of the two groups’ IQ scores.

Equal variances t-test

These two groups’ IQ scores could be compared with a simple equal variances t-test (which you shouldn’t use; Lakens, 2015), also known as Student’s t-test. I have the two groups’ IQ scores in R as two vectors called group_0 and group_1, so doing a t-test is as easy as

t.test(group_0, group_1, var.equal=T)
##  Two Sample t-test
## data:  group_0 and group_1
## t = -1.5587, df = 87, p-value = 0.1227
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.544155  0.428653
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

We interpret the t-test in terms of the observed t-value, and whether it exceeds the critical t-value. The critical t-value, in turn, is defined as the extreme \(\alpha / 2\) percentiles of a t-distribution with the given degrees of freedom. The current situation is illustrated below:

t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

Figure 2: t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

The test results in an observed t-value of 1.56, which is not far enough in the tails of a t-distribution with 87 degrees of freedom to warrant rejecting the null hypothesis (given that we are using \(\alpha\) = .05, which may or may not be an entirely brilliant idea (e.g. Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016)). Note that R also reports a 95% CI for the estimated difference between the two groups.

Unequal variances t-test

Next, we’ll run the more appropriate, unequal variances t-test (also known as Welch’s t-test), which R gives by default:

t.test(group_0, group_1)
##  Welch Two Sample t-test
## data:  group_0 and group_1
## t = -1.6222, df = 63.039, p-value = 0.1098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.4766863  0.3611848
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

Note that while R gives Welch’s t-test by default, SPSS gives both. If you’re using SPSS, make sure to report the Welch’s test results, instead of the equal variances test. Here, the conclusion with respect to rejecting the null hypothesis of equal means is the same. However, notice that the results are numerically different, as they should, because these two t-tests refer to different models!

As a side note, I recently learned that this problem (estimating and testing the difference between two means when the variances are not assumed equal) is unsolved: only approximate solutions are known.

It is of course up to you, as a researcher, to decide whether you assume equal variances or not. But note that we almost always allow the means to be different (that’s the whole point of the test, really), while many treatments may just as well have an effect on the standard deviations.

The first take-home message from today is that there are actually two t-tests, each associated with a different statistical model. And to make clear what the difference is, we must acquaint ourselves with the models.

Describing the model(s) underlying the t-test(s)

We don’t usually think of t-tests (and ANOVAs) as models, but it turns out that they are just linear models disguised as tests (see here and here). Recently, there has been a tremendous push for model/parameter estimation, instead of null hypothesis significance testing (e.g. Cumming, 2014; Kruschke, 2014; see also the brilliant commentary by Gigerenzer, 2004), so we will benefit from thinking about t-tests as linear models. Doing so will facilitate “[interpreting data] in terms of meaningful parameters in a mathematical description” (Kruschke, 2013), and seamlessly expanding our models to handle more complicated situations.

The equal variances t-test models metric data with three parameters: Mean for group A, mean for group B, and one shared standard deviation (i.e. the assumption that the standard deviations [we usually refer to variances, but whatever] are equal between the two groups.)

We call the metric data (IQ scores in our example) \(y_{ik}\), where \(i\) is a subscript indicating the \(i^{th}\) datum, and \(k\) indicates the \(k^{th}\) group. So \(y_{19, 1}\) would be the 19th datum, belonging to group 1. Then we specify that \(y_{ik}\) are Normally distributed, \(N(\mu_{ik}, \sigma)\), where \(\mu_{ik}\) indicates the mean of group \(k\), and \(\sigma\) the common standard deviation.

\[y_{ik} \sim N(\mu_{ik}, \sigma)\]

Read the formula as “Y is normally distributed with mean \(\mu_{ik}\) (mu), and standard deviation \(\sigma\) (sigma)”. Note that the standard deviation \(\sigma\) doesn’t have any subscripts: we assume it is the same for the \(k\) groups. Note also that you’ll often see the second parameter in the parentheses as \(\sigma^2\), referring to the variance.

The means for groups 0 and 1 are simply \(\mu_0\) and \(\mu_1\), respectively, and their difference (let’s call it \(d\)) is \(d = \mu_0 - \mu_1\). The 95% CI for \(d\) is given in the t-test output, and we can tell that it differs from the one given by Welch’s t-test.

It is unsurprising, then, that if we use a different model (the more appropriate unequal variances model; Lakens, 2015), our inferences may be different. Welch’s t-test is the same as Student’s, except that now we assume (and subsequently estimate) a unique standard deviation \(\sigma_{ik}\) for both groups.

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\]

This model makes a lot of sense, because rarely are we in a situation to a priori decide that the variance of scores in Group A is equal to the variance of scores in Group B. If you use the equal variances t-test, you should be prepared to justify and defend this assumption. (Deciding between models–such as between these two t-tests–is one way in which our prior information enters and influences data analysis. This fact should make you less suspicious about priors in Bayesian analyses.)

Armed with this knowledge, we can now see that “conducting a t-test” can be understood as estimating one of these two models. By estimating the model, we obtain t-values, degrees of freedom, and consequently, p-values.

However, to focus on modeling and estimation, it is easier to think of the t-test as a specific type of the general linear model, (aka linear regression). We can re-write the t-test in an equivalent way, but instead have a specific parameter for the difference in means by writing it as a linear model. (For simplicity, I’ll only write the equal variances model):

\[y_{ik} \sim N(\mu_{ik}, \sigma)\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Here, \(\sigma\) is just as before, but we now model the mean with an intercept (Group 0’s mean, \(\beta_0\)) and the effect of Group 1 (\(\beta_1\)). To understand whats going on, let’s look at the data, Group is an indicator variable in the data, for each row of Group 0’s data Group is zero, and for each row of Group 1’s data Group is one.

##     Group  IQ
## 1       0  99
## 2       0 101
## 3       0 100
## 4       0 101
## ...   ... ...
## 86      1 101
## 87      1 104
## 88      1 100
## 89      1 101

With this model, \(\beta_1\) directly tells us the estimated difference in the two groups. And because it is a parameter in the model, it has an associated standard error, t-value, degrees of freedom, and a p-value. This linear model and can be estimated in R with the following line of code:

olsmod <- lm(IQ ~ Group, data = d)

The key input here is a model formula, which in R is specified as outcome ~ predictor (DV ~ IV). Using the lm() function, we estimated a linear model predicting IQ from an intercept (automatically included) and a Group parameter Group, which is the effect of group 1. I called this object olsmod for Ordinary Least Squares Model.

R has it’s own model formula syntax, which is well worth learning. The formula in the previous model, IQ ~ Group means that we want to regress IQ on an intercept (which is implicitly included), and group (Group). Besides the formula, we only need to provide the data, which is contained in the object I’ve conveniently called d.

You can verify that the results are identical to the equal variances t-test above.

## Call:
## lm(formula = IQ ~ Group, data = d)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9149  -0.9149   0.0851   1.0851  22.0851 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.3571     0.7263 138.184   <2e-16 ***
## Group         1.5578     0.9994   1.559    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.707 on 87 degrees of freedom
## Multiple R-squared:  0.02717,    Adjusted R-squared:  0.01599 
## F-statistic:  2.43 on 1 and 87 DF,  p-value: 0.1227

Focus on the Group row in the estimated coefficients. Estimate is the point estimate (best guess) of the difference in means (\(d = 101.9149 - 100.3571 = 1.5578\)). t value is the observed t-value (identical to what t.test() reported), and the p-value (Pr(>|t|)) matches as well. The (Intercept) row refers to \(\beta_0\), which is group 0’s mean.

This way of thinking about the model, where we have parameters for one group’s mean, and the effect of the other group, facilitates focusing on the important parameter, the difference, instead of individual means. However, you can of course compute the difference from the means, or the means from one mean and a difference.

Bayesian estimation of the t-test

Equal variances model

Next, I’ll illustrate how to estimate the equal variances t-test using Bayesian methods. We use brms (Buerkner, 2016), and the familiar R formula syntax which we used with the OLS model.

Estimating this model with R, thanks to the Stan and brms teams (Stan Development Team, 2016; Buerkner, 2016), is as easy as the linear regression model we ran above. If you haven’t yet installed brms, you need to install it first by running install.packages("brms"). Then, to access its functions, load the brms package to the current R session.


The most important function in the brms package is brm(), for Bayesian Regression Model(ing). The user needs only to input a model formula, just as above, and a data frame that contains the variables specified in the formula. brm() then translates the model into Stan language, and asks Stan to compile the model into C++ and estimate it (see Kruschke, 2014; McElreath, 2016 for details about estimation). The result is an R object with the estimated results (and much more). We run the model and save the results to mod_eqvar for equal variances model:

mod_eqvar <- brm(IQ ~ Group, data = d)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   100.34      0.74    98.91   101.83       3803    1
## Group         1.59      1.01    -0.41     3.55       3958    1
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     4.77      0.36     4.13     5.55       2653    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Notice that the model contains three parameters, one of which is the shared standard deviation sigma. Compare the output of the Bayesian model to the one estimated with lm() (OLS):

Method term estimate std.error
OLS (Intercept) 100.357 0.726
OLS Group 1.558 0.999
Bayes b_Intercept 100.339 0.743
Bayes b_Group 1.588 1.009

The point estimates (posterior means in the Bayesian model) and standard errors (SD of the respective posterior distribution) are pretty much identical.

We now know the models behind t-tests, and how to estimate the equal variances t-test using the t.test(), lm(), and brm() functions. We also know how to run Welch’s t-test using t.test(). However, estimating the general linear model version of the unequal variances t-test model is slightly more complicated, because it involves specifying predictors for \(\sigma\), the standard deviation parameter.

Unequal variances model

We only need a small adjustment to the equal variances model to specify the unequal variances model:

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Notice that we now have subscripts for \(\sigma\), denoting that it varies between groups. In fact, we’ll write out a linear model for the standard deviation parameter!

\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

The model now includes, instead of a common \(\sigma\), one parameter for Group 0’s standard deviation \(\gamma_0\) (gamma), and one for the effect of Group 1 on the standard deviation \(\gamma_1\), such that group 1’s standard deviation is \(\gamma_0 + \gamma_1\). Therefore, we have 4 free parameters, two means and two standard deviations. (The full specification would include prior distributions for all the parameters, but that topic is outside of the scope of this post.) Let’s estimate!

brm() takes more complicated models by wrapping them inside bf() (short for brmsformula()), which is subsequently entered as the first argument to brm().

uneq_var_frm <- bf(IQ ~ Group, sigma ~ Group)

You can see that the formula regresses IQ on Group, such that we’ll have an intercept (implicitly included), and an effect of Group 1. Remarkably, we are also able to model the standard deviation sigma, and we regress it on Group (it will also have an intercept and effect of group).

mod_uneqvar <- brm(uneq_var_frm, data = d, cores=4)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept         100.35      0.40    99.55   101.12       4000    1
## sigma_Intercept     0.94      0.11     0.72     1.17       3138    1
## Group               1.54      0.99    -0.38     3.47       2471    1
## sigma_Group         0.87      0.15     0.57     1.17       3222    1
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model’s output contains our 4 parameters. Intercept is the mean for group 0, Group 1 is the “effect of group 1”. The sigma_Intercept is the standard deviation of Group 0, sigma_Group is the effect of group 1 on the standard deviation (the SD of Group 1 is sigma_Intercept + sigma_Group). The sigmas are implicitly modeled through a log-link (because they must be positive). To convert them back to the scale of the data, they need to be exponentiated. After taking the exponents of the sigmas, the results look like this:

parameter Estimate Est.Error l-95% CI u-95% CI
b_sigma_Group 2.41 0.37 1.76 3.23
b_Intercept 100.35 0.40 99.55 101.12
b_sigma_Intercept 1.75 0.85 0.76 3.09

For comparison, here is the “observed SD” of group 0:

## [1] 2.52

Keep in mind that the parameters refer to Group 0’s mean (Intercept) and SD (sigma), and the difference between groups in those values (Group) and (sigma_Group). We now have fully Bayesian estimates of the 4 parameters of the unequal variances t-test model. Because p-values have no place in Bayesian inference, they are not reported in the output. However, you can calculate a quantity that is equivalent to a one-sided p-value from the posterior distribution: Take the proportion of posterior density (MCMC samples) above/below a reference value (0). This is definitely not the most useful thing you can do with a posterior distribution, but the fact that it numerically matches a one-sided p-value is quite interesting:

# Posterior distribution of Group effect
x <- as.data.frame(mod_uneqvar, pars = "b_Group")[,1]
# Proportion of MCMC samples below zero
round((sum(x < 0) / length(x)), 3)
## [1] 0.058
# One sided p-value from t-test
round(t.test(group_0, group_1, data = d, alternative = "less")$p.value, 3)
## [1] 0.055

I’m showing this remarkable fact (Marsman & Wagenmakers, no date) not to persuade you to stick with p-values, but to alleviate fears that these methods would always produce discrepant results.

Although this model is super easy to estimate with brm() (which, I should emphasize, uses Stan for the estimation procedures), the model seems, frankly speaking, strange. I am just not used to modeling variances, and I’ll bet a quarter that neither are you. Nevertheless, there it is!

Finally, let’s move on to Kruschke’s (2013) “Robust Bayesian Estimation” model.

Robust Bayesian Estimation

Kruschke’s robust model is a comparison of two groups, using five parameters: One mean for each group, one standard deviation for each group, just as in the unequal variances model above. The fifth parameter is a “normality” parameter, \(\nu\) (nu), which means that we are now using a t-distribution to model the data. Using a t-distribution to model the data, instead of a Gaussian, means that the model (and therefore our inferences) are less sensitive to extreme values (outliers). Here’s what the model looks like:

\[y_{ik} \sim T(\nu, \mu_{ik}, \sigma_{ik})\]

Read the above formula as “Y are random draws from a t-distribution with ‘normality’ parameter \(\nu\), mean \(\mu_{ik}\), and standard deviation \(\sigma_{ik}\)”. We have a linear model for the means and standard deviations:

\[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]


\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

This model, as you can see, is almost identical to the unequal variances t-test, but instead uses a t distribution (we assume data are t-distributed), and includes the normality parameter. Using brm() we can still use the unequal variances model, but have to specify the t-distribution. We do this by specifying the family argument to be student (as in Student’s t)

mod_robust <- brm(bf(IQ ~ Group, sigma ~ Group),
                  data = d, cores=4)
##  Family: student (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## Intercept        100.524     0.208  100.125  100.940       4000 0.999
## sigma_Intercept    0.001     0.196   -0.389    0.395       4000 1.001
## Group              1.028     0.417    0.206    1.833       3100 1.000
## sigma_Group        0.671     0.258    0.163    1.165       4000 1.000
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## nu    1.853     0.482    1.135     2.99       3179 1.001
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The effect of Group is about one unit, with a 95% Credible Interval from 0.2 to 1.8.

Finally, let’s compare the results to those in Kruschke’s paper (2013, p.578). Before we do this, I’ll convert the estimated parameters to means and standard deviations (instead of the “regression effects” produced by default.) Recall that I recoded the group labels used by Kruschke in the paper, what he calls group 2 is group 0 (control group) in our analyses, but group 1 is still group 1. In the following I transform the results and compute HDIs to obtain results most compatible with Kruschke:

key m HDIlwr HDIupr
group0_mean 100.52 100.12 100.94
group1_mean 101.55 100.80 102.24
diff_means 1.03 0.19 1.80
group0_sigma 1.02 0.65 1.41
group1_sigma 2.00 1.22 2.80
diff_sigmas 2.02 1.13 3.10
nu 1.85 1.04 2.76
nulog10 0.25 0.04 0.46

Notice that Kruschke reports modes (2013, p. 578), but our point estimates are means. The results with respect to the group means are identical to two decimal points; the standard deviations are slightly more discrepant, because the paper reports modes, but we focus on posterior means.

Finally, here is how to estimate the model using the original code (Kruschke & Meredith, 2015):

BEST <- BESTmcmc(group_0, group_1)
## Processing function input....... 
## Done. 
## Beginning parallel processing using 3 cores. Console output will be suppressed.
## Parallel processing completed.
## MCMC took 0.283 minutes.
## MCMC fit results for BEST analysis:
## 100002 simulations saved.
##           mean     sd  median    HDIlo   HDIup Rhat n.eff
## mu1    100.525 0.2149 100.524 100.1108 100.955    1 57740
## mu2    101.548 0.3800 101.551 100.7853 102.289    1 59733
## nu       1.840 0.4775   1.764   1.0380   2.753    1 23934
## sigma1   1.052 0.2069   1.032   0.6737   1.464    1 35619
## sigma2   2.063 0.4359   2.021   1.2598   2.927    1 29495
## 'HDIlo' and 'HDIup' are the limits of a 95% HDI credible interval.
## 'Rhat' is the potential scale reduction factor (at convergence, Rhat=1).
## 'n.eff' is a crude measure of effective sample size.

This output reports posterior means and HDI limits, which we report above. You can verify that they match very closely to each other. This BESTmcmc() function is great, but with brms you are able to estimate a vast variety of models.


Well, that ended up much longer than what I intended. The aim was both to illustrate the ease of Bayesian modeling in R using brms (Buerkner, 2016) and Stan (Stan Development Team, 2016), and highlight the fact that we can easily move from simple t-tests to more complex (and possibly better) models.

If you’ve followed through, you should be able to conduct Student’s (equal variances) and Welch’s (unequal variances) t-tests in R, and to think about those tests as instantiations of general linear models. Further, you should be able to estimate these models using Bayesian methods.

You should now also be familiar with Kruschke’s robust model for comparing two groups’ metric data, and be able to implement it in one line of R code. This model was able to find credible differences between two groups, although the frequentist t-tests and models reported p-values well above .05. That should be motivation enough to try robust (Bayesian) models on your own data.

Further reading

I didn’t take any space here to discuss the interpretation of Bayesian statistics. For this, I recommend Kruschke (2014), McElreath (2016). See also Etz, Gronau, Dablander, Edelsbrunner, & Baribault (2016) for an introduction to Bayesian statistics.


Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Cumming, G. (2014). The New Statistics Why and How. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Dienes, Z. (2011). Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6(3), 274–290. https://doi.org/10.1177/1745691611406920
Etz, A., Gronau, Q. F., Dablander, F., Edelsbrunner, P. A., & Baribault, B. (2016). How to become a Bayesian in eight easy steps: An annotated reading list. ResearchGate. Retrieved from https://www.researchgate.net/publication/301981861_How_to_become_a_Bayesian_in_eight_easy_steps_An_annotated_reading_list
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146
Kruschke, J. K. (2014). Doing Bayesian Data Analysis: A Tutorial Introduction with R (2nd Edition). Burlington, MA: Academic Press.
Lakens, D. (2015, January 26). The 20% Statistician: Always use Welch’s t-test instead of Student’s t-test. Retrieved from https://daniellakens.blogspot.com/2015/01/always-use-welchs-t-test-instead-of.html
Marsman, M., & Wagenmakers, E.-J. (no date). Three Insights from a Bayesian Interpretation of the One-Sided P Value. Retrieved from http://www.ejwagenmakers.com/inpress/MarsmanWagenmakersOneSidedPValue.pdf
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Morey, R. D., & Rouder, J. (2015). BayesFactor: Computation of Bayes Factors for Common Designs. Retrieved from https://CRAN.R-project.org/package=BayesFactor
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547. https://doi.org/10.1111/tops.12214
Stan Development Team. (2016). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Vuorre, M. (2016, December 5). Introduction to Data Analysis using R. Retrieved from http://blog.efpsa.org/2016/12/05/introduction-to-data-analysis-using-r/

How to arrange ggplot2 panel plots

Panel plots are a common name for figures showing every person’s (or whatever your sampling unit is) data in their own little panel. This plot is sometimes also known as “small multiples”, although that more commonly refers to plots that illustrate interactions. Here, I’ll illustrate how to add information to a panel plot by arranging the panels according to some meaningful value.

Here’s an example of a panel plot, using the sleepstudy data set from the lme4 package. Notice that the subject-specific panels are created with facet_wrap(), as explained in an earlier blog post.

data(sleepstudy, package = "lme4")
##   Reaction Days Subject
## 1 249.5600    0     308
## 2 258.7047    1     308
## 3 250.8006    2     308
## 4 321.4398    3     308
## 5 356.8519    4     308
## 6 414.6901    5     308
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
    geom_point() +
    scale_x_continuous(breaks=0:9) +
    facet_wrap("Subject", labeller = label_both)

On the x-axis is days of sleep deprivation, and y-axis is an aggregate measure of reaction time across a number of cognitive tasks. Reaction time increases as a function of sleep deprivation. But the order of the panels is entirely uninformative, they are simply arranged in increasing order of subject ID number, from top left to bottom right. Subject ID numbers are rarely informative, and we would therefore like to order the panels according to some other fact about the individual participants.

Order panels on mean value

Let’s start by ordering the panels on the participants’ mean reaction time, with the “fastest” participant in the upper-left panel.

Step 1 is to add the required information to the data frame used in plotting. For a simple mean, we can actually use a shortcut in step 2, so this isn’t required.

Step 2: Convert the variable used to separate the panels into a factor, and order it based on the mean reaction time.

The key here is to use the reorder() function. You’ll first enter the variable that contains the groupings (i.e. the subject ID numbers), and then values that will be used to order the grouping variables. Finally, here you can use a shortcut to base the ordering on a function of the values, such as the mean, by entering it as the third argument.

sleepstudy <- mutate(sleepstudy,
                     Subject = reorder(Subject, Reaction, mean))

Now if we use Subject to create the subplots, they will be ordered on the mean reaction time. I’ll make the illustration clear by also drawing the person-means with small arrows.

ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
    geom_point() +
    stat_summary(fun.y=mean, geom="segment",
                 aes(yend=..y.., x=0, xend=3),
                 arrow = arrow(ends = "first", length = unit(.1, "npc"))) +
    scale_x_continuous(breaks=0:9, expand = c(0, 0)) +
    facet_wrap("Subject", labeller = label_both)

Ordering panels on other parameters

It might also be useful to order the panels based on a value from a model, such as the slope of a linear regression. This is especially useful in making the heterogeneity in the sample easier to see. For this, you’ll need to fit a model, grab the subject-specific slopes, order the paneling factor, and plot. I’ll illustrate with a multilevel regression using lme4.

# Step 1: Add values to order on into the data frame
mod <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
# Create a data frame with subject IDs and coefficients
coefs <- coef(mod)$Subject %>%
names(coefs) <- c("Subject", "Intercept", "Slope")
# Join to main data frame by Subject ID
sleepstudy <- left_join(sleepstudy, coefs, by="Subject")
# Step 2: Reorder the grouping factor
sleepstudy <- mutate(sleepstudy,
                     Subject = reorder(Subject, Slope))

Then, I’ll plot the data also showing the fitted lines from the multilevel model:

ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
    geom_point() +
    geom_abline(aes(intercept = Intercept, slope = Slope)) +
    scale_x_continuous(breaks=0:9) +
    facet_wrap("Subject", labeller = label_both)

Hopefully you’ll find this helpful. Have a great day!

All Posts by Category or Tags.