On bayesian data analysis and bayes factors

There’s a lot of buzz around bayesian data analysis (BDA) in psychology blogs, social media, and journal articles. For instance, in 2015 the APS Observer ran three columns dedicated to BDA in consecutive issues of the journal (Gallistel, 2015a, b, & c), and browsing the latest issues of Psychonomic Bulletin and Review gives an impression of increased interest in the topic.

Bayesian data analysis is more than bayes factors

However, it appears that there is an imbalance in what many beginning bayesian data analysts think about BDA. From casual observation and discussions, I’ve noticed a tendency for people to equate bayesian methods with computing bayes factors; that is, testing (usually null) hypotheses using bayesian model comparison.

I don’t have good data on people’s impressions of what BDA is, but here’s another anecdote. At a recent conference on Bayesian statistics Mark Andrews summarized his experiences teaching BDA workshops for social scientists. The talk was quite interesting, and what particularly picked my curiosity was his comment that for many–if not most–workshop participants, bayesian data analysis meant hypothesis testing with bayes factors (at 20 minutes in the linked video). (As an aside, he also noted that Stan has now superseded JAGS and BUGS as the preferred choice for a probabilistic modeling language. Go Stan!)

This imbalance or conflation of bayes factors and bayesian data analysis (if it is real and not merely bias in my observations!) is quite disappointing because BDA is a vast field of awesome methods, and bayes factors (BF) are only one thing that you can do with it1. In fact, many textbooks on BDA mention BFs only in footnotes (Gelman et al., 2013; McElreath, 2016). I’ve also written about BDA on my blog about a half-dozen times, but only once about bayes factors (Vuorre, 2016).

Further, it is really only this one method that people bicker over on social media: the bayes vs. frequentism argument usually turns into a p-values vs. bayes factors argument. The tedium of this argument (there really aren’t good reasons to prefer p-values) may even give out the impression that BDA is tedious and limited to model comparison and hypothesis testing problems. It isn’t! Bayes has benefits over frequentism that reach far beyond the p-vs-bf issue.

Here’s a practical example: It is well known that estimating generalized linear mixed models is kind of difficult. In fact, maximum likelihood methods routinely fail, especially when data are sparse of parameters are plenty (you’ve heard of multilevel models not converging, right? That’s the issue.). However, bayesian methods (via MCMC for example) usually have no problem estimating these models in situations when maximum likelihood fails! This benefit of bayes over frequentism (only the first thing to come to mind) doesn’t usually appear in the tedious p-vs-bf arguments, although one could argue that its practical implications are greater.

Reasons for thinking that BDA is BF

I suspect that one factor contributing to the apparent conflation of BDA and BFs is that there are vocal groups of psychological scientists doing interesting and important work promoting the use of Bayes Factors for hypothesis testing, and bayesian methods more generally (eg. Dienes, 2015; Ly, Verhagen, & Wagenmakers, 2015; Morey, Romeijn, & Rouder, 2016; Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016). A quick reading of some of these texts may give the (false) impression that bayes factors are a larger portion of BDA than what they actually are. I think the same goes with the APS Observer columns mentioned above. To be completely clear, these papers are not about, nor do they say, that BDA is BF. I’m merely pointing out that in the psychological literature, (important!) papers about bayesian methods focusing on BF vastly overshadow in number papers that discuss other features of BDA.

But the real root cause for this conflation is probably the fetish-like desire of hypothesis tests in psychological science.

A modest proposal

If now is the time to move toward a new statistical paradigm in psychology, we could take the opportunity to emphasize not only bayesian hypothesis testing, but the importance of modeling, estimation and bayesian methods more generally. As Dr. Andrews noted, BDA could instead be thought of as flexible probabilistic modeling.

References

Dienes, Z. (2015). How Bayes factors change scientific practice. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2015.10.003
Gallistel, C. R. (2015a). Bayes for Beginners 2: The Prior. APS Observer, 28(8). Retrieved from https://www.psychologicalscience.org/observer/bayes-for-beginners-2-the-prior
Gallistel, C. R. (2015b). Bayes for Beginners 3: The Prior in Probabilistic Inference. APS Observer, 28(9). Retrieved from https://www.psychologicalscience.org/observer/bayes-for-beginners-3-the-prior-in-probabilistic-inference
Gallistel, C. R. (2015c). Bayes for Beginners: Probability and Likelihood. APS Observer, 28(7). Retrieved from https://www.psychologicalscience.org/observer/bayes-for-beginners-probability-and-likelihood
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Boca Raton: Chapman and Hall/CRC.
Ly, A., Verhagen, J., & Wagenmakers, E.-J. (2015). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2015.06.004
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2015.11.001
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547. https://doi.org/10.1111/tops.12214
Vuorre, M. (2016, August 23). Statistical inference: Prix fixe or à la carte? Retrieved 4 May 2017, from https://mvuorre.github.io/post/2016/2016-08-23-free-lunch-in-inference/
Vuorre, M., & Bolger, N. (2017). Within-subject mediation analysis for experimental data in cognitive psychology and neuroscience. OSF Preprint. https://doi.org/http://dx.doi.org/10.17605/OSF.IO/6JHPF


  1. To be sure, I think a bayes factor can be a truly great tool when the competing models are well specified. I haven’t yet implemented a bayes factor method for my bayesian multilevel mediation package (Vuorre & Bolger, 2017), but might include one in the future. One difficulty here is specifying the competing models in a meaningful way.

Bayes Factors with brms

Here’s a short post on how to calculate Bayes Factors with the R package brms (Buerkner, 2016) using the Savage-Dickey density ratio method (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

To get up to speed with what the Savage-Dickey density ratio method is–or what Bayes Factors are–please read Wagenmakers et al. 2010. (The paper is available on the author’s webpage.) Here, I’ll only show the R & brms code to do the calculations that Wagenmakers et al. (2010) discuss. In their paper, they used WinBUGS, which requires quite a bit of code to sample from even a relatively simple model. brms on the other hand uses the familiar R formula syntax, making it easy to use. brms also does the MCMC sampling with Stan (Stan Development Team, 2016a & 2016b), or rather creates Stan code from a specified R model formula by what can only be described as string processing magic, making the sampling very fast. Let’s get straight to the examples in Wagenmakers et al. (2010)

# We'll work in tidyverse
library(tidyverse)

Example 0

Wagenmakers and colleagues begin with a simple example of 10 true/false questions: We observe a person answering 9 (s) out of 10 (k) questions correctly.

d <- data.frame(s = 9, k = 10)
d
##   s  k
## 1 9 10

We are interested in the person’s latent ability to answer similar questions correctly. This ability is represented by \(\theta\) (theta), which for us will be the probability parameter (sometimes also called the rate parameter) in a binomial distribution. See Wagenmakers et al. (2010) for formulas. The maximum likelihood (point) estimate for \(\theta\) is n/k, the proportion .9.

The first thing we’ll need to specify with respect to our statistical model is the prior probability distribution for \(\theta\). As in Wagenmakers et al. 2010, we specify a uniform prior, representing no prior information about the person’s ability to aswer the questions. For the binomial probability parameter, \(Beta(\alpha = 1, \beta = 1)\) is a uniform prior.

pd <- tibble(
    x = seq(0, 1, by = .01),
    Prior = dbeta(x, 1, 1)
    )

ggplot(pd, aes(x, Prior)) +
    geom_line() +
    coord_cartesian(xlim = 0:1, ylim = c(0, 6), expand = 0) +
    labs(y = "Density", x = bquote(theta))

The solid line represents the probability density assigned to values of \(\theta\) by this prior probability distribution. You can see that it is 1 for all possible parameter values: They are all equally likely a priori. For this simple illustration, we can easily calculate the posterior distribution by adding the number of correct and incorrect answers to the parameters of the prior Beta distribution.

pd$Posterior <- dbeta(pd$x, 10, 2)
pdw <- gather(pd, key=Type, value=density, Prior:Posterior)
ggplot(pdw, aes(x, density, col=Type)) +
    geom_line() +
    annotate("point", x=c(.5, .5), y = c(pdw$density[pdw$x==.5])) +
    annotate("label", x=c(.5, .5), 
             y = pdw$density[pdw$x==.5], 
             label = round(pdw$density[pdw$x==.5], 3),
             vjust=-.5)

The Savage-Dickey density ratio is calculated by dividing the posterior density by the prior density at a specific parameter value. Here, we are interested in .5, a “null hypothesis” value indicating that the person’s latent ability is .5, i.e. that they are simply guessing.

filter(pd, x == .5) %>% 
    mutate(BF01 = Posterior/Prior,
           BF10 = 1/BF01) %>% 
    round(3)
## # A tibble: 1 x 5
##       x Prior Posterior  BF01  BF10
##   <dbl> <dbl>     <dbl> <dbl> <dbl>
## 1   0.5     1     0.107 0.107 9.309

OK, so in this example we are able to get to the posterior with simply adding values into the parameters of the Beta distribution, but let’s now see how to get to this problem using brms.

library(brms)

First, here’s the brms formula of the model:

s | trials(k) ~ 0 + intercept, 
family=binomial(link="identity"), 
data = d

Read the first line as “s successes from k trials regressed on intercept”. That’s a little clunky, but bear with it. If you are familiar with R’s modeling syntax, you’ll be wondering why we didn’t simply specify ~ 1–R’s default notation for an intercept. The reason is that brms by default uses a little trick in parameterizing the intercept which speeds up the MCMC sampling. In order to specify a prior for the intercept, you’ll have to take the default intercept out (0 +), and use the reserved string intercept to say that you mean the regular intercept. See ?brmsformula for details. (For this model, with only one parameter, this complication doesn’t matter, but I wanted to introduce it early on so that you’d be aware of it when estimating multi-parameter models.)

The next line specifies that the data model is binomial, and that we want to model it’s parameter through an identity link. Usually when you model proportions or binary data, you’d use a logistic (logistic regression!), probit or other similar link function. In fact this is what we’ll do for later examples. Finally, we’ll use the data frame d.

OK, then we’ll want to specify our priors. Priors are extremo important for Bayes Factors–and probabilistic inference in general. To help set priors, we’ll first call get_priors() with the model information, which is basically like asking brms to tell what are the possible priors, and how to specify then, given this model.

get_prior(s | trials(k) ~ 0 + intercept, 
          family=binomial(link="identity"),
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b intercept

The first line says that there is only one class of parameters b, think of class b as “betas” or “regression coefficients”. The second line says that the b class has only one parameter, the intercept. So we can set a prior for the intercept, and this prior can be any probability distribution in Stan language. We’ll create this prior using brms’ set_prior(), give it a text string representing the Beta(1, 1) prior for all parameters of class b (shortcut, could also specify that we want it for the intercept specifically), and then say the upper and lower bounds (\(\theta\) must be between 0 and 1).

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Almost there. Now we’ll actually sample from the model using brm(), give it the model, priors, data, ask it to sample from priors (for the density ratio!), and set a few extra MCMC parameters.

m <- brm(s | trials(k) ~ 0 + intercept, 
         family = binomial(link="identity"),
         prior = Prior,
         data = d,
         sample_prior = TRUE,
         iter = 1e4,
         cores = 4)

We can get the estimated parameter by asking the model summary:

summary(m)
##  Family: binomial(identity) 
## Formula: s | trials(k) ~ 0 + intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.83       0.1     0.59     0.98      15654    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Credible Interval matches exactly what’s reported in the paper. The point estimate differs slightly because here we see the posterior mean, whereas in the paper, Wagenmakers et al. report the posterior mode. I’ll draw a line at their posterior mode, below, to show that it matches.

samples <- posterior_samples(m, "b")
head(samples)
##   b_intercept   prior_b
## 1   0.9496061 0.4433162
## 2   0.6241450 0.6781995
## 3   0.9522157 0.1354565
## 4   0.7378991 0.4176384
## 5   0.7290172 0.9713203
## 6   0.7078843 0.9575486
gather(samples, Type, value) %>% 
    ggplot(aes(value, col=Type)) +
    geom_density() +
    labs(x = bquote(theta), y = "Density") +
    geom_vline(xintercept = .89)  # Vertical line at .89

We can already see the densities, so all that’s left is to obtain the exact values at the value of interest (.5) and take the \(\frac{posterior}{prior}\) ratio. Instead of doing any of this by hand, we’ll use brms’ function hypothesis() that allows us to test point hypotheses using the Dickey Savage density ratio. For this function we’ll need to specify the point of interest, .5, as the point hypothesis to be tested.

h <- hypothesis(m, "intercept = 0.5")
print(h, digits = 4)
## Hypothesis Tests for class b:
##                       Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (intercept)-(0.5) = 0   0.3336    0.1032   0.0876   0.4761     0.1078    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

The Evid.Ratio is our Bayes Factor BF01. Notice that it matches the value 0.107 pretty well! You can also plot this hypothesis:

plot(h)

However, I think the default plot isn’t fantastic (not a fan of the axis adjustments, title). Fortunately, plot(hypothesis) returns a ggplot2 object, which is easily customized.

p <- plot(h, plot = F, theme = theme_get())[[1]]
p + scale_x_continuous(breaks = seq(-.5, .5, by = .25),
                       labels = seq(-.5, .5, by = .25)+.5)

OK, so that was a lot of work for such a simple problem, but the real beauty of brms (and Stan) is the unparalleled scalability: We can easily solve a problem with one row of data and one parameter, and it won’t take much more to solve a problem with tens of thousands of rows of data, and hundreds of parameters. Let’s move on to the next example from Wagenmakers et al. (2010).

Example 1: Equality of Proportions

For context, please refer to the paper.

d <- data.frame(
    pledge = c("yes", "no"),
    s = c(424, 5416),
    n = c(777, 9072)
)
d
##   pledge    s    n
## 1    yes  424  777
## 2     no 5416 9072

They use Beta(1, 1) priors for both rate parameters, which we’ll do as well. Notice that usually a regression formula has an intercept and a coefficient (e.g. effect of group.) By taking the intercept out (0 +) we can define two pledger-group proportions instead, and set priors on these. If we used an intercept + effect formula, we could set a prior on the effect itself.

get_prior(s | trials(n) ~ 0 + pledge, 
          family=binomial(link="identity"),
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b  pledgeno                  
## 3           b pledgeyes

We can set the Beta prior for both groups’ rate with one line of code by setting the prior on the b class without specifying the coef.

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Like above, let’s estimate.

m1 <- brm(s | trials(n) ~ 0 + pledge, 
         family = binomial(link="identity"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

Our estimates match the MLEs reported in the paper:

print(m1, digits=3)
##  Family: binomial(identity) 
## Formula: s | trials(n) ~ 0 + pledge 
##    Data: d (Number of observations: 2) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## pledgeno     0.597     0.005    0.587    0.607      18588    1
## pledgeyes    0.546     0.018    0.511    0.580      19173    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

To get the density ratio Bayes Factor, we’ll need to specify a text string as our hypothesis. Our hypothesis is that the rate parameters \(\theta_1\) and \(\theta_2\) are not different: \(\theta_1\) = \(\theta_2\). The alternative, then, is the notion that the parameter values differ.

h1 <- hypothesis(m1, "pledgeyes = pledgeno")
print(h1, digits = 3)
## Hypothesis Tests for class b:
##                          Estimate Est.Error l-95% CI u-95% CI Evid.Ratio
## (pledgeyes)-(pled... = 0   -0.051     0.018   -0.088   -0.016      0.388
##                          Star
## (pledgeyes)-(pled... = 0    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

As noted in the paper, a difference value of 0 is about twice as well supported before seeing the data, i.e. the null hypothesis of no difference is twice less likely after seeing the data:

1/h1$hypothesis$Evid.Ratio  # BF10
## [1] 2.577207

The paper reports BF01 = 0.47, so we’re getting the same results (as we should.) You can also compare this figure to what’s reported in the paper.

h1p1 <- plot(h1, theme = theme_get(), plot = F)[[1]] +
    theme(legend.position = "top")
h1p2 <- plot(h1, theme = theme_get(), plot = F)[[1]] + 
    coord_cartesian(xlim = c(-.05, .05), ylim = c(0, 5)) +
    theme(legend.position = "top")
gridExtra::grid.arrange(h1p1, h1p2, nrow = 1)

Moving right on to Example 2, skipping the section on “order restricted analysis”.

Example 2: Hierarchical Bayesian one-sample proportion test

The data for example 2 is not available, but we’ll simulate similar data. The simulation assumes that the neither-primed condition average correct probability is 50%, and that the both-primed condition benefit is 5%. Obviously, the numbers here won’t match anymore, but the data reported in the paper has an average difference in proportions of about 4%.

set.seed(5)
d <- tibble(
    id = c(rep(1:74, each = 2)),
    primed = rep(c("neither", "both"), times = 74),
    prime = rep(c(0, 1), times = 74),  # Dummy coded
    n = 21,
    correct = rbinom(74*2, 21, .5 + prime * .05)
)
group_by(d, primed) %>% summarize(p = sum(correct)/sum(n))
## # A tibble: 2 x 2
##    primed         p
##     <chr>     <dbl>
## 1    both 0.5418275
## 2 neither 0.4993565

This data yields a similar t-value as in the paper.

t.test(correct/n ~ primed, paired = T, data = d)
## 
##  Paired t-test
## 
## data:  correct/n by primed
## t = 2.3045, df = 73, p-value = 0.02404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005741069 0.079201016
## sample estimates:
## mean of the differences 
##              0.04247104

Instead of doing a probit regression, I’m going to do logistic regression. Therefore we define the prior on the log-odds scale. The log odds for the expected probability of .5 is 0. I prefer log-odds because–although complicated–they make sense, unlike standardized effect sizes. Note that the probit scale would also be fine as they are very similar.

Let’s just get a quick intuition about effects in log-odds: The change in log odds from p = .5 to .55 is about 0.2.

library(arm)
tibble(rate = seq(0, 1, by = .01),
    logit = logit(rate)) %>%
    ggplot(aes(rate, logit)) +
    geom_line(size=1) +
    geom_segment(x = 0, xend = 0.55, y = .2, yend = .2, size=.4) +
    geom_segment(x = 0, xend = 0.5, y = 0, yend = 0, size=.4) +
    coord_cartesian(ylim = c(-2, 2), expand=0)

We are cheating a little because we know these values, having simulated the data. However, log-odds are not straightforward (!), and this knowledge will allow us to specify better priors in this example. Let’s get the possible priors for this model by calling get_prior(). Notice that the model now includes id-varying “random” effects, and we model them from independent Gaussians by specifying || instead of | which would give a multivariate Gaussian on the varying effects.

get_prior(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
          family=binomial(link="logit"),
          data = d)
##                 prior class      coef group nlpar bound
## 1                         b                            
## 2                         b intercept                  
## 3                         b     prime                  
## 4 student_t(3, 0, 10)    sd                            
## 5                        sd              id            
## 6                        sd intercept    id            
## 7                        sd     prime    id

The leftmost column gives the pre-specified defaults used by brms. Here are the priors we’ll specify. The most important pertains to prime, which is going to be the effect size in log-odds. Our prior for the log odds of the prime effect is going to be a Gaussian distribution centered on 0, with a standard deviation of .2, which is rather diffuse.

Prior <- c(
    set_prior("normal(0, 10)", class = "b", coef = "intercept"),
    set_prior("cauchy(0, 10)", class = "sd"),
    set_prior("normal(0, .2)", class = "b", coef = "prime")
)

Then we estimate the model using the specified priors.

m2 <- brm(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
         family = binomial(link="logit"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

OK, so our results here will be different because we didn’t parameterize the prior on a standardized effect size because a) I don’t like standardized effect sizes, and b) I would have to play around with the Stan code, and this post is about brms. Two good reasons not to use standardized effect sizes! Anyway, here are the estimated parameters:

summary(m2)
##  Family: binomial(logit) 
## Formula: correct | trials(n) ~ 0 + intercept + prime + (0 + intercept + prime || id) 
##    Data: d (Number of observations: 148) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~id (Number of levels: 74) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(intercept)     0.07      0.05     0.00     0.18       8862    1
## sd(prime)         0.13      0.08     0.01     0.30       6970    1
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.01      0.05    -0.09     0.10      20000    1
## prime         0.15      0.07     0.02     0.29      20000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

And our null-hypothesis density ratio:

h2 <- hypothesis(m2, "prime = 0")
h2
## Hypothesis Tests for class b:
##             Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (prime) = 0     0.15      0.07     0.02     0.29       0.27    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

Priming effect of zero log-odds is 4 times less likely after seeing the data:

1/h2$hypothesis$Evid.Ratio
## [1] 3.681923

This is best illustrated by plotting the densities:

plot(h2)

Conclusion

Read the paper! Hopefully you’ll be able to use brms’ hypothesis() function to calculate bayes factors when needed.

References

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Stan Development Team. (2016a). RStan: the R interface to Stan. Retrieved from http://mc-stan.org/
Stan Development Team. (2016b). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology, 60(3), 158–189. https://doi.org/10.1016/j.cogpsych.2009.12.001

Appendix

brms produces exceptionally well written Stan code. Stan code of the last example model:

cat(rstan::get_stancode(m2$fit))
## // generated with brms 1.6.1
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   int Y[N];  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   // data for group-level effects of ID 1 
##   int<lower=1> J_1[N]; 
##   int<lower=1> N_1; 
##   int<lower=1> M_1; 
##   vector[N] Z_1_1; 
##   vector[N] Z_1_2; 
##   int trials[N];  // number of trials 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
## } 
## parameters { 
##   vector[K] b;  // population-level effects 
##   vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
##   vector[N_1] z_1[M_1];  // unscaled group-level effects 
##   // parameters to store prior samples
##   real<lower=0> prior_sd_1;
## } 
## transformed parameters { 
##   // group-level effects 
##   vector[N_1] r_1_1; 
##   vector[N_1] r_1_2; 
##   r_1_1 = sd_1[1] * (z_1[1]); 
##   r_1_2 = sd_1[2] * (z_1[2]); 
## } 
## model { 
##   vector[N] mu; 
##   mu = X * b; 
##   for (n in 1:N) { 
##     mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n]; 
##   } 
##   // prior specifications 
##   b[1] ~ normal(0, 10); 
##   b[2] ~ normal(0, .2); 
##   sd_1 ~ cauchy(0, 10); 
##   z_1[1] ~ normal(0, 1); 
##   z_1[2] ~ normal(0, 1); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ binomial_logit(trials, mu); 
##   } 
##   // additionally draw samples from priors
##   prior_sd_1 ~ cauchy(0,10);
## } 
## generated quantities { 
##   real prior_b_1; 
##   real prior_b_2; 
##   // additionally draw samples from priors 
##   prior_b_1 = normal_rng(0,10); 
##   prior_b_2 = normal_rng(0,.2); 
## }

Better forest plots from meta-analytic models estimated with brms

Hi all! After our previous discussion about how to estimate meta-analytic models with the brilliant brms R package, a few people asked me for the code to produce the forest plots. Here I’ll present a much better version of a function to produce forest plots from meta-analytic models estimated with brms. The function is implemented in ggplot2, and it is included in my vmisc package. To get the function, simply install vmisc from GitHub:

devtools::install_github("mvuorre/vmisc")

The function to draw forest plots from meta-analytic models estimated with brms is called brms_forest(), and you can learn more about it with ?brms_forest. The most important thing to know is that it requires that the model is estimated with brms with this exact formula:

yi | se(sei) ~ 1 + (1|study)

Where yi are the study-specific effect sizes (mean difference between groups, for example), sei are the study-specific standard errors, and study are unique integer labels for each study in the data.

Example data

We’ll illustrate the plot using example data from the metafor package. This data are “Results from 48 studies on the effectiveness of school-based writing-to-learn interventions on academic achievement.”

“In each of the studies included in this meta-analysis, an experimental group (i.e., a group of students that received instruction with increased emphasis on writing tasks) was compared against a control group (i.e., a group of students that received conventional instruction) with respect to some content-related measure of academic achievement (e.g., final grade, an exam/quiz/test score). The effect size measure for this meta-analysis was the standardized mean difference (with positive scores indicating a higher mean level of academic achievement in the intervention group).” (From the metafor help page ?dat.bangertdrowns2004.)

library(metafor)
head(dat.bangertdrowns2004)
##   id   author year grade length minutes wic feedback info pers imag meta
## 1  1 Ashworth 1992     4     15      NA   1        1    1    1    0    1
## 2  2    Ayers 1993     2     10      NA   1       NA    1    1    1    0
## 3  3   Baisch 1990     2      2      NA   1        0    1    1    0    1
## 4  4    Baker 1994     4      9      10   1        1    1    0    0    0
## 5  5   Bauman 1992     1     14      10   1        1    1    1    0    1
## 6  6   Becker 1996     4      1      20   1        0    0    1    0    0
##         subject  ni     yi    vi
## 1       Nursing  60  0.650 0.070
## 2 Earth Science  34 -0.750 0.126
## 3          Math  95 -0.210 0.042
## 4       Algebra 209 -0.040 0.019
## 5          Math 182  0.230 0.022
## 6    Literature 462  0.030 0.009

We’ll only need a few of the columns, and with specific names, so in the following we’ll just select the relevant variables and rename id to study, and create labels for the plot by pasting together the studies author and year columns. I’ll also subset the data to the first 15 studies, because the original data has 48 studies and that would make the plot very large (which is fine, but it’s simpler to start small.)

library(dplyr)
d <- dat.bangertdrowns2004 %>%
    mutate(label = paste0(author, " (", year, ")"), sei = sqrt(vi)) %>%
    select(study = id, label, yi, sei) %>% 
    slice(1:15)
d
##    study              label    yi        sei
## 1      1    Ashworth (1992)  0.65 0.26457513
## 2      2       Ayers (1993) -0.75 0.35496479
## 3      3      Baisch (1990) -0.21 0.20493902
## 4      4       Baker (1994) -0.04 0.13784049
## 5      5      Bauman (1992)  0.23 0.14832397
## 6      6      Becker (1996)  0.03 0.09486833
## 7      7 Bell & Bell (1985)  0.26 0.32557641
## 8      8     Brodney (1994)  0.06 0.08366600
## 9      9      Burton (1986)  0.06 0.20000000
## 10    10   Davis, BH (1990)  0.12 0.22803509
## 11    11   Davis, JJ (1996)  0.77 0.32710854
## 12    12         Day (1994)  0.00 0.14491377
## 13    13     Dipillo (1994)  0.52 0.19235384
## 14    14     Ganguli (1989)  0.54 0.28809721
## 15    15  Giovinazzo (1996)  0.20 0.29325757

Fit meta-analytic model with brms

Fitting the meta-analytic model is easy with brms! The formula specifies the study-specific effect size and standard error, an overall intercept (1), and the “random studies” ((1|study)). I’ll use four cores for speed and increase the adapt_delta parameter to avoid divergent transitions.

library(brms)
mod <- brm(yi | se(sei) ~ 1 + (1|study), data = d, 
           cores = 4, control= list(adapt_delta=.99) )

The model summary shows a 95% CI for the average effect hugging 0, and reasonable between-study heterogeneity (sd(Intercept)):

summary(mod)
##  Family: gaussian(identity) 
## Formula: yi | se(sei) ~ 1 + (1 | study) 
##    Data: d (Number of observations: 15) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~study (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.17       0.1     0.01      0.4        850    1
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     0.13      0.07    -0.01     0.29        999    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Draw forest plots with brms_forest()

The user only needs to enter a data frame and a brms model:

library(vmisc)
brms_forest(data = d, model = mod)

The forest plot shows, on the left, the labels provided in d. On the very right are the effect sizes, [and limits of the Credible Intervals]. The CI limits are by default 95%, but users can control this by passing the argument level = .80, for 80% CIs, for example.

In the middle are the posterior distributions of the estimated effect sizes as grey densities. The black circle indicates the posterior mean, and the arms extending from the point are the CI defined by level (here, 95% CI). The bottom row, ME, is the meta-analytic estimate.

Arguments

brms_forest() has several arguments that impact the resulting plot, see ?brms_forest for details. For example

brms_forest(
    data = d, 
    model = mod, 
    show_data = TRUE,  # Shows data means and SEs
    sort_estimates = TRUE,  # Sorts estimates based on their magnitude
    dens_fill = "dodgerblue")  # Fill densities with blue

Here, the plot also shows the means (black crosses) and 2 SEs (thin black lines) from the data, slightly below each estimate. This plot nicely shows how the random effects model shrinks the estimates toward the group mean, especially for studies that had wide SEs to begin with.

Customizing the plot

brms_forest() returns a ggplot2 object, which is customizable by regular ggplot2 functions (themes, scales…) Here, we’ll add limits to the x axis manually (we have to specify y axis, because the axes are internally flipped…), add a label to replace the default “mean”, and use another custom theme:

library(ggplot2)
myplot <- brms_forest(data = d, 
                      model = mod, 
                      sort_estimates = TRUE)
myplot + 
    scale_y_continuous("Standardized ES", limits = c(-1.2, 1.2)) +
    theme_blog()

Credits

The plot uses a one-sided violin plot which is not included in the default ggplot2 distribution. This code was written by David Robinson https://github.com/dgrtwo: https://gist.github.com/dgrtwo/eb7750e74997891d7c20, and Ben Marwick https://github.com/benmarwick: https://gist.github.com/benmarwick/2a1bb0133ff568cbe28d.

Conclusion

Look at how non-normal those posterior densities look like! It seems that in this case, the posterior mean is quite a misleading point estimate, and actually showing the posterior distribution is a very good idea.

Hope you’ll find this function useful. If you’d like it to be improved, let me know. Thanks and have a good day!

How to Compare Two Groups with Robust Bayesian Estimation Using R, Stan and brms

## Warning: Installed Rcpp (0.12.11) different from Rcpp used to build dplyr (0.12.10).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

Happy New Year 2017 everybody! 2017 will be the year when social scientists finally decided to diversify their applied statistics toolbox, and stop relying 100% on null hypothesis significance testing (NHST). We now recognize that different scientific questions may require different statistical tools, and are ready to adopt new and innovative methods. A very appealing alternative to NHST is Bayesian statistics, which in itself contains many approaches to statistical inference. In this post, I provide an introductory and practical tutorial to Bayesian parameter estimation in the context of comparing two independent groups’ data.

More specifically, we’ll focus on the t-test. Everyone knows about it, everyone uses it. Yet, there are (arguably!) better methods for drawing inferences from two independent groups’ metric data (Kruschke, 2013; Morey & Rouder, 2015). Let’s talk about how “Bayesian estimation supersedes the t-test” (Kruschke, 2013).

Kruschke (2013, p.573) writes:

“When data are interpreted in terms of meaningful parameters in a mathematical description, such as the difference of mean parameters in two groups, it is Bayesian analysis that provides complete information about the credible parameter values. Bayesian analysis is also more intuitive than traditional methods of null hypothesis significance testing (e.g., Dienes, 2011).”

In that article (Bayesian estimation supersedes the t-test) Kruschke (2013) provided clear and well-reasoned arguments favoring Bayesian parameter estimation over null hypothesis significance testing in the context of comparing two groups, a situation which is usually dealt with a t-test. It also introduced a robust model for comparing two groups, which modeled the data as t-distributed, instead of a Gaussian distribution. The article provided R code for running the estimation procedures, which could be downloaded from the author’s website or as an R package (Kruschke & Meredith, 2015).

The R code and programs work well for this specific application (estimating the robust model for one or two groups’ metric data). However, modifying the code to handle more complicated situations is not easy, and the underlying estimation algorithms don’t necessarily scale up to handle more complicated situations. Therefore, today I’ll introduce easy to use, free, open-source, state-of-the-art computer programs for Bayesian estimation, in the context of comparing two groups’ metric (continuous) data. The programs are available for the R programming language–so make sure you are familiar with R basics (e.g. Vuorre, 2016). I provide R code (it’s super easy, don’t worry!) for t-tests and Bayesian estimation in R using the R package brms (Buerkner, 2016), which uses the powerful Stan MCMC program (Stan Development Team, 2016) under the hood.

These programs supersede older Bayesian inference programs because they are easy to use (brms is an interface to Stan, which is actually a programming language in itself), fast, and are able to handle models with thousands of parameters. Learning to implement basic analyses such as t-tests, and Kruschke’s robust model, with these programs is very useful because (obviously) you’ll then be able to do Bayesian statistics in practice, and will be prepared to understand and implement more complex models.

Understanding the results of Bayesian estimation requires understanding some basics of Bayesian statistics, which I won’t describe here at all. If you are not familiar with Bayesian statistics, please read Kruschke’s excellent article (his book is also very good, Kruschke, 2014; see also McElreath, 2016). In fact, you should read the paper anyway, it’s very good.

First, I’ll introduce the basics of t-tests in some detail, and then focus on understanding them as specific instantiations of linear models. If that sounds familiar, skip ahead to Bayesian Estimation of the t-test, where I introduce the brms package for estimating models using Bayesian methods. Following that, we’ll use “distributional regression” to obtain Bayesian estimates of the unequal variances t-test model. Finally, we’ll learn how to estimate Kruschke’s (2013) BEST model using brms.

The t in a t-test

We’ll begin with t-tests, using example data from Kruschke’s paper (p. 577):

“Consider data from two groups of people who take an IQ test. Group 1 (N1=47) consumes a “smart drug,” and Group 2 (N2=42) is a control group that consumes a placebo.”

I’ve decided to call the control group “Group 0”, and the treatment group “Group 1”, because this coding makes it natural to think of the control group as a “reference group”, and any “effect” we’ll estimate will be associated with the treatment group. These data are visualized as histograms, below:

Histograms of the two groups' IQ scores.

Figure 1: Histograms of the two groups’ IQ scores.

Equal variances t-test

These two groups’ IQ scores could be compared with a simple equal variances t-test (which you shouldn’t use; Lakens, 2015), also known as Student’s t-test. I have the two groups’ IQ scores in R as two vectors called group_0 and group_1, so doing a t-test is as easy as

t.test(group_0, group_1, var.equal=T)
## 
##  Two Sample t-test
## 
## data:  group_0 and group_1
## t = -1.5587, df = 87, p-value = 0.1227
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.544155  0.428653
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

We interpret the t-test in terms of the observed t-value, and whether it exceeds the critical t-value. The critical t-value, in turn, is defined as the extreme \(\alpha / 2\) percentiles of a t-distribution with the given degrees of freedom. The current situation is illustrated below:

t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

Figure 2: t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

The test results in an observed t-value of 1.56, which is not far enough in the tails of a t-distribution with 87 degrees of freedom to warrant rejecting the null hypothesis (given that we are using \(\alpha\) = .05, which may or may not be an entirely brilliant idea (e.g. Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016)). Note that R also reports a 95% CI for the estimated difference between the two groups.

Unequal variances t-test

Next, we’ll run the more appropriate, unequal variances t-test (also known as Welch’s t-test), which R gives by default:

t.test(group_0, group_1)
## 
##  Welch Two Sample t-test
## 
## data:  group_0 and group_1
## t = -1.6222, df = 63.039, p-value = 0.1098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.4766863  0.3611848
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

Note that while R gives Welch’s t-test by default, SPSS gives both. If you’re using SPSS, make sure to report the Welch’s test results, instead of the equal variances test. Here, the conclusion with respect to rejecting the null hypothesis of equal means is the same. However, notice that the results are numerically different, as they should, because these two t-tests refer to different models!

As a side note, I recently learned that this problem (estimating and testing the difference between two means when the variances are not assumed equal) is unsolved: only approximate solutions are known.

It is of course up to you, as a researcher, to decide whether you assume equal variances or not. But note that we almost always allow the means to be different (that’s the whole point of the test, really), while many treatments may just as well have an effect on the standard deviations.

The first take-home message from today is that there are actually two t-tests, each associated with a different statistical model. And to make clear what the difference is, we must acquaint ourselves with the models.

Describing the model(s) underlying the t-test(s)

We don’t usually think of t-tests (and ANOVAs) as models, but it turns out that they are just linear models disguised as tests (see here and here). Recently, there has been a tremendous push for model/parameter estimation, instead of null hypothesis significance testing (e.g. Cumming, 2014; Kruschke, 2014; see also the brilliant commentary by Gigerenzer, 2004), so we will benefit from thinking about t-tests as linear models. Doing so will facilitate “[interpreting data] in terms of meaningful parameters in a mathematical description” (Kruschke, 2013), and seamlessly expanding our models to handle more complicated situations.

The equal variances t-test models metric data with three parameters: Mean for group A, mean for group B, and one shared standard deviation (i.e. the assumption that the standard deviations [we usually refer to variances, but whatever] are equal between the two groups.)

We call the metric data (IQ scores in our example) \(y_{ik}\), where \(i\) is a subscript indicating the \(i^{th}\) datum, and \(k\) indicates the \(k^{th}\) group. So \(y_{19, 1}\) would be the 19th datum, belonging to group 1. Then we specify that \(y_{ik}\) are Normally distributed, \(N(\mu_{ik}, \sigma)\), where \(\mu_{ik}\) indicates the mean of group \(k\), and \(\sigma\) the common standard deviation.

\[y_{ik} \sim N(\mu_{ik}, \sigma)\]

Read the formula as “Y is normally distributed with mean \(\mu_{ik}\) (mu), and standard deviation \(\sigma\) (sigma)”. Note that the standard deviation \(\sigma\) doesn’t have any subscripts: we assume it is the same for the \(k\) groups. Note also that you’ll often see the second parameter in the parentheses as \(\sigma^2\), referring to the variance.

The means for groups 0 and 1 are simply \(\mu_0\) and \(\mu_1\), respectively, and their difference (let’s call it \(d\)) is \(d = \mu_0 - \mu_1\). The 95% CI for \(d\) is given in the t-test output, and we can tell that it differs from the one given by Welch’s t-test.

It is unsurprising, then, that if we use a different model (the more appropriate unequal variances model; Lakens, 2015), our inferences may be different. Welch’s t-test is the same as Student’s, except that now we assume (and subsequently estimate) a unique standard deviation \(\sigma_{ik}\) for both groups.

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\]

This model makes a lot of sense, because rarely are we in a situation to a priori decide that the variance of scores in Group A is equal to the variance of scores in Group B. If you use the equal variances t-test, you should be prepared to justify and defend this assumption. (Deciding between models–such as between these two t-tests–is one way in which our prior information enters and influences data analysis. This fact should make you less suspicious about priors in Bayesian analyses.)

Armed with this knowledge, we can now see that “conducting a t-test” can be understood as estimating one of these two models. By estimating the model, we obtain t-values, degrees of freedom, and consequently, p-values.

However, to focus on modeling and estimation, it is easier to think of the t-test as a specific type of the general linear model, (aka linear regression). We can re-write the t-test in an equivalent way, but instead have a specific parameter for the difference in means by writing it as a linear model. (For simplicity, I’ll only write the equal variances model):

\[y_{ik} \sim N(\mu_{ik}, \sigma)\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Here, \(\sigma\) is just as before, but we now model the mean with an intercept (Group 0’s mean, \(\beta_0\)) and the effect of Group 1 (\(\beta_1\)). To understand whats going on, let’s look at the data, Group is an indicator variable in the data, for each row of Group 0’s data Group is zero, and for each row of Group 1’s data Group is one.

##     Group  IQ
## 1       0  99
## 2       0 101
## 3       0 100
## 4       0 101
## ...   ... ...
## 86      1 101
## 87      1 104
## 88      1 100
## 89      1 101

With this model, \(\beta_1\) directly tells us the estimated difference in the two groups. And because it is a parameter in the model, it has an associated standard error, t-value, degrees of freedom, and a p-value. This linear model and can be estimated in R with the following line of code:

olsmod <- lm(IQ ~ Group, data = d)

The key input here is a model formula, which in R is specified as outcome ~ predictor (DV ~ IV). Using the lm() function, we estimated a linear model predicting IQ from an intercept (automatically included) and a Group parameter Group, which is the effect of group 1. I called this object olsmod for Ordinary Least Squares Model.

R has it’s own model formula syntax, which is well worth learning. The formula in the previous model, IQ ~ Group means that we want to regress IQ on an intercept (which is implicitly included), and group (Group). Besides the formula, we only need to provide the data, which is contained in the object I’ve conveniently called d.

You can verify that the results are identical to the equal variances t-test above.

summary(olsmod)
## 
## Call:
## lm(formula = IQ ~ Group, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9149  -0.9149   0.0851   1.0851  22.0851 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.3571     0.7263 138.184   <2e-16 ***
## Group         1.5578     0.9994   1.559    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.707 on 87 degrees of freedom
## Multiple R-squared:  0.02717,    Adjusted R-squared:  0.01599 
## F-statistic:  2.43 on 1 and 87 DF,  p-value: 0.1227

Focus on the Group row in the estimated coefficients. Estimate is the point estimate (best guess) of the difference in means (\(d = 101.9149 - 100.3571 = 1.5578\)). t value is the observed t-value (identical to what t.test() reported), and the p-value (Pr(>|t|)) matches as well. The (Intercept) row refers to \(\beta_0\), which is group 0’s mean.

This way of thinking about the model, where we have parameters for one group’s mean, and the effect of the other group, facilitates focusing on the important parameter, the difference, instead of individual means. However, you can of course compute the difference from the means, or the means from one mean and a difference.

Bayesian estimation of the t-test

Equal variances model

Next, I’ll illustrate how to estimate the equal variances t-test using Bayesian methods. We use brms (Buerkner, 2016), and the familiar R formula syntax which we used with the OLS model.

Estimating this model with R, thanks to the Stan and brms teams (Stan Development Team, 2016; Buerkner, 2016), is as easy as the linear regression model we ran above. If you haven’t yet installed brms, you need to install it first by running install.packages("brms"). Then, to access its functions, load the brms package to the current R session.

library(brms)

The most important function in the brms package is brm(), for Bayesian Regression Model(ing). The user needs only to input a model formula, just as above, and a data frame that contains the variables specified in the formula. brm() then translates the model into Stan language, and asks Stan to compile the model into C++ and estimate it (see Kruschke, 2014; McElreath, 2016 for details about estimation). The result is an R object with the estimated results (and much more). We run the model and save the results to mod_eqvar for equal variances model:

mod_eqvar <- brm(IQ ~ Group, data = d)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   100.34      0.74    98.91   101.83       3803    1
## Group         1.59      1.01    -0.41     3.55       3958    1
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     4.77      0.36     4.13     5.55       2653    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Notice that the model contains three parameters, one of which is the shared standard deviation sigma. Compare the output of the Bayesian model to the one estimated with lm() (OLS):

Method term estimate std.error
OLS (Intercept) 100.357 0.726
OLS Group 1.558 0.999
Bayes b_Intercept 100.339 0.743
Bayes b_Group 1.588 1.009

The point estimates (posterior means in the Bayesian model) and standard errors (SD of the respective posterior distribution) are pretty much identical.

We now know the models behind t-tests, and how to estimate the equal variances t-test using the t.test(), lm(), and brm() functions. We also know how to run Welch’s t-test using t.test(). However, estimating the general linear model version of the unequal variances t-test model is slightly more complicated, because it involves specifying predictors for \(\sigma\), the standard deviation parameter.

Unequal variances model

We only need a small adjustment to the equal variances model to specify the unequal variances model:

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Notice that we now have subscripts for \(\sigma\), denoting that it varies between groups. In fact, we’ll write out a linear model for the standard deviation parameter!

\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

The model now includes, instead of a common \(\sigma\), one parameter for Group 0’s standard deviation \(\gamma_0\) (gamma), and one for the effect of Group 1 on the standard deviation \(\gamma_1\), such that group 1’s standard deviation is \(\gamma_0 + \gamma_1\). Therefore, we have 4 free parameters, two means and two standard deviations. (The full specification would include prior distributions for all the parameters, but that topic is outside of the scope of this post.) Let’s estimate!

brm() takes more complicated models by wrapping them inside bf() (short for brmsformula()), which is subsequently entered as the first argument to brm().

uneq_var_frm <- bf(IQ ~ Group, sigma ~ Group)

You can see that the formula regresses IQ on Group, such that we’ll have an intercept (implicitly included), and an effect of Group 1. Remarkably, we are also able to model the standard deviation sigma, and we regress it on Group (it will also have an intercept and effect of group).

mod_uneqvar <- brm(uneq_var_frm, data = d, cores=4)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept         100.35      0.40    99.55   101.12       4000    1
## sigma_Intercept     0.94      0.11     0.72     1.17       3138    1
## Group               1.54      0.99    -0.38     3.47       2471    1
## sigma_Group         0.87      0.15     0.57     1.17       3222    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model’s output contains our 4 parameters. Intercept is the mean for group 0, Group 1 is the “effect of group 1”. The sigma_Intercept is the standard deviation of Group 0, sigma_Group is the effect of group 1 on the standard deviation (the SD of Group 1 is sigma_Intercept + sigma_Group). The sigmas are implicitly modeled through a log-link (because they must be positive). To convert them back to the scale of the data, they need to be exponentiated. After taking the exponents of the sigmas, the results look like this:

parameter Estimate Est.Error l-95% CI u-95% CI
b_sigma_Group 2.41 0.37 1.76 3.23
b_Intercept 100.35 0.40 99.55 101.12
NA NA NA NA NA
b_sigma_Intercept 1.75 0.85 0.76 3.09

For comparison, here is the “observed SD” of group 0:

## [1] 2.52

Keep in mind that the parameters refer to Group 0’s mean (Intercept) and SD (sigma), and the difference between groups in those values (Group) and (sigma_Group). We now have fully Bayesian estimates of the 4 parameters of the unequal variances t-test model. Because p-values have no place in Bayesian inference, they are not reported in the output. However, you can calculate a quantity that is equivalent to a one-sided p-value from the posterior distribution: Take the proportion of posterior density (MCMC samples) above/below a reference value (0). This is definitely not the most useful thing you can do with a posterior distribution, but the fact that it numerically matches a one-sided p-value is quite interesting:

# Posterior distribution of Group effect
x <- as.data.frame(mod_uneqvar, pars = "b_Group")[,1]
# Proportion of MCMC samples below zero
round((sum(x < 0) / length(x)), 3)
## [1] 0.058
# One sided p-value from t-test
round(t.test(group_0, group_1, data = d, alternative = "less")$p.value, 3)
## [1] 0.055

I’m showing this remarkable fact (Marsman & Wagenmakers, no date) not to persuade you to stick with p-values, but to alleviate fears that these methods would always produce discrepant results.

Although this model is super easy to estimate with brm() (which, I should emphasize, uses Stan for the estimation procedures), the model seems, frankly speaking, strange. I am just not used to modeling variances, and I’ll bet a quarter that neither are you. Nevertheless, there it is!

Finally, let’s move on to Kruschke’s (2013) “Robust Bayesian Estimation” model.

Robust Bayesian Estimation

Kruschke’s robust model is a comparison of two groups, using five parameters: One mean for each group, one standard deviation for each group, just as in the unequal variances model above. The fifth parameter is a “normality” parameter, \(\nu\) (nu), which means that we are now using a t-distribution to model the data. Using a t-distribution to model the data, instead of a Gaussian, means that the model (and therefore our inferences) are less sensitive to extreme values (outliers). Here’s what the model looks like:

\[y_{ik} \sim T(\nu, \mu_{ik}, \sigma_{ik})\]

Read the above formula as “Y are random draws from a t-distribution with ‘normality’ parameter \(\nu\), mean \(\mu_{ik}\), and standard deviation \(\sigma_{ik}\)”. We have a linear model for the means and standard deviations:

\[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

and

\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

This model, as you can see, is almost identical to the unequal variances t-test, but instead uses a t distribution (we assume data are t-distributed), and includes the normality parameter. Using brm() we can still use the unequal variances model, but have to specify the t-distribution. We do this by specifying the family argument to be student (as in Student’s t)

mod_robust <- brm(bf(IQ ~ Group, sigma ~ Group),
                  family=student,
                  data = d, cores=4)
##  Family: student (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## Intercept        100.524     0.208  100.125  100.940       4000 0.999
## sigma_Intercept    0.001     0.196   -0.389    0.395       4000 1.001
## Group              1.028     0.417    0.206    1.833       3100 1.000
## sigma_Group        0.671     0.258    0.163    1.165       4000 1.000
## 
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## nu    1.853     0.482    1.135     2.99       3179 1.001
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The effect of Group is about one unit, with a 95% Credible Interval from 0.2 to 1.8.

Finally, let’s compare the results to those in Kruschke’s paper (2013, p.578). Before we do this, I’ll convert the estimated parameters to means and standard deviations (instead of the “regression effects” produced by default.) Recall that I recoded the group labels used by Kruschke in the paper, what he calls group 2 is group 0 (control group) in our analyses, but group 1 is still group 1. In the following I transform the results and compute HDIs to obtain results most compatible with Kruschke:

key m HDIlwr HDIupr
group0_mean 100.52 100.12 100.94
group1_mean 101.55 100.80 102.24
diff_means 1.03 0.19 1.80
group0_sigma 1.02 0.65 1.41
group1_sigma 2.00 1.22 2.80
diff_sigmas 2.02 1.13 3.10
nu 1.85 1.04 2.76
nulog10 0.25 0.04 0.46

Notice that Kruschke reports modes (2013, p. 578), but our point estimates are means. The results with respect to the group means are identical to two decimal points; the standard deviations are slightly more discrepant, because the paper reports modes, but we focus on posterior means.

Finally, here is how to estimate the model using the original code (Kruschke & Meredith, 2015):

library(BEST)
BEST <- BESTmcmc(group_0, group_1)
## 
## Processing function input....... 
## 
## Done. 
##  
## Beginning parallel processing using 3 cores. Console output will be suppressed.
## 
## Parallel processing completed.
## 
## MCMC took 0.283 minutes.
BEST
## MCMC fit results for BEST analysis:
## 100002 simulations saved.
##           mean     sd  median    HDIlo   HDIup Rhat n.eff
## mu1    100.525 0.2149 100.524 100.1108 100.955    1 57740
## mu2    101.548 0.3800 101.551 100.7853 102.289    1 59733
## nu       1.840 0.4775   1.764   1.0380   2.753    1 23934
## sigma1   1.052 0.2069   1.032   0.6737   1.464    1 35619
## sigma2   2.063 0.4359   2.021   1.2598   2.927    1 29495
## 
## 'HDIlo' and 'HDIup' are the limits of a 95% HDI credible interval.
## 'Rhat' is the potential scale reduction factor (at convergence, Rhat=1).
## 'n.eff' is a crude measure of effective sample size.

This output reports posterior means and HDI limits, which we report above. You can verify that they match very closely to each other. This BESTmcmc() function is great, but with brms you are able to estimate a vast variety of models.

Conclusion

Well, that ended up much longer than what I intended. The aim was both to illustrate the ease of Bayesian modeling in R using brms (Buerkner, 2016) and Stan (Stan Development Team, 2016), and highlight the fact that we can easily move from simple t-tests to more complex (and possibly better) models.

If you’ve followed through, you should be able to conduct Student’s (equal variances) and Welch’s (unequal variances) t-tests in R, and to think about those tests as instantiations of general linear models. Further, you should be able to estimate these models using Bayesian methods.

You should now also be familiar with Kruschke’s robust model for comparing two groups’ metric data, and be able to implement it in one line of R code. This model was able to find credible differences between two groups, although the frequentist t-tests and models reported p-values well above .05. That should be motivation enough to try robust (Bayesian) models on your own data.

Further reading

I didn’t take any space here to discuss the interpretation of Bayesian statistics. For this, I recommend Kruschke (2014), McElreath (2016). See also Etz, Gronau, Dablander, Edelsbrunner, & Baribault (2016) for an introduction to Bayesian statistics.

References

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Cumming, G. (2014). The New Statistics Why and How. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Dienes, Z. (2011). Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6(3), 274–290. https://doi.org/10.1177/1745691611406920
Etz, A., Gronau, Q. F., Dablander, F., Edelsbrunner, P. A., & Baribault, B. (2016). How to become a Bayesian in eight easy steps: An annotated reading list. ResearchGate. Retrieved from https://www.researchgate.net/publication/301981861_How_to_become_a_Bayesian_in_eight_easy_steps_An_annotated_reading_list
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146
Kruschke, J. K. (2014). Doing Bayesian Data Analysis: A Tutorial Introduction with R (2nd Edition). Burlington, MA: Academic Press.
Lakens, D. (2015, January 26). The 20% Statistician: Always use Welch’s t-test instead of Student’s t-test. Retrieved from https://daniellakens.blogspot.com/2015/01/always-use-welchs-t-test-instead-of.html
Marsman, M., & Wagenmakers, E.-J. (no date). Three Insights from a Bayesian Interpretation of the One-Sided P Value. Retrieved from http://www.ejwagenmakers.com/inpress/MarsmanWagenmakersOneSidedPValue.pdf
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Morey, R. D., & Rouder, J. (2015). BayesFactor: Computation of Bayes Factors for Common Designs. Retrieved from https://CRAN.R-project.org/package=BayesFactor
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547. https://doi.org/10.1111/tops.12214
Stan Development Team. (2016). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Vuorre, M. (2016, December 5). Introduction to Data Analysis using R. Retrieved from http://blog.efpsa.org/2016/12/05/introduction-to-data-analysis-using-r/

Meta-analysis is a special case of Bayesian multilevel modeling

Introduction

Hello everybody! Recently, there’s been a lot of talk about meta-analysis, and here I would just like to quickly show that Bayesian multilevel modeling nicely takes care of your meta-analysis needs, and that it is easy to do in R with the rstan and brms packages. As you’ll see, meta-analysis is a special case of Bayesian multilevel modeling when you are unable or unwilling to put a prior distribution on the meta-analytic effect size estimate.

The idea for this post came from Wolfgang Viechtbauer’s website, where he compared results for meta-analytic models fitted with his great (frequentist) package metafor and the swiss army knife of multilevel modeling, lme4. It turns out that even though you can fit meta-analytic models with lme4, the results are slightly different from traditional meta-analytic models, because the experiment-wise variances are treated slightly differently.

Here are the packages we’ll be using:

library(metafor)
library(lme4)
library(brms)
library(tidyverse)

The data

Here I’ll only focus on a simple random effects meta-analysis of effect sizes, and will use the same example data as in the aforementioned website. The data are included in the metafor package, and describe the relationship between conscientiousness and medication adherence. The effect sizes are r to z transformed correlations.

study year ni ri yi vi sei
Axelsson et al. (2009) 2009 109 0.19 0.19 0.01 0.10
Axelsson et al. (2011) 2011 749 0.16 0.16 0.00 0.04
Bruce et al. (2010) 2010 55 0.34 0.35 0.02 0.14
Christensen et al. (1995) 1995 72 0.27 0.28 0.01 0.12
Christensen et al. (1999) 1999 107 0.32 0.33 0.01 0.10
Cohen et al. (2004) 2004 65 0.00 0.00 0.02 0.13
Dobbels et al. (2005) 2005 174 0.18 0.18 0.01 0.08
Ediger et al. (2007) 2007 326 0.05 0.05 0.00 0.06
Insel et al. (2006) 2006 58 0.26 0.27 0.02 0.13
Jerant et al. (2011) 2011 771 0.01 0.01 0.00 0.04
Moran et al. (1997) 1997 56 -0.09 -0.09 0.02 0.14
O’Cleirigh et al. (2007) 2007 91 0.37 0.39 0.01 0.11
Penedo et al. (2003) 2003 116 0.00 0.00 0.01 0.09
Quine et al. (2012) 2012 537 0.15 0.15 0.00 0.04
Stilley et al. (2004) 2004 158 0.24 0.24 0.01 0.08
Wiebe & Christensen (1997) 1997 65 0.04 0.04 0.02 0.13

Here’s what these data look like (point estimates +- 2 SEM):

library(ggplot2)
ggplot(dat, aes(x=yi, y=study)) +
    geom_segment(aes(x = yi-sei*2, xend = yi+sei*2, y=study, yend=study)) +
    geom_point()

The model

We are going to fit a random-effects meta-analysis model to these observed effect sizes and their standard errors.

Here’s what this model looks like, loosely following notation from the R package Metafor (Viechtbauer, 2010) manual (p.6):

\[y_i \sim N(\theta_i, \sigma_i^2)\]

where each recorded effect size, \(y_i\) is a draw from a normal distribution which is centered on that study’s “true” effect size \(\theta_i\) and has standard deviation equal to the study’s observed standard error \(\sigma_i\).

Our next set of assumptions is that the studies’ true effect sizes approximate some underlying effect size in the (hypothetical) population of all studies. We call this underlying population effect size \(\mu\), and its standard deviation \(\tau\), such that the true effect sizes are thus distributed:

\[\theta_i \sim N(\mu, \tau^2)\]

We now have two really interesting parameters: \(\mu\) tells us, all else being equal, what I may expect the “true” effect to be, in the population of similar studies. \(\tau\) tells us how much individual studies of this effect vary.

I think it is most straightforward to write this model as yet another mixed-effects model (metafor manual p.6):

\[y_i \sim N(\mu + \theta_i, \sigma^2_i)\]

where \(\theta_i \sim N(0, \tau^2)\), studies’ true effects are normally distributed with between-study heterogeneity \(\tau^2\). The reason this is a little confusing (to me at least), is that we know the \(\sigma_i\)s (this being the fact that separates meta-analysis from other more common regression modeling).

Estimation with metafor

Super easy!

library(metafor)
ma_out <- rma(data = dat, yi = yi, sei = sei, slab = dat$study)
summary(ma_out)
# 
# Random-Effects Model (k = 16; tau^2 estimator: REML)
# 
#   logLik  deviance       AIC       BIC      AICc  
#   8.6096  -17.2191  -13.2191  -11.8030  -12.2191  
# 
# tau^2 (estimated amount of total heterogeneity): 0.0081 (SE = 0.0055)
# tau (square root of estimated tau^2 value):      0.0901
# I^2 (total heterogeneity / total variability):   61.73%
# H^2 (total variability / sampling variability):  2.61
# 
# Test for Heterogeneity: 
# Q(df = 15) = 38.1595, p-val = 0.0009
# 
# Model Results:
# 
# estimate       se     zval     pval    ci.lb    ci.ub          
#   0.1499   0.0316   4.7501   <.0001   0.0881   0.2118      *** 
# 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The switch to Bayes

So far so good, we’re strictly in the realm of standard meta-analysis. But I would like to propose that instead of using custom meta-analysis software, we simply consider the above model as just another regression model, and fit it like we would any other (multilevel) regression model. That is, using Stan, usually through the brms interface. Going Bayesian allows us to assign prior distributions on the population-level parameters \(\mu\) and \(\tau\), and we would usually want to use some very mildly regularizing priors. Here, to make the results most comparable, I’ll use uniform (non-informative) priors:

\[\mu \sim U(-\infty, \infty)\]

and

\[\tau \sim U(0, 1000)\]

Estimation with brms

Here’s how to fit this model with brms:

library(brms)
brm_out <- brm(yi | se(sei) ~ 1 + (1|study), 
               prior = set_prior("uniform(0, 1000)", class = "sd"),
               data = dat, iter = 5000, warmup = 2000, cores = 4)
#  Family: gaussian(identity) 
# Formula: yi | se(sei) ~ 1 + (1 | study) 
#    Data: dat (Number of observations: 16) 
# Samples: 4 chains, each with iter = 5000; warmup = 2000; thin = 1; 
#          total post-warmup samples = 12000
#    WAIC: Not computed
#  
# Group-Level Effects: 
# ~study (Number of levels: 16) 
#               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sd(Intercept)      0.1      0.04     0.04     0.19       3121    1
# 
# Population-Level Effects: 
#           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept     0.15      0.04     0.08     0.22       5612    1
# 
# Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
# is a crude measure of effective sample size, and Rhat is the potential 
# scale reduction factor on split chains (at convergence, Rhat = 1).

These results are the same as the ones obtained with metafor.

Comparing results

We can now compare the results of these two estimation methods. Of course, the Bayesian method has a tremendous advantage, because it results in an actual distribution of plausible values, whereas the frequentist method gives us just point estimates.

Figure 1. Histogram of 80k samples from the posterior distribution of \(\mu\) (top left) and \(\tau\) (top right). Bottom left displays the multivariate posterior distribution of \(\mu\) (x-axis) and \(\tau\) (y-axis), light colors indicating increased plausibility of values. For each plot, the dashed lines display the maximum likelihood point estimate, and 95% confidence limits (only the point estimate is displayed for the multivariate figure.)

We can see from the numeric output, and especially the figures, that these modes of inference yield the same numerical results. Keep in mind though, that the Bayesian estimates actually allow you to discuss probabilities, and generally the things that we’d like to discuss when talking about results.

For example, what is the probability that the average effect size is greater than 0.2?

avg_es <- as.data.frame(brm_out, pars = "b_")[,1]
cat( (sum(avg_es > 0.2) / length(avg_es))*100, "%")
# 8.125 %

Forest plot

We may now draw a forest plot with a custom ggplot2 function (see source for code, it ended up too ugly to show here!):

fit_forest <- bayes_forest(data = dat, 
                           model = brm_out, 
                           title = "Conscientiousness & medication adherence", 
                           xlimits = c(-0.5, 1),
                           level = .95,
                           rma = ma_out)
fit_forest

This forest plot displays, as violin plots, the entire posterior distribution of each \(\theta_i\). The meta-analytic effect size \(\mu\) is also displayed amidst the varying \(\theta_i\)s. The mean and 95% CI limits of the posteriors are also displayed on the right in text form for all you precision fans.

The plot also shows each study’s observed mean effect size as a little x, and +- 2SEM as a thin line. These are shown slightly below each corresponding posterior distribution. Defying information overflow, the BLUPs from the model fitted with metafor are shown as points above the violins, with their corresponding 95% CI limits.

Focus on Moran et al. (1997)’s observed effect size (the x): This is a really anomalous result compared to all other studies. One might describe it as incredible, and that is indeed what the bayesian estimation procedure has done, and the resulting posterior distribution is no longer equivalent to the observed effect size. Instead, it is shrunken toward the average effect size. Now look at the table above, this study only had 56 participants, so we should be more skeptical of this study’s observed ES, and perhaps we should then adjust our beliefs about this study in the context of other studies. Therefore, our best guess about this study’s effect size, given all the other studies is no longer the observed mean, but something closer to the average across the studies. Fantastic. Brilliant!

If this shrinkage business seems radical, consider Quine et al. (2012). This study had a much greater sample size (537), and therefore a smaller SE. It was also generally more in line with the average effect size estimate. Therefore, the observed mean ES and the mean of the posterior distribution are pretty much identical. This is also a fairly desirable feature.

Discussion

The way these different methods are presented (regression, meta-analysis, ANOVA, …), it is quite easy for a beginner, like me, to lose sight of the forest for the trees. I also feel that this is a general experience for students of applied statistics: Every experiment, situation, and question results in a different statistical method (or worse: “Which test should I use?”), and the student doesn’t see how the methods relate to each other. So I think focusing on the (regression) model is key, but often overlooked in favor of this sort of decision tree model of choosing statistical methods (McElreath, 2016, p.2).

Accordingly, I think we’ve ended up in a situation where meta-analysis, for example, is seen as somehow separate from all the other modeling we do, such as repeated measures t-tests. In fact I think applied statistics in Psychology may too often appear as an unconnected bunch of tricks and models, leading to confusion and inefficient implementation of appropriate methods.

Bayesian multilevel modeling

As I’ve been learning more about statistics, I’ve often noticed that some technique, applied in a specific set of situations, turns out to be a special case of a more general modeling approach. I’ll call this approach here Bayesian multilevel modeling, and won’t say much more than that it’s awesome (Gelman et al., 2013; McElreath, 2016). If you are forced to choose one statistical method to learn, it should be Bayesian multilevel modeling, because it allows you to do and understand most things, and allows you to see how similar all these methods are, under the hood.

Have a nice day.

Bibliography

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Boca Raton: Chapman and Hall/CRC.

McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Stan Development Team. (2016). Stan: A C++ Library for Probability and Sampling, Version 2.11.1. Retrieved from http://mc-stan.org/

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48.

Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer Science & Business Media.

Statistical inference: Prix fixe or à la carte?

Experimental investigations commonly begin with a hypothesis, an expectation of what one might find: “We hypothesize that alcohol leads to slower reactions to events in a driving simulator.” Data is then collected and analyzed to specifically address this hypothesis. Almost always, the support for or against the hypothesis is statistical, not intraocular (Krantz, 1999). However, the prevailing statistical paradigm—null hypothesis significance testing (NHST)—never tests the researcher’s offered hypothesis, but instead the “null hypothesis”: That there is no relationship between alcohol consumption and reaction time. NHST results in a p-value, and values of p less than .05 are taken to indicate indicate that the data are sufficiently surprising under the null hypothesis that we now must accept the alternative hypothesis.

The NHST practice has been examined with a critical eye for decades by many prominent authors (Cohen, 1994; Krantz, 1999; Gigerenzer, 2004), and has most recently been challenged by an uprising of a variety of Bayesian methods. The Bayesian uprising has been prominently championed by a Band of Bayesians1, who in a long string of important publications and software packages have made the Bayesian approach an appealing, understandable, and, perhaps most importantly, practical method for researchers (Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2015; Morey, Romeijn, & Rouder, 2016; Rouder, Morey, & Wagenmakers, 2016 Rouder, Speckman, Sun, Morey, & Iverson, 2009; Wagenmakers, 2007; Morey & Rouder, 2015; JASP Team, 2016). Here, I would like to provide a brief commentary on a recent article about hypothesis testing and Bayesian inference, and remind us about the variety of the (Bayesian) statistical menu.

In a recent article of high statistical significance, “Is There a Free Lunch in Inference” (Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016), the authors argue that valid hypothesis testing requires not just null, but also well specified hypotheses. Here, after paraphrasing Rouder et al.‘s core argument, I would like to draw attention to some of the negative aspects of how the argument is presented. Briefly, it seems that by naming “inference”, but actually almost exclusively discussing hypothesis testing, the authors might have inadvertently downplayed the variety of (Bayesian) approaches to inference, and therefore readers may get the picture that the Bayes factor is the one and only way to utilize Bayes’ theorem in data analysis.

No Free Lunch in Inference

Menu item Price
Grilled salmon $18.99
Statistical inference One (or more) well-defined probability distribution(s)
Target article Free

The starting point for Rouder et al.’s argument is that the classical paradigm of statistical inference in Psychology—accepting alternative hypotheses when a p-value is less than .05—is a flawed and unprincipled method of stating evidence (p. 522). One main cause for this state of affairs is that this paradigm, which they call the p < .05 rule, assumes that there are “free lunches”. That is, p < .05 is used as if it can provide evidence against the null hypothesis without requiring an alternative hypothesis (what the researcher actually expects / believes / wants to test). Unfortunately, there are no free lunches:

We may have the intuition that our conclusions are more valid, more objective, and more persuasive if it is not tied to a particular alternative. This intuition, however, is wrong. We argue here that statistical evidence may be properly understood and quantified only with reference to detailed alternatives. (Rouder et al., 2016, p. 523)

Rouder et al. then argue that psychologists ought to provide well-defined alternatives if they wish to test hypotheses, and that the hypothesis-testing should be performed by computing Bayes factors (BF). Indeed, Bayes factors are arguably superior to p-values: the BF provides an elegant method for comparing two or more specific hypotheses, if they can be represented as probability distributions.

centerFigure 1. A Bayesian. Would you buy lunch from him?

The authors then give a detailed explanation of Bayes’ rule (“The deep meaning of Bayes’ rule is that evidence is prediction”, p. 531), and Bayes factors (“[…] ratios of predictive densities are called Bayes factors.”, p. 532). The discussion is illuminating, and recommended reading for anyone not familiar with these (see also my favourite exposition in The philosophy of Bayes factors and the quantification of statistical evidence, Morey, Romeijn, & Rouder, 2016). Most importantly, the Bayes factor is a tool for comparing the predictions from two or more hypotheses (coded as probability distributions.) In summary, there are no free lunches in inference; the price of a hypothesis test is a specified alternative hypothesis, and Bayes factors are the best candidate method for testing specified hypotheses. I’m writing this commentary about the use of “inference” in the previous sentence.

Statistical lunch: Price is variable

I worry that the argument relies on too restricted a view of statistical inference. The foregoing discussion (or rather, the one in the actual paper) in favor of the Bayes factor as a tool for hypothesis testing is brilliant, but the paper’s titular mention of “Inference” seems to miss the mark. Statistical inference may be done by testing hypotheses (go Bayes factor!), but does not have to be done so. Throughout the article, hypothesis testing is loudly advocated while estimation is quietly swept under the rug (there is a brief discussion of the estimation vs. testing at the end of the article, however.)

centerFigure 2. A table of “inferential stances” reproduced without permission from Kruschke & Liddell (2015; the original caption: “Two conceptual distinctions in the practice of data analysis. Rows show point-value hypothesis testing versus estimating magnitude with uncertainty. Columns show frequentist versus Bayesian methods”.)

The authors briefly discuss Confidence Intervals as a cure to the flawed p < .05 rule, and conclude that CIs are hardly an improvement, and that most researchers use CIs to test hypotheses anyway, by looking at whether critical values are outside the interval. This description (p. 528) seems quite accurate, but fails to mention that if the goal of CIs is estimation with uncertainty, not hypothesis testing (Figure 2), inference is quite cheap indeed. However, we can and should do better, with Bayesian Credible Intervals (Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2015). But Credible Intervals are discounted as well:

Inference with credible intervals does not satisfy Bayes’ rule. (Rouder et al., 2016, p. 535 [referring to Kruschke’s ROPE method, Kruschke, 2011; Kruschke, 2014])

However, because estimation is not inherently about hypothesis testing, it is unfair to claim that inferencing (that’s a word now) via estimation is inadequate or unsatisfying because it doesn’t satisfy the rules of hypothesis testing. In fact, estimation via Bayes’ theorem can be both cheap and accurate, and “valid” if one sticks with estimation. Here’s the gist: The “price of inference” is measured by the specificity of the alternative hypothesis, and (Bayesian, of course) parameter estimation can give you a good meal for a relatively small fee. However, I agree that paying more (in prior specificity) can take you from one dollar pizza to lobster: The more an inferencer (that’s also a word now) is willing to pay, the more information she is able to inject into the posterior, thereby possibly improving inference. In fact, (strongly) regulating priors (ones that place most of the density near zero) improve the out of sample predictive performances of statistical models, and some argue that this is the yardstick by which statistical inference and model comparison ought to be made (Gelman et al., 2013; McElreath, 2016).

In summary, estimation is a valid Bayesian form of inference with a flexible pricing model, but is unfortunately not well recognized in Rouder et al. (2016). In fact, in many situations with very little prior information (a rare situation, I admit), Bayes factors may be too expensive, and we may be better off withholding a strict hypothesis test in favor of estimation.

Conclusion

Psychological researchers are increasingly learning and applying Bayesian methods in published research. This is great, but I hope that researchers would also consider when it is appropriate to test hypotheses, and when it is appropriate to estimate (and when it’s appropriate to call it a day and go home!). The Bayesian paradigm, with modern advances in MCMC methods (McElreath, 2016; Kruschke, 2014), allows estimation of a truly vast variety of models, leading to great flexibility in the kinds of questions one can ask of data and ultimately, psychological phenomena. If psychologists, in learning Bayesian methods, only focus on Bayes factors, I fear that they will miss this opportunity.

I’ll leave the final words to the Band of Bayesians, who have more eloquently described what I’ve here tried to convey, and what I feel Rouder et al. (2016) has unfortunately downplayed:

For psychological science to be a healthy science, both estimation and hypothesis testing are needed. Estimation is necessary in pretheoretical work before clear predictions can be made, and is also necessary in posttheoretical work for theory revision. But hypothesis testing, not estimation, is necessary for testing the quantitative predictions of theories. Neither hypothesis testing nor estimation is more informative than the other; rather, they answer different questions. Using estimation alone turns science into an exercise in keeping records about the effect sizes in diverse experiments, producing a mas- sive catalogue devoid of theoretical content; using hypothesis testing alone may cause researchers to miss rich, meaningful structure in data. For researchers to obtain principled answers to the full range of questions they might ask, it is crucial for estimation and hypothesis testing to be advocated side by side. (Morey, Rouder, Verhagen, & Wagenmakers, 2014, p. 1290)

References

Cohen, J. (1994). The earth is round (p <. 05). American Psychologist, 49(12), 997.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Boca Raton: Chapman and Hall/CRC.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606. http://doi.org/10.1016/j.socec.2004.09.033

Krantz, D. H. (1999). The Null Hypothesis Testing Controversy in Psychology. Journal of the American Statistical Association, 94(448), 1372–1381. http://doi.org/10.2307/2669949

Kruschke, J. K. (2011). Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison. Perspectives on Psychological Science, 6(3), 299–312. http://doi.org/10.1177/1745691611406925

Kruschke, J. K. (2014). Doing Bayesian Data Analysis: A Tutorial Introduction with R (2nd Edition). Burlington, MA: Academic Press.

Kruschke, J. K., & Liddell, T. M. (2015). The Bayesian New Statistics: Two Historical Trends Converge. Available at SSRN 2606016. Retrieved from http://www.valuewalk.com/wp-content/uploads/2015/05/SSRN-id2606016.pdf

McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2015). The Fallacy of Placing Confidence in Confidence Intervals. Retrieved from http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf

Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology. http://doi.org/10.1016/j.jmp.2015.11.001

Morey, R. D., & Rouder, J. (2015). BayesFactor: Computation of Bayes Factors for Common Designs. Retrieved from https://CRAN.R-project.org/package=BayesFactor

Morey, R. D., Rouder, J. N., Verhagen, J., & Wagenmakers, E.-J. (2014). Why Hypothesis Tests Are Essential for Psychological Science A Comment on Cumming (2014). Psychological Science, 25(6), 1289–1290. http://doi.org/10.1177/0956797614525969

Rouder, J., Morey, R., & Wagenmakers, E.-J. (2016). The Interplay between Subjectivity, Statistical Practice, and Psychological Science. Collabra, 2(1). http://doi.org/10.1525/collabra.28

Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547. http://doi.org/10.1111/tops.12214

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. http://doi.org/10.3758/PBR.16.2.225

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. http://doi.org/10.3758/BF03194105


  1. I use this term to refer to a group of authors in the most respectful way.

All Posts by Category or Tags.