Bayes Factors with brms

Here’s a short post on how to calculate Bayes Factors with the R package brms (Buerkner, 2016) using the Savage-Dickey density ratio method (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

To get up to speed with what the Savage-Dickey density ratio method is–or what Bayes Factors are–please read Wagenmakers et al. 2010. (The paper is available on the author’s webpage.) Here, I’ll only show the R & brms code to do the calculations that Wagenmakers et al. (2010) discuss. In their paper, they used WinBUGS, which requires quite a bit of code to sample from even a relatively simple model. brms on the other hand uses the familiar R formula syntax, making it easy to use. brms also does the MCMC sampling with Stan (Stan Development Team, 2016a & 2016b), or rather creates Stan code from a specified R model formula by what can only be described as string processing magic, making the sampling very fast. Let’s get straight to the examples in Wagenmakers et al. (2010)

# We'll work in tidyverse
library(tidyverse)

Example 0

Wagenmakers and colleagues begin with a simple example of 10 true/false questions: We observe a person answering 9 (s) out of 10 (k) questions correctly.

d <- data.frame(s = 9, k = 10)
d
##   s  k
## 1 9 10

We are interested in the person’s latent ability to answer similar questions correctly. This ability is represented by \(\theta\) (theta), which for us will be the probability parameter (sometimes also called the rate parameter) in a binomial distribution. See Wagenmakers et al. (2010) for formulas. The maximum likelihood (point) estimate for \(\theta\) is n/k, the proportion .9.

The first thing we’ll need to specify with respect to our statistical model is the prior probability distribution for \(\theta\). As in Wagenmakers et al. 2010, we specify a uniform prior, representing no prior information about the person’s ability to aswer the questions. For the binomial probability parameter, \(Beta(\alpha = 1, \beta = 1)\) is a uniform prior.

pd <- tibble(
    x = seq(0, 1, by = .01),
    Prior = dbeta(x, 1, 1)
    )

ggplot(pd, aes(x, Prior)) +
    geom_line() +
    coord_cartesian(xlim = 0:1, ylim = c(0, 6), expand = 0) +
    labs(y = "Density", x = bquote(theta))

The solid line represents the probability density assigned to values of \(\theta\) by this prior probability distribution. You can see that it is 1 for all possible parameter values: They are all equally likely a priori. For this simple illustration, we can easily calculate the posterior distribution by adding the number of correct and incorrect answers to the parameters of the prior Beta distribution.

pd$Posterior <- dbeta(pd$x, 10, 2)
pdw <- gather(pd, key=Type, value=density, Prior:Posterior)
ggplot(pdw, aes(x, density, col=Type)) +
    geom_line() +
    annotate("point", x=c(.5, .5), y = c(pdw$density[pdw$x==.5])) +
    annotate("label", x=c(.5, .5), 
             y = pdw$density[pdw$x==.5], 
             label = round(pdw$density[pdw$x==.5], 3),
             vjust=-.5)

The Savage-Dickey density ratio is calculated by dividing the posterior density by the prior density at a specific parameter value. Here, we are interested in .5, a “null hypothesis” value indicating that the person’s latent ability is .5, i.e. that they are simply guessing.

filter(pd, x == .5) %>% 
    mutate(BF01 = Posterior/Prior,
           BF10 = 1/BF01) %>% 
    round(3)
## # A tibble: 1 x 5
##       x Prior Posterior  BF01  BF10
##   <dbl> <dbl>     <dbl> <dbl> <dbl>
## 1   0.5     1     0.107 0.107 9.309

OK, so in this example we are able to get to the posterior with simply adding values into the parameters of the Beta distribution, but let’s now see how to get to this problem using brms.

library(brms)

First, here’s the brms formula of the model:

s | trials(k) ~ 0 + intercept, 
family=binomial(link="identity"), 
data = d

Read the first line as “s successes from k trials regressed on intercept”. That’s a little clunky, but bear with it. If you are familiar with R’s modeling syntax, you’ll be wondering why we didn’t simply specify ~ 1–R’s default notation for an intercept. The reason is that brms by default uses a little trick in parameterizing the intercept which speeds up the MCMC sampling. In order to specify a prior for the intercept, you’ll have to take the default intercept out (0 +), and use the reserved string intercept to say that you mean the regular intercept. See ?brmsformula for details. (For this model, with only one parameter, this complication doesn’t matter, but I wanted to introduce it early on so that you’d be aware of it when estimating multi-parameter models.)

The next line specifies that the data model is binomial, and that we want to model it’s parameter through an identity link. Usually when you model proportions or binary data, you’d use a logistic (logistic regression!), probit or other similar link function. In fact this is what we’ll do for later examples. Finally, we’ll use the data frame d.

OK, then we’ll want to specify our priors. Priors are extremo important for Bayes Factors–and probabilistic inference in general. To help set priors, we’ll first call get_priors() with the model information, which is basically like asking brms to tell what are the possible priors, and how to specify then, given this model.

get_prior(s | trials(k) ~ 0 + intercept, 
          family=binomial(link="identity"),
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b intercept

The first line says that there is only one class of parameters b, think of class b as “betas” or “regression coefficients”. The second line says that the b class has only one parameter, the intercept. So we can set a prior for the intercept, and this prior can be any probability distribution in Stan language. We’ll create this prior using brms’ set_prior(), give it a text string representing the Beta(1, 1) prior for all parameters of class b (shortcut, could also specify that we want it for the intercept specifically), and then say the upper and lower bounds (\(\theta\) must be between 0 and 1).

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Almost there. Now we’ll actually sample from the model using brm(), give it the model, priors, data, ask it to sample from priors (for the density ratio!), and set a few extra MCMC parameters.

m <- brm(s | trials(k) ~ 0 + intercept, 
         family = binomial(link="identity"),
         prior = Prior,
         data = d,
         sample_prior = TRUE,
         iter = 1e4,
         cores = 4)

We can get the estimated parameter by asking the model summary:

summary(m)
##  Family: binomial(identity) 
## Formula: s | trials(k) ~ 0 + intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.83       0.1     0.59     0.98      15654    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Credible Interval matches exactly what’s reported in the paper. The point estimate differs slightly because here we see the posterior mean, whereas in the paper, Wagenmakers et al. report the posterior mode. I’ll draw a line at their posterior mode, below, to show that it matches.

samples <- posterior_samples(m, "b")
head(samples)
##   b_intercept   prior_b
## 1   0.9496061 0.4433162
## 2   0.6241450 0.6781995
## 3   0.9522157 0.1354565
## 4   0.7378991 0.4176384
## 5   0.7290172 0.9713203
## 6   0.7078843 0.9575486
gather(samples, Type, value) %>% 
    ggplot(aes(value, col=Type)) +
    geom_density() +
    labs(x = bquote(theta), y = "Density") +
    geom_vline(xintercept = .89)  # Vertical line at .89

We can already see the densities, so all that’s left is to obtain the exact values at the value of interest (.5) and take the \(\frac{posterior}{prior}\) ratio. Instead of doing any of this by hand, we’ll use brms’ function hypothesis() that allows us to test point hypotheses using the Dickey Savage density ratio. For this function we’ll need to specify the point of interest, .5, as the point hypothesis to be tested.

h <- hypothesis(m, "intercept = 0.5")
print(h, digits = 4)
## Hypothesis Tests for class b:
##                       Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (intercept)-(0.5) = 0   0.3336    0.1032   0.0876   0.4761     0.1078    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

The Evid.Ratio is our Bayes Factor BF01. Notice that it matches the value 0.107 pretty well! You can also plot this hypothesis:

plot(h)

However, I think the default plot isn’t fantastic (not a fan of the axis adjustments, title). Fortunately, plot(hypothesis) returns a ggplot2 object, which is easily customized.

p <- plot(h, plot = F, theme = theme_get())[[1]]
p + scale_x_continuous(breaks = seq(-.5, .5, by = .25),
                       labels = seq(-.5, .5, by = .25)+.5)

OK, so that was a lot of work for such a simple problem, but the real beauty of brms (and Stan) is the unparalleled scalability: We can easily solve a problem with one row of data and one parameter, and it won’t take much more to solve a problem with tens of thousands of rows of data, and hundreds of parameters. Let’s move on to the next example from Wagenmakers et al. (2010).

Example 1: Equality of Proportions

For context, please refer to the paper.

d <- data.frame(
    pledge = c("yes", "no"),
    s = c(424, 5416),
    n = c(777, 9072)
)
d
##   pledge    s    n
## 1    yes  424  777
## 2     no 5416 9072

They use Beta(1, 1) priors for both rate parameters, which we’ll do as well. Notice that usually a regression formula has an intercept and a coefficient (e.g. effect of group.) By taking the intercept out (0 +) we can define two pledger-group proportions instead, and set priors on these. If we used an intercept + effect formula, we could set a prior on the effect itself.

get_prior(s | trials(n) ~ 0 + pledge, 
          family=binomial(link="identity"),
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b  pledgeno                  
## 3           b pledgeyes

We can set the Beta prior for both groups’ rate with one line of code by setting the prior on the b class without specifying the coef.

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Like above, let’s estimate.

m1 <- brm(s | trials(n) ~ 0 + pledge, 
         family = binomial(link="identity"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

Our estimates match the MLEs reported in the paper:

print(m1, digits=3)
##  Family: binomial(identity) 
## Formula: s | trials(n) ~ 0 + pledge 
##    Data: d (Number of observations: 2) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## pledgeno     0.597     0.005    0.587    0.607      18588    1
## pledgeyes    0.546     0.018    0.511    0.580      19173    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

To get the density ratio Bayes Factor, we’ll need to specify a text string as our hypothesis. Our hypothesis is that the rate parameters \(\theta_1\) and \(\theta_2\) are not different: \(\theta_1\) = \(\theta_2\). The alternative, then, is the notion that the parameter values differ.

h1 <- hypothesis(m1, "pledgeyes = pledgeno")
print(h1, digits = 3)
## Hypothesis Tests for class b:
##                          Estimate Est.Error l-95% CI u-95% CI Evid.Ratio
## (pledgeyes)-(pled... = 0   -0.051     0.018   -0.088   -0.016      0.388
##                          Star
## (pledgeyes)-(pled... = 0    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

As noted in the paper, a difference value of 0 is about twice as well supported before seeing the data, i.e. the null hypothesis of no difference is twice less likely after seeing the data:

1/h1$hypothesis$Evid.Ratio  # BF10
## [1] 2.577207

The paper reports BF01 = 0.47, so we’re getting the same results (as we should.) You can also compare this figure to what’s reported in the paper.

h1p1 <- plot(h1, theme = theme_get(), plot = F)[[1]] +
    theme(legend.position = "top")
h1p2 <- plot(h1, theme = theme_get(), plot = F)[[1]] + 
    coord_cartesian(xlim = c(-.05, .05), ylim = c(0, 5)) +
    theme(legend.position = "top")
gridExtra::grid.arrange(h1p1, h1p2, nrow = 1)

Moving right on to Example 2, skipping the section on “order restricted analysis”.

Example 2: Hierarchical Bayesian one-sample proportion test

The data for example 2 is not available, but we’ll simulate similar data. The simulation assumes that the neither-primed condition average correct probability is 50%, and that the both-primed condition benefit is 5%. Obviously, the numbers here won’t match anymore, but the data reported in the paper has an average difference in proportions of about 4%.

set.seed(5)
d <- tibble(
    id = c(rep(1:74, each = 2)),
    primed = rep(c("neither", "both"), times = 74),
    prime = rep(c(0, 1), times = 74),  # Dummy coded
    n = 21,
    correct = rbinom(74*2, 21, .5 + prime * .05)
)
group_by(d, primed) %>% summarize(p = sum(correct)/sum(n))
## # A tibble: 2 x 2
##    primed         p
##     <chr>     <dbl>
## 1    both 0.5418275
## 2 neither 0.4993565

This data yields a similar t-value as in the paper.

t.test(correct/n ~ primed, paired = T, data = d)
## 
##  Paired t-test
## 
## data:  correct/n by primed
## t = 2.3045, df = 73, p-value = 0.02404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005741069 0.079201016
## sample estimates:
## mean of the differences 
##              0.04247104

Instead of doing a probit regression, I’m going to do logistic regression. Therefore we define the prior on the log-odds scale. The log odds for the expected probability of .5 is 0. I prefer log-odds because–although complicated–they make sense, unlike standardized effect sizes. Note that the probit scale would also be fine as they are very similar.

Let’s just get a quick intuition about effects in log-odds: The change in log odds from p = .5 to .55 is about 0.2.

library(arm)
tibble(rate = seq(0, 1, by = .01),
    logit = logit(rate)) %>%
    ggplot(aes(rate, logit)) +
    geom_line(size=1) +
    geom_segment(x = 0, xend = 0.55, y = .2, yend = .2, size=.4) +
    geom_segment(x = 0, xend = 0.5, y = 0, yend = 0, size=.4) +
    coord_cartesian(ylim = c(-2, 2), expand=0)

We are cheating a little because we know these values, having simulated the data. However, log-odds are not straightforward (!), and this knowledge will allow us to specify better priors in this example. Let’s get the possible priors for this model by calling get_prior(). Notice that the model now includes id-varying “random” effects, and we model them from independent Gaussians by specifying || instead of | which would give a multivariate Gaussian on the varying effects.

get_prior(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
          family=binomial(link="logit"),
          data = d)
##                 prior class      coef group nlpar bound
## 1                         b                            
## 2                         b intercept                  
## 3                         b     prime                  
## 4 student_t(3, 0, 10)    sd                            
## 5                        sd              id            
## 6                        sd intercept    id            
## 7                        sd     prime    id

The leftmost column gives the pre-specified defaults used by brms. Here are the priors we’ll specify. The most important pertains to prime, which is going to be the effect size in log-odds. Our prior for the log odds of the prime effect is going to be a Gaussian distribution centered on 0, with a standard deviation of .2, which is rather diffuse.

Prior <- c(
    set_prior("normal(0, 10)", class = "b", coef = "intercept"),
    set_prior("cauchy(0, 10)", class = "sd"),
    set_prior("normal(0, .2)", class = "b", coef = "prime")
)

Then we estimate the model using the specified priors.

m2 <- brm(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
         family = binomial(link="logit"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

OK, so our results here will be different because we didn’t parameterize the prior on a standardized effect size because a) I don’t like standardized effect sizes, and b) I would have to play around with the Stan code, and this post is about brms. Two good reasons not to use standardized effect sizes! Anyway, here are the estimated parameters:

summary(m2)
##  Family: binomial(logit) 
## Formula: correct | trials(n) ~ 0 + intercept + prime + (0 + intercept + prime || id) 
##    Data: d (Number of observations: 148) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~id (Number of levels: 74) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(intercept)     0.07      0.05     0.00     0.18       8862    1
## sd(prime)         0.13      0.08     0.01     0.30       6970    1
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.01      0.05    -0.09     0.10      20000    1
## prime         0.15      0.07     0.02     0.29      20000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

And our null-hypothesis density ratio:

h2 <- hypothesis(m2, "prime = 0")
h2
## Hypothesis Tests for class b:
##             Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (prime) = 0     0.15      0.07     0.02     0.29       0.27    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

Priming effect of zero log-odds is 4 times less likely after seeing the data:

1/h2$hypothesis$Evid.Ratio
## [1] 3.681923

This is best illustrated by plotting the densities:

plot(h2)

Conclusion

Read the paper! Hopefully you’ll be able to use brms’ hypothesis() function to calculate bayes factors when needed.

References

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Stan Development Team. (2016a). RStan: the R interface to Stan. Retrieved from http://mc-stan.org/
Stan Development Team. (2016b). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology, 60(3), 158–189. https://doi.org/10.1016/j.cogpsych.2009.12.001

Appendix

brms produces exceptionally well written Stan code. Stan code of the last example model:

cat(rstan::get_stancode(m2$fit))
## // generated with brms 1.6.1
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   int Y[N];  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   // data for group-level effects of ID 1 
##   int<lower=1> J_1[N]; 
##   int<lower=1> N_1; 
##   int<lower=1> M_1; 
##   vector[N] Z_1_1; 
##   vector[N] Z_1_2; 
##   int trials[N];  // number of trials 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
## } 
## parameters { 
##   vector[K] b;  // population-level effects 
##   vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
##   vector[N_1] z_1[M_1];  // unscaled group-level effects 
##   // parameters to store prior samples
##   real<lower=0> prior_sd_1;
## } 
## transformed parameters { 
##   // group-level effects 
##   vector[N_1] r_1_1; 
##   vector[N_1] r_1_2; 
##   r_1_1 = sd_1[1] * (z_1[1]); 
##   r_1_2 = sd_1[2] * (z_1[2]); 
## } 
## model { 
##   vector[N] mu; 
##   mu = X * b; 
##   for (n in 1:N) { 
##     mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n]; 
##   } 
##   // prior specifications 
##   b[1] ~ normal(0, 10); 
##   b[2] ~ normal(0, .2); 
##   sd_1 ~ cauchy(0, 10); 
##   z_1[1] ~ normal(0, 1); 
##   z_1[2] ~ normal(0, 1); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ binomial_logit(trials, mu); 
##   } 
##   // additionally draw samples from priors
##   prior_sd_1 ~ cauchy(0,10);
## } 
## generated quantities { 
##   real prior_b_1; 
##   real prior_b_2; 
##   // additionally draw samples from priors 
##   prior_b_1 = normal_rng(0,10); 
##   prior_b_2 = normal_rng(0,.2); 
## }
comments powered by Disqus