brms workshop >> 2020-10-22

Bayesian data analysis

What is it?

“Bayesian inference is reallocation of credibility across possibilities.” ((Kruschke 2014), p. 15)

“Bayesian data analysis takes a question in the form of a model and uses logic to produce an answer in the form of probability distributions.” ((McElreath 2020), p. 10)

“Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations.” ((Gelman et al. 2013), p. 1)

What is it?

  • Bayesian inference consists of updating prior information, using evidence in data, to posterior information
  • Use probability distributions to express information (uncertainty)

How is it different from what I know?

  • Bayesian data analysis may not be that different from what you already know (i.e. orthodox statistics)
  • In the absence of strong prior information, and presence of large data, the same model evaluated in a Bayesian or orthodox framework will yield the same numerical answers
    • The interpretations of, and philosophies behind the numbers are vastly different
  • In many ways, orthodox statistical methods can be thought of approximations to Bayesian methods
  • Hypothesis tests are very different between the two frameworks
  • In practice, Bayesian statistics are an extremely flexible modelling framework

How is it different from what I know?

What can it do for me?

You can estimate models in the Bayesian context that might not be otherwise possible. My first Bayesian analysis was conducted out of necessity. The model I wanted to use did not converge to a solution when I attempted to use orthodox methods (maximum likelihood estimation). Around the same time, I heard about Stan. I wrote some Stan code and the model converged without problems, and I was able to use the model that I wanted to.

With Bayes, you can actually be confident in your Confidence Intervals. I have a difficult time understanding p-values and Confidence Intervals. It can be difficult to understand what the uncertainty estimates mean when hypothetical replications are difficult to imagine in a given context. With a posterior distribution at hand, the corresponding probability values have a direct interpretation as credibility, uncertainty, or plausibility.

Bayesian methods allow easily carrying (un)certainty forward to other quantities of interest. It can often be difficult to obtain uncertainty estimates for various quantities when using orthodox methods. For example, effect size metrics are often reported without error bars (they can be obtained, but methods for doing so can be finicky and are not often used.)

To be sure, the Bayesian framework does not come for free. The methods might be difficult to communicate to others, at least until orthodox statistics are replaced in undergraduate applied statistics education. The necessity of complex computational algorithms makes it time-consuming—you will enjoy doing BDA more if you have a fast computer.

What can it do for me

Bayesian inference

  • What are the plausible values of parameters \(\theta\) after observing data?
  • The posterior distribution \(p(\theta \vert Y)\) is the answer
  • Bayes’ theorem describes how to compute this distribution

\[ p(\theta \vert Y) = \frac{p(Y \vert \theta) p(\theta)}{p(Y)} \]

  • \(p(Y \vert \theta)\) is the likelihood function
    • Probability of data given specific values for the model’s parameters
  • \(p(\theta)\) is the prior probability distribution on the parameters
    • How is plausibility distributed across possibilities before seeing data
  • \(p(Y)\) is the marginal likelihood of the data
    • Ignored here

\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta). \]

Bayesian inference

\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta) \]

Need to specify how the likelihood of each data point contributes to the parameters’ overall probability:

\[ p(\theta \vert Y) \propto p(\theta) \prod^N_{n=1} p(y_n \vert \theta) \]

In terms of programming, we think of adding up the log probabilities of each observation:

\[ \text{log}\ p(\theta \vert Y) \propto \text{log}\ p(\theta) + \sum^N_{n=1} \text{log}\ p(y_n \vert \theta) \]

Bayesian inference

Bayesian inference

Why Stan?

  • Above, we showed how to evaluate the probability of any specific parameter values given data
  • We could try plugging in all the possible values of the parameters to approximate a distribution
  • Many interesting models have too many parameters to do this
  • Maybe we just search for the best combination of parameters and use those? (Maximum likelihood or Maximum a posteriori)
    • Not as informative, and we can get weird answers for some types of parameters (see literally every (g)lmer error)
  • Stan and other Markov Chain Monte Carlo (MCMC) techniques allow us to approximate high dimensional probability distributions without trying out every combination of parameters.

Stan & brms

library(rstan)
  • Stan uses Hamiltonian MCMC to approximate \(p(\theta \vert Y)\)
  • We can write out (almost) any probabilistic model and get full probability distributions to express our uncertainty about model parameters
  • Higher-level interfaces allow us to avoid writing raw Stan code
library(brms)

  • Converts R modelling syntax to Stan language and extends it in interesting ways

Modelling Workflow

A conceptual introduction

Modelling schmodelling

  • What is the role and goal of statistics in science?
    • …
    • We want to build models with parameters that inform our theories
    • We want to test differences between means
  • Bayes allows us to use probability to evaluate and express uncertainty about possible values of these parameters, and compare and criticize the models themselves

Bayesian workflow

To get started with BDA, it is useful to first informally define what a “Bayesian workflow” might look like. Following Kruschke ((2014), p. 25), we identify five key data analysis steps

  1. Identify data relevant to the research question.
  2. Define a descriptive model, whose parameters capture the research question.
  3. Specify prior probability distributions on parameters in the model.
  4. Update the prior to a posterior distribution using Bayesian inference.
  5. Check your model against data, and identify possible problems.

Identify relevant data

  1. (Research question!)
  2. Define outcomes, or DVs of interest, and predictors, or IVs, of interest
  3. What are the scales? Were variables measured or manipulated? …

We collected data from a single person on the effects of sleep deprivation on cognitive performance, as measured by reaction time on a cognitive task. The data are 10 observations of reaction times across 10 days:

data(sleepstudy, package = "lme4")
sleepstudy <- filter(sleepstudy, Subject == 308)[,c(2,1)] %>% tibble
sleepstudy
## # A tibble: 10 x 2
##     Days Reaction
##    <dbl>    <dbl>
##  1     0     250.
##  2     1     259.
##  3     2     251.
##  4     3     321.
##  5     4     357.
##  6     5     415.
##  7     6     382.
##  8     7     290.
##  9     8     431.
## 10     9     466.

Identify relevant data

  • The way in which we ran this experiment (we didn’t!) would dictate, to a large extent, the variables and their roles in our analysis.
  • However, there might be several other important variables to consider, such as how much a person typically sleeps, or whether they are trained on the cognitive task.
  • Some or all of those variables might not exist in our data, but might guide our thinking nevertheless.

Define a model

  • A creative process
  • Just because they are all wrong doesn’t mean you shouldn’t try to be less wrong
  • How are the outcomes distributed, conditional on the predictors?
  • Are there natural bounds in the data? Are the data collected on a continuous or categorical scale?
  • What are the relations between variables? Are they linear or more complicated?
  • We will build a series of increasingly complex & informative models for these data

Define a model

  • We assume that the reaction times \(y_n\) in \(1, \dots, N\) are normally distributed with mean \(\mu\) and standard deviation \(\sigma\)
  • Does not include a parameter to evaluate our research question, but an informative example to begin with

You have seen this model written as

\[ y_n = \mu + \epsilon_n \]

where

\[ \epsilon_n \sim N(0, \sigma^2) \]

But we prefer the following notation for its clarity and emphasis on data rather than errors

\[ y_n \sim N(\mu, \sigma^2) \]

Parameters’ prior distribution

  • A prior distribution is the distribution of plausible values a parameter can take, before the data are observed.
  • It is sometimes pointed at, when critics claim that Bayesian statistics are subjective and therefore useless.
  • The prior distribution is only one part of a model chosen by the analyst.
  • Specifying priors requires care, and often a vague or even a prior that is constant over the parameter values can be a useful starting point.
  • We would be guided by our expert knowledge of this topic and design of the experiment

Parameters’ prior distribution

\[ \mu \sim N(250, 200) \\ \sigma \sim N^+(0, 200) \]

Parameters’ prior distribution

But how?

  • If you wish to “let the data speak for itself”, the prior can be set to a constant over the possible values of the parameter.
    • Whether such a noninformative or flat prior leads to a “Bayesian” analysis is, however, debatable.
  • Currently, so called weakly informative priors are popular, because they help prevent certain computational issues in sampling from the model’s posterior distribution, while remaining mostly uninformative about the parameter values.
  • Informative priors have a substantial impact on the posterior distribution
    • Useful when strong prior information is available
    • Required for hypothesis testing (e.g. Bayes factors)
  • It is OK to start with a noninformative prior, but you will likely be able to tell how implausible such a starting point can be with further thought & simulation.
  • Kruschke suggests that a prior should be chosen such that you could defend it in front of a sceptical audience
  • Choosing priors vs. likelihood functions

Bayesian updating

\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta) \]

fit0 <- brm(
  formula = Reaction ~ 1, 
  family = gaussian(), 
  prior = prior(normal(250, 200), class = "Intercept") +
    prior(normal(0, 200), class = "sigma"),
  data = sleepstudy,
  sample_prior = "yes",
  iter = 2000,
  chains = 4,
  cores = 4,
  file = here("models/introduction-0"), 
)

Bayesian updating

It is important to check that the MCMC chains have converged to a common solution

plot(fit0, combo = c("dens_overlay", "trace"))

Bayesian updating

summary(fit0)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: Reaction ~ 1 
##    Data: sleepstudy (Number of observations: 10) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   340.80     29.84   282.13   400.51 1.00     2042     1879
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    91.69     25.93    56.95   156.82 1.00     1461     1606
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Posterior predictive check

Once a posterior distribution is obtained, it is prudent to check whether it makes reasonable predictions; if it “fits the data” well. This is sometimes called posterior predictive checking, because we use the posterior to generate predictions that are then checked against data. These checks can focus on the overall “fit” of the model…

pp_check(fit0, type = "hist", nsamples = 5, binwidth = 30)

Posterior predictive check

…or focus on particular aspects of the data, such as the minimum or maximum values

Model 2

  • The previous model was a “null” model; did not include a predictor
  • What is the effect of days of sleep deprivation on reaction time

\[ y_n \sim N(\beta_0 + \beta_1 x_n, \sigma^2), \]

  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the coefficient of days, \(x_n\).
  • \(\sigma\) is the residual standard deviation

Priors

\[ \beta_0 \sim N(250, 200) \\ \beta_1 \sim N(30, 40) \\ \sigma \sim N^+(0, 200) \]

Bayesian updating

fit1 <- brm(
  bf(Reaction ~ Days, center = FALSE),
  prior = prior(normal(250, 200), class = "b", coef = "Intercept") +
    prior(normal(30, 40), class = "b", coef = "Days") +
    prior(normal(0, 200), class = "sigma"),
  sample_prior = "yes",
  data = sleepstudy, 
  file = here("models/introduction-1")
  )

from https://www.tjmahr.com/bayes-theorem-in-three-panels/

Bayesian updating

plot(fit1, combo = c("dens_overlay", "trace"))

Results

summary(fit1)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: Reaction ~ Days 
##    Data: sleepstudy (Number of observations: 10) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   244.71     32.94   180.46   311.52 1.00     1396     1463
## Days         21.54      6.26     8.70    33.75 1.00     1308     1308
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    56.47     18.54    33.35   102.84 1.00     1372     1459
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Summarising the posterior distribution

gather_draws(fit1, b_Intercept, b_Days, sigma) %>% 
  ggplot(aes(y=.variable, x = .value)) +
  stat_histinterval(breaks = 50) +
  scale_x_continuous("Parameter value") +
  theme(axis.title.y = element_blank())

Summarising the posterior distribution

  • Sometimes, the posterior probability can be a useful, concise summary
hypothesis(fit1, "Days > 10")
## Hypothesis Tests for class b:
##        Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
## 1 (Days)-(10) > 0    11.54      6.26     1.35    21.54      29.08      0.97
##   Star
## 1    *
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.

Meet the S x P matrix

post <- posterior_samples(fit1)[,1:3] %>% tibble
head(post)
## # A tibble: 6 x 3
##   b_Intercept b_Days sigma
##         <dbl>  <dbl> <dbl>
## 1        218.   25.2  30.5
## 2        244.   23.6  36.0
## 3        212.   26.0  64.6
## 4        239.   24.9  60.2
## 5        250.   18.4  51.7
## 6        277.   14.2  64.6
post$qoi <- post$b_Days / post$sigma
posterior_summary(post)
##               Estimate  Est.Error        Q2.5       Q97.5
## b_Intercept 244.707978 32.9356146 180.4573433 311.5204073
## b_Days       21.540890  6.2578309   8.7024414  33.7477390
## sigma        56.471227 18.5422442  33.3541342 102.8432535
## qoi           0.415414  0.1553816   0.1199001   0.7304555

Posterior predictive check

pp_check(fit1, nsamples = 300)

Posterior predictive check

x <- posterior_predict(fit1, summary = FALSE)
ppc_stat_2d(sleepstudy$Reaction, x, c("min", "max"))

Another model

plot(conditional_effects(fit1, "Days"), points = TRUE)

  • Perhaps we should model variance on days?

Another model

\[ y_n \sim N(\beta_0 + \beta_1 x_n, \text{exp}(\gamma_0 + \gamma_1x_1)^2), \]

new_model <- bf(Reaction ~ Days, center = FALSE) + lf(sigma ~ Days)
get_prior(new_model, sleepstudy)
##                 prior     class      coef group resp  dpar nlpar bound
##                (flat)         b                                       
##                (flat)         b      Days                             
##                (flat)         b Intercept                             
##                (flat)         b                      sigma            
##                (flat)         b      Days            sigma            
##  student_t(3, 0, 2.5) Intercept                      sigma            
##        source
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##       default

Another model

fit2 <- brm(
  new_model,
  prior = prior(normal(250, 200), class = "b", coef = "Intercept") +
    prior(normal(30, 40), class = "b", coef = "Days"),
  sample_prior = "yes",
  data = sleepstudy,
  control = list(adapt_delta = .95),
  file = here("models/introduction-2")
  )

Summarising the posterior distribution

summary(fit2)
##  Family: gaussian 
##   Links: mu = identity; sigma = log 
## Formula: Reaction ~ Days 
##          sigma ~ Days
##    Data: sleepstudy (Number of observations: 10) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_Intercept     2.96      0.60     1.95     4.27 1.00     1329     1492
## Intercept         242.57     18.04   206.45   281.16 1.00     1628     1147
## Days               22.16      5.28    11.67    32.76 1.00     1679     1441
## sigma_Days          0.18      0.11    -0.05     0.41 1.00     1266     1538
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Summarising the posterior distribution

mcmc_hist(fit2, pars = c("b_sigma_Days", "b_Days"))

Model comparison

options(loo.cores = parallel::detectCores(logical = FALSE))
looics <- loo(fit0, fit1, fit2)
looics
## Output of model 'fit0':
## 
## Computed from 4000 by 10 log-likelihood matrix
## 
##          Estimate  SE
## elpd_loo    -59.3 1.2
## p_loo         1.2 0.3
## looic       118.7 2.5
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'fit1':
## 
## Computed from 4000 by 10 log-likelihood matrix
## 
##          Estimate  SE
## elpd_loo    -55.8 3.5
## p_loo         3.2 1.9
## looic       111.7 7.1
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     9     90.0%   1047      
##  (0.5, 0.7]   (ok)       0      0.0%   <NA>      
##    (0.7, 1]   (bad)      1     10.0%   21        
##    (1, Inf)   (very bad) 0      0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'fit2':
## 
## Computed from 4000 by 10 log-likelihood matrix
## 
##          Estimate  SE
## elpd_loo    -55.0 3.5
## p_loo         4.3 1.6
## looic       110.0 7.0
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     7     70.0%   980       
##  (0.5, 0.7]   (ok)       2     20.0%   444       
##    (0.7, 1]   (bad)      1     10.0%   22        
##    (1, Inf)   (very bad) 0      0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.
## 
## Model comparisons:
##      elpd_diff se_diff
## fit2  0.0       0.0   
## fit1 -0.8       1.4   
## fit0 -4.3       3.7

Other summaries

bayes_R2(fit1) %>% kable
Estimate Est.Error Q2.5 Q97.5
R2 0.6231391 0.154249 0.161916 0.7581443
bayes_R2(fit2) %>% kable
Estimate Est.Error Q2.5 Q97.5
R2 0.6488731 0.1226077 0.2962559 0.7580441

Wrap-up

This introduction to Bayesian data analysis was necessarily brief, but hopefully has introduced the important concepts that we will keep encountering in the next sessions.

References

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Boca Raton: Chapman; Hall/CRC.

Kruschke, John K. 2014. Doing Bayesian Data Analysis: A Tutorial Introduction with R. 2nd Edition. Burlington, MA: Academic Press.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. CRC Texts in Statistical Science. Boca Raton: Taylor; Francis, CRC Press.