Models

Author

Affiliation

Code

# Packages
library(cmdstanr)
library(rstan)
library(scales)
library(knitr)
library(here)
library(janitor)
library(latex2exp)
library(distributional)
library(posterior)
library(patchwork)
library(tidybayes)
library(ggdist)
library(tidyverse)

# Some settings common to all modules
source(here("modules/_common.R"))

Bayesian inference

What is it?

“Bayesian inference is reallocation of credibility across possibilities.” (Kruschke 2014)
“Bayesian data analysis takes a question in the form of a model and uses logic to produce an answer in the form of probability distributions.” (McElreath 2020)
“Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations.” (Gelman et al. 2013)

What is it?

Bayesian inference consists of updating prior information, using evidence in data, to posterior information
Use probability distributions to express information (uncertainty)

Code

tibble(
  shape1 = c(3, 10, 13),
  shape2 = c(6, 4, 10),
  beta = dist_beta(shape1, shape2),
  name = factor(
    c("Prior", "Likelihood", "Posterior"),
    levels = c("Prior", "Likelihood", "Posterior"),
    labels =
      TeX(c("$p(\\theta)$", "$p(Y | \\theta)$", "$p(\\theta | Y)$"))
  )
) |>
  ggplot() +
  scale_color_brewer(
    palette = "Set1",
    labels = ~unname(
      TeX(c("$p(\\theta)$", "$p(Y | \\theta)$", "$p(\\theta | Y)$"))
    ),
    aesthetics = c("color", "fill")
  ) +
  scale_y_continuous(
    "Probability density",
    expand = expansion(c(0, .1))
  ) +
  scale_x_continuous(
    "Parameter value",
    expand = expansion(c(0.01, 0.01))
  ) +
  stat_slab(
    aes(xdist = beta, color = name),
    fill = NA,
    slab_linewidth = 1.5
  ) +
  theme(
    legend.title = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = "right"
  )

Figure 1: Bayesian inference combines prior information with data to produce a posterior distribution.

How is it different from what I already know?

Bayesian data analysis may not be that different from what you already know (i.e. orthodox / classical / frequentist statistics)
In the absence of strong prior information, and presence of large data, the same model evaluated in a Bayesian or orthodox framework will yield the same numerical answers
The interpretations of, and philosophies behind the numbers are vastly different
In many ways, orthodox statistical methods can be thought of approximations to Bayesian methods
Hypothesis tests are very different between the two frameworks
In practice, Bayesian statistics are an extremely flexible modelling framework

How is it different from what I already know?

Figure 2: McElreath (2020): “Example decision tree, or flowchart, for selecting an appropriate statistical procedure. Beginning at the top, the user answers a series of questions about measurement and intent, arriving eventually at the name of a procedure. Many such decision trees are possible.”

What can it do for me?

You can estimate models in the Bayesian context that might not be otherwise possible. My first Bayesian analysis was conducted out of necessity. The model I wanted to use did not converge to a solution when I attempted to use orthodox methods (maximum likelihood estimation). Around the same time, I heard about Stan. I wrote some Stan code and the model converged without problems, and I was able to use the model that I wanted to.

With Bayes, you can actually be confident in your Confidence Intervals. I have a difficult time understanding p-values and Confidence Intervals. It can be difficult to understand what the uncertainty estimates mean when hypothetical replications are difficult to imagine in a given context. With a posterior distribution at hand, the corresponding probability values have a direct interpretation as credibility, uncertainty, or plausibility.

What can it do for me?

Bayesian methods allow easily carrying (un)certainty forward to other quantities of interest. It can often be difficult to obtain uncertainty estimates for various quantities when using orthodox methods. For example, effect size metrics are often reported without error bars (they can be obtained, but methods for doing so can be finicky and are not often used.)

To be sure, the Bayesian framework does not come for free. The methods might be difficult to communicate to others, at least until orthodox statistics are replaced in undergraduate applied statistics education. The necessity of complex computational algorithms makes it time-consuming—you will enjoy doing BDA more if you have a fast computer.

What can it do for me?

Figure 3: Street cred is real (https://twitter.com/d_spiegel/status/550677361205977088)

How can I do it?

In theory

What are the plausible values of parameters \(\theta\) after observing data?
The posterior distribution \(p(\theta \vert Y)\) is the answer
Bayes’ theorem describes how to compute this distribution

\[ p(\theta \vert Y) = \frac{p(Y \vert \theta) p(\theta)}{p(Y)} \]

\(p(Y \vert \theta)\): likelihood function
Probability of data given specific values for the model’s parameters
\(p(\theta)\): prior probability distribution on the parameters
How is plausibility distributed across possibilities before seeing data
\(p(Y)\): marginal likelihood of the data

\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta). \]

How can I do it?

In theory

\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta) \]

Need to specify how the likelihood of each data point contributes to the parameters’ overall probability:

\[ p(\theta \vert Y) \propto p(\theta) \prod^N_{n=1} p(y_i \vert \theta) \]

In terms of programming, we think of adding up the log probabilities of each observation:

\[ \text{log}\ p(\theta \vert Y) \propto \text{log}\ p(\theta) + \sum^N_{n=1} \text{log}\ p(y_i \vert \theta) \]

How can I do it?

How can I do it?

In practice

Target of inference is the posterior distribution
Many interesting models’ posterior distributions do not have solutions
Markov Chain Monte Carlo (MCMC) techniques allow us to approximate distributions by drawing random samples from them
BUGS, JAGS, PyMC, Stan

How can I do it?

In practice

Code

library(brms)

brms converts R modelling syntax to Stan and extends it in interesting ways
High-level interface to Stan allow us to avoid writing raw Stan code
https://discourse.mc-stan.org/

Gaussian model

This section

Discuss a concise modeling workflow
Implement it in practice
A model with gaussian outcome and one continuous predictor
Establish notation following McElreath (2020)

Why models?

What is the role and goal of statistics in science?
…
We want to build models with parameters whose estimated magnitudes inform theories
We want to test hypothesized differences between means
Bayes allows us to use probability to quantify uncertainty about these parameters, and compare and criticize the models themselves

Bayesian workflow

To get started with BDA, it is useful to first informally define what a “Bayesian workflow” might look like. Following Kruschke (2014, 25), we identify five key data analysis steps

Identify data relevant to the research question.
Define a descriptive model, whose parameters capture the research question.
Specify prior probability distributions on parameters in the model.
Update the prior to a posterior distribution using Bayesian inference.
Check your model against data, and identify possible problems.

Bayesian workflow

A more complete treatment is found in Gelman et al. (2020) (Figure 6).

Identify relevant data

(Research question, experimentation, measurement…)
Define outcomes (DVs) and predictors (IVs)
What are the scales? Were variables measured or manipulated? …

We collected data on the effects of sleep deprivation on cognitive performance, as measured by reaction time on a cognitive task. The data are observations of 18 individuals’ reaction times across 8 days of sleep deprivation (Table 1)

Code

dat <- tibble(lme4::sleepstudy) |>
  clean_names() |>
  # First two days (0 and 1) were an adaptation period
  filter(days >= 2) |>
  mutate(days = days - 2)

reaction	days	subject
251	0	308
321	1	308
357	2	308
415	3	308
382	4	308
290	5	308

Table 1: First six rows of sleep study data.

Identify relevant data

The way in which we ran this experiment (we didn’t!) would dictate, to a large extent, the variables and their roles in our analysis
There might be several other important variables to consider, such as how much a person typically sleeps, or whether they are trained on the cognitive task
Some or all of those variables might not exist in our data, but might guide our thinking nevertheless

Define a model

A creative process
Just because they are all wrong doesn’t mean you shouldn’t try to be less wrong
How are the outcomes distributed conditional on the predictors?
Are there natural bounds in the data? Are the data collected on a continuous or categorical scale?
What are the relations between variables? Are they linear or more complicated?
We will build a series of increasingly complex & informative models for these data

Define a model

“Null” model (no predictors)
We assume that the reaction times \(y_i\) in \(1, \dots, N\) are normally distributed with mean \(\mu\) and standard deviation \(\sigma\)

\[ y_i = \mu + \epsilon_i, \epsilon_i \sim N(0, \sigma^2) \]

We prefer the following “distributional” notation for its conciseness and emphasis on data rather than errors

\[ y_i \sim N(\mu, \sigma^2) \]

Code

bf0 <- bf(reaction ~ 1)

Prior distribution

A prior distribution is the distribution of plausible values a parameter can take, before the data are observed.
It is sometimes pointed at when critics claim that Bayesian statistics are subjective and therefore useless.
The prior distribution is only one part of a model chosen by the analyst.
Specifying priors requires care, and often a vague or even a prior that is constant over the parameter values can be a useful starting point.
We would be guided by our expert knowledge of this topic and design of the experiment

Prior distribution

For our first example, we let {brms} set default priors. These are weakly informative and only serve to facilitate model convergence.

Code

get_prior(bf0, dat)[,-c(3:7, 9)]
#>                      prior     class lb  source
#>  student_t(3, 303.2, 65.5) Intercept    default
#>      student_t(3, 0, 65.5)     sigma  0 default

Code

get_prior(bf0, dat) |>
  parse_dist() |>
  ggplot(aes(y = class)) +
  scale_x_continuous(
    "Parameter value",
    breaks = extended_breaks(7)
  ) +
  stat_halfeye(
    aes(xdist = .dist_obj)
  ) +
  theme(
    axis.title.y = element_blank()
  )

Prior distribution

If you wish to “let the data speak for itself”, the prior can be set to a constant over the possible values of the parameter.
Whether such a noninformative or flat prior leads to a “Bayesian” analysis is, however, debatable.
Currently, so called weakly informative priors are popular, because they help prevent certain computational issues in sampling from the model’s posterior distribution, while remaining mostly uninformative about the parameter values.
Informative priors have a substantial impact on the posterior distribution
Useful when strong prior information is available
Required for hypothesis testing (e.g. Bayes factors)
It is OK to start with a noninformative prior, but you will likely be able to tell how implausible such a starting point can be with further thought & simulation.
Kruschke suggests that a prior should be chosen such that you could defend it in front of a sceptical audience
Choosing priors vs. likelihood functions

Sampling from the posterior with brm()

Code

fit0 <- brm(
  formula = reaction ~ 1,
  family = gaussian(), 
  data = dat,
  file = here("models/introduction-0")
)

Sampling from the posterior with brm()

Use environment variables to separate settings from source code.

.Renviron

MAX_CORES = 8
BRMS_BACKEND = "cmdstanr"
BRMS_THREADS = 2

Refer to environment variables when setting default HMC sampler options

_common.R

options(
  brms.backend = Sys.getenv("BRMS_BACKEND", "rstan"),
  brms.threads = as.numeric(Sys.getenv("BRMS_THREADS"), 1),
  mc.cores = as.numeric(Sys.getenv("MAX_CORES"), 4)
)

brm() uses options() for several arguments (see ?brm).

cmdstanr

https://mc-stan.org/cmdstanr/index.html
http://mc-stan.org/cmdstanr/articles/cmdstanr.html#comparison-with-rstan
Up to date Stan; faster (YMMV); threading
renv::install("stan-dev/cmdstanr")
remotes::install_github("stan-dev/cmdstanr")

Model checking

Estimation relies on computational algorithms that can fail to deliver. This is unlikely with gaussian and other simple models, but checking should be done nevertheless.

Code

plot(fit0)

Model checking

More checking, and interpreting quantities.

Code

summary(fit0)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: reaction ~ 1 
#>    Data: dat (Number of observations: 144) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Regression Coefficients:
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   307.86      4.85   298.39   317.67 1.00     3427     2112
#> 
#> Further Distributional Parameters:
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma    57.33      3.36    51.14    64.35 1.00     3259     2616
#> 
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Posterior predictive check

Once a posterior distribution is obtained, it is prudent to check whether it makes reasonable predictions; if it “fits the data” well. This is sometimes called posterior predictive checking, because we use the posterior to generate predictions that are then checked against data. These checks can focus on the overall “fit” of the model…

Code

pp_check(
  fit0, 
  type = "hist", 
  ndraws = 5
) +
  scale_x_continuous("Reaction time (ms)")

Posterior predictive check

…or focus on particular aspects of the data, such as the mean and sd

Code

pp_check(fit0, type = "stat_2d", stat = c("mean", "sd"))

Posterior predictive check

…or some other summaries

Code


pp_check(fit0, type = "stat_2d", stat = c("min", "max"))

Gaussian model with predictor

Model 2

The previous was a “null” model; did not include predictors
What is the effect of one day of sleep deprivation on reaction time

\[ y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2), \]

\(\beta_0\) is the intercept
\(\beta_1\) is the coefficient of days, \(x_i\).
\(\sigma\) is the residual standard deviation

Code

bf1 <- bf(reaction ~ days)

Priors

Code

get_prior(bf1, dat)[,-c(3:7, 9)]
#>                      prior     class lb  source
#>                     (flat)         b    default
#>                     (flat)         b    default
#>  student_t(3, 303.2, 65.5) Intercept    default
#>      student_t(3, 0, 65.5)     sigma  0 default
p <- prior(student_t(7, 300, 200), class = "Intercept") +
  prior(student_t(7, 0, 100), class = "b", coef = "days") +
  prior(student_t(7, 0, 50), class = "sigma", lb = 0)
p[,-c(3:7, 9)]
#>                   prior     class   lb source
#>  student_t(7, 300, 200) Intercept <NA>   user
#>    student_t(7, 0, 100)         b <NA>   user
#>     student_t(7, 0, 50)     sigma    0   user

Sample from the posterior

Code

fit1 <- brm(
  bf1,
  family = gaussian(),
  data = dat,
  prior = p,
  sample_prior = "yes",
  # file = here("models/introduction-1")
)
#> Running MCMC with 4 chains, at most 8 in parallel, with 2 thread(s) per chain...
#> 
#> Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
#> Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
#> Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
#> Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
#> Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
#> Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
#> Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
#> Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
#> Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
#> Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
#> Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
#> Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
#> Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
#> Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
#> Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
#> Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
#> Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
#> Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
#> Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
#> Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
#> Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
#> Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
#> Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
#> Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
#> Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
#> Chain 2 finished in 0.1 seconds.
#> Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
#> Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
#> Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
#> Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
#> Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
#> Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
#> Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
#> Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
#> Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
#> Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
#> Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
#> Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
#> Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
#> Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
#> Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
#> Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
#> Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
#> Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
#> Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
#> Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
#> Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
#> Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
#> Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
#> Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
#> Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
#> Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
#> Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
#> Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
#> Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
#> Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
#> Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
#> Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
#> Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
#> Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
#> Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
#> Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
#> Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
#> Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
#> Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
#> Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
#> Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
#> Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
#> Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
#> Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
#> Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
#> Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
#> Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
#> Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
#> Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
#> Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
#> Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
#> Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
#> Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
#> Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
#> Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
#> Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
#> Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
#> Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
#> Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
#> Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
#> Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
#> Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
#> Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
#> Chain 1 finished in 0.2 seconds.
#> Chain 3 finished in 0.2 seconds.
#> Chain 4 finished in 0.2 seconds.
#> 
#> All 4 chains finished successfully.
#> Mean chain execution time: 0.2 seconds.
#> Total execution time: 0.4 seconds.

Model checking

Code

summary(fit1)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: reaction ~ days 
#>    Data: dat (Number of observations: 144) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Regression Coefficients:
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   267.98      7.80   252.79   283.03 1.00     4043     2825
#> days         11.44      1.83     7.82    15.07 1.00     3987     2870
#> 
#> Further Distributional Parameters:
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma    51.08      3.02    45.59    57.30 1.00     4335     3008
#> 
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Model checking

Code

plot(fit1)

Posterior predictive check

Code

pp_check(fit1, type = "stat_2d", stat = c("min", "max"))

What is happening?

Code

pp_check(fit1)

Posterior predictive check

Code

set.seed(1)
p1 <- dat |>
  ggplot(aes(days, reaction)) +
  scale_x_continuous(breaks = pretty_breaks(9)) +
  scale_y_continuous(breaks = pretty_breaks(5)) +
  labs(
    x = "days of sleep deprivation", 
    y = "RT"
  )

tmp <- spread_draws(
  fit1,
  prior_Intercept, prior_b_days,
  ndraws = 30
) |>
  crossing(dat) |>
  rowwise() |>
  mutate(
    .value = prior_Intercept + prior_b_days * days
  )

p2 <- p1 %+%
  tmp +
  aes(y = .value, group = .draw) +
  geom_line(alpha = .05) +
  labs(
    x = "days of sleep deprivation", 
    y = "Prior predicted mean RT"
  )

p3 <- p2 %+%
  add_epred_draws(dat, fit1, ndraws = 30) +
  aes(y = .epred) +
  labs(
    x = "days of sleep deprivation", 
    y = "Posterior predicted mean RT"
  )

(p2 | p1 + geom_point() | p3) & 
  coord_cartesian(ylim = c(-500, 1000))

Summarising the posterior distribution

Code

gather_draws(fit1, b_Intercept, b_days, sigma) |>
  ggplot(aes(y = .variable, x = .value)) +
  stat_histinterval(breaks = 50) +
  scale_x_continuous("Parameter value") +
  theme(axis.title.y = element_blank())

Let’s write a function to report useful numbers

Code

sm <- \(
  x, 
  v = c("^b_", "^sd_", "^cor_", "^sigma"), 
  r = TRUE
) {
  as_draws_df(x, variable = v, regex = r) |> 
    summarise_draws(
      mean, sd, 
      ~quantile2(.x, c(0.025, 0.975)),
      pd = ~Pr(sign(.x) == sign(median(.x))),
      rhat, ess_tail
    )
}
sm(fit1)
#> # A tibble: 3 × 8
#>   variable     mean    sd   q2.5 q97.5    pd  rhat ess_tail
#>   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 b_Intercept 268.   7.80 253.   283.      1  1.00    2825.
#> 2 b_days       11.4  1.83   7.82  15.1     1  1.00    2870.
#> 3 sigma        51.1  3.02  45.6   57.3     1  1.00    3008.

Code

sm(fit1) |> 
  mutate(
    `95%CI` = str_glue(
      "[{number(q2.5, .01)}, {number(q97.5, .01)}]"
    ),
    .after = 3
  ) |> 
  select(
    -starts_with("q")
  ) |> 
  kable()

variable	mean	sd	95%CI	pd	rhat	ess_tail
b_Intercept	268	7.8	[252.79, 283.03]	1	1	2825
b_days	11	1.8	[7.82, 15.07]	1	1	2870
sigma	51	3.0	[45.59, 57.30]	1	1	3008

Meet the S x P matrix

Code

post <- as_draws_df(fit1)
post[,1:6]
#> # A tibble: 4,000 × 6
#>    b_Intercept b_days sigma Intercept prior_Intercept prior_b_days
#>          <dbl>  <dbl> <dbl>     <dbl>           <dbl>        <dbl>
#>  1        270.  11.7   52.5      311.            346.        -37.7
#>  2        264.  12.6   53.1      308.            146.       -363. 
#>  3        273.   9.37  49.6      306.            232.        110. 
#>  4        255.  13.2   56.1      301.            106.        -14.2
#>  5        267.   9.43  55.1      300.            197.         94.2
#>  6        271.   7.22  53.2      296.            491.       -119. 
#>  7        273.   7.30  53.2      299.            406.        159. 
#>  8        275.   9.53  53.7      308.           -200.        182. 
#>  9        272.  11.0   45.5      310.            402.        104. 
#> 10        272.  11.4   56.9      312.            771.        -92.5
#> # ℹ 3,990 more rows

Code

post$qoi <- post$b_days / post$sigma
sm(post)
#> # A tibble: 10 × 8
#>    variable            mean       sd     q2.5    q97.5    pd  rhat ess_tail
#>    <chr>              <dbl>    <dbl>    <dbl>    <dbl> <dbl> <dbl>    <dbl>
#>  1 b_Intercept      268.      7.80    253.     283.    1      1.00    2825.
#>  2 b_days            11.4     1.83      7.82    15.1   1      1.00    2870.
#>  3 sigma             51.1     3.02     45.6     57.3   1      1.00    3008.
#>  4 Intercept        308.      4.26    300.     316.    1      1.00    2815.
#>  5 prior_Intercept  301.    237.     -170.     791.    0.910  1.00    4142.
#>  6 prior_b_days      -1.34  118.     -249.     223.    0.502  1.00    3501.
#>  7 prior_sigma       45.2    38.7       1.79   140.    1      1.00    3501.
#>  8 lprior           -16.6     0.0619  -16.7    -16.4   1      1.00    3033.
#>  9 lp__            -783.      1.22   -786.    -782.    1      1.00    2400.
#> 10 qoi                0.225   0.0380    0.150    0.300 1      1.00    2809.

Posterior predictive check

Code

pp_check(fit1, ndraws = 30)

Posterior predictive check

Code

pp_check(fit1, type = "stat_2d", stat = c("min", "max"))

Conditional effects

Code

plot(
  conditional_effects(fit1, "days"), 
  points = TRUE
)

Perhaps we should model variance on days?

Location-scale model

\[ \begin{aligned} y_i &\sim N(\mu, \sigma^2), \\ \mu &= \beta_0 + \beta_1 x_i, \\ \sigma &= \text{exp}(\gamma_0 + \gamma_1x_1). \end{aligned} \]

Code

bf2 <- bf(reaction ~ days) + lf(sigma ~ days)

get_prior(bf2, dat)[,c(1:3, 6, 10)]
#>                      prior     class coef  dpar  source
#>                     (flat)         b            default
#>                     (flat)         b days       default
#>  student_t(3, 303.2, 65.5) Intercept            default
#>                     (flat)         b      sigma default
#>                     (flat)         b days sigma default
#>       student_t(3, 0, 2.5) Intercept      sigma default

Location-scale model

Code

fit2 <- brm(
  bf2,
  data = dat,
  control = list(adapt_delta = .95),
  file = here("models/introduction-2")
)

By the way this is equivalent to

#| echo: true

fit2 <- brm(
  bf(reaction ~ days, sigma ~ days),
  data = dat
)

Summarising the posterior distribution

Code

summary(fit2)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = log 
#> Formula: reaction ~ days 
#>          sigma ~ days
#>    Data: sleepstudy (Number of observations: 144) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Regression Coefficients:
#>                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma_Intercept     3.54      0.12     3.32     3.77 1.00     2623     2686
#> Intercept         267.56      6.02   255.57   279.44 1.00     1836     2143
#> days               11.57      1.83     8.03    15.26 1.00     1907     2411
#> sigma_days          0.10      0.03     0.05     0.16 1.00     2638     2387
#> 
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Summarising the posterior distribution

Code

fit2 |>
  gather_draws(b_days, b_sigma_days) |>
  ggplot(aes(.value, .variable)) +
  stat_histinterval()

Model comparison

Code

# Add LOO criteria to all models
fit0 <- add_criterion(fit0, "loo")
fit1 <- add_criterion(fit1, "loo")
fit2 <- add_criterion(fit2, "loo")

Code

loo(fit0)
#> 
#> Computed from 4000 by 144 log-likelihood matrix.
#> 
#>          Estimate   SE
#> elpd_loo   -788.2  8.3
#> p_loo         1.9  0.4
#> looic      1576.5 16.6
#> ------
#> MCSE of elpd_loo is 0.0.
#> MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 0.9]).
#> 
#> All Pareto k estimates are good (k < 0.7).
#> See help('pareto-k-diagnostic') for details.

elpd_loo is the Bayesian LOO estimate of the expected log pointwise predictive density (Eq 4 in VGG2017) and is a sum of N individual pointwise log predictive densities. Probability densities can be smaller or larger than 1, and thus log predictive densities can be negative or positive. For simplicity the ELPD acronym is used also for expected log pointwise predictive probabilities for discrete models. Probabilities are always equal or less than 1, and thus log predictive probabilities are 0 or negative.

Code

loo_compare(
  fit0, fit1, fit2
)
#>      elpd_diff se_diff
#> fit2   0.0       0.0  
#> fit1  -5.9       3.3  
#> fit0 -22.0       6.1

elpd_diff is the difference in elpd_loo for two models. If more than two models are compared, the difference is computed relative to the model with highest elpd_loo.

As quick rule: If elpd difference (elpd_diff in loo package) is less than 4, the difference is small (Sivula, Magnusson and Vehtari, 2020, p. McLatchie+etal:2023). If elpd difference (elpd_diff in loo package) is larger than 4, then compare that difference to standard error of elpd_diff (provided e.g. by loo package) (Sivula, Magnusson and Vehtari, 2020).

Sometimes Theory > model comparison

Non-gaussian model

Why

Code

dat |> 
  ggplot(aes(days, reaction)) +
  geom_point() +
  geom_smooth(method = "loess")

Reaction time data

https://lindeloev.github.io/shiny-rt/
For convenience, we choose a shifted lognormal distribution

Code

fitrt1 <- brm(
  bf(reaction ~ days) + shifted_lognormal(),
  data = dat,
  control = list(adapt_delta = .95),
  file = here("models/introduction-sln-1")
)
fitrt1 <- add_criterion(fitrt1, "loo")
loo_compare(fit1, fitrt1)
#>        elpd_diff se_diff
#> fitrt1  0.0       0.0   
#> fit1   -1.0       2.6

Code

summary(fitrt1)
#>  Family: shifted_lognormal 
#>   Links: mu = identity; sigma = identity; ndt = identity 
#> Formula: reaction ~ days 
#>    Data: dat (Number of observations: 144) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Regression Coefficients:
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept     5.29      0.20     4.86     5.58 1.00      635     1043
#> days          0.05      0.01     0.03     0.07 1.00      951     1159
#> 
#> Further Distributional Parameters:
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     0.22      0.04     0.16     0.32 1.00      692     1169
#> ndt      64.79     37.38     3.50   134.98 1.01      642      948
#> 
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Code

plot(conditional_effects(fitrt1), points = TRUE)

Code

pp_check(fitrt1)

Model variance too

Code

fitrt2 <- brm(
  bf(reaction ~ days) + 
    lf(sigma ~ days) +
    shifted_lognormal(),
  data = dat,
  control = list(adapt_delta = .95),
  file = here("models/introduction-sln-2")
)
fitrt2 <- add_criterion(fitrt2, "loo")
loo_compare(fitrt1, fitrt2)
#>        elpd_diff se_diff
#> fitrt2  0.0       0.0   
#> fitrt1 -2.2       2.2

Code

summary(fitrt2)
#>  Family: shifted_lognormal 
#>   Links: mu = identity; sigma = log; ndt = identity 
#> Formula: reaction ~ days 
#>          sigma ~ days
#>    Data: dat (Number of observations: 144) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Regression Coefficients:
#>                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept           5.39      0.16     5.00     5.59 1.00      741     1076
#> sigma_Intercept    -1.86      0.20    -2.19    -1.42 1.00      884     1110
#> days                0.04      0.01     0.03     0.07 1.00      980     1098
#> sigma_days          0.07      0.03     0.01     0.12 1.00     1904     2113
#> 
#> Further Distributional Parameters:
#>     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> ndt    45.91     32.10     2.39   115.89 1.00      740     1110
#> 
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Code

plot(conditional_effects(fitrt2), points = TRUE)

Code

pp_check(fitrt2)

Multilevel model

Revisiting the data

Code

pa <- dat |> 
  mutate(subject = fct_reorder(subject, reaction)) |> 
  ggplot(aes(days, reaction)) +
  geom_smooth(method = "lm", color = "black", linewidth = 0.5) +
  geom_point() +
  facet_wrap("subject")

pb <- dat |> 
  ggplot(aes(days, reaction, group = subject)) +
  geom_smooth(method = "lm", color = "black", linewidth = 0.5, se = FALSE)

(pa | pb) + plot_layout(widths = c(6, 4), axis_titles = "collect")

Figure 7: Scatterplots and a spaghetti plot of reaction times on days of sleep deprivation.

Notation

\[ \begin{aligned} y_{ij} &\sim N\left(\beta_0 + \gamma_{0j} + \left(\beta_1 + \gamma_{1j}\right)x_{ij}, \sigma^2\right), \\ \begin{bmatrix} \gamma_0 \\ \gamma_1 \end{bmatrix} &\sim MVN\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{pmatrix} \tau_0 & \\ \rho &\tau_1 \end{pmatrix} \right). \end{aligned} \tag{1}\]

Code

fitml1 <- brm(
  reaction ~ days + (days | subject),
  data = dat,
  control = list(adapt_delta = .95),
  file = here("models/introduction-ml-1")
)

Notation

\[ \begin{align*} y_{ij} &\sim N(\beta_{0i} + \beta_{1i}x_{ij}, \sigma^2), \\ \beta_{0i} &= \bar{\beta}_0 + u_{0i}, \\ \beta_{1i} &= \bar{\beta}_1 + u_{1i}, \\ \begin{bmatrix} u_{0i} \\ u_{1i} \end{bmatrix} &\sim MVN\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{pmatrix} \tau_0 \ & \\ \rho_{01} \ &\tau_1 \end{pmatrix} \right). \tag{1} \end{align*} \tag{2}\]

Code

fitml2 <- brm(
  bf(
    reaction ~ b0 + b1*days,
    b0 + b1 ~ 1 + (1 |s| subject),
    nl = TRUE
  ),
  data = dat,
  control = list(adapt_delta = .95),
  file = here("models/introduction-ml-2")
)

Interpretation

Code

kable(sm(fitml2))

variable	mean	sd	q2.5	q97.5	pd	rhat	ess_tail
b_b0_Intercept	267.77	9.2	249.4	285.67	1.00	1	2165
b_b1_Intercept	11.42	2.1	7.3	15.65	1.00	1	2747
sd_subject__b0_Intercept	34.06	8.3	21.1	53.47	1.00	1	2383
sd_subject__b1_Intercept	7.44	1.9	4.4	11.58	1.00	1	2592
cor_subject__b0_Intercept__b1_Intercept	0.18	0.3	-0.4	0.75	0.72	1	2021
sigma	25.98	1.8	22.8	29.97	1.00	1	2587

Priors

Code

get_prior(fitml2)[,-c(5, 6, 9)]
#>                  prior class      coef   group nlpar lb  source
#>                 (flat)     b                      b0    default
#>                 (flat)     b Intercept            b0    default
#>                 (flat)     b                      b1    default
#>                 (flat)     b Intercept            b1    default
#>   lkj_corr_cholesky(1)     L                            default
#>                 (flat)     L           subject          default
#>  student_t(3, 0, 65.5)    sd                      b0  0 default
#>  student_t(3, 0, 65.5)    sd                      b1  0 default
#>                 (flat)    sd           subject    b0    default
#>                 (flat)    sd Intercept subject    b0    default
#>                 (flat)    sd           subject    b1    default
#>                 (flat)    sd Intercept subject    b1    default
#>  student_t(3, 0, 65.5) sigma                          0 default

What do you think?

References

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Boca Raton: Chapman; Hall/CRC.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” November 3, 2020. https://doi.org/10.48550/arXiv.2011.01808.

Kruschke, John K. 2014. Doing Bayesian Data Analysis: A Tutorial Introduction with R. 2nd Edition. Burlington, MA: Academic Press.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. CRC Texts in Statistical Science. Boca Raton: Taylor; Francis, CRC Press.