brms workshop >> 2020-10-22
“Bayesian inference is reallocation of credibility across possibilities.” ((Kruschke 2014), p. 15)
“Bayesian data analysis takes a question in the form of a model and uses logic to produce an answer in the form of probability distributions.” ((McElreath 2020), p. 10)
“Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations.” ((Gelman et al. 2013), p. 1)
(Figure from McElreath, 2020; https://xcelab.net/rm/statistical-rethinking/)
You can estimate models in the Bayesian context that might not be otherwise possible. My first Bayesian analysis was conducted out of necessity. The model I wanted to use did not converge to a solution when I attempted to use orthodox methods (maximum likelihood estimation). Around the same time, I heard about Stan. I wrote some Stan code and the model converged without problems, and I was able to use the model that I wanted to.
With Bayes, you can actually be confident in your Confidence Intervals. I have a difficult time understanding p-values and Confidence Intervals. It can be difficult to understand what the uncertainty estimates mean when hypothetical replications are difficult to imagine in a given context. With a posterior distribution at hand, the corresponding probability values have a direct interpretation as credibility, uncertainty, or plausibility.
Bayesian methods allow easily carrying (un)certainty forward to other quantities of interest. It can often be difficult to obtain uncertainty estimates for various quantities when using orthodox methods. For example, effect size metrics are often reported without error bars (they can be obtained, but methods for doing so can be finicky and are not often used.)
To be sure, the Bayesian framework does not come for free. The methods might be difficult to communicate to others, at least until orthodox statistics are replaced in undergraduate applied statistics education. The necessity of complex computational algorithms makes it time-consuming—you will enjoy doing BDA more if you have a fast computer.
\[ p(\theta \vert Y) = \frac{p(Y \vert \theta) p(\theta)}{p(Y)} \]
\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta). \]
\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta) \]
Need to specify how the likelihood of each data point contributes to the parameters’ overall probability:
\[ p(\theta \vert Y) \propto p(\theta) \prod^N_{n=1} p(y_n \vert \theta) \]
In terms of programming, we think of adding up the log probabilities of each observation:
\[ \text{log}\ p(\theta \vert Y) \propto \text{log}\ p(\theta) + \sum^N_{n=1} \text{log}\ p(y_n \vert \theta) \]
(g)lmer
error)library(rstan)
library(brms)
To get started with BDA, it is useful to first informally define what a “Bayesian workflow” might look like. Following Kruschke ((2014), p. 25), we identify five key data analysis steps
We collected data from a single person on the effects of sleep deprivation on cognitive performance, as measured by reaction time on a cognitive task. The data are 10 observations of reaction times across 10 days:
data(sleepstudy, package = "lme4") sleepstudy <- filter(sleepstudy, Subject == 308)[,c(2,1)] %>% tibble sleepstudy
## # A tibble: 10 x 2 ## Days Reaction ## <dbl> <dbl> ## 1 0 250. ## 2 1 259. ## 3 2 251. ## 4 3 321. ## 5 4 357. ## 6 5 415. ## 7 6 382. ## 8 7 290. ## 9 8 431. ## 10 9 466.
You have seen this model written as
\[ y_n = \mu + \epsilon_n \]
where
\[ \epsilon_n \sim N(0, \sigma^2) \]
But we prefer the following notation for its clarity and emphasis on data rather than errors
\[ y_n \sim N(\mu, \sigma^2) \]
\[ \mu \sim N(250, 200) \\ \sigma \sim N^+(0, 200) \]
\[ p(\theta \vert Y) \propto p(Y \vert \theta) p(\theta) \]
fit0 <- brm( formula = Reaction ~ 1, family = gaussian(), prior = prior(normal(250, 200), class = "Intercept") + prior(normal(0, 200), class = "sigma"), data = sleepstudy, sample_prior = "yes", iter = 2000, chains = 4, cores = 4, file = here("models/introduction-0"), )
It is important to check that the MCMC chains have converged to a common solution
plot(fit0, combo = c("dens_overlay", "trace"))
summary(fit0)
## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: Reaction ~ 1 ## Data: sleepstudy (Number of observations: 10) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 340.80 29.84 282.13 400.51 1.00 2042 1879 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 91.69 25.93 56.95 156.82 1.00 1461 1606 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
Once a posterior distribution is obtained, it is prudent to check whether it makes reasonable predictions; if it “fits the data” well. This is sometimes called posterior predictive checking, because we use the posterior to generate predictions that are then checked against data. These checks can focus on the overall “fit” of the model…
pp_check(fit0, type = "hist", nsamples = 5, binwidth = 30)
…or focus on particular aspects of the data, such as the minimum or maximum values
\[ y_n \sim N(\beta_0 + \beta_1 x_n, \sigma^2), \]
\[ \beta_0 \sim N(250, 200) \\ \beta_1 \sim N(30, 40) \\ \sigma \sim N^+(0, 200) \]
fit1 <- brm( bf(Reaction ~ Days, center = FALSE), prior = prior(normal(250, 200), class = "b", coef = "Intercept") + prior(normal(30, 40), class = "b", coef = "Days") + prior(normal(0, 200), class = "sigma"), sample_prior = "yes", data = sleepstudy, file = here("models/introduction-1") )
plot(fit1, combo = c("dens_overlay", "trace"))
summary(fit1)
## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: Reaction ~ Days ## Data: sleepstudy (Number of observations: 10) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 244.71 32.94 180.46 311.52 1.00 1396 1463 ## Days 21.54 6.26 8.70 33.75 1.00 1308 1308 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 56.47 18.54 33.35 102.84 1.00 1372 1459 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
gather_draws(fit1, b_Intercept, b_Days, sigma) %>% ggplot(aes(y=.variable, x = .value)) + stat_histinterval(breaks = 50) + scale_x_continuous("Parameter value") + theme(axis.title.y = element_blank())
hypothesis(fit1, "Days > 10")
## Hypothesis Tests for class b: ## Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob ## 1 (Days)-(10) > 0 11.54 6.26 1.35 21.54 29.08 0.97 ## Star ## 1 * ## --- ## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses. ## '*': For one-sided hypotheses, the posterior probability exceeds 95%; ## for two-sided hypotheses, the value tested against lies outside the 95%-CI. ## Posterior probabilities of point hypotheses assume equal prior probabilities.
post <- posterior_samples(fit1)[,1:3] %>% tibble head(post)
## # A tibble: 6 x 3 ## b_Intercept b_Days sigma ## <dbl> <dbl> <dbl> ## 1 218. 25.2 30.5 ## 2 244. 23.6 36.0 ## 3 212. 26.0 64.6 ## 4 239. 24.9 60.2 ## 5 250. 18.4 51.7 ## 6 277. 14.2 64.6
post$qoi <- post$b_Days / post$sigma posterior_summary(post)
## Estimate Est.Error Q2.5 Q97.5 ## b_Intercept 244.707978 32.9356146 180.4573433 311.5204073 ## b_Days 21.540890 6.2578309 8.7024414 33.7477390 ## sigma 56.471227 18.5422442 33.3541342 102.8432535 ## qoi 0.415414 0.1553816 0.1199001 0.7304555
pp_check(fit1, nsamples = 300)
x <- posterior_predict(fit1, summary = FALSE) ppc_stat_2d(sleepstudy$Reaction, x, c("min", "max"))
plot(conditional_effects(fit1, "Days"), points = TRUE)
\[ y_n \sim N(\beta_0 + \beta_1 x_n, \text{exp}(\gamma_0 + \gamma_1x_1)^2), \]
new_model <- bf(Reaction ~ Days, center = FALSE) + lf(sigma ~ Days) get_prior(new_model, sleepstudy)
## prior class coef group resp dpar nlpar bound ## (flat) b ## (flat) b Days ## (flat) b Intercept ## (flat) b sigma ## (flat) b Days sigma ## student_t(3, 0, 2.5) Intercept sigma ## source ## default ## (vectorized) ## (vectorized) ## (vectorized) ## (vectorized) ## default
fit2 <- brm( new_model, prior = prior(normal(250, 200), class = "b", coef = "Intercept") + prior(normal(30, 40), class = "b", coef = "Days"), sample_prior = "yes", data = sleepstudy, control = list(adapt_delta = .95), file = here("models/introduction-2") )
summary(fit2)
## Family: gaussian ## Links: mu = identity; sigma = log ## Formula: Reaction ~ Days ## sigma ~ Days ## Data: sleepstudy (Number of observations: 10) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma_Intercept 2.96 0.60 1.95 4.27 1.00 1329 1492 ## Intercept 242.57 18.04 206.45 281.16 1.00 1628 1147 ## Days 22.16 5.28 11.67 32.76 1.00 1679 1441 ## sigma_Days 0.18 0.11 -0.05 0.41 1.00 1266 1538 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
mcmc_hist(fit2, pars = c("b_sigma_Days", "b_Days"))
options(loo.cores = parallel::detectCores(logical = FALSE)) looics <- loo(fit0, fit1, fit2) looics
## Output of model 'fit0': ## ## Computed from 4000 by 10 log-likelihood matrix ## ## Estimate SE ## elpd_loo -59.3 1.2 ## p_loo 1.2 0.3 ## looic 118.7 2.5 ## ------ ## Monte Carlo SE of elpd_loo is 0.0. ## ## All Pareto k estimates are good (k < 0.5). ## See help('pareto-k-diagnostic') for details. ## ## Output of model 'fit1': ## ## Computed from 4000 by 10 log-likelihood matrix ## ## Estimate SE ## elpd_loo -55.8 3.5 ## p_loo 3.2 1.9 ## looic 111.7 7.1 ## ------ ## Monte Carlo SE of elpd_loo is NA. ## ## Pareto k diagnostic values: ## Count Pct. Min. n_eff ## (-Inf, 0.5] (good) 9 90.0% 1047 ## (0.5, 0.7] (ok) 0 0.0% <NA> ## (0.7, 1] (bad) 1 10.0% 21 ## (1, Inf) (very bad) 0 0.0% <NA> ## See help('pareto-k-diagnostic') for details. ## ## Output of model 'fit2': ## ## Computed from 4000 by 10 log-likelihood matrix ## ## Estimate SE ## elpd_loo -55.0 3.5 ## p_loo 4.3 1.6 ## looic 110.0 7.0 ## ------ ## Monte Carlo SE of elpd_loo is NA. ## ## Pareto k diagnostic values: ## Count Pct. Min. n_eff ## (-Inf, 0.5] (good) 7 70.0% 980 ## (0.5, 0.7] (ok) 2 20.0% 444 ## (0.7, 1] (bad) 1 10.0% 22 ## (1, Inf) (very bad) 0 0.0% <NA> ## See help('pareto-k-diagnostic') for details. ## ## Model comparisons: ## elpd_diff se_diff ## fit2 0.0 0.0 ## fit1 -0.8 1.4 ## fit0 -4.3 3.7
bayes_R2(fit1) %>% kable
Estimate | Est.Error | Q2.5 | Q97.5 | |
---|---|---|---|---|
R2 | 0.6231391 | 0.154249 | 0.161916 | 0.7581443 |
bayes_R2(fit2) %>% kable
Estimate | Est.Error | Q2.5 | Q97.5 | |
---|---|---|---|---|
R2 | 0.6488731 | 0.1226077 | 0.2962559 | 0.7580441 |
This introduction to Bayesian data analysis was necessarily brief, but hopefully has introduced the important concepts that we will keep encountering in the next sessions.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Boca Raton: Chapman; Hall/CRC.
Kruschke, John K. 2014. Doing Bayesian Data Analysis: A Tutorial Introduction with R. 2nd Edition. Burlington, MA: Academic Press.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. CRC Texts in Statistical Science. Boca Raton: Taylor; Francis, CRC Press.