Ordinal Regression Models

brms workshop >> 2020-10-22

Introduction

What are ordinal data

Ordinal data are common in psychology
Most common are Likert items
But also e.g. item above; school grades; number of forks you own; discrete temporal data

Methods for analysis

Metric models
- Models that assume outcomes have a continuous distribution, e.g. t-test
- Overestimate information in data; common & “simple”
Nonparametric statistics
- e.g. analyses of signed ranks (R: ?wilcox.test, etc.)
- Underestimate information in data; don’t scale well
Ordinal models
- A zoo of models that treat outcomes appropriately as ordered categories
- Let’s learn more!

“Analyzing ordinal data with metric models: What could possibly go wrong?”

Liddell and Kruschke surveyed 68 Psychology articles that analysed ordinal data, and found that every article used metric models (2018)
Metric models on ordinal data can lead to false alarms, failures to detect true effects, distorted effect size estimates, and inversions of effects
Three main shortcomings of metric models:
- Response categories may not be (e.g. psychologically) equidistant
- Responses can be non-normally distributed
- Can treat differences in variances of underlying variable inappropriately
I don’t mean to be an alarmist, or ignore practical considerations. We don’t know the empirical rate of differences. But…

“Analyzing ordinal data with metric models: What could possibly go wrong?”

Welch’s t-test: Movie 10’s mean rating was significantly greater (Standardized difference = 0.14 [0.23, 0.04])
Cumulative probit model: Movie 10’s mean rating was significantly smaller (Difference = -0.2 [-0.31, -0.09])
I cherry-picked this example but it exists

Ordinal models

There are many different ordinal models
We focus on the cumulative model (CM)
- Generally the most useful / widely applicable model
IRT? SDT?
We introduce CM in the context of a study conducted by (Forstmann et al. 2020)

Cumulative model

1,225 festivalgoes were asked about their mood and substance use, among other things
The mood rating item, \(Y\), had \(K + 1 = 6\) categories \(1, 2, ..., 6\)

#> # A tibble: 6 x 6
#>    Mood    TE Gender   Age Survey H24  
#>   <dbl> <dbl> <fct>  <dbl> <fct>  <fct>
#> 1     5     3 1         30 Event3 0    
#> 2     5     2 2         50 Event5 0    
#> 3     6     7 1         32 Event4 0    
#> 4     6     7 2         33 Event2 0    
#> 5     5     1 1         22 Event3 0    
#> 6     5     7 2         22 Event4 0

Cumulative model

CM assumes that the observed categorical variable \(Y\) is based on the categorization of an unobserved (“latent”) variable \(\tilde{Y}\) with \(K\) thresholds \(\tau = (\tau_1, \dots, \tau_k)\).
In this example, \(\tilde{Y}\) has a natural interpretation as current mood
We assume that \(\tilde{Y}\) has a normal distribution, but other choices are possible, such as logistic (common; default)
Describe the ordered distribution of responses using thresholds
- \(Y = k \Leftrightarrow \tau_{k-1} < \tilde{Y} \leq \tau_k\)
These thresholds give the probability of each response category
- \(Pr(Y = k) = \Phi(\tau_k) - \Phi(\tau_{k-1})\)
\(\tilde{Y}\) is amenable to regression (without intercept)
- \(\tilde{Y} \sim N(\eta, \sigma = 1); \ \eta = b_1x_1 +...\)
- \(Pr(Y = k \vert \eta) = \Phi(\tau_k - \eta) - \Phi(\tau_{k-1} - \eta)\)

Cumulative model

tab <- count(dat, Mood) %>% 
  mutate(
    p = n / sum(n), 
    cp = cumsum(p), 
    z = qnorm(cp)
    )

#> # A tibble: 6 x 5
#>    Mood     n       p      cp       z
#>   <dbl> <int>   <dbl>   <dbl>   <dbl>
#> 1     1     5 0.00431 0.00431  -2.63 
#> 2     2    10 0.00862 0.0129   -2.23 
#> 3     3    23 0.0198  0.0328   -1.84 
#> 4     4   151 0.130   0.163    -0.982
#> 5     5   661 0.570   0.733     0.621
#> 6     6   310 0.267   1       Inf

Cumulative model

tab <- count(dat, Mood) %>% 
  mutate(
    p = n / sum(n), 
    cp = cumsum(p), 
    z = qnorm(cp)
    )

Cumulative model

Ok, but how do I do that in practice?

library(brms)  # Bayesian, slower, more flexible
library(ordinal)  # Frequentist, fast, less flexible

So far we have described a weird link function + intercepts
Write your regressions in R (brms) modelling syntax
Effects on \(\tilde{Y}\) are directly interpretable

My first cumulative model

fit1 <- brm(
  Mood ~ H24,
  family = cumulative("probit"),
  data = dat,
  file = here("models/ordinal-1")
)

family = cumulative(): CM
"probit": \(\tilde{Y} \sim {N}(\eta, \sigma = 1)\)
Mood ~ H24: \(\eta = b_1\text{H24}\)
\(b_1\) is the degree to which mood is greater in people who used hallucinogens in the past 24 hours, compared to people who didn’t use
Scale of the latent variable (standard deviations)

Cumulative model

summary(fit1)

#>  Family: cumulative 
#>   Links: mu = probit; disc = identity 
#> Formula: Mood ~ H24 
#>    Data: dat (Number of observations: 1160) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept[1]    -2.63      0.15    -2.95    -2.36 1.00     2260     2188
#> Intercept[2]    -2.20      0.10    -2.40    -2.01 1.00     3723     2673
#> Intercept[3]    -1.80      0.07    -1.94    -1.65 1.00     4210     2968
#> Intercept[4]    -0.93      0.05    -1.02    -0.84 1.00     4155     3639
#> Intercept[5]     0.69      0.04     0.61     0.77 1.00     4199     2974
#> H241             0.39      0.09     0.20     0.57 1.00     4282     2893
#> 
#> Family Specific Parameters: 
#>      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> disc     1.00      0.00     1.00     1.00 1.00     4000     4000
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Cumulative model

conditional_effects(fit1, categorical = TRUE)

Cumulative model

pp_check(fit1, "bars_grouped", group = "H24")

Cumulative model

It is considered SOP to not assume equal variances when we do e.g. t tests
Metric models can deal terribly with different variances in \(\tilde{Y}\)
Cannot simultaneously estimate thresholds and sigma, which is fixed to 1 for baseline group
However, we can predict the variance (without intercept)
\(Pr(Y = k \vert \eta, disc) = \Phi(disc \times (\tau_{k+1} - \eta)) - \Phi(disc \times (\tau_{k} - \eta))\)
disc?
- IRT: Discrimination parameter (slope of response function)
- Predicted on the log scale \(disc = exp(\eta_{disc})\)
- \(\sigma\) = \(1/disc\)
\(\tilde{Y} \sim N(\eta, 1/exp(\eta_{disc})); \ \eta = b_1x_1 +...; \eta_{disc} = g_1x_2 + ...\)

Cumulative model

fit2 <- brm(
  bf(Mood ~ H24) + 
    lf(disc ~ 0 + H24, cmc = FALSE),
  family = cumulative("probit"),
  data = dat,
  file = here("models/ordinal-2")
)

Cumulative model

#>  Family: cumulative 
#>   Links: mu = probit; disc = log 
#> Formula: Mood ~ H24 
#>          disc ~ 0 + H24
#>    Data: dat (Number of observations: 1160) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept[1]    -2.62      0.16    -2.97    -2.33 1.00     2896     2496
#> Intercept[2]    -2.19      0.10    -2.39    -1.99 1.00     4299     2861
#> Intercept[3]    -1.78      0.07    -1.93    -1.64 1.00     4475     2932
#> Intercept[4]    -0.92      0.05    -1.01    -0.83 1.00     4044     3285
#> Intercept[5]     0.68      0.04     0.60     0.77 1.00     3046     2812
#> H241             0.37      0.09     0.19     0.56 1.00     3973     2541
#> disc_H241        0.08      0.08    -0.09     0.23 1.00     3205     3076
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Cumulative model

posterior_samples(fit2, pars = "H24") %>%
  mutate(sigma_h24 = 1 / exp(b_disc_H241)) %>% 
  mcmc_hist()

Cumulative model

More ordinal models

Category specific effects

Cannot use CM
Adjacent category model: predict decisions between categories

One problem, however…

table(dat$H24, dat$Mood)

#>    
#>       1   2   3   4   5   6
#>   0   5  10  22 136 561 242
#>   1   0   0   1  15 100  68

Category specific effects

There are two cells with no observations
Those category-specific effects won’t be identified
If only there was a way to inject information to the model…
Bayes to the rescue!

weakly_informative_prior <- prior(normal(0, 1.5), class = "b")
fit3 <- brm(
  bf(Mood ~ cs(H24)),
  family = acat("probit"),
  prior = weakly_informative_prior,
  data = dat,
  control = list(adapt_delta = .95),
  file = here("models/ordinal-3")
)

Category specific effects

summary(fit3)

#>  Family: acat 
#>   Links: mu = probit; disc = identity 
#> Formula: Mood ~ cs(H24) 
#>    Data: dat (Number of observations: 1160) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept[1]    -0.44      0.32    -1.07     0.19 1.00     4155     3196
#> Intercept[2]    -0.51      0.23    -0.97    -0.08 1.00     3748     2832
#> Intercept[3]    -1.09      0.13    -1.33    -0.84 1.00     3555     2669
#> Intercept[4]    -0.86      0.05    -0.97    -0.76 1.00     3791     3331
#> Intercept[5]     0.52      0.05     0.43     0.61 1.00     4165     2713
#> H241[1]          0.40      1.31    -2.00     3.13 1.00     4561     2766
#> H241[2]          1.02      1.13    -0.98     3.27 1.00     3942     2846
#> H241[3]          0.62      0.51    -0.31     1.70 1.00     4129     2952
#> H241[4]          0.27      0.16    -0.02     0.60 1.00     3933     2751
#> H241[5]          0.28      0.11     0.06     0.49 1.00     4482     3081
#> 
#> Family Specific Parameters: 
#>      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> disc     1.00      0.00     1.00     1.00 1.00     4000     4000
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Category specific effects

conditional_effects(fit3, categorical = TRUE)

Category specific effects

Model comparison

fit4 <- brm(
  bf(Mood ~ H24),
  family = acat("probit"),
  prior = weakly_informative_prior,
  data = dat,
  file = here("models/ordinal-4")
)

Model comparison

loo(fit1, fit2, fit3, fit4)

#> Output of model 'fit1':
#> 
#> Computed from 4000 by 1160 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -1250.6 27.6
#> p_loo         6.0  0.6
#> looic      2501.2 55.1
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
#> 
#> Output of model 'fit2':
#> 
#> Computed from 4000 by 1160 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -1251.0 27.5
#> p_loo         6.9  0.7
#> looic      2502.1 55.0
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
#> 
#> Output of model 'fit3':
#> 
#> Computed from 4000 by 1160 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -1252.1 27.7
#> p_loo         8.6  1.7
#> looic      2504.2 55.5
#> ------
#> Monte Carlo SE of elpd_loo is NA.
#> 
#> Pareto k diagnostic values:
#>                          Count Pct.    Min. n_eff
#> (-Inf, 0.5]   (good)     1159  99.9%   3530      
#>  (0.5, 0.7]   (ok)          0   0.0%   <NA>      
#>    (0.7, 1]   (bad)         1   0.1%   84        
#>    (1, Inf)   (very bad)    0   0.0%   <NA>      
#> See help('pareto-k-diagnostic') for details.
#> 
#> Output of model 'fit4':
#> 
#> Computed from 4000 by 1160 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -1249.8 27.6
#> p_loo         6.0  0.6
#> looic      2499.6 55.1
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
#> 
#> Model comparisons:
#>      elpd_diff se_diff
#> fit4  0.0       0.0   
#> fit1 -0.8       0.6   
#> fit2 -1.2       0.5   
#> fit3 -2.3       1.7

Objections and counterarguments

Its too difficult

…
There are, of course, practical considerations
The weird link function and intercepts were difficult
Effects on latent variable are interpretable just like your betas in lm()

The results are the same anyway

How do you know?
Did you fit an ordinal model to confirm?
The prevalence of problems in metric models applied to ordinal data is an empirical questions, and results probably vary greatly between types of data & measures
Fit ordinal models whenever you can
Afford more nuanced interpretation of what’s going on in your data

References

Forstmann, Matthias, Daniel A. Yudkin, Annayah M. B. Prosser, S. Megan Heller, and Molly J. Crockett. 2020. “Transformative Experience and Social Connectedness Mediate the Mood-Enhancing Effects of Psychedelic Use in Naturalistic Settings.” Proceedings of the National Academy of Sciences 117 (5): 2338–46. https://doi.org/10.1073/pnas.1918477117.

Liddell, Torrin M., and John K. Kruschke. 2018. “Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?” Journal of Experimental Social Psychology 79 (November): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.