Quantitative literature review with R: Exploring Psychonomic Society Journals, Part II

(This post directly continues from part I of Quantitative literature review with R. Please read that first for context.) In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a broad research literature, using all Psychonomic Society publications between 2004 and 2016. Part I focused on data cleaning and simple figures, but here we will look at relational data by visualizing some network structures in the data.

We’ll be working with R, so if you want to follow along on your own computer, fire up R and load up the required R packages:

library(tidyverse)
library(stringr)

Then you can load the cleaned data from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda using R:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

We’ve got data on quite a few variables which we could imagine are interconnected, such as Tags (Keywords), Authors, and the articles themselves (through the References variable).

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Let’s begin by looking at the articles’ Keywords.

Tags

These are the things that you have to choose to describe your article when you submit it for publication. They are called Keywords in the data, but I’m going to refer to them as Tags for brevity. Also, I will only work with the Author_Keywords, because it seems like Scopus’s automatic index keywords are mostly nonsense or missing.

So our first operation is to rename the Author_Keywords variable to Tags, and get rid of the automatic keywords altogether.

d <- rename(d, Tags = Author_Keywords) %>% 
    select(-Index_Keywords)

Before moving on to networks, let’s take a look at what these tags are like: How many are there? Do people use them consistently?

Exploring univariate tags

Our first goal is to simply visualize the most common tags. However, recall that the Tags variable in the data is quite messy:

## # A tibble: 3 x 1
##                                                                          Tags
##                                                                         <chr>
## 1    3D perception; Perception and action; Scene perception; Space perception
## 2 Auditory perception; Cognitive performance; Dissonance; Distraction; Inharm
## 3             Attention; Automaticity; Cognitive control; Selective attention

Each value of Tags is a kind of a list, whose individual items are separated by semicolons. What we’d like to have is a variable (column in a data frame) where every item is a single tag, so we need to do some kind of a data unwrapping operation.

The unwrapping can be done by simply splitting the Tags variable on each semicolon (and space) found in each string, but this creates a strange variable:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) 
## # A tibble: 7,614 x 1
##         Tags
##       <list>
##  1 <chr [4]>
##  2 <chr [6]>
##  3 <chr [4]>
##  4 <chr [5]>
##  5 <chr [5]>
##  6 <chr [3]>
##  7 <chr [3]>
##  8 <chr [6]>
##  9 <chr [3]>
## 10 <chr [3]>
## # ... with 7,604 more rows

This is where things get really interesting. A few lines above, we conceptualized each value (cell) of Tags as a sort of a list. We have now operationalized that idea by creating a Tags variable that is an R list-column, meaning that each cell can contain a complex R object. Here, each value is a list of character values. To make this notion concrete, look at the first row (slice(1)) of the Tags variable (.$Tags, . is a placeholder of the data being passed by %>% arrows) by using the familiar str() function:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    slice(1) %>% 
    .$Tags %>%  str()
## List of 1
##  $ : chr [1:4] "3D perception" "Perception and action" "Scene perception" "Space perception"

All the tags associated with the first article in our data frame are in a neat list of neat character strings. Brilliant. This data format is quite foreign to us who are used to working with strictly flat rectangular data, but has a certain recursive beauty to it. Now that our list is concrete, we can unlist it to create a Tags variable whose each row is one individual tag.

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags)
## # A tibble: 22,375 x 1
##                     Tags
##                    <chr>
##  1         3D perception
##  2 Perception and action
##  3      Scene perception
##  4      Space perception
##  5   Auditory perception
##  6 Cognitive performance
##  7            Dissonance
##  8           Distraction
##  9         Inharmonicity
## 10                 Music
## # ... with 22,365 more rows

Excellent! This data frame has one variable, Tags, where each observation is an individual tag. Now we can go ahead and count and summarise the tags. From here onwards, it’s all regular dplyr verbs:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%  # Remove missing observations
    group_by(Tags) %>%  # Group by unique Tags
    count() %>%  # And count them
    ungroup() %>% 
    top_n(40) %>%  # Choose n most numerous Tags
    mutate(Tags = reorder(Tags, n)) %>%  # {1}
    ggplot(aes(n, Tags)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=Tags)) +
    scale_x_continuous(limits = c(0, 400), expand = c(0, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 30 tags in Psychonomics Journals",
         caption = "Data for years 2004-2016 only")

{1} reordered the Tags so that they are nicely ordered in the figure on the n variable. You could also do a wordcloud with these data, but I think I’ll refrain this time. Instead, what I’ll do is explore popular tags by journal. The “pipeline” to get tags by journal is almost identical, except that we’ll keep the Publication variable in the data.

The following code is a little more complicated than is required for a simple plot, but I wanted to order the Tags inside each publication, which required a little bit of extra work as described by @drsimonj. Below, I’ll work with journal abbreviations Pub_abbr.

pd <- d %>% 
    select(Pub_abbr, Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%
    group_by(Pub_abbr, Tags) %>%  # Group by unique Tags AND journals
    count() %>%  # And count them
    group_by(Pub_abbr) %>% 
    top_n(10, n) %>%  # Select top 10 tags within each journal
    ungroup() %>% 
    arrange(Pub_abbr, n) %>%  # Arrange by n within journals
    mutate(order = row_number())  # Will order Tags on this
pd
## # A tibble: 62 x 4
##                       Pub_abbr                  Tags     n order
##                          <chr>                 <chr> <int> <int>
##  1 Atten. Percept. Psychophys.     Attentional blink    33     1
##  2 Atten. Percept. Psychophys.        Working memory    36     2
##  3 Atten. Percept. Psychophys. Visual working memory    40     3
##  4 Atten. Percept. Psychophys.      Visual attention    41     4
##  5 Atten. Percept. Psychophys.   Selective attention    49     5
##  6 Atten. Percept. Psychophys.     Visual perception    51     6
##  7 Atten. Percept. Psychophys.   Attentional capture    52     7
##  8 Atten. Percept. Psychophys.         Eye movements    61     8
##  9 Atten. Percept. Psychophys.         Visual search   116     9
## 10 Atten. Percept. Psychophys.             Attention   180    10
## # ... with 52 more rows

That was more complicated than usual, so I wanted to print out the intermediate result as well. We can now use it to plot:

pd %>%     
    ggplot(aes(n, order)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=order)) +
    scale_y_continuous(breaks = pd$order, labels = pd$Tags) +
    scale_x_continuous(expand = c(0.02, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 10 tags within 6 Psychonomics Journals",
         caption = "Data for years 2004-2016 only") +
    facet_wrap("Pub_abbr", scales="free", ncol=2)

Finally, let’s look at the overall distribution of the tags as a histogram:

There are about 8000 unique Tags, and about 7000 of them occur fewer than three times in the corpus. Now that we have some idea of what kinds of tags there are in the data, roughly what their distributions might look like, and how they vary across the journals, we are ready to look at networks of tags.

Network analysis of Tags

Network analysis (or graph analysis) is a vast and complicated area, and my graph analysis skills are a little rusty. So here we’ll simply illustrate how to format the data in an appropriate way, and draw a few example graphs. I referred to this website more than once while working on this (also thanks to @jalapic for tips).

The first graph will be about Tags. We’ll try to illustrate how and which tags co-occur within articles in the Psychonomic Society journals between 2004 and 2016.

90% of data analysis is data cleaning and manipulation, and network analysis is no exception. The first data manipulation step is to create an adjacency matrix where each row and column is a tag, where the values in the cells are the number of times those tags co-occurred (were present in the same article.) I found this stack overflow answer and this blogpost helpful when preparing the data.

Preparing the adjacency matrix

In order to create the adjacency matrix we need to solve a sequence of smaller problems. First, we’ll take two columns from the data, one for Tags and another one that uniquely identifies each article (doesn’t matter what it is, here we use DOI). Then we unnest the Tags into single tags per row as detailed above, and remove empty tags.

m <- d %>% 
    select(Tags, DOI) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags))
m
## # A tibble: 18,949 x 2
##                          DOI                  Tags
##                        <chr>                 <chr>
##  1 10.3758/s13414-015-1049-4         3D perception
##  2 10.3758/s13414-015-1049-4 Perception and action
##  3 10.3758/s13414-015-1049-4      Scene perception
##  4 10.3758/s13414-015-1049-4      Space perception
##  5 10.3758/s13414-015-1042-y   Auditory perception
##  6 10.3758/s13414-015-1042-y Cognitive performance
##  7 10.3758/s13414-015-1042-y            Dissonance
##  8 10.3758/s13414-015-1042-y           Distraction
##  9 10.3758/s13414-015-1042-y         Inharmonicity
## 10 10.3758/s13414-015-1042-y                 Music
## # ... with 18,939 more rows

Next, I remove all tags from the data that occurred fewer than a couple of times, because otherwise the data would be too large and messy.

m <- group_by(m, Tags) %>% 
    mutate(n = n()) %>% 
    filter(n > 8) %>% 
    ungroup() %>% 
    select(-n)

Next, we need a variable that identifies whether the tag occurred in the article or not:

m$occur <- 1
slice(m, 1:2)
## # A tibble: 2 x 3
##                         DOI                  Tags occur
##                       <chr>                 <chr> <dbl>
## 1 10.3758/s13414-015-1049-4 Perception and action     1
## 2 10.3758/s13414-015-1049-4      Scene perception     1

Note that occur must be one. To create the matrix we need all the item-tag pairs, even ones that didn’t occur. For these pairs, occur will be zero. The tidyr package in tidyverse provides the perfect helper function for this purpose:

m <- m %>% 
    complete(DOI, Tags, fill = list(occur=0))
m
## # A tibble: 960,500 x 3
##                          DOI               Tags occur
##                        <chr>              <chr> <dbl>
##  1 10.1007/s13420-010-0001-7     3-D perception     0
##  2 10.1007/s13420-010-0001-7             Action     0
##  3 10.1007/s13420-010-0001-7         Adaptation     0
##  4 10.1007/s13420-010-0001-7             Affect     0
##  5 10.1007/s13420-010-0001-7 Age of acquisition     0
##  6 10.1007/s13420-010-0001-7              Aging     0
##  7 10.1007/s13420-010-0001-7           Amygdala     0
##  8 10.1007/s13420-010-0001-7   Animal cognition     0
##  9 10.1007/s13420-010-0001-7       Anticipation     0
## 10 10.1007/s13420-010-0001-7            Anxiety     0
## # ... with 960,490 more rows

Now we have a data frame with 960500 rows; each row signifying a possible article (DOI) - Tag co-occurrence. Most occurs will, of course, be zero.

Now we can create the co-occurrence matrix. We first spread Tags over the columns, such that each column is a unique tag, each row is a DOI, and the cell values are occurrences.

library(reshape2)
y <- m %>% 
    acast(Tags~DOI, value.var="occur", fun.aggregate = sum)
y[1:2, 1:2]
##                10.1007/s13420-010-0001-7 10.3758/s13414-010-0002-9
## 3-D perception                         0                         0
## Action                                 0                         0
y[y>1] <- 1  # Remove duplicates

All we need to do then is to make a symmetric matrix where both rows and columns are Tags, and the cell values are numbers of co-occurrences. So we matrix multiply y with its transpose:

co <- y %*% t(y)
co[1:3, 1:3]
##                3-D perception Action Adaptation
## 3-D perception             11      0          1
## Action                      0     20          0
## Adaptation                  1      0         13

Visualizing the adjacency matrix

It’s important to understand what co is; it’s a symmetric matrix whose rows and columns are tags, and their co-occurrences are numerically represented at their intersections. For example, Action and 3-D perception never co-occurred in the data, but Adaptation and 3-D percetion co-occurred once. We are now in an excellent position to investigate what tags tend to co-occur.

First, let’s use igraph to create the graph object g, and take out some unnecessary information from it.

library(igraph)
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)

g is now an igraph R object that contains edges and vertices, and information about them. For example, we can use V(g)$name to look at the Vertice names. Some of these are quite long, so I’ll create another variable for each Vertice, called label, which is the name, but replacing each space with a newline character. This will make some of the graphs prettier.

V(g)$label <- gsub(" ", "\n", V(g)$name)

Networks are very complicated visualizations, and if you simply call plot(g), you’ll get an ugly mess of a figure:

plot(g)

There are more parameters to the plot() command than can be listed here, but I tried out a few different options, and this is what I ended up with:

plot(g,
     layout = layout_on_sphere(g),
     edge.width = E(g)$weight/7,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex=.7,
     vertex.label.color="black",
     vertex.label = ifelse(degree(g) > 25, V(g)$label, NA))

The Bayes Network

The next network will be fairly involved. Our goal is to visualize the prominence and connectedness of Psychonomic Society’s “Bayesians”. Who are these Bayesians, you might ask? Our definition here is simply: “An author who has published one or more articles in Psychonomic Journals (2004 - 2016) where the keywords contained the word ‘Bayes’.” This is obviously suboptimal because someone might have published an article called “Why I’m not a Bayesian”, yet they would be included in this graph, but we’ll go with this for now.

I’ll let the code speak for itself this time. If it is impossible to parse, I recommend downloading the data (see top of post) to your own computer and running the code yourself.

m <- d %>% 
    select(Authors, Tags, DOI) %>% 
    mutate(Authors = str_split(Authors, ", ")) %>% 
    unnest(Authors) %>% 
    filter(!is.na(Tags), !is.na(Authors)) %>% 
    filter(grepl("bayes", Tags, ignore.case = T))
m$occur <- 1
m <- m %>% 
    complete(DOI, Authors, fill = list(occur=0)) %>% 
    select(-Tags)
m
## # A tibble: 11,076 x 3
##                          DOI      Authors occur
##                        <chr>        <chr> <dbl>
##  1 10.3758/s13414-010-0073-7     Bååth R.     0
##  2 10.3758/s13414-010-0073-7    Bakker M.     0
##  3 10.3758/s13414-010-0073-7       Bar M.     0
##  4 10.3758/s13414-010-0073-7  Barker J.D.     0
##  5 10.3758/s13414-010-0073-7      Barr S.     0
##  6 10.3758/s13414-010-0073-7     Biele G.     0
##  7 10.3758/s13414-010-0073-7 Bittner J.L.     0
##  8 10.3758/s13414-010-0073-7  Borsboom D.     0
##  9 10.3758/s13414-010-0073-7    Bratch A.     0
## 10 10.3758/s13414-010-0073-7    Bröder A.     0
## # ... with 11,066 more rows

In these data, we have simply counted whether an author “occurred” in an article, and only included articles whose Tags included the word “bayes” in any shape or form. This network will therefore visualize the connectedness of “Bayesian”, as in who has published papers with whom.

We can then re-run pretty much the same steps as in the previous network.

y <- m %>% 
    acast(Authors~DOI, value.var="occur", fun.aggregate = sum)
co <- y %*% t(y)
co[1:3, 1:3]
##           Bååth R. Bakker M. Bar M.
## Bååth R.         1         0      0
## Bakker M.        0         1      0
## Bar M.           0         0      1
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)
V(g)$label <- gsub(" ", "\n", V(g)$name)
deg <- degree(g)
plot(g,
     layout = layout_with_kk(g),
     edge.width = E(g)$weight,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex = sqrt(5+deg)/4,
     vertex.label.color = "black",
     main = "The Bayesians of Psychonomic Society")

More to come later on…

Quantitative literature review with R: Exploring Psychonomic Society Journals, Part I

Literature reviews, both casual and formal (or qualitative / quantitative), are an important part of research. In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a research literature, using Psychonomic Society’s publications between 2005 and 2016.

Commonly, literature reviews are rather informal; you read a review paper or 10, maybe suggested to you by experts in the field, and then delve deeper into the topics presented in those papers. A more formal version starts with a database search, where you try out various search terms, and collect a more exhaustive list of publications related to your research questions. Here, we are going to have a bit of fun and explore a large heterogeneous literature (kind of) quantitatively, focusing on data manipulation (not the bad kind!) and visualization.

I used Scopus to search for all journal articles published in the Psychonomic Society’s journals (not including CR:PI because it’s so new) between the years 2005 and 2016 (inclusive). For real applications, you would use databases instead, and search across journals, but for a general illustration, I wanted to focus on these publications instead, because Psychonomic journals are closest to my research interests.

I limited the search to these years for two reasons: Metadata on articles before early 2000’s is (are?) not always very good, and if I had included everything, the database would have simply been too large. For the example here, these search terms resulted in about 7500 journal articles.

Download data from Scopus

The Scopus search interface is very straightforward, so I won’t talk about how to use it. Some things to keep in mind, though, are that Scopus does not index all possible sources of citations, and therefore when we discuss citations (and other similar info) below, we’ll only discuss citations that Scopus knows about. For example, Google Scholar often indexes more sources, but it doesn’t have the same search tools as Scopus. Scopus allows you to download the data in various formats, and I chose .csv files, which I have stored on my computer. When you download raw data from Scopus, make sure that you organize the files well and that they are available to the R session.

You can also download the raw or cleaned up data from my website (see below), if you’d like to work on the same data set, or replicate the code below.

Load and clean raw data from Scopus

Let’s first load all the relevant R packages that we’ll use to do our work for us:

library(tidyverse)
library(stringr)
library(tools)

Load the data from .csv files

When I first downloaded the files, I put them in a subdirectory of my project (my website). I downloaded the data such that one file contains the data of one journal. (The raw journal-specific .csv files are available at https://github.com/mvuorre/mvuorre.github.io/tree/master/data/scopus, but I recommend you download the combined R data file instead because it’s easier, see below.)

The first thing is to combine these file names to a list of file names:

fl <- list.files("static/data/scopus/", pattern=".csv", full.names = T)

Then we can apply a function over the list of file names that reads all the files to an R object.

d <- lapply(fl, 
            read_csv, 
            col_types = cols(`Page start` = col_character(),
                             `Page end` = col_character(),
                             `Art. No.` = col_character()))

lapply applies the read_csv() function to the file list fl, and the col_types = cols() argument is there to ensure that those columns are read as character variable types (the original data is a little messy, and this ensures that all the files are read with identical column types.) Next, we’ll then turn the list of data frames d into a single data frame by binding them together row-wise (think stacking spreadsheets vertically on top of each other). Let’s also take a quick look at the result with glimpse()

d <- bind_rows(d)
glimpse(d)
## Observations: 7,697
## Variables: 25
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Volume                    <int> 78, 78, 78, 78, 78, 78, 78, 78, 78, ...
## $ Issue                     <int> 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, ...
## $ Art. No.                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Page count                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Link                      <chr> "https://www.scopus.com/inward/recor...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Publisher                 <chr> "Springer New York LLC", "Springer N...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
## $ Source                    <chr> "Scopus", "Scopus", "Scopus", "Scopu...
## $ EID                       <chr> "2-s2.0-84951928856", "2-s2.0-849519...

OK, looks pretty clean already! (I also tried to get data from ISI web of science, or whatever it’s now called, but the raw data was so messy that I’d have to get paid to even look at it.)

The combined raw data file is available as an R data file at https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda, and can be easily loaded to the R workspace with:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda"))

Clean the data

OK, now that you’ve loaded up the combined data file from my website into your R session with the above command, you can get to work cleaning the data. If you’d like to skip the data cleaning steps, scroll down to “Load cleaned data”.

Select relevant variables

There’s a couple of columns (variables) in the data that I’m just going to drop because I’m not interested in them at all. When you export data from Scopus, you can choose what variables to include, and it looks like I accidentally included some stuff that I don’t care about. I’ll use the select() function, and prepend each variable I want to drop with a minus sign. Variable names with spaces or strange characters are wrapped in backticks.

I think it’s always good to drop unnecessary variables to remove unnecessary clutter. Talking about cognitive psychology, it’s probably less cognitively demanding to work with a data frame with 5 variables than it is to work with a data frame with 50 variables.

d <- select(d,
            -Volume,
            -Issue,
            -`Art. No.`,
            -`Page count`,
            -Link,
            -Publisher,
            -Source,
            -EID)

I dropped page count because it is empty for most cells, and I actually need to re-compute it from the start and end variables, later on. Next, we’re going to go through the columns, but leave the more complicated columns (such as Authors) for later. What do I mean by “complicated”? Some variables are not individual values, and their desired final shape depends on how you want to use them. Much later on, when we get to network and graph analyses, you’ll see that we want to “unlist” the Keywords columns, for example. That’s why we’ll do the cleaning and data manipulation closer to the actual analysis. But for “univariate” columns, we can get started right away.

Numerical columns: Year, pages, citations

Year is already an integer, and has no missing values, so we don’t need to do anything about it. We can, however, make the page number information numerical and combine it into one column.

glimpse(d)
## Observations: 7,697
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
d$Pages <- as.integer(d$`Page end`) - as.integer(d$`Page start`)

How do we know that the Pages variable is now correct? We can’t rely on math itself because the data values themselves may be incorrect. It is good practice to check the data after each operation, and there’s many ways of looking at the data. I like to arrange the data (either descending or ascending) on the newly created variable to show if there are any weird values. Here’s how to arrange the data on the Pages variable in a descending order:

glimpse(arrange(d, desc(Pages)))
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

That code is quite difficult to read, however. It contains nested function calls: arrange() is inside glimpse(). I think this makes the code difficult to read. This is where we first really encounter the principles of R’s tidyverse philosophy: Instead of nesting, we can pipe results from functions to one another. To “pipe” the above code, we’d write the code this way instead:

d %>% 
    arrange(desc(Pages)) %>% 
    glimpse()
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

Read the above code as “Take d, then pass the results to arrange(), then pass the results to glimpse()”. Each %>% is a “pipe”, and takes the results of whatever is being computed on its left-hand side, and passes the result to its right-hand side (or line below, it’s really helpful to use line breaks and put each statement on its own line.) I’ll be using pipes extensively in the code below. I also won’t inspect the code after every change to prevent printing unnecessary boring stuff all the time (although you should when you’re working with your own data.)

So, back to Pages. Looking at the data, it looks like there’s some strange values, with some items having more than 1000 pages. For now we’ll leave those values in, but looking closely at the Source title column, we can see that there’s publications in the data that should be there. I guess I was a little too lazy with my initial Scopus search terms, but we can easily fix that problem here in R.

Publications

First, let’s list all the unique publications in the data:

publications <- unique(d$`Source title`)
publications
##  [1] "Attention, Perception, and Psychophysics"                                                
##  [2] "Attention, perception & psychophysics"                                                   
##  [3] "Attention, Perception, & Psychophysics"                                                  
##  [4] "Behavior Research Methods"                                                               
##  [5] "Behavior research methods"                                                               
##  [6] "Modern Research Methods for The Study of Behavior in Organizations"                      
##  [7] "Modern Research Methods for the Study of Behavior in Organizations"                      
##  [8] "The New Handbook of Methods in Nonverbal Behavior Research"                              
##  [9] "Cognitive, Affective and Behavioral Neuroscience"                                        
## [10] "Cognitive, affective & behavioral neuroscience"                                          
## [11] "Cognitive, Affective, & Behavioral Neuroscience"                                         
## [12] "Learning and Behavior"                                                                   
## [13] "Memory and Cognition"                                                                    
## [14] "Memory & cognition"                                                                      
## [15] "Exploring Implicit Cognition: Learning, Memory, and Social Cognitive Processes"          
## [16] "Memory & Cognition"                                                                      
## [17] "Imagery, Memory and Cognition: Essays in Honor of Allan Paivio"                          
## [18] "Imagery, Memory and Cognition (PLE: Memory):Essays in Honor of Allan Paivio"             
## [19] "Working Memory and Human Cognition"                                                      
## [20] "Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking"
## [21] "Psychonomic Bulletin and Review"                                                         
## [22] "Psychonomic bulletin & review"                                                           
## [23] "Psychonomic Bulletin & Review"

Let’s remove the non-psychonomic journal ones. I’ll just subset the list of publications to ones that I want to keep, and then filter() the data such that Source title matches one of the correct journals.

psychonomic_publications <- publications[c(1:5, 9:14, 16, 21:23)]
d <- filter(d, `Source title` %in% psychonomic_publications)
unique(d$`Source title`)
##  [1] "Attention, Perception, and Psychophysics"        
##  [2] "Attention, perception & psychophysics"           
##  [3] "Attention, Perception, & Psychophysics"          
##  [4] "Behavior Research Methods"                       
##  [5] "Behavior research methods"                       
##  [6] "Cognitive, Affective and Behavioral Neuroscience"
##  [7] "Cognitive, affective & behavioral neuroscience"  
##  [8] "Cognitive, Affective, & Behavioral Neuroscience" 
##  [9] "Learning and Behavior"                           
## [10] "Memory and Cognition"                            
## [11] "Memory & cognition"                              
## [12] "Memory & Cognition"                              
## [13] "Psychonomic Bulletin and Review"                 
## [14] "Psychonomic bulletin & review"                   
## [15] "Psychonomic Bulletin & Review"

Much better. Next, we’ll have to make sure that there’s only one unique name per journal. To do this, we’ll use simple text manipulation tools to change all “and”s to “&”s, and convert the names to Title Case. Before applying these operations to the actual data, I’ll work on just the unique names R object psychonomic_publications. This makes it easier to troubleshoot the operations.

psychonomic_publications <- psychonomic_publications %>% 
    str_replace_all("and", "&") %>%  # Replace all "and"s with "&"s
    toTitleCase()  # Make everything Title Case
unique(psychonomic_publications)
## [1] "Attention, Perception, & Psychophysics"         
## [2] "Attention, Perception & Psychophysics"          
## [3] "Behavior Research Methods"                      
## [4] "Cognitive, Affective & Behavioral Neuroscience" 
## [5] "Cognitive, Affective, & Behavioral Neuroscience"
## [6] "Learning & Behavior"                            
## [7] "Memory & Cognition"                             
## [8] "Psychonomic Bulletin & Review"

Still not quite correct. For two journals, there’s two versions, one where “&” is preceded with a comma, one where it isn’t. I’ll take the commas out.

psychonomic_publications %>% 
    str_replace_all(", &", " &") %>% 
    unique()
## [1] "Attention, Perception & Psychophysics"         
## [2] "Behavior Research Methods"                     
## [3] "Cognitive, Affective & Behavioral Neuroscience"
## [4] "Learning & Behavior"                           
## [5] "Memory & Cognition"                            
## [6] "Psychonomic Bulletin & Review"

Great, now I’ll just apply those operations to the actual data.

d <- d %>% 
    mutate(Publication = str_replace_all(`Source title`, "and", "&"),
           Publication = toTitleCase(Publication),
           Publication = str_replace_all(Publication, ", &", " &"))

Here, I used the mutate() function, which works within a data frame. I passed d to mutate() with the pipe operator, and could then refer to variables without prepending them with d$. I called the product variable Publication because it’s easier to write than Source title.

Lastly, we’ll do the same for the abbreviated publication names. I don’t know what the official abbreviations are, but I can guess that the majority of the articles have it correct. The first step is then to count the occurrences of each unique abbreviation. To do this, I group the data by (group_by()) the abbreviation, and then count():

abbs <- group_by(d, `Abbreviated Source Title`) %>% 
    count()
abbs
## # A tibble: 11 x 2
##          `Abbreviated Source Title`     n
##                               <chr> <int>
##  1         Atten Percept Psychophys    26
##  2      Atten. Percept. Psychophys.  1558
##  3                Behav Res Methods    38
##  4              Behav. Res. Methods  1385
##  5       Cogn Affect Behav Neurosci     7
##  6 Cogn. Affective Behav. Neurosci.   691
##  7                    Learn. Behav.   455
##  8                       Mem Cognit    13
##  9                     Mem. Cognit.  1441
## 10                 Psychon Bull Rev     3
## 11            Psychonom. Bull. Rev.  1997

Then just figure out how to replace all the bad ones with the good ones. Unfortunately this has to be done manually because computers aren’t that smart. The following is ugly, but it works. I first take the just the abbreviations into a variable, then replace the bad ones with the good ones inside the data frame (working with an abbreviated variable name).

abbrs <- abbs[[1]]
d <- mutate(d,
       pa = `Abbreviated Source Title`,
       pa = ifelse(pa == abbrs[1], abbrs[2], pa),
       pa = ifelse(pa == abbrs[3], abbrs[4], pa),
       pa = ifelse(pa == abbrs[5], abbrs[6], pa),
       pa = ifelse(pa == abbrs[8], abbrs[9], pa),
       pa = ifelse(pa == abbrs[10], abbrs[11], pa),
       Pub_abbr = pa)
unique(d$Pub_abbr)
## [1] "Atten. Percept. Psychophys."      "Behav. Res. Methods"             
## [3] "Mem. Cognit."                     "Cogn. Affective Behav. Neurosci."
## [5] "Learn. Behav."                    "Psychonom. Bull. Rev."

OK, we’re pretty much done with cleaning for now. The last thing to do is to remove the unnecessary variables we created while cleaning the data:

d <- select(d, 
            -`Source title`,
            -`Page start`, -`Page end`,
            -pa)

I don’t like spaces, so I’ll also take them out of the variable names:

names(d) <- str_replace_all(names(d), " ", "_")

Leaving us with this to work with:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Load cleaned data

The clean data file can be loaded from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda.

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

Univariate figures

This data is very rich, but for this initial exploration, we’ll be focusing on simple uni- and bi-variate figures. In later parts of this tutorial (future blog posts), we’ll start visualizing more complex data, such as keywords, authors, and networks.

Let’s begin with some simple summaries of what’s been going on, over time, for each journal. We’ll make extensive use of the dplyr data manipulation verbs in the following plots. Take a look at the linked website if they are unfamiliar to you; although I will explain more complicated cases, I won’t bother with every detail.

Publication years

First, we’re interested in a simple question: How many articles has each journal published in each year? Are there temporal patterns, and do they differ between journals? The steps for creating this plot are commented in the code below. Roughly, in order of appearance, we first add grouping information to the data frame, then summarise the data on those groups, create a new column for a publication-level summary, and then order the publications on it. We then pass the results to ggplot2 and draw a line and point graph.

d %>% 
    group_by(Publication, Year) %>% 
    summarise(n = n()) %>%  # For each group (Publication, Year), count nobs
    mutate(nsum = sum(n)) %>%  # For each Publication, sum of nobs {1}
    ungroup() %>%  # Ungroup data frame
    mutate(Publication = reorder(Publication, nsum)) %>%  # {2}
    ggplot(aes(x=Year, y=n)) +
    geom_point() +
    geom_line() +
    labs(y = "Number of articles published") +
    facet_wrap("Publication", ncol=2)

The code in {1} works, because the summarise() command in the above line ungroups the last grouping factor assigned by group_by(). And since I called mutate() instead of summarise, the sum of number of observations was added to each row, instead of collapsing the data by group. {2} made the Publication variable into a factor that’s ordered by the sum of articles across years for that journal; this ordering is useful because when I used facet_wrap() below, the panels are nicely ordered (journals with fewer total papers in the upper left, increasing toward bottom right.)

If this code doesn’t make any sense, I strongly recommend loading the data into your own R workspace, and executing the code line by line.

I’m a little surprised at the relative paucity of articles in Learning and Behavior, and it looks like there might be some upward trajectory going on in PBR. Should I run some regressions? Probably not.

Citations

Let’s look at numbers of citations next. For these, simple histograms will do, and we expect to see some heavy tails. I will pre-emptively clip the x-axis at 100 citations, because there’s probably a few rogue publications with a truly impressive citation count, which might make the other, more common, values less visible.

d %>% 
    ggplot(aes(Cited_by)) +
    geom_histogram(binwidth=1, col="black") +
    coord_cartesian(xlim=c(0,100)) +
    facet_wrap("Publication", ncol=2)

Nice. But how many papers are there with more than 100 citations? What’s the proportion of those highly cited papers?

d %>%
    group_by(Publication) %>%
    mutate(total = n()) %>% 
    filter(Cited_by > 100) %>% 
    summarise(total = unique(total),
              over_100 = n(),
              ratio = signif(over_100/total, 1)) %>% 
    arrange(total)
## # A tibble: 6 x 4
##                                      Publication total over_100 ratio
##                                            <chr> <int>    <int> <dbl>
## 1                            Learning & Behavior   455        1 0.002
## 2 Cognitive, Affective & Behavioral Neuroscience   698       32 0.050
## 3                      Behavior Research Methods  1426       53 0.040
## 4                             Memory & Cognition  1452       17 0.010
## 5          Attention, Perception & Psychophysics  1583        8 0.005
## 6                  Psychonomic Bulletin & Review  2000       54 0.030

Excellent. We have now reviewed some of the powerful dplyr verbs and are ready for slightly more complex summaries of the data.

Bivariate figures

Publication year and citations

We can visualize publication counts over years of publication, but this might not be the most informative graph (because recent articles naturally can’t be as often cited as older ones.) Here, I’ll show how to plot neat box plots (without outliers) with ggplot2:

d %>% 
    filter(Cited_by < 100) %>% 
    ggplot(aes(Year, Cited_by)) +
    geom_boxplot(aes(group = Year), outlier.color = NA) +
    facet_wrap("Publication", ncol=2, scales = "free_y")

Pages and citations

The last figure for today is a scatterplot of pages (how many pages the article spans) versus number of citations. For scatterplots I usually prefer empty points (shape=1).

d %>% 
    filter(Pages < 100, Cited_by < 100) %>% 
    ggplot(aes(Pages, Cited_by)) +
    geom_point(shape=1) +
    facet_wrap("Publication", ncol=2)

Summary

We have now scratched the surface of these data, and there is clearly a lot to explore:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

In next posts on this data, we’ll look at networks and other more advanced topics. Stay tuned.

Bayes Factors with brms

Here’s a short post on how to calculate Bayes Factors with the R package brms (Buerkner, 2016) using the Savage-Dickey density ratio method (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

To get up to speed with what the Savage-Dickey density ratio method is–or what Bayes Factors are–please read Wagenmakers et al. 2010. (The paper is available on the author’s webpage.) Here, I’ll only show the R & brms code to do the calculations that Wagenmakers et al. (2010) discuss. In their paper, they used WinBUGS, which requires quite a bit of code to sample from even a relatively simple model. brms on the other hand uses the familiar R formula syntax, making it easy to use. brms also does the MCMC sampling with Stan (Stan Development Team, 2016a & 2016b), or rather creates Stan code from a specified R model formula by what can only be described as string processing magic, making the sampling very fast. Let’s get straight to the examples in Wagenmakers et al. (2010)

# We'll work in tidyverse
library(tidyverse)

Example 0

Wagenmakers and colleagues begin with a simple example of 10 true/false questions: We observe a person answering 9 (s) out of 10 (k) questions correctly.

d <- data.frame(s = 9, k = 10)
d
##   s  k
## 1 9 10

We are interested in the person’s latent ability to answer similar questions correctly. This ability is represented by \(\theta\) (theta), which for us will be the probability parameter (sometimes also called the rate parameter) in a binomial distribution. See Wagenmakers et al. (2010) for formulas. The maximum likelihood (point) estimate for \(\theta\) is n/k, the proportion .9.

The first thing we’ll need to specify with respect to our statistical model is the prior probability distribution for \(\theta\). As in Wagenmakers et al. 2010, we specify a uniform prior, representing no prior information about the person’s ability to aswer the questions. For the binomial probability parameter, \(Beta(\alpha = 1, \beta = 1)\) is a uniform prior.

pd <- tibble(
    x = seq(0, 1, by = .01),
    Prior = dbeta(x, 1, 1)
    )

ggplot(pd, aes(x, Prior)) +
    geom_line() +
    coord_cartesian(xlim = 0:1, ylim = c(0, 6), expand = 0) +
    labs(y = "Density", x = bquote(theta))

The solid line represents the probability density assigned to values of \(\theta\) by this prior probability distribution. You can see that it is 1 for all possible parameter values: They are all equally likely a priori. For this simple illustration, we can easily calculate the posterior distribution by adding the number of correct and incorrect answers to the parameters of the prior Beta distribution.

pd$Posterior <- dbeta(pd$x, 10, 2)
pdw <- gather(pd, key=Type, value=density, Prior:Posterior)
ggplot(pdw, aes(x, density, col=Type)) +
    geom_line() +
    annotate("point", x=c(.5, .5), y = c(pdw$density[pdw$x==.5])) +
    annotate("label", x=c(.5, .5), 
             y = pdw$density[pdw$x==.5], 
             label = round(pdw$density[pdw$x==.5], 3),
             vjust=-.5)

The Savage-Dickey density ratio is calculated by dividing the posterior density by the prior density at a specific parameter value. Here, we are interested in .5, a “null hypothesis” value indicating that the person’s latent ability is .5, i.e. that they are simply guessing.

filter(pd, x == .5) %>% 
    mutate(BF01 = Posterior/Prior,
           BF10 = 1/BF01) %>% 
    round(3)
## # A tibble: 1 x 5
##       x Prior Posterior  BF01  BF10
##   <dbl> <dbl>     <dbl> <dbl> <dbl>
## 1   0.5     1     0.107 0.107 9.309

OK, so in this example we are able to get to the posterior with simply adding values into the parameters of the Beta distribution, but let’s now see how to get to this problem using brms.

library(brms)

First, here’s the brms formula of the model:

s | trials(k) ~ 0 + intercept, 
family=binomial(link="identity"), 
data = d

Read the first line as “s successes from k trials regressed on intercept”. That’s a little clunky, but bear with it. If you are familiar with R’s modeling syntax, you’ll be wondering why we didn’t simply specify ~ 1–R’s default notation for an intercept. The reason is that brms by default uses a little trick in parameterizing the intercept which speeds up the MCMC sampling. In order to specify a prior for the intercept, you’ll have to take the default intercept out (0 +), and use the reserved string intercept to say that you mean the regular intercept. See ?brmsformula for details. (For this model, with only one parameter, this complication doesn’t matter, but I wanted to introduce it early on so that you’d be aware of it when estimating multi-parameter models.)

The next line specifies that the data model is binomial, and that we want to model it’s parameter through an identity link. Usually when you model proportions or binary data, you’d use a logistic (logistic regression!), probit or other similar link function. In fact this is what we’ll do for later examples. Finally, we’ll use the data frame d.

OK, then we’ll want to specify our priors. Priors are extremo important for Bayes Factors–and probabilistic inference in general. To help set priors, we’ll first call get_priors() with the model information, which is basically like asking brms to tell what are the possible priors, and how to specify then, given this model.

get_prior(s | trials(k) ~ 0 + intercept, 
          family=binomial(link="identity"),
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b intercept

The first line says that there is only one class of parameters b, think of class b as “betas” or “regression coefficients”. The second line says that the b class has only one parameter, the intercept. So we can set a prior for the intercept, and this prior can be any probability distribution in Stan language. We’ll create this prior using brms’ set_prior(), give it a text string representing the Beta(1, 1) prior for all parameters of class b (shortcut, could also specify that we want it for the intercept specifically), and then say the upper and lower bounds (\(\theta\) must be between 0 and 1).

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Almost there. Now we’ll actually sample from the model using brm(), give it the model, priors, data, ask it to sample from priors (for the density ratio!), and set a few extra MCMC parameters.

m <- brm(s | trials(k) ~ 0 + intercept, 
         family = binomial(link="identity"),
         prior = Prior,
         data = d,
         sample_prior = TRUE,
         iter = 1e4,
         cores = 4)

We can get the estimated parameter by asking the model summary:

summary(m)
##  Family: binomial(identity) 
## Formula: s | trials(k) ~ 0 + intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.83       0.1     0.59     0.98      15654    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Credible Interval matches exactly what’s reported in the paper. The point estimate differs slightly because here we see the posterior mean, whereas in the paper, Wagenmakers et al. report the posterior mode. I’ll draw a line at their posterior mode, below, to show that it matches.

samples <- posterior_samples(m, "b")
head(samples)
##   b_intercept   prior_b
## 1   0.9496061 0.4433162
## 2   0.6241450 0.6781995
## 3   0.9522157 0.1354565
## 4   0.7378991 0.4176384
## 5   0.7290172 0.9713203
## 6   0.7078843 0.9575486
gather(samples, Type, value) %>% 
    ggplot(aes(value, col=Type)) +
    geom_density() +
    labs(x = bquote(theta), y = "Density") +
    geom_vline(xintercept = .89)  # Vertical line at .89

We can already see the densities, so all that’s left is to obtain the exact values at the value of interest (.5) and take the \(\frac{posterior}{prior}\) ratio. Instead of doing any of this by hand, we’ll use brms’ function hypothesis() that allows us to test point hypotheses using the Dickey Savage density ratio. For this function we’ll need to specify the point of interest, .5, as the point hypothesis to be tested.

h <- hypothesis(m, "intercept = 0.5")
print(h, digits = 4)
## Hypothesis Tests for class b:
##                       Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (intercept)-(0.5) = 0   0.3336    0.1032   0.0876   0.4761     0.1078    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

The Evid.Ratio is our Bayes Factor BF01. Notice that it matches the value 0.107 pretty well! You can also plot this hypothesis:

plot(h)

However, I think the default plot isn’t fantastic (not a fan of the axis adjustments, title). Fortunately, plot(hypothesis) returns a ggplot2 object, which is easily customized.

p <- plot(h, plot = F, theme = theme_get())[[1]]
p + scale_x_continuous(breaks = seq(-.5, .5, by = .25),
                       labels = seq(-.5, .5, by = .25)+.5)

OK, so that was a lot of work for such a simple problem, but the real beauty of brms (and Stan) is the unparalleled scalability: We can easily solve a problem with one row of data and one parameter, and it won’t take much more to solve a problem with tens of thousands of rows of data, and hundreds of parameters. Let’s move on to the next example from Wagenmakers et al. (2010).

Example 1: Equality of Proportions

For context, please refer to the paper.

d <- data.frame(
    pledge = c("yes", "no"),
    s = c(424, 5416),
    n = c(777, 9072)
)
d
##   pledge    s    n
## 1    yes  424  777
## 2     no 5416 9072

They use Beta(1, 1) priors for both rate parameters, which we’ll do as well. Notice that usually a regression formula has an intercept and a coefficient (e.g. effect of group.) By taking the intercept out (0 +) we can define two pledger-group proportions instead, and set priors on these. If we used an intercept + effect formula, we could set a prior on the effect itself.

get_prior(s | trials(n) ~ 0 + pledge, 
          family=binomial(link="identity"),
          data = d)
##   prior class      coef group nlpar bound
## 1           b                            
## 2           b  pledgeno                  
## 3           b pledgeyes

We can set the Beta prior for both groups’ rate with one line of code by setting the prior on the b class without specifying the coef.

Prior <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1)

Like above, let’s estimate.

m1 <- brm(s | trials(n) ~ 0 + pledge, 
         family = binomial(link="identity"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

Our estimates match the MLEs reported in the paper:

print(m1, digits=3)
##  Family: binomial(identity) 
## Formula: s | trials(n) ~ 0 + pledge 
##    Data: d (Number of observations: 2) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## pledgeno     0.597     0.005    0.587    0.607      18588    1
## pledgeyes    0.546     0.018    0.511    0.580      19173    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

To get the density ratio Bayes Factor, we’ll need to specify a text string as our hypothesis. Our hypothesis is that the rate parameters \(\theta_1\) and \(\theta_2\) are not different: \(\theta_1\) = \(\theta_2\). The alternative, then, is the notion that the parameter values differ.

h1 <- hypothesis(m1, "pledgeyes = pledgeno")
print(h1, digits = 3)
## Hypothesis Tests for class b:
##                          Estimate Est.Error l-95% CI u-95% CI Evid.Ratio
## (pledgeyes)-(pled... = 0   -0.051     0.018   -0.088   -0.016      0.388
##                          Star
## (pledgeyes)-(pled... = 0    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

As noted in the paper, a difference value of 0 is about twice as well supported before seeing the data, i.e. the null hypothesis of no difference is twice less likely after seeing the data:

1/h1$hypothesis$Evid.Ratio  # BF10
## [1] 2.577207

The paper reports BF01 = 0.47, so we’re getting the same results (as we should.) You can also compare this figure to what’s reported in the paper.

h1p1 <- plot(h1, theme = theme_get(), plot = F)[[1]] +
    theme(legend.position = "top")
h1p2 <- plot(h1, theme = theme_get(), plot = F)[[1]] + 
    coord_cartesian(xlim = c(-.05, .05), ylim = c(0, 5)) +
    theme(legend.position = "top")
gridExtra::grid.arrange(h1p1, h1p2, nrow = 1)

Moving right on to Example 2, skipping the section on “order restricted analysis”.

Example 2: Hierarchical Bayesian one-sample proportion test

The data for example 2 is not available, but we’ll simulate similar data. The simulation assumes that the neither-primed condition average correct probability is 50%, and that the both-primed condition benefit is 5%. Obviously, the numbers here won’t match anymore, but the data reported in the paper has an average difference in proportions of about 4%.

set.seed(5)
d <- tibble(
    id = c(rep(1:74, each = 2)),
    primed = rep(c("neither", "both"), times = 74),
    prime = rep(c(0, 1), times = 74),  # Dummy coded
    n = 21,
    correct = rbinom(74*2, 21, .5 + prime * .05)
)
group_by(d, primed) %>% summarize(p = sum(correct)/sum(n))
## # A tibble: 2 x 2
##    primed         p
##     <chr>     <dbl>
## 1    both 0.5418275
## 2 neither 0.4993565

This data yields a similar t-value as in the paper.

t.test(correct/n ~ primed, paired = T, data = d)
## 
##  Paired t-test
## 
## data:  correct/n by primed
## t = 2.3045, df = 73, p-value = 0.02404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005741069 0.079201016
## sample estimates:
## mean of the differences 
##              0.04247104

Instead of doing a probit regression, I’m going to do logistic regression. Therefore we define the prior on the log-odds scale. The log odds for the expected probability of .5 is 0. I prefer log-odds because–although complicated–they make sense, unlike standardized effect sizes. Note that the probit scale would also be fine as they are very similar.

Let’s just get a quick intuition about effects in log-odds: The change in log odds from p = .5 to .55 is about 0.2.

library(arm)
tibble(rate = seq(0, 1, by = .01),
    logit = logit(rate)) %>%
    ggplot(aes(rate, logit)) +
    geom_line(size=1) +
    geom_segment(x = 0, xend = 0.55, y = .2, yend = .2, size=.4) +
    geom_segment(x = 0, xend = 0.5, y = 0, yend = 0, size=.4) +
    coord_cartesian(ylim = c(-2, 2), expand=0)

We are cheating a little because we know these values, having simulated the data. However, log-odds are not straightforward (!), and this knowledge will allow us to specify better priors in this example. Let’s get the possible priors for this model by calling get_prior(). Notice that the model now includes id-varying “random” effects, and we model them from independent Gaussians by specifying || instead of | which would give a multivariate Gaussian on the varying effects.

get_prior(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
          family=binomial(link="logit"),
          data = d)
##                 prior class      coef group nlpar bound
## 1                         b                            
## 2                         b intercept                  
## 3                         b     prime                  
## 4 student_t(3, 0, 10)    sd                            
## 5                        sd              id            
## 6                        sd intercept    id            
## 7                        sd     prime    id

The leftmost column gives the pre-specified defaults used by brms. Here are the priors we’ll specify. The most important pertains to prime, which is going to be the effect size in log-odds. Our prior for the log odds of the prime effect is going to be a Gaussian distribution centered on 0, with a standard deviation of .2, which is rather diffuse.

Prior <- c(
    set_prior("normal(0, 10)", class = "b", coef = "intercept"),
    set_prior("cauchy(0, 10)", class = "sd"),
    set_prior("normal(0, .2)", class = "b", coef = "prime")
)

Then we estimate the model using the specified priors.

m2 <- brm(correct | trials(n) ~ 0 + intercept + prime + 
              (0 + intercept + prime || id), 
         family = binomial(link="logit"),
         prior = Prior,
         sample_prior = TRUE,
         iter = 1e4,
         data = d,
         cores = 4)

OK, so our results here will be different because we didn’t parameterize the prior on a standardized effect size because a) I don’t like standardized effect sizes, and b) I would have to play around with the Stan code, and this post is about brms. Two good reasons not to use standardized effect sizes! Anyway, here are the estimated parameters:

summary(m2)
##  Family: binomial(logit) 
## Formula: correct | trials(n) ~ 0 + intercept + prime + (0 + intercept + prime || id) 
##    Data: d (Number of observations: 148) 
## Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1; 
##          total post-warmup samples = 20000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~id (Number of levels: 74) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(intercept)     0.07      0.05     0.00     0.18       8862    1
## sd(prime)         0.13      0.08     0.01     0.30       6970    1
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## intercept     0.01      0.05    -0.09     0.10      20000    1
## prime         0.15      0.07     0.02     0.29      20000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

And our null-hypothesis density ratio:

h2 <- hypothesis(m2, "prime = 0")
h2
## Hypothesis Tests for class b:
##             Estimate Est.Error l-95% CI u-95% CI Evid.Ratio Star
## (prime) = 0     0.15      0.07     0.02     0.29       0.27    *
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

Priming effect of zero log-odds is 4 times less likely after seeing the data:

1/h2$hypothesis$Evid.Ratio
## [1] 3.681923

This is best illustrated by plotting the densities:

plot(h2)

Conclusion

Read the paper! Hopefully you’ll be able to use brms’ hypothesis() function to calculate bayes factors when needed.

References

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Stan Development Team. (2016a). RStan: the R interface to Stan. Retrieved from http://mc-stan.org/
Stan Development Team. (2016b). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology, 60(3), 158–189. https://doi.org/10.1016/j.cogpsych.2009.12.001

Appendix

brms produces exceptionally well written Stan code. Stan code of the last example model:

cat(rstan::get_stancode(m2$fit))
## // generated with brms 1.6.1
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   int Y[N];  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   // data for group-level effects of ID 1 
##   int<lower=1> J_1[N]; 
##   int<lower=1> N_1; 
##   int<lower=1> M_1; 
##   vector[N] Z_1_1; 
##   vector[N] Z_1_2; 
##   int trials[N];  // number of trials 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
## } 
## parameters { 
##   vector[K] b;  // population-level effects 
##   vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
##   vector[N_1] z_1[M_1];  // unscaled group-level effects 
##   // parameters to store prior samples
##   real<lower=0> prior_sd_1;
## } 
## transformed parameters { 
##   // group-level effects 
##   vector[N_1] r_1_1; 
##   vector[N_1] r_1_2; 
##   r_1_1 = sd_1[1] * (z_1[1]); 
##   r_1_2 = sd_1[2] * (z_1[2]); 
## } 
## model { 
##   vector[N] mu; 
##   mu = X * b; 
##   for (n in 1:N) { 
##     mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n]; 
##   } 
##   // prior specifications 
##   b[1] ~ normal(0, 10); 
##   b[2] ~ normal(0, .2); 
##   sd_1 ~ cauchy(0, 10); 
##   z_1[1] ~ normal(0, 1); 
##   z_1[2] ~ normal(0, 1); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ binomial_logit(trials, mu); 
##   } 
##   // additionally draw samples from priors
##   prior_sd_1 ~ cauchy(0,10);
## } 
## generated quantities { 
##   real prior_b_1; 
##   real prior_b_2; 
##   // additionally draw samples from priors 
##   prior_b_1 = normal_rng(0,10); 
##   prior_b_2 = normal_rng(0,.2); 
## }

Better forest plots from meta-analytic models estimated with brms

Hi all! After our previous discussion about how to estimate meta-analytic models with the brilliant brms R package, a few people asked me for the code to produce the forest plots. Here I’ll present a much better version of a function to produce forest plots from meta-analytic models estimated with brms. The function is implemented in ggplot2, and it is included in my vmisc package. To get the function, simply install vmisc from GitHub:

devtools::install_github("mvuorre/vmisc")

The function to draw forest plots from meta-analytic models estimated with brms is called brms_forest(), and you can learn more about it with ?brms_forest. The most important thing to know is that it requires that the model is estimated with brms with this exact formula:

yi | se(sei) ~ 1 + (1|study)

Where yi are the study-specific effect sizes (mean difference between groups, for example), sei are the study-specific standard errors, and study are unique integer labels for each study in the data.

Example data

We’ll illustrate the plot using example data from the metafor package. This data are “Results from 48 studies on the effectiveness of school-based writing-to-learn interventions on academic achievement.”

“In each of the studies included in this meta-analysis, an experimental group (i.e., a group of students that received instruction with increased emphasis on writing tasks) was compared against a control group (i.e., a group of students that received conventional instruction) with respect to some content-related measure of academic achievement (e.g., final grade, an exam/quiz/test score). The effect size measure for this meta-analysis was the standardized mean difference (with positive scores indicating a higher mean level of academic achievement in the intervention group).” (From the metafor help page ?dat.bangertdrowns2004.)

library(metafor)
head(dat.bangertdrowns2004)
##   id   author year grade length minutes wic feedback info pers imag meta
## 1  1 Ashworth 1992     4     15      NA   1        1    1    1    0    1
## 2  2    Ayers 1993     2     10      NA   1       NA    1    1    1    0
## 3  3   Baisch 1990     2      2      NA   1        0    1    1    0    1
## 4  4    Baker 1994     4      9      10   1        1    1    0    0    0
## 5  5   Bauman 1992     1     14      10   1        1    1    1    0    1
## 6  6   Becker 1996     4      1      20   1        0    0    1    0    0
##         subject  ni     yi    vi
## 1       Nursing  60  0.650 0.070
## 2 Earth Science  34 -0.750 0.126
## 3          Math  95 -0.210 0.042
## 4       Algebra 209 -0.040 0.019
## 5          Math 182  0.230 0.022
## 6    Literature 462  0.030 0.009

We’ll only need a few of the columns, and with specific names, so in the following we’ll just select the relevant variables and rename id to study, and create labels for the plot by pasting together the studies author and year columns. I’ll also subset the data to the first 15 studies, because the original data has 48 studies and that would make the plot very large (which is fine, but it’s simpler to start small.)

library(dplyr)
d <- dat.bangertdrowns2004 %>%
    mutate(label = paste0(author, " (", year, ")"), sei = sqrt(vi)) %>%
    select(study = id, label, yi, sei) %>% 
    slice(1:15)
d
##    study              label    yi        sei
## 1      1    Ashworth (1992)  0.65 0.26457513
## 2      2       Ayers (1993) -0.75 0.35496479
## 3      3      Baisch (1990) -0.21 0.20493902
## 4      4       Baker (1994) -0.04 0.13784049
## 5      5      Bauman (1992)  0.23 0.14832397
## 6      6      Becker (1996)  0.03 0.09486833
## 7      7 Bell & Bell (1985)  0.26 0.32557641
## 8      8     Brodney (1994)  0.06 0.08366600
## 9      9      Burton (1986)  0.06 0.20000000
## 10    10   Davis, BH (1990)  0.12 0.22803509
## 11    11   Davis, JJ (1996)  0.77 0.32710854
## 12    12         Day (1994)  0.00 0.14491377
## 13    13     Dipillo (1994)  0.52 0.19235384
## 14    14     Ganguli (1989)  0.54 0.28809721
## 15    15  Giovinazzo (1996)  0.20 0.29325757

Fit meta-analytic model with brms

Fitting the meta-analytic model is easy with brms! The formula specifies the study-specific effect size and standard error, an overall intercept (1), and the “random studies” ((1|study)). I’ll use four cores for speed and increase the adapt_delta parameter to avoid divergent transitions.

library(brms)
mod <- brm(yi | se(sei) ~ 1 + (1|study), data = d, 
           cores = 4, control= list(adapt_delta=.99) )

The model summary shows a 95% CI for the average effect hugging 0, and reasonable between-study heterogeneity (sd(Intercept)):

summary(mod)
##  Family: gaussian(identity) 
## Formula: yi | se(sei) ~ 1 + (1 | study) 
##    Data: d (Number of observations: 15) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~study (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.17       0.1     0.01      0.4        850    1
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     0.13      0.07    -0.01     0.29        999    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Draw forest plots with brms_forest()

The user only needs to enter a data frame and a brms model:

library(vmisc)
brms_forest(data = d, model = mod)

The forest plot shows, on the left, the labels provided in d. On the very right are the effect sizes, [and limits of the Credible Intervals]. The CI limits are by default 95%, but users can control this by passing the argument level = .80, for 80% CIs, for example.

In the middle are the posterior distributions of the estimated effect sizes as grey densities. The black circle indicates the posterior mean, and the arms extending from the point are the CI defined by level (here, 95% CI). The bottom row, ME, is the meta-analytic estimate.

Arguments

brms_forest() has several arguments that impact the resulting plot, see ?brms_forest for details. For example

brms_forest(
    data = d, 
    model = mod, 
    show_data = TRUE,  # Shows data means and SEs
    sort_estimates = TRUE,  # Sorts estimates based on their magnitude
    dens_fill = "dodgerblue")  # Fill densities with blue

Here, the plot also shows the means (black crosses) and 2 SEs (thin black lines) from the data, slightly below each estimate. This plot nicely shows how the random effects model shrinks the estimates toward the group mean, especially for studies that had wide SEs to begin with.

Customizing the plot

brms_forest() returns a ggplot2 object, which is customizable by regular ggplot2 functions (themes, scales…) Here, we’ll add limits to the x axis manually (we have to specify y axis, because the axes are internally flipped…), add a label to replace the default “mean”, and use another custom theme:

library(ggplot2)
myplot <- brms_forest(data = d, 
                      model = mod, 
                      sort_estimates = TRUE)
myplot + 
    scale_y_continuous("Standardized ES", limits = c(-1.2, 1.2)) +
    theme_blog()

Credits

The plot uses a one-sided violin plot which is not included in the default ggplot2 distribution. This code was written by David Robinson https://github.com/dgrtwo: https://gist.github.com/dgrtwo/eb7750e74997891d7c20, and Ben Marwick https://github.com/benmarwick: https://gist.github.com/benmarwick/2a1bb0133ff568cbe28d.

Conclusion

Look at how non-normal those posterior densities look like! It seems that in this case, the posterior mean is quite a misleading point estimate, and actually showing the posterior distribution is a very good idea.

Hope you’ll find this function useful. If you’d like it to be improved, let me know. Thanks and have a good day!

How to create within-subject scatter plots in R with ggplot2

Today, we’ll take a look at creating a specific type of visualization for data from a within-subjects experiment (also known as repeated measures, but that can sometimes be a misleading label). You’ll often see within-subject data visualized as bar graphs (condition means, and maybe mean difference if you’re lucky.) But alternatives exist, and today we’ll take a look at within-subjects scatterplots.

For example, Ganis and Kievit (2015) asked 54 people to observe, on each trial, two 3-D shapes with various rotations and judge whether the two shapes were the same or not.

There were 4 angles (0, 50, 100, and 150 degree rotations), but for simplicity, today we’ll only look at items that were not rotated with respect to each other, and items rotated 50 degrees. The data are freely available (thanks!!) in Excel format, but to make them more easily available for readers, I’ve uploaded them in a .csv file, which we can load directly from an R script.

d <- read_csv("https://mvuorre.github.io/data/ganis-kievit-2015.csv")

First, let’s clean the data a little bit by selecting and renaming a subset of the variables, and then take only the trials with 0 or 50 degrees rotation.

d <- transmute(
    d,
    id = Subject,
    angle = angle,
    correct = correct.incorrect,
    rt = Time
    ) %>%
    filter(angle < 51)
    d
## # A tibble: 2,592 x 4
##       id angle correct    rt
##    <int> <int>   <int> <int>
##  1     1     0       1  1355
##  2     1    50       1  1685
##  3     1    50       1  1237
##  4     1     0       1  1275
##  5     1    50       1  2238
##  6     1     0       1  1524
##  7     1     0       1   964
##  8     1    50       1  1226
##  9     1    50       1  2548
## 10     1    50       1  1588
## # ... with 2,582 more rows

We’ll focus on comparing the reaction times between the 0 degree and 50 degree rotation trials. We predict that people will take longer to respond when the items are rotated, and that this effect will be robust across people.

Subject means

Bar graph

For the first graph, we’ll only need the subject’s means in each condition.

subject_means <- group_by(d, id, angle) %>%
    summarize(rt = mean(rt, na.rm = T))
subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
## 
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows

The first plot is a simple bar graph showing the condition means, and every subject as a point. Note that the mean is visualized as a bar, using the stat_summary(geom="bar") function.

barplot <- ggplot(subject_means, aes(x = angle, y = rt)) +
    stat_summary(
    geom = "bar",
    fun.y = "mean",
    col = "black",
    fill = "gray70"
    ) +
    geom_point(position = position_jitter(h = 0, w = 5)) +
    scale_y_continuous(limits = c(0, max(d$rt, na.rm = T)),
    expand = c(0, 0))
    barplot

This figure shows quite clearly that the mean reaction time in the 50 degree angle condition was higher than in the 0 degree angle condition, and the spread across individuals in each condition. However, we often are specifically interested in the within-subject effect of condition, which would be difficult to visually display in this image. We could draw lines to connect each point, and the effect would then be visible as a “spaghetti plot”, but while useful, these plots may sometimes be a little overwhelming especially if there’s too many people (spaghetti is great but nobody likes too much of it!)

Within-subject scatterplot

To draw a within-subjects scatterplot, we’ll need a slight reorganization of the data, such that it is in wide format with respect to the conditions:

subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
## 
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows
subject_means_wide <-
    spread(subject_means,
           key = angle,
           value = rt,
           sep = "_")
subject_means_wide
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows

Then we can simply map the per-subject angle-means to the X and Y axes:

ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point()

But this graph needs a couple of fixes to be maximally informative. We need to:

  • Make the aspect ratio 1
  • Force the axes to be identically scaled (note the use of min() and max() to show the plots on the scale of the data)
  • Add an identity (diagonal) line
  • Modify the axis labels
lims <- c(min(d$rt, na.rm = T), max(d$rt, na.rm = T))
wsplot <-
    ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point() +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio = 1)
wsplot

Great! This plot shows each person (mean) as a point, and the difference between conditions can be directly seen by how far from the diagonal line the points are. Points above the diagonal indicate that the person’s (mean) RT was greater in the 50 degrees condition. All of the points lie below the identity line, indicating that the effect was as we predicted, and robust across individuals.

This is a very useful diagnostic plot that simultaneously shows the population- (or group-) level trend (are the points, on average, below or above the identity line?) and the expectation (mean) for every person (roughly, how far apart the points are from each other?). The points are naturally connected by their location, unlike in a bar graph where they would be connected by lines. Maybe you think it’s an informative graph; it’s certainly very easy to do in R with ggplot2. Also, I think it is visually very convincing, and doesn’t necessarily lead one to focus unjustly just on the group means: I am both convinced and informed by the graph.

Within-subject scatterplot with SEs

Well, we didn’t measure everybody repeatedly for nothing. We know more than their means; we can use the spread of the individual level scores to calculate, say, a SE for everybody and add it to the graph.

subject_summaries <- group_by(d, id, angle) %>%
    summarize(mean = mean(rt, na.rm = T),
              se = sd(rt, na.rm = T) / sqrt(n()))
subject_summaries
## Source: local data frame [108 x 4]
## Groups: id [?]
## 
## # A tibble: 108 x 4
##       id angle  mean    se
##    <int> <int> <dbl> <dbl>
##  1     1     0  1512   146
##  2     1    50  2039   134
##  3     2     0  1251   125
##  4     2    50  1768   211
##  5     3     0  1602   162
##  6     3    50  1862   109
##  7     4     0  1501   112
##  8     4    50  2023   172
##  9     5     0  2170   242
## 10     5    50  2512   307
## # ... with 98 more rows

Now we simply need to reformat the data to wide with respect to both the means and SEs. The trick here is to use spread() with different values for the sep() (separate) argument. Then, when the means and SEs are joined into wide format, we can select the columns containing either the means or SEs by referring to their unique names

means <- select(subject_summaries, -se) %>%
    spread(key=angle, value=mean, sep = "_")
means
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows
ses <- select(subject_summaries, -mean) %>%
    spread(key=angle, value=se, sep = "SE")
ses
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angleSE0 angleSE50
##  * <int>    <dbl>     <dbl>
##  1     1      146       134
##  2     2      125       211
##  3     3      162       109
##  4     4      112       172
##  5     5      242       307
##  6     6      100        73
##  7     7      223       240
##  8     8      110       197
##  9     9      130       203
## 10    10      233       276
## # ... with 44 more rows
sums <- left_join(means, ses)
sums 
## Source: local data frame [54 x 5]
## Groups: id [?]
## 
## # A tibble: 54 x 5
##       id angle_0 angle_50 angleSE0 angleSE50
##    <int>   <dbl>    <dbl>    <dbl>     <dbl>
##  1     1    1512     2039      146       134
##  2     2    1251     1768      125       211
##  3     3    1602     1862      162       109
##  4     4    1501     2023      112       172
##  5     5    2170     2512      242       307
##  6     6    1302     1382      100        73
##  7     7    2212     3014      223       240
##  8     8    1452     1824      110       197
##  9     9    2012     2501      130       203
## 10    10    1939     3058      233       276
## # ... with 44 more rows

The code for the plot is actually quite straightforward once the tricky part of data formatting is done (this is really the philosophy behind ggplot2). Use errorbar() to draw the vertical SE bars, and errorbarh() to draw the horizontal SE bars.

ggplot(sums, aes(x=angle_0, y=angle_50)) +
    geom_point() +
    geom_errorbar(aes(ymin=angle_50-angleSE50, ymax=angle_50+angleSE50)) +
    geom_errorbarh(aes(xmin=angle_0-angleSE0, xmax=angle_0+angleSE0)) +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio=1)

Cool, huh? This graph shows the mean and +-1 SEM for everybody’s reaction time in the 0 degrees (x axis) and 50 degrees (y axis) conditions. This graph could be a great visual inspection of the data before fitting any complex models, and requires only some slight reorganizing of the data in R. Hope you’ll find it helpful in your own work!

Endnote

Within-subject scatter plots are pretty common in some fields (psychophysics), but underutilized in many fiels where they might have a positive impact on statistical inference. Why not try them out on your own data, especially when they’re this easy to do with R and ggplot2?

Recall that for real applications, it’s better to transform or model reaction times with a skewed distribution. Here we used normal distributions just for convenience.

Finally, this post was made possible by the Ganis and Kievit (2015) who generously have shared their data online. Big thanks!

Have a great day!

References

Ganis, G., & Kievit, R. (2015). A New Set of Three-Dimensional Shapes for Investigating Mental Rotation Processes: Validation Data and Stimulus Set. Journal of Open Psychology Data, 3(1). https://doi.org/10.5334/jopd.ai

How to Compare Two Groups with Robust Bayesian Estimation Using R, Stan and brms

## Warning: Installed Rcpp (0.12.11) different from Rcpp used to build dplyr (0.12.10).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

Happy New Year 2017 everybody! 2017 will be the year when social scientists finally decided to diversify their applied statistics toolbox, and stop relying 100% on null hypothesis significance testing (NHST). We now recognize that different scientific questions may require different statistical tools, and are ready to adopt new and innovative methods. A very appealing alternative to NHST is Bayesian statistics, which in itself contains many approaches to statistical inference. In this post, I provide an introductory and practical tutorial to Bayesian parameter estimation in the context of comparing two independent groups’ data.

More specifically, we’ll focus on the t-test. Everyone knows about it, everyone uses it. Yet, there are (arguably!) better methods for drawing inferences from two independent groups’ metric data (Kruschke, 2013; Morey & Rouder, 2015). Let’s talk about how “Bayesian estimation supersedes the t-test” (Kruschke, 2013).

Kruschke (2013, p.573) writes:

“When data are interpreted in terms of meaningful parameters in a mathematical description, such as the difference of mean parameters in two groups, it is Bayesian analysis that provides complete information about the credible parameter values. Bayesian analysis is also more intuitive than traditional methods of null hypothesis significance testing (e.g., Dienes, 2011).”

In that article (Bayesian estimation supersedes the t-test) Kruschke (2013) provided clear and well-reasoned arguments favoring Bayesian parameter estimation over null hypothesis significance testing in the context of comparing two groups, a situation which is usually dealt with a t-test. It also introduced a robust model for comparing two groups, which modeled the data as t-distributed, instead of a Gaussian distribution. The article provided R code for running the estimation procedures, which could be downloaded from the author’s website or as an R package (Kruschke & Meredith, 2015).

The R code and programs work well for this specific application (estimating the robust model for one or two groups’ metric data). However, modifying the code to handle more complicated situations is not easy, and the underlying estimation algorithms don’t necessarily scale up to handle more complicated situations. Therefore, today I’ll introduce easy to use, free, open-source, state-of-the-art computer programs for Bayesian estimation, in the context of comparing two groups’ metric (continuous) data. The programs are available for the R programming language–so make sure you are familiar with R basics (e.g. Vuorre, 2016). I provide R code (it’s super easy, don’t worry!) for t-tests and Bayesian estimation in R using the R package brms (Buerkner, 2016), which uses the powerful Stan MCMC program (Stan Development Team, 2016) under the hood.

These programs supersede older Bayesian inference programs because they are easy to use (brms is an interface to Stan, which is actually a programming language in itself), fast, and are able to handle models with thousands of parameters. Learning to implement basic analyses such as t-tests, and Kruschke’s robust model, with these programs is very useful because (obviously) you’ll then be able to do Bayesian statistics in practice, and will be prepared to understand and implement more complex models.

Understanding the results of Bayesian estimation requires understanding some basics of Bayesian statistics, which I won’t describe here at all. If you are not familiar with Bayesian statistics, please read Kruschke’s excellent article (his book is also very good, Kruschke, 2014; see also McElreath, 2016). In fact, you should read the paper anyway, it’s very good.

First, I’ll introduce the basics of t-tests in some detail, and then focus on understanding them as specific instantiations of linear models. If that sounds familiar, skip ahead to Bayesian Estimation of the t-test, where I introduce the brms package for estimating models using Bayesian methods. Following that, we’ll use “distributional regression” to obtain Bayesian estimates of the unequal variances t-test model. Finally, we’ll learn how to estimate Kruschke’s (2013) BEST model using brms.

The t in a t-test

We’ll begin with t-tests, using example data from Kruschke’s paper (p. 577):

“Consider data from two groups of people who take an IQ test. Group 1 (N1=47) consumes a “smart drug,” and Group 2 (N2=42) is a control group that consumes a placebo.”

I’ve decided to call the control group “Group 0”, and the treatment group “Group 1”, because this coding makes it natural to think of the control group as a “reference group”, and any “effect” we’ll estimate will be associated with the treatment group. These data are visualized as histograms, below:

Histograms of the two groups' IQ scores.

Figure 1: Histograms of the two groups’ IQ scores.

Equal variances t-test

These two groups’ IQ scores could be compared with a simple equal variances t-test (which you shouldn’t use; Lakens, 2015), also known as Student’s t-test. I have the two groups’ IQ scores in R as two vectors called group_0 and group_1, so doing a t-test is as easy as

t.test(group_0, group_1, var.equal=T)
## 
##  Two Sample t-test
## 
## data:  group_0 and group_1
## t = -1.5587, df = 87, p-value = 0.1227
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.544155  0.428653
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

We interpret the t-test in terms of the observed t-value, and whether it exceeds the critical t-value. The critical t-value, in turn, is defined as the extreme \(\alpha / 2\) percentiles of a t-distribution with the given degrees of freedom. The current situation is illustrated below:

t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

Figure 2: t distribution with 87 degrees of freedom, and observed t-value. The dashed vertical lines indicate the extreme 2.5 percentiles. We would reject the null hypothesis of no difference if the observed t-value exceeded these percentiles.

The test results in an observed t-value of 1.56, which is not far enough in the tails of a t-distribution with 87 degrees of freedom to warrant rejecting the null hypothesis (given that we are using \(\alpha\) = .05, which may or may not be an entirely brilliant idea (e.g. Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016)). Note that R also reports a 95% CI for the estimated difference between the two groups.

Unequal variances t-test

Next, we’ll run the more appropriate, unequal variances t-test (also known as Welch’s t-test), which R gives by default:

t.test(group_0, group_1)
## 
##  Welch Two Sample t-test
## 
## data:  group_0 and group_1
## t = -1.6222, df = 63.039, p-value = 0.1098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.4766863  0.3611848
## sample estimates:
## mean of x mean of y 
##  100.3571  101.9149

Note that while R gives Welch’s t-test by default, SPSS gives both. If you’re using SPSS, make sure to report the Welch’s test results, instead of the equal variances test. Here, the conclusion with respect to rejecting the null hypothesis of equal means is the same. However, notice that the results are numerically different, as they should, because these two t-tests refer to different models!

As a side note, I recently learned that this problem (estimating and testing the difference between two means when the variances are not assumed equal) is unsolved: only approximate solutions are known.

It is of course up to you, as a researcher, to decide whether you assume equal variances or not. But note that we almost always allow the means to be different (that’s the whole point of the test, really), while many treatments may just as well have an effect on the standard deviations.

The first take-home message from today is that there are actually two t-tests, each associated with a different statistical model. And to make clear what the difference is, we must acquaint ourselves with the models.

Describing the model(s) underlying the t-test(s)

We don’t usually think of t-tests (and ANOVAs) as models, but it turns out that they are just linear models disguised as tests (see here and here). Recently, there has been a tremendous push for model/parameter estimation, instead of null hypothesis significance testing (e.g. Cumming, 2014; Kruschke, 2014; see also the brilliant commentary by Gigerenzer, 2004), so we will benefit from thinking about t-tests as linear models. Doing so will facilitate “[interpreting data] in terms of meaningful parameters in a mathematical description” (Kruschke, 2013), and seamlessly expanding our models to handle more complicated situations.

The equal variances t-test models metric data with three parameters: Mean for group A, mean for group B, and one shared standard deviation (i.e. the assumption that the standard deviations [we usually refer to variances, but whatever] are equal between the two groups.)

We call the metric data (IQ scores in our example) \(y_{ik}\), where \(i\) is a subscript indicating the \(i^{th}\) datum, and \(k\) indicates the \(k^{th}\) group. So \(y_{19, 1}\) would be the 19th datum, belonging to group 1. Then we specify that \(y_{ik}\) are Normally distributed, \(N(\mu_{ik}, \sigma)\), where \(\mu_{ik}\) indicates the mean of group \(k\), and \(\sigma\) the common standard deviation.

\[y_{ik} \sim N(\mu_{ik}, \sigma)\]

Read the formula as “Y is normally distributed with mean \(\mu_{ik}\) (mu), and standard deviation \(\sigma\) (sigma)”. Note that the standard deviation \(\sigma\) doesn’t have any subscripts: we assume it is the same for the \(k\) groups. Note also that you’ll often see the second parameter in the parentheses as \(\sigma^2\), referring to the variance.

The means for groups 0 and 1 are simply \(\mu_0\) and \(\mu_1\), respectively, and their difference (let’s call it \(d\)) is \(d = \mu_0 - \mu_1\). The 95% CI for \(d\) is given in the t-test output, and we can tell that it differs from the one given by Welch’s t-test.

It is unsurprising, then, that if we use a different model (the more appropriate unequal variances model; Lakens, 2015), our inferences may be different. Welch’s t-test is the same as Student’s, except that now we assume (and subsequently estimate) a unique standard deviation \(\sigma_{ik}\) for both groups.

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\]

This model makes a lot of sense, because rarely are we in a situation to a priori decide that the variance of scores in Group A is equal to the variance of scores in Group B. If you use the equal variances t-test, you should be prepared to justify and defend this assumption. (Deciding between models–such as between these two t-tests–is one way in which our prior information enters and influences data analysis. This fact should make you less suspicious about priors in Bayesian analyses.)

Armed with this knowledge, we can now see that “conducting a t-test” can be understood as estimating one of these two models. By estimating the model, we obtain t-values, degrees of freedom, and consequently, p-values.

However, to focus on modeling and estimation, it is easier to think of the t-test as a specific type of the general linear model, (aka linear regression). We can re-write the t-test in an equivalent way, but instead have a specific parameter for the difference in means by writing it as a linear model. (For simplicity, I’ll only write the equal variances model):

\[y_{ik} \sim N(\mu_{ik}, \sigma)\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Here, \(\sigma\) is just as before, but we now model the mean with an intercept (Group 0’s mean, \(\beta_0\)) and the effect of Group 1 (\(\beta_1\)). To understand whats going on, let’s look at the data, Group is an indicator variable in the data, for each row of Group 0’s data Group is zero, and for each row of Group 1’s data Group is one.

##     Group  IQ
## 1       0  99
## 2       0 101
## 3       0 100
## 4       0 101
## ...   ... ...
## 86      1 101
## 87      1 104
## 88      1 100
## 89      1 101

With this model, \(\beta_1\) directly tells us the estimated difference in the two groups. And because it is a parameter in the model, it has an associated standard error, t-value, degrees of freedom, and a p-value. This linear model and can be estimated in R with the following line of code:

olsmod <- lm(IQ ~ Group, data = d)

The key input here is a model formula, which in R is specified as outcome ~ predictor (DV ~ IV). Using the lm() function, we estimated a linear model predicting IQ from an intercept (automatically included) and a Group parameter Group, which is the effect of group 1. I called this object olsmod for Ordinary Least Squares Model.

R has it’s own model formula syntax, which is well worth learning. The formula in the previous model, IQ ~ Group means that we want to regress IQ on an intercept (which is implicitly included), and group (Group). Besides the formula, we only need to provide the data, which is contained in the object I’ve conveniently called d.

You can verify that the results are identical to the equal variances t-test above.

summary(olsmod)
## 
## Call:
## lm(formula = IQ ~ Group, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9149  -0.9149   0.0851   1.0851  22.0851 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.3571     0.7263 138.184   <2e-16 ***
## Group         1.5578     0.9994   1.559    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.707 on 87 degrees of freedom
## Multiple R-squared:  0.02717,    Adjusted R-squared:  0.01599 
## F-statistic:  2.43 on 1 and 87 DF,  p-value: 0.1227

Focus on the Group row in the estimated coefficients. Estimate is the point estimate (best guess) of the difference in means (\(d = 101.9149 - 100.3571 = 1.5578\)). t value is the observed t-value (identical to what t.test() reported), and the p-value (Pr(>|t|)) matches as well. The (Intercept) row refers to \(\beta_0\), which is group 0’s mean.

This way of thinking about the model, where we have parameters for one group’s mean, and the effect of the other group, facilitates focusing on the important parameter, the difference, instead of individual means. However, you can of course compute the difference from the means, or the means from one mean and a difference.

Bayesian estimation of the t-test

Equal variances model

Next, I’ll illustrate how to estimate the equal variances t-test using Bayesian methods. We use brms (Buerkner, 2016), and the familiar R formula syntax which we used with the OLS model.

Estimating this model with R, thanks to the Stan and brms teams (Stan Development Team, 2016; Buerkner, 2016), is as easy as the linear regression model we ran above. If you haven’t yet installed brms, you need to install it first by running install.packages("brms"). Then, to access its functions, load the brms package to the current R session.

library(brms)

The most important function in the brms package is brm(), for Bayesian Regression Model(ing). The user needs only to input a model formula, just as above, and a data frame that contains the variables specified in the formula. brm() then translates the model into Stan language, and asks Stan to compile the model into C++ and estimate it (see Kruschke, 2014; McElreath, 2016 for details about estimation). The result is an R object with the estimated results (and much more). We run the model and save the results to mod_eqvar for equal variances model:

mod_eqvar <- brm(IQ ~ Group, data = d)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   100.34      0.74    98.91   101.83       3803    1
## Group         1.59      1.01    -0.41     3.55       3958    1
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     4.77      0.36     4.13     5.55       2653    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Notice that the model contains three parameters, one of which is the shared standard deviation sigma. Compare the output of the Bayesian model to the one estimated with lm() (OLS):

Method term estimate std.error
OLS (Intercept) 100.357 0.726
OLS Group 1.558 0.999
Bayes b_Intercept 100.339 0.743
Bayes b_Group 1.588 1.009

The point estimates (posterior means in the Bayesian model) and standard errors (SD of the respective posterior distribution) are pretty much identical.

We now know the models behind t-tests, and how to estimate the equal variances t-test using the t.test(), lm(), and brm() functions. We also know how to run Welch’s t-test using t.test(). However, estimating the general linear model version of the unequal variances t-test model is slightly more complicated, because it involves specifying predictors for \(\sigma\), the standard deviation parameter.

Unequal variances model

We only need a small adjustment to the equal variances model to specify the unequal variances model:

\[y_{ik} \sim N(\mu_{ik}, \sigma_{ik})\] \[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

Notice that we now have subscripts for \(\sigma\), denoting that it varies between groups. In fact, we’ll write out a linear model for the standard deviation parameter!

\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

The model now includes, instead of a common \(\sigma\), one parameter for Group 0’s standard deviation \(\gamma_0\) (gamma), and one for the effect of Group 1 on the standard deviation \(\gamma_1\), such that group 1’s standard deviation is \(\gamma_0 + \gamma_1\). Therefore, we have 4 free parameters, two means and two standard deviations. (The full specification would include prior distributions for all the parameters, but that topic is outside of the scope of this post.) Let’s estimate!

brm() takes more complicated models by wrapping them inside bf() (short for brmsformula()), which is subsequently entered as the first argument to brm().

uneq_var_frm <- bf(IQ ~ Group, sigma ~ Group)

You can see that the formula regresses IQ on Group, such that we’ll have an intercept (implicitly included), and an effect of Group 1. Remarkably, we are also able to model the standard deviation sigma, and we regress it on Group (it will also have an intercept and effect of group).

mod_uneqvar <- brm(uneq_var_frm, data = d, cores=4)
##  Family: gaussian (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept         100.35      0.40    99.55   101.12       4000    1
## sigma_Intercept     0.94      0.11     0.72     1.17       3138    1
## Group               1.54      0.99    -0.38     3.47       2471    1
## sigma_Group         0.87      0.15     0.57     1.17       3222    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model’s output contains our 4 parameters. Intercept is the mean for group 0, Group 1 is the “effect of group 1”. The sigma_Intercept is the standard deviation of Group 0, sigma_Group is the effect of group 1 on the standard deviation (the SD of Group 1 is sigma_Intercept + sigma_Group). The sigmas are implicitly modeled through a log-link (because they must be positive). To convert them back to the scale of the data, they need to be exponentiated. After taking the exponents of the sigmas, the results look like this:

parameter Estimate Est.Error l-95% CI u-95% CI
b_sigma_Group 2.41 0.37 1.76 3.23
b_Intercept 100.35 0.40 99.55 101.12
NA NA NA NA NA
b_sigma_Intercept 1.75 0.85 0.76 3.09

For comparison, here is the “observed SD” of group 0:

## [1] 2.52

Keep in mind that the parameters refer to Group 0’s mean (Intercept) and SD (sigma), and the difference between groups in those values (Group) and (sigma_Group). We now have fully Bayesian estimates of the 4 parameters of the unequal variances t-test model. Because p-values have no place in Bayesian inference, they are not reported in the output. However, you can calculate a quantity that is equivalent to a one-sided p-value from the posterior distribution: Take the proportion of posterior density (MCMC samples) above/below a reference value (0). This is definitely not the most useful thing you can do with a posterior distribution, but the fact that it numerically matches a one-sided p-value is quite interesting:

# Posterior distribution of Group effect
x <- as.data.frame(mod_uneqvar, pars = "b_Group")[,1]
# Proportion of MCMC samples below zero
round((sum(x < 0) / length(x)), 3)
## [1] 0.058
# One sided p-value from t-test
round(t.test(group_0, group_1, data = d, alternative = "less")$p.value, 3)
## [1] 0.055

I’m showing this remarkable fact (Marsman & Wagenmakers, no date) not to persuade you to stick with p-values, but to alleviate fears that these methods would always produce discrepant results.

Although this model is super easy to estimate with brm() (which, I should emphasize, uses Stan for the estimation procedures), the model seems, frankly speaking, strange. I am just not used to modeling variances, and I’ll bet a quarter that neither are you. Nevertheless, there it is!

Finally, let’s move on to Kruschke’s (2013) “Robust Bayesian Estimation” model.

Robust Bayesian Estimation

Kruschke’s robust model is a comparison of two groups, using five parameters: One mean for each group, one standard deviation for each group, just as in the unequal variances model above. The fifth parameter is a “normality” parameter, \(\nu\) (nu), which means that we are now using a t-distribution to model the data. Using a t-distribution to model the data, instead of a Gaussian, means that the model (and therefore our inferences) are less sensitive to extreme values (outliers). Here’s what the model looks like:

\[y_{ik} \sim T(\nu, \mu_{ik}, \sigma_{ik})\]

Read the above formula as “Y are random draws from a t-distribution with ‘normality’ parameter \(\nu\), mean \(\mu_{ik}\), and standard deviation \(\sigma_{ik}\)”. We have a linear model for the means and standard deviations:

\[\mu_{ik} = \beta_0 + \beta_1 Group_{ik}\]

and

\[\sigma_{ik} = \gamma_0 + \gamma_1 Group_{ik}\]

This model, as you can see, is almost identical to the unequal variances t-test, but instead uses a t distribution (we assume data are t-distributed), and includes the normality parameter. Using brm() we can still use the unequal variances model, but have to specify the t-distribution. We do this by specifying the family argument to be student (as in Student’s t)

mod_robust <- brm(bf(IQ ~ Group, sigma ~ Group),
                  family=student,
                  data = d, cores=4)
##  Family: student (identity) 
## Formula: IQ ~ Group 
##          sigma ~ Group
##    Data: d (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## Intercept        100.524     0.208  100.125  100.940       4000 0.999
## sigma_Intercept    0.001     0.196   -0.389    0.395       4000 1.001
## Group              1.028     0.417    0.206    1.833       3100 1.000
## sigma_Group        0.671     0.258    0.163    1.165       4000 1.000
## 
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Eff.Sample  Rhat
## nu    1.853     0.482    1.135     2.99       3179 1.001
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The effect of Group is about one unit, with a 95% Credible Interval from 0.2 to 1.8.

Finally, let’s compare the results to those in Kruschke’s paper (2013, p.578). Before we do this, I’ll convert the estimated parameters to means and standard deviations (instead of the “regression effects” produced by default.) Recall that I recoded the group labels used by Kruschke in the paper, what he calls group 2 is group 0 (control group) in our analyses, but group 1 is still group 1. In the following I transform the results and compute HDIs to obtain results most compatible with Kruschke:

key m HDIlwr HDIupr
group0_mean 100.52 100.12 100.94
group1_mean 101.55 100.80 102.24
diff_means 1.03 0.19 1.80
group0_sigma 1.02 0.65 1.41
group1_sigma 2.00 1.22 2.80
diff_sigmas 2.02 1.13 3.10
nu 1.85 1.04 2.76
nulog10 0.25 0.04 0.46

Notice that Kruschke reports modes (2013, p. 578), but our point estimates are means. The results with respect to the group means are identical to two decimal points; the standard deviations are slightly more discrepant, because the paper reports modes, but we focus on posterior means.

Finally, here is how to estimate the model using the original code (Kruschke & Meredith, 2015):

library(BEST)
BEST <- BESTmcmc(group_0, group_1)
## 
## Processing function input....... 
## 
## Done. 
##  
## Beginning parallel processing using 3 cores. Console output will be suppressed.
## 
## Parallel processing completed.
## 
## MCMC took 0.283 minutes.
BEST
## MCMC fit results for BEST analysis:
## 100002 simulations saved.
##           mean     sd  median    HDIlo   HDIup Rhat n.eff
## mu1    100.525 0.2149 100.524 100.1108 100.955    1 57740
## mu2    101.548 0.3800 101.551 100.7853 102.289    1 59733
## nu       1.840 0.4775   1.764   1.0380   2.753    1 23934
## sigma1   1.052 0.2069   1.032   0.6737   1.464    1 35619
## sigma2   2.063 0.4359   2.021   1.2598   2.927    1 29495
## 
## 'HDIlo' and 'HDIup' are the limits of a 95% HDI credible interval.
## 'Rhat' is the potential scale reduction factor (at convergence, Rhat=1).
## 'n.eff' is a crude measure of effective sample size.

This output reports posterior means and HDI limits, which we report above. You can verify that they match very closely to each other. This BESTmcmc() function is great, but with brms you are able to estimate a vast variety of models.

Conclusion

Well, that ended up much longer than what I intended. The aim was both to illustrate the ease of Bayesian modeling in R using brms (Buerkner, 2016) and Stan (Stan Development Team, 2016), and highlight the fact that we can easily move from simple t-tests to more complex (and possibly better) models.

If you’ve followed through, you should be able to conduct Student’s (equal variances) and Welch’s (unequal variances) t-tests in R, and to think about those tests as instantiations of general linear models. Further, you should be able to estimate these models using Bayesian methods.

You should now also be familiar with Kruschke’s robust model for comparing two groups’ metric data, and be able to implement it in one line of R code. This model was able to find credible differences between two groups, although the frequentist t-tests and models reported p-values well above .05. That should be motivation enough to try robust (Bayesian) models on your own data.

Further reading

I didn’t take any space here to discuss the interpretation of Bayesian statistics. For this, I recommend Kruschke (2014), McElreath (2016). See also Etz, Gronau, Dablander, Edelsbrunner, & Baribault (2016) for an introduction to Bayesian statistics.

References

Buerkner, P.-C. (2016). brms: Bayesian Regression Models using Stan. Retrieved from http://CRAN.R-project.org/package=brms
Cumming, G. (2014). The New Statistics Why and How. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Dienes, Z. (2011). Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6(3), 274–290. https://doi.org/10.1177/1745691611406920
Etz, A., Gronau, Q. F., Dablander, F., Edelsbrunner, P. A., & Baribault, B. (2016). How to become a Bayesian in eight easy steps: An annotated reading list. ResearchGate. Retrieved from https://www.researchgate.net/publication/301981861_How_to_become_a_Bayesian_in_eight_easy_steps_An_annotated_reading_list
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146
Kruschke, J. K. (2014). Doing Bayesian Data Analysis: A Tutorial Introduction with R (2nd Edition). Burlington, MA: Academic Press.
Lakens, D. (2015, January 26). The 20% Statistician: Always use Welch’s t-test instead of Student’s t-test. Retrieved from https://daniellakens.blogspot.com/2015/01/always-use-welchs-t-test-instead-of.html
Marsman, M., & Wagenmakers, E.-J. (no date). Three Insights from a Bayesian Interpretation of the One-Sided P Value. Retrieved from http://www.ejwagenmakers.com/inpress/MarsmanWagenmakersOneSidedPValue.pdf
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Morey, R. D., & Rouder, J. (2015). BayesFactor: Computation of Bayes Factors for Common Designs. Retrieved from https://CRAN.R-project.org/package=BayesFactor
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547. https://doi.org/10.1111/tops.12214
Stan Development Team. (2016). Stan: A C++ Library for Probability and Sampling, Version 2.14.1. Retrieved from http://mc-stan.org/
Vuorre, M. (2016, December 5). Introduction to Data Analysis using R. Retrieved from http://blog.efpsa.org/2016/12/05/introduction-to-data-analysis-using-r/

All Posts by Category or Tags.