Quantitative literature review with R: Exploring Psychonomic Society Journals, Part II

(This post directly continues from part I of Quantitative literature review with R. Please read that first for context.) In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a broad research literature, using all Psychonomic Society publications between 2004 and 2016. Part I focused on data cleaning and simple figures, but here we will look at relational data by visualizing some network structures in the data.

We’ll be working with R, so if you want to follow along on your own computer, fire up R and load up the required R packages:

library(tidyverse)
library(stringr)

Then you can load the cleaned data from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda using R:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

We’ve got data on quite a few variables which we could imagine are interconnected, such as Tags (Keywords), Authors, and the articles themselves (through the References variable).

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Let’s begin by looking at the articles’ Keywords.

Tags

These are the things that you have to choose to describe your article when you submit it for publication. They are called Keywords in the data, but I’m going to refer to them as Tags for brevity. Also, I will only work with the Author_Keywords, because it seems like Scopus’s automatic index keywords are mostly nonsense or missing.

So our first operation is to rename the Author_Keywords variable to Tags, and get rid of the automatic keywords altogether.

d <- rename(d, Tags = Author_Keywords) %>% 
    select(-Index_Keywords)

Before moving on to networks, let’s take a look at what these tags are like: How many are there? Do people use them consistently?

Exploring univariate tags

Our first goal is to simply visualize the most common tags. However, recall that the Tags variable in the data is quite messy:

## # A tibble: 3 x 1
##                                                                          Tags
##                                                                         <chr>
## 1    3D perception; Perception and action; Scene perception; Space perception
## 2 Auditory perception; Cognitive performance; Dissonance; Distraction; Inharm
## 3             Attention; Automaticity; Cognitive control; Selective attention

Each value of Tags is a kind of a list, whose individual items are separated by semicolons. What we’d like to have is a variable (column in a data frame) where every item is a single tag, so we need to do some kind of a data unwrapping operation.

The unwrapping can be done by simply splitting the Tags variable on each semicolon (and space) found in each string, but this creates a strange variable:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) 
## # A tibble: 7,614 x 1
##         Tags
##       <list>
##  1 <chr [4]>
##  2 <chr [6]>
##  3 <chr [4]>
##  4 <chr [5]>
##  5 <chr [5]>
##  6 <chr [3]>
##  7 <chr [3]>
##  8 <chr [6]>
##  9 <chr [3]>
## 10 <chr [3]>
## # ... with 7,604 more rows

This is where things get really interesting. A few lines above, we conceptualized each value (cell) of Tags as a sort of a list. We have now operationalized that idea by creating a Tags variable that is an R list-column, meaning that each cell can contain a complex R object. Here, each value is a list of character values. To make this notion concrete, look at the first row (slice(1)) of the Tags variable (.$Tags, . is a placeholder of the data being passed by %>% arrows) by using the familiar str() function:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    slice(1) %>% 
    .$Tags %>%  str()
## List of 1
##  $ : chr [1:4] "3D perception" "Perception and action" "Scene perception" "Space perception"

All the tags associated with the first article in our data frame are in a neat list of neat character strings. Brilliant. This data format is quite foreign to us who are used to working with strictly flat rectangular data, but has a certain recursive beauty to it. Now that our list is concrete, we can unlist it to create a Tags variable whose each row is one individual tag.

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags)
## # A tibble: 22,375 x 1
##                     Tags
##                    <chr>
##  1         3D perception
##  2 Perception and action
##  3      Scene perception
##  4      Space perception
##  5   Auditory perception
##  6 Cognitive performance
##  7            Dissonance
##  8           Distraction
##  9         Inharmonicity
## 10                 Music
## # ... with 22,365 more rows

Excellent! This data frame has one variable, Tags, where each observation is an individual tag. Now we can go ahead and count and summarise the tags. From here onwards, it’s all regular dplyr verbs:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%  # Remove missing observations
    group_by(Tags) %>%  # Group by unique Tags
    count() %>%  # And count them
    ungroup() %>% 
    top_n(40) %>%  # Choose n most numerous Tags
    mutate(Tags = reorder(Tags, n)) %>%  # {1}
    ggplot(aes(n, Tags)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=Tags)) +
    scale_x_continuous(limits = c(0, 400), expand = c(0, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 30 tags in Psychonomics Journals",
         caption = "Data for years 2004-2016 only")

{1} reordered the Tags so that they are nicely ordered in the figure on the n variable. You could also do a wordcloud with these data, but I think I’ll refrain this time. Instead, what I’ll do is explore popular tags by journal. The “pipeline” to get tags by journal is almost identical, except that we’ll keep the Publication variable in the data.

The following code is a little more complicated than is required for a simple plot, but I wanted to order the Tags inside each publication, which required a little bit of extra work as described by @drsimonj. Below, I’ll work with journal abbreviations Pub_abbr.

pd <- d %>% 
    select(Pub_abbr, Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%
    group_by(Pub_abbr, Tags) %>%  # Group by unique Tags AND journals
    count() %>%  # And count them
    group_by(Pub_abbr) %>% 
    top_n(10, n) %>%  # Select top 10 tags within each journal
    ungroup() %>% 
    arrange(Pub_abbr, n) %>%  # Arrange by n within journals
    mutate(order = row_number())  # Will order Tags on this
pd
## # A tibble: 62 x 4
##                       Pub_abbr                  Tags     n order
##                          <chr>                 <chr> <int> <int>
##  1 Atten. Percept. Psychophys.     Attentional blink    33     1
##  2 Atten. Percept. Psychophys.        Working memory    36     2
##  3 Atten. Percept. Psychophys. Visual working memory    40     3
##  4 Atten. Percept. Psychophys.      Visual attention    41     4
##  5 Atten. Percept. Psychophys.   Selective attention    49     5
##  6 Atten. Percept. Psychophys.     Visual perception    51     6
##  7 Atten. Percept. Psychophys.   Attentional capture    52     7
##  8 Atten. Percept. Psychophys.         Eye movements    61     8
##  9 Atten. Percept. Psychophys.         Visual search   116     9
## 10 Atten. Percept. Psychophys.             Attention   180    10
## # ... with 52 more rows

That was more complicated than usual, so I wanted to print out the intermediate result as well. We can now use it to plot:

pd %>%     
    ggplot(aes(n, order)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=order)) +
    scale_y_continuous(breaks = pd$order, labels = pd$Tags) +
    scale_x_continuous(expand = c(0.02, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 10 tags within 6 Psychonomics Journals",
         caption = "Data for years 2004-2016 only") +
    facet_wrap("Pub_abbr", scales="free", ncol=2)

Finally, let’s look at the overall distribution of the tags as a histogram:

There are about 8000 unique Tags, and about 7000 of them occur fewer than three times in the corpus. Now that we have some idea of what kinds of tags there are in the data, roughly what their distributions might look like, and how they vary across the journals, we are ready to look at networks of tags.

Network analysis of Tags

Network analysis (or graph analysis) is a vast and complicated area, and my graph analysis skills are a little rusty. So here we’ll simply illustrate how to format the data in an appropriate way, and draw a few example graphs. I referred to this website more than once while working on this (also thanks to @jalapic for tips).

The first graph will be about Tags. We’ll try to illustrate how and which tags co-occur within articles in the Psychonomic Society journals between 2004 and 2016.

90% of data analysis is data cleaning and manipulation, and network analysis is no exception. The first data manipulation step is to create an adjacency matrix where each row and column is a tag, where the values in the cells are the number of times those tags co-occurred (were present in the same article.) I found this stack overflow answer and this blogpost helpful when preparing the data.

Preparing the adjacency matrix

In order to create the adjacency matrix we need to solve a sequence of smaller problems. First, we’ll take two columns from the data, one for Tags and another one that uniquely identifies each article (doesn’t matter what it is, here we use DOI). Then we unnest the Tags into single tags per row as detailed above, and remove empty tags.

m <- d %>% 
    select(Tags, DOI) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags))
m
## # A tibble: 18,949 x 2
##                          DOI                  Tags
##                        <chr>                 <chr>
##  1 10.3758/s13414-015-1049-4         3D perception
##  2 10.3758/s13414-015-1049-4 Perception and action
##  3 10.3758/s13414-015-1049-4      Scene perception
##  4 10.3758/s13414-015-1049-4      Space perception
##  5 10.3758/s13414-015-1042-y   Auditory perception
##  6 10.3758/s13414-015-1042-y Cognitive performance
##  7 10.3758/s13414-015-1042-y            Dissonance
##  8 10.3758/s13414-015-1042-y           Distraction
##  9 10.3758/s13414-015-1042-y         Inharmonicity
## 10 10.3758/s13414-015-1042-y                 Music
## # ... with 18,939 more rows

Next, I remove all tags from the data that occurred fewer than a couple of times, because otherwise the data would be too large and messy.

m <- group_by(m, Tags) %>% 
    mutate(n = n()) %>% 
    filter(n > 8) %>% 
    ungroup() %>% 
    select(-n)

Next, we need a variable that identifies whether the tag occurred in the article or not:

m$occur <- 1
slice(m, 1:2)
## # A tibble: 2 x 3
##                         DOI                  Tags occur
##                       <chr>                 <chr> <dbl>
## 1 10.3758/s13414-015-1049-4 Perception and action     1
## 2 10.3758/s13414-015-1049-4      Scene perception     1

Note that occur must be one. To create the matrix we need all the item-tag pairs, even ones that didn’t occur. For these pairs, occur will be zero. The tidyr package in tidyverse provides the perfect helper function for this purpose:

m <- m %>% 
    complete(DOI, Tags, fill = list(occur=0))
m
## # A tibble: 960,500 x 3
##                          DOI               Tags occur
##                        <chr>              <chr> <dbl>
##  1 10.1007/s13420-010-0001-7     3-D perception     0
##  2 10.1007/s13420-010-0001-7             Action     0
##  3 10.1007/s13420-010-0001-7         Adaptation     0
##  4 10.1007/s13420-010-0001-7             Affect     0
##  5 10.1007/s13420-010-0001-7 Age of acquisition     0
##  6 10.1007/s13420-010-0001-7              Aging     0
##  7 10.1007/s13420-010-0001-7           Amygdala     0
##  8 10.1007/s13420-010-0001-7   Animal cognition     0
##  9 10.1007/s13420-010-0001-7       Anticipation     0
## 10 10.1007/s13420-010-0001-7            Anxiety     0
## # ... with 960,490 more rows

Now we have a data frame with 960500 rows; each row signifying a possible article (DOI) - Tag co-occurrence. Most occurs will, of course, be zero.

Now we can create the co-occurrence matrix. We first spread Tags over the columns, such that each column is a unique tag, each row is a DOI, and the cell values are occurrences.

library(reshape2)
y <- m %>% 
    acast(Tags~DOI, value.var="occur", fun.aggregate = sum)
y[1:2, 1:2]
##                10.1007/s13420-010-0001-7 10.3758/s13414-010-0002-9
## 3-D perception                         0                         0
## Action                                 0                         0
y[y>1] <- 1  # Remove duplicates

All we need to do then is to make a symmetric matrix where both rows and columns are Tags, and the cell values are numbers of co-occurrences. So we matrix multiply y with its transpose:

co <- y %*% t(y)
co[1:3, 1:3]
##                3-D perception Action Adaptation
## 3-D perception             11      0          1
## Action                      0     20          0
## Adaptation                  1      0         13

Visualizing the adjacency matrix

It’s important to understand what co is; it’s a symmetric matrix whose rows and columns are tags, and their co-occurrences are numerically represented at their intersections. For example, Action and 3-D perception never co-occurred in the data, but Adaptation and 3-D percetion co-occurred once. We are now in an excellent position to investigate what tags tend to co-occur.

First, let’s use igraph to create the graph object g, and take out some unnecessary information from it.

library(igraph)
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)

g is now an igraph R object that contains edges and vertices, and information about them. For example, we can use V(g)$name to look at the Vertice names. Some of these are quite long, so I’ll create another variable for each Vertice, called label, which is the name, but replacing each space with a newline character. This will make some of the graphs prettier.

V(g)$label <- gsub(" ", "\n", V(g)$name)

Networks are very complicated visualizations, and if you simply call plot(g), you’ll get an ugly mess of a figure:

plot(g)

There are more parameters to the plot() command than can be listed here, but I tried out a few different options, and this is what I ended up with:

plot(g,
     layout = layout_on_sphere(g),
     edge.width = E(g)$weight/7,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex=.7,
     vertex.label.color="black",
     vertex.label = ifelse(degree(g) > 25, V(g)$label, NA))

The Bayes Network

The next network will be fairly involved. Our goal is to visualize the prominence and connectedness of Psychonomic Society’s “Bayesians”. Who are these Bayesians, you might ask? Our definition here is simply: “An author who has published one or more articles in Psychonomic Journals (2004 - 2016) where the keywords contained the word ‘Bayes’.” This is obviously suboptimal because someone might have published an article called “Why I’m not a Bayesian”, yet they would be included in this graph, but we’ll go with this for now.

I’ll let the code speak for itself this time. If it is impossible to parse, I recommend downloading the data (see top of post) to your own computer and running the code yourself.

m <- d %>% 
    select(Authors, Tags, DOI) %>% 
    mutate(Authors = str_split(Authors, ", ")) %>% 
    unnest(Authors) %>% 
    filter(!is.na(Tags), !is.na(Authors)) %>% 
    filter(grepl("bayes", Tags, ignore.case = T))
m$occur <- 1
m <- m %>% 
    complete(DOI, Authors, fill = list(occur=0)) %>% 
    select(-Tags)
m
## # A tibble: 11,076 x 3
##                          DOI      Authors occur
##                        <chr>        <chr> <dbl>
##  1 10.3758/s13414-010-0073-7     Bååth R.     0
##  2 10.3758/s13414-010-0073-7    Bakker M.     0
##  3 10.3758/s13414-010-0073-7       Bar M.     0
##  4 10.3758/s13414-010-0073-7  Barker J.D.     0
##  5 10.3758/s13414-010-0073-7      Barr S.     0
##  6 10.3758/s13414-010-0073-7     Biele G.     0
##  7 10.3758/s13414-010-0073-7 Bittner J.L.     0
##  8 10.3758/s13414-010-0073-7  Borsboom D.     0
##  9 10.3758/s13414-010-0073-7    Bratch A.     0
## 10 10.3758/s13414-010-0073-7    Bröder A.     0
## # ... with 11,066 more rows

In these data, we have simply counted whether an author “occurred” in an article, and only included articles whose Tags included the word “bayes” in any shape or form. This network will therefore visualize the connectedness of “Bayesian”, as in who has published papers with whom.

We can then re-run pretty much the same steps as in the previous network.

y <- m %>% 
    acast(Authors~DOI, value.var="occur", fun.aggregate = sum)
co <- y %*% t(y)
co[1:3, 1:3]
##           Bååth R. Bakker M. Bar M.
## Bååth R.         1         0      0
## Bakker M.        0         1      0
## Bar M.           0         0      1
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)
V(g)$label <- gsub(" ", "\n", V(g)$name)
deg <- degree(g)
plot(g,
     layout = layout_with_kk(g),
     edge.width = E(g)$weight,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex = sqrt(5+deg)/4,
     vertex.label.color = "black",
     main = "The Bayesians of Psychonomic Society")

More to come later on…

Quantitative literature review with R: Exploring Psychonomic Society Journals, Part I

Literature reviews, both casual and formal (or qualitative / quantitative), are an important part of research. In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a research literature, using Psychonomic Society’s publications between 2005 and 2016.

Commonly, literature reviews are rather informal; you read a review paper or 10, maybe suggested to you by experts in the field, and then delve deeper into the topics presented in those papers. A more formal version starts with a database search, where you try out various search terms, and collect a more exhaustive list of publications related to your research questions. Here, we are going to have a bit of fun and explore a large heterogeneous literature (kind of) quantitatively, focusing on data manipulation (not the bad kind!) and visualization.

I used Scopus to search for all journal articles published in the Psychonomic Society’s journals (not including CR:PI because it’s so new) between the years 2005 and 2016 (inclusive). For real applications, you would use databases instead, and search across journals, but for a general illustration, I wanted to focus on these publications instead, because Psychonomic journals are closest to my research interests.

I limited the search to these years for two reasons: Metadata on articles before early 2000’s is (are?) not always very good, and if I had included everything, the database would have simply been too large. For the example here, these search terms resulted in about 7500 journal articles.

Download data from Scopus

The Scopus search interface is very straightforward, so I won’t talk about how to use it. Some things to keep in mind, though, are that Scopus does not index all possible sources of citations, and therefore when we discuss citations (and other similar info) below, we’ll only discuss citations that Scopus knows about. For example, Google Scholar often indexes more sources, but it doesn’t have the same search tools as Scopus. Scopus allows you to download the data in various formats, and I chose .csv files, which I have stored on my computer. When you download raw data from Scopus, make sure that you organize the files well and that they are available to the R session.

You can also download the raw or cleaned up data from my website (see below), if you’d like to work on the same data set, or replicate the code below.

Load and clean raw data from Scopus

Let’s first load all the relevant R packages that we’ll use to do our work for us:

library(tidyverse)
library(stringr)
library(tools)

Load the data from .csv files

When I first downloaded the files, I put them in a subdirectory of my project (my website). I downloaded the data such that one file contains the data of one journal. (The raw journal-specific .csv files are available at https://github.com/mvuorre/mvuorre.github.io/tree/master/data/scopus, but I recommend you download the combined R data file instead because it’s easier, see below.)

The first thing is to combine these file names to a list of file names:

fl <- list.files("static/data/scopus/", pattern=".csv", full.names = T)

Then we can apply a function over the list of file names that reads all the files to an R object.

d <- lapply(fl, 
            read_csv, 
            col_types = cols(`Page start` = col_character(),
                             `Page end` = col_character(),
                             `Art. No.` = col_character()))

lapply applies the read_csv() function to the file list fl, and the col_types = cols() argument is there to ensure that those columns are read as character variable types (the original data is a little messy, and this ensures that all the files are read with identical column types.) Next, we’ll then turn the list of data frames d into a single data frame by binding them together row-wise (think stacking spreadsheets vertically on top of each other). Let’s also take a quick look at the result with glimpse()

d <- bind_rows(d)
glimpse(d)
## Observations: 7,697
## Variables: 25
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Volume                    <int> 78, 78, 78, 78, 78, 78, 78, 78, 78, ...
## $ Issue                     <int> 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, ...
## $ Art. No.                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Page count                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Link                      <chr> "https://www.scopus.com/inward/recor...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Publisher                 <chr> "Springer New York LLC", "Springer N...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
## $ Source                    <chr> "Scopus", "Scopus", "Scopus", "Scopu...
## $ EID                       <chr> "2-s2.0-84951928856", "2-s2.0-849519...

OK, looks pretty clean already! (I also tried to get data from ISI web of science, or whatever it’s now called, but the raw data was so messy that I’d have to get paid to even look at it.)

The combined raw data file is available as an R data file at https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda, and can be easily loaded to the R workspace with:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda"))

Clean the data

OK, now that you’ve loaded up the combined data file from my website into your R session with the above command, you can get to work cleaning the data. If you’d like to skip the data cleaning steps, scroll down to “Load cleaned data”.

Select relevant variables

There’s a couple of columns (variables) in the data that I’m just going to drop because I’m not interested in them at all. When you export data from Scopus, you can choose what variables to include, and it looks like I accidentally included some stuff that I don’t care about. I’ll use the select() function, and prepend each variable I want to drop with a minus sign. Variable names with spaces or strange characters are wrapped in backticks.

I think it’s always good to drop unnecessary variables to remove unnecessary clutter. Talking about cognitive psychology, it’s probably less cognitively demanding to work with a data frame with 5 variables than it is to work with a data frame with 50 variables.

d <- select(d,
            -Volume,
            -Issue,
            -`Art. No.`,
            -`Page count`,
            -Link,
            -Publisher,
            -Source,
            -EID)

I dropped page count because it is empty for most cells, and I actually need to re-compute it from the start and end variables, later on. Next, we’re going to go through the columns, but leave the more complicated columns (such as Authors) for later. What do I mean by “complicated”? Some variables are not individual values, and their desired final shape depends on how you want to use them. Much later on, when we get to network and graph analyses, you’ll see that we want to “unlist” the Keywords columns, for example. That’s why we’ll do the cleaning and data manipulation closer to the actual analysis. But for “univariate” columns, we can get started right away.

Numerical columns: Year, pages, citations

Year is already an integer, and has no missing values, so we don’t need to do anything about it. We can, however, make the page number information numerical and combine it into one column.

glimpse(d)
## Observations: 7,697
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
d$Pages <- as.integer(d$`Page end`) - as.integer(d$`Page start`)

How do we know that the Pages variable is now correct? We can’t rely on math itself because the data values themselves may be incorrect. It is good practice to check the data after each operation, and there’s many ways of looking at the data. I like to arrange the data (either descending or ascending) on the newly created variable to show if there are any weird values. Here’s how to arrange the data on the Pages variable in a descending order:

glimpse(arrange(d, desc(Pages)))
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

That code is quite difficult to read, however. It contains nested function calls: arrange() is inside glimpse(). I think this makes the code difficult to read. This is where we first really encounter the principles of R’s tidyverse philosophy: Instead of nesting, we can pipe results from functions to one another. To “pipe” the above code, we’d write the code this way instead:

d %>% 
    arrange(desc(Pages)) %>% 
    glimpse()
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

Read the above code as “Take d, then pass the results to arrange(), then pass the results to glimpse()”. Each %>% is a “pipe”, and takes the results of whatever is being computed on its left-hand side, and passes the result to its right-hand side (or line below, it’s really helpful to use line breaks and put each statement on its own line.) I’ll be using pipes extensively in the code below. I also won’t inspect the code after every change to prevent printing unnecessary boring stuff all the time (although you should when you’re working with your own data.)

So, back to Pages. Looking at the data, it looks like there’s some strange values, with some items having more than 1000 pages. For now we’ll leave those values in, but looking closely at the Source title column, we can see that there’s publications in the data that should be there. I guess I was a little too lazy with my initial Scopus search terms, but we can easily fix that problem here in R.

Publications

First, let’s list all the unique publications in the data:

publications <- unique(d$`Source title`)
publications
##  [1] "Attention, Perception, and Psychophysics"                                                
##  [2] "Attention, perception & psychophysics"                                                   
##  [3] "Attention, Perception, & Psychophysics"                                                  
##  [4] "Behavior Research Methods"                                                               
##  [5] "Behavior research methods"                                                               
##  [6] "Modern Research Methods for The Study of Behavior in Organizations"                      
##  [7] "Modern Research Methods for the Study of Behavior in Organizations"                      
##  [8] "The New Handbook of Methods in Nonverbal Behavior Research"                              
##  [9] "Cognitive, Affective and Behavioral Neuroscience"                                        
## [10] "Cognitive, affective & behavioral neuroscience"                                          
## [11] "Cognitive, Affective, & Behavioral Neuroscience"                                         
## [12] "Learning and Behavior"                                                                   
## [13] "Memory and Cognition"                                                                    
## [14] "Memory & cognition"                                                                      
## [15] "Exploring Implicit Cognition: Learning, Memory, and Social Cognitive Processes"          
## [16] "Memory & Cognition"                                                                      
## [17] "Imagery, Memory and Cognition: Essays in Honor of Allan Paivio"                          
## [18] "Imagery, Memory and Cognition (PLE: Memory):Essays in Honor of Allan Paivio"             
## [19] "Working Memory and Human Cognition"                                                      
## [20] "Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking"
## [21] "Psychonomic Bulletin and Review"                                                         
## [22] "Psychonomic bulletin & review"                                                           
## [23] "Psychonomic Bulletin & Review"

Let’s remove the non-psychonomic journal ones. I’ll just subset the list of publications to ones that I want to keep, and then filter() the data such that Source title matches one of the correct journals.

psychonomic_publications <- publications[c(1:5, 9:14, 16, 21:23)]
d <- filter(d, `Source title` %in% psychonomic_publications)
unique(d$`Source title`)
##  [1] "Attention, Perception, and Psychophysics"        
##  [2] "Attention, perception & psychophysics"           
##  [3] "Attention, Perception, & Psychophysics"          
##  [4] "Behavior Research Methods"                       
##  [5] "Behavior research methods"                       
##  [6] "Cognitive, Affective and Behavioral Neuroscience"
##  [7] "Cognitive, affective & behavioral neuroscience"  
##  [8] "Cognitive, Affective, & Behavioral Neuroscience" 
##  [9] "Learning and Behavior"                           
## [10] "Memory and Cognition"                            
## [11] "Memory & cognition"                              
## [12] "Memory & Cognition"                              
## [13] "Psychonomic Bulletin and Review"                 
## [14] "Psychonomic bulletin & review"                   
## [15] "Psychonomic Bulletin & Review"

Much better. Next, we’ll have to make sure that there’s only one unique name per journal. To do this, we’ll use simple text manipulation tools to change all “and”s to “&”s, and convert the names to Title Case. Before applying these operations to the actual data, I’ll work on just the unique names R object psychonomic_publications. This makes it easier to troubleshoot the operations.

psychonomic_publications <- psychonomic_publications %>% 
    str_replace_all("and", "&") %>%  # Replace all "and"s with "&"s
    toTitleCase()  # Make everything Title Case
unique(psychonomic_publications)
## [1] "Attention, Perception, & Psychophysics"         
## [2] "Attention, Perception & Psychophysics"          
## [3] "Behavior Research Methods"                      
## [4] "Cognitive, Affective & Behavioral Neuroscience" 
## [5] "Cognitive, Affective, & Behavioral Neuroscience"
## [6] "Learning & Behavior"                            
## [7] "Memory & Cognition"                             
## [8] "Psychonomic Bulletin & Review"

Still not quite correct. For two journals, there’s two versions, one where “&” is preceded with a comma, one where it isn’t. I’ll take the commas out.

psychonomic_publications %>% 
    str_replace_all(", &", " &") %>% 
    unique()
## [1] "Attention, Perception & Psychophysics"         
## [2] "Behavior Research Methods"                     
## [3] "Cognitive, Affective & Behavioral Neuroscience"
## [4] "Learning & Behavior"                           
## [5] "Memory & Cognition"                            
## [6] "Psychonomic Bulletin & Review"

Great, now I’ll just apply those operations to the actual data.

d <- d %>% 
    mutate(Publication = str_replace_all(`Source title`, "and", "&"),
           Publication = toTitleCase(Publication),
           Publication = str_replace_all(Publication, ", &", " &"))

Here, I used the mutate() function, which works within a data frame. I passed d to mutate() with the pipe operator, and could then refer to variables without prepending them with d$. I called the product variable Publication because it’s easier to write than Source title.

Lastly, we’ll do the same for the abbreviated publication names. I don’t know what the official abbreviations are, but I can guess that the majority of the articles have it correct. The first step is then to count the occurrences of each unique abbreviation. To do this, I group the data by (group_by()) the abbreviation, and then count():

abbs <- group_by(d, `Abbreviated Source Title`) %>% 
    count()
abbs
## # A tibble: 11 x 2
##          `Abbreviated Source Title`     n
##                               <chr> <int>
##  1         Atten Percept Psychophys    26
##  2      Atten. Percept. Psychophys.  1558
##  3                Behav Res Methods    38
##  4              Behav. Res. Methods  1385
##  5       Cogn Affect Behav Neurosci     7
##  6 Cogn. Affective Behav. Neurosci.   691
##  7                    Learn. Behav.   455
##  8                       Mem Cognit    13
##  9                     Mem. Cognit.  1441
## 10                 Psychon Bull Rev     3
## 11            Psychonom. Bull. Rev.  1997

Then just figure out how to replace all the bad ones with the good ones. Unfortunately this has to be done manually because computers aren’t that smart. The following is ugly, but it works. I first take the just the abbreviations into a variable, then replace the bad ones with the good ones inside the data frame (working with an abbreviated variable name).

abbrs <- abbs[[1]]
d <- mutate(d,
       pa = `Abbreviated Source Title`,
       pa = ifelse(pa == abbrs[1], abbrs[2], pa),
       pa = ifelse(pa == abbrs[3], abbrs[4], pa),
       pa = ifelse(pa == abbrs[5], abbrs[6], pa),
       pa = ifelse(pa == abbrs[8], abbrs[9], pa),
       pa = ifelse(pa == abbrs[10], abbrs[11], pa),
       Pub_abbr = pa)
unique(d$Pub_abbr)
## [1] "Atten. Percept. Psychophys."      "Behav. Res. Methods"             
## [3] "Mem. Cognit."                     "Cogn. Affective Behav. Neurosci."
## [5] "Learn. Behav."                    "Psychonom. Bull. Rev."

OK, we’re pretty much done with cleaning for now. The last thing to do is to remove the unnecessary variables we created while cleaning the data:

d <- select(d, 
            -`Source title`,
            -`Page start`, -`Page end`,
            -pa)

I don’t like spaces, so I’ll also take them out of the variable names:

names(d) <- str_replace_all(names(d), " ", "_")

Leaving us with this to work with:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Load cleaned data

The clean data file can be loaded from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda.

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

Univariate figures

This data is very rich, but for this initial exploration, we’ll be focusing on simple uni- and bi-variate figures. In later parts of this tutorial (future blog posts), we’ll start visualizing more complex data, such as keywords, authors, and networks.

Let’s begin with some simple summaries of what’s been going on, over time, for each journal. We’ll make extensive use of the dplyr data manipulation verbs in the following plots. Take a look at the linked website if they are unfamiliar to you; although I will explain more complicated cases, I won’t bother with every detail.

Publication years

First, we’re interested in a simple question: How many articles has each journal published in each year? Are there temporal patterns, and do they differ between journals? The steps for creating this plot are commented in the code below. Roughly, in order of appearance, we first add grouping information to the data frame, then summarise the data on those groups, create a new column for a publication-level summary, and then order the publications on it. We then pass the results to ggplot2 and draw a line and point graph.

d %>% 
    group_by(Publication, Year) %>% 
    summarise(n = n()) %>%  # For each group (Publication, Year), count nobs
    mutate(nsum = sum(n)) %>%  # For each Publication, sum of nobs {1}
    ungroup() %>%  # Ungroup data frame
    mutate(Publication = reorder(Publication, nsum)) %>%  # {2}
    ggplot(aes(x=Year, y=n)) +
    geom_point() +
    geom_line() +
    labs(y = "Number of articles published") +
    facet_wrap("Publication", ncol=2)

The code in {1} works, because the summarise() command in the above line ungroups the last grouping factor assigned by group_by(). And since I called mutate() instead of summarise, the sum of number of observations was added to each row, instead of collapsing the data by group. {2} made the Publication variable into a factor that’s ordered by the sum of articles across years for that journal; this ordering is useful because when I used facet_wrap() below, the panels are nicely ordered (journals with fewer total papers in the upper left, increasing toward bottom right.)

If this code doesn’t make any sense, I strongly recommend loading the data into your own R workspace, and executing the code line by line.

I’m a little surprised at the relative paucity of articles in Learning and Behavior, and it looks like there might be some upward trajectory going on in PBR. Should I run some regressions? Probably not.

Citations

Let’s look at numbers of citations next. For these, simple histograms will do, and we expect to see some heavy tails. I will pre-emptively clip the x-axis at 100 citations, because there’s probably a few rogue publications with a truly impressive citation count, which might make the other, more common, values less visible.

d %>% 
    ggplot(aes(Cited_by)) +
    geom_histogram(binwidth=1, col="black") +
    coord_cartesian(xlim=c(0,100)) +
    facet_wrap("Publication", ncol=2)

Nice. But how many papers are there with more than 100 citations? What’s the proportion of those highly cited papers?

d %>%
    group_by(Publication) %>%
    mutate(total = n()) %>% 
    filter(Cited_by > 100) %>% 
    summarise(total = unique(total),
              over_100 = n(),
              ratio = signif(over_100/total, 1)) %>% 
    arrange(total)
## # A tibble: 6 x 4
##                                      Publication total over_100 ratio
##                                            <chr> <int>    <int> <dbl>
## 1                            Learning & Behavior   455        1 0.002
## 2 Cognitive, Affective & Behavioral Neuroscience   698       32 0.050
## 3                      Behavior Research Methods  1426       53 0.040
## 4                             Memory & Cognition  1452       17 0.010
## 5          Attention, Perception & Psychophysics  1583        8 0.005
## 6                  Psychonomic Bulletin & Review  2000       54 0.030

Excellent. We have now reviewed some of the powerful dplyr verbs and are ready for slightly more complex summaries of the data.

Bivariate figures

Publication year and citations

We can visualize publication counts over years of publication, but this might not be the most informative graph (because recent articles naturally can’t be as often cited as older ones.) Here, I’ll show how to plot neat box plots (without outliers) with ggplot2:

d %>% 
    filter(Cited_by < 100) %>% 
    ggplot(aes(Year, Cited_by)) +
    geom_boxplot(aes(group = Year), outlier.color = NA) +
    facet_wrap("Publication", ncol=2, scales = "free_y")

Pages and citations

The last figure for today is a scatterplot of pages (how many pages the article spans) versus number of citations. For scatterplots I usually prefer empty points (shape=1).

d %>% 
    filter(Pages < 100, Cited_by < 100) %>% 
    ggplot(aes(Pages, Cited_by)) +
    geom_point(shape=1) +
    facet_wrap("Publication", ncol=2)

Summary

We have now scratched the surface of these data, and there is clearly a lot to explore:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

In next posts on this data, we’ll look at networks and other more advanced topics. Stay tuned.

Better forest plots from meta-analytic models estimated with brms

Hi all! After our previous discussion about how to estimate meta-analytic models with the brilliant brms R package, a few people asked me for the code to produce the forest plots. Here I’ll present a much better version of a function to produce forest plots from meta-analytic models estimated with brms. The function is implemented in ggplot2, and it is included in my vmisc package. To get the function, simply install vmisc from GitHub:

devtools::install_github("mvuorre/vmisc")

The function to draw forest plots from meta-analytic models estimated with brms is called brms_forest(), and you can learn more about it with ?brms_forest. The most important thing to know is that it requires that the model is estimated with brms with this exact formula:

yi | se(sei) ~ 1 + (1|study)

Where yi are the study-specific effect sizes (mean difference between groups, for example), sei are the study-specific standard errors, and study are unique integer labels for each study in the data.

Example data

We’ll illustrate the plot using example data from the metafor package. This data are “Results from 48 studies on the effectiveness of school-based writing-to-learn interventions on academic achievement.”

“In each of the studies included in this meta-analysis, an experimental group (i.e., a group of students that received instruction with increased emphasis on writing tasks) was compared against a control group (i.e., a group of students that received conventional instruction) with respect to some content-related measure of academic achievement (e.g., final grade, an exam/quiz/test score). The effect size measure for this meta-analysis was the standardized mean difference (with positive scores indicating a higher mean level of academic achievement in the intervention group).” (From the metafor help page ?dat.bangertdrowns2004.)

library(metafor)
head(dat.bangertdrowns2004)
##   id   author year grade length minutes wic feedback info pers imag meta
## 1  1 Ashworth 1992     4     15      NA   1        1    1    1    0    1
## 2  2    Ayers 1993     2     10      NA   1       NA    1    1    1    0
## 3  3   Baisch 1990     2      2      NA   1        0    1    1    0    1
## 4  4    Baker 1994     4      9      10   1        1    1    0    0    0
## 5  5   Bauman 1992     1     14      10   1        1    1    1    0    1
## 6  6   Becker 1996     4      1      20   1        0    0    1    0    0
##         subject  ni     yi    vi
## 1       Nursing  60  0.650 0.070
## 2 Earth Science  34 -0.750 0.126
## 3          Math  95 -0.210 0.042
## 4       Algebra 209 -0.040 0.019
## 5          Math 182  0.230 0.022
## 6    Literature 462  0.030 0.009

We’ll only need a few of the columns, and with specific names, so in the following we’ll just select the relevant variables and rename id to study, and create labels for the plot by pasting together the studies author and year columns. I’ll also subset the data to the first 15 studies, because the original data has 48 studies and that would make the plot very large (which is fine, but it’s simpler to start small.)

library(dplyr)
d <- dat.bangertdrowns2004 %>%
    mutate(label = paste0(author, " (", year, ")"), sei = sqrt(vi)) %>%
    select(study = id, label, yi, sei) %>% 
    slice(1:15)
d
##    study              label    yi        sei
## 1      1    Ashworth (1992)  0.65 0.26457513
## 2      2       Ayers (1993) -0.75 0.35496479
## 3      3      Baisch (1990) -0.21 0.20493902
## 4      4       Baker (1994) -0.04 0.13784049
## 5      5      Bauman (1992)  0.23 0.14832397
## 6      6      Becker (1996)  0.03 0.09486833
## 7      7 Bell & Bell (1985)  0.26 0.32557641
## 8      8     Brodney (1994)  0.06 0.08366600
## 9      9      Burton (1986)  0.06 0.20000000
## 10    10   Davis, BH (1990)  0.12 0.22803509
## 11    11   Davis, JJ (1996)  0.77 0.32710854
## 12    12         Day (1994)  0.00 0.14491377
## 13    13     Dipillo (1994)  0.52 0.19235384
## 14    14     Ganguli (1989)  0.54 0.28809721
## 15    15  Giovinazzo (1996)  0.20 0.29325757

Fit meta-analytic model with brms

Fitting the meta-analytic model is easy with brms! The formula specifies the study-specific effect size and standard error, an overall intercept (1), and the “random studies” ((1|study)). I’ll use four cores for speed and increase the adapt_delta parameter to avoid divergent transitions.

library(brms)
mod <- brm(yi | se(sei) ~ 1 + (1|study), data = d, 
           cores = 4, control= list(adapt_delta=.99) )

The model summary shows a 95% CI for the average effect hugging 0, and reasonable between-study heterogeneity (sd(Intercept)):

summary(mod)
##  Family: gaussian(identity) 
## Formula: yi | se(sei) ~ 1 + (1 | study) 
##    Data: d (Number of observations: 15) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~study (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.17       0.1     0.01      0.4        850    1
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     0.13      0.07    -0.01     0.29        999    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Draw forest plots with brms_forest()

The user only needs to enter a data frame and a brms model:

library(vmisc)
brms_forest(data = d, model = mod)

The forest plot shows, on the left, the labels provided in d. On the very right are the effect sizes, [and limits of the Credible Intervals]. The CI limits are by default 95%, but users can control this by passing the argument level = .80, for 80% CIs, for example.

In the middle are the posterior distributions of the estimated effect sizes as grey densities. The black circle indicates the posterior mean, and the arms extending from the point are the CI defined by level (here, 95% CI). The bottom row, ME, is the meta-analytic estimate.

Arguments

brms_forest() has several arguments that impact the resulting plot, see ?brms_forest for details. For example

brms_forest(
    data = d, 
    model = mod, 
    show_data = TRUE,  # Shows data means and SEs
    sort_estimates = TRUE,  # Sorts estimates based on their magnitude
    dens_fill = "dodgerblue")  # Fill densities with blue

Here, the plot also shows the means (black crosses) and 2 SEs (thin black lines) from the data, slightly below each estimate. This plot nicely shows how the random effects model shrinks the estimates toward the group mean, especially for studies that had wide SEs to begin with.

Customizing the plot

brms_forest() returns a ggplot2 object, which is customizable by regular ggplot2 functions (themes, scales…) Here, we’ll add limits to the x axis manually (we have to specify y axis, because the axes are internally flipped…), add a label to replace the default “mean”, and use another custom theme:

library(ggplot2)
myplot <- brms_forest(data = d, 
                      model = mod, 
                      sort_estimates = TRUE)
myplot + 
    scale_y_continuous("Standardized ES", limits = c(-1.2, 1.2)) +
    theme_blog()

Credits

The plot uses a one-sided violin plot which is not included in the default ggplot2 distribution. This code was written by David Robinson https://github.com/dgrtwo: https://gist.github.com/dgrtwo/eb7750e74997891d7c20, and Ben Marwick https://github.com/benmarwick: https://gist.github.com/benmarwick/2a1bb0133ff568cbe28d.

Conclusion

Look at how non-normal those posterior densities look like! It seems that in this case, the posterior mean is quite a misleading point estimate, and actually showing the posterior distribution is a very good idea.

Hope you’ll find this function useful. If you’d like it to be improved, let me know. Thanks and have a good day!

How to create within-subject scatter plots in R with ggplot2

Today, we’ll take a look at creating a specific type of visualization for data from a within-subjects experiment (also known as repeated measures, but that can sometimes be a misleading label). You’ll often see within-subject data visualized as bar graphs (condition means, and maybe mean difference if you’re lucky.) But alternatives exist, and today we’ll take a look at within-subjects scatterplots.

For example, Ganis and Kievit (2015) asked 54 people to observe, on each trial, two 3-D shapes with various rotations and judge whether the two shapes were the same or not.

There were 4 angles (0, 50, 100, and 150 degree rotations), but for simplicity, today we’ll only look at items that were not rotated with respect to each other, and items rotated 50 degrees. The data are freely available (thanks!!) in Excel format, but to make them more easily available for readers, I’ve uploaded them in a .csv file, which we can load directly from an R script.

d <- read_csv("https://mvuorre.github.io/data/ganis-kievit-2015.csv")

First, let’s clean the data a little bit by selecting and renaming a subset of the variables, and then take only the trials with 0 or 50 degrees rotation.

d <- transmute(
    d,
    id = Subject,
    angle = angle,
    correct = correct.incorrect,
    rt = Time
    ) %>%
    filter(angle < 51)
    d
## # A tibble: 2,592 x 4
##       id angle correct    rt
##    <int> <int>   <int> <int>
##  1     1     0       1  1355
##  2     1    50       1  1685
##  3     1    50       1  1237
##  4     1     0       1  1275
##  5     1    50       1  2238
##  6     1     0       1  1524
##  7     1     0       1   964
##  8     1    50       1  1226
##  9     1    50       1  2548
## 10     1    50       1  1588
## # ... with 2,582 more rows

We’ll focus on comparing the reaction times between the 0 degree and 50 degree rotation trials. We predict that people will take longer to respond when the items are rotated, and that this effect will be robust across people.

Subject means

Bar graph

For the first graph, we’ll only need the subject’s means in each condition.

subject_means <- group_by(d, id, angle) %>%
    summarize(rt = mean(rt, na.rm = T))
subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
## 
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows

The first plot is a simple bar graph showing the condition means, and every subject as a point. Note that the mean is visualized as a bar, using the stat_summary(geom="bar") function.

barplot <- ggplot(subject_means, aes(x = angle, y = rt)) +
    stat_summary(
    geom = "bar",
    fun.y = "mean",
    col = "black",
    fill = "gray70"
    ) +
    geom_point(position = position_jitter(h = 0, w = 5)) +
    scale_y_continuous(limits = c(0, max(d$rt, na.rm = T)),
    expand = c(0, 0))
    barplot

This figure shows quite clearly that the mean reaction time in the 50 degree angle condition was higher than in the 0 degree angle condition, and the spread across individuals in each condition. However, we often are specifically interested in the within-subject effect of condition, which would be difficult to visually display in this image. We could draw lines to connect each point, and the effect would then be visible as a “spaghetti plot”, but while useful, these plots may sometimes be a little overwhelming especially if there’s too many people (spaghetti is great but nobody likes too much of it!)

Within-subject scatterplot

To draw a within-subjects scatterplot, we’ll need a slight reorganization of the data, such that it is in wide format with respect to the conditions:

subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
## 
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows
subject_means_wide <-
    spread(subject_means,
           key = angle,
           value = rt,
           sep = "_")
subject_means_wide
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows

Then we can simply map the per-subject angle-means to the X and Y axes:

ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point()

But this graph needs a couple of fixes to be maximally informative. We need to:

  • Make the aspect ratio 1
  • Force the axes to be identically scaled (note the use of min() and max() to show the plots on the scale of the data)
  • Add an identity (diagonal) line
  • Modify the axis labels
lims <- c(min(d$rt, na.rm = T), max(d$rt, na.rm = T))
wsplot <-
    ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point() +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio = 1)
wsplot

Great! This plot shows each person (mean) as a point, and the difference between conditions can be directly seen by how far from the diagonal line the points are. Points above the diagonal indicate that the person’s (mean) RT was greater in the 50 degrees condition. All of the points lie below the identity line, indicating that the effect was as we predicted, and robust across individuals.

This is a very useful diagnostic plot that simultaneously shows the population- (or group-) level trend (are the points, on average, below or above the identity line?) and the expectation (mean) for every person (roughly, how far apart the points are from each other?). The points are naturally connected by their location, unlike in a bar graph where they would be connected by lines. Maybe you think it’s an informative graph; it’s certainly very easy to do in R with ggplot2. Also, I think it is visually very convincing, and doesn’t necessarily lead one to focus unjustly just on the group means: I am both convinced and informed by the graph.

Within-subject scatterplot with SEs

Well, we didn’t measure everybody repeatedly for nothing. We know more than their means; we can use the spread of the individual level scores to calculate, say, a SE for everybody and add it to the graph.

subject_summaries <- group_by(d, id, angle) %>%
    summarize(mean = mean(rt, na.rm = T),
              se = sd(rt, na.rm = T) / sqrt(n()))
subject_summaries
## Source: local data frame [108 x 4]
## Groups: id [?]
## 
## # A tibble: 108 x 4
##       id angle  mean    se
##    <int> <int> <dbl> <dbl>
##  1     1     0  1512   146
##  2     1    50  2039   134
##  3     2     0  1251   125
##  4     2    50  1768   211
##  5     3     0  1602   162
##  6     3    50  1862   109
##  7     4     0  1501   112
##  8     4    50  2023   172
##  9     5     0  2170   242
## 10     5    50  2512   307
## # ... with 98 more rows

Now we simply need to reformat the data to wide with respect to both the means and SEs. The trick here is to use spread() with different values for the sep() (separate) argument. Then, when the means and SEs are joined into wide format, we can select the columns containing either the means or SEs by referring to their unique names

means <- select(subject_summaries, -se) %>%
    spread(key=angle, value=mean, sep = "_")
means
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows
ses <- select(subject_summaries, -mean) %>%
    spread(key=angle, value=se, sep = "SE")
ses
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angleSE0 angleSE50
##  * <int>    <dbl>     <dbl>
##  1     1      146       134
##  2     2      125       211
##  3     3      162       109
##  4     4      112       172
##  5     5      242       307
##  6     6      100        73
##  7     7      223       240
##  8     8      110       197
##  9     9      130       203
## 10    10      233       276
## # ... with 44 more rows
sums <- left_join(means, ses)
sums 
## Source: local data frame [54 x 5]
## Groups: id [?]
## 
## # A tibble: 54 x 5
##       id angle_0 angle_50 angleSE0 angleSE50
##    <int>   <dbl>    <dbl>    <dbl>     <dbl>
##  1     1    1512     2039      146       134
##  2     2    1251     1768      125       211
##  3     3    1602     1862      162       109
##  4     4    1501     2023      112       172
##  5     5    2170     2512      242       307
##  6     6    1302     1382      100        73
##  7     7    2212     3014      223       240
##  8     8    1452     1824      110       197
##  9     9    2012     2501      130       203
## 10    10    1939     3058      233       276
## # ... with 44 more rows

The code for the plot is actually quite straightforward once the tricky part of data formatting is done (this is really the philosophy behind ggplot2). Use errorbar() to draw the vertical SE bars, and errorbarh() to draw the horizontal SE bars.

ggplot(sums, aes(x=angle_0, y=angle_50)) +
    geom_point() +
    geom_errorbar(aes(ymin=angle_50-angleSE50, ymax=angle_50+angleSE50)) +
    geom_errorbarh(aes(xmin=angle_0-angleSE0, xmax=angle_0+angleSE0)) +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio=1)

Cool, huh? This graph shows the mean and +-1 SEM for everybody’s reaction time in the 0 degrees (x axis) and 50 degrees (y axis) conditions. This graph could be a great visual inspection of the data before fitting any complex models, and requires only some slight reorganizing of the data in R. Hope you’ll find it helpful in your own work!

Endnote

Within-subject scatter plots are pretty common in some fields (psychophysics), but underutilized in many fiels where they might have a positive impact on statistical inference. Why not try them out on your own data, especially when they’re this easy to do with R and ggplot2?

Recall that for real applications, it’s better to transform or model reaction times with a skewed distribution. Here we used normal distributions just for convenience.

Finally, this post was made possible by the Ganis and Kievit (2015) who generously have shared their data online. Big thanks!

Have a great day!

References

Ganis, G., & Kievit, R. (2015). A New Set of Three-Dimensional Shapes for Investigating Mental Rotation Processes: Validation Data and Stimulus Set. Journal of Open Psychology Data, 3(1). https://doi.org/10.5334/jopd.ai

Scraping Statistical Modeling, Causal Inference, and Social Science

Hi everybody!

Today I’ll share some tips on elementary web scraping with R. Our goal is to download and process an entire Wordpress blog into an R data frame, which can then be visualized and analyzed for fun and discovery.

We’ll scrape andrewgelman.com, the home of “Statistical Modeling, Causal Inference, and Social Science”. This is a very popular statistics and social science blog, whose main author, Andrew Gelman is a famous statistician and political scientist, and author of such classic holiday thrillers as Bayesian Data Analysis and Data Analysis Using Regression and Multilevel Models.

Why? Web scraping is a useful skill for anyone interested in information, broadly speaking, and is very easy and fun with R. We’re using this specific blog as a useful example, because Wordpress blogs are very popular, and many of the specific commands here will be useful without much editing, if you’re interested in scraping a Wordpress blog. Of course, using these general tools, you’ll be able to scrape any website you’d like!

After we’ve scraped the data from Wordpress, we’ll simply visualize the most popular (in terms of number of comments) blog posts from 2016.

Scraping from Wordpress

The first step is to scrape each Wordpress front page of the blog, with 20 blog posts each, to a local variable. This local variable can then be processed to catch all the relevant variables about a given post to a row of a data frame. I put these methods into a simple function and ran it through a multicore pblapply() call to scrape the entire blog in under a minute.

The actual scraping is done by functions from the rvest package. The key steps are

(1). Parsing the html from a given url (2). Extracting the required information from the raw HTML using “node tags” (browse a page’s html to identify the relevant tag names) (3). Apply the scraping function to all Wordpress pages (each containing 20 blog posts.)

N <- 353  # Number of WordPress pages (20 posts per page)
urls <- paste0("http://andrewgelman.com/page/", 1:N)
getsite <- function(url) {
    site <- read_html(url)  # (1.)
    d <- tibble(
        title = site %>% html_nodes(".posttitle") %>% html_text(),  # (2.)
        url = site %>% html_nodes(".posttitle") %>% 
            html_nodes("a") %>% html_attr("href"),
        author = site %>% html_nodes(".postauthor") %>% html_text(),
        date = site %>% html_nodes(".postdate") %>% html_text(),
        tags = site %>% html_nodes(".postcat") %>% html_text(),
        comments = site %>% html_nodes(".postcomment") %>% html_text(),
        text = site %>% html_nodes(".postentry") %>% html_text())
    return(d)
}
wp_pagelist <- vector("list", N)
wp_pagelist <- pblapply(urls, getsite, cl=8)  # (3.)
d <- bind_rows(wp_pagelist)
save(d, file = "smciss.rda", compress=T)

I’ve saved the resulting object as smciss.rda, which you can download here.

load(url("https://mvuorre.github.io/data/smciss.rda"))

Cleaning the data

Some of these variables need to be processed a bit.

First we’ll convert the date into an R date variable, and include a time variable that also contains the timestamp of the post. For some reason, the functions that I use to convert the strings to times and dates add 1 NA at the end, so I have to remove the last entries.

library(lubridate)
time <- dmy_hm(d$date, "%d %0m %Y, %I:%M %p")
## Warning: 1 failed to parse.
date <- as_date(dmy_hm(d$date, "%d %0m %Y, %I:%M %p"))
## Warning: 1 failed to parse.
length(date) - nrow(d)
## [1] 1
d$date <- date[-length(date)]
d$time <- time[-length(time)]

Then clean up the tags:

library(stringr)
select(d, tags)
## # A tibble: 7,041 x 1
##                                                          tags
##                                                         <chr>
##  1           Filed under Miscellaneous Statistics, Sociology.
##  2                                    Filed under Literature.
##  3                            Filed under Economics, Zombies.
##  4          Filed under Literature, Miscellaneous Statistics.
##  5                                 Filed under Uncategorized.
##  6         Filed under Political Science, Sociology, Zombies.
##  7 Filed under Decision Theory, Economics, Political Science.
##  8                    Filed under Decision Theory, Economics.
##  9                     Filed under Political Science, Sports.
## 10                                       Filed under Zombies.
## # ... with 7,031 more rows
d$tags <- str_replace_all(d$tags, "Filed under", "") %>%
    str_replace_all("[.]", "") %>%
    str_trim()

And comments:

d$comments <- as.integer( str_extract(d$comments, "[0-9]+") )

And the data is pretty clean:

select(d, date, time, tags, comments)
## # A tibble: 7,041 x 4
##          date                time
##        <date>              <dttm>
##  1 2016-12-27 2016-12-27 13:39:00
##  2 2016-12-27 2016-12-27 09:23:00
##  3 2016-12-26 2016-12-26 09:10:00
##  4 2016-12-25 2016-12-25 09:47:00
##  5 2016-12-24 2016-12-24 15:00:00
##  6 2016-12-24 2016-12-24 09:39:00
##  7 2016-12-23 2016-12-23 18:36:00
##  8 2016-12-23 2016-12-23 09:57:00
##  9 2016-12-23 2016-12-23 00:07:00
## 10 2016-12-22 2016-12-22 11:06:00
## # ... with 7,031 more rows, and 2 more variables: tags <chr>,
## #   comments <int>

How to arrange ggplot2 panel plots

Panel plots are a common name for figures showing every person’s (or whatever your sampling unit is) data in their own little panel. This plot is sometimes also known as “small multiples”, although that more commonly refers to plots that illustrate interactions. Here, I’ll illustrate how to add information to a panel plot by arranging the panels according to some meaningful value.

Here’s an example of a panel plot, using the sleepstudy data set from the lme4 package. Notice that the subject-specific panels are created with facet_wrap(), as explained in an earlier blog post.

data(sleepstudy, package = "lme4")
head(sleepstudy)
##   Reaction Days Subject
## 1 249.5600    0     308
## 2 258.7047    1     308
## 3 250.8006    2     308
## 4 321.4398    3     308
## 5 356.8519    4     308
## 6 414.6901    5     308
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
    geom_point() +
    scale_x_continuous(breaks=0:9) +
    facet_wrap("Subject", labeller = label_both)

On the x-axis is days of sleep deprivation, and y-axis is an aggregate measure of reaction time across a number of cognitive tasks. Reaction time increases as a function of sleep deprivation. But the order of the panels is entirely uninformative, they are simply arranged in increasing order of subject ID number, from top left to bottom right. Subject ID numbers are rarely informative, and we would therefore like to order the panels according to some other fact about the individual participants.

Order panels on mean value

Let’s start by ordering the panels on the participants’ mean reaction time, with the “fastest” participant in the upper-left panel.

Step 1 is to add the required information to the data frame used in plotting. For a simple mean, we can actually use a shortcut in step 2, so this isn’t required.

Step 2: Convert the variable used to separate the panels into a factor, and order it based on the mean reaction time.

The key here is to use the reorder() function. You’ll first enter the variable that contains the groupings (i.e. the subject ID numbers), and then values that will be used to order the grouping variables. Finally, here you can use a shortcut to base the ordering on a function of the values, such as the mean, by entering it as the third argument.

sleepstudy <- mutate(sleepstudy,
                     Subject = reorder(Subject, Reaction, mean))

Now if we use Subject to create the subplots, they will be ordered on the mean reaction time. I’ll make the illustration clear by also drawing the person-means with small arrows.

ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
    geom_point() +
    stat_summary(fun.y=mean, geom="segment",
                 aes(yend=..y.., x=0, xend=3),
                 arrow = arrow(ends = "first", length = unit(.1, "npc"))) +
    scale_x_continuous(breaks=0:9, expand = c(0, 0)) +
    facet_wrap("Subject", labeller = label_both)

Ordering panels on other parameters

It might also be useful to order the panels based on a value from a model, such as the slope of a linear regression. This is especially useful in making the heterogeneity in the sample easier to see. For this, you’ll need to fit a model, grab the subject-specific slopes, order the paneling factor, and plot. I’ll illustrate with a multilevel regression using lme4.

# Step 1: Add values to order on into the data frame
library(lme4)
mod <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
# Create a data frame with subject IDs and coefficients
coefs <- coef(mod)$Subject %>%
    rownames_to_column("Subject")
names(coefs) <- c("Subject", "Intercept", "Slope")
# Join to main data frame by Subject ID
sleepstudy <- left_join(sleepstudy, coefs, by="Subject")
# Step 2: Reorder the grouping factor
sleepstudy <- mutate(sleepstudy,
                     Subject = reorder(Subject, Slope))

Then, I’ll plot the data also showing the fitted lines from the multilevel model:

ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
    geom_point() +
    geom_abline(aes(intercept = Intercept, slope = Slope)) +
    scale_x_continuous(breaks=0:9) +
    facet_wrap("Subject", labeller = label_both)

Hopefully you’ll find this helpful. Have a great day!

All Posts by Category or Tags.