Quantitative literature review with R: Exploring Psychonomic Society Journals, Part II

(This post directly continues from part I of Quantitative literature review with R. Please read that first for context.) In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a broad research literature, using all Psychonomic Society publications between 2004 and 2016. Part I focused on data cleaning and simple figures, but here we will look at relational data by visualizing some network structures in the data.

We’ll be working with R, so if you want to follow along on your own computer, fire up R and load up the required R packages:

library(tidyverse)
library(stringr)

Then you can load the cleaned data from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda using R:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

We’ve got data on quite a few variables which we could imagine are interconnected, such as Tags (Keywords), Authors, and the articles themselves (through the References variable).

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Let’s begin by looking at the articles’ Keywords.

Tags

These are the things that you have to choose to describe your article when you submit it for publication. They are called Keywords in the data, but I’m going to refer to them as Tags for brevity. Also, I will only work with the Author_Keywords, because it seems like Scopus’s automatic index keywords are mostly nonsense or missing.

So our first operation is to rename the Author_Keywords variable to Tags, and get rid of the automatic keywords altogether.

d <- rename(d, Tags = Author_Keywords) %>% 
    select(-Index_Keywords)

Before moving on to networks, let’s take a look at what these tags are like: How many are there? Do people use them consistently?

Exploring univariate tags

Our first goal is to simply visualize the most common tags. However, recall that the Tags variable in the data is quite messy:

## # A tibble: 3 x 1
##                                                                          Tags
##                                                                         <chr>
## 1    3D perception; Perception and action; Scene perception; Space perception
## 2 Auditory perception; Cognitive performance; Dissonance; Distraction; Inharm
## 3             Attention; Automaticity; Cognitive control; Selective attention

Each value of Tags is a kind of a list, whose individual items are separated by semicolons. What we’d like to have is a variable (column in a data frame) where every item is a single tag, so we need to do some kind of a data unwrapping operation.

The unwrapping can be done by simply splitting the Tags variable on each semicolon (and space) found in each string, but this creates a strange variable:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) 
## # A tibble: 7,614 x 1
##         Tags
##       <list>
##  1 <chr [4]>
##  2 <chr [6]>
##  3 <chr [4]>
##  4 <chr [5]>
##  5 <chr [5]>
##  6 <chr [3]>
##  7 <chr [3]>
##  8 <chr [6]>
##  9 <chr [3]>
## 10 <chr [3]>
## # ... with 7,604 more rows

This is where things get really interesting. A few lines above, we conceptualized each value (cell) of Tags as a sort of a list. We have now operationalized that idea by creating a Tags variable that is an R list-column, meaning that each cell can contain a complex R object. Here, each value is a list of character values. To make this notion concrete, look at the first row (slice(1)) of the Tags variable (.$Tags, . is a placeholder of the data being passed by %>% arrows) by using the familiar str() function:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    slice(1) %>% 
    .$Tags %>%  str()
## List of 1
##  $ : chr [1:4] "3D perception" "Perception and action" "Scene perception" "Space perception"

All the tags associated with the first article in our data frame are in a neat list of neat character strings. Brilliant. This data format is quite foreign to us who are used to working with strictly flat rectangular data, but has a certain recursive beauty to it. Now that our list is concrete, we can unlist it to create a Tags variable whose each row is one individual tag.

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags)
## # A tibble: 22,375 x 1
##                     Tags
##                    <chr>
##  1         3D perception
##  2 Perception and action
##  3      Scene perception
##  4      Space perception
##  5   Auditory perception
##  6 Cognitive performance
##  7            Dissonance
##  8           Distraction
##  9         Inharmonicity
## 10                 Music
## # ... with 22,365 more rows

Excellent! This data frame has one variable, Tags, where each observation is an individual tag. Now we can go ahead and count and summarise the tags. From here onwards, it’s all regular dplyr verbs:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%  # Remove missing observations
    group_by(Tags) %>%  # Group by unique Tags
    count() %>%  # And count them
    ungroup() %>% 
    top_n(40) %>%  # Choose n most numerous Tags
    mutate(Tags = reorder(Tags, n)) %>%  # {1}
    ggplot(aes(n, Tags)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=Tags)) +
    scale_x_continuous(limits = c(0, 400), expand = c(0, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 30 tags in Psychonomics Journals",
         caption = "Data for years 2004-2016 only")

{1} reordered the Tags so that they are nicely ordered in the figure on the n variable. You could also do a wordcloud with these data, but I think I’ll refrain this time. Instead, what I’ll do is explore popular tags by journal. The “pipeline” to get tags by journal is almost identical, except that we’ll keep the Publication variable in the data.

The following code is a little more complicated than is required for a simple plot, but I wanted to order the Tags inside each publication, which required a little bit of extra work as described by @drsimonj. Below, I’ll work with journal abbreviations Pub_abbr.

pd <- d %>% 
    select(Pub_abbr, Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%
    group_by(Pub_abbr, Tags) %>%  # Group by unique Tags AND journals
    count() %>%  # And count them
    group_by(Pub_abbr) %>% 
    top_n(10, n) %>%  # Select top 10 tags within each journal
    ungroup() %>% 
    arrange(Pub_abbr, n) %>%  # Arrange by n within journals
    mutate(order = row_number())  # Will order Tags on this
pd
## # A tibble: 62 x 4
##                       Pub_abbr                  Tags     n order
##                          <chr>                 <chr> <int> <int>
##  1 Atten. Percept. Psychophys.     Attentional blink    33     1
##  2 Atten. Percept. Psychophys.        Working memory    36     2
##  3 Atten. Percept. Psychophys. Visual working memory    40     3
##  4 Atten. Percept. Psychophys.      Visual attention    41     4
##  5 Atten. Percept. Psychophys.   Selective attention    49     5
##  6 Atten. Percept. Psychophys.     Visual perception    51     6
##  7 Atten. Percept. Psychophys.   Attentional capture    52     7
##  8 Atten. Percept. Psychophys.         Eye movements    61     8
##  9 Atten. Percept. Psychophys.         Visual search   116     9
## 10 Atten. Percept. Psychophys.             Attention   180    10
## # ... with 52 more rows

That was more complicated than usual, so I wanted to print out the intermediate result as well. We can now use it to plot:

pd %>%     
    ggplot(aes(n, order)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=order)) +
    scale_y_continuous(breaks = pd$order, labels = pd$Tags) +
    scale_x_continuous(expand = c(0.02, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 10 tags within 6 Psychonomics Journals",
         caption = "Data for years 2004-2016 only") +
    facet_wrap("Pub_abbr", scales="free", ncol=2)

Finally, let’s look at the overall distribution of the tags as a histogram:

There are about 8000 unique Tags, and about 7000 of them occur fewer than three times in the corpus. Now that we have some idea of what kinds of tags there are in the data, roughly what their distributions might look like, and how they vary across the journals, we are ready to look at networks of tags.

Network analysis of Tags

Network analysis (or graph analysis) is a vast and complicated area, and my graph analysis skills are a little rusty. So here we’ll simply illustrate how to format the data in an appropriate way, and draw a few example graphs. I referred to this website more than once while working on this (also thanks to @jalapic for tips).

The first graph will be about Tags. We’ll try to illustrate how and which tags co-occur within articles in the Psychonomic Society journals between 2004 and 2016.

90% of data analysis is data cleaning and manipulation, and network analysis is no exception. The first data manipulation step is to create an adjacency matrix where each row and column is a tag, where the values in the cells are the number of times those tags co-occurred (were present in the same article.) I found this stack overflow answer and this blogpost helpful when preparing the data.

Preparing the adjacency matrix

In order to create the adjacency matrix we need to solve a sequence of smaller problems. First, we’ll take two columns from the data, one for Tags and another one that uniquely identifies each article (doesn’t matter what it is, here we use DOI). Then we unnest the Tags into single tags per row as detailed above, and remove empty tags.

m <- d %>% 
    select(Tags, DOI) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags))
m
## # A tibble: 18,949 x 2
##                          DOI                  Tags
##                        <chr>                 <chr>
##  1 10.3758/s13414-015-1049-4         3D perception
##  2 10.3758/s13414-015-1049-4 Perception and action
##  3 10.3758/s13414-015-1049-4      Scene perception
##  4 10.3758/s13414-015-1049-4      Space perception
##  5 10.3758/s13414-015-1042-y   Auditory perception
##  6 10.3758/s13414-015-1042-y Cognitive performance
##  7 10.3758/s13414-015-1042-y            Dissonance
##  8 10.3758/s13414-015-1042-y           Distraction
##  9 10.3758/s13414-015-1042-y         Inharmonicity
## 10 10.3758/s13414-015-1042-y                 Music
## # ... with 18,939 more rows

Next, I remove all tags from the data that occurred fewer than a couple of times, because otherwise the data would be too large and messy.

m <- group_by(m, Tags) %>% 
    mutate(n = n()) %>% 
    filter(n > 8) %>% 
    ungroup() %>% 
    select(-n)

Next, we need a variable that identifies whether the tag occurred in the article or not:

m$occur <- 1
slice(m, 1:2)
## # A tibble: 2 x 3
##                         DOI                  Tags occur
##                       <chr>                 <chr> <dbl>
## 1 10.3758/s13414-015-1049-4 Perception and action     1
## 2 10.3758/s13414-015-1049-4      Scene perception     1

Note that occur must be one. To create the matrix we need all the item-tag pairs, even ones that didn’t occur. For these pairs, occur will be zero. The tidyr package in tidyverse provides the perfect helper function for this purpose:

m <- m %>% 
    complete(DOI, Tags, fill = list(occur=0))
m
## # A tibble: 960,500 x 3
##                          DOI               Tags occur
##                        <chr>              <chr> <dbl>
##  1 10.1007/s13420-010-0001-7     3-D perception     0
##  2 10.1007/s13420-010-0001-7             Action     0
##  3 10.1007/s13420-010-0001-7         Adaptation     0
##  4 10.1007/s13420-010-0001-7             Affect     0
##  5 10.1007/s13420-010-0001-7 Age of acquisition     0
##  6 10.1007/s13420-010-0001-7              Aging     0
##  7 10.1007/s13420-010-0001-7           Amygdala     0
##  8 10.1007/s13420-010-0001-7   Animal cognition     0
##  9 10.1007/s13420-010-0001-7       Anticipation     0
## 10 10.1007/s13420-010-0001-7            Anxiety     0
## # ... with 960,490 more rows

Now we have a data frame with 960500 rows; each row signifying a possible article (DOI) - Tag co-occurrence. Most occurs will, of course, be zero.

Now we can create the co-occurrence matrix. We first spread Tags over the columns, such that each column is a unique tag, each row is a DOI, and the cell values are occurrences.

library(reshape2)
y <- m %>% 
    acast(Tags~DOI, value.var="occur", fun.aggregate = sum)
y[1:2, 1:2]
##                10.1007/s13420-010-0001-7 10.3758/s13414-010-0002-9
## 3-D perception                         0                         0
## Action                                 0                         0
y[y>1] <- 1  # Remove duplicates

All we need to do then is to make a symmetric matrix where both rows and columns are Tags, and the cell values are numbers of co-occurrences. So we matrix multiply y with its transpose:

co <- y %*% t(y)
co[1:3, 1:3]
##                3-D perception Action Adaptation
## 3-D perception             11      0          1
## Action                      0     20          0
## Adaptation                  1      0         13

Visualizing the adjacency matrix

It’s important to understand what co is; it’s a symmetric matrix whose rows and columns are tags, and their co-occurrences are numerically represented at their intersections. For example, Action and 3-D perception never co-occurred in the data, but Adaptation and 3-D percetion co-occurred once. We are now in an excellent position to investigate what tags tend to co-occur.

First, let’s use igraph to create the graph object g, and take out some unnecessary information from it.

library(igraph)
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)

g is now an igraph R object that contains edges and vertices, and information about them. For example, we can use V(g)$name to look at the Vertice names. Some of these are quite long, so I’ll create another variable for each Vertice, called label, which is the name, but replacing each space with a newline character. This will make some of the graphs prettier.

V(g)$label <- gsub(" ", "\n", V(g)$name)

Networks are very complicated visualizations, and if you simply call plot(g), you’ll get an ugly mess of a figure:

plot(g)

There are more parameters to the plot() command than can be listed here, but I tried out a few different options, and this is what I ended up with:

plot(g,
     layout = layout_on_sphere(g),
     edge.width = E(g)$weight/7,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex=.7,
     vertex.label.color="black",
     vertex.label = ifelse(degree(g) > 25, V(g)$label, NA))

The Bayes Network

The next network will be fairly involved. Our goal is to visualize the prominence and connectedness of Psychonomic Society’s “Bayesians”. Who are these Bayesians, you might ask? Our definition here is simply: “An author who has published one or more articles in Psychonomic Journals (2004 - 2016) where the keywords contained the word ‘Bayes’.” This is obviously suboptimal because someone might have published an article called “Why I’m not a Bayesian”, yet they would be included in this graph, but we’ll go with this for now.

I’ll let the code speak for itself this time. If it is impossible to parse, I recommend downloading the data (see top of post) to your own computer and running the code yourself.

m <- d %>% 
    select(Authors, Tags, DOI) %>% 
    mutate(Authors = str_split(Authors, ", ")) %>% 
    unnest(Authors) %>% 
    filter(!is.na(Tags), !is.na(Authors)) %>% 
    filter(grepl("bayes", Tags, ignore.case = T))
m$occur <- 1
m <- m %>% 
    complete(DOI, Authors, fill = list(occur=0)) %>% 
    select(-Tags)
m
## # A tibble: 11,076 x 3
##                          DOI      Authors occur
##                        <chr>        <chr> <dbl>
##  1 10.3758/s13414-010-0073-7     Bååth R.     0
##  2 10.3758/s13414-010-0073-7    Bakker M.     0
##  3 10.3758/s13414-010-0073-7       Bar M.     0
##  4 10.3758/s13414-010-0073-7  Barker J.D.     0
##  5 10.3758/s13414-010-0073-7      Barr S.     0
##  6 10.3758/s13414-010-0073-7     Biele G.     0
##  7 10.3758/s13414-010-0073-7 Bittner J.L.     0
##  8 10.3758/s13414-010-0073-7  Borsboom D.     0
##  9 10.3758/s13414-010-0073-7    Bratch A.     0
## 10 10.3758/s13414-010-0073-7    Bröder A.     0
## # ... with 11,066 more rows

In these data, we have simply counted whether an author “occurred” in an article, and only included articles whose Tags included the word “bayes” in any shape or form. This network will therefore visualize the connectedness of “Bayesian”, as in who has published papers with whom.

We can then re-run pretty much the same steps as in the previous network.

y <- m %>% 
    acast(Authors~DOI, value.var="occur", fun.aggregate = sum)
co <- y %*% t(y)
co[1:3, 1:3]
##           Bååth R. Bakker M. Bar M.
## Bååth R.         1         0      0
## Bakker M.        0         1      0
## Bar M.           0         0      1
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)
V(g)$label <- gsub(" ", "\n", V(g)$name)
deg <- degree(g)
plot(g,
     layout = layout_with_kk(g),
     edge.width = E(g)$weight,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex = sqrt(5+deg)/4,
     vertex.label.color = "black",
     main = "The Bayesians of Psychonomic Society")

More to come later on…

comments powered by Disqus