Quantitative literature review with R: Exploring Psychonomic Society Journals, Part II

(This post directly continues from part I of Quantitative literature review with R. Please read that first for context.) In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a broad research literature, using all Psychonomic Society publications between 2004 and 2016. Part I focused on data cleaning and simple figures, but here we will look at relational data by visualizing some network structures in the data.

We’ll be working with R, so if you want to follow along on your own computer, fire up R and load up the required R packages:

library(tidyverse)
library(stringr)

Then you can load the cleaned data from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda using R:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

We’ve got data on quite a few variables which we could imagine are interconnected, such as Tags (Keywords), Authors, and the articles themselves (through the References variable).

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Let’s begin by looking at the articles’ Keywords.

Tags

These are the things that you have to choose to describe your article when you submit it for publication. They are called Keywords in the data, but I’m going to refer to them as Tags for brevity. Also, I will only work with the Author_Keywords, because it seems like Scopus’s automatic index keywords are mostly nonsense or missing.

So our first operation is to rename the Author_Keywords variable to Tags, and get rid of the automatic keywords altogether.

d <- rename(d, Tags = Author_Keywords) %>% 
    select(-Index_Keywords)

Before moving on to networks, let’s take a look at what these tags are like: How many are there? Do people use them consistently?

Exploring univariate tags

Our first goal is to simply visualize the most common tags. However, recall that the Tags variable in the data is quite messy:

## # A tibble: 3 x 1
##                                                                          Tags
##                                                                         <chr>
## 1    3D perception; Perception and action; Scene perception; Space perception
## 2 Auditory perception; Cognitive performance; Dissonance; Distraction; Inharm
## 3             Attention; Automaticity; Cognitive control; Selective attention

Each value of Tags is a kind of a list, whose individual items are separated by semicolons. What we’d like to have is a variable (column in a data frame) where every item is a single tag, so we need to do some kind of a data unwrapping operation.

The unwrapping can be done by simply splitting the Tags variable on each semicolon (and space) found in each string, but this creates a strange variable:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) 
## # A tibble: 7,614 x 1
##         Tags
##       <list>
##  1 <chr [4]>
##  2 <chr [6]>
##  3 <chr [4]>
##  4 <chr [5]>
##  5 <chr [5]>
##  6 <chr [3]>
##  7 <chr [3]>
##  8 <chr [6]>
##  9 <chr [3]>
## 10 <chr [3]>
## # ... with 7,604 more rows

This is where things get really interesting. A few lines above, we conceptualized each value (cell) of Tags as a sort of a list. We have now operationalized that idea by creating a Tags variable that is an R list-column, meaning that each cell can contain a complex R object. Here, each value is a list of character values. To make this notion concrete, look at the first row (slice(1)) of the Tags variable (.$Tags, . is a placeholder of the data being passed by %>% arrows) by using the familiar str() function:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    slice(1) %>% 
    .$Tags %>%  str()
## List of 1
##  $ : chr [1:4] "3D perception" "Perception and action" "Scene perception" "Space perception"

All the tags associated with the first article in our data frame are in a neat list of neat character strings. Brilliant. This data format is quite foreign to us who are used to working with strictly flat rectangular data, but has a certain recursive beauty to it. Now that our list is concrete, we can unlist it to create a Tags variable whose each row is one individual tag.

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags)
## # A tibble: 22,375 x 1
##                     Tags
##                    <chr>
##  1         3D perception
##  2 Perception and action
##  3      Scene perception
##  4      Space perception
##  5   Auditory perception
##  6 Cognitive performance
##  7            Dissonance
##  8           Distraction
##  9         Inharmonicity
## 10                 Music
## # ... with 22,365 more rows

Excellent! This data frame has one variable, Tags, where each observation is an individual tag. Now we can go ahead and count and summarise the tags. From here onwards, it’s all regular dplyr verbs:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%  # Remove missing observations
    group_by(Tags) %>%  # Group by unique Tags
    count() %>%  # And count them
    ungroup() %>% 
    top_n(40) %>%  # Choose n most numerous Tags
    mutate(Tags = reorder(Tags, n)) %>%  # {1}
    ggplot(aes(n, Tags)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=Tags)) +
    scale_x_continuous(limits = c(0, 400), expand = c(0, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 30 tags in Psychonomics Journals",
         caption = "Data for years 2004-2016 only")

{1} reordered the Tags so that they are nicely ordered in the figure on the n variable. You could also do a wordcloud with these data, but I think I’ll refrain this time. Instead, what I’ll do is explore popular tags by journal. The “pipeline” to get tags by journal is almost identical, except that we’ll keep the Publication variable in the data.

The following code is a little more complicated than is required for a simple plot, but I wanted to order the Tags inside each publication, which required a little bit of extra work as described by @drsimonj. Below, I’ll work with journal abbreviations Pub_abbr.

pd <- d %>% 
    select(Pub_abbr, Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%
    group_by(Pub_abbr, Tags) %>%  # Group by unique Tags AND journals
    count() %>%  # And count them
    group_by(Pub_abbr) %>% 
    top_n(10, n) %>%  # Select top 10 tags within each journal
    ungroup() %>% 
    arrange(Pub_abbr, n) %>%  # Arrange by n within journals
    mutate(order = row_number())  # Will order Tags on this
pd
## # A tibble: 62 x 4
##                       Pub_abbr                  Tags     n order
##                          <chr>                 <chr> <int> <int>
##  1 Atten. Percept. Psychophys.     Attentional blink    33     1
##  2 Atten. Percept. Psychophys.        Working memory    36     2
##  3 Atten. Percept. Psychophys. Visual working memory    40     3
##  4 Atten. Percept. Psychophys.      Visual attention    41     4
##  5 Atten. Percept. Psychophys.   Selective attention    49     5
##  6 Atten. Percept. Psychophys.     Visual perception    51     6
##  7 Atten. Percept. Psychophys.   Attentional capture    52     7
##  8 Atten. Percept. Psychophys.         Eye movements    61     8
##  9 Atten. Percept. Psychophys.         Visual search   116     9
## 10 Atten. Percept. Psychophys.             Attention   180    10
## # ... with 52 more rows

That was more complicated than usual, so I wanted to print out the intermediate result as well. We can now use it to plot:

pd %>%     
    ggplot(aes(n, order)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=order)) +
    scale_y_continuous(breaks = pd$order, labels = pd$Tags) +
    scale_x_continuous(expand = c(0.02, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 10 tags within 6 Psychonomics Journals",
         caption = "Data for years 2004-2016 only") +
    facet_wrap("Pub_abbr", scales="free", ncol=2)

Finally, let’s look at the overall distribution of the tags as a histogram:

There are about 8000 unique Tags, and about 7000 of them occur fewer than three times in the corpus. Now that we have some idea of what kinds of tags there are in the data, roughly what their distributions might look like, and how they vary across the journals, we are ready to look at networks of tags.

Network analysis of Tags

Network analysis (or graph analysis) is a vast and complicated area, and my graph analysis skills are a little rusty. So here we’ll simply illustrate how to format the data in an appropriate way, and draw a few example graphs. I referred to this website more than once while working on this (also thanks to @jalapic for tips).

The first graph will be about Tags. We’ll try to illustrate how and which tags co-occur within articles in the Psychonomic Society journals between 2004 and 2016.

90% of data analysis is data cleaning and manipulation, and network analysis is no exception. The first data manipulation step is to create an adjacency matrix where each row and column is a tag, where the values in the cells are the number of times those tags co-occurred (were present in the same article.) I found this stack overflow answer and this blogpost helpful when preparing the data.

Preparing the adjacency matrix

In order to create the adjacency matrix we need to solve a sequence of smaller problems. First, we’ll take two columns from the data, one for Tags and another one that uniquely identifies each article (doesn’t matter what it is, here we use DOI). Then we unnest the Tags into single tags per row as detailed above, and remove empty tags.

m <- d %>% 
    select(Tags, DOI) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags))
m
## # A tibble: 18,949 x 2
##                          DOI                  Tags
##                        <chr>                 <chr>
##  1 10.3758/s13414-015-1049-4         3D perception
##  2 10.3758/s13414-015-1049-4 Perception and action
##  3 10.3758/s13414-015-1049-4      Scene perception
##  4 10.3758/s13414-015-1049-4      Space perception
##  5 10.3758/s13414-015-1042-y   Auditory perception
##  6 10.3758/s13414-015-1042-y Cognitive performance
##  7 10.3758/s13414-015-1042-y            Dissonance
##  8 10.3758/s13414-015-1042-y           Distraction
##  9 10.3758/s13414-015-1042-y         Inharmonicity
## 10 10.3758/s13414-015-1042-y                 Music
## # ... with 18,939 more rows

Next, I remove all tags from the data that occurred fewer than a couple of times, because otherwise the data would be too large and messy.

m <- group_by(m, Tags) %>% 
    mutate(n = n()) %>% 
    filter(n > 8) %>% 
    ungroup() %>% 
    select(-n)

Next, we need a variable that identifies whether the tag occurred in the article or not:

m$occur <- 1
slice(m, 1:2)
## # A tibble: 2 x 3
##                         DOI                  Tags occur
##                       <chr>                 <chr> <dbl>
## 1 10.3758/s13414-015-1049-4 Perception and action     1
## 2 10.3758/s13414-015-1049-4      Scene perception     1

Note that occur must be one. To create the matrix we need all the item-tag pairs, even ones that didn’t occur. For these pairs, occur will be zero. The tidyr package in tidyverse provides the perfect helper function for this purpose:

m <- m %>% 
    complete(DOI, Tags, fill = list(occur=0))
m
## # A tibble: 960,500 x 3
##                          DOI               Tags occur
##                        <chr>              <chr> <dbl>
##  1 10.1007/s13420-010-0001-7     3-D perception     0
##  2 10.1007/s13420-010-0001-7             Action     0
##  3 10.1007/s13420-010-0001-7         Adaptation     0
##  4 10.1007/s13420-010-0001-7             Affect     0
##  5 10.1007/s13420-010-0001-7 Age of acquisition     0
##  6 10.1007/s13420-010-0001-7              Aging     0
##  7 10.1007/s13420-010-0001-7           Amygdala     0
##  8 10.1007/s13420-010-0001-7   Animal cognition     0
##  9 10.1007/s13420-010-0001-7       Anticipation     0
## 10 10.1007/s13420-010-0001-7            Anxiety     0
## # ... with 960,490 more rows

Now we have a data frame with 960500 rows; each row signifying a possible article (DOI) - Tag co-occurrence. Most occurs will, of course, be zero.

Now we can create the co-occurrence matrix. We first spread Tags over the columns, such that each column is a unique tag, each row is a DOI, and the cell values are occurrences.

library(reshape2)
y <- m %>% 
    acast(Tags~DOI, value.var="occur", fun.aggregate = sum)
y[1:2, 1:2]
##                10.1007/s13420-010-0001-7 10.3758/s13414-010-0002-9
## 3-D perception                         0                         0
## Action                                 0                         0
y[y>1] <- 1  # Remove duplicates

All we need to do then is to make a symmetric matrix where both rows and columns are Tags, and the cell values are numbers of co-occurrences. So we matrix multiply y with its transpose:

co <- y %*% t(y)
co[1:3, 1:3]
##                3-D perception Action Adaptation
## 3-D perception             11      0          1
## Action                      0     20          0
## Adaptation                  1      0         13

Visualizing the adjacency matrix

It’s important to understand what co is; it’s a symmetric matrix whose rows and columns are tags, and their co-occurrences are numerically represented at their intersections. For example, Action and 3-D perception never co-occurred in the data, but Adaptation and 3-D percetion co-occurred once. We are now in an excellent position to investigate what tags tend to co-occur.

First, let’s use igraph to create the graph object g, and take out some unnecessary information from it.

library(igraph)
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)

g is now an igraph R object that contains edges and vertices, and information about them. For example, we can use V(g)$name to look at the Vertice names. Some of these are quite long, so I’ll create another variable for each Vertice, called label, which is the name, but replacing each space with a newline character. This will make some of the graphs prettier.

V(g)$label <- gsub(" ", "\n", V(g)$name)

Networks are very complicated visualizations, and if you simply call plot(g), you’ll get an ugly mess of a figure:

plot(g)

There are more parameters to the plot() command than can be listed here, but I tried out a few different options, and this is what I ended up with:

plot(g,
     layout = layout_on_sphere(g),
     edge.width = E(g)$weight/7,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex=.7,
     vertex.label.color="black",
     vertex.label = ifelse(degree(g) > 25, V(g)$label, NA))

The Bayes Network

The next network will be fairly involved. Our goal is to visualize the prominence and connectedness of Psychonomic Society’s “Bayesians”. Who are these Bayesians, you might ask? Our definition here is simply: “An author who has published one or more articles in Psychonomic Journals (2004 - 2016) where the keywords contained the word ‘Bayes’.” This is obviously suboptimal because someone might have published an article called “Why I’m not a Bayesian”, yet they would be included in this graph, but we’ll go with this for now.

I’ll let the code speak for itself this time. If it is impossible to parse, I recommend downloading the data (see top of post) to your own computer and running the code yourself.

m <- d %>% 
    select(Authors, Tags, DOI) %>% 
    mutate(Authors = str_split(Authors, ", ")) %>% 
    unnest(Authors) %>% 
    filter(!is.na(Tags), !is.na(Authors)) %>% 
    filter(grepl("bayes", Tags, ignore.case = T))
m$occur <- 1
m <- m %>% 
    complete(DOI, Authors, fill = list(occur=0)) %>% 
    select(-Tags)
m
## # A tibble: 11,076 x 3
##                          DOI      Authors occur
##                        <chr>        <chr> <dbl>
##  1 10.3758/s13414-010-0073-7     Bååth R.     0
##  2 10.3758/s13414-010-0073-7    Bakker M.     0
##  3 10.3758/s13414-010-0073-7       Bar M.     0
##  4 10.3758/s13414-010-0073-7  Barker J.D.     0
##  5 10.3758/s13414-010-0073-7      Barr S.     0
##  6 10.3758/s13414-010-0073-7     Biele G.     0
##  7 10.3758/s13414-010-0073-7 Bittner J.L.     0
##  8 10.3758/s13414-010-0073-7  Borsboom D.     0
##  9 10.3758/s13414-010-0073-7    Bratch A.     0
## 10 10.3758/s13414-010-0073-7    Bröder A.     0
## # ... with 11,066 more rows

In these data, we have simply counted whether an author “occurred” in an article, and only included articles whose Tags included the word “bayes” in any shape or form. This network will therefore visualize the connectedness of “Bayesian”, as in who has published papers with whom.

We can then re-run pretty much the same steps as in the previous network.

y <- m %>% 
    acast(Authors~DOI, value.var="occur", fun.aggregate = sum)
co <- y %*% t(y)
co[1:3, 1:3]
##           Bååth R. Bakker M. Bar M.
## Bååth R.         1         0      0
## Bakker M.        0         1      0
## Bar M.           0         0      1
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)
V(g)$label <- gsub(" ", "\n", V(g)$name)
deg <- degree(g)
plot(g,
     layout = layout_with_kk(g),
     edge.width = E(g)$weight,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex = sqrt(5+deg)/4,
     vertex.label.color = "black",
     main = "The Bayesians of Psychonomic Society")

More to come later on…

Quantitative literature review with R: Exploring Psychonomic Society Journals, Part I

Literature reviews, both casual and formal (or qualitative / quantitative), are an important part of research. In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a research literature, using Psychonomic Society’s publications between 2005 and 2016.

Commonly, literature reviews are rather informal; you read a review paper or 10, maybe suggested to you by experts in the field, and then delve deeper into the topics presented in those papers. A more formal version starts with a database search, where you try out various search terms, and collect a more exhaustive list of publications related to your research questions. Here, we are going to have a bit of fun and explore a large heterogeneous literature (kind of) quantitatively, focusing on data manipulation (not the bad kind!) and visualization.

I used Scopus to search for all journal articles published in the Psychonomic Society’s journals (not including CR:PI because it’s so new) between the years 2005 and 2016 (inclusive). For real applications, you would use databases instead, and search across journals, but for a general illustration, I wanted to focus on these publications instead, because Psychonomic journals are closest to my research interests.

I limited the search to these years for two reasons: Metadata on articles before early 2000’s is (are?) not always very good, and if I had included everything, the database would have simply been too large. For the example here, these search terms resulted in about 7500 journal articles.

Download data from Scopus

The Scopus search interface is very straightforward, so I won’t talk about how to use it. Some things to keep in mind, though, are that Scopus does not index all possible sources of citations, and therefore when we discuss citations (and other similar info) below, we’ll only discuss citations that Scopus knows about. For example, Google Scholar often indexes more sources, but it doesn’t have the same search tools as Scopus. Scopus allows you to download the data in various formats, and I chose .csv files, which I have stored on my computer. When you download raw data from Scopus, make sure that you organize the files well and that they are available to the R session.

You can also download the raw or cleaned up data from my website (see below), if you’d like to work on the same data set, or replicate the code below.

Load and clean raw data from Scopus

Let’s first load all the relevant R packages that we’ll use to do our work for us:

library(tidyverse)
library(stringr)
library(tools)

Load the data from .csv files

When I first downloaded the files, I put them in a subdirectory of my project (my website). I downloaded the data such that one file contains the data of one journal. (The raw journal-specific .csv files are available at https://github.com/mvuorre/mvuorre.github.io/tree/master/data/scopus, but I recommend you download the combined R data file instead because it’s easier, see below.)

The first thing is to combine these file names to a list of file names:

fl <- list.files("static/data/scopus/", pattern=".csv", full.names = T)

Then we can apply a function over the list of file names that reads all the files to an R object.

d <- lapply(fl, 
            read_csv, 
            col_types = cols(`Page start` = col_character(),
                             `Page end` = col_character(),
                             `Art. No.` = col_character()))

lapply applies the read_csv() function to the file list fl, and the col_types = cols() argument is there to ensure that those columns are read as character variable types (the original data is a little messy, and this ensures that all the files are read with identical column types.) Next, we’ll then turn the list of data frames d into a single data frame by binding them together row-wise (think stacking spreadsheets vertically on top of each other). Let’s also take a quick look at the result with glimpse()

d <- bind_rows(d)
glimpse(d)
## Observations: 7,697
## Variables: 25
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Volume                    <int> 78, 78, 78, 78, 78, 78, 78, 78, 78, ...
## $ Issue                     <int> 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, ...
## $ Art. No.                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Page count                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Link                      <chr> "https://www.scopus.com/inward/recor...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Publisher                 <chr> "Springer New York LLC", "Springer N...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
## $ Source                    <chr> "Scopus", "Scopus", "Scopus", "Scopu...
## $ EID                       <chr> "2-s2.0-84951928856", "2-s2.0-849519...

OK, looks pretty clean already! (I also tried to get data from ISI web of science, or whatever it’s now called, but the raw data was so messy that I’d have to get paid to even look at it.)

The combined raw data file is available as an R data file at https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda, and can be easily loaded to the R workspace with:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda"))

Clean the data

OK, now that you’ve loaded up the combined data file from my website into your R session with the above command, you can get to work cleaning the data. If you’d like to skip the data cleaning steps, scroll down to “Load cleaned data”.

Select relevant variables

There’s a couple of columns (variables) in the data that I’m just going to drop because I’m not interested in them at all. When you export data from Scopus, you can choose what variables to include, and it looks like I accidentally included some stuff that I don’t care about. I’ll use the select() function, and prepend each variable I want to drop with a minus sign. Variable names with spaces or strange characters are wrapped in backticks.

I think it’s always good to drop unnecessary variables to remove unnecessary clutter. Talking about cognitive psychology, it’s probably less cognitively demanding to work with a data frame with 5 variables than it is to work with a data frame with 50 variables.

d <- select(d,
            -Volume,
            -Issue,
            -`Art. No.`,
            -`Page count`,
            -Link,
            -Publisher,
            -Source,
            -EID)

I dropped page count because it is empty for most cells, and I actually need to re-compute it from the start and end variables, later on. Next, we’re going to go through the columns, but leave the more complicated columns (such as Authors) for later. What do I mean by “complicated”? Some variables are not individual values, and their desired final shape depends on how you want to use them. Much later on, when we get to network and graph analyses, you’ll see that we want to “unlist” the Keywords columns, for example. That’s why we’ll do the cleaning and data manipulation closer to the actual analysis. But for “univariate” columns, we can get started right away.

Numerical columns: Year, pages, citations

Year is already an integer, and has no missing values, so we don’t need to do anything about it. We can, however, make the page number information numerical and combine it into one column.

glimpse(d)
## Observations: 7,697
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
d$Pages <- as.integer(d$`Page end`) - as.integer(d$`Page start`)

How do we know that the Pages variable is now correct? We can’t rely on math itself because the data values themselves may be incorrect. It is good practice to check the data after each operation, and there’s many ways of looking at the data. I like to arrange the data (either descending or ascending) on the newly created variable to show if there are any weird values. Here’s how to arrange the data on the Pages variable in a descending order:

glimpse(arrange(d, desc(Pages)))
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

That code is quite difficult to read, however. It contains nested function calls: arrange() is inside glimpse(). I think this makes the code difficult to read. This is where we first really encounter the principles of R’s tidyverse philosophy: Instead of nesting, we can pipe results from functions to one another. To “pipe” the above code, we’d write the code this way instead:

d %>% 
    arrange(desc(Pages)) %>% 
    glimpse()
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

Read the above code as “Take d, then pass the results to arrange(), then pass the results to glimpse()”. Each %>% is a “pipe”, and takes the results of whatever is being computed on its left-hand side, and passes the result to its right-hand side (or line below, it’s really helpful to use line breaks and put each statement on its own line.) I’ll be using pipes extensively in the code below. I also won’t inspect the code after every change to prevent printing unnecessary boring stuff all the time (although you should when you’re working with your own data.)

So, back to Pages. Looking at the data, it looks like there’s some strange values, with some items having more than 1000 pages. For now we’ll leave those values in, but looking closely at the Source title column, we can see that there’s publications in the data that should be there. I guess I was a little too lazy with my initial Scopus search terms, but we can easily fix that problem here in R.

Publications

First, let’s list all the unique publications in the data:

publications <- unique(d$`Source title`)
publications
##  [1] "Attention, Perception, and Psychophysics"                                                
##  [2] "Attention, perception & psychophysics"                                                   
##  [3] "Attention, Perception, & Psychophysics"                                                  
##  [4] "Behavior Research Methods"                                                               
##  [5] "Behavior research methods"                                                               
##  [6] "Modern Research Methods for The Study of Behavior in Organizations"                      
##  [7] "Modern Research Methods for the Study of Behavior in Organizations"                      
##  [8] "The New Handbook of Methods in Nonverbal Behavior Research"                              
##  [9] "Cognitive, Affective and Behavioral Neuroscience"                                        
## [10] "Cognitive, affective & behavioral neuroscience"                                          
## [11] "Cognitive, Affective, & Behavioral Neuroscience"                                         
## [12] "Learning and Behavior"                                                                   
## [13] "Memory and Cognition"                                                                    
## [14] "Memory & cognition"                                                                      
## [15] "Exploring Implicit Cognition: Learning, Memory, and Social Cognitive Processes"          
## [16] "Memory & Cognition"                                                                      
## [17] "Imagery, Memory and Cognition: Essays in Honor of Allan Paivio"                          
## [18] "Imagery, Memory and Cognition (PLE: Memory):Essays in Honor of Allan Paivio"             
## [19] "Working Memory and Human Cognition"                                                      
## [20] "Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking"
## [21] "Psychonomic Bulletin and Review"                                                         
## [22] "Psychonomic bulletin & review"                                                           
## [23] "Psychonomic Bulletin & Review"

Let’s remove the non-psychonomic journal ones. I’ll just subset the list of publications to ones that I want to keep, and then filter() the data such that Source title matches one of the correct journals.

psychonomic_publications <- publications[c(1:5, 9:14, 16, 21:23)]
d <- filter(d, `Source title` %in% psychonomic_publications)
unique(d$`Source title`)
##  [1] "Attention, Perception, and Psychophysics"        
##  [2] "Attention, perception & psychophysics"           
##  [3] "Attention, Perception, & Psychophysics"          
##  [4] "Behavior Research Methods"                       
##  [5] "Behavior research methods"                       
##  [6] "Cognitive, Affective and Behavioral Neuroscience"
##  [7] "Cognitive, affective & behavioral neuroscience"  
##  [8] "Cognitive, Affective, & Behavioral Neuroscience" 
##  [9] "Learning and Behavior"                           
## [10] "Memory and Cognition"                            
## [11] "Memory & cognition"                              
## [12] "Memory & Cognition"                              
## [13] "Psychonomic Bulletin and Review"                 
## [14] "Psychonomic bulletin & review"                   
## [15] "Psychonomic Bulletin & Review"

Much better. Next, we’ll have to make sure that there’s only one unique name per journal. To do this, we’ll use simple text manipulation tools to change all “and”s to “&”s, and convert the names to Title Case. Before applying these operations to the actual data, I’ll work on just the unique names R object psychonomic_publications. This makes it easier to troubleshoot the operations.

psychonomic_publications <- psychonomic_publications %>% 
    str_replace_all("and", "&") %>%  # Replace all "and"s with "&"s
    toTitleCase()  # Make everything Title Case
unique(psychonomic_publications)
## [1] "Attention, Perception, & Psychophysics"         
## [2] "Attention, Perception & Psychophysics"          
## [3] "Behavior Research Methods"                      
## [4] "Cognitive, Affective & Behavioral Neuroscience" 
## [5] "Cognitive, Affective, & Behavioral Neuroscience"
## [6] "Learning & Behavior"                            
## [7] "Memory & Cognition"                             
## [8] "Psychonomic Bulletin & Review"

Still not quite correct. For two journals, there’s two versions, one where “&” is preceded with a comma, one where it isn’t. I’ll take the commas out.

psychonomic_publications %>% 
    str_replace_all(", &", " &") %>% 
    unique()
## [1] "Attention, Perception & Psychophysics"         
## [2] "Behavior Research Methods"                     
## [3] "Cognitive, Affective & Behavioral Neuroscience"
## [4] "Learning & Behavior"                           
## [5] "Memory & Cognition"                            
## [6] "Psychonomic Bulletin & Review"

Great, now I’ll just apply those operations to the actual data.

d <- d %>% 
    mutate(Publication = str_replace_all(`Source title`, "and", "&"),
           Publication = toTitleCase(Publication),
           Publication = str_replace_all(Publication, ", &", " &"))

Here, I used the mutate() function, which works within a data frame. I passed d to mutate() with the pipe operator, and could then refer to variables without prepending them with d$. I called the product variable Publication because it’s easier to write than Source title.

Lastly, we’ll do the same for the abbreviated publication names. I don’t know what the official abbreviations are, but I can guess that the majority of the articles have it correct. The first step is then to count the occurrences of each unique abbreviation. To do this, I group the data by (group_by()) the abbreviation, and then count():

abbs <- group_by(d, `Abbreviated Source Title`) %>% 
    count()
abbs
## # A tibble: 11 x 2
##          `Abbreviated Source Title`     n
##                               <chr> <int>
##  1         Atten Percept Psychophys    26
##  2      Atten. Percept. Psychophys.  1558
##  3                Behav Res Methods    38
##  4              Behav. Res. Methods  1385
##  5       Cogn Affect Behav Neurosci     7
##  6 Cogn. Affective Behav. Neurosci.   691
##  7                    Learn. Behav.   455
##  8                       Mem Cognit    13
##  9                     Mem. Cognit.  1441
## 10                 Psychon Bull Rev     3
## 11            Psychonom. Bull. Rev.  1997

Then just figure out how to replace all the bad ones with the good ones. Unfortunately this has to be done manually because computers aren’t that smart. The following is ugly, but it works. I first take the just the abbreviations into a variable, then replace the bad ones with the good ones inside the data frame (working with an abbreviated variable name).

abbrs <- abbs[[1]]
d <- mutate(d,
       pa = `Abbreviated Source Title`,
       pa = ifelse(pa == abbrs[1], abbrs[2], pa),
       pa = ifelse(pa == abbrs[3], abbrs[4], pa),
       pa = ifelse(pa == abbrs[5], abbrs[6], pa),
       pa = ifelse(pa == abbrs[8], abbrs[9], pa),
       pa = ifelse(pa == abbrs[10], abbrs[11], pa),
       Pub_abbr = pa)
unique(d$Pub_abbr)
## [1] "Atten. Percept. Psychophys."      "Behav. Res. Methods"             
## [3] "Mem. Cognit."                     "Cogn. Affective Behav. Neurosci."
## [5] "Learn. Behav."                    "Psychonom. Bull. Rev."

OK, we’re pretty much done with cleaning for now. The last thing to do is to remove the unnecessary variables we created while cleaning the data:

d <- select(d, 
            -`Source title`,
            -`Page start`, -`Page end`,
            -pa)

I don’t like spaces, so I’ll also take them out of the variable names:

names(d) <- str_replace_all(names(d), " ", "_")

Leaving us with this to work with:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Load cleaned data

The clean data file can be loaded from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda.

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

Univariate figures

This data is very rich, but for this initial exploration, we’ll be focusing on simple uni- and bi-variate figures. In later parts of this tutorial (future blog posts), we’ll start visualizing more complex data, such as keywords, authors, and networks.

Let’s begin with some simple summaries of what’s been going on, over time, for each journal. We’ll make extensive use of the dplyr data manipulation verbs in the following plots. Take a look at the linked website if they are unfamiliar to you; although I will explain more complicated cases, I won’t bother with every detail.

Publication years

First, we’re interested in a simple question: How many articles has each journal published in each year? Are there temporal patterns, and do they differ between journals? The steps for creating this plot are commented in the code below. Roughly, in order of appearance, we first add grouping information to the data frame, then summarise the data on those groups, create a new column for a publication-level summary, and then order the publications on it. We then pass the results to ggplot2 and draw a line and point graph.

d %>% 
    group_by(Publication, Year) %>% 
    summarise(n = n()) %>%  # For each group (Publication, Year), count nobs
    mutate(nsum = sum(n)) %>%  # For each Publication, sum of nobs {1}
    ungroup() %>%  # Ungroup data frame
    mutate(Publication = reorder(Publication, nsum)) %>%  # {2}
    ggplot(aes(x=Year, y=n)) +
    geom_point() +
    geom_line() +
    labs(y = "Number of articles published") +
    facet_wrap("Publication", ncol=2)

The code in {1} works, because the summarise() command in the above line ungroups the last grouping factor assigned by group_by(). And since I called mutate() instead of summarise, the sum of number of observations was added to each row, instead of collapsing the data by group. {2} made the Publication variable into a factor that’s ordered by the sum of articles across years for that journal; this ordering is useful because when I used facet_wrap() below, the panels are nicely ordered (journals with fewer total papers in the upper left, increasing toward bottom right.)

If this code doesn’t make any sense, I strongly recommend loading the data into your own R workspace, and executing the code line by line.

I’m a little surprised at the relative paucity of articles in Learning and Behavior, and it looks like there might be some upward trajectory going on in PBR. Should I run some regressions? Probably not.

Citations

Let’s look at numbers of citations next. For these, simple histograms will do, and we expect to see some heavy tails. I will pre-emptively clip the x-axis at 100 citations, because there’s probably a few rogue publications with a truly impressive citation count, which might make the other, more common, values less visible.

d %>% 
    ggplot(aes(Cited_by)) +
    geom_histogram(binwidth=1, col="black") +
    coord_cartesian(xlim=c(0,100)) +
    facet_wrap("Publication", ncol=2)

Nice. But how many papers are there with more than 100 citations? What’s the proportion of those highly cited papers?

d %>%
    group_by(Publication) %>%
    mutate(total = n()) %>% 
    filter(Cited_by > 100) %>% 
    summarise(total = unique(total),
              over_100 = n(),
              ratio = signif(over_100/total, 1)) %>% 
    arrange(total)
## # A tibble: 6 x 4
##                                      Publication total over_100 ratio
##                                            <chr> <int>    <int> <dbl>
## 1                            Learning & Behavior   455        1 0.002
## 2 Cognitive, Affective & Behavioral Neuroscience   698       32 0.050
## 3                      Behavior Research Methods  1426       53 0.040
## 4                             Memory & Cognition  1452       17 0.010
## 5          Attention, Perception & Psychophysics  1583        8 0.005
## 6                  Psychonomic Bulletin & Review  2000       54 0.030

Excellent. We have now reviewed some of the powerful dplyr verbs and are ready for slightly more complex summaries of the data.

Bivariate figures

Publication year and citations

We can visualize publication counts over years of publication, but this might not be the most informative graph (because recent articles naturally can’t be as often cited as older ones.) Here, I’ll show how to plot neat box plots (without outliers) with ggplot2:

d %>% 
    filter(Cited_by < 100) %>% 
    ggplot(aes(Year, Cited_by)) +
    geom_boxplot(aes(group = Year), outlier.color = NA) +
    facet_wrap("Publication", ncol=2, scales = "free_y")

Pages and citations

The last figure for today is a scatterplot of pages (how many pages the article spans) versus number of citations. For scatterplots I usually prefer empty points (shape=1).

d %>% 
    filter(Pages < 100, Cited_by < 100) %>% 
    ggplot(aes(Pages, Cited_by)) +
    geom_point(shape=1) +
    facet_wrap("Publication", ncol=2)

Summary

We have now scratched the surface of these data, and there is clearly a lot to explore:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

In next posts on this data, we’ll look at networks and other more advanced topics. Stay tuned.

Scraping Statistical Modeling, Causal Inference, and Social Science

Hi everybody!

Today I’ll share some tips on elementary web scraping with R. Our goal is to download and process an entire Wordpress blog into an R data frame, which can then be visualized and analyzed for fun and discovery.

We’ll scrape andrewgelman.com, the home of “Statistical Modeling, Causal Inference, and Social Science”. This is a very popular statistics and social science blog, whose main author, Andrew Gelman is a famous statistician and political scientist, and author of such classic holiday thrillers as Bayesian Data Analysis and Data Analysis Using Regression and Multilevel Models.

Why? Web scraping is a useful skill for anyone interested in information, broadly speaking, and is very easy and fun with R. We’re using this specific blog as a useful example, because Wordpress blogs are very popular, and many of the specific commands here will be useful without much editing, if you’re interested in scraping a Wordpress blog. Of course, using these general tools, you’ll be able to scrape any website you’d like!

After we’ve scraped the data from Wordpress, we’ll simply visualize the most popular (in terms of number of comments) blog posts from 2016.

Scraping from Wordpress

The first step is to scrape each Wordpress front page of the blog, with 20 blog posts each, to a local variable. This local variable can then be processed to catch all the relevant variables about a given post to a row of a data frame. I put these methods into a simple function and ran it through a multicore pblapply() call to scrape the entire blog in under a minute.

The actual scraping is done by functions from the rvest package. The key steps are

(1). Parsing the html from a given url (2). Extracting the required information from the raw HTML using “node tags” (browse a page’s html to identify the relevant tag names) (3). Apply the scraping function to all Wordpress pages (each containing 20 blog posts.)

N <- 353  # Number of WordPress pages (20 posts per page)
urls <- paste0("http://andrewgelman.com/page/", 1:N)
getsite <- function(url) {
    site <- read_html(url)  # (1.)
    d <- tibble(
        title = site %>% html_nodes(".posttitle") %>% html_text(),  # (2.)
        url = site %>% html_nodes(".posttitle") %>% 
            html_nodes("a") %>% html_attr("href"),
        author = site %>% html_nodes(".postauthor") %>% html_text(),
        date = site %>% html_nodes(".postdate") %>% html_text(),
        tags = site %>% html_nodes(".postcat") %>% html_text(),
        comments = site %>% html_nodes(".postcomment") %>% html_text(),
        text = site %>% html_nodes(".postentry") %>% html_text())
    return(d)
}
wp_pagelist <- vector("list", N)
wp_pagelist <- pblapply(urls, getsite, cl=8)  # (3.)
d <- bind_rows(wp_pagelist)
save(d, file = "smciss.rda", compress=T)

I’ve saved the resulting object as smciss.rda, which you can download here.

load(url("https://mvuorre.github.io/data/smciss.rda"))

Cleaning the data

Some of these variables need to be processed a bit.

First we’ll convert the date into an R date variable, and include a time variable that also contains the timestamp of the post. For some reason, the functions that I use to convert the strings to times and dates add 1 NA at the end, so I have to remove the last entries.

library(lubridate)
time <- dmy_hm(d$date, "%d %0m %Y, %I:%M %p")
## Warning: 1 failed to parse.
date <- as_date(dmy_hm(d$date, "%d %0m %Y, %I:%M %p"))
## Warning: 1 failed to parse.
length(date) - nrow(d)
## [1] 1
d$date <- date[-length(date)]
d$time <- time[-length(time)]

Then clean up the tags:

library(stringr)
select(d, tags)
## # A tibble: 7,041 x 1
##                                                          tags
##                                                         <chr>
##  1           Filed under Miscellaneous Statistics, Sociology.
##  2                                    Filed under Literature.
##  3                            Filed under Economics, Zombies.
##  4          Filed under Literature, Miscellaneous Statistics.
##  5                                 Filed under Uncategorized.
##  6         Filed under Political Science, Sociology, Zombies.
##  7 Filed under Decision Theory, Economics, Political Science.
##  8                    Filed under Decision Theory, Economics.
##  9                     Filed under Political Science, Sports.
## 10                                       Filed under Zombies.
## # ... with 7,031 more rows
d$tags <- str_replace_all(d$tags, "Filed under", "") %>%
    str_replace_all("[.]", "") %>%
    str_trim()

And comments:

d$comments <- as.integer( str_extract(d$comments, "[0-9]+") )

And the data is pretty clean:

select(d, date, time, tags, comments)
## # A tibble: 7,041 x 4
##          date                time
##        <date>              <dttm>
##  1 2016-12-27 2016-12-27 13:39:00
##  2 2016-12-27 2016-12-27 09:23:00
##  3 2016-12-26 2016-12-26 09:10:00
##  4 2016-12-25 2016-12-25 09:47:00
##  5 2016-12-24 2016-12-24 15:00:00
##  6 2016-12-24 2016-12-24 09:39:00
##  7 2016-12-23 2016-12-23 18:36:00
##  8 2016-12-23 2016-12-23 09:57:00
##  9 2016-12-23 2016-12-23 00:07:00
## 10 2016-12-22 2016-12-22 11:06:00
## # ... with 7,031 more rows, and 2 more variables: tags <chr>,
## #   comments <int>

GitHub-style waffle plots in R

I have a little python script on my work computer that tracks hours I’ve spent in the office (assuming that those hours begin with a computer login, and end in a computer shutdown–sadly this is almost always true.) I don’t really do anything with the information, and it is really a remnant of my time as a contract worker. However, for shared or billed projects, such automation of data gathering can be very valuable. For example, GitHub does a nice job of displaying a user’s contributions to code repositories over time: center

This heatmap represents the activity (some function of number of contributions) of a user over time, plotting more activity as darker tones for each day, represented as little squares. Each light gray square indicates a day with no activity.

In this post, I’ll show how to recreate this “waffle” plot in R with the ggplot2 plotting package.

Simulate activity data

First, I’ll create a data frame for the simulated data, initializing the data types:

library(dplyr)
d <- data_frame(
    date = as.Date(1:813, origin = "2014-01-01"),
    year = format(date, "%Y"),
    week = as.integer(format(date, "%W")) + 1,  # Week starts at 1
    day = factor(weekdays(date, T), 
                 levels = rev(c("Mon", "Tue", "Wed", "Thu",
                                "Fri", "Sat", "Sun"))),
    hours = 0)

And then simulate hours worked for each date. I’ll simulate hours worked separately for weekends and weekdays to make the resulting data a little more realistic, and also simulate missing values to data (that is, days when no work occurred).

set.seed(1)
# Simulate weekends
weekends <- filter(d, grepl("S(at|un)", day))
# Hours worked are (might be) poisson distributed
weekends$hours <- rpois(nrow(weekends), lambda = 4)
# Simulate missing days with probability .7
weekends$na <- rbinom(nrow(weekends), 1, 0.7)
weekends$hours <- ifelse(weekends$na, NA, weekends$hours)

# Simulate weekdays
weekdays <- filter(d, !grepl("S(at|un)", day))
weekdays$hours <- rpois(nrow(weekdays), lambda = 8)  # Greater lambda
weekdays$na <- rbinom(nrow(weekdays), 1, 0.1)  # Smaller p(missing)
weekdays$hours <- ifelse(weekdays$na, NA, weekdays$hours)

# Concatenate weekends and weekdays and arrange by date
d <- bind_rows(weekends, weekdays) %>% 
    arrange(date) %>%  # Arrange by date
    select(-na)  # Remove na column
d
# # A tibble: 813 x 5
#          date  year  week    day hours
#        <date> <chr> <dbl> <fctr> <int>
#  1 2014-01-02  2014     1    Thu     6
#  2 2014-01-03  2014     1    Fri    13
#  3 2014-01-04  2014     1    Sat    NA
#  4 2014-01-05  2014     1    Sun    NA
#  5 2014-01-06  2014     2    Mon     9
#  6 2014-01-07  2014     2    Tue     5
#  7 2014-01-08  2014     2    Wed    NA
#  8 2014-01-09  2014     2    Thu     9
#  9 2014-01-10  2014     2    Fri     8
# 10 2014-01-11  2014     2    Sat    NA
# # ... with 803 more rows

Waffle-plot function

Then I’ll create a function that draws the waffle plot. If you have similarly structured data, you can copy-paste the function and use it on your data (but make sure the following packages are installed.)

library(ggplot2)
library(viridis)  # Color palette
library(ggthemes)
gh_waffle <- function(data, pal = "D", dir = -1){
    
    p <- ggplot(data, aes(x = week, y = day, fill = hours)) +
        scale_fill_viridis(name="Hours", 
                           option = pal,  # Variable color palette
                           direction = dir,  # Variable color direction
                           na.value = "grey93",
                           limits = c(0, max(data$hours))) +
        geom_tile(color = "white", size = 0.4) +
        facet_wrap("year", ncol = 1) +
        scale_x_continuous(
            expand = c(0, 0),
            breaks = seq(1, 52, length = 12),
            labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                       "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
        theme_tufte(base_family = "Helvetica") +
        theme(axis.title = element_blank(),
              axis.ticks = element_blank(),
              legend.position = "bottom",
              legend.key.width = unit(1, "cm"),
              strip.text = element_text(hjust = 0.01, face = "bold", size = 12))
    
    print(p)
}

Using the waffle plot function

gh_waffle() takes three arguments, the first, data is a data frame with columns date (type: Date), year (number or character), week (number), day (an ordered factor to make days run from top to bottom on the graph), and hours (number). The second option to gh_waffle(), pal specifies one of four color palettes used by the viridis color scale, and can be "A", "B", "C", or "D". The default is “D”, which is also what GitHub uses. The last option, dir specifies the direction of the color scale, and can be either -1 or 1. The GitHub default is -1.

Using gh_waffle() with the default settings, only providing the data frame d, gives the following result:

gh_waffle(d)

Here’s the same plot with the three other color palettes:

for (pal in c("A", "B", "C")) {
    gh_waffle(d, pal)
}

Unless you’re plotting some really awful events, I think it’s best to stick with the default color palette. That’s it for today folks.

Further reading & references

Quantifying my research interests

While attempting to verbally describe my research interests, I got distracted (as I usually do) in the wonderful world of data visualization. One thing led to another, and I attempted to create a graphical depiction of my research interests.

The following figure can be thought of as a visual depiction of my research interests, given a set of assumptions. Some of these assumptions are:

  • The journals a researcher reads are a reasonable indicator of the interests of that researcher
  • Authors of journal articles give some thought to choosing their keywords (those things that journals sometimes require you to list with a submission so people can tell what on earth your crazy article is about.)
  • My Zotero library captures what I’ve been reading in the past few years to some reasonable extent
  • What I’ve been reading captures what I’ve been interested in!

To quantify my interests, I exported my Zotero library as a large .csv file, and used R to count and list all the keywords, journals, and keyword-journal pairs. This led to ~500 journal articles, with ~2000 journal-keyword pairs, each representing the occurrence of a keyword within a specific journal. I then plotted these data, with journal name on the x-axis, and keyword on the y-axis. The following code shows a snippet of what happened in R to produce the plot:

# Create bivariate journal-tag data frame
bivar <- dd %>% select(tags, journal)

# Loop to unlist nested tags
newd <- data.frame(pub = NULL, tag = NULL, stringsAsFactors = F)
for (i in 1:nrow(bivar)) {
    tag <- unlist(strsplit(bivar[i,"tags"], "; "))
    tag <- unlist(strsplit(tag, "/ "))
    tag <- unlist(strsplit(tag, ", "))
    tag <- gsub("^\\s+|\\s+$", "", tag) # Trim whitespace
    tl <- length(tag)
    pub <- bivar[i,"journal"]
    rep_pub <- rep(pub, tl)
    tmp <- data.frame(cbind(pub = rep_pub, tag = tag), stringsAsFactors = F)
    newd <- rbind(newd, tmp)
}
# Bivariate journal-tag plot
newd %>% group_by(pub) %>% 
    mutate(pn = n()) %>% 
    ungroup() %>%
    group_by(tag) %>% 
    mutate(tn = n()) %>%
    ungroup() %>%
    group_by(tag, pub) %>%
    mutate(nn = n()) %>%
    filter(nn >= 2) %>% 
    ggplot(aes(x=reorder(pub, desc(pn)), 
               y=reorder(tag, tn))) +
    scale_size_continuous(range=c(3, 5)) +
    scale_color_gradient(low="dodgerblue", high="tomato2") +
    geom_point(aes(size=nn, col=nn)) +
    labs(title="Figure 1. Lots of small text and some colored points") +
    theme_bw() + 
    theme(axis.title = element_blank(), 
          panel.grid.major = element_line(linetype=1),
          axis.text = element_text(size=12),
          axis.text.x = element_text(angle=45, hjust=1), 
          legend.position="none",
          plot.margin = unit(c(0,0,0,-1), "mm"))

There are a few additional bits of information in this graph; each dot’s size indicates the number of occurrence of that keyword-journal pair (larger dots indicate more frequent pairs; I only show pairs with more than one observation). Although graphs should never include redundant information, I totally included some, and colored the most frequent pairs red.

The data are sorted on the two axes thuswise: * The y-axis is sorted from top to bottom in order of decreasing frequency of the keyword. I’ve invested quite some time in reading up on agency and time perception, so they are on top. The bottom three, action, arousal and bayesian methods are less frequently encountered in the data. This points out an important limitation of the method: I’m really into bayesian methods (in statistics, and cognitive modeling), but the concept gets split into several keywords (bayesian, bayes, bayes factor; above) * The x-axis is sorted from left to right in order of decreasing frequency of the journal. Thus, my most-read journal is Consciousness and Cognition (a fine paper indeed), with newcomers in the Frontiers camp up there as well. I’ve given less attention (sorry) to British Journal of Mathematical and Statistical Psychology and the Scandinavian Journal of Psychology.

Furthermore, you’ll notice a lot of empty space in the graph. This makes sense because not all journals publish articles with all these keywords. I take this feature of the graph to be a sign of my studious nature and wide scope of interests (p < .05).

Conclusion

  • Journals should have shorter names (looking at you APA and Royal Society of London).
  • Authors should give at least some thought to their keywords.
  • It’s really difficult to scrape information from APA’s 1994-style website. Perhaps APA should consider joining us in the next millennium!
  • I’m really interested in action, agency, time, and bayesian stuff. And public health / gesundheitswesen, apparently.
  • I also made a wordcloud, but it stinks (like most wordclouds.)

Packages used: ggplot2, dplyr, stringr

IMDB movie ratings

I really like movies. And R. Although IMDB gives some basic numbers on what a user has watched and when, here I’ll use the superpowers of R to plot my brief history of IMDB ratings.

Get my ratings from IMDB.com

The last time I did this, I scraped the “my ratings” website from IMDB, but now they have a button that let’s you download the data in .csv.

Plot ratings

Average ratings by year

Ratings

Horror

library(stringr)
d$difference <- d$rating - d$IMDB_rating
filter(d, 
       type=="Feature Film",
       !is.na(str_match(genres, "horror"))) %>%
    ggplot(aes(y=reorder(title, rating))) +
    geom_point(aes(x=rating)) +
    geom_point(aes(x=IMDB_rating), shape=1) +
    geom_segment(aes(x=IMDB_rating, xend=rating,
                     y=title, yend=title), size=0.3, col="gray50") +
    labs(y="", x="Rating") 

This plot shows my ratings (filled circles) and how they differ from the average IMDB ratings (empty circles).

Drama

filter(d, 
       type=="Feature Film",
       !is.na(str_match(genres, "drama"))) %>%
    ggplot(aes(y=reorder(title, rating))) +
    geom_point(aes(x=rating)) +
    geom_point(aes(x=IMDB_rating), shape=1) +
    geom_segment(aes(x=IMDB_rating, xend=rating,
                     y=title, yend=title), size=0.3, col="gray50") +
    labs(y="", x="Rating") 

Movie popularity and ratings

Plot popularity versus ratings

arrange(d, n_ratings) %>% 
    filter(type!="TV Series") %>%
    select(title, n_ratings, rating, IMDB_rating) %>%
    ggplot(aes(y=n_ratings)) +
    geom_point(aes(x=rating)) +
    geom_smooth(aes(x=rating), method="lm", se=F) +
    geom_point(aes(x=IMDB_rating), shape=1) +
    geom_smooth(aes(x=IMDB_rating), method="lm", se=F, lty=2) +
    scale_y_log10(breaks=c(1000, 10000, 100000, 1000000),
                  labels=c("1k", "10k", "100k", "1m")) +
    labs(y="Number of ratings", x="Rating") +
    theme_classic()

The solid line shows the relationship between my rating and the number of ratings the movie has on IMDB, and the dashed line shows the equivalent relationship for IMDB average ratings. It makes sense that the most popular movies also received the highest ratings.

Next, it would be fantastic to obtain more data from IMDB, but it turns out that it’s a little trickier than a single user’s ratings.

Updated August2015

Packages used

ggplot2, dplyr, reshape2, knitr, ggthemes, XML

All Posts by Category or Tags.