## Quantitative literature review with R: Exploring Psychonomic Society Journals, Part II

(This post directly continues from part I of Quantitative literature review with R. Please read that first for context.) In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a broad research literature, using all Psychonomic Society publications between 2004 and 2016. Part I focused on data cleaning and simple figures, but here we will look at relational data by visualizing some network structures in the data.

We’ll be working with R, so if you want to follow along on your own computer, fire up R and load up the required R packages:

library(tidyverse)
library(stringr)

Then you can load the cleaned data from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda using R:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

We’ve got data on quite a few variables which we could imagine are interconnected, such as Tags (Keywords), Authors, and the articles themselves (through the References variable).

glimpse(d)
## Observations: 7,614
## Variables: 17
## $Authors <chr> "Button C., Schofield M., Croft J.",... ##$ Title                     <chr> "Distance perception in an open wate...
## $Year <int> 2016, 2016, 2016, 2016, 2016, 2016, ... ##$ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $DOI <chr> "10.3758/s13414-015-1049-4", "10.375... ##$ Affiliations              <chr> "School of Physical Education, Sport...
## $Authors_with_affiliations <chr> "Button, C., School of Physical Educ... ##$ Abstract                  <chr> "We investigated whether distance es...
## $Author_Keywords <chr> "3D perception; Perception and actio... ##$ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $References <chr> "Baird, R.W., Burkhart, S.M., (2000)... ##$ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $Abbreviated_Source_Title <chr> "Atten. Percept. Psychophys.", "Atte... ##$ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $Pages <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22... ##$ Publication               <chr> "Attention, Perception & Psychophysi...
## $Pub_abbr <chr> "Atten. Percept. Psychophys.", "Atte... Let’s begin by looking at the articles’ Keywords. # Tags These are the things that you have to choose to describe your article when you submit it for publication. They are called Keywords in the data, but I’m going to refer to them as Tags for brevity. Also, I will only work with the Author_Keywords, because it seems like Scopus’s automatic index keywords are mostly nonsense or missing. So our first operation is to rename the Author_Keywords variable to Tags, and get rid of the automatic keywords altogether. d <- rename(d, Tags = Author_Keywords) %>% select(-Index_Keywords) Before moving on to networks, let’s take a look at what these tags are like: How many are there? Do people use them consistently? ## Exploring univariate tags Our first goal is to simply visualize the most common tags. However, recall that the Tags variable in the data is quite messy: ## # A tibble: 3 x 1 ## Tags ## <chr> ## 1 3D perception; Perception and action; Scene perception; Space perception ## 2 Auditory perception; Cognitive performance; Dissonance; Distraction; Inharm ## 3 Attention; Automaticity; Cognitive control; Selective attention Each value of Tags is a kind of a list, whose individual items are separated by semicolons. What we’d like to have is a variable (column in a data frame) where every item is a single tag, so we need to do some kind of a data unwrapping operation. The unwrapping can be done by simply splitting the Tags variable on each semicolon (and space) found in each string, but this creates a strange variable: d %>% select(Tags) %>% mutate(Tags = str_split(Tags, "; "))  ## # A tibble: 7,614 x 1 ## Tags ## <list> ## 1 <chr [4]> ## 2 <chr [6]> ## 3 <chr [4]> ## 4 <chr [5]> ## 5 <chr [5]> ## 6 <chr [3]> ## 7 <chr [3]> ## 8 <chr [6]> ## 9 <chr [3]> ## 10 <chr [3]> ## # ... with 7,604 more rows This is where things get really interesting. A few lines above, we conceptualized each value (cell) of Tags as a sort of a list. We have now operationalized that idea by creating a Tags variable that is an R list-column, meaning that each cell can contain a complex R object. Here, each value is a list of character values. To make this notion concrete, look at the first row (slice(1)) of the Tags variable (.$Tags, . is a placeholder of the data being passed by %>% arrows) by using the familiar str() function:

d %>%
select(Tags) %>%
mutate(Tags = str_split(Tags, "; ")) %>%
slice(1) %>%
.$Tags %>% str() ## List of 1 ##$ : chr [1:4] "3D perception" "Perception and action" "Scene perception" "Space perception"

All the tags associated with the first article in our data frame are in a neat list of neat character strings. Brilliant. This data format is quite foreign to us who are used to working with strictly flat rectangular data, but has a certain recursive beauty to it. Now that our list is concrete, we can unlist it to create a Tags variable whose each row is one individual tag.

d %>%
select(Tags) %>%
mutate(Tags = str_split(Tags, "; ")) %>%
unnest(Tags)
## # A tibble: 22,375 x 1
##                     Tags
##                    <chr>
##  1         3D perception
##  2 Perception and action
##  3      Scene perception
##  4      Space perception
##  5   Auditory perception
##  6 Cognitive performance
##  7            Dissonance
##  8           Distraction
##  9         Inharmonicity
## 10                 Music
## # ... with 22,365 more rows

Excellent! This data frame has one variable, Tags, where each observation is an individual tag. Now we can go ahead and count and summarise the tags. From here onwards, it’s all regular dplyr verbs:

d %>%
select(Tags) %>%
mutate(Tags = str_split(Tags, "; ")) %>%
unnest(Tags) %>%
filter(!is.na(Tags)) %>%  # Remove missing observations
group_by(Tags) %>%  # Group by unique Tags
count() %>%  # And count them
ungroup() %>%
top_n(40) %>%  # Choose n most numerous Tags
mutate(Tags = reorder(Tags, n)) %>%  # {1}
ggplot(aes(n, Tags)) +
geom_point() +
geom_segment(aes(xend = 0, yend=Tags)) +
scale_x_continuous(limits = c(0, 400), expand = c(0, 0)) +
labs(x="Number of articles", y="",
subtitle = "Top 30 tags in Psychonomics Journals",
caption = "Data for years 2004-2016 only")

{1} reordered the Tags so that they are nicely ordered in the figure on the n variable. You could also do a wordcloud with these data, but I think I’ll refrain this time. Instead, what I’ll do is explore popular tags by journal. The “pipeline” to get tags by journal is almost identical, except that we’ll keep the Publication variable in the data.

The following code is a little more complicated than is required for a simple plot, but I wanted to order the Tags inside each publication, which required a little bit of extra work as described by @drsimonj. Below, I’ll work with journal abbreviations Pub_abbr.

pd <- d %>%
select(Pub_abbr, Tags) %>%
mutate(Tags = str_split(Tags, "; ")) %>%
unnest(Tags) %>%
filter(!is.na(Tags)) %>%
group_by(Pub_abbr, Tags) %>%  # Group by unique Tags AND journals
count() %>%  # And count them
group_by(Pub_abbr) %>%
top_n(10, n) %>%  # Select top 10 tags within each journal
ungroup() %>%
arrange(Pub_abbr, n) %>%  # Arrange by n within journals
mutate(order = row_number())  # Will order Tags on this
pd
## # A tibble: 62 x 4
##                       Pub_abbr                  Tags     n order
##                          <chr>                 <chr> <int> <int>
##  1 Atten. Percept. Psychophys.     Attentional blink    33     1
##  2 Atten. Percept. Psychophys.        Working memory    36     2
##  3 Atten. Percept. Psychophys. Visual working memory    40     3
##  4 Atten. Percept. Psychophys.      Visual attention    41     4
##  5 Atten. Percept. Psychophys.   Selective attention    49     5
##  6 Atten. Percept. Psychophys.     Visual perception    51     6
##  7 Atten. Percept. Psychophys.   Attentional capture    52     7
##  8 Atten. Percept. Psychophys.         Eye movements    61     8
##  9 Atten. Percept. Psychophys.         Visual search   116     9
## 10 Atten. Percept. Psychophys.             Attention   180    10
## # ... with 52 more rows

That was more complicated than usual, so I wanted to print out the intermediate result as well. We can now use it to plot:

pd %>%
ggplot(aes(n, order)) +
geom_point() +
geom_segment(aes(xend = 0, yend=order)) +
scale_y_continuous(breaks = pd$order, labels = pd$Tags) +
scale_x_continuous(expand = c(0.02, 0)) +
labs(x="Number of articles", y="",
subtitle = "Top 10 tags within 6 Psychonomics Journals",
caption = "Data for years 2004-2016 only") +
facet_wrap("Pub_abbr", scales="free", ncol=2)

Finally, let’s look at the overall distribution of the tags as a histogram:

There are about 8000 unique Tags, and about 7000 of them occur fewer than three times in the corpus. Now that we have some idea of what kinds of tags there are in the data, roughly what their distributions might look like, and how they vary across the journals, we are ready to look at networks of tags.

# Network analysis of Tags

Network analysis (or graph analysis) is a vast and complicated area, and my graph analysis skills are a little rusty. So here we’ll simply illustrate how to format the data in an appropriate way, and draw a few example graphs. I referred to this website more than once while working on this (also thanks to @jalapic for tips).

The first graph will be about Tags. We’ll try to illustrate how and which tags co-occur within articles in the Psychonomic Society journals between 2004 and 2016.

90% of data analysis is data cleaning and manipulation, and network analysis is no exception. The first data manipulation step is to create an adjacency matrix where each row and column is a tag, where the values in the cells are the number of times those tags co-occurred (were present in the same article.) I found this stack overflow answer and this blogpost helpful when preparing the data.

In order to create the adjacency matrix we need to solve a sequence of smaller problems. First, we’ll take two columns from the data, one for Tags and another one that uniquely identifies each article (doesn’t matter what it is, here we use DOI). Then we unnest the Tags into single tags per row as detailed above, and remove empty tags.

m <- d %>%
select(Tags, DOI) %>%
mutate(Tags = str_split(Tags, "; ")) %>%
unnest(Tags) %>%
filter(!is.na(Tags))
m
## # A tibble: 18,949 x 2
##                          DOI                  Tags
##                        <chr>                 <chr>
##  1 10.3758/s13414-015-1049-4         3D perception
##  2 10.3758/s13414-015-1049-4 Perception and action
##  3 10.3758/s13414-015-1049-4      Scene perception
##  4 10.3758/s13414-015-1049-4      Space perception
##  5 10.3758/s13414-015-1042-y   Auditory perception
##  6 10.3758/s13414-015-1042-y Cognitive performance
##  7 10.3758/s13414-015-1042-y            Dissonance
##  8 10.3758/s13414-015-1042-y           Distraction
##  9 10.3758/s13414-015-1042-y         Inharmonicity
## 10 10.3758/s13414-015-1042-y                 Music
## # ... with 18,939 more rows

Next, I remove all tags from the data that occurred fewer than a couple of times, because otherwise the data would be too large and messy.

m <- group_by(m, Tags) %>%
mutate(n = n()) %>%
filter(n > 8) %>%
ungroup() %>%
select(-n)

Next, we need a variable that identifies whether the tag occurred in the article or not:

m$occur <- 1 slice(m, 1:2) ## # A tibble: 2 x 3 ## DOI Tags occur ## <chr> <chr> <dbl> ## 1 10.3758/s13414-015-1049-4 Perception and action 1 ## 2 10.3758/s13414-015-1049-4 Scene perception 1 Note that occur must be one. To create the matrix we need all the item-tag pairs, even ones that didn’t occur. For these pairs, occur will be zero. The tidyr package in tidyverse provides the perfect helper function for this purpose: m <- m %>% complete(DOI, Tags, fill = list(occur=0)) m ## # A tibble: 960,500 x 3 ## DOI Tags occur ## <chr> <chr> <dbl> ## 1 10.1007/s13420-010-0001-7 3-D perception 0 ## 2 10.1007/s13420-010-0001-7 Action 0 ## 3 10.1007/s13420-010-0001-7 Adaptation 0 ## 4 10.1007/s13420-010-0001-7 Affect 0 ## 5 10.1007/s13420-010-0001-7 Age of acquisition 0 ## 6 10.1007/s13420-010-0001-7 Aging 0 ## 7 10.1007/s13420-010-0001-7 Amygdala 0 ## 8 10.1007/s13420-010-0001-7 Animal cognition 0 ## 9 10.1007/s13420-010-0001-7 Anticipation 0 ## 10 10.1007/s13420-010-0001-7 Anxiety 0 ## # ... with 960,490 more rows Now we have a data frame with 960500 rows; each row signifying a possible article (DOI) - Tag co-occurrence. Most occurs will, of course, be zero. Now we can create the co-occurrence matrix. We first spread Tags over the columns, such that each column is a unique tag, each row is a DOI, and the cell values are occurrences. library(reshape2) y <- m %>% acast(Tags~DOI, value.var="occur", fun.aggregate = sum) y[1:2, 1:2] ## 10.1007/s13420-010-0001-7 10.3758/s13414-010-0002-9 ## 3-D perception 0 0 ## Action 0 0 y[y>1] <- 1 # Remove duplicates All we need to do then is to make a symmetric matrix where both rows and columns are Tags, and the cell values are numbers of co-occurrences. So we matrix multiply y with its transpose: co <- y %*% t(y) co[1:3, 1:3] ## 3-D perception Action Adaptation ## 3-D perception 11 0 1 ## Action 0 20 0 ## Adaptation 1 0 13 ## Visualizing the adjacency matrix It’s important to understand what co is; it’s a symmetric matrix whose rows and columns are tags, and their co-occurrences are numerically represented at their intersections. For example, Action and 3-D perception never co-occurred in the data, but Adaptation and 3-D percetion co-occurred once. We are now in an excellent position to investigate what tags tend to co-occur. First, let’s use igraph to create the graph object g, and take out some unnecessary information from it. library(igraph) g <- graph.adjacency(co, weighted=TRUE, mode ='undirected') g <- simplify(g) g is now an igraph R object that contains edges and vertices, and information about them. For example, we can use V(g)$name to look at the Vertice names. Some of these are quite long, so I’ll create another variable for each Vertice, called label, which is the name, but replacing each space with a newline character. This will make some of the graphs prettier.

V(g)$label <- gsub(" ", "\n", V(g)$name)

Networks are very complicated visualizations, and if you simply call plot(g), you’ll get an ugly mess of a figure:

plot(g)

There are more parameters to the plot() command than can be listed here, but I tried out a few different options, and this is what I ended up with:

plot(g,
layout = layout_on_sphere(g),
edge.width = E(g)$weight/7, edge.color = adjustcolor("dodgerblue1", .7), vertex.color = adjustcolor("white", .8), vertex.frame.color=adjustcolor("dodgerblue4", .4), vertex.size=1.9, vertex.label.family = "Helvetica", vertex.label.font=1, vertex.label.cex=.7, vertex.label.color="black", vertex.label = ifelse(degree(g) > 25, V(g)$label, NA))

# The Bayes Network

The next network will be fairly involved. Our goal is to visualize the prominence and connectedness of Psychonomic Society’s “Bayesians”. Who are these Bayesians, you might ask? Our definition here is simply: “An author who has published one or more articles in Psychonomic Journals (2004 - 2016) where the keywords contained the word ‘Bayes’.” This is obviously suboptimal because someone might have published an article called “Why I’m not a Bayesian”, yet they would be included in this graph, but we’ll go with this for now.

I’ll let the code speak for itself this time. If it is impossible to parse, I recommend downloading the data (see top of post) to your own computer and running the code yourself.

m <- d %>%
select(Authors, Tags, DOI) %>%
mutate(Authors = str_split(Authors, ", ")) %>%
unnest(Authors) %>%
filter(!is.na(Tags), !is.na(Authors)) %>%
filter(grepl("bayes", Tags, ignore.case = T))
m$occur <- 1 m <- m %>% complete(DOI, Authors, fill = list(occur=0)) %>% select(-Tags) m ## # A tibble: 11,076 x 3 ## DOI Authors occur ## <chr> <chr> <dbl> ## 1 10.3758/s13414-010-0073-7 Bååth R. 0 ## 2 10.3758/s13414-010-0073-7 Bakker M. 0 ## 3 10.3758/s13414-010-0073-7 Bar M. 0 ## 4 10.3758/s13414-010-0073-7 Barker J.D. 0 ## 5 10.3758/s13414-010-0073-7 Barr S. 0 ## 6 10.3758/s13414-010-0073-7 Biele G. 0 ## 7 10.3758/s13414-010-0073-7 Bittner J.L. 0 ## 8 10.3758/s13414-010-0073-7 Borsboom D. 0 ## 9 10.3758/s13414-010-0073-7 Bratch A. 0 ## 10 10.3758/s13414-010-0073-7 Bröder A. 0 ## # ... with 11,066 more rows In these data, we have simply counted whether an author “occurred” in an article, and only included articles whose Tags included the word “bayes” in any shape or form. This network will therefore visualize the connectedness of “Bayesian”, as in who has published papers with whom. We can then re-run pretty much the same steps as in the previous network. y <- m %>% acast(Authors~DOI, value.var="occur", fun.aggregate = sum) co <- y %*% t(y) co[1:3, 1:3] ## Bååth R. Bakker M. Bar M. ## Bååth R. 1 0 0 ## Bakker M. 0 1 0 ## Bar M. 0 0 1 g <- graph.adjacency(co, weighted=TRUE, mode ='undirected') g <- simplify(g) V(g)$label <- gsub(" ", "\n", V(g)$name) deg <- degree(g) plot(g, layout = layout_with_kk(g), edge.width = E(g)$weight,
vertex.size=1.9,
vertex.label.family = "Helvetica",
vertex.label.font=1,
vertex.label.cex = sqrt(5+deg)/4,
vertex.label.color = "black",
main = "The Bayesians of Psychonomic Society")

More to come later on…

## Quantitative literature review with R: Exploring Psychonomic Society Journals, Part I

Literature reviews, both casual and formal (or qualitative / quantitative), are an important part of research. In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a research literature, using Psychonomic Society’s publications between 2005 and 2016.

Commonly, literature reviews are rather informal; you read a review paper or 10, maybe suggested to you by experts in the field, and then delve deeper into the topics presented in those papers. A more formal version starts with a database search, where you try out various search terms, and collect a more exhaustive list of publications related to your research questions. Here, we are going to have a bit of fun and explore a large heterogeneous literature (kind of) quantitatively, focusing on data manipulation (not the bad kind!) and visualization.

I used Scopus to search for all journal articles published in the Psychonomic Society’s journals (not including CR:PI because it’s so new) between the years 2005 and 2016 (inclusive). For real applications, you would use databases instead, and search across journals, but for a general illustration, I wanted to focus on these publications instead, because Psychonomic journals are closest to my research interests.

I limited the search to these years for two reasons: Metadata on articles before early 2000’s is (are?) not always very good, and if I had included everything, the database would have simply been too large. For the example here, these search terms resulted in about 7500 journal articles.

The Scopus search interface is very straightforward, so I won’t talk about how to use it. Some things to keep in mind, though, are that Scopus does not index all possible sources of citations, and therefore when we discuss citations (and other similar info) below, we’ll only discuss citations that Scopus knows about. For example, Google Scholar often indexes more sources, but it doesn’t have the same search tools as Scopus. Scopus allows you to download the data in various formats, and I chose .csv files, which I have stored on my computer. When you download raw data from Scopus, make sure that you organize the files well and that they are available to the R session.

You can also download the raw or cleaned up data from my website (see below), if you’d like to work on the same data set, or replicate the code below.

# Load and clean raw data from Scopus

Let’s first load all the relevant R packages that we’ll use to do our work for us:

library(tidyverse)
library(stringr)
library(tools)

## Load the data from .csv files

When I first downloaded the files, I put them in a subdirectory of my project (my website). I downloaded the data such that one file contains the data of one journal. (The raw journal-specific .csv files are available at https://github.com/mvuorre/mvuorre.github.io/tree/master/data/scopus, but I recommend you download the combined R data file instead because it’s easier, see below.)

The first thing is to combine these file names to a list of file names:

fl <- list.files("static/data/scopus/", pattern=".csv", full.names = T)

Then we can apply a function over the list of file names that reads all the files to an R object.

d <- lapply(fl,
col_types = cols(Page start = col_character(),
Page end = col_character(),
Art. No. = col_character()))

lapply applies the read_csv() function to the file list fl, and the col_types = cols() argument is there to ensure that those columns are read as character variable types (the original data is a little messy, and this ensures that all the files are read with identical column types.) Next, we’ll then turn the list of data frames d into a single data frame by binding them together row-wise (think stacking spreadsheets vertically on top of each other). Let’s also take a quick look at the result with glimpse()

d <- bind_rows(d)
glimpse(d)
## Observations: 7,697
## Variables: 25
## $Authors <chr> "Button C., Schofield M., Croft J.",... ##$ Title                     <chr> "Distance perception in an open wate...
## $Year <int> 2016, 2016, 2016, 2016, 2016, 2016, ... ##$ Source title              <chr> "Attention, Perception, and Psychoph...
## $Volume <int> 78, 78, 78, 78, 78, 78, 78, 78, 78, ... ##$ Issue                     <int> 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, ...
## $Art. No. <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ##$ Page start                <chr> "915", "946", "848", "938", "902", "...
## $Page end <chr> "922", "959", "867", NA, "914", "735... ##$ Page count                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $Cited by <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,... ##$ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $Link <chr> "https://www.scopus.com/inward/recor... ##$ Affiliations              <chr> "School of Physical Education, Sport...
## $Authors with affiliations <chr> "Button, C., School of Physical Educ... ##$ Abstract                  <chr> "We investigated whether distance es...
## $Author Keywords <chr> "3D perception; Perception and actio... ##$ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $References <chr> "Baird, R.W., Burkhart, S.M., (2000)... ##$ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $Publisher <chr> "Springer New York LLC", "Springer N... ##$ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $Document Type <chr> "Article", "Article", "Article", "Ar... ##$ Source                    <chr> "Scopus", "Scopus", "Scopus", "Scopu...
## $EID <chr> "2-s2.0-84951928856", "2-s2.0-849519... OK, looks pretty clean already! (I also tried to get data from ISI web of science, or whatever it’s now called, but the raw data was so messy that I’d have to get paid to even look at it.) The combined raw data file is available as an R data file at https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda, and can be easily loaded to the R workspace with: load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda")) ## Clean the data OK, now that you’ve loaded up the combined data file from my website into your R session with the above command, you can get to work cleaning the data. If you’d like to skip the data cleaning steps, scroll down to “Load cleaned data”. ### Select relevant variables There’s a couple of columns (variables) in the data that I’m just going to drop because I’m not interested in them at all. When you export data from Scopus, you can choose what variables to include, and it looks like I accidentally included some stuff that I don’t care about. I’ll use the select() function, and prepend each variable I want to drop with a minus sign. Variable names with spaces or strange characters are wrapped in backticks. I think it’s always good to drop unnecessary variables to remove unnecessary clutter. Talking about cognitive psychology, it’s probably less cognitively demanding to work with a data frame with 5 variables than it is to work with a data frame with 50 variables. d <- select(d, -Volume, -Issue, -Art. No., -Page count, -Link, -Publisher, -Source, -EID) I dropped page count because it is empty for most cells, and I actually need to re-compute it from the start and end variables, later on. Next, we’re going to go through the columns, but leave the more complicated columns (such as Authors) for later. What do I mean by “complicated”? Some variables are not individual values, and their desired final shape depends on how you want to use them. Much later on, when we get to network and graph analyses, you’ll see that we want to “unlist” the Keywords columns, for example. That’s why we’ll do the cleaning and data manipulation closer to the actual analysis. But for “univariate” columns, we can get started right away. ### Numerical columns: Year, pages, citations Year is already an integer, and has no missing values, so we don’t need to do anything about it. We can, however, make the page number information numerical and combine it into one column. glimpse(d) ## Observations: 7,697 ## Variables: 17 ##$ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $Title <chr> "Distance perception in an open wate... ##$ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $Source title <chr> "Attention, Perception, and Psychoph... ##$ Page start                <chr> "915", "946", "848", "938", "902", "...
## $Page end <chr> "922", "959", "867", NA, "914", "735... ##$ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $DOI <chr> "10.3758/s13414-015-1049-4", "10.375... ##$ Affiliations              <chr> "School of Physical Education, Sport...
## $Authors with affiliations <chr> "Button, C., School of Physical Educ... ##$ Abstract                  <chr> "We investigated whether distance es...
## $Author Keywords <chr> "3D perception; Perception and actio... ##$ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $References <chr> "Baird, R.W., Burkhart, S.M., (2000)... ##$ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $Abbreviated Source Title <chr> "Atten. Percept. Psychophys.", "Atte... ##$ Document Type             <chr> "Article", "Article", "Article", "Ar...
d$Pages <- as.integer(d$Page end) - as.integer(d$Page start) How do we know that the Pages variable is now correct? We can’t rely on math itself because the data values themselves may be incorrect. It is good practice to check the data after each operation, and there’s many ways of looking at the data. I like to arrange the data (either descending or ascending) on the newly created variable to show if there are any weird values. Here’s how to arrange the data on the Pages variable in a descending order: glimpse(arrange(d, desc(Pages))) ## Observations: 7,697 ## Variables: 18 ##$ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $Title <chr> "Active maintenance of semantic repr... ##$ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $Source title <chr> "Psychonomic Bulletin and Review", "... ##$ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $Page end <chr> "1589", "536", "497", "442", "371", ... ##$ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $DOI <chr> "10.3758/s13423-014-0618-1", "10.109... ##$ Affiliations              <chr> "Department of Psychological Science...
## $Authors with affiliations <chr> "Nishiyama, R., Department of Psycho... ##$ Abstract                  <chr> "In research on verbal working memor...
## $Author Keywords <chr> "Active maintenance; Semantic repres... ##$ Index Keywords            <chr> "adult; attention; female; human; ma...
## $References <chr> "Allen, C.M., Martin, R.C., Martin, ... ##$ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $Abbreviated Source Title <chr> "Psychonom. Bull. Rev.", "The New Ha... ##$ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $Pages <int> 1006, 535, 496, 441, 370, 341, 325, ... That code is quite difficult to read, however. It contains nested function calls: arrange() is inside glimpse(). I think this makes the code difficult to read. This is where we first really encounter the principles of R’s tidyverse philosophy: Instead of nesting, we can pipe results from functions to one another. To “pipe” the above code, we’d write the code this way instead: d %>% arrange(desc(Pages)) %>% glimpse() ## Observations: 7,697 ## Variables: 18 ##$ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $Title <chr> "Active maintenance of semantic repr... ##$ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $Source title <chr> "Psychonomic Bulletin and Review", "... ##$ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $Page end <chr> "1589", "536", "497", "442", "371", ... ##$ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $DOI <chr> "10.3758/s13423-014-0618-1", "10.109... ##$ Affiliations              <chr> "Department of Psychological Science...
## $Authors with affiliations <chr> "Nishiyama, R., Department of Psycho... ##$ Abstract                  <chr> "In research on verbal working memor...
## $Author Keywords <chr> "Active maintenance; Semantic repres... ##$ Index Keywords            <chr> "adult; attention; female; human; ma...
## $References <chr> "Allen, C.M., Martin, R.C., Martin, ... ##$ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $Abbreviated Source Title <chr> "Psychonom. Bull. Rev.", "The New Ha... ##$ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $Pages <int> 1006, 535, 496, 441, 370, 341, 325, ... Read the above code as “Take d, then pass the results to arrange(), then pass the results to glimpse()”. Each %>% is a “pipe”, and takes the results of whatever is being computed on its left-hand side, and passes the result to its right-hand side (or line below, it’s really helpful to use line breaks and put each statement on its own line.) I’ll be using pipes extensively in the code below. I also won’t inspect the code after every change to prevent printing unnecessary boring stuff all the time (although you should when you’re working with your own data.) So, back to Pages. Looking at the data, it looks like there’s some strange values, with some items having more than 1000 pages. For now we’ll leave those values in, but looking closely at the Source title column, we can see that there’s publications in the data that should be there. I guess I was a little too lazy with my initial Scopus search terms, but we can easily fix that problem here in R. ### Publications First, let’s list all the unique publications in the data: publications <- unique(d$Source title)
publications
##  [1] "Attention, Perception, and Psychophysics"
##  [2] "Attention, perception & psychophysics"
##  [3] "Attention, Perception, & Psychophysics"
##  [4] "Behavior Research Methods"
##  [5] "Behavior research methods"
##  [6] "Modern Research Methods for The Study of Behavior in Organizations"
##  [7] "Modern Research Methods for the Study of Behavior in Organizations"
##  [8] "The New Handbook of Methods in Nonverbal Behavior Research"
##  [9] "Cognitive, Affective and Behavioral Neuroscience"
## [10] "Cognitive, affective & behavioral neuroscience"
## [11] "Cognitive, Affective, & Behavioral Neuroscience"
## [12] "Learning and Behavior"
## [13] "Memory and Cognition"
## [14] "Memory & cognition"
## [15] "Exploring Implicit Cognition: Learning, Memory, and Social Cognitive Processes"
## [16] "Memory & Cognition"
## [17] "Imagery, Memory and Cognition: Essays in Honor of Allan Paivio"
## [18] "Imagery, Memory and Cognition (PLE: Memory):Essays in Honor of Allan Paivio"
## [19] "Working Memory and Human Cognition"
## [20] "Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking"
## [21] "Psychonomic Bulletin and Review"
## [22] "Psychonomic bulletin & review"
## [23] "Psychonomic Bulletin & Review"

Let’s remove the non-psychonomic journal ones. I’ll just subset the list of publications to ones that I want to keep, and then filter() the data such that Source title matches one of the correct journals.

psychonomic_publications <- publications[c(1:5, 9:14, 16, 21:23)]
d <- filter(d, Source title %in% psychonomic_publications)
unique(d$Source title) ## [1] "Attention, Perception, and Psychophysics" ## [2] "Attention, perception & psychophysics" ## [3] "Attention, Perception, & Psychophysics" ## [4] "Behavior Research Methods" ## [5] "Behavior research methods" ## [6] "Cognitive, Affective and Behavioral Neuroscience" ## [7] "Cognitive, affective & behavioral neuroscience" ## [8] "Cognitive, Affective, & Behavioral Neuroscience" ## [9] "Learning and Behavior" ## [10] "Memory and Cognition" ## [11] "Memory & cognition" ## [12] "Memory & Cognition" ## [13] "Psychonomic Bulletin and Review" ## [14] "Psychonomic bulletin & review" ## [15] "Psychonomic Bulletin & Review" Much better. Next, we’ll have to make sure that there’s only one unique name per journal. To do this, we’ll use simple text manipulation tools to change all “and”s to “&”s, and convert the names to Title Case. Before applying these operations to the actual data, I’ll work on just the unique names R object psychonomic_publications. This makes it easier to troubleshoot the operations. psychonomic_publications <- psychonomic_publications %>% str_replace_all("and", "&") %>% # Replace all "and"s with "&"s toTitleCase() # Make everything Title Case unique(psychonomic_publications) ## [1] "Attention, Perception, & Psychophysics" ## [2] "Attention, Perception & Psychophysics" ## [3] "Behavior Research Methods" ## [4] "Cognitive, Affective & Behavioral Neuroscience" ## [5] "Cognitive, Affective, & Behavioral Neuroscience" ## [6] "Learning & Behavior" ## [7] "Memory & Cognition" ## [8] "Psychonomic Bulletin & Review" Still not quite correct. For two journals, there’s two versions, one where “&” is preceded with a comma, one where it isn’t. I’ll take the commas out. psychonomic_publications %>% str_replace_all(", &", " &") %>% unique() ## [1] "Attention, Perception & Psychophysics" ## [2] "Behavior Research Methods" ## [3] "Cognitive, Affective & Behavioral Neuroscience" ## [4] "Learning & Behavior" ## [5] "Memory & Cognition" ## [6] "Psychonomic Bulletin & Review" Great, now I’ll just apply those operations to the actual data. d <- d %>% mutate(Publication = str_replace_all(Source title, "and", "&"), Publication = toTitleCase(Publication), Publication = str_replace_all(Publication, ", &", " &")) Here, I used the mutate() function, which works within a data frame. I passed d to mutate() with the pipe operator, and could then refer to variables without prepending them with d$. I called the product variable Publication because it’s easier to write than Source title.

Lastly, we’ll do the same for the abbreviated publication names. I don’t know what the official abbreviations are, but I can guess that the majority of the articles have it correct. The first step is then to count the occurrences of each unique abbreviation. To do this, I group the data by (group_by()) the abbreviation, and then count():

abbs <- group_by(d, Abbreviated Source Title) %>%
count()
abbs
## # A tibble: 11 x 2
##          Abbreviated Source Title     n
##                               <chr> <int>
##  1         Atten Percept Psychophys    26
##  2      Atten. Percept. Psychophys.  1558
##  3                Behav Res Methods    38
##  4              Behav. Res. Methods  1385
##  5       Cogn Affect Behav Neurosci     7
##  6 Cogn. Affective Behav. Neurosci.   691
##  7                    Learn. Behav.   455
##  8                       Mem Cognit    13
##  9                     Mem. Cognit.  1441
## 10                 Psychon Bull Rev     3
## 11            Psychonom. Bull. Rev.  1997

Then just figure out how to replace all the bad ones with the good ones. Unfortunately this has to be done manually because computers aren’t that smart. The following is ugly, but it works. I first take the just the abbreviations into a variable, then replace the bad ones with the good ones inside the data frame (working with an abbreviated variable name).

abbrs <- abbs[[1]]
d <- mutate(d,
pa = Abbreviated Source Title,
pa = ifelse(pa == abbrs[1], abbrs[2], pa),
pa = ifelse(pa == abbrs[3], abbrs[4], pa),
pa = ifelse(pa == abbrs[5], abbrs[6], pa),
pa = ifelse(pa == abbrs[8], abbrs[9], pa),
pa = ifelse(pa == abbrs[10], abbrs[11], pa),
Pub_abbr = pa)
unique(d$Pub_abbr) ## [1] "Atten. Percept. Psychophys." "Behav. Res. Methods" ## [3] "Mem. Cognit." "Cogn. Affective Behav. Neurosci." ## [5] "Learn. Behav." "Psychonom. Bull. Rev." OK, we’re pretty much done with cleaning for now. The last thing to do is to remove the unnecessary variables we created while cleaning the data: d <- select(d, -Source title, -Page start, -Page end, -pa) I don’t like spaces, so I’ll also take them out of the variable names: names(d) <- str_replace_all(names(d), " ", "_") Leaving us with this to work with: glimpse(d) ## Observations: 7,614 ## Variables: 17 ##$ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $Title <chr> "Distance perception in an open wate... ##$ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $Cited_by <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,... ##$ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $Affiliations <chr> "School of Physical Education, Sport... ##$ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $Abstract <chr> "We investigated whether distance es... ##$ Author_Keywords           <chr> "3D perception; Perception and actio...
## $Index_Keywords <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ##$ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $Correspondence_Address <chr> "Button, C.; School of Physical Educ... ##$ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $Document_Type <chr> "Article", "Article", "Article", "Ar... ##$ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $Publication <chr> "Attention, Perception & Psychophysi... ##$ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

The clean data file can be loaded from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda.

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

# Univariate figures

This data is very rich, but for this initial exploration, we’ll be focusing on simple uni- and bi-variate figures. In later parts of this tutorial (future blog posts), we’ll start visualizing more complex data, such as keywords, authors, and networks.

Let’s begin with some simple summaries of what’s been going on, over time, for each journal. We’ll make extensive use of the dplyr data manipulation verbs in the following plots. Take a look at the linked website if they are unfamiliar to you; although I will explain more complicated cases, I won’t bother with every detail.

## Publication years

First, we’re interested in a simple question: How many articles has each journal published in each year? Are there temporal patterns, and do they differ between journals? The steps for creating this plot are commented in the code below. Roughly, in order of appearance, we first add grouping information to the data frame, then summarise the data on those groups, create a new column for a publication-level summary, and then order the publications on it. We then pass the results to ggplot2 and draw a line and point graph.

d %>%
group_by(Publication, Year) %>%
summarise(n = n()) %>%  # For each group (Publication, Year), count nobs
mutate(nsum = sum(n)) %>%  # For each Publication, sum of nobs {1}
ungroup() %>%  # Ungroup data frame
mutate(Publication = reorder(Publication, nsum)) %>%  # {2}
ggplot(aes(x=Year, y=n)) +
geom_point() +
geom_line() +
labs(y = "Number of articles published") +
facet_wrap("Publication", ncol=2)

The code in {1} works, because the summarise() command in the above line ungroups the last grouping factor assigned by group_by(). And since I called mutate() instead of summarise, the sum of number of observations was added to each row, instead of collapsing the data by group. {2} made the Publication variable into a factor that’s ordered by the sum of articles across years for that journal; this ordering is useful because when I used facet_wrap() below, the panels are nicely ordered (journals with fewer total papers in the upper left, increasing toward bottom right.)

If this code doesn’t make any sense, I strongly recommend loading the data into your own R workspace, and executing the code line by line.

I’m a little surprised at the relative paucity of articles in Learning and Behavior, and it looks like there might be some upward trajectory going on in PBR. Should I run some regressions? Probably not.

## Citations

Let’s look at numbers of citations next. For these, simple histograms will do, and we expect to see some heavy tails. I will pre-emptively clip the x-axis at 100 citations, because there’s probably a few rogue publications with a truly impressive citation count, which might make the other, more common, values less visible.

d %>%
ggplot(aes(Cited_by)) +
geom_histogram(binwidth=1, col="black") +
coord_cartesian(xlim=c(0,100)) +
facet_wrap("Publication", ncol=2)

Nice. But how many papers are there with more than 100 citations? What’s the proportion of those highly cited papers?

d %>%
group_by(Publication) %>%
mutate(total = n()) %>%
filter(Cited_by > 100) %>%
summarise(total = unique(total),
over_100 = n(),
ratio = signif(over_100/total, 1)) %>%
arrange(total)
## # A tibble: 6 x 4
##                                      Publication total over_100 ratio
##                                            <chr> <int>    <int> <dbl>
## 1                            Learning & Behavior   455        1 0.002
## 2 Cognitive, Affective & Behavioral Neuroscience   698       32 0.050
## 3                      Behavior Research Methods  1426       53 0.040
## 4                             Memory & Cognition  1452       17 0.010
## 5          Attention, Perception & Psychophysics  1583        8 0.005
## 6                  Psychonomic Bulletin & Review  2000       54 0.030

Excellent. We have now reviewed some of the powerful dplyr verbs and are ready for slightly more complex summaries of the data.

# Bivariate figures

## Publication year and citations

We can visualize publication counts over years of publication, but this might not be the most informative graph (because recent articles naturally can’t be as often cited as older ones.) Here, I’ll show how to plot neat box plots (without outliers) with ggplot2:

d %>%
filter(Cited_by < 100) %>%
ggplot(aes(Year, Cited_by)) +
geom_boxplot(aes(group = Year), outlier.color = NA) +
facet_wrap("Publication", ncol=2, scales = "free_y")

## Pages and citations

The last figure for today is a scatterplot of pages (how many pages the article spans) versus number of citations. For scatterplots I usually prefer empty points (shape=1).

d %>%
filter(Pages < 100, Cited_by < 100) %>%
ggplot(aes(Pages, Cited_by)) +
geom_point(shape=1) +
facet_wrap("Publication", ncol=2)

# Summary

We have now scratched the surface of these data, and there is clearly a lot to explore:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $Authors <chr> "Button C., Schofield M., Croft J.",... ##$ Title                     <chr> "Distance perception in an open wate...
## $Year <int> 2016, 2016, 2016, 2016, 2016, 2016, ... ##$ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $DOI <chr> "10.3758/s13414-015-1049-4", "10.375... ##$ Affiliations              <chr> "School of Physical Education, Sport...
## $Authors_with_affiliations <chr> "Button, C., School of Physical Educ... ##$ Abstract                  <chr> "We investigated whether distance es...
## $Author_Keywords <chr> "3D perception; Perception and actio... ##$ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $References <chr> "Baird, R.W., Burkhart, S.M., (2000)... ##$ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $Abbreviated_Source_Title <chr> "Atten. Percept. Psychophys.", "Atte... ##$ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $Pages <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22... ##$ Publication               <chr> "Attention, Perception & Psychophysi...
## $Pub_abbr <chr> "Atten. Percept. Psychophys.", "Atte... In next posts on this data, we’ll look at networks and other more advanced topics. Stay tuned. ## How to create within-subject scatter plots in R with ggplot2 Today, we’ll take a look at creating a specific type of visualization for data from a within-subjects experiment (also known as repeated measures, but that can sometimes be a misleading label). You’ll often see within-subject data visualized as bar graphs (condition means, and maybe mean difference if you’re lucky.) But alternatives exist, and today we’ll take a look at within-subjects scatterplots. For example, Ganis and Kievit (2015) asked 54 people to observe, on each trial, two 3-D shapes with various rotations and judge whether the two shapes were the same or not. There were 4 angles (0, 50, 100, and 150 degree rotations), but for simplicity, today we’ll only look at items that were not rotated with respect to each other, and items rotated 50 degrees. The data are freely available (thanks!!) in Excel format, but to make them more easily available for readers, I’ve uploaded them in a .csv file, which we can load directly from an R script. d <- read_csv("https://mvuorre.github.io/data/ganis-kievit-2015.csv") First, let’s clean the data a little bit by selecting and renaming a subset of the variables, and then take only the trials with 0 or 50 degrees rotation. d <- transmute( d, id = Subject, angle = angle, correct = correct.incorrect, rt = Time ) %>% filter(angle < 51) d ## # A tibble: 2,592 x 4 ## id angle correct rt ## <int> <int> <int> <int> ## 1 1 0 1 1355 ## 2 1 50 1 1685 ## 3 1 50 1 1237 ## 4 1 0 1 1275 ## 5 1 50 1 2238 ## 6 1 0 1 1524 ## 7 1 0 1 964 ## 8 1 50 1 1226 ## 9 1 50 1 2548 ## 10 1 50 1 1588 ## # ... with 2,582 more rows We’ll focus on comparing the reaction times between the 0 degree and 50 degree rotation trials. We predict that people will take longer to respond when the items are rotated, and that this effect will be robust across people. # Subject means ## Bar graph For the first graph, we’ll only need the subject’s means in each condition. subject_means <- group_by(d, id, angle) %>% summarize(rt = mean(rt, na.rm = T)) subject_means ## Source: local data frame [108 x 3] ## Groups: id [?] ## ## # A tibble: 108 x 3 ## id angle rt ## <int> <int> <dbl> ## 1 1 0 1512 ## 2 1 50 2039 ## 3 2 0 1251 ## 4 2 50 1768 ## 5 3 0 1602 ## 6 3 50 1862 ## 7 4 0 1501 ## 8 4 50 2023 ## 9 5 0 2170 ## 10 5 50 2512 ## # ... with 98 more rows The first plot is a simple bar graph showing the condition means, and every subject as a point. Note that the mean is visualized as a bar, using the stat_summary(geom="bar") function. barplot <- ggplot(subject_means, aes(x = angle, y = rt)) + stat_summary( geom = "bar", fun.y = "mean", col = "black", fill = "gray70" ) + geom_point(position = position_jitter(h = 0, w = 5)) + scale_y_continuous(limits = c(0, max(d$rt, na.rm = T)),
expand = c(0, 0))
barplot

This figure shows quite clearly that the mean reaction time in the 50 degree angle condition was higher than in the 0 degree angle condition, and the spread across individuals in each condition. However, we often are specifically interested in the within-subject effect of condition, which would be difficult to visually display in this image. We could draw lines to connect each point, and the effect would then be visible as a “spaghetti plot”, but while useful, these plots may sometimes be a little overwhelming especially if there’s too many people (spaghetti is great but nobody likes too much of it!)

## Within-subject scatterplot

To draw a within-subjects scatterplot, we’ll need a slight reorganization of the data, such that it is in wide format with respect to the conditions:

subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
##
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows
subject_means_wide <-
key = angle,
value = rt,
sep = "_")
subject_means_wide
## Source: local data frame [54 x 3]
## Groups: id [54]
##
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows

Then we can simply map the per-subject angle-means to the X and Y axes:

ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
geom_point()

But this graph needs a couple of fixes to be maximally informative. We need to:

• Make the aspect ratio 1
• Force the axes to be identically scaled (note the use of min() and max() to show the plots on the scale of the data)
• Add an identity (diagonal) line
• Modify the axis labels
lims <- c(min(d$rt, na.rm = T), max(d$rt, na.rm = T))
wsplot <-
ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
geom_point() +
geom_abline() +
scale_x_continuous("0 degrees", limits = lims) +
scale_y_continuous("50 degrees", limits = lims) +
theme(aspect.ratio = 1)
wsplot

Great! This plot shows each person (mean) as a point, and the difference between conditions can be directly seen by how far from the diagonal line the points are. Points above the diagonal indicate that the person’s (mean) RT was greater in the 50 degrees condition. All of the points lie below the identity line, indicating that the effect was as we predicted, and robust across individuals.

This is a very useful diagnostic plot that simultaneously shows the population- (or group-) level trend (are the points, on average, below or above the identity line?) and the expectation (mean) for every person (roughly, how far apart the points are from each other?). The points are naturally connected by their location, unlike in a bar graph where they would be connected by lines. Maybe you think it’s an informative graph; it’s certainly very easy to do in R with ggplot2. Also, I think it is visually very convincing, and doesn’t necessarily lead one to focus unjustly just on the group means: I am both convinced and informed by the graph.

# Within-subject scatterplot with SEs

Well, we didn’t measure everybody repeatedly for nothing. We know more than their means; we can use the spread of the individual level scores to calculate, say, a SE for everybody and add it to the graph.

subject_summaries <- group_by(d, id, angle) %>%
summarize(mean = mean(rt, na.rm = T),
se = sd(rt, na.rm = T) / sqrt(n()))
subject_summaries
## Source: local data frame [108 x 4]
## Groups: id [?]
##
## # A tibble: 108 x 4
##       id angle  mean    se
##    <int> <int> <dbl> <dbl>
##  1     1     0  1512   146
##  2     1    50  2039   134
##  3     2     0  1251   125
##  4     2    50  1768   211
##  5     3     0  1602   162
##  6     3    50  1862   109
##  7     4     0  1501   112
##  8     4    50  2023   172
##  9     5     0  2170   242
## 10     5    50  2512   307
## # ... with 98 more rows

Now we simply need to reformat the data to wide with respect to both the means and SEs. The trick here is to use spread() with different values for the sep() (separate) argument. Then, when the means and SEs are joined into wide format, we can select the columns containing either the means or SEs by referring to their unique names

means <- select(subject_summaries, -se) %>%
means
## Source: local data frame [54 x 3]
## Groups: id [54]
##
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows
ses <- select(subject_summaries, -mean) %>%
ses
## Source: local data frame [54 x 3]
## Groups: id [54]
##
## # A tibble: 54 x 3
##       id angleSE0 angleSE50
##  * <int>    <dbl>     <dbl>
##  1     1      146       134
##  2     2      125       211
##  3     3      162       109
##  4     4      112       172
##  5     5      242       307
##  6     6      100        73
##  7     7      223       240
##  8     8      110       197
##  9     9      130       203
## 10    10      233       276
## # ... with 44 more rows
sums <- left_join(means, ses)
sums 
## Source: local data frame [54 x 5]
## Groups: id [?]
##
## # A tibble: 54 x 5
##       id angle_0 angle_50 angleSE0 angleSE50
##    <int>   <dbl>    <dbl>    <dbl>     <dbl>
##  1     1    1512     2039      146       134
##  2     2    1251     1768      125       211
##  3     3    1602     1862      162       109
##  4     4    1501     2023      112       172
##  5     5    2170     2512      242       307
##  6     6    1302     1382      100        73
##  7     7    2212     3014      223       240
##  8     8    1452     1824      110       197
##  9     9    2012     2501      130       203
## 10    10    1939     3058      233       276
## # ... with 44 more rows

The code for the plot is actually quite straightforward once the tricky part of data formatting is done (this is really the philosophy behind ggplot2). Use errorbar() to draw the vertical SE bars, and errorbarh() to draw the horizontal SE bars.

ggplot(sums, aes(x=angle_0, y=angle_50)) +
geom_point() +
geom_errorbar(aes(ymin=angle_50-angleSE50, ymax=angle_50+angleSE50)) +
geom_errorbarh(aes(xmin=angle_0-angleSE0, xmax=angle_0+angleSE0)) +
geom_abline() +
scale_x_continuous("0 degrees", limits = lims) +
scale_y_continuous("50 degrees", limits = lims) +
theme(aspect.ratio=1)

Cool, huh? This graph shows the mean and +-1 SEM for everybody’s reaction time in the 0 degrees (x axis) and 50 degrees (y axis) conditions. This graph could be a great visual inspection of the data before fitting any complex models, and requires only some slight reorganizing of the data in R. Hope you’ll find it helpful in your own work!

# Endnote

Within-subject scatter plots are pretty common in some fields (psychophysics), but underutilized in many fiels where they might have a positive impact on statistical inference. Why not try them out on your own data, especially when they’re this easy to do with R and ggplot2?

Recall that for real applications, it’s better to transform or model reaction times with a skewed distribution. Here we used normal distributions just for convenience.

Finally, this post was made possible by the Ganis and Kievit (2015) who generously have shared their data online. Big thanks!

Have a great day!

# References

Ganis, G., & Kievit, R. (2015). A New Set of Three-Dimensional Shapes for Investigating Mental Rotation Processes: Validation Data and Stimulus Set. Journal of Open Psychology Data, 3(1). https://doi.org/10.5334/jopd.ai

## Quantifying my research interests

While attempting to verbally describe my research interests, I got distracted (as I usually do) in the wonderful world of data visualization. One thing led to another, and I attempted to create a graphical depiction of my research interests.

The following figure can be thought of as a visual depiction of my research interests, given a set of assumptions. Some of these assumptions are:

• The journals a researcher reads are a reasonable indicator of the interests of that researcher
• Authors of journal articles give some thought to choosing their keywords (those things that journals sometimes require you to list with a submission so people can tell what on earth your crazy article is about.)
• My Zotero library captures what I’ve been reading in the past few years to some reasonable extent
• What I’ve been reading captures what I’ve been interested in!

To quantify my interests, I exported my Zotero library as a large .csv file, and used R to count and list all the keywords, journals, and keyword-journal pairs. This led to ~500 journal articles, with ~2000 journal-keyword pairs, each representing the occurrence of a keyword within a specific journal. I then plotted these data, with journal name on the x-axis, and keyword on the y-axis. The following code shows a snippet of what happened in R to produce the plot:

# Create bivariate journal-tag data frame
bivar <- dd %>% select(tags, journal)

# Loop to unlist nested tags
newd <- data.frame(pub = NULL, tag = NULL, stringsAsFactors = F)
for (i in 1:nrow(bivar)) {
tag <- unlist(strsplit(bivar[i,"tags"], "; "))
tag <- unlist(strsplit(tag, "/ "))
tag <- unlist(strsplit(tag, ", "))
tag <- gsub("^\\s+|\\s+\$", "", tag) # Trim whitespace
tl <- length(tag)
pub <- bivar[i,"journal"]
rep_pub <- rep(pub, tl)
tmp <- data.frame(cbind(pub = rep_pub, tag = tag), stringsAsFactors = F)
newd <- rbind(newd, tmp)
}
# Bivariate journal-tag plot
newd %>% group_by(pub) %>%
mutate(pn = n()) %>%
ungroup() %>%
group_by(tag) %>%
mutate(tn = n()) %>%
ungroup() %>%
group_by(tag, pub) %>%
mutate(nn = n()) %>%
filter(nn >= 2) %>%
ggplot(aes(x=reorder(pub, desc(pn)),
y=reorder(tag, tn))) +
scale_size_continuous(range=c(3, 5)) +
geom_point(aes(size=nn, col=nn)) +
labs(title="Figure 1. Lots of small text and some colored points") +
theme_bw() +
theme(axis.title = element_blank(),
panel.grid.major = element_line(linetype=1),
axis.text = element_text(size=12),
axis.text.x = element_text(angle=45, hjust=1),
legend.position="none",
plot.margin = unit(c(0,0,0,-1), "mm"))

There are a few additional bits of information in this graph; each dot’s size indicates the number of occurrence of that keyword-journal pair (larger dots indicate more frequent pairs; I only show pairs with more than one observation). Although graphs should never include redundant information, I totally included some, and colored the most frequent pairs red.

The data are sorted on the two axes thuswise: * The y-axis is sorted from top to bottom in order of decreasing frequency of the keyword. I’ve invested quite some time in reading up on agency and time perception, so they are on top. The bottom three, action, arousal and bayesian methods are less frequently encountered in the data. This points out an important limitation of the method: I’m really into bayesian methods (in statistics, and cognitive modeling), but the concept gets split into several keywords (bayesian, bayes, bayes factor; above) * The x-axis is sorted from left to right in order of decreasing frequency of the journal. Thus, my most-read journal is Consciousness and Cognition (a fine paper indeed), with newcomers in the Frontiers camp up there as well. I’ve given less attention (sorry) to British Journal of Mathematical and Statistical Psychology and the Scandinavian Journal of Psychology.

Furthermore, you’ll notice a lot of empty space in the graph. This makes sense because not all journals publish articles with all these keywords. I take this feature of the graph to be a sign of my studious nature and wide scope of interests (p < .05).

# Conclusion

• Journals should have shorter names (looking at you APA and Royal Society of London).
• Authors should give at least some thought to their keywords.
• It’s really difficult to scrape information from APA’s 1994-style website. Perhaps APA should consider joining us in the next millennium!
• I’m really interested in action, agency, time, and bayesian stuff. And public health / gesundheitswesen, apparently.
• I also made a wordcloud, but it stinks (like most wordclouds.)

Packages used: ggplot2, dplyr, stringr

All Posts by Category or Tags.