Quantitative literature review with R: Exploring Psychonomic Society Journals, Part I

Literature reviews, both casual and formal (or qualitative / quantitative), are an important part of research. In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a research literature, using Psychonomic Society’s publications between 2005 and 2016.

Commonly, literature reviews are rather informal; you read a review paper or 10, maybe suggested to you by experts in the field, and then delve deeper into the topics presented in those papers. A more formal version starts with a database search, where you try out various search terms, and collect a more exhaustive list of publications related to your research questions. Here, we are going to have a bit of fun and explore a large heterogeneous literature (kind of) quantitatively, focusing on data manipulation (not the bad kind!) and visualization.

I used Scopus to search for all journal articles published in the Psychonomic Society’s journals (not including CR:PI because it’s so new) between the years 2005 and 2016 (inclusive). For real applications, you would use databases instead, and search across journals, but for a general illustration, I wanted to focus on these publications instead, because Psychonomic journals are closest to my research interests.

I limited the search to these years for two reasons: Metadata on articles before early 2000’s is (are?) not always very good, and if I had included everything, the database would have simply been too large. For the example here, these search terms resulted in about 7500 journal articles.

Download data from Scopus

The Scopus search interface is very straightforward, so I won’t talk about how to use it. Some things to keep in mind, though, are that Scopus does not index all possible sources of citations, and therefore when we discuss citations (and other similar info) below, we’ll only discuss citations that Scopus knows about. For example, Google Scholar often indexes more sources, but it doesn’t have the same search tools as Scopus. Scopus allows you to download the data in various formats, and I chose .csv files, which I have stored on my computer. When you download raw data from Scopus, make sure that you organize the files well and that they are available to the R session.

You can also download the raw or cleaned up data from my website (see below), if you’d like to work on the same data set, or replicate the code below.

Load and clean raw data from Scopus

Let’s first load all the relevant R packages that we’ll use to do our work for us:

library(tidyverse)
library(stringr)
library(tools)

Load the data from .csv files

When I first downloaded the files, I put them in a subdirectory of my project (my website). I downloaded the data such that one file contains the data of one journal. (The raw journal-specific .csv files are available at https://github.com/mvuorre/mvuorre.github.io/tree/master/data/scopus, but I recommend you download the combined R data file instead because it’s easier, see below.)

The first thing is to combine these file names to a list of file names:

fl <- list.files("static/data/scopus/", pattern=".csv", full.names = T)

Then we can apply a function over the list of file names that reads all the files to an R object.

d <- lapply(fl, 
            read_csv, 
            col_types = cols(`Page start` = col_character(),
                             `Page end` = col_character(),
                             `Art. No.` = col_character()))

lapply applies the read_csv() function to the file list fl, and the col_types = cols() argument is there to ensure that those columns are read as character variable types (the original data is a little messy, and this ensures that all the files are read with identical column types.) Next, we’ll then turn the list of data frames d into a single data frame by binding them together row-wise (think stacking spreadsheets vertically on top of each other). Let’s also take a quick look at the result with glimpse()

d <- bind_rows(d)
glimpse(d)
## Observations: 7,697
## Variables: 25
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Volume                    <int> 78, 78, 78, 78, 78, 78, 78, 78, 78, ...
## $ Issue                     <int> 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, ...
## $ Art. No.                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Page count                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Link                      <chr> "https://www.scopus.com/inward/recor...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Publisher                 <chr> "Springer New York LLC", "Springer N...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
## $ Source                    <chr> "Scopus", "Scopus", "Scopus", "Scopu...
## $ EID                       <chr> "2-s2.0-84951928856", "2-s2.0-849519...

OK, looks pretty clean already! (I also tried to get data from ISI web of science, or whatever it’s now called, but the raw data was so messy that I’d have to get paid to even look at it.)

The combined raw data file is available as an R data file at https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda, and can be easily loaded to the R workspace with:

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-combined-2004-2016.rda"))

Clean the data

OK, now that you’ve loaded up the combined data file from my website into your R session with the above command, you can get to work cleaning the data. If you’d like to skip the data cleaning steps, scroll down to “Load cleaned data”.

Select relevant variables

There’s a couple of columns (variables) in the data that I’m just going to drop because I’m not interested in them at all. When you export data from Scopus, you can choose what variables to include, and it looks like I accidentally included some stuff that I don’t care about. I’ll use the select() function, and prepend each variable I want to drop with a minus sign. Variable names with spaces or strange characters are wrapped in backticks.

I think it’s always good to drop unnecessary variables to remove unnecessary clutter. Talking about cognitive psychology, it’s probably less cognitively demanding to work with a data frame with 5 variables than it is to work with a data frame with 50 variables.

d <- select(d,
            -Volume,
            -Issue,
            -`Art. No.`,
            -`Page count`,
            -Link,
            -Publisher,
            -Source,
            -EID)

I dropped page count because it is empty for most cells, and I actually need to re-compute it from the start and end variables, later on. Next, we’re going to go through the columns, but leave the more complicated columns (such as Authors) for later. What do I mean by “complicated”? Some variables are not individual values, and their desired final shape depends on how you want to use them. Much later on, when we get to network and graph analyses, you’ll see that we want to “unlist” the Keywords columns, for example. That’s why we’ll do the cleaning and data manipulation closer to the actual analysis. But for “univariate” columns, we can get started right away.

Numerical columns: Year, pages, citations

Year is already an integer, and has no missing values, so we don’t need to do anything about it. We can, however, make the page number information numerical and combine it into one column.

glimpse(d)
## Observations: 7,697
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Source title              <chr> "Attention, Perception, and Psychoph...
## $ Page start                <chr> "915", "946", "848", "938", "902", "...
## $ Page end                  <chr> "922", "959", "867", NA, "914", "735...
## $ Cited by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors with affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author Keywords           <chr> "3D perception; Perception and actio...
## $ Index Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated Source Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document Type             <chr> "Article", "Article", "Article", "Ar...
d$Pages <- as.integer(d$`Page end`) - as.integer(d$`Page start`)

How do we know that the Pages variable is now correct? We can’t rely on math itself because the data values themselves may be incorrect. It is good practice to check the data after each operation, and there’s many ways of looking at the data. I like to arrange the data (either descending or ascending) on the newly created variable to show if there are any weird values. Here’s how to arrange the data on the Pages variable in a descending order:

glimpse(arrange(d, desc(Pages)))
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

That code is quite difficult to read, however. It contains nested function calls: arrange() is inside glimpse(). I think this makes the code difficult to read. This is where we first really encounter the principles of R’s tidyverse philosophy: Instead of nesting, we can pipe results from functions to one another. To “pipe” the above code, we’d write the code this way instead:

d %>% 
    arrange(desc(Pages)) %>% 
    glimpse()
## Observations: 7,697
## Variables: 18
## $ Authors                   <chr> "Nishiyama R.", "Harrigan J., Rosent...
## $ Title                     <chr> "Active maintenance of semantic repr...
## $ Year                      <int> 2014, 2008, 2013, 2014, 2014, 2014, ...
## $ Source title              <chr> "Psychonomic Bulletin and Review", "...
## $ Page start                <chr> "583", "1", "1", "1", "1", "1", "1",...
## $ Page end                  <chr> "1589", "536", "497", "442", "371", ...
## $ Cited by                  <int> 1, 59, NA, NA, NA, NA, 79, 47, NA, N...
## $ DOI                       <chr> "10.3758/s13423-014-0618-1", "10.109...
## $ Affiliations              <chr> "Department of Psychological Science...
## $ Authors with affiliations <chr> "Nishiyama, R., Department of Psycho...
## $ Abstract                  <chr> "In research on verbal working memor...
## $ Author Keywords           <chr> "Active maintenance; Semantic repres...
## $ Index Keywords            <chr> "adult; attention; female; human; ma...
## $ References                <chr> "Allen, C.M., Martin, R.C., Martin, ...
## $ Correspondence Address    <chr> "Nishiyama, R.; Department of Psycho...
## $ Abbreviated Source Title  <chr> "Psychonom. Bull. Rev.", "The New Ha...
## $ Document Type             <chr> "Article", "Book", "Book", "Article"...
## $ Pages                     <int> 1006, 535, 496, 441, 370, 341, 325, ...

Read the above code as “Take d, then pass the results to arrange(), then pass the results to glimpse()”. Each %>% is a “pipe”, and takes the results of whatever is being computed on its left-hand side, and passes the result to its right-hand side (or line below, it’s really helpful to use line breaks and put each statement on its own line.) I’ll be using pipes extensively in the code below. I also won’t inspect the code after every change to prevent printing unnecessary boring stuff all the time (although you should when you’re working with your own data.)

So, back to Pages. Looking at the data, it looks like there’s some strange values, with some items having more than 1000 pages. For now we’ll leave those values in, but looking closely at the Source title column, we can see that there’s publications in the data that should be there. I guess I was a little too lazy with my initial Scopus search terms, but we can easily fix that problem here in R.

Publications

First, let’s list all the unique publications in the data:

publications <- unique(d$`Source title`)
publications
##  [1] "Attention, Perception, and Psychophysics"                                                
##  [2] "Attention, perception & psychophysics"                                                   
##  [3] "Attention, Perception, & Psychophysics"                                                  
##  [4] "Behavior Research Methods"                                                               
##  [5] "Behavior research methods"                                                               
##  [6] "Modern Research Methods for The Study of Behavior in Organizations"                      
##  [7] "Modern Research Methods for the Study of Behavior in Organizations"                      
##  [8] "The New Handbook of Methods in Nonverbal Behavior Research"                              
##  [9] "Cognitive, Affective and Behavioral Neuroscience"                                        
## [10] "Cognitive, affective & behavioral neuroscience"                                          
## [11] "Cognitive, Affective, & Behavioral Neuroscience"                                         
## [12] "Learning and Behavior"                                                                   
## [13] "Memory and Cognition"                                                                    
## [14] "Memory & cognition"                                                                      
## [15] "Exploring Implicit Cognition: Learning, Memory, and Social Cognitive Processes"          
## [16] "Memory & Cognition"                                                                      
## [17] "Imagery, Memory and Cognition: Essays in Honor of Allan Paivio"                          
## [18] "Imagery, Memory and Cognition (PLE: Memory):Essays in Honor of Allan Paivio"             
## [19] "Working Memory and Human Cognition"                                                      
## [20] "Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking"
## [21] "Psychonomic Bulletin and Review"                                                         
## [22] "Psychonomic bulletin & review"                                                           
## [23] "Psychonomic Bulletin & Review"

Let’s remove the non-psychonomic journal ones. I’ll just subset the list of publications to ones that I want to keep, and then filter() the data such that Source title matches one of the correct journals.

psychonomic_publications <- publications[c(1:5, 9:14, 16, 21:23)]
d <- filter(d, `Source title` %in% psychonomic_publications)
unique(d$`Source title`)
##  [1] "Attention, Perception, and Psychophysics"        
##  [2] "Attention, perception & psychophysics"           
##  [3] "Attention, Perception, & Psychophysics"          
##  [4] "Behavior Research Methods"                       
##  [5] "Behavior research methods"                       
##  [6] "Cognitive, Affective and Behavioral Neuroscience"
##  [7] "Cognitive, affective & behavioral neuroscience"  
##  [8] "Cognitive, Affective, & Behavioral Neuroscience" 
##  [9] "Learning and Behavior"                           
## [10] "Memory and Cognition"                            
## [11] "Memory & cognition"                              
## [12] "Memory & Cognition"                              
## [13] "Psychonomic Bulletin and Review"                 
## [14] "Psychonomic bulletin & review"                   
## [15] "Psychonomic Bulletin & Review"

Much better. Next, we’ll have to make sure that there’s only one unique name per journal. To do this, we’ll use simple text manipulation tools to change all “and”s to “&”s, and convert the names to Title Case. Before applying these operations to the actual data, I’ll work on just the unique names R object psychonomic_publications. This makes it easier to troubleshoot the operations.

psychonomic_publications <- psychonomic_publications %>% 
    str_replace_all("and", "&") %>%  # Replace all "and"s with "&"s
    toTitleCase()  # Make everything Title Case
unique(psychonomic_publications)
## [1] "Attention, Perception, & Psychophysics"         
## [2] "Attention, Perception & Psychophysics"          
## [3] "Behavior Research Methods"                      
## [4] "Cognitive, Affective & Behavioral Neuroscience" 
## [5] "Cognitive, Affective, & Behavioral Neuroscience"
## [6] "Learning & Behavior"                            
## [7] "Memory & Cognition"                             
## [8] "Psychonomic Bulletin & Review"

Still not quite correct. For two journals, there’s two versions, one where “&” is preceded with a comma, one where it isn’t. I’ll take the commas out.

psychonomic_publications %>% 
    str_replace_all(", &", " &") %>% 
    unique()
## [1] "Attention, Perception & Psychophysics"         
## [2] "Behavior Research Methods"                     
## [3] "Cognitive, Affective & Behavioral Neuroscience"
## [4] "Learning & Behavior"                           
## [5] "Memory & Cognition"                            
## [6] "Psychonomic Bulletin & Review"

Great, now I’ll just apply those operations to the actual data.

d <- d %>% 
    mutate(Publication = str_replace_all(`Source title`, "and", "&"),
           Publication = toTitleCase(Publication),
           Publication = str_replace_all(Publication, ", &", " &"))

Here, I used the mutate() function, which works within a data frame. I passed d to mutate() with the pipe operator, and could then refer to variables without prepending them with d$. I called the product variable Publication because it’s easier to write than Source title.

Lastly, we’ll do the same for the abbreviated publication names. I don’t know what the official abbreviations are, but I can guess that the majority of the articles have it correct. The first step is then to count the occurrences of each unique abbreviation. To do this, I group the data by (group_by()) the abbreviation, and then count():

abbs <- group_by(d, `Abbreviated Source Title`) %>% 
    count()
abbs
## # A tibble: 11 x 2
##          `Abbreviated Source Title`     n
##                               <chr> <int>
##  1         Atten Percept Psychophys    26
##  2      Atten. Percept. Psychophys.  1558
##  3                Behav Res Methods    38
##  4              Behav. Res. Methods  1385
##  5       Cogn Affect Behav Neurosci     7
##  6 Cogn. Affective Behav. Neurosci.   691
##  7                    Learn. Behav.   455
##  8                       Mem Cognit    13
##  9                     Mem. Cognit.  1441
## 10                 Psychon Bull Rev     3
## 11            Psychonom. Bull. Rev.  1997

Then just figure out how to replace all the bad ones with the good ones. Unfortunately this has to be done manually because computers aren’t that smart. The following is ugly, but it works. I first take the just the abbreviations into a variable, then replace the bad ones with the good ones inside the data frame (working with an abbreviated variable name).

abbrs <- abbs[[1]]
d <- mutate(d,
       pa = `Abbreviated Source Title`,
       pa = ifelse(pa == abbrs[1], abbrs[2], pa),
       pa = ifelse(pa == abbrs[3], abbrs[4], pa),
       pa = ifelse(pa == abbrs[5], abbrs[6], pa),
       pa = ifelse(pa == abbrs[8], abbrs[9], pa),
       pa = ifelse(pa == abbrs[10], abbrs[11], pa),
       Pub_abbr = pa)
unique(d$Pub_abbr)
## [1] "Atten. Percept. Psychophys."      "Behav. Res. Methods"             
## [3] "Mem. Cognit."                     "Cogn. Affective Behav. Neurosci."
## [5] "Learn. Behav."                    "Psychonom. Bull. Rev."

OK, we’re pretty much done with cleaning for now. The last thing to do is to remove the unnecessary variables we created while cleaning the data:

d <- select(d, 
            -`Source title`,
            -`Page start`, -`Page end`,
            -pa)

I don’t like spaces, so I’ll also take them out of the variable names:

names(d) <- str_replace_all(names(d), " ", "_")

Leaving us with this to work with:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

Load cleaned data

The clean data file can be loaded from https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda.

load(url("https://mvuorre.github.io/data/scopus/scopus-psychonomics-clean.rda"))

Univariate figures

This data is very rich, but for this initial exploration, we’ll be focusing on simple uni- and bi-variate figures. In later parts of this tutorial (future blog posts), we’ll start visualizing more complex data, such as keywords, authors, and networks.

Let’s begin with some simple summaries of what’s been going on, over time, for each journal. We’ll make extensive use of the dplyr data manipulation verbs in the following plots. Take a look at the linked website if they are unfamiliar to you; although I will explain more complicated cases, I won’t bother with every detail.

Publication years

First, we’re interested in a simple question: How many articles has each journal published in each year? Are there temporal patterns, and do they differ between journals? The steps for creating this plot are commented in the code below. Roughly, in order of appearance, we first add grouping information to the data frame, then summarise the data on those groups, create a new column for a publication-level summary, and then order the publications on it. We then pass the results to ggplot2 and draw a line and point graph.

d %>% 
    group_by(Publication, Year) %>% 
    summarise(n = n()) %>%  # For each group (Publication, Year), count nobs
    mutate(nsum = sum(n)) %>%  # For each Publication, sum of nobs {1}
    ungroup() %>%  # Ungroup data frame
    mutate(Publication = reorder(Publication, nsum)) %>%  # {2}
    ggplot(aes(x=Year, y=n)) +
    geom_point() +
    geom_line() +
    labs(y = "Number of articles published") +
    facet_wrap("Publication", ncol=2)

The code in {1} works, because the summarise() command in the above line ungroups the last grouping factor assigned by group_by(). And since I called mutate() instead of summarise, the sum of number of observations was added to each row, instead of collapsing the data by group. {2} made the Publication variable into a factor that’s ordered by the sum of articles across years for that journal; this ordering is useful because when I used facet_wrap() below, the panels are nicely ordered (journals with fewer total papers in the upper left, increasing toward bottom right.)

If this code doesn’t make any sense, I strongly recommend loading the data into your own R workspace, and executing the code line by line.

I’m a little surprised at the relative paucity of articles in Learning and Behavior, and it looks like there might be some upward trajectory going on in PBR. Should I run some regressions? Probably not.

Citations

Let’s look at numbers of citations next. For these, simple histograms will do, and we expect to see some heavy tails. I will pre-emptively clip the x-axis at 100 citations, because there’s probably a few rogue publications with a truly impressive citation count, which might make the other, more common, values less visible.

d %>% 
    ggplot(aes(Cited_by)) +
    geom_histogram(binwidth=1, col="black") +
    coord_cartesian(xlim=c(0,100)) +
    facet_wrap("Publication", ncol=2)

Nice. But how many papers are there with more than 100 citations? What’s the proportion of those highly cited papers?

d %>%
    group_by(Publication) %>%
    mutate(total = n()) %>% 
    filter(Cited_by > 100) %>% 
    summarise(total = unique(total),
              over_100 = n(),
              ratio = signif(over_100/total, 1)) %>% 
    arrange(total)
## # A tibble: 6 x 4
##                                      Publication total over_100 ratio
##                                            <chr> <int>    <int> <dbl>
## 1                            Learning & Behavior   455        1 0.002
## 2 Cognitive, Affective & Behavioral Neuroscience   698       32 0.050
## 3                      Behavior Research Methods  1426       53 0.040
## 4                             Memory & Cognition  1452       17 0.010
## 5          Attention, Perception & Psychophysics  1583        8 0.005
## 6                  Psychonomic Bulletin & Review  2000       54 0.030

Excellent. We have now reviewed some of the powerful dplyr verbs and are ready for slightly more complex summaries of the data.

Bivariate figures

Publication year and citations

We can visualize publication counts over years of publication, but this might not be the most informative graph (because recent articles naturally can’t be as often cited as older ones.) Here, I’ll show how to plot neat box plots (without outliers) with ggplot2:

d %>% 
    filter(Cited_by < 100) %>% 
    ggplot(aes(Year, Cited_by)) +
    geom_boxplot(aes(group = Year), outlier.color = NA) +
    facet_wrap("Publication", ncol=2, scales = "free_y")

Pages and citations

The last figure for today is a scatterplot of pages (how many pages the article spans) versus number of citations. For scatterplots I usually prefer empty points (shape=1).

d %>% 
    filter(Pages < 100, Cited_by < 100) %>% 
    ggplot(aes(Pages, Cited_by)) +
    geom_point(shape=1) +
    facet_wrap("Publication", ncol=2)

Summary

We have now scratched the surface of these data, and there is clearly a lot to explore:

glimpse(d)
## Observations: 7,614
## Variables: 17
## $ Authors                   <chr> "Button C., Schofield M., Croft J.",...
## $ Title                     <chr> "Distance perception in an open wate...
## $ Year                      <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ Cited_by                  <int> NA, 1, 2, 2, 3, 2, 2, NA, NA, NA, 1,...
## $ DOI                       <chr> "10.3758/s13414-015-1049-4", "10.375...
## $ Affiliations              <chr> "School of Physical Education, Sport...
## $ Authors_with_affiliations <chr> "Button, C., School of Physical Educ...
## $ Abstract                  <chr> "We investigated whether distance es...
## $ Author_Keywords           <chr> "3D perception; Perception and actio...
## $ Index_Keywords            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ References                <chr> "Baird, R.W., Burkhart, S.M., (2000)...
## $ Correspondence_Address    <chr> "Button, C.; School of Physical Educ...
## $ Abbreviated_Source_Title  <chr> "Atten. Percept. Psychophys.", "Atte...
## $ Document_Type             <chr> "Article", "Article", "Article", "Ar...
## $ Pages                     <int> 7, 13, 19, NA, 12, 12, 32, 19, 6, 22...
## $ Publication               <chr> "Attention, Perception & Psychophysi...
## $ Pub_abbr                  <chr> "Atten. Percept. Psychophys.", "Atte...

In next posts on this data, we’ll look at networks and other more advanced topics. Stay tuned.

comments powered by Disqus