How to create within-subject scatter plots in R with ggplot2

Today, we’ll take a look at creating a specific type of visualization for data from a within-subjects experiment (also known as repeated measures, but that can sometimes be a misleading label). You’ll often see within-subject data visualized as bar graphs (condition means, and maybe mean difference if you’re lucky.) But alternatives exist, and today we’ll take a look at within-subjects scatterplots.

For example, Ganis and Kievit (2015) asked 54 people to observe, on each trial, two 3-D shapes with various rotations and judge whether the two shapes were the same or not.

There were 4 angles (0, 50, 100, and 150 degree rotations), but for simplicity, today we’ll only look at items that were not rotated with respect to each other, and items rotated 50 degrees. The data are freely available (thanks!!) in Excel format, but to make them more easily available for readers, I’ve uploaded them in a .csv file, which we can load directly from an R script.

d <- read_csv("https://mvuorre.github.io/data/ganis-kievit-2015.csv")

First, let’s clean the data a little bit by selecting and renaming a subset of the variables, and then take only the trials with 0 or 50 degrees rotation.

d <- transmute(
    d,
    id = Subject,
    angle = angle,
    correct = correct.incorrect,
    rt = Time
    ) %>%
    filter(angle < 51)
    d
## # A tibble: 2,592 x 4
##       id angle correct    rt
##    <int> <int>   <int> <int>
##  1     1     0       1  1355
##  2     1    50       1  1685
##  3     1    50       1  1237
##  4     1     0       1  1275
##  5     1    50       1  2238
##  6     1     0       1  1524
##  7     1     0       1   964
##  8     1    50       1  1226
##  9     1    50       1  2548
## 10     1    50       1  1588
## # ... with 2,582 more rows

We’ll focus on comparing the reaction times between the 0 degree and 50 degree rotation trials. We predict that people will take longer to respond when the items are rotated, and that this effect will be robust across people.

Subject means

Bar graph

For the first graph, we’ll only need the subject’s means in each condition.

subject_means <- group_by(d, id, angle) %>%
    summarize(rt = mean(rt, na.rm = T))
subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
## 
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows

The first plot is a simple bar graph showing the condition means, and every subject as a point. Note that the mean is visualized as a bar, using the stat_summary(geom="bar") function.

barplot <- ggplot(subject_means, aes(x = angle, y = rt)) +
    stat_summary(
    geom = "bar",
    fun.y = "mean",
    col = "black",
    fill = "gray70"
    ) +
    geom_point(position = position_jitter(h = 0, w = 5)) +
    scale_y_continuous(limits = c(0, max(d$rt, na.rm = T)),
    expand = c(0, 0))
    barplot

This figure shows quite clearly that the mean reaction time in the 50 degree angle condition was higher than in the 0 degree angle condition, and the spread across individuals in each condition. However, we often are specifically interested in the within-subject effect of condition, which would be difficult to visually display in this image. We could draw lines to connect each point, and the effect would then be visible as a “spaghetti plot”, but while useful, these plots may sometimes be a little overwhelming especially if there’s too many people (spaghetti is great but nobody likes too much of it!)

Within-subject scatterplot

To draw a within-subjects scatterplot, we’ll need a slight reorganization of the data, such that it is in wide format with respect to the conditions:

subject_means
## Source: local data frame [108 x 3]
## Groups: id [?]
## 
## # A tibble: 108 x 3
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0  1512
##  2     1    50  2039
##  3     2     0  1251
##  4     2    50  1768
##  5     3     0  1602
##  6     3    50  1862
##  7     4     0  1501
##  8     4    50  2023
##  9     5     0  2170
## 10     5    50  2512
## # ... with 98 more rows
subject_means_wide <-
    spread(subject_means,
           key = angle,
           value = rt,
           sep = "_")
subject_means_wide
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows

Then we can simply map the per-subject angle-means to the X and Y axes:

ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point()

But this graph needs a couple of fixes to be maximally informative. We need to:

  • Make the aspect ratio 1
  • Force the axes to be identically scaled (note the use of min() and max() to show the plots on the scale of the data)
  • Add an identity (diagonal) line
  • Modify the axis labels
lims <- c(min(d$rt, na.rm = T), max(d$rt, na.rm = T))
wsplot <-
    ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
    geom_point() +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio = 1)
wsplot

Great! This plot shows each person (mean) as a point, and the difference between conditions can be directly seen by how far from the diagonal line the points are. Points above the diagonal indicate that the person’s (mean) RT was greater in the 50 degrees condition. All of the points lie below the identity line, indicating that the effect was as we predicted, and robust across individuals.

This is a very useful diagnostic plot that simultaneously shows the population- (or group-) level trend (are the points, on average, below or above the identity line?) and the expectation (mean) for every person (roughly, how far apart the points are from each other?). The points are naturally connected by their location, unlike in a bar graph where they would be connected by lines. Maybe you think it’s an informative graph; it’s certainly very easy to do in R with ggplot2. Also, I think it is visually very convincing, and doesn’t necessarily lead one to focus unjustly just on the group means: I am both convinced and informed by the graph.

Within-subject scatterplot with SEs

Well, we didn’t measure everybody repeatedly for nothing. We know more than their means; we can use the spread of the individual level scores to calculate, say, a SE for everybody and add it to the graph.

subject_summaries <- group_by(d, id, angle) %>%
    summarize(mean = mean(rt, na.rm = T),
              se = sd(rt, na.rm = T) / sqrt(n()))
subject_summaries
## Source: local data frame [108 x 4]
## Groups: id [?]
## 
## # A tibble: 108 x 4
##       id angle  mean    se
##    <int> <int> <dbl> <dbl>
##  1     1     0  1512   146
##  2     1    50  2039   134
##  3     2     0  1251   125
##  4     2    50  1768   211
##  5     3     0  1602   162
##  6     3    50  1862   109
##  7     4     0  1501   112
##  8     4    50  2023   172
##  9     5     0  2170   242
## 10     5    50  2512   307
## # ... with 98 more rows

Now we simply need to reformat the data to wide with respect to both the means and SEs. The trick here is to use spread() with different values for the sep() (separate) argument. Then, when the means and SEs are joined into wide format, we can select the columns containing either the means or SEs by referring to their unique names

means <- select(subject_summaries, -se) %>%
    spread(key=angle, value=mean, sep = "_")
means
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angle_0 angle_50
##  * <int>   <dbl>    <dbl>
##  1     1    1512     2039
##  2     2    1251     1768
##  3     3    1602     1862
##  4     4    1501     2023
##  5     5    2170     2512
##  6     6    1302     1382
##  7     7    2212     3014
##  8     8    1452     1824
##  9     9    2012     2501
## 10    10    1939     3058
## # ... with 44 more rows
ses <- select(subject_summaries, -mean) %>%
    spread(key=angle, value=se, sep = "SE")
ses
## Source: local data frame [54 x 3]
## Groups: id [54]
## 
## # A tibble: 54 x 3
##       id angleSE0 angleSE50
##  * <int>    <dbl>     <dbl>
##  1     1      146       134
##  2     2      125       211
##  3     3      162       109
##  4     4      112       172
##  5     5      242       307
##  6     6      100        73
##  7     7      223       240
##  8     8      110       197
##  9     9      130       203
## 10    10      233       276
## # ... with 44 more rows
sums <- left_join(means, ses)
sums 
## Source: local data frame [54 x 5]
## Groups: id [?]
## 
## # A tibble: 54 x 5
##       id angle_0 angle_50 angleSE0 angleSE50
##    <int>   <dbl>    <dbl>    <dbl>     <dbl>
##  1     1    1512     2039      146       134
##  2     2    1251     1768      125       211
##  3     3    1602     1862      162       109
##  4     4    1501     2023      112       172
##  5     5    2170     2512      242       307
##  6     6    1302     1382      100        73
##  7     7    2212     3014      223       240
##  8     8    1452     1824      110       197
##  9     9    2012     2501      130       203
## 10    10    1939     3058      233       276
## # ... with 44 more rows

The code for the plot is actually quite straightforward once the tricky part of data formatting is done (this is really the philosophy behind ggplot2). Use errorbar() to draw the vertical SE bars, and errorbarh() to draw the horizontal SE bars.

ggplot(sums, aes(x=angle_0, y=angle_50)) +
    geom_point() +
    geom_errorbar(aes(ymin=angle_50-angleSE50, ymax=angle_50+angleSE50)) +
    geom_errorbarh(aes(xmin=angle_0-angleSE0, xmax=angle_0+angleSE0)) +
    geom_abline() +
    scale_x_continuous("0 degrees", limits = lims) +
    scale_y_continuous("50 degrees", limits = lims) +
    theme(aspect.ratio=1)

Cool, huh? This graph shows the mean and +-1 SEM for everybody’s reaction time in the 0 degrees (x axis) and 50 degrees (y axis) conditions. This graph could be a great visual inspection of the data before fitting any complex models, and requires only some slight reorganizing of the data in R. Hope you’ll find it helpful in your own work!

Endnote

Within-subject scatter plots are pretty common in some fields (psychophysics), but underutilized in many fiels where they might have a positive impact on statistical inference. Why not try them out on your own data, especially when they’re this easy to do with R and ggplot2?

Recall that for real applications, it’s better to transform or model reaction times with a skewed distribution. Here we used normal distributions just for convenience.

Finally, this post was made possible by the Ganis and Kievit (2015) who generously have shared their data online. Big thanks!

Have a great day!

References

Ganis, G., & Kievit, R. (2015). A New Set of Three-Dimensional Shapes for Investigating Mental Rotation Processes: Validation Data and Stimulus Set. Journal of Open Psychology Data, 3(1). https://doi.org/10.5334/jopd.ai

Related