# How to create within-subject scatter plots in R with ggplot2

Today, we’ll take a look at creating a specific type of visualization for data from a within-subjects experiment (also known as repeated measures, but that can sometimes be a misleading label). You’ll often see within-subject data visualized as bar graphs (condition means, and maybe mean difference if you’re lucky.) But alternatives exist, and today we’ll take a look at within-subjects scatterplots.

For example, Ganis and Kievit (2015) asked 54 people to observe, on each trial, two 3-D shapes with various rotations and judge whether the two shapes were the same or not.

There were 4 angles (0, 50, 100, and 150 degree rotations), but for simplicity, today we’ll only look at items that were not rotated with respect to each other, and items rotated 50 degrees. The data are freely available (thanks!!) in Excel format, but to make them more easily available for readers, I’ve uploaded them in a .csv file, which we can load directly from an R script.

d <- read_csv("https://mvuorre.github.io/data/ganis-kievit-2015.csv")

First, let’s clean the data a little bit by selecting and renaming a subset of the variables, and then take only the trials with 0 or 50 degrees rotation.

d <- transmute(
d,
id = Subject,
angle = angle,
correct = correct.incorrect,
rt = Time
) %>%
filter(angle < 51)
d
## # A tibble: 2,592 x 4
##       id angle correct    rt
##    <int> <int>   <int> <int>
##  1     1     0       1  1355
##  2     1    50       1  1685
##  3     1    50       1  1237
##  4     1     0       1  1275
##  5     1    50       1  2238
##  6     1     0       1  1524
##  7     1     0       1   964
##  8     1    50       1  1226
##  9     1    50       1  2548
## 10     1    50       1  1588
## # ... with 2,582 more rows

We’ll focus on comparing the reaction times between the 0 degree and 50 degree rotation trials. We predict that people will take longer to respond when the items are rotated, and that this effect will be robust across people.

# Subject means

## Bar graph

For the first graph, we’ll only need the subject’s means in each condition.

subject_means <- group_by(d, id, angle) %>%
summarize(rt = mean(rt, na.rm = T))
subject_means
## # A tibble: 108 x 3
## # Groups:   id [?]
##       id angle    rt
##    <int> <int> <dbl>
##  1     1     0 1512.
##  2     1    50 2039.
##  3     2     0 1251.
##  4     2    50 1768.
##  5     3     0 1602
##  6     3    50 1862.
##  7     4     0 1501.
##  8     4    50 2023.
##  9     5     0 2170.
## 10     5    50 2512.
## # ... with 98 more rows

The first plot is a simple bar graph showing the condition means, and every subject as a point. Note that the mean is visualized as a bar, using the stat_summary(geom="bar") function.

barplot <- ggplot(subject_means, aes(x = angle, y = rt)) +
stat_summary(
geom = "bar",
fun.y = "mean",
col = "black",
fill = "gray70"
) +
geom_point(position = position_jitter(h = 0, w = 5)) +
scale_y_continuous(limits = c(0, max(d$rt, na.rm = T)), expand = c(0, 0)) barplot This figure shows quite clearly that the mean reaction time in the 50 degree angle condition was higher than in the 0 degree angle condition, and the spread across individuals in each condition. However, we often are specifically interested in the within-subject effect of condition, which would be difficult to visually display in this image. We could draw lines to connect each point, and the effect would then be visible as a “spaghetti plot”, but while useful, these plots may sometimes be a little overwhelming especially if there’s too many people (spaghetti is great but nobody likes too much of it!) ## Within-subject scatterplot To draw a within-subjects scatterplot, we’ll need a slight reorganization of the data, such that it is in wide format with respect to the conditions: subject_means ## # A tibble: 108 x 3 ## # Groups: id [54] ## id angle rt ## <int> <int> <dbl> ## 1 1 0 1512. ## 2 1 50 2039. ## 3 2 0 1251. ## 4 2 50 1768. ## 5 3 0 1602 ## 6 3 50 1862. ## 7 4 0 1501. ## 8 4 50 2023. ## 9 5 0 2170. ## 10 5 50 2512. ## # ... with 98 more rows subject_means_wide <- spread(subject_means, key = angle, value = rt, sep = "_") subject_means_wide ## # A tibble: 54 x 3 ## # Groups: id [54] ## id angle_0 angle_50 ## <int> <dbl> <dbl> ## 1 1 1512. 2039. ## 2 2 1251. 1768. ## 3 3 1602 1862. ## 4 4 1501. 2023. ## 5 5 2170. 2512. ## 6 6 1302. 1382. ## 7 7 2212. 3014. ## 8 8 1452. 1824. ## 9 9 2012. 2501 ## 10 10 1939. 3058. ## # ... with 44 more rows Then we can simply map the per-subject angle-means to the X and Y axes: ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) + geom_point() But this graph needs a couple of fixes to be maximally informative. We need to: • Make the aspect ratio 1 • Force the axes to be identically scaled (note the use of min() and max() to show the plots on the scale of the data) • Add an identity (diagonal) line • Modify the axis labels lims <- c(min(d$rt, na.rm = T), max(d\$rt, na.rm = T))
wsplot <-
ggplot(subject_means_wide, aes(x = angle_0, y = angle_50)) +
geom_point() +
geom_abline() +
scale_x_continuous("0 degrees", limits = lims) +
scale_y_continuous("50 degrees", limits = lims) +
theme(aspect.ratio = 1)
wsplot

Great! This plot shows each person (mean) as a point, and the difference between conditions can be directly seen by how far from the diagonal line the points are. Points above the diagonal indicate that the person’s (mean) RT was greater in the 50 degrees condition. All of the points lie below the identity line, indicating that the effect was as we predicted, and robust across individuals.

This is a very useful diagnostic plot that simultaneously shows the population- (or group-) level trend (are the points, on average, below or above the identity line?) and the expectation (mean) for every person (roughly, how far apart the points are from each other?). The points are naturally connected by their location, unlike in a bar graph where they would be connected by lines. Maybe you think it’s an informative graph; it’s certainly very easy to do in R with ggplot2. Also, I think it is visually very convincing, and doesn’t necessarily lead one to focus unjustly just on the group means: I am both convinced and informed by the graph.

# Within-subject scatterplot with SEs

Well, we didn’t measure everybody repeatedly for nothing. We know more than their means; we can use the spread of the individual level scores to calculate, say, a SE for everybody and add it to the graph.

subject_summaries <- group_by(d, id, angle) %>%
summarize(mean = mean(rt, na.rm = T),
se = sd(rt, na.rm = T) / sqrt(n()))
subject_summaries
## # A tibble: 108 x 4
## # Groups:   id [?]
##       id angle  mean    se
##    <int> <int> <dbl> <dbl>
##  1     1     0 1512.  146.
##  2     1    50 2039.  134.
##  3     2     0 1251.  125.
##  4     2    50 1768.  211.
##  5     3     0 1602   162.
##  6     3    50 1862.  109.
##  7     4     0 1501.  112.
##  8     4    50 2023.  172.
##  9     5     0 2170.  242.
## 10     5    50 2512.  307.
## # ... with 98 more rows

Now we simply need to reformat the data to wide with respect to both the means and SEs. The trick here is to use spread() with different values for the sep() (separate) argument. Then, when the means and SEs are joined into wide format, we can select the columns containing either the means or SEs by referring to their unique names

means <- select(subject_summaries, -se) %>%
spread(key=angle, value=mean, sep = "_")
means
## # A tibble: 54 x 3
## # Groups:   id [54]
##       id angle_0 angle_50
##    <int>   <dbl>    <dbl>
##  1     1   1512.    2039.
##  2     2   1251.    1768.
##  3     3   1602     1862.
##  4     4   1501.    2023.
##  5     5   2170.    2512.
##  6     6   1302.    1382.
##  7     7   2212.    3014.
##  8     8   1452.    1824.
##  9     9   2012.    2501
## 10    10   1939.    3058.
## # ... with 44 more rows
ses <- select(subject_summaries, -mean) %>%
spread(key=angle, value=se, sep = "SE")
ses
## # A tibble: 54 x 3
## # Groups:   id [54]
##       id angleSE0 angleSE50
##    <int>    <dbl>     <dbl>
##  1     1    146.      134.
##  2     2    125.      211.
##  3     3    162.      109.
##  4     4    112.      172.
##  5     5    242.      307.
##  6     6     99.7      72.5
##  7     7    223.      240.
##  8     8    110.      197.
##  9     9    130.      203.
## 10    10    233.      276.
## # ... with 44 more rows
sums <- left_join(means, ses)
sums 
## # A tibble: 54 x 5
## # Groups:   id [?]
##       id angle_0 angle_50 angleSE0 angleSE50
##    <int>   <dbl>    <dbl>    <dbl>     <dbl>
##  1     1   1512.    2039.    146.      134.
##  2     2   1251.    1768.    125.      211.
##  3     3   1602     1862.    162.      109.
##  4     4   1501.    2023.    112.      172.
##  5     5   2170.    2512.    242.      307.
##  6     6   1302.    1382.     99.7      72.5
##  7     7   2212.    3014.    223.      240.
##  8     8   1452.    1824.    110.      197.
##  9     9   2012.    2501     130.      203.
## 10    10   1939.    3058.    233.      276.
## # ... with 44 more rows

The code for the plot is actually quite straightforward once the tricky part of data formatting is done (this is really the philosophy behind ggplot2). Use errorbar() to draw the vertical SE bars, and errorbarh() to draw the horizontal SE bars.

ggplot(sums, aes(x=angle_0, y=angle_50)) +
geom_point() +
geom_errorbar(aes(ymin=angle_50-angleSE50, ymax=angle_50+angleSE50)) +
geom_errorbarh(aes(xmin=angle_0-angleSE0, xmax=angle_0+angleSE0)) +
geom_abline() +
scale_x_continuous("0 degrees", limits = lims) +
scale_y_continuous("50 degrees", limits = lims) +
theme(aspect.ratio=1)

Cool, huh? This graph shows the mean and +-1 SEM for everybody’s reaction time in the 0 degrees (x axis) and 50 degrees (y axis) conditions. This graph could be a great visual inspection of the data before fitting any complex models, and requires only some slight reorganizing of the data in R. Hope you’ll find it helpful in your own work!

# Endnote

Within-subject scatter plots are pretty common in some fields (psychophysics), but underutilized in many fiels where they might have a positive impact on statistical inference. Why not try them out on your own data, especially when they’re this easy to do with R and ggplot2?

Recall that for real applications, it’s better to transform or model reaction times with a skewed distribution. Here we used normal distributions just for convenience.

Finally, this post was made possible by the Ganis and Kievit (2015) who generously have shared their data online. Big thanks!

Have a great day!

# References

Ganis, G., & Kievit, R. (2015). A New Set of Three-Dimensional Shapes for Investigating Mental Rotation Processes: Validation Data and Stimulus Set. Journal of Open Psychology Data, 3(1). https://doi.org/10.5334/jopd.ai