A very common data wrangling task is to transform data between long and wide forms. It is often easier to enter data manually into a spreadsheet in the wide form, but computers prefer to read data in the long form. In this tutorial, I’ll show how to use R to transform data between these two forms with ease and style.

Wide and long forms

Although wide and long data formats are somewhat ill defined, they accurately capture two very common forms of data organization in psychological science. Here’s what they look like.

Wide form

When data is in wide format, each participant’s (or whatever your cluster of measurement is) data is in a single row, and each column represents a different variable. For example, the Big 5 questionnaire data is in wide form (only reading a subset of the data):

library(readr)
wd <- read_csv("data/big5/countries/FI.csv")[,c(2, 7:17)]
age country E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
49 FI 3 2 4 4 5 2 4 4 2 2
19 FI 1 5 2 5 2 2 1 5 1 4
19 FI 4 4 2 4 2 4 2 5 1 4
20 FI 2 2 3 4 3 1 1 3 2 5
17 FI 1 2 3 5 4 2 1 5 1 2
28 FI 2 4 3 5 3 2 2 5 1 4

Each row in this data set represents all the variables collected for a single individual. For each individual, we have age, country of residence, and multiple measures of the Extroversion scale (sub-items), labelled E1, E2, …, E10. In the wide form, the variables are next to each other, in separate columns, and cells identify values.

Long form

Data in the long format, on the other hand, has only one row per observation, and therefore multiple observations of a single individual are distributed to many rows. Each row, therefore, has a number of columns representing the context of the observation. For example, here’s some time-series data of a beaver’s body temperature in long form:

b1 <- head(beaver1)[,-4]
day time temp
346 840 36.33
346 850 36.34
346 900 36.35
346 910 36.42
346 920 36.55
346 930 36.69

Each row in this data set represents an observation, and we know of that observation it’s day, time and temp value.

The long form is very useful for plotting time-series and other repeated measure data. For example, to plot the beaver’s body temperature in long form, we only specify an x-y pair where x and y are two columns in the data:

plot(beaver1$time, beaver1$temp, type = "l", 
     xlab = "Time", ylab = "Beaver temperature")
Beaver temperature time-series.

Beaver temperature time-series.

What form?

Sometimes it may not be as straightforward to identify data as long or wide format, because it may depend on what specific variable(s) you are interested in. Still, a good heuristic is that:

Wide data has an individual’s multiple observations distributed across columns, and long data has an individual’s repeated observations distributed across rows.

Next, let’s transform data between the two forms.

Long to wide transform

It is rarely useful to transform long time-series data to wide form, because computer programs like to take one vector of values (a column) for plotting or analysis. However, in many situations the long -> wide transform is a useful operation. Here are two ways to do it in R, first using the tidyr package (Wickham, 2016):

library(tidyr)
wb <- spread(data = b1, 
             key = time,
             value = temp)
day 840 850 900 910 920 930
346 36.33 36.34 36.35 36.42 36.55 36.69

The spread() function takes as input a data frame (b1 where I’ve save the first few lines of the beaver data), a “key” variable, and a “value” variable. What spread() does, is it spreads the values of “value” across columns of the “key” variable. You can recognize the new column names as times of each observation. This data would less straightforward to plot as a time-series.

An alternative, but very popular, command is dcast(), for “casting” long form data to wide data, from the reshape2 package (Wickham, 2007). This function takes a data frame, a casting formula, and a value variable:

library(reshape2)
dcast(b1, day ~ time, value.var = "temp")

The day ~ time formula says that days identify rows (there’s only one day), and time should be distributed across columns, and cells should be filled with values of “temp”.

Next, let’s transform from wide to long data.

Wide to long transform

Think back to the Big 5 data set:

age country E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
49 FI 3 2 4 4 5 2 4 4 2 2
19 FI 1 5 2 5 2 2 1 5 1 4
19 FI 4 4 2 4 2 4 2 5 1 4
20 FI 2 2 3 4 3 1 1 3 2 5
17 FI 1 2 3 5 4 2 1 5 1 2
28 FI 2 4 3 5 3 2 2 5 1 4

Another way to think about this data is that there are only three variables per individual: age, country, and item. Each individual was repeatedly measured for different items. These items’ average could be considered as the individual’s Extroversion score. When repeated measures are collected into two columns, one identifying the variable (item, in this case, but could be, for example, measurement time), and another one identifying the value of the observation, the data is in long form, with respect to repeated measures of item. Here’s how to do it using the tidyr package (Wickham, 2016):

ld <- gather(data = wd, 
             key = item, 
             value = value, 
             -c(age, country))
age country item value
49 FI E1 3
19 FI E1 1
19 FI E1 4
20 FI E1 2
17 FI E1 1
28 FI E1 2

The gather() function makes reshaping data easy with only a few arguments: The user provides a data frame, names for the key (item in this case, which names the subscales) and value (the value of that observation) columns, and specifies which columns to gather into these key-value pairs. In this case, we selected all the E1, E2, …, E10 columns by taking out age and country. By doing this, gather() used all the columns to create key-value pairs, except those two. An equivalent call is to specify the E columns:

ld <- gather(data = wd, 
             key = item, 
             value = value, 
             E1:E10)
age country item value
49 FI E1 3
19 FI E1 1
19 FI E1 4
20 FI E1 2
17 FI E1 1
28 FI E1 2

Alternatively, in some cases it is easier to reshape the data by “melting” it with the melt() function, from the reshape2 package (Wickham, 2007):

ld2 <- melt(wd, 
            id.vars = c("age", "country"),
            variable.name = "item")
age country item value
49 FI E1 3
19 FI E1 1
19 FI E1 4
20 FI E1 2
17 FI E1 1
28 FI E1 2

The melt() function can sometimes be a little verbose, but perhaps easier to understand. The user needs to provide a data frame, id.vars which identify separate rows, and optionally names for the variable and value columns.

Conclusion

In many cases, data must be reshaped to suitable formats required for different analyses. Transforming between long and wide formats is a very common operation that can be achieved with ease in R. For longer discussions on tidying your data to appropriate formats, see the tidyr vignette or this paper on data cleaning (Wickham, 2014).

References

Wickham, H. (2007). Reshaping data with the reshape package. Journal of Statistical Software, 21(12), 1–20. Retrieved from http://www.jstatsoft.org/v21/i12/

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(1), 1–23. http://doi.org/10.18637/jss.v059.i10

Wickham, H. (2016). Tidyr: Easily tidy data with ‘. Retrieved from https://CRAN.R-project.org/package=tidyr


DOI