##About This Exercise Exercise Type: Guidance

Guidance documents explore how to practice R

Exercise Suitability: Intermediate

If you are getting comfortable running your own code, this exericse will begin to push you forward

These are some notes on changing the format of data from tall to wide and vice versa. Like everything R there are many ways to do this and I used to be a big fan of the reshape2 package until one day I had a Eureka moment about how gather and spread work.

I’m writing this out mainly because I found the online tutorials thought this was obvious and it wasn’t to me!

##Learning Outcomes By the end of this workshop you should be able to . . .

  • Reshape data at will
  • (Optional: Make preserve-related puns about spreadable data)

The R Environment

Here are the packages you’ll need for this exercise. (Remember, if you don’t have one of these packages you can install it with the command: install.packages("packagename")

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

The Observation

As everyone online says - tidy data has one observation per row. My headaches came from dealing with the publicly available National Student Survey data - the format of which seems tidy to the uninitiated. The data is presented with each subject at each university as a row - that’s an observation, surely?

The Data

data <- tibble (group1 = c("A", "A", "A", "A", "B", "B", "B", "B"),
                group2 = c("X", "X", "Y", "Y", "X", "X", "Z", "Z"),
                q = c ("Q1", "Q2","Q1", "Q2", "Q1", "Q2", "Q1", "Q2"),
                disagree = c(0.8, 0.3, 0.8, 0.2, 0.7, 0.5, 0.6, 0.3),
                neutral = c(0.05, 0.4, 0.1, 0.3, 0.1, 0.4, 0.2, 0.3),
                agree = c(0.15, 0.3, 0.1, 0.5, 0.2, 0.1, 0.2, 0.4),
                n = c(121, 121, 140, 140, 50, 50, 57,57)) 


data
## # A tibble: 8 x 7
##   group1 group2 q     disagree neutral agree     n
##   <chr>  <chr>  <chr>    <dbl>   <dbl> <dbl> <dbl>
## 1 A      X      Q1         0.8    0.05  0.15   121
## 2 A      X      Q2         0.3    0.4   0.3    121
## 3 A      Y      Q1         0.8    0.1   0.1    140
## 4 A      Y      Q2         0.2    0.3   0.5    140
## 5 B      X      Q1         0.7    0.1   0.2     50
## 6 B      X      Q2         0.5    0.4   0.1     50
## 7 B      Z      Q1         0.6    0.2   0.2     57
## 8 B      Z      Q2         0.3    0.3   0.4     57

However, for a lot of what I was trying to do, I needed to know how many students responded at each level - how many agreed with Q1, Q2, etc, in each nested group. I wanted my data taller.

(I’m using this specific format not because it’s a nice format, but because it’s a format that exists in the world and that lots of people make big decisions on.)

Gather

The gather command from tidyverse is a quick way to smush this data into a tall format. The gather command creates two new columns, the key column which collects your old column names and your value column which collects the row values (fairly self-explanatory).

The trick with gather is that it will smush everything it can into those two columns, so you need to tell it which columns not to include. This is the step that eluded me for a whole afternoon once, so I’m stating it very obviously here.

talldata <- data %>%
  gather (key = LikertScale, value = PercRespondents,
          -group1,
          -group2,
          -q,
          -n)


talldata
## # A tibble: 24 x 6
##    group1 group2 q         n LikertScale PercRespondents
##    <chr>  <chr>  <chr> <dbl> <chr>                 <dbl>
##  1 A      X      Q1      121 disagree               0.8 
##  2 A      X      Q2      121 disagree               0.3 
##  3 A      Y      Q1      140 disagree               0.8 
##  4 A      Y      Q2      140 disagree               0.2 
##  5 B      X      Q1       50 disagree               0.7 
##  6 B      X      Q2       50 disagree               0.5 
##  7 B      Z      Q1       57 disagree               0.6 
##  8 B      Z      Q2       57 disagree               0.3 
##  9 A      X      Q1      121 neutral                0.05
## 10 A      X      Q2      121 neutral                0.4 
## # ... with 14 more rows

NB - Think About Your Variable Names

I once spent a whole afternoon unable to recreate an error message I was getting with gather - only to realise long after home time that I had sensibly called the key variable question which was a variable name that already existed in my dataset. R was re-writing the variable every time it gathered the data.

The take home? Make sure your key and value names are unique!

Spread

What if, after all that, you realise that you never wanted your data gathered at all? spread is here to rescue you.

Just as before, spread wants to know the key and the value, but this time, it will split those two columns into multiple columns. This time we want all that data to be spread out like marmalade on toast, so we don’t exclude any columns (in fact, try excluding the columns and see what spread says. )

widedata <- talldata %>%
  spread (key = LikertScale, value = PercRespondents)

widedata
## # A tibble: 8 x 7
##   group1 group2 q         n agree disagree neutral
##   <chr>  <chr>  <chr> <dbl> <dbl>    <dbl>   <dbl>
## 1 A      X      Q1      121  0.15      0.8    0.05
## 2 A      X      Q2      121  0.3       0.3    0.4 
## 3 A      Y      Q1      140  0.1       0.8    0.1 
## 4 A      Y      Q2      140  0.5       0.2    0.3 
## 5 B      X      Q1       50  0.2       0.7    0.1 
## 6 B      X      Q2       50  0.1       0.5    0.4 
## 7 B      Z      Q1       57  0.2       0.6    0.2 
## 8 B      Z      Q2       57  0.4       0.3    0.3