##About This Exercise Exercise Type: Guidance
Guidance documents explore how to practice R
Exercise Suitability: Intermediate
If you are getting comfortable running your own code, this exericse will begin to push you forward
These are some notes on changing the format of data from tall to wide and vice versa. Like everything R there are many ways to do this and I used to be a big fan of the reshape2
package until one day I had a Eureka moment about how gather
and spread
work.
I’m writing this out mainly because I found the online tutorials thought this was obvious and it wasn’t to me!
##Learning Outcomes By the end of this workshop you should be able to . . .
Here are the packages you’ll need for this exercise. (Remember, if you don’t have one of these packages you can install it with the command: install.packages("packagename")
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
As everyone online says - tidy data has one observation per row. My headaches came from dealing with the publicly available National Student Survey data - the format of which seems tidy to the uninitiated. The data is presented with each subject at each university as a row - that’s an observation, surely?
data <- tibble (group1 = c("A", "A", "A", "A", "B", "B", "B", "B"),
group2 = c("X", "X", "Y", "Y", "X", "X", "Z", "Z"),
q = c ("Q1", "Q2","Q1", "Q2", "Q1", "Q2", "Q1", "Q2"),
disagree = c(0.8, 0.3, 0.8, 0.2, 0.7, 0.5, 0.6, 0.3),
neutral = c(0.05, 0.4, 0.1, 0.3, 0.1, 0.4, 0.2, 0.3),
agree = c(0.15, 0.3, 0.1, 0.5, 0.2, 0.1, 0.2, 0.4),
n = c(121, 121, 140, 140, 50, 50, 57,57))
data
## # A tibble: 8 x 7
## group1 group2 q disagree neutral agree n
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A X Q1 0.8 0.05 0.15 121
## 2 A X Q2 0.3 0.4 0.3 121
## 3 A Y Q1 0.8 0.1 0.1 140
## 4 A Y Q2 0.2 0.3 0.5 140
## 5 B X Q1 0.7 0.1 0.2 50
## 6 B X Q2 0.5 0.4 0.1 50
## 7 B Z Q1 0.6 0.2 0.2 57
## 8 B Z Q2 0.3 0.3 0.4 57
However, for a lot of what I was trying to do, I needed to know how many students responded at each level - how many agreed with Q1, Q2, etc, in each nested group. I wanted my data taller.
(I’m using this specific format not because it’s a nice format, but because it’s a format that exists in the world and that lots of people make big decisions on.)
The gather
command from tidyverse
is a quick way to smush this data into a tall format. The gather
command creates two new columns, the key
column which collects your old column names and your value
column which collects the row values (fairly self-explanatory).
The trick with gather
is that it will smush everything it can into those two columns, so you need to tell it which columns not to include. This is the step that eluded me for a whole afternoon once, so I’m stating it very obviously here.
talldata <- data %>%
gather (key = LikertScale, value = PercRespondents,
-group1,
-group2,
-q,
-n)
talldata
## # A tibble: 24 x 6
## group1 group2 q n LikertScale PercRespondents
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 A X Q1 121 disagree 0.8
## 2 A X Q2 121 disagree 0.3
## 3 A Y Q1 140 disagree 0.8
## 4 A Y Q2 140 disagree 0.2
## 5 B X Q1 50 disagree 0.7
## 6 B X Q2 50 disagree 0.5
## 7 B Z Q1 57 disagree 0.6
## 8 B Z Q2 57 disagree 0.3
## 9 A X Q1 121 neutral 0.05
## 10 A X Q2 121 neutral 0.4
## # ... with 14 more rows
I once spent a whole afternoon unable to recreate an error message I was getting with gather
- only to realise long after home time that I had sensibly called the key
variable question which was a variable name that already existed in my dataset. R was re-writing the variable every time it gathered the data.
The take home? Make sure your key and value names are unique!
What if, after all that, you realise that you never wanted your data gathered at all? spread
is here to rescue you.
Just as before, spread
wants to know the key
and the value
, but this time, it will split those two columns into multiple columns. This time we want all that data to be spread out like marmalade on toast, so we don’t exclude any columns (in fact, try excluding the columns and see what spread says. )
widedata <- talldata %>%
spread (key = LikertScale, value = PercRespondents)
widedata
## # A tibble: 8 x 7
## group1 group2 q n agree disagree neutral
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A X Q1 121 0.15 0.8 0.05
## 2 A X Q2 121 0.3 0.3 0.4
## 3 A Y Q1 140 0.1 0.8 0.1
## 4 A Y Q2 140 0.5 0.2 0.3
## 5 B X Q1 50 0.2 0.7 0.1
## 6 B X Q2 50 0.1 0.5 0.4
## 7 B Z Q1 57 0.2 0.6 0.2
## 8 B Z Q2 57 0.4 0.3 0.3