Data Manipulation and Visualization in R

Matt Eldridge

What you will learn on this course

How to clean "messy" datasets to make them more amenable to exploratory data analysis
How to manipulate and transform tabular data in R using dplyr
How to visualize data using the popular ggplot2 package

Some of the Tidyverse collection of R packages designed for data science

Why not just use Excel?

Spreadsheets are a common entry point for many types of analysis and Excel is used widely but

can be unwieldy and difficult to deal with large amounts of data
error prone (e.g. gene symbols turning into dates)
tedious and time consuming to repeatedly process multiple files
how can you, or someone else, repeat what you did several months or years down the line?

Aim of the course

The course aims to translate how we think of data in spreadsheets to a series of operations that can be performed and chained together in R

The problem with R

There are many hundreds (thousands!) of functions for us to choose from to achieve our goals and everyone has their own set of favourites

e.g. joining data from two tables (data frames) based on a common variable or key

# base R
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

# data.table package
dt1 <- data.table(df1, key = "CustomerId")
dt2 <- data.table(df2, key = "CustomerId")
dt1[dt2]

# plyr package
join(df1, df2, by = "CustomerId", type = "left")

# dplyr package
left_join(df1, df2, by = "CustomerId")

The problem with R

There are many hundreds (thousands!) of functions for us to choose from to achieve our goals and everyone has their own set of favourites

You know what you want to do but how do you find the right function to use?

This course introduces an increasely popular set of tools that can help us to explore data in a consistent and pipeline-able manner

→ the "tidyverse"

Tidyverse tools covered in this course

readr – reading tabular data into a data frame in R
tidyr – tools for creating tidy data frames
dplyr – a consistent set of verbs for solving most data manipulation challenges
ggplot2 – a system for declaratively creating plots based on the Grammar of Graphics
stringr – string matching, extraction, replacement and joining operations

Course outline

Time	Topic
9.30 — 10.00	Introduction
10.00 — 12.00	Visualization with ggplot2
12.00 — 12.30	Tidying and transforming data - tidyr intro and dplyr select, mutate
12.30 — 1.30	Lunch
1.30 — 2.30	Tidying and transforming data - continued
2.30 — 3.30	Workflows - piping and dplyr arrange, filter
3.30 — 4.30	Summarizing, grouping and combining data
4.30 — 5.30	Customizing plots

How we teach the course

"Live coding" in RStudio (no more slides!)
Exercises in R markdown documents combining narrative text and code chunks
Post-it notes
Feedback questionnaire
- Really does help us improve the course for next time

The Patients dataset

Some data manipulations we will perform

Cleaning and tidying the very messy original form of the patients dataset
Selecting a subset of columns to create a smaller data frame
Creating new columns (variables) from existing ones, e.g. calculating body mass index (BMI) from height and weight
Sorting by specified variables
Filtering rows (observations)
Chaining operations together in workflows
Grouping and summarizing observations, e.g. calculating mean BMI for smokers and non-smokers
Combining data from two or more tables

Some of the plots we will create

Getting started

Install the tidyverse packages

install.packages("tidyverse")

Load the core tidyverse packages

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Reading the patients dataset into R

patients <- read_tsv("patient-data-cleaned.txt")

## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )

Resources

Course website
http://tinyurl.com/cruk-tidyr
R for Data Science book
http://r4ds.had.co.nz
Tidyverse website
https://www.tidyverse.org
Cookbook for R - Graphs section
http://www.cookbook-r.com/Graphs
RStudio cheat sheets
https://www.rstudio.com/resources/cheatsheets