The first part of the exercise uses the patients dataset we’ve been using in previous sections of the course. After reading this into R, answer the following questions using the summarise
, summarise_at
, summarise_if
and mutate_all
functions.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.1
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Compute the mean age, height and weight of patients in the patients dataset
summarise
and then try to do the same using summarise_at
summarise(patients, mean(Age), mean(Height), mean(Weight))
## # A tibble: 1 x 3
## `mean(Age)` `mean(Height)` `mean(Weight)`
## <dbl> <dbl> <dbl>
## 1 43.1 168. 74.9
summarise_at(patients, vars(Age, Height, Weight), mean)
## # A tibble: 1 x 3
## Age Height Weight
## <dbl> <dbl> <dbl>
## 1 43.1 168. 74.9
patients %>%
summarize_at(vars(Age, Height, Weight), funs(mean)) %>%
mutate_all(funs(round(., digits = 1)))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
## # A tibble: 1 x 3
## Age Height Weight
## <dbl> <dbl> <dbl>
## 1 43.1 168. 74.9
Compute the means of all numeric columns
summarise_if(patients, is.numeric, mean)
## # A tibble: 1 x 6
## Height Weight Grade Count Age BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 NA -0.107 43.1 26.5
See what happens if you try to compute the mean of a logical (boolean) variable
patients %>% summarize(mean(Died))
## # A tibble: 1 x 1
## `mean(Died)`
## <dbl>
## 1 0.54
The following questions require grouping of patients based on one or more attributes using the group_by
function.
Compare the average height of males and females in this patient cohort.
Are smokers heavier or lighter on average than non-smokers in this dataset?
patients %>%
group_by(Sex) %>%
summarize(`Average height` = mean(Height))
## # A tibble: 2 x 2
## Sex `Average height`
## <chr> <dbl>
## 1 Female 162.
## 2 Male 175.
patients %>%
group_by(Smokes) %>%
summarize(`Average weight` = mean(Weight))
## # A tibble: 2 x 2
## Smokes `Average weight`
## <chr> <dbl>
## 1 Non-Smoker 75.1
## 2 Smoker 74.2
patients %>%
group_by(Sex, Smokes) %>%
summarize(`Average weight` = mean(Weight))
## # A tibble: 4 x 3
## # Groups: Sex [2]
## Sex Smokes `Average weight`
## <chr> <chr> <dbl>
## 1 Female Non-Smoker 68.9
## 2 Female Smoker 69.0
## 3 Male Non-Smoker 82.7
## 4 Male Smoker 80.3
The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.
This part of the exercise combines grouping, summarisation and joining operations to connect the diabetes study data to the patients table we’ve already been working with.
diabetes <- read_tsv("diabetes.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Date = col_date(format = ""),
## Glucose = col_double(),
## BP = col_double()
## )
diabetes
## # A tibble: 1,316 x 4
## ID Date Glucose BP
## <chr> <date> <dbl> <dbl>
## 1 AC/AH/001 2011-03-07 100 98
## 2 AC/AH/001 2011-03-14 110 89
## 3 AC/AH/001 2011-03-24 94 88
## 4 AC/AH/001 2011-03-31 111 92
## 5 AC/AH/001 2011-04-03 94 83
## 6 AC/AH/001 2011-05-21 110 93
## 7 AC/AH/001 2011-06-24 105 79
## 8 AC/AH/001 2011-07-11 88 86
## 9 AC/AH/001 2011-07-11 101 92
## 10 AC/AH/001 2011-07-13 112 88
## # … with 1,306 more rows
The goal is to compare the blood pressure of smokers and non-smokers, similar to the comparison of the average weight we made in the previous part of the exercise.
First, calculate the average blood pressure for each individual in the diabetes
data frame.
bp <- diabetes %>%
group_by(ID) %>%
summarize(BP = mean(BP))
bp
## # A tibble: 100 x 2
## ID BP
## <chr> <dbl>
## 1 AC/AH/001 88.8
## 2 AC/AH/017 86.2
## 3 AC/AH/020 71.1
## 4 AC/AH/022 90
## 5 AC/AH/029 65.1
## 6 AC/AH/033 83.5
## 7 AC/AH/037 87.7
## 8 AC/AH/044 79.2
## 9 AC/AH/045 79.4
## 10 AC/AH/048 75.2
## # … with 90 more rows
Now use one of the join functions to combine these average blood pressure measurements with the patients
data frame containing information on whether the patient is a smoker.
combined <- left_join(bp, patients, by = "ID")
combined
## # A tibble: 100 x 16
## ID BP Name Sex Smokes Height Weight Birth State Grade
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl>
## 1 AC/A… 88.8 Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2
## 2 AC/A… 86.2 Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2
## 3 AC/A… 71.1 Todd Male Non-S… 169. 75.5 1972-07-09 New … 2
## 4 AC/A… 90 Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1
## 5 AC/A… 65.1 Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2
## 6 AC/A… 83.5 Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2
## 7 AC/A… 87.7 Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1
## 8 AC/A… 79.2 Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1
## 9 AC/A… 79.4 Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1
## 10 AC/A… 75.2 Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2
## # … with 90 more rows, and 6 more variables: Died <lgl>, Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Finally, calculate the average blood pressure for smokers and non-smokers on the resulting, combined data frame.
combined %>%
group_by(Smokes) %>%
summarize(`Average blood pressure` = mean(BP))
## # A tibble: 2 x 2
## Smokes `Average blood pressure`
## <chr> <dbl>
## 1 Non-Smoker 82.0
## 2 Smoker 84.6
Can you write these three steps as a single dplyr chain?
diabetes %>%
group_by(ID) %>%
summarize(BP = mean(BP)) %>%
left_join(patients, by = "ID") %>%
group_by(Smokes) %>%
summarize(`Average blood pressure` = mean(BP))
## # A tibble: 2 x 2
## Smokes `Average blood pressure`
## <chr> <dbl>
## 1 Non-Smoker 82.0
## 2 Smoker 84.6