Summarising, Grouping and Joining Exercise

Summarising

The first part of the exercise uses the patients dataset we’ve been using in previous sections of the course. After reading this into R, answer the following questions using the summarise, summarise_at, summarise_if and mutate_all functions.

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

patients <- read_tsv("patient-data-cleaned.txt")

## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )

patients

## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Compute the mean age, height and weight of patients in the patients dataset

First compute the means using summarise and then try to do the same using summarise_at

summarise(patients, mean(Age), mean(Height), mean(Weight))

## # A tibble: 1 x 3
##   `mean(Age)` `mean(Height)` `mean(Weight)`
##         <dbl>          <dbl>          <dbl>
## 1        43.1           168.           74.9

summarise_at(patients, vars(Age, Height, Weight), mean)

## # A tibble: 1 x 3
##     Age Height Weight
##   <dbl>  <dbl>  <dbl>
## 1  43.1   168.   74.9

Modify the output by adding a step to round to 1 decimal place

patients %>%
  summarize_at(vars(Age, Height, Weight), funs(mean)) %>%
  mutate_all(funs(round(., digits = 1)))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.

## # A tibble: 1 x 3
##     Age Height Weight
##   <dbl>  <dbl>  <dbl>
## 1  43.1   168.   74.9

Compute the means of all numeric columns

summarise_if(patients, is.numeric, mean)

## # A tibble: 1 x 6
##   Height Weight Grade  Count   Age   BMI
##    <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   168.   74.9    NA -0.107  43.1  26.5

See what happens if you try to compute the mean of a logical (boolean) variable

What proportion of our patient cohort has died?

patients %>% summarize(mean(Died))

## # A tibble: 1 x 1
##   `mean(Died)`
##          <dbl>
## 1         0.54

Grouping

The following questions require grouping of patients based on one or more attributes using the group_by function.

Compare the average height of males and females in this patient cohort.

Are smokers heavier or lighter on average than non-smokers in this dataset?

patients %>%
  group_by(Sex) %>%
  summarize(`Average height` = mean(Height))

## # A tibble: 2 x 2
##   Sex    `Average height`
##   <chr>             <dbl>
## 1 Female             162.
## 2 Male               175.

patients %>%
  group_by(Smokes) %>%
  summarize(`Average weight` = mean(Weight))

## # A tibble: 2 x 2
##   Smokes     `Average weight`
##   <chr>                 <dbl>
## 1 Non-Smoker             75.1
## 2 Smoker                 74.2

patients %>%
  group_by(Sex, Smokes) %>%
  summarize(`Average weight` = mean(Weight))

## # A tibble: 4 x 3
## # Groups:   Sex [2]
##   Sex    Smokes     `Average weight`
##   <chr>  <chr>                 <dbl>
## 1 Female Non-Smoker             68.9
## 2 Female Smoker                 69.0
## 3 Male   Non-Smoker             82.7
## 4 Male   Smoker                 80.3

Joining

The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.

This part of the exercise combines grouping, summarisation and joining operations to connect the diabetes study data to the patients table we’ve already been working with.

diabetes <- read_tsv("diabetes.txt")

## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Date = col_date(format = ""),
##   Glucose = col_double(),
##   BP = col_double()
## )

diabetes

## # A tibble: 1,316 x 4
##    ID        Date       Glucose    BP
##    <chr>     <date>       <dbl> <dbl>
##  1 AC/AH/001 2011-03-07     100    98
##  2 AC/AH/001 2011-03-14     110    89
##  3 AC/AH/001 2011-03-24      94    88
##  4 AC/AH/001 2011-03-31     111    92
##  5 AC/AH/001 2011-04-03      94    83
##  6 AC/AH/001 2011-05-21     110    93
##  7 AC/AH/001 2011-06-24     105    79
##  8 AC/AH/001 2011-07-11      88    86
##  9 AC/AH/001 2011-07-11     101    92
## 10 AC/AH/001 2011-07-13     112    88
## # … with 1,306 more rows

The goal is to compare the blood pressure of smokers and non-smokers, similar to the comparison of the average weight we made in the previous part of the exercise.

First, calculate the average blood pressure for each individual in the diabetes data frame.

bp <- diabetes %>%
  group_by(ID) %>%
  summarize(BP = mean(BP))
bp

## # A tibble: 100 x 2
##    ID           BP
##    <chr>     <dbl>
##  1 AC/AH/001  88.8
##  2 AC/AH/017  86.2
##  3 AC/AH/020  71.1
##  4 AC/AH/022  90  
##  5 AC/AH/029  65.1
##  6 AC/AH/033  83.5
##  7 AC/AH/037  87.7
##  8 AC/AH/044  79.2
##  9 AC/AH/045  79.4
## 10 AC/AH/048  75.2
## # … with 90 more rows

Now use one of the join functions to combine these average blood pressure measurements with the patients data frame containing information on whether the patient is a smoker.

combined <- left_join(bp, patients, by = "ID")
combined

## # A tibble: 100 x 16
##    ID       BP Name  Sex   Smokes Height Weight Birth      State Grade
##    <chr> <dbl> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl>
##  1 AC/A…  88.8 Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2
##  2 AC/A…  86.2 Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2
##  3 AC/A…  71.1 Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2
##  4 AC/A…  90   Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1
##  5 AC/A…  65.1 Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2
##  6 AC/A…  83.5 Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2
##  7 AC/A…  87.7 Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1
##  8 AC/A…  79.2 Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1
##  9 AC/A…  79.4 Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1
## 10 AC/A…  75.2 Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2
## # … with 90 more rows, and 6 more variables: Died <lgl>, Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Finally, calculate the average blood pressure for smokers and non-smokers on the resulting, combined data frame.

combined %>%
  group_by(Smokes) %>%
  summarize(`Average blood pressure` = mean(BP))

## # A tibble: 2 x 2
##   Smokes     `Average blood pressure`
##   <chr>                         <dbl>
## 1 Non-Smoker                     82.0
## 2 Smoker                         84.6

Can you write these three steps as a single dplyr chain?

diabetes %>%
  group_by(ID) %>%
  summarize(BP = mean(BP)) %>%
  left_join(patients, by = "ID") %>%
  group_by(Smokes) %>%
  summarize(`Average blood pressure` = mean(BP))

## # A tibble: 2 x 2
##   Smokes     `Average blood pressure`
##   <chr>                         <dbl>
## 1 Non-Smoker                     82.0
## 2 Smoker                         84.6

Summarising, Grouping and Joining Exercise

Matt Eldridge

Last modified: 11 Mar 2019

Summarising

Grouping

Joining