In this session we will introduce the concept of ‘piping’ to help with creating workflows from chains of manipulations on our data. We’ll also look at a couple of other useful dplyr
verbs.
If you haven’t already done so, or are working in a new session, you’ll need to load the core packages from the tidyverse.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.1
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
In the previous exercise we ended up making a series of manipulations to the patients dataset.
patients <- read_tsv("patient-data.txt")
patients <- mutate(patients, Smokes = Smokes %in% c("TRUE", "Yes"))
patients <- mutate(patients, Height = as.numeric(str_remove(Height, pattern = "cm$")))
patients <- mutate(patients, Weight = as.numeric(str_remove(Weight, pattern = "kg$")))
patients <- mutate(patients, BMI = Weight / (Height / 100) ** 2)
patients <- mutate(patients, Overweight = BMI > 25)
Each statement includes an assignment to overwrite the data frame on which we are operating. Surely this could be written in a more succinct and elegant manner.
The tidyverse imports a very useful operator, %>%
from the magrittr
package. This is the ‘pipe’ operator and works a bit like the Unix pipe operator allowing the output from one operation to be “piped” in as the input to another operation.
Let’s look at one of those cleaning operations on the patients dataset to see how piping works.
patients <- read_tsv("patient-data.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_character(),
## Weight = col_character(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade_Level = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = "")
## )
mutate(patients, Height = as.numeric(str_remove(Height, pattern = "cm$")))
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level
## <chr> <chr> <chr> <chr> <dbl> <chr> <date> <chr> <dbl>
## 1 AC/A… Mich… Male FALSE 183. 76.57… 1972-02-06 Geor… 2
## 2 AC/A… Derek Male FALSE 179. 80.43… 1972-06-15 Colo… 2
## 3 AC/A… Todd Male FALSE 169. 75.48… 1972-07-09 New … 2
## 4 AC/A… Rona… Male FALSE 176. 94.54… 1972-08-17 Colo… 1
## 5 AC/A… Chri… Fema… FALSE 164. 71.78… 1973-06-12 Geor… 2
## 6 AC/A… Dana Fema… TRUE 158. 69.9kg 1973-07-01 Indi… 2
## 7 AC/A… Erin Fema… FALSE 162. 68.85… 1972-03-26 New … 1
## 8 AC/A… Rach… Fema… No 166. 70.44… 1973-05-11 Colo… 1
## 9 AC/A… Rona… Male FALSE 181. 76.9kg 1971-12-31 Geor… 1
## 10 AC/A… Bryan Male FALSE 167. 79.06… 1973-07-19 New … 2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <dbl>,
## # Date.Entered.Study <date>
Instead of passing the patients data frame into the mutate
function as it’s first argument we could use the %>%
operator as follows.
patients %>% mutate(Height = as.numeric(str_remove(Height, pattern = "cm$")))
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level
## <chr> <chr> <chr> <chr> <dbl> <chr> <date> <chr> <dbl>
## 1 AC/A… Mich… Male FALSE 183. 76.57… 1972-02-06 Geor… 2
## 2 AC/A… Derek Male FALSE 179. 80.43… 1972-06-15 Colo… 2
## 3 AC/A… Todd Male FALSE 169. 75.48… 1972-07-09 New … 2
## 4 AC/A… Rona… Male FALSE 176. 94.54… 1972-08-17 Colo… 1
## 5 AC/A… Chri… Fema… FALSE 164. 71.78… 1973-06-12 Geor… 2
## 6 AC/A… Dana Fema… TRUE 158. 69.9kg 1973-07-01 Indi… 2
## 7 AC/A… Erin Fema… FALSE 162. 68.85… 1972-03-26 New … 1
## 8 AC/A… Rach… Fema… No 166. 70.44… 1973-05-11 Colo… 1
## 9 AC/A… Rona… Male FALSE 181. 76.9kg 1971-12-31 Geor… 1
## 10 AC/A… Bryan Male FALSE 167. 79.06… 1973-07-19 New … 2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <dbl>,
## # Date.Entered.Study <date>
The basic form of a piped operation is:
x %>% f(y)
is equivalent to f(x, y)
Piping becomes really useful when there are a number of steps involved in transforming a dataset.
patients <- read.delim("patient-data.txt") %>%
as_tibble %>%
mutate(Sex = as_factor(str_trim(Sex))) %>%
mutate(Height = as.numeric(str_remove(Height, pattern = "cm$")))
The usual way of developing a workflow is to build it up one step at a time, testing the output produced at each stage.
See separate markdown document.
The filter
verb allows you to choose rows from a data frame that match some specified criteria. The criteria are based on values of variables and can make use of comparison operators such as ==
, >
, <
and !=
.
For example to filter the patients dataset so it only contains males.
patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
filter(patients, Sex == "Male")
## # A tibble: 45 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 6 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 7 AC/A… Kenn… Male Smoker 175. 92.2 1972-03-22 Colo… 3 FALSE
## 8 AC/A… Marc Male Non-S… 181. 72.3 1972-11-22 New … NA TRUE
## 9 AC/A… Bran… Male Non-S… 169. 73.3 1971-11-22 New … 3 FALSE
## 10 AC/A… Trav… Male Non-S… 176. 97.7 1973-04-14 New … 2 FALSE
## # … with 35 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
The equivalent in base R is much less intuitive.
patients[patients$Sex == "Male",]
## # A tibble: 45 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 6 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 7 AC/A… Kenn… Male Smoker 175. 92.2 1972-03-22 Colo… 3 FALSE
## 8 AC/A… Marc Male Non-S… 181. 72.3 1972-11-22 New … NA TRUE
## 9 AC/A… Bran… Male Non-S… 169. 73.3 1971-11-22 New … 3 FALSE
## 10 AC/A… Trav… Male Non-S… 176. 97.7 1973-04-14 New … 2 FALSE
## # … with 35 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
We can also use the !=
operator.
filter(patients, Sex != "Female")
## # A tibble: 45 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 6 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 7 AC/A… Kenn… Male Smoker 175. 92.2 1972-03-22 Colo… 3 FALSE
## 8 AC/A… Marc Male Non-S… 181. 72.3 1972-11-22 New … NA TRUE
## 9 AC/A… Bran… Male Non-S… 169. 73.3 1971-11-22 New … 3 FALSE
## 10 AC/A… Trav… Male Non-S… 176. 97.7 1973-04-14 New … 2 FALSE
## # … with 35 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
We can filter for a set of values using the %in%
operator.
filter(patients, State %in% c("Florida", "Georgia", "Illinois"))
## # A tibble: 16 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 3 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 4 AC/A… Kimb… Fema… Non-S… 160. 72.4 1972-05-04 Geor… 2 TRUE
## 5 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 6 AC/A… Aaron Male Non-S… 166. 76.7 1972-07-15 Geor… 1 TRUE
## 7 AC/A… Mich… Fema… Non-S… 158. 67.4 1972-05-12 Geor… 3 TRUE
## 8 AC/A… Kris… Fema… Non-S… 163. 65.2 1973-05-23 Geor… 2 FALSE
## 9 AC/A… Linda Fema… Non-S… 165. 70.8 1972-02-07 Geor… NA TRUE
## 10 AC/S… Sher… Fema… Non-S… 159. 64.9 1973-02-04 Geor… 2 TRUE
## 11 AC/S… Jimmy Male Non-S… 171. 81.7 1973-08-11 Geor… 2 TRUE
## 12 AC/S… Sarah Fema… Non-S… 160. 68.2 1972-04-21 Geor… 3 TRUE
## 13 AC/S… Rich… Male Non-S… 173. 67.6 1972-02-10 Geor… 1 FALSE
## 14 AC/S… Suza… Fema… Non-S… 159. 70.4 1973-10-06 Geor… 2 TRUE
## 15 AC/S… Phil… Male Non-S… 177. 88.7 1971-11-30 Geor… 3 TRUE
## 16 AC/S… Wendy Fema… Non-S… 159. 67.0 1972-06-25 Geor… 1 TRUE
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## # Age <dbl>, BMI <dbl>, Overweight <lgl>
Partial matches can be made using the str_detect
function from the stringr
package. Note this is similar to the grepl
function in base R.
For example, let’s select all the patients whose name begins with a ‘B’.
filter(patients, str_detect(Name, "^B"))
## # A tibble: 13 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 2 AC/A… Bran… Male Non-S… 169. 73.3 1971-11-22 New … 3 FALSE
## 3 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 4 AC/A… Brett Male Smoker 177. 74.0 1972-02-24 Indi… 2 FALSE
## 5 AC/A… Brent Male Non-S… 175. 83.6 1973-05-07 New … 3 TRUE
## 6 AC/A… Bran… Fema… Non-S… 160. 68.0 1972-12-08 New … 2 FALSE
## 7 AC/A… Barb… Fema… Non-S… 160. 65.9 1973-01-31 New … NA TRUE
## 8 AC/A… Bran… Male Non-S… 179. 97.0 1973-01-18 Indi… 1 TRUE
## 9 AC/S… Bran… Fema… Non-S… 157. 66.6 1972-04-11 Colo… 3 TRUE
## 10 AC/S… Barb… Fema… Smoker 162. 66.5 1972-02-21 New … 3 FALSE
## 11 AC/S… Benj… Male Non-S… 177. 90.8 1973-02-03 Indi… 3 FALSE
## 12 AC/S… Bren… Fema… Non-S… 158. 69.8 1972-05-21 Cali… 3 FALSE
## 13 AC/S… Barb… Fema… Smoker 163. 64.5 1973-06-25 Indi… 2 FALSE
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## # Age <dbl>, BMI <dbl>, Overweight <lgl>
Note that the str_detect
function returns a logical vector - this is important since the criterion for filtering must evaluate to TRUE
or FALSE
.
Also note that the second argument to str_detect
is a regular expression. An alternative function from stringr
we could have used in this case is str_starts
; with this we no longer need to ‘^’ symbol in our regular expression.
filter(patients, str_starts(Name, "B"))
## # A tibble: 13 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 2 AC/A… Bran… Male Non-S… 169. 73.3 1971-11-22 New … 3 FALSE
## 3 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 4 AC/A… Brett Male Smoker 177. 74.0 1972-02-24 Indi… 2 FALSE
## 5 AC/A… Brent Male Non-S… 175. 83.6 1973-05-07 New … 3 TRUE
## 6 AC/A… Bran… Fema… Non-S… 160. 68.0 1972-12-08 New … 2 FALSE
## 7 AC/A… Barb… Fema… Non-S… 160. 65.9 1973-01-31 New … NA TRUE
## 8 AC/A… Bran… Male Non-S… 179. 97.0 1973-01-18 Indi… 1 TRUE
## 9 AC/S… Bran… Fema… Non-S… 157. 66.6 1972-04-11 Colo… 3 TRUE
## 10 AC/S… Barb… Fema… Smoker 162. 66.5 1972-02-21 New … 3 FALSE
## 11 AC/S… Benj… Male Non-S… 177. 90.8 1973-02-03 Indi… 3 FALSE
## 12 AC/S… Bren… Fema… Non-S… 158. 69.8 1972-05-21 Cali… 3 FALSE
## 13 AC/S… Barb… Fema… Smoker 163. 64.5 1973-06-25 Indi… 2 FALSE
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## # Age <dbl>, BMI <dbl>, Overweight <lgl>
We can filter on logical variables straightforwardly.
filter(patients, !Died)
## # A tibble: 46 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 6 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 7 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 8 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 9 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 10 AC/A… Pame… Fema… Non-S… 166. 67.3 1971-11-14 New … 1 FALSE
## # … with 36 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Also we can add extra conditions, separating them with a ,
.
filter(patients, Sex == "Male", Died)
## # A tibble: 23 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Marc Male Non-S… 181. 72.3 1972-11-22 New … NA TRUE
## 2 AC/A… Jimmy Male Non-S… 179. 75.5 1972-03-17 New … 2 TRUE
## 3 AC/A… Thom… Male Smoker 170. 90.6 1972-04-18 New … 1 TRUE
## 4 AC/A… Geor… Male Non-S… 168. 82.1 1972-11-04 New … 1 TRUE
## 5 AC/A… Stev… Male Non-S… 181. 83.9 1973-10-19 New … 3 TRUE
## 6 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 7 AC/A… Aaron Male Non-S… 166. 76.7 1972-07-15 Geor… 1 TRUE
## 8 AC/A… Brent Male Non-S… 175. 83.6 1973-05-07 New … 3 TRUE
## 9 AC/A… Joel Male Non-S… 166. 76.8 1972-08-23 New … NA TRUE
## 10 AC/A… Aaron Male Non-S… 181. 83.6 1972-03-30 Cali… 3 TRUE
## # … with 13 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
In this example all males who have died are selected. The ,
is the equivalent of using the Boolean operator &
.
filter(patients, Sex == "Male" & Died)
## # A tibble: 23 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Marc Male Non-S… 181. 72.3 1972-11-22 New … NA TRUE
## 2 AC/A… Jimmy Male Non-S… 179. 75.5 1972-03-17 New … 2 TRUE
## 3 AC/A… Thom… Male Smoker 170. 90.6 1972-04-18 New … 1 TRUE
## 4 AC/A… Geor… Male Non-S… 168. 82.1 1972-11-04 New … 1 TRUE
## 5 AC/A… Stev… Male Non-S… 181. 83.9 1973-10-19 New … 3 TRUE
## 6 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 7 AC/A… Aaron Male Non-S… 166. 76.7 1972-07-15 Geor… 1 TRUE
## 8 AC/A… Brent Male Non-S… 175. 83.6 1973-05-07 New … 3 TRUE
## 9 AC/A… Joel Male Non-S… 166. 76.8 1972-08-23 New … NA TRUE
## 10 AC/A… Aaron Male Non-S… 181. 83.6 1972-03-30 Cali… 3 TRUE
## # … with 13 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
The equivalent in regular R is more verbose and less easy to read.
patients[patients$Sex == "Male" & patients$Died,]
## # A tibble: 23 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Marc Male Non-S… 181. 72.3 1972-11-22 New … NA TRUE
## 2 AC/A… Jimmy Male Non-S… 179. 75.5 1972-03-17 New … 2 TRUE
## 3 AC/A… Thom… Male Smoker 170. 90.6 1972-04-18 New … 1 TRUE
## 4 AC/A… Geor… Male Non-S… 168. 82.1 1972-11-04 New … 1 TRUE
## 5 AC/A… Stev… Male Non-S… 181. 83.9 1973-10-19 New … 3 TRUE
## 6 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 7 AC/A… Aaron Male Non-S… 166. 76.7 1972-07-15 Geor… 1 TRUE
## 8 AC/A… Brent Male Non-S… 175. 83.6 1973-05-07 New … 3 TRUE
## 9 AC/A… Joel Male Non-S… 166. 76.8 1972-08-23 New … NA TRUE
## 10 AC/A… Aaron Male Non-S… 181. 83.6 1972-03-30 Cali… 3 TRUE
## # … with 13 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
We can use the |
Boolean operator to select patients above a given weight or BMI.
filter(patients, Weight > 90 | BMI > 28)
## # A tibble: 22 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 2 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## 3 AC/A… Kimb… Fema… Non-S… 160. 72.4 1972-05-04 Geor… 2 TRUE
## 4 AC/A… Kenn… Male Smoker 175. 92.2 1972-03-22 Colo… 3 FALSE
## 5 AC/A… Trav… Male Non-S… 176. 97.7 1973-04-14 New … 2 FALSE
## 6 AC/A… Thom… Male Smoker 170. 90.6 1972-04-18 New … 1 TRUE
## 7 AC/A… Geor… Male Non-S… 168. 82.1 1972-11-04 New … 1 TRUE
## 8 AC/A… Jeff… Male Non-S… 182. 96.9 1972-12-10 Colo… 2 FALSE
## 9 AC/A… Kris… Fema… Non-S… 160. 71.9 1973-09-28 New … 2 TRUE
## 10 AC/A… Sher… Fema… Non-S… 157. 69.6 1973-07-20 Cali… 2 TRUE
## # … with 12 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Mixing both Boolean operators and ,
is also possible.
filter(patients, Weight > 90 | BMI > 28, Sex == "Female")
## # A tibble: 4 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died Count
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl> <dbl>
## 1 AC/A… Kimb… Fema… Non-S… 160. 72.4 1972-05-04 Geor… 2 TRUE -2.41
## 2 AC/A… Kris… Fema… Non-S… 160. 71.9 1973-09-28 New … 2 TRUE 0.74
## 3 AC/A… Sher… Fema… Non-S… 157. 69.6 1973-07-20 Cali… 2 TRUE -0.7
## 4 AC/A… April Fema… Non-S… 161. 73.6 1972-02-14 Indi… 3 FALSE -0.82
## # … with 4 more variables: Date.Entered.Study <date>, Age <dbl>,
## # BMI <dbl>, Overweight <lgl>
See separate markdown document.
Another dplyr
verb that works on rows in a table is arrange
. This is used to sort rows in a dataset based on one or more variables.
For example, let’s say we want to sort our patients by height.
arrange(patients, Height)
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/S… Bran… Fema… Non-S… 157. 66.6 1972-04-11 Colo… 3 TRUE
## 2 AC/A… Sher… Fema… Non-S… 157. 69.6 1973-07-20 Cali… 2 TRUE
## 3 AC/A… Mich… Fema… Non-S… 158. 67.4 1972-05-12 Geor… 3 TRUE
## 4 AC/S… Bren… Fema… Non-S… 158. 69.8 1972-05-21 Cali… 3 FALSE
## 5 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 6 AC/S… Laura Fema… Non-S… 158. 69.7 1972-06-08 New … 3 TRUE
## 7 AC/S… Mary Fema… Non-S… 159. 65.1 1973-06-01 Colo… 1 FALSE
## 8 AC/S… Wendy Fema… Non-S… 159. 67.0 1972-06-25 Geor… 1 TRUE
## 9 AC/S… Erica Fema… Non-S… 159. 70.0 1973-10-04 Cali… 2 TRUE
## 10 AC/S… Sher… Fema… Non-S… 159. 64.9 1973-02-04 Geor… 2 TRUE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
This has arranged the rows in order of ascending height. What if we wanted descending order of height?
arrange(patients, desc(Height))
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Char… Male Non-S… 185. 87.2 1973-07-19 New … 1 FALSE
## 2 AC/S… Doug… Male Smoker 185. 73.6 1973-06-11 New … 3 FALSE
## 3 AC/S… Joel Male Non-S… 184. 90.4 1972-06-11 Cali… 3 TRUE
## 4 AC/A… Stev… Male Smoker 183. 83.4 1973-10-05 New … 2 FALSE
## 5 AC/S… Russ… Male Non-S… 183. 82.5 1972-10-30 New … 3 TRUE
## 6 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 7 AC/A… Jeff… Male Non-S… 182. 96.9 1972-12-10 Colo… 2 FALSE
## 8 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Stev… Male Non-S… 181. 83.9 1973-10-19 New … 3 TRUE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
We can use sort using multiple variables, e.g. first by Grade in descending order, then by Sex and then Smokes.
arrange(patients, desc(Grade), Sex, Smokes)
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Tere… Fema… Non-S… 163. 70.5 1973-04-22 Indi… 3 TRUE
## 2 AC/A… Mich… Fema… Non-S… 158. 67.4 1972-05-12 Geor… 3 TRUE
## 3 AC/A… Wendy Fema… Non-S… 164. 66.7 1971-12-29 New … 3 TRUE
## 4 AC/A… April Fema… Non-S… 161. 73.6 1972-02-14 Indi… 3 FALSE
## 5 AC/S… Vale… Fema… Non-S… 162. 70.0 1972-04-09 Cali… 3 TRUE
## 6 AC/S… Bran… Fema… Non-S… 157. 66.6 1972-04-11 Colo… 3 TRUE
## 7 AC/S… Sarah Fema… Non-S… 160. 68.2 1972-04-21 Geor… 3 TRUE
## 8 AC/S… Carr… Fema… Non-S… 164. 71.5 1973-03-18 New … 3 FALSE
## 9 AC/S… Bren… Fema… Non-S… 158. 69.8 1972-05-21 Cali… 3 FALSE
## 10 AC/S… Laura Fema… Non-S… 158. 69.7 1972-06-08 New … 3 TRUE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Sorting is commonly used in workflows usually as one of the last steps before presentation or writing out the resulting table to a file.
The following concise and easy-to-read workflow includes steps that use all of the dplyr
verbs we have covered so far.
candidates <- patients %>%
filter(!Died) %>%
select(ID, Name, Sex, Smokes, Height, Weight) %>%
mutate(BMI = Weight / (Height / 100) ** 2) %>%
mutate(Overweight = BMI > 25) %>%
arrange(BMI)
candidates
## # A tibble: 46 x 8
## ID Name Sex Smokes Height Weight BMI Overweight
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 AC/SG/193 Douglas Male Smoker 185. 73.6 21.4 FALSE
## 2 AC/SG/099 Richard Male Non-Smoker 173. 67.6 22.7 FALSE
## 3 AC/AH/001 Michael Male Non-Smoker 183. 76.6 22.9 FALSE
## 4 AC/AH/210 Alicia Female Smoker 170. 66.7 23.1 FALSE
## 5 AC/AH/086 Jeffrey Male Smoker 180. 75.7 23.3 FALSE
## 6 AC/AH/045 Ronald Male Non-Smoker 181. 76.9 23.4 FALSE
## 7 AC/AH/164 Brett Male Smoker 177. 74.0 23.6 FALSE
## 8 AC/AH/114 Mark Male Non-Smoker 178. 74.8 23.7 FALSE
## 9 AC/AH/077 Andrew Male Non-Smoker 174. 72.2 23.8 FALSE
## 10 AC/SG/173 Barbara Female Smoker 163. 64.5 24.2 FALSE
## # … with 36 more rows