We will now turn our attention to some of the more advanced operations you can perform on data frames using dplyr.
Especially I will be taking you through three important dplyr fucntions.
Before we start summarise and combine operations, let’s clean the environment and start fresh.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.1
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Read in the patients data frame using read_tsv
from the readr
package. read_tsv
imports tab-delimited files (tsv = tab-separated values).
patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
We can compute summary statistics for selected columns in our dataset using the summarise
verb. For example, we could use summarise
to calculate the average height of patients in our patients
data.
Let’s first remind ourselves what the patients
dataset looks like.
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Now let’s compute the mean height of a patient.
summarise(patients, mean(Height))
## # A tibble: 1 x 1
## `mean(Height)`
## <dbl>
## 1 168.
If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in luck as dplyr accommodates the alternative spelling.
Note that the result is a data frame consisting in this case of a single row and a single column, unlike the more usual way of calculating the mean value for a vector or column, which results in a single numeric value (actually in R this is numeric vector of length 1).
mean(patients$Height)
## [1] 167.9267
Returning a data frame might be quite useful, particularly if we’re summarising multiple columns or using more than one function, for example computing the average and standard deviation.
summarise(patients, average = mean(Height), standard_deviation = sd(Height))
## # A tibble: 1 x 2
## average standard_deviation
## <dbl> <dbl>
## 1 168. 8.04
summarise
collapses data into a single row of values. Notice how we also named the output columns in this last example.
summarise
can take any R function that takes a vector of values and returns a single value. Some of the more useful functions include:
min
minimum valuemax
maximum valuesum
sum of valuesmean
mean valuesd
standard deviationmedian
median valueIQR
the interquartile rangen_distinct
the number of distinct valuesn
the number of observations (Note: this is a special function that doesn’t take a vector argument, i.e. column)It is also possible to summarise using a function that takes more than one value, i.e. from multiple columns. For example, we could compute the correlation between patient height and Weight:
summarise(patients, correlation = cor(Height, Weight))
## # A tibble: 1 x 1
## correlation
## <dbl>
## 1 0.687
We can apply the same function to all columns using summarise_all
.
summarise_all(patients, mean)
## Warning in mean.default(ID): argument is not numeric or logical: returning
## NA
## Warning in mean.default(Name): argument is not numeric or logical:
## returning NA
## Warning in mean.default(Sex): argument is not numeric or logical: returning
## NA
## Warning in mean.default(Smokes): argument is not numeric or logical:
## returning NA
## Warning in mean.default(State): argument is not numeric or logical:
## returning NA
## # A tibble: 1 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl> <dbl>
## 1 NA NA NA NA 168. 74.9 1972-10-09 NA NA 0.54
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## # Age <dbl>, BMI <dbl>, Overweight <dbl>
While this seems to work, there are several warnings about inputs to the mean
function not being numerical. Can you see why?
A look at the documentation for the summarise_all
function (type ‘?summarise_all’ at the command prompt or use the Help viewer in RStudio) shows that there are two related functions, summarise_at
and summarise_if
, either of which can be used to specify the columns for which we would like to calculate the average values.
summarise_at
allows us to select the columns on which to operate using an additional vars
argument.
summarise_at(patients, vars(Height, Weight), mean)
## # A tibble: 1 x 2
## Height Weight
## <dbl> <dbl>
## 1 168. 74.9
summarise_if
provides another option and works well in cases where the operation should be applied to all columns that meet certain criteria. In this case, we want to calculate the averages for all columns with numeric values.
summarise_if(patients, is.numeric, mean)
## # A tibble: 1 x 6
## Height Weight Grade Count Age BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 NA -0.107 43.1 26.5
It is possible to summarise using more than one function in which case a list of functions needs to be provided.
summarise_at(patients, vars(Height, Weight), list(mean, sd))
## # A tibble: 1 x 4
## Height_fn1 Weight_fn1 Height_fn2 Weight_fn2
## <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 8.04 8.56
of if you want some control over naming the output columns.
patients %>%
select(Height, Weight) %>%
summarise_all(list(average = mean, stdev = sd))
## # A tibble: 1 x 4
## Height_average Weight_average Height_stdev Weight_stdev
## <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 8.04 8.56
Just like with the select
operation, we can instead specify those columns to exclude.
summarise_at(patients, vars(-ID:-Smokes, -State:-Died), mean)
## # A tibble: 1 x 8
## Height Weight Birth Count Date.Entered.Study Age BMI Overweight
## <dbl> <dbl> <date> <dbl> <date> <dbl> <dbl> <dbl>
## 1 168. 74.9 1972-10-09 -0.107 2016-05-09 43.1 26.5 0.77
Or we can use one of the helper functions to choose which columns to operate on.
summarise_at(patients, vars(starts_with('BM')), mean)
## # A tibble: 1 x 1
## BMI
## <dbl>
## 1 26.5
summarise_at(patients, vars(ends_with('ight')), mean)
## # A tibble: 1 x 3
## Height Weight Overweight
## <dbl> <dbl> <dbl>
## 1 168. 74.9 0.77
# use regular expression to select columns on which to operate
summarise_at(patients, vars(matches('[HW]eight')), mean)
## # A tibble: 1 x 3
## Height Weight Overweight
## <dbl> <dbl> <dbl>
## 1 168. 74.9 0.77
# use one_of if you have a vector of column names
columns <- c('Height', 'Weight', 'Age')
summarise_at(patients, vars(one_of(columns)), mean)
## # A tibble: 1 x 3
## Height Weight Age
## <dbl> <dbl> <dbl>
## 1 168. 74.9 43.1
# alternatively, and more concisely, just pass the vector of columns names
summarise_at(patients, vars(columns), mean)
## # A tibble: 1 x 3
## Height Weight Age
## <dbl> <dbl> <dbl>
## 1 168. 74.9 43.1
See separate R markdown document.
While the summarise
function is useful on its own, it becomes really powerful when applied to groups of observations within a dataset. For example, suppose we want to compute the mean height for patients from different states. We could take each state in turn and filter
the data frame to only contain rows for a given state, then apply summarise
, but that would be somewhat cumbersome. In dplyr
, the group_by
function allows this to be done in one simple step.
This works best in a workflow using the %>%
pipe symbol, so as quick reminder, the following are identical ways of computing the average height of patients.
summarise(patients, mean(Height))
## # A tibble: 1 x 1
## `mean(Height)`
## <dbl>
## 1 168.
patients %>% summarise(mean(Height))
## # A tibble: 1 x 1
## `mean(Height)`
## <dbl>
## 1 168.
Now let’s do the same thing but treating each group separately.
patients %>% group_by(State) %>% summarise(mean(Height))
## # A tibble: 6 x 2
## State `mean(Height)`
## <chr> <dbl>
## 1 California 168.
## 2 Colorado 169.
## 3 Georgia 168.
## 4 Indiana 166.
## 5 New Jersey 168.
## 6 New York 170.
As before, we can summarise multiple observations.
patients %>% group_by(State) %>% summarise(n(), mean(Height), sd(Height))
## # A tibble: 6 x 4
## State `n()` `mean(Height)` `sd(Height)`
## <chr> <int> <dbl> <dbl>
## 1 California 10 168. 9.43
## 2 Colorado 15 169. 7.36
## 3 Georgia 16 168. 8.93
## 4 Indiana 18 166. 6.77
## 5 New Jersey 20 168. 6.98
## 6 New York 21 170. 9.52
We can make the output more presentable by renaming the columns and using the round
function to round to a specified number of significant figures. Note the use of backticks (` ) for specifying column names that contain spaces.
patients %>%
group_by(State) %>%
summarise(
N = n(),
`Average height` = mean(Height),
`Stdev height` = sd(Height)
) %>%
mutate_at(vars(`Average height`, `Stdev height`), round, digits = 1)
## # A tibble: 6 x 4
## State N `Average height` `Stdev height`
## <chr> <int> <dbl> <dbl>
## 1 California 10 168. 9.4
## 2 Colorado 15 168. 7.4
## 3 Georgia 16 168. 8.9
## 4 Indiana 18 166. 6.8
## 5 New Jersey 20 168. 7
## 6 New York 21 170. 9.5
A shorthand for summarise(n()) for counting the number of observations of each group is available in the form of the count function.
patients %>% group_by(State) %>% summarise(N = n())
## # A tibble: 6 x 2
## State N
## <chr> <int>
## 1 California 10
## 2 Colorado 15
## 3 Georgia 16
## 4 Indiana 18
## 5 New Jersey 20
## 6 New York 21
patients %>% group_by(State) %>% count()
## # A tibble: 6 x 2
## # Groups: State [6]
## State n
## <chr> <int>
## 1 California 10
## 2 Colorado 15
## 3 Georgia 16
## 4 Indiana 18
## 5 New Jersey 20
## 6 New York 21
group_by
can also be used in conjunction with other dplyr
verbs.
patients %>% group_by(State) %>% summarise_at(vars(Height, Weight), mean)
## # A tibble: 6 x 3
## State Height Weight
## <chr> <dbl> <dbl>
## 1 California 168. 73.4
## 2 Colorado 169. 79.1
## 3 Georgia 168. 72.9
## 4 Indiana 166. 74.2
## 5 New Jersey 168. 74.4
## 6 New York 170. 75.2
patients %>%
group_by(State) %>%
summarise_at(vars(Height, Weight), mean) %>%
gather(measure, mean, -State) %>%
ggplot(aes(x = State, y = mean, color = measure, group = measure)) +
geom_point() +
geom_line() +
facet_wrap(~measure, scales = "free_y")
Returning to one of the earlier examples, we can also compute the correlation between Height and Weight on a per-group basis.
patients %>% group_by(State) %>% summarise(correlation = cor(Height, Weight))
## # A tibble: 6 x 2
## State correlation
## <chr> <dbl>
## 1 California 0.856
## 2 Colorado 0.837
## 3 Georgia 0.700
## 4 Indiana 0.725
## 5 New Jersey 0.678
## 6 New York 0.629
group_by
- the technical detailsYou might be wondering what’s going on under the hood with this group_by
function. The help page for group_by
is a little on the technical side but essentially tells us that the data frame we pass it gets converted into a grouped_df
data frame object. dplyr
functions that operate on a grouped_df
object know to treat this in a special way, operating on each group separately. The following sequence of R commands might help to make this a bit clearer.
First let’s take a look at the class of the patients data frame.
class(patients)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Now we’ll create a grouped version with group_by
and look at its class.
patients_grouped <- patients %>% group_by(State)
class(patients_grouped)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
The groups
function lets us see the groups.
groups(patients_grouped)
## [[1]]
## State
The ungroup
function removes the grouping.
patients_ungrouped <- ungroup(patients_grouped)
class(patients_ungrouped)
## [1] "tbl_df" "tbl" "data.frame"
groups(patients_ungrouped)
## NULL
ungroup
can be quite helpful in more complicated chains of dplyr
operations where the grouping is only required for some of the steps and would have unintended consequences for subsequent operations within the chain.
See separate R markdown document.
In many real life situations, data are spread across multiple tables or spreadsheets. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources. It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a common column, for example, an attribute that uniquely identifies the subject.
dplyr
provides a set of join functions for combining two data frames based on matches within specified columns. These operations are very similar to carrying out join operations between tables in a relational database using SQL.
left_join
To illustrate join operations we’ll first consider the most common type, a “left join”. In the schematic below the two data frames share a common column, V1. We can combine the two data frames into a single data frame by matching rows in the first data frame with those in the second data frame that share the same value of V1.
left_join
returns all rows from the first data frame regardless of whether there is a match in the second data frame. Rows with no match are included in the resulting data frame but have NA
values in the additional columns coming from the second data frame.
Here’s an example in which details about members of the Beatles and Rolling Stones are contained in two tables, using data frames conveniently provided by dplyr
. The name column identifies each of the band members and is used for matching rows from the two tables.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join(band_members, band_instruments, by = "name")
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
right_join
is similar but returns all rows from the second data frame that have a match with rows in the first data frame based on the specified column.
right_join(band_members, band_instruments, by = "name")
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
inner_join
It is also possible to return only those rows where matches could be made. The inner_join
function accomplishes this.
inner_join(band_members, band_instruments, by = "name")
## # A tibble: 2 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
full_join
We’ve seen how missing rows from one table can be retained in the joined data frame using left_join
or right_join
but sometimes data for a given subject may be missing from either of the tables and we still want that subject to appear in the combined table. A full_join
will return all rows and all columns from the two tables and where there are no matching values, NA
values are used to fill in the missing values.
full_join(band_members, band_instruments, by = "name")
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
The columns that are used for joining two tables do not have to have the same name, e.g.
band_instruments2
## # A tibble: 3 x 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join(band_members, band_instruments2, by = c("name" = "artist"))
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
Where there are multiple rows in one or both of the two tables for the thing that is being joined, these will be represented in the combined table. All combinations of the matching rows will be included.
df1 <- data_frame(
name = c("Mick", "John", "Paul", "Paul"),
band = c("Stones", "Beatles", "Beatles", "Wings")
)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
df2 <- data_frame(
name = c("John", "Paul", "Paul", "Keith"),
plays = c("guitar", "bass", "guitar", "guitar")
)
left_join(df1, df2, by = "name")
## # A tibble: 6 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Paul Beatles guitar
## 5 Paul Wings bass
## 6 Paul Wings guitar
df1 <- data_frame(
forename = c("Mick", "John", "Paul", "John"),
surname = c("Jagger", "Lennon", "McCartney", "Squire"),
band = c("Stones", "Beatles", "Beatles", "Roses")
)
df2 <- data_frame(
forename = c("John", "Paul", "Keith", "John"),
surname = c("Lennon", "McCartney", "Richards", "Squire"),
plays = c("guitar", "bass", "guitar", "guitar")
)
left_join(df1, df2, by = c("forename", "surname"))
## # A tibble: 4 x 4
## forename surname band plays
## <chr> <chr> <chr> <chr>
## 1 Mick Jagger Stones <NA>
## 2 John Lennon Beatles guitar
## 3 Paul McCartney Beatles bass
## 4 John Squire Roses guitar
A variation on the join operations we’ve considered are semi_join
and anti_join
that filter the rows in one table based on matches or lack of matches to rows in another table.
# semi_join returns all rows from the first table where there are matches in the other table
semi_join(band_members, band_instruments, by = "name")
## # A tibble: 2 x 2
## name band
## <chr> <chr>
## 1 John Beatles
## 2 Paul Beatles
# anti_join returns all rows where there is no match in the other table
anti_join(band_members, band_instruments, by = "name")
## # A tibble: 1 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
The exercise uses a more realistic dataset, building on the patients data frame we’ve already been working with.
The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates. The aim is to compare the average blood pressures of smokers and non-smokers.
See separate R markdown document.