In this session we will look at what makes a dataset ‘tidy’ and why this is important. We will also look at how to perform some basic manipulations of a tabular dataset including:
In this section we will use functions from a number of packages from the tidyverse, namely tidyr for tidying data, dplyr for manipulating data frames and stringr providing common string operations.
Load the core packages from the tidyverse.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.1
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
[Instructors note: draw attention to the packages that are loaded, relate these to this and other sections of the course.]
With the many hundreds of R packages available it is inevitable that there will be functions with the same name being used by more than one package.
For example, filter function in dplyr has masked filter function from the stats package that already loaded as part of base R.
?filter
Navigate to each help section and show how the signatures differ and explain how using the one you hadn’t intended will likely cause your code to fail.
Use R studio’s file completion to show that the filter function we get by default.
To use the filter
function from the stats
package we need to include the stats
namespace prefix as follows:
stats::filter(presidents, rep(1, 3))
In this course the data we will be working with will be structured within a type of R object called a ‘data frame’.
A data frame contains tabular data, like those we might be used to dealing with in spreadsheets, e.g. with Excel, where the data can be thought of as having rows and columns. The values in each column are all of the same type, e.g. all numbers (if quantitative) or all text strings (if qualitative).
Let’s have a look at an example data frame that was loaded as part of the tidyr
package that we loaded when we called library(tidyverse)
.
This dataset contains the number of TB cases documented by the World Health Organization in Afghanistan, Brazil and China in 1999 and 2000.
table1
If we view this in the R console it is displayed as follows and described as a tibble
.
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
class(table1)
## [1] "tbl_df" "tbl" "data.frame"
This is a special kind of data frame with the additional tbl_df
and tbl
types.
Tibbles have some additional behaviour over plain data frames that might be useful. The most obvious is that these get printed to the console in a more concise and useful way, particularly when there are very many rows and/or columns. You can see how the default behaviour of tibbles differs from normal data frames by looking at the help page for the tbl_df
class.
?tbl_df
The tidyr package also contains some different representations of the TB dataset. Each of these tables organizes the values in a different layout or format.
Let’s take a look at the different representations.
table2
contains separate rows for the number of cases and the total population.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
has a column called rate
which contains multiple values, i.e. the number of cases and the total population. If we wanted to extract just the number of cases for each country and year this would require some effort.
Note that the rate as a percentage could easily be computed from table1
and much less straightforwardly from table2
.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
only contains the numbers of TB cases and table4b
contains only the total populations. To compute the percentage of TB cases we would need to combine these tables, something we will look at later on in the course.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
A dataset is a collection of values, usually either numbers (if quantitative) or text strings (if qualitative).
Every value belongs to a variable and an observation.
A variable contains values that measure the same underlying attribute, e.g. height, or temperature.
An observation contains values measured on the same unit, e.g. a person or a date. Note that there can be multiple measurements within an observation but these should be for multiple attributes measured on the same unit.
A tidy dataset is a data frame (or table) for which the following are true:
Question: which of the representations we’ve just been looking at is tidy?
table2
is tricky. If we consider the observational unit to be a country in a specific year then table2
is not tidy because observations are split across two rows. Also the count variable contains counts of what are essentially different things, the number of cases of TB and the total population. In tricky situations like this, a tell-tale sign that your data is not in a tidy format is when you want to visualize your data and you have to perform some kind of filtering to do so. In this case if we wanted to create a bar plot of for the population of each country we would have to first remove the rows corresponding to the number of TB cases.
Let’s look at another untidy dataset.
untidy_data <- read_tsv("tidyr-example.txt")
## Parsed with column specification:
## cols(
## Name = col_character(),
## treatmenta = col_character(),
## treatmentb = col_double()
## )
untidy_data
## # A tibble: 3 x 3
## Name treatmenta treatmentb
## <chr> <chr> <dbl>
## 1 John Smith - 2
## 2 Jane Doe 16 11
## 3 Mary Johnson 3 1
Questions
What makes this an untidy dataset?
What other problems can you see with this dataset?
Data in this format are quite familiar to us, but not easily interpretable by the computer. We need to think of the dataset in terms of variables and values.
What are the variables in this dataset?
Remember the guiding principles:
A tidy form of this data would have 3 columns: Person, Treatment and Result.
This is an example of a fairly common scenario in which there are columns that are not variables but instead are values of a variable.
The variable in question is Treatment
and the two columns treatmenta
and treatmentb
are values that variable can take. So a tidy form of this dataset would have a column called Treatment
.
gather
to create a tidy data frameThe gather
function takes multiple “value” columns and collapses these into key-value pairs.
Look at the help for the gather
function to see what we need to give it.
?gather
We have to specify which columns to work on and we have to tell gather
what to call the variable for which those columns will become values - this variable is the ‘key’.
The values within those ‘key’ columns are values but for another variable which we specify using the ‘value’ argument.
In our example the key would be ‘Treatment’ and the value would be ‘Result’.
tidy_data <- gather(untidy_data, key = "Treatment", value = "Result", treatmenta, treatmentb)
tidy_data
## # A tibble: 6 x 3
## Name Treatment Result
## <chr> <chr> <chr>
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
The key and value variables can be given as strings or as symbols without quotes.
gather(untidy_data, Treatment, Result, treatmenta, treatmentb)
## # A tibble: 6 x 3
## Name Treatment Result
## <chr> <chr> <chr>
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
If there were lots of treatment columns it would be quite laborious to type them all in. Instead you can tell gather
which columns not to include.
gather(untidy_data, Treatment, Result, -Name)
## # A tibble: 6 x 3
## Name Treatment Result
## <chr> <chr> <chr>
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
You can also specify the columns as a range either using the column names
gather(untidy_data, Treatment, Result, treatmenta:treatmentb)
## # A tibble: 6 x 3
## Name Treatment Result
## <chr> <chr> <chr>
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
or column indices
gather(untidy_data, Treatment, Result, 2:3)
## # A tibble: 6 x 3
## Name Treatment Result
## <chr> <chr> <chr>
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
Note that gather
is very similar to the melt
function from the reshape
and reshape2
packages.
See separate R markdown document.
tidyr
functionsThere are some other useful functions within the tidyr
package. We’ll look at these using the dataset from the exercise.
clinical_data <- read_tsv("clinical-data.txt")
## Parsed with column specification:
## cols(
## Subject = col_character(),
## Placebo.1 = col_double(),
## Placebo.2 = col_double(),
## Drug1.1 = col_double(),
## Drug1.2 = col_double(),
## Drug2.1 = col_double(),
## Drug2.2 = col_double()
## )
clinical_data <- gather(clinical_data, Treatment, Value, -Subject)
clinical_data
## # A tibble: 60 x 3
## Subject Treatment Value
## <chr> <chr> <dbl>
## 1 Patient1 Placebo.1 49.8
## 2 Patient2 Placebo.1 46.8
## 3 Patient3 Placebo.1 48.7
## 4 Patient4 Placebo.1 51.7
## 5 Patient5 Placebo.1 48.9
## 6 Patient6 Placebo.1 53.5
## 7 Patient7 Placebo.1 53.6
## 8 Patient8 Placebo.1 46.2
## 9 Patient9 Placebo.1 50.5
## 10 Patient10 Placebo.1 47.0
## # … with 50 more rows
ggplot(clinical_data, mapping = aes(x = Treatment, y = Value)) +
geom_boxplot()
The suffix for each of the treatments, either ‘.1’ or ‘.2’, indicates that each drug or placebo was measured twice. Ideally we would extract this into a new variable called ‘Replicate’. We can use the separate
function to do so.
clinical_replicate_data <- separate(clinical_data, Treatment, into = c("Treatment", "Replicate"))
clinical_replicate_data
## # A tibble: 60 x 4
## Subject Treatment Replicate Value
## <chr> <chr> <chr> <dbl>
## 1 Patient1 Placebo 1 49.8
## 2 Patient2 Placebo 1 46.8
## 3 Patient3 Placebo 1 48.7
## 4 Patient4 Placebo 1 51.7
## 5 Patient5 Placebo 1 48.9
## 6 Patient6 Placebo 1 53.5
## 7 Patient7 Placebo 1 53.6
## 8 Patient8 Placebo 1 46.2
## 9 Patient9 Placebo 1 50.5
## 10 Patient10 Placebo 1 47.0
## # … with 50 more rows
Question: how did separate
know to split the values on the ‘.’ character?
Look at the help for the separate
function and find how you would specify the delimiter. In this case the default delimiter does the trick.
Now that we have a separate variable/column for the replicate we can create an improved box plot with faceting.
ggplot(clinical_replicate_data, mapping = aes(x = Replicate, y = Value)) +
geom_boxplot() +
facet_wrap(~Treatment)
The spread
function does the opposite to gather
and can be used to convert a tidy dataset back to one in which we have values in column headings. Sometimes this really is necessary, e.g. gene expression matrices where there are rows for each gene and columns for each sample, because some analysis functions require the matrix form of the data.
spread(clinical_data, key = Treatment, value = Value)
## # A tibble: 10 x 7
## Subject Drug1.1 Drug1.2 Drug2.1 Drug2.2 Placebo.1 Placebo.2
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Patient1 48.4 48.4 40.8 38.3 49.8 53.8
## 2 Patient10 44.9 50.1 39.0 36.2 47.0 44.8
## 3 Patient2 49.6 41.6 39.1 41.9 46.8 49.8
## 4 Patient3 40.5 49.2 40.3 35.1 48.7 48.1
## 5 Patient4 38.3 41.1 40.7 41.2 51.7 48.1
## 6 Patient5 43.1 39.4 43.3 34.9 48.9 48.3
## 7 Patient6 47.5 42.9 39.5 35.6 53.5 44.7
## 8 Patient7 49.2 46.4 37.4 38.8 53.6 47.0
## 9 Patient8 47.3 38.3 44.0 33.8 46.2 43.2
## 10 Patient9 43.4 48.6 41.6 34.4 50.5 56.2
It can also help fix table2
that we considered earlier.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
spread(table2, key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
We’re going to look now at some of the functions provided by the dplyr
package, another core component of the tidyverse. dplyr
provides a set of data manipulation functions that have a common syntax and work together in a consistent and uniform manner. These are referred to as verbs in the dplyr
documentation.
The most basic operation is to select a subset of columns.
First we’ll load a more complex dataset with several columns (variables).
patients <- read_tsv("patient-data.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_character(),
## Weight = col_character(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade_Level = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = "")
## )
patients
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level
## <chr> <chr> <chr> <chr> <chr> <chr> <date> <chr> <dbl>
## 1 AC/A… Mich… Male FALSE 182.8… 76.57… 1972-02-06 Geor… 2
## 2 AC/A… Derek Male FALSE 179.1… 80.43… 1972-06-15 Colo… 2
## 3 AC/A… Todd Male FALSE 169.1… 75.48… 1972-07-09 New … 2
## 4 AC/A… Rona… Male FALSE 175.6… 94.54… 1972-08-17 Colo… 1
## 5 AC/A… Chri… Fema… FALSE 164.4… 71.78… 1973-06-12 Geor… 2
## 6 AC/A… Dana Fema… TRUE 158.2… 69.9kg 1973-07-01 Indi… 2
## 7 AC/A… Erin Fema… FALSE 161.6… 68.85… 1972-03-26 New … 1
## 8 AC/A… Rach… Fema… No 165.8… 70.44… 1973-05-11 Colo… 1
## 9 AC/A… Rona… Male FALSE 181.3… 76.9kg 1971-12-31 Geor… 1
## 10 AC/A… Bryan Male FALSE 167.3… 79.06… 1973-07-19 New … 2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <dbl>,
## # Date.Entered.Study <date>
View the patients dataset in RStudio by selecting it in the Environment tab (usually top-right) or as follows.
View(patients)
You should be familiar with how to access a specific column of a data frame using the $
operator.
patients$Name
## [1] "Michael" "Derek" "Todd" "Ronald" "Christine"
## [6] "Dana" "Erin" "Rachel" "Ronald" "Bryan"
## [11] "Kimberly" "Pamela" "Kenneth" "Stacy" "Elizabeth"
## [16] "Marc" "Brandon" "Travis" "Andrew" "Jeffrey"
## [21] "Jimmy" "Tara" "Thomas" "Paula" "Sandra"
## [26] "Mark" "Steven" "George" "Steven" "Brandon"
## [31] "Julie" "Aaron" "Jeffery" "Kristen" "Brett"
## [36] "Teresa" "Brent" "Tiffany" "Joel" "Michelle"
## [41] "Aaron" "Brandy" "Kristina" "Catherine" "Alicia"
## [46] "Sherry" "Tanya" "Nancy" "Charles" "Linda"
## [51] "Nicole" "Wendy" "April" "Barbara" "Brandon"
## [56] "Holly" "Stacy" "Kelly" "Shawn" "Anthony"
## [61] "Sherry" "Jeffery" "Valerie" "Jimmy" "Jeffrey"
## [66] "Tiffany" "Steven" "Brandy" "Jon" "Sarah"
## [71] "Barbara" "James" "Marc" "Mary" "Richard"
## [76] "Erica" "Benjamin" "Joel" "Carrie" "Alicia"
## [81] "Paula" "Stacy" "Todd" "Brenda" "Laura"
## [86] "Laura" "Suzanne" "Troy" "Barbara" "Russell"
## [91] "Philip" "Stephen" "Holly" "Douglas" "Christy"
## [96] "Jill" "Rhonda" "Wendy" "Sharon" "Cheryl"
The $
operator works well for single columns, but for multiple columns it quickly starts to get cumbersome as we need to use the []
operator for accessing specific rows of columns and c()
for combining the required columns. The column names also need quotation marks.
patients[,c("Name","Sex")]
## # A tibble: 100 x 2
## Name Sex
## <chr> <chr>
## 1 Michael Male
## 2 Derek Male
## 3 Todd Male
## 4 Ronald Male
## 5 Christine Female
## 6 Dana Female
## 7 Erin Female
## 8 Rachel Female
## 9 Ronald Male
## 10 Bryan Male
## # … with 90 more rows
Non R users probably find these commands a bit obtuse.
[ ]
?c
?[,...]
means display all rowsThe dplyr
verb select
is a lot more intuitive.
select(patients, Name)
## # A tibble: 100 x 1
## Name
## <chr>
## 1 Michael
## 2 Derek
## 3 Todd
## 4 Ronald
## 5 Christine
## 6 Dana
## 7 Erin
## 8 Rachel
## 9 Ronald
## 10 Bryan
## # … with 90 more rows
Note that we don’t need the quotation marks.
select(patients, Name, Sex)
## # A tibble: 100 x 2
## Name Sex
## <chr> <chr>
## 1 Michael Male
## 2 Derek Male
## 3 Todd Male
## 4 Ronald Male
## 5 Christine Female
## 6 Dana Female
## 7 Erin Female
## 8 Rachel Female
## 9 Ronald Male
## 10 Bryan Male
## # … with 90 more rows
We can exclude columns in the same way as we did earlier with the gather
function.
select(patients, -Name)
## # A tibble: 100 x 11
## ID Sex Smokes Height Weight Birth State Grade_Level Died
## <chr> <chr> <chr> <chr> <chr> <date> <chr> <dbl> <lgl>
## 1 AC/A… Male FALSE 182.8… 76.57… 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Male FALSE 179.1… 80.43… 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Male FALSE 169.1… 75.48… 1972-07-09 New … 2 FALSE
## 4 AC/A… Male FALSE 175.6… 94.54… 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Fema… FALSE 164.4… 71.78… 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Fema… TRUE 158.2… 69.9kg 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Fema… FALSE 161.6… 68.85… 1972-03-26 New … 1 FALSE
## 8 AC/A… Fema… No 165.8… 70.44… 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Male FALSE 181.3… 76.9kg 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Male FALSE 167.3… 79.06… 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## # Date.Entered.Study <date>
Question: how would you do this with base R?
patients[,-2]
patients[,setdiff(colnames(patients), "Name")]
You can also select columns within a range using :
, again as we did before with gather
.
select(patients, Name:Sex)
## # A tibble: 100 x 2
## Name Sex
## <chr> <chr>
## 1 Michael Male
## 2 Derek Male
## 3 Todd Male
## 4 Ronald Male
## 5 Christine Female
## 6 Dana Female
## 7 Erin Female
## 8 Rachel Female
## 9 Ronald Male
## 10 Bryan Male
## # … with 90 more rows
Looking at the help for select
you will see that there are a number of useful functions that can be used when selecting columns.
select(patients, starts_with("Grade"))
## # A tibble: 100 x 1
## Grade_Level
## <dbl>
## 1 2
## 2 2
## 3 2
## 4 1
## 5 2
## 6 2
## 7 1
## 8 1
## 9 1
## 10 2
## # … with 90 more rows
You can combine explicit naming of columns, ranges and functions, e.g.
select(patients, Name, Sex:Birth, -Smokes)
## # A tibble: 100 x 5
## Name Sex Height Weight Birth
## <chr> <chr> <chr> <chr> <date>
## 1 Michael Male 182.87cm 76.57kg 1972-02-06
## 2 Derek Male 179.12cm 80.43kg 1972-06-15
## 3 Todd Male 169.15cm 75.48kg 1972-07-09
## 4 Ronald Male 175.66cm 94.54kg 1972-08-17
## 5 Christine Female 164.47cm 71.78kg 1973-06-12
## 6 Dana Female 158.27cm 69.9kg 1973-07-01
## 7 Erin Female 161.69cm 68.85kg 1972-03-26
## 8 Rachel Female 165.84cm 70.44kg 1973-05-11
## 9 Ronald Male 181.32cm 76.9kg 1971-12-31
## 10 Bryan Male 167.37cm 79.06kg 1973-07-19
## # … with 90 more rows
See separate R markdown document.
We’re now going to explore this dataset, taking a look at some of the variables and some potential problems that will hinder any analysis we want to carry out.
We’ll read the data in again but this time using the base R read.delim
function instead of the read_tsv
because the latter has in fact done some tidying of our data and for the purposes of what comes next it would be better if it hadn’t.
patients <- read.delim("patient-data.txt")
patients <- as_tibble(patients)
patients
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level Died
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int> <lgl>
## 1 AC/A… Mich… Male FALSE 182.8… 76.57… 1972… Geor… 2 FALSE
## 2 AC/A… Derek Male FALSE 179.1… 80.43… 1972… "Col… 2 FALSE
## 3 AC/A… Todd " Ma… FALSE 169.1… 75.48… 1972… New … 2 FALSE
## 4 AC/A… Rona… Male FALSE 175.6… 94.54… 1972… Colo… 1 FALSE
## 5 AC/A… Chri… Fema… FALSE 164.4… 71.78… 1973… Geor… 2 TRUE
## 6 AC/A… Dana Fema… TRUE 158.2… 69.9kg 1973… Indi… 2 FALSE
## 7 AC/A… Erin Fema… FALSE 161.6… 68.85… 1972… New … 1 FALSE
## 8 AC/A… Rach… Fema… No 165.8… 70.44… 1973… "Col… 1 FALSE
## 9 AC/A… Rona… Male FALSE 181.3… 76.9kg 1971… Geor… 1 FALSE
## 10 AC/A… Bryan " Ma… FALSE 167.3… 79.06… 1973… New … 2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## # Date.Entered.Study <fct>
We’ll start by looking at the Sex column. Create a bar plot to show the numbers of male and female patients in our study.
ggplot(patients, mapping = aes(x = Sex)) +
geom_bar()
There appear to be more categories than expected - why is this?
Note that the Sex
variable has been read in as a factor. We can check the factor levels.
levels(patients$Sex)
## [1] " Female" " Male" "Female" "Female " "Male" "Male "
There are some extraneous spaces in the Sex column in the tabular file.
R has many functions for dealing with and manipulating text strings. However, the stringr
package that is part of the tidyverse aims to provide a clean and uniform interface to many common string operations.
Select the ‘Packages’ tab in RStudio and navigate to the stringr
package. Take a look at the various functions it offers and work out which one will help us remove the spaces at the beginning and/or end of our Sex
values.
str_trim(patients$Sex)
## [1] "Male" "Male" "Male" "Male" "Female" "Female" "Female"
## [8] "Female" "Male" "Male" "Female" "Female" "Male" "Female"
## [15] "Female" "Male" "Male" "Male" "Male" "Male" "Male"
## [22] "Female" "Male" "Female" "Female" "Male" "Male" "Male"
## [29] "Male" "Male" "Female" "Male" "Male" "Female" "Male"
## [36] "Female" "Male" "Female" "Male" "Female" "Male" "Female"
## [43] "Female" "Female" "Female" "Female" "Female" "Female" "Male"
## [50] "Female" "Female" "Female" "Female" "Female" "Male" "Female"
## [57] "Female" "Female" "Male" "Male" "Female" "Male" "Female"
## [64] "Male" "Male" "Female" "Male" "Female" "Male" "Female"
## [71] "Female" "Male" "Male" "Female" "Male" "Female" "Male"
## [78] "Male" "Female" "Female" "Female" "Female" "Male" "Female"
## [85] "Female" "Female" "Female" "Male" "Female" "Male" "Male"
## [92] "Male" "Female" "Male" "Female" "Female" "Female" "Female"
## [99] "Female" "Female"
str_trim
takes a vector of strings and returns another vector in which each string value has whitespace removed from the beginning and end of the string.
We now need a way to use this to modify the data frame. One way to do this would be:
patients$Sex <- str_trim(patients$Sex)
The dplyr
function for making changes to a data frame is mutate
.
mutate(patients, Sex = str_trim(Sex))
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level Died
## <fct> <fct> <chr> <fct> <fct> <fct> <fct> <fct> <int> <lgl>
## 1 AC/A… Mich… Male FALSE 182.8… 76.57… 1972… Geor… 2 FALSE
## 2 AC/A… Derek Male FALSE 179.1… 80.43… 1972… "Col… 2 FALSE
## 3 AC/A… Todd Male FALSE 169.1… 75.48… 1972… New … 2 FALSE
## 4 AC/A… Rona… Male FALSE 175.6… 94.54… 1972… Colo… 1 FALSE
## 5 AC/A… Chri… Fema… FALSE 164.4… 71.78… 1973… Geor… 2 TRUE
## 6 AC/A… Dana Fema… TRUE 158.2… 69.9kg 1973… Indi… 2 FALSE
## 7 AC/A… Erin Fema… FALSE 161.6… 68.85… 1972… New … 1 FALSE
## 8 AC/A… Rach… Fema… No 165.8… 70.44… 1973… "Col… 1 FALSE
## 9 AC/A… Rona… Male FALSE 181.3… 76.9kg 1971… Geor… 1 FALSE
## 10 AC/A… Bryan Male FALSE 167.3… 79.06… 1973… New … 2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## # Date.Entered.Study <fct>
Notice that we didn’t need to specify the Sex variable as patients$Sex
.
Usually for cleaning operations like this we would choose to overwrite the column in question but we could instead create a new variable, e.g.
mutate(patients, SexTrimmed = str_trim(Sex))
## # A tibble: 100 x 13
## ID Name Sex Smokes Height Weight Birth State Grade_Level Died
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int> <lgl>
## 1 AC/A… Mich… Male FALSE 182.8… 76.57… 1972… Geor… 2 FALSE
## 2 AC/A… Derek Male FALSE 179.1… 80.43… 1972… "Col… 2 FALSE
## 3 AC/A… Todd " Ma… FALSE 169.1… 75.48… 1972… New … 2 FALSE
## 4 AC/A… Rona… Male FALSE 175.6… 94.54… 1972… Colo… 1 FALSE
## 5 AC/A… Chri… Fema… FALSE 164.4… 71.78… 1973… Geor… 2 TRUE
## 6 AC/A… Dana Fema… TRUE 158.2… 69.9kg 1973… Indi… 2 FALSE
## 7 AC/A… Erin Fema… FALSE 161.6… 68.85… 1972… New … 1 FALSE
## 8 AC/A… Rach… Fema… No 165.8… 70.44… 1973… "Col… 1 FALSE
## 9 AC/A… Rona… Male FALSE 181.3… 76.9kg 1971… Geor… 1 FALSE
## 10 AC/A… Bryan " Ma… FALSE 167.3… 79.06… 1973… New … 2 FALSE
## # … with 90 more rows, and 3 more variables: Count <dbl>,
## # Date.Entered.Study <fct>, SexTrimmed <chr>
Creating a new variable is more common where we compute a new value based on values from other columns.
mutate(table1, percent = 100 * cases / population)
## # A tibble: 6 x 5
## country year cases population percent
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.00373
## 2 Afghanistan 2000 2666 20595360 0.0129
## 3 Brazil 1999 37737 172006362 0.0219
## 4 Brazil 2000 80488 174504898 0.0461
## 5 China 1999 212258 1272915272 0.0167
## 6 China 2000 213766 1280428583 0.0167
We may want the Sex
column to remain as a factor. Note that in the following we assign the result of the mutate
operation to the patients
object, i.e. overwriting it.
patients <- mutate(patients, Sex = factor(str_trim(Sex)))
levels(patients$Sex)
## [1] "Female" "Male"
ggplot(patients, mapping = aes(x = Sex)) +
geom_bar()
Let’s turn our attention to another of the variables, Height
. Create a histogram of the heights of our patients.
ggplot(patients, mapping = aes(x = Height)) +
geom_histogram()
Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?
The error is not very easy to understand but it does indicate that for a histogram we need a continuous variable. What type is our Height
variable? Why might this be?
select(patients, Height)
## # A tibble: 100 x 1
## Height
## <fct>
## 1 182.87cm
## 2 179.12cm
## 3 169.15cm
## 4 175.66cm
## 5 164.47cm
## 6 158.27cm
## 7 161.69cm
## 8 165.84cm
## 9 181.32cm
## 10 167.37cm
## # … with 90 more rows
To convert these values to numbers we need to remove the ‘cm’ unit that is appended to each value.
Again, take a look at the functions available in stringr
and see which might be able to carry out this operation.
str_remove(patients$Height, "cm")
## [1] "182.87" "179.12" "169.15" "175.66" "164.47" "158.27" "161.69"
## [8] "165.84" "181.32" "167.37" "160.06" "166.48" "175.39" "164.7"
## [15] "163.79" "181.13" "169.24" "176.22" "174.09" "180.11" "179.24"
## [22] "161.92" "169.85" "160.57" "168.24" "177.75" "183.21" "167.75"
## [29] "181.15" "181.56" "160.03" "165.62" "181.64" "159.67" "177.03"
## [36] "163.35" "175.21" "160.8" "166.46" "157.95" "180.61" "159.52"
## [43] "163.01" "165.8" "170.03" "157.16" "164.58" "163.47" "185.43"
## [50] "165.34" "163.45" "163.97" "161.38" "160.09" "178.64" "159.78"
## [57] "161.57" "161.83" "169.66" "166.84" "159.32" "170.51" "161.84"
## [64] "171.41" "166.75" "166.19" "169.16" "157.01" "167.51" "160.47"
## [71] "162.33" "175.67" "174.25" "158.94" "172.72" "159.23" "176.54"
## [78] "184.34" "163.94" "160.09" "162.32" "162.59" "171.94" "158.07"
## [85] "158.35" "162.18" "159.38" "171.45" "163.17" "183.1" "177.14"
## [92] "171.08" "159.33" "185.43" "162.65" "159.44" "164.11" "159.13"
## [99] "160.58" "164.88"
The values are still strings though.
as.numeric(str_remove(patients$Height, "cm"))
## [1] 182.87 179.12 169.15 175.66 164.47 158.27 161.69 165.84 181.32 167.37
## [11] 160.06 166.48 175.39 164.70 163.79 181.13 169.24 176.22 174.09 180.11
## [21] 179.24 161.92 169.85 160.57 168.24 177.75 183.21 167.75 181.15 181.56
## [31] 160.03 165.62 181.64 159.67 177.03 163.35 175.21 160.80 166.46 157.95
## [41] 180.61 159.52 163.01 165.80 170.03 157.16 164.58 163.47 185.43 165.34
## [51] 163.45 163.97 161.38 160.09 178.64 159.78 161.57 161.83 169.66 166.84
## [61] 159.32 170.51 161.84 171.41 166.75 166.19 169.16 157.01 167.51 160.47
## [71] 162.33 175.67 174.25 158.94 172.72 159.23 176.54 184.34 163.94 160.09
## [81] 162.32 162.59 171.94 158.07 158.35 162.18 159.38 171.45 163.17 183.10
## [91] 177.14 171.08 159.33 185.43 162.65 159.44 164.11 159.13 160.58 164.88
Now we can modify the dataset with mutate
.
patients <- mutate(patients, Height = as.numeric(str_remove(Height, "cm")))
patients
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level Died
## <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <int> <lgl>
## 1 AC/A… Mich… Male FALSE 183. 76.57… 1972… Geor… 2 FALSE
## 2 AC/A… Derek Male FALSE 179. 80.43… 1972… "Col… 2 FALSE
## 3 AC/A… Todd Male FALSE 169. 75.48… 1972… New … 2 FALSE
## 4 AC/A… Rona… Male FALSE 176. 94.54… 1972… Colo… 1 FALSE
## 5 AC/A… Chri… Fema… FALSE 164. 71.78… 1973… Geor… 2 TRUE
## 6 AC/A… Dana Fema… TRUE 158. 69.9kg 1973… Indi… 2 FALSE
## 7 AC/A… Erin Fema… FALSE 162. 68.85… 1972… New … 1 FALSE
## 8 AC/A… Rach… Fema… No 166. 70.44… 1973… "Col… 1 FALSE
## 9 AC/A… Rona… Male FALSE 181. 76.9kg 1971… Geor… 1 FALSE
## 10 AC/A… Bryan Male FALSE 167. 79.06… 1973… New … 2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## # Date.Entered.Study <fct>
ggplot(patients, mapping = aes(x = Height)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can modify multiple columns within a single mutate
function call.
patients <- read.delim("patient-data.txt")
patients <- as_tibble(patients)
patients <- mutate(patients,
Sex = as.factor(str_trim(Sex)),
Height = as.numeric(str_remove(Height, "cm")))
patients
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level Died
## <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <int> <lgl>
## 1 AC/A… Mich… Male FALSE 183. 76.57… 1972… Geor… 2 FALSE
## 2 AC/A… Derek Male FALSE 179. 80.43… 1972… "Col… 2 FALSE
## 3 AC/A… Todd Male FALSE 169. 75.48… 1972… New … 2 FALSE
## 4 AC/A… Rona… Male FALSE 176. 94.54… 1972… Colo… 1 FALSE
## 5 AC/A… Chri… Fema… FALSE 164. 71.78… 1973… Geor… 2 TRUE
## 6 AC/A… Dana Fema… TRUE 158. 69.9kg 1973… Indi… 2 FALSE
## 7 AC/A… Erin Fema… FALSE 162. 68.85… 1972… New … 1 FALSE
## 8 AC/A… Rach… Fema… No 166. 70.44… 1973… "Col… 1 FALSE
## 9 AC/A… Rona… Male FALSE 181. 76.9kg 1971… Geor… 1 FALSE
## 10 AC/A… Bryan Male FALSE 167. 79.06… 1973… New … 2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## # Date.Entered.Study <fct>
See separate R markdown document.
If you look at the help documentation for mutate
you will notice that there are a set of related functions, mutate_all
, mutate_if
and mutate_at
. These are very useful functions for applying the same operation to several columns within a table in one go.
For example, let’s say we want to round each patient’s weight and height to 1 decimal place.
patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
mutate_at(patients, vars(Height, Weight), round, digits = 1)
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Two things to note here. Firstly, we had to specify the variables that needed to be modified in a vars()
argument. Secondly, you can add any additional arguments that the function requires at the end.
mutate_if
allows us to run the same operation on all columns satisfying a specified criterion. For example we could choose to round all numeric columns to 1 decimal place.
patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
patients <- mutate_if(patients, is.numeric, round, digits = 1)
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>