Introduction

In this session we will look at what makes a dataset ‘tidy’ and why this is important. We will also look at how to perform some basic manipulations of a tabular dataset including:

  • operations that will help clean up values within columns
  • selecting a subset of columns of interest
  • creating new variables (columns) based on existing variables

Load the tidyverse

In this section we will use functions from a number of packages from the tidyverse, namely tidyr for tidying data, dplyr for manipulating data frames and stringr providing common string operations.

Load the core packages from the tidyverse.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

[Instructors note: draw attention to the packages that are loaded, relate these to this and other sections of the course.]

Aside: function name conflicts (optional)

With the many hundreds of R packages available it is inevitable that there will be functions with the same name being used by more than one package.

For example, filter function in dplyr has masked filter function from the stats package that already loaded as part of base R.

?filter

Navigate to each help section and show how the signatures differ and explain how using the one you hadn’t intended will likely cause your code to fail.

Use R studio’s file completion to show that the filter function we get by default.

To use the filter function from the stats package we need to include the stats namespace prefix as follows:

stats::filter(presidents, rep(1, 3))

Data frames

In this course the data we will be working with will be structured within a type of R object called a ‘data frame’.

A data frame contains tabular data, like those we might be used to dealing with in spreadsheets, e.g. with Excel, where the data can be thought of as having rows and columns. The values in each column are all of the same type, e.g. all numbers (if quantitative) or all text strings (if qualitative).

Let’s have a look at an example data frame that was loaded as part of the tidyr package that we loaded when we called library(tidyverse).

This dataset contains the number of TB cases documented by the World Health Organization in Afghanistan, Brazil and China in 1999 and 2000.

table1

If we view this in the R console it is displayed as follows and described as a tibble.

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
class(table1)
## [1] "tbl_df"     "tbl"        "data.frame"

This is a special kind of data frame with the additional tbl_df and tbl types.

Tibbles have some additional behaviour over plain data frames that might be useful. The most obvious is that these get printed to the console in a more concise and useful way, particularly when there are very many rows and/or columns. You can see how the default behaviour of tibbles differs from normal data frames by looking at the help page for the tbl_df class.

?tbl_df

Tidy and untidy data

The tidyr package also contains some different representations of the TB dataset. Each of these tables organizes the values in a different layout or format.

Let’s take a look at the different representations.

table2 contains separate rows for the number of cases and the total population.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table3 has a column called rate which contains multiple values, i.e. the number of cases and the total population. If we wanted to extract just the number of cases for each country and year this would require some effort.

Note that the rate as a percentage could easily be computed from table1 and much less straightforwardly from table2.

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

table4a only contains the numbers of TB cases and table4b contains only the total populations. To compute the percentage of TB cases we would need to combine these tables, something we will look at later on in the course.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

A dataset is a collection of values, usually either numbers (if quantitative) or text strings (if qualitative).

Every value belongs to a variable and an observation.

A variable contains values that measure the same underlying attribute, e.g. height, or temperature.

An observation contains values measured on the same unit, e.g. a person or a date. Note that there can be multiple measurements within an observation but these should be for multiple attributes measured on the same unit.

A tidy dataset is a data frame (or table) for which the following are true:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell

Question: which of the representations we’ve just been looking at is tidy?

table2 is tricky. If we consider the observational unit to be a country in a specific year then table2 is not tidy because observations are split across two rows. Also the count variable contains counts of what are essentially different things, the number of cases of TB and the total population. In tricky situations like this, a tell-tale sign that your data is not in a tidy format is when you want to visualize your data and you have to perform some kind of filtering to do so. In this case if we wanted to create a bar plot of for the population of each country we would have to first remove the rows corresponding to the number of TB cases.

Let’s look at another untidy dataset.

untidy_data <- read_tsv("tidyr-example.txt")
## Parsed with column specification:
## cols(
##   Name = col_character(),
##   treatmenta = col_character(),
##   treatmentb = col_double()
## )
untidy_data
## # A tibble: 3 x 3
##   Name         treatmenta treatmentb
##   <chr>        <chr>           <dbl>
## 1 John Smith   -                   2
## 2 Jane Doe     16                 11
## 3 Mary Johnson 3                   1

Questions

  1. What makes this an untidy dataset?

  2. What other problems can you see with this dataset?

Data in this format are quite familiar to us, but not easily interpretable by the computer. We need to think of the dataset in terms of variables and values.

What are the variables in this dataset?

  • Person (John, Jane or Mary)
  • Treatment (A or B)
  • The effect or result of the treatment for that individual

Remember the guiding principles:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell and belongs to a variable and an observation

A tidy form of this data would have 3 columns: Person, Treatment and Result.

This is an example of a fairly common scenario in which there are columns that are not variables but instead are values of a variable.

The variable in question is Treatment and the two columns treatmenta and treatmentb are values that variable can take. So a tidy form of this dataset would have a column called Treatment.

Using gather to create a tidy data frame

The gather function takes multiple “value” columns and collapses these into key-value pairs.

Look at the help for the gather function to see what we need to give it.

?gather

We have to specify which columns to work on and we have to tell gather what to call the variable for which those columns will become values - this variable is the ‘key’.

The values within those ‘key’ columns are values but for another variable which we specify using the ‘value’ argument.

In our example the key would be ‘Treatment’ and the value would be ‘Result’.

tidy_data <- gather(untidy_data, key = "Treatment", value = "Result", treatmenta, treatmentb)
tidy_data
## # A tibble: 6 x 3
##   Name         Treatment  Result
##   <chr>        <chr>      <chr> 
## 1 John Smith   treatmenta -     
## 2 Jane Doe     treatmenta 16    
## 3 Mary Johnson treatmenta 3     
## 4 John Smith   treatmentb 2     
## 5 Jane Doe     treatmentb 11    
## 6 Mary Johnson treatmentb 1

The key and value variables can be given as strings or as symbols without quotes.

gather(untidy_data, Treatment, Result, treatmenta, treatmentb)
## # A tibble: 6 x 3
##   Name         Treatment  Result
##   <chr>        <chr>      <chr> 
## 1 John Smith   treatmenta -     
## 2 Jane Doe     treatmenta 16    
## 3 Mary Johnson treatmenta 3     
## 4 John Smith   treatmentb 2     
## 5 Jane Doe     treatmentb 11    
## 6 Mary Johnson treatmentb 1

If there were lots of treatment columns it would be quite laborious to type them all in. Instead you can tell gather which columns not to include.

gather(untidy_data, Treatment, Result, -Name)
## # A tibble: 6 x 3
##   Name         Treatment  Result
##   <chr>        <chr>      <chr> 
## 1 John Smith   treatmenta -     
## 2 Jane Doe     treatmenta 16    
## 3 Mary Johnson treatmenta 3     
## 4 John Smith   treatmentb 2     
## 5 Jane Doe     treatmentb 11    
## 6 Mary Johnson treatmentb 1

You can also specify the columns as a range either using the column names

gather(untidy_data, Treatment, Result, treatmenta:treatmentb)
## # A tibble: 6 x 3
##   Name         Treatment  Result
##   <chr>        <chr>      <chr> 
## 1 John Smith   treatmenta -     
## 2 Jane Doe     treatmenta 16    
## 3 Mary Johnson treatmenta 3     
## 4 John Smith   treatmentb 2     
## 5 Jane Doe     treatmentb 11    
## 6 Mary Johnson treatmentb 1

or column indices

gather(untidy_data, Treatment, Result, 2:3)
## # A tibble: 6 x 3
##   Name         Treatment  Result
##   <chr>        <chr>      <chr> 
## 1 John Smith   treatmenta -     
## 2 Jane Doe     treatmenta 16    
## 3 Mary Johnson treatmenta 3     
## 4 John Smith   treatmentb 2     
## 5 Jane Doe     treatmentb 11    
## 6 Mary Johnson treatmentb 1

Note that gather is very similar to the melt function from the reshape and reshape2 packages.

Exercise: tidying data

See separate R markdown document.

Other useful tidyr functions

There are some other useful functions within the tidyr package. We’ll look at these using the dataset from the exercise.

clinical_data <- read_tsv("clinical-data.txt")
## Parsed with column specification:
## cols(
##   Subject = col_character(),
##   Placebo.1 = col_double(),
##   Placebo.2 = col_double(),
##   Drug1.1 = col_double(),
##   Drug1.2 = col_double(),
##   Drug2.1 = col_double(),
##   Drug2.2 = col_double()
## )
clinical_data <- gather(clinical_data, Treatment, Value, -Subject)
clinical_data
## # A tibble: 60 x 3
##    Subject   Treatment Value
##    <chr>     <chr>     <dbl>
##  1 Patient1  Placebo.1  49.8
##  2 Patient2  Placebo.1  46.8
##  3 Patient3  Placebo.1  48.7
##  4 Patient4  Placebo.1  51.7
##  5 Patient5  Placebo.1  48.9
##  6 Patient6  Placebo.1  53.5
##  7 Patient7  Placebo.1  53.6
##  8 Patient8  Placebo.1  46.2
##  9 Patient9  Placebo.1  50.5
## 10 Patient10 Placebo.1  47.0
## # … with 50 more rows
ggplot(clinical_data, mapping = aes(x = Treatment, y = Value)) +
  geom_boxplot()

The suffix for each of the treatments, either ‘.1’ or ‘.2’, indicates that each drug or placebo was measured twice. Ideally we would extract this into a new variable called ‘Replicate’. We can use the separate function to do so.

clinical_replicate_data <- separate(clinical_data, Treatment, into = c("Treatment", "Replicate"))
clinical_replicate_data
## # A tibble: 60 x 4
##    Subject   Treatment Replicate Value
##    <chr>     <chr>     <chr>     <dbl>
##  1 Patient1  Placebo   1          49.8
##  2 Patient2  Placebo   1          46.8
##  3 Patient3  Placebo   1          48.7
##  4 Patient4  Placebo   1          51.7
##  5 Patient5  Placebo   1          48.9
##  6 Patient6  Placebo   1          53.5
##  7 Patient7  Placebo   1          53.6
##  8 Patient8  Placebo   1          46.2
##  9 Patient9  Placebo   1          50.5
## 10 Patient10 Placebo   1          47.0
## # … with 50 more rows

Question: how did separate know to split the values on the ‘.’ character?

Look at the help for the separate function and find how you would specify the delimiter. In this case the default delimiter does the trick.

Now that we have a separate variable/column for the replicate we can create an improved box plot with faceting.

ggplot(clinical_replicate_data, mapping = aes(x = Replicate, y = Value)) +
  geom_boxplot() +
  facet_wrap(~Treatment)

The spread function does the opposite to gather and can be used to convert a tidy dataset back to one in which we have values in column headings. Sometimes this really is necessary, e.g. gene expression matrices where there are rows for each gene and columns for each sample, because some analysis functions require the matrix form of the data.

spread(clinical_data, key = Treatment, value = Value)
## # A tibble: 10 x 7
##    Subject   Drug1.1 Drug1.2 Drug2.1 Drug2.2 Placebo.1 Placebo.2
##    <chr>       <dbl>   <dbl>   <dbl>   <dbl>     <dbl>     <dbl>
##  1 Patient1     48.4    48.4    40.8    38.3      49.8      53.8
##  2 Patient10    44.9    50.1    39.0    36.2      47.0      44.8
##  3 Patient2     49.6    41.6    39.1    41.9      46.8      49.8
##  4 Patient3     40.5    49.2    40.3    35.1      48.7      48.1
##  5 Patient4     38.3    41.1    40.7    41.2      51.7      48.1
##  6 Patient5     43.1    39.4    43.3    34.9      48.9      48.3
##  7 Patient6     47.5    42.9    39.5    35.6      53.5      44.7
##  8 Patient7     49.2    46.4    37.4    38.8      53.6      47.0
##  9 Patient8     47.3    38.3    44.0    33.8      46.2      43.2
## 10 Patient9     43.4    48.6    41.6    34.4      50.5      56.2

It can also help fix table2 that we considered earlier.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
spread(table2, key = type, value = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Selecting columns

We’re going to look now at some of the functions provided by the dplyr package, another core component of the tidyverse. dplyr provides a set of data manipulation functions that have a common syntax and work together in a consistent and uniform manner. These are referred to as verbs in the dplyr documentation.

The most basic operation is to select a subset of columns.

First we’ll load a more complex dataset with several columns (variables).

patients <- read_tsv("patient-data.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_character(),
##   Weight = col_character(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade_Level = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = "")
## )
patients
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade_Level
##    <chr> <chr> <chr> <chr>  <chr>  <chr>  <date>     <chr>       <dbl>
##  1 AC/A… Mich… Male  FALSE  182.8… 76.57… 1972-02-06 Geor…           2
##  2 AC/A… Derek Male  FALSE  179.1… 80.43… 1972-06-15 Colo…           2
##  3 AC/A… Todd  Male  FALSE  169.1… 75.48… 1972-07-09 New …           2
##  4 AC/A… Rona… Male  FALSE  175.6… 94.54… 1972-08-17 Colo…           1
##  5 AC/A… Chri… Fema… FALSE  164.4… 71.78… 1973-06-12 Geor…           2
##  6 AC/A… Dana  Fema… TRUE   158.2… 69.9kg 1973-07-01 Indi…           2
##  7 AC/A… Erin  Fema… FALSE  161.6… 68.85… 1972-03-26 New …           1
##  8 AC/A… Rach… Fema… No     165.8… 70.44… 1973-05-11 Colo…           1
##  9 AC/A… Rona… Male  FALSE  181.3… 76.9kg 1971-12-31 Geor…           1
## 10 AC/A… Bryan Male  FALSE  167.3… 79.06… 1973-07-19 New …           2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <dbl>,
## #   Date.Entered.Study <date>

View the patients dataset in RStudio by selecting it in the Environment tab (usually top-right) or as follows.

View(patients)

You should be familiar with how to access a specific column of a data frame using the $ operator.

patients$Name
##   [1] "Michael"   "Derek"     "Todd"      "Ronald"    "Christine"
##   [6] "Dana"      "Erin"      "Rachel"    "Ronald"    "Bryan"    
##  [11] "Kimberly"  "Pamela"    "Kenneth"   "Stacy"     "Elizabeth"
##  [16] "Marc"      "Brandon"   "Travis"    "Andrew"    "Jeffrey"  
##  [21] "Jimmy"     "Tara"      "Thomas"    "Paula"     "Sandra"   
##  [26] "Mark"      "Steven"    "George"    "Steven"    "Brandon"  
##  [31] "Julie"     "Aaron"     "Jeffery"   "Kristen"   "Brett"    
##  [36] "Teresa"    "Brent"     "Tiffany"   "Joel"      "Michelle" 
##  [41] "Aaron"     "Brandy"    "Kristina"  "Catherine" "Alicia"   
##  [46] "Sherry"    "Tanya"     "Nancy"     "Charles"   "Linda"    
##  [51] "Nicole"    "Wendy"     "April"     "Barbara"   "Brandon"  
##  [56] "Holly"     "Stacy"     "Kelly"     "Shawn"     "Anthony"  
##  [61] "Sherry"    "Jeffery"   "Valerie"   "Jimmy"     "Jeffrey"  
##  [66] "Tiffany"   "Steven"    "Brandy"    "Jon"       "Sarah"    
##  [71] "Barbara"   "James"     "Marc"      "Mary"      "Richard"  
##  [76] "Erica"     "Benjamin"  "Joel"      "Carrie"    "Alicia"   
##  [81] "Paula"     "Stacy"     "Todd"      "Brenda"    "Laura"    
##  [86] "Laura"     "Suzanne"   "Troy"      "Barbara"   "Russell"  
##  [91] "Philip"    "Stephen"   "Holly"     "Douglas"   "Christy"  
##  [96] "Jill"      "Rhonda"    "Wendy"     "Sharon"    "Cheryl"

The $ operator works well for single columns, but for multiple columns it quickly starts to get cumbersome as we need to use the [] operator for accessing specific rows of columns and c() for combining the required columns. The column names also need quotation marks.

patients[,c("Name","Sex")]
## # A tibble: 100 x 2
##    Name      Sex   
##    <chr>     <chr> 
##  1 Michael   Male  
##  2 Derek     Male  
##  3 Todd      Male  
##  4 Ronald    Male  
##  5 Christine Female
##  6 Dana      Female
##  7 Erin      Female
##  8 Rachel    Female
##  9 Ronald    Male  
## 10 Bryan     Male  
## # … with 90 more rows

Non R users probably find these commands a bit obtuse.

  • why the [ ]?
  • what is c?
  • need to remember the row and column index
  • [,...] means display all rows

The dplyr verb select is a lot more intuitive.

select(patients, Name)
## # A tibble: 100 x 1
##    Name     
##    <chr>    
##  1 Michael  
##  2 Derek    
##  3 Todd     
##  4 Ronald   
##  5 Christine
##  6 Dana     
##  7 Erin     
##  8 Rachel   
##  9 Ronald   
## 10 Bryan    
## # … with 90 more rows

Note that we don’t need the quotation marks.

select(patients, Name, Sex)
## # A tibble: 100 x 2
##    Name      Sex   
##    <chr>     <chr> 
##  1 Michael   Male  
##  2 Derek     Male  
##  3 Todd      Male  
##  4 Ronald    Male  
##  5 Christine Female
##  6 Dana      Female
##  7 Erin      Female
##  8 Rachel    Female
##  9 Ronald    Male  
## 10 Bryan     Male  
## # … with 90 more rows

We can exclude columns in the same way as we did earlier with the gather function.

select(patients, -Name)
## # A tibble: 100 x 11
##    ID    Sex   Smokes Height Weight Birth      State Grade_Level Died 
##    <chr> <chr> <chr>  <chr>  <chr>  <date>     <chr>       <dbl> <lgl>
##  1 AC/A… Male  FALSE  182.8… 76.57… 1972-02-06 Geor…           2 FALSE
##  2 AC/A… Male  FALSE  179.1… 80.43… 1972-06-15 Colo…           2 FALSE
##  3 AC/A… Male  FALSE  169.1… 75.48… 1972-07-09 New …           2 FALSE
##  4 AC/A… Male  FALSE  175.6… 94.54… 1972-08-17 Colo…           1 FALSE
##  5 AC/A… Fema… FALSE  164.4… 71.78… 1973-06-12 Geor…           2 TRUE 
##  6 AC/A… Fema… TRUE   158.2… 69.9kg 1973-07-01 Indi…           2 FALSE
##  7 AC/A… Fema… FALSE  161.6… 68.85… 1972-03-26 New …           1 FALSE
##  8 AC/A… Fema… No     165.8… 70.44… 1973-05-11 Colo…           1 FALSE
##  9 AC/A… Male  FALSE  181.3… 76.9kg 1971-12-31 Geor…           1 FALSE
## 10 AC/A… Male  FALSE  167.3… 79.06… 1973-07-19 New …           2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## #   Date.Entered.Study <date>

Question: how would you do this with base R?

patients[,-2]
patients[,setdiff(colnames(patients), "Name")]

You can also select columns within a range using :, again as we did before with gather.

select(patients, Name:Sex)
## # A tibble: 100 x 2
##    Name      Sex   
##    <chr>     <chr> 
##  1 Michael   Male  
##  2 Derek     Male  
##  3 Todd      Male  
##  4 Ronald    Male  
##  5 Christine Female
##  6 Dana      Female
##  7 Erin      Female
##  8 Rachel    Female
##  9 Ronald    Male  
## 10 Bryan     Male  
## # … with 90 more rows

Looking at the help for select you will see that there are a number of useful functions that can be used when selecting columns.

select(patients, starts_with("Grade"))
## # A tibble: 100 x 1
##    Grade_Level
##          <dbl>
##  1           2
##  2           2
##  3           2
##  4           1
##  5           2
##  6           2
##  7           1
##  8           1
##  9           1
## 10           2
## # … with 90 more rows

You can combine explicit naming of columns, ranges and functions, e.g.

select(patients, Name, Sex:Birth, -Smokes)
## # A tibble: 100 x 5
##    Name      Sex    Height   Weight  Birth     
##    <chr>     <chr>  <chr>    <chr>   <date>    
##  1 Michael   Male   182.87cm 76.57kg 1972-02-06
##  2 Derek     Male   179.12cm 80.43kg 1972-06-15
##  3 Todd      Male   169.15cm 75.48kg 1972-07-09
##  4 Ronald    Male   175.66cm 94.54kg 1972-08-17
##  5 Christine Female 164.47cm 71.78kg 1973-06-12
##  6 Dana      Female 158.27cm 69.9kg  1973-07-01
##  7 Erin      Female 161.69cm 68.85kg 1972-03-26
##  8 Rachel    Female 165.84cm 70.44kg 1973-05-11
##  9 Ronald    Male   181.32cm 76.9kg  1971-12-31
## 10 Bryan     Male   167.37cm 79.06kg 1973-07-19
## # … with 90 more rows

Exercise: selecting columns

See separate R markdown document.

Transforming and cleaning the data

We’re now going to explore this dataset, taking a look at some of the variables and some potential problems that will hinder any analysis we want to carry out.

We’ll read the data in again but this time using the base R read.delim function instead of the read_tsv because the latter has in fact done some tidying of our data and for the purposes of what comes next it would be better if it hadn’t.

patients <- read.delim("patient-data.txt")
patients <- as_tibble(patients)
patients
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth State Grade_Level Died 
##    <fct> <fct> <fct> <fct>  <fct>  <fct>  <fct> <fct>       <int> <lgl>
##  1 AC/A… Mich… Male  FALSE  182.8… 76.57… 1972… Geor…           2 FALSE
##  2 AC/A… Derek Male  FALSE  179.1… 80.43… 1972… "Col…           2 FALSE
##  3 AC/A… Todd  " Ma… FALSE  169.1… 75.48… 1972… New …           2 FALSE
##  4 AC/A… Rona… Male  FALSE  175.6… 94.54… 1972… Colo…           1 FALSE
##  5 AC/A… Chri… Fema… FALSE  164.4… 71.78… 1973… Geor…           2 TRUE 
##  6 AC/A… Dana  Fema… TRUE   158.2… 69.9kg 1973… Indi…           2 FALSE
##  7 AC/A… Erin  Fema… FALSE  161.6… 68.85… 1972… New …           1 FALSE
##  8 AC/A… Rach… Fema… No     165.8… 70.44… 1973… "Col…           1 FALSE
##  9 AC/A… Rona… Male  FALSE  181.3… 76.9kg 1971… Geor…           1 FALSE
## 10 AC/A… Bryan " Ma… FALSE  167.3… 79.06… 1973… New …           2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## #   Date.Entered.Study <fct>

We’ll start by looking at the Sex column. Create a bar plot to show the numbers of male and female patients in our study.

ggplot(patients, mapping = aes(x = Sex)) +
  geom_bar()

There appear to be more categories than expected - why is this?

Note that the Sex variable has been read in as a factor. We can check the factor levels.

levels(patients$Sex)
## [1] " Female" " Male"   "Female"  "Female " "Male"    "Male "

There are some extraneous spaces in the Sex column in the tabular file.

R has many functions for dealing with and manipulating text strings. However, the stringr package that is part of the tidyverse aims to provide a clean and uniform interface to many common string operations.

Select the ‘Packages’ tab in RStudio and navigate to the stringr package. Take a look at the various functions it offers and work out which one will help us remove the spaces at the beginning and/or end of our Sex values.

str_trim(patients$Sex)
##   [1] "Male"   "Male"   "Male"   "Male"   "Female" "Female" "Female"
##   [8] "Female" "Male"   "Male"   "Female" "Female" "Male"   "Female"
##  [15] "Female" "Male"   "Male"   "Male"   "Male"   "Male"   "Male"  
##  [22] "Female" "Male"   "Female" "Female" "Male"   "Male"   "Male"  
##  [29] "Male"   "Male"   "Female" "Male"   "Male"   "Female" "Male"  
##  [36] "Female" "Male"   "Female" "Male"   "Female" "Male"   "Female"
##  [43] "Female" "Female" "Female" "Female" "Female" "Female" "Male"  
##  [50] "Female" "Female" "Female" "Female" "Female" "Male"   "Female"
##  [57] "Female" "Female" "Male"   "Male"   "Female" "Male"   "Female"
##  [64] "Male"   "Male"   "Female" "Male"   "Female" "Male"   "Female"
##  [71] "Female" "Male"   "Male"   "Female" "Male"   "Female" "Male"  
##  [78] "Male"   "Female" "Female" "Female" "Female" "Male"   "Female"
##  [85] "Female" "Female" "Female" "Male"   "Female" "Male"   "Male"  
##  [92] "Male"   "Female" "Male"   "Female" "Female" "Female" "Female"
##  [99] "Female" "Female"

str_trim takes a vector of strings and returns another vector in which each string value has whitespace removed from the beginning and end of the string.

We now need a way to use this to modify the data frame. One way to do this would be:

patients$Sex <- str_trim(patients$Sex)

The dplyr function for making changes to a data frame is mutate.

mutate(patients, Sex = str_trim(Sex))
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth State Grade_Level Died 
##    <fct> <fct> <chr> <fct>  <fct>  <fct>  <fct> <fct>       <int> <lgl>
##  1 AC/A… Mich… Male  FALSE  182.8… 76.57… 1972… Geor…           2 FALSE
##  2 AC/A… Derek Male  FALSE  179.1… 80.43… 1972… "Col…           2 FALSE
##  3 AC/A… Todd  Male  FALSE  169.1… 75.48… 1972… New …           2 FALSE
##  4 AC/A… Rona… Male  FALSE  175.6… 94.54… 1972… Colo…           1 FALSE
##  5 AC/A… Chri… Fema… FALSE  164.4… 71.78… 1973… Geor…           2 TRUE 
##  6 AC/A… Dana  Fema… TRUE   158.2… 69.9kg 1973… Indi…           2 FALSE
##  7 AC/A… Erin  Fema… FALSE  161.6… 68.85… 1972… New …           1 FALSE
##  8 AC/A… Rach… Fema… No     165.8… 70.44… 1973… "Col…           1 FALSE
##  9 AC/A… Rona… Male  FALSE  181.3… 76.9kg 1971… Geor…           1 FALSE
## 10 AC/A… Bryan Male  FALSE  167.3… 79.06… 1973… New …           2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## #   Date.Entered.Study <fct>

Notice that we didn’t need to specify the Sex variable as patients$Sex.

Usually for cleaning operations like this we would choose to overwrite the column in question but we could instead create a new variable, e.g.

mutate(patients, SexTrimmed = str_trim(Sex))
## # A tibble: 100 x 13
##    ID    Name  Sex   Smokes Height Weight Birth State Grade_Level Died 
##    <fct> <fct> <fct> <fct>  <fct>  <fct>  <fct> <fct>       <int> <lgl>
##  1 AC/A… Mich… Male  FALSE  182.8… 76.57… 1972… Geor…           2 FALSE
##  2 AC/A… Derek Male  FALSE  179.1… 80.43… 1972… "Col…           2 FALSE
##  3 AC/A… Todd  " Ma… FALSE  169.1… 75.48… 1972… New …           2 FALSE
##  4 AC/A… Rona… Male  FALSE  175.6… 94.54… 1972… Colo…           1 FALSE
##  5 AC/A… Chri… Fema… FALSE  164.4… 71.78… 1973… Geor…           2 TRUE 
##  6 AC/A… Dana  Fema… TRUE   158.2… 69.9kg 1973… Indi…           2 FALSE
##  7 AC/A… Erin  Fema… FALSE  161.6… 68.85… 1972… New …           1 FALSE
##  8 AC/A… Rach… Fema… No     165.8… 70.44… 1973… "Col…           1 FALSE
##  9 AC/A… Rona… Male  FALSE  181.3… 76.9kg 1971… Geor…           1 FALSE
## 10 AC/A… Bryan " Ma… FALSE  167.3… 79.06… 1973… New …           2 FALSE
## # … with 90 more rows, and 3 more variables: Count <dbl>,
## #   Date.Entered.Study <fct>, SexTrimmed <chr>

Creating a new variable is more common where we compute a new value based on values from other columns.

mutate(table1, percent = 100 * cases / population)
## # A tibble: 6 x 5
##   country      year  cases population percent
##   <chr>       <int>  <int>      <int>   <dbl>
## 1 Afghanistan  1999    745   19987071 0.00373
## 2 Afghanistan  2000   2666   20595360 0.0129 
## 3 Brazil       1999  37737  172006362 0.0219 
## 4 Brazil       2000  80488  174504898 0.0461 
## 5 China        1999 212258 1272915272 0.0167 
## 6 China        2000 213766 1280428583 0.0167

We may want the Sex column to remain as a factor. Note that in the following we assign the result of the mutate operation to the patients object, i.e. overwriting it.

patients <- mutate(patients, Sex = factor(str_trim(Sex)))
levels(patients$Sex)
## [1] "Female" "Male"
ggplot(patients, mapping = aes(x = Sex)) +
  geom_bar()

Let’s turn our attention to another of the variables, Height. Create a histogram of the heights of our patients.

ggplot(patients, mapping = aes(x = Height)) +
  geom_histogram()
Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?

The error is not very easy to understand but it does indicate that for a histogram we need a continuous variable. What type is our Height variable? Why might this be?

select(patients, Height)
## # A tibble: 100 x 1
##    Height  
##    <fct>   
##  1 182.87cm
##  2 179.12cm
##  3 169.15cm
##  4 175.66cm
##  5 164.47cm
##  6 158.27cm
##  7 161.69cm
##  8 165.84cm
##  9 181.32cm
## 10 167.37cm
## # … with 90 more rows

To convert these values to numbers we need to remove the ‘cm’ unit that is appended to each value.

Again, take a look at the functions available in stringr and see which might be able to carry out this operation.

str_remove(patients$Height, "cm")
##   [1] "182.87" "179.12" "169.15" "175.66" "164.47" "158.27" "161.69"
##   [8] "165.84" "181.32" "167.37" "160.06" "166.48" "175.39" "164.7" 
##  [15] "163.79" "181.13" "169.24" "176.22" "174.09" "180.11" "179.24"
##  [22] "161.92" "169.85" "160.57" "168.24" "177.75" "183.21" "167.75"
##  [29] "181.15" "181.56" "160.03" "165.62" "181.64" "159.67" "177.03"
##  [36] "163.35" "175.21" "160.8"  "166.46" "157.95" "180.61" "159.52"
##  [43] "163.01" "165.8"  "170.03" "157.16" "164.58" "163.47" "185.43"
##  [50] "165.34" "163.45" "163.97" "161.38" "160.09" "178.64" "159.78"
##  [57] "161.57" "161.83" "169.66" "166.84" "159.32" "170.51" "161.84"
##  [64] "171.41" "166.75" "166.19" "169.16" "157.01" "167.51" "160.47"
##  [71] "162.33" "175.67" "174.25" "158.94" "172.72" "159.23" "176.54"
##  [78] "184.34" "163.94" "160.09" "162.32" "162.59" "171.94" "158.07"
##  [85] "158.35" "162.18" "159.38" "171.45" "163.17" "183.1"  "177.14"
##  [92] "171.08" "159.33" "185.43" "162.65" "159.44" "164.11" "159.13"
##  [99] "160.58" "164.88"

The values are still strings though.

as.numeric(str_remove(patients$Height, "cm"))
##   [1] 182.87 179.12 169.15 175.66 164.47 158.27 161.69 165.84 181.32 167.37
##  [11] 160.06 166.48 175.39 164.70 163.79 181.13 169.24 176.22 174.09 180.11
##  [21] 179.24 161.92 169.85 160.57 168.24 177.75 183.21 167.75 181.15 181.56
##  [31] 160.03 165.62 181.64 159.67 177.03 163.35 175.21 160.80 166.46 157.95
##  [41] 180.61 159.52 163.01 165.80 170.03 157.16 164.58 163.47 185.43 165.34
##  [51] 163.45 163.97 161.38 160.09 178.64 159.78 161.57 161.83 169.66 166.84
##  [61] 159.32 170.51 161.84 171.41 166.75 166.19 169.16 157.01 167.51 160.47
##  [71] 162.33 175.67 174.25 158.94 172.72 159.23 176.54 184.34 163.94 160.09
##  [81] 162.32 162.59 171.94 158.07 158.35 162.18 159.38 171.45 163.17 183.10
##  [91] 177.14 171.08 159.33 185.43 162.65 159.44 164.11 159.13 160.58 164.88

Now we can modify the dataset with mutate.

patients <- mutate(patients, Height = as.numeric(str_remove(Height, "cm")))
patients
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth State Grade_Level Died 
##    <fct> <fct> <fct> <fct>   <dbl> <fct>  <fct> <fct>       <int> <lgl>
##  1 AC/A… Mich… Male  FALSE    183. 76.57… 1972… Geor…           2 FALSE
##  2 AC/A… Derek Male  FALSE    179. 80.43… 1972… "Col…           2 FALSE
##  3 AC/A… Todd  Male  FALSE    169. 75.48… 1972… New …           2 FALSE
##  4 AC/A… Rona… Male  FALSE    176. 94.54… 1972… Colo…           1 FALSE
##  5 AC/A… Chri… Fema… FALSE    164. 71.78… 1973… Geor…           2 TRUE 
##  6 AC/A… Dana  Fema… TRUE     158. 69.9kg 1973… Indi…           2 FALSE
##  7 AC/A… Erin  Fema… FALSE    162. 68.85… 1972… New …           1 FALSE
##  8 AC/A… Rach… Fema… No       166. 70.44… 1973… "Col…           1 FALSE
##  9 AC/A… Rona… Male  FALSE    181. 76.9kg 1971… Geor…           1 FALSE
## 10 AC/A… Bryan Male  FALSE    167. 79.06… 1973… New …           2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## #   Date.Entered.Study <fct>
ggplot(patients, mapping = aes(x = Height)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can modify multiple columns within a single mutate function call.

patients <- read.delim("patient-data.txt")
patients <- as_tibble(patients)
patients <- mutate(patients,
                   Sex = as.factor(str_trim(Sex)),
                   Height = as.numeric(str_remove(Height, "cm")))
patients
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth State Grade_Level Died 
##    <fct> <fct> <fct> <fct>   <dbl> <fct>  <fct> <fct>       <int> <lgl>
##  1 AC/A… Mich… Male  FALSE    183. 76.57… 1972… Geor…           2 FALSE
##  2 AC/A… Derek Male  FALSE    179. 80.43… 1972… "Col…           2 FALSE
##  3 AC/A… Todd  Male  FALSE    169. 75.48… 1972… New …           2 FALSE
##  4 AC/A… Rona… Male  FALSE    176. 94.54… 1972… Colo…           1 FALSE
##  5 AC/A… Chri… Fema… FALSE    164. 71.78… 1973… Geor…           2 TRUE 
##  6 AC/A… Dana  Fema… TRUE     158. 69.9kg 1973… Indi…           2 FALSE
##  7 AC/A… Erin  Fema… FALSE    162. 68.85… 1972… New …           1 FALSE
##  8 AC/A… Rach… Fema… No       166. 70.44… 1973… "Col…           1 FALSE
##  9 AC/A… Rona… Male  FALSE    181. 76.9kg 1971… Geor…           1 FALSE
## 10 AC/A… Bryan Male  FALSE    167. 79.06… 1973… New …           2 FALSE
## # … with 90 more rows, and 2 more variables: Count <dbl>,
## #   Date.Entered.Study <fct>

Exercise: tidying data

See separate R markdown document.

Mutating multiple columns

If you look at the help documentation for mutate you will notice that there are a set of related functions, mutate_all, mutate_if and mutate_at. These are very useful functions for applying the same operation to several columns within a table in one go.

For example, let’s say we want to round each patient’s weight and height to 1 decimal place.

patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )
mutate_at(patients, vars(Height, Weight), round, digits = 1)
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Two things to note here. Firstly, we had to specify the variables that needed to be modified in a vars() argument. Secondly, you can add any additional arguments that the function requires at the end.

mutate_if allows us to run the same operation on all columns satisfying a specified criterion. For example we could choose to round all numeric columns to 1 decimal place.

patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )
patients <- mutate_if(patients, is.numeric, round, digits = 1)
patients
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>