Workshop overview and materials

Workshop description
Prerequisites and Preparation
Example project overview
Problems with the data
Useful data manipulation packages

Iterating over a directory of files

Filtering strings using regular expressions

Reading all the files

Data cleanup

Fixing column names
Fixing all the names
Filtering rows
Selecting columns
Re-arranging into a single table

Exercise: Cleanup all the data

——–
Exercise prototype

Workshop overview and materials

Workshop description

Data scientists are known and celebrated for modeling and visually displaying information, but down in the data science engine room there is a lot of less glamorous work to be done. Before data can be used effectively it must often be cleaned, corrected, and reformatted. This workshop introduces the basic tools needed to make your data behave, including data reshaping, regular expressions and other text manipulation tools.

Prerequisites and Preparation

Prior to the workshop you should:

install R from https://cran.r-project.org/
install RStudio from https://www.rstudio.com/products/rstudio/download/#download
install the tidyverse package in R with install.packages("tidyverse")
download and extract the workshop materials from https://github.com/izahn/R-data-cleaning/archive/messy_data_v1.zip

The lesson notes are included in the download link above. You can also view the lesson notes at https://rawgit.com/izahn/R-data-cleaning/master/dataCleaning.html

A github repository containing the workshop materials is available https://github.com/izahn/R-data-cleaning.

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we recommend the Software Carpentry Introductory R material.

Example project overview

It is common for data to be made available on a website somewhere, either by a government agency, research group, or other organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and time consuming. Such is the case with the baby names data we will be using today.

The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is provided separately for boys and girls and is stored in Excel spreadsheets.

I have downloaded all the excel files containing boys names data from https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsboys. Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several things that make this challenging.

Problems with the data

The worksheet containing the data of interest is in different positions and has different names from one year to the next. However, it always includes “Table 1” in the worksheet name.
The data does not start on row one. Headers are on row 7, followed by a blank line, followed by the actual data.
The data is stored in an inconvenient way, with ranks 1-50 in the first set of columns and ranks 51-100 in a separate set of columns.
Some years include columns for “changes in rank”, others do not.
There are notes below the data.

Useful data manipulation packages

As you can see, we have a lot of work to do. Let’s get started by attaching some useful R packages.

library(tidyverse) # data.frame manipulation
library(stringr) # string processing
library(readxl) # read excel files
library(purrr) # work with lists

Iterating over a directory of files

Our first task is to iterate over all the data files and read the appropriate sheet from each one. As noted above, the appropriate sheet differs from year to year, but always has “Table 1” in the sheet name.

The first step is to get a vector of file names.

boy.file.names <- list.files("babyNamesData/boys", full.names = TRUE)

Now we can iterate over the file names and get the names of each worksheet. We could use a for loop, or sapply, but the map family of functions from the purrr package gives us a more consistent alternative, so we’ll use that.

map(boy.file.names, excel_sheets)

## [[1]]
## [1] "Contents"                     "Table 1 - Top 100 boys, E&W" 
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"  
## 
## [[2]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[3]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[4]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[5]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[6]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[7]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[8]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[9]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[10]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[11]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[12]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[13]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[14]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[15]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[16]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[17]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[18]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[19]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[20]]
##  [1] "Contents"             "Metadata"             "Terms and Conditions"
##  [4] "Table 1"              "Table 2"              "Table 3"             
##  [7] "Table 4"              "Table 5"              "Table 6"             
## [10] "Related Publications"

Filtering strings using regular expressions

In order extract the correct worksheet names we will use functions for manipulating strings. Base R provides some string manipulation capabilities (see ?regex, ?sub and ?grep), but we will use the stringr package because it is more user-friendly.

The stringr package provides functions to detect, locate, extract, match, replace, combine and split strings (among other things).

Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the str_subset function. The first argument to str_subset is character vector we want to search in. The second argument is a regular expression matching the pattern we want to retain.

If you are not familiar with regular expressions, http://www.regexr.com/ is a good place to start.

Now that we know how to filter character vectors using str_subset we can identify the correct sheets for each year.

findTable1 <- function(x) {
  str_subset(excel_sheets(x), "Table 1")
}

map(boy.file.names, findTable1)

## [[1]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[20]]
## [1] "Table 1"

Reading all the files

Next we want to read the correct worksheet from each file. We already know how to iterate over a vector of file names with map, and we know how to identify the correct sheet. All we need to do next is read that sheet into R. We can do that using the read_excel function.

Recall that the actual data starts on row 7, so we want to skip the first 6 rows.

readTable1 <- function(file) {
  read_excel(file, sheet = findTable1(file), skip = 6)
}

boysNames <- map(boy.file.names, readTable1)
glimpse(boysNames[[1]])

## Observations: 59
## Variables: 12
## $ NA    <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",...
## $ Name  <chr> "JACK ", "DANIEL ", "THOMAS ", "JAMES ", "JOSHUA ", "MAT...
## $ Count <dbl> 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193, 6161, ...
## $ NA    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA    <dbl> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, ...
## $ Name  <chr> "DOMINIC ", "NICHOLAS ", "BRANDON ", "RHYS ", "MARK ", "...
## $ Count <dbl> 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1128, 11...
## $ NA    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Data cleanup

Now that we’ve read in the data we still have some cleanup to do.

Specifically, we need to:

fix column names
get rid of blank row and the top and the notes at the bottom
get rid of extraneous “changes in rank” columns if they exist
transform the side-by-side tables layout to a single table.

In short, we want to go from this:

messy

to this:

tidy

There are many ways to do this kind of data manipulation in R. We’re going to use the dplyr and tidyr packages to make our lives easier. Both packages were attached along with the tidyverse package.

Fixing column names

The column names are in bad shape. In R we need column names to a) start with a letter, b) contain only letters, numbers, underscores and periods, and c) uniquely identify each column.

The actual column names look like this:

names(boysNames[[1]])

##  [1] NA      "Name"  "Count" NA      NA      "Name"  "Count" NA     
##  [9] NA      NA      NA      NA

So we need to a) make sure each column has a name, and b) distinguish between the first and second occurrences of “Name” and “Count”. We could do this step-by-step, but there is a handy function in R called make.names that will do it for us.

names(boysNames[[1]])

##  [1] NA      "Name"  "Count" NA      NA      "Name"  "Count" NA     
##  [9] NA      NA      NA      NA

make.names(names(boysNames[[1]]), unique = TRUE)

##  [1] "NA."     "Name"    "Count"   "NA..1"   "NA..2"   "Name.1"  "Count.1"
##  [8] "NA..3"   "NA..4"   "NA..5"   "NA..6"   "NA..7"

setNames(boysNames[[1]], make.names(names(boysNames[[1]]), unique = TRUE))

## # A tibble: 59 × 12
##      NA.     Name Count NA..1 NA..2    Name.1 Count.1 NA..3 NA..4 NA..5
##    <chr>    <chr> <dbl> <dbl> <dbl>     <chr>   <dbl> <dbl> <dbl> <dbl>
## 1      1    JACK  10779    NA    51  DOMINIC     1519    NA    NA    NA
## 2      2  DANIEL  10338    NA    52 NICHOLAS     1385    NA    NA    NA
## 3      3  THOMAS   9603    NA    53  BRANDON     1337    NA    NA    NA
## 4      4   JAMES   9385    NA    54     RHYS     1259    NA    NA    NA
## 5      5  JOSHUA   7887    NA    55     MARK     1222    NA    NA    NA
## 6      6 MATTHEW   7426    NA    56      MAX     1192    NA    NA    NA
## 7      7    RYAN   6496    NA    57    DYLAN     1186    NA    NA    NA
## 8      8  JOSEPH   6193    NA    58    HENRY     1135    NA    NA    NA
## 9      9  SAMUEL   6161    NA    59    PETER     1128    NA    NA    NA
## 10    10    LIAM   5802    NA    60  STEPHEN     1122    NA    NA    NA
## # ... with 49 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>

names(boysNames[[1]])

##  [1] NA      "Name"  "Count" NA      NA      "Name"  "Count" NA     
##  [9] NA      NA      NA      NA

Fixing all the names

Of course we need to iterate over each data.frame in the boysNames list and to this for each one. Fortunately the map function makes this easy.

boysNames <- map(boysNames,
                 function(x) {
                     setNames(x, make.names(names(x), unique = TRUE))
                 })

Filtering rows

Next we want to remove blank rows and rows used for notes. An easy way to do that is to remove rows that don’t have a name. We can filter on some condition using the filter function, like this:

boysNames[[1]]

## # A tibble: 59 × 12
##      NA.     Name Count NA..1 NA..2    Name.1 Count.1 NA..3 NA..4 NA..5
##    <chr>    <chr> <dbl> <dbl> <dbl>     <chr>   <dbl> <dbl> <dbl> <dbl>
## 1      1    JACK  10779    NA    51  DOMINIC     1519    NA    NA    NA
## 2      2  DANIEL  10338    NA    52 NICHOLAS     1385    NA    NA    NA
## 3      3  THOMAS   9603    NA    53  BRANDON     1337    NA    NA    NA
## 4      4   JAMES   9385    NA    54     RHYS     1259    NA    NA    NA
## 5      5  JOSHUA   7887    NA    55     MARK     1222    NA    NA    NA
## 6      6 MATTHEW   7426    NA    56      MAX     1192    NA    NA    NA
## 7      7    RYAN   6496    NA    57    DYLAN     1186    NA    NA    NA
## 8      8  JOSEPH   6193    NA    58    HENRY     1135    NA    NA    NA
## 9      9  SAMUEL   6161    NA    59    PETER     1128    NA    NA    NA
## 10    10    LIAM   5802    NA    60  STEPHEN     1122    NA    NA    NA
## # ... with 49 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>

boysNames[[1]] <- filter(boysNames[[1]], !is.na(Name))
boysNames[[1]]

## # A tibble: 50 × 12
##      NA.     Name Count NA..1 NA..2    Name.1 Count.1 NA..3 NA..4 NA..5
##    <chr>    <chr> <dbl> <dbl> <dbl>     <chr>   <dbl> <dbl> <dbl> <dbl>
## 1      1    JACK  10779    NA    51  DOMINIC     1519    NA    NA    NA
## 2      2  DANIEL  10338    NA    52 NICHOLAS     1385    NA    NA    NA
## 3      3  THOMAS   9603    NA    53  BRANDON     1337    NA    NA    NA
## 4      4   JAMES   9385    NA    54     RHYS     1259    NA    NA    NA
## 5      5  JOSHUA   7887    NA    55     MARK     1222    NA    NA    NA
## 6      6 MATTHEW   7426    NA    56      MAX     1192    NA    NA    NA
## 7      7    RYAN   6496    NA    57    DYLAN     1186    NA    NA    NA
## 8      8  JOSEPH   6193    NA    58    HENRY     1135    NA    NA    NA
## 9      9  SAMUEL   6161    NA    59    PETER     1128    NA    NA    NA
## 10    10    LIAM   5802    NA    60  STEPHEN     1122    NA    NA    NA
## # ... with 40 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>

Of course we need to do that for every data set in the boysNames list, but I’ll leave that to you.

Selecting columns

Next we want to retain just the Name, Name.1 and Count, Count.1 columns. We can do that using the select function:

boysNames[[1]]

## # A tibble: 50 × 12
##      NA.     Name Count NA..1 NA..2    Name.1 Count.1 NA..3 NA..4 NA..5
##    <chr>    <chr> <dbl> <dbl> <dbl>     <chr>   <dbl> <dbl> <dbl> <dbl>
## 1      1    JACK  10779    NA    51  DOMINIC     1519    NA    NA    NA
## 2      2  DANIEL  10338    NA    52 NICHOLAS     1385    NA    NA    NA
## 3      3  THOMAS   9603    NA    53  BRANDON     1337    NA    NA    NA
## 4      4   JAMES   9385    NA    54     RHYS     1259    NA    NA    NA
## 5      5  JOSHUA   7887    NA    55     MARK     1222    NA    NA    NA
## 6      6 MATTHEW   7426    NA    56      MAX     1192    NA    NA    NA
## 7      7    RYAN   6496    NA    57    DYLAN     1186    NA    NA    NA
## 8      8  JOSEPH   6193    NA    58    HENRY     1135    NA    NA    NA
## 9      9  SAMUEL   6161    NA    59    PETER     1128    NA    NA    NA
## 10    10    LIAM   5802    NA    60  STEPHEN     1122    NA    NA    NA
## # ... with 40 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>

boysNames[[1]] <- select(boysNames[[1]], Name, Name.1, Count, Count.1)
boysNames[[1]]

## # A tibble: 50 × 4
##        Name    Name.1 Count Count.1
##       <chr>     <chr> <dbl>   <dbl>
## 1     JACK   DOMINIC  10779    1519
## 2   DANIEL  NICHOLAS  10338    1385
## 3   THOMAS   BRANDON   9603    1337
## 4    JAMES      RHYS   9385    1259
## 5   JOSHUA      MARK   7887    1222
## 6  MATTHEW       MAX   7426    1192
## 7     RYAN     DYLAN   6496    1186
## 8   JOSEPH     HENRY   6193    1135
## 9   SAMUEL     PETER   6161    1128
## 10    LIAM   STEPHEN   5802    1122
## # ... with 40 more rows

Again we will want to do this for all the elements in boysNames, a task I leave to you.

Re-arranging into a single table

Our final task is to re-arrange to data so that it is all in a single table instead of in two side-by-side tables. For many similar tasks the gather function in the tidyr package is useful, but in this case we will be better off using a combination of select and bind_rows.

boysNames[[1]]

## # A tibble: 50 × 4
##        Name    Name.1 Count Count.1
##       <chr>     <chr> <dbl>   <dbl>
## 1     JACK   DOMINIC  10779    1519
## 2   DANIEL  NICHOLAS  10338    1385
## 3   THOMAS   BRANDON   9603    1337
## 4    JAMES      RHYS   9385    1259
## 5   JOSHUA      MARK   7887    1222
## 6  MATTHEW       MAX   7426    1192
## 7     RYAN     DYLAN   6496    1186
## 8   JOSEPH     HENRY   6193    1135
## 9   SAMUEL     PETER   6161    1128
## 10    LIAM   STEPHEN   5802    1122
## # ... with 40 more rows

bind_rows(select(boysNames[[1]], Name, Count),
          select(boysNames[[1]], Name = Name.1, Count = Count.1))

## # A tibble: 100 × 2
##        Name Count
##       <chr> <dbl>
## 1     JACK  10779
## 2   DANIEL  10338
## 3   THOMAS   9603
## 4    JAMES   9385
## 5   JOSHUA   7887
## 6  MATTHEW   7426
## 7     RYAN   6496
## 8   JOSEPH   6193
## 9   SAMUEL   6161
## 10    LIAM   5802
## # ... with 90 more rows

Exercise: Cleanup all the data

In the previous examples we learned how to drop empty rows with filter, select only relevant columns with select, and re-arrange our data with select and bind_rows. In each case we applied the changes only to the first element of our boysNames list.

Your task now is to use the map function to apply each of these transformations to all the elements in boysNames.

——–

Exercise prototype

There are different ways you can go about it. Here is one:

## write a function that does all the cleanup
cleanupNamesData <- function(x) {
    filtered <- filter(x, !is.na(Name)) # drop rows with no Name value
    selected <- select(filtered, Name, Count, Name.1, Count.1) # select just Name and Count columns
    bind_rows(select(selected, Name,  Count), # re-arrange into two columns
              select(selected, Name = Name.1, Count = Count.1))
}

## test it out on the second data.frame in the list
glimpse(boysNames[[2]]) # before cleanup

## Observations: 61
## Variables: 13
## $ NA.            <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "1...
## $ Name           <chr> "JACK ", "JAMES ", "THOMAS ", "DANIEL ", "JOSHU...
## $ Count          <dbl> 10145, 9853, 9479, 9047, 7698, 7443, 6367, 5809...
## $ X.since.1996   <chr> "-", "+2", "-", "-2", "-", "-", "+2", "-", "-2"...
## $ NA..1          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..2          <dbl> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,...
## $ Name.1         <chr> "SEAN ", "DYLAN ", "DOMINIC ", "LOUIS ", "RHYS ...
## $ Count.1        <dbl> 1388, 1380, 1359, 1325, 1291, 1274, 1244, 1241,...
## $ X.since.1996.1 <chr> "-2", "+5", "-2", "+7", "-1", "-4", "-1", "-", ...
## $ NA..3          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..4          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..5          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..6          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

glimpse(cleanupNamesData(boysNames[[2]])) # after cleanup

## Observations: 100
## Variables: 2
## $ Name  <chr> "JACK ", "JAMES ", "THOMAS ", "DANIEL ", "JOSHUA ", "MAT...
## $ Count <dbl> 10145, 9853, 9479, 9047, 7698, 7443, 6367, 5809, 5631, 5...

## apply the cleanup function to all the data.frames in the list
boysNames <- map(boysNames, cleanupNamesData)

Cleaning up messy data

Ista Zahn, Daina Bouquin