Data scientists are known and celebrated for modeling and visually displaying information, but down in the data science engine room there is a lot of less glamorous work to be done. Before data can be used effectively it must often be cleaned, corrected, and reformatted. This workshop introduces the basic tools needed to make your data behave, including data reshaping, regular expressions and other text manipulation tools.
Prior to the workshop you should:
install.packages("tidyverse")
The lesson notes are included in the download link above. You can also view the lesson notes at https://rawgit.com/izahn/R-data-cleaning/master/dataCleaning.html
A github repository containing the workshop materials is available https://github.com/izahn/R-data-cleaning.
This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we recommend the Software Carpentry Introductory R material.
It is common for data to be made available on a website somewhere, either by a government agency, research group, or other organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and time consuming. Such is the case with the baby names data we will be using today.
The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is provided separately for boys and girls and is stored in Excel spreadsheets.
I have downloaded all the excel files containing boys names data from https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsboys. Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several things that make this challenging.
As you can see, we have a lot of work to do. Let’s get started by attaching some useful R packages.
library(tidyverse) # data.frame manipulation
library(stringr) # string processing
library(readxl) # read excel files
library(purrr) # work with lists
Our first task is to iterate over all the data files and read the appropriate sheet from each one. As noted above, the appropriate sheet differs from year to year, but always has “Table 1” in the sheet name.
The first step is to get a vector of file names.
boy.file.names <- list.files("babyNamesData/boys", full.names = TRUE)
Now we can iterate over the file names and get the names of each worksheet. We could use a for
loop, or sapply
, but the map
family of functions from the purrr package gives us a more consistent alternative, so we’ll use that.
map(boy.file.names, excel_sheets)
## [[1]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
##
## [[2]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[3]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[4]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[5]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[6]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[7]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[8]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[9]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[10]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[11]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[12]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[13]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[14]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[15]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[16]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[17]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[18]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[19]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[20]]
## [1] "Contents" "Metadata" "Terms and Conditions"
## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"
In order extract the correct worksheet names we will use functions for manipulating strings. Base R provides some string manipulation capabilities (see ?regex
, ?sub
and ?grep
), but we will use the stringr package because it is more user-friendly.
The stringr package provides functions to detect, locate, extract, match, replace, combine and split strings (among other things).
Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the str_subset
function. The first argument to str_subset
is character vector we want to search in. The second argument is a regular expression matching the pattern we want to retain.
If you are not familiar with regular expressions, http://www.regexr.com/ is a good place to start.
Now that we know how to filter character vectors using str_subset
we can identify the correct sheets for each year.
findTable1 <- function(x) {
str_subset(excel_sheets(x), "Table 1")
}
map(boy.file.names, findTable1)
## [[1]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[20]]
## [1] "Table 1"
Next we want to read the correct worksheet from each file. We already know how to iterate over a vector of file names with map
, and we know how to identify the correct sheet. All we need to do next is read that sheet into R. We can do that using the read_excel
function.
Recall that the actual data starts on row 7, so we want to skip the first 6 rows.
readTable1 <- function(file) {
read_excel(file, sheet = findTable1(file), skip = 6)
}
boysNames <- map(boy.file.names, readTable1)
glimpse(boysNames[[1]])
## Observations: 59
## Variables: 12
## $ NA <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",...
## $ Name <chr> "JACK ", "DANIEL ", "THOMAS ", "JAMES ", "JOSHUA ", "MAT...
## $ Count <dbl> 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193, 6161, ...
## $ NA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA <dbl> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, ...
## $ Name <chr> "DOMINIC ", "NICHOLAS ", "BRANDON ", "RHYS ", "MARK ", "...
## $ Count <dbl> 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1128, 11...
## $ NA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Now that we’ve read in the data we still have some cleanup to do.
Specifically, we need to:
In short, we want to go from this:
to this:
There are many ways to do this kind of data manipulation in R. We’re going to use the dplyr and tidyr packages to make our lives easier. Both packages were attached along with the tidyverse package.
The column names are in bad shape. In R we need column names to a) start with a letter, b) contain only letters, numbers, underscores and periods, and c) uniquely identify each column.
The actual column names look like this:
names(boysNames[[1]])
## [1] NA "Name" "Count" NA NA "Name" "Count" NA
## [9] NA NA NA NA
So we need to a) make sure each column has a name, and b) distinguish between the first and second occurrences of “Name” and “Count”. We could do this step-by-step, but there is a handy function in R called make.names
that will do it for us.
names(boysNames[[1]])
## [1] NA "Name" "Count" NA NA "Name" "Count" NA
## [9] NA NA NA NA
make.names(names(boysNames[[1]]), unique = TRUE)
## [1] "NA." "Name" "Count" "NA..1" "NA..2" "Name.1" "Count.1"
## [8] "NA..3" "NA..4" "NA..5" "NA..6" "NA..7"
setNames(boysNames[[1]], make.names(names(boysNames[[1]]), unique = TRUE))
## # A tibble: 59 × 12
## NA. Name Count NA..1 NA..2 Name.1 Count.1 NA..3 NA..4 NA..5
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 JACK 10779 NA 51 DOMINIC 1519 NA NA NA
## 2 2 DANIEL 10338 NA 52 NICHOLAS 1385 NA NA NA
## 3 3 THOMAS 9603 NA 53 BRANDON 1337 NA NA NA
## 4 4 JAMES 9385 NA 54 RHYS 1259 NA NA NA
## 5 5 JOSHUA 7887 NA 55 MARK 1222 NA NA NA
## 6 6 MATTHEW 7426 NA 56 MAX 1192 NA NA NA
## 7 7 RYAN 6496 NA 57 DYLAN 1186 NA NA NA
## 8 8 JOSEPH 6193 NA 58 HENRY 1135 NA NA NA
## 9 9 SAMUEL 6161 NA 59 PETER 1128 NA NA NA
## 10 10 LIAM 5802 NA 60 STEPHEN 1122 NA NA NA
## # ... with 49 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>
names(boysNames[[1]])
## [1] NA "Name" "Count" NA NA "Name" "Count" NA
## [9] NA NA NA NA
Of course we need to iterate over each data.frame in the boysNames
list and to this for each one. Fortunately the map
function makes this easy.
boysNames <- map(boysNames,
function(x) {
setNames(x, make.names(names(x), unique = TRUE))
})
Next we want to remove blank rows and rows used for notes. An easy way to do that is to remove rows that don’t have a name. We can filter on some condition using the filter
function, like this:
boysNames[[1]]
## # A tibble: 59 × 12
## NA. Name Count NA..1 NA..2 Name.1 Count.1 NA..3 NA..4 NA..5
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 JACK 10779 NA 51 DOMINIC 1519 NA NA NA
## 2 2 DANIEL 10338 NA 52 NICHOLAS 1385 NA NA NA
## 3 3 THOMAS 9603 NA 53 BRANDON 1337 NA NA NA
## 4 4 JAMES 9385 NA 54 RHYS 1259 NA NA NA
## 5 5 JOSHUA 7887 NA 55 MARK 1222 NA NA NA
## 6 6 MATTHEW 7426 NA 56 MAX 1192 NA NA NA
## 7 7 RYAN 6496 NA 57 DYLAN 1186 NA NA NA
## 8 8 JOSEPH 6193 NA 58 HENRY 1135 NA NA NA
## 9 9 SAMUEL 6161 NA 59 PETER 1128 NA NA NA
## 10 10 LIAM 5802 NA 60 STEPHEN 1122 NA NA NA
## # ... with 49 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>
boysNames[[1]] <- filter(boysNames[[1]], !is.na(Name))
boysNames[[1]]
## # A tibble: 50 × 12
## NA. Name Count NA..1 NA..2 Name.1 Count.1 NA..3 NA..4 NA..5
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 JACK 10779 NA 51 DOMINIC 1519 NA NA NA
## 2 2 DANIEL 10338 NA 52 NICHOLAS 1385 NA NA NA
## 3 3 THOMAS 9603 NA 53 BRANDON 1337 NA NA NA
## 4 4 JAMES 9385 NA 54 RHYS 1259 NA NA NA
## 5 5 JOSHUA 7887 NA 55 MARK 1222 NA NA NA
## 6 6 MATTHEW 7426 NA 56 MAX 1192 NA NA NA
## 7 7 RYAN 6496 NA 57 DYLAN 1186 NA NA NA
## 8 8 JOSEPH 6193 NA 58 HENRY 1135 NA NA NA
## 9 9 SAMUEL 6161 NA 59 PETER 1128 NA NA NA
## 10 10 LIAM 5802 NA 60 STEPHEN 1122 NA NA NA
## # ... with 40 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>
Of course we need to do that for every data set in the boysNames
list, but I’ll leave that to you.
Next we want to retain just the Name
, Name.1
and Count
, Count.1
columns. We can do that using the select
function:
boysNames[[1]]
## # A tibble: 50 × 12
## NA. Name Count NA..1 NA..2 Name.1 Count.1 NA..3 NA..4 NA..5
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 JACK 10779 NA 51 DOMINIC 1519 NA NA NA
## 2 2 DANIEL 10338 NA 52 NICHOLAS 1385 NA NA NA
## 3 3 THOMAS 9603 NA 53 BRANDON 1337 NA NA NA
## 4 4 JAMES 9385 NA 54 RHYS 1259 NA NA NA
## 5 5 JOSHUA 7887 NA 55 MARK 1222 NA NA NA
## 6 6 MATTHEW 7426 NA 56 MAX 1192 NA NA NA
## 7 7 RYAN 6496 NA 57 DYLAN 1186 NA NA NA
## 8 8 JOSEPH 6193 NA 58 HENRY 1135 NA NA NA
## 9 9 SAMUEL 6161 NA 59 PETER 1128 NA NA NA
## 10 10 LIAM 5802 NA 60 STEPHEN 1122 NA NA NA
## # ... with 40 more rows, and 2 more variables: NA..6 <dbl>, NA..7 <dbl>
boysNames[[1]] <- select(boysNames[[1]], Name, Name.1, Count, Count.1)
boysNames[[1]]
## # A tibble: 50 × 4
## Name Name.1 Count Count.1
## <chr> <chr> <dbl> <dbl>
## 1 JACK DOMINIC 10779 1519
## 2 DANIEL NICHOLAS 10338 1385
## 3 THOMAS BRANDON 9603 1337
## 4 JAMES RHYS 9385 1259
## 5 JOSHUA MARK 7887 1222
## 6 MATTHEW MAX 7426 1192
## 7 RYAN DYLAN 6496 1186
## 8 JOSEPH HENRY 6193 1135
## 9 SAMUEL PETER 6161 1128
## 10 LIAM STEPHEN 5802 1122
## # ... with 40 more rows
Again we will want to do this for all the elements in boysNames
, a task I leave to you.
Our final task is to re-arrange to data so that it is all in a single table instead of in two side-by-side tables. For many similar tasks the gather
function in the tidyr package is useful, but in this case we will be better off using a combination of select
and bind_rows
.
boysNames[[1]]
## # A tibble: 50 × 4
## Name Name.1 Count Count.1
## <chr> <chr> <dbl> <dbl>
## 1 JACK DOMINIC 10779 1519
## 2 DANIEL NICHOLAS 10338 1385
## 3 THOMAS BRANDON 9603 1337
## 4 JAMES RHYS 9385 1259
## 5 JOSHUA MARK 7887 1222
## 6 MATTHEW MAX 7426 1192
## 7 RYAN DYLAN 6496 1186
## 8 JOSEPH HENRY 6193 1135
## 9 SAMUEL PETER 6161 1128
## 10 LIAM STEPHEN 5802 1122
## # ... with 40 more rows
bind_rows(select(boysNames[[1]], Name, Count),
select(boysNames[[1]], Name = Name.1, Count = Count.1))
## # A tibble: 100 × 2
## Name Count
## <chr> <dbl>
## 1 JACK 10779
## 2 DANIEL 10338
## 3 THOMAS 9603
## 4 JAMES 9385
## 5 JOSHUA 7887
## 6 MATTHEW 7426
## 7 RYAN 6496
## 8 JOSEPH 6193
## 9 SAMUEL 6161
## 10 LIAM 5802
## # ... with 90 more rows
In the previous examples we learned how to drop empty rows with filter
, select only relevant columns with select
, and re-arrange our data with select
and bind_rows
. In each case we applied the changes only to the first element of our boysNames
list.
Your task now is to use the map
function to apply each of these transformations to all the elements in boysNames
.
There are different ways you can go about it. Here is one:
## write a function that does all the cleanup
cleanupNamesData <- function(x) {
filtered <- filter(x, !is.na(Name)) # drop rows with no Name value
selected <- select(filtered, Name, Count, Name.1, Count.1) # select just Name and Count columns
bind_rows(select(selected, Name, Count), # re-arrange into two columns
select(selected, Name = Name.1, Count = Count.1))
}
## test it out on the second data.frame in the list
glimpse(boysNames[[2]]) # before cleanup
## Observations: 61
## Variables: 13
## $ NA. <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "1...
## $ Name <chr> "JACK ", "JAMES ", "THOMAS ", "DANIEL ", "JOSHU...
## $ Count <dbl> 10145, 9853, 9479, 9047, 7698, 7443, 6367, 5809...
## $ X.since.1996 <chr> "-", "+2", "-", "-2", "-", "-", "+2", "-", "-2"...
## $ NA..1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..2 <dbl> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,...
## $ Name.1 <chr> "SEAN ", "DYLAN ", "DOMINIC ", "LOUIS ", "RHYS ...
## $ Count.1 <dbl> 1388, 1380, 1359, 1325, 1291, 1274, 1244, 1241,...
## $ X.since.1996.1 <chr> "-2", "+5", "-2", "+7", "-1", "-4", "-1", "-", ...
## $ NA..3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ NA..6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
glimpse(cleanupNamesData(boysNames[[2]])) # after cleanup
## Observations: 100
## Variables: 2
## $ Name <chr> "JACK ", "JAMES ", "THOMAS ", "DANIEL ", "JOSHUA ", "MAT...
## $ Count <dbl> 10145, 9853, 9479, 9047, 7698, 7443, 6367, 5809, 5631, 5...
## apply the cleanup function to all the data.frames in the list
boysNames <- map(boysNames, cleanupNamesData)