--- title: "Tidy data and iteration exercises" author: "Nicholas Horton (nhorton@amherst.edu)" date: "February 5, 2018" output: html_document: fig_height: 5 fig_width: 7 pdf_document: fig_height: 5 fig_width: 7 word_document: fig_height: 3 fig_width: 5 --- ```{r, setup, include=FALSE} library(mdsr) # Load additional packages here # Some customization. You can alter or delete as desired (if you know what you are doing). trellis.par.set(theme=theme.mosaic()) # change default color scheme for lattice knitr::opts_chunk$set( tidy=FALSE, # display code as typed size="small") # slightly smaller font for code ``` ## Introduction These exercises are taken from the tidy data and iteration chapter from **Modern Data Science with R**: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there. ## Home runs Consider the number of home runs hit (`HR`) and home runs allowed (`HRA`) for the Chicago Cubs (CHN) baseball team. Reshape the `Teams` data from the `Lahman` package into *long* format and plot a time series conditioned on whether the HRs that involved the Cubs were hit by them or allowed by them. SOLUTION: ```{r} library(mdsr) library(Lahman) # solution goes here ``` ## Seasons Write a function called `count_seasons` that, when given a `teamID`, will count the number of seasons the team played in the `Teams` data frame from the `Lahman` package. SOLUTION: ```{r} library(mdsr) library(Lahman) # solution goes here ``` ## We'll always have Brooklyn The team IDs corresponding to Brooklyn baseball teams from the `Teams` data frame from the `Lahman` package are listed below. Use `sapply()` to find the number of seasons in which each of those teams played. SOLUTION: ```{r} library(mdsr) library(Lahman) bk_teams <- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF") # solution goes here ``` ## Marriage In the `Marriage` data set included in `mosaicData`, the `appdate`, `ceremonydate`, and `dob` variables are encoded as factors, even though they are dates. Use the `lubridate` package to convert those three columns into a date format. SOLUTION: ```{r} library(mdsr) Marriage %>% select(appdate, ceremonydate, dob) %>% glimpse() # solution goes here ``` ## Coercion Consider the values returned by the `as.numeric()` and `readr::parse_number()` functions when applied to the following vectors. Describe the results and their implication. <<>>= x1 <- c("1900.45", "$1900.45", "1,900.45", "nearly $2000") x2 <- as.factor(x1) @ ## Brittle code An analyst wants to calculate the pairwise differences between the Treatment and Control values for a small data set from a crossover trial (all subjects received both treatments) that consists of the following observations. ```{r echo=FALSE} library(xtable) ds1 <- data_frame(id=c(1:3, 1:3), group=c(rep("T", 3), rep("C", 3)), vals=c(4,6,8,5,6,10)) ``` ```{r results="asis"} tab <- xtable(ds1) print(tab, type="html", floating=FALSE) ``` They use the following code to create the new `diff` variable. ```{r eval=FALSE} Treat <- filter(ds1, group=="T") Control <- filter(ds1, group=="C") all <- mutate(Treat, diff = Treat$vals - Control$vals) all ``` Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects. Provide an alternative approach to generate this variable that is more robust (hint: use `tidyr::spread()`). SOLUTION: ```{r message=FALSE} # solution goes here ``` ## Tall to wide Generate the code to convert the following data frame to wide format. ```{r echo=FALSE} library(mosaic) library(readr) library(tidyr) df <- data_frame( id=c(1,1,2,2,3,3,4,4,5,5), sex=c("M","M","F","F","F","F","M","M","M","M"), grp=rep(c("A","B"), 5), left=c(0.1,0.2, 0.3,0.4, 0.15,0.25, 0.6,0.65, 0.71,0.79), right=c(0.1,0.25, 0.3,0.35, 0.18,0.25, 0.6,0.6, 0.71,0.79)+0.1 ) sumdf <- df %>% group_by(grp, sex) %>% summarize(meanL=mean(left), sdL=sd(left), meanR=mean(right), sdR=sd(right)) %>% ungroup() ``` ```{r echo=FALSE, results='asis'} tab <- xtable(sumdf, floating=FALSE) print(tab, type="html") ``` The result should look like the following display. ```{r echo=FALSE, results="asis"} transformed <- sumdf %>% gather(key = "temp", value = "vals", meanL, meanR, sdL, sdR) %>% unite(col = "temp1", sex, temp, sep = ".") %>% spread(temp1, vals) tab <- xtable(data.frame(transformed)) print(tab, type="html") ``` Hint: use `gather()` in conjunction with `spread()` and `unite()`. ## Multiple models Use the `dplyr::do()` function and the `HELPrct` data frame from the `mosaicData` package to fit a regression model predicting `cesd` as a function of `age` separately for each of the levels of the `substance` variable. Generate a table of results (estimates and confidence intervals) for each level of the grouping variable. SOLUTION: ```{r} library(mdsr) # solution goes here ``` ## Baseball records Use the `dplyr::do()` function and the `Lahman` data to replicate one of these baseball records plots (http://tinyurl.com/nytimes-records) from the *The New York Times*. SOLUTION: ```{r} library(mdsr) library(Lahman) # solution goes here ``` ## FEC Use the `fec` package to download the Federal Election Commission data for 2012. Re-create Figures 2.1 and 2.2 using `ggplot2`. SOLUTION: ```{r} # solution goes here, after downloading the fec package. ``` ## More FEC Using the same FEC data as the previous exercise, re-create Figure 2.8. SOLUTION: ```{r} # solution goes here ``` ## Wikipedia Using the approach described in Section 5.5.4, find another table in Wikipedia that can be scraped and visualized. Be sure to interpret your graphical display. SOLUTION: ```{r message=FALSE} # earlier example (please delete this) library(rvest) library(methods) url <- "http://en.wikipedia.org/wiki/Mile_run_world_record_progression" tables <- url %>% read_html() %>% html_nodes("table") length(tables) Table3 <- html_table(tables[[3]]) head(Table3) ``` ## Wrangling with the FEC Replicate the wrangling to create the `house_elections` table in the `fec` package from the original Excel source file. SOLUTION: ```{r} library(mdsr) # solution goes here ``` ## Babynames Replicate the functionality of the `make_babynames_dist()` function from the `mdsr` package to wrangle the original tables from the `babynames` package. SOLUTION: ```{r} library(mdsr) # solution goes here ```