--- title: "Text as Data exercises" author: "Nicholas Horton (nhorton@amherst.edu)" date: "July 19, 2017" output: html_document: fig_height: 5 fig_width: 7 pdf_document: fig_height: 5 fig_width: 7 word_document: fig_height: 3 fig_width: 5 --- ```{r, setup, include=FALSE} library(mdsr) # Load additional packages here library(tidyr) library(tm) library(wordcloud) # Some customization. You can alter or delete as desired (if you know what you are doing). trellis.par.set(theme=theme.mosaic()) # change default color scheme for lattice knitr::opts_chunk$set( tidy=FALSE, # display code as typed size="small") # slightly smaller font for code ``` ## Introduction These exercises are taken from the text as data chapter from **Modern Data Science with R**: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there. ## Speaking lines Speaking lines in Shakespeare's plays are identified by a line that starts with two spaces, then a string of capital letters and spaces (the character's name) followed by a period. Use `grep()` to find all of the speaking lines in *Macbeth*. How many are there? SOLUTION: ```{r} library(mdsr) library(tidyr) library(tm) library(wordcloud) data(Macbeth_raw) macbeth <- strsplit(Macbeth_raw, "\r\n")[[1]] head(macbeth) # solution goes here ``` ## Hyphenated words Find all the hyphenated words in one of Shakespeare's plays. SOLUTION: ```{r} # solution goes here ``` ## Most popular names Use the `babynames` data table from the `babynames` package to find the ten most popular: 1) Boys' names ending in a vowel. SOLUTION: ```{r} # solution goes here ``` 2) Names ending with `joe`, `jo` `Joe` or `Jo` (e.g., `Billyjoe`). SOLUTION: ```{r} # solution goes here ``` ## Adjectives Find all of the adjectives in one of Shakespeare's plays that end in `more` or `less` (note change from original question 15.4). SOLUTION: ```{r} # solution goes here ``` ## Stage directions Find all of the lines containing the stage direction \val{Exit} or \val{Exeunt} in one of Shakespeare's plays (note change from original exercise 15.5). SOLUTION: ```{r} # solution goes here ``` ## Regular expressions Use regular expressions to determine the number of speaking lines from the *Complete Works of William Shakespeare* (http://www.gutenberg.org/cache/epub/100/pg100.txt). Here, we care only about how many times a character speaks---not what they say or for how long they speak. SOLUTION: ```{r} # solution goes here ``` ## Top characters Make a bar chart displaying the top 100 characters with the greatest number of lines. *Hint*: you may want to use either the `stringr::str_extract()` or `strsplit()` function here. SOLUTION: ```{r} # solution goes here ``` ## Shakespare Machine In this problem, you will do much of the work to recreate Mark Hansen's *Shakespeare Machine*. Start by watching a video clip (http://vimeo.com/54858820) of the exhibit. Use *The Complete Works of William Shakespeare* (see earlier exercise) and regular expressions to find all of the hyphenated words in Shakespeare Machine. How many are there? Use `%in\%` to verify that your list contains the following hyphenated words pictured at 00:46 of the clip. SOLUTION: ```{r} sm_words <- c("true-fix'd", "pale-hearted", "lean-fac'd", "hard-hearted", "best-regarded", "thick-ribbed", "both-sides", "sea-like.", "shrill-shrieking", "lust-stain'd", "tragical-historical,") # solution goes here ``` ## Wikipedia table Find an interesting Wikipedia page with a table, scrape the data from it, and generate a figure that tells an interesting story. Include an interpretation of the figure. SOLUTION: ```{r} # solution goes here ``` ## Stackexchange 1 The site \url{stackexchange.com} displays questions and answers on technical topics. The following code downloads the most recent \R questions related to the \pkg{dplyr} package. ```{r message=FALSE} library(httr) # Find the most recent R questions on stackoverflow getresult <- GET("http://api.stackexchange.com", path = "questions", query = list(site = "stackoverflow.com", tagged = "dplyr")) stop_for_status(getresult) # Ensure returned without error questions <- content(getresult) # Grab content names(questions$items[[1]]) # What does the returned data look like? length(questions$items) substr(questions$items[[1]]$title, 1, 68) substr(questions$items[[2]]$title, 1, 68) substr(questions$items[[3]]$title, 1, 68) ``` How many questions were returned? Without using jargon, describe in words what is being displayed and how it might be used. SOLUTION: ```{r} # solution goes here ``` ## Stackexchange 2 Repeat the process of downloading the content from \url{stackexchange.com} related to the \pkg{dplyr} package and summarize the results. SOLUTION: ```{r} # solution goes here ```