--- title: "Text as Data examples" author: "Nicholas Horton (nhorton@amherst.edu)" date: "July 8, 2017" output: html_document: fig_height: 5 fig_width: 7 toc: true toc_depth: 2 pdf_document: fig_height: 5 fig_width: 7 word_document: fig_height: 3 fig_width: 5 --- ```{r, setup, include=FALSE} library(mdsr) # Load additional packages here library(tidyr) library(tm) library(wordcloud) # Some customization. You can alter or delete as desired (if you know what you are doing). trellis.par.set(theme=theme.mosaic()) # change default color scheme for lattice knitr::opts_chunk$set( tidy=FALSE, # display code as typed size="small") # slightly smaller font for code ``` ## Introduction This example builds on the text mining chapter from **Modern Data Science with R**: http://mdsr-book.github.io/. ## Data ingestation and processing ```{r} library(mdsr) library(tidyr) library(tm) library(wordcloud) data(Macbeth_raw) # strsplit returns a list: we only want the first element macbeth <- strsplit(Macbeth_raw, "\r\n")[[1]] length(macbeth) head(macbeth) ``` ```{r} macbeth[300:310] ``` ## Regular expressions and `grep()` The `grep()` function works using a *needle* in a *haystack* paradigm, wherein the first argument is the regular expression (or pattern) you want to find (i.e., the needle) and the second argument is the character vector in which you want to find patterns (i.e., the haystack). Note that unless the argument value is set to TRUE, `grep()` returns the *indices* of the haystack in which the needles were found. ```{r} macbeth_lines <- grep(" MACBETH", macbeth, value = TRUE) length(macbeth_lines) head(macbeth_lines) ``` ```{r} length(grep(" MACDUFF", macbeth)) ``` The `grepl` function uses the same syntax but returns a logical vector as long as the haystack. Thus, while the length of the vector returned by `grep` is the number of matches, the length of the vector returned by `grepl` is always the same as the length of the haystack vector. ```{r} length(grep(" MACBETH", macbeth)) length(grepl(" MACBETH", macbeth)) ``` However, both will subset the original vector in the same way, and thus in this respect they are functionally equivalent. ```{r} identical(macbeth[grep(" MACBETH", macbeth)], macbeth[grepl(" MACBETH", macbeth)]) ``` To extract the piece of each matching line that actually matched, use the `str_extract()` function from the `stringr` package. ```{r} library(stringr) pattern <- " MACBETH" grep(pattern, macbeth, value = TRUE) %>% str_extract(pattern) %>% head() ``` ```{r} head(grep("MAC.", macbeth, value = TRUE)) head(grep("MACBETH\\.", macbeth, value = TRUE)) head(grep("MAC[B-Z]", macbeth, value = TRUE)) head(grep("MAC(B|D)", macbeth, value = TRUE)) head(grep("^ MAC[B-Z]", macbeth, value = TRUE)) head(grep("^ ?MAC[B-Z]", macbeth, value = TRUE)) head(grep("^ *MAC[B-Z]", macbeth, value = TRUE)) head(grep("^ +MAC[B-Z]", macbeth, value = TRUE)) ``` ## Analysis of speaking lines We might learn something about the play by knowing when each character speaks as a function of the line number in the play. We can retrieve this information using `grepl()`. ```{r} Macbeth <- grepl(" MACBETH\\.", macbeth) LadyMacbeth <- grepl(" LADY MACBETH\\.", macbeth) Banquo <- grepl(" BANQUO\\.", macbeth) Duncan <- grepl(" DUNCAN\\.", macbeth) speaker_freq <- data.frame(Macbeth, LadyMacbeth, Banquo, Duncan) %>% mutate(line = 1:length(macbeth)) %>% gather(key = "character", value = "speak", -line) %>% mutate(speak = as.numeric(speak)) %>% filter(line > 218 & line < 3172) glimpse(speaker_freq) ``` Before we create the plot, we will gather some helpful contextual information about when each Act begins. ```{r} acts_idx <- grep("^ACT [I|V]+", macbeth) acts_labels <- str_extract(macbeth[acts_idx], "^ACT [I|V]+") acts <- data.frame(line = acts_idx, labels = acts_labels) ``` ```{r} ggplot(data = speaker_freq, aes(x = line, y = speak)) + geom_smooth(aes(color = character), method = "loess", se = 0, span = 0.4) + geom_vline(xintercept = acts_idx, color = "darkgray", lty = 3) + geom_text(data = acts, aes(y = 0.085, label = labels), hjust = "left", color = "darkgray") + ylim(c(0, NA)) + xlab("Line Number") + ylab("Proportion of Speeches") ``` ## Some word analyses ```{r} Corpus <- VCorpus(VectorSource(macbeth)) sampleline <- 300 Corpus[[sampleline]] %>% as.character() %>% strwrap() ``` ```{r} Corpus <- Corpus %>% tm_map(stripWhitespace) %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>% tm_map(content_transformer(tolower)) %>% tm_map(removeWords, stopwords("english")) strwrap(as.character(Corpus[[sampleline]])) ``` ```{r fig.width=8, fig.height=8} wordcloud(Corpus, max.words = 30, scale = c(8, 1), colors = topo.colors(n = 30), random.color = TRUE) ``` ## Document term matrix analyses ```{r warning=FALSE} DTM <- DocumentTermMatrix(Corpus, control = list(weighting = weightTfIdf)) # DTM findFreqTerms(DTM, lowfreq = 50) DTM %>% as.matrix() %>% apply(MARGIN = 2, sum) %>% sort(decreasing = TRUE) %>% head(9) ``` ## Further resources Other useful resources include the CRAN Task View on Natural Language processing (https://cran.r-project.org/web/views/NaturalLanguageProcessing.html), the tm package (https://cran.r-project.org/web/packages/tm/index.html), the tidytext package (https://cran.r-project.org/web/packages/tidytext/index.html).