Corpus Analysis and Visualization

Wouter van Atteveldt
Glasgow Text Analysis, 2016-11-17

Course Overview

10:30 - 12:00

  • Recap: Frequency Based Analysis and the DTM
  • Dictionary Analysis with AmCAT and R

13:30 - 15:00

  • Simple Natural Language Processing
  • Corpus Analysis and Visualization
  • Topic Modeling and Visualization

15:15 - 17:00

  • Sentiment Analysis with dictionaries
  • Sentiment Analysis with proximity

Simple NLP

  • Preprocess documents to get more information
  • Relatively fast and accurate
    • Lemmatizing
    • Part-of-Speech (POS) tagging
    • Named Entity Recognition
  • Unfortunately, not within R

NLPipe + nlpiper

  • nlpipe: simple NLP processing based on stanford corenlp, others
docker run --name corenlp -dp 9000:9000 chilland/corenlp-docker

docker run --name nlpipe --link corenlp:corenlp -e "CORENLP_HOST=http://corenlp:9000" -dp 5001:5001 vanatteveldt/nlpipe
devtools::install_github("vanatteveldt/nlpiper")
library(nlpiper)
process("test_upper", "test")
[1] "TEST"

Corenlp POS+lemma+NER

library(nlpiper)
text = "Donald trump was elected president of the United States"
process("corenlp_lemmatize", text, format="csv")

NLPiper and US elections

  • Nlpipe and especially the r library is very much work in progress
  • Can only do one document at a time from R
  • Download tokens for US elections:
# choose one:
download.file("http://i.amcat.nl/tokens.rds", "tokens.rds")
download.file("http://i.amcat.nl/tokens_full.rds", "tokens.rds")
download.file("http://i.amcat.nl/tokens_sample.rds", "tokens.rds")
tokens = readRDS("tokens.rds")
Head(tokens)
id sentence offset word lemma POS POS1 ner
78757 162317736 1 0 New New NNP R O
78758 162317736 1 4 Technologies Technologies NNP R O
78759 162317736 1 17 Give give VB V O
78760 162317736 1 22 Government Government NNP R O
78761 162317736 1 33 Ample Ample NNP R O
78762 162317736 1 39 Means Means NNP R O

Corpus Analysis

Corpus Analysis

  • Exploratory Analysis
  • Term statistics
  • Corpus comparison

The corpustools package

  • Useful functions for corpus analysis
  • Not fully integrated with quanteda
    • (we're working on it :) )
devtools::install_github("kasperwelbers/corpus-tools")

Dtm vs Dfm

  • quanteda uses 'dfm' objects
    • document-feature matrix
  • tm uses 'dtm' objects
    • document-term matrix
  • Both are sparse matrices

Dtm vs dfm

library(quanteda)
dfm = dfm(c("a text", "and another text"))
dtm = convert(dfm, "tm")
class(dtm)
[1] "DocumentTermMatrix"    "simple_triplet_matrix"
library(corpustools)
dfm = dtm.to.dfm(dtm)
class(dfm)
[1] "dfmSparse"
attr(,"package")
[1] "quanteda"

Create DTM from tokens

dtm = dtm.create(tokens$id, tokens$lemma)
dtm.names = with(subset(tokens, POS1=="R"), 
                 dtm.create(id, lemma))
dtm.adj = with(subset(tokens, POS1=="G"), 
               dtm.create(id, lemma))
dtm.persons = with(subset(tokens, ner=="PERSON"), 
                   dtm.create(id, lemma))

Term statistics

stats = term.statistics(dtm.persons)
stats = plyr::arrange(stats, -termfreq)
Head(stats)
term characters number nonalpha termfreq docfreq reldocfreq tfidf
Trump 5 FALSE FALSE 5237 674 0.6884576 0.12922905
Clinton 7 FALSE FALSE 3357 582 0.5944842 0.12456663
Donald 6 FALSE FALSE 1087 643 0.6567926 0.04239862
Obama 5 FALSE FALSE 1082 318 0.3248212 0.17415394
Sanders 7 FALSE FALSE 1040 190 0.1940756 0.36697430
Hillary 7 FALSE FALSE 892 496 0.5066394 0.06503282

Corpus Comparison

meta = readRDS("meta.rds")
nyt = meta$medium == "The New York Times"
nyt1 = meta$id[nyt & meta$date < "2016-08-01"]
nyt2 = meta$id[nyt & meta$date >= "2016-08-01"]
dtm1 = with(subset(tokens, id %in% nyt1 & POS1=="G"), 
            dtm.create(id, lemma))
dtm2 = with(subset(tokens, id %in% nyt2 & POS1=="G"), 
            dtm.create(id, lemma))
cmp = corpora.compare(dtm1, dtm2)
cmp = plyr::arrange(cmp, -chi)
Head(cmp)
term termfreq.x termfreq.y termfreq relfreq.x relfreq.y over chi
presumptive 63 0 63 0.0023287 0.0000606 38.4190954 37.69604
russian 33 66 99 0.0012371 0.0040611 0.3046290 36.30182
hispanic 27 57 84 0.0010188 0.0035156 0.2897992 33.34656
brazilian 0 17 17 0.0000364 0.0010910 0.0333499 28.46516
racist 0 16 16 0.0000364 0.0010304 0.0353117 26.79011
incumbent 0 15 15 0.0000364 0.0009698 0.0375186 25.11513

Visualization

Visualization

dtm.wordcloud(dtm.persons, freq.fun = sqrt)

plot of chunk unnamed-chunk-13

Beyond (stupid) word clouds

Visualizing comparisons

h = rescale(log(cmp$over), c(1, .6666))
s = rescale(sqrt(cmp$chi), c(.25,1))
cmp$col = hsv(h, s, .33 + .67*s)
cmp = arrange(cmp, -chi)
with(head(cmp, 75), plotWords(x=log(over), words=term, wordfreq=chi, random.y = T, col=col, scale=2))