Corpus Analysis and Visualization

Wouter van Atteveldt
VU-HPC, 2017-11-05

Course Overview

10:30 - 12:00

  • Recap: Frequency Based Analysis and the DTM
  • Dictionary Analysis with AmCAT and R

13:30 - 15:00

  • Simple Natural Language Processing
  • Corpus Analysis and Visualization
  • Topic Modeling and Visualization

15:15 - 17:00

  • Sentiment Analysis with dictionaries
  • Sentiment Analysis with proximity

Simple NLP

  • Preprocess documents to get more information
  • Relatively fast and accurate
    • Lemmatizing
    • Part-of-Speech (POS) tagging
    • Named Entity Recognition
  • Unfortunately, not within R

NLPipe + nlpiper

  • nlpipe: simple NLP processing based on stanford corenlp, others
docker run --name corenlp -dp 9000:9000 chilland/corenlp-docker

docker run --name nlpipe --link corenlp:corenlp -e "CORENLP_HOST=http://corenlp:9000" -dp 5001:5001 vanatteveldt/nlpipe
devtools::install_github("vanatteveldt/nlpiper")
library(nlpiper)
process("test_upper", "test")
0x098f6bcd4621d373cade4e832627b4f6 
                            "TEST" 

Corenlp POS+lemma+NER

library(nlpiper)
text = "Donald trump was elected president of the United States"
process("corenlp_lemmatize", text, format="csv")

NLPiper and US elections

  • You can lemmatize a set of articles or AmCAT set directly
  • But that can take a while..
  • Download tokens for US elections:
# choose one:
download.file("http://i.amcat.nl/tokens.rds", "tokens.rds")
download.file("http://i.amcat.nl/tokens_full.rds", "tokens.rds")
download.file("http://i.amcat.nl/tokens_sample.rds", "tokens.rds")
meta = readRDS("meta.rds")
tokens = readRDS("tokens.rds")
head(tokens)
id sentence offset word lemma POS POS1 ner
2 160816860 1 2 Tennis tennis NN N O
3 160816860 1 9 Trailblazer trailblazer NN N O
4 160816860 1 21 Leads lead VBZ V O
6 160816860 1 31 Charge Charge NNP R O
8 160816860 1 42 Women Women NNP R O
10 160816860 1 50 Soccer Soccer NNP R O

Corpus Analysis

Corpus Analysis

  • Exploratory Analysis
  • Term statistics
  • Corpus comparison

The corpustools package

  • Useful functions for corpus analysis
  • Works on token list rather than dfm/dtm
    • preserves word order
install.packages("corpustools")

Create TCorpus from tokens

library(corpustools)
tc = tokens_to_tcorpus(tokens, "id", sent_i_col = "sentence", )
tc_nouns = tc$subset(POS1=="N", copy = T)

dfm = tc$dtm('lemma', form = 'quanteda_dfm')

Corpus Comparison

pre = subset(meta, medium == "The New York Times" & date < "2016-08-01")
post = subset(meta, medium == "The New York Times" & date >= "2016-08-01")

tc1 = tokens_to_tcorpus(subset(tokens, id %in% pre$id & POS1 == "G"), "id", sent_i_col = "sentence")
tc2 = tokens_to_tcorpus(subset(tokens, id %in% post$id & POS1 == "G"), "id", sent_i_col = "sentence")

cmp = tc1$compare_corpus(tc2, feature = 'lemma')
cmp = plyr::arrange(cmp, -chi2)
head(cmp)
feature freq.x freq.y freq p.x p.y ratio chi2
primary 2562 627 3189 0.0063271 0.0022859 2.7679110 568.5031
presumptive 749 5 754 0.0018499 0.0000186 99.5090224 493.4279
musical 1272 412 1684 0.0031415 0.0015022 2.0912766 176.4415
sexual 364 579 943 0.0008991 0.0021109 0.4259513 174.3048
medical 210 395 605 0.0005188 0.0014402 0.3602565 156.8345
nominating 320 35 355 0.0007905 0.0001279 6.1783342 136.9933

Visualization

Visualization

library(quanteda)
dfm = tc2$dtm('lemma', form='quanteda_dfm')
textplot_wordcloud(dfm, max.words = 50, scale = c(4, 0.5))

plot of chunk unnamed-chunk-11

Beyond (stupid) word clouds

Visualizing comparisons

library(scales)
h = rescale(log(cmp$ratio), c(1, .6666))
s = rescale(sqrt(cmp$chi2), c(.25,1))
cmp$col = hsv(h, s, .33 + .67*s)
with(head(cmp, 75), plot_words(x=log(ratio), words=feature, wordfreq=chi2, random.y = T, col=col, scale=1))