Corpus Analysis and Visualization

Wouter van Atteveldt
Glasgow Text Analysis, 2016-11-17

Course Overview

10:30 - 12:00

Recap: Frequency Based Analysis and the DTM
Dictionary Analysis with AmCAT and R

13:30 - 15:00

Simple Natural Language Processing
Corpus Analysis and Visualization
Topic Modeling and Visualization

15:15 - 17:00

Sentiment Analysis with dictionaries
Sentiment Analysis with proximity

Simple NLP

Preprocess documents to get more information
Relatively fast and accurate
- Lemmatizing
- Part-of-Speech (POS) tagging
- Named Entity Recognition
Unfortunately, not within R

NLPipe + nlpiper

nlpipe: simple NLP processing based on stanford corenlp, others

docker run --name corenlp -dp 9000:9000 chilland/corenlp-docker

docker run --name nlpipe --link corenlp:corenlp -e "CORENLP_HOST=http://corenlp:9000" -dp 5001:5001 vanatteveldt/nlpipe

devtools::install_github("vanatteveldt/nlpiper")

library(nlpiper)
process("test_upper", "test")

[1] "TEST"

Corenlp POS+lemma+NER

library(nlpiper)
text = "Donald trump was elected president of the United States"
process("corenlp_lemmatize", text, format="csv")

NLPiper and US elections

Nlpipe and especially the r library is very much work in progress
Can only do one document at a time from R
Download tokens for US elections:

# choose one:
download.file("http://i.amcat.nl/tokens.rds", "tokens.rds")
download.file("http://i.amcat.nl/tokens_full.rds", "tokens.rds")
download.file("http://i.amcat.nl/tokens_sample.rds", "tokens.rds")

tokens = readRDS("tokens.rds")
Head(tokens)

	id	sentence	offset	word	lemma	POS	POS1	ner
78757	162317736	1	0	New	New	NNP	R	O
78758	162317736	1	4	Technologies	Technologies	NNP	R	O
78759	162317736	1	17	Give	give	VB	V	O
78760	162317736	1	22	Government	Government	NNP	R	O
78761	162317736	1	33	Ample	Ample	NNP	R	O
78762	162317736	1	39	Means	Means	NNP	R	O

Corpus Analysis

Exploratory Analysis
Term statistics
Corpus comparison

The corpustools package

Useful functions for corpus analysis
Not fully integrated with quanteda
- (we're working on it :) )

devtools::install_github("kasperwelbers/corpus-tools")

Dtm vs Dfm

quanteda uses 'dfm' objects
- document-feature matrix
tm uses 'dtm' objects
- document-term matrix
Both are sparse matrices

Dtm vs dfm

library(quanteda)
dfm = dfm(c("a text", "and another text"))
dtm = convert(dfm, "tm")
class(dtm)

[1] "DocumentTermMatrix"    "simple_triplet_matrix"

library(corpustools)
dfm = dtm.to.dfm(dtm)
class(dfm)

[1] "dfmSparse"
attr(,"package")
[1] "quanteda"

Create DTM from tokens

dtm = dtm.create(tokens$id, tokens$lemma)
dtm.names = with(subset(tokens, POS1=="R"), 
                 dtm.create(id, lemma))
dtm.adj = with(subset(tokens, POS1=="G"), 
               dtm.create(id, lemma))
dtm.persons = with(subset(tokens, ner=="PERSON"), 
                   dtm.create(id, lemma))

Term statistics

stats = term.statistics(dtm.persons)
stats = plyr::arrange(stats, -termfreq)
Head(stats)

term	characters	number	nonalpha	termfreq	docfreq	reldocfreq	tfidf
Trump	5	FALSE	FALSE	5237	674	0.6884576	0.12922905
Clinton	7	FALSE	FALSE	3357	582	0.5944842	0.12456663
Donald	6	FALSE	FALSE	1087	643	0.6567926	0.04239862
Obama	5	FALSE	FALSE	1082	318	0.3248212	0.17415394
Sanders	7	FALSE	FALSE	1040	190	0.1940756	0.36697430
Hillary	7	FALSE	FALSE	892	496	0.5066394	0.06503282

Corpus Comparison

meta = readRDS("meta.rds")
nyt = meta$medium == "The New York Times"
nyt1 = meta$id[nyt & meta$date < "2016-08-01"]
nyt2 = meta$id[nyt & meta$date >= "2016-08-01"]
dtm1 = with(subset(tokens, id %in% nyt1 & POS1=="G"), 
            dtm.create(id, lemma))
dtm2 = with(subset(tokens, id %in% nyt2 & POS1=="G"), 
            dtm.create(id, lemma))
cmp = corpora.compare(dtm1, dtm2)
cmp = plyr::arrange(cmp, -chi)
Head(cmp)

term	termfreq.x	termfreq.y	termfreq	relfreq.x	relfreq.y	over	chi
presumptive	63	0	63	0.0023287	0.0000606	38.4190954	37.69604
russian	33	66	99	0.0012371	0.0040611	0.3046290	36.30182
hispanic	27	57	84	0.0010188	0.0035156	0.2897992	33.34656
brazilian	0	17	17	0.0000364	0.0010910	0.0333499	28.46516
racist	0	16	16	0.0000364	0.0010304	0.0353117	26.79011
incumbent	0	15	15	0.0000364	0.0009698	0.0375186	25.11513

Visualization

dtm.wordcloud(dtm.persons, freq.fun = sqrt)

plot of chunk unnamed-chunk-13

Beyond (stupid) word clouds

Word clouds waste most information
corpustools::plotWords
- specify x, y, colour, size, etc.
Use any analytics you have to determine characteristics
See also http://vanatteveldt.com/lse-text-visualization/

Visualizing comparisons

h = rescale(log(cmp$over), c(1, .6666))
s = rescale(sqrt(cmp$chi), c(.25,1))
cmp$col = hsv(h, s, .33 + .67*s)
cmp = arrange(cmp, -chi)
with(head(cmp, 75), plotWords(x=log(over), words=term, wordfreq=chi, random.y = T, col=col, scale=2))

plot of chunk unnamed-chunk-14

Visualizing over time

terms = term.statistics(dtm.persons)

wordfreqs = dtm.to.df(dtm.persons)
wordfreqs = merge(meta, wordfreqs, by.x="id", by.y="doc")
dates = aggregate(wordfreqs["date"], by=wordfreqs["term"], FUN=mean)
terms = merge(terms, dates)
terms = plyr::arrange(terms, -docfreq)
with(head(terms, 50), plotWords(words=term, x=date, wordfreq = termfreq))
axis(1)

plot of chunk unnamed-chunk-15

Topic Modeling

Topic Models

dtm = with(subset(tokens, POS1 %in% c("G", "N", "V")), 
                 dtm.create(id, lemma, minfreq = 10))
dtm = dtm[, !colnames(dtm) %in% tm::stopwords()]
set.seed(1234)
m = lda.fit(dtm, K = 10, num.iterations = 50, alpha=0.5)
Head(terms(m, 10))

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
make	say	year	debate	country	tax	show	say	voter	say
city	people	first	media	will	say	art	campaign	win	email
work	get	show	political	people	year	include	presidential	vote	official
can	woman	play	question	american	will	exhibition	republican	state	case
take	think	write	use	police	company	work	convention	candidate	report
come	can	family	time	gun	job	artist	know	republican	year

Visualizing Topic Models: LDAvis

install.packages(c("LDAvis", "servr"))
library(LDAvis)
dtm = dtm[row_sums(dtm) > 0,]
json = ldavis_json(m, dtm)
LDAvis::serVis(json)

Visualizing Topic Models: heat map

topics = c("city", "woman", "play", "debate", "country", "tax", "art", "campaign", "votes", "email")
cm = cor(t(m@beta))
colnames(cm) = rownames(cm) = topics
diag(cm) = 0
heatmap(cm, symm = T)

plot of chunk unnamed-chunk-18

Visualizing Topic Models: word clouds

compare.topics <- function(m, cmp_topics) {
  docs = factor(m@wordassignments$i, labels=m@documents)
  terms = factor(m@wordassignments$j, labels=m@terms)
  assignments = data.frame(doc=docs, term=terms, freq=m@wordassignments$v)
  terms = dcast(assignments, term ~ freq, value.var = "doc", fun.aggregate = length)
  terms = terms[, c(1, cmp_topics+1)]
  terms$freq = rowSums(terms[-1])
  terms = terms[terms$freq > 0,]
  terms$prop = terms[[2]] / terms$freq
  terms$col = hsv(rescale(terms$prop, c(1, .6666)), .5, .5)
  terms[order(-terms$freq), ]
}

Visualizing Topic Models: word clouds

terms = compare.topics(m, match(c("tax", "email"), topics))
with(head(terms, 100), plotWords(x=prop, wordfreq = freq, words = term, col=col, xaxt="none", random.y = T, scale=2))

plot of chunk unnamed-chunk-20

Analysing Topic Models

tpd = topics.per.document(m)
colnames(tpd)[-1] = topics
tpd = merge(meta, tpd)
Head(tpd)

id	date	medium	year	month	week	city	woman	play	debate	country	tax	art	campaign	votes	email
162317736	2016-02-01	The New York Times	2016-01-01	2016-02-01	2016-02-01	0.2146172	0.0916473	0.0313225	0.1125290	0.0336427	0.0452436	0.0290023	0.0243619	0.0243619	0.3932715
162317809	2016-01-25	The New York Times	2016-01-01	2016-01-01	2016-01-25	0.1926752	0.0843949	0.0684713	0.1066879	0.0302548	0.1385350	0.0270701	0.0175159	0.0143312	0.3200637
171932192	2016-07-04	The New York Times	2016-01-01	2016-07-01	2016-07-04	0.1936416	0.0028902	0.0838150	0.1011561	0.0722543	0.2341040	0.0202312	0.0028902	0.0260116	0.2630058
171932219	2016-07-03	The New York Times	2016-01-01	2016-07-01	2016-06-27	0.0538793	0.5107759	0.0969828	0.1099138	0.1185345	0.0323276	0.0237069	0.0280172	0.0193966	0.0064655
171932226	2016-07-03	The New York Times	2016-01-01	2016-07-01	2016-06-27	0.0454545	0.1363636	0.0454545	0.2272727	0.2272727	0.0454545	0.0454545	0.0454545	0.0454545	0.1363636
171932227	2016-07-03	The New York Times	2016-01-01	2016-07-01	2016-06-27	0.1089385	0.2597765	0.2877095	0.0418994	0.0083799	0.0027933	0.0586592	0.1480447	0.0363128	0.0474860

Hands-on session II

Corpus analysis of Election campaign
- (or your own data…)
Which words, adjectives, verbs, etc are frequent?
How do they differ over time, by medium, subcorpus
What topics can we find?
Can we visualize topics, contrasts, etc.