Text Analysis in R

Wouter van Atteveldt
Session 3: Querying and analysing text

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

Querying Text with AmCAT & R
The Document-Term matrix
Comparing Corpora
Topic Modeling

Saturday: Machine Learning & Sentiment Analysis

Sunday: Semantic Networks & Grammatical Analysis

What is AmCAT

Open source text analysis platform
- Queries, manual annotation
- API
- (Working on R plugins…)
Developed at VU Amsterdam
Free account at http://amcat.nl
- (or install on your own server)

AmCAT and R

AmCAT for
- Organizing large corpora
- Central storage and access control
- Fast search with elastic
- Linguistic processing with nlpipe
R for flexible analysis
- Corpus Analysis
- Semantic netwok analysis
- Visualizations
- Reproducability

Demo: AmCAT

Connecting to AmCAT from R

AmCAT API
- (Create account at https://amcat.nl)

install_github("amcat/amcat-r")
amcat.save.password("https://amcat.nl", "user", "pwd")

library(amcatr)
conn = amcat.connect("https://amcat.nl")

Querying AmCAT: aggregation

a = amcat.aggregate(conn, "mortgage*", sets=29454, axis1 = "year", axis2="medium") 
head(a)

count	medium	year	query
1	The Times	2007-06-01	mortgage*
2	The Times	2007-09-01	mortgage*
3	The Times	2007-10-01	mortgage*
1	The Times	2007-12-01	mortgage*
1	The Times	2008-02-01	mortgage*
1	The Times	2008-03-01	mortgage*

Querying AmCAT: raw counts

h = amcat.hits(conn, "mortgage*", sets=29454)
head(h)

count	id	query
1	21794967	mortgage*
1	21795537	mortgage*
1	21795699	mortgage*
1	21796592	mortgage*
1	21798565	mortgage*
1	21798673	mortgage*

Merging with metadata

meta = amcat.getarticlemeta(conn, 41, 29454, dateparts = T)
h = merge(meta, h)
peryear = aggregate(h["count"], h[c("year")], sum)
library(ggplot2)
ggplot(peryear, aes(x=year, y=count)) + geom_line()

plot of chunk unnamed-chunk-6

Uploading text to AmCAT

tweets = searchTwitteR("#bigdata", resultType="recent", n = 100)
tweets = plyr::ldply(tweets, as.data.frame)
set = amcat.upload.articles(conn, project=1, 
  articleset="twitter test", medium="twitter",
  text=tweets$text, headline=tweets$text, 
  date=tweets$created, author=tweets$screenName)
head(amcat.getarticlemeta(conn, 1, set, columns=c('date', 'headline')))

id	date	headline
167538700	2016-06-02	RT @DKMatai: When #Blockchain hype in #FinTech #InsurTech normalises what will matter is #Risk #BigData #Analytics #Cognition #AI #DeepLear…
167538729	2016-06-02	RT @KirkDBorne: Yes, I focused briefly on the myths of “Small Data” here: https://t.co/FMsEhDTCBY

CC: @JenniferChan7 @NetHope_org #BigDa… | | 167538736|2016-06-02 |Video: The Scalable Modular Server DX2000 from @NEC powered by Intel offered #bigdata performance & compute density:https://t.co/74KK5SZFBv | | 167538659|2016-06-02 |RT @StuJoanne2: RT @botbigdata “Google rolls out new features for BigQuery #bigdata” https://t.co/rZJQSFGxDa | | 167538666|2016-06-02 |RT @bigdataconf: How Retailers Can Harness the True Potential of #BigData https://t.co/oZL1VELgSv #Analytics #Datascience #Hadoop #spark #N… | | 167538673|2016-06-02 |RT @KirkDBorne: .@alex_woodie And here comes the #DataLake = the best thing since sliced (and diced) data: https://t.co/qazv1GoVGg #BigData… |

Saving selection as article set

h = amcat.hits(conn, "data*", sets=set)
set2 = amcat.add.articles.to.set(conn, project=1, articles=h$id,
  articleset.name="Visualization", articleset.provenance="From R")
head(amcat.getarticlemeta(conn, 1, set2, columns=c('date', 'headline')))

id	date	headline
167538729	2016-06-02	RT @KirkDBorne: Yes, I focused briefly on the myths of “Small Data” here: https://t.co/FMsEhDTCBY

CC: @JenniferChan7 @NetHope_org #BigDa… | | 167538666|2016-06-02 |RT @bigdataconf: How Retailers Can Harness the True Potential of #BigData https://t.co/oZL1VELgSv #Analytics #Datascience #Hadoop #spark #N… | | 167538673|2016-06-02 |RT @KirkDBorne: .@alex_woodie And here comes the #DataLake = the best thing since sliced (and diced) data: https://t.co/qazv1GoVGg #BigData… | | 167538680|2016-06-02 |The latest A World of Data! https://t.co/sA2cbRDj9F Thanks to @mohammadamiri22 @OSCE_RFoM @amyewalter #bigdata #data | | 167538744|2016-06-02 |Šta je Big Data? https://t.co/yiYni0y3tU #BigData | | 167538681|2016-06-02 |RT @mthtechnews: RT @zabackj AOL Debuts a Startup Incubator to Avoid Becoming a Dinosaur https://t.co/kUHJ9CWwFT #database #CRM #BigData #t… |

Interactive session 3a

Connecting to AmCAT

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

Querying Text with AmCAT & R
The Document-Term matrix
Comparing Corpora
Topic Modeling

Saturday: Machine Learning & Sentiment Analysis

Sunday: Semantic Networks & Grammatical Analysis

Document-Term Matrix

Representation word frequencies
- Rows: Documents
- Columns: Terms (words)
- Cells: Frequency
Stored as 'sparse' matrix
- only non-zero values are stored
- Usually, >99% of cells are zero

Docment-Term Matrix

library(RTextTools)
m = create_matrix(c("I love data", "John loves data!"))
as.matrix(m)

    Terms
Docs data john love loves
   1    1    0    1     0
   2    1    1    0     1

Simple corpus analysis

library(corpustools)
head(term.statistics(m))

	term	characters	number	nonalpha	termfreq	docfreq	reldocfreq	tfidf
data	data	4	FALSE	FALSE	2	2	1.0	0.0000000
john	john	4	FALSE	FALSE	1	1	0.5	0.3333333
love	love	4	FALSE	FALSE	1	1	0.5	0.5000000
loves	loves	5	FALSE	FALSE	1	1	0.5	0.3333333

Preprocessing

Lot of noise in text:
- Stop words (the, a, I, will)
- Conjugations (love, loves)
- Non-word terms (33$, !)
Simple preprocessing, e.g. in RTextTools
- stemming
- stop word removal

Linguistic Preprocessing

Lemmatizing
Part-of-Speech tagging
Coreference resolution
Disambiguation
Syntactic parsing

Tokens

One word per line (CONLL)
Linguistic information

data(sotu)
head(sotu.tokens)

word	sentence	pos	lemma	offset	aid	id	pos1	freq
It	1	PRP	it	0	111541965	1	O	1
is	1	VBZ	be	3	111541965	2	V	1
our	1	PRP$	we	6	111541965	3	O	1
unfinished	1	JJ	unfinished	10	111541965	4	A	1
task	1	NN	task	21	111541965	5	N	1
to	1	TO	to	26	111541965	6	?	1

Getting tokens from AmCAT

tokens = amcat.gettokens(conn, project=1, articleset=set)
tokens = amcat.gettokens(conn, project=1, articleset=set, module="corenlp_lemmatize")

DTM from Tokens

dtm = with(subset(sotu.tokens, pos1=="M"),
           dtm.create(aid, lemma))
dtm.wordcloud(dtm)

plot of chunk unnamed-chunk-14

Corpus Statistics

stats = term.statistics(dtm)
stats= arrange(stats, -termfreq)
head(stats)

term	characters	number	nonalpha	termfreq	docfreq	reldocfreq	tfidf
America	7	FALSE	FALSE	409	346	0.3940774	0.6883991
Americans	9	FALSE	FALSE	179	158	0.1799544	1.4280099
Congress	8	FALSE	FALSE	168	149	0.1697039	1.1398894
Iraq	4	FALSE	FALSE	109	65	0.0740319	1.4157528
States	6	FALSE	FALSE	99	89	0.1013667	0.9573274
United	6	FALSE	FALSE	88	82	0.0933941	0.7817946

Interactive session 3b

Corpus Analysis

Hands-on session 3

Break

Handouts:

Text anlaysis with R and AmCAT
Corpus Analysis

Mini-project:

Upload your data to AmCAT, query,
Create a DTM, view term statistics, wordcloud