Text Analysis in R

Wouter van Atteveldt
Session 3: Querying and analysing text

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

  • Querying Text with AmCAT & R
  • The Document-Term matrix
  • Comparing Corpora
  • Topic Modeling

Saturday: Machine Learning & Sentiment Analysis

Sunday: Semantic Networks & Grammatical Analysis

What is AmCAT

  • Open source text analysis platform
    • Queries, manual annotation
    • API
    • (Working on R plugins…)
  • Developed at VU Amsterdam
  • Free account at http://amcat.nl
    • (or install on your own server)

AmCAT and R

  • AmCAT for
    • Organizing large corpora
    • Central storage and access control
    • Fast search with elastic
    • Linguistic processing with nlpipe
  • R for flexible analysis
    • Corpus Analysis
    • Semantic netwok analysis
    • Visualizations
    • Reproducability

Demo: AmCAT

Connecting to AmCAT from R

  • AmCAT API
    • (Create account at https://amcat.nl)
install_github("amcat/amcat-r")
amcat.save.password("https://amcat.nl", "user", "pwd")
library(amcatr)
conn = amcat.connect("https://amcat.nl")

Querying AmCAT: aggregation

a = amcat.aggregate(conn, "mortgage*", sets=29454, axis1 = "year", axis2="medium") 
head(a)
count medium year query
1 The Times 2007-06-01 mortgage*
2 The Times 2007-09-01 mortgage*
3 The Times 2007-10-01 mortgage*
1 The Times 2007-12-01 mortgage*
1 The Times 2008-02-01 mortgage*
1 The Times 2008-03-01 mortgage*

Querying AmCAT: raw counts

h = amcat.hits(conn, "mortgage*", sets=29454)
head(h)
count id query
1 21794967 mortgage*
1 21795537 mortgage*
1 21795699 mortgage*
1 21796592 mortgage*
1 21798565 mortgage*
1 21798673 mortgage*

Merging with metadata

meta = amcat.getarticlemeta(conn, 41, 29454, dateparts = T)
h = merge(meta, h)
peryear = aggregate(h["count"], h[c("year")], sum)
library(ggplot2)
ggplot(peryear, aes(x=year, y=count)) + geom_line()

plot of chunk unnamed-chunk-6

Uploading text to AmCAT

tweets = searchTwitteR("#bigdata", resultType="recent", n = 100)
tweets = plyr::ldply(tweets, as.data.frame)
set = amcat.upload.articles(conn, project=1, 
  articleset="twitter test", medium="twitter",
  text=tweets$text, headline=tweets$text, 
  date=tweets$created, author=tweets$screenName)
head(amcat.getarticlemeta(conn, 1, set, columns=c('date', 'headline')))
id date headline
167538700 2016-06-02 RT @DKMatai: When #Blockchain hype in #FinTech #InsurTech normalises what will matter is #Risk #BigData #Analytics #Cognition #AI #DeepLear…
167538729 2016-06-02 RT @KirkDBorne: Yes, I focused briefly on the myths of “Small Data” here: https://t.co/FMsEhDTCBY

CC: @JenniferChan7 @NetHope_org #BigDa… | | 167538736|2016-06-02 |Video: The Scalable Modular Server DX2000 from @NEC powered by Intel offered #bigdata performance & compute density:https://t.co/74KK5SZFBv | | 167538659|2016-06-02 |RT @StuJoanne2: RT @botbigdata “Google rolls out new features for BigQuery #bigdata” https://t.co/rZJQSFGxDa | | 167538666|2016-06-02 |RT @bigdataconf: How Retailers Can Harness the True Potential of #BigData https://t.co/oZL1VELgSv #Analytics #Datascience #Hadoop #spark #N… | | 167538673|2016-06-02 |RT @KirkDBorne: .@alex_woodie And here comes the #DataLake = the best thing since sliced (and diced) data: https://t.co/qazv1GoVGg #BigData… |

Saving selection as article set

h = amcat.hits(conn, "data*", sets=set)
set2 = amcat.add.articles.to.set(conn, project=1, articles=h$id,
  articleset.name="Visualization", articleset.provenance="From R")
head(amcat.getarticlemeta(conn, 1, set2, columns=c('date', 'headline')))
id date headline
167538729 2016-06-02 RT @KirkDBorne: Yes, I focused briefly on the myths of “Small Data” here: https://t.co/FMsEhDTCBY

CC: @JenniferChan7 @NetHope_org #BigDa… | | 167538666|2016-06-02 |RT @bigdataconf: How Retailers Can Harness the True Potential of #BigData https://t.co/oZL1VELgSv #Analytics #Datascience #Hadoop #spark #N… | | 167538673|2016-06-02 |RT @KirkDBorne: .@alex_woodie And here comes the #DataLake = the best thing since sliced (and diced) data: https://t.co/qazv1GoVGg #BigData… | | 167538680|2016-06-02 |The latest A World of Data! https://t.co/sA2cbRDj9F Thanks to @mohammadamiri22 @OSCE_RFoM @amyewalter #bigdata #data | | 167538744|2016-06-02 |Šta je Big Data? https://t.co/yiYni0y3tU #BigData | | 167538681|2016-06-02 |RT @mthtechnews: RT @zabackj AOL Debuts a Startup Incubator to Avoid Becoming a Dinosaur https://t.co/kUHJ9CWwFT #database #CRM #BigData #t… |

Interactive session 3a

Connecting to AmCAT

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

  • Querying Text with AmCAT & R
  • The Document-Term matrix
  • Comparing Corpora
  • Topic Modeling

Saturday: Machine Learning & Sentiment Analysis

Sunday: Semantic Networks & Grammatical Analysis

Document-Term Matrix

  • Representation word frequencies
    • Rows: Documents
    • Columns: Terms (words)
    • Cells: Frequency
  • Stored as 'sparse' matrix
    • only non-zero values are stored
    • Usually, >99% of cells are zero

Docment-Term Matrix

library(RTextTools)
m = create_matrix(c("I love data", "John loves data!"))
as.matrix(m)
    Terms
Docs data john love loves
   1    1    0    1     0
   2    1    1    0     1

Simple corpus analysis

library(corpustools)
head(term.statistics(m))
term characters number nonalpha termfreq docfreq reldocfreq tfidf
data data 4 FALSE FALSE 2 2 1.0 0.0000000
john john 4 FALSE FALSE 1 1 0.5 0.3333333
love love 4 FALSE FALSE 1 1 0.5 0.5000000
loves loves 5 FALSE FALSE 1 1 0.5 0.3333333

Preprocessing

  • Lot of noise in text:
    • Stop words (the, a, I, will)
    • Conjugations (love, loves)
    • Non-word terms (33$, !)
  • Simple preprocessing, e.g. in RTextTools
    • stemming
    • stop word removal

Linguistic Preprocessing

  • Lemmatizing
  • Part-of-Speech tagging
  • Coreference resolution
  • Disambiguation
  • Syntactic parsing

Tokens

  • One word per line (CONLL)
  • Linguistic information
data(sotu)
head(sotu.tokens)
word sentence pos lemma offset aid id pos1 freq
It 1 PRP it 0 111541965 1 O 1
is 1 VBZ be 3 111541965 2 V 1
our 1 PRP$ we 6 111541965 3 O 1
unfinished 1 JJ unfinished 10 111541965 4 A 1
task 1 NN task 21 111541965 5 N 1
to 1 TO to 26 111541965 6 ? 1

Getting tokens from AmCAT

tokens = amcat.gettokens(conn, project=1, articleset=set)
tokens = amcat.gettokens(conn, project=1, articleset=set, module="corenlp_lemmatize")

DTM from Tokens

dtm = with(subset(sotu.tokens, pos1=="M"),
           dtm.create(aid, lemma))
dtm.wordcloud(dtm)

plot of chunk unnamed-chunk-14

Corpus Statistics

stats = term.statistics(dtm)
stats= arrange(stats, -termfreq)
head(stats)
term characters number nonalpha termfreq docfreq reldocfreq tfidf
America 7 FALSE FALSE 409 346 0.3940774 0.6883991
Americans 9 FALSE FALSE 179 158 0.1799544 1.4280099
Congress 8 FALSE FALSE 168 149 0.1697039 1.1398894
Iraq 4 FALSE FALSE 109 65 0.0740319 1.4157528
States 6 FALSE FALSE 99 89 0.1013667 0.9573274
United 6 FALSE FALSE 88 82 0.0933941 0.7817946

Interactive session 3b

Corpus Analysis

Hands-on session 3

Break

Handouts:

  • Text anlaysis with R and AmCAT
  • Corpus Analysis

Mini-project:

  • Upload your data to AmCAT, query,
  • Create a DTM, view term statistics, wordcloud