Wouter van Atteveldt
Session 3: Querying and analysing text
Thursday: Introduction to R
Friday: Corpus Analysis & Topic Modeling
Saturday: Machine Learning & Sentiment Analysis
Sunday: Semantic Networks & Grammatical Analysis
http://amcat.nl
nlpipe
https://amcat.nl
)install_github("amcat/amcat-r")
amcat.save.password("https://amcat.nl", "user", "pwd")
library(amcatr)
conn = amcat.connect("https://amcat.nl")
a = amcat.aggregate(conn, "mortgage*", sets=29454, axis1 = "year", axis2="medium")
head(a)
count | medium | year | query |
---|---|---|---|
1 | The Times | 2007-06-01 | mortgage* |
2 | The Times | 2007-09-01 | mortgage* |
3 | The Times | 2007-10-01 | mortgage* |
1 | The Times | 2007-12-01 | mortgage* |
1 | The Times | 2008-02-01 | mortgage* |
1 | The Times | 2008-03-01 | mortgage* |
h = amcat.hits(conn, "mortgage*", sets=29454)
head(h)
count | id | query |
---|---|---|
1 | 21794967 | mortgage* |
1 | 21795537 | mortgage* |
1 | 21795699 | mortgage* |
1 | 21796592 | mortgage* |
1 | 21798565 | mortgage* |
1 | 21798673 | mortgage* |
meta = amcat.getarticlemeta(conn, 41, 29454, dateparts = T)
h = merge(meta, h)
peryear = aggregate(h["count"], h[c("year")], sum)
library(ggplot2)
ggplot(peryear, aes(x=year, y=count)) + geom_line()
tweets = searchTwitteR("#bigdata", resultType="recent", n = 100)
tweets = plyr::ldply(tweets, as.data.frame)
set = amcat.upload.articles(conn, project=1,
articleset="twitter test", medium="twitter",
text=tweets$text, headline=tweets$text,
date=tweets$created, author=tweets$screenName)
head(amcat.getarticlemeta(conn, 1, set, columns=c('date', 'headline')))
id | date | headline |
---|---|---|
167538700 | 2016-06-02 | RT @DKMatai: When #Blockchain hype in #FinTech #InsurTech normalises what will matter is #Risk #BigData #Analytics #Cognition #AI #DeepLear… |
167538729 | 2016-06-02 | RT @KirkDBorne: Yes, I focused briefly on the myths of “Small Data” here: https://t.co/FMsEhDTCBY |
CC: @JenniferChan7 @NetHope_org #BigDa… | | 167538736|2016-06-02 |Video: The Scalable Modular Server DX2000 from @NEC powered by Intel offered #bigdata performance & compute density:https://t.co/74KK5SZFBv | | 167538659|2016-06-02 |RT @StuJoanne2: RT @botbigdata “Google rolls out new features for BigQuery #bigdata” https://t.co/rZJQSFGxDa | | 167538666|2016-06-02 |RT @bigdataconf: How Retailers Can Harness the True Potential of #BigData https://t.co/oZL1VELgSv #Analytics #Datascience #Hadoop #spark #N… | | 167538673|2016-06-02 |RT @KirkDBorne: .@alex_woodie And here comes the #DataLake = the best thing since sliced (and diced) data: https://t.co/qazv1GoVGg #BigData… |
h = amcat.hits(conn, "data*", sets=set)
set2 = amcat.add.articles.to.set(conn, project=1, articles=h$id,
articleset.name="Visualization", articleset.provenance="From R")
head(amcat.getarticlemeta(conn, 1, set2, columns=c('date', 'headline')))
id | date | headline |
---|---|---|
167538729 | 2016-06-02 | RT @KirkDBorne: Yes, I focused briefly on the myths of “Small Data” here: https://t.co/FMsEhDTCBY |
CC: @JenniferChan7 @NetHope_org #BigDa… | | 167538666|2016-06-02 |RT @bigdataconf: How Retailers Can Harness the True Potential of #BigData https://t.co/oZL1VELgSv #Analytics #Datascience #Hadoop #spark #N… | | 167538673|2016-06-02 |RT @KirkDBorne: .@alex_woodie And here comes the #DataLake = the best thing since sliced (and diced) data: https://t.co/qazv1GoVGg #BigData… | | 167538680|2016-06-02 |The latest A World of Data! https://t.co/sA2cbRDj9F Thanks to @mohammadamiri22 @OSCE_RFoM @amyewalter #bigdata #data | | 167538744|2016-06-02 |Šta je Big Data? https://t.co/yiYni0y3tU #BigData | | 167538681|2016-06-02 |RT @mthtechnews: RT @zabackj AOL Debuts a Startup Incubator to Avoid Becoming a Dinosaur https://t.co/kUHJ9CWwFT #database #CRM #BigData #t… |
Connecting to AmCAT
Thursday: Introduction to R
Friday: Corpus Analysis & Topic Modeling
Saturday: Machine Learning & Sentiment Analysis
Sunday: Semantic Networks & Grammatical Analysis
library(RTextTools)
m = create_matrix(c("I love data", "John loves data!"))
as.matrix(m)
Terms
Docs data john love loves
1 1 0 1 0
2 1 1 0 1
library(corpustools)
head(term.statistics(m))
term | characters | number | nonalpha | termfreq | docfreq | reldocfreq | tfidf | |
---|---|---|---|---|---|---|---|---|
data | data | 4 | FALSE | FALSE | 2 | 2 | 1.0 | 0.0000000 |
john | john | 4 | FALSE | FALSE | 1 | 1 | 0.5 | 0.3333333 |
love | love | 4 | FALSE | FALSE | 1 | 1 | 0.5 | 0.5000000 |
loves | loves | 5 | FALSE | FALSE | 1 | 1 | 0.5 | 0.3333333 |
RTextTools
data(sotu)
head(sotu.tokens)
word | sentence | pos | lemma | offset | aid | id | pos1 | freq |
---|---|---|---|---|---|---|---|---|
It | 1 | PRP | it | 0 | 111541965 | 1 | O | 1 |
is | 1 | VBZ | be | 3 | 111541965 | 2 | V | 1 |
our | 1 | PRP$ | we | 6 | 111541965 | 3 | O | 1 |
unfinished | 1 | JJ | unfinished | 10 | 111541965 | 4 | A | 1 |
task | 1 | NN | task | 21 | 111541965 | 5 | N | 1 |
to | 1 | TO | to | 26 | 111541965 | 6 | ? | 1 |
tokens = amcat.gettokens(conn, project=1, articleset=set)
tokens = amcat.gettokens(conn, project=1, articleset=set, module="corenlp_lemmatize")
dtm = with(subset(sotu.tokens, pos1=="M"),
dtm.create(aid, lemma))
dtm.wordcloud(dtm)
stats = term.statistics(dtm)
stats= arrange(stats, -termfreq)
head(stats)
term | characters | number | nonalpha | termfreq | docfreq | reldocfreq | tfidf |
---|---|---|---|---|---|---|---|
America | 7 | FALSE | FALSE | 409 | 346 | 0.3940774 | 0.6883991 |
Americans | 9 | FALSE | FALSE | 179 | 158 | 0.1799544 | 1.4280099 |
Congress | 8 | FALSE | FALSE | 168 | 149 | 0.1697039 | 1.1398894 |
Iraq | 4 | FALSE | FALSE | 109 | 65 | 0.0740319 | 1.4157528 |
States | 6 | FALSE | FALSE | 99 | 89 | 0.1013667 | 0.9573274 |
United | 6 | FALSE | FALSE | 88 | 82 | 0.0933941 | 0.7817946 |
Corpus Analysis
Break
Handouts:
Mini-project: