Sentiment Analysis

Wouter van Atteveldt
Glasgow Text Analysis, 2016-11-17

Sentiment Analysis

What is the tone of a text?
Techniques (e.g. Pang/Lee 2008, Liu 2012)
- Manual coding
- Dictionaries
- Machine Learning
- Crowdsourcing (Benoit ea 2015)
See e.g. https://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-analysis.pdf

Sentiment Analysis: problems

“The man who leaked cell-phone coverage of Saddam Hussein's execution was arrested”

Language is subjective, ambiguous, creative
What does positive/negative mean?
- e.g. Osgood ea 1957: evaluation, potency, activity
Who is positive/negative about what?
- Sentiment Attribution

Sentiment Analysis resources

Lexicon (dictionary)
Annotated texts
Tools / models

Lexical Sentiment Analysis

Get list of positive / negative terms
Count occurrences in text
Summarize to sentiment score
Possible improvements
- Word-window approach (tomorrow)
- Deal with negation, intensification

Lexical Sentiment Analysis in R

Nothing new here!
Directly count words in DTM:

lex = list(pos=c("good", "nice", "great"), neg=c("bad","stupid", "crooked"))
library(slam)
npos = row_sums(dtm[, colnames(dtm) %in% lex$pos])
nneg = row_sums(dtm[, colnames(dtm) %in% lex$neg])
sent = data.frame(id=rownames(dtm), npos=npos, nneg=nneg)
sent$subj = sent$npos + sent$nneg
sent$sent = ifelse(sent$subj == 0, 0, 
                   (sent$npos-sent$nneg) / (sent$subj))
sent = merge(meta, sent)
Head(sent)

id	date	medium	year	month	week	npos	subj	sent
162317736	2016-02-01	The New York Times	2016-01-01	2016-02-01	2016-02-01	0	0	0
162317809	2016-01-25	The New York Times	2016-01-01	2016-01-01	2016-01-25	0	0	0
171932192	2016-07-04	The New York Times	2016-01-01	2016-07-01	2016-07-04	0	0	0
171932219	2016-07-03	The New York Times	2016-01-01	2016-07-01	2016-06-27	1	1	1
171932226	2016-07-03	The New York Times	2016-01-01	2016-07-01	2016-06-27	0	0	0
171932227	2016-07-03	The New York Times	2016-01-01	2016-07-01	2016-06-27	0	0	0

Lexical Sentiment Analysis in R

a = aggregate(sent[c("sent", "subj")],
              sent[c("week", "medium")], sum)
library(ggplot2)
ggplot(a, aes(x=week, y=sent, colour=medium)) + 
  geom_line()

plot of chunk unnamed-chunk-5

Lexical Sentiment Analysis: Alternatives

Apply directly to tokenlist:

tokens$sent =0
tokens$sent[tokens$lemma %in% lex$pos] = 1
Head(tokens[tokens$sent > 0,])

	id	sentence	offset	word	lemma	POS	POS1	ner	sent
230726	171932219	50	3857	good	good	JJ	G	O	1
316450	171932322	2	193	good	good	JJ	G	O	1
431087	171932429	22	3577	Good	good	JJ	G	O	1
547654	171932557	17	1848	good	good	JJ	G	O	1
547812	171932557	25	2589	great	great	JJ	G	O	1
560047	171932568	13	2011	good	good	JJ	G	O	1

Lexical Sentiment Analysis: Quanteda

Use quanteda::apply

library(quanteda)
library(corpustools)
dfm = dtm.to.dfm(dtm)
dfm = applyDictionary(dfm, lex)
head(dfm)

Document-feature matrix of: 999 documents, 2 features (0% sparse).
(showing first 6 documents and first 2 features)
           features
docs        pos neg
  162317736   0   0
  162317809   0   0
  171932192   0   0
  171932219   1   0
  171932226   0   0
  171932227   0   0

Acquiring a lexicon

Many sentiment lexicons exist
Download, run, evaluate

Parsing a sentiment lexicon

lex = readRDS("lexicon.rds")
dict = list(
  pos = lex$word1[lex$priorpolarity == "positive"],
  neg = lex$word1[lex$priorpolarity == "negative"],
  trump = "Trump",
  clinton = c("Hillary", "Clinton"))

Proximity-based sentiment analysis

Political texts contain multiple statements
Apply words to actors in close proximity
- Within sentence/paragraph -> create dtm/dfm at sentence level
- Within N words -> use token list
However, “Trump calls Clinton crooked”
- clause anlaysis (http://vanatteveldt.com/2016-clause-analysis/)

Sentence-level sentiment:

tokens$doc = paste(tokens$id, tokens$sentence, sep="_")
dtm = dtm.create(tokens$doc, tokens$lemma, minfreq = 10)
x = sapply(dict, function(x) 
  row_sums(dtm[, colnames(dtm) %in% x]))
Head(x)

	pos	neg
162317736_1	2	2
162317736_2	2	4
162317736_3	0	1
162317736_4	1	2
162317736_5	3	1
162317736_6	0	4

Proximity-based sentiment: semnet

devtools::install_github("kasperwelbers/semnet")

library(semnet)
tokens$concept = NA
for(c in names(dict))
  tokens$concept[tokens$lemma %in% dict[[c]]] = c

hits = windowedCoOccurenceNetwork(location=tokens$offset, 
    term=tokens$concept, context=tokens$id,
    window.size=40, output.per.context = T)
hits = subset(hits, x %in% c("clinton", "trump") 
              & y %in% c("pos", "neg"))
hits$sent = ifelse(hits$y == "pos", 1, -1)
tapply(hits$sent, droplevels(hits$x), mean)

   clinton      trump 
0.10921228 0.02649982

Sentiment Analysis: difficulty

Liu: “Although necessary, having an opinion lexicon is far from sufficient for accurate sentiment analysis” … “sentiment analysis tasks are very challenging.”
“[sentistrength] has human-level accuracy for short social web texts in English, except political texts.”
Subjective language is
- Creative
- Ambiguous
- Subjective
- Content-sensitive
Political communication is (often) nuanced, complicated

Improving sentiment analysis

Domain adaptation
- Get term statistics
- Merge with lexicon
- Manually check top-X frequent words
Targeted sentiment anlaysis (next section)
Crowd sourcing (e.g. Haselmayer in press)
Machine learning (see handout)

Hands-on session III

Sentiment Analysis of election campaign
- (or your own data …)
What is the overall sentiment?
- Development over time, per medium
What is sentiment for different actors?
Try out different lexica
Try to adapt lexicon to domain

Conclusion

Frequency-based text analysis
- Corpus analysis and topic modeling
- Natural language procesessing
- Sentiment Analysis
Lots of resources out there
- See e.g. vanatteveldt.com/glasgow-r
Questions?