Text Analysis with R

Wouter van Atteveldt
Glasgow Text Analysis, 2016-11-17

Course Overview

10:30 - 12:00

  • Recap: Frequency Based Analysis and the DTM
  • Dictionary Analysis with AmCAT and R

13:30 - 15:00

  • Simple Natural Language Processing
  • Corpus Analysis and Visualization
  • Topic Modeling and Visualization

15:15 - 17:00

  • Sentiment Analysis with dictionaries
  • Sentiment Analysis with proximity

Frequency Based Analysis: The DTM

Frequency Based Analysis

  • Analysis based on word frequency only
    • “Bag of words” assumption
    • Ignore grammar, proximity, relations, …
  • Main data: Document-term matrix (dtm)
  • Can also use other features (dfm)
    • Bag of stems, lemmata, word pairs, …

Creating a DTM

  1. Text source
    • Text files
    • Data frames / vectors or text
    • External sources/APIs
  2. Preprocessing
    • Stemming, lowercasing, lemmatizing
    • Collocations
  3. Feature selection
    • Frequency
    • Stopwords

Creating a DTM from text

texts=c("This is a test", "They tested a test", "I found a test!")
Document-feature matrix of: 3 documents, 8 features (50% sparse).
3 x 8 sparse Matrix of class "dfmSparse"
docs    this is a test they tested i found
  text1    1  1 1    1    0      0 0     0
  text2    0  0 1    1    1      1 0     0
  text3    0  0 1    1    0      0 1     1

Preprocessing: stemming, stopword removal

dfm(texts, stem=T, ignoredFeatures=stopwords("english"))
Document-feature matrix of: 3 documents, 2 features (33.3% sparse).
3 x 2 sparse Matrix of class "dfmSparse"
docs    test found
  text1    1     0
  text2    2     0
  text3    1     1

Preprocessing: collocations

coll = collocations(texts)
   word1  word2 word3 count        G2
1:     a   test           3 14.045308
2:     i  found           1  7.050924
3:  they tested           1  7.050924
4:  this     is           1  7.050924
5: found      a           1  3.231839
6:    is      a           1  3.231839

Preprocessing: collocations

texts2 = phrasetotoken(texts, subset(coll, G2>10))
[1] "This is a_test"     "They tested a_test" "I found a_test!"   
dfm(texts2,  stem=T, ignoredFeatures=stopwords("english"))
Document-feature matrix of: 3 documents, 3 features (44.4% sparse).
3 x 3 sparse Matrix of class "dfmSparse"
docs    a_test test found
  text1      1    0     0
  text2      1    1     0
  text3      1    0     1

Feature selection

dfm = dfm(texts2, stem=T)
dfm = trim(dfm, minDoc = 2)
Document-feature matrix of: 3 documents, 1 feature (0% sparse).
3 x 1 sparse Matrix of class "dfmSparse"
docs    a_test
  text1      1
  text2      1
  text3      1

More control: quanteda step-by-step

tokens = tokenize(texts2, removePunct = T)
tokens = toLower(tokens)
tokens = wordstem(tokens, "english")
dfm = dfm(tokens)
dfm = selectFeatures(dfm, stopwords("english"), "remove")
dfm = trim(dfm, minCount = 1)
Document-feature matrix of: 3 documents, 3 features (44.4% sparse).
3 x 3 sparse Matrix of class "dfmSparse"
docs    a_test test found
  text1      1    0     0
  text2      1    1     0
  text3      1    0     1

(De-)Motivational example: Dutch stemming

texts = c("De kippen eten", "De kip heeft gegeten")
dfm(texts, language="dutch", stem=T, ignoredFeatures=stopwords("dutch"))
Document-feature matrix of: 2 documents, 4 features (50% sparse).
2 x 4 sparse Matrix of class "dfmSparse"
docs    kipp eten kip geget
  text1    1    1   0     0
  text2    0    0   1     1

(We will cover lemmatizing and POS-tagging this afternoon!)

Dictionary-based analysis

Dictionary-based analysis

  • Use list of keywords to define a concept
  • (words, wildcards, boolean combinations, phrases, etc.)
  • Measure (co-)occurrence of these concepts

Advantages of dictionaries?

  • Easy to explain
  • Easy to use
  • Control over operationalization


  • Free and Open Source text analysis infrastructure
  • Easy corpus management, keyword queries
  • Integrates with R / quanteda
  • Run your own server or use ours (amcat.nl)

AmCAT demo

Connecting to AmCAT from R

amcat.save.password("https://amcat.nl", username="...", 
conn = amcat.connect("https://amcat.nl")
meta = amcat.articles(conn, project=1235, articleset=32114, dateparts = T)

      The New York Times The New York Times Blogs                USA TODAY 
                    9551                     1660                     1993 
saveRDS(meta, "meta.rds")

Running AmCAT queries in R

a = amcat.aggregate(conn, sets=32139, queries = c("trump", "clinton"), axis1 = "week")
  count       week query
1    14 2015-12-28 trump
2    51 2016-01-04 trump
3    65 2016-01-11 trump
4    66 2016-01-18 trump
5    92 2016-01-25 trump
6    94 2016-02-01 trump

Running AmCAT queries in R

ggplot(data=a, mapping=aes(x=week, y=count, color=query)) + geom_line()

plot of chunk unnamed-chunk-12

Getting AmCAT data into R

h = amcat.hits(conn, sets=32142, 
               queries=c("trump", "clinton"))
meta = amcat.articles(conn, project=1235, 
h = merge(meta, h)
         id       date             medium count   query
1 162317450 2016-03-01 The New York Times     1 clinton
2 162317731 2016-02-02 The New York Times     7   trump
3 162317731 2016-02-02 The New York Times     4 clinton
4 162317809 2016-01-25 The New York Times     1   trump
5 162317820 2016-01-24 The New York Times     1 clinton
6 162317856 2016-01-18 The New York Times     1   trump

Getting AmCAT texts into R

articles = amcat.articles(conn, project=1235, 
  articleset=32142, dateparts=T,
  columns=c("date", "headline", "text"))
[1] "Every Friday, pop critics for The New York Times weigh in on the week's most\nnotable new songs and videos -- and anything else that strikes them as\nintriguing -- in the Playlist. You can listen to this playlist on Spotify here.\nLike this format? Let us know at theplaylist@nytimes.com\n\nThundercat, 'Bus in These Streets'\n\n[Video: Thundercat - \"Bus in These Streets\" Watch on YouTube.]\n\nHe's nobody's idea of a Luddite, but Thundercat -- the electric bass whiz,\nfalsetto soothsayer and underground prince of head-trippy future soul -- has a\nmessage for those who stumble through the day glued to smartphones. ''Bus in\nThese Streets,'' the irresistibly chirpy single he released this week, features\nproduction and programming by his comrade Flying Lotus, with a sound that\nrecalls the radiant, chiming side of Motown's '60s assembly line. (The live\ndrums and keyboards are the work of another close affiliate, Louis Cole.)\nUltimately, Thundercat -- who has recently been commingling with P-Funk's George\nClinton, and will perform on Saturday at the Afropunk Fest in Brooklyn --\ndelivers a mundane but powerful truth: ''It's O.K. to disconnect sometimes.''\nNATE CHINEN\n\nSharon Van Etten, 'Not Myself'\n\n''Not Myself,'' the single Sharon Van Etten just released as a memorial to the\nvictims of the Orlando dance-club shooting, benefits the Everytown for Gun\nSafety Support Fund. It's pure elegy with overtones of a spiritual: tolling\npiano chords, an austere drone of sustained strings and a call-and-response\nbetween a somber, humming choir and a lead vocal that Ms. Van Etten keeps on the\ndignified side of tears. As she mourns, she recognizes the murders as an attack\non gay identity. ''In the ashes of the aftermath, pray,'' she intones. ''It's\ntoo much to take/There's too much at stake/And I want you to be yourself around\nme.'' JON PARELES\n\nYoung M.A., 'OOOUUU'\n\n[Video: Young M.A. - \"OOOUUU\" (Official Video) Watch on YouTube.]\n\nThe open-car-window anthem of summer in Brooklyn is ''OOOUUU,'' a hypnotically\nchill, casually brooding boast by an upstart female rapper, Young M.A. She's an\nentrancingly calm stylist, full of brutish assonance: ''When it's time to pop\nthey a no-show/Yeah, I'm pretty but I'm loco/The loud got me moving slo-mo.''\nThe song, produced by NY Bangers, creeps with wonder and menace -- part of its\npower is that it's never trying too hard, and Young M.A. raps with the authority\nof someone who's doesn't have to sell herself.\n\nAnother confirmation of the easy effectiveness and ubiquity of ''OOOUUU'' is the\nsudden glut of remixes by New York's toughest-talking veterans. ''I kill 'em\nwith the slow flow,'' French Montana raps on his version, before breaking into a\nclever stutter flow: ''black Rollie, Barack/red beam on a op/sauce down to the\nsocks.'' On her stand-alone version, Remy Ma builds on Young M.A.'s pugnacity\nwith more of her own, recalling a lifetime of hooligan instincts (alongside\nTerror Squad partner Fat Joe): ''Quick to smack a ho, even Joe know/Been doing\nthat since he was spittin' 'Flow Joe.''' The pièce de résistance, though, may be\nthe version featuring Jadakiss and Uncle Murda, two of New York's grimiest. Both\nweave in numerous lyrical references to Young M.A.'s original verse, an\nimpressive display of dexterity. Uncle Murda's verse is hostile and hilarious,\nand Jadakiss's is filled with his typical relaxed sneers: ''Yeah, I think he\ndead, check his pulse though/Call his family, let his folks know/Tell 'em he\nain't make it, it was close though.'' JON CARAMANICA\n\nSad13, 'Get a Yes'\n\n[Video: Sad13 - \"Get A Yes\"  Watch on YouTube.]\n\nSad13 is a solo project for Sadie Dupuis, whose brash, intricate guitar parts\nand nervy lyrics drive the indie-rock band Speedy Ortiz. She uses pop tools\ninstead -- synthesizers, drum machines, cheerfully symmetrical melodies -- as\nSad13, with an album due Nov. 11. Yet behind the genial surface, she's no less\nhard-headed. ''Get a Yes'' starts with a giggle, but it's an unequivocal demand\nfor consent in sexual situations: ''I say yes if I want to/If you want to you've\ngot to get a yes,'' she sings. It's intended to do what a pop song can do:\nclarify a feeling and give listeners the words they need. J.P.\n\nJ Black (a.k.a. Kodak Black), 'Ambition'\n\n[Video: Kodak Black - \"Ambition\" Watch on YouTube.]\n\nEarlier this month, it appeared as if the promising 19-year-old Florida rapper\nKodak Black was set to be released from jail, his latest stint behind bars in a\nshort career frustratingly pockmarked with them. But after a plea deal had been\nreached that secured him house arrest and probation, two additional warrants\nwere discovered for the rapper, real name Dieuson Octave, including one for\ncriminal sexual conduct in South Carolina, a charge that can carry a penalty of\nup to 30 years in prison.\n\nKodak Black has been rapping (under the name J Black) since he was very young,\nand this week, some of his older music resurfaced online. The picture it paints,\nespecially in his remake of Wale's ''Ambition,'' is stark. Even at 14 years old,\nhe was an intense, narratively-adept rapper, with skill far beyond most rappers\na decade older. On his recent mixtapes, he's eased into his Southern drawl more,\nand is opting for more straight-ahead rhyme schemes. But ''Ambition'' and other\nold songs show him to be the most comfortable telling the most harrowing\nstories:\n\nThe only thing I worry 'bout is how my grandma doin'I'm doing good, I'm staying\nhealthy, and now I'm making musicI'm gonna strive to success and I'ma try my\nbestI'm 14 and already thinking about deathDamn, I was raised by the dead end\n\nJ.C.\n\nLiz Longley, 'Weightless'\n\nThe title song of the roots-rock songwriter Liz Longley's second album,\n''Weightless,'' turns the prosaic part of a breakup -- dividing up the\npossessions -- into a rite of passage. ''You can have the couch, the lamp, the\ndiamond ring/the books on the shelf, the dishes in the sink,'' she itemizes over\none repeated chord, and soon, with everything cleared out including the bad\nmemories, the electric guitar digs in, the drums kick harder and the melody\nascends to where she can be ''way up high beyond your gravity'' -- unencumbered\nand free. J.P.\n\nDe La Soul, 'Drawn'\n\n[Video: De La Soul - \"Drawn\" ft. Little Dragon  Watch on YouTube.]\n\nFan funding supported experimentation in ''Drawn,'' from the new album De La\nSoul releases Friday, ''and the Anonymous Nobody...'' A meditative, mysterious\nfive-minute song built on a jazzy bass-and-piano vamp and chamber-music strings,\nmost of ''Drawn'' features Yukimi Nagano from Little Dragon singing about\nregrets (''I never looked around/I'm wrecking rules and it's pulling us down'')\nand offering an entreaty: ''Won't you stay babe?'' In the final minute, Posdnuos\nof De La Soul raps, barely above a whisper, about the personal toll of a hip-hop\ncareer. ''You can lose the love of your life to a lifetime of love on tour,'' he\nadmits, and concludes, ''My ways need laundering/Time's a-ticking, stop\nsquandering.'' J.P.\n\nDerrick Hodge, 'Clock Strike Zero'\n\n[Video: Derrick Hodge - \"Clock Strike Zero\" Watch on YouTube.]\n\nThe bassist Derrick Hodge has been a key catalyst in the recent convergence\nbetween jazz and R&B, notably as a member of the Robert Glasper Experiment, and\nas Maxwell's bandleader on tour. ''The Second,'' his new album, takes this\nhybridism as gospel, extending the premise in a shroud of self-possession. For\nmost of the album Mr. Hodge plays all the instruments -- electric bass, acoustic\npiano, synthesizers, drum programming -- with an emphasis on color and mood.\n''Clock Strike Zero'' sounds like a remixed instrumental from a lost Shuggie\nOtis session, with a ticktock shimmer in the background. It's not hard to\nimagine a vocalist taking on the track's sinuous melody, though Mr. Hodge does a\nfine job himself, running his bass through a gluey haze of distortion. N.C.\n\nFlock of Dimes, 'Everything Is Happening Today'\n\nJenn Wasner is best known as the lead singer and non-drumming half of Wye Oak, a\ndream-poppish indie duo originally from Baltimore. Wye Oak recently released\n''Tween,'' a collection of songs made between its third and fourth albums: in\nreductive terms, a document of transition from guitars to synthesizers. Ms.\nWasner now has a side project, Flock of Dimes, in which she is free to branch\nout at her leisure. ''If You See Me, Say Yes'' will be the first Flock of Dimes\nalbum, due on Sept. 23. Its lead single, ''Everything Is Happening Today,''\nopens in morning reverie, as the song's narrator wakes up and sees light\nfiltering through the room. It's a song of watchful possibility, set against\nsynth-pop grandeur -- though the ''synth'' part of that equation might be\nsemi-negotiable. Ms. Wasner has plans to perform a chamber version of the song\nat Le Poisson Rouge on Wednesday night, with the String Orchestra of Brooklyn.\nN.C.\n\n\n\n\nURL:\nhttp://www.nytimes.com/2016/08/27/arts/music/playlist-thundercat-sharon-van-ette\nn-sad13.html"

AmCAT and quanteda

d = dfm(articles$text, stem=T, 
d = trim(d, minDoc=10)
d = weight(d, "tfidf")
       mr       p.m    street     trump       mrs   theater      said 
1686.3507 1512.2909 1208.5387 1067.8813 1064.3875  998.5898  970.0412 
   museum       art   clinton 
 942.1813  852.4824  788.3125 
plot(d, max.words = 50, scale = c(4, 0.5))

plot of chunk unnamed-chunk-15

AmCAT and quantea (2)

c = quanteda.corpus(conn, project=1235, articleset=32142, dateparts=T)
d = dfm(c, ignoredFeatures=stopwords("english"))
Document-feature matrix of: 1,000 documents, 41,920 features (99.2% sparse).
(showing first 6 documents and first 6 features)
docs        every friday pop critics new york
  171933722     1      2   4       1   6    1
  171933835     0      0   0       0   2    1
  171937938     0      0   0       0   1    1
  171937818     0      0   0       0   0    0
  171935088     0      0   0       0  12   10
  171935925     1      0   0       0   5    1

Dictionares within R

issues = list(economy=c("econ*", "inflation"), immigration=c("immigr*", "mexican*"))
d2 = applyDictionary(d, issues, exclusive=T)
Document-feature matrix of: 1,000 documents, 2 features (0% sparse).
(showing first 6 documents and first 2 features)
docs        economy immigration
  171933722       0           0
  171933835       1           0
  171937938       0           0
  171937818       0           0
  171935088       0           0
  171935925       1          17

Dictionares within R

d2 = cbind(docvars(c), as.matrix(d2))
a = aggregate(d2[names(issues)], d2["week"], sum)
ggplot(a, aes(x=week)) +
  geom_line(aes(y = economy, color="green"))  +
  geom_line(aes(y = immigration, color="red"))

plot of chunk unnamed-chunk-18

Where to get dictionaries?

  • Create your own
  • Create from corpora (next session)
  • Replication materials
  • wordstat, LIWC, …

Hands-on session I

  • Why did Trump win the (primary) election?
  • Operationalize a variable using search strings
    • candidates, issues, emotion, populism, …
    • download or create word list
  • Plot variable over time / co-occurring with either candidate
  • Use AmCAT GUI, AmCAT R, quanteda, …