Building text analysis models using Quanteda

Kohei Watanabe (LSE)

18 April 2018

Introduction

What is Quanteda?

quanteda is an R package for quantitative text analysis developed by a team based at the LSE.

After 5 years of development, we released version 1.0 at London R meeting in January
Used by leading political scientists in North America, Europe and Asia
It is a stand-alone tool, but can be used to develop packages (e.g. politeness, preText, phrasemachine, tidytext, stm.
Quanteda Initiative CIC was founded to support the text analysts community

Why we need to develop text analysis models?

We want to discover theoretically interesting patterns in a corpus of texts from a social scientific point of view.

The same technology as natural language processing (NLP) but for different goals
- Replication of manual reading of text is not the goal
- Computer scientific models are not always useful in social sciences
Analytic methods vary from simple frequency analysis to neural network
- Complex tools are not always the best choice
- Training complex supervised model is usually expensive
- Unsupervised models inexpensive but often atheoretical

Challenges in developing text analysis models

We usually need a large corpus to find interesting patterns.

Textual data is a typical high-dimensional data
- In matrix representation (document-feature matrix or DFM) of texts
  - Each word is a variable (columns)
  - Each document is an observation (rows)
Textual data is extremely sparse (99% or more)
- Occurrences of important features are rare
- Co-occurrences of interesting words are even rarer

Example: non-rectangular structure

Tokenized texts are non-rectangular

## [1] "What is it?"               "That is a dolphine."      
## [3] "No, it is a killer whale!"

## $text1
## [1] "What" "is"   "it"  
## 
## $text2
## [1] "That"     "is"       "a"        "dolphine"
## 
## $text3
## [1] "No"     "it"     "is"     "a"      "killer" "whale"

Example: data sparsity

Document-feature matrix is sparse

## [1] "What is it?"               "That is a dolphine."      
## [3] "No, it is a killer whale!"

##        features
## docs    what is it that a dolphine no killer whale
##   text1    1  1  1    0 0        0  0      0     0
##   text2    0  1  0    1 1        1  0      0     0
##   text3    0  1  1    0 1        0  1      1     1

We need special tools for large text data

We need very efficient tools to process large sparse matrices.

Base R functions do not work well with non-rectangular/sparse data
- character is not memory efficient when vectors are short
- data.frame does not allow variables in different length (rectangular data)
- matrix records both zero and non-zero values (dense matrix)

Quanteda's design

Quanteda's special objects: tokens

quanteda has tokens for tokenized texts.

toks <- tokens(txt, remove_punct = TRUE)
print(toks)

## tokens from 3 documents.
## text1 :
## [1] "What" "is"   "it"  
## 
## text2 :
## [1] "That"     "is"       "a"        "dolphine"
## 
## text3 :
## [1] "No"     "it"     "is"     "a"      "killer" "whale"

Quanteda's special objects: dfm

quanteda has dfm for document-feature matrix.

mt <- dfm(toks)
print(mt)

## Document-feature matrix of: 3 documents, 9 features (51.9% sparse).
## 3 x 9 sparse Matrix of class "dfm"
##        features
## docs    what is it that a dolphine no killer whale
##   text1    1  1  1    0 0        0  0      0     0
##   text2    0  1  0    1 1        1  0      0     0
##   text3    0  1  1    0 1        0  1      1     1

Quanteda's functions: tokens_* and dfm_*

quanteda has many specialized methods for tokens and dfm.

tokens
- tokens_select() select tokens by patterns
- tokens_compound() compound multiple tokens into single token
- tokens_lookup() find dictionary words
dfm
- dfm_select() select features by patterns
- dfm_lookup() find dictionary words
- dfm_group() group multiple words into single document

Complete list of quanteda's function is available at documentation site.

Unpacking tokens

tokens is an extension of list (S3).

str(unclass(toks))

## List of 3
##  $ text1: int [1:3] 1 2 3
##  $ text2: int [1:4] 4 2 5 6
##  $ text3: int [1:6] 7 3 2 5 8 9
##  - attr(*, "types")= chr [1:9] "What" "is" "it" "That" ...
##  - attr(*, "padding")= logi FALSE
##  - attr(*, "what")= chr "word"
##  - attr(*, "ngrams")= int 1
##  - attr(*, "skip")= int 0
##  - attr(*, "concatenator")= chr "_"
##  - attr(*, "docvars")='data.frame':  3 obs. of  0 variables

Unpacking dfm

dfm inherits Matrix::dgCMatrix (S4).

mt@Dim

## [1] 3 9

mt@Dimnames

## $docs
## [1] "text1" "text2" "text3"
## 
## $features
## [1] "what"     "is"       "it"       "that"     "a"        "dolphine"
## [7] "no"       "killer"   "whale"

mt@i

##  [1] 0 0 1 2 0 2 1 1 2 1 2 2 2

mt@p

##  [1]  0  1  4  6  7  9 10 11 12 13

Core APIs

Creating tokens

as.tokens(list(c("I", "like", "dogs"),
               c("He", "likes", "cats")))

## tokens from 2 documents.
## text1 :
## [1] "I"    "like" "dogs"
## 
## text2 :
## [1] "He"    "likes" "cats"

toks %>% as.list() %>% as.tokens() %>% class()

## [1] "tokens"

Creating dfm

as.dfm(matrix(c(1, 0, 2, 1), nrow = 2, 
              dimnames = list(c("doc1", "doc2"), 
                              c("dogs", "cats"))))

## Document-feature matrix of: 2 documents, 2 features (25% sparse).
## 2 x 2 sparse Matrix of class "dfm"
##       features
## docs   dogs cats
##   doc1    1    2
##   doc2    0    1

as.dfm(rbind("doc" = c(3, 4)))

## Document-feature matrix of: 1 document, 2 features (0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
##      features
## docs  feat1 feat2
##   doc     3     4

mt %>% as.matrix() %>% as.dfm() %>% class()

## [1] "dfm"
## attr(,"package")
## [1] "quanteda"

Convert dfm to Matrix

as(mt, "dgCMatrix")

## 3 x 9 sparse Matrix of class "dgCMatrix"
##        features
## docs    what is it that a dolphine no killer whale
##   text1    1  1  1    . .        .  .      .     .
##   text2    .  1  .    1 1        1  .      .     .
##   text3    .  1  1    . 1        .  1      1     1

as(mt, "dgTMatrix")

## 3 x 9 sparse Matrix of class "dgTMatrix"
##        features
## docs    what is it that a dolphine no killer whale
##   text1    1  1  1    . .        .  .      .     .
##   text2    .  1  .    1 1        1  .      .     .
##   text3    .  1  1    . 1        .  1      1     1

dgmt <- as(mt, "dgTMatrix")
dgmt@i

##  [1] 0 0 1 2 0 2 1 1 2 1 2 2 2

dgmt@j

##  [1] 0 1 1 1 2 2 3 4 4 5 6 7 8

dgmt@x

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1

Converting pattern to type ID

types(toks)

## [1] "What"     "is"       "it"       "That"     "a"        "dolphine"
## [7] "No"       "killer"   "whale"

pattern2id("dolphine", types(toks), "fixed", TRUE)

## [[1]]
## [1] 6

pattern2id(c("dolphine", "whale"), types(toks), "fixed", TRUE)

## [[1]]
## [1] 6
## 
## [[2]]
## [1] 9

pattern2id(phrase("killer whale"), types(toks), "fixed", TRUE)

## [[1]]
## [1] 8 9

Converting regex to type ID

pattern2id("^wha.*", types(toks), "regex", TRUE)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 9

pattern2fixed("^wha.*", types(toks), "regex", TRUE)

## [[1]]
## [1] "What"
## 
## [[2]]
## [1] "whale"

Converting glob to type ID

pattern2id("wha*", types(toks), "glob", TRUE)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 9

pattern2fixed("wha*", types(toks), "glob", TRUE)

## [[1]]
## [1] "What"
## 
## [[2]]
## [1] "whale"

Using ID to subset ojbects

featnames(mt)

## [1] "what"     "is"       "it"       "that"     "a"        "dolphine"
## [7] "no"       "killer"   "whale"

id <- pattern2id("wha*", featnames(mt), "glob", TRUE)
mt[,unlist(id)]

## Document-feature matrix of: 3 documents, 2 features (66.7% sparse).
## 3 x 2 sparse Matrix of class "dfm"
##        features
## docs    what whale
##   text1    1     0
##   text2    0     0
##   text3    0     1

Example 1: Newsmap

Newsmap

newsmap is a semi-supervised model for geographical document classification originally created for International Newsmap.

newsmap identifies features associated with location using seed dictionary
- newsmap extracts not only names of places but also names of people and organizations
- Geographical classifier can be updated frequently without additional costs
We have to perform dictionary analysis very accurately for newsmap
- Place names are often comprised of multiple words (multi-word expressions)
- Seed dictionaries are in multiple languages (English, German, Spanish, Japanese, Russian)

Newsmap algorithm

newsmap is a semi-supervised multi-nomial naive Bayes classifier.

Search tokens for place names to assign country labels (weak supervision)
Compute association between geographical features and country labels
Predict geographical focus of documents

Seed dictionary

data_dictionary_newsmap_en[["AMERICA"]]["NORTH"]

## Dictionary object with 1 primary key entry and 2 nested levels.
## - [NORTH]:
##   - [BM]:
##     - bermuda, bermudan*
##   - [CA]:
##     - canada, canadian*, ottawa, toronto, quebec
##   - [GL]:
##     - greenland, greenlander*, nuuk
##   - [PM]:
##     - saint pierre and miquelon, st pierre and miquelon, saint pierrais, miquelonnais, saint pierre
##   - [US]:
##     - united states, us, american*, washington, new york

data_dictionary_newsmap_de[["AMERICA"]]["NORTH"]

## Dictionary object with 1 primary key entry and 2 nested levels.
## - [NORTH]:
##   - [BM]:
##     - bermuda, bermudas
##   - [CA]:
##     - kanada, kanadas, kanadisch*, kanadier*, ottawa, toronto, quebec
##   - [GL]:
##     - grönland, grönlands, grönländisch*, nuuk
##   - [PM]:
##     - saint-pierre und miquelon, saint pierre
##   - [US]:
##     - vereinigte staaten von amerika, united states, vereinigte staaten, usa, us, amerikas, amerikanisch*, washington, new york

data_dictionary_newsmap_ja[["AMERICA"]]["NORTH"]

## Dictionary object with 1 primary key entry and 2 nested levels.
## - [NORTH]:
##   - [BM]:
##     - バミューダ*
##   - [CA]:
##     - カナダ*, オタワ, トロント, ケベック
##   - [GL]:
##     - グリーンランド*, ヌーク
##   - [PM]:
##     - サンピエール・ミクロン島*, サンピエール, ミクロン
##   - [US]:
##     - 米国*, アメリカ*, ワシントン, ニューヨーク

Pre-processing

data <- readRDS("/home/kohei/Dropbox/Public/data_corpus_yahoonews.rds")
data$text <- paste0(data$head, ". ", data$body)
data$body <- NULL
corp_full <- corpus(data, text_field = 'text')
corp <- corpus_subset(corp_full, '2014-01-01' <= date & date <= '2014-12-31')
ndoc(corp)

## [1] 156980

month <- c("January", "February", "March", "April", "May", "June",
           "July", "August", "September", "October", "November", "December")
day <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
agency <- c("AP", "AFP", "Reuters")
toks <- tokens(corp) %>% 
        tokens_remove(stopwords("en"), valuetype = "fixed", padding = TRUE) %>% 
        tokens_remove(c(month, day, agency), valuetype = "fixed", padding = TRUE)

Fit the Newsmap model

label_mt <- dfm(tokens_lookup(toks, data_dictionary_newsmap_en, levels = 3))
feat_mt <- dfm(toks, tolower = FALSE) %>% 
           dfm_select(selection = "keep", "^[A-Z][A-Za-z1-2]+", valuetype = "regex", 
                      case_insensitive = FALSE) %>% 
           dfm_trim(min_termfreq = 10)

newsmap <- textmodel_newsmap(feat_mt, label_mt)
summary(newsmap, n = 10)

## Classes:
##    bi, dj, er, et, ke, km, mg, mu, mw, mz ...  
## Features:
##    French, Ivanovic, Safarova, PARIS, Former, Open, Ana, Lucie, Czech, Republic ...  
## Documents:
##    text63, text68, text69, text73, text78, text79, text84, text85, text86, text92 ...

tb <- table(predict(newsmap))
barplot(head(sort(tb, decreasing = TRUE), 20))

Unpacking newsmap package

newsmap uses quanteda APIs in both estimation and prediction.

dfm_group() groups documents based on labels
dfm_select() makes features identical to the model
dfm_weight() normalizes feature frequencies

Example 2: LSS

LSS package

LSS implements a semi-supervised document scaling model created based on quanteda to perform theory-driven analysis at low costs.

LSS allow researchers to position documents on an arbitrary dimension
- Full-supervised scaling models are often too expensive to train (e.g. Wordscore)
- Unsupervised scaling models tend to produce atheoretical results (e.g. Wordfish)
LSS requires large corpus to accurately estimate semantic relations
- It has to handle extremely sparse matrix

LSS algorithm

LSS (latent semantic scaling) is an application of LSA (latent semantic analysis) in document scaling

Construct a document-feature matrix from a large corpus (> 5000 documents)
Reduce the feature dimension to 300 using SVD (latent semantic space)
Weight terms based on their proximity to seed words in the semantic space
Predict positions of documents on a linear scale as weighted means of the terms

Sentiment seed words

seedwords('pos-neg')

##        good        nice   excellent    positive   fortunate     correct 
##           1           1           1           1           1           1 
##    superior         bad       nasty        poor    negative unfortunate 
##           1          -1          -1          -1          -1          -1 
##       wrong    inferior 
##          -1          -1

Other types of seed words

# concern
c("concern*", "worr*", "anxi*")

## [1] "concern*" "worr*"    "anxi*"

# dysfunction
c("dysfunct*", "paralysi*", "stalemate", "standstill", "gridlock", "deadlock")

## [1] "dysfunct*"  "paralysi*"  "stalemate"  "standstill" "gridlock"  
## [6] "deadlock"

Pre-processing

In LSS, documents are split into sentences to estimate semantic proximity based on immediate contexts of words. This makes document-feature matrix extremely sparse.

corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
sent_toks <- 
    corp %>% 
    corpus_reshape("sentences") %>% 
    tokens(remove_punct = TRUE) %>% 
    tokens_remove(stopwords("en")) %>% 
    tokens_select("^[0-9a-zA-Z]+$", valuetype = "regex")
sent_mt <- 
    sent_toks %>% 
    dfm() %>% 
    dfm_trim(min_termfreq = 5)

ndoc(corp)

## [1] 10000

ndoc(sent_mt)

## [1] 392697

sparsity(sent_mt)

## [1] 0.9996725

Fit LSS model

lss performs feature selection based on collocation to construct domain-specific sentiment models.

eco <- head(char_keyness(sent_toks, 'econom*'), 500)
head(eco, 30)

##  [1] "growth"      "global"      "outlook"     "eurozone"    "slowdown"   
##  [6] "slowing"     "uncertainty" "recession"   "markets"     "uk"         
## [11] "zew"         "financial"   "gdp"         "markit"      "world"      
## [16] "brexit"      "forecasts"   "china"       "rolling"     "jobs"       
## [21] "prospects"   "quarter"     "davos"       "williamson"  "emerging"   
## [26] "ihs"         "downturn"    "risks"       "tombs"       "recovery"

lss <- textmodel_lss(sent_mt, seedwords('pos-neg'), features = eco, cache = TRUE)

head(coef(lss), 30)

##        good    positive    emerging       china   expecting cooperation 
##  0.05252521  0.03941849  0.03906230  0.03249657  0.03172653  0.03014628 
##        drag        asia        prof   turbulent sustainable challenging 
##  0.02910002  0.02873776  0.02849241  0.02824677  0.02765765  0.02682587 
##     markets   investors      remain     beijing   prospects       stock 
##  0.02644153  0.02637175  0.02629941  0.02572956  0.02570061  0.02538831 
##      better   uncertain         hit     chinese   sentiment      mining 
##  0.02479903  0.02358457  0.02302436  0.02291012  0.02269737  0.02230245 
##      failed      threat      robust      argued consultancy      oxford 
##  0.02213793  0.02196979  0.02192322  0.02192176  0.02190114  0.02176545

tail(coef(lss), 30)

##       yellen        banks consequences       carney    inflation 
##  -0.02971046  -0.03026749  -0.03048922  -0.03056076  -0.03126168 
##    countries     downturn       macron     suggests     downbeat 
##  -0.03194409  -0.03257441  -0.03258820  -0.03277442  -0.03309793 
##         debt  assumptions         data policymakers    borrowing 
##  -0.03386571  -0.03416451  -0.03524238  -0.03745570  -0.03815934 
##   unbalanced       shrink unemployment          bad     pantheon 
##  -0.03934074  -0.03944345  -0.03987688  -0.04047418  -0.04054125 
##      cutting       shocks         hike        rates          mpc 
##  -0.04082423  -0.04267978  -0.04370420  -0.04405703  -0.04427978 
##       easing          rba         rate          cut     negative 
##  -0.04577516  -0.04789902  -0.05417844  -0.05498620  -0.05697134

Predicting economic sentiment

doc_mt <- dfm(corp)
data_pred <- as.data.frame(predict(lss, newdata = doc_mt, density = TRUE))
data_pred$date <- docvars(doc_mt, 'date')
data_pred <- subset(data_pred, density > quantile(density, 0.25))

head(data_pred)

##                    fit    density       date
## text141269  0.74545003 0.02946955 2016-04-07
## text151432 -0.01325452 0.08580343 2016-05-10
## text169265 -0.24909294 0.05876393 2016-09-01
## text134827 -0.04752371 0.04449649 2016-03-08
## text133324  1.80075143 0.05355191 2016-03-03
## text132839  0.55650416 0.04322034 2016-02-22

par(mar = c(4, 4, 1, 1))
plot(data_pred$date, data_pred$fit, pch = 16, col = rgb(0, 0, 0, 0.1),
     ylim = c(-0.5, 0.5), ylab = "Economic sentiment", xlab = "Time")
lines(lowess(data_pred$date, data_pred$fit, f = 0.1), col = 1)
abline(h = 0, v = as.Date("2016-06-23"), lty = c(1, 3))

Unpacking LSS package

LSS package is implemented using RSpectra's SVD engine and quanteda's APIs.

as.dfm() converts a SVD-reduced matrix to a dfm
pattern2fixed() converts seed words' glob to fixed patterns
textstat_simil() computes term-term similarity between seed words and features
textstat_keyness() performs chi-squre tests for collocations

Conclusions

For your own text models

quanteda's APIs help you to quickly develop your own models.

pattern2id() helps you to handle patterns, multi-word expressions and Unicode characters
tokens and dfm objects can be created using as.tokens() and as.dfm()
tokens_*() and dfm_*() are optimized for large textual data
textstat_* are useful as packages' internal functions

You can also contribute to development of quanteda.

If quanteda is missing important functions, file a feature request or pull request
QI's Github account will host your quanteda extension packages (e.g. quanteda.newsmap) to give more publicity

Additional materials

Quanteda Documentation: https://docs.quanteda.io
Quanteda Tutorials: https://tutorials.quanteda.io
My personal website: https://koheiw.net