Longitudinal analysis of economic news using Quanteda

Kohei Watanabe (Waseda, LSE)

6 June 2018

What is Quanteda?

Quanteda is an R package

quanteda is an R package for quantitative text analysis developed by a team based at the LSE.

  • quanteda stands for quantetative analysis of textual data
  • The package offers a set of functions for quantitative text analysis
  • Developed for high consistency and performance
    • Many of the core functions are written in C++ with multi-threading
    • Faster than other R packages (tm or tidytext) and even Python package (gensim)
    • Works with Chinese, Japanese and Korean texts
  • Used by leading political scientists in North America, Europe and Asia
    • Analyze party manifestos, legislative speeches, news articles, social media etc.

Quanteda team

What is quantitative text analysis

In quantitative text analysis, we use the same technologies as natural language processing (NLP) but for different goals.

  • We try to discover theoretically interesting patterns from a social scientific point of view
    • Replication of manual reading of text is not the goal
    • Social scientists are interested in specific aspects of textual data
  • Analytic methods vary from simple frequency analysis to machine learning
    • Dictionary analysis is probably the most popular approach
    • Machine learning is becoming more popular these days

Cost-control trade off in machine learning

Not easy to automate theoretically grounded analysis because of the cost-control trade off.

  • Complex models require a lot of data to learn
  • Supervised models can be theoretical but usually expensive
    • naive Bayes, Wordscores, random forest, SVM, neural network
  • Unsupervised models are inexpensive but often atheoretical
    • topic models, correspondence analysis, Wordfish
  • Semi-supervised models try to balance between theory and cost
    • Newsmap, LSS

Examples

Sentiment analysis of news

  • Lexicoder Sentiment Dictionary (LSD)
    • Widely used in political communication research
    • Created by Young and Soroka to analyze political news in North America
    lengths(data_dictionary_LSD2015)
    ##     negative     positive neg_positive neg_negative 
    ##         2858         1709         1721         2860
  • Latent Semantic Scaling (LSS)
    • Parameters are estimated on the corpus to increase internal validity
    • Uses "seed words" to identify subject-specific sentiment words

Lexicoder Sentiment Dictionary

corp <- readRDS('/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds')
ndoc(corp)
## [1] 10000
range(docvars(corp, "date"))
## [1] "2016-01-01" "2016-12-31"
mt <- dfm(corp, remove_punc = TRUE, remove = stopwords())
mt_dict <- dfm_lookup(mt, data_dictionary_LSD2015[1:2])
sent <- (mt_dict[,2] - mt_dict[,1]) / (rowSums(mt) + 1)

data <- data.frame(date = docvars(mt_dict, "date"),
                   lsd = as.numeric(scale(sent)))

dim(mt)
## [1]  10000 121885
head(mt[1:6, 1:6])
## Document-feature matrix of: 6 documents, 6 features (80.6% sparse).
## 6 x 6 sparse Matrix of class "dfm"
##             features
## docs         70-year-old hermit spent entire life siberian
##   text120761           1      1     1      1    2        1
##   text174574           0      0     0      0    0        0
##   text141269           0      0     0      0    0        0
##   text151432           0      0     0      0    0        0
##   text169265           0      0     0      0    0        0
##   text134827           0      0     0      0    1        0

dim(mt_dict)
## [1] 10000     2
head(mt_dict)
## Document-feature matrix of: 6 documents, 2 features (0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
##             features
## docs         negative positive
##   text120761       11        6
##   text174574       17       16
##   text141269       20       38
##   text151432       15       27
##   text169265       24       27
##   text134827       66       50

Sentiment of news around Brexit vote by LSD

Latent Semantic Scaling

devtools::install_github("koheiw/LSS")
require(LSS)
toks_sent <- corp %>% 
    corpus_reshape('sentences') %>% 
    tokens(remove_punct = TRUE)
mt_sent <- toks_sent %>% 
    dfm(remove = stopwords()) %>% 
    dfm_select('^[0-9a-zA-Z]+$', valuetype = 'regex') %>% 
    dfm_trim(min_termfreq = 5)

eco <- head(char_keyness(toks_sent, 'econom*', window = 10), 500)
lss <- textmodel_lss(mt_sent, seedwords('pos-neg'), features = eco, cache = TRUE)

Sentiment seed words

seedwords('pos-neg')
##        good        nice   excellent    positive   fortunate     correct 
##           1           1           1           1           1           1 
##    superior         bad       nasty        poor    negative unfortunate 
##           1          -1          -1          -1          -1          -1 
##       wrong    inferior 
##          -1          -1

Economic sentiment words

head(coef(lss), 20) # most positive words
##    positive    emerging       china   expecting cooperation        drag 
##  0.03941849  0.03906230  0.03249657  0.03172653  0.03014628  0.02910002 
##        asia        prof sustainable    academic challenging     markets 
##  0.02873776  0.02849241  0.02765765  0.02732935  0.02682587  0.02644153 
##   investors   prospects       stock      better   uncertain   strategic 
##  0.02637175  0.02570061  0.02538831  0.02479903  0.02358457  0.02346433 
##         hit     chinese 
##  0.02302436  0.02291012

tail(coef(lss), 20) # most negative words
##     downturn       macron     suggests     downbeat         debt 
##  -0.03257441  -0.03258820  -0.03277442  -0.03309793  -0.03386571 
##         data policymakers   unbalanced       shrink unemployment 
##  -0.03524238  -0.03745570  -0.03934074  -0.03944345  -0.03987688 
##    suggested          bad     pantheon      cutting       shocks 
##  -0.04036920  -0.04047418  -0.04054125  -0.04082423  -0.04267978 
##        rates          rba         rate          cut     negative 
##  -0.04405703  -0.04789902  -0.05417844  -0.05498620  -0.05697134

Sentiment of news around Brexit vote by LSS

Compare LSD and LSS

Seed words for different dimensions

The sentiment seed words are already available, but you can also make you own seed words.

# concern
seed_concern <- c("concern*", "worr*", "anxi*")

# weakening
seed_weakening <- c("declin*", "weak*", "retreat*")

# indicator vs consequence
seed_ecoframe <- c('рост*' = 1, 'инфляци*' = 1, 'безработиц*' = 1,
                 # 'growth'     'inflation'   'unemployment'
                   'рубл*' = 1, 'бедност*' = -1, 'сокращени* доходов' = -1,
                 # 'currency'  'poverty'        'wage reduction'
                   'забастовк*' = -1, 'увольнени* работник*' = 1, 'потер*' = -1)
                 # 'strikes'          'layoff'                    'economic loss'

Conclusions

You can locate documents on sentiment or any other dimensions at low cost using LSS.

  • It is an ideal tool to generate time series data from news articles
  • It can be used in economic research along with macro-economic data

When you use LSS, please be aware that

  • It requires large corpus of texts (usually 5000 or more full-text articles)
  • It is affected by how texts are tokenized (punctuations, function words etc.)
  • Its individual prediction is not very accurate (compared to full-supervised models)

More about Quanteda