Kohei Watanabe (Waseda, LSE)
6 June 2018
quanteda is an R package for quantitative text analysis developed by a team based at the LSE.
In quantitative text analysis, we use the same technologies as natural language processing (NLP) but for different goals.
Not easy to automate theoretically grounded analysis because of the cost-control trade off.
lengths(data_dictionary_LSD2015)
## negative positive neg_positive neg_negative ## 2858 1709 1721 2860
corp <- readRDS('/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds') ndoc(corp)
## [1] 10000
range(docvars(corp, "date"))
## [1] "2016-01-01" "2016-12-31"
mt <- dfm(corp, remove_punc = TRUE, remove = stopwords()) mt_dict <- dfm_lookup(mt, data_dictionary_LSD2015[1:2]) sent <- (mt_dict[,2] - mt_dict[,1]) / (rowSums(mt) + 1) data <- data.frame(date = docvars(mt_dict, "date"), lsd = as.numeric(scale(sent)))
dim(mt)
## [1] 10000 121885
head(mt[1:6, 1:6])
## Document-feature matrix of: 6 documents, 6 features (80.6% sparse). ## 6 x 6 sparse Matrix of class "dfm" ## features ## docs 70-year-old hermit spent entire life siberian ## text120761 1 1 1 1 2 1 ## text174574 0 0 0 0 0 0 ## text141269 0 0 0 0 0 0 ## text151432 0 0 0 0 0 0 ## text169265 0 0 0 0 0 0 ## text134827 0 0 0 0 1 0
dim(mt_dict)
## [1] 10000 2
head(mt_dict)
## Document-feature matrix of: 6 documents, 2 features (0% sparse). ## 6 x 2 sparse Matrix of class "dfm" ## features ## docs negative positive ## text120761 11 6 ## text174574 17 16 ## text141269 20 38 ## text151432 15 27 ## text169265 24 27 ## text134827 66 50
devtools::install_github("koheiw/LSS")
require(LSS) toks_sent <- corp %>% corpus_reshape('sentences') %>% tokens(remove_punct = TRUE) mt_sent <- toks_sent %>% dfm(remove = stopwords()) %>% dfm_select('^[0-9a-zA-Z]+$', valuetype = 'regex') %>% dfm_trim(min_termfreq = 5) eco <- head(char_keyness(toks_sent, 'econom*', window = 10), 500) lss <- textmodel_lss(mt_sent, seedwords('pos-neg'), features = eco, cache = TRUE)
seedwords('pos-neg')
## good nice excellent positive fortunate correct ## 1 1 1 1 1 1 ## superior bad nasty poor negative unfortunate ## 1 -1 -1 -1 -1 -1 ## wrong inferior ## -1 -1
head(coef(lss), 20) # most positive words
## positive emerging china expecting cooperation drag ## 0.03941849 0.03906230 0.03249657 0.03172653 0.03014628 0.02910002 ## asia prof sustainable academic challenging markets ## 0.02873776 0.02849241 0.02765765 0.02732935 0.02682587 0.02644153 ## investors prospects stock better uncertain strategic ## 0.02637175 0.02570061 0.02538831 0.02479903 0.02358457 0.02346433 ## hit chinese ## 0.02302436 0.02291012
tail(coef(lss), 20) # most negative words
## downturn macron suggests downbeat debt ## -0.03257441 -0.03258820 -0.03277442 -0.03309793 -0.03386571 ## data policymakers unbalanced shrink unemployment ## -0.03524238 -0.03745570 -0.03934074 -0.03944345 -0.03987688 ## suggested bad pantheon cutting shocks ## -0.04036920 -0.04047418 -0.04054125 -0.04082423 -0.04267978 ## rates rba rate cut negative ## -0.04405703 -0.04789902 -0.05417844 -0.05498620 -0.05697134
The sentiment seed words are already available, but you can also make you own seed words.
# concern seed_concern <- c("concern*", "worr*", "anxi*") # weakening seed_weakening <- c("declin*", "weak*", "retreat*") # indicator vs consequence seed_ecoframe <- c('рост*' = 1, 'инфляци*' = 1, 'безработиц*' = 1, # 'growth' 'inflation' 'unemployment' 'рубл*' = 1, 'бедност*' = -1, 'сокращени* доходов' = -1, # 'currency' 'poverty' 'wage reduction' 'забастовк*' = -1, 'увольнени* работник*' = 1, 'потер*' = -1) # 'strikes' 'layoff' 'economic loss'
You can locate documents on sentiment or any other dimensions at low cost using LSS.
When you use LSS, please be aware that