18 April 2018
quanteda is an R package for quantitative text analysis developed by a team based at the LSE.
Quanteda Initiative CIC was founded to support the text analysts community
We want to discover theoretically interesting patterns in a corpus of texts from a social scientific point of view.
We usually need a large corpus to find interesting patterns.
Tokenized texts are non-rectangular
## [1] "What is it?" "That is a dolphine." ## [3] "No, it is a killer whale!"
## $text1 ## [1] "What" "is" "it" ## ## $text2 ## [1] "That" "is" "a" "dolphine" ## ## $text3 ## [1] "No" "it" "is" "a" "killer" "whale"
Document-feature matrix is sparse
## [1] "What is it?" "That is a dolphine." ## [3] "No, it is a killer whale!"
## features ## docs what is it that a dolphine no killer whale ## text1 1 1 1 0 0 0 0 0 0 ## text2 0 1 0 1 1 1 0 0 0 ## text3 0 1 1 0 1 0 1 1 1
We need very efficient tools to process large sparse matrices.
character
is not memory efficient when vectors are shortdata.frame
does not allow variables in different length (rectangular data)matrix
records both zero and non-zero values (dense matrix)quanteda has tokens
for tokenized texts.
toks <- tokens(txt, remove_punct = TRUE) print(toks)
## tokens from 3 documents. ## text1 : ## [1] "What" "is" "it" ## ## text2 : ## [1] "That" "is" "a" "dolphine" ## ## text3 : ## [1] "No" "it" "is" "a" "killer" "whale"
quanteda has dfm
for document-feature matrix.
mt <- dfm(toks) print(mt)
## Document-feature matrix of: 3 documents, 9 features (51.9% sparse). ## 3 x 9 sparse Matrix of class "dfm" ## features ## docs what is it that a dolphine no killer whale ## text1 1 1 1 0 0 0 0 0 0 ## text2 0 1 0 1 1 1 0 0 0 ## text3 0 1 1 0 1 0 1 1 1
quanteda has many specialized methods for tokens
and dfm
.
tokens
tokens_select()
select tokens by patternstokens_compound()
compound multiple tokens into single tokentokens_lookup()
find dictionary wordsdfm
dfm_select()
select features by patternsdfm_lookup()
find dictionary wordsdfm_group()
group multiple words into single documentComplete list of quanteda's function is available at documentation site.
tokens
is an extension of list
(S3).
str(unclass(toks))
## List of 3 ## $ text1: int [1:3] 1 2 3 ## $ text2: int [1:4] 4 2 5 6 ## $ text3: int [1:6] 7 3 2 5 8 9 ## - attr(*, "types")= chr [1:9] "What" "is" "it" "That" ... ## - attr(*, "padding")= logi FALSE ## - attr(*, "what")= chr "word" ## - attr(*, "ngrams")= int 1 ## - attr(*, "skip")= int 0 ## - attr(*, "concatenator")= chr "_" ## - attr(*, "docvars")='data.frame': 3 obs. of 0 variables
dfm
inherits Matrix::dgCMatrix
(S4).
mt@Dim
## [1] 3 9
mt@Dimnames
## $docs ## [1] "text1" "text2" "text3" ## ## $features ## [1] "what" "is" "it" "that" "a" "dolphine" ## [7] "no" "killer" "whale"
mt@i
## [1] 0 0 1 2 0 2 1 1 2 1 2 2 2
mt@p
## [1] 0 1 4 6 7 9 10 11 12 13
as.tokens(list(c("I", "like", "dogs"), c("He", "likes", "cats")))
## tokens from 2 documents. ## text1 : ## [1] "I" "like" "dogs" ## ## text2 : ## [1] "He" "likes" "cats"
toks %>% as.list() %>% as.tokens() %>% class()
## [1] "tokens"
as.dfm(matrix(c(1, 0, 2, 1), nrow = 2, dimnames = list(c("doc1", "doc2"), c("dogs", "cats"))))
## Document-feature matrix of: 2 documents, 2 features (25% sparse). ## 2 x 2 sparse Matrix of class "dfm" ## features ## docs dogs cats ## doc1 1 2 ## doc2 0 1
as.dfm(rbind("doc" = c(3, 4)))
## Document-feature matrix of: 1 document, 2 features (0% sparse). ## 1 x 2 sparse Matrix of class "dfm" ## features ## docs feat1 feat2 ## doc 3 4
mt %>% as.matrix() %>% as.dfm() %>% class()
## [1] "dfm" ## attr(,"package") ## [1] "quanteda"
as(mt, "dgCMatrix")
## 3 x 9 sparse Matrix of class "dgCMatrix" ## features ## docs what is it that a dolphine no killer whale ## text1 1 1 1 . . . . . . ## text2 . 1 . 1 1 1 . . . ## text3 . 1 1 . 1 . 1 1 1
as(mt, "dgTMatrix")
## 3 x 9 sparse Matrix of class "dgTMatrix" ## features ## docs what is it that a dolphine no killer whale ## text1 1 1 1 . . . . . . ## text2 . 1 . 1 1 1 . . . ## text3 . 1 1 . 1 . 1 1 1
dgmt <- as(mt, "dgTMatrix") dgmt@i
## [1] 0 0 1 2 0 2 1 1 2 1 2 2 2
dgmt@j
## [1] 0 1 1 1 2 2 3 4 4 5 6 7 8
dgmt@x
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1
types(toks)
## [1] "What" "is" "it" "That" "a" "dolphine" ## [7] "No" "killer" "whale"
pattern2id("dolphine", types(toks), "fixed", TRUE)
## [[1]] ## [1] 6
pattern2id(c("dolphine", "whale"), types(toks), "fixed", TRUE)
## [[1]] ## [1] 6 ## ## [[2]] ## [1] 9
pattern2id(phrase("killer whale"), types(toks), "fixed", TRUE)
## [[1]] ## [1] 8 9
pattern2id("^wha.*", types(toks), "regex", TRUE)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 9
pattern2fixed("^wha.*", types(toks), "regex", TRUE)
## [[1]] ## [1] "What" ## ## [[2]] ## [1] "whale"
pattern2id("wha*", types(toks), "glob", TRUE)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 9
pattern2fixed("wha*", types(toks), "glob", TRUE)
## [[1]] ## [1] "What" ## ## [[2]] ## [1] "whale"
featnames(mt)
## [1] "what" "is" "it" "that" "a" "dolphine" ## [7] "no" "killer" "whale"
id <- pattern2id("wha*", featnames(mt), "glob", TRUE) mt[,unlist(id)]
## Document-feature matrix of: 3 documents, 2 features (66.7% sparse). ## 3 x 2 sparse Matrix of class "dfm" ## features ## docs what whale ## text1 1 0 ## text2 0 0 ## text3 0 1
newsmap is a semi-supervised model for geographical document classification originally created for International Newsmap.
newsmap is a semi-supervised multi-nomial naive Bayes classifier.
tokens
for place names to assign country labels (weak supervision)data_dictionary_newsmap_en[["AMERICA"]]["NORTH"]
## Dictionary object with 1 primary key entry and 2 nested levels. ## - [NORTH]: ## - [BM]: ## - bermuda, bermudan* ## - [CA]: ## - canada, canadian*, ottawa, toronto, quebec ## - [GL]: ## - greenland, greenlander*, nuuk ## - [PM]: ## - saint pierre and miquelon, st pierre and miquelon, saint pierrais, miquelonnais, saint pierre ## - [US]: ## - united states, us, american*, washington, new york
data_dictionary_newsmap_de[["AMERICA"]]["NORTH"]
## Dictionary object with 1 primary key entry and 2 nested levels. ## - [NORTH]: ## - [BM]: ## - bermuda, bermudas ## - [CA]: ## - kanada, kanadas, kanadisch*, kanadier*, ottawa, toronto, quebec ## - [GL]: ## - grönland, grönlands, grönländisch*, nuuk ## - [PM]: ## - saint-pierre und miquelon, saint pierre ## - [US]: ## - vereinigte staaten von amerika, united states, vereinigte staaten, usa, us, amerikas, amerikanisch*, washington, new york
data_dictionary_newsmap_ja[["AMERICA"]]["NORTH"]
## Dictionary object with 1 primary key entry and 2 nested levels. ## - [NORTH]: ## - [BM]: ## - バミューダ* ## - [CA]: ## - カナダ*, オタワ, トロント, ケベック ## - [GL]: ## - グリーンランド*, ヌーク ## - [PM]: ## - サンピエール・ミクロン島*, サンピエール, ミクロン ## - [US]: ## - 米国*, アメリカ*, ワシントン, ニューヨーク
data <- readRDS("/home/kohei/Dropbox/Public/data_corpus_yahoonews.rds") data$text <- paste0(data$head, ". ", data$body) data$body <- NULL corp_full <- corpus(data, text_field = 'text') corp <- corpus_subset(corp_full, '2014-01-01' <= date & date <= '2014-12-31') ndoc(corp)
## [1] 156980
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December") day <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") agency <- c("AP", "AFP", "Reuters") toks <- tokens(corp) %>% tokens_remove(stopwords("en"), valuetype = "fixed", padding = TRUE) %>% tokens_remove(c(month, day, agency), valuetype = "fixed", padding = TRUE)
label_mt <- dfm(tokens_lookup(toks, data_dictionary_newsmap_en, levels = 3)) feat_mt <- dfm(toks, tolower = FALSE) %>% dfm_select(selection = "keep", "^[A-Z][A-Za-z1-2]+", valuetype = "regex", case_insensitive = FALSE) %>% dfm_trim(min_termfreq = 10) newsmap <- textmodel_newsmap(feat_mt, label_mt) summary(newsmap, n = 10)
## Classes: ## bi, dj, er, et, ke, km, mg, mu, mw, mz ... ## Features: ## French, Ivanovic, Safarova, PARIS, Former, Open, Ana, Lucie, Czech, Republic ... ## Documents: ## text63, text68, text69, text73, text78, text79, text84, text85, text86, text92 ...
tb <- table(predict(newsmap)) barplot(head(sort(tb, decreasing = TRUE), 20))
newsmap uses quanteda APIs in both estimation and prediction.
LSS implements a semi-supervised document scaling model created based on quanteda to perform theory-driven analysis at low costs.
LSS (latent semantic scaling) is an application of LSA (latent semantic analysis) in document scaling
seedwords('pos-neg')
## good nice excellent positive fortunate correct ## 1 1 1 1 1 1 ## superior bad nasty poor negative unfortunate ## 1 -1 -1 -1 -1 -1 ## wrong inferior ## -1 -1
# concern c("concern*", "worr*", "anxi*")
## [1] "concern*" "worr*" "anxi*"
# dysfunction c("dysfunct*", "paralysi*", "stalemate", "standstill", "gridlock", "deadlock")
## [1] "dysfunct*" "paralysi*" "stalemate" "standstill" "gridlock" ## [6] "deadlock"
In LSS, documents are split into sentences to estimate semantic proximity based on immediate contexts of words. This makes document-feature matrix extremely sparse.
corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds") sent_toks <- corp %>% corpus_reshape("sentences") %>% tokens(remove_punct = TRUE) %>% tokens_remove(stopwords("en")) %>% tokens_select("^[0-9a-zA-Z]+$", valuetype = "regex") sent_mt <- sent_toks %>% dfm() %>% dfm_trim(min_termfreq = 5)
ndoc(corp)
## [1] 10000
ndoc(sent_mt)
## [1] 392697
sparsity(sent_mt)
## [1] 0.9996725
lss performs feature selection based on collocation to construct domain-specific sentiment models.
eco <- head(char_keyness(sent_toks, 'econom*'), 500) head(eco, 30)
## [1] "growth" "global" "outlook" "eurozone" "slowdown" ## [6] "slowing" "uncertainty" "recession" "markets" "uk" ## [11] "zew" "financial" "gdp" "markit" "world" ## [16] "brexit" "forecasts" "china" "rolling" "jobs" ## [21] "prospects" "quarter" "davos" "williamson" "emerging" ## [26] "ihs" "downturn" "risks" "tombs" "recovery"
lss <- textmodel_lss(sent_mt, seedwords('pos-neg'), features = eco, cache = TRUE)
head(coef(lss), 30)
## good positive emerging china expecting cooperation ## 0.05252521 0.03941849 0.03906230 0.03249657 0.03172653 0.03014628 ## drag asia prof turbulent sustainable challenging ## 0.02910002 0.02873776 0.02849241 0.02824677 0.02765765 0.02682587 ## markets investors remain beijing prospects stock ## 0.02644153 0.02637175 0.02629941 0.02572956 0.02570061 0.02538831 ## better uncertain hit chinese sentiment mining ## 0.02479903 0.02358457 0.02302436 0.02291012 0.02269737 0.02230245 ## failed threat robust argued consultancy oxford ## 0.02213793 0.02196979 0.02192322 0.02192176 0.02190114 0.02176545
tail(coef(lss), 30)
## yellen banks consequences carney inflation ## -0.02971046 -0.03026749 -0.03048922 -0.03056076 -0.03126168 ## countries downturn macron suggests downbeat ## -0.03194409 -0.03257441 -0.03258820 -0.03277442 -0.03309793 ## debt assumptions data policymakers borrowing ## -0.03386571 -0.03416451 -0.03524238 -0.03745570 -0.03815934 ## unbalanced shrink unemployment bad pantheon ## -0.03934074 -0.03944345 -0.03987688 -0.04047418 -0.04054125 ## cutting shocks hike rates mpc ## -0.04082423 -0.04267978 -0.04370420 -0.04405703 -0.04427978 ## easing rba rate cut negative ## -0.04577516 -0.04789902 -0.05417844 -0.05498620 -0.05697134
doc_mt <- dfm(corp) data_pred <- as.data.frame(predict(lss, newdata = doc_mt, density = TRUE)) data_pred$date <- docvars(doc_mt, 'date') data_pred <- subset(data_pred, density > quantile(density, 0.25)) head(data_pred)
## fit density date ## text141269 0.74545003 0.02946955 2016-04-07 ## text151432 -0.01325452 0.08580343 2016-05-10 ## text169265 -0.24909294 0.05876393 2016-09-01 ## text134827 -0.04752371 0.04449649 2016-03-08 ## text133324 1.80075143 0.05355191 2016-03-03 ## text132839 0.55650416 0.04322034 2016-02-22
par(mar = c(4, 4, 1, 1)) plot(data_pred$date, data_pred$fit, pch = 16, col = rgb(0, 0, 0, 0.1), ylim = c(-0.5, 0.5), ylab = "Economic sentiment", xlab = "Time") lines(lowess(data_pred$date, data_pred$fit, f = 0.1), col = 1) abline(h = 0, v = as.Date("2016-06-23"), lty = c(1, 3))
LSS package is implemented using RSpectra's SVD engine and quanteda's APIs.
quanteda's APIs help you to quickly develop your own models.
pattern2id()
helps you to handle patterns, multi-word expressions and Unicode characterstokens
and dfm
objects can be created using as.tokens()
and as.dfm()
tokens_*()
and dfm_*()
are optimized for large textual datatextstat_*
are useful as packages' internal functionsYou can also contribute to development of quanteda.