=========================================
(C) 2015 Wouter van Atteveldt, license: [CC-BY-SA]
The most important object in frequency-based text analysis is the document term matrix. This matrix contains the documents in the rows and terms (words) in the columns, and each cell is the frequency of that term in that document.
In R, these matrices are provided by the tm
(text mining) package. Although this package provides many functions for loading and manipulating these matrices, using them directly is relatively complicated.
Fortunately, the RTextTools
package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a ‘text’ column, use the create_matrix
function (with removeStopwords=F to make sure all words are kept):
library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)
We can inspect the resulting matrix m using the regular R functions to get e.g. the type of object and the dimensionality:
class(m)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
dim(m)
## [1] 2 6
m
## <<DocumentTermMatrix (documents: 2, terms: 6)>>
## Non-/sparse entries: 6/6
## Sparsity : 50%
## Maximal term length: 8
## Weighting : term frequency (tf)
So, m
is a DocumentTermMatrix
, which is derived from a simple_triplet_matrix
as provided by the slam
package. Internally, document-term matrices are stored as a sparse matrix: if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don’t occur in most documents). Storing this as a regular matrix would waste a lot of memory. In a sparse matrix, only the non-zero entries are stored, as ‘simple triplets’ of (document, term, frequency).
As seen in the output of dim
, Our matrix has only 2 rows (documents) and 6 columns (unqiue words). Since this is a fairly small matrix, we can visualize it using as.matrix
, which converts the ‘sparse’ matrix into a regular matrix:
as.matrix(m)
## Terms
## Docs are bird birds chickens eats the
## 1 1 0 1 1 0 0
## 2 0 1 0 0 1 1
So, we can see that each word is kept as is. We can reduce the size of the matrix by dropping stop words and stemming (changing a word like ‘chickens’ to its base form or stem ‘chicken’): (see the create_matrix documentation for the full range of options)
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
dim(m)
## [1] 2 3
as.matrix(m)
## Terms
## Docs bird chicken eat
## 1 1 1 0
## 2 1 0 1
As you can see, the stop words (the and are) are removed, while the two verb forms of to eat are joined together.
In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English. Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German. An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix, which gives the total frequency of each term:
colSums(as.matrix(m))
## bird chicken eat
## 2 1 1
For more richly inflected languages like Dutch, the result is less promising:
text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
## eet geget kip kipp
## 1 1 1 1
As you can see, de and hebben are correctly recognized as stop words, but gegeten (eaten) and kippen (chickens) have a different stem than eet (eat) and kip (chicken). German gets similarly bad results.
If we want to move beyond stemming, one option is to use AmCAT to parse articles. Before we can proceed, we need to save our AmCAT password (only needed once, please don’t save the password in your file!) and log onto amcat:
library(amcatr)
amcat.save.password("http://preview.amcat.nl", "username", "password")
conn = amcat.connect("https://amcat.nl")
## Loading required package: rjson
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: plyr
## Loading required package: httr
Now, we can upload articles from R using the amcat.upload.articles
function, which we now demonstrate with a single article but which can also be used to upload many articles at once:
articles = data.frame(text = "John is a great fan of chickens, and so is Mary", date="2001-01-01", headline="test")
aset = amcat.upload.articles(conn, project = 1, articleset="Test Florence", medium="test",
text=articles$text, date=articles$date, headline=articles$headline)
## Created articleset 21997: Test Florence in project 1
## Uploading 1 articles to set 21997
And we can then lemmatize this article and download the results directly to R using amcat.gettokens
:
amcat.gettokens(conn, project=1, articleset = aset, module = "corenlp_lemmatize")
## GET https://amcat.nl/api/v4/projects/1/articlesets/21997/tokens/?page=1&module=corenlp_lemmatize&page_size=1&format=csv
## GET https://amcat.nl/api/v4/projects/1/articlesets/21997/tokens/?page=2&module=corenlp_lemmatize&page_size=1&format=csv
## word sentence pos lemma offset aid id pos1
## 1 John 1 NNP John 0 114440106 1 M
## 2 is 1 VBZ be 5 114440106 2 V
## 3 a 1 DT a 8 114440106 3 D
## 4 great 1 JJ great 10 114440106 4 A
## 5 fan 1 NN fan 16 114440106 5 N
## 6 of 1 IN of 20 114440106 6 P
## 7 chickens 1 NNS chicken 23 114440106 7 N
## 8 , 1 , , 31 114440106 8 .
## 9 and 1 CC and 33 114440106 9 C
## 10 so 1 RB so 37 114440106 10 B
## 11 is 1 VBZ be 40 114440106 11 V
## 12 Mary 1 NNP Mary 43 114440106 12 M
And we can see that e.g. for “is” the lemma “be” is given. Note that the words are not in order, and the two occurrences of “is” are automatically summed. This can be switched off by giving drop=NULL
as extra argument.
For a more serious application, we will use an existing article set: set 16017 in project 559, which contains the state of the Union speeches by Bush and Obama (each document is a single paragraph) The analysed tokens for this set can be downloaded with the following command:
sotu.tokens = amcat.gettokens(conn, project=559, articleset = 16017, module = "corenlp_lemmatize", page_size = 100)
Note: the command above might take a while to complete. This data is also available directly from the semnet package:
library(semnet)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: wordcloud
## Loading required package: RColorBrewer
## Loading required package: scales
## Loading required package: tm
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:httr':
##
## content
##
## Loading required package: slam
## Loading required package: igraph
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:base':
##
## crossprod, tcrossprod
data(sotu)
nrow(sotu.tokens)
## [1] 91473
head(sotu.tokens, n=20)
## word sentence pos lemma offset aid id pos1 freq
## 1 It 1 PRP it 0 111541965 1 O 1
## 2 is 1 VBZ be 3 111541965 2 V 1
## 3 our 1 PRP$ we 6 111541965 3 O 1
## 4 unfinished 1 JJ unfinished 10 111541965 4 A 1
## 5 task 1 NN task 21 111541965 5 N 1
## 6 to 1 TO to 26 111541965 6 ? 1
## 7 restore 1 VB restore 29 111541965 7 V 1
## 8 the 1 DT the 37 111541965 8 D 1
## 9 basic 1 JJ basic 41 111541965 9 A 1
## 10 bargain 1 NN bargain 47 111541965 10 N 1
## 11 that 1 WDT that 55 111541965 11 D 1
## 12 built 1 VBD build 60 111541965 12 V 1
## 13 this 1 DT this 66 111541965 13 D 1
## 14 country 1 NN country 71 111541965 14 N 1
## 15 : 1 : : 78 111541965 15 . 1
## 16 the 1 DT the 80 111541965 16 D 1
## 17 idea 1 NN idea 84 111541965 17 N 1
## 18 that 1 IN that 89 111541965 18 P 1
## 19 if 1 IN if 94 111541965 19 P 1
## 20 you 1 PRP you 97 111541965 20 O 1
As you can see, the result is similar to the ad-hoc lemmatized tokens, but now we have around 100 thousand tokens rather than 6. We can create a document-term matrix using the dtm.create command from corputools
:
library(corpustools)
## Loading required package: reshape2
## Loading required package: topicmodels
dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma, filter=sotu.tokens$pos1 %in% c("M","N"))
## Ignoring words with frequency lower than 5
## Ignoring words with less than 3 characters
## Ignoring words that contain numbers of non-word characters
dtm
## <<DocumentTermMatrix (documents: 1089, terms: 907)>>
## Non-/sparse entries: 15896/971827
## Sparsity : 98%
## Maximal term length: 14
## Weighting : term frequency (tf)
So, we now have a “sparse” matrix of almost 7,000 documents by more than 70,000 terms. Sparse here means that only the non-zero entries are kept in memory, because otherwise it would have to keep all 70 million cells in memory (and this is a relatively small data set). Thus, it might not be a good idea to use functions like as.matrix
or colSums
on such a matrix, since these functions convert the sparse matrix into a regular matrix. The next section investigates a number of useful functions to deal with (sparse) document-term matrices.
What are the most frequent words in the corpus? As shown above, we could use the built-in colSums
function, but this requires first casting the sparse matrix to a regular matrix, which we want to avoid (even our relatively small dataset would have 400 million entries!). However, we can use the col_sums
function from the slam
package, which provides the same functionality for sparse matrices:
library(slam)
freq = col_sums(dtm)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)
## America year people job country world tax
## 409 385 327 256 228 198 181
## Americans nation Congress
## 179 171 168
As can be seen, the most frequent terms are America and recurring issues like jobs and taxes. It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms). The function term.statistics
from the corpus-tools
package provides this functionality:
terms = term.statistics(dtm)
terms = terms[order(-terms$termfreq), ]
head(terms, 10)
## term characters number nonalpha termfreq docfreq reldocfreq
## America America 7 FALSE FALSE 409 346 0.31772268
## year year 4 FALSE FALSE 385 286 0.26262626
## people people 6 FALSE FALSE 327 277 0.25436180
## job job 3 FALSE FALSE 256 190 0.17447199
## country country 7 FALSE FALSE 228 202 0.18549128
## world world 5 FALSE FALSE 198 156 0.14325069
## tax tax 3 FALSE FALSE 181 102 0.09366391
## Americans Americans 9 FALSE FALSE 179 158 0.14508724
## nation nation 6 FALSE FALSE 171 150 0.13774105
## Congress Congress 8 FALSE FALSE 168 149 0.13682277
## tfidf
## America 0.1275801
## year 0.1540023
## people 0.1547707
## job 0.2102813
## country 0.1677862
## world 0.2127368
## tax 0.3308686
## Americans 0.2033421
## nation 0.1985566
## Congress 0.1917011
As you can see, for each word the total frequency and the relative document frequency is listed, as well as some basic information on the number of characters and the occurrence of numerals or non-alphanumeric characters. This allows us to create a ‘common sense’ filter to reduce the amount of terms, for example removing all words containing a letter or punctuation mark, and all short (characters<=2
) infrequent (termfreq<25
) and overly frequent (reldocfreq>.5
) words:
subset = terms[!terms$number & !terms$nonalpha & terms$characters>2 & terms$termfreq>=25 & terms$reldocfreq<.25, ]
nrow(subset)
## [1] 192
head(subset, n=10)
## term characters number nonalpha termfreq docfreq reldocfreq
## job job 3 FALSE FALSE 256 190 0.17447199
## country country 7 FALSE FALSE 228 202 0.18549128
## world world 5 FALSE FALSE 198 156 0.14325069
## tax tax 3 FALSE FALSE 181 102 0.09366391
## Americans Americans 9 FALSE FALSE 179 158 0.14508724
## nation nation 6 FALSE FALSE 171 150 0.13774105
## Congress Congress 8 FALSE FALSE 168 149 0.13682277
## time time 4 FALSE FALSE 166 137 0.12580349
## child child 5 FALSE FALSE 153 112 0.10284665
## economy economy 7 FALSE FALSE 150 122 0.11202938
## tfidf
## job 0.2102813
## country 0.1677862
## world 0.2127368
## tax 0.3308686
## Americans 0.2033421
## nation 0.1985566
## Congress 0.1917011
## time 0.2070660
## child 0.2254874
## economy 0.2582953
This seems more to be a relatively useful set of words. We now have about 8 thousand terms left of the original 72 thousand. To create a new document-term matrix with only these terms, we can use normal matrix indexing on the columns (which contain the words):
dtm_filtered = dtm.filter(dtm, terms=subset$term)
dim(dtm_filtered)
## [1] 1082 192
Which yields a much more managable dtm. As a bonus, we can use the dtm.wordcloud
function in corpustools (which is a thin wrapper around the wordcloud
package) to visualize the top words as a word cloud:
dtm.wordcloud(dtm_filtered)
Another useful thing we can do is comparing two corpora: Which words or names are mentioned more in e.g. Bush’ speeches than Obama’s.
To do this, we split the dtm in separate dtm’s for Bush and Obama. For this, we select docment ids using the headline
column in the metadata from sotu.meta
, and then use the dtm.filter
function:
head(sotu.meta)
## id medium headline date
## 1 111541965 Speeches Barack Obama 2013-02-12
## 2 111541995 Speeches Barack Obama 2013-02-12
## 3 111542001 Speeches Barack Obama 2013-02-12
## 4 111542006 Speeches Barack Obama 2013-02-12
## 5 111542013 Speeches Barack Obama 2013-02-12
## 6 111542018 Speeches Barack Obama 2013-02-12
obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm.obama = dtm.filter(dtm, documents=obama.docs)
bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
dtm.bush = dtm.filter(dtm, documents=bush.docs)
So how can we check which words are more frequent in Bush’ speeches than in Obama’s speeches? The function corpora.compare
provides this functionality, given two document-term matrices:
cmp = corpora.compare(dtm.obama, dtm.bush)
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level,
## NA generated
cmp = cmp[order(cmp$over), ]
head(cmp)
## term termfreq.x termfreq.y relfreq.x relfreq.y over
## 729 terror 1 55 0.0001121957 0.005824420 0.1629729
## 301 freedom 8 79 0.0008975654 0.008365985 0.2026018
## 242 enemy 4 52 0.0004487827 0.005506725 0.2226593
## 731 terrorist 10 73 0.0011219567 0.007730594 0.2430484
## 390 Iraq 15 94 0.0016829350 0.009954464 0.2449171
## 225 drug 4 39 0.0004487827 0.004130043 0.2824114
## chi
## 729 49.81362
## 301 55.04010
## 242 39.11981
## 731 45.21642
## 390 54.06263
## 225 26.98837
For each term, this data frame contains the frequency in the ‘x’ and ‘y’ corpora (here, Obama and Bush). Also, it gives the relative frequency in these corpora (normalizing for total corpus size) and the overrepresentation in the ‘x’ corpus and the chi-squared value for that overrepresentation. So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant.
Which words did Obama use most compared to Bush?
cmp = cmp[order(cmp$over, decreasing=T), ]
head(cmp)
## term termfreq.x termfreq.y relfreq.x relfreq.y over
## 138 company 54 6 0.006058566 0.0006353913 4.316133
## 374 industry 32 1 0.003590261 0.0001058985 4.150708
## 132 college 55 9 0.006170762 0.0009530869 3.671502
## 124 class 26 1 0.002917087 0.0001058985 3.541995
## 396 job 200 56 0.022439134 0.0059303188 3.382115
## 190 deficit 56 11 0.006282957 0.0011648840 3.364133
## chi
## 138 40.79339
## 374 30.63331
## 132 35.35282
## 124 24.35851
## 396 89.09458
## 190 32.46339
So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education.
Let’s make a word cloud of Obama’ words, with size indicating chi-square overrepresentation:
obama = cmp[cmp$over > 1,]
dtm.wordcloud(terms = obama$term, freqs = obama$chi)
And Bush:
bush = cmp[cmp$over < 1,]
dtm.wordcloud(terms = bush$term, freqs = bush$chi)
Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn’t find a good place for them in the word cloud.