Corpus analysis: the document-term matrix

=========================================

The most important object in frequency-based text analysis is the document term matrix. This matrix contains the documents in the rows and terms (words) in the columns, and each cell is the frequency of that term in that document.

In R, these matrices are provided by the tm (text mining) package. Although this package provides many functions for loading and manipulating these matrices, using them directly is relatively complicated.

Fortunately, the RTextTools package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a ‘text’ column, use the create_matrix function (with removeStopwords=F to make sure all words are kept):

library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)

We can inspect the resulting matrix m using the regular R functions to get e.g. the type of object and the dimensionality:

class(m)

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"

dim(m)

## [1] 2 6

## <<DocumentTermMatrix (documents: 2, terms: 6)>>
## Non-/sparse entries: 6/6
## Sparsity           : 50%
## Maximal term length: 8
## Weighting          : term frequency (tf)

So, m is a DocumentTermMatrix, which is derived from a simple_triplet_matrix as provided by the slam package. Internally, document-term matrices are stored as a sparse matrix: if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don’t occur in most documents). Storing this as a regular matrix would waste a lot of memory. In a sparse matrix, only the non-zero entries are stored, as ‘simple triplets’ of (document, term, frequency).

As seen in the output of dim, Our matrix has only 2 rows (documents) and 6 columns (unqiue words). Since this is a fairly small matrix, we can visualize it using as.matrix, which converts the ‘sparse’ matrix into a regular matrix:

as.matrix(m)

##     Terms
## Docs are bird birds chickens eats the
##    1   1    0     1        1    0   0
##    2   0    1     0        0    1   1

Stemming and stop word removal

So, we can see that each word is kept as is. We can reduce the size of the matrix by dropping stop words and stemming (changing a word like ‘chickens’ to its base form or stem ‘chicken’): (see the create_matrix documentation for the full range of options)

m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
dim(m)

## [1] 2 3

as.matrix(m)

##     Terms
## Docs bird chicken eat
##    1    1       1   0
##    2    1       0   1

As you can see, the stop words (the and are) are removed, while the two verb forms of to eat are joined together.

In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English. Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German. An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix, which gives the total frequency of each term:

colSums(as.matrix(m))

##    bird chicken     eat 
##       2       1       1

For more richly inflected languages like Dutch, the result is less promising:

text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))

##   eet geget   kip  kipp 
##     1     1     1     1

As you can see, de and hebben are correctly recognized as stop words, but gegeten (eaten) and kippen (chickens) have a different stem than eet (eat) and kip (chicken). German gets similarly bad results.

Loading and analysing a larger dataset from AmCAT

If we want to move beyond stemming, one option is to use AmCAT to parse articles. Before we can proceed, we need to save our AmCAT password (only needed once, please don’t save the password in your file!) and log onto amcat:

library(amcatr)
amcat.save.password("http://preview.amcat.nl", "username",  "password")
conn = amcat.connect("https://amcat.nl")

## Loading required package: rjson
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: plyr
## Loading required package: httr

Now, we can upload articles from R using the amcat.upload.articles function, which we now demonstrate with a single article but which can also be used to upload many articles at once:

articles = data.frame(text = "John is a great fan of chickens, and so is Mary", date="2001-01-01", headline="test")

aset = amcat.upload.articles(conn, project = 1, articleset="Test Florence", medium="test", 
                             text=articles$text, date=articles$date, headline=articles$headline)

## Created articleset 21997: Test Florence in project 1
## Uploading 1 articles to set 21997

And we can then lemmatize this article and download the results directly to R using amcat.gettokens:

amcat.gettokens(conn, project=1, articleset = aset, module = "corenlp_lemmatize")

## GET https://amcat.nl/api/v4/projects/1/articlesets/21997/tokens/?page=1&module=corenlp_lemmatize&page_size=1&format=csv
## GET https://amcat.nl/api/v4/projects/1/articlesets/21997/tokens/?page=2&module=corenlp_lemmatize&page_size=1&format=csv

##        word sentence pos   lemma offset       aid id pos1
## 1      John        1 NNP    John      0 114440106  1    M
## 2        is        1 VBZ      be      5 114440106  2    V
## 3         a        1  DT       a      8 114440106  3    D
## 4     great        1  JJ   great     10 114440106  4    A
## 5       fan        1  NN     fan     16 114440106  5    N
## 6        of        1  IN      of     20 114440106  6    P
## 7  chickens        1 NNS chicken     23 114440106  7    N
## 8         ,        1   ,       ,     31 114440106  8    .
## 9       and        1  CC     and     33 114440106  9    C
## 10       so        1  RB      so     37 114440106 10    B
## 11       is        1 VBZ      be     40 114440106 11    V
## 12     Mary        1 NNP    Mary     43 114440106 12    M

And we can see that e.g. for “is” the lemma “be” is given. Note that the words are not in order, and the two occurrences of “is” are automatically summed. This can be switched off by giving drop=NULL as extra argument.

For a more serious application, we will use an existing article set: set 16017 in project 559, which contains the state of the Union speeches by Bush and Obama (each document is a single paragraph) The analysed tokens for this set can be downloaded with the following command:

sotu.tokens = amcat.gettokens(conn, project=559, articleset = 16017, module = "corenlp_lemmatize", page_size = 100)

Note: the command above might take a while to complete. This data is also available directly from the semnet package:

library(semnet)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: wordcloud
## Loading required package: RColorBrewer
## Loading required package: scales
## Loading required package: tm
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:httr':
## 
##     content
## 
## Loading required package: slam
## Loading required package: igraph
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:base':
## 
##     crossprod, tcrossprod

data(sotu)
nrow(sotu.tokens)

## [1] 91473

head(sotu.tokens, n=20)

##          word sentence  pos      lemma offset       aid id pos1 freq
## 1          It        1  PRP         it      0 111541965  1    O    1
## 2          is        1  VBZ         be      3 111541965  2    V    1
## 3         our        1 PRP$         we      6 111541965  3    O    1
## 4  unfinished        1   JJ unfinished     10 111541965  4    A    1
## 5        task        1   NN       task     21 111541965  5    N    1
## 6          to        1   TO         to     26 111541965  6    ?    1
## 7     restore        1   VB    restore     29 111541965  7    V    1
## 8         the        1   DT        the     37 111541965  8    D    1
## 9       basic        1   JJ      basic     41 111541965  9    A    1
## 10    bargain        1   NN    bargain     47 111541965 10    N    1
## 11       that        1  WDT       that     55 111541965 11    D    1
## 12      built        1  VBD      build     60 111541965 12    V    1
## 13       this        1   DT       this     66 111541965 13    D    1
## 14    country        1   NN    country     71 111541965 14    N    1
## 15          :        1    :          :     78 111541965 15    .    1
## 16        the        1   DT        the     80 111541965 16    D    1
## 17       idea        1   NN       idea     84 111541965 17    N    1
## 18       that        1   IN       that     89 111541965 18    P    1
## 19         if        1   IN         if     94 111541965 19    P    1
## 20        you        1  PRP        you     97 111541965 20    O    1

As you can see, the result is similar to the ad-hoc lemmatized tokens, but now we have around 100 thousand tokens rather than 6. We can create a document-term matrix using the dtm.create command from corputools:

library(corpustools)

## Loading required package: reshape2
## Loading required package: topicmodels

dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma, filter=sotu.tokens$pos1 %in% c("M","N"))

## Ignoring words with frequency lower than 5
## Ignoring words with less than 3 characters
## Ignoring words that contain numbers of non-word characters

dtm

## <<DocumentTermMatrix (documents: 1089, terms: 907)>>
## Non-/sparse entries: 15896/971827
## Sparsity           : 98%
## Maximal term length: 14
## Weighting          : term frequency (tf)

So, we now have a “sparse” matrix of almost 7,000 documents by more than 70,000 terms. Sparse here means that only the non-zero entries are kept in memory, because otherwise it would have to keep all 70 million cells in memory (and this is a relatively small data set). Thus, it might not be a good idea to use functions like as.matrix or colSums on such a matrix, since these functions convert the sparse matrix into a regular matrix. The next section investigates a number of useful functions to deal with (sparse) document-term matrices.

Corpus analysis: word frequency

What are the most frequent words in the corpus? As shown above, we could use the built-in colSums function, but this requires first casting the sparse matrix to a regular matrix, which we want to avoid (even our relatively small dataset would have 400 million entries!). However, we can use the col_sums function from the slam package, which provides the same functionality for sparse matrices:

library(slam)
freq = col_sums(dtm)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)

##   America      year    people       job   country     world       tax 
##       409       385       327       256       228       198       181 
## Americans    nation  Congress 
##       179       171       168

As can be seen, the most frequent terms are America and recurring issues like jobs and taxes. It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms). The function term.statistics from the corpus-tools package provides this functionality:

terms = term.statistics(dtm)
terms = terms[order(-terms$termfreq), ]
head(terms, 10)

##                term characters number nonalpha termfreq docfreq reldocfreq
## America     America          7  FALSE    FALSE      409     346 0.31772268
## year           year          4  FALSE    FALSE      385     286 0.26262626
## people       people          6  FALSE    FALSE      327     277 0.25436180
## job             job          3  FALSE    FALSE      256     190 0.17447199
## country     country          7  FALSE    FALSE      228     202 0.18549128
## world         world          5  FALSE    FALSE      198     156 0.14325069
## tax             tax          3  FALSE    FALSE      181     102 0.09366391
## Americans Americans          9  FALSE    FALSE      179     158 0.14508724
## nation       nation          6  FALSE    FALSE      171     150 0.13774105
## Congress   Congress          8  FALSE    FALSE      168     149 0.13682277
##               tfidf
## America   0.1275801
## year      0.1540023
## people    0.1547707
## job       0.2102813
## country   0.1677862
## world     0.2127368
## tax       0.3308686
## Americans 0.2033421
## nation    0.1985566
## Congress  0.1917011

As you can see, for each word the total frequency and the relative document frequency is listed, as well as some basic information on the number of characters and the occurrence of numerals or non-alphanumeric characters. This allows us to create a ‘common sense’ filter to reduce the amount of terms, for example removing all words containing a letter or punctuation mark, and all short (characters<=2) infrequent (termfreq<25) and overly frequent (reldocfreq>.5) words:

subset = terms[!terms$number & !terms$nonalpha & terms$characters>2 & terms$termfreq>=25 & terms$reldocfreq<.25, ]
nrow(subset)

## [1] 192

head(subset, n=10)

##                term characters number nonalpha termfreq docfreq reldocfreq
## job             job          3  FALSE    FALSE      256     190 0.17447199
## country     country          7  FALSE    FALSE      228     202 0.18549128
## world         world          5  FALSE    FALSE      198     156 0.14325069
## tax             tax          3  FALSE    FALSE      181     102 0.09366391
## Americans Americans          9  FALSE    FALSE      179     158 0.14508724
## nation       nation          6  FALSE    FALSE      171     150 0.13774105
## Congress   Congress          8  FALSE    FALSE      168     149 0.13682277
## time           time          4  FALSE    FALSE      166     137 0.12580349
## child         child          5  FALSE    FALSE      153     112 0.10284665
## economy     economy          7  FALSE    FALSE      150     122 0.11202938
##               tfidf
## job       0.2102813
## country   0.1677862
## world     0.2127368
## tax       0.3308686
## Americans 0.2033421
## nation    0.1985566
## Congress  0.1917011
## time      0.2070660
## child     0.2254874
## economy   0.2582953

This seems more to be a relatively useful set of words. We now have about 8 thousand terms left of the original 72 thousand. To create a new document-term matrix with only these terms, we can use normal matrix indexing on the columns (which contain the words):

dtm_filtered = dtm.filter(dtm, terms=subset$term)
dim(dtm_filtered)

## [1] 1082  192

Which yields a much more managable dtm. As a bonus, we can use the dtm.wordcloud function in corpustools (which is a thin wrapper around the wordcloud package) to visualize the top words as a word cloud:

dtm.wordcloud(dtm_filtered)

Comparing corpora

Another useful thing we can do is comparing two corpora: Which words or names are mentioned more in e.g. Bush’ speeches than Obama’s.

To do this, we split the dtm in separate dtm’s for Bush and Obama. For this, we select docment ids using the headline column in the metadata from sotu.meta, and then use the dtm.filter function:

head(sotu.meta)

##          id   medium     headline       date
## 1 111541965 Speeches Barack Obama 2013-02-12
## 2 111541995 Speeches Barack Obama 2013-02-12
## 3 111542001 Speeches Barack Obama 2013-02-12
## 4 111542006 Speeches Barack Obama 2013-02-12
## 5 111542013 Speeches Barack Obama 2013-02-12
## 6 111542018 Speeches Barack Obama 2013-02-12

obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm.obama = dtm.filter(dtm, documents=obama.docs)
bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
dtm.bush = dtm.filter(dtm, documents=bush.docs)

So how can we check which words are more frequent in Bush’ speeches than in Obama’s speeches? The function corpora.compare provides this functionality, given two document-term matrices:

cmp = corpora.compare(dtm.obama, dtm.bush)

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level,
## NA generated

cmp = cmp[order(cmp$over), ]
head(cmp)

##          term termfreq.x termfreq.y    relfreq.x   relfreq.y      over
## 729    terror          1         55 0.0001121957 0.005824420 0.1629729
## 301   freedom          8         79 0.0008975654 0.008365985 0.2026018
## 242     enemy          4         52 0.0004487827 0.005506725 0.2226593
## 731 terrorist         10         73 0.0011219567 0.007730594 0.2430484
## 390      Iraq         15         94 0.0016829350 0.009954464 0.2449171
## 225      drug          4         39 0.0004487827 0.004130043 0.2824114
##          chi
## 729 49.81362
## 301 55.04010
## 242 39.11981
## 731 45.21642
## 390 54.06263
## 225 26.98837

For each term, this data frame contains the frequency in the ‘x’ and ‘y’ corpora (here, Obama and Bush). Also, it gives the relative frequency in these corpora (normalizing for total corpus size) and the overrepresentation in the ‘x’ corpus and the chi-squared value for that overrepresentation. So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant.

Which words did Obama use most compared to Bush?

cmp = cmp[order(cmp$over, decreasing=T), ]
head(cmp)

##         term termfreq.x termfreq.y   relfreq.x    relfreq.y     over
## 138  company         54          6 0.006058566 0.0006353913 4.316133
## 374 industry         32          1 0.003590261 0.0001058985 4.150708
## 132  college         55          9 0.006170762 0.0009530869 3.671502
## 124    class         26          1 0.002917087 0.0001058985 3.541995
## 396      job        200         56 0.022439134 0.0059303188 3.382115
## 190  deficit         56         11 0.006282957 0.0011648840 3.364133
##          chi
## 138 40.79339
## 374 30.63331
## 132 35.35282
## 124 24.35851
## 396 89.09458
## 190 32.46339

So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education.

Let’s make a word cloud of Obama’ words, with size indicating chi-square overrepresentation:

obama = cmp[cmp$over > 1,]
dtm.wordcloud(terms = obama$term, freqs = obama$chi)

And Bush:

bush = cmp[cmp$over < 1,]
dtm.wordcloud(terms = bush$term, freqs = bush$chi)

Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn’t find a good place for them in the word cloud.