Latent Dirichlet Allocation

Topic modelling techniques such as Latent Dirichlet Allocation (LDA) can be a usefull tool for social scientists to analyze large amounts of natural language data. Algorithms for LDA are available in R, for instance in the topicmodels package. In this howto we demonstrate several function in the corpustools package that facilitate the use of LDA using the topicmodels package.

As a starting point we use a Document Term Matrix (dtm) in the DocumentTermMatrix format offered in the tm package. Note that we also offer a howto for creating the dtm.

library(corpustools)
## Loading required package: slam
## Loading required package: Matrix
## Loading required package: tm
## Loading required package: NLP
## Loading required package: plyr
## Loading required package: reshape2
## Loading required package: topicmodels
## Loading required package: RColorBrewer
## Loading required package: wordcloud
## Loading required package: igraph
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## Loading required package: plotrix
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:plotrix':
## 
##     rescale
data(sotu) # state of the union speeches by Barack Obama and George H. Bush.
head(sotu.tokens)
##         word sentence  pos      lemma offset       aid id pos1 freq
## 1         It        1  PRP         it      0 111541965  1    O    1
## 2         is        1  VBZ         be      3 111541965  2    V    1
## 3        our        1 PRP$         we      6 111541965  3    O    1
## 4 unfinished        1   JJ unfinished     10 111541965  4    A    1
## 5       task        1   NN       task     21 111541965  5    N    1
## 6         to        1   TO         to     26 111541965  6    ?    1
dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma, filter=sotu.tokens$pos1 %in%  c('N','M','A'))
## Ignoring words with frequency lower than 5
## Ignoring words with less than 3 characters
## Ignoring words that contain numbers of non-word characters
dtm
## <<DocumentTermMatrix (documents: 1090, terms: 1133)>>
## Non-/sparse entries: 20342/1214628
## Sparsity           : 98%
## Maximal term length: 14
## Weighting          : term frequency (tf)

Not all terms are equally informative of the underlying semantic structures of texts, and some terms are rather useless for this purpose. For interpretation and computational purposes it is worthwhile to delete some of the less usefull words from the dtm before fitting the LDA model. As seen from the red message lines above, the dtm.create automatically uses some filtering of terms, but it can be good to customize this for your research.

Now we are ready to fit the model! We made a wrapper called lda.fit for the LDA function in the topicmodels package. This wrapper doesn’t do anything interesting, except for deleting empty columns/rows from the dtm, which can occur after filtering out words.

The main input for topmod.lda.fit is: - the document term matrix - K: the number of topics (this has to be defined a priori) - Optionally, it can be usefull to increase the number of iterations. This takes more time, but increases performance.

set.seed(12345)
m = lda.fit(dtm, K=20, num.iterations=1000)
terms(m, 10)[,1:5] # show first 5 topics, with ten top words per topic
##       Topic 1     Topic 2     Topic 3     Topic 4       Topic 5    
##  [1,] "health"    "job"       "life"      "America"     "child"    
##  [2,] "care"      "new"       "time"      "nation"      "school"   
##  [3,] "plan"      "economy"   "same"      "great"       "education"
##  [4,] "Americans" "company"   "place"     "opportunity" "college"  
##  [5,] "reform"    "industry"  "community" "history"     "better"   
##  [6,] "cost"      "economic"  "thing"     "strong"      "student"  
##  [7,] "insurance" "worker"    "change"    "century"     "high"     
##  [8,] "Medicare"  "recession" "work"      "society"     "higher"   
##  [9,] "drug"      "today"     "reform"    "funding"     "parent"   
## [10,] "coverage"  "project"   "different" "difficult"   "standard"

We now have a fitted lda model. The terms function shows the most prominent words for each topic (we only selected the first 5 topics for convenience).

One of the thing we can do with the LDA topics, is analyze how much attention they get over time, and how much they are used by different sources (e.g., people, newspapers, organizations). To do so, we need to match this article metadata. We can order the metadata to the documents in the LDA model by matching it to the documents slot.

head(sotu.meta)
##          id   medium     headline       date
## 1 111541965 Speeches Barack Obama 2013-02-12
## 2 111541995 Speeches Barack Obama 2013-02-12
## 3 111542001 Speeches Barack Obama 2013-02-12
## 4 111542006 Speeches Barack Obama 2013-02-12
## 5 111542013 Speeches Barack Obama 2013-02-12
## 6 111542018 Speeches Barack Obama 2013-02-12
meta = sotu.meta[match(m@documents, sotu.meta$id),]

We can now do some plotting. First, we can make a wordcloud for a more fancy (and actually quite informative and intuitive) representation of the top words of a topic.

lda.plot.wordcloud(m, topic_nr=1)

lda.plot.wordcloud(m, topic_nr=2)

With lda.plot.time and lda.plot.category, we can plot the salience of the topic over time and for a given categorical variable.

lda.plot.time(m, 1, meta$date, date_interval='year')

lda.plot.category(m, 1, meta$headline)

It can be usefull to print all this information together. That is what the following function does.

lda.plot.topic(m, 1, meta$date, meta$headline, date_interval='year')

lda.plot.topic(m, 2, meta$date, meta$headline, date_interval='year')

For further substantive analysis, we can also create a data frame containing the topic proportion for each document:

docs = topics.per.document(m)
docs = merge(meta, docs)
head(docs)
##          id   medium     headline       date         X1         X2
## 1 111541965 Speeches Barack Obama 2013-02-12 0.04310345 0.04310345
## 2 111541995 Speeches Barack Obama 2013-02-12 0.04375000 0.04375000
## 3 111542001 Speeches Barack Obama 2013-02-12 0.03012048 0.04216867
## 4 111542006 Speeches Barack Obama 2013-02-12 0.03676471 0.06617647
## 5 111542013 Speeches Barack Obama 2013-02-12 0.03205128 0.07051282
## 6 111542018 Speeches Barack Obama 2013-02-12 0.04729730 0.06081081
##           X3         X4         X5         X6         X7         X8
## 1 0.06034483 0.06034483 0.04310345 0.06034483 0.04310345 0.04310345
## 2 0.05625000 0.03125000 0.06875000 0.04375000 0.06875000 0.10625000
## 3 0.05421687 0.03012048 0.18674699 0.03012048 0.06626506 0.04216867
## 4 0.03676471 0.03676471 0.13970588 0.06617647 0.05147059 0.05147059
## 5 0.07051282 0.03205128 0.05769231 0.09615385 0.05769231 0.04487179
## 6 0.04729730 0.03378378 0.06081081 0.03378378 0.03378378 0.06081081
##           X9        X10        X11        X12        X13        X14
## 1 0.07758621 0.06034483 0.06034483 0.06034483 0.04310345 0.04310345
## 2 0.05625000 0.05625000 0.05625000 0.05625000 0.09375000 0.03125000
## 3 0.03012048 0.06626506 0.04216867 0.05421687 0.03012048 0.03012048
## 4 0.05147059 0.05147059 0.03676471 0.05147059 0.05147059 0.03676471
## 5 0.04487179 0.04487179 0.03205128 0.05769231 0.04487179 0.03205128
## 6 0.04729730 0.03378378 0.03378378 0.04729730 0.08783784 0.06081081
##          X15        X16        X17        X18        X19        X20
## 1 0.04310345 0.04310345 0.04310345 0.04310345 0.04310345 0.04310345
## 2 0.03125000 0.03125000 0.03125000 0.03125000 0.03125000 0.03125000
## 3 0.03012048 0.04216867 0.05421687 0.05421687 0.05421687 0.03012048
## 4 0.03676471 0.03676471 0.03676471 0.03676471 0.05147059 0.03676471
## 5 0.03205128 0.05769231 0.03205128 0.07051282 0.05769231 0.03205128
## 6 0.07432432 0.03378378 0.06081081 0.04729730 0.06081081 0.03378378