In this exercise, we’re going to run topic modeling on Tweets.
Topic modeling is an unsupervised algorithm that finds hidden word occurrence patterns (“topics”) within our data.
One problem with topic modeling on Twitter is that Tweets are so small, it’s tough to find topics given the short document-length. One “quick fix” is to aggregate Tweets by user (e.g. combines all of my tweets and treat them as one document).
A benefit of topic modeling is that you can assign document-level probabilities per topic. This way you can measure the mix of topics across documents. In this example, our documents “are” Twitter users. Therefore, we can analyze a user-topic probability matrix and get an idea of who is tweeting.
Let’s read in our Charlotte sample dataset.
library(tidyverse)
#remove one of the "." if you are running as chunks
tweets <- read_csv('../data/CharlotteTweets20Sample.csv')
source('./functions.R')
#updated for quanteda version 0.9.9-50
library(quanteda)
twcorpus <- corpus(tweets$body)
Let’s load in the user ID number as a document (originally on Tweet level) attribute. We’ll use the groups
optition on dfm
to aggregate our Tweets by the user (actor.id).
Let’s create the DFM object and trim all words used less than 40 times.
docvars(twcorpus, "actor.id") <- as.character(tweets$actor.id)
twdfm <- dfm(twcorpus,
groups = "actor.id",
remove = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
ngrams= 1L)
twdfm <- dfm_trim(twdfm, min_docfreq = 3)
Let’s look at the top 50 words.
topfeatures(twdfm, 50)
## charlotte nc just like love
## 4898 2833 2126 1475 1432
## get day carolina time now
## 1364 1322 1217 1201 1124
## one good north new go
## 1100 1097 1086 1067 1044
## today will see happy know
## 997 900 895 873 809
## great got game #keeppounding night
## 807 747 734 732 731
## #charlotte drinking tonight lol back
## 716 706 678 673 670
## come gt people best christmas
## 665 641 623 588 581
## going year panthers want first
## 578 576 572 571 564
## right need us last thanks
## 563 562 559 544 529
## make home w still much
## 526 505 502 500 495
No surprise Charlotte-related terms are the most popular. Recall - this dataset only obtained geolocated Tweets (excluded non-geolocated Tweets).
Let’s do some exploratory analysis to understand the content.
library(RColorBrewer)
textplot_wordcloud(twdfm,
scale=c(3.5, .75),
colors=brewer.pal(8, "Dark2"),
random.order = F,
rot.per=0.1,
max.words=250)
We can also rerun the word cloud to weight not by the term frequency by the TF-IDF weights.
textplot_wordcloud(tfidf(twdfm),
scale=c(3.5, .75),
colors=brewer.pal(8, "Dark2"),
random.order = F,
rot.per=0.1,
max.words=250)
We can also run a word clustering…
numWords <- 35
wordDfm <- sort(weight(twdfm, "tfidf"))
wordDfm <- t(wordDfm)[1:numWords,] # keep the top numWords words
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="TF-IDF Frequency weighting")
Now, let’s run a simple 20 topic model.
We’ll use the package topicmodels
.
# install.packages("topicmodels")
library(topicmodels)
# we now export to a format that we can run the topic model with
dtm <- convert(twdfm, to="topicmodels")
# estimate LDA with K topics
K <- 20
lda <- LDA(dtm, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
## K = 20; V = 9400; M = 9705
## Sampling 600 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Iteration 525 ...
## Iteration 550 ...
## Iteration 575 ...
## Iteration 600 ...
## Gibbs sampling completed!
Let’s explore the topics.
term <- terms(lda, 10)
colnames(term) <- paste("Topic",1:K)
term
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "love" "today" "#charlotte" "lol" "drinking" "gt"
## [2,] "much" "us" "#clt" "like" "#photo" "f"
## [3,] "always" "come" "rd" "shit" "beer" "wind"
## [4,] "hot" "day" "#traffic" "get" "hill" "charlotte"
## [5,] "one" "new" "w" "want" "chicken" "weather"
## [6,] "@midnight" "one" "accident" "got" "rock" "pressure"
## [7,] "got" "best" "n" "really" "coffee" "current"
## [8,] "will" "get" "south" "need" "ipa" "center"
## [9,] "please" "now" "serious" "fuck" "brewery" "kannapolis"
## [10,] "fun" "see" "st" "know" "company" "cloudy"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## [1,] "carolina" "home" "go" "just" "like" "like"
## [2,] "north" "school" "year" "new" "just" "get"
## [3,] "charlotte" "great" "game" "posted" "people" "im"
## [4,] "la" "high" "good" "photo" "one" "omg"
## [5,] "de" "house" "get" "thank" "still" "go"
## [6,] "#tshirt" "family" "man" "god" "never" "just"
## [7,] "#sale" "new" "great" "make" "think" "guys"
## [8,] "#shop" "work" "team" "going" "say" "girl"
## [9,] "free" "time" "better" "#repost" "know" "love"
## [10,] "photo" "indian" "play" "working" "way" "someone"
## Topic 13 Topic 14 Topic 15 Topic 16 Topic 17
## [1,] "#keeppounding" "charlotte" "see" "happy" "christmas"
## [2,] "time" "nc" "#nc" "thanks" "last"
## [3,] "game" "airport" "#realestate" "birthday" "charlotte"
## [4,] "@panthers" "douglas" "looking" "america" "place"
## [5,] "panthers" "international" "check" "day" "know"
## [6,] "super" "clt" "drive" "bank" "week"
## [7,] "bowl" "first" "tour" "stadium" "hours"
## [8,] "arena" "@cltairport" "#listing" "go" "night"
## [9,] "#panthers" "w" "lake" "great" "monday"
## [10,] "win" "concord" "buyer" "best" "time"
## Topic 18 Topic 19 Topic 20
## [1,] "now" "good" "tonight"
## [2,] "panthers" "life" "night"
## [3,] "will" "right" "party"
## [4,] "weather" "day" "live"
## [5,] "good" "work" "city"
## [6,] "f" "morning" "#ciaa"
## [7,] "mph" "tomorrow" "one"
## [8,] "wind" "today" "weekend"
## [9,] "today" "now" "friday"
## [10,] "news" "back" "bar"
How can we interpret these topics?
If you want to run a LDA interactive visualization, run this chunk:
#Create Json for LDAVis
library(LDAvis)
json <- topicmodels_json_ldavis(lda,twdfm,dtm)
new.order <- RJSONIO::fromJSON(json)$topic.order
# Topic #'s reordered!!
term <- term[,new.order]
serVis(json, out.dir = 'charlotteLDA', open.browser = TRUE)
Like topics are probability distribution of words, in LDA documents are probability distributions of topics. In our case, since documents are on the user-level, we can infer that users’ Twitter logs are assumed to be a distribution of topics.
Accordingly, we can rank Twitter users by those who rank highest for a topic.
First, let’s extract the document-topic probability matrix.
# to get topic probabilities per actor ID (Twitter user)
postlist <- posterior(lda)
probtopics <- data.frame(postlist$topics)
#probtopics <- probtopics[,new.order]
colnames(probtopics) <- paste("Topic",1:K)
Next, let’s find the “expert” for Topic 8.
filter.topic <- "Topic 8"
row <- order(-probtopics[,filter.topic])
actorid <- rownames(probtopics[row[1],])
filter.data <- subset(tweets, actor.id == actorid)
filter.data <- filter.data[order(filter.data$postedTime),]
Explore “experts” for other topics. Do they make sense?
Further analysis could take as a covariate:
Device (e.g. Foursquare vs Untappd vs Twitter iPhone, etc.)
Profile location description (e.g. Charlotte, Fort Mill, Gastonia, etc.)
Profile description keywords (e.g. husband, wife, student, etc.)
A new extension to LDA has been created (“structural topic model”) by social scientists to create a framework to consider covariates impact on topic model results. For more details on running more advanced versions of LDA like CTM and STM, I created a newer workshop on topic modeling.