Unsupervised machine learning

In this exercise, we’re going to run topic modeling on Tweets.

Topic modeling is an unsupervised algorithm that finds hidden word occurrence patterns (“topics”) within our data.

One problem with topic modeling on Twitter is that Tweets are so small, it’s tough to find topics given the short document-length. One “quick fix” is to aggregate Tweets by user (e.g. combines all of my tweets and treat them as one document).

A benefit of topic modeling is that you can assign document-level probabilities per topic. This way you can measure the mix of topics across documents. In this example, our documents “are” Twitter users. Therefore, we can analyze a user-topic probability matrix and get an idea of who is tweeting.

Step 1: Read in the data.

Let’s read in our Charlotte sample dataset.

library(tidyverse)
#remove one of the "." if you are running as chunks
tweets <- read_csv('../data/CharlotteTweets20Sample.csv')
source('./functions.R')

#updated for quanteda version 0.9.9-50
library(quanteda)
twcorpus <- corpus(tweets$body)

Let’s load in the user ID number as a document (originally on Tweet level) attribute. We’ll use the groups optition on dfm to aggregate our Tweets by the user (actor.id).

Let’s create the DFM object and trim all words used less than 40 times.

docvars(twcorpus, "actor.id") <- as.character(tweets$actor.id) 
twdfm <- dfm(twcorpus, 
             groups = "actor.id", 
             remove = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), 
             remove_punct = TRUE, 
             remove_numbers = TRUE, 
             remove_symbols = TRUE,
             remove_url = TRUE,
             ngrams= 1L)
twdfm <- dfm_trim(twdfm, min_docfreq = 3)

Let’s look at the top 50 words.

topfeatures(twdfm, 50)
##     charlotte            nc          just          like          love 
##          4898          2833          2126          1475          1432 
##           get           day      carolina          time           now 
##          1364          1322          1217          1201          1124 
##           one          good         north           new            go 
##          1100          1097          1086          1067          1044 
##         today          will           see         happy          know 
##           997           900           895           873           809 
##         great           got          game #keeppounding         night 
##           807           747           734           732           731 
##    #charlotte      drinking       tonight           lol          back 
##           716           706           678           673           670 
##          come            gt        people          best     christmas 
##           665           641           623           588           581 
##         going          year      panthers          want         first 
##           578           576           572           571           564 
##         right          need            us          last        thanks 
##           563           562           559           544           529 
##          make          home             w         still          much 
##           526           505           502           500           495

No surprise Charlotte-related terms are the most popular. Recall - this dataset only obtained geolocated Tweets (excluded non-geolocated Tweets).

Step 2: Text exploratory analysis.

Let’s do some exploratory analysis to understand the content.

library(RColorBrewer)

textplot_wordcloud(twdfm, 
                   scale=c(3.5, .75), 
                   colors=brewer.pal(8, "Dark2"), 
                   random.order = F, 
                   rot.per=0.1, 
                   max.words=250)

We can also rerun the word cloud to weight not by the term frequency by the TF-IDF weights.

textplot_wordcloud(tfidf(twdfm), 
                   scale=c(3.5, .75), 
                   colors=brewer.pal(8, "Dark2"), 
                   random.order = F, 
                   rot.per=0.1, 
                   max.words=250)

We can also run a word clustering…

numWords <- 35

wordDfm <- sort(weight(twdfm, "tfidf"))
wordDfm <- t(wordDfm)[1:numWords,]  # keep the top numWords words
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="TF-IDF Frequency weighting")

Step 3: Run topic modeling.

Now, let’s run a simple 20 topic model.

We’ll use the package topicmodels.

# install.packages("topicmodels")
library(topicmodels)

# we now export to a format that we can run the topic model with
dtm <- convert(twdfm, to="topicmodels")

# estimate LDA with K topics
K <- 20
lda <- LDA(dtm, k = K, method = "Gibbs", 
                control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
## K = 20; V = 9400; M = 9705
## Sampling 600 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Iteration 525 ...
## Iteration 550 ...
## Iteration 575 ...
## Iteration 600 ...
## Gibbs sampling completed!

Let’s explore the topics.

term <- terms(lda, 10)
colnames(term) <- paste("Topic",1:K)

term
##       Topic 1     Topic 2 Topic 3      Topic 4  Topic 5    Topic 6     
##  [1,] "love"      "today" "#charlotte" "lol"    "drinking" "gt"        
##  [2,] "much"      "us"    "#clt"       "like"   "#photo"   "f"         
##  [3,] "always"    "come"  "rd"         "shit"   "beer"     "wind"      
##  [4,] "hot"       "day"   "#traffic"   "get"    "hill"     "charlotte" 
##  [5,] "one"       "new"   "w"          "want"   "chicken"  "weather"   
##  [6,] "@midnight" "one"   "accident"   "got"    "rock"     "pressure"  
##  [7,] "got"       "best"  "n"          "really" "coffee"   "current"   
##  [8,] "will"      "get"   "south"      "need"   "ipa"      "center"    
##  [9,] "please"    "now"   "serious"    "fuck"   "brewery"  "kannapolis"
## [10,] "fun"       "see"   "st"         "know"   "company"  "cloudy"    
##       Topic 7     Topic 8  Topic 9  Topic 10  Topic 11 Topic 12 
##  [1,] "carolina"  "home"   "go"     "just"    "like"   "like"   
##  [2,] "north"     "school" "year"   "new"     "just"   "get"    
##  [3,] "charlotte" "great"  "game"   "posted"  "people" "im"     
##  [4,] "la"        "high"   "good"   "photo"   "one"    "omg"    
##  [5,] "de"        "house"  "get"    "thank"   "still"  "go"     
##  [6,] "#tshirt"   "family" "man"    "god"     "never"  "just"   
##  [7,] "#sale"     "new"    "great"  "make"    "think"  "guys"   
##  [8,] "#shop"     "work"   "team"   "going"   "say"    "girl"   
##  [9,] "free"      "time"   "better" "#repost" "know"   "love"   
## [10,] "photo"     "indian" "play"   "working" "way"    "someone"
##       Topic 13        Topic 14        Topic 15      Topic 16   Topic 17   
##  [1,] "#keeppounding" "charlotte"     "see"         "happy"    "christmas"
##  [2,] "time"          "nc"            "#nc"         "thanks"   "last"     
##  [3,] "game"          "airport"       "#realestate" "birthday" "charlotte"
##  [4,] "@panthers"     "douglas"       "looking"     "america"  "place"    
##  [5,] "panthers"      "international" "check"       "day"      "know"     
##  [6,] "super"         "clt"           "drive"       "bank"     "week"     
##  [7,] "bowl"          "first"         "tour"        "stadium"  "hours"    
##  [8,] "arena"         "@cltairport"   "#listing"    "go"       "night"    
##  [9,] "#panthers"     "w"             "lake"        "great"    "monday"   
## [10,] "win"           "concord"       "buyer"       "best"     "time"     
##       Topic 18   Topic 19   Topic 20 
##  [1,] "now"      "good"     "tonight"
##  [2,] "panthers" "life"     "night"  
##  [3,] "will"     "right"    "party"  
##  [4,] "weather"  "day"      "live"   
##  [5,] "good"     "work"     "city"   
##  [6,] "f"        "morning"  "#ciaa"  
##  [7,] "mph"      "tomorrow" "one"    
##  [8,] "wind"     "today"    "weekend"
##  [9,] "today"    "now"      "friday" 
## [10,] "news"     "back"     "bar"

How can we interpret these topics?

Step 4: Interactive topic visualization.

If you want to run a LDA interactive visualization, run this chunk:

#Create Json for LDAVis
library(LDAvis)
json <- topicmodels_json_ldavis(lda,twdfm,dtm)
new.order <- RJSONIO::fromJSON(json)$topic.order

# Topic #'s reordered!!
term <- term[,new.order]

serVis(json, out.dir = 'charlotteLDA', open.browser = TRUE)

Step 5: Finding Topic “Experts”

Like topics are probability distribution of words, in LDA documents are probability distributions of topics. In our case, since documents are on the user-level, we can infer that users’ Twitter logs are assumed to be a distribution of topics.

Accordingly, we can rank Twitter users by those who rank highest for a topic.

First, let’s extract the document-topic probability matrix.

# to get topic probabilities per actor ID (Twitter user)
postlist <- posterior(lda)
probtopics <- data.frame(postlist$topics)
#probtopics <- probtopics[,new.order]
colnames(probtopics) <- paste("Topic",1:K)

Next, let’s find the “expert” for Topic 8.

filter.topic <- "Topic 8"

row <- order(-probtopics[,filter.topic])
actorid <- rownames(probtopics[row[1],])
filter.data <- subset(tweets, actor.id == actorid)
filter.data <- filter.data[order(filter.data$postedTime),]

Explore “experts” for other topics. Do they make sense?

Step 6: Deeper LDA Analysis

Further analysis could take as a covariate:

  1. Device (e.g. Foursquare vs Untappd vs Twitter iPhone, etc.)

  2. Profile location description (e.g. Charlotte, Fort Mill, Gastonia, etc.)

  3. Profile description keywords (e.g. husband, wife, student, etc.)

A new extension to LDA has been created (“structural topic model”) by social scientists to create a framework to consider covariates impact on topic model results. For more details on running more advanced versions of LDA like CTM and STM, I created a newer workshop on topic modeling.