Text Mining Leadership Survey: Topic Models (no covariates)

Reload Previous Image

Load the image

We’ll first start to reload our data that was completed after part 1.

See the code from part 1 to understand how the data was created.

load("01-datacleaning-exploration.RData")

Topic Modelings

Baseline Model

First, we’ll need to call the stm package. If you do not have the package, run the command install.packages("stm") to install it.

Note that while we’re going to use the stm packages, for this iteration, we will not include covariates which effectively reduce the model to standard topic modeling. Technically, the stm package uses the Correlated Topic Model rather than vanilla Latent Dirichlet Allocation (LDA), which improves the measurement of the correlations between the topics.

Before running stm, we’ll need to convert the dfm (based on the quanteda package) to the data structure that stm uses through the convert function.

Next, we’ll examine what is the minimum frequency of word occurrences we want to remove sparse (very rare) terms. For now, we’ll only eliminate all terms that are used less than three times (e.g. lower.thresh = 3). We may adjust this up once we get more data.

Last, we’ll create a data structure, labeled out, that includes our data structure, the words and the metadata (covariates).

library(stm)
library(quanteda)

# use quanteda converter to convert our Dfm
stmdfm <- convert(dfm, to = "stm", docvars = docvars(myCorpus))

plotRemoved(stmdfm$documents, lower.thresh = seq(1, 80, by = 20))

out <- prepDocuments(stmdfm$documents, stmdfm$vocab, stmdfm$meta, lower.thresh = 3)

## Removing 1175 of 2659 terms (2764 of 33277 tokens) due to frequency 
## Your corpus now has 1170 documents, 1484 terms and 30513 tokens.

Choosing the Number of Topics

STM offers a great function that allows us to run topic modeling multiple times to determine the “best fitting” number of topics. Note – this step can take several minutes as its running topic modeling under different scenarios (different number of topics).

K <- c(5, 10, 20, 30, 50, 75, 100)
kresult <- searchK(out$documents, out$vocab, K, data = out$meta, max.em.its = 150, 
    init.type = "Spectral")

plot(kresult)

This result shows the performance of running different scenarios of topic models with different number of topics.

Held-out likelihood measure the predictive strength of each model on an out-of-sample (held-out) dataset. Values closer to zero (i.e. higher on the y-axis) are better than large negative values; therefore, the plot shows 100 topic model is the best fit model from a prediction standpoint.

Alternatively, if our goal is interpretation, it may be better to consider the “Semantic Coherence”, which is a way to measure the average interpretability of the topics. Like the held-out likelihood, a model with an average value closer to zero (e.g. 10 topics) can be thought of being more interpretable than a model with lower values. Using this metric, we’d also select the 10 topic model.

One note – in most topic models, the best fitting model using the held-out likelihood may not be the most interpretable (e.g. best Semantic Coherence values). See Chang et al. 2009.

Let’s assume our goal is more interpretation. From this perspective either 20 or 30 topics could be work well that has high interpretation but with adequate prediction (although not necessarily maximum).

Run the baseline model:

Next, we will run our baseline model without any covariates. First, we set k equal to 10 for our number of topics and then run our model.

k <- 10

stmFit <- stm(out$documents, out$vocab, K = k, max.em.its = 150, data = out$meta, 
    init.type = "Spectral", seed = 300)

After running our results, we’ve created a new data object (stmFit) that includes all of our model information.

We can then pass the model information to a plot function to summarise the results.

plot(stmFit, type = "summary", xlim = c(0, 0.7), ylim = c(0, 10.4), n = 5, main = "Survey Topics", 
    width = 10, text.cex = 1)

Next, let’s create a dataframe named topic that provides the topic number, name, and expected topic size (proportions).

We also saved the topic labels into an object called topicNames.

topic <- data.frame(topicnames = paste0("Topic ", 1:k), TopicNumber = 1:k, TopicProportions = colMeans(stmFit$theta))

topicNames <- labelTopics(stmFit)

Topic Labels

For this, we’ll plot the 20 words that best describe each topic. First, we’ll plot each topic (each “row”) using two measures: the topic probability (left column) and the FREX (right column).

The left column of topics show the words based on the typical topic model output: word-topic probabilities. Also, we’ll use the FREX that reweights the word-topic probabilities to emphasize the words that are more “frequent” and more “exclusive” for each topic (hence “FR” + “EX”). For more details on FREX, see page 5 of Roberts et al. (2013).

par(mfrow = c(3, 2), mar = c(1, 1, 2, 1))
for (i in 1:k) {
    plot(stmFit, type = "labels", n = 20, topics = i, main = "Raw Probabilities", 
        width = 40)
    plot(stmFit, type = "labels", n = 20, topics = i, main = "FREX Weights", labeltype = "frex", 
        width = 50)
}

Ultimately, our goal is to give names (labels) for each topic. We’ll then use those labels instead of calling each by its number (which is essentially meaningless). Typically, the first couple of words in the FREX scores provide ideal one or two-word names for the topics. We’ll need to decide on the labels as a group and (ideally) with names that are consistent with any theoretical frameworks you’re aware of with respect to leadership.

We can also manually output the top words for each topic using the word probabilities (“Highest Prob”), FREX and two other measurements (Lift and Score). Let’s consider topic 7.

labelTopics(stmFit, 7)

## Topic 7 Top Words:
##       Highest Prob: provide, coaching, supervisor, mentoring, concerns, work, development 
##       FREX: provide, different, coaching, mentoring, situations, calls, development 
##       Lift: hate, vast, anyway, unable, annoying, typically, upper 
##       Score: hate, provide, coaching, mentoring, development, supervisor, different

Another way of interpretating the topics can be to find the most representative document (in this case responses) for each topic. Essentially, we can find the document (the first 200 characters) that best exemplifies the topic. Let’s again consider topic 7.

shortdoc <- substr(text, 1, 300)
findThoughts(stmFit, texts = shortdoc, n = 5, topics = 7)

## 
##  Topic 7: 
##       My supervisor does not provide a ton of support to accomplish most of my work objectives for the simple fact that while I report to her, my job falls outside of the traditional work my department would do, so she doesn't completely understand my role  She does listen to my concerns and trusts me to 
##      This is one area where there is definitely room for improvement  He will often listen to my concerns or suggestions and sometimes even praise them but then there is no follow through  As far as coaching and mentoring he doesn't do alot of that due to my position, he expects me to self train 
##      My manager tries to understand the work to be done, but mostly keep silent when difficult situations arise  Supervisor is ready to listen to the concerns and usually provide advice  My manager doesn't provide me coaching and mentoring personally but he make arrangements so that I can learn and devel
##      He provides a lot of support in the way of making introductions to people across the company to help accomplish objectives  He also is willing to actively throw support behind the things I am working on and enlist outside help for me to complete projects  He'll listen to concerns and provide insight
##      I really receive no coaching at all  My supervisor takes a very hands off approach  I think that I only get talked to if I do poorly or if I screw up  I think he may or may not respect me  He really only speaks to me when something goes wrong or when there is new information that he has to give me

In the case of topic 3, we can give this topic the label “Understands Problems” as it seems the major words are “needs”, “understands”, “problems” and “supports”.

Doing the same thing for all topics, we can give all the topics labels.

topicNames <- labelTopics(stmFit)
topic <- data.frame(topicnames = c("Clients", "Progress/Advancement", "Training", 
    "Listens to my Problems", "Shows Respect", "Team Members", "Coaching & Mentoring", 
    "Manager Understands", "Boss' Help for Day-to-Day Problems", "Time Management"), 
    TopicNumber = 1:k, TopicProportions = colMeans(stmFit$theta), stringsAsFactors = F)

Interpretability

We can also use the Semantic Coherence and Exclusivity to measure the interpretability of each topic.

topicQuality(stmFit, documents = out$documents)

##  [1] -103.31327 -117.73656 -111.54348  -66.69510  -80.53147 -107.93053
##  [7]  -84.94726  -91.60163  -81.37632  -99.53997
##  [1] 9.046028 8.941832 9.521456 9.536003 9.610015 8.789586 8.895845
##  [8] 8.775386 8.534439 8.891792

Essentially, topics that have a higher coherence (i.e. more right on the x-axis) tend to be interpretable. On the other hand, topics on the left have a lower semantic coherence and tend to be less interpretable. This sometimes can be because that topic is a mixture of two or more sub-topics. This topic may separated if we moved to more topics. This is important when we run inference and need to interpret the results.

The exclusivitiy measure helps to identify which topics contain words that are more unique to that topic. Topics with a lower exclusivity (e.g. Topic 9) tend to include words that are used in a lot of other topics; hence, it tends to be more of a “generic” topic.

Typically, topics with a low exclusivity score also are less interpretable.

Alternatively, topics with higher exclusitivity have more distinctive words and tend to be easier to interpret (e.g. topic 5, was easy to interpret as “shows respect” from the words).

Using both measures, we can think of topics that are in the top right to be the most interpretable: high exclusitivity (unique words) and high semantic coherence (consistency of the words).

Topic Comparisons

We can also examine the relationships between topic words visually using the `plot.STM’ perspectives plots. The perpectives plot is a built-in function that allows you to compare the words between two topics: finding which words are similar (near the center/middle of the plot) and what words are different (near the left or right edges, depending on how different it is from each topic).

For example, we can compare the words used for topic 1 and topic 3.

plot(stmFit, type = "perspectives", topics = c(1, 3))

Note that the words that are in the middle are shared by these topics whereas words near the left or right side are more unique to the topic (relative to the other topic). Please see the stm documentation for more details on the plots (e.g. type “?plot.STM” into the console).

Topic Correlations

Last, we can examine the topics as a network in which each node (dot) is a topic and each edge (line) is whether the topics have significant correlation (shared words) between themselves. To measure this, we have to set a threshold for the minimum correlation required to have an edge. There is not a perfect way to set the minimum correlation value; trial and error may be the best strategy.

library(igraph)
library(visNetwork)

threshold <- 0.1

cormat <- cor(stmFit$theta)
adjmat <- ifelse(abs(cormat) > threshold, 1, 0)

links2 <- as.matrix(adjmat)
net2 <- graph_from_adjacency_matrix(links2, mode = "undirected")
net2 <- igraph::simplify(net2, remove.multiple = FALSE, remove.loops = TRUE)

data <- toVisNetworkData(net2)

nodes <- data[[1]]
edges <- data[[2]]

We can now run an algorithm that automatically detects clusters (communities) within the network.

We’ll use those clusters for the node colors.

# Community Detection
clp <- cluster_label_prop(net2)
nodes$community <- clp$membership
qual_col_pals = RColorBrewer::brewer.pal.info[RColorBrewer::brewer.pal.info$category == 
    "qual", ]
col_vector = unlist(mapply(RColorBrewer::brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
col_vector <- c(col_vector, col_vector)

col <- col_vector[nodes$community + 1]

links <- igraph::as_data_frame(net2, what = "edges")
nodes <- igraph::as_data_frame(net2, what = "vertices")

Last, we can specify parameters on the network like shape, title, label, size and borderwidth.

For example, the node size is proportional to the topic proportions (weighted by the largest topic proportion) and multiplied by 40 to adjust to the visNetwork baseline settings. We can also specify other information about the network.

# save the topic proportions
TopicProportions = colMeans(stmFit$theta)

# visNetwork preferences
nodes$shape <- "dot"
nodes$shadow <- TRUE  # Nodes will drop shadow
nodes$title <- topic$topicnames  # Text on click
nodes$label <- topic$topicnames  # Node label
nodes$size <- (TopicProportions/max(TopicProportions)) * 40  # Node size
nodes$borderWidth <- 2  # Node border width

nodes$color.background <- col
nodes$color.border <- "black"
nodes$color.highlight.background <- "orange"
nodes$color.highlight.border <- "darkred"
nodes$id <- 1:nrow(nodes)

Last, let’s run visNetwork to produce an interactive network.

visNetwork(nodes, links, width = "100%", height = "600px", main = "Topic Correlations")

Save Image & Libraries Used

save.image(file = "02-topicmodel-nocovariates.RData")
sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] visNetwork_2.0.1 igraph_1.1.2     quanteda_0.99    stm_1.3.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13        RColorBrewer_1.1-2  compiler_3.4.1     
##  [4] formatR_1.5         plyr_1.8.4          tools_3.4.1        
##  [7] digest_0.6.12       lubridate_1.6.0     jsonlite_1.5       
## [10] evaluate_0.10.1     tibble_1.3.4        gtable_0.2.0       
## [13] lattice_0.20-34     pkgconfig_2.0.1     rlang_0.1.2        
## [16] Matrix_1.2-8        fastmatch_1.1-0     yaml_2.1.14        
## [19] stringr_1.2.0       knitr_1.17          htmlwidgets_0.9    
## [22] rprojroot_1.2       grid_3.4.1          data.table_1.10.4  
## [25] rmarkdown_1.6       ggplot2_2.2.1.9000  magrittr_1.5       
## [28] backports_1.1.0     scales_0.5.0.9000   htmltools_0.3.6    
## [31] matrixStats_0.52.2  colorspace_1.3-2    stringi_1.1.5      
## [34] RcppParallel_4.3.20 lazyeval_0.2.0      munsell_0.4.3      
## [37] slam_0.1-40