We’ll first start to reload our data that was completed after part 1.
See the code from part 1 to understand how the data was created.
load("01-datacleaning-exploration.RData")
Let’s run (structural) topic modeling using the stm package.
First, we’ll need to call the stm package. If you do not have the package, run the command install.packages("stm") to install the package.
Before running stm, we’ll need to convert the dfm (based on the quanteda package) to the data structure that stm uses through the convert function.
Next, we’ll examine what is the minimum frequency of word occurrences we want to remove sparse (very rare) terms. For now, we’ll only eliminate all terms that are used less than twice (e.g. lower.thresh = 2). We may adjust this up once we get more data.
Last, we’ll create a data structure (out) that includes our data structure, the words and the metadata (covariates).
library(stm)
library(quanteda)
# use quanteda converter to convert our Dfm
stmdfm <- convert(dfm, to = "stm", docvars = docvars(myCorpus))
plotRemoved(stmdfm$documents, lower.thresh = seq(1, 80, by = 20))
out <- prepDocuments(stmdfm$documents, stmdfm$vocab, stmdfm$meta, lower.thresh = 3)
## Removing 1175 of 2659 terms (2764 of 33277 tokens) due to frequency
## Your corpus now has 1170 documents, 1484 terms and 30513 tokens.
For this example, we will include five topics for the sake of parsimony. Review part 2 to consider ways to run validations on the number of topics.
Let’s run the model with two covariates: country and occupation. We will pass them through the prevalence parameter using a “+” to link both values.
k <- 5
stmFit <- stm(out$documents, out$vocab, K = k, prevalence = ~Country + Occupation,
max.em.its = 150, data = out$meta, init.type = "Spectral", seed = 300)
We can then plot the topic summaries…
plot(stmFit, type = "summary", xlim = c(0, 0.8), ylim = c(0.4, k + 0.4), n = 5, main = "Survey Topics",
width = 10, text.cex = 1)
…as well as save information about the topics that will make it easier to identify them later on.
topicNames <- labelTopics(stmFit)
topic <- data.frame(topicnames = paste0("Topic ", 1:k), TopicNumber = 1:k, TopicProportions = colMeans(stmFit$theta))
For this, we’ll plot the 20 words that best describe each topic. First, we’ll plot each topic (each “row”) using two measures: the topic probability (left column) and the FREX (right column).
The left column of topics show the words based on the typical topic model output: word-topic probabilities. Also, we’ll use the FREX that reweights the word-topic probabilities to emphasize the words that are more “frequent” and more “exclusive” for each topic (hence “FR” + “EX”). For more details on FREX, see page 5 of Roberts et al. (2013).
par(mfrow = c(3, 2), mar = c(1, 1, 2, 1))
for (i in 1:k) {
plot(stmFit, type = "labels", n = 20, topics = i, main = "Raw Probabilities",
width = 40)
plot(stmFit, type = "labels", n = 20, topics = i, main = "FREX Weights", labeltype = "frex",
width = 50)
}
Ultimately, our goal is to give names (labels) for each topic. We’ll then use those labels instead of calling each by its number (which is essentially meaningless). Typically, the first couple of words in the FREX scores provide ideal one or two-word names for the topics. We’ll need to decide on the labels as a group and (ideally) with names that are consistent with any theoretical frameworks you’re aware of with respect to leadership.
We can also manually output the top words for each topic using the word probabilities (“Highest Prob”), FREX and two other measurements (Lift and Score). Let’s consider topic 3.
labelTopics(stmFit, 3)
## Topic 3 Top Words:
## Highest Prob: work, leader, provides, provide, coaching, supervisor, help
## FREX: necessary, accomplish, look, tools, concerns, mentoring, mentors
## Lift: t, mentors, profession, zero, confused, unfortunately, outs
## Score: profession, mentoring, needs, concerns, support, necessary, provide
Another way of interpretating the topics can be to find the most representative document (in this case responses) for each topic. Essentially, we can find the document (the first 200 characters) that best exemplifies the topic. Let’s again consider topic 3.
shortdoc <- substr(text, 1, 300)
findThoughts(stmFit, texts = shortdoc, n = 5, topics = 3)
##
## Topic 3:
## My leader provides appropriate and necessary training
## They always look up to me when any work is needed to be done I always look up t them when I am confused
## Yes we do have tenured mentors
## My leader provides appropriate and necessary training
## They always look up to me when any work is needed to
## My leader provides appropriate and necessary training
## They always look up to me when any work is needed to be done
## Both are same
## My leader provides appropriate and necessary training
## They always look up to me when any work is needed to be done
## Both are same
## My leader provides appropriate and n
## My supervisor tends to provide an adequate amount of support to me to accomplish my work objectives He listens to my concerns and gives appropriate advice He provides adequate coaching and mentoring most of the time My supervisor tends to provide an adequate amount of support to me to accomplish
## When I was hired in to my organization, I made it very clear to my leader that I had ambitions to go higher in the company However, in order for me to go beyond my current position, I would have to take my leader's current position My leader listens to my concerns as far as the productivity at wor
## He seems to really understand my problems and needs He gives me support when I need it, to help me accomplish my objectives in work He listens to my concerns and provides advice He gives me coaching and mentoring when necessary He gets stressed at times, but always seems to go out of his way to
In the case of topic 3, we can give this topic the label “Provides Support” as it seems the major words are “provide(s)”, “coaching”, “mentors” and “help”.
We can give all the topics labels by following the same procedure for each. To rename the topics, use the following code.
topicNames <- labelTopics(stmFit)
topic <- data.frame(topicnames = c("Problem Solving", "Time Management", "Provides Support",
"Ask Questions", "Shows Respect"), TopicNumber = 1:k, TopicProportions = colMeans(stmFit$theta),
stringsAsFactors = F)
We can also use the Semantic Coherence and Exclusivity to measure the interpretability of each topic.
topicQuality(stmFit, documents = out$documents)
## [1] -89.33452 -103.31324 -66.73870 -82.30216 -68.31779
## [1] 8.917431 8.455931 9.179232 8.720536 8.759805
Essentially, topics that have a higher (i.e. more right on the x-axis) tend to be interpretable. On the other hand, topics on the left have a lower semantic coherence. This sometimes can be because that topic is a mixture of two or more sub-topics. This topic may separate if we moved to 6 topics. This is important when we run inference and need to interpret the results.
The exclusivitiy measure helps to identify which topics contain words that are more unique to that topic. Words with a lower exclusivity (e.g. Topic 3) tend to include words that are used in a lot of other topics; hence, it tends to be more of a “generic” topic. Typically, topics with a low exclusivity score also are less interpretable. In the case of topic 2, it has the most distinctive words given its high exclusivity.
We can also examine the relationships between topic words visually using the `plot.STM’ perspectives plots.
For example, we can compare the words used for topic 1 and topic 3.
plot(stmFit, type = "perspectives", topics = c(1, 3))
Note that the words that are in the middle are shared by these topics whereas words near the left or right side are more unique to the topic (relative to the other topic).
Next, let’s examine the effects country and occupation have on the topic proportions.
To analyze this, we’ll need to first run the estimateEffect function that runs a series of multinomial regression models.
With the regression results saved as the prep data structure, we can use the STM plot function to examine the differences.
Result <- plot(prep, "Country", method = "difference", cov.value1 = "Domestic", cov.value2 = "International",
verbose.labels = F, model = stmFit, labeltype = "custom", custom.labels = topic$topicnames,
ylab = "Exp Topic Difference", xlab = "International Not Significant Domestic",
main = "Effect of Country on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
Alternatively, we can reorder the coefficients by their rank to aid in intrepretating the visualization.
# order based on Expected Topic Proportion
trank = order(unlist(Result$means))
temp.topic <- topic[trank, ]
x <- plot(prep, "Country", method = "difference", cov.value1 = "Domestic", cov.value2 = "International",
verbose.labels = F, topics = temp.topic$TopicNumber, model = stmFit, labeltype = "custom",
custom.labels = temp.topic$topicnames, ylab = "Exp Topic Difference", xlab = "International Not Significant Domestic",
main = "Effect of Country on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
The way we can interpret this is that international respondees discussed the issue of problem solving much more significantly than domestic respondees. Alternatively, domestic respondees discussed topics on “providing support”, “respect”, and “asking questions” more prevalently than International respondees.
We can plot the effect of Occupation. Since this field has three categories, we need to run as two binary columns. We’ll use “Entry Level” as the baseline.
First, let’s examine the relationship between “Entry Level” and “Analyst”:
Result <- plot(prep, "Occupation", method = "difference", cov.value1 = "Analyst",
cov.value2 = "Entry Level", verbose.labels = F, model = stmFit, labeltype = "custom",
custom.labels = topic$topicnames, ylab = "Exp Topic Difference", xlab = "Entry Level Not Significant Analyst",
main = "Effect of Occupation on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
# order based on Expected Topic Proportion
orank = order(unlist(Result$means))
temp.topic <- topic[orank, ]
x <- plot(prep, "Occupation", method = "difference", cov.value1 = "Analyst", cov.value2 = "Entry Level",
verbose.labels = F, topics = temp.topic$TopicNumber, model = stmFit, labeltype = "custom",
custom.labels = temp.topic$topicnames, ylab = "Exp Topic Difference", xlab = "Entry Level Not Significant Analyst",
main = "Effect of Occupation on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
In this example, we do not find a significant difference in the topics discussed by either entry level or analyst-level participants.
Next, let’s consider “Entry Level” vs “Management”:
Result <- plot(prep, "Occupation", method = "difference", cov.value1 = "Management",
cov.value2 = "Entry Level", verbose.labels = F, model = stmFit, labeltype = "custom",
custom.labels = topic$topicnames, ylab = "Exp Topic Difference", xlab = "Entry Level Not Significant Management",
main = "Effect of Occupation on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
# order based on Expected Topic Proportion
orank = order(unlist(Result$means))
temp.topic <- topic[orank, ]
x <- plot(prep, "Occupation", method = "difference", cov.value1 = "Management", cov.value2 = "Entry Level",
verbose.labels = F, topics = temp.topic$TopicNumber, model = stmFit, labeltype = "custom",
custom.labels = topic$topicnames, ylab = "Exp Topic Difference", xlab = "Entry Level Not Significant Management",
main = "Effect of Occupation on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
We find the same result for Entry Level versus Management. Essentially, we cannot find evidence that the role (occupation) had a significant effect on the topics discussed.
Let’s consider alternative covariates. In this example, we will include country (given we found it was significant for some topics) as well as the gender of both the user (SelfGender) and of the manager (ManagerGender).
We’ll rerun the estimateEffect on our new variables.
genderResult <- plot(prep2, "SelfGender", method = "difference", cov.value1 = "Female",
cov.value2 = "Male", verbose.labels = F, model = stmFit, labeltype = "custom",
custom.labels = topic$topicnames, ylab = "Exp Topic Difference", xlab = "Male Not Significant Female",
main = "Effect of Gender on Topic Prevelance", xlim = c(-0.5, 0.5), width = 40,
ci.level = 0.95)
mgrgenResult <- plot(prep2, "ManagerGender", method = "difference", cov.value1 = "Female",
cov.value2 = "Male", verbose.labels = F, model = stmFit, labeltype = "custom",
custom.labels = topic$topicnames, ylab = "Exp Topic Difference", xlab = "Male Not Significant Female",
main = "Effect of Manager's Gender on Topic Prevelance", xlim = c(-0.5, 0.5),
width = 40, ci.level = 0.95)
save.image(file = "03-topicmodel-covariates.RData")
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] quanteda_0.99 stm_1.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 knitr_1.17 magrittr_1.5
## [4] munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35
## [7] rlang_0.1.2 fastmatch_1.1-0 plyr_1.8.4
## [10] stringr_1.2.0 tools_3.4.2 grid_3.4.2
## [13] data.table_1.10.4-3 gtable_0.2.0 matrixStats_0.52.2
## [16] htmltools_0.3.6 lazyeval_0.2.1 yaml_2.1.14
## [19] RcppParallel_4.3.20 rprojroot_1.2 digest_0.6.12
## [22] tibble_1.3.4 Matrix_1.2-11 ggplot2_2.2.1.9000
## [25] formatR_1.5 slam_0.1-40 evaluate_0.10.1
## [28] rmarkdown_1.6 stringi_1.1.5 compiler_3.4.2
## [31] scales_0.5.0.9000 backports_1.1.0 lubridate_1.6.0