Text Mining Leadership Survey: Cleaning & Exploration

Overview

This RMarkdown file is an annotated R code that combines written text with R code along with the accompanying output. This code is rendered into a HTML file to allow users to view the code and output without having to run the code individually.

Users without any experience in RMarkdown, should consult R Studio’s RMarkdown tutorial lessons on how to run the code.

Users without any experience in quanteda, should consider first reviewing its Getting Started tutorial. This will provide context of many terms and functions used in this document and is especially critical for users without experience in computer assisted text analysis or natural language processing.

Further, this resource references many functions from tidyverse, the standard library of R packages and users are expected to have some knowledge of the most critical aspects (e.g., dplyr, piping). Users without any experience should consider one of many resources for example, here to learn the most basic tidyverse functions.

Design & Data Collection

Load the data

First, let’s load the data. We’ll use the readr package that is embedded in the tidyverse package.

library(tidyverse)  #install.packages('tidyverse') if you do not have the package

file <- "./TMM_FullDataset.csv"
responses <- read_csv(file)

This file includes 753 records and 52 columns.

Invalid Records

Let’s remove any invalid records for users who did not complete the survey.

For example, let’s explore the “Progress” field that shows the percentage of completion.

responses %>% group_by(Progress) %>% summarise(Count = n())

## # A tibble: 5 x 2
##   Progress Count
##      <int> <int>
## 1        7    15
## 2       29   147
## 3       40     4
## 4       43     2
## 5      100   585

So only 585 of the 753 records were completed (i.e. 100 = 100%).

Let’s use the filter function to keep completed responses (i.e. Progress = 100%).

responses <- filter(responses, Progress == 100)

Text Quality

Let’s look at a few examples to determine the quality of the responses (e.g., misspellings, grammatical mistakes).

responses$Q4[1:3]

## [1] "I was at work and I asked my supervisor what to do about a particular problem.  She said not to worry about it and we'd get to it later.  I then got chewed out by the boss because I left the problem unfixed.  I felt bad because I should have stuck up for myself.  I just let that girl throw me under the bus and I didn't say anything.  I wish I would've done things differently and stood up for myself more because I let her dictate my reputation and that wasn't fair.  My boss doesn't always care what my opinions are.  I don't like that he has so many other political considerations to think about."                                                                                                                                                                                                         
## [2] "We have a great relationship. Whenever I need help with anything I can rely on his knowledge and expertise to help me out. He is available on call at any time. He shows me respect by letting me have freedom and autonomy. In other words he trusts me and does not micromanage. My relationship with him is similar to others in that we trust him and he trusts us. His door is always open to all of us. I am a personable and sociable type of person so maybe I am a bit closer to him than others just because I am good at small talk and chit chat. But thinking about it, I have been there over 6 years and never even had a cross word with him. It is a very professional relationship in that sense but there is also a personal aspect to it as well in that we all look up to him and he treats us with respect."
## [3] "I almost never speak with him. he does not know what i do and couldn't give me any coaching if he tried. but thankfully, he does not even try. My relationship with him is the same as with my previous manager - it is virtually nonexistent. That means that we barely ever interact other than at company holiday parties. I am basically on my own left to do what i think is best for my department and in my area using my skill set and experience. I guess I still have to come up with some more words to hit the minimum lol. wow and now i got an error saying that this response must be at least 600 characters. you are joking - this is for one measly dollar? not enough money to justify this level of work."

First, it looks like these responses are written in clear English – unlike Tweets that use slang or abbreviations. Some responses do not have perfect grammar (e.g. sample 3: “him. he does not”), but this is a minor problem. We will standardize responses by converting all letters into lower case and removing punctuation in our pre-processing.

quanteda also offers a helpful textstat_readability function to measure many different indexes (47 different types) of readability (e.g. Flesch.Kincaid). For more information, please see Flesch-Kincaid readability test Wikipedia page.

library(quanteda)
readability <- textstat_readability(responses$Q4)

hist(readability$Flesch.Kincaid, xlim = c(0, 40), breaks = 200, xlab = "Flesch-Kincaid Score", 
    main = "Histogram of Flesch-Kincaid Scores")

Length of responses

Let’s explore the histogram of tokens (i.e., words). We will look at the total number of tokens. (Alternatively, you can view the number of unique tokens by changing the function ntoken() to ntype()).

hist(ntoken(responses$Q4), breaks = 20, main = "# of Words per Response: Relationship With Manager (Q4)", 
    xlab = "Number of Words")

hist(ntoken(responses$Q5), breaks = 20, main = "# of Words per Response: Manger Understand Needs (Q5)", 
    xlab = "Number of Words")

In our case, most of our responses had between 50-100 words.

hist(nchar(responses$Q4), breaks = 20, main = "# of Characters per Response: Relationship With Manager (Q4)", 
    xlab = "Number of Characters")

hist(nchar(responses$Q5), breaks = 20, main = "# of Characters per Response: Manger Understand Needs (Q5)", 
    xlab = "Number of Characters")

Let’s find outliers who responded with less than 10 words.

maxWords <- 10

responses$Q5[which(ntoken(responses$Q5) < maxWords)]

## [1] "He cannot understand anything always"

Covariates: Explore & Extraction

Before running our analysis, we need to prepare any covariates that we’ll use in our analysis.

First, let’s look at how many responses we have by two attributes (covariates): the gender of the respondee and the gender of his/her manager. These two variables will be our focus on how they impact what topics are covered in the responses.

responses %>% group_by(Q8, Q9) %>% summarise(Count = n())

## # A tibble: 4 x 3
## # Groups:   Q8 [?]
##       Q8     Q9 Count
##    <chr>  <chr> <int>
## 1 Female Female   147
## 2 Female   Male    49
## 3   Male Female   136
## 4   Male   Male   253

Recall, Q8 = Manager (Leader) Gender, Q9 = Gender of Respondee. These two variables look good and do not need any additional data preparation.

Let’s also consider location and occupation. For both of these responses, they were open-ended in which respondents could write in any text. This led to many values.

After examining this list, we were able to create a list of two values: domestic (anywhere in the U.S.) and international (anywhere outside of the U.S.).

non.us <- c("Australia", "Canada", "Colombia", "Ecuador", "Finland", "Greece", "india", 
    "India", "INDIA", "Indonesia", "italy", "Nepal", "Phillipines", "Sri Lanka", 
    "UK", "Ukraine", "United Kingdom", "united kingdom", "venezuela", "Venezuela")

responses$country <- ifelse(responses$Q7 %in% non.us, "International", "Domestic")

table(responses$country)

## 
##      Domestic International 
##           468           117

For occupation, we have multiple categories that are sparsely populated for some categories.

Let’s group them into three categories: managment, analyst and entry level.

table(responses$Q13)

## 
##                         Analyst / Associate 
##                                         273 
## C level executive (CIO, CTO, COO, CMO, Etc) 
##                                           4 
##                                    Director 
##                                          18 
##                                 Entry Level 
##                                         110 
##                                      Intern 
##                                           3 
##                                     Manager 
##                                         149 
##                                       Owner 
##                                           2 
##                            President or CEO 
##                                           1 
##                              Senior Manager 
##                                          22 
##                       Senior Vice President 
##                                           2 
##                              Vice President 
##                                           1

# start with all values Management
responses$occupation <- "Management"
# replace Analyst level
responses$occupation[responses$Q13 == "Analyst / Associate"] <- "Analyst"
# replace Entry Level
responses$occupation[responses$Q13 %in% c("Entry Level", "Intern")] <- "Entry Level"

table(responses$occupation)

## 
##     Analyst Entry Level  Management 
##         273         113         199

Quanteda Text Analysis

Cleaning

Before running the analysis, we need to do two manual cleanups. These are steps that are unique to this dataset but may be required for any dataset.

First, since we combined both of our questions (Q4 and Q5), let’s first create a character vector that contains our text, aptly named text.

Second, let’s remove specific cases

text <- c(responses$Q4, responses$Q5)

## Clean ups remove cases where space not completed after sentence for
## tokenization (e.g. man.he => 'man he')
text <- gsub("[.]", " ", text)

# clean up specific texts
text[79]

## [1] "He does not do too much to help us he just sits in his office and we have to go to him with some questions  He then solves the problem usually and thats it  he tries to be manipulative but he is not very good at it  We all know he does not respect us much and only hires women because he likes looking at them  He even admitted the last girl he hired  was just hired because she was a model  So that goes to show how much he respects us  But hey a paycheck is life basically     the survey says 100 words but now it wants 600 characters? x x x x x x x x x x x x x x x x xx random worrds with letters to make characters yayayyayayayayayayayayayayayayayayaayyayaya "

To clean this record, we’ll remove the characters after “basically the survey says…”

text[79] <- substr(text[79], 1, 474)
text[79 + 585] <- substr(text[79 + 585], 1, 551)

Or another example includes a user…

text[714]

## [1] "Most of them can not and will not do our jobs, thats part of why they promote to begin with  It seems the shittier you are at your job on the floor, the more they encourage you to move into supervision here  Its ass backwards  As a result none of us have respect for them                                                                                                                   \nno no no no no no no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no nono no no no no"

who wrote a valid response but then used “nono” to reach the minimum length requirement.

We can similarly remove this text using the substr() function.

text[714] <- substr(text[714], 1, 271)  # clean the nononono record

Next, let’s create the corpus and add the covariates.

myCorpus <- corpus(text)

# add in the attributes about the responses (for now, just the gender of
# respondee and manager)
docvars(myCorpus, "ManagerGender") <- c(responses$Q8, responses$Q8)
docvars(myCorpus, "SelfGender") <- c(responses$Q9, responses$Q9)

docvars(myCorpus, "Question") <- c(rep("Q4", 585), rep("Q5", 585))
docvars(myCorpus, "Country") <- c(responses$country, responses$country)
docvars(myCorpus, "Occupation") <- c(responses$occupation, responses$occupation)

Let’s create our document-feature matrix with only basic preprocessing (standard stop words, unigrams, no stemming).

Recall, the document-feature includes the document-term matrix (counts of each term in each document) along with the covariates (features) we just provided above like the country and occupation of the respondent.

Tokenize, Stemming, Uni/Bi/Tri-grams, Stop Words (dfm)

Let’s run text pre-processing.

dfm <- dfm(myCorpus, remove = c(stopwords("english")), ngrams = 1L, stem = F, remove_numbers = T, 
    remove_punct = T, remove_symbols = T, remove_hyphens = F)

topfeatures(dfm, 25)

##       leader         work       always           us         need 
##         1111         1023          586          567          491 
## relationship         good          can      manager         time 
##          476          458          445          420          413 
##     coaching          job         help   supervisor          get 
##          398          398          393          375          359 
##         also     problems         like     provides      respect 
##          358          329          295          293          290 
##         much         well          one          way      provide 
##          270          269          266          263          261

Let’s explore removing sparse terms.

In addition to removing the standard list of stop words (stopwords(“english”)), we’ve added a list of additional stop words.

extra.stop <- c("always", "will", "can", "job", "us", "get", "also", "much", "well", 
    "way", "like", "things", "one", "make", "really", "just", "take", "lot", "even", 
    "done", "something", "go", "sure", "makes", "every", "come", "say", "many", "often", 
    "see", "little", "want", "though", "without", "going", "takes", "someone", "however", 
    "comes", "usually", "may", "office", "thing", "making", "along", "since", "long", 
    "back", "similar", "goes", "put", "getting", "another", "keep", "related", "else", 
    "now", "seems", "co")

dfm <- dfm(myCorpus, remove = c(stopwords("english"), extra.stop), ngrams = 1L, stem = F, 
    remove_numbers = T, remove_punct = T, remove_symbols = T, remove_hyphens = F)

dfm <- dfm_trim(dfm, min_docfreq = 2)

topfeatures(dfm, 25)

##       leader         work         need relationship         good 
##         1111         1023          491          476          458 
##      manager         time     coaching         help   supervisor 
##          420          413          398          393          375 
##     problems     provides      respect      provide         know 
##          329          293          290          261          255 
##        needs         team      company         feel         give 
##          252          249          227          217          216 
##      support     concerns         boss      working     training 
##          213          203          202          202          198

Can you see the differences between the second iteration after we removed the additional stop words?

Let’s plot two word clouds – one as a whole and the second by respondee country.

library(RColorBrewer)

textplot_wordcloud(dfm, scale = c(3.5, 0.75), colors = brewer.pal(8, "Dark2"), random.order = F, 
    rot.per = 0.1, max.words = 100)

cdfm <- dfm(myCorpus, group = "Country", remove = c(stopwords("english"), extra.stop), 
    stem = F, remove_numbers = T, remove_punct = T, remove_symbols = T, remove_hyphens = F)

textplot_wordcloud(cdfm, comparison = T, scale = c(3.5, 0.75), colors = brewer.pal(8, 
    "Dark2"), random.order = F, rot.per = 0.1, max.words = 100)

This suggest that Domestic participants used “development” more often than International participants. Yet a problem with exploratory words clouds is that they do not measure the difference – especially with statistical inferece. Let’s keep this in mind for when we run topic modeling.

textplot_wordcloud(tfidf(dfm), scale = c(3.5, 0.75), colors = brewer.pal(8, "Dark2"), 
    random.order = F, rot.per = 0.1, max.words = 100)

We can use word clustering to identify words that co-occur together.

wordDfm <- dfm_sort(dfm_weight(dfm, "frequency"))
wordDfm <- t(wordDfm)[1:50, ]  # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab = "", main = "Raw Frequency weighting")

Think of this plot as a crude way of identifying topics. Also, this plot (called a dendrogram) can help us identify “meaningless” words (e.g. also, like, really) that we could eliminate as stop words. For now, we will keep all words (except the most basic list of stop words) but note these words in case we decide we want to remove them later on.

Another interesting take on this plot is not use the raw frequencies (word counts) but instead use the TF-IDF weightings, which re-weights the words focusing less on words that are either rarely used or used too frequently.

We can rerun the plot with TF-IDF by changing the “frequency” to “tfidf” in the second line of the code.

wordDfm <- dfm_sort(dfm_weight(dfm, "tfidf"))
wordDfm <- t(wordDfm)[1:50, ]  # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab = "", main = "TF-IDF weighting")

Save Image & Libraries Used

save.image(file = "01-datacleaning-exploration.RData")
sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2 quanteda_0.99      bindrcpp_0.2      
##  [4] dplyr_0.7.4        purrr_0.2.3        readr_1.1.1       
##  [7] tidyr_0.7.1        tibble_1.3.4       ggplot2_2.2.1.9000
## [10] tidyverse_1.1.1   
## 
## loaded via a namespace (and not attached):
##  [1] wordcloud_2.5       slam_0.1-40         reshape2_1.4.2     
##  [4] haven_1.1.0         lattice_0.20-34     colorspace_1.3-2   
##  [7] htmltools_0.3.6     yaml_2.1.14         rlang_0.1.2        
## [10] foreign_0.8-67      glue_1.1.1          modelr_0.1.1       
## [13] readxl_1.0.0        bindr_0.1           plyr_1.8.4         
## [16] stringr_1.2.0       munsell_0.4.3       gtable_0.2.0       
## [19] cellranger_1.1.0    rvest_0.3.2         psych_1.7.5        
## [22] evaluate_0.10.1     knitr_1.17          forcats_0.2.0      
## [25] parallel_3.4.1      broom_0.4.2         Rcpp_0.12.13       
## [28] scales_0.5.0.9000   backports_1.1.0     formatR_1.5        
## [31] RcppParallel_4.3.20 jsonlite_1.5        fastmatch_1.1-0    
## [34] mnormt_1.5-5        hms_0.3             digest_0.6.12      
## [37] stringi_1.1.5       grid_3.4.1          rprojroot_1.2      
## [40] tools_3.4.1         magrittr_1.5        lazyeval_0.2.0     
## [43] pkgconfig_2.0.1     Matrix_1.2-8        data.table_1.10.4  
## [46] xml2_1.1.1          lubridate_1.6.0     assertthat_0.2.0   
## [49] rmarkdown_1.6       httr_1.3.1          R6_2.2.2           
## [52] nlme_3.1-131        compiler_3.4.1