Word Embeddings - GloVe model

References

The first two references show demos of using the R text2vec package to run GloVe. The third reference is by the original authors to summarize details about the model.

Load the Data

First, let’s load the data. We’ll use the readr package that is embedded in the tidyverse package.

library(tidyverse) #install.packages("tidyverse") if you do not have the package

# one column csv of text -- column text

file <- "../data/pres_tweets.csv"
data <- read_csv(file)

Preprocessing

Let’s first run preprocessing to tokenize and create the vocabulary.

Notice that the text2vec model does not use the standard quanteda framework for preprocessing as it creates a co-occurence matrix rather than the document-term matrix (dfm).

library(text2vec); #install.packages("text2vec"), also will need stringr and quanteda for two functions

#remove punctuation
text <- stringr::str_replace_all(data$body,"[[:punct:]]","")

# Create iterator over tokens
tokens <- regexp_tokenizer(quanteda::char_tolower(text), " ", TRUE)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it, stopwords = c(quanteda::stopwords("english"),""))

Let’s remove sparse and very frequent terms.

# Note, words with less than 1%
vocab <- prune_vocabulary(vocab, doc_proportion_min = 0.005, doc_proportion_max = 0.99)

Let’s plot the terms by documen

library(scatterD3)

scatterD3(vocab$vocab$terms_counts, 
          vocab$vocab$doc_counts, 
          lab = vocab$vocab$terms,
          xlab = "Word Counts (# of times word is used in corpus)",
          ylab = "Document Counts  (# of docs word is in)")
-20002004006008001000120014001600180020002200-2000200400600800100012001400160018002000Word Counts (# of times word is used in corpus)Document Counts (# of docs word is in)justicestillyearscandidatestopeconomicetinterviewcarolinanextenoughgorepublicansmajorputwantwomensbesttedniceampnewgovernmentbushtodaytaxsupportersberniealwaysfirstincomejebstreetsayinterviewedwomanelectedlaworkviolencelikegunpotusdoesntwell2016foxnewsrevolutiontimeimmigrationillegalforwardstatesdemocraticrubiohillaryssupportpovertyisisstartviacruzcrewpublicjobcanpleasertwarpresidentialweveeconomyeveryoneyourewashingtonfamilychangepoliticianssomethingnowvotingpmfargoodwaitthingsrundealcontinuepersoncomedefendeverpartmadepoliticsbetterthinghousedonemoneyagreenightlawthatsideateamnhlovebigjoinjobschildrencampaignworkingnamehelppayobamarealhonoredmakingtrumplotabletrump2016talkingknowreallymustusdontanotherveteranslookclasssecuritylivefreeunitedlifecountryreadstategreatmrproudstandlongfindsandersvotehoursfriendsamericanelrepublicanneedssomeonehearmanybillionairesmaketogetherdaysgoingneedlittleactsureplanbackcrowdreadyelectionrallydebatefuturewowbuildtheresdepollsmillionsivelookingstrongclintonrunningtomorrowonewayjusttonightleadvoterswallneverthankcongresslistenshowtalkiowatruepaidtakeenjoytrumpseducationendplacewhiterightwatchmorningcantcarenewsnationsocialnothingissuesdaycallnationalsaysbadtellstopclimatemillionsaidpoliticalsincehardthinkletstunewomeneveryiacaucusmovementseanhannityhappywontgivetwokidswithoutwonwrongcollegeweeksystemfamiliescandidatehillarymiddleworkersamazingkeepmediarecordamericaspeakingworlddebatewithbernieborderstandingchoosecruzimportantwordsfoxspeechcruzseegettingacrossmissstoryfitnalsolastpartyim1establishmentbreakrightspeoplerealdonaldtrumpisntthanksgoptruthradicalmuchmessagesafeagoyounggotaroundcnnbelievecreateletcruzcountryyorkleavegopdebateamericanshistorywantshbringwindebthesyearpresidentspecialchancelivespollhealthprimaryprotecthopewillfightfightingsouthdemdebategetevenillhampshiremakeamericagreatagaincaucusobamacaremandonaldmegynkelly

Let’s now create the co-occurrence matrix.

# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab, 
                               # don't vectorize input
                               grow_dtm = FALSE, 
                               # use window of 5 for context words
                               skip_grams_window = 5L)
tcm <- create_tcm(it, vectorizer)

Let’s use three out of my computer’s four cores and run the GloVe model.

RcppParallel::setThreadOptions(numThreads = 3)
GloVe = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
GloVe$fit(tcm, n_iter = 50)

Let’s get the 50 feature vectors for the words and then plot the first two vectors for each word…

word_vectors <- GloVe$get_word_vectors()
#write.csv(word_vectors, "./GloVe-features.csv", row.names = T)

# install.packages("Rtsne")
tsne <- Rtsne::Rtsne(word_vectors, dims = 2, verbose=TRUE, max_iter = 500)
## Read the 344 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 344
## Done in 0.04 seconds (sparsity = 0.392509)!
## Learning embedding...
## Iteration 50: error is 65.702049 (50 iterations in 0.15 seconds)
## Iteration 100: error is 66.831125 (50 iterations in 0.17 seconds)
## Iteration 150: error is 66.181232 (50 iterations in 0.17 seconds)
## Iteration 200: error is 66.601498 (50 iterations in 0.16 seconds)
## Iteration 250: error is 68.620399 (50 iterations in 0.18 seconds)
## Iteration 300: error is 1.709352 (50 iterations in 0.17 seconds)
## Iteration 350: error is 1.560578 (50 iterations in 0.12 seconds)
## Iteration 400: error is 1.526998 (50 iterations in 0.12 seconds)
## Iteration 450: error is 1.501500 (50 iterations in 0.11 seconds)
## Iteration 500: error is 1.481305 (50 iterations in 0.12 seconds)
## Fitting performed in 1.47 seconds.
# create data frame
df <- data.frame(X = tsne$Y[,1], Y = tsne$Y[,2])


scatterD3(df$X, 
          df$Y, 
          lab = row.names(word_vectors),
          xlab = "1st Vector",
          ylab = "2nd Vector")
-14-12-10-8-6-4-202468101214-14-12-10-8-6-4-2024681012141st Vector2nd Vectorjusticestillyearscandidatestopeconomicetinterviewcarolinanextenoughgorepublicansmajorputwantwomensbesttedniceampnewgovernmentbushtodaytaxsupportersberniealwaysfirstincomejebstreetsayinterviewedwomanelectedlaworkviolencelikegunpotusdoesntwell2016foxnewsrevolutiontimeimmigrationillegalforwardstatesdemocraticrubiohillaryssupportpovertyisisstartviacruzcrewpublicjobcanpleasertwarpresidentialweveeconomyeveryoneyourewashingtonfamilychangepoliticianssomethingnowvotingpmfargoodwaitthingsrundealcontinuepersoncomedefendeverpartmadepoliticsbetterthinghousedonemoneyagreenightlawthatsideateamnhlovebigjoinjobschildrencampaignworkingnamehelppayobamarealhonoredmakingtrumplotabletrump2016talkingknowreallymustusdontanotherveteranslookclasssecuritylivefreeunitedlifecountryreadstategreatmrproudstandlongfindsandersvotehoursfriendsamericanelrepublicanneedssomeonehearmanybillionairesmaketogetherdaysgoingneedlittleactsureplanbackcrowdreadyelectionrallydebatefuturewowbuildtheresdepollsmillionsivelookingstrongclintonrunningtomorrowonewayjusttonightleadvoterswallneverthankcongresslistenshowtalkiowatruepaidtakeenjoytrumpseducationendplacewhiterightwatchmorningcantcarenewsnationsocialnothingissuesdaycallnationalsaysbadtellstopclimatemillionsaidpoliticalsincehardthinkletstunewomeneveryiacaucusmovementseanhannityhappywontgivetwokidswithoutwonwrongcollegeweeksystemfamiliescandidatehillarymiddleworkersamazingkeepmediarecordamericaspeakingworlddebatewithbernieborderstandingchoosecruzimportantwordsfoxspeechcruzseegettingacrossmissstoryfitnalsolastpartyim1establishmentbreakrightspeoplerealdonaldtrumpisntthanksgoptruthradicalmuchmessagesafeagoyounggotaroundcnnbelievecreateletcruzcountryyorkleavegopdebateamericanshistorywantshbringwindebthesyearpresidentspecialchancelivespollhealthprimaryprotecthopewillfightfightingsouthdemdebategetevenillhampshiremakeamericagreatagaincaucusobamacaremandonaldmegynkelly

This doesn’t look like much because this is only considering 2 of the 50 total dimensions! In this sense, this method is like principal components analysis in which we “reduced” the information from the vector space (co-occurrence counts) to 50 features. Therefore, to visualize, we need a better way to analyse the similarlity between words.

Therefore, let’s run cosine similarity which is a distance function that can give a measurement (from 0 to 1) on how similar two words are based on each word’s (50) factor values.

Let’s run cosine similarity to create a matrix on how similar each of the words are to one another.

cos_sim = sim2(x = word_vectors, y = word_vectors, method = "cosine", norm = "l2")
#write_csv(as.data.frame(cos_sim), "cos-sim.csv")

First, let’s run the raw network.

library(visNetwork); library(igraph)

g <- igraph::graph.adjacency(cos_sim, mode="undirected", weighted=TRUE, diag=FALSE)
E(g)$width <- abs(E(g)$weight)

t <- merge(data.frame(name = as.character(V(g)$name), stringsAsFactors = F), 
      data.frame(name = as.character(vocab$vocab$terms), 
                 Count = vocab$vocab$terms_counts, stringsAsFactors = F),
       by = "name")

Next, let’s use an arbitrary threshold cosine similarity (we can modify), we can create a network plot of how similar the actions are using the GloVe model.

cutoff = 0.47

g <- igraph::graph.adjacency(cos_sim, mode="undirected", weighted=TRUE, diag=FALSE)
E(g)$width <- abs(E(g)$weight)*10

t <- merge(data.frame(name = as.character(V(g)$name), stringsAsFactors = F), 
      data.frame(name = as.character(vocab$vocab$terms), 
                 Count = vocab$vocab$terms_counts, stringsAsFactors = F),
       by = "name")


V(g)$size <- t$Count[match(V(g)$name, t$name)] / 20 + 5

g <- delete.edges(g, E(g)[ abs(weight) < cutoff])

E(g)$color <- ifelse(E(g)$weight > 0, "blue","red")
E(g)$weight <- abs(E(g)$weight)
iso <- V(g)[degree(g)==0]
g <- delete.vertices(g, iso)

clp <- igraph::cluster_label_prop(g)
class(clp)
## [1] "communities"
V(g)$color <- RColorBrewer::brewer.pal(12, "Set3")[as.factor(clp$membership)]

visIgraph(g) %>%   visOptions(highlightNearest = list(enabled = TRUE, algorithm = "hierarchical")) %>% 
  visInteraction(navigationButtons = TRUE) 

This shows the connection between words that are used in a similar context (5 words). Think of this like a “localized” or “micro” topic model where document level is switched to a rolling five word context.

Linear Substructures

First, we can find, given a word (say “border”), what words were used most similarly (i.e., in the same context).

word <- word_vectors["border", , drop = FALSE] 
cos_sim_ex = sim2(x = word_vectors, y = word, method = "cosine", norm = "l2")
head(sort(cos_sim_ex[,1], decreasing = TRUE), 10)
##      border     illegal    security      defend        stop immigration 
##   1.0000000   0.5594225   0.5241929   0.5019102   0.4494898   0.4425788 
##          de      social         end          la 
##   0.4171149   0.3656305   0.3637431   0.3576855

Linear substructures allow “algebra” on words: plus (+) or minus (-). This allows us to see how the context changes when we include “shows”.

word <- word_vectors["trump", , drop = FALSE] +
  word_vectors["president", , drop = FALSE] -
  word_vectors["wall", , drop = FALSE] 
cos_sim_ex = sim2(x = word_vectors, y = word, method = "cosine", norm = "l2")
head(sort(cos_sim_ex[,1], decreasing = TRUE), 10)
##       president           trump realdonaldtrump              mr 
##       0.7002425       0.6951676       0.6197352       0.5997823 
##          donald            best            next       candidate 
##       0.5490356       0.4573153       0.4281908       0.4118675 
##             man         running 
##       0.4047352       0.3906899

See the GloVe website for a better explanation.