Computer-Assisted Text Analysis: Word Embeddings

Word Embeddings - GloVe model

References

The first two references show demos of using the R text2vec package to run GloVe. The third reference is by the original authors to summarize details about the model.

Load the Data

First, let’s load the data. We’ll use the readr package that is embedded in the tidyverse package.

library(tidyverse) #install.packages("tidyverse") if you do not have the package

# one column csv of text -- column text

file <- "../data/pres_tweets.csv"
data <- read_csv(file)

Preprocessing

Let’s first run preprocessing to tokenize and create the vocabulary.

Notice that the text2vec model does not use the standard quanteda framework for preprocessing as it creates a co-occurence matrix rather than the document-term matrix (dfm).

library(text2vec); #install.packages("text2vec"), also will need stringr and quanteda for two functions

#remove punctuation
text <- stringr::str_replace_all(data$body,"[[:punct:]]","")

# Create iterator over tokens
tokens <- regexp_tokenizer(quanteda::char_tolower(text), " ", TRUE)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it, stopwords = c(quanteda::stopwords("english"),""))

Let’s remove sparse and very frequent terms.

# Note, words with less than 1%
vocab <- prune_vocabulary(vocab, doc_proportion_min = 0.005, doc_proportion_max = 0.99)

Let’s plot the terms by documen

library(scatterD3)

scatterD3(vocab$vocab$terms_counts, 
          vocab$vocab$doc_counts, 
          lab = vocab$vocab$terms,
          xlab = "Word Counts (# of times word is used in corpus)",
          ylab = "Document Counts  (# of docs word is in)")

Let’s now create the co-occurrence matrix.

# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab, 
                               # don't vectorize input
                               grow_dtm = FALSE, 
                               # use window of 5 for context words
                               skip_grams_window = 5L)
tcm <- create_tcm(it, vectorizer)

Let’s use three out of my computer’s four cores and run the GloVe model.

RcppParallel::setThreadOptions(numThreads = 3)
GloVe = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
GloVe$fit(tcm, n_iter = 50)

Let’s get the 50 feature vectors for the words and then plot the first two vectors for each word…

word_vectors <- GloVe$get_word_vectors()
#write.csv(word_vectors, "./GloVe-features.csv", row.names = T)

# install.packages("Rtsne")
tsne <- Rtsne::Rtsne(word_vectors, dims = 2, verbose=TRUE, max_iter = 500)

## Read the 344 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 344
## Done in 0.04 seconds (sparsity = 0.392509)!
## Learning embedding...
## Iteration 50: error is 65.702049 (50 iterations in 0.15 seconds)
## Iteration 100: error is 66.831125 (50 iterations in 0.17 seconds)
## Iteration 150: error is 66.181232 (50 iterations in 0.17 seconds)
## Iteration 200: error is 66.601498 (50 iterations in 0.16 seconds)
## Iteration 250: error is 68.620399 (50 iterations in 0.18 seconds)
## Iteration 300: error is 1.709352 (50 iterations in 0.17 seconds)
## Iteration 350: error is 1.560578 (50 iterations in 0.12 seconds)
## Iteration 400: error is 1.526998 (50 iterations in 0.12 seconds)
## Iteration 450: error is 1.501500 (50 iterations in 0.11 seconds)
## Iteration 500: error is 1.481305 (50 iterations in 0.12 seconds)
## Fitting performed in 1.47 seconds.

# create data frame
df <- data.frame(X = tsne$Y[,1], Y = tsne$Y[,2])


scatterD3(df$X, 
          df$Y, 
          lab = row.names(word_vectors),
          xlab = "1st Vector",
          ylab = "2nd Vector")

This doesn’t look like much because this is only considering 2 of the 50 total dimensions! In this sense, this method is like principal components analysis in which we “reduced” the information from the vector space (co-occurrence counts) to 50 features. Therefore, to visualize, we need a better way to analyse the similarlity between words.

Therefore, let’s run cosine similarity which is a distance function that can give a measurement (from 0 to 1) on how similar two words are based on each word’s (50) factor values.

Let’s run cosine similarity to create a matrix on how similar each of the words are to one another.

cos_sim = sim2(x = word_vectors, y = word_vectors, method = "cosine", norm = "l2")
#write_csv(as.data.frame(cos_sim), "cos-sim.csv")

First, let’s run the raw network.

library(visNetwork); library(igraph)

g <- igraph::graph.adjacency(cos_sim, mode="undirected", weighted=TRUE, diag=FALSE)
E(g)$width <- abs(E(g)$weight)

t <- merge(data.frame(name = as.character(V(g)$name), stringsAsFactors = F), 
      data.frame(name = as.character(vocab$vocab$terms), 
                 Count = vocab$vocab$terms_counts, stringsAsFactors = F),
       by = "name")

Next, let’s use an arbitrary threshold cosine similarity (we can modify), we can create a network plot of how similar the actions are using the GloVe model.

cutoff = 0.47

g <- igraph::graph.adjacency(cos_sim, mode="undirected", weighted=TRUE, diag=FALSE)
E(g)$width <- abs(E(g)$weight)*10

t <- merge(data.frame(name = as.character(V(g)$name), stringsAsFactors = F), 
      data.frame(name = as.character(vocab$vocab$terms), 
                 Count = vocab$vocab$terms_counts, stringsAsFactors = F),
       by = "name")


V(g)$size <- t$Count[match(V(g)$name, t$name)] / 20 + 5

g <- delete.edges(g, E(g)[ abs(weight) < cutoff])

E(g)$color <- ifelse(E(g)$weight > 0, "blue","red")
E(g)$weight <- abs(E(g)$weight)
iso <- V(g)[degree(g)==0]
g <- delete.vertices(g, iso)

clp <- igraph::cluster_label_prop(g)
class(clp)

## [1] "communities"

V(g)$color <- RColorBrewer::brewer.pal(12, "Set3")[as.factor(clp$membership)]

visIgraph(g) %>%   visOptions(highlightNearest = list(enabled = TRUE, algorithm = "hierarchical")) %>% 
  visInteraction(navigationButtons = TRUE)

This shows the connection between words that are used in a similar context (5 words). Think of this like a “localized” or “micro” topic model where document level is switched to a rolling five word context.

Computer-Assisted Text Analysis: Word Embeddings

Ryan Wesslen

July 27, 2017

Word Embeddings - GloVe model

References

Load the Data

Preprocessing

Linear Substructures