Analyzing Data from StreamR Package

In my previous material to use streamR(i.e., Twitter’s Streaming API), I was able to pull a dataset of sample tweets.

I’ve saved a dataset that is in the data folder. The dataset pulled 10 minutes of tweets that include the hashtag #gameofthrones.

Data Formatting

Loading JSON file

#install.packages(streamR)
library(streamR)

file <- "../data/stream/stream_got.json"
# file <- "~/Dropbox (UNC Charlotte)/summer-2017-social-media-workshop/data/stream/stream_got.json"

#?parseTweets
tweets <- parseTweets(tweets = file)
## 1342 tweets have been parsed.

Cleaning Text

Do you notice any weird characters? If so, then you have an encoding issues. Not all computers will have this. I found this with my Mac but did this issue with my Ubuntu (Linux) machine.

This is an encoding problem, especially with handling emojis.

We can run a function to convert these characters.

Let’s run this on four text fields:

  1. the body of the tweet (“text”)

  2. the profile location (“location”)

  3. the profile summary (“description”)

  4. the handle name (“name”)

tweets$text <- iconv(tweets$text, from="UTF-8", to="ASCII", "byte")
tweets$location <- iconv(tweets$location, from="UTF-8", to="ASCII", "byte")
tweets$description <- iconv(tweets$description, from="UTF-8", to="ASCII", "byte")
tweets$name <- iconv(tweets$name, from="UTF-8", to="ASCII", "byte")

Notice that emojis are coded as unicode. You can see a lookup of codes here.

Special Character Cleanup

For now, we’re going to exclude all non-standard characters. However, I’ve created code for a simple emoji sentiment.

We can use regular expressions to find unique patterns.

library(stringr)

# removes urls, &amp, RT, etc.
tweets$cleanText <- str_replace_all(tweets$text, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https", "")

# remove emoji (tags)
tweets$cleanText <- gsub("<.*?>", "", tweets$cleanText)

# remove next line character "\n"
tweets$cleanText <- gsub("\n", "", tweets$cleanText)

For example, let’s use the simple grep expression to find how many tweets that mention “Daenerys”"

length(grep("Daenerys", tweets$cleanText, ignore.case=TRUE))
## [1] 109

Hashtag and Mention Counts

Alternatively, we can use a slightly more complicated regular expression to find all hashtags.

ht <- str_extract_all(tweets$cleanText, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE), n = 10)
## ht
##   #GameofThrones   #GameOfThrones           #GoTS7   #gameofthrones 
##              781              540              140               54 
##    #WinterIsHere        #Concours       #EdSheeran              #39 
##               29               16               13               11 
## #GameOfThrones17            #SDCC 
##                8                8

We can also do the same thing for mentions (@)

mt <- str_extract_all(tweets$cleanText, "@(\\d|\\w)+")
mt <- unlist(mt)
head(sort(table(mt), decreasing = TRUE), n = 10)
## mt
##     @333903271 @GameOfThrones      @LordSnow @virginiakimba @Thrones_Memes 
##            156            111             88             85             60 
##      @Daenerys   @ItsGoTQuote @OriginalFunko        @ruhtyt          @9GAG 
##             44             31             28             12             11

Exploring the Data Attributes

In this dataset, we have 43 attributes that are on the tweet-level.

One way to explore the data is to use the str function.

str(tweets)
## 'data.frame':    1342 obs. of  43 variables:
##  $ text                     : chr  "RT @ItsGoTQuote: Lyanna is the best <ed><a0><bd><ed><b8><ad> #GameofThrones https://t.co/G2SWQnqkuu" "RT @Daenerys: Season 7 Kill-Count:\n\nJon Snow - 0\nDaenerys - 0\nDragons - 0\nWhite Walkers - 0\nCersei - 0\nThe Mountain - 0\"| __truncated__ "Petit Sondage: Game Of Thrones est surcot<c3><a9>? <ed><a0><bd><ed><b8><95>\n#GameOfThrones" "RT @GoT_Tyrion: I don't like this song. #ASoIaF #GameOfThrones https://t.co/ROX3U1wcag" ...
##  $ retweet_count            : num  989 18356 0 749 3740 ...
##  $ favorited                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ truncated                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ id_str                   : chr  "887367552057266177" "887367552569036800" "887367554062266368" "887367554813108224" ...
##  $ in_reply_to_screen_name  : chr  NA NA NA NA ...
##  $ source                   : chr  "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" ...
##  $ retweeted                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ created_at               : chr  "Tue Jul 18 17:44:45 +0000 2017" "Tue Jul 18 17:44:45 +0000 2017" "Tue Jul 18 17:44:46 +0000 2017" "Tue Jul 18 17:44:46 +0000 2017" ...
##  $ in_reply_to_status_id_str: chr  NA NA NA NA ...
##  $ in_reply_to_user_id_str  : chr  NA NA NA NA ...
##  $ lang                     : chr  "en" "en" "fr" "en" ...
##  $ listed_count             : num  6 10 4 1 0 1 3 9 3 4 ...
##  $ verified                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ location                 : chr  "<ed><a0><bc><ed><bc><8f>" "Hogwarts" "In the Z<c3><b8>ne" NA ...
##  $ user_id_str              : chr  "196119822" "1224196508" "830767782623059968" "3248934316" ...
##  $ description              : chr  "a girl born to go abroad. <ed><a0><bc><ed><bc><a8>" "#ALWAYS<e2><9d><a4>\n97' / Louisian / <ed><a0><bd><ed><b4><9c> RMT <ed><a0><bd><ed><b2><89><ed><a0><bd><ed><b2><8a><ed><a0><bd>"| __truncated__ "Etudiant <e2><82><aa> D<c3><a9>veloppeur <e2><82><aa> Anticonformiste. Je suis celui que tu crois ne pas <c3><aa>tre celui que "| __truncated__ "FTBL  <e2><9a><bd><e2><9a><bd><e2><9a><bd><e2><9a><bd> <e2><9a><bd>" ...
##  $ geo_enabled              : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
##  $ user_created_at          : chr  "Tue Sep 28 11:32:03 +0000 2010" "Wed Feb 27 09:55:12 +0000 2013" "Sun Feb 12 13:17:29 +0000 2017" "Tue May 12 21:49:36 +0000 2015" ...
##  $ statuses_count           : num  21429 7254 39265 3571 14256 ...
##  $ followers_count          : num  176 210 594 340 284 406 226 249 219 69 ...
##  $ favourites_count         : num  225 5078 3131 7772 2136 ...
##  $ protected                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ user_url                 : chr  NA NA "http://instagram.com/gabriel__the__code" NA ...
##  $ name                     : chr  "A girl has no name." "KENZIE<ed><a0><bd><ed><b2><89><ed><a0><bd><ed><b2><8a>" "Le mec placide <ed><a0><bd><ed><ba><b6><e2><98><af>" "Lucas Dmitruk" ...
##  $ time_zone                : chr  "Hanoi" "Beijing" "West Central Africa" NA ...
##  $ user_lang                : chr  "th" "en" "fr" "es" ...
##  $ utc_offset               : num  25200 28800 3600 NA -10800 -18000 -18000 -18000 -18000 18000 ...
##  $ friends_count            : num  277 949 243 205 219 187 210 209 298 44 ...
##  $ screen_name              : chr  "youknowsNOTHING" "kenzyaaaa" "gabriel_TheCode" "LucasDmitruk" ...
##  $ country_code             : chr  NA NA NA NA ...
##  $ country                  : chr  NA NA NA NA ...
##  $ place_type               : chr  NA NA NA NA ...
##  $ full_name                : chr  NA NA NA NA ...
##  $ place_name               : logi  NA NA NA NA NA NA ...
##  $ place_id                 : logi  NA NA NA NA NA NA ...
##  $ place_lat                : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ place_lon                : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ lat                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ lon                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ expanded_url             : chr  NA NA NA NA ...
##  $ url                      : chr  NA NA NA NA ...
##  $ cleanText                : chr  " @ItsGoTQuote: Lyanna is the best  #GameofThrones " " @Daenerys: Season 7 Kill-Count:Jon Snow - 0Daenerys - 0Dragons - 0White Walkers - 0Cersei - 0The Mountain - 0Arya Stark - 50+" "Petit Sondage: Game Of Thrones est surcot? #GameOfThrones" " @GoT_Tyrion: I don't like this song. #ASoIaF #GameOfThrones " ...

Geolocation & TimeZone

Recall from yesterday’s talk, there are three main types of geolocation data:

  1. Points (lat/long)

  2. Places/Polygons (lat/long bounding boxes)

  3. Profile Location Description

Let’s see how many points we have in our dataset.

 sum(!is.na(tweets$lat))
## [1] 1

So we have only 1 tweet out of 1,342 that has a point lat/long.

What about place/polygon?

sum(!is.na(tweets$place_lat))
## [1] 19

We have 19. So in total, only about 1.49% of these tweets have geolocation – very similar to what we discussed yesterday (only 1-2.9% of tweets have geolocation).

What about profile location?

loc <- table(tweets$location)
head(loc[order(loc, decreasing = T)],n=10)
## 
##                          USA                        Idaho 
##                           22                           16 
##                         Utah                      Florida 
##                           16                           11 
##                    Wisconsin              Los Angeles, CA 
##                           11                           10 
##                     Maryland                          usa 
##                            9                            9 
##               Wisconsin, USA Any run, Any time, Any where 
##                            8                            7

What do you notice? This is an open-ended string (can be a Twitter place/polygon) so while some people provide clean results (e.g., Idaho, Los Angeles) others are vague (e.g., USA, usa) while even others are meaningless (“Any run, Any time, Any where”).

Just for curiosity, let’s see how many of our tweets have missing profile location (i.e., nothing).

sum(is.na(tweets$location))
## [1] 347

347 tweets or about 25%.

Advanced If you’re interested in a creating a large-scale machine learning algorithm to predict missing locations, check out my blog post using PySpark to predict missing profile locations.

You may notice in the dataset that there’s also timezone. Let’s look at the top 20 time zones.

tz <- table(tweets$time_zone)
head(tz[order(tz, decreasing = T)],n=10)
## 
##  Pacific Time (US & Canada) Mountain Time (US & Canada) 
##                         422                          57 
##  Eastern Time (US & Canada)                   Amsterdam 
##                          53                          35 
##                      London  Central Time (US & Canada) 
##                          35                          30 
##                      Hawaii                   Greenland 
##                          20                          15 
##                       Paris                      Athens 
##                          15                          13

Also, there’s a field that shows whether users have enabled the possibility of geolocation.

table(tweets$geo_enabled)
## 
## FALSE  TRUE 
##   826   516

This simply means that the user can provide geolocation. As you can see, most do not have this feature enabled.

User Level Attributes: Handle, Name and Profile Description

There’s also several fields that are “snapshots” of the user at the time of the tweet. These fields can change at any time and only provide information about what the user’s information was at the time of the tweet.

One of the first most important things to remember about users is that there are two ways to identify each user:

  1. By their user id (sometimes called actor.id)

  2. By their handle (screen_name)

An important note is that the user id cannot change while the handle can! In our dataset, given we only have a 10 minute sample, this won’t be a big deal. However, it’s important to know this when considering combining large, long time range data.

Let’s explore which users have the most tweets dplyr, which is part of the tidyverse package.

library(tidyverse)

aggTweets <- tweets %>%
  group_by(screen_name, name, user_id_str) %>%
  summarise(Count=n()) %>%
  arrange(desc(Count))

head(aggTweets[,c("screen_name","Count")], n = 10)
## # A tibble: 10 x 2
##        screen_name Count
##              <chr> <int>
##  1       MookmixCz     8
##  2    AlitlayArisa     7
##  3 RunningForHarry     7
##  4         aunewse     6
##  5       caseyttha     6
##  6         linadbg     6
##  7  mandyfreebird1     6
##  8       heyyouapp     5
##  9   jack_son_five     5
## 10        kathmego     5

Let’s now clean the user profile summary.

For this, we’ll introduce quanteda, which is the best (in my humble opinion) text analysis package in R. (Sorry, tidytext, which wins for the easiest text analysis package.)

tweets$description <- gsub("<.*?>", "", tweets$description)

#install.packages("quanteda")
library(quanteda)

profileCorpus <- corpus(tweets$description)

mydfm <- dfm(profileCorpus, 
             remove = c("na","y","https","http","t.co","de","en","n",stopwords("english")),
             remove_punct = TRUE,
             remove_numbers = TRUE,
             remove_symbols = TRUE)

textplot_wordcloud(mydfm, 
                   min.freq = 6, 
                   random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

There seems to be some discrepancies. Perhaps we’re getting some discrepancies because of other languages.

Let’s see how many users have different language settings on their profile.

userLang <- table(tweets$user_lang)
userLang <- userLang[order(userLang, decreasing = TRUE)]
head(userLang, n = 10)
## 
##  en  es  fr  pt  th  it  ru  de  ar  pl 
## 925 103  84  49  44  25  24  22  16  11

Let’s keep only the top three languages: English (en), Spanish (es), French (fr).

We can then do a “comparison” plot by using quanteda’s group function.

profileCorpus <- corpus(tweets$description,
                        docvars = data.frame(user_lang = tweets$user_lang))

# keep only users who are in English, Spanish or French
profileCorpus <- corpus_subset(profileCorpus, 
                               user_lang %in% c("en","es","fr"))

mydfm <- dfm(profileCorpus, 
             groups = "user_lang",
             remove = c("na","y","https","http","t.co","de","en","n",
                        stopwords("english"), stopwords("french"), stopwords("spanish")),
             remove_punct = TRUE,
             remove_numbers = TRUE,
             remove_symbols = TRUE)

textplot_wordcloud(mydfm, 
                   comparison = TRUE,
                   min.freq = 6, 
                   random.order = FALSE,
                   rot.per = 0, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

We’ll use quanteda again in another section as well as next Thursday for our Text-as-Data workshop.

Challenge: Analyzing the Tweet Text with quanteda

Now that we’ve introduced quanteda, reuse the code above but replace the profile description column with the tweet text (text) field.

You should be able to rerun the above analysis with only one small change. You can use the raw text (text) given the dfm mimics a lot of the cleaning we did for the cleanText field.

If you’re interested in more pre-processing parameters, consider ?tokens like stem or ngrams.

# write your response here

User Level Numeric Attributes: Friends, Followers, Statuses, Favorites, and Lists

Last, there are five numeric attributes for each user.

Let’s use a neat HTMLWidget pairsD3 to run an interactive scatterplot matrix.

Honestly, this may not be the best way of representing this data (e.g. need to be log scaled given power law qualities) but it’s a fun and easy way to do so.

col <- c("friends_count",
         "followers_count",
         "statuses_count",
         "favourites_count",
         "listed_count")

#install.packages("pairsD3")
library(pairsD3)

pairsD3(tweets[,col], 
        group = ifelse(tweets$user_lang=="en","English","Non-English"),
        tooltip = paste0(tweets$screen_name,"\n",tweets$user_lang))
0200,000400,00001,000,0002,000,000500,0001,000,0001,500,0000100,000200,000300,00005,00010,00015,0000200,000400,00001,000,0002,000,000500,0001,000,0001,500,0000100,000200,000300,00005,00010,00015,000friends_countfollowers_countstatuses_countfavourites_countlisted_count

Time

One problem is the time (created_at) field. The original data comes in as GMT time but we can convert it to Eastern standard time.

We can then reduce it to only mins and count by each minute.

tweets$cleanTime <- strptime(tweets$created_at,"%a %b %d %H:%M:%S %z %Y", tz="America/New_York")

by.mins <- cut.POSIXt(tweets$cleanTime,"mins")
t <- as.data.frame(table(by.mins), stringsAsFactors = F)

t$by.mins <- as.POSIXct(t$by.mins)

Note that since I used the streaming API, these were captured across a narrow 10 minute window.

However, we can visualize them using the ggplot2 package.

ggplot(t, aes(x = by.mins, y = Freq)) + 
  geom_line() + 
  xlab("Minutes") +
  ylab("Tweet Count")

There are other great R time series visualization packages like ‘dygraphs’ or ‘streamgraph’.

I’ve created a visualization demo of these for a sample of Twitter data.