Twitter is a popular social media network in which users can send and receive short 140-character messages (“tweets”) on any topic they wish. We have collected 5000 tweets from Twitter on 27 June 2016, searching on the word “love.”

We will show how regular expressions and the package stringr can be used to analyze the text in this data set.

library(stringr)

We will import the text file love.txt using the scan command. We specify that each line is a character string using the argument what =""

love <- scan("data/love.txt", what = "")
head(love)
## [1] "RT @StylishRentals: Love this!  Wings Neck Lighthouse - Lighthouses for Rent in Pocasset  @airbnb #Travel https://t.co/x2hjmg1HOY"                                 
## [2] "RT @pledis_17: [SEVENTEEN NEWS] \x91Love  Letter\x92 repackage album OFFICIAL PHOTO 04 #160704 #SCOUPS #SEVENTEEN #<U+C544><U+C8FC>NICE #VERY #NICE https://t.\x85"
## [3] "RT @chancetherapper: Black Women are soooo beautiful. I love your skin, I love your hair, I love your shape. nothin like it"                                       
## [4] "RT @ElNellaOFC: We love you JaDines!!! #BFYChasingDreams https://t.co/svs5UUwT7o"                                                                                  
## [5] "RT @WeNeedFeminlsm: Love Target for this https://t.co/HkLNTL3IYs"                                                                                                  
## [6] "RT @calumhood5sohs: Retweet if you love @Calum5SOS <U+FFFD><U+FFFD>#VeranoMTV2016 5 Seconds Of Summer https://t.co/O4xcEmB7yu"

How many of these tweets mention the word “heart”?

out <- str_detect(love, "heart")
head(out)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(love[out])
## [1] "RT @dinahjane97: I love you Lauren!! your passionate heart is so admirable! You are the definition of a WARRIOR ! NEVER STOP being YOU @Lau\x85"                                                     
## [2] "@iJesseWilliams this is Profound  heart felt<U+2764><U+FFFD><U+FFFD><U+263A>#blacklivesmatter  much love<U+FFFD><U+FFFD> #Americahasbeenbuiltonoursweattears <U+FFFD><U+FFFD>https://t.co/ZwCiMBIAZp"
## [3] "RT @GraysonDolan: 100% positive that Ethan and I have the best fans in the ENTIRE WORLD. Love you guys with all my heart <U+2764><U+FE0F><U+2764><U+FE0F>"                                           
## [4] "Sweetheart, do not love too long:I loved long and long,And grew to be out of fashionLike an old song."                                                                                               
## [5] "I love connor franta with all my heart"                                                                                                                                                              
## [6] "That happened twice?! He broke your heart twice?! I'm guessing you took him \x85 \x97 Yep, he did. I really did love him. https://t.co/t9v5aVx9f3"
sum(out)  
## [1] 114

Thus, 114 tweets contain “heart”. But did we detect all tweets that had the word “heart?” What if the tweeter writes “Heart” or HEART" or uses some other combination of upper and lower case letters?

out2 <- str_detect(love, "[Hh][Ee][Aa][Rr][Tt]")
sum(out2)
## [1] 120

So there were only 6 additional tweets that had some capitalization in some of the letters in the word “heart”.

To specify an “or”, use the vertical bar. For instance, to find the number of tweets that mentions either “boy” or “girl”

out3 <- str_detect(love, "boy|girl")
sum(out3)
## [1] 198

The word “you” comes up frequently in tweets, often being mentioned more than once in a given tweet. Suppose we wish to count all occurrences of “you”.

out4 <- str_locate_all(love, "[Yy][Oo][Uu]")
head(out4)
## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]    62  64
## [2,]    80  82
## [3,]    98 100
## 
## [[4]]
##      start end
## [1,]    25  27
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## [1,]    32  34
head(unlist(out4))
## [1]  62  80  98  64  82 100

The R object out4 is a list. The unlist command creates a vector from the list. For instance, in the third tweet, “you” occurs in positions 62-64, 80-82, and 98-100. When you apply unlist to out4, these values appear in the vector as 62 80 90 64 82 100 (start positions first, followed by end positions).

Now, in unlist(out4), the start and ending position of every occurrence of “you” is given. To find the total number of occurrences of “you”, we must divide by 2.

length(unlist(out4))/2
## [1] 3229

So of the 5000 tweets in this file, the word “you” occurred 3229 times.

Now, of the tweets in the file, how many were retweets? We need to count the number of tweets that start out with the letters “RT”.

out5 <- str_detect(love, "^RT")
sum(out5)
## [1] 2750

Suppose we wish to count the number of times hashtags are used in the tweets. We need to match the pattern “#” followed by some characters. In particular, the hashtag character “#” cannot be followed by a space. The expression “\S” indicates ‘not a space’ while a “+” following it will specify one or more instances of “not a space”:

out6 <- str_locate_all(love, "#\\S+")
length(unlist(out6))/2
## [1] 2141

Now, many tweets contain web addresses. The URL’s start with “http” or “https” followed immediately by “://” and then more characters. So, after “http” we need to indicate either 0 or 1 occurrence of the “s”. This is done using “s?” After the double forward slashes, we match zero or more occurrences of any character.

outURLS <- str_extract_all(love, "http(s?)(://).*")
head(outURLS)    #a list
## [[1]]
## [1] "https://t.co/x2hjmg1HOY"
## 
## [[2]]
## [1] "https://t.\x85"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "https://t.co/svs5UUwT7o"
## 
## [[5]]
## [1] "https://t.co/HkLNTL3IYs"
## 
## [[6]]
## [1] "https://t.co/O4xcEmB7yu"
head(unlist(outURLS))
## [1] "https://t.co/x2hjmg1HOY" "https://t.\x85"       
## [3] "https://t.co/svs5UUwT7o" "https://t.co/HkLNTL3IYs"
## [5] "https://t.co/O4xcEmB7yu" "https://t.co/SXsnZmBCTA"

Many users include special characters in their tweets such as emojis, foreign symbols, etc. However, when these tweets are exported to a text file, these symbols are replaced by their unicode number, a universal standard for encoding special characters. For instance, take a look at the 12th and 195th tweets in this data file:

love[c(12, 195)]   
## [1] "Hate and love relationship with the morning shift  <U+2764><U+FE0F><U+FFFD><U+FFFD><U+FFFD><U+FFFD>"                       
## [2] "@Kobeasagaya <U+2606>**(<U+2661>*<U+25BD>`*<U+2661>)LOVE**<U+2606><U+FFFD><U+FFFD><U+FFFD><U+FFFD> https://t.co/rXtB23wjh2"

<U+2764> is the unicode notation for a (heavy) heart while <U+2606> is unicode for a white star. You can type the unicode value into any web search engine to see the actual symbol.

Suppose we wish to clean up the file by removing all the unicode values. For each “<”, we need to find the matching “>”. The pattern “[^<]*" is used to match any character except “<” zero or more times. We replace this pattern with a space.

out7 <- str_replace_all(love, "<[^<]*>", " ")
out7[c(12, 195)]
## [1] "Hate and love relationship with the morning shift        "  
## [2] "@Kobeasagaya  **( * `* )LOVE**      https://t.co/rXtB23wjh2"

To remove the *, (, ) and `, we will need to enclose them in brackets since these are meta-characters.

out8 <- str_replace_all(out7, "[*()`]", " ")
head(out8)
## [1] "RT @StylishRentals: Love this!  Wings Neck Lighthouse - Lighthouses for Rent in Pocasset  @airbnb #Travel https://t.co/x2hjmg1HOY"          
## [2] "RT @pledis_17: [SEVENTEEN NEWS] �Love  Letter� repackage album OFFICIAL PHOTO 04 #160704 #SCOUPS #SEVENTEEN #  NICE #VERY #NICE https://t.�"
## [3] "RT @chancetherapper: Black Women are soooo beautiful. I love your skin, I love your hair, I love your shape. nothin like it"                
## [4] "RT @ElNellaOFC: We love you JaDines!!! #BFYChasingDreams https://t.co/svs5UUwT7o"                                                           
## [5] "RT @WeNeedFeminlsm: Love Target for this https://t.co/HkLNTL3IYs"                                                                           
## [6] "RT @calumhood5sohs: Retweet if you love @Calum5SOS   #VeranoMTV2016 5 Seconds Of Summer https://t.co/O4xcEmB7yu"
out7[c(12, 195)]
## [1] "Hate and love relationship with the morning shift        "  
## [2] "@Kobeasagaya  **( * `* )LOVE**      https://t.co/rXtB23wjh2"

We can also remove extra whitespace in the text. We will replace one or more occurrences of a space (\\s+) with just a single space.

out9 <- str_replace_all(out8, "\\s+", " ")
head(out9)
## [1] "RT @StylishRentals: Love this! Wings Neck Lighthouse - Lighthouses for Rent in Pocasset @airbnb #Travel https://t.co/x2hjmg1HOY"          
## [2] "RT @pledis_17: [SEVENTEEN NEWS] �Love Letter� repackage album OFFICIAL PHOTO 04 #160704 #SCOUPS #SEVENTEEN # NICE #VERY #NICE https://t.�"
## [3] "RT @chancetherapper: Black Women are soooo beautiful. I love your skin, I love your hair, I love your shape. nothin like it"              
## [4] "RT @ElNellaOFC: We love you JaDines!!! #BFYChasingDreams https://t.co/svs5UUwT7o"                                                         
## [5] "RT @WeNeedFeminlsm: Love Target for this https://t.co/HkLNTL3IYs"                                                                         
## [6] "RT @calumhood5sohs: Retweet if you love @Calum5SOS #VeranoMTV2016 5 Seconds Of Summer https://t.co/O4xcEmB7yu"
out9[c(12, 195)]
## [1] "Hate and love relationship with the morning shift "
## [2] "@Kobeasagaya LOVE https://t.co/rXtB23wjh2"

Own your own

  1. How many tweets in love contain the phrase “love you”?

  2. How many mentions (of specific users) are there in the love tweets? Mentions are prefaced by the “at” symbol, for example, @username.

  3. The third installment of the movie franchise “Ghostbusters” was released on July 15, 2016. Import (via ‘scan’) the file **Ghostbusters.txt88 that has 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.

Resources