Analysing Texts and Networks with R

Wouter van Atteveldt, Nel Ruigrok, Kasper Welbers
Session 1: R Intro & Accessing APIs

Motivational Example

library(twitteR)
tweets = searchTwitteR("#bigdata", resultType="recent", n = 100)
tweets = plyr::ldply(tweets, as.data.frame)
kable(head(tweets[c("id", "created", "text")]))
id created text
737606276188753921 2016-05-31 11:26:54 #BigData : comment s'enrichir en partageant #tribune @LesEchos https://t.co/6kbaQmd40J
737606250024689666 2016-05-31 11:26:48 RT @jamesturner247: Is Big Data Taking Us Closer to the Deeper Questions in Artificial Intelligence? https://t.co/Z7ZsI1mzLB #ArtificialInt…
737606227358711809 2016-05-31 11:26:43 RT @jamesturner247: Big Data and the Cloud: Uncover New #Insights Hiding in Your Data https://t.co/NM9BNukkXX #BigData #DataScience #Health…
737606216243761152 2016-05-31 11:26:40 momentum in today’s #BigData #data #analytics landscape.
https://t.co/poK5ksaOTO https://t.co/Mmzlf6vJS1
737606192755675141 2016-05-31 11:26:35 Heather Knight is speaking at #smartcon2016 in Istanbul
Marilyn Monrobot - Kurucu, Robotist #bigdata #IoT #Startup https://t.co/xuHZR6vORY
737606191333793792 2016-05-31 11:26:34 RT @Informatica: At @strataconf, learn how to turn #bigdata into big value! https://t.co/T2Jvn3JRqh #StrataHadoop https://t.co/tqneHPVTzk

Motivational Example

library(RTextTools)
library(corpustools)
dtm = create_matrix(tweets$text)
dtm.wordcloud(dtm, freq.fun = sqrt)

plot of chunk unnamed-chunk-5

Workshop Overview

Session 1

  • Organizing & Transforming data
  • Accessing APIs from R

Session 2

  • Corpus Analysis
  • Network Analysis

Introduction

  • Please introduce yourself
    • What is your research interest
    • What do you want to use R for?
    • Experience with R / text / programming

Course Components

  • Two 1.5 hour sessions
  • Lecture & Interactive sessions
    • Please interrupt me!
  • Hands-on sessions
  • http://vanatteveldt.com
    • Slides, hand-outs, data

What is R?

  • Programming language
  • Statistics Toolkit
  • Open Source
  • Community driven
    • Packages/libraries
    • Including many text analysis libraries

Cathedral and Bazar

The R Ecosystem

  • R
  • RStudio
    • RMarkdown / RPresentation
  • Packages
    • CRAN
    • Github

Installing and using packages

install.packages("plyr")
library(plyr)
plyr::rename

devtools::install_github("amcat/amcat-r")

Data types: vectors

x = 12
class(x)
[1] "numeric"
x = c(1, 2, 3)
class(x)
[1] "numeric"
x = "a text"
class(x)
[1] "character"

Data Frames

df = data.frame(id=1:3, age=c(14, 18, 24), 
          name=c("Mary", "John", "Luke"))
df
  id age name
1  1  14 Mary
2  2  18 John
3  3  24 Luke
class(df)
[1] "data.frame"

Selecting a column

df$age
[1] 14 18 24
df[["age"]]
[1] 14 18 24
class(df$age)
[1] "numeric"
class(df$name)
[1] "factor"

Useful functions

Data frames:

colnames(df)
head(df)
tail(df)
nrow(df)
ncol(df)
summary(df)

Vectors:

mean(df$age)
length(df$age)

Other data types

  • Data frame:
    • Rectangular data frame
    • Columns vectors of same length
    • (vetor always has one type)
  • List:
    • Contain anything (inc data frames, lists)
    • Elements arbitrary type
  • Matrix:
    • Rectangular
    • All cells same (primitive) type

Finding help (and packages)

  • Ask a friend!
  • Built-in documentation
    • CRAN package vignettes
  • Task views
  • Google (sorry…)

Organizing Data in R

Subsetting

Recoding & Renaming columns

Ordering

Subsetting

df[1:2, 1:2]
  id age
1  1  14
2  2  18
df[df$id %% 2 == 1, ]
  id age name
1  1  14 Mary
3  3  24 Luke
df[, c("id", "name")]
  id name
1  1 Mary
2  2 John
3  3 Luke

Subsetting: `subset` function

subset(df, id == 1)
  id age name
1  1  14 Mary
subset(df, id >1 & age < 20)
  id age name
2  2  18 John

Recoding columns

df2 = df
df2$age2 = df2$age + df2$id
df2$age[df2$id == 1] = NA
df2$id = NULL
df2$old = df2$age > 20
df2$agecat = 
  ifelse(df2$age > 20, "Old", "Young")
df2
  age name age2   old agecat
1  NA Mary   15    NA   <NA>
2  18 John   20 FALSE  Young
3  24 Luke   27  TRUE    Old

Text columns

  • character vs factor
df2=df
df2$name = as.character(df2$name)
df2$name[df2$id != 1] = 
    paste("Mr.", df2$name[df2$id != 1])
df2$name = toupper(df2$name)
df2$name = gsub("\\.\\s*", "_", df2$name)
df2[grepl("mr", df2$name, ignore.case = T), ]
  id age    name
2  2  18 MR_JOHN
3  3  24 MR_LUKE

Renaming columns

df2 = df
colnames(df2) = c("ID", "AGE", "NAME")
colnames(df2)[2] = "leeftijd"
df2 = plyr::rename(df2, c("NAME"="naam"))
df2
  ID leeftijd naam
1  1       14 Mary
2  2       18 John
3  3       24 Luke

Ordering

df[order(df$age), ]
  id age name
1  1  14 Mary
2  2  18 John
3  3  24 Luke
plyr::arrange(df, -age)
  id age name
1  3  24 Luke
2  2  18 John
3  1  14 Mary

Accessing elements

  • Data frame
    • Select one column: df$col, df[["col"]],
    • Select columns: df[c("col1" ,"col2")]
    • Subset: df[rows, columns]
  • List:
    • Select one element: l$el, l[["el"]], l[[1]]
    • Select columns: l[[1:3]]
  • Matrix:
    • All cells same type
    • Subset: m[rows, columns]

Transforming data

Combining data

Reshaping data

Combining data

cbind(df, country=c("nl", "uk", "uk"))
  id age name country
1  1  14 Mary      nl
2  2  18 John      uk
3  3  24 Luke      uk
rbind(df, c(id=1, age=2, name="Mary"))
  id age name
1  1  14 Mary
2  2  18 John
3  3  24 Luke
4  1   2 Mary

Merging data

countries = data.frame(id=1:2, country=c("nl", "uk"))
merge(df, countries)
  id age name country
1  1  14 Mary      nl
2  2  18 John      uk
merge(df, countries, all=T)
  id age name country
1  1  14 Mary      nl
2  2  18 John      uk
3  3  24 Luke    <NA>

Merging data

merge(data1, data2)
merge(data1, data2, by="id")
merge(data1, data2, by.x="id", by.y="ID")
merge(data1, data2, by="id", all=T)
merge(data1, data2, by="id", all.x=T)

Reshaping data

  • reshape2 package:
    • melt: wide to long
    • dcast: long to wide (pivot table)

Melting data

wide = data.frame(id=1:3, 
  group=c("a","a","b"), 
  width=c(100, 110, 120), 
  height=c(50, 100, 150))
wide
  id group width height
1  1     a   100     50
2  2     a   110    100
3  3     b   120    150

Melting data

library(reshape2)
long = melt(wide, id.vars=c("id", "group"))
long
  id group variable value
1  1     a    width   100
2  2     a    width   110
3  3     b    width   120
4  1     a   height    50
5  2     a   height   100
6  3     b   height   150

Casting data

dcast(long, id + group ~ variable, value.var="value")
  id group width height
1  1     a   100     50
2  2     a   110    100
3  3     b   120    150

Casting data: aggregation

dcast(long, group ~ variable, value.var = "value", fun.aggregate = max)
  group width height
1     a   110    100
2     b   120    150
dcast(long, id ~., value.var = "value", fun.aggregate = mean)
  id   .
1  1  75
2  2 105
3  3 135

Aggregation with `aggregate`

aggregate(long["value"], long["group"], max)
  group value
1     a   110
2     b   150

`aggregate` vs `dcast`

Aggregate

  • One aggregation function
  • Multiple value columns
  • Groups go in rows (long format)
  • Specify with column subsets

Cast

  • One aggregation function
  • One value column
  • Groups go in rows or columns
  • Specify with formula (rows ~ columns)

Simple statistics

Vector properties

mean(x)
sd(x)
sum(x)

Basic tests

t.test(wide, width ~ group)
t.test(wide$width, wide$height, paired=T)
cor.test(wide$width, wide$height)
m = lm(long, width ~ group + height)
summary(m)

Workshop Overview

Session 1

  • Organizing & Transforming data
  • Accessing APIs from R

Session 2

  • Corpus Analysis
  • Network Analysis

What is an API?

  • Application Programming Interface
  • Computer-friendly web page
    • Standardized requests
    • Structured response
    • json/ csv
  • Access directly (HTTP call)
  • Client library for popular APIs

Package twitteR

install_github("geoffjentry/twitteR") 
setup_twitter_oauth(...)
tweets = searchTwitteR("#Trump2016", resultType="recent", n = 10)
tweets = plyr::ldply(tweets, as.data.frame)

Package Rfacebook

install_github("pablobarbera/Rfacebook", subdir="Rfacebook")
fb_token = fbOAuth(fb_app_id, fb_app_secret)
p = getPage(page="nytimes", token=fb_token)
post = getPost(p$id[1], token=fb_token)

Package rtimes

install.packages("rtimes")
options(nytimes_as_key = nyt_api_key)

res = as_search(q="trump", 
  begin_date = "20160101", 
  end_date = '20160501')

arts = plyr::ldply(res$data, 
  function(x) c(headline=x$headline$main, 
                date=x$pub_date))

APIs and rate limits

  • Most APIs have access limits
  • Log on with key or token
  • Response size (page) limited to n results
  • Requests limited to n per hour/day
  • Some clients deal with this, some don't
  • See API and client documentation

Directly accessing APIs

  • Make HTTP requests directly from R
    • package httr (or RCurl)
  • Can access all web data source
  • Need to figure out authentication, structure, etc

Directly accessing APIs

domain = 'https://api.nytimes.com'
path = 'svc/search/v2/articlesearch.json'
url = paste(domain, path, url, sep='/')
query = list(`api-key`=key, q="clinton")
r = httr::GET(url, query=query)
status_code(r)
result = content(r)
result$response$docs[[1]]$headline

Hands-on 1

Handouts: