Text Analysis in R

Wouter van Atteveldt
Session 1: Managing data in R

Motivational Example

library(twitteR)
tweets = searchTwitteR("#bigdata", resultType="recent", n = 100)
tweets = plyr::ldply(tweets, as.data.frame)
kable(head(tweets[c("id", "created", "text")]))
id created text
737606276188753921 2016-05-31 11:26:54 #BigData : comment s'enrichir en partageant #tribune @LesEchos https://t.co/6kbaQmd40J
737606250024689666 2016-05-31 11:26:48 RT @jamesturner247: Is Big Data Taking Us Closer to the Deeper Questions in Artificial Intelligence? https://t.co/Z7ZsI1mzLB #ArtificialInt…
737606227358711809 2016-05-31 11:26:43 RT @jamesturner247: Big Data and the Cloud: Uncover New #Insights Hiding in Your Data https://t.co/NM9BNukkXX #BigData #DataScience #Health…
737606216243761152 2016-05-31 11:26:40 momentum in today’s #BigData #data #analytics landscape.
https://t.co/poK5ksaOTO https://t.co/Mmzlf6vJS1
737606192755675141 2016-05-31 11:26:35 Heather Knight is speaking at #smartcon2016 in Istanbul
Marilyn Monrobot - Kurucu, Robotist #bigdata #IoT #Startup https://t.co/xuHZR6vORY
737606191333793792 2016-05-31 11:26:34 RT @Informatica: At @strataconf, learn how to turn #bigdata into big value! https://t.co/T2Jvn3JRqh #StrataHadoop https://t.co/tqneHPVTzk

Motivational Example

library(RTextTools)
library(corpustools)
dtm = create_matrix(tweets$text)
dtm.wordcloud(dtm, freq.fun = sqrt)

plot of chunk unnamed-chunk-5

Course Overview

Thursday: Introduction to R

  • Intro & Organizing data
  • Transforming data
  • Accessing APIs from R

Friday: Corpus Analysis & Topic Modeling

Saturday: Machine Learning & Sentiment Analysis

Sunday: Semantic Networks & Grammatical Analysis

Introduction

  • Please introduce yourself
    • Background
    • What do you want to learn?
    • Experience with R / text / programming

Course Components

  • Each 3h session:
  • Lecture & Interactive sessions
    • Please interrupt me!
  • Break
  • Hands-on sessions
  • http://vanatteveldt.com
    • Slides, hand-outs, data

What is R?

  • Programming language
  • Statistics Toolkit
  • Open Source
  • Community driven
    • Packages/libraries
    • Including many text analysis libraries

Cathedral and Bazar