Preliminaries: Installation

First, you need to have quanteda installed. You can do this from inside RStudio, from the Tools…Install Packages menu, or simply using

install.packages("quanteda", dependencies = TRUE)

(Optional) You can install some additional corpus data from quantedaData using

## the devtools package is required to install quanteda from Github
devtools::install_github("kbenoit/quantedaData")

Test your setup

Run the rest of this file to test your setup. You must have quanteda installed in order for this next step to succeed.

require(quanteda)
## Loading required package: quanteda
## quanteda version 0.9.8.5
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:base':
## 
##     sample

Now summarize some texts in the Irish 2010 budget speech corpus:

summary(ie2010Corpus)
## Corpus consisting of 14 documents.
## 
##                                   Text Types Tokens Sentences year debate
##        2010_BUDGET_01_Brian_Lenihan_FF  1949   8733       374 2010 BUDGET
##       2010_BUDGET_02_Richard_Bruton_FG  1042   4478       217 2010 BUDGET
##         2010_BUDGET_03_Joan_Burton_LAB  1621   6429       307 2010 BUDGET
##        2010_BUDGET_04_Arthur_Morgan_SF  1589   7185       343 2010 BUDGET
##          2010_BUDGET_05_Brian_Cowen_FF  1618   6697       250 2010 BUDGET
##           2010_BUDGET_06_Enda_Kenny_FG  1151   4254       153 2010 BUDGET
##      2010_BUDGET_07_Kieran_ODonnell_FG   681   2309       133 2010 BUDGET
##       2010_BUDGET_08_Eamon_Gilmore_LAB  1183   4217       201 2010 BUDGET
##     2010_BUDGET_09_Michael_Higgins_LAB   490   1288        44 2010 BUDGET
##        2010_BUDGET_10_Ruairi_Quinn_LAB   442   1290        59 2010 BUDGET
##      2010_BUDGET_11_John_Gormley_Green   404   1036        49 2010 BUDGET
##        2010_BUDGET_12_Eamon_Ryan_Green   512   1651        90 2010 BUDGET
##      2010_BUDGET_13_Ciaran_Cuffe_Green   444   1248        45 2010 BUDGET
##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1188   4094       176 2010 BUDGET
##  number      foren     name party
##      01      Brian  Lenihan    FF
##      02    Richard   Bruton    FG
##      03       Joan   Burton   LAB
##      04     Arthur   Morgan    SF
##      05      Brian    Cowen    FF
##      06       Enda    Kenny    FG
##      07     Kieran ODonnell    FG
##      08      Eamon  Gilmore   LAB
##      09    Michael  Higgins   LAB
##      10     Ruairi    Quinn   LAB
##      11       John  Gormley Green
##      12      Eamon     Ryan Green
##      13     Ciaran    Cuffe Green
##      14 Caoimhghin OCaolain    SF
## 
## Source:  /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul
## Created: Tue Sep 16 15:58:21 2014
## Notes:

Create a document-feature matrix from this corpus, removing stop words:

ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
## Creating a dfm from a corpus ...
## 
##    ... lowercasing
## 
##    ... tokenizing
## 
##    ... indexing documents: 14 documents
## 
##    ... indexing features:
## 4,881 feature types
## 
## ...
## removed 118 features, from 175 supplied (glob) feature types
## ... stemming features (English)
## ```

, trimmed 1510 feature variants

… created a 14 x 3253 sparse dfm

… complete.

Elapsed time: 0.346 seconds.

```

Look at the top occuring features:

topfeatures(ieDfm)
##  budget   peopl  govern    year  minist     tax  public economi     cut 
##     271     266     242     198     197     195     179     172     172 
##     job 
##     148

Make a word cloud:

plot(ieDfm, min.freq=25, random.order=FALSE)

If you got this far, congratulations!