Preliminaries: Installation

First, you need to have quanteda installed. You can do this from inside RStudio, from the Tools…Install Packages menu, or simply using

install.packages("quanteda")

(Optional) You can install some additional corpus data from quantedaData using

## the devtools package is required to install quanteda from Github
devtools::install_github("quanteda/quanteda.corpora")

If you are feeling adventurous, you can install the latest build of quanteda from its GitHub code page.

Note that on Windows platforms, it is also recommended that you install the RTools suite, and for OS X, that you install XCode from the App Store.

Test your setup

Run the rest of this file to test your setup. You must have quanteda installed in order for this next step to succeed.

require(quanteda)
## Loading required package: quanteda
## Package version: 1.1.4
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

Now summarize some texts in the Irish 2010 budget speech corpus:

summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents:
## 
##                                   Text Types Tokens Sentences year debate
##        2010_BUDGET_01_Brian_Lenihan_FF  1953   8641       374 2010 BUDGET
##       2010_BUDGET_02_Richard_Bruton_FG  1040   4446       217 2010 BUDGET
##         2010_BUDGET_03_Joan_Burton_LAB  1624   6393       307 2010 BUDGET
##        2010_BUDGET_04_Arthur_Morgan_SF  1595   7107       343 2010 BUDGET
##          2010_BUDGET_05_Brian_Cowen_FF  1629   6599       250 2010 BUDGET
##           2010_BUDGET_06_Enda_Kenny_FG  1148   4232       153 2010 BUDGET
##      2010_BUDGET_07_Kieran_ODonnell_FG   678   2297       133 2010 BUDGET
##       2010_BUDGET_08_Eamon_Gilmore_LAB  1181   4177       201 2010 BUDGET
##     2010_BUDGET_09_Michael_Higgins_LAB   488   1286        44 2010 BUDGET
##        2010_BUDGET_10_Ruairi_Quinn_LAB   439   1284        59 2010 BUDGET
##      2010_BUDGET_11_John_Gormley_Green   401   1030        49 2010 BUDGET
##        2010_BUDGET_12_Eamon_Ryan_Green   510   1643        90 2010 BUDGET
##      2010_BUDGET_13_Ciaran_Cuffe_Green   442   1240        45 2010 BUDGET
##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1188   4044       176 2010 BUDGET
##  number      foren     name party
##      01      Brian  Lenihan    FF
##      02    Richard   Bruton    FG
##      03       Joan   Burton   LAB
##      04     Arthur   Morgan    SF
##      05      Brian    Cowen    FF
##      06       Enda    Kenny    FG
##      07     Kieran ODonnell    FG
##      08      Eamon  Gilmore   LAB
##      09    Michael  Higgins   LAB
##      10     Ruairi    Quinn   LAB
##      11       John  Gormley Green
##      12      Eamon     Ryan Green
##      13     Ciaran    Cuffe Green
##      14 Caoimhghin OCaolain    SF
## 
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Wed Jun 28 22:04:18 2017
## Notes:

Create a document-feature matrix from this corpus, removing stop words:

ieDfm <- dfm(data_corpus_irishbudget2010, remove = c(stopwords("english"), "will"), 
             stem = TRUE)

Look at the top occuring features:

topfeatures(ieDfm)
##      .      ,      €  peopl budget govern minist   year    tax public 
##   2371   1548    336    273    272    271    204    201    195    179

Make a word cloud:

textplot_wordcloud(ieDfm, min.freq=25, random.order=FALSE)
## Warning: min.freqrandom.order is deprecated; use min_countrandom_order
## instead

If you got this far, congratulations!