First, you need to have quanteda installed. You can do this from inside RStudio, from the Tools…Install Packages menu, or simply using
install.packages("quanteda", dependencies = TRUE)
(Optional) You can install some additional corpus data from quantedaData using
## the devtools package is required to install quanteda from Github
devtools::install_github("kbenoit/quantedaData")
Run the rest of this file to test your setup. You must have quanteda installed in order for this next step to succeed.
require(quanteda)
## Loading required package: quanteda
## quanteda version 0.9.8.5
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:base':
##
## sample
Now summarize some texts in the Irish 2010 budget speech corpus:
summary(ie2010Corpus)
## Corpus consisting of 14 documents.
##
## Text Types Tokens Sentences year debate
## 2010_BUDGET_01_Brian_Lenihan_FF 1949 8733 374 2010 BUDGET
## 2010_BUDGET_02_Richard_Bruton_FG 1042 4478 217 2010 BUDGET
## 2010_BUDGET_03_Joan_Burton_LAB 1621 6429 307 2010 BUDGET
## 2010_BUDGET_04_Arthur_Morgan_SF 1589 7185 343 2010 BUDGET
## 2010_BUDGET_05_Brian_Cowen_FF 1618 6697 250 2010 BUDGET
## 2010_BUDGET_06_Enda_Kenny_FG 1151 4254 153 2010 BUDGET
## 2010_BUDGET_07_Kieran_ODonnell_FG 681 2309 133 2010 BUDGET
## 2010_BUDGET_08_Eamon_Gilmore_LAB 1183 4217 201 2010 BUDGET
## 2010_BUDGET_09_Michael_Higgins_LAB 490 1288 44 2010 BUDGET
## 2010_BUDGET_10_Ruairi_Quinn_LAB 442 1290 59 2010 BUDGET
## 2010_BUDGET_11_John_Gormley_Green 404 1036 49 2010 BUDGET
## 2010_BUDGET_12_Eamon_Ryan_Green 512 1651 90 2010 BUDGET
## 2010_BUDGET_13_Ciaran_Cuffe_Green 444 1248 45 2010 BUDGET
## 2010_BUDGET_14_Caoimhghin_OCaolain_SF 1188 4094 176 2010 BUDGET
## number foren name party
## 01 Brian Lenihan FF
## 02 Richard Bruton FG
## 03 Joan Burton LAB
## 04 Arthur Morgan SF
## 05 Brian Cowen FF
## 06 Enda Kenny FG
## 07 Kieran ODonnell FG
## 08 Eamon Gilmore LAB
## 09 Michael Higgins LAB
## 10 Ruairi Quinn LAB
## 11 John Gormley Green
## 12 Eamon Ryan Green
## 13 Ciaran Cuffe Green
## 14 Caoimhghin OCaolain SF
##
## Source: /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul
## Created: Tue Sep 16 15:58:21 2014
## Notes:
Create a document-feature matrix from this corpus, removing stop words:
ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
## Creating a dfm from a corpus ...
##
## ... lowercasing
##
## ... tokenizing
##
## ... indexing documents: 14 documents
##
## ... indexing features:
## 4,881 feature types
##
## ...
## removed 118 features, from 175 supplied (glob) feature types
## ... stemming features (English)
## ```
```
Look at the top occuring features:
topfeatures(ieDfm)
## budget peopl govern year minist tax public economi cut
## 271 266 242 198 197 195 179 172 172
## job
## 148
Make a word cloud:
plot(ieDfm, min.freq=25, random.order=FALSE)
If you got this far, congratulations!