rvest
and RSelenium
Web scraping is the metaphor used for the practice of getting data that weren’t designed to be programatically consumed off the web (Dale 2016). Almost any repetitive Web structure or pattern that you see in your browser can be scraped and turned to scientific data.
To do scraping, you need to be able to inspect web elements. In Firefox or Chrome you can hit ctrl + shift + c
to invoke the element inspector. It looks like this:
The red arrow shows the button which allows to identify the elements on any web page using locator schemes. The most important locator schemes are:
You can right-click anywhere in the code, then Copy
and select CSS Selector
. Now it’s in your clipboard, ready for R. In Google chrome you can also copy the xpaths.
Alternatively, check the Selectorgadget extension for Chrome.
An overview of what R can do with the web is on the web technologies CRAN Task View.
Two important web-scraping R packages are:
magrittr
, lubridate
or plyr
. The main element locator schemes are CSS selectors and xpaths.For even more heavyweight scraping look at Python’s package Scrapy
.
rvest
library(rvest)
library(magrittr)
Note: It pays off to understand the pipe operator %>%
when working with rvest
.
Here we will use rvest
to download a table with numbers of Nobel laureates in different countries from Wikipedia.
First, I’ll use the element locator in my browser (ctrl + shift + c
) to get the CSS selector of the table. Then I’ll download the entire web page to R, and parse it to XML using read_html
, extract the table node using html_node
, and then convert it to a data_frame
using html_table
:
nobel.table <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_Nobel_laureates_per_capita") %>%
html_node(css = "#mw-content-text > div > table:nth-child(9)") %>%
html_table()
Check the table:
head(nobel.table)
## Rank Country Nobel\nlaureates[1] Population\n(2015)[2]
## 1 — Faroe Islands 1 48,199
## 2 — Saint Lucia 2 184,999
## 3 — Luxembourg 2 567,110
## 4 1 Sweden 31 9,779,426
## 5 2 Switzerland 26 8,298,663
## 6 3 Iceland 1 329,425
## Laureates/\n10 million
## 1 207.473
## 2 108.109
## 3 35.267
## 4 31.700
## 5 31.332
## 6 30.356
We may still need some string operations (gsub
, grep
, strsplit
, …) to clean the data.
data.frame
from a web structure that isn’t a tableHere we will scrape a web structure that is not a table, but looks sufficiently regular (‘pattern-ish’) to be convertible to a table. It is a simple list of Nobel laureates by country.
We’ll need some more sophisticated work with the CSS selectors. Check out the element inspector in your browser first (ctrl + shift + c
), and look for some common properties of the headers and the text.
First, download and parse the page with read_html
, extract the country names with html_nodes
using the CSS selector, and then convert the XML to text using html_text
.
countries <- read_html("https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country") %>%
html_nodes(css = "h2 > span:first-child") %>%
html_text()
countries <- countries[2:(length(countries)-2)] # delete some non-country headings
countries <- append(countries, "Tibet", after=56) # we need to accomodate the 14th Dalai Lama
countries
## [1] "Argentina" "Australia"
## [3] "Austria" "Bangladesh"
## [5] "Belarus" "Belgium"
## [7] "Bosnia and Herzegovina" "Bulgaria"
## [9] "Canada" "Chile"
## [11] "China" "Colombia"
## [13] "Costa Rica" "Croatia"
## [15] "Czech Republic" "Denmark"
## [17] "East Timor" "Egypt"
## [19] "Faroe Islands" "Finland"
## [21] "France" "Germany"
## [23] "Ghana" "Greece"
## [25] "Guatemala" "Hong Kong"
## [27] "Hungary" "Iceland"
## [29] "India" "Iran"
## [31] "Ireland" "Israel"
## [33] "Italy" "Japan"
## [35] "Kenya" "Liberia"
## [37] "Lithuania" "Luxembourg"
## [39] "Mexico" "Myanmar (Burma)"
## [41] "Netherlands" "Nigeria"
## [43] "Norway" "Pakistan"
## [45] "Palestine" "Peru"
## [47] "Poland" "Portugal"
## [49] "Romania" "Russia and Soviet Union"
## [51] "Saint Lucia" "Slovenia"
## [53] "South Africa" "South Korea"
## [55] "Spain" "Sweden"
## [57] "Tibet" "Switzerland"
## [59] "Trinidad and Tobago" "Tunisia"
## [61] "Turkey" "Ukraine"
## [63] "United Kingdom" "United States"
## [65] "Venezuela" "Vietnam"
## [67] "Yemen"
Second, get the lists of laureates using exactly the same approach, but a different CSS selector:
laureates <- read_html("https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country") %>%
html_nodes(css = "h2 ~ ol") %>%
html_text() %>%
strsplit(split="\n")
Finally, put the laureates and the countries together:
names(laureates) <- countries
laureates <- stack(laureates)
Let’s check the resulting data.frame
:
head(laureates)
## values ind
## 1 César Milstein, Physiology or Medicine, 1984 Argentina
## 2 Adolfo Pérez Esquivel, Peace, 1980 Argentina
## 3 Luis Federico Leloir, Chemistry, 1970 Argentina
## 4 Bernardo Houssay, Physiology or Medicine, 1947 Argentina
## 5 Carlos Saavedra Lamas, Peace, 1936 Argentina
## 6 Brian Schmidt, born in the United States, Physics, 2011 Australia
As in the previous example, we may need some string operations (gsub
, grep
, strsplit
, …) to further clean the data.
RSelenium
Interactive elements on the Web can be simple html
forms, and these can be scarped with rvest
. However, often there are forms, clickable buttons, interactive graphics and fill-in forms based on JavaScript, and for these we need some niftier tools.
With RSelenium
you can create an R object virutally from any element of a webpage, and you can emulate actions such as mouse clicks, you can fill-in forms, etc. – you essentially give R your browser, your mouse and your keyboard, and you specify what R should do with these tools.
There is a really nice and comprehensive tutorial. You can also open it by typing:
vignette("RSelenium-basics", package = "RSelenium")
RSelenium
: Docker and Selenium ServerInstallation of the R package is easy:
install.packages("RSelenium")
However, to make RSelenium
work, you may need to fiddle a bit, since it won’t run on its own – you need to install some additional stuff on your computer. You will need to install and run a Selenium Server. The most reliable way to do this is to use something called a Docker container. For instructions on how to set Docker and Selenium Server on your operating system type:
vignette("RSelenium-docker", package = "RSelenium")
It took me about 2 hours to figure it out.
Also, before using RSelenium
, you need to start the Selenium Server. I am on Ubuntu Linux, so do it in the terminal:
sudo docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
sudo docker ps
In this example we will interact with GlobalTreeSearch form: we will fill in the genus and species fields, search the database, and retrieve the results, all from within R.
First, explore the page with the element inspector in your browser first (ctrl + shift + c
).
To interact with the page, I’ve written a simple function get.tree
which takes the genus
and species
arguments (character strings), and returns countries in which the species occurs – the comments should make the idea quite obvious:
get.tree <- function(genus, species)
{
require(RSelenium)
# open the remote driver
remDr <- remoteDriver(port = 4445L)
remDr$open(silent = TRUE)
#
# go to the webpage
remDr$navigate("http://www.bgci.org/global_tree_search.php?sec=globaltreesearch")
remDr$refresh() # refresh the page
# create R objects from the website elements
genusElem <- remDr$findElement(using = 'id', value = "genus-field")
specElem <- remDr$findElement(using = 'id', value = "species-field")
buttElem <- remDr$findElement(using = 'class', value = "btn_ohoDO")
# fill in the forms with the genus and species names
genusElem$sendKeysToElement(list(genus))
specElem$sendKeysToElement(list(species))
# click the search button
buttElem$clickElement()
# get the output
out <- remDr$findElement(using = "css", value="td.cell_1O3UaG:nth-child(4)")
out <- out$getElementText()[[1]] # extract the actual text string
out <- strsplit(out, split="; ")[[1]] # split the text to a character vector
# close the remote driver
remDr$close()
return(out)
}
Let’s try it out:
get.tree("Abies","alba")
## Loading required package: RSelenium
## [1] "Albania"
## [2] "Andorra"
## [3] "Austria"
## [4] "Bulgaria"
## [5] "Croatia"
## [6] "Czech Republic"
## [7] "France"
## [8] "Germany"
## [9] "Greece"
## [10] "Hungary"
## [11] "Italy"
## [12] "Macedonia, the former Yugoslav Republic of"
## [13] "Montenegro"
## [14] "Poland"
## [15] "Romania"
## [16] "Serbia"
## [17] "Slovakia"
## [18] "Slovenia"
## [19] "Spain"
## [20] "Switzerland"
## [21] "Ukraine"
Chrome developer tools: there is more to them then just the elements tab. Have a look at Network for live monitoring and console for direct interaction (js). Ressource http://discover-devtools.codeschool.com
tcpdump/wireshark can inspect the actual network traffic (also outside of the browser).
wget and curl for fetching directly accessible data. They have switches for authentication and user agent (string that client supplies to server to communicate capabilities).
Knowledge of regular expressions
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RSelenium_1.7.1 magrittr_1.5 rvest_0.3.2 xml2_1.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 knitr_1.17 R6_2.2.2 stringr_1.2.0
## [5] httr_1.3.1 caTools_1.17.1 tools_3.4.1 binman_0.1.0
## [9] semver_0.2.0 selectr_0.3-1 htmltools_0.3.6 assertthat_0.2.0
## [13] openssl_0.9.6 yaml_2.1.14 rprojroot_1.2 digest_0.6.12
## [17] bitops_1.0-6 curl_2.8.1 evaluate_0.10.1 wdman_0.2.2
## [21] rmarkdown_1.5 stringi_1.1.5 compiler_3.4.1 backports_1.0.5
## [25] XML_3.98-1.7 jsonlite_1.5