1 Web scraping?

Web scraping is the metaphor used for the practice of getting data that weren’t designed to be programatically consumed off the web (Dale 2016). Almost any repetitive Web structure or pattern that you see in your browser can be scraped and turned to scientific data.

2 Inspecting web elements

To do scraping, you need to be able to inspect web elements. In Firefox or Chrome you can hit ctrl + shift + c to invoke the element inspector. It looks like this:

The red arrow shows the button which allows to identify the elements on any web page using locator schemes. The most important locator schemes are:

You can right-click anywhere in the code, then Copy and select CSS Selector. Now it’s in your clipboard, ready for R. In Google chrome you can also copy the xpaths.

Alternatively, check the Selectorgadget extension for Chrome.


3 R packages for scraping

An overview of what R can do with the web is on the web technologies CRAN Task View.

Two important web-scraping R packages are:

For even more heavyweight scraping look at Python’s package Scrapy.


4 Simple scraping with rvest

library(rvest)
library(magrittr) 

Note: It pays off to understand the pipe operator %>% when working with rvest.

4.1 Example: downloading a simple html table

Here we will use rvest to download a table with numbers of Nobel laureates in different countries from Wikipedia.

First, I’ll use the element locator in my browser (ctrl + shift + c) to get the CSS selector of the table. Then I’ll download the entire web page to R, and parse it to XML using read_html, extract the table node using html_node, and then convert it to a data_frame using html_table:

nobel.table <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_Nobel_laureates_per_capita") %>%
               html_node(css = "#mw-content-text > div > table:nth-child(9)") %>% 
               html_table()

Check the table:

head(nobel.table)
##   Rank       Country Nobel\nlaureates[1] Population\n(2015)[2]
## 1    — Faroe Islands                   1                48,199
## 2    —   Saint Lucia                   2               184,999
## 3    —    Luxembourg                   2               567,110
## 4    1        Sweden                  31             9,779,426
## 5    2   Switzerland                  26             8,298,663
## 6    3       Iceland                   1               329,425
##   Laureates/\n10 million
## 1                207.473
## 2                108.109
## 3                 35.267
## 4                 31.700
## 5                 31.332
## 6                 30.356

We may still need some string operations (gsub, grep, strsplit, …) to clean the data.

4.2 Example: data.frame from a web structure that isn’t a table

Here we will scrape a web structure that is not a table, but looks sufficiently regular (‘pattern-ish’) to be convertible to a table. It is a simple list of Nobel laureates by country.

We’ll need some more sophisticated work with the CSS selectors. Check out the element inspector in your browser first (ctrl + shift + c), and look for some common properties of the headers and the text.

First, download and parse the page with read_html, extract the country names with html_nodes using the CSS selector, and then convert the XML to text using html_text.

countries <- read_html("https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country") %>%
             html_nodes(css = "h2 > span:first-child") %>% 
             html_text() 

countries <- countries[2:(length(countries)-2)]   # delete some non-country headings
countries <- append(countries, "Tibet", after=56) # we need to accomodate the 14th Dalai Lama
countries
##  [1] "Argentina"               "Australia"              
##  [3] "Austria"                 "Bangladesh"             
##  [5] "Belarus"                 "Belgium"                
##  [7] "Bosnia and Herzegovina"  "Bulgaria"               
##  [9] "Canada"                  "Chile"                  
## [11] "China"                   "Colombia"               
## [13] "Costa Rica"              "Croatia"                
## [15] "Czech Republic"          "Denmark"                
## [17] "East Timor"              "Egypt"                  
## [19] "Faroe Islands"           "Finland"                
## [21] "France"                  "Germany"                
## [23] "Ghana"                   "Greece"                 
## [25] "Guatemala"               "Hong Kong"              
## [27] "Hungary"                 "Iceland"                
## [29] "India"                   "Iran"                   
## [31] "Ireland"                 "Israel"                 
## [33] "Italy"                   "Japan"                  
## [35] "Kenya"                   "Liberia"                
## [37] "Lithuania"               "Luxembourg"             
## [39] "Mexico"                  "Myanmar (Burma)"        
## [41] "Netherlands"             "Nigeria"                
## [43] "Norway"                  "Pakistan"               
## [45] "Palestine"               "Peru"                   
## [47] "Poland"                  "Portugal"               
## [49] "Romania"                 "Russia and Soviet Union"
## [51] "Saint Lucia"             "Slovenia"               
## [53] "South Africa"            "South Korea"            
## [55] "Spain"                   "Sweden"                 
## [57] "Tibet"                   "Switzerland"            
## [59] "Trinidad and Tobago"     "Tunisia"                
## [61] "Turkey"                  "Ukraine"                
## [63] "United Kingdom"          "United States"          
## [65] "Venezuela"               "Vietnam"                
## [67] "Yemen"

Second, get the lists of laureates using exactly the same approach, but a different CSS selector:

laureates <- read_html("https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country") %>%
             html_nodes(css = "h2 ~ ol") %>% 
             html_text() %>% 
             strsplit(split="\n")

Finally, put the laureates and the countries together:

names(laureates) <- countries
laureates <- stack(laureates)

Let’s check the resulting data.frame:

head(laureates)
##                                                    values       ind
## 1            César Milstein, Physiology or Medicine, 1984 Argentina
## 2                      Adolfo Pérez Esquivel, Peace, 1980 Argentina
## 3                   Luis Federico Leloir, Chemistry, 1970 Argentina
## 4          Bernardo Houssay, Physiology or Medicine, 1947 Argentina
## 5                      Carlos Saavedra Lamas, Peace, 1936 Argentina
## 6 Brian Schmidt, born in the United States, Physics, 2011 Australia

As in the previous example, we may need some string operations (gsub, grep, strsplit, …) to further clean the data.


5 Advanced scraping with RSelenium

Interactive elements on the Web can be simple html forms, and these can be scarped with rvest. However, often there are forms, clickable buttons, interactive graphics and fill-in forms based on JavaScript, and for these we need some niftier tools.

With RSelenium you can create an R object virutally from any element of a webpage, and you can emulate actions such as mouse clicks, you can fill-in forms, etc. – you essentially give R your browser, your mouse and your keyboard, and you specify what R should do with these tools.

There is a really nice and comprehensive tutorial. You can also open it by typing:

vignette("RSelenium-basics", package = "RSelenium")

5.1 Installation of RSelenium: Docker and Selenium Server

Installation of the R package is easy:

install.packages("RSelenium")

However, to make RSelenium work, you may need to fiddle a bit, since it won’t run on its own – you need to install some additional stuff on your computer. You will need to install and run a Selenium Server. The most reliable way to do this is to use something called a Docker container. For instructions on how to set Docker and Selenium Server on your operating system type:

vignette("RSelenium-docker", package = "RSelenium")

It took me about 2 hours to figure it out.

Also, before using RSelenium, you need to start the Selenium Server. I am on Ubuntu Linux, so do it in the terminal:

sudo docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
sudo docker ps

5.2 Example: Scraping biodiversity data from a page with interactive web elements

In this example we will interact with GlobalTreeSearch form: we will fill in the genus and species fields, search the database, and retrieve the results, all from within R.

First, explore the page with the element inspector in your browser first (ctrl + shift + c).

To interact with the page, I’ve written a simple function get.tree which takes the genus and species arguments (character strings), and returns countries in which the species occurs – the comments should make the idea quite obvious:

get.tree <- function(genus, species)
{
  require(RSelenium)
  
  # open the remote driver
  remDr <- remoteDriver(port = 4445L)
  remDr$open(silent = TRUE)
  # 
  # go to the webpage
  remDr$navigate("http://www.bgci.org/global_tree_search.php?sec=globaltreesearch")
  remDr$refresh() # refresh the page
  
  # create R objects from the website elements
  genusElem <- remDr$findElement(using = 'id', value = "genus-field")
  specElem <- remDr$findElement(using = 'id', value = "species-field")
  buttElem <- remDr$findElement(using = 'class', value = "btn_ohoDO")
  
  # fill in the forms with the genus and species names
  genusElem$sendKeysToElement(list(genus))
  specElem$sendKeysToElement(list(species))
  
  # click the search button
  buttElem$clickElement()
  
  # get the output
  out <- remDr$findElement(using = "css", value="td.cell_1O3UaG:nth-child(4)")
  out <- out$getElementText()[[1]] # extract the actual text string
  out <- strsplit(out, split="; ")[[1]] # split the text to a character vector
  
  # close the remote driver
  remDr$close()
  
  return(out)  
}

Let’s try it out:

get.tree("Abies","alba")
## Loading required package: RSelenium
##  [1] "Albania"                                   
##  [2] "Andorra"                                   
##  [3] "Austria"                                   
##  [4] "Bulgaria"                                  
##  [5] "Croatia"                                   
##  [6] "Czech Republic"                            
##  [7] "France"                                    
##  [8] "Germany"                                   
##  [9] "Greece"                                    
## [10] "Hungary"                                   
## [11] "Italy"                                     
## [12] "Macedonia, the former Yugoslav Republic of"
## [13] "Montenegro"                                
## [14] "Poland"                                    
## [15] "Romania"                                   
## [16] "Serbia"                                    
## [17] "Slovakia"                                  
## [18] "Slovenia"                                  
## [19] "Spain"                                     
## [20] "Switzerland"                               
## [21] "Ukraine"

6 Other ideas, notes, useful stuff


7 Session information

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RSelenium_1.7.1 magrittr_1.5    rvest_0.3.2     xml2_1.1.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12     knitr_1.17       R6_2.2.2         stringr_1.2.0   
##  [5] httr_1.3.1       caTools_1.17.1   tools_3.4.1      binman_0.1.0    
##  [9] semver_0.2.0     selectr_0.3-1    htmltools_0.3.6  assertthat_0.2.0
## [13] openssl_0.9.6    yaml_2.1.14      rprojroot_1.2    digest_0.6.12   
## [17] bitops_1.0-6     curl_2.8.1       evaluate_0.10.1  wdman_0.2.2     
## [21] rmarkdown_1.5    stringi_1.1.5    compiler_3.4.1   backports_1.0.5 
## [25] XML_3.98-1.7     jsonlite_1.5