Introduction to Data scraping with R

Scraping Data From the Web

Data on the web are often presented in tables. For instance, we can see a list of countries by population in 1900 on Wikipedia

Web pages are written in HTML (Hyper Text Markup Language) which uses tags to describe different aspects of document content. For example, a heading in a document is indicated by <h1>My Title</h1> whereas a paragraph would be indicated by <p>A paragraph of content...</p>.

In this tutorial, we will learn how to read data from a table on a web page into R. We will need the package rvest to get the data from the web page, and the stringr package to clean up the data.

library(rvest)
library(stringr)

1. Reading data into R with `rvest`

To get the population data on Wikipedia into R, we use the read_html command from the xml2 package (which is attached when rvest is called) to parse the page to obtain an HTML document.

We then use the html_nodes command that extracts all occurrences of the desired tag. We will be interested in scraping data presented in tables, so in the source code, we look for the table tag: <table> ... </table>.

Note: some of the rvest commands may be slow depending on your Internet connection and the complexity of the web page.

popParse <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_in_1900")

str(popParse)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

The read_html command creates an R object, basically a list, that stores information about the web page.

To extract a table from this web page (this may take some time):

popNodes <- html_nodes(popParse, "table")
popNodes

## {xml_nodeset (4)}
## [1] <table class="plainlinks metadata ambox ambox-style ambox-More_footn ...
## [2] <table class="wikitable floatright"><tbody>\n<tr><th colspan="3" sty ...
## [3] <table class="wikitable" style="text-align:right"><tbody>\n<tr>\n<th ...
## [4] <table class="nowraplinks hlist collapsible autocollapse navbox-inne ...

There are several tables in this document. By inspecting the output of popNodes, we make a guess that we want the third table. (In other cases, trial and error may be required.) We select the third table by using double brackets:

pop <- html_table(popNodes, header = TRUE, fill = TRUE)[[3]]
str(pop)

## 'data.frame':    236 obs. of  4 variables:
##  $ Rank                           : chr  "—" "—" "1" "2" ...
##  $ Country/Territory              : chr  "World" "British Empire[a]" "Qing China[1]" "Indian Empire[b][1] (UK)" ...
##  $ Population c. 1900 estimate[1] : chr  "1,700,000,000" "436,259,084" "400,000,000" "330,567,600" ...
##  $ Percentage of
## World Population: chr  "-" "25.7%" "23.5%" "19.4%" ...

3. Cleaning the data frame

We now have a workable data frame that we can analyze. Notice that even though the first and third columns are numbers, they are classified as “character.” For Rank, that is because the first observation is the world population and it is not assigned a rank, but rather, the character “-”. The Population column is also a character because the numbers have commas in them, plus Morocco’s population is given as “8,000,000 [2][3]” to indicate some footnotes. We need to convert these columns to be numeric.

In the case of the Rank column, we will remove the world population.

We will also simplify the name of the third column to “Population.”

pop2 <- pop[-1, ]         #remove row 1
head(pop2)

ABCDEFGHIJ0123456789

	Rank <chr>	Country/Territory <chr>	Population c. 1900 estimate[1] <chr>
2	—	British Empire[a]	436,259,084
3	1	Qing China[1]	400,000,000
4	2	Indian Empire[b][1] (UK)	330,567,600
5	—	Russian Empire (including Finland and Poland)	136,305,900
6	3	Russian Empire (excluding Finland and Poland)	119,546,234
7	—	Russian Empire[1] (Russia proper only)	87,162,000

row.names(pop2) <- NULL    #reset row numbers to start at 1

pop2$Rank <- as.numeric(pop2$Rank) #coerce Rank to numeric

## Warning: NAs introduced by coercion

names(pop2)[3] <- "Population" #rename 3rd column

To remove the commas in the Population numbers, we will use str_replace_all from the stringr package.

pop2$Population <- str_replace_all(pop2$Population, ",", "")
head(pop2)

ABCDEFGHIJ0123456789

	Rank <dbl>	Country/Territory <chr>	Population <chr>
1	NA	British Empire[a]	436259084
2	1	Qing China[1]	400000000
3	2	Indian Empire[b][1] (UK)	330567600
4	NA	Russian Empire (including Finland and Poland)	136305900
5	3	Russian Empire (excluding Finland and Poland)	119546234
6	NA	Russian Empire[1] (Russia proper only)	87162000

We still have a problem with Morocco, country number 19. Since it is the only country that is a problem, we can fix this individually, but we will also show a more general way to do this in case you encounter more than one instance.

pop2$Population[19]

## [1] "17984000"

out <- pop2$Population
out[19] <- "8000000"
out <- as.numeric(out)

## Warning: NAs introduced by coercion

head(out, 20)

##  [1] 436259084 400000000 330567600 136305900 119546234  87162000  78790700
##  [8]  75994575  56367178  47295100  46605000  43847000  42746000  39875900
## [15]  38900000  32475000  31706300  18594000   8000000  17082000

out2 <- str_replace_all(pop2$Population, "\\[[^]]+\\]", "")
out2 <- as.numeric(out2)
head(out2, 20)

##  [1] 436259084 400000000 330567600 136305900 119546234  87162000  78790700
##  [8]  75994575  56367178  47295100  46605000  43847000  42746000  39875900
## [15]  38900000  32475000  31706300  18594000  17984000  17082000

identical(out, out2)

## [1] FALSE

pop2$Population <- out2

4. A movie box office example

The web site Box Office Mojo gives statistics on box office earnings of movies. In addition to daily earnings, the web site also maintains lists of yearly and all time record holders.

We will look at the movies in the top 100 in all time movie worldwide grosses in box office receipts. In particular, we will scrape the data from Box Office Mojo: All Time Box Office. The dollar amounts are in millions of dollars and the years marked with “^” indicate that the movie had multiple releases.

movieParse<- read_html("http://www.boxofficemojo.com/alltime/world/?pagenum=1")

movieTables <- html_nodes(movieParse, "table")
head(movieTables)

## {xml_nodeset (3)}
## [1] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>< ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr>\ ...
## [3] <table border="0" cellpadding="5" cellspacing="1" width="98%">\n<tr  ...

movies <- html_table(movieTables, header = TRUE, fill = TRUE)[[3]]
str(movies)

## 'data.frame':    100 obs. of  9 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title       : chr  "Avatar" "Titanic" "Star Wars: The Force Awakens" "Avengers: Infinity War" ...
##  $ Studio      : chr  "Fox" "Par." "BV" "BV" ...
##  $ Worldwide   : chr  "$2,788.0" "$2,187.5" "$2,068.2" "$2,047.7" ...
##  $ Domestic / %: chr  "$760.5" "$659.4" "$936.7" "$678.8" ...
##  $ Domestic / %: chr  "27.3%" "30.1%" "45.3%" "33.2%" ...
##  $ Overseas / %: chr  "$2,027.5" "$1,528.1" "$1,131.6" "$1,368.9" ...
##  $ Overseas / %: chr  "72.7%" "69.9%" "54.7%" "66.8%" ...
##  $ Year^       : chr  "2009^" "1997^" "2015" "2018" ...

We will first clean up the column names and then convert the character variables into numeric variables.

names(movies)[5:9] <- c("DomesticDollars", "DomesticPercentage", "OverseasDollars", "OverseasPercentage", "Year")

For the grosses, we note that the numbers are either prefaced by a dollar sign or end with a percentage, both of which will need to be removed. We will also have to remove the commas.

out <- str_replace_all(movies$Worldwide, "\\$|,", "" )
head(out)

## [1] "2788.0" "2187.5" "2068.2" "2047.7" "1671.7" "1518.8"

movies$Worldwide <- as.numeric(out)

Exercise Clean up the remaining variables.

How many images are there on this web page? Images are coded as <img source = "NameOfImageFile.jpg">, so we search for the selector “img”.

moviesImg <- html_nodes(movieParse, "img")

moviesImg

## {xml_nodeset (10)}
##  [1] <img src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=displ ...
##  [2] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box ...
##  [3] <img src="/images/IMDbPro.png" alt="Get industry info at IMDbPro" h ...
##  [4] <img src="/images/glyphicons-social-32-twitter@2x.png" alt="Follow  ...
##  [5] <img src="/images/glyphicons-social-31-facebook@2x.png" alt="Follow ...
##  [6] <img src="/images/mojo_imdb_sm.png">
##  [7] <img src="/images/space.gif" border="0" width="1" height="5">
##  [8] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box ...
##  [9] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" v ...
## [10] <img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=6034961&amp; ...

4. A Billboard Top 100 example

The website billboard.com keeps track of top songs, albums and artists from the music industry.

One page lists the greatest hot women artists. In the source code, here is one artist’s listing: <div class="chart-list-item " data-rank="24" data-artist="" data-title="Dionne Warwick" data-has-content="false">.

(Note that the current number 1 entry, Madonna, is not listed in this format)

webParse <- read_html("https://www.billboard.com/charts/greatest-hot-100-women-artists")
str(webParse)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

webNodes <- html_nodes(webParse,".chart-list-item  " ) 
webNodes

## {xml_nodeset (49)}
##  [1] <div class="chart-list-item  " data-rank="2" data-artist="" data-ti ...
##  [2] <div class="chart-list-item  " data-rank="3" data-artist="" data-ti ...
##  [3] <div class="chart-list-item  " data-rank="4" data-artist="" data-ti ...
##  [4] <div class="chart-list-item  " data-rank="5" data-artist="" data-ti ...
##  [5] <div class="chart-list-item  " data-rank="6" data-artist="" data-ti ...
##  [6] <div class="chart-list-item  " data-rank="7" data-artist="" data-ti ...
##  [7] <div class="chart-list-item  " data-rank="8" data-artist="" data-ti ...
##  [8] <div class="chart-list-item  " data-rank="9" data-artist="" data-ti ...
##  [9] <div class="chart-list-item  " data-rank="10" data-artist="" data-t ...
## [10] <div class="chart-list-item  " data-rank="11" data-artist="" data-t ...
## [11] <div class="chart-list-item  " data-rank="12" data-artist="" data-t ...
## [12] <div class="chart-list-item  " data-rank="13" data-artist="" data-t ...
## [13] <div class="chart-list-item  " data-rank="14" data-artist="" data-t ...
## [14] <div class="chart-list-item  " data-rank="15" data-artist="" data-t ...
## [15] <div class="chart-list-item  " data-rank="16" data-artist="" data-t ...
## [16] <div class="chart-list-item  " data-rank="17" data-artist="" data-t ...
## [17] <div class="chart-list-item  " data-rank="18" data-artist="" data-t ...
## [18] <div class="chart-list-item  " data-rank="19" data-artist="" data-t ...
## [19] <div class="chart-list-item  " data-rank="20" data-artist="" data-t ...
## [20] <div class="chart-list-item  " data-rank="21" data-artist="" data-t ...
## ...

We now need to extract the name of the artist, given in the data-title attribute:

webNodes2 <- html_attr(webNodes, "data-title")
webNodes2

##  [1] "Mariah Carey"               "Janet Jackson"             
##  [3] "Whitney Houston"            "Rihanna"                   
##  [5] "The Supremes"               "Olivia Newton-John"        
##  [7] "Aretha Franklin"            "Katy Perry"                
##  [9] "Taylor Swift"               "Diana Ross"                
## [11] "Donna Summer"               "Carpenters"                
## [13] "Connie Francis"             "Beyonce"                   
## [15] "Brenda Lee"                 "Barbra Streisand"          
## [17] "Cher"                       "P!nk"                      
## [19] "Gloria Estefan"             "Gladys Knight And The Pips"
## [21] "Lady Gaga"                  "TLC"                       
## [23] "Dionne Warwick"             "Heart"                     
## [25] "Paula Abdul"                "Alicia Keys"               
## [27] "Kelly Clarkson"             "Linda Ronstadt"            
## [29] "Destiny's Child"            "Celine Dion"               
## [31] "The Pointer Sisters"        "Captain & Tennille"        
## [33] "Britney Spears"             "Toni Braxton"              
## [35] "Jennifer Lopez"             "Christina Aguilera"        
## [37] "Adele"                      "Monica"                    
## [39] "Helen Reddy"                "Cyndi Lauper"              
## [41] "Roberta Flack"              "Taylor Dayne"              
## [43] "Nicki Minaj"                "Mary J. Blige"             
## [45] "Tina Turner"                "Debbie Gibson"             
## [47] "Sheena Easton"              "Kesha"                     
## [49] "Carly Simon"

As we noted earlier, Madonna’s entry was not listed in the same format as the others, so we will have to add her manually:

top50women <- data.frame(top = c("Madonna", webNodes2))
head(top50women, 5)

ABCDEFGHIJ0123456789

	top <fctr>
1	Madonna
2	Mariah Carey
3	Janet Jackson
4	Whitney Houston
5	Rihanna

On Your Own

The web site BikeRaceInfo has a table with data on past winners of the Tour de France. Create a cleaned-up data frame of this data.
The web site NY Times Best Sellers: Hardcover Fiction contains a list of best-selling fiction books. Scrape the names of these top books. The list of books are tagged via <h2>title</h2>.

Resources

HTML Tutorial

Introduction to Data scraping with R

Scraping Data From the Web

1. Reading data into R with rvest

3. Cleaning the data frame

4. A movie box office example

4. A Billboard Top 100 example

On Your Own

Resources

1. Reading data into R with `rvest`