Data on the web are often presented in tables. For instance, we can see a list of countries by population in 1900 on Wikipedia
Web pages are written in HTML (Hyper Text Markup Language) which uses tags to describe different aspects of document content. For example, a heading in a document is indicated by <h1>My Title</h1>
whereas a paragraph would be indicated by <p>A paragraph of content...</p>
.
In this tutorial, we will learn how to read data from a table on a web page into R. We will need the package rvest
to get the data from the web page, and the stringr
package to clean up the data.
library(rvest)
library(stringr)
rvest
To get the population data on Wikipedia into R, we use the read_html
command from the xml2
package (which is attached when rvest
is called) to parse the page to obtain an HTML document.
We then use the html_nodes
command that extracts all occurrences of the desired tag. We will be interested in scraping data presented in tables, so in the source code, we look for the table tag: <table> ... </table>
.
Note: some of the rvest
commands may be slow depending on your Internet connection and the complexity of the web page.
popParse <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_in_1900")
str(popParse)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
The read_html
command creates an R object, basically a list, that stores information about the web page.
To extract a table from this web page (this may take some time):
popNodes <- html_nodes(popParse, "table")
popNodes
## {xml_nodeset (4)}
## [1] <table class="plainlinks metadata ambox ambox-style ambox-More_footn ...
## [2] <table class="wikitable floatright"><tbody>\n<tr><th colspan="3" sty ...
## [3] <table class="wikitable" style="text-align:right"><tbody>\n<tr>\n<th ...
## [4] <table class="nowraplinks hlist collapsible autocollapse navbox-inne ...
There are several tables in this document. By inspecting the output of popNodes
, we make a guess that we want the third table. (In other cases, trial and error may be required.) We select the third table by using double brackets:
pop <- html_table(popNodes, header = TRUE, fill = TRUE)[[3]]
str(pop)
## 'data.frame': 236 obs. of 4 variables:
## $ Rank : chr "—" "—" "1" "2" ...
## $ Country/Territory : chr "World" "British Empire[a]" "Qing China[1]" "Indian Empire[b][1] (UK)" ...
## $ Population c. 1900 estimate[1] : chr "1,700,000,000" "436,259,084" "400,000,000" "330,567,600" ...
## $ Percentage of
## World Population: chr "-" "25.7%" "23.5%" "19.4%" ...
We now have a workable data frame that we can analyze. Notice that even though the first and third columns are numbers, they are classified as “character.” For Rank
, that is because the first observation is the world population and it is not assigned a rank, but rather, the character “-”. The Population
column is also a character because the numbers have commas in them, plus Morocco’s population is given as “8,000,000 [2][3]” to indicate some footnotes. We need to convert these columns to be numeric.
In the case of the Rank column, we will remove the world population.
We will also simplify the name of the third column to “Population.”
pop2 <- pop[-1, ] #remove row 1
head(pop2)
row.names(pop2) <- NULL #reset row numbers to start at 1
pop2$Rank <- as.numeric(pop2$Rank) #coerce Rank to numeric
## Warning: NAs introduced by coercion
names(pop2)[3] <- "Population" #rename 3rd column
To remove the commas in the Population numbers, we will use str_replace_all
from the stringr
package.
pop2$Population <- str_replace_all(pop2$Population, ",", "")
head(pop2)
We still have a problem with Morocco, country number 19. Since it is the only country that is a problem, we can fix this individually, but we will also show a more general way to do this in case you encounter more than one instance.
pop2$Population[19]
## [1] "17984000"
out <- pop2$Population
out[19] <- "8000000"
out <- as.numeric(out)
## Warning: NAs introduced by coercion
head(out, 20)
## [1] 436259084 400000000 330567600 136305900 119546234 87162000 78790700
## [8] 75994575 56367178 47295100 46605000 43847000 42746000 39875900
## [15] 38900000 32475000 31706300 18594000 8000000 17082000
out2 <- str_replace_all(pop2$Population, "\\[[^]]+\\]", "")
out2 <- as.numeric(out2)
head(out2, 20)
## [1] 436259084 400000000 330567600 136305900 119546234 87162000 78790700
## [8] 75994575 56367178 47295100 46605000 43847000 42746000 39875900
## [15] 38900000 32475000 31706300 18594000 17984000 17082000
identical(out, out2)
## [1] FALSE
pop2$Population <- out2
The web site Box Office Mojo gives statistics on box office earnings of movies. In addition to daily earnings, the web site also maintains lists of yearly and all time record holders.
We will look at the movies in the top 100 in all time movie worldwide grosses in box office receipts. In particular, we will scrape the data from Box Office Mojo: All Time Box Office. The dollar amounts are in millions of dollars and the years marked with “^” indicate that the movie had multiple releases.
movieParse<- read_html("http://www.boxofficemojo.com/alltime/world/?pagenum=1")
movieTables <- html_nodes(movieParse, "table")
head(movieTables)
## {xml_nodeset (3)}
## [1] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>< ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr>\ ...
## [3] <table border="0" cellpadding="5" cellspacing="1" width="98%">\n<tr ...
movies <- html_table(movieTables, header = TRUE, fill = TRUE)[[3]]
str(movies)
## 'data.frame': 100 obs. of 9 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr "Avatar" "Titanic" "Star Wars: The Force Awakens" "Avengers: Infinity War" ...
## $ Studio : chr "Fox" "Par." "BV" "BV" ...
## $ Worldwide : chr "$2,788.0" "$2,187.5" "$2,068.2" "$2,047.7" ...
## $ Domestic / %: chr "$760.5" "$659.4" "$936.7" "$678.8" ...
## $ Domestic / %: chr "27.3%" "30.1%" "45.3%" "33.2%" ...
## $ Overseas / %: chr "$2,027.5" "$1,528.1" "$1,131.6" "$1,368.9" ...
## $ Overseas / %: chr "72.7%" "69.9%" "54.7%" "66.8%" ...
## $ Year^ : chr "2009^" "1997^" "2015" "2018" ...
We will first clean up the column names and then convert the character variables into numeric variables.
names(movies)[5:9] <- c("DomesticDollars", "DomesticPercentage", "OverseasDollars", "OverseasPercentage", "Year")
For the grosses, we note that the numbers are either prefaced by a dollar sign or end with a percentage, both of which will need to be removed. We will also have to remove the commas.
out <- str_replace_all(movies$Worldwide, "\\$|,", "" )
head(out)
## [1] "2788.0" "2187.5" "2068.2" "2047.7" "1671.7" "1518.8"
movies$Worldwide <- as.numeric(out)
Exercise Clean up the remaining variables.
How many images are there on this web page? Images are coded as <img source = "NameOfImageFile.jpg">
, so we search for the selector “img”.
moviesImg <- html_nodes(movieParse, "img")
moviesImg
## {xml_nodeset (10)}
## [1] <img src="//bs.serving-sys.com/Serving/adServer.bs?c=8&cn=displ ...
## [2] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box ...
## [3] <img src="/images/IMDbPro.png" alt="Get industry info at IMDbPro" h ...
## [4] <img src="/images/glyphicons-social-32-twitter@2x.png" alt="Follow ...
## [5] <img src="/images/glyphicons-social-31-facebook@2x.png" alt="Follow ...
## [6] <img src="/images/mojo_imdb_sm.png">
## [7] <img src="/images/space.gif" border="0" width="1" height="5">
## [8] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box ...
## [9] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" v ...
## [10] <img src="http://b.scorecardresearch.com/p?c1=2&c2=6034961& ...
The website billboard.com keeps track of top songs, albums and artists from the music industry.
One page lists the greatest hot women artists. In the source code, here is one artist’s listing: <div class="chart-list-item " data-rank="24" data-artist="" data-title="Dionne Warwick" data-has-content="false">
.
(Note that the current number 1 entry, Madonna, is not listed in this format)
webParse <- read_html("https://www.billboard.com/charts/greatest-hot-100-women-artists")
str(webParse)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
webNodes <- html_nodes(webParse,".chart-list-item " )
webNodes
## {xml_nodeset (49)}
## [1] <div class="chart-list-item " data-rank="2" data-artist="" data-ti ...
## [2] <div class="chart-list-item " data-rank="3" data-artist="" data-ti ...
## [3] <div class="chart-list-item " data-rank="4" data-artist="" data-ti ...
## [4] <div class="chart-list-item " data-rank="5" data-artist="" data-ti ...
## [5] <div class="chart-list-item " data-rank="6" data-artist="" data-ti ...
## [6] <div class="chart-list-item " data-rank="7" data-artist="" data-ti ...
## [7] <div class="chart-list-item " data-rank="8" data-artist="" data-ti ...
## [8] <div class="chart-list-item " data-rank="9" data-artist="" data-ti ...
## [9] <div class="chart-list-item " data-rank="10" data-artist="" data-t ...
## [10] <div class="chart-list-item " data-rank="11" data-artist="" data-t ...
## [11] <div class="chart-list-item " data-rank="12" data-artist="" data-t ...
## [12] <div class="chart-list-item " data-rank="13" data-artist="" data-t ...
## [13] <div class="chart-list-item " data-rank="14" data-artist="" data-t ...
## [14] <div class="chart-list-item " data-rank="15" data-artist="" data-t ...
## [15] <div class="chart-list-item " data-rank="16" data-artist="" data-t ...
## [16] <div class="chart-list-item " data-rank="17" data-artist="" data-t ...
## [17] <div class="chart-list-item " data-rank="18" data-artist="" data-t ...
## [18] <div class="chart-list-item " data-rank="19" data-artist="" data-t ...
## [19] <div class="chart-list-item " data-rank="20" data-artist="" data-t ...
## [20] <div class="chart-list-item " data-rank="21" data-artist="" data-t ...
## ...
We now need to extract the name of the artist, given in the data-title
attribute:
webNodes2 <- html_attr(webNodes, "data-title")
webNodes2
## [1] "Mariah Carey" "Janet Jackson"
## [3] "Whitney Houston" "Rihanna"
## [5] "The Supremes" "Olivia Newton-John"
## [7] "Aretha Franklin" "Katy Perry"
## [9] "Taylor Swift" "Diana Ross"
## [11] "Donna Summer" "Carpenters"
## [13] "Connie Francis" "Beyonce"
## [15] "Brenda Lee" "Barbra Streisand"
## [17] "Cher" "P!nk"
## [19] "Gloria Estefan" "Gladys Knight And The Pips"
## [21] "Lady Gaga" "TLC"
## [23] "Dionne Warwick" "Heart"
## [25] "Paula Abdul" "Alicia Keys"
## [27] "Kelly Clarkson" "Linda Ronstadt"
## [29] "Destiny's Child" "Celine Dion"
## [31] "The Pointer Sisters" "Captain & Tennille"
## [33] "Britney Spears" "Toni Braxton"
## [35] "Jennifer Lopez" "Christina Aguilera"
## [37] "Adele" "Monica"
## [39] "Helen Reddy" "Cyndi Lauper"
## [41] "Roberta Flack" "Taylor Dayne"
## [43] "Nicki Minaj" "Mary J. Blige"
## [45] "Tina Turner" "Debbie Gibson"
## [47] "Sheena Easton" "Kesha"
## [49] "Carly Simon"
As we noted earlier, Madonna’s entry was not listed in the same format as the others, so we will have to add her manually:
top50women <- data.frame(top = c("Madonna", webNodes2))
head(top50women, 5)
The web site BikeRaceInfo has a table with data on past winners of the Tour de France. Create a cleaned-up data frame of this data.
The web site NY Times Best Sellers: Hardcover Fiction contains a list of best-selling fiction books. Scrape the names of these top books. The list of books are tagged via <h2>title</h2>
.