Wrangling Relational Data in R

Overview

For years there has been much discussion on whether Major League Baseball (MLB) needs a salary cap, as some professional baseball players are being signed to massive multiyear contracts—e.g., Clayton Kershaw will make $33M per year until 2020! In this case study we won’t discuss the need for, or lack of, a salary cap in MLB, rather we will investigate how team payroll is related to team performance. More specifically, is a team’s payroll related to their win-loss record, reaching the playoffs, or winning the World Series? Sean Lahman, a database journalist maintains a database containing performance metrics for MLB players since 1871, as well as supplementary information such as salaries and franchise information.

Lahman’s database is what we refer to as a relational database because it is organized in a way that makes the relationships between its data tables explicit. It is common practice to store relational data in a relational database management system (RDMS).There are numerous RDMSs and they all speak the structured query language (SQL). Luckily, we can use R and dplyr to interface with many RDMSs without having to learn SQL or get into the nitty-gritty details of the databases. In this case study Lahman’s baseball database is stored in the SQLite (a very popular and open-source RDMS) database lahman.sqlite.

To begin, load the necessary packages for this case study.

library(ggplot2)
library(dplyr)
library(tidyr)

Connecting to a database

In order to interface with the lahman.sqlite database you need to know the file path of the SQLite database (if the database is not in your working directory). Open the connection between R and the database using the src_sqlite function found in the dplyr package, specifying the file path of the database (which is simply the file name if you have changed your working directory to the folder containing the SQLite database) and the OK to create a table:

lahman_db <- src_sqlite("data/lahman.sqlite", create = TRUE)

Typing the name of the database into the console reveals the RDMS version (src) and what data tables are contained in the database (tbls).

lahman_db

## src:  sqlite 3.8.6 [data/lahman.sqlite]
## tbls: allstarfull, appearances, awardsmanagers, awardsplayers,
##   awardssharemanagers, awardsshareplayers, batting, battingpost,
##   collegeplaying, fielding, fieldingof, fieldingpost, halloffame,
##   homegames, managers, managershalf, master, parks, pitching,
##   pitchingpost, salaries, schools, seriespost, teams, teamsfranchises,
##   teamshalf

Now that we have an open connection to Lahman’s database we can proceed with our investigation.

Calculating team payroll

The first step in determining how team payroll and performance are related is to calculate the team payrolls for each season.

To select the salaries data table from the lahman_db database, use the tbl function found in the dplyr package:

salaries_tbl <- tbl(lahman_db, "salaries")
head(salaries_tbl)

## Source:   query [?? x 5]
## Database: sqlite 3.8.6 [data/lahman.sqlite]
## 
##    yearID teamID  lgID  playerID salary
##     <int>  <chr> <chr>     <chr>  <int>
## 1    1985    ATL    NL barkele01 870000
## 2    1985    ATL    NL bedrost01 550000
## 3    1985    ATL    NL benedbr01 545000
## 4    1985    ATL    NL  campri01 633333
## 5    1985    ATL    NL ceronri01 625000
## 6    1985    ATL    NL chambch01 800000
## ..    ...    ...   ...       ...    ...

Next, aggregate the player salaries by team (teamID) and year (yearID) to produce team payrolls for each season.

salaries_grouped <- group_by(salaries_tbl, teamID, yearID)
payroll <- summarize(salaries_grouped, payroll = sum(salary))
head(payroll)

## Source:   query [?? x 3]
## Database: sqlite 3.8.6 [data/lahman.sqlite]
## Groups: teamID
## 
##    teamID yearID  payroll
##     <chr>  <int>    <int>
## 1     ANA   1997 31135472
## 2     ANA   1998 41281000
## 3     ANA   1999 55388166
## 4     ANA   2000 51464167
## 5     ANA   2001 47535167
## 6     ANA   2002 61721667
## ..    ...    ...      ...

At this point it’s important to note a few technical details about how dplyr talks to a database:

dplyr does not actually pull data into R until you ask for it. This allows dplyr to perform all of the manipulations at once, which is more efficient. Consequently, commands like summary(payroll) and payroll$payroll do not work as expected.

The collect command tells dplyr to pull data into R, allowing us to create a data frame in R that we can work with.

payroll <- collect(payroll)
head(payroll)

teamID	yearID	payroll
ANA	1997	31135472
ANA	1998	41281000
ANA	1999	55388166
ANA	2000	51464167
ANA	2001	47535167
ANA	2002	61721667

Boxplots of team payroll over the years suggests that payroll is increasing over time, but we have not yet adjusted for inflation.

ggplot(data = payroll, mapping = aes(x = factor(yearID), y = payroll/1e6)) + 
  geom_boxplot() + 
  labs(x = "year", y = "payroll (millions)") + 
  theme(axis.text.x = element_text(angle = 90))

To convert all the payrolls to 2016 dollars, we load the inflation.csv data set which contains the necessary multipliers for each year. (This information is obtained from the Bureau of Labor Statistics.)

inflation <- read.csv("data/inflation.csv")

The easiest way to adjust payroll for inflation is to join the payroll and inflation data frames and calculate a new column as the product of payroll and multiplier.

payroll <- left_join(payroll, inflation, by = c("yearID" = "year"))
payroll <- mutate(payroll, adj.payroll = payroll * multiplier)

Replotting the data we still see the increase in payrolls over time, though they appear to somewhat level off in recent years. Additionally, both the adjusted and unadjusted plots reveal far greater variability in payrolls in the 2000s and 2010s than the 1980s and 1990s.

ggplot(data = payroll, mapping = aes(x = factor(yearID), y = adj.payroll/1e6)) + 
  geom_boxplot() + 
  labs(x = "year", y = "payroll (millions)") + 
  theme(axis.text.x = element_text(angle = 90))

Comparing payroll between leagues

So far we have looked at payroll for the entire MLB, which is broken down into two leagues: the American League (AL) and the National League (NL). Next, we investigate whether payroll is similar across the two leagues.

To begin, we must pull in data from the teams data table in the lahman_db database and join it with payroll. The teams data table contains information about a team’s record, playoff wins, and performance statistics. (See Lahman’s codebook for more details.)

teams_tbl <- collect(tbl(lahman_db, "teams"))
head(teams_tbl)

yearID	lgID	teamID	franchID	Rank	G	W	L	LgWin	R	AB	H	2B	3B	HR	BB	SO	SB	RA	ER	ERA	CG	SHO	SV	IPouts	HA	HRA	BBA	SOA	E	FP	name	park	BPF	PPF	teamIDBR	teamIDlahman45	teamIDretro
1871	NA	BS1	BNA	3	31	20	10	N	401	1372	426	70	37	3	60	19	73	303	109	3.55	22	1	3	828	367	2	42	23	225	0.83	Boston Red Stockings	South End Grounds I	103	98	BOS	BS1	BS1
1871	NA	CH1	CNA	2	28	19	9	N	302	1196	323	52	21	10	60	22	69	241	77	2.76	25	0	1	753	308	6	28	22	218	0.82	Chicago White Stockings	Union Base-Ball Grounds	104	102	CHI	CH1	CH1
1871	NA	CL1	CFC	8	29	10	19	N	249	1186	328	35	40	7	26	25	18	341	116	4.11	23	0	0	762	346	13	53	34	223	0.81	Cleveland Forest Citys	National Association Grounds	96	100	CLE	CL1	CL1
1871	NA	FW1	KEK	7	19	7	12	N	137	746	178	19	8	2	33	9	16	243	97	5.17	19	1	0	507	261	5	21	17	163	0.80	Fort Wayne Kekiongas	Hamilton Field	101	107	KEK	FW1	FW1
1871	NA	NY2	NNA	5	33	16	17	N	302	1404	403	43	21	1	33	15	46	313	121	3.72	32	1	0	879	373	7	42	22	227	0.83	New York Mutuals	Union Grounds (Brooklyn)	90	88	NYU	NY2	NY2
1871	NA	PH1	PNA	1	28	21	7	Y	376	1281	410	66	27	9	46	23	56	266	137	4.95	27	0	0	747	329	3	53	16	194	0.84	Philadelphia Athletics	Jefferson Street Grounds	102	98	ATH	PH1	PH1

payroll_teams <- left_join(payroll, teams_tbl, by = c("teamID", "yearID"))

Now that we have combined the payroll data with the team statistics, we can easily render side-by-side boxplots comparing team payroll between the AL and NL over the years.

ggplot(data = payroll_teams) + 
  geom_boxplot(mapping = aes(x = factor(yearID), y = adj.payroll/1e6, fill = lgID), alpha = 0.7) + 
  labs(x = "year", y = "payroll (millions)") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_fill_manual("league", values = c("darkorange", "deepskyblue"))

The boxplots reveal no systematic differences between the leagues.

Payroll and wins

Now that we have a basic understanding of the distribution of team payroll over the years, we can investigate whether payroll helps explain a team’s Win-Loss record with a scatterplot with a linear smoother.

ggplot(data = payroll_teams, aes(x = adj.payroll/1e6, y = W)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  labs(x = "payroll (millions)", y = "wins")

The scatterplot reveals that teams with higher payrolls win more games, on average; however, there is a substantial amount of variability in the scatterplot!

To further explore this trend, we zoom into the individual seasons by faceting the scatterplot by season.

ggplot(data = payroll_teams, aes(x = adj.payroll/1e6, y = W)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  facet_wrap(~ yearID, ncol = 5) + 
  labs(x = "payroll (millions)", y = "wins")

The faceted scatterplot reveals that this positive linear trend is present in most seasons, but not all. In some seasons there appears to be no trend (1987, 1990, and 1992), and in many the trend is so slight that we need to be cautious with our interpretations (this is the case for recent years).

Payroll and the playoffs

While payroll may not be especially useful at predicting a team’s Win-Loss record, it could still be revealing with regard to playoff berths. To explore this we need to pull in postseason information, which is contained in seriespost.

postseason <- collect(tbl(lahman_db, "seriespost"))
head(postseason)

yearID	round	teamIDwinner	lgIDwinner	teamIDloser	lgIDloser	wins	losses	ties
1884	WS	PRO	NL	NYP	AA	3	0	0
1885	WS	CHC	NL	STL	AA	3	3	1
1886	WS	STL	AA	CHC	NL	4	2	0
1887	WS	DTN	NL	STL	AA	10	5	0
1888	WS	NYG	NL	STL	AA	6	4	0
1889	WS	NYG	NL	BRO	AA	6	3	0

The postseason data frame contains what team won and lost each postseason game in each season. While this is interesting, we need to create a data set containing the teams that made the playoffs in each season. To do this we’ll focus on four columns:

postseason <- select(postseason, yearID, teamIDwinner, teamIDloser)

Next, we use gather to collapse the teamIDwinner and teamIDloser columns into teamID and win columns.

postseason <- gather(postseason, key = win, value = teamID, teamIDwinner, teamIDloser)

Since we are not interested in whether a team won a specific series in the playoffs—this information is already in teams_tbl—we drop the win column and summarize the data set so that each team has only one entry per season.

postseason <- select(postseason,-win)
postseason_grouped <- group_by(postseason, yearID, teamID) 
playoffs <- summarize(postseason_grouped, playoff = "Y")
head(playoffs)

yearID	teamID	playoff
1884	NYP	Y
1884	PRO	Y
1885	CHC	Y
1885	STL	Y
1886	CHC	Y
1886	STL	Y

Next, we join payroll_teams and playoffs for plotting.

payroll_postseason <- left_join(payroll_teams, playoffs, by = c("yearID", "teamID"))

left_join introduces NAs to fill in these missing values in the playoff column (teams that did not make the playoffs). The following command replaces these NAs with N so that the column contains only Y and N values.

payroll_postseason <- 
  mutate(payroll_postseason, 
         playoff = ifelse(is.na(playoff), "N", "Y"))

Using the new playoffs data and the payroll_postseason data we can easily create a dotplot of the payroll by season, using color to indicate whether a team made the playoffs.

ggplot(data = payroll_postseason, aes(x = yearID, y = adj.payroll/1e6, color = playoff)) + 
  geom_point(data = filter(payroll_postseason, playoff == "N")) + 
  geom_point(data = filter(payroll_postseason, playoff == "Y")) + 
  labs(x = "payroll (millions)", y = "wins") + 
  scale_color_manual("Playoff berth?", values = c("gray80", "darkorange"))

The dotplot reveals that teams with high payrolls often make the playoffs, as do teams with moderate/low payrolls. Teams with the lowest payrolls, however, have not made the playoffs.

Payroll and the World Series

While payroll does not seem to be the best predictor of a playoff berth, perhaps it better predicts the winner of the World Series. The winner of the World Series is specified in the WSWin column.

ggplot(data = payroll_teams, aes(x = yearID, y = adj.payroll/1e6)) + 
  geom_point(aes(color = WSWin)) + 
  labs(x = "season", y = "payroll (millions)")

The above dotplot needs to be refined, but one thing to notice is that the WSWin column has blanks for the 1994 season. Why? A quick Wikipedia search reveals that there was a player strike that began on August 12, 1994, resulting in the cancellation of the rest of the season. Consequently, no teams made the playoffs in 1994 and the color in the plot needs to reflect this. To make this change, we first need to change the blanks to Ns.

payroll_postseason <- mutate(payroll_postseason, WSWin = ifelse(WSWin == "Y", "Y", "N"))

Next, we’ll tweak the colors and legend title for a publication quality plot.

ggplot(data = payroll_postseason, aes(x = yearID, y = adj.payroll/1e6, color = WSWin)) + 
  geom_point(data = filter(payroll_postseason, WSWin == "N")) + 
  geom_point(data = filter(payroll_postseason, WSWin == "Y")) + 
  labs(x = "season", y = "payroll (millions)") + 
  scale_color_manual("World Series\nChampion?", values = c("gray80", "darkorange"))

The dotplot reveals that while there have been many world series winners from teams with high payrolls, there have been plenty with moderate payrolls. There does, however, seem to be some indication that teams need to spend a bit of money to win the World Series, as very low payroll teams are not winning the world series.

Looking back, looking forward

In this case study we have investigated the distribution of team payroll in the MLB from 1985 to 2015, as well as its association with playoff berths and winning the world series. To do this we have:

extracted a data table from a SQLite database using the src_sqlite and tbl commands from the dplyr package.
manipulated and cleaned data sets using the dplyr and tidyr packages.
used graphics to explore distributions and answer questions.

There is more that can be done using SQLite, but it requires you to learn SQL. With the tools available in dplyr you will be able to tackle many problems you might encounter. It’s important to note that there are other RDMSs other than SQLite that dplyr can interface with, including PostgreSQL, MySQL, and MariaSQL (see this dplyr vignette for more information). The key difference is that src_sqlite is replaced by the appropriate src command.

On Your Own

The payroll_postseason data table created above contains the variables WCWin and DivWin which indicate whether a team won a wild card or divisional playoff game (Y) or not (N).

Use the mutate function in the dplyr package to fill in Ns for the blanks in the WCWin and DivWin columns. (Note: If a team did not reach the playoffs, then its entry for WCWin and DivWin will be blank.)
Create a dotplot of payroll by year, highlighting teams that won the wild card round. Briefly summarize your findings.
Create a dotplot of payroll by year, highlighting teams that won their division. Briefly summarize your findings.

Sources

Lahman, S. (2016) Lahman’s Baseball Database, 1871-2015, Main page, http://seanlahman.com/baseball-archive/statistics/

Lahman, S. (2016) Lahman’s Baseball Database, 1871-2015, v. 2016, Comma-delimited version, http://seanlahman.com/files/database/lahman2012-csv.zip

Hadley Wickham and Romain Francois (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0. https://CRAN.R-project.org/package=dplyr