For years there has been much discussion on whether Major League Baseball (MLB) needs a salary cap, as some professional baseball players are being signed to massive multiyear contracts—e.g., Clayton Kershaw will make $33M per year until 2020! In this case study we won’t discuss the need for, or lack of, a salary cap in MLB, rather we will investigate how team payroll is related to team performance. More specifically, is a team’s payroll related to their win-loss record, reaching the playoffs, or winning the World Series? Sean Lahman, a database journalist maintains a database containing performance metrics for MLB players since 1871, as well as supplementary information such as salaries and franchise information.
Lahman’s database is what we refer to as a relational database because it is organized in a way that makes the relationships between its data tables explicit. It is common practice to store relational data in a relational database management system (RDMS).There are numerous RDMSs and they all speak the structured query language (SQL). Luckily, we can use R and dplyr
to interface with many RDMSs without having to learn SQL or get into the nitty-gritty details of the databases. In this case study Lahman’s baseball database is stored in the SQLite (a very popular and open-source RDMS) database lahman.sqlite.
To begin, load the necessary packages for this case study.
library(ggplot2)
library(dplyr)
library(tidyr)
In order to interface with the lahman.sqlite database you need to know the file path of the SQLite database (if the database is not in your working directory). Open the connection between R and the database using the src_sqlite
function found in the dplyr package, specifying the file path of the database (which is simply the file name if you have changed your working directory to the folder containing the SQLite database) and the OK to create a table:
lahman_db <- src_sqlite("data/lahman.sqlite", create = TRUE)
Typing the name of the database into the console reveals the RDMS version (src
) and what data tables are contained in the database (tbls
).
lahman_db
## src: sqlite 3.8.6 [data/lahman.sqlite]
## tbls: allstarfull, appearances, awardsmanagers, awardsplayers,
## awardssharemanagers, awardsshareplayers, batting, battingpost,
## collegeplaying, fielding, fieldingof, fieldingpost, halloffame,
## homegames, managers, managershalf, master, parks, pitching,
## pitchingpost, salaries, schools, seriespost, teams, teamsfranchises,
## teamshalf
Now that we have an open connection to Lahman’s database we can proceed with our investigation.
The first step in determining how team payroll and performance are related is to calculate the team payrolls for each season.
To select the salaries data table from the lahman_db database, use the tbl
function found in the dplyr package:
salaries_tbl <- tbl(lahman_db, "salaries")
head(salaries_tbl)
## Source: query [?? x 5]
## Database: sqlite 3.8.6 [data/lahman.sqlite]
##
## yearID teamID lgID playerID salary
## <int> <chr> <chr> <chr> <int>
## 1 1985 ATL NL barkele01 870000
## 2 1985 ATL NL bedrost01 550000
## 3 1985 ATL NL benedbr01 545000
## 4 1985 ATL NL campri01 633333
## 5 1985 ATL NL ceronri01 625000
## 6 1985 ATL NL chambch01 800000
## .. ... ... ... ... ...
Next, aggregate the player salaries by team (teamID
) and year (yearID
) to produce team payrolls for each season.
salaries_grouped <- group_by(salaries_tbl, teamID, yearID)
payroll <- summarize(salaries_grouped, payroll = sum(salary))
head(payroll)
## Source: query [?? x 3]
## Database: sqlite 3.8.6 [data/lahman.sqlite]
## Groups: teamID
##
## teamID yearID payroll
## <chr> <int> <int>
## 1 ANA 1997 31135472
## 2 ANA 1998 41281000
## 3 ANA 1999 55388166
## 4 ANA 2000 51464167
## 5 ANA 2001 47535167
## 6 ANA 2002 61721667
## .. ... ... ...
At this point it’s important to note a few technical details about how dplyr
talks to a database:
dplyr
does not actually pull data into R until you ask for it. This allows dplyr
to perform all of the manipulations at once, which is more efficient. Consequently, commands like summary(payroll)
and payroll$payroll
do not work as expected.
The collect
command tells dplyr
to pull data into R, allowing us to create a data frame in R that we can work with.
payroll <- collect(payroll)
head(payroll)
teamID | yearID | payroll |
---|---|---|
ANA | 1997 | 31135472 |
ANA | 1998 | 41281000 |
ANA | 1999 | 55388166 |
ANA | 2000 | 51464167 |
ANA | 2001 | 47535167 |
ANA | 2002 | 61721667 |
Boxplots of team payroll over the years suggests that payroll is increasing over time, but we have not yet adjusted for inflation.
ggplot(data = payroll, mapping = aes(x = factor(yearID), y = payroll/1e6)) +
geom_boxplot() +
labs(x = "year", y = "payroll (millions)") +
theme(axis.text.x = element_text(angle = 90))
To convert all the payrolls to 2016 dollars, we load the inflation.csv data set which contains the necessary multipliers for each year. (This information is obtained from the Bureau of Labor Statistics.)
inflation <- read.csv("data/inflation.csv")
The easiest way to adjust payroll for inflation is to join the payroll and inflation data frames and calculate a new column as the product of payroll
and multiplier
.
payroll <- left_join(payroll, inflation, by = c("yearID" = "year"))
payroll <- mutate(payroll, adj.payroll = payroll * multiplier)
Replotting the data we still see the increase in payrolls over time, though they appear to somewhat level off in recent years. Additionally, both the adjusted and unadjusted plots reveal far greater variability in payrolls in the 2000s and 2010s than the 1980s and 1990s.
ggplot(data = payroll, mapping = aes(x = factor(yearID), y = adj.payroll/1e6)) +
geom_boxplot() +
labs(x = "year", y = "payroll (millions)") +
theme(axis.text.x = element_text(angle = 90))
So far we have looked at payroll for the entire MLB, which is broken down into two leagues: the American League (AL) and the National League (NL). Next, we investigate whether payroll is similar across the two leagues.
To begin, we must pull in data from the teams data table in the lahman_db database and join it with payroll. The teams data table contains information about a team’s record, playoff wins, and performance statistics. (See Lahman’s codebook for more details.)
teams_tbl <- collect(tbl(lahman_db, "teams"))
head(teams_tbl)
yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | DivWin | WCWin | LgWin | WSWin | R | AB | H | 2B | 3B | HR | BB | SO | SB | CS | HBP | SF | RA | ER | ERA | CG | SHO | SV | IPouts | HA | HRA | BBA | SOA | E | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1871 | NA | BS1 | BNA | 3 | 31 | 20 | 10 | N | 401 | 1372 | 426 | 70 | 37 | 3 | 60 | 19 | 73 | 303 | 109 | 3.55 | 22 | 1 | 3 | 828 | 367 | 2 | 42 | 23 | 225 | 0.83 | Boston Red Stockings | South End Grounds I | 103 | 98 | BOS | BS1 | BS1 | ||||||||||
1871 | NA | CH1 | CNA | 2 | 28 | 19 | 9 | N | 302 | 1196 | 323 | 52 | 21 | 10 | 60 | 22 | 69 | 241 | 77 | 2.76 | 25 | 0 | 1 | 753 | 308 | 6 | 28 | 22 | 218 | 0.82 | Chicago White Stockings | Union Base-Ball Grounds | 104 | 102 | CHI | CH1 | CH1 | ||||||||||
1871 | NA | CL1 | CFC | 8 | 29 | 10 | 19 | N | 249 | 1186 | 328 | 35 | 40 | 7 | 26 | 25 | 18 | 341 | 116 | 4.11 | 23 | 0 | 0 | 762 | 346 | 13 | 53 | 34 | 223 | 0.81 | Cleveland Forest Citys | National Association Grounds | 96 | 100 | CLE | CL1 | CL1 | ||||||||||
1871 | NA | FW1 | KEK | 7 | 19 | 7 | 12 | N | 137 | 746 | 178 | 19 | 8 | 2 | 33 | 9 | 16 | 243 | 97 | 5.17 | 19 | 1 | 0 | 507 | 261 | 5 | 21 | 17 | 163 | 0.80 | Fort Wayne Kekiongas | Hamilton Field | 101 | 107 | KEK | FW1 | FW1 | ||||||||||
1871 | NA | NY2 | NNA | 5 | 33 | 16 | 17 | N | 302 | 1404 | 403 | 43 | 21 | 1 | 33 | 15 | 46 | 313 | 121 | 3.72 | 32 | 1 | 0 | 879 | 373 | 7 | 42 | 22 | 227 | 0.83 | New York Mutuals | Union Grounds (Brooklyn) | 90 | 88 | NYU | NY2 | NY2 | ||||||||||
1871 | NA | PH1 | PNA | 1 | 28 | 21 | 7 | Y | 376 | 1281 | 410 | 66 | 27 | 9 | 46 | 23 | 56 | 266 | 137 | 4.95 | 27 | 0 | 0 | 747 | 329 | 3 | 53 | 16 | 194 | 0.84 | Philadelphia Athletics | Jefferson Street Grounds | 102 | 98 | ATH | PH1 | PH1 |
payroll_teams <- left_join(payroll, teams_tbl, by = c("teamID", "yearID"))
Now that we have combined the payroll data with the team statistics, we can easily render side-by-side boxplots comparing team payroll between the AL and NL over the years.
ggplot(data = payroll_teams) +
geom_boxplot(mapping = aes(x = factor(yearID), y = adj.payroll/1e6, fill = lgID), alpha = 0.7) +
labs(x = "year", y = "payroll (millions)") +
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_manual("league", values = c("darkorange", "deepskyblue"))
The boxplots reveal no systematic differences between the leagues.
Now that we have a basic understanding of the distribution of team payroll over the years, we can investigate whether payroll helps explain a team’s Win-Loss record with a scatterplot with a linear smoother.
ggplot(data = payroll_teams, aes(x = adj.payroll/1e6, y = W)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "payroll (millions)", y = "wins")
The scatterplot reveals that teams with higher payrolls win more games, on average; however, there is a substantial amount of variability in the scatterplot!
To further explore this trend, we zoom into the individual seasons by faceting the scatterplot by season.
ggplot(data = payroll_teams, aes(x = adj.payroll/1e6, y = W)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ yearID, ncol = 5) +
labs(x = "payroll (millions)", y = "wins")
The faceted scatterplot reveals that this positive linear trend is present in most seasons, but not all. In some seasons there appears to be no trend (1987, 1990, and 1992), and in many the trend is so slight that we need to be cautious with our interpretations (this is the case for recent years).
While payroll may not be especially useful at predicting a team’s Win-Loss record, it could still be revealing with regard to playoff berths. To explore this we need to pull in postseason information, which is contained in seriespost.
postseason <- collect(tbl(lahman_db, "seriespost"))
head(postseason)
yearID | round | teamIDwinner | lgIDwinner | teamIDloser | lgIDloser | wins | losses | ties |
---|---|---|---|---|---|---|---|---|
1884 | WS | PRO | NL | NYP | AA | 3 | 0 | 0 |
1885 | WS | CHC | NL | STL | AA | 3 | 3 | 1 |
1886 | WS | STL | AA | CHC | NL | 4 | 2 | 0 |
1887 | WS | DTN | NL | STL | AA | 10 | 5 | 0 |
1888 | WS | NYG | NL | STL | AA | 6 | 4 | 0 |
1889 | WS | NYG | NL | BRO | AA | 6 | 3 | 0 |
The postseason data frame contains what team won and lost each postseason game in each season. While this is interesting, we need to create a data set containing the teams that made the playoffs in each season. To do this we’ll focus on four columns:
postseason <- select(postseason, yearID, teamIDwinner, teamIDloser)
Next, we use gather
to collapse the teamIDwinner
and teamIDloser
columns into teamID
and win
columns.
postseason <- gather(postseason, key = win, value = teamID, teamIDwinner, teamIDloser)
Since we are not interested in whether a team won a specific series in the playoffs—this information is already in teams_tbl—we drop the win
column and summarize the data set so that each team has only one entry per season.
postseason <- select(postseason,-win)
postseason_grouped <- group_by(postseason, yearID, teamID)
playoffs <- summarize(postseason_grouped, playoff = "Y")
head(playoffs)
yearID | teamID | playoff |
---|---|---|
1884 | NYP | Y |
1884 | PRO | Y |
1885 | CHC | Y |
1885 | STL | Y |
1886 | CHC | Y |
1886 | STL | Y |
Next, we join payroll_teams and playoffs for plotting.
payroll_postseason <- left_join(payroll_teams, playoffs, by = c("yearID", "teamID"))
left_join
introduces NA
s to fill in these missing values in the playoff
column (teams that did not make the playoffs). The following command replaces these NA
s with N
so that the column contains only Y
and N
values.
payroll_postseason <-
mutate(payroll_postseason,
playoff = ifelse(is.na(playoff), "N", "Y"))
Using the new playoffs data and the payroll_postseason data we can easily create a dotplot of the payroll by season, using color to indicate whether a team made the playoffs.
ggplot(data = payroll_postseason, aes(x = yearID, y = adj.payroll/1e6, color = playoff)) +
geom_point(data = filter(payroll_postseason, playoff == "N")) +
geom_point(data = filter(payroll_postseason, playoff == "Y")) +
labs(x = "payroll (millions)", y = "wins") +
scale_color_manual("Playoff berth?", values = c("gray80", "darkorange"))
The dotplot reveals that teams with high payrolls often make the playoffs, as do teams with moderate/low payrolls. Teams with the lowest payrolls, however, have not made the playoffs.
While payroll does not seem to be the best predictor of a playoff berth, perhaps it better predicts the winner of the World Series. The winner of the World Series is specified in the WSWin
column.
ggplot(data = payroll_teams, aes(x = yearID, y = adj.payroll/1e6)) +
geom_point(aes(color = WSWin)) +
labs(x = "season", y = "payroll (millions)")
The above dotplot needs to be refined, but one thing to notice is that the WSWin
column has blanks for the 1994 season. Why? A quick Wikipedia search reveals that there was a player strike that began on August 12, 1994, resulting in the cancellation of the rest of the season. Consequently, no teams made the playoffs in 1994 and the color in the plot needs to reflect this. To make this change, we first need to change the blanks to N
s.
payroll_postseason <- mutate(payroll_postseason, WSWin = ifelse(WSWin == "Y", "Y", "N"))
Next, we’ll tweak the colors and legend title for a publication quality plot.
ggplot(data = payroll_postseason, aes(x = yearID, y = adj.payroll/1e6, color = WSWin)) +
geom_point(data = filter(payroll_postseason, WSWin == "N")) +
geom_point(data = filter(payroll_postseason, WSWin == "Y")) +
labs(x = "season", y = "payroll (millions)") +
scale_color_manual("World Series\nChampion?", values = c("gray80", "darkorange"))
The dotplot reveals that while there have been many world series winners from teams with high payrolls, there have been plenty with moderate payrolls. There does, however, seem to be some indication that teams need to spend a bit of money to win the World Series, as very low payroll teams are not winning the world series.
In this case study we have investigated the distribution of team payroll in the MLB from 1985 to 2015, as well as its association with playoff berths and winning the world series. To do this we have:
src_sqlite
and tbl
commands from the dplyr
package.dplyr
and tidyr
packages.There is more that can be done using SQLite, but it requires you to learn SQL. With the tools available in dplyr
you will be able to tackle many problems you might encounter. It’s important to note that there are other RDMSs other than SQLite that dplyr
can interface with, including PostgreSQL, MySQL, and MariaSQL (see this dplyr vignette for more information). The key difference is that src_sqlite
is replaced by the appropriate src
command.
The payroll_postseason data table created above contains the variables WCWin
and DivWin
which indicate whether a team won a wild card or divisional playoff game (Y
) or not (N
).
mutate
function in the dplyr
package to fill in N
s for the blanks in the WCWin
and DivWin
columns. (Note: If a team did not reach the playoffs, then its entry for WCWin
and DivWin
will be blank.)Lahman, S. (2016) Lahman’s Baseball Database, 1871-2015, Main page, http://seanlahman.com/baseball-archive/statistics/
Lahman, S. (2016) Lahman’s Baseball Database, 1871-2015, v. 2016, Comma-delimited version, http://seanlahman.com/files/database/lahman2012-csv.zip
Hadley Wickham and Romain Francois (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0. https://CRAN.R-project.org/package=dplyr