Welcome to my analysis of League of Legends data!

Here I will attempt to take apart some data on the professional League of Legends North American League for the 2015 season. This season is comprised of a Spring and Summer split and I opted to use this as opposed to the current season, which is in progress, in order to ensure that I would be able to collect enough data in order to do a reasonable analysis. I also decided to limit my analysis to one region in order to avoid the effect that conflicting styles would have across regions.

Background on League of Legends as a Game


League of Legends is Multiplayer Online Battleground Arena(MOBA) style game developed by Riot Games. Each game consists of two teams with 5 players each. Each team starts the game at opposite sides of the arena in a base that contains their Nexus. The goal of each game is to overcome obstacles such as minions, structures and enemy players in order to destroy the enemy Nexus. Each team generally consists of 5 standard positions divided across the map;Top, Middle, Jungle, Attack Damage Carry(ADC)/Marksman and Support. A more in depth guide to the positions in League of Legends and the roles each position plays within a team can be found here. Each position can play a variety of roles and has the option of choosing between a hundred different champions that perform in each role with varying degrees of success according to the design of the champion, how well the champion synergizes with a given team, how well it counters a given enemy team and the skill of the individual player. With so many variables at play within each game it is impossible to say that one variable is going to determine the success or failure of any particular player.


Sidenote for anyone who cares to know about the data collection process

My inital hope was that I would be able to build a web scraper that pulled all of the data from the lol.esports.com website, but it turns out that Riot explicitly forbade it and I would have to use an api in order to get any data. When I made an attempt to to use the lol. esports api which isn’t meant for rd party use I ran into another issue; The gameId’s necessary to grab game data weren’t arranged in a uniform manner. This meant that instead of using the scraper to pull the data I had to make the calls manually by finding the gameId from the match history on lol.esports and inserting it into a url. While this did require a few hours of tedious work on my part it was not without it’s silver linings. I was able to instead save Json files locally and build the database in SQLite from there. This gave me the advantage of being able to make alterations to the script and re-run it at and have the database recreated in seconds. This was a particularly helpful for me as I’m not an experienced Python user and I would be able to make a lot more mistakes with this method.


In order to give my analysis some focus I decided to look at whether or not I could scrounge up and synthesize data in order to aid fans that would like to participate in fantasy sports. At first my intention was just to collect information regarding palyer’s average performance and examine how far from their average players performed against each team. In this way I would be able to account for match ups and predict higher or lower fantasy scores accordingly.

Well let’s get started! First up we’re gonna have some printouts that were created as I take my data from an SQLite database that I built using Json files containing all the data available for each game played during the 2015 season. This includes creating averages for particular stats that are going to be of significant in predicting fantasy performance. In order to determine which statistics were the most important I explored the data quite a bit, looking for variables that had significant correlations with fantasy scores.


## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Joining by: c("name", "role")
## Joining by: c("role", "opponent")

Before we go any further we can already look at some interesting information. Here we can see the average fantasy points scored by position. This already allows fantasy league players an edge in decision making. We can see that ADC players have the highest average fantasy scores so it makes sense to always prioritize drafting your ADC first followed by Middle, Top, Jungle and Support respectively. In the case of Jungle and Support we can see that their averages aren’t too far apart so you may be able to switch that priority if you identify a particularly high value Support over a mediocre Jungle player.



If you’d like we can even take this appoach a step further by looking at the averages for each player in a given role. Here’s a comparison of mean fantasy scores for players at the ADC position. Pick wisely!



Choosing players with the highest averages for any given position is one strategy for getting the best fantasy team, but even this has limitations. In order to find out just how predictive averages are of performance in individual games let’s build a model using the arithmetic mean.


## 
## Call:
## lm(formula = fantasy ~ Player_mean_goldDif + Player_mean_kills + 
##     Player_mean_assists + Player_mean_win + Player_mean_largestKillingSpree + 
##     Player_mean_champLevelDif + Player_mean_deaths + Player_mean_fantasy + 
##     Player_mean_duration + Player_mean_goldEarned + Player_mean_TDT + 
##     Player_mean_TDDC + Player_mean_TDD + Player_mean_TDTDif + 
##     Player_mean_TDDCDif + Player_mean_TDDDif, data = dhd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.969  -9.918  -0.603   9.155  43.879 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     -1.567e-12  8.406e+00    0.00     1.00
## Player_mean_goldDif             -4.257e-16  1.523e-03    0.00     1.00
## Player_mean_kills                4.306e-13  3.227e+00    0.00     1.00
## Player_mean_assists              3.108e-13  2.120e+00    0.00     1.00
## Player_mean_win                  4.381e-13  4.657e+00    0.00     1.00
## Player_mean_largestKillingSpree -2.595e-14  1.590e+00    0.00     1.00
## Player_mean_champLevelDif        1.687e-14  1.486e+00    0.00     1.00
## Player_mean_deaths              -2.589e-13  1.406e+00    0.00     1.00
## Player_mean_fantasy              1.000e+00  1.163e+00    0.86     0.39
## Player_mean_duration            -1.080e-13  3.374e-01    0.00     1.00
## Player_mean_goldEarned           7.391e-16  2.034e-03    0.00     1.00
## Player_mean_TDT                  1.365e-17  8.493e-05    0.00     1.00
## Player_mean_TDDC                 4.969e-18  2.117e-04    0.00     1.00
## Player_mean_TDD                 -1.715e-17  4.676e-05    0.00     1.00
## Player_mean_TDTDif              -1.107e-18  1.723e-04    0.00     1.00
## Player_mean_TDDCDif              1.469e-17  1.977e-04    0.00     1.00
## Player_mean_TDDDif               1.003e-17  3.887e-05    0.00     1.00
## 
## Residual standard error: 12.35 on 2233 degrees of freedom
## Multiple R-squared:  0.1115, Adjusted R-squared:  0.1051 
## F-statistic: 17.51 on 16 and 2233 DF,  p-value: < 2.2e-16

As we can see averages aren’t necessarily the best indicator of performance in each indivdual game. There are two obvious drawbacks to using this approach.

  1. It requires us to know the future in order to calculate a complete average.

  2. Simply using a player’s average doesn’t account for match ups that will change from week to week.

My solution to the first problem is to use cumulative means instead of taking an average of a player’s performance of the entire season. A cumulative mean only uses records up to the current record in their calculation and better simulates looking at data from week to week where the data available is more limited.


dhd<- ghg%>%
  group_by(name, role) %>%
  mutate(cumGoldDif= cummean(goldEarnedDif),
         cumGoldEarned= cummean(goldEarned),
         cumKills= cummean(kills),
         cumAssists= cummean(assists),
         cumWin= cummean(win),
         cumLKS= cummean(largestKillingSpree),
         cumCLD= cummean(champLevelDif),
         cumDeaths= cummean(deaths),
         cumFantasy= cummean(fantasy),
         cumDuration= cummean(duration),
         cumTDDC= cummean(totalDamageDealtToChampions),
         cumTDDCDif= cummean(totalDamageDealtToChampionsDif),
         cumTDD= cummean(totalDamageDealt),
         cumTDDDif= cummean(totalDamageDealtDif),
         cumTDT= cummean(totalDamageTaken),
         cumTDTDif= cummean(totalDamageTakenDif))

It may not be readily apparent why I chose these variables so I’m going to provide you a with a brief, but thorough explanation of each variables definition and importance to our analysis.

Definition of Key Variables


My solution for the second problem is to include opponent stats for each player in order to get some idea of how the match up affects the player’s performance in each game. Now this is limited in the sense that I only included data in the data from direct lane match ups. That is to say I’ll be comparing the stats of the opposing player in the corresponding position so Top laner vs. Top laner, ADC vs. ADC, etc. These stats are similar to the ones just created so I won’t define them again.


dhd<- dhd %>%
  group_by(opponent, role) %>%
  mutate(cumOppGoldDif= cummean(-goldEarnedDif),
         cumOppDuration= cummean(duration),
         OppWin= (1-win),
         cumOppWin= cummean(OppWin),
         OppGoldEarned= goldEarned- goldEarnedDif,
         cumOppGoldEarned= cummean(OppGoldEarned),
         cumOppGoldEarnedDif= (-goldEarnedDif),
         OppTDDC= totalDamageDealtToChampions-totalDamageDealtToChampionsDif,
         cumOppTDDC= cummean(OppTDDC),
         cumOppTDDCDif=cummean(-totalDamageDealtToChampionsDif),
         OppTDD= totalDamageDealt- totalDamageDealtDif,
         cumOppTDD= cummean(OppTDD),
         cumOppTDDDif= cummean(-totalDamageDealtToChampionsDif),
         OppTDT= totalDamageTaken- totalDamageTakenDif,
         cumOppTDT= cummean(OppTDT),
         cumOppTDTDif= cummean(-totalDamageTakenDif),
         cumOppCLD= cummean(-champLevelDif))

Let’s take a look and see how it goes!


## 
## Call:
## lm(formula = fantasy ~ cumOppGoldDif + cumOppWin + cumOppGoldEarnedDif + 
##     cumOppGoldEarnedDif + cumOppTDDC + cumOppTDDCDif + cumKills + 
##     cumOppTDT + cumGoldDif + cumGoldEarned + cumAssists + cumTDDC + 
##     cumTDD + cumTDT, data = dhd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.745  -5.352  -0.297   4.638  37.282 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.152e+01  3.137e+00  -3.673 0.000245 ***
## cumOppGoldDif        1.482e-03  2.495e-04   5.937 3.35e-09 ***
## cumOppWin           -5.499e+00  1.526e+00  -3.604 0.000320 ***
## cumOppGoldEarnedDif -3.368e-03  7.114e-05 -47.345  < 2e-16 ***
## cumOppTDDC           3.912e-04  6.800e-05   5.753 9.97e-09 ***
## cumOppTDDCDif       -2.836e-04  7.833e-05  -3.621 0.000300 ***
## cumKills             2.455e+00  2.344e-01  10.474  < 2e-16 ***
## cumOppTDT            2.579e-04  5.222e-05   4.939 8.41e-07 ***
## cumGoldDif          -2.792e-03  1.880e-04 -14.853  < 2e-16 ***
## cumGoldEarned        1.417e-03  3.699e-04   3.831 0.000131 ***
## cumAssists           1.814e+00  1.259e-01  14.412  < 2e-16 ***
## cumTDDC             -1.252e-04  7.468e-05  -1.677 0.093760 .  
## cumTDD              -3.462e-05  1.028e-05  -3.369 0.000768 ***
## cumTDT              -1.906e-04  5.321e-05  -3.582 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.986 on 2236 degrees of freedom
## Multiple R-squared:  0.6281, Adjusted R-squared:  0.6259 
## F-statistic: 290.5 on 13 and 2236 DF,  p-value: < 2.2e-16

That’s a lot better! In the interest of saving you some time I’ve omitted the variables that weren’t particulary useful to the model and voila we’ve got a general model that can predict fantasy scores from week to week.

If you are as skeptical as I am you might ask, “With each role in a team being so different how is it that you can use one model for them all?”. Worry not! My curiosity led me to wonder if I might be able to create models that better predict scores for player by position. Let’s give it a whirl!

Let’s build a model specifcally for Middle laners.


## 
## Call:
## lm(formula = fantasy ~ cumOppGoldDif + cumOppWin + cumOppGoldEarnedDif + 
##     cumOppGoldEarnedDif + cumOppTDT + cumOppTDTDif + cumFantasy + 
##     cumGoldDif, data = subset(dhd, role == "Middle"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5423  -4.9516  -0.1378   4.9450  25.0627 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.117e+01  4.471e+00  -2.498  0.01285 *  
## cumOppGoldDif        1.155e-03  4.251e-04   2.717  0.00684 ** 
## cumOppWin           -4.921e+00  2.954e+00  -1.666  0.09647 .  
## cumOppGoldEarnedDif -3.369e-03  1.451e-04 -23.222  < 2e-16 ***
## cumOppTDT            1.004e-03  2.117e-04   4.741 2.87e-06 ***
## cumOppTDTDif        -4.610e-04  2.117e-04  -2.177  0.03001 *  
## cumFantasy           9.014e-01  1.088e-01   8.283 1.45e-15 ***
## cumGoldDif          -2.923e-03  4.178e-04  -6.995 9.85e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.728 on 442 degrees of freedom
## Multiple R-squared:  0.6618, Adjusted R-squared:  0.6564 
## F-statistic: 123.6 on 7 and 442 DF,  p-value: < 2.2e-16

And here’s a model for Top laners.


## 
## Call:
## lm(formula = fantasy ~ cumOppWin + cumOppGoldEarnedDif + cumOppGoldEarnedDif + 
##     cumOppTDDC + cumKills + cumOppTDT + cumGoldDif + cumGoldEarned + 
##     cumAssists + cumTDT, data = subset(dhd, role == "Top"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.755  -5.542  -0.151   4.734  32.750 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.218e+01  6.168e+00  -1.974   0.0490 *  
## cumOppWin           -3.037e+00  1.988e+00  -1.528   0.1272    
## cumOppGoldEarnedDif -3.143e-03  1.545e-04 -20.342  < 2e-16 ***
## cumOppTDDC           3.798e-04  1.864e-04   2.037   0.0422 *  
## cumKills             2.779e+00  5.672e-01   4.900 1.35e-06 ***
## cumOppTDT            3.039e-04  1.399e-04   2.171   0.0304 *  
## cumGoldDif          -2.587e-03  4.908e-04  -5.271 2.13e-07 ***
## cumGoldEarned        1.060e-03  5.151e-04   2.058   0.0401 *  
## cumAssists           1.953e+00  2.988e-01   6.537 1.74e-10 ***
## cumTDT              -4.548e-04  1.477e-04  -3.079   0.0022 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.003 on 440 degrees of freedom
## Multiple R-squared:  0.6093, Adjusted R-squared:  0.6013 
## F-statistic: 76.25 on 9 and 440 DF,  p-value: < 2.2e-16

ADC players


## 
## Call:
## lm(formula = fantasy ~ cumOppGoldDif + cumOppGoldEarnedDif + 
##     cumOppGoldEarnedDif + cumOppTDDC + cumOppTDDCDif + cumOppTDTDif + 
##     cumGoldDif + cumDeaths + cumKills + cumAssists, data = subset(dhd, 
##     role == "ADC"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.139  -5.954  -0.224   5.356  36.169 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.091e+01  3.915e+00  -2.788  0.00554 ** 
## cumOppGoldDif        1.213e-03  3.788e-04   3.200  0.00147 ** 
## cumOppGoldEarnedDif -3.170e-03  1.624e-04 -19.514  < 2e-16 ***
## cumOppTDDC           9.451e-04  1.929e-04   4.899 1.36e-06 ***
## cumOppTDDCDif       -7.067e-04  1.538e-04  -4.595 5.66e-06 ***
## cumOppTDTDif         5.235e-04  2.103e-04   2.489  0.01317 *  
## cumGoldDif          -2.599e-03  4.260e-04  -6.102 2.30e-09 ***
## cumDeaths           -1.359e+00  8.010e-01  -1.697  0.09041 .  
## cumKills             2.488e+00  4.695e-01   5.299 1.84e-07 ***
## cumAssists           1.945e+00  3.987e-01   4.878 1.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.786 on 440 degrees of freedom
## Multiple R-squared:  0.6013, Adjusted R-squared:  0.5931 
## F-statistic: 73.73 on 9 and 440 DF,  p-value: < 2.2e-16

Junglers


## 
## Call:
## lm(formula = fantasy ~ cumOppGoldEarnedDif + cumGoldDif + cumOppGoldEarnedDif + 
##     cumOppTDDC + cumOppTDD + cumFantasy + cumGoldDif + cumTDDC + 
##     cumDeaths, data = subset(dhd, role == "Jungle"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.803  -5.223  -0.457   4.578  31.642 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.438e+01  3.734e+00  -3.852 0.000135 ***
## cumOppGoldEarnedDif -3.239e-03  1.505e-04 -21.525  < 2e-16 ***
## cumGoldDif          -2.257e-03  4.324e-04  -5.220 2.76e-07 ***
## cumOppTDDC           6.624e-04  2.650e-04   2.500 0.012787 *  
## cumOppTDD            8.329e-05  2.104e-05   3.958 8.81e-05 ***
## cumFantasy           1.015e+00  1.289e-01   7.873 2.70e-14 ***
## cumTDDC             -8.320e-04  2.967e-04  -2.805 0.005258 ** 
## cumDeaths            1.328e+00  5.827e-01   2.279 0.023169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.12 on 442 degrees of freedom
## Multiple R-squared:  0.5933, Adjusted R-squared:  0.5869 
## F-statistic: 92.12 on 7 and 442 DF,  p-value: < 2.2e-16

And finally Supports


## 
## Call:
## lm(formula = fantasy ~ cumOppWin + cumOppTDDC + cumOppGoldEarnedDif + 
##     cumOppGoldEarnedDif + cumOppTDDCDif + cumOppTDT + cumGoldDif + 
##     cumCLD + cumDeaths + cumKills + cumAssists + cumTDDDif, data = subset(dhd, 
##     role == "Support"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.835  -4.751  -0.803   4.118  29.891 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.335e+01  3.830e+00  -3.486 0.000540 ***
## cumOppWin            4.148e+00  2.123e+00   1.954 0.051372 .  
## cumOppTDDC           1.515e-03  6.221e-04   2.436 0.015257 *  
## cumOppGoldEarnedDif -4.218e-03  1.725e-04 -24.453  < 2e-16 ***
## cumOppTDDCDif       -8.383e-04  4.866e-04  -1.723 0.085601 .  
## cumOppTDT            5.020e-04  1.686e-04   2.977 0.003068 ** 
## cumGoldDif          -3.192e-03  8.094e-04  -3.943 9.36e-05 ***
## cumCLD              -2.882e+00  1.594e+00  -1.808 0.071240 .  
## cumDeaths           -2.010e+00  5.931e-01  -3.389 0.000764 ***
## cumKills             3.964e+00  1.034e+00   3.835 0.000144 ***
## cumAssists           1.991e+00  2.685e-01   7.416 6.28e-13 ***
## cumTDDDif            1.887e-04  8.037e-05   2.347 0.019358 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.667 on 438 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6819 
## F-statistic:  88.5 on 11 and 438 DF,  p-value: < 2.2e-16

It might be interesting to ask ourselves why it is that some of the models are more predictive than others. You have to remember that we are only looking at direct match ups. So we can see that teams are generally less capable, whether by design of the game or by lack of ability, of abusing the enemy jungler with their own jungler advantage. On the opposite side of things we can see that an advantage for a middle laner has a much stronger effect on the performance of an enemy middle laner. This also goes to show that teams are actively recognizing match-up strengths and pressing them accordingly.

Using our general model we can take a look at the most valuable players for scoring fantasy points. I’m going to exclude any players that have had less than 12 games which is about 80% of a split. This is just a precaution so that players with few games, but high averages won’t misrepresent our results too much. Now we can have a rank for the best draft picks for the 2015 season using our prediction based on their performance.

## Source: local data frame [10 x 5]
## 
##      role        name Avg_prediction Sum_prediction     n
##     (chr)       (chr)          (dbl)          (dbl) (int)
## 1  Middle    bjergsen       27.70374      1523.7056    55
## 2     ADC      apollo       27.43448      1618.6345    59
## 3     ADC  doublelift       26.90911      1237.8190    46
## 4     ADC  wildturtle       26.81259      1447.8801    54
## 5  Middle    pobelter       26.03618      1093.5195    42
## 6     ADC         cop       25.11190       552.4617    22
## 7     ADC       altec       24.76025       940.8895    38
## 8  Middle xiaoweixiao       24.04174      1105.9201    46
## 9  Jungle        rush       23.56674      1390.4376    59
## 10    Top      impact       23.45182      1336.7536    57

Let’s take a peek at the actual results just to ease our curiosity.

## Source: local data frame [10 x 5]
## 
##      role       name Avg_fantasy Sum_fantasy     n
##     (chr)      (chr)       (dbl)       (dbl) (int)
## 1     ADC doublelift    29.18804     1342.65    46
## 2  Middle   bjergsen    27.69145     1523.03    55
## 3     ADC wildturtle    26.13093     1411.07    54
## 4     ADC     apollo    26.10373     1540.12    59
## 5     ADC        cop    25.53773      561.83    22
## 6     ADC      altec    25.51211      969.46    38
## 7  Middle   pobelter    25.17595     1057.39    42
## 8  Middle      fenix    24.91930     1420.40    57
## 9     ADC     piglet    24.86353     1268.04    51
## 10    Top     impact    23.83860     1358.80    57

We could take this analysis a step further in the future by recording all the enemy stats in order to find out which position has the easiest time pushing an advantage. We can even take the data as it is and try to predict the chances of winning rather than the potential fantasy score, but that is an analysis for another time.

Even with the data available now we can make some useful insights into team behavior. For example let’s take a look at the total damage damage being dealt to champions compared to the gold earned by role. We’re going to look at winning games because we’re interested in what makes a successful game!



We can see from our two pie charts that when it comes to dealing damage against champions middle laners use their gold much more efficiently. They’re afforded a smaller gold percentage and manage to do more damage to champions than any other position(on average of course). This information can be particularly useful when synthesized into a team strategy. If a team is centering their strategy around team-fighting then it makes sense to focus resources on their middle laner. On the same note you can say that if a team believes they have an exceptional or superior match up in middle lane then they should focus around team fighting as it may be the best strategy for them going forward.

We can even do a similar analysis for a specific team to see how they compare to the average.



We can see TSM invests a higher gold percentage into their middle laner, but they also get an exceptionally high amount of damage output from him as well. Teams hoping to take a victory from TSM should focus on strategies to inhibit their middle laner, Bjergsen, as much as possible.

Some concluding thoughts

I hope that I’ll be able to look at more current data in the future, but this is a great demonstration of how data can power anyone from casual fantasy players, to Riot Games professional broadcasts, and even teams looking for ways to best analyze the strengths and weaknesses of enemy teams to make informed decisions. In the event that we were able to gain access to a large amount of data from ranked League of Legends gameplay we can even use data analysis to improve matchmaking algorithms and game balance to the benefit of everyone that plays League of Legends around the world.

Well that’s all for now, hope you enjoyed!