PCA and ML on League of Legends Data

I finished the fall semester, and so I wanted to apply what I learned in my computing class to a dataset on my own.

I’m going to look at League of Legends data, because I like playing the game so I wanted to do some analysis on it. I’ll specifically look at pro player performance across the four main regions: US (LCS), Europe (LEC), Korea (LCK), and China (LPL).

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
lolplayer <- read_csv("lol-player-summer-2020.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Player = col_character(),
##   Team = col_character(),
##   POS = col_character(),
##   Region = col_character()
## )
## i Use `spec()` for the full column specifications.
head(lolplayer)
## # A tibble: 6 x 22
##   Player Team  POS       M     G     W     L     K     D     A   KDA    KP    KS
##   <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Power~ FlyQ~ MID      18    18    12     6    83    31   104   6    70.8  31.4
## 2 Doubl~ Team~ ADC      18    18    12     6    76    33    62   4.2  64.5  35.5
## 3 Blaber Clou~ JNG      18    18    13     5    71    42   119   4.5  74.2  27.7
## 4 Tacti~ Team~ ADC      18    18    15     3    67    22    89   7.1  70.9  30.5
## 5 Zven   Clou~ ADC      18    18    13     5    67    22    92   7.2  62.1  26.2
## 6 Jensen Team~ MID      18    18    15     3    66    27    89   5.7  70.5  30  
## # ... with 9 more variables: CS <dbl>, CSM <dbl>, GLD <dbl>, DPM <dbl>,
## #   FB <dbl>, DTH <dbl>, WPM <dbl>, WCPM <dbl>, Region <chr>

We can see the player name, what team they’re on, how many matches they played, etc.

I’m going to remove players that have not played in at least 3 matches, I feel that those players don’t exactly have enough data to perform an analysis.

lolplayer2 <- lolplayer %>% filter(M >= 3) #players must have played in 3 matches

We now see that there is 287 players remaining of the 303.

Principle Component Analysis

If we looked at the data, we see that of the 22 columns, 22 describe the player. The region, the role they play, the games they’ve won, etc.

Of course we could try to perform a linear regression and see how ‘x’ affects ‘y’. However, I’m going to instead perform dimension reduction.

We are essentially going to covert the 287x22 dimensional data into 2 dimensions. Why would we do this? In a way, this allows us to simplify our analysis, but capture the characteristics of the data.

Principal component analysis comes in by essentially dividing and combining aspects of our columns in components. These components will then represent part of our data, the variance more specifically. The biggest component will contain the most variance of our data, followed by the second largest and so on. We usually only look at the components that capture 90% of the variance, however, we’ll just be looking at the first 2 principle components.

Let’s go ahead and do that:

lolpca <- prcomp(lolplayer2[,8:21],
                 center = TRUE,
                 scale = TRUE)

#2 pca analysis (overall)

PC1 <- lolpca$x[, "PC1"]
PC2 <- lolpca$x[, "PC2"]

data_PCA <- tibble(PC1 = PC1, PC2 = PC2, 
                   player = lolplayer2$Player, 
                   position = lolplayer2$POS, 
                   region = lolplayer2$Region)

variance <- lolpca$sdev^2 / sum(lolpca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
black.bold.text <- element_text(face = "bold", color = "black", size=20)

I’m going to now plot the 2 dimensional data, but color it by region:

data_PCA %>% 
  ggplot() +
  aes(x=PC1,y=PC2, color = region) + 
  geom_point() +
  labs(x = v1, y=v2) +
  theme_bw() + 
  theme(text = black.bold.text) 

We can see one interesting pattern, we see that the LCK and LPL tend be clumped together, while the LCS and LEC are clumped together. This may have to do with the playstyle and meta in these regions. LCK and LPL have similar playstyles as do LCS and LEC.

Let’s look by role now:

data_PCA %>% 
  ggplot() +
  aes(x=PC1,y=PC2, color = position) + 
  geom_point() +
  labs(x = v1, y=v2) +
  theme_bw() + 
  theme(text = black.bold.text) 

Here we see the points have clear cutoffs in roles. This means that there can be clear defined ways on how a role is played. We see this especially with supports and junglers. They are in an area to themselves.

On the other hand, we see that mid and adc players are mixed together, do to have similar roles. Both try to get more kills and the highest amount of gold so that they can carry their team to victory. In a way, you can say that if you are a mid player and are looking for a similar, but different role, then adc would not be a hard transition.

Machine Learning

Now I’m going to use machine learning to see if there’s a way to predict positions.

The process behind this should be straight-forward:

  1. Randomly divide the data into 2/3 and 1/3.
  2. Use 2/3 as the training data, in other words, what the machine is using to determine what characteristics define a role.
  3. Use the 1/3 as testing data, and predict with the testing data what role those players are.
  4. Compare the results of prediction to the actual results of the testing data.
  5. See the success rate.

We can look at one simulation of this:

library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin
lolplayer3 <- lolplayer2 %>% select(-c(Player, Team, M, G, W, L, Region))

lolplayer3$POS <- as.factor(lolplayer2$POS)
lolplayer3$Region <- as.factor(lolplayer2$Region)

set.seed(123)
trainIndex <- sample(nrow(lolplayer3), round(nrow(lolplayer3)*2/3))
testIndex <- setdiff(1:nrow(lolplayer3), trainIndex)

data_train <- lolplayer3[trainIndex,]
data_test <- lolplayer3[testIndex,]

cartfit <- rpart(POS ~ ., data = data_train)
rpart.plot(cartfit)

Here is the decision tree on how the machine divided the roles based on what they thought defined a role.

Now we can see how well that does:

predProb_cart <- as.data.frame(predict(object = cartfit, newdata = data_test))

atable <- table(predictLabel = names(predProb_cart)[which(predProb_cart > .5, arr.ind=T)[, "col"]], trueLabel = data_test$POS)

atable 
##             trueLabel
## predictLabel ADC JNG MID SUP TOP
##          ADC   7   5   8   3   5
##          JNG   6   6   2   3   2
##          MID   8   3   4   3   4
##          SUP   1   4   2   1   7
##          TOP   1   1   1   5   4
sum(diag(atable))/sum(atable)
## [1] 0.2291667

Pretty bad, right? It only predicted correctly about 23% of the time.

But machine learning isn’t using just one decision tree. Instead we make many, many trees and find similar patterns. That is where Random Forest comes in.

rffit <- randomForest(POS ~ ., data = data_train)

predrf <- predict(rffit, data_test)

conf <- table(predrf, trueLabel = data_test$POS)
conf
##       trueLabel
## predrf ADC JNG MID SUP TOP
##    ADC  21   0   2   0   1
##    JNG   0  19   0   0   0
##    MID   1   0  13   0   3
##    SUP   0   0   0  15   0
##    TOP   1   0   2   0  18
sum(diag(conf))/sum(conf)
## [1] 0.8958333

Much better, we can that after many, many decision trees. The machine was able to predict correctly 89% of the time. We can see that the machine was able to correct guess the jungler and support role 100% of the time, while it messed up 11 times between deciding if the role was adc, mid, or top.

Let’s look at what the machine thought was the most important predictors:

imscore <- importance(rffit)
imData <- data.frame(cov = rownames(imscore), importance = imscore[,1])

ggplot(imData) + aes(x = cov, y = importance, fill = cov) + 
  geom_bar(stat = 'identity') + theme_bw() + 
  labs(title = "Importance score of predictors", x = 'Predictor', y = 'Importance')

To them, CSM (creep score per minute) was an important deciding factor, followed by Gold Percentage. This makes sense. The carry lanes are generally going to have the highest CS per minute. Gold percentage mainly comes from CS, kills, and assists, which the carry lanes should try to have the highest in the three categories. WCPM is Wards Cleared Per Minute. This part is usually done by the jungler and the support. So it makes sense.

Take-away

I’m glad I was able to use what I learned in my computing class to one of my hobbies. It makes data analysis more fun and if I can find more data on League of Legends, hopefully I can do more things with it.

Updated: