NBA Stats: Radar Charts and Regression

knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE)

library(dplyr)
library(ggplot2)
library(hablar)
library(readxl)
library(fmsb)

It’s summer and I have four days where I need to do something to keep myself busy, so instead of League of Legends Data, I will look at NBA data. I considered looking at this a year ago, but I saw how many players there are and it looked a little intimidating.

I will first subset the data to only players that have completed data.

nba <- read_xlsx("nba_stats.xlsx")

nba2 <- nba[complete.cases(nba),]

nba2 <- nba2 %>% convert(fct(team, position))

The number of players went from 716 to 662, so not much of a decrease.

PCA

Completed cases

Let’s do PCA analysis as usual. I’m actually interested in seeing if there’s actually a noticable difference between players, similar to the words data.

nba.pca <- prcomp(nba2[, 4:28],
                  center = TRUE,
                  scale = TRUE)

PC1 <- nba.pca$x[, "PC1"]
PC2 <- nba.pca$x[, "PC2"]

data_pca <- tibble(PC1=PC1, PC2=PC2, position = nba2$position)
variance <- nba.pca$sdev^2 / sum(nba.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")

data_pca %>% 
  ggplot() +
  aes(x=PC1,y=PC2,color=position) + 
  geom_point() + 
  labs(x = v1, y=v2) +    
  theme_bw() +
  labs(title = "PCA of NBA season 2021-2022 (by position)")

So based on the data, there seems to be an overlap. So maybe position title doesn’t really matter. Especially in the case with centers and forwards, as they seem mixed in together.

Players who played at least 10 games

There is a problem with the data set from earlier, and it’s that there are players who literally only played one game. So you could have this player that completely went off on their only game and it will skew the data, likewise for the other way around.

So this brings us the question: “How many games should a player play to be considered data?” They say 3 data points allows us to see a trend, but I would also consider three games to be too small.

There are 82 games in a season, so maybe we can consider that they at least played 10 games, or roughly 12.5% of the season.

nba3 <- nba2 %>% filter(games >= 10)

Interestingly, the number of players dropped from 662 to 541, so over a hundred players are excluded.

Let’s run PCA again:

nba.pca <- prcomp(nba3[, 4:28],
                  center = TRUE,
                  scale = TRUE)

PC1 <- nba.pca$x[, "PC1"]
PC2 <- nba.pca$x[, "PC2"]

data_pca <- tibble(PC1=PC1, PC2=PC2, position = nba3$position)
variance <- nba.pca$sdev^2 / sum(nba.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")

data_pca %>% 
  ggplot() +
  aes(x=PC1,y=PC2,color=position) + 
  geom_point() + 
  labs(x = v1, y=v2) +    
  theme_bw() +
  labs(title = "PCA of NBA season 2021-2022 (by position)")

The positions are little more discernable. But centers seems to be mixed in with the guards and forwards.

Players who played half a season AND 30 minutes a game

Now we’re looking at players whose names you probably recognize at a glance.

nba4 <- nba2 %>% filter(games >= 40, mpg >= 30)

nba.pca <- prcomp(nba4[, 4:28],
                  center = TRUE,
                  scale = TRUE)

PC1 <- nba.pca$x[, "PC1"]
PC2 <- nba.pca$x[, "PC2"]

data_pca <- tibble(PC1=PC1, PC2=PC2, position = nba4$position)
variance <- nba.pca$sdev^2 / sum(nba.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")

data_pca %>% 
  ggplot() +
  aes(x=PC1,y=PC2,color=position) + 
  geom_point() + 
  labs(x = v1, y=v2) +    
  theme_bw() +
  labs(title = "PCA of NBA season 2021-2022 (by position)")

Adding this extra condition dropped the amount of players from 662 to only 86. We can see that there are not many centers or even center-forwards in this group of people.

Radar Charts

Offensive and Defensive

Continuing from what I tried last winter, let’s do some radar charts. I won’t do all 541 players, that seems a little ridiculous.

There’s an offensive and defensive rating for each player in this dataset. Maybe let’s do the top 20 offensive and defensive players by these two metrics. We’re gonna increase the amount of games played to roughly half the season and they have to had played at least 30 minutes.

For offensive players, let’s look at

offensive <- nba4 %>%
  arrange(desc(offensive_rating)) %>% 
  select(name, "2P%", "3P%", "efg%", apg, versatility)

offensive <- as.data.frame(offensive)

rownames(offensive) <- offensive$name

offensive[,2:6] <- scale(offensive[,2:6])
col_max <- apply(offensive[,2:6], 2, max)
col_min <- apply(offensive[,2:6], 2, min)
col_mean <- apply(offensive[,2:6], 2, mean)

col_summary <- t(data.frame(Max = col_max, 
                            Min = col_min, 
                            average = col_mean))

offensive <- data.frame(rbind(col_summary, offensive[,2:6])) %>% 
  slice(1:23)

defensive <-nba4 %>%
  arrange(desc(defensive_rating)) %>%
  select(name, "efg%", rpg, spg, bpg, versatility)

defensive <- as.data.frame(defensive)

rownames(defensive) <- defensive$name

defensive[,2:6] <- scale(defensive[,2:6])
col_max <- apply(defensive[,2:6], 2, max)
col_min <- apply(defensive[,2:6], 2, min)
col_mean <- apply(defensive[,2:6], 2, mean)

col_summary <- t(data.frame(Max = col_max, 
                            Min = col_min, 
                            average = col_mean))

defensive <- data.frame(rbind(col_summary, defensive[,2:6])) %>% 
  slice(1:23)

opar <- par() 

par(mar = rep(1,4))
par(mfrow = c(4,5))


for (i in 4:nrow(offensive)) {
  radarchart(
    offensive[c(1:3, i), ],
    pfcol = c("#99999980",NA),
    pcol= c("gray","red"), plty = c(1,1), plwd = c(.5,2),
    pty = c(NA, NA),
    cglcol = "black", cglty=1, axislabcol="grey", cglwd=1,
    title = row.names(offensive)[i]
  )
}

par <- par(opar) 

opar <- par() 

par(mar = rep(1,4))
par(mfrow = c(4,5))


for (i in 4:nrow(defensive)) {
  radarchart(
    defensive[c(1:3, i), ],
    pfcol = c("#99999980",NA),
    pcol= c("gray","red"), plty = c(1,1), plwd = c(.5,2),
    pty = c(NA, NA),
    cglcol = "black", cglty=1, axislabcol="grey", cglwd=1,
    title = row.names(defensive)[i]
  )
}

par <- par(opar) 

¿Porque no los dos?

Let’s consider maybe the best of both worlds, possibly making the player more rounded.

both <- nba4 %>%
  mutate(losdos = (offensive_rating + defensive_rating)/2) %>%
  arrange(desc(losdos)) %>% 
  select(name, ppg, apg, rpg, bpg, 
         versatility, offensive_rating, defensive_rating)

both <- as.data.frame(both)

rownames(both) <- both$name

both[,2:8] <- scale(both[,2:8])
col_max <- apply(both[,2:8], 2, max)
col_min <- apply(both[,2:8], 2, min)
col_mean <- apply(both[,2:8], 2, mean)

col_summary <- t(data.frame(Max = col_max, 
                            Min = col_min, 
                            average = col_mean))

both <- data.frame(rbind(col_summary, both[,2:8])) %>% 
  slice(1:23)

opar <- par() 

par(mar = rep(1,4))
par(mfrow = c(4,5))


for (i in 4:nrow(both)) {
  radarchart(
    both[c(1:3, i), ],
    pfcol = c("#99999980",NA),
    pcol= c("gray","red"), plty = c(1,1), plwd = c(.5,2),
    pty = c(NA, NA),
    cglcol = "black", cglty=1, axislabcol="grey", cglwd=1,
    title = row.names(both)[i]
  )
}

par <- par(opar) 

Regression Analysis

I guess we can do some linear regression. But we should also once again consider how many games a player should play. Let’s just consider those that played at least roughly half the season.

nba5 <- nba2 %>% filter(games >= 40)

We do have over 300 observations, so there is more than enough data without worrying about any normality. But let’s check anyways.

The big question is what should our outcome variable be? It should be something that encapsulates how good a player is both offensively and defensively. Let’s try use the versatility index that the dataset has. In the data dictionary, Versatility index is a metric that measures a player’s ability to produce in points, assists, and rebounds. The average player will score around a five on the index, while top players score above 10.

qqnorm(nba5$versatility)
qqline(nba5$versatility, col = "steelblue", lwd = 2)

shapiro.test(nba5$versatility)

## 
##  Shapiro-Wilk normality test
## 
## data:  nba5$versatility
## W = 0.94111, p-value = 3.231e-10

Uh oh, we see the data isn’t exactly normal. We can even use Shapiro-Wilk test to get a p-value, which also tells us the data isn’t normal. Looks like we might need to do some adjusting. The go-to adjustment is log adjustment, let’s check that out:

qqnorm(log(nba5$versatility))
qqline(log(nba5$versatility), col = "steelblue", lwd = 2)

shapiro.test(log(nba5$versatility))

## 
##  Shapiro-Wilk normality test
## 
## data:  log(nba5$versatility)
## W = 0.99459, p-value = 0.2926

Well that seemed to solve the problem. The only catch of changing the values to log is that our interpretation of the results isn’t as discrete.

Theory-ish time

A typical linear regression looks like this:

\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k\]

Here we have \(y\) as the outcome, \(\beta_0\) is the intercept, and \(\beta_1\) is the coefficient of predictor \(x_1\). These coefficients are our primary interests. We are looking if these coefficients have a significant effect on our predictor. In this situation, we can interpret the coefficient as “as \(x_1\) changes by 1 unit, \(y\) will change by \(\beta_1\) units.”

But now we have to consider a log-transformed outcome. Now our linear model looks like this:

\[ \begin{aligned} log(y) &= \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \\ y &= \exp \left(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \right) \\ &= \exp \left(\beta_0\right)\exp\left(\beta_1 x_1\right) \cdots \exp \left(\beta_k x_k\right) \\ \end{aligned} \]

Now we have to interpret an exponentiated \(\beta\). So our interpretation becomes “as \(x_1\) changes by 1 unit, \(y\) will increase by \(\beta_1\) percent”. I could go into detail about why it happens to be percent, but that’s not necessary for this analysis.

Back to regression

So now we will regress a log-transformed versatility on certain predictors. In this context, we are trying to find what makes a more versatile player. We can consider whether 2 point percentage or 3 point percentage is more significant, we can consider which of the points, assists, or rebounds is more significant. There’s many aspects we can tackle.

So let’s just start chugging away:

lmod <- lm(log(versatility) ~ `2P%` + `3P%` + `efg%` + ppg + rpg + apg + spg + bpg, data = nba5)

summary(lmod)

## 
## Call:
## lm(formula = log(versatility) ~ `2P%` + `3P%` + `efg%` + ppg + 
##     rpg + apg + spg + bpg, data = nba5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43408 -0.09153 -0.00983  0.08937  0.47284 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.856398   0.088495  20.977  < 2e-16 ***
## `2P%`        0.124491   0.200452   0.621   0.5350    
## `3P%`       -0.098916   0.085270  -1.160   0.2469    
## `efg%`      -0.460742   0.240131  -1.919   0.0559 .  
## ppg          0.001457   0.002110   0.690   0.4904    
## rpg          0.052401   0.005154  10.167  < 2e-16 ***
## apg          0.090912   0.006600  13.774  < 2e-16 ***
## spg         -0.159777   0.028467  -5.613 4.29e-08 ***
## bpg         -0.034772   0.027587  -1.260   0.2084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1394 on 323 degrees of freedom
## Multiple R-squared:  0.7178, Adjusted R-squared:  0.7108 
## F-statistic: 102.7 on 8 and 323 DF,  p-value: < 2.2e-16

I added 8 predictors, which might be a little too much, as you can see about 5 of them are not significant. There might also be evidence of collinearity, in which two predictors might influence each other. We can use the car package and the vif function to see if the predictors are independent. A good rule of thumb is if the variance inflation factor is larger than 5, then we should consider removing it.

car::vif(lmod)

##    `2P%`    `3P%`   `efg%`      ppg      rpg      apg      spg      bpg 
## 3.327562 1.321488 3.135467 2.824866 2.739848 2.749907 1.741016 2.171163

None of them show a variance inflation factor greater than 5. Next, let’s consider reducing the number of predictors through a stepwise procedure.

rmod <- step(lmod)

## Start:  AIC=-1299.33
## log(versatility) ~ `2P%` + `3P%` + `efg%` + ppg + rpg + apg + 
##     spg + bpg
## 
##          Df Sum of Sq    RSS     AIC
## - `2P%`   1    0.0075 6.2869 -1300.9
## - ppg     1    0.0093 6.2886 -1300.8
## - `3P%`   1    0.0262 6.3055 -1300.0
## - bpg     1    0.0309 6.3103 -1299.7
## <none>                6.2794 -1299.3
## - `efg%`  1    0.0716 6.3509 -1297.6
## - spg     1    0.6124 6.8918 -1270.4
## - rpg     1    2.0096 8.2889 -1209.2
## - apg     1    3.6882 9.9676 -1147.9
## 
## Step:  AIC=-1300.94
## log(versatility) ~ `3P%` + `efg%` + ppg + rpg + apg + spg + bpg
## 
##          Df Sum of Sq    RSS     AIC
## - ppg     1    0.0074 6.2943 -1302.5
## - bpg     1    0.0281 6.3150 -1301.5
## - `3P%`   1    0.0356 6.3224 -1301.1
## <none>                6.2869 -1300.9
## - `efg%`  1    0.0959 6.3828 -1297.9
## - spg     1    0.6127 6.8996 -1272.1
## - rpg     1    2.0354 8.3223 -1209.8
## - apg     1    3.6932 9.9801 -1149.5
## 
## Step:  AIC=-1302.54
## log(versatility) ~ `3P%` + `efg%` + rpg + apg + spg + bpg
## 
##          Df Sum of Sq     RSS     AIC
## - bpg     1    0.0278  6.3221 -1303.1
## - `3P%`   1    0.0290  6.3233 -1303.0
## <none>                 6.2943 -1302.5
## - `efg%`  1    0.0946  6.3889 -1299.6
## - spg     1    0.6072  6.9015 -1274.0
## - rpg     1    2.5399  8.8342 -1192.0
## - apg     1    5.4408 11.7351 -1097.7
## 
## Step:  AIC=-1303.08
## log(versatility) ~ `3P%` + `efg%` + rpg + apg + spg
## 
##          Df Sum of Sq     RSS     AIC
## - `3P%`   1    0.0223  6.3444 -1303.9
## <none>                 6.3221 -1303.1
## - `efg%`  1    0.1224  6.4445 -1298.7
## - spg     1    0.6742  6.9963 -1271.4
## - rpg     1    3.4628  9.7849 -1160.1
## - apg     1    6.0574 12.3795 -1082.0
## 
## Step:  AIC=-1303.91
## log(versatility) ~ `efg%` + rpg + apg + spg
## 
##          Df Sum of Sq     RSS     AIC
## <none>                 6.3444 -1303.9
## - `efg%`  1    0.1266  6.4710 -1299.3
## - spg     1    0.6853  7.0297 -1271.8
## - rpg     1    3.9099 10.2543 -1146.5
## - apg     1    6.0749 12.4193 -1082.9

summary(rmod)

## 
## Call:
## lm(formula = log(versatility) ~ `efg%` + rpg + apg + spg, data = nba5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42933 -0.09148 -0.00756  0.08753  0.45517 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.853114   0.082132  22.563  < 2e-16 ***
## `efg%`      -0.390287   0.152814  -2.554   0.0111 *  
## rpg          0.051640   0.003638  14.196  < 2e-16 ***
## apg          0.094080   0.005317  17.695  < 2e-16 ***
## spg         -0.166028   0.027935  -5.943 7.16e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1393 on 327 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7114 
## F-statistic: 204.9 on 4 and 327 DF,  p-value: < 2.2e-16

Notice that if we reduce the model reduces the number of predictors from 8 to 4. We see that if we look at VIF for each predictor, then all values are small.

So the four significant predictors are effective shooting percentage, rebounds, assists, and steals per game.

Let’s interpret the coefficients now:

exp(coef(rmod))

## (Intercept)      `efg%`         rpg         apg         spg 
##   6.3796548   0.6768623   1.0529961   1.0986478   0.8470225

The interpretation is as follows: - As effective shooting percentage increases by 1 percent, then a player’s versatility decreases by 33.3%. - As rebounds per game increases by 1, then a player’s versatility increases by 5%. - As assists per game increases by 1, then a player’s versatility increases by 10%. - As steals per game increases by 1, then a player’s versatility decreases by 16%.

These are some weird results. Effective Shooting Percentage is defined as when three-point shots made are worth 50% more than two-point shots made. With the formula being: \[eFG\% = \frac{(FGM+ (0.5 \times 3PM))}{FGA}.\]

You would think that the way versatility is defined that this would be helpful, along with steals. But it looks like rebounds and assists are more important, which is also included in the definition of versatility.

Let’s see if this is true.

Back to Radar Charts

Let’s look at the top 10 versatile players in this subset.

versatile <- nba5 %>%
  arrange(desc(versatility)) %>% 
  select(name, "efg%", rpg, apg, spg)

versatile <- as.data.frame(versatile)

rownames(versatile) <- versatile$name

versatile[,2:5] <- scale(versatile[,2:5])
col_max <- apply(versatile[,2:5], 2, max)
col_min <- apply(versatile[,2:5], 2, min)
col_mean <- apply(versatile[,2:5], 2, mean)

col_summary <- t(data.frame(Max = col_max, 
                            Min = col_min, 
                            average = col_mean))

versatile  <- data.frame(rbind(col_summary, versatile[,2:5])) %>% 
  slice(1:13)

opar <- par() 

par(mar = rep(1,4))
par(mfrow = c(2,5))


for (i in 4:nrow(versatile)) {
  radarchart(
    versatile[c(1:3, i), ],
    pfcol = c("#99999980",NA),
    pcol= c("gray","red"), plty = c(1,1), plwd = c(.5,2),
    pty = c(NA, NA),
    cglcol = "black", cglty=1, axislabcol="grey", cglwd=1,
    title = row.names(versatile)[i]
  )
}

par <- par(opar) 

Look at that. We can see why the first four are the ones in the MVP running. They clearly excel at least in one of the areas. The other 7 players are also clearly great players, however not as good as someone like Jokic or Giannis.

Updated: