What's in a Name: NBA Hall Of Fame

A data science project by Surmud Jamil, Sinaan Younus, & Ritesh Verma

Comparison is the foundation of sports. We do it all the time when talking about teams -- who has a better record, who has a better chance to win the championship, etc. Similarly, we do it all the time when talking about players -- who's the biggest, strongest, fastest; who scores the most, who dominates the paint, etc.

In basketball, specifically in the NBA, a common critique of comparing players across all time periods is that different eras emphasized differnet aspects of the game. Thus, some say that we should only compare players from within the same era, but we disagree. We want to explore this opinion head on -- are today's players truly so much better than the greats of decades past? Did Michael Jordan really go up against "mailmen" quality players who played basketball as a second job, or is that an oversimplifcation by LeBron fans?

We explore this grey area, and give quantifiable responses with respect to the quantifiable categories of the game. To do so, we pull from two datasets -- one that contains stats of active NBA players from the 2020-21 season, and one that contains stats of HOF inductees. We import and scrape these datasets, respectively -- with both requiring data cleaning/parsing. Both are stored in the Pandas DataFrame format, which gives us the ability to perform SQL-esque commands easily in a Python environment. We ask and answer (through data viz) multiple questions that compare the two eras, and equally the players within each era.

We end by using ML techniques to predict whether a given active NBA player will enter the HOF based on their most recent performance. We understand that this model may underfit/overfit certain players (who are at the start/end of their careers), but for the most part, it will give us another valuable metric for comparison between eras.

First, let's collect our data.

In this notebook, we will use three main sources of NBA data. Two are datasets obtained from Kaggle, and one source will be a data table scraped from basketball-reference.com

Data Parsing

First we'll clean our dataset of current NBA stats

Data Collection

Next we'll collect and parse data from Basketball-Reference on NBA Hall of Famers.

By sorting the dataframe, we see that certain metrics were not recorded back in the day (STL, BLK, 3P%). This alone is testament of the fact that the game has changed, and consequently, so have the factors that determine HOF worthiness. Let's move on to statistics that we do have, and explore how they vary through different eras.

How have the stats of Hall of Fame inductees changed over the years?

Below we'll display the points, rebounds and assists of hall of fame inductees over time.

From the plots above, it looks like Hall of Famers haven't really varied too much in terms of average number of points scored per game or assists per game. However, it is very interesting how the number of rebounds has decreased in the past few years - might have to do with how the league has gotten smaller with less dominant big men (the Steph Curry effect -- more spacing, more deep shooting, less interaction in the paint -> less rebounds)

A problem with these graphs: not everyone has been inducted 5 years after they played. Some players inducted in the past few years have played way back in the 1960s or 70s, and are being inducted now. These graphs don't necessarily represent when the players actually played.

Let's focus on points for a second.

We'll create a linear regression model to see overall trend in HOF PTS across the years. We predict that PPG will be increasing over the years since a common claim is that players have become more skilled in scoring the ball, especially in the context of a less defensive game (due to rule changes) that lends itself to high-scoring results

Our model shows us that average PPG of HOF players over the years has not changed much, so our initial hypothesis was wrong. After seeing the distribution and the linear regression model, it makes sense as to why this is the case. While it can be assumed players have been more skilled in scoring the ball, players have become more skilled defensively as well. In the old NBA, the skill gap was too much between very skilled players and the average player which you could see through Wilt Chamberlein's career, as he even scored 100 points in a 1962 game. So across time, PPG was balanced overall as skill rose on both the defensive and offensive ends.

The residuals also show evidence of a linear model, due to their lack of pattern and even, non-conical distribution across both sides of 0 -- this pattern is constant, linear, and almost 0 in terms of change per year. Again, this confirms that points have not dramatically skyrocketed overall

How do Hall of Famers compare with NBA players today?

Next we'll try to compare how Hall of Famers with today's players. To do this we will build a "footprint" for these two categories, by showing what the average hall of famer's stats are, and the stats for the average player today.

Your Average Hall of Famer vs. Your Average player today

Twice the Power

It's interesting how the ratio between an average player's stats and a hall of famers stats are nearly the same across categories (2:1). The bars are almost perfectly aligned across the above graphs. Hall of famers consistently put up twice the stats of regular players.

Creating functions

Below we'll create some helper functions to help out with some further exploratory analysis

How has the 3 point shot changed throughout the history of the NBA? Are today's players better shooters?

In the modern NBA, the "3 point shot" is a game changer. As the 3 point shot becomes more and more deadly, with more players attempting 3s as a means of scoring, we want to see how much has the 3 point shot evolved over time among HOF players. Most current HOF players aren't from the era of the 3 pointer as that era began 2013 upwards. We hypothesize that the average 3P% among HOF players won't be as high as the average 3P% from current players, because we believe that older pros did not care much about the 3-pointer in their time.

Today's players have more accuracy

Based on the results of our graph, our hypothesis that the average modern player's 3P% is higher than the average HOF player is correct. We see that the average modern player has about 6% higher 3P% than the average HOF player. As said earlier, this makes sense as players in today's era are more inclined to work on their 3-point shot as the 3 pointer's value is far higher than it used to be in the eras that most HOF players played in.

Are hall of famers more physical and defensive than today's players?

The era that many HOF players come from was also known as the "bully ball era" where players were very aggresive inside of the arc and more shots were taken inside of the arc. In the NBA it is more common to get blocked inside the arc since that's where layups occur so with that in mind, we hypothesize that the average HOF player would have more blocks than the average modern player, since the average HOF player would be from the "bully ball" era, more layups and inside shots occured than three pointers, so the probability for a block should be higher. Furthermore, as shown prior, three point shots have been more perfected and common among modern NBA players, and the 3 point shot usually is not as blocked as much as a field goal/inside shot.

Hall of famer's are more physical

Based on the graph, we can conclude that our hypothesis is correct. The average HOF player has more blocks than the average modern nba player. This makes sense as more field goals, which are more frequenlty blocked than outside shots, were taken in the "bully ball" era than in the modern era where the 3 point shot rules. The difference is little more than double (~0.85 vs ~0.4) indiciating that the "bully ball" era was rougher and it was harder to score inside the paint at that time.

How selfish does a HOF player have to be?

Below we'll examine the relationship between assists and points between Hall of Famers and today's players. As the number of points a player scores increases, will the assists also increase?

Sharing is caring (nowadays at least)

What this graph shows is the relationship between high scoring HOF and how selfish they were (not passing the ball) compared to the realtionship between high-scoring 2021 NBA players and how selfish they are. The HOF line (black line) shows a weakly positive trend (R=.09) between PPG and APG. This indicates that high scoring HOF players, who played during the "bully ball" era, did not share the ball as much. So yes, while they were scoring a lot of points, they were not getting their team as involved. When we look at the 2021 players stats, we see there is a very strong posititve trend (R=.59) between PPG and APG. This means that current high scoring NBA players, still share the ball a lot and get their team involved while scoring a lot of points. This means that it can be said that among the offensive powerhouses among both HOF players and 2021 players, 2021 offensive powerhouses are less selfish, and are more willing to share the ball. It can be said that they are more offensively complete since they are scoring more points and helping their team score points.

Let's now zone in on a comparison entirely within the group of Hall of Fame players.

In honor of his hall of fame induction a few days ago, in this example we will be examining a NBA player who we universally deem as an all-time great, Kobe Bryant. We will see how Kobe Bryant compares to the average Hall of Fame inductee. We've established that HOF'rs are some distance apart from today's average player, but there must be certain HOFrs that are a distance apart from their own class. We think that Kobe will be, at least in the category that defines the scoreboard: points.

How does Kobe Compare?

Based on the graphs comparing the average HOF player and Kobe Bryant in a variety of categories, we can see that Kobe Bryant was better offensively than the average HOF player with a higher PTS, and AST. He was however not as effective (compared to the average HOF player) in two main categories, both defenisve, and they were blocks and rebounds. However, with the most All NBA Defensive First Team selections of all time (9), Kobe was by no means lacking in that area. His steals, even, are barely testament of this as they are only slightly above the HOF average. This shows that numbers do not tell the entire story, and Kobe impacted the game in ways that may not be entirely quantifiable

Machine Learning

Next we'll move onto the machine learning aspect of our project. To train a model for HOF predictions, we need 2 distinct categories -- HOF and non HOF. We build those categories here

Below, we will compile our data in a comparable format. Since the dataset we are using displays stats by season, we'll need to compute career averages ourselves, so that they can be compared with our data set of hall of famers.

How can we determine whether a player should be inducted?

Below, we create our functions for predicting whether a given player will be admitted into the hall of fame. Our criteria will be that a given player must beat the average hall of famer in 1 or more categories. Each category a player exceeds the average hall of famer at will increase their "score"

Trying out the model

Now that we have our functions, we'll try to run our model and see how accurate it is for our dataset.

The cells above show that the accuracy tends to decrease for the hf_category, and increase for the non_hf category as the number for the threshold increases. Let's plot this below to see how it looks.

Plotting the accuracy as the threshold parameter changes

Training Results

From these results, we can see that as we increase the number of categories as the threshold for classifying a hall of famer, the model overfits to non hall of famers, since the non-HOF accuracy approaches 100%, but the HOF classification accuracy decreases drastically. One thing we could try to do to increase our training accuracy is using more categories, instead of just points rebounds and assists. However, our data set constrains us to using only these three since steals and blocks were not recorded in the NBA prior to 1973. Maybe we can try using some variables such as rebound usage percentage, assist percentage, etc. These are values that exist in our all_seasons dataset that could potentially be examined.

As of right now we can conclude that a 1 category threshold for admitting a player into the hall of fame is likely the best, since we have 92% accuracy in classifying people who don't belong, while also achieving an accuracy of 85% in people who do belong. You can see this from the chart above.

Sometimes players are inducted for honorary purposes, while not having sufficient stats or achievements to support them being in the hall of fame. Looking at the output below can help us examine these outlying cases too. For example, towards the end of the results below, you can see Bobby Jones was recently inducted in the hall of fame. Bobby Jones played in the 1970's during a different era of the NBA, and as a result, may have lower stats compared to hall of fame calibre players today, which is why he wasn't classified as HOF. Our model doesn't yet capture how a hall of fame classification is dependant upon the era the player played in.

Let's see which of the actual hall of famers are classified into the hall of fame, according to our model.

How did it do?

From our intuition, it seems as though it did a decent job of classifying the "no brainer" players into the hall of fame. Players like Michael Jordan, Shaquille O'Neal, Abdul-Jabbar, and many others who most people would say should go into the hall of fame without question, made it in.

The players that our classifier took out of the hall of fame, are lesser known players such as Carl Braun, Chuck Cooper, and Bobby Jones.

Which of Today's NBA players will make it to the Hall of Fame?

Let's run our classifier on today's players and see who will make into the hall of fame.

First let's run our predictive model using a parameter of 1.

From these results, it looks like our model classified a LOT of today's players into the hall of fame. More on this later.

Now let's try running it with a parameter of 2 (below)

Which list do you agree with?

Based on the two different outputs, it seems that although a parameter of 1 gave us the best accuracy in training results, we looked at the lists and realized that a parameter of 2 gave us more "realistic" results that met our intuition. The first list contains players who are all very good, but players who fans or experts would say are not necessarily hall of fame calibre. For example, players like Jeff Teague, Marcus Smart, Collin Sexton, are solid players, but most people wouldn't say they belong in the Hall of Fame. The second list however,gave us more "realistic" names that we could look at and agree that these players are the best of the best and therefore are all hall of famers. The problem that arises is that above, we concluded that a parameter of 1 was the more accurrate parameter. So why is this happening?

The first explanation is that our predictive model is only for players who have retired. The data set we are using contains players who are very young, and still in their prime. These players have not had a chance to play in the "declining" stage of their career, so as a result, their career averages are unusually high, compared to where they will be when they retire.

Another reason could be that the standards for Hall of Famers has gone for players who have played over the past few decades. We saw how players like Bobby Jones, don't necessarily have high stats by today's nba standards, but it was high relative to the time they played in. Therefore, to make a better prediction, it may be better to compute the averages of hall of famers who played within the last 1-2 decades as a basis for whether today's players would make it. As the game has evolved, it may be better to use weight data from more recent eras higher than data from half a century ago. The standards for the Hall of Fame have definitely changed over time.

Overall, our model did what we told it to do. We have a lot of ideas for improving the model by adding more categories, weighing the different eras differently, and perhaps creating separate models by the position the player plays. We also expect to recieve better results when looking at players who have finished their career, and are waiting to be inducted in to the hall of fame.

Exploratory Analysis: Is Jose Calderon a Hall of Famer?

Since our instructor is named Jose Calderon, we thought it would be funny to see if the other Jose Calderon, who was point guard in the NBA from 2005 - 2018, would be classified as a Hall of Famer according to our model.

Unfortunately, Jose Calderon was not classified as a hall of famer according to our model. Below you can see his career averages.

Jose Calderon vs The Average Hall of Famer

Below we will plot Jose Calderon's stats to compare with the average hall of famer in the NBA.

Jose Calderon Findings

As you can see, Jose Calderon was actually a pretty average nba player, as he averaged 8 points per game throughout his whole career, but he wasn't quite up to par with the average hall of fame level. It's possible that if he was able to bring up his career averages in assists or rebounds up to higher levels, that he could have potentially been classified according to our model. Perhaps Jose Calderon is better suited in the software engineering industry :)

Conclusion

Thank you for reading through our data science tutorial. This project has allowed us to gain a new appreciation for the different eras of the NBA, and it has allowed us to see how certain statistics can have an influence on how a player's legacy could be regarded. We learned that the year a player has played in can influence how a player is regarded in the history of the NBA. We also gained a lot of valuable experience with machine learning, and we are very interested in continuing our learning in CMSC422.