Procedure
K-means Clustering
By examining different variables for position groups, they were used to construct a comprehensive ranking algorithm that ranked position groups. Once the data was cleaned and ready to be used, it was applied to the ranking algorithm. The ranking algorithm consisted of k-means clustering to split the players, and random forests to identify the most influential variables between the cluster assignments. This analysis was run on all mentioned position groups, but for simplicity this report utilizes the pass catchers as an example.
The process started with principal component analysis (PCA) on the pass catching data set. PCA reduced the dimensionality of the data set so that the clusters could be visualized on a plot. Next, the optimal number of clusters needed to be obtained. This was done by calculating silhouette scores. The silhouette scores graph showcases the calculated silhouette score for each potential cluster number. The first local maximum, or the highest silhouette score with the lowest cluster number, is the optimal number. In this case, the cluster number selected was 3. This provided the cluster plot with the best fit for the data.
After the number of clusters was chosen, a clustering model for that position group was produced. The cluster plot below was produced based on the data input for pass catchers. As stated before, the teams were clustered into 3 groups based on similar performance inputs. Each cluster is distinguished by color (yellow, purple, or teal), and their centroid in light blue. Finally, the teams were identified and grouped by their cluster assignment. Below is the cluster plot for the wide receivers.
What was interesting this time around is that k-means clustering had to be run twice for all position groups. Since the datasets were larger this time due to ranking players as opposed ot teams, the initial k-means clustering split the data into two groups, regardless of position.This occurred because one variable dominated the others when it came to splitting the data. In the pass catcher example, the yards variable was responsible for the two clusters. This meant that the players were purely split by their yardage total in the season. Yards are obviously important to a pass catcher, but the previous model indicated that there were other influential variables beside the yards. Therefore, to reduce the bias of the yards variable, the players in the high-yardage cluster were kept while the players in the low-yardage cluster were “filtered out” from the dataset. K-means was ran again with the reduced dataset, and 3 clusters were produced.
For position groups where success is determined by having higher statistics, the first k-means clustering run served as a filter. Besides the pass catchers, this also happened for the quarterbacks, running backs, and defensive line. In this case, the first k-means served as a “filter” to eliminate players that had insufficient statistics compared to their peers. For position groups where lower statistics are better, the initial k-means clustering showed that the model favored those that played less games than their peers. Therefore, the data had to be filtered out according to snap count. Pro Football Focus had a unique tool in their data sets where a user can filter players by the number of snaps played. In this case, only players that played 75% of the player with the highest snap count at their position were included. These positions included offensive line, linebackers, and defensive backs.
The Silhouette Plot
The transition between the raw cluster plot (left) and filtered cluster plot (right)
The Cluster Plot
Random Forests
After the clusters were obtained, a random forest was implemented to attempt to classify the teams into their cluster assignments. Since this is a supervised machine learning method, the data was broken into testing and training sets to build the model for analysis. A 75-25 split of the data applied to all position groups. The random forest was used to assess the importance of each variable as well as aid in computing a final grade for each team. A confusion matrix was utilized to show the performance of the classification algorithm and reports how successfully the teams were classified into their tiers. An example of this for the pass catchers is shown below. The accuracy is the sum of the diagonals divided by the sum of all the elements in the matrix. For the training data, there was 100% accuracy, while the test data was 90% accurate.
The random forest also provided a variable importance plot. This graph was meant to visualize the varying levels of importance and to show the mean Gini decrease for each variable. The greater the Gini coefficient, the more important the variable was to the model. Based on the results of importance from the random forest, variables for each position group were assigned weights. The higher the weight, the more important the variable when evaluating success.
A snippet of the confusion matrix on the test data
The Weights Obtained from the Random Forest
Building the Ranks
Finally, a ranking was produced. To start, the pass catcher datatable was converted using min-max scaling. This means that the highest value at each position was represented with a score of one, while the lowest value had a score of zero. Min-max scaling was applied to every column. Then, each column was multiplied by its corresponding weight from the random forest analysis. The rows were then summed and sorted from high score to low score. Finally, the weighted scores were scaled again, so that the highest ranked player had a score of 99, while the lowest scored player had a score of 65. This constituted the player rankings, and below are the top ten receivers by grade, rounded to the nearest whole number.
The Top 10 Individual Pass Catchers
The player rankings do a good job of showing who is not only the best at each position, but also the gaps between players in terms of production from the 2023 season. However, the model needs to analyze the team unit, not just the players. To obtain the team rankings for pass catchers, just take the average of the top three players in the rankings for each team. If a team has insufficient players, they will get a 65 grade for each player they are short. After doing the math, here are the top ten receiving corps for the 2023 season.
The Top 10 Team Pass Catchers
Keep in mind that this is just the procedure for the pass catchers, and this process needs to be repeated for all positions and position groups. Calculating the player grades is the same exact process for each position, but computing the team grades are a little different for each group. The table below shows how many players at each position group need to be averaged together to produce the final ranking for teams. Remember that if a team does not have a player that qualified for a ranking or they don’t have the position at all, they receive a 65 as a result.
Note: The quarterbacks were treated a little bit differently this time. Since the clusters were not well defined, it lead to an low, indefensible accuracy from the random forest. Therefore, split the data set into the quarterbacks passing stats and rushing stats and perform each procedure independently. After the ranking is produced, scale the passing score from 93 to 65 and the rushing score from 0 to 7. Finally, add these two scores. These bounds were chosen to place more emphasis on passing ability of a quarterback while implementing a “bonus” for being a quality runner as well.