Decision Trees
A Sample Decision Tree
In this section, decision trees will be used on the serving dataset. A decision tree is a learning method intended to predict the value of a target variable based on other variables. A typical decision tree has three types of nodes; the root node, internal node, and terminal node. The root node is the initial node, and it has zero incoming edges. The internal nodes have one incoming edge, and at least one outcoming edge. Finally, the terminal node is the end node that only has one incoming edge. Take this example of deciding to buy a car. As it shows, the tree starts at the initial split of color choice, then the tree branches into other aspects such as mileage and make. At the end, there are terminal nodes, which are the target variables of either yes or no. It is generally possible to create an infinite amount of decision trees because there are so many ways to split the prioritization of variables, determine specifications, and the ability to exclude or include variables. However, how does a decision tree determine the best way to split the data? Let’s learn more about GINI, entropy, and information gain to see ways to split the data.
GINI is a measure of the probability for a random instance being misclassified when chosen randomly. For GINI, the lower the probability, the better, although gini values can only be between 0 and 0.5. Similarly, entropy is the measure of randomness for each observation in a data set. The entropy can be taken as values between 0 and 1. See the figures for the equations of the two. Finally, there is information gain, which measures how well the data was split for each node. The bigger the number for the information gain, the better the split.
As an example, let's look at GINI coefficients for flowers. At each node, we see that there is a GINI coefficient for every node, which is used to show the probability of misclassification of a dataset. Remember, the lower the probability, the better the split. The lowest GINI coefficients are used to determine the splits for each node.
That was a quick write up on how decision trees work and the math behind it. Now, let’s prep the data for decision tree analysis.
Key Equations for GINI, Entropy, and Information Gain
A Decision Tree containing GINI probabilities
Data Prep
Decision tree analysis was used exclusively in R on the serving data set. From the clustering and association rule mining analysis, there was linkage between the grass and clay courts. There was also a higher association between the serving data association rule mining, so the focus will be on those. For this analysis, the response variable was the surface, and the predictor variable was all serving percentage statistics, plus aces and double faults. The response has to be a factor, so the surface column was converted into a factor type. Finally, the data has to be split into a training and testing dataset. Seventy percent of the data went into the training set, while the other thirty percent went into the testing dataset. The training set is used to build the model, while the testing set is used to predict the model. It is important to make sure that these two data sets are disjoint subsets of each other. The two main reasons why this is important are so that the model can be tested on data it has not seen, and to prevent overfitting of the model. The test set must also have no labels in it, since the test model is trying to predict the label. The code can be found in the code tab.
A Small snippet of the training (left) and testing (right). Note that no two rows are in the same table, and that there are no labels on the testing data.
Analysis and Conclusion
Model 1 Decision Tree
Model 2 Decision Tree
Model 3 Decision Tree
Model 2 CP Graph
Model 1 Confusion Matrix
Three decision trees were plotted to see what the best fit would be. The first model was just simply all of the data in the table, with no manipulations or balancing. The data was split using GINI probabilities. The splits for the model labels were roughly 40% hard, 33% clay, and 27% grass. This decision tree has a lot of nodes and vertices. For all trees, the low score will always be towards the left. The confusion matrix is a table that shows the predicted versus actual data for each entry. The diagonal of the matrix is the correct observations for each surface. From this model, we get an accuracy of 78%, which is pretty good. From the decision tree, double faults and first serve return percentage were the initial splits.
For the second model, the initial model was given a cp of 0.018 to see if a simpler decision tree could be produced. The score chosen came from the cp graph, and the point chosen was the last point before the horizontal cutoff to minimize the error while not underfitting the model. This decision tree was much clearer to see than the initial model. The accuracy reported from the confusion matrix was 77%. This is not too far off from the first model, so this model would be a better fit than the first one. The simplicity of this model while not losing a lot of accuracy makes this model a better choice.
The third model and final model was the same as the first model, except the splits were determined by information gain as opposed to GINI. The cp was determined to be 0.22. From here, the model was even simpler than the first and second model. The accuracy for this model was 68%, which is a notable dropoff from the previous two models. Using information gain was too simple to classify the model, therefore I would still choose the second model.
In conclusion, there seems to be a heavy emphasis on the serve on a grass court. All of the grass nodes resulted from having a high first serve return percentage and low double faults. It also seems like having a high first serve win percentage is associated with the entry being on a grass court. Therefore, it is essential to have a good serve on a grass court. This makes sense given that it is the fastest surface and the lowest bounce. That strongly favors the server. It also seems that the strategy for the clay court is consistency. A good return of serve combined with low double faults is the way to go on this surface. Since clay is the slowest surface and has the highest bounce, consistency is key for that surface. As for the hard court, it was all over the place. It seems to me that winning your first serve is important for the hard court, but aces don’t really matter. They also had a lot of edges leading in the high direction, which means the hard court is likely the most mistake prone surface. Overall, it was interesting to use a decision tree to show what common characteristics each surface has regarding serving. It will be cool to see whether the other procedures back this up.
Model 2 Confusion Matrix
Model 3 Confusion Matrix
Model 3 CP Graph