ARM — Ian Bircak

Association Rule Mining

This section delves into the application of Association Rule Mining, or ARM for short. Association Rule Mining is a machine learning algorithm that utilizes rules to discover interesting relationships between two or more items in large datasets. These rules generally form if-then statements that can be used to identify relationships. ARM mainly uses the apriori algorithm to determine these rules. This algorithm uses prior knowledge of the dataset to determine frequencies between two variables. Once the frequency of all the rules have been determined, the apriori algorithm calculates the support, confidence, and lift of each rule. The support is how frequently the items appear in the data, while the confidence measures the likelihood of items appearing given other items are already there. Finally, the lift is like expected confidence; it is the measure of how many if-then statements are expected to be true. The lift measure is particularly important, since we can use the lift value to determine the relationship between two items. Association rule mining will be used in this project to see if there are any significant relationships to monitor between surfaces and key statistics. Since this is a surface analysis, the key rules to find will be what surfaces have significant relationships with which statistics.

The Definitions of Support, Confidence, and Lift.

Apriori Algorithm.

Data Preparation

The raw dataset.

The dataset prepared for apriori.

To set up association rule mining, the dataset needs to be manipulated into transaction basket data. Transaction data is data where each observation is captured in an instance, and basket data refers to each row being a transaction containing the elements of it. It is also important to note that association rule mining is an unsupervised method, so the labels cannot be included in the final data set. Before we manipulate the data, eliminate all the qualitative data that isn’t a surface. For this portion of the project, the main data set will be separated into a serving and hitting data sets, so that the serving statistics get compared together and the non serving statistics get compared together. Since both of these datasets are numerical, the data will need to be discretized into categories. Each numerical variable was sorted into “Low”, “Normal”, and “High” categories, based on values given for bounds from the discretize function in R. Once the variables were discretized, the labels were removed, thus creating the datasets ready for apriori.

Conclusion: Serving Dataset

The top 15 rules for support.

The top 15 rules for confidence.

The top 15 rules for lift.

Upon performing the apriori on the serving dataset, we printed the top 30 rules, so that enough rules could be created that contained playing surfaces. The hard court produced the most rules, while clay produced some rules, and grass courts had only one rule. What is interesting to note from this algorithm is that from the serving table, all of the rules had associations to the word “Normal”. This would indicate that for any serving statistic, any serve does not have to be in the “high” category to have success on any playing surface, but it certainly cannot be in the “low” category. This shows the importance of having an effective serve. Since the serve usually starts the point, each individual game is usually advantageous to the server. This seems to be particularly important on the hard court, since all of the serving percentage statistics have an association with the hard surface. Therefore, having a consistent, effective serve overall is essential on the hard court. For the clay court and grass court, the only rule relates to the first serve win percentage. This is kind of surprising for the grass court, since the surface is considered the fastest and has the lowest bounce. It could be due to the smaller sample size for the grass variable, or that players on the tour just go for it more on the serve and the percentages vary too much. For the clay court, it makes sense because the court is the slowest with the highest bounce, which means that the serve can be returned much easier on a clay court. The first serve would be more effective to make. All of the lifts in this table are greater than one, so each of these relationships are positively correlated. In the end, all of the serving percentages have an association with the hard court, while only the first serve win percentage is associated with the grass and clay court respectively.

Conclusion: Hitting Dataset

The top 15 rules for support (top), confidence (middle), and lift (bottom).

For the hitting dataset, the hard court is associated with the “high” variable for forced and unforced errors. This tells us that the hard court is very prone to errors, possibly due to the speed and bounce of the court. A fast ball with good bounce could indicate here that players are more aggressive on a hard court, which increases the likelihood of errors being made. Whereas the clay courts are slower and the grass court balls don’t bounce very high, it makes sense that errors are more frequent on a hard court. The hard court was also associated with a “normal” first serve return percentage. Interestingly, the clay court was associated with a “high” first serve return percentage. This is due to the speeds of the surfaces, since it's easier to return a serve on a slower court. On both surfaces, the return of serve is very important. A higher return of serve rate means that players have a higher chance to win every point. Finally, it was surprising that winners were not associated with anything. It seems like tennis is a game of mistakes, where consistency is valued more than hitting good shots.