Linear Regression
Linear regression is a predictive model type that can be used to predict future values based off a linear combination of predictors for a data set. This works by first identifying a response variable, and setting it against a bunch of predictor variables. Linear regression works in three ways. First, the data undergoes a correlation analysis to identify its directionality. Then, the model attempts to estimate the line. Finally, the model is evaluated for validity with an overall p value. However, there are some limitations to a linear regression. First, there are a set of model assumptions that a linear regression has. The four assumptions are linearity, normality, independence, and constant variance. It is important to check the diagnostic plots to assure that these assumptions hold; otherwise, the model may not truly be linear. The other limitation is the idea of multicollinearity. This is when multiple variables are correlated with each other, but they don't actually produce an effect on the response. This is why it is important to check for this, so the model does not misinterpret the results. In this section, the different surfaces and their respective serving and rallying tables will be used to model a linear regression to see what the most significant variables are. The response will be the matches won by an observation in the data set, and all of the other variables will serve as the predictors.
Simple Linear Regression
Data Prep
A snippet of the hard court serving table
Linear regression will be used on seven different data tables. The first one is simply all surfaces, all variables. The remaining six data tables are the serving and rallying table, subsetted by surface. This means that each surface has its own serving and hitting table. In order to prepare the data, all qualitative variables were removed, and the surface types were removed after they were subsetted. The data has to be entirely numerical for analysis.
Analysis and Conclusion
The serving table (left) and rallying table (right) model summaries for the clay court.
The serving table (left) and rallying table (right) model summaries for the hard court.
The full model, all variables and surfaces.
The serving table (left) and rallying table (right) model summaries for the grass court.
The scatterplot and line of best fit for double faults vs wins, all surfaces.
The first interesting thing found with the data is that the subsetted rallying tables only produced one significant variable at a p-value less than 0.05. That variable was the unforced errors for the clay court. This means that on the clay court, the number of unforced errors committed correlates to winning matches. This would imply that consistency is important to success on a clay court. This would make sense since it is the slowest playing surface between the three, but a definitive conclusion cannot be made just yet since none of the other variables were deemed significant. The other two surfaces did not produce any significant variables, which means that having good groundstrokes or net play does not matter at the professional level. It doesn't matter how hard a player hits, how consistent a player hits, or how good someone is at the net on the professional level. Based on these models, none of that truly matters. Therefore, the conclusion is that there is no significant evidence to say that the rallying table has an effect on winning matches.
The serving table produced much more significant variables than the rallying table. Across all the surfaces, double faults were considered significant. A double fault is when the serving player misses two serves in a row. This should be obvious because if you double fault, your opponent gets a point. For the clay court, the other significant variables were the first and second serve winning percentage. While there could be differing reasons to winning points on serves, it seems that holding serve is key on a clay court. Since the clay court is the slowest with the highest bounce, it seems that the consistency belief from the rallying table could hold here. Regardless if a serve is accurate or hit hard, winning on your serve is important on the clay court. The hard court has the same significant variables as the clay, but aces were also considered significant. This means that more aggressive servers benefit more from a faster surface. Whereas the clay court benefits consistency, the hard court may promote more aggressive play and more risk taking. Finally, for the grass court, only the second serve winning percentage and double faults were significant. This one is kind of weird, since the fastest surface doesn't produce very many significant variables. This could mean that the surface is a crapshoot or that it favors the tennis players that are a good at a little bit of everything.
Finally, an overall regression was produced without regard for surface. From the initial model, the significant variables were the aces, serve return percentage, and double faults. This is self explanatory, since aces are like free points and double faults are like giving points away. However, the serve return percentage is interesting. Since the serve is important to most surfaces from the previous regression models, it would make sense that being able to neutralize the serve raises the number of matches won by a player. Therefore, the most success in terms of matches won comes from the serve and return game. Novak Djokovic, arguably now the greatest tennis player of all time, is famous for the return of serve, so the data here could justify success on the tennis court. In the end, there is no indication that the rallying statistics are predictive of success, but the service game has evidence of influencing match victories.