Support Vector Machines

A SVM

Kernels in SVMs

Support Vector Machines are a supervised machine learning algorithm that analyzes data for classification. The objective of a support vector machine is to find a hyperplane in n-dimensional space that maximizes the margin between the classes. In this case, n represents the number of variables in the analysis. This objective indicates that support vector machines are linear separators. It is important to note that the best line or hyperplane that separates the data generally splits the data into two classes. Therefore, support vector machines only work for two class problems. If the classification has more than two classes, multiple SVMs will be necessary to classify each class. This can be done in two different ways. First, it could be done by just comparing every class against each other. For example, if there are classes A,B, and C, this method would compare A and B, A and C, and B and C. The other way to do this is to compare a class versus not that class. This example would evaluate A versus not A, B versus not B, and C versus not C. However, what would happen if the data is not linearly separable?

If the data is not linearly separable, a kernel could be used to separate the data. SVM utilizes kernel functions to map the input data into higher dimensional space. This is intended to help make nonlinear data more separable. A kernel is simply a dot product of two vectors, and the dot product is taken to assist with the transformed data by calculating the new points that the kernel computes for the higher dimensional space. Two kernels utilized in this project are the polynomial kernel and the radial kernel. 

To show an example of how a kernel and dot product are used to transform the data, let’s run an example with a polynomial kernel with a degree of 2 and a cost of 1. This proof can be found in the code tab, and it shows how important the dot product is to kernel analysis.


Image of the prepared data set

Data Prep

A snippet of the training and testing data split

In this module, the hitting dataset will be used to see if there is a difference between the hard court and the other two surfaces, grass and clay. Since it has been proven that the grass and clay court are linked and similar to each other, we will rename the labels as “NotHard”. The other label obviously will be “Hard”. This project will evaluate whether the hard court is also distinct from the other two surfaces from a rallying perspective.This has already been proven with the serving dataset. In order to prep this dataset, the data has to be entirely numeric, including the labels. This is so since transformations with kernels need to utilize the dot product to transform the data. This simply isn’t possible with qualitative data, so any non-numeric data need to be removed. The dataset is already numeric other than the classes. The labels were rewritten as either 1 (Hard) or -1 (Not Hard). Finally, the data must be split into a training and testing set. The set was created with a 75-25 split, with 75 percent of the data being in the training set. It is important to note that these sets must be disjoint, so that the model can be tested on data it has never seen before. The code can be found in the code tab. The prep was done in R, and the analysis in Python. 


Analysis and Conclusion

Linear Kernel Confusion Matrix

Cost 5

Polynomial Kernel Confusion Matrix

Degree 3 and Cost 5

Radial Kernel Confusion Matrix

Cost 5

Scatterplot of Forced vs Unforced Errors on the polynomial kernel with degree 3, cost 5

The model was trained and tested on the data, and ran 11 times. Three linear kernels, five polynomial kernels, and 3 radial kernels were utilized. For each test, the model for each kernel type was updated until the best possible model with that kernel was found. For the linear kernel, three models were built with costs of one, five, and ten. The accuracies for each were 68% with cost one and 70% for the other two models. This is pretty good, but could be better. A polynomial kernel was used next. The first three models had degree 2, and the last two had degree 3. The costs varied with each model. The accuracy for the degree 2 models with costs 1,5, and 10 were 69%, 80%, and 79% respectively. The accuracy for the degree 3 models with costs 1 and 5 were 69% and 85% respectively. This was a significant improvement compared to the linear kernels, and the model with degree 3 and cost 5 was the best fit. Finally a radial kernel was used with costs of 1,5, and 10. The respective accuracies were 70% for the cost one model and 79% for the cost 5 and 10 model. From this analysis, the polynomial model with degree 3 and cost 5 was the most accurate model. This can be verified by a scatter plot of two variables, forced and unforced errors. From this graph, the model performed much better the farther away from zero the point was, but was not as easy to classify when the point was closer to zero. It’s also interesting to note that the hard court variable was more likely to be further away from zero. This could indicate that the hard court is more likely to produce errors in the match than the other two surfaces. 

In conclusion, support vector machines did very well with the hitting dataset. The best model had an accuracy of 85% and no model fell below 68%. This would conclude that the hard court is different from the other two surfaces from a rallying perspective. It seems like from a tennis perspective, the groundstrokes may have more of an impact on the hard court than the other two surfaces. This could make sense, since the hard court is like the median of the surfaces. The clay court is slower and has the highest bounce, while the grass court is the fastest with the highest bounce. Since the hard court is the most distinct court and in the middle of the other two, it probably is a blend of the other two surfaces. In order to truly understand the importance of each variable, a linear regression should be used to confirm our suspicions. This is what will take place in the next part.