Naive Bayes

Naive Bayes is a machine learning algorithm that utilizes Bayes theorem to produce classification of datasets. There are many types of naive Bayes than can be performed, one being Bernoulli. In Bernoulli, the data is classified with binary to determine probabilities that a certain entry has a feature or not. In this project, multinomial naive Bayes is used to produce probabilities of classifiers for a given row. Multinomial naive Bayes is a classification that predicts the label being tested. Some important equations are below. Without further ado, let’s get started.

Bayes Theoerm

Naive Bayes Proability

Data Prep

Multinomial Naive Bayes analysis was cleaned in R, but performed in Python. This data cleaning was the exact same as the decision tree cleaning. For this analysis, the response variable was the surface, and the predictor variable was all serving percentage statistics, plus aces and double faults. The response has to be a factor, so the surface column was converted into a factor type. Finally, the data has to be split into a training and testing dataset. Seventy percent of the data went into the training set, while the other thirty percent went into the testing dataset. The training set is used to build the model, while the testing set is used to predict the model. It is important to make sure that these two data sets are disjoint subsets of each other. The two main reasons why this is important are so that the model can be tested on data it has not seen, and to prevent overfitting of the model. The test set must also have no labels in it, since the test model is trying to predict the label. The code can be found in the code tab.

Snippets of the training and testing data. Notice how two rows do not appear in the same data set, and that the test data does not have the labels in them.

Analysis and Conclusion

Three models were fitted with multinomial naive Bayes. The first model is the whole data set, the second with balanced data, and the third removing the hard court entries. In the first model, all of the data was used, which had a breakdown of roughly 40% hard court, 33% clay court, and 27% grass court. When naive Bayes was run and the confusion matrix was created, the accuracy was 48%. This is poor compared to the decision tree analysis. In order to try and produce a more accurate model, the data was then balanced so each variable represented 33% of the entire dataset. Once the data was balanced, a confusion matrix was produced. The accuracy for this model was still 48%. So that model was not any better than the full, unbalanced data set either. Finally, I tried a model without the hard court. This was chosen since from the previous two matrices, the hard court was responsible for this poor accuracy. On the third row, the diagonal element was not greater than the sum of the other two entries in that row. Also, from the clustering models, there was an apparent connection between the grass court and the clay court. Running this model, the confusion matrix provided an accuracy of 65%. This was a significant improvement. This would indicate to me that the hard court is independent of the other two surfaces and no apparent connections can be made with them. Thus supports the belief that the clay court and grass court are connected. The final conclusion to be made is that decision trees are better for conducting analysis. They did a better job of predicting the class of each surface with consistently higher accuracy. Overall, naive bayes can be a useful too with the right data.

The confusion matrices for model 1, 2, and 3, from left to right. Row and column 1 are clay, grass is row and column 2, and hard is row and column 3.