DataPrep_EDA — Ian Bircak

UTS Dataset Cleaning and Preparation

Figure 1: The raw dataset

Figure 2: The dataset with the Surface identifier. Here, it is the Clay column.

Figures 3 and 4: The semi-merged dataset showing that all the surfaces are together for a statistic.

Since this project asks questions about the impact playing surfaces have on success in tennis, it was important to find a database that contains data relative to surfaces. Ultimate Tennis Statistics was the dataset that not only had career statistics of all tennis players, but also broke down statistics based off of the court surface the data points were collected from. There was one big fundamental issue with downloading these datasets; you could only view one statistic on one surface at one time. Since the study analyzes 12 statistics on 3 court surfaces, 36 datasets were downloaded. The major goal with this dataset was to attempt to merge all 36 datasets into one large dataset.

To start, the active column was removed because it is irrelevant to the analysis. Then, it was important to label each individual dataset with the surface. “Grass”, “Clay”, and “Hard” columns were added so that it would be easy to identify which surface the statistic was recorded on. Then, for each individual statistic, the respective grass, clay, and hard tables were merged into its own table, shrinking the total number of datasets from 36 to 12. Then, some more data cleaning steps were utilized, such as renaming, removing, and reordering columns. The integer and percentage numbers were also converted into numeric characters so that the final dataset can be analyzed.

Finally, it was time to merge the 12 datasets into one large dataset. With the merge function in R, the final dataset was obtained.

Figure 5: The cleaned, merged dataset.

SportScore ATP Rankings Dataset

Figure 6: Code to obtain the data

An API was used to obtain data on the current ATP World Tour Rankings. Some statistics here include their ranking, points, and the number of tournaments played this season. Figure 6 shows the code used to obtain the data, figure 7 shows the raw data, and figure 8 shows the cleaned dataset.

Figure 7: The raw, unstructured data

Figure 8: The cleaned dataset

Exploratory Data Analysis

Figure 9: Barplot of the top 10 aces

Figure 11: Boxplot of Grass Court Winners

Figure 10: Boxplot of Clay Court Winners

Figure 12: Boxplot of Hard Court Winners

Figure 13: Pairwise Scatterplots for Percentage Variables

We start the exploratory data analysis by looking at a bar chart of the top 10 ace hitters in the dataset. This plot was interesting because it said a lot about the data I have. First, every single player in this graph achieved these numbers from a hard court. This makes sense when you put the data into context. The vast majority of the season is played on a hard court. The clay court season is only two months long, and the tour only plays on grass courts for five weeks. It would be wise to consider the length of seasons, and potentially utilize ratios to get more meaningful data.

To further investigate the influence of the hard court category, boxplots of each court surface type were produced for the winners label. Upon observation, we see that the grass court boxplot and the clay court boxplot are nearly identical in terms of distribution and shape, but the hard court has much more significant variability in the data. The upper tail is approximately four times the height of the other two boxplots, and the outliers seem to be much more extreme in that boxplot. Removing the outliers would certainly have to be under consideration, but the context is important; there are significantly more hard court matches than the other two surfaces.

To evaluate the percentages data, a pairwise scatter plot was produced to analyze and observe any potential correlations. Upon looking at the chart, it seems to be that there was a positive linear relationship between the serve return percentage, first serve percentage, and second serve percentage. That means we can expect to see that the return of serve variable will have an impact on the success rate of tennis players on surfaces. The other thing to note here is what seems to be a flat line on the net points variable across every single variable. This would indicate that coming to the net during points has no correlation with any of the variables, and it could be an irrelevant label in the analysis. The rest of the scatterplots look random, which could imply that the correlation is undetermined between the rest of the variables.

Data Sources and Links

Below are the sources used to obtain the data for this project.

https://www.ultimatetennisstatistics.com/statsLeaders

https://rapidapi.com/tipsters/api/sportscore1/