Clustering — Ian Bircak

Partitional Clustering

Clustering Analysis

In this section, k means clustering techniques will be used to conduct analysis on the dataset. Clustering is an unsupervised machine learning method used to discover if groups or categories exist in a numerical data set. In general, an ideal clustering algorithm will result in minimizing the inter cluster distances while maximizing the intra cluster distance. Distance metrics are used to determine which data points belong in which cluster. They can either be calculated with a Euclidean or a non Euclidean metric, such as Manhattan or cosine distance. In this project, cosine similarity will be the distance metric. The goal of clustering in this project is to utilize partitional and hierarchical clustering techniques to determine if there are groups that can be identified with unlabeled data. Partitional clustering refers to dividing the data points into non overlapping groups, while hierarchical clustering means that clusters are built like a hierarchy, either top down or bottom up. Once the clustering is complete, the results will be used to see if there are any groupings or associations in the data set.

Hierarchical Clustering

Data Preparation

A typical clustering data set requires only numerical unlabeled data. This was an easy data set to prepare; all that was needed was to remove the labels and all qualitative data. Here is an example of the raw data set, and then the data set manipulated for clustering analysis.

The master data set pre cleaning

The data set prepared for clustering

Analysis and Conclusion

Silhouette Score graph, indicating that the optimal number of clusters here is 6.

The clusters for k=5.

Hierarchical clustering for k=6.

The clusters for k=6.

The clusters for k=7.

The cluster groupings for 10 observations of each surface.

To start the partitional clustering analysis, the dimensionality of the data was reduced to two using principal component analysis. This needed to happen to produce visuals for analysis. To determine the optimal number of clusters, the silhouette method was used. Upon producing this visual, the graph shows that 6 is the optimal number of clusters, since that is where the first local maxima is. Then, 20 iterations of the k-means algorithm were plotted with k equaling 6. The figure below the centroids for each cluster for k=6. When comparing k=6 to k=5 and k=7, we see that these centroids are not as in line, so k=6 is optimal. The data was then simulated 20 times to find the best centroids and then analyzed by cluster. Interestingly enough, the clay and grass court variables were grouped into either cluster 2 (pale blue) or cluster 5 (yellow). All of the other clusters were dominated by the hard court variable. This would indicate that the clay and grass court variables are somehow related, but the hard court variable seems to be not in any way related to the other two variables. This will be something to keep in mind when going through the other methods for machine learning analysis.

The hierarchical clustering also gave some insights into the optimal number of clusters. When using Ward’s method for hierarchical clustering, k=6 is a viable option. Looking at the visual for k=6, it seems like the data is divided evenly into 6 groups. Three clusters is not a bad idea as well, so it was attempted with the partitional plot. Upon observation, the clusters seemed to not be as well defined, so k=6 is the optimal number. Concluding from clustering analysis, the data does not tell us very much about our dataset. The surfaces from each cluster varied significantly, so it does not have a very indicative relationship.

Overall, the main takeaway here is that there are similarities between the grass court and the clay court. Although the true meaning is yet to be determined, the two variables appearing in the same clusters together indicate that something is going on here. It will be interesting in the other modules to see how they are connected.