Tuesday, November 6, 2018

Data Science 101: Average Silhouette Coefficient

In this short entry I will talk about the average silhouette coefficient (ASC) which is a popular internal cluster validation measure. To be precise, the ASC is the average of the silhouette of a given dataset. We will consider a very specific dataset in this entry, which we shall call the Mouse dataset:


We will next cluster this dataset into three clusters using k-means. Furthermore, we will evaluate both the clustering result from k-means and the groundtruth clustering (namely, one "head" and two "ears") by means of the ASC:


What we observe is quite interesting. First of all, it can be seen that k-means fails to detect the groundtruth clustering, even though the clusters are separated. (See also here; it is argued that k-means prefers clusters of similar size, where size is taken in a Euclidean sense and not in the sense of equal number of datapoints.) Second, and more important, it is shown that the ASC for the "wrong" solution is larger (i.e., better) than the groundtruth.

As a second experiment, we projected the Mouse dataset in three-dimensional space and evaluated the ASC for the groundtruth clustering:



As it can be seen, the ASC differs from the ASC of the same cluster assignment in two-dimensional space -- ASC depends on the dimension of the dataset.

All this of course makes sense by recognizing that the ASC is distance-dependent. Since distances change when a dataset is projected in some higher-dimensional space, it is not surprising that the ASC changes as well. Furthermore, since k-means is a distance-based clustering technique, it is not surprising that the ASC of a k-means clustering is high. And finally, ASC will be a good indicator of cluster validity if the clusters in the dataset are distance-based (and not, e.g., density-, model-, or graph-based).

Related to this, in "Understanding of Internal Clustering Validation Measures" it is shown that k-means performs worse than Chameleon (Figure 6) on a very similar dataset (Figure 5); at least using Chameleon, the ASC is maximized by the correct number of clusters. This paper and the short analysis presented in this entry lead to the following questions:

  • Based on what cluster assumptions (distance, density, etc.) are different internal validation measures defined?
  • Given any internal validation measure, can we find a synthetic dataset for which the groundtruth clustering has a bad value, while an "obviously wrong" clustering has an extremely good value? I.e., can we find pathological examples for which a given internal validation measure fails? (This entry shows that the answer is positive for ASC.)
  • Given these pathological examples, can we show that their properties are in contrast with the cluster assumptions inherent to the considered internal validation measure?
Answering these questions will improve our understanding of these internal cluster validity measures and will help us choose the correct validity measure.