Keywords

1 Introduction

There are still many open challenges in the clustering task. Those challenges are getting even worse in the current big data era, where data is collected from many sources at high speed. This paper focuses on answering the question: how to decide on the number of clusters k? Being one of the oldest question in the clustering literature, the question has been tackled by hundreds of researchers with many solutions that have been proposed. Among these solutions, the gap statistic is one of the most modern approaches. It is backed by the rigorous theoretical foundation and has been shown to outperform many other heuristic-based approaches such as elbow or silhouette. However, there are still several drawbacks to the original design of the gap statistic, which limits its applicability in real applications. This paper introduces a new technique to mitigate those limitations. The technique can improve the effectiveness of the gap statistic in multiple dimensions. The gap statistic that uses the newly proposed technique is called the “new gap” for short. The following few subsections describe literature reviews.

The Elbow Approach

The oldest method called ‘elbow’ has been proposed to determine the number of clusters for k-mean clustering algorithm [6]. This is a visual method. The idea of the elbow method is to run clustering method on the dataset for a range of values of k (for example from 1 to 10), and for each value of k calculate clusters and internal index (it could be the sum of squared error (SSE), the percentage of variance, etc.). Then plot a line chart of the internal index for each value of k. At some value of k the value of internal index drops dramatically, and after that, it reaches a plateau when k is increased further. This is the best k value we can expect. Figure 1 illustrates how the elbow method work. In Fig. 1, the line chart goes down rapidly with k increasing from 1 to 2, and from 2 to 3, and reaches an elbow at k = 3. After that, it decreases very slowly. Looking at the chart, it looks like maybe the right number of cluster is three because that is the elbow of this curve.

Fig. 1.
figure 1

Identification of Elbow point

However, the elbow method does not always work well. Sometimes, there are more than one elbow, or no elbow at all.

Average Silhouette Approach

Average silhouette method computes the average silhouette of observations for different values of k [2, 3]. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k [7]. Given a clustering result with k clusters (k > 1), we can estimate how well an observation i is clustered by calculating its silhouette statistic \( s^{k} \left( i \right) \). Let a(i) be the average distance from observation i to other points in its cluster, and b(i) be the average distance from observation i to points in its the nearest cluster, then the silhouette statistic \( s^{k} \left( i \right) \) is calculated by:

$$ s^{k} \left( i \right) = \frac{b\left( i \right) - a\left( i \right)}{{\hbox{max} \left\{ {a\left( i \right),b\left( i \right)} \right\}}} $$

A point is well clustered if \( s^{k} \left( i \right) \) is large. The average silhouette score \( avgS\left( k \right) \) gives an estimation of the overall clustering quality when clustering the dataset into k clusters:

$$ avgS\left( k \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} s^{k} \left( i \right), $$

where n is the number of data points.

Therefore, we select k so that it maximizes the average silhouette score. However, this average silhouette is only a heuristic metric, which can be shown to perform poorly in many cases. Note that avgS(k) is not defined at k = 1.

Hartigan Statistic

Hartigan proposed the statistic [1]:

$$ H\left( k \right) = \frac{{\frac{{W_{k} }}{{W_{k + 1} }} - 1}}{n - k - 1}, $$

where \( W_{k} \) is the average within-cluster sum of squares around the cluster means. The formula to calculate \( W_{k} \) is given in the next section about the gap statistic.

The idea is to start with k = 1 and keep adding a cluster until H(k) is sufficiently large. Hartigan suggested the “sufficiently large” cut-off is 10. Hence the estimated number of clusters is the smallest \( k \ge 1 \) such that \( H\left( k \right) \le 10 \).

Gap Statistic

Gap statistic was introduced in 2001 by Tibshirani et al. [4] and is still a state-of-the-art method for estimating k. It has been shown to outperform the elbow, average silhouette, and Hartigan methods in both synthesized and real datasets [4, 5]. The method works by assuming a null reference distribution. It then compares the change in within-cluster dispersion with the expected change if the null distribution is true. If when k = K and the within-cluster dispersion starts decreasing slower than the expected rate of the reference distribution, the gap statistic returns k as the expected number of clusters. The formal definition of the gap statistic is given as follows:

Let \( d_{ij} = \left\| {x_{i} - x_{j} } \right\|^{2} \) denotes the Euclidean distance between observation i and j, \( D_{r} \) is the sum of the pairwise distance for all points in a given cluster \( C_{r} \) containing \( n_{r} \) points.

$$ D_{r} = \mathop \sum \limits_{{i \in C_{r} }} \mathop \sum \limits_{{j \in C_{r} }} d_{ij} $$

Then measure of compactness of clusters \( W_{k} \) is the average within – cluster sum of squares around the cluster means:

$$ W_{k} = \mathop \sum \limits_{r = 1}^{k} \frac{1}{{2 n_{r} }}D_{r} $$

The purpose of clustering is with a given K finding the optimal \( W_{k} \), when k increases, \( W_{k} \) decreases. But the speed reduction of \( W_{k} \) also decreases. The idea of elbow method is to choose the k corresponding to the “elbow” (finding k that point has the most significant increase in goodness-of-fit). The problems when using elbow method is no reference clustering to compare, and the differences \( W_{k} - W_{k - 1} \)’s are not normalized for comparison.

The main idea of the gap statistic is to standardize the graph of \( \log \left( {W_{k} } \right) \) by comparing it with its expectation under an appropriate null reference distribution of the data. Estimate of the optimal number of clusters is then the value of k for which \( \log \left( {W_{k}^{data} } \right) \) falls the farthest below this reference curve \( \log \left( {W_{k}^{null} } \right) \):

$$ Gap_{n} \left( k \right) = E_{n}^{ *} \left\{ {\log \left( {W_{k}^{null} } \right) - { \log }\left( {W_{k}^{data} } \right)} \right\} $$

With \( E_{n}^{ *} \) is the expectation under a sample size of n from reference distribution, we estimate \( E_{n}^{ *} \left\{ {\log \left( {W_{k}^{null} } \right)} \right\} \) by an average of B copies \( {\text{log}}\left( {W_{k}^{null} } \right) \), each of which is computed from a Monte Carlo sample from reference distribution. Cluster the Monte Carlo samples into k groups and compute \( logW_{kb} \), b = 1, 2 …, B, k = 1, 2 …, K. Compute the (estimated) gap statistic:

$$ Gap\left( k \right) = \frac{1}{B}\mathop \sum \limits_{b = 1}^{B} logW_{kb}^{null} - l{\text{og}}\left( {W_{k}^{data} } \right) $$

Those \( logW_{kb}^{null} \) from the B Monte Carlo replicates exhibit a standard error \( sd\left( k \right) \) which, accounting for the simulation error, is turned into the quantity

$$ s_{k} = \sqrt {1 + \frac{1}{B}} .sd\left( k \right) $$

Finally, the optimal number of cluster K is the smallest k such that

$$ Gap\left( k \right) \ge Gap\left( {k + 1} \right) - s_{k + 1} $$

The above rule to select k is presented in the original gap statistic paper and called the “Tibs2001SEmax” rule in the R clustering implementation of the gap statistic. Since 2001, several other alternatives to this rule have been proposed, such as the “firstSEmax” rule [8] or the “globalSEmax” rule [9]. In this study, the Tibs2001SEmax rule in all experiments was used as the baseline approach. In this paper, the term “gap statistic” refers to the function Gap(k) with the Tibs2001SEmax is used as the k-selecting rule.

Figure 2 provides an example of how the gap statistic works. Figure 2a plots the example dataset with two well-separated clusters. Figure 2b shows the line representing the within sum of squares \( W_{k}^{data} \), which is a downward trend in number of cluster k. Figure 2c shows the log of the expected rate log(\( W_{kb}^{null} ) \) using an assumed null distribution (uniform distribution in this case). Figure 2d shows the gap statistic, which is calculated by subtracting the log expected rate log(\( W_{k}^{null} ) \) for the log(\( W_{kb}^{data} ) \). The optimal number of k is the smallest k such that there is a significant chance that Gap(k) is higher than Gap(k + 1), which is k = 2 in this case. Tibshirani used one standard deviation \( s_{k + 1} \) to determine when the chance is significant.

Fig. 2.
figure 2

How the gap statistic works on a dataset with two well-separated clusters

2 Methodology

Although being backed by a rigorous theoretical foundation (unlike other heuristic-based methods like elbow or silhouette), the Gap statistic still has several drawbacks that limit its applicability to practical applications. In this section, we conduct several experiments with synthesized datasets to demonstrate those limitations. Based on the insights learned from those experiments, we then introduced a new technique to improve the gap statistic.

2.1 The Gap Statistic Limitations

By design, the gap statistics can only work well when all the clusters in the dataset are well-separated from each other. However, this is rarely the case in practice, where clusters usually overlap up to a certain degree. This “non-overlapping” assumption is one of the main reason that limits the gap statistics effectiveness in real applications. Figure 3 shows how the gap statistics fail to identify the correct K in simple synthesized datasets, that the clusters only barely overlap each other.

Fig. 3.
figure 3

Overlapping clusters problem with gap statistic

  1. (a)

    the ovl2Gauss dataset: 400 data points in 2 dimensions that sampled equally from the two 2D Gaussian distributions: \( {\mathcal{N}}\left( {\left[ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} 1 & {0.7} \\ {0.7} & 1 \\ \end{array} } \right]} \right) \) and \( {\mathcal{N}}\left( {\left[ {\begin{array}{*{20}c} 4 \\ 0 \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} 1 & { - 0.7} \\ { - 0.7} & 1 \\ \end{array} } \right]} \right) \).

  2. (b)

    gap statistic with Tibs2001SE rule suggests k = 3 instead of 2 for the ovl2Gauss.

  3. (c)

    the ovl3Gauss dataset: 600 data points in 2 dimensions that sampled equally from the three Gaussian distributions: \( {\mathcal{N}}\left( {\left[ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} 1 & {0.7} \\ {0.7} & 1 \\ \end{array} } \right]} \right) \), \( {\mathcal{N}}\left( {\left[ {\begin{array}{*{20}c} 0 \\ 8 \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} 1 & {0.7} \\ {0.7} & 1 \\ \end{array} } \right]} \right) \), and \( {\mathcal{N}}\left( {\left[ {\begin{array}{*{20}c} 0 \\ 4 \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} 1 & { - 0.7} \\ { - 0.7} & 1 \\ \end{array} } \right]} \right) \).

  4. (d)

    gap statistic with Tibs2001Se rule suggests k = 4 instead of 3 for the ovl3Gauss.

However, clusters should not overlap with each other too much. Otherwise, the notion of “cluster” will become very fuzzy. This is because the data density in the overlapping area is the sum of the data density of the two clusters in that area. This can potentially make the overlapping area become another cluster. In some applications, we indeed want to recognize that overlapping space as a cluster, while that behavior is unexpected in other applications. Figure 4 illustrates this confusion in the case of two strongly overlapping clusters.

Fig. 4.
figure 4

Two strongly overlapping clusters can be correctly seen as one, two, or 3 clusters.

Besides the non-overlapping assumption, the gap statistic also assumes that there is no hierarchical clustering structure in the dataset. This means in the dataset; there is no cluster that consists of many smaller clusters. In addition, the gap statistics require a lot of computing power to compute the expected \( W_{k} \) under the null reference distribution \( E_{n}^{ *} \left\{ {\log \left( {W_{k} } \right)} \right\} \). It has to sample the null reference distribution B times \( \left( {B \ge 50} \right) \), for each sample b, we run the clustering algorithm. In this case, the clustering algorithm is PAM, which takes \( O\left( {n^{2} } \right) \) with n is the number of data points. In total, the complexity of the algorithm to estimate the \( E_{n}^{ *} \left\{ {\log \left( {W_{k} } \right)} \right\} \) is \( O\left( {Bn^{2} } \right) \). This would make it impossible to apply gap statistic on dataset with more than several thousands of data points.

2.2 The New Gap

As described in the previous section, the gap statistic method has largely three limitations. However, we only focus on the overlapping issue to produce a new gap. The other limitation issues will be covered in the further research.

The 1stDaccSEmax Rule for Overlapping Clusters.

The Tibs2001SEmax rule returns the smallest k such that the gap at that point has a significant chance (one standard error) to be higher than the next gap. As shown in the previous section, this rule is very sensitive to overlapped clusters. In fact, when there are overlapping clusters in the dataset, the gap does not decrease but slightly increase after \( k = K \) (where K is the real number of clusters in the dataset). This results in over-estimation of K.

Therefore, instead of using the gap statistic directly, we propose to use the deceleration of the gap statistic (Dacc statistic for short). The Dacc is calculated as follows:

$$ \begin{aligned} Dacc\left( k \right) & = \left[ {Gap\left( k \right) - Gap\left( {k - 1} \right)} \right] - \left[ {Gap\left( {k + 1} \right) - Gap\left( k \right)} \right] \\ & = 2Gap\left( k \right) - Gap\left( {k - 1} \right) - Gap\left( {k + 1} \right) \\ \end{aligned} $$

Figure 5 shows how the Dacc(k) statistic can be computed from the Gap(k) statistic.

Fig. 5.
figure 5

How to compute Dacc(k) from Gap(k)

We designed this statistic based on the insight that when k is going from 1 to K, the Gap(k) increases with constant or accelerated speed, up to the point where k = K. At that point, the Gap(k) will suddenly slow down its speed of increasing or start to decreasing (negative speed). Figure 6 illustrates how the Dacc(k) looks like in different scenarios.

Fig. 6.
figure 6

The \( {\text{Dacc}}\left( k \right) \) value in different scenarios;

Figure 6(a–c) Different cases where \( Dacc\left( k \right) < 0 \); \( Dacc\left( k \right) = 0 \) and \( Dacc\left( k \right) > 0 \).

Figure 6(d) In dataset with K non-overlapping clusters: Gap(k) increases when \( k < K \), reaches its first local maxima at \( k = K \), and starts decreasing when \( k = K + 1 \). Therefore, \( k = K \) is also the first local maxima of Dacc(k).

Figure 6(c) In dataset with K clusters where some clusters slightly overlap each other: Gap(k) still increases from k = K to k = K + 1, making Gap(k = K) no longer the first local maxima. However, since the overlapping area is small (slightly-overlapping assumption), the increasing speed from Gap(K) to Gap(K + 1) is significantly smaller than the increasing speed from Gap(K - 1) to Gap(K), making the Dacc statistic still maximize at k = K. Therefore, the Dacc(k) is more robust than the Gap(k) in a dataset with slightly-overlapping clusters.

Figure 6(f) In dataset with K clusters where some clusters strongly overlap each other: the definition between clusters becomes very fuzzy. Two strongly overlapping clusters can be correctly considered as one, two, or three clusters. Therefore, both Dacc and Gap statistic behave unpredictably in this case.

To take into account the sampling error occurring when estimating the expected \( W_{k} \) under the null distribution, I incorporate the standard error \( s_{k} \) to the \( Dacc\left( k \right) \) to get the \( DaccSE\left( k \right) \) as follows:

$$ \begin{aligned} DaccSE\left( k \right) = & \left[ {\left( {Gap\left( k \right) - 0.5s_{k} } \right) - \left( {Gap\left( {k - 1} \right) + 0.5s_{k - 1} } \right)} \right] \\ & - \left[ {\left( {Gap\left( {k + 1} \right) + 0.5s_{k + 1} } \right) - \left( {Gap\left( k \right) - 0.5s_{k} } \right)} \right] \\ \end{aligned} $$
$$ DaccSE\left( k \right) = 2Gap\left( k \right) - Gap\left( {k - 1} \right) - Gap\left( {k + 1} \right) - 0.5s_{k - 1} - 0.5s_{k + 1} - s_{k} $$

As we can see, the higher the sampling errors at k - 1, k, or k + 1, the more DaccSE penalizes the Dacc estimation. Note that I used half standard error in the DaccSE(k) formula. We can choose to use different factor for the standard error based on how “aggressive” or “conservative” you want the DaccSE to behave. Figure 7 illustrates how the DaccSE(k) is calculated. While the Dacc is calculated based on the green line, the DaccSE is calculated based on the dashed orange line. The DaccSE penalizes the Gap(k − 1), Gap(k) and Gap(k + 1) estimation according to how big the \( s_{k - 1} \), \( s_{k} \), and \( s_{k + 1} \) are.

Fig. 7.
figure 7

How the DaccSE(k) is derived from Gap(k) and \( s_{k} \).

The Gap(k) chart can have multiple peaks, especially when the dataset has a hierarchical clustering structure. Therefore, instead of selecting k where the DaccSE(k) reaches its global maxima, we select the k where DaccSE(k) reaches its first local maxima. This is similar to the idea of searching for the first local maxima of the Tibs2001SEmax rule introduced in the original gap paper. This new rule is called the 1stDaccSEmax rule. Generally, the 1stDaccSEmax rule keeps looking for k with the highest positive DaccSE, with k sequentially running from k = 2 to k = kmax and stop at the point where \( Gap\left( k \right) \) higher than \( Gap\left( k \right) - s_{k + 1} . \) Figure 8 shows how the 1stDaccSEmax rule works in different situations.

Fig. 8.
figure 8

How the 1stDaccSEmax works in different kinds of Gap charts

Note that although the DaccSE(k) statistic does not define when k = 1, the 1stDaccSEmax rule can still detect if there is no cluster in the dataset. This can happen in two situations, which are illustrated in Fig. 8. In Fig. 8b, Gap(1) > Gap(2) by a margin bigger than \( s_{2} \). Therefore, we stop looking for k right from the beginning and return k = 1 right away. In Fig. 8c, all the DaccSE is negative (there is no k at which the gap decreases). Therefore, we also return k = 1 in this case.

Figure 9 shows the effectiveness of the 1stDaccSEmax rule on synthesized datasets with overlapping clusters.

Fig. 9.
figure 9

Apply the 1stDaccSEmax rule on synthesized overlapping clusters datasets.

Figure 9(a) The ovl2Gauss dataset.

Figure 9(b) Tibs2001SEmax suggests k = 3 because Gap(k) still increases from Gap(2) to Gap(3) due to the overlapping. The 1stDaccSEmax predicts correctly that k = 2, because the decrease at k = 3 is smaller than the decrease at k = 2.

Figure 9(c) The ovl3Gauss dataset.

Figure 9(d) The Tibs2001SEmax predicts wrongly that k = 4 due to the overlapping issue. The 1stDaccSEmax rule correctly predicts that k = 3.

3 Conclusion

This study focuses on improving the gap statistic for the task of predicting the number of clusters k of a dataset. It identifies and demonstrates three main limitations of the gap statistic, including the overlapping clusters problem, the hierarchical clustering structure problem, and the big dataset problem. Based on these insights, we proposed the new technique to tackle the overlapping problem: the 1stDaccSEmax rule. The performance of the new method is evaluated with several synthetic datasets. It is believed that the performance of the new gap method would be shown to be better than all other traditional approaches. The further numerical experiments will be done on several real datasets with some other new techniques to overcome the other gap limitations.