Abstract
Most densitybased clustering methods have difficulties detecting clusters of hugely different densities in a dataset. A recent densitybased clustering CFSFDP appears to have mitigated the issue. However, through formalising the condition under which it fails, we reveal that CFSFDP still has the same issue. To address this issue, we propose a new measure called Local Contrast, as an alternative to density, to find cluster centers and detect clusters. We then apply Local Contrast to CFSFDP, and create a new clustering method called LCCFSFDP which is robust in the presence of varying densities. Our empirical evaluation shows that LCCFSFDP outperforms CFSFDP and three other stateoftheart variants of CFSFDP.
Introduction
Data clustering is a technique to group a dataset into a number of subsets based on a “natural” hidden data sturcture (Cherkassky and Mulier 2007). To capture the underlying data structures, traditional clustering techniques such as the Expectation–Maximization (EM) algorithm (Dempster et al. 1977) assumes specific probability distributions as the source from which the dataset is generated. In comparison, densitybased methods are attractive due to their nonparametric characteristic which enables them to deal with arbitrary shaped clusters (Jain 2010). They rely on spatially varying densities for the detection of clusters. High density regions are identified as clusters which are separated by regions of low density (Han and Kamber 2011).
However, most densitybased methods have difficulties to detect all clusters when the clusters have large variations of densities (Ertöz et al. 2003a; Zhu et al. 2016). For example, DBSCAN (Ester et al. 1996), which uses a global density threshold to discriminate cluster core points from noise, fails to identify all clusters in the presence of greatly varying densities (Zhu et al. 2016).
Many efforts have been devoted to solve the varying densities problem in DBSCANlike algorithms. SharedNearestNeighbours (SNN) (Jarvis and Patrick 1973; Ertöz et al. 2003a) is an effective technique to this end. It uses the number of shared nearest neighbours between two points as a similarity measure to replace distance in the clustering procedure. Yet, the performance of SNN is sensitive to the number of nearest neighbours used in its similarity calculations (Brito et al. 1997; Ertöz et al. 2003; Tan and Wang 2013). ReScale (Zhu et al. 2016) is a recently proposed approach to tackle the same problem. It rescales a dataset such that the estimated density of a rescaled data point approximates the density ratio of the correspond point in the original dataset. However, ReScale does not perform well when clusters overlap significantly on some attributes (Zhu et al. 2016).
A more recent densitybased clustering method called Clustering by Fast Search and Find of Density Peaks (CFSFDP) (Rodriguez and Laio 2014) employs a densitybased approach different from DBSCAN for clustering. Instead of finding core points using a global threshold in the first step, it finds the density peak of every cluster and then links the neighboring points of each peak to form a cluster. CFSFDP overcomes some issues of varying densities of earlier densitybased clustering algorithms (e.g., DBSCAN).
While the condition under which DBSCAN fails to detect all clusters has been formalised recently (Zhu et al. 2016), whether such a condition exists in the more recent densitybased method CFSFDP is still unknown. We formalise a necessary condition for CFSFDP to identify all clusters in a dataset, and show that large variation of densities is still problematic for CFSFDP.
We propose a new measure called Local Contrast (LC), as an alternative to density, to make densitybased clustering algorithms robust against varying densities. The proposed LC is not too sensitive to its parameter setting, and is able to achieve high clustering performance with a default setting.
Though the proposed LC is built on top of a density estimator, it has the following unique theoretical properties:

The local modes and local minima of the LC distribution are also the local modes and local minima of the density distribution of the same dataset.

The local modes of LC have the same constant LC value, irrespective of the density values of the local modes.

The local minima of LC have zero LC value, irrespective of their density values of the local minima.
We utilise LC to create a new version of CFSFDP, named LCCFSFDP. We show that the new clustering method LCCFSFDP is more robust against varying densities than CFSFDP.
To benchmark the proposed LCCFSFDP, we apply SNN and ReScale (which are existing remedies for the density variation issue) to CFSFDP, creating two improved variants called SNNCFSFDP and ReScaleCFSFDP. Together with the original CFSFDP and its latest improvement called FKNNDPC (Xie et al. 2016), the four methods are used as contestants against LCCFSFDP in our experiments. Our empirical evaluation shows that LCCFSFDP outperforms all four contestants in 18 benchmark datasets.
The rest of the paper is organised as follows. Section 2 presents the related work. Section 3 discusses the weakness of CFSFDP and how to use existing remedies to improve it. Section 4 proposes Local Contrast and shows its properties. Section 5 presents LCCFSFDP. The empirical evaluation, discussion and conclusions are provided in the last three sections.
Related work
Densitybased clustering methods such as DBSCAN (Ester et al. 1996) identify high density (core) points using a global threshold and then link all neighbouring core points to form clusters. However, these methods are known to have one key issue, i.e., they have difficulties detecting all clusters when the clusters have large density variations (Ertöz et al. 2003a). Recent research has formalised a necessary condition for DBSCAN to detect all clusters in a dataset (Zhu et al. 2016): if the peak of some cluster has a density lower than that of a lowdensity region between clusters, then DBSCAN will fail to find all clusters. Many densitybased clustering algorithms (Hinneburg and Gabriel 2007; Ram et al. 2009; Borah and Bhattacharyya 2008), like DBSCAN, use a global density threshold to define core points and links them to form clusters. All these algorithms have the same issue. The exact condition under which these densitybased algorithms fail (Zhu et al. 2016) is provided in Appendix A for ease of reference.
Researchers have attempted to address the issue of densitybased clustering using different approaches. For instance, SharedNearestNeighbours (SNN) (Jarvis and Patrick 1973; Ertöz et al. 2003a) employs an alternative similarity measure to replace the distance measure in the clustering procedure. The similarity between two data points is either the number of their shared Knearestneighbours (if they have each other in their Knearestneighbour lists) or 0 otherwise. Since the SNN similarity measure takes into account the local distribution of the data points, it is less affected by varying densities of different clusters. It has been shown that DBSCAN which uses SNN improves the clustering results of DBSCAN which uses the distance measure (Ertöz et al. 2003a). However, its performance is sensitive to the setting of parameter K and its time complexity is \(O(K^2 N^2)\), instead of \(O(N^2)\) for many other densitybased clustering methods such as DBSCAN, because of an additional KNN process is required (Zhu et al. 2016).
ReScale (Zhu et al. 2016) is another technique that is recently proposed to overcome the density variation problem in clustering. This technique is a preprocessing technique and is originally designed for a densitybased clustering algorithm which uses a global density threshold to identify clusters. ReScale enables existing densitybased clustering algorithms to perform densityratiobased clustering, i.e., clusters are defined as regions of locally high density separated by regions of locally low density. The aim is to rescale the data such that the estimated density of each rescaled point is approximately the estimated densityratio of the corresponding point in the original space, where densityratio is defined as a ratio of the density of a point and the average density over its \(\eta \)neighbourhood. A point located at a maximum local density area has higher densityratio value than that of a point located at a minimum local density area. Thus, a densitybased clustering algorithm can be applied without modification to the rescaled data which uses a single threshold to identify all clusters of locally high densities. Two additional parameters are introduced—\(\eta \) is used to define the local neighbourhood; and \(\psi \) is used to control the precision of \(\eta \)neighbourhood density estimation.
A recent densitybased clustering algorithm, CFSFDP (Rodriguez and Laio 2014), takes a different approach to reduce the effect of the abovementioned issue. The idea is to find cluster centres which have higher density than their neighbours and are relatively distant from each other. CFSFDP mitigates the problem of varying densities in some situations because it finds cluster centres not only by high densities, but also by taking into account their distances from other centres. It can detect lowdensity cluster centre if it is far from other clusters.
The Fuzzy weighted KNearestNeighbors Density Peak Clustering (FKNNDPC) (Xie et al. 2016) is a recent effort to improve CFSFDP (Rodriguez and Laio 2014). It uses a similar procedure as CFSFDP, except the density estimation phase and the cluster assignation phase. FKNNDPC uses a KNN kernel estimator, instead of a \(\epsilon \)neighbourhood estimator. The key difference lies in the cluster assignation phase: FKNNDPC uses a complex assignation scheme consists of 2 strategies based on a series of KNN searches. The heavy use of KNN searches makes the algorithm very sensitive to the K parameter. It does not overcome the problem in clusters having hugely varying densities from the root cause because its operation is still based on density, as mentioned in the last paragraph.
It is important to point out that the above improvement over CFSFDP (Xie et al. 2016) was done on procedural steps only (which use a different density estimator and a different scheme to assign points to a cluster), without knowing the root cause.
In this paper, we focus on CFSFDP (Rodriguez and Laio 2014) because it is a powerful and stateoftheart core method of densitybased clustering (Xu and Tian 2015); and we want to identify the key weakness of CFSFDP and its root cause. To achieve this aim, we first formalise the condition under which CFSFDP fails to detect all clusters in a dataset; and reveal that large density variations in a dataset can still harm CFSFDP’s clustering performance significantly under some situations. Then, we propose a new measure called Local Contrast, in place of density, as the primary means to find clusters. We show that this can be easily done using almost the same procedure as CFSFDP; and this overcomes CFSFDP’s weakness from the root cause.
Weakness of CFSFDP and current remedies
Here we first provide a necessary condition for CFSFDP to detect all clusters in a dataset in Sect. 3.1. Its violation will result in CFSFDP failing to detect all clusters. In Sect. 3.2, we create two variants of CFSFDP with existing remedies in tackling the problem of cluster density variations: SNN and ReScale. We show the limitations of these remedies for CFSFDP in the last subsection.
A necessary condition for CFSFDP
Like most densitybased methods, CFSFDP (Rodriguez and Laio 2014) employs a density estimator \(f(\mathbf x)\) to estimate densities for all \(\mathbf x\) in a dataset D. The density estimator is defined as follows:
where \(d(\cdot ,\cdot )\) is a distance measure and \(\epsilon \) is a cutoff distance; and Q is the cardinality of set Q.
Let \({\mathbf x}^m = \hbox {arg max}_{{\mathbf x} \in D} f(\mathbf x)\) denote the point with the global maximum density. CFSFDP (Rodriguez and Laio 2014) defines a distance function of \(\mathbf x\), \(\delta _f(\mathbf x)\), as follows:
In other words, \(\delta _f(\mathbf x)\) is the distance between \(\mathbf x\) and its nearest neighbour with a higher density; except that for the point with the global maximum density, \(\delta _f(\mathbf x)\) is the greatest distance between any point and itself. This is to make sure that for the point with the global maximum density, it will always be ranked first in the ranked list of \(f(\mathbf x)\delta _f(\mathbf x)\) sorted in descending order.
The user then selects the top M points from the ranked list of \(f(\mathbf x)\delta _f(\mathbf x)\), and label them from 1 to M, as the centres for M clusters.
All points are then sorted in descending order of \(f(\mathbf x)\). One by one from top to bottom of the sorted list, each unlabeled point is assigned to the same cluster of its nearest neighbour with a higher density. The first column in Table 1 provides a summary of the key steps in the CFSFDP procedure.
CFSFDP requires that these cluster modes must be ranked at the top in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\) if they are to be selected as cluster centres.
We now state the necessary condition for CFSFDP to identify all clusters of a dataset.
Theorem 1
Given a dataset D of M actual clusters, let \(\mathbb C = \{\mathbf c_m, m = 1,\ldots ,M\}\) denote the M cluster modes, i.e., the points with the maximum density in each cluster with respect to a density estimator \(f(\mathbf x)\). A necessary condition for CFSFDP to correctly identify all clusters is given as follows:
Proof
A violation of Eq. (1) means that at least one point \(\mathbf z \in \mathbb C\) is not among the top M points in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\). Then, one of the following three situations will occur:

(i)
If less than M points are selected as cluster representatives, then not all clusters are identified.

(ii)
If more than M points are selected as cluster representatives, then some cluster will be divided.

(iii)
If exactly M points are selected as cluster representatives, then point \(\mathbf z \in \mathbb C\) is not selected as a representative. As a result, \(\mathbf z\) will be assigned a label from a point with a higher density. Since \(\mathbf z\) is the density maximum in its own cluster, the point that \(\mathbf z\) links to can not be from the same cluster. Hence, \(\mathbf z\) and its neighbouring points will be mislabelled as belonging to a different cluster.
In all the above cases, CFSFDP can not correctly identify all clusters having violated Eq. (1). \(\square \)
Note that the condition provided in Theorem 1 is independent of the density estimator used.
The basic assumptions of CFSFDP are that (i) each cluster centre has the maximum density among all points within the cluster, and (ii) all cluster centres are well separated. While these two assumptions are usually true, the maximum densities of different clusters can not be guaranteed to be the same, or even similar.
Because density can not provide such a guarantee, the use of density becomes the root cause of CFSFDP’s weakness in detecting all clusters having hugely different densities. When clusters have significantly different densities, low density centres which have no sufficient long distance \(\delta _f(\cdot )\) will be ranked low in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\). As a result, the algorithm fails to correctly identify all clusters. An example is shown in Fig. 1. The top dense cluster has multiple peaks; and the centre of the sparse cluster has significantly lower density than these peaks. CFSFDP fails to detect the 4 clusters correctly because the mode of the sparse cluster has density which is too low for the mode to be ranked in the top four in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\), shown in Fig. 1b.
To overcome this weakness, we provide an alternative to density which has the necessary properties to detect all clusters of different densities using the exactly the same CFSFDP procedure. This alternative measure will be introduced in Sect. 4; and our analysis in Sect. 5 shows that the alternative measure is more robust than density in a dataset having clusters of different densities using the same CFSFDP procedure.
Improving CFSFDP using existing methods of improving DBSCAN
SNN (Jarvis and Patrick 1973; Ertöz et al. 2003a) and ReScale (Zhu et al. 2016) are two existing methods designed to address the issue of DBSCANlike clustering methods in datasets having huge density variations.
One can use either of these existing methods to improve CFSFDP. These can be applied straightforwardly. The following two subsections provide the details of two modified versions of CFSFDP: SNNCFSFDP and ReScaleCFSFDP.
SNNCFSFDP
SNNCFSFDP has the same procedure of CFSFDP except that the distance measure used in both \(f(\cdot )\) and \(\delta _{f}(\cdot )\) is replaced with the shared nearest neighbour dissimilarity measure (Ertöz et al. 2003a).
Let \(N_K(\mathbf x)\) denote the K nearest neighbours of \(\mathbf x\) in a dataset D, with respect to Euclidean distance. The shared nearest neighbour dissimilarity (SNN) of two points \(\mathbf x\) and \(\mathbf y\) is defined as
SNNCFSFDP then calculates both the density \(f_{{ SNN}}(\mathbf x)\) and the nearest distance to a higher density point \(\delta _{f_{{ SNN}}}(\mathbf x)\) in terms of \({ SNN}\) dissimilarity as follows:
and
where \(\epsilon \) is the cutoff \({ SNN}\) dissimilarity and \(\mathbf x^w = \hbox {arg max}_{{\mathbf x} \in D} f_{{ SNN}}(\mathbf x)\).
Given \(f_{{ SNN}}(\mathbf x)\) and \(\delta _{f_{{ SNN}}}(\mathbf x)\), the rest of the procedure is the same as CFSFDP. The summary of the key steps is given in the second column of Table 2. Note that if the procedure is implemented with an input of a dissimilarity matrix, the \({ SNN}\) dissimilarity matrix can be computed in a preprocessing step. This is shown in step 0 in Table 2.
ReScaleCFSFDP
ReScaleCFSFDP preprocesses the dataset before utilizing the exact same procedure of CFSFDP. ReScale first estimates the density distribution on each dimension of the dataset D, with an \(\eta \)neighbourhood estimator and a resolution of \(\psi \). It then scales the dataset D along each dimension based on the cumulative distribution, to yield a new dataset \(D'\).
Let \(D_i, \mathbf x_i\) denote the ith attribute of dataset D and data point \(\mathbf x\), respectively. For each attribute i, ReScale divides the range of \(D_i\) into \(\psi \) equal segments, yielding \(\psi + 1\) grid points \(s_j, j=\{1,\ldots ,\psi +1\}\) and \(s_q > s_j\), for all \(q>j\). It then estimates the densities of \(s_j\) by following,
The value of the ith attribute of a transformed point \(\mathbf x'\) is then given by
which is the cumulative marginal probability of \(\mathbf x\) on attribute i. After procesing each attribute, ReScale normalises the transformed dataset \(D'\) to be in [0, 1]. The detailed algorithm can be found in Zhu et al. (2016).
Using \(D'\), the rest of the procedure is the same as CFSFDP. The key steps of ReScaleCFSFDP are given in the last column of Table 2.
Limitations of SNNCFSFDP and ReScaleCFSFDP
We apply SNNCFSFDP and ReScaleCFSFDP on the synthetic dataset as shown in Fig. 1, and their clustering results are given in Figs. 2 and 3, respectively. Though both methods improve the Fmeasure compared to the original CFSFDP, they still fail to correctly identify all clusters: SNNCFSFDP splits the two dense clusters at the bottom into four clusters; ReScaleCFSFDP splits the top cluster into two clusters.
SNN has two weaknesses. First, it is sensitive the K parameter (Brito et al. 1997; Ertöz et al. 2003; Tan and Wang 2013). Second, with a time complexity of \(O(K^2N^2)\), it is computationally expensive. In this example, the default setting of \(K=\sqrt{N}\) leads to an undesirable result as shown in Fig. 2, in which the true peaks #5 and #6 in Fig. 2c can not outrank false peak #2, because the distance \(\delta \) based on SNN dissimilarity is not large enough. A proper K needs to be carefully tuned in order to produce the desired clustering outcome. We will provide further analysis of this issue in Sect. 6.
The ReScale approach aims to transform the dataset to be uniformly distributed along each attribute. However, when clusters overlap significantly on some attribute(s), it becomes problematic as exemplified in Fig. 3: when projected onto the xaxis, because of the overlapping of clusters along xaxis, there are abundant data points in the middle and fewer data points at each end. The ReScale approach therefore shifts data points from the middle to both ends, causing the upper cluster to have two dense regions at both ends after the transformation. A rotation of the dataset is proposed in Zhu et al. (2016) to remedy this weakness. However, without prior knowledge of the dataset, it is difficult to find an orientation that works well, if such an orientation exists.
Local contrast
We propose Local Contrast as a new remedy for the density variation problem in clustering. Unlike SNN or ReScale, it is not sensitive to the parameter K, nor does it need to rescale the dataset.
Here we provide the definition of Local Contrast and describe its properties which empower clustering algorithms to be more robust against varying densities.
Given a dataset D and a density estimator \(f(\cdot )\), we define Local Contrast as follows:
Definition 1
Local Contrast of an instance \(\mathbf x\) is defined as the number of times that \(\mathbf x\) has higher density than its K nearest neighbours:
where \(N_K(\mathbf x)\) is the set of K nearest neighbours of \(\mathbf x\) and \(I_{\{\cdot \}}\) is an indicator.
Local Contrast has three properties.
Property 1
The local modes and the local minima of \(LC(\mathbf x)\) are also the local modes and the local minima of \(f(\mathbf x)\), with a proper choice of K.
Property 2
The local modes of \(LC(\mathbf x)\) that correspond to the local modes of \(f(\mathbf x)\) have \(LC(\mathbf x)= K\), irrespective of the density of \(f(\mathbf x)\).
Property 3
The local minima of \(LC(\mathbf x)\) that correspond to the local minima of \(f(\mathbf x)\) have \(LC(\mathbf x)=0\), irrespective of the density of \(f(\mathbf x)\).
Proof
of Properties 1, 2 and3. Let \(\mathbf p\) and \(\mathbf q\) be the local density maxima and minima, respectively. Assuming a proper choice of K exists such that for all \(\mathbf x \in N_K(\mathbf p)\), \(f(\mathbf p) > f(\mathbf x)\); and for all \(\mathbf x \in N_K(\mathbf q)\), \(f(\mathbf q) < f(\mathbf x)\).
Let \(G \subseteq N_K(\mathbf p)\) be the maximal subset of \(N_K(\mathbf p)\) such that for all \(\mathbf x \in G\), \(\mathbf p \in N_K(\mathbf x)\). In other words, \(\mathbf p\) is one of the Knearestneighbours of each member of G. Since G is a subset of \(N_K(\mathbf p)\), \(\mathbf p\) is also a local density maxima in the neighbourhood defined by G.
By Definition 1, we have
and for all \(\mathbf x \in G\), we have
Thus, \(\mathbf p\) is also a local mode of \(LC(\mathbf x)\) in the neighbourhood defined by G.
Similarly, let \(V \subseteq N_K(\mathbf q)\) be the maximal subset of \(N_K(\mathbf q)\) such that for all \(\mathbf x \in V\), \(\mathbf q \in N_K(\mathbf x)\). In other words, \(\mathbf q\) is one of the Knearestneighbours of each member of V. \(\mathbf q\) is also the local density minima in the neighbourhood defined by V.
The same argument follows that \(LC(\mathbf q) = 0 < LC(\mathbf x), \forall \mathbf x \in V\). \(\square \)
The properties of Local Contrast listed above depend on a proper choice of K for a given dataset. The range of K that can be used is usually large. In other words, Local Contrast is not too sensitive to the setting of K.
Figure 4 provides an illustration of the properties of LC. Note that K can be set within the range of 25 and 500, Properties 2 and 3 still hold true; and Property 1 holds true for all settings of K shown.
Throughout this paper, all experiments are done with the default setting \(K = \sqrt{N}\), the square root of the dataset size, as suggested by some researchers for K nearest neighbour procedures (Ferilli et al. 2008; Zitzler et al. 2004; Fukunaga 1990).
Improving CFSFDP with local contrast
We create a version of CFSFDP, called LCCFSFDP, by replacing density with LC in the clustering procedure. Given a dataset D and a density estimator \(f(\cdot )\), \(LC(\mathbf x)\) is calculated as defined in Definition 1 for all \(\mathbf x\) in D.
Given \(LC(\cdot )\), \(\delta _{LC}(\mathbf x)\) is defined as follows:
where \(\mathbf x^\omega = \hbox {arg max}_{{\mathbf x} \in D} LC(\mathbf x)\) denotes the point with the global maximum LC; and \(d(\cdot ,\cdot )\) is the Euclidean distance.
In other words, \(\delta _{LC}(\mathbf x)\) is defined to be the distance between \(\mathbf x\) and its nearest neighbour with a higher LC, except when \(\mathbf x\) is the point with the maximum LC. In that case, \(\delta _{LC}(\mathbf x)\) is defined to be the maximum distance between \(\mathbf x\) and any point in D.
Here distance \(\delta _{LC}(\mathbf x)\) is analogous to the distance from a point’s nearest neighbour with a higher density \(\delta _f(\mathbf x)\) used in CFSFDP (Rodriguez and Laio 2014). Given \(LC(\mathbf x)\) and \(\delta _{LC}(\mathbf x)\), cluster centres are then chosen from a decision graph where all points are sorted in descending order of \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\).
Definition 2
Cluster centres are defined to be the top M points with the highest \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) values, where M is a user input parameter.
After selecting M cluster centres, all unlabeled data points are then assigned one by one in descending order of LC, with the same cluster label as its nearest neighbour with a higher LC. A contrast between the LCCFSFDP and CFSFDP procedures is given in Table 1.
As shown in Table 1, LCCFSFDP follows the same procedure of CFSFDP. The key difference between the two is that LCCFSFDP replaces density with LC. As a result, analogous to the condition stated in Eq. (1), a necessary condition for LCCFSFDP to detect all clusters correctly can be written as
where \(\mathbb C\) here denotes the set of points with maximum LC in each cluster.
Let \(\check{\mathbf x} = \hbox {arg min}_{\mathbf x \in \mathbb C}LC(\mathbf x)\delta _{LC}(\mathbf x)\) and \(\hat{\mathbf y} = \hbox {arg max}_{\mathbf y \in D{\setminus } \mathbb C}LC(\mathbf y)\delta _{LC}(\mathbf y)\). The above condition can be rewritten as
The corresponding rewritten condition for the densitybased Eq. (1) is given as follows:
where \(\acute{\mathbf x} = \hbox {arg min}_{\mathbf x \in \mathbb C}f(\mathbf x)\delta _{f}(\mathbf x)\) and \(\grave{\mathbf y} = \hbox {arg max}_{\mathbf y \in D{\setminus } \mathbb C}f(\mathbf y)\delta _{f}(\mathbf y)\).
Equation (2) is much easier to satisfy than Eq. (3) because the properties of LC ensures that every member of \(\mathbb C\) has the maximum LC value (i.e., Property 2 stated in Sect. 4), irrespectively of the density distribution. This makes the left side of Eq. (2) not less than 1. Thus Eq. (2) is harder to violate. In contrast, in a data distribution which has greatly varying densities between clusters, the left side of Eq. (3) could easily be smaller than 1, if \(\acute{\mathbf x}\) is from a cluster of low density, which makes a violation of Eq. (3) a lot easier.
To demonstrate this, we apply LCCFSFDP on the same example dataset shown in Fig. 1. Figure 5 shows the result that CFSFDP has ranked the centre of the sparse cluster (rank #7) lower than the multiple peaks in the elongated cluster (ranks #2, 4, 5 and 6) in the decision graph. By simply replacing density \(f(\cdot )\) with \(LC(\cdot )\), LCCFSFDP allows the centre of the sparse cluster to be ranked in the top four. This difference in ranking is the key of improving the algorithm because all peaks now have about the same LC values, by virtue of Properties 1 and 2, stated in the last section. As a result, the ranking of peaks due to \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) is mainly influenced by \(\delta _{LC}(\mathbf x)\). Since multiple peaks in one cluster tend to have smaller \(\delta _{LC}(\mathbf x)\), the algorithm is more likely to select one peak from each cluster, which makes the algorithm more robust against significant density differences in the presence of multiple peaks in one cluster.
Figure 5a shows that the top seven points on the synthetic dataset, as determined by CFSFDP using density. Figure 5b shows that the normalised density and LC of these seven points. Figure 5c shows the ranking due to \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) and \(f(\mathbf x) \times \delta _f(\mathbf x)\). This change has enabled the centre of the sparse cluster to move from rank #7 to rank #2.
The complete clustering result of the LC version of CFSFDP is shown in Fig. 6. Compared to the clustering result of CFSFDP shown in Fig. 1, LCCFSFDP has much stronger detecting power, in the presence of varying densities and multiple density peaks in one cluster (as shown in the top cluster in Fig. 5a), which improves the Fmeasure from 0.85 to 0.98 with the four correct clusters.
Experiments
To show the power of Local Contrast, we conduct experiments using 18 benchmark datasets which have been used in the literature (Chang and Yeung 2008; Gionis et al. 2007; Jain and Law 2005; Lichman 2013; Müller et al. 2009).^{Footnote 1} Table 3 provides the characteristics of the datasets.
In all experiments, the performance is measured in terms of Fmeasure. Given a clustering result, we calculate the precision score \(p_{m}\) and the recall score \(r_{m}\) for each cluster \(C_{m}\) based on the confusion matrix. Fmeasure of \(C_{m}\) is the harmonic mean of \(p_{m}\) and \(r_{m}\). We then use the Hungarian algorithm (Kuhn 1955) to search the optimal match for all clusters. The overall Fmeasure is the weighted average over all clusters: Fmeasure \(=\sum _{m=1}^{M}\frac{C_m}{N} \times \frac{2p_{m}r_{m}}{p_{m}+r_{m}}\), where N is the dataset size. In the calculations of Fmeasure, points labeled as noise are not removed from the dataset, but they are not regarded as a cluster. In addition, we also evaluate the performance in terms of Adjusted Rand Index (ARI) (Hubert and Arabie 1985). The outcome is similar to that of using Fmeasure. For clarity of presentation, we provide the results based on ARI in Appendix B.
All methods are searched in their parameter spaces and the best Fmeasure achieved is recorded. For all versions of CFSFDP, the value of the cutoff distance/dissimilarity \(\epsilon \) is set to be the average distance between each point and its certain percentile nearest neighbour. This percentile is searched within \([0.1,10\%]\), with a step increment of 0.1%. All methods automatically select M points, which rank at the top in their respective decision graph, to be the cluster centres, where M is searched within \(\{2,3,\ldots ,20\}\). For the KNearestNeighbour search involved in LCCFSFDP, SNNCFSFDP and FKNNDPC, the parameter K is set to the nearest integer of \(\sqrt{N}\), the square root of the dataset size. For ReScaleCFSFDP, the parameter \(\psi \) is set 100 as suggested by Zhu et al. (2016), and \(\eta \) is determined in the following way: we searched \(\eta \) within \(\{0.05,0.1,\ldots ,0.5\}\), and find the value that yields the best average Fmeasure of the 18 datasets, which is 0.2. We set \(\eta \) to 0.2 for all the experiments. As a result, ReScaleCFSFDP has been given an additional advantage compared with other methods. A summary of the parameter settings is provided in Table 4.
Comparing LC to SNN and ReScale
The results in Table 5 show that LCCFSFDP has the best clustering performance among the four approaches with average rank 1.61, followed by ReScaleCFSFDP with rank 2.39, SNNCFSFDP with rank 2.83 and CFSFDP with rank 3.00. In term of win/draw/loss counts with respect to base algorithm CFSFDP, LCCFSFDP has 15 wins, 1 loss and 2 draws; SNNCFSFDP has 11 wins and 7 losses; and ReScaleCFSFDP has 10 wins and 8 losses. The Friedman test results in Table 6 show that LCCFSFDP outperforms CFSFDP and SNNCFSFDP significantly at pvalues < 0.02. When comparing LCCFSFDP with ReScaleCFSFDP, LCCFSFDP has 11 wins, 1 tie, and 6 losses, although the difference is not significant. Note that ReScaleCFSFDP has an unfair advantage because the parameter \(\eta \) is set to one which gives the best average Fmeasure over the 18 datasets; whereas SNNCFSFDP and LCCFSFDP have no such advantage.
In a nutshell, Local Contrast significantly improves the CFSFDP algorithm, and its resultant LCCFSFDP is the best densitybased clustering method, among the current stateoftheart.
Comparing LCCFSFDP to FKNNDPC
As shown in Table 7, the performance of FKNNDPC is poor with K being fixed to \(\sqrt{N}\), due to its sensitivity to K. Therefore, we also compare LCCFSFDP to FKNNDPC with the paramter K being searched for an optimal result. When K is searched, FKNNDPC improves significantly in terms of Fmeasure. Nevertheless, LCCFSFDP still outperforms FKNNDPC with 12 wins, 1 draw and 5 losses.
Runtime
The time complexities for all methods are \(O(N^2)\) in terms of dataset size N. However, for those involving Knearestneighbour search, the time complexities are provided in terms of N and K. Table 8 gives the runtimes of all methods on all datasets. SNNCFSFDP runs at least an order of magnitude slower than the others.
K sensitivity test
A K sensitivity test is shown in Fig. 7. In this experiment, the K parameter for the Knearestneighbours search used in LCCFSFDP, SNNCFSFDP and FKNNDPC is set to different values ranging from 5 to 80, while their corresponding best Fmeasure is recorded. Three datasets with low, medium, and high dimensionalities are used for the test. In all three cases, LCCFSFDP exhibits more stable clustering performance than SNNCFSFDP and FKNNDPC while K changes.
Discussion
Local Contrast can be applied using any density estimators, not limited to the \(\epsilon \)neighbourhood density estimator which has been employed in CFSFDP (Rodriguez and Laio 2014). For example, Local Contrast can be applied to DENCLUE (Hinneburg and Gabriel 2007) which employs kernel density estimator in its operation.
We chose CFSFDP over other densitybased methods such as DBSCAN, to be the base algorithm, because the former is a more advanced method. This is confirmed by comparing DBSCAN with CFSFDP in clustering the 18 datasets. The result is provided in Appendix C, which shows that CFSFDP outperforms DBSCAN in all but 1 dataset. To be fair and complete, we also compare LCCFSFDP to the original SNN and ReScale approaches. The result is provided in Appendix D, which shows that LCCFSFDP outperforms both methods.
The choice of parameter K in KNN based methods is usually timeconsuming since they are often sensitive to K. However, we have shown that LC is not as sensitive to K as SNN or FKNNDPC. As a rule of thumb, setting \(K=\sqrt{N}\) has been empirically verified to be effective for LC.
As to the choice of parameter M (the number of clusters), both LCCFSFDP and the original CFSFDP have the same requirement. For a specific dataset, the proper choice of M is a user decision that could be made based on domain knowledge, visual inspection, or other means. In our experiments, M is simply searched to show the best capability of each method.
The original CFSFDP does not explicitly identify any data point to be noise. Instead, after the clustering procedure, it takes an extra step to produce cluster halos, which can be considered as noise. In our experiments, no noise points are produced because all variants of CFSFDP, as well as FKNNDPC, are able to cluster the whole dataset without producing any noise. However, while handling noisy datasets, LCCFSFDP can also produce cluster halos in the same way as CFSFDP.
Gridbased clustering approaches partition the space into a number of cells and use the cell density to identify clusters. For example, GRIDCLUS (Schikuta 1996) and NSGC (Ma and Chow 2004) rely on the cell density to identify core cells and link neighbouring core cells together to form clusters. Instead of the current pointbased definition, it is possible that Local Contrast can be redefined using cell densities of neighbouring cells; and employ Local Contrast in these algorithms to improve their performance.
Another possible application of LC is densitybased subspace clustering, such as SUBCLU (Kailing et al. 2004) and DUSC (Assent et al. 2007). These methods use a density threshold to differentiate between cluster points and noise in different subspaces. Because density is dimensionalitybiased, i.e., when estimated using distancebased density estimators, the densities of a data cloud tend to be lower in higherdimensional spaces. Hence these methods suffer from density variation across subspaces with different dimensionalities: low thresholds detect highdimensional clusters but have difficulty filtering out noise in lowdimensional subspaces; while high thresholds screen out noise well in lowdimensional subspaces but tend to overlook highdimensional clusters (Zimek and Vreeken 2015). LC can possibly be an effective remedy for this issue in subspace clustering since LC is not dimensionalitybiased.
However, not all densitybased methods can utilise Local Contrast readily because some do not employ density directly in their operations. For example, instead of density, OPTICS (Ankerst et al. 1999) employs “core distance” and “reachability distance” to rank points in order to identify clusters. The “reachability distance” reflects the density such that points with a lower density normally have higher “reachability distance”. It is interesting to explore whether Local Contrast can be redefined using these distances rather than density.
Conclusions
In this paper, we identify the root cause of CFSFDP’s failure to detect all clusters in a dataset having hugely varying densities. This is the first work, as far as we know, that overcomes CFSFDP’s weakness from its root cause.
We make the following three contributions:
First, we formalise a necessary condition for CFSFDP to correctly identify all clusters. We show that a violation of this condition leads to poor clustering performance. This explains the reason why a densitybased clustering algorithm such as CFSFDP is unable to correctly identify all clusters in datasets having large density variations.
Second, we propose a new measure called Local Contrast, as an alternative to density, to improve the capability of densitybased clustering methods to detect clusters of hugely different densities in a dataset. We show that it has two unique properties that are critical in improving the abovementioned capability, i.e., all cluster centres in the Local Contrast distribution have the same constant value, so as all local minima of Local Contrast which correspond to the local minima of density distribution, regardless of the densities of these cluster centres and local minima. We show that these properties make densitybased algorithms much more robust in the presence of large density variations.
Third, by incorporating Local Contrast into CFSFDP, we create a powerful method LCCFSFDP which has much better detecting power than the original method. Our empirical evaluation shows that LCCFSFDP is the best performer compared to two stateoftheart methods, SNN and ReScale, as well as FKNNDPC which is a recent improvement of CFSFDP.
Notes
 1.
 2.
In this paper, we refer SNN as a dissimilarity measure. To avoid confusion, here we refer the original SNN clustering method in Ertöz et al. (2003a) as SNNDBSCAN, since it is based on the DBSCAN procedure.
References
Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD international conference on management of data (pp. 49–60). New York, NY: ACM.
Assent, I., Krieger, R., Müller, E., & Seidl, T. (2007). Dusc: Dimensionality unbiased subspace clustering. In Proceedings of the 7th international conference on data mining (pp. 409–414). IEEE.
Borah, B., & Bhattacharyya, D. (2008). DDSC: A density differentiated spatial clustering technique. Journal of Computers, 3(2), 72–79.
Brito, M., Chavez, E., Quiroz, A., & Yukich, J. (1997). Connectivity of the mutual knearestneighbor graph in clustering and outlier detection. Statistics & Probability Letters, 35(1), 33–42.
Chang, H., & Yeung, D. Y. (2008). Robust pathbased spectral clustering. Pattern Recognition, 41(1), 191–203.
Cherkassky, V., & Mulier, F. M. (2007). Learning from data: Concepts, theory, and methods. Hoboken: Wiley.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. 1–38.
Ertöz, L., Steinbach, M., & Kumar, V. (2003a). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM international conference on data mining (pp. 47–58).
Ertöz, L., Steinbach, M., & Kumar, V. (2003b). Finding topics in collections of documents: A shared nearest neighbor approach. Clustering and Information Retrieval, 11, 83–103.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd international conference on knowledge discovery and data mining (pp. 226–231).
Ferilli, S., Biba, M., Basile, T., Di Mauro, N., & Esposito, F. (2008). Knearest neighbor classification on firstorder logic descriptions. In Proceedings of the IEEE international conference on data mining workshops (pp. 202–210).
Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). San Diego, CA: Academic Press Professional Inc.
Gionis, A., Mannila, H., & Tsaparas, P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1), 4.
Han, J., & Kamber, M. (2011). Data mining: Concepts and techniques (3rd ed.). Los Altos, CA: Morgan Kaufmann.
Hinneburg, A., & Gabriel, H. H. (2007). DENCLUE 2.0: Fast clustering based on kernel density estimation. In Advances in intelligent data analysis (Vol. VII, pp. 70–80). Springer.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Jain, A. K. (2010). Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8), 651–666.
Jain, A. K., & Law, M. H. (2005). Data clustering: A user’s dilemma. In Pattern recognition and machine intelligence (pp. 1–10). Springer.
Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, 100(11), 1025–1034.
Kailing, K., Kriegel, H. P., & Kröger, P. (2004). Densityconnected subspace clustering for highdimensional data. In Proceedings of the international conference on data mining (pp. 246–256). SIAM.
Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Research Logistics, 2(1–2), 83–97.
Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 31 May 2017.
Ma, E. W., & Chow, T. W. (2004). A new shifting grid clustering algorithm. Pattern Recognition, 37(3), 503–514.
Müller, E., Günnemann, S., Assent, I., & Seidl, T. (2009). Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment, 2, 1270–1281.
Ram, A., Sharma, A., Jalal, A. S, Agrawal, A., & Singh, R. (2009). An enhanced density based spatial clustering of applications with noise. In Proceedings of the IEEE international advance computing conference (pp. 1475–1478).
Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.
Schikuta, E. (1996). Gridclustering: An efficient hierarchical clustering method for very large data sets. In Proceedings of the 13th IEEE international conference on pattern recognition (Vol. 2, pp. 101–105).
Tan, J., & Wang, R. (2013). Smooth splicing: A robust snnbased method for clustering highdimensional data. Mathematical Problems in Engineering, 2013, 1–9.
Xie, J., Gao, H., Xie, W., Liu, X., & Grant, P. W. (2016). Robust clustering by detecting density peaks and assigning points based on fuzzy weighted knearest neighbors. Information Sciences, 354, 19–40.
Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2), 165–193.
Zhu, Y., Ting, K. M., & Carman, M. J. (2016). Densityratio based clustering for discovering clusters with varying densities. Pattern Recognition, 60, 983–997.
Zimek, A., & Vreeken, J. (2015). The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 98(1–2), 121–155.
Zitzler, E., Laumanns, M., Bleuler, S. (2004). A tutorial on evolutionary multiobjective optimization. In Metaheuristics for multiobjective optimisation (pp. 3–37). Springer.
Acknowledgements
Bo Chen is supported by Monash Data61 Postgraduate Research Scholarship and Faculty of IT Tuition Fee Scholarship, Monash University.
Author information
Affiliations
Corresponding author
Additional information
Editors: Kurt Driessens, Dragi Kocev, Marko RobnikŠikonja, Myra Spiliopoulou.
Appendices
Appendix A: Condition under which DBSCANlike clustering algorithms fail
This section provides a necessary condition for DBSCANlike clustering algorithms, which employ a global threshold to identify core points, to successfully identify all clusters (Zhu et al. 2016).
Let \( D=\lbrace \mathbf x_{1}, \mathbf x_{2},\ldots , \mathbf x_{n} \rbrace \), \(\mathbf x_{i}\in R^{d}, \mathbf x_{i} \sim F\) denote a dataset of n points each sampled independently from a distribution F. Let \(f(\mathbf x)\) denote the density estimate at point \(\mathbf x\) used (either explicitly or implicitly) by a particular densitybased clustering algorithm. A set of clusters \(\{C_{1},\ldots , C_{\varsigma }\}\) is defined as nonempty and nonintersecting subsets: \(C_i\subset D, C_i\ne \emptyset , \forall _{i \ne j} \ C_{i} \cap C_{j}= \emptyset \). Let \(c_{i}=\arg \max _{\begin{array}{c} \mathbf x\in C_{i} \end{array}}f(\mathbf x)\) denote the mode (point of the highest estimated density) for cluster \(C_{i}\) and \(p_{i}=f(c_{i})\) denote the corresponding peak density value.
In addition, let \(N_\epsilon (\mathbf x)\) be the \(\epsilon \)neighbourhood of \(\mathbf x\), \(N_\epsilon (\mathbf x)=\lbrace \mathbf y \in D ~~ d(\mathbf x,\mathbf y) \leqslant \epsilon \rbrace \), where \(d(\cdot ,\cdot )\) is the dissimilarity function (\(d:{R}^{d}\times {R}^{d}\rightarrow {R}\)) used by the densitybased clustering algorithm.
A noncyclic path linking points \(\mathbf x_{i}\) and \(\mathbf x_{j}\), \(path(\mathbf x_{i}, \mathbf x_{j})\), is defined as a sequence of unique points starting with \(\mathbf x_{i}\) and ending with \(\mathbf x_{j}\) where adjacent points lie in each other’s neighbourhood: \((\mathbf x_{\pi (1)},\mathbf x_{\pi (2)},\ldots ,\mathbf x_{\pi (k)})\). Here \(\pi ()\) is a mapping \(\pi : \lbrace 1,\ldots ,k \rbrace \rightarrow \lbrace 1,\ldots ,n \rbrace \) such that \( \forall _{s \ne t \in \lbrace 1,\ldots ,k \rbrace } \ (\pi (s) \ne \pi (t)) \wedge (\pi (1)=i) \wedge (\pi (k)=j) \wedge (\forall _{v \in \lbrace 1,\ldots ,k1 \rbrace } \ \mathbf x_{\pi (v+1)} \in N_\epsilon (\mathbf x_{\pi (v)}))\).
Let \(\mathcal{P}_{ij} = \{(\mathbf x_{\pi (1)},\mathbf x_{\pi (2)},\ldots ,\mathbf x_{\pi (k)})~~\mathbf x_{\pi (1)}=c_{i}\wedge \mathbf x_{\pi (k)}=c_{j}\}\) denote the set of all noncyclic paths linking the modes of clusters \(C_{i}\) and \(C_{j}\), and then let \(g_{ij}=\displaystyle \max _{\begin{array}{c} path\in \mathcal{P}_{ij} \end{array}} \displaystyle \min _{\begin{array}{c} \mathbf x\in path \end{array}}f(\mathbf x)\) be the largest of the minimum values along any paths in \(\mathcal{P}_{ij}\).
Let each cluster \(C_{i}\) be represented by its mode \(c_{i}\) in order for a densitybased clustering algorithm to be able to (reliably) identify and separate all clusters in a dataset \(D=\lbrace \mathbf x_{1},\ldots ,\mathbf x_{n} \rbrace \). The condition that the estimated density at the mode of any cluster is greater than the maximum of the minimum estimated density along any path linking any two modes is given as
This condition implies that there must exist a threshold \(\tau \) that can be used to break all paths between the modes by assigning regions with estimated density less than \(\tau \) to noise, i.e.,
If a densitybased clustering algorithm uses a global threshold on the estimated density to identify core points and links neighbouring core points together to form clusters (e.g., DBSCAN (Ester et al. 1996)), then the requirement given in Eq. (4) on the density estimates and cluster definitions provides a necessary condition for the algorithm to successfully identify all clusters.
Therefore, the densitybased clustering algorithm will fail to identify all clusters in a data distribution which violates the inequality in Eq. (4).
Appendix B: Performance evaluation in terms of ARI
Table 9 provides the comparison of four variants of CFSFDP in terms of ARI. Table 10 provides the comparison between LCCFSFDP and FKNNDPC in terms of ARI. The results are similar to those using Fmeasures in terms of average rank and win/draw/loss counts (as shown in Tables 5 and 7).
Appendix C: DBSCAN versus CFSFDP
A comparison between DBSCAN and CFSFDP is provided in Table 11. The result shows that CFSFDP performs significantly better than DBSCAN.
Appendix D: LCCFSFDP versus SNNDBSCAN and ReScaleDBSCAN
A comparison of LCCFSFDP, SNNDBSCAN^{Footnote 2} (Ertöz et al. 2003a) and ReScaleDBSCAN (Zhu et al. 2016) is provided in Table 12. The result shows that LCCFSFDP performs significantly better than SNNDBSCAN and ReScaleDBSCAN.
Rights and permissions
About this article
Cite this article
Chen, B., Ting, K.M., Washio, T. et al. Local contrast as an effective means to robust clustering against varying densities. Mach Learn 107, 1621–1645 (2018). https://doi.org/10.1007/s109940175693x
Received:
Accepted:
Published:
Issue Date:
Keywords
 Local contrast
 Densitybased clustering
 Varying densities