Local contrast as an effective means to robust clustering against varying densities
 502 Downloads
 1 Citations
Abstract
Most densitybased clustering methods have difficulties detecting clusters of hugely different densities in a dataset. A recent densitybased clustering CFSFDP appears to have mitigated the issue. However, through formalising the condition under which it fails, we reveal that CFSFDP still has the same issue. To address this issue, we propose a new measure called Local Contrast, as an alternative to density, to find cluster centers and detect clusters. We then apply Local Contrast to CFSFDP, and create a new clustering method called LCCFSFDP which is robust in the presence of varying densities. Our empirical evaluation shows that LCCFSFDP outperforms CFSFDP and three other stateoftheart variants of CFSFDP.
Keywords
Local contrast Densitybased clustering Varying densities1 Introduction
Data clustering is a technique to group a dataset into a number of subsets based on a “natural” hidden data sturcture (Cherkassky and Mulier 2007). To capture the underlying data structures, traditional clustering techniques such as the Expectation–Maximization (EM) algorithm (Dempster et al. 1977) assumes specific probability distributions as the source from which the dataset is generated. In comparison, densitybased methods are attractive due to their nonparametric characteristic which enables them to deal with arbitrary shaped clusters (Jain 2010). They rely on spatially varying densities for the detection of clusters. High density regions are identified as clusters which are separated by regions of low density (Han and Kamber 2011).
However, most densitybased methods have difficulties to detect all clusters when the clusters have large variations of densities (Ertöz et al. 2003a; Zhu et al. 2016). For example, DBSCAN (Ester et al. 1996), which uses a global density threshold to discriminate cluster core points from noise, fails to identify all clusters in the presence of greatly varying densities (Zhu et al. 2016).
Many efforts have been devoted to solve the varying densities problem in DBSCANlike algorithms. SharedNearestNeighbours (SNN) (Jarvis and Patrick 1973; Ertöz et al. 2003a) is an effective technique to this end. It uses the number of shared nearest neighbours between two points as a similarity measure to replace distance in the clustering procedure. Yet, the performance of SNN is sensitive to the number of nearest neighbours used in its similarity calculations (Brito et al. 1997; Ertöz et al. 2003; Tan and Wang 2013). ReScale (Zhu et al. 2016) is a recently proposed approach to tackle the same problem. It rescales a dataset such that the estimated density of a rescaled data point approximates the density ratio of the correspond point in the original dataset. However, ReScale does not perform well when clusters overlap significantly on some attributes (Zhu et al. 2016).
A more recent densitybased clustering method called Clustering by Fast Search and Find of Density Peaks (CFSFDP) (Rodriguez and Laio 2014) employs a densitybased approach different from DBSCAN for clustering. Instead of finding core points using a global threshold in the first step, it finds the density peak of every cluster and then links the neighboring points of each peak to form a cluster. CFSFDP overcomes some issues of varying densities of earlier densitybased clustering algorithms (e.g., DBSCAN).
While the condition under which DBSCAN fails to detect all clusters has been formalised recently (Zhu et al. 2016), whether such a condition exists in the more recent densitybased method CFSFDP is still unknown. We formalise a necessary condition for CFSFDP to identify all clusters in a dataset, and show that large variation of densities is still problematic for CFSFDP.
We propose a new measure called Local Contrast (LC), as an alternative to density, to make densitybased clustering algorithms robust against varying densities. The proposed LC is not too sensitive to its parameter setting, and is able to achieve high clustering performance with a default setting.

The local modes and local minima of the LC distribution are also the local modes and local minima of the density distribution of the same dataset.

The local modes of LC have the same constant LC value, irrespective of the density values of the local modes.

The local minima of LC have zero LC value, irrespective of their density values of the local minima.
To benchmark the proposed LCCFSFDP, we apply SNN and ReScale (which are existing remedies for the density variation issue) to CFSFDP, creating two improved variants called SNNCFSFDP and ReScaleCFSFDP. Together with the original CFSFDP and its latest improvement called FKNNDPC (Xie et al. 2016), the four methods are used as contestants against LCCFSFDP in our experiments. Our empirical evaluation shows that LCCFSFDP outperforms all four contestants in 18 benchmark datasets.
The rest of the paper is organised as follows. Section 2 presents the related work. Section 3 discusses the weakness of CFSFDP and how to use existing remedies to improve it. Section 4 proposes Local Contrast and shows its properties. Section 5 presents LCCFSFDP. The empirical evaluation, discussion and conclusions are provided in the last three sections.
2 Related work
Densitybased clustering methods such as DBSCAN (Ester et al. 1996) identify high density (core) points using a global threshold and then link all neighbouring core points to form clusters. However, these methods are known to have one key issue, i.e., they have difficulties detecting all clusters when the clusters have large density variations (Ertöz et al. 2003a). Recent research has formalised a necessary condition for DBSCAN to detect all clusters in a dataset (Zhu et al. 2016): if the peak of some cluster has a density lower than that of a lowdensity region between clusters, then DBSCAN will fail to find all clusters. Many densitybased clustering algorithms (Hinneburg and Gabriel 2007; Ram et al. 2009; Borah and Bhattacharyya 2008), like DBSCAN, use a global density threshold to define core points and links them to form clusters. All these algorithms have the same issue. The exact condition under which these densitybased algorithms fail (Zhu et al. 2016) is provided in Appendix A for ease of reference.
Researchers have attempted to address the issue of densitybased clustering using different approaches. For instance, SharedNearestNeighbours (SNN) (Jarvis and Patrick 1973; Ertöz et al. 2003a) employs an alternative similarity measure to replace the distance measure in the clustering procedure. The similarity between two data points is either the number of their shared Knearestneighbours (if they have each other in their Knearestneighbour lists) or 0 otherwise. Since the SNN similarity measure takes into account the local distribution of the data points, it is less affected by varying densities of different clusters. It has been shown that DBSCAN which uses SNN improves the clustering results of DBSCAN which uses the distance measure (Ertöz et al. 2003a). However, its performance is sensitive to the setting of parameter K and its time complexity is \(O(K^2 N^2)\), instead of \(O(N^2)\) for many other densitybased clustering methods such as DBSCAN, because of an additional KNN process is required (Zhu et al. 2016).
ReScale (Zhu et al. 2016) is another technique that is recently proposed to overcome the density variation problem in clustering. This technique is a preprocessing technique and is originally designed for a densitybased clustering algorithm which uses a global density threshold to identify clusters. ReScale enables existing densitybased clustering algorithms to perform densityratiobased clustering, i.e., clusters are defined as regions of locally high density separated by regions of locally low density. The aim is to rescale the data such that the estimated density of each rescaled point is approximately the estimated densityratio of the corresponding point in the original space, where densityratio is defined as a ratio of the density of a point and the average density over its \(\eta \)neighbourhood. A point located at a maximum local density area has higher densityratio value than that of a point located at a minimum local density area. Thus, a densitybased clustering algorithm can be applied without modification to the rescaled data which uses a single threshold to identify all clusters of locally high densities. Two additional parameters are introduced—\(\eta \) is used to define the local neighbourhood; and \(\psi \) is used to control the precision of \(\eta \)neighbourhood density estimation.
A recent densitybased clustering algorithm, CFSFDP (Rodriguez and Laio 2014), takes a different approach to reduce the effect of the abovementioned issue. The idea is to find cluster centres which have higher density than their neighbours and are relatively distant from each other. CFSFDP mitigates the problem of varying densities in some situations because it finds cluster centres not only by high densities, but also by taking into account their distances from other centres. It can detect lowdensity cluster centre if it is far from other clusters.
The Fuzzy weighted KNearestNeighbors Density Peak Clustering (FKNNDPC) (Xie et al. 2016) is a recent effort to improve CFSFDP (Rodriguez and Laio 2014). It uses a similar procedure as CFSFDP, except the density estimation phase and the cluster assignation phase. FKNNDPC uses a KNN kernel estimator, instead of a \(\epsilon \)neighbourhood estimator. The key difference lies in the cluster assignation phase: FKNNDPC uses a complex assignation scheme consists of 2 strategies based on a series of KNN searches. The heavy use of KNN searches makes the algorithm very sensitive to the K parameter. It does not overcome the problem in clusters having hugely varying densities from the root cause because its operation is still based on density, as mentioned in the last paragraph.
It is important to point out that the above improvement over CFSFDP (Xie et al. 2016) was done on procedural steps only (which use a different density estimator and a different scheme to assign points to a cluster), without knowing the root cause.
In this paper, we focus on CFSFDP (Rodriguez and Laio 2014) because it is a powerful and stateoftheart core method of densitybased clustering (Xu and Tian 2015); and we want to identify the key weakness of CFSFDP and its root cause. To achieve this aim, we first formalise the condition under which CFSFDP fails to detect all clusters in a dataset; and reveal that large density variations in a dataset can still harm CFSFDP’s clustering performance significantly under some situations. Then, we propose a new measure called Local Contrast, in place of density, as the primary means to find clusters. We show that this can be easily done using almost the same procedure as CFSFDP; and this overcomes CFSFDP’s weakness from the root cause.
3 Weakness of CFSFDP and current remedies
Here we first provide a necessary condition for CFSFDP to detect all clusters in a dataset in Sect. 3.1. Its violation will result in CFSFDP failing to detect all clusters. In Sect. 3.2, we create two variants of CFSFDP with existing remedies in tackling the problem of cluster density variations: SNN and ReScale. We show the limitations of these remedies for CFSFDP in the last subsection.
3.1 A necessary condition for CFSFDP
The user then selects the top M points from the ranked list of \(f(\mathbf x)\delta _f(\mathbf x)\), and label them from 1 to M, as the centres for M clusters.
CFSFDP versus LCCFSFDP: key steps
Step  CFSFDP  LCCFSFDP 

1  Estimate density \(f(\mathbf x)\) and distance \(\delta _f(\mathbf x),\; \forall \mathbf x \in D\)  Estimate density \(f(\mathbf x)\), Local Contrast \(LC(\mathbf x)\) and distance \(\delta _{LC}(\mathbf x),\; \forall \mathbf x \in D\) 
2  Select the top M points with the largest \(f(\mathbf x) \times \delta _f(\mathbf x)\) and and label them as cluster centres of Clusters \(1,\ldots ,M\)  Select the top M points with the largest \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) points and label them as cluster centres of Clusters \(1,\ldots ,M\) 
3  Order all points in descending order of \(f(\mathbf x)\). Following the order, assign each unlabelled point to the same cluster of its nearest neighbour with higher \(f(\mathbf x)\)  Order all points in descending order of \(LC(\mathbf x)\). Following the order, assign each unlabelled point to the same cluster of its nearest neighbour with higher \(LC(\mathbf x)\) 
CFSFDP requires that these cluster modes must be ranked at the top in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\) if they are to be selected as cluster centres.
We now state the necessary condition for CFSFDP to identify all clusters of a dataset.
Theorem 1
Proof
 (i)
If less than M points are selected as cluster representatives, then not all clusters are identified.
 (ii)
If more than M points are selected as cluster representatives, then some cluster will be divided.
 (iii)
If exactly M points are selected as cluster representatives, then point \(\mathbf z \in \mathbb C\) is not selected as a representative. As a result, \(\mathbf z\) will be assigned a label from a point with a higher density. Since \(\mathbf z\) is the density maximum in its own cluster, the point that \(\mathbf z\) links to can not be from the same cluster. Hence, \(\mathbf z\) and its neighbouring points will be mislabelled as belonging to a different cluster.
The basic assumptions of CFSFDP are that (i) each cluster centre has the maximum density among all points within the cluster, and (ii) all cluster centres are well separated. While these two assumptions are usually true, the maximum densities of different clusters can not be guaranteed to be the same, or even similar.
Because density can not provide such a guarantee, the use of density becomes the root cause of CFSFDP’s weakness in detecting all clusters having hugely different densities. When clusters have significantly different densities, low density centres which have no sufficient long distance \(\delta _f(\cdot )\) will be ranked low in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\). As a result, the algorithm fails to correctly identify all clusters. An example is shown in Fig. 1. The top dense cluster has multiple peaks; and the centre of the sparse cluster has significantly lower density than these peaks. CFSFDP fails to detect the 4 clusters correctly because the mode of the sparse cluster has density which is too low for the mode to be ranked in the top four in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\), shown in Fig. 1b.
To overcome this weakness, we provide an alternative to density which has the necessary properties to detect all clusters of different densities using the exactly the same CFSFDP procedure. This alternative measure will be introduced in Sect. 4; and our analysis in Sect. 5 shows that the alternative measure is more robust than density in a dataset having clusters of different densities using the same CFSFDP procedure.
3.2 Improving CFSFDP using existing methods of improving DBSCAN
SNN (Jarvis and Patrick 1973; Ertöz et al. 2003a) and ReScale (Zhu et al. 2016) are two existing methods designed to address the issue of DBSCANlike clustering methods in datasets having huge density variations.
One can use either of these existing methods to improve CFSFDP. These can be applied straightforwardly. The following two subsections provide the details of two modified versions of CFSFDP: SNNCFSFDP and ReScaleCFSFDP.
3.2.1 SNNCFSFDP
SNNCFSFDP has the same procedure of CFSFDP except that the distance measure used in both \(f(\cdot )\) and \(\delta _{f}(\cdot )\) is replaced with the shared nearest neighbour dissimilarity measure (Ertöz et al. 2003a).
SNNCFSFDP versus ReScaleCFSFDP: key steps of the two algorithms
Step  SNNCFSFDP  ReScaleCFSFDP 

0  Calculate pairwise dissimilarity \({ SNN}(\mathbf x, \mathbf y), \forall \mathbf x, \mathbf y \in D\)  Preprocess \(D \rightarrow D'\) with the ReScale approach 
1  Estimate density \(f_{{ SNN}}(\mathbf x)\) and distance \(\delta _{f_{{ SNN}}}(\mathbf x),\; \forall \mathbf x \in D\)  Estimate density \(f(\mathbf x')\) and distance \(\delta _f(\mathbf x'),\; \forall \mathbf x' \in D'\) 
2  Select the top M points with the largest \(f_{{ SNN}}(\mathbf x) \times \delta _{f_{{ SNN}}}(\mathbf x)\) points and label them as cluster centres of Clusters \(1,\ldots ,M\)  Select the top M points with the largest \(f(\mathbf x') \times \delta _f(\mathbf x')\) and label them as cluster centres of Clusters \(1,\ldots ,M\) 
3  Order all points in descending order of \(f_{{ SNN}}(\mathbf x)\). Following the order, assign each unlabelled point to the same cluster of its nearest neighbour with higher \(f_{{ SNN}}(\mathbf x)\)  Order all points in descending order of \(f(\mathbf x')\). Following the order, assign each unlabelled point to the same cluster of its nearest neighbour with higher \(f(\mathbf x')\) 
3.2.2 ReScaleCFSFDP
ReScaleCFSFDP preprocesses the dataset before utilizing the exact same procedure of CFSFDP. ReScale first estimates the density distribution on each dimension of the dataset D, with an \(\eta \)neighbourhood estimator and a resolution of \(\psi \). It then scales the dataset D along each dimension based on the cumulative distribution, to yield a new dataset \(D'\).
Using \(D'\), the rest of the procedure is the same as CFSFDP. The key steps of ReScaleCFSFDP are given in the last column of Table 2.
3.2.3 Limitations of SNNCFSFDP and ReScaleCFSFDP
We apply SNNCFSFDP and ReScaleCFSFDP on the synthetic dataset as shown in Fig. 1, and their clustering results are given in Figs. 2 and 3, respectively. Though both methods improve the Fmeasure compared to the original CFSFDP, they still fail to correctly identify all clusters: SNNCFSFDP splits the two dense clusters at the bottom into four clusters; ReScaleCFSFDP splits the top cluster into two clusters.
SNN has two weaknesses. First, it is sensitive the K parameter (Brito et al. 1997; Ertöz et al. 2003; Tan and Wang 2013). Second, with a time complexity of \(O(K^2N^2)\), it is computationally expensive. In this example, the default setting of \(K=\sqrt{N}\) leads to an undesirable result as shown in Fig. 2, in which the true peaks #5 and #6 in Fig. 2c can not outrank false peak #2, because the distance \(\delta \) based on SNN dissimilarity is not large enough. A proper K needs to be carefully tuned in order to produce the desired clustering outcome. We will provide further analysis of this issue in Sect. 6.
4 Local contrast
We propose Local Contrast as a new remedy for the density variation problem in clustering. Unlike SNN or ReScale, it is not sensitive to the parameter K, nor does it need to rescale the dataset.
Here we provide the definition of Local Contrast and describe its properties which empower clustering algorithms to be more robust against varying densities.
Given a dataset D and a density estimator \(f(\cdot )\), we define Local Contrast as follows:
Definition 1
Local Contrast has three properties.
Property 1
The local modes and the local minima of \(LC(\mathbf x)\) are also the local modes and the local minima of \(f(\mathbf x)\), with a proper choice of K.
Property 2
The local modes of \(LC(\mathbf x)\) that correspond to the local modes of \(f(\mathbf x)\) have \(LC(\mathbf x)= K\), irrespective of the density of \(f(\mathbf x)\).
Property 3
The local minima of \(LC(\mathbf x)\) that correspond to the local minima of \(f(\mathbf x)\) have \(LC(\mathbf x)=0\), irrespective of the density of \(f(\mathbf x)\).
Proof
of Properties 1, 2 and3. Let \(\mathbf p\) and \(\mathbf q\) be the local density maxima and minima, respectively. Assuming a proper choice of K exists such that for all \(\mathbf x \in N_K(\mathbf p)\), \(f(\mathbf p) > f(\mathbf x)\); and for all \(\mathbf x \in N_K(\mathbf q)\), \(f(\mathbf q) < f(\mathbf x)\).
Let \(G \subseteq N_K(\mathbf p)\) be the maximal subset of \(N_K(\mathbf p)\) such that for all \(\mathbf x \in G\), \(\mathbf p \in N_K(\mathbf x)\). In other words, \(\mathbf p\) is one of the Knearestneighbours of each member of G. Since G is a subset of \(N_K(\mathbf p)\), \(\mathbf p\) is also a local density maxima in the neighbourhood defined by G.
Similarly, let \(V \subseteq N_K(\mathbf q)\) be the maximal subset of \(N_K(\mathbf q)\) such that for all \(\mathbf x \in V\), \(\mathbf q \in N_K(\mathbf x)\). In other words, \(\mathbf q\) is one of the Knearestneighbours of each member of V. \(\mathbf q\) is also the local density minima in the neighbourhood defined by V.
The same argument follows that \(LC(\mathbf q) = 0 < LC(\mathbf x), \forall \mathbf x \in V\). \(\square \)
The properties of Local Contrast listed above depend on a proper choice of K for a given dataset. The range of K that can be used is usually large. In other words, Local Contrast is not too sensitive to the setting of K.
Throughout this paper, all experiments are done with the default setting \(K = \sqrt{N}\), the square root of the dataset size, as suggested by some researchers for K nearest neighbour procedures (Ferilli et al. 2008; Zitzler et al. 2004; Fukunaga 1990).
5 Improving CFSFDP with local contrast
We create a version of CFSFDP, called LCCFSFDP, by replacing density with LC in the clustering procedure. Given a dataset D and a density estimator \(f(\cdot )\), \(LC(\mathbf x)\) is calculated as defined in Definition 1 for all \(\mathbf x\) in D.
In other words, \(\delta _{LC}(\mathbf x)\) is defined to be the distance between \(\mathbf x\) and its nearest neighbour with a higher LC, except when \(\mathbf x\) is the point with the maximum LC. In that case, \(\delta _{LC}(\mathbf x)\) is defined to be the maximum distance between \(\mathbf x\) and any point in D.
Here distance \(\delta _{LC}(\mathbf x)\) is analogous to the distance from a point’s nearest neighbour with a higher density \(\delta _f(\mathbf x)\) used in CFSFDP (Rodriguez and Laio 2014). Given \(LC(\mathbf x)\) and \(\delta _{LC}(\mathbf x)\), cluster centres are then chosen from a decision graph where all points are sorted in descending order of \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\).
Definition 2
Cluster centres are defined to be the top M points with the highest \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) values, where M is a user input parameter.
After selecting M cluster centres, all unlabeled data points are then assigned one by one in descending order of LC, with the same cluster label as its nearest neighbour with a higher LC. A contrast between the LCCFSFDP and CFSFDP procedures is given in Table 1.
Equation (2) is much easier to satisfy than Eq. (3) because the properties of LC ensures that every member of \(\mathbb C\) has the maximum LC value (i.e., Property 2 stated in Sect. 4), irrespectively of the density distribution. This makes the left side of Eq. (2) not less than 1. Thus Eq. (2) is harder to violate. In contrast, in a data distribution which has greatly varying densities between clusters, the left side of Eq. (3) could easily be smaller than 1, if \(\acute{\mathbf x}\) is from a cluster of low density, which makes a violation of Eq. (3) a lot easier.
Figure 5a shows that the top seven points on the synthetic dataset, as determined by CFSFDP using density. Figure 5b shows that the normalised density and LC of these seven points. Figure 5c shows the ranking due to \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) and \(f(\mathbf x) \times \delta _f(\mathbf x)\). This change has enabled the centre of the sparse cluster to move from rank #7 to rank #2.
6 Experiments
Characteristics of datasets used in the experiments, where N is the dataset size, d is the number of features and M is the number of classes
Dataset  N  d  M  Dataset  N  d  M 

Aggregation  788  2  7  Libras  360  90  15 
Banknote  1372  4  2  Pathbased  300  2  3 
Breastd  569  30  2  Pendig  10,992  16  10 
Breasto  699  9  2  Seeds  210  7  3 
Control  600  60  6  Segment  2310  19  7 
Diabetes  768  8  2  Shape  160  17  9 
Haberman  306  3  2  Thyroid  215  5  3 
Iris  150  4  3  Vowel  990  10  11 
Jain  373  2  2  Wine  178  13  3 
In all experiments, the performance is measured in terms of Fmeasure. Given a clustering result, we calculate the precision score \(p_{m}\) and the recall score \(r_{m}\) for each cluster \(C_{m}\) based on the confusion matrix. Fmeasure of \(C_{m}\) is the harmonic mean of \(p_{m}\) and \(r_{m}\). We then use the Hungarian algorithm (Kuhn 1955) to search the optimal match for all clusters. The overall Fmeasure is the weighted average over all clusters: Fmeasure \(=\sum _{m=1}^{M}\frac{C_m}{N} \times \frac{2p_{m}r_{m}}{p_{m}+r_{m}}\), where N is the dataset size. In the calculations of Fmeasure, points labeled as noise are not removed from the dataset, but they are not regarded as a cluster. In addition, we also evaluate the performance in terms of Adjusted Rand Index (ARI) (Hubert and Arabie 1985). The outcome is similar to that of using Fmeasure. For clarity of presentation, we provide the results based on ARI in Appendix B.
Parameters and their search ranges
Parameter  Search range or setting  

All versions of CFSFDP  \(\epsilon \)  \(\{0.1,0.2,\ldots ,10\% \}\) 
M  \(\{2,3,\ldots ,20\}\)  
LCCFSFDP  K  \(\sqrt{N}\) 
SNNCFSFDP  K  \(\sqrt{N}\) 
ReScaleCFSFDP  \(\psi \)  100 
\(\eta \)  0.2  
FKNNDPC  K  \(\sqrt{N}\) 
6.1 Comparing LC to SNN and ReScale
Comparison of original and improved versions of CFSFDP in terms of Fmeasures
Dataset  CFSFDP  LCCFSFDP  SNNCFSFDP  ReScaleCFSFDP 

Aggregation  0.996  0.996  0.983  0.993 
Banknote  0.991  0.975  0.782  0.972 
Breastd  0.830  0.940  0.877  0.945 
Breasto  0.917  0.966  0.704  0.967 
Control  0.708  0.720  0.717  0.732 
Diabetes  0.602  0.655  0.635  0.649 
Haberman  0.616  0.671  0.641  0.605 
Iris  0.967  0.967  0.892  0.960 
Jain  0.972  1.000  0.975  1.000 
Libras  0.480  0.535  0.519  0.500 
Pathbased  0.828  0.832  0.891  0.775 
Pendig  0.794  0.826  0.664  0.771 
Seeds  0.909  0.919  0.866  0.904 
Segment  0.785  0.791  0.522  0.763 
Shape  0.699  0.751  0.802  0.798 
Thyroid  0.707  0.850  0.917  0.900 
Vowel  0.317  0.322  0.334  0.345 
Wine  0.931  0.949  0.932  0.931 
Average rank  3.00  1.61  2.83  2.39 
Win/draw/loss against CFSFDP  15/2/1  11/0/7  10/0/8  
Win/draw/loss against LCCFSFDP  1/2/15  4/0/14  6/1/11 
Pairwise Friedman tests: pvalues
Friedman pvalues  LCCFSFDP  SNNCFSFDP  ReScaleCFSFDP 

CFSFDP  0.0005  0.3458  0.8084 
LCCFSFDP  0.0184  0.2253  
SNNCFSFDP  0.1573 
In a nutshell, Local Contrast significantly improves the CFSFDP algorithm, and its resultant LCCFSFDP is the best densitybased clustering method, among the current stateoftheart.
6.2 Comparing LCCFSFDP to FKNNDPC
Comparison of LCCFSFDP and FKNNDPC in terms of Fmeasures
Dataset  K is searched  \(K = \sqrt{N}\)  

LCCFSFDP  FKNNDPC  LCCFSFDP  FKNNDPC  
Aggregation  1.000  0.999  0.996  0.998 
Banknote  0.981  0.969  0.975  0.486 
Breastd  0.941  0.919  0.940  0.711 
Breasto  0.972  0.921  0.966  0.804 
Control  0.745  0.722  0.720  0.446 
Diabetes  0.674  0.657  0.655  0.577 
Haberman  0.692  0.692  0.671  0.683 
Iris  0.967  0.973  0.967  0.854 
Jain  1.000  0.982  1.000  0.963 
Libras  0.547  0.500  0.535  0.425 
Pathbased  0.906  0.990  0.832  0.548 
Pendig  0.830  0.883  0.826  0.656 
Seeds  0.933  0.923  0.919  0.631 
Segment  0.804  0.768  0.791  0.573 
Shape  0.760  0.717  0.751  0.633 
Thyroid  0.894  0.909  0.850  0.848 
Vowel  0.325  0.290  0.322  0.220 
Wine  0.949  0.955  0.949  0.694 
Average rank  1.28  1.67  1.11  1.89 
Win/draw/loss against LCCFSFDP  5/1/12  2/0/16 
6.3 Runtime
Runtime in seconds
Dataset  CFSFDP  LCCFSFDP  SNNCFSFDP  ReScaleCFSFDP  FKNNDPC 

Aggregation  0.07  0.1  1.97  0.15  0.2 
Banknote  0.24  0.34  4.78  0.41  0.51 
Breastd  0.05  0.06  0.98  0.2  0.18 
Breasto  0.05  0.09  1.34  0.18  0.15 
Control  0.06  0.07  1.06  0.26  0.16 
Diabetes  0.08  0.1  1.81  0.19  0.19 
Haberman  0.02  0.02  0.38  0.1  0.05 
Iris  0.01  0.01  0.19  0.09  0.02 
Jain  0.02  0.03  0.55  0.09  0.07 
Libras  0.03  0.04  0.46  0.23  0.12 
Pathbased  0.02  0.02  0.4  0.09  0.04 
Pendig  21.69  30.04  345.17  22.53  44.27 
Seeds  0.01  0.01  0.32  0.01  0.02 
Segment  0.82  1.13  16.59  0.96  1.5 
Shape  0.01  0.01  0.26  0.02  0.02 
Thyroid  0.01  0.01  0.28  0.02  0.03 
Vowel  0.13  0.19  3.04  0.17  0.33 
Wine  0.01  0.01  0.27  0.02  0.02 
Time complexity  \(O(N^2)\)  \(O(N^2+KN)\)  \(O(K^2N^2)\)  \(O(N^2)\)  \(O(N^2+K^2N)\) 
6.4 K sensitivity test
7 Discussion
Local Contrast can be applied using any density estimators, not limited to the \(\epsilon \)neighbourhood density estimator which has been employed in CFSFDP (Rodriguez and Laio 2014). For example, Local Contrast can be applied to DENCLUE (Hinneburg and Gabriel 2007) which employs kernel density estimator in its operation.
We chose CFSFDP over other densitybased methods such as DBSCAN, to be the base algorithm, because the former is a more advanced method. This is confirmed by comparing DBSCAN with CFSFDP in clustering the 18 datasets. The result is provided in Appendix C, which shows that CFSFDP outperforms DBSCAN in all but 1 dataset. To be fair and complete, we also compare LCCFSFDP to the original SNN and ReScale approaches. The result is provided in Appendix D, which shows that LCCFSFDP outperforms both methods.
The choice of parameter K in KNN based methods is usually timeconsuming since they are often sensitive to K. However, we have shown that LC is not as sensitive to K as SNN or FKNNDPC. As a rule of thumb, setting \(K=\sqrt{N}\) has been empirically verified to be effective for LC.
As to the choice of parameter M (the number of clusters), both LCCFSFDP and the original CFSFDP have the same requirement. For a specific dataset, the proper choice of M is a user decision that could be made based on domain knowledge, visual inspection, or other means. In our experiments, M is simply searched to show the best capability of each method.
The original CFSFDP does not explicitly identify any data point to be noise. Instead, after the clustering procedure, it takes an extra step to produce cluster halos, which can be considered as noise. In our experiments, no noise points are produced because all variants of CFSFDP, as well as FKNNDPC, are able to cluster the whole dataset without producing any noise. However, while handling noisy datasets, LCCFSFDP can also produce cluster halos in the same way as CFSFDP.
Gridbased clustering approaches partition the space into a number of cells and use the cell density to identify clusters. For example, GRIDCLUS (Schikuta 1996) and NSGC (Ma and Chow 2004) rely on the cell density to identify core cells and link neighbouring core cells together to form clusters. Instead of the current pointbased definition, it is possible that Local Contrast can be redefined using cell densities of neighbouring cells; and employ Local Contrast in these algorithms to improve their performance.
Another possible application of LC is densitybased subspace clustering, such as SUBCLU (Kailing et al. 2004) and DUSC (Assent et al. 2007). These methods use a density threshold to differentiate between cluster points and noise in different subspaces. Because density is dimensionalitybiased, i.e., when estimated using distancebased density estimators, the densities of a data cloud tend to be lower in higherdimensional spaces. Hence these methods suffer from density variation across subspaces with different dimensionalities: low thresholds detect highdimensional clusters but have difficulty filtering out noise in lowdimensional subspaces; while high thresholds screen out noise well in lowdimensional subspaces but tend to overlook highdimensional clusters (Zimek and Vreeken 2015). LC can possibly be an effective remedy for this issue in subspace clustering since LC is not dimensionalitybiased.
However, not all densitybased methods can utilise Local Contrast readily because some do not employ density directly in their operations. For example, instead of density, OPTICS (Ankerst et al. 1999) employs “core distance” and “reachability distance” to rank points in order to identify clusters. The “reachability distance” reflects the density such that points with a lower density normally have higher “reachability distance”. It is interesting to explore whether Local Contrast can be redefined using these distances rather than density.
8 Conclusions
In this paper, we identify the root cause of CFSFDP’s failure to detect all clusters in a dataset having hugely varying densities. This is the first work, as far as we know, that overcomes CFSFDP’s weakness from its root cause.
We make the following three contributions:
First, we formalise a necessary condition for CFSFDP to correctly identify all clusters. We show that a violation of this condition leads to poor clustering performance. This explains the reason why a densitybased clustering algorithm such as CFSFDP is unable to correctly identify all clusters in datasets having large density variations.
Second, we propose a new measure called Local Contrast, as an alternative to density, to improve the capability of densitybased clustering methods to detect clusters of hugely different densities in a dataset. We show that it has two unique properties that are critical in improving the abovementioned capability, i.e., all cluster centres in the Local Contrast distribution have the same constant value, so as all local minima of Local Contrast which correspond to the local minima of density distribution, regardless of the densities of these cluster centres and local minima. We show that these properties make densitybased algorithms much more robust in the presence of large density variations.
Third, by incorporating Local Contrast into CFSFDP, we create a powerful method LCCFSFDP which has much better detecting power than the original method. Our empirical evaluation shows that LCCFSFDP is the best performer compared to two stateoftheart methods, SNN and ReScale, as well as FKNNDPC which is a recent improvement of CFSFDP.
Footnotes
Notes
Acknowledgements
Bo Chen is supported by Monash Data61 Postgraduate Research Scholarship and Faculty of IT Tuition Fee Scholarship, Monash University.
References
 Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD international conference on management of data (pp. 49–60). New York, NY: ACM.Google Scholar
 Assent, I., Krieger, R., Müller, E., & Seidl, T. (2007). Dusc: Dimensionality unbiased subspace clustering. In Proceedings of the 7th international conference on data mining (pp. 409–414). IEEE.Google Scholar
 Borah, B., & Bhattacharyya, D. (2008). DDSC: A density differentiated spatial clustering technique. Journal of Computers, 3(2), 72–79.CrossRefGoogle Scholar
 Brito, M., Chavez, E., Quiroz, A., & Yukich, J. (1997). Connectivity of the mutual knearestneighbor graph in clustering and outlier detection. Statistics & Probability Letters, 35(1), 33–42.MathSciNetCrossRefzbMATHGoogle Scholar
 Chang, H., & Yeung, D. Y. (2008). Robust pathbased spectral clustering. Pattern Recognition, 41(1), 191–203.CrossRefzbMATHGoogle Scholar
 Cherkassky, V., & Mulier, F. M. (2007). Learning from data: Concepts, theory, and methods. Hoboken: Wiley.CrossRefzbMATHGoogle Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. 1–38.Google Scholar
 Ertöz, L., Steinbach, M., & Kumar, V. (2003a). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM international conference on data mining (pp. 47–58).Google Scholar
 Ertöz, L., Steinbach, M., & Kumar, V. (2003b). Finding topics in collections of documents: A shared nearest neighbor approach. Clustering and Information Retrieval, 11, 83–103.MathSciNetGoogle Scholar
 Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd international conference on knowledge discovery and data mining (pp. 226–231).Google Scholar
 Ferilli, S., Biba, M., Basile, T., Di Mauro, N., & Esposito, F. (2008). Knearest neighbor classification on firstorder logic descriptions. In Proceedings of the IEEE international conference on data mining workshops (pp. 202–210).Google Scholar
 Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). San Diego, CA: Academic Press Professional Inc.zbMATHGoogle Scholar
 Gionis, A., Mannila, H., & Tsaparas, P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1), 4.CrossRefGoogle Scholar
 Han, J., & Kamber, M. (2011). Data mining: Concepts and techniques (3rd ed.). Los Altos, CA: Morgan Kaufmann.zbMATHGoogle Scholar
 Hinneburg, A., & Gabriel, H. H. (2007). DENCLUE 2.0: Fast clustering based on kernel density estimation. In Advances in intelligent data analysis (Vol. VII, pp. 70–80). Springer.Google Scholar
 Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRefzbMATHGoogle Scholar
 Jain, A. K. (2010). Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8), 651–666.CrossRefGoogle Scholar
 Jain, A. K., & Law, M. H. (2005). Data clustering: A user’s dilemma. In Pattern recognition and machine intelligence (pp. 1–10). Springer.Google Scholar
 Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, 100(11), 1025–1034.CrossRefGoogle Scholar
 Kailing, K., Kriegel, H. P., & Kröger, P. (2004). Densityconnected subspace clustering for highdimensional data. In Proceedings of the international conference on data mining (pp. 246–256). SIAM.Google Scholar
 Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Research Logistics, 2(1–2), 83–97.MathSciNetCrossRefzbMATHGoogle Scholar
 Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 31 May 2017.
 Ma, E. W., & Chow, T. W. (2004). A new shifting grid clustering algorithm. Pattern Recognition, 37(3), 503–514.CrossRefzbMATHGoogle Scholar
 Müller, E., Günnemann, S., Assent, I., & Seidl, T. (2009). Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment, 2, 1270–1281.CrossRefGoogle Scholar
 Ram, A., Sharma, A., Jalal, A. S, Agrawal, A., & Singh, R. (2009). An enhanced density based spatial clustering of applications with noise. In Proceedings of the IEEE international advance computing conference (pp. 1475–1478).Google Scholar
 Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.CrossRefGoogle Scholar
 Schikuta, E. (1996). Gridclustering: An efficient hierarchical clustering method for very large data sets. In Proceedings of the 13th IEEE international conference on pattern recognition (Vol. 2, pp. 101–105).Google Scholar
 Tan, J., & Wang, R. (2013). Smooth splicing: A robust snnbased method for clustering highdimensional data. Mathematical Problems in Engineering, 2013, 1–9.Google Scholar
 Xie, J., Gao, H., Xie, W., Liu, X., & Grant, P. W. (2016). Robust clustering by detecting density peaks and assigning points based on fuzzy weighted knearest neighbors. Information Sciences, 354, 19–40.CrossRefGoogle Scholar
 Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2), 165–193.MathSciNetCrossRefGoogle Scholar
 Zhu, Y., Ting, K. M., & Carman, M. J. (2016). Densityratio based clustering for discovering clusters with varying densities. Pattern Recognition, 60, 983–997.CrossRefGoogle Scholar
 Zimek, A., & Vreeken, J. (2015). The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 98(1–2), 121–155.MathSciNetCrossRefzbMATHGoogle Scholar
 Zitzler, E., Laumanns, M., Bleuler, S. (2004). A tutorial on evolutionary multiobjective optimization. In Metaheuristics for multiobjective optimisation (pp. 3–37). Springer.Google Scholar