A novel density peaks clustering algorithm based on K nearest neighbors with adaptive merging strategy

Recently the density peaks clustering algorithm (DPC) has received a lot of attention from researchers. The DPC algorithm is able to find cluster centers and complete clustering tasks quickly. It is also suitable for different kinds of clustering tasks. However, deciding the cutoff distance dc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${d}_{c}$$\end{document} largely depends on human experience which greatly affects clustering results. In addition, the selection of cluster centers requires manual participation which affects the efficiency of the algorithm. In order to solve these problems, we propose a density peaks clustering algorithm based on K nearest neighbors with adaptive merging strategy (KNN-ADPC). A clusters merging strategy is proposed to automatically aggregate over-segmented clusters. Additionally, the K nearest neighbors are adopted to divide data points more reasonably. There is only one parameter in KNN-ADPC algorithm, and the clustering task can be conducted automatically without human involvement. The experiment results on artificial and real-world datasets prove higher accuracy of KNN-ADPC compared with DBSCAN, K-means++, DPC, and DPC-KNN.


Introduction
Clustering algorithm is one of the most important machine learning algorithms which has been widely applied in many fields, such as data mining and chemical industry. The basic idea of clustering algorithm is that those data with high similarity should be divided into the same cluster while data with low similarity should be divided into different clusters [1].
There are mainly three classic clustering algorithms: density-based clustering [2], partition-based clustering [3] and hierarchical clustering [4]. In recent years, many new clustering algorithms have been proposed, such as spectral clustering [5], multi-kernel clustering [6], multi-view clustering [7], subspace clustering [8], ensemble clustering [9], and deep embedded clustering [10]. The drawback of these new clustering algorithms is that both complexity and computation costs are larger than classical clustering algorithms.
Among all clustering algorithms, K-means [11] and DBSCAN [12] are the most classic methods. K-means is one of the most famous partition-based methods. The clustering process begins with selecting K initial center points, then iteratively assigning the remaining points into its nearest cluster. The compact idea of K-means allows it to complete clustering tasks quickly. However, the clustering result is vulnerable to the selection of initial center points while K-means++ [13] can partially solve this problem. Besides, both K-means and K-means++ are inadequate in dealing with the non-spherical cluster. DBSCAN is one of the most popular density-based clustering algorithms. The densitybased clustering algorithm is suitable for the non-spherical clustering tasks. The basic idea of DBSCAN is that clusters are decided according to the density connection relationship.
In DBSCAN, points are divided into the core objects and the noise points. Then core objects are aggregated to the same cluster if they are density reachable. Nevertheless, using DBSCAN, researchers need to predefine two hyperparameters for screening core objects, and the optimal hyperparameters are usually difficult to define in practice.
The density peaks clustering algorithm (DPC) [14] was proposed by Rodriguez A in 2014, which attracts great attention from plenty of researchers. DPC can deal with clusters of different shapes. It is mainly based on two basic assumptions: (1) the cluster center is surrounded by other low density points; (2) the cluster center is far from other cluster centers. With these two basic assumptions, it is easy and fast for DPC to find cluster centers and complete clustering task. The core idea of DPC is calculating the local density i and the distance from the higher density points i to draw a decision graph. Cluster centers can be selected according to decision graph. The remaining points are assigned to the cluster to which its nearest higher density point belongs. Although DPC is very concise and efficient, there are still some shortcomings. For example, the selection of cluster centers depends on human experience which greatly limits its autonomy. Furthermore, DPC can not deal with the clustering task that one cluster has more than one high density center points. In addition, although the assignment rules in DPC are very efficient, the domino effect will occur when there occurs misclassification in the process.
Aiming at drawbacks in DPC, a lot of improved clustering algorithms based on DPC are proposed. FKNN-DPC [15] improves two assignment strategies to overcome the drawbacks in the assignment rules of DPC. However, the selection of cluster centers in FKNN-DPC is still the same as in DPC, and it also requires manual participation. CFSFDP+A [16] is proposed to accelerate the calculation process of distance between points. Nevertheless, the clustering process remains unchanged. In order to solve the disadvantage of one-step allocation strategy in DPC, a shared-nearest-neighbor-based clustering algorithm (SNN-DPC) is proposed. SNN-DPC [17] uses a two-step allocation strategy to ensure the correct assignment of points. However, SNN-DPC is far more complex than DPC and also needs manual participation. In DPC-KNN [18], an allocation strategy using K nearest neighbors is proposed. The calculation of the distance from higher density points is more reasonable, but it is still lacks autonomy.
In this paper, we propose a density peaks clustering algorithm based on K nearest neighbors with adaptive merging strategy (KNN-ADPC). The strategy of K nearest neighbors is used to calculate the distance i and data points assignment. In addition, a novel adaptive merging strategy is proposed to solve the potential problem of over-segmentation. The main innovations of KNN-APDC algorithm are listed as follows: 1. The K nearest neighbors are introduced to the assignment rules which is more reasonable than the original assignment rules in DPC. 2. The KNN-ADPC only has one hyperparameter which boosts the efficiency of determining parameters to a large extent. 3. The KNN-ADPC has a high degree of autonomy without losing accuracy. No need for human involvement, the adaptive merging strategy we proposed still has a good performance in non-spherical clustering tasks. 4. A creative and effective cluster automatic merging strategy is proposed to solve the over-segmentation problem and correct clustering results.
The rest of this paper is organized as follows. The details of DPC and KNN-DPC are described in Sect. 2. Section 3 introduces how KNN-ADPC algorithm works. The experiment results are given in Sect. 4. Then discussions are made in Sect. 5 according to the experiment results. Finally, the paper ends with some conclusions and perspectives in Sect. 6.

Related works
In this section, we briefly review the original DPC and DPC-KNN algorithm.

DPC: density peaks clustering
DPC is a novel density peaks clustering algorithm that finds cluster centers quickly and has good adaptability to a variety of clustering tasks. For each point p i .in dataset X , DPC computes the local density i and the distance i from points with higher density to build a decision graph for cluster centers selection. The calculation of i and i is defined by Eqs. (1) and (2).
where d ij is the distance between the two different points p i and p j , d c is the cutoff distance parameter given by user. (•) is the indicative function where (•) = 1 if (•)<0 and (•) = 0 otherwise. For the points with highest local density, its i is specified as the maximum distance between two points which can be written as i = max j (d ij ) . However, the original calculation of local density in Eq. (1) will be confusing in some situations when d c is not given properly. For instance, as shown in Fig. 1, the red point in case 1 and case 1 3 2 both have 10 points within d c range. If we calculate the local density follow Eq. (1), we will get a conclusion that their local densities are the same. However, the local density of the red point in case 1 is obviously higher than in case 2. Although this problem can be solved by constantly adjusting d c , the probability of its occurrence is still very high.
In order to solve this problem, another local density calculation method is used in DPC which is defined by Eq. (3).
where I S is the set of all points whose distance from point p i is less than d c .

DPC-KNN: density peaks clustering based on k nearest neighbor
In order to solve the problem of misclassification of original DPC on some circular clustering tasks, DPC-KNN adopts K nearest neighbor method to compute the distance i as Eq. (4). Compared with the calculation of i in DPC, Eq. (4) expands the scope of distance calculation which treats the set of K nearest neighbors of each point as a whole, considering that the K nearest neighbors of each point can represent its internal structure more comprehensively than a single point.
where KNN i is the K nearest neighbor points set of point p i defined as Eq. (5). K is the number of nearest neighbors.
In DPC-KNN, the calculation of local density i is still the same as in the original DPC. However, the cluster centers selection method and points assignation strategy in DPC-KNN remain the same, which means when using DPC-KNN, researchers still need to manually deicide the parameter d c and select cluster centers.

Methods
Although there are many improved clustering algorithms based on DPC, there are still some drawbacks like requiring manual participation and low clustering accuracy. In order to solve these problems, we proposed a density peaks clustering algorithm based on KNN-ADPC. KNN-ADPC consists of three main steps: (1) cluster centers selection; (2) remaining points assignation; (3) clusters merging.

Cluster centers selection
As mentioned above, the choice of parameter d c greatly affects clustering results. Aiming at this problem, a new approach of calculating d c based on the internal structure of the data [19] is raised. As suggested in DPC, d c can be chosen so that the average number of neighbors of each point is approximately 1-2% of the whole dataset. Inspired by it, we can choose d c according to the original data structure in which the tightness around points can be described by its K nearest neighbors. The calculation method of d c introduces the concept of K nearest neighbors defined as Eq. (6).
where N is the number of all sample points in the data, K i is the distance between the points p i and its Kth nearest neighbor points defined as Eq. (7), K is the average value of each K i defined as Eq. (8).
In Eq. (6), K represents the average degree of dispersion of all points. The calculation of the second part is similar to the standard deviation which can reflect the volatility between K i of each point and K . Thus merging these two parts can help us appropriately estimate the structure of the whole data so that we can choose a proper d c for cluster centers.
After calculating the cutoff parameter d c , we can compute the local density i .for every point based on Eq. (3). Considering that the calculation method of the distance i in original DPC leading to chain misclassification easily, Fig. 1 Density contrast of two points we also adopt the K nearest neighbors to calculate i based on Eq. (4).
Considering that d c is mainly used to determine the local density of each point. In order to verify the rationality and effectiveness of d c , we conduct comparative experiments with different local density calculation methods on Jain [20] which has uneven density. The calculation of i in this paper [21] is free from human efforts and d c setting. The local density is defined as Eq. (9).
The clustering results of our proposed KNN-ADPC using two different density calculation methods are shown in Fig. 2. With Eq. (9), we prefer to use dense points as cluster centers which are more likely to cause sparse clusters to be merged into dense clusters. Additionally, we can identify the center points of both low-density clusters and high-density clusters because that d c comprehensively considers the internal structure information of the entire data.
To ensure that all cluster centers can be filtered, a looser selection strategy is implemented. For each point, we select the points whose i is larger than the cutoff distance d c as initial cluster centers. In this step, neither drawing decision graph nor manual participation is necessary.

Remaining points assignation
Here we demonstrate how chain misclassification can be caused in original DPC. For example, as shown in Fig. 3, point 2 and point 3 are the higher density points of point 1. This distance between point 1 and point 3 is closer than the distance between point 1 and point 2. So if the original assignation strategy is adopted, point 1 will be assigned to the cluster to which point 3 belongs. And then chain misclassification is likely to occur. Hence we adopt the same calculation method to compute i as in DPC-kNN. The new calculation method guarantees the K nearest neighbors to be considered when assigning points. After selecting initial cluster centers, the remaining points will be assigned to their nearest higher density points according to i .

Clusters merging
To avoid over-segmentation caused by loose initial cluster centers selection strategy, an adaptive merging strategy according to the density difference and distance between two clusters is brought forward. By analyzing clusters shape, we find that two clusters that need to be merged generally have two characteristics: (1) two clusters have close neighbor border.
(2) there is usually less density difference between two clusters. In order to further clarify the relationship between clusters, we make some definition declarations in advance.

Definition 1 (Adjacent regions between two clusters)
The adjacent regions between two clusters border(p, q) can be denoted as Eq. (10).

Definition 4 (The density difference between two clusters)
Considering that high density cluster generally merges the lower one, we define the density difference between the boundary density of the high density cluster and local density of the low density cluster as density difference diff (p, q) of these two clusters S p and S q . The calculation method can be denoted as Eq. (14).
As for clusters that have no adjacent regions, their density difference diff (p, q) are set as +∞.
Definition 5 (Density directly-reachable) These density directly-reachable clusters should satisfy two constraints: (1) There should be at least one pair of adjacent points between two clusters (2) The density difference between two clusters should be less than the average value of the density difference between all clusters. These two rules for S p and S q can be denoted as Eqs. (15) and (16).
Under this circumstance, we consider cluster S p and S q density directly-reachable. In addition, the density directlyreachable relationship is symmetrical.

Definition 6 (Density reachable) We consider two clusters S p and S q density reachable if there exists
Similar to the assumption proposed in the paper [22] that "points in the same high-density area or the same structure are likely to have the same label". We think that the average density of each cluster can represent the internal structure itself. On one hand, the prerequisite of merging is that two clusters should have border points with each other. On the other hand, according to our observation, densities of close points in the same cluster are generally continuous and smooth. Therefore, we propose a method to calculate the density difference between each two clusters whose local density are different. Besides, when the threshold of the parameter diff(p,q) being set as the average density difference between two clusters, it obtains the best performance in our experiments.
After performing assignation in Sect. 3.2, a preliminary clustering result is given. Firstly, we can determine the adjacent region between each two clusters and the border region of each cluster according to Eqs. (10) and (11). Secondly, the local density and boundary density of each cluster is calculated with Eqs. (12) and (13). Thirdly, the density difference between two clusters is obtained by Eq. (14). Finally, the density reachable relationship between two clusters can be judged. Cluster pairs that are density reachable to others will be merged.

Algorithm flow
The algorithm flow of the proposed KNN-ADPC is shown in Table 1.

The time complexity analyses of KNN-ADPC
Assuming that there are N points in dataset X and C denotes the number of clusters. The time complexity of KNN-ADPC mainly determined by the following steps: (1) computing the distance matrix ( O(N 2 ) ); (2) sorting the distance vector with fast sorting ( O(NlogN) ); (3) computing the cutoff distance d c ( O(N) ); (4) computing the local density i ( O(N) ); (5) Computing the distance i based on K nearest neighbors ( O(KN 2 ) ), here existsK ≪ N , so the time complexity of this step isO(N 2 ) ; (5) selecting the initial cluster centers and assigning the remaining points ( O(N 2 ) ); (6) determining the adjacent region between clusters ( O(N 2 ) ). The complexity reaches to top when the data is divided into multiple clusters with single point in each cluster (i.e., N clusters); (7) computing the boundary density for each cluster ( O(N 2 )).
According to the above analysis, the overall time complexity of KNN-ADPC is O(N 2 ) which is the same as DPC.

The space complexity of KNN-ADPC
For the proposed KNN-ADPC, there are some major steps requiring storage space. When computing the local density i and the distance i of each point, we need 2N spaces. And we also need KN space to store the K nearest neighbors of each point. During merging step, spaces of storing adjacent region between two clusters are required. Theoretically, N points can generate maximum N 2 pairs when there are N clusters. Since the number of border points is usually far less than N , the space complexity of this step is at most O(N 2 ).
In conclusion, the space complexity of the proposed KNN-ADPC is O (N 2 ) , which is the same as DPC.

Results
In this section, we conduct comparison experiments on artificial datasets and real-world datasets to evaluate the effectiveness of KNN-ADPC. Experiments are conducted on a desktop computer with a core i5 4210U-Intel 1.7 GHz processor and 8 GB RAM running MATLAB R2016A. The performance of KNN-ADPC is compared with DBSCAN, K-means++, DPC, and DPC-KNN. The details of the dataset are shown in Table 2. From first one to the eighth one are the artificial datasets and the last two are representatives of the real-world datasets.

Clustering evaluation metrics
We adopt clustering accuracy (ACC) to measure the performance of each algorithm. In addition, considering that there may be no labels in some actual clustering tasks, the performance can be evaluated according to whether the clustering result is consistent with similar points clustered into the same cluster while low similarity points are divided into different clusters. Therefore we also implement three evaluation metrics which are independent of the absolute values of labels: adjusted mutual information (AMI) [31], adjusted Rand index (ARI) [31], and Fowlkes-Mallows index (FMI) [32]. The upper limit and the lower limit of the above four evaluation indexes are 1 and − 1 respectively. And the larger these indexes are, the higher clustering accuracy these algorithms obtain.

Experiment on artificial and real-world dataset
Before conducting clustering task, we do preprocess for these real-world datasets. We replace missing values with the mean value of all valid values of same dimension. In   Aggregation Breast_wpbc addition, "min-max normalization" [33] for each feature is performed to eliminate the problem of magnitude difference. The "min-max normalization" can be denoted as Eq. (16).
where x j represents one feature. The clustering results are shown in Tables 3, 4, 5 and 6 in detail. For each clustering algorithm, we carefully adjust the parameters to get the best performance. Results in Tables 3, 4, 5 and 6 illustrate the proposed KNN-ADPC obtains good results compared with the other clustering algorithm under various evaluation indexes. In addition, the numbers in bold in the table represent the best-performing results on the dataset. Specifically, KNN-ADPC is superior to these classic clustering algorithms like DBSCAN and K-means++. It also surpasses original DPC and KNN-DPC on multiple artificial datasets. Besides, on the two real-world datasets, KNN-ADPC still gets the best performance under most metrics. Moreover, there is no need for human involvement in KNN-ADPC which can overcome the problem of misclassification caused by insufficient human experience. In conclusion, KNN-ADPC clustering algorithm can achieve satisfactory performance on both artificial and real-world datasets.
The optimal parameters of each algorithm on datasets are shown in Table 7. For DBSCAN, the maximum radius and the minimum points MinPts need to be decided. For K-means++, the number of clusters is necessary. For DPC and DPC-KNN, we need to input the cutoff distance d c and manually select cluster centers. And we can choose d c to make sure the number of neighbors of each point is within a certain percentage range of the total number of points. In this paper, we determine the cutoff distance d c by giving parameter pc . Besides, DPC-KNN requires the number of nearest neighbors K . For the proposed KNN-ADPC, the number of nearest neighbors K is the only needed parameter.

Discussion
Clustering results of each clustering algorithm on these artificial datasets are visualized in Figs. 4, 5, 6, 7, 8, 9 and 10. As shown in Fig. 4, dataset Flame has two clusters that are very close to each other. In addition, the edge of one cluster is close to the other cluster which is prone to causing misclassification. From the result, we can see that the density-based clustering algorithm DBSCAN is powerless for the reason that the distance between two clusters is too close. The K-means++ is hard to deal with the edge of the curved cluster well while these DPC based clustering algorithms get correct results.
As shown in Fig. 5, the two clusters in dataset Jain embedded in each other, which leads to misclassification in DBSCAN, K-means++, and original DPC. Only DPC-KNN and the proposed KNN-ADPC can properly deal with this embedded dataset. DPC-KNN and KNN-ADPC is able to assign points correctly because i is computed with its K nearest neighbors.
As shown in Fig. 6, there are seven clusters and a few connections between partial clusters in the dataset Aggregation. DPC based clustering algorithm can outperform both DBSCAN and K-means++, while the latter are unsuitable for dealing with this type of clustering task.
As shown in Fig. 7, three curved clusters exist in the dataset Spiral. Apart from the partition-based clustering algorithm K-means++, which is unable to deal with the non-spherical clustering task, all other algorithms can obtain correct results. Fig. 6 The performance of each algorithm on Aggregation As shown in Fig. 8, dataset R15 contains fifteen clusters. The misclassification occurs only in DBSCAN because there are some discrete points between clusters that are very close. Other clustering algorithms can assign points correctly.
For the discrete clusters in Fig. 9, all algorithms perform well except K-means++. And as shown in Fig. 10, dataset D9 has four clusters which contain spherical and curved clusters. In addition, there are many discrete points between clusters which easily lead to misclassification. Due to lack of merging strategy, DPC-KNN can not properly deal with the cluster with large curvature. Only the proposed KNN-ADPC can complete this kind of clustering task accurately.

Conclusion
In this paper, we propose a novel density peaks clustering algorithm based on KNN-ADPC. The K nearest neighbors are adopted to calculate the cutoff distance d c which can solve the problem of misclassification caused by giving parameter d c unreasonably, In addition, The calculation of the distance i also takes how discrete its K nearest neighbors are into consideration. The adaptive merging strategy allows us to automatically merge some oversegmentation clusters. The proposed KNN-ADPC is free of human involvement which can enhance the computing efficiency of the clustering algorithm to a large extent. At last, the outstanding performance of KNN-ADPC is demonstrated in the experiment results compared with other clustering algorithms. On artificial datasets, the proposed KNN-ADPC can achieve the best performance with almost all 100% accuracy and other evaluation metrics like ARI, AMI, and FMI. And for the higher-dimensional and more complex real-world dataset, KNN-ADPC can still automatically complete clustering tasks and obtain excellent performance.
However, when implementing KNN-ADPC, it is still inevitable to decide the parameter K with expertise. In future work, efforts will be devoted to determining the parameter K automatically. Fig. 9 The performance of each algorithm on D4