A minimum spanning tree based partitioning and merging technique for clustering heterogeneous data sets

Abstract

Clustering being an unsupervised learning technique, has been used extensively for knowledge discovery due to its less dependency on domain knowledge. Many clustering techniques were proposed in the literature to recognize the cluster of different characteristics. Most of them become inadequate either due to their dependency on user-defined parameters or when they are applied on multi-scale datasets. Hybrid clustering techniques have been proposed to take the advantage of both Partitional and Hierarchical techniques by first partitioning the dataset into several dense sub-clusters and merging them into actual clusters. However, the universality of the partition and merging criteria are not sufficient to capture many characteristics of the clusters. Minimum spanning tree (MST) has been used extensively for clustering because it preserves the intrinsic nature of the dataset even after the sparsification of the graph. In this paper, we propose a parameter-free, minimum spanning tree based efficient hybrid clustering algorithm to cluster the multi-scale datasets. In the first phase, we construct a MST of the dataset to capture the neighborhood information of data points and employ box-plot, an outlier detection technique on MST edge weights for effectively selecting the inconsistent edges to partition the data points into several dense sub-clusters. In the second phase, we propose a novel merging criterion to find the genuine clusters by iteratively merging only the pairs of adjacent sub-clusters. The merging technique involves both dis-connectivity and intra-similarity using the topology of two adjacent pairs which helps to identify the arbitrary shape and varying density clusters. Experiment results on various synthetic and real world datasets demonstrate the superior performance of the proposed technique over other popular clustering algorithms.

Introduction

Clustering is an unsupervised learning technique that is used extensively to discover the underlying patterns of data objects. Several clustering techniques were proposed to effectively recognize the clusters of different characteristics(Hartigan and Wong 1979; Jain and Dubes 1988; Ester et al. 1996; Koga et al. 2007; Schlitter et al. 2014; Limwattanapibool and Arch-int 2017; Kavitha and Tamilarasan 2019; Otoo et al. 2001). The popular approaches like partitional, hierarchical and density based methods have been used for clustering, however, they lack in ability to identify intrinsic clusters due to heterogeneous nature of datasets in terms of arbitrary shapes, different sizes, and varying densities (Jothi et al. 2016b; Zhong et al. 2011; Karypis et al. 1999; Jothi et al. 2016a; Du et al. 2019; Chung and Dai 2014). To overcome these issues, many graph based clustering techniques have been developed (Jothi et al. 2016a; Zahn 1971). In graph based clustering methods, a weighted undirected graph G = (V,E) is constructed where the data objects are the vertices V and the edge set E corresponds to similarity (dissimilarity) between data objects using any graph modeling techniques. Popular amongst them are minimum spanning tree (MST) based clustering techniques which show impressive performance in detecting clusters with irregular boundaries (Grygorash et al. 2006; Jothi et al. 2016a). However, it is often a challenging task to find the inconsistent edges in the MST because removing k − 1 longest edges may not necessarily maintain the structural similarity of clusters.

To reduce these limitations, various hybrid clustering methods were introduced that combine the advantages of the traditional clustering techniques (Zhong et al. 2011; Karypis et al. 1999; Cheng et al. 2016b; Lin and Chen 2005; Kumar and Reddy 2016). Hybrid clustering techniques analyse the dataset in three stages: in the first stage, the dataset is modeled as a graph, in the second stage, the graph is divided into large number of small sub-clusters with a suitable partitioning criteria and in the third stage, sub-clusters are merged into actual clusters based on both inter-connectivity amongst the clusters and the similarity of data points within the clusters. Several hybrid clustering techniques have been proposed to effectively detect the clusters with diverse shapes, sizes, and varying densities in multi-scale dataset (Karypis et al. 1999; Zhong et al. 2011; Mishra and Mohanty 2019; Li et al. 2018; Mishra and Mohanty 2020). Hierarchical Clustering Algorithm Using Dynamic Model (CHAMELEON) (Karypis et al. 1999) improves the existing hierarchical methods by considering both the inter-and intra-similarity in the merging process. Minimum Spanning Tree based Split and Merge Method (SAM) (Zhong et al. 2011) identifies the expected clusters using MST neighborhood graph. A fast and parameter free hybrid clustering technique using MST is proposed in (Mishra and Mohanty 2019) to detect the clusters using local nearest neighbors. It partitions the dataset using dispersion level of points and merges the sub-clusters into expected clusters on the basis of cohesion and intra-similarity.

In summary, the traditional clustering techniques fail to identify the cluster of different characteristics because they either depend on the user specified parameters or suitability of similarity measures, or fail to capture the intrinsic structure of clusters. Though, the hybrid clustering techniques are designed to identify the clusters of different characteristics, but their dependency on user defined parameters for constructing neighborhood graph is a challenge when the dataset contains the clusters of complex structure with varying density. Universality of both partitioning and merging technique in hybrid clustering algorithms is another challenging task for multi-scale datasets.

In this paper, we propose a non-parametric partitioning and merging based clustering technique using MST to effectively identify the clusters of diverse shapes, sizes, and varying densities. We employ MST to construct the similarity graph of the dataset. Proposed technique uses the box-plot method to identify the inconsistent edges to partition the dataset into number of sub-clusters (Cheng et al. 2016a, b) without any user-defined parameters. We propose a merge index to determine the neighboring relationship of partitions where both dis-connectivity and intra-similarity of sub-clusters are considered. Dis-connectivity computes the proximity of data points using average weight of inter connected MST edges between a pair of adjacent sub-clusters (sub-clusters connected by MST edges). Intra-similarity considers the closeness of data points within a sub-cluster as the average of MST edge weights within that sub-cluster.

We show that the proposed algorithm produces the proper partitions of datasets without any user specified parameters. The computational complexity of our algorithm is quadratic on size of the dataset. Experimental results illustrate the effectiveness of our proposed technique in identifying the clusters of different shapes, sizes, and varying densities as compared to other competing algorithms.

The paper is organized as follows: Section 2 describes the Related work. Section 3 discusses the proposed hybrid clustering algorithm. Section 4 shows the experimental analysis of our work. In Section 5, we conclude our work with some future research directions.

Related work

Most of the popular clustering methods can be broadly categorized into partitional, hierarchical, density and graph based clustering (Hartigan and Wong 1979; Karypis et al. 1999; Cheng et al. 2016b; Limwattanapibool and Arch-int 2017; Ester et al. 1996; Grygorash et al. 2006; Zhong et al. 2011; Wang et al. 2013; Jothi et al. 2016a). Partitional clustering methods decompose N data points into k-groups by minimizing an objective function, e.g., square error function (Hartigan and Wong 1979; Zhong et al. 2011; Tong et al. 2019). They are popular due to their linear time complexity on N and k and their ability to identify the spherical shape clusters. However, they fail to detect convex shape and irregular sized clusters and furthermore number of clusters k is required a priori (Karypis et al. 1999; Cheng et al. 2016b; Limwattanapibool and Arch-int 2017). Hierarchical clustering generates a nested sequence of clusters in the form of a binary tree called dendrogram where the root node consist of all the points and leaf nodes are singleton clusters (Karypis et al. 1999; Koga et al. 2007; Hu and he Pan 2015; Kavitha and Tamilarasan 2019; Jiau et al. 2006). The computational complexity of the most of the hierarchical algorithm is \(O(N^{2}\lg N)\). These approaches do not use the information about the nature of individual clusters while merging and they are sensitive to the linkage criteria (Karypis et al. 1999).

Density based clustering techniques identify different regions as clusters based upon the density of the points (Kriegel et al. 2011). Spatial clustering of application with noise (DBSCAN) (Ester et al. 1996) identifies clusters based on the density distribution and points that lie isolated in low density region are called outliers. The computational complexity of DBSCAN is O(N2). The cluster quality depends on two domain specific parameters such as minimum number of nearest neighbors (MinPts) and distance threshold (Eps) and it is difficult to choose the appropriate value on varying density datasets.

To reduce these limitations, several graph based clustering techniques were introduced (Jothi et al. 2016a; Zahn 1971) where first, the similarity graph is constructed from the dataset and then, that is partitioned into components called clusters based on the topological properties of the graph. MST based neighborhood graphs have been employed in recent years since MST of a dataset captures the natural neighborhood information of data points (Zhong et al. 2011; Jothi et al. 2016a; 2018; Li et al. 2019; Chen 2013). One round of MST on complete similarity graph may miss out some of the intrinsic neighborhood information. Therefore, union of multiple rounds (say k rounds) of edge-disjoint MSTs (called k-MST) have been used to model a neighborhood graphs since it can give more insights on how well the points are connected with their neighbors (Zhong et al. 2011; Jothi et al. 2016a).

MST based clustering algorithms have shown impressive performance in detecting clusters with irregular boundaries (Grygorash et al. 2006). Removing an inconsistent edge splits the graph into two sub-graphs (clusters). Finding an inconsistent edges in a dataset containing chain structure, varying densities or local outliers is a challenging problem (Wang et al. 2013). Most of the existing MST based clustering algorithms construct MST on a complete graph which takes O(N2) time (Wang et al. 2013).

CHAMELEON (Karypis et al. 1999) is a hybrid clustering algorithm which first constructs k-nearest neighborhood graph and then partitions it into sub-clusters using graph cut method. Interconnected edges and edge connectivity within the sub-clusters are used for merging the sub-clusters into actual clusters in O(N2) time (Karypis et al. 1999). Though, it effectively classifies the clusters having diverse shapes, sizes, and varying densities in two-dimensional space, however, it is dependant on the user defined parameter (k) for constructing k-nearest neighbor graph (Zhong et al. 2011; Cheng et al. 2016b).

SAM (Zhong et al. 2011) constructs the similarity graph using 3-rounds of MST called 3-MST graph and partitions it into \(k^{\prime }=\sqrt N\) sub-clusters (each sub-cluster can be a disjoint forest) using k-means algorithm where the highest degree nodes are selected as the initial prototypes. Finally, the partitioned clusters are merged into expected clusters using inter-and intra-similarity measure. The time complexity of SAM algorithm is O(N2) (Zhong et al. 2011). Though SAM identifying multi-scale clusters, however, it still requires a user defined parameter, k, for constructing the neighborhood graph and the universality definition of merging step is insufficient for diverse shape clusters as well (Cheng et al. 2016b). SAM proposes inter-connectivity and intra-similarity for selecting the best pair of sub-clusters to be merged. Inter-connectivity is a composite of the connectivity and connection span. Both these factors assign maximum merging probability to a pair with large connection span. For computing, the intra-similarity, each sub-cluster is divided into two equal sized components having minimum inter-connected edge weight between them. Fig. 1 shows a sample dataset in which the merging criteria is not effective. The dataset has four sub-clusters, {C1, C2, C3, C4}, where. C4 is a well separated cluster. The intra-similarity of (C1, C2) is low as compared to the pair (C2, C3). The inter-connectivity and connection span of (C1, C2) is same and greater than the pair (C2, C3) respectively. Hence, the merging approach will falsely select (C1, C2) for merging, whereas the pair (C2, C3) is a better choice. This shows that the merging approach may not work well for diverse shape clusters.

Fig. 1
figure1

Dataset on which merging approach of SAM fails. a An example dataset with four sub-clusters C = {C1, C2, C3, C4}, b 3-MST neighborhood graph on this dataset. The inter connectivity edges between pair (C1, C2) and (C2, C3) are same. But the common border of pair (C1, C2) is higher than the pair (C2, C3). C4 is well separated cluster, c Sub-cluster obtained after applying the merging method of SAM.

Proposed algorithm

In this section, we describe the details of the proposed non-parametric partitioning and merging hybrid clustering algorithm based on MST neighborhood graph. In hybrid clustering algorithm, first the dataset is partitioned into sub-clusters and then the sub-clusters are merged into actual clusters.

Partitioning of MST

We employ one round of MST on the complete similarity graph to construct the local neighborhood of data points. A sample dataset and its MST are shown in Fig. 2a and b respectively. After MST has been constructed, in the next step, proposed algorithm partitions the MST into number of dense sub-clusters (connected components). MST of the dataset represents the similarity of data points with their nearest neighbors and removing the longest edges (inconsistent edges) from the MST results into disjoint sub-graphs such that each sub-graph represents a sub-cluster. After removing the inconsistent edges, the MST is split into several connected components in which the data points within a connected component are more similar to each other. Several techniques have been proposed to find the inconsistent edges (Grygorash et al. 2006). Removing the exact (k − 1) longest edges in the MST may not always give the proper clusters as shown in Fig. 2c. Box plot method has been used effectively for visualizing the distribution of sample data and highlighting the potential outliers and detecting anomalous observations from data (Walker and Chakraborti 2013; Wickham and Stryjewski 2011). It displays the distribution of groups of numerical dataset based on the quartile: minimum (Q0), first quartile (Q1), median (Q2), third quartile (Q3) and maximum (Q4) as shown in Fig 3. It divides the dataset into different quartiles that demonstrates the degree of dispersion (spread) and skewness present in the data. Statistically, it is assumed that points within a cluster lie around some central value, e.g., cluster center. Inter quartile range (IQR) computes the width of the box which tells how spread out the middle values are. It can also be used to find out the values which are too far from the central values, called as outliers (inconsistent edges). It is also used in clustering techniques (Cheng et al. 2016b). The main advantage of this method is that it identifies the inconsistent data using statistical distribution without any user defined parameter. Therefore, our proposed algorithm selects the inconsistent edges using the box-plot method to partition the dataset into a large number of dense sub-clusters.

Fig. 2
figure2

The quality of partitions using box-plot method. a An example dataset (data points= 300) with three clusters, b Minimum spanning tree of the dataset, c Partition obtained after applying SMST, d 3-MST neighborhood graph, e Partition of 3-MST neighborhood graph obtained after applying SAM using k-mean as implemented in SAM, f Partition of MST using box-plot method.

Fig. 3
figure3

Illustration of box-plot with outliers edge (inconsistent edges).

The inconsistent edges of the MST neighborhood graph is defined as follows:

Definition 1

An edge is called an inconsistent edge in the MST if its edge weight is greater than \(Q3+ 1.5\cdot IQR \cdot (\frac {Q3-Q2}{Q2-Q1})\) (Walker and Chakraborti 2013).

The inconsistent edges are the edges which are far away from the central value of the edge set. IQR measures the skewness of the central value. The term (Q2 − Q1) and (Q3 − Q2) measure the left and right skewness respectively (Walker and Chakraborti 2013). Thus, the above definition includes the ratio ((Q3 − Q2)/(Q2 − Q1)) to adjust the upper and lower quartile to account for underlying skewness in the dataset. The number of sub-clusters is set as \(\sqrt {N}\) (Bezdek and Pal 1998). Box-plot method is applied over edge weights of the MST to find out the inconsistent edge which splits the dataset into two sub-clusters. Then it is applied recursively over the sub-clusters till the number of sub-clusters is equal to \(\sqrt {N}\). Each partition contains the most similar data points which belong to the same branch of the MST. Therefore, there is no requirement of any overhead to adjust the partitions as shown in Fig. 2f.

Merging the partitions

We apply merging procedure to merge \(k^{\prime }~(=\sqrt {N})\) sub-clusters into the actual clusters. It is challenging to decide which pair of sub-clusters should be merged. Since ,points in any branch of MST are more similar to each other, so we consider merging only such partitions which are connected with at least one MST edge between them. We call such a pair of sub-clusters an adjacent pair.

One partition may contain more than one adjacent partitions, which may belong to different clusters. We define both dis-connectivity and Intra-similarity index to rank the adjacent pairs, which are defined below:

Dis-connectivity index

Definition 2

(Dis-connectivity: DIS) The dis-connectivity of an adjacent pair (Ci, Cj) is defined as a penalty for merging the two sub-clusters. Let us assume that data points piCi and pjCj are adjacent MST points and the weight between them is wij. The dis-connectivity index between (Ci, Cj) is defined as:

$$ DIS(C_{i},C_{j})= \sum\limits_{p_{i}\in C_{i},~p_{j}\in C_{j},~(p_{i}, p_{j})\in MST}{1/ w_{ij}^{2} } $$
(1)

Dis-connectivity index reflects that straddling edge of two close regions is inversely proportional to the sum of the square distance of the edge weights. If data points in the border region of two clusters are low compared to those in the close regions, those pairs have a small priority of merging. For example, in Fig. 4, for three sub-clusters C1, C2 and C3, it is clear that edges (a,b),(c,d) and (e,f) are the interconnected edges between (C1, C3) and (C1, C2) respectively. So, dis-connectivity value of adjacent pairs are \(DIS(C_{1},C_{3})= ({1}/{w^{2}_{ab}}+ {1}/{w^{2}_{cd}})\) and \(DIS(C_{1},C_{2})= 1/w_{ef}^{2}\).

Fig. 4
figure4

Illustration of inter connected edges of sub-clusters {C1, C2 and C3} and intra-similarity of sub-clusters.

Intra-similarity index

Usually, cut property of graph is used to measure the similarity of sub-clusters which may fail in the presence of outliers and noise (Zhong et al. 2011; Guha et al. 1998). We compute the similarity of sub-clusters by considering the average distance of MST edges in the sub-cluster. To compute the intra-similarity of an adjacent pair, techniques inspired by (Karypis et al. 1999; Das and Sil 2007; Mishra and Mohanty 2019) are employed.

Definition 3

(Similarity:) Let DC be the average of MST edge weights in a sub-cluster C and S(C) be the similarity of cluster C, defined as:

$$ \begin{array}{@{}rcl@{}} S(C) &=& {1}/{\sqrt{2\pi}}\cdot(\exp{(-{D_{C}^{2}} /{2} })) \\ D_{C} &=&\frac{1}{|C|}\sum\limits_{e_{i}\in C}^{}w_{i} \end{array} $$
(2)

For example, in Fig. 4, let the set of internal edges of sub-cluster C2 be {e1, e2....e18}, then \(D_{C_{2}}=\{w_{1}+w_{2}+\cdots +w_{18}\}/{18}\). Hence, similarity index of the sub-cluster (C2) is \(S(C_{2})= {1}/{\sqrt {2\pi }}\cdot \exp {-({D_{C_{2}}^{2}}/2)}\). The similarity value of each sub-cluster is computed from its inter edges between the data points. The average weight of theses edges describe the closeness of data points within a sub-cluster.

Definition 4

(Intra-Similarity) The intra-similarity of an adjacent pair (Ci, Cj), denoted as IS(Ci, Cj), is defined as follows:

$$ \begin{array}{@{}rcl@{}} IS(C_{i},C_{j})&= & \frac{S(C_{i})\cdot S(C_{j})}{(|C_{i}|\cdot |C_{j}|)} \\ &= & \frac{1}{(|C_{i}|\cdot |C_{j}|)}\cdot \left( \frac{1}{{2\pi}}\cdot\exp{-(D_{C_{i}}^{2}+D_{C_{j}}^{2})/2}\right) \end{array} $$
(3)

Intra-similarity between a pair (Ci, Cj) becomes high when the average of MST edge weights in the cluster is less, i.e., points within a sub-cluster are much closer to each other. Since, we also consider the size of the sub-clusters in computing the intra-similarity, so the smaller sub-clusters have more chance to be merged.

Definition 5

(Merge Index (MI)) Let (Ci, Cj) is a pair of adjacent sub-cluster. The merge index to select the best pair is defined as:

$$ MI(C_{i},C_{j})=DIS(C_{i},C_{j})\cdot IS(C_{i},C_{j}) $$
(4)

The adjacent pair with highest merge index value is merged first. After merging a pair, corresponding MI value of the merged cluster and its adjacent pairs are recomputed. This step is continued till either the desired number of clusters are found or there is any significant change in the cluster quality index. By focusing on the improved dis-connectivity and intra-similarity between clusters, proposed method can overcome the limitations of existing algorithms. The proposed merging approach merges the nearest similar dense sub-clusters into a cluster. For instance, in the example shown in Fig. 1, proposed algorithm correctly prefers to merge the pair (C2, C3) over the pair (C1, C2). This is because the similarity of data points within the sub-clusters (C2) and (C3) are same. So, the intra-similarity of (C2, C3) is higher than (C1, C2). The proposed technique is described in Algorithm 1.

Computational complexity

The total computational complexity of the proposed method depends on the amount of time required to construct the MST neighborhood graph and to perform the partitioning and merging of the sub-clusters. Since MST is constructed from the complete graph, so the complexity is O(N2). The dataset is divided recursively into \(\sqrt {N}\) sub-clusters in \(\lg N\) iterations and each iteration takes O(N) time to identify the inconsistent edges and to find the connected components. So the complexity for partitioning the MST is \(O(N \lg N)\). Finally, the time required to merge the \(\sqrt {N}\) sub-clusters depends on the time required to calculate the merge index. Since the dis-connectivity and intra-similarity for a pair of sub-clusters are computed using the MST edges, its complexity is proportional to the number of points in each sub-clusters. In each merge, O(N) time is required to compute the merge index value. Next, in merging step, the worst case complexity occurs when the merging algorithm repeatedly chooses the same sub-cluster and merge it with another. This step iterates for \(\sqrt {N}-k\) times to compute the k number of expected clusters. Therefore, time require to merge the \(\sqrt {N}\) into k clusters takes \(O(N*(\sqrt {N}-k))\) time which is O(N3/2) time.

figurea

The overall complexity of the proposed clustering algorithm is O(N2) which is dominated by the graph construction step. If we exclude the graph construction step then our algorithm takes O(N3/2) time.

Experimental results and analysis

The performance of proposed algorithm is compared on synthetic as well as real datasets with traditional algorithms such as k-means (Hartigan and Wong 1979), hierarchical(Jain and Dubes 1988), and DBSCAN (Ester et al. 1996) and with a hybrid clustering algorithm, SAM (Zhong et al. 2011).

Datasets

The performance of the proposed technique is evaluated on two dimensional synthetic datasets each describing a different kind of clustering problem. The selected datasets contain clusters that differ in terms of densities, sizes, and shapes. The detailed description about theses datasets is given in Table 1. The first four datasets are taken from (Zhong et al. 2011; Cheng et al. 2016b; Pasi et al. 2015; Blake and Merz 1998) and the next two (DS8 and DS9) are taken from (Karypis et al. 1999; Hyde et al. 2015).

Table 1 Synthetic datasets description: number of data points (N), dimension (d), number of clusters (k).

Results on synthetic datasets

Spiral: This dataset consists of three spiral clusters. Fig. 5 illustrates the results of different clustering techniques. Proposed algorithm, DBSCAN, and SAM recognize the expected clusters where as k-means and average linkage fail to identify them. Since, k-means supports spherical clusters, it fails on spiral. Average linkage produces the improper clusters because it computes the similarity as the average distance of points of pair of clusters. Although, DBSCAN produces the proper clusters, but it is challenging to set the appropriate parameters to obtain the expected result.

Fig. 5
figure5

Illustration of clustering results on spiral dataset with k = 3.

Aggregation: The dataset consists of seven varied shape clusters in which some are aggregate to each other. Clustering results are shown in Fig. 6. All algorithms except k-means produce the expected partitions.

Fig. 6
figure6

Illustration of clustering results on aggregation dataset with k = 7.

Varying density: This dataset consists of three clusters of different densities. The clustering results are depicted in Fig. 7. The proposed algorithm and SAM detect the actual clusters where as k-means, average linkage and DBSCAN fail to detect the expected partitions. Since, the dataset consists of clusters of varying densities, therefore it is hard to fix the appropriate values of parameters for the DBSCAN.

Fig. 7
figure7

Illustration of clustering results on varying density dataset with k = 3.

Neck type: This dataset consist of two clusters close to each other like a neck. Fig. 8 illustrates the results of different clustering methods. Proposed algorithm as well as k-means detect the clusters properly as it contains spherical shape clusters whereas remaining methods fail to detect the actual clusters.

Fig. 8
figure8

Illustration of clustering results on neck type dataset with k = 2.

R15: This dataset contains fifteen Gaussian distributed clusters that are arranged in two concentric circles. The clustering result is depicted in Fig. 9. The proposed algorithm and SAM recognize the expected clusters and k-means, average linkage and DBSCAN fail to discover the proper clusters.

Fig. 9
figure9

Illustration of clustering results on R15 dataset with k = 15.

D31:This dataset contains thirty one Gaussian distributed clusters. The results are shown in Fig. 10. The proposed algorithm identifies the actual clusters. SAM, k-means, average linkage and DBSCAN fail to recognize the proper clusters.

Fig. 10
figure10

Illustration of clustering results on D31 dataset with k = 31.

DS8: The dataset DS8 consists of eight clusters of diverse shapes, sizes, densities and it also consist of random noises. The important characteristic of this data is that the clusters are very close to each other with different densities. The results of different clustering methods are depicted in Fig. 11. Only propose method achieves the expected partitions whereas remaining competing clustering techniques fail to identify the actual clusters.

Fig. 11
figure11

Comparison of clustering algorithm on DS8 dataset with k = 8.

DS9: The results of different clustering techniques are shown in Fig. 12. Proposed method detects the expected clusters, whereas k-means, hierarchical, DBSCAN, and SAM fail to identify the proper partitions. SAM fails to recognize the expected clusters because the points are very close to each other and consist of vertical streaks of varying densities which increase the connection span of a pair of clusters. Therefore, SAM keeps such clusters together, producing improper partitions.

Fig. 12
figure12

Illustration of clustering results on DS9 dataset with k = 9.

DBSCAN finds the exact clusters with proper selection of parameters properly which takes several iterations. (Fig. 13) shows the clustering results found by DBSCAN for DS8 and DS9 for different values of the parameters. As recommended in (Ester et al. 1996), the Minpts is fixed to 4 and the value of Eps is varied in these experiments. In the Fig. 13)a, when Eps= 10, DBSCAN puts the nearest clusters into a single cluster because outliers connecting them are treated as elements of the cluster. Again when we decrease the value of Eps then lower density regions are split into a large number of smaller sub-clusters as shown in Fig. 13)b and c. However, by varying the value of the parameter Eps, DBSCAN correctly identifies the clusters of DS9. Fig. 13)(d-f) demonstrates the sensitivity of DBSCAN concerning the parameter Eps.

Fig. 13
figure13

DBSCAN on the DS8 and DS9 datasets with different value of Eps parameter.

Analyses of cluster quality

Cluster quality analyses are performed using two external quality measures; Rand Index (RI) and Adjusted Rand Index (ARI) (Wagner and Wagner 2007). Both of them consider the degree of agreement between a predicted class label and the ground truth labels (Wagner and Wagner 2007; Halkidi et al. 2001). Let U and V denote the actual and predicted clusters respectively. Let P11 be the number of pairs that are clustered in both U and V. Let P00 be the number of pairs that are in different clusters in both U and V. Let P10 be the number of pairs that are same in cluster U and different in clusters V. Let P01 be the number of pairs that are different in clusters U and same in cluster V. RI and ARI are defined in equation 5 and 6 respectively.

$$ RI=\frac{(P_{00}+P_{11})}{(P_{00}+P_{01}+P_{10}+P_{11})} $$
(5)

Rand index computes the percentage of points accurately recognized over total pairs (Rand 1971; Wagner and Wagner 2007). RI takes a value between 0 and 1, where 1 and 0 illustrates the appropriate and inappropriate clustering results respectively.

$$ ARI=\frac{2\cdot (P_{11}\cdot P_{00}-P_{10}\cdot P_{01})}{(P_{11}+P_{10})\cdot (P_{10}+P_{01})+(P_{11}+P_{01})\cdot(P_{01}P_{00})} $$
(6)

Corrected version of the Rand index (Rand 1971; Wagner and Wagner 2007) is called adjusted Rand index. ARI can yield negative values if the probability of agreement is less than the probability of disagreement (Wagner and Wagner 2007). ARI value of − 1 indicates the improper clustering and 1 illustrates proper clustering.

The RI and ARI values of the various clustering techniques are shown in Tables 2 and 3 respectively. The results demonstrate that the proposed method outperforms the popular clustering techniques on all datasets.

Table 2 Comparison of Rand Index obtained by various clustering algorithms on synthetic datasets.
Table 3 Comparison of Adjusted Rand Index obtained by various clustering algorithms on synthetic datasets.

Experiments on real datasets

The performance of the proposed algorithm is compared with popular existing approaches, such as k-means, Hierarchical (average linkage), DBSCAN, and SAM on eight real datasets.

Real datasets

The real-world datasets used in the experiments are all taken from the UCI Machine Learning Repository (Blake and Merz 1998). The detailed description about these datasets is given in Table 4.

Table 4 Real world datasets description: number of data points (N), dimension (d), number of clusters (k).

The performance of clustering results on real datasets

The performance of the eight real world datasets is evaluated using the two common external clustering validity indices: RI and ARI. The values of parameters (MinPts, Eps) of DBSCAN for glass, breast, breast tissue (breastT), lung cancer (lungC), heart, hepatitis, liver disorder (liverD), and dermatology datasets are set to (MinPts = 9, Eps = 2), (MinPts = 14, Eps = 2), (MinPts = 9, Eps = 0.4), (MinPts = 10, Eps = 1), (MinPts = 9, Eps = 2), (MinPts = 22, Eps = 2), (MinPts = 4, Eps = 2) and (MinPts = 22, Eps = 0.4) respectively through experiments. Table 5 shows the performance of our proposed algorithm with k-mean, hierarchical, DBSCAN and SAM based on RI value. The number highlighted in each row indicates that the corresponding algorithm has the best RI value. The clustering results of the proposed technique are better than the competing algorithms except heart and liverD datasets for which hierarchical and SAM perform better respectively.

Table 5 Comparison of Rand Index obtained by various clustering algorithms on real datasets.

Table 6 shows the performance of clustering algorithms based on ARI. The proposed clustering algorithm outperforms all the competing algorithms on all the datasets except heart dataset for which SAM has better performance. SAM fails on most of the real datasets because it seems the points in the real datasets are very close to each other and the dataset may have contained the clusters of diverse shapes and varying densities which increase the connection span.

Table 6 Comparison of Adjusted Rand Index obtained by various clustering algorithms on real datasets.

Speed analysis

The performance of the proposed algorithm is compared against the SAM in terms of execution time (in seconds). Both these algorithms run for 5 times and the mean value of all the run is calculated. Fig. 14 illustrates the run time of both the algorithms on various datasets of different sizes. SAM takes quite longer time as compared to our algorithm because it uses 3 rounds of MST during the graph construction phase and consumes more time in the partitioning phase.

Fig. 14
figure14

Comparison of time(in seconds) taken by proposed algorithm against SAM.

Scalability analysis

In this section, we perform a scalability analysis of the proposed method. We consider the Birch datasets taken from (Pasi et al. 2015) (D1 to D5) of size 2000 to 10000 with increment of 2000 in each run. Table 7 shows the cluster quality in terms of ARI and execution time. The proposed and SAM algorithms run for 5 times and the mean value of all the runs is calculated. The clustering results demonstrate that the proposed method maintains the cluster quality with increase in size of the dataset. Though SAM also maintains the quality of the clusters against the increase in the size of the dataset but it takes longer time than the proposed method because it applies 3 −rounds of MST to construct the neighborhood graph. The experimental analysis show that our technique is scalable with respect to cluster quality and execution time.

Table 7 Comparison of ARI and execution time (in seconds) taken by proposed algorithm against SAM.

Conclusion and future work

In this paper, a novel hybrid clustering algorithm is presented to detect the clusters of different shapes, sizes, and densities. Proposed algorithm is free from any user-defined parameters. An efficient outlier detection technique has been employed for selecting the inconsistent edges in the MST graph in the partitioning phase. In the merging approach, adjacent pairs of sub-clusters are identified based on the MST edges to reduce the unnecessary computational overhead. The concept of dis-connectivity and intra-similarity, introduced to determine the proximity of each sub-clusters based on the inter-connected edged between the sub-clusters and inter edges within the sub-clusters. The experiment results on synthetic and real dataset illustrated that the proposed algorithm shows better performance in terms of cluster quality, scalability and execution time. The O(N2) cost limits the application of the proposed algorithm on large datasets. One of the future work is to construct an approximate MST directly from the dataset without constructing the complete graph, so that the computational complexity of the algorithm becomes sub-quadratic. Another important direction of future work is to not only detect the global outliers but also clearly identify the local outliers in the partitioning phase of our algorithm to improve the accuracy.

References

  1. Bezdek, J.C., & Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3), 301–315.

    Article  Google Scholar 

  2. Blake, C., & Merz, C. (1998). Uci repository of machine learning databases [http://www.ics.uci.edu/mlearn/mlrepository.html], department of information and computer science, University of California, Irvine, CA, Vol. 55.

  3. Chen, X. (2013). Clustering based on a near neighbor graph and a grid cell graph. Journal of Intelligent Information Systems, 40(3), 529–554.

    Article  Google Scholar 

  4. Cheng, Q., Liu, Z., Huang, J., & Cheng, G. (2016a). Community detection in hypernetwork via density-ordered tree partition. Applied Mathematics and Computation, 276, 384–393.

    MathSciNet  Article  Google Scholar 

  5. Cheng, Q., Lu, X., Liu, Z., Huang, J., & Cheng, G. (2016b). Spatial clustering with density-ordered tree. Physica A:, Statistical Mechanics and its Applications, 460, 188–200.

    Article  Google Scholar 

  6. Chung, C.H., & Dai, B.R. (2014). A fragment-based iterative consensus clustering algorithm with a robust similarity. Knowledge and information systems, 41(3), 591–609.

    Article  Google Scholar 

  7. Das, A.K., & Sil, J. (2007). Cluster validation using splitting and merging technique, International conference on computational intelligence and multimedia applications (ICCIMA 2007), vol. 2, pp. 56–60. IEEE.

  8. Du, M., Ding, S., Xue, Y., & Shi, Z. (2019). A novel density peaks clustering with sensitivity of local density and density-adaptive metric. Knowledge and Information Systems, 59(2), 285–309.

    Article  Google Scholar 

  9. Ester, M., Kriegel, H.P., Sander, J., Xu, X., & et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, vol. 96, pp. 226–231.

  10. Grygorash, O., Zhou, Y., & Jorgensen, Z. (2006). Minimum spanning tree based clustering algorithms. In 18Th IEEE international conference on tools with artificial intelligence (ICTAI’06), pp. 73–81. IEEE.

  11. Guha, S., Rastogi, R., & Shim, K. (1998). Cure: an efficient clustering algorithm for large databases. ACM Sigmod Record, 27(2), 73–84.

    Article  Google Scholar 

  12. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of intelligent information systems, 17(2-3), 107–145.

    Article  Google Scholar 

  13. Hartigan, J.A., & Wong, M.A. (1979). Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.

    MATH  Google Scholar 

  14. Hu, W., & he Pan, Q. (2015). Data clustering and analyzing techniques using hierarchical clustering method. Multimedia Tools and Applications, 74(19), 8495–8504.

    Article  Google Scholar 

  15. Hyde, R., & et al. (2015). Lancaster university clustering datasets. http://www.lancaster.ac.uk/pg/hyder/Downloads/downloads.html.

  16. Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data, Prentice-Hall, Inc.

  17. Jiau, H.C., Su, Y.J., Lin, Y.M., & Tsai, S.R. (2006). Mpm: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data. Journal of Intelligent Information Systems, 26(2), 185–207.

    Article  Google Scholar 

  18. Jothi, R., Mohanty, S.K., & Ojha, A. (2016). Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph. Computers in biology and medicine, 71, 135–148.

    Article  Google Scholar 

  19. Jothi, R., Mohanty, S.K., & Ojha, A. (2016). On careful selection of initial centers for k-means algorithm. In Proceedings of 3rd International Conference on Advanced Computing, Networking and Informatics, pp. 435–445. Springer.

  20. Jothi, R., Mohanty, S.K., & Ojha, A. (2018). Fast approximate minimum spanning tree based clustering algorithm. Neurocomputing, 272, 542–557.

    Article  Google Scholar 

  21. Karypis, G., Han, E.H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8), 68–75.

    Article  Google Scholar 

  22. Kavitha, E., & Tamilarasan, R. (2019). Agglo-hi clustering algorithm for gene expression micro array data using proximity measures. Multimedia Tools and Applications, 79, 9003–9017.

    Article  Google Scholar 

  23. Koga, H., Ishibashi, T., & Watanabe, T. (2007). Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems, 12(1), 25–53.

    Article  Google Scholar 

  24. Kriegel, H.P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. Wiley Interdisciplinary Reviews:, Data Mining and Knowledge Discovery, 1 (3), 231–240.

    Google Scholar 

  25. Kumar, K.M., & Reddy, A.R.M. (2016). A fast dbscan clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognition, 58, 39–48.

    Article  Google Scholar 

  26. Li, J., Wang, X., & Wang, X. (2019). A scaled-mst-based clustering algorithm and application on image segmentation, Journal of Intelligent Information Systems, pp 1–25. https://doi.org/10.1007/s10844-019-00572-x.

  27. Li, X., Kao, B., Luo, S., & Ester, M. (2018). Rosc: Robust spectral clustering on multi-scale data. In Proceedings of the 2018 World Wide Web Conference, pp. 157–166.

  28. Limwattanapibool, O., & Arch-int, S. (2017). Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan). Expert Systems, 34(3), 12204.

    Article  Google Scholar 

  29. Lin, C.R., & Chen, M.S. (2005). Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging. IEEE Transactions on Knowledge and Data Engineering, 17(2), 145–159.

    Article  Google Scholar 

  30. Mishra, G., & Mohanty, S. (2020). Rdmn: a relative density measure based on mst neighborhood for clustering multi-scale datasets, IEEE Transactions on Knowledge and Data Engineering, pp 1–1, https://doi.org/10.1109/TKDE.2020.2982400.

  31. Mishra, G., & Mohanty, S.K. (2019). A fast hybrid clustering technique based on local nearest neighbor using minimum spanning tree. Expert Systems with Applications, 132, 28–43.

    Article  Google Scholar 

  32. Otoo, E.J., Shoshani, A., & Hwang, S.w. (2001). Clustering high dimensional massive scientific datasets. Journal of Intelligent Information Systems, 17(2-3), 147–168.

    Article  Google Scholar 

  33. Pasi, F., & et al. (2015). Clustering datasets. http://cs.uef.fi/sipu/datasets/.

  34. Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.

    Article  Google Scholar 

  35. Schlitter, N., Falkowski, T., & Lässig, J. (2014). Dengraph-ho: a density-based hierarchical graph clustering algorithm. Expert Systems, 31(5), 469–479.

    Article  Google Scholar 

  36. Tong, T., Zhu, X., & Du, T. (2019). Connected graph decomposition for spectral clustering. Multimedia Tools and Applications, 78(23), 33247–33259.

    Article  Google Scholar 

  37. Wagner, S., & Wagner, D. (2007). Comparing clusterings: an overview. Universität Karlsruhe: Fakultät für Informatik Karlsruhe.

    Google Scholar 

  38. Walker, M., & Chakraborti, S. (2013). An asymmetrically modified boxplot for exploratory data analysis. The University of Alabama: Department of Information Systems Statistics, and Management Science.

    Google Scholar 

  39. Wang, X., Wang, X.L., Chen, C., & Wilkes, D.M. (2013). Enhancing minimum spanning tree-based clustering by removing density-based outliers. Digital Signal Processing, 23(5), 1523–1538.

    MathSciNet  Article  Google Scholar 

  40. Wickham, H., & Stryjewski, L. (2011). 40 years of boxplots. Am Statistician.

  41. Zahn, C.T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on computers, 100(1), 68–86.

    Article  Google Scholar 

  42. Zhong, C., Miao, D., & Fränti, P. (2011). Minimum spanning tree based split-and-merge: a hierarchical clustering method. Information Sciences, 181(16), 3397–3410.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gaurav Mishra.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mishra, G., Mohanty, S.K. A minimum spanning tree based partitioning and merging technique for clustering heterogeneous data sets. J Intell Inf Syst 55, 587–606 (2020). https://doi.org/10.1007/s10844-020-00602-z

Download citation

Keywords

  • Partitioning and merging approach
  • Minimum spanning tree based clustering
  • Box-plot method
  • Clustering multi-scale datasets