MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

Erdinç, Berfin; Kaya, Mahmut; Şenol, Ali

doi:10.1007/s00521-024-09443-1

MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

Original Article
Open access
Published: 27 February 2024

Volume 36, pages 7025–7042, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

Download PDF

1034 Accesses
1 Altmetric
Explore all metrics

Abstract

Stream clustering has emerged as a vital area for processing streaming data in real-time, facilitating the extraction of meaningful information. While efficient approaches for defining and updating clusters based on similarity criteria have been proposed, outliers and noisy data within stream clustering areas pose a significant threat to the overall performance of clustering algorithms. Moreover, the limitation of existing methods in generating non-spherical clusters underscores the need for improved clustering quality. As a new methodology, we propose a new stream clustering approach, MCMSTStream, to overcome the abovementioned challenges. The algorithm applies MST to micro-clusters defined by using the KD-Tree data structure to define macro-clusters. MCMSTStream is robust against outliers and noisy data and has the ability to define clusters with arbitrary shapes. Furthermore, the proposed algorithm exhibits notable speed and can handling high-dimensional data. ARI and Purity indices are used to prove the clustering success of the MCMSTStream. The evaluation results reveal the superior performance of MCMSTStream compared to state-of-the-art stream clustering algorithms such as DenStream, DBSTREAM, and KD-AR Stream. The proposed method obtained a Purity value of 0.9780 and an ARI value of 0.7509, the highest scores for the KDD dataset. In the other 11 datasets, it obtained much higher results than its competitors. As a result, the proposed method is an effective stream clustering algorithm on datasets with outliers, high-dimensional, and arbitrary-shaped clusters. In addition, its runtime performance is also quite reasonable.

Adaptive Multiple-Resolution Stream Clustering

A Different Approach for Pruning Micro-clusters in Data Stream Clustering

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As digital data increases rapidly, obtaining meaningful patterns and information from this data have become an important research topic [1]. When trying to extract meaningful patterns from an ever-increasing amount of data, clustering algorithms offer valuable information [2]. In this sense, streaming data mining has become one of today's most popular fields in rapidly processing stream data and obtaining meaningful information [3, 4]. In traditional data mining, the focus is on mining from static data. However, the emergence of data streams with technological developments has led to significant changes in data storage and processing methods. For this reason, the need to analyze real-time data and present it to the user instantly has emerged [5]. Today, stream clustering approaches are used to meet this need. Stream clustering is an efficient clustering approach that can cluster streaming data fast according to similarity criteria and update the clusters based on the characteristics of the data. Applications of these approaches include clickstream analysis, intrusion detection systems, social media, financial applications, scientific research, health research, mobile applications, the Internet of Things (IoT), and sensor networks [6].

There are five stream clustering approaches: density, hierarchical, model, partitioning, and grid-based methods [7,8,9]. In density-based methods, clusters are extended to areas where data are dense. The structure of the clusters expresses the density of data. These clustering approaches can detect arbitrary-shaped clusters and outliers. DenStream [10], D-Stream [11], DBSTREAM [12], StreamSW [13], and KD-AR Stream [14] are some examples of density-based clustering approaches. In hierarchical methods, clusters are generated by combining data in a hierarchical structure according to the distances among them. Two types of approaches, agglomerative and divisive, are used to address such clustering problems [8]. BIRCH [15], ClusTree [16], and ODAC [17] are some examples of hierarchical stream clustering algorithms.

In partitioning-based stream clustering approaches, the dataset is partitioned according to the centers selected by various techniques [18]. The goal is to optimize a target parameter such as variance. StreamKM++ [19], CluStream [20], SWClustering [21], and HPStream [22] are examples of partitioning-based stream clustering algorithms. On the other hand, grid-based methods divide the data space into equally sized grids and aim to form clusters according to the number of data in these grids. These grids divide the data stream into regular cells and clusters the data in each cell by analyzing them with statistical or computational methods. DD-Stream [23] is one of these methods. As the last method, model-based approaches assume that the dataset fits a mathematical model. EM (Expectation Maximization) [24] is an example of these methods that effectively handles noisy data and outliers.

In the stream clustering area, defined cluster shapes, cluster generation processes, effective solutions against noisy data and outliers, the capability of processing high-dimensional data, and time complexity are critical in the clustering performance. In the literature, stream data are generally clustered in spherical [14, 16, 20, 25, 26] or arbitrary [10, 12, 13, 27, 28] shapes. Although spherical clustering approaches are successful, their performance is limited when the data distribution is arbitrary. Clustering approaches that partition the dataset into groups according to centroids tend to form spherical clusters. Since the data have a homogenous distribution in the real-world, the cluster distributions of the data may not be spherical. For this reason, the ability of spherical clustering approaches in such data may be limited. Therefore, defining arbitrary-shaped clusters in such data distributions can significantly improve clustering performance. In density-based approaches, it is possible to define arbitrary-shaped clusters based on density without requiring the number of clusters to be initialized. These approaches also successfully detect noise and outliers [8].

In stream clustering, there are fully online and online-offline approaches to cluster the datasets [7, 29]. In the fully online stream clustering approaches [14, 19, 26, 27], the clustering process is performed for each new arrival data, and the current clustering results are kept. The other approaches are two-phased stream clustering approaches that are online-offline phased approaches [12, 13, 16, 20, 28]. In the online phase, the new arrival data are evaluated in real-time, and the relevant summary statistics of the observations are captured. These summary statistics are micro-clusters. In the offline phase, these summaries create final clusters [30]. In such approaches, offline phase algorithms such as k-means [31], k-median [32], DBSCAN [33], and minimum spanning tree (MST) are used to define final clusters.

Two of the most crucial challenges in streaming data clustering is to be able to define arbitrary-shaped clusters and processing high-dimensional data. Şenol and Karacan [14] used the KD-Tree structure to process high-dimensional data. Similarly, Şenol et al. [34] stated that tree data structures can benefit stream clustering data. However, the major limitations of these methods are their weakness in detecting arbitrary-shaped clusters. The MCMSTClustering algorithm [35] is able to support high dimensionality using the KD-Tree structure and is also very successful in defining arbitrary-shaped clusters using the minimum spanning tree algorithm. The motivation of this paper is to adapt the MCMSTClustering algorithm to stream data clustering problems due to these superior capabilities. To summarize, the main contributions of the proposed algorithm to the literature are presented as follows:

Ability to define arbitrary-shaped clusters,
Robustness to outliers,
Capability of processing high-dimensional data,
High clustering quality in acceptable runtime.

The rest of the paper is organized as follows: Sect. 2 discusses the related work, while Sect. 3 provides the necessary information about the methods used in the study. In Sect. 4, the problem is defined, and the objective of the study is explained. Then, Sect. 5 describes the proposed algorithm in detail, while Sect. 6 presents the experimental studies. Section 7 shares the results obtained from the experimental study, while Sect. 8 discusses the obtained results. Finally, Sect. 9 concludes the paper and provides perspective about future works. For ease of reading, the descriptions of mathematical symbols and acronyms used in this paper are summarized in Tables 1 and 2.

Table 1 Mathematical symbol interpretation used in this paper

Full size table

Table 2 The acronyms used in this paper

Full size table

2 Related works

Density-based clustering involves grouping data objects distributed in a contiguous region of the data space with high object density [1, 36]. Conversely, these clusters are separated by contiguous regions characterized by low object density. Outliers and noisy data can dramatically reduce the clustering quality of the techniques in stream clustering. The fact that stream clustering methods are robust to these issues improves cluster performance. Furthermore, the ability to form non-spherical clusters allows algorithms to perform better. As the solution, density-based techniques are often highly effective in overcoming such issues.

DenStream [10] is a density-based stream clustering algorithm that can cluster data streams over fading windows and handle clusters with arbitrary shapes. DenStream introduces the kernel-micro-cluster concept to summarize clusters of arbitrary shape. On the other hand, possible kernel-micro-cluster and outlier micro-cluster structures are proposed to preserve and distinguish possible clusters and outliers. When DenStream receives a clustering request, it uses DBSCAN to achieve the final clustering results. DenStream can analyze high-dimensional data streams and find outliers in data streams. However, it has limited skills for detecting and notifying concept drift. It also fails to predict the number of clusters, which can sometimes be an issue.

Chen and Tu presented D-Stream [11], a density and grid-based clustering technique. D-Stream, like DenStream, has a two-phase clustering design. First, the online component puts each data point into grids. The offline component then computes grid density and groups grids depending on density. D-Stream may thus detect clusters of any shape. Furthermore, D-Stream can manage high-velocity, high-volume data streams, making it suited for real-time applications. D-Stream, on the other hand, has some limits. One of these limitations is that it may require assistance dealing with data streams with varying densities.

Hahsler and Bolaños [12] presented a density-based DBSTREAM approach to address the issue of ignoring data density between micro-clusters. This density-based approach uses data to directly estimate the density in the common region between micro-clusters. In the online phase, like DenStream, density estimates are generated for micro-clusters rather than the epsilon neighborhood around every single point, considerably decreasing processing costs. However, to achieve good results, the parameters of DBSTREAM must be fine-tuned.

CEDAS [35] provides a two-stage, fully online method to group evolving data streams into arbitrarily shaped clusters. Hyperspherical micro-clusters are created in the first stage, and in the second stage, they are combined into larger macro-clusters using a graph structure. The method is accurate, robust to noise, computationally, and memory efficient, and it can handle high-dimensional datasets. However, the outputs of the method are only cluster assignments. Moreover, the proposed method cannot discover the densities of regions in the data space [27].

StreamSW [13] is a density-based clustering approach for streaming data over a sliding window. It uses a two-phase online-offline clustering framework to maintain the synopsis of streaming data in p-micro-clusters, which are then reclustered using an enhanced DBSCAN algorithm in the offline component. StreamSW uses density-based micro-clustering and grid-based approaches to find high-quality arbitrary-shaped clusters with limited memory and execution time. The approach has shown promising results in real-world and synthetic datasets experiments. However, it is unsuitable for high-dimensional streaming data due to the high computation time and lower performance caused by the increasing number of grid cells with space size.

The MVStream clustering algorithm [26] integrates information from numerous insufficient views by employing summary statistics from previous multi-view data objects and a novel multi-view support vector domain description (MVSVDD) model and output support vectors (SVs). Due to the small amount of data objects occupied by SVs, the MVStream technique is efficient when computational resources are constrained. MVStream clustering outperformed seven current single-view data stream clustering methods and two multi-view clustering techniques built for static large-scale multi-view data in benchmark testing.

Şenol and Karacan [14] proposed the KD-AR Stream algorithm, which is suitable for the dynamic structure of streaming data. The proposed method is entirely online and uses the KD-Tree data structure for cluster formation. An adaptive radius is used to adjust the cluster size. Time-based summarization consisting of a time window and a sliding window is implemented to avoid performance loss. However, the proposed method suffers from performance loss on high-dimensional data. In addition, it cannot detect arbitrary clusters since it only aims to create spherical clusters.

Mousavi et al. [37] developed an online and offline density-based CVD-Stream (Clustering Varying Density Data Stream) approach. In the suggested method's online phase, merging and pruning procedures are applied. In the online phase, the proposed technique effectively removes outliers. A variable density clustering technique is used to construct the final clusters during the offline phase. Ahmed et al. [25] suggested a density and grid-based two-phased DGStream technique. The proposed method for constructing arbitrary-shaped clusters and detecting outliers is fast, accurate, efficient, and effective. The suggested technique is compared against DenStream, D-Stream, and ClusTree on synthetic and real-world data. When compared to existing approaches, the performance is improved. As we mentioned, many data stream clustering algorithms have been proposed. However, each one has various issues. Table 3 compares some of the stream clustering algorithms in the literature.

Table 3 Comparison of streaming data clustering algorithms according to their problem-solving style and capabilities

Full size table

3 Preliminaries

3.1 KD-tree data structure and range search

KD-Tree is a data structure with a runtime complexity of O(logn) and is capable of handling high-dimensional datasets. It is fast and allows range search to be performed on it, which makes it widely used. Range search is a method that detects data within a certain radius on the KD-Tree. It is an approach with runtime complexity $O(dn^{1 - \frac{1}{d}} + k)$. Examples of KD-Tree and area searches are shown in Fig. 1.

3.2 Minimum spanning tree

Minimum Spanning Tree is an approach that connects all nodes in a dataset via the shortest path, as illustrated in Fig. 2. It is a highly efficient method capable of handling directed or undirected graphs. The runtime complexity of this method is O(n²). But it can be reduced to O(nlogE) if the Priority Queue is applied. Prim's and Kruskal are two examples of these approaches.

4 Problem statement

Since streaming data are a type of data that streams in real time and has an evolutionary structure, there is a need for algorithms that can handle these types of datasets. However, as we explained in Sect. 2, there are many problems with the streaming data. One of the most critical problems is the detection of non-spherical clusters, some examples of which can be seen in Fig. 3. The majority of proposed works in this area assume that clusters are spherical. However, very few clusters are spherical in real life.

Another problem in this area is outliers. Due to the nature of streaming data, it is not easy to detect outliers. Because streaming data is evolutionary. Any identified data as an outlier may be the first example of a data group forming a cluster later. Another demand in this area is that the algorithms are expected to produce results quickly. Therefore, there is an ongoing demand for algorithms that can detect non-spherical clusters, have high clustering success, handle outliers, and produce results quickly. In this paper, we propose a new stream clustering algorithm inspired by the MCMSTClustering [35] algorithm, and the details are described in Sect. 5 to overcome these problems.

5 Applying minimum spanning tree to Kd-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

This section describes the proposed algorithm in detail.

5.1 Definitions

Euclidean distance is used for the distance calculation method in all sub-algorithms in the proposed algorithm. The following notions are used in the proposed algorithm:

Micro-cluster Micro-clusters are the essential parts of data in our algorithm consisting of chunks of data. These data groups are spherical and formed due to the Kd-Tree data structure and the range search process. In our algorithm, the radii of all micro-clusters are constant and equal to r. For a micro-cluster to be formed, at least N data must be grouped in a circle of r radius.

MST (Minimum spanning tree) MST is a candidate cluster combining micro-clusters—according to a distance threshold of 2r. If the distance from a micro-cluster to its nearest micro-cluster of the MST is 2r or less, it is included in the MST. MST starts with one of the micro-clusters not assigned to any cluster. At each step, the relevant micro-cluster is included in MST if the distance from the micro-cluster to the closest one of MST does not exceed the threshold value.

Macro-clusters Macro-clusters are the original clusters consisting of n_micro-number of micro-clusters. If any candidate MST has more than n_micro-number of micro-clusters, this MST is defined as a new macro-cluster.

In the proposed algorithm, some parameters need to be defined by the user. These parameters and their descriptions are as follows:

W (sliding window width) Since streaming data is very fast due to its nature and there is no possibility of accumulating and processing it, we take W of them as a summary at each step and process them. This method is called as sliding window in the literature.

r (radius) It is a parameter used to create micro-clusters. It expresses the radius of the area to be searched in KD-Tree while defining micro-clusters. It is also used to evaluate the distance between a candidate micro-cluster and the closest micro-cluster of the MST when deciding whether to include the candidate micro-cluster in the MST or not. The micro-cluster is included in the MST if the distance is 2r or less.

N It is the parameter used to create micro-clusters. If there is a minimum of N data in radius r and the center of these data is far away from existing micro-clusters enough, this data group is defined as a micro-cluster.

n_micro This parameter is used to identify macro-clusters. If the number of micro-clusters connected by MST is n_micro or more, this cluster of micro-clusters is defined as a macro-cluster.

5.2 The algorithm

In this paper, we propose a new algorithm for the clustering problem of streaming data to detect clusters with non-spherical shapes. Our algorithm is powered by cluster identification by applying MST to KD-Tree-based micro-clusters. As an example can be seen in Fig. 4, our algorithm first defines micro-clusters by using the KD-Tree data structure. Then, it applies MST to these micro-clusters to define the final clusters. Our proposed algorithm consists of 5 stages. These are:

Generating KD-Tree-based micro-clusters,
Creating macro-clusters by MST to micro-clusters,
Definition of micro-clusters as the result of new arrival data or deletion of micro-clusters because of the completion of the lifetime of data as a result of the number of data that has fallen below the threshold value N,
Assignment of new micro-clusters to macro-clusters or deletion of macro-clusters when the number of micro-clusters they have falls below the n_micro,
Updating the information of the whole system.

In light of this information, the proposed algorithm's basic steps are shown in Algorithm 1.

5.2.1 Defining micro-clusters

In the proposed algorithm, the reason to use micro-clusters is to detect non-spherical clusters and to improve the performance of the algorithm. To define micro-clusters, KD-Tree data structure and range search operation are used. For this purpose, the data that does not belong to any micro-cluster are first placed in a KD-Tree, and then a range search operation is performed on it. Namely, after the data is placed in the KD-Tree, a search is performed to see if there are at least N data within a radius r. If so, this data is grouped as a new micro-cluster. Hence, the pseudo-code of the DefineMC sub-algorithm used to define micro-clusters is provided in Algorithm 2.

5.2.2 Assigning a new arrival data to a micro-cluster

By the nature of stream data, old data are deleted, and new data arrives over time. Therefore, if a new arrival data is close to any micro-cluster enough, it is assigned to the relevant micro-clusters. For this purpose, if the distance between the relevant data and the center of the closest micro-cluster is r or less, the relevant data is assigned to that micro-cluster. The pseudo-code of the AddtoMC sub-algorithm used for this purpose is given in Algorithm 3.

5.2.3 Defining macro-clusters

Macro-clusters are defined over the existing micro-clusters. Here, MST is used to define macro-clusters. To define a macro-cluster, at least n_micro-number of micro-clusters must be merged by MST. While deciding if a micro-cluster should be included in the MST, its distance to the closest micro-cluster of the MST is evaluated. If the distance is less than or equal to 2r this micro-cluster is included in the MST as shared in Algorithm 5. To define macro-clusters Prim's algorithm is used. Prim's algorithm was chosen because it is more suitable for cluster definition since it has an incremental structure while connecting nodes. The pseudo-code of the DefineMacroC sub-algorithm used to define micro-clusters is given in Algorithm 4.

5.2.4 Assigning micro-clusters to macro-clusters

By its nature, the characteristics of the data change over time. This change means that both the defined micro-clusters and macro-clusters also change. By the time, defined micro-clusters might be deleted, or new ones might be defined. Therefore, assigning newly defined micro-clusters to the existing macro-clusters might also be necessary. Our algorithm fulfills these operations. The distance between the closest micro-cluster of the macro-cluster and the micro-cluster is evaluated to decide whether to assign the micro-cluster to the macro-cluster. If the distance is 2r or less, the micro-cluster is assigned to that macro-cluster. The pseudo-code of the AddMCtoMacroC sub-algorithm used to perform this process is presented in Algorithm 6.

5.2.5 Updating defined micro-clusters

Depending on time, the cluster centers of micro-clusters and the number of data they have can change over time. Moreover, deleting or receiving new data from a micro-cluster changes the number of data and the cluster center of the micro-cluster. By performing these operations, our algorithm can adapt to the evolutionary structure of the streaming data. All these updates are performed as provided in Algorithm 7.

5.2.6 Updating defined macro-clusters

Defined macro-clusters may lose or gain new micro-clusters over time. As a result of the micro-cluster deletion process, if the number of micro-clusters owned by the macro-cluster falls below the threshold value n_micro, the relevant macro-cluster is deleted, or the micro-cluster may gain new macro-clusters. This information is kept up-to-date. The pseudo-code of the UpdateMacroC sub-algorithm that performs the related operations is given in Algorithm 8.

5.2.7 Deleting defined micro-clusters

By the time, a micro-cluster may lose its data and its number of data may fall below the threshold value of N. In this case, the micro-cluster is deleted, and all remaining data is defined as idle data that does not belong to any cluster. These micro-clusters can be re-defined over time if sufficient data are grouped in the same region. The pseudo-code of the related sub-algorithm is presented in Algorithm 9.

5.2.8 Deleting defined macro-clusters

A macro-cluster may lose its data, and the number of data it has may fall below the threshold of n_micro. In this case, the macro-cluster is deleted, and all remaining macro-clusters are defined as micro-clusters that do not belong to any cluster. As we mentioned for micro-clusters, these macro-clusters can be re-defined over time. The pseudo-code for the sub-algorithm is given in Algorithm 10.

5.3 Runtime complexity

The proposed algorithm consists of several sub-algorithms. The runtime complexities of the sub-algorithms are presented in Table 4, where n is the number of processed data, d is the number of features each data has, m is the number of micro-clusters, k is the number of defined macro-clusters, and E is the number of edges. Our proposed algorithm has a total time complexity of approximately O(mn²logn), since the overall complexity will be the sum of these sub-algorithms. Although our algorithm's time complexity may be considered high, it is low in practice.

Table 4 Runtime complexity for each function in the proposed algorithm

Full size table

6 Experimental study

6.1 Experimental environment

To conduct the experimental study, the proposed algorithm was coded in Python programming language in Anaconda Spyder environment, and we used related libraries such as scikit-learn, matplotlib, river. The river^{Footnote 1} library was used. Because it contains the DBSTREAM and DenStream algorithms. All the experiments were performed on a computer with an Intel i7 processor, 16 GB of RAM, and a Windows 11 Operating System installed. To measure the success of our algorithm, we compared it with KD-AR Stream [14], DBSTREAM [12], and DenStream [10] algorithms in terms of both clustering success and runtime.

6.2 Datasets

To compare the clustering performance of the algorithms, we used 12 datasets, 5 of which are real datasets from two sources (UCI Machine Learning Repository [39] and Tomas Barton's repository [40]), as shown in Table 5. ExclaStar, Aggregation, Zelnik2, Zelnik4, Three Spirals, and ChainLink were used to measure the ability of our algorithm on arbitrary-shaped clusters; MrData, which consists of 10% noisy data, was used to measure its robustness to noisy data; Breast Cancer, Thyroid, MRData, and KDD were used to measure its effectiveness on real datasets; and KDD dataset was used to measure its success on high-dimensional datasets.

Table 5 Characteristics of used datasets

Full size table

6.3 Used indices to evaluate the clustering quality

We compared the clustering success of the algorithms with Purity and Adjusted Rand Index (ARI) indices. We used the Purity value to measure the purity of the data in the clusters. The Purity value is calculated as the ratio of the dominant cluster labels to all clusters' labels. Let $t_j ,c_i \in C$ be the number of actual cluster labels, N be the number of data, k be the number of clusters; the Purity value is calculated by Eq. (1).

$${\text{Purity}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^k \max_j \left| {c_i \cap^t_j } \right|$$

(1)

ARI is an external cluster quality assessment method that can measure cluster quality correctly even if predicted and actual cluster labels are different. The ARI value is calculated by Eq. (2), where N is the number of data, and n_ij, a_i,, and b_j are the values obtained from the contingency table.

$${\text{ARI}}(C^r ,C^m ) = \frac{{{{\sum_{ij} \left( {\begin{array}{*{20}c} {n_{ij} } \\ 2 \\ \end{array} } \right) - \left[ {\sum_i \left( {\begin{array}{*{20}c} {a_i } \\ 2 \\ \end{array} } \right)\sum_j \left( {\begin{array}{*{20}c} {b_j } \\ 2 \\ \end{array} } \right)} \right]} / {\left( {\begin{array}{*{20}c} n \\ 2 \\ \end{array} } \right)}} }}{{{{\frac{1}{2}\left[ {\sum_i \left( {\begin{array}{*{20}c} {a_i } \\ 2 \\ \end{array} } \right) + \sum_j \left( {\begin{array}{*{20}c} {b_j } \\ 2 \\ \end{array} } \right)} \right] - \left[ {\sum_i \left( {\begin{array}{*{20}c} {a_i } \\ 2 \\ \end{array} } \right)\sum_j \left( {\begin{array}{*{20}c} {b_j } \\ 2 \\ \end{array} } \right)} \right]} / {\left( {\begin{array}{*{20}c} n \\ 2 \\ \end{array} } \right)}}}}$$

(2)

6.4 Experimental procedure and parameter setting

In the experimental study, we compared our algorithm with the other algorithms regarding clustering quality by using a random search method. For this purpose, each algorithm is tested on all datasets with randomly selected parameters. In order to determine the best parameters, each algorithm is tested 50 times on each dataset with randomly selected parameters from intervals shared in Table 6. In order to make parameter selection easier, the data is normalized using the Min–Max Normalization given in Eq. (3).

$${\text{Min - Max}}\;{\text{Normalization}} = \frac{{x_{ij} - \min x_j }}{\max x_j - \min x_j }$$

(3)

Table 6 Parameter intervals of algorithms

Full size table

7 Results

7.1 Clustering quality results on real and synthetic datasets

After testing each algorithm on each dataset with randomly selected parameters shared in Table 6, the best results for all algorithms are presented in Tables 7 and 8 (The bold values represent the highest result values among the algorithms). The synthetic datasets used in the experimental study have arbitrary-shaped clusters, some illustrated in Fig. 3. When the obtained results are analyzed, it can be seen that our algorithm achieves higher clustering success. Especially in the ARI values, it is obvious that our algorithm is superior to other algorithms. Only in the Xclara dataset it was the second-best algorithm. However, its results were very close to the best result. On the other hand, the algorithm achieves the highest success rate in all other datasets in the aspect of the ARI index. Especially in Three Spirals and ChainLink datasets, it made a big difference to other algorithms. Our algorithm reaches ARI values of 1.0 in these two datasets, while other algorithms can reach about 0.30 s. Therefore, we can say that our algorithm is very successful in detecting arbitrary-shaped clusters.

Table 7 Purity performance comparison results of stream clustering algorithms on datasets

Full size table

Table 8 ARI performance comparison results of stream clustering algorithms on datasets

Full size table

In terms of real datasets, our algorithm achieved more successful clustering results. In the experimental study, we used KDD, Breast Cancer, Occupancy, and Thyroid datasets as real datasets. Our algorithm reached significantly better results than other algorithms regarding the ARI index. In the KDD dataset, which is a high-dimensional and commonly used dataset in the stream clustering area, our algorithm achieved higher clustering success than other algorithms. While our algorithm achieved an ARI value of 0.7509 on the KDD dataset, the results of DenStream, DBSTREAM, and KD-AR Stream algorithms were 0.4962, 0.5679, and 0.6818, respectively. On the other hand, in the Thyroid dataset, our algorithm reaches an ARI value of 0.8943, while the other algorithms reach ARI values of 0.6410, 0.8420, and 0.8670, respectively. Similarly, the ARI result of our algorithm was 0.5910, while the results of DenStream, DBSTREAM, and KD-AR Stream were 0.5180, 0.2630, and 0.5890, respectively. Moreover, our algorithm was the best on the MrData dataset. Its clustering quality was 0.9812, which is the highest result among others. MrData is essential because it contains noisy data.

7.2 Clustering success on arbitrary-shaped clusters

As illustrated in Fig. 3, depending on the dataset, the cluster shape may be any geometric shape rather than circular, or it may not resemble any geometric shape at all. The majority of data stream clustering algorithms assume that cluster shapes are spherical. However, this is only the case for the minority of real-world datasets. Therefore, the need for algorithms to detect clusters in arbitrary shapes is raised. The algorithm proposed in this paper can easily define non-spherical clusters since it forms clusters based on the dataset density using MST over micro-clusters. The ExclaStar, Zelnik2, Zelnik4, Aggregation, ChainLink, and Three Spirals datasets used in the experimental study contain non-spherical clusters. According to experimental results, our algorithm achieves better results on these datasets than other algorithms. So, we can say that our algorithm is the best on datasets that contain arbitrary-shaped clusters.

7.3 Robustness of the proposed algorithm against outliers

Detecting outliers is a fundamental problem in stream clustering areas. Because outliers reduce the clustering success. To evaluate the robustness of our algorithm against outliers, we tested it on the MrData dataset. As illustrated in Fig. 5, 10% of the MrData dataset consists of outliers. The proposed algorithm achieved much better results on the dataset than others. The proposed algorithm achieved high results as 0.9812 ARI and 0.9916 Purity, on the MrData dataset, while ARI values of compared algorithms were 0.9732 for DenStream, 0.9820 for DBSTREAM and 0.9859 for KD-AR Stream. These results prove that the proposed algorithm is robust against outliers.

7.4 Runtime complexity

To demonstrate the runtime performance of our algorithm, we tested our algorithm with other algorithms on the KDD and MrData datasets. For this purpose, we ran each algorithm with the parameters that gave the highest success on the relevant dataset and compared their runtimes. The runtime of algorithms is presented in Fig. 6. When the results are analyzed, the runtime performance of our algorithm is quite good. Our algorithm produces faster results than the KD-AR Stream method on both datasets. In the KDD dataset, it produces better results than the DenStream algorithm, while it produces results very close to the DBSTREAM algorithm which was the fastest algorithm.

8 Discussion

In this study, ARI and Purity indices are used to demonstrate the success of the proposed algorithm. However, since the Purity value is calculated based on the ratio of the dominant cluster label to all data in the cluster, it can sometimes produce misleadingly high values. For example, when all data in a dataset is assigned to a single cluster, the Purity value is 1.00. However, there may be ten or more clusters in the dataset. Therefore, it is incorrect to evaluate based on the Purity value alone. For this reason, the ARI index is used together with the Purity value to evaluate the clustering performance in this study. To conclude that an algorithm is successful, parallelism between these two indices is expected. Therefore, it is correct to examine both indices when determining the success of algorithms. From this point of view, although the Purity value of the DBSTREAM algorithm is very high in most datasets, the ARI value remains low. Therefore, evaluating the two metrics together reveals that the DBSTREAM algorithm does not actually show high clustering success in the experiments.

The algorithm we proposed in this study achieved high clustering success in both synthetic and real data sets. Our algorithm can also define arbitrary-shaped clusters with high clustering quality. As shared in Tables 7 and 8, the proposed algorithm was the best algorithm in arbitrary-shaped clusters like in ExclaStar, Aggregation, Zelnik2, Zelnik4, Three Spirals, and ChainLink, datasets. Purity is frequently used in the literature to compare methods to measure whether clustering algorithms correctly assign data points to the relevant cluster. A high Purity value indicates the clustering success of the method. The purity value comparison of other current stream clustering approaches in the literature on the KDD dataset is presented in Table 9. The proposed method obtained a better purity value than the existing methods. Another success of our algorithm is to cluster high-dimensional datasets in high clustering quality. KDD dataset is commonly used in stream clustering areas to evaluate the success of algorithms. Our algorithm achieved the best result on the KDD dataset with an ARI value of 0.7509. In contrast, its closest competitor, the KD-AR Stream algorithm, achieved an ARI value of 0.6818. Therefore, our algorithm gives more successful results than others while clustering high-dimensional datasets.

Table 9 Purity value comparison of the proposed algorithm on the KDD dataset with other algorithms in the literature

Full size table

Our algorithm achieves the highest clustering success with 0.9812 with ARI on the MrData dataset, which contains 10% of outliers, while the second-best algorithm, the KD-AR Stream algorithm, reached 0.9742. On the other hand, DenStream and DBSTREAM algorithms achieve low clustering success with ARI values of 0.7596 and 0.5520, respectively. Therefore, it can be said that our proposed algorithm is quite successful in dealing with outliers.

In terms of runtime, our algorithm is high-speed. When the sub-algorithms of our algorithm presented in Table 5 are analyzed, one may think that the runtime complexity of the proposed algorithm is high; therefore, its performance would be low. However, our algorithm is relatively fast in practice, as seen in Fig. 6. Because, our algorithm improves clustering performance by using a two-phase method in the clustering process. First, summarization is performed using a micro-cluster structure instead of processing all the data. Then, an MST-based clustering approach is performed on these micro-clusters to increase the performance.

As we explained the theoretical background of our algorithm in the previous sections, we expected our algorithm to be more successful than its competitor in the clustering quality manner. Its most substantial ability is to define arbitrarily shaped clusters. The experimental studies support the theoretical background as shared in the tables and figures. In addition, it can process high-dimensional datasets and it is robust against the outliers, as proved in experimental studies. However, processing datasets that are high dimensional can degrade its runtime performance.

9 Conclusion and future works

In this study, we propose a new stream clustering algorithm that can detect arbitrary-shaped clusters, is robust to outliers, fast, and has high clustering success. For this purpose, KD-Tree defines micro-clusters, and MST forms the final clusters by being applied to the defined micro-clusters. The proposed algorithm has been subjected to various experimental studies to test its effectiveness in clustering success and runtime. To reveal the proposed algorithm's clustering success and runtime complexity, we compared it with the DenStream, DBSTREAM, and KD-AR Stream algorithms. According to the experimental studies, it has been observed that the proposed algorithm produced better results than its competitors in terms of clustering success and is also very fast in terms of runtime. According to the literature comparison, our algorithm achieved the highest clustering success with ARI value of 0.9812 on the MrData dataset containing 10% outliers. It also achieved the highest value with a Purity value of 0.9780 and the second highest result with an ARI value of 0.7509 on the KDD dataset, which is a large dataset. In general, highly competitive results were obtained on all datasets. It is concluded that the proposed algorithm is more successful than the compared algorithms in terms of clustering quality, detecting outliers, and defining arbitrary-shaped clusters in very low runtime complexity. However, high-dimensional datasets may slightly degrade the algorithm's runtime performance. Therefore, in future works, some feature selection or feature reduction methods are planned to be integrated to improve MCMSTStream runtime complexity further.

Data availability

The datasets are available as open data via UCI Machine Learning Repository [39] and Tomas Barton's repository [40].

Code availability

The code of proposed algorithm is shared on Github (https://github.com/senolali/MCMSTStream).

Notes

https://riverml.xyz/.

References

Campello RJ et al (2020) Density-based clustering. Data Min Knowl Disc 10(2):e1343
Article Google Scholar
Berahmand K, Li Y, Xu Y (2023) DAC-HPP: deep attributed clustering with high-order proximity preserve. Neural Comput Appl 35:1–19
Article Google Scholar
Jain A, Zhang Z, Chang EY (2006) Adaptive non-linear clustering in data streams. In: Proceedings of the 15th ACM international conference on Information and knowledge management
Pardeshi B, Toshniwal D (2011) Hierarchical clustering of projected data streams using cluster validity ındex. In: Advances in computer science and ınformation technology: first ınternational conference on computer science and ınformation technology, CCSIT 2011, Bangalore, India, January 2–4, 2011. Springer: New York
Nguyen H-L, Woon Y-K, Ng W-K (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45:535–569
Article Google Scholar
Şenol A, Karacan H (2018) Akan Veri Kümeleme Teknikleri Üzerine Bir Derleme. Avrupa Bilim ve Teknoloji Dergisi 13:17–30
Google Scholar
Zubaroğlu A, Atalay V (2021) Data stream clustering: a review. Artif Intell Rev 54(2):1201–1236
Article Google Scholar
Kokate U et al (2018) Data stream clustering techniques, applications, and models: comparative analysis and discussion. Big Data Cognit Comput 2(4):32
Article Google Scholar
Mansalis S et al (2018) An evaluation of data stream clustering algorithms. Stat Anal Data Min ASA Data Sci J 11(4):167–187
Article MathSciNet Google Scholar
Cao F, et al (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining. SIAM
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461
Article Google Scholar
Reddy KSS, Bindu CS (2019) StreamSW: a density-based approach for clustering data streams over sliding windows. Measurement 144:14–19
Article ADS Google Scholar
Şenol A, Karacan H (2020) Kd-tree and adaptive radius (KD-AR Stream) based real-time data stream clustering. J Facult Eng Arch Gazi Univ 35(1):337–354
Google Scholar
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Disc 1:141–182
Article Google Scholar
Kranen P et al (2011) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 29:249–272
Article Google Scholar
Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627
Article Google Scholar
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec 27(2):73–84
Article Google Scholar
Ackermann MR et al (2012) Streamkm++ a clustering algorithm for data streams. J Exp Algorithmics (JEA) 17: 2.1–2.30
Aggarwal CC et al (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference. Elsevier: Amsterdam.
Zhou A et al (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214
Article Google Scholar
Aggarwal CC et al (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, vol 30
Jia C, Tan C, Yong A (2008) A grid and density-based clustering algorithm for processing data stream. In: 2008 Second ınternational conference on genetic and evolutionary computing. IEEE
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc: Ser B (Methodol) 39(1):1–22
MathSciNet Google Scholar
Guha S et al (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528
Article Google Scholar
Huang L et al (2019) MVStream: Multiview data stream clustering. IEEE Trans Neural Netw Learn Syst 31(9):3482–3496
Article MathSciNet PubMed Google Scholar
Maia J et al (2020) Evolving clustering algorithm based on mixture of typicalities for stream data mining. Futur Gener Comput Syst 106:672–684
Article Google Scholar
Ahmed R, Dalkılıç G, Erten Y (2020) DGStream: high quality and efficiency stream clustering algorithm. Expert Syst Appl 141:112947
Article Google Scholar
Carnein M, Assenmacher D, Trautmann H (2017) An empirical comparison of stream clustering algorithms. In: Proceedings of the computing frontiers conference
Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50
Article PubMed Google Scholar
Forgy EW (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21:768–769
Google Scholar
MacQuuen J (1967) Some methods for classification and analysis of multivariate observation. In: Proceedings of the 5th Berkley symposium on mathematical statistics and probability
Ester M et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd
Şenol A, Kaya M, Canbay Y (2024) A comparison of tree data structures in the streaming data clustering issue. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 39(1):217–232
Article Google Scholar
Şenol A (2023) MCMSTClustering: defining non-spherical clusters by using minimum spanning tree over KD-tree-based micro-clusters. Neural Comput Appl 35(18):13239–13259
Article Google Scholar
Kriegel HP et al (2011) Density-based clustering. Data Min Knowl Discov 1(3):231–240
Article MathSciNet Google Scholar
Mousavi M et al (2020) Varying density method for data stream clustering. Appl Soft Comput 97:106797
Article Google Scholar
Hyde R, Angelov P, MacKenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382:96–114
Article Google Scholar
Dua D, Graff C (2021) UCI machine learning repository. Available from: http://archive.ics.uci.edu/ml
Clustering benchmarks (2023) [cited 15/04/2023; Available from: https://github.com/deric/clustering-benchmark
Milli M, Bulut H (2022) SubtStream: online subtractive stream clustering algorithm. Concurr Comput Pract Exp 34(15):e6968
Article Google Scholar
Kashani ES, Shouraki SB, Norouzi Y (2022) Evolving data stream clustering based on constant false clustering probability. Inf Sci 614:1–18
Article Google Scholar

Download references

Acknowledgements

This article is produced from the M.Sc. thesis, "Defining Irregular Clusters by Using Tree Data Structure and Micro-clusters in Real Time Streaming Data Problem" conducted by Berfin Erdinç at the Graduate School of Natural and Applied Sciences in Siirt University.

Funding

Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK).

Author information

Authors and Affiliations

Department of Computer Engineering, Siirt University, Siirt, Turkey
Berfin Erdinç
Department of Artificial Intelligence and Data Engineering, Firat University, Elazig, Turkey
Mahmut Kaya
Department of Computer Engineering, Tarsus University, Mersin, Turkey
Ali Şenol

Authors

Berfin Erdinç
View author publications
You can also search for this author in PubMed Google Scholar
Mahmut Kaya
View author publications
You can also search for this author in PubMed Google Scholar
Ali Şenol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahmut Kaya.

Ethics declarations

Conflict of interest

The author of the article has no relationship (either financial or personal) with any people or organizations that can affect or bias the paper’s contents.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Erdinç, B., Kaya, M. & Şenol, A. MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data. Neural Comput & Applic 36, 7025–7042 (2024). https://doi.org/10.1007/s00521-024-09443-1

Download citation

Received: 30 August 2023
Accepted: 15 January 2024
Published: 27 February 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00521-024-09443-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

Abstract

Similar content being viewed by others

Adaptive Multiple-Resolution Stream Clustering

A Different Approach for Pruning Micro-clusters in Data Stream Clustering

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

1 Introduction

2 Related works

3 Preliminaries

3.1 KD-tree data structure and range search

3.2 Minimum spanning tree

4 Problem statement

5 Applying minimum spanning tree to Kd-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

5.1 Definitions

5.2 The algorithm

5.2.1 Defining micro-clusters

5.2.2 Assigning a new arrival data to a micro-cluster

5.2.3 Defining macro-clusters

5.2.4 Assigning micro-clusters to macro-clusters

5.2.5 Updating defined micro-clusters

5.2.6 Updating defined macro-clusters

5.2.7 Deleting defined micro-clusters

5.2.8 Deleting defined macro-clusters

5.3 Runtime complexity

6 Experimental study

6.1 Experimental environment

6.2 Datasets

6.3 Used indices to evaluate the clustering quality

6.4 Experimental procedure and parameter setting

7 Results

7.1 Clustering quality results on real and synthetic datasets

7.2 Clustering success on arbitrary-shaped clusters

7.3 Robustness of the proposed algorithm against outliers

7.4 Runtime complexity

8 Discussion

9 Conclusion and future works

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation