1 Introduction

As digital data increases rapidly, obtaining meaningful patterns and information from this data have become an important research topic [1]. When trying to extract meaningful patterns from an ever-increasing amount of data, clustering algorithms offer valuable information [2]. In this sense, streaming data mining has become one of today's most popular fields in rapidly processing stream data and obtaining meaningful information [3, 4]. In traditional data mining, the focus is on mining from static data. However, the emergence of data streams with technological developments has led to significant changes in data storage and processing methods. For this reason, the need to analyze real-time data and present it to the user instantly has emerged [5]. Today, stream clustering approaches are used to meet this need. Stream clustering is an efficient clustering approach that can cluster streaming data fast according to similarity criteria and update the clusters based on the characteristics of the data. Applications of these approaches include clickstream analysis, intrusion detection systems, social media, financial applications, scientific research, health research, mobile applications, the Internet of Things (IoT), and sensor networks [6].

There are five stream clustering approaches: density, hierarchical, model, partitioning, and grid-based methods [7,8,9]. In density-based methods, clusters are extended to areas where data are dense. The structure of the clusters expresses the density of data. These clustering approaches can detect arbitrary-shaped clusters and outliers. DenStream [10], D-Stream [11], DBSTREAM [12], StreamSW [13], and KD-AR Stream [14] are some examples of density-based clustering approaches. In hierarchical methods, clusters are generated by combining data in a hierarchical structure according to the distances among them. Two types of approaches, agglomerative and divisive, are used to address such clustering problems [8]. BIRCH [15], ClusTree [16], and ODAC [17] are some examples of hierarchical stream clustering algorithms.

In partitioning-based stream clustering approaches, the dataset is partitioned according to the centers selected by various techniques [18]. The goal is to optimize a target parameter such as variance. StreamKM++ [19], CluStream [20], SWClustering [21], and HPStream [22] are examples of partitioning-based stream clustering algorithms. On the other hand, grid-based methods divide the data space into equally sized grids and aim to form clusters according to the number of data in these grids. These grids divide the data stream into regular cells and clusters the data in each cell by analyzing them with statistical or computational methods. DD-Stream [23] is one of these methods. As the last method, model-based approaches assume that the dataset fits a mathematical model. EM (Expectation Maximization) [24] is an example of these methods that effectively handles noisy data and outliers.

In the stream clustering area, defined cluster shapes, cluster generation processes, effective solutions against noisy data and outliers, the capability of processing high-dimensional data, and time complexity are critical in the clustering performance. In the literature, stream data are generally clustered in spherical [14, 16, 20, 25, 26] or arbitrary [10, 12, 13, 27, 28] shapes. Although spherical clustering approaches are successful, their performance is limited when the data distribution is arbitrary. Clustering approaches that partition the dataset into groups according to centroids tend to form spherical clusters. Since the data have a homogenous distribution in the real-world, the cluster distributions of the data may not be spherical. For this reason, the ability of spherical clustering approaches in such data may be limited. Therefore, defining arbitrary-shaped clusters in such data distributions can significantly improve clustering performance. In density-based approaches, it is possible to define arbitrary-shaped clusters based on density without requiring the number of clusters to be initialized. These approaches also successfully detect noise and outliers [8].

In stream clustering, there are fully online and online-offline approaches to cluster the datasets [7, 29]. In the fully online stream clustering approaches [14, 19, 26, 27], the clustering process is performed for each new arrival data, and the current clustering results are kept. The other approaches are two-phased stream clustering approaches that are online-offline phased approaches [12, 13, 16, 20, 28]. In the online phase, the new arrival data are evaluated in real-time, and the relevant summary statistics of the observations are captured. These summary statistics are micro-clusters. In the offline phase, these summaries create final clusters [30]. In such approaches, offline phase algorithms such as k-means [31], k-median [32], DBSCAN [33], and minimum spanning tree (MST) are used to define final clusters.

Two of the most crucial challenges in streaming data clustering is to be able to define arbitrary-shaped clusters and processing high-dimensional data. Şenol and Karacan [14] used the KD-Tree structure to process high-dimensional data. Similarly, Şenol et al. [34] stated that tree data structures can benefit stream clustering data. However, the major limitations of these methods are their weakness in detecting arbitrary-shaped clusters. The MCMSTClustering algorithm [35] is able to support high dimensionality using the KD-Tree structure and is also very successful in defining arbitrary-shaped clusters using the minimum spanning tree algorithm. The motivation of this paper is to adapt the MCMSTClustering algorithm to stream data clustering problems due to these superior capabilities. To summarize, the main contributions of the proposed algorithm to the literature are presented as follows:

  • Ability to define arbitrary-shaped clusters,

  • Robustness to outliers,

  • Capability of processing high-dimensional data,

  • High clustering quality in acceptable runtime.

The rest of the paper is organized as follows: Sect. 2 discusses the related work, while Sect. 3 provides the necessary information about the methods used in the study. In Sect. 4, the problem is defined, and the objective of the study is explained. Then, Sect. 5 describes the proposed algorithm in detail, while Sect. 6 presents the experimental studies. Section 7 shares the results obtained from the experimental study, while Sect. 8 discusses the obtained results. Finally, Sect. 9 concludes the paper and provides perspective about future works. For ease of reading, the descriptions of mathematical symbols and acronyms used in this paper are summarized in Tables 1 and 2.

Table 1 Mathematical symbol interpretation used in this paper
Table 2 The acronyms used in this paper

2 Related works

Density-based clustering involves grouping data objects distributed in a contiguous region of the data space with high object density [1, 36]. Conversely, these clusters are separated by contiguous regions characterized by low object density. Outliers and noisy data can dramatically reduce the clustering quality of the techniques in stream clustering. The fact that stream clustering methods are robust to these issues improves cluster performance. Furthermore, the ability to form non-spherical clusters allows algorithms to perform better. As the solution, density-based techniques are often highly effective in overcoming such issues.

DenStream [10] is a density-based stream clustering algorithm that can cluster data streams over fading windows and handle clusters with arbitrary shapes. DenStream introduces the kernel-micro-cluster concept to summarize clusters of arbitrary shape. On the other hand, possible kernel-micro-cluster and outlier micro-cluster structures are proposed to preserve and distinguish possible clusters and outliers. When DenStream receives a clustering request, it uses DBSCAN to achieve the final clustering results. DenStream can analyze high-dimensional data streams and find outliers in data streams. However, it has limited skills for detecting and notifying concept drift. It also fails to predict the number of clusters, which can sometimes be an issue.

Chen and Tu presented D-Stream [11], a density and grid-based clustering technique. D-Stream, like DenStream, has a two-phase clustering design. First, the online component puts each data point into grids. The offline component then computes grid density and groups grids depending on density. D-Stream may thus detect clusters of any shape. Furthermore, D-Stream can manage high-velocity, high-volume data streams, making it suited for real-time applications. D-Stream, on the other hand, has some limits. One of these limitations is that it may require assistance dealing with data streams with varying densities.

Hahsler and Bolaños [12] presented a density-based DBSTREAM approach to address the issue of ignoring data density between micro-clusters. This density-based approach uses data to directly estimate the density in the common region between micro-clusters. In the online phase, like DenStream, density estimates are generated for micro-clusters rather than the epsilon neighborhood around every single point, considerably decreasing processing costs. However, to achieve good results, the parameters of DBSTREAM must be fine-tuned.

CEDAS [35] provides a two-stage, fully online method to group evolving data streams into arbitrarily shaped clusters. Hyperspherical micro-clusters are created in the first stage, and in the second stage, they are combined into larger macro-clusters using a graph structure. The method is accurate, robust to noise, computationally, and memory efficient, and it can handle high-dimensional datasets. However, the outputs of the method are only cluster assignments. Moreover, the proposed method cannot discover the densities of regions in the data space [27].

StreamSW [13] is a density-based clustering approach for streaming data over a sliding window. It uses a two-phase online-offline clustering framework to maintain the synopsis of streaming data in p-micro-clusters, which are then reclustered using an enhanced DBSCAN algorithm in the offline component. StreamSW uses density-based micro-clustering and grid-based approaches to find high-quality arbitrary-shaped clusters with limited memory and execution time. The approach has shown promising results in real-world and synthetic datasets experiments. However, it is unsuitable for high-dimensional streaming data due to the high computation time and lower performance caused by the increasing number of grid cells with space size.

The MVStream clustering algorithm [26] integrates information from numerous insufficient views by employing summary statistics from previous multi-view data objects and a novel multi-view support vector domain description (MVSVDD) model and output support vectors (SVs). Due to the small amount of data objects occupied by SVs, the MVStream technique is efficient when computational resources are constrained. MVStream clustering outperformed seven current single-view data stream clustering methods and two multi-view clustering techniques built for static large-scale multi-view data in benchmark testing.

Şenol and Karacan [14] proposed the KD-AR Stream algorithm, which is suitable for the dynamic structure of streaming data. The proposed method is entirely online and uses the KD-Tree data structure for cluster formation. An adaptive radius is used to adjust the cluster size. Time-based summarization consisting of a time window and a sliding window is implemented to avoid performance loss. However, the proposed method suffers from performance loss on high-dimensional data. In addition, it cannot detect arbitrary clusters since it only aims to create spherical clusters.

Mousavi et al. [37] developed an online and offline density-based CVD-Stream (Clustering Varying Density Data Stream) approach. In the suggested method's online phase, merging and pruning procedures are applied. In the online phase, the proposed technique effectively removes outliers. A variable density clustering technique is used to construct the final clusters during the offline phase. Ahmed et al. [25] suggested a density and grid-based two-phased DGStream technique. The proposed method for constructing arbitrary-shaped clusters and detecting outliers is fast, accurate, efficient, and effective. The suggested technique is compared against DenStream, D-Stream, and ClusTree on synthetic and real-world data. When compared to existing approaches, the performance is improved. As we mentioned, many data stream clustering algorithms have been proposed. However, each one has various issues. Table 3 compares some of the stream clustering algorithms in the literature.

Table 3 Comparison of streaming data clustering algorithms according to their problem-solving style and capabilities

3 Preliminaries

3.1 KD-tree data structure and range search

KD-Tree is a data structure with a runtime complexity of O(logn) and is capable of handling high-dimensional datasets. It is fast and allows range search to be performed on it, which makes it widely used. Range search is a method that detects data within a certain radius on the KD-Tree. It is an approach with runtime complexity \(O(dn^{1 - \frac{1}{d}} + k)\). Examples of KD-Tree and area searches are shown in Fig. 1.

Fig. 1
figure 1

KD-Tree data structure and its decomposition

3.2 Minimum spanning tree

Minimum Spanning Tree is an approach that connects all nodes in a dataset via the shortest path, as illustrated in Fig. 2. It is a highly efficient method capable of handling directed or undirected graphs. The runtime complexity of this method is O(n2). But it can be reduced to O(nlogE) if the Priority Queue is applied. Prim's and Kruskal are two examples of these approaches.

Fig. 2
figure 2

An example of minimum spanning tree

4 Problem statement

Since streaming data are a type of data that streams in real time and has an evolutionary structure, there is a need for algorithms that can handle these types of datasets. However, as we explained in Sect. 2, there are many problems with the streaming data. One of the most critical problems is the detection of non-spherical clusters, some examples of which can be seen in Fig. 3. The majority of proposed works in this area assume that clusters are spherical. However, very few clusters are spherical in real life.

Fig. 3
figure 3

Some examples of arbitrary-shaped datasets with clustering challenges

Another problem in this area is outliers. Due to the nature of streaming data, it is not easy to detect outliers. Because streaming data is evolutionary. Any identified data as an outlier may be the first example of a data group forming a cluster later. Another demand in this area is that the algorithms are expected to produce results quickly. Therefore, there is an ongoing demand for algorithms that can detect non-spherical clusters, have high clustering success, handle outliers, and produce results quickly. In this paper, we propose a new stream clustering algorithm inspired by the MCMSTClustering [35] algorithm, and the details are described in Sect. 5 to overcome these problems.

5 Applying minimum spanning tree to Kd-tree-based micro-clusters to define arbitrary-shaped clusters in streaming data

This section describes the proposed algorithm in detail.

5.1 Definitions


Euclidean distance is used for the distance calculation method in all sub-algorithms in the proposed algorithm. The following notions are used in the proposed algorithm:


Micro-cluster Micro-clusters are the essential parts of data in our algorithm consisting of chunks of data. These data groups are spherical and formed due to the Kd-Tree data structure and the range search process. In our algorithm, the radii of all micro-clusters are constant and equal to r. For a micro-cluster to be formed, at least N data must be grouped in a circle of r radius.


MST (Minimum spanning tree) MST is a candidate cluster combining micro-clusters—according to a distance threshold of 2r. If the distance from a micro-cluster to its nearest micro-cluster of the MST is 2r or less, it is included in the MST. MST starts with one of the micro-clusters not assigned to any cluster. At each step, the relevant micro-cluster is included in MST if the distance from the micro-cluster to the closest one of MST does not exceed the threshold value.


Macro-clusters Macro-clusters are the original clusters consisting of n_micro-number of micro-clusters. If any candidate MST has more than n_micro-number of micro-clusters, this MST is defined as a new macro-cluster.

In the proposed algorithm, some parameters need to be defined by the user. These parameters and their descriptions are as follows:


W (sliding window width) Since streaming data is very fast due to its nature and there is no possibility of accumulating and processing it, we take W of them as a summary at each step and process them. This method is called as sliding window in the literature.


r (radius) It is a parameter used to create micro-clusters. It expresses the radius of the area to be searched in KD-Tree while defining micro-clusters. It is also used to evaluate the distance between a candidate micro-cluster and the closest micro-cluster of the MST when deciding whether to include the candidate micro-cluster in the MST or not. The micro-cluster is included in the MST if the distance is 2r or less.


N It is the parameter used to create micro-clusters. If there is a minimum of N data in radius r and the center of these data is far away from existing micro-clusters enough, this data group is defined as a micro-cluster.


n_micro This parameter is used to identify macro-clusters. If the number of micro-clusters connected by MST is n_micro or more, this cluster of micro-clusters is defined as a macro-cluster.

5.2 The algorithm

In this paper, we propose a new algorithm for the clustering problem of streaming data to detect clusters with non-spherical shapes. Our algorithm is powered by cluster identification by applying MST to KD-Tree-based micro-clusters. As an example can be seen in Fig. 4, our algorithm first defines micro-clusters by using the KD-Tree data structure. Then, it applies MST to these micro-clusters to define the final clusters. Our proposed algorithm consists of 5 stages. These are:

  • Generating KD-Tree-based micro-clusters,

  • Creating macro-clusters by MST to micro-clusters,

  • Definition of micro-clusters as the result of new arrival data or deletion of micro-clusters because of the completion of the lifetime of data as a result of the number of data that has fallen below the threshold value N,

  • Assignment of new micro-clusters to macro-clusters or deletion of macro-clusters when the number of micro-clusters they have falls below the n_micro,

  • Updating the information of the whole system.

Fig. 4
figure 4

Example of identifying macro-clusters over micro-clusters on the three spirals dataset

In light of this information, the proposed algorithm's basic steps are shown in Algorithm 1.

Algorithm 1
figure a

MCMSTStream

5.2.1 Defining micro-clusters

In the proposed algorithm, the reason to use micro-clusters is to detect non-spherical clusters and to improve the performance of the algorithm. To define micro-clusters, KD-Tree data structure and range search operation are used. For this purpose, the data that does not belong to any micro-cluster are first placed in a KD-Tree, and then a range search operation is performed on it. Namely, after the data is placed in the KD-Tree, a search is performed to see if there are at least N data within a radius r. If so, this data is grouped as a new micro-cluster. Hence, the pseudo-code of the DefineMC sub-algorithm used to define micro-clusters is provided in Algorithm 2.

Algorithm 2
figure b

DefineMC

5.2.2 Assigning a new arrival data to a micro-cluster

By the nature of stream data, old data are deleted, and new data arrives over time. Therefore, if a new arrival data is close to any micro-cluster enough, it is assigned to the relevant micro-clusters. For this purpose, if the distance between the relevant data and the center of the closest micro-cluster is r or less, the relevant data is assigned to that micro-cluster. The pseudo-code of the AddtoMC sub-algorithm used for this purpose is given in Algorithm 3.

Algorithm 3
figure c

AddtoMC

5.2.3 Defining macro-clusters

Macro-clusters are defined over the existing micro-clusters. Here, MST is used to define macro-clusters. To define a macro-cluster, at least n_micro-number of micro-clusters must be merged by MST. While deciding if a micro-cluster should be included in the MST, its distance to the closest micro-cluster of the MST is evaluated. If the distance is less than or equal to 2r this micro-cluster is included in the MST as shared in Algorithm 5. To define macro-clusters Prim's algorithm is used. Prim's algorithm was chosen because it is more suitable for cluster definition since it has an incremental structure while connecting nodes. The pseudo-code of the DefineMacroC sub-algorithm used to define micro-clusters is given in Algorithm 4.

Algorithm 4
figure d

DefineMacroC

Algorithm 5
figure e

Prims Algorithm

5.2.4 Assigning micro-clusters to macro-clusters

By its nature, the characteristics of the data change over time. This change means that both the defined micro-clusters and macro-clusters also change. By the time, defined micro-clusters might be deleted, or new ones might be defined. Therefore, assigning newly defined micro-clusters to the existing macro-clusters might also be necessary. Our algorithm fulfills these operations. The distance between the closest micro-cluster of the macro-cluster and the micro-cluster is evaluated to decide whether to assign the micro-cluster to the macro-cluster. If the distance is 2r or less, the micro-cluster is assigned to that macro-cluster. The pseudo-code of the AddMCtoMacroC sub-algorithm used to perform this process is presented in Algorithm 6.

Algorithm 6
figure f

AddMCtoMacroC

5.2.5 Updating defined micro-clusters

Depending on time, the cluster centers of micro-clusters and the number of data they have can change over time. Moreover, deleting or receiving new data from a micro-cluster changes the number of data and the cluster center of the micro-cluster. By performing these operations, our algorithm can adapt to the evolutionary structure of the streaming data. All these updates are performed as provided in Algorithm 7.

Algorithm 7
figure g

UpdateMC

5.2.6 Updating defined macro-clusters

Defined macro-clusters may lose or gain new micro-clusters over time. As a result of the micro-cluster deletion process, if the number of micro-clusters owned by the macro-cluster falls below the threshold value n_micro, the relevant macro-cluster is deleted, or the micro-cluster may gain new macro-clusters. This information is kept up-to-date. The pseudo-code of the UpdateMacroC sub-algorithm that performs the related operations is given in Algorithm 8.

Algorithm 8
figure h

UpdateMacroC

5.2.7 Deleting defined micro-clusters

By the time, a micro-cluster may lose its data and its number of data may fall below the threshold value of N. In this case, the micro-cluster is deleted, and all remaining data is defined as idle data that does not belong to any cluster. These micro-clusters can be re-defined over time if sufficient data are grouped in the same region. The pseudo-code of the related sub-algorithm is presented in Algorithm 9.

Algorithm 9
figure i

KillMCs

5.2.8 Deleting defined macro-clusters

A macro-cluster may lose its data, and the number of data it has may fall below the threshold of n_micro. In this case, the macro-cluster is deleted, and all remaining macro-clusters are defined as micro-clusters that do not belong to any cluster. As we mentioned for micro-clusters, these macro-clusters can be re-defined over time. The pseudo-code for the sub-algorithm is given in Algorithm 10.

Algorithm 10
figure j

KillMacroCs

5.3 Runtime complexity

The proposed algorithm consists of several sub-algorithms. The runtime complexities of the sub-algorithms are presented in Table 4, where n is the number of processed data, d is the number of features each data has, m is the number of micro-clusters, k is the number of defined macro-clusters, and E is the number of edges. Our proposed algorithm has a total time complexity of approximately O(mn2logn), since the overall complexity will be the sum of these sub-algorithms. Although our algorithm's time complexity may be considered high, it is low in practice.

Table 4 Runtime complexity for each function in the proposed algorithm

6 Experimental study

6.1 Experimental environment

To conduct the experimental study, the proposed algorithm was coded in Python programming language in Anaconda Spyder environment, and we used related libraries such as scikit-learn, matplotlib, river. The riverFootnote 1 library was used. Because it contains the DBSTREAM and DenStream algorithms. All the experiments were performed on a computer with an Intel i7 processor, 16 GB of RAM, and a Windows 11 Operating System installed. To measure the success of our algorithm, we compared it with KD-AR Stream [14], DBSTREAM [12], and DenStream [10] algorithms in terms of both clustering success and runtime.

6.2 Datasets

To compare the clustering performance of the algorithms, we used 12 datasets, 5 of which are real datasets from two sources (UCI Machine Learning Repository [39] and Tomas Barton's repository [40]), as shown in Table 5. ExclaStar, Aggregation, Zelnik2, Zelnik4, Three Spirals, and ChainLink were used to measure the ability of our algorithm on arbitrary-shaped clusters; MrData, which consists of 10% noisy data, was used to measure its robustness to noisy data; Breast Cancer, Thyroid, MRData, and KDD were used to measure its effectiveness on real datasets; and KDD dataset was used to measure its success on high-dimensional datasets.

Table 5 Characteristics of used datasets

6.3 Used indices to evaluate the clustering quality

We compared the clustering success of the algorithms with Purity and Adjusted Rand Index (ARI) indices. We used the Purity value to measure the purity of the data in the clusters. The Purity value is calculated as the ratio of the dominant cluster labels to all clusters' labels. Let \(t_j ,c_i \in C\) be the number of actual cluster labels, N be the number of data, k be the number of clusters; the Purity value is calculated by Eq. (1).

$${\text{Purity}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^k \max_j \left| {c_i \cap^t_j } \right|$$
(1)

ARI is an external cluster quality assessment method that can measure cluster quality correctly even if predicted and actual cluster labels are different. The ARI value is calculated by Eq. (2), where N is the number of data, and nij, ai,, and bj are the values obtained from the contingency table.

$${\text{ARI}}(C^r ,C^m ) = \frac{{{{\sum_{ij} \left( {\begin{array}{*{20}c} {n_{ij} } \\ 2 \\ \end{array} } \right) - \left[ {\sum_i \left( {\begin{array}{*{20}c} {a_i } \\ 2 \\ \end{array} } \right)\sum_j \left( {\begin{array}{*{20}c} {b_j } \\ 2 \\ \end{array} } \right)} \right]} / {\left( {\begin{array}{*{20}c} n \\ 2 \\ \end{array} } \right)}} }}{{{{\frac{1}{2}\left[ {\sum_i \left( {\begin{array}{*{20}c} {a_i } \\ 2 \\ \end{array} } \right) + \sum_j \left( {\begin{array}{*{20}c} {b_j } \\ 2 \\ \end{array} } \right)} \right] - \left[ {\sum_i \left( {\begin{array}{*{20}c} {a_i } \\ 2 \\ \end{array} } \right)\sum_j \left( {\begin{array}{*{20}c} {b_j } \\ 2 \\ \end{array} } \right)} \right]} / {\left( {\begin{array}{*{20}c} n \\ 2 \\ \end{array} } \right)}}}}$$
(2)

6.4 Experimental procedure and parameter setting

In the experimental study, we compared our algorithm with the other algorithms regarding clustering quality by using a random search method. For this purpose, each algorithm is tested on all datasets with randomly selected parameters. In order to determine the best parameters, each algorithm is tested 50 times on each dataset with randomly selected parameters from intervals shared in Table 6. In order to make parameter selection easier, the data is normalized using the Min–Max Normalization given in Eq. (3).

$${\text{Min - Max}}\;{\text{Normalization}} = \frac{{x_{ij} - \min x_j }}{\max x_j - \min x_j }$$
(3)
Table 6 Parameter intervals of algorithms

7 Results

7.1 Clustering quality results on real and synthetic datasets

After testing each algorithm on each dataset with randomly selected parameters shared in Table 6, the best results for all algorithms are presented in Tables 7 and 8 (The bold values represent the highest result values among the algorithms). The synthetic datasets used in the experimental study have arbitrary-shaped clusters, some illustrated in Fig. 3. When the obtained results are analyzed, it can be seen that our algorithm achieves higher clustering success. Especially in the ARI values, it is obvious that our algorithm is superior to other algorithms. Only in the Xclara dataset it was the second-best algorithm. However, its results were very close to the best result. On the other hand, the algorithm achieves the highest success rate in all other datasets in the aspect of the ARI index. Especially in Three Spirals and ChainLink datasets, it made a big difference to other algorithms. Our algorithm reaches ARI values of 1.0 in these two datasets, while other algorithms can reach about 0.30 s. Therefore, we can say that our algorithm is very successful in detecting arbitrary-shaped clusters.

Table 7 Purity performance comparison results of stream clustering algorithms on datasets
Table 8 ARI performance comparison results of stream clustering algorithms on datasets

In terms of real datasets, our algorithm achieved more successful clustering results. In the experimental study, we used KDD, Breast Cancer, Occupancy, and Thyroid datasets as real datasets. Our algorithm reached significantly better results than other algorithms regarding the ARI index. In the KDD dataset, which is a high-dimensional and commonly used dataset in the stream clustering area, our algorithm achieved higher clustering success than other algorithms. While our algorithm achieved an ARI value of 0.7509 on the KDD dataset, the results of DenStream, DBSTREAM, and KD-AR Stream algorithms were 0.4962, 0.5679, and 0.6818, respectively. On the other hand, in the Thyroid dataset, our algorithm reaches an ARI value of 0.8943, while the other algorithms reach ARI values of 0.6410, 0.8420, and 0.8670, respectively. Similarly, the ARI result of our algorithm was 0.5910, while the results of DenStream, DBSTREAM, and KD-AR Stream were 0.5180, 0.2630, and 0.5890, respectively. Moreover, our algorithm was the best on the MrData dataset. Its clustering quality was 0.9812, which is the highest result among others. MrData is essential because it contains noisy data.

7.2 Clustering success on arbitrary-shaped clusters

As illustrated in Fig. 3, depending on the dataset, the cluster shape may be any geometric shape rather than circular, or it may not resemble any geometric shape at all. The majority of data stream clustering algorithms assume that cluster shapes are spherical. However, this is only the case for the minority of real-world datasets. Therefore, the need for algorithms to detect clusters in arbitrary shapes is raised. The algorithm proposed in this paper can easily define non-spherical clusters since it forms clusters based on the dataset density using MST over micro-clusters. The ExclaStar, Zelnik2, Zelnik4, Aggregation, ChainLink, and Three Spirals datasets used in the experimental study contain non-spherical clusters. According to experimental results, our algorithm achieves better results on these datasets than other algorithms. So, we can say that our algorithm is the best on datasets that contain arbitrary-shaped clusters.

7.3 Robustness of the proposed algorithm against outliers

Detecting outliers is a fundamental problem in stream clustering areas. Because outliers reduce the clustering success. To evaluate the robustness of our algorithm against outliers, we tested it on the MrData dataset. As illustrated in Fig. 5, 10% of the MrData dataset consists of outliers. The proposed algorithm achieved much better results on the dataset than others. The proposed algorithm achieved high results as 0.9812 ARI and 0.9916 Purity, on the MrData dataset, while ARI values of compared algorithms were 0.9732 for DenStream, 0.9820 for DBSTREAM and 0.9859 for KD-AR Stream. These results prove that the proposed algorithm is robust against outliers.

Fig. 5
figure 5

MrData dataset, including outliers

7.4 Runtime complexity

To demonstrate the runtime performance of our algorithm, we tested our algorithm with other algorithms on the KDD and MrData datasets. For this purpose, we ran each algorithm with the parameters that gave the highest success on the relevant dataset and compared their runtimes. The runtime of algorithms is presented in Fig. 6. When the results are analyzed, the runtime performance of our algorithm is quite good. Our algorithm produces faster results than the KD-AR Stream method on both datasets. In the KDD dataset, it produces better results than the DenStream algorithm, while it produces results very close to the DBSTREAM algorithm which was the fastest algorithm.

Fig. 6
figure 6

Comparison of runtime complexity of streaming data clustering algorithms. a KDD dataset, b MrData dataset

8 Discussion

In this study, ARI and Purity indices are used to demonstrate the success of the proposed algorithm. However, since the Purity value is calculated based on the ratio of the dominant cluster label to all data in the cluster, it can sometimes produce misleadingly high values. For example, when all data in a dataset is assigned to a single cluster, the Purity value is 1.00. However, there may be ten or more clusters in the dataset. Therefore, it is incorrect to evaluate based on the Purity value alone. For this reason, the ARI index is used together with the Purity value to evaluate the clustering performance in this study. To conclude that an algorithm is successful, parallelism between these two indices is expected. Therefore, it is correct to examine both indices when determining the success of algorithms. From this point of view, although the Purity value of the DBSTREAM algorithm is very high in most datasets, the ARI value remains low. Therefore, evaluating the two metrics together reveals that the DBSTREAM algorithm does not actually show high clustering success in the experiments.

The algorithm we proposed in this study achieved high clustering success in both synthetic and real data sets. Our algorithm can also define arbitrary-shaped clusters with high clustering quality. As shared in Tables 7 and 8, the proposed algorithm was the best algorithm in arbitrary-shaped clusters like in ExclaStar, Aggregation, Zelnik2, Zelnik4, Three Spirals, and ChainLink, datasets. Purity is frequently used in the literature to compare methods to measure whether clustering algorithms correctly assign data points to the relevant cluster. A high Purity value indicates the clustering success of the method. The purity value comparison of other current stream clustering approaches in the literature on the KDD dataset is presented in Table 9. The proposed method obtained a better purity value than the existing methods. Another success of our algorithm is to cluster high-dimensional datasets in high clustering quality. KDD dataset is commonly used in stream clustering areas to evaluate the success of algorithms. Our algorithm achieved the best result on the KDD dataset with an ARI value of 0.7509. In contrast, its closest competitor, the KD-AR Stream algorithm, achieved an ARI value of 0.6818. Therefore, our algorithm gives more successful results than others while clustering high-dimensional datasets.

Table 9 Purity value comparison of the proposed algorithm on the KDD dataset with other algorithms in the literature

Our algorithm achieves the highest clustering success with 0.9812 with ARI on the MrData dataset, which contains 10% of outliers, while the second-best algorithm, the KD-AR Stream algorithm, reached 0.9742. On the other hand, DenStream and DBSTREAM algorithms achieve low clustering success with ARI values of 0.7596 and 0.5520, respectively. Therefore, it can be said that our proposed algorithm is quite successful in dealing with outliers.

In terms of runtime, our algorithm is high-speed. When the sub-algorithms of our algorithm presented in Table 5 are analyzed, one may think that the runtime complexity of the proposed algorithm is high; therefore, its performance would be low. However, our algorithm is relatively fast in practice, as seen in Fig. 6. Because, our algorithm improves clustering performance by using a two-phase method in the clustering process. First, summarization is performed using a micro-cluster structure instead of processing all the data. Then, an MST-based clustering approach is performed on these micro-clusters to increase the performance.

As we explained the theoretical background of our algorithm in the previous sections, we expected our algorithm to be more successful than its competitor in the clustering quality manner. Its most substantial ability is to define arbitrarily shaped clusters. The experimental studies support the theoretical background as shared in the tables and figures. In addition, it can process high-dimensional datasets and it is robust against the outliers, as proved in experimental studies. However, processing datasets that are high dimensional can degrade its runtime performance.

9 Conclusion and future works

In this study, we propose a new stream clustering algorithm that can detect arbitrary-shaped clusters, is robust to outliers, fast, and has high clustering success. For this purpose, KD-Tree defines micro-clusters, and MST forms the final clusters by being applied to the defined micro-clusters. The proposed algorithm has been subjected to various experimental studies to test its effectiveness in clustering success and runtime. To reveal the proposed algorithm's clustering success and runtime complexity, we compared it with the DenStream, DBSTREAM, and KD-AR Stream algorithms. According to the experimental studies, it has been observed that the proposed algorithm produced better results than its competitors in terms of clustering success and is also very fast in terms of runtime. According to the literature comparison, our algorithm achieved the highest clustering success with ARI value of 0.9812 on the MrData dataset containing 10% outliers. It also achieved the highest value with a Purity value of 0.9780 and the second highest result with an ARI value of 0.7509 on the KDD dataset, which is a large dataset. In general, highly competitive results were obtained on all datasets. It is concluded that the proposed algorithm is more successful than the compared algorithms in terms of clustering quality, detecting outliers, and defining arbitrary-shaped clusters in very low runtime complexity. However, high-dimensional datasets may slightly degrade the algorithm's runtime performance. Therefore, in future works, some feature selection or feature reduction methods are planned to be integrated to improve MCMSTStream runtime complexity further.