Introduction

Nowadays, data are generated continuously from different sources, such as sensors, web-browsing activities, network routers, etc. These continuously flowing data happen to be of very large volume and the patterns prevailing in it keep on changing with time. These patterns actually represent the behaviour of the underlying source from where the data are being generated. This form of data where new patterns keep on evolving and the size of data is ever increasing has been termed as a data stream. Application of traditional data-mining techniques which have been designed with an assumption that entire data are available for processing at a time and whose behaviour is static results in poor performance on data streams. Therefore, designing data-mining techniques for data streams is the need of the hour.

Clustering is a data-mining technique which groups the data objects into different groups in such a way that intra-group similarity between objects is maximised and inter-group similarity is minimised. In the context of data streams, the clustering problem can be formulated as per the two different approaches. In the first approach, the data examples of a single data stream are clustered into different groups; whereas, as per the second approach, different data streams in itself are clustered into different groups. The first approach is referred to as clustering by example and the second approach is referred to as clustering by variable. Overall, it can be said that in case of clustering by example the main focus is on profiling the relationship between different data instances of a single data stream; whereas, in the case of clustering by variable, the main focus is on profiling the relationship between multiple data streams. Clustering by variable approach has its own set of applications in a real-world scenario. For example, providing targeted ads for a group of customers based on their product purchase and browsing histories, grouping the users based on their preferred genres of music and providing similar music suggestions, grouping the web-users based on their web-browsing behaviours for promotional advertisement, endorsement advertisement, bandwagon advertisement, etc.

Fig. 1
figure 1

Clustering by example approach

Fig. 2
figure 2

Clustering by variable approach

The clustering by example and clustering by variable approaches are diagrammatically shown in Figs. 1 and 2, respectively. In the above figures, \(O_{i,j}\) = {\(a_{i,1}, a_{i,2}, \ldots , a_{i,n}\)} is a data object containing n attributes which arrived at \(j\text {th}\) time instant in the \(i\text {th}\) data stream. In Fig. 1, two clusters has been created using clustering by example approach, where cluster 1 = {\(O_{1,j-2}, O_{4,j+1}\)} and cluster 2 = {\(O_{2,j-1}\), \(O_{3,j}\), \(O_{5,j+2}\)}. In case of clustering by example approach, each of the data streams maintain its individuality in terms of clusters, and in an application, there may be only one relevant data stream, or there may be many more. In case of clustering by variable approach, it can be seen from Fig. 2 that two clusters have been formed by grouping the m data streams (\(S_{i}\)’s), such that cluster 1 = {\(S_1\), \(\ldots \), \(S_m\)} and cluster 2 = {\(S_2\), \(\ldots \), \(S_{m-1}\)}.

On surveying the literature, it has been found that most of the existing works have addressed the problem of clustering by example [1,2,3,4, 7, 8, 10, 16, 17, 19, 21, 23, 25, 27, 30, 33] and a few attempts have been made towards the problem of clustering by variable [5, 6, 9, 11, 22, 28, 29]. Furthermore, these works are mainly suitable for numeric data streams. However, in several applications, other types of data have also come in the form of data streams such as nominal, text, web-data, etc. In the case of data with nominal attributes, there is no ordering between nominal attributes’ values. However, for the sake of using the clustering approaches proposed for numeric attributes, conversion of nominal attributes to numeric attributes introduces unnecessary orderings between the values of numeric attributes representing the values of nominal attributes. Hence, after converting to numeric attributes, the operation performed on nominal attributes is not meaningful and leads to misleading results.

In totality, based on the literature survey, no work has been presented for clustering multiple nominal data streams. Therefore, there is a scope and need to work on clustering techniques for multiple data streams in case of nominal and other types of data.

In the present work, a hierarchical clustering by variable technique for multiple nominal data streams has been proposed. It is an integrative technique in the sense that it employs cosine distance for measuring the dissimilarity between data streams and the entropy for computing the degree of disparity within a cluster. The extent to which the data observations in a cluster show disorderedness or randomness signifies the cluster’s disparity with higher values indicating higher disorderedness or randomness. To deal with the continuously flowing nature of the data streams, the proposed technique processes the data incrementally where the increment interval is equal to the size of the sliding window. Furthermore, it adapts the hierarchical structure of clusters by splitting and/or merging the clusters to incorporate the evolving behaviour of data streams where new concepts keep on coming and old may fade out.

The performance of the proposed technique has been analysed on synthetic datasets as well as a real-world dataset and compared to Agglomerative Nesting (AGNES) clustering technique in terms of four clustering validation measures, viz., Dunn Index (DI) [24], Modified Hubert \(\varGamma \) statistic (MH\(\varGamma \)) [24], Cophenetic Correlation Coefficient (CPCC) [14], and Purity [31].

The main contributions of the proposed work have been given below:

  • A method has been proposed for clustering multiple nominal data streams using a hierarchical clustering by variable approach.

  • The proposed method is able to handle the concept drifts in the data streams through the merge/split operations of the nodes in the hierarchical clustering structure.

The remainder of the paper is organised as follows. In the section “Literature review”, literature review has been discussed. In the section “Problem formulation and preliminaries”, the problem formulation and preliminaries has been given followed by the presentation of the proposed method in the section “Proposed technique”. In the section “Datasets and performance measures”, the datasets and performance measures has been presented. In the section “Experimental results and analysis”, experimental results has been discussed. Finally, in “Conclusion”, concluding remarks has been made.

Literature review

Over the years, many techniques for clustering data streams have been proposed. Among those proposed techniques, the discussion relating to clustering by variable techniques has been discussed in this section, which is the focus of the current work.

Dai et al. [11] have presented a clustering on demand (COD) framework which worked in two phases. The online phase stored statistics for the incoming data observations in terms of sliding windows, whereas in the offline phase, stored statistics were used for generating clusters. The advantage of COD framework is in its ability to process the data observations from multiple data streams in a single pass and off loading the actual clustering process to the the offline phase, thus staying true to the constraints in the data stream scenario. Balzanella et al. [5] have proposed a graph-based technique for clustering multiple data streams which collects data observations from the data streams in terms of sliding window and creates summaries out of it. It maintains an undirected graph whose adjacency matrix stores the similarity between the data streams and is updated on every new window of data by applying Dynamic Clustering Algorithm [12] on it. The final clustering structure of the data streams is obtained by applying a partition based clustering technique on the summaries stored online.

Ling et al. have proposed a spectral component-based clustering technique for clustering multiple data streams called SPE-cluster [9]. Here, the data from the data streams are taken in sequential non-overlapping sliding windows where in each window, the data sequences of the respective data streams are represented as the sum of the spectral components. This technique addresses the lag-correlation between the data streams while computing the similarity between the data streams which is ignored in other data stream clustering techniques using Euclidean distance. This technique also works in an online–offline phase. In the online phase, it calculates the spectral components of the data streams, while the offline phase employs dynamic k-means for clustering the most recent sliding window. In [29], the author has proposed a Kendall correlation-based clustering technique for multiple data streams. Here, the sliding window technique is used to gather data observations from the incoming data streams. For clustering the data streams, it uses a modified k-means algorithm which can adjust the number of clusters to reflect the evolving changes in the data streams.

Bones et al. [6] proposed a data stream clustering technique which clusters similar data streams based on the correlation of the attribute values. It uses a sliding window technique, and for each window, a fractal value is calculated in a fractal dimension, which is a reduced dimension of the original dimension of the data streams. This fractal value represents the correlation of the attribute values from the original dimension and is found to cluster the data streams better. Laurinec and Lucka [22] have proposed ClipStream. This technique consisted of two phases, online (data abstraction) and offline phases. In the data abstraction phase, the data from the data streams are processed window wise and a reduced feature vector called FeaClip is constructed from the original feature space, thus representing a clipped version of the data streams. The clipped representation of the data streams captured two behaviours of a data stream: global statistics of the data stream and local behaviour of the data stream. In the offline phase, clustering is performed using the k-medoid clustering technique. Since offline clustering is time-consuming, the change detection module of the ClipStream executes when the data streams evolve.

Online Divisive-Agglomerative Clustering (ODAC) [28] was proposed by Rodrigues et. al. It is a hierarchical clustering approach for multiple data streams and creates a hierarchy of tree nodes. In this technique, each node of the hierarchical tree comprises of data streams, where the leave nodes represent the clusters. For handling concept evolution, the nodes of the hierarchical tree are split and/or merged. The decision for either splitting or merging is done based on the diameter of the cluster and Hoeffding bound [20]. This method is suitable only for numerical data streams as it is dependent upon Pearson’s correlation coefficient [26] which is used as the similarity measure and the entire clustering process is based on this measure.

It can be observed that over approximately 15 years, very few works have been presented under the clustering by variable category. Those mainly fit numerical data streams. Moreover, nominal values do not have any exact order and are not quantitative [18]. Therefore, converting nominal values to numeric values does not make sense. Any effort to perform mathematical operations on nominal attributes after converting them to numerical attributes will not be meaningful. For example, a nominal attribute colour will have values red, green, blue, etc. Assigning numerical values to these values, for example, red=1, green=2, and blue=3, will not make any sense, since the values for colours are not quantitative. Hence, finding mean, median, or any other statistic on such numerical representations of the nominal values will not be meaningful. In this present work, we have proposed a hierarchical clustering technique for multiple nominal data streams. The main difference between ODAC [28] and the clustering technique proposed in the present work lies in the similarity measure used and its computation and the type of data that each method can handle. The technique proposed in the current paper is targeted at multiple nominal data streams. In contrast, ODAC [28] is focused explicitly on numerical data streams and is not suitable for nominal data streams.

Problem formulation and preliminaries

In this section, the problem of clustering by variable for multiple data streams has been introduced. Furthermore, the processes for calculating the dissimilarity measures between the data streams and the entropy values for the clusters have been discussed. Also, the notations used throughout the paper have been described.

Problem formulation

A data stream consists of data observations produced at different time instances. Let DS = \(\{S_1,\) \(S_2,\) \(\ldots ,\) \(S_i,\) \(\ldots ,\) \(S_m\}\) represent the set of data streams where \(S_i\) is the \(i\text {th}\) data stream in the set DS which comprises in total m data streams. Each \(S_i\) = \(\{o_{i,1}\), \(o_{i,2}\), \(\ldots \), \(o_{i,j}\), \(\ldots \), \(o_{i,\infty }\}\) where \(o_{i,j}\) is a data observation observed in the \(j\text {th}\) time instance(\(t_j\)) belonging to the \(i\text {th}\) data stream (\(S_i\)). The clustering of multiple data streams using clustering by variable approach aims to group together those data streams which are producing similar observations over the time. However, additional challenges need to be addressed in clustering data streams, such as the continuous arrival of data in the data streams. Hence, in the proposed work, a snapshot of the data streams’ data is taken to handle this ever-increasing size. A data snapshot is extracted using a sliding window technique, as shown in Fig. 3, and processed. In Fig. 3, \(W_{\bar{k}}\) is the \(\bar{k}\text {th}\) sliding window containing examples in the time frame \(t_{j-w+1}\) to \(t_{j}\) from m data streams where w is the size of the sliding window. Furthermore, there may be concept evolution as new data observations keep on adding to the data streams which requires update in the clustering structure generated using previous sliding window’s data. In the proposed work, the update in the clustering structure has been considered by allowing merge and/or split operations on clusters (for detail, refer “Merge sub-module” and “Split sub-module”).

Fig. 3
figure 3

Example of a sliding window (\(W_{\bar{k}}\)) with window size(w) equal to 5 operating on data streams \(S_1\) to \(S_m\) through time \(t_{j-2}\) to time \(t_{j+2}\)

Hierarchical clustering requires no prior information on the number of clusters and maintains a hierarchical tree of clusters at different levels. Each node in the hierarchical tree represents a cluster. Except for the leaf nodes, all other clusters are a combination of their child clusters. The hierarchical tree can be cut at any level to obtain a set of clusters. Agglomerative and divisive are the two strategies for generating hierarchical clustering. The first one follows a bottom–up approach by starting with single data observation and then iteratively merging them to form larger clusters. The second one follows a top–down strategy by starting with a single cluster comprising all the data observations and then iteratively splitting them into smaller clusters. The merge and split operations are final in the case of traditional hierarchical clustering.

In the context of clustering multiple data streams, the hierarchical clustering structure needs to be updated over time. Updation becomes necessary due to the evolution of new concepts in streaming data which may involve the combination of both split and merge operations based on some clusters parameters.

In the present work, a hierarchical clustering technique for multiple nominal data streams has been proposed. The proposed technique integrates both agglomerative and divisive strategies for updating the hierarchical structure on the arrival of a window of new data observations based on the sliding window technique. Furthermore, it employs cluster entropy as a parameter for deciding whether to split/merge or not to split/merge the clusters. Under the proposed technique, the cluster results can be viewed and analysed at any time, depending on the user’s requirement.

Notations

This section describes the notations used throughout the paper.

  • Let \(DS = \{S_i, 1\le i \le m \}\) where

    • DS is the set of data streams.

    • \(S_i\) is the \(i\text {th}\) data stream in the set DS.

    • m is the number of data streams in the set DS.

  • Let \(N = \{C_i, 1\le i \le \mid N\mid \}\) where

    • N is the set of nodes in the hierarchical tree.

    • \(C_i\) is the \(i\text {th}\) node in the set N.

    • \(C^l_i\) is the \(i\text {th}\) leaf node in the hierarchical tree and \(C^l_i\) \(\epsilon \) N.

    • \(\mid N\mid \) is the number of nodes in the set N.

    • \(\mid K \mid \) is the number of leaf nodes in the set \(\mid N\mid \).

  • \(\mid C^l_i \mid \) is the number of data streams in the leaf node \(C^l_i\).

  • \(C^p_i\) is the immediate parent node of \(C_i^l \) in the hierarchical tree and \(C^p_i\) \(\epsilon \) N.

  • \(D_r\) is the latest data snapshot.

  • w is the size of the sliding window.

  • \(d_{init}\) is the number of initial sliding windows.

  • \(E_i\) = {\(e_i, 1\le i \le \mid N\mid \)} where

    • \(e^i\) is the entropy for an \(i\text {th}\) node(\(C_i\)) in the hierarchical tree.

Computation of dissimilarity between data streams

In the proposed technique, the dissimilarity between any two data stream is calculated by applying cosine distance over a data snapshot extracted corresponding to a sliding window as discussed below.

Table 1 Selected data streams
Table 2 Frequency matrix for the selected data streams
  • Step 1: A data snapshot is extracted from the data streams. Next, two data streams are selected whose distance is to be calculated (say \(S_i\) and \(S_j\)), as shown in Table 1.

  • Step 2: Next, a frequency matrix (F) for \(S_i\) and \(S_j\) is created, as shown in Table 2. In the above frequency matrix (F), as shown in Table  2, \(f_{ik}\) and \(f_{jk}\) are the frequency of occurrence of the value \(o_k\) in \(S_i\) and \(S_j\) respectively, and \(1\le k \le \bar{w}\), where \(\bar{w}\) is the number of unique values occurring in \(S_i\) and \(S_j\).

  • Step 3: Finally, the cosine distance is calculated as given in Eq. (1)

    $$\begin{aligned} \text {cosine}(S_i, S_j) = \dfrac{\sum f_{i,k} X f_{j,k}}{\sqrt{\sum f_{S_i}^2} X \sqrt{\sum f_{S_j}^2}}. \end{aligned}$$
    (1)

Computing entropy of clusters

For calculating the entropy of a cluster, the following steps are followed.

  • Step 1: Let a cluster \(C_i\) comprise of a number of data streams where each data stream comprises b data instances. Let us say all those data observations are stored in a matrix \(A^i\) where the dimension of \(A^i\) is (\(a\times b\)).

  • Step 2: Next, the unique values in \(A^i\) are extracted and stored in a vector \(U^i\) whose size is equal to the number of unique values appearing in \(A^i\).

  • Step 3: For each unique value stored in \(U^i\), its corresponding count of occurrence in \(A^i\) is taken and stored in a vector \(\bar{U}^{i}\) of size \(\mid {U^i}\mid \).

  • Step 4: Finally, the entropy (\(e^i\)) for the \(i\text {th}\) cluster (\(C_i\)) is calculated as shown in Eq. (2)

    $$\begin{aligned} e^i = - \sum _{k=1}^{\mid \bar{U}^{i}\mid } \left( \dfrac{\bar{U}^{i}_k}{\sum _{j=1}^{\mid {\bar{U}^{i}}\mid }\bar{U}_j^{_i}} \log \dfrac{\bar{U}^{i}_k}{\sum _{j=1}^{\mid {\bar{U}^{i}}\mid }\bar{U}_j^{_i}}\right) . \end{aligned}$$
    (2)

For a node containing only a single data stream, the entropy value is set to zero. However, the entropy value for a node containing two or more than two streams can range from zero to \(\log \mid U^i\mid \).

Proposed technique

The overall working of the proposed method is shown in Fig. 4 with the help of a flowchart and explained in the subsequent sections. It comprises three main modules, viz., initialisation module, accommodation sub-module, and update module. These modules are highlighted in Fig.  4 with the help of dotted lines. The initialisation module executes only once for creating the initial hierarchical clustering structure, whereas the accommodation and update modules keep on repeating as the new data snapshots keep on arriving. From Fig.  4, it can be seen that the proposed method first executes the sub-modules in the initialisation module, followed by the sub-modules in the update module. In the initialisation module, the proposed method acquires the initial data snapshot from the data streams. This initial data snapshot is then used to create an initial hierarchical clustering structure, after which the entropies for the nodes of the resulting initial hierarchical clustering structure are calculated. The proposed method then executes the update module, where the next data snapshot is incorporated into the existing hierarchical clustering structure. Again, the entropies are re-calculated for the nodes of the hierarchical tree using the newly acquired data snapshot. The hierarchical clustering structure is then tested for modification when the changes in the node entropies are significant. The changes to the hierarchical clustering structure are done through the merge/split operations. The above process for the update module is then re-iterated for the subsequent data snapshots.

Fig. 4
figure 4

Working of the proposed technique

Data snapshot

A data snapshot from the data streams is obtained using the sliding window technique. The sliding window technique extracts data from m data streams by obtaining w data objects from each of the m data streams. In the proposed work, w is understood as the size of the sliding window. Hence, a data snapshot is a matrix of size (\(m\times w\)) containing data objects belonging to m data streams. An example of an \(i\text {th}\) data snapshot (\(D_i\)) is shown in Table  3. Each row in \(D_i\) represents a user generating a data stream of values. The generated values are that of the websites visited by the respective users. The size of \(D_i\) is (\(5\times 4\)), where the number of users is five and the number of visited websites is four by each user, respectively.

Table 3 Example of an \(i\text {th}\) data snapshot (\(D_i\))

The data stream processing in the proposed work has been done data snapshot wise. It helps in handling the ever-increasing size of the data streams as it is not possible to make the entire data streams available in one go for processing.

Initialisation module: initialisation of the hierarchical clustering structure

The main essence of the initialisation step is to capture the clustering structure prevailing over some initial data snapshots, so that it can be used as a foundation structure and can be updated when more data keep on streaming. For creating the initial hierarchical clustering structure, the data corresponding to \(d_{\text {init}}\) number of initial data snapshots have been used. Hence, in total \(d_{\text {init}}\times w\), the number of data observations from each of the m data streams is used for constructing the initial hierarchical clustering structure. The creation of the hierarchical tree structure using the data of \(d_{\text {init}}\) data snapshots is as given below:

  • Step 1: For the construction of the hierarchical clustering structure, Agglomerative Nesting (AGNES) has been used along with the average linkage method as a measure for merging clusters and cosine distance as a dissimilarity measure. In the hierarchical clustering structure, each node represents a cluster consisting of one or more data streams.

  • Step 2: Furthermore, the entropy corresponding to each of the node in the hierarchical clustering structure is computed as discussed in the section “Computing entropy of clusters”.

  • Step 3: Next, the entropy corresponding to each level of the hierarchical tree is calculated. The entropy of a level is defined as the average of the entropies of nodes available at a particular level as given in Eq. 3

    $$\begin{aligned} \bar{E}_q = \dfrac{1}{\mid q\mid }{\sum _{j}^{\mid q\mid }} e^i_j. \end{aligned}$$
    (3)

    In the above equation, \(\bar{E}_q\) represents the average entropy of nodes at the \(q\text {th}\) level of the hierarchical tree. \(e^i_j\) represents the entropy of an \(i\text {th}\) node (\(C_i\)) belonging to the \(q\text {th}\) level of the hierarchical tree and \(\mid q\mid \) represents the total number of nodes (\(C_i\)) at the \(q\text {th}\) level of the hierarchical tree.

  • Step 4: The hierarchical tree is cut with the help of an elbow method depending upon the change in the entropy value from one level to another. Let us say there is a maximum decrement in entropy value from \(q\text {th}-1\) to \(q\text {th}\) level, then a cut is marked below the \(q\text {th}\) level, and the nodes at the \(q\text {th}\) level are considered as leaf nodes. This step is performed to prune out that part of the hierarchical tree where the changes in the entropy level of the hierarchical tree are only marginal.

  • Step 5: The hierarchical clustering structure after being cut is then used as a base hierarchical clustering structure for processing later incoming data snapshots as discussed in the next section.

Update module: updating the initial hierarchical clustering structure

In the case of data streams, concept evolution may occur over time. The clustering structure as per the concept evolution is updated through merge and split operations. After the initial hierarchical clustering structure has been created as explained in the initialisation module (refer to the section “Initialisation module: initialisation of the hierarchical clustering structure”), the update module processes the next incoming data snapshots one after another. The update module mainly intends to update the clustering structure with incoming data snapshots, so that the concept evolving nature of data streams can be reflected in clustering. The update module comprises three sub-modules, viz., accommodation sub-module, merge sub-module, and split sub-module. The functioning of these three sub-modules has been discussed next.

Accommodation sub-module

This sub-module on receiving the new data snapshot let say \(i\text {th}\) data snapshot (\(D_i\)), assigns the data instances from the respective data streams of \(D_i\) to the leaf nodes (clusters) where the corresponding data stream lies and discards the data instances of the \(D_{i-1}\) data snapshot from the leaf nodes (clusters), but it keeps the entropy value \(e_{i-1}^{i}\) of each node \(C_i\) corresponding to \(D_{i-1}\) data snapshot for further processing.

Merge sub-module

The merge operation in the proposed method handles two cases, i.e., merge case-I and merge case-II. In merge case-I, the clusters containing a single data stream are merged, and in merge case-II, the clusters containing at least two data streams are merged. The merging operation is performed based on the difference in the entropy of the parent and child node. However, in the case of a cluster having a single data stream, the entropy of a cluster happens to be zero. In this scenario, it is not logical to decide on the merge operation depending upon the difference in the entropy of the parent and child node. Hence, to address this exceptional scenario, the split case-I is used. As per the proposed framework merge case-I is tested first followed by the second case(merge case-II).

  • Merge case-I: For all the leaf nodes(clusters) containing a single data stream (\(C_{s}^l\)), the steps below are executed:

    • Step-1: Calculate average entropy for the entire hierarchical clustering structure denoted by \(\xi _r\). \(\xi _r\) is defined as the average of the summation of entropies of all the nodes in the hierarchical tree denoted for the \(r\text {th}\) data snapshot(\(D_r\)), as shown in Eq.  (4)

      $$\begin{aligned} \xi _r(\text {original hierarchical tree}) = \sum _{i=1}^{\mid N \mid } e_r^i . \end{aligned}$$
      (4)
    • Step-2: Using the average linkage method, calculate the distance between \(C_{s}^l\) and other leaf nodes (\(C_i^l\)) to identify the target leaf node (say \(C_{\text {target}}^l\)) which is closest to \(C_{s}^l\) in terms of distance.

    • Step-3: On finding \(C_{\text {target}}^l\), the sibling and intermediate nodes to \(C_{s}^l\) and \(C_{\text {target}}^l\) are iteratively merged starting from the bottom of the hierarchical tree until both \(C_{s}^l\) and \(C_{\text {target}}^l\) get merged into a common immediate parent node, as shown in Fig. 5.

    • Step-4: The average entropy of the modified hierarchical clustering structure denoted by \(\acute{\xi }_r\) (generated by step-3) is calculated and compared with the average entropy of the original (unmodified) hierarchical clustering structure represented by \(\xi _r\) for the \(r\text {th}\) data snapshot (\(D_r\)), as shown in Eq.  (5). If Eq.  (5) is satisfied, then the modified hierarchical clustering structure is used for further processing; otherwise, the modification in the original hierarchical clustering structure is considered null and void and the original hierarchical clustering structure is used for further processing

      $$\begin{aligned} \acute{\xi }_r < \xi _r . \end{aligned}$$
      (5)
  • Merge case-II: It focuses on all those leaf nodes that comprise more than one data stream. The steps for this operation drafted below are executed for all such leaf nodes.

    • Step-1: Calculate entropy for the leaf node \(C_i^l\). Let the entropy of \(C_i^l\) for the \(r\text {th}\) data snapshot be represented by \({\dot{e}}^i_r\). Let \(C_i^p\) be the immediate parent node of the leaf node \(C_i^l\). Let the entropy of \(C_i^p\) for the \((r-1)\text {th}\) data snapshot be represented by \(\ddot{e}^i_{r-1}\).

    • Step-2: If the entropy of the leaf node (\({\dot{e}}^i_r\)) is higher than its immediate parent node’s entropy (\(\ddot{e}^i_{r-1}\)) by a factor greater than equal to \(\epsilon \), as shown in Eq. (6), then the leaf node and its sibling nodes are merged into its immediate parent node, as shown in Fig. 6. Following the merging process of \(C_i^l\) and its sibling nodes into their immediate parent node \(C_i^p\), the immediate parent node \(C_i^p\) becomes the new leaf node

      $$\begin{aligned} ({\dot{e}}^{i}_{r} - \ddot{e}^{i}_{r-1}) \ge \epsilon . \end{aligned}$$
      (6)

      Step-2 is based on the fact that if Eq. (6) is satisfied, then it implies that the data streams in \(C_i^l\) are having a high variation to one another as compared to the data streams in \(C_i^p\) due to the data instances in the \(r^{th}\) data snapshot (\(D_r\)) as compared to the data instances in the \((r-1)\text {th}\) data snapshot (\(D_{r-1}\)). High entropy of the leaf node (\(C_i^l\)) in comparison to its immediate parent node (\(C_i^p\)) represents a distorted hierarchical clustering structure requiring a merge operation for rectification. Moreover, in step-2, the \(\epsilon \) represents the threshold to decide whether to merge, or to not merge the leaf node (\(C_i^l\)) and its sibling node to their parent node (\(C_i^p\)). The value of \(\epsilon \) has been decided based on Hoeffding bound as discussed in [28] and further explained in the section “Threshold (\(\epsilon \))”.

Fig. 5
figure 5

Example of merge case-I

Fig. 6
figure 6

Example of merge case-II

Split sub-module

The split operation is executed for the \(r\text {th}\) data snapshot \(D_r\) once the merge operation is completed for the same data snapshot (\(D_r\)). The split operation in the proposed method handles two cases, i.e., split case-I and split case-II. Split case-I is used for splitting only when the hierarchical clustering structure contains a single node following the single or multiple merge operations. Split case-II is executed in all other scenarios. The split case-I captures the exceptional scenario where the complete restructuring of the hierarchical clustering structure is required due to changes in underlying concepts.

  • Split case-I: For splitting a hierarchical clustering structure containing only a single cluster (say \(C_{s}^l\)), the following steps are taken:

    • Step-1: Calculate entropy for the single cluster (\(C_{s}^l\)) and let its entropy be represented by \({\dot{e}}^{s}_r\).

    • Step-2: Next, the two most dissimilar data streams in \(C_{s}^l\) in terms of cosine distance is found. Let the two most dissimilar data streams be denoted by \(S_a\) and \(S_b\), respectively.

    • Step-3: Create two child nodes for \(C_{s}^l\). In one of the two newly created child nodes \(S_a\) is added, whereas in the other child node, \(S_b\) is added.

    • Step-4: For each data stream (say \(S_i\)) in \(C_{s}^l\), excluding \(S_a\) and \(S_b\), the cosine distance between \(S_i\) and \(S_a\), and \(S_i\) and \(S_b\) is calculated. \(S_i\) is then added to the child node containing \(S_a\) if its cosine distance to \(S_a\) is the least, else \(S_i\) is added to the leaf node containing \(S_b\).

    • Step-5: Calculate average entropy for the entire hierarchical clustering structure obtained after step-4 which is defined as the average of the summation of entropies of all nodes in the hierarchical tree denoted by \(\xi _r\) for the \(r\text {th}\) data snapshot (\(D_r\)), as shown in Eq.  (7)

      $$\begin{aligned} \xi _r = \dfrac{1}{\mid N\mid }\sum _{i=1}^{\mid N \mid } e^i_r. \end{aligned}$$
      (7)
    • Step-6: Check the difference between the average entropies of the hierarchical clustering structure containing only a single cluster and the modified hierarchical clustering structure obtained after step-4 represented by \(e_r^s\) and \(\xi _r\), respectively. If Eq.  (8) is satisfied, then the modified hierarchical clustering structure is used for further processing; otherwise, the modification in the original hierarchical clustering structure is considered null and void and the original hierarchical clustering structure is used for further processing

      $$\begin{aligned} (\xi _r - {\dot{e}}^s_r) \le \epsilon . \end{aligned}$$
      (8)
  • Split case-II: For splitting a hierarchical clustering structure containing two or more clusters, the following steps are taken:

    • Step-1: Calculate entropy for the leaf node \(C_i^l\). Let the entropy of \(C_i^l\) for the \(r\text {th}\) data snapshot be represented by \({\dot{e}}^i_r\) and \({\dot{e}}^i_{r-1}\) represents the entropy for the leaf node \(C_i^l\) for the \((r-1)^{th}\) data snapshot which is already calculated and stored as discussed in the section “Accommodation sub-module” and does not require to be calculated again.

    • Step-2: If \({\dot{e}}^i_r\) is greater than equal to \({\dot{e}}^i_{r-1}\) by a factor of \(\epsilon \), as shown in Eq. (9), then the leaf node(\(C_i^l\)) is splitted, as shown in Fig. 7

      $$\begin{aligned} ({\dot{e}}^i_r - {\dot{e}}^i_{r-1}) \ge \epsilon . \end{aligned}$$
      (9)

      The steps for splitting \(C_i^l\) have been detailed below:

      • Step-2a: First, the two most dissimilar data streams in \(C_i^l\) in terms of cosine distance are found. Let the two most dissimilar data streams be denoted by \(S_a\) and \(S_b\), respectively.

      • Step-2b: Next, two child nodes for \(C_i^l\) are created. In one of the two newly created child nodes, \(S_a\) is added, whereas in the other child node, \(S_b\) is added.

      • Step-2c: For each data stream (say \(S_i\)) in \(C_i^l\), excluding \(S_a\) and \(S_b\), the cosine distance between \(S_i\) and \(S_a\), and \(S_i\) and \(S_b\) is calculated. \(S_i\) is then added to the child node containing \(S_a\) if its cosine distance to \(S_a\) is the least, else \(S_i\) is added to the leaf node containing \(S_b\).

      Fig. 7
      figure 7

      Example of a node split

      The splitting of a leaf node (\(C_i^l\)) on satisfying Eq. (9) as described in step-2 above implies that the data streams in \(C_i^l\) is having a high variation to one another due to the data observations in the \(r\text {th}\) data snapshot (\(D_r\)) as compared to the instance when \(C^l_i\) had data observations from the \( (r-1)\text {th}\) data snapshot (\(D_{r-1}\)). High entropy of the leaf node (\(C_i^l\)) due to the \(r\text {th}\) data snapshot (\(D_r\)) indicates incompatibility between the data streams in \(C_i^l\) and hence requiring a split operation for rectification.

Threshold (\(\epsilon \))

The threshold (\(\epsilon \)) used in Eqs. (6), (8), and (9) is set using the Hoeffding bound [20]. Hoeffding bound is a statistical bound which states that after observing x observations from a random variable v having a range R, the actual mean will be at least (\(\bar{v}-\epsilon \)) where \(\bar{v}\) is the mean calculated from the observed x observations and this can be stated with (1-\(\delta \)) confidence. The advantage of the Hoeffding bound lies in the fact that it is unaffected by the distribution generating the observations and has been highly referenced for parameter setting [13, 15, 28, 32]. The equation for calculating the Hoeffding bound is as given in Eq.  (10)

$$\begin{aligned} \epsilon = \sqrt{\dfrac{R^2\ln {(1/\delta )}}{2x}}, \end{aligned}$$
(10)

where

  • \(\delta \) is the margin of error and \(0 < \delta \le 1\).

  • \(\epsilon \) is the threshold decided by the Hoeffding bound.

In the proposed technique, entropy of a cluster represents whether the data instances within the cluster are in conformity to one another. Higher conformity leads to a lower cluster entropy, whereas lower conformity leads to a higher cluster entropy. A high cluster entropy satisfying the conditions as discussed in the section “Merge sub-module” and the section “Split sub-module” calls for restructuring of the clustering structure through merge and split operations. The parameters for the calculation of the threshold (\(\epsilon \)) as given in Eq.  (10) for deciding on merging or splitting of a cluster in case of the proposed technique is as follows:

  • Merge: For merging a cluster (say \(C^l_i\)) whose parent node is (say \(C^p_i\)), the difference between the entropies of \(C^l_i\) and \(C^p_i\) for the \(r\text {th}\) and \((r-1)\text {th}\) data snapshot, i.e., \({\dot{e}}^i_r\) and \(\ddot{e}^i_r\), is considered as the random variable (v) where range (R) is as given in Eq.  (11)

    $$\begin{aligned} R^{'}= & {} \max \{\ddot{e}^i_{r-1}, {\dot{e}}^i_r\} \nonumber \\ R= & {} [-R^{'}, +R^{'}]. \end{aligned}$$
    (11)
  • Split: For splitting a cluster (say \(C^l_i\)), the difference between the entropies of \(C^l_i\) for the \({(r-1)}\text {th}\) and \(r\text {th}\) data snapshot, i.e., \({\dot{e}}^i_{r-1}\) and \({\dot{e}}^i_r\), is considered as the random variable (v) where range (R) is as given in Eq. (12)

    $$\begin{aligned} R^{'}= & {} \max \{{\dot{e}}^i_{r-1}, {\dot{e}}^i_r\} \nonumber \\ R= & {} [-R^{'}, +R^{'}]. \end{aligned}$$
    (12)

For both the above cases, the size of the data snapshot is taken as the number of data observations (x).

Algorithm for the proposed technique

The algorithm for the proposed technique has been given under Algorithm 1, while the algorithm for the construction of the initial hierarchical clustering structure has been given under Algorithm 2.

Algorithm 1: proposed technique

In line 2 of Algorithm1, the MAIN procedure calls the INITIALISE procedure in Algorithm 2. The INITIALISE procedure creates the initial hierarchical clustering structure (“Initialisation module: initialisation of the hierarchical clustering structure”). The INITIALISE procedure returns tree, leaves, and \(E_i\), which are the initial hierarchical clustering structure, the leaf nodes in the tree and the set containing the entropy of each node in the tree, respectively. Next, lines 3–25 are repeated where newer data snapshots are processed, and necessary actions are taken to handle the concept changes. In lines 4–5 of Algorithm 1, the proposed method acquires the next data snapshot (\(D_i\)) from the data streams. It places the data instances in \(D_i\) into the respective nodes of the tree (refer to the section “Accommodation sub-module”). In line 7, the entropy is calculated for the \(tree's\) nodes for \(D_i\) by calling on the getEntropy() method (refer to “Computing entropy of clusters”) which returns \(E_i\). In line 6, the entropy calculated for \(D_{i-1}\) is stored in \(E_{i-1}\).

The algorithm in line 10 merges each cluster (refer to merge case-I of the section “Merge Sub-Module”) and returns a new set of leaf nodes (leaves) and a modified clustering structure (tree) on satisfying the condition specified in line 9. In lines 13–18, the clusters containing two or more data streams are tested for merging. In line 14, the threshold (\(\epsilon \)) for merging cluster \(C^l\) is calculated by calling on the calculateThreshold() method (refer to the section “Threshold (\(\epsilon \))” ). For the clusters with more than one data stream satisfying the condition in line 15, the doMerge() procedure is executed (refer to merge case-II of “Merge sub-module”). Similarly, the nodes in the hierarchical tree are tested for splitting. On satisfying the condition in line 21, the proposed method splits the clusters as detailed in the section “Split sub-module”. When the hierarchical clustering structure contains only the root node as a cluster (leaf node), split case-I (refer to split case-I of “Split sub-module”) is executed, else split case-II (refer to split case-II of “Split sub-module”) is performed. The threshold (\(\epsilon \)) is calculated using the calculateThreshold() procedure (refer to the section “Threshold (\(\epsilon \))”). Lines 4–24 are repeated for the entire length of the data streams.

figure a

Algorithm 2: initialisation of the hierarchical clustering structure

The procedure INITIALISE under Algorithm 2 takes the size of the initial data snapshot to be used to construct a hierarchical clustering structure as discussed in the section “Initialisation module: initialisation of the hierarchical clustering structure”. In Algorithm 2, line 2 is responsible for acquiring the initial data snapshot followed by the construction of the hierarchical clustering structure in line 3. In line 4, the entropy for each node of the hierarchical tree is calculated. In Line 5, the hierarchical tree is cut at a level as decided in step-4 of the section “Initialisation module: initialisation of the hierarchical clustering structure”. The INITIALISE procedure, in line 6, returns the hierarchical structure (tree), the leaf nodes (leaves) in the tree, and the set containing the entropy for each node in the tree (\(E_i\)).

figure b

Time and space complexity

The time and space complexity for the proposed technique has been discussed in the following two sub-sections.

Time complexity

The proposed method consists of two modules: the initialisation module and the update module. The initialisation module is performed only once with a limited number of data snapshots, irrespective of the size of the data streams. Therefore, the computational time of this step has not been included in the further analysis as it can be understood as a constant factor. The update module works on a single data snapshot at a time in an incremental manner. Thus, after processing the current data snapshot, it discards it by just keeping the summary statistics of the data snapshot. Therefore, the computation of time complexity of the update module is as follows:

  • The entropy of all the nodes in the hierarchical tree is computed. In the worst-case scenario, computation for the entropy of a node is (mw \(\log \) mw), where m is the number of data streams and w is the size of the data window. In a hierarchical tree, at max, there can be (\(2m-1\)) number of nodes, so the computational time complexity is given as ((\(2m-1\))\(\times \)(mw \(\log \) mw)), i.e., (\(m^2w\) \(\log \) mw). Therefore, the term (\(m^2w\) \(\log \) mw) can be represented as a constant \(\Phi \).

  • The doMerge() operation in line 10 of Algorithm 1 takes a total of \(m^2\) computational time, and since this operation is executed for all the leaf nodes in the hierarchical clustering structure; hence, lines 8–12 uses a computational time complexity of (\((2m-1)\) \(m^2\)), i.e., \(m^3\). Lines 13–18 calculates the threshold and performs the doMerge() operation (merge case-II) for each leaf node. Both the calculateThreshold() method and doMerge() method is of the order m each. Hence, lines 13–18 takes (\((2m-1)\times \) \(m\times \) m), i.e., \(m^3\) computational time. Similarly, lines 19–24 which performs the split operations takes (mw \(\log \) mw) computational time.

Overall, the time complexity for lines 4–24 is ((\(m^2w\) \(\log \) mw) + \(m^3\) + \(m^3\) + (mw \(\log \) mw)). As the value of m and w is constant irrespective of the size of the data streams, the term ((\(m^2w\) \(\log \) mw) + \(m^3\) + \(m^3\) + (mw \(\log \) mw)) can be represented as a constant \(\Phi \). Let us say the length of the data stream is h, and then, the time complexity for processing it will be ((h/mw)\(\times \) \(\Phi \)), which can be represented as \({\mathcal {O}}(h)\).

Space complexity

At any point in time, the proposed technique holds the summary statistics of \(D_{i-1}\) data snapshot and the data objects of the data snapshot \(D_i\), which amounts to the size of mw each, i.e., 2mw. The proposed technique also maintains a representation of the hierarchical clustering structure, which takes (\(2m-1\)) space. For making decisions during the merge and split operations, the entropy information is stored for the current and previous clustering structure constructed using \(D_i\) and \(D_{i-1}\), which requires ((\(2m-1\))+(\(2m-1\))) space each. Overall, the total space required by the proposed technique is (2mw + (\(2m-1\))+(\(2m-1\))+(\(2m-1\))), i.e., (2mw + \(3(2m-1)\)) or (\(mw+m\)). Since the value of m and w is constant irrespective of the data streams’ size, the term (\(mw+m\)) can be represented as a constant \(\phi \). Hence, the space complexity for the proposed technique can be represented in a big-oh notation as \({\mathcal {O}}(\phi )\).

Datasets and performance measures

In this section, the synthetic and real-world datasets used for the performance analysis of the proposed technique have been presented. Furthermore, the different performance measures used to validate the proposed technique’s performance have also been discussed.

Datasets

For the performance analysis of the proposed work, two synthetic datasets and one real-world dataset that represents the browsing habits of different students have been taken. The characteristics of these three datasets have been discussed next.

Synthetic stationary dataset

This dataset is stationary in nature and there is no concept evolution in it. It aims to analyse the performance of the proposed technique in such a scenario where overall data pattern is not changing over time. This dataset further comprises of four sub-datasets, as shown in Table 5, viz., synthetic stationary 2C dataset, synthetic stationary 3C dataset, synthetic stationary 5C dataset, and synthetic stationary 7C dataset, where XC specifies the X number of clusters in the dataset. For generating data for a data stream falling under a specific cluster, the uniform distribution U as given in Eq.  13 can be used. In Eq.  13, the uniform distribution U is executed h times to generate a data stream (say \(S_i\)) of length h

$$\begin{aligned} S_i = h \times (U(a, b)), \end{aligned}$$
(13)

where

  • ab :  integers, \(b \ge a\) and \(n = b - a + 1\).

  • n :  number of discrete values.

  • h :  length of the data stream.

So for creating data streams belonging to different clusters, different values for the parameters a and b of U have been taken. The same value for the parameter h can be taken to generate same length data streams. These parameters values are presented in Table  4.

Table 4 Parameter values for uniform distribution (U) for generating data streams belonging to different clusters
Table 5 Synthetic stationary dataset

For generating 20 two cluster data streams, each data stream either chooses one of the following parameters as given in rows(1–2) of Table 4, and are placed into respective clusters: {(\(S_1\), \(S_2,\) \(S_3\), \(S_4\), \(S_5\), \(S_{11}\), \(S_{12}\), \(S_{13}\), \(S_{14}\), \(S_{15}\)), (\(S_6\), \(S_7\), \(S_8\), \(S_9\), \(S_{10}\), \(S_{16}\), \(S_{17}\), \(S_{18}\), \(S_{19}\), \(S_{20}\))}. Similarly, for 20 three cluster data streams, rows(1–3) of Table 4 are used and the streams are placed accordingly: {(\(S_0\), \(S_1\), \(S_2\)),(\(S_3\), \(S_4\), \(S_{10}\)), (\(S_{11}\), \(S_{12}\), \(S_{13}\), \(S_{14}\), \(S_5\), \(S_6\), \(S_7\), \(S_8\), \(S_9\), \(S_{15}\), \(S_{16}\), \(S_{17}\), \(S_{18}\), \(S_{19}\))}. For 20 five cluster data streams, rows(1–5) of Table 4 are used leading to the following configuration: {(\(S_0\), \(S_1\), \(S_2\)),(\(S_3\), \(S_4\), \(S_{10}\)), (\(S_{11}\), \(S_{12}\), \(S_{13}\)), (\(S_{14}\), \(S_5\), \(S_6\)), (\(S_7\), \(S_8\), \(S_9\), \(S_{15}\), \(S_{16}\), \(S_{17}\), \(S_{18}\), \(S_{19}\))}. Finally, for 20 seven cluster data streams, rows(1–7) of Table 4 are used and the assignment of the data streams are as follows: {(\(S_0\), \(S_1\), \(S_2\)), (\(S_3\), \(_4\), \(S_{10}\)), (\(S_{11}\), \(S_{12}\), \(S_{13}\)), (\(S_{14}\), \(S_5\), \(S_6\)), (\(S_7\), \(S_8\), \(S_9\)), (\(S_{15}\), \(S_{16}\), \(S_{17}\)), (\(S_{18}\), \(_{19}\))}.

Synthetic concept evolving dataset

This dataset has been generated in such a way that new concepts keep on evolving in it and older may fade out over time. It aims to test the performance of the proposed technique under the concept evolving scenario. This dataset also comprises of 20 data streams and each data stream contains 100,000 data observations. In this dataset, after an interval of every 25,000 points, a concept evolution has been introduced as given in Table  6 and the data corresponding to each of the cluster have been generated using a uniform distribution as discussed in the Section “Synthetic stationary dataset”

Table 6 Synthetic concept evolving dataset

For a proper representation of the effect of concept evolution on the performance of the proposed technique, synthetic concept evolving dataset has been organised into four sub-datasets, viz., synthetic concept evolving H1 dataset, synthetic concept evolving H2 dataset, synthetic concept evolving H3 dataset, and synthetic concept evolving H4 dataset. The synthetic concept evolving H1 dataset comprises data instances from 1 to 25,000 from the entire synthetic concept evolving H4 dataset. Similarly, synthetic concept evolving H2 dataset and synthetic concept evolving H3 dataset consists of data instances from 1 to 50,000 and 1 to 75,000, respectively, from the entire synthetic concept evolving H4 dataset. These variations introduced in the dataset is made to reflect concept evolution which may occur in a real-world scenario. The above information is also given in Table  7.

Table 7 Minimum and maximum number of concepts in the synthetic concept evolving dataset

Web browsing dataset

The real-world dataset represents the browsing behaviour of 20 students of National Institute of Technology Meghalaya, India. Here, the data are generated from the browsing behaviour of each student, where each student corresponds to a data stream, so it can be said that there are 20 data streams. Corresponding to each student, 10,000 observations have been recorded. Initially, from \(1\text {st}\) to \(35\text {th}\) data snapshot, the students were grouped into two groups. Furthermore, for collecting the concept evolution related information, students were divided into three groups from the \(36\text {th}\) to \(45\text {th}\) data snapshot. Each group was then asked to browse some similar websites and the data from this exercise were recorded. Then again, students were divided into four groups, five groups, and again into four groups for further data snapshots, as shown in Table 8.

Table 8 Change in the number of groups of students over different data snapshots

The main reason for considering this grouping of students during collection of data is to collect the labelled data in terms of the number of concepts (clusters) that has been prevailing over a set of data snapshots. This helps in unbiased comparison of the proposed technique with AGNES.

Performance measures

The performance measures discussed below have been used to validate the proposed technique against AGNES.

Dunn index [24]: This index is based on the inter-cluster distance and intra-cluster distance of clusters. It is a ratio of the minimum distance between clusters to the maximum intra-cluster distance, as shown in Eq. (14). A high value of this index indicates good clustering, where the clusters formed are compact and well separated

$$\begin{aligned}&\text {DI} = \dfrac{\min _{1 \le i \le j \le \mid K \mid } \varDelta _{\text {inter}}(C_i,C_j)}{\max _{1 \le k \le \mid K \mid } \varDelta _{\text {intra}}(C_k)} \nonumber \\&\varDelta _{\text {intra}}(C_k) = \sum \limits _{o\in C_k} \text {cosine}(o, \bar{C_k}) \nonumber \\&\varDelta _{\text {inter}}(C_i, C_j) = \dfrac{1}{\mid C_i\mid \mid C_j\mid } \sum \limits _{o_i\in C_i} \sum \limits _{o_j\in C_j} \text {cosine}(o_i, o_j).\nonumber \\ \end{aligned}$$
(14)

In Eq. (14), \(\varDelta _{\text {intra}}(C_k)\) represents the inter-cluster distance of the \(k\text {th}\) cluster and \(\bar{C_k}\) represents the centroid of \(C_k\). \(\varDelta _{\text {inter}}(C_i,C_j)\) represents the inter-cluster distance between two clusters \(C_i\) and \(C_j\), respectively.

Modified Hubert’s \(\varGamma \) Statistic [24]: This statistic is based on the proximity between the data objects in a dataset and the proximity between the cluster centres. This metric is used to describe the extent to which the clusters formed fit the data. It is calculated as given in Eq.  (15) and a high value of this statistic indicates good clustering

$$\begin{aligned} \text {MH}\varGamma = \dfrac{1}{m}\sum _{i=1}^{m-1}\sum _{j=i+1}^{m} (\text {cosine}(S_i, S_j)) (\varDelta _{\text {inter}}(C_i^l, C_j^l)). \qquad \end{aligned}$$
(15)

Cophenetic correlation coefficient [14]: This metric is used to measure the correlation coefficient between a distance matrix obtained from the original data points against the distance matrix obtained after modelling the same original data points into a dendrogram-based hierarchical structure. High values for this metric, as shown in Eq. (16), suggest well-formed clusters

(16)

In Eq. (16), \(\varDelta _{\text {prox}}(S_i, S_j)\) represents the distance between the \(i\text {th}\) and \(j\text {th}\) data stream \(S_i\) and \(S_j\), respectively. \(\mu _P\) represents the average pairwise distance between m data streams and \(\mu _{\text {den}}\) represents the average pairwise inter-cluster distance between K clusters.

Purity [31]: It represents the degree to which the clusters formed contain data objects from a single class. The more each cluster contains data objects from a single class, the higher is the purity value. Equation  (17) shows the process of purity calculation where L is the set of class labels

$$\begin{aligned} \text {Purity} = \dfrac{1}{m}\sum _i^{\mid K\mid } \max (C_i^l \cap L). \end{aligned}$$
(17)

Experimental results and analysis

The performance of the proposed technique has been analysed and compared to AGNES technique on synthetic as well as real-world datasets in terms of performance measures which have been presented in the section “Performance measures”. AGNES is a hierarchical clustering technique, which has also been used for creating the initial hierarchical structure in the case of the proposed technique. The proposed technique also follows the hierarchical clustering by variable approach; hence, AGNES has been preferred over other traditional clustering techniques for the performance comparison. Indeed, AGNES cannot deal with data streams, but it has been used as a baseline technique for analysing the performance of the proposed technique. The use of traditional clustering techniques as a baseline technique has also been done in the literature related to the data streams [3, 22, 28, 33]. Most of the existing data stream clustering technique follows the clustering by example approach; hence, comparing the proposed method, which follows a clustering by variable approach to a method following a clustering by example approach, is incompatible. Moreover, most of the work has tailored their processing in the clustering by variable domain depending on the similarity or dissimilarity measure. Hence, this makes it hard to apply the computation of dissimilarity between data streams proposed in the current work in the setting of existing works.

AGNES technique requires the availability of the entire dataset at one go for processing, so this condition has been maintained even for obtaining the results for comparison. As opposed to this, the proposed technique processes the data in terms of data chunks which have been obtained from data streams using a sliding window of fixed size. Different sliding window sizes have been tried for experimental analysis and the results corresponding to a window size of \(w = 100\) have been presented in this section. For the initialisation of the clustering structure, \(D_{\text {init}}\) has been set to 20 which means that initial 20 windows have been used. Furthermore, margin of error (\(\delta \)) has been set to 0.05 for the calculation of the threshold (\(\epsilon \)).

Results for synthetic stationary datasets

The performance of the proposed technique and AGNES on synthetic stationary 2C dataset, synthetic stationary 3C dataset, synthetic stationary 5C dataset, and synthetic stationary 7C dataset in terms of DI, CPCC, MH\(\varGamma \), and Purity is given in Figs. 8910, and 11, respectively.

Fig. 8
figure 8

DI score of the proposed technique and AGNES on four synthetic stationary datasets

Fig. 9
figure 9

CPCC score of the proposed technique and AGNES on four synthetic stationary datasets

Fig. 10
figure 10

MH\(\varGamma \) score of the proposed technique and AGNES on four synthetic stationary datasets

Fig. 11
figure 11

Purity score of the proposed technique and AGNES on four synthetic stationary datasets

It can be observed from Fig. 8 that the average of the DI scores obtained by the proposed technique on every window is comparatively better in comparison to the average DI score obtained by AGNES on all the four synthetic stationary datasets, viz., synthetic stationary 2C dataset, synthetic stationary 3C dataset, synthetic stationary 5C dataset, and synthetic stationary 7C dataset. The average DI score for AGNES is calculated by dividing the score achieved by AGNES by the same number of windows processed by the proposed technique. The comparatively better performance of the proposed technique on all the four synthetic stationary datasets can be attributed to the fact that the hierarchical clustering structure which is incrementally modelled after every data snapshot can reflect the local changes in the data streams better than AGNES.

Furthermore, from Fig. 9, it can be seen that the proposed technique performs almost identical to AGNES in terms of the CPCC values obtained on all the four synthetic stationary datasets. However, in Figs. 10 and 11, the performance of the proposed technique and AGNES is exactly the same on all the four synthetic stationary datasets.

Overall, it can be said that for a non-concept evolving data streams, the performance of both the techniques, viz., proposed technique and AGNES, is approximately the same, but the main advantage of the proposed technique is that it has performed comparatively to AGNES while processing data in the form of windows instead of asking for availability of the entire data, which proves its advantage over AGNES in terms of memory requirements.

Results on synthetic concept evolving datasets

The proposed technique has been evaluated and compared to AGNES on four synthetic concept evolving datasets, namely, synthetic concept evolving H1 dataset, synthetic concept evolving H2 dataset, synthetic concept evolving H3 dataset, and synthetic concept evolving H4 dataset as described in Tables 6 and 7 to analyse their performance in presence of multiple concepts in the data streams.

Fig. 12
figure 12

The movement of data streams from one cluster to another on evolution of concepts

The movement of the different data streams from one cluster to another cluster on evolution of new concepts in the data streams is shown in Fig. 12 for the synthetic concept evolving H4 dataset. Altogether there has been three changes in the concepts in the synthetic concept evolving H4 dataset as detailed in the section “Synthetic concept evolving dataset”. Furthermore, the different operations performed by the proposed technique to handle the evolving concepts in the data streams are shown in Fig.  13a–c.

Initially, the first 20 windows (windows 1–20) were used by the proposed technique for creating the initial clustering structure as it has been shown in light colours (light blue, light green, and light red) in Fig. 12a. There are three clusters at the time of initialisation where cluster 1 comprises of (\(S_1-S_4\)) data streams, cluster 2 comprises of (\(S_5-S_8\)) data streams, and cluster 3 comprises of (\(S_9-S_{20}\)) data streams, and the corresponding tree structure is also depicted in Fig.  12b. After the initialisation, from window 21–250, the initial assignment of the data streams into three clusters does not change as no concept evolution occurs in windows 21–250, as shown in Fig.  12a, and the same thing is also shown in Fig. 12c. However, on the 251\(\text {st}\) window, a concept change occurs and the clustering structure changes from three cluster to five cluster due to the split operation, as shown in Fig.  12d and e. Hence, some data streams, i.e., (\(S_{15}-S_{18}\)) and (\(S_{19}-S_{20}\)) in cluster 3 are assigned into two new clusters, viz., cluster 4 and cluster 5, respectively. Therefore, after the split operation on 251\(\text {st}\) window cluster 1, cluster 2, cluster 3, cluster 4 and cluster 5 comprises of (\(S_1-S_4\)), (\(S_5-S_8\)), (\(S_9-S_{14}\)), (\(S_{15}-S_{18}\)) and (\(S_{19}-S_{20}\)) data streams, respectively. This assignment stays the same until 500\(\text {th}\) window. Next, concept evolution happens in 501\(\text {st}\) and 502\(\text {nd}\) windows and the clustering structure changes from five cluster to two cluster. The different operations for changing the clustering structure from five cluster to two cluster are shown in Fig. 12f where node \(n_3\) containing six data streams is split into two clusters named as cluster 3 and cluster 6, as shown in Fig. 12a and f, each containing three data streams. In Fig. 12g, node \(n_5\) containing four data streams is splitted into two new clusters named cluster 4 and cluster 7, as depicted in Fig. 12g, containing one data stream and three data streams, respectively. In Fig. 12h, the process of merging of cluster 4 and cluster 7 takes place after which the clustering structure contains six clusters in total, as shown in Fig. 12i. This process of merging continues where in Fig. 12j, cluster 4 and cluster 5 merges followed by the merging of cluster 3, cluster 6, and cluster 4, as illustrated in Fig.  12k. Again, all three clusters, i.e., cluster 1, cluster 2, and cluster 3, merge into its parent node(denoted as \(n_1\)) as shown in Fig. 12l to produce a single cluster as depicted in Fig. 12m. Finally, the parent node (\(n_1\)) is split into two clusters: cluster 1 and cluster 2, as shown in Fig. 12n containing (\(S_1-S_4\), \(S_{12}-S_{15}\)) and (\(S_5-S_{11}\), \(S_{16}-S_{20}\)) data streams, respectively. The same clustering structure in Fig. 12n is maintained till the 750\(\text {th}\) window until a concept evolution is encountered at the 751\(\text {st}\) window. Due to the concept evolution at the 751\(\text {st}\) window, node \(n_2\) is split into two clusters, as shown in Fig. 12o, followed by the splitting of node \(n_3\) into another two clusters, as illustrated in Fig. 12p. Again, node \(n_4\) is splitted into two clusters, as illustrated in Fig. 12q, followed by the splitting of node \(n_5\) into yet another two clusters, as illustrated in Fig. 12r. Finally, node \(n_6\) is splitted into two clusters, as depicted in Fig. 12s, where cluster 1, cluster 2, cluster 3, cluster 4, cluster 5, cluster 6 and cluster 7 contains (\(S_1-S_4\)), (\(S_5-S_7\)), (\(S_8-S_9\)), (\(S_{10}-S_{11}\)), (\(S_{12}-S_{15}\)), (\(S_{16}-S_{18}\)), and (\(S_{19}-S_{20}\)) data streams, respectively. The clustering structure obtained after the split operations on the 751\(\text {th}\) window remains unchanged till the last window (i.e., 1000\(\text {th}\) window). Overall, it can be observed that different concepts in the synthetic concept evolving H4 dataset have been accurately captured in the clustering structure by the proposed technique.

The change in the average entropy of clustering structure (tree) is shown in Fig. 13a–c, respectively, for first concept evolution at 251\(\text {st}\) window, second concept evolution at 501\(\text {st}\) window, and third concept evolution at 751\(\text {st}\) window corresponding to the synthetic concept evolving H4 dataset. It can be observed from Fig. 13a that the average entropy of clustering structure at the time of 250\(\text {th}\) window is 3.314, but on accommodating the data instances of 251\(\text {st}\) window, the average entropy increases to 3.797 which represents an unstable clustering structure. This increase in entropy is because of the new concepts arriving in the data streams due to the 251\(\text {st}\) window. To capture the new concepts in the clustering structure, different split operations are performed by the proposed technique which decreases the average entropy of the clustering structure (tree) to 3.305 (the different split operations performed are shown in Fig. 12d and e). In Fig. 13b, the average entropy of clustering structure at the time of 500\(\text {th}\) window is 3.306. However, after accommodating the newer concepts which are spread over 501\(\text {th}\) and 502\(\text {th}\) window, the average entropy of the clustering structure increases to 3.663 representing an unstable clustering structure. Therefore, to stabilise the clustering structure, multiple split and merge operations on the data instances corresponding to 501\(\text {th}\) and 502\(\text {th}\) window are executed which lowers the average entropy of the clustering structure to 3.299 (the different split and merge operations performed are shown in Fig. 121f–n). The third and final concept evolution occurs at 751\(\text {st}\) window where the average entropy of the clustering structure increases from 3.314 on 700\(\text {th}\) window to 4.945 on accommodating the 751\(\text {st}\) window, as shown in Fig. 13c. However, after the split operations performed by the proposed technique on the data instances of 751\(\text {st}\) window, the average entropy of the clustering structure reduces to 3.219 (the different split operations performed are shown in Fig. 12o–s).

The DI score, CPCC score, MH\(\varGamma \) score, and Purity score achieved by the proposed technique are shown in Figs 14,  15, 16, and  17, respectively. The DI score has been plotted with an interval of 2 windows; similarly, an interval of two windows has been taken for plotting the CPCC, MH\(\varGamma \) and Purity scores. It can be observed from Fig. 14 that some fluctuations in the DI score are due to the changes in the values of the data instances of different data windows. Whereas, the changes in CPCC and MH\(\varGamma \) are only corresponding to those windows where changes in concepts occur and Purity score is 1 corresponding to all the windows. In totality, it can be said based on the DI, CPCC, MH\(\varGamma \), and Purity scores, as shown in Figs. 14,  15,  16, and  17 that the performance of the proposed technique is very promising for concept evolving datasets.

Fig. 13
figure 13

Change in the average entropy of clustering structure corresponding to different concept evolution

Fig. 14
figure 14

DI score of the proposed technique

Fig. 15
figure 15

CPCC score of the proposed technique

Fig. 16
figure 16

MH\(\varGamma \) score of the proposed technique

Fig. 17
figure 17

Purity score of the proposed technique

The comparative analysis of the proposed technique and AGNES on synthetic concept evolving H1, synthetic concept evolving H2, synthetic concept evolving H3 and synthetic concept evolving H4 datasets in terms of DI, CPCC, MH\(\varGamma \), and Purity is shown, respectively, in Figs. 18, 19,  20, and 21. It can be observed from Fig. 18 that the average of the DI scores obtained by the proposed technique on every window and the average DI score achieved by AGNES is nearly the same for synthetic concept evolving H1 dataset due to the fact that synthetic concept evolving H1 dataset contains data instances from a single concept. However, the performance of the proposed technique is far better compared to AGNES on synthetic concept evolving H2, synthetic concept evolving H3, and synthetic concept evolving H4 datasets as these three datasets consist of data instances belonging to multiple concepts which are correctly captured by the proposed technique but not by AGNES. Similarly, it can be observed from Figs. 19 and  20 that the performance of the proposed technique and AGNES is roughly the same for synthetic concept evolving H1 dataset, but the performance of the proposed technique is better than AGNES for synthetic concept evolving H2, synthetic concept evolving H3, and synthetic concept evolving H4 datasets. This type of behaviour of the proposed technique and AGNES in terms of CPCC and MH\(\varGamma \) is again accounted to the capability of the proposed technique to address the change in concepts by adjusting the clustering structure via different split and merge operations, whereas AGNES is unable to handle it. However, it can be seen from Fig. 21 that the value of Purity is 1 for both the proposed technique and AGNES. However, the Purity value for AGNES is misleading as it is shown in Fig. 22 that a very large number of clusters have been generated by AGNES as compared to the actual number of clusters. Whereas, in the case of proposed technique, the number of clusters generated and the actual number of clusters is the same which proves that the proposed technique accurately captures the concept evolution in the clustering structure. Overall, the proposed technique’s superior performance indicates the proposed technique’s ability to identify and reflect the concept changes in the data streams through proper updation of the clustering structure. However, the same cannot be said about AGNES, as it fails in recognising the evolving changes in the data streams.

Fig. 18
figure 18

DI score of the proposed technique and AGNES on four synthetic concept evolving datasets

Fig. 19
figure 19

CPCC score of the proposed technique and AGNES on four synthetic concept evolving datasets

Fig. 20
figure 20

MH score of the proposed technique and AGNES on four synthetic concept evolving datasets

Fig. 21
figure 21

Purity score of the proposed technique and AGNES on four synthetic concept evolving datasets

Fig. 22
figure 22

Number of clusters generated by the proposed technique and AGNES vs. actual number of clusters for synthetic concept evolving datasets

Performance on web browsing dataset

The proposed technique has also been evaluated and compared to AGNES on web browsing dataset as described in Table8 to analyse their performance in a real-world scenario.

The movement of the different data streams from one cluster to another cluster on evolution of new concepts in the data streams is shown in Fig. 23 for the web browsing dataset. Overall, there has been four changes in the concepts in the web browsing dataset as detailed in the section “Web browsing dataset”. Initially, the first 20 windows (windows 1–20) were used by the proposed technique for creating the initial clustering structure as it has been shown in light colours (light blue, light green, and light red) in Fig. 23a. There are two clusters at the time of initialisation where cluster 1 comprises of (\(S_1-S_{10}\)) data streams and cluster 2 comprises of (\(S_{11}-S_{20}\)) data streams, and the corresponding tree structure is also depicted in Fig. 23b. After the initialisation, from window 21–35, the initial assignment of the data streams into two clusters does not change as no concept evolution occurs in windows 21–35, as shown in Fig. 23a, and the same thing is also shown in Fig. 23c. However, on the 36\(\text {th}\) window, a concept change occurs and the clustering structure changes from two cluster to three cluster due to the split operation, as shown in Fig. 23d. Hence, the data streams (\(S_{1}-S_{10}\)) in node \(n_2\) are assigned to two new clusters, i.e., the data streams (\(S_{1}-S_{5}\)) and (\(S_{6}-S_{10}\)) are assigned to cluster 1 and cluster 3, respectively. Therefore, after the split operation on 36\(\text {st}\) window, cluster 1, cluster 2, and cluster 3 comprises (\(S_1-S_5\)), (\(S_{11}-S_{20}\)), and (\(S_6-S_{10}\)) data streams, respectively. This assignment stays the same till the 45\(\text {th}\) window. Next, concept evolution happens in 46\(\text {th}\) window and the clustering structure changes from three cluster to four cluster. The split operation for changing the clustering structure from three cluster to four cluster is shown in Fig. 23 where node \(n_3\) containing ten data streams is split into two clusters named as cluster 2 and cluster 4 containing three and seven data streams, respectively, as shown in Fig. 23a and e. This new assignment of the data streams also remains the same till the 65\(\text {th}\) window. In Fig. 23f, node \(n_4\) containing seven data streams is split into two new clusters, viz., cluster 4 and cluster 5, as depicted in Fig. 23f, containing three and four data streams, respectively, on processing the 66\(\text {th}\) window where another concept change occurs. The clustering structure so obtained after the split operation stays unchanged till the 82\(\text {nd}\) window where cluster 1, cluster 2, cluster 3, cluster 4, and cluster 5 contains (\(S_1-S_5\)), (\(S_{11}-S_{13}\)), (\(S_6-S_{10}\)), (\(S_{14}-S_{16}\)), and (\(S_{17}-S_{20}\)) data streams, respectively. Finally, on encountering the 83\(^{rd}\) window where yet again another concept change was detected, cluster 4 and cluster 5 were merged into its parent cluster (node \(n_4\) as shown in Fig. 23f) by the proposed technique to produce a clustering structure accommodating four clusters, namely, cluster 1, cluster 2, cluster 3, and cluster 4 containing (\(S_1-S_5\)), (\(S_{11}-S_{13}\)), (\(S_6-S_{10}\)), and (\(S_{14}-S_{20}\)) data streams, respectively. The clustering structure obtained after the merge operation on the 83\(\text {rd}\) window remains unchanged till the last window (i.e., 100\(\text {th}\) window) and can be observed from Fig. 23a and g. Overall, it can be ascertained that different concepts in the web browsing dataset have been accurately captured in the clustering structure by the proposed technique.

Fig. 23
figure 23

The movement of data streams from one cluster to another on evolution of concepts

The DI score, CPCC score, MH\(\varGamma \) score, and Purity score achieved by the proposed technique are shown in Figs.  2425, 26, and  27, respectively. The DI score has been plotted with an interval of 2 windows; similarly, an interval of two windows has been taken for plotting the CPCC, MH\(\varGamma \), and Purity scores. It can be observed from Fig. 24 that some fluctuations in the DI score are due to the changes in the values of the data instances of different data windows. On examining the CPCC score from Fig. 25, it can be seen that the changes occurring in the CPCC score correspond to the concept changes in the 46\(\text {th}\) and 83\(\text {rd}\) windows, while the changes corresponding to the concept changes in the 36\(\text {th}\) and 66\(\text {th}\) windows are negligible. Whereas, the changes in MH\(\varGamma \) are only corresponding to those windows where changes in concepts occur and Purity score is 1 corresponding to all the windows. Hence, it can be said based on the DI, CPCC, MH\(\varGamma \), and Purity scores as shown in Figs. 24,  25, 26, and  27 that for a dataset representing a real-world scenario promising performance by the proposed technique can be observed.

The comparative analysis of the proposed technique and AGNES on web browsing dataset in terms of DI, CPCC, MH\(\varGamma \), and Purity has been done in fractions of the web browsing dataset, i.e., for windows 21–35, windows 21–45, windows 21–65, windows 21–82, and windows 21–100 containing concepts in the range one to five as discussed in the section “Web browsing dataset” to observe the effect of multiple concepts on both the techniques as shown, respectively, in Figs. 28,  29, 30, and  31.

It can be observed from Fig. 28 that the average of the DI scores obtained by the proposed technique on every window and the average DI score achieved by AGNES is nearly the same for windows 21–35 due to the fact that windows 21–35 contain data instances from a single concept. However, the performance of the proposed technique is far better compared to AGNES on windows 21–45, windows 21–65, windows 21–82, and windows 21–100 as these windows consist of data instances belonging to multiple concepts which are correctly captured by the proposed technique but not by AGNES.

Fig. 24
figure 24

DI score of the proposed technique

Fig. 25
figure 25

CPCC score of the proposed technique

Fig. 26
figure 26

MH\(\varGamma \) score of the proposed technique

Fig. 27
figure 27

Purity score of the proposed technique

Furthermore, it can be observed from Fig. 29 that the performance of the proposed technique and AGNES is roughly the same for windows 21–35 and windows 21–45. However, the performance of AGNES drops in comparison to the proposed technique as the number of concepts starts increasing further in windows 21–65, windows 21–82, and windows 21–100. Similarly, in case of MH\(\varGamma \), the performance of the proposed technique and AGNES is roughly the same for windows 21–35, but the performance of AGNES starts decreasing in comparison to the proposed technique for windows 21–45, windows 21–65, windows 21–82, and windows 21–100 as can be seen from Fig. 30. Again, this type of behaviour of the proposed technique and AGNES in terms of CPCC and MH\(\varGamma \) can be accounted to the capability of the proposed technique to address the changes in concepts by adjusting the clustering structure via different split and merge operations, whereas AGNES falls short in such capabilities of handling it. However, it can be seen from Fig. 31 that the value of Purity is 1 for both the proposed technique and AGNES. However, the Purity value for AGNES is misleading as it is shown in Fig. 32 that a very large number of clusters have been generated by AGNES as compared to the actual number of clusters. Whereas, in case of the proposed technique, the number of clusters generated and the actual number of clusters are the same which proves that the proposed technique accurately captures the concept evolution in the clustering structure. Overall, the performance of the proposed technique validated using the validity indices indicates that it can correctly group the students into an optimal number of clusters. Moreover, the clusters produced by the proposed technique are compact and well separated, as suggested by the scores obtained. However, the same cannot be implied in the case of AGNES as it could not handle the concept changes in the data streams. Consequently, AGNES proceeded with a non-optimal assignment of students into clusters.

Fig. 28
figure 28

DI score of the proposed technique and AGNES on web browsing dataset

Fig. 29
figure 29

CPCC score of the proposed technique and AGNES on web browsing dataset

Fig. 30
figure 30

MH\(\varGamma \) score of the proposed technique and AGNES on web browsing dataset

Fig. 31
figure 31

Purity score of the proposed technique and AGNES on web browsing dataset

Fig. 32
figure 32

Number of clusters generated by the proposed technique and AGNES vs. actual number of clusters for web browsing dataset

Time comparison

The proposed technique has also been compared to AGNES based on the computation time required by both the techniques for processing the synthetic stationary datasets, synthetic concept evolving datasets and web browsing dataset, as shown in Figs. 33,  34, and  35, respectively.

From Fig. 33, it is clearly evident that the processing time required by the proposed technique on all the synthetic stationary datasets, namely, synthetic stationary 2C dataset, synthetic stationary 3C dataset, synthetic stationary 5C dataset, and synthetic stationary 7C dataset, is significantly less as compared to the processing time required by AGNES for processing the same datasets. Again, from Fig. 34, it can observed that AGNES requires more processing time compared to the proposed technique as the number of data instances increases on every consecutive synthetic concept evolving datasets (i.e., number of data instances in synthetic concept evolving H1 dataset < number of data instances in synthetic concept evolving H2 dataset < number of data instances in synthetic concept evolving H3 dataset < number of data instances in synthetic concept evolving H4 dataset). Similarly, from Fig. 35, in the case of web browsing dataset as the number of windows (i.e., number of data instances) increases, so does the processing time required by AGNES which is very significant in comparison to the processing time required by the proposed technique. Hence, from Figs. 33, 34, and 35, it can be said that the proposed technique requires significantly less computation time compared to AGNES which satisfies the time constraint in processing the data streams.

Fig. 33
figure 33

Time taken by the proposed technique and AGNES on synthetic stationary datasets

Fig. 34
figure 34

Time taken by the proposed technique and AGNES on synthetic concept evolving datasets

Fig. 35
figure 35

Time taken by the proposed technique and AGNES on web browsing dataset

Conclusion

In the present paper, a hierarchical clustering technique for multiple nominal data streams has been presented. It measures the quality of a cluster by calculating its entropy. The proposed technique is capable of addressing the concept evolving nature of the data streams by adapting the clustering structure with the help of merge and split operations. The experimental analysis has been performed on two synthetic and a real application dataset, namely synthetic stationary datasets, synthetic concept evolving datasets, and web browsing dataset in terms of performance measures, viz., DI, CPCC, MH\(\varGamma \), and Purity. The proposed technique has achieved the highest DI score of 25.642, CPCC score of 0.974, MH\(\varGamma \) score of 0.704, and Purity score of 1 for synthetic concept evolving H4 dataset. For the web browsing dataset, the DI score of 5.678, CPCC score of 0.993, MH\(\varGamma \) score of 0.723, and Purity score of 1 has been attained by the proposed technique. Overall, it can be concluded from the experimental results that the proposed technique has performed approximately the same on synthetic stationary datasets. However, the proposed technique has outperformed the AGNES technique on synthetic concept evolving datasets as well as web browsing dataset. Furthermore, it can be stated that the proposed technique requires much less computation time in comparison to AGNES.