1 Introduction

In the past years, the volume and incoming speed of data have increased tremendously. Data frequently arrive continuously in the form of data stream rather than forming a single static data set. Therefore, data stream learning, which is able to learn incoming data upon arrival, becomes an increasingly important approach to extract knowledge from data. It has been widely used in real-world applications, such as credit card fraud detection (Dal Pozzolo et al., 2017), software defect prediction (Tabassum et al., 2020) and spam filtering (Delany et al., 2005). There are many types of problems/tasks in data stream learning, for examples, classification, regression, clustering, anomaly detection etc. This work focuses on binary classification.

Concept drift is a common challenge in data streams. It is a change in the underlying distribution of the problem. Such a change can deteriorate the predictive performance of the data stream learning algorithm because the predictive model built previously may not be valid anymore. To deal with concept drift, data stream learning algorithms can be categorised to as explicit and implicit approaches (Ditzler et al., 2015; Zliobaite, 2010). Explicit approaches employ a concept drift detection method to detect if there is a concept drift, and then adopt strategies to update predictive model to cope with such drift (Ditzler et al., 2015; Zliobaite, 2010). Implicit approaches do not employ any concept drift detection method but continuously evolve themselves to reflect the current underlying concept, thus adapting to concept drifts (Ditzler et al., 2015; Zliobaite, 2010).

Data stream learning algorithms can also be categorised by their mode of operation: batch-based (chunk-based) learning and online learning (Gama et al., 2014; Ditzler et al., 2015). Batch-based learning refers to as learning the data stream by batches of new training data. It has the advantage of having more data to learn at a given time step, thus the learning approach can better capture the current underlying concept (Gama et al., 2014; Ditzler et al., 2015). In contrast, online learning has a stricter requirement which only allows the data stream learning approach to process each training example separately and then discard it (Gama et al., 2014; Ditzler et al., 2015),rendering it applicable to problems with stricter time and memory requirements. To deal with concept drifts in a timely fashion, online learning usually is more preferable than batch-based learning which needs to wait for whole batches of training examples to arrive. Moreover, batch-based learning assumes that all training data within the same batch are drawn from the same underlying concept, which may not always be the case in most real-world applications. Thus, this work focuses on online learning.

Another challenge frequently present in real world data stream applications is that their class distribution is often skewed, an issue that is commonly referred to as class imbalance (Wang et al., 2018). For example, in credit card fraud detection, there are always more genuine transactions than fraudulent transactions. In software defect prediction, there are typically more clean than defective components. When class imbalance exists, the data stream learning algorithms are likely to get biased towards the majority class, being likely to misclassify minority class examples. Yet, the minority class is usually the class of interest in the classification task, meaning that misclassifying minority class examples could lead to a high cost. This work focuses on binary classification, thus, there is a majority class and a minority class when the data stream is class imbalanced.

To deal with class imbalance, a category of oversampling strategies has shown to be successful in data set learning (offline learning). They create synthetic examples to enrich the minority class, which causes less overfitting than reusing existing minority class examples (Chawla et al., 2002; Han et al., 2005; Lee et al., 2017). Some recent work attempted to bring such a successful idea into the field of data stream learning (Wang & Pineau, 2016; Bernardo et al., 2020). However, they usually cache all the minority class examples seen so far into the memory which is impractical for data stream learning. Moreover, reusing all past minority class examples may prevent these approaches from dealing with concept drifts (changes in the underlying probability distribution, a.k.a., concept (Krawczyk et al., 2017)) affecting the minority class.

Dealing with the joint issue of concept drift and class imbalance is a challenging task. In particular, the relatively small number of minority class examples mean that it may be more difficult to detect or adapt to concept drifts affecting the minority class. Many studies have been proposed to deal with either class imbalance or concept drift. However, existing work to deal with their joint challenge remains little. Although a recent survey work (Wang et al., 2018) showed that class imbalance is a more dominant factor than concept drift in affecting the predictive performance, the effectiveness of the existing class imbalance techniques for data stream learning could potentially be compromised by concept drift as most of them are not prepared to deal with drifts. Recent work addresses this challenge by finding relevant past minority class examples for oversampling (Hoens & Chawla, 2012) or performing synthetic minority class oversampling based on the statistics of the minority class after drift detection (Bernardo et al., 2021). These methods are not always ideal because relevant past minority class examples might not exist while relying on drift detection to reset minority class statistics could be detrimental, especially when the drift is gradual.

In addition, the method of storing past examples as proposed in Hoens & Chawla (2012) may be impractical for data stream learning when there are strict space constraints. Similarly, synthesising new examples based on simple statistics of past examples as proposed in Bernardo et al. (2021) overlooks important data difficulty factors within the class. Specifically, this method does not consider the location of past examples in the feature space. These data difficulty factors include concept drifts involving different movements of the minority class sub-clusters, changing class imbalance ratio, and changing the ratios of different types of minority class examples. Existing work has shown that these factors are critical in learning from drifting class imbalanced data streams (Brzezinski et al., 2021).

Therefore, new approaches are needed to better address concept drifts with multiple data difficulty factors in class imbalanced data streams. To fill this research gap, this paper aims to answer the following research questions:

  • RQ1) How to produce minority class synthetic examples for oversampling so that we could explore the decision areas of the minority class to better consider data difficulty factors while adapting to concept drift?

  • RQ2) How does the proposed approach perform compared to existing approaches in different types of concept drift affecting the minority class? For which types does it perform the best and worst? Why?

  • RQ3) How does the proposed approach perform compared to existing approaches when applied to real-world data streams?

To answer RQ1, we propose a novel method to create synthetic minority class examples for oversampling based on stream clustering. The motivation is that stream clustering methods use a set of micro-clusters as the abstraction/compression of the examples they have seen so far. They usually retain micro-clusters by temporal order, which means old micro-clusters are forgotten. Therefore, the information they hold reflects the characteristics of the current underlying concept. Our novel method exploits this nature of stream clustering methods to track the current decision areas of the minority class. It then generates synthetic minority class samples for oversampling within the region where real minority class examples have been recently observed. With this strategy, the proposed method is less likely to produce noisy synthetic examples while being able to explore the decision areas of the minority class, better considering data difficulty factors when adapting to concept drift (RQ1).

The proposed approach is compared against five existing approaches through experiments on artificial data streams considering different data difficulty factors and class imbalance ratios, and real-world data streams (RQ2, RQ3). The results show that SMOClust handled concept drifts of different minority class sub-clusters movements better than existing approaches (RQ2, RQ3). It also performed better than existing approaches when the data stream is severely class imbalanced and presents high proportions of safe and borderline (Napierala & Stefanowski, 2015) minority class examples (RQ2, RQ3). Its major weakness is to handle data streams presenting large proportions of rare and outlier (Napierala & Stefanowski, 2015) minority class examples (RQ2, RQ3).

The rest of this paper is organised as follows. Section 2 discusses related work on synthetic minority class oversampling techniques and state-of-the-art approaches in dealing with class imbalance and concept drift in data stream learning. Section 3 presents the proposed approach. Section 4 presents the experimental study and discusses the results. Section 5 concludes this study and discusses the future work.

2 Related work

This section first introduces class imbalance and existing resampling methods for class imbalanced learning in Sect. 2.1. Section 2.2 then discusses the state-of-the-art approaches to deal with class imbalance and concept drift in data stream learning. Table 1 summarises the main characteristics of the approaches discussed in this section. At the end of this table, we contrast these with SMOClust, the approach that we propose in Sect. 3 of this paper.

Table 1 Comparison of principal characteristics between related works and SMOClust

2.1 Resampling methods for class imbalance

Class imbalance refers to the data set or data stream having at least one under-represented class (minority class). In this situation, the machine learning model tends to misclassify minority class examples more frequently than the majority class because there exists very little information about the minority class.

Approaches to address class imbalance are mainly divided into three categories: algorithm-level approach, ensemble approach, and data-level approach (Wang et al., 2018). Algorithm-level approaches are often called cost-sensitive approaches, as they place a higher cost when misclassifying minority class examples than majority class examples. Ensemble approaches create different class balanced training subsets to train each ensemble member. Data-level approaches modify the class distribution using a resampling method, such that standard machine learning models can learn from both classes with the same amount of information. They can be applied during the data pre-processing phase. Due to this generic nature, this work focuses on data-level approaches.

Undersampling and Oversampling are two main types of data-level approaches (Wang et al., 2018). Undersampling methods reduce the number of majority class examples for training, usually removing noisy examples or examples that are deemed to have a low impact on the decision boundary. Yet, it has the chance to cause important information loss. On the other hand, oversampling methods increase the number of minority class examples, by replication or synthesis. They will not cause any information loss but they cause longer training time and are likely to cause overfitting when training on the same examples multiple times.

Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002) is a very renowned oversampling technique in offline learning, which synthesises minority class examples for oversampling, thus balancing the data set. In practice, SMOTE first randomly chooses an existing minority class example from the data set, denoted as \(x_i\). It then randomly chooses one of the k-nearest neighbours of \(x_i\) in the minority class, denoted as \(x'_i\). After that, a difference vector between \(x_i\) and \(x'_i\) is calculated. To create a point along the line between \(x_i\) and \(x'_i\), each dimension of the difference vector is multiplied by a random number \(\theta\) (\(0< \theta < 1\)), then the resulting difference vector is added to \(x_i\) dimensionwisely. SMOTE performs this procedure until the target oversampling rate M is reached. This oversampling rate M and the k for k-nearest neighbours are the hyper-parameters of the algorithm.

Many variants of SMOTE have been proposed in the last two decades. For example, Borderline-SMOTE (Han et al., 2005) considers that examples close to the decision boundary are more difficult to learn, thus it synthesises minority class examples around this area. Gaussian-based SMOTE (G-SMOTE) (Lee et al., 2017) is a more recent approach which tend to synthesise examples very close to the existing minority class examples. Other well-known methods of synthetic minority oversampling include ADASYN (He et al., 2008), DBSMOTE (Bunkhumpornpat et al., 2011), SWIM (Bellinger et al., 2020) etc.

On top of the class imbalance ratio, it has been pointed out that data difficultly factors also greatly impact the classification performance (Napierala & Stefanowski, 2015). These factors describe the characteristic of a given example (usually the minority class example) in the feature space:

  • Safe: Surrounded by examples from the same class.

  • Borderline: Located near the decision boundary.

  • Rare: Located deep inside the decision region of the opposite class, together with handful examples from the same class.

  • Outlier: Isolated and located deep inside the decision region of the opposite class.

The aforementioned methods can also be considered as taking the data difficulty factors into the account. For example, Borderline-SMOTE synthesises minority class examples around the borderline region, while G-SMOTE can be considered as synthesising minority class examples in the safe region.

However, these synthetic minority oversampling methods could not be applied to class imbalanced data stream learning directly as they cache the entire data set into memory, which is impractical in data stream learning. For example, OnlineSMOTEBagging (Wang & Pineau, 2016) is one of this kind. It replaces simple oversampling with SMOTE in OnlineUnderOverbagging. In our preliminary experiments with the data streams used in this work, we attempted to run OnlineSMOTEBagging. However, OnlineSMOTEBagging consumed all the memory we had access to (8GB), resulting in failure to complete the run. Furthermore, the underlying concept of the data stream may change over time (concept drift). The cached examples may be from different concepts. Thus, synthesising minority class examples based on them may not always follow the current underlying concept.

Additionally, one recent work in the field of software effort estimation is also quite inspiring (Song et al., 2018). They enlarge the software project data set by adding Gaussian noise to the existing examples. This method could be particularly related to synthetic minority oversampling for class imbalanced data stream learning as it is memory efficient and fast to perform. The potential risk is that, if we apply it to the most recent minority class examples, it might cause overfitting to such a recent area.

2.2 Approaches for class imbalanced data stream learning in the presence of concept drift

This section discusses approaches that are closely related to the proposed approach. For a comprehensive survey on class imbalanced data stream learning, please refer to Aguiar et al. (2022).

Broadly speaking, existing approaches to deal with class imbalance and concept drift have two main categories: explicit approach and implicit approach.

2.2.1 Explicit approaches

Explicit approaches estimate whether a concept drift has happened, usually by employing a drift detector to monitor the predictive performance of the base learner/main ensemble. This drift detector can be any from the literature, ideally using a class imbalance insensitive metric, such as DDM-OCI (Wang et al., 2013), LFR (Wang & Abraham, 2015), PAUC-PH (Brzezinski & Stefanowski, 2014) etc.

Continuous-SMOTE (C-SMOTE) (Bernardo et al., 2020) is one of the pioneers who bring SMOTE to drifting class imbalanced data stream learning. It uses an Adaptive Window (ADWIN) (Bifet & Gavalda, 2007) to store the most recent examples and applies SMOTE to the minority class examples inside the ADWIN for oversampling. Upon drift detection, the old window of ADWIN is dropped as it is deemed to belong to the old concept. However, when there is no concept drift detection, C-SMOTE keeps storing minority class examples which can cause memory issues. Besides, SMOTE does not take decision boundaries and data difficulty factors into consideration, thus noisy examples may be generated.

Very Fast Continuous-SMOTE (VFC-SMOTE) (Bernardo et al., 2021) was proposed to solve the issues faced by C-SMOTE. It uses a dynamic summary data structure, called “sketch”, to summarise the statistics of past examples. It generates synthetic examples by Beta distribution sampling from a set of summaries in the sketch, where each summary has the information of one input feature of past examples. When generating synthetic minority class examples, VFC-SMOTE tends to choose summaries that represent more past examples, which means it tends to generate synthetic minority class examples in the dense area of minority class. Nevertheless, this method may generate considerably noisy synthetic examples because it samples each input feature individually and does not adopt mechanisms to try to respect decision boundaries.

SMOTE with Online Bagging (SMOTE-OB) (Bernardo & Valle, 2021) is another approach that is similar to VFC-SMOTE. It incorporates the strategy of generating synthetic minority class examples from VFC-SMOTE into OnlineUnderOverBagging (Wang & Pineau, 2016). With this design, SMOTE-OB combines three data-level re-balancing methods to combat class imbalance while training the base learners diversely (Bernardo & Valle, 2021). However, as SMOTE-OB uses the same synthetic minority class examples generating strategy as VFC-SMOTE, it faces the same disadvantages in terms of potentially generating considerably noisy synthetic examples.

Ensemble of Subset Online Sequential Extreme Learning Machine (ESOS-ELM) (Mirza et al., 2015) is another notable explicit approach for drifting class imbalanced data stream learning. It uses a sub-ensemble method to train each base learner with an approximately equal number of majority and minority class examples, thus dealing with class imbalance. To deal with concept drift, it uses a threshold-based strategy with hypothesis testing to detect any significant change in the predictive performance of the main ensemble, thus reporting concept drift. Meanwhile, it also uses a weighted majority vote system, based on G-Mean, to adapt to any potential concept drift that could not be detected by the aforementioned method. ESOS-ELM’s sub-ensemble method is time efficient in dealing with class imbalance as it does not replicate or synthesise any examples. However, it does not provide additional information to explore the decision areas of minority class. Besides, ESOS-ELM is restrictive in terms of base learner type. It only allows to use ELMs.

Cost-sensitive Adaptive Random Forest (CSARF) (Loezer et al., 2020) is an online, cost-sensitive sub-ensemble method designed to address the challenges of drifting class imbalanced data streams. It is a variant of the Adaptive Random Forest (ARF) (Gomes et al., 2017) algorithm. It incorporates a drift detector and a weighted majority ensemble to handle concept drift. To deal with class imbalance, CSARF utilises the Matthews Correlation Coefficient (MCC), a class imbalance insensitive metric, to assign weights to internal decision trees and ensure that all trees are trained with examples from the minority class (Loezer et al., 2020). While CSARF offers speed and memory efficiency due to its cost-sensitive approach, it fails to consider factors related to data difficulty. Additionally, CSARF is limited to using only the Hoeffding Tree (Domingos & Hulten, 2000) as base learners.

Robust Online Self-Adjusting Ensemble (ROSE) (Cano & Krawczyk, 2022) is a cost-sensitive ensemble method designed specifically for learning from drifting class imbalanced data streams. It employs ADWIN as a drift detector and uses a weighted majority ensemble to handle concept drift. To address class imbalance, ROSE employs self-adjusting \(\lambda\) bagging (where \(\lambda\) is determined based on estimated class sizes), and enforces the Hoeffding bound to improve predictive performance in the minority class. Furthermore, ROSE maintains sliding windows per class to store the most recent examples and to create a class balanced data set through undersampling. This class balanced data set is used to build new background base learners. However, similar to CSARF, ROSE does not consider data difficulty factors in its class imbalance adaptation strategy. Additionally, ROSE’s strategy for building new background base learners may be prone to more extreme levels of class imbalance in non-stationary data streams because such a scenario would require using very old minority class examples to build new base learners, besides the sliding window initially taking time to get filled with minority class examples.

In short summary, most existing explicit approaches to deal with class imbalance and concept drift do not explore the decision areas of the minority class. Whilst a few recent work (Bernardo et al., 2020, 2021; Bernardo & Valle, 2021) attempted to fill this research gap, they did not strictly take decision boundaries and data difficulty factors, which are crucial in data stream learning, into consideration.

2.2.2 Implicit approaches

Implicit approaches are usually ensemble learners. They do not actively detect concept drift but continuously update the voting weights of the base learners, thus adapting to any potential changes in the underlying concept. However, in class imbalanced data stream learning, the weighting strategy also needs to consider that the base learner may bias toward the majority class. To address this issue, one can place a higher penalty on the weight of the base learners performing poorly in the minority class (cost-sensitive approach). Another method is to employ a resampling method to reduce the learning bias (data-level approach).

Oversampling-based and Undersampling-based Online Bagging (OOB and UOB) (Wang et al., 2015) are two pioneers of data-level approach for class imbalanced data streams. Their idea is to incorporate random oversampling or random undersampling with Online Bagging (OB) (Oza, 2005). They estimate the current class size based on an exponential smoothing function with a fading factor \(\theta\). Whenever a new example \(s_t\) with a class label \(y_t\) arrives, it is first used to calculate the class imbalance ratio of class \(y_t\) to the majority class (OOB) or the minority class (UOB). This ratio is used as the parameter \(\lambda\) of Poisson distribution in OB, thus deciding the number of times to train each ensemble member on \(s_t\). While OOB and UOB are effective in addressing class imbalance with simple resampling methods, they can only deal with concept drifts that affect the posterior probability of the classes (P(Y)).

Learn++ for Concept Drift with SMOTE (Learn++.CDS) and Learn++ for Non-stationary and Imbalanced Environments (Learn++.NIE) (Ditzler & Polikar, 2013) are two pioneer batch-based approaches in this category. They were both based on the well-known approach, Learn++ for Non-Stationary Environment (Lean++.NSE) (Elwell & Polikar, 2011). Learn++.CDS uses SMOTE to balance the most recent batch of training data, while Learn++.NIE is a sub-ensemble method which bootstraps the majority class in the most recent batch of training examples to create different class balanced training sets. They both use weighted majority vote as a strategy to deal with concept drift where ensemble members performing well in the minority class have a higher weight. While they are both great methods to deal with class imbalance, they could struggle when the data stream is severely class imbalanced because there could exist some training batches which has no minority class examples.

Dynamic Weighted Majority for Imbalance Learning (DWMIL) (Lu et al., 2017) brought the renowned Dynamic Weighted Majority (DWM) into class imbalanced data stream learning. In general, it changes the weighting metric from accuracy to a class imbalance insensitive metric, such as G-Mean, while adopting UnderBagging (Wang & Yao, 2009), which is an offline learning approach, as the base learner to deal with class imbalance.

Heuristic Updatable Weighted Random Subspaces with Instance Propagation (HUWRS.IP) (Hoens & Chawla, 2012) is a batch-based learning approach to deal with drifting class imbalanced data streams. It is based on the approach of HUWRS (Hoens et al., 2011) which was proposed to learn class balanced data streams. The main novelty of HUWRS.IP is the example selection mechanism, called Instance Propagation (IP), which selects relevant past minority class examples for oversampling the most recent train batch. However, these examples may not exist in the memory.

Shortly summarising, existing implicit approaches to deal with class imbalance and data stream learning either rely on sub-ensemble methods or reusing relevant past examples. These methods do not explore the decision areas of the minority class. They do not take data difficulty factors into account either. Besides, these approaches are batch-based approaches, thus they are unlikely to react to concept drift swiftly due to the need to wait for whole batches to arrive.

3 Proposed approach

To answer the RQ1 posed in Sect. 1, we proposed a novel approach called Synthetic Minority Oversampling based on stream Clustering (SMOClust). The main novelty of this approach is to produce synthetic minority class examples for oversampling based on the information compressed by the stream clustering method. Most stream clustering methods represent this information in the form of micro-clusters, which summarise the statistics of past examples that are close together in the feature space. These statistics usually include the vectors of the dimensional-wise cumulative sum and squared sum. Thus, they do not need to cache all the past examples in the memory. Most importantly, this strategy could potentially deal with gradual drift involving different data difficulty factors because stream clustering methods continuously update themselves to reflect the characteristics of the current underlying concept.

SMOClust also employs a concept drift detector to monitor the predictive performance of the base learner, as a strategy to deal with abrupt drift. Thus, it is an explicit concept drift adaptation approach. Upon drift detection, the base learner will be reset. Although this strategy may not always be ideal (Chiu & Minku, 2018, 2022), this work focuses on investigating the effectiveness of the novel stream clustering based synthetic minority oversampling strategy in learning class imbalanced data streams with concept drift. So, it is intended to keep other components of SMOClust simple to analyse the characteristics of the proposed strategy.

Algorithm 1 presents the pseudo-code over-viewing SMOClust. The details of its working mechanism are described and explained as follows.

Algorithm 1
figure a

Synthetic minority oversampling based on stream clustering - SMOClust

SMOClust is a data stream learning algorithm that uses a base learner B to learn from and make predictions to new examples. This base learner B could be any single learner, such as Hoeffding Tree (Domingos & Hulten, 2000), or an ensemble learner, such as Online Bagging (Oza, 2005). SMOClust does not store past models. It uses stream clustering methods SC[] to manage sets of micro-clusters that compress the information of past examples. There is one stream clustering method for each class of the problem (line 1, Algorithm 1). The stream clustering method can be arbitrary from the literature, such as Clustream (Aggarwal et al., 2003), StreamKM++ (Ackermann et al., 2012), DenStream (Cao et al., 2006), Clustree (Kranen et al., 2011) etc. In this work, Clustream was chosen because it is largely invariant for different types of concept drifts, meaning that it can effectively adapt to concept drift without compromising the quality of its clustering results (Moulton et al., 2018). The strategy of synthesising minority class examples for oversampling based on micro-clusters is explained in Sect. 3.1.

The most recent example \(s_t\) will be first used for concept drift detection (line 4, Algorithm 1). This concept drift detection method can be arbitrary from the literature, such as DDM (Gama et al., 2004), DDM-OCI (Wang et al., 2013), PAUC-PH (Brzezinski & Stefanowski, 2014), ADWIN (Bifet & Gavalda, 2007) etc. Upon drift detection, the base learner B and time decay class sizes are reset but not the stream clustering methods SC[] because they are prepared to adapt to concept drifts (line 5, Algorithm 1). That said, after concept drift detection, the stream clustering methods will still retain some knowledge belonging to the previous concept. This has two advantages: (1) In the case of false positive drift detection, SMOClust can exploit the knowledge stored in the micro-clusters to train the base learner. (2) Knowledge of the pre-drift concept could help to learn the post-drift concept, especially when the drift has low severity (Minku & Yao, 2012).

After that, SMOClust uses \(s_t\) to train B and to update the time decay class sizes (line 7, Algorithm 1). The time decay class sizes estimate the current minority class and thus determine the oversampling rate. Equation 1 presents the calculation of the normalised class size of class \(c_m\) at time step t (Wang et al., 2015):

$$\begin{aligned} classSize(c_m)^{(t)} = {\left\{ \begin{array}{ll}\frac{1}{|M|}, &{} \text {if}\ t=f \\ \frac{[c_{s_t} = c_m]+\theta \times classSize(c_m)^{(t-1)} \times (t-f)}{t-f+1}, &{} \text {otherwise}\end{array}\right. } \end{aligned}$$
(1)

where \(m \in M\) and \(M = \{0,1\}\), considering binary classification tasks and \(\theta\) \((0<\theta <1)\) is a predefined time decay factor. \(c_{s_t}\) is the true class of \(s_t\). Thus, \([c_{s_t} = c_m]=1\) if the true class of \(s_t\) is \(c_m\), otherwise 0. f is the first time step used in the calculation. Note that, unlike OOB and UOB (Wang et al., 2015) which estimate the current class sizes of the data stream, SMOClust estimates the class imbalance degree of the information seen by the base learner rather than the class imbalance degree of the data stream. Thus, synthetic examples are also used to update the class sizes. The reason behind this design is discussed together with the strategy of training the base learner with synthetic examples.

SMOClust first records the most recent examples from each class (line 8, Algorithm 1) , then checks if the base learner has learnt from both classes equally (line 10, Algorithm 1). If not, SMOClust will generate synthetic minority class examples for oversampling based on the micro-clusters of the minority class (line 13-17, Algorithm 1), which is detailed in Sect. 3.1.

In the case that not all stream clustering methods can provide micro-clustering results and SMOClust has observed and recorded the most recent “real” example of the minority class (denoted as \(s_{last\_minority}\)), SMOClust will generate a synthetic minority class by adding Gaussian noise to \(s_{last\_minority}\) for oversampling (line 21, Algorithm 1). This strategy follows the strategy proposed by Song et al. (2018), except SMOClust treats ordinal attributes as categorical attributes due to the limitation in MOA (Bifet et al., 2010).

No matter the synthetic minority class example is generated based on micro-clusters or Gaussian noise, SMOClust will use it to train the base learner and the corresponding stream clustering method, and to update the class size immediately (line 18-19, 22-23, Algorithm 1). This strategy can prevent the base learner from biasing towards the majority class when there are no “real” minority class examples arrive for a long period, which is likely to happen when the data stream is extremely class imbalanced. Also, updating the class sizes with both “real” and synthetic examples allows us to estimate if the base learner has learnt from both classes equally. If not, SMOClust will then create synthetic minority class examples to train the base learner immediately.

In the case of none of the above conditions being satisfied , i.e., none of the conditions of the while-loop are satisfied (line 10, Algorithm 1), SMOClust will not perform any oversampling because this means either oversampling is not needed or there is no information about the minority class for SMOClust to generate synthetic examples. Lastly, a copy of the most recent example \(s_t\) is converted to a suitable format to train the stream clustering method, corresponding to the class value of \(s_t\) (line 26, Algorithm 1).

3.1 Generating a synthetic minority class example for oversampling using micro-clusters

This section presents the overview of generating a synthetic minority class example for oversampling based on micro-clusters. The general idea is to create synthetic minority class examples in one of the dense areas of the minority class. In this way, we can consolidate the knowledge learnt in the existing minority class areas without being greatly affected by noise. In the case where a dense area does not exist, SMOClust will pick one of the past minority class areas to explore the decision boundary around it.

Algorithm 2 presents the pseudo-code of this method. The details of generating a synthetic minority class example using micro-clusters can be described as follows.

Algorithm 2
figure b

Generate synthetic instance with k-NN micro-clusters

First of all, SMOClust randomly takes one of the micro-clusters from the minority class as an anchor (denoted as \(mc_{anchor}^{minority}\)) (line 12, Algorithm 1). Micro-clusters that are created recently or are updated frequently and recently have higher chance to be chosen as this anchor. After that, SMOClust checks if \(mc_{anchor}^{minority}\) is surrounded by the micro-clusters from the same class (line 13, Algorithm 1). If this condition is satisfied, SMOClust can consider such area is dense enough to create synthetic minority examples for oversampling. It will then make a copy of \(mc_{anchor}^{minority}\) and then combine it with its k-Nearest micro-clusters (based on hull distance) in class \(class_{min}\) to form a temporary micro-cluster \(mc_{temp}\) (line 2, Algorithm 2). We denote such set of k-Nearest micro-clusters as \(MC^{kNN,minority}\), thus, \(|MC^{kNN,minority}|=k\) and each k-Nearest micro-clusters is denoted as \(mc_i^{kNN,minority}\in MC^{kNN,minority}\). The details of how to combine a set of micro-clusters into one are presented in Algorithm 3.

Algorithm 3
figure c

Combining a set of micro-clusters into one

To combine a set of micro-clusters into one, we first need to calculate the new centre \(c_{new}\) of the resulting micro-cluster \(mc_{temp}\). This can be achieved by getting the weighted average of the centres of the original set of micro-clusters, dimensionwisely (line 2, Algorithm 3). After that, we set the radius \(r_{new}\) of the resulting micro-cluster \(mc_{temp}\) to as the distance between the new centre to the farthest hull (boundary) among all the original micro-clusters (line 3-7, Algorithm 3). Figure 1 illustrates an example of combining \(mc_{anchor}^{minority}\) with its 3-nearest neighbours into one micro-cluster.

Fig. 1
figure 1

Illustration of Combining \(mc_{anchor}^{minority}\) with 3-nearest neighbours into one micro-cluster

A synthetic minority class example will then be generated by sampling from this resulting micro-cluster with the highest chance near the centre of \(mc_{anchor}^{minority}\) (line 3, Algorithm 2). Figure 2 illustrates an example of sampling from a synthetic minority class example from \(mc_{temp}\).

Fig. 2
figure 2

Illustration of Sampling a Synthetic Minority Class Example from \(mc_{temp}\) (Color figure online)

In Fig. 2, the green circles are the micro-clusters belonging to the minority class while the blue circles are the micro-clusters belonging to the majority class.Footnote 1 The black circle line represents \(mc_{temp}\) and the red dashed lines are the contour of the probability density function to sampling a point. The closer to the centre of \(mc_{anchor}^{minority}\), the higher the probability.

The reason for sampling a new synthetic minority class example close to \(mc_{anchor}^{minority}\) is that this \(mc_{temp}\) could overlap with the micro-cluster from the other class. If we just sample from \(mc_{temp}\) randomly or by a multivariate Gaussian distribution with a mean at \(c_{new}\), we will have a high chance to sample a point that is close to the region or the majority class. Therefore, sampling points as synthetic minority class examples from \(mc_{temp}\) but close to the centre of \(mc_{anchor}^{minority}\) can reduce the risk of generating noisy examples while maintaining the ability to explore this dense region of the minority class.

Although Fig. 2 only illustrates an example in two-dimensional feature space, this idea can be applied to any multi-dimensional space. This sampling strategy is further detailed in Sect. 3.2.

In the case that \(mc_{anchor}^{minority}\) is not surrounded by the micro-clusters belonging to the same class, SMOClust will generate a synthetic minority class example by performing multivariate Gaussian sampling inside \(mc_{anchor}^{minority}\) (line 16, Algorithm 1). For example, this will be the case when when \(mc_{(i+2)}\) (top right green circle in Fig. 2) is chosen to be the \(mc_{anchor}^{minority}\). The the mean of the multivariate Gaussian distribution is the centre of \(mc_{anchor}^{minority}\) and the standard deviation is set as a third of the radius of \(mc_{anchor}^{minority}\) (radius/3). In other words, the boundary of \(mc_{anchor}^{minority}\) is set at three units standard deviations (or standard score = 3) from the centre. Therefore, we have 99.9% of chance to sample a point within \(mc_{anchor}^{minority}\). Gaussian distribution was chosen rather than uniform distribution in sampling \(mc_{anchor}^{minority}\) because \(mc_{anchor}^{minority}\) could partly overlap with the majority class region. Therefore, sampling a new point as synthetic minority class example close to the centre of \(mc_{anchor}^{minority}\) is a safe strategy.

3.2 Sampling from a micro-cluster with the highest probability at a designated location

This section present the strategy to sampling points from the temporary micro-cluster \(mc_{temp}\) which is formed by combining \(mc_{anchor}^{minority}\) and \(mc_i^{kNN,minority}\in MC^{kNN,minority}\) with the highest probability at the centre of \(mc_{anchor}^{minority}\). The general idea is to sample random points that are inside \(mc_{temp}\) and these points are likely to be close to the centre of \(mc_{anchor}^{minority}\). The pseudocode of this sampling strategy is presented in Algorithm 4. Figure 3 illustrates the steps of this sampling strategy and it can be explained as follows.

Algorithm 4
figure d

Sampling from a hyper-sphere by skewed Gaussian with the maximum of the probability density function at a designated location

Fig. 3
figure 3

Illustration of Sampling from \(mc_{temp}\) (Color figure online)

Let us first denote the micro-cluster \(mc_{temp}\) as \(HS_\beta\) which is a hyper-sphere with radius r and centred at \(\beta =(\beta _1,\beta _2,\beta _3,\ldots ,\beta _n)\), where n is the number of dimensions of the input space of the problem, the equation of this hyper-sphere is:

$$\begin{aligned} \sum ^n_{i=0}(x_i-\beta _i)^2=r^2 \end{aligned}$$
(2)

Let us also denote the centre of \(mc_{anchor}^{minority}\) to as \(\alpha ^{(1)}=(\alpha ^{(1)}_1,\alpha ^{(1)}_2,\alpha ^{(1)}_3,\ldots ,\alpha ^{(1)}_n)\) (the black dot in Fig. 3a), which should always be inside \(HS_\beta\). First of all, we need to pick a random direction from \(\alpha ^{(1)}\) (Fig. 3a). This can be achieved by randomly and uniformly picking a point from a unit hyper-sphere centred at \(\alpha ^{(1)}\), using the Muller’s method (Muller, 1959). We then denote this point to as \(\alpha ^{(2)}=(\alpha ^{(2)}_1,\alpha ^{(2)}_2,\alpha ^{(2)}_3,\ldots ,\alpha ^{(2)}_n)\) (the red dot in Fig. 3a) (line 7, Algorithm 4). Points \(\alpha ^{(1)}\) and \(\alpha ^{(2)}\) form an n-dimensional infinite long straight line (the line d in Fig. 3b), whose parameterised equation is:

$$\begin{aligned} x_i=\alpha ^{(1)}_i+t(\alpha ^{(2)}_i-\alpha ^{(1)}_i) \end{aligned}$$
(3)

where t is a scalar and \((\alpha ^{(2)}_i-\alpha ^{(1)}_i)\) is the direction vector. To find the intercepts of this infinite long line to the hull of \(HS_\beta\) (the blue and green dots in Fig. 3b), we can substitute Eq. 3 into Eq. 2Footnote 2:

$$\begin{aligned} \sum ^n_{i=0}((\alpha ^{(2)}_i-\alpha ^{(1)}_i)t+(\alpha ^{(1)}_i-\beta _i))^2=r^2 \end{aligned}$$
(4)

Let us denote \(\delta _i =\alpha ^{(2)}_i-\alpha ^{(1)}_i\) and \(\gamma _i=\beta _i-\alpha ^{(1)}_i\) (line 10 and 11, Algorithm 4), then Eq. 4 becomes:

$$\begin{aligned} \sum ^n_{i=0}(\delta _it-\gamma _i)^2=r^2 \end{aligned}$$
$$\begin{aligned} \left(\sum ^n_{i=0}\delta ^2_i\right)t^2-2\left(\sum ^n_{i=0}\delta _i\gamma _i\right)t+ \left(\sum ^n_{i=0}\gamma ^2_i\right)-r^2=0 \end{aligned}$$
(5)

Let us denote \(A=\sum ^n_{i=0}\delta ^2_i\) (line 12, Algorithm 4), \(B=-2(\sum ^n_{i=0}\delta _i\gamma _i)\) (line 13 and 16, Algorithm 4) and \(C=(\sum ^n_{i=0}\gamma ^2_i)-r^2\) (line 14 and 17, Algorithm 4) to solve Eq. 5 based on Bhaskara’s equation:

$$\begin{aligned} t=\frac{-B\pm \sqrt{B^2-4AC}}{2A} \end{aligned}$$

Here, we just take the positive root of t because it “follows” the direction vector, while the negative root “oppositely follows” the direction vector (the direction is denoted by the arrows on line d in Fig. 3b), i.e.

$$\begin{aligned} t_{intercept}=\frac{-B+\sqrt{B^2-4AC}}{2A} \end{aligned}$$
(6)

Substituting \(t_{intercept}\) into Eq. 3 will obtain the intercept of the line and the hyper-sphere, following the direction vector (the blue dot in Fig. 3b). Thus, to sample points within the \(HS_\beta\), we can simply sample a scalar \(t_{sample}\) between 0 and \(t_{intercept}\) (Fig. 3c) and substitute it into Eq. 3 to obtain the sampled point. As we want to sample this point with the highest chance at the target point \(\alpha ^{(1)}\), we can sample \(t_{sample}\) using Gaussian distribution with the mean = 0 and standard deviation = \(\frac{t_{intercept}}{3}\), i.e.

$$\begin{aligned} g\sim N\left(0,\left(\frac{t_{intercept}}{3}\right)^2\right) \\ t_{sample}=|g| \end{aligned}$$

At last, we substitute \(t_{sample}\) into Eq. 3 to obtain the sample point.

The reason for setting the standard deviation to be \(\frac{t_{intercept}}{3}\) is that we want the sampled point to be within the micro-cluster. Yet, the probability density function of the Gaussian distribution has no bounds. Thus, we set the \(t_{intercept}\) at 3 standard score (z-score = 3), such that 99.9% area under the probability density function curve of the Gaussian distribution is between \(-t_{intercept}\) and \(+t_{intercept}\). Also, we want \(t_{sample}\) to “follow” the direction vector (i.e. we only interested in line segment between the black and the blue dots on d in Fig. 3b), thus, we only accept the positive value of \(t_{sample}\).

Figure 3d presents a two-dimensional example of using the aforementioned strategy to sample points in a hyper-sphere centred at (0,0) with a radius of 10. The points have the highest probability to be sampled at (− 7,0).

4 Experiments to evaluate the predictive performance of SMOClust

This section presents the design of the experiments to evaluate SMOClust. The predictive performance of SMOClust was first compared against five existing approaches from the literature on artificial data streams of different types of drifts. This is to investigate for which types of drift SMOClust will be advantageous and disadvantageous, answering RQ2. SMOClust was then compared against the same set of existing approaches on real-world data streams to obtain a general idea of its performance in practical situations, answering RQ3. Massive Online Analysis (MOA) (Bifet et al., 2010) was chosen to be the experimentation platform. Section 4.1 presents the details of artificial and real-world data streams used in the experiments. Section 4.2 presents the detailed setup of the experiments, including the procedure of hyper-parameter tuning and the evaluation method used in the experiments.

4.1 Data streams

As discussed in Sects. 1 and 2, data difficulty factors play a crucial role in class imbalanced data stream learning with concept drift. Therefore, it is important to evaluate class imbalance data stream learning approaches based on data streams with different data difficulty factors. In line with that, the artificial data stream generator proposed by (Brzezinski et al., 2021) was adopted because it allows us to simulate concept drifts that affect different data difficulty factors, including the class imbalance ratio, movement of the clusters in the minority class, and the proportion of safe, borderline and rare minority class examples. We have generated a large variety of artificial data streams to avoid any bias in the evaluation and enable us to understand the conditions under which SMOClust performs well and the conditions under which it fails, as well as the reason for such behaviour.

Table 2 presents a summary of artificial data streams used in the experiments. Each of them has five numerical input attributes \(\{x_i \in (-1,1)\}^5_{i=1}\) and a class label \(y_i \in \{0,1\}\). They all consist of 200 k examples where concept drift happens gradually from 70 k to 100 k time steps. The continuous movement of minority class sub-clusters in gradual drift scenarios creates a complex and dynamic environment for evaluation. We created thirty artificial data streams of each type with different random seeds. Each of the thirty streams is used to evaluate the data stream learning approaches in a single run. The evaluation method is detailed in Sect. 4.2

Following the default setting by Brzezinski et al. (2021), when the artificial data stream has no drift or no modifier specified, it is: (1) class balanced, (2) composed of a single cluster representing class 1, uniformly surrounded by the examples of class 0, and (3) examples only appear in safe regions. When the data stream is class imbalanced, class 1 is the minority class while class 0 is the majority class.

Table 2 Summary of artificial data streams

As shown in Table 2, we considered four groups of drift from (Brzezinski et al., 2021)’s work in this study. The first group (Imbalance ratio drift) considers concept drift affecting the class imbalance ratio only. The second group (Single factor drift with static imbalance ratio) considers data streams with a static class imbalance ratio while the concept drift happens in the form of five factors, which were discussed by Brzezinski et al. (2021): splitting, moving, merging clusters and decreasing the ratio of safe examples while increasing the ratio of borderline or rare examples. In the third (Double factor drift) and the fourth (Complex factor drift) groups, we have chosen ten artificial data streams (five for each group) with concept drift affecting two factors and a group of factors, respectively. These artificial data streams were chosen evenly across the lists of data streams from Brzezinski et al. (2021)’s work with double factor drift and complex factor drift in Brzezinski et al. (2021)’s work respectively. These lists were sorted by the average performance of the compared data stream learning approaches in their work. Thus, picking data streams evenly from these lists means that we are taking scenarios with different difficulty levels.

As the analysis which is presented in Sect. 4.3 shows that SMOClust performed well in severely imbalanced data streams, we performed additional experiments with the aforementioned single factor drift streams with more severe static class imbalance ratio to further evaluate SMOClust in extreme cases. These additional severely class imbalanced artificial data streams are summarised in Table 3. Note that, although we reused the static imbalance ratio of 1% minority class examples, we used another set of random seeds when performing these additional experiments.

Table 3 Summary of single factor drift artificial data streams with severe imbalance ratio

Apart from experiments with artificial data streams, we also performed experiments with different real-world data streams to evaluate SMOClust in practical applications. These real-world data streams are summarised in Table 4 and their details are as follows.

Table 4 Summary of real-world data streams

The Luxembourg stream (Zliobaite, 2011) was constructed from the European Social Survey from 2002 to 2007. The classification task is to predict whether internet usage is high or low. The NOAA stream (Elwell & Polikar, 2011) contains weather records collected over five decades (1949–1999). These records include temperature, pressure, wind speed, precipitation and other weather-related events. The classification task is to predict whether the next day will rain. The Ozone stream (Zhang et al., 2006) consists of air measurements collected from 1998 to 2004. The task is to predict the ozone level eight hours ahead of time. The PAKDD2009 stream (Theeramunkong et al., 2009) consists of private label credit card application records and the task is to decide whether a given application should be approved. Forest Covertype (Covtype) stream (Blackard & Dean, 1999) contains the cartographic information about the forest of 30 \(\times\) 30-meter cells and the task is to predict the cover type for a given cell. Covtype stream originally is a multi-class classification problem with seven forest cover types. To make it suitable for this study, it has been converted into seven binary classification streams. Each of them takes one of the forest cover types as one class while combining other forest cover types to be the other class. INSECTS streams (Souza et al., 2020) were constructed using a smart trap with optical sensors to collect the flying data of three different species of insects in a non-stationary environment for around three months. The temperature of the data collection environment was controlled to simulate concept drifts. INSECTS streams originally have six classes: three species of mosquitoes with two genders. We converted them into binary classification tasks by combining classes belonging to the species of ae-albopictus as the minority class while combining the rest of the classes as the majority class. Also, it has to note that Souza et al. (2020) originally proposed seven INSECTS streams but we only adopted six of them which contain concept drifts and left the INSECT-out-of-control stream unused as it does not contain any concept drift. The Amazon stream (Blitzer et al., 2007) comprises reviews of books, DVDs, electronics, and kitchen appliances. Reviews with a rating greater than 3 were labelled as positive. The objective is to discern whether a review has a rating above 3. The Twitter stream (Nakov et al., 2016) consists of labelled tweets about popular topics. The goal is to predict whether the sentiment of a given tweet is positive or negative.

To facilitate analysing the predictive performance of SMOClust, we also analysed the characteristics of the minority class of the real-world data streams, including the potential number of clusters, and the ratios of safe, borderline, rare and outlier examples. Note that we only analysed the portion of the real-world data streams used in the actual experiments, which excludes the first 10% of each original real-world data stream that was used for hyper-parameter tuning (see Section 4.2 for details the hyper-parameter tuning procedure). The procedure of this analysis follows the methodology proposed by Brzezinski et al. (2021) and is described as follows.

The characteristics of each real-world data stream are estimated in successive batched of examples. We followed (Brzezinski et al., 2021) to use a batch size of 2000 examples for all data streams except for Luxembourg, NOAA, Amazon, and Twitter, where a batch size of 200 was used as these data streams have less than 10,000 examples. The class imbalance ratio and the ratios of each minority class type are estimated for each batch. It is worth noting that we only focused on analysing the class 1 because it is the global minority class of all the real-world data streams (see Table 4), even though this class could potentially become a majority during certain periods of the data stream, e.g., when there is potential concept drift affecting P(Y), changing the roles of majority and minority classes temporarily. As for types of minority class examples, they were estimated using the method proposed by Napierala and Stefanowski (2015). This method first finds the k-Nearest neighbours of each minority class example. Based on the class ratios among these k-Nearest neighbours, it then categorises each minority class example as safe, borderline, rare, or outlier. Here, we followed (Napierala & Stefanowski, 2015) to adopt \(k=5\).

Following (Brzezinski et al., 2021)’s procedure, we also estimated the number of minority class clusters in each batch, using the affinity propagation algorithm (Frey & Dueck, 2007) and removing clusters with less than six minority class examples (Brzezinski et al., 2021). The affinity propagation algorithm was run thirty times with different random seeds for each batch. The average estimated number of minority class clusters is then recorded.

Lastly, we reported the ranges of the aforementioned characteristics across the different batched and their medians in Table 5. Note that we only performed analysis about types of minority class examples and the potential number of clusters on batched that contain at least six (\(k+1\)) minority class (class 1) examples. This is to prevent always categorising the minority class examples as rare cases or outliers when the total number of minority class examples in the batch is extremely low. The number of batches with less than size minority class examples is reported in brackets in the third column of Table 5.

Table 5 Characteristics of real-world data streams (Values in the brackets are the median)

As shown in Table 5, PAKDD2009 and NOAA streams usually present the most number of clusters of minority class examples, with medians of twenty-eight clusters, meaning that the minority class is split into several clusters in this data stream. INSECTS streams usually present fewer clusters of the minority class than PAKDD2009 and NOAA streams, which have medians ranging from thirteen to sixteen clusters. Luxembourg, Ozone , Covtype, Amazon and Twitter streams usually present the least number of clusters of the minority class, having medians ranging from zero to six.

As for the types of minority class examples, Table 5 shows that the Ozone, PAKDD2009 , INSECTS, Amazon, and Twitter streams mainly consist of borderline, rare, and outlier minority class examples. Luxembourg and NOAA streams mainly consist of safe and borderline minority class examples. Most Covtype streams mainly consist of safe minority class examples. Regarding the minority ratios, most of them have a small range, indicating that the potential concept drifts only affect P(Y) with mild severity. In contrast, Covtype\(_{(\textrm{c}_{1}=\{1-6\})}\), Covtype\(_{(\textrm{c}_{1}=1)}\) and Covtype\(_{(\textrm{c}_{1}=2)}\) streams have a very large range, indicating that that they potentially present severe concept drifts affecting P(Y). In particular, Covtype\(_{(\textrm{c}_{1}=2)}\) presents a large range of minority class ratio with a very small median (1%). This may indicate that the severe concept drifts affecting P(Y) could potentially be abrupt.

4.2 Experiment setup

This section presents the procedure of hyper-parameter tuning and experiments. The following are the approaches from the literature that were considered in this study and the reason behind the choice. All of these approaches are strict online approaches, which do not require storage of any past data, so that the comparisons are fair.

  • OOB(d) and UOB(d) (Wang et al., 2015): Baseline approaches that use simple oversampling or undersampling to deal with class imbalance in data stream learning.

  • OnlineUnderOverBagging(d) (oUnderOverB(d)) (Wang & Pineau, 2016): A simple existing approach which combines simple undersampling and oversampling for class imbalance data stream learning. We slightly modified it to use time decay class sizes with the “oversampling” equation from OOB to controlling the resampling rate. We chose to adopt the “oversampling" equation from OOB because the research paper (Wang & Pineau, 2016) explicitly states that the resampling rate for OnlineUnderOverBagging should be greater than 1. On the other hand, the “undersampling" equation from UOB produces a fractional number, which is not suitable in this context.

  • VFC-SMOTE (Bernardo et al., 2021): An existing approach which addresses class imbalance by generating synthetic minority class examples using histogram-based summaries of past examples.

  • SMOTE-OB (Bernardo & Valle, 2021): An existing approach which incorporates the class imbalance adaptation strategy of VFC-SMOTE into OnlineUnderOverBagging (Wang & Pineau, 2016).

  • OnlineOversampling(d) (oOS(d)): A variant of the proposed approach which always uses the most recently seen minority class example for oversampling. This approach is used as a baseline to support the investigation of when the proposed strategy of creating synthetic minority class examples for oversampling is advantageous/disadvantageous.

  • SMOGauNoise: A variant of the proposed approach inspired by Song et al. (2018), which proposed a data augmentation method for software effort estimation. SMOGauNoise has the same learning and making prediction strategies as the proposed approach but it always creates synthetic minority class examples for oversampling by adding Gaussian noise to the most recent minority class example. Note that this is the first time to investigate (Song et al., 2018)’s data augmentation method in the context of classification problems.

Approaches followed by “(d)” refers to these approaches that were not designed to handle concept drift originally.Footnote 3 We used a wrapper to enable them to use a concept drift detector. Their system reset upon concept drift detection.

For the evaluation method, we modified the periodic holdout test for the experiments with artificial data streams. This modified periodic holdout test takes the data difficulty factors into the consideration, which includes the position of the minority class clusters, class imbalance ratio, and the proportions of borderline and rare examples. During a single run, the data stream learning approach was tested on a holdout test set \(B^{test}_t\) of m examples after training on every n example. Its predictive performance in G-Mean was then recorded. The holdout test sets are class balanced and they follows the same underlying joint probability distribution (concept) at the evaluation time step t, where \(t \mod n = 0\), i.e., \(B^{test}_t \sim P_t(X,Y)\). At the end of the run, we summarised their performance across the stream by taking an average of their G-Mean performance on the test sets.

For hyper-parameter tuning purposes, an additional artificial data stream was created. It also consists of 200k examples where the concept drift happens from 70k to 100k time steps but the class imbalance ratio and the drift behaviour were randomly selected from the set of all combinations of drift factors used in (Brzezinski et al., 2021). We denote this data stream as the “hyper-parameter tuning stream”. The set of hyper-parameter values of each approach that leads to the best ten runs average of G-Mean across this stream was then used in the experiments. In the experiments, we adopted thirty runs rather than ten runs to reduce the effect of randomness on the results.

Experiments with real-world data streams have a similar procedure. The first 10% or each real-world data stream was used for the hyper-parameter tuning purposes. The prequential evaluation was used because the underlying concepts are unknown in advance. The set of hyper-parameter values of each approach that leads to the best ten runs average of G-Mean across the first 10% of each real-world data stream was then chosen to be used in the experiment of the corresponding data stream which consists of the remaining 90% of examples. The time decayed G-Mean performance was sampled at every 500 time steps, except they were sampled at every fifty time steps for NOAA, Ozone, Amazon, Twitter streams and every ten time steps for Luxembourg stream due to the fact that these streams are a lot shorter than other streams (i.e., they have a lot fewer examples than other data streams). Thus, sampling at shorter intervals allows us to see how the performance of the approaches changes throughout these relatively short data streams. We adopted a time decay factor of 0.999 to make their past predictive performance less important to the current time step. We recorded their thirty runs average G-Mean performance across each stream for evaluation and comparative analysis.

At the end of the experiments, the predictive performance of the approaches was compared by different concept drift data difficulty factors. The corresponding rankings in the groups were then presented. Friedman test with a level of significance of 0.05 was applied to each group, confirming if there is any statistical significance between the predictive performance of different approaches. If there is, Nememyi post-hoc test was used to determine which approaches performed significantly different from the top-ranked approach. In the statistical tests, each group corresponds to a data stream learning approach while each observation within a group corresponds to the average predictive performance across a given data stream in a single run. The thirty runs average predictive performance of the approaches are also reported to facilitate us in analysing the margin of the performance difference.

4.3 Results with artificial data streams

This section presents the analysis done to compare the predictive performance of SMOClust against existing approaches on artificial data streams which consider different drift difficulties in the minority class. General comparisons are first given based on the Friedman rankings of average G-Mean of the approaches grouped by different drift difficulty factors, presented in Table 6. It is then followed by a detailed analysis of the behaviour of SMOClust in representative cases where it performed better and worse than existing approaches in Sects. 4.3.1 and 4.3.2 respectively.

Table 6 Statistical (Friedman) Ranking of G-Mean on Artificial Streams Grouped by Factors

Table 6 shows that SMOClust was one of the top-ranked approaches when the data stream is extremely class imbalanced (minority class ratio: 1%), indicating that SMOClust handled extremely class imbalanced data stream better than most existing approaches, while it performed similarly to UOB and OnlineUnderOverBagging. However, SMOClust was one of the low-ranked approaches in the group of rare cases, indicating that it could not handle rare cases very well. For other groups, although SMOClust was not one of the top-ranked approaches, it usually performed similarly to mid-ranked approaches.

As Friedman rankings only show the relative position of approaches’ predictive performance but they do not provide any information about the margin of difference. To investigate how much did SMOClust performed better in severely class imbalanced streams and worse in other groups of factors, we further compared their thirty runs average G-Mean on each artificial data stream. The results of their difference in average G-Mean are presented in the form of a heat-map in Fig. 4. Green cells indicate results favourable to SMOClust, whereas red cells indicate results favourable to the compared approach. For a comprehensive table of the predictive performance of the approaches, please refer to the supplementary document.

Fig. 4
figure 4

Difference in Average G-Mean Against SMOClust on Class Imbalanced Artificial Data Streams Based on 30 Runs (Green cells indicate SMOClust performed better; Red cells indicate SMOClust performed worse; Grey horizontal lines separate different groups of data streams, i.e., StaticIm{30/10/1}, Imbalance Ratio Drift, Double Factor, and Complex Factor) (Color figure online)

Table 6 shows that SMOClust usually obtained lower rankings than other approaches in less severe class imbalanced data streams. However, Fig. 4 reveals that the margin of the under-performance was usually small as we can rarely see saturated red cells in the table. In contrast, the high ranking achieved by SMOClust in the group of StaticIm1_{*} was supported by a lot of saturated green cells in the sector StaticIm1 of Fig. 4, meaning that SMOClust performed a lot better than existing approaches in cases with severe class imbalanced ratio. Besides, Fig. 4 further confirms that SMOClust could not handle rare minority class examples very well as we can see that cases involving Rare100 drift have lots of saturated red cells. In particular, OOB and OnlineUnderOverBagging handled rare minority class examples better than SMOClust.

One potential reason why SMOClust did not perform well in handling data streams with a large proportion of rare minority class examples is the conservative nature of the proposed synthetic example generation method, where most synthetic examples are generated in the dense area of the minority class. To address this, it might be helpful to generate synthetic examples in a more diverse manner. However, generating synthetic examples diversely can also introduce a significant amount of noise or even create artificial concept drifts. Moreover, it can be challenging to ensure that a certain area belongs to the minority class if there are no real minority class examples in that area. The proposed method is less prone to these risks and uncertainties, while overcoming the problems of existing work, which ignore data difficulty factors and rely on caching all (minority class) examples for synthetic minority class oversampling.

Comparing the predictive performance of SMOClust against UOB and OnlineUnderOverBagging in the group of StaticIm1_{*}, Table 6 shows that they performed similarly. Yet, the sector of StaticIm1 in Fig. 4 reveals that SMOClust performed better than UOB by small margins (around 1–2% G-Mean , light green cell) in cases presenting concept drift of increasing rare minority class ratio, yet, it performed worse than UOB by medium-small margins (around 3% G-Mean , light red cells) in cases presenting concept drift of moving and merging minority class clusters. SMOClust performed better than OnlineUnderOverBagging by medium-small margins (around 2–3% G-Mean , light green cells) in cases presenting a concept drift of splitting minority class clusters. However, it performed slightly worse than OnlineUnderOverBagging (around 1% G-Mean , light red cell) in cases presenting concept drift of merging minority class clusters. It also performed worse than OnlineUnderOverBagging by a large margin (around 7% G-Mean , saturated red cell) in StaticIm1_Rare100 case. In short, SMOClust performed similarly to both UOB and OnlineUnderOverBagging in most StaticIm1 cases, except OnlineUnderOverBagging performed a lot better in StaticIm1_Rare100 case.

When comparing the predictive performance of SMOClust against two approaches that also summarise past knowledge to support the generation of synthetic examples (VFC-SMOTE and SMOTE-OB), Table 6 and Fig. 4 show that SMOClust performed better in most cases, especially in StaticIm1 cases. This indicates that the proposed synthetic minority oversampling strategy in SMOClust is superior.

Based on the aforementioned results, additional experiments were performed with the same set of single factor drift artificial data streams but enforced with extremely severe class imbalance ratios (minority class ratio \(0.3\%\) to \(5\%\), summarised in Table 3) to further evaluate if SMOClust can usually perform better than existing approaches in extremely class imbalanced data streams.

Table 7 presents the Friedman rankings of average G-Mean by groups of different drift difficulty factors on the severely class imbalanced artificial data streams. It shows that SMOClust can indeed achieve higher rankings when the class imbalance ratio is very severe (minority class ratio \(\le 1\%\)). Figure 5 presents the difference in average G-Mean (based on thirty runs) between the compared approaches and SMOClust on severely class imbalanced artificial data streams in the form of a heat-map with the same colour scheme as Fig. 4. Similarly, please refer to the supplementary document for a comprehensive table of the predictive performance of the approaches. It supports the aforementioned deduction with a lot of saturated green cells in the cases of minority class ratio \(\le 1\%\), indicating the superior performance of SMOClust. The exception here is the comparison against UOB, with the margin of under-performance increasing as the severity of the class imbalance ratio increases by case. When compared against OnlineUnderOverBagging, SMOClust generally performed better in cases other than Rare100 drift, with the margin of superior performance increasing as the severity of the class imbalance ratio increases by case.

Table 7 Statistical (Friedman) ranking of G-Mean on severely class imbalanced artificial streams grouped by factors

Figure 5 also confirms that SMOClust usually does not handle rare minority class examples very well, especially when compared against OOB, OnlineUnderOverBagging and SMOGauNoise. However, an extremely severe class imbalance ratio may give advantage to SMOClust in dealing with Rare100 drift as cases involving Rare100 present less saturated red cells when the class imbalance ratio is \(\le 1\%\). In particular, the case of StaticIm03_Rare100 presents a row of saturated green cells. Anyhow, these results are consistent with previous results of experiments with less severe class imbalanced artificial data streams.

Fig. 5
figure 5

Difference in Average G-Mean Against SMOClust on Severely Class Imbalanced Artificial Data Streams Based on 30 Runs (Green cells indicate SMOClust performed better; Red cells indicate SMOClust performed worse; Grey horizontal lines separate different groups of data streams, i.e., StaticIm{5/3/1/07/05/03}) (Color figure online)

Besides, Table 7 also shows that SMOClust could not achieve high rankings in the groups concerning minority class ratio of 5% and 3%. This may due to the fact that the artificial data streams are long enough to have quite a lot of minority class examples, despite the minority class ratios were low. Therefore, the advantage of SMOClust was not manifested. The sectors of StaticIm5 and StaticIm3 on Fig. 5 show that SMOClust usually performed slightly worse than most existing approaches but it performed better than OnlineOversamplingd , VFC-SMOTE and SMOTE-OB.

Considering all cases in Fig. 5, we can see that, when the minority class ratio decreases, SMOClust usually had a smaller margin of performance reduction than other approaches, except UOB. This shows that the aggressive nature of undersampling may be generally more advantageous than oversampling when the number of minority class examples in the data stream is extremely low. Yet, we can still see from Fig. 5 that SMOClust performed better than UOB in most cases of Rare100 drift. This means that, when the minority class has extreme low number of examples and is difficult to learn, SMOClust still has more advantage than undersampling. One reason could be the fact that the compared approaches focus on learning the most recent decision areas of both classes, whereas SMOClust was designed to reinforce its knowledge in past minority class decision areas. This means that SMOClust is likely to have a better generalisation on the sub-areas of the minority class than existing approaches.

In the following sections, representative cases were chosen to discuss why SMOClust performed better and worse than existing approaches respectively, providing a more detailed understanding of the results.

4.3.1 Cases where SMOClust performed better

This section discusses why SMOClust performed better than most other approaches in artificial data streams with severe class imbalance ratio when the class imbalance ratio is extremely severe (minority class ratio \(\le 1\%\) throughout the stream). StaticIm1-Move7 stream was chosen from Fig. 4 as the representative case to discuss the behaviour of SMOClust in details.

As mentioned in Sect. 4.1, the artificial data streams have five input attributes and a class label. Therefore, it is difficult to visualise the learnt decision areas of the approaches and understand their behaviour in details. Because of this, we created a version of the representative streams with two input attributes and a class label while preserving the characteristics which include the class imbalance ratio and the drift difficulty factors etc. Note that we only created a single copy of each two-dimensional representative stream, such that we can compare the data stream learning approaches with their median predictive performance in thirty runs on the same data stream. Also, the hyper-parameters of the approaches were tuned based on a separated random two-dimensional artificial data stream, following the procedure explained in Sect. 4.2.

Table 8 presents the their thirty runs average G-Mean on the two-dimensional version of StaticIm1_Move7 stream. It shows that SMOClust performed the best. These results are slightly inconsistent with the results of the corresponding five dimensional stream in Fig. 4, where SMOClust performed slightly worse than UOB but similarly to OnlineUnderOverBagging. Yet, in general, SMOClust still performed better than other approaches in both two-dimensional and five dimensional versions of StaticIm1_Move7 stream. This may indicate that SMOClust tends to perform better in low-dimensional data stream. Anyhow, the detailed analysis presented in the following paragraphs can still explain the characteristics of SMOClust and why it performed better than most other approaches in this representative case.

Table 8 30 Runs average G-Mean on two-dimensional version of representative artificial data streams where SMOClust performed better

Figure 6 presents the approaches’ predictive performance over time steps of their median run.Footnote 4 To maintain readability, we omitted the predictive performance of OOBd, UOBd, oOSd, oUnderOverBd , VFC-SMOTE, and SMOTE-OB from Fig. 6, as their performance fluctuates significantly throughout the stream. For the comparison of SMOClust against these approaches, please refer to the supplementary document. It shows that SMOClust performed the best in most time steps. In particular, SMOClust maintained the predictive performance to have at least 50% G-Mean on the class balanced holdout test sets during the concept drift (from 70k to 100k time steps) and recovered from the drift better than other approaches (the solid red line has a rapid recovery since 100k time steps). In contrast, other approaches usually dropped to around 0–20% G-Mean during the drift. This case showed the superior performance achieved by SMOClust in handling severely class imbalanced drifting data streams.

Fig. 6
figure 6

Periodic Class Balanced Holdout Test G-Mean Against Time Steps in Two-Dimensional StaticIm1_Move7 (Color figure online)

Figures 7, 8 and 9 visualise the learnt decision areas of approaches at the time steps right before and after concept drift (70k and 100k time steps) and at the end (200k time steps) of the two-dimensional StaticIm1_Move7 stream respectively. The yellow and green regions represent their learnt decision areas of class 0 (majority class) and class 1 (minority class) respectively, while the red and blue dots are the class 0 (majority class) and class 1 (minority class) examples in the class balanced test set, corresponding to the time steps.

Fig. 7
figure 7

Decision Areas Against Class Balanced Test Set at 70k Time Steps (Before Drift) of Two-Dimensional StaticIm1_Move7 (Color figure online)

First of all, we compare the learnt decision areas of the approaches at the time steps right before concept drift (at 70k time steps). Figure 7 shows that OOB, OnlineOversampling, OnlineUnderOverBagging, SMOGauNoise and SMOClust had learnt decision areas which match the corresponding class balanced test set. This explains why they performed the best before the drift (0–70k time steps, Fig. 6). Figure 7i and l show that the learnt decision areas of SMOClust were similar to SMOGauNoise because they both have strategies to explore the minority class decision boundaries. The expansion by SMOClust was slightly more aggressive than SMOGauNoise, with some sub-areas linked together. Although the proposed synthetic minority oversampling strategy prioritises “safe” areas to generate synthetic minority class examples, the strategy of using synthetic examples to train the stream clustering methods may contribute to such aggressiveness in exploring the minority class decision boundaries.

Figure 7a and c show that OOB and OnlineOversampling learnt the most compact minority class decision areas because they reuse the existing minority class examples for oversampling. Figure 7d shows that the minority class decision areas of OnlineUnderOverBagging were slightly larger than that of OOB and OnlineOversampling. Particularly, there were two green areas linked together. This may be the result of using oversampling and undersampling together, which managed to cover the true minority class clusters while preserving some aggressiveness from undersampling. In contrast, Fig. 7b and f show that UOB and UOBd learnt a single cluster to aggressively cover most minority class areas, considering the small majority class areas in between as part of the minority class. This is likely to cost some predictive performance in the majority class. Thus, we can see that UOB and UOBd performed slightly worse than the other approaches before the concept drift (0–70k time steps, Fig. 6). However, Fig. 5 shows that UOB performed slightly better than SMOClust in the five-dimensional StaticIm1_Move7 stream, indicating that the aggressive nature of undersampling may be an advantage in learning the minority class when the feature space is sparse and presents very few minority class examples. When the feature space is more compact, the proposed strategy in SMOClust is more advantageous.

Considering OOBd, OnlineUnderOverBaggingd, VFC-SMOTE, and SMOTE-OB, Fig. 7e, h, j, and k show that their learnt minority decision areas were very small which only covered a small proportion of the true minority class areas. In the case of VFC-SMOTE, it predicted every example as majority class at 70k time steps. As previously mentioned, their predictive performance fluctuated a lot throughout the stream. So, it can be deduced that they were greatly affected by false-positive drift detections.

Over the next paragraphs, we compare the predictive performance and the decision boundaries of SMOClust against other approaches at the time steps right after concept drift (at 100 k time steps) and at the end of the data stream (at 200k time steps), to understand how SMOClust handles concept drift of moving minority class sub-clusters when the data stream is severely class imbalanced.

Figure 6 shows that the predictive performance of SMOClust fluctuated during the drift (70k–100k time steps, Fig. 6). Thus, it is likely that its base learner had been reset several times due to drift detection. Yet, it was the fastest approach to recovering predictive performance from the drift. Figure 8 presents the learnt decision boundaries right after the drift. It shows that SMOClust and SMOGauNoise made the best attempt in adapting the drift. They were able to cover most minority class sub-clusters at the new position, especially SMOClust. The potential reason is that, although the base learner of SMOClust is reset upon drift detection, the stream clustering methods are not reset as they are expected to be drift adaptable. Therefore, SMOClust is more robust to incremental and gradual drifts than SMOGauNoise, explaining its rapid predictive performance recovery from the drift.

Fig. 8
figure 8

Decision areas against class balanced test set at 100 k time steps (After Drift) of two-dimensional StaticIm1_Move7 (Color figure online)

On the other hand, Fig. 8a, b, c, and d show that the learnt minority class decision areas of OOB, UOB, OnlineOversampling and OnlineUnderOverBagging mainly retained at the pre-drift position because they are not concept drift adaptable. Their concept drift adaptable counterparts , VFC-SMOTE and SMOTe-OB did not handle the drift very well either. Figure 8e, f, g, h, j and k show that their learnt minority class decision areas only covered a few minority class sub-clusters at the post-drift position, which is likely because their base learners had been reset for several times caused by drift detection and they do not have any strategy to deal with incremental and gradual drifts. As the result, they struggled to recover their predictive performance from the drift, as shown in Fig. 6.

Lastly, we compare the learnt decision areas of the approaches at the end of the two-dimensional StaticIm1_Move7 stream. Figure 9 shows that OOBd, SMOGauNoise and SMOClust are the best approaches in converging to the post-drift position of minority class sub-clusters. In particular, a few green areas of SMOClust and SMOGauNoise were slightly less compact than OOBd, showing that SMOClust and SMOGauNoise had slightly better generalisation than OOBd.

Figure 9a, c, and d show that OOB, OnlineOversampling and OnlineUnderOverBagging managed to converge to the new concept after the drift. However, they also retained a small portion of green areas which corresponds to the pre-drift position of the minority class. This shows that OOB, OnlineOversampling and OnlineUnderOverBagging can adapt to concept drift involving minority class sub-cluster movement. However, they required a longer period to adapt as they were hindered by the knowledge acquired pre-drift. Meanwhile, Fig. 9e, g, and h show that their concept drift adaptable counterparts adapted better, except OnlineUnderOverBaggingd. While resetting base learners helps to adapt to concept drift, OnlineUnderOverBaggingd partly uses undersampling in its strategy to deal with class imbalance led to some over-generalisation between the learnt minority class areas. UOB and UOBd use undersampling to deal with class imbalance, thus Fig. 9a and f show that they had the greatest over generalisation due to the aggressive nature of undersampling. VFC-SMOTE and SMOTE-OB continued to struggle, as shown in Fig. 9j and k, because of frequent false-positive drift detections.

Fig. 9
figure 9

Decision areas against class balanced test set at 200k time steps (End of Stream) of two-dimensional staticIm1_Move7 (Color figure online)

Short Summary: Through the pre-drift analysis, the ability of SMOClust in handling stationary severely class imbalanced data streams presenting several minority class sub-clusters is validated. In particular, it shows that SMOClust was able to learn and explore the true decision boundaries despite the data stream presents very few minority class examples. The post-drift analysis shows that SMOClust was more robust in adapting incremental and gradual drift involving minority class sub-clusters movement than existing approaches. Although most of the approaches converged to the new concept at the end of the data stream, SMOClust was the best and the fastest approach in recovering predictive performance from the drift. The inconsistent results between two and five-dimensional versions of this representative case indicate that SMOClust may be more advantageous in lower-dimensional data streams.

4.3.2 Cases where SMOClust performed worse

This section discusses the situations where SMOClust performed worse than other approaches, particularly in cases with concept drift leading to 100% rare minority examples. StaticIm10_Rare100 stream was chosen from Table 4 as the representative case to discuss the behaviour of SMOClust in detail. Following the method of analysis in Sect. 4.3.1, we also created a two-dimensional version of StaticIm10_Rare100 stream such that we can visualise and compare the learnt decision boundaries of the approaches to understand their behaviour.

Table 9 presents the approaches’ thirty runs average G-Mean on the two-dimensional StaticIm10_Rare100 stream. It shows that SMOClust performed better than most other approaches. Figure 10, showing the G-Mean of the approaches in their median runFootnote 5 throughout the two-dimensional StaticIm10_Rare100 stream, also supports the results on Table 9. Note that, to improve readability, we have omitted the predictive performance of OOBd, UOBd, oOSd, oUnderOverBd , VFC-SMOTE and SMOTE-OB from Fig. 10, similar to Fig. 6, due to their values fluctuating significantly throughout the stream. For a comparison of SMOClust against these approaches, please refer to the supplementary document.

As these results are not consistent with the results of the five-dimensional StaticIm10_Rare100 stream, shown in Tables 6 and 7, we preliminary checked if using a different set of random seeds or picking another case that involves drift leading to 100% rare minority class examples would yield results that are consistent with Tables 6 and 7. Yet, it still shows that SMOClust performed similar to or better than other approaches in two-dimensional StaticIm10_Rare100 stream. Thus, in this analysis, we focus on why SMOClust can handle concept drift leading to 100% rare minority class examples than other approaches when the data stream has only two dimensions while attempting to deduce why it could not when the data stream has five dimensions.

Table 9 30 Runs average G-Mean on two-dimensional version of representative artificial data streams where SMOClust performed worse
Fig. 10
figure 10

Periodic class balanced holdout test G-Mean against time steps in two-dimensional staticIm10_Rare100 (Color figure online)

Figures 11, 12 and 13 visualise the learnt decision areas of the approaches at the time steps right before and after concept drift (70k and 100k time steps) and at the end (200k time steps) of the two-dimensional StaticIm10_Rare100 stream respectively. The yellow and green regions represent their learnt decision areas of class 0 (majority class) and class 1 (minority class) respectively, while the red and blue dots are the class 0 (majority class) and class 1 (minority class) examples in the class balanced test set which corresponds to the time steps.

Figure 10 shows that all approaches performed very well during the pre-drift period (0-70 k time steps). Figure 11 reveals that it is because they learnt the decision boundary of the pre-drift concept very well, as the minority class was just a single cluster. While most approaches learnt an oval shape decision boundary, UOB, UOBd and SMOTE-OB learnt a rectangular shape, which could be due to the use of undersampling. VFC-SMOTE learnt a peculiar shape decision boundary which would cause more frequent false-positive drift detections. These may have been due to minority class examples generated by VFC-SMOTE with considerable amount of noise. Meanwhile, SMOTE-OB adopts the same strategy as VFC-SMOTE for generating synthetic minority class examples but simultaneously incorporating undersampling to address class imbalance. This integration of undersampling might explain why SMOTE-OB more successfully circumvented the issue encountered by VFC-SMOTE.

Fig. 11
figure 11

Decision areas against class balanced test set at 70 k time steps (Before Drift) of two-dimensional staticIm10_Rare100 (Color figure online)

Figure 10 shows that the predictive performance of the approaches dropped to below 60% G-Mean and started to differ since the concept drift began (70k time steps). While most approaches’ predictive performance fluctuated with large magnitude, SMOClust’s predictive performance was relatively steady, bouncing between 50%-60% G-Mean. UOB performed poorly since the drift began at 70k time steps until the drift was close to finishing at 100k time steps, indicating that undersampling struggled in dealing with this drift without the help of a concept drift detector.

Figure 12 presents the learnt decision boundaries of the approaches right after the drift (100k time steps). It shows that OOB, OnlineOversampling, OnlineUnderOverBagging, OnlineOversamplingd, SMOGauNoise , SMOTE-OB and SMOClust learnt very complex decision areas, indicating that they made great efforts to learn all the areas that spawn rare minority class examples belonging to the post-drift concept. However, only approaches with a concept drift detector were able to forget the old area of the minority class at the top left corner. This shows that, although this drift was gradual, concept drift detection was important in helping the system to forget irrelevant past knowledge. In contrast, approaches without a drift detector retained the oval minority class cluster at the top left corner which belongs to the pre-drift concept. Most of them struggled to perform well since the drift started at 70k time steps, as shown in Fig. 10. OOB was an exception in terms of predictive performance. However, the fact that it retained the knowledge about the pre-drift minority class areas makes it disadvantageous in dealing with other types of drift, as discussed in Sect. 4.3.1.

Comparing the learnt decision areas of SMOClust against other approaches with drift detector (OOBd, UOBd, OnlineOversamplingd, OnlineUnderOverBaggingd , VFC-SMOTE, SMOTE-OB and SMOGauNoise), it can be observed that the learnt minority class areas of SMOClust were complex and covered the feature space spawning minority class examples the most. While OnlineOversamplingd’s, SMOGauNoise’s and SMOTE-OB’s were also complex (see Fig. 12g, i and k), they either did not cover the feature space spawning minority class examples as much as SMOClust’s did or exhibited over-generalisation. The fact that OnlineOversamplingd only reuses the recently seen minority class example for oversampling likely leads to overfitting to such most recent area. SMOGauNoise also has a strategy to explore the decision boundaries of the minority class, but such strategy only explores the area around the recently seen minority class example. This could be disadvantageous when false-positive drift detections were triggered, resetting the base learner. SMOTE-OB’s over-generalisation could be explained by the use of undersampling and noisy minority class examples generated. SMOClust, on the other hand, does not have this disadvantage because the stream clustering methods are not reset upon drift detection. This makes it more robust to false-positive drift detections than other approaches. As the drift was gradual, OOBd, UOBd and OnlineUnderOverBaggingd likely also suffered from multiple drift detection, as Fig. 12e, f and h show that the learnt a simple decision boundary right after the drift.

Fig. 12
figure 12

Decision areas against class balanced test set at 100 k time steps (After Drift) of two-dimensional StaticIm10_Rare100 (Color figure online)

Figure 13 presents the learnt decision boundaries of the approaches at the end of the two-dimensional StaticIm10_Rare100 stream (at 200k time steps). While most approaches continued to further improve their learnt decision boundaries since the drift had finished, Fig. 13h and l show that OnlineUnderOverBaggingd and SMOGauNoise did not improve as much as other approaches, meaning that they suffered from false-positive drift detections during the post-drift period. Besides, UOB, UOBd, and SMOTE-OB exhibited an extensive and predominantly continuous decision area for the minority class, demonstrating the aggressiveness of undersampling. However, in the case of SMOTE-OB, the approach’s synthetic minority class generation strategy exacerbates this aggressiveness.

Fig. 13
figure 13

Decision areas against class balanced test set at 200 k time steps (End of Stream) of two-dimensional StaticIm10_Rare100 (Color figure online)

From this analysis, it has been shown that SMOClust managed to forget the pre-drift concept and adapt to drift leading to 100% rare minority class examples and more robust to false-positive drift detections than other approaches in the two-dimensional StaticIm10_Rare100 stream. However, the experiment with the five-dimensional StaticIm10_Rare100 stream presents different results (Fig. 4). It shows that SMOClust only performed better than OnlineOversamplingd but worse or similar to most other approaches. One potential reason is the fact that two-dimensional space is more compact than five-dimensional space, the rare minority class examples have a lot less space to randomly spawn, which means they are likely to spawn at the locations that had already been learnt and covered by SMOClust using micro-clusters. Therefore, SMOClust can predict their class label correctly. However, five-dimensional space is sparser than two-dimensional space, meaning that new rare minority class examples are less likely to spawn at previous locations. Therefore, SMOClust struggled to make correct predictions to new rare minority class examples. Another potential reason is that the stream clustering method may be less effective in data streams with more dimensions. For example, it may create some minority class micro-clusters that overlap with the majority class region because of the sparsity of the feature space. Therefore, the aforementioned advantage of SMOClust in dealing with drift could not be manifested. Anyhow, future work is needed to further confirm whether SMOClust tends to perform better in data streams with fewer dimensions.

Short Summary: This analysis shows that SMOClust managed to adapt to concept drift leading to 100% rare minority class examples and was robust to multiple drift detection during gradual drift as well as false-positive drift detections when the data stream has only two dimensions. However, the experiments with the corresponding five-dimensional stream present a different set of results, as the stream clustering methods used by SMOClust might not perform well when the data stream has more dimensions.

4.3.3 Results with two-dimensional artificial data streams

To investigate whether SMOClust performs better in lower-dimensional data streams, we performed additional experiments on the same artificial data streams presented in Sect. 4.1, but with only two input features. We also created a randomised two-dimensional data stream for the purpose of hyper-parameter tuning, following the procedure described in Sect. 4.3.

Figure 14 presents the difference in average G-Mean (based on thirty runs) between compared approaches and SMOClust on two-dimensional artificial data streams in the form of a heat-map. Green cells indicate results favourable to SMOClust, whereas red cells indicate results favourable to the compared approach. For a comprehensive table of the predictive performance of the approaches, please refer to the supplementary document. Compared to Fig. 4, there are fewer red cells in this figure, indicating that SMOClust generally performed better in the lower-dimensional version of the same set of data streams. In particular, the sections of the heat-map corresponding to StaticIm30 and StaticIm10 data streams, which were mostly reddish in Fig. 4, are mostly greenish in Fig. 14.

Fig. 14
figure 14

Difference in Average G-Mean Against SMOClust on Two-Dimensional Class Imbalanced Artificial Data Streams Based on 30 Runs (Green cells indicate SMOClust performed better; Red cells indicate SMOClust performed worse; Grey horizontal lines separate different groups of data streams, i.e., StaticIm{30/10/1}, Imbalance Ratio Drift, Double Factor, and Complex Factor) (Color figure online)

Figure 14 also confirms the trend shown in Fig. 4, showing that SMOClust tends to outperform other approaches in severely class-imbalanced data streams. To further validate this trend in lower-dimensional data streams, we performed further experiments on the same set of single factor drift artificial data streams, but with enforced extremely severe class imbalance ratios (minority class ratio \(0.3\%\) to \(5\%\), as summarised in Table 3). The results are presented in Fig. 15 in the form of a heat-map, using the same colour scheme as Fig. 14. Similarly, please refer to the supplementary document for a comprehensive table of the predictive performance of the approaches.

Fig. 15
figure 15

Difference in Average G-Mean Against SMOClust on Two-Dimensional Severely Class Imbalanced Artificial Data Streams Based on 30 Runs (Green cells indicate SMOClust performed better; Red cells indicate SMOClust performed worse; Grey horizontal lines separate different groups of data streams, i.e., StaticIm{5/3/1/07/05/03} (Color figure online)

Figure 15 presents more solid green cells than Fig. 14, indicating that SMOClust performed better than other approaches in extremely severe class-imbalanced data streams, even in the lower-dimensional case. Additionally, the fact that Fig. 15 has more green cells than Fig. 5 supports the conclusion that SMOClust tends to perform better in lower-dimensional data streams.

4.4 Results with real-world data streams

This section presents the analysis done to compare the predictive performance of SMOClust against nine existing approaches in real-world data streams. Experiments with real-world data streams allow us to obtain a general idea of SMOClust’s predictive performance in practical applications, where the class imbalance ratio, the position and the type of the concept drifts are unknown. Table 10 presents the Friedman rankings of approaches’ G-Mean on real-world data streams group by factors.

Table 10 Statistical (Friedman) Ranking of prequential G-Mean on Real-World Streams Grouped by Factors

Table 10 shows that the overall top-ranked approaches on real-world data streams are OOB, OOBd and SMOTE-OB whereas SMOClust usually achieved low rankings. SMOClust only achieved a relatively better ranking in Covtype streams than in other streams. Considering all real-world data streams, SMOClust performed similarly to OnlineOversamplingd and SMOGauNoise. Following the analysis method in Sect. 4.3, we also compared the thirty runs average prequential G-Mean of the approaches on each real-world data stream in Fig. 16 to further evaluate the predictive performance of SMOClust in real-world data streams.

Fig. 16
figure 16

Difference in Average G-Mean Against SMOClust on Real-World Data Streams Based on 30 Runs (Green cells indicate SMOClust performed better; Red cells indicate SMOClust performed worse) (Color figure online)

Figure 16 shows that SMOClust usually performed similar or better than other approaches in NOAA and Covtype streams while it performed worse than other approaches in Ozone, PAKDD2009 , INSECTS, Amazon, and Twitter streams. Recalling the discussion in Sect. 4.1 on estimated characteristics of real-world streams, NOAA and Covtype streams mainly consist of safe and borderline minority class examples with different movements of minority class clusters and the minority class ratios throughout Covtype streams are usually very low (except Covtype\(_{(\textrm{c}_{1}=\{1-6\})}\) and Covtype\(_{(\textrm{c}_{1}=1)})\). As discussed in Sect. 4.3, these are the characteristics of a data stream that SMOClust is likely to perform similar or better than other approaches, especially when the class imbalance ratio is severe, such as Covtype\(_{(\textrm{c}_{1}=3)}\) stream. Thus, we can see from Fig. 16 that the rows of NOAA and Covtype streams mainly consist of saturated green cells and pale red cells.

On the other hand, Table 5 shows that Ozone, PAKDD2009 , INSECTS, Amazon, and Twitter streams consist of large proportions of rare and outlier minority class examples. Based on the discussion in Sect. 4.3.2, SMOClust could not handle rare and outlier minority class examples very well, except when the dimensionality of the data stream was low or compact. Thus, it is not surprising to see a lot of red cells on these data streams.

To summarise the result of experiments with real-world data streams, the advantage of the proposed synthetic minority class oversampling strategy in SMOClust is manifested in severely class imbalanced data streams with high proportions of safe and borderline minority class examples with concept drifts of different movements of minority class sub-clusters. On the downside, SMOClust could not handle rare and outlier minority class examples very well. These findings are consistent with the result of experiments with artificial data streams.

5 Conclusion

The main contribution of this work is the proposed stream clustering based synthetic minority oversampling approach, called SMOClust (RQ1). This method helps the learning system to strategically explore different decision areas of the minority class and to be robust to false-positive drift detections (RQ1). To evaluate the predictive performance and the characteristics of SMOClust, experiments with artificial data streams concerning different types of concept drift difficulties were performed. The results show that SMOClust performed particularly well in severely class imbalanced data streams with high proportions of safe and borderline minority class examples (RQ2). It also handles concept drifts of different movements of minority class clusters better than other existing approaches (RQ2). However, when the data stream presents high proportions of rare and outlier minority class examples, SMOClust becomes disadvantageous (RQ3).

To further understand the reason behind the experiment results on artificial data streams, additional experiments with representative two-dimensional artificial data streams were performed. However, it shows that SMOClust managed to handle rare minority class examples better than other approaches in these two-dimensional cases. This indicates that the reason why SMOClust could not handle rare cases very well on the corresponding five-dimensional stream was likely because of the stream clustering methods did not perform well in higher-dimensional space. In other words, SMOClust may be more advantageous when the dimensionality of the data stream is not high.

Lastly, we validated the performance of SMOClust on different real-world data streams. To facilitate the analysis of the experiment results of this part of the study, we estimated the characteristics of the real-world data streams, following the procedure adopted by Brzezinski et al. (2021). Based on the estimated characteristics and the experiment results, we concluded that the SMOClust behaved similarly to the experiments with artificial data streams (RQ3).

As for future work, an investigation of new strategies to better handle large proportions of rare and outlier minority class examples is one potential direction. For example, strategies to generate synthetic minority examples for oversampling in a more diverse manner without introducing a significant amount of noise or creating artificial concept drifts could be proposed. Additionally, extending the idea of SMOClust to deal with multi-class classification tasks could also be an area to investigate in the future. Furthermore, the proposed synthetic minority oversampling strategy in this work could be adapted for use with other complex data stream learning systems easily as it is a drift adaptable data-level method to address class imbalance in data stream learning. For example, it could be incorporated into an explicit drift handling approach which exploits relevant past knowledge to handle concept drifts (Chiu & Minku, 2018, 2022) or an ensemble approach which evolves themselves to adapt to concept drifts (Kolter & Maloof, 2003; Brzezinski & Stefanowski, 2014). Apart from these, a comprehensive study to compare SMOClust against more approaches for learning drifting class imbalanced data streams (e.g., CSARF (Loezer et al., 2020), ROSE (Cano & Krawczyk, 2022) etc.) and with more data sets could also be a potential future work.