1 Introduction

In dynamic data streaming environments, online learning algorithms consider incoming examples “on arrival” without needing persistent storage and multiple scans, while maintaining a model that reflects the current data. This type of learning has applications in various real-world applications, such as network intrusion detection, spam filtering, fault diagnostics in manufacturing, and e-commerce applications [23]. Learning from such streaming data is challenging, especially in the presence of multiple skewed class distributions, also known as “multi-class imbalance”, where a large number of majority-class instances may lead to the minority classes being ignored. This problem is aggravated in an online learning setting because a steady arrival of minority instances cannot be guaranteed, and a minority class may become a majority concept and vice versa [21]. In addition, evolving streams are susceptible to concept drifts, which is the phenomenon of unexpected changes in the underlying data distribution [35].

While previous studies have primarily focused on binary imbalanced data or stationary streams, only a limited number of studies have addressed the challenge of learning from evolving streams with multi-class imbalanced data [1]. For instance, [21] proposed modifications to resampling methods to adjust for multi-class imbalanced data. In addition, [14, 15] introduced ensemble-based approaches utilizing random feature subsets and resampling to address imbalances. However, further investigation is needed to explore classifier-agnostic approaches that can effectively utilize alternative evaluation metrics and resampling parameters to address the challenges of imbalanced datasets. Additionally, it is crucial to develop class-based drift detection methods to prevent performance deterioration.

This paper introduces an algorithm for online learning in a multi-class imbalance setting to address these shortcomings. Our dynamic queues (DynaQ) approach extends the online-multi-class-queue (OMCQ) [46] approach that maintains separate queues for each class, without any form of sampling. In addition, our DynaQ algorithm uses oversampling with replacement based on a rate parameter associated with the various classes. That is, we oversample the minority classes based on the recall, F1-score, \(\kappa _m\) and Euclidean distance.

Our DynaQ algorithm combines batch-incremental and instance-incremental ensemble learning. Initially, a batch of data with the classes that have been seen so far is presented to the learner, and it subsequently updates the model with new instances as they arrive. Each ensemble member learns from a sliding batch, and the results are combined using soft voting. In addition, we incorporate a class-specific concept drift detection mechanism into DynaQ. Our algorithm can thus dynamically adapt to changes in label arrivals and individual class performances, as monitored by the recalls. Specifically, we do not make any assumptions regarding the frequency of classes, which implies that minority classes may become majority classes, and vice versa. Our experimental results confirm that our DynaQ algorithm is efficient in terms of multi-class concept separation.

The main contributions of this paper are as follows.

  • Novel online queue-based ensemble architecture: DynaQ utilizes a batch-based resampling method that creates an instance queue for each class and incorporates an ensemble approach that utilizes sliding windows and a soft voting schema.

  • Self-handling class imbalance: DynaQ dynamically oversampling minority classes based on a rate parameter associated with the various classes.

  • Class-based concept drift approach: DynaQ incorporating a drift detection mechanism to dynamically adapt to changes in label arrivals and individual class performances.

  • Multi-class imbalanced and concept drift stream: DynaQ highlights the challenges of learning from dynamic data streaming environments while addressing multi-class imbalance and concept drift in online learning.

The paper is organized as follows. Section 2 presents related work, while Section 3 introduces the DynaQ algorithm. Section 4 describes the experimental evaluation, and Section 5 concludes the paper.

2 Background and related work

In online learning, a data-generating process provides at each time step t a sequence of examples (\(x_t,y_t\)) from an unknown probability distribution, where \(x_t\) is a vector consisting of qualitative or quantitative f features, and \(y_t \in Y\) is the class label, where \(Y = \{c_1,c_2,\ldots ,c_s\}\), and S is the number of classes. An online classifier is built receiving an example \(x_t\) at time step t, resulting in a prediction \(\hat{y}_t\). In a supervised learning setting, the label \(y_t\) is available, and the performance of a learning algorithm is evaluated using a loss function \(f(x_t) = l(y_t, \hat{y}_t)\) to find the best predictor for future data at each step [23]. This paper focuses on online learning from multi-class imbalanced data, where \(S > 2\), from evolving streams.

2.1 Online class imbalance learning

In an online class imbalance learning setting, the main goal is to correctly classify minority examples because the minority class is often of most interest. Existing online learning approaches that address the class imbalance problem may be categorized into data-level, algorithmic and hybrid ensemble-based approaches. Several methods proposed for solving class imbalance problems in data-level techniques provide solutions, including resampling and feature selection [4]. Data-level modifications aim to balance the underlying dataset, making them classifier-agnostic approaches. Resampling is an effective data-level approach that proceeds independently of the learning algorithm; this method has been used for binary classification problems in the data stream setting. The major types of resampling are oversampling (increasing the number of minority-class examples), undersampling (reducing the number of majority-class examples), and hybrid sampling [23]. Synthetic Minority Over-sampling Technique (SMOTE) [19] is a base idea for many resampling methods. SMOTE is an over-sampling technique that creates new instances of the minority class by interpolating between existing instances that are located close to each other. This method effectively reduces the degree of class imbalance compared to the original majority-to-minority class ratio. Recently, new modifications of the SMOTE method have been introduced that enable it to work compatible with streaming data [5,6,7]. The technique known as Selection-Based Resampling (SRE) [45], utilizes undersampling to eliminate safe instances from the majority class in an iterative manner, without causing reverse bias towards the minority class. However, undersampling methods may lead to crucial information being overlooked, whereas oversampling may potentially introduce artificial instances that may be deemed unacceptable in real-world domains. Hybrid ensemble-based methods combine resampling and algorithmic approaches to manage the class imbalance [13]. For example, in a study by [51], the authors integrated resampling into ensemble algorithms to define the oversampling online bagging (OOB) and undersampling online bagging (UOB) techniques for binary classification. The work extends bagging ensembles following a class-based ensemble approach to dynamically change the learning rate by maintaining a base learner for each class and updating the base learners with new data to deal with binary class imbalance. Resampling will be triggered to either increase the chance of training minority-class examples (in OOB) or reduce the chance of training majority-class examples (in UOB).

Online queue-based resampling [38] has also been proposed as a method for binary classification. The main idea of this algorithm is to selectively include a subset of the positive and negative examples in the training set that thus far have appeared in the stream. The examples are selected by maintaining, at any given time t, separate queues based on class labels received from data of equal lengths \(L\in {{\mathbb {Z}}^{+}}\), \(q_n^t = \{(x_i, y_i)\}_{i=1}^{L}\) and \(q_p^t = \{(x_i, y_i)\}_{i=1}^{L}\) that contain the negatives as majority examples and the positives as minority examples, respectively. Once the queues are filled, the classifier is incrementally updated after combining the two queues into one training set [38]. The algorithm employs an interleaved test-then-train evaluation [8] in which each example is used to test the model before the example is appended to the queues for training, thus implementing a sliding window method. Our work extends the notion of queues as presented in [38] to the multi-class scenario by incorporating explicit concept drift detection during learning.

2.2 Multi-class imbalanced learning

Multi-class classification problems are often considered more challenging than their binary counterparts, because multiple classes can increase the data complexity and aggravate the imbalanced distribution [1]. Current approaches in an online setting are primarily based on binary decomposition techniques, algorithmic-level modifications using misclassification costs, and resampling methods [1, 51]. Binary decomposition algorithms typically combine binarization techniques that transform the original multi-class data into binary subsets [30]. One of the most commonly used binarization strategies is the one-versus-one (OVO) [20] decomposition where it first selects a subset from the original data that only contains the instances for each pair of classes and proceeds to train a binary classifier for each pair. In contrast, in the one-versus-all (OVA) approach [20], a multi-class data set is decomposed into several binary class problems and subsequently train single classifiers for each class by considering a single class versus a combination of all the remaining classes. For instance, [28] uses adaptive online one-class Support Vector Machines to monitor changes in minority classes over time. Further, [17] propose integrating one-class classification with ensembles using over-sampling and instance selection techniques to balance the class distribution of incoming data batches, which are then used to induce classifier ensembles. A disadvantage of binarization techniques is that the interactions between multiple classes cannot be considered simultaneously.

Algorithmic-level approaches adapt the training process to enhance the classifiers’ ability to deal with skewed distributions. These methods are often specific to particular learning models, making them more specialized but less flexible than data-level approaches. One of the most common algorithm modifications for addressing class imbalance combines Hoeffding Trees with the Hellinger splitting criteria for imbalanced domains. In GHVFDT [36], Hoeffding Trees are used to construct a decision tree that can handle data streams, while the Hellinger distance is used as a splitting criterion that is less sensitive to class imbalance. Further, [31] introduced an approach that modifies predictions made by a base classifier to address imbalanced data streams. This algorithm aims to map prior probabilities in the statistics of assigned classes.

A few research studies considered multi-class online learning from evolving streams, focusing on resampling techniques, using resampling and online ensemble learning together [1, 29]. Online ensembles learn each incoming training example separately, and component classifiers are constructed from corresponding instances. These approaches use this method to learn the data stream in one pass. Several approaches extend the well-known online bagging (OB) algorithm, as introduced by [42]. The OB algorithm modifies the original batch-based bagging method that samples with replacement in an online setting by calculating the value of k based on the Poisson distribution. New instances are classified by a majority voting of the N base model. Specifically, in a recent study [52], the authors proposed two ensemble learning methods for multi-class online learning. These two algorithms, multi-class oversampling-based online bagging (MOOB) and multi-class under-sampling-based online bagging (MUOB), use resampling to overcome class imbalance with the framework of OB, as introduced above [42]. These algorithms can process multi-classes directly without using class decomposition. However, the performance is based on underlying assumptions, such as that sampling only based on the size of classes is efficient and does not introduce bias. Some approaches employ feature space modification to define what feature space input is used by base classifiers. In [14] diverse feature subspaces of random sizes are used to improve the ensemble’s performance. The Kappa Updated Ensemble (KUE) method employ a combination of base classifiers that are updated dynamically based on a \(\kappa \) statistic, which measures the agreement between the base classifiers on a sliding window of recent data. KUE also incorporates an instance weighting scheme that prioritizes recent data over older data. Similarly, [15] introduce the Robust Online Self-Adjusting Ensemble (ROSE), an online ensemble-based algorithm that combines classifiers trained on variable-sized random subsets of features. The ROSE algorithm incorporates undersampling of the majority classes and employs so-called background classifiers initiated once drift is detected. In essence, ROSE is based on self-adjusting bagging and balances the training sets by monitoring the product of the accuracy and \(\kappa \) statistic. Specifically, each window provides a background classifier with a balanced data set of recent instances at the start of model construction. Background classifiers are dynamically added to the ensemble once their performance surpasses a threshold.

The OMCQ framework maintains multiple queues for each class in the multi-class learning setting [46]. This algorithm learns directly from the original data without resampling and incorporates a drift detection mechanism that can adapt to class sizes. While previous results were promising, the OMCQ approach may be extended to use dynamic sampling, as will be discussed in Section 3. The authors in [47] introduced a new method for dealing with multi-class imbalanced data called improved online ensembles (IOE) for semi-supervised learning. In this technique, instances from the minority and majority classes are sampled based on the classifier’s performance, measured in terms of the recall of each class. Classes with lower-than-average recalls are oversampled, while classes with higher recalls are undersampled. The sampling in the IOE algorithm is controlled by setting the rate parameter of the Poisson distribution based on the recall score of classes, which controls the number of times each instance is used for training each base learner in the ensemble learning model. At each time step, the ensemble model is employed so that the N individual classifiers are trained by oversampled or undersampled instances.

In cases where ensembles have limited access to labels, a set of algorithms is also available [48, 55]. One such approach is CALMID [32], which is a robust framework that deals with limited label access, concept drift, and class imbalance by dynamically inducing new base classifiers and weighting the most relevant instances. The method uses a variable threshold uncertainty strategy based on an asymmetric margin threshold matrix to address the problem of a given class being a majority to a given subset of classes while also being a minority to others. A novel sample weight formula is designed to consider the class imbalance ratio of the sample’s category and the prediction difficulty.

Next, we introduce our DynaQ framework.

3 DynaQ framework

Our DynaQ framework maintains a queue for each of our multiple classes. Initially, all queues will be empty. As instances arrive, they are added to the appropriate queue per their true label. Our sampling and training processes commence when the first queue has been filled. Figure 1 illustrates how our contributions fit together and operate in one iteration of an interleaved test-then-train loop. For each arriving instance, minority classes are oversampled to balance the different classes. As noted in Section 2, the sampling process is based on rate parameters of the classes, where classes with rates lower than the average are oversampled, while classes with higher rates use the original training instances. That is, our algorithm oversamples minority instances, while majority classes are not undersampled.

Fig. 1
figure 1

High-level overview of DynaQ methodology

Fig. 2
figure 2

Example of \(Queue_3\) resampling (adapted from [38])

The online learning phase of DynaQ thus incorporates four processes: evaluation, dynamic class balancing, ensemble learning, and concept drift detection. Our algorithm creates a queue space in the class balancing module to separate the instances from each class as they arrive within the stream. Subsequently, if the queue for a class is not full or the parameter rate for that class is below a threshold, the queue is updated with an oversampled instance; otherwise, we use the instance only once by inserting it in the related queue. The concept drift detector captures changes in the data distributions by adapting the idea of the drift detection method (DDM) [50] and subsequently updating the instances in the queues. The online ensemble component updates a single learner per iteration of the interleaved test-then-train loop in a cyclical manner, meaning that each classifier in the ensemble is trained only once every N loop iterations; N is the number of classifiers in an ensemble. That is, a different learner from the ensemble is trained at the next loop iteration, with the sliding batch including the new instance and overlapping by the previous one.

Specifically, the evaluator is used to predict the class label of arriving instances, using soft voting, and to update the evaluation metrics. The concept drift detector captures changes in the data distributions by adapting the DDM algorithm [50] idea and subsequently updating the instances in the queues. In the class balancing module, our algorithm creates a queue space to separate the instances from each class as they arrive within the stream. Subsequently, if the queue for a class is not full or the rate parameter for that class is below a threshold, the queue is updated with the oversampled instance. Otherwise, as indicated above, we use the instance only once by inserting it in the related queue. During online learning, a single learner is trained during a single iteration of the interleaved test-then-train loop in a cyclical manner. A soft voting-based ensemble of classifiers is used during training to incrementally update the model using sliding batches. This implies that each classifier in the ensemble is trained only once every N loop iteration, where N is the number of classifiers in the ensemble. A different learner from the ensemble is trained at the next loop iteration, where the sliding batch includes one new instance. Next, we present our DynaQ algorithm’s details, as shown in Algorithm 1.

3.1 Online queue construction

In an offline supervised learning setting, it follows that a data set D is available with input signals \(x_i\) and output \(y_i\). The task is to infer a model \(M \approx p(y\mid x)\) from such data. In contrast, online learning refers to approaches when the full data set D is unavailable during learning. Here, examples arrive over time, and the task is to infer a reliable model \(M_t\) at time step t based on the newly arriving example \((x_t, y_t)\) and the previous model \(M_{t-1}\).

Incremental online learning is a sub-area of online learning that is additionally bounded by memory resources and the capability of continuous learning with limited data compared to offline learning. In the literature, approaches to incremental learning can generally be categorized as batch incremental and instance incremental [44]. As the name suggests, batch learning methods employ batches of data to form hypotheses about the data. At every time step t, this form of training collects the k newest instances to form a batch. When a batch of data \(D_t\) is filled, a model \(m_t\) is learned [34]. This process continues, batch by batch. In our work, following [38], we consider a sequence of streaming data \(f = \{(x_1,y_1),\ldots ,(x_t,y_t)\}\) \(\in \) \( R^n\times \{1,\ldots ,S\}\), where f is the data dimension, and S is the total number of classes. The key idea is to keep a fixed number of examples (queue size denoted by L) for each class in a stream to combine the training set. In other words, each arriving sample (\(x_t,y_t\)) at any given time t will be stored in a separate queue of equal length \(q_{C_S}^t= L\), where \(c_s\) is the class label received with the data. Together, the queues form a sliding batch \(B_t\). This method considers a given stream of data \(x_1,x_2,...,x_t\) and learns from a sequence of batches \(b_1,b_2,...,b_t\), where the batches are updated as instances arrive and update the model.

Fig. 3
figure 3

Illustration of batch-instance incremental learning process

Figure 2 illustrates how \(Queue_L\) works when \(q = 3\). The upper part shows the examples that arrive at each time step; for example, \(z^0\) and \(z^4\) arrive at \(t = 0\) and \(t = 4\), respectively. Assume that the data stream contains three classes \(Y = \{c_1,c_2,c_3\}\) and that all instances have their own queues. The queues are of equal length \(L\in {{\mathbb {Z}}^{+}}\), \(q_{C_1}^t= L\), \(q_{C_2}^t= L\), and \(q_{C_3}^t= L\) and contain the samples of class \(c_1\), class \(c_2\), and class \(c_3\), respectively. After instances have been separated based on their labels, the arriving samples for class \(c_s\) are placed at the front of the \(queue_{c_s}\). When the queues fill, we combine the full queues in the batch by adding to the training set and commence online learning. Here, the sliding batch explicitly employs a forgetting mechanism, where the oldest instance will be removed from the head of the related queue.

Figure 3 shows that \(B_t\) is a batch of data, including the full queues of current classes, defined as the number of instances used to update the model. We construct an initial model with warm-up instances from the data stream as an initial training step. Whenever the first queue is full, we proceed to initiate our rebalancing process until we have created a full queue for each class. We include those queues in our batch. Next, the model construction process is initiated, and this batch-incremental process continues until the end of the stream. For each arriving instance (\(x_t, y_t\)) at time t, the oldest sample from the queue to which \(y_t\) belongs is removed, and the recent sample is added to batch \(B_t\). Whenever the first queues are full, they are used as our first batch, and we then proceed to update the model. Meanwhile, at each point in time, only one learner is updated with the batch that includes the new resampled instance in a circular order. The learner will use batch \(B_t\) to update its model; the training process utilizes a balanced set consisting of the most recent data. The algorithm waits until it has enough instances from the classes, including the current minority classes, before updating its model. It follows that the sizes of the individual queues are highly domain-dependent; the size of the queue is set by inspection.

3.2 Queue-based sampling

As noted above, our DynaQ methodology employs queue-based sampling of original instances of each class to dynamically construct models against all classes, using an ensemble against a sliding batch. Next, we will explain the process of sampling each instance in the queues. Recall that minority classes suffer from not having enough data to present against majority classes. Following [47], we oversample each class’s recent instances based on the class’s rate parameter while maintaining the majority instances without any form of undersampling. This is done dynamically during learning as the stream evolves. In this way, the learner has access to newer samples and concepts, and we can balance the number of instances for all classes. In addition, the oversampling of minority classes implies that the associated queues will fill faster. We employ a sliding batch of S queues, where S is the number of current classes, and N learners are updated one at a time in a periodic order with each arriving instance. The oversampling rate of our DynaQ algorithm is implemented considering four different metrics, as discussed below.

3.2.1 Recall-based Sampling

In our first variant, the rate parameter is set based on the recalls of the classes [47]. That is, we employ the recall value to maintain a balance in terms of the samples from the majority and minority classes that we maintain, where we define:

$$\begin{aligned} recall = \frac{TP}{TP+FN} \end{aligned}$$
(1)

and where true positive (TP) refers to the instances from the actual class that are correctly classified, while false negative (FN) denotes those instances with incorrect predictions. As such, recall measures the ratio of correctly classified instances from the minority class (true positive rate). When employing the recall scores for the multi-class data stream, we may assess how many of the examples from each class are correctly classified. A model with high recall on each class successfully predicts true labels in the data [52]. If the absolute difference between the recall score for a class and the mean recall is higher than the pre-defined threshold, then the examples for this class will be oversampled using recalls, as follows.

$$\begin{aligned} k = Poisson(r_{avg\_excluding\_c} / r_c) \end{aligned}$$
(2)

Where, \(r_{avg\_excluding\_c}\) is the average recall for the classes excluding class c and \(r_c\) is the recall for class c, which is calculated based on prequential evaluation, where each instance will be used to test the model before it is used for training [56], and from this the recall can be incrementally updated. If the recall for a class is lower than the average recall, then the class will be oversampled. We set the max-value of k equal to the defined queue size, which is determined through inspection.

3.2.2 F1-Score-based Sampling

The F-measure [21] refers to the harmonic mean of two metrics, recall, and precision. The F-measure may be weighted depending on the value assigned to \(\alpha \). We used a balanced value, referred to as the F1-score, by setting \(\alpha = 1\), which implies that precision and recall are assumed to carry equal weights in the metric.

$$\begin{aligned} F-measure = \frac{(1+\alpha ^{2} \times recall \times precision)}{\alpha ^{2} \times recall + precision} \end{aligned}$$
(3)

In our second sampling approach, we utilize the F1-score of the classes as a measure to control the number of oversampled instances. That is, if the absolute difference between the  F1-score for a class and the mean F1-score is higher than the pre-defined threshold, then the samples for this class will be oversampled using the F1-score.

$$\begin{aligned} k = Poisson(F1-score_{avg\_excluding\_c} / F1-score_c) \end{aligned}$$
(4)

Here, the F1-score refers to the score of each class calculated prequentially and \(F1-score_{avg\_excluding\_c}\) denotes the average F1-score of all classes excluding class c. If the F1-score of one class is lower than the average, it will be oversampled k times.

3.2.3 \(\kappa _m\)-based Sampling

In the third case, we utilize the \(\kappa _m\) measure to oversample. Bifet and Morales proposed the \(\kappa _m\) statistic for online learning in [11]; where they confirmed that this measure has advantages over accuracy and the original \(\kappa \) statistic [8]. The main motivation for using the \(\kappa _m\) statistic is when data streams are evolving, and classes are imbalanced, where we have:

$$\begin{aligned} \kappa _m = \frac{p_0 - p_m}{1-p_m} \end{aligned}$$
(5)

In (5), quantity \(p_0\) refers to the current algorithm X’s prequential accuracy, while \(p_m\) is the prequential accuracy of a majority-class classifier, a baseline learner that predicts the label that occurred most frequently up to now [8]. If classifier X is always correct, we conclude \(\kappa _m = 1\). If its predictions are correct as often as those of a majority-class classifier, then \(\kappa _m = 0\). Since the \(\kappa _m\) metric measures the performance agreement between the majority class classifier and classifier X, we cannot calculate it for each class. However, it is a measure that sensitively detects changes in the class distribution while automatically compensating for such changes. It may thus be used to recognize where classifier X is underperforming the baseline majority class learner. In this way, we will be able to assess when classifier X would benefit from oversampling the minority classes. Specifically, the ratio of the mean of the \(\kappa _m\) value divided by the last calculated \(\kappa _m\) value is considered as a rate parameter. If the result surpassed a threshold, the class will be oversampled M times following:

$$\begin{aligned} k = Poisson(Major Class_{Size}/ New Instance Class_{Size})) \end{aligned}$$
(6)

In this equation, \(Major Class_{Size}\) refers to the number of instances in the class with the most instances seen so far, and we update each class size by the class label of each arriving instance.

Fig. 4
figure 4

Ensemble learning of sliding batch [22]

3.2.4 Euclidian distance-based sampling

The last oversampling version monitors the ratio of the majority class to the class label of newly arrived instances. This parameter as per equation 6, defines the number of instances we need to oversample based on a distance criterion [39]. In this case, if k is greater than a defined threshold, we resample the k instances that are most similar to the current one. The Euclidean distance similarity score is used to determine the k most similar instances as:

$$\begin{aligned} Euclidean Distance(x_c, q_c) = sqrt((x_c-q_c)^2) \end{aligned}$$
(7)

Here, \(x_c\) refers to the most recently arrived instance with the class label c, and \(q_c\) is a vector of instances inside the queue of class c. After we sort the results based on distances, the top k instances will be oversampled and inserted into the \(q_c\). In this way, the \(q_c\) contains instances that are most similar to the \(x_c\) while disregarding the less similar ones.

3.3 Ensemble learning

Recall that we utilize a sliding ensemble approach, where the base learners update their models against different batches, corresponding to sliding batches, following [22]. In our algorithm, each of the base classifiers is trained independently [22]. We extended OB [42] so that, instead of training each instance k times from a Poisson distribution, we employ k to oversample the relevant queues. Figure 4 illustrates how we use sliding batches to update base learners. That is, for each instance in the data stream, one base learner in N is updated by the sliding batch. If we consider N learners as \(N = \{p_1, p_2, ..., p_{N-1},p_{N}\}\), where \(p_1\) refers to the first learner, \(p_1\) updates the batch at time t, \(p_2\) updates with the batch at time \(t+1\) and the \(P_N\) updates with the batch at \(t+N\). Subsequently, the rotation restarts from \(p_1\). Note that the prediction at each time step is made over all the base learners. We employed a soft voting process to determine the ensemble prediction [27]. For each arriving instance, soft voting requires each of the learners in the ensemble to produce a confidence score (within the range [0, 1]) for their prediction for each class value or to output the probabilities that an instance belongs to a given class label. Consequently, a simple, soft voting classifier without weighting factors, given an instance \(x_t\), calculates the average probability for each class label over the predictions of all classifiers and determines the most probable class C as in (8) [27]:

$$\begin{aligned} \hat{y_t} = \underset{C}{\arg \min }\ \sum _{j=1}^{\mid {N}\mid } P_j(=C \Vert x_t), y_t\in Y, t \in \{1,2,...,t^{'}\} \end{aligned}$$
(8)

The class with the largest average probability is exported as the winner through this process, where \(y_t\in Y\), and the notation of N depicts the number of the combined classifiers.

Our ensemble is designed so that the training set of the N individual classifiers proceeds out of step, using a sliding batch for instance selection. As shown in Fig. 4, at every time step, we append a new updated batch to the ensemble and train a single classifier \(p_{n}\) on that batch. For the next \(N-1\) iterations, we train the remaining \(N-1\) classifiers, and so on. From the ensemble point of view, it looks like we are using a sliding window to train with the differences of N time steps. However, from the point of view of each base classifier in the ensemble, we are employing sliding batches to train. Figure 4 illustrates this process for N classifiers in the ensemble where the difference for each is a time step, and each classifier will learn based on a batch of the same color. Starting with \(p_{1}\) with the blue batch, the next classifier is \(p_{2}\) with the orange batch of time \(t+1\). The next rotation will start after \(N+1\) steps, as shown with the next blue batch training \(p_{1}\) at time \(t+(N+1)\). Intuitively, this method ensures diversity in terms of the instances used by the individual classifiers when casting their votes.

figure d

DynaQ.

3.4 Concept drift detection

Recall that we also included a class-based concept drift detector to handle evolving streams. To this end, we adapted the idea of the DDM drift detection algorithm in our framework [50]. The main task of a drift detector is to prompt the learner to update the model after drift occurs. The number of misclassified instances corresponding to each class is used as a drift indicator based on the results so far. Following [50], we employ two counters for each class, where \(w_i\) denotes a warning level, and \(d_i\) denotes the drift detection threshold. That is, we continuously update \(w_i\) and \(d_i\), and if the number of misclassified instances reaches \(d_i\), then a drift is detected. Subsequently, a new model is induced using the examples stored between \(w_i\) and \(d_i\). Practically, this process aids in removing outdated samples and updates the queue with new instances. Our drift detector process is initiated once an instance is misclassified, then continues until it reaches the specified proportion of the queue (denoted by L/n). The rationale behind this approach is to find a trade-off between the ability of the learner to adapt faster while not testing for drift too often to limit the overhead associated with detection. We aim to maintain only the optimal “small subset” of data necessary to accurately flag for drift. Intuitively, if \(L = 1\), then the process corresponds to testing for drift as every instance arrives; that is, \(n = 1\). Figure 5 shows our results against the Gas Sensor [49] and LED data stream [10], two of the repositories we used in our experiments where n was set to 2 by inspection. The reader will notice that, as expected, the drift detection threshold has a considerable influence on the predictive performance. In this setting, once a misclassification occurs, we signal a warning for potential drift and start to collect all instances from this point in time into the drift detector queues. Next, we test for drift when we reach (L/n) instances and proceed accordingly. We either detect a drift (reset the learner) or continue monitoring. If no drift has been detected, but the warning level remains, we proceed to collect and test with the next (L/n) instances. This process continues until the set of examples is equal to our queue size L. When a drift is detected, the learner is reset, and a new model is learned using a training set consisting of all the examples in the drift detection queues maintained since the warning was triggered. It follows that the values of n and L are domain-dependent and should be carefully selected to ensure the accuracy and efficiency of the drift detector. As shown in Fig. 5, the response to concept drift in \(\alpha = (1,L)\) is later than the other two, which decreased the model performance. However, as should be expected, the reaction to the threshold values is different for both data streams. The Gas Sensor data experience more fluctuation and decrement during the stream; the LED stream is more tolerant of late drift detection for \(\alpha = (1,L)\). In summary, results of \(\alpha = (1,1/2L)\) outperform the other thresholds for both streams.

Fig. 5
figure 5

G-mean value results with different concept drift thresholds against the Gas Sensor and LED streams

In summary, in our DynaQ algorithm, the rate parameter is calculated when a new instance arrives. Subsequently, for minority classes, the new instance is resampled k times and appended to the appropriate queue. As explained above, we use each instance to test the model, then insert the instance into the appropriate queue. The set of queues form a batch and each batch is used by a single learner to update the model. Ensemble learning involves sliding windows, and soft voting. Next, we discuss our experimental evaluation.

4 Experimental evaluation

All experiments were conducted on a MacBook Pro with a Dual-Core Intel Core i5 processor, CPU @ 3.1 GHz processor, 8.0 GB RAM on the Mac Catalina Operating System (OS) and the Alliance Canada Cloud with 10 Core CPUs [3]. Our code was implemented using the Scikit-Learn [43] and Scikit-Multiflow [40] packages in Python version 3.8.2. Our competitor methods run on the MOA [10] library and we used MOA to generate synthetic streams. The framework’s implementation and all the code for the experiments will be made available on GitHub upon publication. The no-change and majority-class classifiers were used as our baselines. The no-change classifier assumes that the class label of instance \(x_i\) would be the same as the last-seen instance \(x_{i-1}\), while the majority-class learner assigns the class seen most often so far to a new instance [40]. Additionally, we employed three baseline classifiers, namely, Hoeffding Adaptive Tree (HAT) [9], Hoeffding Tree (HT) [18], and the self-adjusting memory (SAM) model for the K-nearest neighbour (KNN), denoted by SAMKNN [33], during our ensemble learning. HTs are incremental decision trees for data stream classification that use Hoeffding’s bound to commence online learning. HAT is an extension of HT that adaptively learns from data streams that change over time without needing a fixed-size sliding window. SAMKNN is an online implementation of KNN, and we set \(k=7\) by inspection. Following [47], we set the number of base learners in our ensemble to 10.

Table 1 Data streams and their properties

The estimation technique we use is prequential evaluation, which consists of executing a loop infinitely, where the ensemble first predicts labels for new data (without its label), then updates its model by that data with the correct label [56]. The performance measures we used are the F-measure, geometric mean (G-mean), and \(\kappa _m\) statistic. As mentioned, the F-measure [21] is macro-averaged over the sum of F1-scores over all classes, which assigns equal weights to the existing classes. Additionally, we employed the G-mean [50] value which is the geometric mean of the recall rates of majority and minority classes in the imbalanced data set. The calculation method is as shown in (9).

$$\begin{aligned} G-mean = \sqrt{\frac{TP}{TP+TN} \times \frac{TN}{FP+TN}} \end{aligned}$$
(9)

The G-mean value is higher only when the classification accuracies of the majority sample and the minority sample are high; therefore, the G-mean value can accurately the classification effect of unbalanced data sets. In addition, we utilize the previously introduced \(\kappa _m\) metric, to address the effect of the performance agreement between the majority class classifier and classifier X [11].

We also compare DynaQ with six state-of-the-art online multi-class learning methods, namely, OMCQ, IOE, ROSE, KUE, MOOB and MUOB. We use inspection and grid-searching for hyper-parameter tuning to determine the optimal parameters for all methods. However, if a particular algorithm specifies certain parameters, we respect that and use them accordingly. As we introduced in Section 2.2, ROSE and KUE are driven by the \(\kappa \) metric. While KUE is a chunk-based general-purpose ensemble for drifting data streams [14], ROSE is an online ensemble that works with imbalanced data streams with dynamic imbalance ratio and concept drift, offering several features designed specifically to deal with these challenges [15]. All ensembles are evaluated using HTs as base learners with the same parameter settings of 10 base classifiers. These two methods employ a sliding window size of 1000 instances. Recall that OMCQ uses a queue-based sampling strategy to keep each class separated in the queues to maintain a balanced training set. The method employs a class-based DDM with a queue-based recovery process. Following [46], the queue sizes were set by inspection. As noted in Section 2.2, Vafaie et al. (2019) introduced an IOE algorithm for handling multi-class imbalanced data. The approach samples with replacement and incorporates DDM-OCI [50]. DDM-OCI tracks the recall rates on the minority classes to actively locate concept drifts for imbalanced data streams. A significant drop in the recall suggests a drift in this class. After drift is detected, the model will be reset and trained based on the data received between the drift warning and drift detection. Following [47], we set the forgetting factor in recall rates to be 0.9 and the threshold of the absolute difference between the class recall and average recall equal to 0.05. Additionally, as mentioned in Section 2.2, [52] introduced the MOOB and MUOB algorithms that oversample or undersample classes based on the probabilities of instances belonging to a class. That is, in MOOB, oversampling is used to increase the possibility of learning minority-class examples based on the occurrence probability of examples belonging to each class. Meanwhile, in MUOB, undersampling is used to reduce the chance of learning majority-class examples. We incorporated the DDM-OCI [50] method with MOOB and MUOB in case of concept drift to conduct a fair comparison. We conducted four sets of experiments to assess our DynaQ algorithm. First, we studied the impact of the four different rate parameters used in order to conduct minority class oversampling. Second, we considered the impact of queue size on learning. Third, we explored the performance of DynaQ when utilizing different base classifiers in our ensemble. Fourth, we considered the impact of concept drift detection on DynaQ. Finally, we contrast our DynaQ with the state of the art.

Table 2 Evaluation of different versions of DynaQ against data streams

4.1 Data streams

Our experimental study was based on the following multi-class data sets depicted in Table 1: historical weather data obtained from Open Data Canada [24], the Shuttle data set from the KEEL repository [2], the LED data stream [10], the radial basis function (RBF) stream [10], the Gas Sensor stream [49], the Human Activity Recognition (HAR) stream [16], the Covertype data stream [12] and the Intel Berkley Research Lab Sensor data stream [37]. The Weather repository contains data from probes located across Canada to detect adverse weather with natural drifts. The Shuttle stream considers three classes and is used to predict when an auto-landing would be preferable to the manual control landing of a spacecraft. The LED data set comprises seven Boolean attributes and ten labels; the goal was to predict the digit displayed on a seven-segment LED display, where each attribute has a 10% noise level. We used a version of LED available through Scikit-Multiflow that includes gradual concept drifts in the stream by simply changing the attribute positions. The RBF generates a fixed number of random centroids, where each center has a random position, a single standard division (SD), a class label, and a weight. The generated RBF data sets have ten numerical attributes and 50 centers with four classes, and a change speed of 0.89 was chosen for the gradual drift in the data. To assess the impact of concept drifts on imbalanced streams, we created two RBF streams with comparable imbalance ratios but distinct concept drift patterns. The Gas Sensor stream contains 13,610 measurements from 16 chemical sensors utilized in simulations for drift compensation in a discrimination task of six gases at various concentration levels. The HAR data set contains uncalibrated accelerometer data from 15 participants performing seven activities. We combined the activity of three participants to create drift in the stream. Covertype is a benchmark data set for evaluating stream classifiers that originates from the UCI repository and contains cartographic attributes for predicting forest cover type. This data set represents a forest cover type for \(30\times 30\)m cells, where each cover type is represented by one of the seven classes. Concept drift may appear in this domain due to weather and climate change. The Intel lab data is collected from 54 sensors deployed in the Intel Berkeley Research lab between February 28th and April 5th, 2004. Mica2Dot sensors with weatherboards collected timestamped topology information and humidity, temperature, light, and voltage values once every 31 seconds. Data was collected using the TinyDB in-network query processing system, built on the TinyOS platform. We used the data instances from 15 sensors to produce an imbalanced data stream.

4.2 Experimental results

In our first set of experiments, we investigate using the four different DynaQ oversampling metrics introduced in Section 3.2, as depicted in Table 2. In this set of experiments, we employ the HT classifier and report the G-mean, F-measure, and \(\kappa _m\) evaluation metrics. Table 2, shows that the DynaQ variant using recall produced the highest values for four of seven data streams. When considering the individual metrics, the reader will notice that the results for DynaQ with F1-score are highest for the Weather and RBF streams. In the case of the CoverType data set, the variant of DynaQ_\(\kappa _m\) using G-mean and \(\kappa _m\) metrics yields the highest results.

Future, we consider the statistical significance of the results using the Nemenyi post-hoc test with \(\alpha = 0.05\), as depicted in Fig. 6. Figure 6 illustrates that oversampling using recall and F1-score, results in a similar performance, with the recall-based sampling ranking first. This result indicates that paying close attention to the true positive rates clearly benefits learning. As a result, we utilize the DynaQ variant with recall-based sampling in our subsequent experiments.

Fig. 6
figure 6

Nemenyi graph ranking HT base classifier performance for various sampling methods

Second, we investigated the effect of queue size on our DynaQ algorithm to assess how the size of L affects the performance of queue-based learning. Figure 7 depicts the proposed method’s behavior on different queue sizes \(L \in \{1,10,20,30,50\}\). As expected, the figure shows that the smaller the queue length, the faster the learning speed, although the results might differ with longer queues. For most data sets, this number is highly domain-dependent and should be set according to the characteristics of each data stream. In online learning, there is an obvious interplay between accuracy and learning time. Our results indicate that, for our experiments, a queue size of 10 resulted in a good trade-off between accuracy and speed for the Shuttle, Weather, and Gas Sensor data sets. A queue size of 20 produced good results for RBF and LED, but Covertype and HAR worked better with a queue size of 50. We subsequently report the results of a queue size of 50 against all data sets. It is noticeable that our extensive experimentation shows that the queue size does not depend on the size of the data set.

Fig. 7
figure 7

Performance comparison against different queue sizes on DynaQ

Fig. 8
figure 8

Results of DynaQ algorithm for different base learner when compared with baselines

Next, we focus on the second set of experiments, where we utilize various base learners in our DynaQ algorithm. Figure 8 depicts the G-mean results for our data sets when assessing the performance of the DynaQ technique compared to the two baseline algorithms. The results clearly show the benefit of our DynaQ algorithm compared to the majority-class and no-change learners, which could not learn the concepts within our multi-class imbalanced streams. The HT ensemble reaches higher performance (in Shuttle by \(1\%\), in LED by up to \(4\%\), in Covertype by \(6\%\), and in HAR by \(7\%\)) than the HAT or SAMKNN ensembles. However, for the Weather, RBF, and Gas Sensor data sets, the HAT ensemble presents up to \(4\%\) better results than ensembles based on HT or SAMKNN. Recall that HT and HAT are both incremental tree learners. However, HAT is more adaptable with streaming data because it uses an adaptive sliding window (ADWIN) algorithm as a drift detector error estimator and requires no parameters related to change control. We conclude that those data sets that work slightly better with HAT ensembles may include gradual or abrupt concept drift while streaming. Readers should notice that fast drift detection and the associated recovery process prevented a significant performance drop. Our results also indicate that the baseline no-change and majority-class learners produced low values among all data sets.

Fig. 9
figure 9

Results with and without concept drift detection for DynaQ ensembles

In the next set of experiments, we investigate the effect of the queue-based concept drift detection method on the DynaQ learning process. As expected, Fig. 9 indicates that the G-mean is higher when we include the DDM mechanisms during learning. This result suggests that because the queue-based sampling keeps the recent concepts of the stream, the use of drift detection clearly benefits the learning process. The fluctuation and recovery periods are different for various types of concept drifts. In the related visualization graphs, the reader will notice that the weather data experience abrupt drift while LED data show gradual drift. With drift detection, the performances against these two streams stay relatively high. Still, with drift detection, the learner can recover and reach back to higher values after the drift happens. In this case, employing a DDM clearly prevents performance degradation. From Fig. 9, we find that RBF and HAR, susceptible to gradual drifts, do not suffer a drastic reduction in their total predictive accuracies. However, there are clear fluctuations related to gradual concept drifts handled by the DDM. The Gas Sensor and Covertype data sets encounter sudden performance decrement and variation among the stream caused by abrupt drifts. However, when utilizing DDM, these changes recovered quickly without harming performance.

Finally, we present our results when contrasting different online multi-class approaches. Specifically, Tables 3, 4, 5, and 6 presents the results of our comparative study contrasting the DynaQ, ROSE, KUE, OMCQ, IOE, MOOB, and MUOB algorithms. Table 3 includes results on all methods based on HT as the base classifier. The remainder of the tables present results on base-classifier agnostic methods with three component classifiers (HT, HAT, and SAMKNN) and three evaluation metrics (G-mean, F-measure and \(Kappa_m\)). Table 3 shows that DynaQ outperforms all approaches over different measures. Although DynaQ and ROSE produce competitive results, DynaQ yields better results in 25 out of 30 cases than ROSE. However, ROSE performs slightly better on the Weather, RBF3, and Covertype streams. This result may be attributed to the abrupt drifts that these three data streams experience, where the background classifiers utilized by the ROSE algorithm facilitate the learning process. Table 4 shows that the DynaQ algorithm produced the highest values in terms of G-mean for 18 out of the 21 cases. The results also indicate that the three base learners produce comparable results in terms of G-mean, while no single base learner consistently outperforms the other two in all settings. However, the reader will notice that there are often substantial differences in G-mean, e.g., for LED and Covertype, when contrasting the queue-based algorithms with IOE, MOOB, and MUOB.

Table 3 HT results against data streams
Table 4 G-mean results against data streams
Table 5 F-measure results against data streams

A similar observation holds for the F-measure depicted in Table 5, where DynaQ produced the highest results for all data streams and base learners. The design of DynaQ implies that when there is no wait time for each queue to commence training the model, and the queues are resampled with the most recent data based on recall rates. Thus, the algorithm allows for better generalization on the incoming stream. Recall that OMCQ maintains queues that only include original instances. In the case of highly imbalanced streams, the minority queues may still contain instances representing the concept prior to drift, thus leading to a degradation in performance. The IOE technique also provides competitive results with DynaQ, which suggests the benefit of balancing performance based on the recall parameter. However, the results suggest that the undersampling of majority instances do not benefit learning. The superior results of DynaQ are especially evident for the LED, RBF, Gas Sensor and Covertype data streams, which contained severe abrupt or gradual concept drifts. The MOOB and MUOB algorithms struggled to obtain high values against such evolving streams.

Table 6 \(\kappa _m\) results against data streams

Regarding the \(\kappa _m\) statistical results presented in Table 6, the DynaQ and IOE algorithms yielded comparable values, with DynaQ in first place and IOE in second place. In the case of the LED and RBF data sets, employing SAMKNN results in \(\kappa _m\) values up to \(10\%\) higher than HT and HAT. The reader will notice that recall balancing causes improvement on synthetic streams using SAMKNN while the performance decreases drastically in the HAR and Weather data sets. For the Covertype, Gas Sensor and Weather data sets, even though they experience abrupt drifts that may change the behaviour of major classes, our results are up to \(8\%\) higher than other methods. In DynaQ, we do not make any assumptions about separating majority and minority classes. This aids DynaQ to reach promising \(\kappa _m\) performance. In our overall analyses, DynaQ using HT as a base learner performs better than, or similar to HAT, as a base learner. Also, the comparison between MOOB and MUOB shows that oversampling benefits the method more than undersampling. This observation further reinforces our design choice of using an oversampling approach.

Next, we present the results of the Nemenyi posthoc test [26] shown in Figs. 10, 11, 12, and 13, where \(\beta \) is set to 0.05. This test highlights the contrasts in the algorithms against all data sets, where a lower rank means a better predictive performance (G-mean, F-measure, and \(kappa_m\)). In Fig. 10 DynaQ is ranked first, followed by ROSE, OMCQ, and IOE. While ROSE benefits from combining backup ensembles, DynaQ provides a stable backup in case of drifts by allocating small-sized queues to each class. For this reason, DynaQ delivers consistent performance and high ranks across all data streams.

Figures 11 and 12 show a critical difference between our DynaQ algorithm and the IOE, MOOB, and MUOB techniques, for the F-measure and G-mean metrics. The reader will notice no significant statistical differences between the DynaQ and OMCQ methods. The DynaQ and OMCQ methods benefit from their underlying queue-based learning processes. However, our DynaQ method ranks first and OMCQ second. This ranking indicates the strength of combining queue-based learning with minority-class oversampling and online ensemble learning.

Figure 13 indicates that DynaQ and IOE present similar \(\kappa _m\) values, with DynaQ ranked first, while outperforming OMCQ, MOOB, and MOUB. The DynaQ and IOE algorithms both utilize recall rates that aid the learners in handling the change in class labels caused by an evolving and skewed stream. The main difference between these approaches is that DynaQ employs oversampling, while IOE combines oversampling and undersampling. Since DynaQ is ranked first, one may conclude that undersampling is unnecessary in most settings. The results further indicate the value of balancing recall rates to improve performance in a multi-class imbalanced setting when focusing on the \(kappa_m\) metric.

Our experimental evaluations indicate the strength of combining a queue-based approach with dynamic minority class oversampling, concept drift detection, and online ensemble learning.

5 Conclusion

The paper addressed the challenge of online learning from evolving multi-class imbalanced data streams susceptible to concept drifts. The DynaQ algorithm combined class-based queues, dynamic oversampling of minority classes, online ensembles based on sliding windows, and class-based concept drift detection. An advantage of the DynaQ method is that it operates independently of a base classifier, thus providing a general framework for dealing with evolving multi-class streams. Our experimental results showed the benefits of our approach, and we determined that the DynaQ method constructs accurate models.

Our DynaQ algorithm is highly suitable for scenarios where minority classes may become majority classes, and vice versa. Future work will specifically address highly skewed distributions, where minority instances may arrive in bursts. We will analyze the effect of the number of classes on the performance of each class and our method. We will also research other resampling methods, such as extending SCUT-DS [41], an approach for stationary streams that combines SMOTE-based oversampling with cluster-based undersampling. Another interesting idea is to extend the SOUP bagging ensemble-based algorithm that uses the notion of safe levels to resample data in an offline setting [25]. Moreover, we are interested in examining the impact of our proposed approach in the domain of privacy [53, 54]. We intend to extend DynaQ to incorporate privacy concerns and conduct experiments using time-series and text datasets to evaluate its effectiveness.

Fig. 10
figure 10

Nemenyi graph ranking algorithms based on HT base classifier

Fig. 11
figure 11

Nemenyi graph ranking G-mean results among various algorithms

Fig. 12
figure 12

Nemenyi graph ranking F-measure results among various algorithms

Fig. 13
figure 13

Nemenyi graph ranking \(Kappa_m\) results among various algorithms