VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

The world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.


Introduction
Data abound as a multitude of smart devices produce massive, continuous, unbounded and non-stationary flows of data, namely data streams. Data streams are different from the batches of data used to train traditional Machine Learning (ML) models. In case of batches, since all the observations are known in advance, it is possible to iterate over them multiple times or to split them into training and testing sets or  inspecting their characteristics, e.g. class imbalance ratio. Instead, in case of data streams, new samples arrive unceasingly over time as mini-batches or even only one at a time. Therefore, it is impossible to iterate over data streams multiple times or to split them into training and testing sets or to inspect their characteristics. Hence, also the traditional/batch-oriented ML techniques cannot be used.
In the 2000s' (Domingos and Hulten 2000;Hulten et al. 2001), data streams have been recognized as a challenge and the interest in them is growing steadily. ML practitioners often transform streams into sequences of batches and retrain models as new batches become available. However, retraining often a model may become expensive. In recent years, Streaming Machine Learning (SML) (Bifet et al. 2010) was introduced to address this challenge. SML can maintain models online, incorporating one sample at a time and continuously updating the model instead of retraining it anew. To cope with the stream unboundedness, each sample once used is discarded. Moreover, streams can evolve and present forms of non-stationarity (namely, concept drift (Tsymbal 2004)). A concept drift occurs when the function which generates instances at time step t is not the same as at time step t + 1. Therefore, a SML model is dynamic and adaptive. Streams can present a class imbalance situation, too. Most of streaming binary classification solutions neglect the minority class instances, preventing the discovery of any existing patterns in them (He and Garcia 2009). Moreover, due to the concept drift occurrence, classes may swap, i.e. all the samples labeled as minority (majority) class before the concept drift would get labeled as majority (minority) class after it.
The combined problems of concept drifts and class imbalance are found in many real-world applications. In social media mining, classifying all the news according to topics involves both concept drift (new topics frequently appear, outdated ones are forgotten with time, and old ones may become popular again) and class imbalance (some topics are more popular than others). Such phenomena can also be observed in product recommendations, since the interests of clients may change over time and some products could be more popular, and so purchased, than others (Ma et al. 2007). Another application is the credit card fraud detection (Pozzolo et al. 2018). The task is to classify if a credit card transaction is fraudulent or not and it involves both concept drift (customers' habits evolve and fraudsters change their strategies over time) and class imbalance (genuine transactions far outnumber frauds).
To address these problems, techniques to rebalance the training dataset were proposed in the batch scenario. Two of the most used (He and Garcia 2009) are the Synthetic Minority Over-sampling Technique (Smote) (Chawla et al. 2002) and the Borderline Synthetic Minority Over-sampling Technique (Borderline-Smote) (Han et al. 2005). They need the entire batch as input to enable rebalancing but, in the streaming approach, this batch is not available. The main contribution of this paper is the Very Fast Continuous Smote (VFC-SMOTE). It is a meta-strategy inspired by Smote, Borderline-Smote and Data Sketching (Cormode 2017) and, besides improving the classification performances w.r.t. the state-of-the-art methods, it focuses on reducing the time and memory consumption.
VFC-SMOTE can be prepended to any SML model as a sort of magnifying glass that enlarges the poorly represented portion of the stream (minority class instances). Since we are proposing a meta-strategy as a sort of pre-processing step, we expect it to be data dependent as some other meta-strategies, i.e. temporally augmentations. This is the reason why we are looking for at least one SML algorithm prepended by VFC-SMOTE that outperforms the state-of-the-art ones and not for an algorithm able to outperform all the others.
In investigating VFC-SMOTE, considering that there are SML algorithms natively able to rebalance streams in presence of concept drift (say, SML+) and algorithms unable to do so (say, SML-), we formulated the following research questions: Q1 Comparing VFC-SMOTE to SML+ models, is there at least one SML-model that, prepended by VFC-SMOTE, outperforms any SML+ algorithm presented in literature? Q2 Is VFC-SMOTE consuming less time/memory than the other SML+ techniques? Q3 Is VFC-SMOTE altering the SML-models recovery from a concept drift?
In more details, the main contributions of this paper are: • VFC-SMOTE, a meta-strategy inspired by both the well-known Smote and Borderline-Smote techniques (applicable only to batches) and by Dynamic Sketching that can be prepended to any SML-classifier; • Statistical evidence that VFC-SMOTE can outperform the SML+ methods in terms of F1, Recall and G-mean obtained analysing the results of five SML-classifiers prepended by VFC-SMOTE and four other SML+ strategies (i.e., positively answering to Q1); • A computational costs analysis of SML-algorithms prepended by VFC-SMOTE w.r.t. other SML+ strategies in the literature, showing that the latter take more time and consume more memory than the former (i.e., positively answering also to Q2); and • A discussion about the VFC-SMOTE recovery speed analysis from concept drift showing that VFC-SMOTE does not alter the recovery speed of the SML-model to which it is prepended (i.e., negatively answering to Q3).
The remainder of this paper is organized as follows. Section 2 introduces the imbalance problem, some rebalance techniques and some SML+ models. Section 3 describes VFC-SMOTE. Section 4 introduces the streams and SML-models used in the experiments and presents our research hypotheses. Section 5 shows and discusses the evaluation results. Section 6 discusses the VFC-SMOTE limitations, while Sect. 7 discusses the conclusions and some future researches.

Related work
The first part of this section introduces the class imbalance problem and some sampling techniques able to deal with it, while the second part deals with some state-of-the-art approaches able to learn from imbalanced data streams and to handle concept drift changes.
Sampling techniques for class imbalance An unequal distribution among classes characterizes imbalanced data. Since the instances contained in the minority class(es) rarely occur, the patterns for classifying these classes tend to be rare, undiscovered, or ignored. He and Garcia (2009) characterize the approaches to handle class imbalance as: sampling techniques, cost-sensitive learning, kernel-based methods, and active learning methods. This work focuses on sampling techniques because they allow creating a new meta-strategy that can be run during the pre-processing phase, regardless of the streaming method chosen. Sampling techniques change the data distribution so that standard algorithms focus on the cases that are more relevant to the user.
They are divided in oversampling and undersampling methods. Oversampling methods increase the number of minority class instances through the creation of synthetic instances, until classes are balanced. After that, the minority class, which was originally underrepresented, may exert a greater influence on learning and future predictions. Undersampling methods, on the other hand, aim at reducing the number of instances from the majority class by removing instances from this class. They often act by removing noisy instances, or simply reducing instances randomly or by means of some heuristics. Both methods introduce their own set of drawbacks that can worsen the learning phase (He and Garcia 2009). In case of undersampling, removing instances from the majority class may cause important concept loss. In case of oversampling, since data are replicated or synthetically generated, the drawback is that multiple iterations over the same instances can result in overfitting.
Smote (Chawla et al. 2002) is a popular oversampling balancing technique (see Fig. 1a) that synthetically generates instances for the minority class to balance the training data. For each minority class sample x i (orange triangle), Smote finds its Knearest neighbours among the other minority class samples, it randomly chooses onê x i from them, and its distance from x i is multiplied by a random number δ ∈ [0, 1]. The resulting new sample x n (black circle) is located between x i and the selected neighbor x i . In general Smote has been shown to improve classification, but it may also suffer from drawbacks related to the way it creates synthetic samples. Specifically, Smote generates new samples without considering the neighbour examples, which increases the occurrence of overlapping between classes.
To this end, Borderline-Smote (Han et al. 2005) (see Fig. 1b), instead of using all the minority class samples to generate new instances, uses only the borderline samples. For each minority class sample x i (orange triangle), Borderline-Smote finds its K -nearest neighbours among the whole samples. If all of them are majority class samples (case A), x i is considered noise. If there are more majority class neighbours than minority class ones (case B), x i is considered easily misclassified and put into a danger set. Otherwise, if there are more minority class neighbours than majority class ones (case C), x i is considered safe. Only the samples in the danger set are the borderline ones and they are used by Smote to generate new instances (black circles).
More than a hundred Smote variants have been proposed (Fernández et al. 2018) to overcome the overlapping between classes. The most well-known are ADASYN (He et al. 2008), DBSMOTE (Bunkhumpornpat et al. 2012), MDO (Abdi and Hashemi 2015), SWIM (Bellinger et al. 2020), and G-SMOTE (Douzas and Bação 2019). Due to space limitations, we provide their explanations in Appendix A.
For our proposed method we decided to use Smote because it is still considered the "de-facto" standard in the framework of learning from imbalanced data (Fernández et al. 2018). There is also the possibility to use Borderline-Smote, with the aim to use less data to generate new synthetic instances, thus reducing the time and memory consumed.
Still another problem remains. As a sampling technique, Smote and Borderline-Smote cache the entire dataset in memory. This approach goes against the basic principles of the data stream paradigm that inspects only a sample at a time, as fast as possible, and then discards it. In Sect. 3, we explain how we overcome this problem.
State-of-the-Art Approaches The first part of the section wraps up the various methods, while the remainder of the section describes the methods summarized in Table 1.
They are commonly categorized into two major groups: passive versus active approaches, depending on whether an explicit drift detection mechanism is employed. Passive approaches train a model continuously without an explicit trigger reporting the drift, while active approaches determine whether a drift has occurred before taking any actions. Examples of passive approaches are RLSACP, ONN, ESOS-ELM, an ensemble of neural networks (NN), OnlineUnderOverBagging, OnlineSMOTEBagging, OnlineAdaC2, OnlineCSB2, OnlineRUSBoost and OnlineSMOTEBoost, while ARF RE , RebalanceStream, C-SMOTE, OOB, UOB, WEOB1 and WEOB2 are considered active approaches.
Passive approaches RLSACP (Ghazikhani et al. 2013b) is inspired by the recursive least square (RLS) filter error model. In the proposed error model, non-stationarity is handled including the forgetting factor (k) present in the RLS error model, while for handling class imbalance, two adaptive error weighting strategies are proposed. In the first one, error weights are adapted based on classifier results in different classes. In the second one, the number of instances in the minority and majority classes are counted from a fixed window of the most recent samples and the weights are assigned accordingly.
ONN (Ghazikhani et al. 2014) is a similar approach. It is an online Multi Layer Perceptron model composed by two parts. The first is a forgetting function for handling concept drift while the second is an error weighting function for handling class imbalance. EONN (Ghazikhani et al. 2013a) is an online ensemble neural network model composed by two layers. The first layer is a cost-sensitive neural network for handling class imbalance, while the second layer contains a method for weighting classifiers of the ensemble. Another passive technique is ESOS-ELM (Mirza et al. 2015). It is an ensemble approach that, to tackle class imbalance, resamples the data using fixed weights to train each classifier with an approximately equal number of 123 Table 1 The main characteristics of the related works compared to  (Wang and Pineau 2016) are the online extensions of the popular batch cost-sensitive ensemble learning algorithms Under-OverBagging, SMOTEBagging, AdaC2, CSB2, RUSBoost and SMOTEBoost, respectively. The main challenge for adapting them to the online settings resides in finding a way to embed costs into online ensembles for boosting algorithms without having all the data. They reformulate the batch cost-sensitive boosting algorithms avoiding the normalization step at each iteration, and then to incrementally estimate the quantities embedded with the cost setting in the online learning scenario. Whereas cost sensitivity in the batch setting is achieved by different resampling mechanisms. In the online ensembles this is achieved by manipulating the parameters of the Poisson distribution for different classes.

Method
Active approaches ARF RE (Ferreira et al. 2019) is an extension of the ARF  algorithm. ARF RE resamples instances based on the current class label distribution, so that it adapts the weights of the Poisson distribution to simulate a balance of the instances to the base models of the forest. Thus, if a sample is part of the minority class, its weight used during the training phase will be increased w.r.t a sample from the majority class.
RebalanceStream (RB) (Bernardo et al. 2020a) uses Adwin (Bifet and Gavaldà 2007) to detect concept drift in the stream by training four models m 1 , m 2 , m 3 , and m 4 in parallel: m 1 is trained with the original samples in input; m 2 uses the samples collected and rebalanced using Smote from the beginning to a change (when the last concept drift occurred); m 3 uses the samples collected from a warning (when the most recent concept drift started to occur) to a change; m 4 uses the same data of m 3 but rebalanced using Smote. After a change, the best model among them is chosen and it is used to proceed with the execution.
C-SMOTE (Bernardo et al. 2020b) is a meta-strategy similar to VFC-SMOTE, designed to be prepended to any SML-techniques. It is inspired by Smote. It uses Adwin (Bifet and Gavaldà 2007) to save only the recently-seen samples and uses the minority class samples stored in the Adwin window to apply Smote. C-SMOTE, saving all the entire instances in a window and not only a summary of them as VFC-SMOTE does, focuses only on improving the classification performances, neglecting the computational ones.
Oversampling-based Online Bagging (OOB) and undersampling-based Online Bagging (UOB) (Wang et al. 2013) are other two resampling-based ensemble methods. When class imbalance is detected, oversampling or undersampling embedded in Online Bagging (Oza 2005) is triggered to either increase the chance of training minority class samples or reduce the chance of training majority class samples. If the new sample (x, ck) belongs to one of the minority classes (ck ∈ Y min), OOB will tune the parameter λ of Poisson distribution to 1/w k , which indirectly increases the number of copies of the current sample for training. In other words, it combines oversampling with Online Bagging. If (x, ck) belongs to one of the majority classes (ck ∈ Y maj), UOB will set λ to (1 − w k ). Training samples from the majority class Fig. 2 Architecture of VFC-SMOTE meta-strategy prepended to a SML-Model will be undersampled. Some performance analysis (Wang et al. 2015) on OOB and UOB show that UOB is a better choice than OOB in terms of minority-class Recall and G-mean. However, it has some weaknesses when the majority class in the data stream turns into the minority class. OOB is more robust against changes.
To combine the strength of OOB and UOB, WEOB1 and WEOB2 (Wang et al. 2015) are proposed. They are based on the idea of ensembles, which train and maintain both OOB and UOB. A weight is maintained for each of them, adjusted adaptively according to their current G-mean performance. The weight evaluates the degree of inductive bias in terms of a ratio of positive and negative accuracy. Their combined weighted rating will decide the final prediction. WEOB1 and WEOB2 differ in the weight adjusting strategy that they use.

VFC-SMOTE
This section describes the proposed meta-strategy VFC-SMOTE. Its acronym stands for Very Fast Continuous Smote because new variants of Smote and Borderline-Smote, to which it is inspired by, are applied continuously, trying to consume as little time and memory as possible. VFC-SMOTE is designed to rebalance an imbalanced data stream and it can be prepended to any SML-models. The section firstly introduces the concepts underlying VFC-SMOTE and then it explains in detail the algorithm.
Algorithm concepts As said before, the real problem is the impossibility to access the entire data during the pre-processing (rebalancing) phase. Indeed, it is impossible to store every new sample in memory until the stream ends for two reasons: 1) the streams are assumed infinite, and 2) this would be against the stream paradigm approach. Our solution (see Fig. 2) is inspired by Data Sketching (Cormode 2017). In the Data Stream Management System context, Frequency Based Sketches are used to summarize the observed frequency distribution of a dataset. From these sketches, accurate estimations of individual frequencies can be extracted (Cormode 2017). This concept of using statistical summaries for data streams is well-known in the stream clustering context, too Kranen et al. 2011). Intuitively, our proposal uses a dynamic summary data structure, called sketch, to maintain an approximation of the attribute distribution seen so far on the underlying continuous stream and then uses the sketch to generate new synthetic instances. VFC-SMOTE uses an incremental decision tree, in particular a Hoeffding Adaptive Tree (HAT) (Bifet and Gavaldà 2009), as a data structure to memorize the sketch with. HAT is based on Hoeffding Tree (HT) (Domingos and Hulten 2000), an incremental very fast decision tree algorithm able to learn from non-evolving streaming data. To make HT able to adapt to concept drifts, HAT implements, at each node, an Adwin change detector (Bifet and Gavaldà 2007). Adwin keeps a variable-length window of recently seen items, with the property that the window has the maximal length statistically consistent with the hypothesis "there has been no change in the average value inside the window". More precisely, an older fragment of the window is dropped if and only if there is enough evidence that its average value differs from that of the rest of the window. Moreover, automatically detecting and adapting to the current rate of change, ADWIN is parameter and assumption-free. Its only parameter is the confidence bound δ, indicating how confident we want to be about the algorithm's output, referring to all algorithms dealing with random processes. In this way, each node can monitor the performances of its sub-branches and, if a concept drift occurs, it can substitute them with new sub-branches. So, VFC-SMOTE is always applied on data that are consistent with the current concept, and the new synthetically generated samples will be consistent as well. We chose the HAT model as sketch since it is one of the most used models and it is a state-of-the-art method, while, about Adwin, it is the most used now in MOA learners, 1 it is cited in different papers Grulich et al. 2018;Bifet et al. 2018;Gama et al. 2014) and it is already implemented inside the HAT model. Another reason is that ADWIN has theoretical guarantees and it extends such guarantees to the sketch. The VFC-SMOTE memory and time (per sample) asymptotic costs are O(logW ), where W is the Adwin window length. If the SML-model prepended by VFC-SMOTE uses Adwin, then the total asymptotic costs do not change, otherwise they depend on whether the SML-model costs are more or less than the VFC-SMOTE ones. Appendix B shows the detailed cost analysis.
In particular, Fig. 3 shows a simple example about how the sketch works. Given a stream with two attributes, one nominal (x 1 ) and one numerical (x 2 ), the sketch incrementally grows a decision tree. Here, after having seen enough samples, the sketch decides to make a split on a x 2 value of the x 2 attribute and in its leaves (summaries) it keeps the summary statistics of each attribute (summar y). For example, in the left branch, the sketch keeps the summaries of all the samples having the x 2 value less than or equal to x 2 (all the samples under the dotted line). In case of a nominal attribute (x 1 summary), it keeps the occurrence frequency of each attribute value divided by class, while in case of a numerical attribute (x 2 summary), it keeps the mean (μ), variance (σ 2 ), minimum (m) and maximum (M) values observed divided by class. Moreover, for each new sample in input, the sketch updates its summaries.
Through the hyperparameter percCorrClass, users can decide how many instances to save into the sketch. Saving all the instances (correctly and incor- rectly classified) means using all of them during the rebalancing phase and thus using Smote, while saving only the misclassified ones means taking into account only the ones that lie on the border between the two classes instances and therefore using Borderline-Smote. Of course, saving less instances ( percCorrClass = 0) into the sketch could speed up the process and save memory unlike saving all of them ( percCorrClass = 1). In the end, the new generated samples created by VFC-SMOTE, are used by the pipelined SML-algorithm to update its model.
Algorithm explanation Algorithm 1 presents VFC-SMOTE pseudo-code. Given a stream S {(X 1 , y 1 ), (X 2 , y 2 ), . . .}, where X i is a feature vector, and y i is the label, VFC-SMOTE sketches them into the variable sketch initialized with a HAT model (Line 2). VFC-SMOTE has also four counters (Line 3): S 0 and S 1 , respectively, count the number of instances in each class; S 0 and S 1 count, respectively, the number of instances of each class generated by VFC-SMOTE. The reason why VFC-SMOTE keeps track of the instances of both classes is that, after a concept drift, the classes may swap. In this case, VFC-SMOTE will use the samples of the new minority class to introduce new synthetic samples. Every time a new sample (X , y) is available, the prequential evaluation (Gama et al. 2013) approach is applied (testing and then learner l training phase) (Line 6). Subsequently, depending on how the instance (X , y) is classified, it can be added to sketch (Lines 7-13). If the sample is misclassified, then it is added directly to the sketch (Line 8). If it is not, it could be added anyway with a probability equal to the percCorrClass value (Lines 10-13). The function updateCounters() (Line 14) updates the relative counter S 0 or S 1 , while the function check DcI nCounters() (Line 15) uses Adwin to check a concept drift occurrence and to reset the S 0 and S 1 counters. The next step identifies the actual minority (Line 16) and majority (Line 17) classes and saves the corresponding minority/majority classes and counters into, respectively, minorit y, S min , S min , ma jorit y, S ma j and, S ma j . Then, Line 18 checks if the number of minorit y instances is at least equal to the hyperparameter minSi zeMinorit y specified by the user. If it is true, the algorithm can proceed with the rebalance phase, otherwise it waits for another sample in input. If minSi zeMinorit y is equal to −1, no checks are performed on the minority class If the imbalanceRatio is less than the balance ratio to achieve t (imbalanced sketch), all the sketch summaries with more than minSi zeMinorit y instances seen are retrieved (Line 20) and they are used to generate some new synthetic samples (X ,ŷ) until the imbalanceRatio is equal to t (balanced sketch, Lines 21-25). In particular, the function newSample(), at Line 22, and exploded in Algorithm 2, is responsible for generating a new synthetic sample. the selected Summaries that saw more instances (Line 3) and then, it uses the attributes observed summar y contained into the selected Summaries to generate a new instance (Lines 4-13). For each summar y, a new value is generated for the new instance. In case of a numerical summary, the new value is a sample drawn from a Beta distribution having the mean (μ) and variance (σ 2 ) observed from the minority class instances seen so far by that summar y, scaled by the minimum (m) and maximum (M) values seen. In this way, the new sample value will certainly be between the m and M values (Lines 5-10). In case of a nominal summary, the new value is the most frequently seen value among the minority class instances seen so far by that summar y (Lines 11-12). In the end, Line 14 assigns the minority class value to the new instance generated.
Back to Algorithm 1, at Line 23, S min is updated, the new instance (X ,ŷ) is used to train the learner l (Line 24) and the new imbalanceRatio is calculated (Line 25).

Experimental settings
This section consists of six parts. At the beginning, we introduce the hypotheses to be tested. Then, the following four parts discuss the data stream (both synthetic and real), the SML-algorithms prepended by VFC-SMOTE to run the experiments, and the various experimental settings. The last part selects the best SML+ methods to be compared with VFC-SMOTE.
Research Hypotheses We formulate our hypotheses as follows: -Hp. 1: VFC-SMOTE compared to the SML+ methods improves the performances for the minority class in at least one SML-algorithm prepended by. -Hp. 2: Since VFC-SMOTE keeps a summary of the data in a sketch and introduces a rebalance phase, the SML-models to which it is prepended consume less memory and they have a lower latency for sample than the SML+ methods.
-Hp. 3: Since VFC-SMOTE is only a meta-strategy to be prepended to any SMLmodel, aiming at improving the minority class performances, it has no impact on the concept drift management and therefore on the recovery speed from a concept drift occurrence.
Artificial data stream We decided to synthetically generate some data streams containing both different types of concept drift (Tsymbal 2004) and different class imbalance levels. We decided to start studying the phenomena in a controllable way using only a small number of features. We choose two of the most commonly used artificial data generators ): SINE1 (Gama et al. 2004) and SEA (Street and Kim 2001). SINE1 generates points with two attributes (x 1 , x 2 ) each uniformly distributed in [0, 1]. The class is determined by x 2 − sin x 1 < θ, where θ is a threshold value. In SEA, each sample has three attributes (x 1 , x 2 , x 3 ) each uniformly distributed in [0, 10]. Only the first two attributes are used to determine the label, while the third one is used as noise. The class label is determined by x 1 + x 2 ≤ θ , where θ is a threshold value. For convenience, in all the streams, the minority class is always the class 1, while the majority one is the class 0. For each type of drift, each generator produces two data streams with a different drift speed (Tsymbal 2004): sudden (in an instant of time the first concept finishes and the next one starts) and gradual (a smooth transition from the first concept to the next one). Streams with a gradual concept drift are denoted by g (SINE1 g and SEA g ). Every data stream has 100, 000 instances, with one concept drift starting at time step 50, 000. The concept drift in SINE1 g and SEA g takes 10, 000 time steps to complete. It starts at the time step 45, 000 and it fully replaces the old one by the time step 55, 000. Moreover, each stream is generated four times, changing the imbalance ratio IR. We use IR = [1:9, 2:8, 3:7, 4:6]. We consider the following three types of concept drift. p(y) Concept Drift. In this case, the class prior probability changes and, so, well calibrated classifiers can become miscalibrated or imbalance issues may occur. Data streams SINE1 and SINE1 g have a significant class imbalance change, in which the minority (majority) class of the first half of the streams becomes the majority (minority) during the latter half. SEA and SEA g have a less significant change, in which the streams are balanced during the first half and become imbalanced during the latter half. In the gradual drifting cases, p(y) is changed linearly during the concept transition period (time step 45, 000 to time step 55, 000). In SINE1 algorithms, we use θ = 0, while in the SEA ones, we use θ = 7.
p(X |y) Concept Drift. In this case, the true decision boundary remains unaffected without creating any particular problem to the classifiers. The data stream is constantly imbalanced. In particular, the class imbalance ratio, respectively in each stream, is 1:9, 2:8, 3:7 and 4:6 both before and after the concept drift occurrence. The concept drift in each data stream is determined by introducing a constraint that changes the x 1 probability of the negative class (0) of being lower than a certain value n. Before the drift occurrence, the probability is p(x 1 < n) = 0.9 while after, it is p(x 1 < n) = 0.1. In the gradual drifting cases, it changes linearly during the concept transition period. In SINE1 algorithms, we use θ = 0 and n = 0.5, while in the SEA ones, we use θ = 7 and n = 5. 123 p(y|X ) Concept Drift. This is the most critical form of concept drift. The true boundary between classes changes after the drift, so that the previously learnt function does not apply any more. The data stream is constantly imbalanced. In particular, the class imbalance ratio, respectively in each stream, is 1:9, 2:8, 3:7 and 4:6 both before and after the concept drift occurrence. The data distribution in SINE1 and SINE1 g involves a concept swap i.e. in class 0 from x 2 − sin x 1 ≥ θ to x 2 − sin x 1 < θ, while the data distribution in SEA and SEA g undergoes a concept drift due to the θ value change. Before the concept drift we use θ = 7 and after θ = 13. The change in SEA and SEA g is less critical than the change in SINE1 and SINE1 g , because some of the examples from the old concept are still valid under the new concept after the threshold has moved completely.
So, we have 4 streams for each stream generator and concept drift speed, for a total of 16 streams for each concept drift type and, in the end, a total of 48 data streams.
Real data stream After having studied the phenomena with some low dimensionality synthetic streams, we tested three real data streams having a higher number of features, often used as a benchmark for concept drift problems ): PAKDD, Elec2, and KDD99. The minority class is always the class 1, while the majority one is the class 0.
PAKDD 2009 credit card data (PAKDD) (Linhart et al. 2009) are collected from the private brand credit card operation of a Brazilian retail chain. The task is to identify whether the client has a good or bad credit. The bad credit is the minority class (20%) of the provided modelling data. Because the data have been collected over a long period of time (one year), gradual market changes occur. It has 28 features, of which 15 numeric and 13 nominal.
In the Electricity data stream (Elec2) (Harries 1999), the task is to predict a direction for electricity price changes w.r.t. the moving average of the last 24h in the Australian New SouthWales Electricity Market. Input variables are recorded every 30 min from May 1996 to December 1998. The data are subject to a concept drift due to changing consumption habits, unexpected events and seasonality. For instance, during the recording period, the electricity market was expanded to include adjacent areas, which allowed production surpluses from one region to be sold to another. The minority class appears in 42.45% of the cases and the stream has 8 attributes, of which 7 numeric and 1 nominal.
The KDD cup intrusion detection data stream (KDD99) (Dua and Graff 2017) records intrusions simulated in a military network environment. The task is to classify network traffic into normal (80.31% of the cases) or some kind of intrusion (19.69% of the cases) described by 41 features, of which 34 numeric and 7 nominal. The problem of temporal dependence is particularly evident here. Inspecting the raw stream confirms that there are time periods of intrusions rather than single instances of intrusions.
Algorithms As SML-models to be prepended by VFC-SMOTE, we tested the ARF , Naïve Bayes (NV), HAT (Bifet and Gavaldà 2009), K-Nearest Neighbor (KNN) and SWT (Bifet et al. 2013b) with ARF as base learner algorithms. Instead, as SML+ models, we the tested the AR RE (Ferreira et al. 2019), RB (Bernardo et al. 2020a), OOB and UOB (Wang et al. 2013) techniques. We did not test C-SMOTE because it mainly focuses on improving the classification performances neglecting the computational ones. In fact, from a preliminary analysis of computational costs, pipelines that use C-SMOTE shown to improve the classification performances w.r.t. the SML+ methods, but this result is obtained at a high computational cost. In some situations, C-SMOTE consumes 10 x memory and 1000 x time more than SML+ methods. All the experiments were performed by using the MOA framework (Bifet et al. 2010) with default hyperparameter values for all the techniques involved. The only parameters that we set in VFC-SMOTE are the minSi zeMinorit y, percCorrClass and Adwin δ values. VFC-SMOTE used with minSi zeMinorit y = 100, percCorrClass = 1 and δ = 1e−5 obtained the best results among all the others tested in a hyper-parameter analysis. Using percCorrClass = 1 means that saving all the samples into the sketch is better than saving only the misclassified ones. In this particular case, using Smote, the classification performances improvements are greater than the savings in time used and RAM consumed as compared to the Borderline-Smote.
Settings All the tests were run in a machine that used 2 virtual CPUs Intel Skylake P-8175 at 2.5 GHz and 8 GiB of RAM and they took about two days for computation. We evaluated the predictive performances using the prequential evaluation approach (Gama et al. 2013) and for each data stream, we performed 10 runs. Following (He and Garcia 2009), we used metrics extracted from a confusion matrix. The model saves the instances that are correctly classified as numbers of True Positives (TP) and True Negatives (TN), while those that are incorrectly classified are saved as numbers of False Positives (FP) and False Negatives (FN). In particular, we used the Recall and F1-Measure metrics for each class: R[0], F1[0] for the majority class and R[1], F1[1] for the minority class. We also used the G-mean (GM) metric. Recall focuses only on the single class, allowing us to see when the recognition of that class drops. In contrast, G-mean captures the balance between recognition ratios of both classes. Therefore, by analyzing both measures it can be noticed whether one class was recognized more often at the cost of the other. Moreover, G-mean is skew-invariant, meaning that its interpretation remains the same for all possible class imbalance ratios, being particularly relevant for studying drifting imbalance ratios. For these experiments, we did not use the Accuracy and the Precision metrics for the following reasons: the former is not reliable in case of imbalanced streams because the impact of the least-represented, and possibly more important, examples is reduced when compared to that of the majority class, while the latter can be easily derived from the comparison between Recall and F1-Measure metrics. We took into account how many seconds and bytes of RAM each algorithm used, too. Since MOA runs in a Java Virtual Machine in which the RAM consumed does not correspond to the real amount consumed, each experiment is run in a Docker container and we tracked the container RAM consumption through InfluxDB.
With synthetic data, knowing when the drift occurs, we calculated the metrics over all the time steps after the concept drift ended. For sudden drifts, we reset the metrics at time step 50, 000, while, for gradual drifts, we reset them at time step 45, 000 (starting drift) and 55, 000 (ending drift). Instead, without knowing the true concept drifts in real-world data, we calculated time-decayed metrics with decay factor 0.995, which means that the old performances were forgotten at the rate of 0.5%.

Results and discussion
In this section, we compare the SML+ methods to the ARF, HAT, NV, KNN, and SWT algorithms prepended by the VFC-SMOTE meta-strategy (from now on called VFC-SMOTE*) in terms of statistical tests, time and memory consumed and recovery speed from concept drift occurrence.
Statistical tests In a tabular form, we presented the measure values averaged over entire streams (mean performance values). We maintain the separation between synthetic and real ones, which we carry out a Nemenyi test (Demsar 2006) with significance level α = 0.05 to compare the SML+ model performances with the performances achieved by VFC-SMOTE*. Under the Nemenyi test, x y indicates that algorithm x is statistically significantly more likely to be more favorable than y. In contrast, x y indicates that the former is better than the latter, but without any statistical significance, and we will address this case as x is statistically similar to y.
In all the tables, VFC − SMOTE ARF , VFC − SMOTE HAT , VFC − SMOTE KNN , VFC − SMOTE NV , and VFC − SMOTE SWT stand for, respectively, VFC-SMOTE strategy used with ARF, HAT, KNN, NV, and SWT as base learners. In particular, Tables 2, 3 and 4 show the average performance results and the ranks using the artificial streams with, respectively, the p(y), p(X |y), and p(y|X ) concept drift types. Table 5, instead, shows the results achieved with the real streams.
Tables 2, and 3 show that, with p(y) and p(X |y) drift types (virtual drift), VFC-SMOTE is not the best algorithm in terms of average results. Instead, in terms of statistical significance, we can notice that at least one SML-model prepended by VFC-SMOTE (ARF) is, in most cases, similar to the best one. In the case of p(y|X ) drift type, i.e., the most important one to address, Table 4 shows that, in the minority class performances and G-mean, ARF prepended by VFC-SMOTE performed better than all others both in terms of average results achieved and statistical significance. In the case of real streams, Table 5 shows that there are not any statistical differences among the ranks. This is probably due to the limited number of real streams tested (3) w.r.t. the artificial ones (48) and the significance level used. However, we can conclude that there is at least one SML-model prepended by VFC-SMOTE* (in general ARF or SWT) that outperforms the SML+ ones, therefore Hp. 1 is verified.
Time & memory consumption Before illustrating the synthesis of the time & memory comparison using all the SML+ algorithms, we discuss the comparison among the ARF RE , RB, OOB, and UOB models to select the best ones to compare them with VFC-SMOTE *. We used the Nemenyi test (Demsar 2006) with significance level α = 0.05 to compare the average results, divided for synthetic and real streams, achieved by the ARF RE , RB, OOB, and UOB algorithms in each metric, and to find the best one. Figures 4 and 5 show the Nemenyi results. The axis represents the average rank achieved, while C D is the critical difference, i.e., the minimum difference that two ranks must have to be considered statistically different. From the former, we infer that, with synthetic streams, the best SML+ model is ARF RE , while the latter shows that, even if there is not any statistical evidence due to the limited number of streams tested, the best SML+ model to use with real streams is RB. Henceforth regarding the synthetic data streams, we proceeded to compare the ARF RE technique, while  Each row represents the comparison between a particular algorithm prepended by VFC-SMOTE and the state-of-the-art technique, while each column represents a different data stream tested. In particular, Fig. 6 shows a single cell of Fig. 7. For each combination of algorithms tested and streams used, Fig. 6 shows a scatter plot that compares the ratio between the time used and the RAM consumed by a specific algorithm prepended by VFC-SMOTE and the state-of-the-art one. Red, yellow, and blue points, respectively, refer to the P(y), P(X |y), and P(y|X ) concept drift types. If a point is on the bottom-left quadrant of the grid, it means that the algorithm prepended by VFC-SMOTE and tested on that stream takes less time and consumes less RAM than the state-of-the-art algorithm. If a point is on the bottom-right quadrant of the grid, it means that the algorithm prepended by VFC-SMOTE takes less time than the state-of-the-art algorithm, but consumes more RAM. Instead, if a point is on the top-left quadrant of the grid, it means that the algorithm prepended by VFC-SMOTE consumes less RAM than the state-of-the-art algorithm, but it takes more time. In the last case, if a point is on the top-right quadrant, it means that the algorithm prepended by VFC-SMOTE takes more time and consumes more RAM than the state-of-the-art one. In particular, if a point lays on the vertical dotted line, it means that the two algorithms consume the same RAM, while if a point lays on the horizontal dotted line, it means that the two algorithms take the same time. To enhance the data visualization, all the time and RAM ratios are scaled using the maximum time and RAM values achieved by the comparison of the VFC-SMOTE*, ARF RE , RB, OOB and UOB algorithms in case of both artificial 2 and real streams. 3 A quick look at Figs. 7 and 8 is enough to say that the ratios change depending on the algorithms and the data streams tested, but there is at least one algorithm  prepended by VFC-SMOTE* that is faster and a better RAM saver than the stateof-the-art methods (green bordered cells). More specifically, Fig. 7 shows the ratios between time and memory consumed by VFC-SMOTE* and ARF RE algorithm for each synthetic stream used and concept drift type. We can notice that both time and RAM used are pretty similar among all the methods involved except for the case of the KNN model prepended by VFC-SMOTE that always consumes less RAM than ARF RE . Moreover, there are 14 cases (green bordered cells) in which, in all the concept drift types, VFC-SMOTE* takes both less time and consumes less RAM than ARF RE . Figure 8 shows the ratios between the time and memory consumed by VFC-SMOTE* and RB algorithm for each real stream. We can notice that VFC-SMOTE* never consumes more RAM than RB and, in more than half of the cases (9), VFC-SMOTE* is also faster.
We can conclude that, in the majority of cases, VFC-SMOTE* uses less time and consumes less RAM than the SML+ methods or, in the worst-case scenario, it uses the same amount of time and RAM. So, looking also at Appendix C, Hp. 2 is verified.
Recovery speed analysis VFC-SMOTE is just a meta-strategy focusing on improving the minority class performances. As Table 1 shows, it leaves to the algorithm to which it is prepended (SML-) to manage the concept drift. If the SML-model uses a concept drift detector then also the VFC-SMOTE pipeline will be able to manage concept drifts, otherwise it will not. In Appendix D we compared, using different datasets and metrics, the performances of two SML-models with and without VFC-SMOTE. We can notice that, despite the fact that the VFC-SMOTE pipelines performances are better than the single models performances, they all start reacting to the concept drift (red line) at the same time. This means that VFC-SMOTE does not have any impact on the concept drift recovery and so Hp. 3 is verified. Moreover, for this reason we did not compare the VFC-SMOTE recovery speed analysis to the state-of-the-art methods one.

Limitations
In this section we discuss the most important VFC-SMOTE limitations. The first one is the inability to know in advance which specific SML-model prepended with VFC-SMOTE can outperform the other SML+ methods. For this purpose, we need to test different SML-models to find out which is the best one in each situation. Another limitation is that, for the time being, VFC-SMOTE can only deal with binary classification problems. Our aim was to start from the easiest problem to determine whether our work could achieve good performances w.r.t. the state-of-the-art methods. Moreover, VFC-SMOTE uses the Smote, Borderline-Smote, Adwin and HAT techniques. In a future extension of this work, we will focus on both addressing multi-class problems and using different techniques. Another important aspect to take into consideration is the synthetic oversampling. The risk is that not all the synthetic generated instances are consistent with the underlying context. In our case, considering that we are using the most seen value with nominal attributes, and a Beta distribution scaled between the minimum and maximum values observed with numerical attributes, we have thus minimized the risk of generating non-coherent instances. This also depends on the stream class distribution. In case of a sharp boundary where the classes do not overlap, we avoided the problem, otherwise the risk would still remain. Another class distribution consequence is the presence of a sort of trade-off between improving the minority class performances and decreasing the majority class ones. In case of nonoverlapping classes, minority class performances can be improved without decreasing significantly the majority class ones, but, in case of overlapping classes, in light of a minority class performances improvement, a majority class performances decrease is unavoidable. Lastly, we chose a Beta distribution to generate the new synthetic samples. A sensitivity analysis could be performed comparing the results achieved using the Beta distributions to the results achieved using different distributions having bounded intervals, i.e. Truncated normal distribution, Logit-normal distribution, Irwin-Hall distribution or Bates distribution.  The We tested VFC-SMOTE prepended to some SML-and SML+ algorithms on different cases of imbalance, concept drifts, and swapping between minority and majority classes. Furthermore, we measured the time and memory consumed.
The results summarized in Table 6 empirically demonstrate that there are at least two SML-models (ARF and SWT) that, prepended by VFC-SMOTE, improve both the minority and majority class performances in more than half of the cases if compared to the SML+ methods (Q1). Moreover, with reference to time and memory consumed, the SML-algorithms prepended by VFC-SMOTE are faster and less memory eager than the SML+ algorithms (Q2), while with reference to the recovery speed analysis, the SML-models behaviour is not influenced by VFC-SMOTE (Q3).
In future works, we would improve the VFC-SMOTE performances for both classes, not only for the minority class ones. Another aim is to investigate other meta-strategies based on different rebalance techniques, different sketch structures and different concept drift detectors and to compare them with VFC-SMOTE. In the long term, adapting VFC-SMOTE to multiclass and regression tasks could be interesting. by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendices Appendix A: Further sampling techniques for class imbalance
The ADASYN (He et al. 2008) main idea proceeds from the assumption of utilizing a weighted distribution depending on the type of minority examples according to their learning complexity. The quantity of synthetic data for each one is associated with the level of difficulty of each minority example. This difficulty estimation is based on the ratio of examples belonging to the majority class in the neighborhood. Then, a density distribution is computed using all the ratios of the minority instances, which will be used to compute the number of synthetic examples required to be generated for each minority example. DBSMOTE (Bunkhumpornpat et al. 2012) relies on a density-based approach to clustering called DBSCAN and performs oversampling by generating synthetic samples along the shortest path from each minority instance to a pseudocentroid of a minority class cluster. DBSMOTE was inspired by Borderline-Smote in the sense that it operates in an overlapping region, but unlike Borderline-Smote, it also tries to maintain both the minority and majority class accuracies. MDO (Abdi and Hashemi 2015) builds synthetic examples having the same Mahalanobis distance from each examined class mean as the other minority examples. Thus, the region of minority instances can be better learned by preserving the co-variance during the generation of synthetic examples along the probability contours. Also, the risk of overlapping between different class regions is reduced. SWIM (Bellinger et al. 2020) is based on the Mahalanobis distance, too. It generates synthetic minority training examples that are (1) near to their minority seed and (2) have the same Mahalanobis distance from the mean of the majority class as their seed. This ensures that the synthetic instances do not spread into denser regions of the majority class where there is no statistical evidence that they should be. The last technique proposed is G-SMOTE (Douzas and Bação 2019). It substitutes the Smote data generation mechanism by defining a flexible geometric region around each minority class instance. Then, synthetic instances are generated inside the boundaries of the region. At the most general choice of hyperparameters, this geometric region of the input space is a truncated hyper-spheroid.  The total memory complexity of the proposed solution is the sum of the space used to store the sketch and the SML-model to which VFC-SMOTE is prepended. Since the sketch is stored in a HAT model, its asymptotic complexity is O(logW ). Thus, whether the SML-model uses Adwin or not, the total asymptotic memory complexity is still O(logW ). Notice that, in the latter case, the asymptotic complexity worsens from O(1) to O(logW ).
Instead, the time complexity per sample of the proposed solution is the sum of the time used to (1) save the element into the sketch, (2) retrieve all the summaries from the sketch, (3) sort the summaries by the number of instances seen, (4) generate new synthetic samples, (5) train the SML-model with the newly generated synthetic samples, and (6) (2), T is the number of tree nodes, L is the number of tree leaves, M is the number of buckets used by Adwin, I is the number of synthetic instances to generate to rebalance the stream in that moment and O(SM L−) stands for the time complexity of the SML-model. Asymptotically, also this cost becomes O(logW ) and so, as before, whether the SMLmodel uses Adwin or not, the total time asymptotic complexity is still O(logW ). Figure 9 shows the ratio between time & memory consumed by VFC-SMOTE* and, respectively, OOB, UOB, RB algorithms with synthetic data streams and ARF RE , OOB and UOB with real data streams.  Figure 10 shows the performances comparison between the ARF and HAT techniques prepended with and without VFC-SMOTE. The red solid lines represent the concept drift occurrence, while the red dashed lines represent the start and the end of a gradual drift. Table 8 shows the summary results of all the SML-methods prepended by VFC-SMOTE.