The impact of data difficulty factors on classification of imbalanced and concept drifting data streams


Class imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.


Classification methods have impressively developed in the last decades. Although most of the research in the field is concerned with batch learning from static data repositories, recent years have seen more and more studies directed at the analysis of large data volumes dynamically generated in the form of data streams.

Compared to classifying static data, the task of learning from data streams introduces limits on computational resources [23] and forces classifiers to act in dynamic environments, where the data and target concepts change over time in a phenomenon called concept drift [79]. Examples of real-life data streams include spam categorization, weather predictions, and financial fraud detection [86]. Furthermore, many practical applications make learning classifiers from streams even more challenging by introducing additional data complexities. One such additional challenge is class imbalance, a situation where at least one of the target classes, called the minority class, is highly underrepresented in the data [34].

Separately, both concept drift and class imbalance have already received substantial research attention. Concept drift has been thoroughly analyzed in the last two decades, in particular in the context of non-stationary data streams [18, 26, 33, 40], resulting in drift taxonomies [79], detectors [26], evaluation techniques [84, 85], and adaptive streaming classifiers [8]. Research on class imbalance has also led to many novel methods, such as class resampling [21], specialized classification methods [34] or dedicated classifier performance measures [11, 12].

Furthermore, several recent works on static imbalanced data highlight the impact of local data difficulty factors on classification deterioration. These studies have shown the influence of such factors as the decomposition of the minority class concept into many sub-concepts [36, 37], class overlapping [29, 62], the presence of isolated rare groups of few minority class examples inside the majority class region located far from the class boundaries [55, 57]. Importantly, it has been shown that such data characteristics are common in real-world classification problems [20, 46, 56, 58, 69, 81] and can be more influential to the overall classifier performance than the global class imbalance ratio itself [37, 56].

Although class imbalance co-occurs with many practical data stream classification tasks, the number of specialized proposals for imbalanced streams is still relatively small, see e.g., [78]. Moreover, existing works on imbalanced stream classification mostly focus on re-balancing classes and reacting to changes affecting the global imbalance ratio. These works do not consider the aforementioned local difficulty factors manifested by changes in local class distributions and other local drifts. However, imbalanced streams can also be affected by these difficulty factors. For example, some minority classes can be fragmented into sub-concepts, which appear and disappear with time, or change their positions in the attribute space [38]. Moreover, drifting overlapping borderline regions between the minority and majority classes have also been observed in tweet streams [42], introducing additional difficulties for learning classifiers.

The conjunction of these data difficulty factors and concept drift may potentially be more challenging for classifiers than the impact of each factor separately, given that the classifier would need to adapt to local drifts based on very few minority class examples. However, the role of streaming local difficulty factors and local drifts has not been studied yet and is still an open research challenge. We claim that better understanding of the interactions between concept drifts and various difficulty factors in class imbalanced streams, will help diagnose and remedy more complex difficulties in learning from real-world data streams. These observations have led us to the following research questions:

  1. RQ1

    What is the impact of different types of single local data difficulty factors and single isolated drifts (i.e., drifts without the presence of other drifts in the data stream) on the predictive performance of online classifiers? Is it possible to identify which classifiers work better and when?

  2. RQ2

    What is the interaction between different types of local data difficulty factors and global class imbalance? Which local factors or drifts are the most demanding for classifiers at different imbalance ratios?

  3. RQ3

    Which complex scenarios integrating several data factors are the most harmful for classification performance?

In this paper, we answer the presented research questions by carrying out a comprehensive experimental study with five representative on-line classifiers applied to various synthetic and real-world data streams modeling the possible interactions between the global class imbalance ratio, local data difficulty factors, and concept drift. To the best of our knowledge, no such analysis has been previously done. To support our analysis and future work in the area, we also propose a categorization of concept drifts for imbalanced data streams and provide a data stream generator that takes into account all the presented factors. We pay special attention to controlled experiments on a wide collection of synthetic datasets, as these dissect the interactions between local data difficulties and classifier reactions to drifts in particular moments in time. Our analysis highlights the main not-yet-addressed challenges of imbalanced stream classification and discusses future directions of research in the field.

Problem formulation

A data stream can be defined as a sequence of labeled examples \(\mathcal {S} = \{\mathbf {x}_t, \mathbf {y}_t\}_{t=1}^T\), where \(\mathbf {x}_t \in \mathcal {X}\), \(\mathbf {y}_t \in \mathcal {Y}\), \(\mathcal {X}\) is the input space, \(\mathcal {Y}\) is the output space, \((\mathbf {x}_t, \mathbf {y}_t) \sim _{i.i.d.} p_t(\mathbf {x}, \mathbf {y})\), where \(p_t(\mathbf {x}, \mathbf {y})\) is the joint probability distribution underlying the data stream at time stamp t, and T may be infinite. In this paper, we will consider data streams where \(\mathcal {Y} = \{-,+\}\). Therefore, we will refer to \(\mathbf {y}\) as a binary label y.

We will investigate data streams produced in non-stationary environments, i.e., data streams that may suffer concept drift. Formally, it is said that a concept drift occurs if [23, 26, 40], for any two time stamps t and \(t+\varDelta \),

$$\begin{aligned} \exists x: p_{t} (\mathbf {x},y) \ne p_{t + \varDelta }(\mathbf {x},y) . \end{aligned}$$

In particular, we will look into data streams which suffer from class imbalance at least during some period of time. A data stream is said to be imbalanced at a given time stamp t if \(p_t(-) \ll p_t(+)\) or \(p_t(+) \ll p_t(-)\). The class that contains less examples is called the minority class, whereas the remaining class is referred to as the majority class. In non-stationary environments, \(p_t(-)\) and \(p_t(+)\) may drift over time; therefore, the imbalance status of the problem may also change. To quantify the amount of class imbalance in a stream (or part of a stream), we will use the percentage of the cardinality of the minority class with respect to all examples in the stream \(|\mathcal {S}_{min}|/|\mathcal {S}| \cdot 100\%\).

We consider classification problems where unlabeled examples \(\mathbf {x}\) are received, and require their labels to be predicted. After some time, the true labels \(y_t\) of such examples are received and can be used to compose a training example \((\mathbf {x}_t,y_t)\). Many classification problems operate in such scenarios [23], with examples including electricity price prediction, networks intrusion detection [49], software defect prediction [13], and credit card assessment [65].

In order to solve such problems, we will adopt online supervised learning algorithms. These algorithms work as follows [51]: assume that, at time stamp t, a model \(f_{t-1}: \mathcal {X} \rightarrow \mathcal {Y}\) created based on past examples from \(\mathcal {S}\) is available, and a new labeled example \((\mathbf {x}_t,y_t)\) becomes available. Online learning aims at building an updated model \(f_t: \mathcal {X} \rightarrow \mathcal {Y}\) able to generalize to unseen examples of \(p_t(\mathbf {x},y)\), based on \(f_{t-1}\) and \((\mathbf {x}_t,y_t)\).

This online procedure differs from the so-called chunk learning algorithms [40] in that it learns each training example as soon as it is received, rather than waiting for a whole chunk (block) of examples to arrive before updating the predictive model or its evaluation. This avoids (1) potential delays in reacting to concept drift, as there is no need to wait for a whole chunk of data before starting to react to drifts, and (2) the need for choosing a chunk size, which is particularly challenging in non-stationary environments [49]. It is also more adequate for applications with high time and memory constraints than typical chunk-based approaches, as each training example is processed once and only once.

Related work

Concept drifts

There are several different ways to categorize and characterize concept drifts in classification problems. A common way is to refer to the components of the joint probability distribution that are affected by the drift. The joint probability distribution can be written as \(p(\mathbf {x},y) = p(\mathbf {x}|y) p(y)\). Therefore, a concept drift affects one or both of the following components [26, 78]:

  • prior probabilities of the classes p(y) and

  • class conditional probabilities \(p(\mathbf {x}|y)\).

A joint probability distribution can also be written as \(p(\mathbf {x},y) = p(y|\mathbf {x}) p(\mathbf {x})\). Therefore, a concept drift also affects one or both of the following components: [18, 26]:

  • unconditional probability distribution \(p(\mathbf {x})\) and

  • posterior probabilities of the classes \(p(y|\mathbf {x})\).

A concept drift affecting \(p(\mathbf {x}|y)\) or \(p(y|\mathbf {x})\) thus affects the relationship between the input and output of the problem. A concept drift affecting only p(y) or \(p(\mathbf {x})\) does not affect this relationship. In particular, it does not affect the true underlying decision boundaries of the problem.

In terms of the joint probability distribution, drifts can be further characterized by their severity or magnitude. Severity and magnitude are equivalent criteria that have been defined as the amount of changes in the joint probability distribution caused by a drift [50], and as the degree of difference between two points of time [79], respectively. Several different distance measures can be used to characterize drifts according to their severity or magnitude [32, 50, 64, 80], such as Kullback–Leibler Divergence, Hellinger Distance and Total Variation Distance. The Hellinger Distance and the Total Variation Distance have the advantage of being symmetric. As the Total Variation Distance is more efficient to compute, it has been favored in the literature [80]. It is defined as follows, where Z is a vector of discrete random variables and dom(Z) is the domain of Z [44, 80]:

$$\begin{aligned} \sigma _{t,u}(Z) = \frac{1}{2} \sum _{z \in dom(Z)} |P_t(z) - P_u(z)|. \end{aligned}$$

However, calculations of Hellinger Distance or the Total Variation Distance for continuous random variables make strong assumptions about the probability distributions, and so discretization of continuous random variables is recommended when using these metrics [80].

Drifts affecting p(y|x) have been referred to as severe [50] or full-concept [79] drifts if they change the p(y|x) of the whole input attribute space, whereas they have been referred to as intersected [50] or sub-concept [79] drifts if they change the p(y|x) of only part of the attribute space.

Another common way to categorize drifts is based on their rate of change [18], also known as speed [50]. Typically, drifts are categorized as abrupt (also called sudden) if they cause sudden changes, or gradual if their underlying joint probability distribution slowly evolves over time [18, 64]. Slow evolution can refer to a period of time where two distinct distributions co-exist in the problem, with the new distribution slowly taking over the old one [26, 50]. It can also refer to a distribution that continuously and incrementally changes by limited amounts over a period of time [64].

Other criteria such as recurrence, frequency, predictability, and cyclical behavior have also been used to categorize and characterize sequences of concept drifts [18, 26, 50], rather than single concept drifts. It is worth noting that several criteria can be used together to provide a richer description of concept drifts. For example, a drift may be categorized as affecting the posterior probabilities of the classes, being severe and gradual at the same time. However, none of the existing criteria address the problem of changes concerning local and global data difficulty factors of imbalanced streams.

Class imbalance and data difficulty factors

Research on class imbalance has resulted in numerous data preprocessing or algorithmic methods for static data; see reviews in  [6, 20, 34]. Moreover, recent studies identified key data properties, which make imbalanced learning difficult. One such property is the global imbalance ratio, defined as the percentage of examples in the dataset that belong to the minority class (see Sect. 2). However, it has also been observed that strong class imbalance does not always deteriorate classification performance. It has led researchers to identify other data characteristics, called data difficulty factors or data complexities, as the sources of classifiers deterioration. Following the literature on static imbalanced data [20, 29, 37, 46, 56, 69], these difficulty factors include:

  1. 1.

    the decomposition of the minority class concept into several sub-concepts,

  2. 2.

    the presence of small, isolated groups of minority examples located deep inside the majority class region;

  3. 3.

    the effect of strong overlapping between the classes,

The second factor above corresponds to the so called rare cases, which are defined as isolated, very small, groups of minority examples (e.g., containing 1–3 examples) located more deeply inside regions of the majority class [55]. This is different from the first factor, which refers to the decomposition of the minority class into several larger sub-clusters containing more examples than rare cases and corresponding to sub-concepts of the minority class [36]. Both of these may or may not have overlapping between the classes, i.e., the third factor is not mutually exclusive with the first two factors. In particular, rare examples may be very near the examples of the majority class or mixed with them, presenting an overlap, or there might be no overlap. Similarly, minority class sub-concepts may or may not present overlaps near their decision boundaries [20].

Note that such difficultly factors deteriorate classification performance also in standard, balanced classification tasks. However, when these data complexity factors occur together with class imbalance, the deterioration of classifier performance is amplified and it affects mostly (or even only) the minority class [57].

Experimental studies on the role of the aforementioned factors have shown that data complexities occur in most static imbalanced datasets, and may play a key role in explaining the difference between the performance of various classifiers [56, 68]. In particular, proper handling of overlapping between classes and unsafe types of minority examples inspired recent proposals of new informed pre-processing [4], rule induction [54] and bagging ensemble generalizations [3], which outperformed many well known methods.

However, research on these local data difficulty factors has concerned mostly static data and there are few attempts to study them in data streams. In [76] data streams with different types of examples have been considered, but only in the scenarios with stationary underlying distributions. Other research concentrated either on classifier modifications [39, 78] or evaluation measures [10]. To the best of our knowledge, no existing work performed a comprehensive study of how concept drifts affecting local data distributions impact the predictive performance of various online class imbalance learning algorithms. There is also a lack of categorization and characterization of drifts that take such local data distributions into account, making systematic studies of the impact of such drifts difficult.

Below we further discuss the aforementioned difficulty factors and briefly describe their characteristics.

Decomposition of the minority class into sub-concepts

Experimental studies with various real-world imbalanced datasets show that the minority class usually does not form a homogeneous region (single concept) in the attribute space, but is often scattered into smaller sub-clusters (sub-concepts) spread over the space  (Fig 1b) [20, 34, 56, 81]. Japkowicz et al. named this phenomenon within-class imbalance in contradiction to the global imbalance between classes. These authors initiated studies on the impact of class decomposition on performance of popular classifiers over several synthetic datasets [36, 37]. In their experiments the cluster forming the minority class distribution, initially surrounded by the majority class examples, was successively split into several sub-clusters, which were separated by the majority class regions. Their results demonstrated that increasing the split of the minority class was more influential than changing the global imbalance ratio, in particular for a smaller number of examples. This was also independently studied for other more complex class distributions  [62, 68]. According to Japkowicz this may be associated with, so called, small disjuncts, which originally referred to classification rules covering too few examples, contributing to classification errors more than rules covering more examples. The impact of small disjuncts arises particularly for the minority class. Identification of the minority class decomposition in real-world static data influenced several studies on their discovery and improved resampling techniques or modifications of algorithms such as, e.g., cluster-based over-sampling techniques [36, 69].

Fig. 1

Minority class distributions with different difficulty factors. Minority examples depicted as filled (teal) circles, majority examples as hollow (white) circles

Presence of rare minority examples

Occurrences of isolated minority examples located deeper inside regions of the majority class was identified as another difficulty factor. These are small “islets” of few examples (usually pairs, triplets) located quite far from the class boundaries (Fig 1d). Following [55], they are called rare cases and due to their rarity they are different from larger sub-concepts discussed in Sect. 3.2.1. As the minority class is under-represented they cannot be treated as noise [41]. Moreover, studies such as [56] have proved that if handled appropriately, e.g., by informed pre-processing, rare examples can be used to improve the recognition of the minority class.

Note that also single minority examples can be located inside the minority class or empty space region, playing the role of outliers [55]. However, their correct recognition is extremely difficult and most existing specialized methods treat outliers as noise. Therefore, in further experiments with imbalanced streams we focus our interest on rare cases only.

Overlapping between minority and majority classes

Other researchers considered difficulties linked to high overlapping between regions of minority and majority class examples in the attribute space (e.g., Fig. 1c). This difficulty factor is present in many real-life data and has been already identified in balanced classification problems. However, experimental studies have shown that its role is even more influential for imbalanced data and the minority class in particular [29, 68]. For instance authors of [29] have shown that an increasing of overlapping region degraded performance of six popular classifiers much more than changing the imbalance ratio. Furthermore they observed the higher impact of the local imbalanced ratio in the overlapping region. Other researches also have shown that usefulness of various resampling methods depends on the amount of overlapping and that have to be dealt with appropriately while constructing new methods [46, 56].

Types of minority examples with respect to their local characteristics

A related view on data difficulty factors leads to distinguishing different types of minority examples based on the number of minority and majority class examples near them [41, 43, 55]. One of the approaches to distinguishing types minority examples [55, 56] analyzes class labels of examples in their local neighborhood defined by either by k-nearest neighbors or kernels. In this approach, an example of a given class is labeled as safe if most of its neighbors also belong to this class. If the numbers of labels from both classes is similar, then the example is labeled as a borderline one. If all of its neighbors are from opposite class the example is treated as an outlier, otherwise it is a rare case. The advantage of adopting this definition of safe, borderline, outlier and rare examples is that it is possible to determine the proportion of minority class examples of each of these types in real-world datasets [55, 56].

Cluster transitions

Existing work on data stream clustering has attempted to model and monitor the evolution of clusters (cluster transitions) in data streams [66, 67]. These transitions can be external or internal transitions from a given time step \(t_i\) to another time step \(t_j\), where \(j>i\). An external transition of a given cluster can be categorized as a split into multiple clusters, cluster disappearance, emergence or absorption by another cluster. An internal transition of a given cluster can be a size transition where the cluster shrinks or expands, a compactness transition where the cluster becomes more compact or diffuse, or a local transition where the center or the distribution of the cluster changes. Even though the work presented in [66, 67] is not about concept drift in classification problems, it can potentially inspire new work on class imbalanced data streams, in view of the data difficulty factors discussed in Sect. 3.2. In particular, concept drifts such as decomposition of the minority class concept into several sub-concepts could potentially be monitored and modeled from a cluster perspective, where a given sub-concept could potentially be modeled by a cluster of minority class examples.

Classifiers for imbalanced data streams

Compared to the number of machine learning approaches designed to tackle balanced data streams [18, 26, 33, 40], there are relatively few approaches aimed to tackle class imbalanced data streams [78]. In both settings, different types of concept drift require different strategies to be effectively and efficiently handled. Active approaches, which rely on a concept drift detector to trigger adaptation, tend to work well for abrupt drifts, but may struggle with gradual drifts. Passive approaches, which continuously update the learning system with incoming data in an attempt to tackle concept drift, are typically well suited for gradual drifts. In this section, we discuss classifiers for imbalanced data streams from these two perspectives. We give emphasis to approaches that perform online learning, which are the focus of this work, and only briefly discuss chunk learning algorithms.

Recursive Least Square Adaptive Cost Perceptron (RLSACP) [30] is a perceptron-based online passive approach. It uses a loss function that is both cost-sensitive and time-decayed, to tackle both class imbalance and concept drifts affecting p(y|x), respectively. The cost associated with each class can vary over time, enabling these approaches to also cope with concept drifts affecting p(y). The Online Neural Network (ONN) [31] is a similar approach, however, it assumes a fixed imbalance ratio and cannot cope with concept drifts affecting p(y). Finally, Ensemble of Subset Online Sequential Extreme Learning Machine (ESOS-ELM) [52] is a perceptron-based passive-active learning approach that can operate in chunk or online mode, so long as an initialization chunk is provided. It trains an ensemble of ELMs based on a resampling technique to deal with class imbalance. Each ELM is associated with a weight that is updated over time. ESOS-ELM uses G-mean, which is an appropriate performance metric for class imbalanced problems, to update the weights over time. The ensemble is reset upon a threshold-based concept drift detection. ESOS-ELM also maintains a memory of batch models. Old models can be recovered from the memory to deal with recurrent concepts.

Oversampling (OOB) and Undersampling (UOB) Online Bagging [76] are two online passive resampling-based ensemble approaches that can be trained with any type of online base classifier. They are extensions of the Online Bagging [60] algorithm, which uses each incoming training example to update each classifier k times. The value k is drawn from a Poisson(1) distribution for each classifier independently. The use of the Poisson(1) distribution enables Online Bagging to be a good approximation of Breiman’s Bagging [7] algorithm. However, Bagging is not prepared to handle class imbalance. Instead of using \({\hbox {Poisson}}(1)\), OOB and UOB use \({\hbox {Poisson}}(\lambda )\), where \(\lambda \) is set as \(\hat{p}_t^{{\hbox {maj}}}/\hat{p}_t^{{\hbox {min}}}\) for OOB and \(\hat{p}_t^{{\hbox {min}}}/\hat{p}_t^{{\hbox {maj}}}\) for UOB, where \(\hat{p}_{t}^{{\hbox {maj}}}\) and \(\hat{p}_t^{{\hbox {min}}}\) are the estimated prior probabilities of the class considered to be a majority and a minority at time t, respectively. This will result in oversampling being applied to tackle class imbalance in OOB, and undersampling in UOB. The probabilities are estimated based on a time-decayed function. This enables OOB and UOB not only to deal with class imbalance, but also to deal with concept drifts affecting p(y). These approaches can be generalized to multi-class settings [77] and combined with drift detectors to become active approaches that tackle concept drifts that affect p(y|x) [78].

A few different concept drift detection methods have also been proposed for class imbalanced data streams. They monitor performance metrics suitable for class imbalanced problems and trigger when the metric significantly deteriorates. Drift Detection Method for Online Class Imbalance learning (DDM-OCI) monitors the time-decayed Recall on the minority class [75], Linear Four Rates (LFR) monitors a time-decayed confusion matrix, whereas Prequential Area Under the Curve Page Hinkley (PAUC-PH) monitors the area under the ROC curve.

Several chunk-based approaches have been proposed based on the storage of old minority class examples to help overcoming class imbalance [1, 14, 15, 27, 28, 35, 45, 82]. Some of these approaches train components by combining all minority class examples seen so far with majority examples from the most recent chunk [27, 28, 45, 82]. Other approaches filter out past minority class examples based on their similarity to the minority class examples of the current data chunk [14, 15], or based on their age [35]. Some other chunk-based approaches tackle class imbalance by using standard offline re-sampling techniques to create a new offline learning classifier for each new data chunk [17, 47]. Yet another recent proposal promotes a reinforcement mechanism to increase the weights of the base classifiers that perform better on the minority class and decrease the weights of the classifiers that perform worse [83].

Contrary to all the aforementioned algorithms which focus on global class imbalance, few data stream mining studies attempted (explicitly or implicitly) to tackle local difficulty factors. Some works have proposed approaches that attempt to deal with local data distributions, however, none of them took into account drifting difficulty factors. For example, Lyon et al. [48] discussed the problem of the minority class decompositions into smaller sub-clusters, and claimed that Hellinger Distance Trees may better classify such data than standard VFDT. The role of the class decomposition was also considered in a chunk-based ensemble SCUT-DS [59]. Moreover, Ren et al. [63] proposed a clustering-based oversampling technique to reduce the risk of increasing class overlapping. Finally, an attempt to handle several types of difficulty factors was recently presented in [39], where the difficulty of incoming examples is exploited in the online learning of a cost sensitive tree. As more difficult examples are presented more times during training, the trees shift towards concentrating on more difficult examples.

Proposed categorization of concept drifts for class imbalanced data

As discussed in Sect. 3.2, local data difficulty factors strongly influence the ability of classifiers to cope with static class imbalanced data. In particular, depending on the underlying distribution of the minority examples, standard classifiers with no strategies to deal with class imbalance can perform either quite well or very poorly. Therefore, it is likely that the extent to which the class imbalance and concept drift issues are exacerbated when combined together in non-stationary environments also depends on the underlying distribution of the minority class, and how it changes over time. The conjunction of local data difficulty factors and concept drift may be potentially more challenging for classifiers than the impact of each of these factors separately when dealing with non-stationary data streams.

However, existing literature on categorization and characterization of concept drifts in streaming classification problems does not take these local data difficulties into account. Recall that most research about imbalanced streams is concerned with the role of the global imbalance ratio only, either for static or changing class proportions. Given the major role that local data difficulty factors may play in class imbalanced data streams, existing concept drift categorizations may not be enough for evaluating class imbalance learning in non-stationary environments. They may not capture all aspects that make a concept drift particularly challenging or easy to solve. Therefore, we propose an extended concept drift categorization that takes local data difficulty factors into account. This extended categorization should open up the path for further research on classifiers for class imbalanced data streams in non-stationary environments, enabling systematic studies under different relevant drifting conditions.

We will concentrate on extending the criteria for categorizing and characterizing single concept drifts. It is worth reiterating that any drift can be categorized and characterized using several different criteria at the same time. In particular, both existing criteria (e.g., those mentioned in Sect. 3.1) and our proposed criteria (Sects. 4.1 and 4.2) can be used together to describe drifts. The proposed categorization is graphically depicted in Fig. 2.

Fig. 2

Imbalanced stream drift categorization. Each criterion (left column) is accompanied by a list of possible drifts (right column). The minority class is depicted as filled (colored) circles, the majority class as hollow (white) circles (color figure online)

Proposed drift criteria for general problems

Locality of drift region

We start the proposed categorization by distinguishing the locality of the data streams. For that, we need a definition of a “local region,” which makes use of the definition of a “partitioning” shown below:

Definition 01

(Partitioning) Consider a bounded d-dimensional attribute space \(\mathcal {X}\) and output space \(\mathcal {Y}\) and a given posterior probability distribution p(y|x), where \(x \in \mathcal {X}\) and \(y \in \mathcal {Y}\). A partitioning \(\mathcal {P}\) of \(\mathcal {X}\) is a set of bounded regions \(\mathcal {R}_1,\ldots ,\mathcal {R}_n\) such that (a) \(\forall i \ne j: \mathcal {R}_i \cap \mathcal {R}_j = \emptyset \); (b) \(\cup _{i=1}^n \mathcal {R}_i = \mathcal {X}\); (c) \(\forall y \in \mathcal {Y}, \ \forall x, x' \in \mathcal {R}_i: |p(y|x) - p(y|x')| < \tau \), where \(\tau \in \mathbb {R}\) is a threshold; and (d) all examples that can possibly be sampled from a given region \(R_i\) are connected to each other through a d-dimensional grid, where the granularity of this grid is infinitely small for real attributes, and is discrete for discrete attributes.

For example, in Fig. 2, the four plots under the heading “Drift region” are composed of two regions each, whereas the right side plot under the heading “Class composition” is composed of four regions. The criterion (d) ensures that a region is not formed by disconnected components. For instance, the three blue regions in the right side plot under the heading “Class composition” are three different regions, rather than a single region.

The definition above has been partly inspired by [66]’s definition of clustering. However, it differs in several aspects. First, our partitioning is a partitioning of the attribute space, whereas clustering is about clusters of examples sampled from a given unconditional probability distribution. Second, we include criterion (c), which is absent from the clustering definition. This criterion is important because we are dealing with classification problems, and so a given region must have the vast majority of its examples associated to a given class. And third, we consider a grid that binds together all examples that can be possibly sampled from the region. This grid means that a given region cannot be composed of several disconnected areas of the attribute space. Any neighborhood operator can be used to connect points in a grid. For instance, for a 2-dimensional space, the classic Moore or Von Neumann neighborhoods [73] could be adopted.

Definition 02

(Local Regions) Given a partitioning \(\mathcal {P}\), a region \(\mathcal {R}_i \in \mathcal {P}\) is a local region if \(V(\mathcal {R}_i)<< V(\mathcal {X})\), where \(V(\dot{)}\) is the geometric volume.

For example, in Fig. 2, the orange regions in the two plots under the heading “Drift region: Global” are not local regions, whereas the orange regions in the two plots under the heading “Drift region: Local” are local regions.

Even though existing literature [24, 38, 50, 79] has considered concept drifts that affect the probability distributions associated with the whole problem space, or to part of the problem space, it has not taken locality into account. Specifically, when the drifts affect only part of the problem space, existing studies have not considered whether these drifts affect a local region of the space or not. Such locality is a general criterion, but is likely to be particularly important for characterizing drifts in class imbalanced data streams. For instance, if we have drifts that affect small portions of the space associated with the minority class, this may be very difficult to track.

Based on the locality of the region affected by the drift, a concept drift can be categorized as local region drift, global drift or global-local region drift (Fig. 2). A local region drift is a concept drift that affects the joint probability distribution (p(xy)) of one or more local regions. If the drift affects regions of the space that are not local regions, it is referred to as a global drift. An example of local region drift in a real-world data stream can be found in Fig. 6 of Sun et al.’s work [71], whereas several synthetic data streams widely adopted by the data stream mining community (e.g., Hyperplane [70]) present global drifts. A global-local region drift affects both specific local regions and non-local regions of the input attribute space.

When a drift is a local region drift, other criteria for characterizing and categorizing drifts such as severity and rate of change can be used to either describe the changes suffered by each local region separately, or to describe the changes as a whole. For example, the severity of the change caused by a local drift to a given local region can be characterized by the volume of the intersection between the old and the new position of this local region. A smaller/larger intersection means higher/lower severity. And, according to the rate of change, a certain local region may be changing slowly, while others may be changing fast. Or, all regions may be changing fast. The combination of the locality criterion with the rate of change for a specific local region may be a particularly useful combination of criteria to describe drifts. For instance, a given local region may change incrementally in a way that this region “moves” through the attribute space. Local region drifts can also be further characterized by the number and size of the affected local regions.

Class composition

Class composition refers to how examples of each class are spread over the input attribute space. Specifically, it corresponds to checking whether the class examples belong to a homogeneous region or not. A homogeneous class composition means that examples of the class are concentrated in a single local region (see the earlier discussion on single minority concepts in Sect. 3.2.3). A heterogeneous class composition means that examples of the class are spread over multiple local regions. Note that here we refer to local regions that are sub-concepts larger than regions of rare minority examples (rare example types are considered in a subsequent criterion). Up until now, class composition has been mainly considered in the literature on static imbalanced data, where it referred to the split of the minority class into several rare sub-concepts [37, 69].

Such minority class decomposition was found to have a greater effect on classification performance than the global imbalance ratio [36, 37, 69, 81]. Given the impact that class (de-)composition can have on the difficulty of class imbalanced problems, it should also be considered in data streams. This was reflected in recent studies, which demonstrate that the detection of sub-concepts in streaming chunks can improve classifier performance [63]. Furthermore, class composition may also change over time in real-world problems. For example, in trend topic data streams several distinct groups of opinions on the same topic may appear as time goes on [38]. Changes of example clusters have also been observed in unsupervised streams [66, 67]; see earlier discussion in Sect. 3.3.

We propose to use class composition as a general criterion to describe concept drifts. We refer to drifts that break down homogeneous classes into heterogeneous ones as class split drifts, drifts that merge several different local regions together as class merge drifts, and drifts that both break down and merge local regions as class split-merge drifts (Fig. 2). Such drifts can be characterized by a list of local regions with their corresponding p(y|x) before and after the drift. Evidence of class merge drifts in real-world problems can be found in [71], suggesting that class split type of drifts may also occur in real-world problems.

Proposed drift criteria for class imbalanced problems

Imbalance ratio

This criterion corresponds to the prior probabilities of classes, representing the proportion of examples expected to be received from each class. Previous studies [76, 78] have taken this criterion into account when designing experiments to evaluate learning approaches for class imbalanced data streams. However, this criterion has not been explicitly emphasized as part of a concept drift taxonomy before.

We consider two situations: static imbalance ratio if the class proportions do not change over time, and dynamic imbalance ratio if the imbalance ratio changes. For example, it is known that the imbalance ratio in problems such as software defect prediction [13] and tweet topic classification [71] can vary over time, requiring specific learning algorithms able to cope with such variations. It is even possible for the role of the minority and majority classes to swap [13, 71], i.e., for the majority to become a minority and vice-versa (Fig. 2).

The above-mentioned drifts can be characterized by a vector containing prior probabilities of each class before and after the drift. Such a description can be further generalized for the case of multi-class classification.

Distribution of minority class examples with respect to their types

As discussed in Sect. 3, the distribution of different types of minority class examples can influence classifiers learning from imbalanced data. Following the method from [55, 56] the type of the minority example can be identified based on the analysis of class labels of other examples in its local neighborhood. Depending on the number of examples from the opposite class, minority examples are labeled as safe, borderline, rare or outlying [56], as explained in Sect. 3.2.4. In case of streams, this analysis could be done with neighbors inside a sliding windows.

Although types of examples have been used to improve classifiers for static data, they were rarely considered in data streams [39, 42]. However, it has been shown that the number of minority class examples of particular types may vary over time in real-world problems, such as the analysis of tweet streams [42]; more examples of challenging and drifting class distributions in real-world streams will be presented in Sect. 5.

Following the above discussion, streams can be categorized as ones with static or dynamically changing example-type distributions.

Minority class example-type distribution drifts can be characterized through a vector containing the proportion of the minority examples belonging to safe, borderline, rare and outlier examples before and after the concept drift. It is worth noting that these drifts cover changes of single example types, such as increasing the proportion of borderline examples at the cost of safe examples, as well as changes of many types at once (Fig. 2).

Experimental evaluation of the influence of data factors and drifts on online classifiers

Experimental aims and setup

We experimentally study the impact of the discussed data difficulty factors and drifts on the performance of selected online stream classifiers. Compared to earlier studies on imbalanced streams, where only global factors like changes of the imbalance ratio were considered, here we pay more attention to local data characteristics and local drifts.

Our series of experiments is organized along the following research questions, which have been preliminary introduced in Sect. 1:

  1. RQ1

    What is the impact of different types of single data difficulty factors and isolated, single drifts on the predictive performance of selected online classifiers? Is it possible to identify which classifiers work better and when?

  2. RQ2

    What is the interaction between the different types of local data factors and global class imbalance? Which local factors or drifts are the most demanding for classifiers at different imbalance ratios?

  3. RQ3

    Which complex scenarios integrating several data factors and drifts are the most harmful for classification performance? Is it possible to determine which components (single factors or drifts) in these scenarios are the most influential?

In order to examine these issues, we carried out most of the experiments in a controlled framework based on synthetic generated data. Using a synthetic data stream generator, each data factor can be modeled and parameterized according to different planned scenarios. This is important because it allows us to obtain a detailed understanding of when and under what circumstances classifiers work well or fail. Moreover, we also supplement this study with experiments on real-world imbalanced streams, where most of the considered data difficulty factors of drifts were identified.

For the purposes of the synthetic data experiments, we have implemented an imbalanced data stream generator. A detailed description of the generator is given in Section A of the supplementary materials.Footnote 1 The source code of a MOA [2] compatible implementation of the generator is available at:

The synthetic data stream characteristics are controlled by modifying parameters referring to the criteria proposed in this paper. The list of parameter values (stream elements) used to create streams for this study is listed in Table 1. The generated streams where named according to a naming convention of their elements (Split, Move, Merge, Borderline, Rare, Im, StaticIm), introduced in Section B of the supplement.

Table 1 Elements with their parameters used to generate streams

To answer research questions RQ1–RQ3, a comprehensive set of data streams was generated as follows:

  1. 1.

    Streams with single factors or drifts (RQ1, discussed in Sect. 5.2)

    • First we modeled the occurrence of a single element (data factor or drift) in isolation from other data stream aspects. All single values of elements from lists CD, TD, ID or SI were used to generate data streams. If drifts were modeled (CD, TD, ID), then the stream starts as fully balanced, with only safe examples, and one cluster of ‘minority’ examples surrounded by examples from the ‘majority’ class.

    • Moreover, we considered imbalance ratio changes, where we started from a static imbalanced stream and generated streams corresponding to all combinations of ratio values from the product SI \(\times \) ID.

  2. 2.

    Streams with pairs of elements (RQ2, Sect. 5.3)

    • Streams with global imbalance were generated as pairs of elements from (SI’ or ID’) \(\times \) (TD’ or CD’)

    • Combinations of values from the product TD’ \(\times \) CD’

  3. 3.

    Complex scenarios consisting of triples or more elements (RQ3, Sect. 5.4)

    • The set of streams was created as a Cartesian product: SI’ \(\times \) ID’ \(\times \) CD’ \(\times \) TD’, excluding streams with mutually exclusive characteristics such as StaticIm1 and Im1, and streams where the global share of minority class was increasing with time.

The diagram summarizing these generation scenarios, their relations to research factors and corresponding categories is presented in Fig. 3.

Fig. 3

Schematic of the experimental setup

The above described scenarios resulted into generating 385 synthetic streams (224 static and 161 drifting). For their full list, see the online supplementary materials. All the generated data streams consisted of 200,000 examples. If a concept drift is present in a stream, it spans from example number 70,000 to 100,000. In case of combined drift, they occur all together in this time period. If the imbalance ratio is not directly stated, the stream is balanced. If no minority class composition or examples-type details are given, the minority concept is defined as a singular cluster of safe-type examples surrounded by majority class examples uniformly distributed in the attribute space.

In addition to synthetic streams, four real-world datasets [5, 53, 72, 74] with various imbalance ratios and local data difficulty factors were used. The real streams range from 8,000 to 50,000 examples and were evaluated as difficult for learning classifiers in previous studies [9, 42]. Reproducible experimental scripts are available at:

The experiments compare the predictive performance of five online classifiers quite often used in the related literature: Oversampling Online Bagging (OOB) [76], Undersampling Online Bagging (UOB) [76], ESOS-ELM (ESOS) [52], Online Bagging (OB) [61], and Hoeffding Tree (VFDT) [19]. OOB, UOB and ESOS were selected as representative specialized classifiers for imbalanced streams which use different strategies and are based on different methods. OB represents a non-specialized data stream ensemble exploiting online bagging (which was an inspiration for OOB and UOB). VFDT serves as a reference for single classifiers.

All ensembles used 15 Hoeffding Trees as component classifiers. For OOB, UOB, OB and VFDT we used MOA implementations with default parameter values suggested by the algorithms’ authors. Note that we used the standard version of VFDT classifierFootnote 2 without any drift-related mechanisms. ESOS was implemented for this study and, as proposed by the algorithm’s authors, utilized artificial neural networks with 70 hidden neurons. The size of the initialization batch for the OS-ELM classifiers in ESOS was set to 100 examples with an evaluation period of 1000 examples, and the drift detection threshold coefficient set to 0.9, as recommended in [52]. With the exception of ESOS-ELM, which by design employs a basic performance-tracking drift detector, the other considered classifiers are not integrated with any drift detectors as the ones provided by MOA are not suitable for dealing with class imbalance and are thus outside the scope of our experimental study.

The classifiers were evaluated using two performance measures—Recall and G-mean. Recall (also called Sensitivity or True Positive Rate—TPR) is the correct recognition rate for the minority class. G-mean is the geometric mean of recall (TPR) and specificity (True Negative Rate—TNR), defined as \(G_{\hbox {mean}} = \sqrt{\hbox {TPR} \times \hbox {TNR}}\). Recall and G-mean were selected from a larger list of measures [11, 34] mainly due to their complementary nature and easy interpretation. Recall focuses only on the minority class, allowing us to see when the recognition of the minority class drops. In contrast, G-mean captures the balance between recognition ratios of both classes. Therefore, by analyzing both measures it can be noticed whether one class was recognized more often at the cost of the other. For example, if Recall improves but G-mean deteriorates, this means that the recognition of the minority class has improved at the cost of the recognition rate of the majority class. Moreover, G-mean is skew-invariant, meaning that G-mean’s interpretation remains the same for all possible class imbalance ratios [11, 12], being particularly relevant for studying drifting imbalance ratios.

Both Recall and G-mean were calculated prequentially [25] using a sliding window of 1000 examples. Each analyzed stream was visualized as a line plot with the number of processed examples on the x-axis, value of Recall/G-mean on the y-axis, and the drift region plotted over a gray background. To increase readability, the line plots were smoothed using a moving average of 20 data points.

Moreover, measure values averaged over entire streams (mean performance values) are presented in a tabular form, which constituted the base for carrying out a ranked Friedman test [16] to compare classifier performance. Due to the large number of experiments, here we only present the most representative plots; the reader is referred to the online supplementary material and code repository for additional results.

Experiments with single drifts or data difficulty factors

Static imbalance between classes

The fully balanced stationary stream without any difficulty factors was quite easy to learn, with most classifiers achieving, on average, 0.99 G-mean and Recall (for detailed values see Supplementary Tables S3 and S4). Average classifier performance on stationary imbalanced streams with minority class ratios 10%, 20%, 30% and 40% were nearly the same (Fig. 4). For all the aforementioned minority class ratios, classifier performance plots looked very similar—performance values rise fairly quickly up to a certain level and remain stable until the end of the stream.

Fig. 4

Comparison of the effect of static imbalance (the same minority class ratio over the entire stream) on average classifier performance; G-mean and Recall averaged over entire streams for a given classifier on a given stream

The situation changes for higher class imbalance (i.e., minority class ratio \(\le \) 5%). With 3% or 5% of minority class examples, plots of OOB, UOB, and ESOS achieved similar G-mean values to those obtained for higher minority class ratios (i.e., the aforementioned less imbalanced streams). However, OB’s and VFDT’s G-mean was worse (at most 0.830–0.850). An even bigger classifier distinction can be seen for the two highest tested class imbalance ratios (1% and 2% minority examples in the stream). Here, the non-specialized OB and VFDT classifiers performed dramatically worse—plots of their G-mean dropped down to \(\sim \) 0.100, whereas minority class Recall was as low as 0.020 for OB. The remaining (specialized) classifiers still achieved fairly high G-mean values when the imbalance ratio was 2% (0.947, 0.956, 0.876 for OOB, UOB, and ESOS, respectively). However, for the imbalance ratio of 1%, the recognition of the minority class (Recall) was worse even for the specialized classifiers (e.g., OOB’s average Recall was 0.815). Figure 4 shows how average G-mean and Recall averaged over entire streams changed between data streams with different static imbalance ratios.

The statistical comparison of classifier performance on all static imbalanced streams, using G-mean and Recall averaged over entire streams, was carried out with the Friedman test followed by the Nemenyi post hoc test (Table 2). The test resulted in the following rankings: OOB \(\succ \) UOB \(\succ \) OB \(\succ \) VFDT \(\succ \) ESOS (G-mean) and UOB \(\succ \) OOB \(\succ \) OB \(\succ \) ESOS \(\succ \) VFDT (Recall). Importantly, OOB is significantly better than all the remaining classifiers except UOB on G-mean, whereas UOB is significantly better than all but OOB.

Table 2 Mean Friedman test ranks for comparing classifier performance using G-mean and Recall averaged over entire streams (see them in Supplementary Tables S3 and S4) on different sets of data streams involving single data difficulty factors

Drifts of imbalance ratios

Next, we analyzed streams that were initially balanced and then the single global imbalance ratio drift was modeled. If the minority class ratio after the drift remained \(\ge \) 10%, then all the analyzed classifiers achieved similar results to those obtained for stationary imbalanced streams discussed in the previous section. For lower minority class ratios, OB and VFDT performance dropped to slightly lower values (in particular for 1% and 2% minority ratios).

On the other hand, scenarios in which the stream initially had \(\le \) 10% minority examples which then went down to 1–2%, were more demanding. In such scenarios, UOB, OOB and even ESOS still perform quite well, whereas VFDT and OB performed poorly after the drift. Figure 5 shows such a case with the minority class ratio changing from 10 to 1%. Moreover, Fig. 6 presents the performance of classifiers (x-axis) on different minority ratio drifts (y-axis).

Fig. 5

G-mean (left) and Recall (right) of the analyzed classifiers on stream StatIm10+Im1, where there is a class ratio drift changing the minority class ratio from 10% to 1%

Fig. 6

Comparison of the effect of decreasing minority class ratios (left) and class swaps (right) on average classifier performance; G-mean averaged over entire streams for a given classifier on a given stream

Interestingly, scenarios which involved class swapping (e.g., 1% or 2% \(\rightarrow \) 60%) were less demanding. Specialized classifiers consistently and fairly quickly improved their performance from the beginning of the stream, whereas OB and VFDT started with close-to-zero G-mean and Recall values only to improve up to similar values as OOB or UOB after the drift.

Looking at Fig. 6, one can notice that streams with the minority class ratio over 5% are sufficiently well recognized by all classifiers. It is also clear that non-specialized classifiers (OB and VFDT) were much more susceptible to global class imbalance than specialized learners (OOB, UOB, ESOS). Finally, it seems that the difficulty of imbalance ratio drifts depends on the number of minority examples available before the drift. If a classifier had enough examples to learn the minority concept prior to the drift, the effect of the drift was smaller.

The observed relations were confirmed by the statistical analysis of the averaged values of both measures done with the Friedman and Nemenyi tests at significance level \(\alpha = 0.05\). The classifier rankings (Table 2) were the same as for the stationary imbalanced streams, but the difference between average ranks of the best performing OOB/UOB and the remaining classifiers was much higher.

Changing class composition

In this group of experiments, we considered three balanced scenarios with changes in class composition: (1) stationary streams with a class decomposed into 3, 5 or 7 sub-clusters, (2) a drift of sub-clusters gradually moving in the attribute space to new random positionsFootnote 3, (3) a drift splitting one concept into multiple sub-clusters, and (4) several sub-clusters merging into one concept. Note that, even though such class compositions have been discussed in the context of class imbalanced data streams in Sect. 4, they are also applicable to a given balanced class as done in the current section. Such balanced scenarios constitute baseline cases—a corresponding classifier that is successful in dealing with class imbalance would present predictive performance as good as the ones obtained in these scenarios when faced with corresponding imbalanced scenarios. In Sect. 5.3, we will discuss these data difficulty factors in imbalanced data streams.

Stationary streams with a class decomposed into sub-clusters were difficult for the analyzed classifiers. All classifiers incrementally improved G-mean (or Recall) to rather high values \(\sim \) 0.982, except for ESOS which reached a slightly smaller G-mean of 0.943. However, to achieve this level of predictive performance, the classifiers required more examples than in the scenarios with the global imbalance ratio from Sect. 5.2.2. Moreover, the more sub-clusters the worse the observed classifier performance.

Scenarios involving a single drift of sub-clusters gradually moving in the attribute space were more demanding and the decrease of the performance measures was clearly visible. However, although G-mean initially dropped from 0.975 down to 0.940, the classifiers recovered quite well after the drift. The effects of moving sub-clusters were only slightly stronger when the number sub-clusters was larger.

The impact of splitting one cluster into 3, 5, or 7 sub-clusters was similar to cluster movement. For example, splitting a class into 5 sub-clusters resulted in a decrease of G-mean of OOB/UOB from 0.989 before the drift to 0.919/0.901 after the drift, and from 0.963 down to only 0.750, respectively, for ESOS (Fig. 7). It is worth noting that all the analyzed classifiers only partly recovered from this type of drift.

Fig. 7

G-mean (left) and Recall (right) of the analyzed classifiers on stream Split5, in which the minority class is gradually split into 5 sub-clusters

Finally, a drift merging many sub-clusters did not pose a significant challenge for the analyzed classifiers. The plots showed that classifiers started off with slightly lower performance values, followed by a small decrease in classifier performance, but then recovered quite quickly and achieved values of G-mean and Recall comparable to those achieved on stationary streams.

Fig. 8

Comparison of the effect of class composition drifts on average classifier performance in balanced data streams; G-mean averaged over entire streams for a given classifier on a given stream

Figure 8 compares the effect of different class composition drifts on G-mean averaged over entire streams (for detailed values see the Section C of the supplement). It can be noticed that all the class composition drifts have similar, rather low, impacts on G-mean since the performance drops were only temporary.

The rankings of classifiers according to the Friedman and Nemenyi tests performed over averaged values were slightly different compared to previous scenarios (Table 2). In particular, UOB was better than OOB on Recall, and also partially on G-mean, and ESOS performed generally better than OB and VFDT.

Distribution of example types

Similar to Sect. 5.2.3, in this group of experiments, we analyzed balanced stationary streams with a given percentage of either borderline or rare types of examples for a given class. Outlier examples were discarded from our analysis as they demonstrated to be extremely difficult for static data and rare cases were already very difficult for the considered stream classifiers.

Classifier performance plots showed that the presence of both types of examples always decreases G-mean and Recall. However, the effect depends on the type and proportion of difficult examples (Table 3). Increasing the proportion of borderline examples reduces G-mean values in comparison to a stream with only safe minority examples, but the deterioration is relatively small (e.g., for OOB G-mean drops from 0.982 (0% borderline) to 0.963 (100% borderline)). On the other hand, the proportion of rare examples showed to be clearly more influential, and even proportions of rare examples as small as 20% lead to G-mean decreasing to \(\sim \) 0.88 for all classifiers, whereas streams with 100% of rare examples exacerbate the effect leading to G-means \(\sim \) 0.55. Similar trends were observed for Recall.

Table 3 The impact of borderline and rare-type minority examples in balanced stationary streams on G-mean values of classifiers; G-mean averaged over entire streams for a given classifier on a given stream
Table 4 The impact of type of examples on G-mean of classifiers

Similar trends were observed when the difficult examples were not present from the start of the stream, but appeared as part of a concept drift. To show that the classifiers recovered from such drifts, in Table 4 we present the values of G-mean directly after the drift (post) and at the end of the stream (end). For the drifts of borderline examples, one can notice that G-mean decreases slightly more than in analogous static streams and that all classifiers slightly recovered after the drift. In contrast, drifts which introduced rare examples were definitely more influential and the classifiers did not recover after the drift.

Figure 9 compares the G-mean performance of classifiers on a stream with an increasing number of borderline (left) and rare (right) examples from 0 to 40%. Introducing borderline examples had visible yet limited effect on all the analyzed algorithms’ performance, with classifiers such as UOB and OOB reacting slightly better to these drifts compared to other methods. However, the proportion of rare examples showed to be clearly more influential than the number of borderline examples. In this case we can notice much stronger deterioration of performance, with nearly no recovery. Similar observations were made for Recall.

Fig. 9

Classifier G-mean on streams a Borderline40 and b Rare40, which, respectively, have ratios of borderline and rare examples within the minority class increasing from 0 to 40%

We have also analyzed the values of G-mean and Recall averaged over entire streams and grouped in sub-categories corresponding to different types of examples (for detailed values see Supplementary Tables S3 and S4). The results of Friedman and Nemenyi tests (presented in Table 2) confirm our observations, ranking OOB/UOB highest in terms of G-mean. Interestingly, UOB performed slightly better on borderline scenarios, whereas OOB was marginally better on data streams with rare examples. In both cases, however, the differences between OOB and UOB were not statistically significant. In terms of Recall, ESOS achieved highest Friedman ranks on both rare and borderline data streams, followed by UOB. It is also worth noting that, on data streams with rare examples, all classifiers had fairly similar ranks and were mostly not significantly different from each other according to the Nemenyi test, which might suggest they all had similar problems with classifying these streams.

Finally, Fig 10 shows the comparison of the impact of different example-type distributions with respect to G-mean averaged over entire streams. This comparison confirms that the proportion of rare examples has the biggest impact on classifier performance. It can also be noticed that static difficulty factors are slightly more demanding than drifting ones, since their effect is not temporary.

Fig. 10

Comparison of the effect of minority example-type distribution on average classifier performance; G-mean averaged over entire streams for a given classifier on a given stream


Impact of pairs of elements in the stream

To study the impact of different types of data difficulties and class imbalance (RQ2), we have analyzed pairs of elements in the generated streams. We have studied various difficulty factors paired with moderate and high class imbalance ratios, and considered combinations of factors referring to the minority class split and the distributions of minority example types.

Interactions between local data factors and the global class imbalance ratio

We have examined how different global imbalance ratios interact with the most influential drifts identified in the previous subsection. As it was explained in Sect. 5.1, two representative imbalance ratios were chosen 1% (as the most influential one) and 10% (having a moderate impact). Firstly, these static global imbalance ratios were combined with a drift on either the class composition or distributions of the minority example types.

The general observation from analyzing the plots of classifier performance is that class imbalance amplifies the effect of other difficulty factors. This amplification is especially visible on streams involving high static imbalance.

For instance, the combination of a static 1% minority class ratio with a minority class split into 5 sub-clusters (StatIm1 + Split5 stream, Fig. 11) resulted in three widely different post-drift G-mean values for the specialized classifiers—UOB: 0.519, OOB: 0.271, ESOS: 0.741. In comparison, in the scenario where the split was performed on a balanced stream these classifiers achieved G-mean of \(\sim \) 0.900 (Fig. 7). Note that there is worse recovery after the stronger drifts and even no recovery for OB and VFDT. Similar trends were observed for pairs involving high static imbalance ratios combined with moving sub-cluster as well as rare and borderline types of minority examples.

Fig. 11

G-mean (left) and Recall (right) of the analyzed classifiers on stream StatIm1 + Split5 stream, which has a 1% minority class ratio and a minority class split into 5 sub-clusters

In contrast, pairs involving combinations of drifting class imbalance ratios, i.e., streams that started off as balanced and class imbalanced appeared in conjunction with a drift, were slightly less demanding. For instance, let us consider UOB in the stream Im1 + Split5. Its G-mean values changed as follows: 0.988 (the moment before the drift) \(\rightarrow \) 0.892 (after the drift) \(\rightarrow \) 0.919 (at the end of the stream). For comparison, G-mean values for UOB in StatIm1 + Split5 were: 0.978 \(\rightarrow \) 0.519 \(\rightarrow \) 0.895. Similar differences between the effect of static and dynamic imbalance were observed for OOB. More results showing G-mean values in characteristic moments in the streams are presented in Section D of the supplement.

The classifier performance plots for pairs with the imbalance drifts were more similar to those obtained for scenarios containing only single difficulty factors. We hypothesize that, compared to static imbalance, in scenarios involving class ratio drifts the classifiers were capable of learning much more before the drift (on the balanced portion of the stream). The main observation is that having a constantly low proportion of minority class examples is much more challenging than having a varying imbalance ratio with periods of relative class balance.

Interactions of class split with various types of examples

In this subset of experiments, drifts of a selected class split (5 sub-clusters) with drifts of example-type proportions were considered in balanced data streams. The use of balanced streams here offers a baseline for further analyses on more complex scenarios including multiple drifts and data difficulty factors presented in Sect. 5.4.

The general observation is that the most difficult scenarios combine the minority class split with rare examples. These scenarios always led to stronger deterioration than pairs with the same percentage of borderline examples. For instance, let us consider OOB and its G-mean values after the drift and at the end of the recovery. In Split5 + Rare20 the values in these moments are 0.853 and 0.861 while in Split4 + Borderline20 they are 0.897 and 0.927, respectively. Moreover, the larger the proportion of rare examples after the drift, the larger the performance drop. The classifier plot for Split5 + Rare40 (Fig. 12) shows that all algorithms behave in a similar way and are unable to recover after the drift. Moreover, when compared against plots depicting scenarios containing only the split (Fig. 7) or only rare examples (Fig. 9b), it can be noticed that it is probably the proportion of rare examples that has a bigger influence on the final performance of the classifiers. Similarly, it was observed that combinations of minority class split and borderline examples were more difficult than scenarios containing the respective single difficulty factors.

Fig. 12

G-mean (left) and Recall (right) of the analyzed classifiers on balanced stream Split5 + Rare40, which has a drift introducing 40% rare-type examples and a minority class split into 5 sub-clusters

The global comparison of pairs of difficulty factors

Similarly to the analysis made in Sect. 5.2, we performed a global comparison of the impact of pairs of difficulty factors on classifier G-mean and Recall calculated over entire streams. The most influential pairs are presented in Fig. 13.

Fig. 13

Comparison of the effect of pairs of factors classifier G-mean and Recall. Scenarios ranked according to the median performance of all classifiers on a given scenario. Borderline[N]/Rare[N]: denotes a drift to N% borderline/rare examples, StaticIm[N]: N% minority examples before the drift, Im[N]: drift to N% minority ratio, Split[N] splitting into N sub-clusters, +: combination of factors

Definitely combinations with rare examples lead to the largest decrease of both measures for all classifiers (although OB and VFDT are the most affected). One explanation is that high rarity corresponds to a very scattered distribution of small islands of few minority examples. Together with the high imbalance it makes the data less safe and difficult to learn. On the other hand, combinations of rare examples with minority class splits were modeled with balanced streams, therefore the impact of the same percentage of rare examples was visibly smaller. As noticed in the earlier subsections, combinations of class imbalance drift with borderline examples are less influential. For some of these borderline pairs even OB and VFDT performed on par with OOB, UOB or ESOS.

Finally the Friedman test over stream-averaged values (Table 5) was performed to compare the classifiers. The test indicates that UOB should be the preferred classifier for streams involving combinations of high imbalance and an additional difficulty factor. However, it should be noted that the difference between UOB and OOB in terms of G-mean is not statistically significant. Interestingly, this statistical analysis also shows that, contrary to most previous analyses, in case of combining split and rare examples UOB substitutes OOB at the first position in the ranking for G-mean, whereas ESOS takes the better rank for Recall.

Table 5 Mean Friedman test ranks for comparing classifier performance using the averaged G-mean on different data stream sets involving pairs of changes and multiple data difficulty factors

Experiments with complex scenarios including multiple drifts

In case of complex scenarios involving multiple difficulty factors, the main general observation is that starting the learning process from a highly imbalanced stream (minority class ratio \(\le \) 10%) is more difficult than starting from a balanced stream. For example, decreases of predictive performance in streams StatIm10 + Split5 + Im1 + Rare[N] are higher than in Split5 + Im1 + Rare[N], for all percentages of rare examples N. Moreover, while analyzing particular plots one can notice that the decrease in predictive performance is stronger than in scenarios involving singular or paired difficulty factors. For instance consider a stream combining a \(10\% \rightarrow 1\%\) drift of the minority class ratio with a minority class split into 5 sub-clusters and a \(0\% \rightarrow 40\%\) increase of the proportion of rare examples (Fig. 14). Note that OB and VFDT are practically incapable of recognizing the minority class after the drift.

Fig. 14

G-mean (left) and Recall (right) for stream StaticIm10 + Split5 + Im1 + Rare40, which combines a \(10\% \rightarrow 1\%\) drift of the minority class ratio with a minority class split into 5 sub-clusters and a \(0\% \rightarrow 40\%\) increase of the proportion of rare examples

Furthermore, we compared complex drifting scenarios (where difficulties appear during a drift) to corresponding static streams (where all the difficulties appear from the beginning of the stream). A sample of this comparison is given in Table 6. One can easily that G-mean values for static streams are much lower than for the corresponding drifting streams. Moreover, there are quite strong differences in classifier performance. Again OB and VFDT failed to handle multiple difficulty factors, whereas UOB and ESOS work better. Similar observations were made for other proportions of rare/borderline examples.

Table 6 G-mean values of classifiers averaged over entire data streams with high imbalance, minority class split, and rare/borderline examples

We have also analyzed averaged values of G-mean and Recall of classifiers for various combinations of multiple drifts (Fig. 15). One can notice that, for all classifiers, scenarios involving various proportions of rare examples are at the top of the ranking. Moreover, the impact of multiple combinations of factors was more severe than that of pairs of factors (Fig. 13). Complex scenarios also intensified the impact of minority class composition and resulted in larger performance differences between classifiers. Whereas pairs of difficulty factors differentiated classifiers mostly at the \(\sim ~0.050\) G-mean levels, complex scenarios show differences of up to almost 0.400.

Moreover, the results of the Friedman test (Table 5) calculated on stream-averaged performance values show that when multiple difficulty factors are present in the stream, UOB is the best performing classifier in terms of both G-mean and Recall.

Fig. 15

Comparison of the effect of complex scenarios on G-mean and Recall of the analyzed classifiers. Scenarios ranked according to the median performance of all classifiers on a given scenario. Borderline[N]/Rare[N]: denotes a drift to N% borderline/rare examples, StaticIm[N]: N% minority examples before the drift, Im[N]: drift to N% minority ratio, Split[N] splitting into N sub-clusters, +: combination of factors


Real-world imbalanced data streams

The final set of experiments involved four real-world imbalanced streams (Table 7). Although synthetic streams are better suited for performing controlled experiments and analyzing the impact of particular difficulty factors, real-world datasets are also useful to verify which of these difficulties and classifier reactions can be expected in practical applications. As there are no benchmark imbalanced data streams, we looked for datasets which at least partly contained the considered data difficulty factors. Finally, we have selected four binary classification datasets, which cover a spectrum of applications involving tweets (Twitter [53]), reviews (Tripadvisor [74]), product descriptions (Amazon [5]), and credit scoring data (PAKDD [72]).

Table 7 Real-world datasets’ characteristics

The occurrence of minority class types and imbalance ratios were estimated in successive blocks of examples in the streams. Types of minority examples were estimated with the method proposed in [56], which analyzes labels of the neighbors of examples (see the method’s brief description in Sect. 3.2.4). We have also estimated the number of sub-clusters and their positions in the attribute space over time. To achieve this, we clustered minority class examples in each successive block using the affinity propagation algorithm (parameterized as proposed in [22]) and removed clusters smaller than six examples. Then, we visualized the relative positions of cluster exemplars using PCA precomputed on the entire dataset and applied to minority cluster representatives in each block.

As Table 7 shows, the selected streams all showcase multiple data difficulties. Interestingly, all datasets contain all types of minority examples, with a relatively high number of outliers. More precisely Amazon, PAKDD, and Twitter have very low proportions of safe examples, with minority examples mostly attributed to rare and outlier example types. Tripadvisor has a higher proportion of safe and borderline examples. Importantly, an analysis of the minority class over subsequent blocks of examples has shown that the global imbalance ratio, example-type proportions, and minority class composition vary over the course of the streams. This is reflected by the minimum and maximum ratios shown in Table 7, and by plots presented in Fig. 16c–f.Footnote 4 Although, these real-world streams are partly similar to the scenarios with multiple data factors discussed in the precious section, they should be considered as even more difficult ones, due to the very low number of safe examples. Furthermore, discovering a quite high number of clusters may indicate quite strong decomposition of classes, which also change over streams; see Table 7. It constitutes additional difficulties for learning classifiers.

Fig. 16

PAKDD a G-mean, b Recall, c minority ratio d example-type proportions, e sub-cluster count, f PCA-visualized sub-cluster positions with color representing time and point sizes representing cluster sizes (color figure online)

Figure 16 shows the G-mean and Recall of the analyzed classifiers on the PAKDD dataset. There is a very clear (\(\sim \) 0.500) difference in G-mean between specialized (OOB, UOB, ESOS) and non-specialized (OB, VFDT) classifiers. However, the G-mean values are generally much lower than those seen on synthetic streams, even though the stream is not that highly imbalanced. We could hypothesize that the difficulty of classifying real-world imbalanced data does not lie in the global imbalance ratio alone. Indeed, the PAKDD dataset has a drifting global imbalance ratio (Fig. 16c), has a very low proportion of safe minority examples (Fig. 16d), and minority sub-cluster appeared and disappear over time (Fig. 16e–f). It is also worth noting that ESOS performs particularly well, compared to previous experiments on single factors, even in the presence of a drift visible around the 15k example. ESOS is indeed the only classifier equipped with a drift detector. Interestingly, similar classifier reactions could be observed for the most complex synthetic streams (e.g., Fig. 14).

Plots for the remaining real-world datasets showed similar patterns. Due to the much lower number of safe examples, classifier G-mean values were generally lower than those observed for synthetic streams. Moreover, the gap between specialized and non-specialized classifiers was always substantial (Fig. 17). The remaining real-world dataset also suffered from drifting global class imbalance (Supplementary Fig. S6), changing example-type proportions (Supplementary Fig. S7), and changing minority sub-clusters (Supplementary Fig. S8). On the Tripadvisor dataset, the classifiers obtained higher measure values than on the remaining real-world streams. This is consistent with the fact that Tripadvisor contains a higher proportion of safe examples and is less imbalanced (Table 7). Amazon proved to be the most difficult stream. Poor classifier performance could be explained by a very unsafe distribution of the minority class types. This is the only data with nearly no safe examples and the highest number of rare examples and outliers. It is also the dataset with the highest class imbalance ratio and the estimated minority sub-clusters show strong movement over time (Supplementary Fig. S8). As a result, VFDT and OB were unable to learn classes from this relatively short stream. Furthermore, all the analyzed real-world streams were affected by more drifts than it was considered in the synthetic streams. Finally, it is also interesting to notice that, even though ESOS needed more time to achieve its final performance level it performed better than OOB and UOB (or came close second) for all of the real streams.

Fig. 17

Classifier G-mean for the Tripadvisor, Amazon, and Twitter datasets

Table 8 presents averaged values of performance measures calculated over entire streams. For both G-mean and Recall UOB ranked first, followed by ESOS. Due to the small number of datasets, the statistical significance of these differences cannot be evaluated. However, the difference between mean performance on each dataset (effect size) is large when one compares specialized (OOB, UOB, ESOS) and non-specialized (OB, VFDT) classifiers.

Table 8 Average classifier performance values on the real-world data streams


The main goal of this paper was to highlight the importance of other difficulty factors than the global imbalance ratio in drifting imbalanced data streams. To this end, we have proposed a new categorization of concept drifts for imbalanced streams, which includes criteria of imbalance ratio, locality, class composition, and minority example distribution. To the best of our knowledge, the three latter criteria were not previously adopted to systematically characterize concept drifts. Moreover, we have developed and made publicly available a data stream generator that follows the proposed drift categorization. Finally, using 385 generated synthetic streams and four real-world datasets we have carried a comprehensive set of experiments to study the impact of drifts and difficulty factors on the performance of online stream classifiers.

The obtained experimental results led to the following main observations:

  • Specialized imbalanced stream classifiers (OOB, UOB, ESOS-ELM) coped well with global class imbalance (except the highest ones), whereas non-specialized classifiers (OB, VFDT) performed worse.

  • Among the single data difficulty factors, complex distributions of minority examples in balanced data streams were more challenging than static imbalanced data streams. In particular, experiments with rare-type examples were the only scenarios where the classifiers did not recover from in any way. Interestingly, all classifiers were fairly robust to larger proportions of borderline examples in balanced data streams.

  • High imbalance ratios amplified the detrimental effect of other difficulty factors.

  • All classifiers were able to successfully benefit from temporary periods of less extreme class imbalance or less challenging data difficulty factors, improving overall predictive performance compared to cases where high levels of class imbalance or challenging data difficulty factors were present from the beginning.

  • Rare examples were the most influential stream characteristic when combined with other factors. Out of the analyzed classifiers, UOB was the best choice for complex streams with multiple difficulty factors.

  • Real-world datasets combine multiple data difficulty factors, being more difficult than synthetic streams. In particular, the proportion of safe minority examples in the analyzed datasets was very low and the minority ratio and class composition drifted quite frequently over time. Real-world streams distinguished classifiers similarly to complex synthetic scenarios with multiple elements in streams, albeit with a larger performance gap between specialized and non-specialized classifiers. UOB and ESOS would be the classifiers of choice for real-world datasets.

Future research for imbalanced data stream classifiers should focus on other difficulty factors than just global class imbalance. None of the analyzed classifiers were able to cope with rare examples and outliers combined with minority class splits. This calls for new online methods that take into account the composition and the types of examples of the minority class. The need for research in this area is further emphasized by the fact that real-world data streams are likely to involve more complex scenarios, as demonstrated in this study.

The methodological similarity between the proposed class composition criterion and stream cluster transitions types [66, 67], suggests that stream clustering methods could be instrumental in tracking minority class evolution. The characteristic of different minority example types could be taken into account by differentiating resampling methods based on whether examples are safe, borderline, rare or outlying. This could be achieved by separate minority example memories, similar to minority class buffers known from block-based classifiers [14, 15]. We also note that currently drift detectors monitor only classifier performance, whereas they could potentially track minority class composition and changes of example-type proportions. Finally, another challenge is to generalize the proposed categorization and design classifiers for multiple imbalanced classes.


  1. 1.

    Supplementary materials at:

  2. 2.

    Trees were induced with the moa.classifiers.trees.HoeffdingTree class from MOA.

  3. 3.

    The new random positions were generated in such a way to ensure that they would not result in the sub-clusters overlapping with each other—see details in the description of the generator in the supplement.

  4. 4.

    Similar plots for other real-world data streams are provided in Section E of the supplement.


  1. 1.

    Ancy S, Paulraj D (2020) Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model. Comput Commun 153:553–560

    Google Scholar 

  2. 2.

    Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  3. 3.

    Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150 A:184–203

  4. 4.

    Błaszczyński J, Stefanowski J (2018) Local data characteristics in learning classifiers from imbalanced data. In: Kacprzyk J, Rutkowski L, Gaweda A, Yen G (eds) Advances in data analysis with computational intelligence methods. Studies in computational intelligence, Springer, New York, pp 51–85

    Google Scholar 

  5. 5.

    Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association for computational linguistics, June 23–30 (2007) ACL. Czech Republic, Prague

    Google Scholar 

  6. 6.

    Branco P, Torgo L, Ribeiro R (2016) A survey of predictive modeling under imbalanced distributions. ACM Comput Surv 49(2):31

    Google Scholar 

  7. 7.

    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  8. 8.

    Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Google Scholar 

  9. 9.

    Brzezinski D, Stefanowski J (2017) Prequential auc: properties of the area under the roc curve for data streams with concept drift. Knowl Inf Syst 52(2):531–562

    Google Scholar 

  10. 10.

    Brzezinski D, Stefanowski J (2018) Ensemble classifiers for imbalanced and evolving data streams. World Scientific, Singapore, pp 44–68.

  11. 11.

    Brzezinski D, Stefanowski J, Susmaga R, Szczech I (2018) Visual-based analysis of classification measures and their properties for class imbalanced problems. Inf Sci 462:242–261

    MathSciNet  Google Scholar 

  12. 12.

    Brzezinski D, Stefanowski J, Susmaga R, Szczech I (2019) On the dynamics of classification measures for imbalanced and streaming data. IEEE Trans Neural Netw Learn Syst 31(8):2868–2878

  13. 13.

    Cabral G, Minku L, Shihab E, Mujahid S (2019) Class imbalance evolution and verification latency in just-in-time software defect prediction. In: Proceedings of the international conference on software engineering (ICSE)

  14. 14.

    Chen S, He H (2009) Sera: selectively recursive approach towards nonstationary imbalanced stream data mining. In: Proceedings of the 2009 international joint conference on neural networks, pp 522–529

  15. 15.

    Chen S, He H (2011) Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evol Syst 2:35–50

  16. 16.

    Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  17. 17.

    Ditzler G, Polikar R (2013) Incremental learning of concept drift from streaming imbalanced data. IEEE Trans Knowl Data Eng 25(10):2283–2301

    Google Scholar 

  18. 18.

    Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comp Int Mag 10(4):12–25

    Google Scholar 

  19. 19.

    Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM sigkdd international conference on knowledge discovery and data mining, pp 71–80

  20. 20.

    Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018a) Learning from imbalanced data sets. Springer, New York

    Google Scholar 

  21. 21.

    Fernández A, García S, Herrera F, Chawla NV (2018b) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Gama J (2010) Knowledge discovery from data streams. Chapman and Hall, London

    MATH  Google Scholar 

  24. 24.

    Gama J, Castillo G (2006) Learning with local drift detection. In: International conference on advanced data mining and applications, pp 42–55

  25. 25.

    Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1-44:37

  27. 27.

    Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the 2007 SIAM international conference on data mining, pp 3–14

  28. 28.

    Gao J, Ding B, Han J, Fan W, Yu PS (2008) Classifying data streams with skewed class distributions and concept drifts. IEEE Internet Comput 12(6):37–49

    Google Scholar 

  29. 29.

    Garcia V, Sanchez J, Mollineda R (2007) An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In: Proceeding of progress in pattern recognition, image analysis and applications, LNCS, vol 4756. Springer, pp 397–406

  30. 30.

    Ghazikhani A, Monsefi R, Yazdi H (2013) Recursive least square perceptron model for non-stationary and imbalanced data stream classification. Evol Syst 4(2):119–131

    Google Scholar 

  31. 31.

    Ghazikhani A, Monsefi R, Yazdi H (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5(1):51–62

    Google Scholar 

  32. 32.

    Goldenberg I, Webb G (2019) Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl Inf Syst 60:591–615

    Google Scholar 

  33. 33.

    Gomes H, Barddal J, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):23:1–36

  34. 34.

    He H, Ma Y (eds) (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, New York

    MATH  Google Scholar 

  35. 35.

    Hoens T, Chawla V (2013) Learning in non-stationary environments with class imbalance. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 168–176

  36. 36.

    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Google Scholar 

  37. 37.

    Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6(1):40–49

    Google Scholar 

  38. 38.

    Khamassi I, Sayed-Mouchaweh M, Hammami M, Ghédira K (2018) Discussion and review on evolving data streams and concept drift adapting. Evol Syst 9(1):1–23

    Google Scholar 

  39. 39.

    Krawczyk B, Skryjomski P (2017) Cost-sensitive perceptron decision trees for imbalanced drifting data streams. In: Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II, pp 512–527

  40. 40.

    Krawczyk B, Minku L, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156

    Google Scholar 

  41. 41.

    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-side selection. In: Proceedings of the 14th international conference on machine learning ICML-97, pp 179–186

  42. 42.

    Lango M (2019) Tackling the problem of class imbalance in multi-class sentiment classification: an experimental study. Found Comput Decis Sci 44(2):151–178

    Google Scholar 

  43. 43.

    Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Technical report A-2001-2, University of Tampere,

  44. 44.

    Levin D, Peres Y, Wilmer E (2008) Markov chains and mixing times, 2nd edn. American Mathematical Society, Rhode Island

  45. 45.

    Lichtenwalter R, Chawla N (2010) Adaptive methods for classification in arbitrarily imbalanced and drifting data streams. In: New frontiers in applied data mining. Lecture notes in computer science, vol 5669, pp 53–75.

  46. 46.

    Lopez V, Fernandez A, Garcia S, Palade V, Herrera F (2014) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 257:113–141

    Google Scholar 

  47. 47.

    Lu Y, Cheung YM, Tang Y (2017) Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In: International joint conference on artificial intelligence, pp 53–75

  48. 48.

    Lyon RJ, Brooke JM, Knowles JD, Stappers BW (2014) Hellinger distance trees for imbalanced streams. CoRR arXiv:1405.2278

  49. 49.

    Minku L, Yao X (2012) DDD: a new ensemble approach for dealing with concept drift. IEEE Trans Knowl Data Eng 24(4):619–633

    Google Scholar 

  50. 50.

    Minku L, White A, Yao X (2010) The impact of diversity on on-line ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22:730–742

    Google Scholar 

  51. 51.

    Minku LL (2019) Transfer learning in non-stationary environments. In: Sayed-Mouchaweh M (ed) Learning from data streams in evolving environments: methods and applications. Springer, Cham, pp 13–37

    Google Scholar 

  52. 52.

    Mirza B, Lin Z, Liu N (2015) Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift. Neurocomputing 149:316–329

    Google Scholar 

  53. 53.

    Nakov P, Ritter A, Rosenthal S, Sebastiani F, Stoyanov V (2016) Semeval-2016 task 4: sentiment analysis in twitter. In: Proceedings of the 10th international workshop on semantic evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16–17, 2016, pp 1–18

  54. 54.

    Napierała K, Stefanowski J (2012a) BRACID: a comprehensive approach to learning rules from imbalanced data. J Intell Inf Syst 39:335–373

    Google Scholar 

  55. 55.

    Napierała K, Stefanowski J (2012b) The influence of minority class distribution on learning from imbalance data. In: Proceedings of the 7th international conference on HAIS 2012, pp 139–150

  56. 56.

    Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597

    Google Scholar 

  57. 57.

    Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of 7th international conference RSCTC 2010, LNAI, Springer, vol 6086, pp 158–167

  58. 58.

    Nickerson A, Japkowicz N, Milios E (2001) Using unsupervised learning to guide re-sampling in imbalanced data sets. In: Proceedings of the eighth international workshop on AI and statitsics

  59. 59.

    Olaitan OM, Viktor HL (2018) SCUT-DS: learning from multi-class imbalanced canadian weather data. In: Foundations of intelligent systems—24th international symposium, ISMIS 2018, Limassol, Cyprus, October 29–31, 2018, Proceedings, pp 291–301

  60. 60.

    Oza NC, Russell S (2001a) Online bagging and boosting. In: Jaakkola T, Richardson T (eds) Eighth international workshop on artificial intelligence and statistics. Morgan Kaufmann, Key West, Florida, pp 105–112

    Google Scholar 

  61. 61.

    Oza NC, Russell SJ (2001b) Experimental comparisons of online and batch versions of bagging and boosting. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA, August 26–29, 2001, pp 359–364

  62. 62.

    Prati R, Batista G, Monard M (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321

  63. 63.

    Ren S, Liao B, Zhu W, Li Z, Liu W, Li K (2018) The gradual resampling ensemble for mining imbalanced data streams with concept drift. Neurocomputing 286:150–166

    Google Scholar 

  64. 64.

    Sarnelle J, Sanchez A, Capo R, Haas J, Polikar R (2015) Quantifying the limited and gradual concept drift assumption. In: Proceedings of the 2015 international joint conference on neural networks (IJCNN)

  65. 65.

    Sousa MR, Gama J, Brandao E (2016) A new dynamic modeling framework for credit risk assessment. Expert Syst Appl 45:341–351

    Google Scholar 

  66. 66.

    Spiliopoulou M, Ntoutsi E, Theodoridis Y, Schult R (2006) MONIC: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 706–711

  67. 67.

    Spiliopoulou M, Ntoutsi E, Theodoridis Y, Schult R (2013) MONIC and followups on modeling and monitoring cluster transigons. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD), pp 622–626

  68. 68.

    Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Ramanna S, Jain LC, Howlett RJ (eds) Emerging paradigms in machine learning, vol 13. Springer, New York, pp 277–306

    Google Scholar 

  69. 69.

    Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Matwin S, Mielniczuk J (eds) Challenges in computational statistics and data mining. Springer, New York, pp 333–363

    Google Scholar 

  70. 70.

    Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery data mining, pp 377–382

  71. 71.

    Sun Y, Tang K, Minku L, Wang S, Yao X (2016) Online ensemble learning of data streams with gradually evolved classes. IEEE Trans Knowl Data Eng 28(6):1532–1545

    Google Scholar 

  72. 72.

    Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (2009) PAKDD data mining competition.

  73. 73.

    Toffoli T, Margolus N (1987) Cellular automata machines: a new environment for modeling. MIT Press, Cambridge

    MATH  Google Scholar 

  74. 74.

    Wang H, Lu Y, Zhai C (2010) Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25–28, 2010, pp 783–792

  75. 75.

    Wang S, Minku L, Yao X (2013) Concept drift detection for online class imbalance learning. In: Proceedings of the 2013 international joint conference on neural networks (IJCNN’13), pp 1–8

  76. 76.

    Wang S, Minku LL, Yao X (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Trans Knowl Data Eng 27(5):1356–1368

    Google Scholar 

  77. 77.

    Wang S, Minku L, Yao X (2016) Dealing with multiple classes in online class imbalance learning. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence (IJCAI-16), pp 2118–2124

  78. 78.

    Wang S, Minku LL, Yao X (2018) A systematic study of online class imbalance learning with concept drift. IEEE Trans Neural Netw Learn Syst 29(10):4802–4821

    Google Scholar 

  79. 79.

    Webb GI, Hyde R, Cao H, Nguyen H, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994

    MathSciNet  MATH  Google Scholar 

  80. 80.

    Webb GI, Lee LK, Goethals B, Petitjean F (2018) Analyzing concept drid and shid from sample data. Data Min Knowl Discov 32(5):1179–1199

    MathSciNet  Google Scholar 

  81. 81.

    Weiss GM (2010) The impact of small disjuncts on classifier learning. In: Stahlbock R, Crone S, Lessmann S (eds) Data Mining. Annals of Information Systems, vol 8. Springer, Boston, MA, pp 193–226.

  82. 82.

    Wu K, Edwards A, Fan W, Gao J, Zhang K (2014) Classifying imbalanced data streams via dynamic feature group weighting with importance sampling. In: Proceedings of the 2014 SIAM international conference on data mining, pp 722–730

  83. 83.

    Zhang H, Liu W, Wang S, Shan J, Liu Q (2019) Resample-based ensemble framework for drifting imbalanced data streams. In: Data Mining IEEE Access 7:65103–65115.

  84. 84.

    Zliobaite I (2014) Controlled permutations for testing adaptive learning models. Knowl Inf Syst 39:565–578

  85. 85.

    Zliobaite I, Budka M, Stahl F (2015a) Towards cost-sensitive adaptation: when is it worth updating your predictive model? Neurocomputing 150:240–249

    Google Scholar 

  86. 86.

    Zliobaite I, Pechenizkiy M, Gama J (2015b) An overview of concept drift applications. In: Japkowicz N, Stefanowski J (eds) Big data analysis: new algorithms for a new society. Springer, New York

    Google Scholar 

Download references


This work was partly supported by PUT Institute of Computing Science Statutory Funds. Leandro Minku was funded by EPSRC Grant Nos. EP/R006660/1 and EP/R006660/2. Moreover, the research of J.Stefanowski was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation program under GA No 952215.

Author information



Corresponding author

Correspondence to Dariusz Brzezinski.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 719 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Brzezinski, D., Minku, L.L., Pewinski, T. et al. The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 63, 1429–1469 (2021).

Download citation


  • Class imbalance
  • Concept drift
  • Data difficulty factors
  • Drift categorization
  • Stream classification