1 Introduction

In the last couple of years, a sharp rise in products and systems using machine learning to enhance their performance is observed. Many of the applications such as predicting user behavior on social platforms like Twitter, or client activity on online stores fall into the category of imbalanced data stream classification [24]. When designing methods for data stream classification one has to take into account the characteristics of a data stream such as the sequential manner that the data arrives, over which one has no control when it comes to the order of the arriving samples, as well as the fact that the size of the stream could be possibly infinite. Due to that requirement, it is impossible to process the upcoming data in multiple passes and such the samples can be processed once [26]. Furthermore, one has to consider the rapid rate at which the data arrives, at the same time ensuring that the processing of the data stream is done in a timely fashion such that the delay in the performance of the algorithm is minimal. Data streams can exhibit a change in data and target concepts over time (so-called Non-stationary data streams) [16, 26]. Such a phenomenon is called concept drift [12] and it is quite common i.e. the change of popular topics on Twitter. Due to the concept drift the performance of the classifier can degrade over time and as such the classifier has to be trained incrementally to accommodate the changes of concepts of non-stationary data streams. Moreover, the proportion between classes is often skewed with one class being over-represented. In cases where the imbalance ratio is present traditional accuracy driven methods are not applicable especially when misclassification of the minority class examples is much more costly, as is often the case i.e. fraud detection [24]. It is worth mentioning that not only the imbalance ratio can influence the performance of the classifier. Some examples can be easy to classify even when the IR is high if the classes are separated from each other the decision boundary and be determined with ease. However, it has been observed that instances of the minority class have a tendency to create sparsely spread throughout the object space clusters, often surrounded by majority class examples [4]. The presence of noise and outliers is another difficulty factor that needs to be addressed. In [3, 15] authors created preprocessing methods with those issues in mind.

Data streams may be processed either in blocks or one instance at a time. One of the most important issues in learning from the data stream is when to update the classifier [22]. Most researchers distinguish between two approaches: active and passive. In the former, the update is performed only if drift is detected while the later updates the classifier continuously regardless if the drift was detected or not [9]. In order to satisfy the time and memory requirements, a forgetting or data management mechanism must be used. One of the most popular approaches to forgetting is using sliding windows, which can be either sequence based, where the size of the window is defined by a number of instances and time stamp based where the window is defined by a certain duration time. In the simplest example sliding windows are of fixed size, and include only the most recent examples. The oldest samples in a window are discarded in favor of new ones. Some methods implement sliding windows of varying size depending on the response from drift detectors [2].

The main contributions of this work are as follows:

  • Proposition of the two novel imbalanced data stream classifiers (DSC-R and DSC-S) which employ under- and oversampling techniques for balancing data.

  • Experimental evaluation of the proposed algorithms and their comparison with state-of-art methods.

The article is organized as follows. Sections 1 and 2 present a brief introduction to the problem of imbalance data stream classification and a quick overview of the state-of-the-art algorithms dedicated to it. Section 3 offers an in-depth explanation of the proposed solution. Section 4 showcases the results of the computer experiments, comparing the proposed algorithm to different techniques for imbalanced data classification, proving the usefulness of the developed algorithm. Section 5 presents the conclusions and describes possible future improvements to the proposed method.

2 Related Works

Studies over the years presented algorithms dedicated to data stream analysis. Very fast decision tree (VFDT) proposed by Domingos and Hulten [13] was among the first methods for stream analysis, that to this day has been a basis for many modifications. VFDT utilizes the Hoeffding bound in order to calculate the proper number of examples needed to select the split-node. The algorithm incrementally creates a tree form from a data stream ensuring that once the examples were used to update the tree they are negligible and can be removed. The aforementioned modifications include ideas such as pruning mechanisms or utilizing sliding windows and drift detectors in order to better the algorithms in case of non-stationary streams [10]. Worth noting are several methods using ensembles of classifiers. Weighted Majority Algorithm [18] adjusts the weights of the classifiers in the ensemble so that the weight of an expert that misclassified an instance is decreased accordingly to the user-specified value. A modification of the method with an added procedure which adds new classifiers to the ensemble when the overall performance is unsatisfactory called Dynamic Weighted Majority (DWM) was introduced in [14]. In Accuracy Weighted Ensemble (AWE) a new classifier is added only if the ensemble’s size is not exceeded [25] while in Learn++.NSE [8] such a constraint is not applied. In Learn++.CDS Ditzler and Polikar combine their previous work Learn++.NSE with SMOTE sampling in order to better address the data imbalance and later replacing SMOTE with an original bagging-based method of data balancing [7]. In SEA [23] a new classifier candidate is evaluated to determine whether or not it is worth including into the ensemble at the cost of replacing some other classifier already in the ensemble. Other approaches such as OUSEnsemble (Over Under Sampling Ensemble) [11] make use of sampling techniques. The stream is divided into blocks that consist of examples from both majority and minority class. The idea is to propagate all the instances of the minority class from the previous block and under-sample the majority examples in the current block such that the desired imbalance ratio is acquired. Afterwards, from the resultant subset, datasets later used to build component classifiers for the ensemble, are created by propagating all instances of the minority class to each of the datasets while each example from the majority class is propagated to only one dataset. Proposed by Chen and He the Selectively Recursive Approach (SERA) [5] uses a Mahalanobis distance to determine which of the examples from the minority class in the previous block are most similar to the minority examples in the current block. Based on that a limited number of minority class examples is selected and added to the majority class examples in the current block. Chen and He later designed a Recursive Ensemble Approach (REA) [6]. In REA minority class examples from the previous block that are nearest neighbors of minority class examples from the current block are added in order to balance the given training block. Both REA and SERA proved to make more accurate predictions than the method proposed by [19]. A Chunk-based ensemble approach, proposed by Wang et al. called KMeanClustering [25] utilizes k-mean clustering in order to under-sample the majority class, by using the centroids created in the clustering process to resample the majority instances.

3 The Deterministic Sampling Classifier

The proposed method, called Deterministic Sampling Classifier (DSC), for data stream classification, processes the upcoming data in chunks. Each chunk is used in two operations. Firstly, the instances of the majority class present in the currently processed block are under-sampled in order to produce a balanced class representation in a data chunk (Fig. 1).

Fig. 1.
figure 1

Proposed method flow diagram

The resulting data (referred to in the Fig. 1 as NEW STORED DATA) is then stored in a memory buffer (DATA STORAGE). Secondly, that same block of data is combined with a part of the data from the buffer, called OLD STORED DATA, using GET NEW CHUNK, which copies the data from the currently processed block and the GET DATA method, which takes OLD STORED DATA from the DATA STORAGE buffer. OLD STORED DATA, consists of all the previously accumulated under-sampled chunks. When a new chunk of data arrives the data from NEW STORED DATA are moved to the OLD STORED DATA part of the buffer. The DATA STORAGE is of fixed size. When the buffer is full, the oldest examples are removed from it. Afterward, the imbalance ratio of the data block created as a result of the GET NEW CHUNK and GET DATA is calculated, and if the value is lower than 0.45, an oversampling of the minority class is performed, and then a classification model is trained. Otherwise, the algorithm accepts the chunk as properly balanced and uses it to train the model right away. The implementation allows one to choose sampling algorithms of their liking. In this paper, the authors created two versions of the method DSC-S (Deterministic Sampling Classifier-SMOTE) and DSC-R (Deterministic Sampling Classifier-Random). For the DSC-R method the chosen sampling methods were: random over- and under-sampling and for the DSC-S: SMOTE and NearMiss (implementation from the imbalanced-learn library [17]) for over- and under-sampling accordingly.

4 Experimental Evaluation

The quality of the proposed algorithms was evaluated on the basis of computer experiments, using 26 real and 60 synthetic data streams. The evaluation procedure used in order to assess the predictive performance of a data stream classifier was conducted by interleaving testing with training (test-then-train) [16]. Each block is first used to test the classifier and afterward it is used for training. As a measure of comparison, the following methods were used: OUSEnsemble, KMeanClustering, REA, Learn++.CDS, Learn++.NIE and MLPClassifier (Multi-layer Perceptron classifier), using a k-NN as a base classifier. The algorithms were implemented in Python using Scikit-learn [20] and imbalanced-learn [17] librariesFootnote 1. The selected real streams were downloaded from the KEEL [1] and PROMISE Software Engineering Repository [21]. The chosen datasets consisted of multidimensional binary classification problems with the imbalance ratio ranging from 1 to 39. The datasets were described in Table 1. The results were analyzed using the KEEL software evaluation tool [1]. Non-parametrical statistical tests were performed namely the Friedman Test as well as a Nemenyi’s Post-Hoc Procedure. The metrics chosen were F-score, Gmean and AUC score. Tables 2, 3 and 4 present the obtained results. The table presents the obtained results as the mean value for each of the metrics, as well as, the information on those methods that performed poorly in comparison with the method named in the column, placed directly below the score. For instance, given the abalone-17-vs-7-8-9-10 dataset, the DSC-R algorithm performed statistically better than the 3rd, 5th, the 7th and 8th algorithm in the table (read from left to right). The obtained results prove the usefulness of the proposed algorithms. For the F-score the proposed DSC-R and DSC-S algorithms along with the REA algorithm have the best results. What is interesting the MLPC algorithm performed consistently the worst. For the Gmean the results are similar. The methods introduced in the paper perform favorably in comparison with other algorithms, the Learn++.CDS and Learn++.NIE techniques, as well as REA, have comparable results to the DSC-R and DSC-S methods. Lastly, the results in Table 4 representing the results for AUC score indicate the proposed algorithm obtained satisfactory results, with only the LCDS algorithm performing marginally better. It is worth mentioning, that the created methods are robust enough, so that imbalance ratio (whether low or high) does not negatively impact their performance.

Table 1. Overview of datasets selected for experimental evaluation (source: KEEL and PROMISE Software Engineering Repository).
Table 2. Overview of the results for F-score.
Table 3. Overview of the results for Gmean score.
Table 4. Overview of the results for AUC score.

5 Conclusions and Future Directions

The proposed in this paper methods for imbalanced stream classification DSC-R and DSC-S performed favorably in comparison with other dedicated algorithms. The evaluation of the predictive abilities of the techniques was conducted on the basis of computer experiments. The obtained results were analyzed using statistical tests and for all the chosen metrics F-score, Gmean and AUC score, the proposed methods obtained satisfactory results, comparable to algorithms such as REA or Learn++.CDS or Learn++.NIE. The algorithm utilizes memory buffer in order to propagate the instances from the previous block that were chosen as the representatives. Since the buffer is of fixed size, after it is full some instances must be removed from it. In the current implementation, the oldest examples are deleted. A more advanced “forgetting” mechanism, that could favor the instances from the minority class and only the instances from the majority that are the best representatives could be introduced in order to further improve the performance of the classifier. Additionally testing other sampling methods for under- and over-sampling may prove to produce better results.