1 Introduction

The analysis of huge volumes and fast arriving data is recently the focus of intense research, because such methods could build a competitive advantage of a given company. One of the useful approach is the data stream classification, which is employed to solve problems related to discovery client preference changes, spam filtering, fraud detection, and medical diagnosis to enumerate only a few.

However, most of the traditional classifier design methods do not take into consideration that:

  • The statistical dependencies between the observations of the given objects and their classifications could change.

  • Data can arrive so quick that labeling all records is impossible.

This section focuses on the first problem called concept drift [23] and it comes in many forms, depending on the type of change. Appearance of concept drift may potentially cause a significant accuracy deterioration of an exploiting classifier. Therefore, developing positive methods which are able to effectively deal with this phenomena has become an increasing issue. In general, the following approaches may be considered to deal with the above problem.

  • Frequently rebuilding a model if new data becomes available. It is very expensive and impossible from a practical point of view, especially when the concept drift occurs rapidly.

  • Detecting concept changes in new data, and if these changes are sufficiently significant then rebuilding the classifier.

  • Adopting an incremental learning algorithm for the classification model.

Let’s firstly characterize shortly the probabilistic model of the classification task [4]. It assumes that attributes \(x\in \mathcal {X}\subseteq \mathcal {R}^{d}\) and class label \(j\in \mathcal {M}=\{1,2,...,M\}\) are the observed values of a pair of random variables \((\mathbf {X},\mathbf {J})\). Their distribution is characterized by prior probability

$$\begin{aligned} p_{j}=P(\mathbf {J}=j),\; j\in \mathcal {M} \end{aligned}$$
(1)

and conditional probability density functionsFootnote 1

$$\begin{aligned} f_j(x)=f(x|j), \,x\in \mathcal {X}, \,j\in \mathcal {M}. \end{aligned}$$
(2)

The classifier \(\varPsi \) maps feature space \(\mathcal {X}\) to the set of the class labels \(\mathcal {M}\)

$$\begin{aligned} \varPsi \, :\, \mathcal {X} \rightarrow \mathcal {M}. \end{aligned}$$
(3)

The optimal Bayes classifier \(\varPsi ^{*}\) minimizes probability of missclassification according to the following classification ruleFootnote 2:

$$\begin{aligned} \varPsi ^{*}(x)=i\,\, \text {if} \,\, p_i(x)=\max _{k\in \mathcal {M}}\,p_k(x), \end{aligned}$$
(4)

where \(p_j(x)\) stands for posterior probability

$$\begin{aligned} p_i(x)=\frac{p_if_i(x)}{f(x)} \end{aligned}$$
(5)

and f(x) is unconditionally density function.

$$\begin{aligned} f(x)=\sum _{k=1}^{M}p_k\;f_k(x) \end{aligned}$$
(6)

Backing to the concept drift problem we may distinguished two types of drifts according its influence into probabilistic characteristics of the classification task [5]:

  • virtual concept drift means that the changes do not have any impact on decision boundaries (some works report that they do not have an impact on posterior probability, but it is disputable), but they change unconditionally density functions [22].

  • real concept drift means that the changes have an impact on decision boundaries, i.e., on posterior probabilities [19, 23].

Considering the classification task, the real concept drift is the most important, but detecting the virtual one could be useful as well, because in the case if we are able not only to detect the drift, but also distinguish between virtual and real concept drift, then we may decide if classifier’s model rebuilding is necessary or not. Another taxonomy of concept drift bases on its impetuosity. We may distinguish:

  • Slow changes (gradual or incremental drift). For the gradual drift for a given period of time examples from different models could appear in the stream concurrently, while for incremental drift the model’s parameters are changing smoothly.

  • Rapid changes (sudden drift, concept shift).

A plethora of solutions have been proposed how to deal with this phenomena. Basically, we may divide these algorithms into four main groups:

  1. 1.

    Online learners [24]

  2. 2.

    Instance based solutions (also called sliding window based solutions)

  3. 3.

    Ensemble approaches [11, 13, 16]

  4. 4.

    Drift detection algorithms

In this work we will focus on the drift detectors which are responsible for determining whether two successive data chunks were generated by the same distribution [9]. In the case the the change is detected the decision about collecting new data to rebuild new model could be made.

2 Concept Drift Detection

As we mentioned above the appearance of concept drift may potentially cause a significant accuracy deterioration of an exploiting classifier [15], what is shown in Fig. 1.

Fig. 1.
figure 1

Exemplary deterioration of classifier accuracy for sudden drift (left) and incremental one (right).

Concept drift detector is an algorithm, which on the basis of incoming information about new examples and their correct classification or a classifier’s performance (as accuracy) can return information that data stream distributions are changing. The currently used detectors return only signal about drift detection, which usually requires a quick classifier’s model updating or that warning level is achieved, which is usually treated as a moment that new learning set should be gathered (i.e., all new incoming examples should be labeled). The new learning set is used to update the model is drift is detected. The idea of drift detection is presented in Fig. 2

Fig. 2.
figure 2

Idea of drift detection.

The drift detection could be recognized as the simple classification task, but from practical point of view detectors do not use the classical classification model. The detection is hard, because on the one hand we require quick drift detection to quickly replace outdated model and to reduce so-called restoration time. On the other hand we do not accept false alarms [8], i.e., when detector returns that change appears, but it is not true. Therefore to measure performance of concept drift detectors the following metrics are usually used:

  • Number of correct detected drifts.

  • Number of false alarms.

  • Time between real drift appearance and its detection.

In some works, as [2], aggregated measures, which take into consideration the mentioned above metrics, are proposed, but we decided not to use them because using aggregated measures do not allow to precisely analyse behavior of the considered detectors.

3 Description of the Chosen Drift Detectors

Let us shortly describe selected methods of drift detection, which later will be used to produce combined detectors.

3.1 DDM (Drift Detection Method)

This algorithm [6] is analyzing the changes in the probability distribution of the successive examples for different classes. On the basis of the mentioned above analysis DDM estimates classifier error, which (assuming the convergency of the classifier training method) has to decrease, what means the probability distribution does not change [18]. If the classifier error is increasing according to the number of training examples then this observation suggests a change of probability distribution and the current model should be rebuilt. DDM estimates classification error, its standard deviation and stores their minimal values. If estimated error is greater than stored minimal value of error and two times its standard deviation then DDM returns signal that the warning level is achieved. In the case if estimated error is greater than stored minimal value of error and three times its standard deviation then DDM returns signal that drift is detected.

3.2 CUSUM (CUmulative SUM)

CUSUM [17] detects a change of a given parameter value of a probability distribution and indicated when the change is significant. As the parameter the expected value of the classification error could be considered, which may be estimated on the basis of labels of incoming objects from data stream. The detection condition looks as follows

$$\begin{aligned} g_t=max(0, g_{t-1}+\epsilon _t-\xi )>treshold \end{aligned}$$
(7)

where \(g_0=0\) and \(\epsilon _t\) stands for observed value of a parameter in time t (e.g., mentioned above classification error). \(\xi \) describes rate of change. The value of treshold is set by an user and it is responsible for detector’s sensitivity. Its low value allows to detect drift quickly, but then a pretty high number of false alarms could appear [2, 5].

3.3 Test PageHinkley

It is modification of the CUSUM algorithm, where the cumulative difference between observed classifier error and its average is taken into consideration [20].

3.4 Detectors Based on Hoeffding’s and McDiarmid’s Inequalities

The interesting drift detectors based on non-parametric estimation of classifier error employing Hoeffding’s and McDiarmid’s inequalities were proposed in [3].

4 Combined Concept Drift Detectors

Let’s assume that we have a pool of n drift detectors

$$\begin{aligned} \varPi =\{D_1,D_2,...,D_n\} \end{aligned}$$
(8)

Each of them returns signal that warning level is achiever or concept drift is detected, i.e.,

$$\begin{aligned} D_i=\left\{ \begin{array}{ll} 0 &{}if \;drift \;is \;not \; detected\\ 1 &{}if \;warning \; level \; is \; achieved\\ 2 &{}if\; drift \; is \; detected \end{array} \right. \end{aligned}$$
(9)

As yet not so many papers deal with combined drift detectors. Bifet et al. [2] proposed the simple combination rules based on the appearance of drift once ignoring signals about warning level. The idea of combined drift detector is presented in Fig. 3.

Fig. 3.
figure 3

Idea of combined drift detector.

Let’s present three combination rules which allow to combine the concept drift detector outputs. Because this work focuses on the ensemble of heterogeneous detectors, therefore only the combination rules using drift appearance signal are taken into consideration.

4.1 ALO (At Least One Detects Drift)

Committee of detectors makes a decision about drift if at least one individual detector returns decision that drift appears (not all of the detectors can return warning signal).

$$\begin{aligned} ALO(\varPi ,x)=\left\{ \begin{array}{cl} 0 &{} \sum \nolimits _{i=1}^n [D_i(x)=2]\\ 1 &{} \sum \nolimits _{i=1}^n [D_i(x)=0] \end{array} \right. \end{aligned}$$
(10)

where [] denotes Iverson bracket.

4.2 ALHD (At Least Half of the Detectors Detect Drift)

Committee of detectors makes a decision about drift if at half of individual detectors return decisions that drift appears.

$$\begin{aligned} ALHWD(\varPi ,x)=\left\{ \begin{array}{cl} 0 &{} \sum \nolimits _{i=1}^n [D_i(x)=2]<\frac{n}{2}\\ 1 &{} \sum \nolimits _{i=1}^n [D_i(x)=2]\geqslant \frac{n}{2} \end{array} \right. \end{aligned}$$
(11)

4.3 AD (All Detectors Detect Drift)

Committee of detectors makes a decision about drift if each individual detector returns decisions that drift appears.

$$\begin{aligned} ALHWD(\varPi ,x)=\left\{ \begin{array}{cl} 0 &{} \sum \nolimits _{i=1}^n [D_i(x)=2]<n\\ 1 &{} \sum \nolimits _{i=1}^n [D_i(x)=2]\geqslant n \end{array} \right. \end{aligned}$$
(12)

Let’s notice that proposed combined detectors are de facto classifier ensembles [25], which use deterministic (untrained) combination rules. They do not take into consideration any additional information as about individual drift detector’s qualities. In this work, we also do not focus on the very important problem how choose the valuable pool of individual detectors. For classifier ensemble such a process (called ensemble selection or ensemble pruning) uses diversity measure [14], but for the combined detectors the measures have been using, thus far is impossible, because of different nature of the decision task.

5 Experimental Research

5.1 Goals

The main objective of the experimental study was evaluating the proposed combined concept drift detectors and their comparison with the well-known simple methods. To ensure the appropriate diversity of the pool of detectors we decided to produce an ensemble on the basis of five detectors employing different models presented in the previous section. For each experiment we estimated detector sensitivity, number of false alarms and computational complexity of the model, i.e., commutative running time.

5.2 Set-Up

We used the following models of individual detectors:

  • DDM with window includes 30 examples.

  • HDDM_A - detector based on McDiarmid’s inequality analysis with the following parameters: drift confidence 0.001, warning confidence 0.005, and one-side t-test.

  • HDDM_W - detector based on Hoeffding’s inequality analysis with the following parameters: drift confidence 0.001, warning confidence 0.005, and one-side t-test.

  • Cusum with window includes 30 examples, \(\delta = 0.005\) and \(lambda = 50\)

  • Page-Hinkley test with window includes 30 examples, \(\delta = 0.005\), \(lambda = 50\), and \(\alpha =1\)

As we mentioned above we use the individual detectors to build three combined ones based on the ALO, ALHD, and AD combination rules.

All experiments were carried out using MOA (Massive Online Analysis)Footnote 3 and our own software written in Java according to MOA’s requirements [1].

For each experiment 3 computer generated data streams were used. Each of them consists of 10 000 examples:

  • Data stream with sudden drift appearing after each 5 000 examples.

  • Data stream with gradual drift, where 2 concept appear.

  • Data stream without drift.

The stationary data stream (without drift) was chosen because we would like to evaluate the sensitivity of the detector, i.e., number of false alarms.

5.3 Results

The results of experiment were presented in Tables 1, 2 and 3.Footnote 4

Table 1. Results of experiment for data stream with sudden drift
Table 2. Results of experiment for data stream with gradual drift
Table 3. Results of experiment for data stream without drift

5.4 Discussion

Firstly we have to emphasize that we realize that the scope of the experiments was limited therefore drawing the general conclusions is very risky. For each experiments the detector based on Hoeffding’s inequality (HDDM_W) [3] outperformed other models. The combined detectors have not behaved well for sudden and frequent drift (see Table 1), because some of them (as ALO) has to low sensitivity or so high sensitivity (as for AD) caused a high number of false alarms. The computation time of AD detector was similar as in the case of HDDM_A or HDDM_W, but as we mentioned before the number of false alarms was not acceptable. For gradual drift the combined detectors ALHD and AD behaved similar as individual detectors. For stationary stream they have not presented so high sensitivity (similar as individual detectors). Probably, not so impressive results of combined detectors have been caused by the fact that we nominated the the individual models arbitrary and we did not check how adding or removing detector from the pool could impact on the combined detector quality. Additionally, HDDM_W and HDDM_A detectors strongly outperformed other models, what could cause that the decision of the detector ensemble was the same as the dominant models. The similar observation has been reported for the classifier ensemble using deterministic combination rules. On the basis of the experiment we are not able to say expressly if the combined concept drift detectors are promising direction. Nevertheless, we decided to continue the works on such models, especially for a pool of homogeneous detectors and method which are able to prune the detector ensemble. We have also to notice the main drawback of the proposed models, which are more complex than single ones. On the other hand we have to notice that using the proposed method of parallel interconnection is easy to parallelize and could be run in a distributed computing environment.

6 Final Remarks

In this work three simple combination rules for combined concept drift detectors were discussed and evaluated on the basis of the computer experiments for different data streams. They seem to be an interesting proposition to solve the problem of concept drift detection, nevertheless the results of experiments do not confirm their high quality, but as we mentioned above it is probably caused by very naive choice of the individual models of the detector ensemble.

It is worth noticing that we assume the continue access to class labels. Unfortunately, from the practical point of view it is hard to be granted that labels are always and immediately available, e.g., for credit approval task the true label is usually available ca. 2 years after the decision, then such labeled example could be worthless as come from outdated model. Therefore, during constructing the concept drift detectors we have to take into consideration the cost of data labeling, which is usually passed over. It seems to be very interesting to design detectors on the basis of a partially labeled set of examples (called active learning) [7] or unlabeled examples. Unfortunately, unsupervised drift detectors can detect the virtual drift only, but it is easy to show that without access to class labels (then we can analyse the unconditionally probability distribution only f(x)) the real drift could be undetected [21].

Let’s propose the future research directions related to the combined concept drift detectors:

  • Developing methods how to choose individual detectors to a committee, maybe dedicated diversity measure should be proposed. In this work the diversity of the detectors was ensured by choosing different detector’s models, but we may also use the same model of detector but with different parameters, e.g., using different drift confidences and warning confidences for detector based on McDiarmid’s inequality.

  • Proposing semi-supervised and unsupervised methods of combined detector training.

  • Proposing trained combination rule to establish the final decision of the combined detector and to fully exploit the strengths of the individual detectors.

  • Developing the combined local drift detectors, probably employing the clustering and selection approach [10, 12], because many changes have the local nature and touch the selected area of the feature space or selected features only.

  • Employing other interconnections of detectors (in this work the parallel architecture was considered only) including serial one.