1 Introduction

The last decade has seen a rapid increase in the amount of massive, rapidly evolving data streams. Today’s decision makers require new solutions to comprehend these fast-evolving knowledge sources in near real-time. That is, they need near-instant models to aid them to detect traffic congestion, to analyze smart phone usage patterns, to track the trends in the online sales of merchandise, for mobile crowd sensing, or to trace the spread of ideas, opinions and movements in social networks. This fact has resulted in a surge of interest in adaptive learning algorithms for data stream classification, that are able to learn incrementally and to rapidly adapt to changes in the data (so-called concept drift) (Žliobaite et al. 2016). Such approaches take into consideration that the learning environment is non-stationary and, as a result, build models that evolve over time, as the data arrive.

A number of incremental learners, such as the Hoeffding Tree (HT)Footnote 1 (Domingos and Hulten 2000), Naive Bayes, Perceptron (Bifet et al. 2010; Freund and Schapire 1999) and K-Nearest Neighbors (K-NN) methods have been developed, in order to learn from data streams. Intuitively, the performance of an individual classifier may vary as a stream evolves. Also, the difference in learning styles may cause a specific classifier to excel against one stream, while failing to accurately model another. Further, no single concept drift detection technique outperforms others in all settings. Rather, the current ‘best’ pair of classifier and drift detector also changes, as the stream evolves.

Based on these observations, we introduce the Tornado framework. In our framework a reservoir of classifiers with diverse learning styles co-exists, together with a number of different drift detector algorithms. These classifiers learn incrementally, in parallel. The Tornado framework operates as follows. Each of the classifiers in the reservoir incrementally accepts the incoming instances, one at a time, and proceeds to build a model. Each classifier is combined with each one of the drift detectors. That is, classifier \(C_1\) is combined with drift detectors \(D_1, D_2, \ldots , D_m\), to form pairs (\(C_1, D_1\)), (\(C_1, D_2\)), ..., (\(C_1, D_m\)), classifier \(C_2\) is combined with drift detectors \(D_1, D_2, \ldots , D_m\), to form pairs (\(C_2, D_1\)), (\(C_2, D_2\)), ..., (\(C_2, D_m\)), and so on. Our CAR measure, as introduced in Sect. 3, is used in order to rank the current best performing (classifier, detector) pair. The CAR measure not only considers the classification error-rate, but also takes into consideration the memory usage, runtime as well as the drift detection delay, together with the number of false positives and false negatives.

In addition, we also incorporate two new drift detection methods into the Tornado framework. They extend the FHDDM algorithm, as introduced in (Pesaranghader and Viktor 2016), in two ways. Firstly, we introduce the FHDDMS algorithm that creates a so-called “stack” of sliding windows of different sizes. The windows monitor the streams using bitmaps and alarm for concept drift using threshold values. The intuition behind this approach is that, by utilizing the windows of various sizes to monitor the stream, concept drift is detected faster and more accurately. In addition, we present \(\hbox {FHDDMS}_{\mathrm{add}}\), a variant of FHDDMS, that employs data summaries, instead of bitwise operations.

Our experimental evaluation against synthetic and real-world data streams confirms that the current best (classifier, drift detector) pair evolves as the characteristics of the stream changes. In addition, our FHDDMS methods detect changes faster and more accurate, with shorter delays, fewer false positives and false negatives, when compared to the state-of-the-art.

This paper is organized as follows. Section 2 discusses related work. In Sect. 3, we introduce our CAR measure. This is followed, in Sect. 4, by an overview of the Tornado framework. Section 5 presents the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) algorithms. In Sect. 6, we detail our experimental setup and results. Section 7 provides a detailed discussion regarding our experiments. Section 8 concludes the paper and highlights future work.

2 Related work

This section discusses related work on performance measures for data stream mining, by focusing on classification and adaptation measures in a streaming setting, which we use as a foundation for defining the CAR measure in Sect. 3. In addition, we review the state-of-the-art in terms of drift detection algorithms.

2.1 Performance measures for adaptive online learning

Researchers agree that the evaluation of data stream algorithms is a complex task. This fact is due to many challenges, including the presence of concept drift, limited processing time in real-world applications and the need for time-oriented evaluation, amongst others (Gama et al. 2004). The error-rate (or accuracy) is most often used as the defining measure of the classification performance for evaluating learning algorithms in most streaming studies (Hulten et al. 2001; Gama et al. 2004, 2006; Bifet and Gavalda 2007; Huang et al. 2015; Baena-Garcıa et al 2006). The error-rate is calculated incrementally using either the prequential or hold-out evaluation procedures (Bifet and Kirkby 2009). The interplay between the error-rate and other factors, such as memory usage and runtime considerations, has received limited attention. Bifet et al. (2009) considered the memory, time and accuracy measures separately, in order to compare the performances of ensembles of classifiers. Bifet et al. (2010) further introduced the RAM-Hour measure, where every RAM-Hour equals to 1 GB of RAM occupied for one hour, to compare the performances of three versions of perceptron-based Hoeffding Trees. Pesaranghader et al. (2016) introduced the EMR measure which combines error-rate, memory usage and runtime for evaluating and ranking learning algorithms.

Žliobaite et al. (2015a) introduced the return on investment (ROI) measure to determine whether the adaptation of a learning algorithm is beneficial. They concluded that adaptation should only take place if the expected gain in performance, measured by accuracy, exceeds the cost of other resources (e.g. memory and time) required for adaptation. In their work, the ROI measure was used to indicate whether an adaptation to a concept drift is beneficial, over time. Olorunnimbe et al. (2015) extended the above-mentioned ROI measure, in order to dynamically adapt the number of base learners in online bagging ensembles. Pesaranghader and Viktor (2016) proposed an approach to count true positive (TP), false positive (FP), and false negative (FN) of drift detection, in order to evaluate the performances of concept drift detectors. They introduced the acceptable delay length notion as a threshold that determines how far a detected drift could be from the real location of drift to be considered as a true positive.

As explained above, the performance measures of classification and adaptation have been often used, separately, to evaluate adaptive learning algorithms against evolving data streams. To date, no single measure that considers classification, adaptation, and resource consumption together, has been developed. Such a measure would allow one to assess the “big picture”, in terms of the costs and benefits of a specific learning and adaptation strategy. In Sect. 3, we introduce the CAR measure in order to address this deficiency.

2.2 Drift detection methods

Gama et al. (2014) categorized concept drift detectors into three general groups, as follows:

  1. 1.

    Sequential Analysis based Methods sequentially evaluate prediction results as they become available, and alarm for drifts when a pre-defined threshold is met. The Cumulative Sum (CUSUM) and its variant Page-Hinkley (PH) (Page 1954), as well as Geometric Moving Average (GMA) (Roberts 2000) are members of this group.

  2. 2.

    Statistical based Approaches probe the statistical parameters such as mean and standard deviation of prediction results to detect drifts in a stream. The Drift Detection Method (DDM) (Gama et al. 2004), Early Drift Detection Method (EDDM) (Baena-Garcıa et al 2006) and Exponentially Weighted Moving Average (EWMA) (Ross et al. 2012) are members of this group.

  3. 3.

    Windows based Methods usually use a fixed reference window summarizing the past information and a sliding window summarizing the most recent information. A significant difference between the distributions of these two windows suggests the occurrence of a drift. Statistical tests or mathematical inequalities, with the null-hypothesis indicating that the distributions are equal, are thus employed. Kifer’s (Kifer et al. 2004), Nishida’s (Nishida and Yamauchi 2007), Bach’s (Bach and Maloof 2008), the Adaptive Windowing (ADWIN) (Bifet and Gavalda 2007), SeqDrift detectors (Sakthithasan et al. 2013; Pears et al. 2014), Drift Detection Methods based on Hoeffding’s Bound (\(\hbox {HDDM}_{\mathrm{A{-}test}}\) and \(\hbox {HDDM}_{\mathrm{W{-}test}}\)) (Frías-Blanco et al. 2015), and Adaptive Cumulative Windows Model (ACWM) (Sebastião et al. 2017) are members of this family.

CUSUM and its variant Page-Hinkley (PH) are some of the pioneer methods in the community. DDM, EDDM, and ADWIN have frequently been considered as benchmarks in the literature (Huang et al. 2015; Frías-Blanco et al. 2015; Baena-Garcıa et al 2006; Nishida and Yamauchi 2007; Bifet and Gavalda 2007; Pesaranghader and Viktor 2016). SeqDrift2 and HDDMs are recently proposed methods, and have shown comparable results to the other benchmarks. We, therefore, consider all these methods for our experimental evaluation, and we briefly describe them as follows.

CUSUM::

Cumulative Sum—CUSUM, by Page (1954), is a sequential analysis technique that alarms for a change when the mean of the input data significantly deviates from zero. The input of CUSUM can be any filter residual; for instance, the prediction error from a Kalman filter (Gama et al. 2014). The CUSUM test is in the form of \(g_t = max(0, g_{t-1} + (x_t - \delta ))\), and it alarms for a concept drift when \(g_t > \lambda \). In this test, \(x_t\) is the currently observed value, \(\delta \) specifies the magnitude of changes that are allowed, while \(g_0 = 0\) and \(\lambda \) is a user-defined threshold. The accuracy of CUSUM depends on the values of parameters \(\delta \) and \(\lambda \). Lower values of \(\delta \) result in faster detection, at the cost of an increased number of false alarms.

PH::

Page-Hinkley—PH, by Page (1954), is a variant of CUSUM typically used for change detection in signal processing applications (Gama et al. 2014). The test variable \(m_T\) is defined as a cumulative difference between the observed values and their mean until the current time T; and calculated by \(m_T = \sum _{t=1}^{T}(x_t - {\bar{x}}_T - \delta )\), where \({\bar{x}} = \frac{1}{T}\sum _{t = 1}^{T}x_t\) and \(\delta \) defines the allowed magnitude of changes. The PH method also updates the minimum \(m_T\), denoted as \(M_T\), using \(M_T = min(m_t, t= 1 \ldots T)\). A significant difference between \(m_T\) and \(M_T\), i.e. \(PH_T:m_T - M_T > \lambda \) where \(\lambda \) is a user-defined threshold, implies a concept drift. A large value of \(\lambda \) typically causes fewer false alarms, but it may increase false negative rate.

DDM::

Drift Detection Method—DDM, by Gama et al. (2004), monitors the error-rate of the classification model to detect drifts. On the basis of PAC learning model (Mitchell 1997), the method considers that the error-rate of a classifier decreases or stays constant as the number of instances increases. Otherwise, it suggests the occurrence of a drift. Consider \(p_t\) as the error-rate of the classifier with a standard deviation of \(s_t = \sqrt{(p_t(1 - p_t) / t)}\) at time t. As instances are processed, DDM updates two variables \(p_{min}\) and \(s_{min}\) when \(p_t + s_t < p_{min} + s_{min}\). DDM warns for a drift when \(p_t + s_t \ge p_{min} + 2 * s_{min}\), and it detects a drift when \(p_t + s_t \ge p_{min} + 3 * s_{min}\). The \(p_{min}\) and \(s_{min}\) are reset when a drift is detected.

EDDM::

Early Drift Detection Method—EDDM, by Baena-Garcıa et al (2006), evaluates the distances between wrong predictions to detect concept drifts. The algorithm is based on the observation that a drift is more likely to occur when the distances between errors are smaller. EDDM calculates the average distance between two recent errors, i.e. \(p'_t\), with its standard deviation \(s'_t\) at time t. It updates two variables \(p'_{max}\) and \(s'_{max}\) when \(p'_t +2*s'_t > p'_{max} + 2*s'_{max}\). The method warns for a drift when \((p'_t + 2 * s'_t) / (p'_{max} + 2*s'_{max}) < \alpha \), and indicates that a drift occurred when \((p'_t + 2 * s'_t) / (p'_{max} + 2*s'_{max}) < \beta \). The authors set \(\alpha \) and \(\beta \) to 0.95 and 0.90, respectively. The \(p'_{max}\) and \(s'_{max}\) are reset only a drift is detected.

HDDMs:

\(\hbox {HDDM}_{\mathrm{A{-}test}}\) and \(\hbox {HDDM}_{\mathrm{W{-}test}}\) are proposed by Frías-Blanco et al. (2015). The former compares the moving averages to detect drifts. The latter uses the EMWA forgetting scheme (Ross et al. 2012) to weight the moving averages. Then, weighted moving averages are compared to detect concept drifts. For both cases, the Hoeffding inequality (Hoeffding 1963) is used to set an upper bound to the level of difference between averages. The authors noted that the first and the second methods are ideal for detecting abrupt and gradual drifts, respectively.

ADWIN::

Adaptive Sliding Window—ADWIN, by Bifet Bifet and Gavalda (2007), slides a window w as the predictions become available, in order to detect drifts. The method examines two sub-windows of sufficient width, i.e. \(w_0\) with size \(n_0\) and \(w_1\) with size \(n_1\), of w, where \(w_0 \cdot w_1 = w\). A significant difference between the means of two sub-windows indicates a concept drift, i.e. \(|{\hat{\mu }}_{w_0} - {\hat{\mu }}_{w_1}| \ge \varepsilon \) where \(\varepsilon = \sqrt{\frac{1}{2m}\ln {\frac{4}{\delta '}}}\), m is the harmonic mean of \(n_0\) and \(n_1\), \(\delta ' = \delta /n\). Here \(\delta \) is the confidence level while n is the size of window w. After a drift is detected, elements are removed from the tail of the window until no significant difference is seen.

SeqDrift2:

—SeqDrift2, by Pears et al. (2014), uses the reservoir sampling method (Vitter 1985), as an adaptive sampling strategy, for random sampling from input data. SeqDrift2 stores entries into two repositories called left and right. As entries are processed over time, the left repository contains a combination of older and new entries by applying the reservoir sampling strategy. The right repository collects the new arriving entries. SeqDrift2 subsequently finds an upper bound for the difference in between the means of the two repositories, i.e. \({\hat{\mu }}_l\) for the left repository and \({\hat{\mu }}_r\) for the right repository, using the Bernstein inequality (Bernstein 1946). Finally, a significant difference between the two means suggests a concept drift.

Discussion—CUSUM and Page-Hinkley (PH) detect concept drift by calculating the difference of observed values from the mean and alarm for a drift when this value is larger than a user-defined threshold. These algorithms are sensitive to the parameter values, resulting a tradeoff between false alarms and detecting true drifts. Recall that DDM, EDDM and HDDM maintain sets of variables, in order to monitor a stream for concept drift. The ADWIN and SeqDrift2 methods, on the other the hand, maintain more than one subset of the stream, either using windowing or repositories. DDM and EDDM have lower memory footprints as they only maintain a small number of variables (Gama et al. 2014). These two approaches also require less execution runtime to update the values of the variables for drift detection. However, EDDM may frequently alarm for concept drift during the early stages of learning, since the distance between wrong predictions is small. HDDM employs the Hoeffding inequality in order to detect concept drift. ADWIN and SeqDrift2 generally require more memory for storing prediction results, as maintained within sliding windows or repositories. They are also computationally more expensive, due to the sub-window compression or reservoir sampling procedures. Recall that the SeqDrift2 algorithm of Pears et al. (2014) employs the Bernstein inequality in order to detect concept drift. SeqDrift2 uses the sample variance, and assumes that the sampled data follow a normal distribution. It follows that this assumption may be too restrictive, in real-world domains. Further, the Bernstein inequality is conservative and requires a variance parameter, in contrast to, for instance, the Hoeffding inequality. These shortcomings may lead to longer detection delays and a potential loss of accuracy. In summary, our preliminary experimentation confirmed that the aforementioned methods may cause long detection delay, high false positive and high false negative rates. In Sect. 5, we will introduce our new Stacking Fast Hoeffding Drift Detection Method (FHDDMS), that extends our earlier introduced Fast Hoeffding Drift Detection Method (FHDDM) technique (Pesaranghader and Viktor 2016). FHDDM slides a window over the stream, in order to detect concept drift. We maintain two variables, namely the mean of elements inside the window at the current time and the maximum mean observed so far. FHDDM subsequently employs the Hoeffding inequality to detect drifts. Our approach thus differs from HDDM, in that we use a sliding window and only maintain two variables.

3 The CAR performance measure

This section presents our CAR measure, which is employed in order to balance classification, adaptation and resource utilization requirements. The motivation for introducing this measure is as follows. Intuitively, as illustrated in Fig. 1, a change in the data distribution, as caused by a concept drift, may result in an increase or decrease of the error-rates for different types of classifiers (Olorunnimbe et al. 2015). We simulated a number of drift points and showed that, as a concept drift occurs, the classifier with the lowest error-rate changes. Consequently, it follows that a learning system where different types of classifiers co-exist and where the model, from the current “best” learner is provided to the users, may hold much value.

Fig. 1
figure 1

Illustration of error-rate and distributional change interplay

However, following an “error-rate-only” approach is not beneficial in all settings. For instance, in an emergency response scenario, the response time, i.e. the time required to present a model to the users, may be the most important criterion. That is, users may be willing to sacrifice accuracy for speed and partial information. Further, consider the area of pocket (or mobile) data mining, which has much application in areas such as defense and environmental impact assessment (Gaber et al. 2014). Here, the memory resources may be limited, due to connection issues, and thus reducing the memory footprint is also of importance (Olorunnimbe et al. 2015).

To address this challenge, the EMR measure was proposed for evaluating the overall performance of learning algorithms based on Error-rate, Memory usage, and Runtime (Pesaranghader et al. 2016). In this approach, classifiers receive their scores based on their current EMR values which are used to rank them. The EMR measure does not fully reflect the performance of learning methods in an environment with concept drift. For instance, consider medical applications, where an adaptive learning algorithm must handle concept drifts very rapidly. A scenario where an airplane is on autopilot is another application where adaptive algorithms, that frequently alarm falsely for concept drifts, are not considered suitable. In such contexts, the EMR measure does not effectively represent the overall performance of adaptive learning algorithms because the detection delay as well as the number of false alarms are both ignored. Subsequently, it is beneficial to represent the performances of the learners, their adaptability as well as their resource usage statistics, by a single measure.

The CAR Measure—We introduce the CAR measure which not only considers the classification error-rates, memory usages, and runtimes but also the drift detection delays, false positives and false negatives. The CAR measure, as defined in Eq. (3.1), consists of three components namely Classification, Adaptation, and Resource Consumption. The classification part consists of the error-rate (\(\hbox {E}_{\mathrm{C}}\)) of classifier C, the adaptation part represents the detection delay (\(\hbox {D}_{\mathrm{D}}\)), false positive (\(\hbox {FP}_{\mathrm{D}}\)) and false negative (\(\hbox {FN}_{\mathrm{D}}\)) of drift detector D, while the resource consumption part is associated with the memory consumption (\(\hbox {M}_{\mathrm{(C,D)}}\)) and runtime (\(\hbox {R}_{\mathrm{(C,D)}}\)) of the (classifier, detector) pair. Please note that, in Eq. (3.1), the \(\diamond \) and \(\oplus \) symbols only represent the combination of three components.

$$\begin{aligned} \text{ CAR }_{\mathrm{(C, D)}}&:= \text{ Classification }_{\mathrm{C}} \diamond \text{ Adaptation }_{\mathrm{D}} \diamond \text{ Resourse } \text{ Use }_{(C, D)} \nonumber \\&:= {E}_{\mathrm{C}} \diamond ({D}_{\mathrm{D}} \oplus {FP}_{\mathrm{D}} \oplus {FN}_{\mathrm{D}}) \diamond (M_{(C, D)} \oplus R_{(C, D)}) \end{aligned}$$
(3.1)

The score associated with the pair (CD) is obtained from Eq. (3.2). The equation implies that a pair with a high CAR has a low score, i.e.

$$\begin{aligned} \text{ Score }_{\mathrm{(C, D)}} := 1 - \text{ CAR }_{(C, D)} \end{aligned}$$
(3.2)

In order to compute the CAR measure, a matrix containing the classification, adaptation, and resource consumption metrics of all (classifier, detector) pairs is created, as the first instance arrives. Note that the current value for the CAR measure is continuously re-calculated during online learning. That is, it is reset every time a new instance is processed. There are n classifiers and m drift detectors, which means that \(n \times m\) pairs, are considered concurrently. The measures associated with each pair for the tth instance are placed, row-by-row, into a matrix \(M^t\), as shown in Eq. (3.3). This matrix is defined as follows:

(3.3)

Subsequently, the elements of the matrix are normalized, column-by-column, using the ‘min-max’ scaling approach. The resulting normalized matrix, \(\overline{M^t}\), is defined in Eq. (3.4):

(3.4)

A weight is assigned to each measure. The weights are combined into a weight vector \(\overrightarrow{w}\) which is defined in Eq. (3.5). The elements of the weight vector \(\overrightarrow{w}\) are the weights associated with the classification error-rate, the detection delay, the false positive rate, the false negative rate, the memory usage, and the runtime. Each weight emphasizes the importance of a particular measure in the evaluation process.

Note that, since each column of the CAR measure is normalized, as illustrated by Eq. (3.4), the corresponding components, such as the memory consumption and the runtime, are also normalized. This implies that the two measures are scale invariant and remain unaffected by change of units, e.g. a memory unit change from kilobytes to megabytes. In our experimental evaluation, we not only show the CAR scores, as a holistic measure, but also include the error rates, memory usages and runtimes. These measures are explicitly shown, to enable the reader to fully evaluate the performance of (classifier, detector) pairs.

$$\begin{aligned} \overrightarrow{w} = \begin{bmatrix} w_e&w_d&w_{fp}&w_{fn}&w_{m}&w_{r} \end{bmatrix}^T \end{aligned}$$
(3.5)

The CAR measures and scores for the \(n \times m\) pairs are evaluated with Eqs. (3.6) and (3.7), respectively. Please note that \(J_{n\times m, 1}\) is a vector that solely consists of unit entries.

$$\begin{aligned}&\displaystyle \text{ CAR }_{n\times m, 1}^t = \frac{\overline{M^t} \cdot \overrightarrow{w}}{J_{1, 6} \cdot \overrightarrow{w}} \end{aligned}$$
(3.6)
$$\begin{aligned}&\displaystyle \text{ Score }_{n\times m, 1}^t = J_{n\times m, 1} - \text{ CAR }_{n\times m, 1}^t \end{aligned}$$
(3.7)

The index of the classifier, that is recommended at time t, \(\hbox {Index}_{\mathrm{opt}}\), is defined by Eq. (3.8)Footnote 2:

$$\begin{aligned} { \ \hbox {Index}_{\mathrm{opt}}} = imax(Score_{n\times m, 1}^t) \end{aligned}$$
(3.8)

In summary, the CAR measure and scores are calculated as follow:

  1. 1.

    Initialize the matrix M, and the weight vector \(\overrightarrow{w}\),

  2. 2.

    Replace the elements of matrix M row-by-row (i.e. for every pair) as the next instance arrives,

  3. 3.

    Normalize the elements of matrix M column-by-column using ‘min-max’ scaling,

  4. 4.

    Calculate the CAR measures and the scores of pairs with Eqs. (3.6) and (3.7),

  5. 5.

    Rank the pairs based on their scores,

  6. 6.

    Recommend the pair with the highest score to the user,

  7. 7.

    Repeat steps 2 to 6.

One should notice that the weights are application dependent. For instance, if the memory resources are limited, such as in the case of a pocket data mining scenario (Gaber et al. 2014), the value of \(w_m\) should be set to a higher value. On the other hand, if memory is abundant, but accuracy and speed of model construction are important, the \(w_m\) value may be decreased (or even set to zero). In medical applications, where reacting rapidly to concept drifts is critical, \(w_d\) may be the dominant weight.

4 Tornado: A reservoir of diverse learning strategies

In this section, we introduce the Tornado framework which is outlined in Fig. 2. We implemented our framework using the Python programming language. Recall that, in our framework, a number of distinct pairs of classifiers and drift detectors are executed in parallel, against the same data stream. In this figure, \(C_n\) and \(D_m\) represent the nth classifier and the mth detector, respectively. A number of classifiers with different learning styles are implemented. Currently, the Naive Bayes (NB), Decision Stump (DS), Hoeffding Tree (HT) (Domingos and Hulten 2000), Perceptron (PR) (Bifet et al. 2010), and K-Nearest Neighbors (K-NN) learning algorithms are available. Furthermore, various concept drift detection methods are provided. Specifically, we have implemented Cumulative Sum (CUSUM) and its variant Page-Hinkley (PH) (Page 1954), Drift Detection Method (DDM) (Gama et al. 2004), Early Drift Detection Method (EDDM) (Baena-Garcıa et al 2006), Hoeffding’s bound based Drift Detection Methods (\(\hbox {HDDM}_\mathrm{A{-}test}\) and \(\hbox {HDDM}_\mathrm{W{-}test}\)) (Frías-Blanco et al. 2015), Adaptive Windowing (ADWIN) (Bifet and Gavalda 2007), SeqDrift2 (Pears et al. 2014), Fast Hoeffding Drift Detection Method (FHDDM) (Pesaranghader and Viktor 2016), and our new Stacking Fast Hoeffding Drift Detection Methods (FHDDMS), which is introduced in Sect. 5.

Fig. 2
figure 2

The Tornado framework

As shown in Fig. 2, Stream Reader, Classifiers and Detectors, Pairs of Detectors and Detectors, and CAR Calculator are the main components of the framework. The input is constituted of a Stream, pairs of classifiers and detectors, and a weight vector. Our framework follows the prequential approach where instances are first tested and then used for training (Gama et al. 2013, 2014; Pesaranghader and Viktor 2016).

The data flow may be described as follows: The (classifier, detector) pairs are constructed as shown in Fig. 2, prior to the learning process. The Stream Reader reads instances from the stream and sends them one-by-one to the (classifier, detector) pairs for model construction. Each learner builds an incremental model, prequentially. That is, each instance is first used for testing and then for training. Simultaneously, Classifiers send their statistics, e.g. error-rates or the current prediction results, to their corresponding Drift Detector in order to detect potential occurrences of concept drifts. Subsequently, the CAR Calculator determines the score of each (classifier, detector) pair by considering the classification error-rate, detection delay, detection false positive rate, detection false negative rate, total memory usage and runtime. Subsequently, the model with the highest score is presented to the user. This model may change as a result of incremental learning and concept drift. This process continues until either a predefined condition is met or all the instances in the stream are processed.

Figure 3 illustrates that, while various pairs are executed concurrently, the one with the best score is recommended at each time interval. An interval is the time difference in between two consecutive concept drifts. This notion is also referred to as context in the literature (Gama et al. 2004; Pesaranghader and Viktor 2016). As illustrated by the figure, during the interval \(\tau _0\) to \(\tau _n\), the pair \((C_1, D_3)\) has the best score. Suddenly, at time \(\tau _n\), the data distribution is altered resulting in pair \((C_3, D_2)\) being recommended to the user. In the illustrative example, another drift occurs at \(\tau _{2n}\) resulting in pair \((C_2, D_1)\) having the highest score.

Fig. 3
figure 3

Recommendation of (Classifier, Detector) Pairs over Time

The following observation is noteworthy. Recall that the continuous outputs of our Tornado framework are the current best performing (classifier, detector) pair, which may change over time. Consequently, our work should thus not be confused with a hybrid ensemble of classifiers setting (Hsu 2017; Min and Cho 2011; Verikas et al. 2010; Salgado et al. 2006). Typically, a hybrid ensemble contains a number of diverse classifiers that form a committee, which aims at increasing the predictive accuracy by utilizing the diversity of the members of the ensemble. In contrast, within the Tornado framework, the individual learners proceed independently to construct their models. Recall that the rationale behind our design is that we aim to utilize diverse learning strategies that potentially address concept drifts more efficiently. However, future work may also involve incorporating ensembles, as one of our classifiers, into the Tornado framework.

5 Stacking fast hoeffding drift detection methods

In this section, we briefly review the Fast Hoeffding Drift Detection Method (FHDDM). We subsequently introduce our Stacking Fast Hoeffding Drift Detection Method (FHDDMS) and Additive FHDDMS (\(\hbox {FHDDMS}_{\mathrm{add}}\)) algorithm.

5.1 Fast hoeffding drift detection method (FHDDM)

Recently, Pesaranghader and Viktor (2016) introduced the Fast Hoeffding Drift Detection Method (FHDDM) which is based on a sliding window mechanism and the Hoeffding inequality. The FHDDM algorithm slides a window of size n over the classification results. A 1 is inserted in the window if a particular prediction is correct, while a 0 is inserted otherwise. As the instances are processed, the mean associated with a particular sliding window at time t, \(\mu ^t\), is evaluated while the maximum mean observed so far, \(\mu ^m\), is updated if the mean of the current sliding window is higher.

On the basis of the probably approximately correct (PAC) learning model (Mitchell 1997), the classification accuracy either increases or remains constant as the number of instances increases (Gama et al. 2004). Should this not be the case, the probability of a concept drift increases. As a result, the value of \(\mu ^m\) either increases or remains constant as instances are processed. Therefore, a concept drift is more likely if the value of \(\mu ^m\) remains approximately constant while the value of \(\mu ^t\) decreases over time. As demonstrated by Pesaranghader and Viktor (2016), if the difference in between the maximum and the current mean is greater than a certain threshold \(\varepsilon _d\), it may be safely assumed that a concept drift has occurred. The threshold is evaluated with the Hoeffding inequality.

Theorem I

Hoeffding’s Inequality—Let \(X_1, X_2, \ldots , X_n\) be n independent random variables bounded by the interval [0, 1], then with a probability of at most \(\delta \), the difference in between the empirical mean of these variables \({\overline{X}} = \frac{1}{n}\sum _{i=1}^{n}X_i\) and their expected values \(E[{\overline{X}}]\) is at least \(\varepsilon _H\), i.e. \(Pr(|{\overline{X}} - E[{\overline{X}}]| \ge \varepsilon _H) \le \delta \), where:

$$\begin{aligned} \varepsilon _H = \sqrt{\frac{1}{2n}\ln {\frac{2}{\delta }}} \end{aligned}$$
(5.1)

and \(\delta \) is the upper bound for the probability.

Corollary I

FHDDM test—In a stream setting, assume \(\mu ^t\) is the mean of a sequence of n random entries, where the prediction status of each instance is represented by a value in the set \(\{0, 1\}\), at time t. Let \(\mu ^m\) be the maximum mean observed so far. Let \(\varDelta \mu = \mu ^m - \mu ^t \ge 0\) be the difference between the two means. Given the desired \(\delta \), Hoeffding’s inequality implies that a drift has occurred if \(\varDelta \mu \ge \varepsilon _d\), where:

$$\begin{aligned} \varepsilon _{d} = \sqrt{\frac{1}{2n}\ln {\frac{1}{\delta }}} \end{aligned}$$
(5.2)

Figure 4 illustrates the FHDDM algorithm. In this example, n and \(\delta \) are set to 10 and 0.2, respectively. Using Corollary I, the value of \(\varepsilon _d\) is equal to 0.28. Suppose that a real drift occurs right after the 12th instance. The values of \(\mu ^t\) and \(\mu ^m\) are set to null and zero until 10 elements are inserted into the window. We have seven 1s in the window after reading the first 10 elements. Thus \(\mu ^t\) is equal to 0.7 and the value of \(\mu ^m\) is also set to 0.7. The 1st element is removed from the window before the 11th prediction status is inserted. Since the value of prediction status is 0, the value of \(\mu ^t\) decreases to 0.6 while the value of \(\mu ^m\) remains the same. This process continues until the 18th instance is inserted. At this point in time, the difference between \(\mu ^m\) and \(\mu ^t\) exceeds \(\varepsilon _d\). As a result, the FHDDM algorithm alarms for a drift.

Fig. 4
figure 4

An example of the FHDDM algorithm

5.2 Sensitivity of FHDDM’s parameters

In this section, we investigate the impact of the change in parameters \(\delta \) and n on the value of \(\varepsilon _d\). We also study the effects of varying these values on the detection delay, the false positive rate, the memory usage, and total runtime. To this end, we conducted a number of experiments with the values of n in {25, 100, 200, 300, 400, 500} and \(\delta \) in {0.001, 0.0001, 0.00001, 0.000001, 0.0000001}.

All the results that are shown are against the Sine1 synthetic data stream, which is susceptible to abrupt drift. The classification is the classic \(y = sin(x)\) function and the classes are reversed at drift points. (Note that more details will be provided in Sect. 6.1.1) Our explorative results are summarized in Tables 1, 2, and 3, as well as in Fig. 5. Note that the averaged runtimes reported are the average of execution runtimes over five contexts. Recall that a context refers to the duration between two consecutive concept drifts. Table 1 shows that, as the value of n increases, the value of \(\varepsilon _d\) decreases. This implies that, since we have more observations, a more optimistic error bound may be used. For a constant n, there is an inverse relationship between \(\delta \) and \(\varepsilon _d\). That is, as the value of \(\delta \) decreases the \(\varepsilon _d\) value increases (i.e. the bound becomes more conservative). Further, Table 2 illustrates that, for a constant value of \(\delta = 10^{-7}\), the detection delay increases as we increase the value of n. Intuitively, memory usage and runtime also increase as the window size grows. Table 3 lists the results for a constant \(n = 100\). The table shows that, as we decrease the value of \(\delta \), the detection delay increases but the false positive rate decreases.

Finally, in Fig. 5, we contrast the memory usage of FHDDM with ADWIN, which also employs a windowing schema. The reader should notice that FHDDM constantly outperforms ADWIN (indicated in red), in terms of memory usage.

Table 1 Values of FHDDM’s \(\varepsilon _d\) for different n and \(\delta \)
Table 2 Behavior of FHDDM for different values of n, and \(\delta = 10^{-7}\)
Table 3 Behavior of FHDDM for \(n=100\), and different values of \(\delta \)
Fig. 5
figure 5

FHDDM’s memory usage and runtime for different window sizes. a Memory Usage, b Runtime

Our experiments as reported in Pesaranghader and Viktor (2016) indicated that FHDDM outperformed the state-of-the-art both in terms of adaptation and classification results. It follows that the size of the sliding window is a crucial parameter, as we illustrated above. Our experimental results further indicated that a longer window implies a longer detection delay against abrupt concepts drifts. On the other hand, a shorter window may cause higher false negative rates against gradual concept drifts. Based on this observation, we extended our approach as explained in the next section.

5.3 Stacking fast hoeffding drift detection method (FHDDMS)

The Stacking Hoeffding Drift Detection Method (FHDDMS),Footnote 3 extends the FHDDM method by maintaining windows of different sizes. That is, a short and a long sliding windows are superimposed, as shown in Fig. 6. The rationale behind this approach is to reduce the detection delay and false negative rate. Intuitively, a short window should detect abrupt drifts faster, while a long window should detect gradual drifts with a lower false negative rate.

Fig. 6
figure 6

Stacking fast hoeffding drift detection method (FHDDMS)

Following FHDDM, the FHDDMS algorithm inserts a 1 into both the short and the long windows when the prediction result is correct, whereas a 0 is inserted otherwise. In Fig. 6, which illustrates our approach, the size of the long and the short sliding windows are set to 20 and 5, respectively.

As instances are processed, FHDDMS calculates the means of the elements inside the long and short sliding windows at time t, i.e. \(\mu _l^t\) and \(\mu _s^t\), as the stream is processed. We define \(\mu _l^m\) and \(\mu _s^m\) as the maximum means so far for the long and the short windows, respectively:

$$\begin{aligned} \begin{aligned} \mu _l^m< \mu _l^t \Rightarrow \mu _l^m = \mu _l^t \\ \mu _s^m < \mu _s^t \Rightarrow \mu _s^m = \mu _s^t \end{aligned} \end{aligned}$$
(5.3)

Recall that, as based on the probably approximately correct (PAC) learning model (Mitchell 1997), the classification accuracy increases or remains constant as the number of instances increases. Otherwise, the possibility of facing concept drift increases (Gama et al. 2004; Pesaranghader and Viktor 2016). Thus, both the values of \(\mu _l^m\) and \(\mu _s^m\) should increase or remain constant as we process instances. Alternatively, the probability of a drift increases if the values of \(\mu _l^m\) and \(\mu _s^m\) do not change and the values of \(\mu _s^t\) and \(\mu _s^t\) decrease over time. As shown in Eq. (5.4), a significant difference between the current means and their maximums indicates the occurrence of a drift in the stream.

$$\begin{aligned}&\varDelta \mu _l = \mu _l^m - \mu _l^t \ge \varepsilon _l \Rightarrow \psi _l = True \nonumber \\&\varDelta \mu _s = \mu _s^m - \mu _s^t \ge \varepsilon _s \Rightarrow \psi _s = True \nonumber \\&\hbox {if } (\psi _l = True) \text{ or } (\psi _s = True) \Rightarrow \text{ alarm } \text{ for } \text{ a } \text{ drift } \end{aligned}$$
(5.4)

Here \(\psi _l\) and \(\psi _s\) denote the status of concept drift detection as observed by the long and short windows, respectively.

Following (Pesaranghader and Viktor 2016), we use the previously introduced Hoeffding’s inequality (Hoeffding 1963) and Corollary I to define the values of \(\varepsilon _l\) and \(\varepsilon _s\):

$$\begin{aligned} \begin{aligned} \varepsilon _{l} = \sqrt{\frac{1}{2 |W_l|}\ln {\frac{1}{\delta }}} \text{ and } \varepsilon _{s} = \sqrt{\frac{1}{2 |W_s|}\ln {\frac{1}{\delta }}} \end{aligned}, \end{aligned}$$
(5.5)

where \(|W_l|\) and \(|W_s|\) are the sizes of the long and the short windows, respectively.

The pseudocode for the FHDDMS algorithm is presented in Algorithm 1. The Initialize function initializes the parameters for the stacking windows. Subsequently, while data stream instances are prequentially processed, the Detect function analyses the prediction results in order to determine if a concept drift has occurred (lines 24–27). A drift is either detected when \(( \mu _{l}^{m} - \mu _{l}^{t} ) \ge \varepsilon _l\) or when \(( \mu _{s}^{m} - \mu _{s}^{t} ) \ge \varepsilon _s\).

figure a

5.4 Additive FHDDMS (\(\hbox {FHDDMS}_{\mathrm{add}}\))

We introduce another version of the FHDDMS method called as the Additive FHDDMS, i.e. denoted as \(\hbox {FHDDMS}_{\mathrm{add}}\). In this approach, the binary indicators are substituted by their summations. As shown in Fig. 7, the short and the long windows are characterized by the sum of their respective most recent prediction results. In this example, the short window holds a single summation of the 5 most recent bits, while the long window holds four summations for the 20 most recent bits seen so far. For each window, the maximum values observed so far, which are \(\mu _s^m\) and \(\mu _l^m\), are updated as required. In this example, the mean values \(\mu _l^t\) and \(\mu _s^t\), are 0.8 and 0.6, respectively. As for the FHDDMS algorithm, a concept drift occurs if the difference in between the current values and the maximum values exceed a certain threshold, as determined by Hoeffding’s inequality.

Note that, intuitively, \(\hbox {FHDDMS}_{\mathrm{add}}\) should require less memory and exhibit a faster execution time when compared to FHDDMS, since the data structure is more concise. Nevertheless, this may lead to a longer drift detection delay, since the algorithm must ensure that the short window has accumulated \(|W_s|\) new predictions, after an element has been removed from the long window’s tail. This is further confirmed by our experimental results in Sect. 6.3.

Fig. 7
figure 7

Additive stacking windows (\(\hbox {FHDDMS}_{\mathrm{add}}\)) Approach

Figure 8 illustrates how the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) algorithms proceed. In this toy example, the sizes of the long and short windows are set to 20 and 5, respectively. We also set \(\delta \) to 0.002. Using Eq. (5.5), we have \(\varepsilon _l\) equal to 0.394, and \(\varepsilon _s\) to 0.788. Recall that FHDDMS considers the predictions bit-by-bit; whereas, \(\hbox {FHDDMS}_{\mathrm{add}}\) calculates the summary statistics of every 5 predictions. Throughout this incremental learning process, the values of \(\mu _l\), \(\mu _l^m\), \(\varDelta \mu _l\) , \(\mu _s\), \(\mu _s^m\), \(\varDelta \mu _s\) are continuously updated, as indicated in the right-side of the illustration. The reader may notice that both algorithms alarm for concept drift when \(\varDelta \mu _s\) exceeds 0.8, i.e. has a value greater than \(\varepsilon _s\).

Fig. 8
figure 8

Example of FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) algorithms

Finally, the theoretical proofs on the bounds of false positive and false negative for the Fast Hoeffding Drift Detection Methods, including FHDDM and FHDDMS, are available in Electronic supplementary material.

6 Experimental evaluation

In this section, we evaluate our new drift detection methods FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\), by comparing them against the state-of-the-art. Subsequently, we perform various experiments utilizing the Tornado framework. We generated synthetic data streams and also considered real-world data for our experiments. We describe the synthetic and real-world data streams as well as the experimental setup in Sects. 6.1 and 6.2 respectively. We evaluate our drift detection methods and the Tornado framework in Sects. 6.3 and 6.5. Our experiments are performed with a processor Intel Core i5 @ 2 \(\times \) 2.30 GHz with 16 GB of RAM.

Table 4 Summary of synthetic data streams

6.1 Data streams used in experimentation

6.1.1 Synthetic data streams

We have selected the previously introduced Sine1 data stream, as well as the Sine2, Mixed, Stagger, Circles and LED streams, which are frequently applied in the data stream mining literature (Kubat and Widmer 1995; Nishida and Yamauchi 2007; Bifet and Gavalda 2007; Frías-Blanco et al. 2015; Gama et al. 2004; Olorunnimbe et al. 2015; Pesaranghader et al. 2016; Pesaranghader and Viktor 2016; Barros et al. 2017, 2018,) as the synthetic data streams for our experiments. Each data stream contains 100, 000 instances. Sine1, Sine2, Mixed, Stagger, Circles have only two class labels, whereas LED has 10 class labels. Following the convention, we have placed drift points at every 20, 000 instances in Sine1, Sine2, and Mixed, and at every 33, 333 instances in Stagger with a transition length of \(\zeta =50\) to simulate abrupt concept drift. In addition, we have put drift points at every 25, 000 instances in Circles and LED data streams with a transition length of \(\zeta =500\) to simulate gradual concept drift. We have added 10% noise to each data stream, as well, to observe how robust drift detectors are against noisy data streams by asserting their ability to distinguish between concept drift and noise. Table 4 summarizes the synthetic data streams. They may be described as follow:

  • Sine1 \(\cdot \) with abrupt drift: Recall that the stream consists of two attributes x and y uniformly distributed in the interval [0, 1]. The classification function is \(y = sin(x)\). Instances are classified as positive if they are under the curve; otherwise they are classified as negative. At a drift point, the class labels are reversed.

  • Sine2 \(\cdot \) with abrupt concept drift: It holds two attributes x and y which are uniformly distributed in between 0 and 1. The classification function is \(0.5 + 0.3 * sin(3 \pi x)\). Instances under the curve are classified as positive while the remaining instances are classified as negative. At a drift point, the classification scheme is inverted.

  • Mixed \(\cdot \) with abrupt drift: The dataset has two numeric attributes x and y distributed in the interval [0, 1] as well as two boolean attributes v and w. The instances are classified as positive if at least two of the following three conditions are satisfied: \(v, w, y < 0.5 + 0.3 * sin(3\pi x)\). The classification is reversed when drift points occur.

  • Stagger \(\cdot \) with abrupt concept drift: This dataset contains three nominal attributes, namely size {small, medium, large}, color {red, green} and shape {circular, non-circular}. Before the first drift point, instances are labeled positive if \((color = red) \wedge (size = small)\). After this point and before the second drift, instances are classified positive if \((color = green) \vee (shape = circular)\), and finally after the second drift point, instances are classified positive only if \((size = medium) \vee (size = large)\).

  • Circles \(\cdot \) with gradual drift: This dataset contains two attributes x and y which are uniformly distributed in the interval [0, 1]. The classification function is \((x - x_c)^2 + (y - y_c)^2 = r_c^2\) where \((x_c, y_c)\) is its center and \(r_c\) is the radius. Instances inside the circle are classified as positive. A drift happens whenever the classification function, i.e. the circle function, changes.

  • LED \(\cdot \) with gradual drift: The objective of this dataset is to predict the digit on a seven-segment display, where each digit has a 10% chance of being displayed. The dataset has 7 attributes related to the class, and 17 irrelevant ones. Concept drift is simulated by interchanging relevant attributes (Frías-Blanco et al. 2015).

6.1.2 Real-world data streams

We further conducted experiments using the following real-world data streamsFootnote 4; which are frequently used in the online learning and adaptive learning literature (Gama et al. 2004; Baena-Garcıa et al 2006; Bifet et al. 2009; Frías-Blanco et al. 2015). These three data streams were used in our comparative evaluation of drift detectors.

  • Electricity contains 45, 312 instances, with 8 input attributes, recorded every half an hour for two years from Australian New South Wales Electricity. The classification task is to predict a rise (Up) or a fall (Down) in the electricity price. The concept drift may happen because of changes in consumption habits, unexpected events, and seasonality (Žliobaite 2013).

  • Forest CoverType has 54 attributes with 581, 012 instances describing 7 forest cover types for \(30 \times 30\) meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data, for four wilderness areas located in the Roosevelt National Forest of northern Colorado (Blackard and Dean 1999).

  • Poker hand consists of 1, 000, 000 instances, where each instance is an example of a hand having five playing cards drawn from a standard deck of 52. Each card is described by two attributes (suit and rank), for ten predictive attributes. The class predicts the poker hand (Olorunnimbe et al. 2015).

To the best of our knowledge, there are no real-world datasets publicly available, wherein the locations of concept drifts are clearly identified. For instance, there is consensus among researchers that the location and/or presence of concept drift in the Electricity, Forest Covertype, and Pokerhand data stream are unknown (Huang et al. 2015; Bifet and Gavalda 2007; Pesaranghader and Viktor 2016; Frías-Blanco et al. 2015; Bifet et al. 2009). Therefore, in addition to the data streams mentioned above, we also used static datasets, publicly available from the UCI machine learning repository (Bache and Lichman 2013), and we simulated concept drift by switching labels at drift points. We considered the Adult (Kohavi 1996), Nursery (Zupan et al. 1997), and Shuttle (Catlett 2002) datasets for our study. We describe the original datasets as well as the preprocessing steps adopted to generate the corresponding data streams below:

  • Adult: The original dataset has six numeric and eight nominal attributes, two class labels, and 48, 842 instances out of which 32, 561 instances are used for building the classification models. The dataset was used to predict whether a person earns an annual income greater than $50,000 (Kohavi 1996).

    \(\triangleright \) Preprocessing: The training dataset is imbalanced, and there are 24, 720 instances for class \(\le 50K\) as oppose to 7841 instances for class \(> 50K\). We first undersampled the data, leading to 8200 instances for class \(\le 50K\) and 7800 instances for class \(> 50K\). We subsequently increased the number of instances to 20, 000 by bootstrapping.

  • Nursery: The dataset holds eight nominal attributes, five class labels, and 12, 960 instances. It was designed to predict whether applications for nursery schools in Ljubljana, Slovenia should be rejected or accepted (Zupan et al. 1997).

    \(\triangleright \) Preprocessing: The dataset consists of 5 classes labelled as ‘no_recom’, ‘recommend’, ‘very_recom’, ‘priority’, and ‘spec_priority’. The number of occurrences of the third and fourth classes are infrequent and we removed them from the dataset, resulting in a dataset consisting of 20, 000 instances.

  • Shuttle: The original dataset contains nine attributes, seven class labels, and 58, 000 instances and was designed to predict suspicious states during a NASA shuttle mission (Catlett 2002).

    \(\triangleright \) Preprocessing: This dataset is also highly imbalanced. Firstly, instances from the four minority classes were filtered out and undersampling and bootstrapping were performed, in order to create a dataset of 20, 000 instances.

Finally, we simulated concept drift by shifting the class labels after drift points, with a transition length of \(\zeta = 50\), for the new context. Note that we use the term ‘context’ to refer to the interval between two consecutive concept drifts. The final data streams have five contexts, each including 20, 000 instances, for 100, 000 instances in total.Footnote 5

6.2 Experimental setting

Following Bifet et al. (2009), we used the sigmoid function to simulate abrupt and gradual concept drifts. The function determines the probability of belonging to the new context during the transition between two contexts. The transition length \(\zeta \) allows us to simulate abrupt or gradual concept drifts. It is set to 50 for abrupt concept drifts, and to 500 for gradual concept drifts in all our experiments.

Pesaranghader and Viktor (2016) proposed an approach to evaluate drift detection methods. They introduced the acceptable delay length notion to count true positive (TP), false positive (FP), and false negative (FN) rates. The acceptable delay length \(\varDelta \) is a threshold that determines how far the detected drift could be from its true location for the drift to be considered as true positive (Pesaranghader and Viktor 2016; Krawczyk et al. 2017). That is, we maintain three variables to count the numbers of true positives, false negatives and false positives. These variables are initially set to zero. We increment the number of true positives when the drift detector alarm is within the acceptable delay range. Otherwise, we increment the number of false negatives, since the alarm has occurred too late. In addition, the false positive value is incremented when a false alarm occurs, outside of the acceptable delay range. Following this approach, we set \(\varDelta \) to 250 for the Sine1, Sine2, Mixed, Stagger, and the real world-world data streams, and to 1000 for the Circles and LED data streams. A longer \(\varDelta \) should be considered for data streams with gradual drifts in order to avoid an increase in false negatives (Pesaranghader and Viktor 2016).

Finally, for \(\hbox {FHDDMS}_{\mathrm{add}}\) and FHDDMS, the size of the long window and the short window are set to 100 and 25, respectively. The parameter \(\delta \) is set to \(10^{-7}\) in all cases. Recall that, as for the other drift detectors, all parameters were set to default values.

6.3 Evaluation of FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) drift detectors

In these experiments, our aim is to assess the capabilities of FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) to detect concept drifts in synthetic as well as in real-world data streams. Firstly, our goal is to determine whether our algorithms are able to detect change fast, while maintaining high accuracies, as measured when considering true positive, false positive, and false negative rates. Secondly, we compare our FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\)methods to the state-of-the art. Thirdly, we assess the abilities of detection algorithms to detect drifts without applying a classification algorithm.

6.3.1 Experimental evaluation on synthetic data streams

We compare the performances of FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) against DDM, EDDM, HDDMs, CUSUM, Page-Hinkley, ADWIN, SeqDrift2 and FHDDM. We considered Naive Bayes (NB) and Hoeffding Tree (HT) as our incremental learners. We ran each classifier-detector pair 100 times. Recall that we maintained true positive, false positive, and false negative counters for each run, by considering the corresponding acceptable delay length \(\varDelta \) of data stream. We captured the memory usage of drift detectors after each alarm, and then averaged them once all instances are processed. The overall detection runtime of drift detectors as well as the overall error-rates classifiers for each run were computed. Please note that the memory usage and the runtime are recorded in bytes and milliseconds, respectively. Further, we averaged the detection delays, true positives, false positives, false negatives, total detection runtimes, and memory usage of drift detectors as well as the error-rates of the classifiers over all iterations.

I. Drift Detection with Naive Bayes—Tables 56 and 7 summarize the experimental results for Naive Bayes (NB) with all drift detectors. As indicated in Tables 5 and 6, \(\hbox {HDDM}_{W{-}test}\) and FHDDMS have the shortest detection delays, followed by \(\hbox {FHDDM}_{100}\) and \(\hbox {FHDDMS}_{\mathrm{add}}\). On the other hand, EDDM yields the longest delay before detection, followed by PageHinkley and SeqDrift2 for Sine1 and Mixed as well as DDM for Circles. \(\hbox {HDDM}_{W{-}test}\) has shorter detection delays compared to FHDDMS against abrupt concept drifts, they yield similar detection delays for the Circles and LED data streams containing gradual concept drifts (Table 6). Overall, \(\hbox {FHDDMS}_{\mathrm{add}}\), FHDDMS, \(\hbox {FHDDM}_{25}\), \(\hbox {FHDDMS}_{100}\), and CUSUM have the lowest false positive rates. ADWIN and SeqDrift2 have a large number of false positive numbers when they are used in conjunction with the LED data stream. Since EDDM does not find concept drifts within the acceptable delay lengths, it resulted in the highest false negative numbers. \(\hbox {FHDDMS}_{\mathrm{add}}\), FHDDMS, FHDDMs, HDDMs, and CUSUM have the lowest false negative rates. \(\hbox {FHDDMS}_{\mathrm{add}}\) outperforms FHDDMS and \(\hbox {FHDDM}_{100}\) in terms of memory consumption and detection runtime. ADWIN and SeqDrift2 require more memory than the other approaches.

Table 5 Naive Bayes classifier and drift detectors against Synthetic data streams with abrupt concept drifts
Table 6 Naive Bayes classifier and drift detectors against Synthetic data streams with gradual concept drifts

Finally, as shown in Table 7, we obtained the lowest classification error-rates with FHDDMS and FHDDMs. In general, the classification error-rates are comparable for all data streams, except for LED, where ADWIN and SeqDrift2 have much higher error-rates.

Table 7 Naive Bayes error-rates against all Synthetic data streams

II. Drift Detection with Hoeffding Trees—The experimental results when using Hoeffding Trees with drift detectors are shown in Tables A.1–A.3. Considering both Tables A.1 and A.2, we observe that FHDDMS, FHDDMs, and \(\hbox {HDDM}_{\mathrm{W{-}test}}\) have the shortest drift detection delay. \(\hbox {HDDM}_{\mathrm{W{-}test}}\) has the shortest detection delay when the concept drifts are abrupt; whereas, FHDDMS has the shortest detection delays when the concept drifts are gradual. FHDDMS, FHDDMs, CUSUM, and DDM caused the lowest false positives of all drift detectors. The false positive numbers of HDDMs are consistently higher than those of FHDDMS and FHDDMs. EDDM, again, resulted in the highest false negative rates.

One should notice that the false positives are more common with Hoeffding Tree than with Naive Bayes. This indicates Hoeffding Tree may not represent decision boundaries adequately, which misleads drift detection methods and consequently causes more false alarms. As Table A.3 depicts, FHDDMS and FHDDMs led to the lowest classification error-rates. In general, the classification error-rates are similar for all data streams, except for LED where ADWIN and SeqDrift2 again resulted in higher error-rates.

III. Drift Detection Assessment—Next, we investigate the capabilities of the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) algorithms to detect drifts. Here, our aim is to determine whether the detection algorithms are able to detect change in an accurate manner, without the assistance of a classification algorithm. To this end, we present the results against synthetic data streams containing 50, 000 bits, following (Bifet and Gavalda 2007; Huang et al. 2015). In this setting, each stream contains 5 segments, where each segment holds 10, 000 bits. We simulated abrupt and gradual concept drifts at the center of each segment. That is, in a stream, there exist 5 drifts at locations 5000, 15, 000, 25, 000, 35, 000, and 45, 000, respectively. The probability of observing a 1 as a bit before and after a drift is \(80\%\) and \(20\%\), respectively. We considered two values of 0.005 and 0.001 for the change ratio r, i.e. to simulate gradual drift. Here, the change ratio defines the rate of shift from one context to another. The acceptable delay length, i.e. \(\varDelta \), was set to 5, 000 for measuring detection delay, as well as false positive (FP) and false negative (FN) numbers. We conducted a set of preliminary experiments to find unbiased values for the r and \(\varDelta \) parameters. Table 8 summarizes the results of experiments in terms of average detection delay, false positive (FP), and false negative (FN) numbers. The reader should also note that we had 100 streams for each experiments, and we averaged the results for each method. We discuss the results as follows.

Table 8 Experiments with synthetic bit streams

Table 8 shows, for the case of abrupt drift, FHDDMS, \(\hbox {FHDDM}_{25}\), and \(\hbox {HDDM}_{\mathrm{W{-}test}}\) had the shortest detection delays. On the other hand, PageHinkley and EDDM resulted in the longest detection delays. FHDDMS, and FHDDMs led to the lowest false positive numbers. \(\hbox {HDDM}_{\mathrm{W{-}test}}\) had a greater false positive number compared to FHDDMS, FHDDM, and \(\hbox {HDDM}_{\mathrm{A{-}test}}\). Similar to previous experiments, EDDM showed the highest number of false positive. Although PageHinkley showed a small number of false positive, it had the greatest false negative number amongst others. EDDM and PageHinkley also showed high detection delays.

As to gradual drifts, we obtained the shortest detection delays with FHDDMS, FHDDMs and \(\hbox {HDDM}_{\mathrm{W{-}test}}\); nevertheless, \(\hbox {HDDM}_{\mathrm{W{-}test}}\) caused greater false positives. \(\hbox {FHDDMS}_{\mathrm{add}}\) and \(\hbox {FHDDM}_{100}\) resulted in lower false positive numbers when \(r=0.005\); whereas CUSUM and DDM were more accurate when \(r=0.001\). Similar to the case of abrupt drift, EDDM and PageHinkley had the highest false positive and false negative numbers, respectively. They also detected drifts with lengthy delays. Further, the reader may notice that false positives increased by decreasing the change ratio from 0.005 to 0.001. Indeed, a slower shift makes it harder for all methods to accurately detect drifts. Consequently, we experienced that abundance in the false positives.

In summary, our results confirm the abilities of the FHDDM family of drift detection algorithms to detect drift, in the absence of a classification algorithm.

IV. Conclusion—Recall that the aim of these experiments was threefold. Firstly, we assessed the performance of the FHDDM family of drift detection algorithms. We observed that FHDDMS yield better performances compared to \(\hbox {FHDDM}_{25}\) and \(\hbox {FHDDM}_{100}\) against data stream containing both abrupt and gradual concept drifts. That is, the stacking of sliding windows assisted to detect concept drifts with shorter detection delays and fewer false negatives. Recall that this method slides a short window as well as a long window on prediction results. The short window finds abrupt drifts with shorter delays, while the long window detects gradual drifts with fewer false negatives. \(\hbox {FHDDMS}_{\mathrm{add}}\) had fewer false positives, less memory usage, and shorter runtime compared to FHDDMS. Secondly, our results confirm that our algorithms performs well, when compared to the state-of-the-art. Although \(\hbox {HDDM}_{W{-}test}\) had similar detection delays compared to FHDDMS and FHDDMs, it resulted in higher false positive rates. Finally, our experimental evaluation show the correctness of our approaches, when detecting drift without the aid of a classification algorithm.

6.4 Experiments on real-world data streams

In this section, we present the results of our experiments on the Electricity, Forest CoverType, and Poker hand data streams, as introduced in Sect. 6.1.2, using Naive Bayes (NB) and Hoeffding Tree (HT) as the incremental learners. Again we stress that, as pointed out by Huang et al. (2015) and Bifet et al. (2009), the locations of concept drifts are not known in these data streams. We therefore follow the work of Huang et al. (2015) and establish our evaluations based on the number of alarms for concept drift and the classification error-rates. Our experimental results are summarized in Table 9 as well as Tables A.4 and A.5.

Since we are not aware of the exact drift locations, we do not make strong conclusions about the performance, in terms of drift detection, of the algorithms. However, when considering the Electricity data stream, the reader will notice that our FHDDMS algorithms and the HDDM algorithms obtained the lowest error-rates, when combined with both classifiers. The HDDMs and EDDM algorithms alarmed most often for concept drift, while the FHDDMS algorithms signal for drifts more often than the other remaining techniques. Moreover, the memory usages and runtimes of our methods are comparable to the state-of-the art. Similar observations hold for the Forest Covertype and the Poker Hand data streams as represented in Tables A.4 and A.5.

Overall, this result indicates that there is no single drift detector that outperforms in all settings. The reader should recall that, based on this observation, we introduced the Tornado framework which is used to constantly recommends the currently best performing (classifier, detector) pair to the user. Next, we discuss our experimental evaluation of the Tornado framework.

Table 9 Naive Bayes and Hoeff. Tree against Electricity Data Stream

6.5 Experimental evaluation of Tornado tramework

This section presents the experimental results for the Tornado framework for synthetic and real-world data streams. In these experiments, our aim is to illustrate the advantages of employing multiple classifiers and drift detectors in parallel. Our goal is to confirm that the best (classifier, detector) pair varies, as the characteristics of a stream change over time. Further, our objective is to confirm that the best pair is highly domain dependent; and we explore whether this observation holds against synthetic and real-world streams.

Five learning algorithms were evaluated, namely incremental Naive Bayes (NB), Decision Stump (DS), Hoeffding Tree (HT), Perceptron (PR), and 5 Nearest Neighbors (5-NN) learners. In addition, the previously introduced drift detection methods, in Sect. 2.2, were paired with the classifiers. As a result, we have a total of 60 pairs of (classifier, detector). The scores associated with the (classifier, detector) pairs are calculated utilizing the CAR measure, as defined in Eq. (3.7), as instances are prequentially processed over time. The experimental results as well as the recommended (classifier, detector) pairs against various data streams are presented in the following subsections.

Fig. 9
figure 9

Recommended Classifier+Detector Pairs out of 60 Pairs against Sine1 Data Stream with Abrupt Concept Drifts

6.5.1 Synthetic data streams

This subsection summarizes our results against the synthetic data streams, when all the (classifier, detector) pairs are executed in parallel, and are ranked over time. First, we present the top ten (10) pairs, ranked by the highest average scores, for the Sine1 data stream. Then, we illustrate the recommended (classifier, detector) pairs for all synthetic data streams in Fig. 9, B.1, and B.2. Note that the results for the other synthetic data streams are listed in the Electronic supplementary material.

Top 10 (Classifier, Detector) Pairs—The pairs with the highest average scores against the Sine1 data stream is shown in Table 10. The table lists the results of classification error-rate, drift detection delay, drift detection false positive and false negative rates, total memory usage and runtime from left to right. Please note that the memory usage and the runtime are recorded in kilobytes and milliseconds, respectively. For this case, Perceptron (PR) and Naive Bayes (NB) paired with \(\hbox {FHDDM}_{25}\), \(\hbox {FHDDM}_{100}\), FHDDMS, \(\hbox {HDDM}_{\mathrm{A{-}test}}\), and \(\hbox {HDDM}_{\mathrm{W{-}test}}\) obtained the highest average scores. This is due to the fact that they had lower error-rate, shorter detection delay, and reasonable resource usage compared to other pairs. In general, pairs with Naive Bayes had higher memory usage compared to pairs that utilized Perceptron; however, they were faster in terms of runtime.

Table 10 Top 10 pairs with highest average scores against Sine1

Illustration of Pair Recommendation—The pairs recommended over time by the Tornado framework against the synthetic data streams with abrupt and gradual concept drifts are indicated in Fig. 9 and B.1 and Fig. B.2, respectively. We utilize different colors, represented in a foursquare form of time unit, in order to show the recommended pairs over time. The pairs that use members of the FHDDM family as drift detectors are indicated in shades of blue, while the HDDM variants are displayed in shades of green. The other drift detector pairs are displayed in shades of yellow (for CUSUM and PageHinkley), orange (for DDM and EDDM), red (for ADWIN), and pink (for SeqDrift2). In our figures, the time line begins at the top-left corner and then unfolds, line by line, from left to right. The locations of the drift points in the data streams are indicated by the symbol . This symbol may thus be used to locate the drift points in the diagrams.

We set all weights to 1 for the experiments against the Sine1, Sine2, Mixed, and Stagger data streams. Recall that, when all weights are set to 1, it is assumed that the corresponding quantities are of equal importance. The results confirm that no pair outperforms the others, and that no drift detector or classifier dominates. Initially, there are larger fluctuations in recommended pairs. In summary, Fig. 9 and B.1 show that, in general, the pairs with the FHDDM/S and HDDM variants are often recommended for having higher scores. The results also show that the Naive Bayes and Perceptron classifiers are often preferred, since they are light, fast, and accurate, particularly when the weights are equal. This is, however, not the case when the weights are distinct, for instance, as shown in Fig. B.2.

Figure B.2a illustrates the impact of the weight vector against the Circles and LED data streams, that are susceptible to gradual drift. The pairs NB+PageHinkley, NB+\(\hbox {FHDDM}_{100}\), and NB+\(\hbox {HDDM}_{\mathrm{W{-}test}}\) outperform the others for the Circles data stream when all the weights are set to one. In contrast, when \(\overrightarrow{w} = \text{[1.5 } \text{1 } \text{2 } \text{1.5 } \text{0 } \text{0.5] }^T\), the pairs of HT+\(\hbox {HDDM}_{\mathrm{W{-}test}}\) and HT+\(\hbox {FHDDM}_{\mathrm{add}}\) dominate the others for the second of half of the stream. In this case, the memory and runtime become less important, and the resulting best pair reflects this change. That is, the entries were chosen so that the pairs with lower values for the error-rate, shorter detection delay, fewer false positives and false negatives obtain higher scores. Memory consumption was not taken into account as \(w_m = 0\). Finally, Fig. B.2b depicts that the pairs of PR+\(\hbox {FHDDM}_{25}\), NB+FHDDMS, and NB+\(\hbox {FHDDM}_{100}\) are recommended over time against the LED data stream; when all the weights are equal to one. Alternatively, when the weight entries are set as \(\overrightarrow{w} = \text{[3 } \text{0 } \text{1.5 } \text{1 } \text{2 } \text{2] }^T\), the pairs of PR+\(\hbox {FHDDM}_{25}\), PR+\(\hbox {FHDDM}_{100}\), PR+PageHinkley, and also PR+DDM are recommended. In this case, memory usage and runtime are the two preferred measures, and subsequently the Perceptron classifier outperformed the Naive Bayes one.

6.5.2 Real-world data streams

Our experimental results for the real-world data streams are reported in Tables A.6–A.8 as well as in Figs. B.3 and B.4. Recall that the pairs that use members of the FHDDM family as drift detectors are indicated in shades of blue, while the HDDM variants are displayed in shades of green. The other drift detector pairs are displayed in shades of yellow (for CUSUM and PageHinkley), orange (for DDM and EDDM), red (for ADWIN), and pink (for SeqDrift2).

Tables A.6 to A.8 list the pairs with the highest averages scores for the Adult, Nursery, Shuttle data streams. It is often seen that the pairs of Naive Bayes (NB), Decision Stump (DS), and Perceptron (PR) with FHDDMS, FHDDMs, HDDMs, and ADWIN are among them. Although we see a few false positives (FP) for the pairs of ADWIN, their low error-rates and short drift detection delays help them to be ranked among the top 10.

Figures B.3 and B.4 show the recommended pairs while input data are processed. The reader should notice that no single pair outperforms in all cases, and that the best pair changes over time. The figures also indicate that the best pairs rapidly changes at the beginning of the stream, due to lack of adequate information about their performances, while the optimal pairs remain more steady towards the ends of the streams. It is also worthwhile to illustrate that a variation of the weights impact the recommendation, as it may be noted when comparing the left and right sides of Fig. B.3 and B.4. Again, the best pair is highly dependent on the weights, especially when we vary the weights of memory and runtime versus error-rate considerations.

In summary, our results confirm that the best (classifier, detector) pairs vary as the stream evolves, and that the choice is highly domain dependent.

7 Discussion

We introduced the CAR measure in order to monitor the overall performance of adaptive classification models against evolving data streams. We also presented the Tornado as a framework that simultaneously runs heterogeneous pairs of classifiers and drift detectors in parallel against data streams, while continuously recommending the best performing pair to the user. This recommendation is based on the weights assigned to the error-rates, drift detection sensitivity, runtime and memory consumption.

In addition, we extended our earlier work and detailed FHDDMS as well as its extension \(\hbox {FHDDMS}_{\mathrm{add}}\) in order to better detect abrupt concept drifts associated with shorter delay as well as to reduce the number of false negatives when a gradual drift is present. FHDDMS slides a long and a short windows, that are stacked on each other, to detect concept drifts. The longer window reduces the number of false negatives, while the shorter one detects drifts faster. In this study, we restricted ourselves to two windows, though more windows could be employed. During the evaluation of drift detection methods, we observed that \(\hbox {HDDM}_{\mathrm{W{-}test}}\) and our FHDDMS and FHDDM algorithms are comparable, in terms of various performance measures. \(\hbox {HDDM}_{\mathrm{W{-}test}}\) outperformed FHDDMS for faster detection against abrupt concept drifts. On the other hand, FHDDMS was better suited to detect gradual concept drifts. In either case, \(\hbox {HDDM}_{\mathrm{W{-}test}}\) showed higher false positive rates compared to FHDDMS and FHDDM.

We conducted experiments using the Tornado framework against synthetic and real-world data streams. Our experimental setup consisted of 60 pairs of learners and detectors, each of which were evaluated in parallel against various data streams. The experimental results clearly show, as expected, that no specific pair dominates in all cases. In the vast majority of cases, the pairs that contains the Naive Bayes or Perceptron classifiers yielded the best results, when all the weights are equal. These two algorithms provide a balance between memory usage, runtime and error-rate. This stands in contrast to the Hoeffding Tree, Decision Stump and other learners. The Hoeffding Tree algorithm generally is expensive, in terms of memory consumption as the tree grows. In addition, the runtime may increase when branching decisions become difficult. The K-NN algorithm is a lazy learner, meaning that it is greedy for both memory and runtime. These two method are therefore more suitable when runtime and memory considerations are of less importance.

Overall, as represented in Fig. 9 and B.1 as well as Fig.  B.2, the pairs with \(\hbox {HDDM}_{\mathrm{W{-}test}}\) are ranked higher than those with FHDDM for data stream containing abrupt drifts, e.g. Sine2 and Stagger; whereas, the pairs of FHDDMS and FHDDM were recommended for the data streams with gradual concept drift; e.g. the LED data stream. It is worth mentioning that the pairs of other drift detectors ranked lower, because of their longer drift detection delays and higher false positive rates.

8 Conclusion and future work

Increasingly, there is a need for near real-time adaptive learning methods to explore dynamically evolving data streams. Such algorithms should provide decision makers with realistic, just-in-time models for short-term and mid-term decision making against today’s vast streams of data. These models should not only be timely, but also be accurate and able to swiftly adapt to changes in the data. Intuitively, no adaptive learning strategy outperforms others in all settings. Similarly, the effectiveness of drift detection methods is determined by the data characteristics, the types of drifts and the rates of true positives and true negatives, among others.

Based on these observations, we created a reservoir of diverse adaptive learners and drift detection algorithms, as implemented in our Tornado framework. In our work, we considered all (classifier, detector) pairs and then utilized them to construct models in parallel. Continuously, the current ‘best’ model was selected and provided to the users. Further, two new drift detector methods, namely the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) algorithms, were introduced in this paper. Our extensive experimental results confirmed that the current (classifier, detector) pairs vary over time and that they are sensitive to concept drift. Further, we showed that the two FHDDMS variants outperformed the state-of-the-art, when evaluated in terms of the holistic CAR measure.

We encountered several interesting avenues of future work. In future, we plan to also compare our Tornado framework to existing ensembles of classifiers. The incorporation of ensembles into the framework, also needs our consideration. In our current research, we implemented our Tornado framework on a single machine. We are now designing a hybrid environment where we utilize Cloud services together with mobile devices, such as tablets. This current research is motivated by our observation that, in many settings, such as environmental impact studies and emergency response, domain experts would require the ability to not only receive up-to-date models on their mobile devices, but also to be able to build their own models locally. In this case, we foresee that the heavy ‘bulk’ analytics would be performed in the Cloud, while the mobile devices would contain personalized, lightweight algorithms. Domain experts are often eager to include their own expertise, and thus an active learning component might prove useful. In addition, we believe that, in such a scenario, the idea of combining lightweight data analytics design with hardware-driven design (Žliobaite et al. 2015b) may further lead to more efficient algorithms.

In our current work, we in essence simplified a multi-objective optimization function through linear scalarization. We are interested in extending our work to explore whether Pareto optimization could be performed in real-time. If no particular application domain is required, the optimization could be performed directly. If the optimization is application domain dependent, the multi-objective function should be supplemented with constraints. For instance, if the memory resources are limited (mobile application) and the false positives should be avoided (such as in medical applications), such a constrained multi-objective optimization may be performed with the Karush-Kuhn-Tucker (KKT) conditions. The best way to define the multi-objective function will be an object of our future investigation.