On evaluating stream learning algorithms
 3.8k Downloads
 121 Citations
Abstract
Most streaming decision models evolve continuously over time, run in resourceaware environments, and detect and react to changes in the environment generating data. One important issue, not yet convincingly addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of prequential error with forgetting mechanisms to provide reliable error estimators. We prove that, in stationary data and for consistent learning algorithms, the holdout estimator, the prequential error and the prequential error estimated over a sliding window or using fading factors, all converge to the Bayes error. The use of prequential error with forgetting mechanisms reveals to be advantageous in assessing performance and in comparing stream learning algorithms. It is also worthwhile to use the proposed methods for hypothesis testing and for change detection. In a set of experiments in drift scenarios, we evaluate the ability of a standard change detection algorithm to detect change using three prequential error estimators. These experiments point out that the use of forgetting mechanisms (sliding windows or fading factors) are required for fast and efficient change detection. In comparison to sliding windows, fading factors are faster and memoryless, both important requirements for streaming applications. Overall, this paper is a contribution to a discussion on best practice for performance assessment when learning is a continuous process, and the decision models are dynamic and evolve over time.
Keywords
Data streams Evaluation design Prequential analysis Concept drift1 Introduction
The last twenty years or so have witnessed large progress in machine learning and its capability to handle realworld applications. Nevertheless, machine learning so far has mostly centered on oneshot data analysis from homogeneous and stationary data, and on centralized algorithms. Most machine learning and data mining approaches assume that examples are independent, identically distributed and generated from a stationary distribution. A large number of learning algorithms assume that computational resources are unlimited, for example, data fits in main memory. In that context, standard data mining techniques use finite training sets and generate static models. Nowadays we are faced with a tremendous amount of distributed data that can be generated from the ever increasing number of smart devices. In most cases, this data is transient, and may not even be stored permanently. Our ability to collect data is changing dramatically. Nowadays, computers and small devices send data to other computers. We are faced with the presence of distributed sources of detailed data. Data continuously flows, eventually at highspeed, generated from nonstationary processes. Examples of data mining applications that are faced with this scenario include sensor networks, social networks, user modeling, radio frequency identification, web mining, scientific data, financial data, etc.
For illustrative purposes, consider a sensor network. Sensors are geographically distributed and produce highspeed distributed data streams. They measure some quantity of interest, and we are interested in predicting that quantity for different time horizons. Assume that at time t our predictive model makes a prediction \(\hat{y}_{t+k}\) for time t+k, where k is the desired horizon forecast. Later on, at time t+k the sensor measures the quantity of interest y _{ t+k }. With a delay k we can estimate the loss of our prediction \({L}(\hat {y}_{t+k},y_{t+k})\).
Recent learning algorithms such as Cormode et al. (2007), Babcock et al. (2003), Rodrigues et al. (2008) for clustering; Domingos and Hulten (2000), Hulten et al. (2001), Gama et al. (2003), Liang et al. (2010) for decision trees; FerrerTroyano et al. (2004), Gama and Kosina (2011) for decision rules; Li et al. (2010), Katakis et al. (2010) in change detection; Giannella et al. (2003), Chi et al. (2006) in frequent pattern mining, etc., maintain a model that continuously evolves over time, taking into account that the environment is nonstationary and computational resources are limited. Examples of publicly available software for learning from data streams include: the VFML toolkit (Hulten and Domingos 2003) for mining highspeed timechanging data streams, the MOA system (Bifet et al. 2010a) for learning from massive data sets, RapidMiner (Mierswa et al. 2006) a data mining system with plugin for stream processing, etc.

we have a continuous flow of data instead of a fixed sample of i.i.d. examples;

decision models evolve over time instead of being static;

data is generated by nonstationary distributions instead of a stationary sample.
The design of experimental studies is of paramount importance (Japkowicz and Shah 2011). It is a necessary condition, although not sufficient, to allow reproducibility, that is the ability of an experiment or study to be accurately replicated by someone else working independently. Dietterich (1996) proposes a straightforward technique to evaluate learning algorithms when data is abundant: “learn a classifier from a large enough training set and apply the classifier to a large enough test set.” Data streams are openended. This could facilitate the evaluation methodologies, because we have train and test sets as large as desired. The problem we address in this work is: Is this sampling strategy viable in the streaming scenario?
Differences between batch and streaming learning that may affect the way evaluation is performed. While batch learners build static models from finite, static, i.i.d. data sets, stream learners need to build models that evolve over time, being therefore dependent on the order of examples, are generated from a continuous nonstationary flow of noni.i.d data
Batch  Stream  

Data size  Finite data set  Continuous flow 
Data distribution  i.i.d.  Noni.i.d. 
Data evolution  Static  Nonstationary 
Model building  Batch  Incremental 
Model stability  Static  Evolving 
Order of observations  Independent  Dependent 
To cope with this stream evaluation scenario, the approach we propose is based on sequential analysis. Sequential analysis refers to the body of statistical theory and methods where the sample size may depend in a random manner on the accumulating data (Ghosh and Sen 1991). This paper is a substantial extension of a previous one (Gama et al. 2009), published in a toplevel data mining conference, where these issues were firstly enunciated.^{1} More than the technical contribution, this paper is a contribution to a discussion on best practice for performance assessment and differences in performance when learning dynamic models that evolve over time.

We prove that, for consistent learning algorithms and an infinite number of examples generated by a stationary distribution, the holdout estimator, the prequential error and the prequential error estimated over a sliding window or using fading factors, converge to the Bayes error;

We propose a faster and memoryless approach, using fading factors, that does not require to store all the required statistics in the window;

We propose the Q statistic, a fast and incremental statistic to continuously compare the performance of two classifiers;

We show that the use of fading factors in the McNemar test provide similar results to sliding windows;

We show that the proposed error estimates can be used for concept drift detection;

We show that performing the PageHinkley test over the proposed error estimators improves the detection delay, although at the risk of increasing false alarms.
The paper is organized as follows. The next section presents an overview of the main lines presented in the literature in learning from data streams and the most common strategies for performance assessment. In Sect. 3 we discuss the advantages and disadvantages of the prequential statistics in relation to the holdout sampling strategy. We propose two new prequential error rate estimates: prequential sliding windows and prequential fading factors, and present convergence properties of these estimators. In Sect. 4 we discuss methods to compare the performance of two learning algorithms. Section 5 presents a new algorithm for concept drift detection based on sliding windows and fading factors error estimators. The last section concludes the exposition, presenting the lessons learned.
2 Learning from data streams

require small constant time per data example;

use fixed amount of main memory, irrespective to the total number of examples;

built a decision model using a single scan over the training data;

generate an anytime model independent from the order of the examples;

ability to deal with concept drift. For stationary data, ability to produce decision models that are nearly identical to the ones we would obtain using a batch learner.

space—the available memory is fixed;

learning time—process incoming examples at the rate they arrive; and

generalization power—how effective the model is at capturing the true underlying concept.
The fact that decision models evolve over time has strong implications in the evaluation techniques assessing the effectiveness of the learning process. Another relevant aspect is the resilience to overfitting: each example is processed once.
3 Design of evaluation experiments

What are the goals of the learning task?

Which are the evaluation metrics?

How to design the experiments to estimate the evaluation metrics?

How to design the experiments to compare different approaches?
3.1 Experimental setup
For predictive learning tasks (classification and regression) the learning goal is to induce a function \(\hat{y} = {\hat{f}}(\mathbf{x})\) that approximates f with arbitrary precision. The most relevant dimension is the generalization error. It is an estimator of the difference between \(\hat{f}\) and the unknown f, and an estimate of the loss that can be expected when applying the model to future examples.
One aspect in the design of experiments that has not been convincingly addressed is that learning algorithms run in computational devices that have limited computational power. For example, many existing learning algorithms assume that data fits in memory; a prohibitive assumption in the presence of openended streams. This issue becomes much more relevant when data analysis must be done in situ. An illustrative example is the case of sensor networks, where data flows at highspeed and computational resources are quite limited.

Sensor environment: hundreds of Kbytes;

Handheld computers: tens of Mbytes;

Server computers: several Gbytes.
In batch learning using finite training sets, crossvalidation and variants (leaveoneout, bootstrap) are the standard methods to evaluate learning systems. Crossvalidation is appropriate for restricted size datasets, generated by stationary distributions, and assuming that examples are independent. In data streams contexts, where data is potentially infinite, the distribution generating examples and the decision models evolve over time, crossvalidation and other sampling strategies are not applicable. Research communities and users need other evaluation strategies.
3.2 Evaluation metrics
Suppose a sequence of examples in the form of pairs (x _{ i }, y _{ i }). For each example, the actual decision model predicts \(\hat{y}_{i}\), that can be either True (\(\hat{y}_{i} = y_{i}\)) or False (\(\hat{y}_{i} \neq y_{i}\)). For each point i in the sequence, the errorrate is the probability of observe False, p _{ i }. For a set of examples, the error (e _{ i }) is a random variable from Bernoulli trials. The Binomial distribution gives the general form of the probability for the random variable that represents the number of errors in a sample of examples: \(e_{i} \sim\mathrm{Bernoulli}(p_{i}) \Leftrightarrow \operatorname{Prob}(e_{i} = \mathit{False}) = p_{i}\).
Definition
For example, Duda and Hart (1973) prove that the knearest neighbor algorithm is guaranteed to yield an error rate no worse than the Bayes error rate as data approaches infinity. Bishop (1995) presents similar proofs for feedforward neural networks, and Mitchell (1997) for decision trees. The following mathematical proofs that appear in this paper apply to consistent learners.
In the holdout evaluation, the current decision model is applied to a test set at regular time intervals (or set of examples). For a large enough holdout, the loss estimated in the holdout is an unbiased estimator.
Definition
Theorem
(Limit of the holdout error)
For consistent learning algorithms, and for large enough holdout sets, the limit of the loss estimated in the holdout is the Bayes error: lim_{ i→∞} H _{ e }(i)=B.
The proof is in Appendix B.1
In predictive sequential (or prequential) (Dawid 1984) the error of a model is computed from the sequence of examples. For each example in the stream, the actual model makes a prediction based only on the example attributevalues. We should point out that, in the prequential framework, we do not need to know the true value y _{ i } for all points in the stream. The framework can be used in situations of limited feedback by computing the loss function and S _{ i } for points where y _{ i } is known.
Definition
Theorem
(Limit of the prequential error)
For consistent learning algorithms, the limit of the prequential error is the Bayes error: lim_{ i→∞} P _{ e }(i)=B.
The proof is in Appendix B.2.
From the Hoeffding bound (Hoeffding 1963), one can ensure that with, at least, a probability of 95 %, the prequential error estimated over all the stream converges to its average μ with an error of 1 %, for a sample of size of (at least) 18444 observations. The proof can be found in Appendix A.
3.2.1 Error estimators using a forgetting mechanism
Prequential evaluation provides a learning curve that monitors the evolution of learning as a process. Using holdout evaluation, we can obtain a similar curve by applying, at regular time intervals, the current model to the holdout set. Both estimates can be affected by the order of the examples. Moreover, it is known that the prequential estimator is pessimistic: under the same conditions it will report somewhat higher errors (see Fig. 2). The prequential error estimated over all the stream might be strongly influenced by the first part of the error sequence, when few examples have been used for train the classifier. Incremental decision models evolve overtime, improving their performance. The decision model used to classify the first example is different from the one used to classify the hundredth instance. This observation leads to the following hypothesis: compute the prequential error using a forgetting mechanism. This might be achieved either using a time window of the most recent observed errors or using fading factors. Considering a sliding window of size infinite or a fading factor equal to 1, these forgetting estimators equal the prequential estimator.
Error estimators using sliding windows
Sliding windows are one of the most used forgetting strategies. They are used to compute statistics over the most recent past.
Definition
Theorem
(Limit of the prequential error computed over a sliding window)
For consistent learning algorithms, the limit of the prediction error computed over a sliding window of (large enough ^{2}) size w is the Bayes error: lim_{ i→∞} P _{ w }(i)=B.
The proof is in Appendix B.3.
Lemma
The prequential error estimator, P _{ e }(i), is greater than or equal to the prequential error computed over a sliding window, P _{ w }(i), considering a sliding window of large enough size w≪i: P _{ e }(i)≥P _{ w }(i).
The proof is in Appendix B.3.
For example, using the Hoeffding bound, with a probability of 95 %, the P _{ e }(i) converges to its average μ with an error of 1 % for i=18444. Taking into account the previous lemma, one can work out that at observation i=18444, the P _{ w }(i) will be smaller than or equal to the P _{ e }(i). And so forth, the P _{ w }(i) will approximate the average μ with an even smaller error.
Error estimators using fading factors
This way, at each observation i, N _{ α }(i) gives an approximated value for the weight given to recent observations used in the fading sum.
Definition
Theorem
(Limit of the prequential error computed with fading factors)
For consistent learning algorithms, the limit of the prequential error computed with fading factors is approximately the Bayes error: lim_{ i→∞} P _{ α }(i)≈B.
The proof is in Appendix B.4.
Furthermore, the prequential estimator computed using fading factors, P _{ α }(i), will be lower than the prequential error estimator, P _{ e }(i). In Sect. 3.3 we present the pseudocode for the three feasible alternatives to compute the prequential error previously introduced.
The proofs of the previous theorems assume the stationarity of the data and the independence of the training examples. The main lesson is that any of these estimators converge, for an infinity number of examples, to the Bayes error. All these estimators can be used in any experimental study. However, these results are even more relevant when dealing with data with concept drift. Indeed, the above theorems and the memoryless advantage support the use of prequential error estimated using fading factors to assess performance of stream learning algorithms in presence of nonstationary data. We address this topic in Sect. 5.
Illustrative example
The objective of this experiment is to illustrate the demonstrated convergence properties of the error estimates using the strategies described above. The learning algorithm is VFDT as implemented in MOA (Bifet et al. 2010a). The experimental work has been done using the Waveform and the LED (Asuncion and Newman 2007) datasets, because the Bayeserror is known: 14 % and 26 %, respectively. We also use the RandomRBF (RBF) and Random Tree (RT) datasets available in MOA. The Waveform stream is a three class problem defined by 21 numerical attributes, the LED stream is a ten class problem defined by 7 boolean attributes, the RBF and the RT streams are twoclass problems defined by 10 attributes.
3.3 Pseudocode of algorithms for prequential estimation
3.4 Discussion
All the error estimators discussed so far apply to stationary processes. Under this constraint, the first relevant observation is: The prequential error estimate over all the history is pessimistic. It is a pessimistic estimate because the decision model is not constant and evolves over time. The decision model used to classify the nth instance is not the same used to classify the first instance. Under the stationarity assumption, the performance of the decision model improves with more labeled data. The second relevant observation is that the proposed forgetting mechanism (sliding windows and fading factors) preserve convergence properties similar to the holdout error estimator. The use of these error estimators in nonstationary and evolving data is discussed in Sect. 5.
Error estimators using sliding windows are additive; the contribution of a term is constant, while it is inside the window and the forgetting mechanism (when the term is out of the window) is abrupt. The key difficulty is how to select the appropriate window size. A small window can assure fast adaptability in phases with concept changes; while large windows produce lower variance estimators in stable phases, but cannot react quickly to concept changes. The fading factors are multiplicative, corresponding to exponential forgetting. A term always contributes to the error, smoothly decreasing has more terms are available.
Error estimates using sliding windows and fading factors require user defined parameters: the window size and the fading factor. It is known that estimating unknown parameters over larger sample sizes (and stationary environments) generally leads to increased precision. Nevertheless, after some point in sample size, the increase in precision is minimal, or even nonexistent. The first question we need to discuss is: What are the admissible values for the window size? The estimator of the error rate, a proportion is given by \(\hat {p}=\frac{\#e}{n}\), where #e is the number of errors. When the observations are independent, this estimator has a (scaled) binomial distribution. The maximum variance of this distribution is 0.25/n, which occurs when the true parameter is \(\hat{p} = 0.5\). For large n, the distribution of \(\hat{p}\) can be approximated by a normal distribution. For a confidence interval of 95 %, the true p is inside of the interval \(\{ \hat{p}  2\sqrt{0.25/n}; \hat{p} + 2\sqrt{0.25/n} \}\). If this interval cannot be larger than ϵ, we can solve the equation \(4\sqrt{0.25/n}\) for n. For example, a window of n=1000 implies ϵ=3 %, while for ϵ=1 % a sample size of n=10000 is required. With respect to fading factors, the equivalent question is: What are the admissible values for the fading factor? At timestamp t the weight of example t−k is α ^{ i−k }. For example, using α=0.995 the weight associated with the first term after 3000 examples is 2.9E−7. Assuming that we can ignore the examples with weights less than ϵ, an upper bound for k (the set of “important” examples) is log(ϵ)/log(α). The fading factors are memoryless, an important property in streaming scenarios. This is a strong advantage over slidingwindows that require maintaining in memory all the observations inside the window.
4 Comparative assessment
In this section we discuss methods to compare the performance of two algorithms (A and B) in a stream. Our goal is to distinguish between random and nonrandom differences in experimental results.
4.1 The 0–1 loss function
For classification problems, one of the most used tests is the McNemar test.^{3} The McNemar’s test is a nonparametric method used on nominal data. It assesses the significance of the difference between two correlated proportions, where the two proportions are based on the same sample. It has been observed that this test has acceptable type I error (Dietterich 1996; Japkowicz and Shah 2011).
To be able to apply this test we only need to compute two quantities n _{ i,j }: n _{0,1} denotes the number of examples misclassified by A and not by B, whereas n _{1,0} denotes the number of examples misclassified by B and not by A. The contingency table can be updated on the fly, which is a desirable property in mining highspeed data streams. The statistic \(M=\operatorname{sign}(n_{0,1}n_{1,0}) \times\frac {(n_{0,1}n_{1,0})^{2}}{n_{0,1}+n_{1,0}}\) has a χ ^{2} distribution with 1 degree of freedom. For a confidence level of 0.99, the null hypothesis is rejected if the statistic is greater than 6.635 (Dietterich 1996).
4.2 Illustrative example
5 Evaluation methodology in nonstationary environments
An additional problem of the holdout method comes from the nonstationary properties of data streams. Nonstationarity or concept drift means that the concept about which data is obtained may shift from time to time, each time after some minimum permanence. The permanence of a concept is designated as context and is defined as a set of consecutive examples from the stream where the underlying distribution is stationary. Learning timeevolving concepts is infeasible, if no restrictions are imposed on the type of admissible changes. For example, Kuh et al. (1990) determine a maximal rate of drift that is acceptable for any learner. We restrict this work to methods for explicit change detection because they are informative about the dynamics of the process generating data. We focus on change detectors that monitor the evolution of learning performance, a common strategy for drift detection metrics (Widmer and Kubat 1996; Street and Kim 2001; Klinkenberg 2004). In this section we study the use of forgetting mechanisms, based on sliding windows or fading factors, in error estimates for drift detection.

Probability of True detection: capacity to detect drift when it occurs;

Probability of False alarms: resilience to false alarms when there is no drift; that is not detect drift when there is no change in the target concept;

Delay in detection: the number of examples required to detect a change after the occurrence of a change.
5.1 The PageHinkley algorithm
The minimum value of this variable is also computed: M _{ T }=min(m _{ t },t=1…T). As a final step, the test monitors the difference between M _{ T } and m _{ T }: PH _{ T }=m _{ T }−M _{ T }. When this difference is greater than a given threshold (λ) we signal a change in the distribution. The threshold λ depends on the admissible false alarm rate. Increasing λ will entail fewer false alarms, but might miss or delay change detection.
5.2 Monitoring drift using prequential error estimates
To assess the proposed error estimates in drift scenarios, we monitor the evolution of the different prequential error estimates using the PH test. The data stream generators are Waveform, LED, Random Tree (RT) and RBF as implemented in MOA. The data was generated by emulating an abrupt concept drift event as a combination of two distributions. For the LED, Waveform and RBF datasets the first distribution was generated with the LEDGenerator, the WaveformGenerator and the RandomRBFGenerator (respectively) and the second distribution was generated with the LEDGeneratorDrift, the WaveformGeneratorDrift and the RandomRBFGeneratorDrift (respectively). For second stream of LED and Waveform datasets we set to 7 and 21 the number of attributes with drift (respectively). The second RBF stream was generated setting the seed for the random generation of the model to 10 and adding speed drift to the centroids of the model (0.01). For the RT dataset, both distributions were generated with the RandomTreeGenerator, varying the seed of the second concept. For the LED data stream the change occurs at example 128k and for the other streams the change occurs at example 32k.
The learning algorithms are VFDTMC and VFDTNBAdaptive as implemented in MOA. The PH test parameters are δ=0.1 and λ=100. For the prequential error estimates we use sliding windows of size 1k, 2k, 3k, 4k and 5k and fading factors of 0.9970, 0.9985, 0.9990, 0.9993 and 0.9994.
Detection delay time using PH test over different error estimates. The learning algorithms are VFDT majority class (MC) and VFDT NaiveBayes adaptive (NBa). We report the mean and standard deviation of 10 runs. In parenthesis the number of runs, if any, where PH test miss the detection. The PH test using the prequential error over the entire stream misses the detection in all streams, except in waveform and VFDTNBa. In these experiments the delay increases by increasing the window size and the fading factor
Prequential  LED  RBF  RT  Waveform  

MC  NBa  MC  NBa  MC  NBa  MC  NBa  
Miss  Miss  Miss  Miss  Miss  Miss  Miss  19698±3935  
Fading factors  
α=0.9970  2155 ± 370  486 ± 10  3632 ± 4413  1416 ± 342  911 ± 132  800 ± 84  1456 ± 326  836 ± 490  
α=0.9985  3391 ± 797  693 ± 15  4288 ± 4413  1992 ± 430  1317 ± 177  1163 ± 121  2072 ± 437  1129 ± 565  
α=0.9990  4279 ± 947  872 ± 19  (1)  6474 ± 4995  2454 ± 486  1676 ± 212  1481 ± 153  2678 ± 541  1420 ± 740 
α=0.9993  5433 ± 1130  1076 ± 24  (1)  9241 ± 7769  3023 ± 571  2122 ± 255  1861 ± 190  3452 ± 656  1766 ± 952 
α=0.9994  6103 ± 1331  1183 ± 26  (1)  9818 ± 7587  3369 ± 644  2365 ± 279  2068 ± 208  3874 ± 718  1931 ± 1002 
Sliding windows  
w=1000  2542 ± 512  725 ± 14  3765 ± 4444  1617 ± 321  1157 ± 120  1054 ± 79  1637 ± 290  1069 ± 444  
w=2000  3484 ± 619  1146 ± 30  4631 ± 4430  2397 ± 338  1866 ± 156  1718 ± 121  2491 ± 353  1622 ± 523  
w=3000  4454 ± 619  1503 ± 36  (1)  7085 ± 5205  3151 ± 344  2549 ± 206  2331 ± 164  3366 ± 400  2158 ± 713 
w=4000  5436 ± 804  1836 ± 49  (1)  8315 ± 5320  3987 ± 384  3209 ± 249  2925 ± 203  4246 ± 433  2654 ± 824 
w=5000  6419 ± 860  2162 ± 55  (2)  9198 ± 5683  4899 ± 466  3859 ± 295  3507 ± 237  5152 ± 463  3133 ± 918 
5.3 Monitoring drift with the ratio of prequential error estimates
Learning in timeevolving streams requires a tradeoff between memory and forgetting. A common approach to detect changes (Bach and Maloof 2008; Bifet and Gavaldà 2007) consists of using two sliding windows: a short window containing the most recent information and a large window, used as reference, containing a larger set of recent data including the data in the short window. The rationale behind this approach is that the short window is more reactive while the large window is more conservative. When a change occurs, statistics computed in the short window will capture the event faster than using the statistics in the larger window. Similarly, using fading factors, a smooth forgetting mechanism, a smaller fading factor will detect drifts earlier than larger ones.
Based on this assumption, we propose a new online approach to detect concept drift. We propose to perform the PH test with the ratio between two error estimates: a long term error estimate (using a large window or a fading factor close to one) and a short term error estimate (using a short window or a fading factor smaller than the first one). If the short term error estimator is significantly greater than the long term error estimator, we signal a drift alarm. In the case of fading factors, we consider α _{1} and α _{2} (with 0≪α _{2}<α _{1}<1) and compute the fading average w.r.t. both: \(M_{\alpha_{1}}(i)\) and \(M_{\alpha_{2}}(i)\) at observation i using Algorithm 3. The ratio between them is referred as: \(R(i)= M_{\alpha _{2}}(i)/M_{\alpha_{1}}(i)\). The PH test monitors the evolution of R(i) and signals a drift when a significant increase of this variable is observed. The pseudocode is presented in Algorithm 4. With sliding windows, the procedure is similar. Considering two sliding windows of different sizes SW _{1}={e _{ j }j∈]i−w _{1},i]} and SW _{2}={e _{ j }j∈]i−w _{2},i]} (with w _{2}<w _{1}) and computing the moving average w.r.t. both: \(M_{w_{1}}(i) = 1/w_{1} \sum_{j=iw_{1}}^{i} e_{j}\) and \(M_{w_{2}}(i) = 1/w_{2} \sum_{j=iw_{2}}^{i} e_{j}\), at observation i using Algorithm 2. The PH test monitors the ratio between both moving averages.
Experimental evaluation
Detection delay time, average and standard deviation, using PH test monitoring the ratio of error estimates. The learning algorithms are VFDT majority class (MC) and VFDT NaiveBayes adaptive (NBa). We report the mean of 10 runs. In parenthesis the number of runs, if any, where PH test miss the detection or signal a false alarm. They are in the form (Miss; False Alarm). With respect to the ratio using different fading factors, the value of the second fading factor is set to 0.9970 and the value of the first one varies from 0.9994 to 0.9990. For the ratio using different sliding windows, the length of the second window is set to 1000 and the length of the first one varies from 5k to 3k
LED  RBF  RT  Waveform  

MC  NBa  MC  NBa  MC  NBa  MC  NBa  
Fading factors  
α=0.9990  Miss  349 ± 11  787 ± 190  384 ± 38  325 ± 39  262 ± 27  780 ± 171  457 ± 161 
α=0.9993  969 ± 255 (4;0)  281 ± 8  563 ± 100  309 ± 29  261 ± 33  211 ± 25  541 ± 65  348 ± 89 
α=0.9994  856 ± 186 (3;0)  264 ± 7  522 ± 92  291 ± 28  244 ± 32  198 ± 24  496 ± 57 (2;0)  324 ± 77 (1;0) 
Sliding windows  
w=3000  1254 ± 355 (1;0)  473 ± 27  779 ± 94  489 ± 48  429 ± 93 (0;1)  350 ± 51 (0;1)  762 ± 62  551 ± 102 
w=4000  1089 ± 261  423 ± 26  710 ± 100 (0;1)  465 ± 41 (0;1)  397 ± 79 (0;2)  312 ± 52 (0;1)  673 ± 57  501 ± 102 
w=5000  1005 ± 216  397 ± 25  679 ± 98 (0;2)  440 ± 45 (0;2)  384 ± 67 (0;3)  299 ± 37 (0;3)  662 ± 60 (0;5)  468 ± 93 
The order of magnitude in detection delay time of the results presented in Table 2 is thousands of examples, while in Table 3 is hundreds of examples. Nevertheless, while monitoring the ratio of error estimates allow much faster detection, it is more risky. We observed false alarms and miss detections, mostly with VFDTMC. In a set of experiments not reported here, we observed that the choice of α in fading factors and the window size is critical. Once more, in these experiments drift detection based on the ratio of fading estimates is somewhat faster that with sliding windows.
6 Conclusions
The design of scientific experiments is in the core of the scientific method. The definition of the metrics and procedures used in any experimental study is a necessary condition, albeit not sufficient, for reproducibility, that is the ability of an experiment or study to be accurately replicated by someone else working independently. The main problem in evaluation methods when learning from dynamic and timechanging data streams consists of monitoring the evolution of the learning process. In this work we defend the use of predictive sequential error estimates using sliding windows or fading factors to assess the performance of stream learning algorithms in presence of nonstationary data. The prequential method is a general methodology to evaluate learning algorithms in streaming scenarios. In those applications where the observed target value is available later in time, the prequential estimator can be implemented inside the learning algorithm. For stationary data and consistent learners, we proved that the prequential error estimated with memoryless forgetting mechanisms is a good estimate of the error for stream learning algorithms. The convergence properties of the forgetting mechanisms in error estimation can be applied in concept drift detection. Thus, the use of a forgetting prequential error is recommended for continuously assess the quality of stream learning algorithms. We present applications of the proposed method in performance comparison, hypothesis testing and drift detection. We observe that drift detection is much more efficient using prequential error estimates with forgetting mechanisms.
The research presented in this work opens interesting opportunities: the system would be capable of monitoring the evolution of the learning process itself and selfdiagnosis the evolution of it. We are currently implementing change detection mechanisms and corresponding adaptation strategies inside VFDT like algorithms. Other direct applications are tracking the best expert algorithms (Herbster and Warmuth 1998) and dynamic weighted majority algorithms (Kolter and Maloof 2007) that continuously track the evolution of the performance of its components.
In this work we focus on loss as performance criteria. Nevertheless, other criteria, imposed by data streams characteristics, must be taken into account. Learning algorithms run in devices with fixed memory. They need to manage the available memory, eventually discarding parts of the required statistics or parts of the decision model. We need to evaluate the memory usage over time and the impact in accuracy when using the available memory. Another aspect is that algorithms must process the examples as fast as (if not faster than) they arrive. Whenever the rate of arrival is faster than the processing speed, some sort of sampling is required. The number of examples processed per second and the impact of sampling on performance are other evaluation criteria. Overall, with this work, a new step forward is given in the discussion of goodpractices on performance assessment of stream learning algorithms.
Footnotes
 1.
Nowadays, the software MOA (Bifet et al. 2010a) implements most of the evaluation strategies discussed here.
 2.
It is necessary to consider a window with large enough size in order to achieve an unbiased estimator of the average.
 3.
We do not argue that this is the most appropriate test for comparing classifiers. An in depth analysis on statistical tests to compare classifiers in batch scenario appears in Demsar (2006).
 4.
It is necessary to consider a window with large enough size in order to achieve a feasible computation of an average.
 5.
From the PAC learning theory the error rate of the learning algorithm will decrease, so the average of errors can be seen as a decreasing function and so if \(N_{1} < N_{2} \Rightarrow\bar{e}_{N_{1}} > \bar{e}_{N_{2}}\).
Notes
Acknowledgements
This work is funded by the ERDF through the Programme COMPETE and by the Portuguese Government through FCT – Foundation for Science and Technology, project refs. PEstC/SAU/UI0753/2011 and PTDC/EIA/098355/2008, Knowledge Discovery from Ubiquitous Data Streams, and through POPH and FCT by a PhD grant ref. SFRH/BD/41569/2007.
References
 Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html.
 Babcock, B., Datar, M., Motwani, R., & O’Callaghan, L. (2003). Maintaining variance and kmedians over data stream windows. In T. Milo (Ed.), Proceedings of the 22nd symposium on principles of database systems, San Diego, USA (pp. 234–243). New York: ACM. Google Scholar
 Bach, S. H., & Maloof, M. A. (2008). Paired learners for concept drift. In ICDM (pp. 23–32). Los Alamitos: IEEE Comput. Soc. Google Scholar
 Basseville, M., & Nikiforov, I. (1993). Detection of abrupt changes: theory and applications. New York: Prentice Hall Google Scholar
 Bifet, A., & Gavaldà, R. (2007). Learning from timechanging data with adaptive windowing. In Proceedings SIAM international conference on data mining, Minneapolis, USA (pp. 443–448). Philadelphia: SIAM. Google Scholar
 Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010a). MOA: massive online analysis. Journal of Machine Learning Research, 11, 1601–1604. Google Scholar
 Bifet, A., Holmes, G., Pfahringer, B., & Frank, E. (2010b). Fast perceptron decision tree learning from evolving data streams. In Advances in knowledge discovery and data mining, 14th PacificAsia conference (pp. 299–310). CrossRefGoogle Scholar
 Bishop, C. (1995). Neural networks for pattern recognition. London: Oxford University Press. Google Scholar
 Chi, Y., Wang, H., Yu, P. S., & Muntz, R. R. (2006). Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowledge and Information Systems, 10(3), 265–294. CrossRefGoogle Scholar
 Cormode, G., Muthukrishnan, S., & Zhuang, W. (2007). Conquering the divide: continuous clustering of distributed data streams. In ICDE: proceedings of the international conference on data engineering, Istanbul, Turkey (pp. 1036–1045). Google Scholar
 Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining stream statistics over sliding windows. SIAM Journal on Computing, 31(6), 1794–1813. MathSciNetzbMATHCrossRefGoogle Scholar
 Dawid, A. P. (1984). Statistical theory: the prequential approach. Journal of the Royal Statistical Society. Series A, 147, 278–292. MathSciNetzbMATHCrossRefGoogle Scholar
 Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. MathSciNetzbMATHGoogle Scholar
 Dietterich, T. (1996). Approximate statistical tests for comparing supervised classification learning algorithms. Corvallis, technical report nr. 97.331, Oregon State University. Google Scholar
 Domingos, P., & Hulten, G. (2000). Mining highspeed data streams. In I. Parsa, R. Ramakrishnan, & S. Stolfo (Eds.), Proceedings of the ACM sixth international conference on knowledge discovery and data mining, Boston, USA (pp. 71–80). New York: ACM. CrossRefGoogle Scholar
 Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Willey. zbMATHGoogle Scholar
 FerrerTroyano, F., AguilarRuiz, J. S., & Riquelme, J. C. (2004). Discovering decision rules from numerical data streams. In Proceedings of the ACM symposium on applied computing, Nicosia, Cyprus (pp. 649–653). New York: ACM Press. Google Scholar
 Gama, J., & Kosina, P. (2011). Learning decision rules from data streams. In Proceedings of the 22nd international joint conference on artificial intelligence, IJCAI (pp. 1255–1260). Google Scholar
 Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. In A. L. C. Bazzan & S. Labidi (Eds.), Lecture notes in computer science: Vol. 3171. Advances in artificial intelligence—SBIA 2004, São Luis, Brasil (pp. 286–295). Berlin: Springer. CrossRefGoogle Scholar
 Gama, J., Rocha, R., & Medas, P. (2003). Accurate decision trees for mining highspeed data streams. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA (pp. 523–528). New York: ACM. Google Scholar
 Gama, J., Sebastião, R., & Rodrigues, P. P. (2009). Issues in evaluation of stream learning algorithms. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France (pp. 329–338). New York: ACM. CrossRefGoogle Scholar
 Ghosh, B., & Sen, P. (1991). Handbook of sequential analysis. New York: Dekker. zbMATHGoogle Scholar
 Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P. (2003). Mining frequent patterns in data streams at multiple time granularities. In H. Kargupta, A. Joshi, K. Sivakumar, & Y. Yesha (Eds.), Next generation data mining. Menlo Park/Cambridge: AAAI Press/MIT Press. Google Scholar
 Hartl, C., Baskiotis, N., Gelly, S., & Sebag, M. (2007). Change point detection and metabandits for online learning in dynamic environments. In Conférence Francophone sur l’apprentissage automatique, Cepadues (pp. 237–250). Google Scholar
 Herbster, M., & Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32(2), 151–178. zbMATHCrossRefGoogle Scholar
 Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30. MathSciNetzbMATHCrossRefGoogle Scholar
 Hulten, G., & Domingos, P. (2001). Catching up with the data: research issues in mining data streams. In Proc. of workshop on research issues in data mining and knowledge discovery, Santa Barbara, USA. Google Scholar
 Hulten, G., & Domingos, P. (2003). VFML—a toolkit for mining highspeed timechanging data streams. Technical report, University of Washington. http://www.cs.washington.edu/dm/vfml/
 Hulten, G., Spencer, L., & Domingos, P. (2001). Mining timechanging data streams. In Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California (pp. 97–106). New York: ACM. Google Scholar
 Japkowicz, N. & Shah, M. (Eds.) (2011). Evaluating learning algorithms: a classification perspective. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
 Katakis, I., Tsoumakas, G., & Vlahavas, I. (2010). Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, 22, 371–391. CrossRefGoogle Scholar
 Kearns, M., & Vazirani, U. (1994). An introduction to computational learning theory. Cambridge: MIT Press. Google Scholar
 Kifer, D., BenDavid, S., & Gehrke, J. (2004). Detecting change in data streams. In Proceedings of the international conference on very large data bases, Toronto, Canada (pp. 180–191). San Mateo: Morgan Kaufmann. Google Scholar
 Kirkby, R. (2008). Improving Hoeffding trees. Ph.D. thesis, University of Waikato, New Zealand. Google Scholar
 Klinkenberg, R. (2004). Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis, 8(3), 281–300. Google Scholar
 Kolter, J. Z., & Maloof, M. A. (2007). Dynamic weighted majority: an ensemble method for drifting concepts. Journal of Machine Learning Research, 8, 2755–2790. zbMATHGoogle Scholar
 Koychev, I. (2000). Gradual forgetting for adaptation to concept drift. In Proceedings of ECAI workshop current issues in spatiotemporal reasoning, Berlin, Germany (pp. 101–106). Leipzig: ECAI Press. Google Scholar
 Kuh, A., Petsche, T., & Rivest, R. (1990). Learning timevarying concepts. In Proceedings advances in neural information processing (pp. 183–189). San Mateo: Morgan Kaufmann. Google Scholar
 Li, P., Wu, X., & Hu, X. (2010). Mining recurring concept drifts with limited labeled streaming data. Journal of Machine Learning Research—Proceedings Track, 13, 241–252. Google Scholar
 Liang, C., Zhang, Y., & Song, Q. (2010). Decision tree for dynamic and uncertain data streams. Journal of Machine Learning Research—Proceedings Track, 13, 209–224. Google Scholar
 Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (2006). Yale: rapid prototyping for complex data mining tasks. In ACM SIGKDD int. conf. on knowledge discovery and data mining (pp. 935–940). New York: ACM Press. CrossRefGoogle Scholar
 Mitchell, T. (1997). Machine learning. New York: McGrawHill zbMATHGoogle Scholar
 Mouss, H., Mouss, D., Mouss, N., & Sefouhi, L. (2004). Test of PageHinkley, an approach for fault detection in an agroalimentary production system. In Proceedings of the Asian control conference (Vol. 2, pp. 815–818). Google Scholar
 Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2), 100–115. MathSciNetzbMATHCrossRefGoogle Scholar
 Rodrigues, P. P., Gama, J., & Pedroso, J. P. (2008). Hierarchical clustering of time series data streams. IEEE Transactions on Knowledge and Data Engineering, 20(5), 615–627. CrossRefGoogle Scholar
 Street, W. N., & Kim, Y. (2001). A streaming ensemble algorithm SEA for largescale classification. In Proceedings 7th ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California (pp. 377–382). New York: ACM Press. Google Scholar
 Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23, 69–101. Google Scholar