- 671 Downloads
Adaptive Windowing is a technique used for the online analysis of data streams to manage changes in the distribution of the data. It uses the standard idea of sliding window over the data, but, unlike other approaches, the size of the window is not fixed and set a priori but changed dynamically as a function of the data. The window is maintained at all times to the maximum length consistent with the assumption that there is no change in the data contained in it.
Many modern sources of data are best viewed as data streams: a potentially infinite sequence of data items that arrive one at a time, usually at high and uncontrollable speed. One wants to perform various analysis tasks on the stream in an online, rather than batch, fashion. Among these tasks, many consist of building models such as creating a predictor, forming clusters, or discovering frequent patterns. The source of data may evolve over time, that is, its statistical properties may vary, and often one is interested in keeping the model accurate with respect to the current distribution of the data, rather than that of the past data.
The ability of model-building methods to handle evolving data streams is one of the distinctive concerns of data stream mining, compared to batch data mining (Gama et al. 2014; Ditzler et al. 2015; Bifet et al. 2018). Most of the approaches to this problem in the literature fall into one of the following three patterns or a combination thereof.
One, the algorithm keeps a sliding window over the stream that stores a certain quantity of the most recently seen items. The algorithm then is in charge of keeping the model accurate with respect to the items in the window. Sliding means that every newly arrived item is added to the front of the window and that the oldest elements are dropped from the tail of the window. Dropping policies may vary. To keep a window of a constant size (denoted W hereafter), one stores the first W elements and then drops exactly one element for each one that is added.
Two, each item seen so far is associated with a weight that changes over time. The model-building algorithm takes into account the weight of each element in maintaining its model, so that elements with higher weight influence more the model behavior. One can, for example, fix a constant λ < 1 called the decay factor and establish that the importance of every item gets decreased (multiplied) by λ at each time step; this implies that the importance of the item that arrived t time steps ago is λ t its initial one, that is, weights decrease exponentially fast. This policy is the basis of the EWMA (Exponentially Weighted Moving Average) estimator for the average of some statistic of the stream.
Three, the model builder monitors the stream with a change detection algorithm that raises a flag when it finds evidence that the distribution of the stream items has changed. When this happens, the model builder revises the current model or discards the model and builds a new one with fresh data. Usually, change detection algorithms monitor one statistic or a small number of statistics of the stream, so they will not detect every possible kind of change, which is in general computationally unfeasible. Two easy-to-implement change detection algorithms for streams of real values are the CUSUM (Cumulative Sum) and Page-Hinkley methods (Basseville and Nikiforov 1993; Gama et al. 2014). Roughly speaking, both methods monitor the average of the items in the stream; when the recent average differs from the historical average by some threshold related to the standard deviation, they declare change.
Methods such as EWMA, CUSUM, and Page-Hinkley store a constant amount of real values, while window-based methods require, if implemented naively, memory linear in W. On the other hand, keeping a window provides more information usable by the model builder, namely, the instances themselves.
A disadvantage of all three methods as described is that they require the user to provide parameters containing assumptions about the magnitude or frequency of changes. Fixing a window size to have size W means that the user expects the last W items to be relevant, so that there is little change within them, but that items older than W are suspect of being irrelevant. Fixing a parameter λ in the EWMA estimator to a value close to 1 indicates that change is expected to be rare or slow, while a value closer to 0 suggests that change may be frequent or abrupt. A similar assumption can be found in the choice of parameters for the CUSUM and Page-Hinkley tests.
In general, these methods face the trade-off between reaction time and variance in the data: The user would like them to react quickly to changes (which happens with, e.g., smaller values of W and λ) but also have a low number of false positives when no change occurs (which is achieved with larger values of W and λ). In the case of sliding windows, in general one wants to have larger values of W when no change occurs, because models built from more data tend to be more accurate, but smaller values of W when the data is changing, so that the model ignores obsolete data. These trade-offs are investigated by Kuncheva and žliobaitė (2009).
Adaptive windowing schemes use sliding windows whose size increases or decreases in response to the change observed in the data. They intend to free the user from having to guess expected rates of change in the data, which may lead to poor performance if the guess is incorrect or if the rate of change is different at different moments. Three methods are reviewed in this section, particularly the ADWIN method.
The Drift Detection Method (DDM) proposed by Gama et al. (2004) applies to the construction of two-class predictive models. It is based on the theoretical and practical observation that the empirical error of a predictive model should decrease or remain stable as the model is built with more and more data from a stationary distribution, assuming one controls overfitting. Therefore, when the empirical error instead increases, this is evidence that the distribution in the data has changed.
If p t + s t ≥ p min + 2 ⋅ s min, DDM declares a warning. It starts storing examples in anticipation of a possible declaration of change.
If p t + s t ≥ p min + 3 ⋅ s min, DDM declares a change. The current predictor is discarded and a new one is built using the stored examples. The values for p min and s min are reset as well.
This approach is generic and fast enough for the use in the streaming setting, but it has the drawback that it may be too slow in responding to changes. Indeed, since p t is computed on the basis of all examples since the last change, it may take many observations after the change to make p t significantly larger than p min. Also, for slow change, the number of examples retained in memory may become large.
An evolution of this method that uses EWMA to estimate the errors is presented and thoroughly analyzed by Ross et al. (2012).
The OLIN method due to Last (2002) and Cohen et al. (2008) also adjusts dynamically the size of the sliding window used to update a predictive model, in order to adapt it to the rate of change in nonstationary data streams. OLIN uses the statistical significance of the difference between the training and the validation accuracy of the current model as an indicator of data stability. Higher stability means that the window can be enlarged to use more data to build a predictor, and lower stability implies shrinking the window to discard stale data. Although described for one specific type of predictor (“Information Networks”) in Last (2002) and Cohen et al. (2008), the technique should apply many other types.
The ADWIN (ADaptive WINdowing) algorithm is due to Bifet and Gavaldà (2007) and Bifet (2010). Its purpose is to be a self-contained module that can be used in the design of data stream algorithms (for prediction or classification, but also for other tasks) to detect and manage change in a well-specified way. In particular, it wants to resolve the trade-off between fast reaction to change and reduced false alarms without relying on the user guessing an ad hoc parameter. Intuitively, the ADWIN algorithm resolves this trade-off by checking change at many scales simultaneously or trying many sliding window sizes simultaneously. It should be used when the scale of change rate is unknown, and this is problematic enough to compensate a moderate increase in computational effort.
More precisely, ADWIN maintains a sliding window of real numbers that are derived from the data stream. For example, elements in the window could be W bits indicating whether the current predictive model was correct on the last W stream items; the window then can be used to estimate the current error rate of the predictor. In a clustering task, it could instead keep track of the fraction of outliers or cluster quality measures, and in a frequent pattern mining task, the number of frequent patterns that appear in the window. Significant variation inside the window of any of these measures indicates distribution change in the stream. ADWIN is parameterized by a confidence parameter δ ∈ (0, 1) and a statistical test T(W 0, W 1, δ); here W 0 and W 1 are two windows, and T decides whether they are likely to come from the same distribution. A good test should satisfy the following criteria:
If W 0 and W 1 were generated from the same distribution (no change), then with probability at least 1 − δ the test says “no change.”
If W 0 and W 1 were generated from two different distributions whose average differs by more than some quantity 𝜖(W 0, W 1, δ), then with probability at least 1 − δ the test says “change.”
When there has been change in the average but its magnitude is less than 𝜖(W 0, W 1, δ), no claims can be made on the validity of the test’s answer. Observe that in reasonable tests, 𝜖 decreases as the sizes of W 0 and W 1 increase, that is, as the test sees larger samples. ADWIN applies the test to a number of partitions of its sliding window into two parts, W 0 containing the oldest elements and W 1 containing the newer ones. Whenever T(W 0, W 1, δ) returns “change”, ADWIN drops W 0, so the sliding window becomes W 1. In this way, at all times, the window is kept of the maximum length such that there is no proof of change within it. In times without change, the window can keep growing indefinitely (up to a maximum size, if desired).
In order to be efficient in time and memory, ADWIN represents its sliding window in a compact way, using the Exponential Histogram data structure due to Datar et al. (2002). This structure maintains a summary of a window by means of a chain of buckets. Older bits are summarized and compressed in coarser buckets with less resolution. A window of length W is stored in only \(O(k \log W)\) buckets, each using a constant amount of memory words, and yet the histogram returns an approximation of the average of the window values that is correct up to a factor of 1/k. ADWIN does not check all partitions of its window into pairs (W 0, W 1), but only those at bucket boundaries. Therefore, it performs \(O(k \log W)\) tests on the sliding window for each stream item. The standard implementation of ADWIN uses k = 5 and may add the rule that checks are only performed only every t number of items for efficiency – at the price of a delay of up to t time steps in detecting a change.
In Bifet and Gavaldà (2007) and Bifet (2010), a test based on the so-called Hoeffding bound is proposed, which can be rigorously proved to satisfy the conditions above for a “good test.” Based on this, rigorous guarantees on the false positive rate and false negative rate of ADWIN are proved in Bifet and Gavaldà (2007) and Bifet (2010). However, this test is quite conservative and will be slow to detect change. In practice, tests based on the normal approximation of a binomial distribution should be used, obtaining faster reaction time for a desired false positive rate.
Algorithms for mining data streams will probably store their own sliding window of examples to revise/rebuild their models. One or several instances of ADWIN can be used to inform the algorithm of the occurrence of change and the optimal window size it should use. The time and memory overhead is moderate (logarithmic in the size of the window) and often negligible compared with the cost of the main algorithm itself.
Several change detection methods for streams were evaluated by Gama et al. (2009). The conclusions were that Page-Hinkley and ADWIN were the most appropriate. Page-Hinkley exhibited a high rate of false alarms, and ADWIN used more resources, as expected.
Examples of Application
ADWIN has been applied in the design of streaming algorithms and in applications that need to deal with nonstationary streams.
At the level of algorithm design, it was used by Bifet and Gavaldà (2009) to give a more adaptive version of the well-known CVFDT algorithm for building decision trees from streams due to Hulten et al. (2001); ADWIN improves it by replacing hard-coded constants in CVFDT for the sizes of sliding windows and the duration of testing and training phases with data-adaptive conditions. A similar approach was used by Bifet et al. (2010b) for regression trees using perceptrons at the leaves. In Bifet et al. (2009a,b, 2010a, 2012), ADWIN was used in the context of ensemble classifiers to detect when a member of the ensemble is underperforming and needs to be replaced. In the context of pattern mining, ADWIN was used to detect change and maintain the appropriate sliding window size in algorithms that extract frequent graphs and frequent trees from streams of graphs and XML trees (Bifet et al. 2011b; Bifet and Gavaldà 2011). In the context of process mining, ADWIN is used by Carmona and Gavaldà (2012) to propose a mechanism that helps in detecting changes in the process, localize and characterize the change once it has occurred, and unravel process evolution.
Bakker et al. (2011) in an application to detect stress situations in the data from wearable sensors
Bifet et al. (2011a) in application to detect sentiment change in Twitter streaming data
Pechenizkiy et al. (2009) as the basic detection mechanism in a system to control the stability and efficiency of industrial fuel boilers
Talavera et al. (2015) in an application to segmentation of video streams
A main research problem continues to be the efficient detection of change in multidimensional data. Algorithms as described above (OLIN, DDM, and ADWIN) deal, strictly speaking, with the detection of change in a unidimensional stream of real values; it is assumed that this stream of real values, derived from the real stream, will change significantly when there is significant change in the multidimensional stream.
Several instances of these detectors can be created to monitor different parts of the data space or to monitor different summaries or projections thereof, as in, e.g., Carmona and Gavaldà (2012). However, there is no guarantee that all real changes will be detected in this way. Papapetrou et al. (2015) and Muthukrishnan et al. (2007) among others have proposed efficient schemes to directly monitor change in change in multidimensional data. However, the problem in its full generality is difficult to scale to high dimensions and arbitrary change, and research in more efficient mechanisms usable in streaming scenarios is highly desirable.
- Bakker J, Pechenizkiy M, Sidorova N (2011) What’s your current stress level? Detection of stress patterns from GSR sensor data. In: 2011 IEEE 11th international conference on data mining workshops (ICDMW), Vancouver, 11 Dec 2011, pp 573–580. https://doi.org/10.1109/ICDMW.2011.178
- Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application. Prentice-Hall, Upper Saddle River. http://people.irisa.fr/Michele.Basseville/kniga/. Accessed 21 May 2017
- Bifet A, Gavaldà R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the seventh SIAM international conference on data mining, 26–28 Apr 2007, Minneapolis, pp 443–448. https://doi.org/10.1137/1.9781611972771.42
- Bifet A, Gavaldà R (2009) Adaptive learning from evolving data streams. In: Advances in intelligent data analysis VIII, proceedings of the 8th international symposium on intelligent data analysis, IDA 2009, Lyon, Aug 31–Sept 2 2009, pp 249–260. https://doi.org/10.1007/978-3-642-03915-7_22
- Bifet A, Holmes G, Pfahringer B, Gavaldà R (2009a) Improving adaptive bagging methods for evolving data streams. In: Advances in machine learning, proceedings of the first Asian conference on machine learning, ACML 2009, Nanjing, 2-4 Nov 2009, pp 23–37. https://doi.org/10.1007/978-3-642-05224-8_4
- Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009b) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 139–148. http://doi.acm.org/10.1145/1557019.1557041 CrossRefGoogle Scholar
- Bifet A, Holmes G, Pfahringer B (2010a) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases, proceedings, part I of the ECML PKDD 2010, Barcelona, 20–24 Sept 2010, pp 135–150. https://doi.org/10.1007/978-3-642-15880-3_15
- Bifet A, Holmes G, Pfahringer B, Frank E (2010b) Fast perceptron decision tree learning from evolving data streams. In: Advances in knowledge discovery and data mining, proceedings, part II of the 14th Pacific-Asia conference, PAKDD 2010, Hyderabad, 21-24 June 2010, pp 299–310. https://doi.org/10.1007/978-3-642-13672-6_30
- Bifet A, Holmes G, Pfahringer B, Gavaldà R (2011a) Detecting sentiment change in twitter streaming data. In: Proceedings of the second workshop on applications of pattern analysis, WAPA 2011, Castro Urdiales, 19-21 Oct 2011, pp 5–11. http://jmlr.csail.mit.edu/proceedings/papers/v17/bifet11a/bifet11a.pdf
- Bifet A, Holmes G, Pfahringer B, Gavaldà R (2011b) Mining frequent closed graphs on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, New York, pp 591–599. http://doi.acm.org/10.1145/2020408.2020501 Google Scholar
- Bifet A, Frank E, Holmes G, Pfahringer B (2012) Ensembles of restricted hoeffding trees. ACM TIST 3(2):30:1–30:20. http://doi.acm.org/10.1145/2089094.2089106
- Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams, with practical examples in MOA. MIT Press, Cambridge. https://mitpress.mit.edu/books/machine-learning-data-streams Google Scholar
- Carmona J, Gavaldà R (2012) Online techniques for dealing with concept drift in process mining. In: Advances in intelligent data analysis XI – proceedings of the 11th international symposium, IDA 2012, Helsinki, 25-27 Oct 2012, pp 90–102. https://doi.org/10.1007/978-3-642-34156-4_10
- Gama J, Medas P, Castillo G, Rodrigues PP (2004) Learning with drift detection. In: Advances in artificial intelligence – SBIA 2004, proceedings of the 17th Brazilian symposium on artificial intelligence, São Luis, Sept 29–Oct 1 2004, pp 286–295. https://doi.org/10.1007/978-3-540-28645-5_29
- Gama J, Sebastião R, Rodrigues PP (2009) Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 09), Paris, June 28–July 1 2009, pp 329–338. http://doi.acm.org/10.1145/1557019.1557060
- Gama J, žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37. http://doi.acm.org/10.1145/2523813
- Hulten G, Spencer L, Domingos PM (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD 01), San Francisco, 26–29 Aug 2001, pp 97–106. http://portal.acm.org/citation.cfm?id=502512.502529
- Kuncheva LI, žliobaitė I (2009) On the window size for classification in changing environments. Intell Data Anal 13(6):861–872. https://doi.org/10.3233/IDA-2009-0397
- Last M (2002) Online classification of nonstationary data streams. Intell Data Anal 6(2):129–147. http://content.iospress.com/articles/intelligent-data-analysis/ida00083 zbMATHGoogle Scholar
- Muthukrishnan S, van den Berg E, Wu Y (2007) Sequential change detection on data streams. In: Workshops proceedings of the 7th IEEE international conference on data mining (ICDM 2007), 28–31 Oct 2007, Omaha, pp 551–550. http://dx.doi.org/10.1109/ICDMW.2007.89
- Pechenizkiy M, Bakker J, žliobaitė I, Ivannikov A, Kärkkäinen T (2009) Online mass flow prediction in CFB boilers with explicit detection of sudden concept drift. SIGKDD Explor 11(2):109–116. http://doi.acm.org/10.1145/1809400.1809423
- Ross GJ, Adams NM, Tasoulis DK, Hand DJ (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33(2):191–198. http://dx.doi.org/10.1016/j.patrec.2011.08.019, erratum in Pattern Recogn Lett 33(16):2261 (2012)
- Talavera E, Dimiccoli M, Bolaños M, Aghaei M, Radeva P (2015) R-clustering for egocentric video segmentation. In: Pattern recognition and image analysis – 7th Iberian conference, proceedings of the IbPRIA 2015, Santiago de Compostela, 17-19 June 2015, pp 327–336. https://doi.org/10.1007/978-3-319-19390-8_37