Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

Brzezinski, Dariusz; Stefanowski, Jerzy

doi:10.1007/s10115-017-1022-8

Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

Regular Paper
Open access
Published: 16 January 2017

Volume 52, pages 531–562, (2017)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

Download PDF

7133 Accesses
83 Citations
Explore all metrics

Abstract

Modern data-driven systems often require classifiers capable of dealing with streaming imbalanced data and concept changes. The assessment of learning algorithms in such scenarios is still a challenge, as existing online evaluation measures focus on efficiency, but are susceptible to class ratio changes over time. In case of static data, the area under the receiver operating characteristics curve, or simply AUC, is a popular measure for evaluating classifiers both on balanced and imbalanced class distributions. However, the characteristics of AUC calculated on time-changing data streams have not been studied. This paper analyzes the properties of our recent proposal, an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC with forgetting. The resulting evaluation measure, called prequential AUC, is studied in terms of: visualization over time, processing speed, differences compared to AUC calculated on blocks of examples, and consistency with AUC calculated traditionally. Simulation results show that the proposed measure is statistically consistent with AUC computed traditionally on streams without drift and comparably fast to existing evaluation procedures. Finally, experiments on real-world and synthetic data showcase characteristic properties of prequential AUC compared to classification accuracy, G-mean, Kappa, Kappa M, and recall when used to evaluate classifiers on imbalanced streams with various difficulty factors.

Prequential AUC for Classifier Evaluation and Drift Detection in Evolving Data Streams

ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams

Article 20 April 2022

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Article 29 June 2023

1 Introduction

In many data mining applications, e.g., in sensor networks, banking, energy management, or telecommunication, the need for processing rapid data streams is becoming more and more common [50]. Such demands have led to the development of classification algorithms that are capable of processing instances one by one, while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, classifiers are often additionally required to react to concept drifts, i.e., changes in definitions of target classes over time [20]. To fulfill these requirements, several data stream classification algorithms have been proposed in recent years; for their review see, e.g., [7, 14, 20].

An important issue when learning classifiers from streaming environments is the way classifiers are evaluated. Traditionally, the predictive performance of a static classifier is measured on a separately curated set of testing examples [31]. However, batch techniques are not applied in streaming scenarios, where simpler and less costly incremental procedures are used, such as interleaving testing with training [33]. Moreover, since data streams can evolve over time, predictive abilities of classifiers are usually calculated with forgetting, i.e., sequentially on the most recent examples [21]. As a result, stream classifiers are mostly evaluated using the least computationally demanding measures, such as accuracy or error rate.

Nevertheless, simple evaluation measures are not sufficient for assessing classifiers when data streams are affected by additional complexity factors. In particular, this problem concerns class imbalance, i.e., situations when one of the target classes is represented by much less instances than other classes [27]. Class imbalance is an obstacle even for learning from static data, as classifiers are biased toward the majority classes and tend to misclassify minority class examples [10]. In such cases, simple performance metrics, such as accuracy, give overly optimistic estimates of classifier performance [41]. That is why, for static imbalanced data researchers have proposed several more relevant measures such as precision/recall, sensitivity/specificity, G-mean, and, in particular, the area under the ROC curve [26, 27].

The area under the ROC curve, or simply AUC, summarizes the relationship between the true and false positive rate of a binary classifier, for different decision thresholds [18]. Several authors have shown that AUC is more preferable for classifier evaluation than total accuracy [29, 30, 41], making it one of the most popular metrics for static imbalanced data. However, in order to calculate AUC one needs to sort a given dataset and iterate through each example. This means that AUC cannot be directly computed on large data streams, as this would require scanning through the entire stream after each example. That is why, the use of AUC for data streams has been limited only to estimations on periodical holdout sets [28, 44] or entire streams [13, 36], making it either potentially biased or computationally infeasible for practical applications.

To address these limitations, in an earlier paper [9] we introduced a new approach for calculating AUC. We presented an efficient algorithm, which incorporates a sorted tree structure with a sliding window as a forgetting mechanism, making it both computationally feasible and appropriate for concept-drifting streams. As a result, we have introduced a new evaluation measure, called prequential AUC, suitable for assessing classifiers on evolving data streams.

To the best of our knowledge, prequential AUC is the first proposal for efficiently calculating AUC on drifting data streams. Nonetheless, the properties of this new evaluation measure have not been thoroughly investigated in our previous work [9]. Thus, the aim of this paper is to carry out a detailed study of basic properties of prequential AUC, which in our view include the following issues.

First, we will analyze whether prequential AUC leads to consistent results with traditional AUC calculated on stationary data streams. For this purpose, we will use the definitions of consistency and discriminancy proposed by Huang and Ling [30]. Although the computation of traditional batch AUC is infeasible for real streams, it nevertheless serves as the best AUC estimate for data without concept changes. Moreover, the proposed measure will be compared with earlier attempts at using AUC to assess stream classifiers, in particular those based on block processing [28, 44]. Additionally, we perform a series of experiments, which analyze the use of different AUC calculation procedures for visualizing concept changes over time and assess the processing speed of the proposed measure. Finally, we compare and characterize classifier evaluations made by AUC and other measures commonly used for imbalanced data: G-mean, recall, and the Kappa statistic. For this purpose, we design a series of synthetic datasets that simulate various difficulty factors, such as high class imbalance ratios, sudden and gradual ratio changes, minority class fragmentation (small disjuncts), appearing minority class subconcepts, and minority–majority class swaps.

In summary, the main contributions of this paper are as follows:

Section 4.1 discusses the applicability of prequential AUC to performance visualization over time, for stationary and drifting streams;
Section 4.2 analyzes the consistency and discriminancy of prequential AUC compared to AUC computed traditionally on an entire stationary dataset;
Section 5.3 summarizes experiments performed to evaluate the speed of the proposed measure;
Section 5.4 analyzes parameter sensitivity of prequential AUC;
Section 5.5 compares prequential AUC to other class imbalance evaluation measures on a series of streams involving various difficulty factors.

Section 2 provides basic background for the conducted study and covers related work. To make the paper self-contained, in Sect. 3 we recall our algorithm [9] for computing AUC online with forgetting. Sections 5.1 and 5.2 summarize the experimental setup and used datasets. Section 6 concludes the paper and discusses potential lines of future research. All of the algorithms described in the paper are implemented in Java as part of the MOA framework [3] and are publicly available.

2 Related work

2.1 Stream classifier evaluation

Requirements posed by data streams make processing time, memory usage, adaptability, and predictive performance key classifier evaluation criteria [7, 22]. In this paper, we focus only on that last criterion.

The predictive performance of stream classifiers is usually assessed using evaluation measures known from static supervised classification, such as accuracy or error rate. However, due to the size and speed of data streams, re-sampling techniques such as cross-validation are deemed too expensive and simpler error estimation procedures are used [33]. In particular, stream classifiers are evaluated either by using a holdout test set or by interleaving testing with training, one instance after another or using blocks of examples [20]. More recently, Gama et al. [21] advocated the use of prequential procedures for calculating performance measures in evolving data streams. The term prequential (blend of predictive and sequential) stems from online learning [39] and is used in data stream mining literature to denote algorithms that base their functioning only on the most recent data, rather than the entire stream.

Prequential accuracy was the first measure for evaluating stream classifiers, which worked according to the aforementioned predictive-sequential procedure. The authors of [21] have shown that computing accuracy only using the most recent examples, instead of the entire stream, is more appropriate for continuous assessment and drift detection. However, it is worth noting that prequential accuracy inherits the weaknesses of traditional accuracy, that is, variance with respect to class distribution and promoting majority class predictions in imbalanced data. Recent works have also introduced the use of prequential recall [46], i.e., class-specific accuracy; however, such a measure must be recorded for each single class. A popular way of combining information about many imbalanced classes is the use of the geometric mean (G-mean) of all class-specific accuracies [35], which has also been recently recognized in online learning [47]. Finally, for imbalanced streams Bifet and Frank [2] proposed to prequentially calculate the Kappa statistic with a sliding window. The Kappa statistic (\(\kappa \)) is an alternative to accuracy, that may be seen as the probability of outperforming a base classifier. Furthermore, this metric has been recently extended to take into account class skew (\(\kappa _\mathrm{m}\)) [4] and temporal dependence (\(\kappa _{\mathrm {per}}\)) [49].

2.2 Area under the ROC curve

The receiver operating characteristics (ROC) curve is a graphical plot that visualizes the relation between the true positive rate and the false positive rate for a classifier under varying decision thresholds [18]. It was first used in signal detection to represent the trade-off between hit and false alarm rates [15]; however, it has also been extensively applied in medical diagnosis [37] and, more recently, to evaluate machine learning algorithms [17, 41]. Unlike single point metrics, the ROC curve compares classifier performance across the entire range of class distributions, offering an evaluation encompassing a wide range of operating conditions.

The area under the ROC curve, or AUC, is one of the most popular classifier evaluation metrics for static imbalanced data [31]. Its basic interpretation and calculation procedure can be explained by analyzing the way in which ROC curves are created. If we denote examples of two classes distinguished by a binary classifier as positive and negative, the ROC curve is created by plotting the proportion of positives correctly classified (true positive rate) against the proportion of negatives incorrectly classified (false positive rate). If a classifier outputs a score proportional to its belief that an instance belongs to the positive class, decreasing the classifier’s decision threshold (i.e., the score above which an instance is deemed to belong to the positive class) will increase both true and false positive rates. Varying the decision threshold results in a piecewise linear curve, called the ROC curve, which is presented in Fig. 1. AUC summarizes the plotted relationship in a scalar metric by calculating the area under the ROC curve.

The popularity of AUC for static data comes from the fact that it is invariant to changes in class distribution. If the class distribution changes in a test set, but the underlying class-conditional distributions from which the data are drawn stay the same, the ROC curve will not change [48]. Moreover, for scoring classifiers it has a very useful statistical interpretation as the expectation that a randomly drawn positive example receives a higher score than a random negative example. Therefore, it can be used to assess machine learning algorithms which are used to rank cases, e.g., in credit scoring, customer targeting, or churn prediction. Furthermore, AUC is equivalent to the Wilcoxon–Mann–Whitney (WMW) U statistic test of ranks [24]. It is worth mentioning that this statistical interpretation led to the development of algorithms that are capable of computing AUC without building the ROC curve itself, by counting the number of positive–negative example misorderings in the ranking produced by classifier scores [48]. Additionally, several researchers have shown that for static data AUC is a more preferable classifier evaluation measure than total accuracy[30].

Nonetheless, the use of AUC for evaluating classifier performance has been challenged in a publication by Hand et al. [23]. Therein, the authors derive a linear relationship between a classifier’s expected minimum loss and AUC under a classifier-dependent distribution over cost proportions. In other words, the authors have shown that the fact that two classifiers have equal AUC does not necessarily imply they have equal expected minimum loss. To amend this deficiency, Hand et al. [23] proposed an alternative for allowing fairer comparisons: the H-measure. However, this criticism concerns using AUC to assess classification performance, and does not affect its usefulness for evaluating ranking performance. Moreover, the relationship between AUC and expected loss is classifier-dependent only if one takes into account solely optimal thresholds [19]. This is especially of note in the context of data stream mining, as due to concept drift an optimal threshold for a part of the stream will be suboptimal after a change occurs. Therefore, if one anticipates concept or data distribution changes, it might be better to consider classification performance for all thresholds created by a classifier, which is exactly what AUC is measuring [19]. Despite the fact that it has been surrounded with some controversy, AUC still remains one of the most used measures under imbalanced domains; for a broader discussion and extensions of AUC see [6].

AUC has also been used for data streams, however, in a very limited way. Some researchers chose to calculate AUC using entire streams [13, 36], while others used periodical holdout sets [28, 44]. Nevertheless, it was noticed that periodical holdout sets may not fully capture the temporal dimension of the data [33], whereas evaluation using entire streams is neither feasible for large datasets nor suitable for drift detection. It is also worth mentioning that an algorithm for computing AUC incrementally has also been proposed [5], yet one which calculates AUC from all available examples and is not applicable to evolving data streams.

Although the cited works show that AUC is recognized as a measure which should be used to evaluate data stream classifiers, until now it has been computed the same way as for static data. In the following section, we present a simple and efficient algorithm for calculating AUC incrementally with forgetting, which we previously introduced in [9]. Later, we investigate the properties of the resulting evaluation measure with respect to classifiers for evolving imbalanced data streams.

3 Prequential AUC

As AUC is calculated on ranked examples, we will consider scoring classifiers, i.e., classifiers that for each predicted class label additionally return a numeric value (score) indicating the extent to which an instance is predicted to be positive or negative. Furthermore, we will limit our analysis to binary classification. It is worth mentioning that most classifiers can produce scores, and many of those that only predict class labels can be converted to scoring classifiers. For example, decision trees can produce class-membership probabilities by using Naive Bayes leaves or averaging predictions using bagging [40]. Similarly, rule-based classifiers can be modified to produce instance scores indicating the likelihood that an instance belongs to a given class [16].

We propose to compute AUC incrementally after each example using a special sorted structure combined with a sliding window forgetting mechanism. It is worth noting that, since the calculation of AUC requires sorting examples with respect to their classification scores, it cannot be computed on an entire stream or using fading factors without remembering the entire stream. Therefore, for AUC to be computationally feasible and applicable to evolving concepts, it must be calculated using a sliding window.

A sliding window of scores limits the analysis to the most recent data, but to calculate AUC, scores have to be sorted. To efficiently maintain a sorted set of scores, we propose to use the red-black tree data structure [1]. A red-black tree is a self-balancing binary search tree that is capable of adding and removing elements in logarithmic time while requiring minimal memory. With these two structures, we can efficiently calculate AUC sequentially on the most recent examples. Algorithm 1 lists the pseudo-code for calculating prequential AUC. Contrary to [9], here we present an extended version of the algorithm that deals with score ties.

For each incoming labeled example, the score assigned to this example by the classifier is inserted into the window (line 15) as well as the red-black tree (line 10), and if the window of examples has been exceeded, the oldest score is removed (lines 5 and 15). The red-black tree is sorted in descending order according to scores, in case of score ties positives before negatives, and in ascending order according to arrival time. This way, we maintain a structure that facilitates the calculation of AUC and ensures that the oldest score in the sliding window will be promptly found in the red-black tree. After the sliding window and tree have been updated, AUC is calculated by summing the number of positive examples occurring before each negative example (lines 18–28) and normalizing that value by all possible pairs pn (line 29), where p is the number of positives and n is the number of negatives in the window. This method of calculating AUC, proposed in [48], is equivalent to summing the area of trapezoids for each pair of sequential points on the ROC curve, but more suitable for our purposes, as it requires very little computation given a sorted collection of scores. We note that in line 26 we take into account score ties between positive and negative examples by reducing the increment of AUC.

An example of using a sliding window and a red-black tree is presented in Fig. 2. Window W contains six examples, all of which are already inserted into the red-black tree. As mentioned earlier, examples in the tree are sorted (depth-first search wise) descending according to scores s, positives before negatives, and ascending according to arrival time t. When a new instance is scored by the classifier (t: 7, l: \(+\), s: 0.80), the oldest instance (t: 1) is removed from the window and the tree. After the new scored example is inserted, AUC is calculated by traversing the tree in a depth-first search manner and counting labels as presented in lines 17–29 of Algorithm 1. In this example, the resulting AUC would be 0.875.

Although prequential AUC requires a sliding window and is not fully incremental (cannot be computed based solely on its previous value), in [9], we have shown that the presented algorithm requires O(1) time and memory per example and is, therefore, suitable for data stream processing.

Prequential AUC aims at extending the list of available evaluation measures, particularly for assessing classifiers and detecting drifts in streams with evolving class distributions. However, we must verify if this measure is suitable for visualizing performance changes over time. Furthermore, we are also interested in how averaged prequential AUC relates to AUC calculated periodically on blocks or once over the entire stream. Finally, we must evaluate the newly proposed measure on real-world and synthetic data to verify its processing speed and applicability to large streaming data with different types of drift and imbalance ratios. In the following sections, we examine the aforementioned characteristics of prequential AUC and present the main contributions of this study.

4 Properties of prequential AUC

We start the study of properties of prequential AUC by comparing it against other attempts at computing AUC on data streams. First, we will compare the differences in visualizations of classifier performance on stationary and drifting streams. In the second part of this section, we will examine the consistency of prequential and block AUC with batch calculations on stationary data.

4.1 AUC visualizations over time

As it was mentioned in Sect. 2, there have already been attempts to use AUC as an evaluation measure for data stream classifiers. Some researchers [28, 44] calculated AUC on periodical holdout sets, i.e., consecutive blocks of examples. Others [13, 36], for experimental purposes, treated small data streams as a single batch of examples and calculated AUC traditionally. Furthermore, there has been a proposal of an algorithm for computing AUC incrementally, instance after instance [5]. With the proposed prequential estimation, this gives in total four ways of evaluating data stream classifiers using AUC: batch, block-based, incremental, and prequential.

Figures 3 and 4 present visualizations of the aforementioned four AUC calculation procedures. More precisely, both plots present the performance of a single Hoeffding Tree classifier [20] on a dataset with 20 k examples created using the RBF generator (the dataset will be discussed in more detail in Sect. 5.2). The first dataset contained no drifts, whereas in the second dataset a sudden drift was added after 10 k examples. Prequential and block AUC were calculated using a window of \(d = 1000\) examples (the impact of the d parameter will be analyzed in Sect. 5.4). It is worth noting that similar comparisons were previously done for incremental and prequential accuracy [21, 33].

On the stream without any drifts, presented in Fig. 3, we can see that AUC calculated in blocks or prequentially is less pessimistic than AUC calculated incrementally. This is due to the fact that the true performance of an algorithm at a given point in time is obscured when calculated incrementally—algorithms are punished for early mistakes regardless of the level of performance they are eventually capable of, although this effect diminishes over time. This property has also been noticed by other researchers when visualizing classification accuracy over time [21, 33]. Therefore, if one is interested in the performance of a classifier at a given moment in time, prequential and block AUC give less pessimistic estimates, with prequential calculation producing a much smoother curve.

Figure 4 presents the difference between all four AUC calculation methods in the presence of concept drift. As one can see, after a sudden drift occurring after 10 k examples, the change in performance is most visible when looking at prequential AUC over time. AUC calculated on blocks of examples also depicts this change, but delayed according to the block size. However, the most relevant observation is that AUC calculated incrementally is not capable of depicting drifts due to its long “memory” of predictions. For this reason, prequential evaluations should be favored over incremental and block-based assessment in drifting environments where class labels are available after each example [21].

In summary, on streams without drift prequential AUC provides a smoother and less pessimistic classifier evaluation than block-based and incremental calculations. Furthermore, prequential AUC is best at depicting sudden changes over time. It is also worth emphasizing that incremental and batch AUC calculations are presented here only for reference, as they do not fulfill computational requirements of data stream processing. Based on the presented analysis, we believe that similarly as prequential accuracy should be favored over batch, incremental, and block-based accuracy [21], prequential AUC should be preferred to batch, block, and incremental AUC when monitoring classifier performance online on drifting data streams.

4.2 Prequential AUC averaged over entire streams

4.2.1 Motivation

The above analysis shows that prequential AUC has several advantages when monitored over time, especially in environments with possible concept drifts. However, in many situations, particularly when comparing classifiers over multiple datasets, it is easier to examine simple numeric values rather than entire performance plots. In such cases, researchers are more interested in performance values averaged over entire streams.

As it was shown in Fig. 4, in the presence of concept drift, prequential AUC is the most appropriate calculation procedure for showcasing reactions to changes, even when averaged over the entire stream. However, if no drifts are expected, AUC calculated traditionally in a batch manner over all examples should give the best performance estimate. This observation stems from the fact that in a stationary stream all examples represent a stationary set of concepts, and therefore, all predictions can be simultaneously taken into account during evaluation. Although the computation of batch AUC is not feasible for large data streams, we are interested in how prequential and block calculations averaged over the entire stream compare to AUC calculated once using all predictions.

First, it is worth noticing that if we simultaneously take all examples into account, their order of appearance does not affect the final batch AUC estimation, as long as an example receives a certain score regardless of its position in the stream. This is contrary to AUC calculated prequentially or in blocks where the order of examples affects the final averaged performance values. Let us analyze two examples that demonstrate this issue. In each of the presented examples, the arrival time of each instance will be denoted by t and scores will be sorted from highest to lowest, left to right. We assume that classifiers are trained to output higher scores for positive examples and lower scores for negative examples.

Table 1 Classifiers with the same batch AUC but different prequential and block AUC (for a window of \(d = 2\))

Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

Abstract

Similar content being viewed by others

Prequential AUC for Classifier Evaluation and Drift Detection in Evolving Data Streams

ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

1 Introduction

2 Related work

2.1 Stream classifier evaluation

2.2 Area under the ROC curve

3 Prequential AUC

4 Properties of prequential AUC

4.1 AUC visualizations over time

4.2 Prequential AUC averaged over entire streams

4.2.1 Motivation

4.2.2 Theoretical background

Definition 1

Definition 2

Definition 3

Definition 4

4.2.3 Simulations

4.2.4 Summary of results

5 Processing speed, parameter sensitivity, and classifier evaluation

5.1 Experimental setup

5.2 Datasets

5.3 Prequential AUC evaluation time

5.4 Parameter sensitivity

5.5 Comparison of imbalanced stream evaluation measures

6 Conclusions and outlook

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 2376 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation