1 Introduction

Measuring the performance of a classifier is a vital task in machine learning. The running time of an algorithm that computes the measure plays a very small role in an offline setting, for example, when the classifier is being developed by a researcher. However, the running time becomes more crucial if our goal is to monitor the performance of a classifier over time where the new data points may arrive at a significant speed.

For example, consider a task of monitoring abnormal behaviour in IT systems based on event logs. Here, the main problem is the gargantuan volume of event logs making the manual monitoring impossible. One approach is to have a classifier to monitor for abnormal events and alert analysts for closer inspection. Here, monitoring should be done continuously to notice abnormalities rapidly. Moreover, the performance of the classifier should also be monitored continuously as the underlying distribution, and potentially the performance of the classifier, may change due to the changes in the IT system.

In order to detect recent changes in the performance, we are often interested in the performance over the last n data points. More generally, we are interested in maintaining the measure under addition or deletion of data points.

We study algorithms for maintaining two measures. The first measure is the area under the ROC curve (AUC), a classic technique of measuring the performance of a classifier based on its ROC curve. We also study H-measure, an alternative measure proposed by Hand (2009). Roughly speaking, the measure is based on the minimum weighted loss, averaged over the cost ratio. A practical advantage of the H-measure over AUC is that it allows a natural way of weighting classification errors.

Both measures can be computed in \(\mathcal {O} \mathopen {}\left( n \log n\right)\) time from scratch, or in \(\mathcal {O} \mathopen {}\left( n\right)\) time if the data points are already sorted. In this paper we present 3 algorithms that allow us to maintain the measures in polylogarithmic time.

The first algorithm maintains AUC under addition or deletion of data points. The approach is straightforward: we maintain the data points sorted in a self-balanced search tree. In order to update AUC we need to know the ROC coordinates of the data point that we are changing. Luckily, this can be done by modifying the search tree so that it maintains the cumulative counts of the labels in each subtree. Consequently, we can obtain the coordinates in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time, which leads to a total of \(\mathcal {O} \mathopen {}\left( \log n\right)\) maintenance time.

Our next two algorithms involve maintaining the H-measure. Computing the H-measure involves finding the convex hull of the ROC curve, and enumerating over the hull. First we show that we can use a classic dynamic convex hull algorithm with some minor modifications to maintain the convex hull of the ROC curve. The modifications are required as we do not have the ROC coordinates of individual data points, but we can use the same trick as when computing AUC to obtain the needed coordinates.

Then we show that if we estimate the class priors from the test data, we can decompose the H-measure into a sum over the points in the convex hull such that the ith term depends only on the difference between the ith and the \((i - 1)\)st data points. This decomposition allows us to maintain the H-measure in \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\) time.

If the class priors are not estimated from the test data, then we propose an estimation algorithm. Here the idea is to group points that are close in the convex hull together. Or in other words, if there are points in the convex hull that are close to each other, then we only use one data point from such group. The grouping is done in a way that we maintain \(\epsilon\)-approximation in \(\mathcal {O} \mathopen {}\left( (\log n + \epsilon ^{-1}) \log n\right)\) time.

Structure The rest of the paper is organized as follows. We present preliminary definitions in Sect. 2. In Sect. 4 we demonstrate how to maintain AUC, and in Sects. 5 and 6 we demonstrate how to maintain the H-measure. We present the experimental evaluation in Sect. 7, and conclude the paper with a discussion in Sect. 8.

2 Preliminaries

Assume that we are given a multiset of n data points Z. Each data point \(z = (s, \ell )\) consists of a score \(s \in R\) and a true label \(\ell \in \left\{ 1, 2\right\}\). The score is typically obtained by applying a classifier with high values implying that z should be classified as class 2. To simplify the notation greatly, given \(z = (s, \ell )\) we define \(d \mathopen {}\left( z\right) = (1, 0)\) if \(\ell = 1\), and \(d \mathopen {}\left( z\right) = (0, 1)\) if \(\ell = 2\). We can now write

$$\begin{aligned} (n_1, n_2) = \sum _{z \in Z} d \mathopen {}\left( z\right), \end{aligned}$$

that is, \(n_j\) is the number of points having the label equal to j. Here we used a convention that the sum of two tuples, say (ab) and (cd), is \((a + c, b + d)\). Note that \(n = n_1 + n_2\).

Let \(S = \left( s_1 ,\ldots , s_n\right)\) be the list of all scores, ordered from the smallest to the largest. Let us write

$$\begin{aligned} r_i = \sum _{z \in Z, s(z)\le s_i} d \mathopen {}\left( z\right), \end{aligned}$$
(1)

that is, \(r_i\) are the label counts of points having a score less than or equal to \(s_i\).

We obtain the ROC curve by normalizing \(r_i\) in Eq. 1, that is, the ROC curve is a list of \(n + 1\) points \(X = (x_0, x_1, \ldots , x_n)\), where

$$\begin{aligned} x_i = (r_{i1} / n_1, r_{i2} / n_2) \end{aligned}$$

and \(x_0 = (0, 0)\). Note that not all points in X are necessarily unique. The points in X are confined in the unit rectangle of \((0, 1) \times (0, 1)\). See Fig. 1 for illustration.Footnote 1

Fig. 1
figure 1

Example of a ROC curve and AUC. If we consider label 1 as a true label and label 2 as a false label, then the vertical axis is the true positive rate (TPR) while the horizonal axis is the false positive rate (FPR)

The area under the curve, \(auc \mathopen {}\left( Z\right)\) is the area below the ROC curve. If there is a threshold \(\sigma\) such that all data points with a score smaller than \(\sigma\) belong to class 1 and all data points with a score larger than \(\sigma\) belong to class 2, then \(auc \mathopen {}\left( Z\right) = 1\). If the scores are independent of the true labels, then the expected value of \(auc \mathopen {}\left( Z\right)\) is 1/2.

Instead of defining \(auc \mathopen {}\left( Z\right)\) using the ROC curve, we can also define it directly with Mann–Whitney U statistic (Mann and Whitney 1947). Assume that we are given a multiset of points Z. Let \(S_1 = \left\{ s \mid (s, \ell ) \in Z, \ell = 1\right\}\) be a multiset of scores with the corresponding labels being equal to 1, and define \(S_2\) similarly. The Mann–Whitney U statistic is equal to

$$\begin{aligned} U = \sum _{s \in S_1} \sum _{t \in S_2} f(s, t), \quad \text {where}\quad f(s, t) = {\left\{ \begin{array}{ll} 1 &{} \text { if } s < t, \\ 0.5 &{} \text { if } s = t, \\ 0 &{} \text { if } s > t. \\ \end{array}\right. } \end{aligned}$$
(2)

We obtain \(auc \mathopen {}\left( Z\right)\) by normalizing U, that is, \(auc \mathopen {}\left( Z\right) = \frac{1}{{\left| S_1\right| }{\left| S_2\right| }}U\).

AUC can be computed naively using U statistic in \(\mathcal {O} \mathopen {}\left( n^2\right)\) time. However, we can easily speed up the computation to \(\mathcal {O} \mathopen {}\left( n \log n\right)\) time using Algorithm 1. To see the correctness, note that in Eq. 2 each \(t \in S_2\) contributes to U with

$$\begin{aligned} \sum _{s \in S_1} f(s, t) = {\left| \left\{ s \in S_1 \mid s < t\right\} \right| } + \frac{1}{2}{\left| \left\{ s \in S_1 \mid s = t\right\} \right| }. \end{aligned}$$

Algorithm 1 achieves its running time by maintaining the first term (in a variable h) as it loops over sorted scores. Note that if Z is already sorted, then the running time reduces to linear.

figure a

Our first goal is to show that we can maintain AUC in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time under addition or removal of data points.

Our second contribution is a procedure for maintaining H-measure.

H-measure is an alternative method proposed by Hand (2009). The main idea is as follows: consider minimizing weighted loss,

$$\begin{aligned} \begin{aligned} Q(c, \sigma )&= c p(s(z)> \sigma, \ell (z) = 1) + (1 - c) p(s(z)\le \sigma, \ell (z) = 2) \\&= c \pi _1 p(s(z)> \sigma \mid \ell (z) = 1) + (1 - c) \pi _2 p(s(z)\le \sigma \mid \ell (z) = 2), \\ \end{aligned} \end{aligned}$$

where c is a cost ratio, \(\sigma\) is a threshold, z is a random data point, and \(\pi _k = p(\ell (z) = k)\) are class priors. Let us write \(\sigma (c)\) to be the threshold minimizing \(Q(c, \sigma )\) for a given c. Increasing c will decrease \(\sigma (c)\), or in other words by varying c we will vary the threshold. As pointed out by Flach et al. (2011) the curve \(Q(c, \sigma (c))\) is a variant of a cost curve [see Drummond and Holte (2006)],

$$\begin{aligned} c p(s(z)> \sigma \mid \ell (z) = 1) + (1 - c) p(s(z)\le \sigma \mid \ell (z) = 2). \end{aligned}$$

Here the difference is that \(Q(c, \sigma (c))\) uses class priors \(\pi _k\) whereas the cost curve omits them.

Since not all values of c may be sensible, we assume that we are given a weight function u(c). We are interested in measuring the weighted minimum loss as we vary c,

$$\begin{aligned} L = \int Q(c, \sigma (c)) u(c) dc. \end{aligned}$$
(3)

Here small values of L indicate strong signal between the labels and the score.

The H-measure is a normalized version of L,

$$\begin{aligned} H = 1 - L / L_{\textit{max}}. \end{aligned}$$

Here, \(L_{\textit{max}}\) is the largest possible value of L over all possible ROC curves. The negation is done so that the values of H are consistent with the AUC scores: values close to 1 represent good performance.

We will see that the convenient choice for u will be a beta distribution, as suggested by Hand (2009), since it allows us to express the integrals in a closed form.

Computing the empirical H-measure in practice starts with an ROC curve X. The following computations assume that the ROC curve is convex. If not, then the first step is to compute the convex hull of X, which we will denote by \(Y = \left( y_0 ,\ldots , y_m\right)\). Taking a convex hull will inflate the performance of the underlying classifier, however it is possible to modify the underlying classifier [see Hand (2009) for more details] so that its ROC curve is convex.

We then define

$$\begin{aligned} c_i = \frac{\pi _2(y_{i2} - y_{(i - 1)2})}{\pi _2(y_{i2} - y_{(i - 1)2)}) + \pi _1(y_{i1} - y_{(i - 1)1})}, \end{aligned}$$
(4)

where, recall that, \(\pi _k = p(\ell (z) = k)\) are the class probabilities and \(\left( y_0 ,\ldots , y_m\right)\) is the convex hull. The probabilities \(\pi _k\) can be either estimated from Z or by some other means. If former, then we show that we can maintain the H-measure exactly, if latter, then we need to estimate the measure in order to achieve a sublinear maintenance time.

We also set \(c_0 = 0\) and \(c_m = 1\). Note that \(c_i\) is a monotonically decreasing function of the slope of the convex hull. This guarantees that \(c_i \le c_{i + 1}\). We can show that [see Hand (2009)] if \(c_i< c < c_{i + 1}\), then the minimum loss is equal to

$$\begin{aligned} Q(c, \sigma (c)) = c\pi _1(1 - y_{i1}) + (1 - c)\pi _2y_{i2}. \end{aligned}$$

We can now write Eq. 3 as

$$\begin{aligned} L = \sum _{i = 0}^m \pi _1(1 - y_{i1}) \int _{c_i}^{c_{i + 1}} c u(c) dc + \pi _2y_{i2} \int _{c_i}^{c_{i + 1}} (1 - c)u(c)dc, \end{aligned}$$
(5)

and if we use beta distribution with parameters \((\alpha , \beta )\) as u(c), we have

$$\begin{aligned} \begin{aligned} L = \frac{1}{B(1, \alpha , \beta )}\sum _{i = 0}^m&\pi _1(1 - y_{i1}) \left( B(c_{i + 1}; \alpha + 1, \beta ) - B(c_i; \alpha + 1, \beta )\right) \\&+ \pi _2y_{i2} \left( B(c_{i + 1}; \alpha , \beta + 1) - B(c_i; \alpha , \beta + 1)\right), \end{aligned} \end{aligned}$$
(6)

where \(B(\cdot , \alpha , \beta )\) is an incomplete beta function.

Finally, we can show that the normalization constant is equal to

$$\begin{aligned} L_{\textit{max}} = \frac{\pi _1B(\pi _1; \alpha + 1, \beta ) + \pi _2 B(1; \alpha , \beta + 1) - \pi _2 B(1; \alpha , \beta + 1)}{B(1, \alpha , \beta )}. \end{aligned}$$

Given an ROC curve X, computing the convex hull Y, and subsequent steps, can be done in \(\mathcal {O} \mathopen {}\left( n\right)\) time. We will show in Sect. 5 that we can maintain the H-measure in \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\) time if \(\pi _k\) are estimated from Z. Otherwise we will show in Sect. 6 that we can approximate the H-measure in \(\mathcal {O} \mathopen {}\left( (\epsilon ^{-1} + \log n)\log n\right)\) time.

As pointed earlier, \(Q(c, \sigma (c))\) can be viewed as a variant of a cost curve. If we were to replace Q with the cost curve and use uniform distribution for u, then, as pointed by Flach et al. (2011), L is equivalent to the area under the cost curve. Interestingly enough, we cannot use the algorithm given in Sect. 5 to compute the area of under the cost curve as the precense of the priors is needed to decompose the measure. However, we can use the algorithm in Sect. 6 to estimate the area under the cost curve.

Interestingly enough, \(Q(c, \sigma )\) can be linked to AUC. If, instead of using the optimal threshold \(\sigma (c)\), we average Q over carefully selected distribution for \(\sigma\) and also use uniform distribution for c, then the resulting integral is a linear transformation of AUC (Flach et al. 2011).

Fig. 2
figure 2

An example of left rotation in a search tree. Left figure: before rotation, right figure: after rotation. Note that only u and v have different children after the rotation

Self-balancing search trees In this paper we make a significant use of self-balancing search trees such as AVL-trees of red-black trees. Such trees are binary trees where each node, say u, has a key, say k. The left subtree of u contains nodes with keys smaller than k and the right subtree of u contains nodes with keys larger than k. Maintaining this invariant allows for efficient queries as long as the height of the tree is kept in check. Self-balancing trees such as AVL-trees or red-black trees keep the height of the tree in \(\mathcal {O} \mathopen {}\left( \log n\right)\). The balancing is done with \(\mathcal {O} \mathopen {}\left( \log n\right)\) number of left rotations or right rotations whenever the tree is modified (see Fig. 2). Searching for nodes with specific keys, inserting new nodes, and deleting existing nodes can be done in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. Moreover, splitting the search tree into two search tree or combining two trees into one can also be done in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time.

We assume that we can compare and manipulate integers of size \(\mathcal {O} \mathopen {}\left( n\right)\) and real numbers in constant time. We do this because it is reasonable to assume that the current bit-length of integers in modern computer acrhitecture is sufficient for any practical applications, and we do need to resort to any custom big integer implementations. If needed, however, the running times need to be multiplied by an additional \(\mathcal {O} \mathopen {}\left( \log n\right)\) factor.

3 Related work

Several works have studied maintaining AUC in a sliding window. Brzezinski and Stefanowski (2017) maintained the order of n data points using a red-black tree but computed AUC from scratch, resulting in a running time of \(\mathcal {O} \mathopen {}\left( n + \log n\right)\), per update. Tatti (2018) proposed algorithm yielding \(\epsilon\)-approximation of AUC in \(\mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log n)\right)\) time, per update. Here the approach bins the ROC space into a small number of bins. The bins are selected so that the AUC estimate is accurate enough. Bouckaert (2006) proposed estimating AUC by binning and only maintaining counters for individual bins. On the other hand, in this work we do not need to resort to binning, instead we can maintain the exact AUC by maintaining a search search tree structure in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time, per update.

We should point out that AUC and the H-measure are defined over the whole ROC curve, and are useful when we do not want to commit to a specific classification threshold. On the other hand, if we do have the threshold, then we can easily maintain a confusion matrix, and consequently maintain many classic metrics, for example, accuracy, recall, F1-measure (Gama et al. 2013; Gama 2010), and Kappa-statistic (Bifet and Frank 2010; Žliobaitė et al. 2015).

In a related work, Ataman et al. (2006); Ferri et al. (2002); Brefeld and Scheffer (2005); Herschtal and Raskutti (2004) proposed methods where AUC is optimized as a part of training a classifier. Note that this setting differs from ours: changing the classifier parameters most likely will change the scores of all data points, and may change the data point order significantly. On the other hand, we rely on the fact we can maintain the order using a search tree. Interestingly, Calders and Jaroszewicz (2007) estimated AUC using a continuous function which then allowed optimizing the classifier parameters with gradient descent.

Our approaches are useful if we are working in a sliding window setting, that is, we want to compute the relevant statistic using only the last n data points. In other words, we abruptly forget the \((n + 1)\)th data point. An alternative option would be to gradually downplay the importance of older data points. A convenient option is to use exponential decay, see for example a survey by Gama et al. (2014). While maintaining the confusion matrix is trivial when using exponential decay but—to our knowledge—there are no methods for maintaining AUC or H-measure under exponential decay.

4 Maintaining AUC

In this section we present a simple approach to maintain AUC in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. We accomplish this by showing that the change in AUC can be computed in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time whenever a new point is added or an existing point is deleted. We rely on the following two propositions that express how AUC changes when adding or deleting a data point. We then show that the quantities occurring in the propositions, namely, the weights \((u_1, u_2)\) and \((v_1, v_2)\) can be obtained in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time.

Proposition 1

(Addition) Let Z be a set of data points with \((n_1, n_2)\) label counts. Let Y be a set of points having the same score \(\sigma\). Write \((w_1, w_2) = \sum _{y \in Y} d \mathopen {}\left( y\right)\). Define also

$$\begin{aligned} (u_1, u_2) = \sum _{\begin{array}{c} z \in Z \\ s(z)< \sigma \end{array}} d \mathopen {}\left( z\right) \quad \text {and}\quad (v_1, v_2) = \sum _{\begin{array}{c} z \in Z \\ s(z)= \sigma \end{array}} d \mathopen {}\left( z\right).\end{aligned}$$

Write \(U = n_1n_2 \times auc \mathopen {}\left( Z\right)\) and \(U' = (n_1 + w_1)(n_2 + w_2) \times auc \mathopen {}\left( Z \cup Y\right)\). Then

$$\begin{aligned} U' = U + w_2\left( u_1 + \frac{v_1}{2}\right) + w_1\left( n_2 - u_2 - \frac{v_2}{2}\right) + \frac{w_1w_2}{2}. \end{aligned}$$

Proof

We will use Mann–Whitney U statistic, given in Eq. 2 to prove the claim. Let us write \(Z' = Z \cup Y\) and define

$$\begin{aligned} S_i = \left\{ s \mid (s, \ell ) \in Z, \ell = i\right\} \quad \text {and}\quad S_i' = \left\{ s \mid (s, \ell ) \in Z', \ell = i\right\} , \quad \text {for}\quad i = 1, 2. \end{aligned}$$

Equation 2 states that

$$\begin{aligned} \begin{aligned} U'&= \sum _{s \in S_1'} \sum _{t \in S_2'} f(s, t) \\&= w_1 \sum _{t \in S_2'} f(\sigma , t) + w_2 \sum _{s \in S_1} f(s, \sigma ) + \sum _{s \in S_1} \sum _{t \in S_2} f(s, t) \\&= w_1 \sum _{t \in S_2'} f(\sigma , t) + w_2 \sum _{s \in S_1} f(s, \sigma ) + U \\&= w_1 \left( n_2 - u_2 - v_2 + \frac{v_2 + w_2}{2}\right) + w_2 \left( u_1 + \frac{v_1}{2}\right) + U. \\ \end{aligned} \end{aligned}$$

We obtain the claim by rearranging the terms.\(\square\)

Proposition 2

(Deletion) Let Z be a set of data points with \((n_1, n_2)\) label counts. Let \(Y \subseteq Z\) be a set of points having the same score \(\sigma\). Write \((w_1, w_2) = \sum _{y \in Y} d \mathopen {}\left( y\right)\). Define also

$$\begin{aligned} (u_1, u_2) = \sum _{\begin{array}{c} z \in Z \\ s(z)< \sigma \end{array}} d \mathopen {}\left( z\right) \quad \text {and}\quad (v_1, v_2) = \sum _{\begin{array}{c} z \in Z \\ s(z)= \sigma \end{array}} d \mathopen {}\left( z\right).\end{aligned}$$

Write \(U = n_1n_2 \times auc \mathopen {}\left( Z\right)\) and \(U' = (n_1 - w_1)(n_2 - w_2) \times auc \mathopen {}\left( Z \setminus Y\right)\). Then

$$\begin{aligned} U' = U - w_2\left( u_1 + \frac{v_1}{2}\right) - w_1\left( n_2 - u_2 - \frac{v_2}{2}\right) + \frac{w_1w_2}{2}. \end{aligned}$$

Note that the sign of the last term is the same for both addition and deletion.

Proof

We will use Mann–Whitney U statistic, given in Eq. 2 to prove the claim. Let us write \(Z' = Z \setminus Y\) and define

$$\begin{aligned} S_i = \left\{ s \mid (s, \ell ) \in Z, \ell = i\right\} \quad \text {and}\quad S_i' = \left\{ s \mid (s, \ell ) \in Z', \ell = i\right\} , \quad \text {for}\quad i = 1, 2. \end{aligned}$$

Equation 2 states that

$$\begin{aligned} \begin{aligned} U&= \sum _{s \in S_1} \sum _{t \in S_2} f(s, t) \\&= w_1 \sum _{t \in S_2} f(\sigma , t) + w_2 \sum _{s \in S_1'} f(s, \sigma ) + \sum _{s \in S_1'} \sum _{t \in S_2'} f(s, t) \\&= w_1 \sum _{t \in S_2} f(\sigma , t) + w_2 \sum _{s \in S_1'} f(s, \sigma ) + U' \\&= w_1 \left( n_2 - u_2 - v_2 + \frac{v_2}{2}\right) + w_2 \left( u_1 + \frac{v_1 - w_1}{2}\right) + U'. \\ \end{aligned} \end{aligned}$$

We obtain the claim by rearranging the terms.\(\square\)

Note that normally we would be adding or deleting a single data point, that is, \(Y = \left\{ y\right\}\). However, the propositions also allow us to modify multiple points with the same score.

These two propositions allow us to maintain AUC as long as we can compute \((u_1, u_2)\) and \((v_1, v_2)\). To compute these quantities we will use a balanced search tree T such as red-black tree or AVL tree. Let S be the unique scores of Z. Each score \(s \in S\) is given a node \(n \in T\).

Moreover, for each node x with a score of s, we will store the total label counts having the same score, \(d \mathopen {}\left( x\right) = \sum _{s(z)= s} d \mathopen {}\left( z\right)\). The counts \(d \mathopen {}\left( x\right)\) will give us immediately \((v_1, v_2)\).

In addition, we will store \(cd \mathopen {}\left( x\right)\), cumulative label counts of all descendants of x, including x itself. We need to maintain these counts whenever we add or remove nodes from T, change the counts of nodes, or when T needs to be rebalanced. Luckily, since

$$\begin{aligned} cd(x) = cd(left(x)) + cd(right(x)) + d(x) \end{aligned}$$

we can compute \(cd \mathopen {}\left( x\right)\) in constant time as long as we have the cumulative counts of children of x. Whenever node x is changed, only its ancestors are changed, so the cumulative weights can be updated in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. The balancing in red-black tree or AVL tree is done by using left or right rotation. Only two nodes are changed per rotation (see Fig. 2), and we can recompute the cumulative counts for these nodes in constant time. There are at most \(\mathcal {O} \mathopen {}\left( \log n\right)\) rotations, so the running time is not increased.

Given a tree T and a score threshold \(\sigma\), let us define \(lcount \mathopen {}\left( \sigma , T\right) = \sum _{s(z){x} < \sigma } d \mathopen {}\left( x\right)\), to be the total count of nodes with scores smaller than \(\sigma\). Computing \(lcount \mathopen {}\left( s, T\right)\) gives us \((u_1, u_2)\) used by Propositions 12.

In order to compute \(lcount \mathopen {}\left( \sigma , T\right)\) we will use the procedure given in Algorithm 2. Here, we use a binary search over the tree, and summing the cumulative counts of the left branch. To see the correctness of the algorithm, observe that during the while-loop Algorithm 2 maintains the invariant that \(u + cd(left(x))\) is equal to \(lcount \mathopen {}\left( s(x), T\right)\). We should point out that similar queries were considered by Tatti (2018). However, they were not combined with Propositions 12.

figure b

Since T is balanced, the running time of Algorithm 2 is \(\mathcal {O} \mathopen {}\left( \log n\right)\).

In summary, we can maintain T in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time, and we can obtain \((u_1, u_2)\) and \((v_1, v_2)\) using T in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. These quantities allow us to maintain AUC in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time.

5 Maintaining H-measure

If we were to compute the H-measure from scratch, we first need to compute the convex hull, and then compute the H-measure from the convex hull. In order to maintain the H-measure, we will first address maintaining the convex hull, and then explain how we maintain the actual measure.

5.1 Divide-and-conquer approach for maintaining a convex hull

Maintaining a convex hull under point additions or deletions is a well-studied topic in computational geometry. A classic approach by Overmars and Van Leeuwen (1981) maintains the hull in \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\) time. Luckily, the same approach with some modifications will work for us.

Before we continue, we should stress two important differences between our setting and a traditional setting of maintaining a convex hull.

First, in a normal setting, the additions and removals are done to new points in a plane. In other words, the remaining points do not change over time. In our case, the data point consists of a classifier score and a label, and modifications shift the ROC coordinates of every point. As a concrete example, in a traditional setting, adding a point cannot reveal already existing points whereas adding a new data point can shift the ROC curve enough so that some existing points become included in the convex hull.

Secondly, we do not have the coordinates for all the points. However, it turns out that we can compute the needed coordinates with no additional costs.

We should point out that the approach by Overmars and Van Leeuwen (1981) is not the fastest for maintaining the hull: for example an algorithm by Brodal and Jacob (2002) can maintain the hull in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. However, due to the aforementioned differences adapting this algorithm to our setting is non-trivial, and possibly infeasible.

We will explain next the main idea behind the algorithm by Overmars and Van Leeuwen (1981), and then modify it to our needs.

The overall idea behind the algorithm is as follows. A generic convex hull can be viewed as a union of the lower convex hull and the upper convex hull. We only need to compute the upper convex hull, and for simplicity, we will refer to the upper convex hull as the convex hull.

Fig. 3
figure 3

Left figure: an example of combining two partial convex hulls into one by finding a bridge segment. Right figure: a stylized data structure for maintaing convex hull. Each node corresponds to a partial convex hull (that are stored in separate search trees), a parent hull is obtained from the child hulls by finding the bridge segment. Leaf nodes containing individual data points are not shown

In order to compute the convex hull C for a point set P we can use a conquer-and-divide technique. Assume that we have ordered the points using the x-coordinate, and split the points roughly in half, say in sets R and Q. Then assume we have computed convex hulls, say \(H = \left\{ h_{i}\right\}\) and \(G = \left\{ g_{i}\right\}\), for R and Q independently.

A key result by Overmars and Van Leeuwen (1981) states that the convex hull C of P is equal to \(\left\{ h_{1}, \ldots , h_{u}, g_{v}, g_{v + 1}, \ldots \right\}\), that is, C starts with H and ends with G. See Fig. 3 for illustration. The segment between \(h_{u}\) and \(g_{v}\) is often referred as a bridge.

We can find the indices u and v in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time using a binary search over H and G. In order to perform the binary search we will store the hulls H and G in balanced search trees (red-black tree or AVL tree). Then the binary search amounts to traversing these trees.

Note that the concatenation and splitting of a search tree can be done in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. In other words, we can obtain C for partial convex hulls H and G in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time.

In order to maintain the hull we will store the original points in a balanced search tree T;Footnote 2 only the leaves store the actual points. Each node in \(u \in T\) represents a set of points stored in the descendant leaves of u. See Fig. 3a for illustration.

Let us write H(u) to be the convex hull of these points: we can obtain H(u) from \(H( left \mathopen {}\left( u\right) )\) and \(H( right \mathopen {}\left( u\right) )\) in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time. So whenever we modify T by adding or removing a leaf v, we only need to update the ancestors of v, and possibly some additional nodes due to the rebalancing. All in all, we only need to update \(\mathcal {O} \mathopen {}\left( \log n\right)\) nodes, which brings the running time to \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\).

An additional complication is that whenever we compute H(u) we also destroy \(H( left \mathopen {}\left( u\right) )\) and \(H( right \mathopen {}\left( u\right) )\) in the process, trees that we may need in the future. However, we can rectify this by storing the remains of the partial hulls, and then reversing the join if we were to modify a leaf of u. This reversal can be done in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time.

5.2 Maintaining the convex hull of a ROC curve

Our next step is to adapt the existing algorithm to our setting so that we can maintain the hull of an ROC curve X.

First of all, adding or removing data points shifts the remaining points. To partially rectify this issue, we will use non-normalized coordinates \(R = \left( r_0 ,\ldots , r_m\right)\) given in Eq. 1. We can do this because scaling does not change the convex hull.

Consider adding or removing a data point z which is represented by a leaf \(u \in T\). The points in R associated with smaller scores than \(s(z)\) will not shift, and the points in R associated with larger scores than \(s(z)\) will shift by the same amount. Consequently, the only partial hulls that are affected are the ancestors of u. This allows us to use the update algorithm of Overmars and Van Leeuwen (1981) for our setting as long as we can obtain the coordinates of the points.

Our second issue is that we do not have access to the coordinates \(r_i\). We approach the problem with the same strategy as when we were computing AUC.

Let U be the search tree of a convex hull H. Let \(u \in U\) be a node with coordinates \(r_{i}\). We will define and store \(d \mathopen {}\left( u\right)\) as the coordinate difference \(r_i - r_{i - 1}\). Let \(s_i\) be the score corresponding to \(r_i\). Then Eq. 1 implies that \(d \mathopen {}\left( u\right) = \sum _{s_{i - 1} < s(z)\le s_i} d \mathopen {}\left( z\right)\).

In addition, we will store \(cd \mathopen {}\left( u\right)\), the total sum of the coordinate differences of descendants of u, including u itself.

Let u be the root of U. The coordinates, say p, of u in U are \(cd(left(u)) + d \mathopen {}\left( u\right)\). Moreover, the coordinates of the left child of u are

$$\begin{aligned} p - d \mathopen {}\left( u\right) - cd(right(left(u))), \end{aligned}$$

and the coordinates of the right child of u are

$$\begin{aligned} p + d(right(u)) + cd(left(right(u))). \end{aligned}$$

In other words, we can compute the coordinates of children in U in constant time if we know the coordinates of a parent.

When combining two hulls, the binary search needed to find the bridge is based on descending U from root to the correct node. During the binary search the algorithm needs to know the coordinates of a node which we can now obtain from the coordinates of the parent. In summary, we can do the binary search in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time, which allow us to maintain the hull of a ROC curve in \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\) time.

For completeness we present the pseudo-code for the binary search in Appendix.

5.3 Maintaining H-measure

Now that we have means to maintain the convex hull, our next step is to maintain the H-measure. Note that the only non-trivial part is L given in Eq. 5.

Assume that we have n data points Z with \(n_k\) data points having class k. Let \(Y = \left( y_0 ,\ldots , y_m\right)\) be the convex hull of the ROC curve computed from Z. Let \((d_1, \ldots , d_m)\) the non-normalized differences between the neighboring points, that is,

$$\begin{aligned} d_{i1} = n_1(y_{i1} - y_{(i - 1)i}) \quad \text {and}\quad d_{i2} = n_2(y_{i2} - y_{(i - 1)2}). \end{aligned}$$

We will now assume that \(\pi _k\) occurring in Eq. 5 are computed from the same data as the ROC curve, that is, \(\pi _k = n_k / n\). We can rewrite the first term in Eq. 5 as

$$\begin{aligned} \begin{aligned} \sum _{i = 0}^m \pi _1 (1 - y_{i1}) \int _{c_i}^{c_{i + 1}} cu(c)dc&= \frac{1}{n} \sum _{i = 0}^m \sum _{j = i + 1}^m d_{j1} \int _{c_i}^{c_{i + 1}} cu(c)dc \\&= \frac{1}{n} \sum _{j = 1}^m d_{j1} \sum _{i = 0}^{j - 1} \int _{c_i}^{c_{i + 1}} cu(c)dc \\&= \frac{1}{n} \sum _{j = 1}^m d_{j1} \int _{0}^{c_{j}} cu(c)dc. \\ \end{aligned} \end{aligned}$$

Similarly, we can express the second term of Eq. 5 as

$$\begin{aligned} \begin{aligned} \sum _{i = 0}^m \pi _2 y_{i2} \int _{c_i}^{c_{i + 1}} (1 - c)u(c)dc&= \frac{1}{n} \sum _{i = 0}^m \sum _{j = 1}^i d_{j2} \int _{c_i}^{c_{i + 1}} (1 - c)u(c)dc \\&= \frac{1}{n} \sum _{j = 1}^m d_{j2} \sum _{i = j}^m \int _{c_i}^{c_{i + 1}} (1 - c)u(c)dc \\&= \frac{1}{n} \sum _{j = 1}^m d_{j2} \int _{c_j}^{1} (1 - c)u(c)dc.\\ \end{aligned} \end{aligned}$$

If we use the beta distribution for u, Eq. 6 reduces to

$$\begin{aligned} L = \frac{1}{nB(1, \alpha , \beta )} \sum _{j = 1}^m d_{j1} B(c_j, \alpha + 1, \beta ) + d_{j2} (B(1, \alpha , \beta + 1) - B(c_j, \alpha , \beta + 1)). \end{aligned}$$
(7)

Let us now consider values \(c_j\). Because we assume that \(\pi _k\) are estimated from the testing data, we have \(\pi _k = n_k / n\), so the values \(c_j\), given in Eq. 4, reduce to

$$\begin{aligned} c_j = \frac{\pi _2(y_{j2} - y_{(j - 1)2})}{\pi _2(y_{j2} - y_{(j - 1)2)}) + \pi _1(y_{j1} - y_{(j - 1)1})} = \frac{\pi _2 d_{j2} / n_2}{\pi _1 d_{j1} / n_1 + \pi _2 d_{j2} / n_2} = \frac{d_{j2}}{d_{j1} + d_{j2}}. \end{aligned}$$

In summary, the terms of the sum in Eq. 7 depend only on the coordinate differences \(d_j\). We should stress that this is only possible if we assume that \(\pi _k\) are computed from the same data as the ROC curve. Otherwise, the terms \(n_k\) will not cancel out when computing \(c_j\).

Let T be a binary tree representing a convex hull. The sole dependency on \(d_j\) allows us to use T to maintain the H-measure. In order to do that, let \(v \in T\) be a node with the coordinate difference \((d_1, d_2) = d \mathopen {}\left( v\right)\). Let \(c = d_2/(d_1 + d_2)\). We define

$$\begin{aligned} h \mathopen {}\left( v\right) = d_1 B(c, \alpha + 1, \beta ) + d_{2} (B(1, \alpha , \beta + 1) - B(c, \alpha , \beta + 1)). \end{aligned}$$

We also maintain \(ch \mathopen {}\left( v\right)\) to be the sum of \(h \mathopen {}\left( u\right)\) of all descendants u of v, including v. Note that maintaining \(ch \mathopen {}\left( v\right)\) can be done in a similar fashion as \(cd \mathopen {}\left( v\right)\).

Finally, Eq. 7 implies that \(L = \frac{ch(root(T)) }{n B(1, \alpha , \beta )}\), allowing us to maintain the H-measure in \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\) time.

6 Approximating H-measure

In our final contribution we consider the case where \(\pi _k\) are not computed from the same dataset as the ROC curve. The consequence is that we no longer can simplify \(c_j\) so that it only depends on \(d_j\), and we cannot express L as a sum over the nodes of the tree representing the convex hull.

We will approach the task differently. We will still maintain the convex hull H. We then select a subset of points from H from which we compute the H-measure from scratch. This subset will be selected carefully. On one hand, the subset will yield an \(\epsilon\)-approximation. On the other hand, the subset will be small enough so that we still obtain polylogarithmic running time.

We start by rewriting Eq. 5. Given a function \({x}:{[0, 1]} \rightarrow {{\mathbb {R}}^+}\), let us define

$$\begin{aligned} L_1(x) = \int _0^1 \pi _1 x(c) c u(c) dc, \quad \text {and}\quad L_2(x) = \int _0^1 \pi _2 x(c) (1 - c) u(c) dc. \end{aligned}$$

Consider the values \(\left\{ y_i\right\}\) and \(\left\{ c_i\right\}\) as used in Eq. 5. We define two functions \({f, g}:{[0, 1]} \rightarrow {{\mathbb {R}}^+}\) as

$$\begin{aligned} \begin{aligned} g(c)&= y_{i2}, \quad \text {where}\quad c_i \le c< c_{i + 1}, \quad \text {and}\quad g(1) = 1, \\ f(c)&= 1 - y_{i1}, \quad \text {where}\quad c_i \le c < c_{i + 1}, \quad \text {and}\quad f(1) = 0. \\ \end{aligned} \end{aligned}$$
(8)

We can now write Eq. 5 as \(L = L_1(f) + L_2(g)\).

We say that a function \(x'\) is an \(\epsilon\)-approximation of a function x if \({\left| x(c) - x'(c)\right| } \le \epsilon x(c)\). The following two propositions are immediate.

Proposition 3

Let \(x'\) be an \(\epsilon\)-approximation of x, then

$$\begin{aligned} {\left| L_1(x) - L_1(x')\right| } \le \epsilon L_1(x) \quad \text {and}\quad {\left| L_2(x) - L_2(x')\right| } \le \epsilon L_2(x). \end{aligned}$$

Proposition 4

Let f and g be defined as in Eq. 8, and let \(f'\) and \(g'\) be respective \(\epsilon\)-approximations. Define

$$\begin{aligned} H = 1 - \frac{L_1(f) + L_2(g)}{L_{max}} \quad \text {and}\quad H' = 1 - \frac{L_1(f') + L_2(g')}{L_{max}}. \end{aligned}$$

Then \({\left| H - H'\right| } \le \epsilon (1 - H)\).

In other words, if we can approximate f and g, we can also approximate the H-measure. Note that the guarantee is \(\epsilon (1 - H)\), that is, the approximation is more accurate when H is closer to 1, that is, a classifier is accurate.

Next we will focus on estimating g.

Proposition 5

Assume \(\epsilon > 0\). Let Y be the convex hull of an ROC curve. Let Q be a subset of Y such that for each \(y_i\), there is \(q_j \in Q\) such that

$$\begin{aligned} q_j = y_i \quad \text {or}\quad q_{j2} \le y_{i2} \le q_{(j + 1)2} \le (1 + \epsilon ) q_{j2}. \end{aligned}$$
(9)

Let g be the function constructed from Y as given by Eq. 8, and let \(g'\) be a function constructed similarly from Q. Then \(g'\) is an \(\epsilon\)-approximation of g.

Proof

Let \((c_i)\) be the slope values computed from Y using Eq. 4, and let \((c'_i)\) be the slope values computed from Q.

Due to convexity of Y, the slope values have a specific property that we will use several times: fix index j, and let i be the index such that \(y_i = q_j\). Then

$$\begin{aligned} c'_j \le c_i \quad \text {and}\quad c'_{j + 1} \ge c_{i + 1}. \end{aligned}$$
(10)

Assume \(0< c < 1\). Let i be an index such that \(c_i \le c < c_{i + 1}\), consequently \(g(c) = y_{i2}\). Similarly, let j be an index such that \(c'_j \le c < c'_{j + 1}\), so that \(g'(c) = q_{j2}\). Let a be an index such that \(q_j = y_a\).

If \(g(c) = g'(c)\), then we have nothing to prove. Assume \(g(c) < g'(c) = q_{j2}\).

Assume \(q_{j2} > (1 + \epsilon )q_{(j - 1)2}\). Then Eq. 9 implies that \(y_{a - 1} = q_{j - 1}\), and so \(c_a = c'_j \le c < c_{i + 1}\). Thus, \(i \ge a\), and \(g(c) = g(c_i) \ge g(c_a) = g'(c'_j) = g'(c)\), which is a contradiction.

Assume \(q_{j2} \le (1 + \epsilon )q_{(j - 1)2}\). Let b be an index such that \(y_{b} = q_{j - 1}\). Then Eq. 10 implies

$$\begin{aligned} c_{b + 1} \le c_j' \le c < c_{i + 1}. \end{aligned}$$

Thus, \(b < i\) and so \(q_{(j - 1)2} = y_{b2} = g(c_b) \le g(c_i) = g(c)\). This leads to

$$\begin{aligned} {\left| g'(c) - g(c)\right| } = q_{j2} - g(c) \le (1 + \epsilon )q_{(j - 1)2} - g(c) \le (1 + \epsilon ) g(c) - g(c) = \epsilon g(c), \end{aligned}$$

proving the proposition.

Now, assume \(g(c) > g'(c) = q_{j2}\).

Assume \(q_{(j + 1)2} > (1 + \epsilon )q_{j2}\). If \(y_{a + 1} \notin Q\), then Eq. 9 leads to a contradiction. Thus \(y_{a + 1} = q_{j + 1}\) and so \(c'_j \le c < c'_{j + 1} = c_{a + 1}\). Thus, \(i \le a\), and \(g(c) = g(c_i) \le g(c_a) = g'(c'_j) = g'(c)\), which is a contradiction.

Assume \(q_{(j + 1)2} \le (1 + \epsilon )q_{j2}\). Let b be an index such that \(y_{b} = q_{j + 1}\). Then Eq. 10 implies

$$\begin{aligned} c_i \le c < c'_{j + 1} \le c_b. \end{aligned}$$

Thus \(i < b\) or \(g(c) = y_{i2} \le y_{b2} = q_{(j + 1)2}\). This leads to

$$\begin{aligned} {\left| g(c) - g'(c)\right| } \le q_{(j + 1)2} - q_{j2} \le (1 + \epsilon )q_{j2} - q_{j2} = \epsilon q_{j2} = \epsilon g'(c) < \epsilon g(c), \end{aligned}$$

proving the proposition.\(\square\)

A similar result also holds for \(L_1(f)\). We omit the proof as it is very similar to the proof of Proposition 5.

Proposition 6

Assume \(\epsilon > 0\). Let Y be a convex hull of a ROC curve. Let Q be a subset of Y such that for each \(y_i\), there is \(q_j \in Q\) such that

$$\begin{aligned} q_j = y_i \quad \text {or}\quad 1 - q_{(j + 1)1} \le 1 - y_{i1} \le 1 - q_{j1} \le (1 + \epsilon ) (1 - q_{(j + 1)1}). \end{aligned}$$

Let f be the function constructed from Y as given by Eq. 8, and let \(f'\) be a function constructed similarly from Q. Then \(f'\) is an \(\epsilon\)-approximation of f.

The above propositions lead to the following strategy. Only use a subset of the ROC curve to compute the H-measure; if we select the points carefully, then the relative error will be less than \(\epsilon\).

Let us now focus on estimating \(L_2(g)\). Assume that we have the convex hull \(Y = \left\{ y_0 ,\ldots , y_m\right\}\) of a ROC curve stored in a search tree T. Consider an algorithm given in Algorithm 3 which we call Subset.

figure c

The pseudo-code traverses T, and maintains two variables p and q that bound the points of the current subtree. If \(q_2 \le (1 + \epsilon )p_2\), then we can safely ignore the current subtree, otherwise we output the current root, and recurse on both children. It is easy to see that \(Q = \left\{ y_0, y_m\right\} \cup \text{S}\textsc {ubset}(r, 0, cd \mathopen {}\left( r\right) )\) satisfies the conditions of Proposition 5.

A similar traverse can be also done in order to estimate \(L_1(f)\). However, we can estimate both values with the same subset by replacing the if-condition with \(q_2> (1 + \epsilon ) p_2 \ \mathbf{or}\ 1 - q_1 > (1 + \epsilon )(1 - p_1)\).

Proposition 7

Subset runs in \(\mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log ^2 n\right)\) time.

Proof

Given a node v, let us write \(T_v\) to mean the subtree rooted at v. Write \(p_v\) and \(q_v\) to be the values of p and q when processing v.

Let V be the reported nodes by Subset. Let \(W \subseteq V\) be a set of m nodes that have two reported children. Let \(\left\{ h_1 ,\ldots , h_m\right\}\) be the non-normalized 2nd coordinate of nodes in W, ordered from smallest to largest.

Fix i and let u and v be the nodes corresponding to \(h_i\) and \(h_{i + 1}\). Assume that \(v \notin T_u\). Let \(r = right \mathopen {}\left( u\right)\) be the right child of u. Then \(T_r \cap W = \emptyset\) as otherwise \(h_i\) and \(h_{i + 1}\) would not be consecutive. We have \(h_{i + 1} \ge q_{r2} > (1 + \epsilon )p_{r2} = (1 + \epsilon ) h_i\).

Assume that \(v \in T_u\) which immediately implies that \(u \notin T_v\). Let \(r = left \mathopen {}\left( u\right)\) be the left child of v. Then \(T_r \cap W = \emptyset\), and we have \(h_{i + 1} = q_{r2} > (1 + \epsilon )p_{r2} \ge (1 + \epsilon ) h_i\).

In summary, \(h_{i + 1} > (1 + \epsilon ) h_i\). Since \(\left\{ h_i\right\}\) are integers, we have \(h_2 \ge 1\). In addition, \(h_m \le n\) since the original data points (from which the ROC curve is computed) do not have weights.

Consequently, \(n \ge h_m \ge (1 + \epsilon )^{m - 2}\). Solving m leads to \(m \in \mathcal {O} \mathopen {}\left( \log _{1 + \epsilon } n\right) \subseteq \mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log n\right)\).

Given \(v \in W\), define k(v) to be the number of nodes in \(V \setminus W\) that have v as their youngest ancestor in W. The nodes contributing to k(v) form at most two paths starting from v. Since the height of the search tree is in \(\mathcal {O} \mathopen {}\left( \log n\right)\), we have \(k(v) \in \mathcal {O} \mathopen {}\left( \log n\right)\).

Finally, we can bound \({\left| V\right| }\) by

$$\begin{aligned} {\left| V\right| } = \sum _{v \in W} 1 + k(v) \in \mathcal {O} \mathopen {}\left( m \log n \right) \subseteq \mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log ^2 n\right), \end{aligned}$$

concluding the proof.\(\square\)

6.1 Speed-up

It is possible to reduce the running time of Subset to \(\mathcal {O} \mathopen {}\left( \log ^2 n + \epsilon ^{-1}\log n\right)\). We should point out that in practice Subset is probably a faster approach as the theoretical improvement is relatively modest but at the same time the overheads increase.

There are several ways to approach the speed-up. Note that the source of the additional \(\log n\) term is that in the proof of Proposition 7, we have \(k(v) \in \mathcal {O} \mathopen {}\left( \log n\right)\). The loose bound is due to the fact that we are traversing a search tree balanced on tree height. We will modify the search procedure, so that we can show that \(k(v) \in \mathcal {O} \mathopen {}\left( 1\right)\) which will give us the desired outcome. More specifically, we would like to traverse the hull using a search tree balanced using the 2nd coordinate.

The best candidate to replace the search tree for storing the convex hull is a weight-balanced tree (Nievergelt and Reingold 1973). Here, the subtrees are (roughly) balanced based on the number of children. The problem is that this tree, despite its name, does not allow weights for nodes. Moreover, the algorithm relies on the fact that the nodes have no weights.

It is possible to extend the weight-balanced trees to handle the weights but such modification is not trivial. Instead we demonstrate an alternative approach that is possible using only stock search structures.

We will do this by modifying the search tree T in which the nodes correspond to the partial hulls, see Fig. 3b.

Let Z be the current set of points and let \(P = \left\{ (s, \ell ) \in Z \mid \ell = 2\right\}\) be the points with label equal to 2. Set \(N = Z \setminus P\). We store P in a tree T of bounded balance; the points are only stored in leaves. Each leaf, say u, also stores all points in N that follow immediately u. These points are stored in a standard search tree, say \(L_u\), so that we can join two trees or split them when needed. Any points in N that are without a preceding point in P are handled and stored separately.

Note that \(L_u\) correspond to a vertical line when drawing the ROC curve. Consequently, a point in the convex hull will always be the last point in \(L_u\) for some u. This allows us to define the weight \(d \mathopen {}\left( u\right)\) of a leaf u in T as (m, 1), where m is the number of nodes in \(L_u\). We now apply the convex hull maintenance algorithm on T. As always, we maintain the cumulative weights \(cd \mathopen {}\left( u\right)\) for the non-leaf nodes.

In order to approximate the H-measure we will use a variant of Subset, except that we will traverse T instead of traversing the hull. The pseudo-code is given in Algorithm 4. At each node we output the bridge, if it is included in the final convex hull. The condition is easy to test, we just need to make sure that it does not overlap with the previously reported bridges. Since we output both points of the bridge, this may lead to duplicate points, but we can prune them as a post-processing step. Finally, we truncate the traversal if the subtree is sandwiched between two bridges that are close enough to each other. It is easy to see that the output of SubsetAlt satisfies the conditions in Proposition 5 so we can use the output to estimate \(L_2(g)\). In order to estimate \(L_1(f)\) we duplicate the procedure, except we swap the labels and negate the scores which leads to a mirrored ROC curve.

figure d

Proposition 8

SubsetAlt runs in \(\mathcal {O} \mathopen {}\left( \log ^2 n + \epsilon ^{-1}\log n\right)\) time.

Proof

Let T be the tree traversed by SubsetAlt. Let us write \(T_v\) to be the subtree rooted at v.

Let n(v) be the number of nodes in \(T_v\), and let \(\ell (v)\) be the number of leaves in \(T_v\). Note that \(n(v) = 2\ell (v) + 1\).

Let v be a child of u. Since T is a weight-balanced tree (Nievergelt and Reingold 1973), we have

$$\begin{aligned} \alpha \le \frac{1 + \ell (v)}{1 + \ell (u)} = \frac{1 + 1+ 2\ell (v)}{1 + 1 + 2\ell (u)} = \frac{1 + n(v)}{1 + n(u)} \le 1 - \alpha , \quad \text {where}\quad \alpha = \frac{1 - \sqrt{2}}{2}. \end{aligned}$$
(11)

Let us write o(v) to be the 2nd origin coordinate of \(T_v\). Note that o(v) corresponds to the variable \(o_2\) in SubsetAlt when v is processed.

Let V be the set of nodes whose bridges we output, and let U be the set of nodes in T for which \(\ell (u) > \epsilon o(u)\).

We will prove the claim by showing that \(V \subseteq U\) and \({\left| U\right| } \in \mathcal {O} \mathopen {}\left( \log ^2 n + \epsilon ^{-1}\log n\right)\).

To prove the first claim, let \(v \in V\). Let p and q match the variables of SubsetAlt when v is visited. The points p and q correspond to the two leaves of \(T_v\). In other words, \(q_{2} - p_{2} \le \ell (v)\), and \(o(v) \le p_{2}\). Thus,

$$\begin{aligned} \ell (v) \ge q_{2} - p_{2} > \epsilon p_{2} \ge \epsilon o(v). \end{aligned}$$

This proves that \(v \in U\).

To bound \({\left| U\right| }\), let \(W \subseteq U\) be a set of m nodes that have two children in U.

Define \(\left( h_1 ,\ldots , h_m\right) = \left( o( right \mathopen {}\left( v\right) ) \mid v \in W\right)\) to be the sequence of the (non-normalized) 2nd coordinates of the right children of nodes in W, ordered from the smallest to the largest.

Fix i. Let \(u \in W\) be the node for which \(o( right \mathopen {}\left( u\right) ) = h_i\), and let \(v \in W\) be the node for which \(o( right \mathopen {}\left( v\right) ) = h_{i + 1}\).

Assume that \(h_i \le o(v)\). Since \(v \in W\), we have

$$\begin{aligned} h_{i + 1} = o( right \mathopen {}\left( v\right) ) = o(v) + \ell ( left \mathopen {}\left( v\right) ) > o(v) + \epsilon o( left \mathopen {}\left( v\right) ) = (1 + \epsilon ) o(v) \ge (1 + \epsilon )h_i. \end{aligned}$$

Assume that \(h_i > o(v)\). Then \(u \in T_{ left \mathopen {}\left( v\right) }\), and consequently \(v \notin T_{ right \mathopen {}\left( u\right) }\). Thus, \(T_{ right \mathopen {}\left( u\right) } \cap W = \emptyset\) as otherwise \(h_i\) and \(h_{i + 1}\) are not consecutive. Since \(right \mathopen {}\left( u\right) \in U\), we have

$$\begin{aligned} h_{i + 1} \ge o( right \mathopen {}\left( u\right) ) + \ell ( right \mathopen {}\left( u\right) ) \ge (1 + \epsilon )o( right \mathopen {}\left( u\right) ) = (1 + \epsilon )h_i. \end{aligned}$$

In summary, we have \(h_{i + 1} > (1 + \epsilon ) h_i\). Note that \(h_1 \ge 1\). In addition, \(h_m \le n\) since the original data points (from which the ROC curve is computed) do not have weights.

Consequently, \(n \ge h_m \ge (1 + \epsilon )^{m - 1}\). Solving m leads to

$$\begin{aligned} m \in \mathcal {O} \mathopen {}\left( \log _{1 + \epsilon } n\right) \subseteq \mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log n\right). \end{aligned}$$

Given \(v \in W\), define k(v) to be the number of nodes in \(V \setminus W\) that have v as their youngest ancestor in W. The nodes contributing to k(v) form at most two paths starting from v. Since the height of the search tree is in \(\mathcal {O} \mathopen {}\left( \log n\right)\), we have \(k(v) \in \mathcal {O} \mathopen {}\left( \log n\right)\).

Assume that \(\epsilon > \alpha / 2\) (recall that \(\alpha = \frac{1}{2}(1 - \sqrt{2})\)). Then

$$\begin{aligned} {\left| V\right| } = \sum _{v \in W} 1 + k(v) \in \mathcal {O} \mathopen {}\left( m \log n \right) \subseteq \mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log ^2 n\right) \subseteq \mathcal {O} \mathopen {}\left( \log ^2 n\right), \end{aligned}$$

proving the proposition.

Assume that \(\epsilon \le \alpha / 2\). Let \(v \in W\) with \(k(v) > 0\). Recall that the nodes corresponding to k(v) form at most two paths. Let \(u_1, \ldots , u_j\) be such a path.

Let w be a child of \(u_1\) for which \(w \notin U\). We have

$$\begin{aligned} 1 + \ell (u_1)&\le \alpha ^{-1}(1 + \ell (w))&\text {(Eq.~11)}\\&\le \alpha ^{-1}(1 + \epsilon o(w))&{(\text {w} \notin \text {U})}\\&\le \alpha ^{-1}(1 + \epsilon (o(u_1) + \ell (u_1)))&{(w\, \text {is a child of}\, u_1)}\\&\le \alpha ^{-1}(1 + \epsilon o(u_1)) + \ell (u_1)/2,&{(\epsilon \le \alpha /2)}\\ \end{aligned}$$

which in turns implies \(1 + \ell (u_1) \le 2 \alpha ^{-1}(1 + \epsilon o(u_1))\).

Applying Eq. 11 iteratively and the fact that \(u_j \in U\), we see that

$$\begin{aligned} 1 + \epsilon o(u_1)&\le 1 + \epsilon o(u_j)&{(u_j \text { is a child of }u_1)}\\&< 1 + \ell (u_j)&{(u_j \in U)}\\&\le (1 - \alpha )^{j - 1}( 1 + \ell (u_1))&\text {(Eq.~11 applied } j - 1 \text { times)}\\&\le (1 - \alpha )^{j - 1} 2 \alpha ^{-1}(1 + \epsilon o(u_1)). \end{aligned}$$

Solving for j leads to

$$\begin{aligned} j \le 1 + \log _{1 - \alpha } \alpha / 2 \in O(1), \end{aligned}$$

and consequently \(k(v) \in O(1)\). We conclude that

$$\begin{aligned} {\left| V\right| } = \sum _{v \in W} 1 + k(v) \in \mathcal {O} \mathopen {}\left( m\right) \subseteq \mathcal {O} \mathopen {}\left( (1 + \epsilon ^{-1})\log n\right), \end{aligned}$$

proving the proposition. \(\square\)

7 Experimental evaluation

In this section we present our experimental evaluation. Our primary focus is computational time. We implemented our algorithm using C++.Footnote 3 For convenience, we refer our algorithms as DynAuc, Hexact, and Happrox.

We used 3 datasets obtained from UCI repository:Footnote 4APS contains APS failure in Scania trucks, Diabetes contains medical information of diabetes patients, here the label is whether the patient has been readmitted to a hospital, Dota2 describes the character selection and the outcome of a popular competitive online computer game.

We imputed the missing values with the corresponding means, and encoded the categorical features as binary features. We then proceeded to train a logistic regressor using 1/10th of the data, and used remaining data as testing data. When computing the H-measure we used beta distribution with \(\alpha = \beta = 2\).

In our first experiment, we tested maintaining AUC as opposed to computing AUC by maintaining the points sorted and computing the AUC from the sorted list (Brzezinski and Stefanowski 2017). Given a sequence \(z_1, \ldots , z_n\) of scores and labels, we compute AUC for \(z_1, \ldots , z_i\) for every i. In the dynamic algorithm, this is done by simply adding the latest point to the existing structure. We record the time difference after 1000 additions.

Fig. 4
figure 4

Running time for computing AUC 1000 times as a function of the number of data points. Left figure: our approach. Right figure: baseline method by computing AUC from the maintained, sorted data points. Note that the time units are different

Fig. 5
figure 5

Running time for computing AUC 10,000 times in a sliding window as a function of the size of the sliding window. Left figure: our approach. Right figure: baseline method by computing AUC from the maintained, sorted data points. Note that the time units are different

From the results shown in Fig. 4 we see that DynAuc is about \(10^4\) times faster, though we should point out that the exact ratio depends heavily on the implementation. More importantly, the needed time increases logarithmically for DynAuc and linearly for the baseline. The spikes in running time of DynAuc are due to self-balancing search trees.

Next, we compare the running time of computing AUC in a sliding window. We use the same baseline as in the previous experiment, and record the running time after sliding a window for \(10\,000\) steps. From the results shown in Fig. 5 we see that DynAuc is faster than the baseline by several orders of magnitude with the needed time increasing logarithmically for DynAuc and linearly for the baseline.

Fig. 6
figure 6

Running time for computing the H-measure 1000 times as a function of the number of data points. Left figure: our approach. Right figure: baseline method computing from sorted data points. Note that the time units are different

Fig. 7
figure 7

Running time for computing H-measure \(10\,000\) times in a sliding window as a function of the size of the sliding window. Left figure: our approach. Right figure: baseline method computing from sorted data points. Note that the time units are different

We repeat the same experiments but now we compare maintaining the H-measure against computing it from scratch from sorted data points. From the results shown in Figs. 6 and 7 we see that Hexact is about 10–\(10^2\) times faster, and the time grows polylogarithmically for Hexact and linearly for the baseline. Similarly, the spikes in running time of Hexact are due to self-balancing search trees. Interestingly, Hexact is faster for APS than for the other datasets. This is probably due to the imbalanced labels, making the ROC curve relatively skewed, and the convex hull small.

Fig. 8
figure 8

Approximative H-measure as a function of approximation guarantee \(\epsilon\). Left figure: running time. Right figure: absolute difference to the correct value

In our final experiment we use approximative H-measure, without the speed-up described in Sect. 6.1. Here, we measure the total time to compute the H-measure for \(z_1, \ldots , z_i\) for every i as a function of \(\epsilon\). Figure 8 shows the running time as well as the difference to the correct score when using the whole data.

Computing the H-measure from scratch required roughly 1 minute for APS, and 2.5 minutes for Diabetes and Dota2. On the other hand, we only need 10 seconds to obtain accurate result, and as we increase \(\epsilon\), the running time decreases. As we increase \(\epsilon\), the error grows but only modestly (up to 3%), with Happrox underestimating the exact value.

8 Conclusions

In this paper we considered maintaining AUC and the H-measure under addition and deletion. More specifically, we show that we can maintain AUC in \(\mathcal {O} \mathopen {}\left( \log n\right)\) time, and the H-measure in \(\mathcal {O} \mathopen {}\left( \log ^2 n\right)\) time, assuming that the class priors are obtained from the testing data. We also considered the case, where the class priors are not obtained from the testing data. Here, we can approximate the H-measure in \(\mathcal {O} \mathopen {}\left( (\log n + \epsilon ^{-1}) \log n\right)\) time.

We demonstrate empirically that our algorithms, DynAuc and Hexact, provide significant speed-up over the natural baselines where we compute the score from the sorted, maintained data points.

When computing the H-measure the biggest time saving factor is maintaining the convex hull, as the hull is typically smaller than all the data points used for creating the ROC curve. Because of the smaller size of the hull, the tricks employed by Happrox, provide less of a speed-up. Still, for larger values of \(\epsilon\), the speed-up can be almost 50%.