Non-parametric Online AUC Maximization

Szörényi, Balázs; Cohen, Snir; Mannor, Shie

doi:10.1007/978-3-319-71246-8_35

Balázs Szörényi^18,19,
Snir Cohen¹⁸ &
Shie Mannor¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10535))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3188 Accesses
4 Citations

Abstract

We consider the problems of online and one-pass maximization of the area under the ROC curve (AUC). AUC maximization is hard even in the offline setting and thus solutions often make some compromises. Existing results for the online problem typically optimize for some proxy defined via surrogate losses instead of maximizing the real AUC. This approach is confirmed by results showing that the optimum of these proxies, over the set of all (measurable) functions, maximize the AUC. The problem is that—in order to meet the strong requirements for per round run time complexity—online methods typically work with restricted hypothesis classes and this, as we show, corrupts the above compatibility and causes the methods to converge to suboptimal solutions even in some simple stochastic cases. To remedy this, we propose a different approach and show that it leads to asymptotic optimality. Our theoretical claims and considerations are tested by experiments on real datasets, which provide empirical justification to them.

You have full access to this open access chapter, Download conference paper PDF

Exploiting problem structure in optimization under uncertainty via online convex optimization

Article 30 March 2018

Nam Ho-Nguyen & Fatma Kılınç-Karzan

Stochastic online optimization. Single-point and multi-point non-linear multi-armed bandits. Convex and strongly-convex case

Article 12 February 2017

A. V. Gasnikov, E. A. Krymova, … F. A. Fedorenko

Online Optimization Problems with Functional Constraints Under Relative Lipschitz Continuity and Relative Strong Convexity Conditions

1 Introduction

The area under the ROC curve (AUC) [16] measures how well a mapping h of the instance space to the reals respects the partial order defined by some “ideal” score function s; in the special case of biparite ranking, s is simply a 0–1 valued function. As such, it has important applications in bioinformatics, information retrieval, anomaly detection, and many other areas.

Maximizing the AUC requires an approach different from maximizing the accuracy, even though there are some connections between the two [3, 5, 11]. Over the last decade, several approaches have been proposed and analyzed, guaranteeing consistency [9] and even optimal learning rates in some restricted cases [19]. Subsequently [22], followed by [13, 21], considered AUC maximization in an online setting, while [14] introduced a one-pass AUC maximization framework.

In this paper we first point out two important shortcomings of the existing methods proposed for online and one-pass AUC optimization:

(A)
None of them guarantees an optimal solution (not even asymptotically).
(B)
They all need to store the whole data. The reason for this is that they require parameter tuning, and thus also multiple passes over the data.

In contrast to (A), the k nearest neighbor method (k-NN), as we show, is guaranteed to converge to the optimum. This superiority of k-NN is also supported by the results of our empirical investigations. What is more, even though it clearly requires storing the whole data, it is not more demanding in terms of space complexity then previous algorithms, according to (B). Finally, one could argue that k-NN must perform poorly in terms of running time. This is not true, however: efficient solutions exist and, in fact, our experiments suggest that k-NN is competitive also in this regard. Additionally, dimensionality-originated issues can be taken care of using PCA or related methods.

The rest of the paper is structured as follows. First we introduce the formal framework and the definitions, then we show (A) formally, provide the theoretical justification for the k-NN method, present our experimental results, sum up the most important results from the literature, and finally we conclude with a short discussion.

2 Formal Setup

Given a set of n samples $(x_1,y_1), \dots , (x_n,y_n) \in \mathcal{X}\times \mathcal{Y}$, where $\mathcal{X}\subseteq \mathbb {R}^d$ for some positive integer d and $\mathcal{Y}= \{-1,1\}$, and given some mapping $h:\mathcal{X}\rightarrow \mathbb {R}$, the area under the ROC curve (AUC) [16] is the empirical mean

$$\begin{aligned} \text {AUC}(h;\mathcal{X}^+,\mathcal{X}^-) = \sum _{x^+ \in \mathcal{X}^+}\sum _{x^- \in \mathcal{X}^-} \left( \tfrac{{\mathbb I}\left[ h(x^+) > h(x^-) \right] }{T_+ T_-} + \tfrac{{\mathbb I}\left[ h(x^+) = h(x^-) \right] }{2 T_+ T_-}\right) , \nonumber \end{aligned}$$

where $\mathcal{X}^+ = \{ x_t: y_t = 1, 1 \le t \le n\}$, $\mathcal{X}^- = \{ x_t: y_t = -1, 1 \le t \le n\}$, $T_+ = |\mathcal{X}^+|$, $T_-= |\mathcal{X}^-|$, and where ${\mathbb I}\left[ \cdot \right] $ denotes the indicator function; i.e., ${\mathbb I}\left[ E \right] =1$ when event E holds and ${\mathbb I}\left[ E \right] =0$ otherwise. The regret hypothesis h with respect to some hypothesis set $\mathcal{H}\subseteq \mathbb {R}^\mathcal{X}$ is defined as

$$\begin{aligned} \mathrm {Regret}^\mathcal{H}(h;\mathcal{X}^+,\mathcal{X}^-) = \sup _{h' \in \mathcal{H}}\text {AUC}(h';\mathcal{X}^+,\mathcal{X}^-) - \text {AUC}(h;\mathcal{X}^+,\mathcal{X}^-) \end{aligned}$$

(1)

We also denote by $\text {AUC}^*(\mathcal{X}^+,\mathcal{X}^-)$ the supremum of $\text {AUC}(h';\mathcal{X}^+,\mathcal{X}^-)$ over the set of all measurable functions $h'$, and introduce the notation

$$\mathrm {Regret}(h;\mathcal{X}^+,\mathcal{X}^-) = \text {AUC}^*(\mathcal{X}^+,\mathcal{X}^-) - \text {AUC}(h;\mathcal{X}^+,\mathcal{X}^-).$$

Maximizing the AUC is also equivalent to minimizing the empirical risk

$$\begin{aligned} \mathrm {Risk}(h;\mathcal{X}^+,\mathcal{X}^-) = 1 - \text {AUC}(h;\mathcal{X}^+,\mathcal{X}^-) = \sum _{t,t'=1}^n \tfrac{\ell ^{\text {AUC}}(h;(x_t,y_t),(x_{t'},y_{t'}))}{2T_+T_-} \end{aligned}$$

(2)

of the loss function

$$ \ell ^{\text {AUC}}(h;(x,y),(x',y')) = {\mathbb I}\left[ (h(x) - h(x'))(y - y')<0 \right] + {\mathbb I}\left[ h(x) = h(x'), y \ne y' \right] . $$

One notorious problem with $\text {AUC}$ is that it is non-convex and non-continuous, which makes it hard to work with. Especially in the online and one-pass settings, where having a low (typically constant or logarithmic, but at least sublinear) per round run time complexity is essential. To resolve this issue, papers that aim for maximizing AUC online [13, 14, 21, 22], choose to replace $\ell ^{\text {AUC}}$ in (2) by some surrogate loss function $\ell : \mathcal{H}\times (\mathcal{X}\times \mathcal{Y}) \rightarrow \mathbb {R}$, and instead of maximizing the AUC, they minimize the surrogate risk

$$ \mathrm {Risk}^{\ell }(h;\mathcal{X}^+, \mathcal{X}^-) = \sum _{t,t'=1}^t\tfrac{\ell (h;(x_t,y_t),(x_{t'},y_{t'}))}{2T_+T_-} , $$

and derive bounds for $\mathrm {Regret}^{\ell ,\mathcal{H}}(h;\mathcal{X}^+, \mathcal{X}^-)$, which is obtained by replacing $\text {AUC}$ in (1) by $1-\mathrm {Risk}^{\ell }$.

If $(x_,y_1),(x_2,y_2),\dots $ are i.i.d. samples from some probability distribution $\mathbf {P}$ over $\mathcal{X}\times \mathcal{Y}$, then one can define

$$\begin{aligned} \mathrm {Risk}(h) = \mathbf {E}\left[ \ell ^{\text {AUC}}(h;(X,Y),(X',Y'))\,\big |\,Y>Y'\right] \end{aligned}$$

(3)

and $\text {AUC}(h) = 1-\mathrm {Risk}(h)$. Similarly as above, replacing $\ell ^{\text {AUC}}$ in (3) by some other loss function $\ell $ one obtains surrogate measures $\mathrm {Risk}^\ell (h)$ and $\mathrm {Regret}^{\ell ,\mathcal{H}}(h)$. Along the same analogy, we also use the notation $\text {AUC}^*$ and $\mathrm {Regret}(h)$.

One-Pass and Online Setting

The one-pass and the online settings both have the same underlying protocol: in each round t, the learner proposes some hypothesis $h_t: \mathcal{X}\rightarrow \mathbb {R}$ based on its previous experience, and then it observes the sample $(x_t,y_t)$. The two frameworks only differ in their objectives:

In the online setting we are concerned with the evolution of the empirical AUC—that is, with
$$ \text {AUC}_t = \text {AUC}(h_t; \mathcal{X}_t^+,\mathcal{X}^-_t) $$
for $t = 1, \dots , n$, where $\mathcal{X}_t^+ = \{ x_{i}: y_{i} = 1, 1 \le i \le t\}$ and $\mathcal{X}_{t}^- = \{ x_{i}: y_{i} = -1, 1 \le i \le t\}$.
In the one-pass setting the generalization ability of the learner is tested after the whole data had been processed. More precisely, the measure of performance is $\text {AUC}(h_n)$.

3 Surrogate Measures and Restricted Classes

Existing results for the online and one-pass AUC maximization problem optimize for some surrogate risk, instead of working with the AUC directly. In particular, many of them work with the square loss $\ell _2$ (see Example 1). This approach is also confirmed by results showing consistency i.e., that $\mathrm {Regret}(h_t)$ converges to 0 whenever $\mathrm {Regret}^\ell (h_t)$ does for some sequence $h_1, h_2, \dots $ of functions, as t goes to infinity. (See more about this in the section about the related work.)

These important results require, however, careful interpretation. And this is the starting point of our investigations: we claim that utilization of consistency is only legitimate when the hypotheses class $\mathcal{H}$ of our interest contains a global optimizer of the surrogate loss; that is, if $\sup _h\text {AUC}^\ell (h) = \sup _{h \in \mathcal{H}} \text {AUC}^\ell (h)$. When working with the set $\mathcal{H}_{\mathrm {lin}}=\{ h^w(x) = w^\top x: w \in \mathbb {R}^d \}$ of linear functions—which is the case for all existing results for online and one-pass AUC maximization—this criterion is not fulfilled. Indeed, even though square loss is consistent (see [14]), in Example 1 below it holds that $\text {AUC}(h') \ll \sup _{h \in \mathcal{H}_{\mathrm {lin}}} \text {AUC}(h)$ for any $h' \in {\mathop {\hbox {argmin}}\nolimits _{h \in \mathcal{H}_{\mathrm {lin}}}} \mathrm {Risk}^{\ell _2}(h)$.

The hinge loss $\ell ^{\gamma }(h;(x,y),(x',y'))={\mathbb I}\left[ y \ne y' \right] [\gamma - \tfrac{1}{2}(y-y')(h(x)-h(x'))]_+$ was also used in algorithmic solutions, but that does not even satisfy consistency [15].

3.1 Square Loss with Linear Hypotheses

This section presents the example that existing results can fail completely in maximizing the real AUC even in a simple case. This is demonstrated by the following example.

Example 1

Consider the setting when $\mathcal{X}= \mathbb {R}^2$, and $\mathbf {P}[X=(-\epsilon ,-1+\epsilon )|Y=1] = 1$ and $\mathbf {P}[X=(0,-1-\epsilon ) | Y=-1] = \mathbf {P}[X=(0,1)| Y=-1] = \mathbf {P}[X=(1,0)| Y=-1] = 1/3$ for some small $\epsilon > 0$.

[14] first shows that the square loss $\ell _2(h;(x,y),(x',y')) = (1 - \tfrac{y-y'}{2}(h(x) - h(x'))^2$ is consistent with $\mathrm {AUC}$, and then use $\ell _2$ as a surrogate loss to find the best linear score function $h^w(x) = w^\top x$. However, in the case above, $\mathrm {AUC}$ is maximized at $h^{w^*}$ where $w^* = (-1,0)$, where it actually takes value 1 (one has a very small freedom though, depending on the size of $\epsilon $), and thus $\mathrm {Risk}(h^{w^*}) = 0$ for the corresponding linear function $h^{w^*}$. On the other hand, the surrogate measure for $h^{w^*}$ is

$$\begin{aligned}&\mathrm {Risk}^{\ell _2}(h^{w^*}) = \mathbf {E}[\ell _2(h^{w^*};X,X')|Y>Y'] \\&= \tfrac{1}{3}\big (1 + ({w^*})^\top (\epsilon ,2\epsilon )\big )^2 + \tfrac{1}{3}\big (1 + ({w^*})^\top (\epsilon ,2-\epsilon )\big )^2 + \tfrac{1}{3}\big (1 + ({w^*})^\top (1+\epsilon ,1-\epsilon )\big )^2 \end{aligned}$$

which evaluates approximately to 2/3. At the same time, the actual optimum $w'$ of this surrogate measure is around $(-1/2, -1/2)$, where it takes the value:

$$ \mathrm {Risk}^{\ell _2}(h^{w'}) = \mathbf {E}[\ell _2(h;X,X')|Y>Y'] \approx 1/3 , $$

whereas $\mathrm {AUC}^*(h^{w'}) \approx 2/3$, which is very far from the true optimum.

That is, when $\mathcal{H}$ consists of the linear hypotheses (as is the case in [13, 14]), then

$$ \mathrm {Regret}^{\ell _2,\mathcal{H}}(h^{w'}) = 0 , $$

implying that

$$ \mathrm {Regret}^{\mathcal{H}}(h^{w'}) = \mathrm {Regret}(h^{(-1/2,-1/2)}) \approx 1/3 . $$

Furthermore, adding a term $\Vert w\Vert ^2$ to regularize the surrogate measure does not change on this.

Note that this does not contradict the consistency of the square loss. The reason is that consistency requires $\mathrm {Regret}$ to vanish as $\mathrm {Regret}^\ell $ approaches 0, but as $\mathrm {Regret}^\ell $ is huge for all linear hypotheses, this sets no restrictions on how $\mathrm {Regret}$ should behave in this example.

4 Conditional Probability as Rank Function

In this section we start the investigation of finding alternative algorithmic solutions. With that in mind, we reach back to the fundamentals of AUC, and show that good estimates of the conditional probability function $\eta (x) = \mathbf {P}[Y=1|X=x]$ perform well at AUC maximization too.

The particular estimates that we consider here are of the form $\widehat{\eta }: \mathcal{X}\times \mathcal{Z}\rightarrow \mathbb {R}$ for some domain $\mathcal{Z}$. Here $\mathcal{Z}$ is the domain of a variable that is used to encode prior information (e.g., random samples) and internal randomization (used e.g., for tie breaking) of the learner. (In accordance with that, in some cases it will be more convenient to use the notation $\widehat{\eta }_z$ for $\widehat{\eta }(\,\cdot \,,z)$.) For example, given some series $\{ k_n \}_n$ of stepsizes, the $k_n\mathrm {-NN}$ estimate of [12] makes use of some i.i.d. samples $U_1, \dots , U_n, U$ drawn from the uniform distribution over [0, 1]. Putting $Z_n = (X_1,Y_1, \dots , X_n,Y_n, U_1, \dots , U_n, U)$, their $k_n\mathrm {-NN}$ estimate maps an instance x to

$$\begin{aligned} \widehat{\eta }_{Z_n}^{\text {DGKL}}(x) = \frac{1}{k_n}\sum _{i=1}^{k_n} Y_{\sigma (Z_n,x,i)}, \end{aligned}$$

(4)

where $\sigma (Z_n,x, \,\cdot \,)$ is the permutation for which $(\Vert X_{\sigma (Z_n,x,1)} - x\Vert ,\Vert U_{\sigma (Z_n,x,1)}-U\Vert ), \dots , (\Vert X_{\sigma (Z_n,x,n)} - x\Vert ,\Vert U_{\sigma (Z_n,x,n)}-U\Vert )$ is in lexicographic order.

Given such an estimate, we show the following result (for the proof see Appendix A).

Theorem 2

Let Z be some random variable over some domain $\mathcal{Z}$, and let $\widehat{\eta }: \mathcal{X}\times \mathcal{Z}\rightarrow \mathbb {R}$ be an estimate of the conditional distribution function $\eta (x) := \mathbf {E}[Y|X=x]$ as described above. Then $ \mathbf {E}_Z\left[ \mathrm {Regret}(\widehat{\eta }_Z)\right] \le \frac{3\sqrt{\epsilon }}{\mathbf {P}[Y=1]\mathbf {P}[Y=0]},$ where $\epsilon = \mathbf {E}_{X,Z}[|\widehat{\eta }(X,Z) - \eta (X)|]$.

Similar result has also appeared in [1, 10, 19]. However, this particular estimator requires some small but essential differences in the analysis. Most importantly, kernel estimators are completely determined by the samples, whereas $k_n\mathrm {-NN}$ needs tie breaking. This requires additional randomness and complicates the analysis slightly.

5 AUC Maximization Using k-NN

In the previous section we have shown guarantees for the AUC performance of estimators of the conditional probability function $\eta $. In this section we review some of the results on estimating $\eta $ using $k_n\mathrm {-NN}$, and show what they give combined with Theorem 2.

First of all, Devroye et al. [12] have shown that the $k_n\mathrm {-NN}$ version presented as Algorithm 1 converges under any distribution, assuming some standard restrictions on $k_n$.^{Footnote 1}

Theorem 3

(Theorem 1 in [12]). If stepsize $k_n$ satisfies $\lim _{n \rightarrow \infty } k_n = \infty $ and $\lim _{n \rightarrow \infty } k_n/n = 0$, then $\mathbf {E}\left[ \left| \widehat{\eta }_{Z_n}^{\text {DGKL}}(X)-\eta (X)\right| \right] \rightarrow 0$, where $\widehat{\eta }_{Z_n}^{\text {DGKL}}$ is defined as in (4).

Plugging this into Theorem 2 we immediately obtain the following result on $\text {AUC}(\widehat{\eta }_{Z_n}^{\text {DGKL}})$.

Corollary 4

Let $\widehat{\eta }_{Z_n}^{\text {DGKL}}$ be defined as in (4). Then $\mathbf {E}_{Z_n}\left[ \mathrm {Regret}(\widehat{\eta }_{Z_n}^{\text {DGKL}})\right] \rightarrow 0$ if $\lim _{n \rightarrow \infty } k_n = \infty $ and $\lim _{n \rightarrow \infty } k_n/n = 0$, where $\widehat{\eta }_{Z_n}^{\text {DGKL}}$ is defined as in (4).

KNNOAM is thus guaranteed to converge in case of i.i.d. samples.

One can, in fact, derive results also for the rate of convergence based on the work by Chaudhuri and Dasgupta [8]. This would not hold uniformly though, only for some restricted distributions.

Efficient Implementation

An important feature of k-NN-methods is that it can be implemented efficiently. For example, the Cover Tree structure [6] makes it possible to insert a new instance into an existing cover tree or to remove an old one from it in time $O(\log t)$, and also to find the $k_n$ nearest neighbor of some arbitrary point in time $O(k_n \log t)$.

Choosing ${{\varvec{k}}}$

Choosing the right k for k-NN is a hard question. $k > \log \log n$ is recommended for pointwise convergence (see Remark 1 in 4), but the common practice is to use $logn< k < n^{1/2}$. One can also think about using k that changes with context; i.e., depends on the particular instance that is queried. See [4] for further details.

For a given dataset, one can also use cross validation or some Bayesian approach to find the best k. This was used by all the linear methods mentioned in the Introduction, but in a real online setting this is not applicable.

6 Dimensionality Reduction

In general, any learning method can be applied that approximates $\eta $ with arbitrary accuracy. Note, however, that all these methods suffer from dimensionality issues; including k-NN. One way to deal with it is to apply first dimensionality reduction methods. More specifically, the idea is to first feed the obtained sample into some online PCA algorithm (like SGA or CCIPCA—see [7] for a thorough discussion), and then use its output as input for k-NN, Parzen-Rosenblatt kernels, etc. This way one maintains the good AUC performance guaranteed by the learning algorithms but prevents dimensionality-originated run-time issues thanks to the guarantees of the PCA methods.

The property this task requires from the aforementioned techniques to preserve the good AUC performance guarantees of k-NN is a kind of stability. More precisely, denoting by $\varPhi _t$ the mapping they induce from the data from the first t rounds, it should fulfill the property

$$ \Vert \varPhi _t(x_t) - \varPhi _T(x_t)\Vert \le \epsilon (t) \qquad \quad \forall \ T \ge t $$

for some $\epsilon (t)$ converging to 0 as t goes to infinity. Maintaining the convergence of k-NN does not seem possible otherwise.

7 Experimental Results

In this section, we evaluate the empirical performance of the proposed KNN Online AUC Maximization (KNNOAM) algorithm on benchmark datasets.

Compared Algorithms

We compare the proposed KNNOAM algorithm with state-of-the-art online AUC optimization algorithms. Specifically, the compared algorithms in our experiments include:

OAM $_{\mathrm {seq}}{} \mathbf : $ the OAM algorithm with reservoir sampling and sequential updating [22];
OAM $_{\mathrm {gra}}{} \mathbf : $ the OAM algorithm with reservoir sampling and online gradient updating method [22];
OPAUC: the one-pass algorithm AUC optimization algorithm proposed in [14];
AdaOAM: the adaptive gradient AUC optimization algorithm proposed in [13];
KNNOAM: our proposed KNN based algorithm.

General Experimental Setup

We conduct our experiments on sixteen benchmark datasets that have been used in previous studies on AUC optimization. The details of the datasets are summarized in Table 1. All these datasets can be downloaded from LIBSVM^{Footnote 2} and UCI Machine Learning Repository^{Footnote 3}. Note that half of the datasets (segment, satimage, vowel, letter, poker, usps, connect-4, acoustic and vehicle) are originally multi-class. These multi-class datasets have been converted into class-imbalanced binary datasets by choosing one class, setting its label to +1 and the rest to $-1$. This class has been chosen so that the ratio $T_-/T_+$ is below 50 and its cardinality has been minimized. (The ratio is kept below 50 to obtain conclusive results.) In case two or more classes have the same size, one has been chosen randomly. Previous studies use similar conversion methods. In addition, the features have been rescaled linearly to $[-1,\,1]$ for all datasets. All experiments are performed with Matlab on a computer workstation with 3.40 GHz CPU and 32 GB memory.

Table 1. Details of the benchmark datasets used in the experiments. $T_+ = |\mathcal{X}^+|$ and $T_- = |\mathcal{X}^-|$.

Full size table

Experimental Setup of the Online Setting

Our main goal is to compare the performance of the above algorithms in the online setting. In this setting, each algorithm receives a random sample from the dataset, suffers loss and updates its classifier according to this sample. Note that our proposed KNNOAM does not require any parameter tuning, as opposed to the other four existing algorithms. Clearly, parameter tuning requires multiple passes over the data, which is inconsistent with this setting. Although KNNOAM must store at time t all the samples up to time $t-1$, the rest of the algorithms must store all of the samples in advance for the parameter tuning. Despite the fact that it gives the rest of the algorithms an unfair advantage, we follow the parameter tuning procedures for each algorithm suggested in [13, 14, 22] and use the best obtained parameters for each dataset and algorithm. We apply five-fold cross-validation on the training set to find the best learning rate $\eta \in 2 ^ {\left[ -10, 10 \right] }$ and the regularization parameter $\lambda \,\in \,2^{[-10,2]}$ for both OPAUC and AdaOAM. For OAM$_{\mathrm {seq}}$ and OAM$_{\mathrm {gra}}$, we apply five-fold cross-validation to tune the penalty parameter $C \in 2^{[-10:10]}$, and fix the buffer at 100 as suggested in [22]. For KNNOAM we only had to choose $k_n$ that goes to infinity and is of order o(n). We choose $k_n = 2 \log _2 n$.^{Footnote 4} For every dataset, we average over 20 runs and plot a graph showing the experimental AUC loss as a function of the number of samples. This gives us an opportunity to examine the evolution of the performance of the algorithms. The results of this experiment are presented in Appendix B. The code is available at https://bitbucket.org/snir/auc2017.

Experimental Setup of the One-Pass Setting

We also compare the performance of the algorithm in the one-pass setting. We follow the experimental setup suggested in previous studies [13, 14, 22]. The performance of the compared algorithms is evaluated by four trials of five-fold cross-validation using the parameters received by the parameter tuning procedure explained in the previous section about the online experimental setup. The AUC values are the average of these 20 runs. The results are summarized in Fig. 1.

7.1 Evaluation on Benchmark Datasets

In the online setting, KNNOAM outperforms the four other, state-of-the-art online AUC algorithms considered in our experiments in 12 out 16 datasets. What is more, in 10 out of those 12 datasets, the improvement by KNNOAM is significant. For example, in ijcnn1 and acoustic datasets, KNNOAM converges to a much lower empirical AUC loss than the rest of the algorithms. These are outstanding results for this setting, especially when recalling that the compared algorithms have been tuned before running the experiments. Our graphs demonstrate that KNNOAM has a much smoother convergence. The rest of the algorithms’ performance is unstable: after receiving new samples it may improve but might also significantly deteriorate. This is a highly undesirable property.

In the german, svmguide3 and vehicle datasets, KNNOAM does not outperform all the compared algorithms. However, these datasets are very small and therefore the results are inconclusive.

In the a9a dataset, OPAUC and AdaOAM outperform KNNOAM. Although this dataset cannot be considered small, KNNOAM still behaves as if it were, because the samples are sparse and the dimensionality is relatively big. To strengthen this claim, we can examine the performance of KNNOAM on connect-4 dataset. This dataset has roughly the same dimensionality as the a9a dataset and the samples are as sparse, but has more than twice the samples, and KNNOAM outperforms the rest of the algorithms significantly.

Many of the graphs show that $\text {OAM}_{\text {seq}}$, $\text {OAM}_{\text {gra}}$ are not guaranteed to converge at all.^{Footnote 5} Some of them also show that OPAUC and AdaOAM does not converge smoothly. This behavior is observed when the learning rate $\eta $ is too high and a single sample could change the classifier drastically and cause a significant local deterioration of the performance as can be seen from the performance of OPAUC on svmguide1 dataset for example. On the other hand, if the chosen learning rate $\eta $ is too small, the performance might be poor, as can be seen from the performance of AdaOAM on the same dataset.

The results for the one-pass setting are similar to the ones for the online setting: KNNOAM is better, or at least competitive in comparison with its competitors. For this setting we also present the running times (see Appendix C) KNNOAM is usually the fastest algorithm and is never the slowest among the compared algorithms.

8 Related Work

Reducing Ranking to Binary Classification

[3, 5] considers the problem of ranking a finite random sample, and reduced this to a related binary classification problem. In particular, they bound the risk of the ranking (which is closely related to AUC of the ranker over this sample) in terms of the classification performance. However, the risk and the classification performance they use is only representative to that particular sample, and they do not consider prediction or generalization bounds. ([5] comment on the case when the rankings are drawn from some distribution, but does not imply any result in our setting.)

Offline Algorithmic Solutions

[9] aims to minimize (3), and actually obtain asymptotic optimality. Their algorithm constructs a tree in an iterative fashion by solving subsequent challenging optimization problems.

[19] uses kernel estimates of the conditional probability function $\eta (x) = \mathbf {E}[Y|X=x]$ based on the Parzen-Rosenblatt kernel, and have shown fast and superfast convergence rates assuming Tsybakov-style noise. They also complement their results by showing lower bounds on the best convergence rate in some situations.

Consistency of Surrogate Measures

The investigation of consistency with respect to $\text {AUC}$ was initiated by [18] showing consistency of a balanced version of the exponential and the logistic loss. Later on [1, 14, 15, 20] investigated the consistency of other loss functions like the exponential, logistic, squared, and q-normed hinge loss, and variants of them. Finally, [15] shows that hinge loss is not consistent.

One-Pass and Online Solutions

As mentioned, [22] was the first to analyze online AUC maximization. They have defined the setup, presented algorithmic solutions optimizing for the hinge loss, and provided regret bounds. [21] also uses the hinge loss with a perceptron-like algorithm which, in round t, achieves regret $O(1/\sqrt{t})$. [17] works with the same setting as [21], and achieves several improvements in terms of different parameters.

[14] uses square loss in the one-pass setting, and obtains a convergence rate of order O(1 / t) for the linearly separable case and $O(\sqrt{1/t})$ for the general one. [13] obtains similar results, but using the Adaptive Gradient method.

All these papers work with the set of linear hypothesis.

Uniform Convergence Bounds

Uniform convergence bounds like the ones in [2] show how fast the empirical AUC-risk converges to the actual AUC over a given class of hypothesis. Consequently, they do not provide any practical guidance on how one could acquire some hypothesis with small risk.

9 Concluding Remarks

We have shown that existing methodology for maximizing AUC in an online or one-pass setting can fail already in very simple situations. To remedy this, we have proposed to reach back to the fundamentals of AUC, and suggested an algorithmic solution based on the celebrated k-NN-estimate of the conditional probability function. This has guarantees in the stochastic setting, has efficient implementations, and outperforms previous methods on several real datasets. The latter is even more surprising in view of the fact that, unlike its competitors, it requires no parameter tuning.

Nevertheless, we feel that this should not be considered as an ultimate solution, but rather as an encouragement for future research to explore further alternative solutions. To mention a few:

Combining KNNOAM with metric learning arises naturally, and could extend its applicability to more exotic domains.
Maximizing the objective function $\text {AUC}_n$ in the adversarial setting is another important question, which existing results do not tell anything about.

Notes

1.
They actually show an even stronger equivalence result.
2.
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/.
3.
www.ics.uci.edu/~mlearn/MLRepository.html.
4.
This choice guarantees low time complexity and also turned out to result in a competitive method. (We did try other choices; the results were similar.).
5.
It should be mentioned that the theoretical gurantees of OAM is also doubtful, according to [17].

References

Agarwal, S.: Surrogate regret bounds for bipartite ranking via strongly proper losses. J. Mach. Learn. Res. 15(1), 1653–1674 (2014)
MathSciNet MATH Google Scholar
Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalization bounds for the area under the ROC curve. JMLR 6, 393–425 (2005)
MathSciNet MATH Google Scholar
Ailon, N., Mohri, M.: An efficient reduction of ranking to classification. In: COLT 2008, Helsinki, Finland, 9–12 July 2008, pp. 87–98 (2008)
Google Scholar
Anava, O., Levy, K.: $k^\ast $-nearest neighbors: from global to local. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 4916–4924. Curran Associates Inc., Red Hook (2016)
Google Scholar
Balcan, M.F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., Sorkin, G.B.: Robust reductions from ranking to classification. Mach. Learn. 72(1), 139–153 (2008)
Article MATH Google Scholar
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp. 97–104. ACM, New York (2006)
Google Scholar
Cardot, H., Degras, D.: Online principal component analysis in high dimension: which algorithm to choose? CoRR abs/1511.03688 (2015). http://arxiv.org/abs/1511.03688
Chaudhuri, K., Dasgupta, S.: Rates of convergence for nearest neighbor classification. In: NIPS 2014, pp. 3437–3445 (2014)
Google Scholar
Clémençon, S., Vayatis, N.: Tree-based ranking methods. IEEE Trans. Inf. Theory 55(9), 4316–4336 (2009)
Article MathSciNet MATH Google Scholar
Clémençon, S., Lugosi, G., Vayatis, N.: Ranking and empirical minimization of U-statistics. Ann. Stat. 36(2), 844–874 (2008)
Article MathSciNet MATH Google Scholar
Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) NIPS, pp. 313–320. MIT Press, Cambridge (2004)
Google Scholar
Devroye, L., Győrfi, L., Krżyzak, A., Lugosi, G.: On the strong universal consistency of nearest neighbor regression function estimates. Ann. Stat. 22(3), 1371–1385 (1994)
Article MathSciNet MATH Google Scholar
Ding, Y., Zhao, P., Hoi, S.C.H., Ong, Y.: An adaptive gradient method for online AUC maximization. In: AAAI, pp. 2568–2574 (2015)
Google Scholar
Gao, W., Jin, R., Zhu, S., Zhou, Z.: One-pass AUC optimization. In: ICML 2013, pp. 906–914 (2013)
Google Scholar
Gao, W., Zhou, Z.: On the consistency of AUC pairwise optimization. In: IJCAI 2015, pp. 939–945 (2015)
Google Scholar
Hanley, J.A., Mcneil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)
Article Google Scholar
Kar, P., Sriperumbudur, B.K., Jain, P., Karnick, H.: On the generalization ability of online learning algorithms for pairwise loss functions. In: 30th ICML 2013, 16–21 June 2013, Atlanta, GA, USA, pp. 441–449 (2013)
Google Scholar
Kotlowski, W., Dembczynski, K., Hüllermeier, E.: Bipartite ranking through minimization of univariate loss. In: ICML, pp. 1113–1120. Omnipress (2011)
Google Scholar
Robbiano, S., Clémençon, S.: Minimax learning rates for bipartite ranking and plug-in rules. ICML 2011, pp. 441–448 (2011)
Google Scholar
Uematsu, K., Lee, Y.: On theoretically optimal ranking functions in bipartite ranking. Technical report 863, Department of Statistics, The Ohio State University, December 2011
Google Scholar
Wang, Y., Khardon, R., Pechyony, D., Jones, R.: Generalization bounds for online learning algorithms with pairwise loss functions. In: COLT, pp. 13.1-13.22 (2012)
Google Scholar
Zhao, P., Hoi, S.C.H., Jin, R., Yang, T.: Online AUC maximization. In: ICML, pp. 233–240 (2011)
Google Scholar

Download references

Acknowledgements

This research was supported in part by the European Communities Seventh Framework Programme (FP7/2007-2013) under grant agreement 306638 (SUPREL).

Author information

Authors and Affiliations

Technion, Haifa, Israel
Balázs Szörényi, Snir Cohen & Shie Mannor
Research Group on AI, Hungarian Academy of Sciences, University of Szeged, Szeged, Hungary
Balázs Szörényi

Authors

Balázs Szörényi
View author publications
You can also search for this author in PubMed Google Scholar
Snir Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Shie Mannor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Balázs Szörényi .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Aalto University School of Science, Espoo, Finland
Jaakko Hollmén
University of Ljubljana, Ljubljana, Slovenia
Ljupčo Todorovski
KU Leuven Kulak, Kortrijk, Belgium
Celine Vens
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Appendices

A Proof of Theorem 2

Let us first introduce the notation $\mathcal{X}_z = \{ x : |\widehat{\eta }(x,z) - \eta (x)| < \sqrt{\epsilon } \}$ for $z \in \mathcal{X}$ and define for $h: \mathcal{X}\rightarrow \mathbb {R}$ measurable and $x,x' \in \mathcal{X}$

$$\begin{aligned} a(h,x,x') = \tfrac{1}{2}{\mathbb I}\left[ h(x) = h(x') \right] [\eta (x)(1-\eta (x')] +{\mathbb I}\left[ h(x) > h(x') \right] [\eta (x)(1-\eta (x')] \end{aligned}$$

and its symmetrization

$$ b(h,x,x') = \tfrac{1}{2}a(h,x,x') + \tfrac{1}{2}a(h,x',x) . $$

It then holds that

$$\begin{aligned} \mathbf {E}_{(X,Y)}&\mathbf {E}_{(X',Y')}\Big [ \tfrac{1}{2}{\mathbb I}\left[ h(X)> h(X') \right] {\mathbb I}\left[ Y=1,Y'=0 \right] \\&+{\mathbb I}\left[ h(X)> h(X') \right] {\mathbb I}\left[ Y=1,Y'=0 \right] ] \Big ] \\ =\;&\mathbf {E}_{X,X'}\bigg [\tfrac{1}{2}{\mathbb I}\left[ h(X) = h(X') \right] \mathbf {E}_{Y,Y'}\left[ {\mathbb I}\left[ Y=1,Y'=0 \right] \right] \\&+{\mathbb I}\left[ h(X) > h(X') \right] \mathbf {E}_{Y,Y'}\left[ {\mathbb I}\left[ Y=1,Y'=0 \right] \right] \bigg ] \\ =\;&\mathbf {E}_{X,X'}\left[ a(h,X,X')\right] \\ =\;&\mathbf {E}_{X,X'}\left[ b(h,X,X')\right] , \end{aligned}$$

where the last equation follows because X and $X'$ are i.i.d. This then gives

$$\begin{aligned} \text {AUC}(h) = \frac{\mathbf {E}_{X,X'}[b(h,X,X')]}{\mathbf {P}[Y=1]\mathbf {P}[Y=0]}. \end{aligned}$$

(5)

Now, note that $x, x' \in \mathcal{X}_z$ implies $|\eta (x)(1-\eta (x')) - \eta (x')(1-\eta (x))| \le 2\sqrt{\epsilon }$ because of the $\alpha \beta - \alpha '\beta '=(\alpha -\alpha ')\beta + \alpha '(\beta - \beta ')$ equality. Combining this with the fact that ${\mathbb I}\left[ h(x) = h(x') \right] + {\mathbb I}\left[ h(x) < h(x') \right] + {\mathbb I}\left[ h(x) > h(x') \right] = 1$ for any $h: \mathcal{X}\rightarrow \mathbb {R}$ and any $x,x' \in \mathcal{X}$, it follows that

$$ b(\eta ,x,x') - b(\widehat{\eta }_z,x,x')) \le \sqrt{\epsilon } , \; z \in \mathcal{Z}, \; x,x' \in \mathcal{X}_z . $$

Accordingly,

$$\begin{aligned} \mathbf {E}_{X,X'}[b(\eta ,X,X')] - \mathbf {E}_{X,X',Z}&[b(\widehat{\eta }_Z,X,X')] \nonumber \\&\le \sqrt{\epsilon } + \mathbf {P}_{X,X',Z}[X \not \in \mathcal{X}_Z \mathrm {\ or\ }X' \not \in \mathcal{X}_Z] \nonumber \\&\le 3\sqrt{\epsilon } , \end{aligned}$$

(6)

where the last inequality is true because X and $X'$ are i.i.d. and because $\mathbf {P}_{X,Z}[X \not \in \mathcal{X}_Z]=\mathbf {P}_{X,Z}[|\widehat{\eta }(X,Z)-\eta (X)| \ge \sqrt{\epsilon }] \le \sqrt{\epsilon }$ by the definition of $\epsilon $.

Finally, according to [10], $\text {AUC}(h)$ is maximized when $h = \eta $. The theorem thus follows by combining (5) and (6).

B Figures for Benchmark Datasets

Table 2. Experimental results comparing the one-pass AUC performance (i.e., $\text {AUC}(h_n)$) of $\text {OAM}_{\text {seq}}$, $\text {OAM}_{\text {gra}}$, OPAUC, ADAOM and KNNOAM

Full size table

C Experimental Results for the One-Pass AUC Setup

See Table 2.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szörényi, B., Cohen, S., Mannor, S. (2017). Non-parametric Online AUC Maximization. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10535. Springer, Cham. https://doi.org/10.1007/978-3-319-71246-8_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-71246-8_35
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71245-1
Online ISBN: 978-3-319-71246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics