Abstract
The area under the ROC curve (AUC) is a widely used measure for evaluating classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structures underlying most realworld data. However, the high training complexity renders the kernelized AUC machines infeasible for largescale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize linear classifiers over a finitedimensional feature space constructed via the kmeans NystrÃ¶m approximation. Our first algorithm maximizes the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the secondorder batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivates us to develop a firstorder stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. We also show experimentally that the proposed stochastic AUC classifier is able to reach the optimal solution, while the other stateoftheart online and stochastic AUC maximization methods are prone to suboptimal convergence. Code related to this paper is available at: https://sites.google.com/view/majdikhalid/.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
The area under the ROC Curve (AUC) [11] has a wide range of applications in machine learning and data mining such as recommender systems, information retrieval, bioinformatics, and anomaly detection [1, 5, 22, 25, 26]. Unlike error rate, the AUC metric does not consider the class distribution when assessing the performance of classifiers. This property renders the AUC a reliable measure to evaluate classification performance on heavily imbalanced datasets [7], which are not uncommon in realworld applications.
The optimization of the AUC metric aims to learn a score function that scores a random positive instance higher than any negative instance. Therefore, the AUC metric is a thresholdindependent measure. In fact, it evaluates a classifier over all possible thresholds, hence eliminating the effect of imbalanced class distribution. The objective function maximizing the AUC metric optimizes a sum of pairwise losses. This objective function can be solved by learning a binary classifier on pairs of positive and negative instances that constitute the difference space. Intuitively, the complexity of such algorithms increases linearly with respect to the number of pairs. However, linear ranking algorithms like RankSVM [4, 21], which can optimize the AUC directly, have shown a learning complexity independent from the number of pairs.
However, the kernelized versions of RankSVM [4, 13, 20] are superior to linear ranking machines in terms of producing higher AUC classification accuracy. This is due to its ability to model the complex nonlinear structures that underlie most realworld data. Analogous to kernel SVM, the kernelized RankSVM machines entail computing and storing a kernel matrix, which grows quadratically with the number of instances. This hinders the efficiency of kernelized RankSVM machines for learning on large datasets.
The recent approaches attempt to scale up the learning for AUC maximization from different perspectives. The first approach adopts online learning techniques to optimize the AUC on large datasets [9, 10, 17, 18, 32]. However, online methods result in inferior classification accuracy compared to batch learning algorithms. The authors of [15] develop a sparse batch nonlinear AUC maximization algorithm, which can scale to large datasets, to overcome the low generalization capability of online AUC maximization methods. However, sparse algorithms are prone to the underfitting problem due to the sparsity of the model, especially for large datasets. The work in [28] imputes the low generalization capability of online AUC maximization methods to the optimization of the surrogate loss function on a limited hypothesis space. Therefore, it devises a nonparametric algorithm to maximize the real AUC loss function. However, learning such nonparametric algorithm on high dimensional space is not reliable.
In this paper, we address the inefficiency of learning nonlinear kernel machines for AUC maximization. We propose two learning algorithms that learn linear classifiers on a feature space constructed via the kmeans NystrÃ¶m approximation [31]. The first algorithm employs a linear batch classifier [4] that optimizes the AUC metric. The batch classifier is a Newtonbased algorithm that requires the computation of all gradients and the Hessianvector product in each iteration. While this learning algorithm is applicable for large datasets, it becomes expensive for training enormous datasets embedded in a large dimensional feature space. This motivates us to develop a firstorder stochastic learning algorithm that incorporates the scheduled regularization update [3] and scheduled averaging [23] to accelerate the convergence of the classifier. The integration of these acceleration techniques allows the proposed stochastic method to enjoy the low complexity of classical firstorder stochastic gradient algorithms and the fast convergence rate of secondorder batch methods.
The remainder of this paper is organized as follows. We begin by reviewing closely related work in Sect.Â 2. In Sect.Â 3, we define the AUC problem and present related background. The proposed methods are presented in Sect.Â 4. The experimental results are shown in Sect.Â 5. Finally, we conclude the paper and point out the future work in Sect.Â 6.
2 Related Work
The maximization of the AUC metric is a bipartite ranking problem, a special type of ranking algorithm. Hence, most ranking algorithms can be used to solve the AUC maximization problem. The largescale kernel RankSVM is proposed in [20] to address the high complexity of learning kernel ranking machines. However, this method still depends quadratically on the number of instances, which hampers its efficiency. Linear RankSVM [2, 4, 14, 21, 27] is more applicable to scaling up in comparison to the kernelized variations. However, linear methods are limited to linearly separable problems. Recent study [6] explores the NystrÃ¶m approximation to speed up the training of the nonlinear kernel ranking function. This work does not address the AUC maximization problem. It also does not consider the kmeans NystrÃ¶m method and only uses a batch ranking algorithm. Another method [15] attempts to speed up the training of nonlinear AUC classifiers by learning a sparse model constructed incrementally based on chosen criteria [16]. However, the sparsity can deteriorate the generalization ability of the classifier.
Another class of research proposes using online learning methods to reduce the training time required to optimize the AUC objective function [9, 10, 17, 18, 32]. The work in [32] addresses the complexity of pairwise learning by deploying a firstorder online algorithm that maintains a buffer of fixed size for positive and negative instances. The work in [17] proposes a secondorder online AUC maximization algorithm with a fixedsized buffer. The work [10] maintains the firstorder and secondorder statistics for each instance instead of the buffering mechanism. Recently the work in [30] formulates the AUC maximization problem as a convexconcave saddle point problem. The proposed algorithm in [30] solves a pairwise squared hinge loss function without the need to access the buffered instances or the secondorder information. Therefore, it shows linear space and time complexities per iteration with respect to the number of features.
The work in [12] proposes a budget online kernel method for nonlinear AUC maximization. For massive datasets, however, the size of the budget needs to be large to reduce the variance of the model and to achieve an acceptable accuracy, which in turns increases the training time complexity. The work [8] attempts to address the scalability problem of kernelized online AUC maximization by learning a minibatch linear classifier on an embedded feature space. The authors explore both NystrÃ¶m approximation and random Fourier features to construct an embedding in an online setting. Despite their superior efficiency, online linear and nonlinear AUC maximization algorithms are susceptible to suboptimal convergence, which leads to inferior AUC classification accuracy.
Instead of maximizing a surrogate loss function, the authors of [28] attempt to optimize the real AUC loss function using a nonparametric learning algorithm. However, learning the nonparametric algorithm on high dimensional datasets is not reliable.
3 Preliminaries and Background
3.1 Problem Setting
Given a training dataset \(\mathcal {S} = \{x_{i},y_{i}\} \in \mathbb {R}^{n\times d}\), where n denotes the number of instances and d refers to the dimension of the data, generated from unknown distribution \(\mathcal {D}\). The label of the data is a binary class label \(y=\{1,1\}\). We use \(n^{+}\) and \(n^{}\) to denote the number of positive and negative instances, respectively. The maximization of the AUC metric is equivalent to the minimization of the following loss function:
for a linear classifier \(f(x) = w^{T}x\), where \(I(\cdot )\) is an indicator function that outputs 1 if its argument is true, and 0 otherwise. The discontinuous nature of the indicator function makes the pairwise minimization problem (1) hard to optimize. It is common to replace the indicator function with its convex surrogate function as follows,
This pairwise loss function \(\ell (f(x^{+}_{i})  f(x^{}_{j}))\) is convex in w, and it upper bounds the indicator function. The pairwise loss function is defined as hinge loss when \(p=1\), and is defined as squared hinge loss when \(p=2\). The optimal linear classifier w for maximizing the AUC metric can be obtained by minimizing the following objective function:
where w is the Euclidean norm and C is the regularization hyperparameter. Notice that the weight vector w is trained on the pairs of instances \((x^{+}  x^{})\) that form the difference space. This linear classifier is efficient in dealing with largescale applications, but its modeling capability is limited to the linear decision boundary.
The kernelized AUC maximization can also be formulated as an unconstrained objective function [4, 20]:
where K is the kernel matrix, and A is a sparse matrix that contains all possible pairs \(A \equiv \{(i,j)  y_{i} > y_{j}\}\). In the batch setting, the computation of the kernel costs \(\mathcal {O}(n^{2}d)\) operations, while storing the kernel matrix requires \(\mathcal {O}(n^{2})\) memory. Moreover, the summation over pairs costs \(\mathcal {O}(n \log n)\) [20]. These complexities make kernel machines costly to train compared to the linear model that has linear complexity with respect to the number of instances.
3.2 NystrÃ¶m Approximation
The NystrÃ¶m approximation [19, 31] is a popular approach to approximate the feature maps of linear and nonlinear kernels. Given a kernel function \(K(\cdot ,\cdot )\) and landmark points \(\{u_{l}\}^{v}_{l=1}\) generated or randomly chosen from the input space S, the NystrÃ¶m method approximates a kernel matrix G as follows,
where \(W_{ij} = \kappa (u_{i},u_{j})\) is a kernel matrix computed on landmark points and \(W^{1}\) is its pseudoinverse. The matrix \(E_{ij} = \kappa (x_{i},u_{j})\) is a kernel matrix representing the intersection between the input space and the landmark points. The matrix W is factorized using singular value decomposition or eigenvalue decomposition as follows: \(W = U \varSigma ^{1} U^{T}\), where the columns of the matrix U hold the orthonormal eigenvectors while the diagonal matrix \( \varSigma \) holds the eigenvalues of W in descending order. The NystrÃ¶m approximation can be utilized to transform the kernel machines into linear machines by nonlinearly embedding the input space in a finitedimensional feature space. The nonlinear embedding for an instance x is defined as follows,
where \(\phi (x) = [\kappa (x,u_{1}),\dots , \kappa (x,u_{v})]\), the diagonal matrix \(\varSigma _{r}\) holds the top r eigenvalues, and \(U_{r}\) is the corresponding eigenvectors. The rankr, \(r \le v\), is the best rankr approximation of W. We use the kmeans algorithm to generate the landmark points [31]. This method has shown a low approximation error compared to the standard method, which selects the landmark points based on uniform sampling without replacement from the input space. The complexity of the kmeans algorithm is linear \(\mathcal {O}(nvd)\), while the complexity of singular value decomposition or eigenvalue decomposition is \(\mathcal {O}(v^{3})\). Therefore, the complexity of the kmeans NystrÃ¶m approximation is linear in the input space.
4 Nonlinear AUC Maximization
In this section, we present the two nonlinear algorithms that maximize the AUC metric over a finitedimensional feature space constructed using the kmeans NystrÃ¶m approximation [31]. First, we solve the pairwise squared hinge loss function in a batch learning mode using the truncated Newton solver [4]. For the second method, we present a stochastic learning algorithm that minimizes the pairwise hinge loss function.
The main steps of the proposed nonlinear AUC maximization methods are shown in AlgorithmÂ 1. In the embedding steps, we construct the nonlinear mapping (embedding) based on a given kernel function and landmark points. TheÂ landmark points are computed by the kmeans clustering algorithm applied to the input space. Once the landmark points are obtained, the matrix W and its decomposition are computed. The original input space is then mapped nonlinearly to a finitedimensional feature space in which the nonlinear problem can be solved using linear machines.
The AUC optimization (3) can be solved for w in the embedded space as follows,
where \(\varphi (x)\) is a nonlinear feature mapping for x. The minimization of (5) can be solved using truncated Newton methods [4] as shown in AlgorithmÂ 2. The matrix A in AlgorithmÂ 2 is a sparse matrix of size \(r \times n\), where r is the number of pairs. The matrix A holds all possible pairs in which each row of A has only two nonzero values. That is, if \((i,j) \;\; y_{i} > y_{j}\), the matrix A has a kth row such that \(A_{ki}=1, A_{kj}=1\). However, the complexity of this Newton batch learning is dependent on the number of pairs. The authors of [4] also proposed the PSVM+ algorithm, which avoids the direct computation of pairs by reformulating the pairwise loss function in such a way that the calculations of the gradient and the Hessianvector product are accelerated.
Nevertheless, the optimization of PRSVM+ to maximize the AUC metric still requires \(O(n\hat{d}+2n+\hat{d})\) operations to compute each of the gradient and the Hessianvector product in each iteration, where \(\hat{d}\) is the dimension of the embedded space. This makes the training of PRSVM+ expensive for massive datasets embedded using a large number of landmark points. A large set of landmark points is desirable to improve the approximation of the feature maps; hence boosting the generalization ability of the involved classifier.
To address this complexity, we present a firstorder stochastic method to maximize the AUC metric on the embedded space. Specifically, we optimize a pairwise hinge loss function using stochastic gradient descent accelerated by scheduling both the regularization update and averaging techniques. The proposed stochastic algorithm can be seen as an averaging variant of the SVMSGD2 method proposed in [3]. AlgorithmÂ 3 describes the proposed stochastic AUC maximization method. The algorithm randomly selects a positive and negative instance and updates the model in each iteration as follows,
where \(\ell ^{\prime }(z) = max(0,1z)\) is a hinge loss function, the vector \(x_{t}\) holds the difference \(\varphi ({x}^{+}_{i})  \varphi (x^{}_{j})\), \(w_{t}\) is the solution after t iterations, and \(\lambda (t+t_{0})\) is the learning rate, which decreases in each iteration. The hyperparameter \(\lambda \) can be tuned on a validation set. The positive constant \(t_{0}\) is set experimentally, and it is utilized to prevent large steps in the first few iterations [3]. The model is regularized each rskip iterations to accelerate its convergence. We also foster the acceleration of the model by implementing an averaging technique [23, 29]. The intuitive idea behind the averaging step is to reduce the variance of the model that stems from its stochastic nature. We regulate the regularization update and averaging steps to be performed each askip and rskip iterations as follows,
where \(\tilde{w}\) is the averaged solution after q iterations with respect to the askip. The advantage of regulating the averaging step is to reduce the per iteration complexity, while effectively accelerating the convergence.
The presented firstorder stochastic AUC maximization requires \(\mathcal {O}(\hat{d}a)\) operations per iteration in addition to the \(\mathcal {O}(\hat{d})\) operations needed for each of the regularization update and averaging steps that occur per rskip and askip iterations respectively, where a denotes the average number of nonzero coordinates in the embedded difference vector \(x_{t}\).
5 Experiments
In this section, we evaluate the proposed methods on several benchmark datasets and compare them with kernelized AUC algorithm and other stateoftheart online AUC maximization algorithms. The experiments are implemented in MATLAB, while the learning algorithms are written in C language via MEX files. The experiments were performed on a computer equipped with an Intel 4Â GHz processor with 32Â G RAM.
5.1 Benchmark Datasets
The datasets we use in our experiments can be downloaded from LibSVM website^{Footnote 1} or UCI^{Footnote 2}. The datasets that are not split (i.e., spambase, magic04, connect4, skin, and covtype) into training and test sets; we randomly divide them into 80\(\%\)â€“20\(\%\) for training and testing. The features of each dataset are standardized to have zero mean and unit variance. The multiclass datasets (e.g., covtype and usps) are converted into classimbalanced binary data by grouping the instances into two sets, where each set has the same number of class labels. To speed up the experiments that include the kernelized AUC algorithm, we train all the compared methods on 80k instances, randomly selected from the training set. The other experiments are performed on the entire training data. The characteristics of the datasets along with their imbalance ratios are shown in TableÂ 1.
5.2 Compared Methods and Model Selection
We compare the proposed methods with kernel RankSVM and linear RankSVM, which can be used to solve the AUC maximization problem. We also include two stateoftheart online AUC maximization algorithms. The random Fourier method that approximates the kernel function is also involved in the experiments where the resulting classifier is solved by linear RankSVM.

1.
RBFRankSVM: This is the nonlinear kernel RankSVM [20]. We use Gaussian kernel \(K(x,y) = exp(\gamma xy^2)\) to model the nonlinearity of the data. The best width of the kernel \(\gamma \) is chosen by 3fold cross validation on the training set via searching in \(\{2^{6},\dots ,2^{1}\}\). The regularization hyperparameter C is also tuned by 3fold cross validation by searching in the grid \(\{2^{5},\dots , 2^{5}\}\). The searching grids are selected based on [20]. We also train the RBFRankSVM on 1/5 subsamples, selected randomly.

2.
Linear RankSVM (PRSVM+): This is the linear RankSVM that optimizes the squared hinge loss function using truncated Newton [4]. The best regularization hyperparameter C is chosen from the grid \(\{2^{15},\dots ,2^{10}\}\) via 3fold cross validation.

3.
RFAUC: This uses the random Fourier features [24] to approximate the kernel function. We use PRSVM+ to solve the AUC maximization problem on the projected space. The hyperparameters C and \(\gamma \) are selected via 3fold cross validation by searching on the grids \(\{2^{15},\dots ,2^{10}\}\) and \(\{1,10,100\}\), respectively.

4.
NOAM: This is the sequential variant of online AUC maximization [32] trained on a feature space constructed via the kmeans NystrÃ¶m approximation. The hyperparameters are chosen as suggested by [32] via 3fold cross validation. The number of positive and negative buffers is set to 100.

5.
NSOLAM: This is the stochastic online AUC maximization [30] trained on a feature space constructed via the kmeans NystrÃ¶m approximation. The hyperparameters of the algorithm (i.e., the learning rate and the bound on the weight vector) are selected via 3fold cross validation by searching in the grids \(\{1:9:100\}\) and \(\{10^{1},\dots ,10^{5}\}\), respectively. The number of epochs is set to 15.

6.
NBAUC: This is the proposed batch AUC maximization algorithm trained on the embedded space. We solve it using the PRSVM+ algorithm [4]. The hyperparameter C is tuned similarly to the Primal RankSVM.

7.
NSAUC: This is the proposed stochastic AUC maximization algorithm trained on the embedded space. The hyperparameter \(\lambda \) is chosen from the grid \(\{10^{10},\dots ,10^{7}\}\) via 3fold cross validation.
For those algorithms that involve the kmeans NystrÃ¶m approximation (i.e., our proposed methods, NOAM, and NSOLAM), we compute 1600 landmark points using the kmeans clustering algorithm, which is implemented in C language. We select a Gaussian kernel function to be used with the kmeans NystrÃ¶m approximation. The bandwidth of the Gaussian function is set to be the average squared distance between the first 80k instances and the mean computed over these 80k instances. For a fair comparison, we also set the number of random Fourier features to 1600.
5.3 Results for Batch Methods
The comparison of batch AUC maximization methods in terms of AUC classification accuracy on the test set is shown in TableÂ 2, while TableÂ 3 compares these batch methods in terms of training time. For connect4 dataset, the results of RBFRankSVM are not reported because the training runs over five days.
We observe that the proposed NBAUC outperforms the competing batch methods in terms of AUC classification accuracy. The AUC performance of RBFRankSVM might be improved for some datasets if the best hyperparameters are selected on a more restricted grid of values. Nevertheless, the training of NBAUC is several orders of magnitude faster than RBFRankSVM. The fast training of NBAUC is clearly demonstrated on the large datasets.
The proposed NBAUC shows a robust AUC performance compared to RFAUC on most datasets. This can be attributed to the robust capability of the kmeans NystrÃ¶m method in approximating complex nonlinear structures. It also indicates that a better generalization can be attained by capitalizing on the data to construct the feature maps, which is the main characteristic of the NystrÃ¶m approximation, while the random Fourier features are oblivious to the data.
We also observe that the AUC performance of both RBFRankSVM and its variant applied to random subsamples outperform the linear RankSVM, except for the protein dataset. However, RBFRankSVM methods require longer training, especially for large datasets. We see that the linear RankSVM performs better than the kernel AUC machines on the protein dataset. This implies that the protein dataset is linearly separable. However, the AUC performance of the proposed method NBAUC is even better than linear RankSVM on this dataset.
5.4 Results for Stochastic Methods
We now compare our stochastic algorithm NSAUC with the stateoftheart online AUC maximization methods, NOAM and NSOLAM. We also include the results of the proposed batch algorithm NBAUC for reference. The kmeans NystrÃ¶m approximation is implemented separately for each algorithm as introduced in Sect.Â 4. We experiment on the following large datasets: ijcnn1, connect4, acoustic, skin, codrna, and covtype. TableÂ 4 shows the comparison of the proposed methods with the online AUC maximization algorithms. Notice that the reported training time in TableÂ 4 indicates only the time cost of the learning steps with excluding the embedding steps.
We can see that the proposed NSAUC achieves a competitive AUC performance compared to the proposed NBAUC, but with less training time. On the largest dataset covtype, the AUC performance of NSAUC is on par with NBAUC, while it only requires 49.17Â s for training compared to more than 18Â min required by NBAUC. In contrast to the online methods, the proposed NSAUC is able to converge to the optimal solution obtained by the batch method NBAUC. We attribute the robust performance of NSAUC to the effectiveness of scheduling both the regularization update and averaging.
We observe that the proposed NSAUC requires longer training time on some datasets (e.g., connect4 and acoustic) compared to the online methods; however, the difference in the training time is not significant. In addition, we see that NSOLAM performs better than NOAM in terms of AUC classification accuracy. This implies the advantage of optimizing the pairwise squared hinge loss function, performed by NSOLAM, over the pairwise hinge lose function, carried out by NOAM, for onepass AUC maximization.
5.5 Study on the Convergence Rate
We investigate the convergence of NSAUC and its counterpart NSOLAM with respect to the number of epochs. We also include NSVMSGD2 algorithm [3] that minimizes the pairwise hinge loss function on a feature space constructed via the kmeans NystrÃ¶m approximation, described in Sect.Â 4. The algorithm NSVMSGD2 is analogous to the proposed algorithm NSAUC, but with no averaging step. The AUC performances of these stochastic methods upon varying the number of epochs are depicted in Fig.Â 1. We vary the number of epochs according to the grid \(\{1,2,3,4,5,10,20,50,100,200,300,400\}\), and run the stochastic algorithms using the same setup described in the previous subsection. In all subfigures, the xaxis represents the number of epochs, while the yaxis is the AUC classification accuracy on the test data.
The results show that the proposed NSAUC converges to the optimal solution on all datasets. We can also see that the AUC performance of NSAUC outperforms its nonaveraging variant NSVMSGD2 on four datasets (i.e., ijcnn1, codrna, acoustic, and connect4), while its training time is on par with that of NSVMSGD2. This indicates the effectiveness of incorporating the scheduled averaging technique. Furthermore, the AUC performance of NSAUC does not fluctuate with varying the number of epochs on all datasets. This implies that choosing the best number of epochs would be easy.
In addition, we can observe that the AUC performance of NSOLAM does not show significant improvement after the first epoch. The reason is that NSOLAM reaches a local minimum (i.e., a saddle point) in a single pass and gets stuck there.
6 Conclusion and Future Work
In this paper, we have proposed scalable batch and stochastic nonlinear AUC maximization algorithms. The proposed algorithms optimize linear classifiers on a finitedimensional feature space constructed via the kmeans NystrÃ¶m approximation. We solve the proposed batch AUC maximization algorithm using truncated Newton optimization, which minimizes the pairwise squared hinge loss function. The proposed stochastic AUC maximization algorithm is solved using a firstorder gradient descent that implements scheduled regularization update and scheduled averaging to accelerate the convergence of the classifier. We show via experiments on several benchmark datasets that the proposed AUC maximization algorithms are more efficient than the nonlinear kernel AUC machines, while their AUC performances are comparable or even better than the nonlinear kernel AUC machines. Moreover, we show experimentally that the proposed stochastic AUC maximization algorithm outperforms the stateoftheart online AUC maximization methods in terms of AUC classification accuracy with a marginal increase in the training time for some datasets. We demonstrate empirically that the proposed stochastic AUC algorithm converges to the optimal solution in a few epochs, while other online AUC maximization algorithms are susceptible to suboptimal convergence. In the future, we plan to use the proposed algorithms in solving largescale multipleinstance learning.
References
Agarwal, S., Graepel, T., Herbrich, R., HarPeled, S., Roth, D.: Generalization bounds for the area under the roc curve. J. Mach. Learn. Res. 6(Apr), 393â€“425 (2005)
Airola, A., Pahikkala, T., Salakoski, T.: Training linear ranking svms in linearithmic time using redblack trees. Pattern Recogn. Lett. 32(9), 1328â€“1336 (2011)
Bordes, A., Bottou, L., Gallinari, P.: SGDQN: careful quasinewton stochastic gradient descent. J. Mach. Learn. Res. 10(Jul), 1737â€“1754 (2009)
Chapelle, O., Keerthi, S.S.: Efficient algorithms for ranking with svms. Inf. Retrieval 13(3), 201â€“215 (2010)
Chaudhuri, S., Theocharous, G., Ghavamzadeh, M.: Recommending advertisements using ranking functions, uS Patent App. 14/997,987, 18 Jan 2016
Chen, K., Li, R., Dou, Y., Liang, Z., Lv, Q.: Ranking support vector machine with kernel approximation. Comput. Intell. Neurosci. 2017, 4629534 (2017)
Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. Adv. Neural Inf. Process. Syst. 16(16), 313â€“320 (2004)
Ding, Y., Liu, C., Zhao, P., Hoi, S.C.: Large scale kernel methods for online AUC maximization. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 91â€“100. IEEE (2017)
Ding, Y., Zhao, P., Hoi, S.C., Ong, Y.S.: An adaptive gradient method for online AUC maximization. In: AAAI, pp. 2568â€“2574 (2015)
Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: Onepass AUC optimization. In: ICML, vol. 3, pp. 906â€“914 (2013)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29â€“36 (1982)
Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: AAAI, pp. 2666â€“2672 (2015)
Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377â€“384. ACM (2005)
Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217â€“226. ACM (2006)
Kakkar, V., Shevade, S., Sundararajan, S., Garg, D.: A sparse nonlinear classifier design using AUC optimization. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 291â€“299. SIAM (2017)
Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res. 7(Jul), 1493â€“1515 (2006)
Khalid, M., Ray, I., Chitsaz, H.: Confidenceweighted bipartite ranking. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) ADMA 2016. LNCS (LNAI), vol. 10086, pp. 35â€“49. Springer, Cham (2016). https://doi.org/10.1007/9783319495866_3
Kotlowski, W., Dembczynski, K.J., Huellermeier, E.: Bipartite ranking through minimization of univariate loss. In: Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 1113â€“1120 (2011)
Kumar, S., Mohri, M., Talwalkar, A.: Ensemble nystrom method. In: Advances in Neural Information Processing Systems, pp. 1060â€“1068 (2009)
Kuo, T.M., Lee, C.P., Lin, C.J.: Largescale kernel RankSVM. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 812â€“820. SIAM (2014)
Lee, C.P., Lin, C.J.: Largescale linear rankSVM. Neural Comput. 26(4), 781â€“817 (2014)
Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval 3(3), 225â€“331 (2009)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838â€“855 (1992)
Rahimi, A., Recht, B.: Random features for largescale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177â€“1184 (2008)
Rendle, S., Balby Marinho, L., Nanopoulos, A., SchmidtThieme, L.: Learning optimal ranking with tensor factorization for tag recommendation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727â€“736. ACM (2009)
Root, J., Qian, J., Saligrama, V.: Learning efficient anomaly detectors from KNN graphs. In: Artificial Intelligence and Statistics, pp. 790â€“799 (2015)
Sculley, D.: Large scale learning to rank. In: NIPS Workshop on Advances in Ranking, pp. 58â€“63 (2009)
SzÃ¶rÃ©nyi, B., Cohen, S., Mannor, S.: Nonparametric Online AUC Maximization. In: Ceci, M., HollmÃ©n, J., Todorovski, L., Vens, C., DÅ¾eroski, S. (eds.) ECML PKDD 2017. LNCS (LNAI), vol. 10535, pp. 575â€“590. Springer, Cham (2017). https://doi.org/10.1007/9783319712468_35
Xu, W.: Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490 (2011)
Ying, Y., Wen, L., Lyu, S.: Stochastic online AUC maximization. In: Advances in Neural Information Processing Systems, pp. 451â€“459 (2016)
Zhang, K., Tsang, I.W., Kwok, J.T.: Improved nystrÃ¶m lowrank approximation and error analysis. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1232â€“1239. ACM (2008)
Zhao, P., Jin, R., Yang, T., Hoi, S.C.: Online AUC maximization. In: Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 233â€“240 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Â© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Khalid, M., Ray, I., Chitsaz, H. (2019). Scalable Nonlinear AUC Maximization Methods. In: Berlingerio, M., Bonchi, F., GÃ¤rtner, T., Hurley, N., Ifrim, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11052. Springer, Cham. https://doi.org/10.1007/9783030109288_18
Download citation
DOI: https://doi.org/10.1007/9783030109288_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030109271
Online ISBN: 9783030109288
eBook Packages: Computer ScienceComputer Science (R0)