Scalable Nonlinear AUC Maximization Methods

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11052)


The area under the ROC curve (AUC) is a widely used measure for evaluating classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structures underlying most real-world data. However, the high training complexity renders the kernelized AUC machines infeasible for large-scale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize linear classifiers over a finite-dimensional feature space constructed via the k-means Nyström approximation. Our first algorithm maximizes the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the second-order batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivates us to develop a first-order stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. We also show experimentally that the proposed stochastic AUC classifier is able to reach the optimal solution, while the other state-of-the-art online and stochastic AUC maximization methods are prone to suboptimal convergence. Code related to this paper is available at:


  1. 1.
    Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalization bounds for the area under the roc curve. J. Mach. Learn. Res. 6(Apr), 393–425 (2005)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Airola, A., Pahikkala, T., Salakoski, T.: Training linear ranking svms in linearithmic time using red-black trees. Pattern Recogn. Lett. 32(9), 1328–1336 (2011)CrossRefGoogle Scholar
  3. 3.
    Bordes, A., Bottou, L., Gallinari, P.: SGD-QN: careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10(Jul), 1737–1754 (2009)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Chapelle, O., Keerthi, S.S.: Efficient algorithms for ranking with svms. Inf. Retrieval 13(3), 201–215 (2010)CrossRefGoogle Scholar
  5. 5.
    Chaudhuri, S., Theocharous, G., Ghavamzadeh, M.: Recommending advertisements using ranking functions, uS Patent App. 14/997,987, 18 Jan 2016Google Scholar
  6. 6.
    Chen, K., Li, R., Dou, Y., Liang, Z., Lv, Q.: Ranking support vector machine with kernel approximation. Comput. Intell. Neurosci. 2017, 4629534 (2017)Google Scholar
  7. 7.
    Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. Adv. Neural Inf. Process. Syst. 16(16), 313–320 (2004)Google Scholar
  8. 8.
    Ding, Y., Liu, C., Zhao, P., Hoi, S.C.: Large scale kernel methods for online AUC maximization. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 91–100. IEEE (2017)Google Scholar
  9. 9.
    Ding, Y., Zhao, P., Hoi, S.C., Ong, Y.S.: An adaptive gradient method for online AUC maximization. In: AAAI, pp. 2568–2574 (2015)Google Scholar
  10. 10.
    Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass AUC optimization. In: ICML, vol. 3, pp. 906–914 (2013)Google Scholar
  11. 11.
    Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRefGoogle Scholar
  12. 12.
    Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: AAAI, pp. 2666–2672 (2015)Google Scholar
  13. 13.
    Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377–384. ACM (2005)Google Scholar
  14. 14.
    Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. ACM (2006)Google Scholar
  15. 15.
    Kakkar, V., Shevade, S., Sundararajan, S., Garg, D.: A sparse nonlinear classifier design using AUC optimization. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 291–299. SIAM (2017)CrossRefGoogle Scholar
  16. 16.
    Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res. 7(Jul), 1493–1515 (2006)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Khalid, M., Ray, I., Chitsaz, H.: Confidence-weighted bipartite ranking. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) ADMA 2016. LNCS (LNAI), vol. 10086, pp. 35–49. Springer, Cham (2016). Scholar
  18. 18.
    Kotlowski, W., Dembczynski, K.J., Huellermeier, E.: Bipartite ranking through minimization of univariate loss. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1113–1120 (2011)Google Scholar
  19. 19.
    Kumar, S., Mohri, M., Talwalkar, A.: Ensemble nystrom method. In: Advances in Neural Information Processing Systems, pp. 1060–1068 (2009)Google Scholar
  20. 20.
    Kuo, T.M., Lee, C.P., Lin, C.J.: Large-scale kernel RankSVM. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 812–820. SIAM (2014)Google Scholar
  21. 21.
    Lee, C.P., Lin, C.J.: Large-scale linear rankSVM. Neural Comput. 26(4), 781–817 (2014)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval 3(3), 225–331 (2009)CrossRefGoogle Scholar
  23. 23.
    Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)Google Scholar
  25. 25.
    Rendle, S., Balby Marinho, L., Nanopoulos, A., Schmidt-Thieme, L.: Learning optimal ranking with tensor factorization for tag recommendation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727–736. ACM (2009)Google Scholar
  26. 26.
    Root, J., Qian, J., Saligrama, V.: Learning efficient anomaly detectors from K-NN graphs. In: Artificial Intelligence and Statistics, pp. 790–799 (2015)Google Scholar
  27. 27.
    Sculley, D.: Large scale learning to rank. In: NIPS Workshop on Advances in Ranking, pp. 58–63 (2009)Google Scholar
  28. 28.
    Szörényi, B., Cohen, S., Mannor, S.: Non-parametric Online AUC Maximization. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds.) ECML PKDD 2017. LNCS (LNAI), vol. 10535, pp. 575–590. Springer, Cham (2017). Scholar
  29. 29.
    Xu, W.: Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490 (2011)
  30. 30.
    Ying, Y., Wen, L., Lyu, S.: Stochastic online AUC maximization. In: Advances in Neural Information Processing Systems, pp. 451–459 (2016)Google Scholar
  31. 31.
    Zhang, K., Tsang, I.W., Kwok, J.T.: Improved nyström low-rank approximation and error analysis. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1232–1239. ACM (2008)Google Scholar
  32. 32.
    Zhao, P., Jin, R., Yang, T., Hoi, S.C.: Online AUC maximization. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 233–240 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Science DepartmentColorado State UniversityFort CollinsUSA

Personalised recommendations