Scalable Nonlinear AUC Maximization Methods
- 1.2k Downloads
Abstract
The area under the ROC curve (AUC) is a widely used measure for evaluating classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structures underlying most real-world data. However, the high training complexity renders the kernelized AUC machines infeasible for large-scale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize linear classifiers over a finite-dimensional feature space constructed via the k-means Nyström approximation. Our first algorithm maximizes the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the second-order batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivates us to develop a first-order stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. We also show experimentally that the proposed stochastic AUC classifier is able to reach the optimal solution, while the other state-of-the-art online and stochastic AUC maximization methods are prone to suboptimal convergence. Code related to this paper is available at: https://sites.google.com/view/majdikhalid/.
References
- 1.Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalization bounds for the area under the roc curve. J. Mach. Learn. Res. 6(Apr), 393–425 (2005)MathSciNetzbMATHGoogle Scholar
- 2.Airola, A., Pahikkala, T., Salakoski, T.: Training linear ranking svms in linearithmic time using red-black trees. Pattern Recogn. Lett. 32(9), 1328–1336 (2011)CrossRefGoogle Scholar
- 3.Bordes, A., Bottou, L., Gallinari, P.: SGD-QN: careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10(Jul), 1737–1754 (2009)MathSciNetzbMATHGoogle Scholar
- 4.Chapelle, O., Keerthi, S.S.: Efficient algorithms for ranking with svms. Inf. Retrieval 13(3), 201–215 (2010)CrossRefGoogle Scholar
- 5.Chaudhuri, S., Theocharous, G., Ghavamzadeh, M.: Recommending advertisements using ranking functions, uS Patent App. 14/997,987, 18 Jan 2016Google Scholar
- 6.Chen, K., Li, R., Dou, Y., Liang, Z., Lv, Q.: Ranking support vector machine with kernel approximation. Comput. Intell. Neurosci. 2017, 4629534 (2017)Google Scholar
- 7.Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. Adv. Neural Inf. Process. Syst. 16(16), 313–320 (2004)Google Scholar
- 8.Ding, Y., Liu, C., Zhao, P., Hoi, S.C.: Large scale kernel methods for online AUC maximization. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 91–100. IEEE (2017)Google Scholar
- 9.Ding, Y., Zhao, P., Hoi, S.C., Ong, Y.S.: An adaptive gradient method for online AUC maximization. In: AAAI, pp. 2568–2574 (2015)Google Scholar
- 10.Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass AUC optimization. In: ICML, vol. 3, pp. 906–914 (2013)Google Scholar
- 11.Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRefGoogle Scholar
- 12.Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: AAAI, pp. 2666–2672 (2015)Google Scholar
- 13.Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377–384. ACM (2005)Google Scholar
- 14.Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. ACM (2006)Google Scholar
- 15.Kakkar, V., Shevade, S., Sundararajan, S., Garg, D.: A sparse nonlinear classifier design using AUC optimization. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 291–299. SIAM (2017)CrossRefGoogle Scholar
- 16.Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res. 7(Jul), 1493–1515 (2006)MathSciNetzbMATHGoogle Scholar
- 17.Khalid, M., Ray, I., Chitsaz, H.: Confidence-weighted bipartite ranking. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) ADMA 2016. LNCS (LNAI), vol. 10086, pp. 35–49. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49586-6_3CrossRefGoogle Scholar
- 18.Kotlowski, W., Dembczynski, K.J., Huellermeier, E.: Bipartite ranking through minimization of univariate loss. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1113–1120 (2011)Google Scholar
- 19.Kumar, S., Mohri, M., Talwalkar, A.: Ensemble nystrom method. In: Advances in Neural Information Processing Systems, pp. 1060–1068 (2009)Google Scholar
- 20.Kuo, T.M., Lee, C.P., Lin, C.J.: Large-scale kernel RankSVM. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 812–820. SIAM (2014)Google Scholar
- 21.Lee, C.P., Lin, C.J.: Large-scale linear rankSVM. Neural Comput. 26(4), 781–817 (2014)MathSciNetCrossRefGoogle Scholar
- 22.Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval 3(3), 225–331 (2009)CrossRefGoogle Scholar
- 23.Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)MathSciNetCrossRefGoogle Scholar
- 24.Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)Google Scholar
- 25.Rendle, S., Balby Marinho, L., Nanopoulos, A., Schmidt-Thieme, L.: Learning optimal ranking with tensor factorization for tag recommendation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727–736. ACM (2009)Google Scholar
- 26.Root, J., Qian, J., Saligrama, V.: Learning efficient anomaly detectors from K-NN graphs. In: Artificial Intelligence and Statistics, pp. 790–799 (2015)Google Scholar
- 27.Sculley, D.: Large scale learning to rank. In: NIPS Workshop on Advances in Ranking, pp. 58–63 (2009)Google Scholar
- 28.Szörényi, B., Cohen, S., Mannor, S.: Non-parametric Online AUC Maximization. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds.) ECML PKDD 2017. LNCS (LNAI), vol. 10535, pp. 575–590. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71246-8_35CrossRefGoogle Scholar
- 29.Xu, W.: Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490 (2011)
- 30.Ying, Y., Wen, L., Lyu, S.: Stochastic online AUC maximization. In: Advances in Neural Information Processing Systems, pp. 451–459 (2016)Google Scholar
- 31.Zhang, K., Tsang, I.W., Kwok, J.T.: Improved nyström low-rank approximation and error analysis. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1232–1239. ACM (2008)Google Scholar
- 32.Zhao, P., Jin, R., Yang, T., Hoi, S.C.: Online AUC maximization. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 233–240 (2011)Google Scholar