Machine Learning

, Volume 107, Issue 8–10, pp 1597–1620 | Cite as

Optimizing non-decomposable measures with deep networks

  • Amartya SanyalEmail author
  • Pawan KumarEmail author
  • Purushottam KarEmail author
  • Sanjay Chawla
  • Fabrizio Sebastiani
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track


We present a class of algorithms capable of directly training deep neural networks with respect to popular families of task-specific performance measures for binary classification such as the F-measure, QMean and the Kullback–Leibler divergence that are structured and non-decomposable. Our goal is to address tasks such as label-imbalanced learning and quantification. Our techniques present a departure from standard deep learning techniques that typically use squared or cross-entropy loss functions (that are decomposable) to train neural networks. We demonstrate that directly training with task-specific loss functions yields faster and more stable convergence across problems and datasets. Our proposed algorithms and implementations offer several advantages including (i) the use of fewer training samples to achieve a desired level of convergence, (ii) a substantial reduction in training time, (iii) a seamless integration of our implementation into existing symbolic gradient frameworks, and (iv) assurance of convergence to first order stationary points. It is noteworthy that the algorithms achieve this, especially point (iv), despite being asked to optimize complex objective functions. We implement our techniques on a variety of deep architectures including multi-layer perceptrons and recurrent neural networks and show that on a variety of benchmark and real data sets, our algorithms outperform traditional approaches to training deep networks, as well as popular techniques used to handle label imbalance.


Optimization Deep learning F-measure Task-specific training 



A.S. did this work while he was a student at IIT Kanpur and acknowledges support from The Alan Turing Institute under the Turing Doctoral Studentship grant TU/C/000023. P. Kar is supported by the Deep Singh and Daljeet Kaur Faculty Fellowship and the Research-I foundation at IIT Kanpur, and thanks Microsoft Research India and Tower Research for research grants.


  1. Barranquero, J., Díez, J., & del Coz, J. J. (2015). Quantification-oriented learning based on reliable classifiers. Pattern Recognition, 48(2), 591–604.CrossRefzbMATHGoogle Scholar
  2. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., & Bengio, Y. (2010). Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in science conference (SciPy 2010) (pp. 1–7). Austin, USA.Google Scholar
  3. Eban, E., Schain, M., Mackey, A., Gordon, A., Saurous, R., & Elidan, G. (2017). Scalable Learning of non-decomposable objectives. In Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS).Google Scholar
  4. Esuli, A. (2016). ISTI-CNR at SemEval-2016 Task 4: Quantification on an ordinal scale. In Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016). San Diego, US.Google Scholar
  5. Esuli, A., & Sebastiani, F. (2015). Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data 9(4), Article 27.
  6. Gao, W., & Sebastiani, F. (2015). Tweet sentiment: From classification to quantification. In Proceedings of the 7th international conference on advances in social network analysis and mining (ASONAM 2015) (pp. 97–104). Paris, FR.Google Scholar
  7. Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs. Machine Learning Journal, 77(1), 27–59.CrossRefzbMATHGoogle Scholar
  8. Kakade, S., Shalev-Shwartz, S., & Tewari, A. (2012). Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13, 1865–1890.MathSciNetzbMATHGoogle Scholar
  9. Kar, P., Li, S., Narasimhan, H., Chawla, S., & Sebastiani, F. (2016). Online optimization methods for the quantification problem. In Proceedings of the 22nd ACM international conference on knowledge discovery and data mining (SIGKDD 2016) (pp. 1625–1634). San Francisco, USA.Google Scholar
  10. Kar, P., Narasimhan, H., & Jain, P. (2014). Online and stochastic gradient methods for non-decomposable loss functions. In Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014) (pp. 694–702). Montreal, USA.Google Scholar
  11. Kar, P., Narasimhan, H., & Jain, P. (2015). Surrogate functions for maximizing precision at the top. In Proceedings of the 32nd international conference on machine learning (ICML 2015) (pp. 189–198). Lille, FR.Google Scholar
  12. Kar, P., Sriperumbudur, B.K., Jain, P., & Karnick, H. (2013). On the generalization ability of online learning algorithms for pairwise loss functions. In 30th international conference on machine learning (ICML).Google Scholar
  13. Kennedy, K., Namee, B.M., & Delany, S.J. (2010). Learning without default: A study of one-class classification and the low-default portfolio problem. In International conference on artificial intelligence and cognitive science (ICAICS), Lecture notes in computer science (Vol. 6202, pp. 174–187).Google Scholar
  14. Koyejo, O.O., Natarajan, N., Ravikumar, P.K., & Dhillon, I.S. (2014). Consistent binary classification with generalized performance metrics. In Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014) (pp. 2744–2752). Montreal, CA.Google Scholar
  15. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  16. Narasimhan, H., & Agarwal, S. (2013). A structural SVM based approach for optimizing partial AUC. In 30th international conference on machine learning (ICML). Google Scholar
  17. Narasimhan, H., & Agarwal, S. (2013). \(\text{SVM}^\text{ tight }_\text{ pAUC }\): A new support vector method for optimizing partial AUC based on a tight convex upper bound. In ACM SIGKDD conference on knowledge, discovery and data mining (KDD).Google Scholar
  18. Narasimhan, H., Kar, P., & Jain, P. (2015). Optimizing non-decomposable performance measures: A tale of two classes. In Proceedings of the 32nd international conference on machine learning (ICML 2015) (pp. 199–208). Lille, FR.Google Scholar
  19. Narasimhan, H., Vaish, R., & Agarwal, S. (2014). On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In 28th annual conference on neural information processing systems (NIPS).Google Scholar
  20. Qi, Y., Bar-Joseph, Z., & Klein-Seetharaman, J. (2006). Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins, 63, 490–500.CrossRefGoogle Scholar
  21. Schäfer, D., & Hüllermeier, E. (2018). Dyad ranking using Plackett–Luce models based on joint feature representations. Machine Learning, 107(5), 903–941.MathSciNetCrossRefzbMATHGoogle Scholar
  22. Song, Y., Schwing, A.G., Zemel, R.S., & Urtasun, R. (2016). Training deep neural networks via direct loss minimization. In Proceedings of the 33rd international conference on machine learning (ICML). Google Scholar
  23. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large Margin methods for structured and interdependent output variables. Journal of Machine Learning, 6, 1453–1484.MathSciNetzbMATHGoogle Scholar
  24. Vincent, P. (1994). An introduction to signal detection and estimation. New York: Springer.zbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.The University of OxfordOxfordUK
  2. 2.The Alan Turing InstituteLondonUK
  3. 3.Indian Institute of Technology KanpurKanpurIndia
  4. 4.Qatar Computing Research InstituteDohaQatar
  5. 5.Istituto di Scienza e Tecnologia dell’InformazionePisaItaly

Personalised recommendations