# Optimizing non-decomposable measures with deep networks

**Part of the following topical collections:**

## Abstract

We present a class of algorithms capable of directly training deep neural networks with respect to popular families of task-specific performance measures for binary classification such as the F-measure, QMean and the Kullback–Leibler divergence that are structured and non-decomposable. Our goal is to address tasks such as label-imbalanced learning and quantification. Our techniques present a departure from standard deep learning techniques that typically use squared or cross-entropy loss functions (that are decomposable) to train neural networks. We demonstrate that directly training with task-specific loss functions yields faster and more stable convergence across problems and datasets. Our proposed algorithms and implementations offer several advantages including (i) the use of fewer training samples to achieve a desired level of convergence, (ii) a substantial reduction in training time, (iii) a seamless integration of our implementation into existing symbolic gradient frameworks, and (iv) assurance of convergence to first order stationary points. It is noteworthy that the algorithms achieve this, especially point (iv), despite being asked to optimize complex objective functions. We implement our techniques on a variety of deep architectures including multi-layer perceptrons and recurrent neural networks and show that on a variety of benchmark and real data sets, our algorithms outperform traditional approaches to training deep networks, as well as popular techniques used to handle label imbalance.

## Keywords

Optimization Deep learning F-measure Task-specific training## Notes

### Acknowledgements

A.S. did this work while he was a student at IIT Kanpur and acknowledges support from The Alan Turing Institute under the Turing Doctoral Studentship grant TU/C/000023. P. Kar is supported by the Deep Singh and Daljeet Kaur Faculty Fellowship and the Research-I foundation at IIT Kanpur, and thanks Microsoft Research India and Tower Research for research grants.

## References

- Barranquero, J., Díez, J., & del Coz, J. J. (2015). Quantification-oriented learning based on reliable classifiers.
*Pattern Recognition*,*48*(2), 591–604.CrossRefzbMATHGoogle Scholar - Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., & Bengio, Y. (2010). Theano: A CPU and GPU math compiler in Python. In
*Proceedings of the 9th Python in science conference (SciPy 2010)*(pp. 1–7). Austin, USA.Google Scholar - Eban, E., Schain, M., Mackey, A., Gordon, A., Saurous, R., & Elidan, G. (2017). Scalable Learning of non-decomposable objectives. In
*Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS)*.Google Scholar - Esuli, A. (2016). ISTI-CNR at SemEval-2016 Task 4: Quantification on an ordinal scale. In
*Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016)*. San Diego, US.Google Scholar - Esuli, A., & Sebastiani, F. (2015). Optimizing text quantifiers for multivariate loss functions.
*ACM Transactions on Knowledge Discovery and Data***9**(4), Article 27. https://doi.org/10.1145/2700406. - Gao, W., & Sebastiani, F. (2015). Tweet sentiment: From classification to quantification. In
*Proceedings of the 7th international conference on advances in social network analysis and mining (ASONAM 2015)*(pp. 97–104). Paris, FR.Google Scholar - Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs.
*Machine Learning Journal*,*77*(1), 27–59.CrossRefzbMATHGoogle Scholar - Kakade, S., Shalev-Shwartz, S., & Tewari, A. (2012). Regularization techniques for learning with matrices.
*Journal of Machine Learning Research*,*13*, 1865–1890.MathSciNetzbMATHGoogle Scholar - Kar, P., Li, S., Narasimhan, H., Chawla, S., & Sebastiani, F. (2016). Online optimization methods for the quantification problem. In
*Proceedings of the 22nd ACM international conference on knowledge discovery and data mining (SIGKDD 2016)*(pp. 1625–1634). San Francisco, USA.Google Scholar - Kar, P., Narasimhan, H., & Jain, P. (2014). Online and stochastic gradient methods for non-decomposable loss functions. In
*Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014)*(pp. 694–702). Montreal, USA.Google Scholar - Kar, P., Narasimhan, H., & Jain, P. (2015). Surrogate functions for maximizing precision at the top. In
*Proceedings of the 32nd international conference on machine learning (ICML 2015)*(pp. 189–198). Lille, FR.Google Scholar - Kar, P., Sriperumbudur, B.K., Jain, P., & Karnick, H. (2013). On the generalization ability of online learning algorithms for pairwise loss functions. In
*30th international conference on machine learning (ICML)*.Google Scholar - Kennedy, K., Namee, B.M., & Delany, S.J. (2010). Learning without default: A study of one-class classification and the low-default portfolio problem. In
*International conference on artificial intelligence and cognitive science (ICAICS),**Lecture notes in computer science*(Vol. 6202, pp. 174–187).Google Scholar - Koyejo, O.O., Natarajan, N., Ravikumar, P.K., & Dhillon, I.S. (2014). Consistent binary classification with generalized performance metrics. In
*Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014)*(pp. 2744–2752). Montreal, CA.Google Scholar - Manning, C. D., Raghavan, P., & Schütze, H. (2008).
*Introduction to information retrieval*. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar - Narasimhan, H., & Agarwal, S. (2013). A structural SVM based approach for optimizing partial AUC. In
*30th international conference on machine learning (ICML).*Google Scholar - Narasimhan, H., & Agarwal, S. (2013). \(\text{SVM}^\text{ tight }_\text{ pAUC }\): A new support vector method for optimizing partial AUC based on a tight convex upper bound. In
*ACM SIGKDD conference on knowledge, discovery and data mining (KDD)*.Google Scholar - Narasimhan, H., Kar, P., & Jain, P. (2015). Optimizing non-decomposable performance measures: A tale of two classes. In
*Proceedings of the 32nd international conference on machine learning (ICML 2015)*(pp. 199–208). Lille, FR.Google Scholar - Narasimhan, H., Vaish, R., & Agarwal, S. (2014). On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In
*28th annual conference on neural information processing systems (NIPS)*.Google Scholar - Qi, Y., Bar-Joseph, Z., & Klein-Seetharaman, J. (2006). Evaluation of different biological data and computational classification methods for use in protein interaction prediction.
*Proteins*,*63*, 490–500.CrossRefGoogle Scholar - Schäfer, D., & Hüllermeier, E. (2018). Dyad ranking using Plackett–Luce models based on joint feature representations.
*Machine Learning*,*107*(5), 903–941.MathSciNetCrossRefzbMATHGoogle Scholar - Song, Y., Schwing, A.G., Zemel, R.S., & Urtasun, R. (2016). Training deep neural networks via direct loss minimization. In
*Proceedings of the 33rd international conference on machine learning (ICML).*Google Scholar - Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large Margin methods for structured and interdependent output variables.
*Journal of Machine Learning*,*6*, 1453–1484.MathSciNetzbMATHGoogle Scholar - Vincent, P. (1994).
*An introduction to signal detection and estimation*. New York: Springer.zbMATHGoogle Scholar