Machine Learning

, Volume 76, Issue 2–3, pp 271–285 | Cite as

Cost-sensitive learning based on Bregman divergences

  • Raúl Santos-Rodríguez
  • Alicia Guerrero-Curieses
  • Rocío Alaiz-Rodríguez
  • Jesús Cid-Sueiro


This paper analyzes the application of a particular class of Bregman divergences to design cost-sensitive classifiers for multiclass problems. We show that these divergence measures can be used to estimate posterior probabilities with maximal accuracy for the probability values that are close to the decision boundaries. Asymptotically, the proposed divergence measures provide classifiers minimizing the sum of decision costs in non-separable problems, and maximizing a margin in separable MAP problems.


Cost sensitive learning Bregman divergence Posterior class probabilities Maximum margin 


  1. Abe, N., Zadrozny, B., & Langford, J. (2004). An iterative method for multi-class cost-sensitive learning. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 3–11). New York: ACM. CrossRefGoogle Scholar
  2. Banerjee, A., Guo, X., & Wang, H. (2005). On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7), 2664–2669. CrossRefMathSciNetGoogle Scholar
  3. Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., & Brodley, C. E. (1998). Pruning decision trees with misclassification costs. In Proceedings of the European conference on machine learning (pp. 131–136). Berlin: Springer. Google Scholar
  4. Bregman, L. M. (1967). The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(10), 200–217. CrossRefGoogle Scholar
  5. Cid-Sueiro, J., & Figueiras-Vidal, A. R. (2001). On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks, 12(3). Google Scholar
  6. Cid-Sueiro, J., Arribas, J. I., Urbán-Muñoz, S., & Figueiras-Vidal, A. R. (1999). Cost functions to estimate a posteriori probabilities in multi-class problems. IEEE Transactions on Neural Networks, 10(3), 645–656. CrossRefGoogle Scholar
  7. Dhillon, I. S., Banerjee, A., Merugu, S., & Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749. MathSciNetGoogle Scholar
  8. Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). Adacost: misclassification cost-sensitive boosting. In Proc. 16th international conf. on machine learning (pp. 97–105). San Mateo: Morgan Kaufmann. Google Scholar
  9. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. MATHCrossRefMathSciNetGoogle Scholar
  10. Guerrero-Curieses, A., Cid-Sueiro, J., Alaiz-Rodríguez, R., & Figueiras, A. (2004). Local estimation of posterior class probabilities to minimize classification errors. IEEE Transactions on Neural Networks, 15(2), 309–317. CrossRefGoogle Scholar
  11. Guerrero-Curieses, A., Alaiz-Rodríguez, R., & Cid-Sueiro, J. (2005). Loss function to combine learning and decision in multiclass problems. Neurocomputing, 69, 3–17. CrossRefGoogle Scholar
  12. Kapur, J. N., & Kesavan, H. K. (1993). Entropy optimization principles with applications. San Diego: Academic Press. Google Scholar
  13. Kukar, M. Z., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. In Proceedings of the 13th European conference on artificial intelligence (ECAI-98) (pp. 445–449). New York: Wiley. Google Scholar
  14. Liu, X. Y., & Zhou, Z. H. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1), 63–77. CrossRefGoogle Scholar
  15. Lozano, A. C., & Abe, N. (2008). Multi-class cost-sensitive boosting with p-norm loss functions. In KDD ’08: proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 506–514). New York: ACM. CrossRefGoogle Scholar
  16. Marrocco, C., & Tortorella, F. (2004). A cost-sensitive paradigm for multiclass to binary decomposition schemes. Lecture notes in computer science (Vol. 3138, pp. 753–761). Berlin: Springer. Google Scholar
  17. Miller, J. W., Goodman, R., & Smyth, P. (1993). On loss functions which minimize to conditional expected values and posterior probabilities. IEEE Transactions on Information Theory, 39(4), 1404–1408. MATHCrossRefGoogle Scholar
  18. O’Brien, D. B., & Gray, R. M. (2005). Improving classification performance by exploring the role of cost matrices in partitioning the estimated class probability space. In Proceedings of the ICML workshop on ROC analysis (pp. 79–86). Google Scholar
  19. O’Brien, D. B., Gupta, M. R., & Gray, R. M. (2008). Cost-sensitive multi-class classification from probability estimates. In ICML ’08: proceedings of the 25th international conference on machine learning (pp. 712–719). New York: ACM. CrossRefGoogle Scholar
  20. Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers (pp. 61–74). Cambridge: MIT Press. Google Scholar
  21. Provost, F., & Fawcett, T. (2001). Robust classification systems for imprecise environments. Machine Learning, 42(3), 203–231. MATHCrossRefGoogle Scholar
  22. Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association (pp. 783–801). Google Scholar
  23. Stuetzle, W., Buja, A., & Shen, Y. (2005). Loss functions for binary class probability estimation and classification: Structure and applications (Technical report). Department of Statistics, University of Pennsylvania. Google Scholar
  24. Zadrozny, B., & Elkan, C. (2001a). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 204–213). New York: ACM. Google Scholar
  25. Zadrozny, B., & Elkan, C. (2001b). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In ICML ’01: proceedings of the eighteenth international conference on machine learning (pp. 609–616). San Francisco: Morgan Kaufmann. Google Scholar
  26. Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In KDD ’02: proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 694–699). New York: ACM. CrossRefGoogle Scholar
  27. Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. In ICDM ’03: proc. of the 3rd IEEE int. conf. on data mining (p. 435). Washington: IEEE Comput. Soc. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Raúl Santos-Rodríguez
    • 1
  • Alicia Guerrero-Curieses
    • 2
  • Rocío Alaiz-Rodríguez
    • 3
  • Jesús Cid-Sueiro
    • 1
  1. 1.Department of Signal Theory and CommunicationsUniversidad Carlos III de MadridLeganés (Madrid)Spain
  2. 2.Department of Signal Theory and CommunicationsUniversidad Rey Juan CarlosFuenlabrada (Madrid)Spain
  3. 3.Department of Electrical and Electronic EngineeringUniversidad de LeónLeónSpain

Personalised recommendations