Skip to main content
Log in

Optimizing different loss functions in multilabel classifications

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Multilabel classification (ML) aims to assign a set of labels to an instance. This generalization of multiclass classification yields to the redefinition of loss functions and the learning tasks become harder. The objective of this paper is to gain insights into the relations of optimization aims and some of the most popular performance measures: subset (or 0/1), Hamming, and the example-based F-measure. To make a fair comparison, we implemented three ML learners for optimizing explicitly each one of these measures in a common framework. This can be done considering a subset of labels as a structured output. Then, we use structured output support vector machines tailored to optimize a given loss function. The paper includes an exhaustive experimental comparison. The conclusion is that in most cases, the optimization of the Hamming loss produces the best or competitive scores. This is a practical result since the Hamming loss can be minimized using a bunch of binary classifiers, one for each label separately, and therefore, it is a scalable and fast method to learn ML tasks. Additionally, we observe that in noise-free learning tasks optimizing the subset loss is the best option, but the differences are very small. We have also noticed that the biggest room for improvement can be found when the goal is to optimize an F-measure in noisy learning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://www.aic.uniovi.es/ml_generator/.

    Table 1 Cardinality and density statistics of the 48 free-noise datasets

References

  1. Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for multilabel classification. Mach Learn 76(2), 211–225 (2009)

    Article  Google Scholar 

  2. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2, 265–292 (2002)

    MATH  Google Scholar 

  3. Dembczyński, K., Cheng, W., Hüllermeier, E.: Bayes optimal multilabel classification via probabilistic classifier chains. In: Proceedings of the International Conference on Machine Learning (ICML) (2010)

  4. Dembczyński, K., Kotłowski, W., Jachnik, A., Waegeman, W., Hüllermeier, E.: Optimizing the f-measure in multi-label classification: plug-in rule approach versus structured loss minimization. ICML (2013)

  5. Dembczyński, K., Waegeman, W., Cheng, W., Hüllermeier, E.: An exact algorithm for F-measure maximization. In: Proceedings of the neural information processing systems (NIPS) (2011)

  6. Dembczyński, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label dependence and loss minimization in multi-label classification. Mach Learn 88, 1–41 (2012)

    Article  MathSciNet  Google Scholar 

  7. Díez, J., del Coz, J.J., Luaces, O., Bahamonde, A.: Tensor products to optimize label-based loss measures in multilabel classifications. Tech. rep., Centro de Inteligencia Artificial. Universidad de Oviedo at Gijón (2012)

  8. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pp. 681–687. MIT Press, Cambridge (2001)

  9. Gao, W., Zhou, Z.H.: On the consistency of multi-label learning. J Mach Learn Res Proc Track (COLT) 19, 341–358 (2011)

    Google Scholar 

  10. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 195–200. ACM, New York (2005)

  11. Hariharan, B., Vishwanathan, S., Varma, M.: Efficient max-margin multi-label classification with applications to zero-shot learning. Mach Learn 88(1–2), 127–155 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  12. Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the International Conference on Machine Learning (ICML) (2005)

  13. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York (2006)

  14. Joachims, T., Finley, T., Yu, C.: Cutting-plane training of structural svms. Mach Learn 77(1), 27–59 (2009)

    Article  MATH  Google Scholar 

  15. Lampert, C.H.: Maximum margin multi-label structured prediction. In: Advances in Neural Information Processing Systems, pp. 289–297 (2011)

  16. Luaces, O., Dfez, J., Barranquero, J., del Coz, J.J., Bahamonde, A.: Binary relevance efficacy for multilabel classification. Prog Artif Intell 4(1), 303–313 (2012)

    Article  Google Scholar 

  17. Madjarov, G., Kocev, D., Gjorgjevikj, D., D\(\rm \check{z}\)eroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognit 45(9), 3084–3104 (2012). doi:10.1016/j.patcog.2012.03.004. http://www.sciencedirect.com/science/article/pii/S0031320312001203

  18. Montañés, E., Quevedo, J., del Coz, J.: Aggregating independent and dependent models to learn multi-label classifiers. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp. 484–500. Springer, Berlin (2011)

  19. Montañes, E., Senge, R., Barranquero, J., Ramón Quevedo, J., José del Coz, J., Hüllermeier, E.: Dependent binary relevance models for multi-label classification. Pattern Recognit 47(3), 1494–1508 (2014)

    Article  Google Scholar 

  20. Petterson, J., Caetano, T.: Reverse multi-label learning. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pp. 1912–1920 (2010)

  21. Petterson, J., Caetano, T.S.: Submodular multi-label learning. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pp. 1512–1520 (2011)

  22. Quevedo, J.R., Luaces, O., Bahamonde, A.: Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recognit 45(2), 876–883 (2012)

    MATH  Google Scholar 

  23. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp. 254–269 (2009)

  24. Schapire, R., Singer, Y.: Boostexter: a boosting-based system for text categorization. Mach Learn 39(2), 135–168 (2000)

    Article  MATH  Google Scholar 

  25. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J Mach Learn Res 6(2), 1453 (2006)

    MathSciNet  Google Scholar 

  26. Tsoumakas, G., Katakis, I.: Multi labelclassification: an overview. Int J Data Wareh Min 3(3), 1–13 (2007)

    Article  Google Scholar 

  27. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multilabel data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Berlin (2010)

    Google Scholar 

  28. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans Knowl Discov Data Eng 23, 1079–1089 (2010)

    Article  Google Scholar 

  29. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  30. Vedaldi, A.: A MATLAB wrapper of \({\text{ SVM }}^{{\text{ struct }}}\) (2011). http://www.vlfeat.org/~vedaldi/code/svm-struct-matlab.html

  31. Zaragoza, J., Sucar, L., Bielza, C., Larrañaga, P.: Bayesian chain classifiers for multidimensional classification. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2011)

  32. Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7), 2038–2048 (2007)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The research reported here is supported in part under Grant TIN2011-23558 from the MINECO (Ministerio de Economía y Competitividad, Spain), partially supported with FEDER funds. We would also like to acknowledge all those people who generously shared the datasets and software used in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Díez.

Appendix

Appendix

In this section, we report the results obtained in the whole collection of datasets. Since we have two different ways to introduce noise in a learning task (see Sect. 6.1), to represent all datasets at once, we define the similarity of noise-free and noisy versions for each dataset and loss or score function. For a loss function \(\Delta \), the similarity is the complementary of the loss of the noisy release with respect to the noise-free output in the test set,

$$\begin{aligned} \mathrm{Sim }(\Delta ,\mathbf {Y}, \mathrm{noise}(\mathbf {Y}))\! =\! 1\!-\! \Delta (\mathbf {Y}, \mathrm{noise}(\mathbf {Y})) \end{aligned}$$
(9)

where \(\mathbf {Y}\) represents the matrix of actual labels. On the other hand, the similarity does not use the complementary in \(F_1\),

$$\begin{aligned} \mathrm{Sim }(F_1,\mathbf {Y}, \mathrm{noise}(\mathbf {Y})) = F_1(\mathbf {Y}, \mathrm{noise}(\mathbf {Y})). \end{aligned}$$
(10)

In Fig. 3, each dataset is represented in a 2-dimension space. The horizontal axis represents the \(F_1\) score achieved by Struct(\(F_1\)) in the noise-free version of the dataset. The vertical axis represents the similarity of the dataset (measured with \(F_1\)). Thus, points near the top of the picture stand for noise-free datasets or datasets with low noise. On the other hand, points near the left side represent harder datasets, in the sense that the noise-free releases achieves lower \(F_1\) scores. Finally, the points in this space are labeled by the name of the learner that achieved the best \(F_1\) score.

Fig. 3
figure 3

Learners winning in \(F_1\) score in the 528 datasets. The horizontal axis represents the \(F_1\) score of datasets in the noise-free releases. Thus points at the right hand side stand for easier datasets, typically those with less number of labels. The vertical axis represent the similarity of label sets, using \(F_1\) similarity (10), with the noise-free version. The higher the points in the figure, the lower the noise in the datasets

Here, we observe that Struct(\(F_1\)) outperforms the other learners in terms of \(F_1\) when tackling harder learning tasks (left bottom corner of Fig. 3). In easier tasks, mainly those with \(10\) or \(25\) labels, the procedure (Algorithm 2) seems to require more evidences to estimate the optimal expected \(F_1\). In any case, the differences in the easiest learning tasks are small.

To complete the discussion of the results, we made figures analogous to Fig. 3 using subset \(0/1\) loss (Fig. 4), and Hamming loss (Fig. 5).

Fig. 4
figure 4

Learners winning in \(0/1\) loss in the 528 datasets. The horizontal axis represents the \(0/1\) loss of datasets in the noise-free releases. Thus points at the left-hand side stand for easier datasets, typically those with less number of labels. The vertical axis represent the similarity of label sets, using \(0/1\) similarity (9), with the noise-free version. The higher the points in the figure, the lower the noise in the datasets

Fig. 5
figure 5

Learners winning in Hamming loss in the 528 datasets. The horizontal axis represents the Hamming loss of datasets in the noise-free releases. Thus points at the left-hand side stand for easier datasets, typically those with less number of labels. The vertical axis represent the similarity of label sets, using Hamming loss similarity (9), with the noise-free version. The higher the points in the figure, the lower the noise in the datasets

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Díez, J., Luaces, O., del Coz, J.J. et al. Optimizing different loss functions in multilabel classifications. Prog Artif Intell 3, 107–118 (2015). https://doi.org/10.1007/s13748-014-0060-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-014-0060-7

Keywords

Navigation