On the Impact of Dataset Complexity and Sampling Strategy in Multilabel Classifiers Performance

  • Francisco CharteEmail author
  • Antonio Rivera
  • María José del Jesus
  • Francisco Herrera
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9648)


Multilabel classification (MLC) is an increasingly widespread data mining technique. Its goal is to categorize patterns in several non-exclusive groups, and it is applied in fields such as news categorization, image labeling and music classification. Comparatively speaking, MLC is a more complex task than multiclass and binary classification, since the classifier must learn the presence of various outputs at once from the same set of predictive variables. The own nature of the data the classifier has to deal with implies a certain complexity degree. How to measure this complexness level strictly from the data characteristics would be an interesting objective. At the same time, the strategy used to partition the data also influences the sample patterns the algorithm has at its disposal to train the classifier. In MLC random sampling is commonly used to accomplish this task.

This paper introduces TCS (Theoretical Complexity Score), a new characterization metric aimed to assess the intrinsic complexity of a multilabel dataset, as well as a novel stratified sampling method specifically designed to fit the traits of multilabeled data. A detailed description of both proposals is provided, along with empirical results of their suitability for their respective duties.


Multilabel classification Complexity Metrics Partitioning 



This work was partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P and TIN2012-33856, and the Andalusian regional projects P10-TIC-06858 and P11-TIC-7765.


  1. 1.
    Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: QUINTA: a question tagging assistant to improve the answering ratio in electronic forums. In: EUROCON 2015 - International Conference on Computer as a Tool (EUROCON), pp. 1–6. IEEE (2015). doi: 10.1109/EUROCON.2015.7313677
  2. 2.
    Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-30115-8_22 CrossRefGoogle Scholar
  3. 3.
    Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). doi: 10.1007/3-540-47979-1_7 CrossRefGoogle Scholar
  4. 4.
    Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3), 1–38 (2015). doi: 10.1145/2716262 CrossRefGoogle Scholar
  5. 5.
    Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdisc. Rev. Data Min. Knowl. Discovery 4(6), 411–444 (2014). doi: 10.1002/widm.1139 CrossRefGoogle Scholar
  6. 6.
    Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)CrossRefGoogle Scholar
  7. 7.
    Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011). doi: 10.1007/s00500-010-0625-8 CrossRefGoogle Scholar
  8. 8.
    Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recogn. 46(1), 355–364 (2013). doi: 10.1016/j.patcog.2012.07.009 CrossRefGoogle Scholar
  9. 9.
    Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163, 3–16 (2015). doi: 10.1016/j.neucom.2014.08.091 CrossRefGoogle Scholar
  10. 10.
    Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: Concurrence among imbalanced labels and its influence on multilabel resampling algorithms. In: Polycarpou, M., Carvalho, A.C.P.L.F., Pan, J.-S., Woźniak, M., Quintian, H., Corchado, E. (eds.) HAIS 2014. LNCS, vol. 8480, pp. 110–121. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-07617-1_10 CrossRefGoogle Scholar
  11. 11.
    Bellman, R.: Dynamic programming and lagrange multipliers. Proc. Natl. Acad. Sci. U.S.A. 42(10), 767 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. Adv. Knowl. Discovery Data Min. 3056, 22–30 (2004). doi: 10.1007/978-3-540-24775-3_5 Google Scholar
  13. 13.
    Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise preferences. Artif. Intell. 172(16), 1897–1916 (2008). doi: 10.1016/j.artint.2008.08.002 MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via calibrated label ranking. Mach. Learn. 73, 133–153 (2008). doi: 10.1007/s10994-008-5064-8 CrossRefGoogle Scholar
  15. 15.
    Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011). doi: 10.1007/s10994-011-5256-5 MathSciNetCrossRefGoogle Scholar
  16. 16.
    Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004). doi: 10.1016/j.patcog.2004.03.009 CrossRefGoogle Scholar
  17. 17.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, Antwerp, Belgium, MMD 2008, pp. 30–44 (2008)Google Scholar
  18. 18.
    Read, J.: A pruned problem transformation method for multi-label classification. In: Proceedings of the 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), pp. 143–150 (2008)Google Scholar
  19. 19.
    Tsoumakas, G., Vlahavas, I.P.: Random k-labelsets: an ensemble method for multilabel classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–417. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-74958-5_38 CrossRefGoogle Scholar
  20. 20.
    Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part III. LNCS, vol. 6913, pp. 145–158. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23808-6_10 CrossRefGoogle Scholar
  21. 21.
    Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 532–538. Springer, New York (2009). doi: 10.1007/978-0-387-39940-9_565 Google Scholar
  22. 22.
    Charte, F., Charte, D., Rivera, A., del Jesus, M.J., Herrera, F.: R ultimate multilabel dataset repository. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds.) HAIS 2016. LNCS (LNAI), vol. 9648, pp. 487–499 Springer, Switzerland (2016)Google Scholar
  23. 23.
    Zhang, M., Zhou, Z.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007). doi: 10.1016/j.patcog.2006.12.019 CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Francisco Charte
    • 1
    Email author
  • Antonio Rivera
    • 2
  • María José del Jesus
    • 2
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and Artificial IntelligenceUniversity of GranadaGranadaSpain
  2. 2.Department of Computer ScienceUniversity of JaénJaénSpain

Personalised recommendations