Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels

  • Francisco Charte
  • Antonio Rivera
  • María José del Jesus
  • Francisco Herrera
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9121)

Abstract

Multilabel classification is a task that has been broadly studied in late years. However, how to face learning from imbalanced multilabel datasets (MLDs) has only been addressed latterly. In this regard, a few proposals can be found in the literature, most of them based on resampling techniques adapted from the traditional classification field. The success of these methods varies extraordinarily depending on the traits of the chosen MLDs.

One of the characteristics which significantly influences the behavior of multilabel resampling algorithms is the joint appearance of minority and majority labels in the same instances. It was demonstrated that MLDs with a high level of concurrence among imbalanced labels could hardly benefit from resampling methods. This paper proposes an original resampling algorithm, called REMEDIAL, which is not based on removing majority instances nor creating minority ones, but on a procedure to decouple highly imbalanced labels. As will be experimentally demonstrated, this is an interesting approach for certain MLDs.

Keywords

Multilabel classification Imbalanced learning Resampling Label concurrence 

References

  1. 1.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, Ch. 34, pp. 667–685. Springer, Boston (2010). doi:10.1007/978-0-387-09823-4_34 Google Scholar
  2. 2.
    Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30115-8_22 CrossRefGoogle Scholar
  3. 3.
    Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Audio Speech Lang. Process. 16(2), 467–476 (2008). doi:10.1109/TASL.2007.913750 Google Scholar
  4. 4.
    Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). doi:10.1007/3-540-47979-1_7 CrossRefGoogle Scholar
  5. 5.
    Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004). doi:10.1145/1007730.1007733 Google Scholar
  6. 6.
    García, V., Sánchez, J., Mollineda, R.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 25(1), 13–21 (2012). http://dx.doi.org/10.1016/j.knosys.2011.06.013 Google Scholar
  7. 7.
    Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: A first approach to deal with imbalance in multi-label datasets. In: Pan, J.-S., Polycarpou, M.M., Woźniak, M., de Carvalho, A.C.P.L.F., Quintián, H., Corchado, E. (eds.) HAIS 2013. LNCS, vol. 8073, pp. 150–160. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40846-5_16 CrossRefGoogle Scholar
  8. 8.
    Giraldo-Forero, A.F., Jaramillo-Garzón, J.A., Ruiz-Muñoz, J.F., Castellanos-Domínguez, C.G.: Managing imbalanced data sets in multi-label problems: a case study with the SMOTE algorithm. In: Ruiz-Shulcloper, J., Sanniti di Baja, G. (eds.) CIARP 2013, Part I. LNCS, vol. 8258, pp. 334–342. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41822-8_42 CrossRefGoogle Scholar
  9. 9.
    Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing to be publishedGoogle Scholar
  10. 10.
    Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: MLeNN: a first approach to heuristic multilabel undersampling. In: Corchado, E., Lozano, J.A., Quintián, H., Yin, H. (eds.) IDEAL 2014. LNCS, vol. 8669, pp. 1–9. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10840-7_1 CrossRefGoogle Scholar
  11. 11.
    Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012). doi:10.1016/j.patcog.2012.03.014 Google Scholar
  12. 12.
    Tahir, M.A., Kittler, J., Bouridane, A.: Multilabel classification using heterogeneous ensemble of multi-label classifiers. Pattern Recogn. Lett. 33(5), 513–523 (2012). doi:10.1016/j.patrec.2011.10.019 Google Scholar
  13. 13.
    Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: Concurrence among imbalanced labels and its influence on multilabel resampling algorithms. In: Polycarpou, M., de Carvalho, A.C.P.L.F., Pan, J.-S., Woźniak, M., Quintian, H., Corchado, E. (eds.) HAIS 2014. LNCS, vol. 8480, pp. 110–121. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  14. 14.
    Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.P.: Protein classification with multiple algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005). doi:10.1007/11573036_42 CrossRefGoogle Scholar
  15. 15.
    Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Dietterich, G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, vol. 14, pp. 681–687. MIT Press, Cambridge (2001)Google Scholar
  16. 16.
    Crammer, K., Dredze, M., Ganchev, K., Talukdar, P.P., Carroll, S.: Automatic code assignment to medical text. In: Proceedings of the Workshop on Biological, Translational, and Clinical Language Processing, BioNLP 2007. Prague, Czech Republic, pp. 129–136 (2007)Google Scholar
  17. 17.
    Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 22–30. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24775-3_5 CrossRefGoogle Scholar
  18. 18.
    Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004). doi:10.1016/j.patcog.2004.03.009 Google Scholar
  19. 19.
    Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011). doi:10.1007/s10994-011-5256-5 MathSciNetGoogle Scholar
  20. 20.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, MMD 2008. Antwerp, Belgium, pp. 30–44 (2008)Google Scholar
  21. 21.
    Zhang, M., Zhou, Z.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007). doi:10.1016/j.patcog.2006.12.019 MATHGoogle Scholar
  22. 22.
    Clare, A.J., King, R.D.: Knowledge discovery in multi-label phenotype data. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, p. 42. Springer, Heidelberg (2001). doi:10.1007/3-540-44794-6_4 CrossRefGoogle Scholar
  23. 23.
    Zhang, M.-L.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006). doi:10.1109/TKDE.2006.162 Google Scholar
  24. 24.
    Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014). doi:10.1109/TKDE.2013.39 Google Scholar
  25. 25.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). doi:10.1613/jair.953 MATHGoogle Scholar
  26. 26.
    Kotsiantis, S.B., Pintelas, P.E.: Mixture of expert agents for handling imbalanced data sets. Ann. Math. Comput. Teleinformatics 1, 46–55 (2003)Google Scholar
  27. 27.
    López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). doi:10.1016/j.ins.2013.07.007 Google Scholar
  28. 28.
    Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. 42, 203–231 (2001). doi:10.1023/A:1007601015854 MATHGoogle Scholar
  29. 29.
    He, J., Gu, H., Liu, W.: Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PloS one 7(6), 7155 (2012). doi:10.1371/journal.pone.0037155 Google Scholar
  30. 30.
    Li, C., Shi, G.: Improvement of learning algorithm for the multi-instance multi-label rbf neural networks trained with imbalanced samples. J. Inf. Sci. Eng. 29(4), 765–776 (2013)Google Scholar
  31. 31.
    Tepvorachai, G., Papachristou, C.: Multi-label imbalanced data enrichment process in neural net classifier training. In: IEEE International Joint Conference on Neural Networks, IJCNN 2008, pp. 1301–1307 (2008). doi:10.1109/IJCNN.2008.4633966
  32. 32.
    Charte, F., Charte, F.D.: How to work with multilabel datasets in R using the mldr package. doi:10.6084/m9.figshare.1356035
  33. 33.
    Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76(2–3), 211–225 (2009). doi:10.1007/s10994-009-5127-5 Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Francisco Charte
    • 1
  • Antonio Rivera
    • 2
  • María José del Jesus
    • 2
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and Artificial IntelligenceUniversity of GranadaGranadaSpain
  2. 2.Department of Computer ScienceUniversity of JaénJaénSpain

Personalised recommendations