Skip to main content
Log in

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://mulan.sourceforge.net/datasets-mlc.html.

References

  • Ali H, Salleh MNM, Hussain K, Ahmad A, Ullah A, Muhammad A, Naseem R, Khan M (2019) A review on data preprocessing methods for class imbalance problem. Int J Eng Technol 8:390–397

    Google Scholar 

  • Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397

    Article  Google Scholar 

  • Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397

    Article  Google Scholar 

  • Charte F, Rivera AJ, del Jesus MJ, Herrera F (2014) Mlenn: a first approach to heuristic multilabel undersampling. In: International conference on intelligent data engineering and automated learning. Springer, pp 1–9

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357

    Google Scholar 

  • Choirunnisa S, Lianto J (2018) Hybrid method of undersampling and oversampling for handling imbalanced data. In: International seminar on research of information technology and intelligent systems (ISRITI). IEEE, pp 276–280

  • Daniels Z, Metaxas D (2017) Addressing imbalance in multi-label classification using structured hellinger forests. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  • Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic. NIPS’01, MIT Press, Cambridge, MA, USA, pp 681–687

  • Fürnkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153

    Article  Google Scholar 

  • Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining, pp 22–30

  • Gonzalez-Lopez J, Ventura S, Cano A (2018) Distributed nearest neighbor classification for large-scale multi-label data on spark. Future Gener Comput Syst 87:66–82

    Article  Google Scholar 

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142

  • Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD-08 workshop on discovery challenge

  • Liu B, Tsoumakas G (2020) Dealing with class imbalance in classifier chains via random undersampling. Knowl Based Syst 192:105292

    Article  Google Scholar 

  • Liu Y, Wen K, Gao Q, Gao X, Nie F (2018) SVM based multi-label learning with missing labels for image annotation. Pattern Recognit 78:307–317

    Article  Google Scholar 

  • Li X, Zhao F, Guo Y (2014) Multi-label image classification with a probabilistic label enhancement model. In: Uncertainty in artificial intelligence

  • Ludera DT (2021) Credit card fraud detection by combining synthetic minority oversampling and edited nearest neighbours. In: Future of information and communication conference. Springer, pp 735–743

  • Moyano JM, Gibaja EL, Cios KJ, Ventura S (2018) Review of ensembles of multi-label classifiers: models, experimental study and prospects. Inf Fusion 44:33–45

    Article  Google Scholar 

  • Nam J, Kim J, Mencía EL, Gurevych I, Fürnkranz J (2014) Large-scale multi-label text classification—revisiting neural networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 437–452

  • Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105

    Article  Google Scholar 

  • Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105

    Article  Google Scholar 

  • Pillai I, Fumera G, Roli F (2013) Threshold optimisation for multi-label classifiers. Pattern Recognit 46(7):2055–2065

    Article  Google Scholar 

  • Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333

    Article  MathSciNet  Google Scholar 

  • Sadhukhan P, Palit S (2019) Lattice and imbalance informed multi-label learning. IEEE Access 8:7394–7407

    Article  Google Scholar 

  • Sadhukhan P, Palit S (2020) Multi-label learning on principles of reverse k-nearest neighbourhood. Expert Syst 38:e12615

    Article  Google Scholar 

  • Siblini W, Kuntz P, Meyer F (2018) Craftml, an efficient clustering-based random forest for extreme multi-label learning. In: International conference on machine learning. PMLR, pp 4664–4673

  • Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750

    Article  Google Scholar 

  • Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089

    Article  Google Scholar 

  • Zhang ML, Wu L (2015) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120

    Article  Google Scholar 

  • Zhang ML, Li YK, Yang H, Liu XY (2020) Towards class-imbalance aware multi-label learning. IEEE Trans Cybern 52:4459–4471

    Article  Google Scholar 

  • Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Payel Sadhukhan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadhukhan, P., Palit, S. Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00589-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11634-024-00589-3

Keywords

Mathematics Subject Classification

Navigation