Abstract
This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.
Similar content being viewed by others
References
Ali H, Salleh MNM, Hussain K, Ahmad A, Ullah A, Muhammad A, Naseem R, Khan M (2019) A review on data preprocessing methods for class imbalance problem. Int J Eng Technol 8:390–397
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2014) Mlenn: a first approach to heuristic multilabel undersampling. In: International conference on intelligent data engineering and automated learning. Springer, pp 1–9
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
Choirunnisa S, Lianto J (2018) Hybrid method of undersampling and oversampling for handling imbalanced data. In: International seminar on research of information technology and intelligent systems (ISRITI). IEEE, pp 276–280
Daniels Z, Metaxas D (2017) Addressing imbalance in multi-label classification using structured hellinger forests. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic. NIPS’01, MIT Press, Cambridge, MA, USA, pp 681–687
Fürnkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining, pp 22–30
Gonzalez-Lopez J, Ventura S, Cano A (2018) Distributed nearest neighbor classification for large-scale multi-label data on spark. Future Gener Comput Syst 87:66–82
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142
Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD-08 workshop on discovery challenge
Liu B, Tsoumakas G (2020) Dealing with class imbalance in classifier chains via random undersampling. Knowl Based Syst 192:105292
Liu Y, Wen K, Gao Q, Gao X, Nie F (2018) SVM based multi-label learning with missing labels for image annotation. Pattern Recognit 78:307–317
Li X, Zhao F, Guo Y (2014) Multi-label image classification with a probabilistic label enhancement model. In: Uncertainty in artificial intelligence
Ludera DT (2021) Credit card fraud detection by combining synthetic minority oversampling and edited nearest neighbours. In: Future of information and communication conference. Springer, pp 735–743
Moyano JM, Gibaja EL, Cios KJ, Ventura S (2018) Review of ensembles of multi-label classifiers: models, experimental study and prospects. Inf Fusion 44:33–45
Nam J, Kim J, Mencía EL, Gurevych I, Fürnkranz J (2014) Large-scale multi-label text classification—revisiting neural networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 437–452
Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105
Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105
Pillai I, Fumera G, Roli F (2013) Threshold optimisation for multi-label classifiers. Pattern Recognit 46(7):2055–2065
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333
Sadhukhan P, Palit S (2019) Lattice and imbalance informed multi-label learning. IEEE Access 8:7394–7407
Sadhukhan P, Palit S (2020) Multi-label learning on principles of reverse k-nearest neighbourhood. Expert Syst 38:e12615
Siblini W, Kuntz P, Meyer F (2018) Craftml, an efficient clustering-based random forest for extreme multi-label learning. In: International conference on machine learning. PMLR, pp 4664–4673
Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750
Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089
Zhang ML, Wu L (2015) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Zhang ML, Li YK, Yang H, Liu XY (2020) Towards class-imbalance aware multi-label learning. IEEE Trans Cybern 52:4459–4471
Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sadhukhan, P., Palit, S. Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00589-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11634-024-00589-3