Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Sadhukhan, Payel; Palit, Sarbani

doi:10.1007/s11634-024-00589-3

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Regular Article
Published: 30 March 2024

(2024)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

63 Accesses
Explore all metrics

Abstract

This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Synthetic Oversampling of Multi-label Data Based on Local Label Distribution

MLeNN: A First Approach to Heuristic Multilabel Undersampling

Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm

Notes

http://mulan.sourceforge.net/datasets-mlc.html.

References

Ali H, Salleh MNM, Hussain K, Ahmad A, Ullah A, Muhammad A, Naseem R, Khan M (2019) A review on data preprocessing methods for class imbalance problem. Int J Eng Technol 8:390–397
Google Scholar
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Article Google Scholar
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Article Google Scholar
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2014) Mlenn: a first approach to heuristic multilabel undersampling. In: International conference on intelligent data engineering and automated learning. Springer, pp 1–9
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
Google Scholar
Choirunnisa S, Lianto J (2018) Hybrid method of undersampling and oversampling for handling imbalanced data. In: International seminar on research of information technology and intelligent systems (ISRITI). IEEE, pp 276–280
Daniels Z, Metaxas D (2017) Addressing imbalance in multi-label classification using structured hellinger forests. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic. NIPS’01, MIT Press, Cambridge, MA, USA, pp 681–687
Fürnkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153
Article Google Scholar
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining, pp 22–30
Gonzalez-Lopez J, Ventura S, Cano A (2018) Distributed nearest neighbor classification for large-scale multi-label data on spark. Future Gener Comput Syst 87:66–82
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142
Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD-08 workshop on discovery challenge
Liu B, Tsoumakas G (2020) Dealing with class imbalance in classifier chains via random undersampling. Knowl Based Syst 192:105292
Article Google Scholar
Liu Y, Wen K, Gao Q, Gao X, Nie F (2018) SVM based multi-label learning with missing labels for image annotation. Pattern Recognit 78:307–317
Article Google Scholar
Li X, Zhao F, Guo Y (2014) Multi-label image classification with a probabilistic label enhancement model. In: Uncertainty in artificial intelligence
Ludera DT (2021) Credit card fraud detection by combining synthetic minority oversampling and edited nearest neighbours. In: Future of information and communication conference. Springer, pp 735–743
Moyano JM, Gibaja EL, Cios KJ, Ventura S (2018) Review of ensembles of multi-label classifiers: models, experimental study and prospects. Inf Fusion 44:33–45
Article Google Scholar
Nam J, Kim J, Mencía EL, Gurevych I, Fürnkranz J (2014) Large-scale multi-label text classification—revisiting neural networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 437–452
Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105
Article Google Scholar
Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105
Article Google Scholar
Pillai I, Fumera G, Roli F (2013) Threshold optimisation for multi-label classifiers. Pattern Recognit 46(7):2055–2065
Article Google Scholar
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333
Article MathSciNet Google Scholar
Sadhukhan P, Palit S (2019) Lattice and imbalance informed multi-label learning. IEEE Access 8:7394–7407
Article Google Scholar
Sadhukhan P, Palit S (2020) Multi-label learning on principles of reverse k-nearest neighbourhood. Expert Syst 38:e12615
Article Google Scholar
Siblini W, Kuntz P, Meyer F (2018) Craftml, an efficient clustering-based random forest for extreme multi-label learning. In: International conference on machine learning. PMLR, pp 4664–4673
Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750
Article Google Scholar
Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089
Article Google Scholar
Zhang ML, Wu L (2015) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Article Google Scholar
Zhang ML, Li YK, Yang H, Liu XY (2020) Towards class-imbalance aware multi-label learning. IEEE Trans Cybern 52:4459–4471
Article Google Scholar
Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Advancing Intelligence, TCG CREST, Kolkata, India
Payel Sadhukhan
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
Sarbani Palit

Authors

Payel Sadhukhan
View author publications
You can also search for this author in PubMed Google Scholar
Sarbani Palit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Payel Sadhukhan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sadhukhan, P., Palit, S. Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00589-3

Download citation

Received: 22 September 2022
Accepted: 03 March 2024
Published: 30 March 2024
DOI: https://doi.org/10.1007/s11634-024-00589-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Abstract

Access this article

Similar content being viewed by others

Synthetic Oversampling of Multi-label Data Based on Local Label Distribution

MLeNN: A First Approach to Heuristic Multilabel Undersampling

Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Abstract

Access this article

Similar content being viewed by others

Synthetic Oversampling of Multi-label Data Based on Local Label Distribution

MLeNN: A First Approach to Heuristic Multilabel Undersampling

Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation