Imbalanced data sampling design based on grid boundary domain for big data

He, Hanji; He, Jianfeng; Zhang, Liwei

doi:10.1007/s00180-024-01471-8

Imbalanced data sampling design based on grid boundary domain for big data

Original Paper
Published: 08 March 2024

(2024)
Cite this article

Computational Statistics Aims and scope Submit manuscript

102 Accesses
Explore all metrics

Abstract

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A survey on missing data in machine learning

Article Open access 27 October 2021

A review of unsupervised feature selection methods

Article 29 January 2019

Availability of data and material

The codrna data comes from the LIBSVM data source (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). The poker, HTRU2 datasets come from the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php). And the Diabetes0vs1 and Diabetes0vs2 datasets come from the Kaggle (https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset). Codes for the NGBM is available on GitHub (https://github.com/hehanji/NGBM).

References

Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16(1):321–357. https://doi.org/10.1613/jair.953
Article Google Scholar
Cheng Q, Wang HY, Yang M (2020) Information-based optimal subdata selection for big data logistic regression. J Stat Plan Inference 209:112–122. https://doi.org/10.1016/j.jspi.2020.03.004
Article MathSciNet Google Scholar
Clough RW (1960) The finite element method in plane stress analysis. In: proceedings of the 2nd ASCE conference on electronic computation. American Society of Civil Engineers, pp 345–378. http://refhub.elsevier.com/S1674-7755(18)30451-7/sref20
Derezinski M, Warmuth MKK, Hsu DJ (2018) Leveraged volume sampling for linear regression. Adv Neural Inform Process Syst 2018:2505–2514. https://doi.org/10.48550/arXiv.1802.06749
Article Google Scholar
Drineas P, Mahoney M, Muthukrishnan S (2006) Subspace sampling and relative-error matrix approximation: column-row-based methods. Springer, Berlin
Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Sci 91:878–887. https://doi.org/10.1007/11538059_91
Article Google Scholar
Kubát M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: proceedings of the fourteenth international conference on machine learning, pp 179–186
Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inform Sci 409–410:17–26. https://doi.org/10.1016/j.ins.2017.05.008
Article Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B 39(2):539–550. https://doi.org/10.1109/TSMCB.2008.2007853
Article Google Scholar
Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911. https://doi.org/10.48550/arXiv.1306.5362
Article MathSciNet Google Scholar
Ma P et al (2022) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. JMLR 23(177):1–45
MathSciNet Google Scholar
Mahoney MW (2011) Randomized algorithms for matrices and data. Adv Mach Learn Data Min Astron 3(2):647–672. https://doi.org/10.1561/2200000035
Article ADS Google Scholar
Orriols-Puig A, Bernado-Mansilla E (2009) Evolutionary rule based systems for imbalanced data sets. Soft Comput 13(3):213–225. https://doi.org/10.1007/s00500-008-0319-7
Article Google Scholar
Pan T, Zhao J, Wu W (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233. https://doi.org/10.1016/j.ins.2019.10.048
Article Google Scholar
Park S, Park H (2021) Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic. Computing 103(1):1–24. https://doi.org/10.1007/s00607-020-00854-1
Article MathSciNet Google Scholar
Ramentol E, Caballero Y, Bello R et al (2012) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33:245–265. https://doi.org/10.1007/s10115-011-0465-6
Article Google Scholar
Tarawneh AS, Hassanat A, Almohammadi K (2020) SMOTEFUNA: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:1–15. https://doi.org/10.1109/ACCESS.2020.2983003
Article Google Scholar
Wang H (2019) Divide-and-conquer information-based optimal subdata selection algorithm. J Stat Theory Pract. https://doi.org/10.1007/s42519-019-0048-5
Article ADS MathSciNet Google Scholar
Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108(1):99–112. https://doi.org/10.1093/biomet/asaa043
Article MathSciNet Google Scholar
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844. https://doi.org/10.1080/01621459.2017.1292914
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114(525):393–405. https://doi.org/10.1080/01621459.2017.1408468
Article MathSciNet CAS Google Scholar
Xu Z, Shen D, Nie T (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform 107:103465. https://doi.org/10.1016/j.jbi.2020.103465
Article PubMed Google Scholar
Yen SJ, Lee YS (2006) Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Lect Notes Control Inform Sci 344(2):731–740. https://doi.org/10.1007/978-3-540-37256-1_89
Article Google Scholar
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727. https://doi.org/10.1016/j.eswa.2008.06.108
Article MathSciNet Google Scholar
Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap 63:1883–1906. https://doi.org/10.1007/s00362-022-01299-8
Article MathSciNet Google Scholar
Zuo L, Zhang H, Wang H et al (2021) Optimal subsample selection for massive logistic regression with distributed data. Comput Stat 36:2535–2562. https://doi.org/10.1007/s00180-021-01089-0
Article MathSciNet Google Scholar

Download references

Funding

This work was supported by Major national statistical science research projects of China (grant number 2020LD02).

Author information

Authors and Affiliations

School of Economics and Finance, South China University of Technology, Guangzhou, People’s Republic of China
Hanji He & Jianfeng He
Ping An Insurance Company of China, Shenzhen, People’s Republic of China
Liwei Zhang

Authors

Hanji He
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng He
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianfeng He.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest in the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

He, H., He, J. & Zhang, L. Imbalanced data sampling design based on grid boundary domain for big data. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01471-8

Download citation

Received: 10 April 2023
Accepted: 29 January 2024
Published: 08 March 2024
DOI: https://doi.org/10.1007/s00180-024-01471-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data sampling design based on grid boundary domain for big data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on missing data in machine learning

A review of unsupervised feature selection methods

Availability of data and material

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Imbalanced data sampling design based on grid boundary domain for big data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on missing data in machine learning

A review of unsupervised feature selection methods

Availability of data and material

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation