Abstract
The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.
Similar content being viewed by others
Availability of data and material
The codrna data comes from the LIBSVM data source (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). The poker, HTRU2 datasets come from the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php). And the Diabetes0vs1 and Diabetes0vs2 datasets come from the Kaggle (https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset). Codes for the NGBM is available on GitHub (https://github.com/hehanji/NGBM).
References
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16(1):321–357. https://doi.org/10.1613/jair.953
Cheng Q, Wang HY, Yang M (2020) Information-based optimal subdata selection for big data logistic regression. J Stat Plan Inference 209:112–122. https://doi.org/10.1016/j.jspi.2020.03.004
Clough RW (1960) The finite element method in plane stress analysis. In: proceedings of the 2nd ASCE conference on electronic computation. American Society of Civil Engineers, pp 345–378. http://refhub.elsevier.com/S1674-7755(18)30451-7/sref20
Derezinski M, Warmuth MKK, Hsu DJ (2018) Leveraged volume sampling for linear regression. Adv Neural Inform Process Syst 2018:2505–2514. https://doi.org/10.48550/arXiv.1802.06749
Drineas P, Mahoney M, Muthukrishnan S (2006) Subspace sampling and relative-error matrix approximation: column-row-based methods. Springer, Berlin
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Sci 91:878–887. https://doi.org/10.1007/11538059_91
Kubát M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: proceedings of the fourteenth international conference on machine learning, pp 179–186
Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inform Sci 409–410:17–26. https://doi.org/10.1016/j.ins.2017.05.008
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B 39(2):539–550. https://doi.org/10.1109/TSMCB.2008.2007853
Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911. https://doi.org/10.48550/arXiv.1306.5362
Ma P et al (2022) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. JMLR 23(177):1–45
Mahoney MW (2011) Randomized algorithms for matrices and data. Adv Mach Learn Data Min Astron 3(2):647–672. https://doi.org/10.1561/2200000035
Orriols-Puig A, Bernado-Mansilla E (2009) Evolutionary rule based systems for imbalanced data sets. Soft Comput 13(3):213–225. https://doi.org/10.1007/s00500-008-0319-7
Pan T, Zhao J, Wu W (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233. https://doi.org/10.1016/j.ins.2019.10.048
Park S, Park H (2021) Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic. Computing 103(1):1–24. https://doi.org/10.1007/s00607-020-00854-1
Ramentol E, Caballero Y, Bello R et al (2012) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33:245–265. https://doi.org/10.1007/s10115-011-0465-6
Tarawneh AS, Hassanat A, Almohammadi K (2020) SMOTEFUNA: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:1–15. https://doi.org/10.1109/ACCESS.2020.2983003
Wang H (2019) Divide-and-conquer information-based optimal subdata selection algorithm. J Stat Theory Pract. https://doi.org/10.1007/s42519-019-0048-5
Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108(1):99–112. https://doi.org/10.1093/biomet/asaa043
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844. https://doi.org/10.1080/01621459.2017.1292914
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114(525):393–405. https://doi.org/10.1080/01621459.2017.1408468
Xu Z, Shen D, Nie T (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform 107:103465. https://doi.org/10.1016/j.jbi.2020.103465
Yen SJ, Lee YS (2006) Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Lect Notes Control Inform Sci 344(2):731–740. https://doi.org/10.1007/978-3-540-37256-1_89
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727. https://doi.org/10.1016/j.eswa.2008.06.108
Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap 63:1883–1906. https://doi.org/10.1007/s00362-022-01299-8
Zuo L, Zhang H, Wang H et al (2021) Optimal subsample selection for massive logistic regression with distributed data. Comput Stat 36:2535–2562. https://doi.org/10.1007/s00180-021-01089-0
Funding
This work was supported by Major national statistical science research projects of China (grant number 2020LD02).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest in the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, H., He, J. & Zhang, L. Imbalanced data sampling design based on grid boundary domain for big data. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01471-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00180-024-01471-8