Skip to main content
Log in

Imbalanced data sampling design based on grid boundary domain for big data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and material

The codrna data comes from the LIBSVM data source (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). The poker, HTRU2 datasets come from the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php). And the Diabetes0vs1 and Diabetes0vs2 datasets come from the Kaggle (https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset). Codes for the NGBM is available on GitHub (https://github.com/hehanji/NGBM).

References

Download references

Funding

This work was supported by Major national statistical science research projects of China (grant number 2020LD02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianfeng He.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest in the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, H., He, J. & Zhang, L. Imbalanced data sampling design based on grid boundary domain for big data. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01471-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00180-024-01471-8

Keywords

Navigation