Skip to main content
Log in

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

An over-sampling technique called V-synth is proposed and compared to borderline SMOTE (bSMOTE), a common methodology used to balance an imbalanced dataset for classification purposes. V-synth is a machine learning methodology that allows synthetic minority points to be generated based on the properties of a Voronoi diagram. A Voronoi diagram is a collection of geometric regions that encapsulate classifying points in such a way that any point within the region is closest to the encapsulated classifier than any other adjacent classifiers based on their distance from one another. Because of properties inherent to Voronoi diagrams, V-synth identifies exclusive regions of feature space where it is ideal to create synthetic minority samples. To test the generalization and application of V-synth, six databases from various problem domains were selected from the University of California Irvine’s Machine Learning Repository. Though not always guaranteed due to the random nature of synthetic over-sampling, significant evidence is presented that supports the hypothesis that V-synth more consistently leads to the creation of more accurate and better-balanced classification models than bSMOTE when the classification complexity of a dataset is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  2. Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887

  3. Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 853–857

  4. Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542

    Article  Google Scholar 

  5. He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  6. Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30:25–36

    Google Scholar 

  7. Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2008) Building useful models from imbalanced data with sampling and boosting. In: FLAIRS conference, pp 306–311

  8. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    Article  MathSciNet  Google Scholar 

  9. Ezawa K, Singh M, Norton S (1996) Learning goal oriented bayesian networks for telecommunications management. In: Proceedings of the international conference on machine learning, Bari, Italy

  10. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30–39

    Article  Google Scholar 

  11. Frank A, Asuncion A, Census income data set. University of California, Irvine [Online]. http://archive.ics.uci.edu/ml/datasets/Census+Income

  12. Frank A, Asuncion A, Credit approval data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Credit+Approval

  13. Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 875–886

  14. Frank A, Asuncion A, Vertebral column data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Vertebral+Column

  15. Frank A, Asuncion A, Ecoli data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Ecoli

  16. Frank A, Asuncion A, Yeast data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Yeast

  17. Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, 2007

  18. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Google Scholar 

  19. Young WA, Holland WS, Weckman GR (2008) Determining hall of fame status for major league baseball using an artificial neural network. J Quant Anal Sports 4(4):1–44

    MathSciNet  Google Scholar 

  20. Duchesnay E, Cachia A, Boddaert N, Chabane N, Mangin J, Martinot JBF, Zilbovicius M (2011) Feature selection and classification of imbalanced datasets: application to PET images of children with autistic spectrum disorders. NeuroImage 57(3):1003–1014

    Article  Google Scholar 

  21. Weckman G, Paschold H, Dowler J, Whiting H, Young W (2010) Using neural networks with limited data to estimate manufacturing cost. J Indus Syst Eng 3(4):257–274

    Google Scholar 

  22. Chen C, Liaw A, Breiman L (2006) Using random forest to learn imbalanced data. Department of Statistics, University of California, Berkeley

    Google Scholar 

  23. Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling. In: ICML’03 workshop on learning from imbalanced data sets, 2003

  24. Jiang X, El-Kareh R, Ohno-Machado L (2011) Improving predictions in imbalanced data using pairwise expanded logistic regression. In: AMIA annual symposium proceedings archive, 2011

  25. Nguyen H, Cooper E, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Fifth international workshop on computational intelligence and applications, Japan, 2009

  26. Toribio P, Alejo R, Valdovinos R, Pacheco-Sanchez JH (2012) Using Gabriel graphs in Borderline-SMOTE to deal with severe two-class imbalance problems on neural networks. In: Proceedings of CCIA, pp 29–36

  27. Saez J, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  28. Frank A, Asuncion A (2010) UCI machine learning repository [Online]. http://archive.ics.uci.edu/ml

  29. Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto

    Google Scholar 

  30. Millie D, Weckman G, Young W, Ivey J, Carrick H, Fahnenstiel G (2012) Modeling microalga abundance with artificial neural networks: demonstration of a heuristic, ‘Grey-Box’ technique to deconvolve and quantify environmental influences. Environ Model Softw 37:27–39

    Article  Google Scholar 

  31. Visa S, Ralescu A (2005) Issues in mining imbalanced data sets: a review paper. In: Proceedings of the sixteenth midwest artificial intelligence and cognitive science conference, 2005

  32. Lopex V, Fernandez A, Jose del Jesus M, Herrera F (2012) Cost sensitive and preprocessing for classification with imbalanced data-sets: similar behaviour and potential hybridizations. Proc ICPRAM 2:98–107

    Google Scholar 

  33. Weissenbacher A, Kasess B, Gerstl F, Lanzenberger R, Moser E, Windischberger C (2009) Correlations and anticorrelations in resting-state functional connectivity MRI: a quantitative comparison of preprocessing strategies. NeuroImage 47(4):1408–1416

    Article  Google Scholar 

  34. Debar H, Wespi A (2001) Aggregation and correlation of intrusion–detection alerts. Recent Adv Intrusion Detect Lect Notes Comput Sci 2212:85–103

    Article  Google Scholar 

  35. Aboy M, McNames J, Thong T, Tsunami D, Ellenby M, Goldstein B (2005) An automatic beat detection algorithm for pressure signals. IEEE Biomed Eng 52(10):1662–1670

    Article  Google Scholar 

  36. Sack J (1999) Handbook of computational geometry. Elsevier, Amsterdam

    Google Scholar 

  37. Voronoi G (1907) Nouvelles applications des paramtres continus la thorie des formes quadratiques. Journal fr die Reine und Angewandte Mathematik 97–178

  38. Chew P (2007) Delaunay triangulation. [Online]. http://www.cs.cornell.edu/home/chew/Delaunay.html

  39. Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469–483

    Article  MATH  MathSciNet  Google Scholar 

  40. Coxter H (1973) Regular polytopes, 3rd edn. Dover Publications Inc, New York

    Google Scholar 

  41. Kohavi R (1996) Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. In: Proceedings of the second international conference on knowledge discovery and data mining, 1996

  42. Quinlan J (1999) Simplifying decision trees. Int J Man Mach Stud 27(3):221–234

    Article  Google Scholar 

  43. Frank A, Asuncion A, Haberman’s survival data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

  44. Haberman SJ (1976) Generalized residuals for log-linear models. In: Proceedings of the 9th international biometrics conference, Boston

  45. Hortan P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceeding of the fourth international conference on intelligent systems for molecular biology, pp 109–115

  46. Ho T, Basu M (2012) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Learn 24(3):289–300

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William A. Young II.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Young, W.A., Nykl, S.L., Weckman, G.R. et al. Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets. Neural Comput & Applic 26, 1041–1054 (2015). https://doi.org/10.1007/s00521-014-1780-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-014-1780-0

Keywords

Navigation