Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

Young, William A.; Nykl, Scott L.; Weckman, Gary R.; Chelberg, David M.

doi:10.1007/s00521-014-1780-0

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

Original Article
Published: 09 December 2014

Volume 26, pages 1041–1054, (2015)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

William A. Young II¹,
Scott L. Nykl²,
Gary R. Weckman³ &
…
David M. Chelberg²

598 Accesses
16 Citations
Explore all metrics

Abstract

An over-sampling technique called V-synth is proposed and compared to borderline SMOTE (bSMOTE), a common methodology used to balance an imbalanced dataset for classification purposes. V-synth is a machine learning methodology that allows synthetic minority points to be generated based on the properties of a Voronoi diagram. A Voronoi diagram is a collection of geometric regions that encapsulate classifying points in such a way that any point within the region is closest to the encapsulated classifier than any other adjacent classifiers based on their distance from one another. Because of properties inherent to Voronoi diagrams, V-synth identifies exclusive regions of feature space where it is ideal to create synthetic minority samples. To test the generalization and application of V-synth, six databases from various problem domains were selected from the University of California Irvine’s Machine Learning Repository. Though not always guaranteed due to the random nature of synthetic over-sampling, significant evidence is presented that supports the hypothesis that V-synth more consistently leads to the creation of more accurate and better-balanced classification models than bSMOTE when the classification complexity of a dataset is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

Article 30 April 2014

LoRAS: an oversampling approach for imbalanced datasets

Article Open access 12 November 2020

Framework for extreme imbalance classification: SWIM—sampling with the majority class

Article 17 July 2019

References

Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887
Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 853–857
Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542
Article Google Scholar
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30:25–36
Google Scholar
Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2008) Building useful models from imbalanced data with sampling and boosting. In: FLAIRS conference, pp 306–311
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Article MathSciNet Google Scholar
Ezawa K, Singh M, Norton S (1996) Learning goal oriented bayesian networks for telecommunications management. In: Proceedings of the international conference on machine learning, Bari, Italy
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30–39
Article Google Scholar
Frank A, Asuncion A, Census income data set. University of California, Irvine [Online]. http://archive.ics.uci.edu/ml/datasets/Census+Income
Frank A, Asuncion A, Credit approval data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Credit+Approval
Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 875–886
Frank A, Asuncion A, Vertebral column data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Vertebral+Column
Frank A, Asuncion A, Ecoli data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Ecoli
Frank A, Asuncion A, Yeast data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Yeast
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, 2007
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
MATH Google Scholar
Young WA, Holland WS, Weckman GR (2008) Determining hall of fame status for major league baseball using an artificial neural network. J Quant Anal Sports 4(4):1–44
MathSciNet Google Scholar
Duchesnay E, Cachia A, Boddaert N, Chabane N, Mangin J, Martinot JBF, Zilbovicius M (2011) Feature selection and classification of imbalanced datasets: application to PET images of children with autistic spectrum disorders. NeuroImage 57(3):1003–1014
Article Google Scholar
Weckman G, Paschold H, Dowler J, Whiting H, Young W (2010) Using neural networks with limited data to estimate manufacturing cost. J Indus Syst Eng 3(4):257–274
Google Scholar
Chen C, Liaw A, Breiman L (2006) Using random forest to learn imbalanced data. Department of Statistics, University of California, Berkeley
Google Scholar
Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling. In: ICML’03 workshop on learning from imbalanced data sets, 2003
Jiang X, El-Kareh R, Ohno-Machado L (2011) Improving predictions in imbalanced data using pairwise expanded logistic regression. In: AMIA annual symposium proceedings archive, 2011
Nguyen H, Cooper E, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Fifth international workshop on computational intelligence and applications, Japan, 2009
Toribio P, Alejo R, Valdovinos R, Pacheco-Sanchez JH (2012) Using Gabriel graphs in Borderline-SMOTE to deal with severe two-class imbalance problems on neural networks. In: Proceedings of CCIA, pp 29–36
Saez J, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Article Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository [Online]. http://archive.ics.uci.edu/ml
Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto
Google Scholar
Millie D, Weckman G, Young W, Ivey J, Carrick H, Fahnenstiel G (2012) Modeling microalga abundance with artificial neural networks: demonstration of a heuristic, ‘Grey-Box’ technique to deconvolve and quantify environmental influences. Environ Model Softw 37:27–39
Article Google Scholar
Visa S, Ralescu A (2005) Issues in mining imbalanced data sets: a review paper. In: Proceedings of the sixteenth midwest artificial intelligence and cognitive science conference, 2005
Lopex V, Fernandez A, Jose del Jesus M, Herrera F (2012) Cost sensitive and preprocessing for classification with imbalanced data-sets: similar behaviour and potential hybridizations. Proc ICPRAM 2:98–107
Google Scholar
Weissenbacher A, Kasess B, Gerstl F, Lanzenberger R, Moser E, Windischberger C (2009) Correlations and anticorrelations in resting-state functional connectivity MRI: a quantitative comparison of preprocessing strategies. NeuroImage 47(4):1408–1416
Article Google Scholar
Debar H, Wespi A (2001) Aggregation and correlation of intrusion–detection alerts. Recent Adv Intrusion Detect Lect Notes Comput Sci 2212:85–103
Article Google Scholar
Aboy M, McNames J, Thong T, Tsunami D, Ellenby M, Goldstein B (2005) An automatic beat detection algorithm for pressure signals. IEEE Biomed Eng 52(10):1662–1670
Article Google Scholar
Sack J (1999) Handbook of computational geometry. Elsevier, Amsterdam
Google Scholar
Voronoi G (1907) Nouvelles applications des paramtres continus la thorie des formes quadratiques. Journal fr die Reine und Angewandte Mathematik 97–178
Chew P (2007) Delaunay triangulation. [Online]. http://www.cs.cornell.edu/home/chew/Delaunay.html
Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469–483
Article MATH MathSciNet Google Scholar
Coxter H (1973) Regular polytopes, 3rd edn. Dover Publications Inc, New York
Google Scholar
Kohavi R (1996) Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. In: Proceedings of the second international conference on knowledge discovery and data mining, 1996
Quinlan J (1999) Simplifying decision trees. Int J Man Mach Stud 27(3):221–234
Article Google Scholar
Frank A, Asuncion A, Haberman’s survival data set. University of California, Irvine, [Online]. http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival
Haberman SJ (1976) Generalized residuals for log-linear models. In: Proceedings of the 9th international biometrics conference, Boston
Hortan P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceeding of the fourth international conference on intelligent systems for molecular biology, pp 109–115
Ho T, Basu M (2012) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Learn 24(3):289–300
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Management, College of Business, Ohio University, Athens, OH, 45701-2979, USA
William A. Young II
School of Electrical Engineering and Computer Science, Russ College of Engineering and Technology, Ohio University, Athens, OH, 45701-2979, USA
Scott L. Nykl & David M. Chelberg
Department of Industrial and Systems Engineering, Russ College of Engineering and Technology, Ohio University, Athens, OH, 45701-2979, USA
Gary R. Weckman

Authors

William A. Young II
View author publications
You can also search for this author in PubMed Google Scholar
Scott L. Nykl
View author publications
You can also search for this author in PubMed Google Scholar
Gary R. Weckman
View author publications
You can also search for this author in PubMed Google Scholar
David M. Chelberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William A. Young II.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Young, W.A., Nykl, S.L., Weckman, G.R. et al. Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets. Neural Comput & Applic 26, 1041–1054 (2015). https://doi.org/10.1007/s00521-014-1780-0

Download citation

Received: 12 May 2014
Accepted: 30 November 2014
Published: 09 December 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s00521-014-1780-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

Abstract

Access this article

Similar content being viewed by others

To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

LoRAS: an oversampling approach for imbalanced datasets

Framework for extreme imbalance classification: SWIM—sampling with the majority class

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets

Abstract

Access this article

Similar content being viewed by others

To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

LoRAS: an oversampling approach for imbalanced datasets

Framework for extreme imbalance classification: SWIM—sampling with the majority class

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation