NEATER: filtering of over-sampled data using non-cooperative game theory

Abstract

In this paper, we present a method for the filteriNg of ovEr-sampled dAta using non-cooperaTive gamE theoRy (NEATER) to address the imbalanced data problem. Specifically, the problem is formulated as a non-cooperative game where all the data are players and the goal is to uniformly and consistently label all of the synthetic data created by any over-sampling technique. The proposed algorithm does not require any prior assumptions and selects representative synthetic instances while generating a very small number of noisy data. We present extensive experimental results over a large collection of datasets using three different classifiers to demonstrate the advantages of our method.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL (2007) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 51(12):6166–6179

    MATH  MathSciNet  Article  Google Scholar 

  2. Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) KEEL data-mining software: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 255–287

  3. Almogahed BA, Kakadiaris IA (2014) NEATER: filtering of over-sampled data using non-cooperative game theory. In: Proceedings of the international conference of pattern recognition, Stockholm, Sweden (in press)

  4. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750

    Article  Google Scholar 

  5. Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36(3):849–851

    Article  Google Scholar 

  6. Batista G, Prati R, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29

    Article  Google Scholar 

  7. Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinform 14(1):106

    Article  Google Scholar 

  8. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Springer (ed) Advances in knowledge discovery and data mining. Springer, New York, pp 475–482

  9. Chawla N, Bowyer K, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  10. Chen JJ, Tsai CA, Young JF, Kodell RL (2005) Classification ensembles for unbalanced class sizes in predictive toxicology. SAR QSAR Environ Res 16(6):517–529

    Article  Google Scholar 

  11. Chen X, Song E, Ma G (2010) An adaptive cost-sensitive classifier. In: Proceedings of the 2nd international conference on computer automation engineering, Singapore, pp 699–701

  12. Christensen BC, Houseman AE, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, Nelson HH, Karagas MR, Padbury JF, Bueno R, Sugarbaker DJ, Yeh R, Wiencke JK, Kelsey KT (2009) Aging and environmental exposures alter tissue-specific dna methylation dependent upon CPG island context. PLOS Genet 5(8):e1000602

  13. Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18

    Article  Google Scholar 

  14. Cressman R (1992) The stability concept of evolutionary game theory: a dynamic approach. Springer-Verlag, New York

  15. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  16. Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18

    Article  Google Scholar 

  17. Erdem A, Pelillo M (2012) Graph transduction as a noncooperative game. Neural Comput 24(3):700–723

    MATH  MathSciNet  Article  Google Scholar 

  18. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    MathSciNet  Article  Google Scholar 

  19. García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  Google Scholar 

  20. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  21. García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21

    Article  Google Scholar 

  22. Gordon GJ, Jensen RV, Hsiao L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967

  23. Guyon I (2003) Design of experiments of the NIPS 2003 variable selection benchmark. NIPS 2003 workshop on feature extraction and feature selection

  24. Guyon IS, Gunn MN, Zadeh L (2006) Feature extraction. Springer, New York

  25. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten H (2009) WEKA data mining software. ACM SIGKDD Explor Newslett 11(1):10–18

    Article  Google Scholar 

  26. Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv Intell Comput (Springer) 3644:878–887

    Google Scholar 

  27. Hart PE (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 515–516

  28. He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  29. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the IEEE international joint conference on neural networks, Hong Kong, pp 1322–1328

  30. Hofbauer J, Sigmund K (2003) Evolutionary game dynamics. Bull Am Math Soc 40(4):479

    MATH  MathSciNet  Article  Google Scholar 

  31. Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th international joint conference on artificial intelligence, vol 1, Detroit

  32. Hu B, Dong W (2014) A study on cost behaviors of binary classification measures in class-imbalanced problems. arXiv preprint arXiv:1403.7100, p 1

  33. Howson TJ (1972) Equilibria of polymatrix games. Manag Sci 312–318

  34. Kreps DM (1990) Game theory and economic modelling. Clarendon, Oxford

    Google Scholar 

  35. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, pp 179–186

  36. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Artif Intell Med 63–66

  37. Lemke CE, Howson JT Jr (1964) Equilibrium points of bimatrix games. J Soc Ind Appl Math 12(2):413–423

    MATH  MathSciNet  Article  Google Scholar 

  38. Lemnaru C, Rodica P (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Springer (ed) Enterprise information systems. Springer, New York, pp 35–50

  39. Lusa L, Blagus R (2010) Class prediction for high-dimensional class-imbalanced data. BMC Bioinform 11(1):523

    Article  Google Scholar 

  40. Maratea A, Petrosino A, Manzo M (2014) Adjusted f-measure and kernel scaling for imbalanced data learning. Inf Sci 257:331–341

    Article  Google Scholar 

  41. Meng HH, Li GZ, Wang R, Zhao X, Chen L (2008) The imbalanced problem in mass-spectrometry data analysis. In: Proceedings of the LNOR 9: the second international symposium on optimization and systems biology (OSB108), Lijiang, pp 136–143

  42. Merz C, Murphy P, Aha D (2012) UCI repository of machine learning databases. Department of Information and Computer Science, University of California

  43. Nash J (1951) Non-cooperative games. Ann Math 54(2):286–295

  44. Nisan N, Roughgarden T, Tardos E, Vazirani VV (2007) Algorithmic game theory. Cambridge University Press, Cambridge

  45. Oh S (2011) Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6):1058–1061

    Article  Google Scholar 

  46. Ordeshook PC (1986) Game theory and political theory: an introduction. Cambridge University Press, Cambridge

  47. Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced data sets. Soft Comput Fusion Found Methodol Appl 13(3):213–225

    Google Scholar 

  48. Porter R, Nudelman E, Shoham Y (2008) Simple search methods for finding a Nash equilibrium. Games Econ Behav 63(2):642–662

    MATH  MathSciNet  Article  Google Scholar 

  49. Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  50. Rota Bulò S, Bomze IM (2011) Infection and immunization: a new class of evolutionary game dynamics. Games Econ Behav 71(1):193–211

    MATH  Article  Google Scholar 

  51. Smith J (1982) Evolution and the theory of games. Cambridge University Press, Cambridge

  52. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437

    Article  Google Scholar 

  53. Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen M, Michael B, Rijn MV, Jeffrey S, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning E, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 98:10869–10874

  54. Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099

    Article  Google Scholar 

  55. Wang BX, Japkowicz N (2004) Imbalanced data set learning with synthetic samples. In: Proceedings of the IRIS machine learning workshop, Canada

  56. Weibull JW (1997) Evolutionary game theory. MIT Press, London

  57. Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7(1):228

    Article  Google Scholar 

  58. Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proceedings of the hybrid intelligent systems, Rio de Janeiro, p 6

  59. Zhang D, Liu W, Gong X, Jin H (2011) A novel improved smote resampling algorithm based on fractal. J Comput Inf Syst 7(6):2204–2211

    Google Scholar 

Download references

Acknowledgments

This research was funded in part by the US Department of Education (P200A070377 and P200A100119) with cost sharing provided by the University of Houston (UH) and in part by UH Hugh Roy and Lillie Cranz Cullen Endowment Fund.

Author information

Affiliations

Authors

Corresponding author

Correspondence to I. A. Kakadiaris.

Additional information

Communicated by V. Loia.

Appendix

Appendix

This appendix provides seven tables with the detailed results for the experimental analysis carried out in the present work. Table 12 contains the AUC values for all the databases and algorithms achieved when using the C4.5 classifier, Table 13 presents the results with the random forest, and Table 14 is for the SVM classifier. Tables 15, 16 and 17 contain the AUC values for all high-dimensional datasets for the three classifiers. The best results are highlighted in bold face. Table 18 contains the full description of the datasets used for the experimental analysis.

Table 12 AUC results for C4.5 classifier
Table 13 AUC results for random forest classifier
Table 14 AUC results for SVM classifier
Table 15 AUC results on high-dimensional data for C4.5 classifier
Table 16 AUC results on high-dimensional data for random forest classifier
Table 17 AUC results on high-dimensional data for SVM classifier
Table 18 Description of datasets used for the experimental analysis

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Almogahed, B.A., Kakadiaris, I.A. NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput 19, 3301–3322 (2015). https://doi.org/10.1007/s00500-014-1484-5

Download citation

Keywords

  • Imbalanced data
  • Classification
  • Sampling