Sampling Methods in Genetic Programming for Classification with Unbalanced Data
This work investigates the use of sampling methods in Genetic Programming (GP) to improve the classification accuracy in binary classification problems in which the datasets have a class imbalance. Class imbalance occurs when there are more data instances in one class than the other. As a consequence of this imbalance, when overall classification rate is used as the fitness function, as in standard GP approaches, the result is often biased towards the majority class, at the expense of poor minority class accuracy. We establish that the variation in training performance introduced by sampling examples from the training set is no worse than the variation between GP runs already accepted. Results also show that the use of sampling methods during training can improve minority class classification accuracy and the robustness of classifiers evolved, giving performance on the test set better than that of those classifiers which made up the training set Pareto front.
KeywordsGenetic Programming Pareto Front Majority Class Minority Class Class Imbalance
Unable to display preview. Download preview PDF.
- 1.Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
- 2.Bhowan, U., Johnston, M., Zhang, M.: Differentiating between individual class performance in genetic programming fitness for classification with unbalanced data. In: CEC 2009: Proceedings of the Eleventh conference on Congress on Evolutionary Computation, pp. 2802–2809 (2009)Google Scholar
- 3.Doucette, J., Heywood, M.I.: GP classification under imbalanced data sets: active sub-sampling and AUC Approximation. In: O’Neill, M., Vanneschi, L., Gustafson, S., Esparcia Alcázar, A.I., De Falco, I., Della Cioppa, A., Tarantino, E. (eds.) EuroGP 2008. LNCS, vol. 4971, pp. 266–277. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- 4.Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: PPSN, pp. 312–321 (1994)Google Scholar
- 5.Gray, H.F., Maxwell, R.J., Martinez-Perez, I., Arus, C., Cerdan, S.: Genetic programming for classification of brain tumours from nuclear magnetic resonance biopsy spectra. In: Koza, J.R., Goldberg, D.E., Fogel, D.B., Riolo, R.L. (eds.) Genetic Programming 1996: Proceedings of the First Annual Conference, p. 424. MIT Press, Stanford University (July 28-31, 1996)Google Scholar
- 6.Iba, H.: Bagging, boosting, and bloating in Genetic Programming. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, pp. 1053–1060. Morgan Kaufmann, Orlando (July 13-17, 1999)Google Scholar
- 7.Paris, G., Robilliard, D., Fonlupt, C.: Applying boosting techniques to genetic programming. In: Selected Papers from the 5th European Conference on Artificial Evolution, pp. 267–280. Springer, London (2002)Google Scholar
- 8.Song, D., Heywood, M.I., Zincir-Heywood, A.N.: A linear genetic programming approach to intrusion detection. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 2325–2336 (2003)Google Scholar
- 9.Yan, L., Dodier, R.H., Mozer, M., Wolniewicz, R.H.: Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney Statistic. In: International Conference on Machine Learning, pp. 848–855 (2003)Google Scholar