Abstract
The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and ‘robust’ fitness function design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classifiers that are robust to degenerate classifier behavior. To this end we propose a ‘Simple Active Learning Heuristic’ (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. In addition, an efficient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modified Wilcoxon-Mann-Whitney (WMW) statistic. Performance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modified WMW statistic, and deterministic classifiers (Naive Bayes and C4.5). The resulting SALH-WMW model is demonstrated to be both efficient and effective at providing solutions maximizing performance assessed in terms of AUC.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
Brameier, M., Banzhaf, W.: A comparison of linear Genetic Programming and Neural Networks in medical data mining. IEEE Transactions on Evolutionary Computation 5(1), 17–26 (2001)
Cartlidge, J., Bullock, S.: Learning lessons from the common cold: How reducing parasite virulence improves coevolutionary optimization. In: IEEE Congress on Evolutionary Computation, pp. 1420–1425 (2002)
de Jong, E.D.: A monotonic archive for pareto-coevolution. Evolutionary Computation 15(1), 61–94 (2007)
de Jong, E.D., Pollack, J.B.: Ideal evaluation from coevolution. Evolutionary Computation 12(2), 159–192 (2004)
Eggermont, J., Eiben, A.E., van Hermert, J.I.: Adapting the fitness function in GP for data mining. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598, pp. 193–202. Springer, Heidelberg (1999)
Ficici, S.G., Pollock, J.B.: Pareto optimality in coevolutionary learning. In: European Conference on Artificial Life, pp. 316–325 (2001)
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Davidor, Y., Männer, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994)
Hand, D.J.: Construction and Assessment of Classification Rules. John Wiley, Chichester (1997)
Hillis, W.D.: Co-evolving parasites improve simulated evolution as an optimization procedure. In: Artificial Life II. Santa Fe Institute Studies in the Sciences of Complexity, vol. X, pp. 313–324 (1990)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
Langdon, W.B., Buxton, B.F.: Evolving Receiver Operating Characteristics for Data Fusion. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 87–96. Springer, Heidelberg (2001)
Lemczyk, M., Heywood, M.I.: Training binary GP classifiers efficiently: a Pareto-coevolutionary approach. In: Ebner, M., O’Neill, M., Ekárt, A., Vanneschi, L., Esparcia-Alcázar, A.I. (eds.) EuroGP 2007. LNCS, vol. 4445, pp. 229–240. Springer, Heidelberg (2007)
Lichodzijewski, P., Heywood, M.I.: Pareto-coevolutionary Genetic Programming for problem decomposition in multi-class classification. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), vol. 1, pp. 464–471 (2007)
McIntyre, A.R., Heywood, M.I.: Toward co-evolutionary training of a multi-class classifier. In: Proceedings of the Congress on Evolutionary Computation (CEC), vol. 3, pp. 2130–2137 (2005)
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/mlrepository.html
Noble, J., Watson, R.A.: Pareto coevolution: Using performance against coevolved opponents in a game as dimensions for Pareto selection. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 493–500 (2001)
Patterson, G., Zhang, M.: Fitness functions in Genetic Programming for classification with unbalanced data. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 464–471. Springer, Heidelberg (2007)
Song, D., Heywood, M.I., Zincir-Heywood, A.N.: Training Genetic Programming on half a million patterns: An example from anomaly detection. IEEE Transactions on Evolutionary Computation 9(3), 225–239 (2005)
Weiss, G.M., Provost, R.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
Yan, L., Dodier, R., Mozer, M.C., Wolniewicz, R.: Optimizing classifier performance via the Wilcoxon-Mann-Whitney statistic. In: Proceedings of the International Conference on Machine Learning, pp. 848–855 (2003)
Zoungker, D., Punch, B.: lilgp genetic programming system. version 1.1 (1998), http://garage.cse.msu.edu/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Doucette, J., Heywood, M.I. (2008). GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation. In: O’Neill, M., et al. Genetic Programming. EuroGP 2008. Lecture Notes in Computer Science, vol 4971. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78671-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-78671-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78670-2
Online ISBN: 978-3-540-78671-9
eBook Packages: Computer ScienceComputer Science (R0)