Abstract
Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ahmed S, Zhang M, Peng L (2012) Genetic programming for biomarker detection in mass spectrometry data. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science vol. 7691, pp 266–278
Ahmed S, Zhang M, Peng L (2013) Enhanced feature selection for biomarker discovery in lc-ms data using gp. In: IEEE Congress on Evolutionary Computation (CEC’13), pp 584–591
Ahmed S, Zhang M, Peng L, Xue B (2014) Multiple feature construction for effective biomarker identification and classification using genetic programming. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, ACM, pp 249–256
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99(10):6562–6566
Banzhaf W, Francone FD, Keller RE, Nordin P (1998) Genetic Programming: An Introduction on the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers Inc, USA
Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. Evolut Comput IEEE Trans 18(6):893–908
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28
De Stefano C, Fontanella F, Marrocco C, di Freca AS (2014) A GA-based feature selection approach with an application to handwritten character recognition. Pattern Recognit Lett 35:130–141
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinf Comput Biol 03(02):185–205
Espejo P, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144. doi:10.1109/TSMCC.2009.2033566
Estébanez C, Valls JM, Aler R (2008) Gppe: a method to generate ad-hoc feature extractors for prediction in financial domains. Appl Intell 29(2):174–185
Guo H, Nandi AK (2006) Breast cancer diagnosis using genetic programming generated feature. Pattern Recognit 39(5):980–987
Guo L, Rivero D, Dorado J, Munteanu CR, Pazos A (2011) Automatic feature extraction using genetic programming: An application to epileptic eeg classification. Expert Syst Appl 38(8):10425–10436
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor 11:931–934
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Krawiec K (2002) Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genet Program Evolv Mach 3:329–343
Krawiec K (2010) Evolutionary feature selection and construction. In: Encyclopedia of Machine Learning, Springer, pp 353–357
Langdon WB, Buxton BF (2004) Genetic programming for mining dna chip data from cancer patients. Genet Program Evolv Mach 5(3):251–257
Lin Y, Bhanu B (2005) Evolutionary feature synthesis for object recognition. IEEE Trans Syst Man Cybern Part C Appl Rev 35(2):156–171
Lones M, Smith SL, Alty JE, Lacy SE, Possin KL, Jamieson D, Tyrrell AM et al (2014) Evolving classifiers to recognize the movement characteristics of parkinson’s disease patients. Evolut Comput IEEE Trans 18(4):559–576
Mohamad M, Omatu S, Deris S, Yoshioka M, Abdullah A, Ibrahim Z (2013) An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes. Algorithms Mol Biol 8(1):15
Muharram M, Smith G (2005) Evolutionary constructive induction. IEEE Trans Knowl Data Eng 17:1518–1528
Nekkaa M, Boughaci D (2015) A memetic algorithm with support vector machine for feature selection and classification. Memet Comput 7(1):59–73
Neshatian K, Zhang M (2009) Dimensionality reduction in face detection: A genetic programming approach. In: 24th International Conference on Image and Vision Computing, pp 391–396
Neshatian K, Zhang M (2011) Using genetic programming for context-sensitive feature scoring in classification problems. Connect Sci 23(3):183–207
Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evolut Comput 16(5):645–661
Patterson G, Zhang M (2007) Fitness functions in genetic programming for classification with unbalanced data. In: AI 2007: Advances in Artificial Intelligence, Springer, pp 769–775
Russell S, Norvig P (2009) Artificial Intelligence: a modern approach, 3rd edn. Prentice Hall Press, USA
Smith M, Bull L (2005) Genetic Programming with a Genetic Algorithm for Feature Construction and Selection. Genet Program Evol Mach 6:265–281
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631–643
Wang P, Emmerich M, Li R, Tang K, Back T, Yao X (2015) Convex hull-based multiobjective genetic programming for maximizing receiver operating characteristic performance. Evolut Comput IEEE Trans 19(2):188–200
Yu H, Gu G, Liu H, Shen J, Zhao J (2009) A modified ant colony optimization algorithm for tumor marker gene selection. Genom Proteom Bioinf 7(4):200–208
Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tran, B., Xue, B. & Zhang, M. Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Comp. 8, 3–15 (2016). https://doi.org/10.1007/s12293-015-0173-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12293-015-0173-y