Skip to main content

Genetic programming for feature construction and selection in classification on high-dimensional data

Abstract

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. http://www.gems-system.org, http://csse.szu.edu.cn/staff/zhuzx/Datasets.html.

References

  1. Ahmed S, Zhang M, Peng L (2012) Genetic programming for biomarker detection in mass spectrometry data. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science vol. 7691, pp 266–278

  2. Ahmed S, Zhang M, Peng L (2013) Enhanced feature selection for biomarker discovery in lc-ms data using gp. In: IEEE Congress on Evolutionary Computation (CEC’13), pp 584–591

  3. Ahmed S, Zhang M, Peng L, Xue B (2014) Multiple feature construction for effective biomarker identification and classification using genetic programming. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, ACM, pp 249–256

  4. Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99(10):6562–6566

    Article  MATH  Google Scholar 

  5. Banzhaf W, Francone FD, Keller RE, Nordin P (1998) Genetic Programming: An Introduction on the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers Inc, USA

    Book  MATH  Google Scholar 

  6. Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. Evolut Comput IEEE Trans 18(6):893–908

    Article  Google Scholar 

  7. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28

    Article  Google Scholar 

  8. De Stefano C, Fontanella F, Marrocco C, di Freca AS (2014) A GA-based feature selection approach with an application to handwritten character recognition. Pattern Recognit Lett 35:130–141

    Article  Google Scholar 

  9. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinf Comput Biol 03(02):185–205

    Article  Google Scholar 

  10. Espejo P, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144. doi:10.1109/TSMCC.2009.2033566

    Article  Google Scholar 

  11. Estébanez C, Valls JM, Aler R (2008) Gppe: a method to generate ad-hoc feature extractors for prediction in financial domains. Appl Intell 29(2):174–185

    Article  Google Scholar 

  12. Guo H, Nandi AK (2006) Breast cancer diagnosis using genetic programming generated feature. Pattern Recognit 39(5):980–987

    Article  Google Scholar 

  13. Guo L, Rivero D, Dorado J, Munteanu CR, Pazos A (2011) Automatic feature extraction using genetic programming: An application to epileptic eeg classification. Expert Syst Appl 38(8):10425–10436

    Article  Google Scholar 

  14. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor 11:931–934

    Article  Google Scholar 

  15. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324

    Article  MATH  Google Scholar 

  16. Krawiec K (2002) Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genet Program Evolv Mach 3:329–343

    Article  MATH  Google Scholar 

  17. Krawiec K (2010) Evolutionary feature selection and construction. In: Encyclopedia of Machine Learning, Springer, pp 353–357

  18. Langdon WB, Buxton BF (2004) Genetic programming for mining dna chip data from cancer patients. Genet Program Evolv Mach 5(3):251–257

    Article  Google Scholar 

  19. Lin Y, Bhanu B (2005) Evolutionary feature synthesis for object recognition. IEEE Trans Syst Man Cybern Part C Appl Rev 35(2):156–171

    Article  Google Scholar 

  20. Lones M, Smith SL, Alty JE, Lacy SE, Possin KL, Jamieson D, Tyrrell AM et al (2014) Evolving classifiers to recognize the movement characteristics of parkinson’s disease patients. Evolut Comput IEEE Trans 18(4):559–576

    Article  Google Scholar 

  21. Mohamad M, Omatu S, Deris S, Yoshioka M, Abdullah A, Ibrahim Z (2013) An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes. Algorithms Mol Biol 8(1):15

  22. Muharram M, Smith G (2005) Evolutionary constructive induction. IEEE Trans Knowl Data Eng 17:1518–1528

    Article  Google Scholar 

  23. Nekkaa M, Boughaci D (2015) A memetic algorithm with support vector machine for feature selection and classification. Memet Comput 7(1):59–73

    Article  Google Scholar 

  24. Neshatian K, Zhang M (2009) Dimensionality reduction in face detection: A genetic programming approach. In: 24th International Conference on Image and Vision Computing, pp 391–396

  25. Neshatian K, Zhang M (2011) Using genetic programming for context-sensitive feature scoring in classification problems. Connect Sci 23(3):183–207

    Article  Google Scholar 

  26. Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evolut Comput 16(5):645–661

    Article  Google Scholar 

  27. Patterson G, Zhang M (2007) Fitness functions in genetic programming for classification with unbalanced data. In: AI 2007: Advances in Artificial Intelligence, Springer, pp 769–775

  28. Russell S, Norvig P (2009) Artificial Intelligence: a modern approach, 3rd edn. Prentice Hall Press, USA

    Google Scholar 

  29. Smith M, Bull L (2005) Genetic Programming with a Genetic Algorithm for Feature Construction and Selection. Genet Program Evol Mach 6:265–281

    Article  Google Scholar 

  30. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631–643

    Article  Google Scholar 

  31. Wang P, Emmerich M, Li R, Tang K, Back T, Yao X (2015) Convex hull-based multiobjective genetic programming for maximizing receiver operating characteristic performance. Evolut Comput IEEE Trans 19(2):188–200

    Article  Google Scholar 

  32. Yu H, Gu G, Liu H, Shen J, Zhao J (2009) A modified ant colony optimization algorithm for tumor marker gene selection. Genom Proteom Bioinf 7(4):200–208

    Article  Google Scholar 

  33. Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Binh Tran.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tran, B., Xue, B. & Zhang, M. Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Comp. 8, 3–15 (2016). https://doi.org/10.1007/s12293-015-0173-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12293-015-0173-y

Keywords

  • Genetic programming
  • Feature construction
  • Feature selection
  • Classification
  • High-dimensional data