Skip to main content

Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12047))

Abstract

Symbolic regression via genetic programming has been used successfully for empirical modeling from given data sets. However, real-world data sets might contain missing values. Although there are different approaches to dealing with incomplete data sets for classification, symbolic regression with missing values has been rarely investigated. Similarly, only a few studies have been conducted on feature selection for symbolic regression, but none of them addresses the incompleteness issue. In this work, a genetic programming-based method for simultaneous imputation and feature selection is developed. This method selects the predictive features for the incomplete features whilst constructing their imputation models. Such models are designed to be suitable for data sets with mixed numerical and categorical features. The performance of the proposed method is compared with state-of-the-art widely used imputation methods from three aspects: the imputation accuracy, the feature selection effectiveness, and the symbolic regression performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A hybrid GP-KNN imputation for symbolic regression with missing values. In: Mitrovic, T., Xue, B., Li, X. (eds.) AI 2018. LNCS (LNAI), vol. 11320, pp. 345–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03991-2_33

    Chapter  Google Scholar 

  2. Arslan, S., Ozturk, C.: Multi hive artificial bee colony programming for high dimensional symbolic regression with feature selection. Appl. Soft Comput. 78, 515–527 (2019)

    Article  Google Scholar 

  3. Austel, V., et al.: Globally optimal symbolic regression. arXiv preprint arXiv:1710.10720 (2017)

  4. Bhardwaj, H., Sakalle, A., Bhardwaj, A., Tiwari, A., Verma, M.: Breast cancer diagnosis using simultaneous feature selection and classification: a genetic programming approach. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2186–2192. IEEE (2018)

    Google Scholar 

  5. Buuren, S.V., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. softw. 15, 1–68 (2010)

    Google Scholar 

  6. Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)

    Article  Google Scholar 

  7. Davidson, J.W., Savic, D.A., Walters, G.A.: Symbolic and numerical regression: experiments and applications. Inf. Sci. 150(1–2), 95–117 (2003)

    Article  MathSciNet  Google Scholar 

  8. Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)

    Article  Google Scholar 

  9. Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: Deap: evolutionary algorithms made easy. J. Mach. Learn. Res. 13(Jul), 2171–2175 (2012)

    MathSciNet  MATH  Google Scholar 

  10. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)

    Article  Google Scholar 

  11. Koza, J.R.: Genetic Programming II, Automatic Discovery of Reusable Subprograms. MIT Press, Cambridge (1992)

    Google Scholar 

  12. Loh, P.L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with non-convexity. In: Advances in Neural Information Processing Systems, pp. 2726–2734 (2011)

    Google Scholar 

  13. Nag, K., Pal, N.R.: Genetic programming for classification and feature selection. In: Bansal, J.C., Singh, P.K., Pal, N.R. (eds.) Evolutionary and Swarm Intelligence Algorithms. SCI, vol. 779, pp. 119–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91341-4_7

    Chapter  Google Scholar 

  14. Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)

    Google Scholar 

  15. Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 583–590. ACM (2015)

    Google Scholar 

  16. Tran, C.T., Zhang, M., Andreae, P.: A genetic programming-based imputation method for classification with missing data. In: Heywood, M.I., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds.) EuroGP 2016. LNCS, vol. 9594, pp. 149–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30668-1_10

    Chapter  Google Scholar 

  17. Tran, C.T., Zhang, M., Andreae, P., Xue, B.: Multiple imputation and genetic programming for classification with incomplete data. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 521–528. ACM (2017)

    Google Scholar 

  18. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)

    Article  Google Scholar 

  19. Viegas, F., et al.: A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273, 554–569 (2018)

    Article  Google Scholar 

  20. Xue, B., Zhang, M.: Evolutionary computation for feature manipulation: key challenges and future directions. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 3061–3067. IEEE (2016)

    Google Scholar 

  21. Xue, B., Zhang, M.: Evolutionary feature manipulation in data mining/big data. ACM SIGEVOlution 10(1), 4–11 (2017)

    Article  Google Scholar 

  22. Zhang, M., Ciesielski, V.: Genetic programming for multiple class object detection. In: Foo, N. (ed.) AI 1999. LNCS (LNAI), vol. 1747, pp. 180–192. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46695-9_16

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Baligh Al-Helali , Qi Chen , Bing Xue or Mengjie Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al-Helali, B., Chen, Q., Xue, B., Zhang, M. (2020). Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12047. Springer, Cham. https://doi.org/10.1007/978-3-030-41299-9_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-41299-9_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-41298-2

  • Online ISBN: 978-3-030-41299-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics