Skip to main content

Mining Feature Relationships in Data

  • Conference paper
  • First Online:
Genetic Programming (EuroGP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12691))

Included in the following conference series:

Abstract

When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For example, mutating the 0.71 node of \(x = f_1 \times (f_0 + 0.71)\) using a traditional mutation would give a new value in U[0, 1]. While local-search approaches can be used to optimise constants more cautiously, it is best if they can be avoided completely.

  2. 2.

    Two features are defined to be linearly correlated if they have an absolute Pearson’s correlation greater than 0.95.

  3. 3.

    Note that \(\text {Fitness}=\text {Cost}+ \alpha \times \text {Nodes}\), but we also list the fitness separately for completeness.

  4. 4.

    Only the top five FRs are considered to make the plots easier to analyse.

References

  1. Tukey, J.W.: Exploratory data analysis, vol. 2. Reading, MA (1977)

    Google Scholar 

  2. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, pp. 207–216, May 26–28, 1993. ACM Press (1993)

    Google Scholar 

  3. Dick, G.: Bloat and generalisation in symbolic regression. In: Dick, G., et al. (eds.) SEAL 2014. LNCS, vol. 8886, pp. 491–502. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13563-2_42

    Chapter  Google Scholar 

  4. Poli, R., Langdon, W.B., McPhee, N.F.: A Field Guide to Genetic Programming (2008). https://lulu.com. Accessed 27 Sept 2019

  5. Neshatian, K., Zhang, M., Andreae, P.: A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans. Evol. Comput. 16(5), 645–661 (2012)

    Article  Google Scholar 

  6. Tran, B., Xue, B., Zhang, M.: Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Comput. 8(1), 3–15 (2015). https://doi.org/10.1007/s12293-015-0173-y

    Article  Google Scholar 

  7. Hart, E., Sim, K., Gardiner, B., Kamimura, K.: A hybrid method for feature construction and selection to improve wind-damage prediction in the forestry sector. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1121–1128 (2017)

    Google Scholar 

  8. Chen, Q., Xue, B., Zhang, M.: Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern. 1–14 (2020)

    Google Scholar 

  9. Arnaldo, I., Krawiec, K., O’Reilly, U.: Multiple regression genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO, pp. 879–886. ACM (2014)

    Google Scholar 

  10. Handl, J., Knowles, J.D.: An evolutionary approach to multiobjective clustering. IEEE Trans. Evol. Comput. 11(1), 56–76 (2007)

    Article  Google Scholar 

  11. McDermott, J.: Why is auto-encoding difficult for genetic programming? In: Sekanina, L., Hu, T., Lourenço, N., Richter, H., García-Sánchez, P. (eds.) EuroGP 2019. LNCS, vol. 11451, pp. 131–145. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16670-0_9

    Chapter  Google Scholar 

  12. Lensen, A., Zhang, M., Xue, B.: Multi-objective genetic programming for manifold learning: balancing quality and dimensionality. Genetic Program. Evolvable Mach. 21, 399–431 (2020)

    Article  Google Scholar 

  13. Lensen, A., Xue, B., Zhang, M.: Genetic programming for evolving a front of interpretable models for data visualisation. IEEE Trans. Cybern. 1–15 (2020)

    Google Scholar 

  14. Telikani, A., Gandomi, A.H., Shahbahrami, A.: A survey of evolutionary computation for association rule mining. Inf. Sci. 524, 318–352 (2020)

    Article  MathSciNet  Google Scholar 

  15. Rodríguez, D.M., Rosete, A., Alcalá-Fdez, J., Herrera, F.: A new multiobjective evolutionary algorithm for mining a reduced set of interesting positive and negative quantitative association rules. IEEE Trans. Evol. Comput. 18(1), 54–69 (2014)

    Article  Google Scholar 

  16. Kuo, R.J., Chao, C.M., Chiu, Y.T.: Application of particle swarm optimization to association rule mining. Appl. Soft Comput. 11(1), 326–336 (2011)

    Article  Google Scholar 

  17. Taboada, K., Shimada, K., Mabu, S., Hirasawa, K., Hu, J.: Association rule mining for continuous attributes using genetic network programming. In: Lipson, H. (ed.) Genetic and Evolutionary Computation Conference, GECCO 2007, Proceedings, London, England, UK, p. 1758, July 7–11, 2007. ACM (2007)

    Google Scholar 

  18. Mabu, S., Chen, C., Lu, N., Shimada, K., Hirasawa, K.: An intrusion-detection model based on fuzzy class-association-rule mining using genetic network programming. IEEE Trans. Syst. Man Cybern. Part C 41(1), 130–139 (2011)

    Article  Google Scholar 

  19. Luna, J.M., Romero, J.R., Ventura, S.: Design and behavior study of a grammar-guided genetic programming algorithm for mining association rules. Knowl. Inf. Syst. 32(1), 53–76 (2012)

    Article  Google Scholar 

  20. Luna, J.M., Pechenizkiy, M., del Jesus, M.J., Ventura, S.: Mining context-aware association rules using grammar-based genetic programming. IEEE Trans. Cybern. 48(11), 3030–3044 (2018)

    Article  Google Scholar 

  21. Tomassini, M., Vanneschi, L., Collard, P., Clergue, M.: A study of fitness distance correlation as a difficulty measure in genetic programming. Evol. Comput. 13(2), 213–239 (2005)

    Article  Google Scholar 

  22. Haeri, M.A., Ebadzadeh, M.M., Folino, G.: Statistical genetic programming for symbolic regression. Appl. Soft Comput. 60, 447–469 (2017)

    Article  Google Scholar 

  23. Luke, S., Panait, L.: Lexicographic parsimony pressure. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 829–836 (2002)

    Google Scholar 

  24. Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  25. Sayyad Shirabad, J., Menzies, T.: The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada (2005). http://promise.site.uottawa.ca/SERepository

  26. Badran, K.M.S., Rockett, P.I.: The influence of mutation on population dynamics in multiobjective genetic programming. Genet. Program. Evolvable Mach. 11(1), 5–33 (2010)

    Article  Google Scholar 

  27. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Academic press, Cambridge (2013)

    Book  Google Scholar 

  28. Halstead, M.H., et al.: Elements of Software Science, vol. 7. Elsevier, New York (1977)

    MATH  Google Scholar 

  29. Roth, A.E. (ed.): The Shapley Value: Essays in Honor of Lloyd S. Cambridge University Press, Shapley, Cambridge (1988)

    MATH  Google Scholar 

  30. Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds.) Measures of Complexity, pp. 11–30. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21852-6_3

    Chapter  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Lensen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lensen, A. (2021). Mining Feature Relationships in Data. In: Hu, T., Lourenço, N., Medvet, E. (eds) Genetic Programming. EuroGP 2021. Lecture Notes in Computer Science(), vol 12691. Springer, Cham. https://doi.org/10.1007/978-3-030-72812-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72812-0_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72811-3

  • Online ISBN: 978-3-030-72812-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics