Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge

Part of the Genetic and Evolutionary Computation book series (GEVO)


Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. The goal of this study is to develop and evaluate a genetic programming (GP) approach to attribute selection and classification in this domain. We simulated genetic datasets of varying size in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then show that including pre-processed estimates of attribute quality using Tuned ReliefF (TuRF) in a multi-objective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data preprocessing, the ideal functional form of the fitness function, and the importance of expert knowledge. We anticipate this study will provide an important baseline for future studies investigating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.


genetic programming human genetics expert knowledge epistasis multifactor dimensionality reduction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S., Daly, M.J., and Donnelly, P. (2005). International hapmap consortium: A haplotype map of the human genome. Nature, 437:1299–1320.CrossRefGoogle Scholar
  2. Andrew, A.S., Nelson, H.H., Kelsey, K.T., Moore, J.H., Meng, A.C., Casella, D.P., Tosteson, T.D., Schned, A.R., and Karagas, M.R. (2006). Concordance of multiple analytical approaches demonstrates a complex relationship between dna repair gene snps, smoking and bladder cancer susceptibility. Carcinogenesis.Google Scholar
  3. Bala, J., Jong, K. De, Huang, J., Vafaie, H., and Wechsler, H. (1996). Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation, 4:297–312.Google Scholar
  4. Banzhaf, W., Nordin, P., Keller, R.E., and Francone, F.D. (1998). Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers.Google Scholar
  5. Bateson, W. (1909). Mendel’s Principles of Heredity. Cambridge University Press, Cambridge.Google Scholar
  6. Cho, Y.M., Ritchie, M.D., Moore, J.H., Park, J.Y., Lee, K.U., Shin, H.D., Lee, H.K., and Park, K.S. (2004). Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia, 47:549–554.CrossRefGoogle Scholar
  7. Coello, C.A., Veldhuizen, D.A. Van, and Lamont, G.B. (2002). Evolutionary Algorithms for Solving Multi-Objective Problems. Kluwer.Google Scholar
  8. Coffey, C.S., Hebert, P.R., Ritchie, M.D., Krumholz, H.M., Morgan, T.M., Gaziano, J.M., Ridker, P.M., and Moore, J.H. (2004). An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: The importance of model validation. BMC Bioninformatics, 4:49.CrossRefGoogle Scholar
  9. Deb, K. (2001). Multi-Objective Optimization Using Evolutionary Algorithms. Wiley.Google Scholar
  10. Freitas, A. (2001). Understanding the crucial role of attribute interactions. Artificial Intelligence Review, 16:177–199.zbMATHCrossRefGoogle Scholar
  11. Freitas, A. (2002). Data Mining and KNowledge Discovery with Evolutionary Algorithms. Springer.Google Scholar
  12. Goldberg, D.E. (2002). The Design of Innovation. Kluwer.Google Scholar
  13. Hahn, L.W. and Moore, J.H. (2004). Ideal discrimination of discrete clinical endpoints using multilocus genotypes. Silico Biology, 4:183–194.MathSciNetGoogle Scholar
  14. Hahn, L.W., Ritchie, M.D., and Moore, J.H. (2003). Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics, 19:376–382.CrossRefGoogle Scholar
  15. Haynes, Thomas, Langdon, William B., O’Reilly, Una-May, Poli, Riccardo, and Rosca, Justinian, editors (1999). Foundations of Genetic Programming, Orlando, Florida, USA.Google Scholar
  16. Hirschhorn, J.N. and Daly, M.J. (2005). Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6(95): 108–118.Google Scholar
  17. Jensen, L.J., Saric, J., and Bork, P. (2006). Literature mining for the biologist: from information retrieval to biological discovery. Nature Review Genetics, 7:119–129.CrossRefGoogle Scholar
  18. Jin, Y. (2005). Knowledge Incorporation in Evolutionary Computation. Springer.Google Scholar
  19. Kira, K. and Rendell, L.A. (1992). A practical approach to feature selection. In Machine Learning: Proceedings of the AAAI’92.Google Scholar
  20. Kononenko, I. (1994). Estimating attributes: analysis and extension of relief. Machine Learning: ECML, 94:171–182.Google Scholar
  21. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA.zbMATHGoogle Scholar
  22. Koza, John R. (1994). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge Massachusetts.zbMATHGoogle Scholar
  23. Koza, John R., Andre, David, Bennett III, Forrest H, and Keane, Martin (1999). Genetic Programming 3: Darwinian Invention and Problem Solving. Morgan Kaufman.Google Scholar
  24. Koza, John R., Keane, Martin A., Streeter, Matthew J., Mydlowec, William, Yu, Jessen, and Lanza, Guido (2003). Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Kluwer Academic Publishers.Google Scholar
  25. Koza, J.R., Jones, L.W., Keane, M.A., Streeter, M.J., and Al-Sakran, S.H. (2005). Toward automated design of industrial-strength analog circuits by means of genetic programming. In O’Reilly, U.M., Yu, T., Riolo, R., and Worzel, B., editors, Genetic Programming Theory and practice. Springer.Google Scholar
  26. Langdon, William B. (1998). Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming!, volume 1 of Genetic Programming. Kluwer, Boston.Google Scholar
  27. Lenski, R.E., Ofria, C., Pennock, R.T., and Adami, C. (2003). The evolutionary origin of complex features. 423:139–144.Google Scholar
  28. Li, W. and Reich, J. (2000). A complete enumeration and classification of two-locus disease models. Human Heredity, 50:334–349.CrossRefGoogle Scholar
  29. Moore, J.H. (2003). The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Human Heredity, 56:73–82.CrossRefGoogle Scholar
  30. Moore, J.H. (2004). Computational analysis of gene-gene interactions in common human diseases using multifactor dimensionality reduction. Expert Rev. Mol Diagn, 4:795–803.CrossRefGoogle Scholar
  31. Moore, J.H., Gilbert, J.C., Tsai, C.T., Chiang, F.T., Holden, W., Barney, N., and White, B.C. (2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology.Google Scholar
  32. Moore, J.H. and Ritchie, M.D. (2004). The challenges of whole-genome approaches to common diseases. JAMA, 291:1642–1643.CrossRefGoogle Scholar
  33. Moore, J.H. and Williams, S.W. (2002). New strategies for identifying gene-gene interactions in hypertension. Annals of Medicine, 34:88–95.CrossRefGoogle Scholar
  34. Moore, J.H. and Williams, S.W. (2005). Traversing the conceptual divide between biological and statistical epistasis: Systems biology and a more mordern synthesis. BioEssays, 27:637–646.CrossRefGoogle Scholar
  35. Qin, S., Zhao, X., Pan, Y., Liu, J., Feng, G., Fu, J., Bao, J., Zhang, Z., and He, L. (2005). An association study of the n-methyl-d-aspartate receptor nr1 subunit gene (grin1) and nr2b subunit gene (grin2b) in schizophrenia with universal dna microarray. European Journal of Human Genetics, 13:807–814.CrossRefGoogle Scholar
  36. Ritchie, M.D., Hahn, L.W., and Moore, J.H. (2003). Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, phenocopy and genetic heterogeneity. Genetic Epidemiology, 24:150–157.CrossRefGoogle Scholar
  37. Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F, and Moore, J.H. (2001). Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer. American Journal of Human Genetics, 69:138–147.CrossRefGoogle Scholar
  38. Robnik-Sikonja, M. and Kononenko, I. (2003). Theoretical and empirical analysis of relieff and rrelieff. Machine Learning, 53:23–69.zbMATHCrossRefGoogle Scholar
  39. Ryan, C. and Azad, R.M. (2003). Sensible initialization in chorus. EuroGP 2003, pages 394–403.Google Scholar
  40. Sastry, Kumara, O’Reilly, Una-May, and Goldberg, David E. (2004). Population sizing for genetic programming based on decision making. In O’Reilly, Una-May, Yu, Tina, Riolo, Rick L., and Worzel, Bill, editors, Genetic Programming Theory and Practice II, chapter 4, pages 49–65. Springer, Ann Arbor.Google Scholar
  41. Soares, M.L., Coelho, T., Sousa, A., Batalov, S., Conceicao, I., Sales-Luis, M.L., Ritchie, M.D., Williams, S.M., Nievergelt, C.M., Schork, N.J., Saraiva, M.J., and Buxbaum, J.N. (2005). Susceptibility and modifier genes in Portuguese transthyretin v30m amyloid polygeuropathy: complexity in a single-gene disease. Human Molecular Genetics, 14:543–553.CrossRefGoogle Scholar
  42. Thornton-Wells, T.A., Moore, J.H., and Haines, J.L. (2004). Genetics, statistics and human disease: analytical retooling for complexity. Trends in Genetics, 20:640–647.CrossRefGoogle Scholar
  43. Tsai, C.T., Lai, L.P., Lin, J.L., Chiang, F.T., Hwang, J.J., Ritchie, M.D., Moore, J.H., Hsu, K.L., Tseng, C.D., Liau, C.S., and Tseng, Y.Z. (2004). Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation, 109:1640–1646.CrossRefGoogle Scholar
  44. Wang, W.Y., Barratt, B.J., Clayton, D.G., and Todd, J.A. (2005). Genome-wide association studies: theoretical and practical concerns. Nature Reviews Genetics, 6:109–118.CrossRefGoogle Scholar
  45. White, B.C., Gilbert, J.C., Reif, D.M., and Moore, J.H. (2005). A statistical comparison of grammatical evolution strategies in the domain of human genetics. Proceedings of the IEEE Congress on Evolutionary Computing, pages 676–682.Google Scholar
  46. Wilke, R.A., Reif, D.M., and Moore, J.H. (2005). Combinatorial pharmacoge-netics. Nature Reviews Drug Discovery, 4:911–918.CrossRefGoogle Scholar
  47. Williams, S.M., Ritchie, M.D., 3rd, J.A. Phillips, Dawson, E., Prince, M., Dzhura, E., Willis, A., Semenya, A., Summar, M., White, B.C., Addy, J.H., Kpodonu, J., Wong, L.J., Felder, R.A., Jose, P.A., and Moore, J.H. (2004). Multilocus analysis of hypertension: a hierarchical approach. Human Heredity, 57:28–38.CrossRefGoogle Scholar
  48. Xu, J., Lowery, J., Wiklund, F., Sun, J., Lindmark, F., Hsu, F.C., Dimitrov, L., Chang, B., Turner, A.R., Adami, H.O., Suh, E., Moore, J.H., Zheng, S.L., Isaacs, W.B., Trent, J.M., and Gronberg, H. (2005). The interaction of four inflammatory genes significantly predicts prostate cancer risk. Cancer Epidemiology Biomarkers and Prevention, 14:2563–2568.CrossRefGoogle Scholar
  49. Yu, Tina, Riolo, Rick L., and Worzel, Bill (2005). Genetic programming: Theory and practice. In Yu, Tina, Riolo, Rick L., and Worzel, Bill, editors, Genetic Programming Theory and Practice III, volume 9 of Genetic Programming, chapter 1, pages 1–14. Springer, Ann Arbor.Google Scholar
  50. Zhang, Yang and Rockett, Peter I. (2006). Feature extraction using multi-objective genetic programming. In Jin, Yaochu, editor, Multi-Objective Machine Learning, volume 16 of Studies in Computational Intelligence, chapter 4, pages 79–106. Springer. Invited chapter.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Computational Genetics Laboratory, Department of GeneticsDartmouth Medical SchoolDartmouth

Personalised recommendations