Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge

Moore, Jason H.; White, Bill C.

doi:10.1007/978-0-387-49650-4_2

Jason H. Moore⁶ &
Bill C. White⁶

Part of the book series: Genetic and Evolutionary Computation ((GEVO))

930 Accesses
19 Citations
4 Altmetric

Abstract

Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. The goal of this study is to develop and evaluate a genetic programming (GP) approach to attribute selection and classification in this domain. We simulated genetic datasets of varying size in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then show that including pre-processed estimates of attribute quality using Tuned ReliefF (TuRF) in a multi-objective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data preprocessing, the ideal functional form of the fitness function, and the importance of expert knowledge. We anticipate this study will provide an important baseline for future studies investigating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S., Daly, M.J., and Donnelly, P. (2005). International hapmap consortium: A haplotype map of the human genome. Nature, 437:1299–1320.
Article Google Scholar
Andrew, A.S., Nelson, H.H., Kelsey, K.T., Moore, J.H., Meng, A.C., Casella, D.P., Tosteson, T.D., Schned, A.R., and Karagas, M.R. (2006). Concordance of multiple analytical approaches demonstrates a complex relationship between dna repair gene snps, smoking and bladder cancer susceptibility. Carcinogenesis.
Google Scholar
Bala, J., Jong, K. De, Huang, J., Vafaie, H., and Wechsler, H. (1996). Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation, 4:297–312.
Google Scholar
Banzhaf, W., Nordin, P., Keller, R.E., and Francone, F.D. (1998). Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers.
Google Scholar
Bateson, W. (1909). Mendel’s Principles of Heredity. Cambridge University Press, Cambridge.
Google Scholar
Cho, Y.M., Ritchie, M.D., Moore, J.H., Park, J.Y., Lee, K.U., Shin, H.D., Lee, H.K., and Park, K.S. (2004). Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia, 47:549–554.
Article Google Scholar
Coello, C.A., Veldhuizen, D.A. Van, and Lamont, G.B. (2002). Evolutionary Algorithms for Solving Multi-Objective Problems. Kluwer.
Google Scholar
Coffey, C.S., Hebert, P.R., Ritchie, M.D., Krumholz, H.M., Morgan, T.M., Gaziano, J.M., Ridker, P.M., and Moore, J.H. (2004). An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: The importance of model validation. BMC Bioninformatics, 4:49.
Article Google Scholar
Deb, K. (2001). Multi-Objective Optimization Using Evolutionary Algorithms. Wiley.
Google Scholar
Freitas, A. (2001). Understanding the crucial role of attribute interactions. Artificial Intelligence Review, 16:177–199.
Article MATH Google Scholar
Freitas, A. (2002). Data Mining and KNowledge Discovery with Evolutionary Algorithms. Springer.
Google Scholar
Goldberg, D.E. (2002). The Design of Innovation. Kluwer.
Google Scholar
Hahn, L.W. and Moore, J.H. (2004). Ideal discrimination of discrete clinical endpoints using multilocus genotypes. Silico Biology, 4:183–194.
MathSciNet Google Scholar
Hahn, L.W., Ritchie, M.D., and Moore, J.H. (2003). Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics, 19:376–382.
Article Google Scholar
Haynes, Thomas, Langdon, William B., O’Reilly, Una-May, Poli, Riccardo, and Rosca, Justinian, editors (1999). Foundations of Genetic Programming, Orlando, Florida, USA.
Google Scholar
Hirschhorn, J.N. and Daly, M.J. (2005). Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6(95): 108–118.
Google Scholar
Jensen, L.J., Saric, J., and Bork, P. (2006). Literature mining for the biologist: from information retrieval to biological discovery. Nature Review Genetics, 7:119–129.
Article Google Scholar
Jin, Y. (2005). Knowledge Incorporation in Evolutionary Computation. Springer.
Google Scholar
Kira, K. and Rendell, L.A. (1992). A practical approach to feature selection. In Machine Learning: Proceedings of the AAAI’92.
Google Scholar
Kononenko, I. (1994). Estimating attributes: analysis and extension of relief. Machine Learning: ECML, 94:171–182.
Google Scholar
Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA.
MATH Google Scholar
Koza, John R. (1994). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge Massachusetts.
MATH Google Scholar
Koza, John R., Andre, David, Bennett III, Forrest H, and Keane, Martin (1999). Genetic Programming 3: Darwinian Invention and Problem Solving. Morgan Kaufman.
Google Scholar
Koza, John R., Keane, Martin A., Streeter, Matthew J., Mydlowec, William, Yu, Jessen, and Lanza, Guido (2003). Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Kluwer Academic Publishers.
Google Scholar
Koza, J.R., Jones, L.W., Keane, M.A., Streeter, M.J., and Al-Sakran, S.H. (2005). Toward automated design of industrial-strength analog circuits by means of genetic programming. In O’Reilly, U.M., Yu, T., Riolo, R., and Worzel, B., editors, Genetic Programming Theory and practice. Springer.
Google Scholar
Langdon, William B. (1998). Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming!, volume 1 of Genetic Programming. Kluwer, Boston.
Google Scholar
Lenski, R.E., Ofria, C., Pennock, R.T., and Adami, C. (2003). The evolutionary origin of complex features. 423:139–144.
Google Scholar
Li, W. and Reich, J. (2000). A complete enumeration and classification of two-locus disease models. Human Heredity, 50:334–349.
Article Google Scholar
Moore, J.H. (2003). The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Human Heredity, 56:73–82.
Article Google Scholar
Moore, J.H. (2004). Computational analysis of gene-gene interactions in common human diseases using multifactor dimensionality reduction. Expert Rev. Mol Diagn, 4:795–803.
Article Google Scholar
Moore, J.H., Gilbert, J.C., Tsai, C.T., Chiang, F.T., Holden, W., Barney, N., and White, B.C. (2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology.
Google Scholar
Moore, J.H. and Ritchie, M.D. (2004). The challenges of whole-genome approaches to common diseases. JAMA, 291:1642–1643.
Article Google Scholar
Moore, J.H. and Williams, S.W. (2002). New strategies for identifying gene-gene interactions in hypertension. Annals of Medicine, 34:88–95.
Article Google Scholar
Moore, J.H. and Williams, S.W. (2005). Traversing the conceptual divide between biological and statistical epistasis: Systems biology and a more mordern synthesis. BioEssays, 27:637–646.
Article Google Scholar
Qin, S., Zhao, X., Pan, Y., Liu, J., Feng, G., Fu, J., Bao, J., Zhang, Z., and He, L. (2005). An association study of the n-methyl-d-aspartate receptor nr1 subunit gene (grin1) and nr2b subunit gene (grin2b) in schizophrenia with universal dna microarray. European Journal of Human Genetics, 13:807–814.
Article Google Scholar
Ritchie, M.D., Hahn, L.W., and Moore, J.H. (2003). Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, phenocopy and genetic heterogeneity. Genetic Epidemiology, 24:150–157.
Article Google Scholar
Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F, and Moore, J.H. (2001). Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer. American Journal of Human Genetics, 69:138–147.
Article Google Scholar
Robnik-Sikonja, M. and Kononenko, I. (2003). Theoretical and empirical analysis of relieff and rrelieff. Machine Learning, 53:23–69.
Article MATH Google Scholar
Ryan, C. and Azad, R.M. (2003). Sensible initialization in chorus. EuroGP 2003, pages 394–403.
Google Scholar
Sastry, Kumara, O’Reilly, Una-May, and Goldberg, David E. (2004). Population sizing for genetic programming based on decision making. In O’Reilly, Una-May, Yu, Tina, Riolo, Rick L., and Worzel, Bill, editors, Genetic Programming Theory and Practice II, chapter 4, pages 49–65. Springer, Ann Arbor.
Google Scholar
Soares, M.L., Coelho, T., Sousa, A., Batalov, S., Conceicao, I., Sales-Luis, M.L., Ritchie, M.D., Williams, S.M., Nievergelt, C.M., Schork, N.J., Saraiva, M.J., and Buxbaum, J.N. (2005). Susceptibility and modifier genes in Portuguese transthyretin v30m amyloid polygeuropathy: complexity in a single-gene disease. Human Molecular Genetics, 14:543–553.
Article Google Scholar
Thornton-Wells, T.A., Moore, J.H., and Haines, J.L. (2004). Genetics, statistics and human disease: analytical retooling for complexity. Trends in Genetics, 20:640–647.
Article Google Scholar
Tsai, C.T., Lai, L.P., Lin, J.L., Chiang, F.T., Hwang, J.J., Ritchie, M.D., Moore, J.H., Hsu, K.L., Tseng, C.D., Liau, C.S., and Tseng, Y.Z. (2004). Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation, 109:1640–1646.
Article Google Scholar
Wang, W.Y., Barratt, B.J., Clayton, D.G., and Todd, J.A. (2005). Genome-wide association studies: theoretical and practical concerns. Nature Reviews Genetics, 6:109–118.
Article Google Scholar
White, B.C., Gilbert, J.C., Reif, D.M., and Moore, J.H. (2005). A statistical comparison of grammatical evolution strategies in the domain of human genetics. Proceedings of the IEEE Congress on Evolutionary Computing, pages 676–682.
Google Scholar
Wilke, R.A., Reif, D.M., and Moore, J.H. (2005). Combinatorial pharmacoge-netics. Nature Reviews Drug Discovery, 4:911–918.
Article Google Scholar
Williams, S.M., Ritchie, M.D., 3rd, J.A. Phillips, Dawson, E., Prince, M., Dzhura, E., Willis, A., Semenya, A., Summar, M., White, B.C., Addy, J.H., Kpodonu, J., Wong, L.J., Felder, R.A., Jose, P.A., and Moore, J.H. (2004). Multilocus analysis of hypertension: a hierarchical approach. Human Heredity, 57:28–38.
Article Google Scholar
Xu, J., Lowery, J., Wiklund, F., Sun, J., Lindmark, F., Hsu, F.C., Dimitrov, L., Chang, B., Turner, A.R., Adami, H.O., Suh, E., Moore, J.H., Zheng, S.L., Isaacs, W.B., Trent, J.M., and Gronberg, H. (2005). The interaction of four inflammatory genes significantly predicts prostate cancer risk. Cancer Epidemiology Biomarkers and Prevention, 14:2563–2568.
Article Google Scholar
Yu, Tina, Riolo, Rick L., and Worzel, Bill (2005). Genetic programming: Theory and practice. In Yu, Tina, Riolo, Rick L., and Worzel, Bill, editors, Genetic Programming Theory and Practice III, volume 9 of Genetic Programming, chapter 1, pages 1–14. Springer, Ann Arbor.
Google Scholar
Zhang, Yang and Rockett, Peter I. (2006). Feature extraction using multi-objective genetic programming. In Jin, Yaochu, editor, Multi-Objective Machine Learning, volume 16 of Studies in Computational Intelligence, chapter 4, pages 79–106. Springer. Invited chapter.
Google Scholar

Download references

Author information

Authors and Affiliations

Computational Genetics Laboratory, Department of Genetics, Dartmouth Medical School, Dartmouth
Jason H. Moore & Bill C. White

Authors

Jason H. Moore
View author publications
You can also search for this author in PubMed Google Scholar
Bill C. White
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for the Study of Complex Systems, University of Michigan, USA
Rick Riolo
University of Idaho, USA
Terence Soule
Genetics Squared, Inc., USA
Bill Worzel

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Moore, J.H., White, B.C. (2007). Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge. In: Riolo, R., Soule, T., Worzel, B. (eds) Genetic Programming Theory and Practice IV. Genetic and Evolutionary Computation. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-49650-4_2

Download citation

DOI: https://doi.org/10.1007/978-0-387-49650-4_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-33375-5
Online ISBN: 978-0-387-49650-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics