Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge
Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. The goal of this study is to develop and evaluate a genetic programming (GP) approach to attribute selection and classification in this domain. We simulated genetic datasets of varying size in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then show that including pre-processed estimates of attribute quality using Tuned ReliefF (TuRF) in a multi-objective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data preprocessing, the ideal functional form of the fitness function, and the importance of expert knowledge. We anticipate this study will provide an important baseline for future studies investigating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.
Keywordsgenetic programming human genetics expert knowledge epistasis multifactor dimensionality reduction
Unable to display preview. Download preview PDF.