Mining Epistatic Interactions from High-Dimensional Data Sets

  • Xia Jiang
  • Shyam Visweswaran
  • Richard E. Neapolitan
Part of the Intelligent Systems Reference Library book series (ISRL, volume 25)


Genetic epidemiologists strive to determine the genetic profile of diseases. Two or more genes can interact to have a causal effect on disease even when little or no such effect can be observed statistically for one or even both of the genes individually. This is in contrast to Mendelian diseases like cystic fibrosis, which are associated with variation at a single genetic locus. This gene-gene interaction is called epistasis. To uncover this dark matter of genetic risk it would be pivotal to be able to discover epistatic relationships from data. The recent availability of high-dimensional data sets affords us unprecedented opportunity to make headway in accomplishing this. However, there are two central barriers to successfully identifying genetic interactions using such data sets. First, it is difficult to detect epistatic interactions statistically using parametric statistical methods such as logistic regression due to the sparseness of the data and the non-linearity of the relationships. Second, the number of candidate models in a high-dimensional data set is forbiddingly large. This paper describes recent research addressing these two barriers. To address the first barrier, the primary author and colleagues developed a specialized Bayesian network model for representing the relationship between features and disease, and a Bayesian network scoring criterion tailored to this model. This research is summarized in Section 2. To address the second barrier the primary author and colleagues developed an enhancement of Greedy Equivalent Search. This research is discussed in Section 3. Background is provided in Section 1.


Bayesian Network Directed Acyclic Graph Epistatic Interaction Multifactor Dimensionality Reduction APOE Gene 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bateson, W.: Mendel’s Principles of Heredity. Cambridge University Press, New York (1909)CrossRefGoogle Scholar
  2. Brooks, A.J.: The Essence of SNPs. Gene. 234, 177–186 (1999)CrossRefGoogle Scholar
  3. Chen, S.S., et al.: Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing 20, 33–61 (1998)MathSciNetCrossRefGoogle Scholar
  4. Chickering, M.: Learning Bayesian Networks is NP-Complete. In: Fisher, D., Lenz, H. (eds.) Learning from Data. Lecture Notes in Statistics, Springer, New York (1996)Google Scholar
  5. Chickering, D.: Optimal Structure Identification with Greedy Search. The Journal of Machine Learning Research 3, 507–554 (2003)zbMATHMathSciNetGoogle Scholar
  6. Cho, Y.M., Ritchie, M.D., Moore, J.H., Moon, M.K., et al.: Multifactor Dimensionality Reduction Reveals a Two-Locus Interaction Associated with Type 2 Diabetes Mellitus. Diabetologia 47, 549–554 (2004)CrossRefGoogle Scholar
  7. Coffey, C.S., et al.: An Application of Conditional Logistic Regression and Multifactor Dimensionality Reduction for Detecting Gene-Gene Interactions on Risk of Myocardial Infarction: the Importance of Model Validation. BMC Bioinformatics 5(49) (2004)Google Scholar
  8. Coon, K.D., et al.: A High-Density Whole-Genome Association Study Reveals that APOE is the Major Susceptibility Gene for Sporadic Late-Onset Alzheimer’s Disease. J. Clin. Psychiatry 68, 613–618 (2007)CrossRefGoogle Scholar
  9. Cooper, G.F., Herskovits, E.: A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 9, 309–347 (1992)zbMATHGoogle Scholar
  10. Corder, E.H., et al.: Gene Dose of Apolipoprotein E type 4 Allele and the Risk of Alzheimer’s Disease in Late Onset Families. Science 261, 921–923 (1993)CrossRefGoogle Scholar
  11. Epstein, M.J., Haake, P.: Very Large Scale ReliefF for Genome-Wide Association Analysis. In: Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2008)Google Scholar
  12. Evans, D.M., Marchini, J., Morris, A., Cardon, L.R.: Two-Stage Two-Locus Models in Genome-Wide Association. PLOS Genetics 2(9) (2006)Google Scholar
  13. Friedman, N., Yakhini, Z.: On the Sample Complexity of Learning Bayesian Networks. In: Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pp. 206–215 (1996)Google Scholar
  14. Galvin, A., Ioannidis, J.P.A., Dragani, T.A.: Beyond Genome-Wide Association Studies: Genetic Heterogeneity and Individual Predisposition to Cancer. Trends in Genetics (3), 132–141 (2010)Google Scholar
  15. Hahn, L.W., Ritchie, M.D., Moore, J.H.: Multifactor Dimensionality Reduction Software for Detecting Gene-Gene and Gene-Environment Interactions. Bioinformatics 19(3), 376–382 (2003)CrossRefGoogle Scholar
  16. Han, B., Park, M., Chen, X.: Markov Blanket-Based Method for Detecting Causal SNPs in GWAS. In: Proceeding of IEEE International Conference on Bioinformatics and Biomedicine (2009)Google Scholar
  17. Heckerman, D.: A Tutorial on Learning with Bayesian Networks, Technical Report # MSR-TR-95-06. Microsoft Research, Redmond, WA (1996)Google Scholar
  18. Heckerman, D., Geiger, D., Chickering, D.: Learning Bayesian Networks: The Combination of Knowledge and Statistical Data, Technical Report MSR-TR-94-09. Microsoft Research, Redmond, Washington (1995)Google Scholar
  19. Hoeting, J.A., Madigan, D., Raftery, A.E., Volinksy, C.T.: Bayesian Model Averaging: A Tutorial. Statistical Science 14, 382–417 (1999)zbMATHMathSciNetCrossRefGoogle Scholar
  20. Hunter, D.J., Kraft, P., Jacobs, K.B., et al.: A Genome-Wide Association Study Identifies Alleles in FGFR2 Associated With Risk of Sporadic Postmenopausal Breast Cancer. Nature Genetics 39, 870–874 (2007)CrossRefGoogle Scholar
  21. Jiang, X., Barmada, M.M., Visweswaran, S.: Identifying Genetic Interactions From Genome-Wide Data Using Bayesian Networks. Genetic Epidemiology 34(6), 575–581 (2010a)CrossRefGoogle Scholar
  22. Jiang, X., Neapolitan, R.E., Barmada, M.M., Visweswaran, S., Cooper, G.F. : A Fast Algorithm for Learning Epistatic Genomic Relationships. In: Accepted as Proceedings Eligible by AMIA 2010 (2010b) Google Scholar
  23. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)Google Scholar
  24. Korb, K., Nicholson, A.E.: Bayesian Artificial Intelligence. Chapman & Hall/CRC, Boca Raton, FL (2003)CrossRefGoogle Scholar
  25. Lam, W., Bacchus, F.: Learning Bayesian Belief Networks: An approach based on the MDL Principle. In: Proceedings of 2nd Pacific Rim International Conference on Artificial Intelligence, pp. 1237–1243 (1992)Google Scholar
  26. Logsdon, B.A., Hoffman, G.E., Mezey, J.G.: A Variational Bayes Algorithm for Fast and Accurate Multiple Locus Genome-Wide Association Analysis. BMC Bioinformatics 11(58) (2010)Google Scholar
  27. Manolio, T.A., Collins, F.S.: The HapMap and Genome-Wide Association Studies in Diagnosis and Therapy. Annual Review of Medicine 60, 443–456 (2009)CrossRefGoogle Scholar
  28. Matsuzaki, H., Dong, S., Loi, H., et al.: Genotyping over 100,000 SNPs On a Pair of Oligonucleotide Arrays. Nat. Methods 1, 109–111 (2004)CrossRefGoogle Scholar
  29. Meng, Y., et al.: Two-Stage Approach for Identifying Single-Nucleotide Polymorphisms Associated With Rheumatoid Arthritis Using Random Forests and Bayesian Networks. BMC Proc. 2007 1(suppl. 1), S56 (2007)Google Scholar
  30. Moore, J.H., White, B.C.: Tuning reliefF for genome-wide genetic analysis. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 166–175. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  31. Neapolitan, R.E.: Learning Bayesian Networks. Prentice Hall, Upper Saddle River (2004)Google Scholar
  32. Neapolitan, R.E.: A Polemic for Bayesian Statistics. In: Holmes, D., Jain, L. (eds.) Innovations in Bayesian Networks. Springer, Heidelberg (2008)Google Scholar
  33. Neapolitan, R.E.: Probabilistic Methods for Bioinformatics: with an Introduction to Bayesian Networks. Morgan Kaufmann, Burlington (2009)zbMATHGoogle Scholar
  34. Pappassotiropoulos, A., Fountoulakis, M., Dunckley, T., Stephan, D.A., Reiman, E.M.: Genetic Transcriptomics and Proteomics of Alzheimer’s Disease. J. Clin. Psychiatry 67, 652–670 (2006)CrossRefGoogle Scholar
  35. Reiman, E.M., et al.: GAB2 Alleles Modify Alzheimer’s Risk in APOE ε4 Carriers. Neuron 54, 713–720 (2007)CrossRefGoogle Scholar
  36. Ritchie, M.D., et al.: Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer. Am. J. Hum. Genet. 69(1), 138–147 (2001)CrossRefGoogle Scholar
  37. Rissanen, J.: Modelling by Shortest Data Description. Automatica 14, 465–471 (1978)zbMATHCrossRefGoogle Scholar
  38. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. Springer, New York (1993); 2nd edn. MIT Press (2000)zbMATHCrossRefGoogle Scholar
  39. Suzuki, J.: Learning Bayesian Belief Networks based on the Minimum Description length Principle: Basic Properties. IEICE Trans. on Fundamentals  E82-A(9), 2237–2245 (1999)Google Scholar
  40. Tibshirani, R.: Regression Shrinkage and Selection Via the Lasso. J. Royal. Statist. Soc. B 58(1), 267–288 (1996)zbMATHMathSciNetGoogle Scholar
  41. Velez, D.R., White, B.C., Motsinger, A.A., Bush, W.S., Ritchie, M.D., Williams, S.M., Moore, J.H.: A Balanced Accuracy Function for Epistasis Modeling in Imbalanced Dataset using Multifactor Dimensionality Reduction. Genetic Epidemiology 31, 306–315 (2007)CrossRefGoogle Scholar
  42. Verzilli, C.J., Stallard, N., Whittaker, J.C.: Bayesian Graphical Models for Genomewide Association Studies. The American Journal of Human Genetics 79, 100–112 (2006)CrossRefGoogle Scholar
  43. Wade, N.: A Decade Later, Genetic Map Yields Few New Cures. New York Times (June 12, 2010)Google Scholar
  44. Wan, X., et al.: Predictive Rule Inference for Epistatic Interaction Detection in Genome-Wide Association Studies. Bioinformatics 26(1), 30–37 (2010)CrossRefGoogle Scholar
  45. Wang, D.G., Fan, J.B., Siao, C.J., et al.: Large-Scale Identification, Mapping, and Genotyping of Single Nucleotide Polymorphisms in the Human Genome. Science 80, 1077–1082 (1998)CrossRefGoogle Scholar
  46. Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., Lange, K.: Genome-Wide Association Analysis by Lasso Penalized Logistic Regression. Genome Analysis 25, 714–721 (2009)Google Scholar
  47. Wu, J., Devlin, B., Ringguist, S., Trucco, M., Roeder, K.: Screen and Clean: A Tool for Identifying Interactions in Genome-Wide Association Studies. Genetic Epidemiology 34, 275–285 (2010)Google Scholar
  48. Zabell, S.L.: W.E. Johnson’s ‘Sufficientness’ Postulate. The Annals of Statistics 10(4) (1982)Google Scholar
  49. Zhang, X., Pan, F., Xie, Y., Zou, F., Wang, W.: COE: A general approach for efficient genome-wide two-locus epistasis test in disease association study. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 253–269. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  50. Zhang, Y., Liu, J.S.: Bayesian Inference of Epistatic Interactions in Case Control Studies. Nature Genetics 39, 1167–1173 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Xia Jiang
    • 1
  • Shyam Visweswaran
    • 1
  • Richard E. Neapolitan
    • 2
  1. 1.Department of Biomedical InformaticsUniversity of PittsburghPittsburghUSA
  2. 2.Department of Computer ScienceNortheastern Illinois UniversityChicagoUSA

Personalised recommendations