Feature Selection for Detecting Gene-Gene Interactions in Genome-Wide Association Studies

  • Faramarz Dorani
  • Ting HuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10784)


Disease association studies aim at finding the genetic variations underlying complex human diseases in order to better understand the etiology of the disease and to provide better diagnoses, treatment, and even prevention. The non-linear interactions among multiple genetic factors play an important role in finding those genetic variations, but have not always been taken fully into account. This is due to the fact that searching combinations of interacting genetic factors becomes inhibitive as its complexity grows exponentially with the size of data. It is especially challenging for genome-wide association studies (GWAS) where typically more than a million single-nucleotide polymorphisms (SNPs) are under consideration. Dimensionality reduction is thus needed to allow us to investigate only a subset of genetic attributes that most likely have interaction effects. In this article, we conduct a comprehensive study by examining six widely used feature selection methods in machine learning for filtering interacting SNPs rather than the ones with strong individual main effects. Those six feature selection methods include chi-square, logistic regression, odds ratio, and three Relief-based algorithms. By applying all six feature selection methods to both a simulated and a real GWAS datasets, we report that Relief-based methods perform the best in filtering SNPs associated with a disease in terms of strong interaction effects.


Feature selection Relief algorithms Information gain Gene-gene interactions Genome-wide association studies 



This research was supported by Newfoundland and Labrador Research and Development Corporation (RDC) Ignite Grant 5404.1942.101 and the Natural Science and Engineering Research Council (NSERC) of Canada Discovery Grant RGPIN-2016-04699 to TH.


  1. 1.
    Wellcome Trust Case Control Consortium, et al.: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145), 661 (2007)Google Scholar
  2. 2.
    Gibbs, R.A., Belmont, J.W., Hardenbol, P., Willis, T.D., Yu, F., Yang, H., Ch’ang, L.Y., Huang, W., Liu, B., Shen, Y., et al.: The international HapMap project. Nature 426(6968), 789–796 (2003)CrossRefGoogle Scholar
  3. 3.
    The 1000 Genomes Project Consortium, et al.: A map of human genome variation from population scale sequencing. Nature 467(7319), 1061 (2010)Google Scholar
  4. 4.
    Moore, J.H., Asselbergs, F.W., Williams, S.M.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445–455 (2010)CrossRefGoogle Scholar
  5. 5.
    Hu, T., Andrew, A.S., Karagas, M.R., Moore, J.H.: Statistical epistasis networks reduce the computational complexity of searching three-locus genetic models. Proc. Pac. Symp. Biocomput. 18, 397–408 (2013)Google Scholar
  6. 6.
    Cordell, H.J.: Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 11(20), 2463–2468 (2002)CrossRefGoogle Scholar
  7. 7.
    Hu, T., Chen, Y., Kiralis, J.W., Moore, J.H.: ViSEN: methodology and software for visualization of statistical epistasis networks. Genet. Epidemiol. 37, 283–285 (2013)CrossRefGoogle Scholar
  8. 8.
    Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML 3, 856–863 (2003)Google Scholar
  9. 9.
    Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997)CrossRefGoogle Scholar
  10. 10.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003)zbMATHGoogle Scholar
  11. 11.
    Freitas, A.A.: Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer Science & Business Media, Heidelberg (2013)Google Scholar
  12. 12.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRefGoogle Scholar
  13. 13.
    Hua, J., Tembe, W.D., Dougherty, E.R.: Performance of feature-selection methods in the classification of high-dimension data. Pattern Recogn. 42(3), 409–424 (2009)CrossRefzbMATHGoogle Scholar
  14. 14.
    Shah, S.C., Kusiak, A.: Data mining and genetic algorithm based gene/SNP selection. Artif. Intell. Med. 31(3), 183–196 (2004)CrossRefGoogle Scholar
  15. 15.
    Wu, Q., Ye, Y., Liu, Y., Ng, M.K.: SNP selection and classification of genome-wide SNP data using stratified sampling random forests. IEEE Trans. Nanobiosci. 11(3), 216–227 (2012)CrossRefGoogle Scholar
  16. 16.
    Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(Jan), 27–66 (2012)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Urbanowicz, R.J., Kiralis, J.W., Fisher, J.M., Moore, J.H.: Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Min. 5, 15 (2012)CrossRefGoogle Scholar
  18. 18.
    Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: Gametes: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5(1), 16 (2012)CrossRefGoogle Scholar
  19. 19.
    Schumacher, F.R., Schmit, S.L., Jiao, S., Edlund, C.K., Wang, H., Zhang, B., Hsu, L., Huang, S.C., Fischer, C.P., et al.: Genome-wide association study of colorectal cancer identifies six new susceptibility loci. Nature Commun. 6, 7138 (2015)CrossRefGoogle Scholar
  20. 20.
    Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., Zondervan, K.T.: Data quality control in genetic case-control association studies. Nat. Protoc. 5(9), 1564–1573 (2010)CrossRefGoogle Scholar
  21. 21.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)zbMATHGoogle Scholar
  22. 22.
    Hu, T., Sinnott-Armstrong, N.A., Kiralis, J.W., Andrew, A.S., Karagas, M.R., Moore, J.H.: Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC Bioinform. 12, 364 (2011)CrossRefGoogle Scholar
  23. 23.
    Fan, R., Zhong, M., Wang, S., Zhang, Y., Andrew, A., Karagas, M., Chen, H., Amos, C.I., Xiong, M., Moore, J.H.: Entropy-based information gain approaches to detect and to characterize gene-gene and gene-environment interactions/correlations of complex diseases. Genet. Epidemiol. 35(7), 706–721 (2011)CrossRefGoogle Scholar
  24. 24.
    Li, H., Lee, Y., Chen, J.L., Rebman, E., Li, J., Lussier, Y.A.: Complex-disease networks of trait-associated single-nucleotide polymorphisms (SNPs) unveiled by information theory. J. Am. Med. Inform. Assoc. 19, 295–305 (2012)CrossRefGoogle Scholar
  25. 25.
    Hu, T., Chen, Y., Kiralis, J.W., Collins, R.L., Wejse, C., Sirugo, G., Williams, S.M., Moore, J.H.: An information-gain approach to detecting three-way epistatic interactions in genetic association studies. J. Am. Med. Inform. Assoc. 20(4), 630–636 (2013)CrossRefGoogle Scholar
  26. 26.
    Yates, F.: Contingency tables involving small numbers and the \(\chi \)2 test. Suppl. J. Roy. Stat. Soc. 1(2), 217–235 (1934)CrossRefzbMATHGoogle Scholar
  27. 27.
    Szumilas, M.: Explaining odds ratios. J. Can. Acad. Child Adolesc. Psychiatry 19(3), 227 (2010)CrossRefGoogle Scholar
  28. 28.
    Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning, pp. 249–256 (1992)Google Scholar
  29. 29.
    Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). CrossRefGoogle Scholar
  30. 30.
    Robnik-Šikonja, M., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Mach. Learn. 53(1–2), 23–69 (2003)CrossRefzbMATHGoogle Scholar
  31. 31.
    Moore, J.H., White, B.C.: Tuning ReliefF for genome-wide genetic analysis. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 166–175. Springer, Heidelberg (2007). CrossRefGoogle Scholar
  32. 32.
    Greene, C.S., Penrod, N.M., Kiralis, J., Moore, J.H.: Spatially uniform relieff (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Min. 2(1), 5 (2009)CrossRefGoogle Scholar
  33. 33.
    Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., Moore, J.H.: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69(1), 138–147 (2001)CrossRefGoogle Scholar
  34. 34.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., De Bakker, P.I., Daly, M.J., et al.: Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceMemorial UniversitySt. John’sCanada

Personalised recommendations