Evaluation of Phenotype Classification Methods for Obesity Using Direct to Consumer Genetic Data

  • Casimiro Aday Curbelo Montañez
  • Paul Fergus
  • Abir Hussain
  • Dhiya Al-Jumeily
  • Mehmet Tevfik Dorak
  • Rosni Abdullah
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10362)


Direct-to-Consumer genetic testing services are becoming more ubiquitous. Consumers of such services are sharing their genetic and clinical information with the research community to facilitate the extraction of knowledge about different conditions. In this paper, we build on these services to analyse the genetic data of people with different BMI levels to determine the immediate and long-term risk factors associated with obesity. Using web scraping techniques, a dataset containing publicly available information about 230 participants from the Personal Genome Project is created. Subsequent analysis of the dataset is conducted for the identification of genetic variants associated with high BMI levels via standard quality control and association analysis protocols for Genome Wide Association Analysis. We applied a combination of Random Forest based feature selection algorithm and Support Vector Machine with Radial Basis Function Kernel learning method to the filtered dataset. Using a robust data science methodology our approach identified obesity related genetic variants, to be used as features when predicting individual obesity susceptibility. The results reveal that the subset of features obtained through the Random Forest based algorithm improve the performance of the classifier when compared to the top statistically significant genetic variants identified in logistic regression. Support Vector Machine showed the best results with sensitivity=81%, specificity=83% and area under the curve=92% when the model was trained with the top fifteen features selected by Boruta.


Bioinformatics Data science Machine learning Feature selection Genetics Obesity SNPs 


  1. 1.
    James, W.P.T.: WHO recognition of the global obesity epidemic. Int. J. Obes. 32(Suppl 7), S120–S126 (2008). (Lond)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Poloz, Y., Stambolic, V.: Obesity and cancer, a case for insulin signaling. Cell Death Dis. 6, e2037 (2015)CrossRefGoogle Scholar
  3. 3.
    Rao, K.R., Lal, N., Giridharan, N.V.: Genetic & epigenetic approach to human obesity. Indian J. Med. Res. 140, 589–603 (2015)Google Scholar
  4. 4.
    Li, S., et al.: Physical activity attenuates the genetic predisposition to obesity in 20,000 men and women from EPIC-Norfolk prospective population study. PLoS Med. 7, 1–9 (2010)CrossRefGoogle Scholar
  5. 5.
    Bello, A., et al.: Using linked administrative data to study periprocedural mortality in obesity and chronic kidney disease (CKD). Nephrol. Dial. Transpl. 28, iv57–iv64 (2013)CrossRefGoogle Scholar
  6. 6.
    Loos, R.J.F.: Genetic determinants of common obesity and their value in prediction. Best Pract. Res. Clin. Endocrinol. Metab. 26, 211–226 (2012)CrossRefGoogle Scholar
  7. 7.
    Samish, I., Bourne, P.E., Najmanovich, R.J.: Achievements and challenges in structural bioinformatics and computational biophysics. Bioinformatics 31, 146–150 (2014)CrossRefGoogle Scholar
  8. 8.
    Higdon, R., et al.: Unravelling the complexities of life sciences data. Big Data 1, 17–23 (2012)Google Scholar
  9. 9.
    Tanwani, A.K., Afridi, J., Shafiq, M.Zubair, Farooq, M.: Guidelines to select machine learning scheme for classification of biomedical datasets. In: Pizzuti, C., Ritchie, Marylyn D., Giacobini, M. (eds.) EvoBIO 2009. LNCS, vol. 5483, pp. 128–139. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-01184-9_12 CrossRefGoogle Scholar
  10. 10.
    Su, P.: Direct-to-consumer genetic testing: a comprehensive view. Yale J. Biol. Med. 86, 59–65 (2013)Google Scholar
  11. 11.
    Ball, M.P., et al.: Harvard personal genome project: lessons from participatory public research. Genome Med. 6, 10 (2014)CrossRefGoogle Scholar
  12. 12.
    Glez-Pena, D., Lourenco, A., Lopez-Fernandez, H., Reboiro-Jato, M., Fdez-Riverola, F.: Web scraping technologies in an API world. Brief. Bioinform. 15, 788–797 (2014)CrossRefGoogle Scholar
  13. 13.
    Marx, V.: Biology: the big challenges of big data. Nature 498, 255–260 (2013)CrossRefGoogle Scholar
  14. 14.
    Tryka, K.A., et al.: NCBI’s database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014)CrossRefGoogle Scholar
  15. 15.
    Gonzaga-Jauregui, C., Lupski, J.R., Gibbs, R.A.: Human genome sequencing in health and disease. Annu. Rev. Med. 63, 35–61 (2012)CrossRefGoogle Scholar
  16. 16.
    Bush, W.S., Moore, J.H.: Chapter 11: Genome-wide association studies. PLoS Comput. Biol. 8, e1002822 (2012). doi: 10.1371/journal.pcbi.1002822 CrossRefGoogle Scholar
  17. 17.
    Fadista, J., Manning, A.K., Florez, J.C., Groop, L.: The (in)famous GWAS P-value threshold revisited and updated for low-frequency variants. Eur. J. Hum. Genet. 24, 1202–1205 (2016)CrossRefGoogle Scholar
  18. 18.
    Zhang, Y.-B., et al.: Genome-wide association study identifies multiple susceptibility loci for craniofacial microsomia. Nat. Commun. 7, 10605 (2016)CrossRefGoogle Scholar
  19. 19.
    Stoeklé, H.-C., Mamzer-Bruneel, M.-F., Vogt, G., Hervé, C.: 23andMe: a new two-sided data-banking market model. BMC Med. Ethics. 17, 19 (2016)CrossRefGoogle Scholar
  20. 20.
    Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007)CrossRefGoogle Scholar
  21. 21.
    Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., Zondervan, K.T.: Data quality control in genetic case-control association studies. Nat. Protoc. 5, 64–73 (2010)CrossRefGoogle Scholar
  22. 22.
    Turner, S., et al.: Quality control procedures for genome-wide association studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit1.19 (2011). doi: 10.1002/0471142905.hg0119s68
  23. 23.
    Reed, E., Nunez, S., Kulp, D., Qian, J., Reilly, M.P., Foulkes, A.S.: A guide to genome-wide association analysis and post-analytic interrogation. Stat. Med. 34, 3769–3792 (2015)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Gül, H., Aydin Son, Y., Açikel, C.: Discovering missing heritability and early risk prediction for type 2 diabetes: a new perspective for genome-wide association study analysis with the Nurses’ Health Study and the Health Professionals’ Follow-Up Study. Turkish J. Med. Sci. 44, 946–954 (2014)CrossRefGoogle Scholar
  25. 25.
    Kursa, M.B., Rudnicki, W.R.: Feature Selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010)CrossRefGoogle Scholar
  26. 26.
    Cordell, H.J.: Detecting gene–gene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392–404 (2009)CrossRefGoogle Scholar
  27. 27.
    Curbelo Montañez, C.A. et al.: Machine learning approaches for the prediction of obesity using publicly available genetic profiles. In: 2017 International Joint Conference on Neural Networks (IJCNN), p. 8, Anchorage, Alaska (2017)Google Scholar
  28. 28.
    Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008)CrossRefGoogle Scholar
  29. 29.
    Stein, L.: Creating a bioinformatics nation. Nature 417, 119–120 (2002)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Casimiro Aday Curbelo Montañez
    • 1
  • Paul Fergus
    • 1
  • Abir Hussain
    • 1
  • Dhiya Al-Jumeily
    • 1
  • Mehmet Tevfik Dorak
    • 2
  • Rosni Abdullah
    • 3
  1. 1.Liverpool John Moores UniversityLiverpoolUK
  2. 2.Liverpool Hope UniversityLiverpoolUK
  3. 3.Universiti Sains MalaysiaGeorge TownMalaysia

Personalised recommendations