Abstract
Direct-to-Consumer genetic testing services are becoming more ubiquitous. Consumers of such services are sharing their genetic and clinical information with the research community to facilitate the extraction of knowledge about different conditions. In this paper, we build on these services to analyse the genetic data of people with different BMI levels to determine the immediate and long-term risk factors associated with obesity. Using web scraping techniques, a dataset containing publicly available information about 230 participants from the Personal Genome Project is created. Subsequent analysis of the dataset is conducted for the identification of genetic variants associated with high BMI levels via standard quality control and association analysis protocols for Genome Wide Association Analysis. We applied a combination of Random Forest based feature selection algorithm and Support Vector Machine with Radial Basis Function Kernel learning method to the filtered dataset. Using a robust data science methodology our approach identified obesity related genetic variants, to be used as features when predicting individual obesity susceptibility. The results reveal that the subset of features obtained through the Random Forest based algorithm improve the performance of the classifier when compared to the top statistically significant genetic variants identified in logistic regression. Support Vector Machine showed the best results with sensitivity=81%, specificity=83% and area under the curve=92% when the model was trained with the top fifteen features selected by Boruta.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
James, W.P.T.: WHO recognition of the global obesity epidemic. Int. J. Obes. 32(Suppl 7), S120āS126 (2008). (Lond)
Poloz, Y., Stambolic, V.: Obesity and cancer, a case for insulin signaling. Cell Death Dis. 6, e2037 (2015)
Rao, K.R., Lal, N., Giridharan, N.V.: Genetic & epigenetic approach to human obesity. Indian J. Med. Res. 140, 589ā603 (2015)
Li, S., et al.: Physical activity attenuates the genetic predisposition to obesity in 20,000 men and women from EPIC-Norfolk prospective population study. PLoS Med. 7, 1ā9 (2010)
Bello, A., et al.: Using linked administrative data to study periprocedural mortality in obesity and chronic kidney disease (CKD). Nephrol. Dial. Transpl. 28, iv57āiv64 (2013)
Loos, R.J.F.: Genetic determinants of common obesity and their value in prediction. Best Pract. Res. Clin. Endocrinol. Metab. 26, 211ā226 (2012)
Samish, I., Bourne, P.E., Najmanovich, R.J.: Achievements and challenges in structural bioinformatics and computational biophysics. Bioinformatics 31, 146ā150 (2014)
Higdon, R., et al.: Unravelling the complexities of life sciences data. Big Data 1, 17ā23 (2012)
Tanwani, A.K., Afridi, J., Shafiq, M.Zubair, Farooq, M.: Guidelines to select machine learning scheme for classification of biomedical datasets. In: Pizzuti, C., Ritchie, Marylyn D., Giacobini, M. (eds.) EvoBIO 2009. LNCS, vol. 5483, pp. 128ā139. Springer, Heidelberg (2009). doi:10.1007/978-3-642-01184-9_12
Su, P.: Direct-to-consumer genetic testing: a comprehensive view. Yale J. Biol. Med. 86, 59ā65 (2013)
Ball, M.P., et al.: Harvard personal genome project: lessons from participatory public research. Genome Med. 6, 10 (2014)
Glez-Pena, D., Lourenco, A., Lopez-Fernandez, H., Reboiro-Jato, M., Fdez-Riverola, F.: Web scraping technologies in an API world. Brief. Bioinform. 15, 788ā797 (2014)
Marx, V.: Biology: the big challenges of big data. Nature 498, 255ā260 (2013)
Tryka, K.A., et al.: NCBIās database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 42, D975āD979 (2014)
Gonzaga-Jauregui, C., Lupski, J.R., Gibbs, R.A.: Human genome sequencing in health and disease. Annu. Rev. Med. 63, 35ā61 (2012)
Bush, W.S., Moore, J.H.: Chapter 11: Genome-wide association studies. PLoS Comput. Biol. 8, e1002822 (2012). doi:10.1371/journal.pcbi.1002822
Fadista, J., Manning, A.K., Florez, J.C., Groop, L.: The (in)famous GWAS P-value threshold revisited and updated for low-frequency variants. Eur. J. Hum. Genet. 24, 1202ā1205 (2016)
Zhang, Y.-B., et al.: Genome-wide association study identifies multiple susceptibility loci for craniofacial microsomia. Nat. Commun. 7, 10605 (2016)
StoeklƩ, H.-C., Mamzer-Bruneel, M.-F., Vogt, G., HervƩ, C.: 23andMe: a new two-sided data-banking market model. BMC Med. Ethics. 17, 19 (2016)
Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559ā575 (2007)
Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., Zondervan, K.T.: Data quality control in genetic case-control association studies. Nat. Protoc. 5, 64ā73 (2010)
Turner, S., et al.: Quality control procedures for genome-wide association studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit1.19 (2011). doi:10.1002/0471142905.hg0119s68
Reed, E., Nunez, S., Kulp, D., Qian, J., Reilly, M.P., Foulkes, A.S.: A guide to genome-wide association analysis and post-analytic interrogation. Stat. Med. 34, 3769ā3792 (2015)
GĆ¼l, H., Aydin Son, Y., AƧikel, C.: Discovering missing heritability and early risk prediction for type 2 diabetes: a new perspective for genome-wide association study analysis with the Nursesā Health Study and the Health Professionalsā Follow-Up Study. Turkish J. Med. Sci. 44, 946ā954 (2014)
Kursa, M.B., Rudnicki, W.R.: Feature Selection with the Boruta package. J. Stat. Softw. 36, 1ā13 (2010)
Cordell, H.J.: Detecting geneāgene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392ā404 (2009)
Curbelo MontaƱez, C.A. et al.: Machine learning approaches for the prediction of obesity using publicly available genetic profiles. In: 2017 International Joint Conference on Neural Networks (IJCNN), p. 8, Anchorage, Alaska (2017)
Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28, 1ā26 (2008)
Stein, L.: Creating a bioinformatics nation. Nature 417, 119ā120 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2017 Springer International Publishing AG
About this paper
Cite this paper
Curbelo MontaƱez, C.A., Fergus, P., Hussain, A., Al-Jumeily, D., Dorak, M.T., Abdullah, R. (2017). Evaluation of Phenotype Classification Methods for Obesity Using Direct to Consumer Genetic Data. In: Huang, DS., Jo, KH., Figueroa-GarcĆa, J. (eds) Intelligent Computing Theories and Application. ICIC 2017. Lecture Notes in Computer Science(), vol 10362. Springer, Cham. https://doi.org/10.1007/978-3-319-63312-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-63312-1_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63311-4
Online ISBN: 978-3-319-63312-1
eBook Packages: Computer ScienceComputer Science (R0)