A Simulation Study Comparing SNP Based Prediction Models of Drug Response

  • Wencan ZhangEmail author
  • Pingye Zhang
  • Feng Gao
  • Yonghong Zhu
  • Ray Liu
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 218)


Lack of replication on findings and missing heritability are two of the major challenges in Pharmacogenetics (PGx) studies. Recently developed statistical methods for genome-wide association studies offer greater power both to identify relevant genetic markers and to predict drug response or phenotype based on these markers. However, the relative performance of these methods has not been thoroughly studied. Here, we present several simulations to compare the performance of these analysis methods. In our first simulation, we compared five different approaches: Elastic Net (EN), Genome-wide Association Study (GWAS)+EN, Principal Component Regression (PCR), Random Forest (RF) and Support Vector Machine (SVM). The results showed that EN has the smallest test mean squared error (MSE) and the highest portion of causal SNPs among identified SNPs. In the second simulation, we compared three approaches, GWAS+EN, GWAS+RF and GWAS+SVM. The GWAS+RF has the smallest test MSE and the highest causal percent. In the third simulation study, we compared two cross validation procedures: GWAS+EN versus modified learn and confirm cross validation GWAS+EN. The latter approach demonstrated better prediction accuracy at the expense of greatly increased computational time.


Genomics GWAS Predictive modeling Machine learning Cross validation 



Useful discussions with Dr. Zheng Zha and reviews by Dr. Yu-chen Su at Takeda Pharmaceutical Develop Center are highly appreciated.

Conflict of Interest

The project was carried out while Dr. Pingye Zhang was a summer intern at Takeda develop center at Deerfield, IL. USA. All other authors were Takeda employees at the time. The nature of the research is comparison of statistical methodologies and cross validation procedures, there is no conflict of interests.


  1. 1.
    Schilsky, R.L.: Personalized medicine in oncology: the future is now. Nat. Rev. Drug. Discov. 9(5), 363–366 (2010)CrossRefGoogle Scholar
  2. 2.
    Schrodi, S.J., Mukherjee, S., Shan, Y., Tromp, G., Sninsky, J.J., Callear, A.P., et al.: Genetic-based prediction of disease traits: prediction is very difficult, especially about the future. Front. Genet. 5 Article162. 2 (2014)Google Scholar
  3. 3.
    Wray, N.R., Yang, J., Hayes, B.J., Price, A.L., Goddard, M.E., Visscher, P.M.: Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 14(7), 507–515. (2013)CrossRefGoogle Scholar
  4. 4.
    Lee, S.H., Wray, N.R., Goddard, M.E., Visscher, P.M.: Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88, 294–305 (2011)CrossRefGoogle Scholar
  5. 5.
    Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., et al.: Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010)CrossRefGoogle Scholar
  6. 6.
    Visscher, P.M., Yang, J., Goddard, M.E.: A commentary on ‘common SNPs explain a large proportion of the heritability for human height’ by Yang et al. Twin Res. Hum. Genet. 13, 517–524 (2010)Google Scholar
  7. 7.
    Pang, G.S.Y., Wang, J., Wang, Z., Lee, C.G.L.: Predicting potentially functional SNPs in drug-response genes. Phamacogenomics 10(4), 639–653 (2009)CrossRefGoogle Scholar
  8. 8.
    Francis Lam, Y.W.: Scientific challenges and implementation barriers to translation of Pharmacogenomics in clinical practice. ISRN Pharm. Article ID 641089 (2013)Google Scholar
  9. 9.
    Lee, S.H., Wray, N.R., Goddard, M.E., Visscher, P.M.: Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88(3), 294–305 (2011)CrossRefGoogle Scholar
  10. 10.
    Nguyen, T.-T., Huang, J.Z., Wu, Q., Nguyen Mark, T.T., Li, J.: Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests. BMC Genomics 16(Suppl 2), S5 (2015)CrossRefGoogle Scholar
  11. 11.
    Cosgun, E., Limdi, N.A., Duarte, C.W.: High-dimensional pharmacogenetic prediction of a continuous trait using machine learning techniques with application to warfarin dose prediction in African Americans. Bioinformatics 27(10), 1384–1389 (2011)CrossRefGoogle Scholar
  12. 12.
    Shigemizu, D., Abe, T., Morizono, T., Johnson, T.A., Boroevich, K.A., Hirakaw, Y., et al.: The Construction of risk prediction models using GWAS data and its application to a Type 2 diabetes prospective cohort. PLoS ONE 9(3), e9254 (2014)CrossRefGoogle Scholar
  13. 13.
    Kooperberg, C., LeBlanc, M., Obenchain, V.: Risk prediction using genome-wide association studies. Genet Epidemiol. 34(7), 643–652 (2010)CrossRefGoogle Scholar
  14. 14.
    Wei, Z., Wang, W., Bradfield, J., Li, J., Cardinale, C., Frackelton, E., et al.: Large sample size, wide variant advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am. J. Hum. Genet. 92, 1008–1012 (2013)CrossRefGoogle Scholar
  15. 15.
    Chen, X., Ishwaran, H.: Random forests for genomic data analysis. Genomics 99, 323–329 (2012)CrossRefGoogle Scholar
  16. 16.
    Schrijver, I., Aziz, N., Farkas, D.H., Furtado, M., Gonzalez, A.F., Greiner, T.C., et al.: Opportunities and challenges associated with clinical diagnostic genome sequencing. J. Mol. Diagn. 14(6) (2012)CrossRefGoogle Scholar
  17. 17.
    Cantor, R.M., Lange, K., Sinsheimer, J.S.: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86, 6–22 (2010)CrossRefGoogle Scholar
  18. 18.
    Li, L., Guennel, T., Marshall, S.L., Cheung, L.W.K.: A multi-marker molecular signature approach for treatment-specific subgroup identification with survival outcomes. Pharmacogen. J. 14(5), 439–445 (2014)CrossRefGoogle Scholar
  19. 19.
    Zou, H., Trevor, T.: Regularization and variable selection via the elastic net. J. Royal Stat. Soc. Ser. B 67(2), 301–320 (2005)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99(10), 6562–6566 (2002)CrossRefGoogle Scholar
  21. 21.
    Tin Kam, H.O.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August, pp. 278–282 (1995)Google Scholar
  22. 22.
    Jolliffe, I.T.: A note on the use of principal components in regression. J. Royal Stat. Soc. Ser. C. 31(3), 300–303 (1982)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Wencan Zhang
    • 1
    Email author
  • Pingye Zhang
    • 2
  • Feng Gao
    • 3
  • Yonghong Zhu
    • 4
  • Ray Liu
    • 5
  1. 1.Takeda Develop Center, B3 4202A. One Takeda PKWYDeerfieldUSA
  2. 2.MerckRahwayUSA
  3. 3.BiogenCambridgeUSA
  4. 4.Shanghai Henlius Biotech IncShanghaiChina
  5. 5.TakedaCambridgeUSA

Personalised recommendations