Leveraging Big Data to Transform Drug Discovery

Part of the Methods in Molecular Biology book series (MIMB, volume 1939)


The surge of public disease and drug-related data availability has facilitated the application of computational methodologies to transform drug discovery. In the current chapter, we outline and detail the various resources and tools one can leverage in order to perform such analyses. We further describe in depth the in silico workflows of two recent studies that have identified possible novel indications of existing drugs. Lastly, we delve into the caveats and considerations of this process to enable other researchers to perform rigorous computational drug discovery experiments of their own.

Key words

Systems pharmacology Drug discovery Big data Electronic medical records Clinical informatics Bioinformatics Drug repurposing Drug repositioning Gene expression data Pharmacogenomics 



The research is supported by R21 TR001743, U24 DK116214, and K01 ES028047 (to BC). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


  1. 1.
    Eder J, Sedrani R, Wiesmann C (2014) The discovery of first-in-class drugs: origins and evolution. Nat Rev Drug Discov 13(8):577–587PubMedGoogle Scholar
  2. 2.
    Mullard A (2016) Parsing clinical success rates. Nat Rev Drug Discov 15(7):447PubMedGoogle Scholar
  3. 3.
    Every-Palmer S, Howick J (2014) How evidence-based medicine is failing due to biased trials and selective publication. J Eval Clin Pract 20(6):908–914PubMedGoogle Scholar
  4. 4.
    Rothwell PM (2006) Factors that can affect the external validity of randomised controlled trials. PLoS Clin Trials 1(1):e9PubMedPubMedCentralGoogle Scholar
  5. 5.
    Murthy VH, Krumholz HM, Gross CP (2004) Participation in cancer clinical trials: race-, sex-, and age-based disparities. JAMA 291(22):2720–2726Google Scholar
  6. 6.
    Rothwell PM (2005) External validity of randomised controlled trials: “to whom do the results of this trial apply?”. Lancet 365(9453):82–93PubMedGoogle Scholar
  7. 7.
    Hodos RA, Kidd BA, Shameer K, Readhead BP, Dudley JT (2016) In silico methods for drug repurposing and pharmacology. Wiley Interdiscip Rev Syst Biol Med 8(3):186–210PubMedPubMedCentralGoogle Scholar
  8. 8.
    Paik H, Chen B, Sirota M, Hadley D, Butte AJ (2016) Integrating clinical phenotype and gene expression data to prioritize novel drug uses. CPT Pharmacometrics Syst Pharmacol 5(11):599–607PubMedPubMedCentralGoogle Scholar
  9. 9.
    Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL (2010) How to improve R&D productivity: the pharmaceutical industry's grand challenge. Nat Rev Drug Discov 9(3):203–214PubMedGoogle Scholar
  10. 10.
    Caskey CT (2007) The drug development crisis: efficiency and safety. Annu Rev Med 58:1–16PubMedGoogle Scholar
  11. 11.
    Nosengo N (2016) Can you teach old drugs new tricks? Nature 534(7607):314–316PubMedGoogle Scholar
  12. 12.
    Scannell JW, Blanckley A, Boldon H, Warrington B (2012) Diagnosing the decline in pharmaceutical R&D efficiency. Nat Rev Drug Discov 11(3):191–200PubMedGoogle Scholar
  13. 13.
    Ashburn TT, Thor KB (2004) Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov 3(8):673–683PubMedGoogle Scholar
  14. 14.
    Jahchan NS, Dudley JT, Mazur PK, Flores N, Yang D, Palmerton A, Zmoos AF, Vaka D, Tran KQ, Zhou M et al (2013) A drug repositioning approach identifies tricyclic antidepressants as inhibitors of small cell lung cancer and other neuroendocrine tumors. Cancer Discov 3(12):1364–1377PubMedGoogle Scholar
  15. 15.
    Pessetto ZY, Chen B, Alturkmani H, Hyter S, Flynn CA, Baltezor M, Ma Y, Rosenthal HG, Neville KA, Weir SJ et al (2017) In silico and in vitro drug screening identifies new therapeutic approaches for Ewing sarcoma. Oncotarget 8(3):4079–4095PubMedGoogle Scholar
  16. 16.
    Dudley JT, Sirota M, Shenoy M, Pai RK, Roedder S, Chiang AP, Morgan AA, Sarwal MM, Pasricha PJ, Butte AJ (2011) Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Sci Transl Med 3(96):96ra76PubMedPubMedCentralGoogle Scholar
  17. 17.
    Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte AJ (2011) Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med 3(96):96ra77PubMedPubMedCentralGoogle Scholar
  18. 18.
    Stephens T, Brynner R (2009) Dark remedy: the impact of thalidomide and its revival as a vital medicine. Basic BooksGoogle Scholar
  19. 19.
    Attal M, Harousseau JL, Leyvraz S, Doyen C, Hulin C, Benboubker L, Yakoub Agha I, Bourhis JH, Garderet L, Pegourie B et al (2006) Maintenance therapy with thalidomide improves survival in patients with multiple myeloma. Blood 108(10):3289–3294PubMedGoogle Scholar
  20. 20.
    From nightmare drug to celgene blockbuster, thalidomide is back bloomberg.
  21. 21.
    R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria In. 2014Google Scholar
  22. 22.
    Van Rossum G, Drake FL: Python language reference manual: network theory; 2003Google Scholar
  23. 23.
    Jones E, Oliphant T, Peterson P (2014) SciPy: open source scientific tools for PythonGoogle Scholar
  24. 24.
    Chen B, Wang H, Ding Y, Wild D (2014) Semantic breakthrough in drug discovery. Synthesis Lectures on the Semantic Web: Theory and Technology 4(2):1–142Google Scholar
  25. 25.
    Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(Database issue):D267–D270PubMedPubMedCentralGoogle Scholar
  26. 26.
    Liu S, Ma W, Moore R, Ganesan V, Nelson S (2005) RxNorm: prescription for electronic drug information exchange. IT professional 7(5):17–23Google Scholar
  27. 27.
    Kuhn M, Letunic I, Jensen LJ, Bork P (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(D1):D1075–D1079PubMedGoogle Scholar
  28. 28.
    Tatonetti NP, Ye PP, Daneshjou R, Altman RB (2012) Data-driven prediction of drug effects and interactions. Sci Transl Med 4(125):125ra131Google Scholar
  29. 29.
    Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Database issue):D668–D672PubMedGoogle Scholar
  30. 30.
    Shameer K, Glicksberg BS, Hodos R, Johnson KW, Badgeley MA, Readhead B, Tomlinson MS, O'Connor T, Miotto R, Kidd BA et al (2017) Systematic analyses of drugs and disease indications in RepurposeDB reveal pharmacological, biological and epidemiological factors influencing drug repositioning. Brief BioinformGoogle Scholar
  31. 31.
    Geifman N, Bollyky J, Bhattacharya S, Butte AJ (2015) Opening clinical trial data: are the voluntary data-sharing portals enough? BMC Med 13:280PubMedPubMedCentralGoogle Scholar
  32. 32.
    Greene CS, Garmire LX, Gilbert JA, Ritchie MD, Hunter LE (2017) Celebrating parasites. Nat Genet 49(4):483–484PubMedPubMedCentralGoogle Scholar
  33. 33.
    Yao L, Zhang Y, Li Y, Sanseau P, Agarwal P (2011) Electronic health records: implications for drug discovery. Drug Discov Today 16(13–14):594–599PubMedGoogle Scholar
  34. 34.
    Wang G, Jung K, Winnenburg R, Shah NH (2015) A method for systematic discovery of adverse drug events from clinical notes. J Am Med Inform Assoc 22(6):1196–1204PubMedPubMedCentralGoogle Scholar
  35. 35.
    Crosslin DR, Robertson PD, Carrell DS, Gordon AS, Hanna DS, Burt A, Fullerton SM, Scrol A, Ralston J, Leppig K et al (2015) Prospective participant selection and ranking to maximize actionable pharmacogenetic variants and discovery in the eMERGE network. Genome Med 7(1):67PubMedPubMedCentralGoogle Scholar
  36. 36.
    Xu H, Aldrich MC, Chen Q, Liu H, Peterson NB, Dai Q, Levy M, Shah A, Han X, Ruan X et al (2015) Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality. J Am Med Inform Assoc 22(1):179–191PubMedGoogle Scholar
  37. 37.
    Kirkendall ES, Kouril M, Minich T, Spooner SA (2014) Analysis of electronic medication orders with large overdoses: opportunities for mitigating dosing errors. Appl Clin Inform 5(1):25–45PubMedPubMedCentralGoogle Scholar
  38. 38.
    Ramirez AH, Shi Y, Schildcrout JS, Delaney JT, Xu H, Oetjens MT, Zuvich RL, Basford MA, Bowton E, Jiang M et al (2012) Predicting warfarin dosage in European-Americans and African-Americans using DNA samples linked to an electronic health record. Pharmacogenomics 13(4):407–418PubMedPubMedCentralGoogle Scholar
  39. 39.
    Dewey FE, Murray MF, Overton JD, Habegger L, Leader JB, Fetterolf SN, O'Dushlaine C, Van Hout CV, Staples J, Gonzaga-Jauregui C et al (2016) Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354(6319)Google Scholar
  40. 40.
    Yuille M, Dixon K, Platt A, Pullum S, Lewis D, Hall A, Ollier W (2010) The UK DNA banking network: a "fair access" biobank. Cell Tissue Bank 11(3):241–251PubMedGoogle Scholar
  41. 41.
    Wain LV, Shrine N, Artigas MS, Erzurumluoglu AM, Noyvert B, Bossini-Castillo L, Obeidat M, Henry AP, Portelli MA, Hall RJ et al (2017) Genome-wide association analyses for lung function and chronic obstructive pulmonary disease identify new loci and potential druggable targets. Nat Genet 49(3):416–425PubMedPubMedCentralGoogle Scholar
  42. 42.
    Edgar R, Domrachev M, Lash AE (2002) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30(1):207–210PubMedPubMedCentralGoogle Scholar
  43. 43.
    Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, Dylag M, Kurbatova N, Brandizi M, Burdett T et al (2015) ArrayExpress update--simplifying data submissions. Nucleic Acids Res 43(Database issue):D1113–D1116PubMedGoogle Scholar
  44. 44.
    Wickham H (2016) ggplot2: elegant graphics for data analysis, 2nd edn. SpringerGoogle Scholar
  45. 45.
    Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90–95Google Scholar
  46. 46.
    Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504PubMedPubMedCentralGoogle Scholar
  47. 47.
    Bastian M, Heymann S, Jacomy M (2009) Gephi: an open source software for exploring and manipulating networks. Icwsm 8:361–362Google Scholar
  48. 48.
    Li L, Greene I, Readhead B, Menon MC, Kidd BA, Uzilov AV, Wei C, Philippe N, Schroppel B, He JC et al (2017) Novel therapeutics identification for fibrosis in renal allograft using integrative informatics approach. Sci Rep 7:39487PubMedPubMedCentralGoogle Scholar
  49. 49.
    Chen B, Wei W, Ma L, Yang B, Gill RM, Chua MS, Butte AJ, So S (2017) Computational discovery of niclosamide ethanolamine, a repurposed drug candidate that reduces growth of hepatocellular carcinoma cells in vitro and in mice by inhibiting cell division cycle 37 signaling. Gastroenterology 152(8):2022–2036PubMedPubMedCentralGoogle Scholar
  50. 50.
    Chen R, Li L, Butte AJ (2007) AILUN: reannotating gene expression data automatically. Nat Methods 4(11):879PubMedPubMedCentralGoogle Scholar
  51. 51.
    Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98(9):5116–5121PubMedPubMedCentralGoogle Scholar
  52. 52.
    Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106PubMedPubMedCentralGoogle Scholar
  53. 53.
    Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, Murino L, Tagliaferri R, Brunetti-Pierri N, Isacchi A et al (2010) Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc Natl Acad Sci U S A 107(33):14621–14626PubMedPubMedCentralGoogle Scholar
  54. 54.
    Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN et al (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–1935PubMedGoogle Scholar
  55. 55.
    Kidd BA, Wroblewska A, Boland MR, Agudo J, Merad M, Tatonetti NP, Brown BD, Dudley JT (2016) Mapping the effects of drugs on the immune system. Nat Biotechnol 34(1):47–54PubMedGoogle Scholar
  56. 56.
    Hanzelmann S, Castelo R, Guinney J (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14:7PubMedPubMedCentralGoogle Scholar
  57. 57.
    Dudley JT, Butte AJ (2010) In silico research in the era of cloud computing. Nat Biotechnol 28(11):1181–1185PubMedPubMedCentralGoogle Scholar
  58. 58.
    Beaulieu-Jones BK, Greene CS (2017) Reproducibility of computational workflows is automated using continuous analysis. Nat Biotechnol 35(4):342–346PubMedPubMedCentralGoogle Scholar
  59. 59.
    Ramasamy A, Mondry A, Holmes CC, Altman DG (2008) Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med 5(9):e184PubMedPubMedCentralGoogle Scholar
  60. 60.
    Klebanov L, Yakovlev A (2006) Treating expression levels of different genes as a sample in microarray data analysis: is it worth a risk? Stat Appl Genet Molec Biol 5(1):1–9Google Scholar
  61. 61.
    Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3(9):1724–1735PubMedGoogle Scholar
  62. 62.
    Dudley JT, Tibshirani R, Deshpande T, Butte AJ (2009) Disease signatures are robust across tissues and experiments. Mol Syst Biol 5:307PubMedPubMedCentralGoogle Scholar
  63. 63.
    Campain A, Yang YH (2010) Comparison study of microarray meta-analysis methods. BMC Bioinformatics 11:408PubMedPubMedCentralGoogle Scholar
  64. 64.
    Chen B, Ma L, Paik H, Sirota M, Wei W, Chua MS, So S, Butte AJ (2017) Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets. Nat Commun (In Press)Google Scholar
  65. 65.
    Chen B, Greenside P, Paik H, Sirota M, Hadley D, Butte AJ (2015) Relating chemical structure to cellular response: an integrative analysis of gene expression, bioactivity, and structural data across 11,000 compounds. CPT Pharmacometrics Syst Pharmacol 4(10):576–584PubMedPubMedCentralGoogle Scholar
  66. 66.
    Smith C (2003) Drug target validation: hitting the target. Nature 422(6929). 341, 343, 345 passimGoogle Scholar
  67. 67.
    Chen B, Sirota M, Fan-Minogue H, Hadley D, Butte AJ (2015) Relating hepatocellular carcinoma tumor samples and cell lines using gene expression data in translational research. BMC Med Genet 8(Suppl 2):S5Google Scholar
  68. 68.
    Domcke S, Sinha R, Levine DA, Sander C, Schultz N (2013) Evaluating cell lines as tumour models by comparison of genomic profiles. Nat Commun 4:2126PubMedPubMedCentralGoogle Scholar
  69. 69.
    Hefti FF (2008) Requirements for a lead compound to become a clinical candidate. BMC Neurosci 9(Suppl 3):S7PubMedPubMedCentralGoogle Scholar
  70. 70.
    Empfield JR, Leeson PD (2010) Lessons learned from candidate drug attrition. IDrugs 13(12):869–873PubMedGoogle Scholar
  71. 71.
    Hughes JP, Rees S, Kalindjian SB, Philpott KL (2011) Principles of early drug discovery. Br J Pharmacol 162(6):1239–1249PubMedPubMedCentralGoogle Scholar
  72. 72.
    Meanwell NA (2011) Improving drug candidates by design: a focus on physicochemical properties as a means of improving compound disposition and safety. Chem Res Toxicol 24(9):1420–1456PubMedGoogle Scholar
  73. 73.
    Bate A, Juniper J, Lawton AM, Thwaites RM (2016) Designing and incorporating a real world data approach to international drug development and use: what the UK offers. Drug Discov Today 21(3):400–405PubMedGoogle Scholar
  74. 74.
    Cipparone CW, Withiam-Leitch M, Kimminau KS, Fox CH, Singh R, Kahn L (2015) Inaccuracy of ICD-9 codes for chronic kidney disease: a study from two practice-based research networks (PBRNs). J Am Board Fam Med 28(5):678–682PubMedGoogle Scholar
  75. 75.
    Chung CP, Rohan P, Krishnaswami S, McPheeters ML (2013) A systematic review of validated methods for identifying patients with rheumatoid arthritis using administrative or claims data. Vaccine 31(Suppl 10):K41–K61PubMedGoogle Scholar
  76. 76.
    Wei WQ, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC (2016) Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J Am Med Inform Assoc 23(e1):e20–e27PubMedGoogle Scholar
  77. 77.
    Yoon D, Ahn EK, Park MY, Cho SY, Ryan P, Schuemie MJ, Shin D, Park H, Park RW (2016) Conversion and data quality assessment of electronic health record data at a Korean tertiary teaching hospital to a common data model for distributed network research. Healthc Inform Res 22(1):54–58PubMedPubMedCentralGoogle Scholar
  78. 78.
    Barrows RC Jr, Clayton PD (1996) Privacy, confidentiality, and electronic medical records. J Am Med Inform Assoc 3(2):139–148PubMedPubMedCentralGoogle Scholar
  79. 79.
    Shameer K, Badgeley MA, Miotto R, Glicksberg BS, Morgan JW, Dudley JT (2017) Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams. Brief Bioinform 18(1):105–124PubMedGoogle Scholar
  80. 80.
    Davis S, Meltzer PS (2007) GEOquery: a bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics 23(14):1846–1847PubMedGoogle Scholar
  81. 81.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140PubMedPubMedCentralGoogle Scholar
  82. 82.
    Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22(22):2825–2827PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Bakar Computational Health Sciences InstituteUniversity of CaliforniaSan FranciscoUSA
  2. 2.Department of Genetics and Genomic SciencesInstitute of Next Generation Healthcare, Icahn School of Medicine at Mount SinaiNew YorkUSA
  3. 3.Sema4, A Mount Sinai VentureStamfordUSA
  4. 4.Department of Pediatrics and Human DevelopmentMichigan State UniversityGrand RapidsUSA
  5. 5.Department of Pharmacology and ToxicologyMichigan State UniversityGrand RapidsUSA

Personalised recommendations