How to Prepare a Compound Collection Prior to Virtual Screening

  • Cristian G. Bologa
  • Oleg Ursu
  • Tudor I. OpreaEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1939)


Virtual screening is a well-established technique that has proven to be successful in the identification of novel biologically active molecules, including drug repurposing. Whether for ligand-based or for structure-based virtual screening, a chemical collection needs to be properly processed prior to in silico evaluation. Here we describe our step-by-step procedure for handling very large collections (up to billions) of compounds prior to virtual screening.

Key words

Cheminformatics Drug discovery Online services Property filtering Unwanted structures Virtual screening 



This work was supported, in part, by NIH grants R21GM095952, U54MH084690, and U24CA224370. We thank Jeremy Yang for useful discussions.


  1. 1.
    Avorn J (2015) The $2.6 billion pill — methodologic and policy considerations. N Engl J Med 372:1877–1879PubMedCrossRefGoogle Scholar
  2. 2.
    Sukuru SCK, Jenkins JL, Beckwith REH, Scheiber J, Bender A, Mikhailov D, Davies JW, Glick M (2009) Plate-based diversity selection based on empirical HTS data to enhance the number of hits and their chemical diversity. J Biomol Screen 14:690–699PubMedCrossRefGoogle Scholar
  3. 3.
    Horvath D (1997) A virtual screening approach applied to the search for trypanothione reductase inhibitors. J Med Chem 40:2412–2423PubMedCrossRefGoogle Scholar
  4. 4.
    Walters WP, Stahl MT, Murcko MA (1998) Virtual screening–an overview. Drug Discov Today 3:160–178CrossRefGoogle Scholar
  5. 5.
    Fara DC, Oprea TI, Prossnitz ER, Bologa CG, Edwards BS, Sklar LA (2006) Integration of virtual and physical screening. Drug Discov Today Technol 3:377–385CrossRefGoogle Scholar
  6. 6.
    Oprea TI, Matter H (2004) Integrating virtual screening in lead discovery. Curr Opin Chem Biol 8:349–358PubMedCrossRefGoogle Scholar
  7. 7.
    The PubChem service is hosted by the National Center for Biotechnology Information at NIH.
  8. 8.
    ChEMBL is a database of bioactive drug-like molecules hosted by the European Bioinformatics Institute at EMBL.
  9. 9.
    Brown F (2005) Chemoinformatics – a ten year update. Curr Opin Drug Discov Devel 8:296–302PubMedGoogle Scholar
  10. 10.
    Mewes HW, Wachinger B, Stümpflen V (2010) Perspectives of a systems biology of the synapse: How to transform an indefinite data space into a model? Pharmacopsychiatry 43:S2–S8PubMedCrossRefGoogle Scholar
  11. 11.
    Mestres J, Gregori-Puigjané E, Valverde S, Solé RV (2008) Data completeness - the Achilles heel of drug-target networks. Nat Biotechnol 26:983–984PubMedCrossRefGoogle Scholar
  12. 12.
    Schwaighofer A, Schroeter T, Mika S, Blanchard G (2009) How wrong can we get? A review of machine learning approaches and error bars. Comb Chem High Throughput Screen 12:453–468PubMedCrossRefGoogle Scholar
  13. 13.
    Edwards BS, Bologa CG, Young SM, Prossnitz ER, Sklar LA, Oprea TI (2005) Integration of virtual screening with high throughput flow cytometry to identify novel small molecule formylpeptide receptor antagonists. Mol Pharmacol 368:1301–1310CrossRefGoogle Scholar
  14. 14.
    Young SM, Bologa CG, Fara D, BJK B, Strouse JJ, Arterburn JB, Ye RD, Oprea TI, Prossnitz ER, Sklar LA, Edwards BS (2009) Duplex high-throughput flow cytometry screen identifies two novel formylpeptide receptor family probes. Cytometry 75A:253–263CrossRefGoogle Scholar
  15. 15.
    Dennis M, Burai R, Ramesh C, Petrie W, Alcon S, Nayak T, Bologa C, Leitão A, Brailoiu E, Deliu E, Dun NS, Sklar LA, Hathaway H, Arterburn JB, Oprea TI, Prossnitz ER (2009) In vivo effects of a GPR30 antagonist. Nat Chem Biol 5:421–427PubMedPubMedCentralCrossRefGoogle Scholar
  16. 16.
    Bologa CG, Revankar CM, Young SM, Edwards BS, Arterburn JB, Parker MA, Tkachenko SE, Savchuck NP, Sklar LA, Oprea TI, Prossnitz ER (2006) Virtual and biomolecular screening converge on a selective agonist for GPR30. Nat Chem Biol 2:207–212PubMedCrossRefGoogle Scholar
  17. 17.
    George Thompson AM, Ursu O, Babkin P, Iancu CV, Whang A, Oprea TI, Choe JY (2016) Discovery of a specific inhibitor of human GLUT5 by virtual screening and in vitro transport evaluation. Sci Rep 6:24240PubMedPubMedCentralCrossRefGoogle Scholar
  18. 18.
    Koch MA, Schuffenhauer A, Scheck M, Wetzel S, Casaulta M, Odermatt A, Ertl P, Waldmann H (2005) Charting biologically relevant chemical space: a structural classification of natural products (SCONP). Proc Natl Acad Sci U S A 102:17272–17277PubMedPubMedCentralCrossRefGoogle Scholar
  19. 19.
    Renner S, van Otterlo W, Dominguez Seoane M, Möcklinghoff S, Hofmann B, Wetzel S, Schuffenhauer A, Ertl P, Oprea TI, Steinhilber D, Brunsveld L, Rauh D, Waldmann H (2009) Bioactivity-guided mapping of and navigation in chemical space by means of hierarchical scaffold trees. Nat Chem Biol 5:585–592PubMedCrossRefGoogle Scholar
  20. 20.
    Wetzel S, Klein K, Renner S, Rauh D, Oprea TI, Mutzel P, Waldmann H (2009) Interactive exploration of chemical space with Scaffold Hunter. Nat Chem Biol 5:581–583PubMedCrossRefGoogle Scholar
  21. 21.
    Olah MM, Bologa CG, Oprea TI (2004) Strategies for compound selection. Curr Drug Discov Technol 1:211–220PubMedCrossRefGoogle Scholar
  22. 22.
    Oprea TI, Bologa CG, Edwards BS, Prossnitz EA, Sklar LA (2004) Post-HTS analysis: an empirical compound prioritization scheme. J Biomol Screen 10:419–425CrossRefGoogle Scholar
  23. 23.
    Rishton GM (1997) Reactive compounds and in vitro false positives in HTS. Drug Discov Today 2:382–384CrossRefGoogle Scholar
  24. 24.
    Young SM, Bologa CG, Oprea TI, Prossnitz ER, Sklar LA, Edwards BS (2005) Screening with HyperCyt high throughput flow cytometry to detect small-molecule formyl peptide receptor ligands. J Biomol Screen 10:374–382PubMedCrossRefGoogle Scholar
  25. 25.
    Rishton GM (2003) Nonleadlikeness and leadlikeness in biochemical screening. Drug Discov Today 8:86–96PubMedCrossRefGoogle Scholar
  26. 26.
    McGovern SL, Caselli E, Grigorieff N, Shoichet BK (2002) A common mechanism underlying promiscuous inhibitors from virtual and high-throughput screening. J Med Chem 45:1712–1722PubMedCrossRefGoogle Scholar
  27. 27.
    Roche O, Schneider P, Zuegge J, Guba W, Kansy M, Alanine A, Bleicher K, Danel F, Gutknecht EM, Rogers-Evans M, Neidhart W, Stalder H, Dillon M, Sjögren E, Fotouhi N, Gillespie P, Goodnow R, Harris W, Jones P, Taniguchi M, Tsujii S, von der Saal W, Zimmermann G, Schneider G (2002) Development of a virtual screening method for identification of ‘frequent hitters’ in compound libraries. J Med Chem 45:137–142PubMedCrossRefGoogle Scholar
  28. 28.
    Oprea TI (2002) Lead structure searching: are we looking for the appropriate properties? J Comput-Aided Mol Design 16:325–334CrossRefGoogle Scholar
  29. 29.
    Austin CP, Brady LS, Insel TR, Collins FS (2004) NIH molecular libraries initiative. Science 306:1138–1139PubMedCrossRefGoogle Scholar
  30. 30.
    Oprea TI, Bologa CG, Boyer S, Curpan RF, Glen RC, Hopkins AL, Lipinski CA, Marshall GR, Martin YC, Ostopovici-Halip L, Rishton G, Ursu O, Vaz RJ, Waller C, Waldmann H, Sklar LA (2009) A crowdsourcing evaluation of the NIH chemical probes. Nat Chem Biol 5:441–447PubMedPubMedCentralCrossRefGoogle Scholar
  31. 31.
    Collins FS (2010) Research agenda. Opportunities for research and NIH. Science 327:36–37PubMedCrossRefGoogle Scholar
  32. 32.
    Boguski MS, Mandl KD, Sukhatme VP (2009) Repurposing with a difference. Science 324:1394–1395PubMedCrossRefGoogle Scholar
  33. 33.
    Toney JH, Fasick JI, Singh S, Beyrer C, Sullivan DJ Jr (2009) Purposeful learning with drug repurposing. Science 325:1139–1140CrossRefGoogle Scholar
  34. 34.
    Chong CR, Sullivan DJ Jr (2007) New uses for old drugs. Nature 448:645–646PubMedCrossRefGoogle Scholar
  35. 35.
    Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P (2008) Drug target identification using side-effect similarity. Science 321:263–266PubMedCrossRefGoogle Scholar
  36. 36.
    Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Jensen NH, Kuijer MB, Matos RC, Tran TB, Whaley R, Glennon RA, Hert J, Thomas KLH, Edwards DD, Shoichet BK, Roth BL (2009) Predicting new molecular targets for known drugs. Nature 462:175–181PubMedPubMedCentralCrossRefGoogle Scholar
  37. 37.
    Ashburn TT, Thor KB (2004) Drug repositioning: Identifying and developing new uses for existing drugs. Nat Rev Drug Discov 3:673–683PubMedCrossRefGoogle Scholar
  38. 38.
  39. 39.
    Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23:3–25CrossRefGoogle Scholar
  40. 40.
    Oprea TI (2000) Property distribution of drug-related chemical databases. J Comput Aided Mol Des 14:251–264PubMedCrossRefGoogle Scholar
  41. 41.
    Ursu O, Oprea TI (2010) Model-free drug-likeness from fragments. J Chem Inf Model 50:1387–1394PubMedCrossRefGoogle Scholar
  42. 42.
    Wester MJ, Pollock SN, Coutsias EA, Allu TK, Muresan S, Oprea TI (2008) Scaffold topologies. 2. Analysis of chemical databases. J Chem Inf Model 48:1311–1324PubMedPubMedCentralCrossRefGoogle Scholar
  43. 43.
    Teague SJ, Davis AM, Leeson PD, Oprea TI (1999) The design of leadlike combinatorial libraries. Angew Chem Int Ed 38:3743–3748 German version: Angew. Chem. 111, 3962--3967CrossRefGoogle Scholar
  44. 44.
    Hann MM, Oprea TI (2004) Pursuing the leadlikeness concept in pharmaceutical research. Curr Opin Chem Biol 8:255–263PubMedCrossRefGoogle Scholar
  45. 45.
    Oprea TI, Allu TK, Fara DC, Rad RF, Ostopovici L, Bologa CG (2007) Lead-like, drug-like or “Pub-like”: how different are they? J Comput Aided Mol Des 21:113–119PubMedPubMedCentralCrossRefGoogle Scholar
  46. 46.
    See the OpenEye Scientific Software, Santa Fe, NM.
  47. 47.
    See the mesa analytics & computing, Santa Fe, NM.
  48. 48.
    See the ChemAxon kft, Budapest, Hungary.
  49. 49.
    Accelrys Inc., San Diego, CA.
  50. 50.
    See the Chemical Computing Group.
  51. 51.
    Certara, Princeton, NJ.
  52. 52.
    Ambure P, Aher RB, Roy K (2014) Recent advances in the open access cheminformatics toolkits, software tools, workflow environments, and databases. In: Zhang W (ed) Computer-aided drug discovery. Methods in pharmacology and toxicology. Humana Press, New York, NYGoogle Scholar
  53. 53.
    Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36CrossRefGoogle Scholar
  54. 54.
    The International Chemical Identifier, InChI, was a IUPAC project.
  55. 55.
    OEChem Toolkit, Openeye Scientific Software, Santa Fe, NM.
  56. 56.
  57. 57.
    raphSim TK Openeye Scientific Software, Santa Fe, NM.
  58. 58.
    MACCSKeys320Generator, Mesa analytics and computing LLC, Santa Fe, NM.
  59. 59.
    Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42:1273–1280PubMedCrossRefGoogle Scholar
  60. 60.
    MOE: The molecular operating environment from chemical computing group Inc., Montreal, QC.
  61. 61.
    Open Babel: the open source chemistry toolbox.
  62. 62.
    CDK is a Java library for structural chemo- and bioinformatics.
  63. 63.
    Leo A (1993) Estimating LogPoct from structures. Chem Rev 5:1281–1306CrossRefGoogle Scholar
  64. 64.
    CLOGP is available from BioByte Corporation, Claremont, CA.
  65. 65.
    EPI Suite v4.11, U.S. Environmental Protection Agency.
  66. 66.
    Tetko IV, Tanchuk VY (2002) Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program. J Chem Inf Comput Sci 42:1136–1145 PubMedCrossRefGoogle Scholar
  67. 67.
    The virtual computational chemistry laboratory (VCCLAB) as a number of on-line software modules. Available at
  68. 68.
    Molinspiration has a number of property calculators, including 3D conformer generation.
  69. 69.
    Yap CW (2011) PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466–1474PubMedCrossRefGoogle Scholar
  70. 70.
    Measures, mesa analytics and computing LLC, Santa Fe, NM.
  71. 71.
    ChemoMine plc, Cambridge, UK.
  72. 72.
    MacCuish JD, MacCuish NE (2010) Chapman & Hall/CRC mathematical & computational biology. In: Clustering in bioinformatics and drug discovery, vol 40. CRC press, Boca Raton, FL, p 244CrossRefGoogle Scholar
  73. 73.
    Gasteiger J, Rudolph C, Sadowski J (1990) Automatic generation of 3D-atomic coordinates for organic molecules. Tetrahedron Comput Methodol 3:537–547 CORINA is available from Molecular Networks GmbH and Altamira LLC; CrossRefGoogle Scholar
  74. 74.
    Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT (2010) Conformer generation with OMEGA: Algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 50:572–584 OpenEye Scientific Software Inc., Santa Fe, NM; PubMedPubMedCentralCrossRefGoogle Scholar
  75. 75.
    MODDE is available from Umetrics, a division of Sartorius Stedim biotech.
  76. 76.
    The MLSMR collection can be datamined using the PubChem interface (keyword, MLSMR).
  77. 77.
    Ruddigkeit L, van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52:2864–2875PubMedCrossRefGoogle Scholar
  78. 78.
    Ursu O, Holmes J, Knockel J, Bologa CG, Yang JJ, Mathias SL, Nelson SJ, Oprea TI (2017) DrugCentral: online drug compendium. Nucleic Acids Res 45:D932–D939PubMedCrossRefGoogle Scholar
  79. 79.
    FILTER is available from OpenEye Scientific Software Inc., Santa Fe, NM.
  80. 80.
    Olah M, Mracec M, Ostopovici L, Rad R, Bora A, Hadaruga N, Olah I, Banda M, Simon Z, Mracec M, Oprea TI (2004) WOMBAT: world of molecular bioactivity. In: Oprea TI (ed) Cheminformatics in drug discovery. Wiley-VCH, New York, NY (in press)Google Scholar
  81. 81.
    Coats EA (1998) The CoMFA steroids as a benchmark dataset for development of 3D-QSAR methods. In: Kubinyi H, Folkers G, Martin YC (eds) 3D QSAR in drug design. Volume 3. Recent advances. Kluwer/ESCOM, Dordrecht, The Netherlands, pp 199–213Google Scholar
  82. 82.
    Oprea TI, Olah M, Ostopovici L, Rad R, Mracec M (2003) On the propagation of errors in the QSAR literature. In: Ford M, Livingstone D, Dearden J, Van de Waterbeemd H (eds) EuroQSAR 2002–Designing drugs and crop protectants: processes, problems and solutions. Blackwell Publishing, New York, NY, pp 314–315Google Scholar
  83. 83.
    Chemical Database Management Software, TimTec Inc.
  84. 84.
    Public web applications from UNM Biocomputing are available at
  85. 85.
    Yang JJ, Ursu O, Lipinski CA, Sklar LA, Oprea TI, Bologa CG (2016) Badapple: promiscuity patterns from noisy evidence. J Chem 8:29CrossRefGoogle Scholar
  86. 86.
    Johnston PA (2011) Redox cycling compounds generate H2O2 in HTS buffers containing strong reducing reagents-real hits or promiscuous artifacts? Curr Opin Chem Biol 15:174–182PubMedCrossRefGoogle Scholar
  87. 87.
    Kenny PW, Sadowski J (2004) Structure modification in chemical databases. In: Oprea TI (ed) Cheminformatics in drug discovery. Wiley-VCH, New York, NY (in press)Google Scholar
  88. 88.
    Martin YC (2010) Perspectives in drug discovery and design: tautomers and tautomerism. J Comput Aided Mol Design 24:473–638CrossRefGoogle Scholar
  89. 89.
    Sadowski J, Gasteiger J (1993) From atoms and bonds to three-dimensional atomic coordinates: automatic model builders. Chem Rev 93:2567–2581CrossRefGoogle Scholar
  90. 90.
  91. 91.
    Johnson MA, Maggiora GM (1990) Concepts and applications of molecular similarity. Wiley-VCH, New York, NYGoogle Scholar
  92. 92.
    Maggiora GM (2006) On outliers and activity cliffs−Why QSAR often disappoints. J Chem Inf Model 46:1535PubMedCrossRefGoogle Scholar
  93. 93.
    Oprea TI (2002) Chemical space navigation in lead discovery. Cur Opin Chem Biol 6:384–389CrossRefGoogle Scholar
  94. 94.
    Todeschini R, Consonni V (2008) Handbook of molecular descriptors, 2nd edn. Wiley-VCH, Weinheim, GermanyGoogle Scholar
  95. 95.
    Tanimoto TT (1961) Non-linear model for a computer assisted medical diagnostic procedure. Trans NY Acad Sci Ser 2(23):576–580CrossRefGoogle Scholar
  96. 96.
    Tversky A (1977) Features of similarity. Psychol Rev 84:327–352CrossRefGoogle Scholar
  97. 97.
    Willett P (1987) Similarity and clustering techniques in chemical information systems. In: Research Studies Press. Letchworth, EnglandGoogle Scholar
  98. 98.
    Willett P (2000) Chemoinformatics–similarity and diversity in chemical libraries. Curr Op Biotech 11:85–88CrossRefGoogle Scholar
  99. 99.
    Lewis RA, Pickett SD, Clark DE (2000) Computer-aided molecular diversity analysis and combinatorial library design. Rev Comput Chem 16:1–51Google Scholar
  100. 100.
    Martin YC (2001) Diverse viewpoints on computational aspects of molecular diversity. J Comb Chem 3:231–250PubMedCrossRefGoogle Scholar
  101. 101.
    Linusson A, Gottfries J, Lindgren F, Wold S (2000) Statistical molecular design of building blocks for combinatorial chemistry. J Med Chem 43:1320–1328PubMedCrossRefGoogle Scholar
  102. 102.
    Eriksson L, Johansson E, Kettaneh-Wold N, Wikström C, Wold S (2000) Design of experiments: principles and applications. Umetrics Academy, Umeå, SwedenGoogle Scholar
  103. 103.
    Taleb NN (2005) Fooled by randomness: the hidden role of chance in the markets and life. Random House, New YorkGoogle Scholar
  104. 104.
    Taleb NN (2007) The Black Swan. The impact of the highly improbable. Random House, New YorkGoogle Scholar
  105. 105.
    Sneader W (2005) Drug discovery: a history. Wiley, New YorkCrossRefGoogle Scholar
  106. 106.
    Boström J, Norrby P-O, Liljefors T (1998) Conformational energy penalties of protein-bound ligands. J Comput-Aided Mol Design 12:383–396CrossRefGoogle Scholar
  107. 107.
    Prossnitz ER, Arterburn JB, Edwards BS, Sklar LA, Oprea TI (2006) Steroid-binding GPCRs: new drug discovery targets for old ligands. Expert Opin Drug Discov 1:137–150PubMedCrossRefGoogle Scholar
  108. 108.
    Papadatos G, Davies M, Dedman N, Chambers J, Gaulton A, Siddle J, Koks R, Irvine SA, Pettersson J, Goncharoff N, Hersey A, Overington JP (2016) SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic Acids Res 44:D1220–D1228 Available at PubMedCrossRefGoogle Scholar
  109. 109.
    Antolin AA, Tym JE, Komianou A, Collins I, Workman P, Al-Lazikani B (2017) Objective, quantitative, data-driven assessment of chemical probes. Cell Chem Biol 25(2):P194–205.E5 in press. Available at Scholar
  110. 110.
    Harding SD, Sharman JL, Faccenda E, Southan C, Pawson AJ, Ireland S, Gray AJG, Bruce L, Alexander SPH, Anderton S, Bryant C, Davenport AP, Doerig C, Fabbro D, Levi-Schaffer F, Spedding M, Davies JA, NC-IUPHAR (2018) The IUPHAR/BPS Guide to PHARMACOLOGY in 2018: updates and expansion to encompass the new guide to IMMUNOPHARMACOLOGY. Nucleic Acids Res 46:D1091–D1106 Available at PubMedCrossRefGoogle Scholar
  111. 111.
    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE The protein data bank. Nucleic Acids Res 28:235–242 Available at

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Cristian G. Bologa
    • 2
  • Oleg Ursu
    • 1
    • 2
  • Tudor I. Oprea
    • 2
    Email author
  1. 1.Merck Research LaboratoriesBostonUSA
  2. 2.Division of Translational Informatics, Department of Internal MedicineUniversity of New Mexico School of MedicineAlbuquerqueUSA

Personalised recommendations