Skip to main content

Automated Inference of Chemical Discriminants of Biological Activity

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1762))

Abstract

Ligand-based virtual screening has become a standard technique for the efficient discovery of bioactive small molecules. Following assays to determine the activity of compounds selected by virtual screening, or other approaches in which dozens to thousands of molecules have been tested, machine learning techniques make it straightforward to discover the patterns of chemical groups that correlate with the desired biological activity. Defining the chemical features that generate activity can be used to guide the selection of molecules for subsequent rounds of screening and assaying, as well as help design new, more active molecules for organic synthesis.

The quantitative structure–activity relationship machine learning protocols we describe here, using decision trees, random forests, and sequential feature selection, take as input the chemical structure of a single, known active small molecule (e.g., an inhibitor, agonist, or substrate) for comparison with the structure of each tested molecule. Knowledge of the atomic structure of the protein target and its interactions with the active compound are not required. These protocols can be modified and applied to any data set that consists of a series of measured structural, chemical, or other features for each tested molecule, along with the experimentally measured value of the response variable you would like to predict or optimize for your project, for instance, inhibitory activity in a biological assay or ΔGbinding. To illustrate the use of different machine learning algorithms, we step through the analysis of a dataset of inhibitor candidates from virtual screening that were tested recently for their ability to inhibit GPCR-mediated signaling in a vertebrate.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

Abbreviations

2D:

Two-dimensional

3D:

Three-dimensional

3kPZS:

3-keto petromyzonol sulfate

CAS:

Chemical Abstracts Service Registry

CSD:

Cambridge Structural Database

DKPES:

3,12-diketo-4,6-petromyzonene-24-sulfate

EOG:

Electro-olfactogram

GPCR:

G protein-coupled receptor

QSAR:

Quantitative structure–activity relationship

SBS:

Sequential backward selection

SFS:

Sequential feature selection

VS:

Virtual screening

ZINC12:

Zinc Is Not Commercial database, version 12

References

  1. Ripphausen P, Nisius B, Bajorath J (2011) State-of-the-art in ligand-based virtual screening. Drug Discov Today 16:372–376

    Article  CAS  PubMed  Google Scholar 

  2. Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216

    Article  CAS  PubMed  Google Scholar 

  3. Pérez-Nueno VI, Ritchie DW, Rabal O, Pascual R, Borrell JI, Teixidó J (2008) Comparison of ligand-based and receptor-based virtual screening of HIV entry inhibitors for the CXCR4 and CCR5 Receptors using 3D ligand shape matching and ligand-receptor docking. J Chem Inf Model 48:509–533

    Article  PubMed  Google Scholar 

  4. Hawkins PCD, AG S, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82

    Article  CAS  PubMed  Google Scholar 

  5. Sukuru SCK, Crepin T, Milev Y, Marsh LC, Hill JB, Anderson RJ, Morris JC, Rohatgi A, O’Mahony G, Grøtli M et al (2006) Discovering new classes of Brugia malayi asparaginyl-tRNA synthetase inhibitors and relating specificity to conformational change. J Comput Aided Mol Des 20:159–178

    Article  CAS  PubMed  Google Scholar 

  6. Lyne PD (2002) Structure-based virtual screening: an overview. Drug Discov Today 7:1047–1055

    Article  CAS  PubMed  Google Scholar 

  7. Ghosh S, Nie A, An J, Huang Z (2006) Structure-based virtual screening of chemical libraries for drug discovery. Curr Opin Chem Biol 10:194–202

    Article  CAS  PubMed  Google Scholar 

  8. Li Q, Shah S (2017) Structure-based virtual screening. Methods Mol. Biol. 1558:111–124

    Google Scholar 

  9. Yan X, Liao C, Liu Z, T Hagler A, Gu Q, Xu J (2016) Chemical structure similarity search for ligand-based virtual screening: methods and computational resources. Curr Drug Targets 17:1580–1585

    Article  CAS  PubMed  Google Scholar 

  10. Raschka S, Scott AM, Liu N, Gunturu S, Huertas M, Li W, Kuhn LA (2018) Enabling hypothesis-driven prioritization of small molecules in big databases: screenlamp and its application to GPCR inhibitor discovery. J Comput Aided Mol Des 32:415–433

    Google Scholar 

  11. Zavodszky MI, Rohatgi A, Van Voorst JR, Yan H, Kuhn LA (2009) Scoring ligand similarity in structure-based virtual screening. J Mol Recognit 22:280–292

    Article  CAS  PubMed  Google Scholar 

  12. Buhrow L, Hiser C, Van Voorst JR, Ferguson-Miller S, Kuhn LA (2013) Computational prediction and in vitro analysis of potential physiological ligands of the bile acid binding site in cytochrome c oxidase. Biochemistry 52:6995–7006

    Article  CAS  PubMed  Google Scholar 

  13. Kubinyi H, Folkers G, Martin YC (eds) (2006) 3D QSAR in drug design: recent advances. Springer, Berlin

    Google Scholar 

  14. Verma J, Khedkar VM, Coutinho EC (2010) 3D-QSAR in drug design-a review. Curr Top Med Chem 10:95–115

    Article  CAS  PubMed  Google Scholar 

  15. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton, FL

    Google Scholar 

  16. Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  17. Ferri F, Pudil P, Hatef M, Kittler J (1994) Comparative study of techniques for large-scale feature selection. Pattern Recognit Pract IV 1994:403–413

    Google Scholar 

  18. Raschka S (2017) rasbt/mlxtend: Version 0.7.0. https://doi.org/10.5281/zenodo.816309

  19. Hansen GJA, Jones ML (2008) A rapid assessment approach to prioritizing streams for control of Great Lakes sea lampreys (Petromyzon marinus): a case study in adaptive management. Can J Fish Aquat Sci 65:2471–2484

    Article  Google Scholar 

  20. Irwin JJ, Shoichet BK (2005) ZINC—a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Allen F (2002) The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B Struct Sci 58:380–388

    Article  Google Scholar 

  22. Johnson NS, Yun S-S, Li W (2014) Investigations of novel unsaturated bile salts of male sea lamprey as potential chemical cues. J Chem Ecol 40:1152–1160

    Article  CAS  PubMed  Google Scholar 

  23. Van Rossum G (2007) Python programming language. In: USENIX annual technical conference, p 36

    Google Scholar 

  24. Van Der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30

    Article  Google Scholar 

  25. Jones E, Oliphant T, Peterson P (2001) SciPy: open source scientific tools for Python. http://www.scipy.org/

  26. McKinney W, et al. (2010) Data structures for statistical computing in Python. In: Millman J, vand der Walt S (eds) Proceedings of the 9th Python Science conference, pp 51–56

    Google Scholar 

  27. Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95

    Article  Google Scholar 

  28. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  29. Aiello A, Carbonelli S, Esposito G, Fattorusso E, Iuvone T, Menna M (2000) Novel bioactive sulfated alkene and alkanes from the Mediterranean ascidian Halocynthia papillosa. J Nat Prod 63:1590–1592

    Article  CAS  PubMed  Google Scholar 

  30. Raschka S (2015) Python machine learning, 1st edn. Packt Publishing, Birmingham

    Google Scholar 

  31. Louppe G (2014) Understanding random forests: from theory to practice. Ph.D. thesis

    Google Scholar 

  32. Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179

    Article  CAS  PubMed  Google Scholar 

  33. Hughes G (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14:55–63

    Article  Google Scholar 

  34. Raschka S, Mirjalili V (2017) Python machine learning, 2nd edn. Packt Publishing, Birmingham

    Google Scholar 

  35. Raschka S, Julian D, Hearty J (2016) Python: deeper insights into machine learning, 1st edn. Packt Publishing, Birmingham

    Google Scholar 

  36. Hastie T, Tibshirani R, Friedman J, Hastie T, Tibshirani R (2001) Springer series in statistics. Springer, New York, NY

    Google Scholar 

  37. Müller AC, Guido S (2017) Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Sebastopol, CA

    Google Scholar 

  38. Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 50:572–584

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Hawkins PCD, Nicholls A (2012) Conformer generation with OMEGA: learning from the data set and the analysis of failures. J Chem Inf Model 52:2919–2936

    Article  CAS  PubMed  Google Scholar 

  40. Raschka S (2017) BioPandas: working with molecular structures in pandas DataFrames. J Open Source Softw. doi:10.21105/joss.00279

    Google Scholar 

  41. Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinformatics 9:307

    Article  PubMed  PubMed Central  Google Scholar 

  42. Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods 14:323

    Article  PubMed  PubMed Central  Google Scholar 

  43. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517

    Google Scholar 

  44. Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4:164–171

    Article  Google Scholar 

  45. Raymer ML, Sanschagrin PC, Punch WF, Venkataraman S, Goodman ED, Kuhn LA (1997) Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. J Mol Biol 265:445–464

    Article  CAS  PubMed  Google Scholar 

  46. Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390

    Article  Google Scholar 

  47. Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105

    Google Scholar 

  48. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–1143

    Google Scholar 

Download references

Acknowledgments

This research was supported by funding from the Great Lakes Fishery Commission from 2012 to 2017 (Project ID: 2015_KUH_54031). We gratefully acknowledge OpenEye Scientific Software (Santa Fe, NM) for providing academic licenses for the use of their ROCS, Omega, QUACPAC (molcharge), and OEChem toolkit software. We also wish to express our special appreciation to the open source community for developing and sharing the freely accessible Python libraries for data processing, machine learning, and plotting that were used for the data analysis presented in this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leslie A. Kuhn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Raschka, S., Scott, A.M., Huertas, M., Li, W., Kuhn, L.A. (2018). Automated Inference of Chemical Discriminants of Biological Activity. In: Gore, M., Jagtap, U. (eds) Computational Drug Discovery and Design. Methods in Molecular Biology, vol 1762. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7756-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-7756-7_16

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-7755-0

  • Online ISBN: 978-1-4939-7756-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics