Automated Inference of Chemical Discriminants of Biological Activity

Raschka, Sebastian; Scott, Anne M.; Huertas, Mar; Li, Weiming; Kuhn, Leslie A.

doi:10.1007/978-1-4939-7756-7_16

Automated Inference of Chemical Discriminants of Biological Activity

Sebastian Raschka⁴,
Anne M. Scott⁵,
Mar Huertas⁵^nAff6,
Weiming Li⁵ &
…
Leslie A. Kuhn^4,5,7

Protocol
First Online: 29 March 2018

3799 Accesses
6 Citations

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1762))

Abstract

Ligand-based virtual screening has become a standard technique for the efficient discovery of bioactive small molecules. Following assays to determine the activity of compounds selected by virtual screening, or other approaches in which dozens to thousands of molecules have been tested, machine learning techniques make it straightforward to discover the patterns of chemical groups that correlate with the desired biological activity. Defining the chemical features that generate activity can be used to guide the selection of molecules for subsequent rounds of screening and assaying, as well as help design new, more active molecules for organic synthesis.

The quantitative structure–activity relationship machine learning protocols we describe here, using decision trees, random forests, and sequential feature selection, take as input the chemical structure of a single, known active small molecule (e.g., an inhibitor, agonist, or substrate) for comparison with the structure of each tested molecule. Knowledge of the atomic structure of the protein target and its interactions with the active compound are not required. These protocols can be modified and applied to any data set that consists of a series of measured structural, chemical, or other features for each tested molecule, along with the experimentally measured value of the response variable you would like to predict or optimize for your project, for instance, inhibitory activity in a biological assay or ΔG_binding. To illustrate the use of different machine learning algorithms, we step through the analysis of a dataset of inhibitor candidates from virtual screening that were tested recently for their ability to inhibit GPCR-mediated signaling in a vertebrate.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

Abbreviations

2D:: Two-dimensional
3D:: Three-dimensional
3kPZS:: 3-keto petromyzonol sulfate
CAS:: Chemical Abstracts Service Registry
CSD:: Cambridge Structural Database
DKPES:: 3,12-diketo-4,6-petromyzonene-24-sulfate
EOG:: Electro-olfactogram
GPCR:: G protein-coupled receptor
QSAR:: Quantitative structure–activity relationship
SBS:: Sequential backward selection
SFS:: Sequential feature selection
VS:: Virtual screening
ZINC12:: Zinc Is Not Commercial database, version 12

References

Ripphausen P, Nisius B, Bajorath J (2011) State-of-the-art in ligand-based virtual screening. Drug Discov Today 16:372–376
Article CAS PubMed Google Scholar
Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216
Article CAS PubMed Google Scholar
Pérez-Nueno VI, Ritchie DW, Rabal O, Pascual R, Borrell JI, Teixidó J (2008) Comparison of ligand-based and receptor-based virtual screening of HIV entry inhibitors for the CXCR4 and CCR5 Receptors using 3D ligand shape matching and ligand-receptor docking. J Chem Inf Model 48:509–533
Article PubMed Google Scholar
Hawkins PCD, AG S, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82
Article CAS PubMed Google Scholar
Sukuru SCK, Crepin T, Milev Y, Marsh LC, Hill JB, Anderson RJ, Morris JC, Rohatgi A, O’Mahony G, Grøtli M et al (2006) Discovering new classes of Brugia malayi asparaginyl-tRNA synthetase inhibitors and relating specificity to conformational change. J Comput Aided Mol Des 20:159–178
Article CAS PubMed Google Scholar
Lyne PD (2002) Structure-based virtual screening: an overview. Drug Discov Today 7:1047–1055
Article CAS PubMed Google Scholar
Ghosh S, Nie A, An J, Huang Z (2006) Structure-based virtual screening of chemical libraries for drug discovery. Curr Opin Chem Biol 10:194–202
Article CAS PubMed Google Scholar
Li Q, Shah S (2017) Structure-based virtual screening. Methods Mol. Biol. 1558:111–124
Google Scholar
Yan X, Liao C, Liu Z, T Hagler A, Gu Q, Xu J (2016) Chemical structure similarity search for ligand-based virtual screening: methods and computational resources. Curr Drug Targets 17:1580–1585
Article CAS PubMed Google Scholar
Raschka S, Scott AM, Liu N, Gunturu S, Huertas M, Li W, Kuhn LA (2018) Enabling hypothesis-driven prioritization of small molecules in big databases: screenlamp and its application to GPCR inhibitor discovery. J Comput Aided Mol Des 32:415–433
Google Scholar
Zavodszky MI, Rohatgi A, Van Voorst JR, Yan H, Kuhn LA (2009) Scoring ligand similarity in structure-based virtual screening. J Mol Recognit 22:280–292
Article CAS PubMed Google Scholar
Buhrow L, Hiser C, Van Voorst JR, Ferguson-Miller S, Kuhn LA (2013) Computational prediction and in vitro analysis of potential physiological ligands of the bile acid binding site in cytochrome c oxidase. Biochemistry 52:6995–7006
Article CAS PubMed Google Scholar
Kubinyi H, Folkers G, Martin YC (eds) (2006) 3D QSAR in drug design: recent advances. Springer, Berlin
Google Scholar
Verma J, Khedkar VM, Coutinho EC (2010) 3D-QSAR in drug design-a review. Curr Top Med Chem 10:95–115
Article CAS PubMed Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton, FL
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article Google Scholar
Ferri F, Pudil P, Hatef M, Kittler J (1994) Comparative study of techniques for large-scale feature selection. Pattern Recognit Pract IV 1994:403–413
Google Scholar
Raschka S (2017) rasbt/mlxtend: Version 0.7.0. https://doi.org/10.5281/zenodo.816309
Hansen GJA, Jones ML (2008) A rapid assessment approach to prioritizing streams for control of Great Lakes sea lampreys (Petromyzon marinus): a case study in adaptive management. Can J Fish Aquat Sci 65:2471–2484
Article Google Scholar
Irwin JJ, Shoichet BK (2005) ZINC—a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182
Article CAS PubMed PubMed Central Google Scholar
Allen F (2002) The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B Struct Sci 58:380–388
Article Google Scholar
Johnson NS, Yun S-S, Li W (2014) Investigations of novel unsaturated bile salts of male sea lamprey as potential chemical cues. J Chem Ecol 40:1152–1160
Article CAS PubMed Google Scholar
Van Rossum G (2007) Python programming language. In: USENIX annual technical conference, p 36
Google Scholar
Van Der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30
Article Google Scholar
Jones E, Oliphant T, Peterson P (2001) SciPy: open source scientific tools for Python. http://www.scipy.org/
McKinney W, et al. (2010) Data structures for statistical computing in Python. In: Millman J, vand der Walt S (eds) Proceedings of the 9th Python Science conference, pp 51–56
Google Scholar
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Aiello A, Carbonelli S, Esposito G, Fattorusso E, Iuvone T, Menna M (2000) Novel bioactive sulfated alkene and alkanes from the Mediterranean ascidian Halocynthia papillosa. J Nat Prod 63:1590–1592
Article CAS PubMed Google Scholar
Raschka S (2015) Python machine learning, 1st edn. Packt Publishing, Birmingham
Google Scholar
Louppe G (2014) Understanding random forests: from theory to practice. Ph.D. thesis
Google Scholar
Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179
Article CAS PubMed Google Scholar
Hughes G (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14:55–63
Article Google Scholar
Raschka S, Mirjalili V (2017) Python machine learning, 2nd edn. Packt Publishing, Birmingham
Google Scholar
Raschka S, Julian D, Hearty J (2016) Python: deeper insights into machine learning, 1st edn. Packt Publishing, Birmingham
Google Scholar
Hastie T, Tibshirani R, Friedman J, Hastie T, Tibshirani R (2001) Springer series in statistics. Springer, New York, NY
Google Scholar
Müller AC, Guido S (2017) Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Sebastopol, CA
Google Scholar
Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 50:572–584
Article CAS PubMed PubMed Central Google Scholar
Hawkins PCD, Nicholls A (2012) Conformer generation with OMEGA: learning from the data set and the analysis of failures. J Chem Inf Model 52:2919–2936
Article CAS PubMed Google Scholar
Raschka S (2017) BioPandas: working with molecular structures in pandas DataFrames. J Open Source Softw. doi:10.21105/joss.00279
Google Scholar
Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinformatics 9:307
Article PubMed PubMed Central Google Scholar
Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods 14:323
Article PubMed PubMed Central Google Scholar
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517
Google Scholar
Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4:164–171
Article Google Scholar
Raymer ML, Sanschagrin PC, Punch WF, Venkataraman S, Goodman ED, Kuhn LA (1997) Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. J Mol Biol 265:445–464
Article CAS PubMed Google Scholar
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390
Article Google Scholar
Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105
Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–1143
Google Scholar

Download references

Acknowledgments

This research was supported by funding from the Great Lakes Fishery Commission from 2012 to 2017 (Project ID: 2015_KUH_54031). We gratefully acknowledge OpenEye Scientific Software (Santa Fe, NM) for providing academic licenses for the use of their ROCS, Omega, QUACPAC (molcharge), and OEChem toolkit software. We also wish to express our special appreciation to the open source community for developing and sharing the freely accessible Python libraries for data processing, machine learning, and plotting that were used for the data analysis presented in this chapter.

Author information

Mar Huertas
Present address: Department of Biology, Texas State University, San Marcos, TX, USA

Authors and Affiliations

Department of Biochemistry and Molecular Biology , Michigan State University, East Lansing, MI, USA
Sebastian Raschka & Leslie A. Kuhn
Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA
Anne M. Scott, Mar Huertas, Weiming Li & Leslie A. Kuhn
Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
Leslie A. Kuhn

Authors

Sebastian Raschka
View author publications
You can also search for this author in PubMed Google Scholar
Anne M. Scott
View author publications
You can also search for this author in PubMed Google Scholar
Mar Huertas
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Li
View author publications
You can also search for this author in PubMed Google Scholar
Leslie A. Kuhn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leslie A. Kuhn .

Editor information

Editors and Affiliations

Department of Basic and Applied Sciences, Dayananda Sagar University, Bangalore, KA, India
Mohini Gore
Department of Biotechnology, Shivaji University, Kolhapur, MH, India
Umesh B. Jagtap

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Raschka, S., Scott, A.M., Huertas, M., Li, W., Kuhn, L.A. (2018). Automated Inference of Chemical Discriminants of Biological Activity. In: Gore, M., Jagtap, U. (eds) Computational Drug Discovery and Design. Methods in Molecular Biology, vol 1762. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7756-7_16

Download citation

DOI: https://doi.org/10.1007/978-1-4939-7756-7_16
Published: 29 March 2018
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7755-0
Online ISBN: 978-1-4939-7756-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics