De Novo Molecular Design with Chemical Language Models

Grisoni, Francesca; Schneider, Gisbert

doi:10.1007/978-1-0716-1787-8_9

Francesca Grisoni^3,4 &
Gisbert Schneider³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2390))

5749 Accesses
3 Citations

Abstract

Artificial intelligence (AI) offers new possibilities for hit and lead finding in medicinal chemistry. Several instances of AI have been used for prospective de novo drug design. Among these, chemical language models have been shown to perform well in various experimental scenarios. In this study, we provide a hands-on introduction to chemical language modeling. A technique based on recurrent neural networks is discussed in detail, together with a step-by-step guide to applying this AI method for focused compound library design. The program code is freely available at URL: github.com/ETHmodlab/de_novo_design_RNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hoffmann R, Laszlo P (1991) Representation in chemistry. Angew Chem Int Ed Engl 30:1–16
Article Google Scholar
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
Article CAS Google Scholar
Sushko I, Novotarskyi S, Körner R et al (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25:533–554
Article CAS PubMed PubMed Central Google Scholar
Karwath A, De Raedt L (2006) SMIREP: predicting chemical activity from SMILES. J Chem Inf Model 46:2432–2444
Article CAS PubMed Google Scholar
Irwin JJ, Shoichet BK (2005) ZINC − a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182
Article CAS PubMed PubMed Central Google Scholar
Davis GDJ, Vasanthi AHR (2011) Seaweed metabolite database (SWMD): a database of natural compounds from marine algae. Bioinformation 5:361–364
Article PubMed PubMed Central Google Scholar
Toropov AA, Benfenati E (2007) SMILES in QSPR/QSAR modeling: results and perspectives. Curr Drug Discov Technol 4:77–116
Article CAS PubMed Google Scholar
Ikebata H, Hongo K, Isomura T et al (2017) Bayesian molecular design with a chemical language model. J Comput Aided Mol Des 31:379–391
Article CAS PubMed PubMed Central Google Scholar
Segler MHS, Kogej T, Tyrchan C et al (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 4:120–131
Article CAS PubMed Google Scholar
Merk D, Friedrich L, Grisoni F et al (2018) De novo design of bioactive small molecules by artificial intelligence. Mol Inform 37:1700153
Article PubMed Central CAS Google Scholar
Hirohara M, Saito Y, Koda Y et al (2018) Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 19:526
Article CAS PubMed PubMed Central Google Scholar
Gómez-Bombarelli R, Wei JN, Duvenaud D et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4:268–276
Article PubMed PubMed Central CAS Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article CAS PubMed Google Scholar
Melis G, Dyer C, Blunsom P (2017) On the state of the art of evaluation in neural language models. ArXiv170705589 Cs
Google Scholar
Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, New York
Book Google Scholar
Olivecrona M, Blaschke T, Engkvist O et al (2017) Molecular de-novo design through deep reinforcement learning. J Cheminformatics 9:48
Article Google Scholar
Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4:eaap7885
Article CAS PubMed PubMed Central Google Scholar
Putin E, Asadulaev A, Ivanenkov Y et al (2018) Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 58:1194–1204
Article CAS PubMed Google Scholar
Grisoni F, Moret M, Lingwood R et al (2020) Bidirectional molecule generation with recurrent neural networks. J Chem Inf Model 60:1175–1183
Article CAS PubMed Google Scholar
Merk D, Grisoni F, Friedrich L et al (2018) Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun Chem 1:68
Article Google Scholar
Yuan W, Jiang D, Nambiar DK et al (2017) Chemical space mimicry for drug discovery. J Chem Inf Model 57:875–882
Article CAS PubMed PubMed Central Google Scholar
Grisoni F, Huisman B, Button A, et al (2020) Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci Adv 7:3338
Google Scholar
Heller S, McNaught A, Stein S et al (2013) InChI - the worldwide chemical structure identifier standard. J Cheminformatics 5:7
Article CAS Google Scholar
O’Boyle NM and Dalke A (2018) DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv Prepr Chemrxiv7097960v1
Google Scholar
Krenn M, Häse F, Nigam A et al (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1:045024
Article Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, CA
Book Google Scholar
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A 79:2554
Article CAS PubMed PubMed Central Google Scholar
Kluyver T, Ragan-Kelley B, Pérez F et al (2016) Jupyter notebooks – a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B (eds) Positioning and power in academic publishing: players, agents and agendas. IOS Press, Amsterdam, pp 87–90
Google Scholar
Cao D-S, Liang Y-Z, Yan J et al (2013) PyDPI: freely available Python package for chemoinformatics, bioinformatics, and chemogenomics studies. J Chem Inf Model 53:3086–3096
Article CAS PubMed Google Scholar
Nugmanov RI, Mukhametgaleev RN, Akhmetshin T et al (2019) CGRtools: Python library for molecule, reaction, and condensed graph of reaction processing. J Chem Inf Model 59:2516–2521
Article CAS PubMed Google Scholar
Cao D-S, Xu Q-S, Hu Q-N et al (2013) ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29:1092–1094
Article CAS PubMed Google Scholar
Tangadpalliwar SR, Vishwakarma S, Nimbalkar R et al (2019) ChemSuite: a package for chemoinformatics calculations and machine learning. Chem Biol Drug Des 93:960–964
Article CAS PubMed Google Scholar
Müller AT, Gabernet G, Hiss JA et al (2017) modlAMP: Python for antimicrobial peptides. Bioinformatics 33:2753–2755
Article PubMed CAS Google Scholar
Paszke A, Gross S, Massa F, et al (2019) Pytorch: an imperative style, high-performance deep learning library, In: Advances in neural information processing systems, NeurIPS Proceedings, pp 8026–8037
Google Scholar
Yan Y, Yan J (2018) Hands-on data science with Anaconda: utilize the right mix of tools to create high-performance data science applications. Packt Publishing Ltd, UK
Google Scholar
Grisoni F, Merk D, Byrne R et al (2018) Scaffold-hopping from synthetic drugs by holistic molecular representation. Sci Rep 8:16469
Article PubMed PubMed Central CAS Google Scholar
Dheer Y, Chitranshi N, Gupta V et al (2018) Bexarotene modulates retinoid-X-receptor expression and is protective against neurotoxic endoplasmic reticulum stress response and apoptotic pathway activation. Mol Neurobiol 55:9043–9056
Article CAS PubMed Google Scholar
Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940
Article CAS PubMed Google Scholar
Kim S, Thiessen PA, Bolton EE et al (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202–D1213
Article CAS PubMed Google Scholar
Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87:1123–1124
Article CAS Google Scholar
Moret M, Friedrich L, Grisoni F et al (2020) Generative molecular design in low data regimes. Nat Mach Intell 2:171–180
Article Google Scholar
Grisoni F, Neuhaus CS, Gabernet G et al (2018) Designing anticancer peptides by constructive machine learning. ChemMedChem 13:1300–1302
Article CAS PubMed Google Scholar
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press, Boca Raton, FL
Book Google Scholar
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 06:107–116
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article CAS PubMed Google Scholar
Chung J, Gulcehre C, Cho K, et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv14123555 Cs
Google Scholar
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Article Google Scholar
Al-Jarrah OY, Yoo PD, Muhaidat S et al (2015) Efficient machine learning for big data: a review. Big Data Res 2:87–93
Article Google Scholar
Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning. Int Conf Learn Represent 2017. https://openreview.net/pdf?id=rJY0-Kcll
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22:1345–1359
Article Google Scholar
Zoph B, Yuret D, May J, et al (2016) Transfer learning for low-resource neural machine translation. ArXiv160402201 Cs
Google Scholar
Ouyang X, Kawaai S, Goh EGH et al (2017) Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, New York, NY, USA, pp 577–582
Chapter Google Scholar
Wang D, Zheng TF (2015) Transfer learning for speech and language processing. In: 2015 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1225–1237
Google Scholar
Hunter D, Yu H, Pukish MS et al (2012) Selection of proper neural network sizes and architectures—a comparative study. IEEE Trans Ind Inform 8:228–240
Article Google Scholar
Valsecchi C, Collarile M, Grisoni F et al (2020) Predicting molecular activity on nuclear receptors by multitask neural networks. J Chemom:e3325
Google Scholar
Winkler DA, Le TC (2017) Performance of deep and shallow neural networks, the universal approximation theorem, activity cliffs, and QSAR. Mol Inform 36:1600118
Article CAS Google Scholar
Preuer K, Renz P, Unterthiner T et al (2018) Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J Chem Inf Model 58:1736–1741
Article CAS PubMed Google Scholar
Simard P, Victorri B, LeCun Y, et al (1992) Tangent prop-a formalism for specifying selected invariances in an adaptive network, In: Advances in neural information processing systems, NeurIPS Proceedings, pp 895–903
Google Scholar
Bjerrum EJ (2017) SMILES enumeration as data augmentation for neural network modeling of molecules. ArXiv Prepr ArXiv170307076
Google Scholar
Arús-Pous J, Johansson SV, Prykhodko O et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminformatics 11:1–13
Article Google Scholar
Justus D, Brennan J, Bonner S et al (2018) Predicting the computational cost of deep learning models. In: 2018 IEEE international conference on big data (big data). IEEE, Washington, DC, pp 3873–3882
Chapter Google Scholar
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50:1189–1204
Article CAS PubMed PubMed Central Google Scholar
Young D, Martin T, Venkatapathy R et al (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27:1337–1345
Article CAS Google Scholar
O’Boyle NM (2012) Towards a universal SMILES representation – a standard method to generate canonical SMILES based on the InChI. J Cheminformatics 4:22
Article CAS Google Scholar
Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101
Article CAS Google Scholar
Gupta A, Müller AT, Huisman BJ et al (2018) Generative recurrent networks for de novo drug design. Mol Inform 37:1700111
Article CAS Google Scholar
Goh GB, Siegel C, Vishnu A et al (2018) Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, New York, NY, USA, pp 302–310
Chapter Google Scholar
Schneider G, Schneider P, Renner S (2006) Scaffold-hopping: how far can you jump? QSAR Comb Sci 25:1162–1171
Article CAS Google Scholar
Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39:2887–2893
Article CAS PubMed Google Scholar
Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44:1–12
Article CAS PubMed Google Scholar
Ying X (2019) An overview of overfitting and its solutions. J Phys Conf Ser 1168:022022
Article Google Scholar
Brown N, Fiscato M, Segler MHS et al (2019) GuacaMol: benchmarking models for de novo molecular design. J Chem Inf Model 59:1096–1108
Article CAS PubMed Google Scholar
Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al (2020) Molecular sets (MOSES): a benchmarking platform for molecular generation models. ArXiv181112823 Cs Stat
Google Scholar
Schneider G, Neidhart W, Giller T et al (1999) “Scaffold-hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38:2894–2896
Article CAS Google Scholar
Teuber L, Watjen F, Jensen L (1999) Ligands for the benzodiazepine binding site-a survey. Curr Pharm Des 5:317–344
CAS PubMed Google Scholar
Patel S, Harris SF, Gibbons P et al (2015) Scaffold-hopping and structure-based discovery of potent, selective, and brain penetrant N-(1H-pyrazol-3-yl)pyridin-2-amine inhibitors of dual leucine zipper kinase (DLK, MAP3K12). J Med Chem 58:8182–8199
Article CAS PubMed Google Scholar
Jiang Z, Liu N, Dong G et al (2014) Scaffold hopping of sampangine: discovery of potent antifungal lead compound against Aspergillus fumigatus and Cryptococcus neoformans. Bioorg Med Chem Lett 24:4090–4094
Article CAS PubMed Google Scholar
Olson GL, Bolin DR, Bonner MP et al (1993) Concepts and progress in the development of peptide mimetics. J Med Chem 36:3039–3049
Article CAS PubMed Google Scholar
Friedrich L, Rodrigues T, Neuhaus CS et al (2016) From complex natural products to simple synthetic mimetics by computational de novo design. Angew Chem Int Ed 55:6789–6792
Article CAS Google Scholar
Tresadern G, Cid JM, Macdonald GJ et al (2010) Scaffold hopping from pyridones to imidazo[1,2-a]pyridines. New positive allosteric modulators of metabotropic glutamate 2 receptor. Bioorg Med Chem Lett 20:175–179
Article CAS PubMed Google Scholar
Yang H, Sun L, Wang Z et al (2018) ADMETopt: a web server for ADMET optimization in drug design via scaffold hopping. J Chem Inf Model 58:2051–2056
Article CAS PubMed Google Scholar
Böhm H-J, Flohr A, Stahl M (2004) Scaffold hopping. Drug Discov Today Technol 1:217–224
Article PubMed CAS Google Scholar
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Article CAS PubMed Google Scholar
O’Boyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminformatics 8:36
Article CAS Google Scholar
Pyzer-Knapp EO, Simm GN, Guzik AA (2016) A Bayesian approach to calibrating high-throughput virtual screening results and application to organic photovoltaic materials. Mater Horiz 3:226–233
Article CAS Google Scholar
Besnard J, Ruda GF, Setola V et al (2012) Automated design of ligands to polypharmacological profiles. Nature 492:215–220
Article CAS PubMed Google Scholar
Hert J, Willett P, Wilton DJ et al (2004) Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci 44:1177–1185
Article CAS PubMed Google Scholar
Todeschini R, Ballabio D, Consonni V (2020) Distances and similarity measures in chemometrics and chemoinformatics. In: Encyclopedia of analytical chemistry. American Cancer Society, Atlanta, GA, pp 1–40
Google Scholar
Adam SP, Alexandropoulos S-AN, Pardalos PM et al (2019) No free lunch theorem: a review. In: Demetriou IC, Pardalos PM (eds) Approximation and optimization: algorithms, complexity and applications. Springer International Publishing, Cham, pp 57–82
Chapter Google Scholar
Kim K, Kang S, Yoo J et al (2018) Deep-learning-based inverse design model for intelligent discovery of organic molecules. Npj Comput Mater 4:67
Article Google Scholar
Skalic M, Jiménez J, Sabbadin D et al (2019) Shape-based generative modeling for de novo drug design. J Chem Inf Model 59:1205–1214
Article CAS PubMed Google Scholar
Méndez-Lucio O, Baillif B, Clevert D-A et al (2020) De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat Commun 11:10
Article PubMed PubMed Central CAS Google Scholar
Jiménez-Luna J, Grisoni F, Schneider G (2020) Drug discovery with explainable artificial intelligence. Nat Mach Intell 2:573–584
Article Google Scholar

Download references

Author information

Authors and Affiliations

ETH Zurich, Department of Chemistry and Applied Biosciences, RETHINK, Zurich, Switzerland
Francesca Grisoni & Gisbert Schneider
Eindhoven University of Technology, Department of Biomedical Engineering, Eindhoven, Netherlands
Francesca Grisoni

Authors

Francesca Grisoni
View author publications
You can also search for this author in PubMed Google Scholar
Gisbert Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesca Grisoni .

Editor information

Editors and Affiliations

Computational Drug Discovery, Evotec (UK) Ltd., Abingdon, Oxfordshire, UK
Alexander Heifetz

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Grisoni, F., Schneider, G. (2022). De Novo Molecular Design with Chemical Language Models . In: Heifetz, A. (eds) Artificial Intelligence in Drug Design. Methods in Molecular Biology, vol 2390. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1787-8_9

Download citation

DOI: https://doi.org/10.1007/978-1-0716-1787-8_9
Published: 04 November 2021
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1786-1
Online ISBN: 978-1-0716-1787-8
eBook Packages: Springer Protocols

Publish with us

Policies and ethics