Abstract
Artificial intelligence (AI) offers new possibilities for hit and lead finding in medicinal chemistry. Several instances of AI have been used for prospective de novo drug design. Among these, chemical language models have been shown to perform well in various experimental scenarios. In this study, we provide a hands-on introduction to chemical language modeling. A technique based on recurrent neural networks is discussed in detail, together with a step-by-step guide to applying this AI method for focused compound library design. The program code is freely available at URL: github.com/ETHmodlab/de_novo_design_RNN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hoffmann R, Laszlo P (1991) Representation in chemistry. Angew Chem Int Ed Engl 30:1–16
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
Sushko I, Novotarskyi S, Körner R et al (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25:533–554
Karwath A, De Raedt L (2006) SMIREP: predicting chemical activity from SMILES. J Chem Inf Model 46:2432–2444
Irwin JJ, Shoichet BK (2005) ZINC − a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182
Davis GDJ, Vasanthi AHR (2011) Seaweed metabolite database (SWMD): a database of natural compounds from marine algae. Bioinformation 5:361–364
Toropov AA, Benfenati E (2007) SMILES in QSPR/QSAR modeling: results and perspectives. Curr Drug Discov Technol 4:77–116
Ikebata H, Hongo K, Isomura T et al (2017) Bayesian molecular design with a chemical language model. J Comput Aided Mol Des 31:379–391
Segler MHS, Kogej T, Tyrchan C et al (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 4:120–131
Merk D, Friedrich L, Grisoni F et al (2018) De novo design of bioactive small molecules by artificial intelligence. Mol Inform 37:1700153
Hirohara M, Saito Y, Koda Y et al (2018) Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 19:526
Gómez-Bombarelli R, Wei JN, Duvenaud D et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4:268–276
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Melis G, Dyer C, Blunsom P (2017) On the state of the art of evaluation in neural language models. ArXiv170705589 Cs
Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, New York
Olivecrona M, Blaschke T, Engkvist O et al (2017) Molecular de-novo design through deep reinforcement learning. J Cheminformatics 9:48
Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4:eaap7885
Putin E, Asadulaev A, Ivanenkov Y et al (2018) Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 58:1194–1204
Grisoni F, Moret M, Lingwood R et al (2020) Bidirectional molecule generation with recurrent neural networks. J Chem Inf Model 60:1175–1183
Merk D, Grisoni F, Friedrich L et al (2018) Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun Chem 1:68
Yuan W, Jiang D, Nambiar DK et al (2017) Chemical space mimicry for drug discovery. J Chem Inf Model 57:875–882
Grisoni F, Huisman B, Button A, et al (2020) Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci Adv 7:3338
Heller S, McNaught A, Stein S et al (2013) InChI - the worldwide chemical structure identifier standard. J Cheminformatics 5:7
O’Boyle NM and Dalke A (2018) DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv Prepr Chemrxiv7097960v1
Krenn M, Häse F, Nigam A et al (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1:045024
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, CA
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A 79:2554
Kluyver T, Ragan-Kelley B, Pérez F et al (2016) Jupyter notebooks – a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B (eds) Positioning and power in academic publishing: players, agents and agendas. IOS Press, Amsterdam, pp 87–90
Cao D-S, Liang Y-Z, Yan J et al (2013) PyDPI: freely available Python package for chemoinformatics, bioinformatics, and chemogenomics studies. J Chem Inf Model 53:3086–3096
Nugmanov RI, Mukhametgaleev RN, Akhmetshin T et al (2019) CGRtools: Python library for molecule, reaction, and condensed graph of reaction processing. J Chem Inf Model 59:2516–2521
Cao D-S, Xu Q-S, Hu Q-N et al (2013) ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29:1092–1094
Tangadpalliwar SR, Vishwakarma S, Nimbalkar R et al (2019) ChemSuite: a package for chemoinformatics calculations and machine learning. Chem Biol Drug Des 93:960–964
Müller AT, Gabernet G, Hiss JA et al (2017) modlAMP: Python for antimicrobial peptides. Bioinformatics 33:2753–2755
Paszke A, Gross S, Massa F, et al (2019) Pytorch: an imperative style, high-performance deep learning library, In: Advances in neural information processing systems, NeurIPS Proceedings, pp 8026–8037
Yan Y, Yan J (2018) Hands-on data science with Anaconda: utilize the right mix of tools to create high-performance data science applications. Packt Publishing Ltd, UK
Grisoni F, Merk D, Byrne R et al (2018) Scaffold-hopping from synthetic drugs by holistic molecular representation. Sci Rep 8:16469
Dheer Y, Chitranshi N, Gupta V et al (2018) Bexarotene modulates retinoid-X-receptor expression and is protective against neurotoxic endoplasmic reticulum stress response and apoptotic pathway activation. Mol Neurobiol 55:9043–9056
Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940
Kim S, Thiessen PA, Bolton EE et al (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202–D1213
Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87:1123–1124
Moret M, Friedrich L, Grisoni F et al (2020) Generative molecular design in low data regimes. Nat Mach Intell 2:171–180
Grisoni F, Neuhaus CS, Gabernet G et al (2018) Designing anticancer peptides by constructive machine learning. ChemMedChem 13:1300–1302
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press, Boca Raton, FL
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 06:107–116
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Chung J, Gulcehre C, Cho K, et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv14123555 Cs
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Al-Jarrah OY, Yoo PD, Muhaidat S et al (2015) Efficient machine learning for big data: a review. Big Data Res 2:87–93
Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning. Int Conf Learn Represent 2017. https://openreview.net/pdf?id=rJY0-Kcll
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22:1345–1359
Zoph B, Yuret D, May J, et al (2016) Transfer learning for low-resource neural machine translation. ArXiv160402201 Cs
Ouyang X, Kawaai S, Goh EGH et al (2017) Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, New York, NY, USA, pp 577–582
Wang D, Zheng TF (2015) Transfer learning for speech and language processing. In: 2015 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1225–1237
Hunter D, Yu H, Pukish MS et al (2012) Selection of proper neural network sizes and architectures—a comparative study. IEEE Trans Ind Inform 8:228–240
Valsecchi C, Collarile M, Grisoni F et al (2020) Predicting molecular activity on nuclear receptors by multitask neural networks. J Chemom:e3325
Winkler DA, Le TC (2017) Performance of deep and shallow neural networks, the universal approximation theorem, activity cliffs, and QSAR. Mol Inform 36:1600118
Preuer K, Renz P, Unterthiner T et al (2018) Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J Chem Inf Model 58:1736–1741
Simard P, Victorri B, LeCun Y, et al (1992) Tangent prop-a formalism for specifying selected invariances in an adaptive network, In: Advances in neural information processing systems, NeurIPS Proceedings, pp 895–903
Bjerrum EJ (2017) SMILES enumeration as data augmentation for neural network modeling of molecules. ArXiv Prepr ArXiv170307076
Arús-Pous J, Johansson SV, Prykhodko O et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminformatics 11:1–13
Justus D, Brennan J, Bonner S et al (2018) Predicting the computational cost of deep learning models. In: 2018 IEEE international conference on big data (big data). IEEE, Washington, DC, pp 3873–3882
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50:1189–1204
Young D, Martin T, Venkatapathy R et al (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27:1337–1345
O’Boyle NM (2012) Towards a universal SMILES representation – a standard method to generate canonical SMILES based on the InChI. J Cheminformatics 4:22
Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101
Gupta A, Müller AT, Huisman BJ et al (2018) Generative recurrent networks for de novo drug design. Mol Inform 37:1700111
Goh GB, Siegel C, Vishnu A et al (2018) Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, New York, NY, USA, pp 302–310
Schneider G, Schneider P, Renner S (2006) Scaffold-hopping: how far can you jump? QSAR Comb Sci 25:1162–1171
Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39:2887–2893
Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44:1–12
Ying X (2019) An overview of overfitting and its solutions. J Phys Conf Ser 1168:022022
Brown N, Fiscato M, Segler MHS et al (2019) GuacaMol: benchmarking models for de novo molecular design. J Chem Inf Model 59:1096–1108
Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al (2020) Molecular sets (MOSES): a benchmarking platform for molecular generation models. ArXiv181112823 Cs Stat
Schneider G, Neidhart W, Giller T et al (1999) “Scaffold-hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38:2894–2896
Teuber L, Watjen F, Jensen L (1999) Ligands for the benzodiazepine binding site-a survey. Curr Pharm Des 5:317–344
Patel S, Harris SF, Gibbons P et al (2015) Scaffold-hopping and structure-based discovery of potent, selective, and brain penetrant N-(1H-pyrazol-3-yl)pyridin-2-amine inhibitors of dual leucine zipper kinase (DLK, MAP3K12). J Med Chem 58:8182–8199
Jiang Z, Liu N, Dong G et al (2014) Scaffold hopping of sampangine: discovery of potent antifungal lead compound against Aspergillus fumigatus and Cryptococcus neoformans. Bioorg Med Chem Lett 24:4090–4094
Olson GL, Bolin DR, Bonner MP et al (1993) Concepts and progress in the development of peptide mimetics. J Med Chem 36:3039–3049
Friedrich L, Rodrigues T, Neuhaus CS et al (2016) From complex natural products to simple synthetic mimetics by computational de novo design. Angew Chem Int Ed 55:6789–6792
Tresadern G, Cid JM, Macdonald GJ et al (2010) Scaffold hopping from pyridones to imidazo[1,2-a]pyridines. New positive allosteric modulators of metabotropic glutamate 2 receptor. Bioorg Med Chem Lett 20:175–179
Yang H, Sun L, Wang Z et al (2018) ADMETopt: a web server for ADMET optimization in drug design via scaffold hopping. J Chem Inf Model 58:2051–2056
Böhm H-J, Flohr A, Stahl M (2004) Scaffold hopping. Drug Discov Today Technol 1:217–224
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
O’Boyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminformatics 8:36
Pyzer-Knapp EO, Simm GN, Guzik AA (2016) A Bayesian approach to calibrating high-throughput virtual screening results and application to organic photovoltaic materials. Mater Horiz 3:226–233
Besnard J, Ruda GF, Setola V et al (2012) Automated design of ligands to polypharmacological profiles. Nature 492:215–220
Hert J, Willett P, Wilton DJ et al (2004) Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci 44:1177–1185
Todeschini R, Ballabio D, Consonni V (2020) Distances and similarity measures in chemometrics and chemoinformatics. In: Encyclopedia of analytical chemistry. American Cancer Society, Atlanta, GA, pp 1–40
Adam SP, Alexandropoulos S-AN, Pardalos PM et al (2019) No free lunch theorem: a review. In: Demetriou IC, Pardalos PM (eds) Approximation and optimization: algorithms, complexity and applications. Springer International Publishing, Cham, pp 57–82
Kim K, Kang S, Yoo J et al (2018) Deep-learning-based inverse design model for intelligent discovery of organic molecules. Npj Comput Mater 4:67
Skalic M, Jiménez J, Sabbadin D et al (2019) Shape-based generative modeling for de novo drug design. J Chem Inf Model 59:1205–1214
Méndez-Lucio O, Baillif B, Clevert D-A et al (2020) De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat Commun 11:10
Jiménez-Luna J, Grisoni F, Schneider G (2020) Drug discovery with explainable artificial intelligence. Nat Mach Intell 2:573–584
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Grisoni, F., Schneider, G. (2022). De Novo Molecular Design with Chemical Language Models . In: Heifetz, A. (eds) Artificial Intelligence in Drug Design. Methods in Molecular Biology, vol 2390. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1787-8_9
Download citation
DOI: https://doi.org/10.1007/978-1-0716-1787-8_9
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1786-1
Online ISBN: 978-1-0716-1787-8
eBook Packages: Springer Protocols