Skip to main content

De Novo Molecular Design with Chemical Language Models

  • Protocol
  • First Online:
Artificial Intelligence in Drug Design

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2390))

Abstract

Artificial intelligence (AI) offers new possibilities for hit and lead finding in medicinal chemistry. Several instances of AI have been used for prospective de novo drug design. Among these, chemical language models have been shown to perform well in various experimental scenarios. In this study, we provide a hands-on introduction to chemical language modeling. A technique based on recurrent neural networks is discussed in detail, together with a step-by-step guide to applying this AI method for focused compound library design. The program code is freely available at URL: github.com/ETHmodlab/de_novo_design_RNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hoffmann R, Laszlo P (1991) Representation in chemistry. Angew Chem Int Ed Engl 30:1–16

    Article  Google Scholar 

  2. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36

    Article  CAS  Google Scholar 

  3. Sushko I, Novotarskyi S, Körner R et al (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25:533–554

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Karwath A, De Raedt L (2006) SMIREP: predicting chemical activity from SMILES. J Chem Inf Model 46:2432–2444

    Article  CAS  PubMed  Google Scholar 

  5. Irwin JJ, Shoichet BK (2005) ZINC − a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Davis GDJ, Vasanthi AHR (2011) Seaweed metabolite database (SWMD): a database of natural compounds from marine algae. Bioinformation 5:361–364

    Article  PubMed  PubMed Central  Google Scholar 

  7. Toropov AA, Benfenati E (2007) SMILES in QSPR/QSAR modeling: results and perspectives. Curr Drug Discov Technol 4:77–116

    Article  CAS  PubMed  Google Scholar 

  8. Ikebata H, Hongo K, Isomura T et al (2017) Bayesian molecular design with a chemical language model. J Comput Aided Mol Des 31:379–391

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Segler MHS, Kogej T, Tyrchan C et al (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 4:120–131

    Article  CAS  PubMed  Google Scholar 

  10. Merk D, Friedrich L, Grisoni F et al (2018) De novo design of bioactive small molecules by artificial intelligence. Mol Inform 37:1700153

    Article  PubMed Central  CAS  Google Scholar 

  11. Hirohara M, Saito Y, Koda Y et al (2018) Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 19:526

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Gómez-Bombarelli R, Wei JN, Duvenaud D et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4:268–276

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444

    Article  CAS  PubMed  Google Scholar 

  14. Melis G, Dyer C, Blunsom P (2017) On the state of the art of evaluation in neural language models. ArXiv170705589 Cs

    Google Scholar 

  15. Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, New York

    Book  Google Scholar 

  16. Olivecrona M, Blaschke T, Engkvist O et al (2017) Molecular de-novo design through deep reinforcement learning. J Cheminformatics 9:48

    Article  Google Scholar 

  17. Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4:eaap7885

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Putin E, Asadulaev A, Ivanenkov Y et al (2018) Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 58:1194–1204

    Article  CAS  PubMed  Google Scholar 

  19. Grisoni F, Moret M, Lingwood R et al (2020) Bidirectional molecule generation with recurrent neural networks. J Chem Inf Model 60:1175–1183

    Article  CAS  PubMed  Google Scholar 

  20. Merk D, Grisoni F, Friedrich L et al (2018) Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun Chem 1:68

    Article  Google Scholar 

  21. Yuan W, Jiang D, Nambiar DK et al (2017) Chemical space mimicry for drug discovery. J Chem Inf Model 57:875–882

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Grisoni F, Huisman B, Button A, et al (2020) Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci Adv 7:3338

    Google Scholar 

  23. Heller S, McNaught A, Stein S et al (2013) InChI - the worldwide chemical structure identifier standard. J Cheminformatics 5:7

    Article  CAS  Google Scholar 

  24. O’Boyle NM and Dalke A (2018) DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv Prepr Chemrxiv7097960v1

    Google Scholar 

  25. Krenn M, Häse F, Nigam A et al (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1:045024

    Article  Google Scholar 

  26. Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, CA

    Book  Google Scholar 

  27. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A 79:2554

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Kluyver T, Ragan-Kelley B, Pérez F et al (2016) Jupyter notebooks – a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B (eds) Positioning and power in academic publishing: players, agents and agendas. IOS Press, Amsterdam, pp 87–90

    Google Scholar 

  29. Cao D-S, Liang Y-Z, Yan J et al (2013) PyDPI: freely available Python package for chemoinformatics, bioinformatics, and chemogenomics studies. J Chem Inf Model 53:3086–3096

    Article  CAS  PubMed  Google Scholar 

  30. Nugmanov RI, Mukhametgaleev RN, Akhmetshin T et al (2019) CGRtools: Python library for molecule, reaction, and condensed graph of reaction processing. J Chem Inf Model 59:2516–2521

    Article  CAS  PubMed  Google Scholar 

  31. Cao D-S, Xu Q-S, Hu Q-N et al (2013) ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29:1092–1094

    Article  CAS  PubMed  Google Scholar 

  32. Tangadpalliwar SR, Vishwakarma S, Nimbalkar R et al (2019) ChemSuite: a package for chemoinformatics calculations and machine learning. Chem Biol Drug Des 93:960–964

    Article  CAS  PubMed  Google Scholar 

  33. Müller AT, Gabernet G, Hiss JA et al (2017) modlAMP: Python for antimicrobial peptides. Bioinformatics 33:2753–2755

    Article  PubMed  CAS  Google Scholar 

  34. Paszke A, Gross S, Massa F, et al (2019) Pytorch: an imperative style, high-performance deep learning library, In: Advances in neural information processing systems, NeurIPS Proceedings, pp 8026–8037

    Google Scholar 

  35. Yan Y, Yan J (2018) Hands-on data science with Anaconda: utilize the right mix of tools to create high-performance data science applications. Packt Publishing Ltd, UK

    Google Scholar 

  36. Grisoni F, Merk D, Byrne R et al (2018) Scaffold-hopping from synthetic drugs by holistic molecular representation. Sci Rep 8:16469

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. Dheer Y, Chitranshi N, Gupta V et al (2018) Bexarotene modulates retinoid-X-receptor expression and is protective against neurotoxic endoplasmic reticulum stress response and apoptotic pathway activation. Mol Neurobiol 55:9043–9056

    Article  CAS  PubMed  Google Scholar 

  38. Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940

    Article  CAS  PubMed  Google Scholar 

  39. Kim S, Thiessen PA, Bolton EE et al (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202–D1213

    Article  CAS  PubMed  Google Scholar 

  40. Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87:1123–1124

    Article  CAS  Google Scholar 

  41. Moret M, Friedrich L, Grisoni F et al (2020) Generative molecular design in low data regimes. Nat Mach Intell 2:171–180

    Article  Google Scholar 

  42. Grisoni F, Neuhaus CS, Gabernet G et al (2018) Designing anticancer peptides by constructive machine learning. ChemMedChem 13:1300–1302

    Article  CAS  PubMed  Google Scholar 

  43. Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press, Boca Raton, FL

    Book  Google Scholar 

  44. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 06:107–116

    Article  Google Scholar 

  45. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

    Article  CAS  PubMed  Google Scholar 

  46. Chung J, Gulcehre C, Cho K, et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv14123555 Cs

    Google Scholar 

  47. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681

    Article  Google Scholar 

  48. Al-Jarrah OY, Yoo PD, Muhaidat S et al (2015) Efficient machine learning for big data: a review. Big Data Res 2:87–93

    Article  Google Scholar 

  49. Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning. Int Conf Learn Represent 2017. https://openreview.net/pdf?id=rJY0-Kcll

  50. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22:1345–1359

    Article  Google Scholar 

  51. Zoph B, Yuret D, May J, et al (2016) Transfer learning for low-resource neural machine translation. ArXiv160402201 Cs

    Google Scholar 

  52. Ouyang X, Kawaai S, Goh EGH et al (2017) Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, New York, NY, USA, pp 577–582

    Chapter  Google Scholar 

  53. Wang D, Zheng TF (2015) Transfer learning for speech and language processing. In: 2015 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1225–1237

    Google Scholar 

  54. Hunter D, Yu H, Pukish MS et al (2012) Selection of proper neural network sizes and architectures—a comparative study. IEEE Trans Ind Inform 8:228–240

    Article  Google Scholar 

  55. Valsecchi C, Collarile M, Grisoni F et al (2020) Predicting molecular activity on nuclear receptors by multitask neural networks. J Chemom:e3325

    Google Scholar 

  56. Winkler DA, Le TC (2017) Performance of deep and shallow neural networks, the universal approximation theorem, activity cliffs, and QSAR. Mol Inform 36:1600118

    Article  CAS  Google Scholar 

  57. Preuer K, Renz P, Unterthiner T et al (2018) Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J Chem Inf Model 58:1736–1741

    Article  CAS  PubMed  Google Scholar 

  58. Simard P, Victorri B, LeCun Y, et al (1992) Tangent prop-a formalism for specifying selected invariances in an adaptive network, In: Advances in neural information processing systems, NeurIPS Proceedings, pp 895–903

    Google Scholar 

  59. Bjerrum EJ (2017) SMILES enumeration as data augmentation for neural network modeling of molecules. ArXiv Prepr ArXiv170307076

    Google Scholar 

  60. Arús-Pous J, Johansson SV, Prykhodko O et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminformatics 11:1–13

    Article  Google Scholar 

  61. Justus D, Brennan J, Bonner S et al (2018) Predicting the computational cost of deep learning models. In: 2018 IEEE international conference on big data (big data). IEEE, Washington, DC, pp 3873–3882

    Chapter  Google Scholar 

  62. Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50:1189–1204

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Young D, Martin T, Venkatapathy R et al (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27:1337–1345

    Article  CAS  Google Scholar 

  64. O’Boyle NM (2012) Towards a universal SMILES representation – a standard method to generate canonical SMILES based on the InChI. J Cheminformatics 4:22

    Article  CAS  Google Scholar 

  65. Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101

    Article  CAS  Google Scholar 

  66. Gupta A, Müller AT, Huisman BJ et al (2018) Generative recurrent networks for de novo drug design. Mol Inform 37:1700111

    Article  CAS  Google Scholar 

  67. Goh GB, Siegel C, Vishnu A et al (2018) Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, New York, NY, USA, pp 302–310

    Chapter  Google Scholar 

  68. Schneider G, Schneider P, Renner S (2006) Scaffold-hopping: how far can you jump? QSAR Comb Sci 25:1162–1171

    Article  CAS  Google Scholar 

  69. Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39:2887–2893

    Article  CAS  PubMed  Google Scholar 

  70. Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44:1–12

    Article  CAS  PubMed  Google Scholar 

  71. Ying X (2019) An overview of overfitting and its solutions. J Phys Conf Ser 1168:022022

    Article  Google Scholar 

  72. Brown N, Fiscato M, Segler MHS et al (2019) GuacaMol: benchmarking models for de novo molecular design. J Chem Inf Model 59:1096–1108

    Article  CAS  PubMed  Google Scholar 

  73. Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al (2020) Molecular sets (MOSES): a benchmarking platform for molecular generation models. ArXiv181112823 Cs Stat

    Google Scholar 

  74. Schneider G, Neidhart W, Giller T et al (1999) “Scaffold-hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38:2894–2896

    Article  CAS  Google Scholar 

  75. Teuber L, Watjen F, Jensen L (1999) Ligands for the benzodiazepine binding site-a survey. Curr Pharm Des 5:317–344

    CAS  PubMed  Google Scholar 

  76. Patel S, Harris SF, Gibbons P et al (2015) Scaffold-hopping and structure-based discovery of potent, selective, and brain penetrant N-(1H-pyrazol-3-yl)pyridin-2-amine inhibitors of dual leucine zipper kinase (DLK, MAP3K12). J Med Chem 58:8182–8199

    Article  CAS  PubMed  Google Scholar 

  77. Jiang Z, Liu N, Dong G et al (2014) Scaffold hopping of sampangine: discovery of potent antifungal lead compound against Aspergillus fumigatus and Cryptococcus neoformans. Bioorg Med Chem Lett 24:4090–4094

    Article  CAS  PubMed  Google Scholar 

  78. Olson GL, Bolin DR, Bonner MP et al (1993) Concepts and progress in the development of peptide mimetics. J Med Chem 36:3039–3049

    Article  CAS  PubMed  Google Scholar 

  79. Friedrich L, Rodrigues T, Neuhaus CS et al (2016) From complex natural products to simple synthetic mimetics by computational de novo design. Angew Chem Int Ed 55:6789–6792

    Article  CAS  Google Scholar 

  80. Tresadern G, Cid JM, Macdonald GJ et al (2010) Scaffold hopping from pyridones to imidazo[1,2-a]pyridines. New positive allosteric modulators of metabotropic glutamate 2 receptor. Bioorg Med Chem Lett 20:175–179

    Article  CAS  PubMed  Google Scholar 

  81. Yang H, Sun L, Wang Z et al (2018) ADMETopt: a web server for ADMET optimization in drug design via scaffold hopping. J Chem Inf Model 58:2051–2056

    Article  CAS  PubMed  Google Scholar 

  82. Böhm H-J, Flohr A, Stahl M (2004) Scaffold hopping. Drug Discov Today Technol 1:217–224

    Article  PubMed  CAS  Google Scholar 

  83. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754

    Article  CAS  PubMed  Google Scholar 

  84. O’Boyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminformatics 8:36

    Article  CAS  Google Scholar 

  85. Pyzer-Knapp EO, Simm GN, Guzik AA (2016) A Bayesian approach to calibrating high-throughput virtual screening results and application to organic photovoltaic materials. Mater Horiz 3:226–233

    Article  CAS  Google Scholar 

  86. Besnard J, Ruda GF, Setola V et al (2012) Automated design of ligands to polypharmacological profiles. Nature 492:215–220

    Article  CAS  PubMed  Google Scholar 

  87. Hert J, Willett P, Wilton DJ et al (2004) Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci 44:1177–1185

    Article  CAS  PubMed  Google Scholar 

  88. Todeschini R, Ballabio D, Consonni V (2020) Distances and similarity measures in chemometrics and chemoinformatics. In: Encyclopedia of analytical chemistry. American Cancer Society, Atlanta, GA, pp 1–40

    Google Scholar 

  89. Adam SP, Alexandropoulos S-AN, Pardalos PM et al (2019) No free lunch theorem: a review. In: Demetriou IC, Pardalos PM (eds) Approximation and optimization: algorithms, complexity and applications. Springer International Publishing, Cham, pp 57–82

    Chapter  Google Scholar 

  90. Kim K, Kang S, Yoo J et al (2018) Deep-learning-based inverse design model for intelligent discovery of organic molecules. Npj Comput Mater 4:67

    Article  Google Scholar 

  91. Skalic M, Jiménez J, Sabbadin D et al (2019) Shape-based generative modeling for de novo drug design. J Chem Inf Model 59:1205–1214

    Article  CAS  PubMed  Google Scholar 

  92. Méndez-Lucio O, Baillif B, Clevert D-A et al (2020) De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat Commun 11:10

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  93. Jiménez-Luna J, Grisoni F, Schneider G (2020) Drug discovery with explainable artificial intelligence. Nat Mach Intell 2:573–584

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesca Grisoni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Grisoni, F., Schneider, G. (2022). De Novo Molecular Design with Chemical Language Models . In: Heifetz, A. (eds) Artificial Intelligence in Drug Design. Methods in Molecular Biology, vol 2390. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1787-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-1787-8_9

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1786-1

  • Online ISBN: 978-1-0716-1787-8

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics