Skip to main content
Log in

On the ability of machine learning methods to discover novel scaffolds

  • Original Paper
  • Published:
Journal of Molecular Modeling Aims and scope Submit manuscript

Abstract

The recent advances in the application of machine learning to drug discovery have made it a ‘hot topic’ for research, with hundreds of academic groups and companies integrating machine learning into their drug discovery projects. Nevertheless, there remains great uncertainty regarding the most appropriate ways to evaluate the relative performance of these powerful methods against more traditional cheminformatics approaches, and many pitfalls remain for the unwary. In 2020, researchers at MIT (Stokes et al., Cell 180(4), 688–702, 2020) reported the discovery of a new compound with antibacterial activity, halicin, through the use of a neural network machine learning method. A robust ability to identify new active chemotypes through computational methods would be very useful. In this study, we have used the Stokes et al. dataset to compare the performance of this method to two other approaches, Mapping of Activity Through Dichotomic Scores (MADS) by Todeschini et al. (J Chemom 32(4):e2994, 2018) and Random Matrix Theory (RMT) by Lee et al. (Proc Natl Acad Sci 116(9):3373–3378, 2019). Our results demonstrate that all three methods are capable of predicting halicin as an active antibacterial compound, but that this result is dependent on the dataset composition, pre-processing and the molecular fingerprint used. We have further assessed overall performance as determined by several performance metrics. We also investigated the scaffold hopping potential of the methods by modifying the dataset by removal of the β-lactam and fluoroquinolone chemotypes. MADS and RMT are able to identify actives in the test set that contained these substructures. This ability arises because of high scoring fragments of the withheld chemotypes that are in common with other active antibiotic classes. Interestingly, MADS is relatively better compared to the other two methods based on general predictive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Yanling J, Xin L, Zhiyuan L (2013) The antibacterial drug discovery. Drug Discovery, pp 289–307

  2. Aminov RI (2010) A brief history of the antibiotic era: lessons learned and challenges for the future. Front Microbiol 1:134

    Article  Google Scholar 

  3. Laxminarayan R, Duse A, Wattal C et al (2013) Antibiotic resistance—the need for global solutions. Lancet Infect Dis 13(12):1057–1098

    Article  Google Scholar 

  4. Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J Comput Chem 38(16):1291–1307

    Article  CAS  Google Scholar 

  5. Scarselli F, Gori M, Tsoi AC et al (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80

    Article  Google Scholar 

  6. Baskin II, Winkler D, Tetko IV (2016) A renaissance of neural networks in drug discovery. Expert Opin Drug Discov 11(8):785–795

    Article  CAS  Google Scholar 

  7. Salt DW, Yildiz N, Livingstone DJ et al (1992) The use of artificial neural networks in qsar. Pestic Sci 36(2):161–170

    Article  CAS  Google Scholar 

  8. Ghasemi F, Mehridehnavi A, Perez-Garrido A et al (2018) Neural network and deep-learning algorithms used in qsar studies: merits and drawbacks. Drug Discov Today 23(10):1784–1790

    Article  CAS  Google Scholar 

  9. Staszak M, Staszak K, Wieszczycka K et al (2021) Machine learning in drug design: use of artificial intelligence to explore the chemical structure–biological activity relationship. Wiley Interdisciplinary Reviews: Computational Molecular Science, pp e1568

  10. Mayr A, Klambauer G, Unterthiner T et al (2018) Large-scale comparison of machine learning methods for drug target prediction on chembl. Chem Sci 9(24):5441–5451

    Article  CAS  Google Scholar 

  11. Lenselink EB, Ten Dijke N, Bongers B et al (2017) Beyond the hype: deep neural networks outperform established methods using a chembl bioactivity benchmark set. J Cheminformatics 9(1):1–14

    Article  Google Scholar 

  12. Gaulton A, Bellis LJ, Bento AP et al (2012) Chembl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107

    Article  CAS  Google Scholar 

  13. Truchon JF, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inform Model 47(2):488–508

    Article  CAS  Google Scholar 

  14. Koutsoukas A, Monaghan KJ, Li X et al (2017) Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminformatics 9(1):1–13

    Article  Google Scholar 

  15. Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J et al (2015) Convolutional networks on graphs for learning molecular fingerprints. arXiv:150909292

  16. Withnall M, Lindelöf E, Engkvist O et al (2020) Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction. J Cheminformatics 12(1):1–18

    Article  CAS  Google Scholar 

  17. Jiang D, Wu Z, Hsieh CY et al (2021) Could graph neural networks learn better molecular representation for drug discovery? a comparison study of descriptor-based and graph-based models. J Cheminformatics 13(1):1–23

    Article  Google Scholar 

  18. Robinson MC, Glen RC et al (2020) Validating the validation: reanalyzing a large-scale comparison of deep learning and machine learning models for bioactivity prediction. Journal of computer-aided molecular design, pp 1–14

  19. Pérez-Sianes J, Pérez-Sánchez H, Díaz F (2016) Virtual screening: a challenge for deep learning. In: International Conference on Practical Applications of Computational Biology & Bioinformatics. Springer, pp 13–22

  20. Bajorath J (2017) Computational scaffold hopping: cornerstone for the future of drug design?

  21. Schneider G, Neidhart W, Giller T et al (1999) “scaffold-hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38(19):2894–2896

    Article  CAS  Google Scholar 

  22. Vainio MJ, Kogej T, Raubacher F et al (2013) Scaffold hopping by fragment replacement

  23. Saluste G, Albarran MI, Alvarez RM et al (2012) Fragment-hopping-based discovery of a novel chemical series of proto-oncogene pim-1 kinase inhibitors. PloS One 7(10):e45,964

    Article  CAS  Google Scholar 

  24. Ertl P (2012) Database of bioactive ring systems with calculated properties and its use in bioisosteric design and scaffold hopping. Bioorg Med Chem 20(18):5436–5442

    Article  CAS  Google Scholar 

  25. Stokes JM, Yang K, Swanson K et al (2020) A deep learning approach to antibiotic discovery. Cell 180(4):688–702

    Article  CAS  Google Scholar 

  26. Todeschini R, Consonni V, Ballabio D et al (2018) Mapping of activity through dichotomic scores (mads): a new chemoinformatic approach to detect activity-rich structural regions. J Chemom 32(4):e2994

    Article  Google Scholar 

  27. Lee AA, Yang Q, Bassyouni A et al (2019) Ligand biological activity predicted by cleaning positive and negative chemical correlations. Proc Natl Acad Sci 116(9):3373–3378

    Article  CAS  Google Scholar 

  28. Inc CCG (2019) Molecular operating environment (moe)

  29. Corsello SM, Bittker JA, Liu Z et al (2017) The drug repurposing hub: a next-generation drug library and information resource. Nat Med 23(4):405–408

    Article  CAS  Google Scholar 

  30. Cereto-Massagué A, Ojeda MJ, Valls C et al (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63

    Article  Google Scholar 

  31. Willett P (2006) Similarity-based virtual screening using 2d fingerprints. Drug Discov Today 11 (23-24):1046–1053

    Article  CAS  Google Scholar 

  32. Muegge I, Mukherjee P (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discov 11(2):137–148

    Article  CAS  Google Scholar 

  33. Riniker S, Landrum GA (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminformatics 5(1):1–17

    Article  Google Scholar 

  34. Wale N, Watson IA, Karypis G (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst 14(3):347–375

    Article  Google Scholar 

  35. Russo DP, Zorn KM, Clark AM et al (2018) Comparing multiple machine learning algorithms and metrics for estrogen receptor binding prediction. Mol Pharm 15(10):4361–4370

    Article  CAS  Google Scholar 

  36. Kensert A, Alvarsson J, Norinder U et al (2018) Evaluating parameters for ligand-based modeling with random forest on sparse data sets. J Cheminformatics 10(1):1–10

    Article  Google Scholar 

  37. Chen B, Harrison RF, Papadatos G et al (2007) Evaluation of machine-learning methods for ligand-based virtual screening. J Comput Aided Mol Des 21(1):53–62

    Article  Google Scholar 

  38. (1984) Maccs keys, mdl information systems. Inc: San Leandro, CA

  39. Nilakantan R, Bauman N, Dixon JS et al (1987) Topological torsion: a new molecular descriptor for sar applications. comparison with other descriptors. J Chem Inf Comput Sci 27(2):82– 85

    Article  CAS  Google Scholar 

  40. Landrum G (2013) Rdkit documentation. Release 1(1-79):4

    Google Scholar 

  41. Lee AA, Brenner MP, Colwell LJ (2016) Predicting protein–ligand affinity with a random matrix framework. Proc Natl Acad Sci 113:13,564–13,569

    Article  CAS  Google Scholar 

  42. Bajusz D, Rácz A, Héberger K (2015) Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminformatics 7(1):1–13

    Article  CAS  Google Scholar 

  43. Hussin SK, Abdelmageid SM, Alkhalil A et al (2021) Handling imbalance classification virtual screening big data using machine learning algorithms. Complexity 2021

  44. Branco P, Torgo L, Ribeiro RP (2017) Relevance-based evaluation metrics for multi-class imbalanced domains. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 698–710

  45. Ballabio D, Grisoni F, Todeschini R (2018) Multivariate comparison of classification performance measures. Chemometr Intell Lab Syst 174:33–44

    Article  CAS  Google Scholar 

  46. Schubert S, Dalhoff A (2012) Activity of moxifloxacin, imipenem, and ertapenem against Escherichia coli, enterobacter cloacae, enterococcus faecalis, and bacteroides fragilis in monocultures and mixed cultures in an in vitro pharmacokinetic/pharmacodynamic model simulating concentrations in the human pancreas. Antimicrob Agents Chemother 56(12):6434–6436

    Article  CAS  Google Scholar 

  47. Marie MAM, Krishnappa LG, Lory S (2016) In vitro activity and the efficacy of arbekacin, cefminox, fosfomycin, biapenem against gram-negative organisms: new treatment options?. Proceedings of the National Academy of Sciences, India Section B: Biological Sciences 86(3):749–755

    Article  CAS  Google Scholar 

  48. Goto S, Sakamoto H, Ogawa M et al (1982) Bactericidal activity of cefazolin, cefoxitin, and cefmetazole against Escherichia coli and klebsiella pneumoniae. Chemotherapy 28(1):18–25

    Article  CAS  Google Scholar 

  49. Russell DG (2001) Mycobacterium tuberculosis: here today, and here tomorrow. Nat Rev Mol Cell Biol 2(8):569–578

    Article  CAS  Google Scholar 

  50. Brenner DJ, Farmer IIIJ (2015) Enterobacteriaceae. Bergey’s manual of systematics of archaea and bacteria, pp 1–24

  51. Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and qsar modeling research. J Chem Inf Model 50(7):1189

    Article  CAS  Google Scholar 

  52. Williams AJ, Ekins S, Tkachenko V (2012) Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 17(13-14):685–701

    Article  CAS  Google Scholar 

  53. Richter MF, Drown BS, Riley AP et al (2017) Predictive compound accumulation rules yield a broad-spectrum antibiotic. Nature 545(7654):299–304

    Article  CAS  Google Scholar 

  54. Ebejer JP, Charlton MH, Finn PW (2016) Are the physicochemical properties of antibacterial compounds really different from other drugs? J Cheminformatics 8(1):1–9

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Dr. Jean Paul Ebejer, University of Malta, for his valuable suggestions to improve the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rishi Jagdev.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Conflict of interest

The authors declare no competing interests.

Additional information

Supplementary information

The online version contains supplementary material available at https://doi.org/10.1007/s00894-022-05359-6.

Author contribution

All authors contributed to the study conception and design. RJ did the experimental analysis and wrote the first draft of the main manuscript. TM contributed to the mathematical interpretation of all the ML methods and reviewed the manuscript. PF substantially contributed to the conception of the experiments, critically reviewed and revised the manuscript. All authors read and approved the final manuscript.

Data availability statement

The training and test datasets are available as supplementary information with the publication by Stokes et al. [25, Supplementary Tables S2A, S2B]. The codes for all three methods discussed in the paper are publicly available. The links to the codes are provided as follows : Chemprop [25] : https://github.com/swansonk14/chemprop Mapping of Activity through Dichotomic Scores (MADS) [26]: https://michem.unimib.it/download/matlab-toolboxes/virtual-screening-toolbox-for-matlab/https://michem.unimib.it/download/matlab-toolboxes/virtual-screening-toolbox-for-matlab/ Random Matrix theory (RMT) [27] : https://github.com/alphaleegroup/RandomMatrixDiscriminant

Institutional review board statement

Not applicable.

Informed consent statement

Not applicable.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Thomas Bruun Madsen and Paul W. Finn contributed equally to this work.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 367 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jagdev, R., Madsen, T.B. & Finn, P.W. On the ability of machine learning methods to discover novel scaffolds. J Mol Model 29, 22 (2023). https://doi.org/10.1007/s00894-022-05359-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00894-022-05359-6

Keywords

Navigation