Abstract
The recent advances in the application of machine learning to drug discovery have made it a ‘hot topic’ for research, with hundreds of academic groups and companies integrating machine learning into their drug discovery projects. Nevertheless, there remains great uncertainty regarding the most appropriate ways to evaluate the relative performance of these powerful methods against more traditional cheminformatics approaches, and many pitfalls remain for the unwary. In 2020, researchers at MIT (Stokes et al., Cell 180(4), 688–702, 2020) reported the discovery of a new compound with antibacterial activity, halicin, through the use of a neural network machine learning method. A robust ability to identify new active chemotypes through computational methods would be very useful. In this study, we have used the Stokes et al. dataset to compare the performance of this method to two other approaches, Mapping of Activity Through Dichotomic Scores (MADS) by Todeschini et al. (J Chemom 32(4):e2994, 2018) and Random Matrix Theory (RMT) by Lee et al. (Proc Natl Acad Sci 116(9):3373–3378, 2019). Our results demonstrate that all three methods are capable of predicting halicin as an active antibacterial compound, but that this result is dependent on the dataset composition, pre-processing and the molecular fingerprint used. We have further assessed overall performance as determined by several performance metrics. We also investigated the scaffold hopping potential of the methods by modifying the dataset by removal of the β-lactam and fluoroquinolone chemotypes. MADS and RMT are able to identify actives in the test set that contained these substructures. This ability arises because of high scoring fragments of the withheld chemotypes that are in common with other active antibiotic classes. Interestingly, MADS is relatively better compared to the other two methods based on general predictive performance.
Similar content being viewed by others
References
Yanling J, Xin L, Zhiyuan L (2013) The antibacterial drug discovery. Drug Discovery, pp 289–307
Aminov RI (2010) A brief history of the antibiotic era: lessons learned and challenges for the future. Front Microbiol 1:134
Laxminarayan R, Duse A, Wattal C et al (2013) Antibiotic resistance—the need for global solutions. Lancet Infect Dis 13(12):1057–1098
Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J Comput Chem 38(16):1291–1307
Scarselli F, Gori M, Tsoi AC et al (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Baskin II, Winkler D, Tetko IV (2016) A renaissance of neural networks in drug discovery. Expert Opin Drug Discov 11(8):785–795
Salt DW, Yildiz N, Livingstone DJ et al (1992) The use of artificial neural networks in qsar. Pestic Sci 36(2):161–170
Ghasemi F, Mehridehnavi A, Perez-Garrido A et al (2018) Neural network and deep-learning algorithms used in qsar studies: merits and drawbacks. Drug Discov Today 23(10):1784–1790
Staszak M, Staszak K, Wieszczycka K et al (2021) Machine learning in drug design: use of artificial intelligence to explore the chemical structure–biological activity relationship. Wiley Interdisciplinary Reviews: Computational Molecular Science, pp e1568
Mayr A, Klambauer G, Unterthiner T et al (2018) Large-scale comparison of machine learning methods for drug target prediction on chembl. Chem Sci 9(24):5441–5451
Lenselink EB, Ten Dijke N, Bongers B et al (2017) Beyond the hype: deep neural networks outperform established methods using a chembl bioactivity benchmark set. J Cheminformatics 9(1):1–14
Gaulton A, Bellis LJ, Bento AP et al (2012) Chembl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107
Truchon JF, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inform Model 47(2):488–508
Koutsoukas A, Monaghan KJ, Li X et al (2017) Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminformatics 9(1):1–13
Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J et al (2015) Convolutional networks on graphs for learning molecular fingerprints. arXiv:150909292
Withnall M, Lindelöf E, Engkvist O et al (2020) Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction. J Cheminformatics 12(1):1–18
Jiang D, Wu Z, Hsieh CY et al (2021) Could graph neural networks learn better molecular representation for drug discovery? a comparison study of descriptor-based and graph-based models. J Cheminformatics 13(1):1–23
Robinson MC, Glen RC et al (2020) Validating the validation: reanalyzing a large-scale comparison of deep learning and machine learning models for bioactivity prediction. Journal of computer-aided molecular design, pp 1–14
Pérez-Sianes J, Pérez-Sánchez H, Díaz F (2016) Virtual screening: a challenge for deep learning. In: International Conference on Practical Applications of Computational Biology & Bioinformatics. Springer, pp 13–22
Bajorath J (2017) Computational scaffold hopping: cornerstone for the future of drug design?
Schneider G, Neidhart W, Giller T et al (1999) “scaffold-hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38(19):2894–2896
Vainio MJ, Kogej T, Raubacher F et al (2013) Scaffold hopping by fragment replacement
Saluste G, Albarran MI, Alvarez RM et al (2012) Fragment-hopping-based discovery of a novel chemical series of proto-oncogene pim-1 kinase inhibitors. PloS One 7(10):e45,964
Ertl P (2012) Database of bioactive ring systems with calculated properties and its use in bioisosteric design and scaffold hopping. Bioorg Med Chem 20(18):5436–5442
Stokes JM, Yang K, Swanson K et al (2020) A deep learning approach to antibiotic discovery. Cell 180(4):688–702
Todeschini R, Consonni V, Ballabio D et al (2018) Mapping of activity through dichotomic scores (mads): a new chemoinformatic approach to detect activity-rich structural regions. J Chemom 32(4):e2994
Lee AA, Yang Q, Bassyouni A et al (2019) Ligand biological activity predicted by cleaning positive and negative chemical correlations. Proc Natl Acad Sci 116(9):3373–3378
Inc CCG (2019) Molecular operating environment (moe)
Corsello SM, Bittker JA, Liu Z et al (2017) The drug repurposing hub: a next-generation drug library and information resource. Nat Med 23(4):405–408
Cereto-Massagué A, Ojeda MJ, Valls C et al (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63
Willett P (2006) Similarity-based virtual screening using 2d fingerprints. Drug Discov Today 11 (23-24):1046–1053
Muegge I, Mukherjee P (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discov 11(2):137–148
Riniker S, Landrum GA (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminformatics 5(1):1–17
Wale N, Watson IA, Karypis G (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst 14(3):347–375
Russo DP, Zorn KM, Clark AM et al (2018) Comparing multiple machine learning algorithms and metrics for estrogen receptor binding prediction. Mol Pharm 15(10):4361–4370
Kensert A, Alvarsson J, Norinder U et al (2018) Evaluating parameters for ligand-based modeling with random forest on sparse data sets. J Cheminformatics 10(1):1–10
Chen B, Harrison RF, Papadatos G et al (2007) Evaluation of machine-learning methods for ligand-based virtual screening. J Comput Aided Mol Des 21(1):53–62
(1984) Maccs keys, mdl information systems. Inc: San Leandro, CA
Nilakantan R, Bauman N, Dixon JS et al (1987) Topological torsion: a new molecular descriptor for sar applications. comparison with other descriptors. J Chem Inf Comput Sci 27(2):82– 85
Landrum G (2013) Rdkit documentation. Release 1(1-79):4
Lee AA, Brenner MP, Colwell LJ (2016) Predicting protein–ligand affinity with a random matrix framework. Proc Natl Acad Sci 113:13,564–13,569
Bajusz D, Rácz A, Héberger K (2015) Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminformatics 7(1):1–13
Hussin SK, Abdelmageid SM, Alkhalil A et al (2021) Handling imbalance classification virtual screening big data using machine learning algorithms. Complexity 2021
Branco P, Torgo L, Ribeiro RP (2017) Relevance-based evaluation metrics for multi-class imbalanced domains. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 698–710
Ballabio D, Grisoni F, Todeschini R (2018) Multivariate comparison of classification performance measures. Chemometr Intell Lab Syst 174:33–44
Schubert S, Dalhoff A (2012) Activity of moxifloxacin, imipenem, and ertapenem against Escherichia coli, enterobacter cloacae, enterococcus faecalis, and bacteroides fragilis in monocultures and mixed cultures in an in vitro pharmacokinetic/pharmacodynamic model simulating concentrations in the human pancreas. Antimicrob Agents Chemother 56(12):6434–6436
Marie MAM, Krishnappa LG, Lory S (2016) In vitro activity and the efficacy of arbekacin, cefminox, fosfomycin, biapenem against gram-negative organisms: new treatment options?. Proceedings of the National Academy of Sciences, India Section B: Biological Sciences 86(3):749–755
Goto S, Sakamoto H, Ogawa M et al (1982) Bactericidal activity of cefazolin, cefoxitin, and cefmetazole against Escherichia coli and klebsiella pneumoniae. Chemotherapy 28(1):18–25
Russell DG (2001) Mycobacterium tuberculosis: here today, and here tomorrow. Nat Rev Mol Cell Biol 2(8):569–578
Brenner DJ, Farmer IIIJ (2015) Enterobacteriaceae. Bergey’s manual of systematics of archaea and bacteria, pp 1–24
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and qsar modeling research. J Chem Inf Model 50(7):1189
Williams AJ, Ekins S, Tkachenko V (2012) Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 17(13-14):685–701
Richter MF, Drown BS, Riley AP et al (2017) Predictive compound accumulation rules yield a broad-spectrum antibiotic. Nature 545(7654):299–304
Ebejer JP, Charlton MH, Finn PW (2016) Are the physicochemical properties of antibacterial compounds really different from other drugs? J Cheminformatics 8(1):1–9
Acknowledgements
The authors would like to thank Dr. Jean Paul Ebejer, University of Malta, for his valuable suggestions to improve the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Conflict of interest
The authors declare no competing interests.
Additional information
Supplementary information
The online version contains supplementary material available at https://doi.org/10.1007/s00894-022-05359-6.
Author contribution
All authors contributed to the study conception and design. RJ did the experimental analysis and wrote the first draft of the main manuscript. TM contributed to the mathematical interpretation of all the ML methods and reviewed the manuscript. PF substantially contributed to the conception of the experiments, critically reviewed and revised the manuscript. All authors read and approved the final manuscript.
Data availability statement
The training and test datasets are available as supplementary information with the publication by Stokes et al. [25, Supplementary Tables S2A, S2B]. The codes for all three methods discussed in the paper are publicly available. The links to the codes are provided as follows : Chemprop [25] : https://github.com/swansonk14/chemprop Mapping of Activity through Dichotomic Scores (MADS) [26]: https://michem.unimib.it/download/matlab-toolboxes/virtual-screening-toolbox-for-matlab/https://michem.unimib.it/download/matlab-toolboxes/virtual-screening-toolbox-for-matlab/ Random Matrix theory (RMT) [27] : https://github.com/alphaleegroup/RandomMatrixDiscriminant
Institutional review board statement
Not applicable.
Informed consent statement
Not applicable.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Thomas Bruun Madsen and Paul W. Finn contributed equally to this work.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jagdev, R., Madsen, T.B. & Finn, P.W. On the ability of machine learning methods to discover novel scaffolds. J Mol Model 29, 22 (2023). https://doi.org/10.1007/s00894-022-05359-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00894-022-05359-6