Neural Network Guided Tree-Search Policies for Synthesis Planning
Developments and accessibility of computational methods within machine learning and deep learning have led to the resurgence of methods for computer assisted synthesis planning (CASP). In this paper we introduce our viewpoints on the analysis of reaction data, model building and evaluation. We show how the models’ performance is affected by the specificity of the extracted reaction rules (templates) and outline the direction of research within our group.
KeywordsReaction informatics Neural networks Classification Synthesis prediction
With the increasing availability of reaction data, developments and accessibility of computational methods, and a drive to further automate design, make, test, analyze (DMTA) cycles within drug discovery , computer assisted synthesis planning (CASP) has seen renewed interest as of late . This has been spurred by recent achievements in the application of neural networks combined with search algorithms [3, 4], learning from breakthroughs in their application to games such as chess and Go .
Comparison of search spaces in games vs retrosynthetic analysis.
Recent studies have shown that neural network policies framed as multi-class classification problems can identify likely reactions through the noisy knowledge base [3, 4]. However, we have found they are heavily weighted towards frequently occurring reactions, owing to imbalanced datasets. Thus, miss out on less frequent yet feasible alternatives. In the present study, we explore and tune neural network architectures with the aim of maximizing the number of synthetically feasible options at each step. This is supplemented by curation and analysis of the underlying knowledge base, extracted from available reaction datasets. The number of which is limited when publicly available data is considered.
The US patent office extracts are a set of text mined reactions from the patent literature . Given the reaction SMILES, an extension of the SMILES notation used to represent molecular structures , we used a modified version of Coley and coworkers algorithm to extract reaction templates . That is the transformation required to convert the reactants into the products. These form the core of our knowledge base from which we can train a policy to enumerate retrosynthetic pathways in the form of a tree. To evaluate performance on a state-of-the-art model, we have opted to reimplement a variant of the policy used by Segler and Waller .
Preliminary evaluation of the models was performed on a random selection of 10,000 compounds from each ChEMBL  and FDB17 . This enabled assessment of both the model’s predictive ability, and the validity of templates across a range of druglike and novel scaffolds. Using this assessment criteria, we aim to maximize the number of options available to our policy at each step in the subsequent tree search. Thereby, enabling the later prediction of full synthetic pathways, which is a necessity in accelerating automated DMTA cycles.
Whilst our viewpoint on the performance of the models has shed new light on the way in which a model may be evaluated, there is still much detail to investigate. This paper introduces preliminary results for a template-based synthesis planning methodology, and highlights that a more rigorous study is currently underway. This will encompass larger datasets, template design, data curation, the network architecture, and the implementation of appropriate metrics. These results will follow in a more rigorous study of the problems faced in computer assisted synthesis planning.
Amol Thakkar is supported financially by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Grant Agreement No. 676434, “Big Data in Chemistry” (“BIGCHEM,” http://bigchem.eu).
- 6.Daniel, L.: Extraction of chemical structures and reactions from the literature, Doctoral thesis, University of Cambridge (2012)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.