Structure generation by the Prior
After the initial training, 94% of the sequences generated by the Prior as described in “Generating new samples” section corresponded to valid molecular structures according to RDKit [27] parsing, out of which 90% are novel structures outside of the training set. A set of randomly chosen structures generated by the Prior, as well as by Agents trained in the subsequent examples, are shown in the Additional file 2. The process of generating a SMILES by the Prior is illustrated in Fig. 5. For every token in the generated SMILES sequence, the conditional probability distribution over the vocabulary at this step according to the Prior is displayed. The sequence of distributions are depicted in Fig. 5. For the first step, when no information other than the initial GO token is present, the distribution is an approximation of the distribution of first tokens for the SMILES in the ChEMBL training set. In this case “O” was sampled, but “C”, “N”, and the halogens were all likely as well. Corresponding log likelihoods were −0.3 for “C”, −2.7 for “N”, −1.8 for “O”, and −5.0 for “F” and “Cl”.
A few (unsurprising) observations:
-
Once the aromatic “n” has been sampled, the model has come to expect a ring opening (i.e. a number), since aromatic moieties by definition are cyclic.
-
Once an aromatic ring has been opened, the aromatic atoms “c”, “n”, “o”, and “s” become probable, until 5 or 6 steps later when the model thinks it is time to close the ring.
-
The model has learnt the RDKit canonicalized SMILES format of increasing ring numbers, and expects the first ring to be numbered “1”. Ring numbers can be reused, as in the two first rings in this example. Only once “1” has been sampled does it expect a ring to be numbered “2” and so on.
Learning to avoid sulphur
As a proof of principle the Agent was first trained to generate molecules which do not contain sulphur. The method described in “The Agent network” is compared with three other policy gradient based methods. The first alternative method is the same as the Agent method, with the only difference that the loss is defined on an action basis rather than on an episodic one, resulting in the cost function:
$$J(\Theta ) = \left[ \sum _{t=0}^T{(\log \pi _{Prior}(a_t, s_t) - \log \pi _{\Theta }(a_t, s_t))} + \sigma S(A)\right] ^2$$
We refer to this method as ‘Action basis’. The second alternative is a REINFORCE algorithm with a reward of S(A) given at the last step. This method is similar to the one used by Silver et al. to train the policy network in AlphaGo [41], as well as the method used by Yu et al. [15]. We refer to this method as ‘REINFORCE’. The corresponding cost function can be written as:
$$J(\Theta ) = S(A)\sum _{t=0}^T \log \pi _{\Theta }(a_t, s_t)$$
A variation of this method that considers prior likelihood is defined by changing the reward from S(A) to \(S(A)+ \log P(A)_{Prior}\). This method is referred to as ‘REINFORCE + Prior’, with the cost function:
$$J(\Theta ) = [\log P(A)_{Prior} + \sigma S(A)]\sum _{t=0}^T \log \pi _{\Theta }(a_t, s_t)$$
Note that the last method by nature strives to generate only the putative sequence with the highest reward. In contrast to the Agent, the optimal policy for this method is not stochastic. This tendency could be restrained by introducing a regularizing policy entropy term. However, it was found that such regularization undermined the models ability to produce valid SMILES. This method is therefor dependent on only training sufficiently long for the model to reach a point where highly scored sequences are generated, without being settled in a local minima. The experiment aims to answer the following questions:
-
Can the models achieve the task of generating valid SMILES corresponding to structures that do not contain sulphur?
-
Will the models exploit the reward function by converging on naïve solutions such as ‘C’ if not imposed handwritten rules?
-
Are the distributions of physical chemical properties for the generated structures similar to those of sulphur free structures generated by the Prior?
The task is defined by the following scoring function:
$$\begin{aligned} S(A) =\left\{ \begin{array}{ll} 1 &{}\quad \text {if valid and no S} \\ 0 &{}\quad \text {if not valid} \\ -1 &{}\quad \text {if contains S} \end{array}\right. \end{aligned}$$
All the models were trained for 1000 steps starting from the Prior and 12,800 SMILES sequences were sampled from all the models as well as the Prior. A learning rate of 0.0005 was used for the Agent and Action basis methods, and 0.0001 for the two REINFORCE methods. The values of \(\sigma\) used were 2 for the Agent and ‘REINFORCE + Prior’, and 8 for ‘Action basis’. To explore what effect the training has on the structures generated, relevant properties for non sulphur containing structures generated by both the Prior and the other models were compared. The molecular weight, cLogP, the number of rotatable bonds, and the number of aromatic rings were all calculated using RDKit. The experiment was repeated three times with different random seeds. The results are shown in Table 1 and randomly selected SMILES generated by the Prior and the different models can be seen in Table 2. For the ‘REINFORCE’ method, where the sole aim is to generate valid SMILES that do not contain sulphur, the model quickly learns to exploit the reward funtion by generating sequences containing predominately ‘C’. This is not surprising, since any sequence consisting only of this token always gets rewarded. For the ‘REINFORCE + Prior’ method, the inclusion of the prior likelihood in the reward function means that this is no longer a viable strategy (the sequences would be given a low prior probability). The model instead tries to find the structure with the best combination of score and prior likelihood, but as is evident from the SMILES generated and the statistics shown in Table 1, this results in small, simplistic structures being generated. Thus, both REINFORCE algorithms managed to achieve high scores according to the scoring function, but poorly represented the Prior. Both the Agent and the ‘Action basis’ methods have explicitly specified target policies. For the ‘Action basis’ method the policy is specified exactly on a stepwise level, while for the Agent the target policy is only specified to the likelihoods of entire sequences. Although the ‘Action basis’ method generates structures that are more similar to the Prior than the two REINFORCE methods, it performed worse than the Agent despite the higher value of \(\sigma\) used while also being slower to learn. This may be due to the less restricted target policy of the Agent, which could facilitate optimization. The Agent achieved the same fraction of sulphur free structures as the REINFORCE algorithms, while seemingly doing a much better job of representing the Prior. This is indicated by the similarity of the properties of the generated structures shown in Table 1 as well as the SMILES themselves shown in Table 2.
Table 1 Comparison of model performance and properties for non-sulphur containing structures generated by the two models
Table 2 Randomly selected SMILES generated by the different models
Similarity guided structure generation
The second task investigated was that of generating structures similar to a query structure. The Jaccard index [37] \(J_{i, j}\) of the RDKit implementation of FCFP4 [38] fingerprints was used as a similarity measure between molecules i and j. Compared to the DRD2 activity model (“The DRD2 activity model” section), the feature invariant version of the fingerprints and the smaller diameter 4 was used in order to get a more fuzzy similarity measure. The scoring function was defined as:
$$\begin{aligned} S(A) = -1 + 2 \times \frac{\min \{ J_{i, j}, k \}}{k} \end{aligned}$$
This definition means that an increase in similarity is only rewarded up to the point of \(k\in [0, 1]\), as well as scaling the reward from \(-1\) (no overlap in the fingerprints between query and generated structure) to 1 (at least k degree of overlap). Celecoxib was chosen as our query structure, and we first investigated whether Celecoxib itself could be generated by using the high values of \(k=1\) and \(\sigma =15\). The Agent was trained for 1000 steps. After a 100 training steps the Agent starts to generate Celecoxib, and after 200 steps it predominately generates this structure (Fig. 6).
Celecoxib itself as well as many other similar structures appear in the ChEMBL training set used to build the Prior. An interesting question is whether the Agent could succeed in generating Celecoxib when these structures are not part of the chemical space covered by the Prior. To investigate this, all structures with a similarity to Celecoxib higher than 0.5 (corresponding to 1804 molecules) were removed from the training set and a new reduced Prior was trained. The prior likelihood of Celecoxib for the canonical and reduced Priors was compared, as well as the ability of the models to generate structures similar to Celecoxib. As expected, the prior probability of Celecoxib decreased when similar compounds were removed from the training set from \(\log _e P = -12.7\) to \(\log _e P = -19.2\), representing a reduction in likelihood of a factor of 700. An Agent was then trained using the same hyperparameters as before, but on the reduced Prior. After 400 steps, the Agent again managed to find Celecoxib, albeit requiring more time to do so. After 1000 steps, Celecoxib was the most commonly generated structure (about a third of the generated structures), followed by demethylated Celecoxib (also a third) whose SMILES is more likely according to the Prior with \(\log _e P = -15.2\) but has a lower similarity (\(J = 0.87\)), resulting in an augmented likelihood equal to that of Celecoxib.
These experiments demonstrate that the Agent can be optimized using fingerprint based Jaccard similarity as the objective, but making copies of the query structure is hardly useful. A more useful example is that of generating structures that are moderately to the query structure. The Agent was therefore trained for 3000 steps, starting from both the canonical as well as the reduced Prior, using \(k = 0.7\) and \(\sigma = 12\). The Agents based on the canonical Prior quickly converge to their targets, while the Agents based on the reduced Prior converged more slowly. For the Agent based on the reduced Prior where \(k=1\), the fact that Celecoxib and demethylated Celecoxib are given similar augmented likelihoods means that the average similarity converges to around 0.9 rather than 1.0. For the Agent based on the reduced Prior where \(k=0.7\), the lower prior likelihood of compounds similar to Celecoxib translates to a lower augmented likelihood, which lowers the average similarity of the structures generated by the Agent.
To explore whether this reduced prior likelihood could be offset with a higher value of \(\sigma\), an Agent starting from the reduced Prior was trained using \(\sigma =15\). Though taking slightly more time to converge than the Agent based on the canonical Prior, this Agent too could converge to the target similarity. The learning curves for the different model is shown in Fig. 6.
An illustration of how the type of structures generated by the Agent evolves during training is shown in Fig. 7. For the Agent based on the reduced Prior with \(k=0.7\) and \(\sigma =15\), three structures were randomly sampled every 100 training steps from step 0 up to step 400. At first, the structures are not similar to Celecoxib. After 200 steps, some features from Celecoxib have started to emerge, and after 300 steps the model generates mostly close analogues of Celecoxib.
We have investigated how various factors affect the learning behaviour of the Agent. In real drug discovery applications, we might be more interested in finding structures with modest similarity to our query molecules rather than very close analogues. For example, one of the structures sampled after 200 steps shown in Fig. 7 displays a type of scaffold hopping where the sulphur functional group on one of the outer aromatic rings has been fused to the central pyrazole. The similarity to Celecoxib of this structure is 0.4, which may be a more interesting solution for scaffold-hopping purposes. One can choose hyperparameters and similarity criterion tailored to the desired output. Other types of similarity measures such as pharmacophoric fingerprints [42], Tversky substructure similarity [43], or presence/absence of certain pharmacophores could also be explored.
Target activity guided structure generation
The third example, perhaps the one most interesting and relevant for drug discovery, is to optimize the Agent towards generating structures with predicted biological activity. This can be seen as a form of inverse QSAR, where the Agent is used to implicitly map high predicted probability of activity to molecular structure. DRD2 was chosen as the biological target. The clustering split of the DRD2 activity dataset as described in “The DRD2 activity model” section resulted in 1405, 1287, and 4526 actives in the test, validation, and training sets respectively. The average similarity to the nearest neighbour in the training set for the test set compounds was 0.53. For a random split of actives in sets of the same sizes this similarity was 0.69, indicating that the clustering had significantly decreased training-test set similarity which mimics the hit finding practice in drug discovery to identify diverse hits to the training set. Most of the DRD2 actives are also included in the ChEMBL dataset used to train the Prior. To explore the effect of not having the known actives included in the Prior, a reduced Prior was trained on a reduced subset of the ChEMBL training set where all DRD2 actives had been removed.
The optimal hyperparameters found for the SVM activity model were \(C=2^{7}, \gamma =2^{-6}\), resulting in a model whose performance is shown in Table 3. The good performance in general can be explained by the apparent difference between actives and inactive compounds as seen during the clustering, and the better performance on the test set compared to the validation set could be due to slightly higher nearest neighbour similarity to the training actives (0.53 for test actives and 0.48 for validation actives).
Table 3 Performance of the DRD2 activity model
The output of the DRD2 model for a given structure is an uncalibrated predicted probability of being active \(P_{active}\). This value is used to formulate the following scoring function:
$$\begin{aligned} S(A) = -1 + 2 \times P_{active} \end{aligned}$$
The model was trained for 3000 steps using \(\sigma = 7\). After training, the fraction of predicted actives according to the DRD2 model increased from 0.02 for structures generated by the reduced Prior to 0.96 for structures generated by the corresponding Agent network (Table 4). To see how well the structure-activity-relationship learnt by the activity model is transferred to the type of structures generated by the Agent RNN, the fraction of compounds with an ECFP6 Jaccard similarity greater than 0.4 to any active in the training and test sets was calculated.
Table 4 Comparison of properties for structures generated by the canonical Prior, the reduced Prior, and corresponding Agents
In some cases, the model recovered exact matches from the training and test sets (c.f. Segler et al. [13]). The fraction of recovered test actives recovered by the canonical and reduced Prior were 1.3 and 0.3% respectively. The Agent derived from the canonical Prior managed to recover 13% test actives; the Agent derived from the reduced Prior recovered 7%. For the Agent derived from the reduced Prior, where the DRD2 actives were excluded from the Prior training set, this means that the model has learnt to generate “novel” structures that have been seen neither by the DRD2 activity model nor the Prior, and are experimentally confirmed actives. We can formalize this observation by calculating the probability of a given generated sequence belonging to the set of test actives. For the canonical and reduced Priors, this probability was \(0.17\times 10^{-3}\) and \(0.05\times 10^{-3}\) respectively. Removing the actives from the Prior thus resulted in a threefold reduction in the probability of generating a structure from the set of test actives. For the Agents, the probabilities rose to \(15.0\times 10^{-3}\) and \(40.2\times 10^{-3}\) respectively, corresponding to an enrichment of a factor of 250 over the Prior models. Again the consequence of removing the actives from the Prior was a threefold reduction in the probability of generating a test set active: the difference between the two Priors is directly mirrored by their corresponding Agents. Apart from generating a higher fraction of structures that are predicted to be active, both Agents also generate a significantly higher fraction of valid SMILES (Table 4). Sequences that are not valid SMILES receive a score of \(-1\), which means that the scoring function naturally encourages valid SMILES.
A few of the test set actives generated by the Agent based on the reduced Prior along with a few randomly selected generated structures are shown together with their predicted probability of activity in Fig. 8. Encouragingly, the recovered test set actives vary considerably in their structure, which would not have been the case had the Agent converged to generating only a certain type of very similar predicted active compounds.
Removing the known actives from the training set of the Prior resulted in an Agent which shows a decrease in all metrics measuring the overlap between the known actives and the structures generated, compared to the Agent derived from the canonical Prior. Interestingly, the fraction of predicted actives did not change significantly. This indicates that the Agent derived from the reduced Prior has managed to find a similar chemical space to that of the canonical Agent, with structures that are equally likely to be predicted as active, but are less similar to the known actives. Whether or not these compounds are active will be dependent on the accuracy of the target activity model. Ideally, any predictive model to be used in conjunction with the generative model should cover a broad chemical space within its domain of applicability, since it initially has to assess representative structures of the dataset used to build the Prior [13].
Figure 9 shows a comparison of the conditional probability distributions for the reduced Prior and its corresponding Agent when a molecule from the set of test actives is generated. It can be seen that the changes are not drastic with most of the trends learnt by the Prior being carried over to the Agent. However, a big change in the probability distribution even only at one step can have a large impact on the likelihood of the sequence and could significantly alter the type of structures generated.