Introduction

The design and optimization of compounds towards potential drug candidates is crucial in drug discovery. The main challenges include the large chemical search space [1] and the requirement of optimization towards multiple properties e.g. physicochemical properties, safety, synthetic feasibility and potency against its target. To accelerate the molecular design and optimization process, various deep neural networks have been explored as molecular generative models, e.g. recurrent neural networks (RNNs) [2,3,4], variational autoencoders (VAEs) [5,6,7,8,9,10], transformers [11,12,13,14], generative adversarial networks (GANs) [15,16,17,18], graph neural networks (GNNs) [19,20,21,22] and diffusion-based models [23,24,25]. Early work have been focusing on de novo molecular design which generates molecules from scratch without needing a starting compound, while there is an increasing attention on conditional compound generation and optimization from a specific starting structure that shows promise, e.g. compounds [12, 26,27,28,29,30], scaffolds [14, 22, 31,32,33,34] and fragments [25, 35,36,37,38]. In this work, we focus on compounds as starting point. In previous publications [12, 29, 30], we treated the molecular optimization problem as machine translation task and trained the transformer model [39] on pairs of similar molecules extracted based on different similarity criteria e.g. Tanimoto similarity on fingerprint, matched molecular pairs and scaffold. The model learns to generate similar molecules to a given input molecule. To generate compounds with desired properties, property change tokens are prepended to the simplified molecular-input line-entry system (SMILES) [40] tokens in order to steer the model towards the chemical space of interest. However, this model is limited to the preselected set of properties during optimization.

Reinforcement learning (RL) [41,42,43] has been used to guide generative models to explore the chemical space of interest defined by a set of user-defined properties. It provides the flexibility of optimizing molecules towards various user-specified desired properties. Here, we integrate the transformer models [30, 44] trained for generating similar molecules into the REINVENT framework [42] and evaluate the effect of reinforcement learning. Specifically, the evaluation will be conducted on two tasks i.e. molecular optimization and scaffold discovery. Each task will include four example starting molecules with varying level of optimization challenges. The transformer model generates molecules similar to a given starting molecule, and the reinforcement learning is applied to enforce multi-parameter optimisation of the starting molecule. The integration of transformer model, which have learned the surrounding chemical space of input molecules, with RL has potential applicability in the context of constrained optimization of a starting molecule, e.g. molecular optimization and scaffold discovery.

Methods

Transformer based molecular generator

We focus on the transformer models trained on a set of similar molecular pairs. The molecules are represented as SMILES and the SMILES are tokenized to construct a vocabulary, which contains all possible tokens. After training, the models can generate similar molecules to a given input molecule. In particular, two models trained on varying size of training data are examined in REINVENT: the transformer model [30] trained on around 6.5 million molecular pairs extracted from ChEMBL and the transformer model [44] trained on over 200 billion molecular pairs from PubChem. The molecular pairs with a Tanimoto similarity \(\ge\) 0.5 based on RDKit Morgan fingerprints (radius = 2, with counts) are selected. To generate multiple molecules, the non deterministic, multinomial sampling is used. At each time step, a token is randomly selected based on the probability distribution over the vocabulary.

REINVENT

REINVENT [42, 45] is an AI-based tool for molecular design and optimization. It contains three main components: a molecular generative model, a scoring function which scores the generated molecules based on a set of user-specified scoring criteria and produces a combined score as reward, and RL as a search algorithm to steer the generated model towards the chemical space with high reward. Additionally, to reduce the risk of mode collapse and encourage the diversity of the generated molecules, REINVENT uses a molecular memory system called the diversity filter (DF) with different implemented strategies. The DF penalizes the generation of identical compounds or compounds sharing the same scaffold that have been generated too often. The generative model acts as agent and describes the joint probability of generating a molecule represented by a token sequence \(T = t_1, t_2, \ldots , t_l\) given an input molecule token sequence X as

$$\begin{aligned} \textrm{P} (T\vert X;\varvec{\uptheta }) = \prod _{i=1}^{l}{\text {P}}\left( t_{i} \vert t_1,\ldots ,t_{t-1},X;\varvec{\uptheta }\right) , \end{aligned}$$
(1)

where \(\varvec{\uptheta }\) represents the model parameters, \(t_{i}\) represents the i-th token of T, and l represents the length of T. Accordingly, the negative log likelihood (NLL) is defined as

$$\begin{aligned} \textrm{NLL} (T\vert X;\varvec{\uptheta }) = -\sum _{i=1}^{l} \log \textrm{P}\left( t_{i} \vert t_1,\ldots ,t_{i-1},X;\varvec{\uptheta }\right) \end{aligned}$$
(2)
Fig. 1
figure 1

General RL workflow. The agent is initialized (0) by a transformer prior which learns to generate similar molecules to a given input molecule. The RL loop starts with sampling (1) a batch of molecules represented as SMILES which then are scored based on the set of user-specified scoring components (2). The loss is computed by combining the score and the negative log likelihood of the generated molecules and finally the agent is updated (3) to minimize the loss

Fig. 2
figure 2

Input starting molecules. P(active): predicted probability to be active according to the DRD2 activity model

Table 1 Overview of model configuration

Figure 1 shows the general RL workflow. The agent is initialized using the transformer prior, which generates similar molecules to an input molecule. The reinforcement learning loop is further performed to tune the agent’s focus on narrower chemical space of interest defined by a set of user-specified scoring components. Specifically, in each RL step, a batch of molecules (batch size=128) are sampled from the agent given the input molecule, and then evaluated based on the scoring function. The evaluated score is combined with the prior and the agent’s negative log likelihood for loss computation. The loss is defined as Eq. 3 following [42].

$$\begin{aligned} \mathcal {L} (\varvec{\uptheta }) = \left( \textrm{NLL}_{\text {aug}} (T\vert X ) - \textrm{NLL} (T\vert X; \varvec{\uptheta } ) \right) ^2. \end{aligned}$$
(3)

\(\textrm{NLL}_{\text {aug}}\) represents the augmented negative log likelihood defined as

$$\begin{aligned} \textrm{NLL}_{\text {aug}} (T\vert X) = \textrm{NLL} (T\vert X; \varvec{\uptheta }_{\text {prior}}) - \sigma S(T) \end{aligned}$$
(4)

where \(S(T) \in\) [0, 1] is a scoring function whose value represents the evaluated desirability of molecule sequence T. It is an aggregation function of multiple scoring components. More details of S can be found in [42]. \(\sigma > 0\) is a scalar coefficient balancing the desirability with prior likelihood of a sequence, and \(\varvec{\uptheta }_{\text {prior}}\) are the parameters of the prior. The agent is updated to minimize Eq. 3, as demonstrated previously [32, 42], which encourages increasing the evaluated score while keeping the agent not very far away from the prior which has learnt to produce valid and similar molecules. Note that at the beginning of the training \(\varvec{\uptheta } = \varvec{\uptheta }_{\text {prior}}\), \(\varvec{\uptheta }_{\text {prior}}\) are kept fixed, while \(\varvec{\uptheta }\) are updated.

Experimental setup

The computational experiments aim to evaluate whether RL could improve the performance of transformer-based generative models in generating molecules with desired properties. The evaluation is conducted for two application scenarios,

  1. 1

    Scaffold discovery: generate new scaffold ideas that are active against the dopamine receptor type 2 (DRD2) target.

  2. 2

    Molecular optimization: generate close analogues to improve the activity against the DRD2 target compared to the input molecule.

As a proxy for biologcal activity, we use the DRD2 activity model from Olivecrona et al. [41] which was trained on data extracted from ExCAPE-DB [46]. The output of the model is the predicted probability of a given molecule to be active (pIC50\(\ge\)5). For both scaffold discovery and molecular optimization, it is common to start with compounds which have already shown reasonable potency. Four compounds were selected from the DRD2 active compounds with pIC50\(\ge\)5 in ExCAPE-DB as input starting molecules for thorough investigation. Figure 2 shows the input compounds, pIC50, the predicted probabilities to be active P(active) and Quantitative Estimate of Drug-likeness (QED) [47] scores. These compounds were selected to simulate different challenges with respect to input starting structure and property score. Additionally, as a supplementary analysis, 100 compounds were selected from the DRD2 active compounds as input compounds, with each compound having P(active)>0.5 and being randomly chosen from the top 100 most frequent unique scaffolds.

Baseline and REINVENT configuration

The goal is to evaluate whether RL could help steer the transformer-based generative model towards a desirable chemical- and physical-property space. Therefore, the transformer models trained on molecular pairs but without RL serve as baselines. For the main experiments, we use our most recent transformer model [44] which is trained on the PubChem database. For RL, different REINVENT configurations are used, see Table 1.

Table 2 Evaluation metrics

Scoring components: Since we are interested in generating compounds that are active against the DRD2 target, the DRD2 activity model is added to the scoring function. Additionally, QED is included to prevent the model from generating molecules that have high predicted probability to be active but are not drug-like. For the task of molecular optimization, an extra scoring component, Tanimoto similarity based on RDKit Morgan fingerprints (radius = 2, with counts) is added to encourage generating molecules that are similar to corresponding input compound.

Fig. 3
figure 3

Scaffold discovery task: mean values and ±1 standard deviation over 10 runs for the number of unique compounds, scaffolds and generic scaffolds that show P(active)>0.6 and QED>0.6. RL generally outperforms No RL, and RL_DF(scaffold) performs best on finding most unique scaffolds and generic scaffolds with desirable properties

Diversity filter: Different diversity filter strategies are used. DF(comp) penalizes the same compound being generated frequently while DF(scaffold) penalizes the compounds sharing the same Murcko type scaffold. For the task of molecular optimization, the option DF(scaffold) is not used since the goal is to generate molecules which are highly similar to the input compound. For comparison, we also include noDF which corresponds to no diversity filter being applied.

The RL loop is run for 1000 steps with each step generating 128 moleculesFootnote 1, which results in total 128,000 molecules. Therefore, for the baseline model without RL, we sample 128,000 molecules for comparison. Since multinomial sampling is non-deterministic, we run the experiments ten times and report the averaged results with ± one standard deviation for the input starting molecules in Fig. 2.

Evaluation metrics

In general, we are interested in understanding whether RL could help to generate additional, diverse high-scoring compounds. Table 2 shows the evaluation metrics used for the tasks of scaffold discovery and molecular optimization. For scaffold discovery, the focus is to find a novel scaffold which exhibits high chance to be active and good QED score (i.e. P(active)>0.6 and QED>0.6). Additionally, the improved predicted activity and QED over input compounds are examined in a secondary analysis. For molecular optimization, it is favourable to have close analogues to the input molecule with improved predicted probability to be active and improved QED. Here, “scaffold” represents Murcko scaffold from RDKit which removes the side chains and the “generic scaffold” is the Murcko scaffold which converts all atom types to carbon and all bonds to single.

Results and discussion

RL vs No RL for the scaffold discovery task

Fig. 4
figure 4

Scaffold discovery task: mean values and ±1 standard deviation over 10 runs for the number of unique compounds, scaffolds and generic scaffolds that show improved P(active) and QED compared to corresponding input molecule

Fig. 5
figure 5

Molecular optimization task: mean values and ±1 standard deviation over 10 runs for the number of unique compounds that show improved P(active) and QED compared to corresponding input molecule (“Compounds” in Figure) and additionally Tanimoto similarity above 0.7 (“Similarity>0.7” in Figure). RL generally outperforms No RL, and RL_DF(cmp)_Sim performs best in generating compounds with improved properties and Tanimoto similarity above 0.7 compared with corresponding input compound

Most RL settings perform better than No RL in terms of generating molecules with the predicted probability to be active>0.6 and QED>0.6 via all evaluation metrics and for all input compounds (Fig. 3). RL_DF(cmp) generates more unique compounds than RL_noDF, which validates the advantage of penalizing the compounds that have been generated frequently to improve diversity. RL_DF(scaffold) generates more unique scaffolds and generic scaffolds than RL_noDF and RL_DF(cmp), which suggests the benefit of penalizing the frequent generated scaffolds. Especially for scaffold discovery efforts, this can be a useful strategy to increase scaffold diversity.

Furthermore, we examine the performance of achieving higher P(active) and QED upon the input molecules for a secondary analysis. Similar trend can be found that most RL settings perform better than No RL for all input compounds except compound 4 (Fig. 4). The reason why RL struggles with compound 4 might be that the predicted activity for the starting molecule is already very high, which makes it difficult to identify even more potent compounds with RL. Additionally, for compound 2 and compound 4, RL_DF(scaffold) performs worse than RL_noDF and RL_DF(cmp) which indicates changes in the scaffold for these compounds does not improve activity and/or QED. One possible explanation for this is that the scoring function is not set and optimized towards improving predicted activity and QED explicitly. It aims to generate molecules with high scores, but not necessarily higher than the input molecules. This might also contribute to the observed high standard deviation across different runs. Overall, depending on the starting molecules’ properties and structural complexity, it is not unexpected to observe different behaviors. For example, for compound 1 with P(active)=0.61, it appears to be easier to improve when exploring diverse scaffolds. While for compound 4 which already has P(active)=0.94, it is difficult to improve and change the scaffold.

RL vs No RL for molecular optimization task

Fig. 6
figure 6

Molecular optimization task: Tanimoto similarity to input compound per RL step. Results are mean values and ±1 standard deviation over 10 runs

Fig. 7
figure 7

Scaffold discovery task: effect of pre-trained priors. Results are mean values and ±1 standard deviation over 10 runs for the number of unique scaffolds that show P(active)>0.6 and QED>0.6

Fig. 8
figure 8

Molecular optimization task: effect of pre-trained priors. Results are mean values and ±1 standard deviation over 10 runs for the number of unique compounds that show improved P(active) and QED over corresponding input molecule

Figure 5 shows the results for the molecular optimization task with molecules achieving higher P(active) and QED scores. Most RL settings perform better than No RL on all evaluation metrics except for compound 4. RL_DF(cmp) generally generates more compounds with improved properties and are less similar to the input molecule, as can be seen from the lower number of compounds with Tanimoto similarity>0.7 in comparison with RL_DF(cmp)_Sim. This indicates that adding Tanimoto similarity to scoring function help generating molecules that are more similar to the input compound, which is useful for local molecular optimization - exploration of the close chemical space of an input compound.

The improvement of RL over No RL is not as large as seen in the scaffold discovery task (Fig. 3). This may be because there is more possibilities in discovering compounds with diverse scaffolds and relatively favorable properties, compared to identifying molecules that closely resemble the input molecule while exhibiting improved properties.

Notably, for compound 4 which has P(active)=0.94, RL (i.e. RL_DF(cmp)_Sim) shows a slight improvement over No RL, unlike in the scaffold discovery task (Fig. 4). This might because the Tanimoto similarity scoring component helps the model generate more similar compounds to compound 4, which are also more likely to be highly active.

Figure 6 investigates the Tanimoto similarity to the corresponding input compound changes as RL steps progress. It can be seen that RL_noDF mostly maintains a decent similarity as RL steps increase, while RL_DF(cmp) exhibits a declining similarity but mostly above 0.5. RL_DF(cmp) encourages the generation of more unique compounds with improved properties than RL_noDF (as shown in Fig. 5) at the expense of reduced similarity. RL_DF(cmp)_Sim helps to increase similarity compared to RL_DF(cmp) (Fig. 6) and leads to more unique compounds with similarity above 0.7 and improved properties than RL_noDF and RL_DF(cmp) as shown in Fig. 5.

Effect of pre-trained priors

Here, we evaluate the effect of the priors trained on different sizes of training data, in particular, the transformer model trained on ChEMBL [30] and PubChem [44]. Figure 7 shows the number of unique scaffolds with P(active)>0.6 and QED>0.6 for scaffold discovery task. Without RL, the PubChem prior already yields more compounds of interest than the ChEMBL prior. This could be because the PubChem prior was trained on a much larger scale (200B vs 6.5M). Most RL configurations improve performance for both priors. The PubChem prior consistently outperforms the ChEMBL prior with the exception of RL_DF(scaffold) where ChEMBL prior show comparable performance. This might be because the PubChem prior has a knowledge of closer area of an input molecule than ChEMBL prior, resulting in a slower adaptation of diverse scaffolds generation. Figure 8 shows the results for molecular optimization task. Similarly, the PubChem prior generates more compounds with desirable properties compared to the ChEMBL prior. In general, RL facilitates the generation of more compounds with desirable properties for both priors, with the PubChem prior typically outperforming the ChEMBL prior in the evaluated tasks.

Fig. 9
figure 9

Scaffold discovery task: effect of learning steps on the number of unique scaffolds with P(active)>0.6 and QED>0.6. Results are mean values and ±1 standard deviation over 10 runs. RL_DF(scaffold) consistently generates more unique scaffolds with desirable properties as the number of steps increases

Fig. 10
figure 10

Molecular optimization task: effect of learning steps on the number of unique compounds with improved P(active) and QED score, and a Tanimoto similarity >0.7 compared to corresponding input molecule. Results are mean values and ±1 standard deviation over 10 runs. RL_DF(cmp)_Sim generally produces more unique compounds with improved properties that are similar to input molecule as the number of steps increases

Effect of learning steps

Here, we evaluate the effect of varying number of RL learning steps, i.e. 100, 1000 and 2000 steps. Figure 9 shows the number of unique scaffolds with P(active)>0.6 and QED>0.6 for scaffold discovery task when varying the number of learning steps. For simplicity, we only show RL_DF(scaffold). It can be seen that RL_DF(scaffold) exhibits a consistent trend of generating more unique scaffolds with desirable properties as the number of steps increases, while No RL shows limited improvement. This is expected because without RL, the same area of chemical space is searched every step, whereas RL allows the agent to update at each step, exploring different regions. A similar trend can be found for the molecular optimization task in Fig. 10. With more steps, RL_DF(cmp)_Sim tends to generate more unique compounds with improved properties that are similar to input molecule. Overall, these findings suggest that increasing the number of learning steps typically leads to the discovery of more compounds of interest.

Fig. 11
figure 11

Scaffold discovery task: effect of learning rates on the number of unique scaffolds with P(active)>0.6 and QED>0.6 when using RL_DF(scaffold). Results are mean values and ±1 standard deviation over 10 runs. Learning rate=0 is equivalent to No RL. Increasing learning rate (up to 1e-4) tends to explore the desired chemical space more efficiently while introducing higher variance between different runs. Learning rate 1e-3 leads to dramatic decrease in model performance

Fig. 12
figure 12

Scaffold discovery task: overlap of three runs with varying learning rates (lr) on the unique compounds (a-d) and unique scaffolds with P(active)>0.6 and QED>0.6 (e-h) produced by RL_DF(scaffold) for compound 1. Generally, higher learning rate (up to 1e-4) results in less overlap chemical space between different runs and exploration of larger chemical space

Effect of learning rates

Here, we evaluate the effect of different learning rates, i.e. 0, 1e-5, 1e-4 (default) and le-3. Notebly learning rate=0 is equivalent to No RL. Figure 11 shows the number of unique scaffolds with P(active)>0.6 and QED>0.6 for scaffold discovery task with increasing learning rate. For simplicity, we focus on RL_DF(scaffold). It can be seen that as the learning rate increases up to 1e-4, more scaffolds with desirable properties are found, indicating the model is guided more efficiently towards the desired chemical space. Meanwhile, the variance between different runs increases. This may be because with a higher learning rate, each update to the model parameters is larger, directing the model’s focus towards a more different region of the chemical space. A too high learning rate i.e. le-3 in this study, results in noisy and unstable update. Figure 12a–d shows the overlap of three runs for the unique compounds generated by RL_DF(scaffold) for compound 1. It shows a tendency of reduced overlap as the learning rate increases, indicating each run tends to explore different parts of the chemical space. Additionally, a larger chemical space is explored with a higher learning rate of up to 1e-4. These factors might contribute to the greater variance in the number of unique scaffold with desirable properties between different runs (Figs. 11 and 12e–g). The results for the molecular optimization task can be found in Supplementary Figs. S1 and S2 where similar results are found.

Fig. 13
figure 13

Percentage of valid molecules generated per RL step produced by RL_DF(scaffold) for compound 1. Too high learning rate 1e-3 results in dramatic instability

Fig. 14
figure 14

Left: the prior NLL distribution of unique compounds generated by RL_DF(scaffold) for compound 1 with varying learning rate. Right: the prior and agent NLL distribution of unique scaffold with P(active)>0.6 and QED>0.6 generated by RL_DF(scaffold) for compound 1 when learning rate=1e-4. The lower the NLL of a molecule, the higher the chance to generate

Fig. 15
figure 15

Scaffold discovery task: example of generated compounds with P(active)>0.6 and QED>0.6

Fig. 16
figure 16

Molecular optimization task: example of generated compounds with improved P(active) and QED, and Tanimoto similarity >0.7 compared with corresponding input compound

A learning rate of 1e-3 is too high and results in very large differences between different runs (Fig. 12d) and much fewer scaffolds of interest are found (Fig. 12)h. Figure 13 compares the percentage of valid molecules generated by RL_DF(scaffold) for compound 1 when learning rate is 1e-4 and 1e-3. It can be seen that learning rate 1e-4 produces stable output and a high percentage of valid molecules between different runs, while vast variance is observed when the learning rate is 1e-3.

Figure 14 examines the prior NLL distribution of unique compounds generated by RL_DF(scaffold) for compound 1 with varying learning rates. With a higher learning rate (up to 1e-4), a larger chemical space, deviating from prior, is explored. This is because RL helps steering the agent towards the chemical space with favourable properties, potentially directing it away from prior. Consequently, the agent has a higher chance (lower NLL) to generate molecules with desirable properties than the prior (Fig. 14 right).

Figures 15 and 16 show example molecules generated for scaffold discovery and molecular optimization task respectively. We can see that these molecules are typically more likely (lower NLL) to be generated by the agent using RL compared to the prior.

Effect of balancing factor \(\sigma\)

Here, we evaluate the effect of \(\sigma\) in Eq. 4 which balances the desirability of a molecule (enforced by the scoring objective) and the likelihood of this molecule generated from the prior. A default value is 120. Figure 17 shows the results for the molecular optimization task. A lower \(\sigma\) generally results in more unique similar molecules (i.e. Tanimoto similarity > 0.7) for all three RL settings as shown in Fig. 17a. This is because a lower \(\sigma\) keeps the agent closer to the prior which is trained to generate similar molecules. Meanwhile, a higher \(\sigma\) generally helps generate more unique compounds with improved properties in Fig. 17b. Ultimately, there is no clear trend in the number of unique compounds that are both similar and show improved properties when varying \(\sigma\) in Fig. 17c. Among the three RL settings, RL_DF(cmp)_Sim generates the most unique similar compounds while RL_DF(cmp) generates the most unique compounds with improved properties. This is because without Tanimoto similarity included in the scoring objective, the agent could explore chemical space more freely in search of high scoring compounds, potentially deviating from the prior. In general, RL_DF(cmp)_Sim performs best in finding compounds with both similarity and improved properties. RL_DF(cmp) generates more unique similar compounds than RL_noDF as shown in Fig. 17a indicating the benefit of diversity filter to improve uniqueness. When \(\sigma\)=120, RL_DF(cmp) mostly becomes worse than RL_noDF. This might be because the high \(\sigma\) shifts the agent away from the prior.

Overall, for local molecular optimization, the goal is to generate (1) unique molecules that are (2) highly similar (i.e. Tanimoto similarity > 0.7) to the input molecule while also showing (3) desirable properties (improved properties in this case). Achieving all these criteria is important and challenging since they can be conflicting. The prior, scoring objectives and diversity filter have direct impact on similarity, desirable properties, and uniqueness respectively. Lowering \(\sigma\) brings the agent closer to the prior, thus producing more similar molecules but also finding less compounds with improved properties. Scoring objectives guide the agent towards the chemical space of high scoring compounds but this could also lead to deviations from the prior. The diversity filter helps in exploring more unique compounds but could also have less similar compounds when \(\sigma\)(=120) is high. Therefore, it is crucial to understand and consider the impact of these factors.

Fig. 17
figure 17

Molecular optimization task: effect of \(\sigma\) on the number of (a) unique compounds that have Tanimoto similarity above 0.7 relative to corresponding input compound, (b) unique compounds that show improved P(active) and QED compared to corresponding input compound and (c) unique compounds that show both improved P(active) and QED and Tanimoto similarity above 0.7. Results are mean values and ±1 standard deviation over 10 runs

Supplementary comparison of RL and No RL

Here, we examine the effect of RL in a larger scale, specifically with 100 input starting molecules. A single run is conducted for each configuration. Figure 18 shows the results for the scaffold discovery task with each point representing the performance for each input starting molecule. Clearly, all three RL settings generate more unique compounds, scaffolds and generic scaffolds that show P(active)>0.6 and QED>0.6 than No RL, with RL_DF(scaffold) performing the best followed by RL_DF(cmp) and RL_noDF. This re validates the advantage of RL over No RL, and the use of diversity filtering which penalizes frequently generated compounds or scaffolds to improve diversity.

Figure 19 shows the results for the molecular optimization task. All three RL settings generate more unique compounds that show improved P(active) and QED than No RL. However, for generating unique similar (i.e. Tanimito similarity > 0.7) molecules, RL_noDF and RL_DF(cmp) perform worse than No RL. RL_DF(cmp) generates the most unique compounds with improved properties but least in the unique similar compounds. The reason could be that the agent was guided to focus on improving properties (enforced by scoring objectives) and unique molecules (enforced by diversity filter) which might deviate from the prior for generating similar molecules. RL_DF(cmp)_Sim which includes Tanimoto similarity as an additional scoring objective help generate more similar compounds as shown in Fig. 19b, and move towards the end goal of both improved properties and similarity in Fig. 19c.

Fig. 18
figure 18

Scaffold discovery task: comparison of No RL and RL with 100 input starting molecules in terms of generating unique compounds, scaffolds and generic scaffolds that show P(active)>0.6 and QED>0.6. Each point represents the performance for each input starting molecule

Fig. 19
figure 19

Molecular optimization task: comparison of No RL and RL with 100 input starting molecules in terms of generating (a) unique compounds that show improved P(active) and QED compared to corresponding input compound, (b) unique compounds that have Tanimoto similarity above 0.7 relative to corresponding input compound and (c) unique compounds that show both improved P(active) and QED and Tanimoto similarity above 0.7. Each point represents the performance for each input starting molecule

Conclusions

We have evaluated the effect of RL on the transformer-based molecular generative model trained for generating similar molecules to a given input molecule. The generative model serves as a pre-trained model with knowledge of the chemical space surrounding the input molecule, and reinforcement learning acts as fine tuning phase to focus the model on the desirable chemical space based on a set of user-specified property objectives. This provides the flexibility of optimizing molecules towards task-specific property profiles. The evaluation has been performed on two application scenarios, scaffold discovery and molecular optimization. Additionally, the effect of pre-trained priors, learning steps, learning rates and the balancing factor \(\sigma\) was examined. The results have shown that

  1. (i)

    RL generally helps generate more molecules with desired properties compared to No RL for both scaffold discovery and molecular optimization tasks. Additionally, different behaviors can be expected depending on the starting input molecule’s structure and properties, e.g. it can be challenging for RL to find molecules with improved activity if the starting molecule is already highly active.

  2. (ii)

    RL consistently helps generating more compounds with desirable properties across priors trained on both ChEMBL and PubChem datasets, and the PubChem prior generally outperforms the ChEMBL prior.

  3. (iii)

    Increasing the number of learning steps typically results in the discovery of more compounds of interest.

  4. (iv)

    Increasing the learning rate (to a certain extent) tends to explore a larger chemical space and sample the chemical space of interest more efficiently, at the same time a higher learning rate leads to a higher variance between different runs. A too high learning rate can have a dramatic negative impact on the performance.

  5. (v)

    For the molecular optimization task, a lower \(\sigma\) typically results in more unique similar molecules, whereas a higher \(\sigma\) tends to produce more unique compounds with improved properties. Ultimately, there is no clear trend in the number of unique compounds that are both similar and show improved properties when varying \(\sigma\).

As an example of optimizing towards user-specified desired properties, we have evaluated how well we can find more active compounds against DRD2 compared to a given starting molecule. However, any property can be optimized in the RL framework as long as it can be used as a scoring function. Notably, the accuracy and generalizability of a predictive model plays an important role in practice.

Our evaluation has been conducted on the tasks of scaffold discovery and molecular optimization. However, it is not limited to these tasks and can be used for molecular generation tasks such as scaffold decorating or fragment linking by adding substructure matching scoring components.