Utilizing Reinforcement Learning for de novo Drug Design

Deep learning-based approaches for generating novel drug molecules with specific properties have gained a lot of interest in the last few years. Recent studies have demonstrated promising performance for string-based generation of novel molecules utilizing reinforcement learning. In this paper, we develop a unified framework for using reinforcement learning for de novo drug design, wherein we systematically study various on- and off-policy reinforcement learning algorithms and replay buffers to learn an RNN-based policy to generate novel molecules predicted to be active against the dopamine receptor DRD2. Our findings suggest that it is advantageous to use at least both top-scoring and low-scoring molecules for updating the policy when structural diversity is essential. Using all generated molecules at an iteration seems to enhance performance stability for on-policy algorithms. In addition, when replaying high, intermediate, and low-scoring molecules, off-policy algorithms display the potential of improving the structural diversity and number of active molecules generated, but possibly at the cost of a longer exploration phase. Our work provides an open-source framework enabling researchers to investigate various reinforcement learning methods for de novo drug design.


Introduction
In recent years, there has been an increased interest in using machine learning for drug discovery.It has been applied to a large range of different tasks, including virtual screening, synthesis prediction, property prediction, and computer-assisted molecular design [1][2][3].Machine learning has obtained an important position in de novo drug design -the design of novel chemical entities that fit certain constraints.De novo drug design is an iterative optimization problem whose navigation in the optimization landscape relies on finding local optima of molecular structures, which does not necessarily lead to identifying the global optimum [4].Therefore, it is of interest to find a diverse set of local optima, meaning structurally different molecules with a high probability of being active against a desired target, i.e., with high activity.
Numerous deep learning-based methods have been developed for de novo drug design, including approaches based on reinforcement learning [5][6][7][8][9][10][11][12] and variational autoencoders [13][14][15][16].These approaches use several different ways to encode molecules into something that the model can learn, such as fingerprint-, string-and graph-based encodings.The string-based simplified molecular-input line-entry system (SMILES) [17] is a popular way to encode the 2D structure of molecules.Previous work has demonstrated that graphbased and string-based generative models show equivalent performance in terms of chemical space coverage [18].It is expected that the conclusions of this work are independent of how the molecules are represented.Moreover, recent evaluations of sample efficiency of de novo molecular generation methods have concluded good performance when using reinforcement learning (RL) for learning a recurrent neural network (RNN) [19] to generate SMILES strings representing high-scoring molecules [20,21].The objective is to learn a policy that can sample sequences of tokens to generate SMILES strings.Hence, policy optimization algorithms might have a significant impact on this task.
To further improve the sample efficiency of RL, it has been proposed to combine RL with a Hill-climb algorithm, which learns on the k top-scoring sequences [12,22,23].This method focuses the training on good samples from the current round of sequences.This can be interpreted as an off-policy algorithm with a replay buffer, filtering out low reward sequences and initializing the buffer memory between learning rounds.
The use of replay buffers is crucial in off-policy algorithms and is known to improve the sample efficiency of these algorithms [24].However, to our knowledge, no previous work in de novo drug design has investigated off-policy algorithms with replay buffers utilizing past sequences.This work builds upon previous work in de novo drug design using a reward-based replay mechanism.Many of the state-of-the-art replay mechanisms used in reinforcement learning are often based on the temporal difference (TD) error [25].However, since the TD error is not necessarily computable in a policy-based algorithm, a more general mechanism is needed for a fair comparison of algorithms.This paper focuses on reward-based replay mechanisms.
In this paper, we explore in a systematic way different on-policy and offpolicy policy optimization reinforcement learning algorithms, in combination with several ways of replaying previous sequences or restricting the learning to a subset of the sequences sampled in the current episode.The objective is to investigate how large fraction of the generated molecules are predicted, with high probability, to be active to a desired target, and how structurally diverse these predicted active molecules are.Our work can be used as an open-source framework for researchers to investigate various RL methods for de novo drug design 1 .

Problem Setup
The first step of de novo drug design using RL involves training a pre-trained policy (and/or encoding certain structures into the policy), as illustrated in Fig. 1.Subsequently, a batch of molecules is sampled by the policy, e.g., by the policy choosing a sequence of characters in a SMILES string.In the next step, the sampled molecules are scored by an unknown "black box" objective function, i.e., the objective function can be evaluated at any point of its domain but its full expression is unknown.The molecules and corresponding scores are both stored for final inspection and replay (optional).The current molecules and corresponding scores are also fed into the RL algorithm, where the molecular sampling policy is updated.Depending on the use of the replay buffer, current and/or previous samples are provided for the learning step.Using the updated policy, a new batch of molecules is sampled.This continues until a stopping criterion has been reached, such as a pre-defined budget of samples.
Fig. 1 Schematic illustration of the de novo drug design process using reinforcement learning (RL).

Problem Definition
Molecular de novo design using RNNs for optimizing molecules encoded as SMILES strings, can be formulated as an RL problem.The agent interacts with the environment over discrete time steps by adding tokens to a SMILES string.The environment is episodic, where a SMILES string is provided between a start and stop token, and the episode's length depends on the SMILES string's length, which is terminated when the stop token is added to the string.At time step t = 0, the start token is added to the string and defines the first state s 1 .At time step t = 1, . . ., T the agent observes a n s -dimensional state vector s t ∈ S ⊆ R ns , chooses an action a t ∈ A according to a policy π(a t |s t ).
The episode ends at time step T + 1, for terminal state s T +1 , when the stop token is chosen as action a T .Moreover, at the end of the episode, a reward signal R(a 1:T ) ∈ [0, 1] is observed for a sequence of actions a 1:T .The scoring function provides this reward signal by scoring each valid SMILES string.
The state vector s t is given by the output states of the RNN at step t − 1 and encodes information about the actions taken in previous steps.A discrete action space A = {0, . . ., 33} which tokenizes the feasible characters in the SMILES string, including start and stop tokens, is considered.Under this setup, we explore various policy optimization RL methods, where the goal is to learn the policy directly parameterized by θ, π θ (a t |s t ).The output gates of the RNN are fed into a fully connected layer to provide either the probability (utilizing a softmax layer) or values of each action.

Policy Optimization Algorithms for de novo Drug Design
In this paper, we explore the following policy optimization algorithms for de novo drug design, to generate diverse molecules with high scores: (

Offpolicy
Fig. 2 Taxonomy of the reinforcement learning (RL) algorithms explored in this work.

Regularized Maximum Likelihood Estimation
The regularized maximum likelihood estimation (Reg.MLE) algorithm is currently used in REINVENT [26].Recent evaluations by both [20] and [21] have concluded good performance compared to both RL-based and non-RL-based approaches for de novo drug design.The idea is that the likelihood of the agent's policy should stay close to that of the pre-trained policy (See Sec.4.2) while still focusing on the high-scoring sequences.It minimizes the following policy loss where π prior (a 1:T ) = π prior (a 1 |s 1 ) • • • π prior (a T |s T ) is the likelihood of the pre-trained policy for a sequence of length T (excluding start token), and is the corresponding likelihood of the policy that is optimized.The policy has the same network architecture and is initialized as the pre-trained policy.σ is a hyperparameter [26] that determines the importance of the reward signal.Note that it uses a margin guard, which resets the agent to the prior and adjusts sigma if the margin between the augmented likelihood and the agent likelihood log π prior (a 1:T )+σR(a 1:T )−log π θ (a 1:T ) is below a pre-defined threshold.Table A1 in Appendix A displays a list of hyperparameters and the values employed in this paper.

Proximal Policy Optimization
Proximal Policy Optimization (PPO) [27] uses a clipping loss function defined by where r(θ) = π θ (at|st) π θ old (at|st) is the probability ratio and advantage Â(s t ) = γ T −t R(a 1:T ) − V ϕ (s t ) is used, where V ϕ (s t ) is the value function.This corresponds to Monte-Carlo (MC) samples where the reward R(a 1:T ) is only given at time step T .γ is the discount factor and ϵ is a hyperparameter determining the clipping range.We adapt PPO to the setting where the state vectors are represented by the RNN outputs of the latest recurrent node.The actor is a neural network with parameters θ, providing probabilities π θ (a|s) for an action a at state vector s using a softmax layer.It is initialized as the pre-trained policy (see 4.2) and the parameters are updated using the loss function in Eq. ( 2).The value function V ϕ (s t ) is a neural network with (non-shared) parameters ϕ.It has the same network architecture as the actor, but the output layer only consists of one output, i.e., the value, and uses no softmax layer.The output layer is reset at initialization, while the initial embedding and long short-term memory (LSTM) [28] layer are the same as the pre-trained policy network.The value network is trained by minimizing the following mean squared error loss For each batch of sequences, the actor and critic loss are minimized over 4 epochs, each doing minibatch updates where sequences are shuffled into 4 mini-batches.It is possible for the actor to diverge from the pre-trained actor (policy), which has learned how to sample a valid SMILES string.This rarely happens but should be properly handled when happening.Therefore, if the fraction of valid SMILES strings (out of the 128 sampled in each episode) is less than 0.8 for more than 10 consecutive episodes, the parameters of the algorithm will be reset to that of the pre-trained model.Table A2 in Appendix A displays a list of hyperparameters and the values employed in this paper.

Advantage Actor-Critic
Advantage Actor-Critic (A2C) is a synchronous version of the A3C algorithm [29], wherein the following definition of the advantage is used where V ϕ (s t ) is the value network with (non-shared) parameters ϕ and discount factor γ. We adapt A2C to the setting where the state vectors are represented by the RNN outputs of the latest recurrent node.The actor is a neural network with parameters θ, providing probabilities π θ (a|s) for an action a and state vector s using a softmax layer.It is initialized as the pre-trained policy (see 4.2) and the parameters are updated using the policy gradient with the advantage.A discount factor slightly smaller than 1 is used to slightly favor small molecules.The value network has the same network architecture as the actor, but the output layer only consists of one output, i.e., the value, and uses no softmax layer.The output layer is reset at initialization, while the initial embedding and LSTM layer are the same as the pre-trained policy network.The value network is trained by minimizing the following mean squared error loss If the actor generates a large fraction of invalid SMILES strings, the algorithm is reset in the same way as for PPO (see Sec. 3.2).Table A3 in Appendix A displays a list of hyperparameters and the values employed in this paper.

Actor-Critic with Experience Replay
Actor-Critic with Experience Replay (ACER) [30] is an off-policy actor-critic algorithm with experience replay.ACER is the off-policy counterpart of the A3C algorithm [29] where the aim is to stabilize the off-policy estimator, e.g, by applying a trust region policy optimization (TRPO) method.The algorithm performs one on-policy update and r ∼ Pois(λ) off-policy updates using replay, where each replay samples 128 sequences.We adapt ACER to the setting where the state vectors are represented by the RNN outputs of the latest recurrent node.It uses a shared network utilizing the same architecture as the pretrained policy network (see Sec. 4.2) but with an additional value head, i.e., a parallel fully connected layer with output dimension 1.The value head is randomly initialized, while the other weights are the same as the pre-trained policy network, including the policy head.We add an entropy term to the loss with weight 0.001, slightly favoring sequences with larger cumulative entropy.
We use retrace Q-value estimation, as proposed in the original algorithm [30].Using retrace Q-value estimations, instead of Monte-Carlo samples, does slightly improve the stability.For each sequence, a 1:T , reward R(a 1:T ) is only given at time step T for action a T , when the stop token is chosen as action.A reward signal of −1 is given to invalid SMILES strings, which are by default given a reward of 0 for the on-policy algorithms.The penalty for invalid SMILES bias the algorithm strongly toward valid SMILES.This seems to be crucial, especially when performing many off-policy updates using the replay memory.Furthermore, 10 initial steps without updating the policy are performed, only using the pre-trained policy to store initial sequences in the replay memory.Table A4 in Appendix A displays a list of hyperparameters and the values employed in this paper.

Soft Actor-Critic
Soft Actor-Critic (SAC) [31] is an off-policy algorithm that incorporates the entropy of the policy into the reward signal to encourage a stochastic policy with more randomness, while still fulfilling the task.It is based on the maximum entropy objective, with the aim of optimal policy π * to maximize both its reward and entropy at each visited state.
It uses an automatic entropy adjustment to control the temperature parameter α that determines the relative importance of the entropy term.The Algorithm 1 Discrete Soft Actor-Critic for de novo drug design Input: ϕ, θ, initial episodes K init , total budget of episodes K E , Init: Sample a batch T of M sequences using pre-trained policy π θ

3:
Score each sequence in T 4: Add unique, valid sequences to replay memory D 5: end for 6: for each episode K init + 1, . . ., K E do 7: Sample a batch T of M sequences using current policy π θ 8: Score each sequence in T 9: Add unique, valid sequences to replay memory D 10: ▷ On-policy update of policy parameters 12: α ← α − λ α ∇α J α (α|T ) ▷ On-policy update of temperature 13: for each off-policy update do 16: end for 22: end for original algorithm is formulated for a continuous action space, but [32] has extended it to discrete action spaces.The soft actor-critic algorithm for discrete action spaces is utilized in this work, with some adaptions discussed below.We adapt SAC to the setting where the state vectors are represented by the RNN outputs of the latest recurrent node.The actor is a neural network with parameters θ, providing probabilities π θ (a|s) for an action a and state vector s using a softmax layer.It is initialized as the pre-trained policy (see 4.2) and the parameters are updated by minimizing the loss in Eq. (10).The value function Q ϕ (a, s) is given by a neural network that has the same network architecture as the actor, where each output corresponds to the action-state value of an action a at current state s but uses no softmax layer.We initialize the parameters ϕ to that of the parameters of the pre-trained policy network.We perform no updates of parameters during the K init = 10 first episodes, where only experiences are sampled to the replay buffer.Since a budget of 2000 episodes is considered in the following experiments, there are 1990 episodes left for learning to generate high-scoring molecules.When updating parameters, we utilize a reward of −1 for invalid SMILES.Moreover, we perform one on-policy update, using all current sequences, before doing any off-policy update.Four off-policy updates are performed, where 64 sequences are sampled from the replay memory, for each episode.We observed no significant difference in performance when using the larger replay size of 128 as for ACER (seer Sec.3.4).
The following entropy-augmented reward is defined where p is the state transition probability of the environment.Full sequences are used for updating, i.e., Monte-Carlo (MC) samples, instead of one-step update in [31,32].The reward signal R(a 1:T ) is given at the end of the episode, at time step T , and uses no discount, i.e., discount factor γ = 1 is utilized.This gives the following target for each soft Q-value update where y s T +1 = 0, and to keep the pre-trained high probabilities of stop tokens at certain states.This gives each action an equal contribution from the reward signal of the full sequence and an additional cumulative entropy term that favors actions where future states have high entropy.
To improve the stability, [33] has proposed to include more regularization in the actor and critic losses.The proposed clipping of the critic loss with target critic(s) is used, which is found to improve the stability, giving the following loss of the critic network where Π is a set of sequences (either current sequences or replay memory), T is a sequence, Q ϕ is the estimate of the critic network, and Q ϕ ′ is the estimate of the target-critic network.When Π contains the current sequences, all sequences are sampled; otherwise, 64 sequences are sampled when using a full replay memory.The weights of the target-critic are updated using the moving average between the current weights of the target-critic and critic, where τ is the smoothing coefficient, determining how much of the updated critic network that will be transferred to the target-critic network.We only use one critic and target-critic network, since we observe that using the pretrained model as the initial critic network is advantageous, instead of using randomly initialized weights.This yields more stable learning in terms of the validity of generated SMILES strings and biases the learning towards what the pre-trained model knows.It could possibly be useful to use two critic networks if their updates utilize different replay experiences.
For regularizing the actor loss, we include the Kullback-Leibler (KL) divergence term between the (current) actor and the average policy network.The average policy network π θ ′ is initialized as the actor and whose parameters θ ′ is updated using moving average (as in Eq. ( 9)) with τ = 0.99.Adding this KL divergence term yields the following actor loss The temperature α is updated by minimizing the following objective, extending the objective proposed by [32], where H is the target entropy.Algorithm 1 illustrates all steps in the (discrete) soft actor-critic algorithm used for de novo design.Table A5 in Appendix A displays a list of hyperparameters and the values employed in this paper.

de novo drug design using RNNs for generating SMILES generation
This section describes the experimental setup, including the replay buffers used to investigate the on-and off-policy policy optimization algorithms.A batch of M = 128 sequences is sampled in each episode, generating 128 SMILES strings in each episode.Any duplicates of SMILES strings are removed afterward, possibly yielding a list of fewer unique molecules.At the end of each episode, using on-policy or/and off-policy batches, the policies are updated by the full roll-out of each sequence, as described in Sec. 3. Technical details of the experiments are discussed in Appendix B.

Molecular and Topological Scaffolds
Molecular scaffold Topological scaffold c1ccc(Oc2ccccn2)cc1 Fig. 3 The structural formula and SMILES strings for an arbitrary molecule, and its corresponding molecular and topological scaffold.
The scaffold of a molecule is defined as its core structure.This is a common structure characterizing a group of molecules.This provides a basis for a systematic investigation of molecular cores and building blocks.This assists in finding structurally distinct molecules having similar activity, providing several structural alternatives when optimizing the properties of potential drug candidates [34].Hence, scaffolds provide a diversity measure of the identified active molecules.
The molecular scaffold defined by [35], also known as the Bemis-Murcko scaffold, is used in this work to generate scaffolds.The topological scaffold is defined as the generic molecular scaffold where all atom types are converted into carbon atoms and all bonds are converted into single bonds, as illustrated in Fig. 3.All scaffolds in this work are generated using [36].

Sampling
We use the pre-trained policy network provided by [26].The model is trained on a data set derived from the ChEMBL databse [37] and is capable of generating molecules in terms of SMILES strings.The parameters of the policy network are sequentially updated using reinforcement learning (see Sec. 3).The policy consists of an embedding layer, an LSTM layer, and a fully connected output layer.The embedding layer consists of 256-dimensional embedding vectors.The LSTM layer has an input size of 256 and an output size of 512 and consists of 3 recurrent layers, i.e., three LSTMs stacked together.The LSTM output is fed to an output layer of output dimension 34.Each output entity corresponds to a token in the vocabulary, including start and stop tokens, defined by [26].
When sampling a sequence, the output layer is fed through a softmax function to obtain estimates of the probabilities of each token (action).Multimodal sampling, using the estimated probabilities from the softmax function, is performed to select the next action in a sequence.Small molecules are of interest and, therefore, the length of a sequence is limited to 256.If a sequence reaches this length, the sampling is stopped, returning the sequence sampled so far.

Scoring
A scoring function provides the rewards signal R(a 1:T ) of each sequence.The scoring function is given by a random forest model with 1300 trees and a maximum depth of 300 for each tree.Class weights, which determine the sample probability during bootstrapping, are inversely proportional to the class frequencies.The random forest model is trained to predict the binary activity of a molecule to the Dopamine receptor D2 (DRD2), using the activity data in ExCAPE-DB [38].2048-bit Morgan-like fingerprints, computed by RDKit [36], with radius 2, utilizing features and counts are used as feature vectors.Each SMILES string is encoded into a such feature vector for scoring.Class probabilities, of the binary activity, are given by the fraction of trees predicting the corresponding class.The reward of a sequence is defined as the probability of predicting a positive label for the corresponding sequence.A sequence (or SMILES string) is defined to be valid if the corresponding SMILES string can be constructed into a Mol object by RDKit [36], which is done when computing the fingerprints for scoring.When constructing a Mol object, RDKit first performs a grammar check and then applies basic chemical rules.An invalid SMILES string is given a reward of 0 and −1 for the on-and off-policy algorithms, respectively.The score can also be modified by the diversity filter, see Sec. 4

.4.
A molecule is defined to be active if the corresponding sequence has a reward greater than or equal to 0.7, and a scaffold is defined as active if it contains at least one active molecule.These definitions are used to compare the policy optimization algorithm utilizing different replay buffers.

Diversity Filter
A diversity filter is a memory-assisted approach to improve the diversity of generated molecules.It keeps track of the molecules with similar structures, e.g., scaffolds.We use the diversity filter based on identical molecular scaffolds, proposed by [39].This diversity filter consists of a scaffold memory that stores molecules and their corresponding molecular scaffold.A molecule is saved into the scaffold memory if the corresponding sequence reaches at least a reward of 0.4.No molecules with the same canonical SMILES are allowed in the scaffold memory, hence only storing unique molecules.If the number of saved molecules with the same molecular scaffold reaches 25, all future molecules with the same molecular scaffold are given a reward of 0, and consequently not saved in the scaffold memory.This changes the reward function used for learning when a certain number of molecules with the same molecular scaffold have been generated.

Replay Buffers
In light of the current use of the Hill-climb algorithm [12,22,23] for training, we study different approaches using both current and previous sequences.All of these approaches are collected under the term replay buffers.
In each episode, a batch of M = 128 sequences are sampled.For on-policy algorithms with replay buffers considering historical data, the current batch of sequences plus k = 64 sequences from history, not including the current sequences, are used for learning.Moreover, for replay buffers only using current data, k = 64 sequences from the current batch of sequences are used for learning, except for All current where the entire current batch is used for training.For the off-policy algorithms, only replay buffers using historical data are considered, since they are defined to always do one on-policy update with the current sequences.For each off-policy update, SAC and ACER replay k = 64 and k = 128 sequences, respectively, from the buffer.Opposite of what is done for the on-policy algorithms, the sequences of the current episode are immediately stored in the replay memory, i.e., sequences from the current batch can be sampled from the replay memory when performing off-policy updates in the current episode.Below follows descriptions of each replay buffer investigated in this paper, seven in total.

All current (AC)
For the All current (AC) replay buffer, the entire batch of sampled sequences in the current episode is used during learning.No sequences from previous episodes are utilized.In practice, this corresponds to performing a full onpolicy update.

Bin history (BH)
The Bin history (BH) replay buffer sorts sequences into bins with respect to their reward.It consists of the following fixed binds with respect to rewards: [0, 0.1], (0.1, 0.2], . . ., (0.9, 1].For the off-policy algorithms, the buffer also includes a bin storing invalid SMILES strings, i.e., SMILES string with a score −1.Each bin has a maximum size of 1000 sequences.First in, first out (FIFO) is applied if the bin is full.To the extent possible, it does, without replacement, sample an equal number of sequences from each bin and otherwise uniformly samples, without replacement, from the bins with elements that have not been sampled until k sequences have been acquired.

Bin current (BC)
Bin current (BC) replay buffer sorts the current batch of sampled sequences into bins with respect to their rewards.It consists of the following fixed binds with respect to rewards: [0, 0.1], (0.1, 0.2], . . ., (0.9, 1].It does, to the extent possible, sample an equal number of sequences from each bin without replacement.If k sequences have not been acquired after this, it uniformly samples from the bins with unsampled sequences (i.e., bins whose sequences are not yet in the set used for update) until k sequences have been sampled in total.

Top-Bottom history (TBH)
When using Top-Bottom history (TBH) replay buffer with an on-policy algorithm, the previous k/2 highest and k/2 lowest rewarding sequences, and the current sequences, are used for updating the policy.It prioritizes storing the newest sequence(s) if several sequences have an equal reward.No duplicates with the same canonical SMILES string are allowed, keeping the newest sequence with the lowest reward.
For off-policy algorithms, it consists of three sub-buffers, each with a memory size of 1000 sequences.It consists of one sub-buffer with the highest rewarding sequences and two sub-buffers with low rewarding sequences.One of these low-reward sub-buffers stores only sequences with −1 reward, i.e., invalid SMILES strings, and the other one stores the lowest-scoring sequences that correspond to a valid molecule (having a reward of at least zero).It uniformly samples, without replacement, from a buffer consisting of all three sub-buffers.FIFO is utilized for each sub-buffer, where the oldest sequences are removed when a sub-buffer is full.No duplicates with the same canonical SMILES string are allowed, where the newest sequence with the lowest score is kept.

Top-Bottom current (TBC)
For the Top-Bottom current (TBC) replay buffer, k/2 highest and k/2 lowest rewarding sequences of the current batch are used for the update.

Top history (TH)
When using the Top history (TH) replay buffer with an on-policy algorithm, the top-k rewarding sequences from previous episodes and sequences from the current episode are used for updating the actor and critic.Hence, it only needs to store the top-k sequences from previous episodes.Any duplicate with the same canonical SMILES string as another stored sequence is removed.
For off-policy algorithms, it consists of a buffer with the highest rewarding sequences.It has a memory size of 1000 sequences and does not allow any sequences with the same canonical SMILES string, keeping the newest sequence with the lowest score.It uniformly samples k sequences, without replacement, from the buffer to use for the update of the parameters.

Top current (TC)
For the Top current (TC) replay buffer, the top-k rewarding sequences of the current batch is utilized for updating.This is similar to what is utilized in the Hill-climb algorithm.

Results
In this section, we investigate various policy optimization algorithms, described in Sec. 3, for the de novo drug design setup defined in Sec. 4. To compare all algorithms under the same budget constraint, the generation is limited to 2000 episodes, giving a budget of 256 000 possible SMILES strings in total.We investigate both the use of the diversity filter based on identical molecular scaffolds (see Sec. 4.4) and the use of no diversity filter.The results are divided into on-and off-policy algorithms.Fig. 4 displays the different combinations of replay buffer, policy optimization algorithm and diversity filter that are investigated.Figs.C1 to C10 in Appendix C display the ten top-scoring molecules, each from a unique topological scaffold, of one representative run for each combination.The generated molecules look like what has been published in previous work targeting DRD2 [5,22].On-policy

Replay Buffer
Policy Optimization Algorithm Diversity Filter Fig. 4 Illustration of the different combinations of replay buffer, policy optimization algorithm and diversity filter that are investigated in this paper.

On-policy Algorithms
Firstly, we study de novo drug design utilizing on-policy algorithms.We investigate the algorithms A2C, Regularized MLE and PPO, when either using the identical molecular scaffold diversity filter or no diversity filter.

With Diversity Filter
Fig. 5 shows box plots of the number of (unique) active molecules and active scaffolds over 11 runs for the on-policy algorithms utilizing different replay buffers, and identical molecular scaffolds diversity filter.As illustrated in Fig. 5a, for most replay buffers, both A2C and PPO sample a significantly larger number of active molecules compared to Regularized MLE; whereas there is relatively little difference between A2C and PPO in terms of the number of active molecules generated.Regularized MLE only generates notably more unique active molecules when using the top current replay buffer, compared to A2C and PPO using the same replay buffer.When using PPO and A2C, the replay buffers All current and Bin current yield the largest number of active molecules.Moreover, when using Regularized MLE, utilizing All current generates the largest number, whereas Top current is the second-best replay buffer.
As seen in Fig. 5b, Regularized MLE utilizing All current show the largest median of (unique) active scaffolds with low variability, compared to A2C and PPO.This combination also shows the largest median of generated (unique) active topological scaffolds, as seen in Fig. 5c.Regularized MLE generates a significantly larger number of active topological scaffolds with most replay buffers, compared to A2C and PPO.However, A2C with Top-Bottom history performs on par, in terms of active scaffolds, with Regularized MLE but has a larger variability.In fact, A2C with Top-Bottom history displays the best runs in terms of the number of active scaffolds.Moreover, in most cases, A2C generates a significantly larger number of topological scaffolds than PPO; while comparable numbers of active molecular scaffolds are often generated.

(c) Number of active topological scaffolds
Fig. 5 Box plots of the number of unique active molecules and active scaffolds for the on-policy algorithms A2C, Regularized MLE, and PPO (higher is better) when utilizing identical molecular scaffold filter.It shows mean and standard deviation over 11 runs for each policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.
Fig. 6 displays means and standard deviations, over 11 repeated runs, of the average episodic rewards of sampled molecules for all combinations of onpolicy algorithms and replay buffers over 2000 episodes of batch size 128.The average episodic rewards are displayed using a moving average with a window size of 50.All runs use the identical molecular scaffold filter.For both PPO and A2C, Top current gives the lowest average episodic reward, as shown in Fig. 6a and Fig. 6b, respectively.Furthermore, PPO utilizing Bin current and All current performs on par, giving moving averages between 0.8 and 0.7 after roughly 125 episodes.It seems that most replay buffers converge after around 125 episodes.Moreover, for A2C, All current, Top-Bottom current and Bin current perform among the best in terms of the average episodic reward, when using a diversity filter.Compared to PPO, it takes slightly more episodes for the rewards to converge.For Regularized MLE, utilizing All current gives the largest average episodic reward, converging to a reward between 0.7 and 0.8 with low variance.It is the only replay buffer for Regularized MLE that can reach an average episodic reward above 0.7.Utilizing Top history yields the lowest episodic reward.(c) Reg.MLE Fig. 6 Average episodic reward computed over a batch of sequences, for the on-policy algorithms A2C, Regularized MLE and PPO (higher is better) when utilizing identical molecular scaffold filter.It shows mean and standard deviation of moving average, of window size 50, over 11 runs for each policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.

Without Diversity filter
Fig. 7 shows boxplots of the number of (unique) active molecules and active scaffolds over 11 runs for the on-policy algorithms utilizing different replay buffers.No diversity filter is used.In this setting, Regularized MLE using either Bin history, Bin current or All current consistently generates a larger number of active molecules, molecular scaffolds and topological scaffolds, compared to all other combinations of on-policy algorithm and replay buffer.Bin current yields the largest number for all these metrics.For all replay buffers, Regularized MLE gives the largest number of active molecules and scaffolds, while A2C gives a higher number than PPO.Fig. 8 displays means and standard deviations, over 11 repeated runs, of the average episodic rewards of sampled molecules for all combinations of onpolicy algorithms and replay buffers using no diversity filter.A budget of 2000 episodes of batch size 128 is investigated.For visualization purposes, the average episodic rewards correspond to moving averages using a window size of 50.PPO seems to require the least number of episodes to converge but, on the other hand, shows a higher variance.
As seen in Fig. 8a, PPO with Top-Bottom current, All current and Top-Bottom history reaches an average episodic reward 1 approximately.Using Top-bottom history shows a substantially lower episodic reward but has a larger variance.
For A2C, displayed in Fig. 8b, using replay buffers with data samples from previous episodes, except for Top current, gives slightly lower episodic reward compared to using only data sampled in the current episode.All replay buffers only using immediate samples give an average episodic reward close to 1.One should note that the short drop of reward for All current occurs due to the resetting of the parameters of one run to that of the pre-trained model, since the policy samples less than 80% valid molecules for more than 10 consecutive episodes.After restarting, it quickly gets back on track.This run is kept to highlight that it is possible for the networks to diverge from the pre-trained model and forget how to generate valid SMILES strings.When this happens, it can quickly find its way back by restarting from the pre-trained model.
For Regularized MLE, there is generally a lower variance for the episodic reward compared to PPO and A2C.When using either Top-Bottom history, Top current or Top history, the average episodic reward converges to 1, as illustrated in Fig. 8c.Note that neither PPO nor A2C consistently reaches such high episodic reward.Also, for regularized MLE, All current is not among the best ones, which is the case for both PPO and A2C.

Off-policy Algorithms
To further investigate the benefits of a replay buffer, two off-policy algorithms have been explored: (1) Soft Actor-Critic (SAC); (2) Actor-Critic with Experience replay (ACER).The off-policy algorithms perform one step of on-policy update (using all sampled data in the current episode) and several off-policy updates using sequences from both current and previous episodes.The off-policy updates use replays from either Bin history, Top history or Top-Bottom history.Opposite to the on-policy algorithms, the sequences of the current episode are stored in the replay memory before use in the current episode, i.e., it is possible to utilize sequences from the current batch for off-policy updates in the current episode.

With Diversity Filter
Fig. 9 displays box plots of the number of active molecules and scaffolds for 11 repeated runs of each combination of off-policy algorithm and replay buffer.128 molecules are generated in each episode, not necessarily valid and/or unique molecules, with a total budget of 2000 episodes.ACER using Bin history generates the largest number of active molecules and scaffold.It is able to yield numbers close to or better than the best on-policy results in Sec.5.1.1,which the other off-policy combinations are not able to.However, in general, ACER shows a larger variability compared to SAC, where ACER using Top-Bottom history shows the largest variability over the repeated runs.

(c) Number of active topological scaffolds
Fig. 9 Box plots of the number of unique active molecules and active scaffolds for the off-policy algorithms ACER and SAC (higher is better) when utilizing identical molecular scaffold filter.The box plot is computed over 11 repeated runs for each combination of policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.
Fig. 10 shows means and standard deviations, over 11 repeated runs, of the average episodic rewards of sampled molecules for all combinations of off-policy algorithms and replay buffers using the identical molecular scaffold filter.A budget of 2000 episodes, each sampling a batch of 128 molecules, is investigated.For visualization purposes, each average episodic reward corresponds to the moving average using a window size of 50.Note that, in this figure, invalid SMILES are displayed with a reward of 0 but are given a reward of −1 during training.
When using SAC, both Bin history and Top-Bottom history reach an average episodic reward of 0.5 approximately; while Top history reaches an average episodic reward of 0.4 approximately, which is the minimum reward for a sequence to be saved in the diversity filter.For ACER, Bin history reaches an episodic reward of around 0.75; while using the other two replay buffers yields substantially lower episodic rewards and higher variances, in particular for Top-Bottom history.Top history reaches an episodic reward below 0.3, not reaching above the diversity filter threshold.ACER with Bin history is the only combination that is able to obtain an average episodic reward comparable to the best on-policy algorithms, but with a slower increase in average episodic reward.SAC displays a significantly slower increase in average episodic reward than both ACER and the on-policy algorithms.

Without Diversity filter
Fig. 11 displays box plots of the number of active molecules and scaffolds, over 11 repeated runs for each combination of off-policy algorithms and replay buffers.Note that invalid SMILES are displayed again with a reward of 0 in this figure but are given a reward of −1 during training.No diversity filter is used, i.e., the generation of similar molecules is not penalized between episodes.
It is observed that SAC using Top history generates the largest number of (unique) active molecules and scaffolds, on par with the best on-policy without a diversity filter, Regularized MLE with Bin current.For all replay buffers, SAC seems to generate a significantly larger number of active molecules and scaffolds compared to ACER, except for topological scaffolds when using Topbottom history where they display similar performance.Fig. 12 displays means and standard deviations, over 11 repeated runs, of the average episodic rewards of sampled molecules for all combinations of offpolicy algorithms and replay buffers using no diversity filter.For SAC, using Top-Bottom history gives an average episodic reward of over 0.8; while Bin history converges to an average episodic reward of around 0.6 and Top history keeps improving, almost reaching an average episodic reward of 0.8.For ACER, Top-Bottom history reaches an average episodic reward close to 1.When using Bin history, a significantly faster improvement in the first 500 episodes, compared to Top-Bottom history, is observed but then diverges.A possible explanation for this is that it finds a mode with higher entropy, but that gives a lower score, it accidentally forgets too much of the pre-trained model.For Top history, ACER shows a large variability and does not consistently reach an average episodic reward above 0.4.

Discussion
In this section, we discuss the de novo drug design of investigated on-and offpolicy algorithms and replay buffers.Both the use of a diversity filter and the use of no diversity filter are discussed.We consider a diversity filter that penalizes molecules with identical molecular scaffolds.As a result, there is a natural increase in diversity regarding molecular scaffolds, but not necessarily topological scaffolds since different molecular scaffolds can have the same topological scaffold.When using no diversity filter, a large number of molecular scaffolds seems to lead to a large number of topological scaffolds.However, this is not as evident when using the diversity filter.A reason for this can be that there is a substantially larger number of molecular scaffolds generated when using a diversity filter which naturally leads to more topological scaffolds being the same.
Molecules are generated using a model pre-trained on a data set derived from the ChEMBL database which will teach the agent to generate ChEMBLlike molecules.The DRD2 data set, which is used to create the scoring function, comprises data from both the ChEMBL and PubChem databases [40].Hence, this experimental setup only considers small molecules and the agent is not anticipated to have the ability to produce molecules that differ significantly from those present in the ChEMBL database.Previous work suggests that there are close to 500 000 molecular scaffolds in total in the ChEMBL database [41].On the other hand, all scaffolds do not necessarily contain molecules active toward DRD2.The budget of 2000 episodes used in this work also limits the search for new scaffolds.Typically, we are able to identify scaffolds consisting of SMILES with a score of 1 and hence successfully traverse the reward landscape.

On-policy Algorithms
This section discusses the results of the on-policy algorithms and replay buffers; both with and without a diversity filter.

With Diversity Filter
Both A2C and PPO generate more active molecules but fewer active scaffolds, compared to Regularized MLE.One possible explanation for this could be that A2C and PPO stay for a longer time close to a penalized scaffold before moving to the next scaffold.Regularized MLE uses a fixed pre-trained actor in the loss and can, therefore, easily jump between scaffolds without forgetting how to generate valid SMILES strings.This comes at the cost of its likelihood staying close to that of the pre-trained model.However, we do not observe that this notably limits its performance.We observe that on-policy algorithms can enhance diversity by not only relying on high-scoring molecules, particularly regarding the diversity of topological scaffolds.
A2C and PPO achieve high average episodic rewards by either using all onpolicy data or a subset comprising both high-and low-scoring molecules.For some combinations of algorithm and replay mechanism is it advantageous to use historical samples instead of only current samples.However, there seems to be no significant added performance gain in using experiences from previous iterations, particularly compared to using all current samples.One possible explanation for this could be that the diversity filter can impose a significant change in the reward landscape after each episode by initiating penalization of frequently generated scaffolds and therefore possibly making previously acquired knowledge outdated, which may need to be accounted for.The investigated diversity filter uses a discrete threshold to determine when molecules of a scaffold should obtain a zero reward.A diversity filter that in a stepwise or continuous manner penalizes molecules from the same scaffold could have other effects on the reward landscape which possibly could be more suitable for some replay buffers.

Without Diversity Filter
As expected, the number of actives molecules and scaffolds is significantly lower compared to when using a diversity filter.There is an evident relationship between the number of active molecules, molecular scaffolds, and topological scaffolds.This is not necessarily evident when utilizing diversity filter.
When using no diversity filter, Regularized MLE generates a substantially larger number of active molecules and scaffolds, compared to A2C and PPO, especially when used in combination with the replay buffers Bin current, Bin history and All current.There is no significant difference in using a replay mechanism utilizing historical or current samples, except for the bin-based replay mechanisms in combination with Regularized MLE.Bin current in combination with Regularized MLE displays the overall largest number of active molecules and scaffolds while displaying among the lowest average episodic reward.This replay buffer seems to be able to enhance the exploration of Regularized MLE, possibly because it does not only exploit high-scoring molecules.Hence, it is evidently important to use a diverse set, in terms of reward, for learning without a diversity filter.
Why Regularized MLE performs better, in general, is likely because it is heavily regularized to stay close to the pre-trained model in terms of likelihood and, therefore, inherently can jump between scaffolds without using the initial knowledge.Without a diversity filter, PPO generates the least number of active molecules and scaffolds, while still displaying a high average episodic reward.This must be because of an early mode collapse, leading to PPO generating more or less only the same molecules through all episodes.

Off-policy Algorithms
In this section, the results of the off-policy algorithms and replay buffers are discussed; both with and without a diversity filter.

With Diversity Filter
For the best off-policy algorithm and replay buffer combination, i.e., ACER with Bin history, the number of active molecules is on par with the best onpolicy combinations.On the other hand, several off-policy results struggle to reach an episodic reward significantly above 0.4, which is the minimum reward for a generated SMILES string to be stored in the diversity filter.Only ACER with Bin history can consistently reach an average episodic reward above 0.7.With the help of the diversity filter, this leads to the highest number of active molecules and scaffolds.Hence, the slower convergence rate can possibly be explained by a more elaborate exploration phase.However, ACER shows a large variability.Even though SAC does generally achieve lower average episodic rewards, it is more stable and shows a larger median of active molecules and scaffolds compared to ACER utilizing Top history and Topbottom history.It appears that the average episodic reward of SAC with Bin history has not converged yet, indicating that it is still in the learning phase.It is possible that this specific combination requires over 2000 episodes to conduct adequate exploration.

Without Diversity Filter
For the off-policy algorithms with no diversity filter, it is observed that SAC using Top history generates the largest number of unique active molecules and scaffolds.However, it does not reach the highest average episodic reward in this setting, which is achieved by ACER.ACER with Top-bottom history converges to the highest average episodic reward, among the off-policy algorithms without diversity filter, but is still outperformed by SAC for the generation of active molecules and scaffolds.This is because it generates many duplicates of the same high-scoring molecules.Hence, SAC will generate more unique molecules when no diversity filter is used and, consequently, improve the exploration.This behavior yields an enhancement in the diversity of the active molecules, especially when the top-scoring molecules are used for off-policy updates.Overall, this highlights the positive impact an appropriate replay buffer can have on the generation of diverse molecules.It also emphasizes the positive effect that the usage of a diversity filter has and that inherent exploration of the algorithms is not necessarily enough to generate a structurally diverse set of active molecules.

Conclusions
We explore on-and off-policy RL algorithms for SMILES-based molecular de novo drug design using recurrent neural networks.The investigation has focused on how well the algorithms sample structurally diverse and highrewarding molecules for different replay buffers.This has been done by studying their behaviors both with and without using a diversity filter that penalizes the generation of similar molecules between episodes.
For on-policy algorithms, we observe that it is often favorable to use all generated molecules from the current batch for learning.Regularized MLE utilizing the full batch for learning, in combination with a diversity filter, leads to the overall best performance in terms of both the reward and diversity.However, it is possible to obtain similar performance by learning from fewer samples if the training data includes at least both high-rewarding and lowrewarding data points.For these on-policy algorithms applied with no diversity filter, it is also important to use intermediate-rewarding samples, either from the current batch of sampled molecules or previously sampled molecules.
There is a potential performance gain in using off-policy algorithms with a suitable replay buffer.When using no diversity filter, we observe that SAC provides good exploration, leading to a more structurally diverse generation.Hence, when no diversity filter is used, the policy must keep its randomness to avoid mode collapse.Interestingly, when using a diversity filter ACER yields better performance, displaying the potential to be on par with Regularized MLE or even better.
We release the source code of the methods as an open-source framework, to enable further exploration of reinforcement learning algorithms and replay buffers mechanisms.

Fig. 7
Fig.7Box plots of the number of unique active molecules and active scaffolds for the on-policy algorithms A2C, Regularized MLE and PPO (higher is better) when utilizing no diversity filter.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.

Fig. 8
Fig.8Average episodic reward computed over a batch of sequences, for the on-policy algorithms A2C, Regularized MLE and PPO (higher is better) when utilizing no diversity filter.It shows mean and standard deviation of moving average, of window size 50, over 11 runs for each policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.

Fig. 10
Fig.10Average episodic reward computed over a batch of sequences, for the off-policy algorithms SAC and ACER (higher is better) when utilizing identical molecular scaffold filter.It shows mean and standard deviation of moving average, of window size 50, over 11 runs for each policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.Note that invalid SMILES are displayed with a reward of 0 in this figure, but are given a reward of −1 during training.

Fig. 11
Fig.11Box plots of the number of unique active and active scaffolds for the off-policy algorithms ACER and SAC (higher is better) when utilizing no diversity filter.It shows mean and standard deviation over 11 runs for each policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.

Fig. 12
Fig.12Average episodic reward computed over a batch of sequences, for the off-policy algorithms SAC and ACER (higher is better) when utilizing no diversity filter.It shows mean and standard deviation of moving average, of window size 50, over 11 runs for each policy and replay buffer.128 SMILES strings are sampled in each episode, with a budget of 2000 episodes in total.Note that invalid SMILES are displayed with a reward of 0 in this figure, but are given a reward of −1 during training.

( a )
Fig. C1 A2C with diversity filter penalizing the generation of SMILES with the same molecular scaffold.If no score is displayed, all scores are at least 0.9995.Otherwise, scores are rounded upwards to 3 decimals.

( a )
Fig. C2 PPO with diversity filter penalizing the generation of SMILES with the same molecular scaffold.If no score is displayed, all scores are at least 0.9995.Otherwise, scores are rounded upwards to 3 decimals.
Fig. C3 Regularized MLE with diversity filter penalizing the generation of SMILES with the same molecular scaffold.If no score is displayed, all scores are at least 0.9995.Otherwise, scores are rounded upwards to 3 decimals.

( a )
Fig. C5 PPO without diversity filter.If no score is displayed, all scores are at least 0.9995.
Fig. C6 Regularized MLE without diversity filter.If no score is displayed, all scores are at least 0.9995.

( a )
Fig. C8 SAC with diversity filter penalizing the generation of SMILES with the same molecular scaffold.If no score is displayed, all scores are at least 0.9995.Otherwise, scores are rounded upwards to 3 decimals.