1 Introduction

Molecular deep generative models have emerged as a powerful tool to generate chemical space [6] and obtain optimised compounds [2, 5]. Models trained with a set of drug-like molecules can generate molecules that are similar but not equal to those in the training set, thus spanning a bigger chemical space than that of training data. The most popular architecture uses Recurrent Neural Networks (RNNs) and the SMILES syntax [7] to represent molecules. Nevertheless, a recent publication [1] shows that this architecture introduces bias to the generated chemical space. To be able to prove that, models were created with a subset of GDB-13 [4], a database that holds most drug-like molecules up to 13 heavy atoms, and sampled with replacement 2 billion times. At most, only 68% of GDB-13 could be obtained from a theoretical maximum of 87%, which would be from a sample of the same size from an ideal model that has a uniform probability of obtaining each molecule from GDB-13.

This study uses the previous research as a starting point and focuses on benchmarking RNN with SMILES trained with subsets of GDB-13 of different sizes (1 million and 1000 molecules) and with different variants of the SMILES notation. One of those variants, randomized SMILES, can be used as a data amplification technique and is shown to generate more diversity [3]. When the right data representations and hyperparameter combinations are chosen, models are able to generate more diversity and learn to better generalise the training set information.

2 Methods

The model architecture used is similar to the one used in [1, 5]. The training set sequences are pre-processed, and for each training epoch the entire training set is shuffled and subdivided in batches. The encoded SMILES strings of each batch are input token by token to an embedding layer, followed by several layers of RNN cells. Between the inner RNN layers there can be dropout layers. Then, the output from the cells is squeezed to the vocabulary size by a linear layer and a softmax is performed to obtain the probabilities of sampling each token in the next position. This is repeated for each token in the entire sequence.

Table 1. Hyperparameter combinations for both the 1M model and the 1 K model. Notice that the 1 K model also optimises the network topology, this was possible due to shorter training times.

The models were optimised for the hyperparameter combinations shown in Table 1. Also, training sets were set up with canonical SMILES and randomized SMILES. In the case of the randomized SMILES, each training epoch had a different permutation. For each combination of hyperparameters a model was trained and a sample with replacement of 2 billion SMILES strings was performed (Fig. 1). Then, three ratios were calculated from the percentages obtained that characterise the three main properties that the output domain should have: uniformity (even posterior probability for each molecule), completeness (all molecules from GDB-13) and closeness (no molecules outside of GDB-13 should be generated). Lastly, the UCC, a ratio obtained from the other three was used as a sorting criteria for all the models.

Fig. 1.
figure 1

Training and sampling process used for each model in the benchmark and the formulas for the ratios calculated from the sample.

3 Results

Table 2 shows the results for the models with highest UCC score of each training set size with each SMILES variant. 1M models trained with randomized SMILES are overall better than those trained with canonical SMILES. This might be due to the additional information the model has from molecules in the training set when they are input as different randomized SMILES each epoch. Notice especially that the completeness is at 0.95, which indicates that the model is theoretically able to reproduce mostly all of GDB-13 given enough sampling. On the other hand, models trained with 1000 SMILES have much lower performance, as there is not enough information in the training sets to be able to generalise the entire database. Nevertheless, the randomized SMILES model has an even better performance compared to the canonical SMILES one. Namely, a model trained with canonical SMILES can only reach 52% valid molecules, whereas the randomized SMILES model learns much better (82%). This shows that randomized SMILES add more information to the model and effectively increase its learning capability without having to add additional data to the training set.

Table 2. Results for the best canonical and randomized SMILES models for both the 1M and 1 K training set benchmarks.