Conformational Oversampling as Data Augmentation for Molecules

. Toxicological datasets tend to be small and imbalanced. This quickly causes models to overﬁt and disregard the minority class. To solve this issue we generate conformations of molecules. Thereby, we can balance datasets as well as increase their size. Using this approach on the Tox21 Challenge data we observed conformational oversampling to be a viable approach to train datasets, increasing the balanced accuracy of trained models.


Introduction
Safety evaluation during drug development would greatly benefit from models reliably predicting specific toxicities. Such models could reduce time and cost as well as animal testing by indicating hazards in the first stage of drug design. The Merck Kaggle competition and the Tox21 Challenge demonstrated the superiority of neural networks over traditional machine learning approaches for biological activity predictions [7,8,11,12]. Hence, neural networks are a viable tool to generate predictive models for toxicities. However, toxicological datasets often are small and imbalanced. This oftentimes results in overfitting of models or neglection of the minority class, predicting almost all compounds as the majority class. Especially for toxicity predictions, it is important that models exhibit high sensitivity and specificity, that is, identify hazards but not flagging every single compound.
With the aim to solve both, imbalance and overfitting, we developed a method named COVER (Conformational OVERsampling). This method was derived from augmentation as used in image recognition, where images are transformed to increase generalization and dataset size (e.g. [3,5,10,13]). Instead of transforming an image we "transform" molecule representations by generating multiple conformations along with a 3D based description of molecules. This enables us to balance datasets for neural network training, as well as increase the dataset size.

Method
For model training we assembled a dataset using the the endpoint p53 activation ("SR-p53") from the Tox21 Challenge data which contains 5843 compounds with binary annotation. The endpoint has an imbalance ratio of 1:16, denoting that the dataset contains 16 times more inactive than active molecules. In the next step we generated conformations of the molecules using the ETKDG Algorithm with energy minimisation as implemented in RDKit [6,9]. In total, we generated 6 datasets (see Table 1). For each dataset m conformations were generated for the inactives and n conformations for the actives, resulting in the label "m-n dataset". Therefore, to generate m conformations the conformation generation was run m times, without any further processing. The datasets were prepared with different goals: Firstly to evaluate the training on the original dataset ("1-1 dataset"), secondly to evaluate training on a balanced dataset ("1-16 dataset"), thirdly to evaluate whether oversampling alone is beneficial ("2-2" and "5-5 dataset") or if balancing is mandatory ("2-32" and "5-80 dataset"). After conformer generation 1145 3D descriptors were calculated, which are used as input for the models. In addition, we used the two test datasets from the Tox21 Challenge as a combined external validation dataset containing 643 compounds. To erase increased predictivity by sophisticated network architectures we only trained multilayer feed-forward neural networks with two to four hidden layers. The training was conducted using a nested cross-validation [1]. The inner crossvalidation is used for hyperparameter grid search, whereas the outer loop is used to validate the model on an external validation fold. In addition, the splits of the dataset are not chosen randomly but decided via affinity propagation clustering [4,7]. The generated clusters are randomly distributed to the five folds, such that molecules of the same cluster are found in the same fold. The thereby achieved low similarity between folds ensures a reduction in the model bias. To prevent bias by the increased number of conformations of single molecules, the conformer generation was done after the clustering and splitting. Subsequently, each conformation was assigned the same cluster number, and thus cross validation fold, as the parent molecule. This prevents a leakage of information between the training and test dataset and the different cross validation folds.
Since the focus of model training was the suitability for toxicity predictions, balanced accuracy was used to choose the best hyperparameter set. Balanced accuracy is calculated as the harmonic mean between sensitivity and specificity.

Results and Discussion
For all datasets we could generate the necessary number of conformations. For rigid molecules, which constituted about 10% of the dataset, we obtained very similar or the same conformation. Nevertheless, the conformers of one molecule had an average root mean square deviation (RMSD) of about 1.7-1.9Å.
Using the dataset with one conformation per molecule (1-1 dataset) we observed that training yielded good results with respect to the area under the receiver operating curve (AUC) but, in most cases, balanced accuracy was lower than 0.6. In these cases, the models have a high specificity but lack sensitivity. Evaluating the predictions showed that mostly all molecules were classified as the majority class. Applying COVER, we observed considerable changes. Specifically, trained models gained sensitivity with only slight loss of specificity and no loss in the AUC. This indicates the models are able to predict both classes. Figure 1A illustrates oversampling by itself does not increase predictivity of models trained with the 2-2 or 5-5 dataset. Contrarily, models trained with the 1-16, 2-32 and 5-80 datasets show an increase in the sensitivity, and thus predictivity.
The results for the external fold of the cross-validation can be seen in Fig. 1B. It shows models trained with balanced datasets do not suffer from low sensitivity.
To ascertain the models also work for an external dataset we used the test set from the Tox21 challenge. No model showed a decrease in predictivity, which is depicted in Fig. 1C.
Our approach demonstrates creation of multiple conformations of a molecule facilitates training of neural networks. This is achieved using only information inherent in the dataset without having to create artificial samples as is often done in traditional machine learning (e.g. SMOTE [2]). Since our idea originated from image augmentation, we do not view the conformations as biological relevant, therefore all conformation have the same label as the parent molecule. Rather we hypothesize that through increase in training space the model is able to generalize better. This is similar to augmented images helping a network in the generalization. Therefore, rather than regarding the images as meaningful representations, they present the variety of the real world to a model. Thus no conformation selection process is implemented. In the future it would be very interesting to investigate the influence of a more sophisticated conformer selection on model training. However, this makes balancing harder as there might not be enough distinct conformations per molecule. In general, we observed COVER is only beneficial when also balancing the dataset. By oversampling, the training space of the network is increased, facilitating training. Subsequently, models cannot disregard the minority class. Our final validation showed that models trained on datasets balanced with COVER have a much higher sensitivity than those trained on unbalanced datasets.