Augmentation Is What You Need!

. We investigate the effect of augmentation of SMILES to increase the performance of convolutional neural network models by extending the results of our previous study [1] to new methods and augmentation scenarios. We demonstrate that augmentation signi ﬁ cantly increases performance and this effect is consistent across investigated methods. The convolutional neural network models developed with augmented data on average provided better performances compared to those developed using calculated molecular descriptors for both regression and classi ﬁ cation tasks.


Introduction
The renaissance of neural networks methods [2,3] has brought a wide variety of neural network types including methods that are capable to analyze chemical structure represented as graphs or text (e.g., SMILES) in chemistry. These methods are particularly interesting since their internal representations, latent variables, in principle do not lose information about the chemical structures as compared to methods using chemical descriptors. These latent variables can be used to decode the chemical structures and address the problem of inverse Quantitative Structure Activity Relationship (QSAR) [4], which has been a challenge for chemoinformatics since first QSAR models. However, the practical question arises whether the prediction power of such methods is similar to those based on descriptors. In our previous study [1] we have demonstrated that the Convolutional Neural Fingerprint (CNF) which is based on ideas of text processing originally proposed by [5], provided similar performance to models developed using three descriptors sets. The high prediction accuracy of the CNF was achieved thanks to the augmentation technique, which was originally proposed in computer vision and was recently introduced to QSAR studies [6]. It is worth mentioning that a related technique to enhance accuracy of QSAR models by considering symmetry of molecules was proposed more than 20 years ago [7]. Typically so called canonical SMILES produced according to some rule-defined enumeration of atoms, are used to train the model. The augmentation procedure generates a number of unique SMILES for the same molecule, e.g., by starting enumeration of atoms from a random atom and/or traversing molecular graph path in random order. Augmentation is employed during both training and inference steps: during training, augmentation increases the dataset size by providing for each structure a number n of distinct SMILES; during inference, the prediction for a given structure is averaged over predictions of m distinct SMILES of that structure.
The goal of this study was to clarify whether the performance of augmented models is similar to those developed using a large set of descriptors typically used in QSAR studies. We also include a result for TextCNN [8] which is a DeepChem implementation of the Char-CNN [5], but using a different architecture.

Methods
The CNF method from our previous study [1] as well as the TextCNN method as implemented in DeepChem [8] were used as convolutional methods. Associative Neural Networks (ASNN) [9], which is a shallow neural network was used as a traditional method to develop models using descriptors. Early stopping was used to prevent overfitting of neural networks [10] for all three analyzed methods.
Augmentation. The augmentation used for convolutional methods was generated with RDKit and included: (1) no-augmentationcanonical SMILES was used, (2) off-line augmentation with n = 10 SMILES generated before the neural network training and (3) on-line augmentation in which new SMILES were generated for each training epoch (only for CNF). The augmentation with n = 10 SMILES (which was selected based on results in [1]) was also applied during the prediction step and the average value was used as the final model prediction. It was shown that model performance decreased when model developed with canonical SMILES was used to predict augmented data [1]. Therefore for models developed with the first protocol only canonical SMILES were used during the prediction step.
Hyperparameter Optimization. The methods were used with their default parameters as available on the On-line Chemical Database and Modeling Environment (http:// ochem.eu). For CNF we used the same parameters as in [1] with an exception of the convolutional filter, which was increased to 5. We tried to optimize neural network parameters, such as the number of neurons, architecture, activation function, etc. but did not see improvement as compared to the defaults. The full run of optimization required more than 12 h on six GPU cards (GeForce RTX 2070 and 1070, Quadro P6000, Titan Xp and V) and thus exhaustive investigation of all options was impossible. Since training with augmented data was about 10 times longer for each epoch, the number of epochs for on-line training was respectively increased 10 times to allow networks to use about the same number of training steps.
Descriptors. In total 16 sets of descriptors, namely ALogPS + OEstate, CDK2, Che-maxonDescriptors, Dragon7, Fragmentor, GSFrag, InductiveDescriptors, JPlogP, Mera + Mersy, PyDescriptor, QNPR, RDKIT, SIRMS, Spectrophores, StructuralAlerts and alvaDesc were used with their default settings. The descriptors are described on the OCHEM web site and were used in multiple previous studies. Many of the descriptors have their own hyper parameters, e.g., size of fragments for fragmental descriptors, which can be also optimized. We did not perform such optimization for this study but instead used default hyperparameters that were found to be optimal ones in the previous studies.
Model Validation. Five-fold cross-validation was used to test performance of all models.
Datasets. The same 9 regression and 9 classification sets from our previous study [1] were used.
Statistical Parameters. The regression models were compared using the coefficient of determination where y is the average value across all samples, while f i and y i are predicted and target values for sample i, respectively. The classification results were compared using Area Under the Curve (AUC).

Results
The augmentation dramatically improved results for CNF method but also contributed better models for TextCNN for both regression and classification datasets as shown on Figs. 1 and 2. For comparison of performances of models developed using descriptors and convolutional neural networks, we counted the number of models for which ASNN (using any descriptor set) or one of augmented convolutional models provided better results. Such comparison was biased towards ASNN since the best result for this method was selected from 16 models corresponding to the used descriptor sets versus only three (two off-line and one on-line) models for SMILES-based approaches. For three datasets the best models for both approaches had the same performance. For remaining data, the SMILES-based approaches contributed better models in 11 cases while descriptorbased approaches did it for 4 models.

Conclusions
We showed that convolutional neural networks trained with augmented data provide better performances compared to models developed with the state-of-the-art descriptor representation of molecules for both regression and classification problems.