1 Introduction

Deep learning [24] has been successfully applied in a number of problem domains from natural language processing [39], medical imaging analysis [32] to finance [26]. Deep learning architectures are also successfully used for many predictive tasks in chemistry and biology domains [55]. One application of deep learning in the chemistry domain is to predict important properties of chemical compounds. It allows for the assessment of chemical compounds before committing to an expensive synthesis process [7, 8, 20, 43].

Deep learning theory is based on deep neural networks (DNN) which consist of many layers. Each layer is comprised of a number of neurons. Higher levels of the DNN represent more complex concepts. To improve network performance, layers are often implemented using different methodologies. The overall topology of a neural network is selected based on the problem to be solved and often is tuned during an experimental phase.

In this paper, the DNN learns to represent compounds by their chemical descriptors. This opens a wide range of opportunities to build sophisticated machine learning applications for predicting different properties of chemical compounds.

This research study investigates how the pretrained autoencoder can be used for building classification and regression models for predicting the LogD property and target binding of chemical compounds using data obtained from different sources.

The remainder of the paper is outlined as follows. Section 2 provides background information which includes a description of predicting properties of chemical compounds and other machine learning models used for their prediction. Section 3 introduces the mathematics behind the developed autoencoder together with visualizations of the learning mechanisms. Section 4 provides detail on the data sets used and experiments carried out, while Sect. 5 describes the obtained results, followed by discussion and conclusions in Sect. 6.

2 Background

2.1 Predicting properties

2.1.1 LogD

Lipophilicity is possibly one of the most important physicochemical properties of a potential drug. It plays a role in solubility, absorption, membrane penetration, plasma protein binding, distribution, CNS penetration and partitioning into other tissues or organs such as the liver and has an impact on the routes of clearance. It is important in ligand recognition, not only to the target protein but also CYP450 interactions, HERG binding, and PXR mediated enzyme induction. Most drugs entering a market are designed for oral administration. The absorption of drugs can either be via passive diffusion across membranes or via carrier mediated transport. Carrier mediated transport is energy dependent and requires a specific transporter protein. In contrast passive diffusion does not require the presence of a specific carrier transporter protein and is less structure specific than carrier-mediated transport; there is a general dependence on lipophilicity for structurally diverse compounds. However, the relationship with LogD is non-linear with an optimum of LogD 1-2.

Measurement of LogP can be undertaken in a variety of ways, the most common is the shake-flask method, which consists of dissolving some of the solute in question in a volume of octanol and water, shaking for a period of time, then measuring the concentration of the solute in each solvent. This can be time-consuming particularly if there is no quick spectroscopic method to measure the concentration of the molecule in the phases. A faster method of logP determination makes use of high-performance liquid chromatography.

However, the majority of known drugs contain ionizable groups, as shown in Fig. 1, which shows the distribution of small molecule drugs with DrugBank [53] and are likely to be charged at physiological pH and LogP only correctly describes the partition coefficient of neutral (uncharged) molecules. LogD, the distribution constant is a better descriptor of the lipophilicity of a molecule. This can be determined in a similar manner to LogP but instead of using water, the aqueous phase is adjusted to a specific pH using a buffer. LogD is thus pH dependent, hence one must specify the pH at which the logD was measured. Of particular interest is the logD at pH = 7.4 (the physiological pH of blood serum).

Fig. 1
figure 1

Distribution of small molecule drugs with DrugBank

Usually, it is not practical to determine the LogD of every compound made experimentally (and it may be of interest to calculate logD prior to synthesis) and so calculated results are used.

2.1.2 Binding

The majority of drug-like small molecules are specifically designed to bind to protein targets involved in disease related pathways. The activity of molecules in a biological assay may be captured by a variety of different measures IC50, EC50, Ki, % inhib, etc., but most are a measure of a binding event in some manner. The biological results varies considerably in quality from single point high-throughput screening (HTS) data [15] to full dose response curves. Much of these data are captured in ChEMBL, a database of bioactive drug-like small molecules and abstracted bioactivities.

2.2 Models for comparison

The developed DNN model provides an accurate prediction of LogD and binding properties. To gauge its performance, a total of ten machine learning (ML) techniques were used. Taking into consideration a large variety of different experimental setups, model implementations and evaluation metrics, authors tried to summarize the results from a number of sources and provide a ‘single’ performance metric to compare each of the techniques.Footnote 1

  • Logistic regression (LR) [41] is a very straightforward and popular classification model, which has a long and successful history of been used in many ML applications and statistical modelling. It uses a logistic function for weighting a linear combination of input parameters. LR models have been used in a large number of publications involving chemical data types.

  • Kernel ridge regression (KRR) [57] uses a modified approach to find a regression function by adding a bias, which causes a drop in variance. In other words a better prediction can be achieved by considering a slightly less fit model during the training process.

  • Another very popular and well recognized model is random forests (RF) [56]. It can be applied for both classification and regression problems and uses an ensemble of decision trees, which are trained on different subsets of the original data.

  • Extreme gradient boosting (XGB) [14] tree is similar to RF and is an ensemble method. During each training step it constructs a new tree model, which in combination with previous models, minimizes the overall prediction error. It is a very popular approach in the ML community, and has been successfully applied to many problems, consistently providing accurate results.

  • Multitask neural network (MNN) [49] is a special modified neural network architecture for solving simultaneously multiple problems. There are a number of different designs of these neural networks. However, they are all constructed based on the principle that some fully-connected layers are shared between different tasks. In this way training processes for solving one task can influence other tasks and vice versa.

  • Graph convolutional (GC) [18] models are designed to utilize molecules which can be transformed into undirected graphs where atoms are represented as nodes and bonds as edges respectively. A convolution is applied to expand a feature space, by creating multiple filters representing graph substructures. The aggregation of these substructures is performed via multiple convolution layers.

  • Message passing neural network (MPNN) [23] model is based on mathematical framework, which generalizes a number of graph-based neural network designs. It performs computation in two phases: message passing phase (constantly updates a hidden state) and read-out phase (uses the final hidden state, used for making a prediction).

  • Directed acyclic graph (DAG) [37] is another popular approach creating classification and regression models. It is based on a chemical compounds graph structure. The previously reviewed graph-based models are using undirected graphs (where molecular bonds between atoms naturally do not have directions). However, the latest DAG design synthetically introduces an additional directional feature for the graph-based molecular representation. It identifies a central atom and creates a directional-structure of chemical compound from this central point. Generating the additional features may provide an improvement in prediction accuracy in comparison to other graph-based models.

  • WEAVE [31] supports graph-based model that utilizes both properties of chemical compounds: nodes (for atoms) and edges (for bonds). The constructed features matrix is processed by convolution-like filters. Similar to image convolution neural network, WEAVE provides more informative representations of chemical compound structures.

  • Influence relevance voting (IRV) [51] is not the most common approach, but it has certain important advantages against other ML models. Its prediction can be easily interpreted in a similar fashion to k-nearest neighbours (kNN). IRV tries to identify k-nearest neighbours using a neural network to compute a more complex similarity function.

The majority of these models rely on two key factors: availability of large volumes of training data and knowledge about the physical structure of chemical compounds (which can be used for converting them to graphs or fingerprints). Despite large collections of stock chemical compounds, an offering by many research and commercial entities, the number of compounds in individual assays investigating a specific problem, is relatively small. It may significantly impact the performance of many ML models. Besides, if a model relies on the physical structure of chemical compounds the training process can become very computationally expensive.

This study investigates transfer learning using the variational autoencoder. It reduces reliance on a large volume of training data from the specific assays.Footnote 2 This model also does not require any external knowledge about the physical structure of chemical compounds. All knowledge required for making accurate predictions derive from SMILES representation of chemical compounds.

3 Methodology

The proposed methodology uses a pretrained autoencoder to build classification and regression models for predicting properties of chemical compounds. A high-level overview of the proposed approach is presented in Fig. 2.

Fig. 2
figure 2

A summary of the proposed approach where ChEMBL data are used to train the autoencoder and screening data are used to build and evaluate a model for predicting desirable properties of chemical compounds

The majority of the proposed workflow stays the same for predicting LogD and binding properties. It starts with selecting a collection of chemical compounds from the ChEMBLv23 database [11]. These compounds are used for training variational autoencoder. Then, a specially designed process isolates encoder layers of the variational autoencoder. These layers are combined with an additional sub-network for performing classification or regression tasks (depending on the problem being solved). The constructed regression or classification neural networks are trained using screening data derived from HTS. It is important to mention that the encoder layers remain frozen throughout this training. The obtained model is used for evaluation, and if the assessment is successful, the model can be deployed in a production environment.

The rest of the methodology section is split into three parts. Section 3.1 describes a variational autoencoder. Section 3.2 focuses on classification and regression models for predicting desirable properties of chemical compounds. Section 3.3 explains a learning process using a simple and intuitive example.

3.1 Variational autoencoder

An autoencoder is a neural network which can be used to address representational learning problems [30]. It learns to reconstruct the original input using an informational bottleneck. Such neural networks have been successfully applied in various domains such as noise filtering in audio-video content, translation, and compression. The proposed approach focuses on the autoencoder for reconstruction of chemical compounds. It takes simplified molecular-input line-entry system (SMILES) [52] and tries to reproduce these SMILES using latent space. SMILES are ASCII strings for describing the structure of chemical species. They can be imported by most molecule editors and converted into two-dimensional or three-dimensional structures of the molecules. For example Aspirin \(C_{9}H_{8}O_{4}\) is represented by the following SMILES: CC(=O)OC1=CC=CC=C1C(=O)O, whose 2D structure is shown in Fig. 3.

Fig. 3
figure 3

Aspirin 2D structure

A generic autoencoder is trained to minimize the reconstruction error \(\mathcal {L}\) defined by Eq. 1:

$$\begin{aligned} \min \{\mathcal {L}(x, \widehat{x}) + \mathcal {R}\}, \end{aligned}$$
(1)

where x is the original input, \(\hat{x}\) is the reconstructed output and \(\mathcal {R}\) is a regularizer. The regularizer penalizes a model using large weights to prevent memorization and overfitting problems [17]. Ideally, a well-trained autoencoder should accurately reconstruct an input SMILES x such as already mentioned CC(=O)OC1=CC=CC=C1C(=O)O.

In this paper, we are focusing on variational autoencoder [22], a special type of autoencoder, which uses a probability distribution to reconstruct the original input x. Suppose z represents hidden variables (from the latent space) used to reconstruct the original input:

$$\begin{aligned} p(z|x)=\frac{p(x|z)p(z)}{p(x)}=\frac{p(z,x)}{p(x)}. \end{aligned}$$
(2)

The computation of the marginal distribution p(x) is very complex, since the following integral is intractable (in majority cases):

$$\begin{aligned} p(x)=\int p(x|z)p(z)dz. \end{aligned}$$
(3)

There are two main approaches available to tackle this problem: Monte Carlo [38] and variational inference (used to build variational autoencoder) [27]. Let’s approximate p(z|x) with another distribution q(z|x), where q can be chosen as a tractable distribution (such as Gaussian) [29]. Then, it is possible to find distribution parameters when q becomes close enough to p by minimizing the Kullback–Leibler divergence [47], which measures an amount of lost information for the chosen approximation:

$$\begin{aligned} KL(q(z|x)||p(z|x)) = -\sum {q(z|x)\log {\frac{p(z|x)}{q(z|x)}}}. \end{aligned}$$
(4)

By replacing p(z|x) in Eq. 4 it is possible to derive to the following equation:

$$\begin{aligned} KL(q(z|x)||p(z|x))= -\sum q(z|x) \log {\frac{p(z,x)}{q(z|x)}} + p(x), \end{aligned}$$
(5)

and express p(x) as:

$$\begin{aligned} p(x) = KL(q(z|x)||p(z|x)) + \sum q(z|x){\log {\frac{p(z,x)}{q(z|x)}}}, \end{aligned}$$
(6)

where the second term is called a variational low bound:

$$\begin{aligned} \mathcal {L} = \sum {q(z|x)log{\frac{p(z,x)}{q(z|x)}}}. \end{aligned}$$
(7)

This allows us to rewrite Eq. 6 as:

$$\begin{aligned} p(x) = KL(q(z|x)||p(z|x)) + \mathcal {L}, \end{aligned}$$
(8)

where p(x) in Eq. 8 can be considered as a constant, since x is given (as the original input). The rest of this equation represents a sum of two quantities, where KL-divergence needs to be minimized. The minimization of KL-divergence is effectively a maximization of the variational low bound \(\mathcal {L}\) defined by Eq. 7. By substituting p(zx) in Eq. 7 it is possible to derive to the following equation:

$$\begin{aligned} \mathcal {L} = \sum {q(z|x)\log {p(x|z)}} + \sum {q(z)\log {\frac{p(z)}{q(z|x)}}}, \end{aligned}$$
(9)

where p(x|z) is the expectation with respect to q(z) and can be written as \(E_{q(z)}\log {p(x|z)}\). The second term \(-KL(q(z|x)||p(z|x))\) represents the KL-divergence. This allows us to rewrite Eq. 9 as following:

$$\begin{aligned} \mathcal {L} = E_{q(z)}\log {p(x|z)} - KL(q(z|x)||p(z|x)). \end{aligned}$$
(10)

Let’s build the variational autoencoder based on variational low bound Eq. 9. The observed distribution q is a function mapping x to z, which should match an another distribution p. The observed distribution p is a function mapping z to \(\hat{x}\), where p can be chosen. Both q and p are implemented as neural networks and for the further references are called encoder and decoder accordingly. A visualization of the variational autoencoder model is shown in Fig. 4.

Fig. 4
figure 4

A schema of variational autoencoder model

Let’s select p to be the Gaussian distribution. This requires us to make the distribution q (in the latent layer) also similar to Gaussian. The cost function can then be expressed as:

$$\begin{aligned} \min {|x-\hat{x}|^2-KL(q(z|x)||N(\mu ,\sigma ))}, \end{aligned}$$
(11)

where N is a normal distribution defined by two parameters \(\mu \)-mean and \(\sigma \)-variance. \(|x-\hat{x}|^2\) in 11 has been derived from the definition of the reconstruction error for the Gaussian distribution \(p(x|\hat{x})=e^{-|x-\hat{x}|^2}\).

As shown in Fig. 5, the encoder learns to represent the original input x as a set of attributes z in the latent space, where each attribute is defined as the probability distribution (with parameters \(\mu \) and \(\sigma \)). The decoder learns to reconstruct \(\hat{x}\) close to the original input x using a set of attributes z from the latent space.

Fig. 5
figure 5

The mechanism of the input reconstruction using latent space

A set of attributes z in the latent space represents chemical compounds fingerprints and can be used for building classification and regression models. A typical approach for building classification and regression models is to join together trained encoder and problem-dependent prediction layers as shown in Fig. 6.

Fig. 6
figure 6

A schema of built classification and regression models based on the variational autoencoder

The training process in such architecture learns some function f(y|z) which predicts y (category value for a classification problem or real value for a regression problem) using chemical compounds fingerprints z generated by encoder. In the classical approach the encoder neural network q(z|x) only generates z and does not take part in the training process (so it weights remain frozen thorough out all training cycle).Footnote 3

3.2 Architecture

The variational autoencoder was implemented using a Convolution Neural Network [6] in combination with a few layers for supporting the variational training process. Its neural network topology is presented in Fig. 7.

Fig. 7
figure 7

A neural network topology of variational autoencoder

It consists of two joined neural networks: encoder and decoder. The encoder neural network consists of nine layers. The Input layer takes SMILES transformed to the one-hot 150x78 matrix representation (each row represents a SMILES character and column its encoding). A visualization of SMILES transformation is shown in Fig. 8.

Fig. 8
figure 8

SMILES one-hot encoding

This figure demonstrates a process of one-hot encoding for the Aspirin SMILES, introduced earlier. If the input SMILES is less than 150 characters, the original sequence is padded with empty spaces on the right. When the input sequence is adjusted, the transformation process creates a zero-matrix with 150 rows (equals to the maximum number of characters in the input sequence) and 78 column (equals to SMILES vocabulary size). A transformation loop selects each character in the padded sequence and use a dictionary to identify the character code. A position of the selected character and its code are used to set the according element to 1 in the one-hot matrix.Footnote 4

The input layer is followed by three 1D-convolution layers (Conv1D-1, Conv1D-2 and Conv1D-3) with 512, 256, 128 filters and 7, 5, 3 kernel sizes accordingly. These convolution layers perform a very similar role to 2D-convolution layers in image processing. They identify specific patterns of elements (atom and bonds) in chemical compounds and aggregate them in bigger substructures at each consequent layer. A useful insight of how these layers work is presented in the discussion Sect. 6.

The Flatten layer vectorizes convolution weights, so they can be processed by the following Dense-1 layer. This layer has 1024 neurons, which is exactly the same size as an output latent vector z. The size of this layer has been defined via a hyper-parameters tuning procedure. Dense-2 and Dense-3 layers implement variational learning, which computes \(\mu \) and \(\sigma \) accordingly. The final Lambda layer combines \(\mu \) and \(\sigma \) into a single latent vector (consisting of 1024 real values). Effectively this latent vector represents a chemical compound fingerprint.

The decoder consists of five layers. The input layer receives the latent vector from the encoder and passes it via two dense layers: Dense-4 and Dense-5. The Dense-5 scale-up original dimensionality from 1024 to 11700. This step is needed to transform a 1D into a 2D vector which is performed by the Reshape layer. This layer passes a 2D vector straight to the output. The final output of all these layers is 150x78 matrix, which can be reversed to the SMILES sequence.

A neural network topology of classification and regression models constructed based on pretrained encoder is shown in Fig. 9.

Fig. 9
figure 9

A neural network topology of classification and regression models

The encoder in classification and regression models has exactly the same topology structure as already described in variational autoencoder. However, the attached layers are designed to perform classification or regression tasks. The Dense-1 layer receives chemical compounds fingerprints and passes it to the next layer Dense-2. There is no difference in the current neural network topology whether it’s applied for a classification or regression problem. The only difference is in an inactivation for the Dense-2 layer.Footnote 5 In case of the regression problem the activation function as relu and in case of the classification problem the activation function as sigmoid.

Due to the high level of complexity defined above methodology and architecture, it is helpful to review a simple example, which illustrates a prediction process.

3.3 Prediction example

An example of neural network classifier built based on an encoder is shown in Fig. 10.

Fig. 10
figure 10

Classification example using the latent space

This is a high level of visualization intends to demonstrate some key concepts described above. Let us assume we have three types of compounds: red, green and blue. The red and green compounds have a very similar rectangular shape (despite the green compound has slightly rounded edges). The blue compound has the triangle shape. A constructed neural network should identify two classes: rectangle or triangle for the input compound.

According to the variational autoencoder methodology the trained encoder generates a latent vector representation for each input compound. This latent vector is a ‘compact’ representationFootnote 6 of the original input and in chemistry domain can be also refereed as a chemical compound fingerprint. It is likely that similar inputs will have a similar latent vector representation. It also means that similar compounds should be closely located in the latent space. The latent space is an abstract concept, which can be very useful to visualize a distributing of encoded compounds. As it can be seen from Fig. 10 red and green crosses represent location of ‘rectangular’ instances in close approximation from each other. The blue cross represents a ‘triangular’ sample accordingly. The dashed line represents areas of distribution of compounds with similar shapes in the latent space.

MLP neural network can provide an efficient architecture to learn a distribution of compounds in the latent space. Since the latent vector has much smaller size in comparison to the original input, MLP does not require a complex topology. One or two hidden layers can be sufficient to handle prediction of properties of chemical compounds for the majority of cases.

The next section describes a series of experiments for evaluating the proposed solutions.

4 Data and experiment design

In order to evaluate the proposed methodology three studies are presented in this paper:

  • An evaluation of the variational autoencoder for reconstruction of SMILES Sect. 4.1.

  • An evaluation of regression model for predicting LogD properties of chemical compounds Sect. 4.2.

  • An evaluation of classification model for drug-target predictions Sect. 4.3.

4.1 Experimental data and setup for SMILES reconstruction problem

ChEMBL is a chemical database of bio-active molecules with drug-like properties [16]. It is maintained by the European Bioinformatics Institute (EBI) of the European Molecular Biology Laboratory (EMBL), located at the Wellcome Trust Genome Campus in Hinxton, UK. ChEMBL data are widely used by pharmaceutical companies and research organizations around the World for creating screening libraries in drug discovery. ChEMBLv23 (version 23) has been selected for the current study. It includes approximately 1.7M chemical compounds.

The initial study [21] already showed an accurate SMILES reconstruction using a variational autoencoder neural network. This work is focusing on an optimization of a training process. To train a neural network based on all chemical compounds containing in ChEMBL already requires a powerful architecture. However, to process collections such as ZINC [28] or Enamine [50] with hundreds of million chemical compounds requires a much more sophisticated approach.

This experiment tries to define, what is an optimal size of a training data set for an accurate reconstruction of SMILES. This will help to scale down training data set and to preserve reconstruction accuracy at the same time. A design of proposed experiments is shown in Fig. 11.

Fig. 11
figure 11

Design of experiments for autoencoder model

All data taking part in the experiment were normalized and filtered using MolVS Open Source software [4]. SMILES exceeding 150 characters were removed. This had no significant impact on overall quality of experimental results, since only a very small percentage of these compounds were discarded. After filtering the data set comprises 1688073 samples.Footnote 7 This data set had been randomly split into training and evaluation partitions in proportions of 75% and 25% of samples respectively. The evaluation data (422,019 samples) remained unchanged throughout all experiments. It helped to score all produced models against the same benchmark. The size of training data varied from 10% to 100% of the original size (1,266,054 samples) depending on an experiment configuration. Generators were developed to feed data to a training model during a fitting processes. These generators streamed data directly from files using fixed size batches (equals to 1024), preventing any memory overflow.

Ten groups of experiments were carried out. Five tests were performed in each group, except for the last one.Footnote 8 Each test randomly selected a certain percentage of samples from the internal training partition. Then, these samples were divided into fitting and validation subsets in proportions of 75% to 25% respectively. The validation subset was used in each training epoch to assess model performance. This assessment was necessary to control learning rate, checkpoints and earlier stopping mechanisms. An approximate number of chemical compounds selected for each test is shown in Table 1.

Table 1 ChEMBL data set split

The best setup (experiment 5) shown in Table 2 was selected to build a production autoencoder model (which was used in the further tests for building classification and regression models). A explanation for the selected data split (defined by the experiment 5) is provided in Sect. 5.1 and discussed in Sect. 6.

Table 2 ChEMBL data set production split

Autoencoder model efficacy was evaluated by reconstruction accuracy, Hamming [44] and Levenshtein [42] editing distances.

4.2 Experimental data and setup for regression problem

LogD values for training and evaluation of a regression model were obtained from the ChEMBL database. To prevent influence outliers on the overall model only values in the interval between \(LogD\in [-20, +20]\) were considered. Chemical compounds with normalization issues were also excluded during the pre-processing step. The total number of chemical compounds taking part in testing was 1669058.

A data set containing these values was obtained from ChEMBL using the following steps:

  1. 1.

    Data fields smi and val were retrieved from ChEMBL using the SQL statement, where ’smi’ is a chemical compound SMILES and ’val’ a LogD value.

  2. 2.

    MolVS Open Source software [4] was used to normalize each SMILES.

  3. 3.

    Records with \(length(\textit{smi})>150\) were removed;

  4. 4.

    Records with LogD values outside the specified interval \(\textit{val} \in [-20, +20]\) were removed;

  5. 5.

    All LogD values were normalized between 0 and 1.

10-fold cross-validation was conducted to assess the model performance. The design is presented in Fig. 12.

Fig. 12
figure 12

Design of experiments for predicting LogD using a regression model and drug-target binding using a classification model

The data selected for cross-validation was split into 10 folds where nine folds (90% of data) were taken for training and one fold (10% of data) for evaluation accordingly. The data set allocated to training was also randomly split into fitting/validation partitions in the following proportion 75%/25%. The validation partition was used at each training epoch for assessing model quality. R2-score [10] metrics were recorded for each fold, and later generalized into the final result.

An additional 10-cross validation test was introduced to evaluate the regression model based on the experimental LogD data. Data were downloaded from latest ChEMBL version (on 30 Sept 2018) and pre-processed by industry experts in drug-discovery.

  • The data was curated to remove results obtained with solvents other than octanol/aqueous buffer.Footnote 9

  • Results derived from HPLC retention times were also removed.Footnote 10

  • Results obtained from experiments conducted at pH other than pH7.4 were also removed (both high and low pH.Footnote 11

  • Duplicates were removed by using of InChiKeys.Footnote 12

After cleaning and pre-processing this data set consisted of 12413 samples which LogD property defined in the interval \(-12.0, +12.0\).

The second data set ‘Lipophilicity’ was obtained from the ML resource described in [55]. It consists of 4200 chemical compounds which LogD property is defined in the following interval \([-1.5, +4.5]\). This data set was used to gauge the performance of developed system against other ML models.

4.3 Experimental data and setup for classification problem

ChEMBLFootnote 13 contains approximately 13.5M bio-activity measurements, where  1.1M assays are assigned to approximately 11K targets. The majority of available bio-activity data are highly unbalanced. More than 50% of assays have just a single measurement while others contain tens of thousands. On the other hand, a lot of targets belong only to a single assay, while others to hundreds. A large proportion of these data contain duplicate records. Such heterogeneity of data prevents clear identification, which measurements can be considered as active or inactive accordingly. A special protocol proposed in [40] helps to generate benchmark data sets for binary classification. It has the following six steps:

  1. 1.

    Data fields smi, typ, unt, rel were retrieved from ChEMBL using the SQL statement, where ’smi ’ field represents a SMILES, ’typ ’ a type of measurement, ’val ’ a measurement value, ’com ’ a measurement comment, ’unt ’ a measurement unit and ’rel ’ a measurement relation. These fields abbreviations are used throughout this study for referencing. All identified assays are belonged to the ’B’-type. These data are measures of compound binding to a molecular target, e.g. Ki, IC50, Kd.;

  2. 2.

    MolVS Open Source software [4] was used to normalize each of the SMILES.

  3. 3.

    Records with \(length(\textit{smi})>150\) were removed;

  4. 4.

    A measurement was considered as active if \(\textit{com} \in \mathbb {A}\). A measurement was considered as inactive if \(\textit{com} \in \mathbb {I}\). Sets (\(\mathbb {A}\) and \(\mathbb {I}\)) of strings in comment-field are defined below. If a record is identified as inactive all further steps are discarded. \(\mathbb {A}\) = (’active’, ’note: corresponding ic50 reported as active’) \(\mathbb {I}\) =(’inconclusive’, ’not active’, ’inactive’, ’not active (inhibition < 50% @ 10 um and thus dose-response curve)’)

  5. 5.

    Removed all records passed the previous step where \(\textit{val} =\emptyset \mid \textit{unt} \ne '\text {nM}' \mid \textit{rel} \notin \{'>', '\ge ', '<', \le ', '=', '\sim '\}\).

  6. 6.

    Assigned labels to each record according to the defined thresholds presented in Sect. 12:

    $$\begin{aligned} \text {label} = \left\{ \begin{array}{lr} 1 : val\ge 5.5\\ 0 : val<5.5 \end{array} \right. \end{aligned}$$
    (12)
  7. 7.

    All records with duplicates and contradictory measurements obtained during the previous steps were discarded.

56 assays were identified for this study. Each assay reflects in vitro measurements obtained during HTS. A break down between active or inactive chemical compounds together with associated target are presented in Tables 3 and 4.

Table 3 Classification ChEMBL data set statistics (Part I)
Table 4 Classification ChEMBL data set statistics (Part II)

Three cross-validation experiments were carried out for each data set. A design of these experiment is shown in Fig. 12. This folds number was chosen due to the relatively small assay sizes. As it can be seen from Table 5, a large proportion of assays have approximately 800 compounds. Three folds provide a fair representation of active or inactive chemical compounds across training, validation and evaluations partitions.

The Maximum Unbiased Validation (MUV) [48] is another benchmark data set selected from PubChem BioAssay by applying a refined nearest neighbour analysis. The MUV data set contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques. The detail breakdown between active and inactive compounds is shown in Table 5.

Table 5 Classification MUV data set statistics

For comparing prediction models the Receiver Operating Characteristics - Area Under The Curve (ROC-AUC) [12] was used. This metric is widely excepted by a ML community to assess performance of classification models.

5 Results

According to the experimental setup defined in Sect. 4, results are presented in three sub-sections. The first Sect. 5.1 shows results for variational autoencoder, the second Sect. 5.2 for prediction of LogD (regression problem) and the final Sect. 5.3 for prediction of compounds-targets binding (binary classification problem). The main rationale behind these experiments is to validate the versatility of latent vector based fingerprint, in other words, to prove that it works well regardless of the selected problem (classifier or regression).

5.1 SMILES reconstruction

Forty-six tests (9x5+1 for more details see 4.1) were conducted to assess the accuracy of SMILES reconstruction on different portion of training data. The results obtained are detailed in Table 6.

Table 6 SMILES reconstruction on different portion of training data

Changes in accuracy and editing distance for different size of training data sets are presented in Figs. 13 and 14. As it can be seen from Fig. 13, the reconstruction accuracy increases from 0.247 \(\pm \) 0.027 for 10% of randomly selected samples to 0.877 \(\pm \) 0.009 for 50% of samples accordingly. From 60% onwards accuracy stays around 0.8 on average with slight fluctuations. This is an expected result. However slight variations in accuracy starting from 60% of samples needs to be addressed. It is a difficult task to identify the exact reason of what is influencing these changes in accuracy. One of the possible reasons for such behaviour is early signs of overfitting. Obviously there is no evidence of this phenomenon spotted during training. However, it is possible that with increasing of samples number a model starts to memorize the input. This can be addressed by a better sampling algorithm. For example, Butina clustering [13] can be a useful technique to design a sampling algorithm.

In addition to accuracy, model performance was measured using Hamming and Levenshtein distances. Both belong to a family of editing distances and give a different perspective on the results obtained. They show a similar trend to accuracy. The Hamming distance decreased from 4.374 \(\pm \) 0.817 to 0.663 \(\pm \) 0.071 for 10% and 50% of sample cases respectively. The Levenshtein distance decreased from 4.299 \(\pm \) 0.127 to 0.648 \(\pm \) 0.031 for the same percentage of samples. Both show slight fluctuation in editing distance from 60% onwards.

Fig. 13
figure 13

SMILES reconstruction accuracy for different percentage of training sample

Fig. 14
figure 14

SMILES reconstruction editing distances for different percentages of training samples

According to the carried out experiment the best result (accuracy 0.877 \(\pm \) 0.009) was obtained for 50% of randomly selected samples. This configuration was selected for building a production variational autoencoder, which can be later used for training classification and regression models with SMILES input. Training was conducted using 40 epochs, and produced a model with accuracy 0.872. The editing distances for the production model were 0.682 and 0.677 for Hamming and Levenshtein respectively.

5.2 LogD prediction

Two cross validation experiments described in Sect. 4.2 were carried out on ChEMBL and Lipophilicity data sets. Results of these experiments are presented in Table 7.

Table 7 Cross-validation results prediction of the LogD property for ChEMBL and Lipophilicity data sets

As it can be seen from the obtained results the best performance is achieved for the ChEMBL data set, with an average coefficient of determination of 0.907 \(\pm \) 0.008. A scatter plot with an alignment of true and predicted values is shown in Fig. 15. The training process was carried out with LogD values normalized in the interval \([-20,+20]\)Footnote 14 and all training cycles took 20 epochs.

The high prediction accuracy is a very much expected result, since the majority of LogD data points in ChEMBL are computed using the ACD/Labs software [1]. In this scenario the regression model is simply learning to predict an outcome of another computational algorithm (such as ACD/Labs). It makes a learning task much more straightforward. This assumption is vindicated by results obtained in other studies. For example experiments with ChEMBL data set described in [1] also show a high accuracy using a SVM [19] model.

Fig. 15
figure 15

Scatter plot of predicted LogD values on ChEMBL data

Much more interesting results are obtained for the Lipophilicity data set. The average R2 score equals to 0.542 \(\pm \) 0.021, which is noticeably less in comparison to ChEMBL data. The training process was carried out with the LogD normalized interval \([-1.5,+4.5]\) and all training cycles took 30 epochs. A longer training cycle reflects the complexity of building a predictive model on real experimental data. A scatter plot with an alignment of true and predicted values for the Lipophilicity data set is shown in Fig. 16.

Fig. 16
figure 16

Scatter plot of predicted LogD values on Lipophilicity data

Eight ML models were investigated on this data. A comparison chart for all these models is presented in Fig. 17. The highest score 0.697 is obtained for MPNN, which uses a generalized model [23]. It is very suitable for processing graph structured data, which makes it efficient in predicting properties of chemical compounds (since chemical compounds can be easily represented as an undirected graph). MPNN is closely followed by GC model [31], with R2 score equals to 0.662. GC utilities principles of circular fingerprints described in [35] representing molecular structures by atom neighbourhoods. Similar to GC, Weave [31] (0.636) is another graph-based model which processes chemical compounds as a undirected graph using a convolution approach. In contrast to the previous three models, XGB [14] provides a different approach for making predictions. It is an ensemble approach which combines predictions of individual decision trees. XGB coefficient of determination for Lipophilicity equals to 0.577 and is closely followed by the model developed with in this study (0.542 \(\pm \) 0.021). Directed Acyclic Graph (DAG), Kernel Ridge Regression (KRR) and Random Forests (RF) are the lowest performing in this evaluation with R2 scores 0.507, 0.496 and 0.483 respectively.

Fig. 17
figure 17

SA comparison chart for ML models of predicting Lipophilicity

The key challenge of developing ML models for predicting the properties of chemical compounds is to encode molecules into fixed-length strings or vectors representation [54]. Despite SMILES providing unique representations of molecules, the majority of ML models are also relaid on additional information such as electronic or topological, profiles of chemical compounds. To derive these features, the models, we are gauging against, applied different factorizations: Extended Connectivity Fingerprints (ECFP), Coulomb matrix, Grid features, etc. These approaches are computationally expensive, and there will always be a trade-off between speed, accuracy and expense. Because of this, it is not surprising that some models provide better performance compared to our approach. As has been already mentioned, our approach is purely data-driven and inspired by - et al.[25]. It should provide alternatives to replace crafted featurization methods with the learning ability of DNN. In the future development, we are planning to improve the encoding method and directly compare it to existing featurization approaches.

Open-source research in AI always provides a solid benchmark for assessing in-house models. However, it would be interesting to compare the developed model against a commercial application. The following results show comparison of predictions made by ChemAxon software [5] against the developed regression model. This work was carried out in collaboration with Cambridge MedChem Consulting [2]. Data were selected and pre-processed by industry experts in drug discovery. All undertaking steps for preparing this experiment are described in Sect. 4.2.

10-folds cross-validations were performed on selected data set to compare models.Footnote 15 The obtained results are shown in Table 8. Results for the developed regression are shown in the ‘ARM’ column and for the commercial software in the ‘ChemAxon’ column accordingly.

Table 8 Cross-validation for predicting LogD using developed regression model and ChemAxon software

An average R2 score for our developed regression model equals to 0.695 \(\pm \) 0.013, which is nearly twice that obtained by ChemAxon with R2 of 0.338 \(\pm \) 0.034. It is hard to comment on the underlining ChemAxon algorithm for making prediction without available source code. However, from the description presented on the company website [5] it possible to assume, that it is a deterministic algorithm which approximates a chemical compound structure into a specific property value. Two scatter plots represented in Figs. 18 and 19 show an alignment of true and predicted values based on our developed regression model and ChemAxon respectivelyFootnote 16

Fig. 18
figure 18

Scatter plot of predicted LogD values on in-vivo data using our model

Fig. 19
figure 19

Scatter plot of predicted LogD values on in-vivo data using ChemAxon model

The experiments clearly demonstrate the validity of our proposed model to predict the LogD property of chemical compounds. However a large proportion of tasks required simple classification, for example whether a compound binds to the specified target. An evaluation of the classification model constructed based on variational autoencoder is presented in the next Sect. 5.3.

5.3 Binding prediction

Two cross validation experiments described in Sect. 4.3 were carried out on 56 ChEMBL data sets. Results of these experiments are presented in Table 9.

Table 9 ChEMBL binding-assays with testing results

Considering the large volume of obtained results, they were split into five groups based on ROC-AUC metric (see Fig. 20). The first group combines 10.7% of assays with least accurate prediction (which ROC-AUC is located in (0.0, 0.6) interval). It is very closely followed by the next group of 12.5% assays, which showed result in [0.6, 0.7) interval. A slightly bigger group of 17.9% of assays demonstrated ROC-AUC in interval [0.7, 0.8). In many cases such accuracy of in-silico prediction on HTS data can be already consider as a very good result. However the largest group, which combines 50% of assays showed ROC-AUC scores in interval [0.8, 0.9). Such accuracy can have a significant impact on planning and execution of HTS experiments, majority of assays filtered by in-silico approach. The remaining group combines 8.9% assays with highest scores located in interval [0.9, 1.0).

Fig. 20
figure 20

Binding prediction results split into five groups based on the ROC-AUC metric

Despite competent results obtained on vast majority of tested assays, the authors carried out additional investigation to rank the developed classifier against other ML algorithms. Similar to the regression problem, the main objective here is not a direct comparison of different ML algorithms, since it requires different experimental setup. This study projects prediction accuracy observed for the developed model on results already described by - et al. [55]. The ROC-AUC characteristics for 17 binding-assays are presented in Table 10.

Table 10 MUV binding-assays with testing results

An average ROC-AUC of a 10-folds cross-validation experiment was recorded for each assay. The summary field represents an average ROC-AUC across all 17 experiments. It was used for ranking the developed classifier against 6 ML algorithms. A visual representation of this ranking is shown in Fig. 21.

Fig. 21
figure 21

Ranking the developed classifier against six ML models using binding prediction

Similar to the regression problem, GC - a graph based model, scored 0.775 the best result for binding classification.Footnote 17 It closely followed by BYPASS, representing a multitask neural network and LOGREG representing logistic regression model, scoring 0.764 and 0.749 respectively. XBG model showed 0.720 ROC-AUC score which was closely followed by ARM, with 0.696. Influence Relevance Voting (IRV) systems and Random Forests (RF) are the poorest performing, with ROC-AUC scores of 0.693 and 0.693 respectively. Despite variations in accuracy all techniques showed a very consistent performance. Considering the hugely unbalanced data sets the measured metric can be significantly shifted by producing one extra true positive prediction.

With such variety of different models capable of producing equally accurate results for regression and classification problems, why another approach? This question is addressed in the below.

6 Discussion and conclusion

During the last few years deep neural networks de facto have become an industry standard for creating sophisticated AI models. A significant amount of effort have been devoted to improve classical ML algorithms. Often XBG and RF produce even better results than DNNs. Such a variety of approaches makes it a very difficult task to choose the right technique. This has also made a huge impact on scientific publications. Researchers are forced to show rigorous testing, with comparison of every proposed technique against the recognized leaders. It expected that the developed technique must outperform others, which in practice leads to compromising with experiments design and cherry-picking phenomenon.

It can be seen from our comparison of published models, the majority show very similar performance. However, it is also important to score each model in term of practical application. For example a model can deliver impressive accuracy, but when it needs to be deployed in a production environment, scalability and efficiency diminish all advantages gained in perfecting the quality of predictions.

This work has been inspired by a research effort of using variational autoencoder to generate chemical compounds fingerprints, using these fingerprints for predicting specific properties of chemical compounds. Considering the high complexity of chemical compounds to train quality variational autoencoder requires a large data set of SMILES. A typical size of HTS assays consists of several hundred, maybe thousands of chemical compounds. Such volume of data samples does not deliver a sufficient variety of chemical compounds structures, which makes it impractical to train a variational autoencoder based on assay data.

A decision was made to use ChEMBL data to obtain a large representation of chemical compounds structures. It worked very well, with the developed variational autoencoder model reconstructing nearly 90% of SMILES using 1024 latent vector. Taking into the account Hamming and Levenshtein editing distances the final model in average has one misplaced or incorrect atom or bond. Obviously such error is unforgivable in chemistry, but the primary objective of variational autoencoder is to produce latent space where similar structures are crumbled together. Let’s demonstrate this on simple example.

Assume that the selected target chemical compound is NC1=NNC(=C1)C1=CC(F)=CC(F)=C1, which structure is shown Fig. 22.

Fig. 22
figure 22

2D Structure of chemical compound NC1=NNC(=C1)C1=CC(F)=CC(F)=C1

The search space for similar chemical compounds was reduced to 10,000 samples (out of 674,040 initially allocated for evaluation, for more detail see Sect. 4.1) to make computation more efficient. K-nearest neighbour [36] identified five closest chemical compounds defined in Table 11, and illustrated in Fig. 23.

Table 11 Definitions of five closest chemical compounds retrieved for the specified NC1=NNC(=C1)C1=CC(F)=CC(F)=C1 target
Fig. 23
figure 23

Structures of five closest chemical compounds retrieved for the specified NC1=NNC(=C1)C1=CC(F)=CC(F)=C1 target (the structure number corresponds to the compound number in Table 11

The last column in Table 11 represents Tanimoto similarity score [9]. It is widely used in the chemistry domain to assess similarity between two chemical compounds. Two compounds can be considered similar if Tanimoto score is greater than 0.85 (for Daylight fingerprints). As it can be seen from the obtained results in Table 11 three compounds retrieved using latent space are also similar according to Tanimoto metric, where the other two results are closely followed.

This example demonstrates that latent vector based fingerprints can be used to define similarity between two chemical compounds. It also clearly shows that selected chemical compounds are closely located in latent space.

The closest approximation of similar compounds in the latent space provides potential capability for the developed model to be applied to generating new chemical compounds and forecasting the desired properties. Such ML models become a hot topic in pharmaceutical domain, which can be witnessed by increasing the number of high-quality research publications in this space [25, 46]. In this paper, the main focus was on the transfer learning, to use the trained encoder as a base for classification or regression networks which can predict properties of chemical compounds. However, the decoder can be potentially used for generating novel chemical compounds. By introducing a small modification into the latent representation of a target chemical compound, it is possible to generate a novel structure. Despite this simple idea the implementation of such a model is very complex and outside the scope of this publication.

The trained variational autoencoder forms a solid base for creating different classification and regression models. An interesting pattern was observed during a training process. The majority of trained models converge to a stale state (where no longer improvement observed) during 2-3 epochs. However when the same topology of neural network was trained without pretrained variational autoencoder, the training process continue up to 100 epochs. Longer training is not a problem for relatively small data set, where full learning cycle can be complied in the matter of hours. However in case of such collection volume as ChEMBL, training may go on for days. Also, the experiments did not reveal any degradation in accuracy with shortening the training cycle.

The question is why are classification and regression models built based on variational autoencoder so efficient in the training process? To answer this question, let us come back to the methodology described in Sect. 3. Constructed classification and regression models consists from two parts. The first part is an encoder isolated from a trained variational autoencoder. The second part is MLP, which performs actual predictions. Effectively, the MLP is trained to make predictions based on chemical compound fingerprints. Since the ‘hard work’ has been already done by variational autoencoder, only a few cycles are required to learn differentiation rules (to solve classification or regression problem).

A further observation which was called ‘latent space drift’ was also noted. A number of publications using a similar approach do not clearly reveal their mechanisms of using an encoder. It can be used with frozen and unfrozen layers. Using an encoder with frozen layers makes a lot of sense. If it is already trained to encode SMILES, then the attached layers can only be trained to utilize the obtained chemical compounds fingerprints. However, a preliminary study showed that if an encoder is left unfrozen, the training result is generally better. An initial investigation found that fingerprint points in latent space become adjusted according to the target (predicting) property. This process is explained in Fig. 24.

Fig. 24
figure 24

An example of the latent space drift. The diagrams show a distributions of training points of binary classes before and after the drift

All this internal analysis of neural network behaviour becomes possible due the specialized AUROMIND software. It provides a set of tools for ‘debugging’ a training process. The authors plan to describe some of the core principles behind developed tools in upcoming publications.