Affinity prediction using deep learning based on SMILES input for D3R grand challenge 4

Lim, Sangrak; Lee, Yong Oh; Yoon, Juyong; Kim, Young Jun

doi:10.1007/s10822-022-00448-3

Affinity prediction using deep learning based on SMILES input for D3R grand challenge 4

Published: 22 March 2022

Volume 36, pages 225–235, (2022)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Affinity prediction using deep learning based on SMILES input for D3R grand challenge 4

Sangrak Lim ORCID: orcid.org/0000-0001-5112-7907¹,
Yong Oh Lee^1,2,
Juyong Yoon¹ &
…
Young Jun Kim¹

641 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Modern molecular docking comprises the prediction of pose and affinity. Prediction of docking poses is required for affinity prediction when three-dimensional coordinates of the ligand have not been provided. However, a large number of feature engineering is required for existing methods. In addition, there is a need for a robust model for the sequential combination of pose and affinity prediction due to the probabilistic deviation of the ligand position issue. We propose a pipeline using a bipartite graph neural network and transfer learning trained on a re-docking dataset. We evaluated our model on the released data from drug design data resource grand challenge 4 (D3R GC4). The two target protein data provided by the challenge have different patterns. The model outperformed the best participant by 9% on the BACE target protein from stage 2. Further, our model showed competitive performance on the CatS target protein.

Introduction

Molecular docking is an essential component of the modern drug development toolkit to identify promising small molecules that bind to a target protein. Molecular docking based virtual screening protocols predict and rank the binding affinities of a large pool of small molecules represented in a simplified molecular-input line-entry system (SMILES) format [1,2,3].

The binding pose prediction is the first step of the molecular docking, followed by the affinity prediction step. In detail, binding pose prediction is a search process that extracts the optimal structure from the ligand’s conformational space within the binding pocket. The subsequent affinity prediction is also called a scoring function and given the pose of a small molecule, it predicts the binding affinity of the small molecule to a macromolecular target. However, combining the pose and the subsequent affinity prediction is more complicated than tackling each component independently. For instance, flexible and dynamic protein residues often lead to errors in the pose prediction, affecting the results in the affinity step. To improve molecular docking performance, it is necessary to address both components together.

A divide-and-conquer method is often used to deal with complex problems such as molecular docking. The divide-and-conquer method combines each library that performs well concerning small tasks into a large pipeline. However, there are two obstacles associated with the divide-and-conquer method: statistical fluctuation, and feature engineering. First, the success of affinity prediction is highly dependent upon the accuracy of pose prediction. The input of the affinity prediction is the estimated ligand conformations in the pose prediction, which contain probabilistic deviation of the ligand atoms compared to the atoms of crystal pose. A common evaluation metric for the pose prediction step is the root mean square deviation (RMSD), and RMSD values less than 2.0 Å were considered as an acceptable level of accuracy during the last two decades [4, 5]. With advancements in the pose prediction technique, there is a decrease in the differences between experimentally measured and predicted poses. However, it is challenging to predict the co-crystallized pose precisely. Second, the established approaches require more than five components, with complex feature engineering. In grand challenge 3, even if the same docking method was implemented, the performance difference was closely dependent on different hyperparameters [6]. For instance, existing approaches for partial charge calculations offer a high level of variability concerning selection from diverse partial charge outputs. Also, it could be challenging to reproduce the same result with a variety of attributes to decide. Therefore, it is important to organize a concise structure to reduce complex feature engineering or high variability.

Recently, deep learning has been applied to pose prediction [7, 8] and affinity prediction [7, 9,10,11]. We focused on the aspect of deep learning, which can reduce feature engineering in complex problems and is resistant to data noise. One such example is an end-to-end model that connects input data and target values through deep learning [12]. The advantage of an end-to-end pipeline is derived from the minimization of feature engineering or handcrafted heuristics. Such an abstraction for an intricate problem might help create a concise model; however, hardware capacity is a barrier for directly applying an end-to-end model to molecular docking. A binding pocket site usually comprises 100 or more three-dimensional coordinates. While it is a common approach to process multiple training instances simultaneously in one batch for robust performance, training multiple instances is considerably expensive for the pose prediction step. In experiments, we have shown that using deep learning can be more effective than conventional machine learning methods, although our model is not end-to-end.

One of the critical challenge for deep learning based pose and affinity prediction is a data scarcity issue. The protein data bank (PDB) file format is used for three-dimensional structural data, and the PDB is the main data source of structural bioinformatics. Although there are more than thousands of available PDB data for ligand–protein pairs, they are still insufficient compared to the potential bioactive chemicals. Re-docking involves separating a ligand from the macromolecule that it was originally bound to, followed by the reproduction of the binding state, to ensure the docked pose geometrically matches the original position of the co-crystallized ligand. Cross-docking refers to the docking simulation in exchanging the ligands from different PDB files of the same protein receptor. The affinity prediction should comprise cross-docking in that the bound structure between the ligand and the protein receptor is not provided. We created multiple poses through re-docking to augment the number of training data of affinity prediction. We provide a detailed description of the data augmentation in the Supplementary Material, Data augmentation section.

When a pre-training dataset with various protein receptors lacks data points for a certain receptor, transfer learning is a feasible method to improve performance with additional data. A transfer learning approach involves re-training a pre-trained model for a similar but different task [13]. The prediction of ligand binding results such as affinity or toxicity requires a large amount of learning data for the target receptor. Accordingly, in the case of a specific protein with few co-crystallized pose data, the prediction performance of a model tends to degrade. We made a more detailed description of the transfer learning in the Supplementary Material, transfer learning section.

We propose the pipeline for affinity prediction using bipartite graph neural network (BGNN) and transfer learning trained on a re-docking dataset. Since the current pose prediction is exposed to probabilistic deviation, a robust and scalable deep learning model is able to utilize an increased amount of generated data by re-docking between the pose prediction and the affinity prediction. We evaluated our pipeline on D3R GC4, which offered challenges for binding pose prediction and affinity ranking between several small molecules and two target proteins: beta secretase 1 (BACE) and Cathepsin S (CatS) [14]. Challenge organizers provide a shared dataset to compare multiple methods on an equal footing for prediction, of which co-crystal structures of ligand–protein pairs are not given.

Successful molecular docking approaches could be utilized for early drug discovery process, such as automating novel drug design or drug re-positioning process [15]. However, there are several difficulties in practical applications. We present the probabilistic deviation of atoms and the feature engineering issues involving the pose prediction. Another challenge in molecular docking is the limited number of public data instances that provide the three-dimensional structural data and corresponding binding affinity. Considering the above challenges, our contributions are summarized in the following:

We mitigate the probabilistic deviation of atoms issue from pose prediction and data scarcity using data augmentation.
We use deep learning aimed at a scalable model with a minimum number of feature engineering.
Our method demonstrated state-of-the-art performance in the target protein BACE dataset from stage 2 of D3R GC4. Our method also showed competitive performance in the CatS dataset. We demonstrated the robustness of our model by evaluating two challenge datasets with different patterns.

Methodology

Pose prediction

Molecular docking can be roughly divided into two steps: ligand binding pose and affinity prediction. To cover the entire process of molecular docking, it is important to consider the association between the two steps as the second step uses the results of the first step. In general, the pose prediction process is also separated into two sub-processes: a generation process for several three-dimensional conformations of small molecules bound to a rigid macromolecular target. A ranking process is used to select the cognate structure of a co-crystallized molecule among the generated conformations. However, some existing studies [16, 17] demonstrated that using the extended training set by generating conformations helped optimize a deep learning model. We use all the compounds created in the first sub-process.

The PDBbind dataset was used as the raw data source for pre-training. PDBbind separated proteins and ligands for a specific PDB, and we also stored the ligands in a SMILES format. Whereas the test data contain the ligands in a SMILES format only. In Fig. 1, the data section is contained within a gray dotted box. We augmented the raw training data by re-docking the PDBbind data and extended the test data by cross-docking.

Our pose prediction method mostly followed the same procedure as Ragoza et al. [7]. Further, people from the same lab participated in the D3R GC4 using the same method. The difference between our pose prediction and that of the Ragoza et al. is we cropped their method other than explained in our pose prediction above, such as ranking process of generated poses. We selected the existing method that uses the minimum level of components for pose prediction to utilize the entire data produced by redocking. In short, we generate conformations from SMILES and superimpose the conformations with the previously found reference ligand structure.

Figure 1a, b: The input of the SMILES format goes through the RDKIT library and creates a conformations with 3D coordinates. RDKIT is an open-source toolkit for bioinformatics, and the toolkit uses the ETKDG algorithm to generate conformations [18]. As it is difficult to accurately predict how the ligand coordinates will be located in the actual binding process, we generated up to 30 conformations. Depending on the ligand’s environment, the SMILES for conformation process may be neglected by RDKIT. For example, the valence of a particular atom could be abnormal; therefore, the amount of generated data is different each time.

Figure 1c: Our pose prediction requires an existing co-crystal structure with the same protein receptor to find a reference ligand. We found a reference ligand that was similar to the ligand in the SMILES form concerning the bond substructure. The reference ligand should be selected from the existing PDBs and shared with the same target protein. In the case of the re-docking process, the reference ligand is known. SMILES arbitrary target specification (SMARTS) specifies substructural patterns in molecules. We use SMARTS to designate a similar substructure between the generated conformation and the reference ligand. The shared SMARTS pattern is passed to the next part. We can determine the position of the generated conformations through the co-crystallized pose of the reference ligand. We used the FindMCS function of the RDKIT library.

Figure 1d: The next part was the superimposition process. The superimposition or structural alignment of the obfit function superimposes the conformer based on the shared SMARTS with the template ligand and conformer. As a result, the conformation is located around the reference small molecule in the existing PDB. We used the Obfit function from the open babel library by inputting the reference ligand and the newly generated conformation.

Figure 1e: The final process of pose prediction saves the data for use in the next affinity prediction step. The resulting ligand–protein pair has a maximum of 30 pose combinations. We grouped the data so that generated data from the same ligand–protein pair were not split when processed in a mini-batch.

We provide a detailed description of the pose prediction in the section of Supplementary Material, a step-by-step explanation of the pose prediction phase with more visual representations.

Affinity prediction

For prediction at the molecular level, it is important to transform the input into more suitable forms called molecular fingerprints. The extended connectivity interaction features (ECIF) assigns each atom to the corresponding atom types considering the atom environment concept, which was originally presented in the extended connectivity fingerprints [19, 20]. We implemented ECIF pre-processing for each atom.

Among structural models, the three-dimensional coordinates of atoms are often converted into a meta-structure owing to the cost of the complex model and the usefulness of abstraction [21, 22]. Our model is inspired by the simple structure of ECIF. ECIF exhibited the state-of-the-art performance in the existing affinity prediction task. In addition to the ECIF descriptors, the authors defined 1540 possible interactions between a pair of ligand and protein atoms. Further, the number of corresponding relationships was counted as an input feature for gradient boosting machine (GBM). ECIF uses only the simple input feature of protein-ligand atom pair counts for prediction. However, we consider that additional features such as atom-level embedding or adjacency matrix help the scoring function while remaining a concise model. We made a detailed description of the GBM in the Supplementary Material, GBM model section.

Representing a character as an embedding vector is a frequently used method in natural language processing (NLP). After learning the embedding vector through deep learning, character embeddings of the semantically similar items are located near each other in vector spaces [23]. Despite the obvious differences, the NLP and the biomedical domains have statistical similarities, especially for the chemical compound and the natural language sentence [24]. Some of the existing studies used character embedding on the SMILES input to represent chemical attributes [25, 26]. In the affinity prediction model, we apply character embedding after ECIF pre-processing for each atom.

After the pose prediction step, a ligand–protein instance has an average of 10 generated poses. First, we pre-process the list of atom data in the form of an ECIF fingerprint and apply character embedding. Every atom becomes a node and is expressed by the following equation:

$$\begin{aligned} V_l&= \{l_j \in C; j=1,2,\ldots,M\} \end{aligned}$$

(1)

$$\begin{aligned} V_p&= \{p_k \in C; k=1,2,\ldots,N\} \end{aligned}$$

(2)

$$\begin{aligned} adj&= \{e_{jk} \in \mathbb {R}^{M\times N}\} \end{aligned}$$

(3)

Our model has three types of inputs: a list of atoms of the ligand, a list of protein atoms of the binding pocket located close to the ligand, and the adjacency matrix that contains distances between the ligand and the protein atoms. $V_l$ is the ligand node and the $l_j$ is the jth embedded vector of atoms. Likewise, $V_p$ is the protein node and the $p_k$ is the kth embedded vector of atoms. The C sign denotes a pool of atom embeddings. The $e_{jk}$ is the edge weight between the jth ligand atom and the kth protein atom in Eq. 3. The edge weights are constant value between 0 and 1 which is the reverse of the distance between two atoms. These inputs have the characteristic of a bipartite graph. The inputs are depicted in the part (a) of Fig. 2.

$$\begin{aligned} In_l&= softmax(FC(l_j)FC(l_j)^T)FC(l_j) \end{aligned}$$

(4)

The transformer is a well-known model for its performance, and its efficacy has been presented in various datasets [27]. The transformer model introduced self-attention which allows the extraction of different aspects from a sentence [28]. One of the advantages of the transformer model is that it learns long-range dependencies in the input, which makes the model is capable of handling lengthy information. The self-attention module has been applied to drug-target interaction (DTI) [29, 30]. DTI or drug-protein interaction task has a similar input structure to molecular docking. Since the self-attention module is applicable for long input, it is viable to consider applying the module for SMILES data. We implemented simplified self-attention, which predicts the outcome using only the ligand input. The process is described in part (b-1) of Fig. 2.

Our self-attention module is represented in Eq. 4. Fully connected (FC) layer wraps $l_j$ atom embedding. We provide a detailed description of our self-attention module in the Supplementary Material, Difference with the original self-attention section. The main objective of this self-attention module is to support a transfer learning approach. Thus, we excluded some functionalities such as the multi-head or position embedding from the original transformer module. The purpose of transfer learning is to optimize a subset of the feature space. Therefore, it is common to freeze most of the model layers and only train certain layers [13]. We froze most of the parts, except the self-attention layer, for the transfer learning phase.

$$\begin{aligned} l_j^{i + 1}&= l_j^{i} \odot \sum _{k=0}^N e_{jk} * p_k^{i} \end{aligned}$$

(5)

$$\begin{aligned} p_k^{i + 1}&= p_k^{i} \odot \sum _{j=0}^M e_{jk} * l_j^{i} \end{aligned}$$

(6)

$$\begin{aligned} In_{lp}&= adj \odot V_lV_p^T \end{aligned}$$

(7)

$$\begin{aligned} cat&= [sum(In_{lp}), FC_{dr}(In_{lp}), sum(In_l), FC_{dr}(In_l)] \end{aligned}$$

(8)

$$\begin{aligned} \hat{y}&= Wcat + b \end{aligned}$$

(9)

In the Eq. 5, each atom in a ligand updates its node using the information of the connected atoms in a protein pocket. Similarly, each atom in the protein pocket is updated using the connected atoms of the ligand. The update process was repeated three times. The processes are depicted in part (b-2) of Fig. 2.

Finally, the ligand and protein pocket atoms are combined, followed by element-wise multiplication using the adjacency matrix value. In is an intermediate matrix that aggregates ligand and protein. To produce the result, we used fully connected layers and the summed feature for dimension reduction of the concatenated variables. $FC_{dr}$ is a FC layer with dropout. W and b are the weight matrix and bias for the final FC layer, respectively. $\hat{y}$ in Eq. 8 represents the predicted results. Depending on the training set, the results could be $IC_{50}$, Ki, or Kd. As the affinity prediction is a regression task, we used the mean squared error loss function. The process is illustrated in part (c) of Fig. 2.

Results

In the D3R GC4, it is required to predict the rankings of the affinities of given small molecules against the two target proteins. The target proteins were BACE and CatS. The number of given small molecules for each dataset were 154 and 459, respectively. The main evaluation metrics are Kendall’s $\tau$ and Spearman’s $\rho$, as the primary purpose is to predict the rank correlation. There were several sub-challenges in D3R GC4, and we focused on rank prediction for the task is specifically about affinity prediction. In the case of CatS protein, the previous grand challenge 3 released data that underwent experimentation in an environment similar to GC4. Hence, it is easy to apply for transfer learning.

Because we utilized the CatS data from GC3, we additionally experimented on the GC3 CatS data. The number of given small molecules is 136. The experiment settings remain the same as the GC4 experiments.

Data statistics

In Table 1, we used the PDBbind (v2019) dataset mainly for training and validation. It is common to preprocess the PDBbind dataset to filter out invalid data for training [19]. We found a normal data range from the ChEMBL dataset [31, 32]. The ChEMBL dataset has a comment column indicating overly high or unreasonably low $IC_{50}$ values. After filtering target scores outside the normal range, the number of valid PDB instances in Ki or Kd is 5936 and in $IC_{50}$ is 4475. As the study used the generated conformations for each instance, the average augmented number of training instances was ten times more than the raw data. The test set D3R GC4 was based on the $IC_{50}$ values. The D3R GC4 converted the IC50 values to kcal/mol before evaluation and released the code used for the conversion. Therefore, we also evaluated the predicted value as kcal/mol and set the average to be 0 for learning. The number of atom types is the number of unique atom types defined with the ECIF pre-processing. The number of atomic types was determined in the training data and did not change in the test set.

The PDBbind database provides experimentally measured affinities for interactions between proteins and ligands with the corresponding PDB. However, owing to the nature of handling atomic units, many instances of PDBbind data are of inferior quality. For that reason, PDBbind provides two distinguished datasets; structurally ordered “refined sets” and low-quality “general sets.” Unfortunately, in the case of the $IC_{50}$ dataset, almost all instances fall into the category of the “general set.” We used the PDBbind dataset as the raw data source for pre-training.

Table 1 Dataset statistics

Full size table

Result on D3R GC4: target protein BACE

Other models from leading participant groups of GC4 BACE

Since our model uses the challenge data as a test set, we introduce the top group models in D3R GC4. We sorted the results by Spearman correlation and then selected the top group. Combined models of Skeledock and Kdeep achieved the best performance in the BACE test set [9, 33]. Skeledock performed pose prediction using template-query mapping and Kdeep performed affinity prediction using a convolutional neural network (CNN). The CACTVS chemo-informatics toolkit is a combination of several molecular docking techniques located in the leading group of the BACE test set [34]. Lastly, the GNINA model used an ensemble of CNNs for affinity prediction, and Monte Carlo chain sampling for pose prediction [35]. In some cases of the GNINA model’s submission, pose prediction was performed using only conformation superimposition. We followed the same method for pose prediction, as the process is straightforward and a large amount of data can be obtained.

In the after challenge section in Table 2, we also experimented with ECIF and GBM under the same conditions as our model, such as re-docking [19]. Since the pre-processing is almost identical, we passed the same pose prediction data to the GBM model and reported its performance.

The comparative assessment of scoring functions-2016 (CASF-2016) is a benchmark dataset for affinity prediction (Ki or Kd) given a structural dataset [36]. Concerning the co-crystallized data, ECIF and GBM models showed outstanding performance in CASF-16 dealing with Ki or Kd. The GBM model and similar boosting based machine learning algorithms are robust against small noise in the dataset [37]. However, machine learning techniques should be carefully applied to data with high disturbance in the input features. Regarding cases where input data requires cross-docking, the noise inside data is a significant issue, and there is room for the application of deep learning. One of the deep learning features is to obtain generalizable results using data that involves a certain amount of noise for training.

Result analysis

Table 2 Evaluation results on D3R GC4: BACE

Full size table

Given a test set from challenges, there are several ways to select a model that can be applied to the test set. After dividing the pre-training data into a training set and a validation set, we saved a model that surpassed the previously saved model at validation for each training iteration. After training, we loaded and ran the model on the test set.

We applied transfer learning. Since we do not have similar data to the test data, we gathered chemicals and $IC_{50}$ values for a BACE target protein from ChEMBL. The test data were released after the challenge in 2018. To avoid a situation where test data is included in the training set, we excluded data from 2018, 2019, and 2020. Since ChEMBL data instances do not contain three-dimensional structure data, we used the same cross-docking method as we did in creating the test data. When deciding the epoch of transfer learning, two cases should be considered. If additional learning is performed with almost the same data distribution as the test set, we don’t need to be concerned about overfitting. However, if the distribution is different such as in the case of the target protein BACE, it would be better to finish transfer learning before convergence, or in other words, before no change in training loss [38]. We performed transfer learning for 50 epochs as half of the epochs as the pre-training.

Table 2 shows the performance of our model alongside the top group results in the D3R GC4 challenge of target protein BACE data. A participant could yield multiple submissions to the challenge. We showed only the best performance of each participant for concise presentation. Our model outperformed the baseline best result by 9%.

The ECIF + GBM model has lower performance than our BGNN; the GBM model have performance difficulties in noisy data. In the case of BACE data, we retrieved additional chemical interaction data from ChEMBL and used it to train the GBM model. But, the process did not narrow the gap between the BACE test set and the training set.

Result on D3R GC4: target protein CatS

Other models from leading participant groups of GC4 CatS

Here, we introduce the top group models of the CatS target protein. ICM-dock achieved the best performance in the CatS test set [39]. Unlike other models in the leading group, the ICM-dock model hardly used deep learning. The ICM-dock model faced the problem of selecting the optimal conformer among multiple ligand/protein conformations, as in our case. The authors of ICM claimed that using most conformers through an ensemble technique might address the protein flexibility problem in molecular docking. MathDL was located in the leading group in the CatS test set [22]. MathDL converted the data into a bipartite graph between two atomic types using geometric relations. MathDL uses low-dimensional, and translationally invariant graph features. deepscaffopt achieved its best result in the CatS dataset, and it was the only ligand–based model in the leading group; however, this model has not yet been published. The CNDO model was ranked fourth in the CatS test set [40]. The CNDO model can calculate molecular orbital energies from the ligand geometry.

Table 3 Evaluation results on D3R GC4: CatS

Full size table

Result analysis

For this CatS dataset, the previous GC3 released data on non-overlapping ligand compounds for the same target protein. We trained our model on the PDBbind dataset and applied transfer learning using GC3 CatS data. We have prepared two models for CatS data; one that freezes the rest of the modules except for the self-attention in the transfer learning phase, and one that does not freeze any module. Freezing the modules other than self-attention in transfer learning aims to prevent overfitting and to optimize the self-attention module to more data. In contrast, no frozen model for CatS data is presented because overfitting can help if the test data and the additional data for transfer learning are similar. We assumed that it is feasible to prepare the overfitting model, such as in this CatS case, since it is possible to predict the data distribution to some extent in advance.

Table 3 presents the performance of our model, and the top group results in the D3R GC4 challenge of the target protein CatS data. In terms of the results, the model that implemented the freezing in transfer learning performed more poorly than the model without freezing. The performance of our model without freezing is not comparable to the performance of the top three participants; however, an improvement in the performance is observed compared to the BACE case in Table 2. The difference between BACE and CatS results depends on whether the data used for transfer learning have a similar distribution to the test set.

The ECIF + GBM model showed almost similar performance to the basic BGNN model. The GBM model performs better than in other experiments because the quality of the training data is guaranteed to some extent. In the CASF-16 challenge, in which the training data was the crystal pose itself, the GBM model showed a significant level of performance [19].

Result on D3R GC3: target protein CatS

Other models from leading participant groups of GC3 CatS

Next, we introduce the top group models in D3R GC3 of the CatS protein, stage 2. The ICM-dock and the ScaffOpt are the same model that appeared in the leading participant group of GC4 CatS. Dock$\_$close [41, 42] used several open-source library such as Smina, Openbabel, and Omega2. They found closest compound among known bound ligands using Babel 2.3.2 and the affinity score is derived from best Smina score. Amber method 1 [43] used Antechamber tool of AmberTools16 [44] which utilizes general amber force field (GAFF). The AmberTools [45] is a package of molecular simulation libraries and the GAFF is a molecular mechanical force field for the simulation of biomolecules.

Table 4 Evaluation results on D3R GC3: CATS

Full size table

Result analysis

Compared to the GC4 CatS dataset, the number of instances of GC3 CatS is small. For this dataset, we used the same approach as the BACE. We trained our model on the PDBbind dataset and applied transfer learning using a CatS target protein from ChEMBL data. For fair comparison, we excluded data starting from 2017 in the additional data for transfer learning.

Table 4 shows the performance of our model alongside the top group results in the D3R GC3 challenge of target protein CatS data. Our model is located in the third position regarding Kendall’s $\tau$, and in the fourth position regarding Spearman. The overall performances of participants are similar to that of GC4 BACE.

In this experiment, the ECIF + GBM model performance was not good. For GBM model, the additional ChEMBL set was not helpful in predicting the test set. The GC3 CatS data has a slightly more complicated pattern than GC4 BACE, as it can be seen that our BGNN performed worse in the GC3 CatS.

Discussion

We selected the top three participants for each of the two target proteins. We observed that the BACE and CatS data have quite different patterns as none of the participants matched. Due to the characteristic of the $IC_{50}$, the $IC_{50}$ value is significantly influenced by the surrounding environment. With the help of the transfer learning, we can better predict the test set later in case we have data similar to the test set.

The results for CatS are higher than those for BACE, as GC3 data from experiments in the same environment have been previously released. When predicting new data with a model only trained on PDBbind data, it is difficult to expect the same performance as the CatS case. Nevertheless, if the amount of data are sufficiently large, a model can tolerate a certain degree of statistical fluctuation. When estimating the $IC_{50}$ value for a pool of compounds for a specific protein, implementing transfer learning can be a good method. We gathered the $IC_{50}$ experimental values for the BACE target protein from ChEMBL and used these data for transfer learning to obtain a good performance.

Analysis of the results showed that our model performance was slightly lower than that of the top group for the CatS target protein. We regard the pose prediction process as the cause of the performance degradation. This was similar to the observation of the authors of the MathDL study who pointed out that relying on existing pose prediction libraries has limitations [22]. The authors of the MathDL study attempted to predict poses with their custom deep learning in D3R GC4 and created docking poses using a generative adversarial network. In our case, using the superimposed pose likely caused inaccurate docking pose problems. As shown in Fig. 3, there is a significant difference in positions when there are many generated conformations.

In Fig. 4, we made a fitting graph for the CatS data. Square dots represent the test set data (D3R GC4) and the triangle dots are the transfer learning data (D3R GC3). In the fitting graph, since the actual value and the predicted value are compared, it is a more rigorous evaluation method than the rank prediction used as the evaluation metric in the challenge. We put the transfer learning data in the graph to emphasize that the data do not have the same distribution as the test set, even though the model has already trained on the data.

The D3R GC4 challenge was held in 2018, while we used the PDBbind dataset v2019. We also experimented the PDBbind v2017 as a pre-training dataset and the result is in the Supplementary Material, GC4 CatS and BACE result based on pre-trained model with PDBbind dataset v2017 section. The results using the PDBbind dataset v2017 are not much different from those of the v2019.

Hyperparameter

Table 5 summarizes the hyperparameter search range. We implemented our models using CUDA-enabled pytorch. Our model used Adam optimizers with a learning rate set of 0.0005. We applied the dropout in the input, middle, and the last layers, however, all dropout layers use the same value. We attempted several different epochs to train the model on the PDBbind dataset. As the data distribution of PDBbind is different from the test set data, we stopped training when the loss on the validation set roughly converged. The final the mini-batch size was set at 20. Since we grouped up to 30 conformations from the same ligand–protein pair, in some cases where every mini-batch is filled with 30 conformers, our model trains 600 instances in a batch. We used the Nvidia Titan RTX graphic card for 20 GB of vram.

Performance change of our model through excluding features

Table 5 Hyperparameter search range

Full size table

Table 6 Ablation study on D3R GC4: target protein CatS

Full size table

We performed an ablation study to evaluate the effectiveness of several features of our model. We removed the features individually to track any changes in the performance. The result was an average of five tests of models on the target protein CatS dataset from D3R GC4. Table 6 shows the results from the ablation study.

Effectiveness of self-attention The first of the ablation studies was the evaluation without a self-attention module. The Spearman correlation was 0.54 and Kendall’s $\tau$ was 0.47. The ligand-based prediction module exhibited good performance, as in the case of deepscaffopt.

Effectiveness of transfer learning For the next experiment, we skipped transfer learning. The Spearman correlation was 0.17 and Kendall’s $\tau$ was 0.11. The low performance is attributed to the data distribution of the CatS set, which is quite different from that of PDBbind. Moreover, the number of CatS instances in PDBbind was insufficient. If the target protein data are insufficient in the training data, it is important to perform transfer learning with additional data generated by cross-docking or with other experimental data obtained in a similar environment as in this case.

Effectiveness of BGNN Without the BGNN main module of our model, we are left with a ligand-based, self-attention module. The Spearman correlation was 0.57 and Kendall’s $\tau$ was 0.39. The performance of the ligand-based model was slightly lower than that of the model using an additional graph neural net. Our BGNN model utilizes protein-ligand relationship information, as even the same ligand can exhibit entirely different $IC_{50}$ values depending on the target protein the ligand binds to.

Conclusions

AutoDock Vina is an open-source library capable of pose prediction, and it is one of the famous libraries among the GC4 participants. When we looked into the mean RMSD in evaluation results of GC4 pose prediction, the RMSD of the participants who used AutoDock Vina ranges from 0.76 to 44.07. If we remove some outliers, the max RMSD with AutoDock Vina becomes 10.34. Pose prediction libraries such as AutoDock Vina require certain level of human intervention or feature engineering to determine the set of parameters for different data inputs. The complex feature engineering in the pose prediction framework implies that the overall performance of the molecular docking is dependent on specific hyperparameters. Sometimes, it can be challenging to reproduce the same pose prediction since some of the features are not deterministic.

We aimed to reduce feature engineering in molecular docking. We used pose prediction that is straightforward without complex features. The problem was that the accuracy of the simple method was not outstanding. After our pose prediction, the predicted position of atoms contains probabilistic deviation around the reference ligand. Table 2 shows that the existing ML model could be unstable when handling noisy data. We used deep learning with a minimum feature engineering to utilize the noisy data since noisy data can prevent overfitting for deep learning. For the reproducible approach, we released our entire code as open-source.

Existing cross-docking approaches have the probabilistic deviation issue from pose prediction and data scarcity. We proposed a straightforward pipeline for affinity prediction using BGNN and transfer learning based on a re-docking dataset. We showed that our BGNN model achieved competitive results on the D3R GC4 dataset. For the BACE set, our model outperformed the best participant in GC4 by 9%. For the CatS set, our model performed competitively. We demonstrated the robustness of our model by evaluating two challenge datasets with different patterns.

Data availability

All data are publicly available. Please refer to the code availability section for detail.

Code availability

The model code and data is available at : https://github.com/arwhirang/affinity_prediction_BGNN.

References

Seifert MH, Wolf K, Vitt D (2003) Virtual high-throughput in silico screening. Biosilico 1(4):143–149
Article CAS Google Scholar
Braga R, Alves V, Silva A, Nascimento M, Silva F, Liao L, Andrade C (2014) Virtual screening strategies in medicinal chemistry: the state of the art and current challenges. Curr Top Med Chem 14(16):1899–1912
Article CAS Google Scholar
Gimeno A, Ojeda-Montes MJ, Tomás-Hernández S, Cereto-Massagué A, Beltrán-Debón R, Mulero M, Garcia-Vallvé S (2019) The light and dark sides of virtual screening: what is there to know? Int J Mol Sci 20(6):1375
Article CAS Google Scholar
Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Shenkin PS (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47(7):1739–1749
Article CAS Google Scholar
Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Head MS (2006) A critical assessment of docking programs and scoring functions. J Med Chem 49(20):5912–5931
Article CAS Google Scholar
Gaieb Z, Parks CD, Chiu M, Yang H, Shao C, Walters WP, Gilson MK (2019) D3R Grand Challenge 3: blind prediction of protein-ligand poses and affinity rankings. J Comput Aided Mol Des 33(1):1–18
Article CAS Google Scholar
Ragoza M, Hochuli J, Idrobo E, Sunseri J, Koes DR (2017) Protein-ligand scoring with convolutional neural networks. J Chem Inf Model 57(4):942–957
Article CAS Google Scholar
Morrone JA, Weber JK, Huynh T, Luo H, Cornell WD (2020) Combining docking pose rank and structure with deep learning improves protein-ligand binding mode prediction over a baseline docking approach. J Chem Inf Model 60(9):4170–4179
Article CAS Google Scholar
Jiménez J, Skalic M, Martinez-Rosell G, De Fabritiis G (2018) Kdeep: protein-ligand absolute binding affinity prediction via 3d-convolutional neural networks. J Chem Inf Model 58(2):287–296
Article Google Scholar
Zheng L, Fan J, Mu Y (2019) Onionnet: a multiple-layer intermolecular-contact-based convolutional neural network for protein-ligand binding affinity prediction. ACS Omega 4(14):15956–15965
Article CAS Google Scholar
Yang L, Yang G, Chen X, Yang Q, Yao X, Bing Z, Yang L (2021) Deep scoring neural network replacing the scoring function components to improve the performance of structure-based molecular docking. ACS Chem Neurosci 12:2133
Article CAS Google Scholar
Muller U, Ben J, Cosatto E, Flepp B, Cun YL (2006) Off-road obstacle avoidance through end-to-end learning. Adv Neural Inf Process Syst 739–746
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1717–1724
Parks CD, Gaieb Z, Chiu M, Yang H, Shao C, Walters WP, Gilson MK (2020) D3R grand challenge 4: blind prediction of protein-ligand poses, affinity rankings, and relative binding free energies. J Comput-Aided Mol Des 34(2):99–119
Article CAS Google Scholar
Nguyen D, Gao K, Chen J, Wang R, Wei G (2020) Potentially highly potent drugs for 2019-nCoV. BioRxiv
Ragoza M, Turner L, Koes DR (2017) Ligand pose optimization with atomic grid-based convolutional neural networks. arXiv:1710.07400
Francoeur PG, Masuda T, Sunseri J, Jia A, Iovanisci RB, Snyder I, Koes DR (2020) Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J Chem Inf Model 60(9):4200–4215
Article CAS Google Scholar
Riniker S, Landrum GA (2015) Better informed distance geometry: using what we know to improve conformation generation. J Chem Inf Model 55(12):2562–2574
Article CAS Google Scholar
Sánchez-Cruz N, Medina-Franco JL, Mestres J, Barril X (2020) Extended connectivity interaction features: improving binding affinity prediction through chemical description. Bioinformatics
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Article CAS Google Scholar
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Pande V (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530
Article CAS Google Scholar
Nguyen DD, Gao K, Wang M, Wei GW (2018) MathDL: mathematical deep learning for D3R grand challenge 4. J Comput-Aided Mol Des 342020:131–147
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Cadeddu A, Wylie EK, Jurczak J, Wampler-Doty M, Grzybowski BA (2014) Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew Chem Int Ed 53(31):8108–8112
Article CAS Google Scholar
Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y (2018) Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinform 19(19):83–94
Google Scholar
Goh GB, Hodas NO, Siegel C, Vishnu A (2017) Smiles2vec: an interpretable general-purpose deep neural network for predicting chemical properties. arXiv:1712.02034
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Lin Z, Feng M, Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv:1703.03130
Shin B, Park S, Kang K, Ho JC (2019) Self-attention based molecule representation for predicting drug–target interaction. In: Machine learning for healthcare conference. Proceedings of Machine Learning Research (PMLR) (pp. 230–248)
Zheng S, Li Y, Chen S, Xu J, Yang Y (2020) Predicting drug-protein interaction using quasi-visual question answering system. Nat Mach Intell 2(2):134–140
Article Google Scholar
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:1100–1107
Article Google Scholar
Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, Wang R (2015) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31(3):405–412
Article CAS Google Scholar
Varela-Rial A, Majewski M, Cuzzolin A, Martínez-Rosell G, De Fabritiis G (2020) SkeleDock: a web application for scaffold docking in play molecule. J Chem Inf Model 60(6):2673–2677
Article CAS Google Scholar
Ihlenfeldt WD, Takahashi Y, Abe H, Sasaki SI (1994) Computation and management of chemical properties in CACTVS: an extensible networked approach toward modularity and compatibility. J Chem Inf Comput Sci 34(1):109–116
Article CAS Google Scholar
McNutt A, Francoeur P, Aggarwal R, Masuda T, Meli R, Ragoza M, Koes D (2021) GNINA 1.0: molecular docking with deep learning
Su M, Yang Q, Du Y, Feng G, Liu Z, Li Y, Wang R (2018) Comparative assessment of scoring functions: the CASF-2016 update. J Chem Inf Model 59(2):895–913
Article Google Scholar
Li AH, Bradic J (2018) Boosting in the presence of outliers: adaptive classification with nonconvex loss functions. J Am Stat Assoc 113(522):660–674
Article CAS Google Scholar
Prechelt L (1998) Early stopping-but when? Neural networks: tricks of the trade. Springer, Berlin, pp 55–69
Chapter Google Scholar
Lam PC, Abagyan R, Totrov M (2018) Hybrid receptor structure/ligand-based docking and activity prediction in ICM: development and evaluation in D3R Grand Challenge 3. J Comput Aided Mol Des 33(1):35–46
Article Google Scholar
Sahu S, Shukla A (2009) Fortran 90 implementation of the Hartree–Fock approach within the CNDO/2 and INDO models. Comput Phys Commun 180(5):724–734
Article CAS Google Scholar
Wingert BM, Oerlemans R, Camacho CJ (2018) Optimal affinity ranking for automated virtual screening validated in prospective D3R grand challenges. J Comput Aided Mol Des 32(1):287–297
Article CAS Google Scholar
Ye Z, Baumgartner MP, Wingert BM, Camacho CJ (2016) Optimal strategies for virtual screening of induced-fit and flexible target in the 2015 D3R Grand Challenge. J Comput Aided Mol Des 30(9):695–706
Article CAS Google Scholar
He X, Man VH, Ji B, Xie XQ, Wang J (2019) Calculate protein-ligand binding affinities with the extended linear interaction energy method: application on the Cathepsin S set in the D3R Grand Challenge 3. J Comput Aided Mol Des 33(1):105–117
Article CAS Google Scholar
Wang J, Wang W, Kollman PA, Case DA (2006) Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model 25(2):247–260
Article Google Scholar
Salomon-Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular simulation package. Wiley Interdiscip Rev 3(2):198–210
CAS Google Scholar

Download references

Funding

The study is supported by National Research Council of Science & Technology (NST) grant by the Korea government (MSIP) (No. CAP-17-01-KIST Europe).

Author information

Authors and Affiliations

Kist Europe, Campus E7 1 66123, Saarbrücken , Germany
Sangrak Lim, Yong Oh Lee, Juyong Yoon & Young Jun Kim
Industrial and Data Engineering Department of Hongik University, Seoul, Republic of Korea
Yong Oh Lee

Authors

Sangrak Lim
View author publications
You can also search for this author in PubMed Google Scholar
Yong Oh Lee
View author publications
You can also search for this author in PubMed Google Scholar
Juyong Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Young Jun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The study was designed by SL, and YL. SL wrote the code and performed the analysis. The original manuscript was written by SL, and YL. All authors (SL, YL, JY, and YK) have reviewed and edited the manuscript. YL and YK acquired the funding. All authors have given approval to the final version of the manuscript.

Corresponding author

Correspondence to Sangrak Lim.

Ethics declarations

Conflict of interest

We declare no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 2774 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, S., Lee, Y.O., Yoon, J. et al. Affinity prediction using deep learning based on SMILES input for D3R grand challenge 4. J Comput Aided Mol Des 36, 225–235 (2022). https://doi.org/10.1007/s10822-022-00448-3

Download citation

Received: 14 June 2021
Accepted: 08 March 2022
Published: 22 March 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10822-022-00448-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Affinity prediction using deep learning based on SMILES input for D3R grand challenge 4

Abstract

Introduction

Methodology

Pose prediction

Affinity prediction

Results

Data statistics

Result on D3R GC4: target protein BACE

Other models from leading participant groups of GC4 BACE

Result analysis

Result on D3R GC4: target protein CatS

Other models from leading participant groups of GC4 CatS

Result analysis

Result on D3R GC3: target protein CatS

Other models from leading participant groups of GC3 CatS

Result analysis

Discussion

Hyperparameter

Performance change of our model through excluding features

Conclusions

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 2774 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation