Introduction

Molecular docking is an essential component of the modern drug development toolkit to identify promising small molecules that bind to a target protein. Molecular docking based virtual screening protocols predict and rank the binding affinities of a large pool of small molecules represented in a simplified molecular-input line-entry system (SMILES) format [1,2,3].

The binding pose prediction is the first step of the molecular docking, followed by the affinity prediction step. In detail, binding pose prediction is a search process that extracts the optimal structure from the ligand’s conformational space within the binding pocket. The subsequent affinity prediction is also called a scoring function and given the pose of a small molecule, it predicts the binding affinity of the small molecule to a macromolecular target. However, combining the pose and the subsequent affinity prediction is more complicated than tackling each component independently. For instance, flexible and dynamic protein residues often lead to errors in the pose prediction, affecting the results in the affinity step. To improve molecular docking performance, it is necessary to address both components together.

A divide-and-conquer method is often used to deal with complex problems such as molecular docking. The divide-and-conquer method combines each library that performs well concerning small tasks into a large pipeline. However, there are two obstacles associated with the divide-and-conquer method: statistical fluctuation, and feature engineering. First, the success of affinity prediction is highly dependent upon the accuracy of pose prediction. The input of the affinity prediction is the estimated ligand conformations in the pose prediction, which contain probabilistic deviation of the ligand atoms compared to the atoms of crystal pose. A common evaluation metric for the pose prediction step is the root mean square deviation (RMSD), and RMSD values less than 2.0 Å were considered as an acceptable level of accuracy during the last two decades [4, 5]. With advancements in the pose prediction technique, there is a decrease in the differences between experimentally measured and predicted poses. However, it is challenging to predict the co-crystallized pose precisely. Second, the established approaches require more than five components, with complex feature engineering. In grand challenge 3, even if the same docking method was implemented, the performance difference was closely dependent on different hyperparameters [6]. For instance, existing approaches for partial charge calculations offer a high level of variability concerning selection from diverse partial charge outputs. Also, it could be challenging to reproduce the same result with a variety of attributes to decide. Therefore, it is important to organize a concise structure to reduce complex feature engineering or high variability.

Recently, deep learning has been applied to pose prediction [7, 8] and affinity prediction [7, 9,10,11]. We focused on the aspect of deep learning, which can reduce feature engineering in complex problems and is resistant to data noise. One such example is an end-to-end model that connects input data and target values through deep learning [12]. The advantage of an end-to-end pipeline is derived from the minimization of feature engineering or handcrafted heuristics. Such an abstraction for an intricate problem might help create a concise model; however, hardware capacity is a barrier for directly applying an end-to-end model to molecular docking. A binding pocket site usually comprises 100 or more three-dimensional coordinates. While it is a common approach to process multiple training instances simultaneously in one batch for robust performance, training multiple instances is considerably expensive for the pose prediction step. In experiments, we have shown that using deep learning can be more effective than conventional machine learning methods, although our model is not end-to-end.

One of the critical challenge for deep learning based pose and affinity prediction is a data scarcity issue. The protein data bank (PDB) file format is used for three-dimensional structural data, and the PDB is the main data source of structural bioinformatics. Although there are more than thousands of available PDB data for ligand–protein pairs, they are still insufficient compared to the potential bioactive chemicals. Re-docking involves separating a ligand from the macromolecule that it was originally bound to, followed by the reproduction of the binding state, to ensure the docked pose geometrically matches the original position of the co-crystallized ligand. Cross-docking refers to the docking simulation in exchanging the ligands from different PDB files of the same protein receptor. The affinity prediction should comprise cross-docking in that the bound structure between the ligand and the protein receptor is not provided. We created multiple poses through re-docking to augment the number of training data of affinity prediction. We provide a detailed description of the data augmentation in the Supplementary Material, Data augmentation section.

When a pre-training dataset with various protein receptors lacks data points for a certain receptor, transfer learning is a feasible method to improve performance with additional data. A transfer learning approach involves re-training a pre-trained model for a similar but different task [13]. The prediction of ligand binding results such as affinity or toxicity requires a large amount of learning data for the target receptor. Accordingly, in the case of a specific protein with few co-crystallized pose data, the prediction performance of a model tends to degrade. We made a more detailed description of the transfer learning in the Supplementary Material, transfer learning section.

We propose the pipeline for affinity prediction using bipartite graph neural network (BGNN) and transfer learning trained on a re-docking dataset. Since the current pose prediction is exposed to probabilistic deviation, a robust and scalable deep learning model is able to utilize an increased amount of generated data by re-docking between the pose prediction and the affinity prediction. We evaluated our pipeline on D3R GC4, which offered challenges for binding pose prediction and affinity ranking between several small molecules and two target proteins: beta secretase 1 (BACE) and Cathepsin S (CatS) [14]. Challenge organizers provide a shared dataset to compare multiple methods on an equal footing for prediction, of which co-crystal structures of ligand–protein pairs are not given.

Successful molecular docking approaches could be utilized for early drug discovery process, such as automating novel drug design or drug re-positioning process [15]. However, there are several difficulties in practical applications. We present the probabilistic deviation of atoms and the feature engineering issues involving the pose prediction. Another challenge in molecular docking is the limited number of public data instances that provide the three-dimensional structural data and corresponding binding affinity. Considering the above challenges, our contributions are summarized in the following:

  • We mitigate the probabilistic deviation of atoms issue from pose prediction and data scarcity using data augmentation.

  • We use deep learning aimed at a scalable model with a minimum number of feature engineering.

  • Our method demonstrated state-of-the-art performance in the target protein BACE dataset from stage 2 of D3R GC4. Our method also showed competitive performance in the CatS dataset. We demonstrated the robustness of our model by evaluating two challenge datasets with different patterns.

Methodology

Pose prediction

Molecular docking can be roughly divided into two steps: ligand binding pose and affinity prediction. To cover the entire process of molecular docking, it is important to consider the association between the two steps as the second step uses the results of the first step. In general, the pose prediction process is also separated into two sub-processes: a generation process for several three-dimensional conformations of small molecules bound to a rigid macromolecular target. A ranking process is used to select the cognate structure of a co-crystallized molecule among the generated conformations. However, some existing studies [16, 17] demonstrated that using the extended training set by generating conformations helped optimize a deep learning model. We use all the compounds created in the first sub-process.

The PDBbind dataset was used as the raw data source for pre-training. PDBbind separated proteins and ligands for a specific PDB, and we also stored the ligands in a SMILES format. Whereas the test data contain the ligands in a SMILES format only. In Fig. 1, the data section is contained within a gray dotted box. We augmented the raw training data by re-docking the PDBbind data and extended the test data by cross-docking.

Fig. 1
figure 1

Binding pose prediction protocol: target values are \(IC_{50}\), Ki, or Kd. The re-docking/cross-docking process is as follows: a The original ligand files contain 3D coordinates and they are converted to SMILES through Rdkit. b Generate conformers through Rdkit. c Find the closest reference ligand of the given ligand in SMILES form. d Superimpose through obfit. The structural alignment of the obfit function superposes the conformer based on the shared SMARTS with the template molecule

Our pose prediction method mostly followed the same procedure as Ragoza et al. [7]. Further, people from the same lab participated in the D3R GC4 using the same method. The difference between our pose prediction and that of the Ragoza et al. is we cropped their method other than explained in our pose prediction above, such as ranking process of generated poses. We selected the existing method that uses the minimum level of components for pose prediction to utilize the entire data produced by redocking. In short, we generate conformations from SMILES and superimpose the conformations with the previously found reference ligand structure.

Figure 1a, b: The input of the SMILES format goes through the RDKIT library and creates a conformations with 3D coordinates. RDKIT is an open-source toolkit for bioinformatics, and the toolkit uses the ETKDG algorithm to generate conformations [18]. As it is difficult to accurately predict how the ligand coordinates will be located in the actual binding process, we generated up to 30 conformations. Depending on the ligand’s environment, the SMILES for conformation process may be neglected by RDKIT. For example, the valence of a particular atom could be abnormal; therefore, the amount of generated data is different each time.

Figure 1c: Our pose prediction requires an existing co-crystal structure with the same protein receptor to find a reference ligand. We found a reference ligand that was similar to the ligand in the SMILES form concerning the bond substructure. The reference ligand should be selected from the existing PDBs and shared with the same target protein. In the case of the re-docking process, the reference ligand is known. SMILES arbitrary target specification (SMARTS) specifies substructural patterns in molecules. We use SMARTS to designate a similar substructure between the generated conformation and the reference ligand. The shared SMARTS pattern is passed to the next part. We can determine the position of the generated conformations through the co-crystallized pose of the reference ligand. We used the FindMCS function of the RDKIT library.

Figure 1d: The next part was the superimposition process. The superimposition or structural alignment of the obfit function superimposes the conformer based on the shared SMARTS with the template ligand and conformer. As a result, the conformation is located around the reference small molecule in the existing PDB. We used the Obfit function from the open babel library by inputting the reference ligand and the newly generated conformation.

Figure 1e: The final process of pose prediction saves the data for use in the next affinity prediction step. The resulting ligand–protein pair has a maximum of 30 pose combinations. We grouped the data so that generated data from the same ligand–protein pair were not split when processed in a mini-batch.

We provide a detailed description of the pose prediction in the section of Supplementary Material, a step-by-step explanation of the pose prediction phase with more visual representations.

Affinity prediction

Fig. 2
figure 2

Affinity prediction protocol: a We applied ECIF pre-processing before applying character embedding on the ligand and the protein pocket atoms. An empty box inside the adjacency matrix denotes filtered distance since the corresponding ligand–protein atom pair has distance over a certain amount. b-1 Self-attention module for ligand input, b-2 An atom node in the ligand updates its embedding from linked protein atoms. c Dimension reduction using fully-connected layer and sum

For prediction at the molecular level, it is important to transform the input into more suitable forms called molecular fingerprints. The extended connectivity interaction features (ECIF) assigns each atom to the corresponding atom types considering the atom environment concept, which was originally presented in the extended connectivity fingerprints [19, 20]. We implemented ECIF pre-processing for each atom.

Among structural models, the three-dimensional coordinates of atoms are often converted into a meta-structure owing to the cost of the complex model and the usefulness of abstraction [21, 22]. Our model is inspired by the simple structure of ECIF. ECIF exhibited the state-of-the-art performance in the existing affinity prediction task. In addition to the ECIF descriptors, the authors defined 1540 possible interactions between a pair of ligand and protein atoms. Further, the number of corresponding relationships was counted as an input feature for gradient boosting machine (GBM). ECIF uses only the simple input feature of protein-ligand atom pair counts for prediction. However, we consider that additional features such as atom-level embedding or adjacency matrix help the scoring function while remaining a concise model. We made a detailed description of the GBM in the Supplementary Material, GBM model section.

Representing a character as an embedding vector is a frequently used method in natural language processing (NLP). After learning the embedding vector through deep learning, character embeddings of the semantically similar items are located near each other in vector spaces [23]. Despite the obvious differences, the NLP and the biomedical domains have statistical similarities, especially for the chemical compound and the natural language sentence [24]. Some of the existing studies used character embedding on the SMILES input to represent chemical attributes [25, 26]. In the affinity prediction model, we apply character embedding after ECIF pre-processing for each atom.

After the pose prediction step, a ligand–protein instance has an average of 10 generated poses. First, we pre-process the list of atom data in the form of an ECIF fingerprint and apply character embedding. Every atom becomes a node and is expressed by the following equation:

$$\begin{aligned} V_l&= \{l_j \in C; j=1,2,\ldots,M\} \end{aligned}$$
(1)
$$\begin{aligned} V_p&= \{p_k \in C; k=1,2,\ldots,N\} \end{aligned}$$
(2)
$$\begin{aligned} adj&= \{e_{jk} \in \mathbb {R}^{M\times N}\} \end{aligned}$$
(3)

Our model has three types of inputs: a list of atoms of the ligand, a list of protein atoms of the binding pocket located close to the ligand, and the adjacency matrix that contains distances between the ligand and the protein atoms. \(V_l\) is the ligand node and the \(l_j\) is the jth embedded vector of atoms. Likewise, \(V_p\) is the protein node and the \(p_k\) is the kth embedded vector of atoms. The C sign denotes a pool of atom embeddings. The \(e_{jk}\) is the edge weight between the jth ligand atom and the kth protein atom in Eq. 3. The edge weights are constant value between 0 and 1 which is the reverse of the distance between two atoms. These inputs have the characteristic of a bipartite graph. The inputs are depicted in the part (a) of Fig. 2.

$$\begin{aligned} In_l&= softmax(FC(l_j)FC(l_j)^T)FC(l_j) \end{aligned}$$
(4)

The transformer is a well-known model for its performance, and its efficacy has been presented in various datasets [27]. The transformer model introduced self-attention which allows the extraction of different aspects from a sentence [28]. One of the advantages of the transformer model is that it learns long-range dependencies in the input, which makes the model is capable of handling lengthy information. The self-attention module has been applied to drug-target interaction (DTI) [29, 30]. DTI or drug-protein interaction task has a similar input structure to molecular docking. Since the self-attention module is applicable for long input, it is viable to consider applying the module for SMILES data. We implemented simplified self-attention, which predicts the outcome using only the ligand input. The process is described in part (b-1) of Fig. 2.

Our self-attention module is represented in Eq. 4. Fully connected (FC) layer wraps \(l_j\) atom embedding. We provide a detailed description of our self-attention module in the Supplementary Material, Difference with the original self-attention section. The main objective of this self-attention module is to support a transfer learning approach. Thus, we excluded some functionalities such as the multi-head or position embedding from the original transformer module. The purpose of transfer learning is to optimize a subset of the feature space. Therefore, it is common to freeze most of the model layers and only train certain layers [13]. We froze most of the parts, except the self-attention layer, for the transfer learning phase.

$$\begin{aligned} l_j^{i + 1}&= l_j^{i} \odot \sum _{k=0}^N e_{jk} * p_k^{i} \end{aligned}$$
(5)
$$\begin{aligned} p_k^{i + 1}&= p_k^{i} \odot \sum _{j=0}^M e_{jk} * l_j^{i} \end{aligned}$$
(6)
$$\begin{aligned} In_{lp}&= adj \odot V_lV_p^T \end{aligned}$$
(7)
$$\begin{aligned} cat&= [sum(In_{lp}), FC_{dr}(In_{lp}), sum(In_l), FC_{dr}(In_l)] \end{aligned}$$
(8)
$$\begin{aligned} \hat{y}&= Wcat + b \end{aligned}$$
(9)

In the Eq. 5, each atom in a ligand updates its node using the information of the connected atoms in a protein pocket. Similarly, each atom in the protein pocket is updated using the connected atoms of the ligand. The update process was repeated three times. The processes are depicted in part (b-2) of Fig. 2.

Finally, the ligand and protein pocket atoms are combined, followed by element-wise multiplication using the adjacency matrix value. In is an intermediate matrix that aggregates ligand and protein. To produce the result, we used fully connected layers and the summed feature for dimension reduction of the concatenated variables. \(FC_{dr}\) is a FC layer with dropout. W and b are the weight matrix and bias for the final FC layer, respectively. \(\hat{y}\) in Eq. 8 represents the predicted results. Depending on the training set, the results could be \(IC_{50}\), Ki, or Kd. As the affinity prediction is a regression task, we used the mean squared error loss function. The process is illustrated in part (c) of Fig. 2.

Results

In the D3R GC4, it is required to predict the rankings of the affinities of given small molecules against the two target proteins. The target proteins were BACE and CatS. The number of given small molecules for each dataset were 154 and 459, respectively. The main evaluation metrics are Kendall’s \(\tau\) and Spearman’s \(\rho\), as the primary purpose is to predict the rank correlation. There were several sub-challenges in D3R GC4, and we focused on rank prediction for the task is specifically about affinity prediction. In the case of CatS protein, the previous grand challenge 3 released data that underwent experimentation in an environment similar to GC4. Hence, it is easy to apply for transfer learning.

Because we utilized the CatS data from GC3, we additionally experimented on the GC3 CatS data. The number of given small molecules is 136. The experiment settings remain the same as the GC4 experiments.

Data statistics

In Table  1, we used the PDBbind (v2019) dataset mainly for training and validation. It is common to preprocess the PDBbind dataset to filter out invalid data for training [19]. We found a normal data range from the ChEMBL dataset [31, 32]. The ChEMBL dataset has a comment column indicating overly high or unreasonably low \(IC_{50}\) values. After filtering target scores outside the normal range, the number of valid PDB instances in Ki or Kd is 5936 and in \(IC_{50}\) is 4475. As the study used the generated conformations for each instance, the average augmented number of training instances was ten times more than the raw data. The test set D3R GC4 was based on the \(IC_{50}\) values. The D3R GC4 converted the IC50 values to kcal/mol before evaluation and released the code used for the conversion. Therefore, we also evaluated the predicted value as kcal/mol and set the average to be 0 for learning. The number of atom types is the number of unique atom types defined with the ECIF pre-processing. The number of atomic types was determined in the training data and did not change in the test set.

The PDBbind database provides experimentally measured affinities for interactions between proteins and ligands with the corresponding PDB. However, owing to the nature of handling atomic units, many instances of PDBbind data are of inferior quality. For that reason, PDBbind provides two distinguished datasets; structurally ordered “refined sets” and low-quality “general sets.” Unfortunately, in the case of the \(IC_{50}\) dataset, almost all instances fall into the category of the “general set.” We used the PDBbind dataset as the raw data source for pre-training.

Table 1 Dataset statistics

Result on D3R GC4: target protein BACE

Other models from leading participant groups of GC4 BACE

Since our model uses the challenge data as a test set, we introduce the top group models in D3R GC4. We sorted the results by Spearman correlation and then selected the top group. Combined models of Skeledock and Kdeep achieved the best performance in the BACE test set [9, 33]. Skeledock performed pose prediction using template-query mapping and Kdeep performed affinity prediction using a convolutional neural network (CNN). The CACTVS chemo-informatics toolkit is a combination of several molecular docking techniques located in the leading group of the BACE test set [34]. Lastly, the GNINA model used an ensemble of CNNs for affinity prediction, and Monte Carlo chain sampling for pose prediction [35]. In some cases of the GNINA model’s submission, pose prediction was performed using only conformation superimposition. We followed the same method for pose prediction, as the process is straightforward and a large amount of data can be obtained.

In the after challenge section in Table  2, we also experimented with ECIF and GBM under the same conditions as our model, such as re-docking [19]. Since the pre-processing is almost identical, we passed the same pose prediction data to the GBM model and reported its performance.

The comparative assessment of scoring functions-2016 (CASF-2016) is a benchmark dataset for affinity prediction (Ki or Kd) given a structural dataset [36]. Concerning the co-crystallized data, ECIF and GBM models showed outstanding performance in CASF-16 dealing with Ki or Kd. The GBM model and similar boosting based machine learning algorithms are robust against small noise in the dataset [37]. However, machine learning techniques should be carefully applied to data with high disturbance in the input features. Regarding cases where input data requires cross-docking, the noise inside data is a significant issue, and there is room for the application of deep learning. One of the deep learning features is to obtain generalizable results using data that involves a certain amount of noise for training.

Result analysis

Table 2 Evaluation results on D3R GC4: BACE

Given a test set from challenges, there are several ways to select a model that can be applied to the test set. After dividing the pre-training data into a training set and a validation set, we saved a model that surpassed the previously saved model at validation for each training iteration. After training, we loaded and ran the model on the test set.

We applied transfer learning. Since we do not have similar data to the test data, we gathered chemicals and \(IC_{50}\) values for a BACE target protein from ChEMBL. The test data were released after the challenge in 2018. To avoid a situation where test data is included in the training set, we excluded data from 2018, 2019, and 2020. Since ChEMBL data instances do not contain three-dimensional structure data, we used the same cross-docking method as we did in creating the test data. When deciding the epoch of transfer learning, two cases should be considered. If additional learning is performed with almost the same data distribution as the test set, we don’t need to be concerned about overfitting. However, if the distribution is different such as in the case of the target protein BACE, it would be better to finish transfer learning before convergence, or in other words, before no change in training loss [38]. We performed transfer learning for 50 epochs as half of the epochs as the pre-training.

Table  2 shows the performance of our model alongside the top group results in the D3R GC4 challenge of target protein BACE data. A participant could yield multiple submissions to the challenge. We showed only the best performance of each participant for concise presentation. Our model outperformed the baseline best result by 9%.

The ECIF + GBM model has lower performance than our BGNN; the GBM model have performance difficulties in noisy data. In the case of BACE data, we retrieved additional chemical interaction data from ChEMBL and used it to train the GBM model. But, the process did not narrow the gap between the BACE test set and the training set.

Result on D3R GC4: target protein CatS

Other models from leading participant groups of GC4 CatS

Here, we introduce the top group models of the CatS target protein. ICM-dock achieved the best performance in the CatS test set [39]. Unlike other models in the leading group, the ICM-dock model hardly used deep learning. The ICM-dock model faced the problem of selecting the optimal conformer among multiple ligand/protein conformations, as in our case. The authors of ICM claimed that using most conformers through an ensemble technique might address the protein flexibility problem in molecular docking. MathDL was located in the leading group in the CatS test set [22]. MathDL converted the data into a bipartite graph between two atomic types using geometric relations. MathDL uses low-dimensional, and translationally invariant graph features. deepscaffopt achieved its best result in the CatS dataset, and it was the only ligand–based model in the leading group; however, this model has not yet been published. The CNDO model was ranked fourth in the CatS test set [40]. The CNDO model can calculate molecular orbital energies from the ligand geometry.

Table 3 Evaluation results on D3R GC4: CatS

Result analysis

For this CatS dataset, the previous GC3 released data on non-overlapping ligand compounds for the same target protein. We trained our model on the PDBbind dataset and applied transfer learning using GC3 CatS data. We have prepared two models for CatS data; one that freezes the rest of the modules except for the self-attention in the transfer learning phase, and one that does not freeze any module. Freezing the modules other than self-attention in transfer learning aims to prevent overfitting and to optimize the self-attention module to more data. In contrast, no frozen model for CatS data is presented because overfitting can help if the test data and the additional data for transfer learning are similar. We assumed that it is feasible to prepare the overfitting model, such as in this CatS case, since it is possible to predict the data distribution to some extent in advance.

Table  3 presents the performance of our model, and the top group results in the D3R GC4 challenge of the target protein CatS data. In terms of the results, the model that implemented the freezing in transfer learning performed more poorly than the model without freezing. The performance of our model without freezing is not comparable to the performance of the top three participants; however, an improvement in the performance is observed compared to the BACE case in Table  2. The difference between BACE and CatS results depends on whether the data used for transfer learning have a similar distribution to the test set.

The ECIF + GBM model showed almost similar performance to the basic BGNN model. The GBM model performs better than in other experiments because the quality of the training data is guaranteed to some extent. In the CASF-16 challenge, in which the training data was the crystal pose itself, the GBM model showed a significant level of performance [19].

Result on D3R GC3: target protein CatS

Other models from leading participant groups of GC3 CatS

Next, we introduce the top group models in D3R GC3 of the CatS protein, stage 2. The ICM-dock and the ScaffOpt are the same model that appeared in the leading participant group of GC4 CatS. Dock\(\_\)close [41, 42] used several open-source library such as Smina, Openbabel, and Omega2. They found closest compound among known bound ligands using Babel 2.3.2 and the affinity score is derived from best Smina score. Amber method 1 [43] used Antechamber tool of AmberTools16 [44] which utilizes general amber force field (GAFF). The AmberTools [45] is a package of molecular simulation libraries and the GAFF is a molecular mechanical force field for the simulation of biomolecules.

Table 4 Evaluation results on D3R GC3: CATS

Result analysis

Compared to the GC4 CatS dataset, the number of instances of GC3 CatS is small. For this dataset, we used the same approach as the BACE. We trained our model on the PDBbind dataset and applied transfer learning using a CatS target protein from ChEMBL data. For fair comparison, we excluded data starting from 2017 in the additional data for transfer learning.

Table  4 shows the performance of our model alongside the top group results in the D3R GC3 challenge of target protein CatS data. Our model is located in the third position regarding Kendall’s \(\tau\), and in the fourth position regarding Spearman. The overall performances of participants are similar to that of GC4 BACE.

In this experiment, the ECIF + GBM model performance was not good. For GBM model, the additional ChEMBL set was not helpful in predicting the test set. The GC3 CatS data has a slightly more complicated pattern than GC4 BACE, as it can be seen that our BGNN performed worse in the GC3 CatS.

Fig. 3
figure 3

Predicted structure overlay of the docked ligands in CatS_206: The above figure is the SMILES visualization of CatS_206 ligand. The below is the three generated conformers of a ligand–protein instance CatS_206. Our pose prediction generates up to 30 conformers for a ligand–protein instance. We use a data augmentation method that induces probabilistic deviation of atoms rather than minimizes the noise during the pose prediction step. Although our method increased the degree of randomness, the process also made it possible to train extended data, which is helpful for robust performance through deep learning. Figure generated with rdkit and Pymol

Discussion

We selected the top three participants for each of the two target proteins. We observed that the BACE and CatS data have quite different patterns as none of the participants matched. Due to the characteristic of the \(IC_{50}\), the \(IC_{50}\) value is significantly influenced by the surrounding environment. With the help of the transfer learning, we can better predict the test set later in case we have data similar to the test set.

The results for CatS are higher than those for BACE, as GC3 data from experiments in the same environment have been previously released. When predicting new data with a model only trained on PDBbind data, it is difficult to expect the same performance as the CatS case. Nevertheless, if the amount of data are sufficiently large, a model can tolerate a certain degree of statistical fluctuation. When estimating the \(IC_{50}\) value for a pool of compounds for a specific protein, implementing transfer learning can be a good method. We gathered the \(IC_{50}\) experimental values for the BACE target protein from ChEMBL and used these data for transfer learning to obtain a good performance.

Analysis of the results showed that our model performance was slightly lower than that of the top group for the CatS target protein. We regard the pose prediction process as the cause of the performance degradation. This was similar to the observation of the authors of the MathDL study who pointed out that relying on existing pose prediction libraries has limitations [22]. The authors of the MathDL study attempted to predict poses with their custom deep learning in D3R GC4 and created docking poses using a generative adversarial network. In our case, using the superimposed pose likely caused inaccurate docking pose problems. As shown in Fig. 3, there is a significant difference in positions when there are many generated conformations.

In Fig. 4, we made a fitting graph for the CatS data. Square dots represent the test set data (D3R GC4) and the triangle dots are the transfer learning data (D3R GC3). In the fitting graph, since the actual value and the predicted value are compared, it is a more rigorous evaluation method than the rank prediction used as the evaluation metric in the challenge. We put the transfer learning data in the graph to emphasize that the data do not have the same distribution as the test set, even though the model has already trained on the data.

The D3R GC4 challenge was held in 2018, while we used the PDBbind dataset v2019. We also experimented the PDBbind v2017 as a pre-training dataset and the result is in the Supplementary Material, GC4 CatS and BACE result based on pre-trained model with PDBbind dataset v2017 section. The results using the PDBbind dataset v2017 are not much different from those of the v2019.

Fig. 4
figure 4

CatS affinity fitting graph: Square dots represent the test set data (D3R GC4) and the triangle dots are the transfer learning data (D3R GC3). The lines indicate linear regression trendlines for each data. The two data are the results of experiments in a similar environment, but as can be seen from the graph, there is a slight difference in the data distribution

Hyperparameter

Table  5 summarizes the hyperparameter search range. We implemented our models using CUDA-enabled pytorch. Our model used Adam optimizers with a learning rate set of 0.0005. We applied the dropout in the input, middle, and the last layers, however, all dropout layers use the same value. We attempted several different epochs to train the model on the PDBbind dataset. As the data distribution of PDBbind is different from the test set data, we stopped training when the loss on the validation set roughly converged. The final the mini-batch size was set at 20. Since we grouped up to 30 conformations from the same ligand–protein pair, in some cases where every mini-batch is filled with 30 conformers, our model trains 600 instances in a batch. We used the Nvidia Titan RTX graphic card for 20 GB of vram.

Performance change of our model through excluding features

Table 5 Hyperparameter search range
Table 6 Ablation study on D3R GC4: target protein CatS

We performed an ablation study to evaluate the effectiveness of several features of our model. We removed the features individually to track any changes in the performance. The result was an average of five tests of models on the target protein CatS dataset from D3R GC4. Table  6 shows the results from the ablation study.

Effectiveness of self-attention The first of the ablation studies was the evaluation without a self-attention module. The Spearman correlation was 0.54 and Kendall’s \(\tau\) was 0.47. The ligand-based prediction module exhibited good performance, as in the case of deepscaffopt.

Effectiveness of transfer learning For the next experiment, we skipped transfer learning. The Spearman correlation was 0.17 and Kendall’s \(\tau\) was 0.11. The low performance is attributed to the data distribution of the CatS set, which is quite different from that of PDBbind. Moreover, the number of CatS instances in PDBbind was insufficient. If the target protein data are insufficient in the training data, it is important to perform transfer learning with additional data generated by cross-docking or with other experimental data obtained in a similar environment as in this case.

Effectiveness of BGNN Without the BGNN main module of our model, we are left with a ligand-based, self-attention module. The Spearman correlation was 0.57 and Kendall’s \(\tau\) was 0.39. The performance of the ligand-based model was slightly lower than that of the model using an additional graph neural net. Our BGNN model utilizes protein-ligand relationship information, as even the same ligand can exhibit entirely different \(IC_{50}\) values depending on the target protein the ligand binds to.

Conclusions

AutoDock Vina is an open-source library capable of pose prediction, and it is one of the famous libraries among the GC4 participants. When we looked into the mean RMSD in evaluation results of GC4 pose prediction, the RMSD of the participants who used AutoDock Vina ranges from 0.76 to 44.07. If we remove some outliers, the max RMSD with AutoDock Vina becomes 10.34. Pose prediction libraries such as AutoDock Vina require certain level of human intervention or feature engineering to determine the set of parameters for different data inputs. The complex feature engineering in the pose prediction framework implies that the overall performance of the molecular docking is dependent on specific hyperparameters. Sometimes, it can be challenging to reproduce the same pose prediction since some of the features are not deterministic.

We aimed to reduce feature engineering in molecular docking. We used pose prediction that is straightforward without complex features. The problem was that the accuracy of the simple method was not outstanding. After our pose prediction, the predicted position of atoms contains probabilistic deviation around the reference ligand. Table 2 shows that the existing ML model could be unstable when handling noisy data. We used deep learning with a minimum feature engineering to utilize the noisy data since noisy data can prevent overfitting for deep learning. For the reproducible approach, we released our entire code as open-source.

Existing cross-docking approaches have the probabilistic deviation issue from pose prediction and data scarcity. We proposed a straightforward pipeline for affinity prediction using BGNN and transfer learning based on a re-docking dataset. We showed that our BGNN model achieved competitive results on the D3R GC4 dataset. For the BACE set, our model outperformed the best participant in GC4 by 9%. For the CatS set, our model performed competitively. We demonstrated the robustness of our model by evaluating two challenge datasets with different patterns.