Mining flexible-receptor docking experiments to select promising protein receptor snapshots

Machado, Karina S; Winck, Ana T; Ruiz, Duncan DA; Norberto de Souza, Osmar

doi:10.1186/1471-2164-11-S5-S6

Mining flexible-receptor docking experiments to select promising protein receptor snapshots

Proceedings
Open access
Published: 22 December 2010

Volume 11, article number S6, (2010)
Cite this article

Download PDF

You have full access to this open access article

BMC Genomics Aims and scope Submit manuscript

Mining flexible-receptor docking experiments to select promising protein receptor snapshots

Download PDF

Karina S Machado¹,
Ana T Winck²,
Duncan DA Ruiz² &
…
Osmar Norberto de Souza¹

3471 Accesses
19 Citations
Explore all metrics

Abstract

Background

Molecular docking simulation is the Rational Drug Design (RDD) step that investigates the affinity between protein receptors and ligands. Typically, molecular docking algorithms consider receptors as rigid bodies. Receptors are, however, intrinsically flexible in the cellular environment. The use of a time series of receptor conformations is an approach to explore its flexibility in molecular docking computer simulations, but it is extensively time-consuming. Hence, selection of the most promising conformations can accelerate docking experiments and, consequently, the RDD efforts.

Results

We previously docked four ligands (NADH, TCL, PIF and ETH) to 3,100 conformations of the InhA receptor from M. tuberculosis. Based on the receptor residues-ligand distances we preprocessed all docking results to generate appropriate input to mine data. Data preprocessing was done by calculating the shortest interatomic distances between the ligand and the receptor’s residues for each docking result. They were the predictive attributes. The target attribute was the estimated free-energy of binding (FEB) value calculated by the AutodDock3.0.5 software. The mining inputs were submitted to the M5P model tree algorithm. It resulted in short and understandable trees. On the basis of the correlation values, for NADH, TCL and PIF we obtained more than 95% correlation while for ETH, only about 60%. Post processing the generated model trees for each of its linear models (LMs), we calculated the average FEB for their associated instances. From these values we considered a LM as representative if its average FEB was smaller than or equal the average FEB of the test set. The instances in the selected LMs were considered the most promising snapshots. It totalized 1,521, 1,780, 2,085 and 902 snapshots, for NADH, TCL, PIF and ETH respectively.

Conclusions

By post processing the generated model trees we were able to propose a criterion of selection of linear models which, in turn, is capable of selecting a set of promising receptor conformations. As future work we intend to go further and use these results to elaborate a strategy to preprocess the receptors 3-D spatial conformation in order to predict FEB values. Besides, we intend to select other compounds, among the million catalogued, that may be promising as new drug candidates for our particular protein receptor target.

Background

The pharmaceutical industry is under pressure to increase the rate with which it delivers new drugs to the market [1]. At present, the time to place a new drug into the market is between 10 to 15 years and the costs involved are estimated in 800 million dollars [2]. Due to these reasons there are current efforts towards changing these figures, for instance, by reducing the timeline and costs, and increasing the quality of the candidate drugs.

Advances in molecular biology and in computer modelling and simulation tools have had a direct impact in the drug discovery process, making viable the rational drug design (RDD) [3] approach. In-silico based RDD is a four-step cycle that combines structural information and computational efforts [4] based on a detailed understanding of the target protein (or receptor) and ligand interactions. In this sense, molecular docking algorithms are applied to evaluate and find the best ligand position and conformation inside the receptor binding site.

Nowadays, the majority of molecular docking algorithms consider only the ligand as flexible while the receptor remains rigid since it has far more atoms and consequently has a much greater number of degrees of freedom. It is computationally very expensive to consider the receptor flexibility [5] in molecular docking. Conversely, biological macromolecules like protein receptors are intrinsically flexible in their cellular environment. Therefore, it is very important to consider the receptor flexibility during molecular docking and, consequently, during RDD [6] because frequently the receptor can modify its shape upon ligand binding, moulding itself to be complementary to its ligand, increasing favourable contacts and reducing adverse interactions, thus minimizing the total free energy of binding (FEB) [7].

There are a number of alternative ways to incorporate at least part of the receptor flexibility. These have been reviewed by Teodoro and Kavraki [8], Totrov and Abagyan [9], Cozzini et al. [6], Huang and Zou [5], Wong [10], Alonso et al. [11] and Chandrika et al. [12]. Among these methods there are the approaches that consider one receptor conformation like the soft docking [13], the approach presented by Apostolakis et al. [14] and the methods that permit some mobility of the side-chains of the receptor binding site as devised by Leach [15] and the ones that use a rotameric library [16, 17]. There are a large number of approaches that consider a set of receptor conformations. Some of these approaches combine the structures on a grid like the methods proposed by Knegtel et al.[18] and Österberg et al.[19]. Other approaches perform a series of docking experiments considering in each one a different receptor conformation. According to Teodoro and Kavraki [8] the first use of multiple structures derived from a molecular dynamics (MD) simulation was by Pang and Kozikowski [20]. Lin et al. [21, 22] developed the relaxed complex scheme (RCS) to accommodate receptor flexibility in the search for correct receptor-ligand conformation. More recently, Amaro et al. [23] presented extensions of the RCS method which improves computational efficiency by reducing the receptor ensemble to a set of representative configurations.

In this work, we chose to model the explicit receptor flexibility by performing a series of molecular docking experiments considering in each one a different receptor snapshot derived from a MD simulation [24].

Target receptor and ligands

Our target protein receptor is the InhA enzyme from Mycobacterium tuberculosis (MTB) [25]. This enzyme represents an important target to tuberculosis control [26]. Data from WHO [27] reports that about 9 million people will develop tuberculosis (TB) each year in the world and , at the same time, this disease will cause almost 2 million deaths. Furthermore, one third of the world’s population is infected with MTB [27, 28]. More alarming is the growth of TB cases resistant to isoniazid and other anti-TB drugs [29]. In summary, these problems make it paramount to find alternative inhibitors for this enzyme.

To illustrate the receptor flexibility, the 2.2 Å 3-D crystal structure (PDB ID: 1ENY) of InhA obtained from the Protein Data Bank (PDB) [30] can be viewed in Figure 1, together with four averaged conformations or snapshots extracted from different regions of the InhA 3,100 ps MD simulation trajectory [31]. Although simple, this example serves only to illustrates how flexible, by adopting different conformations, is the InhA receptor.

In this work we considered four different ligands, TCL [32], PIF [33], ETH [34] and NADH [25], which are summarized in Table 1. The ligands 3-D structures are illustrated in Figure 2. These structures were obtained either from the PDB [30] and ZINC [35], or generated by ab initio quantum mechanical methods [26].

Table 1 Names, abbreviations and the number of atoms of the ligands considered in this work.

Full size table

The 3,100 InhA receptor conformations (or snapshots) were obtained from a MD simulation trajectory as described in [31]. Considering this set of snapshots we performed molecular docking experiments [24] for each of the four ligands described. After the execution of over 3,000 docking experiments, for each ligand, as a result we have a large amount of data that need to be dissected to produce useful information about the receptor-ligands interactions. Then, we preprocessed all docking results and snapshots from the MD simulation and stored them into a proper repository developed and introduced in Winck et al, 2009 [36].

Our contribution

In this article we propose a methodology to mine data from fully flexible-receptor molecular docking experiments, looking for receptor snapshots to which a particular ligand, amongst the four investigated here, binds more favourably.

We expect that our strategy can lead to the use of a significantly smaller number of snapshots in docking experiments with other ligands obtained from virtual libraries of small molecules. This advance is expected in the future to accelerate the molecular docking step of the drug discovery process while maintaining the level of detail with which we treat the receptor flexibility. To the best of our knowledge such an approach has never been explored. Amaro et al. [23] has also investigated snapshot selection from a MD trajectory. However in their approach they use QR factorization which measures the structural similarity between all pairs of Cα atoms among all the MD snapshots aligned.

To achieve our goal of snapshots selection we systematically preprocessed our molecular docking results and submitted them to the M5P model tree machine learning algorithm [37]. The model trees generated by this method were further post-processed. From these results we propose a criterion of selection of linear models (LMs) capable of picking out a set of conformations which are the most promising receptor conformations from the initial 3,100 snapshots for each ligand.

Methods

Full receptor flexibility from molecular dynamics simulation

The MD simulation of the InhA-NADH complex was performed for 3,100 ps as described in Schroeder et al. [31]. In this simulation the conformations were saved at every 0.5 ps which resulted in 6,200 (3,100 divided by 0.5) snapshots. However, for our work we took conformations at every 1.0 ps which resulted in 3,100 snapshots. This set of instantaneous receptor structures or snapshots was used to represent the full receptor explicit flexibility during the flexible-receptor docking procedure [24].

Performing molecular docking experiments

For every ligand we submitted 3,100 docking experiments, with 10 runs each, using the scientific workflow proposed by Machado et al [24] using the simulated annealing (SA) protocol in the docking software Autodock3.0.5 [38]. In our laboratory we executed different docking experiments considering both flexible and rigid ligands. After evaluating the docking results, for the four ligands considered here, we did not observe differences that justified the use of flexible ligands in the docking experiments of the fully-flexible receptor. These findings were interesting because they allowed us to concentrate only in flexibility aspects of the receptor which is the most computationally demanding task in this type of receptor-ligand docking simulations. In summary, for the work presented here we used 3,100 snapshots of the MTB InhA enzyme receptor to represent its full explicit flexibility and all four ligands had their conformations kept rigid. With 10 runs per docking simulation we ended up with 31,000 results (receptor-ligand complexes and their estimated FEB) per experiment per ligand.

The M5P model tree algorithm

As our entire attribute values in our mining input data are numeric and according to Han & Kamber [39] the most widely approach for numeric prediction is regression, we decided to explore this machine learning task.

Despite the lack of consensus in the data mining literature about the most understandable task result there is a reasonable agreement that representations such as decision trees and rule sets are better understood than black box representations such as Support Vector Machines or Neural Networks [40]. Decision trees have the advantage of being graphical representation of discovered knowledge and the tree hierarchical structure can point to information about the importance of the attributes used for prediction [40].

For prediction there are two main types of trees: regression trees and model trees. The main difference between these trees is in the content of the leaves. Each leaf in a regression tree stores a continuous-valued prediction that corresponds to the average value of the predicted attribute for the training tuples that get to the leaf. By contrast, in model trees, each leaf holds a regression model – a multivariate linear equation for the predicted value [39]. Owing to the multivariate linear equations generated by the model trees we decide to apply this algorithm to our input data. The model tree algorithm used in this work is the M5P [41] available in the WEKA package [42]. M5P handles tasks with very high dimensionality [41]. Indeed, our data mining input files have more than one hundred numeric attributes (shortest distances to residues) including the target attribute (the estimated FEB ). In our work, the M5P results can be especially useful because they could present an equation that properly weights the predictive attributes.

In model trees the input space is recursively partitioned until the data at the leaf nodes constitute subsets relatively homogeneous, so that a linear model can explain their variability [37]. Then, these linear models can be used to quantify the contribution of each predictive attribute to predict the target attribute.

Application Tools

• AutoDock3.0.5 [38] is a suite of computer programs to perform automated molecular docking. It was developed to predict how small molecules bind to a receptor 3-D structure active site.

• AMBER 6.0 [43] is a suite of programs to energy-minimize and perform MD simulations of bio-molecules. It consists of a substructure database, a force field parameter file and also of a variety of utility programs. Ptraj, one of its modules, processes the trajectory files generated by the MD simulations.

• WEKA [42] is a collection of machine learning algorithms for data mining tasks. WEKA contains tools for data pre-processing and analyses by classification, regression, clustering, and association rules, as well as for their visualization.

• In-house developed Python scripts are used to pre- and post-process the data. For instance, to process the molecular docking outputs and to generate the proper mining inputs.

Results

The molecular docking outputs

The output of a molecular docking simulation by AutoDock3.0.5 is a complex text file at the end of which the final docking results are summarized. An example of such a summary is shown in Figure 3. Each run result is mainly composed of three values highlighted by rectangular boxes in Figure 3: (a) Root Mean-Squared Deviation (RMSD), which indicates how distant the final ligand position is from its initial position; (b) the values of the estimated FEB and its corresponding inhibition constant (Ki); (c) the final 3-D coordinates of the ligand atoms.

Since we have 3,100 receptor snapshots and each docking simulation was set up with 10 runs, the maximum number of different ligand conformations inside the InhA active site is 31,000. However, during the docking simulations some of the runs either did not converge or had a positive FEB value. These data were considered outliers and were left out of the preprocessing of the docking results. Table 2 summarizes the docking results: for each ligand we have two lines; one with the results of all successful runs (maximum of 31,000) and another with only the best FEB of each docking simulation (maximum of 3,100). These are called the total number of valid results in the third column of Table 2.

Table 2 Results of the molecular docking simulations.

Full size table

As can be seen from Table 2, some docking simulations did not converge or had positive FEB values for PIF, TCL, and ETH, while many did not converge or had positive FEB for NADH, except for the NADH best FEB.

Preprocessing inputs to data mining experiments

As exemplified at the end of the Background section, a virtual screening using a standard docking simulation of one receptor protein with 13 millions of compounds in the ZINC [35] library would take a long and an unacceptable time to complete. Adding full flexibility to the receptor, as we propose in this work, this time would be extraordinarily much bigger, turning virtual screening with this type of receptor flexibility model literally impossible and an efficient RDD process impractical. Our main interest is in developing ways to analyze and explore the data presented above in order to design some efficient strategy to speed up docking experiments with fully-flexible receptor. Our working hypothesis is: it is possible to perform in silico docking simulations for a given receptor-ligand pair, employing a reduced number of receptor snapshots, but still maintaining its full flexibility model. Consequently, we hope to answer the following question: “How to select a subset of snapshots, of the fully flexible-receptor model, which are most relevant to indicate whether a given ligand is a promising compound?”

The proposed approach to address our hypothesis and to start to answer the above question is to manage the large amount of data involved in docking simulations with a fully-flexible receptor model: the ligands, snapshots from MD simulations of the receptor and the results of their docking experiments. With this in mind, we will explore which data mining task can help us discover relationships in the receptor-ligands complexes. In this paper, we concentrate efforts on a regression mining algorithm to attempt to answer our working question. Our main contribution is in developing strategies focused on both preprocessing of data and post processing of mining results, aiming at obtaining predictive models to improve snapshot selection.

Handling molecular docking simulation outputs and receptor’s snapshots

The FReDD repository [36] stores all features about the receptor snapshots, the ligands, and the docking simulation results used in this work. This repository allows easy retrieval of its information to produce comprehensive data to be mined.

The fully-flexible receptor model of the MTB InhA enzyme contains 3,100 snapshots, each with 4,008 atoms. This gives a total of 12,424,800 records for the atomic coordinates of the receptor.

The same calculation done for the InhA receptor is done for the four ligands used in this work. After the valid docking simulations (see Table 2) of the fully-flexible receptor with each ligand we end up with 568,768 records for NADH, 510,660 for TCL, 730,080 for PIF, and 395,590 for ETH (Table 3).

Table 3 Data size for preprocessing.

Full size table

In summary, we have 12,424,800 and 2,223,098 records for the receptor and ligands, respectively.

Generating the mining input data

Predictive data mining algorithms concern with building predictive models that present the best relationships among a set of attributes, called predictive attributes, and a given attribute, called target attribute [44]. Regression tasks describe and distinguish the target attribute, which must be numeric, such that the resulting models can be used to predict the linear model (LM) to which the predictive attributes belong. In this work we use as predictive attribute the shortest distance between all atoms in the ligand and in the receptor’s residues, measured in ångströms (Å), and for each docking result. Figure 4 illustrates this concept. Its shows some distances between the PIF ligand and the receptor residue GLY95. From all of the calculated distances we consider only the shortest one. In this example, the shortest distance is 2.72 Å.

Therefore, for each snapshot there will be 268 such attributes which is equivalent to its number of residues. This procedure was repeated for the other three ligands. To generate the mining input data we needed to combine the 12,424,800 coordinate records of the receptor with the 2,223,098 records of all four ligands’ coordinates. It means that, applying Definition 1 below, which looks for the shortest distances between the receptor’s residues and ligand atoms, we have about 7 trillions of records for NADH, 6 trillion of records for TCL; 9 trillion of records for PIF, and 5 trillions of records for ETH.

Definition 1

The input data to mine for a specific ligand l ∈ {NADH, PIF, TCL, ETH} is a set S of tuples <ssn, lc, SD, feb>, S⊆SSN ×LC ×Powerset(SD)×F, such that:

• ssn is the receptor snapshot number,

• lc is the ligand l conformation,

• feb is the corresponding free energy of binding obtained from the docking experiment performed between snapshot ssn and ligand lc,

• SD is a list of 268 distance values sd_i, sd_i ∈ SD, between atoms from the snapshot ssn and ligand lc, and:

○ 1 ≤ i ≤ 268 corresponds to the i^th protein receptor residue,

○ ssn_a_ij is the j^th atom of the i^th protein receptor residue, and ssn_a_ij.x, ssn_a_ij.y, ssn_a_ij.z are its corresponding spatial coordinates in ssn,

○ lc_a_k is the k^th atom of lc, and lc_a_k.x, lc_a_k.y, lc_a_k.z are its corresponding spatial coordinates in lc,

○ d_ijk = SQRT((ssn_a_ij.x - lc_a_k.x)² + (ssn_a_ij.y - lc_a_k.y)² + (ssn_a_ij.z - lc_a_k.z)²) is the Euclidian distance between ssn_a_ij and lc_a_k ,

○ sd_i = MIN(d_ijk), where i is the i^th protein residue, ij corresponds to the j^th atom of the i^th protein receptor residue, and k corresponds to the k^th atom of lc ligand.

Having these results, the next step is to arrange them into an appropriate input file or table which is composed of 271 attributes. Table 4 describes part of such an input file for PIF. The first column contains the number of the receptor snapshot in the same order as it appears in the MD simulation trajectory. The second column contains the ligand conformation for each of the 10 runs in a docking simulation. The next 268 columns hold the shortest distance found for each receptor residue, and, finally, the last column contains the estimated FEB value which is our target attribute.

Table 4 Example of part of a mining input data file for PIF. See text for details.

Full size table

Improving the input file

The largest distance value that allows a biologically meaningful contact between receptor and ligand atoms is 4.0 Å [45, 46]. Hence, distances higher than 4.0 Å means that the corresponding receptor residue does not establish a direct contact with some atom of the ligand.

To improve the quality of the models, we removed all attributes (residues) to which the shortest distance to any ligand is bigger than 5.0 Å (recall Definition 1): S’ ⊆S is a set of tuples s’ = <ssn, lc, SD’, feb>, where

• ssn, lc and feb have the same meaning as in Definition 1,

• SD’ is a list of n values, n ≤268, and

• ∀sd_i ∈ SD, sd_i ∈ SD’ ↔ ∃s_m ∈ S | s_m = <ssn_m, lc_m, SD_m, feb_m> ^ sd_j ∈ SD_m ^i = j ^ sd_j≤ 5.0 Å.

We chose 5.0 Å in order to consider a 1.0 Å margin of risk. After this feature selection, instead of the 268 original residues for each ligand, we ended up with 106, 122, 121 and 128 receptor residues, respectively, for NADH, PIF, TCL and ETH.

Applying the M5P model tree algorithm

To obtain linear models to select the most promising snapshots having the described input data we need to apply a data mining algorithm that would be capable to predict the FEB value based on the shortest receptor residues-ligand distances, establishing a relationship between them. We choose the M5P model tree algorithm [41].

We performed one experiment applying M5P for each ligand, based on the instances in the preprocessed input files. In addition, we removed their first two columns. Among the parameters available in M5P, we concentrated in calibrating parameters related to legibility and accuracy of the generated model trees. The M5P minimum number of instances parameter is related to the size of the resulting model tree and the number of linear models (LM) generated by the algorithm. Accordingly, we set this parameter to 1,000 for all mining experiments.

To exemplify the result of M5P we present in Figure 5 the model tree for the NADH ligand that has 10 nodes with 11 LMs. This tree is in the format of the M5P algorithm output using WEKA. Each node corresponds to a receptor residue and the decisions in the tree are related to the residues distances to the NADH ligand.

Each leaf node is a LM in the form described by Equation (1) that corresponds to the LM1 in the NADH model tree described in Figure 5.

FEB = 0.0003*PHE22 - 0.0001*THR38 + 0.005*ILE46 + 0.0012*LEU62 + 0.0041*ASP63 + 0.0001*VAL91 - 0.0002*HIE92 + 0.0006*THR100 - 0.0007*GLY101 + 0.0013*LEU167 + 0.0061*VAL188 - 0.0002*GLY191 + 0.0039*ILE193 + 0.0036*ILE214 + 0.0122*TRP229 - 0.0051*ALA259 - 10.5158

The results for all ligands are summarized in Table 5. In this table we show, for each ligand, the number of instances and the number of attributes submitted to the algorithm; the number of nodes in the tree and its depth; the total number of LMs produced followed by correlation. On the basis of the correlation values, we were able to generate very satisfactory models. These models produced short and understandable trees. We can see that for NADH, TCL and PIF we obtained more than 95% correlation. ETH, however, showed the worst correlation, which was about 60%. The reason why ETH correlation is so low is because it is pro-drug. It binds to the InhA receptor protein active site as an adduct with NADH (ETH-NADH). In the docking experiments explored here we considered only the ETH molecule. Alone, ETH explored a larger area of the receptor active site that is actually not available to it in its inhibitory form (ETH-NADH adduct). Despite the large difference in Correlation as compared to the other three ligands, its 60% correlation is a very satisfactory value. We know from visual inspection that ETH alone preferred to bind to or near the active site region of InhA where it binds as ETH-NADH inhibitory adduct.

Table 5 Results of the mining experiments with M5P.

Full size table

Despite the overall high quality of the correlation values, we still need to apply the model trees on a test set to evaluate whether they can really help us to achieve proper snapshot selections. In face of this, from now on we focus our attention in formulating a strategy to identify the LMs capable of predicting, and thus selecting, the most promising receptor snapshots.

Post processing and evaluating the models results

As our objective is to select snapshots we need to establish a selection criterion of LMs. That is, we have to identify which are the best LMs in order to traverse the tree and select the snapshots that belong to the selected LMs. Therefore, the snapshots in the instances classified in the selected LMs indicate the most promising receptor snapshots to be considered in future docking simulations. That evaluation phase is composed of three steps as follows:

1. We traverse the produced trees with our test set to identify which instances belong to which LM.

2. We carefully establish a criterion to select representative LMs;

3. We evaluate whether the selected snapshots are indeed promising.

Selecting representative LMs

Each ligand test set is composed of the receptor snapshot collection considering only the docking results with best FEBs (Lines 2, 4, 6, and 8 of Table 2). We chose to have one test set for each ligand because, with exception of NADH, the other three ligands are chemically distinct, but with similar biological role: they are inhibitors of the InhA receptor function.

We started by implementing Python scripts to map the instances according to the LMs of each model tree. These scripts verify to which LMs belongs each one of the instances of the test set. Having completed this mapping, we are ready to indicate which LMs are representatives, and thus use them to select the snapshots. We have as a premise that good snapshots are those ones that present low FEB values. However, such values vary from one snapshot to another. We take as a starting point the average FEB (column 2 of Table 2) from each test set. For each LM, we calculate the average FEB for their associated instances. From these values we set up the following LM selection strategy: a given LM is considered representative if its average FEB is smaller (more negative) than or equal to the average FEB of the test set.

To exemplify our methodology of LMs selection we consider the results for the PIF ligand. Table 6 describes in the columns 1 and 4 the LMs; columns 2 and 5 the total number of instances for each LM while columns 3 and 6 contain the average FEB values for each LM. Based on our strategy, as the average FEB of the PIF test set is -9.9 Kcal/mol (Table 2) the correspond LMs selected for this ligand are LM1, LM2, LM3, LM5 and LM7 (highlighted in bold in Table 6).

Table 6 Analysis of the LMs generated for the PIF ligand.

Full size table

Tables 7, 8 and 9 have the same format of Table 6 and describe all the results obtained for all the other three ligands: NADH, TCL and ETH. For example, for NADH (Table 7) the average FEB is -12.90 kcal/mol. We can observe that only LM11 has an average FEB smaller than this value. Consequently, LM11 is selected for NADH. All the selected LMs for each ligand are highlighted in the tables.

Table 7 Analysis of the LMs generated for the NADH ligand.

Full size table

Table 8 Analysis of the LMs generated for the TCL ligand.

Full size table

Table 9 Analysis of the LMs generated for the ETH ligand.

Full size table

Discussion

To verify which snapshots were selected and whether these were the best ones we carefully evaluated their related FEB values. In doing so we organized the instances of each test set according to the FEB in an ascending order (FEB list). Then, we investigated if the selected snapshots were at the top of this list (consequently, with the most negative FEB values). As a result we obtained the data described in Table 10. In column 1 we have the ligand names; in columns 2, 3 and 4 the total number of selected snapshots that are in the top 10 FEB list, top 100 FEB list and top 1,000 FEB list, respectively. Column 5 shows the total number of selected snapshots and the total number of snapshots for each ligand test set.

Table 10 Results of the analyzes of the LMs for all four ligands in this study.

Full size table

Based on the data described in Table 10 we can notice that our snapshot selection strategy worked well for all four ligands. For NADH and PIF, from the 10, 100 and 1,000 best FEBs, our method worked best, selecting almost 100% of the best snapshots. For ETH, our method selected the 10 best ones, 92% of the 100 best ones and 617 of the 1,000 best ones. However, for this ligand the method selected less snapshots (617 out of 902). Nonetheless this represents almost 70% of the 1,000 best FEBs. The worst results were obtained for TCL. Only 60% of the top 1,000 FEB list were selected.

In this paper our main contribution is the snapshot selection strategy which was capable of picking up the most promising receptor snapshots based on their conformations. Even if some of the selected snapshots did not have the best FEB values, it is important to test them for they have may have a promising conformation. Besides the analysis about the most promising snapshots, the model trees can indicate the most important residues to predict the good and bad values of FEB. With the results described in this paper we are able to discuss, for instance, about the residue THR100 of NADH model tree (Figure 5). As we show in the results section, for this ligand our methodology selected just one LM, the LM11 (Table 7). If you observe in the model tree, all the snapshots in which the distance of the residue THR100 to the NADH ligand is bigger than 11.49 Å are considered promising snapshots. The discussion about the consequences of this result is beyond the scope of this work but it is a important particularity that must be investigated by a domain specialist and is part of our future work.

Conclusions

Molecular docking experiments that consider fully flexible-receptor involve typically several types of data describing receptor and ligand conformations and generate a huge amount of data. We analysed docking experiments of the highly flexible InhA enzyme receptor from M. tuberculosis with four distinct ligands: NADH, PIF, TCL, and ETH. These experiments were conceived for snapshots obtained from a 3,100 ps MD simulation trajectory. Currently, we have a total of 100,504 valid docking results, 12,424,800 receptor coordinates, and 2,223,098 ligand coordinates. Having these docking results, our goal was to propose a methodology to knowledge discovery, that is, to select the most promising receptor conformations, by identifying snapshots characteristics for their selection. To achieve this goal we used the M5P model tree algorithm for data mining, aiming at identifying representative LM for snapshot selection.

We systematically preprocessed our molecular docking results, mined, and further post processed them to select a set of the most promising receptor conformations from the initial 3,100 snapshots. Preprocessing was done by calculating the shortest interatomic distance between the ligand and the receptor’s residues, for each docking result. They were the predictive attributes. The target attribute was the FEB value. We proposed a strategy to select LMs that can represent the most promising snapshots: the ones to be selected. Our results showed that the inferred model trees were able to select snapshots properly with high correlation values, except for ETH. As future work we intend to further the use of the most promising selected snapshots to perform virtual screening in small molecule libraries using a reduced set of snapshots, that still represents a fully-flexible model of the InhA protein receptor, thus reducing the time to find new druggable compounds candidates for new treatment against tuberculosis.

References

Lyne PD: Structure-based Virtual Screening: an Overview. Drug Discov Today. 2002, 7: 1047-1055. 10.1016/S1359-6446(02)02483-2.
CAS PubMed Google Scholar
Caskey CT: The Drug Development Crisis: Efficiency and Safety. Annu Rev Med. 2007, 58: 1-16. 10.1146/annurev.med.58.042705.124037.
CAS PubMed Google Scholar
Lybrand TP: Ligand-protein Docking and Rational Drug Design. Curr Opin Struct Biol. 1995, 5: 224-228. 10.1016/0959-440X(95)80080-8.
CAS PubMed Google Scholar
Kuntz ID: Structure-based Strategies for Drug Design and Discovery. Science. 1992, 257: 1078-1082. 10.1126/science.257.5073.1078.
CAS PubMed Google Scholar
Huang S, Zou X: Emsemble Docking of Multiple Protein Structures: Considering Protein Structural Variations in Molecular Docking. Proteins. 2007, 66: 399-421.
CAS PubMed Google Scholar
Cozzini P, Kellogg GE, Spyrakis F, Abraham DJ, Constantino G, Emerson A, Fanelli F, Gohlke H, Kuhn LA, Morris GM, Orozco M, Pertinhez TA, Rizzi M, Sotriffer CA: Target Flexibility: An Emerging Consideration in Drug Discovery and Design. J Med Chem. 2008, 51: 6237-6255. 10.1021/jm800562d.
CAS PubMed PubMed Central Google Scholar
Verkhivker GM, Bouzida D, Gehlhaar DK, Rejto PA, Freer ST, Rose PW: Complexity and Simplicity of Ligand-macromolecule Interactions: The Energy Landscape Perspective. Curr Opin Struct Biol. 2002, 12: 197-203. 10.1016/S0959-440X(02)00310-X.
CAS PubMed Google Scholar
Teodoro ML, Kavraki LE: Conformational Flexibility Models for the Receptor in Structure Based Drug Design. Curr Pharm Des. 2003, 9: 1419-1431. 10.2174/1381612033454685.
Google Scholar
Totrov M, Abagyan R: Flexible Ligand Docking to Multiple Receptor Conformations: a Practical Alternative. Curr Opin Struct Biol. 2008, 18: 178-184. 10.1016/j.sbi.2008.01.004.
CAS PubMed PubMed Central Google Scholar
Wong CF: Flexible Ligand–Flexible Protein Docking in Protein Kinase Systems. Biochim Biophys Acta. 2008, 1784: 244-251. 10.1016/j.bbapap.2007.10.005.
CAS PubMed Google Scholar
Alonso H, Bliznyuk AA, Gready JE: Combining Docking and Molecular Dynamic Simulations in Drug Design. Med Res Rev. 2006, 26: 531-568. 10.1002/med.20067.
CAS PubMed Google Scholar
Chandrika B, Subramanian J, Sharma SD: Managing Protein Flexibility in Docking and its Applications. Drug Discov Today. 2009, 14: 394-400. 10.1016/j.drudis.2009.01.003.
Google Scholar
Jiang F, Kim S-H: Soft Docking: Matching of Molecular Surface Cubes. J Mol Biol. 1991, 219: 79-102. 10.1016/0022-2836(91)90859-5.
CAS PubMed Google Scholar
Apostolakis J, Plückthun A, Caflisch A: Docking Small Ligands in Flexible Binding Sites. J Comput Chem. 1998, 19: 21-37. 10.1002/(SICI)1096-987X(19980115)19:1<21::AID-JCC2>3.0.CO;2-0.
CAS Google Scholar
Leach AR: Ligand Docking to Proteins with Discrete Side-chain Flexibility. J Mol Biol. 1994, 235: 345-356. 10.1016/S0022-2836(05)80038-5.
CAS PubMed Google Scholar
Althus E, Kohlbacher O, Lenhof HP, Muller P: A Combinatorial Approach to Protein Docking with Flexible Side Chains. J Comput Biol. 2002, 9: 597-612. 10.1089/106652702760277336.
Google Scholar
Jones G, Willett P, Glen RC, Leach AR, Taylor R: Development and Validation of a Genetic Algorithm for Flexible Docking. J Mol Biol. 1997, 267: 727-748. 10.1006/jmbi.1996.0897.
CAS PubMed Google Scholar
Knegtel RMA, Kuntz ID, Oshiro CM: Molecular Docking to Ensembles of Protein Structures. J Mol Biol. 1997, 266: 424-440. 10.1006/jmbi.1996.0776.
CAS PubMed Google Scholar
Österberg F, Morris GM, Sanner MF, Olson AJ, Goodsell DS: Automated Docking to Multiple Target Structures: Incorporation of Protein Mobility and Structural Water Heterogeneity in AutoDock. Proteins. 2002, 46: 34-40. 10.1002/prot.10028.
PubMed Google Scholar
Pang Y-P, Kozikowski AP: Prediction of the Binding Sites of Huperzine A in Acetylcholinesterase by Docking Studies. J Comput Aided Mol Des. 1994, 8: 669-681. 10.1007/BF00124014.
CAS PubMed Google Scholar
Lin J-H, Perryman AL, Schames JR, McCammon JA: Computational Drug Design Accommodating Receptor Flexibility: the Relaxed Complex Scheme. J Am Chem Soc. 2002, 124: 5632-5633. 10.1021/ja0260162.
CAS PubMed Google Scholar
Lin J-H, Perryman AL, Schames JR, McCammon JA: The Relaxed Complex Method: Accommodating Receptor Flexibility for Drug Design with an Improved Scoring Scheme. Biopolymers. 2003, 68: 47-62. 10.1002/bip.10218.
CAS PubMed Google Scholar
Amaro RE, Baron R, McCammon JA: An Improved Relaxed Complex Scheme for Receptor Flexibility in Computer-aided Drug Design. J Comput Aided Mol Des. 2008, 22: 693-705. 10.1007/s10822-007-9159-2.
CAS PubMed PubMed Central Google Scholar
Machado KS, Schroeder EK, Ruiz DD, Norberto de Souza Norberto O: Automating Molecular Docking with Explicit Receptor Flexibility Using Scientific Workflows. Lect Notes Comput Sci. 2007, 4643: 1-11. 10.1007/978-3-540-73731-5_1.
Google Scholar
Dessen A, Quémard A, Blanchard JS, Jacobs WR, Sacchettini JC: Crystal Structure and Function of the Isoniazid Target of Mycobacterium tuberculosis. Science. 1995, 267: 1638-1641. 10.1126/science.7886450.
CAS PubMed Google Scholar
Oliveira JS, Sousa ESH, Norberto de Souza O, Moreira IS, Santos DS, Basso LA: Slow-onset Inhibition of 2-trans-Enoyl ACP (CoA) Reductase from Mycobacterium tuberculosis by an Inorganic Complex. Curr Pharm Des. 2006, 12: 2409-2424. 10.2174/138161206777698927.
CAS PubMed Google Scholar
World Health Organization: Gender and Tuberculosis. WHO Report 2002. 2002
Google Scholar
World Health Organization: Global Tuberculosis Control: Surveillance. Planning. Financing. 2009, WHO Report 2009
Google Scholar
Kruh NA, Rawat R, Ruzsicska BP, Tonge PJ: Probing Mechanisms of Resistance to the Tuberculosis Drug Isoniazid: Conformational Changes Caused by Inhibition of InhA, the Enoyl Reductase from Mycobacterium tuberculosis. Protein Sci. 2007, 16: 1617-1627. 10.1110/ps.062749007.
CAS PubMed PubMed Central Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.
CAS PubMed PubMed Central Google Scholar
Schroeder EK, Basso LA, Santos DS, Norberto de Souza O: Molecular Dynamics Simulation Studies of the Wild-Type, I21V, and I16T Mutants of Isoniazid-Resistant Mycobacterium tuberculosis Enoyl Reductase (InhA) in Complex with NADH: Toward the Understanding of NADH-InhA Different Affinities. Biophys J. 2005, 89: 876-884. 10.1529/biophysj.104.053512.
CAS PubMed PubMed Central Google Scholar
Kuo MR, Morbidoni HR, Alland D, Sneddon SF, Gourlie BB, Staveski MM, Leonard M, Gregory JS, Janjigian AD, Yee C, Musser JM, Kreiswirth B, Iwamoto H, Perozzo R, Jacobs WR, Sacchettini JC, Fidock DA: Targeting Tuberculosis and Malaria Through Inhibition of Enoyl Reductase: Compound Activity and Structural Data. J Biol Chem. 2003, 278: 20851-20859. 10.1074/jbc.M211968200.
CAS PubMed Google Scholar
Oliveira JS, Sousa EHS, Basso LA, Palaci M, Dietze R, Santos DS, Moreira IS: An Inorganic Iron Complex that Inhibits Wild-type and an Isoniazid-Resistant Mutant 2-trans-enoyl-ACP (CoA) Reductase from Mycobacterium tuberculosis. Chem Commun. 2004, 3: 312-313.
Google Scholar
Wang F, Langley R, Gulten G, Dover LG, Besra GS, Jacobs WR, Sacchettini JC: Mechanism of Thioamide Drug Action Against Tuberculosis and Leprosy. J Exp Med. 2007, 204: 73-78. 10.1084/jem.20062100.
CAS PubMed PubMed Central Google Scholar
Irwin JJ, Shoichet B: ZINC - A Free Database of Commercially Available Compounds for Virtual Screening. J Chem Inf Model. 2005, 45: 177-182. 10.1021/ci049714+.
CAS PubMed PubMed Central Google Scholar
Winck AT, Machado KS, Norberto de Souza O, Ruiz DD: FReDD: Supporting Mining Strategies Through a Flexible-receptor Docking Database. Lect Notes Comput Sci. 2009, 5676: 143-146. 10.1007/978-3-642-03223-3_13.
CAS Google Scholar
Quinlan JR: Learning with Continuous Classes. Proceedings of the Fifith Australian Joint Conference on Artificial Intelligence; Singapore: World Scientific. 1992, 343-348.
Google Scholar
Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, Belew RK, Olson AJ: Automated Docking Using a Lamarckian Genetic Algorithm and Empirical Binding Free Energy Function. J Comput Chem. 1998, 19: 1639-1662. 10.1002/(SICI)1096-987X(19981115)19:14<1639::AID-JCC10>3.0.CO;2-B.
CAS Google Scholar
Han J, Kamber M: Data mining: concepts and techniques. 2006, San Francisco: Morgan & Kaufmann
Google Scholar
Freitas AA, Wieser DC, Apweiler R: On the Importance of Comprehensible Classification Models for Protein Function Prediction. Proceedings of the IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010, IEEE/ACM Trans Comput Biol Bioinform, 99: 172-182.
Google Scholar
Wang Y, Witten IH: Inducing Model Trees for Continuous Classes. Proceedings of the Ninth European Conference on Machine Learning. 1997, 128-137.
Google Scholar
Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2005, San Francisco: Morgan & Kaufmann
Google Scholar
Case DA, Cheatham-III TE, Darden T, Gohlke H, Luo R, Merz KM, Onufriev A, Simmerling N, Wang B, Woods R: The AMBER Biomolecular Simulation Programs. J Comput Chem. 2005, 26: 1668-1688. 10.1002/jcc.20290.
CAS PubMed PubMed Central Google Scholar
Tan P-N, Steinbach M, Kumar V: Introduction to data mining. 2006, Boston: Addison Wesley
Google Scholar
Jeffrey GA: An introduction to hydrogen bonding. . 1997, UK: Oxford University Press
Google Scholar
da Silveira CH, Pires DEV, Minardi RC, Ribeiro C, Veloso CJM, Lopes JCD, Meira W, Neshich G, Ramos CHI, Habesch R, Santoro MM: Protein Cutoff Scanning: A comparative Analysis of Cutoff Dependent and Cutoff Free Methods for Prospecting Contacts in Proteins. Proteins. 2009, 74: 727-743. 10.1002/prot.22187.
CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by grants from MCT/CNPq (14/2008 deliberative council) to DDR and in part by grants 410505/2006-4 and 312027/2006-0 to ONS. ONS is a CNPq Research Fellow. ATW is supported by CT-INFO/CNPq (17/2007 deliberative council) PhD scholarship. KSM is supported by a CAPES PhD scholarship.

This article has been published as part of BMC Genomics Volume 11 Supplement 5, 2010: Proceedings of the 5th International Conference of the Brazilian Association for Bioinformatics and Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S5.

Author information

Authors and Affiliations

LABIO - Laboratório de Bioinformática, Modelagem e Simulação de Biossistemas. PPGCC, Faculdade de Informática, PUCRS, Av. Ipiranga, 6681 – Prédio 32, sala 602, 90619-900, Porto Alegre, RS, Brazil
Karina S Machado & Osmar Norberto de Souza
GPIN - Grupo de Pesquisa em Inteligência de Negócio. PPGCC, Faculdade de Informática, PUCRS, Av. Ipiranga, 6681 – Prédio 32, sala 628, 90619-900, Porto Alegre, RS, Brazil
Ana T Winck & Duncan DA Ruiz

Authors

Karina S Machado
View author publications
You can also search for this author in PubMed Google Scholar
Ana T Winck
View author publications
You can also search for this author in PubMed Google Scholar
Duncan DA Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Osmar Norberto de Souza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Osmar Norberto de Souza.

Additional information

Competing interests statement

The authors declare that they have no competing interests.

Authors' contributions

KSM and ATW executed the preprocessing for the data mining experiments, performed all the data mining experiments and post processing, evaluated the models results and wrote the first draft of the article. DDAR helped to conceive the test cases, to generate different data mining inputs, to perform the data mining experiments and to evaluate the models. ONS helped to define the problem, to analyze the results and to write the final version of the article. All authors read and approved the final manuscript.

Karina S Machado, Ana T Winck contributed equally to this work.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Machado, K.S., Winck, A.T., Ruiz, D.D. et al. Mining flexible-receptor docking experiments to select promising protein receptor snapshots. BMC Genomics 11 (Suppl 5), S6 (2010). https://doi.org/10.1186/1471-2164-11-S5-S6

Download citation

Published: 22 December 2010
DOI: https://doi.org/10.1186/1471-2164-11-S5-S6

Mining flexible-receptor docking experiments to select promising protein receptor snapshots