Using desolvation energies of structural domains to predict stability of protein complexes

Original Article

Abstract

Employing domain knowledge for prediction of particular types of protein–protein interactions (PPIs) is a problem that has become increasingly important in the past few years, due to the fundamental role of domains in protein function. We propose a model to predict obligate and non-obligate protein interaction types using desolvation energies of structural domains that are present in the interfaces of protein complexes, which are extracted from the CATH database. The prediction is performed using several state-of-the-art classification techniques, including linear dimensionality reduction, a support vector machine based on sequential minimal optimization, naive Bayes, and k-nearest neighbour. Our results on two well-known datasets demonstrate that (a) domain-based features of higher levels of CATH, especially level 2, are more powerful and discriminative than features of other levels, and (b) properties taken from different levels of the CATH hierarchy yield higher accuracies than properties taken from each level of the hierarchy separately. Furthermore, analysis of structural properties suggests that domain–domain interactions that have at least a mainly-beta secondary structure in one sub-unit are more informative for predicting obligate and non-obligate PPIs.

Keywords

Protein–protein interaction Domain–domain interaction CATH Complex type prediction 

1 Introduction

Domains can be considered to be the minimal and fundamental units of proteins, which have a clear biological role and act as basic functional units within cells (Chen et al. 2009; Dwivedi et al. 2013). Recent studies focus on employing domain knowledge to predict protein–protein interactions (Maleki et al. 2012; Hall et al. 2012; Zaki 2009; Zaki et al. 2009; Singhal and Resat 2007; Akutsu and Hayashida 2009; Chandrasekaran et al. 2013). This is based on claims that only a few highly conserved residues are crucial for protein–protein interactions (De et al. 2005; Eichborn et al. 2010), and that most domains and domain–domain interactions (DDIs) are evolutionarily conserved (Singh et al. 2013). As a consequence, it has been observed that proteins interact if a domain in one protein interacts with a domain in the other protein (Caffrey et al. 2004; Park and Bolser 2001). There are a number of domain family resources that can be applied for this purpose, such as Pfam (Finn et al. 2010) and CATH (class, architecture, topology, and homologous superfamily) databases (Cuff et al. 2009).

An important problem surrounding PPIs is the identification and prediction of different types of complexes, which are characterized by properties such as similarities between subunits (homo/hetero-oligomers), number of subunits involved in the interaction (dimers, trimers, etc.), duration of the interaction (transient vs. permanent), and stability of the interaction (non-obligate vs. obligate), among others. We focus on the prediction of obligate and non-obligate complexes. It is important to be able to distinguish between obligate and non-obligate complexes, since non-obligate interactions are more difficult to study and understand due to their instability and short life, while obligate interactions are more stable (Jones and Thornton 1996).

Choosing relevant features is very important for successful prediction. Features are the observed properties of each sample that are used for prediction. Some studies on PPIs consider a wide range of features for predicting obligate and non-obligate complexes, including solvent accessibility (Shanahan and Thornton 2005; Zhu et al. 2006), geometry (Lawrence and Colman 1993), hydrophobicity (Young 1994; Glaser et al. 2001), sequence-based features (Mintseris and Weng 2003), desolvation energy (Maleki et al. 2012; Hall et al. 2012; Rueda et al. 2010a, b; Aziz et al. 2011) and, more recently, electrostatic energies (Vasudev and Rueda 2012). In this study, we use desolvation energies, which have been shown to be very efficient for PPI prediction (Rueda et al. 2010a, b).

To study the behavior of obligate and non-obligate interactions using domain knowledge, in De et al. (2005), interactions between residues were used for finding obligate and non-obligate residue contacts of PPIs. The study concluded that non-obligate interfaces occupy less than 2 % of the area of the domain surfaces, while the area occupied by obligate interfaces ranges from 0 to 6 %. In (Eichborn et al. 2010), the interface of 750 transient DDIs (interactions between domains that are part of different proteins) and 2,000 obligate DDIs were studied. The interactions between domains of one amino acid chain were analyzed to obtain a better understanding of molecular recognition and identify frequent amino acids in the interfaces and on the surfaces of protein complexes. Also, in (Park et al. 2009), the domain information from protein complexes was used to predict four different types of interactions, including transient enzyme inhibitor/non enzyme inhibitor and permanent homo/hetero obligate complexes. In this way, the physical interaction between proteins can be better analyzed in terms of interactions among their structural domains. In Maleki et al. (2011), a prediction model was proposed in which Pfam domains were used to predict obligate and non-obligate PPIs. The results demonstrated that desolvation energies are more efficient and powerful than interface area and composition properties for prediction. Moreover, a visual and numerical analysis of the DDIs present in these two types of complexes showed that different pairs of DDIs can be identified in obligate and non-obligate complexes and highlighted that homo-DDIs are more likely to be present in obligate interactions.

In one of our recent works (Maleki et al. 2012), a domain-based model to predict obligate and non-obligate PPIs was presented in which structural domains from the CATH database were taken as the input features. That model used desolvation energies of amino acid pairs present in the interface of DDIs as features for prediction. The results show that DDIs at higher levels in the CATH hierarchy, especially those at the architecture and topology levels (levels 2 and 3), provide the best prediction performance. Whereas our previous efforts in (Maleki et al. 2012) focused on the case in which all DDIs were taken from the same level of the CATH hierarchy, we have extended our approach in (Hall et al. 2012) to cover the more general case in which each domain can be represented at one of a number of possible levels. We restricted our efforts to levels 2 and 3 of the CATH hierarchy, which have been shown to be very efficient for prediction.

This work is an extension of the work presented in Hall et al. (2012), by incorporating a wider range of classification techniques that include LDR, SVM-SMO, NB, and k-NN and also a numerical analysis focused on selecting relevant structural properties. The results on two pre-classified datasets from Zhu et al. (2006) and Mintseris and Weng (2005) confirm that using DDIs from different levels of the CATH hierarchy as prediction properties yields better performance than using DDIs of individual levels to predict obligate and non-obligate PPIs, based on the obtained prediction results using different classification methods. Furthermore, by grouping the DDI feature vectors of the second level of the CATH hierarchy based on their secondary structures, it is shown that most of the interactions are between domains that have mainly-beta structures. Also, the prediction results for each group of DDIs from level 2 demonstrate that DDIs composed of mainly-beta structural elements, especially DDIs of mainly-beta with some alpha–beta are the most descriminative properties for predicting obligate and non-obligate PPIs using SVM-SMO and k-NN.

2 Datasets and prediction properties

Two pre-classified datasets of obligate and non-obligate protein complexes were obtained from the studies of Zhu et al. (2006), and Mintseris and Weng (2005).The first dataset contains 75 permanent (obligate) and 62 non-obligate interactions, while the second contains 115 obligate and 212 non-obligate interactions. These datasets were obtained from the literature and manually curated by the authors of Mintseris and Weng (2005) and Zhu et al. (2006) by removing inconsistent complex types and homologous protein sequences.

2.1 Desolvation energy

In this study, desolvation energies are used as the prediction properties, which have been shown to be very efficient for the prediction of obligate and non-obligate complexes (Rueda et al. 2010a; Maleki et al. 2011). Desolvation energy is defined as knowledge-based contact potential (accounting for hydrophobic interactions), self-energy change upon desolvation of charged and polar atom groups, and side-chain entropy loss. As in Camacho and Zhang (2005), the binding free energy \(\Updelta G_{\rm bind}\) is defined as follows:
$$\Updelta G_{{\rm bind}} = \Updelta E_{{\rm elec}}+\Updelta G_{{\rm des}} \, ,$$
(1)
where \(\Updelta E_{\rm elec}\) is the total electrostatic energy and \(\Updelta G_{\rm des}\) is the total desolvation energy. For a protein, \(\Updelta G_{\rm des}\) is defined as follows:
$$g(r)\Upsigma \Upsigma e_{ij} \, .$$
(2)

If we consider the interaction between the ith atom of a ligand and the jth atom of a receptor, then eij is the atomic contact potential between them and g(r) is a smooth function based on their distance (Zhang et al. 1997). For simplicity, we consider the smooth function to be linear. We also consider the criterion that for a successful interaction, atoms should be within 7 Å distance; between 5 and 7 Å, the value of g(r) varies from 0 to 1 based on a smooth function. For atoms that are less than 5 Å apart, the value of g(r) is 1 (Camacho and Zhang 2005).

2.2 Domain-based properties

We consider structural CATH domains (Cuff et al. 2009) in this study. The CATH database is organized in a hierarchical fashion, which can be visualized as a tree with levels numbered from 1 to 8, hereafter referred to as L1–L8 (Cuff et al. 2009). Domains at upper levels of the tree represent more general classes of structure than those at lower levels. For example, domains at level 1 represent mainly-alpha (c1), mainly-beta (c2), mixed alpha–beta (c3) and few secondary structures (c4), whereas those at level 2 represent more specific structures. As shown in Fig. 1, roll, beta barrel, and two-layer sandwich are three different sample architectures of domains in class c3. Domains at level 3 are even more specific, and so on.
Fig. 1

Four levels of the CATH hierarchy (class, architecture, topology and homologous superfamily)

To extract domain-based properties, we first collected the 3D structures of each complex in our datasets from the Protein Data Bank (Berman et al. 2000). Then, we collected the domain information for each complex from CATH database and added this information to each atom present in the chain. Complexes that did not have domain information in at least one of their subunits were discarded. We refer to these two new datasets as the MW and ZH datasets. The new MW dataset contains 100 permanent (obligate) and 161 non-obligate interactions, while the new ZH dataset contains 72 obligate and 55 non-obligate interactions.

After identifying all unique domains present in the interface of at least one complex in the datasets, the desolvation energies for all pairs of domains (DDIs) were calculated using Eq. (2). For each ligand–receptor (protein–protein) pair, if we found any duplicate DDIs during calculation, we simply computed the cumulative desolvation energy across all occurrences of that DDI. A domain is considered to be in the interface if it has at least one residue interacting with a domain in the other chain. In this study, two types of domain-based properties are considered.

2.2.1 Domain-based properties at the individual level

Since the CATH database is organized in a hierarchical scheme, in Maleki et al. (2012), a separate dataset of feature vectors was created for each level of the CATH hierarchy. After calculating the desolvation energies for all DDIs in level 8, for each DDI in higher levels, the desolvation energy was calculated by taking the sum of the desolvation energies of the corresponding DDIs at the next lowest level. After pre-processing the datasets, all zero-columns, which represent DDIs that were not present in any complexes, were removed. More details about the generation of domain-based feature vectors for each level are given in Maleki et al. (2012).

Each of these subsets of features was used for classification separately to determine the predictive power of a specific level in the CATH hierarchy. In Maleki et al. (2012), it was shown that domain-based features taken from level 2 (L2) and level 3 (L3) of CATH are more predictive than the features of other levels.

2.2.2 Domain-based properties at the combined level

To generate these types of feature vectors, instead of considering each level in the CATH hierarchy separately, we consider combinations of levels. Thus, we do not obtain only one set of feature vectors per level. Indeed, by allowing arbitrary combinations of nodes, the total number of feature vectors would be exponential, with each feature vector corresponding to a sequence of nodes chosen to represent the domains found in the dataset. In order to maintain computational tractability and eliminate any redundancy in the feature vectors, the following constraints have been imposed: (a) there can be no overlap between nodes. That is, there cannot exist a pair of nodes in a sequence such that one node is an ancestor of the other; (b) only combinations of nodes taken from levels 2 and 3 of the hierarchy have been considered. Based on the results of our previous study (Maleki et al. 2012), it is pertinent to conclude that the optimal combination of nodes will be found somewhere between these two levels; (c) nodes at level 3 which are the sole child of their parent node at level 2 have been discarded.

However, the number of node sequences to be evaluated is still exponential with respect to the number of nodes at level 2. Though an exhaustive enumeration of the entire search space is still computationally tractable given the size of our datasets, this would be a poor choice in general. Accordingly, a method based on sequential floating forward search (SFFS) (Pudil et al. 1994) has been implemented to find a reasonable approximation to the best combination of nodes between levels 2 and 3. For this, SFFS was initialized at the sequence of nodes consisting of the set of all nodes at level 2, as this sequence showed the greatest promise in our previous study (Maleki et al. 2012). Then, the search proceeded downward through the CATH tree towards the sequence of nodes corresponding to level 3.

A complete list of domain-based features in the individual and combined levels for ZH and MW datasets is shown in Tables S1 and S2 of the supplementary material, respectively.

3 Prediction methods

After finding the domain-based features of the complexes of the MW and ZH datasets, we applied several prediction methods to them. In this work, the prediction is performed via commonly used classification methods, including LDR, SVM-SMO, NB and k-NN. More detailed explanation of each prediction method is given below.

3.1 Linear dimensionality reduction

The basic idea of LDR, which has become popular in pattern recognition due to its relatively easy implementation and high classification speed, is to represent an object of dimension n onto a lower-dimensional vector of dimension d, achieving this by performing a linear transformation. Each class, obligate or non-obligate, is represented by a random vector \(\mathbf{x}_1 \sim N(\mathbf{\mu}_1, \mathbf{S}_1)\) or \(\mathbf{x}_2 \sim N(\mathbf{\mu}_2, \mathbf{S}_2),\) respectively, with p1 or p2 as a priori probabilities. Each random vector is distributed normally with its mean \(\mathbf{\mu}\) and covariance \(\mathbf{S}.\) The aim of LDR is to find a linear transformation matrix \(\mathbf{A}\) in such a way that the new classes \(\mathbf{y}_i = \mathbf{A}\mathbf{x}_i\) are as separable as possible.

In this work, we use a generalization of the Chernoff discriminant analysis (CDA) criterion proposed in Rueda and Herrera (2008), by relaxing the constraint that p1 = β and p2 = 1 − β. The generalized formula for the CDA criterion can be stated starting from the Chernoff distance as given in Duda et al. (2001). As in Rueda and Herrera (2008), we take the trace of the resulting matrix in the transformed space as follows:
$$\begin{aligned} J_{\rm CDA} ({\mathbf{A}}) = {\rm tr}&\{ p_1 p_2 ({\mathbf{A}} {\mathbf{S}}_{\rm W} {\mathbf{A}}^t)^{-1} {\mathbf{A}} {\mathbf{S}}_{\rm E} {\mathbf{A}}^t + \log({\mathbf{A}} {\mathbf{S}}_{\rm W} {\mathbf{A}}^t) \\- &p_1 \log({\mathbf{A}} {\mathbf{S}}_1 {\mathbf{A}}^t)- p_2 \log({\mathbf{A}}{\mathbf{S}}_2 {\mathbf{A}}^t) \}\, , \end{aligned}$$
(3)
where \(\mathbf{S}_{\rm E} = (\mathbf{\mu}_1 - \mathbf{\mu}_2)(\mathbf{\mu}_1 - \mathbf{\mu}_2)^t\) and \(\mathbf{S}_{\rm W} = p_1 \mathbf{S}_1 + p_2 \mathbf{S}_2.\) Also, the most accurate error bound given in Duda et al. (2001) is for a value of β (\(\beta \epsilon [0,1]\)) that maximizes the Chernoff distance.

The aim of the CDA approach is to maximize the above equation. To solve this problem, a gradient-based algorithm is used (Rueda and Herrera 2008). This iterative algorithm needs a learning rate, αk, which is maximized using the secant method to ensure that the gradient algorithm converges. The initialization of the matrix \(\mathbf{A}\) is also an important issue in the gradient-based algorithm.

In this study, ten different initializations were performed and the solution for \(\mathbf{A}\) that yielded the maximum Chernoff distance in the transformed space was selected. Since the best value of β is unknown in advance, an exhaustive search over all possible values of β, ranging from 0 to 1 with steps of 0.05, is applied in this study. This search gives a more accurate bound for the classification error, and hence we expect higher classification accuracy than other LDR methods. Note that the optimization of β is performed over all ten cross-validation folds, to avoid any bias in selecting the parameter for a particular fold. The resulting vectors \(\mathbf{y}_i\) are then input to a quadratic Bayesian (QB) classifier and a linear Bayesian (LB) classifier, which is obtained by deriving a Bayesian classifier with a common covariance matrix. The maximum of the average classification accuracies from these classifiers is reported. More details about the CDA approach and LDR methods can be found in Rueda and Herrera (2008).

3.2 Support vector machines based on SMO

The aim of the SVM is to find the support vectors, and derive a linear classifier, which ideally separates the space into two regions. Classification using a linear classifier is not possible when the data are not linearly separable, and hence kernels are used to map the data into a higher dimensional space in which the classification boundary can be found much more efficiently. Sequential minimal optimization (SMO) is a fast learning algorithm which is widely applied in the training phase of a SVM classifier as one possible way to solve the underlying quadratic programming problem. In this study, the SMO module of the Waikato Environment for Knowledge Analysis (WEKA) with a normalized polynomial kernel, default parameter settings, and 10-fold cross-validation is used (Hall et al. 2009).

3.3 k-Nearest neighbor

k-NN is one of the simplest classification methods, in which the class of each test sample can be easily found by a majority vote of the class labels of its neighbors. To achieve this, after computing and sorting the distances between the test sample and each training sample, the most frequent class label in the first "k" training samples (nearest neighbors) is assigned as the class of the test sample. Determining the appropriate number of neighbors is one of the challenges of this method. In this study, the IBK module of WEKA with Euclidean distance, default parameter settings, and 10-fold cross-validation is used (Hall et al. 2009).

3.4 Naive Bayes

One of the simplest probabilistic classifiers is naive Bayes. Assuming independence of features, the class of each test sample can be found by applying Bayes’ theorem. The basic mechanism of NB is rather simple. The reader is referred to Theodoridis and Koutroumbas (2008) for more details. In this study, the NaiveBayes module of WEKA with kernel estimator, default parameters, and 10-fold cross-validation is used (Hall et al. 2009).

4 Results and discussion

To test our proposed method and perform an in-depth analysis of the domain-based prediction properties, the four classification methods outlined above have been used. The performance of these prediction methods is compared in terms of their classification accuracies, which are computed as follows: acc = (TP + TN)/N, where TP and TN are the total numbers of true positive (true obligate) and true negative (true non-obligate) predictions over the ten cross-validation folds, respectively, and N is the total number of complexes in the dataset.

4.1 Analysis of the prediction properties

The prediction results of the LDR, NB, SVM-SMO and k-NN classifiers with individual and combined domain-based features for the MW and ZH datasets are shown in Table 1. The numbers in bold represent the highest accuracy for the corresponding subset of features.
Table 1

Prediction accuracies of SVM-SMO, NB, k-NN and LDR for all domain-based subsets of features of the ZH and MW datasets

Subset name

# Features

LDR

SVM-SMO

k-NN

NB

MW-L2

96

70.01

71.65

68.96

69.39

MW-L3

291

67.05

68.97

67.43

67.05

MW-L2+L3

133

70.11

73.56

69.73

70.15

ZH-L2

64

74.80

77.17

66.14

71.65

ZH-L3

150

66.14

58.27

59.04

56.70

ZH-L2+L3

70

75.59

78.74

66.93

72.44

For the domain-based subsets of features at each level, extracted from the MW dataset, the MW-L2 subset achieves the best classification accuracy of 71.65 % with SVM-SMO, while for MW-L3 the best obtained performance with SVM-SMO is 68.97 %. However, by combining the feature vectors from levels 2 and 3 of MW (MW-L2+L3), the prediction accuracy improves to 73.56 %, which is much better than using features from individual levels (MW-L2 and MW-L3). This trend can be seen for all of the applied classifiers, which shows that using domain-based features by combining levels is better that using only features of level 2 and much better than using features of level 3 of the CATH hierarchy for prediction. Also, by comparing the classification accuracies, we can see that for all subsets of features extracted from the MW dataset, SVM-SMO performs better than other classifiers.

Similarly, the best accuracy for the ZH dataset, 78.74 %, is obtained using combined domain-based properties (ZH-L2+L3) with SVM-SMO, compared with the best accuracies of 77.17 % for ZH-L2 with the SVM-SMO classifier and 66.14 % for ZH-L3 with LDR. Also, the performances of other classifiers for all subsets of features of the ZH dataset show the same trend: using the feature vector generated by combining features from levels 2 and 3 (ZH-L2+L3) is more efficient than using features from individual levels. Moreover, from the results, it is clear that after ZH-L2+L3, domain-based features of level 2 (ZH-L2) are more powerful for prediction than domain-based features of level 3.

Generally, it can be concluded for both the MW and ZH datasets that (a) domain-based properties at the combined level yield higher accuracies than domain-based properties on the individual levels; (b) domain-based features related to level 2 of CATH are more powerful than the features from level 3; (c) SVM-SMO is the most powerful classifier for all subsets of features; (d) SVM-SMO, LDR, NB and k-NN classifiers, however, show a similar trend. For all classifiers, DDIs from L2 are better than those of L3, while DDIs from a combination of L2 and L3 are much better than those of both L2 and L3 individually.

The receiver operating characteristic (ROC) curves for the MW and ZH datasets using different DDI properties for prediction are shown in Fig. 2a, b, respectively. These ROC curves are plotted based on the true positive rate (TPR), aka "sensitivity", vs. the false positive rate (FPR), or "1-specificity", at various threshold settings. To generate the ROC curves, the sensitivity and specificity of each subset of features were determined for different values of d and β values in the CDA classifier. Then, by applying a simple algorithm, the FPR and TPR points were filtered as follows: (a) for the same FPR values, the greatest TPR value (top point) was chosen, and (b) for the same TPR values, the smallest FPR value (left point) was chosen. A polynomial function with degree 2 was then fitted to the selected points. From the ROC curves, it is clear that for both datasets, the prediction performances of LDR using DDI properties on the combined level (ZH-L2+L3 and MW-L2+L3) are clearly better than using DDI properties of level 2 (ZH-L2 and MW-L2) and much better than those of level 3 (ZH-L3 and MW-L3).
Fig. 2

ROC curves and AUC values for all subsets of features of a MW and b ZH datasets

In addition, the area under the curve (AUC), is computed for each of the above ROC curves using the trapezoid rule. The AUC values are also shown in Fig. 2. The AUC for ZH-L2+L3 is 0.68 which is greater than AUC of both ZH-L2 (0.66) and ZH-L3 (0.63). Similarly, the AUC for the MW dataset using DDI properties on the combined level (MW-L2+L3) is 0.65 while for MW-L2 is 0.60. Also, the AUC of MW-L2 is greater than that of MW-L3. Generally, by comparing the AUC values, it can be concluded that DDI properties from the combined levels show much better predictive power than DDI properties from the individual levels.

4.2 Analysis of structural properties

As discussed earlier, in level 1 of the CATH hierarchy, the "class" of each complex is defined. The four classes of CATH, which are determined based on the secondary structure composition of the complexes, are mainly-alpha (c1), mainly-beta (c2), mixed alpha–beta (c3) and secondary structure content (c4) (Cuff et al. 2009). A summary of the number of DDIs present in both the ZH and MW datasets, categorized by class type, c1–c4, is shown in Table 2. The bold numbers indicate the highest accuracy for each classifier for a particular pair of interacting domains. From the table, it is clear that most of the DDIs are between domains of c2 and other classes in which c2:c2 and c2:c3 have the highest ranks. However, domains of c4 have no interactions (with c1 and c4) or the least number of interactions with the domains of other levels. This indicates that DDIs taken from c4 are less important and could be ignored for achieving a faster, yet still accurate, prediction. In contrast, DDIs taken from c2 are more powerful for prediction.
Table 2

A summary of the number of CATH DDIs from level 2 present in the ZH and MW datasets, categorized by their class types

Domain 1

Domain 2

MW-L2

ZH-L2

#DDIs

SVM-SMO

k-NN

#DDIs

SVM-SMO

k-NN

c1

c1

5

63.98

63.68

3

56.69

56.69

c1

c2

9

63.98

63.68

10

56.69

58.27

c1

c3

5

62.07

61.68

5

56.69

56.69

c1

c4

0

0

0

0

0

0

c2

c2

24

63.98

62.07

18

60.63

61.42

c2

c3

32

67.43

64.75

17

62.99

61.42

c2

c4

6

64.75

63.98

2

56.69

56.69

c3

c3

13

59

61.68

7

56.69

56.69

c3

c4

2

55.17

61.3

2

55.9

55.9

c4

c4

0

0

0

0

0

0

To investigate this hypothesis, a structural feature selection scheme has been applied on the MW-L2 and ZH-L2 datasets. For this, DDI feature vectors from level 2 have been grouped based on their class (secondary structure) type interactions such as c1–c1, c1–c2, and so on. Then, each group of features was classified with SVM-SMO and k-NN classifiers, individually. The classification results are shown in Table 2.

For the MW-L2 subset, the feature vector of c2–c3 achieves the best prediction with 67.43 and 64.75 % accuracies by SVM-SMO and k-NN, respectively. The other DDIs from c2 also achieve better performance than using DDIs of other classes. The most notable feature vectors are c2–c4 and c1–c1, because they achieve acceptable prediction accuracies with less features. As expected, the worst prediction results were achieved using DDI feature vectors from c4 and c2–c4.

Similarly, for the ZH-L2 subset, it is clear that while the most discriminative feature vector for prediction is c2–c3, obtaining accuracies of 62.99 % by SVM-SMO and 61.42 % by k-NN, the worst feature vectors are DDIs taken from c4 (c1–c4, c3–c4 and c4–c4). Moreover, the feature vector of c2–c2 is the second most powerful for prediction. All other subsets of features yield almost the same performance. Some notable DDIs are c2–c4 and c1–c1, as they achieve reasonable performance with fewer features.

Furthermore, using structural feature selection, a decrease of 4–8 % in prediction accuracy compared with the original subset of features from MW-L2 and ZH-L2 (Table 1) are observed. However, these decreases in performance can be acceptable given that there are fewer features than in the original feature vectors, leading to a reduction in time and space requirements.

5 Conclusion

The idea of employing a structural domain-based approach for predicting obligate and non-obligate protein complexes, which were presented in our previous studies, is extended in this paper. Different interface properties, including domain-based properties on the individual levels and on the combined levels of the CATH hierarchy are used for prediction. The classification is performed using various techniques, including LDR, SVM-SMO, k-NN, and NB, for two well-known datasets of pre-classified protein complexes.

The prediction results demonstrate a significant improvement by combining nodes from different levels in the CATH hierarchy, rather than considering DDI features of each level separately. Also, it has been shown that DDIs at upper levels are more powerful than those at lower levels for prediction. The plotted ROC curves and calculated AUC values corroborate the prediction results.

Furthermore, a numerical analysis shows that while there are fewer interactions between domains of c4 and domains of other classes, most of the interactions are between domains of c2 and domains of other classes of level 2 of the CATH hierarchy.

Also, the prediction results on the structurally selected features of the MW-L2 and ZH-L2 datasets confirm that DDIs taken from the mainly-beta class (c2), especially DDIs between the mainly-beta and alpha–beta classes (c2–c3) are the best properties for predicting obligate and non-obligate PPIs.

Notes

Acknowledgments

This research work has been partially supported by NSERC, the Natural Sciences and Research Council of Canada, Grant No. RGPIN 261360, and the University of Windsor, Internal Start-up and VP Research Equipment grants. The authors would like to thank the anonymous reviewers for their valuable feedback on the paper.

Supplementary material

13721_2013_43_MOESM1_ESM.pdf (22 kb)
PDF (21 KB)

References

  1. Akutsu T, Hayashida M (2009) Domain-based prediction and analysis of protein–protein interactions (chapter 3). Biol Data Min Protein Interact Netw Med Inf Sci Ref  29–44Google Scholar
  2. Aziz MM, Maleki M, Rueda L, Raza M, Banerjee S (2011) Prediction of biological protein–protein interactions using atom-type and amino acid properties. Proteomics 11(19):3802–3810CrossRefGoogle Scholar
  3. Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242CrossRefGoogle Scholar
  4. Caffrey D, Somaroo S, Hughes J, Mintseris J, Huang E (2004) Are protein–protein interfaces more conserved in sequence than the rest of the protein surface?. Protein Sci 13(1):190–202CrossRefGoogle Scholar
  5. Camacho C, Zhang C (2005) FastContact: rapid estimate of contact and binding free energies. Bioinformatics 21(10):2534–2536CrossRefGoogle Scholar
  6. Chandrasekaran P, Doss C, Nisha J, Sethumadhavan R, Shanthi V, Ramanathan K, Rajasekaran R (2013) In silico analysis of detrimental mutations in add domain of chromatin remodeling protein atrx that cause atr-x syndrome: X-linked disorder. Netw Model Anal Health Inform Bioinforma 2(3):123–135CrossRefGoogle Scholar
  7. Chen L, Wang R, Zhang X (2009) Biomolecular networks: methods and applications in systems biology. Wiley, New YorkCrossRefGoogle Scholar
  8. Cuff A, Sillitoe I, Lewis T, Redfern O, Garratt R, Thornton J, Orengo C (2009) The cath classification revisited-architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res 37:310–314CrossRefGoogle Scholar
  9. De S, Krishnadev O, Srinivasan N, Rekha N (2005) Interaction preferences across protein–protein interfaces of obligatory and non-obligatory components are different. BMC Struct Biol 5(15). doi:10.1186/1472-6807-5-15
  10. Duda RO, Stork DG, Hart PE (2001) Pattern classification, 2nd edn. Wiley-Interscience, New YorkMATHGoogle Scholar
  11. Dwivedi VD, Arora S, Pandey A (2013) Computational analysis of physico-chemical properties and homology modeling of carbonic anhydrase from cordyceps militaris. Netw Model Anal Health Inform Bioinforma 1–4. doi:10.1007/s13721-013-0036-8
  12. Eichborn JV, Gnther S, Preissner R (2010) Structural features and evolution of protein–protein interactions. Int Conf Genome Inform 22:1–10CrossRefGoogle Scholar
  13. Finn R, Mistry J, Tate J, Coggill P, Heger A, Pollington J, Gavin O, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer E, Eddy S, Ateman A (2010) The pfam protein families database. Nucleic Acids Res 38:211–222CrossRefGoogle Scholar
  14. Glaser F, Steinberg DM, Vakser IA, Ben-Tal N (2001) Residue frequencies and pairing preferences at protein–protein interfaces. Proteins 43(2):89–102CrossRefGoogle Scholar
  15. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor 11(1):10–18CrossRefGoogle Scholar
  16. Hall M, Maleki M, Rueda L (2012) Multi-level structural domain–domain interactions for prediction of obligate and non-obligate protein–protein interactions. In: Proceedings of ACM conference on bioinformatics, computational biology and biomedicine (ACM-BCB), Florida, pp 518–520Google Scholar
  17. Jones S, Thornton JM (1996) Principles of protein–protein interactions. Proc Natl Acad Sci U S A 93(1):13–20CrossRefGoogle Scholar
  18. Lawrence MC, Colman PM (1993) Shape complementarity at protein/protein interfaces. J Mol Biol 234(4):946–950CrossRefGoogle Scholar
  19. Maleki M, Aziz MM, Rueda L (2011) Analysis of obligate and non-obligate complexes using desolvation energies in domain–domain interactions. In: Proceedings of the 10th international workshop on Data mining in bioinformatics (BIOKDD 2011) in conjunction with ACM SIGKDD 2011, San Diego, pp 21–26Google Scholar
  20. Maleki M, Hall M, Rueda L (2012) Using structural domain to predict obligate and non-obligate protein–protein interactions. In: Proceedings of the IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB2012), San Diego, pp 9–15Google Scholar
  21. Mintseris J, Weng Z (2003) Atomic contact vectors in protein–protein recognition. Proteins Struct Funct Genet 53:629–639CrossRefGoogle Scholar
  22. Mintseris J, Weng Z (2005) Structure, function, and evolution of transient and obligate protein–protein interactions. Proc Natl Acad Sci U S A 102(31):10930–10935CrossRefGoogle Scholar
  23. Park J, Bolser D (2001) Conservation of protein interaction network in evolution. Genome Inform 12:135–140Google Scholar
  24. Park SH, Reyes J, Gilbert D, Kim JW, Kim S (2009) Prediction of protein–protein interaction types using association rule based classification. BMC Bioinforma 10(36). doi:10.1186/1471-2105-10-36
  25. Pudil P, Ferri FJ, Novovicova J, Kittler J (1994) Floating search methods for feature selection with nonmonotonic criterion functions. In: Proceedings of the 12th international conference on pattern recognition, vol 2, pp 279–283Google Scholar
  26. Rueda L, Herrera M (2008) Linear dimensionality reduction by maximizing the Chernoff distance in the transformed space. Pattern Recognit 41(10):3138–3152CrossRefMATHGoogle Scholar
  27. Rueda L, Banerjee S, Aziz MM, Raza M (2010a) Protein–protein interaction prediction using desolvation energies and interface properties. In: Proceedings of the 2nd IEEE international conference on bioinformatics and biomedicine (BIBM 2010), Hong Kong, pp 17–22Google Scholar
  28. Rueda L, Garate C, Banerjee S, Aziz MM (2010b) Biological protein–protein interaction prediction using binding free energies and linear dimensionality reduction. In: Proceedings of the 5th IAPR international conference on pattern recognition in bioinformatics (PRIB 2010), pp 383–394Google Scholar
  29. Shanahan H, Thornton J (2005) Amino acid architecture and the distribution of polar atoms on the surfaces of proteins. Biopolymers 78(6):318–328CrossRefGoogle Scholar
  30. Singh DB, Gupta MK, Kesharwani RK, Misra K (2013) Comparative docking and admet study of some curcumin derivatives and herbal congeners targeting amyloid. Netw Model Anal Health Inform Bioinforma 2(1):13–27. doi:10.1007/s13721-012-0021-7 CrossRefGoogle Scholar
  31. Singhal M, Resat H (2007) A domain-based approach to predict protein–protein interactions. BMC Bioinforma 8(199). doi:10.1186/1471-2105-8-199
  32. Theodoridis S, Koutroumbas K (2008) Pattern recognition, 4th edn. Elsevier Academic Press, Burlington, California, USA, London, UKGoogle Scholar
  33. Vasudev G, Rueda L (2012) A model to predict and analyze protein–protein interaction types using electrostatic energies. In: Proceedings of the 5th IEEE international conference on bioinformatics and biomedicine (BIBM 2012), Philadelphia, pp 543–547Google Scholar
  34. Young J (1994) A role for surface hydrophobicity in protein–protein recognition. Protein Sci 3:717–729CrossRefGoogle Scholar
  35. Zaki N (2009) Protein–protein interaction prediction using homology and inter-domain linker region information. Adv Electr Eng Comput Sci Springer 39:635–645CrossRefGoogle Scholar
  36. Zaki N, Lazarova-Molnar S, El-Hajj W, Campbell P (2009) Protein–protein interaction based on pairwise similarity. BMC Bioinforma 10(150). doi:10.1186/1471-2105-10-150
  37. Zhang C, Vasmatzis G, LCornette J, DeLisi C (1997) Determination of atomic desolvation energies from the structures of crystallized proteins. J Mol Biol 267:707–726CrossRefGoogle Scholar
  38. Zhu H, Domingues F, Sommer I, Lengauer T (2006) Noxclass: prediction of protein–protein interaction types. BMC Bioinforma 7(27). doi:10.1186/1471-2105-7-27

Copyright information

© Springer-Verlag Wien 2013

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of WindsorWindsorCanada

Personalised recommendations