Background

Protein-protein interactions (PPIs) regulate a wide range of biological processes, involved in almost every cellular function. Majority of the proteins in living cells interact with partner proteins and form a complex to regulate proper functions. PPI employs transport mechanisms, muscle contractions, regulations of gene expression and signal transductions [1, 2]. PPIs are classified into different types based on their functional and structural characteristics. According to their stability, interaction surface and involvement, PPIs are classified into obligate and non-obligate, homo and hetero, or permanent and transient [3].

Binding affinity defines the strength of PPIs, and is represented by a dissociation constant (K d ). Binding affinity is crucial in drug developments and therapeutics, and thus, many approaches have been developed to measure the binding affinity. Generally, these approaches are categorized into two groups. The first group identifies the binding affinity using scoring functions and two hybrid systems, surface plasmon resonance and forster resonance energy transfer [4]. These experimental methods for estimating the binding affinity are costly and time consuming. The second group uses computational methods to predict protein binding affinity, such as binding site prediction studies [57], empirical scoring function, knowledge based and quantitative structural methods [810]. Machine learning models have been developed with structure- and sequence-based features to predict and classify the binding affinities. Yugandhar et al. using sequence descriptors to develop a prediction method SMO using support vector machines (SVM) to discriminate high and low binding affinity of heterodimeric protein complexes [11]. Additionally, the works [12, 13] used support vector regression (SVR) models with structure-based features to predict binding affinities for different sets of protein complexes. Alternatively, the work [14] used functional features with a SVR to represent the strength of interactions and observed physicochemical and conformational changes. For existing studies of predicting binding affinities, the prediction models work using small datasets. Only few sequence based studies on predicting the binding affinities. This work aims to predict the binding affinities of heterodimeric complexes and characterize the used sequence-based features.

Nearly 4,000 PPIs exist and the growth of PPIs in size increases speedily. It is a challenging task to accurately predict the binding affinities of PPIs based on sequence information only. This work proposes a SVM-based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes by predicting their binding affinity. SVM-BAC using SVM with an optimal feature selection method, an inheritable bi-objective combinatorial genetic algorithm (IBCGA) [15], can identify a small set of features to determine the binding affinity of protein complexes from 580 sequence descriptors including 531 physicochemical properties from the AAindex database [16] and 49 selected physicochemical, energetic and conformational properties of the 20 amino acids from the literature [17]. A dataset with 216 heterodimeric protein-protein complexes is established from the work [11, 18]. SVM-BAC identified 14 sequence descriptors to classify the high and low binding affinity of protein complexes and obtained 10-fold cross validation and independent test accuracies of 85.80% and 83.33%, respectively. Using these 14 features selected by SVM-BAC with SVR, we estimated the binding affinity in terms of dissociation constant (Pkd) for 200 heterodimeric protein complexes and obtained correlation coefficient of 0.34 and a mean absolute error of 1.4. Contribution analysis of prediction has been used to select top-ranked features. The top-two physicochemical properties apparent partition energy [19] and principal component analysis IV [20], and an important secondary structure based feature, i.e. normalized frequency of beta turn [21] are effective in predicting the binding affinity of heterodimeric protein complexes.

Results and Discussion

Prediction performance of SVM-BAC

We have classified heterodimeric protein complexes by predicting their binding affinities. A dataset consisting of 108 and 108 complexes with high and low binding affinity was used, respectively. All the sequences were encoded into 580 sequence descriptors. SVM-BAC incorporating with the optimal feature selection algorithm IBCGA selected a set of 14 informative sequence descriptors to discriminate the high and low binding affinity complexes.

SVM-BAC achieved the training (10-fold cross validation), test accuracies and Matthews correlation coefficient (MCC) of 85.80%, 83.33% and 0.71 respectively, slightly better than the SMO method [11] with 76.1%, 83.3% and 0.66, shown in Table 1. SVM-BAC predicted high and low binding affinity complexes with training sensitivity and specificity of 0.89 and 0.83, and test sensitivity and specificity of 0.89 and 0.78, respectively. To avoid the biased results due to the fix partition of training and test datasets, we also evaluated the performance of SVM-BAC using the whole dataset of 216 complexes in terms of 10-fold and 5-fold cross validations (10-CV and 5-CV). The sensitivity, specificity, and accuracy of 10-CV were 0.759, 0.842, and 80.09%, respectively. The sensitivity, specificity, and accuracy of 5-CV were 0.777, 0.842, and 81.01%, respectively. The accuracies of 10-CV and 5-CV using 216 complexes were slightly smaller than the test accuracy (83.33%) on 54 complexes.

Table 1 Performance results of SVM-BAC using 162 training and 54 test complexes.

Classifier performance of using the ROC curve is shown in Figure 1. The areas under the ROC curve (AUC) were 0.86 and 0.76 for SVM-BAC and the SMO method, respectively. The SMO method is better than several machine learning algorithms such as Bayesian logistic regression, Naïve Bayes, Multilayer perception, K-nearest neighbors, J48 decision tree, and random forest [11]. The 14 sequence descriptors identified by SVM-BAC are given in Table 2. The results suggest that the 14 features selected by the optimization method IBCGA were effective in predicting the binding affinity of complexes.

Figure 1
figure 1

ROC curve for the SVM-BAC performance evaluation. The area under the ROC curve (AUC) is 0.86 using the training dataset.

Table 2 The 14 physicochemical properties identified by SVM-BAC.

Furthermore, we evaluated individual effect of these 14 features on prediction accuracy using knock-out analysis. Removing of an informative feature makes a significant decrease between 8 and 18% in terms of prediction accuracy, shown in Figure 2. These results suggest that the 14 features selected by IBCGA have substantial effects on discriminating high and low binding affinity of protein complexes.

Figure 2
figure 2

Difference accuracies of individual physicochemical properties using knock-out analysis.

Estimating binding affinities

Binding affinity of a heterodimeric protein-protein complex is estimated in terms of dissociation constant (PKd). The binding affinity dissociation constant depends on many factors, such as structural features, interface properties and physiological factors, which are not easily obtained from primary sequences only. We made an attempt to estimate the binding affinities using the promising features of amino acids that were used to predict high and low binding affinity complexes. There were 200 heterodimeric protein complexes used to estimate the binding affinities, which covered various ranges of binding affinity values (Pkd) and functions. Support vector regression (SVR) was used as a prediction model to estimate binding affinities. Our model was trained using the 14 sequence descriptors and the PKd value. The proposed sequence based model using SVR yielded the correlation coefficient of 0.34 and mean absolute error of 1.4 (Table 3). The correlation result between estimated binding affinities and actual binding ones is shown in Figure 3. Mean absolute error for 200 heterodimeric complexes is shown in Figure 4. The 200 protein-protein complexes and their respective Pkd values were reported in Additional file 1: Table S1.

Table 3 Estimation performance of SVR using Jackknife test on 200 heterodimeric complexes
Figure 3
figure 3

Estimation performance of jackknife test using the SVR-based method for 200 heterodimeric complexes.

Figure 4
figure 4

Mean absolute error of Pkd (binding affinity dissociation constant) for 200 heterodimeric complexes.

Although we have used an effective set of sequence features, the estimation result of binding affinity was not good enough for the whole dataset, irrespective of their function types. The result was consistent with the recently published prediction method of binding affinity using amino acid sequence feature [22] that they predicted the binding affinity using 642 sequence-based features and obtained poor performance in terms of correlation coefficient on 135 protein complexes. It is noted that protein-protein binding affinities also rely on their function types.

Interestingly, when we observed the estimation error of 200 complexes, we found nearly 150 out of 200 complexes that the mean absolute error was 0.87. The result revealed that amino acid properties are also influential factors to estimate the binding affinities for heterodimeric complexes with specific functional types. However, considering all the 200 complexes with various functional types, the estimation performance was not satisfactory, consistent with the work [14] for the prediction of binding affinity on diverse protein-protein interactions.

Although the promising properties of amino acids can predict high and low binding affinity complexes with satisfactory results, yet they cannot be used to accurately estimate the actual binding affinity dissociation constant. To advance the estimation ability, structural features, interface properties, physiological factors, and partner residues information are useful which are not available from the primary sequences themselves. Partner-aware prediction of interacting residues in protein-protein complexes from sequence information has significance in characterizing the interaction [23].

Physicochemical property analysis

The top-two physicochemical properties according to the main effect difference (MED) are apparent partition energies calculated from Chothia index (GUYH850105) [19] and principal component IV (SNEP660104) [20]. Large value of MED means the great contribution to prediction performance. An influential secondary structure related property, normalized frequency of beta turn (CHOP780101) [21] was at rank 9. The three physicochemical properties are further analyzed and discussed below. Table 4 presents the values of 20 amino acids for the three physicochemical properties, the amino acid compositions in high and low binding affinity complexes, and amino acid compositional difference between the two classes.

Table 4 Amino acid composition (AAC) of high binding affinity (HBA) and low binding affinity (LBA) complexes and three physicochemical properties.

The property of apparent partition energies

The property of GUYH850105 is described as "Apparent partition energies calculated from Chothia index [19]". Chothia index is based on calculating the ratio of buried molar fractions for each amino acid in globular proteins. Guy proposed Salvation energies calculated from vapour pressure of side chain analogues R (ΔSE) which are highly correlated (R = 0.86) with Chothia apparent transfer energy scale [19]. This property describes the buried hydrophobicity in proteins.

The buried hydrophobicity nature of protein-protein interactions has been extensively studied. Protein-protein binding directly correlates with total buried hydrophobic surface area and the binding energy increases with the increment of interfacial buried surface area [18]. Mutational studies on free energy change on mutants Δ (ΔG0) correlated with hydrophobic buried area. Upon adding hydrophobic buried surface at their interface leads to gaining of free energy Δ (ΔG0) = -15 ± 1.2 cal/molA2. Statistical and experimental estimations state that the increase of hydrophobic buried surface enhances the protein binding affinity [24, 25]. We thus calculated apparent partition energies for hydrophobic amino acids in our dataset according to the property [19]. We found that the average apparent partition energies for high binding affinity complexes are slightly larger than those for low binding affinity complexes that mean of apparent partition energies obtained for high and low binding affinities were -55.10 and -60.87, respectively. This property analysis declared the importance of hydrophobic amino acid residues at buried region which is one of the major influential factors to increase the binding strength of an interaction. Hydrophobic core in high binding affinity complex PDB_ID: 1MAH as an example is shown in Figure 5. The analysis results are consistent with the previous studies of binding affinity.

Figure 5
figure 5

Surface hydrophobicity of 1MAH. The color of the surfaces represents the level of hydrophobicity. The blue, white, and brown colors represent low, mediate, and high hydrophobicity, respectively. (a) Secondary structures of 1MAH (b) Surface hydrophobicity of 1MAH. Protein structures are drawn using Discovery studio 4.0.

The property of principal component analysis IV

The property of SNEP660104 was described as "Relations between chemical structure and biological activity in principal component analysis IV". Sneath calculated the correlations of amino acids for the use in principal component analysis. Four vectors (Vectors I, II, III and IV) were derived from the 20 amino acid correlations [20]. These four vectors were interpreted as different properties, in which Vector IV represents hydroxythiolation. Hydroxythiolation property has an ability to form hydrogen bonds.

Hydrogen bonds and salt bridges are one of the major contributors to protein-protein interactions. Polar and non-polar side chains significantly contribute to stabilization of the complexes. Polar side chains stabilize the protein complexes through hydrogen bonds. In general, protein interfaces are more hydrophilic than interior residues and form more hydrogen bonds at interfaces [26]. In trypsin-pancreatic trypsin inhibitor, insulin dimer and hemoglobin alpha beta dimer complexes, most of the hydrogen bonds are charged; opposite charges are more favorable to hydrogen bond formation, and 86% of buried polar atoms are favorable to form hydrogen bonds. Chothia et al. reported mean of hydrogen bonds per 100 A2 ΔASA, and maximum and minimum numbers of hydrogen bonds in heterodimeric complexes were 1.89 and 0.29, respectively. Xu-et al analyzed hydrogen bond and salt bridge specificity, and charge distribution at protein-protein interfaces [27].

We measured the numbers of hydrogen bonds at protein-protein interfaces in high and low binding affinity complexes. The average numbers of hydrogen bonds in high and low binding affinity complexes were 22.83 ± 19.70 and 19.42 ± 17.91, respectively. The hydrogen bonds at their interfaces were more enriched in high binding affinity complexes than in low binding affinity complexes. Contribution of these hydrogen bonds in overall protein-protein interactions is various and crucial to pinpoint. The numbers of hydrogen bonds at interfaces in high and low binding affinity complexes are shown in Figure 6.

Figure 6
figure 6

The numbers of hydrogen bonds at interfaces in heterodimeric complexes. X-axis denotes the identification numbers of high and low binding affinity complexes. Y-axis denotes the number of hydrogen bonds at interfaces in a protein complex.

The property of normalized frequencies of beta turns

The property of CHOP780101 is described as "Normalized frequencies of beta turn". Beta bends/ turn is formed by the polypeptide chain folds back on itself by 180 degrees. Beta turns shows three conformations based on their phi, psi values, and two major types exists i.e., types 1 and 2 [28]. Amino acid preferences are different in each type. In type 2 beta turns Gly possess a major preferences at position i+2 and i+3. Usually, beta turns promote antiparallel beta sheets, which can stabilize the secondary structure and these beta sheets are involved in protein interactions. Beta sheet interactions are involved in the binding of Ras oncoproteins to their receptors, significant part occurred in cell signaling pathway [29, 30], immune system, and HIV-1 proteases and inhibitors [31]. Non-regular structures such as turns, helix and loops at interfaces are large groups in heterodimeric complexes and also have large percentages of interface residues at protein-protein complexes belonging to non-regular regions only [32].

To examine the beta turn participation in heterodimeric complexes, we calculated the number of beta turns in the used dataset. Totally, 4,528 beta turns participate in the 216 heterodimeric complexes. On average, every high and low binding affinity complex possesses 23.78 ± 16.89 and 18.27 ± 12.42 beta turns, respectively. Notably, the mean number of beta turns in high binding affinity complexes is significantly larger than that in low binding affinity complexes where the p-value of student's t-test is 0.003. The difference accuracy of the beta turns property CHOP780101 was 12.35% using the knock-out analysis (Figure 2). Though, there are several factors influencing the protein binding affinities, beta turn is one of the most important factors in binding affinity prediction. A better insight into beta turn would have the potential to improve our current protein structure analysis and prediction methods. Beta turn formation in the example complex PROMMP-2/TIMP-2 is shown in Figure 7.

Figure 7
figure 7

Structure of PROMMP-2/TIMP-2 COMPLEX (PDB code 1GXD). Left: View of the enzyme-inhibitor complex complete structure. Right: A close-up view of type-2 beta turn from the whole complex structure and arrangement of amino acids shown as balls-and-stick model.

All 14 physicochemical properties and their amino acid composition preferences were calculated for high and low binding affinity complexes, reported in Additional file 2: Table S2.

Significance of H-bonds and beta turn properties in predicting high and low binding affinity complexes

We estimated the individual effects of H-bonds and beta turn properties in the binding affinity classification by knocking out one of the two corresponding properties, SNEP660104 and CHOP780101. The difference accuracies for the H-bonds and beta turn properties are 12.96% and 12.35%, respectively (Figure 2). Elimination of these two features decreases the overall prediction accuracy of 14.81%. The result suggests that the H-bond and beta turn properties equally contribute to predict high and low binding affinity complexes.

Conclusions

Characterizing the physicochemical properties influencing the protein binding affinity has a significant role in protein-protein interaction studies. We developed amino acid based predictor named as SVM-BAC to classify the high and low binding affinity complexes by identifying 14 informative properties. Moreover, the SVR-based prediction method using the 14 features was investigated to examine the ability of predicting the binding affinities for the whole set of complexes of various functional types. Our model estimated the binding affinities (Pkd) of 200 heterodimeric complexes with mean absolute error of 1.4 and it can be further refined by considering the categorization of functional types. Further physicochemical analysis revealed that buried hydrophobicity, beta turns, and hydrogen bonds are influential factors in protein binding affinity. The property analysis would be helpful to understand the underlying mechanism in the protein binding affinities. Though, protein binding affinities depend on various factors, we attempted to find out the contribution of sequence properties in binding affinity prediction.

Methods

Datasets

We compiled a dataset of 262 high and low binding affinity complexes from previous literature [11, 18]. Protein complexes possess diverse functions and molecular weights. Protein complexes with Kd < 10-8 M are regarded as high binding affinity class and those with Kd ≥ 10-8 are considered as low binding affinity class. We extracted the protein sequences from the PDB database [33]. After removing uncertain entries and deleting the sequences if sequence length is less than 50 amino acids, a balanced dataset contains 216 heterodimeric protein complexes including 108 positive (high binding affinity) and 108 negative (low binding affinity) samples. Since each amino acid is significant and possesses the ability to change binding free energies [34], we did not apply the redundancy criterion to decrease the sequence identity and thus considered all the 216 complexes. We randomly selected 162 samples (75%) as training and 54 samples (25%) as test sets. We utilized 531 amino acid sequence based features from AAindex and 49 properties from literature [16, 17].

For estimating the binding affinity value (Pkd), we used the same 216 heterodimeric complexes which were used to predict the high and low binding affinity complexes. There were 16 complexes removed from the 216 complexes because there is no absolute value of binding affinity available. Finally, 200 protein complexes have been considered for the estimation experiment. Binding affinity values were collected from different sources [35, 36]. These binding affinity values are in a diverse range from μM to nM, i.e. from 10-3 to 10-15 M. The 200 complexes possess diverse functional groups such as antibody/antigen, enzyme-inhibitor, enzyme-substrate, G-protein containing, receptor-containing and other enzymes.

Physicochemical properties

Kawashima and Kanehisa developed the AAindex database which collects numerical indices representing physicochemical and biochemical properties of amino acids [16]. In our work we used 531 physicochemical properties from AAindex and 49 additional features from protein folding related sequence descriptors [17] as candidate features to construct a SVM-based classifier for the discrimination of high and low affinity binding proteins. The original sequences of the selected datasets were transformed into the numerical indices according to the each feature's corresponding values of amino acids. The numerical values of the features were normalized to the scale [-1, 1] for using SVM. When translating into machine learning lexicon for predicting protein functions, variable-length sequences may arise the problem of encoding the feature vector. It is important for classification that the feature vector is formulated into a feature vector with constant length by feature generation. In this work, we used 580 sequence descriptors for encapsulating the global information about proteins of variable length in a fixed length formation. Each of the 580 features was derived from the averaged value of a specific property of amino acids, which was independent of the sequence order of two proteins. The aims of this work are to identify the informative properties of amino acids and then predict the binding affinity in heterodimeric complexes. So, order-dependent sequence features were not used in the proposed method.

The procedure of feature representation for the 580 physicochemical properties is described as follows:

Step 1: Collect the high and low binding affinity sequences from the training dataset.

Step 2: Calculate the composition w(a i ) of a complex for the ith amino acid ai of 20 amino acids to encode the protein sequence of variable length into the feature vector of length 580.

Step 3: Calculate the feature value of the pth physicochemical property, TPCP(p), of a protein complex, where p = 1, 2, ..., 580.

TPCP ( p ) = i = 1 20 w a i .PC P p ( a i )
(1)

where PCP p (a i ) is the value of the a i amino acid of the pth physicochemical property.

Inheritable bi-objective combinatorial genetic algorithm (IBCGA)

In this work, the inheritable bi-objective combinatorial optimization genetic algorithm (IBCGA) [15] is used for the feature selection. The feature selection is a combinatorial optimization problem C(n, m). IBCGA selects a small set of m features from a large number of n candidate features while optimizing the prediction performance. IBCGA is an efficient global optimization algorithm comprising an intelligent evolutionary algorithm which uses orthogonal array crossover to efficiently solve large parameter optimization problems. The inheritable mechanism can conserve the features that can improve the predication accuracy in the searching procedure.

In using IBCGA, the parameter setting of SVM and feature selection were encoded into binary genes to be optimized simultaneously. In this work, the commonly used genetic algorithm (GA) terms are gene and chromosome, represent as GA-gene and GA-chromosome for the discrimination. The GA-chromosome consists of n = 580 binary genes b i for selecting informative features and two 4-bit GA-genes for tuning the parameters C and γ of SVM. The ith property is excluded from the SVM classifier if b i = 0, otherwise it will be included. This method can encode the 16 values of γ and C ∈ {2-7, 2-6, ..., 28}. In the SVM classifier, digitalized and normalized protein sequences in the training data set were used as input. In this work, the range of the size of candidate feature set selected by IBCGA is from r start = 10 and r end = 20. The feature selection algorithm IBCGA is described as follows.

(Initialization) randomly generate an initial population of individuals.

Step 1: (Evaluation) Evaluate the fitness value of all individuals using the fitness function that is the prediction accuracy in terms of 10-fold cross validation.

Step 2: (Selection) Use a conventional method of tournament selection that selects the winner from two randomly selected individuals to generate a mating pool.

Step 3: (Crossover) Select two parents from the mating pool to perform orthogonal array crossover operation.

Step 4: (Mutation) Apply a conventional mutation operator to the randomly selected individuals in the new population. Mutation is not applied to the best individuals to prevent the best fitness value from deterioration.

Step 5: (Termination test) If the stopping condition (reaching a prespecified number of generations) for obtaining the solution is satisfied, then output the best individual as the solution. Otherwise, go to Step 2.

Step 5: (Inheritance) If r < r end, randomly change one bit in the binary GA-genes for each individual from 0 to 1; increase the number r by one, and go to Step 2. Otherwise, stop the algorithm.

Binding affinity prediction method SVM-BAC

After the feature selection (m features) and parameter settings (γ and C) of SVM are done by using IBCGA, the binding affinity prediction method SVM-BAC can be implemented. SVM is an effective method used in the two-class classification and regression problems [37]. SVM works implicitly in the feature space by only computing the corresponding kernel K(x i , x j ) between any two objects x i and x j :

K x i , x j =Φ x i T Φ x j
(2)

where Φ x is used as a mapping function. Support vector regression (SVR) has an ability to interpret the property values from a number of samples in high dimensional space. Due to its effective regression abilities, SVR has been used for many biological prediction problems. This work used the following equations to measure the performance evaluation.

Accuracy= T P + T N T P + T N + F P + F N
(3)
Sensitivity= T P T P + F N
(4)
Specificity= T N T N + F P
(5)
MCC= T P × T N - F P × F N T P + F P T P + F N T N + F P T N + F N
(6)

Where TP is true positive; TN is true negative; FP is false positive; FN is false negative; MCC is Matthews Correlation Coefficient.

Calculation of H-bonds and beta turns

Hydrogen bonds and beta turns were calculated using the PDB sum database [38] and DSSP webserver [39].