Background

Identification of protein-protein interactions (PPIs) is important to elucidate protein functions and identify biological processes in a cell. The knowledge of PPIs can help people better understand disease mechanisms and drug designs. In the past several years, a large number of technologies have been developed for the large-scale analysis of PPIs. In general, there are three categories of methods for detecting PPIs: methods based on the information of evolution, methods based on natural language processing, and methods based on features of amino acid sequence.

A large number of past studies have made clear that the protein-protein interaction has a co-evolution trend [1]. The evolution information is extracted from multiple sequence alignment of homologous proteins. Tree similarity is used as a simple linear correlation between distance matrices of two protein families, as a proxy of their phylogenetic trees [2]. MirrorTree [35] evaluates the relationship between tree similarities and physical or functional interactions. It is possible to predict PPIs on a genomic scale with higher correlations indicating a higher probability of protein-protein interaction. Carlo et al. [6] presented a log-likelihood score for protein-protein interaction. Direct Coupling Analysis (DCA) has been used to predict response regulator (RR) interaction partners for orphan histidine sensor kinase (SK) proteins in bacterial two-component signal transduction systems [7]. They also presented a protein-protein interaction score, which is based on improved efficiency of multivariate gaussian approach [8]. However, since these methods need a large number of homologous proteins and interaction marks of protein partners, they are very difficult to compute and their applications are limited.

Many methods have been developed to find the evidence for PPIs from PubMed abstracts based on Natural Language Processing (NLP) [9]. According to a certain semantic model, these methods automatically extract relevant pieces of information from texts, since a large number of known PPIs are stored in the scientific literature of biology and medicine. Daraselia et al. [10] used a method, called MedScan, to extract more than one million pieces of data from PubMed. They obtained accuracy rates of up to 91 %, compared with the BIND and DIP databases [11]. The problem of this approach is that some PPIs information may be missing from literature, thus the prediction may not be complete.

It might be possible to predict PPIs accurately by using only protein sequence information with methods based on machine learning algorithms and features of amino acids. To use machine learning methods in this task, one of the most important computational challenges is to extract useful features from protein sequences. Generally, there are several kinds of feature representation methods including Auto Covariance (AC) [12], Auto Cross Covariance (ACC) [12], Conjoint Triad (CT) [13], Local Protein Sequence Descriptors (LD) [14, 15], Multi-scale Continuous and Discontinuous feature set(MCD) [16], Physicochemical Property Response Matrix combined with Local Phase Quantization descriptor (PR-LPQ) [17], Multi-scale Local Feature Descriptors (MLD) [18], as well as Substitution Matrix Representation (SMR) [19].

AC and ACC [12] use seven physicochemical properties of amino acids to reflect their interaction modes whenever possible. After being represented by these seven descriptors, a pair of proteins could be converted into a 420-dimensional vector by AC, and 2940-dimension by ACC. CT [13] considers the properties of each amino acid and its vicinal neighbors and regards the three contiguous amino acids as a unit. The PPIs information of protein sequences can be projected into a homogeneous vector space by counting the frequency of each type. The 20 amino acids are clustered into seven groups according to dipoles and volumes of side chains. The descriptor of proteins were concatenated into a 686-dimensional vector by CT.

Similar to CT, LD [14, 15] clusters twenty standard amino acids into seven functional groups. It splits the protein sequence into ten local regions of varying length to describe multiple overlapping continuous and discontinuous interaction patterns within a protein sequence. For each local region, three local descriptors–composition (C), transition (T) and distribution (D)–are calculated. A 1260-dimentional vector is constructed to represent each protein pair by LD. MLD [18] uses a multi-scale decomposition technique to divide protein sequence into multiple sequence segments of varying length to describe overlapping local regions. A binary coding scheme is then adopted to construct a set of continuous regions on the basis of the above partition. A 1134-dimentional vector is constructed to represent each protein pair by MLD. MCD [16] is similar to MLD, except that it constructs a 1764-dimentional vector for each protein pair. Indeed, LD, MCD and MLD can be categorized as the same type of methods.

PR-LPQ [17] adopts the physicochemical property response matrix method to transform the amino acids sequence into a matrix and then employs the local phase quantization-based texture descriptor to extract local phrase information in the matrix. SMR is based on BLOSUM62, which is considered to be powerful for detecting weak protein similarities. Huang et al. [19] used BLOSUM62 to construct a new matrix representation from a protein sequence. Then, the matrix is lossy compressed by Discrete Cosine Transform(DCT) and a 400-dimensional feature vector is extracted from the compressed matrix. Each pair of protein sequences forms an 800-dimensional feature vector, which is fed into the Weighted Sparse Representation based Classifier(WSRC) for predicting PPIs.

In this paper, we propose a novel sequence-based approach with a k-gram feature representation calculated as Multivariate Mutual Information (MMI). Combined with normalized Moreau-Broto Autocorrelation (NMBAC), we predict PPIs via Random Forest (RF), which is an ensemble learning method for classification, regression and other tasks. For the performance evaluation, our method is applied to the S.cerevisiae PPIs dataset. Our method achieves 95.01 % accuracy and 92.67 % sensitivity. Compared with the existing best method, the accuracy is increased by 0.29 %. To further demonstrate the effectiveness of our method, we also test it on the H.pylori PPIs dataset. Our method achieves 87.59 % accuracy and 86.81 % sensitivity. On the human 8161 PPIs dataset, our method achieves 97.56 % accuracy and 96.57 % sensitivity. In addition, we use S.cerevisiae PPIs dataset to construct a model to predict five other independent species PPIs datasets. Compared with the state-of-the-art methods, the accuracy is increased 2.42 % on average. We also test our method on two special PPIs datasets [20]. On the yeast dataset, our method achieves 82,82,62 and 61 % AUROC on four different test classes (typical Cross-Validated (CV) and distinct test classes C1, C2 and C3). On the human dataset, our method achieves 82,82,60 and 57 % AUROC on four different test classes. Finally, we test our method on three important PPIs networks: the one-core network (CD9) [21], the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway) [22], and the crossover network (Wnt-related Network) [23]. Compared to the Conjoint Triad (CT) method [13], accuracies of our method are increased by 6.25,2.06 and 18.75 %, respectively.

Methods

In our method for predicting protein-protein interaction based on protein sequence information, first we extract features from protein sequence information. The feature vector represents the characteristic on one pair of proteins. We use k-gram feature representation calculated as Multivariate Mutual Information (MMI) and extract additional feature by normalized Moreau-Broto Autocorrelation (NMBAC) from protein sequences. These two approaches are employed to transform the protein sequence into feature vectors. Then, we feed the feature vectors into a specific classifier for identifying interaction pairs and non-interaction pairs.

Multivariate mutual information

Inspired by previous work [13, 24, 25] for extracting features from protein sequences, we propose a novel method to fully describe key information of protein-protein interaction. There exist many technologies using the k-gram feature representation, which is commonly used for protein sequence classification [26, 27]. Here k represents the number of conjoint amino acids. For example, CT [13] used the 3-gram feature representation. Shen et al. [13] indicated that methods without considering local environment are usually not reliable and robust, so they produced a conjoint triad method to consider properties of amino acids and their proximate amino acids.

To continue the usage of k-gram feature representation and to enhance classification accuracy, we utilize MMI [28] for deeply extracting conjoint information of amino acids in protein sequences.

Classifying amino acids

The protein-protein interaction can be dominated by dipoles and volumes of diverse amino acids, which reflect electrostatic and hydrophobic properties. All 20 standard amino acid types are assigned to seven functional groups [13], as shown in Table 1. For each pair of proteins, we extract conjoint information based on these amino acid categories.

Table 1 Division of 20 amino acid types, based on dipoles and volumes of side chains

Calculating multivariate mutual information

Considering the neighbours of each amino acid, we regard any three contiguous amino acids as a unit. We use a sliding window of a length of 3 amino acids to parse the protein sequence. For each window, categories of three amino acids are used to label the type of this unit. Instead of considering the order of the three amino acids, we only consider the basic ingredient of the unit. We define different types of 3-gram feature representation, such as C 0,C 0,C0′, C 0,C 0,C1′,…, C 6,C 6,C6′. Similarly, we also define different types of 2-gram feature representation, such as C 0,C0′, C 0,C1′,…, C 6,C6′. We count each type of 3-gram feature and 2-gram feature on one protein sequence by a sliding window, as shown in Fig. 1.

Fig. 1
figure 1

3-gram or 2-gram feature representation

At some point in the ensuing discussion of mutual information, we state the logarithmic base as e. In contrast to the standard mutual information approach, our mutual information and entropy method refer to single event on one protein sequence, whereas standard mutual information refers to overall possible events. We calculate the multivariate mutual information for each type of 3-gram feature, defined as follows:

$$ I(a,b,c)= I(a,b)- I(a,b|c) $$
(1)

where a,b and c are categories of three conjoint amino acids in one unit.

We then define the mutual information for one type of 2-gram feature as I(a,b), which can be counted by a 2-length sliding window:

$$ I(a,b) = f(a,b)ln\left(\frac{f(a,b)}{f(a)f(b)}\right) $$
(2)

where f(a,b) is the frequency of categories a and b appearing in 2-gram feature on a protein, and f(a) is the frequency of category a appearing on a protein, respectively.

In addition, we define the conditional mutual information as I(a,b|c).

$$ I(a,b|c)=H(a|c)-H(a|b,c) $$
(3)

where H(a|c) and H(a|b,c) are the conditional entropy as follows.

$$ H(a|c) = -f(a|c)ln(f(a|c)) $$
(4)

and

$$ H(a|b,c)=-f(a|b,c)ln(f(a|b,c)) $$
(5)

where f(a|c) is the frequency of category a appearing while category c exists in 2-gram feature on a protein, and f(a|b,c) is the frequency of category a appearing while categories b and c exist in 3-gram feature on a protein.

H(a|c) and H(a|b,c) can be approximately calculated as follows:

$$ H(a|c) = -\frac{f(a,c)}{f(c)}ln\left(\frac{f(a,c)}{f(c)}\right) $$
(6)

and

$$ H(a|b,c)=-\frac{f(a,b,c)}{f(b,c)}ln\left(\frac{f(a,b,c)}{f(b,c)}\right) $$
(7)

where f(a,b,c) is the frequency of categories a,b and c appearing in 3-gram feature on a protein.

To avoid the values of I(a,b,c) and I(a,b) being infinity, we calculate the frequency as follows:

$$ f(a) = \frac{n_{a} + 1}{L+1} $$
(8)

where n a is the occurrence number of category a appearing on a protein and L is the length of this protein sequence. We also use similar formulas to calculate f(a,b) and f(a,b,c).

We can get 84 multivariate mutual information values of I(a,b,c) (3-tuples MI) and 28 mutual information values of I(a,b) (2-tuples MI) from one protein. We also compute the frequency of the seven amino acid categories appearing on this protein. A protein sequence is represented as 84+28+7=119 features. Finally, we combine the descriptors of two proteins to build a 238-dimensional vector for representing each pair of proteins.

Normalized moreau-broto autocorrelation

It is well known that PPIs include four interaction modes, usually expressed as electrostatic interaction, hydrophobic interaction, steric interaction and hydrogen bond. Feng et al. [29] introduced an autocorrelation function combining physicochemical properties of amino acids to propose a feature representation method, which is used to predict the types of membrane proteins. Inspired by this method, we use the NMBAC to extract features from protein sequences.

Six physicochemical properties of amino acid

The physicochemical properties we consider are hydrophobicity (H), volumes of side chains of amino acids (VSC), polarity (P1), polarizability (P2), solvent-accessible surface area (SASA) and net charge index of side chains (NCISC) of amino acid.

Values of these six physicochemical properties for each amino acid are listed in Table 2 [30]. They are first normalized to zero mean and unit standard deviation (SD) as follows:

$$ P_{i,j}^{'} = \frac{P_{i,j}-P_{j}}{S_{j}}(i=1,2,\ldots,20;j=1,2,\ldots,6.) $$
(9)
Table 2 Original values of six physicochemical properties of 20 amino acid types

where P i,j is the value of descriptor j for amino acid type i,P j is the mean over 20 amino acids of descriptor value j, and S j is the corresponding SD.

Each protein can be translated into six vectors with each amino acid represented by normalized values of six descriptors. So, NMBAC [29] can be computed as follows:

$$ \begin{aligned} AC_{lag,j} &= \frac{1}{(n-lag)}\sum\limits_{i=1}^{n-lag}(X_{i,j} \times X_{i+lag,j})(i=1,2,\ldots,\\ &\qquad n-lag;j=1,2,\ldots,6.) \end{aligned} $$
(10)

where j represents one descriptor of six descriptor, i is the position in protein sequence X, n is the length of the protein sequence and lag is the sequential distance between one residue and another, a certain number of residues away (lag=1,2,…,lg), and lg is a parameter determined by an optimization procedure to be described.

Inspired by AC [12], we select the optimal value of lag from 1 to 30. We can get 30×6=180 dimensional vector. We also compute the frequency of 20 amino acids appearing on this sequence. As a result, a protein sequence is represented as 30×6+20=200 features. Finally, we combine descriptors of two proteins, and build a 400-dimensional vector to represent each pair of proteins by NMBAC.

Random forest classifier

RF is an algorithm for classification developed by Leo Breiman [31], which uses an ensemble of classification trees. Each classification tree is built by using a bootstrap sample of training data, and each split candidate set is a random subset of variables. RF uses both bagging (bootstrap aggregation) and random variable selection for tree building. Each classification tree is unpruned to obtain low-bias trees. The bagging and random variable selection can cause low correlation of individual trees. Therefore, RF has excellent performance in classification tasks.

In this paper, the feature space of each pair of proteins is composed of MMI and NMBAC. Totally, there are 238+400=638 features to be encoded to represent each pair of proteins. We define a 638-dimentional feature vector F=(x 1,x 2,…,x 638) as the input data of RF model. The class label t of interacting pair or non-interacting pair is set as 1 or −1, respectively. If the number of cases in the training set is N, the sample is built by randomly choosing N cases from the original data, but with replacement. This sample will be the training set for growing the tree. There are M input variables, a number mM is specified such that at each node, m variables are selected at random out of M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. Each tree is grown to the largest extent possible without pruning. For the new test sample, the classification result can be obtained by a voting method on these trees.

Results

We test our method on several different PPIs datasets to evaluate the performance of our proposed approach, including S.cerevisiae,H.pylori,human 8161,C.elegans,E.coli,human 1412 and M.musculus dataset. First, we independently analyze the performance of two protein representations, such as MMI and NMBAC. Second, we compare our method with some outstanding methods on the S.cerevisiae,H.pylori and human 8161 datasets. Then, we use the S.cerevisiae PPIs dataset to construct a model to predict other five independent species PPIs datasets. Our proposed method achieves a high performance on the S.cerevisiae,H.pylori and human 8161 datasets, so we evaluate the prediction performance of our model on five independent testing datasets. Our experiments suggest that experimentally identified interactions in one organism are able to predict interactions in other organisms. We also test our method on two special yeast and human PPIs datasets. In addition, we test our method on three important PPIs networks, and compare it with the state-of-the-art methods. We use our primary experimental information to predict real PPIs network, which is assembled by pairwise PPIs data.

PPIs datasets

The first PPIs dataset, described by You et al. [16], is downloaded from yeast S.cerevisiae core subset in the Database of Interacting Proteins (DIP) [11]. A protein with fewer than 50 residues or having more than 40 percent sequence identity are removed, and the remaining 5594 pairs of proteins formed the golden standard positive dataset (GSP). Non-interacting pairs are selected uniformly at random from the set of all interacting pairs that are not known to interact. Interacting pairs with the same subcellular localization information are then excluded. Finally, the golden standard negative dataset (GSN) is consisted of 5594 protein pairs, and their subcellular localization are different. The GSP and GSN datasets contain a total of 11188 protein pairs (half from the positive dataset and half from the negative dataset).

The second PPIs dataset, described by Martin et al. [32], is composed of 2916 H.pylori protein pairs (1458 interacting pairs and 1458 non-interacting pairs). The third PPIs dataset is collected from the Human Protein References Database (HPRD) as described by Huang et al. [19]. Huang et al. constructed the human 8161 dataset by 8161 protein pairs (3899 interacting pairs and 4262 non-interacting pairs).

The C.elegans(4013 interacting pairs), E.coli(6954 interacting pairs), human 1412(1412 interacting pairs), M.musculus(313 interacting pairs), and H.pylori(1420 interacting pairs) datasets are mentioned by Zhou et al. [14]. These species-specific PPIs datasets are employed in our experiment to verify the effectiveness of our proposed method.

Evaluation measurements

To test the robustness of our method, we repeat the process of random selection of the training and test sets, model-building and model-evaluating. This process is five-fold cross validation. There are seven parameters: overall prediction accuracy (ACC), sensitivity (SN), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV), weighted average of the PPV and sensitivity (F score), Matthew’s correlation coefficient (MCC). These parameters are defined as follows:

$$\begin{array}{*{20}l} ACC&=\frac{TP+TN}{TP+FP+TN+FN} \end{array} $$
(11a)
$$\begin{array}{*{20}l} SN&=\frac{TP}{TP+FN} \end{array} $$
(11b)
$$\begin{array}{*{20}l} Spec&=\frac{TN}{TN+FP} \end{array} $$
(11c)
$$\begin{array}{*{20}l} PPV&=\frac{TP}{TP+FP} \end{array} $$
(11d)
$$\begin{array}{*{20}l} NPV&=\frac{TN}{TN+FN} \end{array} $$
(11e)
$$\begin{array}{*{20}l} F_{score}&=2 \times \frac{SN \times PPV}{SN + PPV} \end{array} $$
(11f)
$${} \begin{aligned} MCC&=\frac{TP \times TN - FP \times FN}{\sqrt{(TP+FN)\times (TN+FP)\times (TP+FP) \times (TN+FN)}} \end{aligned} $$
(11g)

where true positive (TP) is the number of true PPIs that are predicted correctly; false negative (FN) is the number of true PPIs that are predicted to be non-interacting pairs; false positive(FP) is the number of true non-interacting pairs that are predicted to be PPIs, and true negative(TN) is the number of true non-interacting pairs that are predicted correctly.

Experimental environment

In this paper, our proposed sequence-based PPIs predictor is implemented using C++ and MATLAB. All experiements are carried out on a computer with 2.5 GHz 6-core CPU, 32 GB memory and Windows operating system. Two RF parameters, the number of decision trees and split are 500 and 25.

Performance of PPIs prediction

We use eight different datasets to evaluate the performance of our proposed method. The proposed approach is compared with other methods on the S.cerevisiae,H.pylori and human 8161 datasets. Then, we test our method on the human 1412,M.musculus,H.pylori,C.elegans, and E.coli datasets for PPIs prediction.

S.cerevisiae dataset

We use the first PPIs dataset used in You et al. [16] to evaluate the performance of our model.

Analyzing 2-tuples and 3-tuples MI

To analyze the performance of the 2-tuples and 3-tuples MI features by testing the S.cerevisiae dataset. The results of prediction for the 2-tuples and 3-tuples MI are shown in Table 3. The accuracies for 2-tuples MI, 3-tuples MI and MMI are 93.56,93.88 and 94.23 %, respectively. Obviously, the combinatorial approach of MMI achieves better performance than either 2-tuples MI or 3-tuples MI.

Table 3 Analyze the performance of 2-tuples and 3-tuples MI on S.cerevisiae dataset
Selecting optimal lag

The large value of lag=1,2,…,lg will result in more variables that account for residue contacts with large distances apart in the sequence. The maximal possible lg is the length of the shortest sequence (50 amino acids) in the dataset. To obtain the best lg, we test nine different values of lg(lg=5,10,15,20,25,30,35,40,45). The results of these nine values of lg on S.cerevisiae dataset are shown in Fig. 2. As seen from the curve, the prediction accuracy increases when lg increases from 5 to 30. However, it slightly declines when lg increases from 30 up to 45. The best prediction accuracy is 92.76 %, when lg is 30 amino acids. NMBAC with lg less than 30 would lose some useful features of protein sequences and larger values could introduce noise instead of improving the prediction performance. So, we select the optimal lag as 30 in our study.

Fig. 2
figure 2

Accuracy of our method with NMBAC on different values of lag

Analyzing MMI and NMBAC

In order to understand the contribution of different feature representations, we evaluate the performance of MMI and NMBAC for PPIs prediction. We use the S.cerevisiae dataset, which is randomly partitioned into training and independent testing sets via a five-fold cross validation. Each of the five subsets acts as an independent holdout testing dataset for the model trained with rest four subsets. The cross validation can minimize the impact of data dependency and the reliability of experimental results can be improved. The prediction result is showed in Table 4. The accuracies for MMI, NMBAC and ensemble representation are 94.23,92.76 and 95.01 %, respectively. Obviously, MMI has better performance than NMBAC. Using ensemble representation, accuracy can be raised 0.78 %.

Table 4 Analyze the performance of MMI and NMBAC on S.cerevisiae dataset by RF Classifier

To consider the asymmetric of proteins, the forward vector of one PPI is composed of two interacting proteins (protein A and protein B), and the backward vector is composed of reverse two interacting proteins (protein B and protein A). Accuracies on forward and backward vectors for PPIs prediction are 95.01 and 94.90 %, and the prediction result is less changed.

5-fold cross-validation

The prediction result of our method on S.cerevisiae dataset is shown in Table 5. We predict PPIs of S.cerevisiae dataset, and obtain accuracy, precision, sensitivity, and MCC of 95.01,97.31,92.67, and 90.1 %, respectively. Standard deviations of these criteria values are 0.46,0.61,0.5, and 0.92 %, respectively. High accuracies and low standard deviations of these criterion values show that our proposed model is effective and stable for predicting PPIs.

Table 5 5-fold cross-validation result obtained by using our proposed method on S.cerevisiae dataset
Comparison with existing methods

We compare the prediction performance of our proposed method with other existing methods on the S.cerevisiae dataset, as showed in Table 6. It can be observed that high prediction accuracy of 95.01 % is obtained from our proposed model. We use the same S.cerevisiae PPIs dataset, and compare our experimental result with methods proposed by You et al. [16, 18, 30], Wong et al. [17], Guo et al. [12], Zhou et al. [14] and Yang et al. [15], where Random Forest (RF), Ensemble Extreme Learning Machines (EELM), Support Vector Machine (SVM), Rotation Forest, Support Vector Machine (SVM), or k-Nearest Neighbor (KNN) is performed with MLD, AC +CT+LD+MAC, MCD, PR-LPQ, AC, ACC, or LD scheme as input feature vectors, respectively. Their prediction accuracies are 94.72±0.43,87.00±0.29,91.36±0.36,93.92±0.36,89.33±2.67,87.36±1.38,88.56±0.33, and 86.15±1.17 %, respectively, whereas our prediction accuracy is 95.01±0.46 %. Our method has the highest prediction accuracy on the S.cerevisiae PPIs dataset, compared to all above methods. Our method has the best performance in other criteria as well. The sensitivity is 92.67±0.5 %, and the Matthew’s correlation coefficient is 90.1±0.92 % in our result. On the S.cerevisiae dataset, the MCC of our method is better than other existing methods.

Table 6 Comparison of the prediction performance between our proposed method and other state-of-the-art works on S.cerevisiae dataset

H.pylori dataset

In order to highlight the advantage of our method, we also test it on the H.pylori dataset, which is described by Martin et al. [32]. We compare the prediction performance of our proposed method with other previous works including AC+CT+LD+MAC [30], MCD [16] DCT + SMR [19], phylogenetic bootstrap [33], signature products [32], HKNN [24], ensemble of HKNN [25] and boosting. In Table 7, we can see that the average prediction performance of our method, such as sensitivity, PPV, accuracy and MCC are 87.59,86.81,88.23 and 75.24 %, respectively. On the H.pylori dataset, the accuracy of our method is better than all other methods tested. It is shown that our method deeply extracts the contiguous amino acid information from protein sequence. Furthermore, our method combining MMI and NMBAC can increase the prediction performance. The accuracies for MMI, NMBAC and ensemble representation are 85.42,85.59 and 87.59 %, respectively. The accuracy can be increased by at least 2.00 % on the H.pylori dataset.

Table 7 Comparison of the prediction performance between our proposed method and other different methods on H.pylori dataset

human 8161 dataset

We also test our method on a human 8161 dataset, which is used by Huang et al. [19]. We compare the prediction performance between our proposed method and Huang’s work [19] on this dataset, as showed in Table 8. Our method achieves 97.56 % accuracy, 96.57 % sensitivity and 95.13 % MCC. However, Huang’s work achieved 96.30 % accuracy, 92.63 % sensitivity and 92.82 % MCC. Our method obtains better prediction result than Huang’s work on human 8161 dataset. Particularly, accuracies for MMI, NMBAC and ensemble representation are 97.56,96.08 and 95.59 %, respectively. The accuracy can be raised 1.48 % on human 8161 dataset.

Table 8 Comparison of the prediction performance between our proposed method and other different methods on human 8161 dataset

PPIs identification on independent across species dataset

If large number of physically interacting proteins in one organism exist “co-evolved” relationship, their respective orthologs in other organisms interact as well. In this section, we use all 11,188 samples of the S.cerevisiae dataset as the training set and other species datasets (E.coli,C.elegans,human 1412,H.pylori and M.musculus) as the test sets. The performance of these five experiments is summarized in Table 9. The accuracies are 92.80,92.16,94.33,91.13, and 95.85 % on the E.coli,C.elegans,human 1412,H.pylori and M.musculus datasets, respectively. The result of our method is better than other methods [14, 18, 19]. Overall, the accuracy of ensemble representation is raised by 2.79 % than single representation (MMI and NMBAC) on these five independent species.

Table 9 Prediction results on five independent species by our proposed method, based on S.cerevisiae dataset as the training set

Two special PPIs datasets

Yungki Park and Edward M. Marcotte [20] proposed two PPIs datasets to evaluate pair-input computational predictions, including yeast and human data sets. We compare the performance of our method with seven methods (M 1M 7) of pair-input computational predictions on the two PPIs datasets: M 1, a signature products-based method proposed by Martin et al. [32] and classified by SVM; M 2, a protein sequence is described as in M 1, and the feature vector for a protein pair is formed by applying the metric learning pairwise kernel and classified by SVM; M 3, the SVM-based method of CT feature developed by Shen et al. [13]; M 4, the SVM-based method of AC feature developed by Guo et al. [12]; M 5, the PPIs feature is same as M 4, and the classifier is the random forest; M 6, a method developed by Pitre et al. [34]; M 7, a method originally developed for protein-RNA interaction prediction [35]. We use the typical cross-validated (CV) predictive performances for three distinct test classes (C1,C2,C3). The performance of each method is summarized as the average area under the receiver operating characteristic curve (AUROC) ± its standard deviation and the corresponding average area under the precision-recall curve (AUPRC) ± its standard deviation.

Prediction results are shown in Tables 10 and 11. On the yeast PPIs dataset, our method achieves 0.82,0.82,0.62 and 0.61 AUROC values on CV,C1,C2, and C3, respectively. Moreover, AUROC values on CV,C1,C2, and C3 are 0.82,0.82,0.60 and 0.57 on the human dataset, respectively. Our method obtains better prediction result than M 1M 7 on yeast and human datasets.

Table 10 Comparison of prediction performance between our proposed method and other seven methods on the yeast dataset
Table 11 Comparison of prediction performance between our proposed method and other seven methods on the human dataset

Yungki Park and Edward M. Marcotte [20] also constructed new yeast and human PPIs datasets by suppressing the representational bias-driven learning. Prediction results are shown in Table 12 and Table 13. On new yeast PPIs dataset, our method achieves 0.65,0.66,0.60 and 0.55 AUROC on CV,C1,C2, and C3, respectively. On average, our method obtains better prediction result than M 1M 7 on new yeast dataset. On new human dataset, our proposed method achieves 0.61,0.62,0.57 and 0.53 AUROC on CV,C1,C2, and C3, respectively. On average, our result is also better than M 2M 7, but does not outperform M 1 on the new human dataset.

Table 12 Comparison of prediction performance between our proposed method and other seven methods on new yeast dataset, suppressing representation bias-driven learning
Table 13 Comparison of prediction performance between our proposed method and other seven methods on new human dataset, suppressing representation bias-driven learning

PPIs networks prediction

The useful application of PPIs prediction method is the capability of predicting PPIs networks. Our method predicts three important PPI networks assembled by PPIs pairwise. The one-core network of CD9 is the simplest network, which is an important tetraspanin protein [21]. The result reveals that 14 of all 16 PPIs could be identified by our method, and accuracy is 87.50 %. Comparing to Shen’s work [13], accuracy of our method is raised 6.25 %. Results are shown in Fig. 3, and the dark blue lines are true prediction, and red lines are false prediction.

Fig. 3
figure 3

An one-core network for the CD9 network

The Ras-Raf-Mek-Erk-Elk-Srf pathway is a multiple-core network that has been implicated in a variety of cellular processes [22]. There are 189 PPIs in this network, 174 PPIs are predicted correctly by our method. Comparing to Shen’s work, accuracy is raised 2.06 %. The prediction result and Ras-Raf-Mek-Erk-Elk-Srf pathway are shown in Fig. 4. The dark blue lines are true prediction, and red lines are false prediction.

Fig. 4
figure 4

A multiple-cores network for the Ras-Raf-Mek-Erk-Elk-Srf pathway

The Wnt-related network is a typical crossover network, and its related pathway is essential in signal transduction. Ulrich et al. [23] has demonstrated the protein interaction topology of Wnt-related network. Shen et al. [13] have tested their method on the network. The accuracy of their method is 76.04 % in the network: there are 96 PPIs in this network, and 73 PPIs are predicted correctly by their method. We also try to predict PPIs in the Wnt-related network. The prediction result shows that 91 PPIs among all 96 PPIs in the network are discovered by our method, and the accuracy is 94.79 %, which is better than Shen’s method [13]. The prediction result and Wnt-related network are shown in Fig. 5. The dark blue lines are true prediction, and red lines are false prediction.

Fig. 5
figure 5

A crossover network for the Wnt-related pathway

Discussion

Although many computational methods have been used to predict PPIs, the effectiveness of previous prediction models can still be improved. Existing methods that fail to take into account local amino acid environments are neither reliable nor robust, therefore we propose a Conjoint Triad method that accounts for the properties of each amino acid when accompanied by its two vicinal peptide amino acids.

We use one PPIs dataset to construct a model to predict other five independent species PPIs datasets. This finding indicates that the proposed model can be successfully applied to other species for which experimental PPIs data is not available. It should be noticed that the biological hypothesis of mapping PPIs from one species to another species is that large numbers of physically interacting proteins in one organism are co-evolved.

The most useful application of PPIs prediction method is its capability of predicting PPIs networks. Accurately predicting PPI networks is the most important issue for PPI prediction methods. We extend our method to predict three real important PPIs networks: one-core network, multiple-core network and crossover network. General PPIs networks are crossover networks, so our method is useful in practical applications. All these results demonstrate that our proposed method is a very promising and useful support tool for future proteomics research. Main improvements of the proposed method come from adopting an effective feature extraction method that can capture useful protein sequence information. In the future work, we will extend our method to predict other important PPIs networks.

Conclusions

In this paper, we develop a new method for predicting PPIs using primary sequences of two proteins. The prediction model is constructed based on random forest and ensemble feature representation scheme. In addition, we use MMI to improve the performance in predicting PPIs. For the performance evaluation, our method is applied to S.cerevisiae PPIs dataset. The prediction result shows that our method achieves 95.01 % accuracy and 92.67 % sensitivity. To further demonstrating the effectiveness of our method, we also use H.pylori PPIs dataset. Our method achieves 87.59 % accuracy and 86.81 % sensitivity. On human 8161 dataset, the experimental result shows that our method achieves 97.56 % accuracy and 96.57 % sensitivity. We use S.cerevisiae PPIs dataset to construct a model to predict other five independent species PPIs datasets. Our proposed method achieves 92.80,92.16,94.33,91.13, and 95.85 % accuracies on E.coli,C.elegans,human 1412,H.pylori and M.musculus datasets, respectively. We extend our method to predict three real important PPIs networks, and accuracy of our method is increased 6.25,2.06 and 18.75 % compared with CT. The prediction ability of our approach is better than that of other existing PPIs prediction methods.