Large-scale protein-protein interactions (PPIs) have been experimentally identified in several eukaryotic model organisms, such as Drosophila melanogaster [13], Caenorhabditis elegans [4, 5], and Saccharomyces cerevisiae [69]. Moreover, thousands of PPIs have been collected from web databases including BIND [10], CYGD [11], DIP [12], BioGRID [13], IntAct [14], and MINT [15]. Although the mammalian interactions, MPPI [16], have been published, the amount of the data with similar scale has not been described. The large-scale set of interactions of human proteins is still hard to determine directly.

Many computational methods have been developed to predict protein-protein interactions. A phylogenetic profile method [17] describes the presence or absence of proteins among different organisms with sequenced genomes. Proteins have similar phylogenetic profiles, between which functional links can be detected. The gene or domain fusion method [18, 19] describes a pair of proteins encoded as separate genes in one organism and fused into a single protein in another organism. Such a pair of proteins can be inferred by the function link, particularly among metabolic pathways. In the gene neighbor or gene order method [2022], the genes that encode two proteins are adjacent in chromosome proximity in several organisms, and are likely to be functionally linked. However, this method exploits the prevalance of operons in prokaryotes, but operons appear to be uncommon in eukaryoyes such as humans. Predictions using interologs [5] are based on the theory that proteins interacting in one organism co-evolve such that their respective orthologs maintain the ability to interact in another organism. The interolog concept has been applied to predict human protein interactions [2329]. Some bioinformatics models [30, 31] have also been developed to detect interactions among proteins by probability and machine-learning methods and the literature text-mining approach [3234] based on natural language processing. Bader et al. developed a logistic regression approach [35] that adopts employs statistical and topological descriptors to predict the biological relevance of PPIs obtained from high-throughput screening for yeast. Other sources of information, such as mRNA expression, genetic interactions and database annotations, are subsequently used to validate the model predictions. Lu et al. used a simple Naive Bayes classifier to integrate diverse sources of genomic evidence, ranging from co-expression relationships to phylogenetic profiling similarity [36].

The greatest challenge in predicting human PPIs using the interolog-based method is that the high-throughput interactions generate too many false positives when applied to phylogenetically distant organisms or lower eukaryotes [37], and some researchers have suggested that only 50% of yeast two-hybrid interactions are reliable [38]. Therefore, other filtering examinations of features and scoring schema should be further considered in order to increase the confidence in the prediction of human interactions performed by the interolog-based method. This study constructs human PPI maps from six eukaryotes, namely rat, mouse, fly, worm, thale cress and baker's yeast. The quasi-clique is analyzed and determined as a relative conservation score from the protein interaction networks in each organism. The other feature scores further drawn from spatial proximity (sub-cellular localization and tissue-specificity), temporal synchronicity (cell-cycle stage) and domain-domain combinations are also inspected, to obtain human PPI networks with confidence scores.

Results and discussion

Predicted human protein interactions

All protein access codes, such as NCBI GI number or RefSeq ID, were converted into non-redundant UniProt IDs. Table 1 shows the non-redundant (nr) total set of the originally predicted human protein-protein interactions (interologs) derived from six reference organisms. One-to-many mappings exist across species in the InParanoid-predicted data set, and are applied to identify protein orthologs. The total data set of 90, 871 human PPIs was obtained by the proposed method without cutoff by confidence score (CS). A total of 90, 871 protein interactions were predicted (see Additional File 1).

Table 1 Number and sources of predicted interactions inferred from each reference organism.

The known human interactions (indicated as KNOWN) were downloaded from external databases BIND, BioGRID, DIP, HPRD, IntAct, MINT and MPPI. The KNOWN2 data set was derived from KNOWN with the addition of two recently published experimental data sets of human PPIs [39, 40]. The proposed method predicted all 2, 572(2.83%) true positive (TP0) and 88, 299(97.17%) putatives (PU0) interaction data sets when applying the threshold CS ≥ 0. A threshold of CS ≥ 4 achieved 1, 467(7.46%) true positives (TP4) and 18, 192(92.53%) putatives (PU4). Figure 1 summarizes the results that showing the relationship among these data sets. The following evaluation compares the functional annotations among the KNOWN2, TP4, TP0, PU4, PU0, and random interaction data sets (RANDOMS).

Figure 1
figure 1

Schematic illustration of interaction data. Schematic illustration of sets of known (KNOWN2), predicted true positive (TP0) and predicted putative (PU0) interaction data. The confidence score (CS = 4 herein) can be used to identify interaction sets (TP4 and PU4) quantitatively and filter out the predicted interactions with lower confidence.


The experimental human PPIs and standard benchmark are limited from well-known databases and few interactions are known completely. Therefore, the absence of interactions between proteins from the experimental databases does not indicate that the interactions are negative. Given this limited knowledge, functional keyword annotation and GO term matching were tested to determine the accuracy of measurement of various interaction data sets.

Testing for true positives

Table 2 presents the successfully predicted human PPIs (true positives) from different reference organisms in the first evaluation. The accuracy of combining the predicted human interactions from various reference organisms was found to exceed that of a single reference organism. Although the large-scale and protein interactions of rat and mouse have not yet been completed, these two mammal model organisms can be used to identify higher proportion of predicted true positives, 317 476 = 66.60 % MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabiodaZiabigdaXiabiEda3aqaaiabisda0iabiEda3iabiAda2aaacqGH9aqpcqaI2aGncqaI2aGncqGGUaGlcqaI2aGncqaIWaamcqGGLaqjaaa@3920@ and 474 1212 = 39.11 % MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabisda0iabiEda3iabisda0aqaaiabigdaXiabikdaYiabigdaXiabikdaYaaacqGH9aqpcqaIZaWmcqaI5aqocqGGUaGlcqaIXaqmcqaIXaqmcqGGLaqjaaa@39F8@ , respectively (Table 1 and Table 2). Therefore, human interactions can be confidently predicted from multiple mammalian organisms and higher eukaryotes.

Table 2 Number of human interactions (true positives) successfully predicted from each reference organism in different experimental databases.

Testing scoring method

Each feature score of each data set was evaluated to determine whether the proposed scoring method was associated with more accurate predictions of interactions. The data sets predicted by BLAST search method (BTP and BPU are data sets for true positive and putative, respectively) were also compared with our predicted data sets. In Figure 2, each feature score was the original raw score without normalization, revealing that the data sets (TP4 and PU4) predicted by our approach have similar but higher feature scores than those of the known interaction data sets (KNOWN2) and the randomly generated data sets (RANDOMS). The distributions of the various components of the confidence metrics and ANOVA tests between these interaction data sets were listed (see Additional File 2). The differences between these data sets are statistically significant.

Figure 2
figure 2

Each feature score for all data sets. Each feature score for all data sets; x-axis is the feature type, and y-axis is the corresponding raw feature score (mean value). The predicted data sets with confidence score (CS = 4) (TP4 and PU4) have similar or higher feature scores than the known interaction data sets with two recently published experimental data sets (KNOWN2), data sets for true positive and putative predicted from BLAST mapping method (BTP and BPU) and randomly generated data sets (RANDOMS).

Testing functional annotation

Interacting proteins commonly have similar functions. Additionally, researchers should be able to validate the functions of predicted protein pairs. The interactions predicted by the proposed method were optimized in terms of UniProt functional keyword annotations, GO 'molecular function' (MF) and GO 'biological process' (BP). Their relevant GO terms such as 'molecular function unknown', 'obsolete molecular function', 'biological process unknown' and 'obsolete biological process' were discarded.

Equations (1), (2), and (3) define the Jaccard coefficient of the UniProt keyword, and the deepest depth of common ancestor GO terms in MF and BP categories, UK, GMF and GBP, respectively.

U K = K a T K b K a T K a + K b T K b K a T K b MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGvbqvcqWGlbWscqGH9aqpdaWcaaqaaiabdUealnaaDaaaleaacqWGHbqyaeaacqWGubavaaGccqGHxiIkcqWGlbWsdaWgaaWcbaGaemOyaigabeaaaOqaaiabdUealnaaDaaaleaacqWGHbqyaeaacqWGubavaaGccqGHxiIkcqWGlbWsdaWgaaWcbaGaemyyaegabeaakiabgUcaRiabdUealnaaDaaaleaacqWGIbGyaeaacqWGubavaaGccqGHxiIkcqWGlbWsdaWgaaWcbaGaemOyaigabeaakiabgkHiTiabdUealnaaDaaaleaacqWGHbqyaeaacqWGubavaaGccqGHxiIkcqWGlbWsdaWgaaWcbaGaemOyaigabeaaaaaaaa@4F65@
G M F = i = 1 i ( %  of PPI share ancestor GO term at depth  i  in MF) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGhbWrcqWGnbqtcqWGgbGrcqGH9aqpdaaeqbqaaiabdMgaPjabgEHiQiabcIcaOiabcwcaLiabbccaGiabb+gaVjabbAgaMjabbccaGiabbcfaqjabbcfaqjabbMeajjabbccaGiabbohaZjabbIgaOjabbggaHjabbkhaYjabbwgaLjabbccaGiabbggaHjabb6gaUjabbogaJjabbwgaLjabbohaZjabbsha0jabb+gaVjabbkhaYjabbccaGiabbEeahjabb+eapjabbccaGiabbsha0jabbwgaLjabbkhaYjabb2gaTjabbccaGiabbggaHjabbsha0jabbccaGiabbsgaKjabbwgaLjabbchaWjabbsha0jabbIgaOjabbccaGiabdMgaPjabbccaGiabbMgaPjabb6gaUjabbccaGiabb2eanjabbAeagjabbMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabeqdcqGHris5aaaa@7343@
G B P = i = 1 i ( %  of PPI share ancestor GO term at depth  i  in BP) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGhbWrcqWGcbGqcqWGqbaucqGH9aqpdaaeqbqaaiabdMgaPjabgEHiQiabcIcaOiabcwcaLiabbccaGiabb+gaVjabbAgaMjabbccaGiabbcfaqjabbcfaqjabbMeajjabbccaGiabbohaZjabbIgaOjabbggaHjabbkhaYjabbwgaLjabbccaGiabbggaHjabb6gaUjabbogaJjabbwgaLjabbohaZjabbsha0jabb+gaVjabbkhaYjabbccaGiabbEeahjabb+eapjabbccaGiabbsha0jabbwgaLjabbkhaYjabb2gaTjabbccaGiabbggaHjabbsha0jabbccaGiabbsgaKjabbwgaLjabbchaWjabbsha0jabbIgaOjabbccaGiabdMgaPjabbccaGiabbMgaPjabb6gaUjabbccaGiabbkeacjabbcfaqjabbMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabeqdcqGHris5aaaa@733F@

where K a , K b are the keyword vectors of interacting protein pairs a and b, respectively. For example, in K a = [1, 0, 1, 0, 1], the presence or absence of a keyword are represented as 1 or 0, respectively. Protein self-interactions or homo-dimers tend to have high scores, and always share the same functional annotations. Hence, these interactions were eliminated from the predicted pairs to eliminate bias in the results.

First, the number of interaction pairs sharing at least one UniProt overlapping functional keyword was determined to verify the accuracy of the predicted interactions. Second, the number of interaction pairs sharing common GO annotations at a particular depth in the GO 'molecular function' and 'biological process' hierarchy was analyzed to confirm that the results and that were not just a general GO term applied. Comparisons were made among KNOWN2, TP4, TP0, BTP, PU4, PU0, BPU and RANDOMS data sets (Figure 3).

Figure 3
figure 3

Testing of functional annotation. Testing of functional annotation between all data sets. (a) Mean depth level of common ancestor GO term in 'molecular function' (MF) or 'biological process' (BP) categories. (b) Mean of Jaccard coefficient of UniProt keyword matching.

Finally, the probability that two proteins share the same UniProt functional keyword by chance is determined through the hypergeometric distribution [41]. The p-value is obtained by the following equation:

p = x n ( M x ) ( N M n x ) ( N n ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGH9aqpdaaeWbqaamaalaaabaWaaeWaaeaafaqabeGabaaabaGaemyta0eabaGaemiEaGhaaaGaayjkaiaawMcaamaabmaabaqbaeqabiqaaaqaaiabd6eaojabgkHiTiabd2eanbqaaiabd6gaUjabgkHiTiabdIha4baaaiaawIcacaGLPaaaaeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqWGUbGBaaaacaGLOaGaayzkaaaaaaWcbaGaemiEaGhabaGaemOBa4ganiabggHiLdaaaa@4533@

where N and M denote the total number of proteins in the population, and the number of proteins that have a particular functional keyword, respectively, and n and x denote the total number of proteins in the set, and the number of proteins annotated with the particular functional keyword, respectively. Since a pair of proteins is observed, both n and x are equal to 2. A protein pair is treated as enriched by a UniProt functional if the corrected p-value is ≤ 0.05. The total of 90, 871 predicted interactions with this p-value are listed (see Additional File 1).

Testing conservation score (C) and interolog score (I)

Table 3 and Table 4 show present the effectiveness of conservation (C) and interolog scores (I) based on the quasi-clique of protein networks. The raw conservation score and interolog score and corresponding standard error of true positive and putative interaction data sets from different InParanoid score (0.0 to 1.0) were evaluated. The result reveals that the conservation and interolog scores in the true positive data set were higher than those in the putative data set.

Table 3 Mean and standard error of Conservation score (C) among the different InParanoid score (IP) interaction data sets.
Table 4 Mean and standard error of Interolog score (I) among the different InParanoid score (IP) interaction data sets.


Comparison with cut-off scores

Table 5 indicates that InParanoid can predict 1, 918(5.27%) and 2, 572(2.83%) true positive interologs for one-to-one mapping and one-to-many mapping, respectively. The table also shows the precision values, given by (TP/(TP+FP)) and the recall, given by (TP/(TP+FN)), where TP, FP and FN denote the numbers of true positive, false positive and false negative interactions in the predicted data sets, respectively. True positives are the overlaps between predicted positive data set and all known human interactions (KNOWN2); false negatives are the overlaps between predicted negative data set and all known human interactions (KNOWN2), and false positives are the predicted positive data sets that are absent from the true positives (i.e. the putatives in this case).

Table 5 Number of human interactions (true positives) predicted from BLAST with minimum E-value and InParanoid.

The cut-off threshold of confidence score (CS), equation (8), was identified to increase the true positive ratio and indicate the relationship between the number of predicted interactions and the coverage of known interactions. The maximum precision was obtained by a threshold of CS ≥ 4. Table 6 shows the relationship between cut-off threshold and predicted data sets from the 175, 085 known interactions in the six reference organisms.

Table 6 Relationship between cut-off threshold and predicted human interactions (true positives).

Comparison with BLAST data sets

All of the 175, 085 known interactions (Table 1) from the six reference organisms were used in the orthology search by BLAST with minimum E-value (the E ≤ 0.005 was configured in the BLAST tool). The protein sequences were downloaded from UniProt. The InParanoid one-to-one mapping (InPranoid score = 1.0) and one-to-many mappings (InPranoid score ≥ 0.0) were also compared, as were the InParanoid data sets with threshold CS = 4. Table 5 shows the results of these predictions. Although BLAST can more true positive interologs in quantity than the InParanoid method, it also produced a higher putative ratio. The predicted and true positive ratios reveal that InParanoid can distinguish potential true orthologs. The BTP and BPU are data sets for true positive and putative predicted from BLAST mapping method, respectively. The scoring method testing results are also presented (see Figure 2 and Additional File 2).

Comparison with experimental data sets

All of the proteins were mapped to UniProt Entry ID, and proteins (and their interactions) that could not be confidently mapped were eliminated. Figure 4 presents the overlap among various interacting data sets, including two human experimental networks [39, 40] and our predicted interlogs from six reference organisms (Huang et al.). Surprisingly, the results of the proposed interolog-based approach and the experimental high-throughput method did not overlap significantly, revealing that the methods applied to detect interactions have different biases. Therefore, two methods (interolog-based and experimental method) may reveal different and partial sub-networks of the whole human protein interaction network. The proposed method is based on evolutionarily conserved interologs, and can not distinguish between species-specific interactions from the two experimental data sets.

Figure 4
figure 4

Comparisons among experimental data sets. Comparisons among two experimental data sets (Rual's and Stelzl's data sets), known databases (KNOWN) and our results (Huang).

Comparison with interolog-based approach

The proposed method was compared with other interolog-based methods for predicting human PPIs, namely HomoMINT [28], HPID [25], IPPRED [24], the method of Lehner et al.'s group [27], OPHID [23], POINT [26] and Rhodes et al.'s method [29].

The properties of the ortholog identification methods and other features are as follows.

  • An ortholog identification method indicates the orthologs between model organisms. Orthologs between organisms do not have a one-to-one relationship with BLAST search (B) or BLAST search with E-value (BE); yet one-to-many and many-to-many mappings exist. The InParanoid clustering algorithm distinguishes potential true orthologs from paralogs according to the InParanoid score (IP). Although similar structures typically share similar biological functions, the structural classification at the protein superfamily level (SS) is not trivial in the identification of structural similarities at the human protein level on the large scale.

  • Other features indicate that some other factors affecting their interactions are considered. The quasi-clique with maximal conservation score (C), domain-domain combinations (D), sub-cellular localization (L), cell-cycle phase (P) and tissue-specificity (T) were also carefully examined in this study. Other existing methods apply the 'biological process' (BP) and 'molecular function' (MF) annotations in the GO hierarchy.

The brief comparisons in Table 7 reveal that the proposed method predicts results based on the relative conservation score and the other feature scores to obtain human PPI networks through confidence scores. A confidence score allows researchers to identify interactions qualitatively from objective and biologically reasonable judgement, rather than using a large quantity of interacting data without prioritized selection.

Table 7 Comparisons with other interolog-based approach for predicting human PPIs.

Biological significance

Many predicted pairs have been identified in existing known human PPI databases (KNOWN) and the two human experimental PPIs data sets (as shown in Figure 4). The top 20 predictions that were not identified or not present in the existing databases were listed (see Additional File 3) using the proposed prediction system, and indicate that some top predicted protein interacting pairs were manifestations of their potentially physical interactions. For example, for the top 1 PLK1 and STK6 interaction, PLK1 (polo-likekinase1) has just been reported this year that it interacts with Aurora-B in playing critical roles in the regulation of chromosomal dynamics [42]. STK6 is also known as Aurora-A. The kinase domains of Aurora-A and Aurora-B share more than 70% of their sequence data. Most importantly, in 3D structure, they are likely to share partially similar surface features [43]. Therefore, the interaction of Aurora-A (i.e., STK6) with PLK1 (top1 interaction) is not surprising. ORC1, the origin recognition complex protein, binds specifically to origins of replication, and serves as a platform for the assembly of additional initial factors including MCM and CDC6 proteins. MCM proteins form a hexameric structure complex with 6 subunits, namely MCM2, MCM3, MCM4, MCM5, MCM6 and MCM7 [44]. To date, ORC1 been confirmed to interact with MCM2 and MCM7. ORC1 can also be reasonably expected to interact with MCM4 (top 2 interaction) and MCM6 (top 5 interaction), because they are all localized in a complex or origin recognition site. Furthermore, since MCM proteins form a hexamer, MCM5 can reasonably be expected to interact with MCM6 (top 3 interaction), and MCM5 can be expected to interact with MCM4 (top 4 interaction). These findings reveal that constructing a protein-protein interaction network allows novel interacting proteins to be identified. All proteins of the prediction pairs are linked to a human disease in the OMIM database [45] whenever possible (see Additional File 3). Therefore, the interaction network can be further extended through these annotated disease-associated proteins. Moreover, these predicted interactions have high conservation (C) and interolog (I) scores (Table 3 and Table 4, respectively), revealing that these interactions are evolutionarily conserved across species.


Important high-throughput approaches such as yeast two-hybrid have recently been applied to systematically identify PPIs in humans (Figure 4). Surprisingly, the experimental results of the proposed and high-throughput methods did not overlap significantly, indicating that different biases exist because of the approaches applied to detect interactions. Hence, two methods (interolog-based and experimental methods) may indicate different and partial sub networks of the complete human-protein interaction network.

The accuracy of the predicted interactions depends mainly on the quality and completeness of the reference model organism interaction data sets. Although only a subset of the known interactions in the human interaction network can currently be accurately predicted (Table 2), the accuracy can be improved by large-scale protein interaction data in 'higher' eukaryotic reference model organisms in the future. The orthologous relationship between sequence and function is difficult to evaluate, because no clear measurement of functional similarity between any pair of proteins is made. Many one-to-many and many-to-many mappings exist across species, and can be used to identify protein orthologs. The InParanoid algorithm was applied because several proteins from so-called 'lower' eukaryotes have many co-orthologs in humans, and can be identified using InParanoid, but not with a simple one-to-one sequence similarity search based on BLAST or structural classification at the protein superfamily level.

The Interolog [5] concept was previously proposed to predict C. elegans PPIs from yeast. This study presents 'Interolog' as a concrete method for predicting human PPIs from those of six 'lower' eukaryotes. However, high-throughput interactions with false positives and false negatives have been noted in some eukaryotes [37]. This study utilized other features and scoring schema to derive the confidence with which human interactions are predicted using the interolog-based method. Computational analysis can be applied to determine conservation scores and other feature scores, and is readily extensible to any newly sequenced genomes. Users can construct many genome-wide PPI networks with high confidence using interolog mapping and the proposed scoring method. This concept can also be applied to discover transcription networks, such as simultaneous protein-DNA and protein-protein interaction networks [46].


The evolution of PPIs from the relative conservation score is comprehensively assessed by finding a quasi-clique from protein networks. However, PPIs in biological organisms are complex, and do not depend only on a single feature, such as protein structural complementarity, gene proximity or co-evolution.

Moreover, some other protein interaction features, including sub-cellular localization, tissue specificity, cell-cycle stage and domain-domain combinations, are also critical factors to be considered. This study describes a scoring method based on integrating these heterogeneous but significant biological resources to prioritize human protein-protein interacting networks. The analytical results indicate that the proposed method can predict potential human PPIs with higher confidence than the other methods studied (Figure 2). The analytical results also reveal that some correlations exist between the true positive data set and the data set produced by the proposed method (Figure 3). Furthermore, the conservation score of a true positive interaction data set is higher that the score of the putative interaction data set (Table 2). Additionally, the proposed method allows researchers to identify quantitatively, rather than simply qualitatively, how (functional domain), when (cell cycle stage) and where (cellular compartment and tissue specificity) the two proteins interact, using a confidence score.


Some studies have been published on the experimental derivation of PPIs and so does the in silico PPIs. Examples of topics examined include domain-domain co-occurrence [31, 47, 48], gene co-expression as shown by microarrays [4952] and co-localization to the same sub-cellular compartment using Gene Ontology cellular component terms [35, 38, 53, 54]. The combination of such evidence can support a broader range of PPIs than the predicted results from any single feature.

Protein-protein interactions can be represented as a network graph whose vertices are proteins. These vertices are linked by edges if the corresponding proteins interact. In this study, the maximal quasi-clique determines a conservation score (C) from reference to target organism, and the interolog score (I) from the orthologous scores (IP) and (C). The other features of the protein interaction, such as spatial proximity (sub-cellular localization (L) and tissue-specificity (T)), temporal synchronicity (cell-cycle phase (P)) and domain-domain combinations (D) are also considered. Each score is normalized, and then these scores are summed into the final confidence score (CS). Figure 5 shows schematically the proposed scoring method.

Figure 5
figure 5

Schematic illustration of scoring method for human PPIs determined from interologs. The protein pair (a, b) is a known interaction in the reference organism, and the corresponding orthologous protein pair (A, B) can be inferred to interact in the target organism. The five-tuple score (I, D, T, L, P) is normalized to obtain a confidence score (CS).

InParanoid score (IP)

The InParanoid [55] algorithm was designed to distinguish potential true orthologs from co-orthologs (paralogs) based on the best pairwise protein sequence similarity between organisms. The orthologous score, IP denotes the InParanoid score; the main orthologs always receive a score of 1.0, and the other paralogs receive scores from 0.0 to 1.0. Table 8 shows the predicted interologs and true positives mapped using only InParanoid data without considering other features. A lower IP score indicates more true positives in quantity. This finding indicates that the ortholog mappings across species are one-to-many and many-to-many. However, it also reveals that the true positive ratio does not signify an improvement in quality. The other features must be considered in order to filter out the predicted interactions that have low confidence scores.

Table 8 Number of interologs and true positives predicted by InParanoid score (IP) without other feature scores.

Quasi-clique and conservation score (C)

Let G = (V, E) denote a graph, where V is the set of vertices, and E is the set of edges in graph G. A graph is γ-dense, such that γ = 2 |E|/|V| (|V| - 1). For a subset SV, GSis the sub-graph induced by S. A quasi-clique, also called a γ-clique S, is a subset of G, such that the induced graph GSis connected and γ-clique. The original maximum problem γ-clique S is to find a 1-clique, complete sub-graph (γ = 1) with maximum vertices in graph G.

A quasi-clique in PPI networks is a group of proteins that tend to interact with each other, but a complete sub-graph (γ = 1) is not always biologically significant. Hence, C = γ |E| is defined as the protein complex conservation score. The value of |E| is the functional links of a protein complex.

Some recent studies have concluded that motif modules and their constituents in a specific functional protein network are highly conserved across species [56, 57]. Evolutionary rate analysis [58] has indicated that the connectivity of well-conserved proteins in the network is negatively correlated with their rate of evolution. More connected proteins in an interaction network evolve at a lower rate, because they are subject to a higher pressure to co-evolve with other interacting proteins. This study searches for a quasi-clique with maximal relative conservation score C in a protein complex. Figure 6 illustrates an example of such a quasi-clique.

Figure 6
figure 6

Relationship among γ , | E | and C. Relationship among γ, |E| and C. (a) Three proteins interacting as a complex with three functional links; (b) five proteins interacting as a complex with nine functional links. Although the protein complex in (a) has a higher γ = 1.0 than the protein complex in (b), that in (b) is more biologically significant. Therefore,C = γ |E| is taken as the relative conservation score for a protein complex.

Figure 7
figure 7

Protein-protein interolog score. Protein-protein interolog score, where A-a and B-b are orthologs between the two organisms. The orthologous protein pair (A i , B j ) can be inferred to interact in a target organism if the protein pair (a, b) interacts in a reference organism. Gab is the sub-graph of proteins that interact with both a and b; Cab is the quasi-clique with maximal conservation score in Gab, and IPAi and IPBj are the InParanoid scores of paralogs i and j of orthologs A and B, respectively, in the target organism.

Interolog score (I)

The protein interaction bases utilized for mapping human protein interaction networks were obtained from six eukaryotes, namely Rattus norvegicus, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana and Saccharomyces cerevisiae, as reference organisms. These data were obtained from AfCS-Nature [59], BIND, BioGRID, CYGD, CORE subset of DIP, IntAct, MINT and MPPI. Table 9 lists the numbers of distinct interactions in each data set.

Table 9 Number and sources of model organism interaction data sets.

The interolog concept states that proteins that interact in a single organism co-evolve so that their respective orthologs maintain the ability to interact in another organism. For example, as shown in Figure 8, if two proteins (a, b) interact in the reference organism, then the corresponding pairs of orthologs and paralogs (A1, B1), (A1, B2), (A1, B3), (A2, B1), (A2, B2) and (A2, B3) can be inferred to interact in a target organism, and the interolog score (I) can be determined as follows.

Figure 8
figure 8

Example of GO cellular component hierarchy from depth levels 0 to 8. A protein pair (A, B) with GO cellular component annotations 'cell' and 'spindle' at depths 2 and 8, respectively. The common GO terms among their ancestor terms (including the original terms) are 'Gene Ontology', 'cellular component' and 'cell'. The deepest term is 'cell', at a depth of 2.

I i j = w e c min ( I P A i , I P B j ) C ab MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGjbqsdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabg2da9iabdEha3naaBaaaleaacqWGLbqzcqWGJbWyaeqaaOGaey4fIOIagiyBa0MaeiyAaKMaeiOBa4MaeiikaGIaemysaKKaemiuaa1aaSbaaSqaaiabbgeabnaaBaaameaacqqGPbqAaeqaaaWcbeaakiabcYcaSiabdMeajjabdcfaqnaaBaaaleaacqqGcbGqdaWgaaadbaGaeeOAaOgabeaaaSqabaGccqGGPaqkcqGHxiIkcqWGdbWqdaWgaaWcbaGaeeyyaeMaeeOyaigabeaaaaa@4C97@

The weight of evolutionary conservation (w ec ) is defined such that a higher w ec value indicates an organism that is genetically closer to humans. The following w ec values were considered: wrat = 1.0, wmouse = 1.0, wfly = 0.75, wworm = 0.75, wthalecress = 0.5 and wyeast = 0.25 for rat, mouse, fly, worm, thale cress and baker's yeast, respectively. Because rat and mouse are both mammals, and are thus genetically closest to human, they were assigned the highest value of 1.0. Drosophila and C. elegans are two animal models that are widely studied to understand human disease genes and development, and are ranked second closest to humans among the organisms studied. Finally, thale cress is sorted in higher order than yeast, since it is multi-cellular organism, while yeast is a single-cell species. If a pair of human protein interactions is derived from two or more reference model organisms, then only the highest interolog score is used to generate non-redundant (nr) human protein-protein interactions.

Domain-domain combination score (D)

A probabilistic framework [31] has been presented to predict the interaction probability of proteins, and an interaction possibility ranking method has been developed for multiple protein pairs using the Potentially Interacting Domain Combination Pair (PIDC). This study utilized the concept of PIDC, collecting all domain combinations were accumulated from the known interactions in the experimental databases. A pair of interacting proteins A and B with multiple domains was obtained. For example, a domain set Dd = {d1, d2, d3, ..., d m }, and its power set PDd = {{d1}, {d2}, {d3}, ..., {d1, d2, d3, ..., d m }}. The protein domain information was downloaded from the Pfam [60] domain annotation database. The domain-domain combination score, D, was calculated by summing the appearance probability as follows:

D = j = 1 2 m 1 i = 1 2 m 1 N ( p d i , p d j ) N ( p d i , p d j )  if  p d i P D d , p d j P D d MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGebarcqGH9aqpdaaeWbqaamaaqahabaWaaSaaaeaacuWGobGtgaqbaiabcIcaOiabdchaWjabdsgaKnaaBaaaleaacqqGPbqAaeqaaOGaeiilaWIaemiCaaNaemizaq2aaSbaaSqaaiabbQgaQbqabaGccqGGPaqkaeaacqWGobGtcqGGOaakcqWGWbaCcqWGKbazdaWgaaWcbaGaeeyAaKgabeaakiabcYcaSiabdchaWjabdsgaKnaaBaaaleaacqqGQbGAaeqaaOGaeiykaKcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaeGOmaiZaaWbaaWqabeaacqWGTbqBaaWccqGHsislcqaIXaqma0GaeyyeIuoaaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabikdaYmaaCaaameqabaGaemyBa0gaaSGaeyOeI0IaeGymaedaniabggHiLdGccqqGGaaicqqGPbqAcqqGMbGzcqqGGaaicqWGWbaCcqWGKbazdaWgaaWcbaGaeeyAaKgabeaakiabgIGiolabdcfaqjabdseaenaaBaaaleaacqqGKbazaeqaaOGaeiilaWIaemiCaaNaemizaq2aaSbaaSqaaiabbQgaQbqabaGccqGHiiIZcqWGqbaucqWGebardaWgaaWcbaGaeeizaqgabeaaaaa@73A7@

where pdi and pdj are sets i and j in the power set PDd, respectively, and N' (pdi, pdj) and N (pdi, pdj) are the number of interacting protein pairs and the total number of protein pairs that contain (pdi, pdj) in known interactions, respectively.

Tissue specificity score (T)

The tissue specificity is another spatial proximity value to be considered. Two proteins that are activated at the same sub-cellular localization, and co-expressed in the same tissue, are likely to interact with each other. This information can be used to discover tissue-specific PPIs associated with human diseases for biomedical research. Tissue-specific gene expression information was extracted from the GeneAtlas Affymetrix data set, which includes 44, 775 human probe sets (30, 694 proteins) from 79 normal human tissue samples [61].

Score T denotes the tissue specificity score, calculated by summing the number of common tissues if two proteins both have 2-fold up-regulated expressions (log2 expression ratio = 1) than the mean expression value of specific tissue.

T = i = 1 79 1  if  log 2 e A i e A ¯ 1  and  log 2 e B i e B 1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavcqGH9aqpdaaeWbqaaiabigdaXaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaeG4naCJaeGyoaKdaniabggHiLdGccqqGGaaicqqGPbqAcqqGMbGzcqqGGaaicyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGOmaidabeaakmaalaaabaGaemyzauMaeeyqae0aaSbaaSqaaiabdMgaPbqabaaakeaadaqdaaqaaiabdwgaLjabdgeabbaaaaGaeyyzImRaeGymaeJaeeiiaaIaeeyyaeMaeeOBa4MaeeizaqMaeeiiaaIagiiBaWMaei4Ba8Maei4zaC2aaSbaaSqaaiabikdaYaqabaGcdaWcaaqaaiabdwgaLjabbkeacnaaBaaaleaacqWGPbqAaeqaaaGcbaGaemyzauMaemOqaieaaiabgwMiZkabigdaXaaa@5DF2@

where e A i and e B i are the normalized expression values of proteins A and B, respectively, in tissue sample i, and e A ¯ = i = 1 79 e A i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdwgaLjabdgeabbaacqGH9aqpdaaeWaqaaiabdwgaLjabbgeabnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabiEda3iabiMda5aqdcqGHris5aaaa@3B48@ and e B ¯ = i = 1 79 e B i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdwgaLjabdkeacbaacqGH9aqpdaaeWaqaaiabdwgaLjabbkeacnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabiEda3iabiMda5aqdcqGHris5aaaa@3B4C@ are the mean expression values of proteins A and B, respectively, under 79 tissue samples.

Sub-cellular localization score (L)

The physical PPI requires contact between two proteins at certain cellular locations. Hence, this study used the Gene Ontology (GO) [62] annotation in the deep 'Cellular Component' (CC) hierarchy, discarding irrelevant GO terms such as 'cellular component unknown' and 'obsolete cellular component'.

If two interacting proteins share a common ancestor of the GO term, then L is the sub-cellular localization score, which is the deepest level number of the common GO term among ancestor terms (including itself) in the GO hierarchy. For example, a protein pair (A, B) has the GO cellular component annotation 'GO:0005623 cell' and 'GO:0005819 spindle' at depths of 2 and 8, respectively. The sub-cellular localization score L = 2 since the deepest level of common GO term among ancestors is at a depth of 2 in the GO hierarchy. Figure 8 shows the detailed hierarchy.

Cell-cycle stage score (P)

Human cell cycle cDNA microarray analysis [63] reveals cell cycle-regulated genes. Table 10 lists the numbers of non-redundant (nr) proteins mapped from the original 1, 134 expressed clones at different cell cycle phases. The cell-development stage score P is given by the number of cell cycle phases in the overlap between two interacting proteins.

Table 10 Number of human cell cycle-regulated proteins at different phases.

Confidence score (CS)

The five-tuple score (I, D, T, L, P) is an overall confidence score determined from equation (8), where the D K ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdseaenaaBaaaleaacqWGlbWsaeqaaaaaaaa@2F19@ , L K ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdYeamnaaBaaaleaacqWGlbWsaeqaaaaaaaa@2F29@ , P K ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdcfaqnaaBaaaleaacqWGlbWsaeqaaaaaaaa@2F31@ and T K ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsfaunaaBaaaleaacqWGlbWsaeqaaaaaaaa@2F39@ are the mean values of each feature score from known human interaction data sets (KNOWN2). I R ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdMeajnaaBaaaleaacqWGsbGuaeqaaaaaaaa@2F31@ is the mean interolog score in one reference organism.

C S = w I I I R ¯ + w D D D K ¯ + w T T T K ¯ + w L L L K ¯ + w P P P K ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqcqWGtbWucqGH9aqpcqWG3bWDdaWgaaWcbaGaemysaKeabeaakiabgEHiQmaalaaabaGaemysaKeabaWaa0aaaeaacqWGjbqsdaWgaaWcbaGaemOuaifabeaaaaaaaOGaey4kaSIaem4DaC3aaSbaaSqaaiabdseaebqabaGccqGHxiIkdaWcaaqaaiabdseaebqaamaanaaabaGaemiraq0aaSbaaSqaaiabdUealbqabaaaaaaakiabgUcaRiabdEha3naaBaaaleaacqWGubavaeqaaOGaey4fIOYaaSaaaeaacqWGubavaeaadaqdaaqaaiabdsfaunaaBaaaleaacqWGlbWsaeqaaaaaaaGccqGHRaWkcqWG3bWDdaWgaaWcbaGaemitaWeabeaakiabgEHiQmaalaaabaGaemitaWeabaWaa0aaaeaacqWGmbatdaWgaaWcbaGaem4saSeabeaaaaaaaOGaey4kaSIaem4DaC3aaSbaaSqaaiabdcfaqbqabaGccqGHxiIkdaWcaaqaaiabdcfaqbqaamaanaaabaGaemiuaa1aaSbaaSqaaiabdUealbqabaaaaaaaaaa@58CB@

In this scoring scheme, all data sources are weighted equally: w I = 1, w D = 1, w T = 1, w L = 1 and w P = 1. Moreover, the confidence score CS = 4, as derived by recall ratio ≥ 50% (Table 6).