Background

In the post-genomic era, the study of networks has obtained unprecedented attention and network-based analyses have played fundamental roles in biological research. Indeed, most genes and proteins function through a complex network between them rather than on their own [1]. Recently, advances in high-throughput experimental technologies have made an ever-increasing amount of data on protein interaction networks (PINs) available. PINs provide a novel perspective for the study of the principles driving the evolution of living organisms.

In the study of the evolution of PINs, one of the most basic and important problems is to explore how the PIN originated and grew. Many researchers have tried to answer the question by multiple approaches. By the theoretical modeling, several evolutionary models of PINs have been established [210]. By the analyses on real PINs, several interesting and possible mechanisms have been uncovered [1116]. Based on the finding that proteins of similar phylogenetic profiles tend to interact with each other, Qin et al. for the first time presented the hypothesis that the evolution of PINs has undergone the additions of clustered nodes [12].

Previous studies on the evolution of PINs focus either on the individual protein level [11, 1727], interaction level [11, 14, 2830], functional module level [9, 15, 3137] or the whole network level [28, 10, 13, 16]. Few study the evolution of PINs from the perspective of network motifs [38, 39]. Network motifs are referred to as recurring interconnected patterns of specific topology in complex networks, and may represent the simplest building blocks of cellular machines [38, 40]. Meanwhile motifs are found to be evolutionarily conserved topological units of cellular networks, which suggests that they are of biological significance [38]. Further, compared with functional modules [41], owing to the definite definition of motifs, they can be explicitly identified and enumerated in various cellular networks [40].

Considering the advantages of network motifs, in this paper, we explore the evolution of PINs from the perspective of network motifs, and try to provide further evidence for the hypothesis that the evolution of PINs has undergone the additions of clustered interacting proteins. First, we classify proteins based on their original time, and analyze the tendency between proteins of the same/different age classes to form motifs in the PIN. Further we investigate whether co-origins of motif constituents are affected by motif topologies and biological functions. Then we focus on those age-homogeneous motifs whose constituents are of the same age class, and analyze the evolution and functions of their members. Finally we discuss how our findings support the hypothesis of the clustered additions and the underlying driving force of the clustered additions.

Results

The tendency between proteins of the same/different age classes to form motifs

To understand the evolutionary history of PINs from the network motif perspective, we first analyze the tendency between proteins of the same/different age classes to form motifs in the PIN.

We classify proteins based on their original ages. In our work, we use orthologous groups of orthoMCL [42] to construct the phylogenetic profile and further to assess the original age of the protein. Each orthologous group of orthoMCL is composed of orthologs and only "recent paralogs" whose sequences are similar and thus functions are likely to remain similar. "Ancient paralogs" whose sequences have diverged and thus functions are likely to diverge are assigned into different orthologous groups, and thus their ages are assessed separately. Therefore, using this method, we can crudely assign the original age of a protein to the time when it obtained today's function. Actually, there is no single, optimal method to define the original age of a protein, especially for the protein derived from duplication which is a big source of new gene origins [43, 44]. On the one hand, even though we can crudely assess the time when the duplication event happened, in most cases it doesn't make sense to distinguish which copy is the ancestral one and which copy is the created one from this duplication [45]. Therefore, it seems improper to assign the original age of one of the duplicates or both of them to the time when the duplication event happened. On the other hand, for the research on the growth of PINs, it is also improper to assign the original age of all proteins derived from the direct or indirect duplication of a common traceable earliest ancestral protein to the time when the traceable earliest ancestor emerged, because new proteins directly or indirectly from the ancestor are continuously produced at various stages during the evolution of PINs after this ancestor was created. And these today's descended proteins are likely to have been functionally significantly divergent from each other and from the ancestor. Therefore, in our work, we try to define the origin of a protein, taking the phylogeny and meanwhile the (sequence and) function as reference. Especially for a protein from duplication, when it evolved to obtain significantly divergent sequence and function from its ancestor, it is thought to be new. This definition of the original age simply takes sequences and functions as reference, which not only avoids the troublesome reconstruction of the original and evolutionary process of proteins, especially proteins from duplication, but also provides us opportunities to infer the evolutionary process of today's PINs from the functional perspective.

As shown in Figure 1, we classify the yeast proteins into 5 age classes based on taxonomy [46]. The most ancient yeast proteins with age 5 are those which originated in the common ancestor of three domains of tree of life (Eukaryota, Bacteria and Archaea) (cellular organisms class: node Cellular organisms). Proteins of the second class with age 4 are those whose traced ancestors appeared before the radiation of eukaryota (and after the radiation of the common ancestor of life) (eukaryota class: node Eukaryota). Those with age 3 emerged before the split of fungi and other fungi/metazoa (fungi/metazoa class: node Fungi/Metazoa group). Those of the fourth class evolved before the split of S. cerevisiae and other fungi (fungi class: node Fungi, node Dikarya, node Ascomycota, node Saccharomyceta, node Saccharomycetales and node Saccharomycetaceae). The youngest class contains proteins found only in S. cerevisiae (yeast class).

Figure 1
figure 1

Schematic representation of the age classification of proteins. We classify the yeast proteins into 5 age classes based on the phylogenetic relationship of 138 species [46]. Inner nodes on the evolutionary tree represent ancestral organisms and inner nodes on the path from root to S. cerevisiae indicate representative time points when the yeast proteins originated during evolution. The path that leads to S. cerevisiae is highlighted in bold and 5 age classes are labeled with different colors. The inset table shows the age class distribution of the yeast proteins in the PIN of DIP_YEAST_CORE. The inner nodes on the path from root to H. sapiens are also labeled. For the age classification of human proteins, please refer to Supplementary Methods and Results.

To study the interconnection tendency between protein nodes of the same/different age classes, based on network motifs, we define "evolutionary motif modes" to characterize particular interconnected patterns of proteins of the same/different age classes (Figure 2). We compute empirical P -value for each kind of evolutionary motif mode with specific topology to check the statistical significance of its enrichment or depletion in the real PIN (see Methods). Based on the credible yeast PIN of DIP_YEAST_CORE [47], we find that for the motifs with specific topology, the number of evolutionary motif modes ranges from enrichment to depletion as their constituents gradually change from those of the same age class to those of different age classes (Table 1). The results indicate that in the PIN, proteins of the same age class tend to interact with each other and further to cluster into motifs, while proteins of different age classes tend to avoid interacting with each other and further to avoid forming motifs.

Figure 2
figure 2

Network motifs and evolutionary motif modes. There are two interconnected patterns for 3-motifs and six for 4-motifs. Evolutionary motif modes of a 3-motif and a 4-motif of specific topology are shown, different node colors indicating different protein age classes. For example, for each 4-motif of specific topology, in total there are five possible evolutionary motif modes which are marked as #4, #3-1, #2-2, #2-1-1 and #1-1-1-1. The label for an evolutionary motif mode indicates the number of nodes of different age classes within the motif mode. For example, #4 indicates that all the four proteins within the motif mode are of the same age class, and #2-2 indicates that two of the four proteins within the motif mode are of one age class, while the other two are of another age class.

Table 1 Interconnection tendency of proteins of the same/different age classes in the PIN of DIP_YEAST_CORE

We obtain the similar results on other PIN datasets, such as YEAST_HC [10], HPRD_HUMAN_HIGH [48], DIP_YEAST [47] and HPRD_HUMAN_ALL [48] (see additional file 1: Table S2, S3, S4, S5, S6, S7, S8 and S9), of which the last two datasets are not well qualitatively controlled and thus are of relatively low quality. The similar results across different datasets indicate that the conclusion above is robust on different data quality and even different organisms.

Here we group ten representative time points into five age classes for yeast based on taxonomy (Figure 1). Actually all the conclusions in this paper keep unchanged across different classifications of age groups (see additional file 1: Supplementary Results and Table S17, S18, S19, S20, S21, S22, S23, S24, S25, S26, S27, S28, S29, S30, S31, S32). In addition, as we know, many ribosomal proteins are evolutionarily conserved and old. The ribosomal proteins in the PIN may influence our results. We find that when removing the ribosomal proteins annotated by FunCat [49] from the PIN of DIP_YEAST_CORE, all the results in the paper still hold (see additional file 1: Table S33, S34, S35, S36, S37, S38, S39 and S40).

The influence of topologies and biological functions on co-origins of motif constituents

Proteins of the same age class tend to form motifs, while those of different age classes tend to avoid forming motifs. This finding means that in the PIN, age homogeneity of motif constituents is higher than random expectation. In this part we further analyze whether age homogeneity of motif constituents is different for different classes of motifs with special topology or/and function in the real PIN. For this purpose, we introduce the "age homogeneity rate" and the "age homogeneity ratio". The "age homogeneity rate" is referred to as the fraction of motifs whose constituents are of the same age class among a class of motifs with specific topology or/and function. The "age homogeneity ratio" is defined as the ratio of the age homogeneity rate of the real network to its random expectation, which can measure the extent to which a class of motifs with specific topology or/and function affect co-origins of their constituents.

We observe that in the PIN of DIP_YEAST_CORE, motifs with different topologies indeed have different age homogeneity rates (chi-square test, P <10-4 for 3, 4, 5-motifs), while this phenomena is absent in random networks (Table 2). Especially, among the motifs with a special number of nodes, the age homogeneity rates seem to be correlated with the topological saturation (Table 2). To quantify this relationship, we test the correlation between motifs' topological saturation (which is simply measured by the number of edges within the motifs) and their age homogeneity (see additional file 1: Table S11), and the correlation between the clustering coefficient and age homogeneity for single proteins (which is defined as the fraction of its interaction partners which are of the same age class as the protein) (see additional file 1: Figure S1). In both cases we observe week but significant positive correlations. Furthermore, by analyzing the age homogeneity ratio, we find that the constraints of motifs with a special number of nodes and edges forcing their constituents' co-origins seem to rise as the number of nodes and edges increases.

Table 2 Constraints of topologies on the co-origins of motif constituents

To find out whether the biological functions of the yeast proteins within the motifs affect their age homogeneity, here we only take those motifs whose constituents share at least one common functional category into account, and assign such motifs to the common functional class. First, we find the conclusion that the age homogeneity of motif constituents is higher than random expectation holds for most classes of motifs with specific function (Table 3). Further, we find different biological functions have different age homogeneity rates (chi-square test, P <10-4 for 3, 4-motifs) and age homogeneity ratios: motifs belonging to functional classes of protein fate, protein synthesis, and transcription tend to have high age homogeneity ratios, while those belonging to functional classes of energy, signal transduction and metabolism low co-original constraints.

Table 3 Constraints of functions on the co-origins of motif constituents

Finally, we also check the joint impact of motif topologies and functions on co-origins of motif constituents (see additional file 1: Table S13). We find the conclusion that age homogeneity of motif constituents is higher than random expectation is also true for most classes of motifs with specific function and topology. Different combinations of biological functions and topologies have different joint constraints forcing co-origins of motif constituents based on their age homogeneity ratios.

Evolutionary rates and functions of the proteins within motifs whose constituents are of the same age class

To further analyze the evolutionary history of the PIN from network motifs, we focus on those age-homogeneous motifs whose constituents are of the same age class and analyze them from the following aspects.

First, by computing the evolutionary rates, we find the proteins within the age-homogeneous motifs co-evolve to a significantly higher degree than those participating in the other motifs (Figure 3A, B). Then, we further observe that the constituents of these motifs with constituents of the same age class tend to share the same biological functions (Table 4). From the other point of view, the proteins within the motifs whose members share at least one common functional category tend to be of the same age class, compared with those within the other motifs (see additional file 1: Table S14). Further, compared with the other motifs, these age-homogeneous motifs tend to be within protein complexes (see additional file 1: Table S15). Finally, we find these motifs also tend to have dense intraconnectedness (see additional file 1: Table S16), which is consistent with the finding that the motifs of high topological saturation tend to be of high age homogeneity (Table 2 and Table S11).

Figure 3
figure 3

Distributions of evolutionary rate difference of protein pairs within the age-homogeneous motifs and the other motifs. The probability (y-axis) is calculated as the percentage of protein pairs whose evolutionary rate difference falls in a special interval that x-axis shows. (A) 3-motif. Average evolutionary rate difference is 5.8 × 10-2 for 3-motifs whose constituents are of the same age class and 7.9 × 10-2 for the other 3-motifs. Rank sum test, P <10-4 . (B) 4-motif. The average evolutionary rate difference is 6.0 × 10-2 and 8.0 × 10-2 for the two 4-motif classes. Rank sum test, P <10-4 . The common protein pairs of the two motif classes are removed in the analyses. The results are based on the PIN of DIP_YEAST_CORE.

Table 4 Functional homogeneity rates of the age-homogeneous motifs and the other motifs

In 2003, Wuchty et al. found in yeast, proteins that participate in the motifs are more conserved than those that don't [38]. Here we further find that compared with the other motif constituents, proteins participating in age-homogeneous motifs significantly tend to co-evolve, share the same functions and be densely interconnected, and these motifs tend to be within protein complexes.

Discussion

Evidence for the hypothesis of the clustered additions from network motifs

In 2003, based on the finding that proteins of similar phylogenetic profiles tend to interact with each other [12], Qin et al. first presented the hypothesis that the evolution of PINs has undergone the additions of clustered nodes. Here we find proteins of the same age class not only tend to interact but also tend to form motifs (Table 1), which presents a more direct support for the hypothesis of the clustered additions. Here, "the addition of clustered interacting proteins during the evolution of PINs" means that several proteins along with the interactions between them originated and joined the PIN during a relatively short period of time.

We further explore the possibility of the clustered additions by discussing two alternative scenarios which could lead to the formation of these today's age-homogenous motifs. One scenario is that these proteins formed motifs just during almost the same period of time when these proteins originated, that is, they were clusteredly added during this period of time, and the other is that the interactions between these constituents gradually appeared during a long period of time after these constituents originated, and ultimately formed today's motifs from separated nodes. From the intuitive and parsimonious view, we support the former one. As we know, protein interactions are frequently conserved across multiple organisms [50, 51], which is also the theoretical basis for protein interaction prediction using orthologs [5256]. In our study, proteins within these age-homogeneous motifs significantly tend to share similar phylogenetic profiles (see additional file 1: Figure S2), which means these proteins significantly co-occur in different genomes. We have already known they form motifs in yeast. Then based on the conservation of interactions, we can speculate that their co-occurring orthologous hits are likely to form motifs in other species. When a motif exists in multiple species, from the most parsimonious perspective, the motif existed in the ancestral species rather than gradually formed in child species independently. This suggests that the proteins within today's age-homogenous motifs formed motifs during almost the same period of time when these proteins originated, that is, they are much more likely to be clusteredly added to the PIN during evolution.

Meanwhile, co-evolution (Figure 3A, B) and functional homogeneity (Table 4 Table S14 and Table S15 in the additional file 1) of the constituents within these age-homogenous motifs are consistent with their clustered additions. It is likely that after these proteins' traced ancestors were clusteredly added to the PIN (maybe as a result of functional needs), they together played a functionally important role, and thus underwent similar inner and outer pressure and co-evolved to further maintain steady motif structure to "guarantee" biological functions.

Our results from network motifs suggest that the proteins within age-homogeneous motifs tend to be clusteredly added historically during a (short) period of time. However such tendencies of clustered additions are affected by topologies and biological functions. Motifs with specific function and dense topology were more likely to be clusteredly added to the PIN (Table 2 and 3).

The impact of "recent paralogs" on the issue of the clustered additions

In our work, the recent paralogs in an orthologous group which are likely to retain the similar functions will be traced to the same origin and thus be assigned the same original age, which will result in some age-homogeneous motifs in which some members are ("recent") paralogous to other members. The members of such age-homogeneous motifs may not be thought to be clusteredly added to the network during the (short) period of time when these members originated. Because at the original time of these members, there is only one ancestor of these paralogous members and such age-homogeneous motifs' ultimate formation depends on the later (recent) duplication event. However actually we find the fractions of such motifs with recent paralog pairs among all the age-homogeneous motifs are small, which are only 2.4% for 3-motifs and 2.7% for 4-motifs.

Evidence for the hypothesis of the clustered additions from protein complexes

Another evidence for the additions of clustered interacting nodes comes from the analyses of yeast protein complexes [57]. We find there are significantly more age-homogeneous complexes whose constituents are all of the same age class than random expectation based on 1000 experiments established by randomizing the corresponding relationships between proteins in the yeast genome and their ages. Further, among the other age-heterogeneous complexes, there are also significantly more complexes which are significantly enriched with members from a special age class (the corresponding upper-tailed P- value of hypergeometric cumulative distribution [58] is less than 0.05) than random expectation (Figure 4A). These results still hold when only considering protein complexes without recent paralog pairs (see the second part of Discussion for the details) (Figure 4B).

Figure 4
figure 4

The number of yeast protein complexes and their random expectation. We consider two kinds of protein complexes. One is those whose members are all of the same age class, and the other is those which are significantly enriched with members from a particular age class. The random expectation is the average of 1000 randomizations which is established by randomizing the corresponding relationships between proteins in the yeast genome and their ages. The empirical P -values are all less than 10-3 . (A) The results are obtained considering all yeast protein complexes. (B) The results are obtained only considering yeast protein complexes without recent paralog pairs (see the Discussion part for the details).

Functional constraints as the possible driving force of the clustered additions

Qin et al. used natural selection to explain the additions of clustered nodes [12]. They thought that a new function likely requires a group of interacting new proteins and the growth of PINs is under functional constraints. Indeed, we find co-evolution (Figure 3A, B) of the constituents of these age-homogeneous motifs, which suggests functional significance for a cluster of interacting proteins. Also we find proteins within these age-homogeneous motifs tend to share the same biological functions (Table 4) and these motifs tend to be within known protein complexes (see additional file 1: Table S15). All the results indicate that these motifs of the same age class tend to be functionally significant. What is more, as we know, protein complexes are definite functional modules in the PIN. Their analytic results (Figure 4) provide powerful evidence for functional constraints as the driving force of the additions of clustered interacting nodes.

Conclusions

In the PIN, proteins of the same age class tend to form motifs while those of different age classes tend to avoid forming motifs. The constituents within the motifs with specific function or dense topology tend to be under high co-original constraints. Further the proteins participating in the motifs with members of the same age class tend to be densely interconnected, share the same functions and evolve at similar rates, and these motifs tend to be within protein complexes. These results suggest that the age-homogeneous motifs historically tend to be clusteredly added to the PIN, especially those with dense topology and specific function, providing evidence for the hypothesis of the additions of clustered interacting nodes from the network motif perspective for the first time. Our results also suggest functional constraints may be the underlying driving force for such clustered additions.

Methods

Protein-protein interactions

For yeast, we use two protein-protein interaction datasets. One is from Database of Interacting Proteins (DIP) which catalogs experimentally determined protein interactions from a variety of sources (Version 20080114) [47]. After removing self-interactions, we obtain 15410 yeast protein interactions between 4551 proteins (DIP_YEAST). Especially, DIP provides a reliable, core subset of DIP_YEAST which is denoted as DIP_YEAST_CORE (Version 20071007). This core subset contains protein interactions that have been computationally verified or observed in more than one large-scale experiment or those that come from small-scale experiments [26]. After self-interactions are removed, DIP_YEAST_CORE contains 5611 interactions between 2545 proteins. To validate the universality of our analytic results, we use the other yeast protein interaction dataset which contains 12051 non-self interactions between 3264 proteins. This dataset denoted as YEAST_HC is from Kim and Marcotte [10] and is a reliable subset of literature-curated yeast protein interaction data in BioGrid [59].

In addition, for testing the robustness of the result of the interconnection tendency between the proteins of the same/different age classes on PINs of other organisms, we also analyze the other two human PINs respectively denoted as HPRD_HUMAN_ALL (high-throughput and low-throughput experimental interactions, 22545 non-self interactions, 6919 proteins) and HPRD_HUMAN_HIGH (low-throughput experimental interactions, 17156 non-self interactions, 5704 proteins), which are downloaded from Human Protein Reference Database (HPRD) (Release 7) [48].

Yeast protein complexes

We use re-annotated, manually curated MIPS yeast protein complexes provided by de Lichtenberg et al. which contain 199 complexes, 966 proteins [57]. Compared with original MIPS complexes [60], the re-annotated data reflect known dynamic expression information of proteins and thus can better represent real complexes in vivo . For example, in vivo Cdc28p can only interact with a single cyclin at a time, however in MIPS Cdc28p and all its 9 interacting cyclins are organized as a single complex. To correct this, de Lichtenberg et al. annotated 9 complexes instead.

Age assessment of proteins

We use the GeneTrace algorithm with default parameters to assess each protein's original age [61]. GeneTrace is an efficient algorithm that allows the reconstruction of the most likely evolutionary scenario of an individual protein, including the original time of this protein, given a phylogenetic profile of the protein and an evolutionary tree including all organisms involved. Compared with the simple method of finding orthologs in representative species [6264], GeneTrace algorithm takes gene loss and horizontal transfer events into account to a certain extent, and thus is more precise in assessing protein ages. The phylogenetic profile of a protein is defined as a binary vector based on the presence (1) or absence (0) of its orthologous hits in the reference genomes. Here we use orthologous groups from orthoMCL (Version 4.0) [42] to construct the phylogenetic profiles. Each orthologous group from orthoMCL consists of orthologs and only "recent paralogs" derived from recent gene duplication which retain similar sequences and are likely to retain similar functions. Those "ancient paralogs" from ancient duplication events which are likely to exhibit divergent functions are assigned into different orthologous groups of orthoMCL [42]. Totally, the orthologous group data of orthoMCL involve 50 prokaryotic and 88 eukaryotic genomes and thus the phylogenetic profile here is a 138-dimention binary vector. Phylogenetic tree including these 138 species is from NCBI Taxonomy common tree system (Version 2010 Aug) [46] (Figure 1).

Network motifs and evolutionary motif modes

"Network motifs" are recurring, topologically distinct interconnected patterns of nodes in complex networks [38, 40]. Based on network motifs, we define "evolutionary motif modes" as network motifs which characterize particular interconnected patterns of proteins of the same/different age classes (Figure 2). We use FANMOD software [65] to detect network motifs and then Perl programs to obtain evolutionary motif modes. FANMOD software implements RAND-ESU algorithm to enumerate and sample the vertex-induced motifs [66]. For a given subset of the vertices of network G, the vertex-induced motif is unique. Therefore, there are not motifs with the same vertices but with different topologies. This algorithm is orders of magnitude faster than any other existing algorithms for this task [67].

Random age assignment and empirical P-value

If the ages of proteins don't impact the interconnected patterns of proteins of the same/different age classes in the PIN, a random age assignment should give similar interconnected patterns as seen in the real PIN. To analyze the interconnection tendency of proteins of the same/different age classes, we first generate 1000 random networks by randomizing the corresponding relationships between proteins and their ages in real network. Then we use empirical P -value to evaluate the statistical significance of enrichment/depletion of each kind of evolutionary motif mode in the real network [68, 69]. For each kind of motif mode of specific topology, the empirical P -value is calculated as the fraction of random networks in which its number is not smaller than (upper tail) or not larger than (lower tail) that in real network. The evolutionary motif modes are significantly enriched/depleted in the real network when the upper-tailed/lower-tailed P -value is less than 0.05.

Functional annotation of yeast proteins

The molecular functions of yeast proteins are based on Functional Catalogue (FunCat) annotations [49] from MIPS/CYGD database [60]. FunCat is a hierarchically structured functional classification system, and each FunCat term can be traced to different annotation levels in the hierarchies. Here we only focus on the first level (see additional file 1: Table S12).

Yeast protein evolutionary rates

The evolutionary rate of a protein is defined as the ratio between the number of non-synonymous substitutions per non-synonymous site (dN ) and the number of synonymous substitutions per synonymous site (dS ). To compute evolutionary rates of S. cerevisiae proteins, we adopt S. paradoxus as reference species which is the most closely related species to S. cerevisiae among all the completely sequenced organisms. Amino acid sequences and corresponding coding sequences (CDS) of proteins of the two species are from Saccharomyces Genome Database (SGD) (for S. cerevisiae , Version 20-Feb-2009 and for S. paradoxus , Version 14-Dec-2004) [70]. S. cerevisiae-S. paradoxus orthologs are obtained using Inparanoid program [71]. Pairs of orthologous proteins are aligned using the ClustalW program [72] and dN /dS s are calculated using PAML program [73].