Background

Knowledge of constraints governing systems provides a means to predict events and mechanisms that cause their breakdown. It may also allow one to speculate how such systems might have evolved. One class of biological systems that have captured much interest involves protein-protein interactions and protein complexes. Protein complexes are groups of two or more proteins that physically interact. Such interaction serves to spatially join, modify or create novel functional capability from component proteins. Once in a complex, proteins can achieve greater structural stabilization and protection from proteases, which result in significantly longer half-lives [1, 2]. A breakdown in complex assembly has been associated with a number of diseases [35].

What forces have shaped the formation of complexes and what is their effect? The formation of quaternary structure is associated with greater constraint in the evolution of proteins [68]. Recently, a large collection of manually annotated mammalian complexes has become available in the CORUM database [9]. The combination of data from this resource and complexes derived from HPRD [10] and BIND [11] allows for one of the largest investigations of complex-specific protein constraints in mammalian species to date.

Recently, complexes have been investigated in the context of intrinsic disorder [12] and aggregation propensity of the subunits [13]. In this study, we focus on two characteristics of complexes: the complexity of a complex as represented by the number of unique proteins in a complex and the number of complexes a particular protein participates in. The number of protein subunits that bind together fluctuates in the context of different conditions. Complexes extracted from cells can be viewed as snapshot samples of proteins that have come together with adequate stability to be isolated. We subsequently derive 'complex participation' for individual proteins by counting the number of complexes that they belong to, in our large combined collection of protein complex samples.

Although these characteristics of complexes are dynamic, their relation to properties of component proteins derived in their isolated state can be studied. By deriving these relations, we hope to gain insight into certain constraints governing the organization of complexes. Here, we find evidence that protein length, predicted secondary structure and isoelectric point, as well as the nonsynonymous/synonymous substitution ratio of genes are associated with our measures of complex complexity and participation.

Available data on protein-protein interactions have been used extensively to predict complexes [1425]. A strong discriminator between interacting and non-interacting protein pairs appears to be the presence of domains known to interact [26]. We assess the utility of known binary protein-protein and domain-domain interactions in predicting the composition and interactions in the mammalian complexes that we have collected.

Methods

Complex and Interaction data

1732 experimentally verified protein complexes annotated at MIPS [27] and stored in the CORUM database [9] were combined with 631 complexes retrieved from HPRD [10] and 538 complexes from BIND [11]. After removal of redundancy, there were 2706 different complexes containing 4543 different proteins from human, mouse, rat, dog, rabbit, cow, pig and other mammals. Of these 2706 unique complexes, 665 are subcomplexes of other complexes in the data. A more detailed description of the complexes appears in the additional files [see Additional file 1].

The focus of this study was on mammalian complexes but we also analyzed the MIPS yeast complexes stored in CYGD [28]. We use a set of 1142 unique yeast complexes consisting of 2755 proteins. A much larger set of yeast complexes consisting of 5370 proteins in 2025 complexes was produced by merging the MIPS yeast complexes with the 893 complexes predicted by Friedel et al. [25]. We will refer to this set of complexes as the 'extended yeast complex set'. Possible estimates of the completeness of the human and yeast complex data appear in the additional files [see Additional file 1].

We assembled 94906 pairwise eukaryotic protein-protein interactions from BIND, DIP [29], HPRD, MINT [30], Mpact [31] and MPPI [32]. 29074 of those interactions were between different proteins from human, mouse or rat and were used in this study. We also retrieved 127514 binary protein interactions by over 36500 proteins from IntAct [33]. We will refer to interaction data from IntAct in the text specifically with the word 'IntAct' to distinguish from the interaction set we collected.

We refer as domain-domain interactions in the text as those derived from known 3D structures of proteins, obtained from the iPfam [34] and 3did [35] database. There are 3654 such interactions between protein domains in total. Pfam domains of protein sequences were taken from Uniprot [36]. We assume that peptide chains in PDB structures with C-beta atoms within 8 Angstroms of each other interact.

We generated random complexes using two different random models to evaluate significance of trends associated with non-stochastic protein complex organization. For the random model in the main text (Model 1), we generated the same number of random complexes as found in annotated complexes by picking the same number of proteins as found in each of the annotated complexes from a pool of all proteins collected from these complexes. We picked proteins randomly with replacement throughout the process of random complex generation to simulate neutrality of protein reuse in the complexes. A second random model (Model 2) of the complexes which uses the complex complexity and participation preserving rewiring method, first introduced by Maslov and Sneppen [37] was also tested [see Additional file 1]. In certain parts of this paper, we used sets of randomly generated complexes (each set containing the same number of complexes as found in the annotated complexes) in Monte-Carlo simulations (denoted MC-test) to derive p-values for significance statements. P-values of events were estimated by the fraction of random simulations supporting the null hypothesis.

Computed features of genes and proteins

Secondary structure was predicted with PSIPRED [38]. Our conclusions did not change when we repeated our experiments using SSPRO [39]. The performance of PSIPRED in terms of secondary structure content prediction is benchmarked [see Additional file 2]. We measured human protein evolutionary rates by non-synonymous to synonymous substitution (dN/dS) ratios computed from coding sequences of ortholog pairs from Ensembl [40]. Yeast (S. cerevisiae) gene dN/dS ratios were computed using orthologs [41] from S. mikatae by PAL2NAL [42] with default parameters and Codeml from the PAML package [43]. The smallest dN/dS ratio was recorded for each gene when more than one potential ortholog was identified (our conclusions did not change if we chose the larger one). The values of dN/dS depend on the calculation methodology but they should be comparable within the same species pairs chosen. Note that only dN/dS ratios less than 1 (computed from human-mouse orthologs) were associated with the annotated human complexes. The mean dN/dS ratio of individual genes encoding a given protein complex was taken as a measure of selection associated with the protein complex. We focused on a measure of evolutionary rate reflecting more on changes in protein sequence. We considered evolution at synonymous sites [44, 45] to be neutral.

We refer to "pI" in the text as values obtained from BioPerl-based [46] isoelectric point prediction on protein sequences. Essential genes for S. cerevisiae were taken from CYGD [28]. Essential genes for human were gathered by merging those from [47] and Table S6 of [48].

Statistics of trends in 2-D plots

We assessed trends concerning 2-dimensional protein complex data by first applying standard linear regression to see if there is a noticeable slope different from a horizontal line (observed when the y values do not depend on the x values). We considered a given trend to be significantly associated with non-stochastic protein complex organization if the p-value reported by the t-test on the slope of the regression line (the null hypothesis being a slope = 0) was less than the maximum value observed when 1000 sets of random human or yeast complexes were generated. For example, when random complexes were generated from entire sets of annotated complexes, p-values associated with plots between mean dN/dS values and complex complexity for these random complexes were always above 0.003; so we set this value as a threshold of significance for such trends. However, the use of the t-test by itself is potentially inadequate because it could be significantly affected by biases in our snap-shot protein complex samples. So in addition, we repeated the same plots using sets of complexes likely to have different biases. These include the MIPS yeast complexes, the extended yeast complex set, sets of complexes where subcomplexes were removed and sets of complexes in different subcellular compartments. Unless otherwise stated, only those trends in which all assessments (from the regression line analysis to the analysis with other complex datasets) are considered significant are reported. Analysis was aided by the PROMPT tool [49].

Results and discussion

Complex Complexity and Participation

We first examined the distribution of complex complexity, the number of unique proteins in a complex. Amongst all mammalian complexes collected, the number of unique proteins ranges from one for homo-oligomers to 142 (U2-type spliceosome [CORUM:351]). When the number of unique proteins increases past 4, complexes become increasingly rare following a power law-like distribution (Figure 1A). When we considered yeast complexes, we saw similar trends [see Additional file 3].

Figure 1
figure 1

Composition of mammalian complexes. A) The number of complexes (y-axis) with a particular number of unique proteins (x-axis) is plotted. B) The number of proteins (y-axis) participating in a particular number of complexes (x-axis) is plotted.

Not all proteins are exclusive to a single complex. Having defined complex participation as the number of complexes that a particular protein is found in, we find that it also follows a power law-like distribution (Figure 1B, [Additional file 3]). Some of the proteins currently with the highest complex participation include human integrin beta-1 [Uniprot:P05556], histone deacetylase 1 [Uniprot:Q13547] and RING-box protein 1 [Uniprot:P62877] which were annotated to be in 54, 39 and 24 complexes, respectively so far.

The decrease in the number of complexes as the number of subunits increase might be a reflection of the increased difficulty in assembling and thus evolving larger beneficial complexes. Although this trend seems to follow a power law, as commonly reported for biological networks [50], we could not ascertain if this truly is the case. Similarly, we could not ascertain whether complex participation follows a power law. Our observed complex complexity and participation distributions may be the result of our complexes being samples of mammalian proteins [51, 52] derived under a variety of conditions [Additional file 1]. The number of unique proteins in yeast complexes was also observed to follow a similar-shaped distribution and an exponential model of the data has been proposed [53, 54]. In both mammalian and yeast complexes, most proteins belong to few complexes, thus supporting the idea of modularity [55] between complexes in the context of the protein-protein interaction network that have been examined.

The length of proteins in complexes

One of the simplest protein properties to study is the length of the protein in terms of the number of amino acids. The distribution of human protein lengths is skewed towards smaller values and subsequently when we sample uniformly from this distribution to generate random complexes, the distribution of the protein lengths remains skewed (Figure 2A). The mean length of proteins in this distribution is ~500 amino acids (Figure 2A).

Figure 2
figure 2

Protein length in human complexes. A) Length distribution of proteins in human complexes generated randomly from the entire known human proteome. Mean lengths of proteins versus the number of unique subunits in: B) model 1 random complexes generated from the entire proteome C) annotated human complexes D) model 1 random human complexes generated from data we collected.

By picking proteins with replacement from the distribution of proteins in Figure 2A, we generated random complexes and plotted the complexity of the complexes in terms of the number of unique subunits against the mean length of these subunits (Figure 2B). As complex complexity increases the upper bound for mean lengths of proteins drops (before 20 unique proteins) and levels out at ~510 amino acids for more complex complexes. Because the mean length of proteins in the human proteome which we sample from is ~500 amino acids, as one adds more random proteins to complexes, the mean length of proteins in the complexes tends to stabilize near 500 amino acids. This observation was similarly made in annotated human complexes (Figure 2C) and random complexes generated from this annotated data (Figure 2D). A similar asymptotic bound is thus formed in all our plots. For example, mean lengths of 2000 amino acids were absent in complexes of more than 20 subunits in both the random and the complexes that we have collected. Like for the random human complexes, the mean protein length distribution for the real human complexes (Figure 2C) was also skewed towards smaller protein lengths. Similar plots were observed for all mammalian complexes that we collected and Model 2 random complexes [Additional file 4]. These results suggest that the mean protein length in complexes reflects to an observable degree the length distribution of proteins encoded in the proteome (Figure 2A). Similar observations were made when the median length of proteins was studied in the context of complex complexity [Additional file 5].

In contrast to randomly generated complexes, the complexes that we have collected had much more variable mean lengths, especially for complexes with large numbers of subunits. We observe this as an increased scattering of points on the right side of the plot (Figure 2C). This phenomenon was similarly observed in yeast complexes [Additional file 6]. In particular, when complex complexity increases past 20 different subunits, complexes with mean protein lengths below 250 amino acids are rarely generated randomly (Random Model 1: P < 10-6) but are present in annotated mammalian and yeast complexes. For example, the 81 subunits of the human ribosomal complex [CORUM:306] have a mean length of 169 amino acids and the 37 subunits of the mouse mitochondrial NADH dehydrogenase complex [CORUM:381] have a mean length of 234 amino acids.

Although the complexes analyzed are samples of what is in cells, it is a large set that has been manually annotated. Assuming that the plots in Figure 2 capture length constraint bounds, they can provide rules of thumb for spot-checking errors in newly isolated complexes. For example, if an isolated human complex of 10 subunits had a mean subunit length of 2000 amino acids, our plots suggest that large numbers of additional missing subunits of length 2000 or more are unlikely. If such a complex did exist, then it would be striking given both the annotated complex data (Figure 2C) and data generated from random complexes (Figure 2B, Figure 2D; [Additional file 4]). Unusual property distributions associated with complexes provide initial suggestions of errors in the composition of the complexes or unusual constraints associated with these complexes. For example, the average length of a large sample of human mitochondrial proteins appears to be much smaller than the average length of all human proteins (means ± standard deviation/medians are: ~320 ± 304/235aa vs. ~500 ± 551/364aa, respectively; MW, KS-test (Mann-Whitney, Kolmogorov-Smirnov tests): P < 10-57) in the Eukaryotic Subcellular Localization Database [56] [Additional file 7]. Similar results were obtained when we used the reference set from MitoP2 [57] as a source of mitochondrial proteins. Despite large amounts of protein turnover since the mitochondria prokaryotic origin [58], most (> 97%) human mitochondrial proteins have lengths less than 1000 amino acids. These results suggest unusual size constraints associated with mitochondrial proteins. Mitochondria-related functions, the need for proteins to import into the mitochondria [59], the presence of reactive oxygen species which can impose difficulties for protein folding [60] are possible reasons which may have helped limit the incorporation of very large proteins into mitochondria. The scarcity of large proteins in the mitochondria, in turn, will likely limit mean protein lengths of its complexes. Although many (300/1962 = 15%) human complexes in our data have mean protein lengths over 1000 amino acids (Figure 2C), none of the 28 mitochondrial human complexes we annotated so far exceed this limit. If mean protein lengths in complexes reflect protein length distributions, we expect hetero-oligomeric mitochondria human complexes with mean protein lengths over 1000 amino acids to be extremely rare.

The relative lack of complexes with many (>20) different large proteins (> 1500aa) in both mammals and yeast suggests that evolving such complexes is also extremely difficult. Smaller subunits may be easier to fold, transport [61], require less conformation sampling [2] and undergo smaller entropic loss [[62] – Supplement] during complex assembly. These factors may have contributed to the absence of highly complex complexes with many large proteins in our data.

Natural selection and complexes

To understand observations from a more evolutionary point of view, we analyzed non-synonymous to synonymous substitution ratios (dN/dS) derived from human coding sequences and their alignments with orthologs from mouse (Figure 3). One way of comparing selection on coding sequences associated with different complexes is by their mean dN/dS ratios derived from the same species pairs.

Figure 3
figure 3

Complexes and selection. dN/dS distribution of genes associated with A) all human proteins and B) proteins in the human complex data. D) The mean dN/dS ratio for human-mouse orthologs is plotted against the number of unique proteins in the complex. Complexes with more unique proteins have a significantly smaller mean dN/dS ratios than those with less unique proteins (t-test: P < 3.1 × 10-7). F) The dN/dS ratio for human-mouse orthologs is plotted against the number of complexes they participate in. Proteins participating in more complexes have significantly lower dN/dS ratios than those with less complex participation (t-test: P < 7.4 × 10-9). C, E) Both trends are rarely observed for model 1 random complexes.

For example, proteins from the human MBD1-MCAF complex [CORUM:2759] are encoded by two genes with a mean dN/dS ratio of 0.37, computed from mouse orthologs. The sequences coding for this complex seem to be under much less selective constraint than ones coding the eIF3 complex [CORUM:742] which is associated with a mean dN/dS ratio of 0.04 (Stdv. = 0.03, median = 0.02).

It has been noted previously that easily alignable eukaryotic mRNAs are over-represented for multi-subunit complexes [63] due to increased sequence conservation. In terms of trends, we find that as the complexity of the complex increases, the mean dN/dS ratio of human genes of associated constituent proteins tends to decrease suggesting that complex complexity negatively correlates with purifying selective pressure on genes (Figure 3D). The trend is not extremely strong but such a trend was not observed with random complexes (Figure 3C; [Additional file 8]). In particular, complexes with more than 15 proteins and mean dN/dS ratios below 0.07 are absent in our random complex set. For example, the 20 genes from the PA700 proteosome regulator complex [CORUM:32] have a mean dN/dS ratio of 0.03 but such complexes are not likely generated randomly (Random model 1: P < 0.05). A contributing reason can be derived by examining the distribution of dN/dS ratios. The distribution of human-mouse dN/dS ratios is skewed towards small values considering all known human genes with over 40% of the values < 0.1 (Figure 3A). The distribution of dN/dS ratios associated with our sample of annotated complexes (Figure 3B) is likewise skewed with 60% of the values < 0.1. In both cases, the mean dN/dS ratio is close to ~0.1. As more randomly-selected proteins are added to form random complexes, the mean dN/dS ratio for genes associated with the complexes tends to stabilize also at 0.1. Thus, highly complex complexes with mean dN/dS ratios substantially below 0.1 are rarely generated randomly.

Knowing that many proteins belong to more than one complex, we also asked how this fact was related to selective pressure. We find that as the number of complexes a protein participates in increases, the dN/dS ratio of corresponding genes also tends to decrease (Figure 3F), suggesting that complex participation also correlates negatively with the dN/dS ratio. This trend is not observed when complexes were generated randomly by picking proteins with replacement (Figure 3E). From such randomly generated complexes, proteins participating in more than 10 complexes rarely occur, but they are clearly present in our annotated data. Binned average plots of Figures 3D and 3F appear in the additional files [see Additional file 9].

Because such results could potentially be sensitive to the ortholog pairs compared, the quality of the gene models, and peculiarities of human/mouse evolution, we also repeated such analyses using a variety of other mammals [see Additional file 10]. For human-dog, human-chimp and human-rat orthologs, we observed similar trends although in the closely related human-chimp case in which complex complexity was plotted against mean dN/dS, the trend was not statistically significant. We also had similar observations with yeast complexes using S. cerevisiaeS. mikatae, S. cerevisiaeS. paradoxus orthologs. We did not observe such trends from dN/dS ratios derived from the more distant pairing of S. cerevisiaeS. castellii orthologs (data not shown). For both mammalian and yeast complexes, the trends appear to be conserved using dN/dS ratios generated from a variety of ortholog-pairs with some exceptions.

Related to dN/dS ratios is the gene conservation of orthologs of the proteins in complexes. Compared to dN/dS ratios, statistically significant differences in gene conservation are even more difficult to detect in closely related genomes. However, over large evolutionary distances we noticed that highly complex complexes tend to have conserved their subunits. For example, between yeast and human, yeast complexes with more than 60 subunits had more than 80% of their subunits also encoded in human. This was not observed in random complexes [Additional file 11]. Yeast complex proteins conserved in human also participated in a significantly higher number of complexes than those without a human ortholog (mean ± standard deviation/median complex participation of 8.0 ± 6.8/6.0 vs. 4.2 ± 4.3/3.0; MW, KS-test: P < 10-70). A significant difference in complex participation was not observed for model 1 randomly generated complexes. These results concerning dN/dS ratios and gene conservation portrays similar relationships between gene evolution, complex complexity and participation at different time and constraint scales.

Because our conclusions are based on a limited sample of complexes, we sought for extra datasets of protein complexes. The combination of the annotated MIPS yeast complexes with those that were predicted [25] allowed us to test our hypotheses on a large set of complexes covering 5370 (88%) of the proteins encoded in the yeast proteome. We continued to observe a significant negative correlation between mean dN/dS with complex complexity and dN/dS with complex participation using such an extended complex dataset [Additional file 12].

Often in a cell, complexes exist along side their subcomplexes. In our data, subcomplexes of complexes are present. If we repeated our analysis with subcomplexes removed, we still observed a significant negative correlation between mean dN/dS with complex complexity and dN/dS with complex participation [Additional file 12]. Similar trends were also observed when we considered nuclear localized and non-localized complexes separately [Additional file 12].

As complexes are assembled or disassembled, the mean dN/dS ratio should fluctuate as proteins are added or subtracted from the complex. Nevertheless, there also seems to be an asymptotic bound for mean dN/dS ratios as complex complexity increases and dN/dS ratios as complex participation increases. The conserved trends we observe may be a reflection of these bounds. We find that genes associated with our human complexes which are considered essential (see Methods) tend to have lower dN/dS ratios (computed by human-mouse orthologs) and encode proteins that participate in more complexes (MW, KS-test: P < 0.01) compared with all genes in the human complexes. We see the same significant trend for yeast (dN/dS ratios computed with S. cerevisiaeS. paradoxus orthologs). These results suggest that for proteins participating in increasing number of complexes, lower dN/dS ratios tend to be associated with purifying selection to maintain organism fitness. Essential S. cerevisiae complex proteins tend to be found in complexes with more unique subunits (MW, KS-test: P < 0.05) compared to other proteins in the yeast complex data. However, such a trend was not observed for human possibly because a significant number of proteins which contribute to complex complexity are not known to be essential. We conclude that there is strong evidence of non-random association between dN/dS ratios of genes and complexes of their protein products.

The finding that the number of protein-protein interactions positively correlates with the level of conservation of orthologs between species [64, 65] is similar to our finding that highly complex complexes tend to conserve their orthologs. Previously, increased gene conservation has been correlated with lower evolutionary rate [66]. We have found that highly complex complexes tend to conserve their orthologs and have lower dN/dS ratios, in agreement with these results.

In summary, we observed that the mean dN/dS ratios of proteins in the collected complexes tend to decrease with increasing complex complexity. There seem to be at least two major phenomena associated with this observation:

  1. 1)

    The skewed distribution of dN/dS ratios amongst genes as depicted in Figure 3B. In particular, as complex complexity increases, the mean dN/dS ratios of proteins in complexes tend to stabilize towards the mean of this skewed distribution.

  2. 2)

    increased selective pressure on proteins involved in the organization and function of more complex complexes

The median of the mouse dN/dS values (Figure 3B) is 0.08 which is less than the mean of 0.1. When we plotted Figure 3D and related supplements using median dN/dS values instead of the mean, we also found the median dN/dS decreased significantly with increasing number of unique subunits. Such a trend was consistently found when human-mouse, human-dog, human-rat, S. cerevisiaeS. paradoxus and S. cerevisiaeS. mikatae dN/dS median values were used (for all plots, t-test: P < 7.1 × 10-7; selected plots are shown in the additional files [Additional file 13]).

Thus, for the complex data available, we observed that both the mean and median dN/dS ratios of genes negatively correlates with the complex complexity.

Homogeneity of Protein and Gene Properties

Next, we examined similarities between proteins within mammalian protein complexes (Table 1). Although not necessarily similar to physiological pI, our predicted pI values reflect sequence properties which may be of some utility. We ask whether subunits in complexes tend to have greater homogeneity than expected in terms of predicted pI values. Considering complexes of 3 or more unique proteins, we found that the standard deviations in protein pI are significantly lower in the annotated complexes compared to those in model 1 random complexes. Thus, the pI of a protein can be a feature which may be useful in predicting whether the protein belongs to a particular complex or not. The need for subunits to co-localize [67] and the presence of highly sequence similar paralogs in complexes [68] should be contributing factors to our observation of lower than expected pI deviation between complex proteins.

Table 1 Homogeneity of protein properties in mammalian complexes of 3 or more proteins.

The common presence of paralogs in complexes and the correlation between secondary structure and localization [69] also contributes to explaining why we observed significantly lower than expected deviations of secondary structure content (percent helix, sheet and coil) in proteins belonging to the annotated complexes versus those in random complexes.

Similar to protein properties, we also found that the dN/dS ratio of genes associated with annotated human complexes are significantly more homogeneous compared with those from randomly generated complexes. Some examples of complexes with very low deviation of dN/dS values include Ubiquitin E3 ligase [CORUM:386] and the MKK4-ARRB2-ASK1 complex [CORUM:1297] with a standard deviation of 0.0007 and 0.001, respectively. The complex with the highest observed dN/dS deviation of 0.34 is the DNA ligase IV-XRCC4-XLF complex [CORUM:359]. Significant differences between random and annotated complexes were also observed when the dN/dS ratio was computed with dog, rat, or chimp orthologs (data not shown). These observations agree with previous results providing evidence that yeast proteins belonging to the same functional module tend to have more similar evolutionary rates than those belonging to different modules [70]. We observed similar results when we examined yeast complexes [Additional file 14] and when we generated model 2 random mammalian and yeast complexes (data not shown).

To conclude, we suggest that pI, secondary structure and dN/dS ratios can be used to help predict the probability that a protein belongs to a particular complex. We did not observe a significant difference in standard deviations of protein length between proteins belonging to annotated versus Model 1 random mammalian complexes. It may still prove useful for the separation between real and erroneous complexes when combined with other properties.

Comparison with pairwise protein interactions

The overlap between binary protein-protein and predicted domain-domain interactions with proteins in our complexes was assessed (Figure 4). On average, 57%, 46% and 71% of the proteins found in the complexes (of 3 or more proteins) can be explained by known binary protein-protein, predicted domain-domain or either type of interactions (see Methods), respectively. The entire annotated protein complement of 24%, 10%, and 29% of these complexes can be explained by binary, domain-domain or either type of interactions, respectively. In addition we find that 14% of binary protein-protein interactions can be explained by domain-domain interactions which is in the range of 4–19% identified by Schuster-Bockler & Bateman [71].

Figure 4
figure 4

Coverage by binary domain and protein interactions. Coverage is indicated by values in the Venn diagrams. Example: A) Protein content coverage. 71% of the proteins in our complexes with 3 or more unique proteins can be explained by either domain-domain or binary protein interactions. The full protein content of 29% of these complexes can be explained by either interaction-type. 14% of binary ppi can be explained by predicted domain-domain interactions. B) Interaction coverage. 5% of IntAct binary interactions could be explained from predicted domain-domain interactions. 77% of solved protein-protein contacts in our complexes of known structure with 3 or more unique proteins can be explained by either domain-domain or binary protein interactions. All solved contacts in 52% of these complexes can likewise be explained by these interactions.

192 PDB structures are available for our complexes (103 of which consist of at least 3 different proteins). A large proportion of these structures are solved with protein fragments rather than the full-length proteins. Many of the interactions between peptides in these structures (see Methods) are supported by binary protein-protein interactions. The IntAct interaction set (excluding interactions derived only from NMR or X-ray diffraction) covered on average ~40% of detected interactions in a given complex containing PDB structure. Interactions in 34 (18%) PDB structures were fully explained by these IntAct binary interactions. A similar coverage (Figure 4B: 45% mean interaction coverage; 13% of complexes with all interactions fully explained) is provided by these IntAct binary interactions even if we consider only complexes with 3 or more distinct interacting peptides. In contrast, on average 53% of interactions in complexes with 3 or more distinct interacting proteins are explained by predicted domain-domain interactions. Predicted domain-domain interactions explain all interactions in about 25% of such complexes. Our results suggest that known protein-protein and domain-domain interactions can aid substantially in predicting interactions in our complex data. Many of these binary and domain-domain interaction defined protein pairs may be subcomplex precursors of the mammalian complexes which form during the course of assembly. Some of these subcomplex precursors might be functional complexes in current mammalian and ancestral organisms [62].

Considering binary protein-protein interaction pairs in IntAct, those found in our complexes, and those derived from contacts in PDB structures associated with our mammalian complexes, we found that larger proteins tend to have larger length differences with their interacting partners (Figure 5, [Additional file 15]). For example, a large protein of length 5000 amino acids was more likely to be found interacting with a protein of ~1000 amino acids (a 4000 amino acid difference) rather than another protein of 5000 amino acids. However, a protein of length 300 amino acids often was found interacting with another protein of similar size. A number of reasons can explain this observation. First, the length distribution of the proteins is skewed such that relatively long proteins tend to be rare (Figure 2A). Second, on average shorter genes appear to be more abundantly expressed [72, 73] than longer genes. Thus, relatively large proteins may have a greater chance to encounter shorter proteins rather than longer proteins. The observation that many complex complexes tend to be composed of proteins with a shorter mean length may also be related to this phenomenon since the evolution of large complexes may depend on how often and easy it is to bring together undamaged components. The relatively low dN/dS ratios of genes from highly complex complexes may also be associated with this phenomenon (Figure 3D). Interestingly, similar observations (fitting even better to the straight line) were made with random protein-protein interactions (generated by picking proteins in each interaction pair randomly with replacement) from the IntAct data (Figure 5B). The relative low abundance of relatively large proteins in both the real and random interaction data can explain the trends in Figure 5A–C. The differences between real (Figure 5A, C) and random (Figure 5B) plots might be attributed to historical, selective constraints or experimental error on certain protein-protein interactions. Binned plots of Figure 5A are found in [Additional file 15].

Figure 5
figure 5

Large proteins tend to interact with relatively smaller partners. For each protein pair of different proteins, the length of the larger protein (protein1) is plotted on the x-axis. A,B,C) The difference in length with its partner (protein2) is plotted on y-axis. Larger proteins tend to have a larger length difference with their partners compared to proteins of smaller size. We see this trend amongst A) IntAct binary protein-protein interactions and B) Random binary interactions generated from the IntAct data C) binary interactions mapped onto all collected mammalian complexes. For all plots A-C, the trends are significant (t-test: P < 0.001).

Explaining evolutionary rate

In this work, we have hinted at various explanations such as the ease of complex assembly to reason why more complex complexity and participation negatively correlates with our measures of evolutionary rate. We believe that these explanations are contributing factors but clearly a systems approach [74] is warranted. The ability of a protein to change sequence is related to the mutability of its encoding gene, its subsequent expression, the consequences of errors in the gene's products which is related to the magnitude of expression and finally the degradation of those products. Fitness effects for the individual and the population under historical environmental constraints needs to be taken into account to completely understand why certain sequences have been transmitted for millions of years of evolution. A challenge is to uncover which factors are more important than others in explaining evolutionary rate [7577]. Factors associated with proteins early in their expression such as transcription and translation [78] error rate or more general factors such as protein abundance would affect the evolutionary rate of all proteins. However, factors which are related to later events in a protein's life cycle such as functional molecular interactions will likely explain evolutionary rate differently for different types of proteins. For example, transcription factors which bind DNA will have sequences constrained by the need to bind DNA. Such constraints would be absent for those proteins that do not bind DNA. If shorter proteins are more abundantly expressed, and more complex complexes tend to contain smaller proteins, higher expression of smaller proteins in more complex complexes could help explain our observations of lower dN/dS ratios. While expression appears to be important, the functional and non-functional [79] interactions of the protein and its precursors with other proteins [7, 80] or molecules most likely play a major role in defining the protein's relation to organism fitness. It can be very challenging relating evolutionary rate to these different factors, especially when some of these factors are measured with substantial noise or are unknown [81, 82]. While the processes leading up to complex assembly can help explain protein evolutionary rate, there may be factors not directly related to complex formation yet to be fully accounted for. Some progress has been made in teasing apart contributions to evolutionary rate [83].

Part of the reason why deriving mechanisms explaining evolutionary rate has captured so much interest is because such data is readily available from comparative genomics. However, without knowing mechanisms explaining evolutionary rate values, its significance and the utility of associated information becomes limited. In particular, its relation to protein function and disease needs to be worked out. By combining other information such as those derived from complexes, one might gain a deeper insight. For example, just knowing whether a protein complex contains a certain number of proteins or whether a protein belongs to a certain number of complexes may significantly help us predict its relation to certain fitness effects and future diseases, given information on its evolutionary rate. It has also been observed by many that proteins related to genetic diseases in OMIM [84] tend to be relatively long [85, 86] and it might be interesting to work out how this relates to specific sequences and structural features in the complexes.

Conclusion

In conclusion, we have found relations and trends between the dN/dS ratio, primary sequence properties, secondary structure, complex complexity, participation and localization using large samples of protein complexes derived under certain distributions of conditions. There is a large evolutionary distance between yeast and mammals. Protein complexes and their subunits are not necessarily conserved between distant species [87]. The different environments which mammals and yeast occupy may have produced different evolutionary pressure on the protein complexes. Despite these differences, many of our observations appear to be conserved for the mammalian and yeast complex data that we have. We suggest that our observations have been significantly influenced by constraints during the evolution of the complexes. Because of these constraints, numerical boundaries were discovered when we related properties such as protein length to complex complexity. Proteins at these property value boundaries are interesting because they are exceptional and possibly point to unusual pressures or homeostatic adaptations that allow for their presence in cells. For example, by studying highly complex protein complexes composed of very large proteins, one might uncover new mechanisms that allow for their assembly and the assembly of later evolved proteins. We also observed that random phenomena (as in neutral drift) from skewed origins can appear highly stable when averaged over large numbers (Ex. Figure 2B, D; Figure 5B). Compared to randomly generated phenomena, biologically-derived characteristics can appear less structured despite these circumstances (Ex. Figure 2C, Figure 5A). Our focus in this study is mainly on information about the presence of certain protein complexes. Once data on the abundance, stoichiometry and dynamics of mammalian protein and protein complexes become available on a large scale, one can obtain other views.