Background

Low-complexity regions (LCRs) in protein sequences are regions containing little diversity in their amino acid composition. The degree of diversity they exhibit may vary, ranging from regions comprising few different amino acids, to those comprising just one, the amino acid positions within these regions being either loosely clustered, irregularly spaced, or periodic [1]. This work defines LCRs computationally as an amino acid sequence with low information content (see methods). Therefore, simple repetitive sequences such as tandem amino acid repeats form part of the LCR dataset discussed here.

LCRs are common in protein sequences, but precise measures of their abundance are difficult to ascertain. One of the problems is that the degrees of stringency applied by different detection methods differ, leading to different estimates of the numbers of LCRs in the same dataset. Importantly also, our knowledge of the protein universe has changed dramatically during the last 15 years, as protein sequence repositories have become engorged with the outputs of high-throughput sequencing projects. Protein sequence databases have thus grown enormously (both in terms of the numbers of sequences they contain and in terms of the numbers of organisms represented), and estimates of the numbers of LCRs they contain have changed accordingly: e.g., the proportion of proteins in the Swiss-Prot database that contain LCRs has changed from 56%, in 1993 (V-26.0) [2], to 12% in the current version of UniProt (V-54.0) [3]. Notwithstanding their abundance in protein sequences, LCRs are largely under-represented in the Protein Data Bank (PDB) [4, 5], presumably because most of the proteins containing LCRs do not readily crystallise. Despite this lack of structural information, LCRs are believed to play pivotal roles across a wide range of biological functions [68], some of whose mechanisms have been extensively documented, although the proposed functional models remain unverified [810].

Low-complexity regions evolve rapidly through recombination events

LCRs are known to evolve rapidly, sometimes via mitotic replication slippage, or, more often, via meiotic recombination events [11]. Highly dynamic diversification of these regions, and high levels of inter-species variation and polymorphism, suggest that newly generated and expanded LCRs are, in most cases, structurally and functionally neutral, with a high probability of fixation [12], thus generating novel material that could enable rapid functional expansions. Moxon and co-workers suggested that repeat formation is a common source of genetic variation among prokaryotes to generate novel surface antigens and adapt to fast evolving environments [7, 13]. This source of variability may also compensate for longer generation times in eukaryotes, which have higher proportions of LCRs [11] and it has been suggested that expansions and contractions of tandem repeats constitute a large source of phenotypic variation [6].

Hub proteins contain more LCRs than non-hub proteins

While some LCRs are known to play important structural roles by acquiring strong static conformations [14], others have been associated with intrinsically unstructured proteins [15, 16]. The flexible nature of regions lacking well-defined folding structures is thought to be responsible for their versatile binding capabilities; this flexibility could allow these regions to bind several different targets [17]. In their recent study on yeast protein-protein interactions (PPIs), Ekman and co-workers noted that the highly connected 'hub' proteins contain an increased fraction with LCRs compared to non-hub proteins [12]. They suggested that disordered regions are particularly important for flexible binding and could act as flexible linkers between globular protein domains. Here, we set out to investigate whether proteins with LCRs tend to have larger numbers of binding partners across a range of high confidence PPI datasets. We then examined whether proteins with LCRs positioned at their sequence extremities show differences in connectivity compared to proteins with LCRs positioned in central regions, and if the number of protein binding partners is related to LCR length. Finally, we functionally categorised both terminal-LCR and central-LCR groups using Gene Ontology [18] (GO)-term enrichment analysis.

Results and Discussion

In this study, we used data from the yeast Saccharomyces cerevisiae, as this was the most comprehensive for our purposes. We used four PPI datasets (Table 1): three high-confidence datasets (FYI [19], HC [20], and DIP-verified (DIPv) [21]), where each interaction is confirmed by more than one detection method, and a lower-confidence but more extensive dataset (BioGrid [22]) containing all interactions reported to date.

Table 1 Nodes and edges in each PPI dataset

The FYI [19] is generated as the union of: Yeast two-hybrid experiments [2325], datasets produced from affinity purification and mass spectrometry screens [26, 27], one dataset produced from in silico computational prediction methods [28], the physical protein-protein interactions, excluding interactions from genome-scale experiments, from the Munich Information Center for Protein Sequences (MIPS) [29] Comprehensive Yeast Genome Database (CYGD) dataset [30], and finally, the CYGD protein complexes published in the literature (called LC for L iterature C urated data). The resulting union is then filtered keeping only interactions observed at least twice by different detection methods.

The HC PPI dataset [20] is also a join of multiple interaction datasets, were the minimal criterion for inclusion is that relevant interactions must be independently reported at least twice. This differs from the FYI in that two independent reports can come from two datasets using identical detection methods. HC uses LC data from five major PPI databases - BIND [31], BioGrid [22], DIP [32], MINT [33] and MIPS [29], and interactions detected from affinity purification and mass spectrometry screens [34, 35]. The DIPv dataset [21] is a computationally verified core of the DIP dataset [32], which is a database of experimentally verified interactions determined by several techniques (such as genome-wide two hybrid screen-including results from [23] and [24]-, immunoprecipitation, affinity binding, and antibody blockage).

The DIPv core was computed using two methods: the E xpression P rofile R eliability (EPR) index, and the P aralogous V erification M ethod (PVM). EPR compares RNA expression profiles of potentially interactive proteins against expression profiles of known interacting, and non-interacting pairs of proteins. PVM measures the likelihood that two proteins interact by measuring interactions between their paralogues. We refer to this dataset as DIP-verified (DIPv).

S. cerevisiae is also amongst the most well-annotated genomes, making it ideal for functional analysis using the Gene Ontology [18]. In agreement with previous estimates [36], our LCR-detection method (see Methods) found that of 6, 165 S. cerevisiae proteins documented in UniProt, 1; 306 contained LCRs. Of these, 929 contain a unique LCR; to simplify the analyses presented, this study deals only with proteins containing a single LCR.

Proteins containing LCRs tend to have more interactions than those without

We considered two subsets of yeast proteins: those with one LCR and those without LCRs. The degree (i.e., connectivity) distributions of both subsets were computed for the four PPI network datasets used in this study. By way of illustration, the degree distributions in the BioGrid network are shown in Figure 1.

Figure 1
figure 1

Degree distributions comparison between proteins with and without LCRs. Degree distributions of proteins with and without LCRs in the BioGrid dataset show proteins with LCRs have more connections than proteins without LCRs. See Table 2 for Wilcoxon-Mann-Whitney p-values for this and the other datasets.

Comparing the degree distributions using the Wilcoxon-Mann-Whitney test shows that proteins containing LCRs appear to have more protein interactions than proteins without LCRs in all four PPI datasets (all networks having p < 0.05, see Table 2).

Table 2 Degree distributions comparison between protein with and without LCRs.

LCR locations are biased towards protein sequence extremities

To investigate whether LCR locations are positionally significant, we examined whether LCRs occur randomly within protein sequences. We located the centre positions of LCRs on a continuous scale ranging from the centre to the extremities of the protein sequence by recording their normalised centre positions and folding the resulting distribution in half. We compared the actual distribution of their centres to an empirical null distribution derived from a random model (see Figure 2 and Additional file 1: Figure S1). This null distribution was constructed by removing the LCR from each protein sequence, then repeatedly re-inserting it at random start positions (see Additional file 2: Figure S2). The empirical null distribution is approximately uniform near the centre of the protein sequence and decreases sharply near the sequence extremities. By contrast, the observed frequency of real LCRs increases steadily from the centre to the near extremities (Figure 2(a)). The Kolmogorov-Smirnov test confirms that natural LCR positions do not follow our computed random distribution (p-value = 7.6 × 10-6), implying that the position of the LCR within the protein sequence may be of relevance to its function.

Figure 2
figure 2

Distribution of folded LCR centre positions. Comparison of normalised and randomly re-arranged LCR centre positions in S. cerevisiae. The Kolmogorov-Smirnov test confirms that these two distributions are significantly different (p-value = 7.6 × 10-6).

Terminal LCRs are more connected than central LCRs and show length-connectivity dependence

To further characterise the properties of LCRs in our study, we tested whether protein connectivity is related to LCR position within the sequence. We defined two sub-populations of LCRs: terminal LCRs (t-LCRs), occurring near the sequence extremities, and central LCRs (c-LCRs), positioned far from the sequence extremities. To ensure that t-LCRs are truly positioned at the sequence termini, they were defined as regions starting or ending at no more than 25 amino acids from either sequence extremity; c-LCRs, on the other hand, were defined as regions positioned at least 50 amino acids from either sequence extremity. The number of c-LCRs and t-LCRs found in the different PPI datasets are shown in Table 3. To investigate the properties of our two LCR populations, we first compared the degree distributions of t-LCRs, c-LCRs and non-LCR proteins. Results presented in Figure 3 show that proteins with t-LCRs are more connected than proteins with c-LCRs in three out of four networks (Table 4). t-LCRs clearly tend to be more connected than non-LCR proteins, with significant differences across all four networks. c-LCRs also appear to have higher degrees than non-LCRs, with p < 0.05 in three out of four networks. We then examined whether LCR length is related to protein degree in each population. Figure 4 shows that the length of t-LCRs is positively correlated to their protein degree, while there is no sign of such correlation amongst the population of c-LCRs. r2 values are small owing to the large scatter in protein degrees, which is presumably caused by a combination of the uncertainties in PPI network data and the fact that proteins may also bind via interfaces that are independent of LCRs. Notwithstanding these effects, the p-values associated with each linear regression line show that proteins with t-LCRs have significant correlations between LCR length and degree across all four PPI networks studied (Table 5).

Figure 3
figure 3

Degree distribution comparisons. Boxplot representations comparing degree distributions of t-LCRs, c-LCRs, and proteins without LCRs. Table 4 shows Wilcoxon-Mann-Whitney p-values resulting from comparing their degree distributions.

Figure 4
figure 4

LCR length versus protein degree. Scatterplots show the relationship between length and protein degree for t-LCRs (in black) and c-LCRs (in gray) in four different PPI networks. The associated p-values and r2-values for linear regression are shown in Table 5.

Table 3 Number of t-LCRs and c-LCRs found across the four PPI datasets.
Table 4 Degree distributions comparison between protein with c-LCRs, t-LCRs, and proteins without LCRs.
Table 5 Correlation results (LCR length versus protein degree).

GO analysis shows that terminal and central LCRs have different biological roles

We then performed GO-term enrichment analyses for the set of all LCR proteins, and for the c-LCR and t-LCRs subsets, in order to gain insights into their respective functions. Results show that the set of proteins with LCRs is enriched for functions related to the regulation of gene expression. Furthermore, the analysis suggests that t-LCRs and c-LCRs have distinct cellular roles. The first analysis compared all proteins with LCRs against the entire S. cerevisiae proteome as background, and showed enrichments for ten GO terms at a false-discovery rate (q-value) threshold of 0:01. Table 6 gives a detailed description of these terms, their frequencies, p-values and q-values. This ensemble of GO term enrichments suggests that LCRs have a tendency to find roles in transcription, transcription regulation and translation. Interestingly, the term 'nucleic acid binding' suggests that the binding capabilities of LCR proteins may not be restricted to protein-protein interactions. The same analysis was performed with t-LCRs and c-LCRs separately, and revealed t-LCR enrichments for 32 GO terms and c-LCR enrichments for 22 GO terms under the same q-value threshold (Table 7). Proteins with t-LCRs are important to stress response, translation and transport processes and are enriched in protein complexes, while proteins with c-LCRs are important in transcription and transcription regulation processes and are enriched for kinase functions. Although these groups share common and functionally related GO terms, the fact that our somewhat arbitrary division of LCRs into central and terminal subsets results in lower q-values (and hence more significant GO term enrichments) than in the complete LCR population supports the hypothesis that LCR location is directly implicated in protein function.

Table 6 GO term enrichments for all LCRs.
Table 7 GO term enrichments for central and terminal LCRs.

Conclusions

Our results show that LCRs are preferentially located towards sequence extremities, and that proteins with LCRs in their sequence extremities have more protein binding partners than proteins with LCRs in their central regions. Furthermore, we have shown the length of LCRs to be positively correlated with the number of binding partners, but only in the sequence extremities. While t-LCRs can extend free from the rest of the protein structure, c-LCRs are likely to be surrounded by protein globular domains, thus limiting their flexibility and accessibility, and therefore the number of different proteins to which they can mediate binding. By contrast, if t-LCRs themselves tend to act as promiscuous interfaces for protein binding, this would explain our observation that proteins with longer t-LCR regions have a tendency towards a higher number of protein binding partners. Examining the list of over-represented GO terms in Table 7, we hypothesise that t-LCRs play major roles in low-specificity biological events that involve large protein complexes. Protein chaperones, for example, which play a major role in stress response, have low-specificity binding properties due to the large variety of partners they bind to assist conformational search towards global energy minima [37, 38]. Translation and translation elongation are also events requiring low-specificity interactions, involving a crowded protein machinery that operates on the entire proteome. Finally, molecular transport could also be considered to fall within this category, with large protein complexes moving a wide variety of cargos across the cell.

Although some c-LCRs might still be expected to act as flexible linkers, there is evidence that they may also act as direct binding interfaces, albeit with more restricted promiscuity than t-LCRs. Kim and co-workers [39] found that disordered regions could function as interfaces with a limited number of binding partners, particularly in the context of phosphorylation cascades in signalling pathways, where proteins tend to contain both a structured kinase domain and an unstructured kinase-binding domain. Indeed, regions of protein disorder are already known to be implicated in signalling as phosphorylation sites [40]. Our GO analysis finds protein kinase functions to be over-represented only for the set of central LCRs, and not those located at the termini, hence could be considered to be consistent with the existence of a specific set of binding partners for each signalling protein. The set of c-LCR proteins is also enriched with other biological processes that, although still 'promiscuous' in the sense that they have multiple binding partners, need to be much more specific than the translation, folding, and transport processes observed for the t-LCRs. Transcription regulation events, for example, limit the number of proteins present simultaneously [41]. Binding events in polyadenylation processes are also relatively specific and do not involve crowded protein machineries.

In their recent study on protein-protein interactions, Ekman and co-workers noted that hub proteins (those with a large number of interacting partners) are more often multi-domain proteins and contain more disordered regions compared to non-hubs. This observations led them to stress that the disordered regions serve as linkers between domains, in addition to their more commonly reported role in flexible or rapidly reversible binding [12]. Our proteome-wide results show that these two LCR functional roles are distinct and depend on the location of the LCRs within the protein sequence: their role in flexible and rapidly reversible binding is preferentially mediated by LCRs located in the terminal regions of proteins while their role as linkers between protein domains is preferentially mediated by centrally located LCRs.

These results, together with the other differences in GO enrichment discussed above, suggest that the functions of the low-complexity regions of a protein are related in a fundamental manner to their positions within the sequence.

Methods

Implementation of the LCRs detection algorithm

We used Shannon's entropy, H, as the measure to detect LCRs, as it is the most well-accepted measure of complexity in biological sequences [36]

(1)

where P i represents the fraction of the amino acid at position i within the string of interest. The difficulty is that LCRs vary widely in length and position, and it is not reasonable to use the same complexity threshold for every sequence length. Therefore, we scanned the whole proteome for window lengths, varying from 16 to 300 amino acids, to compute the distributions of entropy values (1012 measurements). This provided a background to test whether a single entropy value would be sufficiently extreme to be considered an LCR. For each window, w, the frequency density of the calculated Shannon entropy values is represented by a histogram f w (H). Let A w be a cumulative density function, the area underneath this histogram:

(2)

Given (2), a low-complexity threshold value, t w , is calculated for every window, w, as the entropy limit holding 0.5% of the cumulative distribution function such that:

(3)

We define a low-complexity region as any window of length w with an entropy value smaller than t w . Entropy distributions for every window length are highly skewed, with a bell-shaped curve at high entropy values and a very long and thin tail extending toward the low entropy values where LCRs are located (see Additional file 3: Figure S3). Given that all entropy distributions for any window length have a similar shape, a single cut-off point selects the same proportion of low-entropy regions, enriched LCRs, regardless of window length.

A very conservative threshold was sought to exclude non-LCR. Visual inspection determined that a threshold corresponding to 0.5% of the area under the distribution curve only included the portion of the curve where the flat tail, containing the LCRs, was located. A very conservative threshold was chosen to have a stringent cut-off and exclude non-LCRs.

Selecting LCRs in protein sequences

Entropy values from different window lengths have comparable distribution shapes (Additional files: Figure S3 and S4), and are therefore standardised for comparison. Entropy value distributions from longer regions have smaller standard deviations and greater means. By contrast, distributions from shorter regions have greater standard deviations and smaller means. Overlapping LCRs are common during the detection process; in order to compare entropy scores from LCRs of different length, the implemented algorithm computes a standardised Z-score for each detected LCR.

(4)

where H is the entropy, μ w the mean, and σ w the standard deviation of f w (H). If multiple LCRs overlap, only the region with the highest Z-score is retained. All detected regions can be accessed and queried through the UTOPIA User Interface [42].

PPI datasets

Analyses were cross-validated over four PPI datasets: three high-confidence datasets (HC [20], DIPv [21] and FYI [19]) and one, potentially of lower-confidence, but much larger set of interactions (BioGrid [22]). Although the comparison of the three different high-confidence PPI datasets, FYI, HC and DIPv, showed a much greater overlap than previous datasets [43], there were still large numbers of differences between them (Additional file 4: Figure S5). Therefore, inter-study validation using the three high-confidence and the BioGrid PPI datasets was performed to ensure robust results. To ensure that only information relevant to protein-protein interactions was obtained from the BioGrid network, it was first stripped of all non-physical interactions, as described in [44]. To determine whether LCRs are equally distributed across PPI datasets, the study also investigated the distribution of LCRs within the different PPI datasets. Results showed that the three high-confidence networks were similarly enriched in LCRs (approximately 19% of their entries contain LCRs, see Additional file 5: Table S1). These enrichments in the high-confidence networks support the idea that these regions are highly interactive.

Measurements of region positions in protein sequences, correlations, and comparison of degree distributions

We defined the position of an LCR as the coordinate of the LCR's centre within the protein sequence in which it occurs. We then divided this coordinate by the length of the protein to express it on a normalised scale between 0 and 1. The result is an LCR position metric comparable across LCRs of varying lengths within proteins of varying lengths. t-LCRs were defined as regions starting or ending at no more than 25 amino acids from either sequence extremity, c-LCRs as regions starting or ending at least 50 amino acids from either sequence extremity. Correlation p-values and regression lines were computed using the linear model function implemented in the R statistics package. Degree distributions were compared using the Wilcoxon Mann-Whitney test, also implemented in the R statistics package.

GO-term enrichment analyses

GO-term enrichment p-values were calculated using Fisher's exact test [45], and transformed to q-values using Benjamini and Hochberg's multiple testing correction method [46], as implemented in the R statistics package, version 2.7.