Amyloidogenesis involving domains that contain glutamine and/or asparagine ((Q+N)-rich domains) is linked to prion phenomena in budding yeast, as well as a number of neurological disorders in humans, including Huntington's disease.

A prion is an alternative conformation for a protein that can direct its own propagation [1, 2]. In budding yeast, there are currently four identified prions: [PSI+], [URE3], [RNQ+] and [NU+] [37]. [PSI+] arises from the propagation of an alternatively folded amyloid-like form of Sup35p [3]. Sup35p is part of the complex in budding yeast that controls translation termination and nonsense-codon readthrough [8, 9]. [URE3] is caused by an alternatively folded form of Ure2p, a protein involved in nitrogen metabolism [4, 10]. The determinant sequences of [PSI+] and [URE3] are characterized chiefly by bias for glutamine (Q) and asparagine (N) residues (Table 1). [RNQ+] and [NU+] arise from alternative propagatable forms of parts of the Rnq1p and New1p sequences, and were found by searches for further sequences with Q/N compositional bias (Table 1) [7, 11]. Prokaryotic and eukaryotic proteomes were assessed for yeast-prion-like domains that comprised a total of 30 or more glutamines and asparagines in an 80-amino-acid stretch [12]. [PIN+] is a non-Mendelian inherited trait that is required for the de novo appearance of [PSI+] [7, 13]. Eight candidate sequences for [PIN+], which tend to have a Q- and/or N-rich segment, were identified using a genetic screen and remain to be verified [13].

Table 1 The four prion sequences*

Expanded polyglutamine repeats underlie the pathology of neurodegenerative disorders in humans, the most common of which is Huntington's disease [14]. This disorder is caused by inherited expansions of length equal or greater than 39 amino-acid residues in the polyglutamine region of the protein huntingtin [15]. (Q+N)-rich regions, polyglutamine and polyasparagine are thought to oligomerize or polymerize through a 'polar zipper' of hydrogen bonds between the side chains [15, 16].

Here, we have derived a method for identifying biased regions that relies on defining the lowest-probability subsequences (LPSs) for a given amino-acid composition. For six eukaryotic proteomes (budding yeast, fission yeast, nematode worm, fruit fly, human and Arabidopsis), we have used this formalism to analyze the prevalence of Q- and N-rich regions in the context of other biases. In general, N-rich regions are rarer than Q-rich regions in the eukaryotic proteomes, most notably so in the human proteome. We use the biases for the four known prions of budding yeast to survey comprehensively for (Q+N)-rich domains, and examine the diversity of their subsidiary amino-acid compositions, their functions and their cellular compartments. We find up to around 170 (Q+N)-rich regions in budding yeast, and a relative dearth of such regions in fission yeast. In addition, to provide more context, we discuss some overarching observations on biased regions of any sort.

Results and discussion

Our analysis can be broken up as follows. First we analyze the Q, N, and (Q+N) biases in the four known prion sequences of budding yeast, as well as other subsidiary biases for and against certain residue types. We discuss how this relates to prion-determinant domains (that is, regions of the prion sequences that are necessary for the prion phenomenon). Our analysis is performed using a simple algorithm to find the lowest probability subsequences (LPSs) for a given residue bias (see Materials and methods).

Second, we ask how prevalent are Q- and N-biases in eukaryotes? Motivated by the fact that prion-determinant sequences correspond to LPSs, we examine LPSs for Q and N biases in the context of single-residue biases for all residue types, in all six proteomes. Focusing on the budding-yeast proteome, we also compare single-residue biases observed for known and hypothetical proteins, and for conceptually translated intergenic DNA (igDNA) in all six potential reading frames.

Then, on the basis of biases for Q and N in combination, we examine the abundance and diversity of (Q+N)-rich regions in the six eukaryotic proteomes. We survey their subsidiary biases for and against certain residues and groups of residues, and their sizes, functional classes and cellular compartments. Finally, to provide more general context, we discuss some overarching perspectives on compositional bias in the eukaryotic proteomes.

Analysis of the identified yeast prion sequences for their biases

Four identified prion protein sequences of budding yeast are Sup35p, Ure2p, Rnq1p and New1p, which form the prions [PSI+], [URE3], [RNQ+] and [NU+] respectively. We extracted the domains that are determinant for prion formation from each of these protein sequences, which have been found previously by experimental study (Table 1) [3, 4, 6, 7]. Using the formalism described in Materials and methods, we determined the main biases for each prion-determinant domain and an associated probability for each bias. Results are shown for single-residue biases, with examples for the sets of residues {QN}, {DERK} and {VILM} (single-letter amino-acid code; Table 1). The groupings {DERK} and {VILM} are 'charged residues' and 'major hydrophobics' respectively [17]. Charged residues and the major hydrophobics appear to be disfavored for the yeast prions (Table 1) [18]; mutation of Q or N to charged residues can lead to loss of prion-forming capability [19]. We also derived the LPSs for a given bias for the whole protein sequences (not just the prion-determinant domains). In addition to the well-documented Q and N biases, we also note that three of the four budding-yeast prions have subsidiary biases for tyrosine, glycine and/or serine (Table 1). The mild bias for tyrosine is conserved for homologs of the Sup35 prion determinant in other fungi, although it is not clear how this is related to the prion phenomenon [18]. It has been suggested that π-stacking of aromatic groups, as in tyrosine and phenylalanine, may play a part in stabilizing amyloid conformations [20].

Interestingly, the prion-determinant domains for three of the prion sequences (Sup35p, Ure2p, Rnq1p) are congruent with the top-ranking single-residue LPSs for the whole sequences (these are the underlined sequence regions in Table 1). That is, the most biased regions coincide with the experimentally derived prion-determinant domains. These are either for Q or N biases (Table 1). However, for the fourth prion sequence (New1p), the prion-determinant domain is comparatively poorly biased for N or Q or {QN} (for example, for N, Pbias = 1.4 × 10-6, where Pbias is the probability of bias). Its LPS for N bias does, however, coincide with a region derived from a repeat of the amino-acid triplet NYN that has been shown to be necessary for [NU+] prion propagation [7].

In the next section, we show how single-residue biases for Q and N rank in terms of their relative abundance in eukaryotic proteomes. After that, we use the Q- and N-bias levels of the LPSs of the four prions in combination to derive a refined set of (Q+N)-rich domains in the six eukaryotic proteomes (see below).

Abundance of Q and N biases in a proteomic context, for budding yeast and five other eukaryotic proteomes

How abundant are the biases for Q and N observed for the yeast prion domains compared to biases for all the other residue types? Are they noticeably more or less prevalent in budding yeast compared to other eukaryotic proteomes? We examined the most prevalent single-residue biases for the six eukaryotic proteomes at Pbias values corresponding to the LPSs observed in the prion-determinant domains (Table 1). This data gives us a perspective on the relative abundance of such biases (arrayed in Table 2; the exact threshold used to make this table is Pbias < 1 × 10-13).

Table 2 Abundance of biased regions that have biases at the same level as the Q and N biases in the four budding-yeast prions

It is clear that biases for Q and N are relatively more prevalent in the budding yeast proteome than in the other eukaryotic proteomes. Both Q and N are among the top six biases for this organism at this bias level (Table 2). This observation is the same regardless of whether the biases are ranked in terms of the total number of bias residues, or the total number of biased regions (Table 3), or as a weighted count in which the number of bias residues is multiplied by a factor derived from the amino-acid composition of the proteome (see Additional data file 1 or Supplementary Table A at [21]). For all the proteomes, N biases are always less prevalent than Q biases, being most disfavored in the human proteome, where they are up to 12 times rarer than Q-rich regions (Table 2c). The small number of N-rich regions in human sequences is intriguing, and may be due to a cellular toxicity of such regions in higher eukaryotes.

Table 3 Comparison of prevalent compositionally biased regions for the whole proteome, translated intergenic DNA, known proteins, hypothetical proteins and dORFs in budding yeast

Interestingly, as noted in Table 2, there are eight examples of predicted coiled-coil domains [22] that are in our list of (Q+N)-rich domains. Coiled coils are alpha-helical, whereas the prions form beta-sheet-rich aggregates; this may be an artifact of the coiled-coil prediction program [22], although there are some known viral coiled coils that have short runs of up to five Q residues, and a mild overall Q bias over their whole sequence [23].

The prevalent biases in the budding-yeast proteome were broken down into those for hypothetical and known proteins, and compared at three bias levels (Table 3). Known proteins are those in open reading frame (ORF) classes 1 through 3 in the MIPS database [24] (these are either characterized proteins or sequences that have homology to a characterized sequence). 'Hypothetical' proteins are the remaining annotations (ORF classes 4 through 6 in the MIPS database). There is little difference in the rankings for biases for the whole proteome, the set of known proteins and the set of hypothetical proteins (Table 3a,c,d). Surprisingly, however, total amounts of biased regions are substantially higher for known proteins (Table 3); for example at Pbias < 1 × 10-9, eight times as prevalent. Q and N biases both remain high-ranking in the 'known' and 'hypothetical' protein lists, and are lowly ranked for conceptually translated igDNA (Table 3a,b,e). In general, the prevalent biases observed for conceptually translated igDNA are very different from those for the annotated proteome (Table 3a,b,e). Notably, there is also very little implied bias for negatively charged residues (aspartic acid (D) and glutamic acid (E) combined), relative to positively charged residues (lysine (K) and arginine (R) combined) in the translated igDNA biases. This suggests that negatively charged bias regions in protein-coding sequences would take longer to evolve or need much greater selective pressure than those for positively charged biased regions, and that underlying replication 'slippage' tendencies [25] and mutation biases for the formation of cryptically simple sequences [26] may disfavor such regions.

(Q+N)-rich domains

We derived a list of (Q+N)-rich domains using Q, N, and {Q+N} compositional bias in combination (Table 4, see footnotes for details of Pbias thresholds used). The longest LPS was chosen to define the domain where any of the three LPSs overlap substantially (a threshold of 15 residues was found to be suitable). There are up to approximately 170 such (Q+N)-rich domains in budding yeast. Most strikingly, we note that (Q+N)-rich domains are relatively rarer in fission yeast, with a comparatively large number in fruit fly (Table 4). The four known budding-yeast prions have biases against the major hydrophobics {VILM} and charged residues {DERK} (Table 1). When these negative biases are accounted for, the number of (Q+N)-rich domains in budding yeast reduces by half to around 100 (Table 4). This may be due to selection against amyloidogenesis mechanisms, where such bias is used for a different reason (perhaps in some cases as part of a coiled coil, see above). Subsidiary biases for glycine (G), tyrosine (Y) and serine (S) occur for three of the four yeast prions (Table 1). When these are accounted for, a substantial number (30) still remains (Table 4). The thresholds used in Table 4 are derived from the highest Pbias values for the LPSs of any yeast prion sequence (rounded up to two significant figures) (Table 4). These observations on subsidiary biases demonstrate the diversity of (Q+N)-rich domains in eukaryotes, showing that about half of them have other biases that are predicted to be incompatible with prion-like amyloidogenesis mechanisms (Table 4).

Table 4 Numbers of (Q+N)-rich domains for the six proteomes

[PIN+] is a non-Mendelian inherited trait required for the de novo appearance of the [PSI+] prion in budding yeast [13]. A recent study derived a list of nine candidate genes responsible for the [PIN+] phenomenon [13]. Seven of these nine are in found in the (Q+N)-rich domain list here. With regard to the other two, one (PIN2, YOR104W) has a notable subsidiary bias for Y (11 in 51 residues, Pbias = 9.0 × 10-7), and the other (STE18, YJR086W) has a very short Q-rich region (12 in 25 residues, Pbias = 3.9 × 10-11).

To characterize the (Q+N)-rich domains further, we examined their lengths (Figure 1), and also their prevalent gene Ontology (GO) annotations [27] for the proteins that contain them (Table 5), focusing on budding yeast, fruit fly and human. The GO annotations can be considered as 'keywords' that give an indication of the biological role of the (Q+N)-rich domains (Table 5). The distribution of lengths for the regions with (Q+N) bias varies markedly from organism to organism, with humans having the largest proportion of very long regions with (Q+N) bias (44% > 275 residues; see Figure 1 legend). The fly (Q+N)-rich regions tend to be short, like those in budding yeast (see Figure 1 legend). They have a large proportion (around 18%) that localize to the nucleus, with some of these appearing to be related to transcription (Table 5). In budding yeast, the distribution of GO compartment annotations for proteins with (Q+N)-rich domains shows that these sequences occur most often in the nucleus (23 annotations), in preference to the cytoplasm (16), and the plasma membrane (9). Those that are placed in the nucleus tend to be transcription factors (see function categories in Table 5). Along with transcription, the preferred processes for proteins with (Q+N)-rich domains are 'endocytosis', 'pseudohyphal growth' and 'nuclear pore organization'.

Figure 1
figure 1

Histogram of the lengths of the (Q+N)-rich domains for budding yeast, fruit fly and human. The distribution of sequence lengths for the (Q+N)-rich domains are shown for budding yeast (top panel), fruit fly (middle panel) and human (bottom panel). The y-axis is the number of regions per bin, and the x-axis is for bins with labels x such that each bin contains all sequences with length x to x + 24 inclusive. The mean and median lengths for each of these distributions are as follows (organism, mean (± SD), median): budding yeast, 209 ± 209, 116; fruit fly, 236 ± 389, 89; human, 553 ± 730, 268. Only the distributions up to bin x = 275 are shown; a sizeable proportion of each distribution is longer than 275 residues (budding yeast 30% of sequences, fruit fly 22% and human 44%).

Table 5 Functional categories for the (Q+N)-rich domains for budding yeast fruit, fly and human

Some overarching perspectives on biased regions

To put our case study of Q- or N-rich regions in a general context, we will discuss some overarching perspectives on compositional bias in the eukaryotic proteomes. The behavior of all 20 single-residue biases as a function of decreasing Pbias in the proteomes of budding yeast, fission yeast, nematode worm, fruit fly, Arabidopsis and human was examined. The curves for seven selected residues are shown for the budding yeast, fission yeast, fruit fly and human proteomes (Figure 2). Each eukaryotic proteome has a characteristic profile of bias proportions (Figure 2). For budding yeast, serine (S) is an abundant bias regardless of the Pbias threshold (Figure 2). For lower Pbias values, less than 1 × 10-15, these biases arise mainly from serine-rich mannoproteins that are involved in the cell wall (for example, FLO8 [28]). N and Q, however, are prevalent biases only for the lowest Pbias levels (Pbias ≤ 1 × 10-13). In all the proteomes, biases for individual hydrophobic residues (for example, isoleucine (I) and leucine (L)) fall off at much milder levels of probability, although less so for leucine because of its involvement in coiled-coil regions (Figure 2). There are no I or tryptophan (W) biases at Pbias = 1 × 10-10 or lower for each of the eukaryotes. It is noticeable that cysteine (C) bias is maintained at relative abundance in the human proteome (Figure 2) to much lower Pbias levels than in the other eukaryotes studied; this arises from the occurrence of large tandem arrays of cysteine-rich domains that are disulfide-bridged (for example, epidermal growth factor-like domains [29] and/or metal-binding proteins (such as the zinc finger)).

Figure 2
figure 2

Each proteome has a characteristic distribution of biases. The proportion of bias residues (y-axis) counted up for each of the following seven residues (S, Q, N, L, I, D, C) are shown as a function of the bias probability (x-axis). The x-axis comprises bins labeled with -log(P) such that all regions with probabilities from -log(P) to 3.0 -log(P) are included. The end (right-most) bin includes all regions with log probability greater than -log(P). From left to right, the first set of panels is for budding yeast, the second set for fission yeast, the third set for fruit fly and the fourth for human. The rows of panels are labeled at the far right with the appropriate one-letter amino-acid symbol (S, Q, N, L, I, D and C).


We have carried out an analysis of (Q+N)-rich domains in the complete proteomes of six eukaryotes, using a simple formalism based on finding the LPSs for a given set of amino acids within a protein sequence. We were motivated to use LPSs by the fact that the four known (Q+N)-rich prion-determinant sequences in budding yeast (found previously by experiment) each correspond to an LPS.

Analysis of budding-yeast prion sequences

We have examined the characteristic biases of the four known budding-yeast prion sequences. Supplementary to the well-documented Q and N biases, there are also mild biases for Y, G and/or S in three of the prion-determinant sequences. A substantial fraction (30/172) of the (Q+N)-rich domains found in this survey for budding yeast have such a subsidiary bias. In particular, for Sup35p, the bias for Y is conserved in homologs in other fungi [18], and could potentially contribute to amyloid formation through π-stacking of aromatic groups [20]. It is likely that some Q- or N-rich regions may have subsidiary compositions that are there to decrease the likelihood of prion-like amyloidogenesis in higher eukaryotes; this would explain the large number of (Q+N)-rich domains that are deleted when a mild bias against charged and major hydrophobic residues is considered. Interestingly, the prion-determinant domains of three of the four prions correspond closely with the LPSs for the single most abundant residue types. For the fourth, New1p, the LPS corresponds to a triplet repeat (NYN)n that appears necessary for prion propagation [13].

Relative abundance of Q and N bias in eukaryotic proteomes

We examined the relative abundance of biases for all 20 residue types for the six different eukaryotic proteomes. When biases that are at least at the level of the Q and N biases observed for the yeast prion sequences are considered, regions with N bias are always less common than those with Q bias, and become substantially less favored for the human proteome (being 12 times rarer than Q-biased regions). Disfavoring of N-rich regions in the human proteome (and other mammalian proteomes) has also been observed for homopolymeric runs of sequence [30, 31].

Occurrence of (Q+N)-rich regions

As a suitable standard, we determined a refined list of domains that are at least as biased as the budding-yeast prion domains (either in terms of Q, N, or {Q+N} compositional bias). Fission yeast appears to have rather fewer (Q+N)-rich domains than budding yeast, which may indicate a relative intolerance to Q/N-based 'polar zipper' oligomerization/polymerization [16]. In the fruit fly, the large number of apparent (Q+N)-rich domains tend to be as short as those in budding yeast, with about a fifth (around 18%) of them localizing to the nucleus, some of which are annotated as involved in transcription (by GO classification [27]).

The analysis of Q/N can be of use to those studying the prion phenomenon in budding yeast and aggregation/amyloidogenesis in the eukaryotic cell. The data for (Q+N)-rich domains is available [21]. We are not suggesting that these regions are indeed prion-like; on the contrary, we have shown the diversity and abundance of these domains in a genomic context, and that they can have a variety of functions, compartments, and importantly, that they very often have subsidiary biases which would be disruptive to prion-like amyloidogenesis. The main results here on Q/N bias are robust to the underlying probability model, as a uniform frequency expectation for amino acids (all f x = 0.05 in Equation 1, see Materials and methods) produces essentially the same trends (see, for example, Additional data file 2 or Supplementary Table B at [21]).

Some general perspectives on compositional bias

To put this 'case study' of Q/N bias in context, we have also presented some results that offer a more general perspective on the phenomenon of compositional bias. We found that the prevalence of different biases as a function of Pbias is characteristic for each proteome (Figure 2); however, there are some common trends, such as a disfavorment of regions with pronounced biases for I or W. For budding yeast, compositional biases extracted from conceptually translated igDNA, and for disabled ORFs (so-called dORFs) are very different from those for the annotated proteome. Also, there is surprisingly little difference between the prevalent biases observed for the approximately 2,000 annotated 'hypothetical' proteins and the approximately 4,000 known proteins in the budding-yeast proteome. However, biased regions are substantially more common (at some bias levels more than eight times as common) for known proteins than for hypothetical proteins. These observations may be applicable to gene prediction and verification.

The algorithm presented here can be developed for other investigations of compositional bias, for structural genomics, and for topics in protein folding and design, and is also readily applicable to nucleic acid sequences.

Materials and methods

Calculating regions that are Q+N-rich or have other biases

Six complete eukaryotic proteomes were downloaded from the web: budding yeast (Saccharomyces cerevisiae from the SGD [32]), nematode worm (Caenorhabditis elegans Wormpep25 [33]), fruit fly (Drosophila melanogaster [34]), mustard weed (Arabidopsis thaliana [35]) and the Ensembl data set for human [36]). In each protein sequence of these proteomes, we searched for biased regions for each of the 20 amino-acid types as follows. For each individual amino-acid type x, and for the range of window sizes (w) from 25 residues to 2,500 residues, we searched each protein sequence for segments that have compositional bias of the lowest probability (Pbias,min):

Pbias,min = min P(i,w) for all i and x     (1)

where i is each possible start position for a window w in the sequence. The probability P(i,w) is given by a binomial distribution:

P(i,w) = [{w!/[n!(w - n)!]}.(f x )n.(1-f x )w-n]     (2)

where f x is the proportion of amino-acid type x in all of the sequences of the proteome taken together (or a uniform expected proportion for each amino acid = 0.05). The count for x is denoted n in the window w starting at position i. Such segments with Pbias,min are termed LPSs. Once an initial LPS is found in a protein sequence, the remainder of the sequence is resubmitted to the procedure until no further LPSs can be found. This is somewhat similar to the procedure in the program SEG for assignment of low-complexity or compositionally biased regions (which is based on the calculation of sequence information entropy), and which also determines an LPS [37]. To save on computational time, an initial filtering is applied (before using the procedure described above) using pre-computed threshold tables for each window length for all residue types for a fixed relatively high probability value (Pbias = 0.001 was found to be suitable).

This procedure differs from those previously reported as it allows for calculation of biases both for and against amino-acid types, while allowing calculation of subsidiary biases for any predefined sequence or subsequence [3739]. We applied this formalism, because we noted that prion-determinant domains for the budding-yeast prions correspond closely to LPSs. The results and trends for Q/N biases reported in this paper use f x values derived from the eukaryotic proteomes, but do not differ substantially if a uniform probability model for residue bias is used (all residues having f x = 0.05; see [21]).

Calculating biases for any set of amino acids

Equation (2) can be generalized to calculate a bias for any set of amino acids {xyz...}, by summing up the number of residues over the whole set. This is studied in particular for the sets {QN}, {DERK} (charged residues) and {VILM} (major hydrophobics). As for single-residue biases, the LPSs for a sequence are identified.

GO annotations

Annotations for GO categories [27] for the eukaryotic proteomes were downloaded from the Gene Ontology website [40] and counted up as lists of keywords indicative of biological role.

Additional data files

The abundance of biases counted up in different ways for different bias probability thresholds is available in Additional data file 1. A table showing the number of biased regions for all the eukaryotes (for a uniform probability model) is available in Additional data file 2. The coordinates of (gln+asn)-rich domains are available for the following organisms: S. cerevisiae (Additional data file 3) S. pombe (Additional data file 4), C. elegans (Additional data file 5), Arabidopsis (Additional data file 6), Drosophila (Additional data file 7) and human (Additional data file 8). The format for each of these files is as follows: field #1 = name, field #2 = sequence length, field #3 = bias (Q or N or {QN}), field #4 = number of bias residues, field #5 = start of QN-rich region, field #6 = end of QN-rich region, field #7 = probability of bias (see manuscript for details).

The sequences of the proteomes can be found in S. cerevisiae (Additional data file 9), S. pombe (Additional data file 10), C. elegans (Additional data file 11), Arabidopsis (Additional data file 12), Drosophila (Additional data file 13) and human (Additional data file 14). All Additional data files are also available at [21].