Skip to main content
Log in

Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Eukaryotic proteomes abound in low-complexity sequences, including tandem repeats and regions with significantly biased amino acid compositions. We assessed the functional importance of compositionally biased sequences in the yeast proteome using an evolutionary analysis of 2838 orthologous open reading frame (ORF) families from three Saccharomyces species (S. cerevisiae, S. bayanus, and S. paradoxus). Sequence conservation was measured by the amino acid sequence variability and by the ratio of nonsynonymous-to-synonymous nucleotide substitutions (K a /K s ) between pairs of orthologous ORFs. A total of 1033 ORF families contained one or more long (at least 45 residues), low-complexity islands as defined by a measure based on the Shannon information index. Low-complexity islands were generally less conserved than ORFs as a whole; on average they were 50% more variable in amino acid sequences and 50% higher in K a /K s ratios. Fast-evolving low-complexity sequences outnumbered conserved low-complexity sequences by a ratio of 10 to 1. Sequence differences between orthologous ORFs fit well to a selectively neutral Poisson model of sequence divergence. We therefore used the Poisson model to identify conserved low-complexity sequences. ORFs containing the 33 most conserved low-complexity sequences were overrepresented by those encoding nucleic acid binding proteins, cytoskeleton components, and intracellular transporters. While a few conserved low-complexity islands were known functional domains (e.g., DNA/RNA-binding domains), most were uncharacterized. We discuss how comparative genomics of closely related species can be employed further to distinguish functionally important, shorter, low-complexity sequences from the vast majority of such sequences likely maintained by neutral processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Altschul SF, Gish W (1996) Local alignment statistics. Methods Enzymol 266:460–480

    PubMed  CAS  Google Scholar 

  • Brocchieri L, Karlin S (2005) Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res 33:3390–3400

    Article  PubMed  CAS  Google Scholar 

  • Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, Williams CJ, Dunker AK (2002) Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol 55:104–110

    Article  PubMed  CAS  Google Scholar 

  • Cooper GM, Brudno M, Green ED, Batzoglou S, Sidow A (2003) Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res 13:813–820

    Article  PubMed  CAS  Google Scholar 

  • Coronado JE, Attie O, Epstein SL, Qiu WG, Lipke PN (2006) Composition-modified matrices improve identification of homologs of saccharomyces cerevisiae low-complexity glycoproteins. Eukaryot Cell 4:628–637

    Article  CAS  Google Scholar 

  • Dujon B (2005) Hemiascomycetous yeasts at the forefront of comparative genomics. Curr Opin Genet Dev 15:614–620

    Article  PubMed  CAS  Google Scholar 

  • Dunker AK, Garner E, Guilliot S, Romero P, Albrecht K, Hart J, Obradovic Z, Kissinger C, Villafranca JE (1998) Protein disorder and the evolution of molecular recognition: theory, predictions and observations. Pac Symp Biocomput:473–484

    Google Scholar 

  • Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6:197–208

    Article  PubMed  CAS  Google Scholar 

  • Eddy SR (2005) A model of the statistical power of comparative genome sequence analysis. PLoS Biol 3:e10

    Article  PubMed  CAS  Google Scholar 

  • Felsenstein J (1989) PHYLIP-Phylogeny Inference Package. Cladistics 5:164–166

    Google Scholar 

  • Gianni L, Edward JL (2005) Yeast evolution and comparative genomics. Annu Rev Microbiol 59:135–153

    Article  CAS  Google Scholar 

  • Golding GB (1999) Simple sequence is abundant in eukaryotic proteins. Protein Sci 8:1358–1361

    Article  PubMed  CAS  Google Scholar 

  • Huang L, Guan RJ, Pardee AB (1999) Evolution of transcriptional control from prokaryotic beginnings to eukaryotic complexities. Crit Rev Eukaryot Gene Expr 9:175–182

    PubMed  Google Scholar 

  • Huntley M, Golding GB (2000) Evolution of simple sequence in proteins. J Mol Evol 51:131–140

    PubMed  CAS  Google Scholar 

  • Huntley MA, Golding GB (2002) Simple sequences are rare in the Protein Data Bank. Proteins 48:134–140

    Article  PubMed  CAS  Google Scholar 

  • Hurst LD (2002) The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet 18:486–487

    Article  PubMed  Google Scholar 

  • Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254

    Article  PubMed  CAS  Google Scholar 

  • Liu J, Tan H, Rost B (2002) Loopy proteins appear conserved in evolution. J Mol Biol 322:53–64

    Article  PubMed  CAS  Google Scholar 

  • Lynch M, Conery JS (2003) The origins of genome complexity. Science 302:1401–1404

    Article  PubMed  CAS  Google Scholar 

  • Malpertuy A, Dujon B, Richard GF (2003) Analysis of microsatellites in 13 hemiascomycetous yeast species: mechanisms involved in genome dynamics. J Mol Evol 56:730–741

    Article  PubMed  CAS  Google Scholar 

  • Mar Alba M, Santibanez-Koref MF, Hancock JM (1999) Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol 49:789–797

    Article  PubMed  CAS  Google Scholar 

  • Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D (1999) A census of protein repeats. J Mol Biol 293:151–160

    Article  PubMed  CAS  Google Scholar 

  • Nei M (2005) Selectionism and neutralism in molecular evolution. Mol Biol Evol 22:2318–2342

    Article  PubMed  CAS  Google Scholar 

  • Piskur J, Langkjaer RB (2004) Yeast genome sequencing: the power of comparative genomics. Mol Microbiol 53:381–389

    Article  PubMed  CAS  Google Scholar 

  • Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK (2001) Sequence complexity of disordered protein. Proteins 42:38–48

    Article  PubMed  CAS  Google Scholar 

  • Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, Cherry JM, Henikoff S, Skupski MP, Misra S, Ashburner M, Birney E, Boguski MS, Brody T, Brokstein P, Celniker SE, Chervitz SA, Coates D, Cravchik A, Gabrielian A, Galle RF, Gelbart WM, George RA, Goldstein LS, Gong F, Guan P, Harris NL, Hay BA, Hoskins RA, Li J, Li Z, Hynes RO, Jones SJ, Kuehl PM, Lemaitre B, Littleton JT, Morrison DK, Mungall C, O’Farrell PH, Pickeral OK, Shue C, Vosshall LB, Zhang J, Zhao Q, Zheng XH, Lewis S (2000) Comparative genomics of the eukaryotes. Science 287:2204–2215

    Article  PubMed  CAS  Google Scholar 

  • Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994–3005

    Article  PubMed  CAS  Google Scholar 

  • Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986) Information content of binding sites on nucleotide sequences. J Mol Biol 188:415–431

    Article  PubMed  CAS  Google Scholar 

  • Sim KL, Creamer TP (2002) Abundance and distributions of eukaryote protein simple sequences. Mol Cell Proteomics 1:983–995

    Article  PubMed  CAS  Google Scholar 

  • Sim KL, Creamer TP (2004) Protein simple sequence conservation. Proteins 54:629–638

    Article  PubMed  CAS  Google Scholar 

  • Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41

    Article  PubMed  Google Scholar 

  • Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680

    PubMed  CAS  Google Scholar 

  • Tompa P (2002) Intrinsically unstructured proteins. Trends Biochem Sci 27:527–533

    Article  PubMed  CAS  Google Scholar 

  • Tompa P (2003) Intrinsically unstructured proteins evolve by repeat expansion. Bioessays 25:847–855

    Article  PubMed  CAS  Google Scholar 

  • Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645

    Article  PubMed  CAS  Google Scholar 

  • Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285

    Article  PubMed  CAS  Google Scholar 

  • Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266:554–571

    Article  PubMed  CAS  Google Scholar 

  • Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556

    PubMed  CAS  Google Scholar 

  • Young ET, Sloan JS, Van Riper K (2000) Trinucleotide repeats are clustered in regulatory genes in Saccharomyces cerevisiae. Genetics 154:1053–1068

    PubMed  CAS  Google Scholar 

  • Zmasek CM, Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

We thank Anton Vysoiskiy for computer system administration. We thank two anonymous reviewers for critical comments and valuable suggestions. This investigation was supported by awards (RR-03037 and GM-60654) from the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Gang Qiu.

Additional information

[Reviewing Editor: Dr. Stuart Newfeld]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Romov, P.A., Li, F., Lipke, P.N. et al. Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins. J Mol Evol 63, 415–425 (2006). https://doi.org/10.1007/s00239-005-0291-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-005-0291-0

Keywords

Navigation