Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins

Romov, Philip A.; Li, Fubin; Lipke, Peter N.; Epstein, Susan L.; Qiu, Wei-Gang

doi:10.1007/s00239-005-0291-0

Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins

Published: 21 August 2006

Volume 63, pages 415–425, (2006)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

Philip A. Romov¹,
Fubin Li²,
Peter N. Lipke²,
Susan L. Epstein¹ &
…
Wei-Gang Qiu²

197 Accesses
7 Citations
Explore all metrics

Abstract

Eukaryotic proteomes abound in low-complexity sequences, including tandem repeats and regions with significantly biased amino acid compositions. We assessed the functional importance of compositionally biased sequences in the yeast proteome using an evolutionary analysis of 2838 orthologous open reading frame (ORF) families from three Saccharomyces species (S. cerevisiae, S. bayanus, and S. paradoxus). Sequence conservation was measured by the amino acid sequence variability and by the ratio of nonsynonymous-to-synonymous nucleotide substitutions (K _a /K _s) between pairs of orthologous ORFs. A total of 1033 ORF families contained one or more long (at least 45 residues), low-complexity islands as defined by a measure based on the Shannon information index. Low-complexity islands were generally less conserved than ORFs as a whole; on average they were 50% more variable in amino acid sequences and 50% higher in K _a /K _s ratios. Fast-evolving low-complexity sequences outnumbered conserved low-complexity sequences by a ratio of 10 to 1. Sequence differences between orthologous ORFs fit well to a selectively neutral Poisson model of sequence divergence. We therefore used the Poisson model to identify conserved low-complexity sequences. ORFs containing the 33 most conserved low-complexity sequences were overrepresented by those encoding nucleic acid binding proteins, cytoskeleton components, and intracellular transporters. While a few conserved low-complexity islands were known functional domains (e.g., DNA/RNA-binding domains), most were uncharacterized. We discuss how comparative genomics of closely related species can be employed further to distinguish functionally important, shorter, low-complexity sequences from the vast majority of such sequences likely maintained by neutral processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analyses of whole-genome protein sequences from multiple organisms

Article Open access 01 May 2018

Analysis of lineage-specific protein family variability in prokaryotes combined with evolutionary reconstructions

Article Open access 30 August 2022

A Phylogenetic Rate Parameter Indicates Different Sequence Divergence Patterns in Orthologs and Paralogs

Article 29 October 2020

References

Altschul SF, Gish W (1996) Local alignment statistics. Methods Enzymol 266:460–480
PubMed CAS Google Scholar
Brocchieri L, Karlin S (2005) Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res 33:3390–3400
Article PubMed CAS Google Scholar
Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, Williams CJ, Dunker AK (2002) Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol 55:104–110
Article PubMed CAS Google Scholar
Cooper GM, Brudno M, Green ED, Batzoglou S, Sidow A (2003) Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res 13:813–820
Article PubMed CAS Google Scholar
Coronado JE, Attie O, Epstein SL, Qiu WG, Lipke PN (2006) Composition-modified matrices improve identification of homologs of saccharomyces cerevisiae low-complexity glycoproteins. Eukaryot Cell 4:628–637
Article CAS Google Scholar
Dujon B (2005) Hemiascomycetous yeasts at the forefront of comparative genomics. Curr Opin Genet Dev 15:614–620
Article PubMed CAS Google Scholar
Dunker AK, Garner E, Guilliot S, Romero P, Albrecht K, Hart J, Obradovic Z, Kissinger C, Villafranca JE (1998) Protein disorder and the evolution of molecular recognition: theory, predictions and observations. Pac Symp Biocomput:473–484
Google Scholar
Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6:197–208
Article PubMed CAS Google Scholar
Eddy SR (2005) A model of the statistical power of comparative genome sequence analysis. PLoS Biol 3:e10
Article PubMed CAS Google Scholar
Felsenstein J (1989) PHYLIP-Phylogeny Inference Package. Cladistics 5:164–166
Google Scholar
Gianni L, Edward JL (2005) Yeast evolution and comparative genomics. Annu Rev Microbiol 59:135–153
Article CAS Google Scholar
Golding GB (1999) Simple sequence is abundant in eukaryotic proteins. Protein Sci 8:1358–1361
Article PubMed CAS Google Scholar
Huang L, Guan RJ, Pardee AB (1999) Evolution of transcriptional control from prokaryotic beginnings to eukaryotic complexities. Crit Rev Eukaryot Gene Expr 9:175–182
PubMed Google Scholar
Huntley M, Golding GB (2000) Evolution of simple sequence in proteins. J Mol Evol 51:131–140
PubMed CAS Google Scholar
Huntley MA, Golding GB (2002) Simple sequences are rare in the Protein Data Bank. Proteins 48:134–140
Article PubMed CAS Google Scholar
Hurst LD (2002) The K_a/K_s ratio: diagnosing the form of sequence evolution. Trends Genet 18:486–487
Article PubMed Google Scholar
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254
Article PubMed CAS Google Scholar
Liu J, Tan H, Rost B (2002) Loopy proteins appear conserved in evolution. J Mol Biol 322:53–64
Article PubMed CAS Google Scholar
Lynch M, Conery JS (2003) The origins of genome complexity. Science 302:1401–1404
Article PubMed CAS Google Scholar
Malpertuy A, Dujon B, Richard GF (2003) Analysis of microsatellites in 13 hemiascomycetous yeast species: mechanisms involved in genome dynamics. J Mol Evol 56:730–741
Article PubMed CAS Google Scholar
Mar Alba M, Santibanez-Koref MF, Hancock JM (1999) Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol 49:789–797
Article PubMed CAS Google Scholar
Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D (1999) A census of protein repeats. J Mol Biol 293:151–160
Article PubMed CAS Google Scholar
Nei M (2005) Selectionism and neutralism in molecular evolution. Mol Biol Evol 22:2318–2342
Article PubMed CAS Google Scholar
Piskur J, Langkjaer RB (2004) Yeast genome sequencing: the power of comparative genomics. Mol Microbiol 53:381–389
Article PubMed CAS Google Scholar
Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK (2001) Sequence complexity of disordered protein. Proteins 42:38–48
Article PubMed CAS Google Scholar
Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, Cherry JM, Henikoff S, Skupski MP, Misra S, Ashburner M, Birney E, Boguski MS, Brody T, Brokstein P, Celniker SE, Chervitz SA, Coates D, Cravchik A, Gabrielian A, Galle RF, Gelbart WM, George RA, Goldstein LS, Gong F, Guan P, Harris NL, Hay BA, Hoskins RA, Li J, Li Z, Hynes RO, Jones SJ, Kuehl PM, Lemaitre B, Littleton JT, Morrison DK, Mungall C, O’Farrell PH, Pickeral OK, Shue C, Vosshall LB, Zhang J, Zhao Q, Zheng XH, Lewis S (2000) Comparative genomics of the eukaryotes. Science 287:2204–2215
Article PubMed CAS Google Scholar
Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994–3005
Article PubMed CAS Google Scholar
Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986) Information content of binding sites on nucleotide sequences. J Mol Biol 188:415–431
Article PubMed CAS Google Scholar
Sim KL, Creamer TP (2002) Abundance and distributions of eukaryote protein simple sequences. Mol Cell Proteomics 1:983–995
Article PubMed CAS Google Scholar
Sim KL, Creamer TP (2004) Protein simple sequence conservation. Proteins 54:629–638
Article PubMed CAS Google Scholar
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41
Article PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
PubMed CAS Google Scholar
Tompa P (2002) Intrinsically unstructured proteins. Trends Biochem Sci 27:527–533
Article PubMed CAS Google Scholar
Tompa P (2003) Intrinsically unstructured proteins evolve by repeat expansion. Bioessays 25:847–855
Article PubMed CAS Google Scholar
Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645
Article PubMed CAS Google Scholar
Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285
Article PubMed CAS Google Scholar
Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266:554–571
Article PubMed CAS Google Scholar
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556
PubMed CAS Google Scholar
Young ET, Sloan JS, Van Riper K (2000) Trinucleotide repeats are clustered in regulatory genes in Saccharomyces cerevisiae. Genetics 154:1053–1068
PubMed CAS Google Scholar
Zmasek CM, Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828
Article PubMed CAS Google Scholar

Download references

Acknowledgments

We thank Anton Vysoiskiy for computer system administration. We thank two anonymous reviewers for critical comments and valuable suggestions. This investigation was supported by awards (RR-03037 and GM-60654) from the National Institutes of Health.

Author information

Authors and Affiliations

Department of Computer Science, Hunter College, City University of New York, 695 Park Avenue, New York, New York, 10021, USA
Philip A. Romov & Susan L. Epstein
Department of Biological Sciences, Hunter College, City University of New York, 695 Park Avenue, New York, New York, 10021, USA
Fubin Li, Peter N. Lipke & Wei-Gang Qiu

Authors

Philip A. Romov
View author publications
You can also search for this author in PubMed Google Scholar
Fubin Li
View author publications
You can also search for this author in PubMed Google Scholar
Peter N. Lipke
View author publications
You can also search for this author in PubMed Google Scholar
Susan L. Epstein
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Gang Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei-Gang Qiu.

Additional information

[Reviewing Editor: Dr. Stuart Newfeld]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Romov, P.A., Li, F., Lipke, P.N. et al. Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins. J Mol Evol 63, 415–425 (2006). https://doi.org/10.1007/s00239-005-0291-0

Download citation

Received: 30 November 2005
Accepted: 27 April 2006
Published: 21 August 2006
Issue Date: September 2006
DOI: https://doi.org/10.1007/s00239-005-0291-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins

Abstract

Access this article

Similar content being viewed by others

Comparative analyses of whole-genome protein sequences from multiple organisms

Analysis of lineage-specific protein family variability in prokaryotes combined with evolutionary reconstructions

A Phylogenetic Rate Parameter Indicates Different Sequence Divergence Patterns in Orthologs and Paralogs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins

Abstract

Access this article

Similar content being viewed by others

Comparative analyses of whole-genome protein sequences from multiple organisms

Analysis of lineage-specific protein family variability in prokaryotes combined with evolutionary reconstructions

A Phylogenetic Rate Parameter Indicates Different Sequence Divergence Patterns in Orthologs and Paralogs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation