Skip to main content
Log in

“Word” Preference in the Genomic Text and Genome Evolution: Different Modes of n-tuplet Usage in Coding and Noncoding Sequences

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Extensive work on n-tuplet occurrence in genomic sequences has revealed the correlation of their usage with sequence origin. Parallel to that, there exist different restrictions in the nucleotide composition of coding and noncoding sequences that may result in distinct modes of usage of n-tuplets. The relatively simple approaches described herein focus on such differences. They are based on simple summation measures of n-tuplet frequencies, computed after filtering the background nucleotide composition. Among the main targets of this work is to draw some conclusions on the qualitative differences in the composition of genomic sequences depending on their functionality. Moreover, an evolutionary model is formulated, including simple forms of ubiquitous events of genome dynamics: genomic fusions, genome shuffling due to transpositions, replication slippage, and point mutations. This model is shown to be able to reproduce all the statistical features of genomic sequences discussed herein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8

Similar content being viewed by others

References

  • Almirantis Y, (1999) A standard deviation based quantification differentiates coding from noncoding DNA sequences and gives insight to their evolutionary history. J Theor Biol 196:297–308

    Article  PubMed  Google Scholar 

  • Almirantis Y, Nicolaou C (2005) Multi-criterial coding sequence prediction. Combination of GeneMark with two novel, coding-character specific quantities. Comput Biol Med 35:627–643

    Article  PubMed  Google Scholar 

  • Almirantis Y, Provata A (1997) The “clustered structure” of the purines/pyrimidines distribution in DMA distinguishes systematically between coding and noncoding sequences. Bull Math Biol 59:975–992

    Article  PubMed  Google Scholar 

  • Almirantis Y, Provata A (1999) Long- and short-range correlations in genome organisation. J Stat Phys 97:233–239

    Article  Google Scholar 

  • Almirantis Y, Provata A (2001) An evolutionary model about the origin of non-randomness, long-range order and fractality in the genome. Bioessays 23:647–656

    Article  PubMed  Google Scholar 

  • Bernardi G (1989) The isochore organization of the human genome. Annu Rev Genet 23:637–661

    Article  PubMed  Google Scholar 

  • Bernardi G (1993) The isochore organization of the human genome and its evolutionary history—A review. Gene 135:57–66

    Article  PubMed  Google Scholar 

  • Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159

    PubMed  Google Scholar 

  • Brendel V, Beckmann JS, Trifonov EN (1986) Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn 4:11–21

    PubMed  Google Scholar 

  • Bucher P, Yagil G (1991) Occurrence of oligopurine. oligopyrimidine tracts in eukaryotic and prokaryotic genes. DNA Seq 1:157–172

    Google Scholar 

  • Burge C, Campbell AM. Karlin S (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA 89:1358–1362

    PubMed  Google Scholar 

  • Chargaff E (1951) Structure and function of nucleic acids and mechanism of their enzymic degradation. Experientia 6:201–209

    Google Scholar 

  • Crick FH, Brenner S, Klug A, Pieczenik G (1976) A speculation on the origin of protein synthesis. Orig Life 7:389–397

    Article  PubMed  Google Scholar 

  • Dechering KJ, Cuelenaere K, Konings RN, Leunissen JA (1998) Distinct frequency-distributions of homopolymeric DNA tracts in different genomes. Nucleic Acids Res 26:4056–4062

    Article  PubMed  Google Scholar 

  • Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 16:1391–1399

    PubMed  Google Scholar 

  • Eigen M, Schuster P. (1977) The hypercycle. A principle of natural self-organization. Part A: Emergence of the hypercycle. Naturwissenschaften 60:541–565

    Google Scholar 

  • Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z (2004) Detection of functional DNA motifs via statistical over-representation. Nucleic Acids 32:1372–1381

    Article  Google Scholar 

  • Genfles AJ, Karlin S (2001) Genome-scale compositional comparisons in peukaryotes. Gen Res 11:540–546

    Article  Google Scholar 

  • Goldman N (1993) Nucleoticte, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 21:2487–2491

    PubMed  Google Scholar 

  • Gragg H, Harfe BD, Jinks-Robertson S (2002) Base composition of mononucleotide runs affects DNA polymerase slippage and removal of frameshift intermediates by mismatch repair in Saccharomyces cerevisiae. Mol Cell Biol 24:8756–8762

    Article  Google Scholar 

  • Hao BL (2000) Fractals from genomes. Modern Phys Lett B 14:871–875

    Google Scholar 

  • Hao BL (2000) Fractals from genomes—Exact solutions of a biology-inspired problem. Physica A 282:225–246

    Google Scholar 

  • Hancock JM (1993) Evolution of sequence repetition and gene duplications in the TATA-binding protein TBP (TFIID). Nucleic Acids Res 21:2823–2830

    PubMed  Google Scholar 

  • Harr B, Zangerl B, Schlotterer C (2000) Removal of microsatellite interruptions by DNA replication slippage: phylogenetic evidence from Drosophila. Mol Biol Evol 7:1001–1009

    Google Scholar 

  • Holmquist GP. (1989) Evolution of chromosome bands: molecular ecology of noncoding DNA. J Mol Evol 28:469–486

    PubMed  Google Scholar 

  • Jeffrey HJ (1990) Chaos game representation of gene structure. Nucleic Acids Res 18:2163–2170

    PubMed  Google Scholar 

  • Karlin S, Burge C (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11:283–290

    Article  PubMed  Google Scholar 

  • Karlin S, Ladunga I (1994) Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci USA 91:12832–12836

    PubMed  Google Scholar 

  • Karlin S, Mrazek J (1997) Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci USA 94:10227–10232

    Article  PubMed  Google Scholar 

  • Karlin S, Ladunga I, Blaisdell BE (1994) Heterogeneity of genomes: measures and values. Proc Natl Acad Sci USA 91:12837–12841

    PubMed  Google Scholar 

  • Katsaloulis P, Theoharis T, Provata A (2002) Statistical distribution of oligonucleotide combinations: applications in human chromosomes 21 and 22. Physica A 316:380–396

    Google Scholar 

  • Knuth DE (1981) The art of computer programming. Addison–West, Chicago

    Google Scholar 

  • Kruglyak S, Durrett R, Schug MD, Aquadro CF (2000) Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol Biol Evol 8:1210–1219

    Google Scholar 

  • Li WH (1997) Molecular evolution. Sinauer Associates, Sunderland, MA

    Google Scholar 

  • Lin HJ, Chargaff E (1967) On the denaturation of deoxyribonucleic acid. II. Effects of concentration. Biochim Biophys Acta 145:398–409

    Google Scholar 

  • Lovett ST (2004) Encoded errors: mutations and rearrangements mediated by misalignment at repetitive DNA sequences. Mol Microbiol 5:1243–1253

    Article  Google Scholar 

  • Mantegna RN, Buldyrev SV, Goldberger AL, Havlin S, Peng CK, Simons M, Stanley HE (1994) Linguistic features of noncoding DNA sequences. Phys Rev Lett 73:3169–3172

    Google Scholar 

  • Nakamura Y, Wada K, Wada Y, Doi H, Kanaya S, Gojobori T, Ikemura T (1996) Codon usage tabulated from the international DNA sequence databases. Nucleic Acids Res 24:214–215

    Article  PubMed  Google Scholar 

  • Nikolaou C, Almirantis Y (2002) A study of the middle-scale nucleotide clustering in DNA sequences of various origin and functionality by means of a method based on a modified standard deviation. J Theor Biol 217:479–942

    Article  PubMed  MathSciNet  Google Scholar 

  • Nicolaou C, Almirantis Y (2003) Mutually symmetric and complementary triplets: differences in their use distinguish systematically between coding and non-coding genomic sequences. J Theor Biol 223:477–487

    Article  PubMed  Google Scholar 

  • Nicolaou C, Almirantis Y (2004) Measuring the coding potential of genomic sequences through a combination of triplet occurrence patterns and RNY preference. J Mol Evol 59:309–316

    Article  PubMed  Google Scholar 

  • Nussinov R (1981) Eukaryotic dinucleotide preference rules and their implications for degenerate codon usage. J Mol Biol 149:125–131

    Article  PubMed  Google Scholar 

  • Peng C-K, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M, Stanley HE (1992) Long range correlations in nucleotide sequences. Nature 356:168–170

    Article  PubMed  Google Scholar 

  • Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1986) Numerical recipies—The art of scientific computing. Cambridge University Press, Cambridge

    Google Scholar 

  • Provata A (1999) Random aggregation models for the formation and evolution of coding and non-coding DNA. Physica A 264:570–580

    Google Scholar 

  • Provata A, Almirantis Y (2000) Cantor fractal properties of DNA sequences. Fractals 8:15–27

    Google Scholar 

  • Qi J, Wang B, Hao B-L (2004) Whole proteome prokaryote phylogeny without sequence alignment: A k-string composition approach. J Mol Evol 58:1–11

    Article  PubMed  Google Scholar 

  • Raghavan S, Hariharan R, Brahmachari SK (2000) Polypurine polypyrimidine sequences in complete bacterial genomes: preference for polypurines in protein-coding regions. Gene 242:275–283

    Article  PubMed  Google Scholar 

  • Schmitt AO, Herzel H (1997) Estimating the entropy of DNA sequences. J Theor Biol 188:369–377

    Article  PubMed  Google Scholar 

  • Stuckle EE, Emmrich C, Grob U, Nielsen PJ (1990) Statistical analysis of nucleotide sequences. Nucleic Acids Res 18:6641–6647

    PubMed  Google Scholar 

  • Stuckle EE, Nielsen PJ, Grob U (1992) Probability of occurrence of specific oligomers. J Theor Biol 159:299–306

    PubMed  Google Scholar 

  • Tautz D, Trick M, Dover GA (1986) Cryptic simplicity in DNA is a major source of genetic variation. Nature 322:652–656

    Article  PubMed  Google Scholar 

  • Trifonov EN (1989) The multiple codes of nucleotide sequences. Bull Math Biol 51:417–432

    Article  PubMed  Google Scholar 

  • Yang Z, Yoder AD (1999) Estimation of the transition/transversion rate bias and species sampling. J Mol Evol 48:274–283

    PubMed  Google Scholar 

  • Yomo T, Urabe I (1994) A frame-specific symmetry of complementary £3strands of DNA suggests the existence of genes on the antisense strand. J Mol Evol 38:113–120

    Article  PubMed  Google Scholar 

  • Zuckerkandl E (1992) Revisiting junk DNA. J Mol Evol 34:259–271

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yannis Almirantis.

Additional information

[Reviewing Editor: Dr. Brian Morton]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikolaou, C., Almirantis, Y. “Word” Preference in the Genomic Text and Genome Evolution: Different Modes of n-tuplet Usage in Coding and Noncoding Sequences. J Mol Evol 61, 23–35 (2005). https://doi.org/10.1007/s00239-004-0209-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-004-0209-2

Key words:

Navigation