Skip to main content
Log in

Sublinear growth of information in DNA sequences

  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

We introduce a novel method to analyse complete genomes and recognise some distinctive features by means of an adaptive compression algorithm, which is not DNA-oriented, based on the Lempel-Ziv scheme. We study the Information Content as a function of the number of symbols encoded by the algorithm and we analyse the dictionary created by the algorithm. Preliminary results are shown concerning regions showing a sublinear type of information growth, which is strictly connected to the presence of highly repetitive subregions that might be supposed to have a regulatory function within the genome.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adebiyi, E.F., Jiang, T., Kaufmann, M., 2001. An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics 17(Suppl. 1), S5–S12.

    Google Scholar 

  • Arabidopsis Genome Initiative, 2000. Nature 408, 796–815.

    Article  Google Scholar 

  • Argenti, F., Benci, V., Cerrai, P., Cordelli, A., Galatolo, S., Menconi, G., 2002. Information and dynamical systems: a concrete measurement on sporadic dynamics. Chaos Solitons Fractals 13(3), 461–469.

    Article  MathSciNet  MATH  Google Scholar 

  • Baronchelli, A., Caglioti, E., Loreto, V., Pizzi, E., 2004. Dictionary based methods for information extraction. Physica A 342(1–2), 294–300.

    Article  Google Scholar 

  • Baronchelli, A., Loreto, V., 2004. Data compression approach to information extraction and classification. cond-mat/0403233.

  • Bell, T., Witten, I.H., Cleary, J.G., 1989. Modeling for text compression. ACM Comput. Surv. 21, 557–591.

    Article  Google Scholar 

  • Bellazzini, J., Menconi, G., Ignaccolo, M., Buresti, G., Grigolini, P., 2003. Vortex dynamics in evolutive flows: a weakly chaotic phenomenon. Phys. Rev. E 68, 026126.

    Google Scholar 

  • Benci, V., Bonanno, C., Galatolo, S., Menconi, G., Virgilio, M., 2004. Dynamical systems and computable information. Discrete Contin. Dyn. Syst. B 4(4), 935–960.

    Article  MathSciNet  MATH  Google Scholar 

  • Bonanno, C., Menconi, G., 2002. Computational information for the logistic map at the chaos threshold. Discrete Contin. Dyn. Syst. B 2(3), 415–431.

    MathSciNet  MATH  Google Scholar 

  • Bult, C.J. et al., 1996. Complete genome sequence of the methanogenic archaeon, M. jannaschii. Science 273(5278), 1058–1073.

    Google Scholar 

  • The C. elegans Sequencing Consortium, 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282(5396), 2012–2018.

    Article  Google Scholar 

  • The C. elegans Sequencing Consortium, 1999. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 285(5433), 1493.

    Google Scholar 

  • Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley.

  • Deckert, G. et al., 1998. The complete genome of the hyperthermophilic bacterium A. aeolicus. Nature 392(6674), 353–358.

    Article  Google Scholar 

  • Delgrange, O., Rivals, E., 2004. STAR: an algorithm to Search for Tandem Approximate Repeats. Bioinformatics 20, 1–9.

    Article  Google Scholar 

  • Fleischmann, R.D. et al., 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512.

    Google Scholar 

  • Goffeau, A. et al., 1996. Life with 6000 genes. Science 274(5287), 563–567.

    Article  Google Scholar 

  • Grumbach, S., Tahi, F., 1993. Compression of DNA sequences. In: Data Compression Conference. IEEE Computer Society Press, pp. 340–350.

  • Klenk, H.P. et al., 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon A. fulgidus. Nature 390(6658), 364–370.

    Article  Google Scholar 

  • Kolmogorov, A.N., 1959. On the entropy per time unit as a metric invariant of automorphism. Dokl. Acad. Nauk. 124, 754–755.

    MATH  MathSciNet  Google Scholar 

  • Kunst, F. et al., 1997. The complete genome sequence of the gram-positive bacterium B. subtilis. Nature 390, 249–256.

    Article  Google Scholar 

  • Lecompte, O. et al., 2001. Genome evolution at the genus level: comparison of three complete genomes of hyperthermophilic archaea. Genome Res. 11, 981–993.

    Article  Google Scholar 

  • Li, M., Vitányi, P., 1997. An introduction to Kolmogorov complexity and its applications, 2nd edition. Springer, New York.

    MATH  Google Scholar 

  • Menconi, G., 2003. A compression algorithm as a complexity measure on DNA sequences. In: Benci, V. et al. (Eds.), Proceedings of Determinism, Holism and Complexity, 2–8 Septembre 2001. Arcidosso, Italy. Kluwer Academic, Plenum Publishers, NY, p. 221.

    Google Scholar 

  • Myers, G., 1999. Whole-genome DNA sequencing. Comput. Sci. Eng. 1(3), 33–43.

    Article  Google Scholar 

  • Nelson, K.E. et al., 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of T. maritima. Nature 399(6734), 323–329.

    Article  Google Scholar 

  • Orlov, Y.L., Potapov, V.N., 2004. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32.

  • Peterson, S.N. et al., 1995. Characterization of repetitive DNA in the M. genitalium genome: possible role in the generation of antigenic variation. Proc. Natl. Acad. Sci. U. S.A. 92(25), 11829–11833.

    Google Scholar 

  • Prestridge, D.S., 1995. Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249, 923–932.

    Article  Google Scholar 

  • Shannon, C.E., 1948. The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423. 623–656.

    MathSciNet  MATH  Google Scholar 

  • Smith, D.R. et al., 1997. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J. Bacteriol. 179, 7135–7155.

    Google Scholar 

  • Welch, R.A. et al., 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic E. coli. Proc. Natl. Acad. Sci. U.S.A. 99(26), 17020–17024.

    Google Scholar 

  • Ziv, J., Lempel, A., 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 337–342.

    Article  MathSciNet  MATH  Google Scholar 

  • Ziv, J., Lempel, A., 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24, 530–536.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menconi, G. Sublinear growth of information in DNA sequences. Bull. Math. Biol. 67, 737–759 (2005). https://doi.org/10.1016/j.bulm.2004.10.005

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1016/j.bulm.2004.10.005

Keywords

Navigation