Abstract
We introduce a novel method to analyse complete genomes and recognise some distinctive features by means of an adaptive compression algorithm, which is not DNA-oriented, based on the Lempel-Ziv scheme. We study the Information Content as a function of the number of symbols encoded by the algorithm and we analyse the dictionary created by the algorithm. Preliminary results are shown concerning regions showing a sublinear type of information growth, which is strictly connected to the presence of highly repetitive subregions that might be supposed to have a regulatory function within the genome.
Similar content being viewed by others
References
Adebiyi, E.F., Jiang, T., Kaufmann, M., 2001. An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics 17(Suppl. 1), S5–S12.
Arabidopsis Genome Initiative, 2000. Nature 408, 796–815.
Argenti, F., Benci, V., Cerrai, P., Cordelli, A., Galatolo, S., Menconi, G., 2002. Information and dynamical systems: a concrete measurement on sporadic dynamics. Chaos Solitons Fractals 13(3), 461–469.
Baronchelli, A., Caglioti, E., Loreto, V., Pizzi, E., 2004. Dictionary based methods for information extraction. Physica A 342(1–2), 294–300.
Baronchelli, A., Loreto, V., 2004. Data compression approach to information extraction and classification. cond-mat/0403233.
Bell, T., Witten, I.H., Cleary, J.G., 1989. Modeling for text compression. ACM Comput. Surv. 21, 557–591.
Bellazzini, J., Menconi, G., Ignaccolo, M., Buresti, G., Grigolini, P., 2003. Vortex dynamics in evolutive flows: a weakly chaotic phenomenon. Phys. Rev. E 68, 026126.
Benci, V., Bonanno, C., Galatolo, S., Menconi, G., Virgilio, M., 2004. Dynamical systems and computable information. Discrete Contin. Dyn. Syst. B 4(4), 935–960.
Bonanno, C., Menconi, G., 2002. Computational information for the logistic map at the chaos threshold. Discrete Contin. Dyn. Syst. B 2(3), 415–431.
Bult, C.J. et al., 1996. Complete genome sequence of the methanogenic archaeon, M. jannaschii. Science 273(5278), 1058–1073.
The C. elegans Sequencing Consortium, 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282(5396), 2012–2018.
The C. elegans Sequencing Consortium, 1999. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 285(5433), 1493.
Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley.
Deckert, G. et al., 1998. The complete genome of the hyperthermophilic bacterium A. aeolicus. Nature 392(6674), 353–358.
Delgrange, O., Rivals, E., 2004. STAR: an algorithm to Search for Tandem Approximate Repeats. Bioinformatics 20, 1–9.
Fleischmann, R.D. et al., 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512.
Goffeau, A. et al., 1996. Life with 6000 genes. Science 274(5287), 563–567.
Grumbach, S., Tahi, F., 1993. Compression of DNA sequences. In: Data Compression Conference. IEEE Computer Society Press, pp. 340–350.
Klenk, H.P. et al., 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon A. fulgidus. Nature 390(6658), 364–370.
Kolmogorov, A.N., 1959. On the entropy per time unit as a metric invariant of automorphism. Dokl. Acad. Nauk. 124, 754–755.
Kunst, F. et al., 1997. The complete genome sequence of the gram-positive bacterium B. subtilis. Nature 390, 249–256.
Lecompte, O. et al., 2001. Genome evolution at the genus level: comparison of three complete genomes of hyperthermophilic archaea. Genome Res. 11, 981–993.
Li, M., Vitányi, P., 1997. An introduction to Kolmogorov complexity and its applications, 2nd edition. Springer, New York.
Menconi, G., 2003. A compression algorithm as a complexity measure on DNA sequences. In: Benci, V. et al. (Eds.), Proceedings of Determinism, Holism and Complexity, 2–8 Septembre 2001. Arcidosso, Italy. Kluwer Academic, Plenum Publishers, NY, p. 221.
Myers, G., 1999. Whole-genome DNA sequencing. Comput. Sci. Eng. 1(3), 33–43.
Nelson, K.E. et al., 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of T. maritima. Nature 399(6734), 323–329.
Orlov, Y.L., Potapov, V.N., 2004. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32.
Peterson, S.N. et al., 1995. Characterization of repetitive DNA in the M. genitalium genome: possible role in the generation of antigenic variation. Proc. Natl. Acad. Sci. U. S.A. 92(25), 11829–11833.
Prestridge, D.S., 1995. Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249, 923–932.
Shannon, C.E., 1948. The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423. 623–656.
Smith, D.R. et al., 1997. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J. Bacteriol. 179, 7135–7155.
Welch, R.A. et al., 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic E. coli. Proc. Natl. Acad. Sci. U.S.A. 99(26), 17020–17024.
Ziv, J., Lempel, A., 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 337–342.
Ziv, J., Lempel, A., 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24, 530–536.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Menconi, G. Sublinear growth of information in DNA sequences. Bull. Math. Biol. 67, 737–759 (2005). https://doi.org/10.1016/j.bulm.2004.10.005
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1016/j.bulm.2004.10.005