Complexity Profiles of DNA Sequences Using Finite-Context Models

  • Armando J. Pinho
  • Diogo Pratas
  • Sara P. Garcia
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7058)


Every data compression method assumes a certain model of the information source that produces the data. When we improve a data compression method, we are also improving the model of the source. This happens because, when the probability distribution of the assumed source model is closer to the true probability distribution of the source, a smaller relative entropy results and, therefore, fewer redundancy bits are required. This is why the importance of data compression goes beyond the usual goal of reducing the storage space or the transmission time of the information. In fact, in some situations, seeking better models is the main aim. In our view, this is the case for DNA sequence data. In this paper, we give hints on how finite-context (Markov) modeling may be used for DNA sequence analysis, through the construction of complexity profiles of the sequences. These profiles are able to unveil structures of the DNA, some of them with potential biological relevance.


Complexity Measure Data Compression Kolmogorov Complexity Arithmetic Code Cyanidioschyzon Merolae 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rissanen, J.: Generalized Kraft inequality and arithmetic coding. IBM J. Res. Develop. 20(3), 198–203 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Pinho, A.J., Neves, A.J.R., Afreixo, V., Bastos, C.A.C., Ferreira, P.J.S.G.: A three-state model for DNA protein-coding regions. IEEE Trans. on Biomedical Engineering 53(11), 2148–2155 (2006)CrossRefGoogle Scholar
  3. 3.
    Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)Google Scholar
  4. 4.
    Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan (April 2009)Google Scholar
  5. 5.
    Pratas, D., Pinho, A.J.: Compressing the Human Genome Using Exclusively Markov Models. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds.) PACBB 2011. AISC, vol. 93, pp. 213–220. Springer, Heidelberg (2011)Google Scholar
  6. 6.
    Pinho, A.J., Pratas, D., Ferreira, P.J.S.G.: Bacteria DNA sequence compression using a mixture of finite-context models. In: Proc. of the IEEE Workshop on Statistical Signal Processing, Nice, France (June 2011)Google Scholar
  7. 7.
    Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)CrossRefGoogle Scholar
  8. 8.
    Pinho, A.J., Pratas, D., Ferreira, P.J.S.G., Garcia, S.P.: Symbolic to numerical conversion of DNA sequences using finite-context models. In: Proc. of the 19th European Signal Processing Conf., EUSIPCO 2011, Barcelona, Spain (August 2011)Google Scholar
  9. 9.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text compression. Prentice-Hall (1990)Google Scholar
  10. 10.
    Salomon, D.: Data compression - The complete reference, 4th edn. Springer, Heidelberg (2007)zbMATHGoogle Scholar
  11. 11.
    Sayood, K.: Introduction to data compression, 3rd edn. Morgan Kaufmann (2006)Google Scholar
  12. 12.
    Laplace, P.S.: Essai philosophique sur les probabilités (A philosophical essay on probabilities). John Wiley & Sons, New York (1814); translated from the sixth French edition by Truscott, F.W., Emory, F. L. (1902)Google Scholar
  13. 13.
    Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. of the Royal Society (London) A 186, 453–461 (1946)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Krichevsky, R.E., Trofimov, V.K.: The performance of universal encoding. IEEE Trans. on Information Theory 27(2), 199–207 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 1993, Snowbird, Utah, pp. 340–350 (1993)Google Scholar
  16. 16.
    Rivals, E., Delahaye, J.P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proc. of the Data Compression Conf., DCC 1996, Snowbird, Utah, p. 453 (1996)Google Scholar
  17. 17.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology Magazine 20, 61–66 (2001)CrossRefGoogle Scholar
  18. 18.
    Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. In: Dunker, A.K., Konagaya, A., Miyano, S., Takagi, T. (eds.) Genome Informatics 2000: Proc. of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)Google Scholar
  19. 19.
    Manzini, G., Rastero, M.: A simple and fast DNA compressor. Software—Practice and Experience 34, 1397–1411 (2004)CrossRefGoogle Scholar
  20. 20.
    Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)CrossRefGoogle Scholar
  21. 21.
    Behzadi, B., Le Fessant, F.: DNA Compression Challenge Revisited. In: Combinatorial Pattern Matching. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Korodi, G., Tabus, I.: Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 33–42 (March 2007)Google Scholar
  23. 23.
    Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)Google Scholar
  24. 24.
    Solomonoff, R.J.: A formal theory of inductive inference, part I. Information and Control 7(1), 1–22 (1964)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Solomonoff, R.J.: A formal theory of inductive inference, part II. Information and Control 7(2), 224–254 (1964)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1–7 (1965)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Chaitin, G.J.: On the length of programs for computing finite binary sequences. Journal of the ACM 13, 547–569 (1966)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Wallace, C.S., Boulton, D.M.: An information measure for classification. The Computer Journal 11(2), 185–194 (1968)CrossRefzbMATHGoogle Scholar
  29. 29.
    Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)CrossRefzbMATHGoogle Scholar
  30. 30.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. on Information Theory 22(1), 75–81 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Gordon, G.: Multi-dimensional linguistic complexity. Journal of Biomolecular Structure & Dynamics 20(6), 747–750 (2003)CrossRefGoogle Scholar
  32. 32.
    Dix, T.I., Powell, D.R., Allison, L., Bernal, J., Jaeger, S., Stern, L.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 8(suppl. 2), S10 (2007)CrossRefGoogle Scholar
  33. 33.
    Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. on Information Theory 50(12), 3250–3264 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Bennett, C.H., Gács, P., Vitányi, M.L.P.M.B., Zurek, W.H.: Information distance. IEEE Trans. on Information Theory 44(4), 1407–1423 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. on Information Theory 51(4), 1523–1545 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Nan, F., Adjeroh, D.: On the complexity measures for biological sequences. In: Proc. of the IEEE Computational Systems Bioinformatics Conference, CSB 2004, Stanford, CA (August 2004)Google Scholar
  37. 37.
    Pirhaji, L., Kargar, M., Sheari, A., Poormohammadi, H., Sadeghi, M., Pezeshk, H., Eslahchi, C.: The performances of the chi-square test and complexity measures for signal recognition in biological sequences. Journal of Theoretical Biology 251(2), 380–387 (2008)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Armando J. Pinho
    • 1
  • Diogo Pratas
    • 1
  • Sara P. Garcia
    • 1
  1. 1.Signal Processing Lab, IEETA / DETIUniversity of AveiroAveiroPortugal

Personalised recommendations