Compression of Whole Genome Alignments Using a Mixture of Finite-Context Models

  • Luís M. O. Matos
  • Diogo Pratas
  • Armando J. Pinho
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7324)


In the last years, advances in DNA sequencing technology have caused a giant growth in the amount of available data related with genomic sequences. One of those types of data sets is that resulting from multiple sequence alignments (MSA). In this paper, we propose a compression method for compressing these data sets, using a mixture of finite-context models and arithmetic coding. The method relies on image compression concepts, it was tested in the multiz28way data set and attained a compression rate around 0.93 bits per symbol on the sequence data, better than the ≈ 1 bit per symbol attained by a recently proposed method.


Multiple Sequence Alignment Genome Alignment Arithmetic Code Normalize Maximum Likelihood Frequent Symbol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Benson, D.A., Karsch-Mizrachi, I., Clark, K., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucl. Acids Res. 40(D1), D48–D53 (2012)Google Scholar
  2. 2.
    Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., et al.: The UCSC Genome Browser Database: update 2011. Nucl. Acids Res. 39(suppl. 1), D876–D882 (2011)Google Scholar
  3. 3.
    Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., et al.: The Ensembl genome database project. Nucl. Acids Res. 30(1), 38–41 (2002)CrossRefGoogle Scholar
  4. 4.
    Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)CrossRefGoogle Scholar
  5. 5.
    Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)Google Scholar
  6. 6.
    Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)Google Scholar
  7. 7.
    Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan, pp. 1693–1696 (April 2009)Google Scholar
  8. 8.
    Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., et al.: Ensembl 2002: Accommodating comparative genomics. Nucl. Acids Res. 31(1), 38–42 (2003)CrossRefGoogle Scholar
  9. 9.
    Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., et al.: 28-Way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Research 17(12), 1797–1808 (2007)CrossRefGoogle Scholar
  10. 10.
    Hardison, R.C.: Conserved noncoding sequences are reliable guides to regulatory elements. Trends in Genetics 16(9), 369–372 (2000)CrossRefGoogle Scholar
  11. 11.
    Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proc. of the Eighth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2004, pp. 177–186. ACM, New York (2004)CrossRefGoogle Scholar
  12. 12.
    Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. Journal of Computational Biology 13(2), 379–393 (2006)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., et al.: Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computational Biology 2(4), e33 (2006)Google Scholar
  14. 14.
    Lewin, B.: Genes VIII. Benjamin Cumming (December 2003)Google Scholar
  15. 15.
    Cooper, G.M., Brudno, M., Stone, E.A., Dubchak, I., Batzoglou, S., Sidow, A.: Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes. Genome Research 14(4), 539–548 (2004)CrossRefGoogle Scholar
  16. 16.
    Blanchette, M.: Computation and Analysis of Genomic Multi-Sequence Alignments. Annual Review of Genomics and Human Genetics 8(1), 193–213 (2007)CrossRefGoogle Scholar
  17. 17.
    Cutello, V., Nicosia, G., Pavone, M., Prizzi, I.: Protein multiple sequence alignment by hybrid bio-inspired algorithms. Nucl. Acids Res. 39(6), 1980–1992 (2011)CrossRefGoogle Scholar
  18. 18.
    Aniba, M.R., Poch, O., Marchler-Bauer, A., Thompson, J.D.: AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis. Nucl. Acids Res. 38(19), 6338–6349 (2010)CrossRefGoogle Scholar
  19. 19.
    Ye, L., Huang, X.: MAP2: multiple alignment of syntenic genomic sequences. Nucl. Acids Res. 33(1), 162–170 (2005)CrossRefGoogle Scholar
  20. 20.
    Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., et al.: Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Research 14(4), 708–715 (2004)CrossRefGoogle Scholar
  21. 21.
    Bray, N., Pachter, L.: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research 14(4), 693–699 (2004)CrossRefGoogle Scholar
  22. 22.
    Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., NISC Comparative Sequencing Program, et al.: LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Research 13(4), 721–731 (2003)CrossRefGoogle Scholar
  23. 23.
    Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G., Thompson, J.D.: Multiple sequence alignment with the clustal series of programs. Nucl. Acids Res. 31(13), 3497–3500 (2003)CrossRefGoogle Scholar
  24. 24.
    Hanus, P., Dingel, J., Chalkidis, G., Hagenauer, J.: Compression of Whole Genome Alignments. IEEE Trans. on Information Theory 56(2), 696–705 (2010)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Luís M. O. Matos
    • 1
  • Diogo Pratas
    • 1
  • Armando J. Pinho
    • 1
  1. 1.Signal Processing Laboratory, IEETA/DETIUniversity of AveiroAveiroPortugal

Personalised recommendations