A Biological Compression Model and Its Applications

  • Minh Duc Cao
  • Trevor I. Dix
  • Lloyd Allison
Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 696)


A biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data. It can be used for repeat element discovery, sequence alignment and phylogenetic analysis. We demonstrate that the model can handle statistically biased sequences and distantly related sequences where conventional knowledge discovery tools often fail.


Mutual Information Information Content Compression Algorithm Repeat Element Biological Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    S. F. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990PubMedGoogle Scholar
  2. 2.
    J. Buard and A. J. Jeffreys. Big, bad minisatellites. Nature Genetics, 15(4):327–328, 1997PubMedCrossRefGoogle Scholar
  3. 3.
    M. D. Cao, T. I. Dix, L. Allison, and C. Mears. A simple statistical algorithm for biological sequence compression. Proceedings of the 2007 Data Compression Conference, 43–52, 2007Google Scholar
  4. 4.
    M. D. Cao, L. Allison, and T. I. Dix. A distance measure for genome phylogenetic analysis. Lecture Notes in Computer Science, 5866:71–80, 2009CrossRefGoogle Scholar
  5. 5.
    M. D. Cao, T. I. Dix, and L. Allison. Computing substitution matrices for genomic comparative analysis. Lecture Notes in Computer Science, 5476:647–655, 2009CrossRefGoogle Scholar
  6. 6.
    M. D. Cao, T. I. Dix, and L. Allison. A genome alignment algorithm based on compression. BMC Bioinformatics, 11:599, 2010PubMedCrossRefGoogle Scholar
  7. 7.
    T. I. Dix, D. Powell, L. Allison, J. Bernal, S. Jaeger, and L. Stern. Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics, 8(Suppl 2):S10, 2007Google Scholar
  8. 8.
    J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, and J. Walichiewicz. Repbase update, a database of eukaryotic repetitive elements. Cytogentic and Genome Research, 110:462–467, 2005CrossRefGoogle Scholar
  9. 9.
    G. Korodi and I. Tabus. Normalized maximum likelihood model of order-1 for the compression of DNA sequences. Proceedings of the 2007 Data Compression Conference, 33–42, 2007Google Scholar
  10. 10.
    S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg. Versatile and open software for comparing large genomes. Genome Biology, 5(2), 2004Google Scholar
  11. 11.
    E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, and K. Devon. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001PubMedCrossRefGoogle Scholar
  12. 12.
    M. C. Leclerc, J. P. Hugot, P. Durand, and F. Renaud. Evolutionary relationships between 15 plasmodium species from new and old world primates (including humans): An 18s rDNA cladistic analysis. Parasitology, 129(16):677–684, 2004PubMedCrossRefGoogle Scholar
  13. 13.
    E. Lerat, V. Daubin, and N. A. Moran. From gene trees to organismal phylogeny in prokaryotes:the case of the gamma-proteobacteria. PLoS Biology, 1(1):e19, 2003Google Scholar
  14. 14.
    C. G. Nevill-Manning and I. H. Witten. Protein is incompressible. Proceedings of the 2007 Data Compression Conference, 257–266, 1999Google Scholar
  15. 15.
    W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988CrossRefGoogle Scholar
  16. 16.
    N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425, 1987PubMedGoogle Scholar
  17. 17.
    C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 1948Google Scholar
  18. 18.
    M. E. Siddall and J. R. Barta. Phylogeny of plasmodium species: Estimation and inference. The Journal of Parasitology, 78(3):567–568, 1992PubMedCrossRefGoogle Scholar
  19. 19.
    L. Stern, L. Allison, R. L. Coppel, and T. I. Dix. Discovering patterns in plasmodium falciparum genomic DNA. Molecular and Biochemical Parasitology, 118:175–186, 2001PubMedCrossRefGoogle Scholar
  20. 20.
    I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Clayton School of Information TechnologyMonash UniversityClaytonAustralia

Personalised recommendations