Non-repetitive DNA Sequence Compression Using Memoization

  • K. G. Srinivasa
  • M. Jagadish
  • K. R. Venugopal
  • L. M. Patnaik
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4345)


With increasing number of DNA sequences being discovered the problem of storing and using genomic databases has become vital. Since DNA sequences consist of only four letters, two bits are sufficient to store each base. Many algorithms have been proposed in the recent past that push the bits/base limit further. The subtle patterns in DNA along with statistical inferences have been exploited to increase the compression ratio. From the compression perspective, the entire DNA sequences can be considered to be made of two types of sequences: repetitive and non-repetitive. The repetitive parts are compressed used dictionary-based schemes and non-repetitive sequences of DNA are usually compressed using general text compression schemes. In this paper, we present a memoization based encoding scheme for non-repeat DNA sequences. This scheme is incorporated with a DNA-specific compression algorithm, DNAPack, which is used for compression of DNA sequences. The results show that our method noticeably performs better than other techniques of its kind.


DNA Compression Memoization Text Compression 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its application in genome comparison. genomic 12, 512–514 (2001)Google Scholar
  2. 2.
    Grumbach, S., Tahi, F.: Compression of dna sequences. In: Data compression conference, pp. 340–350 (1993)Google Scholar
  3. 3.
    Grumbach, S., Tahi, F.: A new challenge for compression algorithms genetic sequences. Journal of Information processing and Management 30, 866–875 (1994)Google Scholar
  4. 4.
    Matsumuto, T., Sadakane, K., Imai, H.: Biological sequences compression algorithms. In: Genome Information Ser. Workshop Genome Inform., vol. 11, pp. 43–52 (2000)Google Scholar
  5. 5.
    Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive dna sequences. LIFL Lille I Univerisity technical report, 285 (1995)Google Scholar
  6. 6.
    Willems, F.M.J., Shtralov, Y.M., Tjalkens, T.J.: The context tree weighting method:basic properties. IEE trans Inform Theory 41(3), 653–664 (1995)zbMATHCrossRefGoogle Scholar
  7. 7.
    Sadakane, K., Okazaki, T., Imai, H.: Implementing the context tree weighting method for text compression. In: DCC 2000: Proceedings of the Conference on Data Compression, Washington, DC, USA, p. 123. IEEE Computer Society, Los Alamitos (2000)CrossRefGoogle Scholar
  8. 8.
    Rivals, E., Dauchet, M.: Fast discerning repeats in DNA sequences with a compression algorithm. In: Proc. Genome Informatics Workshop, pp. 215–226. Universal Academy Press, Tokyo (1997)Google Scholar
  9. 9.
    Sata, H., Yoshioka, T., Konagaya, A., Toyoda, T.: Dna compression in the post genomic era. Genome Informatics 12, 512–514 (2001)Google Scholar
  10. 10.
    Ziv, J., Limpel, A.: Compression of individual sequences using variable-rate encoding. IEE trans. Inform Theory 24, 530–536 (1978)zbMATHCrossRefGoogle Scholar
  11. 11.
    Ziv, J., Limpel, A.: A universal algorithm for sequential data compression. IEE trans. Inform. Theory 23(3), 337–343 (1977)zbMATHCrossRefGoogle Scholar
  12. 12.
    Sadel, I.: Universal data compression algorithm based on approximate string matching. In: Probability in the Engineering and Informational Sciences, pp. 465–486 (1996)Google Scholar
  13. 13.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences. IEEE Engineering in Medicine and biology Magazine 20(4), 61–66 (2001)CrossRefGoogle Scholar
  14. 14.
    Li, M., Badger, J.H., Chen, J.H., Kwong, S., Kerney, P., Zhang, H.: An information based sequences distance and its application to whole mitochondrial genome. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
  15. 15.
    Chen, X., La, M., Ma, B., Tromp, J.: Dnacompress: fast and effective dna sequence compression. Bioinformatics 18, 1696–1698 (2002)CrossRefGoogle Scholar
  16. 16.
    Ma, B., Tromp, J., Li, M.: Patternhunter-faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)CrossRefGoogle Scholar
  17. 17.
    Chang, C.: Dnac: A compression algorithm of dna sequences by non-overlapping approximate repeats. Master Thesis (2004)Google Scholar
  18. 18.
    Modegi, T.: Development of lossless compression techniques for biology information and its application for bioinformatics database retrieval. Genome Informatics (14), 695–696 (2003)Google Scholar
  19. 19.
    Zhang, Y., Parthe, R., Adjeroh, D.: Lossless compression of dna microarray images. csbw 0, 128–132 (2005)Google Scholar
  20. 20.
    Tan, Z., Cao, X., Ooi, B.C., Tung, A.K.H.: The ed-tree: An index for large dna sequence databases. ssdbm, 151 (2003)Google Scholar
  21. 21.
    Behzadi, B., Le Fessant, F.: Dna compression challenge revisited:a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. dcc, 143 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • K. G. Srinivasa
    • 1
  • M. Jagadish
    • 2
  • K. R. Venugopal
    • 3
  • L. M. Patnaik
    • 4
  1. 1.Data Mining LaboratoryM S Ramaiah Institute of TechnologyBangalore
  2. 2.Software Engineer, MindTree ConsultingBangalore
  3. 3.Professor, University of Visvesvaraya College of EngineeringBangalore UniversityBangalore
  4. 4.Professor, Microprocessor Application LaboratoryIndian Institute of ScienceBangalore

Personalised recommendations