Abstract
This paper introduces a novel algorithm for DNA sequence compression that makes use of a transformation and statistical properties within the transformed sequence. The designed compression algorithm is efficient and effective for DNA sequence compression. As a statistical compression method, it is able to search the pattern inside the compressed text which is useful in knowledge discovery. Experiments show that our algorithm is shown to outperform existing compressors on typical DNA sequence datasets.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Adjeroh, D., Nan, F.: On compressibility of protein sequences. In: DCC, pp. 422–434 (1998)
Allison, L., Edgoose, T., Dix, T.I.: Compression of strings with approximate repeats. In: ISMB, pp. 8–16 (1998)
Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. In: DCC, pp. 143–152 (2000)
Behzadi, B., Fessant, F.L.: DNA compression challenge revisited: A dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)
Boulton, D.M., Wallace, C.S.: The information content of a multistate distribution. Theoretical Biology 23(2), 269–278 (1969)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, p. 107 (2000)
Chen, X., Li, M., Ma, B., John, T.: DNA Compress: Fast and effective DNA sequence compression. Bioinformatics 18(2), 1696–1698 (2002)
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Comm. COM-32(4), 396–402 (1984)
Dix, et al.: Exploring long DNA sequences by information content. In: Probabilistic Modeling and Machine Learning in Structural and Systems Biology Workshop Proc., pp. 97–102 (2006)
Dix, et al.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics (to appear, 2007)
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: DCC, pp. 340–350 (1993)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: Genetic sequences. Inf. Process. Manage. 30(6), 875–886 (1994)
Hategan, A., Tabus, I.: Protein is compressible. In: NORSIG, pp. 192–195 (2004)
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1), 3–34 (2005)
Loewenstern, D., Yianilos, P.N.: Significantly lower entropy estimates for natural DNA sequences. Computational Biology 6(1), 125–142 (1999)
Loewenstern, D., Yianilos, P.N.: Biological sequence compression algorithms. Genome Informatics 11, 43–52 (2000)
Gupta, A., Agarwal, S.: Partial retrieval of compressed semi-structured documents. Int. J. Computer Applications in Technology (IJCAT) (to appear)
Nevill-Manning, C.G., Witten, I.H.: Protein is incompressible. In: DCC 1999, pp. 257–266 (1999)
Powell, D.R., Allison, L., Dix, T.I.: Modelling-alignment for non-random sequences. In: Advances in Artificial Intelligence, pp. 203–214 (2004)
Rivals, et al.: A guaranteed compression scheme for repetitive DNA sequences. In: DCC, p. 453 (1996)
Stern, et al.: Discovering patterns in plasmodium falciparum genomic DNA. Molecular & Biochemical Parasitology 118, 175–186 (2001)
Tabus, I., Korodi, G., Rissanen, J.: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In: DCC, p. 253 (2003)
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: Basic properties. IEEE Trans. Info. Theory, 653–664 (1995)
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Comm. ACM 30(6), 520–540 (1987)
Gupta, A., Agarwal, S.: A Novel Approach of Data Compression for Dynamic Data. In: Proc. of IEEE third International Conference on System of Systems Engineering, California, USA, June 2-4 (2008)
Gupta, A., Agarwal, S.: Transforming the Natural Language Text for Improving Compression Performance. In: Trends in Intelligent Systems and Computer Engineering (ISCE). Lecture Notes in Electrical Engineering, vol. 6, pp. 637–644. Springer, Heidelberg (2008)
Kamel, N.: Panel: Data and knowledge bases for genome mapping: What lies ahead? In: Proc. Intl. Very Large Databases (1991)
Li, M., Vit’anyi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer, Heidelberg (1993)
Bell, T.C., Cleary, J.C., Witten, I.H.: Text Compression. Prentice Hall, Englewood Cliffs (1990)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufman, San Francisco (1999)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Syst. 23(3), 337–342 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Syst. 24(5), 530–536 (1978)
Rubin, F.: Experiments in textile compression. Communications of the ACM 19(11), 617–623 (1976)
Wolf, J.G.: Recoding of natural language for economy of transmission or storage. The Computer Journal 21(1), 42–44 (1978)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM Association for Computing Machinery 29(4), 928–951 (1982)
Cleary, J.G., Teahan, W.J.: Unbounded length contexts for PPM. The Computer Journal 40(2/3), 67–75 (1997)
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report, Digital Equipment Corporation, Palo Alto, CA (1994)
Fenwick, P.: The Burrows-Wheeler Transform for block sorting text compression. The Computer Journal 39(9), 731–740 (1996)
Moffat, A.: Word based text compression. Software Practice and Experience 19(2), 185–198 (1990)
de Silva, M.E., et al.: Fast and flexible word searching on compressed text. ACM Transaction on Information Systems 18(2), 113–139 (2000)
Bat, O., et al.: Computer simulation of expansions of DNA triplet repeats in the Fragile-X Syndrome and Huntington’s disease. Journal of theoretical Biology 188, 53–67 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gupta, A., Rishiwal, V., Agarwal, S. (2010). Efficient Storage of Massive Biological Sequences in Compact Form. In: Ranka, S., et al. Contemporary Computing. IC3 2010. Communications in Computer and Information Science, vol 95. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14825-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-14825-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14824-8
Online ISBN: 978-3-642-14825-5
eBook Packages: Computer ScienceComputer Science (R0)