Efficient Storage of Massive Biological Sequences in Compact Form

Gupta, Ashutosh; Rishiwal, Vinay; Agarwal, Suneeta

doi:10.1007/978-3-642-14825-5_2

Ashutosh Gupta⁹,
Vinay Rishiwal⁹ &
Suneeta Agarwal¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 95))

Included in the following conference series:

International Conference on Contemporary Computing

705 Accesses
1 Citations

Abstract

This paper introduces a novel algorithm for DNA sequence compression that makes use of a transformation and statistical properties within the transformed sequence. The designed compression algorithm is efficient and effective for DNA sequence compression. As a statistical compression method, it is able to search the pattern inside the compressed text which is useful in knowledge discovery. Experiments show that our algorithm is shown to outperform existing compressors on typical DNA sequence datasets.

Download to read the full chapter text

Chapter PDF

A Novel Algorithm for DNA Sequence Compression

Representation of a DNA Sequence by a Substring of Its Genetic Information

Pattern Matching Compression Algorithm for DNA Sequences

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Adjeroh, D., Nan, F.: On compressibility of protein sequences. In: DCC, pp. 422–434 (1998)
Google Scholar
Allison, L., Edgoose, T., Dix, T.I.: Compression of strings with approximate repeats. In: ISMB, pp. 8–16 (1998)
Google Scholar
Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. In: DCC, pp. 143–152 (2000)
Google Scholar
Behzadi, B., Fessant, F.L.: DNA compression challenge revisited: A dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)
Google Scholar
Boulton, D.M., Wallace, C.S.: The information content of a multistate distribution. Theoretical Biology 23(2), 269–278 (1969)
Article MathSciNet Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, p. 107 (2000)
Google Scholar
Chen, X., Li, M., Ma, B., John, T.: DNA Compress: Fast and effective DNA sequence compression. Bioinformatics 18(2), 1696–1698 (2002)
Article Google Scholar
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Comm. COM-32(4), 396–402 (1984)
Article Google Scholar
Dix, et al.: Exploring long DNA sequences by information content. In: Probabilistic Modeling and Machine Learning in Structural and Systems Biology Workshop Proc., pp. 97–102 (2006)
Google Scholar
Dix, et al.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics (to appear, 2007)
Google Scholar
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: DCC, pp. 340–350 (1993)
Google Scholar
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: Genetic sequences. Inf. Process. Manage. 30(6), 875–886 (1994)
Article MATH Google Scholar
Hategan, A., Tabus, I.: Protein is compressible. In: NORSIG, pp. 192–195 (2004)
Google Scholar
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1), 3–34 (2005)
Article Google Scholar
Loewenstern, D., Yianilos, P.N.: Significantly lower entropy estimates for natural DNA sequences. Computational Biology 6(1), 125–142 (1999)
Article Google Scholar
Loewenstern, D., Yianilos, P.N.: Biological sequence compression algorithms. Genome Informatics 11, 43–52 (2000)
Google Scholar
Gupta, A., Agarwal, S.: Partial retrieval of compressed semi-structured documents. Int. J. Computer Applications in Technology (IJCAT) (to appear)
Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Protein is incompressible. In: DCC 1999, pp. 257–266 (1999)
Google Scholar
Powell, D.R., Allison, L., Dix, T.I.: Modelling-alignment for non-random sequences. In: Advances in Artificial Intelligence, pp. 203–214 (2004)
Google Scholar
Rivals, et al.: A guaranteed compression scheme for repetitive DNA sequences. In: DCC, p. 453 (1996)
Google Scholar
Stern, et al.: Discovering patterns in plasmodium falciparum genomic DNA. Molecular & Biochemical Parasitology 118, 175–186 (2001)
Article Google Scholar
Tabus, I., Korodi, G., Rissanen, J.: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In: DCC, p. 253 (2003)
Google Scholar
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: Basic properties. IEEE Trans. Info. Theory, 653–664 (1995)
Google Scholar
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Comm. ACM 30(6), 520–540 (1987)
Article Google Scholar
Gupta, A., Agarwal, S.: A Novel Approach of Data Compression for Dynamic Data. In: Proc. of IEEE third International Conference on System of Systems Engineering, California, USA, June 2-4 (2008)
Google Scholar
Gupta, A., Agarwal, S.: Transforming the Natural Language Text for Improving Compression Performance. In: Trends in Intelligent Systems and Computer Engineering (ISCE). Lecture Notes in Electrical Engineering, vol. 6, pp. 637–644. Springer, Heidelberg (2008)
Chapter Google Scholar
Kamel, N.: Panel: Data and knowledge bases for genome mapping: What lies ahead? In: Proc. Intl. Very Large Databases (1991)
Google Scholar
Li, M., Vit’anyi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer, Heidelberg (1993)
MATH Google Scholar
Bell, T.C., Cleary, J.C., Witten, I.H.: Text Compression. Prentice Hall, Englewood Cliffs (1990)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufman, San Francisco (1999)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Syst. 23(3), 337–342 (1977)
MATH MathSciNet Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Syst. 24(5), 530–536 (1978)
MATH MathSciNet Google Scholar
Rubin, F.: Experiments in textile compression. Communications of the ACM 19(11), 617–623 (1976)
Article Google Scholar
Wolf, J.G.: Recoding of natural language for economy of transmission or storage. The Computer Journal 21(1), 42–44 (1978)
Article Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM Association for Computing Machinery 29(4), 928–951 (1982)
MATH MathSciNet Google Scholar
Cleary, J.G., Teahan, W.J.: Unbounded length contexts for PPM. The Computer Journal 40(2/3), 67–75 (1997)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report, Digital Equipment Corporation, Palo Alto, CA (1994)
Google Scholar
Fenwick, P.: The Burrows-Wheeler Transform for block sorting text compression. The Computer Journal 39(9), 731–740 (1996)
Article Google Scholar
Moffat, A.: Word based text compression. Software Practice and Experience 19(2), 185–198 (1990)
Article MathSciNet Google Scholar
de Silva, M.E., et al.: Fast and flexible word searching on compressed text. ACM Transaction on Information Systems 18(2), 113–139 (2000)
Article Google Scholar
Bat, O., et al.: Computer simulation of expansions of DNA triplet repeats in the Fragile-X Syndrome and Huntington’s disease. Journal of theoretical Biology 188, 53–67 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Engineering & Technology, M.J.P. Rohilkhand University, Bareilly, UP, India
Ashutosh Gupta & Vinay Rishiwal
Motilal Nehru National Institute of Technology, Allahabad, UP, 211004, India
Suneeta Agarwal

Authors

Ashutosh Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Vinay Rishiwal
View author publications
You can also search for this author in PubMed Google Scholar
Suneeta Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Sciences, University of Florida, 32611, Gainesville, FL, USA
Sanjay Ranka
University of Florida, Gainesville, Fl, USA
Arunava Banerjee
Department of Computer Science and Engineering, Indian Institute of Technology, 110016, New Delhi, INDIA
Kanad Kishore Biswas
Computer Science, College of Engineering and Science, Louisiana Tech University, LA 71272, Ruston, USA
Sumeet Dua
University of Florida, Gainesville, FL, USA
Prabhat Mishra
Department of Computer Science & Engineering, Indian Institute of Technology Kanpur, 208016, India
Rajat Moona
National Tsing Hua University, Hsin-Chu, Taiwan, R.O.C.
Sheung-Hung Poon
Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
Cho-Li Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, A., Rishiwal, V., Agarwal, S. (2010). Efficient Storage of Massive Biological Sequences in Compact Form. In: Ranka, S., et al. Contemporary Computing. IC3 2010. Communications in Computer and Information Science, vol 95. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14825-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-14825-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14824-8
Online ISBN: 978-3-642-14825-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Storage of Massive Biological Sequences in Compact Form

Abstract

Chapter PDF

Similar content being viewed by others

A Novel Algorithm for DNA Sequence Compression

Representation of a DNA Sequence by a Substring of Its Genetic Information

Pattern Matching Compression Algorithm for DNA Sequences

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Efficient Storage of Massive Biological Sequences in Compact Form

Abstract

Chapter PDF

Similar content being viewed by others

A Novel Algorithm for DNA Sequence Compression

Representation of a DNA Sequence by a Substring of Its Genetic Information

Pattern Matching Compression Algorithm for DNA Sequences

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation