Advertisement

GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species

  • Péter Lehotay-KéryEmail author
  • Attila Kiss
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11432)

Abstract

There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more.

The need is growing for the efficient compression of these data and general compressors can not reach a satisfying result. These are not aware of the special structure of these data. There are already some algorithms tried to reach smaller and smaller rates. In this paper, we would like to present our new method to accomplish this task.

Keywords

Bioinformatics Biology Compression Genetics 

Notes

Acknowledgment

The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).

References

  1. 1.
    Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)CrossRefGoogle Scholar
  2. 2.
    Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6), 875–886 (1994)CrossRefGoogle Scholar
  3. 3.
    Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proceedings of Data Compression Conference, DCC 1996, p. 453. IEEE (1996)Google Scholar
  4. 4.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. 10, 51–61 (1999)Google Scholar
  5. 5.
    Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. Genome Inform. 11, 43–52 (2000)Google Scholar
  6. 6.
    Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)CrossRefGoogle Scholar
  7. 7.
    Cherniavsky, N., Ladner, R.: Grammar-based compression of DNA sequences. DIMACS Working Group on The Burrows-Wheeler Transform, 21 (2004)Google Scholar
  8. 8.
    Behzadi, B., Le Fessant, F.: DNA compression challenge revisited: a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005).  https://doi.org/10.1007/11496656_17CrossRefGoogle Scholar
  9. 9.
    Ferreira, P.J.S.G., Neves, A.J.R., Afreixo, V., Pinho, A.J.: Exploring three-base periodicity for DNA compression and modeling. In: Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006, vol. 5, p. V. IEEE (2006)Google Scholar
  10. 10.
    Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-16321-0_20CrossRefzbMATHGoogle Scholar
  11. 11.
    Rajeswari, P.R., Apparo, A., Kumar, V.K.: Genbit Compress Tool (GBC): a Java-based tool to compress DNA sequences and compute compression ratio (bits/base) of genomes. arXiv preprint arXiv:1006.1193 (2010)
  12. 12.
    Rajarajeswari, P., Apparao, A.: DNABit compress-genome compression algorithm. Bioinformation 5(8), 350 (2011)CrossRefGoogle Scholar
  13. 13.
    Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 9(1), 137–149 (2012)CrossRefGoogle Scholar
  14. 14.
    Machhi, V., Patel, M.S.: Compression techniques applied to DNA data of various species. DNA Seq. 8(3) (2016)Google Scholar
  15. 15.
    Keerthy, A.S., Priya, S.M.: Lempel-Ziv-Welch compression of DNA sequence data with indexed multiple dictionaries. Int. J. Appl. Eng. Res. 12(16), 5610–5615 (2017)Google Scholar
  16. 16.
    Bockenhauer, H.-J., Bongartz, D.: Algorithmic Aspects of Bioinformatics. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-71913-7CrossRefzbMATHGoogle Scholar
  17. 17.
    Cavalier-Smith, T.: A revised six-kingdom system of life. Biol. Rev. 73(3), 203–266 (1998)CrossRefGoogle Scholar
  18. 18.
    Moreira, D., López-García, P.: Ten reasons to exclude viruses from the tree of life. Nat. Rev. Microbiol. 7(4), 306 (2009)CrossRefGoogle Scholar
  19. 19.
    Hegde, N.R., Maddur, M.S., Kaveri, S.V., Bayry, J.: Reasons to include viruses in the tree of life. Nat. Rev. Microbiol. 7(8), 615 (2009)CrossRefGoogle Scholar
  20. 20.
    NCBI National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/
  21. 21.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of Informatics, Department of Information SystemsELTE Eötvös Loránd UniversityBudapestHungary

Personalised recommendations