GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species
There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more.
The need is growing for the efficient compression of these data and general compressors can not reach a satisfying result. These are not aware of the special structure of these data. There are already some algorithms tried to reach smaller and smaller rates. In this paper, we would like to present our new method to accomplish this task.
KeywordsBioinformatics Biology Compression Genetics
The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).
- 3.Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proceedings of Data Compression Conference, DCC 1996, p. 453. IEEE (1996)Google Scholar
- 4.Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. 10, 51–61 (1999)Google Scholar
- 5.Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. Genome Inform. 11, 43–52 (2000)Google Scholar
- 7.Cherniavsky, N., Ladner, R.: Grammar-based compression of DNA sequences. DIMACS Working Group on The Burrows-Wheeler Transform, 21 (2004)Google Scholar
- 9.Ferreira, P.J.S.G., Neves, A.J.R., Afreixo, V., Pinho, A.J.: Exploring three-base periodicity for DNA compression and modeling. In: Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006, vol. 5, p. V. IEEE (2006)Google Scholar
- 11.Rajeswari, P.R., Apparo, A., Kumar, V.K.: Genbit Compress Tool (GBC): a Java-based tool to compress DNA sequences and compute compression ratio (bits/base) of genomes. arXiv preprint arXiv:1006.1193 (2010)
- 14.Machhi, V., Patel, M.S.: Compression techniques applied to DNA data of various species. DNA Seq. 8(3) (2016)Google Scholar
- 15.Keerthy, A.S., Priya, S.M.: Lempel-Ziv-Welch compression of DNA sequence data with indexed multiple dictionaries. Int. J. Appl. Eng. Res. 12(16), 5610–5615 (2017)Google Scholar
- 20.NCBI National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/
- 21.Ensembl genomes. http://ensemblgenomes.org/