Genome Compression: An Image-Based Approach
With the advent of Next Generation Sequencing Technologies, it has been possible to reduce the cost and time of genome sequencing. Thus, there was a significant increase in demand for genomes that were assembled daily. This demand requires more efficient techniques for storing and transmitting genomic data. In this research, we discussed the horizontal compression of lossless genomic sequences, using two image formats, WEBP, and FLIF. For this, the genomic sequence is transformed into a matrix of colored pixels, where an RGB color is assigned to each symbol of the A, T, C, G alphabet at a position x-y. The WEBP format showed the best data-rate saving (76.15%, SD = 0.84) when compared to FLIF. In addition, we compared the data-rate savings of two specialized DELIMINATE and MPCompress genomic data compression tools with WEBP. The results obtained show that the WEBP is close to DELIMINATE (76.03%, SD = 2.54%) and MFCompress (76.97%). SD = 1.36%). Finally, we suggest using WEBP for genomic data compression.
KeywordsData compression Genome compression Assembled genomic sequence Lossless compression Image file format
We thank Biji Christopher Leela for her help, sharing with us the sequences that compound the dataset she created.
This work was partially supported by CAPES-Brazilian Federal Agency for Support and Evaluation of Graduate Education-scholarship. That provided Master Fellowship to JVM. Ph.D. Fellowship to KVK. and Postdoctoral Fellowship to OBD. The computational infrastructure for data analysis of this manuscript was supported by Fundação Araucária (grant #CP09/2016) and Graduate Program in Computer Science (PPGIa) from PUCPR.
- 5.Hayden, E.C.: Genome researchers raise alarm over big data. Nature (2015)Google Scholar
- 6.Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Data Compression Conference DCC 1993, pp. 340–350 (1993)Google Scholar
- 7.Yamagishi, M.E.B., Herai, R.H.: Chargaff’s “Grammar of Biology”: New Fractal-Like Rules. Quantitative Biology, Arxiv preprint arXiv, p. 17 (2011)Google Scholar
- 8.Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W.C., Shago, M., Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Beeson, K.Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., Rogers, Y.-H., Frazier, M.E., Scherer, S.W., Strausberg, R.L., Venter, J.C.: The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007)CrossRefGoogle Scholar
- 15.Bakr, N.S., Sharawi, A.A.: DNA lossless compression algorithms: review. Am. J. Bioinf. Res. 3(3), 72–81 (2013)Google Scholar
- 19.Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature committee of the international union of biochemistry (NC-IUB). Proc. Natl. Acad. Sci. U.S.A. 83, 4–8 (1986)Google Scholar
- 22.Mann, H.B., Whitney, D.R.: Institute of mathematical statistics is collaborating with JSTOR to digitize, preserve, and extend access to the annals of mathematical statistics. Ann. Stat. 50–60. \(\textregistered \) https://www.jstor.org/
- 25.Nemenyi, P.: Distribution-Free Multiple Comparisons (1963)Google Scholar