Advertisement

Genome Compression: An Image-Based Approach

  • Kelvin Vieira KredensEmail author
  • Juliano Vieira Martins
  • Osmar Betazzi Dordal
  • Edson Emilio Scalabrin
  • Roberto Hiroshi Herai
  • Bráulio Coelho Ávila
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10842)

Abstract

With the advent of Next Generation Sequencing Technologies, it has been possible to reduce the cost and time of genome sequencing. Thus, there was a significant increase in demand for genomes that were assembled daily. This demand requires more efficient techniques for storing and transmitting genomic data. In this research, we discussed the horizontal compression of lossless genomic sequences, using two image formats, WEBP, and FLIF. For this, the genomic sequence is transformed into a matrix of colored pixels, where an RGB color is assigned to each symbol of the A, T, C, G alphabet at a position x-y. The WEBP format showed the best data-rate saving (76.15%, SD = 0.84) when compared to FLIF. In addition, we compared the data-rate savings of two specialized DELIMINATE and MPCompress genomic data compression tools with WEBP. The results obtained show that the WEBP is close to DELIMINATE (76.03%, SD = 2.54%) and MFCompress (76.97%). SD = 1.36%). Finally, we suggest using WEBP for genomic data compression.

Keywords

Data compression Genome compression Assembled genomic sequence Lossless compression Image file format 

Notes

Acknowledgments

We thank Biji Christopher Leela for her help, sharing with us the sequences that compound the dataset she created.

Funding.

This work was partially supported by CAPES-Brazilian Federal Agency for Support and Evaluation of Graduate Education-scholarship. That provided Master Fellowship to JVM. Ph.D. Fellowship to KVK. and Postdoctoral Fellowship to OBD. The computational infrastructure for data analysis of this manuscript was supported by Fundação Araucária (grant #CP09/2016) and Graduate Program in Computer Science (PPGIa) from PUCPR.

References

  1. 1.
    Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5, 16–18 (2008)CrossRefGoogle Scholar
  2. 2.
    Reuter, J.A., Spacek, D.V., Snyder, M.P.: High-throughput sequencing technologies. Mol. Cell 58(4), 586–597 (2015)CrossRefGoogle Scholar
  3. 3.
    Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015)CrossRefGoogle Scholar
  4. 4.
    Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)CrossRefGoogle Scholar
  5. 5.
    Hayden, E.C.: Genome researchers raise alarm over big data. Nature (2015)Google Scholar
  6. 6.
    Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Data Compression Conference DCC 1993, pp. 340–350 (1993)Google Scholar
  7. 7.
    Yamagishi, M.E.B., Herai, R.H.: Chargaff’s “Grammar of Biology”: New Fractal-Like Rules. Quantitative Biology, Arxiv preprint arXiv, p. 17 (2011)Google Scholar
  8. 8.
    Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W.C., Shago, M., Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Beeson, K.Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., Rogers, Y.-H., Frazier, M.E., Scherer, S.W., Strausberg, R.L., Venter, J.C.: The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007)CrossRefGoogle Scholar
  9. 9.
    Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15, 390–406 (2013)CrossRefGoogle Scholar
  10. 10.
    Giancarlo, R., Scaturro, D., Utro, F.: Textual data compression in computational biology: algorithmic techniques. Comput. Sci. Rev. 6(1), 1–25 (2012)CrossRefGoogle Scholar
  11. 11.
    Nalbantoglu, Ö.U., Russell, D.J., Sayood, K.: Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12, 34–52 (2009)CrossRefGoogle Scholar
  12. 12.
    Bhattacharyya, M., Bhattacharyya, M., Bandyopadhyay, S.: Recent directions in compressing next generation sequencing data. CBIO 7, 2–6 (2012)CrossRefGoogle Scholar
  13. 13.
    Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013)CrossRefGoogle Scholar
  14. 14.
    Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15, 390–406 (2014)CrossRefGoogle Scholar
  15. 15.
    Bakr, N.S., Sharawi, A.A.: DNA lossless compression algorithms: review. Am. J. Bioinf. Res. 3(3), 72–81 (2013)Google Scholar
  16. 16.
    Wandelt, S., Bux, M., Leser, U.: Trends in genome compression. Curr. Bioinform. 9, 315–326 (2014)CrossRefGoogle Scholar
  17. 17.
    Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7, 56 (2016)CrossRefGoogle Scholar
  18. 18.
    Biji, C.L., Nair, A.S.: Benchmark dataset for whole genome sequence compression. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1228–1236 (2017)CrossRefGoogle Scholar
  19. 19.
    Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature committee of the international union of biochemistry (NC-IUB). Proc. Natl. Acad. Sci. U.S.A. 83, 4–8 (1986)Google Scholar
  20. 20.
    Mohammed, M.H., Dutta, A., Bose, T., Chadaram, S., Mande, S.S.: DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics 28, 2527–2529 (2012)CrossRefGoogle Scholar
  21. 21.
    Pinho, A.J., Pratas, D.: MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 (2014)CrossRefGoogle Scholar
  22. 22.
    Mann, H.B., Whitney, D.R.: Institute of mathematical statistics is collaborating with JSTOR to digitize, preserve, and extend access to the annals of mathematical statistics. Ann. Stat. 50–60. \(\textregistered \) https://www.jstor.org/
  23. 23.
    Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)CrossRefGoogle Scholar
  24. 24.
    Fisher, R.: Statistical methods and scientific induction (1955)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Nemenyi, P.: Distribution-Free Multiple Comparisons (1963)Google Scholar
  26. 26.
    Haubold, B., Wiehe, T.: How repetitive are genomes? BMC Bioinf. 7(1), 541 (2006)CrossRefGoogle Scholar
  27. 27.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Kelvin Vieira Kredens
    • 1
    Email author
  • Juliano Vieira Martins
    • 1
  • Osmar Betazzi Dordal
    • 1
  • Edson Emilio Scalabrin
    • 1
  • Roberto Hiroshi Herai
    • 2
  • Bráulio Coelho Ávila
    • 1
  1. 1.Graduate Program in Computer Science – PPGIaPontifical Catholic University of Paraná – PUCPRCuritibaBrazil
  2. 2.Graduate Program in Health Sciences – PPGCSPontifical Catholic University of Paraná – PUCPRCuritibaBrazil

Personalised recommendations