Advertisement

STrieGD: A Sampling Trie Indexed Compression Algorithm for Large-Scale Gene Data

  • Yanzhen GaoEmail author
  • Xiaozhen Bao
  • Jing Xing
  • Zheng Wei
  • Jie Ma
  • Peiheng Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11276)

Abstract

The development of next-generation sequencing (NGS) technology presents a considerable challenge for data storage. To address this challenge, a number of compression algorithms have been developed. However, currently used algorithms fail to simultaneously achieve high compression ratio as well as high compression speed. We propose an algorithm STrieGD that is based on a trie index structure for improving the compression speed of FASTQ files. To reduce the size of the trie index structure, our approach adopts a sampling strategy followed by a filtering step using quality scores. Our experiment shows that the compression ratio of our algorithm increased by approx. 50% over GZip, while being nearly equal to that of DSRC. Importantly, the compression speed of the STrieGD is 3 to 6 times faster than GZip and about 55% faster than DSRC. Moreover, with the increase of compressors, the compression ratio remains stable and the compression speed is nearly linear scalable.

Keywords

Sampling trie FASTQ file Data compression 

Notes

Acknowledgments

We thank the anonymous reviewers for their insightful comments. We thank Xueqi Li for providing data sets and Torsten Juelich for his helpful advices in writing. We would also like to thank Hougui Liu and Huajie Zheng for the implementation of our deduplication systems. This work was supported by the National Natural Science Foundation of China (Grant No. 601502454).

References

  1. 1.
    Clinton, R.D.: The Selfish Gene. Oxford University Press, Oxford (2006)Google Scholar
  2. 2.
    Nicolae, M., Pathak, S., Rajasekaran, S.: LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20), 3276–3281 (2015)CrossRefGoogle Scholar
  3. 3.
    Roguski, Ł., Ribeca, P.: CARGO: effective format-free compressed storage of genomic information. Nucleic Acids Res. 44(12), 114 (2016)CrossRefGoogle Scholar
  4. 4.
    Stuart, M.B.: Sequencing-by-synthesis: explaining the illumina sequencing technology. BitesizeBio (2012). https://bitesizebio.com/13546/sequencing-by-synthesis-explaining-the-illumina-sequencing-technology/
  5. 5.
    Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L., Rice, P.M.: The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res. 38(6), 1767–1771 (2010)CrossRefGoogle Scholar
  6. 6.
    WIKIPEDIA. Genetic testing (2017). https://en.wikipedia.org/wiki/Genetic_testing
  7. 7.
    Waibhav, T., James, L., Suh, E.: G-SQZ: compact encoding of genomic sequence and Quality scores. Bioinformatics 26(17), 2192–2194 (2010)CrossRefGoogle Scholar
  8. 8.
    Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6), 860–862 (2011)CrossRefGoogle Scholar
  9. 9.
    Ziv, J., Lempel, A., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Grassi, E., Gregorio, F.D., Molineris, I.: KungFQ: a simple and powerful approach to compress FASTQ files. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(6), 1837–1842 (2012)CrossRefGoogle Scholar
  11. 11.
    Golomb, S.W.: Run-length encodings. IEEE Trans Inf. Theory 12(3), 399–401 (1966)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2018

Authors and Affiliations

  • Yanzhen Gao
    • 1
    • 2
    Email author
  • Xiaozhen Bao
    • 1
    • 2
  • Jing Xing
    • 1
  • Zheng Wei
    • 1
  • Jie Ma
    • 1
  • Peiheng Zhang
    • 1
  1. 1.Institute of Computing TechnologyBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations