Skip to main content
Log in

A new efficient referential genome compression technique for FastQ files

  • Original Article
  • Published:
Functional & Integrative Genomics Aims and scope Submit manuscript

Abstract

Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80–140% for fixed-length datasets and 80–125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10–25%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Bhukya R et al (2020) Compression for DNA sequences using Huffman encoding. In: Information and Communication Technology for Sustainable Development. Springer, Singapore, pp 615–624

    Chapter  Google Scholar 

  • Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. Plos One 8(3):e59190

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Chandak S et al (2018) SPRING: a next-generation compressor for FASTQ data. Bioinformatics

    Google Scholar 

  • Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862

    Article  CAS  PubMed  Google Scholar 

  • Dutta A, Haque MM, Bose T, Reddy CV, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of FastQ datasets. J Bioinform Comput Biol 13(3):1541003

    Article  CAS  PubMed  Google Scholar 

  • Genome is digital, and can be compressed, 2022 Available at: https://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/ [21-5-2022]

  • Guerra A et al (2019) Tackling the challenges of FASTQ referential compression. Bioinform Biol Insights 13:1177932218821373

    Article  PubMed  PubMed Central  Google Scholar 

  • Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z (2017) LW-FQZip 2: a parallelized reference-base compression of FASTQ files. BMC Bioinform 18(1):179

    Article  Google Scholar 

  • Jian DD et al (2020) Genome compression and decompression. U.S. Patent No. 10,679,727

    Google Scholar 

  • Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40(22):e171

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kowalski TM, Grabowski S (2020) PgRC: pseudogenome-based read compressor. Bioinformatics 36(7):2082–2089

    Article  CAS  PubMed  Google Scholar 

  • Kredens KV et al (2020) Vertical lossless genomic data compression tools for assembled genomes: a systematic literature review. Plos One 15(5):e0232942

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kryukov K et al (2020) Sequence Compression Benchmark (SCB) database—a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. GigaScience 9(7):giaa072. https://www.ncbi.nlm.nih.gov/sra. Accessed Jun 2022

    Article  PubMed  PubMed Central  Google Scholar 

  • Kumar S, Agarwal S (2018) WBFQC: a new approach for compressing next-generation sequencing data splitting into homogeneous streams. J Bioinforma Comput Biol 1850018

  • Kumar S, Agarwal S, Prasad R (2015) Efficient read alignment using burrows wheeler transform and wavelet tree. (ICACCE), 2015 Second International Conference on 2015 May 1. IEEE, pp 133–138

    Google Scholar 

  • Lee SJ, Cho GY, Ikeno F, Lee TR (2018) BAQALC: blockchain applied lossless efficient transmission of DNA sequencing data for next generation medical informatics. Appl Sci 8(9):1471

    Article  Google Scholar 

  • Liu Y, Peng H, Wong L, Li J (2017) High-speed and high-ratio referential genome compression. Bioinformatics 33(21):3364–3372

    Article  CAS  PubMed  Google Scholar 

  • Mansouri D, Yuan X, Saidani A (2020) A new lossless DNA compression algorithm based on a single-block encoding scheme. Algorithms 13(4):99

    Article  Google Scholar 

  • Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Rabbani L, Müller J, Weigel D (2020) An algorithm to build a multi-genome reference. bioRxiv

    Book  Google Scholar 

  • Roguski DS (2014) DSRC 2Industry-oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215

    Article  CAS  PubMed  Google Scholar 

  • Shokrof M, Abouelhoda M (2020) IonCRAM: a reference-based compression tool for ion torrent sequence files. BMC Bioinform 21(1):1–16

    Google Scholar 

  • Sultan AY, Huang C-H (2019) LFastqC: a lossless non-reference-based FASTQ compressor. Plos One 14:11

    Google Scholar 

  • Tembe W, Lowey J, Suh E (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194

    Article  CAS  PubMed  Google Scholar 

  • Wan R, Anh VN, Asai K (2011) Transformations for the compression of FASTQ quality scores of next generation sequencing data. Bioinformatics 28(5):628–635

    Article  PubMed  Google Scholar 

  • Wandelt S, Bux M, Leser U (2014) Trends in genome compression. Curr Bioinform 9:3

    Article  Google Scholar 

  • Yu R, Yang W (2020) ScaleQC: a scalable lossy to lossless solution for NGS data compression. Bioinformatics

    Google Scholar 

  • Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinform 16(1):188

    Article  Google Scholar 

Download references

Data availability statement

Data will be available on reasonable request and will be provided by the first author.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Sanjeev Kumar, Prabhishek Singh, and Manoj Diwakar; investigation: Anuj Kumar Jain and Manoj Diwakar; methodology: Sanjeev Kumar, Mukund Pratap, and Anuj Kumar Jain; software, Sanjeev Kumar and Mukund Pratap. All authors have agreed and read to the published version of the manuscript.

Corresponding author

Correspondence to Soumya Ranjan Nayak.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, S., Singh, M.P., Nayak, S.R. et al. A new efficient referential genome compression technique for FastQ files. Funct Integr Genomics 23, 333 (2023). https://doi.org/10.1007/s10142-023-01259-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10142-023-01259-x

Keywords

Navigation