Abstract
Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80–140% for fixed-length datasets and 80–125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10–25%.
Similar content being viewed by others
References
Bhukya R et al (2020) Compression for DNA sequences using Huffman encoding. In: Information and Communication Technology for Sustainable Development. Springer, Singapore, pp 615–624
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. Plos One 8(3):e59190
Chandak S et al (2018) SPRING: a next-generation compressor for FASTQ data. Bioinformatics
Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862
Dutta A, Haque MM, Bose T, Reddy CV, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of FastQ datasets. J Bioinform Comput Biol 13(3):1541003
Genome is digital, and can be compressed, 2022 Available at: https://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/ [21-5-2022]
Guerra A et al (2019) Tackling the challenges of FASTQ referential compression. Bioinform Biol Insights 13:1177932218821373
Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z (2017) LW-FQZip 2: a parallelized reference-base compression of FASTQ files. BMC Bioinform 18(1):179
Jian DD et al (2020) Genome compression and decompression. U.S. Patent No. 10,679,727
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40(22):e171
Kowalski TM, Grabowski S (2020) PgRC: pseudogenome-based read compressor. Bioinformatics 36(7):2082–2089
Kredens KV et al (2020) Vertical lossless genomic data compression tools for assembled genomes: a systematic literature review. Plos One 15(5):e0232942
Kryukov K et al (2020) Sequence Compression Benchmark (SCB) database—a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. GigaScience 9(7):giaa072. https://www.ncbi.nlm.nih.gov/sra. Accessed Jun 2022
Kumar S, Agarwal S (2018) WBFQC: a new approach for compressing next-generation sequencing data splitting into homogeneous streams. J Bioinforma Comput Biol 1850018
Kumar S, Agarwal S, Prasad R (2015) Efficient read alignment using burrows wheeler transform and wavelet tree. (ICACCE), 2015 Second International Conference on 2015 May 1. IEEE, pp 133–138
Lee SJ, Cho GY, Ikeno F, Lee TR (2018) BAQALC: blockchain applied lossless efficient transmission of DNA sequencing data for next generation medical informatics. Appl Sci 8(9):1471
Liu Y, Peng H, Wong L, Li J (2017) High-speed and high-ratio referential genome compression. Bioinformatics 33(21):3364–3372
Mansouri D, Yuan X, Saidani A (2020) A new lossless DNA compression algorithm based on a single-block encoding scheme. Algorithms 13(4):99
Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281
Rabbani L, Müller J, Weigel D (2020) An algorithm to build a multi-genome reference. bioRxiv
Roguski DS (2014) DSRC 2Industry-oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215
Shokrof M, Abouelhoda M (2020) IonCRAM: a reference-based compression tool for ion torrent sequence files. BMC Bioinform 21(1):1–16
Sultan AY, Huang C-H (2019) LFastqC: a lossless non-reference-based FASTQ compressor. Plos One 14:11
Tembe W, Lowey J, Suh E (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194
Wan R, Anh VN, Asai K (2011) Transformations for the compression of FASTQ quality scores of next generation sequencing data. Bioinformatics 28(5):628–635
Wandelt S, Bux M, Leser U (2014) Trends in genome compression. Curr Bioinform 9:3
Yu R, Yang W (2020) ScaleQC: a scalable lossy to lossless solution for NGS data compression. Bioinformatics
Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinform 16(1):188
Data availability statement
Data will be available on reasonable request and will be provided by the first author.
Author information
Authors and Affiliations
Contributions
Conceptualization: Sanjeev Kumar, Prabhishek Singh, and Manoj Diwakar; investigation: Anuj Kumar Jain and Manoj Diwakar; methodology: Sanjeev Kumar, Mukund Pratap, and Anuj Kumar Jain; software, Sanjeev Kumar and Mukund Pratap. All authors have agreed and read to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kumar, S., Singh, M.P., Nayak, S.R. et al. A new efficient referential genome compression technique for FastQ files. Funct Integr Genomics 23, 333 (2023). https://doi.org/10.1007/s10142-023-01259-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10142-023-01259-x