## Background

It has been more than a decade now since the first draft of the human genome was published [1]. The Human Genome Project, which required a significant collaborative effort of many scientists for more than 10 years, was completed using the Sanger sequencing technology. Just a decade later, many medium and small size laboratories achieve the task of sequencing complete mammalian genomes within a few weeks using the new Next Generation Sequencing (NGS) technologies. The cost per genome has decreased from \$100M in 2001 to \$10K in 2012 [2]. Current sequencers are capable of generating close to tera-base worth of data that needs to be stored and processed. Several recent studies, such as the cow rumen [3] and the MetaHit [4] metagenomic projects resulted in hundreds and hundreds of gigabase worth of datasets. As project scales continue to grow, it is expected that the bottleneck will move towards the computational aspects, in particular with respect to the analysis and storage of the data. For example, there is an explosive growth of data submitted to the freely available next-generation sequence data archive, the Sequence Read Archive (SRA) [5]. Thus compressing this data will facilitate its storage and dissemination.

Several methods are used by the different NGS technologies for the readout of the sequencing signal (known as base calling). This process may be affected by various factors, which may lead to a wrong readout of the sequencing signal. In order to assess the probability of base calling mistakes, the sequencers generate, in addition to the nucleotide sequences (reads), quality scores that reflect the level of confidence in the readout of each base. That is, the higher the quality score, the higher the reliability of the corresponding base, and vice versa. Specifically, the quality score Q is the integer mapping of P(the probability that the corresponding base call is incorrect) and can be represented in (at least) the following scales/standards:

• Sanger or Phred scale [6]: $Q=-10{log}_{10}P$.

• Solexa scale: $Q=-10{log}_{10}\frac{P}{1-P}$.

Different NGS technologies use different scales, Phred + 33, Phred + 64 and Solexa + 64 being the most common ones. For example, Phred + 33 corresponds to values of Q in the range [33:73].

Quality scores are important and very useful in many downstream applications such as trimming (used to remove untrusted regions) [7, 8], alignment [9-12] or Single Nucleotide Polymorphism (SNP) detection [13, 14], among others. However, they significantly increase the size of the files storing raw sequencing data.

The literature abounds in efforts to compress genomic data. Whole genome level compression without the aid of any external information has been the focus of [15-19] and references therein. More recent contributions show that further compression can be achieved by mapping the target genome to a reference genome, and encoding only the differences between the two [20-27]. Compression of raw sequencing data has also been studied in the literature. While the compression scheme proposed in [28] focuses only on the nucleotide sequences, most approaches consider the compression of the entire file, including the quality scores [29-37].

In this paper, we concentrate on the lossy compression of the quality scores, as they take up a significant chunk of the storage space. FASTQ files are widely accepted as a standard for storing sequencing data, and are the ones considered in this paper. FASTQ files consist of separate entries for each read, each consisting of four lines. The first one is for the header line, which begins with the ‘@’ character and is followed by a sequence identifier and an optional description. The second one contains the nucleotide sequence. The third one starts with the ‘+’ character and can be followed by the same sequence identifier and optional description as the first line. Finally, the fourth line is for the quality scores (encoded in ASCII) associated with the bases that appear in the nucleotide sequence of line two (both lines must contain the same number of symbols). The following is an example of a FASTQ file corresponding to a read of length 26 and quality scores in the scale Phred + 33.

@SRR001666.1 GATTTGGGGTTCAAAGCAGTGCAAGC + IIIHIIHABBBAA=2))!!!(!!!((

Unlike header lines and nucleotide sequences, quality scores are particularly difficult to compress, due to their higher entropy and larger alphabet. The importance of lossy compression in the context of reducing storage of sequenced data has been highlighted in [38]. Although discarding information may not always be welcome, several lossy compression schemes for the quality scores have been proposed recently in the literature as a partial solution to reduce the storage space of genomic data. It has been presented as a plausible idea in [5] and as a concrete algorithm as part of the SLIMGENE package in [32]. The SLIMGENE package considers the compression of both the nucleotide and quality score sequences, and it includes a module which performs lossy compression of quality scores based on fitting a fixed state markov model on adjacent quality scores as well as on reducing the alphabet. They use SNP variant calling as their performance metric and show that lossy compression has a minimal effect on performance. In [33], a metric called “quality-budget” is used to selectively discard the quality scores which match perfectly to the reference, with only quality scores with sufficient variation being retained. Cramtools [36] is a software built on this principle of reference-based compression that also allows lossy compression of the quality scores. Recently, [39] considered relative mapping accuracy of the reads as the metric and applied various lossy transform techniques to show that impressive compression can be achieved with a small loss in performance. The software fastqz [34] has the option of quantizing the quality scores for lossy compression. Finally, the SCALCE software [35] provides an optional controlled lossy transformation approach that reduces the alphabet size based on the observation that “neighboring” quality scores are similar in general.

Reducing the storage space of quality scores by performing lossy compression may affect the performance of downstream applications, and therefore minimizing the loss in performance is an important goal. However, due to the noisy nature of the several models/technologies that generate the quality scores and the varied use that different downstream applications make of them, it is difficult to design a compression scheme that achieves this goal independently of the technology and downstream application. Furthermore, it is not clear how to compare the different lossy compression schemes proposed so far in the literature, in the absence of a defined metric. In this work, we propose a scheme based on rate distortion theory that compresses the quality scores by allocating as many bits per quality score sequence as specified by the user, while minimizing a given distortion between the uncompressed (i.e., the original quality scores) and the reconstructed quality scores after the lossy compression. We choose a mathematical quantity for the distortion rather than a “physical distortion” or performance loss with respect to downstream applications, thus making the proposed scheme independent of the sequencing technology and the downstream analyzers.

Specifically, we use mean squared error (MSE) as the measure of performance for our lossy compression. We believe that reduced MSE translates to minimizing the incurred loss in the downstream applications. Our algorithm assumes that the quality scores are generated by a multivariate Gaussian. This is justified by the fact that, given a vector source with a particular covariance structure, the Gaussian multivariate source is the least compressible and, further, a code designed under the Gaussian assumption will perform at least as well on any other source of the same covariance [40]. Then, using the singular value decomposition (SVD) technique, we decorrelate the quality scores thereby getting a multivariate Gaussian characterized by a diagonal covariance matrix, or in other words, “independent” streams of quality scores. This allows us to use optimization techniques from rate distortion theory to optimally allocate bits to minimize the MSE. Because reads within a file may have very different qualities and the proposed method is based on the statistics of the quality score sequences, the proposed algorithm also allows the user to cluster the data prior to compression, to improve the statistics and hence the performance. We compare several lossy compression schemes based on this criterion, and see that our algorithm achieves much smaller MSE for small rates. Also, we show that our algorithm can work at rates not achieved by other lossy compression algorithms. We further show that, for a data set where the assembled genome is known and assumed to be correct, one can get performance in some downstream applications comparable to that achieved using the original (uncompressed) quality scores. Finally, our algorithm allows the user to specify the rate prior to compression, which as far as we know is the first implementation of lossy compression of quality scores that has this characteristic.

## Methods

### The compression method

We now formalize the problem of lossy compression of quality scores and describe the proposed method. As stated in previous sections, the FASTQ format is widely accepted as the standard to store sequencing data. Therefore, we consider the compression of quality scores presented in this format, and assume all the reads are of the same length n within a file. Denote the number of entries in the FASTQ file by N, where each entry is composed of four lines. The quality score sequences presented in a FASTQ file are denoted by ${\left\{{\mathbf{Q}}_{i}\right\}}_{i=1}^{N}$, where Q i =[Q i (1),…,Q i (n)]. Our goal is to design an encoder-decoder pair that describes the quality score vectors using only as many bits as specified by the user, while minimizing a given distortion D between the original vectors ${\left\{{\mathbf{Q}}_{i}\right\}}_{i=1}^{N}$ and the reconstructed vectors ${\left\{{\stackrel{̂}{\mathbf{Q}}}_{i}\right\}}_{i=1}^{N}$. More specifically, we consider that each Q i is compressed using at most n R bits, where R denotes the rate (bits per quality score), and that the distortion D is computed as the average distortion of each of the vectors, i.e., $D=\frac{1}{N}{\sum }_{i=1}^{N}D\left(i\right)$. Furthermore, we consider the MSE as our given distortion function $d:\left(\mathbf{Q},\stackrel{̂}{\mathbf{Q}}\right)\to {\mathcal{R}}^{+}$, which operates symbol by symbol (as opposed to block by block), so that $D\left(i\right)=d\left({\mathbf{Q}}_{i},{\stackrel{̂}{\mathbf{Q}}}_{i}\right)=\frac{1}{n}{\sum }_{j=1}^{n}d\left({Q}_{i}\left(j\right),{\stackrel{̂}{Q}}_{i}\left(j\right)\right)=\frac{1}{n}{\sum }_{j=1}^{n}{\left({Q}_{i}\left(j\right)-{\stackrel{̂}{Q}}_{i}\left(j\right)\right)}^{2}$. These assumptions allow us to model the encoder-decoder pair as a rate-distortion scheme of rate R, where the encoder is described by the mapping f n :Q i →{1,2,…,2nR}, which represents the compressed version of the vector Q i of length n using n R bits, and the decoder is described by the mapping ${g}_{n}:\left\{1,2,\dots ,{2}^{\mathit{\text{nR}}}\right\}\to {\stackrel{̂}{\mathbf{Q}}}_{i}$, where ${\stackrel{̂}{\mathbf{Q}}}_{i}={g}_{n}\left({f}_{n}\left({\mathbf{Q}}_{i}\right)\right)$ denotes the reconstructed sequence.

With this formulation of the problem we can use some results on rate distortion theory to guide the design of QualComp. For a detailed description on rate distortion theory and proofs, please refer to [41]. We are interested in the following result:

#### Theorem 1

For an i.i.d. Gaussian vector source $\mathbf{X}\sim \mathcal{N}\left(\mathbf{0},{\Sigma }_{\mathbf{X}}\right)$, with Σ X = diag$\left[{\sigma }_{1}^{2},\dots ,{\sigma }_{n}^{2}\right]$ (i.e., independent components), the optimal allocation of n R bits that minimizes the MSE is given as the solution to the following optimization problem:

$\begin{array}{l}\underset{\rho =\left[{\rho }_{1},\cdots \phantom{\rule{0.3em}{0ex}},{\rho }_{n}\right]}{\text{min}}\frac{1}{n}\sum _{j=1}^{n}{\sigma }_{j}^{2}{2}^{-2{\rho }_{j}}\phantom{\rule{2em}{0ex}}\end{array}$
(1)
$\begin{array}{l}\text{s.t.}\sum _{j=1}^{n}{\rho }_{j}\le \mathit{\text{nR}},\phantom{\rule{2em}{0ex}}\end{array}$
(2)

where ρ j denotes the number of bits allocated to the jth component of X.

Next we describe how we use this result in the design of QualComp. In real data, quality scores take integer values in a finite alphabet $\mathcal{Q}$, but for the purpose of modeling, we assume $\mathcal{Q}=\mathcal{R}$ (the set of real numbers). Although the quality scores of different reads may be correlated, we model correlations only within a read, and consider quality scores across different reads to be independent. Thus we assume that each quality score vector Q i is independent and identically distributed (i.i.d.) as P Q , which will be specified below.

To the best of our knowledge, there are no known statistics of the quality score vectors. However, given a vector source with a particular covariance matrix, the multivariate Gaussian is the least compressible. Furthermore, compression/coding schemes designed on the basis of Gaussian assumption, i.e., worst distribution for compression, will also be good for non-Gaussian sources, as long as the mean and the covariance matrix remain unchanged [40]. Guided by this observation, we model the quality scores as being jointly Gaussian with the same mean and covariance matrix, i.e., ${P}_{\mathbf{Q}}\sim \mathcal{N}\left({\mu }_{\mathbf{Q}},{\Sigma }_{\mathbf{Q}}\right)$, where μ Q and Σ Q are empirically computed from the set of vectors ${\left\{{\mathbf{Q}}_{i}\right\}}_{i=1}^{N}$. Due to the correlation of quality scores within a read, Σ Q is not in general a diagonal matrix. Thus to apply Theorem 1, we need to first decorrelate the quality score vectors.

In order to decorrelate the quality score vectors, we first perform the singular value decomposition (SVD) of the matrix Σ Q . This allows us to express Σ Q as Σ Q =V S VT, where V is a unitary matrix that satisfies V VT=I and S is a diagonal matrix whose diagonal entries s j j , for j∈[1:n], are known as the singular values of Σ Q . We then generate a new set of vectors ${\left\{{{Q}^{\prime }}_{i}\right\}}_{i=1}^{N}$ by performing the operation ${\mathbf{Q}}_{i}^{\prime }={V}^{T}\left({\mathbf{Q}}_{i}-{\mu }_{\mathbf{Q}}\right)$ for all i. This transformation, due to the Gaussian assumption on the quality score vectors, makes the components of each ${\mathbf{Q}}_{i}^{\prime }$ independent and distributed as N(0,s j j ), for j∈[1:n], since ${\mathbf{Q}}_{i}^{\prime }\sim \mathcal{N}\left(\mathbf{0},S\right)$. This property allows us to use the result of Theorem 1. The number of bits alloted per quality score vector, n R, is a user specified parameter. Thus we can formulate the bit allocation problem for minimizing the MSE as a convex optimization problem, and solve it exactly. That is, given a budget of n R bits per vector, we allocate the bits by first transforming each Q i into ${\mathbf{Q}}_{i}^{\prime }$, for i∈[1:N], and then allocating bits to the independent components of ${\mathbf{Q}}_{i}^{\prime }$. In order to minimize the MSE, we solve the following optimization problem:

$\begin{array}{l}\underset{\rho =\left[{\rho }_{1},\cdots \phantom{\rule{0.3em}{0ex}},{\rho }_{n}\right]}{\text{min}}\frac{1}{n}\sum _{j=1}^{n}{s}_{\mathit{\text{jj}}}{2}^{-2{\rho }_{j}}\end{array}$
(3)
$\begin{array}{l}\text{s.t.}\sum _{j=1}^{n}{\rho }_{j}\le \mathit{\text{nR}},\end{array}$
(4)

where ρ j represents the number of bits allocated to the jth position of ${\mathbf{Q}}_{i}^{\prime }$, for i∈[1:N], i.e., the allocation of bits is the same for all the quality score vectors and thus the optimization problem has to be solved only once. Ideally, this allocation should be done by vector quantization, i.e., by applying a vector quantizer with N ρ j bits to ${\mathbf{Q}}_{i}^{\prime }{\left(j\right)}_{i=1}^{N}$, for j∈[1:n]. However, due to ease of implementation and negligible performance loss, we use a scalar quantizer. Thus for all i∈[1:N], each component ${\mathbf{Q}}_{i}^{\prime }\left(j\right)$, for j∈[1:n], is normalized to a unit variance Gaussian and then it is mapped to decision regions representable in ρ j bits. For this we need ρ j to be an integer. However, this will not be the case in general, so we randomly map each ρ j to ${\rho }_{j}^{\prime }$, which is given by either the closest integer from below or from above, so that the average of ${\rho }_{j}^{\prime }$ and ρ j coincide. In order to ensure the decoder gets the same value of ${\rho }_{j}^{\prime }$, the same pseudorandom generator is used in both functions. The decision regions that minimize the MSE for different values of ρ and their representative values are found offline from a Lloyd Max procedure [42] on a scalar Gaussian distribution with mean zero and variance one. For example, for ρ=1 we have 21 decision regions, which correspond to values below zero (decision region 0) and above zero (decision region 1), with corresponding representative values −0.7978 and +0.7978. Therefore, if we were to encode the value −0.344 with one bit, we will encode it as a ‘0’, and the decoder will decode it as −0.7978. The decoder, to reconstruct the new quality scores ${\left\{{\stackrel{̂}{\mathbf{Q}}}_{i}\right\}}_{i=1}^{N}$, performs the operations complementary to that done by the encoder. The decoder constructs round (V Q+μ Q ) and replaces the quality scores corresponding to an unknown basepair (given by the character ‘N’), by the least reliable quality value score. The steps are summarized below.

Notice that the final size is given by n R N plus an overhead to specify the mean and covariance of the quality scores, the length n and the number of sequences N. This can be further reduced by performing a lossless compression using a standard universal entropy code.

Since the algorithm is based on the statistics of the quality scores, better statistics would give lower distortion. With that in mind, and to capture possible correlation between the reads, we allow the user to first cluster the quality score vectors, and then perform the lossy compression in each of the clusters separately. For that we use the standard k-means algorithm[43], which we explain below. Notice that the total size in this case is just increased by a small amount for each of the clusters, since we do not preserve the order of the sequences. Specifically, after decoding the quality scores, we create the corresponding FASTQ file by incorporating the remaining information, i.e., the first three lines of each entry (including the header and the nucleotide sequence), and sorting the entries in alphabetical order with respect to the headers. This guarantees that related FASTQ files with paired end reads will have the same ordering after applying the lossy compression algorithm.

Finally, notice that R=0 is not the same as discarding the quality scores, since the decoder will not assign the same value to all the reconstructed quality scores. Instead, the reconstructed quality score vectors within a cluster will be the same, and equal to the empirical mean of the original quality score vectors within the cluster, but each quality score within the vector will in general be different.

#### Clustering

The clustering is based on the k-means algorithm[43], and it is performed as follows. For each of the clusters, we initialize a mean vector V of length n, with the same value at each position. The values are chosen to be equally spaced between the minimum quality score and the maximum. For example, if the quality scores go from 33 to 73 and there are 3 clusters, the mean vectors will be initialized as all 33’s, all 53’s, and all 73’s. Then, each of the quality score vectors will be assigned to the cluster that minimizes the MSE with respect to its mean vector V, i.e., to the cluster that minimizes $\frac{1}{n}{\sum }_{i=1}^{n}{\left(Q\left(i\right)-V\left(i\right)\right)}^{2}$. After assigning each quality score vector to a cluster, the mean vectors are updated by computing the empirical mean of the quality score vectors assigned to the cluster. This process is repeated until none of the quality score vectors is assigned to a different cluster, or until a maximum number of iterations is reached.

## Results and discussion

Ideally, when testing a lossy compression scheme over quality scores, one would like to know how the compression affects the results of downstream algorithms which use the quality scores. However, lossy compression of quality scores has been introduced only recently, and has not been tested extensively in that respect. In general, even without compression, there is a limited amount of work on the behavior of downstream application with respect to quality scores. Further complicating the assessment is the fact that there are no standard benchmarks, and there are very few datasets for which results (e.g., assembled genome, SNPs) were validated.

When assessing the performance of our lossy compression algorithm we consider two aspects. First is the distortion rate tradeoff achieved. That is, the MSE vs. the compression rate. This measure is independent of the downstream application and provides a framework that can be used for an evaluation of different lossy-compression algorithms. Second, we use a few publicly available datasets to test the impact of our algorithm on two downstream applications: short-read alignment and SNP detection.

### Datasets

We consider the lossy compression of three different datasets. We use two of the datasets used in [39], both generated by Solexa technology. Specifically, we consider the lossy compression of the SRR032209 dataset [44] from a M. musculus species, with N=18828274, n=36 and quality scores in the range [33:67], and the SRR089526 dataset [45] from an H. sapiens with N=23892841, n=48 and quality scores in the range [33:73]. We also consider a dataset from the PhiX[46] virus used in the control lane in the Illumina technology, where the genome from which the reads are generated is assumed to be known. This data set contains N=13310768 quality score sequences of length n=100 and values in the range [66:98]. All data sets are available for download at [47].

### Timing and memory usage

We provide estimates of the time and memory used by QualComp to compress and decompress the quality scores presented in the FASTQ files of the PhiX, SRR03229 and SRR089526 data introduced above. Recall that the number of reads in each file is 13310768,18828274 and 23892841, respectively, and the length of each read is 100,36 and 48, respectively. Computing the statistics needed for compression took 20, 4 and 10 minutes, respectively. However, notice that the statistics need to be computed only once. The compression time increases with the rate, varying from 0 seconds with zero rate to 9 minutes with R=2.5 and PhiX data, 3 minutes with R=2 and SRR032209 data and 8 minutes with R=2.5 for SRR089526 data.

The decompression time of the algorithm increases with the rate as well, and varies from 3 minutes (R=0) to 7 minutes (R=2.5) for the PhiX data, from 1 minute (R=0) to 2 minutes (R=2) for SRR032209 data and from 2 minutes (R=0) to 4 minutes (R=2.5) for SRR089526 data.

Finally, the memory usage for all the datasets during the compression and the decompression was about 16 MB, making it suitable for execution on any personal computer.

### Simulation results

#### Performance in terms of the MSE

We apply QualComp to the data sets introduced above. For each of the data sets, we run the algorithm with several rates and different number of clusters. We then compute the MSE between the uncompressed quality scores (i.e., the original quality scores) and the quality scores reconstructed by QualComp. To evaluate the performance, we consider the lossy compression schemes proposed in [39], SCALCE [35] and fastqz [34]. To that end, we run the algorithm of [39] with the options for lossy compression LogBinning, UniBinning and Truncating, for different parameters, and SCALCE with different error thresholds. We choose the option gzip for the final compression method in both algorithms. We also run the algorithm fastqz with different quantization levels and command “c”, since it offers more compression than the command “e” while creating the same reconstructed quality scores. For a fair comparison, we compute the number of bits per quality score employed on average by each of the algorithms and plot it versus the MSE. We do not consider the scheme proposed in [32], because we did not succeed in running the lossy compressor of quality scores presented in the SLIMGENE package. Also, the schemes proposed in [33] and Cramtools [36] employ a reference for compression, so we did not consider them.

Notice that, in our scheme, there is a small overhead due to some extra information needed for the decoder to decompress the quality scores. As a result, the actual rate (bits per quality scores) after the compression, which we denote by R, may not be exactly equal to the parameter R we set when calling the program. The use of a general compression tool such as gzip after our compression could further decrease the size of the output file, but the reduction is very small so we do not include it in our final results.

1. 1.

PhiX data:

We run our lossy compression algorithm with rates R={0,0.2,0.5,1,2,2.5,3}, and 1,3 and 5 clusters. Table 1 shows the MSE obtained with the different parameters, where R and C denote the rate and the number of clusters specified when running QualComp. As can be observed from these results, increasing the rate for a given number of clusters decreases the MSE, especially for small values of R. Similarly, increasing the number of clusters for the same rate decreases the MSE, as expected. This can be clearly seen in Figure 1. The overhead incurred by setting the number of clusters to 1,3 and 5 is 100 KB, 292 KB and 488 KB, respectively, which represents an increase in rate of 6.15×10−4, 1.79×10−3 and 3.00×10−3.

The results obtained by running the schemes proposed in [39] and SCALCE [35] are shown in Table 2 and Table 3, respectively. No results are provided for the fastqz software [34], because it only accepts Sanger files with quality scores between 33 and 73. For ease of comparison, we plot the rate versus the MSE for all the schemes in Figure 2. As can be observed, QualComp presents a smaller MSE than that of [39] and SCALCE for the same rates in most of the cases. For rates above 2, both the LogBinning scheme proposed in [39] and SCALCE present a smaller MSE than QualComp with rate 3. This is due to our assumption that $\mathcal{Q}=\mathcal{R}$, instead of $\mathcal{Q}=\phantom{\rule{2.77626pt}{0ex}}\left[66:98\right]$.

As a result, QualComp may perform worse than other algorithms for rates close to those of lossless compression. However, in such a scenario (when the rate is high), the savings with lossy compression are marginal compared to a lossless compressor.

Finally, notice that in the LogBinning scheme proposed in [39] (as well as in SCALCE), there is no clear correlation between the number of quality scores per bin (the error threshold) and the corresponding rate and MSE. Moreover, both schemes present an MSE that is not monotonically decreasing with increasing rate. Notice also that SCALCE can not achieve rates smaller than 0.77 in this case, whereas QualComp can work at any rate. Also, the schemes UniBinning and Truncating are unable to get an MSE less than 100 with this data.

2. 2.

M. musculus data (SRR032209):

We run QualComp with rates R={0,0.33,0.66,1,2}, and 1,2 and 3 clusters. Table 4 shows the MSE obtained with the different parameters. R and C denote the rate and the number of clusters, and R denotes the actual rate obtained after applying the compression. The results we get are consistent with what we observed in the previous dataset. Note that the difference between R and R is due to the extra information (fixed) about the statistics and the number of clusters. This additional rate per read (i.e., R R) becomes negligible for a large number of reads N, as can be observed in Table 4.

With two clusters, 24.31% of the quality score sequences are in one cluster and the remaining ones in the other, with the small cluster corresponding to that of quality scores with lower mean. In the case of three clusters, the division is 21.17%, 15.01% and 63.81%. Finally, the overheads for one, two and three clusters are 20 KB, 40 KB and 52 KB, respectively.

In order to compare the performance of QualComp, we run the schemes proposed in [39]], SCALCE [[35] and fastqz [34], with different parameters. The corresponding rate and MSE for the different schemes are shown in Table 5 and Table 6, and in Figure 3.

As can be seen, QualComp performs well in terms of the MSE, especially for small values of R, as in the previous case. Notice that, as before, SCALCE can not work at small rates. Similarly, fastqz can not work below rate 0.05, and the MSE is not decreasing as a function of the rate. Among the schemes proposed in [39], LogBinning performs better for higher rates, and Truncating for smaller rates.

3. 3.

H. sapiens data (SRR089526):

For this dataset, we run QualComp with a choice of rates given by R={0,0.25,0.5,1,2} and 1,2 and 3 clusters. The obtained MSE for the different parameters is shown in Table 7. The overhead for the choice of 1,2 and 3 clusters is 28 KB, 56 KB and 84 KB, respectively, which represents an increment (i.e., R R) of 2×10−4, 4×10−4 and 6×10−4. As before, increasing the number of clusters decreases the MSE, and this difference is more noticeable for small values of R. For example, with R=0, we have a reduction of more than 77% in the MSE when dividing the data into three clusters instead of one, whereas the reduction for R=2 is around 43%.

Finally, Table 8 shows the rate and the MSE obtained when applying the schemes proposed in [39] for different parameters, and Table 9 shows the results for SCALCE and fastqz. We see from these results and Figure 4, that QualComp can work at rates not attainable by other algorithms and that it has the lowest MSE for small rates. This is again consistent with our observations from the previous datasets.

#### Impact on downstream applications

We test QualComp on two downstream applications: short-read alignment and SNP detection. For alignment, we use the Bowtie algorithm [10], and compare the results obtained with the original PhiX FASTQ file to those obtained after applying QualCompa. We chose this dataset because the virus PhiX has a known assembled genome [48], which we assume to be correct. This allows us to consider that a perfect mapping, i.e., a mapping with no mismatches, is correct and more valuable than a mapping that has mismatches between the read and the region where it maps in the genome. We present the alignment results in Table 10.

As this table shows, increasing the rate and the number of clusters results in a mapping which is closer to that of the original FASTQ file. Also, notice that the number of reads that map with zero mismatches is constant with all the tested parameters. The main differences between the original FASTQ file and the reconstructed ones (after applying QualComp) are for alignments with high number of mismatches. Even with 0 bits per quality score, the difference in mapping is about 1.23%, 0.38% and 0.28% for 1, 3 and 5 clusters, respectively, whereas the savings in size are significant. With 3 clusters, for example, the savings are 467 MB. With 2 bits per quality score and 3 clusters, the difference in mapping is 0.06% and in size 149 MB.

In order to see how lossy compression affects the SNP detection for this dataset, PhiX, we run the alignment algorithm BWA [11] followed by Samtools [49] for the original and the compressed files with one, three and five clusters and different rates. However, no SNPs were detected for any of the files (including the original FASTQ file), and therefore no results are provided. Therefore, in this case, the output of the SNP detection is unaffected by the use of QualComp.

We also perform alignment with BWA [11] followed by SNP detection with Samtools [49] for the M. musculus data. Specifically, we use as a reference chromosome one of the Mus Musculus reference genome release, known as MGSCv37. We compare the output of Samtools using the original FASTQ file as an input with the reconstructed ones after applying QualComp. We show the results of SNP detection in Table 11, with one, two and three clusters and rates R={0,0.20,0.33,0.66,1,2}. We did not report results for higher number of clusters because we observed no noticeable improvements. We also omit the alignment results since BWA does not use the quality scores for alignment.

As can be seen, both the selectivity and the sensitivity parameters (refer to Table 11 for definitions) increase with the rate, as expected. Also, since increasing the rate decreases the MSE, a lower MSE yields a performance closer to that of using the original FASTQ file. For example, with two clusters and 0.66 bits per quality score, we get a selectivity of 92.78% and a sensitivity of 97.25%. With three clusters and 1 bit per quality score, we increase both the selectivity and sensitivity to 94.23% and 97.37%, respectively.

Finally, we perform the same simulations with the H. sapiens data. We use as a reference chromosome one of the human reference genome release hg19, also known as GRCh37. We show the results of SNP detection in Table 12, with one, two and three clusters, and rates R = {0,0.2,0.25,0.5,1,2}. As in the previous case, lower MSE (and therefore higher rate) yields a performance closer to that of the original FASTQ file. For example, with two clusters and 0.25 bits per quality score, we get 92.25% selectivity and 97.64% sensitivity, and with three clusters and 0.5 bits per quality score, we get 93.06% selectivity and 97.82% sensitivity. Also, increasing the number of clusters yields a better selectivity. This, together with the previous results, supports the use of the MSE as a metric for evaluating any lossy compression algorithm.

### Setting the compression rate R

The performance of the algorithm in terms of the MSE and its effect on downstream applications is highly data dependent, and therefore each user should decide which rate to use based on their storage capabilities and the accuracy required for the reconstructed quality scores.

However, based on the results we obtained with the three datasets we experimented with, we can suggest some general guidelines on how to select R effectively. The trade-off between rate and performance for QualComp is best for small values of R, i.e., for the same increment in rate, the improvement in performance is higher for small rates (see Additional file 1: Figure S1). For example, although R=0 gives the most savings in terms of storage, with R=0.05 and two clusters, the sensitivity of SNP calling on the M. musculus dataset jumps from 92.77% to 96.42%, and from 84.41% to 96.30% for the H. sapiens data (see Additional file 1: Table S1 and Table S2). Furthermore, increasing the rate to 0.5 increases the sensitivity for the M. musculus data only slightly to 97.16%, but increases the storage space drastically (by around 900 percent) from 4.07 MB (R=0.05) to 40.44 MB (R=0.5). Similar observations hold for the H. sapiens data.

Looking at the MSE, we observe a similar behavior. Specifically, the slope of the MSE with respect to the rate is more pronounced for small values of R. For example, for the PhiX dataset (see Additional file 1: Table S3), we observe the biggest relative decrease from R=0 (MSE =18.62) to R=0.05(MSE =11.73), which is a 37 percent improvement. Then, for R=0.3 we get an MSE of 7.58, and for R=0.5 an MSE of 5.94. Similarly, for the H. sapiens dataset we observe a factor of two reduction in the MSE, from 25.21 with R=0 to 13.55 with R=0.05. For R=0.2 we get an MSE of 9.09 and further increasing the rate to 0.5 gives an MSE of 7.17.

We therefore propose to set R=0.2 as a starting point, since it offers better performance than R=0 while still offering major savings in size (see Additional file 1: Table S4). For example, while Gzip can compress the dataset PhiX to 592 MB, QualComp produces a file of size 32.09 MB for rate R=0.2 and one cluster (94% reduction). Such savings are very significant for large-scale metagenomics projects, which generate massive datasets. Similarly, Gzip reduces the H. sapiens dataset to 389 MB, while QualComp can compress the file to 27.42 MB (R=0.2 and 3 clusters), which represents a reduction in size of 93%.

As for the number of clusters, the results show a notable improvement when switching from one to three clusters. For example, with R=0.2 the alignment results of Bowtie on the PhiX dataset show 3.01%,2.78% and 2.78% unmapped reads with one, three and five clusters, respectively. Similar results are obtained for different choices of R. For the H. sapiens data and R=0.2, we get an MSE of 13.95 with one cluster, 9.09 with two clusters and 7.80 with three clusters. Therefore we suggest setting the number of clusters to 3.

QualComp does take longer to run compared to other programs (see Additional file 1: Table S4), however most of it is spent on the computation of the statistics and the clustering, which do not depend on the rate. For example, computing the statistics for the H. sapiens (SRR089526) and the M. musculus (SRR032209) datasets took approximately 10 and 4 minutes (both with 1 and 3 clusters), respectively. Clustering the H. sapiens and the M. musculus datasets to three clusters took about 24 and 20 minutes, respectively. However, notice that the clustering only affects the compression and not the decompression time, which is approximately a few minutes for the considered datasets (and therefore for FASTQ files with similar sizes). In general, the decompression time scales linearly with the number of reads.

Moreover, note that when compressing a dataset with different rates, the clustering process and the computation of the statistics need to be performed only once. Furthermore, for large sets of sequencing data, one needs to perform clustering and estimation of the statistics (mean and covariance) only on a small subset (can be thought of as the training set) of the data. One can then use those estimates to compress every subsequent read. This can be done by first assigning the read to a cluster and then using the proposed compression scheme. This will lead to a significant reduction in the compression time.

In summary, the proposed scheme can offer major savings in storage space while still enabling accurate reconstruction of the quality scores. We suggest the use of small rates (between 0.05 and 0.5) and 3 clusters. We stress here that our study of downstream applications assumes read sequences from a single organism. However, the scale of savings from our scheme becomes more significant for larger datasets such as those used for metagenomics studies [3, 4]. For the scale of data reported in [3] (268 gigabases) our methods can reduce the storage requirements for the quality values to 1.67 GB using a rate of 0.05 or 16.75 GB (R=0.5) (as opposed to around 83 GB achieved by a lossless compressor algorithm if we assume a rate of 2.5 bits per quality score). Similarly for the data sets in [4] (576.7 gigabases), we could do with 3.6 GB of storage for rate of 0.05 and 36 GB for rate 0.5, instead of 180 GB for lossless compression.

## Conclusions

To tackle the problem of storage and dissemination of genomic data, we have developed QualComp, a new algorithm for the lossy compression of the quality scores presented in a FASTQ file. One advantage of the proposed method with respect to other lossy compression algorithms is that it allows the user to specify the rate (number of bits per quality score) prior to compression. This choice should be made according to the storage availability of each user and the amount of accuracy required for the reconstructed quality scores. Given a model for the quality scores and using theoretical results on rate-distortion, QualComp optimally allocates the bits in order to minimize the MSE. We then compare several lossy compression schemes by looking at the MSE versus the rate, thus making the comparison independent of downstream applications that use quality scores. We show that our algorithm results in better MSE, compared to alternative schemes, for small rates. QualComp can also work at rates not attainable with other algorithms, and presents an MSE that decreases monotonically as a function of the rate.

We also test QualComp on two downstream applications (short-read alignment and SNP calling), showing that little is compromised in performance, while the file size is reduced significantly after compression. However, to better understand how the lossy compression algorithm affects the downstream applications, more simulations with several datasets and full sequence analysis workflow should be performed, once such benchmarks become available.

## Availability

Software name: QualComp