1 Introduction

By the rapid and still ongoing development of next generation sequencing (NGS) technologies it is now possible to obtain the nucleotide sequences of currently billions of DNA molecules from a sequencing library in the matter of hours on a single machine [1, 2]. Here, I focus on second generation sequencing which generates reads with lengths ranging from 50 to 250 nucleotides, but the general concepts proposed here also apply to third generation sequencing which produces much longer reads but at currently lower throughput [3, 4].

While sequencing libraries consist of DNA, they can also be generated from RNA using reverse transcription [5]. Sequencing RNA provides numerous opportunities to study dynamic processes that occur in living cells. Arguably, the most prominent application is called RNA-seq. The sequences in an RNA-seq library correspond to RNA fragments that have been randomly sampled from all mRNAs extracted from a biological sample. After sequencing, reads are mapped to a reference sequence such as the genome, and the number of reads per gene is determined. Such read counts from an RNA-seq experiment thus approximate the individual expression levels of all genes.

The fundamental principle of RNA-seq is to obtain estimates of quantitative biological parameters based on counting specific sequences. It is, however, not the only example: NGS has been used to quantitatively measure binding of transcription factors to their target sites [6], initiation and elongation rates of RNA polymerases [7], rates of splicing [8] and RNA export from the nucleus [9], translation rates [10] and RNA decay [11], the thermodynamic ensemble of RNA structures [12] and interactions among RNAs and RNA binding proteins [13] among many other applications [14]. All these examples of quantitative NGS have in common that due to the biochemical steps performed in the wet-lab, information on particular parameters is introduced into the sequencing library. Which parameters can be measured by sequencing is virtually only limited by the creativity of the researcher [14]. The purpose of data analysis then is to extract this information from the sequencing reads by employing statistical models.

Here, we differentiate between two kinds of statistical models for quantitative NGS, namely read count models (RCMs) and what I here introduce as read feature models (RFMs). In this article, after defining the different scopes of RCMs and RFMs, I will formally introduce RFMs. Our recently developed GRAND-SLAM [15] method for the analysis of nucleotide conversion RNA-seq data fits into this statistical framework of RFMs, and I use this example to discuss advantages and potential shortcomings of a two-step approach for parameter estimation for RFMs.

2 Read Count Models

There is ample literature about RCMs [16,17,18]. RCMs are concerned with modeling the number of reads per biological entity using replicated biological samples. A frequently used model is the negative binomial distribution for which the mean and dispersion parameters can in principle be estimated independently for each gene from a large enough number of replicates. Importantly, however, only two or three replicates are common practice, which would result in highly variable dispersion estimates. For that reason, shrinkage estimators that share information across genes are widely used under the assumption that overdispersion is the same [19] or at least similar [20] for genes with similar expression level.

A simple application of RCMs is testing for differential gene expression in a pairwise comparison between two conditions using RNA-seq, e.g. treatment \(T\) vs control \(C\). This can be accomplished by a hypothesis test asking whether \(\mu_T = \mu_C\). Generalized linear models are a convenient framework for such tests and can also be used to analyze more complex experimental scenarios such as multiple conditions or multifactorial designs [20].

The data generation process for RNA-seq is highly complex and consists of biochemical reactions taking place during fragmentation of RNA, reverse transcription, adapter ligation, amplification using polymerase chain reaction and sequencing [5, 14]. The negative binomial distribution is not only an appealing model because it is able to handle the overdispersion that is observed for such data but can be seen to resemble these complex steps of data generation in a coarse-grained manner: If we assume the RNA level for a gene among replicated experiments to be gamma distributed, and consider the generation of reads for this gene to be a random sampling process from this RNA level (in competition with the total levels of all other genes), then a gamma-Poisson mixture distribution emerges for the read count, which is the negative binomial distribution. Of note, the overdispersion likely also includes technical variance due to library preparation in addition to biological variability.

RCMs can also be used for other applications than RNA-seq, e.g. to compare transcription factor occupancy on binding sites using ChIP-seq [21] or the strength of translation using Ribo-seq [22]. There are also scenarios where the assignment of reads to biological entities is not unique. For instance, genes of higher eukaryotes have different transcript isoforms, which often share large parts of their sequences. For reads corresponding to such sequences it is a priori not clear from which transcript isoform they originate. Isoform level quantification can be performed by treating the assignment of reads to isoforms as latent variable and using the EM algorithm or variational Bayes for inference [23, 24].

In summary, RCMs model read counts that belong to biological entities and are concerned with differences between biological conditions. However, many NGS applications generate patterns in the data that can be used to make more fine-grained inferences for each individual sample. This is where read feature models (RFM) come into play which are concerned with recognizing and exploiting these patterns.

3 Read Feature Models

NGS data derived from a single biological sample consists of short reads \(R\). Each read \(r \in R\) belongs to a biological entity, and we denote all reads belonging to the biological entity \(i\) as \(R_i\). Usually, only specific features \(s\left( {r_j } \right)\) of a read \(r_j\) are relevant and provide the sufficient statistics \(D_i = \left\{ {s\left( {r_j } \right){|}r_j \in R_i } \right\}\) for parameter estimation for a biological entity \(i\). An RFM consists of a parametric family \({\rm{\mathcal{F}}}\) and a parameter vector \(\theta = \left( {\theta_G ,\phi_1 , \ldots ,\phi_N } \right)\) involving global parameters \(\theta_G\) and parameters \(\phi_i\) for the \(N\) individual biological entities. Each \(d_j = s\left( {r_j } \right) \in D_i\) is modeled by a probability distribution from the parametric family \({\rm{\mathcal{F}}}\) with parameters \(\theta_G\) and \(\phi_i\), i.e. \(d_j \sim \ {\rm{\mathcal{F}}}\left( {\theta_G ,\phi_i } \right)\). Thus, each read, or at least the features relevant for parameter estimation, emerge from a probability distribution that depends on a set of global parameters and a gene specific parameter but is independent of the specific parameters from other genes. Often, \(\phi_i\) is one-dimensional and represents an activity or abundance of some sort for biological entity \(i\), and is usually the biological parameter of interest. The global parameters \(\theta_G\) by contrast often represent the stochastic behavior of the biochemical procedures that are used to generate the sequencing library. Thus, like RCMs, RFMs do not only try to fit observed data, but can be considered to model the actual data generation process in a coarse-grained manner.

There are, however, many fundamental differences between RCMS and RFMs. RCMs are used to compare quantities such as RNA levels (RNA-seq) or occupancies (ChIP-seq) across replicates and conditions. Thus, RCMs are concerned with the number of reads for a biological entity across replicated experiments. By contrast, the purpose of RFMs rather is to extract qualitative or quantitative information introduced into the sequencing library by the biochemical steps taken to generate the library. RFMs therefore focus on a single biological sample and model the features \(D_i\) of all the reads mapped to a single biological entity \(i\) instead of their number. The function \(s\) might extract features such as the read length, its positioning within the entity or patterns of mismatched nucleotides.

An example of an RFM is implemented in our PRICE method [25]: PRICE aims to find stretches on RNAs called open reading frames (ORFs) that are translated by ribosomes into proteins based on data generated by a technique called Ribo-seq [10]. Due to the way RNA is prepared for sequencing, Ribo-seq reads corresponding to actively translating ribosomes have specific lengths and have a periodic pattern with regard to their positions along such stretches. PRICE learns the global parameters of an RFM using the ORFs of known proteins and can then be used to predict so far unknown translated ORFs. PRICE has been used by us and others to identify thousands of short ORFs in the human genome [26] and dozens to hundreds in clinically relevant viruses such as SARS-CoV-2 [27] and human cytomegalovirus [25, 28]. Moreover, PRICE enabled us to show that peptides originating from such short ORFs are presented via the major histocompatibility complex I (MHC-I) [29], defining a new class of antigens that might play a hitherto unknown role in the T cell mediated defense against infection and cancer [26].

A second use case for RFMs is PAR-CLIP, which is a quantitative NGS technique for the discovery of the binding sites on mRNAs of an important class of short regulatory RNAs called microRNAs [13]. A microRNA binding site can be as short as six consecutive nucleotides on an mRNA. PAR-CLIP generates clusters of reads at such binding sites with a specific pattern of start and end positions, and additionally induces specific mismatches close to the microRNA binding site. Our PARma method [30] utilizes an RFM to learn these patterns of positions and mismatches to precisely define the binding site within a cluster and by sequence complementary also the microRNA that binds there. An analysis of several data sets including several PAR-CLIP experiments revealed that microRNA binding to an mRNA generally is a context-dependent phenomenon adding an additional layer of complexity to the gene regulatory network [31].

These examples demonstrate that parameter estimation for RFMs can be done in a two-step process: First, the global parameters \(\theta_G\) are estimated using the pooled data across many or all biological entities. \(\theta_G\) is then considered constant for the estimation of gene-wise parameters in the second step. These examples also show that for building RFMs, a detailed understanding of the data generation process for a particular type of experiment is necessary.

4 An RFM for Nucleotide Conversion RNA-seq

Being able to quantify RNA that was synthesized during a defined period in addition to total RNA levels has many advantages over standard RNA-seq. For instance, this allows to estimate parameters describing the kinetics of gene expression (synthesis rates, degradation rates) [11, 32], and it enables to reveal short-term regulatory changes of gene expression in much greater detail than normal RNA-seq [33]. The most widely used methods for quantifying newly synthesized RNA are based on metabolic RNA labeling.

Metabolic RNA labeling utilizes nucleoside analogs such as 4-thiouridine (4sU) that are supplied to a cell culture for a defined period (e.g. 2h). Cells take up the 4sU and incorporate it into newly synthesized RNA instead of normal uridine (U). After e.g. 2h, RNA is extracted and treated with compounds that result in 4sU being sequenced as cytosine (C) [11]. The reads are then mapped to the genome sequence, where the U found on RNA corresponds to thymine (T). Thus, the incorporation of 4sU and its conversion in the RNA gives rise to a T-to-C mismatch in the mapped reads. Such T-to-C mismatches therefore provide evidence for the read originating from an RNA molecule that was transcribed during the last 2h.

The parameter of interest is the gene-wise new-to-total RNA ratio (NTR). The NTR is the starting point to derive other, biologically relevant parameters. For instance, there is a 1-to-1 correspondence between the NTR and the kinetic rate of RNA degradation [15]. Estimating the NTR is non-trivial for two reasons: First, library preparation and sequencing can also introduce mismatches, including T-to-C. Thus, a mismatch in a read can either be due to such an error, or because of the conversion of an incorporated 4sU. Second, and more importantly, only a small and typically unknown percentage of U are substituted by 4sU during transcription. Consequently, many reads that indeed originate from a newly synthesized RNA might not cover any site of 4sU incorporation by chance. We estimated that for published data [11], more than 75% of all reads originating from a newly synthesized RNA does not exhibit any T-to-C mismatch [34]. Thus, the fraction of reads having T-to-C mismatches among all reads belonging to a gene is a biased estimator of the NTR: Due to sequencing errors, it might overestimate the NTR, and due to reads not covering 4sU sites by chance, it might also underestimate the NTR. We previously proposed our GRAND-SLAM approach to estimate NTRs in an unbiased manner [15].

To define the model behind GRAND-SLAM in the framework of RFMs, we denote the probabilities of a T-to-C mismatch for reads originating from a newly synthesized RNA or pre-existing RNA molecule \(p_{new}\) and \(p_{old}\), respectively. Thus, \(p_{old}\) corresponds to the probability of a sequencing error or any other base substitution that can happen during library preparation. By contrast, \(p_{new} = p_{old} + p_{4sU}\), i.e. \(p_{new}\) includes the probability of errors and of the incorporation and conversion of a 4sU. Both \(p_{new}\) and \(p_{old}\) are global parameters and are the same for all genes. By contrast, the parameters \(\nu_1 , \ldots , v_N\) represent the gene specific NTRs for all genes.

The features extracted for a read are \(s\left( r \right) = k_r\), with \(k_r\) being the number of T-to-C mismatches observed for read \(r\). The parametric family of the RFM is a two-component binomial mixture model \(BinomMix\left( {p_{old} ,p_{new} ,\nu ,n} \right)\) with probability mass function

$$\begin{aligned} P\left( {k_r ;n,p_{old} ,p_{new} ,\nu } \right) & = \left( {1 - \nu } \right) \cdot Binom\left( {k_r ;n,p_{old} } \right) + \nu \cdot Binom\left( {k_r ;n,p_{new} } \right) \\ Binom\left( {k;n,p} \right) & = \left( {\begin{array}{*{20}c} n \\ k \\ \end{array} } \right)p^k \left( {1 - p} \right)^{n - k} \\ \end{aligned}$$

Thus, the global parameters of the RFM are \(\theta_G = \left( {p_{old} ,p_{new} } \right)\), the gene-wise parameters are the NTRs \(\left( {\nu_1 , \ldots ,\nu_N } \right)\), and the sufficient statistic is \(k_r \sim \ BinomMix\left( {p_{old} ,p_{new} ,\nu_i ,n_r } \right)\) which is distributed according to the parametric family defining the RFM. Note that \(n_r\) here is the number of T covered by the read \(r\) in the genome, i.e. the maximal number of possible T-to-C mismatches, which can be considered a constant.

5 RFM Parameter Estimation Using a Two-Step Approach

Computing maximum likelihood estimators (MLE) or the Bayesian posterior distribution for the high dimensional parameter \(\theta\) of an RFM is conceptually straightforward and could be done by numerical optimization to obtain the MLE or Markov chain monte carlo (MCMV) sampling for approximating the posterior. Of note, \(N\) can be quite large, making numerical optimization or MCMC computationally challenging. However, the special structure of RFMs suggests a two-step parameter estimation procedure that is computationally much more efficient: First, by pooling data from all biological entities, the global parameters \(\theta_G\) are estimated and then considered constants. With that, the high-dimensional estimation problem decomposes into \(N\) independent low-dimensional problems.

For the nucleotide conversion RNA-seq RFM, \(p_{old}\) can be estimated from control samples that were not labeled with 4sU. Such control samples are usually included into experiments to test for 4sU induced effects on the biology of the cells. Since there is no 4sU, the mixture model reduces to a binomial distribution making estimation of \(p_{old}\) straight-forward [15]. \(p_{new}\) can be estimated by introducing the nuisance parameter \(\nu\), which is the global NTR, i.e. the fraction of labeled RNA across all genes. This two-dimensional estimation problem can efficiently be solved using numerical optimization [15]. Once point estimates for the parameters \(p_{old}\) and \(p_{new}\) are available, they are treated as constant and only the gene specific NTR \(\nu_i\) must be estimated for each gene. In GRAND-SLAM, the full posterior distribution of each \(\nu_i\) is computed by numerical integration.

6 The Two-Step Approach Introduces Negligible Bias

Using point estimates for the global parameters \(\theta_G\) and considering them as constants for the second step comes with the danger of introducing bias into the estimator of the gene specific parameters. For instance, for the GRAND-SLAM RFM, if \(p_{new}\) is overestimated, the \(\nu_i\) are expected to be underestimated: Consider a gene with a true NTR of 1, i.e. all reads indeed originate from a labeled RNA molecule. The expected overall percentage of T-to-C mismatches for this gene therefore is equal to the true \(p_{new}\). If the \(\hat{p}_{new}\) is overestimated, i.e. \(\hat{p}_{new} > p_{new}\), the required percentage of T-to-C mismatches to achieve \(\nu_i = 1\) is \(\hat{p}_{new}\), which is greater than \(p_{new}\). Thus, overestimated \(p_{new}\) bias the \(\nu_i\) towards 0. To investigate the magnitude of such bias empirically, I conducted simulation experiments.

Data were simulated from a \(BinomMix\) model with \(N = 2.5 \cdot 10^7\), \(p_{old} = 4 \cdot 10^{ - 4}\), \(p_{new} = 0.02\) and \(\nu = 0.15\), all reflecting realistic values for the total number of reads for a single sample, sequencing errors, 4sU incorporation and typical RNA turnover in mammalian cells for 1h of labeling [32]. For each read the number of T positions \(n\) was drawn from a distribution reflecting a read length of 100. To mimic the estimation of \(p_{old}\) by an additional, 4sU naïve sample, it was treated as a constant. The joint posterior distributions indeed show anticorrelation of \(p_{new}\) and \(\nu\) (an example is shown in Fig. 1A), demonstrating that \(\nu\) is biased towards 0 if \(p_{new}\) is overestimated. The 95% credible interval (CI) computed from the marginal posterior for the example in Fig. 1A was approximately [0.01996,0.02006], i.e. the relative uncertainty defined as the size of the 95% CI divided by the true value 0.02 was in the range of 0.5%.

Fig. 1
figure 1

A The joint posterior density distribution of data simulated with \(p_{new} = 0.02\) and \(\nu = 0.15\) is shown. The true values are marked by dashed lines, and the 95% credible interval (CI) of the marginal posterior for \(p_{new}\) is indicated at the bottom. B The relative uncertainty defined as the size of the 95% CI divided by the true value (\(p_{new} = 0.02\)) is shown for multiple simulations with total read counts \(N\) ranging from 300.000 to 50 mio reads. Relative uncertainty cutoffs of 0.5% and 5% are indicated. CD Half-lives simulated for individual genes (n = 1000) are scattered against their estimated half-lives with a \(p_{new}\) that is overestimated by 0.5% (C) or 5% (D). The main diagonals (dashed line) representing no bias are indicated. For (D), a second dashed line above the main diagonal represents a log2 fold change of 0.1 between simulated and estimated half-lives

The accuracy of the point estimate for \(p_{new}\) mostly depends on the total number of reads \(N\). Thus, additional experiments with \(N\) ranging from 300.000 to 50 mio reads were simulated, and the 95% CI of \(p_{new}\) and the relative uncertainty as defined above were computed (Fig. 1B). Relative uncertainties dropped steeply with increasing \(N\) and were below 1% with 6.3 mio reads. More reads improved the uncertainty only marginally. Thus, based on these empirical analyses, \(p_{new}\) is estimated with high relative accuracy for standard experimental setting with > 20 mio reads per sample.

To evaluate the effects of these uncertainties in the subsequent estimation of gene-wise RNA half-lives, which is a biologically relevant parameter and has a 1-to-1 correspondence to \(\nu_i\) [15], the read simulator built into our grandR package [32] was used to generate data for individual genes with \(p_{new} = 0.02\). Then, the \(\nu_i\) were estimated based on an overestimated \(p_{new}\). To reduce the effect of variance in the estimates of \(\nu_i\), 10.000 reads were simulated for each gene. With a relative overestimation of 0.5%, no bias in the RNA half-life estimates was discernable, i.e. the effects of an overestimated \(p_{new}\) was much smaller than the variance in the \(\nu_i\) estimates even for genes with 10.000 reads (Fig. 1C). With a relative overestimation of 5%, however, especially short half-lives were clearly overestimated (Fig. 1D). The magnitude of the overestimation, however, was low, with most genes having a log2 fold change of estimated vs true RNA half-life below 0.1.

In summary, this simulation study indicates that inaccurate estimation of \(p_{new}\) in the first step has little to no effect on the estimates of \(\nu_i\) in the second step for realistic data sets.

7 Discussion

There are many applications of NGS that result in specific patterns of sequencing reads mapped to the biological entities of interest. RFMs focus on modeling these patterns to extract biologically meaningful information from sequencing data. GRAND-SLAM is an RFM for nucleotide conversion RNA-seq to estimate the gene-wise NTR, which provides information about the dynamics of gene expression. When gene expression is at steady-state, the NTR can be transformed into the RNA half-life [15]. With few reads for a gene, the NTR cannot be estimated accurately, and if the NTR is close to 0 or 1, the transformation into the RNA half-life inflates even slight inaccuracies substantially [15, 35]. It is therefore important to quantify the uncertainty in these parameters, e.g. using Bayesian posteriors. Even when gene expression is not at steady-state, RNA half-lives can be estimated if gene expression from an additional prior timepoint is known [32], which introduces another source of uncertainty in the estimation.

The special structure of RFMs greatly facilitates the estimation of posteriors, since in the two-step approach, the NTR is estimated per gene by solving a univariate parameter estimation problem in the second step. This enables GRAND-SLAM to efficiently compute exact posteriors without MCMC sampling. Here, I investigated whether inaccurate estimation of global parameters introduce bias into the estimation of the gene-wise parameters in this two-step process. The empirical analyses presented here indicate that for realistic data sets inaccuracies of the global parameters only have negligible effects on the estimates of the gene-wise parameters for nucleotide conversion sequencing RNA-seq.

In the definition of RFMs I explicitly made the assumption that each sequencing read is uniquely assigned to a single biological entity. A similar assumption has been made for the fundamental RCMs modeling RNA abundance [20]. There are scenarios, where this is not the case: Typically, short RNA-seq reads map to a single gene but occur in multiple isoforms of this gene in higher eukaryotes. Thus, for estimating RNA abundances (using RCMs) for all individual transcript isoforms, methods have been developed that treat the assignment of reads to isoforms as latent variable [23, 24]. The same approaches can also be implemented for RFMs, i.e. the latent variable can be integrated into the model, making the estimation slightly more complicated. Alternatively, the estimate of the latent variable could be used as a probabilistic but fixed assignment of reads to isoforms. Computing this probabilistic assignment as an additional prior step would make any RFM directly applicable to cases with non-unique reads. However, in contrast to integrating the latent variable into the RFM model, this procedure would not make full use of the patterns that are modeled by the RFM for the probabilistic read assignment. It is an interesting future direction to integrate latent variables into specific RFM models and evaluate whether treating the assignment as a separate first step provides sufficiently accurate results.

While the simulation approach proposed here demonstrates minimal bias introduced by the two-step approach for the GRAND-SLAM model, it is important to note that this methodology might not be as robust for other RFMs. The two-step approach proposed here should not be employed if global parameters estimated from it differ significantly to those obtained via a joint estimation. The susceptibility to this discrepancy is inherently associated with the specific RFM in use. Therefore, it is important to conduct a rigorous evaluation for specific RFMs to ascertain the reliability of the two-step approach. The simulation methodology presented here offers an empirical framework to probe such potential challenges.