A detailed description of our simulation approach can be found in the “Methods” section, where a brief description is given here for convenience. Our approach for the first scenario, in which we simulate one isoform being expressed per gene per cell, is to first identify genes for which the expression of exactly four isoforms is detected in a real scRNA-seq dataset. In the second step, we randomly select one isoform based on a plausible model of isoform choice for the first of our genes in the first cell in our simulated dataset. For our default model of isoform choice, we choose the isoform based on a model of alternative splicing described by Hu et al. [18]. Third, we simulate dropouts based on a Michaelis-Menten model described by Andrews and Hemberg [9]. Fourth, we simulate quantification errors based on isoform detection error estimates based on work by Westoby et al. [8]. We repeat these four steps for every four isoform gene and cell in our simulated dataset, then calculate the mean number of isoforms detected for that gene per cell. The entire process described above is one complete simulation. We run 100 simulations for each of our four scenarios, where each scenario corresponds to one, two, three or four isoforms being expressed per gene per cell. We can then plot the distributions of the mean number of isoforms detected per gene per cell for each scenario. A schematic of our simulation approach is displayed in Fig. 1. Negative control models, in which our simulations are repeated but with no dropouts and/or quantification errors are simulated, can be found in Additional file 1: Figs S1–3.
In Fig. 2, we apply our simulation approach to a dataset of H1 and H9 human embryonic stem cells (hESCs) [19, 20]. In this dataset, each cell’s cDNA was split into two groups and sequenced at two different sequencing depths, enabling us to directly compare our simulation results at different sequencing depths without biological confounders. One group was sequenced at approximately 1 million reads per cell and the other group at approximately 4 million reads per cell on average. Our simulation results for the two H1 groups are compared side by side in Fig. 2a. scRNA-seq experiments have been found to saturate in terms of the number of genes detected per cell at approximately 1 million reads per cell [21, 22]. However, we observe differences in the number of isoforms detected per gene per cell at 1 and 4 million reads per cell, indicating that the saturation depth may differ for gene- and isoform-level analyses. Next, we calculate the fraction of overlap between the isoforms expressed in the ground truth and the isoforms detected as expressed in our simulations. In Fig. 2b, we show the distributions of the mean fraction of overlap for each gene. We will refer to each gene’s mean fraction of overlap between isoforms expressed in the ground truth and isoforms detected as expressed as the ‘overlap fraction’ hereafter in the text. The mean overlap fraction is consistently higher at 4 million reads per cell compared to at 1 million reads per cell, indicating that our ability to accurately detect isoforms is improved at higher sequencing depths. Similar results were observed for the H9 hESC dataset in Additional file 1: Fig S4.
Figure 2 a and b illustrate some of the difficulties associated with splicing analysis in scRNA-seq. At both sequencing depths, the distributions of the observed mean number of isoforms per gene per cell are shifted to the left of their true value. In addition, the highest mean overlap fraction observed is less than 0.8, indicating that even in a best case scenario, we fail to detect over 20% of the isoforms expressed in the ground truth. These effects are less extreme, but still present, for the group sequenced at approximately 4 million reads per cell compared to the group sequenced at 1 million reads per cell. This is consistent with the hypothesis that sequencing at higher depth reduces the extent to which isoform number is underestimated. However, even at approximately 4 million reads per cell, our simulations suggest that scRNA-seq substantially underestimates the mean number of isoforms per gene per cell for almost all genes. A naive analysis of these two datasets would most likely underestimate the number of isoforms expressed per gene per cell. This casts doubt on the biological relevance of previous observations suggesting only one isoform was typically produced per gene per cell, although admittedly the sequencing depth per cell was generally much greater than 4 million reads per cell in those studies (for example, Shalek et al. sequenced approximately 27 million reads per cell [14]).
One hypothesis for why our ability to detect isoforms increases with increased sequencing depth is that the rate of dropouts is reduced. In Fig. 3a, we investigate this hypothesis by plotting the distribution of the probabilities of dropout for each isoform (p(dropout)), as estimated using the Michaelis-Menten equation [9] (see the “Methods” section). We find that the distribution is skewed towards high probabilities of dropout for the group sequenced at around 1 million reads per cell. In contrast, the distribution for the group sequenced at around 4 million reads per cell is more skewed towards low probabilities of dropouts. This demonstrates that our estimated dropout probabilities are different at the two sequencing depths, as expected.
Overall, the data in Figs. 2 and 3a support the hypothesis that when the rate of technical dropouts decreases, the accuracy of isoform number estimation increases. However, as our dataset was only sequenced at two depths, we only have two data points available to investigate our hypothesis. To extend our investigation, we assume that the distributions of dropout probabilities observed in Fig. 3a can be modelled as beta distributions. The beta distribution is parameterised by two values, α and β, and we find that it approximates our probability distributions well (see bottom panels of Fig. 3a). Therefore, we select five values of α and β that generate differently shaped dropout distributions, as shown in Fig. 3b. We then perform five further simulation experiments. In each simulation experiment, we sample our dropout probabilities from one of our beta distributions. The results of these experiments are shown in Fig. 3c and d.
In Fig. 3c, we show the mean detected number of isoforms per gene per cell for the scenario where each gene produces one isoform per gene per cell. As we move from the top to the bottom of Fig. 3c, the value of α decreases, corresponding to scenarios where the probability of dropout is more frequently close to zero. As α decreases, the distributions of mean detected isoforms per gene per cell shift further to the right and closer to the true number of isoforms produced per cell. In Fig. 3d, we find that the mean overlap fraction increases as α decreases, corresponding to the mean probability of dropout decreasing. We conclude from Fig. 3c and d that reducing the dropout rate would likely improve the accuracy of splicing analyses performed using scRNA-seq. Similar results were observed for the H9 hESCs in Additional file 1: Fig. S5, lending further support to this conclusion.
Quantification errors are a relatively minor obstacle to studying alternative splicing
A benchmark of isoform quantification softwares in full-length coverage mouse scRNA-seq datasets found that the error rate of many software tools was low and comparable to bulk RNA-seq [8]. This is encouraging; however, it should be noted that the error rate is likely to be substantially higher for non-model organisms with less well-annotated genomes than the mouse genome. As isoform quantification is a key step of many scRNA-seq alternative splicing analysis pipelines, it would be beneficial to understand how quantification errors impact our ability to study alternative splicing, both when the error rate is high and when the error rate is low.
As our interest in this study is the detected number of isoforms per gene per cell, we are only interested in quantification errors which lead to changes in the number of isoforms detected. We simulate two types of quantification errors, false positives and false negatives. In this context, a false positive occurs when an isoform is called as expressed by the quantification software when there are no reads from that isoform. Note that this means that if an isoform is expressed in a cell but no reads are captured from it (i.e. a dropout), but the quantification software calls it as expressed, we would define this as a false-postive event. A false negative occurs when an isoform is not called as expressed by the isoform quantification software when reads from that isoform are present. Based on our previous benchmark [8], we estimate that the probability of false-positive events (pFP) is around 1% and that the probability of false negative (pFN) events is around 4% (see the “Methods” section). In our simulations in Fig. 4, we vary both of these probabilities in the range of 0 to 50%. Figure 4a shows how the mean number of isoforms detected per gene per cell distributions changes as the probability of false positives and false negatives alters when every gene expresses one isoform per cell. Importantly, even when the probability of false positives and false negatives is zero, there are many genes for which the mean number of detected isoforms per gene per cell is not equal to one, the true number of expressed isoforms. This indicates that even if a perfect, 100% accurate isoform quantification tool existed, there would still be substantial barriers to studying alternative splicing using scRNA-seq. We suspect that the reason a 100% accurate isoform quantification tool would underestimate the number of isoforms per gene per cell is that isoform quantification tools usually only quantify the reads that are present. Due to the high number of dropouts in scRNA-seq, many expressed isoforms do not generate reads and thus would be called as unexpressed by a 100% accurate isoform quantification tool, leading to an underestimate of the number of isoforms present.
Unsurprisingly, increasing the probability of false positives causes an increase in the mean number of detected isoforms, whilst increasing the probability of false negatives causes the mean number of detected isoforms to decrease, as shown in Fig. 4b. Somewhat counterintuitively, increasing the probability of false positives from 0.0 to 0.1 could be considered to ‘improve’ the accuracy of the mean number of isoforms detected by shifting the distribution to slightly higher values and away from zero. This is probably because slightly increasing the probability of false positives allows some dropout events to be detected. In Additional file 1: Fig. S6, we investigate how the overlap fraction is affected by changes in the probability of false positives and negatives. We find that the overlap fraction increases as the probability of false positives increases, supporting the hypothesis that some dropout events are ‘rescued’ by false positive events. However, we note that in addition to ‘rescuing’ some dropouts, many unexpressed isoforms are also called as expressed, as indicated by the mean numbers of detected isoforms per gene per cell that are greater than one. Interestingly, when the probability of false positives and false negatives are equally increased (the diagonal of Fig. 4a), the mean number of detected isoforms increases, suggesting that the increased rate of false positives dominates over the increased rate of false negatives. This is likely because more isoforms are unexpressed than are expressed, and thus, there are more opportunities for false positive events than for false negative events. Overall, we find that high probabilities of false positives and false negatives decrease our ability to accurately detect expressed isoforms in scRNA-seq.
In Fig. 4a, we showed that even when isoform quantification is 100% accurate, we underestimate the number of expressed isoforms for many genes. One hypothesis for why we are less able to detect isoforms in scRNA-seq data compared in bulk RNA-seq data is that the sequencing depth is typically lower. A lower sequencing depth could mean that for many expressed isoforms, there are too few or no reads that would allow the expressed isoform to be uniquely identified.
To investigate whether sequencing depth could explain the difference in our ability to detect isoforms in bulk and scRNA-seq, we first identified a matched bulk and scRNA-seq dataset. The dataset we selected was a mouse embryonic stem cell (mESC) dataset in which mESCs were cultured in 2i + LIF media [23, 24]. In the mESC dataset, each cell was sequenced to approximately 7 million reads on average, whilst the matched bulk data was sequenced to approximately 44 million reads.
To determine whether sequencing depth was responsible for the difference in our ability to detect isoforms in bulk and scRNA-seq, we randomly downsampled the bulk mESC RNA-seq dataset to 7 million reads 50 times. Using the original, un-downsampled bulk RNA-seq dataset as the ground truth, in Additional file 1: Fig. S7, we plotted the mean overlap fractions for each gene in the downsampled bulk RNA-seq dataset and the matched scRNA-seq dataset. We found that the mean overlap fraction was significantly higher (p < 2.2e−16, Welch two sample t test) for the downsampled bulk RNA-seq than for the matched scRNA-seq. This indicates that a lower sequencing depth does reduce our ability to detect isoforms, but that this does not fully explain the reduction in ability to detect isoforms between bulk and scRNA-seq. One explanation for the reduction in ability to detect isoforms in scRNA-seq, over and above the reduction expected due to reduced sequencing depth, is that there could be heterogeneous isoform expression between individual cells. If this were the case, using the isoforms detected in bulk RNA-seq as the ground truth would not be appropriate. There are also potential technical explanations for the reduced ability to detect isoforms using scRNA-seq. For example, the enzymatic reactions associated with library preparation may have reduced efficiency when there is a lower amount of starting material, as is the case for scRNA-seq. Determining to what extent heterogeneous isoform expression and technical factors are responsible for our reduced ability to detect isoforms in scRNA-seq will require further study of cellular isoform heterogeneity and the technical noise associated with scRNA-seq.
Different models of isoform choice meaningfully change our simulation results
It is possible that different mechanisms of isoform choice at the cellular level could alter our ability to correctly detect which isoforms are present in scRNA-seq. Because there is uncertainty over the mechanism of isoform choice within single cells, we implement four different models of isoform choice in our simulations. We then ask whether different models of isoform choice alter the mean number of detected isoforms per gene per cell in our simulations.
We give a detailed description of how each of these models was implemented in the “Methods” section; here, we provide a brief description of each model and the rationale behind it. We first model the alternative splicing process as a type III Weibull distribution, using a model described by Hu et al. [18]. Based on observations about the molecular process of alternative splicing, Hu et al. suggested that the process could be well modelled by an extreme value distribution, and they found that a Weibull distribution best fit the expression levels of isoforms in bulk RNA-seq. In our second implemented model, we attempt to infer the probability of each isoform being ‘chosen’ to be expressed in a cell. We calculate the probability of an isoform being chosen based on the observed probability of the isoform being detected. Our third model is identical to the second except that we allow the probability of an isoform being ‘chosen’ to vary between cells. We achieve this by sampling the probability of an isoform being chosen from a beta distribution, using a similar approach as Velten et al. [4]. In our final model, we choose a random number between 0 and 1 for each isoform. The random number is assigned to be that isoform’s probability of being chosen, weighted against the probabilities of the gene’s other isoforms being chosen. For brevity, we will refer to these four models as the Weibull model, the inferred probabilities model, the cell variability model and the random model below.
Figure 5 shows the distributions of the mean number of detected isoforms when one, two, three or four isoforms are expressed per gene per cell for each model. Figure 5 shows our simulation results for the H1 hESC dataset sequenced at 4 million reads; results for the other hESC datasets including distributions of overlap fractions can be found in Additional file 1: Figs. S8–14. Importantly, the distributions in Fig. 5 visibly differ between models. To quantitatively confirm this, we perform a K-sample Anderson-Darling test on each row of graphs in Fig. 5. We find that the distributions for 1, 2 and 3 isoforms significantly differ between the isoform choice models (p <0.001, see Additional file 1: Supplementary Tables for details). In contrast, the distributions for 4 isoforms have a p value of 0.999999, consistent with these distributions originating from the same population. This is as expected, as in the 4 isoform simulations all of the isoforms are picked, and thus, we would not expect isoform choice to matter. Our qualitative and quantitative analyses indicate that different mechanisms of isoform choice alter our ability to detect splice isoforms in scRNA-seq. Therefore, a better understanding of the mechanism of isoform choice across the transcriptome could be key to enabling splicing analysis using scRNA-seq data. Without knowing how best to model isoform choice, our results suggest the presence of a substantial confounder.
Interestingly, our simulation results when using the inferred probability model compared with the cell variability model are almost identical. Given that the only difference between these models is whether or not isoform preference is allowed to vary between cells, this indicates that cellular heterogeneity in isoform preference does not change our ability to detect isoforms under the inferred probability model. We perform a K-sample Anderson-Darling test between the inferred probabilities and cell variability models for each row of Fig. 5, and we find that these distributions do not significantly differ (see Additional file 1: Supplementary Tables). Interestingly, the results of the random model of isoform choice look more like the inferred probability and cell variability models than the Weibull model. This could be because the Weibull model determines the probability of an isoform being chosen based on the rank of that isoform, whereas all of the other models do not use a rank-based approach. These observations and the difficulty we have interpreting them illustrate the need for a better understanding of how best to model isoform choice.
We hypothesise that the reason that different models of isoform choice differ in ability to detect isoforms could be because some models of isoform choice prefentially pick isoforms with a low probability of dropout, whereas other models do not exhibit this preference. To investigate whether different models of isoform choice differ in their preference for picking isoforms with a low probability of dropout, in Additional file 1: Figs. S15–18, we plot the distributions of the probabilities of dropout for the isoforms chosen when one, two, three or four isoforms are picked using each of our four models. We would expect models with a preference for picking isoforms with a low probability of dropout to have distributions of dropout probabilities more skewed towards zero when small numbers of isoforms are chosen. When larger numbers of isoforms are chosen, we would expect to observe less skewed distributions, because the model is effectively forced to choose isoforms with higher probabilities of dropout due to a lack of alternatives. In contrast, if a model had no preference for picking isoforms with a low probability of dropouts, we would expect the distributions of the probabilities of dropout to be identical regardless of whether one, two, three or four isoforms are chosen.
In Additional file 1: Figs. S15–18, we find that only the random model does not exhibit any preference for choosing isoforms with a low probability of dropout. Of the Weibull, inferred probability and cell variability models, the Weibull model has the dropout probability distribution most skewed towards zero when one isoform is picked, indicating that the Weibull model has the strongest preference for picking isoforms with a low probability of dropout. The Weibull model also detects the highest mean number of isoforms per gene per cell when one isoform is expressed in the ground truth, consistent with the hypothesis that the difference in the performance of the isoform choice models may be related to their preference for picking isoforms with a low probability of dropout.
If isoform detection ability of the isoform choice models is mainly determined by their preference for picking isoforms with a low probability of dropout, we would expect that if the probability of dropout was globally changed, it would alter the isoform choice models’ abilities to detect isoforms. We investigate this in Additional file 1: Fig. S19 by sampling dropout probabilities from the beta distributions shown in Fig. 3b. We find that more isoforms are detected by all isoform choice models when dropouts are sampled from distributions that are more skewed towards zero. This supports the hypothesis that choosing isoforms with a low probability of dropout improves the ability of isoform choice models to accurately detect isoforms.
Some models of isoform choice are more plausible than others
In the previous section, we observed that our simulation results for the inferred probability and cell variability models were extremely similar. To investigate how general our observation that allowing isoform preference to vary between cells does not alter our simulation results is, we developed three additional models of isoform choice. In the first model, the probability of selecting each isoform was sampled from a truncated normal distribution with a mean of 0.25 and a standard deviation of 0.06 in each cell. In the second model, we sample the probability of selecting each isoform from a Bernoulli distribution, in which the value 1 is chosen 25% of the time and the value 0 is chosen 75% of the time in each cell. In the final model, the probability of selecting each isoform is always 0.25 (the ‘p = 0.25’ model). The three models are illustrated in Fig. 6a, and additional details are given in the “Methods” section. Under the normal and the Bernoulli models, the probability of picking each isoform varies between cells, whereas the probability of picking each isoform is constant between cells under the p = 0.25 model. Importantly, although the distributions we are sampling isoforms from have very different shapes, the mean probability of picking each isoform is 0.25 for all three distributions.
In the second row in Fig. 6a, we show the distribution of the mean number of isoforms detected per gene per cell when we simulate one isoform being expressed per gene per cell. There is no visible difference between our simulation results regardless of which model of isoform choice is used. This is supported by a non-significant result in a K-sample Anderson-Darling test (p = 0.998). These findings are consistent with the hypothesis that our simulation results are unchanged whether or not the model of isoform choice used allows cell variability in isoform choice. We suggest that this is because we are reporting the mean number of isoforms detected per gene per cell in our simulations. Across many cells and rounds of simulation, the mean probability of selecting isoforms seems to determine the shape of our simulation result distributions, whereas the higher moments of the isoform choice probability distribution are apparently unimportant. Thus, including cell variability in our isoform choice model appears to not matter. For future scRNA-seq studies in which the mean number of isoforms detected per gene per cell is an important metric, we conjecture that there is no need to model cellular variability in isoform choice, regardless of whether or not such variability exists in reality. Of course, if future studies are interested in precisely what isoforms are present in individual cells rather than a population mean, understanding whether or not cell variability in isoform choice exists is likely to be important.
We have established that our ability to detect isoforms using scRNA-seq is severely affected by the high rate of dropouts in scRNA-seq. Therefore, attempts to infer a biologically meaningful model of isoform choice from scRNA-seq data are likely to fail. However, we can make some general observations to help rule out certain models of isoform choice. In Fig. 6b, we have ranked isoforms by their mean expression relative to other isoforms from the same gene (so for example, an isoform with rank 1 has the highest mean expression, an isoform with rank 2 has the second highest mean expression and so on). Unsurprisingly, we find that the most highly ranked isoforms are substantially more highly expressed than lowly ranked isoforms. This is consistent with the finding that many genes appear to have a ‘major’, more highly expressed isoform, and one or more ‘minor’, less highly expressed isoform [12, 13]. We suggest that this behaviour needs to be represented in some way in future models of isoform choice, and models that do not represent it (for example, our random, normal, Bernoulli and p = 0.25 models) are probably overly simplistic. In Fig. 6c, we rank isoforms by their probability of dropout, where the isoform with the lowest probability of dropout compared to other isoforms from the same gene has rank 1. We observe a very similar pattern in which highly ranked isoforms have a substantially lower probability of dropout relative to lowly ranked isoforms, further supporting the finding that ‘major’ and ‘minor’ isoforms exist for many genes. The results shown in Fig. 6 are for the H1 hESCs sequenced at 1 million reads per cell; equivalent plots and overlap fraction distributions for all of the hESC datasets can be found in Additional file 1: Figs. S20–24.
A mixture modelling approach suggests genes for which four isoforms are detected typically express around three isoforms per cell
We ask whether our simulation-based approach could shed any light on the biological question of how many isoforms are expressed per gene per cell. To do this, we simulate one, two, three and four isoforms being expressed per gene per cell and compare the mean isoforms detected distributions to the distribution of isoforms detected per gene per cell for genes for which four isoforms were detected in the real dataset (see Fig. 7a and b). We then approximate each distribution as a log normal distribution and take a mixture modelling approach to estimate the mixing fraction for each of our simulated distributions in the real distribution.
Figure 7c shows the mixing fractions found over 100 iterations of expectation maximisation for H1 hESCs sequenced at approximately 1 million reads per cell. In Fig. 7c, the mixing fraction for the distribution corresponding to four isoforms being expressed per gene per cell is over 90%. This suggests that genes detected to express four isoforms in this dataset typically express four isoforms per gene per cell. However, in Fig. 6d, after 100 iterations of expectation maximisation for H1 hESCs sequenced at 4 million reads per cell, the distribution with the largest mixing fraction is that corresponding to three isoforms per gene per cell. This suggests that genes detected to express four isoforms in this dataset most often express three isoforms per gene per cell. As the cDNA sequenced at 1 and 4 million reads per cell came from the same population of cells, it is unlikely that both of these statements are true. We propose several possible explanations for why we might observe this result.
First, we might be over-estimating the dropout rate at 1 million reads per cell. As there is less information with which to infer the dropout rate at 1 million reads per cell compared to at 4 million reads per cell, it is plausible that our estimates of the dropout rate are less accurate at 1 million reads per cell. Whether or not there is a systematic bias towards over-estimating the dropout rate at low sequencing depths is unknown and goes beyond the scope of this paper.
Second, we have established that the model of isoform choice influences the outcome of our simulations but we do not know which model of isoform choice is correct. Therefore, we are (almost certainly) attempting to fit distributions that do not represent reality. Figure 5 shows our mixture modelling approach using the Weibull model of isoform choice. We note however that fitting our alternative models of isoform choice achieves a similar result, in that the largest mixing fraction goes to four isoforms at 1 million reads per cell and to three or fewer isoforms at 4 million reads per cell (see Additional file 1: Figs. S25–31).
Third, the genes detected to express four isoforms differ between the sequencing depths of 1 and 4 million reads. More genes are detected to express four isoforms at 4 million reads (1443 versus 1543 for the H1 cells, 1453 versus 1524 for the H9 cells). Whilst this is not a dramatic difference, it does mean that the mixing fractions between these two depths could genuinely differ, although this is unlikely to fully explain the observed difference.
Fourth, we assume all genes for which four isoforms are detected in the real data actually express four isoforms. Due to dropouts and quantification errors, this may not be accurate, and some genes for which four isoforms are detected may express a different number of isoforms in reality.
Fifth, our parameter estimation for quantification errors and isoform choice modelling is not 100% accurate. We can not rule out that this could be confounding the results of our mixture modelling approach.
Our mixture modelling experiments broadly support the hypothesis that it might be common for a cell to produce more than one isoform per gene. However, there are clearly a lot of potential confounders in our approach, many of which relate to uncertainty about dropouts, quantification errors and isoform choice. We note that without having either a ground truth knowledge of how many isoforms are produced from given genes in given cells, or good estimates of dropout probabilities, quantification errors and isoform choice mechanism, it is hard to imagine how an accurate and reliable estimate of the number of isoforms produced per gene per cell could be obtained.