Introduction

Alternative splicing (AS) is an essential, ubiquitous regulatory mechanism in eukaryotes. Through AS, a single gene can yield multiple mRNA isoforms, greatly expanding the protein diversity encoded by eukaryotic genes [1]. Fine-tuned regulation of alternative splicing has a critical role in the development and function of a diversity of tissues and cell types, including muscles, neurons, and immune cells [2,3,4,5,6]. Splicing errors can also lead to an array of human diseases, such as neurodegenerative diseases, autoimmunity, and cancer [7,8,9].

Decades of research using bulk methods have shown that many AS events are tissue-regulated [10], yet cell type-specific splicing remains incompletely understood. Using single cell RNA-seq (scRNA-seq), cell types can be comprehensively identified based on their expression profile, paving the way for studying splicing patterns. Although most scRNA-seq studies use droplet-based technologies such as 10X Chromium, which only profiles one end of the transcript, there are full-length scRNA-seq technologies, such as Smart-seq2 [11] and VASA-seq [12], that provide coverage of the entire transcript. Full-length technologies make it possible to conduct a local, event-level splicing quantification per cell type. In an event-level AS quantification, transcripts can be split into non-overlapping exonic regions, referred to as splicing nodes [13,14,15,16]. Nodes are further classified based on their behavior during splicing, e.g., core exons (CEs) or alternative donors (ADs). Then, the percent spliced-in (PSI) value for splicing nodes can be calculated based on reads spanning node junctions [13, 15, 17]. PSI is an informative indicator of exon usage frequency, providing an intuitive and easily interpretable metric to describe complex splicing events.

There are plenty of computational methods for event-level splicing quantification in bulk RNA-seq, such as MISO [17], dSpliceType [18], rMATS [16], MAJIQ [15], and SUPPA2 [19], but they are poorly suited due to the high sparsity and large size of scRNA-seq datasets. To overcome these issues, several methods aiming to detect and quantify AS in single-cell data have been developed. They include SingleSplice [20] which compares biological variation and technical noise in a population of single cells to find genes with isoform usage differences. Expedition [21] is a suite of tools that can detect differences among the usage of splicing modalities. Huang and Sanguinetti have developed BRIE and BRIE2 [22, 23], which use Bayesian models for PSI estimation to overcome sparsity. SICILIAN [24] assigns probabilities to called splice junctions to improve precision for their detection, and SpliZ [25] generalizes PSI to enhance splicing quantification at the single-cell level. A recent software tool is MARVEL [26], which integrates splicing and gene expression analyses. However, MARVEL analysis is limited to splicing events involving a single exon and it can only detect differential splicing between pairs of cell types. None of the methods presented to date can leverage event-level splicing quantification to comprehensively characterize cell type-specific splicing patterns, involving either single or multiple exons, without using a parametric model or imputing missing values.

To facilitate comprehensive de novo detection of cell type-specific AS events, we developed scASfind [27], a flexible and intuitive method for mining complex AS patterns in large single-cell datasets. scASfind is an open-source R package which is freely available at https://github.com/hemberg-lab/scASfind. scASfind uses a similar data compression strategy as our previous work scfind [28] to transform the cell pool-to-node differential PSI matrix into an index. This efficient data structure enables rapid access to cell type-specific splicing events, making it possible to use an exhaustive approach when carrying out pattern searches across the entire dataset. Importantly, scASfind does not involve any imputation or model fitting, instead cells are pooled to avoid the challenges presented by sparse coverage. Moreover, there is no restriction on the number of exons, or the inclusion/exclusion events involved in the pattern of interest. Building on these fast searches, scASfind allows interactive searching of cell type specificity of splicing patterns, such as differential splicing, mutually exclusive exons, and coordinated splicing events. We applied scASfind to mouse primary visual cortex [29], mouse embryonic development [12], and human fetal liver [30] to characterize cell type-specific splicing patterns.

Results

Data compression enables fast searching of splicing patterns

scASfind takes full-length scRNA-seq data, such as Smart-seq2 [31], RamDA-seq [32], Smart-seq3 [33], VASA-seq [12], and FLASH-seq [34], as input for splicing quantification. Tag-based methods such as 10X Genomics Chromium are unsuitable since they only capture the transcript’s 3′ or 5′ end, and typically do not provide enough reads that span splice junctions. It is assumed that the data has been clustered and annotated so that each cell is assigned a cell type. Several cells of the same type are first combined into cell pools to provide sufficient reads for robust and accurate PSI quantification with Whippet [13] using the MicroExonator workflow [35].

The size of cell pools is an important hyperparameter in the analysis, which should be carefully determined on a per-case basis. Pooling single cells aims to reduce the technical sparsity of scRNA-seq data, which could be due to technology or tissue. In our examples, the dataset by VASA-seq (mouse embryo) was two orders of magnitude sparser than the two Smart-seq2 datasets (mouse cortex, human fetal liver); therefore, we employed a much larger pool size (Table 1). Choosing a pool size needs to balance two aspects: ensuring a proper coverage for PSI quantification, while preserving enough cell pools per cell type to faithfully represent the variation of cell abundance and allow effective statistics tests in scASfind. An examination of the impact of pool size on scASfind results in the mouse cortex data can be found in the “Methods” section.

Table 1 Datasets used in this study, parameters, on-disk size and the time of building scASfind index. Splicing nodes which had a PSI deviating from the dataset mean in at least one cell pool were encoded. The size of raw data includes the raw PSI matrix and relevant metadata objects, while the size of the scASfind index object involves the compressed PSI index and the same metadata objects (see "Methods"). Both sizes are on-disk file sizes. The time to build the scASfind index is calculated by running the index-building script in scASfind on a HPC cluster. UMI: unique molecular identifier; MB: megabyte; std: standard deviation

After obtaining a splicing node x cell pool matrix of PSI values (Fig. 1a), scASfind first centers each column of the matrix to obtain the deviation of PSI values from the dataset mean (default: |ΔPSI|> 0.2). This is to capture the biologically informative PSI variation across cells. The differential PSI matrix is further split into two to encode positive (spliced-in) and negative (spliced-out) PSI values. In both matrices, a non-zero value indicates that there is differential inclusion or exclusion of the splicing node in that particular cell pool. By ensuring that the matrices are sparse, we can achieve a high compression rate and fast pattern matching, even for large datasets.

Fig. 1
figure 1

Overview of scASfind. a Schematic of the scASfind workflow. Single-cell full-length transcriptome sequencing data such as Smart-seq2 or VASA-seq are suitable inputs for the scASfind workflow. Cells from the same cell type are pooled to increase the accuracy of splicing event detection (default 5 cells per pool) with MicroExonator [35]. The PSI value for each splicing node is calculated by Whippet [13] to obtain a splice node-by-cell pool PSI matrix, and we then build a scASfind index containing information about splicing events that are differentially spliced in or spliced out in each cell pool. Finally, we query the index to search for cell type-specific differential splicing events, mutually exclusive node pairs and consecutive nodes that are similarly spliced-in or coordinated splicing events. b The size of the file saved to disk containing either the raw PSI values and metadata objects or the scASfind index with metadata objects built with a two-bit quantization. c The elapsed time of searching all cells with increased inclusion, i.e., has a PSI no less than 0.2 higher than the dataset mean, in any of five randomly selected splicing nodes. The process is repeated 30 times. The bar in the boxplot shows the arithmetic mean, lower and upper hinges correspond to the first and third quartiles, whiskers extend from the hinge to the largest value no further than 1.5 * interquartile range from the hinge, and outliers beyond this range are plotted as individual data points. PSI, percent spliced-in

We adopted the indexing strategy in scfind [28] to compress the two sparse PSI matrices into two scASfind indexes, and we then combined them into a single meta-index object. The index efficiently stores the splicing nodes that have a PSI value deviated from the mean (see “Methods” for details about data compression into an index). In addition to providing efficient storage, the scASfind index also allows for rapid access to the raw PSI values associated with each splicing node. In particular, it allows us to use AND or OR queries to find the set of cell pools that match a set of inclusion/exclusion criteria, e.g., nodes 1, 2, and 3 need to be included well above mean while nodes 4 and 5 are excluded well below mean. By combining multiple queries, we can carry out more complex searches, and since each operation is fast it becomes possible to adopt an exhaustive approach to search the entire dataset.

In the following, we set out to demonstrate how the scASfind index allows for thorough identification and characterization of cell type-specific splicing events. We used scASfind to analyze three technically and biologically distinct datasets: a mouse cortex data profiled using Smart-seq2 (hereafter referred to as mouse cortex) [29], a mouse embryonic development dataset profiled using VASA-seq (hereafter referred to as mouse embryo) [12], and a human fetal liver dataset profiled using Smart-seq2 (hereafter referred to as human fetal liver) [30].

For all three datasets, the scASfind representation required two to three orders of magnitude less disk space (Fig. 1b) when using two bits for the quantizer. We also benchmarked the search times: compared to an implementation using only standard data structures, scASfind was hundreds or thousands of times faster (Fig. 1c). For example, for the mouse embryo dataset, finding all pools that have increased inclusion of any of 200 randomly selected nodes with scASfind took 0.24 s on average, compared to 112 s for the naive approach. scASfind was also highly robust to increased search size. One additional overhead for scASfind is the time to build the index, but this is relatively minor as none of the datasets took more than 30 mins (Table 1).

Splicing events are more precise markers of cell types

Cell types are typically associated with a set of marker genes, i.e., genes that are highly expressed compared to other cell types. Since the most widely used single cell protocols do not provide enough information to distinguish isoforms, transcripts are usually evaluated at the gene level. However, identifying a reliable set of marker genes can be challenging, especially for neuronal tissues with complex cell type taxonomy [36]. Since AS is known to be more prevalent in the brain [2, 37], we hypothesized that splicing events are more reliable for distinguishing cell types than gene expression. We refer to splicing nodes that are highly included or excluded in only one cell type as marker nodes in analogy with marker genes.

To identify cell type marker nodes, we used each cell type to query the scASfind index for nodes that have high or low inclusion. Benefiting from the speed at which these quantities can be extracted using the scASfind index, we carried out an exhaustive search to identify the best marker nodes for each cell type. Nodes were evaluated using the precision, recall, and F1 scores for their ability to detect the cell type of interest (see “Methods” for details). We used a similar procedure for marker genes using scfind, and we compared the quality of the markers by the precision, recall, and F1 scores. In the mouse cortex and the mouse embryo datasets, we observed higher F1 scores in splicing markers, compared with expression markers across the board (Fig. 2a, d). Interestingly, the higher F1 of splicing markers was largely driven by higher precision (Fig. 2b, e), suggesting that they yielded few false positives. The F1 and precision of splicing and expression markers showed comparative scores in the human fetal liver dataset (Fig. 2g, h). The lack of benefit in using splicing markers for the human fetal liver suggested that the splicing landscape in this dataset was less complex compared to the mouse cortex and embryo, possibly due to the tissue. We conjecture that the poor recall (Fig. 2c, f, i) for splicing markers could be due to the sparsity of splicing quantification leading to a high number of false negatives as there is insufficient information to accurately quantify splicing nodes in many pools.

Fig. 2
figure 2

Comparing gene expression and splicing as cell type markers. The top 20 expression and splicing markers, ranked by F1 scores, and their precision, recall, and F1 scores are calculated via scfind [28] or scASfind for ac mouse cortex, df mouse embryo, and gi human fetal liver. We compare the mean scores per cell type and consider a 0.2 difference between expression and splicing to indicate a better marker (gray lines in the figure). The dots represent the mean, while the whiskers indicate the minimum and maximum for the 20 markers. Cell types with either better splicing or expression markers are colored (blue for splicing, red for expression). For visual clarity, cell types have score differences of 0.5, 0.6, or 0.2 for precision; 0.2, 0.3, or 0.2 for recall; or 0.3, 0.4, or 0.2 for F1 in mouse cortex, mouse embryo, and human fetal liver data are labeled, respectively. These values are chosen based on the respective number of cell types with a better marker for each dataset

Moreover, the inclusion or exclusion of splicing nodes was independent of increased or decreased expression, suggesting that splicing markers were largely independent of the expression level (Fig. 3). For instance, in astrocytes, Dtna_27 was excluded while the cell type had higher expression of the Dtna gene. On the other hand, astrocytes had similar expression of the Hnrnpa2b1 gene with other cell types while it had higher inclusion of Hnrnpa2b1_32 (Fig. 3a, b). Another example is glutamatergic neuron subtype L4_Scnn1a (Fig. 3c, d), here we found both inclusion and exclusion splicing markers, while the expression levels of the corresponding genes could hardly distinguish this cell type from others. This was in line with observations by Wen et al. [26] that only a fraction of differentially spliced genes have expression changes in the same direction, indicating that differential splicing provides another layer of transcriptomics regulation that contributes to cell type heterogeneity.

Fig. 3
figure 3

PSI values of marker splicing nodes and expression levels of corresponding genes. a PSI values for astrocyte splicing markers from mouse cortex data, compared to the mean of all other cell types. b Expression levels for the same genes contribute to splicing markers in astrocytes. c PSI values for L4_Scnn1a neuron splicing markers from mouse cortex data. d Expression levels for the same genes in c. In all panels, each dot represents a cell pool, and the color scale shows the median PSI or gene expression level among the cell pools. PSI, percent spliced-in

While 67%, 57%, and 62% of the top 20 splicing markers were from different genes in mouse cortex, mouse embryo, and human fetal liver datasets, respectively, a single gene could contribute a large portion of marker nodes in some cases. We observed that the Ttn gene, which encodes titin—the largest protein in the genome [38], contributed most of the above splicing markers in the first heart field (Fig. 2f) for the mouse embryo dataset. These results were consistent with previous analyses of Ttn splicing profiles that showed 50–219 exons to be developmentally regulated [5]. Taken together, this result suggested that splicing events frequently show higher cell type specificity than gene expression, and the splicing marker events reported by scASfind can be superior in terms of distinguishing cell types.

We compared our analysis of cell type-specific splicing markers with differentially spliced nodes calculated by MARVEL [26], a recent method focusing a combined analysis of expression and splicing in scRNA-seq data. Since MARVEL was designed for pairwise comparison, we also ran scASfind pairwise to detect differentially spliced CEs between all pairs of cell types in the mouse cortex data (see details in “Methods”). We saw a remarkable degree of consistency between the two tools. Among the 1176 pairs of cell types, MARVEL returned a total of 441 SE markers in 318 pairs of cell types. Among the MARVEL SE markers, 232 were ranked top 1 in scASfind and 330 were among the top 5 (Additional file 1: Fig. S1). For those marker nodes less significant in MARVEL, they also had a lower ranking and smaller F1 score in scASfind (Additional file 1: Fig. S1). The comparison suggested that MARVEL focused on returning the strongest signal while scASfind provided more information, reporting events in a custom range of F1 scores. Both tools were highly consistent in capturing the highest specificity splicing nodes that maximally distinguish the pair of cell types. Nevertheless, scASfind could provide results for all cell type pairs while many of these pairs did not have significant differentially spliced SEs in MARVEL analysis.

We also evaluated the PSI values of scASfind node markers in the mouse cortex data in VastDB [10], an atlas of alternative splicing profiles based on bulk RNA-seq data. For some non-neuronal cell types, there were purified single cell type tissues in VastDB, making them directly comparable with scASfind marker nodes. We found that for microglia, oligodendrocytes, and OPCs, the top 10 scASfind marker nodes clearly exhibited increased PSI in the respective VastDB cell type sample (Additional file 1: Fig. S2). Meanwhile, we observed that cortical, cerebellar, and whole brain data, which encompass mixed neuronal and glial cell types, did not exhibit high PSI for these marker genes. Furthermore, a probabilistic principal component analysis using the VastDB PSI values for all cell type marker nodes in scASfind across all VastDB tissues showed that these nodes have neural specificity (Additional file 1: Fig. S3). In summary, scASfind results were well supported by the PSI quantification from purified cell types in VastDB. The analysis further highlights the enhanced resolution provided by single-cell results compared to bulk RNA-seq from mixed cell types.

Detecting mutually exclusive exon pairs

Mutually exclusive splicing event is a special type of AS event where only one of two consecutive exons is included in the final mRNA product [39, 40]. In the largest study of mutually exclusive exons (MXEs) in bulk RNA-seq data carried out to date, Hatje et al. identified 855 exon pairs from 515 datasets [37, 39]. MXE splicing is known to be regulated by different molecular mechanisms that enable tissue-specific patterns [40,41,42,43,44]. We hypothesized that some of the observed tissue specific splicing profiles arise from the cumulative effects of cell type specific MXE preferences within each tissue. Hence, we leveraged scASfind to systematically discover MXEs and explore their cell type specificity.

We performed an exhaustive search of all three datasets to observe cell type specificity for all exon pairs that could be MXEs (Fig. 4a). That is, for all consecutive exons, we identified cell types in which one of them is always included and the other excluded. To ensure high-quality results, several additional filters were employed. First, we required the pair to have mean PSI values summing to 1 ± 0.1, and that PSI standard deviation scores differ by less than 0.1 across all cell pools in the dataset. Second, we required at least one cell type to be significantly enriched for the pattern when one exon is included, and the other is excluded. Statistical significance was determined using hypergeometric tests. Third, we considered MXEs detected in over half of the cell pools and having a difference of cell type mean PSI value between the two exons ≥ 0.5 as highly confident pairs. The default criteria we have used were rather stringent to obtain a small amount of highly confident results.

Fig. 4
figure 4

Summary of 14 high-confidence MXE pairs in mouse cortex data. a Schematic overview of how MXE pairs are identified. b High confidence adjacent MXE pairs detected by scASfind. The heatmap color scale indicates the difference in raw PSI value (∆PSI) between the downstream exon and the upstream exon in the pairs of MXEs. Cell types are grouped by their broad type; genes are organized by the number of cell types in which the pair is significant (decrease from left to right). MXE, mutually exclusive exon; PSI, percent spliced-in; CE, core exon; AA, alternative acceptor; AD, alternative donor; RI, retained intron; signif., significant

We detected 63, 17, and 35 significant pairs of MXEs in mouse cortex, mouse embryo, and human fetal liver data, respectively. Among these 14, 2, and 2 were adjacent and highly confident. The high-confidence pairs in the mouse cortex dataset are summarized in Fig. 4b. Hierarchical clustering across cell types using the ∆PSI of these exons indicates that generally, cells of the same broad type have similar splicing patterns in these MXEs, with a few exceptions. This is concordant with the tissue-specific splicing observed in bulk RNA-seq. However, there are examples of cell types within each broad type that have distinctive MXE preferences, suggesting a more complex pattern.

A known example of MXEs can be found in the ionotropic glutamate receptor genes, AMPA 1/2/3 (Gria1, Gria2, Gria3), which have been studied extensively in mouse brain [45,46,47]. Reassuringly, the top candidates reported by scASfind include the three Gria genes. To the best of our knowledge, this is the first detailed study of cell type-specific MXE preferences for these genes (Fig. 5a). During development, some neurons switch from node 28 to using node 29, and this has important consequences for their responses to electric stimuli [48, 49]. We detected a significant preference for node 29 in glutamatergic neurons including L2 Ngb and L2/3 Ptgst, as well as L6a Car12, L6b Rgs 12 and L6b_Serpinb11, GABA-ergic neurons including Vip_Sncg, Pvalb_Obox3, Smad3, Igtp and Pvalb_Wt1, as well as in oligodendrocytes Oligo_96_Rik. By contrast, several glutamatergic and GABA-ergic neurons included node 28, suggesting that the cell type-specific pattern is complex. Another example was in the SNARE protein Snap25, whose MXE preference switches during mouse brain development [50, 51], and is related to regulating synaptic transmission and long-term synaptic plasticity [52, 53]. In our analysis, glutamatergic neurons L5_Chrna6, L6b_Rgs12, L6b_Serpinb11; GABA-ergic neuron Igtp, Ndnf_Car4, and oligodendrocyte progenitor cell OPC_Pdgfra showed a strong preference for node 9, while other cell types utilized node 10 (Fig. 5a). Taken together, our results recapitulate some of the complex cell type specific pattern of isoform switching for both the Gria genes and Snap25. In the mouse embryo data, we detected highly confident MXEs in Actn1 and Actn4. Actn1 has been found to have tissue-specific mutually exclusive splicing in adult mice. Compared to other tissues, muscle cells select an alternative exon which makes the protein’s EF-hand motif insensitive to Ca2 + , while brain cells include both exons [54, 55]. We were the first to describe the cell type preference of this MXE pair in the mouse embryo (Fig. 5b). For Actn4, we found primitive_heart_tube cells prefer Actn4_14 while first_heart_field and secondary_heart_field cells chose Actn4_13. Finally, for the P4HA2 gene in the human fetal liver dataset, Common_Prog_cycling and Dend_cell_cycling selected node 47 or 48, respectively (Fig. 5c).

Fig. 5
figure 5

PSI values of detected high confidence adjacent mutually exclusive exons. The mean PSI across cell pools in each cell type in a Gria2 (left) and Snap25 (right) in the mouse cortex dataset, b Actn1 in the mouse embryo dataset, and c P4HA2 in the human fetal liver dataset. PSI, percent spliced-in

Observing the cell type mean PSI of high-confidence MXEs, we found that the mutually exclusive pattern is not strictly followed in some cell types (Fig. 5). For example, Sst_Cbln4 cells showed comparable PSI for nodes 28 and 29 in Gria2, whereas astrocytes included both exons. This is in parallel with previous observations of MXEs only being mutually exclusive in specific tissues but not in all tissues [1, 40, 56]. For example, TCL6 only shows exclusive patterns in specific tissues while on the basis of all known transcripts, the pattern is lost [1, 40, 56]. Similarly, our results concurred that MXEs can show a mutually exclusive pattern only in some cell types.

Identification of coordinately spliced exon blocks

By definition, the splicing nodes considered in scASfind only involve a single event. Though a single event can lead to drastic shifts of gene function [57], splicing events are often coordinated, resulting in multiple consecutive exons being simultaneously included or excluded. Hereafter we refer to such coordinated groups of splicing nodes as node blocks. Coordinated events are more likely to have a more substantial impact on protein function as a larger proportion of coding sequences are affected, but when using a splicing node representation, they are difficult to detect as one must find a stretch of consecutive nodes. Given the large number of nodes across the transcriptome and the high noise level of splicing quantification, this search can be very time-consuming.

We used scASfind to detect node blocks. For each gene, we first identify consecutive nodes of type “core exon” with similar mean and standard deviation of their PSI values (the default is to require the absolute differences for both to be < 0.1 for all exons in the block). Next, the search is expanded to identify additional neighboring nodes to find cell type specific blocks. Events where a block of nodes is coordinately spliced-in (the “above” events) and spliced-out (the “below” events) are detected separately. To ensure high quality results, we only keep blocks composed of at least three nodes from different actual exons. Cell type specific node blocks that are detected in over half of the cell pools are reported as high confidence blocks.

Overall, we detected 263 node blocks with lengths ranging from 3 to 21 in the mouse cortex dataset, with 8 high-confidence ones (3–5 splicing nodes long). For the mouse embryo data, we found 306 blocks containing 3–26 nodes and 14 high-confidence ones with lengths ranging from 3 to 6. For the human fetal liver data, there were 526 node blocks ranging from 3 to 37 nodes, 19 of which were high confidence with lengths from 3 to 9 (Fig. 6a–f). For example, in the mouse cortex data, we found that node 5–7 in Haus7 is significantly spliced-out in L2_Ngb and L5_Pde1c while it is spliced-in in Sncg (Fig. 6d). Reassuringly, this is in line with two documented isoforms as shown in the GENCODE annotation [58, 59] (Additional file 1: Fig. S4). In the human fetal liver, we found a block of exons between chr2:95,895,399–95,901,206 in ANKRD36C that is spliced-in for dend_cell_cycling, also in concordance with known isoforms (Additional file 1: Fig. S5).

Fig. 6
figure 6

Cell type-specific coordinated splicing. Overall, a 263 node blocks were detected in the mouse cortex dataset, b 306 in the mouse embryo, and c 526 in the human fetal liver. The gene name of highly confident node blocks, which include at least 3 nodes from different exons and were detected in ≥ 50% cell pools, are shown in df for each dataset. In df, “above” means that the node block is significantly spliced-in in the respective cell type and “below” means significantly splice-out of the node block

We proceed to analyze whether the detected highly confident node blocks correspond to known isoforms, and what are their functional implications on protein domains in the mouse embryo data. We detected known coordinated events in Ttn in primitive_heart_tube and first_heart_field. This corresponds to a regulated isoform switching event during heart development, upon which the stiffness of the protein changes [60, 61]. Another key gene for heart development is Dst. Here we found a block including 5 exons and spanning 3 of the 7 major protein domains significantly spliced-in in myofibroblasts (Fig. 7A). This event is in line with the documented muscle-specific Dst-b isoform [62]. Mutation studies have shown that Dst-b is essential for strained muscle maintenance [63]. We have also found a node exclusion block in Myo18A, which is consistent with shorter annotated isoforms that are derived from an internal transcription start site [64, 65] (Fig. 7B). One of these shorter isoforms, known as Myo18Aγ, lacks the PDZ-containing N-terminus but includes an alternative N-terminal extension [66] and showed well-marked cell type specific profile associated to primitive_heart_tube. In general, the detected highly confident node blocks correspond well to known regulated isoforms. The analysis demonstrated the versatility of scASfind to detect diverse isoform switching events.

Fig. 7
figure 7

Coordinated inclusion of splicing nodes across mouse embryonic cell types. Sashimi plots showing the read coverage and splice site usage across A Dst and B Myo18a transcripts. Top schematics show domains annotated for these proteins and coordinated splice nodes are highlighted with a red segment (bottom). Different gray shades are meant to distinguish cell types. Mean ± SD of PSI values across all coordinated splice nodes are indicated in red. PSI, percent spliced-in; SD, standard deviation

Even though some blocks are detected in < 50% of cell pools, they often match documented isoforms in the annotation. For example, in the human fetal liver, 20% of NK_cell cell pools did not include chr6:42,632,552–42,655,723 (14 splicing nodes) in the UBR2 gene, in line with an isoform corresponding to early termination (Additional file 1: Fig. S6). We also found that 25% of endothelial cells express a shorter isoform of TBC1D19 (Additional file 1: Fig. S7). Moreover, we have detected several high confidence blocks that suggest undocumented isoforms. For example, four exons between chr7:44,864,913–44,865,509 in Ptov1 are shown to be coordinately included in all cell pools of Pro_Nephros in mouse embryos while there are no recorded isoforms of this gene in the GENCODE [59] annotation (Fig. 6f, Additional file 1: Figs. S8, 9). This could have relevance to human biology since PTOV1 has been associated with prostate cancer, and the splicing event is likely to have a disruptive impact on one of the two major domains involved in the interaction with multiple other genes [67].

Discussion

Splicing is a highly regulated process with a key role in cellular identity and function [5, 68]. Here we present scASfind, a toolkit for mining cell type-specific splicing patterns from large, single-cell, full-length transcriptomics datasets. It is challenging to analyze splicing in scRNA-seq data due to the vast number of splicing nodes and the high degree of sparsity. To overcome these challenges, we utilized cell pooling and data compression to build an index which can support efficient queries of the cell type pattern of splicing events. We demonstrate that an index for thousands of cells can be created in 20–30 min, resulting in compression by 2–3 orders of magnitude, while at the same time speeding up queries by hundreds of folds. Building on this data structure, we provide functions for discovering cell type-specific splicing events such as finding marker nodes, mutually exclusive exons, or coordinately spliced node blocks. Using mouse cortex, mouse embryonic development, and human fetal liver datasets, we demonstrated scASfind’s utility for carrying out tasks that would have been prohibitive without the tool.

Quantification of individual splicing events, compared with transcript-level analysis, is more tractable with short-read data and does not rely on complete annotation models [13, 15]. The PSI quantification used by scASfind comes from the event-level splicing quantification tool Whippet. Whippet is an established method which is efficient and showed high recall in a benchmark study [69]. However, the algorithm only detects and analyzes annotated AS events in its contiguous splice graphs index [13] and it is less effective on detecting events de novo [69]. Therefore, scASfind also only detected splicing patterns of nodes existing in the Whippet index.

Current single-cell platforms typically focus on sequencing only the 3′ or 5′ ends of transcripts, leaving alternative splicing largely unexplored at the single-cell level. Our analysis of the scASfind splicing node markers strongly suggests that single-node splicing patterns can provide higher cell type precision than gene expression. This is particularly relevant to tracing rare cell types. By leveraging rare cell types identified through specialized algorithms [70, 71], we can explore specific splicing patterns using scASfind, offering potential for higher precision and experimental validation. For instance, studies have demonstrated subtype-specific splicing in neurons [72,73,74], reinforcing the utility of splicing patterns in distinguishing neuron subtypes. Understanding AS events with strong cell population specificity is crucial for effectively studying cellular heterogeneity. Additionally, the systematic identification of MXEs and tissue-specific coordinated splicing events provides insights into cell type-specific AS regulation, enriching our understanding of the regulatory landscape.

The development of high throughput full-length protocols such as VASA-seq [12] will likely open opportunities for splicing analysis in a diversity of biological systems. Moreover, several studies have utilized long reads technologies for single-cell studies [75,76,77], allowing an entire transcript to be captured by a single read. We believe that these advances will allow single-cell studies to quantify alternative splicing events, but for this to become feasible novel computational methods are required. Given its efficient memory usage, low run times, and convenient search functionality, we believe that scASfind will be a valuable tool for researchers to decipher cell type-specific splicing using scRNA-seq data.

Conclusions

We provide scASfind, a freely available software for mining cell type-specific alternative splicing events in full-length scRNA-seq data. It utilizes an efficient data structure to detect marker splicing nodes and enables exhaustive searches of MXEs and node blocks.

Applying scASfind to three datasets from mouse and human demonstrated the high precision of marker splicing nodes compared to the more widely used marker genes. We also found known and novel MXEs and node blocks that show cell type specific splicing patterns. Splicing analysis with scASfind facilitates discovery of cell type-specific splicing events that may have functional implications.

Methods

AS quantification across cell types

To quantify AS events across cell types, we configured and ran MicroExonator’s single-cell module, as described in [78]. As part of this workflow, MicroExonator qualifies AS events using Whippet [13] across cell pools derived from annotated cell clusters. Using this protocol, we processed mouse scRNA-seq data from brain visual cortex [29] and whole embryos [12], as well as scRNA-seq data derived from human immunophenotypic blood cells from fetal liver and bone marrow [30]. We used genome assembly mm10 and GENCODE transcript annotation v16 to process mouse scRNA-seq data. For human scRNA-seq analyses, we used genome assembly hg38 and GENCODE transcript annotation v19.

Filter for confidently quantified events

Before encoding the PSI data, we first filter for confidently quantified events. The sparsity of scRNA-seq data often results in an insufficient number of reads spanning splicing junctions that can be used to calculate node PSI values. By default, we require at least 10 reads available for PSI quantification. This gives roughly a confidence interval of PSI < 0.5 from Whippet.

Create a scASfind index

scASfind builds four types of data structures from the input data, and together they form a queryable index.

  1. 1.

    The splicing node x cell pool differential PSI matrices

    For each node, we first calculate the mean PSI across all cell pools, then calculate the difference from the mean for all PSI values. A 0.2 deviation was used as the default threshold to select sufficiently deviated events. Secondly, we separate nodes with differential inclusion (the “above” events) and those with differential exclusion (the “below” events). Both metrics are then multiplied by 100 so that the value is in the range of 0, 100 for the compression.

    The splicing node x cell pool differential PSI (∆PSI) matrices are independently compressed using the strategy in scfind [28]. The compression is a two-step process: (1) storing the positions of non-zero values are compressed by Elias-Fano encoding, and (2) the actual differential PSI value is represented as quantiles of a log-normal distribution (Additional file 1: Fig. S10). The mean and variance of the log-normal distribution, along with quantiles of the original ∆PSI values, are stored. The first step is lossless while the second step is lossy. The approximation of actual ∆PSI in the index makes it possible to retrieve the approximate PSI value when giving the user the ability to tune the size of the storage based on the number of bits used for the quantization (default 2 bits).

  2. 2.

    The mean and standard deviation PSI values per node

    We store the mean and standard deviation for each dataset and node for retrieval of raw PSI values and for speeding up searches of MXEs and node blocks based on expected patterns in mean and standard deviation.

  3. 3.

    The mask for NA values from PSI quantification

    Since we only encode differential PSI values, cell pools with PSI values close to the dataset-wise mean are excluded. However, cell pools where the PSI values are unquantified (NA) or below the required number of reads for confident quantification are also excluded. Distinguishing these two circumstances is required to enable retrieval of raw PSI values from the index. For this purpose, we use a binary mask matrix. In this matrix, 1 indicates the cell pools with PSI value equal to the dataset-wise mean and 0 indicates unquantified. Typically, unquantified events are more frequent than equal mean events, resulting in a sparse matrix.

  4. 4.

    The annotation of nodes

    We use ENSEMBL [79] via the R package biomaRt (V2.46.3) to obtain annotations of all nodes included in the index to enable quick interpretation of results.

    In addition, metadata for each cell, providing information about its annotated cell type or state is required. The buildAltSpliceIndex function in scASfind takes the cell pool-by-splicing node differential PSI matrices and a table with the cell type annotation to build an index object. The “above” and “below” index objects are stored as two datasets, and the three other metrics are stored in the metadata slot in the scASfind object.

Benchmark of file size, index build time, and node search time

We benchmark the efficiency of the scASfind index, compared with a basic approach utilizing only R and Seurat functions. For file size, we take the sum of on-disk space taken by the raw PSI matrix (as.tsv files) and the three metadata objects (as.rds files) as “raw PSI,” and the complete scASfind index (as.rds object), including the same three metadata objects stored in the metadata slot as “scASfind index.” All file sizes are obtained with the “file.info” function in R. For index build time, we run the scASfind build index script on a high-performance computing cluster (Rocky Linux 8.5) for the three datasets with 4 cores and maximum 2 GB memory, 10 processes, and 200 threads. The time to build the index naturally depends on the computational resources available. For differential events search time, we randomly select 5, 10, 50, 100, and 200 splicing nodes, and search for cell pools with an above-mean PSI of any of the nodes using either the naive approach or the scASfind index. The elapsed times were measured in 30 repetitions per query length.

Cell type marker node search

We use a precision-recall framework to search for nodes that are specific to different cell types. For each node, we count the number of cell pools with the relative inclusion/exclusion of this node in a cell type of interest compared to all other cell types. We use precision, recall, and F1 scores to evaluate how well the node distinguishes the cell type of interest from all other cell types. A true positive (TP) is when a node is included or excluded in a cell pool from the cell type of interest for spliced-in and spliced-out events, respectively. False positives (FP) are when the same node is included/excluded in cell pools of other cell types, and false negatives (FN) are cell pools from the same cell type in which the node is not detected as included or excluded. The precision score is calculated by:

$$precision=\frac{TP}{TP+FP}$$
(1)

The recall score is calculated by:

$$recall=\frac{TP}{TP+FN}$$
(2)

The F1 score is the harmonic mean of precision and recall score:

$$F1=\frac{2 * precision * recall}{precision + recall}$$
(3)

By default, we use F1 scores as a balanced metric to rank all nodes for each cell type to indicate the best marker nodes for either inclusion or exclusion events.

Comparing PSI and gene expression in splicing marker nodes

We calculate the top 20 gene expression markers (ranked by F1 score) using scfind in the mouse cortex data. For the genes containing splicing marker nodes, we used the R package Seurat (V4.1.0) to obtain the scaled expression values. Then, we used ggplot2 (V3.3.3), viridislite (V0.4.0), and cowplot (V1.1.1) to create the violin plot of PSI and scaled expression values.

MARVEL analysis

We ran MARVEL [26] (v2.0.5) following the tutorial for plate-based sequencing methods (https://wenweixiong.github.io/MARVEL_Plate.html). The function CompareValues were used to perform pairwise differential splicing analysis between cell types using the “ad” algorithm. Only exon-skipping type events were used to compare with CEs in scASfind.

VastDB comparison

We downloaded the main PSI table of the mouse data in VastDB [10] (mm10) (https://vastdb.crg.eu/wiki/Downloads#AS_events_3). Splicing nodes that were cell type markers in scASfind analysis of the mouse cortex data were subtracted from this table and subjected to tissue specificity analysis. PPCA was performed using pcaMethods (v1.64.0).

Detect cell type-specific mutually exclusive exons

Detection of cell type-specific MXEs is based on a hypergeometric test with the “hyperQueryCellTypes” function in scASfind. The hypergeometric distribution models the probability of k success in n draws without replacement, from a finite population with N subjects and K of them contains the pattern. In our case, k is the number of cell pools in a cell type which have the splicing pattern, n is the number of cell pools in that cell type, K is the total number of cell pools in which the splicing pattern is detected, and N is the total number of cell pools. A pattern with a hypergeometric test P value ≤ 0.05 in a cell type is considered significant.

We use a mutually exclusive combination of splicing nodes (include one and exclude the other) as the pattern in the hypergeometric test to detect MXEs. The query is performed exhaustively for all pairs of exons in the dataset. To reduce the search space, we first filter all the possible node pairs by (1) having a mean PSI sum of 1 ± 0.1 and (2) having a < 0.1 difference in the PSI standard deviation. Then, we query for significant cell types for the potential pairs of MXEs. Pairs with at least one cell type significant in one of the two possible patterns are kept as candidates.

Further filters are applied for candidate MXE pairs. First, we require the MXE pattern to be found in ≥ 50% of the pools in the significant cell type. Then, we require the difference of absolute PSI value in the pair to be ≥ 0.5 to be considered high confidence.

Detecting cell type-specific coordinately spliced-in exons

We scan all genes from the 5′ of all annotated “core exon” nodes in the scASfind index to find coordinately spliced-in exons. We extend an exon block by requiring the next exon to have at most ± 0.1 difference with both the mean and standard deviation of the PSI of the previous block of exons. If this criterion is not fulfilled, we initiate a new block, and the previously constructed block is tested for cell type specificity using the hypergeometric test as described previously. Blocks significant in at least one cell type are kept. Finally, we use a node-to-exon mapping table from Whippet [13] to combine nodes belonging to the same actual exon. If the resulting block contains at least 3 exons, we propose it as a potential coordinated sliced-in exon block. We also require the block to be found in > 50% of the pools in the significant cell type for it to be highly confident.

The impact of pool size on PSI quantification and scASfind results

We ran Whippet, followed by scASfind, using a cell pool size of 5, 10, 15, and 20 on the mouse cortex data. This was specified in the config file with cells_pseudobulks in the MicroExonator workflow.

First, we examined the percentage of nodes quantified among all annotated nodes (Additional file 1: Fig. S11a). We found that despite a slight increase in the total percentage of nodes quantified as the pool size increased, the percentage of confidently quantified nodes (with ≥ 10 reads) remained stable and saw a slight decrease at pool size 20. This is further supported by the fact that the coverage distribution of confidently quantified nodes remained similar across the pool sizes (Additional file 1: Fig. S11b). Additionally, comparing the percentage of nodes quantified among all nodes in the three datasets suggested that the selected pool sizes gave comparable numbers of confidently quantified nodes: 20.47 ± 4.30, 30.63 ± 13.78, and 15.72 ± 5.63 (mean ± stddev).

In summary, using a pool size 5 for the mouse cortex data ensures a balance between achieving sufficient coverage for PSI quantification and retaining the abundance variation between cell types.