Main text

Droplet-based single-cell RNA-sequencing (scRNA-seq) methods utilize microfluidics to encapsulate individual cells in nanoliter droplet emulsions, a technique that has dramatically increased throughput compared to plate-based protocols [1]. While encapsulating cells, droplets also capture cell-free ambient RNA, a complex mixture of transcripts released from damaged, stressed, and dying cells, often exacerbated during dissociation of solid tissues. This ambient RNA creates challenges for downstream analyses and the biological interpretation of results as most analysis methods are based on the assumption that a droplet contains RNA from a single cell. To combat this problem, several methods have been developed to estimate and remove its contribution to gene expression [2,3,4].

High levels of ambient RNA also create challenges in accurately identifying cell-containing droplets. This is a particular problem for data generated from solid tissues, where more fragile cells are more likely to become damaged during dissociation and contribute to ambient RNA. We thus have three scenarios that need to be differentiated: empty droplets containing high concentrations of ambient RNA, droplets containing damaged cells, and droplets containing cells with limited ambient RNA. Using cut-offs based on the total number of RNA fragments assigned to each droplet, such as those originally proposed by Macosko et al. [5] and Zheng et al. [6], risks both including empty droplets and excluding small cells with below-average RNA content. The EmptyDrops method [7] addresses this issue through a more sophisticated approach, calculating the profile of the ambient RNA pool and testing each barcode for significant deviations from this profile. A favored alternative to simple UMI cut-offs, this method has been integrated as the default cell-calling algorithm in the widely used 10x Genomics Cell Ranger pipeline [6]. However, cell-free droplets with high ambient RNA concentration and damaged cells are still retained by this method.

Here, we present DropletQC, a new method that is able to simultaneously improve the detection of cell free droplets and droplets containing damaged cells. Taking advantage of the observation that unspliced and spliced mRNAs can be distinguished in common scRNA-seq protocols [8], we develop a novel metric: the nuclear fraction. The nuclear fraction quantifies, for each droplet, the proportion of RNA originating from unspliced pre-mRNA. Ambient RNA consists predominantly of mature cytoplasmic mRNA. This may arise as RNA is released from damaged cells in which the nuclear envelope remains intact, or capped and polyadenylated transcripts may be more stable in the extracellular environment (Fig. 1). Regardless, droplets that contain only ambient RNA have a low nuclear fraction compared to droplets containing cells (Additional file 1: Figure S1). In contrast, damaged cells due to the depletion of cytoplasmic RNA will have a higher nuclear fraction compared to intact cells. By using the nuclear fraction score in combination with UMIs per droplet, we are able to accurately distinguish between empty droplets, damaged cells, and intact cells.

Fig. 1
figure 1

Illustration of how the nuclear fraction, in combination with the library size of each droplet, can be used to separate the populations of empty droplets, intact cells, and damaged cells

To assess the ability of DropletQC to identify both empty droplets and droplets containing damaged cells, we applied it to four independent scRNA-seq datasets; embryonic mouse brain, glioblastoma tumor, peripheral blood mononuclear cells (PBMCs), and Hodgkin’s lymphoma tumor. To determine whether DropletQC could identify empty droplets missed by current methods, barcodes were filtered using 10x Genomics Cell Ranger 6.1.1 [6], CellBender [9], EmptyNN [10], or EmptyDrops as implemented in DropletUtils [7]. For all methods tested, DropletQC identified additional cell free droplets (Additional file 1: Figure S2). These droplets are identified by DropletQC using an automatic threshold based on the distinctly lower nuclear fraction scores exhibited compared to droplets containing cells, which contain a mixture of mature (cytoplasmic) and precursor (nuclear) mRNA (Fig. 2). To validate the identified droplets do not contain intact cells, the levels of two transcripts; MALAT1 and NEAT1 were quantified for each droplet. These abundant lncRNAs maintain structural roles in nuclear speckles and paraspeckles respectively and are retained exclusively within the cell nucleus [11]. Droplets identified by DropletQC as cell-free displayed low levels of both transcripts (Additional file 1: Figure S3), indicating the droplets identified as empty do not contain intact cells. Cell free droplets represent 9.5% of mouse brain, 6.0% of Hodgkin’s lymphoma, 4.0% of glioblastoma, and 0.4% of PBMCs retained following filtering with EmptyDrops (Additional file 1: Table S1). Cells from dissociated tissue (Fig. 2a–c) contained more empty droplets with high RNA content than PBMCs (Fig. 2d), suggesting ambient RNA may be released from cells damaged during sample preparation.

Fig. 2
figure 2

DropletQC identifies empty droplets and damaged cells in four heterogeneous scRNA-seq datasets. Total UMI counts (y-axis) and nuclear fraction scores (x-axis) are shown for each cell, with colors representing the status of each cell assigned by DropletQC. Empty droplets contain less RNA than cells and a higher fraction of cytoplasmic RNA (low nuclear fraction score). Damaged cells contain less RNA than intact cells and a higher proportion of unspliced RNA fragments (high nuclear fraction score)

Following identification of empty droplets, droplets containing damaged cells are identified using expectation maximization and a Gaussian mixture model to separate them from droplets containing intact cells. As both the total UMI count and nuclear fraction scores display distinct distributions for different cell types (Additional file 1: Figure S4), it is necessary to first annotate cell types, so that damaged cells may be identified separately for each. Cells were annotated for each sample using a combination of gene markers and supervised classification with scPred [12]. Of the remaining cells, 14.0% of mouse brain, 5.2% of Hodgkin’s lymphoma, 9.8% of glioblastoma tumor cells, and one PBMC cell were identified as damaged cells (Additional file 1: Table S1).

As an additional test of the ability of DropletQC to identify damaged cells, we applied the method to data from a recent investigation on the effects of cryopreservation on the transcriptomes of macaque microglia [13]. DropletQC revealed an increase in the proportion of damaged cells following cryopreservation from 4.1 to 13.8% (Additional file 1: Figure S5, Table S2). These findings have implications for the suitability of prospectively archiving samples for scRNA-seq studies through cryopreservation and demonstrates the utility of DropletQC for similar studies. To validate that the cells identified by DropletQC are indeed damaged, we applied the method to data from HEK293 cells stressed with staurosporine and captured in healthy, pro-, and late-apoptotic states [14]. DropletQC identified an increase in both the proportion of damaged cells and cell free droplets as cells progressed toward the late-apoptotic state (Additional file 1: Figure S6a-d). HEK293 cells identified as damaged were associated with a higher mitochdondrial gene content (Additional file 1: Figure S6e) a hallmark of damaged and dying cells [15]. Similarly, damaged cells for other datasets were associated with a higher mitochondrial gene content (Additional file 1: Figure S7).

To assess whether DropletQC is applicable to commonly used 5′ scRNA-seq protocols, we examined two additional scRNA-seq datasets; glioblastoma tumor 5′ v1 and mouse splenocytes 5′ v2. After filtering with EmptyDrops, DropletQC identified both empty droplets and damaged cells in both samples (Additional file 1: Figure S8, Table S3). As common protocols capture unspliced intronic sequences [8], we anticipate DropletQC will be applicable to most scRNA-seq datasets.

By default, the DropletQC software provides a flag for empty droplets and damaged cells, but does not remove them from the dataset. Depending on the biological analyses, damaged cells may retain useful information, and as such, it may be desirable to retain this metadata throughout downstream analyses. In addition, care should be taken that cells are accurately annotated, to avoid confounding distinct cell types as populations of damaged and intact cells. Similarly, cells such as erythrocytes, which contain small amounts of mature mRNA, may be misidentified as empty droplets and can be rescued downstream if desired. We note that intron and exon boundaries should be well-defined in the reference transcriptome for accurate estimation of the nuclear fraction. An option to calculate the nuclear fraction using a user-provided gene annotation file is provided, so that it may be easily recalculated as the quality of a species’ gene annotation is improved.

For samples with large percentages of ambient RNA, some damaged cells and empty droplets may be missed by DropletQC. However, these can be identified by their low RNA content (Fig. 2a) and may be easily flagged using a minimum UMI threshold. We recommend that DropletQC be used in tandem with a tool such as EmptyDrops to prune the majority of cell barcodes, before identifying any remaining cell free droplets or damaged cells. Calculation of the nuclear fraction, identification of empty droplets and damaged cells are implemented as separate functions within the DropletQC package.

In summary, we have shown that DropletQC is able to successfully identify both empty droplets and damaged cells in data from a range of tissue types.

Methods

Nuclear fraction calculation

The DropletQC method first calculates the nuclear fraction for each droplet, which is the proportion of RNA fragments that originate from intronic regions. It is calculated as:

$$ {NF}_i=\frac{\Sigma \left({IR}_i\right)}{\Sigma \left({IR}_i+{ER}_i\right)} $$

where NFi is the nuclear fraction for droplet i, IRi are the reads that map to intronic regions for droplet i, and ERi are those that map to exonic regions. We have implemented two methods to map reads to either intronic or exonic regions. The first takes advantage of region tags, such as those added by 10x Genomics’ Cell Ranger count analysis pipeline that identify the region type of each genome-aligned RNA fragment; exonic, intronic, or intergenic. These are efficiently counted using the nuclear_fraction_tags function to calculate a nuclear fraction score for each provided cell barcode. Alternatively, if region tags are missing, our second method assesses aligned reads for overlap with intronic regions using the nuclear_fraction_annotation function in combination with a user-provided gene annotation file. To speed up processing of indexed, coordinate-sorted alignment files, reads are split across a user-specified number of genomic regions to allow parallel computation. The four samples presented in the manuscript were processed with 8 CPUs and 16Gb of RAM with an average processing time of 106 s per 100 million reads using the nuclear_fraction_tags function and 132 s per 100 million reads using the nuclear_fraction_annotation function.

Identifying empty droplets and damaged cells

Empty droplets are classified as all barcodes that fall below a defined nuclear fraction threshold. To identify a suitable threshold, a kernel density estimate is calculated using the nuclear fraction scores. The first derivative of the estimate is then calculated to identify the local minimum immediately following the first peak, corresponding to the population of empty droplets. If the automatically selected cut-off misidentifies the empty droplet population, two user-adjustable parameters are provided; a nuclear fraction threshold and a total UMI threshold, above which all barcodes are marked as cells.

To identify droplets containing damaged cells, barcodes are assessed separately for each cell type. It is assumed damaged cells have both a lower UMI count and higher nuclear fraction score than undamaged cells. We therefore use a two component (k) gaussian mixture model, implemented with mclust [16], to classify droplets containing damaged cells:

$$ P\left(X\ \right|\mu, \sigma, \alpha \Big)={\alpha}_1N\left(X|{\mu}_1,{\sigma}_1^2\right)+{\alpha}_2N\left(X|{\mu}_2,{\sigma}_2^2\right) $$
(Eq. 1)

where X is a dataset with log10(UMI) and estimated nuclear fractions for 1-n droplets of a given cell type. μ and σ2are the mean and variance, and α represents the mixing weight of a given component. The initial model parameters are calculated as:

$$ {\mu}_k=\frac{\sum_i^{N_k}{x}_{i,k}}{N_k} $$
$$ {\sigma}_k^2=\frac{\sum_i^{N_k}{\left({x}_{i,k}-{\mu}_k\right)}^2}{N_k} $$
$$ {\alpha}_k=\frac{N_k}{N} $$

where Nkis the number of data points in the kth component. Following the initialization, we estimate parameters using expectation maximization by asking what is the posterior probability that a droplet (xi) belongs to component kj:

$$ P\left({x}_i\in {k}_j|{x}_i\right)=\frac{P\left({x}_i|{x}_i\in {k}_j\right)P\left({k}_j\right)}{P\left({x}_i\right)} $$
(Eq. 2)

where,

$$ P\left({x}_i|{x}_i\in {k}_j\right)=N\left({x}_i|{\mu}_{kj},{\sigma}_{kj}^2\right) $$
$$ P\left({k}_j\right)={\alpha}_{kj} $$
$$ P\left({x}_i\right)={\sum}_{k=1}^K{\alpha}_kN\left({x}_i|{\mu}_k,{\sigma}_k^2\right) $$

Nkin the initial component parameters are replaced with the posterior probability and recalculated, with these steps repeated until convergence determined using the Bayesian information criterion. This model identifies the minimum separation required between the identified distributions for a population of droplets to be marked as damaged. We then label droplets as containing a damaged cell based on a higher mean nuclear fraction and lower mean UMI than the cell population, a mean nuclear fraction greater than the cell population mean by a user-adjustable amount (default 0.15), and a mean UMI count lower than the cell population (default 50%).

Data

Cell filtering and annotation

For the mouse brain, Hodgkin’s lymphoma, glioblastoma, and PBMC samples presented in Fig. 2, prior to calculating the nuclear fraction score, all cell barcodes were assessed for a significant deviation from the ambient RNA expression pattern using the EmptyDrops method implemented in DropletUtils [7]. The lower bound on the total UMI count used to identify empty droplets was increased from 100 to 500 and all other parameters were left at their default values. Barcodes below a false discovery rate threshold of 1% were excluded. Remaining barcodes were additionally filtered for a maximum mitochondrial gene content of 15% to exclude low-quality cells, in line with current best practices when assessing common scRNA-seq quality control metrics. Mouse brain and PBMC cell types were annotated by supervised classification with the scPred [12] using annotated PBMC [17], mouse brain [18], and developing mouse brain [19] reference datasets. The glioblastoma sample cell types were identified using cell-type specific gene markers for oligodendrocytes (MAG, MOG, MBP), microglia/macrophages (C1QA, AIF1, LAPTM5), T cells (CD2, CD3D, CD3E), and endothelial cells (CD34, ESAM, APOLD1) [20,21,22,23]. Hodgkin’s lymphoma cell types were classified using marker genes for B cells (MS4A1), macrophages (CD68, IDO1), plasmacytoid dendritic cells (CLEC4C, NRP1), erythrocytes (HBB, HBA1, HBA2), cytotoxic T cells (GZMA, GZMK, IFNG), regulatory T cells (FOXP3, IL2RA, IKZF2), T helper cells (CXCL13, PDCD1, FABP5), naïve T cells (CCR7, IL7R, LEF1), progenitor (CD34), and mast cells (TPSAB1, TPSB2, KIT) [24, 25].

Method comparison

CellBender’s remove-background function was run with the number of epochs set to 150 and fpr of 0.01, as per the default parameters. The total-droplets-included argument was set to 20,000 and the expected-cells argument to 10,000 for the mouse brain and PBMC datasets and 5000 for the glioblastoma and Hodgkin’s lymphoma samples. EmptyNN was run with the number of k-folds set to 10, for 10 iterations. The UMI counts threshold was set to the default value of 100, as well as a value of 500. EmptyDrops was run with default parameters and the lower bound on the total UMI count set to the default value of 100, as well as 500. Barcodes below a false discovery rate threshold of 1% were excluded as empty droplets. To provide a fair comparison with existing practices for assessing QC metrics, cells were additionally filtered with a mitochondrial gene content threshold of 15% before being assessed with DropletQC.