A crucial step in any scRNA-Seq analysis is the cell quality control. This step is supposed to exclude low quality cells and doublets that might impair the downstream analyses and is typically based on three covariates: the total number of transcripts per cell (count depth), the total number of detected genes per cell, and the fraction of transcripts from mitochondrial genes. In addition, using a PCA is a basic and broadly applicable approach to identify outlier cells but requires general bioinformatics knowledge to be applied properly. Moreover, many groups rely on the implementation of ERCC RNA spike-ins and compare the ratio of reads mapped to spike-ins against the number of total mapped reads to detect endogenous transcript loss. However, 10× Genomics for instance does not recommend this approach for their assays [14], leaving the experimentalist with the initially mentioned standard parameters.
Applying respective filters demands special caution, since there can be biological interpretations for aberrant values. For example, low transcript or gene numbers may be characteristic of quiescent cell populations and high counts may arise from large cells. Accordingly, thresholds are usually user-defined for each experiment individually based on specific guidelines [10, 15]. For low-count filtering, the transcripts per cell are visualized and a threshold is applied, where count depths start to decrease rapidly. For high‐count filtering, it is recommended that the proportion of filtered cells should not exceed the expected doublet rate. To filter out genes that are expressed in only a few cells and, therefore, rendered irrelevant, the threshold is adjusted to the minimum cell cluster size of interest plus some leeway for dropout effects. Notably, there are no such recommendations for the adjustment of the threshold for mtDNA-encoded genes, despite an awareness of biological interpretations for a high fraction of mitochondrial transcripts, such as involvement in respiratory processes.
A fixed threshold of 5% mitochondrial transcripts has established as standard and is set as default in several software packages for scRNA-Seq analysis [11]. Moreover, aware that the whole heart in general and cardiomyocytes in particular show an average fraction of mitochondrial transcripts significantly higher than 5%, Osorio et al. still concluded in their systematic meta-analysis that 5% mtRNA is an appropriate threshold for murine tissues and that omitting this filter may lead to erroneous biological interpretations of scRNA-Seq data [16].
Contrarily, we found that for murine cardiac tissue sticking to the 5% threshold causes biased results as distinct cell types are affected by this filter to varying degrees. For example, a large proportion of cardiomyocytes of the SAN region was shown to have fractions of mitochondrial transcripts above the threshold, while only very few fibroblasts exceed this limit. Moreover, we demonstrated here that a high fraction of transcripts from mitochondrial genes also represents a marker for pacemaker cells and that an employment of the 5% mtRNA filter results in the elimination of this population from the dataset.
Among other cardiac cell populations, a small number of white blood cells demonstrated a relatively high fraction of mitochondrial transcripts. Notably, mitochondrial biogenesis was shown to be functionally connected with the immune response. In particular, rapid changes including an increase in number and mtDNA content of mitochondria have been observed upon T cell activation [17]. Such effects might be the underlying reason for some white blood cells to exceed the 5% limit and support the idea that increased metabolic activity results in higher fractions of mitochondrial transcripts. Inflammatory events for example in context with myocardial infarcts are of high interest in the cardiovascular field. In summary, the results of this study point at limitations of the standard threshold for mtDNA-encoded genes for investigations on the heart.
More specifically, we herewith have demonstrated for the first time that scRNA-Seq data from pacemaker cells, that are naturally rare, are particularly affected by a lack of proper adaption of quality control measures. This implicates that at least for cardiovascular research it will be essential to empirically determine the best-suited value for the mtRNA-threshold for each analysis individually to avoid the introduction of biases. Going even further, we do recommend to completely omit the fraction of mitochondrial transcripts as a default quality control parameter whenever possible. However, the feasibility of omitting this filter completely depends highly on the overall quality of the biological samples and needs to be evaluated for each individual experiment.
The raw data set of Goodyer et al., which we have re-analyzed, proves that it is possible to omit the mtRNA-threshold completely without negative impacts on the analysis outcome. Yet, a high fraction of mitochondrial reads can complicate the cluster annotation and hamper some downstream analyses. To avoid these negative effects it is possible to preclude mitochondrial genes from the count matrix for later steps. In a recent study on adult human heart, Wang et al. retained all cells with mitochondrial transcripts < 72% but subsequently removed the respective mitochondrial genes from the count matrix [9]. Unfortunately, the authors provide only this manipulated count matrix instead of the raw data. Thus, it was not possible to verify our findings on the mitochondrial fractions in pacemaker cells with this human data set. In this context, we advocate that it becomes common practice to provide actual raw data formats thereby enabling to customize the quality control for each analysis.
In general, a distinction between signal and noise of cells within dedicated clusters can be facilitated through current normalization techniques (e.g., SCT) [18, 19]. Alternatively, specialized tools, such as EMBEDR, recover the ability to separate signal and noise in dimensionality reduction outputs, such as tSNE and UMAP representations, which is essential for the subsequent utilization in quantitative analyses [20]. The obtained embedding quality is made available as a cellwise, interpretable p value that has meaning across datasets. Besides these classical bioinformatics approaches, an alternative means to assess the quality of cardiomyocytes and other cells is to visually inspect their morphological features. For example, using the Icell8 platform allows for basic microscopic examination of the cells and detection of stainings in three channels (e.g., for dead-life-assays). As most low quality cells are visibly damaged, cell imaging helps to identify a large proportion of low quality cells. However, it is not feasible for all single-cell sequencing methods and relatively inefficient and time-consuming for larger cell counts.
To avoid the use of arbitrary %mtRNA thresholds, Ma et al. suggested an unsupervised method for optimization of quality control parameters, called EnsembleKQC [21]. The threshold is based on a function of the distribution of the data and represents a more objective method for the quality control of biological samples. However, this approach comes with two limitations. On the one hand, a corrupted sample with a large proportion of damaged cells will produce a data set in which most cells demonstrate increased mitochondrial fractions. Based on these increased values, the optimized threshold might be inappropriately high. On the other hand, tissue samples that are more heterogeneous might demonstrate an unequal distribution of the data. Cell types with unusually high or low mitochondrial fractions might be excluded for their “abnormal” characteristics.
Very recently, another data-driven approach was proposed in a preprint of Hippen et al. [22]. Applying mixture models in a probabilistic framework their QC metric (miQC) combines both the fraction of mitochondrial transcripts and the number of detected genes to computationally predict low quality cells. Using a tumor sample, they demonstrate that miQC preserves more cells within identified clusters and minimizes sub-population bias, compared to a uniform threshold approach that can result in a disproportionate exclusion of certain cell populations as demonstrated in this manuscript. By now, miQC might currently be the most appropriate tool to control the quality of scRNA-seq of heart tissue and other heterogeneous tissues in a more objective manner. In general, it is recommended to consider several parameters in conjunction to gain a more detailed overview on the overall quality of the data.