MetaCell: analysis of singlecell RNAseq data using Knn graph partitions
Abstract
scRNAseq profiles each represent a highly partial sample of mRNA molecules from a unique cell that can never be resampled, and robust analysis must separate the sampling effect from biological variance. We describe a methodology for partitioning scRNAseq datasets into metacells: disjoint and homogenous groups of profiles that could have been resampled from the same cell. Unlike clustering analysis, our algorithm specializes at obtaining granular as opposed to maximal groups. We show how to use metacells as building blocks for complex quantitative transcriptional maps while avoiding data smoothing. Our algorithms are implemented in the MetaCell R/C++ software package.
Keywords
RNAseq scRNAseq Graph partition Multinomial distribution Sampling variance Clustering SmoothingBackground
Singlecell RNAseq (scRNAseq) is used extensively for discovery and identification of cell types, for characterizing transcriptional states within them, and for inference of continuous gene expression gradients linking these states. These phenomenological observations are used for creating cell type atlases and as a starting point for analysis of different cellular processes, including differentiation, cell cycle, and response to stimuli [1, 2, 3, 4, 5, 6, 7, 8, 9] (reviewed in [10]). The advent of scRNAseq increased the resolution of models for transcriptional regulation by orders of magnitude compared to prior bulk methods, allowing precise and unbiased analysis of small cell populations as well as opening the way to quantitative modeling of subtle withinpopulation effects.
As technology matures, the analytical basis for interpreting scRNAseq experiments must become more principled. In a way similar to other experimental strategies aiming at improved resolution, scRNAseq relies on the ability to integrate a large number of highly noisy measurements for inferring a highresolution model of some target sample. In analogy, when performing optimal reconstruction of a microscopic sample, a typical microscopic sensor can reduce noise by resampling the same pixel or voxel, trading instrument time with precision and resolution. In scRNAseq, the major source of technical noise (not to be confused with various systematic biases) is introduced through partial sampling of some 1000–10,000 RNAmolecules from the pool of RNA within a cell, generating a highly discrete and noisy estimation for the concentration of any RNA species in this cell except very few superhigh abundance genes. In contrast to the microscopy analogy, the same cell cannot be revisited and resampled to decrease sampling noise, since scRNAseq technology involves lysing the cell. Instead, integration of data from different cells must be used to simultaneously capture the true biological variance among cells and the purely technical sampling variance of the experiment.
When scRNA analysis is tuned toward cell type detection [6, 11], the implicit model assumption is that single cells derived from the same transcriptional cluster are approximately identical. In this case, sampling noise can be overcome by pooling the molecules from a sufficiently large number of cells, such that the expected number of sampled transcripts (or unique molecular identifiers (UMIs)) from each significantly expressed gene allows precise inference of the concentration of this RNA species in the idealized cell state that the cluster represents. When aiming at modeling more subtle molecular states, in particular those involving dynamics of cellular differentiation or response to stimuli, the clustering state homogeneity assumption can no longer hold. In these scenarios, current techniques combine handling of sparse data with modeling (implicitly or explicitly) of cellular dynamics [3, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. Inference of robust celltocell similarity metrics from sparse data is commonly used for construction of Knn graphs over which dynamics are inferred. Smoothing sparse data [25, 26, 27] or imputation of transcriptional states [25, 28, 29, 30] was proposed as a possible preprocess for modeling similarity in the data. Modelbased inference of transcriptional states from sparse data is on the other hand still difficult to derive, since parametric models for singlecell RNAseq data are lacking. Even though a basic parametric model for the sampling noise in scRNAseq profiles can be easily assumed, it is not routinely explicitly integrated within a broader context of model inference from scRNAseq data.
In this paper, we introduce the notion of metacells and develop a methodology for inferring and using them. A metacell (abbreviated MC) is in theory a group of scRNAseq cell profiles that are statistically equivalent to samples derived from the same RNA pool. Such profiles should therefore be distributed multinomially with predictable variance per gene (approximately proportional to the mean) and near zero genegene covariance. Moreover, given a set of scRNAseq profiles that are derived from the same multinomial distribution, it is trivial to infer the model parameters and establish their statistical confidence. If an entire scRNAseq dataset could be decomposed into disjoint metacells with sufficient coverage per metacell, many difficulties that follow from the sparsity of the data would be circumvented. In practice, one cannot assume a perfect metacell cover of the scRNAseq dataset a priori, and we found that directly searching for metacells using a parametric approach is highly sensitive to the many intricacies and biases of the data. Instead, we propose to use nonparametric celltocell similarities and partition the resulting Knn similarity graphs into densely connected subgraphs, which are filtered to derive approximately multinomial metacells. Metacells can then serve as building blocks for describing complex gene expression distributions with minimal parametric assumptions, scaling well with the number of cells and providing a more accurate approximation when increasing the number of sampled cells.
We implemented tools for deriving metacells and analyzing scRNAseq data using them in the new R/C++ package MetaCell. The utility of the approach was recently demonstrated in scenarios involving analysis of mammalian hematopoiesis differentiation [31], immunotherapy [32], blood cancer [33], and inference of cell type decompositions in comparative whole organism scRNAseq [34, 35]. Here we perform indepth analysis of the model and its performance through reanalysis of datasets including 8000 and 160,000 peripheral blood mononuclear cells (PBMC), and by dissecting two wholeorganism singlecell RNAseq maps from two worm species. The data show that metacells approximate the expression distribution in a surprisingly accurate fashion, dissecting the dataset into truly homogenous local neighborhoods and providing quantitative building blocks for exploring the global expression manifold. We suggest that MetaCell provides, especially as the size of singlecell atlases increases, an attractive universal first layer of analysis on top of which quantitative and dynamic analysis can be developed further.
Results
Overview of the MetaCell method
After the initial construction of a graph partition, we perform pruning and filtering of metacells to increase their homogeneity. We do not enforce a strict multinomial model as empirical data only approximately supports it (see indepth analysis below), and instead ensure that clear violations of homogeneity are filtered. First, outliers are detected and filtered using a simple parametric test for gene overexpression compared to their metacell. Second, the metacells’ homogeneity is verified, and metacells showing strong subcluster structure are split. In practice, splitting is rarely necessary, but outlier detection may require parameter tuning (see Additional file 1: Table S1). Third, metacells representing doublets (composed of groups of profiles that share a similar doublet mixture) are searched for and filtered in a supervised manner. Most of the doublets, however, are identified as such during the outlier filtering stage.
Figure 1a illustrates different types of metacells that are obtained in different experimental scenarios. When a limited number of single cells are sampled from a highly distinct transcriptional behavior, a metacell may define a completely isolated cluster (type I MCs). When a larger number of cells are sampled from a cell state, several metacells may cover it, defining variation in secondary biological behaviors (e.g., cell cycle) or even equivalent transcriptional distributions (type II MCs). More informatively, when sampling a dynamic process that induces a transcriptional gradient across single cells, metacells may create a piecewise approximation of the process (type III MCs). We note that in the latter cases, the MC cover need not be uniquely defined.
Based on a filtered set of metacells, we can robustly explore the scRNAseq transcription manifold, performing markerbased annotation of the metacells, grouping of metacells into higherorder clusters, and visualizing the data by projecting metacells onto a 2D space. In essence, the analysis downstream the identification of metacells is similar to common scRNAseq strategies, but replacing sparse single cells, or smoothed single cells, with fewer but more robust metacell profiles.
MetaCell is readily applicable as an R/C++ package and is scalable to large datasets. The full method and implementation details are given in the “Methods” section. Information on feature selection is provided in Additional file 3.
Metacells eliminate outliers and reconstruct cell type structure in PBMC data
We first illustrate the use of the MetaCell algorithm and pipeline through reanalysis of a small (n = 8276) dataset of PBMC scRNAseq profiles sampled from a healthy donor and downloaded from the 10x website. In a preprocessing step (see Additional file 2: Figure S1), we removed cells with less than 800 UMIs (Additional file 2: Figure S1A) and several noncoding RNAs linked with stress or apoptotic signatures (“blacklisted genes”) (Additional file 2: Figure S1B). We then applied the metacell construction pipeline as outlined above, using 816 high variance genes as features (Additional file 2: Figure S1C, excluding ribosomal proteins) and deriving an initial set of 82 MCs following 1000 resampling iterations using K = 100. The MC outlier/rare cell detection screen then identified 182 cells with at least one outlier gene (8fold or more enrichment over the respective MC model) (Fig. 1b, Additional file 2: Figure S2). Most outlier cells showed potential doublet profiles, coexpressing genes associated with two different cell types. For example, this effect was notable in the association of a coherent megakaryocytic gene module (including PF4, PPBP and more genes) with signatures linked to other cell types. In fact, pure megakaryocyte expression profiles are very rare in the data, and the MC outlier analysis highlights their identification (Additional file 2: Figure S2). In addition to potential doublets, outlier cells also included representatives of rare cell types, including cells expressing progenitor markers (SOX4 [36]) or eosinophilic markers (MS4A2, MS4A3 [37]).
Doublet outlier cells are observed when two cell types are mixed rarely in the data, thereby contaminating a metacell associated with one cell type with a few mixed signatures. More frequent doublet scenarios can give rise to homogeneous doublet MCs, as we observed for two cases combining expression of T cell marker genes (e.g., CD3D) with either B cell (CD79A) or monocyte (LYZ) markers (Fig. 1c). Following the removal of these two doublet MCs, we ended up with a model organizing 7901 cells in 80 MCs (45–176 cells per MC, median size 95 cells) and marking 375 cells as outliers or doublets. This model was annotated using enriched gene markers (Additional file 2: Figure S3) and visualized using a marker heat map (Fig. 1d) and a 2D layout computed from the MC adjacency matrix (Fig. 1e). This visualization organizes transcriptional states in the blood into clear cell type groups representing T, NK, and B cells; monocytes/macrophages; and DC populations. Within these cell types, the maps show additional structure. For example, T cells were organized into CD8+ effector states (marked by GZMH and additional genes), CD8+ preeffector states (marked by GZMK+), CCR7+ CD8+ cells with variable degree of cathepsinW (CTSW) expression, naïve CD8+ cells (IL7R+), and CD4+ cells showing some activation of Treg genes (FOXP3+). Overall, when sampling at a depth of 8000 cells, the metacell analysis allowed for robust identification of cell types and initial modeling of gene expression distribution within them. Additional coverage can lead to refined modeling of transcriptional distributions within cell types as we shall demonstrate below, but first, we will use this basic model to evaluate the similarity structure and homogeneity of metacells.
MetaCell graphs define a symmetrized and modular adjacency structure between MCs
Comparing metacells’ graph closure with their transcriptional homogeneity
To quantify the accuracy of the MC approximation to the similarity graph, we computed the fraction of Knn similarities captured within each MC, which we refer to here as the MC’s closure. As shown in Fig. 2c, the level of closure varies considerably between cell types. Distinct and low abundance cell types (type I MCs) can show very high closure (up to 100%), while multiple MCs that cover abundant cell types (type II or III MCs) show overall low closure (as low as 10% withinMC adjacencies, 20–30% within the three most linked MCs). Imperfect closure may suggest that the MC partition is suboptimal or, alternatively, that the Knn local similarity structure in large and diffused cell types is covered by multiple, nonmaximal but still homogeneous MCs (Type II MCs in Fig. 1a). To test this, we compared the intraMC UMI distribution to the distribution predicted by a simple multinomial model for specific genes and MCs (Fig. 2d). We found that low closure MCs show high degree of consistency with the multinomial model, confirming their homogeneity. Interestingly, MCs with very high closure may show a reciprocal behavior, where additional high variance is present within Knn consistent clusters (e.g., MC #70; note the bimodal distributions observed for most genes). This analysis highlights a key property of the MC partition: MCs are not maximal, and multiple highly similar MCs which are only weakly separated in the similarity graph can together approximate a larger cluster.
Multinomial sampling explains most of the intraMC UMI variance
Systematic screening for genes showing intraMC overdispersion (Fig. 2e) provides a global view on the consistency of the PBMC MC cover with simple multinomial sampling. In this screening, MCs containing residual, nonhomogeneous structure will be associated with many overdispersed genes. For example, this analysis associates the dendritic cells MC #70 with overdispersion of multiple megakaryocyteassociated and other genes. This suggests that these poorly sampled cell types show additional hidden structure and potential remaining outlier cells. The screening also reveals specific genes that are consistently overdispersed across many MCs, such as the earlyimmediate response gene module (including the transcription factors JUN, JUNB, FOS). This overdispersion is consistent with variable levels of activity of this pathway in multiple cell types, perhaps representing technical experimental stress. Other genes are overdispersed in a celltype specific fashion, for example cytotoxic (GNLY, CCL5) genes in NK and T subtypes, and MHCII and LYZ in myeloid cell types. These highly expressed genes may be incompatible with a simple multinomial sampling model, and their analysis may necessitate assuming prior biological variance to allow for overdispersion. Beyond these specific examples, however, intraMC distributions for the entire gene set (including genes that were not used as features for defining similarities) are generally well approximated by Poisson sampling with no zero inflation (Additional file 2: Figure S5). Together, the data shows that the degree of residual, intraMC overdispersion is relatively low in the PBMC MC cover, so that the variance of most genes is accounted for by a model assuming partition of cells into MCs from which UMIs are multinomially sampled.
Analysis of intra and interMC genegene covariance (Fig. 2f) provided an additional avenue for diagnosing structure within and between MCs. We observed persistent intraMC correlations between a limited set of genes, including the overdispersed modules of earlyimmediate genes, MHC class II genes, and S100 genes as well as a correlated gene set including actinrelated genes (ACTB, ACTG1, COTL1, PFN1). We did not observe strong intraMC correlations of cytotoxic and many other functional genes. The scarcity of strong intraMC genegene correlations (see for example Fig. 2g, MC #66) suggests that little residual structure remains within the MCs, and that the dataset is well summarized by the MC profiles. In the few cases where intraMC correlations are observed (Fig. 2g, MC #70), they indicate the need for a more flexible intraMC modeling, or alternatively call for deepening the dataset with more cells defining the transcriptional states underlying the MC.
Metacells are accurate local approximations of the expression manifold
Differences in prediction accuracy should reflect the different similarity measures employed by each method as well as the effect of disjoint partitioning applied in MetaCell. In theory, the partitioning strategy should provide less modeling flexibility compared to approaches that compute cellspecific neighborhoods. The latter effect should be particularly noticeable when several MCs discretize a continuum, such as differentiation trajectory (type III MCs, Fig. 1a). In practice, we observed relatively mild differences between the different approximations (Fig. 3b), with very few genes losing accuracy when MCs are used. Moreover, analysis of the gain in accuracy when including all genes in the models (Fig. 3c) suggested that MetaCell is significantly less exposed to overfitting than the Knn approaches. The diffusionbased smoothing approach showed minimal overfitting, but also loss of accuracy (Fig. 3c). Overall, the nearly multinomial intraMC UMI distribution observed above and the minimal loss of predictive power entailed by the MetaCell disjoint partition, together suggest that MCs succeed in capturing most of the biological variation in the data, while eliminating most of the sampling noise.
Metacells avoid artefactual gradient effects
We showed that the cell partitioning induced by MetaCell does not decrease local approximation accuracy and that, in fact, it even reduces the model’s tendency to overfit the data. We speculated that another advantage of partitioning would be robustness to oversmoothing. The discussion about oversmoothing recently arose in the context of evaluating scRNAseq imputation methods, i.e., methods that use the covariance patterns measured across multiple cells and genes to refine pergene, percell measurements (reviewed here [41]). Most imputation methods are local in the sense that they impute gene expression for a cell using its inferred neighborhood. It has been observed [27, 28] that in some cases imputation tends to enforce spurious proximities between cells, which in turn manifest as artefactual gradients, i.e., discrete states pertaining to be a series of cells gradually modulating expression of certain genes along a temporal process or a spatial axis. While oversmoothing is detected directly when evaluating imputation methods, it is in fact a potential concern with any model regardless of its downstream application, and stems from the manner in which cellcell similarities are defined.
We evaluated the susceptibility of the MetaCell model to oversmoothing using the expression predictions obtained in the previous section (the version without crossvalidation), comparing the different similarity structures included in that experiment. Our results support the robustness of MetaCell to artefactual gradients (Fig. 3d). For example, NK cells are known to be characterized by high levels of KLRF1, but do not express the T cell classical marker CD3 (Fig. 3d, top). Smoothing based on Knn similarity structures (MetaCell’s Knn or Seurat’s) or on diffusion similarities (MAGIC’s) gives rise to phantom gradients that can be interpreted erroneously, for example, as supporting differentiation of NK to T cells or vice versa. The MC statistics generate a much less detailed, but likely more realistic map of joint CD3D/KLRF1 expression. Similar phantom gradients are observed when analyzing CCR7+ CD8+ and CCR7+ CD8− cells (Fig. 3d, bottom). On the other hand, the MC model does reveal expression gradients in cases where sampling adequately supports them, such as in the tradeoff expression of GZMK+ and GZMH+ in T cells (Fig. 3e). These quantitative gradients are refined in the denser dataset we analyze below. Robust modeling of transcriptional gradients by MCs is also demonstrated on simulated data (Additional file 2: Figure S6).
Dissecting complex cell type hierarchies with MetaCell
Highresolution analysis of inter and intracell type states in the blood
Using MCs to define gradients of CD8+ effector T cell activation
Finally, we demonstrate the potential of applying MetaCell for indepth analysis of differentiation gradients through analysis of the transcriptional signatures in effector CD8+ T cells. Activation of the T cell effector program ultimately depends on expression of units of the cytotoxic granule (granzymes, cathepsins, granulysin) and of the machinery required for perforating target cells (e.g., perforin) [45]. Elevated expression of Perforin 1 (PRF1) is indeed observed in a subset of the CD8+ MCs, spanning a spectrum of intensity from background level to 10fold enrichment over it. We observed PRF1 enrichment to correlate strongly with multiple additional effector genes, for example granzyme H and B, FCGR3A, and KLRD1 (Fig. 5d), consistent with the idea of a spectrum of transcriptional states with variable effector gene toolkit expression in the blood. Remarkably, we identified a second set of genes showing elevated expression in MCs with lowtointermediate effector program expression (Fig. 5e), including most notably granzyme K (GZMK) and the phosphatase DUSP2, but possibly also the chemokine receptor CXCR4 and the adhesion/motility molecule AMICA1/JAML. The effector program expression gradient was also associated with decrease in relative housekeeping gene expression (e.g., ribosomal proteins, Fig. 5f). We note that the association between the transcriptional gradient of effector genes and temporal or differentiation processes cannot be assumed immediately. It is nevertheless tempting to suggest that effector program activation involves transient expression of the GZMKlinked genes observed here, suggesting several experimental directions for followup toward a better understanding of T cell commitment and regulation in the blood and other organs, and in particular within tumors [29, 46].
Discussion and conclusions
We introduce here the use of metacells for analyzing scRNAseq data. Metacells are defined as groups of singlecell profiles that ideally represent resampling from the same cellular state. In practice, we compute MCs as a graph partition using adequately processed similarities between singlecell profiles. We demonstrate that in real data, we can construct partitions such that the intraMC UMI distribution can be approximated as sparse multinomial sample, representing sampling from a highly specific transcriptional state with no significant additional variance. We show how to screen for MCs with overdispersion or residual pairwise gene correlations, reflecting deviation from this model and residual intraMC biological variation. We then demonstrate how the MCs can be used for indepth exploration of large data sets involving either a rich set of cell types (whole organism) or a limited and oversampled set (PBMCs). The analysis methodology we advocate involves direct inspection of the MC adjacency matrix, which provides analysts with complete information about cell type hierarchy and supports clustering at appropriate resolution. Combined with visual examination of correlation patterns between MCenriched genes, the result is a detailed and unbiased characterization of cell types and expression gradients that we have already used in several challenging analysis scenarios [31, 32, 33, 34, 35].
The main property that makes metacells a powerful analysis tool is their ability to increase the signaltonoise ratio in the data without introducing biases stemming from mistaken modeling assumptions or oversmoothing of the data. The only manipulation performed by MetaCell on the data is the pooling of highly similar cells, thereby forming a partition of the data. The analyses we present show that, despite enforcing this partitioning, a metacell cover provides accurate local approximations of the expression manifold. At the same time, partitioning entails multiple advantages. Statistically, it greatly reduces the effective number of parameters of the model, making it less prone to overfitting and to oversmoothing compared with naïve smoothing approaches. For the analyst, it allows for the characterization of welldefined, discrete and highly granular states in a conservative and easytointerpret framework.
In cases where residual intraMC structure is detected in the cover, additional cells can be sampled to refine the MC cover and tighten the approximation. Fundamentally however, in any realistic data set, there will always remain some undersampled behaviors regardless of sampling depth, and our current model will not provide a constructive approach for understanding such behaviors beyond signaling them out as nonhomogeneous. Fitting more flexible intraMC models, capable of accounting for not only sampling noise but also convergent processes such as cell cycle or stress [47, 48], or embedding the metacells in hierarchical or multiresolution structures [49, 50] should allow for more efficient extraction of the signals of interest. We view the integration of such models as an important future extension of this work.
Methods
Notation and definitions
We assume raw scRNAseq reads are mapped to genome sequences and assigned to cell barcodes and unique molecular identifiers (UMI) using pipelines that eliminate most of the UMI duplications induced by PCR and sequencing errors. We summarize all UMIs in the molecule count matrix U = [u_{gi}] on genes g ∈ G and cells i ∈ I. We define u_{g} as the total molecule count for gene g on the raw count matrix, and u_{i} as the total number of molecules for a cell (sometime referred to as the cell’s depth). The procedures below are designed to robustly define a metacell partition over the cells, which is denoted by a set of cell subsets M_{k} and a set of outliers O such that \( \left(\bigcup \limits_k{M}_k\right)\cup O=I \).
We assume a set of gene features F ⊆ G is specified and focus our analysis on a similarity graph between cells derived using data from these features (see below). We discuss several strategies for selecting genes in Additional file 3. We note that our features represent individual genes rather than principle components or other forms of reduced dimensions. This enables some direct approaches to testing and correcting the gene expression distributions within metacells. It also forces the modeling of similarities and derivation of metacells to work over highdimensional spaces and to account for noise and sparse data directly. Applying the metacell algorithmic pipeline to similarity structures derived using popular dimensionality reduction techniques is easily applicable as well, as we demonstrate in the results section.
The metacell balanced Knn cell similarity graph
A wellfounded parametric generative model for scRNAseq data is currently missing, mainly due to the limited understanding of the biological variation in transcriptional states within different cell populations, and the remarkable diversity of coupled (e.g., developmental) and uncoupled (e.g., cell cycle, stress) biological processes that are captured in typical singlecell RNAseq maps. We therefore use a simple nonparametric approach for modeling raw pairwise local similarities, which is then refined by additional analysis of the derived cell Knn similarity structure. We transform the raw UMI count U on the gene features F as U ′ = [u′_{gi}] = [log_{2}(ϵ + u_{gi})]_{g ∈ F} and compute the raw similarity matrix using the Pearson correlations on the transformed features R = [r(u′_{gi}, u′_{gj})]_{ij}. A simple variation on this procedure may include prior normalization of the U matrix by downsampling (sampling min(u_{i}) UMIs from each cell without replacement) so as to avoid biases associated with improved accuracy (and thereby higher similarity) between deeper UMI profiles. We however avoid downsampling when the distribution of the number of UMIs per cell is highly variable and correct for the sampling bias when manipulating the similarity graph as described below.
A weighted directed graph G is then constructed using [a_{ij}] as the weighted adjacency matrix. Note that nodes with degrees lower than K are possible following this procedure, since outlier cells may become disconnected or poorly connected during the balancing operations.
Seeding and optimizing graph partitions

Initialize k = 0
 While \( \underset{i}{\max }\ f(i)> size\_\mathit{\min} \) do:

sample a new seed cell j by drawing a sample from cells in I − C with weights proportional to f(i)^{3}

update mc(u) = k for u = j, u ∈ N^{out}(j) − C

Increment k and update C, f.

We terminate seeding using a minimum subgraph size parameter size _ min < K. When we meet the stop criterion, cells that are not associated with a seed (i.e., cells for which mc(i) = − 1) have at most size _ min uncovered neighbors and in particular will almost always have at least one covered neighbor (since the degree in the balanced graph is typically K).
 Until convergence:

Select a cell i

Reassign mc(i) = argmax_{k} w_{ik}

Update weights

Convergence is defined by deriving a partition in which all cells are associated with their highest scoring subgraph. To enforce convergence (which is not guaranteed to occur in general), we slowly increase the score association between cells and their current subgraph after each reassignment. This is especially useful when a large subset of cells (i.e., larger than K) are very homogeneous, which may result in unstable exchange of nodes between several modules covering this subset.
After convergence, there are no formal guarantees on size distribution of the subgraphs produced by the algorithm. Empirically, however, the connectivity of the graph (maximum K outgoing edges) and the seeding process promote a relatively uniform cover partition and prevent convergence toward solutions with very large subgraphs. Rare cases of cells that reside in connected components whose size is smaller than size _ min and were left uncovered during seeding are defined as outliers.
Importantly, the complexity of the entire procedure (seeding and optimization) is linear in the number of cells and the maximum degree K (or alternatively, linear in the number of edges in the graph). An efficient implementation of the algorithm therefore scales well to large datasets, as does its integration within an extensive resampling strategy, as we discuss next.
Resampling graph partitions and computing metacells
We improve the robustness of the above randomized graph partition algorithm using a resampling approach. Given the balanced graph G, we generate a series of subgraphs b = 1. . N_{B} (typically N_{B} = 500) by sampling cells independently without replacement with probability ρ (typically ρ = 0.75) and adding all edges connecting them, forming G^{b} = (V^{b}, E^{b}), V^{b} ⊂ V, E^{b} ⊂ E. For each resampled G^{b}, we apply the partition algorithm, thereby generating a set of partial graph partitions mc^{b}(i) for each i ∈ V^{b}. We summarize all partitions using the matrices O = [o_{ij}] and C = [c_{ij}], specifying how many times the pair of cells i, j were resampled together, and how many times they were both assigned to the same subgraph in the resampled partition, respectively. We then define the resampled cooccurrence matrix as \( {S}^{boot}=\left[{s}_{ij}^{boot}\right]=\left[{c}_{ij}/{o}_{ij}\right] \).
The values in S^{boot} are now used to compute a weighted, nondirected graph, discarding the original correlation distances. We compute for each cell i the value of the K^{core} (typically 30) highest frequency neighbors (denoted T_{i}) and then define a cooccurrence threshold for each pair of cells using the maximal of the two critical values multiplied by a factor T_{ij} = max (T_{i}, T_{j}) ∗ 0.5. Pairs with \( {S}_{ij}^{boot}>{T}_{ij} \) are used as the edges in a new graph denoted as G^{boot} on all cells. Note that G^{boot} is still of non homogeneous degrees, as setting fixed thresholds on edges implies that nodes in large and diffused clusters will have a lower T_{i} values and thereby higher degree than nodes in tight and robust clusters that always cluster in the same subgraphs. The parameter K^{core} provides users of the algorithm with flexible control over the degrees in the derived graph. The final partition solution is obtained by reapplying the same partition algorithm on the graph G^{boot}, resulting in a new set of subgraphs M_{i} and a potential list of outliers. This solution is subject to further filtering and verification, as described next.
Filtering clear parametric outliers from a metacell cover
Note that the regularization (adding 1 to observed and expected count) implies that high fold change values (e.g., > 2) cannot be attained for genes with very low overall UMI counts. However, this regularization is sufficient to ensure robust detection of clear outliers. Cells with one or more genes showing high f_{gi} values are labeled as potential outliers and removed from their metacell cover prior to indepth quantitative analysis of the model.
Verifying metacells homogeneity
Outlier filtering does not guarantee metacell homogeneity in cases where two distinct and significantly separated transcriptional states are grouped together. To screen for such scenarios, we attempt to cluster cells within each metacell M_{k} de novo. Clustering is performed by applying the DBSCAN densitybased clustering algorithm to the intrametacell similarity matrix, computed as the correlation distances described above but restricted to genes exhibiting mildly high intrametacell variance (normalized variance/mean > 1.2). If more than one cluster is detected, we split the metacell accordingly. In practice, metacells almost never include hidden subclusters and testing for splits is used mostly for validation purposes.
Defining the metacell gene expression profile
We then quantify relative expression as the log fold enrichment over the median metacell value:
Note that the lfp values are affected by the composition of metacells in the dataset up to a constant and that ϵ (typically set to 10^{−4}) should be adapted to the typical total molecule count within a metacell.
Metacell regularized force directed 2D projection
(here C = median_{k}( M_{k} ) is a scaling constant). We symmetrize B by replacing it with B′, the sum of its row and columnnormalized forms, and retain as candidate edges only pairs for which b′_{ml} > T_{edge}. We then construct a graph over the metacells G^{M} = (M, E^{M}), by adding the D highest scoring candidate edges (if they exist) for each metacell. This results in a graph with maximum degree D and any number of connected components. We compute coordinates (xm_{k}, ym_{k}) for each metacell by applying a standard forcedirected layout algorithm to the graph G^{M}. We then position cells by averaging the metacell coordinates of their neighbor cells in the original balanced graph G, but filter neighbors that define a metacell pair that is not connected in the graph G^{M}. Averaging allows for layout flexibility along one or few edges in the metacell graph when positioning large cell clusters that are dissected by several metacells.
Implementation
We implemented MetaCell using a combination of C++ and R code. We used parallelization over multicore machines. On a strong XeonE52660 dualCPU machine, the entire analysis pipeline for a small 8200 cells dataset, including bootstrap iterations and computing 2D visualizations, required 2 min and 20 cores, and a maximum of 4.8 GB of RAM. The entire analysis pipeline for a 160K cells’ dataset required 112 min and a maximum of 79GB RAM on the same machine.
Evaluating withinMC homogeneity
Following the computation of the MetaCell partition, our pipeline produces diagnostic statistics and plots to evaluate the level of adherence of the metacells to a multinomial sampling model. To visualize largescale adherence across all genes, we produce per MC plots comparing the coefficient of variation and the fraction of zero counts to the expected under a Poisson model (see examples in Additional file 2: Figure S5). In addition, we visualize adherence to binomial sampling of the top enriched genes per MC by plotting the observed distribution of UMI count and the same distribution sampled from a binomial model (see examples in Fig. 2d). For both observed and expected, counting is done after downsampling all cells within a metacell to uniform total counts. Finally, global diagnostic matrices over all MCs and marker genes (see example in Fig. 2e) are computed as follows: We downsample the UMIs to uniform total counts per MC and compute the binomial likelihood of the observed counts, as well as their overdispersion (observed divided by expected variance). We average these statistics over multiple downsamples and repeat the whole procedure over 999 fake count matrices drawn from the perMC multinomial model. Per gene and per MC, we compute the empirical p value of its likelihood with respect to the binomial null. We output the p values and the overdispersion values and visualize a summarizing heatmap of the latter. Note that when computing binomial statistics, we downsample with respect to feature and enriched genes only, and that the expected distributions are derived from the pool frequencies constrained to these genes.
Comparing local approximation accuracy using expression prediction
We designed a crossvalidation experiment to quantify how well the MetaCell partition captures local celltocell similarities. We divided the gene set into 100 folds, and leaving out each fold at a time computed celltocell similarities on the remaining genes using four different strategies. We next used these similarities to predict, per cell, the expression level of the leftout genes. Finally, we compared the quality of predictions across all genes. A model that captures accurately local similarities in the expression manifold is expected to produce accurate predictions.
The extent of overfitting was tested by avoiding the crossvalidation design and computing a single similarity matrix using all genes per modeling approach. Regardless of whether crossvalidation was used, a cell was never a part of its own prediction pool when comparing prediction accuracy (Fig. 3b, c). In contrast, for plotting the gradients (Fig. 3d, e), the predicted values were generated using all genes and all cells, as in a typical analysis.
Combining Seurat and MetaCell’s filtering criteria, only cells with at least 800 UMIs, number of expressed genes between 800 and 4000, and mitochondrial gene fraction below 0.1 are included. We omitted from the modeling and the evaluation mitochondrial genes and immunoglobulin genes. For MetaCell, we used MC size parameter K = 100 and 500 downsamples of 0.75 of the data during the graph resampling stage. For Seurat (package downloaded on 18/3/26), we used gene selection parameters x.low.cutoff = 0, y.cutoff = 0.8, negative binomial scaling over mitochondrial fraction and number of UMIs, and 40 PCs. For MAGIC (code downloaded on 18/3/19), we used 30 PCs, k = 5, ka = 4, epsilon = 1, and t = 6.
Whole organism scRNAseq analysis
For the Caenorhabditis elegans map, we analyzed the wholeorganism singlecell dataset published by Cao et al. [42] and generated using methanolfixed larval L2 stage cells and a split&pool scRNAseq strategy. We started from a UMI matrix containing 41,449 single cells. We filtered out cells with less than 100 and more than 8000 total UMIs. We used MetaCell to select marker genes with the following criteria: (1) a normalized size correlation below − 0.1 and/or a niche score over 0.1, (2) a minimum of 300 total UMIs observed, and (3) a minimum of 3 UMIs observed in at least three single cells. For MetaCell, we used MC size parameter K = 150 and 1000 downsamples of 0.75 of the data during the graph resampling stage. We computed the final partition from the cooccurrence matrix using a size parameter K = 30, a minimum MC size parameter of 30 and alpha = 2. We filtered outlier cells using a filtering parameter T_lfc = 4, resulting in a final filtered set of 38,149 cells.
For Schmidtea mediterranea, we analyzed the wholeadult singlecell dataset published by Fincher et al. [43] and generated using fresh cells from wholeadult and head area planarian samples and the Dropseq scRNAseq technology. We started from a UMI matrix containing 58,328 single cells. We filtered out cells with less than 500 and more than 18,000 total UMIs. We used MetaCell to select marker genes with the following criteria: (1) a normalized size correlation below − 0.1 and/or a niche score over 0.05, (2) a minimum of 300 total UMIs observed, and (3) a minimum of 3 UMIs observed in at least three single cells. In the graph partitioning stage, we used the same parameters as in the C. elegans analysis. We filtered outlier cells using a filtering parameter T_lfc = 4.5, resulting in a final filtered set of 56,627 cells.
Fine clustering using Seurat
Seurat’s clustering algorithm was used for producing a highresolution clustering of the 160K PBMCs dataset by applying the following procedure: Data was lognormalized and scaled to 10,000 UMIs per cell, 1000 genes with top variance/mean ratio were used as highly variable genes, these genes were rescaled by regressing on percell number of UMIs, and PCA reduction to 45 dimensions was applied to the rescaled variable genes. In order to generate a fine clustering solution, we set Seurat’s resolution parameter to 100, using the approximation parameters nn.eps = 0.5 and n.start = 10, which yielded 817 clusters. We note that Seurat is typically executed with much lower resolution values (0.6–3).
Notes
Acknowledgements
We are grateful to all members of the Tanay group for discussion.
Authors’ contributions
YB and AT designed the study. YB, ASP, YL, and AT developed the MC algorithm with input from AG, EC, and ZM. YB performed the computational experiments with help from AB and YL. AT, YB, YL, AL, EC and MH implemented the algorithms. YB and AT wrote the manuscript with input from all authors. AT supervised the study. All authors read and approved the final manuscript.
Funding
Work in AT group was supported in part by an ERC grant 724824, the Chan Zuckerberg Association, the Wolfson foundation, and FAMRI.
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Supplementary material
References
 1.Kumar RM, Cahan P, Shalek AK, Satija R, DaleyKeyser AJ, Li H, et al. Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature. 2014;516(7529):56.CrossRefGoogle Scholar
 2.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.CrossRefGoogle Scholar
 3.Jaitin DA, Kenigsberg E, KerenShaul H, Elefant N, Paul F, Zaretsky I, et al. Massively parallel singlecell RNAseq for markerfree decomposition of tissues into cell types. Science. 2014;343(6172):776–9.CrossRefGoogle Scholar
 4.Zeisel A, MuñozManchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science. 2015;347(6226):1138–42.CrossRefGoogle Scholar
 5.Reinius B, Mold JE, Ramsköld D, Deng Q, Johnsson P, Michaëlsson J, et al. Analysis of allelic expression patterns in clonal somatic cells by singlecell RNA–seq. Nat Genet. 2016;48(11):1430.CrossRefGoogle Scholar
 6.Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, et al. Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251.CrossRefGoogle Scholar
 7.Levin M, Anavy L, Cole AG, Winter E, Mostov N, Khair S, et al. The middevelopmental transition and the evolution of animal body plans. Nature. 2016;531(7596):637.CrossRefGoogle Scholar
 8.Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, et al. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155.CrossRefGoogle Scholar
 9.Plass M, Solana J, Wolf FA, Ayoub S, Misios A, Glažar P, et al. Cell type atlas and lineage tree of a whole complex animal by singlecell transcriptomics. Science. 2018;360(6391):eaaq1723.CrossRefGoogle Scholar
 10.Tanay A, Regev A. Scaling singlecell genomics from phenomenology to mechanism. Nature. 2017;541(7637):331.CrossRefGoogle Scholar
 11.Grün D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, et al. De novo prediction of stem cell identity using singlecell transcriptome data. Cell Stem Cell. 2016;19(2):266–77.CrossRefGoogle Scholar
 12.Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015;33(5):495.CrossRefGoogle Scholar
 13.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of singlecell RNAseq data. Nat Methods. 2017;14(5):483.CrossRefGoogle Scholar
 14.Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, et al. Reference component analysis of singlecell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017;49(5):708.CrossRefGoogle Scholar
 15.Lin P, Troup M, Ho JW. CIDR: ultrafast and accurate clustering through imputation for singlecell RNAseq data. Genome Biol. 2017;18(1):59.CrossRefGoogle Scholar
 16.Bendall SC, Davis KL, Amir ED, Tadmor MD, Simonds EF, Chen TJ, et al. Singlecell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell. 2014;157(3):714–25.CrossRefGoogle Scholar
 17.Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381.CrossRefGoogle Scholar
 18.Haghverdi L, Buettner F, Theis FJ. Diffusion maps for highdimensional singlecell analysis of differentiation data. Bioinformatics. 2015;31(18):2989–98.CrossRefGoogle Scholar
 19.Ocone A, Haghverdi L, Mueller NS, Theis FJ. Reconstructing gene regulatory dynamics from highdimensional singlecell snapshot data. Bioinformatics. 2015;31(12):i89–96.CrossRefGoogle Scholar
 20.Shin J, Berg DA, Zhu Y, Shin JY, Song J, Bonaguidi MA, et al. Singlecell RNAseq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell. 2015;17(3):360–72.CrossRefGoogle Scholar
 21.Ji Z, Ji H. TSCAN: pseudotime reconstruction and evaluation in singlecell RNAseq analysis. Nucleic Acids Res. 2016;44(13):e117.CrossRefGoogle Scholar
 22.Welch JD, Hartemink AJ, Prins JF. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNAseq data. Genome Biol. 2016;17(1):106.CrossRefGoogle Scholar
 23.Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell lineage and pseudotime inference for singlecell transcriptomics. BMC Genomics. 2018;19(1):477.CrossRefGoogle Scholar
 24.Zheng S, Papalexi E, Butler A, Stephenson W, Satija R. Molecular transitions in early progenitors during human cord blood hematopoiesis. Mol Syst Biol. 2018;14(3):e8041.CrossRefGoogle Scholar
 25.Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr A, et al. Recovering gene interactions from singlecell data using data diffusion. Cell. 2018;174(3):716–29.CrossRefGoogle Scholar
 26.Ronen J, Akalin A. netSmooth: Networksmoothing based imputation for single cell RNAseq [version 3; peer review: 2 approved]. F1000Research. 2018;7:8.Google Scholar
 27.Wagner F, Yan Y, Yanai I. Knearest neighbor smoothing for highthroughput singlecell RNASeq data. bioRxiv. 2018; Available from: https://www.biorxiv.org/content/early/2018/04/09/217737
 28.Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, et al. SAVER: gene expression recovery for singlecell RNA sequencing . Nat Methods. 2018;15(7):53942.Google Scholar
 29.Azizi E, Carr AJ, Plitas G, Cornish AE, Konopacki C, Prabhakaran S, et al. Singlecell map of diverse immune phenotypes in the breast tumor microenvironment . Cell. 2018;174(5):1293308.Google Scholar
 30.Li WV, Li JJ. An accurate and robust imputation method scImpute for singlecell RNAseq data. Nat Commun. 2018;9(1):997.Google Scholar
 31.Giladi A, Paul F, Herzog Y, Lubling Y, Weiner A, Yofe I, et al. Singlecell characterization of haematopoietic procgenitors and their trajectories in homeostasis and perturbed haematopoiesis. Nat Cell Biol. 2018;20(7):83646.Google Scholar
 32.Li H, van der Leun AM, Yofe I, Lubling Y, GelbardSolodkin D, van Akkooi AC, et al. Dysfunctional CD8 T cells form a proliferative, dynamically regulated compartment within human melanoma. Cell. 2019;176(4):775–89.CrossRefGoogle Scholar
 33.Ledergor G, Weiner A, Zada M, Wang SY, Cohen YC, Gatt ME, et al. Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma. Nat Med. 2018;24(12):1867.CrossRefGoogle Scholar
 34.SebéPedrós A, Saudemont B, Chomsky E, Plessier F, Mailhé MP, Renno J, et al. Cnidarian cell type diversity and regulation revealed by wholeorganism singlecell RNASeq. Cell. 2018;173(6):1520–34.CrossRefGoogle Scholar
 35.SebéPedrós A, Chomsky E, Pang K, LaraAstiaso D, Gaiti F, Mukamel Z, et al. Early metazoan cell type diversity and the evolution of multicellular gene regulation. Nat Ecol Evol. 2018;2(7):1176.CrossRefGoogle Scholar
 36.Laurenti E, Doulatov S, Zandi S, Plumb I, Chen J, April C, et al. The transcriptional architecture of early human hematopoiesis identifies multilevel control of lymphoid commitment. Nat Immunol. 2013;14(7):756.CrossRefGoogle Scholar
 37.Donnadieu E, Jouvin MH, Rana S, Moffatt MF, Mockford EH, Cookson WO, et al. Competing functions encoded in the allergyassociated FcϵRIβ gene. Immunity. 2003;18(5):665–74.CrossRefGoogle Scholar
 38.Haghverdi L, Lun AT, Morgan MD, Marioni JC. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421.CrossRefGoogle Scholar
 39.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411.CrossRefGoogle Scholar
 40.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. rXiv:1802.03426v2. 2018.Google Scholar
 41.Zhang L, Zhang S. Comparison of computational methods for imputing singlecell RNAsequencing data. IEEE/ACM Trans Comput Biol Bioinform. 2018;1. https://doi.org/10.1109/TCBB.2018.2848633.
 42.Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, et al. Comprehensive singlecell transcriptional profiling of a multicellular organism. Science. 2017;357(6352):661–7.CrossRefGoogle Scholar
 43.Fincher CT, Wurtzel O, de Hoog T, Kravarik KM, Reddien PW. Cell type transcriptome atlas for the planarian Schmidtea mediterranea. Science. 2018;360(6391):eaaq1736.CrossRefGoogle Scholar
 44.Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049.CrossRefGoogle Scholar
 45.Voskoboinik I, Whisstock JC, Trapani JA. Perforin and granzymes: function, dysfunction and human pathology. Nat Rev Immunol. 2015;15(6):388.CrossRefGoogle Scholar
 46.Schelker M, Feau S, Du J, Ranu N, Klipp E, MacBeath G, et al. Estimation of immune cell content in tumour tissue using singlecell RNAseq data. Nat Commun. 2017;8(1):2032.CrossRefGoogle Scholar
 47.Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016;13(3):241.CrossRefGoogle Scholar
 48.Buettner F, Pratanwanich N, McCarthy DJ, Marioni JC, Stegle O. fscLVM: scalable and versatile factor analysis for singlecell RNAseq. Genome Biol. 2017;18(1):212.CrossRefGoogle Scholar
 49.Peixoto TP. Hierarchical block structures and highresolution model selection in large networks. Phys Rev X. 2014;4(1):011047.Google Scholar
 50.Wolf FA, Hamey F, Plass M, Solana J, Dahlin JS, Gottgens B, et al. Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. bioRxiv [Internet]. 2017; Available from: https://www.biorxiv.org/content/early/2017/10/25/208819
 51.MetaCell: analysis of single cell RNAseq data using kNN graph partitions [Internet]. [cited 2019 Sep 1]. Available from: https://tanaylab.github.io/metacell/
 52.Yael Baran, Akhiad Bercovich, Arnau SebePedros, Yaniv Lubling, Amir Giladi, Elad Chomsky, et al. MetaCell: analysis of single cell RNAseq data using kNN graph partitions [Internet]. Zenodo; 2019 [cited 2019 Jul 14]. Available from: https://doi.org/10.5281/zenodo.3334525.
 53.Datasets  Single Cell Gene Expression  Official 10x Genomics Support [Internet]. [cited 2019 Jul 13]. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets
 54.Cell Atlas of Worm [Internet]. [cited 2017 Jun 27]. Available from: http://atlas.gs.washington.edu/wormrna/data/
 55.Cell type transcriptome atlas for the planarian Schmidtea mediterranea [Internet]. [cited 2018 Apr 17]. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111764
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.