Background

In an analogous manner to the maturation of RNA-Seq methodologies, single-cell RNA-Seq (scRNA-Seq) is now in its infancy and requires new computational tools to realize its full potential for dissecting and understanding the functional meaning of cell-to-cell heterogeneity [1, 2]. Visualization methods provide an effective strategy for inspecting and characterizing the phenotypic diversity between cells. In a typical scenario, the analysis begins with a matrix of expression levels of thousands of genes in hundreds of cells. An appealing way to make sense out of this immense data is to project it into a two dimensional scatter-plot, where each cell is represented by a single point. While such representations provide an easy way to see obvious stratification of cells into a taxonomy of discrete types, they can also provide more nuanced views of gradual transitions, reflecting for instance, developmental processes [3], physical locations [4], or the cell cycle [5]. Indeed, supplementing these two-dimensional views with additional, phenotypical information (e.g. the expression level of marker genes) can be used to provide the correct context, and make the observed diversity between cells interpretable [6, 7]).

There are three main challenges in making effective use of such visualizations for scRNA-Seq data. The first challenge is selecting an appropriate method for dimensionality reduction (projection) among candidates such as principal component analysis (PCA) [1, 2], independent component analysis (ICA) [3] or various non-linear methods such as t-distributed stochastic neighbor embedding (t-SNE) [8], each of which may highlight different aspects of the data. Once a projection is created, a second challenge is to interpret its biological significance, namely which cellular phenotypes [7] or processes [9] are most responsible for the resulting arrangement of cells. Lastly, scRNA-Seq can be difficult to correctly interpret due to technical confounders such as differences in gene capture rates [10, 11]. Performing functional interpretation on the input data without accounting for this effect may lead to incorrect interpretation of the biological meaning of cell-to-cell heterogeneity.

Introducing FastProject

To address these issues, we have developed FastProject, a software tool for the visualization and interpretation of scRNA-Seq data. FastProject allows the user to explore a gene expression matrix using a plethora of two-dimensional visualization methods. To interpret these two-dimensional plots, we use the concept of biological signatures - sets of genes that represent either a collection of genes with a common associated function (e.g. Glycolysis) or a dichotomy between two cellular states of interest (e.g. naïve vs. memory CD8 T-cells)[7]. We evaluate the extent to which these phenotypic dichotomies are reflected in the projections (i.e. to what extent do neighboring cells in the projection have similar values for the genes included in the signature), and highlight the significant projection-signature pairs. This analysis is made possible in single cells by modeling the probability that a missed transcript was actually expressed in the cell, and using these probabilities when evaluating signature scores on samples to minimize the effect of variable capture rates between cells. As a source for biological signatures, we use publicly available datasets that consist of comparative information from hundreds of studies (e.g. MSigDB [12]), which can be supplemented by the user to include any other gene signatures of interest. Through this automated analysis, FastProject therefore provides a systematic view of the main axes of variation in the data, along with their possible biological meaning.

Compared with currently-available software, FastProject is unique in the combination of methods it employs. Visualization software such as viSNE [6], although designed for cytometry data, allows for the integrated visualization of transcript levels overlaid on a two-dimensional tSNE projection. However, viSNE does not incorporate the use of gene sets or provide a method to systematically search for biological variation within a two-dimensional projection. Another method, PAGODA [13], makes use of gene sets, but does so in the context of a clustered heatmap of expression data, not a 2-dimensional embedding.

It is important to clarify up-front that FastProject is not a normalization tool. Indeed, it has been observed by us and others that without proper normalization scRNA-Seq data can be heavily confounded by technical factors such as library depth and complexity [7]. We strongly advocate the use of scRNA-Seq normalization tools (e.g. based on [5] or [7]) prior to any downstream analysis, and we assume that the data has been normalized prior to input to FastProject. Nevertheless, since scRNA-Seq measurements tend to be characterized by strong zero-inflation, we conduct our biological signature analysis while aiming to minimize the effect of gene dropout events (i.e. false negatives, described in [10, 14]).

When running FastProject, processing is done upfront, producing dozens of projection maps (using different algorithms and parameterizations) and their functional annotation, which can be inspected through an interactive HTML report. On a typical data set, consisting of around 1000 cells, processing time is typically between 10 and 30 min (see Additional file 1 for benchmarking results). Because processing is performed upfront, the user can quickly switch between different projection maps in the output report as well as share the results with colleagues who would not themselves need to install FastProject. Importantly, FastProject has been written to be easily extensible so that it may serve as a general platform for deploying and evaluating new gene filtering schemes, false-negative estimation methods, or projection techniques. Instructions for developers on how to augment FastProject are detailed in the FastProject wiki hosted at https://github.com/YosefLab/FastProject/wiki.

Implementation

Overview

The steps involved in the FastProject processing pipeline are depicted in Fig. 1.

Fig. 1
figure 1

The FastProject pipeline. a Diagram describing the FastProject pipeline. A gene expression matrix is taken as input (left), and the resulting output report (right) combines low dimensional-representations of the input with gene signatures to highlight signatures which best explain features in the data. b Configurations for the projection that can be selected among in the output report

The software starts with an evaluation of false negative rates, later used to down-weight the effect of drop-outs on the biological signature analysis. It then selects sufficiently variable genes for further analysis using increasingly stringent criteria. With these genes in hand, FastProject uses 11 different projection methods (summarized below) to calculate two dimensional coordinate for each cell. Based on a user-provided database of gene signatures it then computes a score for every cell/signature pair and uses a randomization test to identify statistically significant projection-signature associations. Importantly, the confounding effect of missed transcripts is mitigated by estimating the probability that each undetected gene was actually expressed in the cell, and down-weighting the contributions of these measurements in downstream analysis (similar to [7]). Altogether, FastProject outputs 76 possible projections (a combination of choosing different gene filtering criteria, whether or not the data was reduced to significant PCs prior to projection, and the final projection method) along with their functional annotations, which can be interactively inspected through a user-friendly HTML report (Figs. 1 b and 3). The results are also provided in the form of text files (including signature scores, projection coordinates etc.), which can be used for downstream analysis.

False-negative estimates

To account for expressed transcripts that are not detected (false negatives) due to the limitations in sensitivity [1, 10, 11, 15], an initial step in the processing pipeline is to evaluate these detection rates so that the subsequent analysis can down-weight the contribution of less reliably measured transcripts. More specifically, it has been observed that the detection probability for a transcript varies with the transcripts relative abundance measurement as well as the total amount of RNA in the sample [10, 14]. To characterize this, we calculate a false negative rate curve which estimates the probability of detection based on a gene’s abundance. A separate curve is estimated for each cell individually so that varying levels of library quality and cell integrity can be accounted for.

To perform this estimate, our procedure utilizes a set of human housekeeping genes [16] under the assumption that these housekeeping genes are constitutively expressed in all cells and only missed due to technical errors. Importantly, as the appropriate set of constitutively expressed genes may differ from study to study and between organisms, FastProject can accept a user-defined housekeeping list.

Our estimation of false negative rates is built on and extends upon our previous work [7]. For each housekeeping gene, we estimate its mean expression level by taking the average of non-zero measurements for the gene. We then use the estimated means to group the genes into 30 quantiles, and denote the mean of genes in quantile 1≤q≤30 as μ q . For each cell j and quantile q, we then compute F jq as the proportion of genes from q that are not detected by j. Based on our assumption of constitutive expression, we treat F jq as an empirical estimate to the dropout rates (i.e. probability that a gene is not detected, given that it is expressed). We use the F jq values to fit a sigmoid function \(\hat {F}_{j}(\cdot)\) that describes the observed dropout rates as a function the genes’ average expression when detected (Fig. 1 a):

$$ \hat{F}_{j}(x)=1/(1+\exp(\alpha_{j} \cdot (x-\beta_{j}))) $$
(1)

where the estimated parameters α j and β j minimize the residual sum of squares (RSS) term:

$$ \sum\limits_{q=1}^{30}(\hat{F}_{j} (\mu_{q}) - F_{jq})^{2} $$
(2)

Applying the fitted function globally for all genes, we estimate the conditional probability for a dropout event for gene i in cell j as: \(P(\text {Gene i is not detected in cell j} \mid \text {Gene i is expressed by cell j})=\hat {F}_{j}(\mu _{i})\) where μ i is the average of gene i’s expression level when detected. Finally, we estimate the prior probabilities for detection and expression of each gene in order to evaluate weights for the missed transcripts, P(Gene i is not expressed by cell j∣Gene i isnot detected in cell j), as described in the Methods section. These probability estimates are used further in the pipeline to reduce the effect of missed transcripts when evaluating signature scores and generating projections. As a first step in validating the efficacy of this approach, a set of simulations were performed (Additional file 2, procedure detailed in the “Methods” section) in which FastProject’s weighting procedure was able to distinguish between true negatives (representing unexpressed genes) and false negatives (representing technical drop-outs).

The fitted sigmoids \(\hat {F}_{j}(\cdot)\) can also be used to provide an overall evaluation of the abundance of false negatives in each cell j by taking the area under the curve. This in turn provides a way to identify and exclude poor quality samples, assuming that such samples have higher dropout rates. Such quality control filter (on cells) is available in FastProject and can be enabled when running the pipeline (it is turned off by default). With this option enabled, samples that score 1.6 median absolute deviation (MAD) units worse than the median quality score are removed prior to calculating signatures and low-dimensional projections.

Generating 2-dimensional projections

Gene filtering

Selecting an informative set of genes is necessary for obtaining biologically meaningful patterns of variability between cells. To this end, FastProject applies a strict pre-filtering step that discards genes detected in less than a threshold number of cells. The default threshold is 20 % of the input cells, however this is configurable as an input option. Subsequently, FastProject produces projections that derive from all pre-filtered genes as well as subsets thereof, calculated using either of two filtering schemes. The first filter option selects bimodal genes, a characteristic which in the past has been useful in isolating biological variation (e.g. cytokines related to Th17 differentiation [7] or antiviral and inflammatory response genes in stimulated dendritic cells [17]). In FastProject bimodal genes are selected using the Hartigan’s Dip Test (p<0.05). The second filter option selects highly-variable genes, based on their Fano-factor (σ 2/μ where μ is the average expression and σ the standard deviation across all cells). An increase in the Fano factor past a value of 1 indicates a departure from a Poisson statistic, which is the steady state distribution expected of constitutively expressed genes [18]. To select candidates with high Fano-factor, similar to the procedure in [19], genes are first stratified into 30 quantiles according to μ, and within each quantile genes are retained if their Fano-factor is more than 2 MAD above the quantile’s median Fano-factor.

Projection methods

The variety of methods available for the task of dimensionality reduction each come with strengths and weaknesses. For instance, projections of scRNA-Seq data based on PCA [7], provide an appealing guarantee about the preservation of variation, and makes the contribution of individual genes readily interpretable [6]. However, the underlying assumptions of PCA may not necessarily be supported by single cell data. In particular, PCA is a linear transformation, which may not be able to accurately capture non-linear relationships in the data (i.e. if the data is embedded within a non-linear low-dimensional manifold). Additionally, PCA posits that the projection axes should be uncorrelated, which again may not necessarily result in the most informative representation. The same criticisms apply for other linear methods such as ICA [3]. A complementary, and commonly used approach (t-SNE [8]) uses a non-linear projection that aims to preserve the structure in the data locally. However, while this method aims to ensure that points that are close in the high dimensional space will be close (with high probability) in the low dimensionality embedding, more global relations are not directly interpretable from the results. To avoid the issue of choosing up-front between these different options, FastProject uses these methods plus additional non-linear projection methods, including: ISOMAP [20] (using four nearest neighbors when defining the adjacency graph), PCA with a radial basis function (RBF) kernel (with kernel coefficient of \(\frac {1}{\text {Number of genes}}\)), Multi-Dimensional Scaling (MDS), and Spectral Embedding [21] (with Graph laplacian formed using k-nearest neighbors with \(k=\frac {\text {Number of cells}}{10}\)). For the linear PCA we consider three combinations of principal components (1st and 2nd, 1st and 3rd, 2nd and 3rd); and for tSNE we use two perplexity values (10 and 30), totaling in 11 projection methods overall. All methods are run as implemented in the Scikit-Learn package for Python [22]. After each projection, the resulting sets of 2-dimensional coordinates are mean-centered and scaled such that the average r 2=[x c o o r d i n a t e]2+[y c o o r d i n a t e]2 equals 1. This standardization is performed so that signature-projection scores (defined below) are more comparable between projections.

Incorporating false-negative estimates into projections

Non-linear methods such as t-SNE have been shown to effectively combine with PCA. In this approach, PCA is performed first to reduce high-dimensional data to an intermediate number of dimensions, and this intermediate representation is further reduced to 2 or 3 dimensions by the non-linear method. The procedure is effective, assuming that the initial PCA operation preserves the main structure of the data while discarding unnecessary components.

FastProject makes use of this procedure to allow for a general method of incorporating weights (derived from false-negative estimates) into nonlinear projections. Prior to a non-linear projection, a weighted-PCA is performed by using a weighted covariance matrix (similar to [7]). Entries in the weighted covariance matrix are calculated as:

$$ cov(X_{a}, X_{b}) = \frac{ {\sum\nolimits}_{j}(X_{aj} - \overline{X}_{a})(X_{bj} - \overline{X}_{b})w_{aj}w_{bj}}{ {\sum\nolimits}_{j}w_{aj}w_{bj}} $$
(3)

where X ij is the log-transformed expression of gene i in cell j and w represents the matrix of weights of equivalent size, defined as:

$$ {}\begin{aligned} w_{ij} = \left\{ \begin{array}{cc} P(\text{i not expressed in j} \mid \text{i not detected in j}) & X_{ij}=0 \\ 1 & X_{ij} \ne 0 \end{array}\right. \end{aligned} $$
(4)

To select the number of components to retain, following weighted-PCA, the software employs a randomization scheme, as described in [23], to select the top principal components with statistically significant (p<0.05) contribution to the overall variance (retaining at least 5 components when less than 5 are significant). All the subsequent non-linear projections are evaluated on this reduced-dimension matrix. To provide a way for evaluating the effects of this reduction, FastProject also runs each non-linear projection method on the original, non-reduced matrix, allowing the user to switch between these representations in the output report.

Signature-based analysis

To interpret the biological meaning of the organization of cells in the resulting two dimensional maps, FastProject incorporates domain knowledge through the input of gene “signatures” [7]. The signatures can reflect a comparative analysis between two conditions of interest and consist of a set of genes, each of which is labeled as either “up-regulated” or “down-regulated” in that comparison. Signatures, such as these, are based off of annotations of cell states obtained from public databases (e.g. the Immsig collection [12]), or provided by the user. For each signature, a score is computed against each cell by aggregating over the weighted expression level of its genes. Specifically, for signature S and cell j, the respective signature score R s (j) is calculated as:

$$ R_{s}(j) = \sum\limits_{i \in S} {sign}_{S}(i) \cdot X'_{ij} \cdot w_{ij} / \sum\limits_{i \in S} w_{ij} $$
(5)

Where \(X^{\prime }_{ij}\) is the standardized (Z-normalized across all cells) log expression level of gene i in cell j,w ij is the estimated False-Negative weight (defined above), and s i g n S (i)=−1 for genes in the "down-regulated" set and +1 otherwise. Notably, signatures can also be undirected, in which case the sign value is set to +1 for all member genes.

Projections vs. signatures

A signature-projection consistency score is calculated to evaluate how well each projection reflects the phenotypic variation that is captured by each signature. To this end, for each pair of signature, s, and projection, p, we compute a signature consistency score representing the extent to which neighboring points in the projection have similar signature scores. This is done by calculating for each cell j, an estimated signature rank \(\hat {r}_{sp}(j)\) based on its location in the two dimensional plot:

$$ \hat{r}_{sp}(j) = \frac{ {\sum\nolimits}_{k \ne j} r_{s}(k) e^{\Delta^{2}_{jk}/\alpha^{2}}}{ {\sum\nolimits}_{k \ne j} e^{\Delta^{2}_{jk}/\alpha^{2}}} $$
(6)

where Δ jk is the Euclidean distance between cells j and k in the projection, α defines an effective neighborhood size (set to 0.33 by default), and r s (j) is the rank of the signature score for cell j (i.e. a rank transformation of the quantitative signature scores R S ). The signature-projection consistency is then determined by the respective goodness of fit:

$$ Consistency(s,p)=1-\frac{median(| \hat{r}_{sp} - r_{s} |)}{Total\ Cell\ Count} $$
(7)

In this way, each signature/projection pairing is scored based on how similar signature scores are for samples located nearby in the projection space. To identify significantly consistent signature/projection pairs, we use signatures of randomly selected genes to create empirical background distributions of signature scores. We compare the consistency scores computed for the original signatures with those of the random ones (for example, see Fig. 2 a), obtaining P-values using a Z-test and correcting for multiple hypothesis testing using the Benjamini Hochberg procedure. We observed that the distribution parameters of scores generated by random signatures varied with the number of genes in the signature. To account for this, separate distributions are generated for different signature sizes (10, 20, 50, 100, and 200 genes) and when assessing the significance of a signature score, the score is compared against the background distribution with the most similar number of genes. In the output, we report all the signatures that had a significant match (F D R<0.05) with at least one projection.

Fig. 2
figure 2

Behavior of Signature Scores. Behavior of signature scores calculated on the human glioblastoma scRNA-seq data from Patel et al., 2014 [2]. a Distribution of Signature/Projection consistency scores across four different types of signatures, Signed (signed immunological signatures from MSigDB), Unsigned (various unsigned hallmark and pathway signatures from MSigDB), Random Signed (signed signatures with randomly selected genes), and Random Unsigned (unsigned signatures with randomly selected genes). Lower panel shows distributions from the same signatures, run on data in which gene expression levels have been shuffled within each cell. Comparing these, it can be seen that biological signatures tend to have higher consistency scores than random signatures and this distinction disappears using shuffled data. b Distribution of the Pearson’s correlation coefficient between signature scores and a confounding variable - the proportion of undetected genes in a sample. Upper plot shows correlations when signature are calculated by simply taking the unweighted average of log expression level for genes in the signature. Lower panel shows the effect of using the weighted method presented here

Results

Software

FastProject has been implemented as a command line Python package. As an input the software receives: (1) an expression matrix in a tab-delimited format (genes in rows, cells in columns). (2) a set of gene signatures, using the standard GMT format. Such sets of directed signatures are publicly available from various databases, such as MsigDB [12] (e.g. including signaling effects of genetic and chemical perturbations, cell cycle signatures, and comparison of cell types) and NetPath [24] (transcriptional effects of signaling cascades). Un-directed gene sets are naturally more abundant, and can be drawn from resources such as Gene Ontology [25], KEGG [26, 27], and MSigDB [12] (note that the latter includes much of the information in the former two). Importantly, the user can also upload his/her own signatures that reflect a phenotype of interest. For instance, in the example below, we study glioblastoma cells, and use signatures derived from microarray experiments, which define different tumor sub-types. FastProject also accepts pre-computed signatures, namely, an association of cells with pre-computed values. These can be categorical (e.g. annotations of clusters, computed by a different tool); or numerical (e.g. additional meta data associated with cells; e.g. levels of a handful of proteins, or technical factors such as library complexity; see Methods section for description of how pre-computed signatures are analyzed).

FastProject will calculate projections, signature scores, and their associations, covering all the options above (totaling in 76 different projections; Fig. 1 b). To examine the extent to which the projections are affected by zero values, FastProject treats the sample quality score (defined above), and the percentage of genes with zero expression as pre-computed signatures and evaluates their association with each projection method. The output is provided as an HTML file (Fig. 3), where projections, signatures and their associations can be inspected interactively. Additionally, a data export feature embedded in the HTML report allows the generation of tab-delimited text files that depict the output projection coordinates, signature scores, and other relevant information. The source code, running, examples and user manual are bundled with this manuscript submission and also hosted at https://github.com/YosefLab/FastProject.

Fig. 3
figure 3

FastProject Output Report. a Screenshot of FastProject interactive output report. 1) Controls for changing which genes were used when generating the projection and whether or not PCA was applied first. 2) Table displaying significance of the consistency score for each signature/projection pairing. Each row represents a signature and each column, a projection method. Clicking a cell in the table selects a signature and projection. 3) Scatter plot showing the selected projection annotated (color) with signature scores from the selected signature. 4) Heatmap showing average expression level of genes within each cluster. The clustering method can be changed through the dropdown menu in the same panel. b Corresponding scatterplot when selecting projection tSNE30 and the Patient signature. c Scatterplot for a signature representing response to the PPAR γ agonist rosiglitazone

Extending FastProject

FastProject has been designed using a modular architecture so that new projection methods or criteria to filter genes can easily be added to the pipeline. Since dimensionality reduction is still an active research area, this allows new methods to easily be compared against more traditional approaches. For example, the recently proposed ZIFA algorithm [28] can be added by appending the following lines to Projections.py:

This is documented in the software wiki, hosted with the project repository at https://github.com/YosefLab/FastProject/wiki.

Proof of concept

We applied FastProject to a recently published data sets of tumor cells from five glioblastoma patients [2]. The analyzed data, consisting of 430 single cells with mRNA abundance in units of transcripts per million (TPM) and normalized as described in [2], was downloaded from the Gene Expression Omnibus, accession number GSE57872. We applied FastProject on this data, using a large collection of both “signed” (i.e. up- and down- regulated genes) and “unsigned” signatures from MSig DB (the C2 (curated), H (Hallmark), and C7 (Immune signatures) collections) and NetPath [24].

As a first check, we explored the distributions of signature consistency scores obtained for the original vs. random signatures, and compared the results to an application of FastProject on randomized datasets, where the entries in each column (Cell) were shuffled (i.e. maintaining the percentage of zeroes in every cell; Fig. 2 a). As expected, we see pronounced differences between the original input signatures and the randomized ones when FastProject is applied on the non-shuffled data, and these differences disappear when we apply FastProject on the randomized data. As a second test, we evaluated the extent by which the signature consistency scores are biased by the abundance of zero-values in the analyzed cells. As expected, when we do not down-weigh the potential false negative entries, the signature consistency scores highly correlate with the amount of zeroes in each cell; namely the analysis primarily reflect what might be a result of technical dropouts. Conversely, down-weighing the false negatives reduces this bias (Fig. 2 b). We repeated this procedure using a second dataset of scRNA-Seq data from mouse dendritic cells responding to antigen stimulation [1], obtaining similar results (Additional files 3 and 4).

Examining the output report of FastProject (Fig. 3 a), we first observe that the various projection methods correctly stratify the cells according to their respective donors (Fig. 3 b). More importantly, FastProject automatically picks up on several of the most important features in this data, providing a proof of concept for its utility as an unbiased analysis tool. Specifically, the two dimensional position of the cells is highly consistent with their scoring according to an epithelial to mesenchymal transition signature, which is a strong marker of poor prognosis [29, 30]. The two dimensional positions are also associated with signatures of other key pathways altered in cancer, including immune and hypoxic responses as detailed below. While these observations were made using a general database of signatures (MSigDB), we supplemented our analysis with case-specific signatures - in this case gene signatures from TCGA that are predictive of Glioblastoma subtypes [30] to provide further support. As expected we see high level of concordance between the TCGA-derived scores of the Mesenchymal tumor sub-type and the epithelial to mesenchymal transition signatures from MSigDB. We also see the mirror image of the cell ranking when we consider the TCGA-derived signature of the Proneural tumor subtype of glioblastoma.

In addition to the HTML report, FastProject outputs all the cell-signature scores in textual format. Taking advantage of this feature, we were able to more closely inspect the relationship between the different pathways that were highly correlated with the two dimensional positions and reveal new associations in the data. Considering the top ranked signatures (p<10−15 for at least one projection method), and filtering overlapping signatures (Jaccard coefficient >30 %; 63 signatures remaining), we observe a clear pattern of signature clusters (Fig. 4). Interestingly, the mesenchymal signature is positively correlated with the expression of coagulation/complement genes (whose expression in the glioblastoma cells studied here was already verified by [2]). Both signatures are also correlated with the TNF α signaling, which supports previous findings concerning the role of this pathway in mesenchymal emergence [31]. On the other hand, the mesenchymal signature is negatively correlated with a hypoxia signature. While hypoxic regions are characteristic in solid tumors [32], this inverse correlation is surprising and possibly aligned with the up-regulation of angiogenetic markers in mesenchymal glioblastoma tumors [29]. Finally, we see a strong negative correlation with a signature of response to the PPAR γ agonist rosiglitazone, which aligns with previous observations of beneficial effects PPAR γ agonists have in glioblastoma [3335]. In addition to the inter-donor variation, FastProject’s visualization also highlights potentially important variation within a tumor. Indeed, the cells from patient MGH31 (Fig. 3 b, purple) are clearly divided in accordance to the two programs mentioned above - with cells with low mesenchymal and high hypoxic score on one side and the mirror image on the other.

Fig. 4
figure 4

Discovering Correlations between Signatures. FastProject makes its data amenable to further analysis by outputting signature scores and projection coordinates in text format. Shown here is a covariance matrix between top-ranked signatures (p<10−15 for at least one projection method) after removing overlapping signatures (J a c c a r d c o e f f i c i e n t>30 %) revealing signatures with similar patterns of expression

The glioblastoma case study underscores the utility of FastPtroject as a tool for scRNA-Seq data exploration. Starting from a normalized input matrix of gene expression in single cells, and a generic set of signatures, it clearly highlighted some of the main sources for phenotypic variation between cells, and the relation between these sources. Such insights provide an important first step in working with data sets an immense and as complex as the one presented here.

Conclusions

FastProject is a flexible, comprehensive, and automated pipeline that combines multiple techniques for the analysis of scRNA-Seq. It provides a first glance on the main axes of variation in the data (captured by projections of interest) and their biological meaning (the biological signatures that may explain the organization of cells within the projections). The tool was designed with a flexible API in mind, with the aim of establishing a general platform that will be used by the scRNA-Seq research community for deployment and evaluation of future methods, such as normalization, batch correction, removal of undesired effects (cell cycle, drop-outs), gene/cell filtering, and dimensionality reduction.

Methods

False-negative estimates

Let P(E ij N o t D ij ) denote the probability that a gene i is expressed in cell j conditioned that it is not detected (i.e. probability for a false negative). To estimate this probability we first estimate the priors for the detection, P(D ij ), and expression, P(E ij ), events:

For P(D ij ) we use the percentage of cells that detect gene i (e x p r e s s i o n>0), which we denote as \(\bar {D_{i}}\). For P(E ij ), one approach would be to use:

$$ {\small{\begin{aligned} P(E_{ij}) & = \frac{P(E_{ij}\ and\ D_{ij})}{P(D_{ij} \mid E_{ij})} \\ & = \frac{P(D_{ij})}{P(D_{ij} \mid E_{ij})} \\ & \approx \frac{P(D_{ij})}{1-\hat{F}_{j}(\mu_{i})} \\ & \approx \frac{\bar{D_{i}}}{1 - 1/(1+\exp(\alpha_{j} \cdot (\mu_{i} - \beta_{j})))} \end{aligned}}} $$
(8)

where μ i is the average of gene i’s log expression when detected and \(\hat {F}_{j}(\mu _{i})\) (defined in Eq. 1) represents P(N o t D ij E ij ) for each cell. However, in order to get a more robust estimation, we use the population mean of the conditional probability, taking: \(P(E_{ij}) \approx \frac {\bar {D_{i}}}{\frac {1}{N} {\sum \nolimits }_{k=1}^{N} P(D_{ik} \mid E_{ik})}\) where N is the number of cells in the dataset. Combining these terms, the full expression is:

$$ \begin{aligned} P(E_{ij} \mid Not\ D_{ij}) & = \frac{P(Not\ D_{ij} \mid E_{ij})P(E_{ij})}{P(Not\ D_{ij})} \\ & = \frac{P(Not\ D_{ij} \mid E_{ij})\frac{\bar{D_{i}}}{\frac{1}{N} {\sum\nolimits}_{k=1}^{N} P(D_{ik} \mid E_{ik})}}{1 - \bar{D_{i}}} \\ & = \frac{\hat{F}_{j}(\mu_{i})\frac{\bar{D_{i}}}{\frac{1}{N} {\sum\nolimits}_{k=1}^{N}1-\hat{F}_{k}(\mu_{i})}}{1 - \bar{D_{i}}} \\ \end{aligned} $$
(9)

This value is then complemented to obtain P(N o t E ij |N o t D ij ) (the true negative probability), which is used to weight missed transcripts. Notably, on occasions where the detection rate in a cell is higher than the prior estimate, this formulation results in a negative value. We therefore restrict the estimate to the range [ 0,1] by applying a clipping operation.

Simulating false-negatives

To evaluate our approach, we first applied it on simulated data. Count values for each gene were either set to zero (with probability P i ) or drawn from a negative binomial distribution parameterized by the mean, μ i , and the dispersion, \(\gamma _{i} = ({\sigma ^{2}_{i}} - \mu _{i})/{\mu _{i}^{2}}\). P i was set to 1.0 for the first 500 genes (considered to be housekeeping genes, and therefore constitutively active) and varied uniformly from 0 to 1 for the remaining genes. μ i was generated from an exponential distribution with m e a n=75 and γ was generated using a uniform distribution from 0 to 4 (reasonable distributions and ranges for μ and γ found by measuring these parameters using non-zero values from the Shalek et al. [1] single-cell data). The generated counts were transformed using l o g(x+1). In this way, an expression matrix of 3000 genes across 2000 cells was simulated.

After generating ‘true’ expression data, sigmoidal drop-out curves (as defined in Eq. 1) were generated for each cell by drawing α and β from uniform distributions (α∈[ 2,6];β∈[ 1,3]). Using these curves, ‘measured’ expression data was generated by dropping-out non-zero measurements using each cell’s drop-out characteristic and each gene’s non-zero mean. Using FastProject’s weight-estimation procedure, weights were estimated using the ‘measured’ data matrix. For comparison purposes, weights were also generated using the likelihood, P(D e t e c t i o nE x p r e s s i o n), as described in Gaublomme et al. [7]. The results, detailed in Additional file 2, demonstrate that FastProject’s weighting system produces different weight distributions for true negatives (un-expressed genes) as opposed to false negatives (dropped-out genes), assigning higher weights to the true negatives, as would be expected in an effective weighting procedure.

Clustering

For each of the projections, the FastProject HTML report includes a simple clustering analysis using the cells’ positioning in the respective two dimensional map. Specifically, samples are clustered based on Euclidean distance in the two-dimensional space using k-means with different k values. These clusters are used when rendering a heat-map below the projection showing the (per cluster) average z-score of expression for each gene in the signature.

Evaluating the consistency of projections and categorical pre-computed signatures

A different method is used to evaluate the significance of signature/projection pairings when operating on pre-computed signatures. For numerical pre-computed signatures, the assigned values are shuffled among cells and for each iteration of this procedure, the signature/projection consistency score is evaluated. P-values are then assessed by comparing the unshuffled score against this distribution using a Z-test as above. For factor signatures, it is necessary to calculated the consistency score in a different manner. To this end, for each pair of signature s and projection p we compute a signature consistency score representing the extent to which neighboring points in the projection are assigned to the same category. First, we evaluate a neighborhood-consistency score for every cell:

$$ {\small{\begin{aligned} \hat{L}_{sp}(j) = \frac{ {\sum\nolimits}_{k \ne j} \delta_{s}(j,k) e^{- \frac{\Delta^{2}_{jk} }{\alpha^{2}}}}{ {\sum\nolimits}_{k \ne j} e^{- \frac{\Delta^{2}_{jk} }{\alpha^{2}}}} \end{aligned}}} $$
(10)

where

$$ {\small{\begin{aligned} \delta_{s}(j,k) = \left\{ \begin{array}{cc} 1 & \text{samples k and j have the same label} \\ 0 & otherwise \end{array}\right. \end{aligned}}} $$
(11)

In this way, \(\hat {L}_{sp}(i)\) is closer to 1 if most of sample i’s neighbors have the same label. The signature-projection consistency is then determined by a measure of the overall consistency:

$$ Consistency(s,p)=median(\hat{L}_{sp}(j)) $$
(12)

These consistency scores are compared against a distribution of scores calculated from shuffled label assignments to assess significance using a Z-test as above.

Availability and requirements

Project Name: FastProject.Software Repository: https://github.com/yoseflab/fastproject.Software Manual: https://github.com/YosefLab/FastProject/wiki.Archived Versions: https://pypi.python.org/pypi/FastProject.Operating Systems: Platform-Independent.Programming language: Python.License: Free for Educational, Research, and Not-For-Profit purposes. Results from the analysis in this article can by found at https://github.com/YosefLab/FastProject/wiki/Example-Output-Reports.