Background

The greater precision afforded by single cell sequencing has increased the scope of the average sequencing study. Unlike conventional bulk sequencing methods that profile aggregate of hundreds of thousands of cells, the single cell sequencing methods made it possible to isolate a single cell and perform different types of omics profiling including genomics, transcriptomics, epigenomics and proteomics [1]. One prominent technique that measures gene expression at single-cell level is single cell mRNA sequencing (scRNA-seq) [1, 2]. It, unlike bulk sequencing, unmasks the fundamental, widespread heterogeneity in gene expression among cells in a tissue or cells considered to be of same type based on canonical markers [3, 4]. Hence, rather than simply examining differential expression between two samples, we can identify the cell types and expressed genes within each cell type as a first step before differential expression analysis [4, 5]. Not only does this first step provide valuable insights into the transcriptomic profiles of individual cell types and states, but it also provides a deeper context for the subsequent differential expression analysis.

However, the effectiveness of cell type identification is a multi-step process which led to the explosion of new single cell software applications, referred to as a “cottage industry” [6]. According to Awesome Single Cell (https://github.com/seandavi/awesome-single-cell), a site that compiles a list of new single cell analysis methods, 118 methods have recently been created for analyzing single cell sequencing data (normalization, dimensionality reduction, clustering and differential expression), including plethora of methods required for cell type identification.

Hence, it is necessary to comparatively assess the different tool combinations (aka pipelines) to determine which is the best at cell type identification. Comparative analyses have been published on sequencing [7, 8], normalization [9] and clustering [10, 11]. Yet, there has not been a comprehensive study, assessing whole pipelines and addressing broader issues of experimental design in cell type identification.

We developed a computational method to address this important gap. We developed an easy to use tool as an R-package SCEED (Single Cell Experimental Design and Analysis). The package has functionality to simulate scRNA-seq data with user provided statistical characteristics: total number of cells, genes, cell type proportions, marker genes and fold change (fC) of marker genes. The simulated dataset with known cell types can be analyzed using published cell-type identification algorithms by incorporating them in SCEED. Systematic comparison of the results of the analysis pipeline to the known true labels using F1score (for details see methods) that provide the ability to identify the optimal single cell analysis procedures for the dataset and will also help to identify the number of cells required for adequate power for the detection of the cell-types.

Methods

The schematic of SCEED is shown in Fig. 1. Each step in SCEED is described below.

Fig. 1
figure 1

Schematic representation of SCEED pipeline. (Left to right) First a simulated dataset is generated using SCEED “generateDataset” function with input parameters mentioned under “Data simulation”. Next, the simulated dataset is analyzed using different single cell analysis procedures. To test the performance of each single-cell algorithm, F1score which is a measure of test’s accuracy is computed. Finally, based on the F1score cutoff chosen by user, the best analysis procedure and the number of cells required to perform the single cell experiment are selected

Data simulation

Our procedure to simulate a single cell dataset is shown in Fig. 2. In step 1, gene by cell expression matrix is simulated using Splatter package [12], which simulates m cell types of given rarity/prevalence with n cells. In step 2, each cell type will express specific number of marker genes g with specific fold change levels fC. The mean expression level of each marker gene gi in group k was simulated by taking the product of a group-specific fold change level (sampled from a negative binomial distribution with shape = fCi and rate = 1) and the mean expression level of gi in all cells that are not part of k. For each cell in k, the final expression level of marker gene gi was the product of the simulated mean of gi and a library size that was simulated using Splatter [12]. The remaining steps are stated in Fig. 2.

Fig. 2
figure 2

Schematic representation showing generation of simulated dataset using SCEED. (Left to right) A blank matrix is provided as an input where initially (1) mean expression of all the genes and (2) number of marker genes at a desired foldchange cutoff are simulated, followed by adjustment of (3) biological and (4) technical noises. Finally, (5) single cell count is simulated and provided as an output matrix

Analyses

Single cell analysis steps

A standard single cell analysis procedure includes data normalization, dimensionality reduction and clustering [13]. Normalization is a crucial step for any single cell analysis that adjusts for unwanted technical or biological variations that may otherwise affect the gene expression analysis. With larger datasets like single cell, dimensionality reduction is also an important step that transforms data into lower dimensional space, allowing significant reduction in data complexity and also makes data visualization easier. Finally, single cells with similar transcriptome profiles are clustered together to deduce putative (sub)populations, aka cell types.

Incorporation of single cell methods into SCEED package

SCEED package allow users to add any single cell analysis package of interest into its pipeline using function “sceed_AlgorithmName” for example sceed_seurat. In the current implementation of SCEED, Kmeans, SIMLR and Seurat (details in results section) are available. Although we have added only three single cell algorithms, SCEED package is completely flexible and any number of single-cell algorithms can be added for testing as per user’s requirements.

Performance assessment

The performance of an analysis procedure is assessed by computing F1score of a cluster. F1score is a balancing measures of recall (sensitivity) and precision of cell classification. Higher F1score shows better performance of the algorithm tested. User can choose F1score threshold suitable to annotate the clusters for cell types and hence best single cell analysis algorithm as well as sample size.

$$ \mathsf{F}\mathsf{1}\mathsf{score}=\mathsf{2}\ \left(\left(\mathsf{Precision}\ \mathsf{x}\ \mathsf{Recall}\right)/\left(\mathsf{Precision}+\mathsf{Recall}\right)\right) $$

Results

We used SCEED to test 3 popularly known single cell algorithms for cell type identification: Kmeans, SEURAT and SIMLR. For Kmeans clustering approach, k was set equal to the number of cell types simulated. For Seurat and SIMLR algorithms, default parameters mentioned by the authors were used. In Seurat, while using “FindClusters” function, k.param was set to the number of cell types simulated. We generated 27 datasets of varying choices of parameters.

Generating simulated single-cell datasets

In a single cell experiment, discovering rare cell populations is of utmost importance. Stressing on the rarity of cell populations, we simulated single cell datasets where five cell types were partitioned into unequal proportions such that one of them has low proportion or representing rare population, ranging from 2 to 10%. For instance, we defined a single-cell category having 5 cell types in proportions of 0.1, 0.2, 0.2, 0.2 and 0.3. In each cell type, 50 genes were simulated as marker genes that were either 2-, 4- or 8-fold upregulated when compared to rest of the cell types. For the same proportions of cell types while keeping the other parameters same, we simulated single cell data sets of 2000 or 3000 cells. More details of each dataset are shown in Table 1. In summary, we created 27 simulated single-cell datasets. Notably, in SCEED package, all these parameters (such as number of cell types, single cells per cell type, genes per cell, marker genes per cell type and fold change cutoffs) can be adjusted as per user’s requirements.

Table 1 Properties of different of simulated single cell datasets generated

Testing the performance of single-cell algorithms and estimation of sample size required

All these datasets were analyzed using three single cell algorithms, Kmeans, Seurat and SIMLR and tested for their performance using F1score. At lowest fold change (fC) of 2 of marker genes, irrespective of number of single cells collected, Seurat provided the best performance in F1score for rarity of 0.1. However, for fC of 2, we may need at least 1000 cells to achieve F1score of > 0.9. As fC increases, the other algorithms also offered increased performance, Additional files 1 and 2. Next, we compared these algorithms to detect even rarer cell type, with a proportion of 0.05 (the cell type proportions are 0.05, 0.35, 0.2, 0.2 and 0.2), Fig. 3. At fC = 2, Seurat reached the F1 score of 0.93 but only when number of single cells > = 2000. In line with previous observation, the other algorithms also showed increased performance with increase in fC at 0.05 proportion. However, when we reduced the rarer cell type proportion further down to 0.02, SIMLR outperformed the remaining two algorithms with F1 score of 0.69 for number of single cells > = 1000. Separately, we also estimated the minimum sample size required at a given F1 score. For instance, SIMLR could attain F1score > 0.7 for proportions of 0.1 and 0.05 for sample size (number of single cells) of 1000 while Seurat required sample sizes of 1000 and 2000 for cell proportions of 0.1 and 0.05 respectively. We see similar results for fC of 4 and 8, see Additional files 1 and 2.

Fig. 3
figure 3

Performance of different single cell algorithms at different cell proportions. F1score was calculated at cell rarity proportions of 0.02, 0.05 and 0.1 containing 2 foldchange upregulated marker genes for 1000, 2000 and 3000 single cells datasets. X-axis represents the cell rarity proportions while y-axis represents F1score

Discussion and conclusion

We proposed SCEED method as an easy-to-use package to help the researchers in designing a single cell experiment (estimate the number of cells required to identify novel cell types) and optimal analysis procedure. The package takes into account all technical and biological parameters that characterize typical single cell RNA-seq data. Using SCEED package, we simulated 27 single cell datasets that account for varying sample sizes, rarity of cell types and fold change of expression of marker genes. Such a simulation is significant. For example, when researchers are planning to analyze cell types similar to beta cells from islets of Langerhans in the pancreas where marker genes such as insulin are expressed in far greater concentrations than rest of the genes. In contrast, they are interested in identifying sub classes of established cell types where marker genes are expressed in moderately higher concentrations than the rest of the genes. Using SCEED package, researchers can generate simulated datasets that bear statistical properties similar to that of the expected data and test various single cell algorithms. Our package not only suggests the best method among the tested algorithms but also suggest the number of cells required to achieve the required results. As single cell transcriptome analysis field is rapidly growing field, SCEED package facilitates easily adding more single cell algorithms for testing.

In our study, we have compared the performance of three popularly used single cell algorithms. Though our simulations are limited, our study clearly shows that even popularly used algorithms do not perform best over ranges of cell population rarity and fold change in expression of marker genes. Based on these results, we demonstrated that SCEED package fills an important gap in the single cell analysis field. However, we need to conduct extensive study to identify optimal analysis procedures for a variety of experimental settings and statistical properties of data. Such a study needs to account not only for the 3 parameters we tested up on, it needs to account for the variation in the other statistical parameters (can be selected in SCEED package) and addressing the experimental designs of scRNA-seq experiments.