Introduction

Recent advances in single-cell RNA-Seq (scRNASeq) techniques have provided transcriptomes of the large numbers of individual cells (single-cell gene expression data)1,2,3,4,5,6,7,8,9. In particular, analyzing the diversity and evolution of single cancer cells can enable the advances in early cancer diagnosis, and ultimately choosing the best strategy for cancer treatment10,11,12. Furthermore, one important analysis on scRNASeq is the identification of cell types that can be achieved by performing an unsupervised clustering method on transcriptome data13,14,15,16,17,18,19.

Clustering algorithms such as k-means and density-based spatial clustering of applications with noise (DBSCAN)20 can identify groups of cells given the single-cell gene expression data. However, clusters obtained by these algorithms might not be robust. Such algorithms require non-intuitive parameters13. For instance, given the number of clusters, k-means iteratively assigns data points (cells) to the nearest centroids (cluster center), and recomputes the centroids based on the predefined number of clusters. This algorithm starts with the randomly chosen centroids. Thus, the result of the algorithm depends on the number of clusters (in DBSCAN, the maximum distance between the two data points in the same neighborhood should be determined) and the number of runs.

Another challenge comes from the high dimensionality of data, known as “curse of dimensionality”. Identifying the accurate clusters of data points based on the measured distances between the pairs of data points may fail since those data points become more similar when they are represented in a higher dimensional space13,21. One approach to deal with the curse of high dimensionality is projecting data into a lower dimensional space, known as dimensionally reduction. In this approach, the data is represented in a lower dimensional space while the characteristic(s) (e.g similarities between the data points) of the original data is preserved. Several methods have used different techniques based on this concept (e.g. principal component analysis) to determine the cell types22,23,24,25,26. Another approach to deal with this challenge is feature selection, i.e. eliminating some of the features (genes) that are not informative27. In the following, we provide a brief overview of the related methods that identify the cell types based on the combination of approaches described above.

Methods SC328 and Seurat25 use a combination of feature selection, dimensionality reduction, and clustering algorithms to identify the cell types. Authors of SC3 use a consensus clustering framework that combines clustering solutions obtained by the spectral transformations and k-means clustering based on the complete-linkage hierarchical clustering. They first apply a gene filtering approach on the single-cell gene expression data to remove rare and ubiquitous genes/transcripts. Next, they compute the distance matrices (distance between the cells) using the Euclidean, Pearson, and Spearman metrics. They transform the distance matrices using either principal component analysis (PCA)29, or by computing the eigenvectors of the associated graph Laplacian. Next, they perform a k-means clustering on the first d eigenvectors of the transformed distance matrices. Using the different k-means clustering results, they construct a consensus matrix that represents how often each pair of cells is clustered together. This consensus matrix is used as an input to a hierarchical clustering using a complete linkage and agglomeration strategies30. The clusters are inferred at the k-th level of hierarchy, where k is computed based on the Random Matrix Theory31,32. The accuracy of SC3 is sensitive to the number of eigenvectors (d), chosen for the spectral transformation. The authors report that SC3 performs well when d is between 4% and 7% of the number of cells. The main advantage of SC3 is its high accuracy in identification of cell types. However, it is not scalable33.

Seurat25 is a graph-based clustering method that projects the single cell expression data into the two-dimensional space using the t-distributed stochastic neighbor embedding (t-SNE) technique34. Then, it performs the DBSCAN method20 on the dimensionality-reduced single cell data. Seurat may fail to find the cell types in small datasets (low cell numbers)28. It is reported that this may be due to possible difficulties in estimating the densities when the number of data points is low.

RaceID35 determines the cell types by performing a k-means clustering algorithm. In this method, the gap statistics is used to choose the number of clusters. RaceID does not perform well when the data does not contain rare cell populations but it appears to be the preferred methods when the aim is identification of rare types13,33,36,37.

SNN-Cliq17 uses the shared nearest neighbor (SNN) concept, which considers the effect of the surrounding neighbor data points, to handle the high-dimensional data. The authors of SNN-Cliq compute the similarity between the pairs of data points (the similarity matrix) based on the Euclidean distance, referred as the primary similarity measure. Using the similarity matrix, they list the k-nearest neighbors (KNN) to each data point. They propose a secondary similarity measure that computes the similarity between two data points based on their shared neighborhoods. Consequently, an SNN graph is constructed based on the connectivity between the data points. Then, a graph-based clustering method is applied on the SNN graph in which nodes and weighted edges represent the data points and similarities between the data points, respectively. The main disadvantage of the graph-based methods such as SNN-Cliq is that scRNASeq data is not inherently graph-structured13. Therefore, the accuracy of these methods depends on the graph representation of scRNASeq data.

SINCERA38 performs a hierarchical clustering on the similarity matrix that is computed using the centered Pearson’s correlation. The average linkage approach is used as the default choice for the linkage. Consensus clustering39,40, tight clustering41 and ward linkage42 are provided as alternative clustering approaches. Users can choose a distance threshold or the number of clusters during the visual inspection when the hierarchical clustering is used for the cell cluster identification. SINCERA tends to identify many clusters which likely represent the same cell type13.

One way to identify robust clusters of cells is to resample the cells/genes and compare the original clusters with the ones that are obtained by resampling43. In the current paper, in order to explore the strength of a pattern (cluster of cells) in the data, we analyze the sensitivity of that pattern against small changes in the data. The data is resampled by replacing a certain number of data points with the noise points from a noise distribution. Our hypothesis is that if there is a strong pattern in data, it will remain despite small perturbations44. Here, we develop a stable subtyping (clustering) method that employs the t-distributed stochastic neighbor embedding (t-SNE)34 and k-means clustering to identify the cell types. We add noise and apply a bootstrap method45,46 to identify the stable clusters of cells. We use the Adjusted Rand Index (ARI)47, adjusted mutual information (AMI)48,49, and V-measure50 to evaluate the performance of the clustering result for datasets in which the true cell types are known. We compare the results of our method with five other methods: RaceID35, SNN-Cliq17, SINCERA38, SEURAT25, and SC328 using 8 real datasets with known cell types and 5 simulated datasets. The results of the different methods show that the proposed method performs better than the five methods across different datasets.

Materials and methods

The goal of the proposed method is to identify the cell types present in a mixture of single cells. The input of the method is the single cell gene expression matrix (Mgene×cell) in which rows represent the genes and columns represent the cells. In the following we provide more detail about the input data and different steps of the proposed framework. The overall approach is shown in Fig. 1.

Figure 1
figure 1

The overall workflow of the proposed method. Given the single cell gene expression matrix, module (A) eliminates the genes that are not expressed in any cell. Using the resulting matrix, module (B) computes the Euclidean distance between the cells. The output of this module is a distance matrix in which the rows and columns are the cells (Dcell×cell). Module (C) reduces the dimensionality of the distance matrix using the t-distributed stochastic neighbor embedding (t-SNE) technique. In this module, an average silhouette method is employed to choose the optimal number of clusters k. Finally in module (D), the lower-dimension distance matrix and the optimal number of clusters k obtained from module (C) are used as the input data to identify the most stable clustering of cells. Figure 2 shows the details of module D.

Data source

The eight publicly available scRNA-seq datasets as well as the five simulation datasets we used in our analysis are included in the Supplementary Materials. Among the eight real datasets, all but three (Klein51, Patel52, Treutlein53) are considered as’gold standard’ since the labels of the cells are known in a definitive way. Patel52 and Treutlein53 are referred as'silver standard’ by Kiselev et al.28 since their cell labels are determined based on the computational methods and the authors’ knowledge of the underlying biology.

We obtained the processed data from Hemberg lab's website (https://hemberg-lab.github.io/scRNA.seq.datasets). Hemberg et al.54 use the SingleCellExperiment Bioconductor S4 class55 to store the data, and the scater package56 for the quality control and plotting purposes. The normalized data is deposited as a SingleCellExperiment object (.RData file) and the cell type information is accessed in the cell_type1 column of the “colData” slot of this object. The gene expression values of the cells are organized as a matrix in which rows are cells and columns are the genes. In our analysis, genes (features) that are not expressed in any cells are removed. We did not filter any cell in this analysis.

Gene filtering

As shown in Fig. 1A, we remove the genes/transcripts that are not expressed in any cell (expression value is zero in all cells). Such genes cannot provide useful information that can differentiate between cell types57. The result of performing the filtering method on the single cell gene expression matrix (Mgene×cell) is used as the input to the second module of the proposed framework.

Measuring the dissimilarity between the cells

The distance between the cells is calculated using the Euclidean metric (Fig. 1B). The output of this step is the distance (dissimilarity) matrix Dcell×cell. We reduce the dimension of D by performing the t-distributed stochastic neighbor embedding (t-SNE)34,58, the nonlinear dimensionality reduction/visualization technique (Fig. 1C). We will refer to the output as Dcell×l, where 2 ≤ l ≤ cell. In this study, the number of dimensions is 2.

Clustering

Identification of the optimal number of clusters

This section describes the third module of the proposed method (Fig. 1C). In this analysis, the t-SNE is repeatedly (n = 50) applied on the distance matrix Dcell×cell to obtain the dimensionality-reduced distance matrix Dcell×l. Each time, the optimal number of clusters is calculated based on the average silhouette method using the dimensionality reduced distance matrix D′. In order to find the optimal number of clusters k, the k-means clustering is applied on the D′ matrix using a range value (default = 2:20), and the k that maximizes the average silhouette measure is selected. Finally, the average of the selected numbers k across different repeats (n = 50) (rounded to the nearest integer) is considered as the final optimal number of clusters.

The silhouette evaluates the quality of that clustering based on how well its data points are clustered. A silhouette measure is assigned to each data point representing how close a data point is to its own cluster in comparison to other clusters. For each data point i, this measure is calculated as follows:

$${\rm{s}}({\rm{i}})=\frac{b(i)-a(i)}{max\{a(i),b(i)\}}$$

where a(i) is the average distance between the data point i and all other data points within the same cluster. b(i) is the smallest average distance of i to all points in any other cluster of which i is not a member. s(i) takes values from −1 to 1, where a high positive score shows that the given data point is well clustered (close to other points in its own cluster and far from points in the other clusters). Conversely, a high negative score shows that data point is poorly clustered.

k-means clustering based on the resampling method

This section describes the detail of the last module of the proposed method. As shown in Fig. 2, using the dimensionality reduced distance matrix D′ and the chosen number of clusters k from the previous step, we identify the most stable clustering by generating different clustering solutions (clusteringi (i ∈ [1..n])) and measure the stability of each clustering solution based on a resampling method. The stability measure assigned to each particular clustering (denoted as clusteringi) represents how often the k clusters belonging to that clustering are preserved when the input data (D′) is resampled several times. The resampled datasets are generated from D′ by randomly replacing 5% of data points (cells) with noise. These noisy datasets are then used as the input to k-means algorithm. Hence, several clusterings (clusteringi,j, j ∈ [1..m]) are generated from the resampled data (resampled versions of clusteringi).

Figure 2
figure 2

Identifying the most stable clustering. In this analysis, given the lower-dimension distance matrix Dcell×l and the optimal number of clusters k, we calculate n different clusterings (clustering1, ..., clusteringn) using the k-means clustering algorithm. Then, the stability of each clustering is assessed based on a resampling approach (grey box). A stability score is assigned to each clustering based on how often its clusters are recovered when the input data is perturbed (resampled). A clustering with the maximum stability score is selected as the final solution.

In order to assess the stability of each cluster c in the clusteringi (original clustering), the cluster c is compared to all the clusters in the clustering that is obtained from the resample data (clusteringi,j) based on the Jaccard distance. The Jaccard coefficient59, a similarity measure between sets, is used to compute the similarity between two clusters as follows:

$${\rm{J}}({\rm{A}},{\rm{B}})=\frac{|A\cap B|}{|A\cup B|},\,A,B\subseteq X$$

where the term A and B are two clusters, consisting of some data points in X = {x1, …, xN}.

If the Jaccard similarity between the cluster c (from the original clustering clusteringi) and the most similar cluster in the resampled clustering is equal or greater than 0.75, that cluster is considered stable (preserved). Thus, the stability of each cluster in clusteringi is calculated as the percentage of the times that cluster is preserved (Jaccard coefficient ≥ 0.75) across the m different resamplings.

We then average the stability measures of the k clusters belonging to clusteringi, and consider it as the overall stability measure of clusteringi. Among n different clustering solutions (clusteringi (i ∈ [1..n])), we select the clustering solution with the maximum stability measure as the final clustering solution.

Figure 3 shows the detail of the resampling method we performed to compute the stability measure for each clustering. The clusters that are obtained by applying k-mean on the resampled dataset are compared with the clusters from the original input data only based on the non-noise points (the noise data points are excluded when two clusters are compared based on the Jaccard similarity metric.

Figure 3
figure 3

The resampling framework to compute the stability measure for each clustering. The input includes N data points X = {x1, ..., xN}, the number of clusters k, the number of resamplings m, and the clustering C that is obtained by applying k-means on X. This analysis generates m resampling data by randomly replacing 5% of data points with the noise, and computes m resampled clusterings based on k-means clustering. Each cluster c in C is compared with the most similar cluster in the resampling clustering, and the Jaccard coefficient between the two clusters is computed, while the noise points are excluded. The percentage of the times that Jaccard coefficients are larger than 0.75 is considered the stability measure for cluster c. The average of stability measures for all clusters belonging to clustering C is calculated and considered as the overall stability measure for clustering C.

Validation methods

We use 13 different datasets in which the cell types (labels) are known. To measure the level of similarity between the reference labels and the inferred labels that are obtained by each clustering method, we use three different metrics: adjusted rand index (ARI), adjusted mutual information (AMI), and V-measure as explained in the following.

Adjusted rand index

Given the cell labels, the Adjusted Rand Index (ARI)47 is used to assess the similarity between the inferred clustering and the true clustering. ARI ranges from 0, for poor matching (a random clustering), to 1 for a perfect agreement with the true clustering. For a set of n data points, the contingency table is constructed based on the shared number of data points between two clusters. Suppose X = {X1, X2, ..., XR} and Y = {Y1, Y2, ..., YC} represent two different clusterings with R and C clusters, respectively. The overlap between X and Y can be summarized as a contingency table MR×C = [nij], where i = 1...R, j = 1...C. Xi and Yj denote a cluster in clusterings X and Y, and i and j refer to the row number and the column number of the contingency table, respectively. The ARI is defined as follow:

$$ARI=\frac{\mathop{\sum }\limits_{ij}^{}(\genfrac{}{}{0ex}{}{{n}_{ij}}{2})-[\mathop{\sum }\limits_{i}^{}(\genfrac{}{}{0ex}{}{{a}_{i}}{2})\mathop{\sum }\limits_{j}^{}(\genfrac{}{}{0ex}{}{{b}_{i}}{2})]/(\genfrac{}{}{0ex}{}{n}{2})}{\frac{1}{2}[\mathop{\sum }\limits_{i}^{}(\genfrac{}{}{0ex}{}{{a}_{i}}{2})+\mathop{\sum }\limits_{j}^{}(\genfrac{}{}{0ex}{}{{b}_{i}}{2})]-[\mathop{\sum }\limits_{i}^{}(\genfrac{}{}{0ex}{}{{a}_{i}}{2})\mathop{\sum }\limits_{j}^{}(\genfrac{}{}{0ex}{}{{b}_{i}}{2})]/(\genfrac{}{}{0ex}{}{n}{2})}$$
(1)

where nij denotes the number of shared data points between clusters Xi and Yj (nij = |XiYj|), and \({a}_{i}={\sum }_{k}{n}_{ik}\) (the sum of the ith row of the contingency table), and \({b}_{j}={\sum }_{k}{n}_{kj}\) (the sum of the jth column of the contingency table).

Adjusted mutual information

The adjusted mutual information (AMI)48,49 is a variation of mutual information that corrects for random partitioning, similar to the way the ARI corrects the rand index. As explained in the previous section, given two different clusterings X = {X1, X2, ..., XR} and Y = {Y1, Y2, ..., YC} of n data points with R and C clusters, respectively, the mutual information of cluster overlap between X and Y can be summarized as a contingency table MR×C = [nij], where i = 1...R, j = 1...C, and nij represents the number of common data points between clusters Xi and Yj. Suppose a data point is picked at random from X, the probability that the data point falls into cluster Xi is \(p(i)=\frac{|{X}_{i}|}{n}\). The entropy60 associated with the clustering X is calculated as follows:

$$H(X)=\mathop{\sum }\limits_{i\mathrm{=1}}^{R}P(i)\,logP(i)$$
(2)

H(X) is non-negative and takes the value 0 only when there is no uncertainty determining a data point's cluster membership (there is only one cluster). The mutual information (MI) between two clusterings X and Y is calculated as follows:

$$MI(X,Y)=\mathop{\sum }\limits_{i\mathrm{=1}}^{R}\mathop{\sum }\limits_{j\mathrm{=1}}^{C}P(i,j)\,log\frac{P(i,j)}{P(i)P(j)}$$
(3)

where P(i, j) denotes the probability that a data point belongs to both the cluster Xi in X and the cluster Yj in Y:

$$P(i,j)=\frac{|{X}_{i}\cap {Y}_{j}|}{n}$$
(4)

MI is a non-negative quantity upper bounded by the entropies H(X) and H(Y). It quantifies the information shared by the two clusterings and therefore can be considered as a clustering similarity measure. The adjusted measure for the mutual information is defined as follows:

$$AMI(X,Y)=\frac{MI(X,Y)-E\{MI(X,Y)\}}{max\{H(X),H(Y)\}-E\{MI(X,Y)\}}$$
(5)

where the expected mutual information between two random clusterings is:

$$E\{MI(X,Y)\}=\mathop{\sum }\limits_{i\mathrm{=1}}^{R}\mathop{\sum }\limits_{j\mathrm{=1}}^{C}\,\mathop{\sum }\limits_{{n}_{ij}=max\mathrm{(1,}{a}_{i}+{b}_{j}-n)}^{min({a}_{i},{b}_{j})}\frac{{n}_{ij}}{n}log\left(\frac{n\mathrm{}.{n}_{ij}}{{a}_{i}{b}_{j}}\right)\frac{{a}_{i}!{b}_{j}!(n-{a}_{i})!(n-{b}_{j})!}{n!{n}_{ij}!({a}_{i}-{n}_{ij})!({b}_{j}-{n}_{ij})!(n-{a}_{i}-{b}_{j}+{n}_{ij})!}$$
(6)

where the ai and bj are the partial sums of the contingency table: \({a}_{i}={\sum }_{j\mathrm{=1}}^{C}{n}_{ij}\) and \({b}_{j}={\sum }_{i\mathrm{=1}}^{R}{n}_{ij}\).

The adjusted mutual information (AMI) takes a value of 1 when the two clusterings are identical and 0 when the MI between two partitions equals the value expected due to chance alone.

V-measure

The V-measure50 is the harmonic mean between two measures: homogeneity and completeness. The homogeneity criteria is satisfied if a clustering assigns only those data points that are members of a single class (true cluster) to a single cluster. Thus, the class distribution within each cluster should be skewed to a single class (zero entropy). To determine how close a given clustering is to this ideal, the conditional entropy of the class distribution given the identified clustering is computed as H(C|K), where C = {C1, C2, ..., Cl} is a set of classes and K is a clustering K = {K1, K2, ..., Km}. In the perfectly homogeneous case, this value is 0. However, this value is dependent on the size of the dataset and the distribution of class sizes. Thus, this conditional entropy is normalized by the maximum reduction in entropy the clustering information could provide, H(C). Therefore, the homogeneity is defined as follows:

$$h=\{\begin{array}{cc}1 & \text{if}\,H(C,K)=0\\ 1-\frac{H(C| K)}{H(C)} & \text{otherwise}\end{array}$$
(7)

The completeness is symmetrical to homogeneity50. In order to satisfy the completeness criteria, a clustering must assign all of those data points that are members of a single class to a single cluster. To measure the completeness, the distribution of cluster assignments within each class is assessed. In a perfectly complete clustering solution, each of these distributions will be completely skewed to a single cluster.

Given the homogeneity h and completeness c, the V-measure is computed as the weighted harmonic mean of homogeneity and completeness:

$${\rm{V}} \mbox{-} {\rm{m}}{\rm{e}}{\rm{a}}{\rm{s}}{\rm{u}}{\rm{r}}{\rm{e}}=\frac{(1+\beta )\ast h\ast c}{(\beta \ast h)+c}$$
(8)

if β is greater than 1, completeness is weighted more strongly in the calculation. If β is less than 1, homogeneity is weighted more strongly. Since the computations of homogeneity, completeness and V-measure are completely independent of the number of classes, the number of clusters, the size of the dataset and the clustering algorithm, these measures can be employed for evaluating any clustering solution.

Results

Tables 13 shows the comparison between the proposed method and five other methods: RaceID35, SC328, SEURAT25, SINCERA38, and SNN-Cliq17 using the three metrics: ARI, AMI, and V-measures, respectively.

Table 1 A comparison between the results of six methods: proposed, RaceID, SC3, Seurat, SINCERA, and SNN-Cliq.
Table 2 A comparison between the results of six methods: proposed, RaceID, SC3, Seurat, SINCERA, and SNN-Cliq.
Table 3 A comparison between the results of six methods: proposed, RaceID, SC3, Seurat, SINCERA, and SNN-Cliq.

We used the R package fpc61 to compute the k-means clustering based on the resampling method. We generated 20 different clusterings, and for each clustering we computed 1,000 clusterings based on the resampled datasets to find the most meaningful clustering. We used the log-transformation (M′ = log2(M + 1)) for all methods except SINCERA. For SINCERA we followed the authors instructions38 and used the original z-score normalization instead of the log-transformation. In order to generate SC3 results, we used the R package SC3 (http://bioconductor.org/packages/SC3, v.1.8.0). We applied the same gene filtering approach that authors proposed in their study (parameter gene_filter=TRUE).

For SEURAT we used the Seurat R package (v.2.3.4)62. We performed the t-SNE using the Rtsne R package with the default parameters, and we used DBSCAN algorithm for clustering. We ran SNN-cliq with the default parameters that are provided by the authors17. For RaceID, we used the R code provided by the authors35 (https://github.com/dgrun/RaceID).

As shown in Fig. 4, the proposed method performs better than the five methods across 13 different datasets. In this figure, the three boxplots shows the the performance of each method on these 13 datasets based on the adjusted rand index (ARI), adjusted mutual information (AMI), and V-measure. We performed the proposed method, SC3 and RaceID on each dataset for 50, 5, and 50 times, respectively. In these three methods, we calculated the average of ARIs, AMIs, and V-measures over different runs. Since SC3 is reported as a stable method by the authors28, we run it only 5 times. Indeed, we have observed the results with a very small standard deviation in all 5 runs for all 13 datasets confirming the claims of the authors. The other clustering methods SEURAT, SINCERA, and SNN-Cliq were run only once since they are deterministic.

Figure 4
figure 4

The performance comparison using 13 single cell datasets based on three metrics: the adjusted rand index (ARI), adjusted mutual information (AMI), and V-measure.The proposed method and RaceID were applied 50 times on each dataset. SC3 was used only 5 times on each dataset because it is very stable. The average ARIs, AMIs, and V-measures across different runs are computed for the proposed method, RaceID, and SC3. Since SNN-Cliq, SINCERA, and SEURAT are deterministic, they are run only once for each dataset.

Discussion

The results shown in Tables 13 merit some discussion. The Goolam dataset, for instance, includes 5 true cell types. On this dataset, the proposed algorithm identifies 3 clusters, while SC3 identifies 6, RaceID 1, Seurat 2, SINCERA 13 and SNN-Cliq 17 types. Even though the number of clusters closest to the number of true types is 6, as yielded by SC3, the membership of various cells in these clusters is not correct since the ARI index associated to these 6 clusters is only 0.59 compared to the ARI index of 0.8 associated to the 3 clusters constructed by the proposed method.

Conversely, for the Patel dataset that includes 5 cell types, the proposed method was able to correctly estimate the number of clusters (k = 5). However, the distribution of the individual cells across these five clusters is not perfect, as illustrated by the lower ARI value of 0.66, compared to the 0.78 ARI associated with the SINCERA results.

As another observation, the Pollen dataset includes 11 cell types. Using this dataset, the number of clusters (k = 10) determined by SINCERA is close to the correct number of cell types. However, SC3 achieved better clustering (ARI = 0.93) in contrast to the five other methods. SC3 identified 17 different clusters using this dataset.

Two conclusions may be drawn from these observations. First, results should not be assessed based on the agreement between the number of clusters found and the number of known cell types – the assignment of each cell to a given type is more important. Second, larger number of clusters reported will be associated with larger values of ARI. Therefore, results that include very large number of clusters should be regarded with caution.

RaceID and Seurat both were not able to find a meaningful clustering for the Treutlein dataset. The identified number of clusters by both RaceID and Seurat is 1 (k = 1), while this dataset includes 5 different cell types. As a result, the clusterings obtained by these two methods are poorly matched to the reference clustering. In Deng dataset, the best ARI of 0.65 is obtained by SC3 but this value is not very high. The poor results obtained by all 6 methods using this dataset might be due to noisy data.

We also assessed the reproducibility/stability of the stochastic methods: proposed, RaceID, and SC3 by running each method several times. Although SC3’s consensus pipeline provides a very stable solution (very low standard deviation for the three metrics and k across all datasets), it is computationally more costly than other methods. In summary, one key advantage of our proposed method is that we produce consistent clustering across different datasets.

The run time for each method using 13 different datasets is shown in Fig. 5. It is notable that RaceID, the proposed method, and SC3 have a non-linear increase in run time. At this time, it appears that it is unfeasible to perform this method on large datasets consisting of thousands of cells. The fastest method among all the methods is Seurat, which is a graph-based method. The graph-based methods often return only a single clustering solution with a faster run time and they do not require the user to provide the number of clusters33. Seurat is a popular choice for the large data sets based on the its optimal speed and scalability. However, it has been shown that Seurat does not provide an accurate solution for smaller datasets33. The details of the run times are included in Supplementary Materials.

Figure 5
figure 5

The run time of the different methods using 13 single cell datasets.

More generally, finding an optimal clustering method that provides stable solutions for all situations may not be possible. In fact, because no method can perform well for all situations, a comparative analysis of methods based on a set of criteria should be employed33.

Conclusion

Recent advances in single-cell RNA-Seq (scRNASeq) provide the opportunity to perform single-cell transcriptome analysis. In this paper, we develop a pipeline to cluster the individual cells based on their gene expression values such that each cluster consisting of cells with specific functions or distinct developmental stages. We first filter genes that are not expressed in any cell. Then, we compute the distance between the cells using the Euclidean distance. We reduce the dimensions of the distance matrix data using the t-distributed stochastic neighbor embedding (t-SNE) technique. Based on the dimensionality reduced distance matrix, we explore strong patterns (clusters) of cells by randomly drawing a percentage of the data points without replacement, and replacing them with points from a noise distribution. We apply the proposed method on 13 different single cell datasets, and we compare it with five related methods: RaceID, SC3, Seurat, SINCERA, and SNN-Cliq. The results of the evaluation on datasets demonstrate that the proposed method yields better clustering results in comparison to the existing methods.