1 Introduction

Two-dimensional embeddings and their visualizations may assist in the analysis and interpretation of high-dimensional data. Intuitively, two data instances should be co-located in the resulting visualization if their multi-dimensional profiles are similar. For this task, non-linear embedding techniques such as t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten & Hinton, 2008) or uniform manifold approximation and projection (McInnes & Healy, 2018) have recently complemented traditional data transformation and embedding approaches such as principal component analysis (PCA) (Wold et al., 1987) and multi-dimensional scaling (Cox & Cox, 2008). While useful for visualizing data from a single coherent source, these methods may encounter problems with multiple data sources. Here, when performing dimensionality reduction on a merged data set, the resulting visualizations would typically reveal source-specific clusters instead of grouping data instances of the same class, regardless of data sources. This source-specific confounding is often referred to as domain shift (Gopalan et al., 2011), covariate shift (Bickel et al., 2009) or data set shift (Quionero-Candela et al., 2009). In bioinformatics, the domain-specific differences are more commonly referred to as batch effects (Butler et al., 2018; Haghverdi et al., 2018; Stuart et al., 2019).

Massive, multi-variate biological data sets often suffer from these source-specific biases. The focus of this work is single-cell genomics, a domain that was selected due to high biomedical relevance and abundance of recently published data. Single-cell RNA sequencing (scRNA-seq) data sets are the result of isolating RNA molecules from individual cells, which serve as an estimate of the expression of cell’s genes. The studies can exceed thousands of cells and tens of thousands of genes, and typically start with cell type analysis. Here, it is expected that cells of the same type would cluster together in two-dimensional data visualization (Wolf et al., 2018). For instance, Fig. 1a shows t-SNE embedded data from mouse brain cells originating from the visual cortex (Hrvatin et al., 2018) and the hypothalamus (Chen et al., 2017). The figure reveals distinct clusters but also separates the data from the two brain regions. These two regions share the same cell types and—contrary to the depiction in Fig. 1a—we would expect the data points from the two studies to overlap. Batch effects similarly prohibit the utility of t-SNE in the exploration of pancreatic cells in Fig. 1b, which renders the data from a pancreatic cell atlas (Baron et al., 2016) and similarly-typed cells from diabetic patients (Xin et al., 2016). Just like with data from brain cells, pancreatic cells cluster primarily by data source, again resulting in a visualization driven by batch effects.

Fig. 1
figure 1

Batch effects are a driving factor of variation between the data sets. We depict a t-SNE visualization of two pairs of data sets. In each pair, the data sets share cell types, so we would expect cells from the reference data (blue) to mix with the cells in a secondary data sets (orange). Instead, t-SNE clusters data according to the data source

Current solutions to embedding the data from various data sources address the batch effect problems up-front. The data is typically preprocessed and transformed such that the batch effects are explicitly removed. Recently proposed procedures for batch effect removal include canonical correlation analysis (Butler et al., 2018) and mutual nearest-neighbors (Haghverdi et al., 2018; Stuart et al., 2019). In these works, batch effects are deemed removed when cells from different sources exhibit good mixing in a t-SNE visualization. The elimination of batch effects may require aggressive data preprocessing which may blur the boundaries between cell types. Another problem is also the inclusion of any new data, for which the entire analysis pipeline must be rerun, usually resulting in a different embedding layout and clusters that have little resemblance to original visualization and thus require reinterpretation.

We propose a direct solution of rendering t-SNE visualizations to address batch effects. Our approach treats one of the data sets as a reference and embeds the cells from another, secondary data set to a reference-defined low-dimensional space. We construct a t-SNE embedding using the reference data set, which is then used as a scaffold to embed the secondary data. The key idea underpinning our approach is that secondary data points are embedded independently of one another.

Independent embedding of each secondary datum causes the clustering landscape to depend only on the reference scaffold, thus removing data source-driven variation. In other words, when including new data, the scaffold inferred from the reference data set is kept unchanged and defines a “gravitational field”, independently driving the embedding of each new instance. For example, in Fig. 2, the cells from the visual cortex define the scaffold (Fig. 2a) into which we embed the cells from the hypothalamus (Fig. 2b). Unlike in their joint t-SNE visualization (Fig. 1a), the hypothalamic cells are dispersed across the entire embedding space and their cell type correctly matches the prevailing type in reference clusters.

Fig. 2
figure 2

A two-dimensional embedding of a reference containing brain cells (a) and the corresponding mapping of secondary data containing hypothalamic cells (b). The majority of hypothalamic cells were mapped to their corresponding reference cluster. For instance, astrocyte cells marked with red on the right were mapped to an oval cluster of same-typed cells denoted with the same color in the visualization on the left

The proposed solution implements a mapping of new data into an existing t-SNE visualization. While the utility of such an algorithm was already hinted at in recent publication (Kobak & Berens, 2019), we here provide its practical and theoretically-grounded implementation. Considering the abundance of recent publications on batch effect removal, we present surprising evidence that a computationally more direct and principled embedding procedure solves the batch effects problem when constructing interpretable visualizations from different data sources.

Our contributions are twofold:

  1. 1.

    We introduce a theoretically-grounded extension of the t-SNE visualization algorithm that supports embedding new data points into existing reference visualizations. Our extension is readily incorporated into existing approximation schemes, enabling its applications to large data sets. We show that optimization using the default t-SNE parameters is highly unstable and proposes parameter values leading to stable convergence.

  2. 2.

    We show that the proposed t-SNE extensions can mitigate batch effects in the data sets and demonstrate this feature in treating single-cell gene expression data.

2 Related work

Batch effects are systematic biases between biological data sets caused by technical factors in the data collection and preparation process. It has been well documented that even small differences in the experimental setup of cell-dissociation, handling protocols, library-preparation technologies, or sequencing platforms can significantly affect the resulting gene-expression measurements (Tung et al., 2017; Hicks et al., 2018). When performing downstream comparative analyses, batch effects may confound real biological variability and introduce spurious correlations, leading to misleading conclusions.

Due to their severity, numerous computational approaches have been proposed to directly remove batch effects when performing joint analysis on two or more data sets. Batch effect removal is typically performed as a preprocessing step. Existing approaches involve either modifying the original data matrix or finding a joint lower-dimensional space, where batch effects are removed. Current methods broadly fall into two categories:

  1. 1.

    Mutual nearest neighbor-based approaches aim to identify matching populations of cells across the data sets, using them to either find and correct the data sets (Haghverdi et al., 2018) or directly construct a batch-corrected k-nearest neighbor graph used in downstream analyses (Park et al., 2018).

  2. 2.

    Embedding multiple data sets into a joint lower-dimensional space, where batch effects are removed. Some of these approaches opt for linear dimensionality-reduction methods such as PCA (Korsunsky et al., 2019) or MultiCCA (Butler et al., 2018), while others employ non-linear techniques from deep learning (Li et al., 2020; Lopez et al., 2018). Still, other approaches use a combination of the two (Stuart et al., 2019; Hie et al., 2019). Note that these approaches bear similarity with transfer learning (Weiss et al., 2016), which has also been used in domain adaptation (Liu et al., 2019).

Besides computational techniques, approaches for the removal of batch effects can also use domain knowledge. For example, in the analysis of single-cell gene expression data, these approaches act on a subset of representative marker genes for a specific cell type. Instead of considering the entire gene-expression profile, which may be noisy and affected by batch effects, the idea is to profile the cells with a handful of genes that can collectively determine the cell type. One such procedure is scMap-Cluster, a consensus-based k-nearest neighbor method tailored explicitly to scRNA-seq gene-expression data (Kiselev et al., 2018). scMap-Cluster uses three correlation-based distance measures and uses a voting scheme to perform classification. To identify novel cell types, scMap-Cluster heuristically determines a distance threshold.

Our approach to batch effect removal falls into the second category, as we lose the batch effects through dimensionality reduction. Alongside scMap-Cluster, we also benefit from a standard single-cell data preprocessing pipeline that profiles the cells with representative genes. Unlike other batch effect removal procedures, the primary purpose of our approach is not classification but the visualization of the various cell-types. If required, we can apply a k-nearest neighbor classifier to the resulting visualizations to obtain accuracy estimates and compare our approach to other classification methods. However, the classification aspect of our approach is secondary: the primary purpose of t-SNE is to aid in scientists in exploratory data analysis and help them better understand the underlying data landscape.

3 Methods

We describe an end-to-end pipeline that uses fixed t-SNE coordinates as a scaffold for embedding new (secondary) data, enabling joint visualization of multiple data sources while mitigating batch effects. Our proposed approach starts by using t-SNE to embed a reference data set, with the aim of constructing a two-dimensional visualization to facilitate interpretation and cluster classification. Then, the placement of each new sample is optimized independently via the t-SNE loss function. Independent treatment of each data instance from a secondary data set disregards any interactions present in that data set, and prevents the formation of clusters that would be specific to the secondary data. Below, we start with a summary of t-SNE and its extensions (Sect. 3.1), introducing the relevant notation, upon which we base our secondary data embedding approach (Sect. 3.2).

3.1 Data embedding by t-SNE and its extensions

Local, non-linear dimensionality reduction by t-SNE is performed as follows. Given a multi-dimensional data set \({\mathbf {X}} = \left\{ {\mathbf {x}}_1, {\mathbf {x}}_2, \dots , {\mathbf {x}}_N \right\} \in {\mathbb {R}}^D\) where N is the number of data points in the reference data set, t-SNE aims to find a low dimensional embedding \({\mathbf {Y}} = \left\{ {\mathbf {y}}_1, {\mathbf {y}}_2, \dots , {\mathbf {y}}_N \right\} \in {\mathbb {R}}^d\) where \(d \ll D\), such that if points \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) are close in the multi-dimensional space, their corresponding embeddings \({\mathbf {y}}_i\) and \({\mathbf {y}}_j\) are also close. Since t-SNE is primarily used as a visualization tool, d is typically set to two. The similarity between two data points in t-SNE is defined as:

$$\begin{aligned} p_{j \mid i} = \frac{\exp \left( -\frac{1}{2} {\mathcal {D}}({\mathbf {x}}_i, {\mathbf {x}}_j ) / \sigma _i^2 \right) }{\sum _{k \ne i } \exp \left( -\frac{1}{2} {\mathcal {D}}({\mathbf {x}}_i, {\mathbf {x}}_k ) / \sigma _i^2 \right) }, \quad p_{i \mid i} = 0 \end{aligned}$$

where \({\mathcal {D}}\) is a distance measure. This is then symmetrized to

$$\begin{aligned} p_{ij} = \frac{p_{j \mid i} + p_{i \mid j}}{2N}. \end{aligned}$$

The bandwidth of each Gaussian kernel \(\sigma _i\) is selected such that the perplexity of the distribution matches a user-specified parameter value

$$\begin{aligned} \text {Perplexity} = 2^{H(P_i)} \end{aligned}$$

where \(H(P_i)\) is the Shannon entropy of \(P_i\),

$$\begin{aligned} H(P_i) = -\sum _i p_{j \mid i} \log _2 (p_{j \mid i}). \end{aligned}$$

Different bandwidths \(\sigma _i\) enable t-SNE to adapt to the varying density of the data in the multi-dimensional space.

The similarity between points \({\mathbf {y}}_i\) and \({\mathbf {y}}_j\) in the embedding space is defined using the t-distribution with one degree of freedom

$$\begin{aligned} q_{ij} = \frac{\left( 1 + || {\mathbf {y}}_i - {\mathbf {y}}_j ||^2 \right) ^{-1}}{\sum _{k \ne l}\left( 1 + || {\mathbf {y}}_k - {\mathbf {y}}_l ||^2 \right) ^{-1}}, \quad q_{ii} = 0. \end{aligned}$$

The t-SNE method finds an embedding \({\mathbf {Y}}\) that minimizes the Kullback-Leibler (KL) divergence between \({\mathbf {P}}\) and \({\mathbf {Q}}\),

$$\begin{aligned} C = \text {KL}({\mathbf {P}} \mid \mid {\mathbf {Q}}) = \sum _{ij} p_{ij} \log \frac{p_{ij}}{q_{ij}}. \end{aligned}$$

The time complexity needed to evaluate the similarities in Eq. 5 is \({\mathcal {O}}(N^2)\), making its application impractical for large data sets. We adopt a recent approach for low-rank approximation of gradients based on polynomial interpolation which reduces its time complexity to \({\mathcal {O}}(N)\). This approximation enables the visualization of massive data sets, possibly containing millions of data points (Linderman et al., 2019).

The resulting embeddings substantially depend on the value of the perplexity parameter. Perplexity can be interpreted as the number of neighbors for which the distances in the embedding space are preserved. Small values of perplexity result in tightly-packed clusters of points and effectively ignore the long-range interactions between clusters. Larger values may result in a more globally consistent visualizations—preserving distances on a large scale and organizing clusters in a more meaningful way—but can lead to merging small clusters and thus obscuring local aspects of the data (Kobak & Berens, 2019).

The trade-off between the local organization and global consistency may be achieved by replacing the Gaussian kernels in Eq. 1 with a mixture of Gaussians of varying bandwidths (Lee et al., 2015). Multi-scale kernels are defined as

$$\begin{aligned} p_{j \mid i} \propto \frac{1}{L} \sum _{l=1}^{L} \exp \left( - \frac{1}{2} {\mathcal {D}}({\mathbf {x}}_i, {\mathbf {x}}_j ) / \sigma _{i,l}^2 \right) , \quad p_{i \mid i} = 0 \end{aligned}$$

where L is the number of mixture components as specified by the user. The bandwidths \(\sigma _{i,l}\) are selected in the same manner as in Eq. 1, but with a different value of perplexity for each l. In our experiments, we used a mixture of two Gaussian kernels with perplexity values of 50 and 500. A similar formulation of multi-scale kernels was proposed in Kobak and Berens (2019), and we found the resulting embeddings are visually very similar to those obtained with the approach described above (not shown for brevity).

When using t-SNE on larger data sets, the standard learning rate \(\eta = 200\) has been shown to lead to slower convergence and requires more iterations to achieve consistent embeddings (Belkina et al., 2019). We follow the recommendation of Belkina et al.  and use a higher learning rate \(\eta = N / 12\) when visualizing larger data sets.

3.2 Adding new data points to reference embedding

Our algorithm, which embeds new data points to a reference embedding, consists of estimating similarities between each new point and the reference data and optimizing the position of each new data point in the embedding space. Unlike parametric models such as principal component analysis or autoencoders, t-SNE does not define an explicit mapping to the embedding space, and embeddings need to be found through loss function optimization.

The position of a new data point in embedding space is initialized to the median reference embedding position of its k nearest neighbors. While we found the algorithm to be robust to choices of k, we use \(k=10\) in our experiments.

We adapt the standard t-SNE formulation from Eqs. 1 and 5 with

$$\begin{aligned} p_{j \mid i}&= \frac{\exp \left( -\frac{1}{2} {\mathcal {D}}({\mathbf {x}}_i, {\mathbf {v}}_j) / \sigma _i^2 \right) }{\sum _{i} \exp \left( -\frac{1}{2} {\mathcal {D}}({\mathbf {x}}_i, {\mathbf {v}}_j) / \sigma _i^2 \right) }, \end{aligned}$$
$$\begin{aligned} q_{j \mid i}&= \frac{\left( 1 + || {\mathbf {y}}_i - {\mathbf {w}}_j ||^2 \right) ^{-1}}{\sum _{i}\left( 1 + || {\mathbf {y}}_i - {\mathbf {w}}_j ||^2 \right) ^{-1}}, \end{aligned}$$

where \({\mathbf {V}} = \left\{ {\mathbf {v}}_1, {\mathbf {v}}_2, \dots , {\mathbf {v}}_M \right\} \in {\mathbb {R}}^D\) where M is the number of samples in the secondary data set and \({\mathbf {W}} = \left\{ {\mathbf {w}}_1, {\mathbf {w}}_2, \dots , {\mathbf {w}}_M \right\} \in {\mathbb {R}}^d\). Additionally, we omit the symmetrization step in Eq. 2. This enables new points to be inserted into the embedding independently of one another. The gradients of \({\mathbf {w}}_j\) with respect to the loss (Eq. 6) are:

$$\begin{aligned} \frac{\partial C}{\partial {\mathbf {w}}_j} = 2 \sum _i \left( p_{j \mid i} - q_{j \mid i} \right) \left( {\mathbf {y}}_i - {\mathbf {w}}_j \right) \left( 1 + || {\mathbf {y}}_i - {\mathbf {w}}_j || ^2 \right) ^{-1} \end{aligned}$$

In the optimization step, we refine point positions using batch gradient descent. We use an adaptive learning rate scheme with momentum to speed up the convergence, as proposed by Jacobs (1988) and van der Maaten (2014). We run gradient descent with momentum \(\alpha\) of 0.8 for 250 iterations, where the optimization converged in all our experiments. The time complexity needed to evaluate the gradients in Eq. 10 is \({\mathcal {O}}(N \cdot M)\), however, by adapting the same polynomial interpolation based approximation, this is reduced to \({\mathcal {O}}(\max \{ N, M \})\). The time complexity can further be reduced to \({\mathcal {O}}(M)\) by exploiting the fact that the reference embedding remains fixed.

Special care must be taken to reduce the learning rate \(\eta\) as the default value in most implementations (\(\eta = 200\)) may cause points to “shoot off” from the reference embedding. This phenomenon is caused due to the embedding to a previously defined t-SNE space, where the distances between data points and corresponding gradients of the optimization function may be quite large. When running standard t-SNE, points are initialized and scaled to have variance 0.0001. The resulting gradients tend to be very small during the initial phase, resulting in stable convergence. When embedding new samples, the span of the embedding is much larger, resulting in substantially larger gradients, and the default learning rate causes points to move very far from the reference embedding. In our experiments, we found that decreasing the learning rate to \(\eta \sim 0.1\) produces stable solutions. Alternatively, we can employ gradient clipping to achieve similar behaviour. This is especially important when using the interpolation-based approximation, which places a grid of interpolation points over the embedding space, where the number of grid points is determined by the span of the embedding. Clearly, if even one point “shoots off” far from the embedding, the number of required grid points may grow dramatically, increasing the runtime substantially. The reduced learning rate suppresses this issue, and does not slow the convergence because of the adaptive learning rate scheme, provided the optimization is run for a sufficient number of steps.

4 Experiments and discussion

We apply the proposed approach to t-SNE visualizations of single-cell data. Data in this realm include a variety of cells from specific tissues and are characterized through gene expression. In our experiments, we considered several recently published data sets where cells were annotated with the cell type. Our aim was to construct t-SNE visualizations where similarly-typed cells would cluster together, despite systematic differences between data sources. To that end, we focus on comparing different ways of using t-SNE rather than differences to embeddings like PCA or MDS, which have been substantially covered before (van der Maaten & Hinton, 2008; Becht et al., 2019). Below, we list the data sets used in our experiments, and display the resulting data visualizations. Due to the unique nature of single-cell data, we apply a specialized single-cell pipeline for all our experiments, as described in Appendix A. Finally, we discuss the success of the proposed approach in alleviating the batch effects.

4.1 Data

We use three pairs of reference and secondary single-cell data sets originating from different organisms and tissues. The data in each pair were chosen so that the majority of cell types from the secondary data set were included in the reference set (Table 1). The cells in the data sets originate from the following three tissues:

Mouse brain.:

The data set from Hrvatin et al. (2018) contains cells from the visual cortex exploring transcriptional changes after exposure to light. This was used as a reference for the data from Chen et al. (2017), containing cells from the mouse hypothalamus and their reaction to food deprivation. From the secondary data, we removed cells with no corresponding types in the reference: tanycytes, ependymal, epithelial, and unlabelled cells.

Human pancreas.:

Baron et al. (2016) created an atlas of pancreatic cell types. We used this set as a reference for data from Xin et al. (2016), who examined transcriptional differences between healthy and type 2 diabetic patients.

Mouse retina.:

Macosko et al. (2015) created an atlas of mouse retinal cell types. We used this as a reference for the data from Shekhar et al. (2016), who built an atlas for retinal bipolar cells.

Table 1 Data sets used in our experiments

4.2 t-SNE transform successfully alleviates batch effects

Figures 2, 3, and 4 show the embeddings of the reference data sets and their corresponding embeddings of the secondary data sets. In all the figures, the cells from the secondary data sets were positioned in the cluster of same-typed reference cells, providing strong evidence of the success of our approach. There are some deviations to these observations; for instance, in Fig. 2 several oligodendrocyte precursor cells (OPCs) were mapped to oligodendrocytes. This may be due to differences in annotation criteria by different authors, or due to inherent similarities of these types of cells. Examples of such erroneous placements can be found in other figures as well, but are uncommon and constitute less then 5% of the cells (less than 5% in brain, 1% in pancreas and 2% in retina secondary data).

Notice that we could simulate the split between reference and secondary data sets using one data set only and perform cross-validation, however this type of experiment would not incorporate batch effects. We want to remind the reader that handling batch effects were central to our endeavor and that the disregard of this effect could lead to overly-optimistic results and data visualizations strikingly different from ours. For example, compare the visualizations from Figs. 1a and 2b, or Figs. 1b and 3b.

Fig. 3
figure 3

Embedding of pancreatic cells from Baron et al. (2016) and cells from the same tissue from Xin et al. (2016). Just like in Fig. 2, the vast majority of the cells from the secondary data set were correctly mapped to the same-typed cluster of reference cells

Fig. 4
figure 4

An embedding of a large reference of retinal cells from Macosko et al. (2015) (a) and mapping of cells from a smaller study that focuses on bipolar cells from Shekhar et al. (2016) (b). We use colors consistent with the study by

4.3 Construction of a reference embedding

We use a number of additional, recently proposed modifications to enhance the t-SNE visualization of the reference data set. Kobak and Linderman have shown that the global consistency of embeddings produced by popular visualization algorithms are largely dependent on their initialization (Kobak & Linderman, 2021). By utilizing PCA-based initialization, t-SNE is able to achieve more meaningful layouts of the resulting clusters (Fig. 5b) as opposed to using randomly initialized embeddings (Fig. 5a). Another important extension is the use of multi-scale similarities, which, in addition to considering short range interactions, also models wider point neighborhoods. Coupled with PCA-based initialization, this produces even more meaningful visualizations where clusters form interpretable structures. For instance, consider Fig. 5c, which reveals two meaningful subgroups of GABAergic neurons, corresponding to their developmental origin, as discussed in Tasic et al. (2018), while this division is less apparent when using PCA-based initialization alone in Fig. 5b.

Fig. 5
figure 5

A comparison of standard and multi-scale t-SNE on data from the mouse neocortex (Tasic et al., 2018). a Standard t-SNE using random initialization places clusters arbitrarily. The resulting clustering structure is not globally consistent, as clusters of the same type of cells are dispersed throughout the landscape. Non-Neuronal clusters, for instance, are mixed with clusters of GABAergic and Glutamaergic neurons. b By utilizing a globally consistent initialization for t-SNE, the clusters are organized in a more meaningful layout, where clusters of cells of the same type appear closer together. c Augmenting t-SNE with multi-scale similarities and using proper initialization provides a more meaningful layout of the clusters. Non-Neuronal and Endothelial cell types are now placed in the same region of the embedding. There are two clear sub-groups of GABAergic neurons corresponding to their developmental origins, which was not as apparent when using clever initialization alone

We also observed the important role of gene selection in crafting the reference embedding spaces. We found that when selecting an insufficient number of genes, the resulting visualizations display overly-fragmented clusters. When the selection is too broad and includes lowly expressed genes, the subclusters tend to overlap. These effects can all be attributed to sparseness of the data sets and may be intrinsic to single-cell data. In our studies, we found that selection of 3000 genes yields most informative visualizations (Fig. 6).

Fig. 6
figure 6

Gene selection plays an important role when constructing the reference embedding. a Using too few genes results in fragmented clusters. b Using an intermediate number of genes reveals clustering mostly consistent with cell annotations. c Including all the genes may lead to under-clustering of the more specialized cell types. In our example, the neuronal subclusters are more clearly defined in (b)

4.4 Optimization is crucial to producing meaningful point embeddings

In principle, our theoretically-grounded embedding of secondary data into the scaffold defined by the reference embedding could be simplified with the application of the nearest neighbors-based procedure. For example, while describing a set of tricks for t-SNE (Kobak & Berens, 2019) proposed positioning new points into a known embedding by placing them in the median position of their 10 nearest neighbors, where the neighborhood was estimated in the original data space. Notice that we use this approach as well, but only for the initialization of positions of new data instances that are subject to further optimization. Despite both nearest-neighbors search and t-SNE optimization can be computed in linear time, the former dominates the runtime (mouse retina example; 44,808 reference, 26,830 secondary cells, 9min NN-search, 13 s optimization).

Fig. 7 demonstrates a case where nearest neighbor-based positioning alone is insufficient. We construct a reference embedding using only neurons from Hrvatin et al. (2018) (Fig. 7a) and use that to position neuronal cells from the data set from Campbell et al. (2017). We utilize the weighted mean and median positions to initialize point positions from the secondary data set, as shown in Fig. 7b, c . After initialization, we optimize point positions using the procedure described above for 500 iterations. The resulting visualizations from both initializations are visually very similar, indicating stable convergence. We show one of the resulting visualizations in Fig. 7b.

Notice that both neighbor-based initialization schemes generally position data points such that their classification is unclear. Median-based initialization produces a sort of grid-like structure, while median based initialization positions the points almost continuously across the embedding space. Optimization reveals strong correspondence of several points to reference-defined clusters, while other points from the secondary data set are pushed away from their initial clusters, possibly indicating dissimilarity.

Fig. 7
figure 7

Comparison of different initialization schemes for positioning new data points onto reference embeddings. a We construct a reference embedding using only neuronal subtypes from Hrvatin et al. (2018). b We position neuronal cells from Campbell et al. (2017) using the median initialization scheme from Kobak and Berens (2019) and run optimization for 500 iterations. Compare the optimized embedding with the initial median initialization (c) or by using a simple weighted mean initialization (d)

4.5 On requirement of a complete reference set

Our approach assumes that all cell types from the secondary data set are present in the reference. Intuitively, using t-SNE in such a way is conceptually similar to classification via k-nearest neighbor classifiers and is similarly limited. The method may fail to reveal unseen cell types in the secondary data set, likely positioning them arbitrarily close to unrelated clusters. In some instances, unknown cell types may be sufficiently different from the reference data that t-SNE will repel them from existing clusters. However, we caution that this approach is unreliable and depends heavily on the chosen preprocessing pipeline.

We illustrate this with Fig. 8, where we first fit create a reference embedding containing only neuronal cells from Hrvatin et al. (2018). We then select only non-neuronal cells from Campbell et al. (2017) and add them to the reference embedding in Fig. 8b. The non-neuronal cells from Hrvatin et al. are scattered somewhat arbitrarily around several clusters in the reference embedding. Interestingly, the secondary data points form a “ring” around one of the clusters, indicating that these data points are very different from the cells in this cluster. Notice also that the points from the secondary data set exhibit little to no clustering and the different cell types seem to be mixed among each other. We hypothesize that this effect is due primarily to the single-cell preprocessing pipeline and not the limitations of our procedure itself, as the informative genes selected to create the reference neuronal embedding likely do not differentiate supportive glial cells from the secondary data set. This effect is similar to procedures such as scMap-Cluster, a consensus k-nearest neighbor method, which heuristically determines a distance threshold to identify unknown cell types (Kiselev et al., 2018).

Fig. 8
figure 8

A reference embedding must contain all the cell types in the secondary embedding to produce reliable results. a We construct a reference embedding containing only neuronal cells from Hrvatin et al. (2018). b We select only non-neuronal cells from Campbell et al. (2017) so that no overlap exists between the cell types between the data sets. Conceptually, t-SNE behaves similarly to a k-NN classifier and places the non-neuronal cells to their most similar points in the reference. In some instances, the non-neuronal cells are sufficiently different from the neuronal cells so that they are repelled from the reference clusters. Such behavior results in the “ring” seen on the right-hand side of the embedding

Our procedure is, therefore, asymmetrical in the choice of reference and secondary data set. In practice, however, newly produced secondary data would be embedded into previously-prepared reference landscapes. Large collections of data e.g. the Human Cell Atlas initiative (Rozenblatt-Rosen et al., 2017) make it possible to scale up our approach to wider sets of cell types. Identifying potential failure cases where rare cell-types may still be missing from constructed reference embeddings is a problem that plagues the bioinformatics community and is an active area of research.

4.6 Comparison to other similar batch-effect methods

To quantitatively evaluate the predictive accuracy of the described procedure, we fit k-nearest neighbors classifiers on each reference t-SNE embedding from Figs. 2a, 3a and 4a and use them to predict the cell types for the secondary data set embeddings from Figs. 2b, 3b and 4b. The accuracy measures are reported in Table 2. Our procedure of embedding new data points into two-dimensional t-SNE plane results in similar accuracy to approaches like random forests that use full compendium of cell-characterizing features. The results indicate that positioning of new cells onto a cell visualization plane is not only indicative but also an accurate instrument for cell type characterization.

Table 2 We compare our approach (t-SNE) to three other methods, evaluating performance using classification accuracy and the adjusted rand index (ARI)

We compare our approach to two machine learning techniques, namely a k-nearest neighbor classifier (KNN) and a random forest ensemble, and scMap-Cluster (Kiselev et al., 2018). For scMap-Cluster, we disable the distance threshold heuristic for identifying novel cell types, as our secondary data sets were chosen such that there is complete overlap between cell-types. For the two machine learning approaches, we apply the typical single-cell preprocessing pipeline described in Appendix A, i.e., library-size normalization, log-transformation, and select 1000 most informative genes. Similarly to scMap-Cluster, we use the cosine distance to find the 5 nearest neighbors in the KNN model. We used 100 trees in the random forest ensemble. The models were fit on the reference data set, and no hyper-parameter tuning was performed.

Surprisingly, both the random forest and k-nearest neighbor models outperform scMap-Cluster, which is specifically tailored to scRNA-seq data. However, these results may be skewed, as, in our examples, all the cell-types from the secondary data set were present in the reference data set. One of the core features of scMap-Cluster is the detection of novel cell types, which none of the other methods support. In other words, the other three methods would always assign a cell-type to a given cell, regardless of cell origin. Additionally, scMap-Cluster was primarily designed and tested on data sets produced by full-length sequencing protocols, which tend to detect a much higher number of molecules than other, sequencing protocols based on unique molecular identifiers (UMI). These two classes of sequencing protocols produce data sets with different sparsity and variance characteristics. This is consistent with the results in Table 2, as only the data sets from the human pancreas, were produced using a full-length sequencing protocol, where scMap-Cluster achieves reasonably high accuracy.

The aim of t-SNE is to construct embeddings, in which neighborhoods are preserved, therefore it is unsurprising that the accuracy of our t-SNE based approach is largely consistent with the k-nearest neighbors model. While our approach is comparable to the other models in terms of accuracy, we emphasize that the goal of t-SNE embeddings is to serve as visual aids in exploratory data analysis. Therefore, it is surprising that our simple procedure performs competitively to specialized classification methods. Therefore, our procedure, in addition to providing the end-user with a cell-type prediction, allows the user to examine the low-dimensional embedding space, which may provide richer insight and interpretation of the resulting predictions.

5 Conclusion

Almost all recent publications of single-cell studies begin with a two-dimensional visualization of the data that reveals cellular diversity. While many dimensionality reduction techniques are available, different variants of t-SNE are most often used to produce such visualizations. Single-cell studies enable the exploration of biological mechanisms at a cellular level, and their publications in the past couple of years are abundant. One of the central tasks in single-cell studies is the classification of new cells based on findings from previous studies. Such transfer of knowledge is often difficult due to batch effects present in data from different sources. Addressing batch effects by adapting and extending t-SNE, the prevailing method used to present single-cell data in two-dimensional visualization, motivated the research presented in this paper.

The proposed approach uses a t-SNE embedding as a scaffold for the positioning of new cells within the visualization, and possibly for aiding in their classification. The three case studies incorporating pairs of data sets from different domains but with similar classifications demonstrate that our proposed procedure can effectively deal with batch effects to construct visualizations that correctly map secondary data sets onto an embedding of the data from an independent study that possibly uses different experimental protocol. We quantitatively evaluate the predictive accuracy of our approach by fitting a k-nearest neighbors model on the resulting two-dimensional embeddings and compare its predictive accuracy to other machine learning methods that use the entire compendium of gene expressions that characterize the cells. Experiments show that our approach is successful in predicting cell types and performs comparably to other methods. This encouraging result indicates that by using our procedure, scientists can quickly and accurately determine the composition of new data by merely visualizing and inspecting resulting visualizations. While we focused here on reference visualizations constructed using t-SNE, this approach can be applied using any existing two-dimensional visualization.

6 Availability and implementation

The procedures described in this paper are provided as Python notebooks that are, together with the data, available in an open repository.Footnote 1 The described methods were implemented and incorporated into openTSNE, our open-source, extensible t-SNE library for Python (Poličar et al., 2019).