Enhancing cluster analysis via topological manifold learning

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: theoretical arguments and empirical evidence show that clustering embedding vectors, representing the structure of a data manifold instead of the observed feature vectors themselves, is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how \textit{separable} the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. Our approach is successful because we perform the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.


Introduction
Clustering is the task of uniting similar and separating dissimilar observations in a dataset (Kriegel et al, 2009;Aggarwal, 2014).It is a fundamental task in data analysis and is thus widely investigated in many fields.With this study, we intend to raise awareness for topological aspects of clustering and to provide empirical evidence that topologically-informed approaches which are conceptually and computationally simple can compete with or even outperform much more complex existing methods on a wide range of problems.

Problem specification
Cluster analysis is usually approached in an algorithm-driven manner, and considerations about the underlying principles of data generating processes and data structures are often limited to a probabilistic conceptualization assuming that the data X follow a joint probability distribution P (X) (Hastie et al, 2009) or, more precisely, a mixture of distributions (Aggarwal, 2014).In contrast, connections to topological data analysis (TDA) (Chazal and Michel, 2021;Wasserman, 2018), a branch of statistical data analysis inferring the structure of data leveraging topological concepts, are usually not considered.In general, the topological aspects of cluster analysis appear to be an under-investigated topic.Current textbooks on cluster analysis (Aggarwal and Reddy, 2014;Aggarwal, 2015;Giordani et al, 2020;Scitovski et al, 2021;Hennig et al, 2015, e.g) and recent reviews of the field (Jain et al, 1999;Kriegel et al, 2009;Assent, 2012;Pandove et al, 2018;Mittal et al, 2019, e.g.) rarely mention the term "topology".
Following Niyogi et al (2011), we consider clustering a natural example of TDA.Since an improved understanding of the underlying principles governing the problem is likely to lead to more suitable methods and novel solutions, our work aims to reduce this lack of awareness of topological aspects in the clustering literature.Specifically, our approach follows Niyogi et al (2011, p. 2) who state that "clustering is a kind of topological question" which tries to separate the data into "connected components."One particularly relevant consequence of this topological perspective is its implication that the difficulty of a clustering problem is not necessarily determined by the data's (nominal) dimensionality.

Scope of the study
In this work, we make use of the well-known algorithm DBSCAN (Ester et al, 1996) for cluster detection and the recently developed manifold learning algorithm UMAP (McInnes et al, 2018) to infer the topological structure of a dataset.
UMAP has a decidedly topological underpinning, so it is suitable for a theoretical analysis from the clustering perspective we take here.In particular, it builds on simplicial complexes to obtain a fuzzy topological representation of the inherent structure of a dataset.As such, it is based on the same theoretical principles as topological data analysis (Chazal and Michel, 2021;Wasserman, 2018).In addition, it has already been shown that preprocessing by UMAP can improve clustering results (Allaoui et al, 2020) and that the resulting embeddings frequently yield "more compact clusters than t-SNE [another state-of-the-art manifold learning method] with more white space in between" (Kobak and Linderman, 2021, p. 157).
To be specific, "inferring the topological structure" as we do here with UMAP has two aspects: first, a fuzzy graph representation of the dataset is used to find the (number of) connected components.Second, this structure is represented by embedding vectors (i.e.coordinates in a representation space) that are optimized for the separability of the connected components.As we show in section 3, UMAP's graph construction and graph embedding steps both increase cluster separability, and their combined effect thus improves clusterability dramatically.
DBSCAN, on the other hand, is a widely used and well-established method for cluster detection (Schubert et al, 2017).In particular, it neither requires a pre-specified number of clusters nor does it make any assumptions about their specific shapes or patterns.This is important, as inferring the connected components of a dataset is largely equivalent to identifying the clusters it contains.Moreover, the optimized representation of the topological approach focuses on the separability of clusters, not on the specific shapes the clusters might have.Also note that UMAP's developers conjectured that it might enhance density-based clustering, but that this requires further investigation (McInnes, 2018).
From a practical perspective, this means we use UMAP to preprocess the data such that its representation is optimized for separability and use the resulting embedding vectors as inputs for DBSCAN.Although the theoretical and empirical considerations outlined above show that these two methods are suitable, it has to be emphasized that this does not mean that we consider UMAP and DBSCAN the most suitable combination in general.Certainly, additional research has to focus on the pros, cons, and differences between UMAP and other manifold learning methods, in particular t-SNE, and some efforts have already been made in this direction (Kobak and Linderman, 2021;Wang et al, 2021).In this paper, we intend to show that a topological perspective, in general, can improve understanding and practical feasibility of clustering and not whether that specific combination of methods is the most suitable.Other combinations of clustering and/or manifold learning methods than UMAP and DBSCAN are possible and certainly deserve investigation as well.
Moreover, note that there are other approaches to infer the topological structure of a dataset.For example, persistent homology -which also builds on simplicial complexes -quantifies the topological structure of a dataset by providing information on statistically significant persistent topological features such as connected components, holes, or voids, e.g.(Wasserman, 2018).In contrast, measures of data separability such as the distance-based separability index (Guan and Loew, 2021) quantify the separability of datasets in a single scalar value.However, both approaches only contribute to the first aspect of inferring the topological structure, i.e. they do not provide data representation optimized for separability.

Contributions
This study makes three distinct contributions: First, section 3 illustrates that approaches motivated by a topological perspective can dramatically reduce the complexity of clustering for both low-and high-dimensional data.This is achieved with an in-depth analysis of simulated data specifically designed to reflect some often described problems of clustering including high-dimensional data, clusters of different densities, and irrelevant features.In addition, a simple toy example demonstrates why and how inferring the intrinsic topological structure of a dataset with UMAP before clustering improves the clustering performance of DBSCAN.
Secondly, with intuition and motivation in place, section 4 is devoted to specific implications of the topological perspective.We describe which structures of a dataset are preserved when inferring the topological structure by finding connected components and enhancing separability (using the UMAP algorithm), in particular by contrasting topological against geometrical characteristics in a detailed qualitative and quantitative analysis of simple synthetic examples.
Finally, in section 5, we report extensive experiments using real-world data.Our results show that inferring the topological structure of datasets before clustering them not only improves -dramatically, for some examples such as MNIST -performance of DBSCAN, but also drastically reduces its parameter sensitivity.The comparatively simple approach of combining UMAP and DBSCAN can even outperform recently proposed clustering methods such as ClusterGAN (Mukherjee et al, 2019), which require expensive hyperparameter tuning, on complex datasets.
In addition, related work and the methods used are described in section 2, while the results are discussed in section 6 before we conclude in section 7.

Methods and related work
In this section, we first describe the background of the study and related work, before we outline the methods DBSCAN and UMAP, which are used for clustering and inferring topological structure, respectively, in this study.Readers which are familiar with the methods might skip the corresponding paragraphs.However, note that we will refer to some of the more technical details outlined here in section 3.2.

Background and related work
The body of literature on clustering, topological data analysis, and manifold learning is extensive and has seen contributions from many different areas and perspectives.General reviews on clustering have been provided for example by Jain et al (1999) and more recently by Saxena et al (2017).Moreover, there a several reviews focusing on cluster analysis for high-dimensional data (Kriegel et al, 2009;Assent, 2012;Pandove et al, 2018;Mittal et al, 2019).In addition, there exist overviews on TDA (Niyogi et al, 2011;Chazal and Michel, 2021;Wasserman, 2018, e.g.) as well as on manifold and representation learning (Cayton, 2005;Bengio et al, 2013;Wang et al, 2021) including the textbooks by Ma and Fu (2012) and Lee and Verleysen (2007).
The variety of clustering algorithms is vast and endeavors have been made to capture this diversity through taxonomies.DBSCAN, the algorithm used here, is a density-based approach.One of its major advantages is that it does not require a pre-specified number of clusters and that the clusters can have arbitrary shapes and patterns.Its hierarchical version (HDBSCAN, Campello et al, 2013) does not use a global ε-threshold but computes on its own multiple cut-off values resulting in clusters of different densities and therefore requires only the minP ts parameter.Similar to HDBSCAN, the OPTICS algorithm (Ankerst et al, 1999) calculates an ordering of the observations without a global ε-threshold that provides broader insight on the structure of the data.However, the method does not explicitly assign cluster memberships.Instead, it allows viualizing the hierarchical cluster structure for example via reachability plots (Ankerst et al, 1999).
Further categories are hierarchical and partitioning algorithms (Jain et al, 1999), where the latter can be divided further into sub-taxonomies.Some of them are based on the minimization of distances to certain prototypes (centroids, medoids, etc.), this includes algorithms like k-means (Lloyd, 1982), or its more general archetype of algorithms: Gaussian Mixture Models (GMMs) among which the Expectation-Maximization (EM) algorithm (Dempster et al, 1977) is a prominent exponent.A major caveat, however, is that these methods estimate a specific probabilistic model which includes the number of clusters to be detected and often fail if the data is distributed differently (Liu and Han, 2014).
In contrast, spectral clustering, a family of algorithms that shares some common ground with many manifold learning methods that are also based on spectral decompositions of pairwise (dis)similarity matrices, is more robust with respect to the shape and distribution of the clusters.However, these methods require the number of clusters to be specified in advance (Von Luxburg, 2007;Liu and Han, 2014).
Subspace clustering approaches emerged specifically for high-dimensional settings (Kriegel et al, 2009;Assent, 2012;Pandove et al, 2018;Mittal et al, 2019).The fundamental assumption here is that objects within a cluster do not exhibit high similarities among all dimensions but only within a small subset of features that can either (a) span an axis-parallel subspace or (b) an affine projection to an arbitrarily-oriented subspace ("correlation clustering").In both cases, the objects of a cluster are assumed to be located on a common, low-dimensional linear manifold.
In contrast, manifold learning is based on the assumption that data observed in a high-dimensional ambient observation space is distributed on or near a potentially nonlinear manifold with a much smaller intrinsic dimension than the ambient space (Ma and Fu, 2012).In general, the aim is to find low-dimensional representations of datasets preserving as much of the structure of the observed data as possible.A synonymous term is nonlinear dimension reduction (NDR) (Lee and Verleysen, 2007).However, there is no general definition of which characteristics are to be preserved and represented and different methods infer the intrinsic structure and provide low-dimensional representations in different ways.
For instance, principal component analysis (PCA) yields embedding vectors that optimally preserve global Euclidean distances in the original data space, while other methods such as Isomap (Tenenbaum et al, 2000) yield embedding vectors that aim to preserve geodesic distances on a single, globally connected data manifold.Methods like t-distributed Stochastic Neighbor Embedding (t-SNE, van der Maaten and Hinton, 2008) and uniform manifold approximation and projection (UMAP, McInnes et al, 2018) have been successfully applied to complex high-dimensional datasets with cluster structure.More recently, methods with a specific topological focus such as general purpose Topomap (Doraiswamy et al, 2021) as well as domain specific Paga (Wolf et al, 2019), which focuses on the analysis of single cell data, have been proposed.The manifold learning-based clustering approach of Souvenir and Pless (2005) relies on the assumption that data is sampled from multiple intersecting lower-dimensional manifolds.
Several studies that precede ours also focus on the combination of manifold learning techniques and cluster analysis, with applications to cytometry data (Putri et al, 2019), brain tumor segmentation (Kaya et al, 2017), spectral clustering (Arias-Castro et al, 2017), or big data (Feldman et al, 2020), the latter three based on PCA.DBSCAN was used in combination with multi-dimensional-scaling (MDS) in Mu et al (2020), and UMAP was used for time-series clustering (Pealat et al, 2021) as well as clustering SARS-COV-2 mutation datasets (Hozumi et al, 2021).However, these all focus on specific domains and not on the underlying topological principles.In contrast, we base our work on a topological perspective on clustering first described theoretically by Niyogi et al (2011), who conceptualize clustering as the problem of identifying the connected components of a data manifold.We show the theoretical and practical utility of this perspective by means of extensive experiments based on synthetic and real datasets.Similar in spirit to our work, Allaoui et al (2020) perform a comparative study with real data to show that UMAP can considerably improve the performance of clustering algorithms.Among other things, they combined UMAP with HDBSCAN and report comparable clustering results for three of the real-world datasets (Pendigits, MNIST and FMNIST) also used here.However, in contrast to our study, Allaoui et al (2020) do not provide insights into the conceptual topological underpinnings, nor do they describe how the data structures preserved in UMAP embeddings lead to these performance improvements.Note that their results also show empirically that the benefits of the proposed approach are not tied to any particular combination of NDR and clustering methods.

UMAP
The principle idea behind UMAP essentially consists of two steps: 1) Constructing a weighted k-nearest neighbor (k-NN) graph from a pairwise distance matrix.
2) Finding a (low-dimensional) representation of the graph which preserves as much of its structure as possible.Note that this is the fundamental principle in manifold learning and the details of the two steps constitute the differences between manifold learning methods (Wang et al, 2021).However, unlike many other manifold learning methods, UMAP is based on a solid theoretical foundation that ensures that the topology of the manifold is faithfully approximated by its fuzzy simplical set representation.We concentrate on the computational aspects outlined in McInnes et al (2018) and refer interested readers to the original study for theoretical details.

Graph construction
Given a dataset X = {x 1 , ..., x n obs } sampled from a space equipped with a distance metric d(x i , x j ), UMAP constructs a directed k-NN graph Ḡ = (V, E, w) with the vertices V i being observations x i from X, E the edges and w the weights, based on the following definitions.
Definition 1 The distance ρ i of an observation x i to its nearest neighbor x ij is defined by Definition 2 A (smooth) normalization factor σ i is set for each x i by This defines a local (Riemannian) metric at point x i .
Definition 3 Weight function: The edge weights of the graph are defined by Note, the distance to the nearest neighbor ρ i ensures that x i is connected to at least one other point with an edge of weight 1 (local connectivity constraint).
For the theory to work it is essential to assume that the data is uniformly distributed on the manifold, which is too strong an assumption for real-world data.The issue is bypassed by defining independent notions of distance at each observed point through σ i and ρ i .However, these local metrics may not be mutually interchangeable, which means that the "distance" between neighboring points x i and x j may not be the same if measured w.r.t x i or w.r.t.x j , i.e., d(x i , x j ) = d(x i , x j ), so edge weights in Ḡ depend on the direction of the edges.
A unified, undirected graph G with adjacency matrix B is obtained by with A the weighted adjacency matrix of Ḡ and • the point-wise product.Note that Eq. ( 1) represents the well-defined operation of unioning fuzzy simplicial sets (with which the manifold is approximated).The resulting entries in B can be interpreted as the probability that at least one of the two directed edges between two vertices in Ḡ exists, or more generally as a measure of similarity between two observations x i and x j .Note that it has recently been shown that a stricter notion of connectivity induced by mutual nearest neighbors can further improve the topology preserving property of standard UMAP used here (Dalmia and Sia, 2021).

Graph embedding
The objective is to find a configuration of points in the representation space Y whose fuzzy simplicial set is as similar as possible to the fuzzy simplicial set of the original data, as represented by G. To find this low-dimensional representation, UMAP optimizes the cross entropy of edge weights in the two spaces.Similarities in the observation space are represented in terms of the local smooth nearest neighbor distances as with 1)), and similarities in the representation space Y as the cross entropy between the two fuzzy simplicial set representations is minimized via stochastic gradient descent (SGD) to obtain the graph layout (by default a ≈ 1.929 and b ≈ 0.7915).The two terms in Eq. ( 4) represent the attractive and repulsive forces for the graph layout algorithm used here.Next to a and b, UMAP's central tuning parameters are the number of nearest neighbors k (often denoted as n or n_neighbors), the number of SGD optimisation iterations n-epochs, the dimension d of the representation space, and min-dist, a parameter controlling how close neighboring points can appear in the representation.

DBSCAN
The principle idea behind DBSCAN is captured within 6 definitions we adapt from Ester et al (1996) and elaborate on: Definition 4 ε-neighborhood of an object: The ε-neighborhood of an object x i denoted by Nε(x i ), is defined by: Nε(x i ) = {x j ∈ X|d(x i , x j ) ≤ ε} where X denotes a given dataset.
Definition 5 Directly density-reachable: An object x i is direct density-reachable from an object x j w.r.t. a given ε-range and M inP ts if: 1) x i ∈ Nε(x j ) and 2) |Nε(x j )| ≥ M inP ts (core point condition) Definition 6 Density-reachable: An object x i is density-reachable from another object x j w.r.t.ε and M inP ts if there is a chain of objects x 1 , ..., xc, x 1 = x i , xc = x j such that x l+1 is directly density-reachable from x l .
Definition 7 Density-connected: An object x i is density-connected to another object x j w.r.t.ε and M inP ts if there is an object o such that both, x i and x j are density-reachable from o w.r.t.ε and M inP ts.
Definition 8 Cluster: Let X be a given dataset of objects.A cluster C w.r.t.ε and M inP ts is a non-empty subset of X satisfying the following conditions: 1) ∀x i , x j : if x i ∈ C and x j is density-reachable from x i w.r.t.ε and M inP ts, then x j ∈ C (Maximality) 2) ∀x i , x j ∈ C : x i is density-connected to x j w.r.t.ε and M inP ts (Connectivity) Definition 9 Noise: Let C 1 , ..., Cn c be the nc clusters of the given dataset X w.r.t.parameters ε i and M inP ts i , i = 1, ..., nc.Then noise is defined as the set of objects in the dataset X that do not belong to any cluster C i , i.e. noise = {x i ∈ X|∀i : In Definition 5 an object is a core point if it has at least M inP ts number of objects within its ε-neighborhood.In the case that no objects in a given dataset are density-reachable then we would obtain n c clusters where n c denotes the number of core-points in a dataset X for a given ε and M inP ts.This means that the number of core points can be considered as an upper bound for the number of emerging clusters for a given ε and M inP ts.Further it can be deduced from the core point definition that the region surrounding a core point is more dense compared to density-connected objects that do not satisfy |N ε (x j )| ≥ M inP ts meaning that they are objects in more spare regions.

Inferring the topological structure enhances clusterability
In this section, we demonstrate that the correct use of manifold learning (here, specifically: UMAP), as motivated by our topological framing, largely avoids several frequently described challenges in cluster analysis.
A major problem affecting cluster analysis is that clustering often becomes more challenging in high-dimensional datasets.Specifically, the presence of many irrelevant and/or dependent features potentially degrades results (Kriegel et al, 2009).However, contrary to widespread "folk-methodological" superstitions and some sources like Assent (2012), the well-known result that L p distances lose their discriminating power in high dimensions (Beyer et al, 1999, e.g.) is entirely irrelevant for well-posed clustering problems: both the original publication and subsequent works like Kriegel et al (2009) and Zimek and Vreeken (2015) show that the conditions for this result do not apply if the data is distributed in well separable clusters.In particular, this means that DBSCAN, being based on pairwise distance information, can easily detect clusters in high-dimensional datasets.
Nevertheless, there are other problems specific to density-based clustering, and DBSCAN in particular, among which finding a suitable density level is one of the most important (Kriegel et al, 2011;Assent, 2012).A recent review (Schubert et al, 2017), outlined some heuristic rules for specifying ε for DBSCAN, but domain knowledge should mostly determine such decisions.More importantly, density-based clustering is likely to fail for clusters with varying densities.In such cases, a single global density level -for example, specified via ε in DBSCAN -cannot delineate cluster boundaries successfully (Kriegel et al, 2011).
In addition to these well-known issues, we outline another more subtle, less well-known aspect: not only does the difficulty of a clustering problem not necessarily increase for high-dimensional X, but clusters may even become easier to detect in higher dimensional (embedding) spaces.

Enhancing clusterability of DBSCAN with UMAP
The four example datasets we consider here illustrate the following three points: (1) Density-based clustering works in some but not all high-dimensional settings.(2) Perfect performance may not be achievable even for extensive parameter grid searches, and suitable ε values are highly problem-specific.(3) Most importantly, manifold learning can considerably enhance clustering both by improving performance and by reducing parameter sensitivity of DBSCAN to the extent that it becomes almost tuning-free.
The datasets we consider here consist of three clusters sampled from three multivariate Gaussian distributions with different mean vectors.In the first two examples, denoted by E 100 and E 1000 , the covariance matrix for all three Gaussians is the identity matrix, inducing clusters of similar densities.In the latter two examples, U 3 and U 1003 , the covariance matrices differ, inducing clusters of different density.In addition, we consider problems with very different dimensionalities.Observations in setting E 100 are sampled from 100dimensional Gaussians, while observations in setting E 1000 are sampled from 1000-dimensional Gaussians.In contrast, observations for U 3 and U 1003 are sampled from 3-dimensional Gaussians.For U 1003 , an additional 1000 features that are irrelevant for cluster membership are sampled independently and uniformly from [0, 1].For each setting, we sample 500 observations from each of the three clusters, i.e. each example dataset consists of 1500 observations in total.The complete specifications of the examples are given in Table 1.
Table 1 Specifications of the settings E 100 , E 1000 , U 3 , and U 1003 .In setting U 1003 clusters are defined by means of p = 3 dimensional Gaussians, yet an additional 1000 irrelevant features are sampled uniformly from [0, 1], leading to a total dimensionality of 1003.

Setting p
Means Variances Figure 1 shows the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985, Eq. 5) and the Normalized Mutual Information (NMI) with maximum normalization (Vinh et al, 2010, Tab. 2) for different ε values obtained by either applying DBSCAN directly to the observed data or to their 2D UMAP embeddings.Both measures compare two data partitions and return a numeric value quantifying the agreement.While the NMI strictly ranges between [0, 1] (with a value of 1 indicating perfect concordance), the ARI is 0 only if the Rand Index exactly matches its expected value under the null hypothesis that the partitions are generated randomly from a hypergeometric distribution (Hubert and Arabie, 1985).
Several aspects need to be emphasized.First of all, the effect of the dimensionality of the dataset on the performance of DBSCAN applied to the original data is complicated (Figure 1, first column (A)).Contrary to preconceived notions, it can be easier to detect clusters in higher dimensions.Figure 1 A shows that using only DBSCAN, clusters are more easily detected in the 1000-dimensional data (2nd row) than in the 100-dimensional data (1st row, although perfect performance is not achieved by DBSCAN in either of the two.The dimension of the Gaussian distributions defining the clusters is the only difference between these two settings.On the other hand, Figure 1 A, shows that it can also be the other way round.In the 1003-dimensional dataset with 1000 irrelevant features (4th row), cluster performance is much lower than in the corresponding 3-dimensional dataset with only 3 relevant variables (3rd row).Again, perfect cluster performance is not achieved by DBSCAN alone.Note that settings U 3 and U 1003 define clusters with varying densities, so DBSCAN is expected not to provide a perfect result.
Secondly, finding a suitable value of ε is very challenging using DBSCAN alone.Note that the optimal ε opt varies between 0.9 and 42.64 for these examples.Identifying a suitable ε is even more problematic since the sensible ε-ranges are very small (e.g.see U 1003 ).In some cases, clustering does not seem feasible at all even with an optimally chosen ε -optimal results are very poor for setting E 100 with ARI (NMI) = 0.003(0.05)for ε opt = 11.32(10.98).Moreover, while ε opt is not necessarily consistent for datasets with approximately the same dimensionality -compare ε opt = 42.64 for E 1000 to ε opt = 12.48 for U 1003 -it can be similar for datasets with very different dimensionalitycompare ε opt ∼ 11 for E 100 to ε opt = 12.48 for U 1003 .
Finally, the crucial point we want to highlight with these examples is that inferring the topological structure before clustering by applying DBSCAN on UMAP embeddings instead of directly to the data makes all these issues (almost) completely disappear (see Figure 1 B).First of all, clustering performance is increased in all four examples; in three it even leads to perfect performances.But not only is performance increased, but UMAP also dramatically reduces the complexity of finding a suitable ε.In all considered cases the sensible ε-ranges start near zero, rapidly reach the optimal value, and remains optimal over a wide range of ε-values in three of the four examples.Note that we do not tune UMAP at all -we simply set k = 5 and leave all other settings at their default values.
We emphasize that perfect performance is obtained for large swaths of the ε-range we consider for the two high-dimensional examples.This suggests that the crucial issue in clustering is not the nominal dimension of the dataset or whether it contains irrelevant features, but rather how separable the clusters are in their ambient space, which is usually simply the p-dimensional Euclidean space spanned/defined by the dimensions/features of the data, while the approach taken here attempts to cluster observations after projecting them into a space that is optimized for separability.
In summary, applying DBSCAN on UMAP embeddings not only improved performance considerably, but it also reduced the sensitivity of DBSCAN w.r.t.ε.In particular, suitable ε-ranges started near zero for all considered examples.Our experiments described in section 5 show that this holds for complex real data such as fashion MNIST (Xiao et al, 2017) as well, where applying DBSCAN on UMAP embeddings not only dramatically improved DBSCAN's performance but even outperformed the recently proposed Clus-terGAN (Mukherjee et al, 2019) method.In the next subsection, we examine the technical aspects that explain this behavior in a simple toy example.

Reasons for improved clusterability
This section lays out possible reasons for the observed improvements w.r.t clusterability with a detailed analysis of the underlying technical mechanisms in a simple toy example.Consider the following distance matrix between six objects: 0 0.6 0.7 1.3 1.2 1.5 0.6 0 0.5 0.75 1.6 1.3 0.7 0.5 0 1.4 1.3 1.1 1.3 0.75 1.4 0 0.7 0.75 1.2 1.6 1.3 0.7 0 0.75 1.5 1.3 1.1 0.75 0.75 0 Inspecting this distance matrix reveals two clusters of objects, shown here in green and cyan.We set DBSCAN's core point condition parameter to minP ts = 2.Note that the object itself is not considered part of itsneighborhood.We set ε = 0.75, so that every object whose row (or column) in the distance matrix contains at least two entries ≤ 0.75 is considered a "core point".Since two objects from the different clusters have a distance of exactly 0.75 (orange entries), all objects are part of a single connected component, and the two dense regions are subsumed into a single large cluster for ε = 0.75, as can be seen in the matrix below:         0 0.6 0.7 1.3 1.2 1.5 0.6 0 0.5 0.75 1.6 1.3 0.7 0.5 0 1.4 1.3 1.1 1.3 0.75 1.4 0 0.7 0.75 1.2 1.6 1.3 0.7 0 0.75 1.5 1.3 1.1 0.75 0.75 0 To avoid this collapsed solution, one could try to reduce the ε parameter to e.g.ε = 0.74.However, as a consequence, now all the objects in the second (cyan) cluster become "noise": They no longer satisfy the "core point" condition for minP ts = 2, since at most one distance in each of their rows is ≤ 0.74.This means only one cluster (top left, green) is detected, as can be seen in the following matrix:         0 0.6 0.7 1.3 1.2 1.5 0.6 0 0.5 0.75 1.6 1.3 0.7 0.5 0 1.4 1.3 1.1 1.3 0.75 1.4 0 0.7 0.75 1.2 1.6 1.3 0.7 0 0.75 1.5 1.3 1.1 0.75 0.75 0 From this first example, we conclude 1) that there may be cases where even a single object may connect two clusters, yielding a single collapsed cluster and 2) that the sensitivity of clustering solutions to hyperparameter settings is large: A small change of the ε-parameter by only 0.01 led to a fundamentally different solution.
Thus, we should look for improvements that (i) reduce the sensitivity of results towards the parameter settings and (ii) increase the separability of the data and thereby reduce the susceptibility of DBSCAN to merge multiple poorly separated clusters via interconnecting observations at their respective margins.Sharpening the distinction between dense and sparse regions within the dataset, i.e. increasing separability, improves clusterability.As we will now see, UMAP is able to do exactly that by arranging objects into clusters with fairly constant density within and empty regions in between.
To illustrate this, we consider the representation of the toy example via the fuzzy graph as constructed by UMAP.This reflects the fuzzy simplicial set representation of the data and crucially depends on the number of nearest neighbors k.We start with k = 6.This leads to a graph with adjacency matrix         0 1.0 0.95 0.29 0.53 0.25 1.0 0 1.0 0.9 0.19 0.30 0.95 1.0 0 0.24 0.45 0.58 0.29 0.9 0.24 0 1.0 1.0 0.53 0.19 0.45 1.0 0 1.0 0.25 0.3 0.58 1.0 1.0 0 Each cell represents the fuzzy edge weight v ij (Eq.2) connecting two points, so each value represents the affinity of two observations, not their dissimilarity as in the distance matrices before.As before, the cluster structure is obvious in this representation, with high affinities (≥ 0.95) where distances had been low (≤ 0.75).The representation learned by UMAP in the graph construction step clearly reflects the cluster structure of the dataset.Note that this fuzzy topological representation by itself already amplifies the cluster structure: if we stopped UMAP at this point and converted the affinities v ij into dissimilarities e.g.via d ij = 1 − v ij , i = j, DBSCAN with minP ts = 2 would yield perfect cluster results for ε ∈ [0.01, 0.09]!Note as well that UMAP's graph layout optimization has not even been performed yet and that the nearest-neighbor parameter k has been set to 6, the largest possible value in this example.Thus, the vast improvement in separability we observe is due only to the way UMAP learns and represents the structure of the data in the fuzzy graph G alone.The improvement can be driven even further both by decreasing the parameter k and by conducting the graph layout optimization.
First, consider the effect of k.In the following, blanks in the matrices denote zero entries.Graph 9 shows G for k = 3.Clearly, the beneficial effects we noted for k = 6 are considerably amplified.
Almost all v ij become zero (i.e.there is no affinity/similarity between the two points) except for those joined in one of the clusters and the two entries which caused DBSCAN to break.Turning v ij into d ij as above, DBSCAN yields correct clusters for ε ∈ [0.01, 0.42].By setting k = 2, the smallest possible value due to the local connectivity constraint, we can further distill the cluster structure down to its bare essentials: Based on this graph, DBSCAN yields correct clusters for ε ∈ [0.01, 0.99]!Thus, by setting the nearest neighbor parameter of UMAP to a very small value, the cluster separability is dramatically amplified and DBSCAN's sensitivity w.r.t.ε is significantly reduced.However, the graph layout optimization step has not even been performed yet.This additional step is crucial, in particular for reducing the parameter sensitivity of clustering methods.This is due to the fact d ij = 1 − v ij only converts affinities into dissimilarities.Finding a graph layout via the cross-entropy C U M AP as defined in Eq. 4 instead not only converts affinities (indirectly) into dissimilarities but also improves the conversion itself w.r.t. to separability (on top of the separability gained by the graph construction), since the optimization procedure optimizes the graph layout for increased cluster separability.This can be explained as follows: C U M AP becomes minimal for v ij = w ij .For v ij = 0, the further away from each other the embedding vectors y i and y j are placed, the better, since this will drive w ij towards zero.Considering graphs 9 and 10, we see that v ij is zero mostly for observations from different clusters.Minimizing C U M AP thus increases cluster separability in the embedding space by driving objects from different clusters apart.Note that minimizing the cross entropy "can be seen as an approximate bound-optimization (or Majorize-Minimize) algorithm [...] implicitly minimizing intra-class distances and maximizing inter-class distances" (Boudiaf et al, 2020, p. 3).The optimization in the graph embedding step of UMAP thus leads to tighter clusters with more white space in between.
The most relevant additional benefit this graph embedding step provides is the large expansion of well-performing ε-ranges for DBSCAN.Since the graph layout optimization uses stochastic gradient descent, the resulting embedding vectors are not deterministic.To account for this randomness, we perform 25 embeddings for each value of k and compute separate averages of the lower and the upper interval boundaries of the ε-ranges yielding optimal cluster performance.On average, the obtained embedding coordinates yield correct clusters for k = 6 with ε ∈ [0.83, 1.03], for k = 3 with ε ∈ [0.70, 6.76], and for k = 2 with ε ∈ [0.79, 20.94].Even the smallest (optimal) ε-ranges we observed over the 3 × 25 replications are at least as large as the ones obtained on the fuzzy graph for k = 6, and still considerably larger for k = 3 and k = 2: [0.94, 1.03], [0.72, 1.33], [1.16, 4.57], respectively.Further analysis of the variability resulting from optimizing embedding vectors via SGD can be found in appendix A.
These results indicate how crucial optimizing separability by computing embedding vectors is for clustering performance.Appendix B confirms its importance on real data.
In these and the following experiments, all of UMAP's other hyperparameters were left to the implementation defaults, in particular min_dist = 0.1.Additionally adjusting these parameters might further increase separability.However, tuning parameters in an unsupervised setting is a notoriously difficult task and since the results are already convincing by setting k to a small value, we concentrate on the effect of k.
In summary, both the graph construction and the graph embedding steps in the UMAP algorithm independently contribute to an increased separability of clusters in a dataset, and their combined effect improves clusterability dramatically.

The price to pay: structures preserved and lost
As we have outlined in the previous sections, UMAP is able to infer and even enhance the topological, i.e. the cluster, structure of a dataset.However, these improvements come at a price which will be outlined in this section.

Topology vs. geometry
Beyond topological structure, i.e., mere "connectedness", datasets also have geometrical structure -the shapes of the clusters and how the clusters are positioned relative to each other in the ambient space.
Consider the example of a dataset consisting of three nested spheres embedded in a 3-dimensional (Euclidean) space (see Figure 2 A).What kind of structure does this dataset yield?First of all, from a purely topological perspective, we have three unconnected topological subspaces, i.e. clusters: the three spheres.Moreover, from an additional geometrical perspective, we have information on the shape of the individual clusters: they form spheres, i.e. 2-dimensional surfaces.Finally, we have information on the relative position of the clusters to each other within the ambient feature space: the spheres are nested.
What happens if these data are represented in a 2D UMAP embedding?Since a sphere cannot be isometrically mapped to a 2-dimensional plane, some distortion of the geometric structure will be unavoidable in any 2D embedding.Figure 2 B shows that, in fact, most of the geometrical structure is lost in UMAP embeddings: the relative positioning of the clusters diverges from the original data and is not consistent over different embeddings.The effect on the shape of the clusters is less severe.While for k = 15 the embeddings are similar to circles, i.e. 2D spheres, for k = 7 the general circular shape is retained, yet less uniformly.In contrast, the topological structure of the different clusters is not only preserved in full but even exaggerated -clusters are much more separated in the embeddings, which is also reflected once again in much wider ε-ranges that yield sensible results (Figure 2 C).DBSCAN alone provides perfect clustering performances only over a much smaller ε range than when applied to these UMAP embeddings.
As a further example, we consider the complex 2D synthetic dataset by Jain (2010), "who suggest that it cannot be solved by a clustering algorithm" (Barton et al, 2019, p. 2).This "impossible" data contains seven clusters with complex structure, see Figure 3 A. The clusters have different densities, are in part non-convex, and are not linearly separable.DBSCAN by itself is not able to detect the full cluster structure and choosing ε from [0, 15] (step size: 0.01 minP ts = 5) based on an optimal ARI value yields a very different cluster result than choosing ε based on the optimal NMI value (see Figure 3 B & C).This challenging example further demonstrates two important points: First, how successfully UMAP embeddings preserve the connected components (i.e.topological structure) and simultaneously distort geometric structure.In Figure 3 D, we can see that the nested structure of the circles and the entanglement of the spirals are completely lost and that the spirals have been "unrolled" in the embedding space, but the different clusters are very clearly separated.
Second, the example illustrates that "dimension inflation" via UMAP can have a positive effect on cluster performance."Dimension inflation" means that the data is embedded into a space of higher dimensionality than the observed data.Although this is uncommon and we are not aware of any work where this has been investigated before, there are no restrictions that prevent UMAP from being used in this way.Consider Figure 3 F, which shows ARI-and NMIcurves obtained with DBSCAN applied (1) to the data, (2) a 2D UMAP-5, and (3) a 3D UMAP-5 embedding.Although the 2D UMAP-5 embedding already improves performance and strongly reduces parameter sensitivity, it does not yield a perfect solution.In the 2D embedding (Fig. 3 D), the two spirals are very close to each other, with a gap between them that is smaller than the gap appearing within the black cluster.Fig. 3 Another example of complex synthetic data and the beneficial effect of "dimension inflation".1st row: the "impossible" data with color according to true cluster structure.2nd row: data colored according to DBSCAN cluster results if applied directly to the data (different optimal ε values for ARI and NMI).3rd row: Visualizations of a 2D and 3D UMAP-5 embedding with colors according to true cluster structure.4th row: ε-curves for DBSCAN applied to the data, a 2D UMAP-5, and a 3D UMAP-5 embedding.Last row: 2D visualizations of the 3D UMAP-5 embedding with colors according to true cluster structure.In all settings: DBSCAN computed for ε ∈ [0.01, 15], step size: 0.01; minP ts = 5.
However, the three dimensional UMAP-5 embedding not only further reduces parameter sensitivity but also allows for perfect cluster performances.A 3D visualization of this embedding is depicted in Figure 3 E, but note that a static 3D visualization does not make the improved separability visible very well.Figures 3 G-I show all pairwise plots of the three embedding dimensions of the 3D UMAP embedding, even though none of these 2D projections reflects the cluster structure well.We recommend basing exploratory analysis on 3D embeddings as they are more likely to yield good results in complex data than 2D embeddings and still allow for very reasonable visualizations with dynamic plotting tools.

Outliers and noise points
Outliers are another important property of a dataset, but their distinctiveness and relative isolation is unlikely to be preserved in their UMAP embeddings.Consider Figure 4 A & C, which shows two 2D datasets with two clusters and, firstly, with two outliers (in blue) on the left-hand side, and, secondly, with additional, uniformly distributed noise points (in grey) on the right-hand side.Corresponding UMAP embeddings for k = 15 are depicted in Figure 4 B & D. Although the cluster structure is preserved, in both cases the outliers are no longer detectable as such (note that no dimension reduction has taken place).Similarly for noise points, which are embedded into proximal clusters and then no longer detectable as noise.It has recently been shown for functional data that outlyingness can be seen as a metric structure of a dataset (Herrmann and Scheipl, 2021).Since UMAP does not preserve metric structure (i.e.distances) but connected components, the loss of the outlier structure is not surprising.Moreover, note that UMAP's local connectivity constraint, which ensures that each point is at least connected to its nearest neighbor, may render it generally impossible to preserve outlier structure in UMAP embeddings.Applying outlier detection methods in an additional preprocessing step before computing UMAP embeddings may solve this issue.

Overlapping and diffuse clusters
Clusters with considerable overlap or diffuse boundaries that result in a large likelihood of "bridge" points between nominally distinct clusters are especially challenging for most clustering algorithms.
First of all, consider Figure 5 A, which shows a 2D dataset consisting of two clusters that are connected by a small "bridge" of points (blue).From a purely topological perspective, we have a single connected topological subspace.A 2D UMAP representation, however, breaks the connected components apart, see Figure 5 B & C. Note, that this holds for a small value of k = 15 as well as for a very large value of k = 505.Another issue concerns clusters with substantial overlap, which are often modeled as diffuse components of a Gaussian mixture (Rasmussen, 2000).In such cases, UMAP and similar manifold learning methods are unlikely to improve clustering performance.Consider Figure 5 D. It shows a 2D dataset with two clusters following 2dimensional Gaussian distributions with mean vectors (0, 2) and (2, 2) and unit covariance matrix.Note that in both embeddings (Figure 5 E & F) the clusters are not clearly separable, and the less so the larger UMAP's locality parameter k is chosen.
For strongly overlapping clusters, it is questionable to even consider such settings as ("pure") clustering tasks.From a topological perspective, such settings cannot be considered a well-posed clustering problem as there are no separable components in the data.However, in the presence of bridges, it seems reasonable to consider the dataset as consisting of two clusters.Whether overlapping clusters should be merged or considered separate must surely be answered w.r.t. the specific domain.Practitioners should be aware of how UMAP tends to behave in such settings: it typically breaks "bridges" apart and merges highly overlapping clusters.

Quantitative analysis of further synthetic data
In addition to the qualitative analyses of these toy datasets we investigate further examples quantitatively in this paragraph.The datasets under consideration are those from the Fundamental Clustering Problem Suite (FCPS) (Ultsch, 2005).These datasets are constructed such that they reflect specific clustering problems.Table 2 shows key characteristics of these datasets and the problems they present.More details including visualizations can be found in the corresponding papers (Thrun and Ultsch, 2020;Ultsch and Lötsch, 2020).
The results of applying DBSCAN directly to the data and on 2D UMAP embeddings with k = 10 are shown in Table 3. Depicted are the highest achievable ARI and NMI values by approach and dataset as well as the ε-range ε [ARI>0] for which ARI is greater than zero.
On the datasets Tetra and TwoDiamonds, DBSCAN does not perform perfectly.These datasets represent problems (specified as "almost touching clusters" (Tetra) and "cluster borders defined by density" (TwoDiamonds)) with less clearly separable clusters.Consistent with the examples presented in section 3.1, inferring the topological structure via UMAP not only drastically Table 2 Characteristics of the FCPS datasets: the number of clusters nc, the number of observations n obs , the number of features (dimensionality) p, and the problem as specified in corresponding papers (Thrun and Ultsch, 2020;Ultsch and Lötsch, 2020) In contrast to that, inferring the relevant structure is not possible with UMAP in the settings EngyTime and Target and thus it does not improve the performance of DBSCAN, it even reduces it.This is consistent with the results of the previous subsections: EngyTime is a setting with clusters that overlap strongly, while the Target data is a setting with six clusters of which four are defined by a few outliers.
In summary, the synthetic examples investigated in this and the previous section show that inferring the topological structure of a dataset can dramatically improve and simplify clustering: improvement in the sense that cluster detection with DBSCAN is considerably more reliable, and simplification in the sense that finding good parameters for DBSCAN becomes significantly less challenging: the suitable ε-ranges are typically much wider, they consistently start near zero and ARI/NMI quickly reach their optimum in this range, so that a quick and simple coarse grid search over small values of ε is likely to be successful.
We emphasize that these conclusions apply to diverse and challenging synthetic data settings that include low-dimensional as well as high-dimensional data, data with equal and unequal cluster densities, data with (many) irrelevant features, clusters of arbitrary shape, and not linearly separable clusters.In the next section, we show that this also holds for several real datasets.

Experiments on Real-World Data
An overview of the real datasets used in this study is given in Table 4. Since some of these datasets have already been used in other studies, we can investigate not only how the clustering performance of DBSCAN is improved if the topological structure of a dataset is inferred beforehand.We can additionally compare our results to those reported for other clustering methods.The set of datasets includes the well known Iris data (Anderson, 1935;Fisher, 1936), the Wine data (Aeberhard et al, 1994;Forina et al, 1988;Dua and Graff, 2017), the Pendigits data (Alimoglu and Alpaydin, 2001;Dua and Graff, 2017) as well as the COIL (Nane et al, 1996), MNIST (Lecun et al, 1998) and fashion MNIST (FMNIST) (Xiao et al, 2017) data.Following Mukherjee et al (2019), we use two different versions of FMNIST: one with the original ten clusters and a version reduced to five clusters which are pooled from the original ten based on their similarity.The results of applying DBSCAN directly to the datasets and to the embeddings obtained with UMAP are depicted in Figure 6 and Table 5.
Table 4 Characteristics of the real datasets: the number of clusters nc, the number of observations n obs , and the number of features (dimensionality) p.As in the ClusterGAN paper (Mukherjee et al, 2019) we investigate two versions of FMNIST: FMNIST-10 and FMNIST-5, the clusters in the latter are: 1: Tshirt/Top, Dress; 2: Trouser; 3: Pullover, Coat, Shirt; 4: Bag; 5: Sandal, Sneaker, Ankle Boot.  Figure 6 shows ARI and NMI as a function of ε for the different datasets.Table 5 details the optimum ARI and NMI achieved within the considered εranges.We inferred the topological structure of the datasets for three different values of k ∈ {5, 10, 15}.Note that we did not tune UMAP at all and used min_dist = 0.1, n_components = 3 and spectral initialization throughout.Iris and Wine data features were scaled respectively standardized.
In general, the results show that what has been observed for the synthetic examples also holds for real data.For all considered settings, inferring the topological structure of the dataset via UMAP before applying DBSCAN leads to better clustering performances than applying DBSCAN directly, dramatically so for MNIST and FMNIST.Moreover, it reduces ε sensitivity of DBSCAN with suitable ε-ranges starting close to zero and with high (> 0.5) ARI and NMI values for large parts of the ε-range.For DBSCAN directly applied to (F)MNIST, we additionally scanned the ε-range [0, 100] with a step size of 0.1, but performance did not improve over this extended search grid.
We also investigate the effect of optimizing the separability by constructing embedding vectors instead of using the fuzzy edge weights directly for datasets Iris, Wine, COIL, and Pendigits.Clustering using UMAP's fuzzy graph weights directly performs worse, as expected.For example on the Iris data, computing embedding vectors with UMAP-10 leads to optimal ARI/NMI = 0.89/0.86over an ε-range of [0.67, 4.82] in contrast to 0.88/0.84over [0.6, 0.61] if only the fuzzy graph weights of UMAP-10 are used.Both variants still yield better results than applying DBSCAN directly to the data (optimal ARI/NMI = 0.75/0.67).We found similar results for Wine, COIL, and Pendigits, see appendix B. In addition, our results show that the fast, simple and very easily tuneable approach we have proposed leads to comparable or superior clustering performances than recently proposed clustering methods such as ClusterGAN (Mukherjee et al, 2019) and SPECTACL(N) (Hess et al, 2019) in some settings.Table 6 lists the highest results obtained on the respective datasets in other studies (Goebl et al, 2014;Mautz et al, 2017;Mukherjee et al, 2019;Hess et al, 2019).On Pendigits and FMNIST-5, DBSCAN applied to UMAP embeddings Enhancing cluster analysis via topological manifold learning performs better than the best-performing methods FOSSCLU and Cluster-GAN as reported by Goebl et al (2014), Mautz et al (2017), andMukherjee et al (2019).On MNIST, comparable performance is achieved w.r.t.Cluster-GAN and better performance w.r.t.SPECTACL(N).Only for the Wine data and FMNIST-10 are better performance reported for methods FOSSCLU, LDA-k-means, and ClusterGAN.It must be emphasized that these methods also require analysts to prespecify a fixed number of clusters that are to be found.ClusterGAN's optimal performances reported in Table 6 were achieved only if the true number of clusters was supplied (Mukherjee et al, 2019).The performance on MNIST considerably deteriorated if the number of clusters was not correctly specified.Recall that one of the major advantages of DBSCAN is that it does not require pre-specifying the number of clusters, in contrast to the complexity of specifying and training ClusterGAN.It should be taken into account, first of all, that a suitable network architecture needs to be defined.Note that standard architectures specified elsewhere had to be adapted for ClusterGAN to achieve satisfactory performance.In addition, the various hyperparameters for the GAN, the SGD optimizer, and the generator-discriminator updating require substantial tuning.Finally, note that our approach works well in settings with both few and many clusters and for both small and large numbers of observations.This is also in contrast to ClusterGAN, which was "particularly difficult [... to train ...] with only a few thousand data points" (Mukherjee et al, 2019, p. 4616).

Discussion
In summary, the presented results show that considering clustering from a topological perspective consistently simplified analysis and improved results in a wide range of settings: from a practical perspective, inferring the topological structure of datasets and representing this structure in suitable embedding vectors that are, in some sense, optimized for separability between the different connected components (dramatically) increased clustering performances of DBSCAN, even outperforming a highly complex deep learning-based clustering method, as long as the clusters did not exhibit large overlap.These insights suggest some conceptual conclusions and raise a number of fundamental questions for cluster analysis, which we will discuss in the following.
To begin with, we argue that two "perspectives" on cluster analysis should be more strictly distinguished: on the one hand, settings where the aim is to infer the number of connected components in a dataset (the "topological perspective"), and on the other hand, settings where clusters may show considerable overlap (in the following the "probabilistic perspective").If the "perspective" (implicitly) taken is not clearly specified, the results of cluster analysis can be misleading.For example, in applied, exploratory analyses relevant information may be lost, while in methodological analyses method comparisons can be misleading.
Consider a truly unsupervised and exploratory setting (i.e. the true number of clusters is not known and determining it is a crucial part of the problem) in an applied context.From the "topological perspective" applying methods that yield a fixed, pre-specified number of clusters is highly questionable in this situation.If the number of clusters is determined a priori for example via domain knowledge, the analysis cannot falsify these a priori assumptions about the data and may hide any unexpected structure.This seems contradictory to the purpose of an exploratory analysis, where the discovery of unexpected structures can yield valuable new insights.If, on the other hand, approaches such as elbow-plots of cluster quality metrics are used to determine the number of clusters n c in a data-driven way, methods inferring and enhancing connected components should be used in the first place.
Another issue concerns the evaluation of competing methods for clustering using datasets with label information.Label information can be misleading, in particular, if it is (also) used to pre-specify n c , as the label information may not be consistent with the unconnected components of a dataset.Consider the FMNIST example, where a simple modification of label information -merging the original 10 into 5 broader categories -leads to considerably different results.Note that this change of labels was not introduced here, but in Mukherjee et al (2019).We assume that the performance of ClusterGAN on FMNIST -as measured based on the original labels -was not as convincing as for the other datasets.Since it requires no specialized domain knowledge to assess the general similarity of clusters in this dataset containing images of pieces of apparel, a change of labels is easy to do.But while this change did not improve the performance of ClusterGAN in terms of ARI and NMI by much, it considerably improves the performance of DBSCAN + UMAP.In other words: the labels were presumably changed such that they were much more consistent with the actual unconnected components -i.e.clusters -in the data.If only the original ten categories of clothing had been considered here, the method comparison would have been misleading, as the different ability of the methods to identify the (un)connected components of the data would have gone unnoticed.The original label information arguably does not reflect the actual cluster structure of the data.This is likely to be the case in many labeled datasets.
On the other hand, consider settings with overlapping clusters.Taking the topological perspective does not make a lot of sense here, as there are no unconnected components if clusters (strongly) overlap, and our investigations showed that it is, in general, questionable that it is possible to infer such cluster structure with methods that aim to infer connected components.In such settings, one should rather take a "probabilistic perspective" and assume that the data follow a joint multi-modal probability distribution, i.e., a mixture of probability distributions.Note that this usually implies some kind of domain knowledge from which it makes sense to assume such structure.Many prominent clustering methods such as k-means, Gaussian Mixture models, or approaches based on the EM algorithm are based on this perspective.It has to be emphasized that our experiments on several widely used real-world benchmark datasets showed that an approach based on the topological perspective, which does not use the true number of clusters as a parameter, can perform comparable or even better than methods that do so.
These considerations raise some important questions.First of all, from a rather practical perspective: Is it fair to compare methods that require n c as a parameter with those that do not?How trustworthy is the widely used approach to evaluate clustering methods using labeled data?Is it at all useful to apply non-probabilistic clustering methods on data with assumed strong cluster overlap?
Moreover, from a rather general conceptual perspective: Can there be methods that work optimally both in settings with large cluster overlap and settings of high separability?As Schubert et al (2017, p. 19) state in that regard: "To get deeper insights into DBSCAN, it would also be necessary to evaluate with respect to utility of the resulting clusters, as our experiments suggest that the datasets used do not yield meaningful clusters.We may thus be benchmarking on the 'wrong' datasets (but, of course, an algorithm should perform well on any data)." This already points to the problem of "wrong" datasets, while on the other hand, they state a method should perform well in any setting.In the light of the insights presented here, we would argue that it is very fruitful to investigate the characteristics of settings in which a method or combination of methods works specifically well or even optimally.As outlined, we consider in particular high cluster overlap in contrast to well separable clusters examples of such settings.The underlying principles are fundamentally different (disconnected domains of the clusters vs. connected domains of the clusters) and may require different, maybe even contradictory objectives to be optimized.This is specifically relevant as a dataset may consist of both sorts of (assumed) structures.We think the insights and results presented here support this view.

Conclusion
This work considered cluster analysis from a topological perspective.Our results suggest that the crucial issue in clustering is not the nominal dimension of the dataset or whether it contains many irrelevant features, but rather how separable the clusters are in the ambient observation space they are embedded in.Extensive experiments on synthetic and real datasets clearly show that focusing on the topological structure of the data can dramatically improve and simplify cluster analysis both in low-and high-dimensional settings.To demonstrate this principle in practice, we used the manifold learning method UMAP to infer the connected components of the datasets and to create embedding vectors optimized for separability, to which we then applied DBSCAN.
Using synthetic data, we showed that this makes results much more robust to hyperparameters in a diverse set of problems including low-dimensional as well as high-dimensional data, data with equal and unequal cluster densities, data with (many) irrelevant features and clusters of arbitrary, not linearly separable shapes.The parameter sensitivity of DBSCAN is consistently and dramatically reduced, simplifying the search for a suitable ε-value.Moreover, the cluster detection performance of DBSCAN was considerably improved compared to applying it directly to the data.
Experiments in real data settings corroborated these insights.In addition, our results showed that the simple approach of combining UMAP and DBSCAN can even outperform complex clustering methods SPECTACL and deep-learning-based ClusterGAN on complex image data such as Fashion MNIST.
All these results were obtained with very little hyperparameter tuning for UMAP.In particular, we always used a small value of the parameter k/n_neighbors -k ∈ {5, 10, 15} in most of our experiments -markedly reducing the complexity of the parameter choice in density-based clustering.All other parameters were set to the default values.Based on a simple toy example we provided a detailed technical explanation of why the choice of a small k is reasonable for the purpose of clustering.
Finally, we propose a conceptual differentiation of cluster analysis suggested by the topological perspective and the presented results.Specifically, we argue that settings with high cluster overlap in contrast to well separable clusters should be considered as fundamentally different settings which require different kinds of methods for optimal results, a distinction usually not made explicit enough.We also propose that using external label information to evaluate clustering solutions should only be done if these labels actually correspond to the (un)connected components of the data manifold from which observations are sampled.If this is not the case, we would argue that evaluation metrics diverge from what clustering algorithms should properly optimize for -identifying (un)connected components -and results will be misleading.
We think these considerations point out important questions to be investigated in future work.

Fig. 4
Fig. 4 Effect of UMAP on data with outliers and noise points.First column: 2D datasets with two clusters and two outliers (A) and two outliers and noise points (C).Second column: UMAP embeddings with k = 15 (B & D, respectively).The cluster structure is preserved.Outliers and noise points are forced into the clusters.

Fig. 5
Fig. 5 Effect of UMAP on data with connected components.Upper row: 2D data with two bridged clusters.Lower row: 2D dataset with two strongly overlapping clusters.A & D: data.B, C, E, F: UMAP embeddings with k = 15 and k = 505, respectively.UMAP breaks the bridged components up into two clusters but does not break up the strongly overlapping components.

Fig. A1
Fig. A1 Maximum, mean, and minimum ARI (left column) and NMI (right column) curves summarized over 25 embeddings of the four synthetic settings E 100 , E 1000 , U 3 , U 1003 .Note, the curves do not reflect a single embedding, but the worst/mean/optimal case over all 25 embeddings for each individual ε-value.The maximum ARI and NMI values obtained by applying DBSCAN directly to the data are shown as a black dashed horizontal line and the corresponding ε-value as a black dashed vertical line.DBSCAN computed for ε ∈ [0.01, 15], step size: 0.01; minP ts = 5.

Fig
Fig. A2 Maximum, mean, and minimum ARI (A) and NMI (B) curves summarized over 25 embeddings of the Iris, Wine, and COIL data.Note, the curves do not reflect a single embedding, but the worst/mean/optimal case over all 25 embeddings for each individual εvalue.The maximum ARI and NMI values obtained by applying DBSCAN directly to the data are shown as a black dashed horizontal line and the corresponding ε-value as a black dashed vertical line.DBSCAN computed for ε ∈ [0.01, 25], step size: 0.01; minP ts = 5.

Fig
Fig.C4Visualizing 2D UMAP-10 embeddings of the real datasets.Note that an embedding dimension of d = 2 was chosen for the purpose of optimal static visualization, in contrast to d = 3 used for better cluster in the quantitative experiments in section 5. .

Table 3
Maximum ARI and NMI and ε ranges corresponding to ARI > 0 for FCPS data.

Table 5
Maximum ARI and NMI for the real datasets.DBSCAN directly applied to the data and to 3D UMAP embeddings for k ∈ {5, 10, 15}.For the explored ε-ranges, see Fig.6.

Table 6
Optimal ARI and NMI for some of the real datasets reported in other studies and the methods used.The last two columns show the corresponding optimal performances achieved with DBSCAN & UMAP.