1 Introduction

In this article, we explore the growing interface between deep learning and topology. We examine deep learning methods that make use of topological information to understand the shape of data, as well as the use of deep learning in calculating topological signatures. We broadly refer to this intersection of fields as topological deep learning. The advancements in topological deep learning have been enabled by the development of topological data analysis (TDA) over the last two decades.

TDA is a relatively recent amalgam of theory and algorithms that aims to obtain a geometric and topological understanding of data from real-world applications. The approach to data employed in TDA fundamentally differs from that in statistical learning. Rather than finding summary statistics, estimators, fitting approximate distributions, clustering, or training neural nets, TDA instead seeks to understand the properties of the geometric object, often a manifold, on which the data resides. This reflects the common intuition that data tends to lie on, or close to, a lower-dimensional manifold that is embedded in high-dimensional feature space. In this article, we sometimes refer to this as the data manifold.

The main goal of TDA is to infer information about the global structure of the data manifold, such as its connectivity and the presence of multi-dimensional holes. In the pure mathematical setting, this information is characterized by the persistence homology and the related concept of Betti numbers, that counts the number of n-dimensional holes in a manifold. With a finite set of data points, the Betti numbers are unavailable, but TDA employs various substitutes such as persistence diagrams and the barcode. An important property of the topological information obtained is its invariance to continuous deformation and scaling. This property also lends itself to robustness against perturbation and noise. Another benefit is the versatility of the TDA methods, owed mostly to the abstract origins of algebraic topology. The methods are applicable to a wide variety of data types and objects. This includes point cloud data in Euclidean spaces, categorical data, or the analysis of images and functions. TDA is backed by explainable theory but lacks the learning ability and other practical aspects of deep neural networks. Conversely, neural networks suffer from the need for large training datasets and billions of tunable parameters. Due to these aspects, integration of TDA with deep neural networks poses a number of challenges.

Despite much recent activity in co-opting topological approaches in deep learning, what the leading approach should be remains unclear, mostly because of computational and theoretical concerns. The TDA methods discussed in this paper form but a small part of the ever-expanding interface between topological data analysis and machine learning. However, it is important to state that this survey does not provide exhaustive background on TDA background and literature. For that, we refer our reader to the following excellent studies:  (Pun et al. 2022; Edelsbrunner and Harer 2008; Ali et al. 2023; Carlsson 2009; Carlsson and Zomorodian 2009). Whilst a number of recent studies have focused on TDL targeting specific families of architectures (e.g. Message Passing Neural Networks in Papillon et al. (2023b)), our work provides broad coverage of TDA integration into various DL pipelines and architectures. We did our best to choose work that has a historical and linear connection with deep learning approaches to improve understandability.

This survey provides the broader machine learning community with a convenient starting point to explore how TDA has been integrated with deep learning. Such interaction brings novel perspectives, benefits, and challenges. We shed light on the benefits of this interaction demonstrated by the growing adoption of TDA in various deep learning applications. To the best of our knowledge, this is the first work that comprehensively covers topological deep learning and organizes the research works in this field in a unified taxonomy (Sect. 3).

We start in Sect. 2 by introducing the key theoretical concepts of TDA and their representations for learning. In Sect. 3, we explain how topological approaches can fit into different deep learning constructs, such as learnable features, feature transformations, and loss functions. In Sect. 3.4 we shed light on a promising use of TDA to understand and dissect trained deep models, called deep topological analytics.

We continue in Sect. 4.2 with a discussion of the known challenges of TDA and its adaptation to deep neural networks. We further discuss future directions and adjacent applications of topological deep learning, and we present some current libraries. Finally, we make some concluding remarks in Sect. 5.

Notations We write \(X \in \mathbb {R}^{n \times d}\) to denote the data set, where n is the number of samples and d the number of features or dimensions. We write \(\mathcal {M}\) to denote the underlying data manifold, which, for the purposes of this survey, is a locally Euclidean space embedded in \(\mathbb {R}^d\). We write BD and PD as abbreviations of barcode diagram and persistence diagram.

2 Overview of TDA

An object’s topology is broadly defined as the characteristics that remain invariant under continuous deformation, as if the object was made of soft rubber. How many connected components the object possesses, the holes or voids it contains, or how the object loops back on itself are a few examples of topological properties. In a sense, topological information can be considered qualitative. For example, if we demonstrate that data points lie on two totally disconnected sub-manifolds, then we know that the data comes from two very distinct sources, or that the underlying system has two distinct states.

A central concept is that of homology, which is a powerful tool to characterize the topological features of a space. Homology is an abstract concept; its general definition is outside the scope of this paper (Carlsson 2009). In essence, however, the k-th homology (where \(0\le k \le d\)) is a group (in the mathematical sense) that characterizes the set of k-dimensional loops in a topological space. The relationships between the various k-dimensional loops then characterize the holes or voids in the manifold.

When we say there is a 0-dimensional hole, it means that the space has disjoint connected components or isolated points. In other words, there are no paths connecting certain points in the manifold. The 0-th homology group identifies and counts these connected components, treating them as separate entities.

A 1-dimensional hole can be traced around with a 1-dimensional loop (like a loop of string). Consider, for example, a typical donut shape that has a single 1-dimensional hole, as illustrated in Fig. 1. One can draw a loop in two ways on its surface: a loop that follows the central hole or one that traverses through the hole. An infinite number of nontrivial loops can be generated that may wind, double back, or wrap around multiple times before returning to their origin, but they will all be equivalent to one of these two loops.

For example, let’s refer to any loop that travels through the central hole and around the tube as a. Due to the fact that such a loop can go around the tube once, twice, or an arbitrary number of times, or in the opposite direction, we can represent these loops as a, 2a, \(-a\), etc. In Fig. 1, there is another loop present that is not multiple of a. It goes around the central hole along the long circumference of the tube, and we will refer to it as b. Any loop in a donut figure can be deformed (without breaking) to follow either of the loops a or b an integer number of times. The fact that there are exactly two one-dimensional loops from which all others can be constructed indicates that the number of one-dimensional holes of the donut is one. Hence, the 1-homology counts the number of these holes. A 2-dimensional hole is a void, for example, the void within a hollow sphere, torus, or Swiss cheese in 3 dimensions.

Fig. 1
figure 1

Illustration to explain how infinitely many different loops on a donut surface are actually homeomorphisms of just two loops (ab) that are themselves not homeomorphic to each other

The k-dimensional holes are counted specifically by the Betti numbers. The k-th Betti number is defined as the group rank of the k-th homology. In group theory, rank refers to the concept of independence, it is closely related to the concept of rank from linear algebra, and it represents dimensionality (Hatcher 2002). In general, the Betti numbers can be quite difficult to compute, but fortunately, there are some settings where the calculations are straightforward.

2.1 Simplicial complexes and persistent homology

The k-th homology is much more convenient to work with when we restrict ourselves to simplicial complexes, which are structures built upon discrete sets. This is the natural domain for data-driven and machine-learning applications.

A simplex can be considered a generalization of a triangle or tetrahedron. It is the simplest polytope of any given dimension. A simplex in zero dimensions is a point, in one dimension is a line segment, in two dimensions is a triangle, in three is a tetrahedron, and so on. We use k-simplex to refer to a simplex of dimension k. Note that any simplex is composed of faces, which are themselves simplices of lower dimension. A simplicial complex K is a collection of simplices with two properties: each face of a simplex in K must also be in K, and the intersection of any two simplices of K is either empty or a face of both of them (Munkres 1993).

Consider each point in our data set X to be a vertex (a 0-simplex). We can define a set of 1-simplices as connections between pairs of vertices, 2-simplices between collections of three vertices, and so on. Thus, we build a simplicial complex K that gives some sense of “connectivity” between data points. It can be thought of as a hyper-graph on X. Note that K is not necessarily unique on X.

Homological information is much easier obtained for a simplicial complex, and in particular, the k-th Betti number can be obtained through tractable linear algebra (Robins 1999). The Betti numbers in this setting are closely related to Euler characteristic, which gives the relationship between the numbers of vertices, edges, and faces in a polyhedron.

The goal now is to construct simplicial complexes on X that reflect the underlying topology of \(\mathcal {M}\). This is done by varying scale, typically a radius \(r>0\). The Čech complex and the Vietoris-Rips complex are two typical constructions (Chazal and Michel 2021). A Čech complex \(C_r(X)\) includes a k-simplex on \((k + 1)\) vertices of X if the collection of balls of radius r centered on each vertex has a non-empty intersection. The Vietoris-Rips (or simply Rips) complex \(V_r(X)\) includes a k-simplex on any set of \((k + 1)\) vertices that all have a pairwise distance less than r of each other (Zomorodian 2010). These two constructions of simplicial complexes can yield very different results on the same data set with the same r.

Fig. 2
figure 2

Example Persistent Homology for a point cloud and b grey scale image objects. Intuitively, the filtration captures the multiscale structural properties of the object by recording the persistence state of the topological features (e.g. holes) as the filtration threshold changes. Considering a monotonically increasing threshold \(r_1 \le r_2 \le \cdots \le r_n\). b the sequence of sublevel sets for different values after applying the filteration \(f^{r}= \{x \in X| f(x) \ge r\} \) on the image X. The corresponding c, left persistence diagram and c, right barcode are convenient summaries of topological features’ lifetime. Some holes (c and d, purple) persist much longer than others, while some (c and d, pink) are born later. (Colour figure online)

Persistent homology is obtained through a filtration F. Typically, an initial simplicial complex captures the fundamental structure of the space. It serves as the analysis’ starting point. The complex is then subjected to a series of additions of simplices, gradually introducing higher-dimensional characteristics and capturing global details of the space. These additions to F are governed by a filtration parameter that determines the analysis’ scale or resolution. As the value of the filtration parameter increases, simplices with higher assigned values are added to the complex, resulting in the emergence of new topological structures or the maintenance of existing ones.

In other words, F is a growing sequence of sub-complexes: \(K_1 \subseteq K_2 \subseteq \ldots \subseteq K_n = K\). Two commonly used examples of filtration are the sets of simplicial complexes, \(C_r(X)\) or \(V_r(X)\), that are obtained with increasing radius r of the balls around the data points. As we vary r, these constructs will naturally reflect different aspects of the topology of \(\mathcal {M}\). There is monotone inclusion of these simplicial complexes with increasing r, i.e. for two radii \(r \le r^\prime \) we have that \(C_r(X) \subseteq C_{r^\prime }(X)\) and \(V_r(X) \subseteq V_{r^\prime }(X)\).

Throughout the filtration, the evolving complexes form a nested sequence that reflects the evolution of the topological characteristics of the space across scales. The key idea is to track the appearance and disappearance of topological features over the filtration. We may see new loops created, separate components connected, or holes filled in as we increase r. We record the lifetime of these features with respect to r, that is, the appearance (at \(b_i\) for birth) and disappearance (at \(d_i\) for death) of a particular topological feature. Figure 2 shows an example of filtration and the corresponding lifetime of topological features.

2.2 Representations of persistent homology

The set of birth and death coordinates obtained from the filtration forms the backbone of persistent homology. The two most popular representations of this information are barcode diagrams and persistence diagrams (Carlsson 2009). The multiset of intervals \((b_i, d_i)\) form the barcode diagram (BD), the name coming from the visual representation of the set of intervals as stacked line segments. In the persistence diagram (PD), the lifetime of each feature is represented by a point in \(\mathbb {R}^2\) with coordinates \((b_i, d_i)\). A filtration may have several copies of the same birth and death interval, which is represented in the PD by giving the point \((b_i, d_i)\) an integer-valued multiplicity. It is important to note that the BD and PD contain equivalent information, and one can define a bijection between the two. From here onwards we use the term PD to refer to either construct unless BD is explicitly referred to.

The PD of a data set contains a wealth of topological information. Features that have a long persistence interval (\(d_i - b_i\)) are considered to be likely to reflect the true topological features of the underlying manifold \(\mathcal {M}\). These features are represented in the PD by points far away from the diagonal. A short persistence interval describes a feature that is possibly generated from noise or is otherwise insignificant. Features with short persistence will be represented by points close to the diagonal line in the PD. Hence, points in the PD that are further away from the diagonal are considered more informative than those that are closer to it.

Comparing the PDs of two objects is a way to assess their topological similarity. However, this is a challenging task due to the multiset information contained in the PDs. Figure 3 shows the basic underlying issue of differentiating PDs. In the next section, we discuss various methods to represent them in manners suitable for traditional machine learning and computation, with Sect. 3.3 exploring this issue from the perspective of deep learning loss.

Fig. 3
figure 3

Visual depiction the distance computation between two persistence diagrams \(D_{X_1}\) and \(D_{X_2}\). Note that comparing PDs (\(D_{X_1} - D_{X_2}\)) shows an example of PDs points assignment. Here, the points in \(D_{X_1}\) are matched with the nearest adjacent point in \(D_{X_2}\) for differentiation, and the unmatched point is ignored (or assigned to diagonal)

2.3 Homological feature vectorizations

Most machine learning methods assume that the input data resides in \(\mathbb {R}^d\) or, more generally, some Hilbert space \(\mathcal {H}\). Hence, they cannot be directly applied to datasets comprised of PDs, and the multiset information contained in the PD needs to be represented in some vector format. This process is called vectorization, which requires the definition of a continuous map \(f: \textrm{PD} \rightarrow \mathcal {H}\). There is a plethora of different published methods to achieve this, each having subtle consequences (Ali et al. 2023). It is important to note that these vectorization methods can be thought of as handcrafted feature engineering rather than feature learning. In this section, we discuss various strategies that have evolved over time.

A simple approach for representing PDs is using their statistical properties such as the sum, mean, variance, maximum, minimum, etc. (Adcock et al. 2016). The total Betti number of a certain filtration can also be used as a summary representation (Cang et al. 2015). These approaches yield a univariate output and lose information; however, they can still be useful.

Another approach is to vectorize BDs using histogram-like methods (Cang and Wei 2017; Cang et al. 2018). The basic concept is to discretize the BD along the filtration axis, creating equal-sized bins in which we count the number of persistent intervals. Alternatively, tropical coordinates defined on the space of BDs are a useful and stable algebraic representation (Kališnik 2018).

Yet a different approach is to construct various forms of persistence functions from PDs. These functions are readily vectorized themselves, however, it is also convenient to work with them directly for many tasks (Bubenik and Dłotko 2017; Adams et al. 2017). Example of these persistent functions includes persistence landscape (Bubeni 2020; Bubenik and Dłotko 2017), persistence Betti number (Edelsbrunner et al. 2002), persistence Betti function (Xia et al. 2017), persistence surfaces and persistence images (Adams et al. 2017), etc.

A useful feature representation technique called persistence codebooks (Zieliński et al. 2020) uses bag-of-words quantization techniques to group data points into a fixed-sized vector. Chevyrev et al. (2020) proposed persistence paths, which is a feature map for barcodes.

Representation can vary from simple to complex structures. To get better structural representations there is scope to investigate new methods of vectorization which can benefit topological learning models. Note, however, that when a large feature vector is used to represent PDs, the curse of dimensionality comes into play. In this case, variable selection, regularization approaches, or dropout methods should be considered (Pun 2021; Chiu et al. 2017; Cai and Liu 2011; Srivastava et al. 2014).

In addition, it is important to consider the comparison of different PDs. To this end, various metrics have been proposed, such as bottleneck distance (Mileyko et al. 2011), as well as adaptations of the Gromov-Hausdorff and Wasserstein metric (Bubenik et al. 2018) amongst others. A central consideration is the stability of vectorizations and metrics, which we discuss in Sect. 3.3.

Whilst vectorization methods can be used in the input space, combining PD information with machine learning models can also be achieved with kernel-based models (Kwitt et al. 2015; Reininghau et al. 2015). Since metrics can be modified into kernels, various approaches have been proposed to induce kernel function from PD information (Bubenik et al. 2018; Mileyko et al. 2011) and into traditional machine learning approaches like PCA and SVM. Topological-based kernel methods have been used successfully in various ways (Zhu et al. 2016; Kwitt et al. 2015). However, techniques based on kernel methods suffer from scalability issues (Kusano et al. 2016), as training typically scales poorly with the sample number (e.g., roughly cubic in the case of kernel-SVMs). For this reason, we do not discuss topological kernel methods any further in this paper.

Many of the aforementioned methods have advantageous stability properties with respect to standard metrics in TDA, like the Wasserstein or Bottleneck distances. However, they all have the same drawback: the mapping of topological representation that is compatible with existing learning techniques is predefined. Therefore, it is fixed and agnostic to any specific learning task, which makes it suboptimal. The phenomenal success of deep neural networks (e.g. He et al. (2016), Krizhevsky et al. (2012)) has shown that learning representations (i.e. feature learning) is a preferable approach.

Fig. 4
figure 4

Topological Deep Learning introduces TDA methods to deep models leading to topological neural architectures that can potentially address deep learning limitations. This is done by plugging topological components for a learning features Embedding (Sect. 3.1), b enhancing the learned Representations (Sect. 3.2), and/or c regularizing the model using a topological Loss (Sect. 3.3). Beyond that, d TDA can be used post-training to reveal insights of trained models (interpretability) (Sect. 3.4)

3 Topological deep learning (TDL)

Topological representations that incorporate structural information hold great promise for topological deep learning models (Hofer et al. 2017). Combining these cues with deep learning approaches has inherent benefits in various applications. On the flip side, deep learning approaches can be useful in overcoming some common hurdles faced by TDA approaches in estimating robust topological features. The incorporation of topological concepts into deep learning has only recently been investigated and the following benefits have been observed:

  • Global features from input data can be efficiently and robustly extracted that would otherwise be inaccessible via traditional feature maps.

  • TDA is versatile and adaptable, meaning that we are not limited to specific problems and types of data (such as images, sensor measurements, time series, graphs, etc.).

  • TDA is noise-resistant across a number of problems, which include the classification of 3D surface meshes (Som et al. 2018; Reininghau et al. 2015; Li et al. 2014), the recognition of 2D object shapes (Turner et al. 2014), the manifold of natural image patches (Carlsson et al. 2007), the analysis of activity patterns in the visual cortex (Singh et al. 2008), and clustering (Chazal et al. 2013).

  • TDA can be applied to arbitrary data structures without any prepossessing provided the right filtrations are used.

  • A new trend is emerging that allows efficient backpropagation through persistent homology components. This has been a long-standing challenge in TDA (further discussed in Sect. 3.3), but topological layers are now becoming compatible with deep learning and end-to-end training schemes.

We reiterate that though the benefits of using TDA (more specifically, persistent homology) and deep learning together have demonstrated success, there are still some theoretical and computational challenges in the application of TDA to data. We discuss these issues at length in Sect. 4.2.

In the rest of this section, we investigate TDA for deep learning from lenses of different magnifications and perspectives, as shown in Fig. 4. In particular, we explore the use of persistent homology in various different ways. The discussion in Sects. 3.13.3 is focused on the on-training integration of TDA. That is, building topological neural architectures. However, a holistic view should also consider TDA’s contribution to post-training (deep topological analytics). These analytics use TDA to study the ‘shape’ of a trained model. Thus, we review works that studied deep model complexity and interpretability using TDA in Sect. 3.4.

3.1 Learning topological features embedding

In this section, we extend the discussion of fixed vectorization methods (Sect. 2.3) by introducing deep learnable vectorization (i.e. embedding). A key advantage here is the possibility of leveraging the deep model to simultaneously learn the vectorization of data and the representation of the target task. For example, we may parameterize the vectorization of persistence diagrams \(\textrm{PD}\) to embedding vector \(V \in \mathbb {R}^d\) by neural layers \(f_w\) where w denotes the trainable parameters. Guided by the task loss, we can efficiently learn mapping \(f_w: \mathrm {PD_x} \rightarrow V_{x}\) and automatically answer the question of “which family of vectorizations should best work for the given task”.

Handling PDs by neural networks is the focus of many deep topological embedding studies. Generally, PDs deep vectorization layers should be continuous and permutation invariant with respect to the input. The latter requirement is motivated by the set nature of the persistence diagram. Hofer et al. (2017, 2019) introduced the first learnable deep vectorization of PDs. It adopts a permutation invariant transformation by evaluating the PD’s points against Gaussian(s) whose mean and variance are learned during the training. Since permutation invariance was explored in other deep learning problems (e.g. Deep Set (Zaheer et al. 2017) for points cloud), some vectorization techniques for PD were borrowed from them. For example, PersLay (Carrière et al. 2020) builds on DeepSets for embedding extended PDs encoding graphs and uses it for graph classification. Recently, transformers were used for PDs embedding. Persformer (Reinauer et al. 2021) architecture showed superiority in synthetic and graph tasks while having some interpretability features. Note that transformers without positional encoding can be made as expressive as Deep Sets. Thus, the permutation invariance requirement can be maintained.

Zhou et al. (2022) proposed TopologyNet, a novel approach, to directly fit the output of topological representations derived from input point cloud data. This innovative method substantially reduces computation time for generating topological representations, in contrast to traditional pipelines, while maintaining a minimal approximating error in practical scenarios. The resultant output of TopologyNet holds potential for various downstream tasks that require efficient topological representations. Experimental evaluations involved incorporating TopologyNet as a topological branch within an autoencoder framework. The results demonstrated that the inclusion of the topological branch led to superior topology quality in the generated point clouds compared to an autoencoder lacking such a branch. Furthermore, the latent vectors generated by a topological autoencoder were employed to train a latent generative adversarial network (GAN), enabling the generation of new point clouds from Gaussian noise. Evaluation indices indicated that the inclusion of the topological autoencoder within the generative adversarial network resulted in improved quality of the newly generated point clouds, surpassing the performance of a GAN lacking the topological autoencoder.

Beyond PDs, deep embedding was explored for other topological signatures. For example, PLLay (Kim et al. 2020) provides a layer for embedding persistence landscapes. PLLay claim to robustness to extreme topological distortion is backed by a tight stability bound that’s independent of the input complexity.

Topological embedding transforms the topological input with a complex structure into a vector representation compatible with deep models. As discussed in this section, the process uses a custom topological input layer for embedding. In the next section, we explore topological components that enhance deep learning representation and usually have the flexibility to be plugged anywhere in the network.

Algorithm 1
figure a

Deep learnable topological embedding

Algorithm 1 represents the process of embedding persistence diagrams (PDs) into a vector space using deep neural network layers. The procedure DeepTopologicalEmbedding takes a persistence diagram as input, initializes an embedding vector and neural layers, and then maps each point in the PD to the embedding vector. The process is guided by a loss function to determine the best vectorization for the given task.

3.2 Integration of topological representations

Representation learning is the process of learning features from data that can be used to improve the accuracy of the model. Deep learning excels in this regard thanks to its powerful feature learning, but having a good representation goes further than achieving good performance on a target task (Bengio et al. 2013). For example, TDA’s stability can make deep representation resilient to input perturbation (de Surrel et al. 2022). Below, we review two categories of deep topological representations.

Constrained representations One approach is to train deep neural networks to learn representations that preserve the persistent homology of the input data. Again, TDA’s versatility ensures the feasibility of this as the topological signature can be computed for both the input and the internal representation. For example, Topological Autoencoders (Moor et al. 2020) perform the alignment through a loss, minimizing the divergence between input and latent representation topologies (both captured by PDs).

Augmented representations Another approach for topological representation is augmenting the deep features with topological signatures. Persistence Enhanced Graph Network (PEGN) (Zhao et al. 2020) developed graph spatial convolution that builds on persistence homology. Normally, convolution filters can adapt to local graph structures through the use of node degree information. In contrast, PEGN weights the message passing between nodes through neighborhood information captured by persistence images. Moreover, Graph Filtration Learning (GFL) (Hofer et al. 2020) adapts the readout operation (a graph pooling-like operation) in Graph Neural Network (GNN) to be topologically aware. BDs are computed for the graph nodes feature and vectorized. Interestingly, the filtration function is learned end-to-end. Topological Graph Layer (TOGL) (Horn et al. 2022) extends GFL’s idea and learns multiple filtrations of a graph (rather than one) in an end-to-end manner.

Unlike the embedding layers (e.g. PersLay Carrière et al. (2020)) that expect a pre-specified input type (e.g. PDs), the topological representation layers discussed in this section enjoy more flexibility regarding the input and placement in the network. This comes with the attached cost of requiring careful design choices and guarantees on the layer characteristics (e.g. consistency of gradients in Hofer et al. (2020)).

Algorithm 2
figure b

Topological Representation Integration in Deep Neural Networks

The process of integrating topological representations into deep learning models is outlined in Algorithm 2. The exact method used (e.g. Topological Autoencoders, PEGN, GFL, TOGL) depends on the specific approach chosen.

3.3 Topological loss

The most common approach for leveraging topology in deep learning is incorporating a topological penalty in the loss. The popularity of the approach stems from the fact that loss-based integration is straightforward and does not require changing the architecture or adding additional layers. The only caveat is that the loss should be differentiable and easy to compute. As iterated previously, the capability of topological features to capture the complex structure of the data means that deep learning can learn robust representations guided by topological loss. Thus, the representations are likely invariant with respect to typical transformations present in real-world datasets, such as noise and outliers. An example of this is a common persistence loss (Hu et al. 2019), which minimizes the difference between a predicted persistence diagram \(\textrm{PD}_X\) and the true diagram \(\textrm{PD}_Y\):

$$\begin{aligned} \mathcal {L}_{\text {topological}} = d(\textrm{PD}_X,\textrm{PD}_Y) \end{aligned}$$
(1)

This has been used either as a standalone loss or as a regularizer (i.e. augmenting another loss) (Hu et al. 2019) in applications such as semantic segmentation (Hu et al. 2019), or generative modeling (Wang et al. 2020).

As discussed in 3.1, PDs do not lend themselves to vector representations in Euclidean space. Moreover, the PD is not differentiable (a key requirement for using backpropagation). One strategy to resolve this is to leverage a divergence or metric that can handle PDs. The p-WassersteinFootnote 1 distance and the bottleneck distance are popular choices:

$$\begin{aligned} d_{p,q}(\textrm{PD}_X,\textrm{PD}_Y)&= \Big [ \inf _{\pi \in \Pi (\textrm{PD}_X, \textrm{PD}_Y) } \sum _{t \in \textrm{PD}_X} \Vert t - \pi (t)\Vert _{q}^{p} \Big ]^{\frac{1}{p}} \end{aligned}$$
(2)
$$\begin{aligned} d_{\infty }(\textrm{PD}_X,\textrm{PD}_Y)&= \inf _{\pi \in \Pi (\textrm{PD}_X, \textrm{PD}_Y) } \sup _{t \in \textrm{PD}_X} \Vert t - \pi (t)\Vert _{\infty } \end{aligned}$$
(3)

where t is a point corresponding to a \((b_i, d_i)\in \mathbb {R}^2\) that is in \(\mathrm {PD_X}\), and where \(\Pi (\textrm{PD}_X, \textrm{PD}_Y)\) denotes the set of bijection between \(\textrm{PD}_X\) and \(\textrm{PD}_Y\), and \(\Vert .\Vert _q\) is the \(\ell _q\) Euclidean norm. It can be seen that the bottleneck distance is the largest distance between any pair of corresponding points across all bijections that preserve the partial ordering of the points (i.e. we cannot match a point with a birth time greater than another point’s death time). This ensures that the topological features to be matched are comparable.

The initial popularity of the bottleneck distance is perhaps fueled by a stability theorem (Cohen-Steiner et al. 2005) for PDs of continuous functions. According to this theorem, the bottleneck distance is controlled by \(L_\infty \) distance, that is

$$\begin{aligned} d_{\infty }(\textrm{PD}_{f_1},\textrm{PD}_{f_2}) \le C \Vert f_1-f_2\Vert _{\infty } \end{aligned}$$
(4)

form some constant C. In effect, this means that the diagrams are stable with respect to small perturbations of the underlying data. A similar stability result exists for the p-Wasserstein distance. These are the foundation of the stability guarantees by recent deep learning works such as the stability of Heat Kernel Signature in graphs (Carrière et al. 2020) and stability of mini-batch-based diagram distances in Topological Autoencoders (Moor et al. 2020).

Among the limitations of (2) and (3) is the high computational budget needed by these distances when the number of points is large. As the distance requires point-wise matching, the computational complexity is \(\mathcal {O}(n^3)\) for n points (Anirudh et al. 2016). Also, in many applications (Wang et al. 2020; Chen et al. 2019), we aim to learn a model \(f_w\) that aligns a predicted diagram \(\textrm{PD}_P\) with a target (i.e. ground truth) diagram \(\textrm{PD}_T\) by gradually moving \(\textrm{PD}_P\) points towards \(\textrm{PD}_T\). This is typically achieved by pushing w in the negative direction of \(\nabla _w \mathcal {L}_{\text {topological}}\) and, obviously, assumes that the loss is differentiable with respect to the diagram. While the Wasserstein distance satisfies this requirement in general, it can have some instability issues (Solomon et al. 2021). Below, we select a few representative papers using topological losses in various applications and show how they handle these issues.

In generative modeling, TopoGAN (Wang et al. 2020) uses a slightly modified 1-Wassertsein distance to align the diagrams of generated and real images in medical image applications. The loss ignores the death time and focuses only on the birth time of the diagram features. Framed in this way, the loss becomes similar to the Sliced Wasserstein (Peyré et al. 2019), which can be computed efficiently and is still differentiable. A similar loss was used by Hu et al. (2019) for segmentation to encourage the deep model to produce output with a topology that was close to the ground truth. The cross-entropy loss is augmented with the 2-Wasserstein loss between persistence diagrams. To alleviate the computational burden, the method performs the calculation on a single small image patch (part of the image) at a time. In (Clough et al. 2022), the authors rely on Betti numbers for semi-supervised image segmentation. A notable advantage here is the output of a network trained on a small set of labeled images can still capture the actual Betti numbers correctly. This gives us the opportunity to initially train the model on a small labeled dataset guided by the Betti numbers loss \(\mathcal {L}_{\beta }\). The model is then fine-tuned using a large unlabeled dataset and guided by a loss (that incorporates \(\mathcal {L}_{\beta }\)). Since the estimation of Betti numbers is robust for unlabeled data, \(\mathcal {L}_{\beta }\) will regularize the second stage of training (fine-tuning). In classification, (Chen et al. 2019) uses a topological regularizer. To speed up the computation, it focuses on the zero homological dimension, where the persistence computations are particularly faster.

Algorithm 3
figure c

Topological Loss for Deep Learning

Algorithm 3 outlines the computation of topological loss using either the p-Wasserstein distance or the bottleneck distance. The procedure TopologicalLoss takes two persistence diagrams \(\textrm{PD}_X\) and \(\textrm{PD}_Y\), and the parameters p and q, then computes the p-Wasserstein or bottleneck distance as the topological loss. This loss can be used in deep learning models to minimize the difference between predicted and true topological features.

3.4 Deep topological analytics

The complementary value of TDA goes beyond on-training integration and constructing topological neural architectures. In fact, leveraging TDA methods post-training can be even more insightful and powerful. Currently, researchers use TDA to address deep learning transparency  (Liu et al. 2020), studying model complexity (Rieck et al. 2019; Carlsson and Gabrielsson 2020) and even tracking down answers for seemingly mysterious aspects of deep learning, e.g. why deep networks outperform shallow ones (Naitzat et al. 2020). These efforts are centered around analyzing deep models using TDA approaches. Hence, we call it deep topological analytics. We explore two aspects of it below.

Quantifying structural complexity Watanabe and Yamana (2021) treats the neural networks as a weighted graph G(VE) where V and E denote the network neurons and the relevance scores (computed from weights); respectively. By computing persistence features (e.g. Betti numbers) across filtration, we can gain insight into the network complexity. For example, the increase in the Betti number (the occurrence of a cycle between a set of neurons) can reflect the complexity of knowledge in deep neural networks. In Rieck et al. (2019), the authors follow the same line and further develop training optimization strategies (e.g. early stopping) informed by homological features.

Visual exploration of models Another use of TDA here is to provide a post-hoc explanation and/or visual exploration of the internal functioning of deep models. For example, topological information provides insight into the overall structure of high-dimensional functions. The authors in Liu et al. (2020) use this to offer a scalable visual exploration tool for data-driven black box models. This is an important research problem, where doing so in an intuitive way is a challenge. They also use topological splines to visualize the high-dimensional error landscape of the models. Similarly, TopoAct (Rathore et al. 2021) offers insightful information on neural network learned representations and provides a visual exploration tool to study topological summaries of activation vectors. Works such as Polianskii (2018) shed light on how neural networks maintain the topological properties of the data when they are projected into low-dimensional space.

DNN focused topology optimization The concept of “Inverting Representation of Image” and “Physically Informed Neural Network” served as inspiration for the creation of the topology optimization via neural reparameterization framework (TONR) (Zhang et al. 2021), which aims to address a variety of topology optimization issues. In this approach, the density field is optimized through the updating of DNN parameters and carefully choosing the initial parameters. This leads to quicker training and suggests a good measure for topology optimization.

4 Discussion

TDA is a steadily developing and promising area, with successes in a wide variety of applications. However, there are open questions in applying TDA with deep neural networks. In this section, we discuss various successes and applications of deep TDA, we highlight several open challenges for future research on deep TDA in both practical and theoretical aspects, and paint a speculative picture by outlining what persistent homology holds for the future. We also note some open-source implementations available for researchers to get started.

4.1 Successes and applications

Deep TDA has demonstrated potential in a variety of challenging settings. The invariance of PH information to continuous deformation means TDA applies well to settings where objects should have consistent shapes but may be transformed in some way. TDA also performs well to bridge the gap between structural information and prior knowledge. If we have prior knowledge of the topology of a class of objects, then PDs are an effective tool for the classification and comparison of data against this class, even in the presence of noise or limited data. This robustness is well adapted to deep learning.

A potential area of application for topological data analysis (TDA) combined with deep learning lies in multi-class segmentation tasks. In such tasks, it becomes feasible to delineate the topology of individual classes as well as the boundaries between each class. This extension can be viewed as an implementation of persistent homology (PH) to address the issue examined in a study by Clough et al. (2022) and Haft-Javaherian et al. (2020), where prior information was utilized to define the adjacencies amongst different brain regions.

TDA can produce good results in small datasets (Byrne et al. 2021; BenTaieb and Hamarne 2016), and is especially useful for medical imaging applications where cost and privacy concerns often limit data acquisition. Byrne et al. (2021), BenTaieb and Hamarne (2016) have investigated the limitations of conventional deep learning training procedures when applied to small datasets. It reveals that these procedures heavily rely on pixel-wise loss functions, which restrict the optimization process in terms of extended or global features. They used persistent homology and constructed topological loss functions to evaluate image segments against a known prior, resulting in a richer description of segmentation topology with better accuracy.

As persistence homology describes the global structure, developing topological loss functions could suppress small false positives or false negatives related to the topology of an object. For example, in the segmentation task, techniques such as morphological operations or CRF-based techniques are used to remove local errors; they do not have the concept of global topology. The benefit of PH-based loss is that the correct global topology can be propagated with local label smoothness. TDA has been used in settings with limited or noisy data, such as power forecasting (Senekane et al. 2021), segmenting aerial photography (Mosinska et al. 2018) and astronomy (Murugan and Robertson 2019).

In some applications, topological information may be more significant (e.g. finding anomalies or changes in topology) than statistical (e.g. pixel-wise) information. For example, in Vukicevic et al. (2017), Byrne et al. (2016) detecting holes between heart chambers was more important than inferring the thickness of septal walls. For these types of applications, a loss function combining topological and statistical information can be adjusted in favor of topology, when training a network.

Given its ability to preserve the global structure, TDA emerges as a promising approach for capturing intricate structural details and can be effectively integrated into generative models to produce new data that aligns with the topology of the training set. In a recent study conducted by  Zhou et al. (2022), a topological network was trained and incorporated as a branch within a generative adversarial network (GAN) framework. This integration aimed to enhance the performance of generating new point clouds. By leveraging the strengths of TDA and GANs, the researchers demonstrated significant improvements in the generation process, yielding more accurate and topologically consistent synthetic data.

Performance and comparative analysis of TDL typically focuses on evaluating the effectiveness against traditional machine learning and deep learning models. Common metrics used in these studies include accuracy, precision, recall, and computational efficiency (Hofer et al. 2017; Moor et al. 2020; Huynh et al. 2021; Clough et al. 2022; Haft-Javaherian et al. 2020). Deep TDA often demonstrates superior performance in handling complex data structures and noisy datasets, showing resilience in maintaining accuracy under such conditions (Clough et al. 2022). Moreover, Deep TDA models are frequently found to be more interpretable, a key advantage in critical applications where understanding model decisions is crucial (Singh et al. 2023; Fan et al. 2023).

The integration of Topological Data Analysis (TDA) with deep learning methodologies has recently exhibited remarkable potential and practical application across various disciplines. Following are some of the pivotal fields that highlight the significance of this synergistic approach:

Biomedical imaging In biomedical imaging, this combination has been used for more accurate analysis of complex medical images. Researchers utilize these techniques for enhanced feature extraction and classification in areas such as tumor detection and organ segmentation (Hajij et al. 2021; Singh et al. 2023; Fan et al. 2023; Glatt and Liu 2023).

Genomics In genomics, it aids in the analysis of high-dimensional genetic data (Amézquita et al. 2023). It’s particularly useful for understanding genetic diseases by identifying patterns and connections in genomic data that traditional methods might miss (Shapanis et al. 2023; Narayana et al. 2023; Amézquita et al. 2023; Yu et al. 2023; Wamil et al. 2023; Chulián et al. 2023; Morilla et al. 2022).

Protein engineering Topological Deep Learning is revolutionizing the way scientists approach the vast mutational space of proteins. It is particularly transformative when combined with existing protein structure prediction tools like AlphaFold2, enabling more precise and powerful structure-based strategies in protein engineering. Topological Deep Learning in this field is not just enhancing the speed and accuracy of protein design and analysis, but also opening new pathways for advancements in drug discovery, antibody development, and beyond (Qiu and Wei 2023a; Chen et al. 2022; Qiu and Wei 2023b).

Smart manufacturing TDA with deep learning enables enhanced detection of patterns and anomalies in manufacturing processes. This integration not only improves predictive maintenance but also optimizes production efficiency and quality control, paving the way for more intelligent and responsive manufacturing systems (Ko and Koo 2023; Sarpietro et al. 2022; Uray et al. 2023).

Finance and economics The financial sector employs these techniques for market analysis and risk assessment (Goel et al. 2020). By analyzing complex market data, this integrated approach helps in predicting stock market trends and in algorithmic trading (Chang and Lin 2023; Hafez et al. 2022).

Cybersecurity In cybersecurity, combining TDA with deep learning enhances the detection of anomalies and threats in network data, aiding in the identification and prevention of cyber-attacks (Zhen et al. 2022; Guo et al. 2022).

Topological analysis, as a general methodology, serves as a means of formalizing qualitative aspects inherent in reality. Integrating topological analysis with deep learning techniques proves to be highly advantageous for a wide range of tasks and applications, as highlighted previously. Additionally, the representation of data using TDA provides enhanced interpretability to human observers compared to the utilization of conventional black-box deep neural networks. This attribute allows for a deeper understanding of the underlying patterns and structures present in the data, thus enabling more meaningful insights to be derived from the analysis.

4.2 Challenges

Despite the success of TDA and its use in deep learning, we describe a few notable challenges that, if properly addressed, could benefit the field greatly.

Computational cost Many aspects of calculating persistent homology are computationally intractable. The construction of the Čech complex for a given r is known to be an NP-hard task. Computing Betti numbers for a given simplicial complex is also infeasible for very large-scale complexes. The costs of calculating TDA information add to already computationally expensive deep learning routines.

Lack of universal framework for vectorization: There is no universally accepted framework for incorporating topological information into deep learning, with earlier representations created in an ad-hoc manner or learned independently (Hofer et al. 2017; Moor et al. 2020). This is both a theoretical and a computational matter, with the lack of strong theory encoding persistence diagrams as vectors as an example of the issues encountered. There have been a variety of ad-hoc solutions of varying merit, recently catalogued in (Ali et al. 2023). Alternatively, vectorization methods have been chosen as part of learning strategies (Hofer et al. 2017; Moor et al. 2020).

Statistical guarantees Through this article, we have not discussed the statistical aspects of persistence due to finite sampling. For example, there is no guarantee that the PD derived from X reflects the true homology of \(\mathcal {M}\). The framework for understanding the statistical robustness of persistence information is evolving. Some simple strategies for verification, such as sub-sampling and cross-validation, have been used in the literature (Chazal and Michel 2021). There is scope to further understand issues such as the minimum number of data points required to guarantee robust PDs. Furthermore, persistence is not well understood from a probabilistic point of view (e.g. the distribution of persistence from a distribution of shapes).

High-dimensional learning challenge There is no underlying theoretical framework for what topological features to expect with high-dimensional data. While abstract topological spaces can be enormously complex in high dimensions, we do not know whether to expect data to behave similarly. Moreover, high dimensional homological features are unattainable due to computational cost, and in any case, the sensitivity of PDs to sampling or noise is not well understood in high dimensions. This makes learning the underlying topology of the data for use in deep neural networks challenging.

The need for a good backpropagation strategy The differentiability of PDs or other homological quantities is not guaranteed or necessarily well understood. This makes backpropagation in deep neural networks that incorporate topological signatures extremely challenging or only feasible under special conditions (Moor et al. 2020).

Capturing multi-variate persistence In some cases, multiple concurrent filtrations are needed to fully capture the topology of the data manifold, especially for data in higher dimensions. This leads to multi-variate persistence, where the birth and death of topological features occur in multiple dimensions. This notion of persistence does not have a complete discrete invariant, unlike the one-dimensional BD that we have discussed so far. For the practical use of multi-variate persistence in deep learning, we would need new theoretical frameworks and better computational methods.

4.3 Future directions

It would be interesting to explore sophisticated deep learning architectures that learn mappings between high dimensional data and their corresponding PDs or other topological representations, furthering the work of de Surrel et al. (2022).

As deep learning models continue to grow in complexity and dataset to grow in size, scalability and efficiency become even more crucial. Future directions in TDA for deep learning involve the development of scalable algorithms and efficient computational frameworks capable of handling large-scale datasets. This would enable the application of topological data analysis to diverse domains and real-world problems.

Interpreting deep learning models’ decisions remains a challenging endeavor. TDA offers a unique perspective by providing interpretable representations of complex data. Future directions in this area will focus on developing methodologies to extract meaningful topological features and interpret their significance in the context of deep learning tasks. This will facilitate a better understanding of the decision-making process for deep neural networks and increase their trustworthiness.

Regularization plays a crucial role in preventing overfitting and improving the generalization ability of deep learning models. Future research will explore how TDA-based regularization techniques can be integrated into deep learning frameworks. This could involve incorporating topological penalties or constraints to encourage models to capture meaningful topological features, leading to improved model generalization and robustness.

Many real-world applications involve multimodal data, such as images, text, and sensor data. Combining TDA with deep learning techniques provides a promising avenue for analyzing and integrating information from multiple modalities. Future directions include the development of TDA methods that can handle multimodal data and exploit the interactions between different modalities to uncover complex relationships and structures.

Transfer learning has proven to be an effective strategy for leveraging knowledge gained from one task to improve performance on a related task. Integrating TDA into transfer learning frameworks can enable the transfer of topological knowledge between domains or datasets. This could facilitate the adaptation of deep learning models to new domains by preserving the underlying topological structure and transferring relevant information.

Moreover, deep learning may yet yield new kinds of topological representation other than PDs, with robustness to different data deformations. PH could have further applications in multi-class open-set problems (where data may have unknown classes). If the topology among classes is relatively consistent, then the object labels of unknown classes could be better predicted.

4.4 Implementations

There are a number of open-source implementations of TDA available to practitioners. Here, we present three libraries that have interfaces with deep learning architectures.

GUDHIFootnote 2 is an open-source library that implements relevant geometric data structures and TDA algorithms, and it can be integrated into the TensorFlow framework. PersLay (Carrière et al. 2020) and RipsLayer are implementations using GUDHI that learn persistence representations from complexes and PDs. They can handle automatic differentiation and are readily integrated in deep learning architectures.

Giotto-deepFootnote 3 is an open-source extension of the Giotto-TDA library. It aims to provide seamless integration between TDA and deep learning on top of PyTorch. To use topology for both pre-processing data (using a variety of available methods) and using it within neural networks, the developers aim to provide several off-the-shelf architectures. One such example is that of Persformer (Reinauer et al. 2021).

TopoModelXFootnote 4 is a recent Python package that extends Graph Neural Networks (GNNs) for application in topological domains, demonstrating a substantial development in the field of topological deep learning. The implementation of topological neural networks in TopoModelX started as the ICML 2023 Topological Deep Learning Challenge (Papillon et al. 2023a), hosted by the second annual Topology and Geometry (TAG) in Machine Learning Workshop at ICML. Participants contributed by implementing existing topological neural network methods from the literature and applying them to train on a benchmark dataset. TopoModelX offers a robust framework and essential functionalities, enabling researchers to either implement new GNN-based TDL algorithms or apply existing methodologies from scholarly literature to their specific problems.

5 Conclusion

The recent growth in TDA and the established efficacy of deep learning have meant that the integration of these techniques has been inevitable. There is no universal paradigm for combining TDA and deep learning. This article surveyed numerous ways in which these frameworks have benefited each other. We began with an overview of the key TDA concepts. Following this, we reviewed TDA in deep learning from a variety of perspectives. We described numerous challenges and opportunities that remain in this field, as well as some observed successes.