Persistence codebooks for topological data analysis

Persistent homology is a rigorous mathematical theory that provides a robust descriptor of data in the form of persistence diagrams (PDs) which are 2D multisets of points. Their variable size makes them, however, difficult to combine with typical machine learning workflows. In this paper we introduce persistence codebooks, a novel expressive and discriminative fixed-size vectorized representation of PDs that adapts to the inherent sparsity of persistence diagrams. To this end, we adapt bag-of-words, vectors of locally aggregated descriptors and Fischer vectors for the quantization of PDs. Persistence codebooks represent PDs in a convenient way for machine learning and statistical analysis and have a number of favorable practical and theoretical properties including 1-Wasserstein stability. We evaluate the presented representations on several heterogeneous datasets and show their (high) discriminative power. Our approach yields comparable—and partly even higher—performance in much less time than alternative approaches.


Introduction
Topological data analysis (TDA) provides a powerful framework for the structural analysis of high-dimensional data. An important tool in TDA is persistent homology, PH (Edelsbrunner et al. 2002). It provides a comprehensive, multiscale summary of the underlying data's shape and currently gains an increasing importance in data science (Ferri 2017). Recently, it has been successfully applied to computer vision problems, such as shape and texture analysis (Li et al. 2014;Reininghaus et al. 2015), 3D surface 1 3 analysis (Adams et al. 2017;Zeppelzauer et al. 2017), 3D shape matching (Carrière et al. 2015), mesh segmentation (Skraba et al. 2010), and motion analysis (Vejdemo-Johansson et al. 2015). Further application areas include time series analysis (Seversky et al. 2016), music tagging (Liu et al. 2016) and social-network analysis (Hofer et al. 2017) as well as applications from the bio-medical domain, e.g. biomolecular analysis (Cang and Wei 2017), brain network analysis (Lee et al. 2012), protein investigation (Gameiro et al. 2015) and material science (Nakamura et al. 2015).
Persistent homology can be efficiently computed using various currently available tools (Bauer et al. 2017;Chen and Kerber 2011;De Silva et al. 2011;Dey et al. 2016;Edelsbrunner and Harer 2010;Maria et al. 2014). A basic introduction to PH is given in Sect. 2 and the more detailed one in the "Appendix". The common representation of PH are persistence diagrams (PDs) which are multisets of points in ℝ 2 . Due to their variable size, which varies depending on the input data, PDs are not easy to integrate within common data analysis, statistics and machine learning workflows. To alleviate this problem, a number of kernel functions defined on PDs and vectorization methods for PDs have been introduced.
Kernel-based approaches have a strong theoretical background but in practice they often become inefficient when the number of training samples is large. As typically the entire kernel matrix must be computed explicitly (like in case of SVMs), this leads to roughly quadratic complexity in computation time and memory with respect to the size of the training set. Furthermore, vector-based approaches are limited to kernelized methods, such as SVM and kernel PCA. Vectorized representations, in contrast, are compatible with a much wider range of methods and do not suffer from complexity constraints of kernels. They, however, often lack in representational power, as they require the spatial quantization of the PDs, which is unsually non-adaptive and thus does not cope well with the sparseness of PDs.
In this work we present a novel adaptive representation of PDs which aims at combining the large representational power of kernel-based approaches with the general applicability of vectorized representations. To this end, we adapt the popular bag-ofwords (BoW) encoding (McCallum et al. 1998;Sivic and Zisserman 2003), as well as its more comprehensive extensions, such as VLAD (Jégou et al. 2010) and Fisher vectors (Perronnin and Dance 2007) to cope with the inherent sparsity of PDs. The proposed persistent codebooks provide universally applicable fixed-sized feature vectors. They are, under mild assumptions, stable with respect to a standard metric in PDs and thus, also theoretically, built upon a solid basis. The presented method is to some extend a generalization of persistence images, PI (Adams et al. 2017), which adapts to the underlying data distribution. In contrast, PI samples the distribution in a regular grid (corresponding to an image), what often results in unnecessary codewords. Experiments show that the new representations achieve peak performance and even outperform numerous competitive methods while being more compact and requiring orders of magnitude less time.
This paper builds upon previous work of Zieliński et al. (2019). The additional contribution includes: (1) two new persistence codebook representations (PVLAD and PFV) building upon vectors of locally aggregated descriptors (VLAD) and Fisher vectors (FV); (2) the investigation of their stability; (3) the introduction of stable variant of PVLAD algorithm together with the proofs of its stability; (4) a significant number of additional experiments on an extended collection of datasets; and (5) an extended discussion of results.
The paper is structured as follows. Section 2 gives a basic introduction to PH and reviews related approaches. In Sect. 3 we introduce persistence codebooks and investigate their stability. Sections 4 and 5 present the experimental setup and results. We conclude the work in Sect. 6.

Background on persistent homology
In this section, we first introduce persistent homology, and then describe related state-ofthe-art approaches, both kernel-and vectorization-based, that aim at making PH compatible with machine learning methods.
Under mild assumptions, persistent homology (PH) can be defined for a continuous function f ∶ X → ℝ , where X ⊂ R n . Typically f is a distance function from a collection of points, or a scalar value function defined on a grid of points, but in principle it can be an arbitrary function that satisfies a tameness assumption specified below. Focusing on sublevel sets L x = f −1 ((−∞, x]) , we let x grow from −∞ to +∞ . While this happens, we can observe a whole hierarchy of events. In dimension zero, connected components of L x will be created and merged together. One dimensional cycles that are not bounded, or higher dimensional voids, will appear in L x at critical points of f. The value of x on which a connected component, cycle or a higher dimensional void appears is refereed to as birth time.
They will subsequently either become identical (up to a deformation) to other cycles and voids (created earlier), or they will be glued-in and become trivial. The value of x on which that happens is refereed to as death time. Every connected component, a cycle, or a higher dimensional void can, therefore, be characterized by a pair of numbers, b and d, its birth and death time. The difference between the death and the birth, p = d − b , is the so-called persistence value. In this paper, we will use the birth-persistence pair [b, p] to encode the feature. The multi-set of birth-persistence pairs makes up a persistence diagram (PD). The set of all persistence diagrams will be denoted as D . Example PDs for three different input point clouds are shown in Fig. 1.
The persistence coordinate is often an indicator of whether a cycle is structurally relevant or more likely to be related to noise. This observation is justified by many stability theorems for persistence (Cohen-Steiner et al. 2007), which state that a small change in the space X, or in a function f, implies only a small change in the resulting persistence diagram. Consequently, points in the PD with low persistence can be removed by a small perturbation of the data; and therefore, are not considered stable features. Those stability results make PDs a robust tool in data analysis.
Throughout this paper we assume that the given function f is tame, i.e. it induces a finite number of birth-persistence points. There are various metrics on finite PDs. To define them, the finite diagrams have to be enriched with an infinite collection of points (b, 0), which represent features that are born and immediately die. Having the enriched PDs B and B ′ let us consider all possible matchings ∶ B → B � . The 1-Wasserstein distance is defined as: In this paper, when considering stability of the representations, we will consider the stability with respect to 1-Wasserstein distance. A more in-depth introduction to PH is provided in "Appendix".

Kernels and vectorized representations of PDs
Numerous kernel-based and vectorized approaches have been introduced to make PDs compatible with statistical analysis and machine learning methods. The goal of kernelbased approaches is to define dissimilarity measures (also known as kernel functions) on PDs to compare them, and thereby make them compatible with kernel-based machine learning methods, such as Support Vector Machines (SVMs), and kernel Principal Component Analysis (kPCA). Li et al. (2014) combine the traditional bag-of-features (BoF) approach with PDs by using various distance functions between 0-dimensional PDs (bottleneck and Wasserstein distances for PDs, L p distance functions for persistence landscapes of PDs) to generate kernels. On different datasets (SHREC 2010, TOSCA, hand gestures, Outex) they show that topological information is complementary to the information of traditional BoF. Reininghaus et al. (2015) propose a kernel for persistence diagrams by turning PDs into a continuous distribution by appropriate placement of Gaussian distributions in ℝ 2 . Subsequently, they define a kernel as a scalar product of the two corresponding distributions. They apply topological descriptors together with the novel kernel to shape retrieval and texture classification. Kusano et al. (2016) propose a persistence weighted Gaussian kernel, which employs the framework of kernel embedding of measures into reproducing kernel Hilbert spaces. Carrière et al. (2017) propose another kernel based on sliced approximation of the Wasserstein distance. The authors show that the kernel is not only stable, but also mimics bottleneck distances between PDs.  Fig. 1 The principle behind persistent codebooks on the example of the computational workflow of the persistence bag-of-words representation (PBoW): From the input data we compute PDs in dimension 1 in birth-persistence coordinates and combine them into one consolidated diagram (for the entire dataset). Next, a subset of points is obtained from this diagram by either a weighted or unweighted sub-sampling. Subsequently, we cluster the sub-sampled consolidated diagram to retrieve the codewords which will form our codebook. Finally, the points for each input PD are encoded by the codewords (BoW quantization).
In this illustration a hard assignment of points to words (PBoW) is performed. The result is a codeword histogram for each input PD that represents how many points fall into which cluster of the codebook, i.e. codeword cardinalities. These codeword histograms are a compact and fixed-size vectorial representation. It is worth mentioning that while the hard assignment presented here gives the idea of the procedure, in practice we often employ soft assignment for stability reasons. Please, note further that the workflows for other persistent codebook encodings (e.g. based on VLAD or Fisher Vectors) are structurally similar, but partly use different codeword generation, quantization, and assignment schemes to reduce the kernel computation time. They apply it to 3D shape segmentation, texture classification, and orbit recognition in dynamical systems. Another approach for the representation of PDs are persistence landscapes, PL (Bubenik 2015). PL is a stable functional representation of a PD obtained from transforming it into a sequence of real-valued piecewise linear functions. To compare two landscapes, the authors use standard L p distance. This distance can be used to define a kernel function. Note that PL can also be transformed into a fixed-length vectorized representation by sampling the values of the landscape function. The authors, however, are not reporting results for vectorized PLs; therefore, we compare to kernels derived from PLs in our experiments. More recently, Le and Yamada (2018) proposed persistence Fisher kernel. It is based on a Fisher information distance between persistence diagrams and preserves some of the geometrical properties of the persistence diagrams space. Another approach, proposed by Som et al. (2018), embeds persistence diagrams into a Grassmann manifold, where PDs are compared using a geodesic distance. Vectorized representations aim at deriving fixed-size encodings of PDs that can be used directly as input to current machine learning methods. One of the first attempt to vectorize PDs was presented by Aadcock et al. (2014). Given a collection of PDs D 1 , … , D n , a vector characterizing D i is obtained by taking the vector of matching distances between D i and D j , for every j ∈ {1, … , n} . A more recent approach, called persistence image, PI (Adams et al. 2017) is built upon earlier work on size functions (Donatini et al. 1998;Ferri et al. 1998). It maps a PD to a space of functions from ℝ 2 to ℝ by taking a weighted sum of two dimensional Gaussian kernels placed in the points of PD. Subsequently, a discretization of the obtained function on a fixed grid of points provides the vectorization of PDs. Anirudh et al. (2016) propose an alternative approach based on the reconstruction of a certain Riemannian manifold (RM) based on PDs and its subsequent representation by a fixed-size vector. In Di Fabio and Ferri (2015), PDs are represented as the coefficients of a complex polynomial having points of PD as roots. Similarly, using a sequence of weighting functions, Wang et al. (2019) transform PDs D 1 , … , D n into vectors V 1 , … , V n and obtain a polynomial representation of D i by taking a polynomial system having roots in the corresponding V i . Recently, as a continuation of this idea, the tropical algebra was used to construct a polynomial representation of diagrams (Kališnik 2019;Monod et al. 2019).
Recently, a third type of approach has been introduced, which aims at learning which points in the PD are of particular importance for the given task in a supervised manner by end-to-end learning (Hofer et al. 2017).
Overall, we can distinguish between approaches that learn the representation in a supervised or unsupervised manner. While supervised learned representations may better adapt to a specific task, the representations or kernels constructed in an unsupervised fashion bear less risk for overfitting because their construction is task agnostic. The proposed approach falls in the category of task agnostic representations and can be applied in supervised and unsupervised problem settings.

Persistence codebooks
In this section, we adapt the bag-of-words (BoW) model (McCallum et al. 1998;Sivic and Zisserman 2003) as well as its more comprehensive extensions, such as VLAD (Jégou et al. 2010) and Fisher vector (Perronnin and Dance 2007), introduced originally in text and image retrieval, for adaptive quantization of PDs into a fixed length vectorial representation. The idea behind BoW is to quantize variable length input data into a fixed-size representation by using a common dictionary, also called codebook of constant size. The codebook is generated from the input data in an unsupervised manner by extracting centers of clusters obtained from data clustering. The basic assumption behind BoW is that the clusters (i.e. codewords) capture the intrinsic structure of the data and, thereby represent an efficient vocabulary for the quantization of the data.
The overall approach of bag-of-words for persistence diagrams is visualized in Fig. 1. The input is a set of PDs extracted from all instances of a given dataset. First, all PDs are merged into one diagram. This consolidated diagram is then sub-sampled to reduce the influence of noise. In this paper, we consider two types of sub-sampling. A standard one which does not consider the persistence of the points, and one where points of higher persistence are more likely to be sampled, see Sect. 3.2 (we refer to those two types of sub-sampling as without and with weighting, respectively). From the (subsampled) consolidated diagram, the codebook C is generated using clustering. Given a codebook C, every input point P is encoded by assigning it to the nearest codeword from C. In traditional BoW this encoding leads to a codeword histogram, i.e. a histogram for which each codeword from C counts how many points from P are closest to this codeword. Further encodings investigated include vector of locally aggregated descriptors (VLAD) and Fisher vector (FV), see below.
For the proposed approaches, three important hyperparameters need to be identified: (1) the clustering algorithm used to generate the codebook, (2) the size of the codebook, i.e., the number of clusters, and (3) the type of proximity encoding which is used to obtain the final descriptors, i.e. hard and soft assignment. In this paper, we use k-means and Gaussian mixture models (GMM) for clustering. The codebook size is investigated empirically. In the following sections, we introduce persistent codebook approaches based on different quantizations and encodings, such as standard BoW, VLAD and FV. Consult Table 1 for an overview of representations introduced and evaluated in this paper.
For all approaches presented in these sections, we show if they are stable with respect to 1-Wasserstein distance. We would like to indicate that since the representations presented here are additive (consider the definition of additivity from Reininghaus et al. 2015), they are not stable for a p-Wasserstein distance for any p > 1 as indicated in Theorem 3 in Reininghaus et al. (2015). Table 1 The design space of persistent codebook approaches introduced in this paper together with their abbreviations for reference Each resulting representation can use either weighting of no weighting in codebook generation (see experiments in Sect. 5.1 for a direct comparison). For variants with weighting, we add "-w" to the abbreviation, e.g. "PBoW-w" for clarity

Persistence bag of words (PBoW)
Let us first consider a direct adaptation of BoW (Baeza-Yates et al. 1999;Sivic and Zisserman 2003) to PDs. Given a collection of persistence diagrams B 1 , B 2 , … , B n , they are consolidated into D = B 1 ∪ B 2 ∪ … ∪ B n and a codebook of size N is obtained by using k-means clustering on D. Let { i ∈ ℝ 2 , i = 1, … , N} denote the centers of obtained clusters (the codewords). Moreover, for a PD B = {x t ∈ ℝ 2 } T t=1 , let us denote NN(x t ) as the index of the codeword nearest to i} captures the number of points from B, which are closer to i than to any other j . Then the persistence bag of words (PBoW) is defined as a vector: Subsequently, (B) is normalized by taking the square root of each component (preserving the initial sign) and dividing it by the norm of the whole vector: This is a standard normalization for BoW (Perronnin et al. 2010), which reduces the influence of outliers.
Remark Let B, B � ∈ D be persistence diagrams containing only finitely many off-diagonal points. The persistence bag of words, PBoW with N words is not stable with respect to 1-Wasserstein distance.

Weighted subsampling for codebook generation
Aside from being unstable, the straight-forward application of BoW to PD would neglect an important property of persistence diagrams, i.e. that points in a PD with higher persistence are typically considered more important than points with lower persistence. It is a consequence of a stability theorem, Edelsbrunner and Harer (2010) indicating that points with low persistence are more likely to originate from a noise than the points of high persistence. (1) In order to integrate this property into the codebook generation procedure, we perform k-means clustering on a subset of points obtained by a weighted sampling described below. This results in extended procedure of codebook generation which is as follows: 1. Place all the persistence diagrams (or all diagrams of a certain dimension) on to a single consolidated persistence diagram D. 2. Subsample n points from D in a way that points of higher persistence are more likely to be sampled. In the experiments presented in this paper we set it to n = 10,000. 1 3. Perform k-means clustering on the obtained subset of n points to extract the centers of the clusters (the codewords).
For the weighted sampling of points from a persistence diagram we define a piecewise linear weighting function w a,b ∶ ℝ → ℝ as: and use it to weight second coordinates (persistence) of points in PD. In our experiments we set a and b to the persistence values corresponding to 0.05 and 0.95 quantiles of the persistence coordinate of the points in D. In the performed sub-sampling persistence points having longer values of the function w are more likely to be sampled. We want to highlight that in this case we have selected a linear weighting with respect to persistence, i.e. the probability of sampling of a point is proportional to its persistence. It works well in the cases considered in this paper, however in the case of very noisy data with just a few dominant persistent points, the points of high persistence may not be sampled at all. In such case, we suggest to consider the weighting to be a higher degree polynomial or an exponential function to boost the probability of capturing the high persistence points.
Please, note further that the sub-sampling does not directly enforce the points of the highest persistence to be automatically selected as the centers of clusters, but it makes the probability of such an event considerably larger. Examples of birth-persistence distributions with standard (unweighted) and weighted codebooks obtained with k-means and GMM are presented in Fig. 2. The unweighted clustering produces larger clusters, which are less adaptive to the strongest topological structures. At the same time, the weighted clustering yields a more adaptive codebook with a more uniform sampling of the space.

Weighted codeword assignment for persistence bag of words (wPBoW)
The weighting function from Sect. 3.2 can be similarly used to weight the histogram assignments to give points with higher persistence more influence in the final representation. For this purpose, instead of counting the number of points, we sum up the weights of their persistent coordinates. ( 1 Preliminary experiments have shown that this number is insensitive and has little influence on the results (evaluated value range: 1000-100,000).
where B ∈ D . We will refer to this representation as weighted persistence bag of words (wPBoW) in the following. Similary to standard PBoW, wPBoW is not stable with respect to 1-Wasserstein distance. The couterexample is identical with the one in Sect. 3.1, when we assume that function w a,b is identity.

Stable persistence bag of words (sPBoW)
After having integrated persistence-based weighting into codebook generation and also into histogram assignment, we aim at making the representation stable. To this end we adapt soft assignment of points to clusters and prove that such an approach guarantees stability of the resulting representation. Stable persistence bag of words (sPBoW) similarly to PBoW (and wPBoW) first consolidates PDs in the initial step of construction, and then generates a GMM based on the sub-sampled points (e.g. by expectation maximization

Consolidated diagram with unweighted codebook
birth persistence

Fig. 2
Codebook generation based on the consolidated PD with N = 7 codewords (top: k-means, bottom: GMM, left: no weighting, right: weighting). Using the language of computational geometry; one may tell that the cells in the top diagrams form a Voronoi diagram of the codewords. Equivalently all points in the same Voronoi cell have the same closest codeword. Weighting allows to sample more points of higher persistent from the diagram, representing more stable topological structures and yields more balanced cluster weights (even though there are many more points on the bottom of consolidated PD). Note that a cluster can (but does not have to) be created for a single high persistence point which is well separated from the others, as is the case here with the most persistent point (top-left quadrant) algorithm Nasrabadi 2007). This approach was originally introduced by Van Gemert et al. (2008). Let the parameters of the fitted GMM be = {w i , i , Σ i , i = 1, … , N} , where w i , i and Σ i denote the weight, mean vector and covariance matrix of i-th Gaussian, and N denotes the number of Gaussians. Given a PD B, the stable PBoW is defined as: is the likelihood that observation x t was generated by Gaussian i: The intuition behind this approach is to assign each point to all codewords, but with weight inversely proportional to the distance to the codewords.

Theorem Let B and B ′ be persistence diagrams with a finite number of non-diagonal points. Stable persistence bag of words, sPBoW with N words is stable with respect to 1-Wasserstein distance between the diagrams, that is
where C is a constant.
Proof Let ∶ B → B � be the optimal matching in the definition of 1-Wasserstein distance. For a fixed i ∈ {1, … , N} we have:

Persistence VLAD
Persistence VLAD (PVLAD) is based on vector of locally aggregated descriptors (VLAD) by Jégou et al. (2010), an extension of the bag-of-words concept, which accumulates the residual of each descriptor with respect to its assigned cluster. The first computation step is similar to PBoW: a codebook { i ∈ ℝ 2 , i = 1..N} is obtained from a training set using k-means clustering. Given a new PD B, each point x t ∈ B is associated with its nearest codeword NN(x t ) . In the second step, for each codeword i , we compute a sum of differences between i and all x t ∈ B for which NN(x t ) = i . This results in: The dimension of v PVLAD i equals 2 (differences on two coordinates), therefore is of size 2N. Intuitively, this vector should capture more information than PBoW alone, because it encodes the first order moments of the points assigned to a codeword instead of simply counting those points.
Similarly to PBoW, PVLAD is not stable with respect to 1-Wasserstein distance. Therefore, in Sect. 3.6, we propose to adapt a stable variant of VLAD, called soft VLAD.
Remark Let B, B � ∈ D be persistence diagrams containing only finite off-diagonal points. The persistence vector of locally aggregated descriptors, PVLAD with N words is not stable with respect to 1-Wasserstein distance. In order to be stable in 1-Wasserstein sense, PVLAD should fulfill the following condition:

Proof Starting from two clusters with centers
As > 0 can be arbitrarily small, and there does not exist a constant C that meets this condition. Therefore, PVLAD is not stable. ◻

Stable persistence VLAD
Similarly to PBoW, the hard association with codewords can be replaced by soft association in VLAD (Jégou et al. 2012), to account for instability. To this end, we define stable persistence VLAD (sPVLAD) as follows: where i (x t ) is the soft assignment of descriptor x t to ith Gaussian: In the stability theorem for stable persistence VLAD (presented below) we assume that coordinates of the points in the considered persistence diagrams are limited to a certain compact subset of ℝ 2 . This limitation is crucial to prove the stability and it is a reasonable assumption in case of TDA. Moreover, this theorem is true for any ℝ n (not only for ℝ 2 ).
Proof Let us first consider the following difference: where: while M i and L i are the maximal value and Lipschitz constant of i-th Gaussian p i . Note that the constant C 1 exists because diagrams are supported in a compact subset of ℝ 2 . Therefore, the Gaussians achieve a minimum value, which is bounded away from zero. When it comes to assignment (i), we simply put For a fixed i we can estimate: where: In (ii) we used the estimation (7) and the fact that sup y i (y) = 1 . The boundaries of ℝ 2 compact subset allow to determine the maximal distance between diagram points and the Gaussian centers, which is used in (iii) to estimate C 3 . Note that C 4 is independent of i, hence: and ◻ We would like to point out that in the estimation of C 1 and (iii) we use an assumption that diagrams are supported in the compact ℝ 2 subset [a, b] × [a, b] . As a result, if the support of a persistence diagram sequence diverges to ∞ , then the corresponding sequence of C 1 also diverges to infinity. Therefore C 4 is not a global constant and the persistence VLAD is not globally stable. We want to indicate, however, that for any practical case, the assumption about compact support of diagrams is always satisfied.

Persistence fisher vector
The idea of persistence Fisher vector (PFV) is based on Fisher vectors introduced by Perronnin and Dance (2007) and relies on the gradient of the log-likelihood with respect to the parameters of a Gaussian mixture model. Compared to the traditional BoW model, it captures first and second order moments. It can be extended to PDs as follows. Given a PD B ∈ D we aim a characterizing it with a gradient vector derived from a generative probability model (obtained for all PDs used for codebook generation). This is similar to sPVLAD, however in case of PFV we compute not only the first order, but also the second order moments of the points assigned to a codeword (i.e. not only the gradient for i but also for Σ i ). Let L(B| ) = log p(B| ) , where under the independence assumption: where: is the likelihood that point x t was generated by the GMM.
Assuming that the covariance matrices are diagonal (for ease of calculation), the der- and superscript d denotes the d-th dimension of a vector) can be effectively computed as (Perronnin and Dance 2007): The gradient vector is just a concatenation of the partial derivatives with respect to all the parameters.
To normalize the dynamic range of the different dimensions of the gradient vectors, the diagonal of the Fisher information matrix F is computed as: The persistence Fisher vector with N words is stable with respect to 1-Wasserstein distance, that is: Proof Persistence Fisher Vector is a concatenation of the two components presented in Eqs. (9) and (10). In order to be stable, both components have to be stable with respect to 1-Wasserstein distance; therefore, we estimate them separately (we skip d superscript from the original notation for clarity). The first FV component (10) can be estimated using the theorem about sPVLAD stability (8): The second component (10) can be estimated as follows: where D 1 is an upper bound for

3
Summing up the two estimates above, we conclude that persistence Fisher vector is stable with respect to 1-Wasserstein distance with a constant D 3 + C 4 ( i ) 2 , where D 3 is defined above and C 4 is defined in Sect. 3.6.

Experimental setup
To evaluate the proposed persistence BoW representations (PBoW, sPBoW, wPBoW, PVLAD, sPVLAD and PFV), we compare them with a number of state-of-the-art approaches including kernel-based methods and vectorized PD representations. The evaluation is performed on classification tasks involving different datasets representing heterogeneous data including, among others, 3D shapes, textures, and social media graphs. In the following, we describe the datasets used in our experiments, list the state-of-the-art approaches we compare with, and discuss the setup of the experiments.

Datasets
For the evaluation we incorporate various datasets which cover a wide range of different data types. Firstly, to provide a proof-of-concept, we evaluate all the approaches on a synthetically generated shape classes from Adams et al. (2017). Next, the approaches are evaluated on real-world datasets for 3D shape segmentation (Carrière et al. 2017), activity recognition in 3D motion capture data (Ali et al. 2007), geometry-informed material recognition (DeGol et al. 2016), classification of social network graphs (Hofer et al. 2017) and analysis of 3D surface texture ). The datasets are described in detail in the following sections. Where possible, we have used pre-computed PDs available with the datasets to foster reproducibility and comparability. As the computation times for some of the considered methods, especially for kernel-based approaches, do not scale well with the sizes of datasets, we have decided to randomly sub-sample some of the datasets (see details below).

Synthetic dataset
The first dataset is a synthetic dataset introduced by Adams et al. (2017). It consists of seven shape classes represented by point clouds in ℝ 3 of the following geometrical objects: unit cube, circle of diameter one, sphere of diameter one, three clusters with centers randomly chosen from unit cube, hierarchical structure of three minor clusters within three major clusters (where the centers of the minor clusters are chosen as small perturbations from the major cluster centers), and a torus (see Fig. 3 for example shapes). Each point cloud is randomly perturbed by positioning a Gaussian distribution of standard deviation 0.1 at this point and sampling novel points from the distribution. Overall, this dataset contains 50 point clouds for each of the six classes, each containing 500 3D points. This gives 300 point clouds in total.
From each point cloud, we compute the PDs in dimension 1 for a Vietoris-Rips filtration for a radius parameter equal to the maximal distance between points in the point cloud. 2 We employ the approximation method proposed by Dey et al. (2016) and the SimBa implementation based on the work of Dayu Shi. 3

Geometry-informed material recognition dataset (GeoMat)
The GeoMat dataset provides geometry information (point clouds) as well as visual images of 19 different materials, such as "brick", "grass" and "gravel" (DeGol et al. 2016). The GeoMat dataset contains patches sampled from larger photographs of surfaces from buildings and grounds. Each patch predominantly represents only one material, while each class consists of 600 images, each of size 100 × 100 pixels. Among them, there are pictures of different scales, i.e. 100 × 100 , 200 × 200 , 400 × 400 and 800 × 800.
For each patch, the dataset provides a depth image, 4 containing the local (fine-grained) surface texture and the global surface curvature. To filter out the global curvature, we transform each depth image into a point cloud in 3D space, consisting of 10, 000 points (every point represents one of the 100 × 100 pixels). Then, the resulting point cloud is rotated in a way that the Z axis represents depth and the global surface curvature is removed by fitting a second degree function (paraboloid) to the point cloud and subtracting the approximated values from the Z coordinates of the original points. The values of the Z-coordinates are  Fig. 3 Example shapes from the six shape classes of the synthetic dataset 2 We use Vietoris-Rips complexes to be consistent with the previous works. However, it should be noted that the alpha complex is a more optimal choice to compute persistent homology for a collection of points in low dimensional Euclidean space. 3 http://web.cse.ohio-state .edu/~dey.8/SimBa /Simba .html, last visited September, 2019. 4 Source: http://web.engr.illin ois.edu/~degol 2/pages /MatRe c_CVPR1 6.html, last visited September, 2019. then centered at 0 and the point cloud is projected back into a bitmap (depth map). Ultimately, PDs are computed by gray-scale filtration.

Social network graph datasets (Reddit)
To extend the range of different data types in our evaluation, we further incorporate graphbased datasets. To this end we employ the reddit-5k and reddit-12k datasets from Yanardag and Vishwanathan (2015), which contain discussion graphs from the reddit platform. 5 Nodes in the graphs correspond to users, and edges between users exist if one user has commented a posting of the other user. Different graphs are labeled by subreddits, which refer to different topics. The dataset reddit-5k contains overall 4999 graphs for 5 popular subreddits. The larger dataset, reddit-12k, contains 11929 graphs for 11 subreddits including topics like, e.g. "worldnews", "videos" and "atheism". The task for both datasets is to predict the subreddit (topic) from the input graph. For both datasets we use the pre-computed PDs available online. 6 They are obtained using filtration based on the vertex degree, that is, the number of edges incident to a vertex. More precisely, concerning vertices V, edges E, and vertex v ∈ V , deg(v) denote the number of edges in E that contain v. The filtration f (v) = deg(v) on vertices can be then extended to edges.That yields nontrivial persistent homology in dimension zero and one, where all one-dimensional classes are essential.

3D surface texture dataset (PetroSurf3D)
A further dataset in our experiments is the recently released PetroSurf3D dataset, which contains high-resolution 3D surface reconstructions from the archaeological domain with a resolution of approximately 0.1 mm (Poier et al. 2017). The reconstructions represent 26 natural rock surfaces that exhibit human-made engravings (so-called rock art), and thereby exhibit complex 3D surface textures. The classification task for PetroSurf3D is to automatically predict which areas of the surface have been manipulated by tools (engravings) and which have not, i.e. there are two classes of surface topographies: engraved areas and the natural rock surface. Engraved areas represent approximately 19% of the data. For each surface, a precise pixel-accurate ground truth exists together with a depth map of the surface. The depth maps are analyzed in a patch-wise manner. Overall, there are 754.386 square patches to classify from 26 surfaces. In order to keep the number of training samples in a practical range, we randomly subsampled each surface. Overall, a balanced set (equal class cardinalities) of 600 patches per surface ( 26 * 600 = 15,600 samples) is used in each repetition of an experiment.
For each patch, a PD is computed by grayscale filtration over the surface depth ranges (depth maps) as a basis for our experiments. To normalize the values for different shaped surfaces, the depth value range is z-standardized before filtration.

3D shape segmentation dataset
We further employ the 3D shape dataset from Chen et al. (2009) which was preprocessed by Carrière et al. (2015) for topological data analysis. The preprocessed dataset contains PDs for 5700 3D points from airplane models. Each point is assigned to one sub-part (segment) of an airplane, e.g., 'wing', 'vertical stabilizer' and 'horizontal stabilizer'. For our experiments we use the PDs computed by Carrière et al. (2015), available in their repository. 7 The PDs were generated by tracking topology evolution of a geodesic ball centered at the individual points of the input 3D model. Thereby, the radius grows from 0 to infinity. We focus on PDs of dimension 1 as the considered 3D shapes are connected. The task is to classify each point according to the segment it belongs to.

Motion capture dataset
Another real-world dataset represents 3-dimensional motion capture sequences of body joints (Ali et al. 2007). The dataset describes the following five activities: dancing, jumping, running, sitting and walking with 31, 14, 30, 35 and 48 instances, respectively. For each activity, a set of 19 3D motion trajectories (each corresponding to the motion of one tracked joint) is extracted. This corresponds to 3 ⋅ 19 = 57 curves of individual (x, y, and z) components for which 57 separate PDs are computed by Ali et al. (2007). For our experiments we employ the original pre-computed PDs. 8 Experiments with this dataset are performed only on the vectorized representations. For kernel-based approaches we would have to compute 57 full kernel matrices, which is computationally expensive and would further require an adequate method for the combination of the kernels. For vectorized representations, the proceedure is much more efficient and straight-forward. We simply compute one vector per PD and concatenate them into a final feature vector for classification. For the BoW approaches, we compute 57 codebooks, one for each 3D motion component, and concatenate the corresponding codeword histograms. In case of the Riemannian manifold representation, RM Anirudh et al. (2016), we generate vectorial features by PCA and concatenate them as proposed by the authors. We also use original procedure for PI (Adams et al. 2017).

Compared approaches
We compare our bag-of-word approaches with both kernel-based techniques and vectorized representations. Kernel-based approaches include: 2-Wasserstein distance, 9 2Wd (Kerber et al. 2017.), the multi-scale kernel, 10 MK (Reininghaus et al. 2015), and sliced Wasserstein kernel, 11 SWK (Carrière et al. 2017). Furthermore, we employ the persistence landscape 12 , 13 (PL) representation and generate a kernel matrix by the distance metric defined in Bubenik (2015). Vectorized PD representations include: persistence image 14 , PI Adams et al. (2017) and the Riemannian manifold approach, 15 RM Anirudh et al. (2016). The original PI implementation is rather inefficient since it takes into account the exact location of all birth-persistence points when calculating the values of PI. Therefore, we additionally perform experiments with an approximated unstable version of PI (referred to as approxPI in the following), which applies the Gaussian filter to 2D histogram of birthpersistence points. 16 We refrained from incorporating descriptors composed of simple topological statistics (such as minimum, maximum, and average of birth, death, and persistence) because, according to our previous research , they lack sensitivity and are easily outperformed by more sophisticated approaches like persistence images.

Setup
For all datasets, except Reddit, in Sect. 4.1 we consider the PDs of dimension 1 as a common input (cycles) since they best express the internal structure in the data and yielded the most promising results in related works (Adams et al. 2017;Carrière et al. 2015). In case of Reddit database we use PDs of dimension 0 (connected components), since graphs are considered as 1-complex; thus, first dimensional homology generators never die. In the considered datasets no infinite intervals of dimension 1 occur. In cases where infinite intervals are present, there are different ways to proceed: (1) ignoring them, (2) substituting infinity with some (large) number or (3) building separate representations for finite and infinite intervals. In the general case, we recommend to compute persistence codebooks for PDs of all available dimensions separately and to combine them before classification.
The classification pipeline is as follows. For the kernel-based approaches, we take the PDs as input and compute the explicit kernel matrices for the training and test samples. Next, we train an SVM from the explicitly computed kernel-matrix and evaluate it on the test set. For the vectorized representations we compute the respective feature vectors from the PDs and feed them into a linear SVM for training. This procedure allows direct comparison between kernel-based approaches and vectorized representations.
Ultimately, we run a Wilcoxon signed-rank test (with p value of 0.1) on the results to identify which results significantly differ from the best obtained result, and which ones do not, and can thus be considered equally good. The comparison is performed between the best method (the one with the best mean accuracy) and all the other methods, for each experiment separately. The mean accuracy is obtained as an average over 5 runs with the same train/test divisions used by all compared methods. The number of repetitions is relatively small for statistical tests, therefore we set p value to 0.1.
The entire code of all experiments (implemented in Matlab) is available at https ://githu b.com/bziiu j/pcode books . For external approaches, we use the publicly available implementations of the original authors. For clustering and bag-of-words encoding, we employ the VLFeat library (Vedaldi and Fulkerson 2008). Table 2 summarizes the results obtained in our experiments for EXP-A and EXP-B. For each combination of dataset and approach, we provide the obtained classification accuracy (including the standard deviation) and the processing time needed to construct the representations (excluding the time for classification). Note that for the synthetic dataset and the 3D shape segmentation dataset, results of EXP-A and EXP-B are equal, as no subsampling was needed to perform EXP-A.
Large differences exist in the processing times of the different approaches. The highest runtimes are obtained for the kernel-based approaches going up to 63k seconds for the PetroSurf3D dataset. The slowest kernel is 2Wd followed by MK, and approximately one order of magnitude faster PL and SWK. Note that computation complexity depends not only on a number of persistence diagrams, but is also highly affected by the average number of points per diagram. For the vectorized approaches, PI takes longest to compute. The runtimes, however, vary strongly, depending on the resolution of the employed PI (note that we have estimated the optimal parameters for each dataset by a grid search over all hyperparameters, see Tables 4 and 5). The RM representation is one to two magnitudes faster than PI. 17 The proposed approaches outperform almost all evaluated approaches in runtime for all datasets, both for EXP-A and EXP-B. The gain in runtime efficiency ranges from one to up to four orders of magnitude. For the largest dataset in the experiments (Petr-Surf3D in EXP-B), the fastest (PBoW) and the slowest (sPBoW) codebook approaches are still 3 and 2 magnitudes faster than PI, while reaching comparable accuracy. Competitive runtimes can be achieved by the approximated version of PI (approxPI). However, this leads to a slight drop in performance for most datasets compared to the exact (and stable) implementation of PI. Moreover, we observed that approxPI performs much worse than our approaches in the case of 3D Shape Segm., the largest database of EXP-A. Detailed investigation on this performance revealed that it is primarily dominated by kernel computations (reported in EXP-A due to comparison with kernel methods). It is expected, as the size of approxPI in the experiment mentioned above is 4900, while the size of sPBoW is only 70. Concerning EXP-A, the codebook approaches are comparable in computing time, which is due to the small size of the datasets. EXP-B demonstrates well how the different approaches scale to larger data. It shows that PBoW scales best (is fastest) and still obtains optimal results in all but one case. It thus represents the best tradeoff between time efficiency and classification accuracy. See Sect. 5.2 for further discussion.
From our experiments we conclude that persistence codebooks are significantly faster than most approaches while achieving similar or even better performance level. This shows that the codebooks capture well the essential information contained in the PDs and important for the respective classification tasks. The variablity of runtimes between the different codebook variants is low compared to the other approaches. Thus, for the selection of the appropriate codebook approach for a given problem in practice, the runtime plays a secondary role.
In the following sections, we analyze selected aspects of the novel representations in greater detail, such as runtime, dependency on parameters and the scalability of the approach to large number of input PDs.

Accuracy versus codebook size and weighted sub-sampling
The most important parameter for codebook-based representations is the codebook size N, i.e. the number of clusters. There is no commonly agreed analytic method to estimate the optimal codebook size; thus, the estimation is usually performed empirically. To investigate the sensitivity of codebook approaches and their performance on the codebook size, each approach was evaluated for a sequence of N values (see Tables 4, 5). The results are presented in Fig. 4, both without (left column) and with weighted sub-sampling (right column) of the consolidated PD. We can observe that all three variants of PBoW (PBoW, wPBoW and sPBoW) reach optimal performance at the level of about 50 words in a codebook (or earlier), a further increase of codebook size does not necessarily further improve its efficiency. In some cases, there is a slight improvement (synthetic data), in other cases performance goes down slightly (GeoMat) or remains a the same level. This shows that the codebook size is not rather insensitive parameter, once a certain minimum size is surpassed.
The remaining approaches (PVLAD, sPVLAD and PFV) show a general tendency to achieve their best performance early, with codebooks containing less than 40 words, and after that the accuracy drops substantially. The most prominent example of this behavior is depicted by the PVLAD method. It is caused by the fact that in cases where there is just a few clusters, it is simpler to capture well both the zeroth and the first moments, because clusters occupy large regions. However, once the number of clusters gets larger, cluster size shrinks and the assignment gets unstable. Figure 4 further shows the effect of weighting during sub-sampling for codebook generation. This can be best observed from the performance curves of sPBoW, where the effect is largest. For 3D shapes, weighting leads to a dramatic improvement in accuracy. For for synthetic data, GeoMat and PetroSurf3D, there is also a moderate improvement in performance. Only for Reddit5K, weighted subsampling degrades performance. For other methods, however, the introduction of weighted subsampling does yield an improvement on the Reddit5K experiment (see e.g. PBoW, wPBoW and sPVLAD). In the majority of cases, weighted subsampling has a positive impact on performance. For PVLAD, weighted subsampling even seems to compensate for weaknesses of the representation in situations where codebook sizes are large.
Overall, we conclude, that optimal and universal choice for codebook size is about 50 in case of PBoW, wPBoW and sPBoW; while for the remaining methods, 20 words seems to be sufficient. These values are thus good starting points for hyperparameter optimization on other datasets. The choice of weighted vs. non-weighted subsampling seems to be dataset dependent. For 3D shapes, for example, strong performance gains are achieved. For the other datasets the trend is not so clear. Tables 2 and 3 show that our approaches achieve comparable performance on almost all of the evaluated datasets and partly even outperform the compared approaches. Additionally, they beat all methods in speed. While the above tables show results for the optimal parameters (from the classification accuracy point of view), we decided to analyze the relationship between accuracy and computation time. For this purpose, we use PI and approxPI as references for comparison, because they represent the strongest competitors (in the sense of accuracy) of the proposed representations.

Accuracy versus time
In Fig. 5, we plot accuracy vs. time for the proposed approaches and PI for all datasets from EXP-B. We decided to focus on EXP-B here, because it operates on larger datasets than EXP-A and is thus better suited to study runtime efficienty. We vary the parameters with most influence placed on runtime (codebook size for persistent codebooks as well as the resolution of PI and approxPI) according to the values provided in Table 5. This directly influences the output dimension of the representation and is reflected by the area of the circles in Fig. 5, i.e. larger diameter means higher dimension. Note that experiments with codebooks and approxPI were performed on 1 CPU, while experiments on PI were performed in parallel on 8 CPUs. Therefore, the total runtime differences are in fact even larger than depicted. For more compact visualization (and avoiding a logarithmic scale which would compress too much), we decided not to take the number of CPUs into account for plotting. We can see clearly that the runtime of PI is always significally larger than any  Table 5). Note that the actual times of computation for the construction of the representations are presented. Codebooks and approxPI were computed on 1 CPU, while PI was constructed by using 8 CPUs in parallel. Moreover, SVM training and prediction time was not taken into consideration. This would further increase computational times, especially for PI and approxPI due to their larger dimension. The bottom-right plot shows results for codebook approaches and approxPI (but skipping PI) in the case of GeoMat dataset for a narrower range of x-axis (for better visibility) codebook representation. The accuracy obtained varies. For all datasets the performance level of PI is reached (or even superseeded) much quicker. In the case of GeoMat dataset, codebooks clearly outperform PI (while consuming much less time); and in case of the other experiments, they quickly achieve a similar performance level. The computational cost of achieving a higher performance with PI is over-proportionally high, while the performance gain is actually rather limited (approx. +1% ). Interestingly, approxPI strongly outperforms PI and can compete well with our representations in terms of runtime. The computation time of approxPI is in the same order of magnitude as that of the persistence codebook representation. However, the performance is in most cases lower as can be seen in Fig. 5 where pink dots represent approxPI. Note that for better visibility, we provide a wider and a narrower range of x-axis for the performance comparison of GeoMat dataset (last row of Fig. 5) because the latter allows easier comparison between the codebook approaches. Among our methods, PBoW is the fastest. We observe that runtimes of PBoW, wPBoW and PVLAD are almost not affected by codebook size. The other approaches, i.e. sPBoW, sPVLAD and PFV, that involve the computation of Gaussians, clearly require much more time when codebook size is increased. However, as observed before, larger codebook sizes are not necessarily required to obtain good accuracy, which mitigates the situation. approxPI is equally fast as PBoW but is not able to achieve the same level of performance.

Time versus dataset size
To investigate the runtime behavior of the proposed approaches in more detail, we evaluate how they scale to increasing dataset sizes (i.e. increasing numbers of input PDs). To this end, we employ the largest dataset in our experiments (PetroSurf3D) and randomly sample different numbers of PDs, starting from 1000 to 10, 000 in steps of 1000. To get a detailed breakdown of computation time, we separately measure the time needed for codebook generation, histogram assignment, and classification. The computation of the PDs is the same for all approaches, and thus is not included in this breakdown.
From the results presented in Fig. 6, we conclude that runtime grows almost linearly with dataset size. For the approaches without weighted subsampling (upper part in Fig. 6), most of the computation time is spent on histogram assignment and classification. Histogram assignment takes more time for more complex encoding methods, such as PVLAD and PFV. In case of PBoW, histogram assignment is particularly fast because of k-d trees being used (Bentley 1975). For sPBoW, Gaussian likelihood has to be computed, which slows down the computation. Assignment time, however, grows linearly with dataset size. Classification time takes the major part for PVLAD and PFV. This is due to the fact that the computational complexity of both, primal and dual SVM optimization, depends on dimensionality (Chapelle 2007), which is higher in case of PVLAD and PFV. The distribution of computation times is similar for the persistence codebook approaches with weighted subsampling (lower part in Fig. 6).

Qualitative analysis
In this section, we investigate PBoW with a special focus on its discriminative abilities. For this purpose, we employ the synthetic dataset as a proof-of-concept and GeoMat (for which we outperform other representations by a large margin) to investigate how this performance increase is achieved by PBoW compared to related approaches.

Synthetic dataset
We compute PBoW with N = 20 clusters for the synthetic dataset and visually analyze the codeword histograms obtained by (hard) assignment. To this end, for each of the six shape classes, we compute the average codebook histogram (over all samples of each class) to obtain one representative PBoW vector per class. The averaged PBoW histograms for all classes are presented in Fig. 7. Instead of only providing the histograms themselves, for each codeword of the histogram we plot the corresponding cluster center as a circle in the original birth-persistence domain and encode the number of assigned codeworks (the actual values of the histograms) in the area of the circles, i.e. the larger the count for a cluster, the larger the circle. The advantage of this representation is that the spatial distribution of the codewords in the PD is preserved.
From Fig. 7 we can see that, except for the classes "random cloud" and "sphere" (which are difficult to differentiate), all classes generate strongly different cluster distributions. Class "circle", for example, uniquely activates four clusters with strong persistence (topleft corner) and the "torus" class distributes its corresponding code words across a large number of clusters representing less persistent components. Figure 7 further illustrates an important property of persistence bag-of-words, namely its sparse nature. More specifically, areas with no points in the consolidated persistence diagram will contain no codewords (clusters). In Fig. 7, for example, no codeword is Fig. 7 Average codebook histograms computed for each of the six shape classes of the synthetic dataset. The cluster center of each codeword is presented as a circle in the birth-persistence domain. The area of the circles reflects the histogram values of the specific class. For all classes, the same codebook (same clustering) is employed; thus, dot locations are the same on all plots. The differences between the circles reflect the class differences obtained in the upper-right quadrant of the diagram, since no components are located there for the underlying data. Therefore, these unimportant areas are neglected and not encoded into the final representation. This not only reduces the dimension of the final representation, but further makes the representation adaptive to the underlying data. This in turn increases the information density in the obtained representation.

GeoMat dataset
We further investigate the performance on the GeoMat dataset to explain why (s)PBoW outperforms PI, approxPI and RM by such a large margin (see Table 2). To this end, we generate confusion matrices for PI and PBoW (see Fig. 8) to investigate their discriminative abilities. The matrices show that PBoW, for example, achieves better discrimination between classes "cement smooth" and "concrete cast-in-place" (i.e. classes 4 and 5). Average PBoW histograms for those classes are shown in Fig. 9. The histograms are on the first sight similar (upper row in Fig. 9). However, by zooming-in towards the birthpersistence plane in Fig. 9 (bottom row), differences become better visible. The plots in the center illustrate the difference between the class distributions (red color means left class is stronger, blue means right class is stronger for this cluster). The classes differ by finegrained spatial differences. The set of three blue points around birth time of 0 (which are characteristic for class "concrete cast-in-place") surrounded by red points (which are characteristic for class "cement smooth") illustrates this well (see lower central plot). For the Fig. 8 Confusion matrix for PI (left) and PBoW (right) on the GeoMat dataset from EXP-A. From the diagonal of the matrices we can see that PBoW outperforms PI for many classes (e.g. classes 2-5, 9 and 12). Furthermore, there are less confusions (off-diagonal values) for PBoW discrimination of these two classes, a particularly fine-grained codebook with many clusters is needed. PI (and approxPI) have problems with such fine-grained structures, because due to its limited resolution, all topological components in the most discriminative area would most likely fall into one PI-pixel. Therefore, an extraordinary high resolution would be necessary to capture the discriminative patterns between those two classes. The bag-ofwords model makes our approaches independent of the resolution and enables to capture even fine differences adaptively and in an unsupervised way. In Fig. 10 we show a similar comparison for classes "brick" and "concrete cast-in-place" (i.e. classes 2 and 5).

Fig. 9
Comparison of averaged PBoW histograms for class "cement smooth" (left, red) and "concrete castin-place" (right, blue) from GeoMat dataset (top row: total view; bottom row is zoomed in). The plot in the center shows the difference between the classes, where red color means that the left class has stronger support for this cluster and blue means that the right class has stronger support. The classes differ by finegrained spatial differences, which are not distinguishable in other vectorized representations. (Color figure online)

Conclusion
We have introduced the concept of persistence codebooks, a novel fixed-length vectorial representation for persistence diagrams. Persistence codebooks employ bag-ofwords encodings to quantize the persistence diagram into a vectorized representation. We propose different types of encodings (based on traditional bag-of-words, VLAD and Fisher Vectors), investigate their theoretic properties, such as their stability with respect to 1-Wasserstein, and introduce robust variants of the representations. Experiments on seven heterogeneous datasets show that they consistently achieve comparable performance to related methods, and partly even outperform them, with significantly shorter computation time. Though there is no overall winner among the introduced representations, we conclude that PFV is a powerful representation, as it achieves peak Fig. 10 Comparison of averaged PBoW histograms for classes "brick" (left, red) and "concrete cast-inplace" (right, blue) from GeoMat dataset (top row: total view; 2nd row: zoomed in view; 3rd row: even further zoomed in view). The plot in the center shows the difference between the classes, where red color means that the left class has stronger support for this cluster and blue means that the right class has stronger support. The classes differ by fine-grained spatial differences, which are not distinguishable in other vectorized representations. (Color figure online) performance over all evaluated datasets. It is followed by the PBoW variants, which also consistently achieve peak performance (but not for all datasets). PVLAD cannot compete with the other representations in our experiments and is thus less recommended. Moreover, we observe a certain tradeoff between computation time and accuracy when comparing stable and non-stable representations. Unstable representations like approxPI and PBoW are particularly fast. However, in general, they cannot compete with stable descriptors like PFV in terms of accuracy.
The novel representations have both attractive theoretic properties as well as practical properties, i.e. compactness, expressiveness, as well as the ability to adapt to the inherent sparsity of persistence diagrams. They can be constructed in a completely unsupervised fashion and achieve a high discriminativity compared to related approaches. The high computational efficiency of persistence codebooks could in future facilitate the application of TDA to larger datasets than possible today and enable real-time applications.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.

Appendix: Background on persistent homology
In this section we present basic introduction to persistent homology. Please consult (Edelsbrunner and Harer 2010;Edelsbrunner et al. 2002;Zomorodian and Carlsson 2005) for more information.
Topological spaces are typically infinite objects and, for the sake of data analysis, they have to be finitely represented by simplified objects called cell complexes. Cell complexes are build from cells: topologically simple objects having the property that an intersection of every pair of cells is either empty, or contains yet another cell in the cell complex.
A simplicial complex is a particular instance of a general cell complex. It is a natural tool in the study of multi-dimensional point cloud data. Cells of simplicial complex are called simplices and, in this particular case, are formed with convex hulls of collections of nearby points in the point cloud. Simplices are uniquely characterized by a collection of points involved in their convex hulls. A simplicial complex X needs to satisfy the following property: for every pair of simplices , ∈ X , ∩ is either empty or a simplex in X.
Given a point cloud X with a distance or a similarity measure d and a parameter r > 0 , one can define a Vietoris-Rips complex VR(X, r). It is a simplicial complex whose every simpliex = {v 0 , v 1 , … , v k } satisfies d(v i , v j ) ≤ r for every i, j ∈ {0, … , k} . For every simplex ∈ VR(X, r) , one can define a diameter of being the largest distance between the points in . This gives a natural ordering of simplices in VR(X, r): primarily by diameter of simplices and secondarily (when diameters of two simplices are the same) by inverse of the number of points in simplices 18 . It is easy to see that every prefix of such an ordering forms a simplicial complex, and therefore any increasing sequence of numbers 0 < r 1 < r 2 < … < r n yields a nested sequence of simplicial complexes: Another typical scenario when such a nested sequence of cell complexes arises is the case of values of a function f discretized on a grid G. The function f ∶ G → ℝ is typically an output of some numerical method. The grid G naturally corresponds to cubical complex G , and the function f provides an ordering of maximal cubes in the complex. This ordering induces a nested sequence of cubical complex, very much like a nested sequence of Vietoris-Rips complexes discussed above.
To cover those and other possible cases, later in this section we will focus on a general case of filtered cell complex: keeping in mind that most typically it will come from a point cloud, or numerical simulations on a grid.
Having a complex C i in the filtration, one can define its homology, H(C i ) . Rather than providing a formal definition, which can be found in Edelsbrunner and Harer (2010), we will focus on the intuitive understanding of the concept. Homology in dimension 0 measures number of connected components. In dimension 1 it measures the cycles, which do not bound to a (deformed) surface. In dimension 2 it corresponds to voids, i.e. regions of space totally bounded by a collection of triangles (very much like a ball bounds the void inside it). The idea of a cycle bounding a hole in the complex can be formalized using homology theory for arbitrary dimension.
Persistent homology measures the evolution of homology for the constitutive complexes in filtration. Once more and more cells are being added to a complex C i , new connected components or cycles may appear, old ones may become trivial or become identical (homologous) to others created earlier. For every connected component or a cycle, � ⊂ X = VR(X, 0) ⊂ VR(X, r 1 ) ⊂ VR(X, r 2 ) ⊂ ⋯ ⊂ VR(X, r n ) � = C 0 ⊂ C 1 ⊂ ⋯ ⊂ C n = C, Fig. 11 Various stages of a construction of a Vietoris-Rips complex for eight points sampled from a circle. Initially, for sufficiently small radius, only vertices are present in the complex. Gradually, more and more edges along with higher dimensional simplices of an increasing diameter are added. In all but the initial and final stage of the construction the topology of a circle is visible, and therefore will be recovered by PH in dimension one (depicted by the long bar below the picture) there are two important characteristics we will store: the first moment b, referred to as a birth time, when it appears in the filtration, and the last moment d, referred to as death time, when it either becomes trivial or becomes identical to other cycle created earlier.
In this paper, instead of a standard birth-death summaries of persistent homology, we use birth-persistence coordinates, which can be obtained by the [b, d] → [b, d − b] transformation. The basic geometrical idea behind PH is presented in Fig. 11.
A couple of assumptions about PDs are made. Firstly, as our aim is to perform computations, we assume that persistence diagrams consist of finitely many points of nonzero persistence. Secondly, PDs may also contain infinite intervals that correspond to the so-called essential classes, i.e. the cycles that are born but never die. Those infinite intervals need to be processed prior to the computations. There are at least three strategies one can apply: 1. to ignore infinite intervals and use only the finite ones for consideration; 2. to substitute +∞ in the death coordinates of the essential classes with a number N chosen by the user (a logical choice would be a number which is larger than a filtration value of any cell in the considered complex); 3. to build a pair of descriptors: one for finite, and one for infinite intervals and use them together as a final descriptor.
Given the available options, in the numerical experiments presented in this paper, we have chosen the simplest one, i.e. to ignore the infinite intervals. There are various classical metrics used to compare persistence diagrams (Edelsbrunner and Harer 2010). We will review them here, as they are essential in the study of stability of the presented representations. Note that the presentation is a bit non standard, as we are working on birth-peristence coordinates. Given two diagrams B and B ′ , we construct a matching ∶ B → B � assuming that points can also be matched to y = 0 axis. Putting B and B ′ in the same diagram, one can visualize a matching by drawing a line segment between x ∈ B and (x) (note that (x) is either in B ′ , or is a projection of x to its first coordinate). Given all the line segments, for each matching we can store the longest one, or a sum of lengths of all of them (raised to power q). Taking the minimum over all possible matchings of the obtained numbers will yield the bottleneck distance in the first case, and the Wasserstein distance (raised to power 1 q ) in the second case. More formally: Definition q-Wasserstein distance between two persistence diagrams B, B � ∈ D is defined as: In particular: An important feature of persistent homology is its stability. Intuitively, it indicates that small changes in the filtration imply small changes (for instance in Wasserstein metric) in the resulting persistence diagrams. Formally: Theorem Edelsbrunner and Harer (2010) Let be a finite cell complex and f , g ∶ → ℝ filtering Lipshitz functions. Let B and B ′ be the PDs of with filtration induced by f and g respectively. Then there exist constants C and k such that W 1 (B, B � ) ≤ C||f − g|| 1−k ∞ In this paper, we show stability with respect to 1-Wasserstein distance. Combined with the stability result described above, this indicates stability of bag-of-words representations with respect to the perturbation of initial data.