Advertisement

Discrete & Computational Geometry

, Volume 52, Issue 1, pp 44–70 | Cite as

Fréchet Means for Distributions of Persistence Diagrams

  • Katharine Turner
  • Yuriy Mileyko
  • Sayan Mukherjee
  • John Harer
Article

Abstract

Given a distribution \(\rho \) on persistence diagrams and observations \(X_{1},\ldots ,X_{n} \mathop {\sim }\limits ^{iid} \rho \) we introduce an algorithm in this paper that estimates a Fréchet mean from the set of diagrams \(X_{1},\ldots ,X_{n}\). If the underlying measure \(\rho \) is a combination of Dirac masses \(\rho = \frac{1}{m} \sum _{i=1}^{m} \delta _{Z_{i}}\) then we prove the algorithm converges to a local minimum and a law of large numbers result for a Fréchet mean computed by the algorithm given observations drawn iid from \(\rho \). We illustrate the convergence of an empirical mean computed by the algorithm to a population mean by simulations from Gaussian random fields.

Keywords

Persistence diagram Fréchet mean Topological data analysis Alexandrov space Persistent homology 

1 Introduction

There has been a recent effort in topological data analysis (TDA) to incorporate ideas from stochastic modeling. Much of this work involved the study of random abstract simplicial complexes generated from stochastic processes [10, 11, 12, 14, 22, 23] and non-asymptotic bounds on the convergence or consistency of topological summaries as the number of points increase [2, 4, 6, 19, 20]. The central idea in these papers has been to study statistical properties of topological summaries of point cloud data.

In [16] it was shown that a commonly used topological summary, the persistence diagram [8], admits a well defined notion of probability distributions and notions such as expectations, variances, percentiles and conditional probabilities. The key contribution of this paper is characterizing Fréchet means and variances of finitely many persistence diagrams and providing an algorithm for estimating them. Existence of these means and variances was previously shown. However, a procedure to compute means and variances was not provided.

In this paper we state an algorithm which when given an observed set of persistence diagrams \(X_{1},\ldots ,X_{n}\) computes a new diagram which is a local minimum of the Fréchet function of the empirical measure corresponding to the empirical distribution \(\rho _{n} := n^{-1} \sum _{i=1}^{n} \delta _{X_{i}}\). In the case where the diagrams are sampled independently and identically from a probability measure that is a finite combination of Dirac masses we provide a (weak) law of large numbers for the local minima computed by the algorithm we propose.

2 Persistence Diagrams and Alexandrov Spaces with Curvature Bounded from Below

In this section we state properties of the space of persistence diagrams that we will use in the subsequent sections. We first define persistence diagrams and the \(L^{2}\)-Wasserstein metric on the set of persistence diagrams. Note that this is not the same metric as was used in [16]. We discuss the relation between the two metrics and why we work with the \(L^{2}\)-Wasserstein metric later in this section. We then show that the space of persistence diagrams is a geodesic space and specifically an Alexandrov space with curvature bounded from below. We show that the Fréchet function in this space is semiconcave which allows us to define supporting vectors which will serve as an analog of the gradient. The supporting vectors will be used in the algorithm developed in the following section to find local minima—the algorithm is a gradient descent based method.

2.1 Persistent Homology and Persistence Diagrams

Consider a topological space \(\mathbb { X}\) and a bounded continuous function \(f: \mathbb { X}\rightarrow \mathbb { R}\). For a threshold \(a\) we define sublevel sets \(\mathbb { X}_{a} = f^{-1}(-\infty ,a]\). For \(a \le b\) inclusions \(\mathbb { X}_{a} \subset \mathbb { X}_{b}\) induce homomorphisms of the homology groups of sublevel sets:
$$\begin{aligned} \mathbf {f}_\ell ^{a,b}: \mathbf {H}_\ell (\mathbb { X}_{a}) \rightarrow \mathbf {H}_\ell (\mathbb { X}_{b}) \end{aligned}$$
for each dimension \(\ell \). We assume the function \(f\) is tame which means that \(\mathbf {f}_\ell ^{c-\delta ,c}\) is not an isomorphism for any \(\delta >0\) at only a finite number of \(c\)’s for all dimensions \(\ell \) and \(\mathbf {H}_\ell (\mathbb { X}_{a})\) is finitely generated for all \(a \in \mathbb { R}\). We also assume that the homology groups are defined over field coefficients, e.g. \(\mathbb {Z}_{2}\).
By the tameness assumption the image \(\mathbf {F}_\ell ^{a-,b} := \text{ Im } \mathbf {f}_\ell ^{a-\delta ,b} \subset \mathbf {H}_\ell (\mathbb { X}_{b})\) is independent of \(\delta >0\) if \(\delta \) is small enough. The quotient group
$$\begin{aligned} \mathbf {B}_\ell ^{a} = \mathbf {H}_\ell (\mathbb { X}_{a}) / \mathbf {F}_\ell ^{a-,a} \end{aligned}$$
is the cokernel of \(\mathbf {f}_\ell ^{a-\delta ,a}\) and captures homology classes which did not exist in sublevel sets preceding \(\mathbb { X}_{a}\). This group is called the \(\ell \)-th birth group at \(\mathbb { X}_{a}\) and we say that a homology class \(\alpha \in \mathbf {H}_\ell (\mathbb { X}_{a})\) is born at \(\mathbb { X}_{a}\) if its projection onto \(\mathbf {B}_\ell ^{a}\) is nontrivial.
Consider the map
$$\begin{aligned} \mathbf {g}_\ell ^{a,b}: \mathbf {B}_\ell ^a \rightarrow \mathbf {H}_\ell (\mathbb { X}_{b}) / \mathbf {F}^{a-,b} \end{aligned}$$
and denote its kernel as \(\mathbf {D}_\ell ^{a,b}\). The kernel captures homology classes that were born at \(\mathbb { X}_{a}\) but at \(\mathbb { X}_{b}\) are homologous to homology classes born before \(\mathbb { X}_{a}\). We say that a homology class \(\alpha \in \mathbf {H}_\ell (\mathbb { X}_{a})\) that was born at \(\mathbb { X}_{a}\) dies entering \(\mathbb { X}_{b}\) if its projection onto \(\mathbf {D}_\ell ^{a,b}\) is \(0\) but its projection to \(\mathbf {D}_{\ell }^{a,b-\delta }\) is nontrivial for all sufficiently small \(\delta >0\). We also call \(b\) a degree-\(r\) death value of \(\mathbf {B}_\ell ^a\) if \(\mathrm {rank}\mathbf {D}_\ell ^{a,b}-\mathrm {rank}\mathbf {D}_\ell ^{a,b-\delta }=r>0\) for all sufficiently small \(\delta >0\).

If a homology class \(\alpha \) is born at \(\mathbb { X}_{a}\) and dies entering \(\mathbb { X}_{b}\) we set \(\mathrm {b}(\alpha ) = a\) and \(\mathrm {d}(\alpha ) = b\) and represent the births and deaths of \(\ell \)-dimensional homology classes by a multiset of points in \(\mathbb { R}^{2}\) with the horizontal axis corresponding to the birth of a class, the vertical axis corresponding to the death of a class, and the multiplicity of a point being the degree of the death value. The idea of a persistence diagram is to consider a basis of persistent homology classes \(\{\alpha \}\) and to represent each persistent homology class \(\alpha \) by a point \((b(\alpha ), d(\alpha ))\).

The persistence of \(\alpha \) is the difference \(\text{ pers }(\alpha ) = \mathrm {d}(\alpha ) - \mathrm {b}(\alpha )\). In the general setting we could have points with infinite persistence which corresponds to points of the form \((-\infty , y)\) or \((x,\infty )\). These points are infinitely far from all points on finite persistence and hence would have to be treated separately. The space of persistence diagrams would be forced to be disconnected with each component corresponding to the number of points at infinity. For the sake of clarity we will restrict ourselves to the case where all classes have finite persistence. This can be achieved by considering extended persistence but for simplicity we can simply kill everything by setting \(\mathbf {g}_\ell ^{a,b} = 0\) if \(b \ge \sup _{x \in \mathbb { X}} f(x)\).

After establishing some notation we can define persistence diagrams and the distance between two diagrams. Let \(\Delta =\{(x,y)\in \mathbb { R}^{2} \mid x=y\}\) be the diagonal in \(\mathbb { R}^{2}\). Let \(\Vert x-y \Vert \) be the usual Euclidean distance if \(x\) and \(y\) are off diagonal points. With a slight abuse of notation let \(\Vert x-\Delta \Vert \) denote the perpendicular distance between \(x\) and the diagonal and \(\Vert \Delta -\Delta \Vert =0\).

Definition 2.1

A persistence diagram is a countable multiset of points in \(\mathbb { R}^{2}\) along with the infinitely many copies of the diagonal \(\Delta =\{(x,y)\in \mathbb { R}^{2} \mid x=y\}\). We also require for the countably many points \(x_{j}\in \mathbb { R}^{2}\) not lying on the diagonal that \(\sum _{j} \Vert x_{j}-\Delta \Vert <\infty \).

Each point \(p=(a,b)\) in a persistence diagram corresponds to some homology class \(\alpha \) with \(\mathrm {b}(\alpha )=a\) and \(\mathrm {d}(\alpha )=b\). As a slight abuse of notation we say that \(p\) is born at \(\mathrm {b}(p):=\mathrm {b}(\alpha )\) and dies at \(\mathrm {d}(p):=\mathrm {d}(\alpha )\).

We denote the set of all persistence diagrams by \(\mathcal {D}\). One metric on \(\mathcal {D}\) is the \(L^{2}\)-Wasserstein metric
$$\begin{aligned} d_{L^{2}}(X,Y)^{2} = \inf _{\phi :X \rightarrow Y} \sum _{x\in X} \Vert x-\phi (x)\Vert ^{2}. \end{aligned}$$
(1)
Here we consider all the possible bijections \(\phi \) between the off diagonal points and copies of the diagonal in \(X\) and the off diagonal points and copies of the diagonal in \(Y\). Bijections always exist as any point can be paired to the diagonal. We will call a bijection optimal if it achieves this infimum.
In much of the computational topology literature the following \(p\)-th Wasserstein distance between two persistence diagrams, \(X\) and \(Y\), is used
$$\begin{aligned} d_{W_{p}}(X, Y)=\Big ( \inf _{\phi }\sum _{x\in X}{\Vert x-\phi (x)\Vert ^p_{\infty }} \Big )^{\frac{1}{p}}. \end{aligned}$$
In [16] the above metric was used to define the following space of persistence diagrams
$$\begin{aligned} \mathcal {D}_{p}=\{x \mid d_{W_{p}}(x,\emptyset )<\infty \}, \end{aligned}$$
with \(p \ge 1\) and \(\emptyset \) is the diagram with just the diagonal. It was shown in [16, Theorems 6 and 10] that \(\mathcal {D}_{p}\) is a complete separable metric space and probability measures on this space can be defined. Given a probability measure \(\rho \) on \(\mathcal {D}_{p}\) the existence of a Fréchet mean was proven under restrictions on the space of persistence diagrams \(\mathcal {D}_{p}\) [16, Theorem 21 and Lemma 27]. The basic requirement is that \(\rho \) has a finite second moment and the support of \(\rho \) has compact support or is concentrated on a set with compact support.
In this paper we focus on the \(L^{2}\)-Wasserstein metric since it leads to a geodesic space with some known structure. Thus we consider the space of persistence diagrams
$$\begin{aligned} \mathcal {D}_{L^{2}}=\{x \mid d_{L^{2}}(x,\emptyset )<\infty \}. \end{aligned}$$
The results stated in the previous paragraph will also hold for \(\mathcal {D}_{L^{2}}\) with metric \(d_{L^{2}}\), including existence of Fréchet means. This follows from the fact that for any \(x,y \in \mathbb { R}^{2}\)
$$\begin{aligned} \Vert x-y \Vert _\infty \le \Vert x -y \Vert _2 \le \sqrt{2} \Vert x - y \Vert _\infty , \end{aligned}$$
(2)
so \(d_{W_2}(X,Y)\le d_{L^{2}}(X,Y) \le \sqrt{2}d_{W_2}(X,Y)\). This inequality coupled with the results in [7] implies the following stability result for the \(L^{2}\)-Wasserstein distance.

Theorem 2.2

Let \(\mathbb { X}\) be a triangulable, compact metric space such that \(d_{W_k}(\text{ Diag }(h), \emptyset )^k\le C_{\mathbb { X}}\) for any tame Lipschitz function \(h:\mathbb { X}\rightarrow \mathbb { R}\) with Lipschitz constant \(1\), where \(\text{ diag }(h)\) denotes the persistence diagram of \(h\), \(k\in [1,2)\), and \(C_{\mathbb { X}}\) is a constant depending only on the space \(\mathbb { X}\). Then for two tame Lipschitz functions \(f, g : \mathbb {X} \rightarrow \mathbb {R}\) we have
$$\begin{aligned} d_{L^{2}}(\text{ Diag }(f), \text{ Diag }(g)) \le 2^{\frac{k+2}{2}}\big [ C \Vert f - g\Vert _{\infty }^{2-k}\big ]^{\frac{1}{2}}, \end{aligned}$$
where \(C = C_{\mathbb { X}} \max \{\text{ Lip }(f)^k,\text{ Lip }(g)^k\}\).

For ease of notation in the rest of the paper we denote \(d_{L^{2}}(X,Y)^{2}\) as \(d(X,Y)^{2}\).

Proposition 2.3

For any diagrams \(X, Y\in \mathcal {D}_{L^{2}}\) the infimum in (1) is always achieved.

We prove this proposition in the Appendix.

We now show that the space of persistence diagrams with the above metric is a geodesic space. A rectifiable curve \(\gamma : [ 0, l] \rightarrow X\) is called a geodesic if it is locally minimizing and parametrized proportionally to the arc length. If \(\gamma \) is also globally minimizing, then it is said to be minimal. \(\mathcal {D}_{L^{2}}\) is a geodesic space if every pair of points is connected by a minimal geodesic. Now consider diagrams \(X=\{x\}\) and \(Y=\{y\}\) and some optimal pairing \(\phi \) between the points in \(X\) and \(Y\). Let \(\gamma :[0,1]\rightarrow \mathcal {D}_{L^{2}}\) be the path from \(X\) to \(Y\) where \(\gamma (t)\) is the diagram with points which have travelled in a straight line from the point (which can be a copy of the diagonal) \(x\) to the point (which can be a copy of the diagonal) for a distance of \(t\Vert x-\phi (x)\Vert \). In other words, the diagram with points \(\{(1-t)x+t\phi (x)\,|\, x\in X\}\).1 \(\gamma \) is a geodesic from \(X\) to \(Y\). The proof of this is the observation that \(\phi _t^{X}:X\rightarrow \gamma (t)\) where
$$\begin{aligned} \phi _t^{X}(x)=(1-t)x+t\phi (x) \end{aligned}$$
(3)
is optimal.

2.2 Gradients and Supporting Vectors on \(\mathcal {D}_{L^{2}}\)

We will propose a gradient descent based algorithm to compute Fréchet means. To analyze and understand the algorithm we will need to understand the structure of \(\mathcal {D}_{L^{2}}\). We will show that \(\mathcal {D}_{L^{2}}\) is an Alexandrov space with curvature bounded from below (see [5] for more information on these spaces). This result is not so surprising since there are known relations between \(L^{2}\)-Wasserstein spaces and Alexandrov spaces with curvature bounded from below [13, 21]. The motivating idea behind these spaces was to generalize the results of Riemannian geometry to metric spaces without Riemannian structure.

The property and behavior of Fréchet means is closely related to the curvature of the space. For metric spaces with curvature bounded from above, called \(CAT\)-spaces,2 properties of Fréchet means have been investigated and there exist algorithms to compute Fréchet means [25]. \(\mathcal {D}_{L^{2}}\) is not a \(CAT\)-space, see Proposition 2.4. \(\mathcal {D}_{L^{2}}\) is however an Alexandrov space with curvature bounded from below. Less is known about properties of Fréchet means in these spaces as well as algorithms to compute Fréchet means. We use the structure of Alexandrov spaces with curvature bounded from below to compute estimates of Fréchet means and provide some analysis of these estimates. Note that Fréchet means are the same as barycenters which is what is referred to in much of the literature.

We first confirm that \(\mathcal {D}_{L^{2}}\) is not a \(CAT\)-space.

Proposition 2.4

\(\mathcal {D}_{L^{2}}\) is not in \(\text{ CAT }(k)\) for any \(k>0\).

Proof

If \(\mathcal {D}_{L^{2}} \in \text{ CAT }(k)\) then for all \(X,Y\in \mathcal {D}_{L^{2}}\) with \(d(X,Y)^{2}<\pi ^{2}/k\) there is a unique geodesic between them [3, Proposition 2.11]. However, we can find \(X,Y\) arbitrarily close with two distinct geodesics. One example is taking \(X\) to be a diagram with two diagonally opposite corners of a square and \(Y\) a diagram with the other two corners. The horizontal and vertical paths are equally optimal and we may choose the square to be as small as we wish. \(\square \)

The following inequality characterizes Alexandrov spaces with curvature bounded from below by zero [21]. Given a geodesic space \(\mathbb { X}\) with metric \(d'\) for any geodesic \(\gamma :[0,1] \rightarrow \mathbb { X}\) from \(X\) to \(Y\) and any \(Z \in \mathbb { X}\)
$$\begin{aligned} d'(Z,\gamma (t))^{2} \ge t d'(Z,Y)^{2} + (1-t) d'(Z,X)^{2} -t(1-t) d'(X,Y)^{2}. \end{aligned}$$
(4)
We now show that \(\mathcal {D}_{L^{2}}\) is a non-negatively curved Alexandrov space.

Theorem 2.5

The space of persistence diagrams \(\mathcal {D}_{L^{2}}\) with metric \(d\) given in (1) is a non-negatively curved Alexandrov space.

Proof

First observe that \(\mathcal {D}_{L^{2}}\) is a geodesic space. Let \(\gamma :[0,1] \rightarrow \mathcal {D}_{L^{2}}\) be a geodesic from \(X\) to \(Y\) and let \(Z \in \mathcal {D}_{L^{2}}\) be any diagram. We want to show that the inequality (4) holds.

Let \(\phi \) be an optimal bijection between \(X\) and \(Y\) which induces the geodesic \(\gamma \). That is \(\gamma (t)=\{(1-t)x+t\phi (x)\,|\,x\in X\}\) and defined \(\phi _{t(x)} =tx+(1-t)\phi (x)\) as done in (3). Let \(\phi _{Z}^{t}: Z \rightarrow \gamma (t)\) be optimal. Construct bijections \(\phi _{Z}^{X}:Z\rightarrow X\) and \(\phi _{Z}^{Y}: Z\rightarrow Y\) by \(\phi _{Z}^{X}= (\phi _t)^{-1}\circ \phi _{Z}^{t}\) and \(\phi _{Z}^{Y}=\phi \circ \phi _{Z}^{X}\). There is no reason to suppose that either bijections \(\phi _{Z}^{X}\) or \(\phi _{Z}^{Y}\) are optimal. Note that if \(\phi _{Z}^{t}(z)=\Delta \) then \(\phi _{Z}^{X}(z)=\Delta \) and \(\phi _{Z}^{Y}(z)=\Delta \).

From the formula for the distance in \(\mathcal {D}_{L^{2}}\) we observe
$$\begin{aligned} \begin{aligned} d(Z,\gamma (t))^{2}&=\sum _{z\in Z} \Vert z-\phi _{Z}^{t}(z)\Vert ^{2}=\sum _{z\in Z} \Vert z - [(1-t)\phi _{Z}^{X}(z)+t\phi _{Z}^{Y}(z)]\Vert ^{2}, \\ d(Z,Y)^{2}&\le \sum _{z\in Z}\Vert z-\phi _{Z}^{Y}(z)\Vert ^{2},\\ d(Z,X)^{2}&\le \sum _{z\in Z}\Vert z-\phi _{Z}^{X}(z)\Vert ^{2},\\ d(X,Y)^{2}&=\sum _{z\in Z}\Vert \phi _{Z}^{X}(z)-\phi (\phi _{Z}^{X}(z))\Vert ^{2}=\sum _{z\in Z}\Vert \phi _{Z}^{X}(z)-\phi _{Z}^{Y}(z)\Vert ^{2}. \end{aligned} \end{aligned}$$
(5)
Euclidean space has everywhere curvature zero so for each \(z\) in the diagram \(Z\), and all \(t\in [0,1]\), we have
$$\begin{aligned} \Vert z - [(1-t)\phi _{Z}^{X}(z)+t\phi _{Z}^{Y}(z)]\Vert ^{2}&= t\Vert z-\phi _{Z}^{Y}(z)\Vert ^{2} +(1-t)\Vert z-\phi _{Z}^{X}(z)\Vert ^{2}\\&-\,t(1-t)\Vert \phi _{Z}^{X}(z)-\phi _{Z}^{Y}(z)\Vert . \end{aligned}$$
Combining these equalities with inequalities (5) gives us the desired result. \(\square \)

2.3 Properties of the Fréchet Function

Given a probability distribution \(\rho \) on \(\mathcal {D}_{L^{2}}\) we can define the corresponding Fréchet function to be
$$\begin{aligned} F:\mathcal {D}_{L^{2}} \rightarrow \mathbb { R}, \quad Y \mapsto \int \limits _{\mathcal {D}_{L^{2}}} d(X,Y)^{2} d\rho (X). \end{aligned}$$
The Fréchet mean set of \(\rho \) is the set of all the minimizers of the map \(F\) on \(\mathcal {D}_{L^{2}}\). If there is a unique minimizer then this is called the Fréchet mean of \(\rho \). The variance is then defined to be the infimum of the above functional.
We will show that the Fréchet function has the nice property of being semiconcave. For an Alexandrov space \(\Omega \), a locally Lipschitz function \(f : \Omega \rightarrow \mathbb { R}\) is called \(\lambda \) -concave if for any unit speed geodesic \(\gamma \) in \(\Omega \), the function
$$\begin{aligned} f \circ \gamma (t) - \lambda t^{2}/2 \end{aligned}$$
is concave. A function \(f : \Omega \rightarrow \mathbb { R}\) is called semiconcave if for any point \(x\in \Omega \) there is a neighborhood \(\Omega _x\) of \(x\) and \(\lambda \in \mathbb { R}\) such that the restriction \(f|_{\Omega _x}\) is \(\lambda \)-concave.

Proposition 2.6

If the support of \(\rho \) is bounded \((\)as in has bounded diameter\()\) then the corresponding Fréchet function is semiconcave.

Proof

We will first show that if the support of a probability distribution \(\rho \) is bounded then the corresponding Fréchet function is Lipschitz on any set with bounded diameter. We then show that for any unit length geodesic \(\gamma \) and any \(X \in \mathcal {D}_{L^{2}}\) the function
$$\begin{aligned} g_{X}(s) := d(\gamma (s), X)^{2} - s^{2} \end{aligned}$$
is concave. We then complete the proof by showing the Fréchet function \(F\) is 2-concave at every point (and hence \(F\) is semiconcave) by considering \(F(\gamma (s))-s^{2}\) as \(\int g_{X}(s) d\rho (X)\).
Let \(U\) be a subset of \(\mathcal {D}_{L^{2}}\) with bounded diameter. This means that there is some \(K\) such that for any \(Y \in U\) we have \(\int d(X,Y) d\rho (X) \le K\). Here we are also using that the support of \(\rho \) is bounded. Let \(Y,Z \in U\). Then
$$\begin{aligned} |F(Y)-F(Z)|&=\big |\int d(X,Y)^{2}-d(X,Z)^{2} d\rho (X)\big |\\&= \big |\int (d(X,Y) - d(X,Z))(d(X,Z)+ d(X,Y)) d\rho (X)\big |\\&\le \int (d(Z,Y))(d(X,Z)+ d(X,Y))d\rho (X).\\&=2Kd(Z,Y). \end{aligned}$$
Let \(\gamma \) be a unit speed geodesic and \(X\in \mathcal {D}_{L^{2}}\). Consider the function
$$\begin{aligned} g_{X}(s):=d(\gamma (s), X)^{2} - s^{2}. \end{aligned}$$
We want to show that \(g_{X}\) is concave which means that \(g_{X}(tx+(1-t)y)\ge tg_{X}(x) + (1-t)g_{X}(y)\). Let \(\tilde{\gamma }(t)\) be the geodesic from \(\gamma (x)\) to \(\gamma (y)\) traveling along \(\gamma \) so that \(\gamma ((1-t)x+ ty) = \tilde{\gamma }(t)\) for \(t \in [0,1]\) and
$$\begin{aligned} t g_{X}(x)+(1-t)g_{X}(y)&= t d(\tilde{\gamma }(0),X)^{2} + (1-t) d(\tilde{\gamma }(1),X)^{2} -tx^{2}-(1-t)y^{2}\\&\le d(\tilde{\gamma }(t),X)^{2} +t(1-t) d (\tilde{\gamma }(0), \tilde{\gamma }(1))^{2} -tx^{2}-(1-t)y^{2}\\&=d(\tilde{\gamma }(t),X)^{2} +t(1-t)(x-y)^{2} -tx^{2}-(1-t)y^{2}\\&=d(\tilde{\gamma }(t),X)^{2} -(tx+(1-t)y)^{2}\\&=g_{X}(tx+(1-t)y). \end{aligned}$$
The inequality comes from the defining inequality (4) that makes \(\mathcal {D}_{L^{2}}\) a non-negatively curved Alexandrov space.
By the construction of \(g_{X}\) we can think of \(F(\gamma (s))-s^{2}\) as \(\int g_{X}(s) d\rho (X)\). This means that we can write
$$\begin{aligned} t[F(\gamma (x))-x^{2}]+ (1-t)[F(\gamma (y))-y^{2}] = \int tg_{X}(x)+(1-t)g_{X}(y) d\rho (X). \end{aligned}$$
The concavity of \(g_{X}\) ensures that \( tg_{X}(x)+(1-t)g_{X}(y)\le g_{X}(tx+(1-t)y)\) and hence
$$\begin{aligned} t[F(\gamma (x))-x^{2}]+ (1-t)[F(\gamma (y))-y^{2}]&\le \int g_{X}(tx+(1-t)y) d\rho (X)\\&=F(tx+(1-t)y) -(tx+(1-t)y)^{2}. \end{aligned}$$
\(\square \)

We now define the additional structure on Alexandrov spaces with curvature bounded from below that we will need to define gradients and supporting vectors. This exposition is a summary of the content in [21, 24].

Given a point \(Y\) in an Alexandrov space \(\mathcal {A}\) with non-negative curvature we first define the tangent cone \(T_{Y}\). Let \(\widehat{\Sigma }_{Y}\) be the set of all nontrivial unit-speed geodesics emanating from \(Y\). For \(\gamma , \eta \in \widehat{\Sigma }_{Y}\) the angle between them defined by
$$\begin{aligned} \angle _{Y}(\gamma , \eta ) :=\arccos \left( \lim \limits _{s,t\downarrow 0}\frac{s^{2}+t^{2}-d(\gamma (s),\eta (t))^{2}}{2st}\right) \in [0,\pi ], \end{aligned}$$
when the limit exists. We define the space of directions \((\Sigma _{Y}, \angle _{Y})\) at \(Y\) as the completion of \(\widehat{\Sigma }_{Y}/ \sim \) with respect to \(\angle _{Y}\), where \(\gamma \sim \eta \) if \(\angle _{Y}(\gamma ,\eta )=0\). The tangent cone \(T_{Y}\) is the Euclidean cone over \(\Sigma _{Y}\):
$$\begin{aligned} T_{Y}&:= \Sigma _{Y} \times [0,\infty )/\Sigma _{Y}\times \{0\}\\ d_{T_{Y}}((\gamma , s), (\eta , t))^{2}&:= s^{2}+t^{2} - 2st\cos \angle _{Y}(\gamma , \eta ). \end{aligned}$$
The inner product of \(\mathbf {u} = (\gamma ,s), \mathbf {v} = (\eta ,t) \in T_{Y}\) is defined as
$$\begin{aligned} \langle \mathbf {u}, \mathbf {v} \rangle _{Y} := st \cos \angle _{Y}(\gamma ,\eta ) = \frac{1}{2}\big [ s^{2} + t^{2} - d_{T_{Y}}(\mathbf {u},\mathbf {v})^{2} \big ]. \end{aligned}$$
A geometric description of the tangent cone \(T_{Y}\) is as follows. \(Y \in \mathcal {D}_{L^{2}}\) has countably many points \(\{y_{i}\}\) off the diagonal. A tangent vector is a set of vectors \(\{v_{i} \in \mathbb { R}^{2}\}\) one assigned to each \(y_{i}\) along with countably many vectors at points along the diagonal pointing perpendicular to the diagonal such that the sum of the squares of the lengths of all these vectors is finite. Observe that there can exist tangent vectors such that the corresponding geodesic may not exist for any positive amount of time. The angle between two tangent vectors is effectively a weighted average of all the angles between the pairs of vectors.
We now define differential structure as a limit of rescalings. For \(s > 0\) denote the space \((\mathcal {A},s\cdot d)\) by \(s \mathcal {A}\) and define the map \(i_{s}: s \mathcal {A} \rightarrow \mathcal {A}\). For an open set \(\Omega \subset \mathcal {A}\) and any function \(f : \Omega \rightarrow \mathbb { R}\) the differential of \(f\) at a point \(p \in \Omega \) is a map \(T_{p} \rightarrow \mathbb { R}\) is defined by
$$\begin{aligned} d_{p} f = \lim _{s \rightarrow \infty } s (f \circ i_{s} - f(p)),\quad f \circ i_{s}: s \mathcal {A} \rightarrow \mathbb { R}. \end{aligned}$$
For semiconcave functions the above differential is well defined and we can study gradients and supporting vectors.

Definition 2.7

(Gradients and supporting vectors) Given an open set \(\Omega \subset \mathcal {A}\) and a function \(f: \Omega \rightarrow \mathbb { R}\) we denote by \(\nabla _{p} f\) the gradient of a function \(f\) at a point \(p \in \Omega \). \(\nabla _{p} f\) is the vector \(v \in T_{p}\) such that
  1. (i)

    \(d_{p} f(x) \le \langle v, x\rangle \) for all \(x\in T_{p}\)

     
  2. (ii)

    \(d_{p} f(v)=\langle v,v\rangle \).

     
For a semiconcave \(f\) the gradient exists and is unique (Theorem 1.7 in [15]). We say \(s\in T_{p}\) is a supporting vector of \(f\) at \(p\) if \(d_{p} f(x) \le - \langle s, x\rangle \) for all \(x\in T_{p}\). Note that \(-\nabla _{p} f\) is a supporting vector if it exists in the tangent cone at \(p\).

Lemma 2.8

  1. (i)

    If \(s\) is a supporting vector then \(\Vert s\Vert \ge \Vert \nabla _{p} f\Vert \).

     
  2. (ii)

    If \(p\) is local minimum of \(f\) and \(s\) is a supporting vector of \(f\) at \(p\) then \(s=0\).

     

Proof

  1. (i)
    First observe that from the definitions of \(\nabla _{p} f\) and supporting vectors we have
    $$\begin{aligned} \langle \nabla _{p} f, \nabla _{p} f \rangle = d_{p} f(\nabla _{p} f)\le -\langle s, \nabla _{p} f\rangle . \end{aligned}$$
    We also know that
    $$\begin{aligned} 0\le \langle \nabla _{p} f +s, \nabla _{p} f+s\rangle = \langle \nabla _{p} f, \nabla _{p} f\rangle + 2\langle \nabla _{p} f,s\rangle + \langle s, s\rangle . \end{aligned}$$
    These inequalities combined tell us that \(0\le -\langle \nabla _{p} f, \nabla _{p} f\rangle + \langle s, s\rangle .\)
     
  2. (ii)

    If \(p\) is a local minimum of \(f\) then \(d_{p}f(x)\ge 0\) for all \(x\in T_{p}\). In particular \(d_{p}(s)\ge 0\). Since \(s\) is a supporting vector \(-\langle s, s \rangle \ge d_{p} f(s) \ge 0\). This implies \(\langle s,s\rangle =0\) and hence \(s=0\).\(\square \)

     

We care about gradients and supporting vectors because they can help us find local minima of the Fréchet function. Indeed a necessary condition for \(F\) to have local minimum at \(Y\) is \(s=0\) for any supporting vector \(s\) of \(F\) at \(Y\). Since the tangent cone at \(Y\) is a convex subset of a Hilbert space we can take integrals over probability measures with values in \(T_{Y}\). This allows us to find a formula for a supporting vector of the Fréchet function \(F\).

Proposition 2.9

Let \(Y\in \mathcal {D}_{L^{2}}\). For each \(X\in \mathcal {D}_{L^{2}}\) let \(F_{X}:Z \mapsto d(X, Z)^{2}\).
  1. (i)

    If \(\gamma \) is a distance achieving geodesic from \(Y\) to \(X\), then the tangent vector to \(\gamma \) at \(Y\) of length \(2d(X,Y)\) is a supporting vector at \(Y\) for \(F_{X}\).

     
  2. (ii)

    If \(s_{X}\) is a supporting vector at \(Y\) for the function \(F_{X}\) for each \(X\in \text {supp}(\rho )\) then \(s=\int s_{X}d\rho (X)\) is a supporting vector at Y of the Fréchet function \(F\) corresponding to the distribution \(\rho \).

     

Proof

(i) Let \(\gamma \) be a unit speed geodesic from \(Y\) to \(X\). Consider the tangent vector \(s_{X}=(\gamma , 2d(X,Y))\). Let \(\gamma (t)_{i}\) denote the point in \(\gamma (t)\) that is sent to \(x_{i} \in X\). Since \(\gamma \) is a distance achieving geodesic we know that
$$\begin{aligned} \inf _{\phi :\gamma (0)\rightarrow X} \sum _{i}\Vert x_{i} -\phi (x_{i})\Vert ^{2} = \sum _{i}\Vert x_{i}-\gamma (0)_{i}\Vert ^{2}=F_{X}(Y). \end{aligned}$$
To show \(d_{Y}F_{X}(v)\le \langle s_{X}, v\rangle \) for all \(v\in T_{Y}\) it is sufficient to consider vectors of the form \((\tilde{\gamma }, 1)\) where \(\tilde{\gamma }\) is a unit speed geodesic starting at \(Y\). Let \(\tilde{\gamma }(t)_{i}\) denote the point in \(\tilde{\gamma }(t)\) which started at \(\gamma (0)_{i}\). This means that \(x_{i} \mapsto \tilde{\gamma }(t)_{i}\) is a bijection from \(X\) to \(\tilde{\gamma }(t)\) and
$$\begin{aligned} d_{Y} F_{X}(v)&= \frac{d}{dt}\bigg |_{t=0} F_{X}(\tilde{\gamma }(t))\\&=\lim _{t\rightarrow 0} \frac{F_{X}(\tilde{\gamma }(t))-F_{X}(Y)}{t}\\&=\lim _{t\rightarrow 0}\frac{\inf \big \{\sum \Vert x_{i}-\phi (x_{i})\Vert ^{2}-\Vert x_{i}-\gamma (0)_{i}\Vert ^{2}\,|\, \phi :X \rightarrow \tilde{\gamma }(t)\big \}}{t}\\&\le \lim _{t\rightarrow 0}\frac{\sum \Vert x_{i}-\tilde{\gamma }(t)_{i}\Vert ^{2}-\Vert x_{i}-\gamma (0)_{i}\Vert ^{2}}{t}\\&=\lim _{t\rightarrow 0}\frac{\sum \Vert \tilde{\gamma }(0)_{i}-\tilde{\gamma }(t)_{i}\Vert ^{2}-2 \Vert \tilde{\gamma }(0)_{i}-\tilde{\gamma }(t)_{i}\Vert \Vert x_{i}-\gamma (0)_{i}\Vert \cos \theta _{i}}{t} \end{aligned}$$
where \(\theta _{i}\) is the angle between the paths \(s\mapsto \gamma (s)_{i}\) and \(t\mapsto \tilde{\gamma }(t)_{i}\) in the plane. Now
$$\begin{aligned} \Vert x_{i}-\gamma (0)_{i}\Vert =\Vert \gamma (d(X,Y))_{i}-\gamma (0)_{i}\Vert =d(X,Y)\frac{\Vert \gamma (s)_{i}-\gamma (0)_{i}\Vert }{s} \end{aligned}$$
for all \(s>0\) and \(\Vert \tilde{\gamma }(0)_{i}-\tilde{\gamma }(t)_{i}\Vert ^{2}=t^{2}\Vert \tilde{\gamma }(0)_{i}-\tilde{\gamma }(1)_{i}\Vert ^{2}\) for all \(t\). This implies that
$$\begin{aligned} d_{Y} F_{X}(v) \le -2d(X,Y)\lim _{t,s\downarrow 0}\frac{\sum \Vert \tilde{\gamma }(t)_{i}-\tilde{\gamma }(0)_{i}\Vert \Vert \gamma (s)_{i}-\gamma (0)_{i}\Vert \cos \theta _{i}}{st}. \end{aligned}$$
Recall from our construction of the tangent cone that
$$\begin{aligned} \langle v, s_{X}\rangle&\!=\! 2d_{L^{2}}(X,Y) \cos (\angle _{Y} (\gamma , \tilde{\gamma }))\\&\!=\!2 d(X,Y) \bigg (\lim _{s,t\downarrow 0} \frac{s^{2}\!+\!t^{2} - d(\gamma (s), \tilde{\gamma }(t))^{2}}{2st}\bigg )\\&\!=\!2d(X,Y) \bigg (\lim _{s,t\downarrow 0}\frac{\sum \Vert \gamma (s)_{i}\!-\!\gamma (0)_{i}\Vert ^{2}\!+\!\Vert \tilde{\gamma }(t)_{i}\!-\!\tilde{\gamma }(0)\Vert ^{2} \!-\! \Vert \gamma (s)_{i}\!-\! \tilde{\gamma }(t)_{i}\Vert ^{2}}{2st}\bigg )\\&=2d(X,Y)\bigg (\lim _{t,s\downarrow 0}\frac{\sum \Vert \tilde{\gamma }(t)_{i}-\tilde{\gamma }(0)_{i}\Vert \Vert \gamma (s)_{i}-\gamma (0)_{i}\Vert \cos \theta _{i}}{st}\bigg ). \end{aligned}$$
By comparing these equations we get \(d_{Y} F_{X}(v) \le -\langle v, s_{X}\rangle \) and thus we can conclude \(s_{X}\) is a supporting vector.
(ii) Now let \(s_{X}\) be any supporting vector of \(F_{X}\). By its definition we know that \(d_{Y} F_{X}(v) \le - \langle s_{X}, v\rangle \) for all \(v\in T_{Y}\) and hence
$$\begin{aligned} d_{Y} F(v)&=\int d_{Y} F_{X}(v) d\rho (X) \le \int \left( - \langle s_{X}, v \rangle \right) d\rho (X)= -\big \langle \int s_{X} d\rho (X) , v \big \rangle . \end{aligned}$$
\(\square \)

In the following section we provide an algorithm that computes a local minimum of a Fréchet function using a gradient descent procedure. The above results will be used since computing a supporting vector of \(Z \mapsto d(X,Z)^{2}\) can be significantly easier and faster than computing a supporting vector of \(F\) itself

3 Finding Local Minima of the Fréchet Function

In this section we state an algorithm that computes a Fréchet mean of a finite set of persistence diagrams with finitely many off diagonal points, and examine convergence properties of this algorithm. We will restrict our attention to diagrams with only finitely many off-diagonal points with multiplicity of the points allowed.

Given a set of persistence diagrams \(\{X_{i}\}_{i=1}^{m}\) a Fréchet mean \(Y\) is a diagram that satisfies
$$\begin{aligned} \min _{Y \in \mathcal {D}_{L^{2}}} \Big [ F_{m} := \int \limits _{\mathcal {D}_{L^{2}}} d(X,Y)^{2} d \rho _{m} (X) \Big ], \end{aligned}$$
with the empirical measure \(\rho _{m} := m^{-1} \sum _{i=1}^{m} \delta _{X_{i}}\).

We employ a greedy search algorithm based on gradient descent to find a local minimum. A key component of this greedy algorithm (see Algorithm 1) consists of a variant of the Kuhn–Munkres (Hungarian) algorithm [18].

The Hungarian algorithm finds the least cost assignment of tasks to people under the assumption that the number of tasks and people are the same. The input is the cost for each person to do each of the tasks. Suppose we have two diagrams \(X\) and \(Y\) each with only finitely many off diagonal points. Consider as many copies of the diagonal in \(X\) and \(Y\) to allow the option of matching every off diagonal point with the diagonal. We can think of the points and copies of the diagonal in \(X\) as the people and the points and copies of the diagonal in \(Y\) as tasks. The cost of \(x\in X\) doing task \(y\in Y\) is \(\Vert x-y\Vert ^{2}\). The total cost of an assignment (or in other words bijection) \(\phi \) of tasks to people is \(\sum _{x\in X} \Vert x-\phi (x)\Vert ^{2}\). The Hungarian algorithm gives us a bijection \(\phi \) that minimizes this cost. This means it gives an optimal pairing between \(X\) and \(Y.\)

We would like to use the arithmetic mean of points in the plane and some number of copies of the diagonal. If \(x_{1}, \ldots , x_{m}\) are points in \(\mathbb { R}^{2}\) then there arithmetic mean \(w=\frac{1}{n}\sum _{i=1}^{m} x_{i}\) is the choice of \(z\) that minimizes the sum \(\sum _{i=1}^{m} \Vert z-x_{i}\Vert ^{2}\). If \(x_{i}=\Delta \) for all \(i\) then the arithmetic mean is set to be \(\Delta \). The final case, without loss of generality, is when \(x_{1}, \ldots , x_k\) are all off diagonal points and \(x_{k+1}, \ldots , x_{m}\) are all the diagonal. Let \(w\) be the normal arithmetic mean of \(x_{1}, \ldots , x_k\) and let \(w_\Delta \) be the closest point on the diagonal to \(w\). We set
$$\begin{aligned} w':=\frac{kw+(m-k)w_\Delta }{m} \end{aligned}$$
to be the arithmetic mean of \(x_{1}, \ldots , x_{m}\). This is the choice of \(z\) that minimizes \(\sum _{i=1}^{m} \Vert z-x_{i}\Vert ^{2}\). We use an operation \(\text{ mean }_{i=1,..,m}(x_{i}^{j})\) that computes the arithmetic mean for each pairing over the diagrams.

Suppose \(Y\) is our current estimate for the Fréchet mean. Using the Hungarian algorithm we compute optimal pairings between \(Y\) and each of the \(X_{i}\). We denote these pairings as \(\{(y^{j}, x_{i}^{j})\}_{j=1}^{J_{i}}\) where \(J_{i}\) is the number of off diagonal in \(X_{i}\) and \(Y\) combined. For each \(y_{j}\ne \Delta \) we then consider all the \(x_{ij}\). Let \(\tilde{y^{j}}\) be the arithmetic mean of the \(x_{ij}\). Whenever in our pairings \(\{(y^{j}, x_{i}^{j})\}_{j=1}^{J_{i}}\) we see a \((\Delta , x_{i}^{j})\) we think this as a different copy of the diagonal as in any pairing between \(Y\) and \(X_k\) with \(k\ne i\). We would be using the arithmetic mean of \(m-1\) copies of the diagonal and \(x_{i}^{j}\). Let \(Y'\) be the diagram with points \(\tilde{y^{j}}\). We will show later that if \(Y=Y'\) then \(Y\) is a local minimum of the Fréchet function. Otherwise we chose \(Y'\) to be our current estimate.

The basic steps of Algorithm 1 is to:
  1. (a)

    randomly initialize the mean diagram. For example we can start at one of the \(m\) persistence diagrams or the midway point of two of the \(m\) diagrams;

     
  2. (b)

    use the Hungarian algorithm to compute optimal pairings between the estimate of the mean diagram and each of the persistence diagrams;

     
  3. (c)

    update each point in the mean diagram estimate with the arithmetic mean over all diagrams—each point in the mean diagram is paired with a point (possibly on the diagonal) in each diagram;

     
  4. (d)

    if the updated estimate locally minimizes \(F_{m}\) then return the estimate otherwise return to step (b).

     

An alternative to the above greedy approach would be a brute force search over point configurations to find a Fréchet mean. One way to do this is to list all possible pairings between points in each pair of diagrams. Then compute the arithmetic mean for all such pairings. One of these means will be a Fréchet mean. While this approach will find the complete mean set its combinatorial complexity is prohibitive.

3.1 Convergence of the Greedy Algorithm

The remainder of this section provides convergence properties for Algorithm 1. By convergence we mean that the algorithm will terminate at some point having found a local minimum. The reason for this is that at each iteration the cost function \(F_{m}\) decreases, at each iteration the algorithm uses a new set of pairings, and there are only finitely many combinations of pairings between points in the diagrams.

We first develop necessary and sufficient conditions for a diagram \(Y\) to be a local minimum of a set of persistence diagrams. We define \(F_{i} (Z):= d(Z,X_{i})^{2}\), the Fréchet function corresponding to \(\delta _{X_{i}}\). This allows us to define the Fréchet function as \(F= \frac{1}{m} \sum _{i=1}^{m} F_{i}\) corresponding to the the distribution \(\frac{1}{m}\sum _{i=1}^{m} \delta _{X_{i}}\).

The following lemma provides a necessary condition for a diagram to be a local minimum of \(F\). This condition is the stopping criterion in Algorithm 1.

Lemma 3.1

If \(W = \{w_{i}\}\) is a local minimum of the Fréchet function \(F = \frac{1}{m} \sum _{j=1}^{m} F_{j}\) \(F\) then there is a unique optimal pairing from \(W\) to each of the \(X_{j}\) which we denote as \(\phi _{j}\) and each \(w_{i}\) is the arithmetic mean of the points \(\{\phi _{j}(w_{i})\}_{j=1, 2, \ldots , m}\). Furthermore if \(w_k\) and \(w_l\) are off-diagonal points such that \(\Vert w_k-w_l\Vert =0\) then \(\Vert \phi _{j}(w_k)-\phi _{j}(w_l)\Vert =0\) for each \(j\).

Proof

Let \(\phi _{j}\) be some optimal pairings (not yet assumed to be unique) between \(Y\) and \(X_{j}\) and let \(s_{j}\) be the corresponding vectors in the tangent cone at \(Y\) that are tangent to the geodesics induced by \(\phi _{j}\) and are of length \(d(X_{j},Y)\). The \(2s_{j}\) are supporting vectors for the functions \(F_{j}(Y)= d(Y, X_{j})^{2}\) by Proposition 2.9, so we have \(\frac{2}{m}\sum _{j=1}^{m} s_{j}\) is a supporting vector of \(F\).

From Lemma 2.8 we know that \(\frac{2}{m}\sum _{j=1}^{m} s_{j}=0\). Since at each \(w_{i}\) the \(s_{j}\) gives the vector from \(w_{i}\) to \(\phi _{j}(w_{i})\), \(\sum _{j=1}^{m} s_{j}=0\) implies that \(w_{i}\) is the arithmetic mean of the points \(\{\phi _{j}(w_{i})\}_{j=1, 2,\ldots ,m}\).

Now suppose that \(\phi _k\) and \(\tilde{\phi _k}\) are both optimal pairings. By the above reasoning we have \(\frac{1}{m}(\tilde{s_k} + \sum _{j=1, j\ne k}^{m} s_{j})=0 =\frac{1}{m}\sum _{j=1}^{m} s_{j}\) and hence \(\tilde{s_k} = s_k\). This implies that \(\Vert \tilde{\phi _k}(w_{i}) -\phi _k(w_{i})\Vert =0\) for all \(w_{i}\in W\). In particular, for off-diagonal points \(w_k\) and \(w_l\) with \(\Vert w_k-w_l\Vert =0\) and \(\phi _k\) an optimal pairing, we can consider the pairing \(\tilde{\phi }_k\) with \(w_k\) and \(w_l\) swapped. Since \(\Vert \tilde{\phi _k}(w_{i}) -\phi _k(w_{i})\Vert =0\) for all \(w_{i}\in W\) we can conclude that \(\Vert \phi _{j}(w_k)-\phi _{j}(w_l)\Vert \). \(\square \)

We now prove that the above is also a sufficient condition for \(W\) to be a local minimum of \(F\) when \(F\) is the Fréchet function for the measure \(\frac{1}{m}\sum _{i}\delta _{X_{i}}\) withe the diagrams \(X_{i}\) each with finitely many off-diagonal points. This requires a result about a local extension of optimal pairings.

Proposition 3.2

Let \(X\) and \(Y\) be diagrams, each with only finitely many off diagonal points, such that there is a unique optimal pairing \(\phi _{X}^{Y}\) between them and no off diagonal point in \(X\) matches the diagonal in \(Y\). We further stipulate that if \(y_k\) and \(y_l\) are off-diagonal points with \(\Vert y_k-y_l\Vert =0\) then \(\Vert (\phi _{X}^{Y})^{-1}(y_k)-(\phi _{X}^{Y})^{-1}(y_l)\Vert =0\). There is some \(r>0\) such that for every \(Z \in B(Y,r)\) there is a unique optimal pairing between \(X\) and \(Z\) and this optimal pairing is induced from the one from \(X\) to \(Y\). By this we mean there is a unique optimal pairing \(\phi _{Y}^Z\) from \(Y\) to \(Z\) and that the unique optimal pairing from \(X\) to \(Z\) is \(\phi _{Y}^Z \circ \phi _{X}^{Y}\).

Furthermore, if \(X_{1}, X_{2}, \ldots , X_{m}\) and \(Y\) are diagrams with finitely many off-diagonal points such that there is a unique optimal pairing \(\phi _{X_{i}}^{Y}\) between \(X_{i}\) and \(Y\) for each \(i\) with the same conditions as above, then there is some \(r>0\) such that for every \(Z \in B(Y,r)\) there is a unique optimal pairing between each \(X_{i}\) and \(Z\) and this optimal pairing is induced by the one from \(X_{i}\) to \(Y\).

Proof

Since \(Y\) has only finitely many off-diagonal points there is some \(\varepsilon >0\) such that for every diagram \(Z\) with \(d(Y,Z)<\varepsilon \) there is a unique geodesic from \(Y\) to \(Z\).

For each bijection \(\phi \) of points in \(X\) to points in \(Y\), define the function \(g_\phi \) between \(X\) and points in \(B(Y,\varepsilon )\) by setting
$$\begin{aligned} g_\phi (X,Z):=\sum _{x \in X} \Vert x - \phi _{Y}^Z(\phi (x))\Vert ^{2} + \sum _{\{z\in Z: (\phi _{Y}^Z)^{-1}(z)= \Delta \}}\Vert z-\Delta \Vert ^{2}, \end{aligned}$$
where \(\phi _{Y}^Z\) is the optimal pairing that comes from the unique geodesic from \(Y\) to \(Z\). First note that \(g_\phi (X,Z) \le \sum _{x\in X} \Vert x - \phi _{Y}^Z(\phi (x))\Vert ^{2} + d(Y,Z)^{2}\). Since there are only finitely many points in \(X\) and \(Y\) there is a bound \(M\) on \( \Vert x - \phi (x)\Vert + \varepsilon \). \(M\) is a bound on \(\Vert x - \phi _{Y}^Z(\phi (x))\Vert \) for all \(x\) and all \(\phi \). We also know \(\Vert \phi _{Y}^Z(\phi (x))-\phi (x)\Vert \le d(Y,Z)\) for all \(x \in X\). Let \(K\) be the number of off-diagonal points in diagrams \(X\) and \(Y\) combined.
$$\begin{aligned} g_\phi (X,Z)&\le \sum \Vert x_{i} - \phi _{Y}^Z(\phi (x_{i}))\Vert ^{2}+ d(Y,Z)^{2}\\&\le \sum _{x\in X} (\Vert x-\phi (x)\Vert + \Vert \phi (x)-\phi _{Y}^Z(\phi (x))\Vert )^{2}+ d(Y,Z)^{2}\\&\le \sum _{x\in X} (\Vert x-\phi (x)\Vert ^{2}+ \Vert \phi (x)-\phi _{Y}^Z(\phi (x))\Vert ^{2}\\&\ \quad +2 \Vert x-\phi (x)\Vert \Vert \phi (x)-\phi _{Y}^Z(\phi (x))\Vert )+ d(Y,Z)^{2} \\&\le g_\phi (X,Y) + 2d(Y,Z)^{2} + 2M d(Y,Z) \, K. \end{aligned}$$
Similarly
$$\begin{aligned} g_\phi (X,Y)\le g_\phi (X,Z)+ 2d(Y,Z)^{2} + 2MK d(Z,Y). \end{aligned}$$
Let \(\phi _{X}^{Y}\) be the optimal pairing from \(X\) to \(Y\) which is assumed to be unique in the statement of the proposition. Let \(\hat{\phi }\) be another bijection of points in \(X\) to points in \(Y\). Since there are only finitely many off-diagonal points in \(X\) and \(Y\) there are only finitely many possible \(\hat{\phi }\). Set
$$\begin{aligned} \beta := \min _{\hat{\phi }\ne \phi _{X}^{Y}}\big \{ g_{\hat{\phi }}(X,Y)-g_{\phi _{X}^{Y}}(X,Y)\big \} = \min _{\hat{\phi }\ne \phi _{X}^{Y}}\big \{ g_{\hat{\phi }}(X,Y)-d(X,Y)^{2} \big \} \end{aligned}$$
which must be positive as \(\phi _{X}^{Y}\) is uniquely optimal by assumption.
Choose \(r>0\) such that \(4r^{2}+4MKr<\beta \). Now suppose that \(g_\phi (Z,X)\le g_{\phi _{X}^{Y}}(Z,X)\) for some \(Z \in B(Y,r)\). This will imply that
$$\begin{aligned} g_\phi (X,Y)&\le g_\phi (X,Z) + 2d(Y,Z)^{2} + 2MK\,d(Z,Y)\\&\le g_{\phi _{X}^{Y}}(X,Y) + 4d(Y,Z)^{2} +4MK \,d(Y,Z)\\&<g_{\phi _{X}^{Y}}(X,Z) +\beta , \end{aligned}$$
which contradicts our choice of \(\beta \).

Now suppose \(X_{1}, X_2, \ldots , X_{m}\) and \(Y\) are diagrams with finitely many off diagonal points such that there is a unique optimal pairing \(\phi _{X_{i}}^{Y}\) between \(X_{i}\) and \(Y\) for each \(i\). By the above argument there are some \(r_{1}, r_2,\ldots ,r_{m}>0\) such that for each \(i\) and for every \(Z \in B(Y,r_{i})\) there is a unique optimal pairing between each \(X_{i}\) and \(Z\) and this optimal pairing is induced by the one from \(X_{i}\) to \(Y\). Take \(r=\min \{r_{i}\}\) which is positive. \(\square \)

The following theorem states that Algorithm 1 will find a local minimum on termination.

Theorem 3.3

Given diagrams \(\{X_{1},\ldots ,X_{m}\}\) and the corresponding Fréchet function \(F\), then \(W = \{w_{i}\}\) is a local minimum of \(F\) if and only if there is a unique optimal pairing from \(W\) to each of the \(X_{j}\) denoted as \(\phi _{j}\) and each \(w_{i}\) is the arithmetic mean of the points \(\{\phi _{j}(w_{i})\}_{j=1,2,\ldots ,m}\).

Proof

In Lemma 3.1 we showed that it it is a necessary condition.

Given \(m\) points in the plane or copies of the diagonal, \(\{x_{1}, x_{2}, \ldots , x_{m}\}\), the choice of \(y\) which minimizes \(\sum _{i=1}^{m} \Vert x_{i}-y\Vert ^{2}\) is the arithmetic mean of \(\{x_{1}, \ldots , x_{m}\}\). As a result we know that \(F(Z)>F(W)\) for all \(Z\) with the same optimal pairings as \(W\) to \(X_{1}, X_{2}, \ldots , X_{m}\). Since there is some ball \(B(W,r)\) such that every \(Z\in B(W,r)\) has the same optimal pairings as \(W\), by Proposition 3.2, we know that \(F(Z)>F(W)\) for all \(Z\) in \(B(W,r)\). Thus we can conclude that \(W\) is a local minimum. \(\square \)

4 Law of Large Numbers for the Empirical Fréchet Mean

In this section we study the convergence of Fréchet means computed from sampling sets to the set of means of a measure. Consider a measure \(\rho \) on the space of persistence diagrams \(\mathcal {D}_{L^{2}}\). Given a set of persistence diagrams \(\{X_{i}\}_{i=1}^{n} \mathop {\sim }\limits ^{iid} \rho \) one can define an empirical measure \(\rho _{n}=\frac{1}{n}\sum _{k=1}^n \delta _{X_k}\). We will examine the relation between the two sets
$$\begin{aligned} \mathbf {Y}&= \Big \{ \min _{Z \in \mathcal {D}_{L^{2}}} \Big [ F := \int \limits _{\mathcal {D}_{L^{2}}} d(X,Z)^{2} d\rho (X) \Big ] \Big \} ,\\ \mathbf {Y}_{n}&= \Big \{ \min _{Z \in \mathcal {D}_{L^{2}}} \Big [ F_{n} := \int \limits _{\mathcal {D}_{L^{2}}} d(X,Z)^{2} d\rho _{n}(X) \Big ] \Big \}, \end{aligned}$$
where \(\mathbf {Y}\) and \(\mathbf {Y}_{n}\) are the Fréchet mean sets of the measures \(\rho \) and \(\rho _{n}\) respectively. We would like prove convergence of \(\mathbf {Y}_{n}\) to \(\mathbf {Y}\) asymptotically with \(n\)—a law of large numbers result.

There exist weak and strong laws of large numbers for general metric spaces (for example see [17, Theorem 3.4]). These results hold for global minima of the Fréchet and empirical Fréchet functions \(F\) and \(F_{n}\), respectively. It is not clear to us how to adapt these results to the case of Algorithm 1 where we can only ensure convergence to a local minimum. It is also not clear how we can adapt these theorems to get rates of convergence of the sample Fréchet mean set to the population quantity.

In this section we provide a law of large number result for the restricted case where \(\rho \) is a combination of Dirac masses
$$\begin{aligned} \rho =\frac{1}{m}\sum _{i=1}^{m} \delta _{Z_{i}}, \end{aligned}$$
where \(Z_{i}\) are diagrams with only finitely many off diagonal points and we allow for multiplicity in these points. The proof is constructive and we provide rates of convergence.

The main results of this section, Theorem 4.1 and Lemma 4.2, provide a probabilistic justification for Algorithm 1. Theorem 4.1 states that with high probability local minima of the empirical Fréchet function \(F_{n}\) will be close to local minima of the Fréchet function \(F\). Ideally we would like the above convergence to hold for global minima, the Fréchet mean set. The condition of Lemma 4.2 states that the number of local minima of \(F_{n}\) is finite and not a function of \(n\). This suggests that applying Algorithm 1 to a random set of start conditions can be used to explore the finite set of local minima.

Theorem 4.1

Set \(\rho =\frac{1}{m}\sum _{i=1}^{m} \delta _{Z_{i}}\) where \(Z_{i}\) are diagrams with finitely many off diagonal points with multiplicity allowed. Let \(F\) be the Fréchet function corresponding to \(\rho \) and \(Y\) be a local minimum of \(F\). Set \(\{X_{i}\}_{i=1}^n \mathop {\sim }\limits ^{iid} \rho \), and denote the corresponding empirical measure \(\rho _{n}=\frac{1}{n}\sum _{k=1}^n \delta _{X_k}\) and Fréchet mean function \(F_{n}\). There exists a local minimum \(Y_{n}\) of \(F_{n}\) such that with probability greater than \(1-\delta \)
$$\begin{aligned} d(Y,Y_{n})^{2} \le \frac{m^{2} F(Y)}{n} \ln \left( \frac{m}{\delta }\right) , \end{aligned}$$
for \(n \ge 8m \ln \frac{m}{\delta }\) and \(\frac{m^{2} F(Y)}{n} \ln \left( \frac{m}{\delta }\right) < r^{2}\) where \(r\) characterizes the separation between the local minima of \(F\).

Proof

The empirical distribution is
$$\begin{aligned} \rho _{n}=\frac{1}{n}\sum _{k=1}^n \delta _{X_k} = \frac{1}{m}\sum _{i=1}^{m} \xi _{i} \delta _{Z_{i}}, \end{aligned}$$
where \(\xi _{i}\) is the random variable that states the multiplicity of each \(Z_{i}\) appearing in the empirical measure, \(|\{k:X_k=Z_{i}\}|\). Observe that \(\xi _{1}, \xi _2, \ldots , \xi _{m}\) can be stated as a multinomial distribution with parameters \(n\) and \(p=\left( \frac{1}{m}, \frac{1}{m}, \ldots , \frac{1}{m}\right) \).

We will bound the probability that \(|\xi _{i}-\frac{n}{m}|>\varepsilon \frac{n}{m}\) for any \(i=1,2,\ldots ,m\). We then will show that under the assumption that \(|\xi _{i}-\frac{n}{m}|\le \varepsilon \frac{n}{m}\) for all \(i=1,2,\ldots ,m\) for sufficiently small \(\varepsilon >0\) there is a local minimal \(Y_{n}\) with \(d(Y,Y_{n})^{2}<\frac{\varepsilon ^{2} m F(Y)}{(1-\varepsilon )^{2}}\).

For each \(i\), \(\xi _{i} \sim \text{ Bin }(n,\frac{1}{m})\) and \(n-\xi _{i} \sim \text{ Bin }(n, 1-\frac{1}{m})\). Using Hoeffding’s inequality we obtain \(\Pr \left[ \xi _{i}-\frac{n}{m}\le -\varepsilon \frac{n}{m}\right] \le \frac{1}{2}\exp (-2\frac{\varepsilon ^{2} n}{m})\) and
$$\begin{aligned} \Pr \left[ \xi _{i}-\frac{n}{m}\ge \varepsilon \frac{n}{m}\right] =\Pr \left[ (n-\xi _{i})-(n-\frac{n}{m})\le -\varepsilon \frac{n}{m}\right] \le \frac{1}{2}\exp \big (-2\frac{\varepsilon ^{2} n}{m}\big ). \end{aligned}$$
Together they show that \(\Pr \left[ |\xi _{i}-\frac{n}{m}|\ge \varepsilon \frac{n}{m}\right] \le \exp (-2\frac{\varepsilon ^{2} n}{m})\) implying the bound
$$\begin{aligned} \Pr \left[ |\xi _{i}-\frac{n}{m}|<\varepsilon \frac{n}{m} \text { for all } i=1, 2, \ldots , m\right] \ge 1-m\exp \big (-2\frac{\varepsilon ^{2} n}{m}\big ). \end{aligned}$$
From now on we will assume that \(|\xi _{i}-\frac{n}{m}|<\varepsilon \frac{n}{m}\) for all \(i=1, 2, \ldots , m\). Let us consider our algorithm for finding a local minimal of \(F_{n}\) starting at the point \(Y\). We first define some notation. We denote the points in \(Y\) by \(\{y_{j}\}_{j=1}^J\). We denote by \(z_{i}^{j} := \phi _{Y}^{Z_{i}}(y_{j})\) the point in \(Z_{i}\) that \(y_{j}\) is paired to in the (unique) optimal bijection between \(Y\) and \(Z_{i}\). Recall that the \(z_{i}^{j}\) could be the diagonal but from our assumption that \(Y\) is a local minimum no off diagonal point in any \(Z_{i}\) is paired with the diagonal in \(Y\).
Let \((a_{i}^{j}, b_{i}^{j})\) be the coefficients of the vector from \(y_{j}\) to \(z_{i}^{j}\) in the basis of \(\mathbb { R}^{2}\) given by \((\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\) and \((-\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\). This basis has the advantage that when \(z_{i}^{j}\) is the diagonal then \(a_{i}^{j}=0\) and \(b_{i}^{j}=d(y_{j},\Delta )\). From our assumption that \(Y\) is a local minimum we know that \(\sum _{i=l}^{m} a_{i}^{j}=0\) and \(\sum _{i=l}^{m} b_{i}^{j}=0\) for all \(j\) and
$$\begin{aligned} F(Y)=\frac{1}{m}\sum _{j=1}^J\sum _{i=1}^{m} \big ((a_{i}^{j})^{2} + (b_{i}^{j})^{2}\big ). \end{aligned}$$
For the moment fix \(j\). Without loss of generality reorder the \(Z_{i}\) so that the first \(k\) (with \(1\le k \le m\)) of the \(z_{i}^{j}\) are off the diagonal and the remained are copies of the diagonal. Let \(y_{j}^n\) be the point in \(\mathbb { R}^{2}\) given by
$$\begin{aligned} y_{j} + \bigg (\frac{1}{\xi _{1} +\xi _2 + \cdots +\xi _k} \sum _{i=1}^k \xi _{i} a_{i}\bigg ) \bigg (\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}\bigg ) + \bigg (\frac{1}{n}\sum _{i=1}^{m} \xi _{i} b_{i}^{j}\bigg ) \bigg (-\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}\bigg ). \end{aligned}$$
By construction this \(y_{j}^n\) is the weighted arithmetic mean of the \(z_{i}^{j}\) where we have weighted by the \(\xi _{i}\) taking into account that when \(i>k\) then \(z_{i}^{j}\) is the diagonal.
Under our assumption that \(|\xi _{i}-\frac{n}{m}|<\varepsilon \frac{n}{m}\) for all \(i=1, 2, \ldots , m\) and using \(\sum _{i=1}^k a_{i}^{j}=0=\sum _{i=1}^{m} b_{i}^{j}\) we know that
$$\begin{aligned} \Vert y_{j}-y_{j}^n\Vert ^{2}&= \frac{1}{(\xi _{1} +\xi _2 + \cdots + \xi _k)^{2}}\bigg ( \sum _{i=1}^k \xi _{i} a_{i}^{j}\bigg )^{2} + \frac{1}{n^{2}}\bigg (\sum _{i=1}^{m} \xi _{i}b_{i}^{j}\bigg )^{2}\\&= \frac{1}{(\xi _{1} +\xi _2 + \cdots +\xi _k)^{2}}\bigg ( \sum _{i=1}^k \left( \xi _{i}\!-\!\frac{n}{m}\right) a_{i}^{j}\big )^{\!2} \!+\! \frac{1}{n^{2}}\bigg (\sum _{i=1}^{m} \left( \xi _{i}\!-\!\frac{n}{m}\right) b_{i}^{j}\bigg )^{\!2}\\&\le \frac{1}{\frac{k^{2}}{m^{2}}n^{2}(1-\varepsilon )^{2}}\frac{\varepsilon ^{2}n^{2}}{m^{2}}\bigg ( \sum _{i=1}^k (a_{i}^{j})^{2}\bigg ) + \frac{1}{n^{2}}\frac{\varepsilon ^{2}n^{2}}{m^{2}}\bigg (\sum _{i=1}^{m} (b_{i}^{j})^{2}\bigg )\\&\le \frac{m\varepsilon ^{2}}{(1-\varepsilon )^{2}}\bigg ( \frac{1}{m} \sum _{i=1}^{m} (a_{i}^{j})^{2} + (b_{i}^{j})^{2}\bigg ). \end{aligned}$$
Set \(Y_{n}\) to be the diagram with off-diagonal points \(\{y_{j}^n\}_{j=1}^J\). Using the pairing between \(Y\) and \(Y_{n}\) where we pair \(y_{j}\) with \(y_{j}^n\) we conclude that
$$\begin{aligned} d(Y,Y_{n})&\le \sum _{j=1}^J \Vert y_{j}-y_{j}^n\Vert ^{2}\\&\le \sum _{j=1}^J \frac{m\varepsilon ^{2}}{(1-\varepsilon )^{2}}\bigg ( \frac{1}{m} \sum _{i=1}^{m} (a_{i}^{j})^{2} + (b_{i}^{j})^{2}\bigg )\\&\le \frac{m\varepsilon ^{2}}{(1-\varepsilon )^{2}}F(Y). \end{aligned}$$
Set \(\delta = m\exp \big (-2\frac{\varepsilon ^{2} n}{m}\big )\) and solve for \(\varepsilon \). This provides the bound that with probability greater than \(1-\delta \)
$$\begin{aligned} d(Y,Y_{n})^{2} \le \frac{m^{2} F(Y)}{ 2 n} \ln \left( \frac{m}{\delta }\right) \frac{1}{(1-\varepsilon )^{2}}. \end{aligned}$$
For \(\varepsilon \in [0,.25]\) it holds that \((1-\varepsilon )^{-2} <2\) and \(n \ge 8m \ln \frac{m}{\delta }\) implies \(\varepsilon < .25\).

We want to show that \(Y_{n}\) is a local minimum for sufficiently small \(\varepsilon \). Indeed it will be the output of Algorithm 1 given the initializing diagram of \(Y\). Since \(Y\) is a local minimum, Proposition 3.2 implies that there is a ball around \(Y\), \(B(Y,r)\), such that for every diagram in \(B(Y,r)\) there is a unique optimal pairing with each \(Z_{i}\) which corresponds to the unique optimal pairing between \(Y\) and \(Z_{i}\). That is \(\phi _{X}^{Z_{i}} = \phi _{X}^{Y}\circ \phi _{Y}^{Z_{i}}\) for all \(X\in B(Y,r)\). For \(\varepsilon >0\) such that \(\frac{\varepsilon ^{2} m F(Y)}{(1-\varepsilon )^{2}}<r^{2}\) we have \(Y_{n}\in B(Y,r)\). Plugging in for \(\varepsilon \) results in \(\frac{m^{2} F(Y)}{n} \ln \left( \frac{m}{\delta }\right) < r^{2}\).

This implies that \(\phi _{Y_{n}}^{Z_{i}} = \phi _{Y_{n}}^{Y}\circ \phi _{Y}^{Z_{i}}\) is the unique optimal pairing between \(Y_{n}\) and \(Z_{i}\) for all \(i\) and hence \(\phi _{Y_{n}}^{X_k} = \phi _{Y_{n}}^{Y}\circ \phi _{Y}^{X_k}\) for each of the sample diagrams \(X_k\). If \(X_k=Z_{i}\) then
$$\begin{aligned} \phi _{Y_{n}}^{X_k}(y_{j}^n)=\phi _{Y_{n}}^{Y}\circ \phi _{Y}^{Z_{i}}(y_{j}^n) = \phi _{Y} ^{Z_{i}}(y_{j})=z_{i}^{j}. \end{aligned}$$
By construction \(y_{j}^n\) is the weighted arithmetic mean of the \(z_{i}^{j}\) (weighted by the \(\xi _{i}\)), and hence \(y_{j}^n\) is the arithmetic mean of the \(x_k^{j}\). By Theorem 3.3 \(Y_{n}\) is local minimum. \(\square \)

The above theorem provides a (weak) law of large numbers results for the local minima computed from \(n\) persistence diagrams but it does not ensure that the number of local minima is bounded as \(n\) goes to infinity. The utility of such a convergence result would be limited if the number of local minima could not be bounded. The following lemma states that the number of local minima is bounded.

Lemma 4.2

Let \(\rho =\frac{1}{m}\sum _{i=1}^{m} \delta _{Z_{i}}\) as before. Let \(\rho _{n}=\frac{1}{n}\sum _{k=1}^n \delta _{X_k}\) be the empirical measure of \(n\) points drawn iid from \(\rho \) and \(F_{n}\) is the corresponding Fréchet function. The number of local minima of \(F_{n}\) is bounded by \(\prod _{i=1}^{m}(k_{i}+1)^{(k_{1}+k_2+\ldots k_{m})}\). Here \(k_{i}\) is the number of off-diagonal points in the \(i\)-th diagram. This bound is independent of \(n\).

Proof

Set \(Y_{n}\) as a local minimum of \(F_{n}\). This implies there are unique optimal pairings \(\phi _{i}\) between \(Y_{n}\) and \(X_{i}\) for each \(i\) and that any point \(y\) in \(Y_{n}\) is the arithmetic mean of \(\{\phi _{i}(y)\}\). Since the optimal pairing is unique, if \(X_{i}=X_{j}\) then \(\phi _{i}=\phi _{j}\). This in turn means that the \(\phi _{i}\) are determined by which of \(Z_{i}\) are in the set \(X_{j}\) (with multiplicity). This implies that the number of local minima is bounded by the number of different partitions into subsets of the points in the \(\cup X_{j}\) so that each subset has exactly one point from each of the \(X_{j}\). The number of subsets is bounded by \(k_{1}+k_2+\cdots +k_{m}\) and for each subset there is a bound of \(\prod _{i=1}^{m}(k_{i}+1)\) on the choices of which element to take from each of the \(X_{i}\). Thus the number of different partitions is bounded by \(\prod _{i=1}^{m}(k_{i}+1)^{(k_{1}+k_2+\cdots +k_{m})}\). \(\square \)

We would like to discuss not only the convergence of local minima but also the convergence of the Fréchet means. We can do this in the case when there is a unique Fréchet mean.

Lemma 4.3

Let \(\rho =\frac{1}{m}\sum _{i=1}^{m} \delta _{Z_{i}}\) as before. Suppose further that the corresponding Fréchet function \(F\) has a unique minimum. Let \(\rho _{n}=\frac{1}{n}\sum _{k=1}^n \delta _{X_k}\) be the empirical measure of \(n\) points drawn iid from \(\rho \) and \(F_{n}\) is the corresponding Fréchet function. Let \(\mathbf {Y}\) be the Fréchet mean of \(F\) and \(\mathbf {Y}_{n}\) the set of Fréchet means of \(F_{n}\). With probability \(1\) the Hausdorff distance between \(\mathbf {Y}_{n}\) and \(\mathbf {Y}\) goes to zero as \(n\) goes to infinity.

Proof

It is sufficient for us to show for each \(r>0\) that with probability \(1\) there is some \(N_{r}\) such that \(\mathbf {Y}_{n}\subset B( \mathbf {Y},r)\) for all \(n>N_{r}\).

Fix \(r>0\). Suppose there does not exist some \(N_{r}\) such that \(\mathbf {Y}_{n}\subset B( \mathbf {Y},r)\) for all \(n>N_{r}\). Then there is some sequence of \(W_{n_k}\in \mathbf {Y}_{n_k}\) such that \(d(W_{n_k}, \mathbf {Y})\ge r\). The set \(\{W_{n_k}\}\) is clearly bounded, off-diagonally birth–death bounded and uniform and hence precompact. This implies that \((W_{n_k})\) has a convergent subsequence \((W_{{n_k}_{j}})\). Let \(W\) denote the limit of this sequence. Since \(d(W_{{n_k}_{j}}, \mathbf {Y})\ge r\) for all \(j\) we have \(d(W, \mathbf {Y})\ge r\).

By the arguments in Proposition 2.6 there is some \(K\) independent of \(n\) such that \(F_{n}\) is \(K\)-Lipschitz in \(B(W,1)\) and hence \(|F_{{n_k}_{j}}(W_{{n_k}_{j}}) - F_{{n_k}_{j}}(W)|\le K d(W_{{n_k}_{j}},W)\) for large \(j\). Hence, for all \(\varepsilon >0\) we can say that \( F_{{n_k}_{j}}(W)\le F_{{n_k}_{j}}(W_{{n_k}_{j}}) + \varepsilon \) for sufficiently large \(j\).

The law of large numbers tells us that \(F_{n}(W)\rightarrow F(W)\) and \(F_{n}(\mathbf {Y})\rightarrow F(\mathbf {Y})\) as \(n \rightarrow \infty \) with probability \(1\). Hence for all \(\varepsilon >0\) we know that with probability \(1\) both \(F(W)\le F_{n}(W) +\varepsilon \) and \(F_{n}(\mathbf {Y})\le F(\mathbf {Y}) +\varepsilon \) for sufficiently large \(n\).

From our assumption that \(W_{{n_k}_{j}}\) is a Fréchet mean of \(F_{{n_k}_{j}}\) we know that \(F_{{n_k}_{j}}(W_{{n_k}_{j}})\le F_{{n_k}_{j}}(\mathbf {Y})\) for all \(j\).

Let \(\varepsilon >0\). Combining the inequalities above we conclude that with probability \(1\)
$$\begin{aligned} F(W)\le F_{{n_k}_{j}}(W) + \varepsilon \le F_{{n_k}_{j}}(W_{{n_k}_{j}}) + 2\varepsilon \le F_{{n_k}_{j}}(\mathbf {Y})+2\varepsilon \le F(\mathbf {Y}) + 3\varepsilon , \end{aligned}$$
for \(j\) sufficiently large. Since \(\varepsilon >0\) was arbitrary we obtain \(F(W)\le F(\mathbf {Y})\) which contradicts the uniqueness assumption about the Fréchet mean. \(\square \)

5 Persistence Diagrams of Random Gaussian Fields

We illustrate the utility of our algorithm in computing means and variances of persistence diagrams in this section via simulation. The idea will be to show that persistence diagrams generated from a random Gaussian field will concentrate around the diagonal with the mean diagram moving closer to the diagonal as the number of diagrams averaged increases.

The persistence diagrams were computed from random Gaussian field over the unit square using the procedure outlined in Sect. 3 in [1]. The field generated is a stationary, isotropic, and infinitely differentiable random field. The Gaussian was set to be mean zero and the covariance function was \(R(p) = \exp (-\alpha \Vert p \Vert ^{2})\) where \(\alpha = 100\). A few hundred levels in the range of the realization of the field were taken for each level a simplicial complex was constructed. This was done by taking a fine grid on the unit square and including any vertex, edge or square in the complex if and only if the values of the field at the vertex or set of vertices (for the edge and square cases) were higher than the level. The complex increases as the level decreases which provides the filtering and from which birth–death values of the diagram were computed. We obtained from Subag 10,000 such random persistence diagrams generated as described above. These diagrams contain points with infinite persistence, we ignore these points. Using extended persistence in computing the diagrams would address this issue.
Fig. 1

The top two rows plot the mean persistence diagram for dimension zero. Each figure contains four means computed from the number of diagrams specified in the figure title. Each mean is computed from a different random sample of diagrams and is plotted in a different color. The bottom two rows are the sample plots for dimension one

In Fig. 1 we display the mean diagram of sets of \(2, 4, 8, 16, 32, 64, 128\) diagrams randomly drawn from the 10,000 diagrams. This is done for both dimensions zero and one. We wanted to see that as the number of diagrams being averaged increases the Fréchet means converged. To quantify this concentration we took ten draws of \(2, 4, 8, 16, 32, 64, 128\) diagrams from the 10,000 diagrams and considered the distribution \(\frac{1}{10}\sum _{i=1}^{10}\delta _{X_{i}}\) where \(X_{i}\) where the Fréchet means of each of the sets of samples. We then computed the variance of these distributions as documented in Table 1.
Table 1

Variance of the sample Fréchet means

Number of samples

\(H_0\)

\(H_{1}\)

2

0.8353

0.9058

4

0.6295

0.6741

8

0.4429

0.5608

16

0.4356

0.4618

32

0.3165

0.3742

64

0.3362

0.2965

128

0.3127

0.2233

6 Discussion

In this paper we introduce an algorithm for computing estimates of Fréchet means of a set of persistence diagrams. We demonstrate local convergence of this algorithm and provide a law of large numbers for the Fréchet mean computed on this set when the underlying measure has the form \(\rho = m^{-1} \sum _{i=1}^{m} \delta _{X_{i}}\), where \(X_{i}\) are persistence diagrams. We believe that generically there is a unique global minimum to the Fréchet function and hence a unique Fréchet mean but this needs to be shown.

The work in this paper is a first step and several obvious extensions are needed. A law of large numbers result when the underlying measure is not restricted to a combination of Dirac functions is obviously important. The results in our paper are strongly dependent on the \(L^{2}\)-Wasserstein metric; generalizing these results to the Wasserstein metrics used in computational topology is of central interest. The proofs and problem formulation in this paper are very constructive—the proofs and algorithms are developed for the specific examples and constructions we propose and are not meant to generalize to other metrics or variants on the algorithm. It would be of great interest to provide a presentation of the core ideas in the algorithm and theory we developed in a more general framework using properties of abstract metric spaces and probability theory on these spaces.

Footnotes

  1. 1.

    If both \(x\) and \(\phi (x)\) are the diagonal then this is the diagonal. If exactly one of \(x\) or \(\phi (x)\) is the diagonal then we replace it in this sum by the closest point in the diagonal to \(\phi (x)\) or \(x\) respectively.

  2. 2.

    Terminology given by Gromov [9] that stands for Cartan, Alexandrov, and Toponogov.

Notes

Acknowledgments

SM and KT would like to acknowledge Shmuel Weinberger for discussions and insight. SM and KT would like to acknowledge E. Subag with help in obtaining persistence diagrams computed from random Gaussian fields and explaining the generative model. JH and YM are pleased to acknowledge the support from grants DTRA: HDTRA1-08-BRCWMD, DARPA: D12AP00001On, AFOSR: FA9550-10-1-0436, and NIH (Systems Biology): 5P50-GM081883. SM is pleased to acknowledge support from grants NIH (Systems Biology): 5P50-GM081883, AFOSR: FA9550-10-1-0436, and NSF CCF-1049290.

References

  1. 1.
    Adler, R.J., Bobrowski, O., Borman, M.S., Subag, E., Weinberger, S.: Persistent homology for random fields and complexes. In: Berger, J.O., Tony Cai, T., Johnstone, I.M. (eds.) Borrowing Strength: Theory Powering Applications—A Festschrift for Lawrence D. Brown, vol. 6. Institute of Mathematical Statistics, Beachwood (2010)Google Scholar
  2. 2.
    Bendich, P., Mukherjee, S., Wang B.: Local homology transfer and stratification learning. In: ACM-SIAM Symposium on Discrete Algorithms (2012)Google Scholar
  3. 3.
    Birdson, M.R., Haefliger, A.: Metric Spaces of Non-positive Curvature. Springer-Verlag, Berlin (1999)CrossRefGoogle Scholar
  4. 4.
    Bubenik, P., Carlsson, G., Kim, P.T., Luo, Z.-M.: Statistical topology via Morse theory, persistence, and nonparametric estimation. In: Viana, M.A.G., Wynn, H.P. (eds.) Algebraic Methods in Statistics and Probability II. Contemporary Mathematics, vol. 516, pp. 75–92. American Mathematical Society, Providence (2010)CrossRefGoogle Scholar
  5. 5.
    Burago, Y., Gromov, M., Perel’man, G.: A.D. Alexandrov spaces with curvature bounded below. Russ. Math. Surv. 47(2), 1–58 (1992)CrossRefMATHMathSciNetGoogle Scholar
  6. 6.
    Chazal, F., Cohen-Steiner, D., Lieutier, A.: A sampling theory for compact sets in Euclidean space. Discrete Comput. Geom. 41, 461–479 (2009)CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Cohen-Steiner, D., Edelsbrunner, H., Harer, J., Mileyko, Y.: Lipschitz functions have \({L}_p\)-stable persistence. Found. Comput. Math. 10, 127–139 (2010). doi:  10.1007/s10208-010-9060-6 CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Edelsbrunner, H., Harer, J.: Computational Topology: An Introduction. American Mathematical Society, Providence (2010)Google Scholar
  9. 9.
    Gromov, M.: Hyperbolic groups. In: Gersten, S.M. (ed.) Essays in Group Theory. Mathematical Sciences Research Institute Publications, vol. 8, pp. 75–263. Springer, New York (1987)Google Scholar
  10. 10.
    Kahle, M.: Topology of random clique complexes. Discrete Math. 309(6), 1658–1671 (2009)CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Kahle, M.: Random geometric complexes (2011). http://arxiv.org/abs/0910.1649
  12. 12.
    Kahle, M., Meckes, E.: Limit theorems for Betti numbers of random simplicial complexes (2010). http://arxiv.org/abs/1009.4130v3
  13. 13.
    Lott, J., Villani, C.: Ricci curvature for metric-measure spaces via optimal transport. Ann. Math. 169, 903–991 (2009)CrossRefMATHMathSciNetGoogle Scholar
  14. 14.
    Lunagómez, S., Mukherjee, S., Wolpert, R.L.: Geometric representations of hypergraphs for prior specification and posterior sampling (2009). http://arxiv.org/abs/0912.3648
  15. 15.
    Lytchak, A.: Open map theorem for metric spaces. St. Petersbg. Math. J. 17(3), 477–491 (2006)CrossRefMATHMathSciNetGoogle Scholar
  16. 16.
    Mileyko, Y., Mukherjee, S., Harer, J.: Probability measures on the space of persistence diagrams. Inverse Probab. 27(12), 124007 (2012)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Molchanov, I.: Theory of Random Sets. Springer, London (2005)MATHGoogle Scholar
  18. 18.
    Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)CrossRefMATHMathSciNetGoogle Scholar
  19. 19.
    Niyogi, P., Smale, S., Weinberger, S.: Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom. 39, 419–441 (2008)CrossRefMATHMathSciNetGoogle Scholar
  20. 20.
    Niyogi, P., Smale, S., Weinberger, S.: A topological view of unsupervised a topological view of unsupervised learning from noisy data. Manuscript (2008)Google Scholar
  21. 21.
    Ohta, S.: Barycenters in Alexandrov spaces with curvature bounded below. Adv. Geom. 12, 571–587 (2012)MATHMathSciNetGoogle Scholar
  22. 22.
    Penrose, M.D.: Random Geometric Graphs. Oxford University Press, New York (2003)CrossRefMATHGoogle Scholar
  23. 23.
    Penrose, M.D., Yukich, J.E.: Central limit theorems for some graphs in computational geometry. Ann. Appl. Probab. 11(4), 1005–1041 (2001)MATHMathSciNetGoogle Scholar
  24. 24.
    Petrunin, A.: Semiconcave functions in Alexandrov’s geometry. Surv. Differ. Geom. 11, 137–201 (2007)CrossRefMathSciNetGoogle Scholar
  25. 25.
    Sturm, K.-T.: Probability measures on metric spaces of nonpositive curvature. In: Auscher, P., Coulhon, T., Grigoryan, A. (eds.) Heat Kernels and Analysis on Manifolds, Graphs, and Metric Spaces, vol. 338. American Mathematical Society, Providence (2002)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Katharine Turner
    • 1
  • Yuriy Mileyko
    • 2
  • Sayan Mukherjee
    • 3
  • John Harer
    • 4
  1. 1.Department of MathematicsUniversity of ChicagoChicagoUSA
  2. 2.Department of MathematicsUniversity of Hawaii at ManoaHonoluluUSA
  3. 3.Departments of Statistical Science, Computer Science, and Mathematics, Institute for Genome Sciences & PolicyDuke UniversityDurhamUSA
  4. 4.Departments of Mathematics, Computer Science and Electrical and Computer EngineeringCenter for Systems Biology, Duke UniversityDurhamUSA

Personalised recommendations