Crackle: The Homology of Noise
 536 Downloads
 9 Citations
Abstract
We study the homology of simplicial complexes built via deterministic rules from a random set of vertices. In particular, we show that, depending on the randomness that generates the vertices, the homology of these complexes can either become trivial as the number \(n\) of vertices grows, or can contain more and more complex structures. The different behaviours are consequences of different underlying distributions for the generation of vertices, and we consider three illustrative examples, when the vertices are sampled from Gaussian, exponential, and powerlaw distributions in \({\mathbb {R}}^d\). We also discuss consequences of our results for manifold learning with noisy data, describing the topological phenomena that arise in this scenario as “crackle”, in analogy to audio crackle in temporal signal analysis.
Keywords
Čech complex Random complexes Persistent homology Random Betti numbers1 Introduction
This paper treats the homology of simplicial complexes built via deterministic rules from a random set of vertices. In particular, it shows that, depending on the randomness that generates the vertices, the homology of these complexes can either become trivial as the sample size grows, or can contain more and more complex structures.
The motivation for these results comes from applications of topological tools for pattern analysis, object identification, and especially for the analysis of data sets. Typically, one starts with a collection of points and forms some simplicial complexes associated to these, and then takes their homology. For example, the \(0\)dimensional homology of such complexes can be interpreted as a version of clustering. The basic philosophy behind this attempt is that topology has an essentially qualitative nature and should therefore be robust with respect to small perturbations. Some recent references are [2, 3, 9, 15, 19] with two reviews, from different aspects, in [1] and [12]. Many of these papers find their raison d’être in essentially statistical problems, in which data generate the structures.
In order to be able, eventually, to extend the work in [17] beyond Gaussian noise, and make more concrete statements about the probabilistic features of the homology this extension generates, it is necessary to first focus on the behaviour of samples generated by pure noise, with no underlying manifold. In this case, thinking of the above setup, the manifold \(\mathcal {M}\) is simply the point at the origin, and the homology that we shall be trying to recapture is trivial. Nevertheless, we shall see that differing noise models can make this task extremely delicate, regardless of sample size.
1.1 Summary of Results
 (1)
The \(0\)simplices of \(\check{C}(\mathcal {X},\varepsilon )\) are the points in \(\mathcal {X}\),
 (2)
An \(n\)simplex \(\sigma =[x_{i_0},\ldots ,x_{i_n}]\) is in \(\check{C}(\mathcal {X},\varepsilon )\) if \(\bigcap _{k=0}^{n} B_{x_{i_k}}\!(\varepsilon ) \ne \emptyset \).
Beyond the core, the topology is more varied. For fixed \(n\), there may be additional isolated components, but no longer placed densely enough to connect with one another and to form a contractible set. Indeed, we shall show that the individual components will typically have non trivial homology. Thus, in this region, the topology of the Čech complex is highly nontrivial, and many homology elements of different orders appear. We call this phenomenon “crackling”, akin to the well known phenomenon caused by noise interference in audio signals and commonly referred to as crackling.
As we already mentioned, the Gaussian distribution is fundamentally different than the other two, and does not lead to crackling. In Sect. 2.4 we show that, for the Gaussian distribution, there are hardly any points located outside the core. Thus, as \(n\rightarrow \infty \), the union of balls around the sample points becomes a giant contractible ball of radius of order \(\sqrt{2\log n}\).
It is now possible to understand a little better how the results of this paper relate to the noisy manifold learning problem discussed above. For example, if the distribution of the noise is Gaussian, our results imply that if the manifold is well behaved, and the sample size is moderate, noise outliers should not significantly interfere with homology recovery, since Gaussian noise does not introduce artificial homology elements with large samples. However, there is a delicate counterbalance here between “moderate” and “large”. Once the sample size is large, the core is also large, and the reconstructed manifold will have the topology of \(\mathcal {M}\oplus B_{O(\sqrt{2\log n})}(0)\), where \(\oplus \) is Minkowski addition. As \(n\) grows, the core will eventually envelope any compact manifold, and thus the homology of \(\mathcal {M}\) will be hidden by that of the core.
On the other hand, if the distribution of the noise is powerlaw or exponential, then noise outliers will typically generate extraneous homology elements that, for almost any sample size, will complicate the estimation of the original manifold. Furthermore, increasing the sample size in no way solves this problem. Note that this issue is in addition to the fact that increasing the sample size will, as in the Gaussian case, create the problem of a large core concealing the topology of \(\mathcal {M}\).
Thus, from a practical point of view, the message of this paper is that outliers cause problems in manifold estimation when noise is present, a fact well known to all practitioners who have worked in the area. What is qualitatively new here is a quantification of how this happens, and how it relates to the distribution of the noise. We do not attempt to solve the estimation problem here, but unfortunately it follows from the results of this paper that algorithms for handling outliers will probably involve knowing at least the tail behaviour of the error distribution, despite the fact that in practical situations one does not generally want to take as known prior knowledge.
1.2 On Persistence Intervals
While the above discussion has concentrated on the persistence of noise induced crackle as sample sizes grow, and the regions in \({\mathbb {R}}^d\) in which different types of homology appear, the proofs below also yield information about the more classical persistence diagrams of topological data analysis (cf. [8, 10, 11, 12]).
For example, in the two cases for which crackle persists—the powerlaw and exponential cases—estimates of the type appearing in Sect. 3 indicate that, with high probability, there exist extremely long bars in the bar code representation of persistent homology. Up to lower order corrections, preliminary calculations show that bar lengths for the \(k\)th homology can be as large as \(O(n^{a_k})\) for the powerlaw case, and \(b_k (\log \log n)\) for the exponential case, for appropriate \(a_k\) and \(b_k\). More detailed studies of these phenomena will appear in a later publication.
1.3 Poisson Processes
Although we have described everything so far in terms of a random sample \(\mathcal {X}\) of \(n\) points taken from a density \(f\), there is another way to approach the results of this paper, and that is to replace the points of \(\mathcal {X}\) with the points of a \(d\)dimensional Poisson process \(\mathcal {P}_n\) whose intensity function is given by \(\lambda _n = n f\). In this case the number of points is no longer fixed, but has mean \(n\). Similarly to many phenomena in random geometric graphs (see [18]), the results of this paper hold without any change, if we replace \(\mathcal {X}\) by \(\mathcal {P}\).
1.4 Disclaimers
Before starting the paper in earnest, and so as not to be accused of myopia, we note that the subject of manifold learning is obviously much broader that that described above, and algorithms for “estimating” an underlying manifold from a finite sample abound in the statistics and computer science literatures. Very few of them, however, take an algebraic point of view that we or the literature quoted above take. Furthermore, we note that other important results about the homology of Rips and Čech complexes for various distributions can be found in the papers [4, 5, 6, 13, 14]. While the methods and emphases of these papers are rather different, they demonstrate phenomena similar to the ones in this paper. The study of random geometric complexes typically concentrates on situations for which the number of points (\(n\)) goes to infinity and the radius (\(r_n\)) involved in defining the complexes goes to zero. Decreasing the radius \(r_n\) plays a similar role to increasing \(R_n\), as treated in this paper. Both actions result in making the complex sparser. For example, if \(r_n \rightarrow 0\) relatively slow (\(r_n = \Omega ((\log n/n)^{1/d})\)), the entire complex behaves like the “core” discussed earlier. On the other hand, if \(r_n\rightarrow 0\) fast enough (\(r_n = o(n^{1/d})\)), then the entire complex behaves like “crackle”. For more details see [13].
2 Results
In this section we shall present all our main results, along with some discussion, more technical than that of the Introduction. Recall from Sect. 1.3 that although we present all results for the point set \(\mathcal {X}\), they also hold if we replace the points of \(\mathcal {X}\) by the points of an appropriate Poisson process. All proofs are deferred to Sect. 3.
2.1 The Core of Distributions with Unbounded Support
Theorem 1
Theorem 1 implies that the core size has a completely different order of magnitude for each of the three distributions. The heavytailed, powerlaw distribution has the largest core, while the core of the Gaussian distribution is the smallest. While Theorem 1 provides a lower bound to the size of the core, the results in Theorems 2, 3 and 4 indicate the existence of an equivalent upper bound. In fact we believe that the upper bound would differ from the lower bound in Theorem 1 only by a constant, but this will not be pursued in this paper. In the following sections we shall study the behaviour of the Čech complex outside the core.
2.2 How PowerLaw Noise Crackles
Theorem 2
Corollary 1

\([R_{0,n}^\varepsilon ,\infty )\)—there are hardly any points (\(\beta _k\sim 0\), \(0\le k \le d1\)).

\([R_{0,n},R_{0,n}^\varepsilon )\)—points start to appear, and \(\beta _0\sim \mu _{\mathrm {p},0}\). The points are very few and scattered, so no cycles are generated (\(\beta _k \sim 0\), \(1\le k \le d1\)).

\([R_{1,n}^\varepsilon ,R_{0,n})\)—the number of components grows to infinity, but no cycles are formed yet (\(\beta _0 \sim \infty \), and \(\beta _k = 0\), \(1 \le k \le d1\)).

\([R_{1,n},R_{1,n}^\varepsilon )\)—a finite number of \(1\)dimensional cycles show up, among the infinite number of components (\(\beta _0 \sim \infty \), \(\beta _1\sim \mu _{\mathrm {p},1}\), and \(\beta _k = 0\), \(1 \le k \le d1\)).

\([R_{2,n}^\varepsilon ,R_{1,n})\)—we have \(\beta _0\sim \infty \), \(\beta _1\sim \infty \), and \(\beta _k\sim 0\) for \(k\ge 1\).

\([R_{d1},R_{d1}^\varepsilon )\)—we have \(\beta _{d1}\sim \mu _{\mathrm {p},d1}\) and \(\beta _k\sim \infty \) for \(0\le k \le d2\).

\([R_n^{\mathrm {c}},R_{d1})\)—just before we reach the core, the complex exhibits the most intricate structure, with \(\beta _k \sim \infty \) for \(0\le k \le d1\).
2.3 How Exponential Noise Crackles
In this section we focus on the exponential density function \(f=f_{\mathrm {e}}\). The results in this section are very similar to the those for the power law distribution, and we shall describe them briefly. Differences lie in the specific values of the \(R_{k,n}\) and in the terms in the limit formulae.
Theorem 3
Corollary 2
As in the powerlaw case, Theorem 3 implies the same “layered” behaviour, the only difference being in the values of \(R_{k,n}\). From examining the values of \(R_n^{\mathrm {c}}\), and \(R_{k,n}\) it is reasonable to guess that the phase transition in the exponential case occurs at \(R_n = \log n\).
2.4 Gaussian Noise Does Not Crackle
Theorem 4
Note that in the Gaussian case \(\lim _{n\rightarrow \infty }\big ( R_{0,n}^\varepsilon  R_n^{\mathrm {c}}\big ) = 0\). This implies that as \(n\rightarrow \infty \) we have the core which is contractible, and outside the core there is hardly anything. In other words, the ball placed around every new point we add to the sample immediately connects to the core, and thus, the Gaussian noise does not crackle.
3 Proofs
We now turn to proofs, starting with the proof of the main result of Sect. 2.1.
3.1 The Core
Proof
3.2 Crackle: Notation and General Lemmas
Lemma 1
Proof
Lemma 2
Proof
 1.For every \(k > 0\),$$\begin{aligned} \lim _{n\rightarrow \infty }n^{k} \left( {\begin{array}{c}n\\ k\end{array}}\right) = \frac{1}{k!} \end{aligned}$$(3.5)
 2.For every sequence \(a_n\rightarrow 0\) and \(k\ge 0\),$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{(1a_n)^{nk}}{\mathrm{e}^{na_n}} = 1 \end{aligned}$$(3.6)
3.3 Crackle: The Power Law Distribution
In this section we prove the results in Sect. 2.2. First, we need a few lemmas.
Lemma 3
Proof
Lemma 4
Proof
Lemma 5
Proof
The proof is very similar to the proof of Lemma 4. We need only replace \(T_k\) with an indicator function that tests whether a subcomplex generated by \(k+3\) points is connected. The exact value of \(\hat{\mu }_{\mathrm {p},k}\) will not be needed anywhere.\(\square \)
We can now prove Theorem 2.
Proof of Theorem 2 To prove the limit for \(\beta _{0,n}\) simply combine Lemma 3 with the inequality (3.2). To prove the limit for \(\beta _{k,n}\), \(k\ge 1 \), combine Lemmas 4 and 5 with the inequality (3.3).\(\square \)
3.4 Crackle: The Exponential Distribution
In this section we wish to prove Theorem 3. We start with the following lemmas.
Lemma 6
Proof
Lemma 7
Proof
Lemma 8
Proof
As for the proof of Lemma 5, mimic now the proof of Lemma 7, replacing \(T_k\) with an indicator function that tests whether a subcomplex generated by \(k+3\) points is connected.\(\square \)
Proof (Proof of Theorem 3)
The proof follows the same steps as the proof of Theorem 2.\(\square \)
3.5 Crackle: The Gaussian Distribution
In this section we prove Theorem 4.
Proof (Proof of Theorem 4)
Notes
Acknowledgments
Adler and Bobrowski were supported in part by AFOSR FA86551113039. Weinberger was supported in part by AFOSR FA95501110216. The authors would like to thank Yuliy Baryshnikov, Matthew Strom Borman, Matthew Kahle, and Katherine Turner for a number of interesting and useful conversations, and Peter Landweber for useful comments on the first revision of this paper.
References
 1.Adler, R.J., Bobrowski, O., Borman, M.S., Subag, E., Weinberger, S.: Persistent homology for random fields and complexes. In: Borrowing Strength: Theory Powering Applications—A Festschrift for Lawrence D. Brown, p. 124–143. Institute of Mathematical Statistics, Beachwood (2010)Google Scholar
 2.Aronshtam, L., Linial, N., Łuczak, T., Meshulam, R.: Collapsibility and vanishing of top homology in random simplicial complexes. Discrete Comput. Geom. 49(2), 317–334 (2013)CrossRefzbMATHMathSciNetGoogle Scholar
 3.Babson, E., Hoffman, C., Kahle, M.: The fundamental group of random 2complexes. J. Am. Math. Soc. 24(1), 1–28 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
 4.Bobrowski, O.: Algebraic Topology of Random Fields and Complexes. Ph.D. thesis, Faculty of Electrical Engineering, TechnionIsrael Institute of Technology (2012). http://www.graduate.technion.ac.il/Theses/Abstracts.asp?Id=26908
 5.Bobrowski, O., Adler, R.J.: Distance functions, critical points, and topology for some random complexes (2011). arXiv:1107.4775
 6.Bobrowski, O., Mukherjee, S.: The topology of probability distributions on manifolds. Probab. Theory Relat. Fields. 1–36 (2014). doi: 10.1007/s004400140556x
 7.Borsuk, K.: On the imbedding of systems of compacta in simplicial complexes. Fundam. Math. 35(1), 217–234 (1948)zbMATHMathSciNetGoogle Scholar
 8.Carlsson, G.: Topology and data. Bull. Am. Math. Soc. 46(2), 255–308 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
 9.Cohen, D.C., Farber, M., Kappeler, T.: The homotopical dimension of random 2complexes (2010). arXiv:1005.3383
 10.Edelsbrunner, H., Harer, J.: Persistent homology—a survey. Contemp. Math. 453, 257–282 (2008)CrossRefMathSciNetGoogle Scholar
 11.Edelsbrunner, H., Harer, J.L.: Computational Topology: An Introduction. American Mathematical Society, Providence, RI (2010)Google Scholar
 12.Ghrist, R.: Barcodes: the persistent topology of data. Bull. Am. Math. Soc. 45(1), 61–75 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
 13.Kahle, M.: Random geometric complexes. Discrete Comput. Geom. 45(3), 553–573 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
 14.Kahle, M., Meckes, E.: Limit the theorems for betti numbers of random simplicial complexes. Homol. Homotopy Appl. 15(1), 343–374 (2013)CrossRefzbMATHMathSciNetGoogle Scholar
 15.Meshulam, R., Wallach, N.: Homological connectivity of random kdimensional complexes. Random Struct. Algorithms 34(3), 408–417 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
 16.Niyogi, P., Smale, S., Weinberger, S.: Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom. 39(1–3), 419–441 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
 17.Niyogi, P., Smale, S., Weinberger, S.: A topological view of unsupervised learning from noisy data. SIAM J. Comput. 40(3), 646–663 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
 18.Penrose, M.: Random Geometric Graphs. Oxford Studies in Probability, vol. 5. Oxford University Press, Oxford (2003)CrossRefzbMATHGoogle Scholar
 19.Pippenger, N., Schleich, K.: Topological characteristics of random triangulated surfaces. Random Struct. Algorithms 28(3), 247–288 (2006)CrossRefzbMATHMathSciNetGoogle Scholar