Skip to main content

Functional summaries of persistence diagrams

Abstract

One of the primary areas of interest in applied algebraic topology is persistent homology, and, more specifically, the persistence diagram. Persistence diagrams have also become objects of interest in topological data analysis. However, persistence diagrams do not naturally lend themselves to statistical goals, such as inferring certain population characteristics, because their complicated structure makes common algebraic operations—such as addition, division, and multiplication—challenging (e.g., the mean might not be unique). To bypass these issues, several functional summaries of persistence diagrams have been proposed in the literature (e.g. landscape and silhouette functions). The problem of analyzing a set of persistence diagrams then becomes the problem of analyzing a set of functions, which is a topic that has been studied for decades in statistics. First, we review the various functional summaries in the literature and propose a unified framework for the functional summaries. Then, we generalize the definition of persistence landscape functions, establish several theoretical properties of the persistence functional summaries, and demonstrate and discuss their performance in the context of classification using simulated prostate cancer histology data, and two-sample hypothesis tests comparing human and monkey fibrin images, after developing a simulation study using a new data generator we call the Pickup Sticks Simulator (STIX).

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    Actually, as long as we have a filtration, we can define a persistence diagram.

  2. 2.

    Some literature considers the lower level set \(f^{-1}((-\infty , \lambda ))\). Both definitions are valid and here we use upper level sets because they yield a very straight forward definition when we consider the image data or certain functions estimated from a point cloud, such as kernel density estimates (Wasserman 2006).

  3. 3.

    Adams et al. (2017) also presented conditions under which persistence images are stable with respect to changes in the corresponding persistence diagrams.

  4. 4.

    A finitely discrete probability measure Q puts probability mass only finitely many points in \({\mathcal {B}}_F\).

  5. 5.

    While Fasy et al. (2018) will release a version of the gland generator that is fully verified with a pathologist, the preliminary version of this software that we used for the experiments in this paper will be made available upon request.

  6. 6.

    R (R Core Team 2017) is used to produce the STIX images, and the thickness t of the segments are set by the lwd plotting option.

  7. 7.

    The p-Wasserstein distance between persistence diagrams \(D_1\) and \(D_2\) is \(d_p(D_1, D_2) = \left( \inf _{\phi :D_1 \rightarrow D_2}\sum _{d\in D_1}\Vert d - \phi (d)\Vert _p^p\right) ^{1/p}\), where \(\phi \) is a bijection between the two persistence diagrams. See Robinson and Turner (2017) for more details.

  8. 8.

    This is important to ensure that the tests do not consistently incorrectly reject the null hypothesis.

References

  1. Abdollahi, A., Meysamie, A., Sheikhbahaei, S., Ahmadi, A., Tabriz, H.M., Bakhshandeh, M., Hosseinzadeh, H.: Inter/intra-observer reproducibility of Gleason scoring in prostate adenocarcinoma in Iranian pathologists. Urol. J. 9(2), 486–490 (2012)

    Google Scholar 

  2. Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., Ziegelmeier, L.: Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18(1), 218–252 (2017)

    MathSciNet  MATH  Google Scholar 

  3. Adcock, A., Carlsson, E., Carlsson, G.: The ring of algebraic functions on persistence bar codes. Homol. Homotopy Appl. 18(1), 381–402 (2016)

    MathSciNet  MATH  Google Scholar 

  4. Adler, R.J.: On excursion sets, tube formulas and maxima of random fields. Ann. Appl. Probab. 10, 1–74 (2000)

    MathSciNet  MATH  Google Scholar 

  5. Adler, R.J., Agami, S.: Modelling persistence diagrams with planar point processes, and revealing topology with bagplots. J. Appl. Comput. Topol. 3(3), 139–183 (2019)

    MathSciNet  MATH  Google Scholar 

  6. Adler, R.J., Bobrowski, O., Borman, M.S., Subag, E., Weinberger, S.: Persistent homology for random fields and complexes. In: Borrowing Strength: Theory Powering Applications—A Festschrift for Lawrence D. Brown, Vol. 6, pp. 124–143. Institute of Mathematical Statistics (2010)

  7. Bendich, P., Marron, J.S., Miller, E., Pieloch, A., Skwerer, S.: Persistent homology analysis of brain artery trees. Ann. Appl. Stat. 10(1), 198 (2016)

    MathSciNet  Google Scholar 

  8. Biscio, C., Møller, J.: The accumulated persistence function, a new useful functional summary statistic for topological data analysis, with a view to brain artery trees and spatial point process applications. arXiv preprint arXiv:1611.00630 (2016)

  9. Bubenik, P.: Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16(1), 77–102 (2015)

    MathSciNet  MATH  Google Scholar 

  10. Campbell, R.A., Overmyer, K.A., Selzman, C.H., Sheridan, B.C., Wolberg, A.S.: Contributions of extravascular and intravascular cells to fibrin network formation, structure, and stability. Blood 114(23), 4886–4896 (2009)

    Google Scholar 

  11. Carlsson, G.: Topology and data. Bull. Am. Math. Soc. 46(2), 255–308 (2009)

    MathSciNet  MATH  Google Scholar 

  12. Carlsson, G., Zomorodian, A., Collins, A., Guibas, L.J.: Persistence barcodes for shapes. Int. J. Shape Model. 11(2), 149–187 (2005)

  13. Center, M.M., Jemal, A., Lortet-Tieulent, J., Ward, E., Ferlay, J., Brawley, O., Bray, F.: International variation in prostate cancer incidence and mortality rates. Eur. Urol. 61(6), 1079–1092 (2012)

    Google Scholar 

  14. Chazal, F., Fasy, B.T., Lecci, F., Rinaldo, A., Wasserman, L.: Stochastic convergence of persistence landscapes and silhouettes. In: Proceedings of the Thirtieth Annual Symposium on Computational Geometry, p. 474. ACM (2014)

  15. Chen, Y.-C., Wang, D., Rinaldo, A., Wasserman, L.: Statistical analysis of persistence intensity functions. arXiv preprint arXiv:1510.02502 (2015)

  16. Ciollaro, M., Genovese, C., Lei, J., Wasserman, L.: The functional mean-shift algorithm for mode hunting and clustering in infinite dimensions. arXiv preprint arXiv:1408.1187 (2014)

  17. Cisewski, J., Croft, R.A., Freeman, P.E., Genovese, C.R., Khandai, N., Ozbek, M., Wasserman, L.: Non-parametric 3D map of the intergalactic medium using the Lyman-alpha forest. Mon. Not. R. Astron. Soc. 440(3), 2599–2609 (2014)

    Google Scholar 

  18. Cohen-Steiner, D., Edelsbrunner, H., Harer, J.: Stability of persistence diagrams. Discrete Comput. Geom. 37(1), 103–120 (2007)

    MathSciNet  MATH  Google Scholar 

  19. Crawford, L., Monod, A., Chen, A.X., Mukherjee, S., Rabadán, R.: Functional data analysis using a topological summary statistic: the smooth Euler characteristic transform. arXiv preprint arXiv:1611.06818 (2016a)

  20. Crawford, L., Monod, A., Chen, A.X., Mukherjee, S., Rabadán, R.: Topological summaries of tumor images improve prediction of disease free survival in glioblastoma multiforme. arXiv preprint arXiv:1611.06818 (2016b)

  21. Degras, D.A.: Simultaneous confidence bands for nonparametric regression with functional data. Stat. Sin. 21, 1735–1765 (2011)

    MathSciNet  MATH  Google Scholar 

  22. Edelsbrunner, H., Harer, J.: Persistent homology—a survey. Contemp. Math. 453, 257–282 (2008)

    MathSciNet  MATH  Google Scholar 

  23. Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological persistence and simplification. Discrete Comput. Geom. 28(4), 511–533 (2002)

    MathSciNet  MATH  Google Scholar 

  24. Edelsbrunner, H., Morozov, D.: Persistent homology: theory and practice. In: Proceedings of the European Congress of Mathematics, pp. 31–50 (2012)

  25. Engers, R.: Reproducibility and reliability of tumor grading in urological neoplasms. World J. Urol. 25(6), 595–605 (2007)

    Google Scholar 

  26. Epstein, J.I., Egevad, L., Amin, M.B., Delahunt, B., Srigley, J.R., Humphrey, P.A., Committee, G.: The 2014 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma: definition of grading patterns and proposal for a new grading system. Am. J. Surg. Pathol. 40(2), 244–252 (2016)

    Google Scholar 

  27. Evans, S.M., Patabendi Bandarage, V., Kronborg, C., Earnest, A., Millar, J., Clouston, D.: Gleason group concordance between biopsy and radical prostatectomy specimens: a cohort study from Prostate Cancer Outcome Registry-Victoria. Prostate Int. 4(4), 145–151 (2016)

    Google Scholar 

  28. Fasy, B.T., Lecci, F., Rinaldo, A., Wasserman, L., Balakrishnan, S., Singh, A.: Confidence sets for persistence diagrams. Ann. Stat. 42(6), 2301–2339 (2014)

    MathSciNet  MATH  Google Scholar 

  29. Fasy, B.T., Payne, S., Schenfish, A., Schupback, J., Stouffer, N.: Simulating prostate cancer slide scans (2018) (forthcoming)

  30. Ferlay, J., Soerjomataram, I., Dikshit, R., Eser, S., Mathers, C., Rebelo, M., Parkin, D.M., Forman, D., Bray, F.: Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136(5), E359–E386 (2015)

    Google Scholar 

  31. Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York (2006)

    MATH  Google Scholar 

  32. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001)

    MATH  Google Scholar 

  33. Gameiro, M., Mischaikow, K., Kalies, W.: Topological characterization of spatial-temporal chaos. Phys. Rev. E 70(3), 035203 (2004)

    MathSciNet  Google Scholar 

  34. Ghrist, R.: Barcodes: the persistent topology of data. Bull. Am. Math. Soc. 45(1), 61–75 (2008)

    MathSciNet  MATH  Google Scholar 

  35. Ghrist, R.W.: Elementary Applied Topology. Createspace, Seattle (2014)

    MATH  Google Scholar 

  36. Goodman, M., Ward, K.C., Osunkoya, A.O., Datta, M.W., Luthringer, D., Young, A.N., Marks, K., Cohen, V., Kennedy, J.C., Haber, M.J., Amin, M.B.: Frequency and determinants of disagreement and error in gleason scores: a population-based study of prostate cancer. Prostate 72(13), 1389–1398 (2012)

    Google Scholar 

  37. Helpap, B., Kristiansen, G., Beer, M., Köllermann, J., Oehler, U., Pogrebniak, A., Fellbaum, C.: Improving the reproducibility of the gleason scores in small foci of prostate cancer—suggestion of diagnostic criteria for glandular fusion. Pathol. Oncol. Res. 18(3), 615–621 (2012)

    Google Scholar 

  38. Humphrey, P.A.: Gleason grading and prognostic factors in carcinoma of the prostate. Mod. Pathol. 17(3), 292–306 (2004)

    MathSciNet  Google Scholar 

  39. Ieva, F., Paganoni, A., Pigoli, D., Vitelli, V.: Multivariate functional clustering for the analysis of ECG curves morphology. In: Cladag 2011 (8th International Meeting of the Classification and Data Analysis Group), pp. 1–4 (2011)

  40. Jacques, J., Preda, C.: Functional data clustering: a survey. Adv. Data Anal. Classif. 8(3), 231–255 (2014)

    MathSciNet  MATH  Google Scholar 

  41. Kerber, M., Morozov, D., Nigmetov, A.: Geometry helps to compare persistence diagrams. J. Exp. Algorithm. JEA 22(1), 1–4 (2017)

    MathSciNet  MATH  Google Scholar 

  42. Khasawneh, F.A., Munch, E.: Exploring equilibria in stochastic delay differential equations using persistent homology. In: ASME 2014 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, pp. V008T11A034–V008T11A034. American Society of Mechanical Engineers (2014)

  43. Khasawneh, F.A., Munch, E.: Chatter detection in turning using persistent homology. Mech. Syst. Signal Process. 70, 527–541 (2016)

    Google Scholar 

  44. Kosorok, M.R.: Introduction to Empirical Processes and Semiparametric Inference. Springer, New York (2007)

    MATH  Google Scholar 

  45. Lamprecht, M.R., Sabatini, D.M., Carpenter, A.E.: CellProfiler: free, versatile software for automated biological image analysis. Biotechniques 42(1), 71–75 (2007)

    Google Scholar 

  46. Lawson, P., Berry, E., Brown, J.Q., Fasy, B.T., Wenk, C.: Topological descriptors for quantitative prostate cancer morphology analysis. In: Conference on Digital Pathology, Part of SPIE Medical Imaging. Honorable Mention Poster Award (2017)

  47. Lawson, P., Sholl, A.B., Brown, J.Q., Fasy, B.T., Wenk, C. Persistent Homology for the quantitative evaluation of architectural features in prostate cancer histology. Sci. Rep. 9(1), 1–15 (2019)

  48. Lei, J., Rinaldo, A., Wasserman, L.: A conformal prediction approach to explore functional data. Ann. Math. Artif. Intell. 74(1–2), 29–43 (2015)

    MathSciNet  MATH  Google Scholar 

  49. Li, B., Yu, Q.: Classification of functional data: a segmentation approach. Comput. Stat. Data Anal. 52(10), 4790–4800 (2008)

    MathSciNet  MATH  Google Scholar 

  50. Ma, S., Yang, L., Carroll, R.J.: A simultaneous confidence band for sparse longitudinal regression. Stat. Sin. 22, 95–122 (2012)

    MathSciNet  MATH  Google Scholar 

  51. Mileyko, Y., Mukherjee, S., Harer, J.: Probability measures on the space of persistence diagrams. Inverse Prob. 27(12), 124007 (2011)

    MathSciNet  MATH  Google Scholar 

  52. Monod, A., Kališnik, S., Patiño-Galindo, J.Á., Crawford, L.: Tropical sufficient statistics for persistent homology with a parametric application to infectious viral disease. SIAM J. Appl. Algebra Geom. 3(2), 337–371 (2019)

    MathSciNet  MATH  Google Scholar 

  53. Munkres, J.R.: Algebraic Topology. Prentice Hall, Upper Saddle River (1964)

    MATH  Google Scholar 

  54. Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H., Hahn, U.: Global envelope tests for spatial processes. J. R. Stat. Soc. Ser. B Stat. Methodol. 79(2), 381–404 (2017)

    MathSciNet  MATH  Google Scholar 

  55. Obayashi, I., Hiraoka, Y., Kimura, M.: Persistence diagrams with linear machine learning models. J. Appl. Comput. Topol. 1(3–4), 421–449 (2018)

    MathSciNet  MATH  Google Scholar 

  56. Padellini, T., Brutti, P.: (2017) Persistence flamelets: Multiscale persistent homology for kernel density exploration. arXiv preprint arXiv:1709.07097

  57. Perea, J.A., Harer, J.: Sliding windows and persistence: an application of topological methods to signal analysis. Found. Comput. Math. 15(3), 799–838 (2015)

    MathSciNet  MATH  Google Scholar 

  58. Pretorius, E., Vieira, W., Oberholzer, H., Auer, R.: Comparative scanning electron microscopy of platelets and fibrin networks of human and differents animals. Int. J. Morphol. 27(1), 69–76 (2009)

    Google Scholar 

  59. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2017)

    Google Scholar 

  60. Ramsay, J.O.: Functional Data Analysis. Wiley, New York (2006)

    Google Scholar 

  61. Robins, V.: Computational topology for point data: betti numbers of \(\alpha \)-shapes. In: Mecke, K.R., Stoyan, D. (eds.) Morphology of Condensed Matter, pp. 261–274. Springer, New York (2002)

    Google Scholar 

  62. Robinson, A., Turner, K.: Hypothesis testing for topological data analysis. J. Appl. Comput. Topol. 1(2), 241–261 (2017)

    MathSciNet  MATH  Google Scholar 

  63. Rossi, F., Villa, N.: Support vector machine for functional data classification. Neurocomputing 69(7–9), 730–742 (2006)

    Google Scholar 

  64. Rubin, H.: Uniform convergence of random functions with applications to statistics. Ann. Math. Stat. 27(1), 200–203 (1956)

    MathSciNet  MATH  Google Scholar 

  65. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9(Mar), 371–421 (2008)

    MathSciNet  MATH  Google Scholar 

  66. Singh, N., Couture, H.D., Marron, J.S., Perou, C., Niethammer, M.: Topological descriptors of histology images. In: International Workshop on Machine Learning in Medical Imaging, pp. 231–239. Springer (2014)

  67. Sousbie, T., Pichon, C., Kawahara, H.: The persistent cosmic web and its filamentary structure–II. Illustrations. Mon. Not. R. Astron. Soc. 414(1), 384–403 (2011)

    Google Scholar 

  68. Tarpey, T., Kinateder, K.K.: Clustering functional data. J. Classif. 20(1), 093–114 (2003)

    MathSciNet  MATH  Google Scholar 

  69. Topaz, C.M., Ziegelmeier, L., Halverson, T.: Topological data analysis of biological aggregation models. PLoS ONE 10(5), e0126383 (2015)

    Google Scholar 

  70. Truesdale, M.D., Cheetham, P.J., Turk, A.T., Sartori, S., Hruby, G.W., Dinneen, E.P., Benson, M.C., Badani, K.K.: Gleason score concordance on biopsy-confirmed prostate cancer: is pathological re-evaluation necessary prior to radical prostatectomy? BJU Int. 107(5), 749–754 (2011)

    Google Scholar 

  71. Truong, M., Slezak, J.A., Lin, C.P., Iremashvili, V., Sado, M., Razmaria, A.A., Leverson, G., Soloway, M.S., Eggener, S.E., Abel, E.J., Downs, T.M., Jarrard, D.F.: Development and multi-institutional validation of an upgrading risk tool for Gleason 6 prostate cancer. Cancer 119(22), 3992–4002 (2013)

    Google Scholar 

  72. Turner, K.: Means and medians of sets of persistence diagrams. arXiv preprint arXiv:1307.8300 (2013)

  73. Turner, K., Mileyko, Y., Mukherjee, S., Harer, J.: Fréchet means for distributions of persistence diagrams. Discrete Comput. Geom. 52(1), 44–70 (2014)

    MathSciNet  MATH  Google Scholar 

  74. Turner, K., Mukherjee, S., Boyer, D.M.: Persistent homology transform for modeling shapes and surfaces. Inf. Inference J. IMA 3(4), 310–344 (2014)

    MathSciNet  MATH  Google Scholar 

  75. Van de Weygaert, R., Vegter, G., Edelsbrunner, H., Jones, B.J., Pranav, P., Park, C., Hellwing, W.A., Eldering, B., Kruithof, N., Bos, E., et al.: Alpha, betti and the megaparsec universe: on the topology of the cosmic web. In: Transactions on Computational Science XIV, pp. 60–101. Springer (2011)

  76. Van der Vaart, A.W.: Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics), vol. 3. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  77. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    MathSciNet  Google Scholar 

  78. Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer, New York (2005)

    MATH  Google Scholar 

  79. Wang, J.-L., Chiou, J.-M., Müller, H.-G.: Review of functional data analysis. Annu. Rev. Stat. Appl. 3, 257–295 (2016)

    Google Scholar 

  80. Wasserman, L.: All of Nonparametric Statistics. Springer, New York (2006)

    MATH  Google Scholar 

  81. Wasserman, L.: Topological data analysis. Annual Review of Statistics and Its Application 5, 501–532 (2016)

    MathSciNet  Google Scholar 

  82. Worsley, K.J.: The geometry of random images. Chance 9(1), 27–40 (1996)

    Google Scholar 

  83. Yuan, K.-H.: A theorem on uniform convergence of stochastic functions with applications. J. Multivar. Anal. 62(1), 100–109 (1997)

    MathSciNet  MATH  Google Scholar 

  84. Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI, pp. 1953–1959 (2013)

  85. Zomorodian, A.J.: Topology for Computing. Cambridge Monographs, vol. 16. Cambridge University Press, Cambridge (2005)

    MATH  Google Scholar 

Download references

Acknowledgements

Berry acknowledges support from NSF DMS 1812055 and DMS 1945639. Berry and Fasy acknowledge supported by NIH and NSF under Grant Numbers NSF DMS 1664858 and NSF DMS 1557716, as well as the research teams at MSU and Tulane University working on the prostate cancer histology project. Chen is supported by NIH U01 AG016976 and NSF DMS 1810960. Cisewski-Kehe thanks the Yale Center for Research Computing for guidance and use of the research computing infrastructure. Cisewski-Kehe and Fasy acknowledge support from NSF under Grant Numbers DMS 1854220 and 1854336. Fasy is additionally supported by NSF CCF 1618605.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Brittany Terese Fasy.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs

The proofs from the propositions of Sect. 3 are presented below.

Proof of Proposition 1

This proof uses ideas from Rubin (1956) and Yuan (1997).

Let \(\epsilon >0\) be given and \({\widehat{F}}_n = {\widehat{F}} = \frac{1}{n}\sum _{i=1}^n F_i\). Because \({\mathcal {B}}_F\) is equicontinuous, the collection of differences

$$\begin{aligned} {\mathcal {B}}_{\varDelta } = \{{\widehat{F}}_n - {\bar{F}}:n = 1,2,\ldots \} \end{aligned}$$

is also equicontinuous. Since \({\mathbb {T}}\) is compact, there exists a number \(M>0\) and points \(t_1,\ldots ,t_M\in {\mathbb {T}}\) such that

$$\begin{aligned} \sup _{\varDelta \in {\mathcal {B}}_{\varDelta }} \min _{j =1,\ldots ,M}|\varDelta (t)-\varDelta (t_j)| < \epsilon /2. \end{aligned}$$

Namely, \(t_1,\ldots ,t_M\) forms a \(\epsilon /2\)-covering of \({\mathcal {B}}_{\varDelta }\).

Let \(\varDelta _n(t) = {\widehat{F}}(t)-{\bar{F}}(t)\), and note that \(\varDelta _n\in {\mathcal {B}}_{\varDelta }\). Then

$$\begin{aligned} \sup _{t\in {\mathbb {T}}}|{\widehat{F}}(t)-{\bar{F}}(t)|&= \sup _{t\in {\mathbb {T}}}|\varDelta _n(t)|\\&\le \sup _{t\in {\mathbb {T}}} \min _{j\in 1,\ldots ,M} |\varDelta _n(t)-\varDelta _n(t_j)| + \sup _{j\in 1,\ldots ,M} |\varDelta _n(t_j)|\\&\le \epsilon /2 + \sup _{j\in 1,\ldots ,M} |\varDelta _n(t_j)|. \end{aligned}$$

By (5), every \(t\in {\mathbb {T}}\) satisfies \({{\mathbb {E}}}|F_i(t)|<\infty \) so by the strong law of large numbers,

$$\begin{aligned} |\varDelta _n(t_j)| \overset{a.s.}{\rightarrow } 0 \end{aligned}$$

for every \(j=1,\ldots ,M\) and this implies that for any \(\delta >0\), there exists \(N>0\) such that

$$\begin{aligned} P\left( \sup _{n\ge N}|\varDelta _n(t_j)|>\frac{\epsilon }{2M}\right) <\frac{\delta }{M} \end{aligned}$$

for every \(j=1,\ldots ,M\).

As a result,

$$\begin{aligned} P\left( \sup _{t\in {\mathbb {T}}}|{\widehat{F}}(t)-{\bar{F}}(t)|>\epsilon \right)&\le P\left( \epsilon /2 + \sup _{j\in 1,\ldots ,M} |\varDelta _n(t_j)|>\epsilon \right) \\&\le P\left( \sup _{n\ge N} \sup _{j\in 1,\ldots ,M}|\varDelta _n(t_j)|>\frac{\epsilon }{2}\right) \\&\le \sum _{j=1}^M P\left( \sup _{n\ge N}|\varDelta _n(t_j)|>\frac{\epsilon }{2M}\right) \le \delta \end{aligned}$$

and the result follows. \(\square \)

Proof of Corollary 1

The case of persistence landscapes is just a special case of the generalized landscape so we will only prove the case of persistence intensity and generalized landscape. Note that Chazal et al. (2014) already show that persistence silhouettes with a power-weighted silhouette is 1-Lipschitz so the equicontinuity follows so we ignore this case.

Persistence intensity Given a persistence diagram \({\textsf {D}}\), the persistence intensity is the function

$$\begin{aligned} F(t,s;{\textsf {D}})={\mathbb {F}}({\textsf {D}};t,s) = \frac{1}{\left| {\textsf {D}}\right| }\sum _{j=1}^{\left| {\textsf {D}}\right| }\omega (d_j-b_j) \cdot K\left( \frac{\sqrt{|b_j-t|^2+|d_j-s|^2}}{h}\right) , \end{aligned}$$

Thus, for any \((t_1,s_1)\) and \((t_2,s_2)\),

$$\begin{aligned} \left| F(t_1,s_1;{\textsf {D}}) - F(t_2,s_2;{\textsf {D}}) \right|&= \left| \frac{1}{\left| {\textsf {D}}\right| }\sum _{j=1}^{\left| {\textsf {D}}\right| }\omega (d_j-b_j) \cdot \left\{ K\left( \frac{\sqrt{|b_j-t_1|^2+|d_j-s_1|^2}}{h}\right) \right. \right. \\&\quad \left. \left. - K\left( \frac{\sqrt{|b_j-t_2|^2+|d_j-s_2|^2}}{h}\right) \right\} \right| \\&\le \sup _{a,b\in {\mathbb {T}}}\Vert \omega (a,b)\Vert \times \left| K\left( \frac{\sqrt{|b_j-t_1|^2+|d_j-s_1|^2}}{h}\right) \right. \\&\quad \left. - K\left( \frac{\sqrt{|b_j-t_2|^2+|d_j-s_2|^2}}{h}\right) \right| \\&\le \sup _{a,b\in {\mathbb {T}}}\Vert \omega (a,b)\Vert L_0 \cdot \frac{\sqrt{(t_1-t_2)^2+(s_1-s_2)^2}}{h}, \end{aligned}$$

where \(L_0\) is the Lipschitz constant of the kernel function K, i.e., \(|K(x)-K(y)|\le L_0|x-y| \). The bound \(\sup _{a,b\in {\mathbb {T}}}\Vert \omega (a,b)\Vert L_0/h\) is uniform for all \((t_1,s_1)\) and \((t_2,s_2)\) so the corresponding persistence intensity functions form an equicontinuous family.

Generalized landscape Recall that generalized landscapes use the kth maximum value of \(\{{\widetilde{\varLambda }}_j(t; h):j =1,\ldots , |{\textsf {D}}|\}\), where

$$\begin{aligned} {\widetilde{\varLambda }}_j(t; h) = {\left\{ \begin{array}{ll} \frac{y_j}{K_h(0)} K_h\left( {t - x_j}\right) , &{}\quad \text {for } | \frac{t - x_j}{h} | \le 1 \\ 0, &{}\quad \text {otherwise, } \end{array}\right. } \end{aligned}$$

and \(y_j =|b_j-d_j|\) is the persistence, \(x_j = \frac{b_j+d_j}{2}\), and \(K_h(\cdot /h) = \frac{1}{h}K(\cdot /h)\). Therefore, for any \(t_1,t_2\),

$$\begin{aligned} |{\widetilde{\varLambda }}_j(t_1; h)-{\widetilde{\varLambda }}_j(t_2; h)|&\le \frac{y_j}{K_h(0)} |K_h\left( {t_1 - x_j}\right) -K_h\left( {t_2 - x_j}\right) |\\&\le \frac{y_j}{K(0)/h} \frac{1}{h}\left| K\left( \frac{t_1 - x_j}{h}\right) -K\left( \frac{t_2 - x_j}{h}\right) \right| \\&\le \frac{\mathsf{diam}({\mathbb {T}})}{K(0)} L_0 \frac{|t_1-t_2|}{h}, \end{aligned}$$

where \(L_0\) is the Lipschitz constant of the kernel function K and \(\mathsf{diam}({\mathbb {T}})\) is the diameter of \({\mathbb {T}}\). Thus, for a given \(h>0\), \({\widetilde{\varLambda }}_j(t; h)\) is Lipschitz with a constant \(\frac{\mathsf{diam}({\mathbb {T}}) L_0}{K(0)h} \). Moreover, this analysis applies to every \(j=1,\ldots , |{\textsf {D}}|\) so the kth maximum value of \(\{{\widetilde{\varLambda }}_j(t; h):j =1,\ldots , |{\textsf {D}}|\}\) is still Lipschitz with the same constant. This implies that the generalized landscape forms an equicontinuous family. \(\square \)

Proof of Proposition 2

Note that by Eq. (5), the functional summary satisfies

$$\begin{aligned} \sup _{t\in {\mathbb {T}}}\mathsf{Var}(F_i(t)) \le \sup _{t\in {\mathbb {T}}}{{\mathbb {E}}}(F^2_i(t))\le {\bar{U}}^2<\infty . \end{aligned}$$

Pointwise normality The first assertion \(\sqrt{n}({\widehat{F}}(t)-{\bar{F}}(t))\rightarrow N(0,\sigma ^2(t))\) follows from the usual central limit theorem because the variance is finite.

Normality of integrated difference To show the normality for the integrated difference, note that

$$\begin{aligned} \sqrt{n}\int ({\widehat{F}}(t)-{\bar{F}}(t))dt&= \sqrt{n}\int \left( \frac{1}{n}\sum _{i=1}^n F_i(t) - {\bar{F}}(t)\right) dt\\&= \sqrt{n} \frac{1}{n}\sum _{i=1}^n \underbrace{\int \left( F_i(t)-{\bar{F}}(t)\right) dt}_{=Y_i}\\&= \sqrt{n} {\bar{Y}}_n, \end{aligned}$$

where \({\bar{Y}}_n = \frac{1}{n}\sum _{i=1}^n Y_i\) and \(Y_1,\ldots ,Y_n\) are i.i.d. with mean \({{\mathbb {E}}}(Y_i) = 0\) and variance

$$\begin{aligned} \mathsf{Var}(Y_i )= \int \mathsf{Var}(F_i(t))dt = \int \sigma ^2(t)dt<\infty . \end{aligned}$$

Thus, by the usual central limit theorem again, we obtain the normality.

Convergence to a Gaussian process The proof of convergence toward a Gaussian process involves the concept of a Donsker class; see Section 2.2 of Kosorok (2007) for an introduction. Simply put, given i.i.d. random elements \(S_1,\ldots , S_n\in {\mathcal {S}}\) and a collection of mappings \(f_t: {\mathcal {S}}\mapsto {\mathbb {R}}\) with an index set \(t\in {\mathbb {T}}\), the collection \(\{f_t: t\in {\mathbb {T}}\}\) is called Donsker if the average function \({\bar{f}}(t) = \frac{1}{n}\sum _{i=1}^nf_t(S_i)\) converges to a tight Gaussian process in \({\mathcal {L}}_\infty \) metric. The assumption in Eq. (6) along with Theorem 2.5 in Kosorok (2007) implies that the class \({\mathcal {W}}\) is Donsker, which implies that the sample mean functional \({\widehat{F}}(t)\) converges to a Gaussian process. \(\square \)

Proof of of Proposition 3

Because Eq. (6) implies that the class \({\mathcal {W}}\) is Donsker, this proposition is a well-known result in the empirical process theory. For instance, see the discussion on page 21 of Kosorok (2007). Here we briefly highlight the basic idea.

By Proposition 3 and the continuous mapping theorem [see, e.g., page 16 of Kosorok (2007)],

$$\begin{aligned} \sqrt{n}\sup _{t\in {\mathbb {T}}}|{\widehat{F}}(t)-{\bar{F}}(t)| \overset{D}{\rightarrow } \sup _{t\in {\mathbb {T}}}|{\mathbb {B}}(t)|. \end{aligned}$$

In the case of the bootstrap, using the same proposition but now apply it to the bootstrap version, we obtain

$$\begin{aligned} \sqrt{n}\sup _{t\in {\mathbb {T}}}|{\widehat{F}}^*(t)-{\widehat{F}}(t)| \overset{D|F^{\otimes n}}{\rightarrow } \sup _{t\in {\mathbb {T}}}|{\mathbb {B}}(t)|, \end{aligned}$$

where \(\overset{D|F^{\otimes n}}{\rightarrow }\) denotes convergences in distribution given \(F_1,\ldots ,F_n\).

Therefore, the bootstrap quantile converges to the corresponding quantile of the original difference. \(\square \)

Proof of Proposition 4

This proof consists of two parts. In the first part, we prove that \({\widehat{q}}_\gamma \overset{P}{\rightarrow } q_\gamma \). In the second part, we prove the desired result.

Part 1 Given \(F_1,\ldots ,F_n\), we define

$$\begin{aligned} {\widehat{Q}}(t) = \frac{1}{n}\sum _{i=1}^n I(d(F_i,{\bar{F}})\le t) \end{aligned}$$

to be the empirical version of Q(t) and define

$$\begin{aligned} {\widetilde{q}}_\gamma : {\widehat{Q}}({\widetilde{q}}_\gamma ) = \gamma , \end{aligned}$$

to be the corresponding \(\gamma \) quantile.

Because \({\widehat{Q}}\) is just the empirical distribution function (EDF) of \(Y_1,\ldots ,Y_n\) where \(Y_i = d(F_i,{\bar{F}})\) and Q(t) is the cumulative distribution function (CDF) of \(Y_i\),

$$\begin{aligned} \sup _{t} |{\widehat{Q}}(t)-Q(t)| = O_P\left( \frac{1}{\sqrt{n}}\right) . \end{aligned}$$

By the mean value theorem and the fact that \({\widehat{Q}}({\widetilde{q}}_\gamma ) -Q({\widetilde{q}}_\gamma ) = o_P(1)\),

$$\begin{aligned} {\widehat{Q}}({\widetilde{q}}_\gamma ) -Q({\widetilde{q}}_\gamma )&=Q(q_\gamma ) -Q({\widetilde{q}}_\gamma ) \\&= Q'(q_\gamma ^*) (q_\gamma -{\widetilde{q}}_\gamma ) \end{aligned}$$

for some \(q_\gamma ^* \in [q_\gamma , {\widetilde{q}}_\gamma ]\). By assumption, \(Q'(q_\gamma ^*)\ge q_0>0\) so

$$\begin{aligned} |q_\gamma -{\widetilde{q}}_\gamma | \le \frac{1}{q_0} \left| {\widehat{Q}}({\widetilde{q}}_\gamma ) -Q({\widetilde{q}}_\gamma )\right| =O_P\left( \frac{1}{\sqrt{n}}\right) . \end{aligned}$$

Now, because \(\max {i=1,\ldots ,n}|d(F_i,{\bar{F}})-d(F_i,{\widehat{F}})| \le d({\bar{F}},{\widehat{F}})\), \({\widetilde{q}}_\gamma \), the quantile of \(d(F_1,{\bar{F}}),\ldots , d(F_n,{\bar{F}})\), and \({\widehat{q}}_\gamma \), the quantile of \(d(F_1,{\widehat{F}}),\ldots , d(F_n,{\widehat{F}})\), are bounded by

$$\begin{aligned} |{\widetilde{q}}_\gamma -{\widehat{q}}_\gamma | \le d({\bar{F}},{\widehat{F}}). \end{aligned}$$

which implies

$$\begin{aligned} |{\widehat{q}}_\gamma - q_\gamma |=O_P\left( \frac{1}{\sqrt{n}}\right) +d({\bar{F}},{\widehat{F}}). \end{aligned}$$

Part 2 Let

$$\begin{aligned} {\widehat{A}}_\gamma = P(F_\mathsf{new}\subset \widehat{{\mathcal {F}}}_\gamma |F_1,\ldots ,F_n) =P(d(F_\mathsf{new}, {\widehat{F}})\le {\widehat{q}}_\gamma |F_1,\ldots ,F_n). \end{aligned}$$

be the probability of interest. By the triangular inequality,

$$\begin{aligned} |d(F_\mathsf{new}, {\bar{F}})-d(F_\mathsf{new}, {\widehat{F}}) | \le d({\bar{F}}, {\widehat{F}}). \end{aligned}$$

Thus,

$$\begin{aligned} A_\gamma&= P(d(F_\mathsf{new}, {\widehat{F}})\le {\widehat{q}}_\gamma |F_1,\ldots ,F_n) \\&\ge P(d(F_\mathsf{new}, {\bar{F}})\le {\widehat{q}}_\gamma -d({\widehat{F}},{\bar{F}})|F_1,\ldots ,F_n)\\&= Q({\widehat{q}}_\gamma -d({\widehat{F}},{\bar{F}}))\\&\ge Q\left( q_\gamma -2d({\widehat{F}},{\bar{F}})+O_P\left( \frac{1}{\sqrt{n}}\right) \right) \\&= Q(q_\gamma ) + O_P\left( \left( \frac{1}{\sqrt{n}}\right) +d({\widehat{F}},{\bar{F}})\right) .\\ A_\gamma&= P(d(F_\mathsf{new}, {\widehat{F}})\le {\widehat{q}}_\gamma |F_1,\ldots ,F_n) \\&\le P(d(F_\mathsf{new}, {\bar{F}})\le {\widehat{q}}_\gamma +d({\widehat{F}},{\bar{F}})|F_1,\ldots ,F_n)\\&=Q({\widehat{q}}_\gamma +d({\widehat{F}},{\bar{F}}))\\&\le Q\left( q_\gamma +2d({\widehat{F}},{\bar{F}})+O_P\left( \frac{1}{\sqrt{n}}\right) \right) \\&= Q(q_\gamma ) + O_P\left( \left( \frac{1}{\sqrt{n}}\right) +d({\widehat{F}},{\bar{F}})\right) .\\ \end{aligned}$$

Thus,

$$\begin{aligned} A_\gamma = P(F_\mathsf{new}\subset \widehat{{\mathcal {F}}}_\gamma |F_1,\ldots ,F_n)= \underbrace{Q(q_\gamma )}_{=\gamma }+ O_P\left( \left( \frac{1}{\sqrt{n}}\right) +d({\widehat{F}},{\bar{F}})\right) , \end{aligned}$$

which completes the proof. \(\square \)

Fig. 16
figure16

STIX simulation results for homology dimension 0 by landscape and generalized landscape function order. The median permutation p values (ac) and \(\log _{10}\) (p values) (df) are plotted along with their interquartile range (the vertical lines) for two-sample hypothesis tests comparing samples drawn from the null population, \({\mathcal {P}}_{F,1}\), with an average thickness of \(t_1 = 5\). The alternative hypotheses include average thicknesses of \(t_2 = 5, 5.25, 5.5, 5.75, 6, 7, 8\), corresponding to images (af), respectively (except average thickness of 8 is not displayed). The permutation p values are based on 100 repetitions of 12 STIX images drawn from the null and alternative hypothesis, with 10,000 random permutations. The different plot colors and symbols represent the different functions and bandwidths considered; the function order is the ordering of the landscape and generalized landscape functions (color figure online)

Fig. 17
figure17

STIX simulation results for homology dimension 1 by landscape and generalized landscape function order. The median permutation p values (ac) and \(\log _{10}\) (p values) (df) are plotted along with their interquartile range (the vertical lines) for two-sample hypothesis tests comparing samples drawn from the null population, \({\mathcal {P}}_{F,1}\), with an average thickness of \(t_1 = 5\). The alternative hypotheses include average thicknesses of \(t_2 = 5, 5.25, 5.5, 5.75, 6, 7, 8\), corresponding to images (af), respectively (except average thickness of 8 is not displayed). The permutation p values are based on 100 repetitions of 12 STIX images drawn from the null and alternative hypothesis, with 10,000 random permutations. The different plot colors and symbols represent the different functions and bandwidths considered; the function order is the ordering of the landscape and generalized landscape functions (color figure online)

B STIX simulation results

For completeness, results for the STIX simulation study from Sect. 4.2.1 are displayed for \(t_2 = 5, 5.25, 5.5, 5.75, 6, 7\). Figures 16 and 17 include the 0th and 1st homology dimensions, respectively, for the landscape and generalized landscape functions for their first 10 function orders; a subset of the results for the 1st homology dimension are displayed in Fig. 12 and discussed in the main text. Note that the results for the setting with \(t_2 = 8\) are not displayed because the permutation p values for all function orders for the 1st homology dimension achieve the minimum possible p value, and most of the function orders for the 0th homology dimension achieve this minimum as well.

Similarly, the results for the other functional summaries and the concatenated orders of the landscape and generalized landscape functions for \(t_2 = 5, 5.25, 5.5, 5.75, 6, 7\) are displayed in Figs. 18 and 19 for the 0th and 1st homology dimensions, respectively. A subset of the results for the 1st homology dimension are displayed in Fig. 13 and discussed in the main text. Note that the results for the setting with \(t_2 = 8\) are not displayed because the permutation p values for all functional summaries for the 1st homology dimension achieve the minimum possible p value, and all functional summaries (except the PDT for the 0th homology dimension) achieve this minimum as well.

Fig. 18
figure18

STIX simulation results for homology dimension 0. The median permutation p values (ac) and \(\log _{10}\) (p values) (df) are plotted along with their interquartile range (the vertical lines) for two-sample hypothesis tests comparing samples drawn from the null population, \({\mathcal {P}}_{F,1}\), with an average thickness of \(t_1 = 5\). The alternative hypotheses include average thicknesses of \(t_2 = 5, 5.25, 5.5, 5.75, 6, 7, 8\), corresponding to images (af), respectively (except average thickness of 8 is not displayed). The permutation p values are based on 100 repetitions of 12 STIX images drawn from the null and alternative hypothesis, with 10,000 random permutations. The different plot colors and symbols represent the different functional summaries and bandwidths considered; the landscape and generalized landscape functions used orders 1–10 (color figure online)

Fig. 19
figure19

STIX simulation results for homology dimension 1. The median permutation p values (ac) and \(\log _{10}\) (p values) (df) are plotted along with their interquartile range (the vertical lines) for two-sample hypothesis tests comparing samples drawn from the null population, \({\mathcal {P}}_{F,1}\), with an average thickness of \(t_1 = 5\). The alternative hypotheses include average thicknesses of \(t_2 = 5, 5.25, 5.5, 5.75, 6, 7, 8\), corresponding to images (af), respectively (except average thickness of 8 is not displayed). The permutation p values are based on 100 repetitions of 12 STIX images drawn from the null and alternative hypothesis, with 10,000 random permutations. The different plot colors and symbols represent the different functional summaries and bandwidths considered; the landscape and generalized landscape functions used orders 1–10 (color figure online)

C Alternative generalized landscape function kernel results

C.1 Simulated Gleason data

Figure 20 displays classification test error results for generalized landscape functions using an alternative kernel function (the tricubic kernel) to the one considered in Sect. 4.1 (the triangle kernel). The ratio of the classification test errors are plot with the tricubic kernel in the numerator and the corresponding triangle kernel plotted in the denominator. The ratios vary between about 0.98 and 1.02 suggesting that the kernel choice is not strongly affecting the results in this setting.

Fig. 20
figure20

Classification test error ratios of gland classification tests results for generalized landscape functions using tricubic or triangle kernels. The different plot colors and symbols represent the different functions and bandwidths considered; the function order is the ordering of the generalized landscape functions. The ratios are the gland classification test results for the tricubic kernel function divided by the gland classification test results for the corresponding triangle kernels (as displayed in Fig. 8). The dashed line at a ratio of 1 is where the two kernels have identical classification test error (color figure online)

Fig. 21
figure21

STIX simulation results using a tricubic kernel function for the generalized landscape functions. The median permutation \(\log _{10}\) (p values) using the tricubic kernel function are plotted for a\(H_0\) and b\(H_1\) along with their interquartile range (the vertical lines) for two-sample hypothesis tests comparing samples drawn from the null population, \({\mathcal {P}}_{F,1}\), with an average thickness of \(t_1 = 5\). The alternative hypothesis displayed have an average thicknesses of \(t_2 = 5.75\). The permutation p values are based on 100 repetitions of 12 STIX images drawn from the null and alternative hypothesis, with 10,000 random permutations. The ratio of the median p values using the tricubic kernel function over the median p values using the triangle kernel function are displayed in c, d for the \(H_0\) and \(H_1\), respectively. The different plot colors and symbols represent the different functions and bandwidths considered; the function order is the ordering of the landscape and generalized landscape functions. The dashed line at a ratio of 1 is where the two kernels have identical classification test error (color figure online)

C.2 Pick-up sticks simulation data

Figure 21 displays the two-sample hypothesis test results for generalized landscape functions using an alternative kernel function (the tricubic kernel) for the setting with samples drawn from the null population, \({\mathcal {P}}_{F,1}\), with an average thickness of \(t_1 = 5\) and an alternative hypothesis with an average thicknesses of \(t_2 = 5.75\). Additionally, a comparison between the median p values using the tricubic kernel compared to the triangle kernal considered in Sect. 4.2.1. There appears to be more variability in these results compared to what we saw in “Appendix C.1”; however, the p values used in this setting tend to be smaller values than the classification test error values so a larger variation may imply only a small magnitude difference. For \(H_0\), the largest magnitude difference in p values (across all function orders and thicknesses considered) is 0.089 (in the setting with \(t_2 = 5\), function order 3, bandwidth of 0.60), with an interquartile range of 0 to 0.014. For \(H_1\), the largest magnitude difference in p values (across all function orders and thicknesses considered) is 0.065 (in the setting with \(t_2 = 5\), function order 2, bandwidth of 0.40), with an interquartile range of 0 to 0.006. There does not appear to be a discernible pattern in the variations between the tricubic and triangle kernel function results.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Berry, E., Chen, YC., Cisewski-Kehe, J. et al. Functional summaries of persistence diagrams. J Appl. and Comput. Topology 4, 211–262 (2020). https://doi.org/10.1007/s41468-020-00048-w

Download citation

Keywords

  • Topological data analysis
  • Persistent homology
  • Functional data analysis
  • Statistical inference

Mathematics Subject Classification

  • 55N31
  • 62R40
  • 62R10