Estimation of level set trees using adaptive partitions

Holmström, Lasse; Karttunen, Kyösti; Klemelä, Jussi

doi:10.1007/s00180-016-0702-2

Estimation of level set trees using adaptive partitions

Original Paper
Published: 02 December 2016

Volume 32, pages 1139–1163, (2017)
Cite this article

Computational Statistics Aims and scope Submit manuscript

437 Accesses
3 Citations
Explore all metrics

Abstract

We present methods for the estimation of level sets, a level set tree, and a volume function of a multivariate density function. The methods are such that the computation is feasible and estimation is statistically efficient in moderate dimensional cases (\(d\approx 8\)) and for moderate sample sizes (\(n\approx \) 50,000). We apply kernel estimation together with an adaptive partition of the sample space. We illustrate how level set trees can be applied in cluster analysis and in flow cytometry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The Voronoi cell of a point \(p \in S =\{ X_1,\ldots ,X_n\}\) is \( V_p = \{ x \in \mathbf{R}^d : D(x,p) \le D(x,q) \text{ for } \text{ all } q \in S \} , \) where \(D(x,y) = \Vert x-y\Vert \) is the Euclidean distance in \(\mathbf{R}^d\). A Voronoi cell is a convex polyhedron: it is the intersection of the half-spaces of points at least as close to p as to q, taken over all \(q\in S\).
The Delaunay triangulation of \(S =\{ X_1,\ldots ,X_n\}\) is \( \text{ Delaunay }(S) = \{ \sigma \subset S : \bigcap _{p\in \sigma } V_p \ne \emptyset \} ; \) this is the collection of those tuples \(\sigma \) of \(d+1\) points whose Voronoi cells touch each other. Sets \(\sigma \in \text{ Delaunay }(S)\) are considered as the vertices of a simplex. Thus, a Delaunay triangulation is a collection of d-simplices. A d-simplex is the convex hull of its \(d+1\) vertices. In the two dimensional case (\(d=2\)) the simplices are triangles and in the three dimensional case (\(d=3\)) the simplices are tetrahedrons.
When the kernel function is the standard normal density, then the support of the kernel estimator is the whole space \(\mathbf{R}^d\). When the kernel function is the Bartlett density \(C(1-\Vert x\Vert ^2)_+\), then the support is a union of balls, and when the kernel function is the Epanechnikov product density \(C\prod _{i=1}^d (1-x_i^2)_+\), then the support is a union of rectangles.
A Reeb graph of a function (or a contour tree of a function) is graph whose leaf vertices represent the local minima or maxima and each interior vertex represent the joining or splitting of the contours of the function. Carr et al. (2003) give the following procedure for the computation of a Reeb graph. First a level set tree and a lower level set tree are calculated. A lower level set tree is analogous to a level set tree, but its nodes correspond to the lower level sets \( \{ x \in \mathbf{R}^d : f(x) \le \lambda \} . \) Second, the level set tree and the lower level set tree are pruned so that only the leaf nodes and the nodes with more than one child are left. Third, the pruned level set tree (split tree) and the pruned lower level set tree (join tree) are combined to obtain the Reeb graph.
Given a finite set \(\mathcal{{X}}\) of points in \(\mathbf{R}^d\) and a query point x, the k-d-algorithm finds the point in \(\mathcal{{X}}\) closest to x. The k-d-algorithm uses a k-d-tree, which is a similar binary tree as we use to represent an adaptive partition.
Note that when observations are in \(\mathbf{R}^d\) it is possible that only d of the observations are on the boundary, which happens when d observations are on the corners of the smallest rectangle containing the observations.
The unit simplex is defined as the convex hull of the origin and the vertices \(e_1,\ldots ,e_d\), where \(e_i\in \mathbf{R}^d\) has 1 in the ith position and 0 in the other positions. Klemelä (2004b) used a simulation example where the simplex was such that all edges have the same length. The current simulation example was used in Stuetzle and Nugent (2010) and Menardi and Azzalini (2014) with the sample size about 500.
The modes are located at (1 / 2, 0, 0, 0), \((-1/2,0,0,0)\), \((0,\sqrt{3}/2,0,0)\), \((0,1/(2\sqrt{3}),\sqrt{2/3},0)\), and \((0,1/(2\sqrt{3}),1/(2\sqrt{6}),\sqrt{15/24})\).

References

Aaron C (2013) Estimation of the support of the density and its boundary using random polyhedrons. Technical report, Université Blaise Pascal
Aghaeepour N (2010) FlowMeans: non-parametric flow cytometry data gating. R package version 1(16):
Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH (2013) Critical assessment of automated flow cytometry data analysis techniques. Nat Methods 10(3):228–238
Article Google Scholar
Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR (2011) Rapid cell population identification in flow cytometry data. Cytom Part A J Int Soc Anal Cytol 79(1):6–13
Article Google Scholar
Azzallini A, Torelli N (2007) Clustering via nonparametric density estimation. Stat Comput 17:71–80
Article MathSciNet Google Scholar
Baíllo A, Cuesta-Albertos JA, Cuevas A (2001) Convergence rates in nonparametric estimation of level sets. Stat Probab Lett 53:27–35
Article MathSciNet MATH Google Scholar
Baíllo A, Cuevas A, Justel A (2000) Set estimation and nonparametric detection. Can J Stat 28:765–782
Article MathSciNet MATH Google Scholar
Bashashati A, Brinkman RR (2009) A survey of flow cytometry data analysis methods. Adv Bioinform 2009:584–603
Google Scholar
Biau G, Cadre B, Pelletier B (2007) A graph-based estimator of the number of clusters. ESAIM Probab Stat 11:272–280
Article MathSciNet MATH Google Scholar
Breiman L, Friedman J, Olshen R, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New York
MATH Google Scholar
Burman P, Polonik W (2009) Multivariate mode hunting: data analytic tools with measures of significance. J Multivar Anal 100:1198–1218
Article MathSciNet MATH Google Scholar
Cadre B (2006) Kernel estimation of density level sets. J Multivar Anal 97(4):999–1023
Article MathSciNet MATH Google Scholar
Carr H, Snoeyink J, Axen U (2003) Computing contour trees in any dimension. Comput Geom Theory Appl 24(2):75–94
Article MATH Google Scholar
Chaudhuri K, Dasgupta S (2010) Rates of convergence for the cluster tree. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems 23. Curran Associates, Vancouver, pp 343–351
Google Scholar
Cuevas A, Febreiro M, Fraiman R (2000) Estimating the number of clusters. Can J Stat 28:367–382
Article MathSciNet MATH Google Scholar
Cuevas A, Febreiro M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36:441–459
Article MathSciNet MATH Google Scholar
Cuevas A, Febreiro M, Fraiman R (2006) Plug-in estimation of general level sets. Aust N Z J Stat 48:7–19
Article MathSciNet Google Scholar
Cuevas A, Fraiman R (1997) A plug-in approach to support estimation. Ann Stat 25:2300–2312
Article MathSciNet MATH Google Scholar
Devroye L, Wise GL (1980) Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math 38:480–488
Article MathSciNet MATH Google Scholar
Duong T, Cowling A, Koch I, Wand MP (2008) Feature significance for multivariate kernel density estimation. Comput Stat Data Anal 52(9):4225–4242
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Ge Y, Sealfon SC (2012) Flowpeaks: a fast unsupervised clustering for flow cytometry data via k-means and density peak finding. Bioinformatics 28(15):2052–2058
Article Google Scholar
Hartigan JA (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Hartigan JA (1987) Estimation of a convex density cluster in two dimensions. J Am Stat Assoc 82:267–270
Article MATH Google Scholar
Holmström L, Karttunen K, Klemelä J (2014) Estimation of level set trees with adaptive partitions: supplementary material. Technical report, University of Oulu
Indyk P (2004) Nearest neighbors in high-dimensional spaces. In: Goodman JE, O’Rourke J (eds) Handbook of discrete and computational geometry. Chapman & Hall/CRC, Boca Raton, pp 877–892
Google Scholar
Karttunen K, Holmström, L, Klemelä J (2014) Level set trees with enhanced marginal density visualization. In: In proceedings of the 5th international conference on information visualization theory and applications, (IVAPP 2014), Lisbon, Portugal, pp 210–217
Kent BP, Rinaldo A, Verstynen T (2013) DeBaCl: a Python package for interactive DEnsity-BAsed CLustering. J Stat Softw (submitted). arXiv:1307.8136
Klemelä J (2004a) Complexity penalized support estimation. J Multivar Anal 88:274–297
Article MathSciNet MATH Google Scholar
Klemelä J (2004b) Visualization of multivariate density estimates with level set trees. J Comput Graph Stat 13(3):599–620
Article MathSciNet Google Scholar
Klemelä J (2005) Algorithms for the manipulation of level sets of nonparametric density estimates. Comput Stat 20:349–368
Article MathSciNet MATH Google Scholar
Klemelä J (2006) Visualization of multivariate density estimates with shape trees. J Comput Graph Stat 15(2):372–397
Article MathSciNet Google Scholar
Klemelä J (2009) Smoothing of multivariate data: density estimation and visualization. Wiley, New York
Book MATH Google Scholar
Korostelev AP, Tsybakov AB (1993) Minimax theory of image reconstruction (Lecture notes in statistics), vol 82. Springer, Berlin
Kpotufe S, von Luxburg U (2011) Pruning nearest neighbor cluster trees. In: Proceedings of the 28th international conference on machine learning, vol 105, pp 225–232
Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytom Part A J Int Soc Anal Cytol 73:321–332
Article Google Scholar
Maier M, Hein M, von Luxburg U (2009) Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theor Comput Sci 410(19):1749–1764
Article MathSciNet MATH Google Scholar
Mammen E, Tsybakov AB (1995) Asymptotical minimax recovery of sets with smooth boundaries. Ann Stat 23:502–524
Article MathSciNet MATH Google Scholar
McMullen P (1970) The maximum numbers of faces of a convex polytope. Mathematika 17:179–184
Article MathSciNet MATH Google Scholar
Melamed MR, Lindmo T, Mendelsohn ML (1990) Flow cytometry and sorting, 2nd edn. Wiley, New York
Google Scholar
Menardi G, Azzalini A (2014) An advacement in clustering via nonparametric density estimation. Stat Comput 24:753–767
Article MathSciNet MATH Google Scholar
Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58:415–434
Article MATH Google Scholar
Müller DW, Sawitzki G (1991) Excess mass estimates and tests of multimodality. J Am Stat Assoc 86:738–746
MathSciNet MATH Google Scholar
Naumann U, Luta G, Wand MP (2010) The curvHDR method for gating flow cytometry samples. BMC Bioinform 11(44). doi:10.1186/1471-2105-11-44
Nolan D (1991) The excess-mass ellipsoid. J Multivar Anal 39:348–371
Article MathSciNet MATH Google Scholar
O’Neill K, Aghaeepour N, Spidlen J, Brinkman R (2013) Flow cytometry bioinformatics. PLoS Comput Biol 9(12):e1003365. doi:10.1371/journal.pcbi.1003365
Article Google Scholar
Ooi H (2002) Density visualization and mode hunting using trees. J Comput Graph Stat 11:328–347
Article MathSciNet Google Scholar
Polonik W (1995) Measuring mass concentration and estimating density contour clusters—an excess mass approach. Ann Stat 23:855–881
Article MathSciNet MATH Google Scholar
Reeb G (1946) Sur les points singuliers d’une forme de pfaff completement integrable ou d’une fonction numerique. C R Acad Sci Paris 222:847–849
MathSciNet MATH Google Scholar
Rigollet P, Vert R (2009) Optimal rates for plug-in estimators of density level sets. Bernoulli 15:1154–1178
Article MathSciNet MATH Google Scholar
Rinaldo A, Wasserman L (2010) Generalized density clustering. Ann Stat 38:2678–2722
Article MathSciNet MATH Google Scholar
Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York
Book MATH Google Scholar
Shapiro HM (2003) Practical flow cytometry, 4th edn. Wiley, New York
Book Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
Book MATH Google Scholar
Singh A, Scott C, Nowak R (2009) Adaptive Hausdorff estimation of density level sets. Ann Stat 37:2760–2782
Article MathSciNet MATH Google Scholar
Steinwart I (2015) Fully adaptive density-based clustering. Ann Stat 43:2132–2167
Article MathSciNet MATH Google Scholar
Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20(5):25–47
Article MathSciNet MATH Google Scholar
Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19:397–418
Article MathSciNet Google Scholar
Tarjan RE (1976) Efficiency of a good but not linear set union algorithm. J ACM 22:215–225
Article MathSciNet MATH Google Scholar
Tsybakov AB (1997) On nonparametric estimation of density level sets. Ann Stat 25:948–969
Article MathSciNet MATH Google Scholar
Walther G (1997) Granulometric smoothing. Ann Stat 25:2273–2299
Article MathSciNet MATH Google Scholar
Walther G, Zimmerman N, Moore W, Parks D, Meehan S, Belitskaya I, Pan J, Herzenberg L (2009) Automatic clustering of flow cytometry data with density-based merging. Adv Bioinform 2009:686–759
Google Scholar
Zomorodian A (2012) Topological data analysis. In: Zomorodian A (ed) Advances in applied and computational topology, vol 70. American Mathematical Society, Providence, pp 1–40
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Oulu, Oulu, Finland
Lasse Holmström
CEMIS Oulu, University of Oulu, Oulu, Finland
Kyösti Karttunen
Helsinki, Finland
Jussi Klemelä

Authors

Lasse Holmström
View author publications
You can also search for this author in PubMed Google Scholar
Kyösti Karttunen
View author publications
You can also search for this author in PubMed Google Scholar
Jussi Klemelä
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jussi Klemelä.

Additional information

The authors gratefully acknowledge the TEKES funding under project 24301335.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 20040 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Holmström, L., Karttunen, K. & Klemelä, J. Estimation of level set trees using adaptive partitions. Comput Stat 32, 1139–1163 (2017). https://doi.org/10.1007/s00180-016-0702-2

Download citation

Received: 11 June 2015
Accepted: 18 November 2016
Published: 02 December 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s00180-016-0702-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of level set trees using adaptive partitions

Abstract

Access this article

Similar content being viewed by others

Application of Mixture Models to Large Datasets

Analysis of Mass Cytometry Data

diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 20040 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimation of level set trees using adaptive partitions

Abstract

Access this article

Similar content being viewed by others

Application of Mixture Models to Large Datasets

Analysis of Mass Cytometry Data

diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 20040 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation