Skip to main content
Log in

Estimation of level set trees using adaptive partitions

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We present methods for the estimation of level sets, a level set tree, and a volume function of a multivariate density function. The methods are such that the computation is feasible and estimation is statistically efficient in moderate dimensional cases (\(d\approx 8\)) and for moderate sample sizes (\(n\approx \) 50,000). We apply kernel estimation together with an adaptive partition of the sample space. We illustrate how level set trees can be applied in cluster analysis and in flow cytometry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The Voronoi cell of a point \(p \in S =\{ X_1,\ldots ,X_n\}\) is \( V_p = \{ x \in \mathbf{R}^d : D(x,p) \le D(x,q) \text{ for } \text{ all } q \in S \} , \) where \(D(x,y) = \Vert x-y\Vert \) is the Euclidean distance in \(\mathbf{R}^d\). A Voronoi cell is a convex polyhedron: it is the intersection of the half-spaces of points at least as close to p as to q, taken over all \(q\in S\).

  2. The Delaunay triangulation of \(S =\{ X_1,\ldots ,X_n\}\) is \( \text{ Delaunay }(S) = \{ \sigma \subset S : \bigcap _{p\in \sigma } V_p \ne \emptyset \} ; \) this is the collection of those tuples \(\sigma \) of \(d+1\) points whose Voronoi cells touch each other. Sets \(\sigma \in \text{ Delaunay }(S)\) are considered as the vertices of a simplex. Thus, a Delaunay triangulation is a collection of d-simplices. A d-simplex is the convex hull of its \(d+1\) vertices. In the two dimensional case (\(d=2\)) the simplices are triangles and in the three dimensional case (\(d=3\)) the simplices are tetrahedrons.

  3. When the kernel function is the standard normal density, then the support of the kernel estimator is the whole space \(\mathbf{R}^d\). When the kernel function is the Bartlett density \(C(1-\Vert x\Vert ^2)_+\), then the support is a union of balls, and when the kernel function is the Epanechnikov product density \(C\prod _{i=1}^d (1-x_i^2)_+\), then the support is a union of rectangles.

  4. A Reeb graph of a function (or a contour tree of a function) is graph whose leaf vertices represent the local minima or maxima and each interior vertex represent the joining or splitting of the contours of the function. Carr et al. (2003) give the following procedure for the computation of a Reeb graph. First a level set tree and a lower level set tree are calculated. A lower level set tree is analogous to a level set tree, but its nodes correspond to the lower level sets \( \{ x \in \mathbf{R}^d : f(x) \le \lambda \} . \) Second, the level set tree and the lower level set tree are pruned so that only the leaf nodes and the nodes with more than one child are left. Third, the pruned level set tree (split tree) and the pruned lower level set tree (join tree) are combined to obtain the Reeb graph.

  5. Given a finite set \(\mathcal{{X}}\) of points in \(\mathbf{R}^d\) and a query point x, the k-d-algorithm finds the point in \(\mathcal{{X}}\) closest to x. The k-d-algorithm uses a k-d-tree, which is a similar binary tree as we use to represent an adaptive partition.

  6. Note that when observations are in \(\mathbf{R}^d\) it is possible that only d of the observations are on the boundary, which happens when d observations are on the corners of the smallest rectangle containing the observations.

  7. The unit simplex is defined as the convex hull of the origin and the vertices \(e_1,\ldots ,e_d\), where \(e_i\in \mathbf{R}^d\) has 1 in the ith position and 0 in the other positions. Klemelä (2004b) used a simulation example where the simplex was such that all edges have the same length. The current simulation example was used in Stuetzle and Nugent (2010) and Menardi and Azzalini (2014) with the sample size about 500.

  8. The modes are located at (1 / 2, 0, 0, 0), \((-1/2,0,0,0)\), \((0,\sqrt{3}/2,0,0)\), \((0,1/(2\sqrt{3}),\sqrt{2/3},0)\), and \((0,1/(2\sqrt{3}),1/(2\sqrt{6}),\sqrt{15/24})\).

References

  • Aaron C (2013) Estimation of the support of the density and its boundary using random polyhedrons. Technical report, Université Blaise Pascal

  • Aghaeepour N (2010) FlowMeans: non-parametric flow cytometry data gating. R package version 1(16):

  • Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH (2013) Critical assessment of automated flow cytometry data analysis techniques. Nat Methods 10(3):228–238

    Article  Google Scholar 

  • Aghaeepour N, Nikolic R, Hoos HH, Brinkman RR (2011) Rapid cell population identification in flow cytometry data. Cytom Part A J Int Soc Anal Cytol 79(1):6–13

    Article  Google Scholar 

  • Azzallini A, Torelli N (2007) Clustering via nonparametric density estimation. Stat Comput 17:71–80

    Article  MathSciNet  Google Scholar 

  • Baíllo A, Cuesta-Albertos JA, Cuevas A (2001) Convergence rates in nonparametric estimation of level sets. Stat Probab Lett 53:27–35

    Article  MathSciNet  MATH  Google Scholar 

  • Baíllo A, Cuevas A, Justel A (2000) Set estimation and nonparametric detection. Can J Stat 28:765–782

    Article  MathSciNet  MATH  Google Scholar 

  • Bashashati A, Brinkman RR (2009) A survey of flow cytometry data analysis methods. Adv Bioinform 2009:584–603

    Google Scholar 

  • Biau G, Cadre B, Pelletier B (2007) A graph-based estimator of the number of clusters. ESAIM Probab Stat 11:272–280

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman L, Friedman J, Olshen R, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New York

    MATH  Google Scholar 

  • Burman P, Polonik W (2009) Multivariate mode hunting: data analytic tools with measures of significance. J Multivar Anal 100:1198–1218

    Article  MathSciNet  MATH  Google Scholar 

  • Cadre B (2006) Kernel estimation of density level sets. J Multivar Anal 97(4):999–1023

    Article  MathSciNet  MATH  Google Scholar 

  • Carr H, Snoeyink J, Axen U (2003) Computing contour trees in any dimension. Comput Geom Theory Appl 24(2):75–94

    Article  MATH  Google Scholar 

  • Chaudhuri K, Dasgupta S (2010) Rates of convergence for the cluster tree. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems 23. Curran Associates, Vancouver, pp 343–351

    Google Scholar 

  • Cuevas A, Febreiro M, Fraiman R (2000) Estimating the number of clusters. Can J Stat 28:367–382

    Article  MathSciNet  MATH  Google Scholar 

  • Cuevas A, Febreiro M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36:441–459

    Article  MathSciNet  MATH  Google Scholar 

  • Cuevas A, Febreiro M, Fraiman R (2006) Plug-in estimation of general level sets. Aust N Z J Stat 48:7–19

    Article  MathSciNet  Google Scholar 

  • Cuevas A, Fraiman R (1997) A plug-in approach to support estimation. Ann Stat 25:2300–2312

    Article  MathSciNet  MATH  Google Scholar 

  • Devroye L, Wise GL (1980) Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math 38:480–488

    Article  MathSciNet  MATH  Google Scholar 

  • Duong T, Cowling A, Koch I, Wand MP (2008) Feature significance for multivariate kernel density estimation. Comput Stat Data Anal 52(9):4225–4242

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Ge Y, Sealfon SC (2012) Flowpeaks: a fast unsupervised clustering for flow cytometry data via k-means and density peak finding. Bioinformatics 28(15):2052–2058

    Article  Google Scholar 

  • Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  • Hartigan JA (1987) Estimation of a convex density cluster in two dimensions. J Am Stat Assoc 82:267–270

    Article  MATH  Google Scholar 

  • Holmström L, Karttunen K, Klemelä J (2014) Estimation of level set trees with adaptive partitions: supplementary material. Technical report, University of Oulu

  • Indyk P (2004) Nearest neighbors in high-dimensional spaces. In: Goodman JE, O’Rourke J (eds) Handbook of discrete and computational geometry. Chapman & Hall/CRC, Boca Raton, pp 877–892

    Google Scholar 

  • Karttunen K, Holmström, L, Klemelä J (2014) Level set trees with enhanced marginal density visualization. In: In proceedings of the 5th international conference on information visualization theory and applications, (IVAPP 2014), Lisbon, Portugal, pp 210–217

  • Kent BP, Rinaldo A, Verstynen T (2013) DeBaCl: a Python package for interactive DEnsity-BAsed CLustering. J Stat Softw (submitted). arXiv:1307.8136

  • Klemelä J (2004a) Complexity penalized support estimation. J Multivar Anal 88:274–297

    Article  MathSciNet  MATH  Google Scholar 

  • Klemelä J (2004b) Visualization of multivariate density estimates with level set trees. J Comput Graph Stat 13(3):599–620

    Article  MathSciNet  Google Scholar 

  • Klemelä J (2005) Algorithms for the manipulation of level sets of nonparametric density estimates. Comput Stat 20:349–368

    Article  MathSciNet  MATH  Google Scholar 

  • Klemelä J (2006) Visualization of multivariate density estimates with shape trees. J Comput Graph Stat 15(2):372–397

    Article  MathSciNet  Google Scholar 

  • Klemelä J (2009) Smoothing of multivariate data: density estimation and visualization. Wiley, New York

    Book  MATH  Google Scholar 

  • Korostelev AP, Tsybakov AB (1993) Minimax theory of image reconstruction (Lecture notes in statistics), vol 82. Springer, Berlin

  • Kpotufe S, von Luxburg U (2011) Pruning nearest neighbor cluster trees. In: Proceedings of the 28th international conference on machine learning, vol 105, pp 225–232

  • Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytom Part A J Int Soc Anal Cytol 73:321–332

    Article  Google Scholar 

  • Maier M, Hein M, von Luxburg U (2009) Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theor Comput Sci 410(19):1749–1764

    Article  MathSciNet  MATH  Google Scholar 

  • Mammen E, Tsybakov AB (1995) Asymptotical minimax recovery of sets with smooth boundaries. Ann Stat 23:502–524

    Article  MathSciNet  MATH  Google Scholar 

  • McMullen P (1970) The maximum numbers of faces of a convex polytope. Mathematika 17:179–184

    Article  MathSciNet  MATH  Google Scholar 

  • Melamed MR, Lindmo T, Mendelsohn ML (1990) Flow cytometry and sorting, 2nd edn. Wiley, New York

    Google Scholar 

  • Menardi G, Azzalini A (2014) An advacement in clustering via nonparametric density estimation. Stat Comput 24:753–767

    Article  MathSciNet  MATH  Google Scholar 

  • Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58:415–434

    Article  MATH  Google Scholar 

  • Müller DW, Sawitzki G (1991) Excess mass estimates and tests of multimodality. J Am Stat Assoc 86:738–746

    MathSciNet  MATH  Google Scholar 

  • Naumann U, Luta G, Wand MP (2010) The curvHDR method for gating flow cytometry samples. BMC Bioinform 11(44). doi:10.1186/1471-2105-11-44

  • Nolan D (1991) The excess-mass ellipsoid. J Multivar Anal 39:348–371

    Article  MathSciNet  MATH  Google Scholar 

  • O’Neill K, Aghaeepour N, Spidlen J, Brinkman R (2013) Flow cytometry bioinformatics. PLoS Comput Biol 9(12):e1003365. doi:10.1371/journal.pcbi.1003365

    Article  Google Scholar 

  • Ooi H (2002) Density visualization and mode hunting using trees. J Comput Graph Stat 11:328–347

    Article  MathSciNet  Google Scholar 

  • Polonik W (1995) Measuring mass concentration and estimating density contour clusters—an excess mass approach. Ann Stat 23:855–881

    Article  MathSciNet  MATH  Google Scholar 

  • Reeb G (1946) Sur les points singuliers d’une forme de pfaff completement integrable ou d’une fonction numerique. C R Acad Sci Paris 222:847–849

    MathSciNet  MATH  Google Scholar 

  • Rigollet P, Vert R (2009) Optimal rates for plug-in estimators of density level sets. Bernoulli 15:1154–1178

    Article  MathSciNet  MATH  Google Scholar 

  • Rinaldo A, Wasserman L (2010) Generalized density clustering. Ann Stat 38:2678–2722

    Article  MathSciNet  MATH  Google Scholar 

  • Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York

    Book  MATH  Google Scholar 

  • Shapiro HM (2003) Practical flow cytometry, 4th edn. Wiley, New York

    Book  Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London

    Book  MATH  Google Scholar 

  • Singh A, Scott C, Nowak R (2009) Adaptive Hausdorff estimation of density level sets. Ann Stat 37:2760–2782

    Article  MathSciNet  MATH  Google Scholar 

  • Steinwart I (2015) Fully adaptive density-based clustering. Ann Stat 43:2132–2167

    Article  MathSciNet  MATH  Google Scholar 

  • Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20(5):25–47

    Article  MathSciNet  MATH  Google Scholar 

  • Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19:397–418

    Article  MathSciNet  Google Scholar 

  • Tarjan RE (1976) Efficiency of a good but not linear set union algorithm. J ACM 22:215–225

    Article  MathSciNet  MATH  Google Scholar 

  • Tsybakov AB (1997) On nonparametric estimation of density level sets. Ann Stat 25:948–969

    Article  MathSciNet  MATH  Google Scholar 

  • Walther G (1997) Granulometric smoothing. Ann Stat 25:2273–2299

    Article  MathSciNet  MATH  Google Scholar 

  • Walther G, Zimmerman N, Moore W, Parks D, Meehan S, Belitskaya I, Pan J, Herzenberg L (2009) Automatic clustering of flow cytometry data with density-based merging. Adv Bioinform 2009:686–759

    Google Scholar 

  • Zomorodian A (2012) Topological data analysis. In: Zomorodian A (ed) Advances in applied and computational topology, vol 70. American Mathematical Society, Providence, pp 1–40

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jussi Klemelä.

Additional information

The authors gratefully acknowledge the TEKES funding under project 24301335.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 20040 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Holmström, L., Karttunen, K. & Klemelä, J. Estimation of level set trees using adaptive partitions. Comput Stat 32, 1139–1163 (2017). https://doi.org/10.1007/s00180-016-0702-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-016-0702-2

Keywords

Navigation