Skip to main content
Log in

Density estimation with distribution element trees

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

The estimation of probability densities based on available data is a central task in many statistical applications. Especially in the case of large ensembles with many samples or high-dimensional sample spaces, computationally efficient methods are needed. We propose a new method that is based on a decomposition of the unknown distribution in terms of so-called distribution elements (DEs). These elements enable an adaptive and hierarchical discretization of the sample space with small or large elements in regions with smoothly or highly variable densities, respectively. The novel refinement strategy that we propose is based on statistical goodness-of-fit and pairwise (as an approximation to mutual) independence tests that evaluate the local approximation of the distribution in terms of DEs. The capabilities of our new method are inspected based on several examples of different dimensionality and successfully compared with other state-of-the-art density estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. Fixing the domain bounds based on the data range leads to bounds that are almost certainly too narrow. Accordingly, the resulting density estimates will display a bias toward too high values.

  2. In the singular case of linearly dependent components \(x_1\) and \(x_2\), many small DEs, resolving the probability peak along the diagonal of the \(x_1\)-\(x_2\)-space, result from a DET estimator.

References

  • Achilleos, A., Delaigle, A.: Local bandwidth selectors for deconvolution kernel density estimation. Stat. Comput. 22(2), 563–577 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Bagnato, L., Punzo, A., Nicolis, O.: The autodependogram: a graphical device to investigate serial dependences. J. Time Ser. Anal. 33(2), 233–254 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Botev, Z.: Spectral implementation of adaptive kernel density estimator via diffusion. https://ch.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator (2007). Accessed 01 Sep 2016

  • Botev, Z.: Implementation of adaptive kernel density estimator for high dimensions via diffusion. https://ch.mathworks.com/matlabcentral/fileexchange/58312-kernel-density-estimator-for-high-dimensions (2016). Accessed 10 Jan 2017

  • Botev, Z.I., Grotowski, J.F., Kroese, D.P.: Kernel density estimation via diffusion. Ann. Stat. 38(5), 2916–2957 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman, L., Friedman, J., Stone, C.J., Olshen, R.: Classification and Regression Trees. Wadsworth Statistics/Probability. Chapman and Hall/CRC, Boca Raton (1984)

    MATH  Google Scholar 

  • Cao, R., Cuevas, A., Gonzalez Manteiga, W.: A comparative study of several smoothing methods in density estimation. Comput. Stat. Data Anal. 17(2), 153–176 (1994)

    Article  MATH  Google Scholar 

  • Cochran, W.G.: The chi square test of goodness of fit. Ann. Math. Stat. 23(3), 315–345 (1952)

    Article  MATH  Google Scholar 

  • Curtin, R.R., Cline, J.R., Slagle, N.P., March, W.B., Ram, P., Mehta, N.A., Gray, A.G.: Mlpack: a scalable C++ machine learning library. J. Mach. Learn. Res. 14, 801–805 (2013)

    MathSciNet  MATH  Google Scholar 

  • Ferguson, T.S.: A bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  • Fix, E., Hodges, J.: Discriminatory analysis, nonparametric estimation: consistency properties. Report 4, Project No. 21-49-004, USAF School of Aviation Medicine (1951)

  • Härdle, W., Werwatz, A., Müller, M., Sperlich, S.: Nonparametric and Semiparametric Models. Springer Series in Statistics, 1st edn. Springer, Berlin (2004)

  • Jiang, H., Mu, J.C., Yang, K., Du, C., Lu, L., Wong, W.H.: Computational aspects of optional Pólya tree. J. Comput. Graph. Stat. 25(1), 301–320 (2016)

    Article  Google Scholar 

  • Jing, J., Koch, I., Naito, K.: Polynomial histograms for multivariate density and mode estimation. Scand. J. Stat. 39(1), 75–96 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Jones, M.C., Marron, J.S., Sheather, S.J.: A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 91(433), 401–407 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  • Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1–2), 81–93 (1938)

    Article  MATH  Google Scholar 

  • Kogure, A.: Asymptotically optimal cells for a historgram. Ann. Stat. 15(3), 1023–1030 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  • Kooperberg, C., Stone, C.J.: A study of logspline density estimation. Comput. Stat. Data Anal. 12(3), 327–347 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  • Loftsgaarden, D.O., Quesenberry, C.P.: A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 36(3), 1049–1051 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  • Ma, L., Wong, W.H.: Coupling optional Pólya trees and the two sample problem. J. Am. Stat. Assoc. 106(496), 1553–1565 (2011)

    Article  MATH  Google Scholar 

  • Mann, H.B., Wald, A.: On the choice of the number of class intervals in the application of the chi square test. Ann. Math. Stat. 13(3), 306–317 (1942)

    Article  MathSciNet  MATH  Google Scholar 

  • Marron, J.S., Wand, M.P.: Exact mean integrated squared error. Ann. Stat. 20(2), 712–736 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  • Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)

    MathSciNet  Google Scholar 

  • Nelsen, R.B., Ubeda-Flores, M.: How close are pairwise and mutual independence? Stat. Probab. Lett. 82(10), 1823–1828 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • O’Brien, T.A., Kashinath, K., Cavanaugh, N.R., Collins, W.D., O’Brien, J.P.: A fast and objective multidimensional kernel density estimation method: fastKDE. Comput. Stat. Data Anal. 101, 148–160 (2016)

    Article  MathSciNet  Google Scholar 

  • Papoulis, A.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill Series in Electrical Engineering, 3rd edn. McGraw-Hill Inc., New York (1991)

    Google Scholar 

  • Park, B.U., Marron, J.S.: Comparison of data-driven bandwidth selectors. J. Am. Stat. Assoc. 85(409), 66–72 (1990)

    Article  Google Scholar 

  • Park, B., Turlach, B.: Practical performance of several data driven bandwidth selectors. Report, Université Catholique de Louvain, Center for Operations Research and Econometrics (CORE) (1992)

  • Pearson, K.: On the criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50, 157–175 (1900)

    Article  MATH  Google Scholar 

  • Petersen, A., Müller, H.G.: Functional data analysis for density functions by transformation to a Hilbert space. Ann. Stat. 44(1), 183–218 (2016)

  • Ram, P., Gray, A.G.: Density estimation trees. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, pp. 627–635 (2011)

  • Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 27(3), 832–837 (1956)

    Article  MathSciNet  MATH  Google Scholar 

  • Scott, D.W.: Multivariate Density Estimation Theory, Practice, and Visualization. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Hoboken (2015)

    Google Scholar 

  • Scott, D.W., Sagae, M.: Adaptive density estimation with massive data sets. In: Proceedings of the Statistical Computing Section, pp. 104–108. ASA, American Statistical Association (1997)

  • Shampine, L.F.: Matlab program for quadrature in 2d. Appl. Math. Comput. 202(1), 266–274 (2008a)

    MathSciNet  MATH  Google Scholar 

  • Shampine, L.F.: Vectorized adaptive quadrature in matlab. J. Comput. Appl. Math. 211(2), 131–140 (2008b)

    Article  MathSciNet  MATH  Google Scholar 

  • Sheather, S.J.: Density estimation. Stat. Sci. 19(4), 588–597 (2004)

    Article  MATH  Google Scholar 

  • Shorack, G.R.: Probability for Statisticians. Springer Texts in Statistics. Springer, Berlin (2000)

    Google Scholar 

  • Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. Chapman and Hall, CRC, Boca Raton (1998)

    Google Scholar 

  • Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat. 19(2), 279–281 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  • Sriperumbudur, B., Fukumizu, K., Kumar, R., Gretton, A., Hyvaerinen, A.: Density estimation in infinite dimensional exponential families. arXiv:1509.04348v2 p. 42 (2013)

  • Steele, M., Chaseling, J.: Powers of discrete goodness-of-fit test statistics for a uniform null against a selection of alternative distributions. Commun. Stat. Simul. Comput. 35(4), 1067–1075 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, X., Wang, Y.: Nonparametric multivariate density estimation using mixtures. Stat. Comput. 25(2), 349–364 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Wong, W.H., Ma, L.: Optional Pólya tree and Bayesian inference. Ann. Stat. 38(3), 1433–1459 (2010)

    Article  MATH  Google Scholar 

  • Yang, Y.: Penalized semiparametric density estimation. Stat. Comput. 19(4), 355 (2008)

    Article  MathSciNet  Google Scholar 

  • Zaunders, J., Jing, J., Leipold, M., Maecker, H., Kelleher, A.D., Koch, I.: Computationally efficient multidimensional analysis of complex flow cytometry data using second order polynomial histograms. Cytom. Part A 89(1), 44–58 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel W. Meyer.

Additional information

The author is grateful to Marco Weibel for his help during the preparation of this manuscript. Very valuable feedback from an associate editor and two reviewers and helpful input from Oliver Brenner and Florian Müller are gratefully acknowledged. Moreover, the author acknowledges helpful comments from Nina Roth and feedback on the initial version of this manuscript from Patrick Jenny, both from ETH Zürich. The author has been financially supported by ETH Zürich.

Appendix: Derivation of MMSE slope estimator

Appendix: Derivation of MMSE slope estimator

Writing without loss of generality the linear marginal PDF (4) in a simpler form with \(x_i\in [0,1]\) and the subscripts skipped, we obtain

$$\begin{aligned} p(x|\theta ) = \left( x - {\textstyle \frac{1}{2}}\right) \theta + 1. \end{aligned}$$

By calculating the mean of random variable X based on this PDF, we obtain \(\langle X\rangle = \frac{1}{12}(6 + \theta )\), and therefore, can express the slope parameter in terms of this mean as

$$\begin{aligned} \theta = 6(2\langle X\rangle - 1). \end{aligned}$$
(20)

In the case of a finite ensemble, we estimate the mean with \(\langle X\rangle _n = \frac{1}{n}\sum _{j = 1}^n x_j\) and the slope by

$$\begin{aligned} \hat{\theta } = 6c(2\langle X\rangle _n - 1). \end{aligned}$$

Here, c is a correction factor that is determined by minimizing the mean square error (MSE) expressed as

$$\begin{aligned} \langle (\hat{\theta } - \theta )^2\rangle= & {} \left\langle \left[ 6c\left( \frac{2}{n}\sum _{j = 1}^n x_j - 1\right) - \theta \right] ^2\right\rangle \\= & {} \left\langle 36c^2\left( \frac{2}{n}\sum _{j = 1}^n x_j - 1\right) ^2 \right. \nonumber \\&\left. - 12c\left( \frac{2}{n}\sum _{j = 1}^n x_j - 1\right) \theta + \theta ^2\right\rangle \\= & {} 36c^2\left\langle \frac{4}{n^2}\sum _{j = 1}^n\sum _{k = 1}^n x_j x_k - \frac{4}{n}\sum _{j = 1}^n x_j + 1\right\rangle \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2 \\= & {} 36c^2\left( \frac{4}{n^2}\left\langle \sum _{j = 1}^n\sum _{k = 1}^n x_j x_k\right\rangle - 4\langle X\rangle + 1\right) \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2 \\= & {} 36c^2\left( \frac{4}{n^2}\left\langle \sum _{j = 1}^n\sum _{\begin{array}{c} k = 1\\ k \ne j \end{array}}^n x_j x_k + \sum _{j = 1}^n x_j^2\right\rangle - 4\langle X\rangle + 1\right) \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2 \\= & {} 36c^2\left[ \frac{4(n-1)}{n}\langle X\rangle ^2 + \frac{4}{n}\langle X^2\rangle - 4\langle X\rangle + 1\right] \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2. \end{aligned}$$

To determine the minimum MSE, we set

$$\begin{aligned} \frac{\mathrm {d} }{\mathrm {d} c}\langle (\hat{\theta } - \theta )^2\rangle= & {} 72c\left( \frac{4(n-1)}{n}\langle X\rangle ^2 + \frac{4}{n}\langle X^2\rangle \right. \\&\left. - 4\langle X\rangle + 1\right) \\&- 12(2\langle X\rangle - 1)\theta = 0, \end{aligned}$$

which leads for the correction factor to

$$\begin{aligned} c= & {} \frac{(2\langle X\rangle - 1)n\theta }{6[4(n-1)\langle X\rangle ^2 + 4\langle X^2\rangle - 4n\langle X\rangle + n]} \nonumber \\{}= & {} \frac{(2\langle X\rangle - 1)n\theta }{6(4n\langle X\rangle ^2 - 4\langle X\rangle ^2 + 4\langle X^2\rangle - 4n\langle X\rangle + n])} \nonumber \\{}= & {} \frac{(2\langle X\rangle - 1)n\theta }{6[n(2\langle X\rangle - 1)^2 + 4\langle X^{\prime 2}\rangle ]} \nonumber \\{}= & {} \frac{6n(2\langle X\rangle - 1)^2}{6[n(2\langle X\rangle - 1)^2 + 4\langle X^{\prime 2}\rangle ]}. \end{aligned}$$
(21)

For \(n\rightarrow \infty \) the correction factor c goes to one.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meyer, D.W. Density estimation with distribution element trees. Stat Comput 28, 609–632 (2018). https://doi.org/10.1007/s11222-017-9751-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-017-9751-9

Keywords

Mathematics Subject Classification

Navigation