Skip to main content
Log in

An advancement in clustering via nonparametric density estimation

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.ncdc.noaa.gov/oa/climate/climateproducts.html#PUBS.

References

  • Abramson, I.S.: On bandwidth variation in kernel estimates—a square root law. Ann. Stat. 10, 1217–1223 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  • Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17, 71–80 (2007)

    Article  MathSciNet  Google Scholar 

  • Azzalini, A., Menardi, G., Rosolin, T.: R package pdfCluster: cluster analysis via nonparametric density estimation (version 1.0-0) (2012). http://cran.r-project.org/package=pdfCluster

  • Burman, P., Polonik, W.: Multivariate mode hunting: data analytic tools with measures of significance. J. Multivar. Anal. 100, 1198–1218 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  • Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367–382 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  • Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441–459 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  • Dazard, J.E., Rao, J.S.: Local sparse bump hunting. J. Comput. Graph. Stat. 19, 900–929 (2010)

    Article  Google Scholar 

  • De la Cruz, R.: Bayesian non-linear regression models with skew-elliptical errors: applications to the classification of longitudinal profiles. Comput. Stat. Data Anal. 53, 436–449 (2008)

    Article  MATH  Google Scholar 

  • Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev. 41, 637–676 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  • Even, S.: Graph Algorithms, 2nd edn. Cambridge University Press, New York (2011)

    Book  Google Scholar 

  • Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classication of olive oils from their fatty acid composition. In: Food Research and Data Analysis, pp. 189–214. Applied Sc. Publishers, London (1983)

    Google Scholar 

  • Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986)

    Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  • Fraley, C., Raftery, A.E.: MCLUST vers. 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504, Univ. Washington, Dep. Stat. (2006), rev. 2009

  • Friedman, J.H.: On bias, variance, 0–1 loss, and the curse of dimensionality. Data Min. Knowl. Discov. 1, 55–77 (1997)

    Article  Google Scholar 

  • Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc., Ser. C, Appl. Stat. 18, 54–64 (1969)

    MathSciNet  Google Scholar 

  • Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)

    MATH  Google Scholar 

  • Kriegel, H., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Data Min. Knowl. Discov. 1, 231–240 (2011)

    Article  Google Scholar 

  • Krznaric, D., Levcopoulos, C.: Fast algorithms for complete linkage clustering. Discrete Comput. Geom. 19, 131–145 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  • Lubischew, A.A.: On the use of discriminant analysis in taxonomy. Biometrics 18, 455–477 (1962)

    Article  MATH  Google Scholar 

  • Menardi, G., Torelli, N.: Reducing data dimension for cluster detection. J. Stat. Comput. Simul. (2012). doi:10.1080/00949655.2012.679032

    Google Scholar 

  • Minnotte, M.C.: Nonparametric testing for the existence of modes. Ann. Stat. 25, 1646–1660 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  • Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86, 738–746 (1991)

    MATH  Google Scholar 

  • Prates, M., Lachos, V., Cabral, C.: R package mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions (version 1.0-3) (2012). http://cran.r-project.org/web/packages/mixsmsn/index.html

  • Ooi, H.: Density visualization and mode hunting using trees. J. Comput. Graph. Stat. 11, 328–347 (2002)

    Article  MathSciNet  Google Scholar 

  • R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2013). http://www.R-project.org/. ISBN 3-900051-07-3-900051-0

  • Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38, 2678–2722 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  • Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression models. Can. J. Stat. 31, 129–150 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  • Scott, D.W., Sain, S.: Multidimensional Density Estimation. Handbook of Statistics, vol. 24, pp. 229–261 (2005)

    Google Scholar 

  • Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986)

    Book  MATH  Google Scholar 

  • Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  • Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat. 19, 397–418 (2010)

    Article  MathSciNet  Google Scholar 

  • Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the GAP statistic. J. R. Stat. Soc., Ser. B, Stat. Methodol. 63, 411–423 (2000)

    Article  MathSciNet  Google Scholar 

  • Wishart, D.: Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In: Cole, A.J. (ed.) Numerical Taxonomy, pp. 282–308. Academic Press, London (1969)

    Google Scholar 

Download references

Acknowledgements

We wish to thank the two anonymous referees and an Associate editor for their valuable comments which have stimulated a clearer exposition of the original formulation and helped to improve the overall outcome.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanna Menardi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menardi, G., Azzalini, A. An advancement in clustering via nonparametric density estimation. Stat Comput 24, 753–767 (2014). https://doi.org/10.1007/s11222-013-9400-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-013-9400-x

Keywords

Navigation