Statistics and Computing

, Volume 24, Issue 5, pp 753–767 | Cite as

An advancement in clustering via nonparametric density estimation

  • Giovanna Menardi
  • Adelchi Azzalini


Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.


Cluster analysis Connected sets Nonparametric density estimation Kernel method 



We wish to thank the two anonymous referees and an Associate editor for their valuable comments which have stimulated a clearer exposition of the original formulation and helped to improve the overall outcome.


  1. Abramson, I.S.: On bandwidth variation in kernel estimates—a square root law. Ann. Stat. 10, 1217–1223 (1982) CrossRefzbMATHMathSciNetGoogle Scholar
  2. Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17, 71–80 (2007) CrossRefMathSciNetGoogle Scholar
  3. Azzalini, A., Menardi, G., Rosolin, T.: R package pdfCluster: cluster analysis via nonparametric density estimation (version 1.0-0) (2012).
  4. Burman, P., Polonik, W.: Multivariate mode hunting: data analytic tools with measures of significance. J. Multivar. Anal. 100, 1198–1218 (2009) CrossRefzbMATHMathSciNetGoogle Scholar
  5. Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367–382 (2000) CrossRefzbMATHMathSciNetGoogle Scholar
  6. Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441–459 (2001) CrossRefzbMATHMathSciNetGoogle Scholar
  7. Dazard, J.E., Rao, J.S.: Local sparse bump hunting. J. Comput. Graph. Stat. 19, 900–929 (2010) CrossRefGoogle Scholar
  8. De la Cruz, R.: Bayesian non-linear regression models with skew-elliptical errors: applications to the classification of longitudinal profiles. Comput. Stat. Data Anal. 53, 436–449 (2008) CrossRefzbMATHGoogle Scholar
  9. Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev. 41, 637–676 (1999) CrossRefzbMATHMathSciNetGoogle Scholar
  10. Even, S.: Graph Algorithms, 2nd edn. Cambridge University Press, New York (2011) CrossRefGoogle Scholar
  11. Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classication of olive oils from their fatty acid composition. In: Food Research and Data Analysis, pp. 189–214. Applied Sc. Publishers, London (1983) Google Scholar
  12. Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986) Google Scholar
  13. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) CrossRefzbMATHMathSciNetGoogle Scholar
  14. Fraley, C., Raftery, A.E.: MCLUST vers. 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504, Univ. Washington, Dep. Stat. (2006), rev. 2009 Google Scholar
  15. Friedman, J.H.: On bias, variance, 0–1 loss, and the curse of dimensionality. Data Min. Knowl. Discov. 1, 55–77 (1997) CrossRefGoogle Scholar
  16. Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc., Ser. C, Appl. Stat. 18, 54–64 (1969) MathSciNetGoogle Scholar
  17. Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975) zbMATHGoogle Scholar
  18. Kriegel, H., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Data Min. Knowl. Discov. 1, 231–240 (2011) CrossRefGoogle Scholar
  19. Krznaric, D., Levcopoulos, C.: Fast algorithms for complete linkage clustering. Discrete Comput. Geom. 19, 131–145 (1998) CrossRefzbMATHMathSciNetGoogle Scholar
  20. Lubischew, A.A.: On the use of discriminant analysis in taxonomy. Biometrics 18, 455–477 (1962) CrossRefzbMATHGoogle Scholar
  21. Menardi, G., Torelli, N.: Reducing data dimension for cluster detection. J. Stat. Comput. Simul. (2012). doi: 10.1080/00949655.2012.679032 Google Scholar
  22. Minnotte, M.C.: Nonparametric testing for the existence of modes. Ann. Stat. 25, 1646–1660 (1997) CrossRefzbMATHMathSciNetGoogle Scholar
  23. Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86, 738–746 (1991) zbMATHGoogle Scholar
  24. Prates, M., Lachos, V., Cabral, C.: R package mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions (version 1.0-3) (2012).
  25. Ooi, H.: Density visualization and mode hunting using trees. J. Comput. Graph. Stat. 11, 328–347 (2002) CrossRefMathSciNetGoogle Scholar
  26. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2013). ISBN 3-900051-07-3-900051-0
  27. Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38, 2678–2722 (2010) CrossRefzbMATHMathSciNetGoogle Scholar
  28. Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression models. Can. J. Stat. 31, 129–150 (2003) CrossRefzbMATHMathSciNetGoogle Scholar
  29. Scott, D.W., Sain, S.: Multidimensional Density Estimation. Handbook of Statistics, vol. 24, pp. 229–261 (2005) Google Scholar
  30. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986) CrossRefzbMATHGoogle Scholar
  31. Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003) CrossRefzbMATHMathSciNetGoogle Scholar
  32. Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat. 19, 397–418 (2010) CrossRefMathSciNetGoogle Scholar
  33. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the GAP statistic. J. R. Stat. Soc., Ser. B, Stat. Methodol. 63, 411–423 (2000) CrossRefMathSciNetGoogle Scholar
  34. Wishart, D.: Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In: Cole, A.J. (ed.) Numerical Taxonomy, pp. 282–308. Academic Press, London (1969) Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Statistical SciencesUniversity of PaduaPadovaItaly

Personalised recommendations