Statistics and Computing

, Volume 24, Issue 5, pp 753–767 | Cite as

An advancement in clustering via nonparametric density estimation

Article

Abstract

Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.

Keywords

Cluster analysis Connected sets Nonparametric density estimation Kernel method 

References

  1. Abramson, I.S.: On bandwidth variation in kernel estimates—a square root law. Ann. Stat. 10, 1217–1223 (1982) CrossRefMATHMathSciNetGoogle Scholar
  2. Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17, 71–80 (2007) CrossRefMathSciNetGoogle Scholar
  3. Azzalini, A., Menardi, G., Rosolin, T.: R package pdfCluster: cluster analysis via nonparametric density estimation (version 1.0-0) (2012). http://cran.r-project.org/package=pdfCluster
  4. Burman, P., Polonik, W.: Multivariate mode hunting: data analytic tools with measures of significance. J. Multivar. Anal. 100, 1198–1218 (2009) CrossRefMATHMathSciNetGoogle Scholar
  5. Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367–382 (2000) CrossRefMATHMathSciNetGoogle Scholar
  6. Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441–459 (2001) CrossRefMATHMathSciNetGoogle Scholar
  7. Dazard, J.E., Rao, J.S.: Local sparse bump hunting. J. Comput. Graph. Stat. 19, 900–929 (2010) CrossRefGoogle Scholar
  8. De la Cruz, R.: Bayesian non-linear regression models with skew-elliptical errors: applications to the classification of longitudinal profiles. Comput. Stat. Data Anal. 53, 436–449 (2008) CrossRefMATHGoogle Scholar
  9. Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev. 41, 637–676 (1999) CrossRefMATHMathSciNetGoogle Scholar
  10. Even, S.: Graph Algorithms, 2nd edn. Cambridge University Press, New York (2011) CrossRefGoogle Scholar
  11. Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classication of olive oils from their fatty acid composition. In: Food Research and Data Analysis, pp. 189–214. Applied Sc. Publishers, London (1983) Google Scholar
  12. Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986) Google Scholar
  13. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) CrossRefMATHMathSciNetGoogle Scholar
  14. Fraley, C., Raftery, A.E.: MCLUST vers. 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504, Univ. Washington, Dep. Stat. (2006), rev. 2009 Google Scholar
  15. Friedman, J.H.: On bias, variance, 0–1 loss, and the curse of dimensionality. Data Min. Knowl. Discov. 1, 55–77 (1997) CrossRefGoogle Scholar
  16. Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc., Ser. C, Appl. Stat. 18, 54–64 (1969) MathSciNetGoogle Scholar
  17. Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975) MATHGoogle Scholar
  18. Kriegel, H., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Data Min. Knowl. Discov. 1, 231–240 (2011) CrossRefGoogle Scholar
  19. Krznaric, D., Levcopoulos, C.: Fast algorithms for complete linkage clustering. Discrete Comput. Geom. 19, 131–145 (1998) CrossRefMATHMathSciNetGoogle Scholar
  20. Lubischew, A.A.: On the use of discriminant analysis in taxonomy. Biometrics 18, 455–477 (1962) CrossRefMATHGoogle Scholar
  21. Menardi, G., Torelli, N.: Reducing data dimension for cluster detection. J. Stat. Comput. Simul. (2012). doi:10.1080/00949655.2012.679032 Google Scholar
  22. Minnotte, M.C.: Nonparametric testing for the existence of modes. Ann. Stat. 25, 1646–1660 (1997) CrossRefMATHMathSciNetGoogle Scholar
  23. Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86, 738–746 (1991) MATHGoogle Scholar
  24. Prates, M., Lachos, V., Cabral, C.: R package mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions (version 1.0-3) (2012). http://cran.r-project.org/web/packages/mixsmsn/index.html
  25. Ooi, H.: Density visualization and mode hunting using trees. J. Comput. Graph. Stat. 11, 328–347 (2002) CrossRefMathSciNetGoogle Scholar
  26. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2013). http://www.R-project.org/. ISBN 3-900051-07-3-900051-0
  27. Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38, 2678–2722 (2010) CrossRefMATHMathSciNetGoogle Scholar
  28. Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression models. Can. J. Stat. 31, 129–150 (2003) CrossRefMATHMathSciNetGoogle Scholar
  29. Scott, D.W., Sain, S.: Multidimensional Density Estimation. Handbook of Statistics, vol. 24, pp. 229–261 (2005) Google Scholar
  30. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986) CrossRefMATHGoogle Scholar
  31. Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003) CrossRefMATHMathSciNetGoogle Scholar
  32. Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat. 19, 397–418 (2010) CrossRefMathSciNetGoogle Scholar
  33. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the GAP statistic. J. R. Stat. Soc., Ser. B, Stat. Methodol. 63, 411–423 (2000) CrossRefMathSciNetGoogle Scholar
  34. Wishart, D.: Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In: Cole, A.J. (ed.) Numerical Taxonomy, pp. 282–308. Academic Press, London (1969) Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Statistical SciencesUniversity of PaduaPadovaItaly

Personalised recommendations