Computational Statistics

, Volume 31, Issue 2, pp 771–798 | Cite as

Density-based clustering with non-continuous data

Original Paper
  • 300 Downloads

Abstract

Density-based clustering relies on the idea of associating groups with regions of the sample space characterized by high density of the probability distribution underlying the observations. While this approach to cluster analysis exhibits some desirable properties, its use is necessarily limited to continuous data only. The present contribution proposes a simple but working way to circumvent this problem, based on the identification of continuous components underlying the non-continuous variables. The basic idea is explored in a number of variants applied to simulated data, confirming the practical effectiveness of the technique and leading to recommendations for its practical usage. Some illustrations using real data are also presented.

Keywords

Density estimation Mixed variables Modal clustering Model-based clustering Multidimensional scaling 

Supplementary material

180_2016_644_MOESM1_ESM.pdf (508 kb)
Supplementary material 1 (pdf 507 KB)

References

  1. Anderlucci L, Hennig C (2014) Clustering of categorical data: a comparison of a model- based and a distance-based approach. Commun Stat Theory Methods 43(4):704–721MathSciNetCrossRefMATHGoogle Scholar
  2. Arabie P, Hubert L (1994) Cluster analysis in marketing research. In: Bagozzi R (ed) Handbook of marketing research. Blackwell, OxfordGoogle Scholar
  3. Asuncion A, Newman D (2010) UCI machine learning repository. School of Information and Computer Sciences, University of California, IrvineGoogle Scholar
  4. Azzalini A, Menardi G (2014) Clustering via nonparametric density estimation: the R package pdfCluster. J Stat Softw 57(11):1–26Google Scholar
  5. Azzalini A, Torelli N (2007) Clustering via nonparametric density estimation. Stat Comput 17:71–80MathSciNetCrossRefGoogle Scholar
  6. Bartholomew DJ (1980) Factor analysis for categorical data. J R Stat Soc Series B 42:293–321MathSciNetMATHGoogle Scholar
  7. Bartholomew DJ, Knott M (1999) Latent variable models and factor analysis, 2nd edn. Arnold Publisher, LondonMATHGoogle Scholar
  8. Browne RP, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis of data with mixed type. J Stat Plan Inference 142:2976–2984MathSciNetCrossRefMATHGoogle Scholar
  9. Fraley C, Raftery A (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588CrossRefMATHGoogle Scholar
  10. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631MathSciNetCrossRefMATHGoogle Scholar
  11. Fraley C, Raftery AE, Murphy B, Scrucca L (2012) Mclust version 4 for R: normal mixture modeling and model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of WashingtonGoogle Scholar
  12. Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans Inf Theory 21:32–40MathSciNetCrossRefMATHGoogle Scholar
  13. Goodman LA (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61:215–231MathSciNetCrossRefMATHGoogle Scholar
  14. Gruen B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35. http://www.jstatsoft.org/v28/i04/
  15. Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATHGoogle Scholar
  16. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefMATHGoogle Scholar
  17. Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41:429–440MathSciNetCrossRefMATHGoogle Scholar
  18. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRefGoogle Scholar
  19. Leisch F (2004) FlexMix: a general framework for finite mixture models and latent class regression in R. J Stat Softw 11(8):1–18. http://www.jstatsoft.org/v11/i08/
  20. Lin TI (2010) Robust mixture modeling using multivariate skew t distributions. Stat Comput 20(3):343–356MathSciNetCrossRefGoogle Scholar
  21. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2013) Cluster: cluster analysis basics and extensions. R package version 1.14.4Google Scholar
  22. Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering for conditionally correlated categorical data. J Classif 32(2):145–175Google Scholar
  23. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, CambridgeMATHGoogle Scholar
  24. Menardi G, Azzalini A (2014) An advancement in clustering via nonparametric density estimation. Stat Comput 24:753–767MathSciNetCrossRefMATHGoogle Scholar
  25. Oh M, Raftery AE (1998) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Graph Stat 16:559–585MathSciNetCrossRefGoogle Scholar
  26. R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0Google Scholar
  27. Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47MathSciNetCrossRefMATHGoogle Scholar
  28. Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19:397–418MathSciNetCrossRefGoogle Scholar
  29. Tzeng J, Lu HH, Li WH (2008) Multidimensional scaling for large genomic data sets. BMC Bioinformatics 9(1):179CrossRefGoogle Scholar
  30. Venables VN, Ripley BD (2002) Modern applied statistics with S. Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4
  31. Vermunt JK, Magidson J (2002) Latent class cluster analysis. In: Hagenaars JA, McCutcheon AL (eds) Applied latent class analysis. Cambridge University Press, Cambridge, pp 89–106CrossRefGoogle Scholar
  32. Wishart D (1969) Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In: Cole AJ (ed) Numerical taxonomy. Academic Press, Cambridge, pp 282–308Google Scholar
  33. Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivar Behav Res 5:329–350CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Dipartimento di Scienze StatisticheUniversità degli Studi di PadovaPadovaItaly

Personalised recommendations