Skip to main content
Log in

Density-based clustering with non-continuous data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Density-based clustering relies on the idea of associating groups with regions of the sample space characterized by high density of the probability distribution underlying the observations. While this approach to cluster analysis exhibits some desirable properties, its use is necessarily limited to continuous data only. The present contribution proposes a simple but working way to circumvent this problem, based on the identification of continuous components underlying the non-continuous variables. The basic idea is explored in a number of variants applied to simulated data, confirming the practical effectiveness of the technique and leading to recommendations for its practical usage. Some illustrations using real data are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Anderlucci L, Hennig C (2014) Clustering of categorical data: a comparison of a model- based and a distance-based approach. Commun Stat Theory Methods 43(4):704–721

    Article  MathSciNet  MATH  Google Scholar 

  • Arabie P, Hubert L (1994) Cluster analysis in marketing research. In: Bagozzi R (ed) Handbook of marketing research. Blackwell, Oxford

    Google Scholar 

  • Asuncion A, Newman D (2010) UCI machine learning repository. School of Information and Computer Sciences, University of California, Irvine

    Google Scholar 

  • Azzalini A, Menardi G (2014) Clustering via nonparametric density estimation: the R package pdfCluster. J Stat Softw 57(11):1–26

  • Azzalini A, Torelli N (2007) Clustering via nonparametric density estimation. Stat Comput 17:71–80

    Article  MathSciNet  Google Scholar 

  • Bartholomew DJ (1980) Factor analysis for categorical data. J R Stat Soc Series B 42:293–321

    MathSciNet  MATH  Google Scholar 

  • Bartholomew DJ, Knott M (1999) Latent variable models and factor analysis, 2nd edn. Arnold Publisher, London

    MATH  Google Scholar 

  • Browne RP, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis of data with mixed type. J Stat Plan Inference 142:2976–2984

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery A (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588

    Article  MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE, Murphy B, Scrucca L (2012) Mclust version 4 for R: normal mixture modeling and model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of Washington

  • Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans Inf Theory 21:32–40

    Article  MathSciNet  MATH  Google Scholar 

  • Goodman LA (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61:215–231

    Article  MathSciNet  MATH  Google Scholar 

  • Gruen B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35. http://www.jstatsoft.org/v28/i04/

  • Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  MATH  Google Scholar 

  • Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41:429–440

    Article  MathSciNet  MATH  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  Google Scholar 

  • Leisch F (2004) FlexMix: a general framework for finite mixture models and latent class regression in R. J Stat Softw 11(8):1–18. http://www.jstatsoft.org/v11/i08/

  • Lin TI (2010) Robust mixture modeling using multivariate skew t distributions. Stat Comput 20(3):343–356

    Article  MathSciNet  Google Scholar 

  • Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2013) Cluster: cluster analysis basics and extensions. R package version 1.14.4

  • Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering for conditionally correlated categorical data. J Classif 32(2):145–175

  • Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, Cambridge

    MATH  Google Scholar 

  • Menardi G, Azzalini A (2014) An advancement in clustering via nonparametric density estimation. Stat Comput 24:753–767

    Article  MathSciNet  MATH  Google Scholar 

  • Oh M, Raftery AE (1998) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Graph Stat 16:559–585

    Article  MathSciNet  Google Scholar 

  • R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0

  • Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47

    Article  MathSciNet  MATH  Google Scholar 

  • Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19:397–418

    Article  MathSciNet  Google Scholar 

  • Tzeng J, Lu HH, Li WH (2008) Multidimensional scaling for large genomic data sets. BMC Bioinformatics 9(1):179

    Article  Google Scholar 

  • Venables VN, Ripley BD (2002) Modern applied statistics with S. Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4

  • Vermunt JK, Magidson J (2002) Latent class cluster analysis. In: Hagenaars JA, McCutcheon AL (eds) Applied latent class analysis. Cambridge University Press, Cambridge, pp 89–106

    Chapter  Google Scholar 

  • Wishart D (1969) Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In: Cole AJ (ed) Numerical taxonomy. Academic Press, Cambridge, pp 282–308

    Google Scholar 

  • Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivar Behav Res 5:329–350

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanna Menardi.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 507 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Azzalini, A., Menardi, G. Density-based clustering with non-continuous data. Comput Stat 31, 771–798 (2016). https://doi.org/10.1007/s00180-016-0644-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-016-0644-8

Keywords

Navigation