Skip to main content
Log in

Model-based clustering of probability density functions

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Abramowitz M, Stegun IA (1974) Handbook of mathematical functions. Dover Publ Inc., New York

  • Applegate D, Dasu T, Krishnan S, Urbanek S (2011) Unsupervised clustering of multidimensional distributions using earth mover distance. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 636–644. doi:10.1145/2020408.2020508

  • Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382

    MathSciNet  MATH  Google Scholar 

  • Bezdeck JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York

    Book  Google Scholar 

  • Bock H-H, Diday E (2000) Analysis of symbolic data. Springer, Heidelberg

    Book  Google Scholar 

  • Cattani C (2010) Fractals and Hidden Symmetries in DNA. Mathematical Problems in Engineering. Article ID 507056: doi:10.1155/2010/507056

  • Chervoneva I, Zhan T, Iglewicz B, Walter H, Birck DE (2012) Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions. J Appl Stat 39:445–460

    Article  MathSciNet  Google Scholar 

  • Delicado P (2011) Dimensionality reduction when data are density functions. Comput Stat Data An 55: 401–420

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster NM, Laird AP, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc (Ser B) 39:1–39

    MathSciNet  MATH  Google Scholar 

  • Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175

    Article  MATH  Google Scholar 

  • Diday E, Noirhomme M (2008) Symbolic data analysis. Wiley, New York

    MATH  Google Scholar 

  • Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108

    Article  MATH  Google Scholar 

  • Herrick DRM, Nason GP, Silverman BW (2001) Some new methods for wavelet density estimation. Sankhya A 63:391–411

    MathSciNet  Google Scholar 

  • Maharaj EA, D’Urso P, Galagedera DUA (2010) Wavelets-based fuzzy clustering of time series. J Classif 27:231–275

    Article  MathSciNet  Google Scholar 

  • Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Patt An Mach Intell 11:674–693

    Article  MATH  Google Scholar 

  • Mardia KV, Jupp PE (2000) Directional statistics. Wiley, New York

    MATH  Google Scholar 

  • Marron S, Wand M (1992) Exact mean integrated squared error. Ann Stat 20:712–736

    Article  MathSciNet  MATH  Google Scholar 

  • Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4:157–170

    Article  MathSciNet  Google Scholar 

  • Ogden RT (1997) Essential wavelets for statistical application and data analysis. Birkhauser, Boston

    Book  Google Scholar 

  • Peel D, Whiten WJ, McLachlan GJ (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J Am Stat Assoc 96:56–63

    Article  MathSciNet  Google Scholar 

  • Penev S, Dechevsky L (1997) On non-negative wavelet-based density estimators. J Nonparameter Stat 7:365–394

    Article  MathSciNet  MATH  Google Scholar 

  • Percival DB, Walden AT (2000) Wavelet methods for time series analysis. Cambridge University Press, New York

    MATH  Google Scholar 

  • Peter A, Rangarajan A (2008) Maximum likelihood wavelet density estimation with applications to image and shape matching. IEEE Trans Image Proc 17:458–468

    Article  MathSciNet  Google Scholar 

  • Pinheiro A, Vidakovic B (1997) Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 25:399–415

    Article  MathSciNet  MATH  Google Scholar 

  • Sakurai Y, Chong R, Lei L, Faloutsos C (2008) Efficient distribution mining and classification. In: Proceedings of the 2008 SIAM international conference on data mining. http://www.siam.org/proceedings/datamining/2008/dm08_58_sakurai.pdf

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MATH  Google Scholar 

  • Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J Roy Statist Soc (Ser B) 53:683–690

    MathSciNet  MATH  Google Scholar 

  • Silverman B (1986) Density estimation. Chapman and Hall, London

    MATH  Google Scholar 

  • Spellman E, Vemuri BC, Rao M (2005) Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. IEEE Comput Soc Confer Comput V Pattern Recogn 1:111–116. doi:10.1109/CVPR.2005.363

    Google Scholar 

  • Sra S, Karp D (2013) The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J Multivariate Anal 114:256–269

    Article  MathSciNet  MATH  Google Scholar 

  • Srivastava A, Jermyn I, Joshi S (2007) Riemannian analysis of probability density functions with applications in vision. IEEE Conf Comput Vision Patt Recogn. doi:10.1109/CVPR.2007.383188

  • Sturges H (1926) The choice of a class-interval. J Am Stat Assoc 21:65–66

    Article  Google Scholar 

  • Terada Y, Yadohisa H (2010) Non-hierarchical clustering for distribution-valued data. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT 2010. Physica-Verlag, Heidelberg, pp 1653–1660

  • Vannucci M (1998) Nonparametric density estimation using wavelets. ISDS, D.P. http://www.isds.duke.edu

  • Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) Proceedings of COMPSTAT2008. Physica-Verlag, Heidelberg, pp 77–89

  • Vrac M, Billard L, Diday E, Chdin A (2011) Copula analysis of mixture models. Comput Stat 27:427–457

    Article  Google Scholar 

  • Walter GG (1995) Estimation with wavelets and the curse of dimensionality. Technical report—Department of Mathematical Sciences. University of Wisconsin-Milwaukee

  • Wouters BJ, Lwenberg B, Erpelinck-Verschueren CA, van Putten W, Valk P, Delwel R (2009) Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood 26:3088–3091

    Article  Google Scholar 

  • Yamal JM, Follen M, Guillaud M, Cox D (2011) Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. Biostatistics 12:695–709

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniela G. Calò.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Montanari, A., Calò, D.G. Model-based clustering of probability density functions. Adv Data Anal Classif 7, 301–319 (2013). https://doi.org/10.1007/s11634-013-0140-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-013-0140-8

Keywords

Mathematics Subject Classification

Navigation