Skip to main content
Log in

A distributed EM algorithm to estimate the parameters of a finite mixture of components

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, a distributed expectation maximization (DEM) algorithm is first introduced in a general form for estimating the parameters of a finite mixture of components. This algorithm is used for density estimation and clustering of data distributed over nodes of a network. Then, a distributed incremental EM algorithm (DIEM) with a higher convergence rate is proposed. After a full derivation of distributed EM algorithms, convergence of these algorithms is analyzed based on the negative free energy concept used in statistical physics. An analytical approach is also developed for evaluating the convergence rate of both incremental and distributed incremental EM algorithms. It is analytically shown that the convergence rate of DIEM is much faster than that of the DEM algorithm. Finally, simulation results approve that DIEM remarkably outperforms DEM for both synthetic and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Assent I, Krieger R, Glavic B (2008) Clustering multidimensional sequences in spatial and temporal databases. Knowl Inf Syst 16: 29–51

    Article  Google Scholar 

  2. Besag J (1975) Statistical analysis of non-lattice data. Statistician 24(3): 79–195

    Article  MathSciNet  Google Scholar 

  3. Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc B (Methodological) 48(3): 259–302

    MATH  MathSciNet  Google Scholar 

  4. Brecheisen S, Kriegel HP, Pfeifle M (2006) Multi-step density-based clustering. Knowl Inf Syst 9(3): 284–308

    Article  Google Scholar 

  5. Chen R, Sivakumar K, Kargupta H (2004) Collective mining of Bayesian networks from distributed heterogeneous data. Knowl Inf Syst 6: 164–187

    Google Scholar 

  6. Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of the 40th annual symposium on foundations of computer science. IEEE Computer Society, New York, 17–19 October, pp 634–644

  7. Datta S, Bhaduri K, Giannella C et al (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10: 18–26

    Article  Google Scholar 

  8. Dempster A, Laird N, Rubin D (1977) Maximum likelihood estimation from incomplete data via the em algorithm. J R Stat Soc Ser B 39: 1–38

    MATH  MathSciNet  Google Scholar 

  9. Dutta S, Gianella C, Kargupta H (2005) K-means clustering over peer-to-peer networks. In: 8th international workshop on high performance and distributed mining, SIAM international conference on data mining

  10. Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3): 381–396

    Article  Google Scholar 

  11. Gabriela M, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14: 273–298

    Article  MATH  Google Scholar 

  12. Ghosh D, Chinnaiyan AM (2002) Mixture modeling of gene expression data from microarray experiments. Bioinformatics 18: 275–286

    Article  Google Scholar 

  13. Giannella C, Dutta H, Mukherjee S et al (2006) Efficient kernel density estimation over distributed data. In: 9th international workshop on high performance and distributed mining, SIAM international conference on data mining

  14. Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12: 1–24

    Article  Google Scholar 

  15. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2/3): 107–145

    Article  MATH  Google Scholar 

  16. Hinnerburge D, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5: 387–415

    Article  Google Scholar 

  17. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386

    Article  Google Scholar 

  18. Kargupta H, Kamath C, Chan P (2000) Distributed and parallel data mining: emergence, growth, and future directions. Advances in distributed and parallel knowledge discovery, AAAI/MIT Press, Cambridge, pp 409–416

  19. Kowalczyk W, Vlassis N (2005) Newscast EM. Advances in neural information processing systems, vol 17. MIT Press, Cambridge

  20. Lin X, Clifton C, Zhu M (2005) Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inf Syst 8: 68–81

    Article  Google Scholar 

  21. Ma J, Xu L, Jordan MI (2000) Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Comput 12: 2881–2907

    Article  Google Scholar 

  22. McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18: 413–422

    Article  Google Scholar 

  23. McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York, pp 120–211

    MATH  Google Scholar 

  24. Neal R, Hinton G (1999) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI(eds) Learning in graphical models. MIT Press, Cambridge, pp 355–368

    Google Scholar 

  25. Nowak RD (2003) Distributed EM algorithms for density estimation and clustering in sensor networks. IEEE Trans Signal Process 51: 2245–2253

    Article  Google Scholar 

  26. Ordonez C, Omiecinski E (2002) FREM: fast and robust EM clustering for large data sets. In: Proceedings of the ACM CIKM conference, pp 590–599

  27. Ordonez C, Omiecinski E (2005) Accelerating EM clustering to find high-quality solutions. Knowl Inf Syst 7: 135–157

    Article  Google Scholar 

  28. Roweis S, Ghahramani Z (1999) A unifying review of linear Gaussian models. Neural Comput 11: 305–345

    Article  Google Scholar 

  29. Thiesson B, Meek C, Heckerman D (2001) Accelerating EM for large databases. Mach Learn 45: 279–299

    Article  MATH  Google Scholar 

  30. Verbeek JJ, Vlassis N, Nunnink JRJ (2003) A variational EM approach for large-scale mixture modeling. In: Proceedings of 8th annual conference of the advanced school of computing and imaging. Heijen, The Netherlands

  31. Vincent C, Wüthrich B (2002) Distributed mining of classification rules. Knowl Inf Syst 4: 1–30

    Article  MATH  Google Scholar 

  32. Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybern B 34: 2426–2438

    Article  Google Scholar 

  33. Wu X, Kumar V, Quinlan J et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14: 1–37

    Article  Google Scholar 

  34. Xia Y, Zhang C, Weng S et al (2005) Fault-tolerant EM algorithm for GMM in sensor networks. In: Proceedings of the 2005 international conference on data mining, Las Vegas, Nevada, USA, pp 166–172

  35. Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8: 129–151

    Article  Google Scholar 

  36. Yeung KY, Fraley C, Murua A et al (2001) Model-based clustering and data transformation for gene expression data. Bioinformatics 17: 977–987

    Article  Google Scholar 

  37. Yuille A, Stolorz P, Utans J (1994) Mixtures of distributions and the EM algorithm. Neural Comput 6(1): 334–340

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behrooz Safarinejadian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Safarinejadian, B., Menhaj, M.B. & Karrari, M. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl Inf Syst 23, 267–292 (2010). https://doi.org/10.1007/s10115-009-0218-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0218-y

Keywords

Navigation