Advances in Data Analysis and Classification

, Volume 7, Issue 3, pp 281–300 | Cite as

Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

  • Anastasios Bellas
  • Charles Bouveyron
  • Marie Cottrell
  • Jérôme Lacaille
Regular Article


Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.


Model-based clustering Mixture of probabilistic PCA  Data streams High-dimensional data Online inference 

Mathematics Subject Classification

62 62-07 62H25 62H30 


  1. Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on very large data bases, vol. 30. VLDB Endowment, pp 852–863Google Scholar
  2. Akaike H (1981) Likelihood of a model and information criteria. J Econom 16(1):3–14MathSciNetzbMATHCrossRefGoogle Scholar
  3. Arandjelović O, Cipolla R (2005) Incremental learning of temporally-coherent Gaussian mixture models. In: Proceedings of the British Machine Vision Conference. Oxford, UK, pp 759–768Google Scholar
  4. Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on principles of database systems. ACM, pp 234–243Google Scholar
  5. Baek J, McLachlan G, Flack L (2010) Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. Pattern Anal Mach Intell IEEE Trans 32(7):1298–1309CrossRefGoogle Scholar
  6. Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factor analysis: a unified approach, vol 899. Wiley, New YorkCrossRefGoogle Scholar
  7. Basilevsky A (2009) Statistical factor analysis and related methods: theory and applications, vol 418. Wiley-Interscience, New YorkGoogle Scholar
  8. Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal Mach Intell IEEE Trans 22(7):719–725CrossRefGoogle Scholar
  9. Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Stat Comput 22(1):301–324MathSciNetCrossRefGoogle Scholar
  10. Bouveyron C, Girard S, Schmid C (2007a) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519MathSciNetzbMATHCrossRefGoogle Scholar
  11. Bouveyron C, Girard S, Schmid C (2007b) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623MathSciNetzbMATHCrossRefGoogle Scholar
  12. Cappé O, Moulines E (2009) Online EM algorithm for latent data models. R Stat Soc: Ser B (Stat Methodol) 71:1–21.
  13. Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332MathSciNetzbMATHCrossRefGoogle Scholar
  14. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38. doi: 10.2307/2984875
  15. Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, pp 106–113Google Scholar
  16. Duda R, Har, P, Stork D (1995) Pattern classification and scene analysis, 2nd ednGoogle Scholar
  17. Figueiredo M, Jain A (2002) Unsupervised learning of finite mixture models. Pattern Anal Mach Intell IEEE Trans 24(3):381–396CrossRefGoogle Scholar
  18. Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetzbMATHCrossRefGoogle Scholar
  19. Gaber M, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26CrossRefGoogle Scholar
  20. Ghahramani Z, Hinton G et al (1996) The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of TorontoGoogle Scholar
  21. Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Foundations of Computer Science, 2000. In: Proceedings of 41st Annual Symposium on IEEE, pp 359–366Google Scholar
  22. Hall P, Hicks Y, Robinson T (2005) A method to add gaussian mixture models. Technical report, University of BathGoogle Scholar
  23. Hall P, Marshall D, Martin R (1998) Incremental eigenanalysis for classification. In: British Machine Vision Conference, vol 1. Citeseer, pp 286–295Google Scholar
  24. Jacques J, Bouveyron C, Girard S, Devos O, Duponchel L, Ruckebusch C (2010) Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data. J Chemom 24(11–12):719–727CrossRefGoogle Scholar
  25. Lindsay B (1995) Mixture models: theory, geometry and applications. In: JSTOR NSF-CBMS Regional Conference Series in probability and statistics.Google Scholar
  26. MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on mathematical statistics and probability, vol. 1. California, USA, p 14Google Scholar
  27. McLachlan G, Krishnan T (1997) The em algorithm and extensions. Wiley-Interscience, New YorkGoogle Scholar
  28. McLachlan G, Peel D (2000) Finite mixture models, vol 299. Wiley-Interscience, New YorkzbMATHCrossRefGoogle Scholar
  29. McLachlan G, Peel D, Bean R (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3):379–388MathSciNetzbMATHCrossRefGoogle Scholar
  30. McNicholas P, Murphy B (2008) Parsimonious Gaussian mixture models. Stat Comput 18(3):285–296MathSciNetCrossRefGoogle Scholar
  31. McNicholas P, Murphy T, McDaid A, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetzbMATHCrossRefGoogle Scholar
  32. Neal R, Hinton G (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learn Graph Models 89:355–368CrossRefGoogle Scholar
  33. O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp 685–694Google Scholar
  34. Samé A, Ambroise C, Govaert G (2007) An online classification EM algorithm based on the mixture model. Stat Comput 17(3):209–218. doi: 10.1007/s11222-007-9017-z MathSciNetCrossRefGoogle Scholar
  35. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464zbMATHCrossRefGoogle Scholar
  36. Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101Google Scholar
  37. Tipping M, Bishop C (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482CrossRefGoogle Scholar
  38. Titterington D (1984) Recursive parameter estimation using incomplete data. J R Stat Soc Ser B (Methodol) 46(2):257–267MathSciNetzbMATHGoogle Scholar
  39. Ueda N, Nakano R, Ghahramani Z, Hinton G (2000) Smem algorithm for mixture models. Neural Comput 12(9):2109–2128CrossRefGoogle Scholar
  40. Wang WL, Lin TI (2013) An efficient ecm algorithm for maximum likelihood estimation in mixtures of t-factor analyzers. Comput Stat 28(2):751–759CrossRefGoogle Scholar
  41. Wu C (1983) On the convergence properties of the em algorithm. Ann Stat 11(1):95–103zbMATHCrossRefGoogle Scholar
  42. Zhao JH, Yu PL (2008) Fast ml estimation for the mixture of factor analyzers via an ecm algorithm. Neural Netw IEEE Trans 19(11):1956–1961CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Anastasios Bellas
    • 1
  • Charles Bouveyron
    • 1
  • Marie Cottrell
    • 1
  • Jérôme Lacaille
    • 2
  1. 1.SAMM (EA 4543), Université Paris 1Paris Cedex 13France
  2. 2.Snecma, Groupe SafranMoissy CramayelFrance

Personalised recommendations