Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

Bellas, Anastasios; Bouveyron, Charles; Cottrell, Marie; Lacaille, Jérôme

doi:10.1007/s11634-013-0133-7

Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

Regular Article
Published: 25 May 2013

Volume 7, pages 281–300, (2013)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Anastasios Bellas¹,
Charles Bouveyron¹,
Marie Cottrell¹ &
…
Jérôme Lacaille²

651 Accesses
15 Citations
Explore all metrics

Abstract

Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

An Online Clustering Approach for Evolving Data-Stream Based on Data Point Density

References

Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on very large data bases, vol. 30. VLDB Endowment, pp 852–863
Akaike H (1981) Likelihood of a model and information criteria. J Econom 16(1):3–14
Article MathSciNet MATH Google Scholar
Arandjelović O, Cipolla R (2005) Incremental learning of temporally-coherent Gaussian mixture models. In: Proceedings of the British Machine Vision Conference. Oxford, UK, pp 759–768
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on principles of database systems. ACM, pp 234–243
Baek J, McLachlan G, Flack L (2010) Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. Pattern Anal Mach Intell IEEE Trans 32(7):1298–1309
Article Google Scholar
Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factor analysis: a unified approach, vol 899. Wiley, New York
Book Google Scholar
Basilevsky A (2009) Statistical factor analysis and related methods: theory and applications, vol 418. Wiley-Interscience, New York
Google Scholar
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal Mach Intell IEEE Trans 22(7):719–725
Article Google Scholar
Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Stat Comput 22(1):301–324
Article MathSciNet Google Scholar
Bouveyron C, Girard S, Schmid C (2007a) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519
Article MathSciNet MATH Google Scholar
Bouveyron C, Girard S, Schmid C (2007b) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623
Article MathSciNet MATH Google Scholar
Cappé O, Moulines E (2009) Online EM algorithm for latent data models. R Stat Soc: Ser B (Stat Methodol) 71:1–21. http://arxiv.org/pdf/0712.4273
Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Article MathSciNet MATH Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38. doi:10.2307/2984875
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, pp 106–113
Duda R, Har, P, Stork D (1995) Pattern classification and scene analysis, 2nd edn
Figueiredo M, Jain A (2002) Unsupervised learning of finite mixture models. Pattern Anal Mach Intell IEEE Trans 24(3):381–396
Article Google Scholar
Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Article MathSciNet MATH Google Scholar
Gaber M, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26
Article Google Scholar
Ghahramani Z, Hinton G et al (1996) The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto
Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Foundations of Computer Science, 2000. In: Proceedings of 41st Annual Symposium on IEEE, pp 359–366
Hall P, Hicks Y, Robinson T (2005) A method to add gaussian mixture models. Technical report, University of Bath
Hall P, Marshall D, Martin R (1998) Incremental eigenanalysis for classification. In: British Machine Vision Conference, vol 1. Citeseer, pp 286–295
Jacques J, Bouveyron C, Girard S, Devos O, Duponchel L, Ruckebusch C (2010) Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data. J Chemom 24(11–12):719–727
Article Google Scholar
Lindsay B (1995) Mixture models: theory, geometry and applications. In: JSTOR NSF-CBMS Regional Conference Series in probability and statistics.
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on mathematical statistics and probability, vol. 1. California, USA, p 14
McLachlan G, Krishnan T (1997) The em algorithm and extensions. Wiley-Interscience, New York
McLachlan G, Peel D (2000) Finite mixture models, vol 299. Wiley-Interscience, New York
Book MATH Google Scholar
McLachlan G, Peel D, Bean R (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3):379–388
Article MathSciNet MATH Google Scholar
McNicholas P, Murphy B (2008) Parsimonious Gaussian mixture models. Stat Comput 18(3):285–296
Article MathSciNet Google Scholar
McNicholas P, Murphy T, McDaid A, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723
Article MathSciNet MATH Google Scholar
Neal R, Hinton G (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learn Graph Models 89:355–368
Article Google Scholar
O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp 685–694
Samé A, Ambroise C, Govaert G (2007) An online classification EM algorithm based on the mixture model. Stat Comput 17(3):209–218. doi:10.1007/s11222-007-9017-z
Article MathSciNet Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Google Scholar
Tipping M, Bishop C (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482
Article Google Scholar
Titterington D (1984) Recursive parameter estimation using incomplete data. J R Stat Soc Ser B (Methodol) 46(2):257–267
MathSciNet MATH Google Scholar
Ueda N, Nakano R, Ghahramani Z, Hinton G (2000) Smem algorithm for mixture models. Neural Comput 12(9):2109–2128
Article Google Scholar
Wang WL, Lin TI (2013) An efficient ecm algorithm for maximum likelihood estimation in mixtures of t-factor analyzers. Comput Stat 28(2):751–759
Article Google Scholar
Wu C (1983) On the convergence properties of the em algorithm. Ann Stat 11(1):95–103
Article MATH Google Scholar
Zhao JH, Yu PL (2008) Fast ml estimation for the mixture of factor analyzers via an ecm algorithm. Neural Netw IEEE Trans 19(11):1956–1961
Article Google Scholar

Download references

Author information

Authors and Affiliations

SAMM (EA 4543), Université Paris 1, 90, rue de Tolbiac, 75634 , Paris Cedex 13, France
Anastasios Bellas, Charles Bouveyron & Marie Cottrell
Snecma, Groupe Safran, 77550 , Moissy Cramayel, France
Jérôme Lacaille

Authors

Anastasios Bellas
View author publications
You can also search for this author in PubMed Google Scholar
Charles Bouveyron
View author publications
You can also search for this author in PubMed Google Scholar
Marie Cottrell
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Lacaille
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charles Bouveyron.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bellas, A., Bouveyron, C., Cottrell, M. et al. Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA. Adv Data Anal Classif 7, 281–300 (2013). https://doi.org/10.1007/s11634-013-0133-7

Download citation

Received: 30 November 2012
Revised: 22 April 2013
Accepted: 11 May 2013
Published: 25 May 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11634-013-0133-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

Abstract

Access this article

Similar content being viewed by others

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

An Online Clustering Approach for Evolving Data-Stream Based on Data Point Density

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

Abstract

Access this article

Similar content being viewed by others

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

An Online Clustering Approach for Evolving Data-Stream Based on Data Point Density

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation