Abstract
We present new initialization methods for the expectation-maximization algorithm for multivariate Gaussian mixture models. Our methods are adaptions of the well-known K-means++ initialization and the Gonzalez algorithm. Thereby we aim to close the gap between simple random, e.g. uniform, and complex methods, that crucially depend on the right choice of hyperparameters. Our extensive experiments indicate the usefulness of our methods compared to common techniques and methods, which e.g. apply the original K-means++ and Gonzalez directly, with respect to artificial as well as real-world data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The (inverse) pdf is unsuited due to the exponential behavior (over-/underflows).
- 2.
Even wrt. a single Gaussian, i.e. \(\log \mathcal {N}(c\cdot x|c\cdot \mu ,c^2\cdot \varSigma )=\log \mathcal {N}(x|\mu ,\varSigma )-D \ln (c)\).
- 3.
As explained before, our goal is not to identify these GMMs.
- 4.
Averaging the (average) log-likelihood values over different data sets is not meaningful since the optimal log-likelihoods may deviate significantly.
References
GeoNames geographical database. http://www.geonames.org/
Achtert, E., Goldhofer, S., Kriegel, H.-P., Schubert, E., Zimek, A.: Evaluation of Clusterings - Metrics and Visual Support. http://elki.dbs.ifi.lmu.de/wiki/DataSets/MultiView
Arthur, V.: k-means++: The advantages of careful seeding. In: SODA 2007 (2007)
Asuncion: UCI machine learning repository (2007). http://www.ics.uci.edu/mlearn/MLRepository.html
Baudry, J.-P., Celeux, G.: EM for mixtures. Stat. Comput. 25(4), 713–726 (2015)
Biernacki, C.: Initializing EM using the properties of its trajectories in Gaussian mixtures. Stat. Comput. 14(3), 267–279 (2004)
Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 41(3–4), 561–575 (2003)
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, Secaucus (2006)
Bujna, K., Kuntze, D.: Supplemental Material. http://www-old.cs.upb.de/fachgebiete/ag-bloemer/forschung/clusteranalyse/adaptive_seeding_for_gmms.html
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)
Dasgupta, S.: Experiments with random projection. In: UAI 2000 (2000)
Dasgupta, S.: Learning mixtures of gaussians. In: FOCS 1999 (1999)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39(1), 1–38 (1977)
Färber, I., Günnemann, S., Kriegel, H., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A.: On using class-labels in evaluation of clusterings. In: MultiClust 2010 (2010)
Fayyad, U., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: KDD 1998 (1998)
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam library of object images. Int. J. Comput. Vis. 6(1), 103–112 (2005)
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293–306 (1985)
Kriegel, H.-P., Schubert, E., Zimek, A.: Evaluation of multiple clustering solutions. In: MultiClust 2010 (2010)
Krüger, A., Leutnant, V., Haeb-Umbach, R., Ackermann, M., Blömer, J.: On the initialization of dynamic models for speech features. In: Sprachkommunikation 2010 (2010)
Kwedlo, W.: A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering. Pattern Anal. Appl. 18(4), 757–770 (2015)
Maitra, R.: Initializing partition-optimization algorithms. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(1), 144–157 (2009)
Maitra, R., Melnykov, V.: Simulating data to study performance of finite mixture modeling and clustering algorithms. J. Comput. Graph. Stat. 19(2), 354–376 (2010)
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley Series in Probability and Statistics, 2nd edn. Wiley-Interscience, New York (2008)
Meilă, M., Heckerman, D.: An experimental comparison of several clustering and initialization methods. In: UAI 1998. Morgan Kaufmann Inc., San Francisco (1998)
Melnykov, V., Melnykov, I.: Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput. Stat. Data Anal. 56, 1381–1395 (2011)
Thiesson, B.: Accelerated quantification of Bayesian networks with incomplete data. University of Aalborg (1995)
Verbeek, J.J., Vlassis, N., Kröse, B.: Efficient greedy learning of Gaussian mixture models. Neural Comput. 15(2), 469–485 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Blömer, J., Bujna, K. (2016). Adaptive Seeding for Gaussian Mixture Models. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)