Exploring the number of groups in robust model-based clustering

García-Escudero, L. A.; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A.

doi:10.1007/s11222-010-9194-z

Exploring the number of groups in robust model-based clustering

Published: 28 July 2010

Volume 21, pages 585–599, (2011)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

L. A. García-Escudero¹,
A. Gordaliza¹,
C. Matrán¹ &
…
A. Mayo-Iscar¹

435 Accesses
54 Citations
Explore all metrics

Abstract

Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected.

Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory “trimming-based” tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some “discriminant” factors are the basis for these exploratory tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

Recent Developments in Model-Based Clustering with Applications

Advances in Robust Constrained Model Based Clustering

References

Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Article MATH MathSciNet Google Scholar
Becker, C., Gather, U.: The masking breakdown point for outlier identification rules. J. Am. Stat. Assoc. 94, 947–955 (1999)
Article MATH MathSciNet Google Scholar
Biernacki, C., Govaert, G.: Using the classification likelihood to choose the number of clusters. Comput. Sci. Stat. 29, 451–457 (1997)
Google Scholar
Biernacki, C., Celeux, G., Govaert, G.: Assesing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
Article Google Scholar
Bryant, P.G.: Large-sample results for optimization-based clustering methods. J. Classif. 8, 31–44 (1991)
Article MATH Google Scholar
Bock, H.-H.: Probabilistic models in cluster analysis. Comput. Stat. Data Anal. 23, 5–28 (1996)
Article MATH Google Scholar
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)
Article MathSciNet Google Scholar
Celeux, G., Govaert, A.: Classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 13, 315–332 (1992a)
Article MathSciNet Google Scholar
Celeux, G., Govaert, A.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1992b)
Article Google Scholar
Cook, D.: Graphical detection of regression outliers and mixtures. Proceedings ISI’99. Helsinki (1999)
Cuesta-Albertos, J.A., Matran, C., Mayo-Iscar, A.: Robust estimation in the normal mixture model based on robust clustering. J. R. Stat. Soc., Ser. B 70, 779–802 (2008)
Article MATH MathSciNet Google Scholar
Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294–302 (1998)
Article MATH Google Scholar
Engelman, L., Hartigan, J.A.: Percentage points of a test for clusters. J. Am. Stat. Assoc. 64, 1647–1648 (1969)
Article Google Scholar
Flury, B.: A First Course in Multivariate Statistics. Springer, New York (1997)
MATH Google Scholar
Flury, B., Riedwyl, H.: Multivariate Statistics, A Practical Approach. Cambridge University Press, Cambridge (1988)
Google Scholar
Friedman, H.P., Rubin, J.: On some invariant criterion for grouping data. J. Am. Stat. Assoc. 63, 1159–1178 (1967)
Article MathSciNet Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)
Article MATH Google Scholar
Gallegos, M.T.: Maximum likelihood clustering with outliers. In: Jajuga, K., Sokolowski, A., Bock, H.-H. (eds.) Classification, Clustering and Data Analysis: Recent Advances and Applications, pp. 247–255. Springer, Berlin (2002)
Google Scholar
Gallegos, M.T., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)
Article MATH MathSciNet Google Scholar
Gallegos, M.T., Ritter, G.: Trimming algorithms for clustering contaminated grouped data and their robustness. Adv. Data Anal. Classif. 3, 135–167 (2009)
Article MathSciNet Google Scholar
Gallegos, M.T., Ritter, G.: Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. Comput. Stat. Data Anal. 54, 637–654 (2010)
Article MATH MathSciNet Google Scholar
García-Escudero, L.A., Gordaliza, A., Matrán, C.: Trimming tools in exploratory data analysis. J. Comput. Graph. Stat. 12, 434–449 (2003)
Article Google Scholar
García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Article MATH Google Scholar
Hardin, J., Rocke, D.: Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 44, 625–638 (2004)
Article MathSciNet Google Scholar
Hathaway, R.J.: A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Stat. 13, 795–800 (1985)
Article MATH MathSciNet Google Scholar
Hawkins, D.M., Olive, D.J.: Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm, with discussion. J. Am. Stat. Assoc. 97, 136–159 (2002)
Article MATH MathSciNet Google Scholar
Hennig, C.: Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann. Stat. 32, 1313–1340 (2004a)
Article MATH MathSciNet Google Scholar
Hennig, C.: Asymmetric linear dimension reduction for classification. J. Comput. Graph. Stat. 13, 930–945 (2004b)
Article MathSciNet Google Scholar
Hennig, C., Christlieb, N.: Validating visual clusters in large datasets: fixed point clusters of spectral features. Comput. Stat. Data Anal. 40, 723–739 (2002)
Article MATH MathSciNet Google Scholar
Keribin, C.: Consistent estimation of the order of mixture models. Sankhya, Ser. A 62, 49–62 (2000)
MATH MathSciNet Google Scholar
Marriott, F.H.C.: Practical problems in a method of cluster analysis. Biometrics 27, 501–514 (1971)
Article Google Scholar
McLachlan, G.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 37, 318–324 (1987)
Article Google Scholar
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Book MATH Google Scholar
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematics, Statistics, and Probability, vol. 1, pp. 281–298 (1967)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Article Google Scholar
Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.N.: Mixture of generalized linear models and the trimmed likelihood methodology. In: Antoch, J. (ed.) Proceedings in Computational Statistics, pp. 1585–1592. Physica-Verlag, Heidelberg (2004)
Google Scholar
Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)
Article MATH MathSciNet Google Scholar
Rocke, D.M., Woodruff, D.M.: Computational connections between robust multivariate analysis and clustering. In: Härdle, W., Rönz, B. (eds.) COMPSTAT 2002, Proceedings in Computational Statistics, pp. 255–260. Physica-Verlag, Heidelberg (2002)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Rousseeuw, P.J., Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics 41, 212–223 (1999)
Article Google Scholar
Sugar, C., James, G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Article MATH MathSciNet Google Scholar
Symons, M.J.: Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43 (1981)
Article MATH MathSciNet Google Scholar
Titterington, D.M., Smith, A.F., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)
MATH Google Scholar
Van Aelst, S., Wang, X., Zamar, R.H., Zhu, R.: Linear grouping using orthogonal regression. Comput. Stat. Data Anal. 50, 1287–1312 (2006)
Article Google Scholar
Woodruff, D.L., Reiners, T.: Experiments with, and on, algorithms for maximum likelihood clustering. Comput. Stat. Data Anal. 47, 237–253 (2004)
Article MATH MathSciNet Google Scholar
Wolfe, J.H.: Pattern clustering by multivariate analysis. Multivar. Behav. Res. 5, 329–350 (1970)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Estadística e Investigación Operativa, Facultad de Ciencias, Universidad de Valladolid, 47002, Valladolid, Spain
L. A. García-Escudero, A. Gordaliza, C. Matrán & A. Mayo-Iscar

Authors

L. A. García-Escudero
View author publications
You can also search for this author in PubMed Google Scholar
A. Gordaliza
View author publications
You can also search for this author in PubMed Google Scholar
C. Matrán
View author publications
You can also search for this author in PubMed Google Scholar
A. Mayo-Iscar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. A. García-Escudero.

Additional information

Research partially supported by the Spanish Ministerio de Ciencia e Innovación, grant MTM2008-06067-C02-01, and 02 and by Consejería de Educación y Cultura de la Junta de Castilla y León, GR150.

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. Exploring the number of groups in robust model-based clustering. Stat Comput 21, 585–599 (2011). https://doi.org/10.1007/s11222-010-9194-z

Download citation

Received: 04 December 2009
Accepted: 30 June 2010
Published: 28 July 2010
Issue Date: October 2011
DOI: https://doi.org/10.1007/s11222-010-9194-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the number of groups in robust model-based clustering

Abstract

Access this article

Similar content being viewed by others

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

Recent Developments in Model-Based Clustering with Applications

Advances in Robust Constrained Model Based Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring the number of groups in robust model-based clustering

Abstract

Access this article

Similar content being viewed by others

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

Recent Developments in Model-Based Clustering with Applications

Advances in Robust Constrained Model Based Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation