Abstract
Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected.
Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory “trimming-based” tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some “discriminant” factors are the basis for these exploratory tools.
Similar content being viewed by others
References
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Becker, C., Gather, U.: The masking breakdown point for outlier identification rules. J. Am. Stat. Assoc. 94, 947–955 (1999)
Biernacki, C., Govaert, G.: Using the classification likelihood to choose the number of clusters. Comput. Sci. Stat. 29, 451–457 (1997)
Biernacki, C., Celeux, G., Govaert, G.: Assesing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
Bryant, P.G.: Large-sample results for optimization-based clustering methods. J. Classif. 8, 31–44 (1991)
Bock, H.-H.: Probabilistic models in cluster analysis. Comput. Stat. Data Anal. 23, 5–28 (1996)
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)
Celeux, G., Govaert, A.: Classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 13, 315–332 (1992a)
Celeux, G., Govaert, A.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1992b)
Cook, D.: Graphical detection of regression outliers and mixtures. Proceedings ISI’99. Helsinki (1999)
Cuesta-Albertos, J.A., Matran, C., Mayo-Iscar, A.: Robust estimation in the normal mixture model based on robust clustering. J. R. Stat. Soc., Ser. B 70, 779–802 (2008)
Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294–302 (1998)
Engelman, L., Hartigan, J.A.: Percentage points of a test for clusters. J. Am. Stat. Assoc. 64, 1647–1648 (1969)
Flury, B.: A First Course in Multivariate Statistics. Springer, New York (1997)
Flury, B., Riedwyl, H.: Multivariate Statistics, A Practical Approach. Cambridge University Press, Cambridge (1988)
Friedman, H.P., Rubin, J.: On some invariant criterion for grouping data. J. Am. Stat. Assoc. 63, 1159–1178 (1967)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)
Gallegos, M.T.: Maximum likelihood clustering with outliers. In: Jajuga, K., Sokolowski, A., Bock, H.-H. (eds.) Classification, Clustering and Data Analysis: Recent Advances and Applications, pp. 247–255. Springer, Berlin (2002)
Gallegos, M.T., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)
Gallegos, M.T., Ritter, G.: Trimming algorithms for clustering contaminated grouped data and their robustness. Adv. Data Anal. Classif. 3, 135–167 (2009)
Gallegos, M.T., Ritter, G.: Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. Comput. Stat. Data Anal. 54, 637–654 (2010)
García-Escudero, L.A., Gordaliza, A., Matrán, C.: Trimming tools in exploratory data analysis. J. Comput. Graph. Stat. 12, 434–449 (2003)
García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Hardin, J., Rocke, D.: Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 44, 625–638 (2004)
Hathaway, R.J.: A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Stat. 13, 795–800 (1985)
Hawkins, D.M., Olive, D.J.: Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm, with discussion. J. Am. Stat. Assoc. 97, 136–159 (2002)
Hennig, C.: Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann. Stat. 32, 1313–1340 (2004a)
Hennig, C.: Asymmetric linear dimension reduction for classification. J. Comput. Graph. Stat. 13, 930–945 (2004b)
Hennig, C., Christlieb, N.: Validating visual clusters in large datasets: fixed point clusters of spectral features. Comput. Stat. Data Anal. 40, 723–739 (2002)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhya, Ser. A 62, 49–62 (2000)
Marriott, F.H.C.: Practical problems in a method of cluster analysis. Biometrics 27, 501–514 (1971)
McLachlan, G.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 37, 318–324 (1987)
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematics, Statistics, and Probability, vol. 1, pp. 281–298 (1967)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.N.: Mixture of generalized linear models and the trimmed likelihood methodology. In: Antoch, J. (ed.) Proceedings in Computational Statistics, pp. 1585–1592. Physica-Verlag, Heidelberg (2004)
Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)
Rocke, D.M., Woodruff, D.M.: Computational connections between robust multivariate analysis and clustering. In: Härdle, W., Rönz, B. (eds.) COMPSTAT 2002, Proceedings in Computational Statistics, pp. 255–260. Physica-Verlag, Heidelberg (2002)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Rousseeuw, P.J., Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics 41, 212–223 (1999)
Sugar, C., James, G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Symons, M.J.: Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43 (1981)
Titterington, D.M., Smith, A.F., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)
Van Aelst, S., Wang, X., Zamar, R.H., Zhu, R.: Linear grouping using orthogonal regression. Comput. Stat. Data Anal. 50, 1287–1312 (2006)
Woodruff, D.L., Reiners, T.: Experiments with, and on, algorithms for maximum likelihood clustering. Comput. Stat. Data Anal. 47, 237–253 (2004)
Wolfe, J.H.: Pattern clustering by multivariate analysis. Multivar. Behav. Res. 5, 329–350 (1970)
Author information
Authors and Affiliations
Corresponding author
Additional information
Research partially supported by the Spanish Ministerio de Ciencia e Innovación, grant MTM2008-06067-C02-01, and 02 and by Consejería de Educación y Cultura de la Junta de Castilla y León, GR150.
Rights and permissions
About this article
Cite this article
García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. Exploring the number of groups in robust model-based clustering. Stat Comput 21, 585–599 (2011). https://doi.org/10.1007/s11222-010-9194-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-010-9194-z