Abstract
Deviations from theoretical assumptions together with the presence of certain amount of outlying observations are common in many practical statistical applications. This is also the case when applying Cluster Analysis methods, where those troubles could lead to unsatisfactory clustering results. Robust Clustering methods are aimed at avoiding these unsatisfactory results. Moreover, there exist certain connections between robust procedures and Cluster Analysis that make Robust Clustering an appealing unifying framework. A review of different robust clustering approaches in the literature is presented. Special attention is paid to methods based on trimming which try to discard most outlying data when carrying out the clustering process.
Similar content being viewed by others
References
Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52: 272–285
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics, Springer, New York
Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search, pp 163–172
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821
Bock H-H (1996a) Probability models and hypotheses testing in partitioning cluster analysis. In: Arabie P, Hubert LJ, De Soete G (eds) Clustering and classification. World Scientific, River Edge, pp 377–453
Bock H-H (1996b) Probabilistic models in cluster analysis. Comput Stat Data Anal 23: 5–28
Bryant PG (1991) Large-sample results for optimization-based clustering methods. Comput Stat Data Anal 23: 5–28
Byers SD, Raftery AE (1998) Nearest neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584
Celeux G, Govaert A (1992a) Classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 13: 315–332
Celeux G, Govaert A (1992b) Gaussian parsimonious clustering models. Pattern Recognit 28: 781–793
Cerioli A, Riani M, Atkinson AC (2006) Robust classification with categorical variables. In: Rizzi A, Vichi M (eds) Proceedings in computational statistics, pp 507–519
Croux C, Gallopoulos E, Van Aelst S, Zha H (2007) Machine learning and robust data mining. Comput Stat Data Anal 52: 151–154
Cuesta-Albertos JA, Fraiman R (2007) Impartial trimmed k-means for functional data. Comput Stat Data Anal 51: 4864–4877
Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576
Cuesta-Albertos JA, Gordaliza A, Matrán C (1998) Trimmed best k-nets. A robustifyed version of a L ∞-based clustering method. Stat Probab Lett 36: 401–413
Cuesta-Albertos JA, García-Escudero LA, Gordaliza A (2002) On the asymptotics of trimmed best k-nets. J Multivar Anal 82: 482–516
Cuesta-Albertos JA, Matran C, Mayo-Iscar A (2008) Robust estimation in the normal mixture model based on robust clustering. J R Stat Soc Ser B 70: 779–802
Cuevas A, Febrero M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36: 441–459
Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93: 294–302
Davé RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5: 270–293
Davies PL, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88: 782–801
Ding Y, Dang X, Peng H, Wilkins D (2007) Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8(Suppl 7): S8
Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum K, Hodges JL Jr (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184
Estivill-Castro V, Yang J (2004) Fast and robust general purpose clustering algorithms. Data Min Knowl Discov 8: 127–150
Everitt BS (1977) Cluster analysis. Heinemann Education Books, London
Flury B (1997) A first course in multivariate statistics. Springer-Verlag, New York
Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpreability of classifications. Biometrics 21: 768
Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41: 578–588
Friedman HP, Rubin J (1967) On some invariant criterion for grouping data. J Am Stat Assoc 63: 1159–1178
Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Jajuga K, Sokolowski A, Bock HH (eds) Classification, clustering and data analysis: recent advances and applications. Springer-Verlag, Berlin, pp 247–255
Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380
Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3: 135–167
García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969
García-Escudero LA, Gordaliza A (2005a) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 471: 1036–1045
García-Escudero LA, Gordaliza A (2005b) A proposal for robust curve clustering. J Classif 22: 185–201
García-Escudero LA, Gordaliza A (2007) The importance of the scales in heterogeneous robust clustering. Comput Stat Data Anal 51: 4403–4412
García-Escudero LA, Gordaliza A, Matrn C (1999) A central limit theorem for multivariate generalized trimmed k-means. Ann Stat 27: 1061–1079
García-Escudero LA, Gordaliza A, Matrán C (2003) Trimming tools in exploratory data analysis. J Comput Graph Stat 12: 434–449
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345
García-Escudero LA, Gordaliza A, San Martín R, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc Ser B 71: 301–318
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) Exploring the number of groups in robust model-based clustering. (submitted.) Preprint http://www.eio.uva.es/infor/personas/langel.html
Gordaliza A (1991) Best approximations to random variables based on trimming procedures. J Approx Theory 64: 162–180
Gordon AD (1981) Classification. Chapman and Hall, London
Hampel FR, Rousseeuw PJ, Ronchetti E, Stahel WA (1986) Robust statistics, the approach based on the influence function. Wiley, New York
Hardin J, Rocke D (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput Stat Data Anal 44: 625–638
Hathaway RJ (1985) A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800
Hennig C (2003) Clusters, outliers, and regression: fixed point clusters. J Multivar Anal 86: 183–212
Hennig C (2004) Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann Stat 32: 1313–1340
Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99: 1154–1176
Huber PJ (1981) Robust statistics. Wiley, New York
Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22: 691–700
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Kumar M, Orlin JB (2008) Scale-invariant clustering with minimum volume ellipsoids. Comput Oper Res 35: 1017–1029
Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 356: 483–486
Maronna R (2005) Principal components and orthogonal regression based on robust scales. Technometrics 47: 264–273
Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30: 499–505
Massart DL, Plastria E, Kaufman L (1983) Non-hierarchical clustering with MASLOC. Pattern Recognit 16: 507–516
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
McLachlan GJ, Ng S-K, Bean R (2006) Robust cluster analysis via mixture models. Austrian J Stat 35: 157–174
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Müller DW, Sawitzki G (1991) Excess mass estimates and tests for multimodality. J Am Stat Assoc 86: 738–746
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308
Perrotta D, Riani M, Torti F (2009) New robust dynamic plots for regression mixture detection. Adv Data Anal Classif 3: 263–279
Polonik W (1995) Measuring mass concentrations and estimating density contour clusters: an excess mass approach. Ann Stat 23: 855–881
Rocke DM, Woodruff DM (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91: 1047–1061
Rocke DM, Woodruff DM (2002) Computational connections between robust multivariate analysis and clustering. In: Härdle W, Rönz B (eds) COMPSTAT 2002 proceedings in computational statistics. Physica-Verlag, Heidelberg, pp 255–260
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications. Reidel, Dordrecht, pp 283–297
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley-Interscience, New York
Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41: 212–223
Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Gaul W, Opitz O, Schader M (eds) Data analysis: scientific modeling and practical application. Springer Verlag, New York, pp 335–446
Santos-Pereira CM, Pires AM (2002) Detection of outliers in multivariate data, a method based on clustering and robust estimators. In: Proceedings in computational statistics, pp 291–296
Schynsa M, Haesbroeck G, Critchley F (2010) RelaxMCD: smooth optimisation for the minimum covariance determinant estimator. Comput Stat Data Anal 54: 843–857
Späth H (1975) Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion. Oldenbourg Verlag, Münchenwien
Van Aelst S, Wang X, Zamar RH, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50: 1287–1312
Vinod HD (1969) Integer programming and the theory of grouping. J Am Stat Assoc 64: 506–519
Willems G, Joe H, Zamar R (2009) Diagnosing multivariate outliers detected by robust estimators. J Comput Graph Stat 18: 73–91
Woodruff DL, Reiners T (2004) Experiments with, and on, algorithms for maximum likelihood clustering. Comput Stat Data Anal 47: 237–253
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. A review of robust clustering methods. Adv Data Anal Classif 4, 89–109 (2010). https://doi.org/10.1007/s11634-010-0064-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-010-0064-5