A review of robust clustering methods

García-Escudero, Luis Angel; Gordaliza, Alfonso; Matrán, Carlos; Mayo-Iscar, Agustín

doi:10.1007/s11634-010-0064-5

A review of robust clustering methods

Regular Article
Published: 18 June 2010

Volume 4, pages 89–109, (2010)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Luis Angel García-Escudero¹,
Alfonso Gordaliza¹,
Carlos Matrán¹ &
…
Agustín Mayo-Iscar¹

2967 Accesses
128 Citations
Explore all metrics

Abstract

Deviations from theoretical assumptions together with the presence of certain amount of outlying observations are common in many practical statistical applications. This is also the case when applying Cluster Analysis methods, where those troubles could lead to unsatisfactory clustering results. Robust Clustering methods are aimed at avoiding these unsatisfactory results. Moreover, there exist certain connections between robust procedures and Cluster Analysis that make Robust Clustering an appealing unifying framework. A review of different robust clustering approaches in the literature is presented. Special attention is paid to methods based on trimming which try to discard most outlying data when carrying out the clustering process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52: 272–285
Article MATH MathSciNet Google Scholar
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics, Springer, New York
MATH Google Scholar
Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search, pp 163–172
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821
Article MATH MathSciNet Google Scholar
Bock H-H (1996a) Probability models and hypotheses testing in partitioning cluster analysis. In: Arabie P, Hubert LJ, De Soete G (eds) Clustering and classification. World Scientific, River Edge, pp 377–453
Google Scholar
Bock H-H (1996b) Probabilistic models in cluster analysis. Comput Stat Data Anal 23: 5–28
Article MATH Google Scholar
Bryant PG (1991) Large-sample results for optimization-based clustering methods. Comput Stat Data Anal 23: 5–28
Google Scholar
Byers SD, Raftery AE (1998) Nearest neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584
Article MATH Google Scholar
Celeux G, Govaert A (1992a) Classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 13: 315–332
Article MathSciNet Google Scholar
Celeux G, Govaert A (1992b) Gaussian parsimonious clustering models. Pattern Recognit 28: 781–793
Article Google Scholar
Cerioli A, Riani M, Atkinson AC (2006) Robust classification with categorical variables. In: Rizzi A, Vichi M (eds) Proceedings in computational statistics, pp 507–519
Croux C, Gallopoulos E, Van Aelst S, Zha H (2007) Machine learning and robust data mining. Comput Stat Data Anal 52: 151–154
Article MATH Google Scholar
Cuesta-Albertos JA, Fraiman R (2007) Impartial trimmed k-means for functional data. Comput Stat Data Anal 51: 4864–4877
Article MATH MathSciNet Google Scholar
Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576
Article MATH Google Scholar
Cuesta-Albertos JA, Gordaliza A, Matrán C (1998) Trimmed best k-nets. A robustifyed version of a L _∞-based clustering method. Stat Probab Lett 36: 401–413
Article MATH Google Scholar
Cuesta-Albertos JA, García-Escudero LA, Gordaliza A (2002) On the asymptotics of trimmed best k-nets. J Multivar Anal 82: 482–516
Article Google Scholar
Cuesta-Albertos JA, Matran C, Mayo-Iscar A (2008) Robust estimation in the normal mixture model based on robust clustering. J R Stat Soc Ser B 70: 779–802
Article MATH MathSciNet Google Scholar
Cuevas A, Febrero M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36: 441–459
Article MATH MathSciNet Google Scholar
Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93: 294–302
Article MATH Google Scholar
Davé RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5: 270–293
Article Google Scholar
Davies PL, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88: 782–801
Article MATH MathSciNet Google Scholar
Ding Y, Dang X, Peng H, Wilkins D (2007) Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8(Suppl 7): S8
Article Google Scholar
Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum K, Hodges JL Jr (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184
Google Scholar
Estivill-Castro V, Yang J (2004) Fast and robust general purpose clustering algorithms. Data Min Knowl Discov 8: 127–150
Article MathSciNet Google Scholar
Everitt BS (1977) Cluster analysis. Heinemann Education Books, London
Google Scholar
Flury B (1997) A first course in multivariate statistics. Springer-Verlag, New York
MATH Google Scholar
Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpreability of classifications. Biometrics 21: 768
Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41: 578–588
Article MATH Google Scholar
Friedman HP, Rubin J (1967) On some invariant criterion for grouping data. J Am Stat Assoc 63: 1159–1178
Article MathSciNet Google Scholar
Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Jajuga K, Sokolowski A, Bock HH (eds) Classification, clustering and data analysis: recent advances and applications. Springer-Verlag, Berlin, pp 247–255
Google Scholar
Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380
Article MATH MathSciNet Google Scholar
Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3: 135–167
Article Google Scholar
García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969
Article MATH Google Scholar
García-Escudero LA, Gordaliza A (2005a) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 471: 1036–1045
Article Google Scholar
García-Escudero LA, Gordaliza A (2005b) A proposal for robust curve clustering. J Classif 22: 185–201
Article Google Scholar
García-Escudero LA, Gordaliza A (2007) The importance of the scales in heterogeneous robust clustering. Comput Stat Data Anal 51: 4403–4412
Article MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrn C (1999) A central limit theorem for multivariate generalized trimmed k-means. Ann Stat 27: 1061–1079
Article MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C (2003) Trimming tools in exploratory data analysis. J Comput Graph Stat 12: 434–449
Article Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345
Article MATH Google Scholar
García-Escudero LA, Gordaliza A, San Martín R, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc Ser B 71: 301–318
Article MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) Exploring the number of groups in robust model-based clustering. (submitted.) Preprint http://www.eio.uva.es/infor/personas/langel.html
Gordaliza A (1991) Best approximations to random variables based on trimming procedures. J Approx Theory 64: 162–180
Article MATH MathSciNet Google Scholar
Gordon AD (1981) Classification. Chapman and Hall, London
MATH Google Scholar
Hampel FR, Rousseeuw PJ, Ronchetti E, Stahel WA (1986) Robust statistics, the approach based on the influence function. Wiley, New York
Google Scholar
Hardin J, Rocke D (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput Stat Data Anal 44: 625–638
Article MathSciNet Google Scholar
Hathaway RJ (1985) A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800
Article MATH MathSciNet Google Scholar
Hennig C (2003) Clusters, outliers, and regression: fixed point clusters. J Multivar Anal 86: 183–212
Article MATH MathSciNet Google Scholar
Hennig C (2004) Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann Stat 32: 1313–1340
Article MATH MathSciNet Google Scholar
Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99: 1154–1176
Article MATH MathSciNet Google Scholar
Huber PJ (1981) Robust statistics. Wiley, New York
Book MATH Google Scholar
Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22: 691–700
Article MATH Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
Kumar M, Orlin JB (2008) Scale-invariant clustering with minimum volume ellipsoids. Comput Oper Res 35: 1017–1029
Article MATH MathSciNet Google Scholar
Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 356: 483–486
Article Google Scholar
Maronna R (2005) Principal components and orthogonal regression based on robust scales. Technometrics 47: 264–273
Article MathSciNet Google Scholar
Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30: 499–505
Article MATH Google Scholar
Massart DL, Plastria E, Kaufman L (1983) Non-hierarchical clustering with MASLOC. Pattern Recognit 16: 507–516
Article Google Scholar
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
Book MATH Google Scholar
McLachlan GJ, Ng S-K, Bean R (2006) Robust cluster analysis via mixture models. Austrian J Stat 35: 157–174
Google Scholar
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Article Google Scholar
Müller DW, Sawitzki G (1991) Excess mass estimates and tests for multimodality. J Am Stat Assoc 86: 738–746
Article MATH Google Scholar
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308
Article MATH MathSciNet Google Scholar
Perrotta D, Riani M, Torti F (2009) New robust dynamic plots for regression mixture detection. Adv Data Anal Classif 3: 263–279
Article Google Scholar
Polonik W (1995) Measuring mass concentrations and estimating density contour clusters: an excess mass approach. Ann Stat 23: 855–881
Article MATH MathSciNet Google Scholar
Rocke DM, Woodruff DM (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91: 1047–1061
Article MATH MathSciNet Google Scholar
Rocke DM, Woodruff DM (2002) Computational connections between robust multivariate analysis and clustering. In: Härdle W, Rönz B (eds) COMPSTAT 2002 proceedings in computational statistics. Physica-Verlag, Heidelberg, pp 255–260
Google Scholar
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications. Reidel, Dordrecht, pp 283–297
Google Scholar
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley-Interscience, New York
Book MATH Google Scholar
Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41: 212–223
Article Google Scholar
Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Gaul W, Opitz O, Schader M (eds) Data analysis: scientific modeling and practical application. Springer Verlag, New York, pp 335–446
Google Scholar
Santos-Pereira CM, Pires AM (2002) Detection of outliers in multivariate data, a method based on clustering and robust estimators. In: Proceedings in computational statistics, pp 291–296
Schynsa M, Haesbroeck G, Critchley F (2010) RelaxMCD: smooth optimisation for the minimum covariance determinant estimator. Comput Stat Data Anal 54: 843–857
Article Google Scholar
Späth H (1975) Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion. Oldenbourg Verlag, Münchenwien
MATH Google Scholar
Van Aelst S, Wang X, Zamar RH, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50: 1287–1312
Article Google Scholar
Vinod HD (1969) Integer programming and the theory of grouping. J Am Stat Assoc 64: 506–519
Article MATH Google Scholar
Willems G, Joe H, Zamar R (2009) Diagnosing multivariate outliers detected by robust estimators. J Comput Graph Stat 18: 73–91
Article Google Scholar
Woodruff DL, Reiners T (2004) Experiments with, and on, algorithms for maximum likelihood clustering. Comput Stat Data Anal 47: 237–253
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Estadística e Investigación Operativa, Facultad de Ciencias, Universidad de Valladolid, Prado de la Magdalena, 47002, Valladolid, Spain
Luis Angel García-Escudero, Alfonso Gordaliza, Carlos Matrán & Agustín Mayo-Iscar

Authors

Luis Angel García-Escudero
View author publications
You can also search for this author in PubMed Google Scholar
Alfonso Gordaliza
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Matrán
View author publications
You can also search for this author in PubMed Google Scholar
Agustín Mayo-Iscar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Angel García-Escudero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. A review of robust clustering methods. Adv Data Anal Classif 4, 89–109 (2010). https://doi.org/10.1007/s11634-010-0064-5

Download citation

Received: 16 February 2010
Revised: 21 April 2010
Accepted: 28 April 2010
Published: 18 June 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s11634-010-0064-5

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of robust clustering methods

Abstract

Access this article

Similar content being viewed by others

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

Overview on Cluster Analysis

Hierarchical Means Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

A review of robust clustering methods

Abstract

Access this article

Similar content being viewed by others

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

Overview on Cluster Analysis

Hierarchical Means Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation