Skip to main content
Log in

A review of robust clustering methods

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Deviations from theoretical assumptions together with the presence of certain amount of outlying observations are common in many practical statistical applications. This is also the case when applying Cluster Analysis methods, where those troubles could lead to unsatisfactory clustering results. Robust Clustering methods are aimed at avoiding these unsatisfactory results. Moreover, there exist certain connections between robust procedures and Cluster Analysis that make Robust Clustering an appealing unifying framework. A review of different robust clustering approaches in the literature is presented. Special attention is paid to methods based on trimming which try to discard most outlying data when carrying out the clustering process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52: 272–285

    Article  MATH  MathSciNet  Google Scholar 

  • Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics, Springer, New York

    MATH  Google Scholar 

  • Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search, pp 163–172

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821

    Article  MATH  MathSciNet  Google Scholar 

  • Bock H-H (1996a) Probability models and hypotheses testing in partitioning cluster analysis. In: Arabie P, Hubert LJ, De Soete G (eds) Clustering and classification. World Scientific, River Edge, pp 377–453

    Google Scholar 

  • Bock H-H (1996b) Probabilistic models in cluster analysis. Comput Stat Data Anal 23: 5–28

    Article  MATH  Google Scholar 

  • Bryant PG (1991) Large-sample results for optimization-based clustering methods. Comput Stat Data Anal 23: 5–28

    Google Scholar 

  • Byers SD, Raftery AE (1998) Nearest neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584

    Article  MATH  Google Scholar 

  • Celeux G, Govaert A (1992a) Classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 13: 315–332

    Article  MathSciNet  Google Scholar 

  • Celeux G, Govaert A (1992b) Gaussian parsimonious clustering models. Pattern Recognit 28: 781–793

    Article  Google Scholar 

  • Cerioli A, Riani M, Atkinson AC (2006) Robust classification with categorical variables. In: Rizzi A, Vichi M (eds) Proceedings in computational statistics, pp 507–519

  • Croux C, Gallopoulos E, Van Aelst S, Zha H (2007) Machine learning and robust data mining. Comput Stat Data Anal 52: 151–154

    Article  MATH  Google Scholar 

  • Cuesta-Albertos JA, Fraiman R (2007) Impartial trimmed k-means for functional data. Comput Stat Data Anal 51: 4864–4877

    Article  MATH  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576

    Article  MATH  Google Scholar 

  • Cuesta-Albertos JA, Gordaliza A, Matrán C (1998) Trimmed best k-nets. A robustifyed version of a L -based clustering method. Stat Probab Lett 36: 401–413

    Article  MATH  Google Scholar 

  • Cuesta-Albertos JA, García-Escudero LA, Gordaliza A (2002) On the asymptotics of trimmed best k-nets. J Multivar Anal 82: 482–516

    Article  Google Scholar 

  • Cuesta-Albertos JA, Matran C, Mayo-Iscar A (2008) Robust estimation in the normal mixture model based on robust clustering. J R Stat Soc Ser B 70: 779–802

    Article  MATH  MathSciNet  Google Scholar 

  • Cuevas A, Febrero M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36: 441–459

    Article  MATH  MathSciNet  Google Scholar 

  • Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93: 294–302

    Article  MATH  Google Scholar 

  • Davé RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5: 270–293

    Article  Google Scholar 

  • Davies PL, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88: 782–801

    Article  MATH  MathSciNet  Google Scholar 

  • Ding Y, Dang X, Peng H, Wilkins D (2007) Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8(Suppl 7): S8

    Article  Google Scholar 

  • Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum K, Hodges JL Jr (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184

    Google Scholar 

  • Estivill-Castro V, Yang J (2004) Fast and robust general purpose clustering algorithms. Data Min Knowl Discov 8: 127–150

    Article  MathSciNet  Google Scholar 

  • Everitt BS (1977) Cluster analysis. Heinemann Education Books, London

    Google Scholar 

  • Flury B (1997) A first course in multivariate statistics. Springer-Verlag, New York

    MATH  Google Scholar 

  • Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpreability of classifications. Biometrics 21: 768

    Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41: 578–588

    Article  MATH  Google Scholar 

  • Friedman HP, Rubin J (1967) On some invariant criterion for grouping data. J Am Stat Assoc 63: 1159–1178

    Article  MathSciNet  Google Scholar 

  • Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Jajuga K, Sokolowski A, Bock HH (eds) Classification, clustering and data analysis: recent advances and applications. Springer-Verlag, Berlin, pp 247–255

    Google Scholar 

  • Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380

    Article  MATH  MathSciNet  Google Scholar 

  • Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3: 135–167

    Article  Google Scholar 

  • García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A (2005a) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 471: 1036–1045

    Article  Google Scholar 

  • García-Escudero LA, Gordaliza A (2005b) A proposal for robust curve clustering. J Classif 22: 185–201

    Article  Google Scholar 

  • García-Escudero LA, Gordaliza A (2007) The importance of the scales in heterogeneous robust clustering. Comput Stat Data Anal 51: 4403–4412

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrn C (1999) A central limit theorem for multivariate generalized trimmed k-means. Ann Stat 27: 1061–1079

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C (2003) Trimming tools in exploratory data analysis. J Comput Graph Stat 12: 434–449

    Article  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, San Martín R, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc Ser B 71: 301–318

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) Exploring the number of groups in robust model-based clustering. (submitted.) Preprint http://www.eio.uva.es/infor/personas/langel.html

  • Gordaliza A (1991) Best approximations to random variables based on trimming procedures. J Approx Theory 64: 162–180

    Article  MATH  MathSciNet  Google Scholar 

  • Gordon AD (1981) Classification. Chapman and Hall, London

    MATH  Google Scholar 

  • Hampel FR, Rousseeuw PJ, Ronchetti E, Stahel WA (1986) Robust statistics, the approach based on the influence function. Wiley, New York

    Google Scholar 

  • Hardin J, Rocke D (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput Stat Data Anal 44: 625–638

    Article  MathSciNet  Google Scholar 

  • Hathaway RJ (1985) A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig C (2003) Clusters, outliers, and regression: fixed point clusters. J Multivar Anal 86: 183–212

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig C (2004) Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann Stat 32: 1313–1340

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99: 1154–1176

    Article  MATH  MathSciNet  Google Scholar 

  • Huber PJ (1981) Robust statistics. Wiley, New York

    Book  MATH  Google Scholar 

  • Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22: 691–700

    Article  MATH  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  • Kumar M, Orlin JB (2008) Scale-invariant clustering with minimum volume ellipsoids. Comput Oper Res 35: 1017–1029

    Article  MATH  MathSciNet  Google Scholar 

  • Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 356: 483–486

    Article  Google Scholar 

  • Maronna R (2005) Principal components and orthogonal regression based on robust scales. Technometrics 47: 264–273

    Article  MathSciNet  Google Scholar 

  • Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30: 499–505

    Article  MATH  Google Scholar 

  • Massart DL, Plastria E, Kaufman L (1983) Non-hierarchical clustering with MASLOC. Pattern Recognit 16: 507–516

    Article  Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • McLachlan GJ, Ng S-K, Bean R (2006) Robust cluster analysis via mixture models. Austrian J Stat 35: 157–174

    Google Scholar 

  • Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179

    Article  Google Scholar 

  • Müller DW, Sawitzki G (1991) Excess mass estimates and tests for multimodality. J Am Stat Assoc 86: 738–746

    Article  MATH  Google Scholar 

  • Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308

    Article  MATH  MathSciNet  Google Scholar 

  • Perrotta D, Riani M, Torti F (2009) New robust dynamic plots for regression mixture detection. Adv Data Anal Classif 3: 263–279

    Article  Google Scholar 

  • Polonik W (1995) Measuring mass concentrations and estimating density contour clusters: an excess mass approach. Ann Stat 23: 855–881

    Article  MATH  MathSciNet  Google Scholar 

  • Rocke DM, Woodruff DM (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91: 1047–1061

    Article  MATH  MathSciNet  Google Scholar 

  • Rocke DM, Woodruff DM (2002) Computational connections between robust multivariate analysis and clustering. In: Härdle W, Rönz B (eds) COMPSTAT 2002 proceedings in computational statistics. Physica-Verlag, Heidelberg, pp 255–260

    Google Scholar 

  • Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications. Reidel, Dordrecht, pp 283–297

    Google Scholar 

  • Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  • Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41: 212–223

    Article  Google Scholar 

  • Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Gaul W, Opitz O, Schader M (eds) Data analysis: scientific modeling and practical application. Springer Verlag, New York, pp 335–446

    Google Scholar 

  • Santos-Pereira CM, Pires AM (2002) Detection of outliers in multivariate data, a method based on clustering and robust estimators. In: Proceedings in computational statistics, pp 291–296

  • Schynsa M, Haesbroeck G, Critchley F (2010) RelaxMCD: smooth optimisation for the minimum covariance determinant estimator. Comput Stat Data Anal 54: 843–857

    Article  Google Scholar 

  • Späth H (1975) Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion. Oldenbourg Verlag, Münchenwien

    MATH  Google Scholar 

  • Van Aelst S, Wang X, Zamar RH, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50: 1287–1312

    Article  Google Scholar 

  • Vinod HD (1969) Integer programming and the theory of grouping. J Am Stat Assoc 64: 506–519

    Article  MATH  Google Scholar 

  • Willems G, Joe H, Zamar R (2009) Diagnosing multivariate outliers detected by robust estimators. J Comput Graph Stat 18: 73–91

    Article  Google Scholar 

  • Woodruff DL, Reiners T (2004) Experiments with, and on, algorithms for maximum likelihood clustering. Comput Stat Data Anal 47: 237–253

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Angel García-Escudero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. A review of robust clustering methods. Adv Data Anal Classif 4, 89–109 (2010). https://doi.org/10.1007/s11634-010-0064-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-010-0064-5

Keywords

Mathematics Subject Classification (2000)

Navigation