Advances in Data Analysis and Classification

, Volume 4, Issue 2–3, pp 89–109 | Cite as

A review of robust clustering methods

  • Luis Angel García-Escudero
  • Alfonso Gordaliza
  • Carlos Matrán
  • Agustín Mayo-Iscar
Regular Article

Abstract

Deviations from theoretical assumptions together with the presence of certain amount of outlying observations are common in many practical statistical applications. This is also the case when applying Cluster Analysis methods, where those troubles could lead to unsatisfactory clustering results. Robust Clustering methods are aimed at avoiding these unsatisfactory results. Moreover, there exist certain connections between robust procedures and Cluster Analysis that make Robust Clustering an appealing unifying framework. A review of different robust clustering approaches in the literature is presented. Special attention is paid to methods based on trimming which try to discard most outlying data when carrying out the clustering process.

Keywords

Clustering Robustness Model-based clustering Trimming 

Mathematics Subject Classification (2000)

62h30 62G35 

References

  1. Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52: 272–285MATHCrossRefMathSciNetGoogle Scholar
  2. Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics, Springer, New YorkMATHGoogle Scholar
  3. Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search, pp 163–172Google Scholar
  4. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821MATHCrossRefMathSciNetGoogle Scholar
  5. Bock H-H (1996a) Probability models and hypotheses testing in partitioning cluster analysis. In: Arabie P, Hubert LJ, De Soete G (eds) Clustering and classification. World Scientific, River Edge, pp 377–453Google Scholar
  6. Bock H-H (1996b) Probabilistic models in cluster analysis. Comput Stat Data Anal 23: 5–28MATHCrossRefGoogle Scholar
  7. Bryant PG (1991) Large-sample results for optimization-based clustering methods. Comput Stat Data Anal 23: 5–28Google Scholar
  8. Byers SD, Raftery AE (1998) Nearest neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584MATHCrossRefGoogle Scholar
  9. Celeux G, Govaert A (1992a) Classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 13: 315–332CrossRefMathSciNetGoogle Scholar
  10. Celeux G, Govaert A (1992b) Gaussian parsimonious clustering models. Pattern Recognit 28: 781–793CrossRefGoogle Scholar
  11. Cerioli A, Riani M, Atkinson AC (2006) Robust classification with categorical variables. In: Rizzi A, Vichi M (eds) Proceedings in computational statistics, pp 507–519Google Scholar
  12. Croux C, Gallopoulos E, Van Aelst S, Zha H (2007) Machine learning and robust data mining. Comput Stat Data Anal 52: 151–154MATHCrossRefGoogle Scholar
  13. Cuesta-Albertos JA, Fraiman R (2007) Impartial trimmed k-means for functional data. Comput Stat Data Anal 51: 4864–4877MATHCrossRefMathSciNetGoogle Scholar
  14. Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576MATHCrossRefGoogle Scholar
  15. Cuesta-Albertos JA, Gordaliza A, Matrán C (1998) Trimmed best k-nets. A robustifyed version of a L -based clustering method. Stat Probab Lett 36: 401–413MATHCrossRefGoogle Scholar
  16. Cuesta-Albertos JA, García-Escudero LA, Gordaliza A (2002) On the asymptotics of trimmed best k-nets. J Multivar Anal 82: 482–516CrossRefGoogle Scholar
  17. Cuesta-Albertos JA, Matran C, Mayo-Iscar A (2008) Robust estimation in the normal mixture model based on robust clustering. J R Stat Soc Ser B 70: 779–802MATHCrossRefMathSciNetGoogle Scholar
  18. Cuevas A, Febrero M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36: 441–459MATHCrossRefMathSciNetGoogle Scholar
  19. Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93: 294–302MATHCrossRefGoogle Scholar
  20. Davé RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5: 270–293CrossRefGoogle Scholar
  21. Davies PL, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88: 782–801MATHCrossRefMathSciNetGoogle Scholar
  22. Ding Y, Dang X, Peng H, Wilkins D (2007) Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8(Suppl 7): S8CrossRefGoogle Scholar
  23. Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum K, Hodges JL Jr (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184Google Scholar
  24. Estivill-Castro V, Yang J (2004) Fast and robust general purpose clustering algorithms. Data Min Knowl Discov 8: 127–150CrossRefMathSciNetGoogle Scholar
  25. Everitt BS (1977) Cluster analysis. Heinemann Education Books, LondonGoogle Scholar
  26. Flury B (1997) A first course in multivariate statistics. Springer-Verlag, New YorkMATHGoogle Scholar
  27. Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpreability of classifications. Biometrics 21: 768Google Scholar
  28. Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41: 578–588MATHCrossRefGoogle Scholar
  29. Friedman HP, Rubin J (1967) On some invariant criterion for grouping data. J Am Stat Assoc 63: 1159–1178CrossRefMathSciNetGoogle Scholar
  30. Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Jajuga K, Sokolowski A, Bock HH (eds) Classification, clustering and data analysis: recent advances and applications. Springer-Verlag, Berlin, pp 247–255Google Scholar
  31. Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380MATHCrossRefMathSciNetGoogle Scholar
  32. Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3: 135–167CrossRefGoogle Scholar
  33. García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969MATHCrossRefGoogle Scholar
  34. García-Escudero LA, Gordaliza A (2005a) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 471: 1036–1045CrossRefGoogle Scholar
  35. García-Escudero LA, Gordaliza A (2005b) A proposal for robust curve clustering. J Classif 22: 185–201CrossRefGoogle Scholar
  36. García-Escudero LA, Gordaliza A (2007) The importance of the scales in heterogeneous robust clustering. Comput Stat Data Anal 51: 4403–4412MATHCrossRefGoogle Scholar
  37. García-Escudero LA, Gordaliza A, Matrn C (1999) A central limit theorem for multivariate generalized trimmed k-means. Ann Stat 27: 1061–1079MATHCrossRefGoogle Scholar
  38. García-Escudero LA, Gordaliza A, Matrán C (2003) Trimming tools in exploratory data analysis. J Comput Graph Stat 12: 434–449CrossRefGoogle Scholar
  39. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345MATHCrossRefGoogle Scholar
  40. García-Escudero LA, Gordaliza A, San Martín R, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc Ser B 71: 301–318MATHCrossRefGoogle Scholar
  41. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) Exploring the number of groups in robust model-based clustering. (submitted.) Preprint http://www.eio.uva.es/infor/personas/langel.html
  42. Gordaliza A (1991) Best approximations to random variables based on trimming procedures. J Approx Theory 64: 162–180MATHCrossRefMathSciNetGoogle Scholar
  43. Gordon AD (1981) Classification. Chapman and Hall, LondonMATHGoogle Scholar
  44. Hampel FR, Rousseeuw PJ, Ronchetti E, Stahel WA (1986) Robust statistics, the approach based on the influence function. Wiley, New YorkGoogle Scholar
  45. Hardin J, Rocke D (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput Stat Data Anal 44: 625–638CrossRefMathSciNetGoogle Scholar
  46. Hathaway RJ (1985) A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800MATHCrossRefMathSciNetGoogle Scholar
  47. Hennig C (2003) Clusters, outliers, and regression: fixed point clusters. J Multivar Anal 86: 183–212MATHCrossRefMathSciNetGoogle Scholar
  48. Hennig C (2004) Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann Stat 32: 1313–1340MATHCrossRefMathSciNetGoogle Scholar
  49. Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99: 1154–1176MATHCrossRefMathSciNetGoogle Scholar
  50. Huber PJ (1981) Robust statistics. Wiley, New YorkMATHCrossRefGoogle Scholar
  51. Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22: 691–700MATHCrossRefGoogle Scholar
  52. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkGoogle Scholar
  53. Kumar M, Orlin JB (2008) Scale-invariant clustering with minimum volume ellipsoids. Comput Oper Res 35: 1017–1029MATHCrossRefMathSciNetGoogle Scholar
  54. Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 356: 483–486CrossRefGoogle Scholar
  55. Maronna R (2005) Principal components and orthogonal regression based on robust scales. Technometrics 47: 264–273CrossRefMathSciNetGoogle Scholar
  56. Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30: 499–505MATHCrossRefGoogle Scholar
  57. Massart DL, Plastria E, Kaufman L (1983) Non-hierarchical clustering with MASLOC. Pattern Recognit 16: 507–516CrossRefGoogle Scholar
  58. McLachlan G, Peel D (2000) Finite mixture models. Wiley, New YorkMATHCrossRefGoogle Scholar
  59. McLachlan GJ, Ng S-K, Bean R (2006) Robust cluster analysis via mixture models. Austrian J Stat 35: 157–174Google Scholar
  60. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179CrossRefGoogle Scholar
  61. Müller DW, Sawitzki G (1991) Excess mass estimates and tests for multimodality. J Am Stat Assoc 86: 738–746MATHCrossRefGoogle Scholar
  62. Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308MATHCrossRefMathSciNetGoogle Scholar
  63. Perrotta D, Riani M, Torti F (2009) New robust dynamic plots for regression mixture detection. Adv Data Anal Classif 3: 263–279CrossRefGoogle Scholar
  64. Polonik W (1995) Measuring mass concentrations and estimating density contour clusters: an excess mass approach. Ann Stat 23: 855–881MATHCrossRefMathSciNetGoogle Scholar
  65. Rocke DM, Woodruff DM (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91: 1047–1061MATHCrossRefMathSciNetGoogle Scholar
  66. Rocke DM, Woodruff DM (2002) Computational connections between robust multivariate analysis and clustering. In: Härdle W, Rönz B (eds) COMPSTAT 2002 proceedings in computational statistics. Physica-Verlag, Heidelberg, pp 255–260Google Scholar
  67. Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications. Reidel, Dordrecht, pp 283–297Google Scholar
  68. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley-Interscience, New YorkMATHCrossRefGoogle Scholar
  69. Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41: 212–223CrossRefGoogle Scholar
  70. Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Gaul W, Opitz O, Schader M (eds) Data analysis: scientific modeling and practical application. Springer Verlag, New York, pp 335–446Google Scholar
  71. Santos-Pereira CM, Pires AM (2002) Detection of outliers in multivariate data, a method based on clustering and robust estimators. In: Proceedings in computational statistics, pp 291–296Google Scholar
  72. Schynsa M, Haesbroeck G, Critchley F (2010) RelaxMCD: smooth optimisation for the minimum covariance determinant estimator. Comput Stat Data Anal 54: 843–857CrossRefGoogle Scholar
  73. Späth H (1975) Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion. Oldenbourg Verlag, MünchenwienMATHGoogle Scholar
  74. Van Aelst S, Wang X, Zamar RH, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50: 1287–1312CrossRefGoogle Scholar
  75. Vinod HD (1969) Integer programming and the theory of grouping. J Am Stat Assoc 64: 506–519MATHCrossRefGoogle Scholar
  76. Willems G, Joe H, Zamar R (2009) Diagnosing multivariate outliers detected by robust estimators. J Comput Graph Stat 18: 73–91CrossRefGoogle Scholar
  77. Woodruff DL, Reiners T (2004) Experiments with, and on, algorithms for maximum likelihood clustering. Comput Stat Data Anal 47: 237–253MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Luis Angel García-Escudero
    • 1
  • Alfonso Gordaliza
    • 1
  • Carlos Matrán
    • 1
  • Agustín Mayo-Iscar
    • 1
  1. 1.Departamento de Estadística e Investigación Operativa, Facultad de CienciasUniversidad de ValladolidValladolidSpain

Personalised recommendations