Advertisement

TEST

pp 1–19 | Cite as

Robust model-based clustering with mild and gross outliers

  • Alessio FarcomeniEmail author
  • Antonio Punzo
Original Paper
  • 40 Downloads

Abstract

We propose a model-based clustering procedure where each component can take into account cluster-specific mild outliers through a flexible distributional assumption, and a proportion of observations is additionally trimmed. We propose a penalized likelihood approach for estimation and selection of the proportions of mild and gross outliers. A theoretically grounded penalty parameter is then obtained. Simulation studies illustrate the advantages of our procedure over flexible mixtures without trimming, and over trimmed normal mixture models (tclust). We conclude with an original real data example on the identification of the source from illicit drug shipments seized in Italy and Spain. The methodology proposed in this paper has been implemented in R functions which can be downloaded from https://github.com/afarcome/cntclust.

Keywords

tclust Contaminated normal Penalized likelihood 

Mathematics Subject Classification

62H30 91C20 62F35 

Notes

Acknowledgements

The authors are grateful to two referees for constructive and helpful suggestions.

Supplementary material

11749_2019_693_MOESM1_ESM.pdf (83 kb)
Supplementary material 1 (pdf 83 KB)

References

  1. Aitkin M, Wilson GT (1980) Mixture models, outliers, and the EM algorithm. Technometrics 22(3):325–331zbMATHCrossRefGoogle Scholar
  2. Andrews J, Wickins J, Boers N, McNicholas P (2018) teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J Stat Softw 83(7):1–32CrossRefGoogle Scholar
  3. Atkinson AC, Riani M, Cerioli A (2018) Cluster detection and clustering with random start forward searches. J Appl Stat 45(5):777–798MathSciNetCrossRefGoogle Scholar
  4. Bagnato L, Punzo A, Zoia MG (2017) The multivariate leptokurtic-normal distribution and its application in model-based clustering. Can J Stat 45(1):95–119MathSciNetCrossRefGoogle Scholar
  5. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821MathSciNetzbMATHCrossRefGoogle Scholar
  6. Bryant P (1991) Large-sample results for optimization-based clustering methods. J Classif 8:31–44MathSciNetzbMATHCrossRefGoogle Scholar
  7. Cabral CSB, Lachos VH, Prates MO (2012) Multivariate mixture modelling using skew-normal independent distributions. Comput Stat Data Anal 56:126–142zbMATHCrossRefGoogle Scholar
  8. Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105:147–156MathSciNetzbMATHCrossRefGoogle Scholar
  9. Cerioli A, Farcomeni A, Riani M (2019) Wild adaptive trimming for robust estimation and cluster analysis. Scand J Stat 46:235–256MathSciNetzbMATHCrossRefGoogle Scholar
  10. Cerioli A, García-Escudero LA, Mayo-Iscar A, Riani M (2018) Finding the number of normal groups in model-based clustering via constrained likelihoods. J Comput Gr Stat 27:404–416MathSciNetCrossRefGoogle Scholar
  11. Cerioli A, Riani M, Atkinson AC, Corbellini A (2018) The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl 27:559–587MathSciNetzbMATHCrossRefGoogle Scholar
  12. Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J Am Stat Assoc 111:1648–1659MathSciNetCrossRefGoogle Scholar
  13. Dang UJ, Browne RP, McNicholas PD (2015) Mixtures of multivariate power exponential distributions. Biometrics 71(4):1081–1089MathSciNetzbMATHCrossRefGoogle Scholar
  14. Di Zio M, Guarnera U, Rocci R (2007) A mixture of mixture models for a classification problem: the unity measure error. Comput Stat Data Anal 51:2573–2585MathSciNetzbMATHCrossRefGoogle Scholar
  15. Dotto F, Farcomeni A (2019) Robust inference for parsimonious model-based clustering. J Stat Comput Simul 89:414–442MathSciNetCrossRefGoogle Scholar
  16. Dotto F, Farcomeni A, Garcia-Escudero LA, Mayo-Iscar A (2017) A fuzzy approach to robust regression clustering. Adv Data Anal Classif 11:691–710MathSciNetzbMATHCrossRefGoogle Scholar
  17. Dotto F, Farcomeni A, Garcia-Escudero LA, Mayo-Iscar A (2018) A reweighting approach to robust clustering. Stat Comput 28:477–493MathSciNetzbMATHCrossRefGoogle Scholar
  18. Doukan P (1994) Mixing, vol 85. Lectures notes in statistics. Springer, BerlinCrossRefGoogle Scholar
  19. Embrechts P, Klüppelberg C, Mikosch T (2008) Modelling extremal events for insurance and finance. Springer, New YorkzbMATHGoogle Scholar
  20. Esary J, Proschan F, Walkup D (1967) Association of random variables, with applications. Ann Math Stat 38:1466–1474MathSciNetzbMATHCrossRefGoogle Scholar
  21. Farcomeni A (2007) Some results on the control of the false discovery rate under dependence. Scand J Stat 34:275–297MathSciNetzbMATHCrossRefGoogle Scholar
  22. Farcomeni A (2009) Robust double clustering: a method based on alternating concentration steps. J Classif 26:77–101MathSciNetzbMATHCrossRefGoogle Scholar
  23. Farcomeni A (2014) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111MathSciNetCrossRefGoogle Scholar
  24. Farcomeni A, Dotto F (2018) The power of (extended) monitoring in robust clustering. Stat Methods Appl 27:651–660MathSciNetzbMATHCrossRefGoogle Scholar
  25. Farcomeni A, Greco L (2015) Robust methods for data reduction. CRC Press, Boca RatonzbMATHGoogle Scholar
  26. Franczak BC, Browne RP, McNicholas PD (2014) Mixtures of shifted asymmetric Laplace distributions. IEEE Trans Pattern Anal Mach Intell 36(6):1149–1157CrossRefGoogle Scholar
  27. Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33(1):347–380MathSciNetzbMATHCrossRefGoogle Scholar
  28. García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Stat Comput 21:585–599MathSciNetzbMATHCrossRefGoogle Scholar
  29. García-Escudero LA, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36:1324–1345MathSciNetzbMATHCrossRefGoogle Scholar
  30. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4:89–109MathSciNetzbMATHCrossRefGoogle Scholar
  31. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218zbMATHCrossRefGoogle Scholar
  32. Li J (2004) Clustering based on a multilayer mixture model. J Comput Gr Stat 14:547–568MathSciNetCrossRefGoogle Scholar
  33. Lin TI (2009) Maximum likelihood estimation for multivariate skew normal mixture models. J Multivar Anal 100(2):257–265MathSciNetzbMATHCrossRefGoogle Scholar
  34. Mazza A, Punzo A (2017) Mixtures of multivariate contaminated normal regression models. Stat Pap.  https://doi.org/10.1007/s00362-017-0964-y CrossRefGoogle Scholar
  35. Meng X-L, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278MathSciNetzbMATHCrossRefGoogle Scholar
  36. Morris K, Punzo A, McNicholas PD, Browne RP (2019) Asymmetric clusters and outliers: mixtures of multivariate contaminated shifted asymmetric Laplace distributions. Comput Stat Data Anal 132:145–166MathSciNetzbMATHCrossRefGoogle Scholar
  37. Peel D, McLachlan GJ (2000) Robust mixture modelling using the \(t\) distribution. Stat Comput 10(4):339–348CrossRefGoogle Scholar
  38. Punzo A, Blostein M, McNicholas PD (2020) High-dimensional unsupervised classification via parsimonious contaminated mixtures. Pattern Recognit 98:107031CrossRefGoogle Scholar
  39. Punzo A, Mazza A, McNicholas P (2018) ContaminatedMixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J Stat Softw 85(10):1–25CrossRefGoogle Scholar
  40. Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biomet J 58(6):1506–1537MathSciNetzbMATHCrossRefGoogle Scholar
  41. Punzo A, McNicholas PD (2017) Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. J Classif 34(2):249–293MathSciNetzbMATHCrossRefGoogle Scholar
  42. Riani M, Atkinson AC, Cerioli A, Corbellini A (2019) Efficient robust methods via monitoring for clustering and multivariate data analysis. Pattern Recogn 88:246–260CrossRefGoogle Scholar
  43. Ritter G (2015) Robust cluster analysis and variable selection. CRC Press, Boca RatonzbMATHGoogle Scholar
  44. Ruwet C, García-Escudero LA, Gordaliza A, Mayo-Iscar A (2013) On the breakdown behavior of the tclust clustering procedure. Test 22(3):466–487MathSciNetzbMATHCrossRefGoogle Scholar
  45. Schott JR (2016) Matrix analysis for statistics. Wiley series in probability and statistics, Wiley, HobokenGoogle Scholar
  46. Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B Stat Methodol 62(4):795–809MathSciNetzbMATHCrossRefGoogle Scholar
  47. Tukey JW (1960) A survey of sampling from contaminated distributions. In: Olkin I (ed) Contributions to probability and statistics: essays in honor of harold hotelling, stanford studies in mathematics and statistics, Chapter 39. Stanford University Press, California, pp 448–485Google Scholar
  48. Zhang J, Liang F (2010) Robust clustering using exponential power mixtures. Biometrics 66(4):1078–1086MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Sociedad de Estadística e Investigación Operativa 2019

Authors and Affiliations

  1. 1.Department of Economics and FinanceUniversity of Rome “Tor Vergata”RomeItaly
  2. 2.Department of Economics and BusinessUniversity of CataniaCataniaItaly

Personalised recommendations