Skip to main content
Log in

Robust model-based clustering with mild and gross outliers

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

We propose a model-based clustering procedure where each component can take into account cluster-specific mild outliers through a flexible distributional assumption, and a proportion of observations is additionally trimmed. We propose a penalized likelihood approach for estimation and selection of the proportions of mild and gross outliers. A theoretically grounded penalty parameter is then obtained. Simulation studies illustrate the advantages of our procedure over flexible mixtures without trimming, and over trimmed normal mixture models (tclust). We conclude with an original real data example on the identification of the source from illicit drug shipments seized in Italy and Spain. The methodology proposed in this paper has been implemented in R functions which can be downloaded from https://github.com/afarcome/cntclust.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Aitkin M, Wilson GT (1980) Mixture models, outliers, and the EM algorithm. Technometrics 22(3):325–331

    MATH  Google Scholar 

  • Andrews J, Wickins J, Boers N, McNicholas P (2018) teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J Stat Softw 83(7):1–32

    Google Scholar 

  • Atkinson AC, Riani M, Cerioli A (2018) Cluster detection and clustering with random start forward searches. J Appl Stat 45(5):777–798

    MathSciNet  Google Scholar 

  • Bagnato L, Punzo A, Zoia MG (2017) The multivariate leptokurtic-normal distribution and its application in model-based clustering. Can J Stat 45(1):95–119

    MathSciNet  MATH  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821

    MathSciNet  MATH  Google Scholar 

  • Bryant P (1991) Large-sample results for optimization-based clustering methods. J Classif 8:31–44

    MathSciNet  MATH  Google Scholar 

  • Cabral CSB, Lachos VH, Prates MO (2012) Multivariate mixture modelling using skew-normal independent distributions. Comput Stat Data Anal 56:126–142

    MATH  Google Scholar 

  • Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105:147–156

    MathSciNet  MATH  Google Scholar 

  • Cerioli A, Farcomeni A, Riani M (2019) Wild adaptive trimming for robust estimation and cluster analysis. Scand J Stat 46:235–256

    MathSciNet  MATH  Google Scholar 

  • Cerioli A, García-Escudero LA, Mayo-Iscar A, Riani M (2018) Finding the number of normal groups in model-based clustering via constrained likelihoods. J Comput Gr Stat 27:404–416

    MathSciNet  Google Scholar 

  • Cerioli A, Riani M, Atkinson AC, Corbellini A (2018) The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl 27:559–587

    MathSciNet  MATH  Google Scholar 

  • Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J Am Stat Assoc 111:1648–1659

    MathSciNet  Google Scholar 

  • Dang UJ, Browne RP, McNicholas PD (2015) Mixtures of multivariate power exponential distributions. Biometrics 71(4):1081–1089

    MathSciNet  MATH  Google Scholar 

  • Di Zio M, Guarnera U, Rocci R (2007) A mixture of mixture models for a classification problem: the unity measure error. Comput Stat Data Anal 51:2573–2585

    MathSciNet  MATH  Google Scholar 

  • Dotto F, Farcomeni A (2019) Robust inference for parsimonious model-based clustering. J Stat Comput Simul 89:414–442

    MathSciNet  MATH  Google Scholar 

  • Dotto F, Farcomeni A, Garcia-Escudero LA, Mayo-Iscar A (2017) A fuzzy approach to robust regression clustering. Adv Data Anal Classif 11:691–710

    MathSciNet  MATH  Google Scholar 

  • Dotto F, Farcomeni A, Garcia-Escudero LA, Mayo-Iscar A (2018) A reweighting approach to robust clustering. Stat Comput 28:477–493

    MathSciNet  MATH  Google Scholar 

  • Doukan P (1994) Mixing, vol 85. Lectures notes in statistics. Springer, Berlin

    Google Scholar 

  • Embrechts P, Klüppelberg C, Mikosch T (2008) Modelling extremal events for insurance and finance. Springer, New York

    MATH  Google Scholar 

  • Esary J, Proschan F, Walkup D (1967) Association of random variables, with applications. Ann Math Stat 38:1466–1474

    MathSciNet  MATH  Google Scholar 

  • Farcomeni A (2007) Some results on the control of the false discovery rate under dependence. Scand J Stat 34:275–297

    MathSciNet  MATH  Google Scholar 

  • Farcomeni A (2009) Robust double clustering: a method based on alternating concentration steps. J Classif 26:77–101

    MathSciNet  MATH  Google Scholar 

  • Farcomeni A (2014) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111

    MathSciNet  Google Scholar 

  • Farcomeni A, Dotto F (2018) The power of (extended) monitoring in robust clustering. Stat Methods Appl 27:651–660

    MathSciNet  MATH  Google Scholar 

  • Farcomeni A, Greco L (2015) Robust methods for data reduction. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Franczak BC, Browne RP, McNicholas PD (2014) Mixtures of shifted asymmetric Laplace distributions. IEEE Trans Pattern Anal Mach Intell 36(6):1149–1157

    Google Scholar 

  • Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33(1):347–380

    MathSciNet  MATH  Google Scholar 

  • García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Stat Comput 21:585–599

    MathSciNet  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36:1324–1345

    MathSciNet  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4:89–109

    MathSciNet  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    MATH  Google Scholar 

  • Li J (2004) Clustering based on a multilayer mixture model. J Comput Gr Stat 14:547–568

    MathSciNet  Google Scholar 

  • Lin TI (2009) Maximum likelihood estimation for multivariate skew normal mixture models. J Multivar Anal 100(2):257–265

    MathSciNet  MATH  Google Scholar 

  • Mazza A, Punzo A (2017) Mixtures of multivariate contaminated normal regression models. Stat Pap. https://doi.org/10.1007/s00362-017-0964-y

    Article  MATH  Google Scholar 

  • Meng X-L, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278

    MathSciNet  MATH  Google Scholar 

  • Morris K, Punzo A, McNicholas PD, Browne RP (2019) Asymmetric clusters and outliers: mixtures of multivariate contaminated shifted asymmetric Laplace distributions. Comput Stat Data Anal 132:145–166

    MathSciNet  MATH  Google Scholar 

  • Peel D, McLachlan GJ (2000) Robust mixture modelling using the \(t\) distribution. Stat Comput 10(4):339–348

    Google Scholar 

  • Punzo A, Blostein M, McNicholas PD (2020) High-dimensional unsupervised classification via parsimonious contaminated mixtures. Pattern Recognit 98:107031

    Google Scholar 

  • Punzo A, Mazza A, McNicholas P (2018) ContaminatedMixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J Stat Softw 85(10):1–25

    Google Scholar 

  • Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biomet J 58(6):1506–1537

    MathSciNet  MATH  Google Scholar 

  • Punzo A, McNicholas PD (2017) Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. J Classif 34(2):249–293

    MathSciNet  MATH  Google Scholar 

  • Riani M, Atkinson AC, Cerioli A, Corbellini A (2019) Efficient robust methods via monitoring for clustering and multivariate data analysis. Pattern Recogn 88:246–260

    Google Scholar 

  • Ritter G (2015) Robust cluster analysis and variable selection. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Ruwet C, García-Escudero LA, Gordaliza A, Mayo-Iscar A (2013) On the breakdown behavior of the tclust clustering procedure. Test 22(3):466–487

    MathSciNet  MATH  Google Scholar 

  • Schott JR (2016) Matrix analysis for statistics. Wiley series in probability and statistics, Wiley, Hoboken

    Google Scholar 

  • Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B Stat Methodol 62(4):795–809

    MathSciNet  MATH  Google Scholar 

  • Tukey JW (1960) A survey of sampling from contaminated distributions. In: Olkin I (ed) Contributions to probability and statistics: essays in honor of harold hotelling, stanford studies in mathematics and statistics, Chapter 39. Stanford University Press, California, pp 448–485

    Google Scholar 

  • Zhang J, Liang F (2010) Robust clustering using exponential power mixtures. Biometrics 66(4):1078–1086

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors are grateful to two referees for constructive and helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessio Farcomeni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 83 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farcomeni, A., Punzo, A. Robust model-based clustering with mild and gross outliers. TEST 29, 989–1007 (2020). https://doi.org/10.1007/s11749-019-00693-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-019-00693-z

Keywords

Mathematics Subject Classification

Navigation