The power of monitoring: how to make the most of a contaminated multivariate sample

Abstract

Diagnostic tools must rely on robust high-breakdown methodologies to avoid distortion in the presence of contamination by outliers. However, a disadvantage of having a single, even if robust, summary of the data is that important choices concerning parameters of the robust method, such as breakdown point, have to be made prior to the analysis. The effect of such choices may be difficult to evaluate. We argue that an effective solution is to look at several pictures, and possibly to a whole movie, of the available data. This can be achieved by monitoring, over a range of parameter values, the results computed through the robust methodology of choice. We show the information gain that monitoring provides in the study of complex data structures through the analysis of multivariate datasets using different high-breakdown techniques. Our findings support the claim that the principle of monitoring is very flexible and that it can lead to robust estimators that are as efficient as possible. We also address through simulation some of the tricky inferential issues that arise from monitoring.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

References

  1. Agostinelli C, Marazzi A, Yohai V (2014) Robust estimators of the generalized log-gamma distribution. Technometrics 56:92–101

    MathSciNet  Article  MATH  Google Scholar 

  2. Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7:226–248

    MathSciNet  Article  MATH  Google Scholar 

  3. Amiguet M, Marazzi A, Valdora M, Yohai V (2017) Robust estimators for generalized linear models with a dispersion parameter. Technical Report 1703.09626v1, arXiv

  4. Atkinson AC, Corbellini A, Riani M (2017a) Robust Bayesian regression with the forward search: theory and data analysis. Test, in press, https://doi.org/10.1007/s11749-017-0542-6

  5. Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer, New York

    Google Scholar 

  6. Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52:272–285

    MathSciNet  Article  MATH  Google Scholar 

  7. Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New York

    Google Scholar 

  8. Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis (with discussion). J Korean Stat Soc 39:117–134

    Article  MATH  Google Scholar 

  9. Atkinson AC, Riani M, Cerioli A (2017) Cluster detection and clustering with random start forward searches. J Appl Stat, in press, https://doi.org/10.1080/02664763.2017.1310806

  10. Avella-Medina M, Ronchetti E (2015) Robust statistics: a selective overview and new directions. WIREs Comput Stat 7:372–393

    MathSciNet  Article  Google Scholar 

  11. Azzalini A, Bowman A (1990) A look at some data on the Old Faithful geyser. Appl Stat 39:357–365

    Article  MATH  Google Scholar 

  12. Boudt K, Rousseeuw P, Vanduffel S, Verdonck T (2017) The minimum regularized covariance determinant estimator. Technical Report 1701.07086v1, arXiv

  13. Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105:147–156

    MathSciNet  Article  MATH  Google Scholar 

  14. Cerioli A, Farcomeni A (2011) Error rates for multivariate outlier detection. Comput Stat Data Anal 55:544–553

    MathSciNet  Article  MATH  Google Scholar 

  15. Cerioli A, Riani M (1999) The ordering of spatial data and the detection of multiple outliers. J Comput Gr Stat 8:239–258

    MathSciNet  Google Scholar 

  16. Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the MCD estimator of scatter. Stat Comput 19:341–353

    MathSciNet  Article  Google Scholar 

  17. Cerioli A, Farcomeni A, Riani M (2014) Strong consistency and robustness of the forward search estimator of multivariate location and scatter. J Multivar Anal 126:167–183

    MathSciNet  Article  MATH  Google Scholar 

  18. Cerioli A, Atkinson AC, Riani M (2016) How to marry robustness and applied statistics. In: Di Battista T, Moreno E, Racugno W (eds) Topics on methodological and applied statistical inference. Springer, Heidelberg, pp 51–64

    Google Scholar 

  19. Cerioli A, Farcomeni A, Riani M (2017) Wild adaptive trimming for robust estimation and cluster analysis. Submitted

  20. Clarke BR, Schubert DD (2006) An adaptive trimmed likelihood algorithm for identification of multivariate outliers. Aust N Z J Stat 48:353–371

    MathSciNet  Article  MATH  Google Scholar 

  21. Croux H, Haesbroeck G (1999) Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. J Multivar Anal 71:161–190

    MathSciNet  Article  MATH  Google Scholar 

  22. Davies PL (1987) Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices ellipsoid estimator. Ann Stat 15:1269–1292

    Article  MATH  Google Scholar 

  23. Dotto F, Farcomeni A, García-Escudero LA, Mayo-Iscar A (2017) A reweighting approach to robust clustering. Stat Comput, in press, https://doi.org/10.1007/s11222-017-9742-x

  24. Farcomeni A, Greco L (2015) Robust methods for data reduction. Chapman and Hall/CRC, Boca Raton

    Google Scholar 

  25. García-Escudero LA, Gordaliza A (2005) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 100:1036–1045

    MathSciNet  Article  MATH  Google Scholar 

  26. Green CG, Martin D (2014) An extension of a method of Hardin and Rocke, with an application to multivariate outlier detection via the IRMCD method of Cerioli. Technical Report available at http://christopherggreen.github.io/papers, Department of Statistics, University of Washington

  27. Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Gr Stat 14:910–927

    MathSciNet  Article  Google Scholar 

  28. Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. Wiley, Hoboken

    Google Scholar 

  29. Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23:92–119

    MathSciNet  Article  MATH  Google Scholar 

  30. Hubert M, Rousseeuw PJ, Siegaert P (2015) Multivariate functional outlier detection (with discussion). Stat Methods Appl 24:177–202

    MathSciNet  Article  MATH  Google Scholar 

  31. Johansen S, Nielsen B (2016a) Analysis of the Forward Search using some new results for martingales and empirical processes. Bernoulli 22:1131–1183

    MathSciNet  Article  MATH  Google Scholar 

  32. Johansen S, Nielsen B (2016b) Asymptotic theory of outlier detection algorithms for linear time series regression models (with discussion). Scand J Stat 43:321–348

    MathSciNet  Article  MATH  Google Scholar 

  33. Lopuhaä HP, Rousseeuw PJ (1991) Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann Stat 19:229–248

    MathSciNet  Article  MATH  Google Scholar 

  34. Maronna RA, Martin RD, Yohai VJ (2006) Robust statistics. Wiley, Chichester

    Google Scholar 

  35. Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55:111–123

    MathSciNet  Article  MATH  Google Scholar 

  36. Riani M, Atkinson AC (2001) Regression diagnostics for binomial data from the forward search. J R Stat Soc Ser D 50:63–78

    MathSciNet  Google Scholar 

  37. Riani M, Atkinson AC (2007) Fast calibrations of the forward search for testing multiple outliers in regression. Adv Data Anal Classif 1:123–141

    MathSciNet  Article  MATH  Google Scholar 

  38. Riani M, Atkinson AC, Cerioli A (2009) Finding an unknown number of multivariate outliers. J R Stat Soc Ser B 71:447–466

    MathSciNet  Article  MATH  Google Scholar 

  39. Riani M, Cerioli A, Atkinson AC, Perrotta D (2014a) Monitoring robust regression. Electron J Stat 8:646–677

    MathSciNet  Article  MATH  Google Scholar 

  40. Riani M, Cerioli A, Torti F (2014b) On consistency factors and efficiency of robust S-estimators. Test 23:356–387

    MathSciNet  Article  MATH  Google Scholar 

  41. Riani M, Atkinson AC, Perrotta D (2014c) A parametric framework for the comparison of methods of very robust regression. Stat Sci 29:128–143

    MathSciNet  Article  MATH  Google Scholar 

  42. Riani M, Perrotta D, Cerioli A (2015) The forward search for very large datasets. J Stat Softw 67:1

    Article  Google Scholar 

  43. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York

    Google Scholar 

  44. Salini S, Cerioli A, Laurini F, Riani M (2016) Reliable robust regression diagnostics. Int Stat Rev 84:99–127

    MathSciNet  Article  Google Scholar 

  45. Tallis GM (1963) Elliptical and radial truncation in normal samples. Ann Math Stat 34:940–944

    Article  MATH  Google Scholar 

  46. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York

    Google Scholar 

  47. Yohai VJ (1987) High breakdown-point and high efficiency estimates for regression. Ann Stat 15:642–656

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgements

We are very grateful to the Editor, Tommaso Proietti, for inviting this paper and for organizing its discussion. We also thank Alessio Farcomeni, Luca Greco, Domenico Perrotta and two anonymous reviewers for helpful comments on a previous draft. MR and ACA gratefully acknowledge support from the CRoNoS project, reference CRoNoS COST Action IC1408.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andrea Cerioli.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cerioli, A., Riani, M., Atkinson, A.C. et al. The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl 27, 559–587 (2018). https://doi.org/10.1007/s10260-017-0409-8

Download citation

Keywords

  • Data movie
  • Forward search
  • Outlier detection
  • MM-estimation
  • S-estimation
  • Trimming
  • Reweighting