The power of monitoring: how to make the most of a contaminated multivariate sample

  • Andrea Cerioli
  • Marco Riani
  • Anthony C. Atkinson
  • Aldo Corbellini
Original Paper

Abstract

Diagnostic tools must rely on robust high-breakdown methodologies to avoid distortion in the presence of contamination by outliers. However, a disadvantage of having a single, even if robust, summary of the data is that important choices concerning parameters of the robust method, such as breakdown point, have to be made prior to the analysis. The effect of such choices may be difficult to evaluate. We argue that an effective solution is to look at several pictures, and possibly to a whole movie, of the available data. This can be achieved by monitoring, over a range of parameter values, the results computed through the robust methodology of choice. We show the information gain that monitoring provides in the study of complex data structures through the analysis of multivariate datasets using different high-breakdown techniques. Our findings support the claim that the principle of monitoring is very flexible and that it can lead to robust estimators that are as efficient as possible. We also address through simulation some of the tricky inferential issues that arise from monitoring.

Keywords

Data movie Forward search Outlier detection MM-estimation S-estimation Trimming Reweighting 

Notes

Acknowledgements

We are very grateful to the Editor, Tommaso Proietti, for inviting this paper and for organizing its discussion. We also thank Alessio Farcomeni, Luca Greco, Domenico Perrotta and two anonymous reviewers for helpful comments on a previous draft. MR and ACA gratefully acknowledge support from the CRoNoS project, reference CRoNoS COST Action IC1408.

References

  1. Agostinelli C, Marazzi A, Yohai V (2014) Robust estimators of the generalized log-gamma distribution. Technometrics 56:92–101MathSciNetCrossRefGoogle Scholar
  2. Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7:226–248MathSciNetCrossRefMATHGoogle Scholar
  3. Amiguet M, Marazzi A, Valdora M, Yohai V (2017) Robust estimators for generalized linear models with a dispersion parameter. Technical Report 1703.09626v1, arXivGoogle Scholar
  4. Atkinson AC, Corbellini A, Riani M (2017a) Robust Bayesian regression with the forward search: theory and data analysis. Test, in press,  https://doi.org/10.1007/s11749-017-0542-6
  5. Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer, New YorkCrossRefMATHGoogle Scholar
  6. Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52:272–285MathSciNetCrossRefMATHGoogle Scholar
  7. Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New YorkCrossRefMATHGoogle Scholar
  8. Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis (with discussion). J Korean Stat Soc 39:117–134CrossRefMATHGoogle Scholar
  9. Atkinson AC, Riani M, Cerioli A (2017) Cluster detection and clustering with random start forward searches. J Appl Stat, in press,  https://doi.org/10.1080/02664763.2017.1310806
  10. Avella-Medina M, Ronchetti E (2015) Robust statistics: a selective overview and new directions. WIREs Comput Stat 7:372–393MathSciNetCrossRefGoogle Scholar
  11. Azzalini A, Bowman A (1990) A look at some data on the Old Faithful geyser. Appl Stat 39:357–365CrossRefMATHGoogle Scholar
  12. Boudt K, Rousseeuw P, Vanduffel S, Verdonck T (2017) The minimum regularized covariance determinant estimator. Technical Report 1701.07086v1, arXivGoogle Scholar
  13. Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105:147–156MathSciNetCrossRefMATHGoogle Scholar
  14. Cerioli A, Farcomeni A (2011) Error rates for multivariate outlier detection. Comput Stat Data Anal 55:544–553MathSciNetCrossRefMATHGoogle Scholar
  15. Cerioli A, Riani M (1999) The ordering of spatial data and the detection of multiple outliers. J Comput Gr Stat 8:239–258MathSciNetGoogle Scholar
  16. Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the MCD estimator of scatter. Stat Comput 19:341–353MathSciNetCrossRefGoogle Scholar
  17. Cerioli A, Farcomeni A, Riani M (2014) Strong consistency and robustness of the forward search estimator of multivariate location and scatter. J Multivar Anal 126:167–183MathSciNetCrossRefMATHGoogle Scholar
  18. Cerioli A, Atkinson AC, Riani M (2016) How to marry robustness and applied statistics. In: Di Battista T, Moreno E, Racugno W (eds) Topics on methodological and applied statistical inference. Springer, Heidelberg, pp 51–64Google Scholar
  19. Cerioli A, Farcomeni A, Riani M (2017) Wild adaptive trimming for robust estimation and cluster analysis. SubmittedGoogle Scholar
  20. Clarke BR, Schubert DD (2006) An adaptive trimmed likelihood algorithm for identification of multivariate outliers. Aust N Z J Stat 48:353–371MathSciNetCrossRefMATHGoogle Scholar
  21. Croux H, Haesbroeck G (1999) Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. J Multivar Anal 71:161–190MathSciNetCrossRefMATHGoogle Scholar
  22. Davies PL (1987) Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices ellipsoid estimator. Ann Stat 15:1269–1292CrossRefMATHGoogle Scholar
  23. Dotto F, Farcomeni A, García-Escudero LA, Mayo-Iscar A (2017) A reweighting approach to robust clustering. Stat Comput, in press,  https://doi.org/10.1007/s11222-017-9742-x
  24. Farcomeni A, Greco L (2015) Robust methods for data reduction. Chapman and Hall/CRC, Boca RatonCrossRefMATHGoogle Scholar
  25. García-Escudero LA, Gordaliza A (2005) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 100:1036–1045MathSciNetCrossRefMATHGoogle Scholar
  26. Green CG, Martin D (2014) An extension of a method of Hardin and Rocke, with an application to multivariate outlier detection via the IRMCD method of Cerioli. Technical Report available at http://christopherggreen.github.io/papers, Department of Statistics, University of Washington
  27. Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Gr Stat 14:910–927MathSciNetCrossRefGoogle Scholar
  28. Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. Wiley, HobokenCrossRefMATHGoogle Scholar
  29. Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23:92–119MathSciNetCrossRefMATHGoogle Scholar
  30. Hubert M, Rousseeuw PJ, Siegaert P (2015) Multivariate functional outlier detection (with discussion). Stat Methods Appl 24:177–202MathSciNetCrossRefMATHGoogle Scholar
  31. Johansen S, Nielsen B (2016a) Analysis of the Forward Search using some new results for martingales and empirical processes. Bernoulli 22:1131–1183MathSciNetCrossRefMATHGoogle Scholar
  32. Johansen S, Nielsen B (2016b) Asymptotic theory of outlier detection algorithms for linear time series regression models (with discussion). Scand J Stat 43:321–348MathSciNetCrossRefMATHGoogle Scholar
  33. Lopuhaä HP, Rousseeuw PJ (1991) Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann Stat 19:229–248MathSciNetCrossRefMATHGoogle Scholar
  34. Maronna RA, Martin RD, Yohai VJ (2006) Robust statistics. Wiley, ChichesterCrossRefMATHGoogle Scholar
  35. Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55:111–123MathSciNetCrossRefMATHGoogle Scholar
  36. Riani M, Atkinson AC (2001) Regression diagnostics for binomial data from the forward search. J R Stat Soc Ser D 50:63–78MathSciNetGoogle Scholar
  37. Riani M, Atkinson AC (2007) Fast calibrations of the forward search for testing multiple outliers in regression. Adv Data Anal Classif 1:123–141MathSciNetCrossRefMATHGoogle Scholar
  38. Riani M, Atkinson AC, Cerioli A (2009) Finding an unknown number of multivariate outliers. J R Stat Soc Ser B 71:447–466MathSciNetCrossRefMATHGoogle Scholar
  39. Riani M, Cerioli A, Atkinson AC, Perrotta D (2014a) Monitoring robust regression. Electron J Stat 8:646–677MathSciNetCrossRefMATHGoogle Scholar
  40. Riani M, Cerioli A, Torti F (2014b) On consistency factors and efficiency of robust S-estimators. Test 23:356–387MathSciNetCrossRefMATHGoogle Scholar
  41. Riani M, Atkinson AC, Perrotta D (2014c) A parametric framework for the comparison of methods of very robust regression. Stat Sci 29:128–143MathSciNetCrossRefMATHGoogle Scholar
  42. Riani M, Perrotta D, Cerioli A (2015) The forward search for very large datasets. J Stat Softw 67:1CrossRefGoogle Scholar
  43. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New YorkCrossRefMATHGoogle Scholar
  44. Salini S, Cerioli A, Laurini F, Riani M (2016) Reliable robust regression diagnostics. Int Stat Rev 84:99–127MathSciNetCrossRefGoogle Scholar
  45. Tallis GM (1963) Elliptical and radial truncation in normal samples. Ann Math Stat 34:940–944MathSciNetCrossRefMATHGoogle Scholar
  46. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New YorkCrossRefMATHGoogle Scholar
  47. Yohai VJ (1987) High breakdown-point and high efficiency estimates for regression. Ann Stat 15:642–656MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Department of Economics and ManagementUniversity of ParmaParmaItaly
  2. 2.Department of StatisticsThe London School of EconomicsLondonUK

Personalised recommendations