Skip to main content
Log in

Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

A collection of robust Mahalanobis distances for multivariate outlier detection is proposed, based on the notion of shrinkage. Robust intensity and scaling factors are optimally estimated to define the shrinkage. Some properties are investigated, such as affine equivariance and breakdown value. The performance of the proposal is illustrated through the comparison to other techniques from the literature, in a simulation study and with a real dataset. The behavior when the underlying distribution is heavy-tailed or skewed, shows the appropriateness of the method when we deviate from the common assumption of normality. The resulting high true positive rates and low false positive rates in the vast majority of cases, as well as the significantly smaller computation time show the advantages of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Agostinelli C, Romanazzi M (2011) Local depth. J Stat Plan Inference 141(2):817–830

    MathSciNet  MATH  Google Scholar 

  • Alqallaf F, Van Aelst S, Yohai VJ, Zamar RH (2009) Propagation of outliers in multivariate data. Ann Stat 37(1):311–331

    MathSciNet  MATH  Google Scholar 

  • Bay SD (1999) The UCI KDD archive [http://kdd.ics.uci.edu]. University of California, Irvine. Department of Information and Computer Science, vol 404, p 405

  • Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. J Am Stat Assoc 94(447):947–955

    MathSciNet  MATH  Google Scholar 

  • Becker C, Fried R, Kuhnt S (2014) Robustness and complex data structures: festschrift in honour of Ursula Gather. Springer, New York

    MATH  Google Scholar 

  • Bose A (1995) Estimating the asymptotic dispersion of the l1 median. Ann Inst Stat Math 47(2):267–271

    MATH  Google Scholar 

  • Bose A, Chaudhuri P (1993) On the dispersion of multivariate median. Ann Inst Stat Math 45(3):541–550

    MathSciNet  MATH  Google Scholar 

  • Brettschneider J, Collin F, Bolstad BM, Speed TP (2008) Quality assessment for short oligonucleotide microarray data. Technometrics 50(3):241–264

    MathSciNet  Google Scholar 

  • Brown B (1983) Statistical uses of the spatial median. J R Stat Soc Ser B (Methodol) 45:25–30

    MathSciNet  MATH  Google Scholar 

  • Cerioli A, Riani M, Atkinson AC, Perrotta D, Torti F (2008) Fitting mixtures of regression lines with the forward search. Min Massive Data Sets Secur 19:271

    Google Scholar 

  • Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the mcd estimator of scatter. Stat Comput 19(3):341–353

    MathSciNet  Google Scholar 

  • Chen SX, Qin Y-L et al (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38(2):808–835

    MathSciNet  MATH  Google Scholar 

  • Chen Y, Dang X, Peng H, Bart HL (2009) Outlier detection with the kernelized spatial depth function. IEEE Trans Pattern Anal Mach Intell 31(2):288–305

    Google Scholar 

  • Chen Y, Wiesel A, Hero AO (2011) Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Trans Signal Process 59(9):4097–4107

    MathSciNet  MATH  Google Scholar 

  • Choi HC, Edwards HP, Sweatman CH, Obolonkin V (2016) Multivariate outlier detection of dairy herd testing data. ANZIAM J 57:38–53

    Google Scholar 

  • Chu JT (1955) On the distribution of the sample median. Ann Math Stat 26:112–116

    MathSciNet  MATH  Google Scholar 

  • Couillet R, McKay M (2014) Large dimensional analysis and optimization of robust shrinkage covariance matrix estimators. J Multivar Anal 131:99–120

    MathSciNet  MATH  Google Scholar 

  • DeMiguel V, Martin-Utrera A, Nogales FJ (2013) Size matters: optimal calibration of shrinkage estimators for portfolio selection. J Bank Finance 37(8):3018–3034

    Google Scholar 

  • Devlin SJ, Gnanadesikan R, Kettenring JR (1981) Robust estimation of dispersion matrices and principal components. J Am Stat Assoc 76(374):354–362

    MATH  Google Scholar 

  • Dodge Y (1987) An introduction to l1-norm based statistical data analysis. Comput Stat Data Anal 5(4):239–253

    Google Scholar 

  • Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum KA, Hodges JL Jr (eds) A festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184

    Google Scholar 

  • Falk M (1997) On mad and comedians. Ann Inst Stat Math 49(4):615–644

    MathSciNet  MATH  Google Scholar 

  • Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587

    Google Scholar 

  • Gao X (2016) A flexible shrinkage operator for fussy grouped variable selection. Statistical Papers, pp 1–24

  • Gnanadesikan R, Kettenring JR (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28:81–124

    Google Scholar 

  • Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: Proceedings of the European Conference on Information Retrieval, pp 345–359. Springer

  • Gower J (1974) Algorithm as 78: the mediancentre. J R Stat Soc Ser C (Appl Stat) 23(3):466–470

    Google Scholar 

  • Hall P, Welsh A (1985) Limit theorems for the median deviation. Ann Inst Stat Math 37(1):27–36

    MathSciNet  MATH  Google Scholar 

  • Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Graph Stat 14(4):928–946

    MathSciNet  Google Scholar 

  • Hubert M, Debruyne M (2009) Breakdown value. Wiley Interdiscip Rev Comput Stat 1(3):296–302

    Google Scholar 

  • Hubert M, Debruyne M (2010) Minimum Covariance Determinant. Wiley Interdiscip Rev Comput Stat 2(1):36–43

    Google Scholar 

  • Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23:92–119

    MathSciNet  MATH  Google Scholar 

  • Inselberg A (2009) Parallel coordinates. Springer, New York

    MATH  Google Scholar 

  • Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: Proceedings of the 1st conference on Visualization’90, pp 361–378. IEEE Computer Society Press

  • James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, vol 1, pp 361–379

  • Lazar N (2008) The statistical analysis of functional MRI data. Springer, New York

    MATH  Google Scholar 

  • Ledoit O, Wolf M (2003a) Honey, i shrunk the sample covariance matrix. UPF economics and business working paper (691)

  • Ledoit O, Wolf M (2003b) Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empir Finance 10(5):603–621

    Google Scholar 

  • Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices. J Multivar Anal 88(2):365–411

    MathSciNet  MATH  Google Scholar 

  • Leroy AM, Rousseeuw PJ(1987) Robust regression and outlier detection

  • Lindquist MA (2008) The statistical analysis of FMRI data. Stat Sci 23:439–464

    MathSciNet  MATH  Google Scholar 

  • Liu RY et al (1990) On a notion of data depth based on random simplices. Ann Stat 18(1):405–414

    MathSciNet  MATH  Google Scholar 

  • Lopuhaa HP, Rousseeuw PJ (1991) Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann Stat 19:229–248

    MathSciNet  MATH  Google Scholar 

  • Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci (Calcutta) 2:49–55

    MATH  Google Scholar 

  • Marcano L, Fermín W (2013) Comparación de métodos de detección de datos anómalos multivariantes mediante un estudio de simulación. SABER. Revista Multidisciplinaria del Consejo de Investigación de la Universidad de Oriente 25(2):192–201

    Google Scholar 

  • Maronna RA, Yohai VJ (1976) Robust estimation of multivariate location and scatter. Statistics Reference Online, Wiley StatsRef

  • Maronna RA, Zamar RH (2002) Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4):307–317

    MathSciNet  Google Scholar 

  • Monti MM (2011) Statistical analysis of fmri time-series: a critical review of the glm approach. Front Hum Neurosci 5:28

    Google Scholar 

  • Möttönen J, Nordhausen K, Oja H et al (2010) Asymptotic theory of the spatial median. In: Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jurečková, pp 182–193. Institute of Mathematical Statistics

  • Oja H (2010) Multivariate nonparametric methods with R: an approach based on spatial signs and ranks. Springer, New York

    MATH  Google Scholar 

  • Paindaveine D, Van Bever G (2013) From depth to local depth: a focus on centrality. J Am Stat Assoc 108(503):1105–1119

    MathSciNet  MATH  Google Scholar 

  • Peña D, Prieto FJ (2001) Multivariate outlier detection and robust covariance matrix estimation. Technometrics 43(3):286–310

    MathSciNet  Google Scholar 

  • Peña D, Prieto FJ (2007) Combining random and specific directions for outlier detection and robust estimation in high-dimensional multivariate data. J Comput Graph Stat 16(1):228–254

    MathSciNet  Google Scholar 

  • Perrotta D, Torti F (2010) Detecting price outliers in european trade data with the forward search. In: Data Analysis and Classification, pp 415–423. Springer

  • Poline J-B, Brett M (2012) The general linear model and fmri: does love last forever? Neuroimage 62(2):871–880

    Google Scholar 

  • Powers DM (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation

  • Reimann C, Filzmoser P (2000) Normal and lognormal data distribution in geochemistry: death of a myth. consequences for the statistical treatment of geochemical and environmental data. Environ Geol 39(9):1001–1014

    Google Scholar 

  • Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297

    MathSciNet  MATH  Google Scholar 

  • Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223

    Google Scholar 

  • Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639

    Google Scholar 

  • Sajesh T, Srinivasan M (2012) Outlier detection for high dimensional data using the comedian approach. J Stat Comput Simul 82(5):745–757

    MathSciNet  MATH  Google Scholar 

  • Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Statistical data analysis based on the L1-norm and related methods, pp 25–38. Springer, New York

  • Small CG (1990) A survey of multidimensional medians. Int Stat Rev 58:263–277

    Google Scholar 

  • Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation. In: Australasian Joint Conference on Artificial Intelligence, pp 1015–1021. Springer, New York

  • Steland A (2018) Shrinkage for covariance estimation: asymptotics, confidence intervals, bounds and applications in sensor monitoring and finance. Statistical Papers, pp 1–22

  • Sun R, Ma T, Liu S (2018) Portfolio selection: shrinking the time-varying inverse conditional covariance matrix. Statistical Papers, pp 1–22

  • Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334

    MathSciNet  Google Scholar 

  • Tarr G, Müller S, Weber NC (2016) Robust estimation of precision matrices under cellwise contamination. Comput Stat Data Anal 93:404–420

    MathSciNet  MATH  Google Scholar 

  • Templ M, Filzmoser P, Reimann C (2008) Cluster analysis applied to regional geochemical data: problems and possibilities. Appl Geochem 23(8):2198–2213

    Google Scholar 

  • Tukey JW (1975) Mathematics and the picturing of data. Proc Int Congr Math 2:523–531

    MathSciNet  MATH  Google Scholar 

  • Vardi Y, Zhang C-H (2000) The multivariate l1-median and associated data depth. Proc Natl Acad Sci USA 97(4):1423–1426

    MathSciNet  MATH  Google Scholar 

  • Vargas JA, Robust N (2003) estimation in multivariate control charts for individual observations. J Qual Technol 35(4):367–376

    Google Scholar 

  • Verboven S, Hubert M (2005) Libra: a matlab library for robust analysis. Chemometr Intell Lab Syst 75(2):127–136

    Google Scholar 

  • Wegman EJ (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc 85(411):664–675

    Google Scholar 

  • Zeng Y, Wang G, Yang E, Ji G, Brinkmeyer-Langford CL, Cai JJ (2015) Aberrant gene expression in humans. PLoS Genet 11(1):e1004942

    Google Scholar 

Download references

Acknowledgements

The authors are grateful to the editor and the referees for the constructive and valuable comments. This research was partially supported by Ministerio de Economía, Industria y Competitividad, España, award number: ECO2015-66593-P.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elisa Cabana.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partially supported by MINISTERIO DE ECONOMIA, INDUSTRIA Y COMPETITIVIDAD, award number: ECO2015-66593-P.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3784 KB)

Supplementary material 2 (xlsx 278 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cabana, E., Lillo, R.E. & Laniado, H. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Stat Papers 62, 1583–1609 (2021). https://doi.org/10.1007/s00362-019-01148-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-019-01148-1

Keywords

Navigation