Abstract
A collection of robust Mahalanobis distances for multivariate outlier detection is proposed, based on the notion of shrinkage. Robust intensity and scaling factors are optimally estimated to define the shrinkage. Some properties are investigated, such as affine equivariance and breakdown value. The performance of the proposal is illustrated through the comparison to other techniques from the literature, in a simulation study and with a real dataset. The behavior when the underlying distribution is heavy-tailed or skewed, shows the appropriateness of the method when we deviate from the common assumption of normality. The resulting high true positive rates and low false positive rates in the vast majority of cases, as well as the significantly smaller computation time show the advantages of our proposal.
Similar content being viewed by others
References
Agostinelli C, Romanazzi M (2011) Local depth. J Stat Plan Inference 141(2):817–830
Alqallaf F, Van Aelst S, Yohai VJ, Zamar RH (2009) Propagation of outliers in multivariate data. Ann Stat 37(1):311–331
Bay SD (1999) The UCI KDD archive [http://kdd.ics.uci.edu]. University of California, Irvine. Department of Information and Computer Science, vol 404, p 405
Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. J Am Stat Assoc 94(447):947–955
Becker C, Fried R, Kuhnt S (2014) Robustness and complex data structures: festschrift in honour of Ursula Gather. Springer, New York
Bose A (1995) Estimating the asymptotic dispersion of the l1 median. Ann Inst Stat Math 47(2):267–271
Bose A, Chaudhuri P (1993) On the dispersion of multivariate median. Ann Inst Stat Math 45(3):541–550
Brettschneider J, Collin F, Bolstad BM, Speed TP (2008) Quality assessment for short oligonucleotide microarray data. Technometrics 50(3):241–264
Brown B (1983) Statistical uses of the spatial median. J R Stat Soc Ser B (Methodol) 45:25–30
Cerioli A, Riani M, Atkinson AC, Perrotta D, Torti F (2008) Fitting mixtures of regression lines with the forward search. Min Massive Data Sets Secur 19:271
Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the mcd estimator of scatter. Stat Comput 19(3):341–353
Chen SX, Qin Y-L et al (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38(2):808–835
Chen Y, Dang X, Peng H, Bart HL (2009) Outlier detection with the kernelized spatial depth function. IEEE Trans Pattern Anal Mach Intell 31(2):288–305
Chen Y, Wiesel A, Hero AO (2011) Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Trans Signal Process 59(9):4097–4107
Choi HC, Edwards HP, Sweatman CH, Obolonkin V (2016) Multivariate outlier detection of dairy herd testing data. ANZIAM J 57:38–53
Chu JT (1955) On the distribution of the sample median. Ann Math Stat 26:112–116
Couillet R, McKay M (2014) Large dimensional analysis and optimization of robust shrinkage covariance matrix estimators. J Multivar Anal 131:99–120
DeMiguel V, Martin-Utrera A, Nogales FJ (2013) Size matters: optimal calibration of shrinkage estimators for portfolio selection. J Bank Finance 37(8):3018–3034
Devlin SJ, Gnanadesikan R, Kettenring JR (1981) Robust estimation of dispersion matrices and principal components. J Am Stat Assoc 76(374):354–362
Dodge Y (1987) An introduction to l1-norm based statistical data analysis. Comput Stat Data Anal 5(4):239–253
Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum KA, Hodges JL Jr (eds) A festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184
Falk M (1997) On mad and comedians. Ann Inst Stat Math 49(4):615–644
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587
Gao X (2016) A flexible shrinkage operator for fussy grouped variable selection. Statistical Papers, pp 1–24
Gnanadesikan R, Kettenring JR (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28:81–124
Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: Proceedings of the European Conference on Information Retrieval, pp 345–359. Springer
Gower J (1974) Algorithm as 78: the mediancentre. J R Stat Soc Ser C (Appl Stat) 23(3):466–470
Hall P, Welsh A (1985) Limit theorems for the median deviation. Ann Inst Stat Math 37(1):27–36
Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Graph Stat 14(4):928–946
Hubert M, Debruyne M (2009) Breakdown value. Wiley Interdiscip Rev Comput Stat 1(3):296–302
Hubert M, Debruyne M (2010) Minimum Covariance Determinant. Wiley Interdiscip Rev Comput Stat 2(1):36–43
Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23:92–119
Inselberg A (2009) Parallel coordinates. Springer, New York
Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: Proceedings of the 1st conference on Visualization’90, pp 361–378. IEEE Computer Society Press
James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, vol 1, pp 361–379
Lazar N (2008) The statistical analysis of functional MRI data. Springer, New York
Ledoit O, Wolf M (2003a) Honey, i shrunk the sample covariance matrix. UPF economics and business working paper (691)
Ledoit O, Wolf M (2003b) Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empir Finance 10(5):603–621
Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices. J Multivar Anal 88(2):365–411
Leroy AM, Rousseeuw PJ(1987) Robust regression and outlier detection
Lindquist MA (2008) The statistical analysis of FMRI data. Stat Sci 23:439–464
Liu RY et al (1990) On a notion of data depth based on random simplices. Ann Stat 18(1):405–414
Lopuhaa HP, Rousseeuw PJ (1991) Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann Stat 19:229–248
Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci (Calcutta) 2:49–55
Marcano L, Fermín W (2013) Comparación de métodos de detección de datos anómalos multivariantes mediante un estudio de simulación. SABER. Revista Multidisciplinaria del Consejo de Investigación de la Universidad de Oriente 25(2):192–201
Maronna RA, Yohai VJ (1976) Robust estimation of multivariate location and scatter. Statistics Reference Online, Wiley StatsRef
Maronna RA, Zamar RH (2002) Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4):307–317
Monti MM (2011) Statistical analysis of fmri time-series: a critical review of the glm approach. Front Hum Neurosci 5:28
Möttönen J, Nordhausen K, Oja H et al (2010) Asymptotic theory of the spatial median. In: Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jurečková, pp 182–193. Institute of Mathematical Statistics
Oja H (2010) Multivariate nonparametric methods with R: an approach based on spatial signs and ranks. Springer, New York
Paindaveine D, Van Bever G (2013) From depth to local depth: a focus on centrality. J Am Stat Assoc 108(503):1105–1119
Peña D, Prieto FJ (2001) Multivariate outlier detection and robust covariance matrix estimation. Technometrics 43(3):286–310
Peña D, Prieto FJ (2007) Combining random and specific directions for outlier detection and robust estimation in high-dimensional multivariate data. J Comput Graph Stat 16(1):228–254
Perrotta D, Torti F (2010) Detecting price outliers in european trade data with the forward search. In: Data Analysis and Classification, pp 415–423. Springer
Poline J-B, Brett M (2012) The general linear model and fmri: does love last forever? Neuroimage 62(2):871–880
Powers DM (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation
Reimann C, Filzmoser P (2000) Normal and lognormal data distribution in geochemistry: death of a myth. consequences for the statistical treatment of geochemical and environmental data. Environ Geol 39(9):1001–1014
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297
Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639
Sajesh T, Srinivasan M (2012) Outlier detection for high dimensional data using the comedian approach. J Stat Comput Simul 82(5):745–757
Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Statistical data analysis based on the L1-norm and related methods, pp 25–38. Springer, New York
Small CG (1990) A survey of multidimensional medians. Int Stat Rev 58:263–277
Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation. In: Australasian Joint Conference on Artificial Intelligence, pp 1015–1021. Springer, New York
Steland A (2018) Shrinkage for covariance estimation: asymptotics, confidence intervals, bounds and applications in sensor monitoring and finance. Statistical Papers, pp 1–22
Sun R, Ma T, Liu S (2018) Portfolio selection: shrinking the time-varying inverse conditional covariance matrix. Statistical Papers, pp 1–22
Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334
Tarr G, Müller S, Weber NC (2016) Robust estimation of precision matrices under cellwise contamination. Comput Stat Data Anal 93:404–420
Templ M, Filzmoser P, Reimann C (2008) Cluster analysis applied to regional geochemical data: problems and possibilities. Appl Geochem 23(8):2198–2213
Tukey JW (1975) Mathematics and the picturing of data. Proc Int Congr Math 2:523–531
Vardi Y, Zhang C-H (2000) The multivariate l1-median and associated data depth. Proc Natl Acad Sci USA 97(4):1423–1426
Vargas JA, Robust N (2003) estimation in multivariate control charts for individual observations. J Qual Technol 35(4):367–376
Verboven S, Hubert M (2005) Libra: a matlab library for robust analysis. Chemometr Intell Lab Syst 75(2):127–136
Wegman EJ (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc 85(411):664–675
Zeng Y, Wang G, Yang E, Ji G, Brinkmeyer-Langford CL, Cai JJ (2015) Aberrant gene expression in humans. PLoS Genet 11(1):e1004942
Acknowledgements
The authors are grateful to the editor and the referees for the constructive and valuable comments. This research was partially supported by Ministerio de Economía, Industria y Competitividad, España, award number: ECO2015-66593-P.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was partially supported by MINISTERIO DE ECONOMIA, INDUSTRIA Y COMPETITIVIDAD, award number: ECO2015-66593-P.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Cabana, E., Lillo, R.E. & Laniado, H. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Stat Papers 62, 1583–1609 (2021). https://doi.org/10.1007/s00362-019-01148-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-019-01148-1