Abstract
The minimum covariance determinant (MCD) approach estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the minimum regularized covariance determinant (MRCD) approach, which differs from the MCD in that the scatter matrix is a convex combination of a target matrix and the sample covariance matrix of the subset. A data-driven procedure sets the weight of the target matrix, so that the regularization is only used when needed. The MRCD estimator is defined in any dimension, is well-conditioned by construction and preserves the good robustness properties of the MCD. We prove that so-called concentration steps can be performed to reduce the MRCD objective function, and we exploit this fact to construct a fast algorithm. We verify the accuracy and robustness of the MRCD estimator in a simulation study and illustrate its practical use for outlier detection and regression analysis on real-life high-dimensional data sets in chemistry and criminology.
Similar content being viewed by others
References
Agostinelli, C., Leung, A., Yohai, V., Zamar, R.: Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. Test 24(3), 441–461 (2015)
Agulló, J., Croux, C., Van Aelst, S.: The multivariate least trimmed squares estimator. J. Multivar. Anal. 99, 311–338 (2008)
Atkinson, A.C., Riani, M., Cerioli, A.: Exploring Multivariate Data with the Forward Search. Springer, New York (2004)
Bartlett, M.S.: An inverse matrix adjustment arising in discriminant analysis. Ann. Math. Stat. 22(1), 107–111 (1951)
Boudt, K., Cornelissen, J., Croux, C.: Jump robust daily covariance estimation by disentangling variance and correlation components. Comput. Stat. Data Anal. 56(11), 2993–3005 (2012)
Butler, R., Davies, P., Jhun, M.: Asymptotics for the minimum covariance determinant estimator. Ann. Stat. 21(3), 1385–1400 (1993)
Cator, E., Lopuhaä, H.: Central limit theorem and influence function for the MCD estimator at general multivariate distributions. Bernoulli 18(2), 520–551 (2012)
Croux, C., Haesbroeck, G.: Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. J. Multivar. Anal. 71(2), 161–190 (1999)
Croux, C., Haesbroeck, G.: Principal components analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika 87, 603–618 (2000)
Croux, C., Gelper, S., Haesbroeck, G.: Regularized Minimum Covariance Determinant Estimator. Mimeo, New York (2012)
Esbensen, K., Midtgaard, T., Schönkopf, S.: Multivariate Analysis in Practice: A Training Package. Camo As, Oslo (1996)
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(2), 432–441 (2008)
Gnanadesikan, R., Kettenring, J.: Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81–124 (1972)
Grübel, R.: A minimal characterization of the covariance matrix. Metrika 35(1), 49–52 (1988)
Hardin, J., Rocke, D.: Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 44, 625–638 (2004)
Hardin, J., Rocke, D.: The distribution of robust distances. J. Comput. Graph. Stat. 14(4), 928–946 (2005)
Hubert, M., Van Driessen, K.: Fast and robust discriminant analysis. Comput. Stat. Data Anal. 45, 301–320 (2004)
Hubert, M., Rousseeuw, P., Vanden Branden, K.: ROBPCA: a new approach to robust principal components analysis. Technometrics 47, 64–79 (2005)
Hubert, M., Rousseeuw, P., Van Aelst, S.: High breakdown robust multivariate methods. Stat. Sci. 23, 92–119 (2008)
Hubert, M., Rousseeuw, P., Verdonck, T.: A deterministic algorithm for robust location and scatter. J. Comput. Graph. Stat. 21(3), 618–637 (2012)
Khan, J., Van Aelst, S., Zamar, R.H.: Robust linear model selection based on least angle regression. J. Am. Stat. Assoc. 102(480), 1289–1299 (2007)
Ledoit, O., Wolf, M.: A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 88, 365–411 (2004)
Lopuhaä, H., Rousseeuw, P.: Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Stat. 19, 229–248 (1991)
Maronna, R., Zamar, R.H.: Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4), 307–317 (2002)
Öllerer, V., Croux, C.: Robust high-dimensional precision matrix estimation. In: Modern Nonparametric, Robust and Multivariate Methods, pp. 325–350. Springer (2015)
Pison, G., Rousseeuw, P., Filzmoser, P., Croux, C.: Robust factor analysis. J. Multivar. Anal. 84, 145–172 (2003)
Rousseeuw, P.: Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984)
Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (eds.) Mathematical Statistics and Applications, vol. B, pp. 283–297. Reidel Publishing Company, Dordrecht (1985)
Rousseeuw, P., Croux, C.: Alternatives to the median absolute deviation. J. Am. Stat. Assoc. 88(424), 1273–1283 (1993)
Rousseeuw, P., Van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223 (1999)
Rousseeuw, P., Van Zomeren, B.: Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 85(411), 633–639 (1990)
Rousseeuw, P., Van Aelst, S., Van Driessen, K., Agulló, J.: Robust multivariate regression. Technometrics 46, 293–305 (2004)
Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Maechler, M.: Robustbase: Basic Robust Statistics. R package version 0.92-3 (2012)
SenGupta, A.: Tests for standardized generalized variances of multivariate normal populations of possibly different dimensions. J. Multivar. Anal. 23(2), 209–219 (1987)
Sherman, J., Morrison, W.J.: Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann. Math. Stat. 21(1), 124–127 (1950)
Todorov, V., Filzmoser, P.: An object-oriented framework for robust multivariate analysis. J. Stat. Softw. 32(3), 1–47 (2009)
Won, J.-H., Lim, J., Kim, S.-J., Rajaratnam, B.: Condition-number-regularized covariance estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 75(3), 427–450 (2013)
Woodbury, M.A.: Inverting modified matrices. Memo. Rep. 42, 106 (1950)
Zhao, T., Liu, H., Roeder, K., Lafferty, J., Wasserman, L.: The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13, 1059–1062 (2012)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research has benefited from the financial support of the Flemish Science Foundation (FWO) and project C16/15/068 of Internal Funds KU Leuven. We are grateful to Valentin Todorov for adding the MRCD functionality to the R package rrcov (Todorov and Filzmoser 2009), and to Yukai Yang for his initial assistance to this work. We also thank the Editor, two anonymous referees, Dries Cornilly, Christophe Croux, Gentiane Haesbrouck, Sebastiaan Höppner, Stefan Van Aelst and Marjan Wauters for their constructive comments.
Appendices
Appendix A: Proof of Theorem 1
Generate a p-variate sample \(\mathbf {Z}\) with \(p+1\) points for which \({\varvec{\Lambda }} = \frac{1}{p+1}\sum _{j=1}^{p+1} (\varvec{z}_i-\overline{z})(\varvec{z}_i-\overline{z})'\) is nonsingular and\(\overline{z}=\frac{1}{p+1}\sum _{j=1}^{p+1}\varvec{z}_i\). Then \(\tilde{\varvec{z}}_i={\varvec{\Lambda }}^{-1/2}(\varvec{z}_i-\overline{z})\) has mean zero and covariance matrix \(\mathbf {I}_p\). Now compute \(\varvec{y}_i=\mathbf {T}^{1/2}\tilde{\varvec{z}}_i\;\), hence \(\mathbf {Y}\) has mean zero and covariance matrix \(\mathbf {T}\).
Next, create the artificial dataset
with \(k=h+p+1\) points, where \(\varvec{x}^{1}_{1},\ldots ,\varvec{x}^{1}_{h}\) are the members of \(H_1\). The factors \(w_i\) are given by
The mean and covariance matrix of \( {\tilde{\mathbf{X}}}^1\) are then
and
The regularized covariance matrix \(\mathbf {K}_1\) is thus the actual covariance matrix of the combined data set \({\tilde{\mathbf{X}}}^1\;\). Analogously we construct
where \(\varvec{x}^{2}_{1},\ldots ,\varvec{x}^{2}_{h}\) are the members of \(H_2\;\). \(\tilde{\mathbf {X}}_2\) has zero mean and covariance matrix \(\mathbf {K}_2=(1-\rho ) \mathbf {S}_2 + \rho \mathbf {T}\;\).
Denote \(d_{\mathbf {K}_1}(\varvec{\tilde{x}}) = \varvec{\tilde{x}'}( \mathbf {K}_1)^{-1}\varvec{\tilde{x}}\). We can then prove that:
in which the second inequality (23) is the condition (18).
The first inequality (22) can be shown as follows. Put \(\varvec{z}_i = (\mathbf {K}_1)^{-1/2}\varvec{x}^2_{i}\) and \(\tilde{\varvec{z}} = (\mathbf {K}_1)^{-1/2}\mathbf {m}_1\) and note that \(\overline{\varvec{z}}=(\mathbf {K}_1)^{-1/2}\mathbf {m}_2\) is the average of the \(\varvec{z}_i\). Then (22) becomes
which follows from the fact that \(\tilde{\varvec{z}} \) is the unique minimizer of the least squares objective \(\sum _{i=1}^k \Vert \varvec{z}_i - c \Vert ^2\), so (22) becomes an equality if and only if \(\tilde{\varvec{z}} =\overline{\varvec{z}} \) which is equivalent to \(\mathbf {m}_2=\mathbf {m}_1\).
It follows that
Now put
If we now compute distances relative to \(b \mathbf {K}_1\;\), we find
From the theorem in Grübel (1988), it follows that \(\mathbf {K}_2\) is the unique minimizer of \(\text{ det }(\mathbf {S})\) among all \(\mathbf {S}\) for which \(\frac{1}{k} \sum _{i=1}^k d_{\mathbf {S}}(\varvec{\tilde{x}}^{2}_{i} )=p\) (note that the mean of \(\varvec{\tilde{x}}^{2}_{i}\) is zero). Therefore
We can only have \(\text{ det }(\mathbf {K}_2) = \det (\mathbf {K}_1)\) if both of these inequalities are equalities. For the first, by uniqueness we can only have equality if \(\mathbf {K}_2=b\mathbf {K}_1\). For the second inequality, equality holds if and only if \(b=1\). Combining both yields \(\mathbf {K}_2=\mathbf {K}_1\). Moreover, \(b=1\) implies that (22) becomes an equality, hence \(\mathbf {m}_2=\mathbf {m}_1\). This concludes the proof of Theorem 1.
Appendix B: The OGK estimator
Maronna and Zamar (2002) presented a general method to obtain positive definite and approximately affine equivariant robust scatter matrices starting from a robust bivariate scatter measure. This method was applied to the bivariate covariance estimate of Gnanadesikan and Kettenring (1972). The resulting multivariate location and scatter estimates are called orthogonalized Gnanadesikan-Kettenring (OGK) estimates and are calculated as follows:
- 1.
Let m(.) and s(.) be robust univariate estimators of location and scale.
- 2.
Construct \(\varvec{y}_i=\varvec{D}^{-1}\varvec{x}_i\) for \(i=1,\ldots ,n\) with \(\varvec{D}=\text {diag}(s(X_1),\ldots ,s(X_p))\;\).
- 3.
Compute the ‘pairwise correlation matrix’ \(\varvec{U}\) of the variables of \(\varvec{Y}=(Y_1,\ldots ,Y_p)\;\), given by \(u_{jk} = 1/4 (s(Y_j+Y_k)^2-s(Y_j-Y_k)^2)\;\). This \(\varvec{U}\) is symmetric but not necessarily positive definite.
- 4.
Compute the matrix \(\varvec{E}\) of eigenvectors of \(\varvec{U}\) and
- (a)
project the data on these eigenvectors, i.e. \(\varvec{V}=\varvec{Y}\varvec{E}\;\);
- (b)
compute ‘robust variances’ of \(\varvec{V}=(V_1,\ldots ,V_p)\;\), i.e. \(\varvec{\Lambda } = \text {diag}(s^2(V_1),\ldots ,s^2(V_p))\;\);
- (c)
set the \(p \times 1\) vector \(\hat{\varvec{\mu }}(\varvec{Y}) = \varvec{E}\varvec{m}\) where \(\varvec{m}=(m(V_1),\ldots ,m(V_p))^T\;\), and compute the positive definite matrix \(\hat{\varvec{\Sigma }}(\varvec{Y}) = \varvec{E}\varvec{\Lambda } \varvec{E}^T\;\).
- (a)
- 5.
Transform back to \(\varvec{X}\), i.e. \(\hat{\varvec{\mu }}_\text {OGK}= \varvec{D}\hat{\varvec{\mu }}(\varvec{Y})\) and \(\hat{\varvec{\Sigma }}_\text {OGK}= \varvec{D}\hat{\varvec{\Sigma }}(\varvec{Y}) \varvec{D}^T\;\).
Step 2 makes the estimate location invariant and scale equivariant, whereas the next steps replace the eigenvalues of \(\varvec{U}\) (some of which may be negative) by positive numbers. In the simulation study and empirical analysis, we set m(.) to the median and s(.) to either the median absolute deviation or the Qn scale estimator. We use the implementation in the R package rrcov of Todorov and Filzmoser (2009).
Appendix C: The RMCD estimator
The RMCD as initially proposed by Croux et al. (2012) uses random subsets. Below we give its adaptation using deterministic subsets. We thank Christophe Croux and Gentiane Haesbrouck for their helpful guidelines in specifying the proposed detRMCD algorithm in which we follow closely the MRCD algorithm presented in Sect. 3. It uses the GLASSO algorithm of Friedman et al. (2008), as implemented in the package huge of Zhao et al. (2012).
Rights and permissions
About this article
Cite this article
Boudt, K., Rousseeuw, P.J., Vanduffel, S. et al. The minimum regularized covariance determinant estimator. Stat Comput 30, 113–128 (2020). https://doi.org/10.1007/s11222-019-09869-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-019-09869-x