Skip to main content
Log in

The minimum regularized covariance determinant estimator

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

The minimum covariance determinant (MCD) approach estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the minimum regularized covariance determinant (MRCD) approach, which differs from the MCD in that the scatter matrix is a convex combination of a target matrix and the sample covariance matrix of the subset. A data-driven procedure sets the weight of the target matrix, so that the regularization is only used when needed. The MRCD estimator is defined in any dimension, is well-conditioned by construction and preserves the good robustness properties of the MCD. We prove that so-called concentration steps can be performed to reduce the MRCD objective function, and we exploit this fact to construct a fast algorithm. We verify the accuracy and robustness of the MRCD estimator in a simulation study and illustrate its practical use for outlier detection and regression analysis on real-life high-dimensional data sets in chemistry and criminology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Agostinelli, C., Leung, A., Yohai, V., Zamar, R.: Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. Test 24(3), 441–461 (2015)

    Article  MathSciNet  Google Scholar 

  • Agulló, J., Croux, C., Van Aelst, S.: The multivariate least trimmed squares estimator. J. Multivar. Anal. 99, 311–338 (2008)

    Article  MathSciNet  Google Scholar 

  • Atkinson, A.C., Riani, M., Cerioli, A.: Exploring Multivariate Data with the Forward Search. Springer, New York (2004)

    Book  Google Scholar 

  • Bartlett, M.S.: An inverse matrix adjustment arising in discriminant analysis. Ann. Math. Stat. 22(1), 107–111 (1951)

    Article  MathSciNet  Google Scholar 

  • Boudt, K., Cornelissen, J., Croux, C.: Jump robust daily covariance estimation by disentangling variance and correlation components. Comput. Stat. Data Anal. 56(11), 2993–3005 (2012)

    Article  MathSciNet  Google Scholar 

  • Butler, R., Davies, P., Jhun, M.: Asymptotics for the minimum covariance determinant estimator. Ann. Stat. 21(3), 1385–1400 (1993)

    Article  MathSciNet  Google Scholar 

  • Cator, E., Lopuhaä, H.: Central limit theorem and influence function for the MCD estimator at general multivariate distributions. Bernoulli 18(2), 520–551 (2012)

    Article  MathSciNet  Google Scholar 

  • Croux, C., Haesbroeck, G.: Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. J. Multivar. Anal. 71(2), 161–190 (1999)

    Article  MathSciNet  Google Scholar 

  • Croux, C., Haesbroeck, G.: Principal components analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika 87, 603–618 (2000)

    Article  MathSciNet  Google Scholar 

  • Croux, C., Gelper, S., Haesbroeck, G.: Regularized Minimum Covariance Determinant Estimator. Mimeo, New York (2012)

    Google Scholar 

  • Esbensen, K., Midtgaard, T., Schönkopf, S.: Multivariate Analysis in Practice: A Training Package. Camo As, Oslo (1996)

    Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(2), 432–441 (2008)

    Article  Google Scholar 

  • Gnanadesikan, R., Kettenring, J.: Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81–124 (1972)

    Article  Google Scholar 

  • Grübel, R.: A minimal characterization of the covariance matrix. Metrika 35(1), 49–52 (1988)

    Article  MathSciNet  Google Scholar 

  • Hardin, J., Rocke, D.: Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 44, 625–638 (2004)

    Article  MathSciNet  Google Scholar 

  • Hardin, J., Rocke, D.: The distribution of robust distances. J. Comput. Graph. Stat. 14(4), 928–946 (2005)

    Article  MathSciNet  Google Scholar 

  • Hubert, M., Van Driessen, K.: Fast and robust discriminant analysis. Comput. Stat. Data Anal. 45, 301–320 (2004)

    Article  MathSciNet  Google Scholar 

  • Hubert, M., Rousseeuw, P., Vanden Branden, K.: ROBPCA: a new approach to robust principal components analysis. Technometrics 47, 64–79 (2005)

    Article  MathSciNet  Google Scholar 

  • Hubert, M., Rousseeuw, P., Van Aelst, S.: High breakdown robust multivariate methods. Stat. Sci. 23, 92–119 (2008)

    Article  MathSciNet  Google Scholar 

  • Hubert, M., Rousseeuw, P., Verdonck, T.: A deterministic algorithm for robust location and scatter. J. Comput. Graph. Stat. 21(3), 618–637 (2012)

    Article  MathSciNet  Google Scholar 

  • Khan, J., Van Aelst, S., Zamar, R.H.: Robust linear model selection based on least angle regression. J. Am. Stat. Assoc. 102(480), 1289–1299 (2007)

    Article  MathSciNet  Google Scholar 

  • Ledoit, O., Wolf, M.: A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 88, 365–411 (2004)

    Article  MathSciNet  Google Scholar 

  • Lopuhaä, H., Rousseeuw, P.: Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Stat. 19, 229–248 (1991)

    Article  MathSciNet  Google Scholar 

  • Maronna, R., Zamar, R.H.: Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4), 307–317 (2002)

    Article  MathSciNet  Google Scholar 

  • Öllerer, V., Croux, C.: Robust high-dimensional precision matrix estimation. In: Modern Nonparametric, Robust and Multivariate Methods, pp. 325–350. Springer (2015)

  • Pison, G., Rousseeuw, P., Filzmoser, P., Croux, C.: Robust factor analysis. J. Multivar. Anal. 84, 145–172 (2003)

    Article  MathSciNet  Google Scholar 

  • Rousseeuw, P.: Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984)

    Article  MathSciNet  Google Scholar 

  • Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (eds.) Mathematical Statistics and Applications, vol. B, pp. 283–297. Reidel Publishing Company, Dordrecht (1985)

    Chapter  Google Scholar 

  • Rousseeuw, P., Croux, C.: Alternatives to the median absolute deviation. J. Am. Stat. Assoc. 88(424), 1273–1283 (1993)

    Article  MathSciNet  Google Scholar 

  • Rousseeuw, P., Van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223 (1999)

    Article  Google Scholar 

  • Rousseeuw, P., Van Zomeren, B.: Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 85(411), 633–639 (1990)

    Article  Google Scholar 

  • Rousseeuw, P., Van Aelst, S., Van Driessen, K., Agulló, J.: Robust multivariate regression. Technometrics 46, 293–305 (2004)

    Article  MathSciNet  Google Scholar 

  • Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Maechler, M.: Robustbase: Basic Robust Statistics. R package version 0.92-3 (2012)

  • SenGupta, A.: Tests for standardized generalized variances of multivariate normal populations of possibly different dimensions. J. Multivar. Anal. 23(2), 209–219 (1987)

    Article  MathSciNet  Google Scholar 

  • Sherman, J., Morrison, W.J.: Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann. Math. Stat. 21(1), 124–127 (1950)

    Article  MathSciNet  Google Scholar 

  • Todorov, V., Filzmoser, P.: An object-oriented framework for robust multivariate analysis. J. Stat. Softw. 32(3), 1–47 (2009)

    Article  Google Scholar 

  • Won, J.-H., Lim, J., Kim, S.-J., Rajaratnam, B.: Condition-number-regularized covariance estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 75(3), 427–450 (2013)

    Article  MathSciNet  Google Scholar 

  • Woodbury, M.A.: Inverting modified matrices. Memo. Rep. 42, 106 (1950)

    Google Scholar 

  • Zhao, T., Liu, H., Roeder, K., Lafferty, J., Wasserman, L.: The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13, 1059–1062 (2012)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kris Boudt.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research has benefited from the financial support of the Flemish Science Foundation (FWO) and project C16/15/068 of Internal Funds KU Leuven. We are grateful to Valentin Todorov for adding the MRCD functionality to the R package rrcov (Todorov and Filzmoser 2009), and to Yukai Yang for his initial assistance to this work. We also thank the Editor, two anonymous referees, Dries Cornilly, Christophe Croux, Gentiane Haesbrouck, Sebastiaan Höppner, Stefan Van Aelst and Marjan Wauters for their constructive comments.

Appendices

Appendix A: Proof of Theorem 1

Generate a p-variate sample \(\mathbf {Z}\) with \(p+1\) points for which \({\varvec{\Lambda }} = \frac{1}{p+1}\sum _{j=1}^{p+1} (\varvec{z}_i-\overline{z})(\varvec{z}_i-\overline{z})'\) is nonsingular and\(\overline{z}=\frac{1}{p+1}\sum _{j=1}^{p+1}\varvec{z}_i\). Then \(\tilde{\varvec{z}}_i={\varvec{\Lambda }}^{-1/2}(\varvec{z}_i-\overline{z})\) has mean zero and covariance matrix \(\mathbf {I}_p\). Now compute \(\varvec{y}_i=\mathbf {T}^{1/2}\tilde{\varvec{z}}_i\;\), hence \(\mathbf {Y}\) has mean zero and covariance matrix \(\mathbf {T}\).

Next, create the artificial dataset

$$\begin{aligned} \tilde{\mathbf {X}}^{1} = \left( w_1(\varvec{x}^{1}_{1}-\mathbf {m}_1),\ldots ,w_h(\varvec{x}^{1}_{h}-\mathbf {m}_1),w_{h+1}\varvec{y}_1,\ldots ,w_{k}\varvec{y}_{p+1}\right) \end{aligned}$$

with \(k=h+p+1\) points, where \(\varvec{x}^{1}_{1},\ldots ,\varvec{x}^{1}_{h}\) are the members of \(H_1\). The factors \(w_i\) are given by

$$\begin{aligned} w_i = \left\{ \begin{array}{cl} \sqrt{k(1-\rho )/h} &{} \;\;\; \text{ for } \ i=1,\ldots ,h \\ \sqrt{k\rho /(p+1)} &{} \;\;\; \text{ for } \ i=h+1,\ldots ,k. \end{array}\right. \end{aligned}$$

The mean and covariance matrix of \( {\tilde{\mathbf{X}}}^1\) are then

$$\begin{aligned} \frac{1}{k} \sum _{i=1}^k \varvec{\tilde{x}}^1_i&= \sqrt{\frac{1-\rho }{kh}} \sum _{i=1}^h ( \varvec{x}^1_i - \mathbf {m}_1) +\sqrt{\frac{\rho }{k(p+1)}} \sum _{j=1}^{p+1} \varvec{y}_j = 0 \end{aligned}$$

and

$$\begin{aligned} \frac{1}{k} \sum _{i=1}^k \varvec{\tilde{x}}^{1}_{i} (\varvec{\tilde{x}}^{1}_{i})'&= \frac{1-\rho }{h} \sum _{i=1}^h ( \varvec{x}^{1}_{i} - \mathbf {m}_1)( \varvec{x}^{1}_{i} - \mathbf {m}_1)'\nonumber \\&\quad +\frac{\rho }{p+1} \sum _{j=1}^{p+1} \varvec{y}_j \varvec{y}'_j \\&= (1-\rho ) \mathbf {S}_1 + \rho \mathbf {T} = \mathbf {K}_1. \end{aligned}$$

The regularized covariance matrix \(\mathbf {K}_1\) is thus the actual covariance matrix of the combined data set \({\tilde{\mathbf{X}}}^1\;\). Analogously we construct

$$\begin{aligned} \tilde{\mathbf {X}}^2 = \left( w_1(\varvec{x}^{2}_{1}-\mathbf {m}_2),\ldots ,w_h(\varvec{x}^{2}_{h}-\mathbf {m}_2),w_{h+1}\varvec{y}_1,\ldots ,w_{k}\varvec{y}_{p+1}\right) \end{aligned}$$

where \(\varvec{x}^{2}_{1},\ldots ,\varvec{x}^{2}_{h}\) are the members of \(H_2\;\). \(\tilde{\mathbf {X}}_2\) has zero mean and covariance matrix \(\mathbf {K}_2=(1-\rho ) \mathbf {S}_2 + \rho \mathbf {T}\;\).

Denote \(d_{\mathbf {K}_1}(\varvec{\tilde{x}}) = \varvec{\tilde{x}'}( \mathbf {K}_1)^{-1}\varvec{\tilde{x}}\). We can then prove that:

$$\begin{aligned} \frac{1}{k}\sum _{i=1}^h d_{\mathbf {K}_1}(\varvec{\tilde{x}}^2_{i} )&=\frac{1-\rho }{h}\sum _{i=1}^h d_{\mathbf {K}_1}(\varvec{x}^2_{i}-\mathbf {m}_2) \end{aligned}$$
(21)
$$\begin{aligned}&\le \frac{1-\rho }{h}\sum _{i=1}^h d_{\mathbf {K}_1}(\varvec{x}^2_{i}-\mathbf {m}_1) \end{aligned}$$
(22)
$$\begin{aligned}&\le \frac{1-\rho }{h}\sum _{i=1}^h d_{\mathbf {K}_1}(\varvec{x}^1_{i}-\mathbf {m}_1) \end{aligned}$$
(23)
$$\begin{aligned}&= \frac{1}{k}\sum _{i=1}^h d_{\mathbf {K}_1}({\tilde{\mathbf{x}}}^1_{i}) \end{aligned}$$
(24)

in which the second inequality (23) is the condition (18).

The first inequality (22) can be shown as follows. Put \(\varvec{z}_i = (\mathbf {K}_1)^{-1/2}\varvec{x}^2_{i}\) and \(\tilde{\varvec{z}} = (\mathbf {K}_1)^{-1/2}\mathbf {m}_1\) and note that \(\overline{\varvec{z}}=(\mathbf {K}_1)^{-1/2}\mathbf {m}_2\) is the average of the \(\varvec{z}_i\). Then (22) becomes

$$\begin{aligned} \sum _{i=1}^h \Vert \varvec{z}_i -\overline{\varvec{z}} \Vert ^2 \le \sum _{i=1}^h \Vert \varvec{z}_i - \tilde{\varvec{z}} \Vert ^2, \end{aligned}$$

which follows from the fact that \(\tilde{\varvec{z}} \) is the unique minimizer of the least squares objective \(\sum _{i=1}^k \Vert \varvec{z}_i - c \Vert ^2\), so (22) becomes an equality if and only if \(\tilde{\varvec{z}} =\overline{\varvec{z}} \) which is equivalent to \(\mathbf {m}_2=\mathbf {m}_1\).

It follows that

$$\begin{aligned} \sum _{i=1}^k d_{\mathbf {K}_1}(\varvec{\tilde{x}}^2_{i} )&=\sum _{i=1}^h d_{\mathbf {K}_1}(\varvec{\tilde{x}}^2_{i} ) + \frac{\rho }{p+1}\sum _{j=1}^{p+1} d_{\mathbf {K}_1}(\varvec{y}_{j} ) \\&\le \sum _{i=1}^h d_{\mathbf {K}_1}(\varvec{\tilde{x}}^1_{i} ) + \frac{\rho }{p+1}\sum _{j=1}^{p+1} d_{\mathbf {K}_1}(\varvec{y}_{j} ) \\&= \sum _{i=1}^k d_{\mathbf {K}_1}(\varvec{\tilde{x}}^1_{i}). \end{aligned}$$

Now put

$$\begin{aligned} b = \frac{ \sum _{i=1}^k d_{\mathbf {K}_1}(\varvec{\tilde{x}}^2_{i} ) }{ \sum _{i=1}^k d_{\mathbf {K}_1}(\varvec{\tilde{x}}^1_{i} )} \le 1. \end{aligned}$$

If we now compute distances relative to \(b \mathbf {K}_1\;\), we find

$$\begin{aligned} \frac{1}{k} \sum _{i=1}^k d_{b\mathbf {K}_1}(\varvec{\tilde{x}}^2_{i} )&=\frac{1}{b} \frac{1}{k} \sum _{i=1}^k d_{\mathbf {K}_1}(\varvec{\tilde{x}}^2_{i} ) = \frac{1}{k} \sum _{i=1}^k d_{\mathbf {K}_1}(\varvec{\tilde{x}}^1_{i} ) \\&= \frac{1}{k} \sum _{i=1}^k (\varvec{\tilde{x}}^{1}_{i} )'(\mathbf {K}_1)^{-1} \varvec{\tilde{x}}^{1}_{i}\\&=\frac{1}{k} \sum _{i=1}^k (\mathbf {K}_1^{-1/2}\varvec{\tilde{x}}^{1}_{i} )'(\mathbf {K}_1^{-1/2}\varvec{\tilde{x}}^{1}_{i})\\&= \text{ Trace }\left( \frac{1}{k} \sum _{i=1}^k (\mathbf {K}_1^{-1/2}\varvec{\tilde{x}}^{1}_{i} )'(\mathbf {K}_1^{-1/2}\varvec{\tilde{x}}^{1}_{i}) \right) \\&= \text{ Trace }\left( (\mathbf {K}_1)^{-1/2} \left( \frac{1}{k} \sum _{i=1}^k (\varvec{\tilde{x}}^{1}_{i} )(\varvec{\tilde{x}}^{1}_{i})' \right) (\mathbf {K}_1)^{-1/2} \right) \\&= \text{ Trace }(\mathbf {I}_p) = p . \end{aligned}$$

From the theorem in Grübel (1988), it follows that \(\mathbf {K}_2\) is the unique minimizer of \(\text{ det }(\mathbf {S})\) among all \(\mathbf {S}\) for which \(\frac{1}{k} \sum _{i=1}^k d_{\mathbf {S}}(\varvec{\tilde{x}}^{2}_{i} )=p\) (note that the mean of \(\varvec{\tilde{x}}^{2}_{i}\) is zero). Therefore

$$\begin{aligned} \det (\mathbf {K}_2) \le \det (b\mathbf {K}_1) \le \det (\mathbf {K}_1). \end{aligned}$$

We can only have \(\text{ det }(\mathbf {K}_2) = \det (\mathbf {K}_1)\) if both of these inequalities are equalities. For the first, by uniqueness we can only have equality if \(\mathbf {K}_2=b\mathbf {K}_1\). For the second inequality, equality holds if and only if \(b=1\). Combining both yields \(\mathbf {K}_2=\mathbf {K}_1\). Moreover, \(b=1\) implies that (22) becomes an equality, hence \(\mathbf {m}_2=\mathbf {m}_1\). This concludes the proof of Theorem 1.

Appendix B: The OGK estimator

Maronna and Zamar (2002) presented a general method to obtain positive definite and approximately affine equivariant robust scatter matrices starting from a robust bivariate scatter measure. This method was applied to the bivariate covariance estimate of Gnanadesikan and Kettenring (1972). The resulting multivariate location and scatter estimates are called orthogonalized Gnanadesikan-Kettenring (OGK) estimates and are calculated as follows:

  1. 1.

    Let m(.) and s(.) be robust univariate estimators of location and scale.

  2. 2.

    Construct \(\varvec{y}_i=\varvec{D}^{-1}\varvec{x}_i\) for \(i=1,\ldots ,n\) with \(\varvec{D}=\text {diag}(s(X_1),\ldots ,s(X_p))\;\).

  3. 3.

    Compute the ‘pairwise correlation matrix’ \(\varvec{U}\) of the variables of \(\varvec{Y}=(Y_1,\ldots ,Y_p)\;\), given by \(u_{jk} = 1/4 (s(Y_j+Y_k)^2-s(Y_j-Y_k)^2)\;\). This \(\varvec{U}\) is symmetric but not necessarily positive definite.

  4. 4.

    Compute the matrix \(\varvec{E}\) of eigenvectors of \(\varvec{U}\) and

    1. (a)

      project the data on these eigenvectors, i.e. \(\varvec{V}=\varvec{Y}\varvec{E}\;\);

    2. (b)

      compute ‘robust variances’ of \(\varvec{V}=(V_1,\ldots ,V_p)\;\), i.e. \(\varvec{\Lambda } = \text {diag}(s^2(V_1),\ldots ,s^2(V_p))\;\);

    3. (c)

      set the \(p \times 1\) vector \(\hat{\varvec{\mu }}(\varvec{Y}) = \varvec{E}\varvec{m}\) where \(\varvec{m}=(m(V_1),\ldots ,m(V_p))^T\;\), and compute the positive definite matrix \(\hat{\varvec{\Sigma }}(\varvec{Y}) = \varvec{E}\varvec{\Lambda } \varvec{E}^T\;\).

  5. 5.

    Transform back to \(\varvec{X}\), i.e. \(\hat{\varvec{\mu }}_\text {OGK}= \varvec{D}\hat{\varvec{\mu }}(\varvec{Y})\) and \(\hat{\varvec{\Sigma }}_\text {OGK}= \varvec{D}\hat{\varvec{\Sigma }}(\varvec{Y}) \varvec{D}^T\;\).

Step 2 makes the estimate location invariant and scale equivariant, whereas the next steps replace the eigenvalues of \(\varvec{U}\) (some of which may be negative) by positive numbers. In the simulation study and empirical analysis, we set m(.) to the median and s(.) to either the median absolute deviation or the Qn scale estimator. We use the implementation in the R package rrcov of Todorov and Filzmoser (2009).

Appendix C: The RMCD estimator

The RMCD as initially proposed by Croux et al. (2012) uses random subsets. Below we give its adaptation using deterministic subsets. We thank Christophe Croux and Gentiane Haesbrouck for their helpful guidelines in specifying the proposed detRMCD algorithm in which we follow closely the MRCD algorithm presented in Sect. 3. It uses the GLASSO algorithm of Friedman et al. (2008), as implemented in the package huge of Zhao et al. (2012).

figure b

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boudt, K., Rousseeuw, P.J., Vanduffel, S. et al. The minimum regularized covariance determinant estimator. Stat Comput 30, 113–128 (2020). https://doi.org/10.1007/s11222-019-09869-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-019-09869-x

Keywords

Navigation