Skip to main content
Log in

Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination

  • Invited Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Multivariate location and scatter matrix estimation is a cornerstone in multivariate data analysis. We consider this problem when the data may contain independent cellwise and casewise outliers. Flat data sets with a large number of variables and a relatively small number of cases are common place in modern statistical applications. In these cases, global down-weighting of an entire case, as performed by traditional robust procedures, may lead to poor results. We highlight the need for a new generation of robust estimators that can efficiently deal with cellwise outliers and at the same time show good performance under casewise outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Alqallaf F, Van Aelst S, Yohai VJ, Zamar RH (2009) Propagation of outliers in multivariate data. Ann Stat 37(1):311–331

    Article  MATH  Google Scholar 

  • Alqallaf FA, Konis KP, Martin RD, Zamar RH (2002) Scalable robust covariance and correlation estimates for data mining. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02, pp 14–23. doi:10.1145/775047.775050

  • Danilov M (2010) Robust estimation of multivariate scatter under non-affine equivarint scenarios. Dissertation, University of British Columbia

  • Danilov M, Yohai VJ, Zamar RH (2012) Robust estimation of multivariate location and scatter in the presence of missing data. J Am Stat Assoc 107:1178–1186

    Article  MathSciNet  MATH  Google Scholar 

  • Davies P (1987) Asymptotic behaviour of S-estimators of multivariate location parameters and dispersion matrices. Ann Stat 15:1269–1292

    Article  MATH  Google Scholar 

  • Donoho DL (1982) Breakdown properties of multivariate location estimators. Dissertation, Harvard University

  • Farcomeni A (2014) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111

    Article  MathSciNet  Google Scholar 

  • Gervini D, Yohai VJ (2002) A class of robust and fully efficient regression estimators. Ann Stat 30(2):583–616

    Article  MathSciNet  MATH  Google Scholar 

  • Huber PJ, Ronchetti EM (1981) Robust statistics, 2nd edn. Wiley, New Jersey

    Book  MATH  Google Scholar 

  • Hubert M, Rousseeuw PJ, Vakili K (2014) Shape bias of robust covariance estimators: an empirical study. Stat Pap 55:15–28

    Article  MathSciNet  MATH  Google Scholar 

  • Maronna RA, Martin RD, Yohai VJ (2006) Robust statistic: theory and methods. Wiley, Chichister

    Book  Google Scholar 

  • Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications, vol B. Reidel Publishing Company, Dordrecht, pp 256–272

  • Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41:212–223

    Article  Google Scholar 

  • Salibian-Barrera M, Yohai VJ (2006) A fast algorithm for S-regression estimates. J Comput Gr Stat 15(2):414–427

    Article  MathSciNet  Google Scholar 

  • Smith RE, Campbell NA, Licheld A (1984) Multivariate statistical techniques applied to pisolitic laterite geochemistry at Golden Grove, Western Australia. J Geochem Explor 22:193–216

    Article  Google Scholar 

  • Stahel WA (1981) Breakdown of covariance estimators. Tech. Rep. 31, Fachgruppe für Statistik, ETH Zürich, Switzerland

  • Stahel WA, Maechler M (2009) Comment on “invariant co-ordinate selection”. J R Stat Soc Ser B Stat Methodol 71:584–586

    Google Scholar 

  • Tatsuoka KS, Tyler DE (2000) On the uniqueness of S-functionals and M-functionals under nonelliptical distributions. Ann Stat 28:1219–1243

    Article  MathSciNet  MATH  Google Scholar 

  • Van Aelst S, Vandervieren E, Willems G (2012) A Stahel–Donoho estimator based on huberized outlyingness. Comput Stat Data Anal 56:531–542

    Article  Google Scholar 

  • Yohai VJ (1985) High breakdown point and high efficiency robust estimates for regression. Tech. Rep. 66, Department of Statistics, University of Washington. Available: http://www.stat.washington.edu/research/reports/1985/tr066.pdf

Download references

Acknowledgments

The authors thank the anonymous reviewers for their thoughtful comments and suggestions which led to an improved version of the paper. Victor Yohai research was partially supported by Grants W276 from Universidad of Buenos Aires, PIP 112-2008-01-00216 and 112-2011-01-00339 from CONICET and PICT2011-0397 from ANPCYT, Argentina. Ruben Zamar and Andy Leung research was partially funded by the Natural Science and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruben H. Zamar.

Additional information

This invited paper is discussed in comments available at: doi:10.1007/s11749-015-0451-5; doi:10.1007/s11749-015-0452-4; doi:10.1007/s11749-015-0453-3; doi:10.1007/s11749-015-0454-2; doi:10.1007/s11749-015-0455-1; doi:10.1007/s11749-015-0456-0.

Appendix: Proofs

Appendix: Proofs

1.1 Proof of Proposition 2.1

Let \(\widehat{F}_n^+\) be the empirical distribution |Z| and \(\widehat{Z}\) as defined by replacing \(\mu \) and \(\sigma \) with \(T_{0n}\) and \(S_{0n}\), respectively, in the definition of Z.

Note that

$$\begin{aligned} |Z-\widehat{Z}|&\le \left| \frac{X - \mu }{\sigma } - \frac{X - T_{0n}}{S_{0n}} \right| \\&\le \left| \frac{X - \mu }{\sigma } - \frac{X - \mu }{S_{0n}} \right| + \frac{|T_{0n} - \mu |}{S_{0n}} \\&\le \widehat{A}+\widehat{B} \end{aligned}$$

where \(\widehat{A} \rightarrow 0\) a.s and \(\widehat{B} \rightarrow 0\) a.s. By the uniform continuity of \(F^+\), given \(\varepsilon >0,\) there exists \(\delta > 0\) such that \(|F^+(z(1-\delta ) - \delta )-F^+(z)|\le \varepsilon /2\). With probability one there exists \(n_1\) such that \(n\ge n_1\) implies \(|\widehat{A}|\) \(<\delta \) and \(|\widehat{B}|<\delta \). By the Glivenko–Cantelli Theorem, with probability one there exists \(n_2\) such that \(n\ge n_2\) implies that \(\sup _{z}|\widehat{F}_n^+(z) -F^+(z)|\le \varepsilon /2\). Let \(n_3=\max (n_1,n_2)\), then \(n\ge n_3\) imply

$$\begin{aligned} \widehat{F}_{n}^+(z)&\ge \widehat{F}_{n}^+(z(1-\delta ) -\delta )\\&=\left( \widehat{F}_{n}^+(z(1-\delta ) -\delta )-F_0^+(z(1-\delta )-\delta )\right) \\&\qquad +(F_0^+(z(1-\delta ) -\delta )-F_0^{+}(z))+(F_0^{+}(z)- F^+(z))+F^{+}(z) \end{aligned}$$

and then

$$\begin{aligned} \sup _{z> \eta }( F^+(z)-\widehat{F}_{n}^+(z))&\le \sup _{z> \eta }\left| F_0^+(z(1-\delta ) -\delta )- \widehat{F}_{n}^{+}(z(1-\delta ) -\delta )\right| \\&\qquad +\sup _{z> \eta }\left| F_0^{+}(z(1-\delta )-\delta )\!-\!F_0^{+}(z)\right| +\sup _{z> \eta }(F^{+}(z)\!-\!F_0^{+}(z))\\&\le \varepsilon \end{aligned}$$

This implies that \(n_{0}/n\rightarrow 0\) a.s.

1.2 Proof of Theorem 3.1

We need the following Lemma proved in Yohai (1985).

Lemma 7.1

Let \(\{\mathbf {Z}_{i}\}\) be i.i.d. random vectors taking values in \(\mathbb {R}^{k}\), with common distribution Q. Let \(f:\mathbb {R}^{k}\times \mathbb {R}^{h}\rightarrow \mathbb {R}\) be a continuous function and assume that for some \(\delta >0\) we have that

$$\begin{aligned} E_Q\left[ \sup _{||\lambda -\lambda _{0}||\le \delta }|f(\mathbf {Z},\lambda )|\right] <\infty . \end{aligned}$$

Then, if \(\widehat{\lambda }_{n}\rightarrow \lambda _{0}\) a.s., we have

$$\begin{aligned} \frac{1}{n}\sum _{1=1}^{n}f(\mathbf {Z}_{i},\widehat{\lambda }_{n})\rightarrow E_Q\left[ f(\mathbf {Z},\lambda _{0})\right] \text { a.s.} \end{aligned}$$

Proof of Theorem 3.1,

Define

$$\begin{aligned} (\varvec{\widehat{\mu }}_{GS},\widetilde{\varvec{\Sigma }}_{GS})=\arg \min _{\varvec{\varvec{\mu }},|\varvec{\Sigma }|=1}s_{GS} (\varvec{\varvec{\mu }},\varvec{\Sigma },\varvec{\widehat{\Omega }}). \end{aligned}$$
(16)

We drop out \({\mathbb {X}}\) and \({\mathbb {U}}\) in the argument to simplify the notation. Since \(s_{GS}(\varvec{\mu },\lambda \varvec{\Sigma },\varvec{\widehat{\Omega }})=s_{GS}(\varvec{\mu },\varvec{\Sigma },\varvec{\widehat{\Omega }})\), to prove Theorem 3.1 it is enough to show

  1. (a)
    $$\begin{aligned} (\varvec{\widehat{\mu }}_{Gs},\ \widetilde{\varvec{\Sigma }}_{GS} )\rightarrow (\varvec{\mu }_{0},\varvec{\Sigma }_{00})\text { a.s.,} \quad \quad \text { and} \end{aligned}$$
    (17)
  2. (b)
    $$\begin{aligned} s_{GS}(\varvec{\widehat{\varvec{\mu }}}_{GS},\widetilde{\varvec{\Sigma } }_{GS},\widetilde{\varvec{\Sigma }}_{GS})\rightarrow \sigma _{0}\text { a.s.} \end{aligned}$$
    (18)

Note that since we have

$$\begin{aligned} E_{H_0}\left( \rho \left( \frac{d\left( \mathbf {X} ,\varvec{\varvec{\mu }}_{0},\varvec{\Sigma }_{0}\right) }{\sigma _{0} c_{p}\,}\right) \right) =b, \end{aligned}$$

then part (i) of Lemma 6 in the Supplemental Material of Danilov et al. (2012) implies that given \(\varepsilon >0,\) there exists \(\delta > 0\) such that

$$\begin{aligned} \underset{n\rightarrow \infty }{\underline{\lim }}\inf _{(\varvec{\mu },\varvec{\Sigma })\in C_{\varepsilon }^{C},|\varvec{\Sigma }|=1}\frac{1}{n}\sum _{i=1}^{n} c_{p}\rho \left( \frac{d\left( \mathbf {X}_{i},\varvec{\varvec{\mu } },\varvec{\Sigma }\right) }{\sigma _{0}c_{p}\,(1+\delta )}\right) >(b+\delta )c_{p}, \end{aligned}$$
(19)

where \(C_{\varepsilon }\) is a neighborhood of \((\varvec{\mu }_{0},\varvec{\Sigma }_{00})\) of radius \(\varepsilon \) and if A is a set, then \(A^{C}\) denotes its complement. In addition, by part (iii) of the same Lemma we have for any \(\delta > 0\),

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}c_{p}\rho \left( \frac{d\left( \mathbf {X}_{i},\varvec{\varvec{\mu }}_{0},\varvec{\Sigma }_{00}\right) }{\sigma _{0}c_{p}\,(1+\delta )}\right) <b\,c_{p}. \end{aligned}$$
(20)

Let

$$\begin{aligned} Q_i(\varvec{\mu }, \mathbf \Sigma ) = c_p \rho \left( \frac{d\left( \mathbf{X}_i,\varvec{\mu },\varvec{\Sigma }\right) }{\sigma _0 c_p (1 + \delta )}\right) \end{aligned}$$

and

$$\begin{aligned} Q_i^{(\mathbf{U})}(\varvec{\mu }, \mathbf \Sigma ) = c_{p(\mathbf{U}_i)} \rho \left( \frac{d^*\left( \mathbf{X}_i^{(\mathbf{U}_i)},\varvec{\mu }^{(\mathbf{U}_i)},\varvec{\Sigma }^{(\mathbf{U}_i)}\right) }{S \,c_{p(\mathbf{U}_i)}\, \left| \widehat{\varvec{\Omega }}^{(\mathbf{U}_i)}\right| ^{1/p(\mathbf{U}_i)}}\right) . \end{aligned}$$

Now, if \(|\varvec{\Sigma }|=1\) and \(S=\sigma _{0}(1+\delta )/|\varvec{\widehat{\Omega }}|^{1/p}\), we have

$$\begin{aligned} \begin{aligned} \frac{1}{n}\sum _{i=1}^{n} Q_i^{(\mathbf{U})}(\varvec{\mu }, \mathbf \Sigma )&=\frac{1}{n}\sum _{p_{i}=p} Q_i(\varvec{\mu }, \mathbf \Sigma ) +\frac{1}{n}\sum _{p_{i}\ne p} Q_i^{(\mathbf{U})}(\varvec{\mu }, \mathbf \Sigma ) . \end{aligned} \end{aligned}$$
(21)

We also have

$$\begin{aligned} \frac{1}{n}\sum _{p_{i}\ne p}Q_i^{(\mathbf{U})}(\varvec{\mu }, \mathbf \Sigma ) \le c_{p}(1-t_{n}) \end{aligned}$$
(22)

and, therefore, by Assumption 3.4 we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\varvec{\mu },|\varvec{\Sigma }|=1}\frac{1}{n}\sum _{p_{i}\ne p} Q_i^{(\mathbf{U})}(\varvec{\mu }, \mathbf \Sigma ) =0 \text { a.s.} \end{aligned}$$
(23)

Similarly, we can prove that

$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\varvec{\mu },|\varvec{\Sigma }|=1}\frac{1}{n}\sum _{p_{i}\ne p} Q_i(\varvec{\mu }, \mathbf \Sigma ) = 0\text { a.s.} \end{aligned}$$
(24)

and

$$\begin{aligned} c_{p}-\frac{1}{n}\sum _{i=1}^{n}c_{p(\mathbf {U}_{i})}\rightarrow 0,\text { a.s.} \end{aligned}$$
(25)

Then, from (19) and (21)–(25) we get

$$\begin{aligned} \underset{n\rightarrow \infty }{\underline{\lim }}\inf _{(\varvec{\mu },\varvec{\Sigma })\in C_{\varepsilon }^{C},|\varvec{\Sigma }|=1}\frac{1}{n}\sum _{i=1} ^{n} Q_i^{(\mathbf{U})}(\varvec{\mu }, \mathbf \Sigma ) > (b+\delta )\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}c_{p(\mathbf {U}_{i})} =(b+\delta )c_{p}\ \text {a.s.} \end{aligned}$$
(26)

Using similar arguments, from (20) we can prove

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n} Q_i^{(\mathbf{U})}(\varvec{\mu }_0, \mathbf \Sigma _{00}) < b\lim _{n\rightarrow \infty }\frac{1}{n} \sum _{i=1}^{n}c_{p(\mathbf {U}_{i})} =b \, c_{p}\text { a.s.} \end{aligned}$$
(27)

Equations (26)–(27) imply that

$$\begin{aligned} \underset{n\rightarrow \infty }{\underline{\lim }}\inf _{(\varvec{\mu },\varvec{\Sigma })\in C_{\varepsilon }^{C},|\varvec{\Sigma }|=1}s_{GS}(\varvec{\varvec{\mu } },\varvec{\Sigma },\varvec{\widehat{\Omega }})>S\text { a.s.} \end{aligned}$$

and

$$\begin{aligned} \lim _{n\rightarrow \infty }s_{GS}(\varvec{\varvec{\mu }}_{0},\varvec{\Sigma }_{00},\varvec{\widehat{\Omega }})<S\text { a.s.} \end{aligned}$$

Therefore, with probability one there exists \(n_{0}\) such that for \(n>n_{0}\) we have \((\varvec{\widehat{\mu }}_{GS},\widetilde{\varvec{\Sigma }} _{GS})\in \) \(C_{\varepsilon }^{C}\). Then, \((\varvec{\widehat{\mu }} _{GS},\widetilde{\varvec{\Sigma }}_{GS})\rightarrow (\varvec{\mu } _{0},\varvec{\Sigma }_{00})\) a.s. proving (a).

Let

$$\begin{aligned} P_i( \varvec{\mu }, \varvec{\Sigma }, s) = c_p \rho \left( \frac{d\left( \mathbf {X}_{i},\varvec{\mu },\varvec{\Sigma }\right) }{c_p \,\,s}\right) \end{aligned}$$

and

$$\begin{aligned} P_i^{(\mathbf{U})}( \varvec{\mu }, \varvec{\Sigma }, s) = c_{p(\mathbf {U}_{i})} \rho \left( \frac{d\left( \mathbf {X}_{i}^{(\mathbf {U}_{i})},\varvec{\mu } ^{(\mathbf {U}_{i})},\varvec{\Sigma }^{(\mathbf {U}_{i})}\right) }{c_{p(\mathbf {U}_{i})}\,\,s}\right) . \end{aligned}$$

Since \(|\widetilde{\varvec{\Sigma }}_{GS}|=1\), we have that \(s_{GS}(\varvec{\widehat{\mu }}_{GS},\widetilde{\varvec{\Sigma }} _{GS},\widetilde{\varvec{\Sigma }}_{GS})\) is the solution in s in the following equation

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} P_i^{(\mathbf{U})}( \widehat{\varvec{\mu }}_{GS}, \widetilde{\varvec{\Sigma }}_{GS}, s) =\frac{b}{n}\sum _{i=1} ^{n}c_{p(\mathbf {U}_{i})}. \end{aligned}$$
(28)

Then, to prove (18) it is enough to show that for all \(\varepsilon >0\)

$$\begin{aligned} \begin{aligned}&\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n} P_i^{(\mathbf{U})}( \widehat{\varvec{\mu }}_{GS}, \widetilde{\varvec{\Sigma }}_{GS}, \sigma _0 + \varepsilon ) <b \, c_{p}\text { a.s.} \quad \text {and} \\&\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n} P_i^{(\mathbf{U})}( \widehat{\varvec{\mu }}_{GS}, \widetilde{\varvec{\Sigma }}_{GS}, \sigma _0 - \varepsilon ) > b \, c_{p}\text { a.s.} \end{aligned} \end{aligned}$$
(29)

Using Assumption 3.4, to prove (29) it is enough to show

$$\begin{aligned} \begin{aligned}&\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n} P_i( \widehat{\varvec{\mu }}_{GS}, \widetilde{\varvec{\Sigma }}_{GS}, \sigma _0 + \varepsilon ) <b \, c_{p}\text { a.s.} \quad \text {and} \\&\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n} P_i( \widehat{\varvec{\mu }}_{GS}, \widetilde{\varvec{\Sigma }}_{GS}, \sigma _0 - \varepsilon ) > b \, c_{p}\text { a.s.} \end{aligned} \end{aligned}$$
(30)

It is immediate that

$$\begin{aligned} E\left( \rho \left( \frac{d\left( \mathbf {X},\varvec{\mu } _{0},\varvec{\Sigma }_{0}\right) }{c_{p}\,(\sigma _{0}+\varepsilon )}\right) \right) <E\left( \rho \left( \frac{d\left( \mathbf {X},\varvec{\mu }_{0},\varvec{\Sigma }_{0}\right) }{c_{p}\,\sigma _{0}}\right) \right) =b \end{aligned}$$

and

$$\begin{aligned} E\left( \rho \left( \frac{d\left( \mathbf {X},\varvec{\mu } _{0},\varvec{\Sigma }_{0}\right) }{c_{p}\,(\sigma _{0}-\varepsilon )}\right) \right) >E\left( \rho \left( \frac{d\left( \mathbf {X},\varvec{\mu }_{0},\varvec{\Sigma }_{0}\right) }{c_{p}\,\sigma _{0}}\right) \right) =b. \end{aligned}$$

Then, Eq. (30) follows from Lemma 7.1 and part (a). This proves (b).

Fig. 3
figure 3

LRT distances under barrow wheel contamination setting

1.3 Investigation on the performance on the barrow wheel outliers

An anonymous referee suggested considering the performance of 2SGS under the barrow wheel contamination setting (Stahel and Maechler 2009; Hubert et al. 2014). We conduct a Monte Carlo study to compare the performance of 2SGS with three second-generation estimators under 5 and 10 % of outliers from the barrow wheel distribution. The data are generated using the R package robustX with default parameters. The three second-generation estimators are: the fast minimum covariance determinant (MCD), the fast S-estimator (FS), and the S-estimator (S), described in Sect. 4. The sample size is \(n = 10\times p\), for \(p=10\) and 20. The results in terms of the LRT measure are graphically displayed in Fig. 3. The performance of 2SGS is comparable to that of the other estimators designed to deal with casewise outliers like the barrow wheel type.

Table 4 Average “CPU time”—in seconds of a 2.8 GHz Intel Xeon—evaluated using the R command, system.time

1.4 Timing experiment

Table 4 shows the mean time needed to compute 2SGS for data with cellwise or casewise outliers as described in Sect. 5. We also include the mean time needed to compute MVE-S for comparison. MVE-S is implemented in the R package rrcov, function CovSest, option method=“bisquare”. We consider 10 % contamination and several sample sizes and dimensions. We use the random correlation structures as described in Sect. 4. For each pair of dimension and sample size, we average the computing times over 250 replications for each of the following setups: (a) cellwise contamination with k generated from U(0, 6) and (b) casewise contamination with k generated from U(0, 20). Comparatively longer computing times for 2SGS arise for higher dimensions because GSE becomes more computationally intensive for higher dimensions and for higher fractions of cases affected by filtered cells.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agostinelli, C., Leung, A., Yohai, V.J. et al. Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST 24, 441–461 (2015). https://doi.org/10.1007/s11749-015-0450-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-015-0450-6

Keywords

Mathematics Subject Classification

Navigation