Agostinelli, Leung, Yohai, and Zamar (Agostinelli et al. in the remainder) consider the difficult problem of robust estimation based on high-dimensional data. If outlying values can appear independently in the variables, then it can easily occur that the majority of the observations in high-dimensional data are contaminated, as pointed out in Alqallaf et al. (2009). Consequently, standard robust methods fail in this case, and new methods need to be developed that can handle this type of contamination. Moreover, next to independent contamination also casewise or structural outliers can still appear in the data. This situation was formalized as the partially spoiled independent contamination model in Alqallaf et al. (2009).

In their paper, Agostinelli et al. are the first to introduce a consistent estimator of multivariate location and scatter that is highly robust against both cellwise and casewise outliers. The 2SGS is a strongly consistent estimator of the location and shape of general elliptical distributions. Similarly to other proposals, the estimator proceeds in two steps. In the first step, an outlier detection rule is used to identify potential cellwise outliers. A first improvement is the use of a data adaptive cutoff instead of a fixed cutoff value when filtering cellwise outliers. The second novelty is to replace flagged outliers by missing values as first proposed in Danilov (2010) and Farcomeni (2014), while earlier proposals tried to reduce their effect through some form of Winsorization, see e.g., Alqallaf et al. (2002), Van Aelst et al. (2011, 2012), (Van Aelst 2014). In the second step, the location and scatter are estimated based on the dataset with missing values using the GSE estimator of Danilov et al. (2012). GSE is a very effective estimator, but is also computationally very demanding. This limits its use for really high-dimensional datasets, e.g., \(p\ge 100\).

The replacement of cellwise outliers by missing values opens the door to apply missing data methods to the incomplete data. For example, instead of directly estimating the parameters from the incomplete data by a complex estimation procedure, an initial imputation step can be applied. If the filtering and imputation are successful, then the imputed data will only contain casewise outliers. Thus, any standard robust estimation method can be used to estimate the parameters from the imputed data. Hence, computationally efficient procedures such as the fast MCD (Rousseeuw and Van Driessen 1999) or the fast S/MM (Salibian-Barrera and Yohai 2006; Salibian-Barrera et al. 2006) can be used. For very large high-dimensional data, the recently developed deterministic MCD (Hubert et al. 2012) or S/MM (Hubert et al. 2015) can be used.

When imputing the data, we need to take into account that the inserted missing values are not missing completely at random. However, the missingness is non-informative in the sense that the recorded value was an outlying value which did not provide any useful information. Moreover, if we make the common assumption that the cellwise contamination indicator \(\mathbf {B}_{\epsilon }\) in ICM (see expression (2) of Agostinelli et al.) is independent of both \(\mathbf {X}_0\) and \(\tilde{\mathbf {X}}\), then we can use the data distribution to impute the missing values. An overview of such imputation methods can be found in e.g., Cevallos Valdiviezo and Van Aelst (2015). However, since the incomplete data still may contain structural outliers, a robust imputation strategy should be used (see e.g., Vanden Branden and Verboven 2009).

To illustrate these ideas, I consider the following procedure.

  1. Step I

    Eliminate cellwise outliers using the Gervini–Yohai filter and replace them by NA’s.

  2. Step II

    Impute the missing values To impute the missing values, I use the following simple procedure. For each empty cell, determine the most correlated variable with a non-empty cell for that same case. The correlation is measured robustly using the Gnanadesikan and Kettenring (1972) procedure based on the efficient and robust \(Q_n\) estimator of scale (Rousseeuw and Croux 1993). Take this variable as the regressor in a robust simple regression that uses all the complete cases for the two variables. I used the MM-estimator of Yohai (1987) for this purpose. Assuming that the errors are normal, determine the predictive distribution for the empty cell and impute the cell by making a random draw from this distribution.

  3. Step III

    Robustly estimate the location and scatter from the imputed data. To see whether this procedure gives an improvement over the Huberized Stahel–Donoho (HSD) estimator which is considered in Agostinelli et al., I used the Stahel–Donoho (SD) estimator to estimate the parameters.

The imputation technique in Step II is a simple attempt to approximate the conditional distribution of the variable with missing value based on the available data. Of course, more complex methods can be developed to better approximate this conditional distribution using all the variables with an observed value for the case whose cell needs to be imputed. To keep the computation time low, only one imputation is drawn from the predictive distribution. However, it is straightforward to generate multiple imputed datasets from which the parameters can be estimated more precisely in Step III.

To examine the performance of this imputation approach, Figure 1 shows the results of a simulation with the same design as in Agostinelli et al. Results are shown for data with \(10\,\%\) of contamination. The plots on the left in Fig. 1 show the average LRT distances in function of k for ICM outliers and can be compared to Fig. 1 in Agostinelli et al. The plots on the right in Fig. 1 show the average LRT distances in function of k for THCM outliers and can be compared to Fig. 2 in Agostinelli et al. Next to the imputed data SD estimator (ISD), the results for the standard SD estimator are shown as well.

Fig. 1
figure 1

Average LRT distances for \(10\,\%\) of contamination at different values of k. Left plots show results for ICM, right plots for THCM

As could be expected, the ISD estimator performs not as good as the SD estimator for THCM contamination. For this model ISD shows a pattern of behavior that is similar to HSD in Fig. 2 of Agostinelli et al., but with somewhat larger distances. However, in case of ICM contamination, ISD shows a behavior that is similar to the 2SGS estimator and thus behaves much better than HSD. In fact, it can be seen that for this type of data with a correlation matrix that has a high condition number, there is not much difference between the HSD and SD estimators. Hence, the Winsorization in HSD is not effective in this setting. Overall, these limited results suggest that ISD can handle both cellwise and casewise outliers. Three step estimators in which an initial filtering is followed by a suitable robust imputation procedure may be a viable alternative to robustly analyze high-dimensional data. Due to the flexibility to choose an appropriate robust estimation procedure in the third step, it may be easier to extend this approach to handle large datasets in really high dimensions.