Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample”

We contribute to the discussion of an article where Andrea Cerioli, Marco Riani, Anthony Atkinson and Aldo Corbellini review the advantages of analyzing multivariate data by monitoring how the estimated model parameters change as the estimation parameters vary. The focus is on robust methods and their sensitivity to the nominal efficiency and breakdown point. In congratulating with the authors for the clear and stimulating exposition, we contribute to its discussion with an overview of what we experienced in applying the monitoring in our application domain.

applications of the monitoring approach to datasets relevant for international trade analysis and anti-fraud, which bring new statistical challenges not yet fully addressed.

Monitoring trade data
CRAC have introduced us to a particular monitoring instance, the Forward Search (FS, Atkinson and Riani 2000), more than ten years ago. We studied together the application of monitoring to other established robust regression estimators (Riani et al. 2014). Currently, we use different forms of monitoring in the routine analysis of large amounts of regression datasets relevant to European Union policies, such as international trade and anti-fraud. We have many more reasons for supporting enthusiastically the approach than drawbacks to signal.
We compute on a monthly basis robust estimates of "fair prices" for goods imported in the European Union from third countries. The estimates are used by customs and anti-fraud services to combat illegal practices. The financial impact for the budget of the EU is very big and the fair prices must be somehow "certified", in view of their use in Court cases. We are therefore studying appropriate statistics or indicators to summarize the sensitivity of the robust fair price estimate to the choice of the estimation method and the related parameters and tuning constants. To this end the monitoring is a precious instrument, although we are facing with two main disadvantages: one is the substantial computation time (which increases with the sample size and number of parameters monitored) and the other is the lack of clear instruments to summarize automatically in a unique statistic or indicator the rich collection of monitored results.
We illustrate the need of monitoring the stability of the fair price estimates, even for small datasets, in Fig. 1, which has to do with imports of "sports footwear with insoles of a length of less than 24 cm". We see the results obtained with the SAS PROC ROBUSTREG and the MATLAB FSDA Toolbox (Riani et al. 2012(Riani et al. , 2015 with four methods: FS, LTS, S and MM. We could verify that the reason for the discordance between the two sets of results originates from the different default parameter values adopted by the two environments. The figure also reports in the caption the FS estimate, which is in line with the results obtained by the other estimators with the FSDA defaults. The typical forward plot associated to the FS (in the top-right panel) shows a sharp decrease of the statistic monitored (the minimum deletion residual), which is sign of structure in the data that the other methods have not addressed properly because of the unlikely choice of the key parameters that determine their robustness and efficiency.

The effect of concentrated non-contaminated observations
We introduce in the discussion another complication that occurs rather often in trade data, consisting in large proportions of non-contaminated observations falling in a small data region. To our knowledge, this problem was addressed in robust statistics only recently, with Heikkonen et al. (2013) and Cerioli and Perrotta (2014) showing that the effect of a high-density region can be so strong to override the benefits of robust devices such as trimming methods for robust clustering. We show that the monitoring plots do not make exception and become completely uninformative in  presence of highly concentrated data. The proposal of Cerioli and Perrotta (2014) in these cases, is to sample a much smaller subset of observations which preserves the cluster structure and also retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. Consider for example the datasets of Fig. 2, which for the sake of clarity will be called respectively "Books" and "Jewellery" datasets. They are both characterized by a densely populated area in a "small trade" region of no practical interest in the anti-fraud context. In the case of the Books dataset, units are so concentrated that only 0.02% of the data is retained, while the general data pattern is preserved. Note that the initial size of these datasets can be so large to make analyses computationally very demanding (the application of the FS to the 33304 books import flows went out of memory after running several hours on a 2.1 GHz Xeon processor with 16 Gb of memory).
To illustrate the issues that we encounter in monitoring this type of datasets, we use first a trade-like dataset simulated for a separate assessment exercise, still in progress. It is represented in Fig. 3 before and after thinning (left and right panel rspectively). Along the lines of CRAC, we monitor the S and MM estimators fit on the original data. To be more general than in the case illustrated in Section 2, from now on we will use a model with intercept. The forward plot trajectories appear as uninformative flat lines, shown in Fig. 4 for the MM estimator.
After thinning, when the same monitoring is applied on the retained units, the forward plots of the S estimator (right panel of Fig. 5) show that, when the breakdown point is chosen between 0.5 and 0.45 the outliers are very well identifiable. On the  values are chosen, with residual trajectories which become closer and closer. After the thinning step, the monitoring on the retained data clearly shows the presence of structure in the data. In the right panel of Fig. 7 a drastic decrease of the residuals occurs below a certain breakdown value, around 10%, suggesting that masking is occurring when the outliers present, which indeed are roughly 10% (about 70 high price outliers among a total of 723 units), start distorting the estimates. In the right panel of Fig. 8 the sudden decrease corresponding to a breakdown point of 0.45 indicates the presence of two major groups in the data.
Note, in both figures, the different scales of the monitored residuals in the original and thinned datasets. To understand the nature of this effect we have monitored the intercept and slope values estimated in the two cases. Figure 9, which refers to the Books dataset, shows that the intercept is close to 0 if all data are fit, while with the retained units it is between 100 and 350, depending on the breakdown point, with Monitoring of the S-regression parameters fit on the Books data, before and after thinning (red and blue line respectively) obvious inflation effect on the residuals. The corresponding slopes for a standard 0.5 breakdown point are respectively around 3.5 and 2.7. We could verify that the most reasonable slope (obtained with a robust fit using a model without intercept, to estimate the import price of the books) is 2.8, which is very close to the S fit on the retained units. Finally note that also the monitoring of the estimated regression parameters shows that something is occurring for a breakdown point approximately equal to 0.1. The monitoring of the minimum deletion residuals with the FS provides similar information about the two trade datasets. The left panel of Fig. 10 clearly shows that the Books dataset is formed by one main population and a set of outliers. The right panel of the same figure, which report the same monitoring (from step 200) for many random starts on the Jewellery dataset, shows different sets of trajectories indicating the presence of multiple groups.

Closure
For CRAC, the monitoring is more than a particular way of dealing with data: they often like to state that it is a truly data analysis philosophy, which comes from the belief that data can be completely understood only by appraising the effect on a fitted model of each statistical unit, or sub-groups of units. In this discussion we have provided other evidence that the monitoring is, at least, a very powerful instrument to summarize lot of information in one single plot.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.