Accounting for outliers in optimal subsampling methods

Nowadays, in many different fields, massive data are available and for several reasons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observations. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influence). To overcome this problem, firstly, we propose a non-informative “exchange” procedure that enables us to select a “nearly” D-optimal subset of observations without high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.


Introduction
Recently, the theory of optimal design has been exploited to draw a subsample from huge datasets, containing the most information for the inferential goal; see Drovandi et al. (2017), Wang et al. (2019), Deldossi and Tommasi (2022) among others. Unfortunately, Big Data sets usually are the result of passive observations, so some high leverage values in the covariates and/or outliers in the response variable (denoted by Y) may be present. In this study, we assume that a small percentage of the data are outliers and the goal is to provide a precise estimate of the model parameters or an accurate prediction for the model that generates the majority of the data.
The most commonly applied criterion is the D-optimality. It is well known that Doptimal designs tend to lie on the boundary of the design region thus, in the presence of high leverage values, all of them would be selected. Since this circumstance could have a severe influence on the estimated model (leverage points with high influence), we propose an "exchange" procedure to select a "nearly" D-optimal subset which does not include high leverage values. Avoiding high leverage points, however, does not guard from all the outliers in Y. Therefore, we also modify the previous method to exploit the information about the responses and avoid the selection of the abnormal Y-values. The first proposal is a non-informative procedure, as it is not based on the response observations, while the latter is an informative exchange method.
Finally, both these exchange algorithms are adapted to the I-criterion, which aims at providing accurate predictions in a set of covariate-values (called prediction set).
Notation and motivation of the work are introduced in Sect. 2. Section 3 describes the novel modified exchange algorithm to obtain both non-informative and informative D-optimal subsamples without outliers. In Sect. 4 we adapt our proposal to I-optimality, to select a subsample with the goal of obtaining accurate predictions. In Sect. 5 we develop some simulations and a real data example, to assess the performance of the proposed subsampling methods. Finally, in Appendix we suggest a procedure for the initialization of these algorithms.

Notation and motivation of the work
Assume that N independent responses have been generated by a super-population model where denotes transposition, β = (β 0 , β 1 , . . . , β k ) is a vector of unknown coefficients, x i = (1,x i ) wherex i = (x i1 , . . . , x ik ) , for i = 1, . . . , N , are N iid repetitions of a k-variate explanatory variable, and ε i are iid random errors with zero mean and equal variance σ 2 . D = {(x 1 , Y 1 ), . . . , (x N , Y N )} indicates the available dataset, which is assumed to be a tall dataset, i.e. with k << N .
The population under study is denoted by U = {1, . . . , N } and s n ⊆ U denotes a sample without replications of size n from U (i.e. a collection of n different indices from U ).
Herein, we describe a new sampling method from a given dataset D, with the goal of selecting n observations (k < n << N ) to produce an efficient parameter estimate or an accurate prediction for the model generating the whole dataset apart from a few outliers, i.e. a small quantity of points that take "abnormal" values with respect to the rest of the data and that possibly have been generated by a different model.
Given a sample s n = {i 1 , . . . , i n }, let X be the n × (k + 1) matrix whose rows are x i , for i ∈ s n , and let Y = (Y i 1 , . . . , Y i n ) be the n × 1 vector of the sampled responses. We consider the OLS estimator of the coefficients of the linear model based on the sample s n :β denotes the sample inclusion indicator.
To improve the precision ofβ, we suggest to select the sample s n according to D-optimality. We denote the D-optimum sample as When D contains outliers, s * n may include them, because D-optimal support points usually lie on the boundary of the experimental region. Example 1 illustrates this issue.
Example 1 An artificial dataset D with N = 10000 observations has been generated from a simple linear model, in the following way: The left-hand side of Fig. 1displays these last 10 observations which are isolated with respect to the majority of the data, generated from the first distribution. The right-hand side of Fig. 1 emphasises the D-optimal subsample of size n = 100, s * n . As expected, all the abnormal values in X are included in s * n because they maximize the determinant of the information matrix [s * n has been obtained by applying the function od_KL of the R package OptimalDesign (Harman and Filová, 2019)].
A similar behaviour would be displayed also by the I-optimal subsample, that should be applied to get accurate predictions (see Sect. 4). To avoid the inclusion of outliers when applying the D-or I-optimal subsampling, we propose a modification of the well known exchange algorithm.
Before describing our proposal, we recall that a tool to identify an outlier in the factor-space is the leverage score, h ii = x i (X X) −1 x i . Points which are isolated in the factor-space (i.e., far located from the main body of points) can be thought of as outliers and are characterized by high leverage values [see Chatterjee and Hadi (1986)]. Actually, an observation x i , with i = 1, · · · , n, such that where ν 1 is a tuning parameter usually set equal to 2 [see for instance Hoaglin and Welsch (1978)], that is called a high leverage point. In general, high leverage points allow to reduce the variance of the parameters' estimates and in the literature many leverage-based sampling procedure have been proposed [see, among others Ma et al. (2015)]. But consider that if these high leverage points are associated to outlying response values, their inclusion in the sample may lead to misleading inferential results. For this reason, our aim is to avoid these points.

Modified exchange algorithms
The common structure of the t-th iteration of an exchange algorithm consists in adding a unit, chosen from a list of candidate points C (t) , to the current sample s (t) n , and then deleting an observation from it. The choice of the augmented and deleted points is based on the achievement of some optimality criterion. For instance, for D-optimality, Algorithm 1 describes the classical exchange procedure [see Chapter 12 in Atkinson et al. (2007)].
Our main idea is to modify Algorithm 1 by not proposing for the exchange the high leverage points, thus avoiding the inclusion in the sample of high leverage scores with

Algorithm 1 Exchange Algorithm for D-optimality
Require: Design matrix X, sample size n, initial sample s (0) n , t max ,Ñ Ensure: D-optimal sample 1: Set t = 0 2: while t < t max do 3: Select randomlyÑ units from U − s (t) n to form the set of candidate points for the exchange, C (t) 4: Select from C (t) the observation j a = arg max n to form the augmented sample s If the information about the responses is not exploited in step b (to identify C (t) ), then the modified D-optimal sample is non-informative for the parameters of interest. The non-informative procedure is described in detail in Subsect. 3.1.
Preventing high leverage points, however, does not guard from all the outliers in Y : there may exist points that are in the core of the data with respect to the features, while being abnormal with respect to the response variable. In Subsect. 3.2 we propose another version of the algorithm, where (in step b) we employ the responses to remove the outliers in Y . Note that the obtained optimal subsample becomes informative because of the dependence on the Y values.

Non-informative D-optimal samples without high leverage points
Let s (t) n be the current sample of size n and s (0) n an initial sample which does not include high leverage points (see Algorithm 5 in Appendix for a detailed procedure to get a convenient initial sample).
To update s (t) n , firstly we remove from it the unit i m with the smallest leverage score, thus obtaining a reduced sample of size n − 1, where X t denotes the design matrix associated to s (t) n . Let X − t be the design matrix attained by leaving out the row x i m from X t . Subsequently, we add the unit j a ∈ C (t) with the largest leverage score where the set of candidate points for the exchange at the current iteration is [see Searle (1982) p. 153 to get (3)] and h i m i m (x j ) is the leverage score obtained by exchanging n }. The next theorem provides an analytical expression for h i m i m (x j ), which reduces the computational burden of the algorithm. with Proof Expression (5) can be obtained from Lemma 3.3.1 in Fedorov (1972) after some cumbersome algebra.
In force of the upper bound in (2), our proposal is to consider as candidates for the exchange only observations in {U − s (t) n } which are not high leverage points. In addition, to speed up the algorithm we reduce the number of exchanges by imposing the lower bound in (2). Without this lower bound, if h i m i m (x j ) ≤ h i m i m , the new observation j could be removed at the subsequent iteration.
Algorithm 2 outlines the steps to select a D-optimal subsample without high leverage points.

Informative D-optimal sample without outliers
Whenever the response values are available, this information should be exploited by the exchange algorithm, obtaining an informative D-optimal subsample.
According to Chatterjee and Hadi (1986) an influential data point in Y is an observation that strongly influences the fitted values. To identify these influential values, we adopt Cook's distance, but other measures can be similarly applied. Cook's distance Algorithm 2 Non-informative D-optimal sample without high leverage points Require: Design matrix X, sample size n, initial sample s (0) n , ν 1 , t max ,Ñ Ensure: D-optimal sample without high leverage points 1: Set t = 0 2: while t < t max do 3: Identify the unit i m = arg min , to identify the set of candidate points C (t) according to (2) 7: Select from C (t) the observation j a = arg max n by replacing unit i m with j a , to form s (t+1) n 9: Set t = t + 1 10: end while for the i-th observation, C i , quantifies how much all of the fitted values in the model change when the i-th data point is deleted: whereŶ = Xβ ,σ 2 is the residual mean square estimate of σ 2 andŶ (i) = Xβ (i) is the vector of predicted values when the i-th unit is removed from the data set D.
According to a general practical rule, any observation with a Cook's distance larger than 4/n may be considered as an influential point.
To get an informative D-optimal sample, Algorithm 2 is modified by including the additional steps illustrated in Algorithm 3.
Algorithm 3 Informative optimal subsample without outliers: additional steps to be included between 7 and 8 in Algorithm 2 (and Algorithm 4) Require: Dataset D, sample size n Ensure: Informative D-optimal sample without outliers 1: Compute Cook's distance for unit j a : 2: if C j a < 4/n then 3: go ahead to step 8 of Algorithm 2 (and Algorithm 4) 4: else 5: reject the exchange and go back to step 5 of Algorithm 2 (and Algorithm 4) 6: end if As expected, the Iboss algorithm provides a subset similar to the D-optimal sample (cfr. with Fig. 1) since it selects the points on the boundary of the design space, thus including most of the outliers. As a consequence, the true model and the fitted model are quite distinct. Neither the simple random sample produces a good filted model, as it includes an outlier. The non-informative selection procedure seems to improve the fit of the true model, even if the best performance is obtained using the informative selection approach, which doesn't include outliers.
Remark Let us note that an increase of t max andÑ would lead to an improvement of the D-optimal subsamples, because of a better chance of exchanging sample points. In particular, it is reasonable to considerÑ = N − n whenever N is not too large.

Optimal subsampling to get accurate predictions
If we are interested in obtaining accurate predictions on a set of values X 0 = {x 01 , . . . , x 0N 0 } instead of a precise parameter estimation, then we should select the observations minimizing the overall prediction variance. LetŶ 0i =β x 0i be the prediction of μ 0i = E(Y 0i |x 0i ) at x 0i , i = 1, . . . , N 0 . The prediction variance at x 0i , also known as "mean squared prediction error" is If X 0 is the N 0 × k matrix whose i-th row is x 0i , then a measure of the overall mean squared prediction error is the sum of the prediction variances in X 0 : The following sample minimizes the overall prediction variance (7) and is called I-optimal. It is well known that to produce accurate predictions it would be advisable to avoid outliers. An Ioptimal subsample without high leverage points can be obtained by modifying the deletion and augmentation steps of the exchange algorithm described in Sect. 3.1 accordingly to the I-criterion. The current sample s (t) n should be updated by removing the unit i m which minimises the increase in the overall mean squared prediction error. From the results given in Appendix A of Meyer and Nachtsheim (1995), the increment in the overall mean squared prediction error due to the omission of the unit i is given byh where X t is the n × k matrix whose rows are n . Subsequently, to obtain again a sample of size n, from a set C (t) of candidate points, we should add the unit j a which maximize the decrease in the overall mean squared prediction error: where X − t is the design matrix obtained by removing the row x i m from X t and (X − t X − t ) −1 can be computed from (3). The set of candidate points should be composed by units that are not at risk to be deleted at the next iteration and are not high leverage points: where h i m i m (x j ) is given in (4), j X t is the matrix obtained from X t by exchanging x i m with x j and ( j X t j X t ) −1 can be computed from Eq. (5). Algorithm 4 summarizes the steps to select a non-informative I-optimal sample, while to obtain its informative version, it is enough to incorporate the additional steps of Algorithm 3.

Algorithm 4 Non-informative I-optimal sample without high leverage points
Require: Design matrix X, sample size n, initial sample s (0) n , prediction-set X 0 = {x 01 , . . . , x 0N 0 }, ν 1 , t max ,Ñ Ensure: I-optimal sample without high leverage points 1: Set t = 0 2: while t < t max do 3: Identify the unit From (3), compute the inverse of the information matrix without i m : (X − t X − t ) −1 5: Select randomlyÑ units from U − s (t) n 6: From (4) and (9), compute h i m i m (x j ) andh i m i m (x j ) ( j = 1, . . . ,Ñ ), to identify the set of candidate points C (t) according to (8) 7: Select from C (t) the observation n by replacing unit i m with j a , to form s (t+1) n 9: Set t = t + 1 10: end while 5 Numerical studies

Simulation results
In this section, we evaluate the performance of our proposals through a simulation study. We generate H × S random datasets of size N = 10 6 , each one including N out = 500 high leverage points/outliers (with H = 30 and S = 50). The computation of some metrics will illustrate the validity of our procedures in selecting D-or I-optimal subsamples without outliers.
For each generated N × (k + 1) design matrix h X, whose i-th row is h = (1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1)  To assess these subsampling techniques, we have generated a test set of size N T = 500, without high leverage points and outliers (i.e. with N out = 0): Finally, to implement the I-optimality procedure, we have generated a prediction region X 0 without high leverage points. In addition, to compare the performance of the distinct subsamples in terms of prediction ability on X 0 , we have generated also the corresponding responses (without outliers). Let A subsample selected from the dataset h D s (generated at the (h, s)-th simulation step) is denoted by s (h,s) n , and is the corresponding sampling indicator variable, for h = 1, . . . , H and s = 1, . . . , S. At each simulation step (h, s): (a) To evaluate the performance of the subsampling techniques with respect to D-and I-optimality criteria, we have computed: -The average mean squared prediction error in X 0 [from (7)]: -The logarithm of the determinant of the information matrix: (b) To assess the predictive ability of the selection algorithms, we have considered: -The average squared prediction error in X 0 and in X T = {x T 1 , . . . , x T N T }: for the different sampling strategies: non-inf. I, non-inf. D, inf. I, inf. D and SRS, respectively. The results have been obtained having set n = 500,Ñ = 1000, t max = 500, ν 1 = 2 and ν 2 = 3.
From Table 1, the non-informative procedures seem to provide subsamples "nearly" D-and I-optimal that do not include high leverage points (they would be exactly Dand I-optimal if they allowed for these abnormal values). This result is consistent with the definitions of I-and D-optimality. Table 2 instead lists the following Monte Carlo averages: the different subsamples. These quantities enable to assess the predictive ability of the subsampling techniques. From Table 2, we can appreciate the prominent role of the informative procedures. In fact, when the database includes outliers in Y which are not associated with high leverage points (as in this simulation study), only the informative procedures are able to exclude them providing accurate predictions.
From the last row of Table 2, the SRS seems to behave quite well: it is fast, easy to be implemented and provides good predictions compared to the informative Ioptimal subsampling. However, such a nice performance is due to the low percentage of outliers present in the artificial datasets. Figure 3 displays the superiority of the informative procedures with respect to the passive learning selection (SRS), as the percentage of the outliers increases. Of course, we consider a short range for the percentage of outliers because outliers are (by definition) a few isolated data points.
Comparing the third and the fourth rows of Table 2, informative I-optimal subsamples seem outperform the D-optimal ones only slightly, despite I-optimality should reflect the goal of getting accurate predictions. This happens because the prediction set X 0 has a similar shape as the dataset. When X 0 defines a specific subset of covariatevalues, then the superiority of I-optimality emerges. See for instance the values of SPE X 0 and SPE X T in Table 3, where X 0 and X T involve only positive values of the features.   Remark Actually, to take into account the randomness of the SRS technique, we have drawn N S RS = 50 different independent SRSs from each dataset h D s , for h = 1, . . . , H and s = 1, . . . , S; the Monte Carlo averages for SRS are based also on these additional observations.

Real data example
In this section we apply our proposal to the diamonds data set in the ggplot2 package. This dataset contains the prices and the specifications for more than 50000 diamonds. More specifically, 7 features are included: -The carat x 1 , which is the weight of the diamond and ranges from 0.2 to 5.01; -The quality of the diamond cut x 2 , which is coded by one if the quality is better than "Very Good" and zero otherwise; -The level of diamond color x 3 , which is coded by one if the quality is better than "level F" and zero otherwise; -A measurement of the diamond clearness x 4 , which takes value one if the quality is better than "SI1" and zero otherwise; -The total depth percentage x 5 ; -The width at the widest point x 6 ; -The volume of the diamond x 7 .
To avoid a multicollinearity problem, x 1 has not been considered in the analysis, because it is highly correlated with x 7 (the volume). Furthermore, to obtain a better fit of the data, the quadratic effect of x 7 has been included in the model, where the response variable Y is the logarithm of the price (log 10 ). The dataset contains some outliers, such as observation NO.24068 which corresponds to a diamond with an unusually large width that makes the price too high.
Let us assume that the goal is the prediction of the price of the diamonds with a volume larger than 200 mm 3 . Therefore, to apply the I-optimality strategy, we have randomly selected a prediction set X 0 from all the diamonds with x 7 larger than 200 mm 3 . Then, the remaining dataset has been divided in fourfolds of the same size to compare the different subsampling techniques through a cross-validation approach. In rotation, one fold represents the test set, while the others form the training set, from which subsample of size n are selected according to the different algorithms. In each test set only diamonds with volume larger than 200 mm 3 are considered; in addition, the outliers (if present) are removed. In this example we have set n = 100,Ñ = 2000, t max = 2000.
The first two columns of Table 4 show that the minimum value of the MSPE X 0 is associated to the non-informative I-Algorithm, while the maximum value of Log(Det) corresponds to the non-informative D-optimal subsample. This result is consistent with the definitions of I-and D-optimality and with the results of the simulation study in Sect. 5. With regards to the predictive ability of the subsampling techniques, we can observe that the I-informative procedure leads to the minimum values of the Crossvalidation averages SE D 0 and SE D T (last two columns of Table 4). Differently from the simulation study, in this real data example, also the non-informative I-criterion seems to perform properly. This is due to the fact that in the diamonds dataset most of the outliers in Y are associated with high leverage points and thus also the non-informative procedure is able to exclude them providing accurate predictions.

Discussion
Recent advances in technology have brought the ability to collect, transfer and store large datasets. The availability of such a huge amount of data is a great challenge nowadays. However, very often Big Datasets contain noisy data because they are the result of a passive observation and not of a well planned survey. Moreover, huge datasets may not be queried for free; typically agencies that create and manage huge databases, enable to download data by paying a price per MB. Furthermore, there are circumstances where the value of the response variable may be obtained only for a restricted number of units.
For this reason, we suggest to consider only a subsample of the dataset excluding abnormal values, with the idea that a subset of a few relevant data may be more "informative" than a huge quantity of raw, redundant, and noisy observations. The theory of optimal design is a guide to draw a subsample containing the most informative observations, but optimal subsamples frequently lie on the boundary of the factordomain, including all the outliers. Two modifications of the well-known exchange algorithm are herein proposed to select "nearly" optimal subsamples without abnormal values: -A non-informative procedure, that avoids the inclusion of high leverage points, can be applied whenever information about the responses is not available or is too expensive to have it; -An informative procedure, that excludes outliers in the response besides high leverage points, can be used whenever the responses are available.
A simulation study confirms that D-optimal subsampling should be applied if the inferential goal is precise estimation of the parameters, while informative I-optimal algorithm should be applied to get accurate predictions on a specified prediction set.
A limitation of these methods is that they are model-based, while in the reallife problems the model is unknown. This relevant issue will be handled in a future research by adapting the algorithms to optimality criteria for model selection, possibly combined with D-and I-criteria.
Finally, another challenging future development could be the extension of the proposed algorithms to the generalised linear model, because the definition of outliers and high leverage points, in this context, is not straightforward.