Sliced inverse median difference regression

In this paper we propose a sufficient dimension reduction algorithm based on the difference of inverse medians. The classic methodology based on inverse means in each slice was recently extended, by using inverse medians, to robustify existing methodology at the presence of outliers. Our effort is focused on using differences between inverse medians in pairs of slices. We demonstrate that our method outperforms existing methods at the presence of outliers. We also propose a second algorithm which is not affected by the ordering of slices when the response variable is categorical with no underlying ordering of its values.


Introduction
Sufficient Dimension Reduction (SDR) is a class of dimension reduction techniques used in regression to address the high dimensionality of a predictor vector X 2 R p when a response variable Y (assumed univariate without loss of generality) is regressed on X. In other words, in SDR we are trying to estimate a p Â d ðd\pÞ matrix b such that The space spanned by the columns of b is called a Dimension Reduction Subspace (DRS). There are many different b's that satisfy (1) and the main objective is to estimate the one with the minimum dimension d. The minimum DRS is known as the Central Dimension Reduction Space (CDRS) or simply the Central Space (CS) and is denoted with S YjX . There are some mild conditions of existence of the CS which we assume that they hold in this paper (see Yin et al. 2008). A number of methods have been proposed in the SDR literature. The most well known class of methods that has been developed and is being used most frequently is probably the class of methods based on inverse moments-see Li (1991), Cook and Weisberg (1991), Li and Wang (2007) among others. A comprehensive review of this methodology can be found in Li (2018).
There are two main drawbacks in the inverse-moment based methodology like Sliced Inverse Regression (SIR) which was introduced by Li (1991). First is the dependence on moments which suffer on the presence of outliers. To address this, Gather et al. (2001), Dong et al. (2015) and Christou (2018) suggested using inverse medians instead of means. Second is the dependence on the number of slices the range of Y is discretized into, especially in methods like Sliced Average Variance Estimation (SAVE-Cook and Weisberg 1991) where the second inverse moment is used as well. Zhu et al. (2010) suggested a cumulative slicing approach (Cumulative Mean Estimation-CUME) to avoid using the number of slices as a tuning parameters. Artemiou and Tian (2015) identified that the solution given by the cumulative approach will suffer in accuracy in cases where the response is categorical and in computational time in cases where we have massive datasets. They proposed Sliced Inverse Mean Difference (SIMD) approach which uses the difference between slice means and suggested two different algorithms for their method. One is the ''left vs right'' (LVR) which is more appropriate to be used when the response variable is continuous and the other is the ''one vs another'' (OVA) which is more appropriate when the response is categorical. The LVR approach was demonstrated to be theoretically equivalent to CUME, but computationally faster, and unlike CUME the OVA approach allowed SIMD to be used for categorical responses.
Although SIMD has certain advantages over similar methods, it suffers in accuracy at the presence of outliers. In this paper, we take a look in addressing this by proposing the use of Sliced Inverse Median Difference (SIMeD) which uses the difference between slice medians. As with SIMD, we propose the use of two different algorithms as they were proposed in Li et al. (2011). The ''left vs right'' (LVR) which gives computational speed and works better when the response is continuous and the ''one vs another'' (OVA) which gives better accuracy when the response is categorical. Although there are different definitions for the multivariate median, in this work we focus on the use of L1 median (spatial median) which was used in Dong et al. (2015). The main reason is that theoretical results require the uniqueness of the median and the L1 median is the only one which is shown to be unique for p [ 2 (Hettmansperger and McKean 1998).
The paper is structured as follows. We discuss previous methodology in Sect. 2 and then in Sect. 3 we discuss the newly proposed method (SIMeD). In Sect. 4 we demonstrate the performance with numerical studies and we close with a small discussion section.

Review of previous methodology
In this section we provide a review of the methodology previously proposed in the SDR framework which mostly relates to our proposal. We will discuss Sliced Inverse Regression (SIR), Cumulative Mean Estimation (CUME), Sliced Inverse Mean Difference Regression (SIMD) and Sliced Inverse Median (SIME). Let X be the p dimensional predictor vector, Y be the response variable and Z ¼ R À1=2 ðX À EðXÞÞ the standardized predictors, where R ¼ varðXÞ.

Sliced inverse regression (SIR)
SIR was introduced by Li (1991) and is considered the first method introduced in the SDR framework. As with all the methods we discuss in this section, SIR depends on the linearity assumption (or the linear conditional mean assumption) which is satisfied when the predictors have an elliptical distribution. The author proposed to standardize the predictors and then use the inverse mean EðZjYÞ. By performing an eigenvalue decomposition of the variance-covariance matrix varðEðZjYÞÞ one can find the directions which span the CS, S YjZ . One then can use the invariance principle (see Cook 1998) to find the direction that span S YjX .
One of the most important aspect of SIR is the use of inverse conditional means. The conditioning on Y is achieved by discretizing Y, that is, by slicing into a number of intervals (denoted with I 1 ; . . .; I H where H the number of intervals). Therefore, the inverse mean EðZjYÞ is in practice replaced with EðZjY 2 I i Þ for i ¼ 1; . . .; H and thus, it is the eigenvalue decomposition of varðEðZjY 2 I i ÞÞ which is used to find the directions which span the CS, S YjZ .

Cumulative mean estimation (CUME)
Although, it was shown that SIR is robust to the number of slices, Zhu et al. (2010) proposed the use of Cumulative Mean Estimation (CUME) which removes the need of tuning for the number of slices. In essence they proposed the use of the eigenvalue decomposition of the matrix varðEðZjY yÞÞ to find the directions which span the CS, S YjZ . In practice this means that one increases the value of y and recalculates the mean each time the value of y is large enough for a new observation to be included in the range fY yg.

Sliced inverse mean difference (SIMD)
Sliced Inverse Mean Difference (SIMD) was introduced by Artemiou and Tian (2015) to address issues of the aforementioned methodology. First, they identified that in case of massive datasets, the cumulative approach by CUME will be computationally very costly. Second, they identified that in the case of a response variable that has no underlying ordering, CUME wouldn't be able to work. Therefore, they proposed the use of the mean difference between subsets of the data (not necessarily slices) using two different algorithms which were influenced by the work of Li et al. (2011).
The first one, called ''left vs right'' (LVR) used the eigenvalue decomposition of varðEðZjY [ q i Þ À EðZjY q i ÞÞ where i ¼ 1; . . .H À 1 and q i is the cutoff point between slice i and slice i þ 1. This algorithm is shown to be theoretically equivalent to CUME but computationally more efficient due to the use of fewer comparisons. As with SIR it was shown to be robust to the number of slices. The second algorithm, called ''one vs another'' (OVA) uses the eigenvalue decomposition of varðEðZjY 2 I j Þ À EðZjY 2 I i ÞÞ where i ¼ 1; . . .H À 1 and j ¼ i; . . .; H (essentially one takes the difference between the means of two slices). This method was shown to be equivalent to SIR, and it was also more appropriate than CUME to be used in cases where Y was categorical. In cases where Y is categorical different orderings of the values of Y can lead to different results in CUME and the LVR approach and therefore comparing the difference between all possible pair of slices made more sense as the result in these case were not affected by the ordering of the values of Y.

Sliced inverse median (SIME)
Finally, the work by Dong et al. (2015) showed that to robustify SIR one can use the inverse L1 median instead of means. They preferred L1 median due to its uniqueness in cases where p ! 2. In their work they defined the inverse L1 median as: Definition 1 The inverse L1 median is given bym ¼ arg min l2R p EðkX À lkjYÞ, where k Á k is the Euclidean norm.
Similarly, we denote withm Z ¼ arg min l2R p EðkZ À lkjYÞ. Therefore they performed an eigenvalue decomposition of the matrix: varðm Z Þ to find the the vectors which span the CS S YjZ . The authors demonstrated the robustness in the presence of outliers of this algorithms as well as the good overall performance when there are no outliers. More recently, similar ideas were proposed by Christou (2018) but instead of using the L1 median, they suggest using the Oja and the Tukey median.

Sliced inverse median difference (SIMeD)
In this section the main idea of this paper is presented and an estimation algorithm and a method for determining the dimension of the CS is given.

Using inverse medians in sufficient dimension reduction
Here we present the foundations of our proposal. Let ðX; YÞ be a random pair where X 2 R p denotes the predictors and the response Y is assumed univariate without loss of generality. We also assume that the range of Y is discretized in H slices, which we denote with J 1 ; . . .; J H . Also we define the dividing points between two slices with q 1 ; . . .q HÀ1 , that is q k separates slice J k from slice J kþ1 . Finally, let Z ¼ R À1=2 ðX À lÞ the standardized predictors, where R ¼ varðXÞ and l ¼ EðXÞ.
Using the definition of the inverse L1 median one can define now the difference between medians. We use A 1 and A 2 to be two disjoint sets of X, the support of Y. Then we define,Ỹ to be the discretized version of Y, where IðÁÞ denotes the indicator function. Therefore, one can define the difference between two medians to be: Depending on the way one chooses to define the sets A 1 and A 2 , then the definition ofm d may take different forms. In this article, we will use two different definitions [as it was the case for the difference between means in the proposal of SIMD by Artemiou and Tian (2015)]. The first is called ''left vs right'' (LVR) and being more appropriate in cases where the response is continuous or at least a discrete variable with a logical numerical ordering of the values it can take. The second is called ''one vs another '' and is more appropriate when Y is categorical. With the LVR approach we select multiple cutoff points q and for each cutoff points we split the data to those above (A 1 ) and those below (A 2 ) the cutoff point. For the OVA approach we split the data points in a number of slices (like in previous slicing methods) and we select all possible pairs of slices, (one slice of the pair is A 1 and the other is A 2 ). The above two methods lead to the following definitions.
To simplify the notation we will avoid using the superscripts LVR and OVA from the mathematical notation of the median and the subscripts from the name in the rest of the paper in cases where it is clear which approach is being used.
From the definitions, one can see that each time the LVR algorithm is applied for each cutoff point one uses all the points, while each time the OVA algorithm is applied only the points on the two selected slices are used. The fact that the LVR algorithm uses all the points for different cutoff points, allows it to be more accurate than SIME which uses only the points within a slice. For example, if one compares SIME with H slices where the H À 1 cutoff points between the slices are used as the cutoff points of the LVR algorithm of SIMeD, then the SIMeD LVR algorithm performs more accurately and is more robust to the number of slices H. The LVR though will not perform in cases where the response variable is categorical, with no sense of ordering between the values, while SIME will still perform perfectly fine. The main problem with categorical responses in the LVR algorithm of SIMeD is the fact that it is constructed assuming the responses can be ordered and different orderings of the categories will give different answers. In SIME, as well as in earlier methods like SIR, this is not a problem since we use the points of each slide independently to find the inverse medians. Having this in mind, we propose the use of the OVA algorithm which finds the differences between the medians of all possible pairs of slices. Therefore in the case of a categorical response, the OVA algorithm will not be affected by different orderings of the categorical response. The following theorem gives the theoretical justification and we will demonstrate the advantages of each algorithm in the numerical experiments section later. Note that when using the medians we need the stronger assumption of ellipticity of the predictors instead of the more classic Linear Conditional Mean assumption in the SDR literature.
The proof of this theorem is pretty straight forward. Using the fact that Dong et al. (2015) showed that the L1 medianm 2 S YjX thenm LVR i andm OVA i;j are linear combinations of vectors in S YjX and therefore they belong in it.

Estimation algorithm
The estimation algorithm in the LVR case is as follows: 1. Standardize data to find Z ¼R À1=2 ðX ÀlÞ using robust estimators for location and scale. (we give details below) 2. Divide the range of Y into H slices using cutoff points q 1 ; . . .; q HÀ1 . 3. For each q i , i ¼ 1; . . .; H À 1 calculate the median difference using the sample version of formula (2), which we denote withm i . 4. Do an eigenvalue decomposition of the matrixV to find the largest d eigenvalues and the corresponding eigenvectors v 1 ; . .
Similarly, the estimation algorithm in the OVA case is as follows: 1. Standardize data to find Z ¼R À1=2 ðX ÀlÞ using robust estimators for location and scale. (we give details below) 2. Divide the range of Y into H slices and find all possible pairs, that is H 2 À Á pairs. 3. For each pair (i, j), i ¼ 1; . . .; H and j ¼ 1; . . .; i calculate the median difference using the sample version of formula (3), which we denote withm i;j .
4. Do an eigenvalue decomposition of the matrixV to find the largest d eigenvalues and the corresponding eigenvectors v 1 ; . . .; v d , wherê We suggest the use of a robust estimator from R ¼ varðXÞ for both algorithms to provide better results. Since we are using a robust estimator for the mean, it makes sense to use a robust estimator for R. If we don't, then outliers will have an effect on the estimator of R and therefore our results will be affected as well. Therefore, we use the minimum covariance determinant (MCD) estimator for R, which is defined as the covariance matrix which has the minimum determinant among all possible covariance matrices using a set of k points where n=2\k\n. It is implemented in function covMcd in package robustbase in R. (see Maechler et al. 2018)

Order determination tests
There are a number of ideas that were introduced in the literature for determining the dimension of the CS. Sequential tests were proposed for several methods in the literature [see for example Li (1991) for SIR, Shao et al. (2007) for SAVE, Artemiou and Tian (2015) for SIMD]. Bura and Yang (2011) proposed a unified approach. The second method was a BIC-type criterion which was proposed by Zhu et al. (2006) and also used for other methods like CUME in Zhu et al. (2010) and Li et al. (2011). Recently, a method called the ladle plot was proposed by Luo and Li (2016). In this paper, we use the BIC type criterion as the asymptotic distribution, which is needed for sequential tests, has very complicated variance structure when the estimators are based on medians. Also the ladle plot requires bootstrapping which is computationally more expensive than the other two approaches.
In the BIC-type criterion one tries to maximize: where k 2 f1; . . .; pg, k i the ith eigenvalue of candidate matrixV (in the estimation algorithm), k 1 ! Á Á Á ! k p , c 1 ðnÞ is a function of n which converge in probability to 0 as n ! 1 and c 2 ðkÞ is a nondecreasing sequence of numbers that depends on k.
The asymptotic properties and more specifically the asymptotic consistency of the method were studied in Zhu et al. (2006). Their result assumes that as long as we have a consistent estimator of the candidate matrix (matrixV in our estimation algorithm in Sect. 3.2) then the BIC criterion can estimate consistently the dimension d of the central subspace. Christou (2018) has shown the consistency of similar criterion when using the Tukey and Oja medians in each slice. We can use similar arguments in this case to prove the consistency of our criterion. The only step that will be different is the use of consistency of the estimator of the L1 median which was proved in Brown (1983).

Numerical experiments
In this section we discuss numerical experiments where we demonstrate the improvement in performance when Sliced Inverse Median Regression (SIMeD) is used compared to previous methodology. We start with simulated datasets and we then use 3 real datasets.

Simulated datasets
We run three different simulation experiments using the following models: where X ¼ ðX 1 ; . . .; X p Þ (p ¼ 10; 20 or 30) are either simulated form N p ð0; IÞ, or X $ Multivariate Cauchy, or even a mixture of two normals where a big proportion is drawn from N p ð0; IÞ and a small proportion is drawn from N p ð0; kIÞ where k 2 R. Furthermore, e $ Nð0; 1Þ and n ¼ 100, H ¼ 5; 10; 20. We run 100 simulations and we report the average trace correlation among all simulations. Trace correlation is calculated as follows: where P B is the projection matrix on the true subspace and PB the projection matrix on the estimated subspace. It takes values between 0 and 1 and the closer to 1 the better the estimation. We compare our newly proposed algorithm with SIR (Li 1991), SIME (Dong et al. 2015) and SIMD (Artemiou and Tian 2015). Here we have to be careful as SIR and SIME use individual slices and SIMD and SIMeD use comparison between slices. To make sure we have a fair setting for a comparison, the number of slices are chosen so that each slice contains approximately the same number of points. The cutoff points between the slices in SIR and SIME are used as the cutoff points of the LVR algorithm for SIMD and SIMeD, or in other words, those cutoff points are the ðkn=HÞ th percentiles, for k ¼ 1; . . .; H À 1. Also, note that, unless otherwise noted, for SIMD and SIMeD we run the LVR approach.
In the first experiment we compare the performance of the four algorithms on the five models with fixed number of slices H ¼ 10 and different predictor dimension p when the predictors are distributed from one of the three following scenarios: • a standard multivariate normal, • from a standard multivariate Cauchy where the components are pairwise independent (hence there are outliers) • from a mixture of multivariate normal distributions (where we can control the number of outliers).
As we can see in Table 1, our proposed algorithms, SIMeD, outperforms by a large margin SIMD in both scenarios which produce outliers while the two methods perform similarly when there are no outliers. Also, in all the settings SIMeD is more robust than SIME which is the algorithm that uses the median in each slice as was proposed in Dong et al. (2015). A similar advantage was demonstrated between SIMD and SIR in Artemiou and Tian (2015). Finally, Artemiou and Tian (2015) observed that the OVA approach of SIMD is equivalent to the SIR algorithm. Similarly here, we observe that the OVA approach of the SIMeD method is equivalent to SIME therefore we do not report results.
In the second experiment we compare the four methods on different number of slices in an effort to check the robustness of the algorithms against the number of slices. The results for the number of slices H ¼ 5; 10; 20 are presented in Table 2 where we can see that all algorithms are robust when there are no outliers. When we have outliers in the sample the algorithms using the inverse medians (SIME and SIMeD) are more robust to the number of slices. Especially our newly proposed algorithms for models I and II shows no real difference in performance for different number of slices whether there are outliers or not.
We further study the effect of the variance and the proportion of outliers in the case when we have mixture of multivariate normals and the results are shown in Tables 3 and 4 where we can see that our proposed methodology SIMeD is not affected at all by any of the two tuning parameters. SIME is as well not affected but SIMeD outperforms it on all settings we have explored.
Finally, we run the four experiments to verify the performance of our BIC criterion as was described in section 3.3. We set c 1 ðnÞ ¼ ð1=2Þn À3=5 and c 2 ðnÞ ¼ kðk þ 1Þ=2 and we run it for two different sample sizes n ¼ 200 and n ¼ 400. We choose to run the test on model I which has dimension d ¼ 1 and model II which  has dimension d ¼ 2. We can see from the results in Table 5 that our BIC criterion works well for SIMeD in all settings and it outperforms the same criterion for SIME especially when the predictors are drawn from a multivariate standard Cauchy distribution. The distribution column indicates whether X were drawn from multivariate standard Normal (a), from a multivariate standard Cauchy (b) or from a mixture of multivariate Normal distributions where 95% of the points are from standard normal and 5% from Nð0; 10IÞ (c). The dimension of the predictor p ¼ 10.
The last line gives information of the time in seconds needed to run all the simulations on the table for each method Table 3 Comparison of results for SIR, SIME, SIMD, SIMeD on the effect of the variance of the Normal distribution the outliers are drawn from for the 5 models when the data are drawn from a mixture of Normal distributions

Method
Model SV SIR SIME SIMD SIMeD The column ''sv'' indicates that 5% of the points are drawn from multivariate Normal with mean 0 and covariance matrix equal to ''sv'' times the identity matrix. All other points are drawn from a standard multivariate Normal distribution

Real datasets
We run experiments using three different real data sets. With the first two, we do demonstrate the effectiveness of the SIMeD method in the presence of outliers in a dataset with few predictors and in one with high number of predictors. With the last real data experiment we demonstrate the usefulness of the OVA algorithm when we have discrete responses with no logical ordering.

Concrete Strength dataset
We use the concrete strength dataset (see Yeh 1998) which has 8 predictors (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, Age) and 1 response variable (Concrete Strength). The dataset has 1030 observations and we use 10 slices. We know from the description of the dataset that age is very significant along with some ingredients with most important being the water and the superplasticizer. We use the 4 methods (SIR, SIMD, SIME, SIMeD) to get an initial estimate of the first direction. Table 6 shows the first direction and we can see that SIR and SIMD which are based on means rank water second and age third (with cement close fourth) while age is ranked first and second The column k gives the proportion of points drawn from multivariate Normal with mean 0 and covariance matrix equal to 10 times the identity matrix. All other points are drawn from a standard multivariate Normal distribution in SIME and SIMeD, respectively, and water is ranked third (cement is fourth but further away than the other 3).
To demonstrate the ability of SIMeD to behave well under outliers, we run 100 experiments where we randomly select 30 observations and we multiply them by 10. Then we calculate the distance of the original coefficient vector we have in Table 6 to the new vector of coefficients (calculated with outliers) for each one of the methods. The boxplot in Fig. 1 of the 100 trace correlation distances defined in (4) for each method, show that SIMeD clearly outperforms the methods based on means and it also slightly outperforms the SIME method. The distribution column indicates whether we draw our X's from a multivariate standard Normal (a), from a multivariate standard Cauchy (b) or from a mixture of multivariate Normal distributions where 95% of the points are from standard normal and 5% from Nð0; 10IÞ (c)

Superconductivity Dataset
The superconductivity data set (see Hamidieh 2018) was obtained from UC Irvine machine learning repository (see Dua and Karra-Taniskidou 2017) and we use it to demonstrate the effectiveness in the methodology in handling outliers in high dimensional cases. The dataset tries to predict the critical temperature of a superconductor using 81 selected features. There are 21263 observations. We calculate the original predictor vector. We randomly select 100 points and we multiply them by 10 to create outliers and we repeat the experiment 100 times. We calculate the distance of the original coefficient vector (using the true data) to the new coefficient vector (where 100 points are outliers). Table 7 gives the mean distances of those 100 experiments for all 4 methods, as well as the median value. As we can see the two methods that are based on differences between slices, i.e. SIMD and SIMeD, outperform the other two. SIMeD has better mean distance and  In parenthesis we have the standard deviation for the means and the median absolute deviation of the medians SIMD has better median distance. In Fig. 2 we can see the range of values for the 4 methods. We first note that SIR and SIME can be influenced a lot (distances closed to 0) or not at all (distances close to 0.9). At the same time SIMD has a range between (0.1 and 0.9) and SIMeD (0.25 to 1). As expected the range of distances for all methods is much bigger in this dataset, due to the very high-dimensional nature of the dataset but our proposed method (SIMeD) performs better than SIR and SIME and is competitive with SIMD.

Iris dataset
We run the the algorithms on the Iris data where we have four predictors sepal length, sepal width, petal length and petal width. There are 150 observations, 50 from each of three species (setosa, versicolor, virginica). Table 8 shows the first direction when dimension reduction methods are applied. One can see that both  For SIMD and SIMeD the LVR approach is used SIME and SIMeD give a larger coefficient to petal width which is well known to be the one mostly separating the three species.
To emphasize the usefulness of the OVA algorithm when the response is categorical we show the first direction when the three species are first coded as 1 for setosa, 2 for versicolor and 3 for virginica and then when we keep setosa as 1 but we code virginica as 2 and versicolor as 3. We see in Table 9 that the OVA approach does not change estimate while the LVR changes a lot and gives a complete different direction.

Discussion
In this work we propose a new algorithm for robust dimension reduction using inverse median differences between slices and which we call SIMeD. We demonstrate the advantages of this algorithm over using a similar algorithm based on inverse means (SIMD) which was proposed by Artemiou and Tian (2015) at the presence of outliers and we also demonstrate that in any case it performs better than just using sliced inverse median regression (SIME) proposed by Dong et al. (2015). Moreover, we propose the OVA algorithm of SIMeD and demonstrate its advantage in the case of categorical responses. Although in this paper we choose the L1 median due to its uniqueness for p ! 2, we expect similar results to hold when other multivariate medians are used, for example Tukey and Oja median which were discussed in the SDR framework by Christou (2018).
Acknowledgements The authors would like to thank two anonymous reviewers and the Editor who have helped us improve the presentation of the manuscript with their comments. The authors would like to help Cardiff University for partially funding this work through the CUROP program.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.