Slice Weighted Average Regression

It has previously been shown that ordinary least squares can be used to estimate the coefficients of the single-index model under only mild conditions. However, the estimator is non-robust leading to poor estimates for some models. In this paper we propose a new sliced least-squares estimator that utilizes ideas from Sliced Inverse Regression. Slices with problematic observations that contribute to high variability in the estimator can easily be down-weighted to robustify the procedure. The estimator is simple to implement and can result in vast improvements for some models when compared to the usual least-squares approach. While the estimator was initially conceived with the single-index model in mind, we also show that multiple directions can be obtained, therefore providing another notable advantage of using slicing with least squares. Several simulation studies and a real data example are included, as well as some comparisons with some other recent methods.


Introduction
Methodologies for dimension reduction have been immensely expanded in recent decades.It has become a prevalent topic due to the rapid advancements in computer technologies and the need for researchers to find complex structures in high-dimensional data sets.The 'curse of dimensionality' (e.g.Bellman, 1961) is commonly mentioned in dimension reduction settings since it describes the problem where as dimensionality gets higher, larger sample sizes are required to produce accurate estimates.Standard regression methods often assume simple regression models and can therefore fail at identifying more complicated relationships between a random univariate response variable, Y ∈ R, and a random high-dimensional predictor variable X.This is further complicated by our inability to sufficiently visualize high-dimensional data sets.The dimension reduction (DR) methods we consider here aim to reduce the dimensionality of the predictor vector by replacing it with one or more linear combinations of the predictor components.When little regression information is lost, this allows for the visualization of the data in a lower-dimensional framework.Li (1991) described such a scenario by the following model where for, X = [X 1 , . . ., X p ] ∈ R p , we have, where f is the unknown link function, β i 's (for i = 1, . . ., K) are the p-dimensional column vectors of coefficients and ε is the error term independent of X with E(ε) = 0. Dimension reduction is then achieved when we replace X with β 1 X, . . ., β K X where K < p.Then, a plot of Y versus the lower-dimensional projections, β 1 X, . . ., β K X, is referred to as a Sufficient Summary Plot (SSP, Cook, 1998) and reveals the structure of f .In this setting, the aim of DR methods are to find a basis for the set S B = span(β 1 , . . ., β K ), referred to as the effective dimension reduction (e.d.r ) space whose elements are referred to as e.d.r directions.Throughout we assume that S B denotes the Central Dimension Reduction Space (CDRS, Cook, 1998), defined as the intersection of all dimension reduction subspaces.
In recent decades there has been much interest in this form of dimension reduction.Ordinary Least Squares (OLS) in the case of the single-index model (i.e.K = 1) (Brillinger, 1977(Brillinger, , 1983;;Li and Duan, 1989) led to further advances in methods such as Sliced Inverse Regression (SIR, Li, 1991), Sliced Average Variance Estimation (SAVE, Cook and Weisberg, 1991) and Principal Hessian Directions (pHd, Li, 1992), just to name a few.Each method has its strengths and limitations and works well under specific conditions and assumptions and with particular model types.For this reason, the combination of dimension reduction methods and statistical tools has been used numerous times to improve the estimation of the CDRS, see for example, Xia et al. (2002), Ye and Weiss (2003), Yin and Cook (2004), Zhu et al. (2007), Cook (2007), Cook and Forzani (2008), Cook and Forzani (2009), Li et al. (2011), Soale and Dong (2022).
Based on the simplicity and good results that can be achieved by SIR by using slicing, our goal in this paper is to investigate the use of least squares within this slicing context.
We begin with a brief overview of OLS and SIR in Section 2. The theory and implementation of the new method is presented in Section 3. Influence functions are derived in Section 4 and used to understand the behaviour of the method for various contamination structures.Slice weights used to down-weight influential slices are formulated in Section 5.The effectiveness of the method is highlighted via simulations and a real-data example, in Section 6 and 7, respectively.We conclude with a discussion in Section 8. Proofs and other supporting material are given in the Appendix.

Dimension reduction methods
As our main motivating methods, in this section we briefly consider OLS and SIR for dimension reduction.Before we do, we provide the Linear Design Condition (LDC), defined in its general form for K ≥ 1 by Li (1991): Condition 1 (LDC).For any b ∈ R p , there are some scalar constants a 0 , a 1 , . . ., a K such that, Condition 1 is satisfied when X follows an elliptically symmetric distribution (Eaton, 1986).However, it has also been shown to often approximately hold when p is large (Hall and Li, 1993).

Ordinary Least Squares
Ordinary least squares (OLS) is commonly used for estimating the unknown parameters in the multiple linear regression (MLR).Using the notation above for the model in (1), the MLR model is Y = β 0 + β 1 X + ε, where β 0 and β 1 are the intercept and slope vector respectively.However, the capabilities of OLS expand further than just MLR.Under the single-index model K = 1, if X follows a multivariate normal distribution and an additive error is assumed, Brillinger (1977Brillinger ( , 1983) ) showed that the OLS slope, which we denote b ols , can be used to determine the direction of β 1 since b ols = cβ 1 for some c ∈ R, provided c = 0. Li and Duan (1989) extended the above for OLS to include milder distributional conditions for X (e.g.Condition 1 with K = 1) and a non-additive error for a single-index model of the form Y = f (β X, ε). (2) Hence, OLS can be used much more extensively than for just the MLR model.However, OLS can fail for some model types where the slope vector is equal to zero.This can occur when the model exhibits symmetric dependency (when the model is symmetric around the mean of β X), but also for less obvious scenarios, (e.g.Garnham and Prendergast, 2013, Example 2.1).

Sliced Inverse Regression
Li (1991) introduced SIR to estimate the inverse regression curve, E(X|Y ), which provides a notable advantage over the regression of Y |X when dealing with high-dimensional data.Let µ = E(X), Σ = Var(X) and the standardized regressor be defined as Z = Σ −1/2 (X − µ).Li (1991) showed that under Condition 1, E(Z|Y ) ∈ Σ 1/2 S B for the model in (1).As a consequence, the eigenvectors of Var E(Z|Y ) that correspond to the largest non-zero eigenvalues are elements of Σ 1/2 S B and a basis for S B can be found by the re-standardization of the eigenvectors with respect to In practice, the estimation of the standardized inverse regression curve is performed by partitioning the data into several slices, based on the value of Y , which produces an approximating step function for E(Z|Y where µ h = E(X|Y ∈ S h ) is the hth slice mean.Since µ is a linear combination of the slice means, the maximum rank of V is H − 1.Under Condition 1, eigenvectors of V corresponding to nonzero eigenvalues are elements of Σ 1/2 S B .Therefore, re-standardizing with respect to Σ −1/2 provides a basis for S B , or at least part of S B if the rank is less than K.
Similar to OLS, SIR can perform well for various model types, but it also fails when the link function is symmetric around the mean of β X.In this case transformations may correct this issue (Prendergast and Garnham, 2016) or other dimension reduction methods, such as SAVE (Cook and Weisberg, 1991) may be used.Based on slice variances, in addition to the LDC in Condition 1, the constant conditional variance condition required by SAVE is: Conditions 1 and 2 are satisfied when, for example, X is normally distributed.

Slice weighted average regression
The proposed method, which we call Slice Weighted Average Regression (SWAR), was conceptualized by noting the advantages of the slicing approach implemented by SIR and obtaining the slice slope vectors given by OLS.Since the population slope vector of OLS (i.e. the slope to be estimated) is given as b ols = [Var(X)] −1 Cov(X, Y ), then this means that SWAR is based on the slice covariances between Y and X, as opposed to the slice means of SIR and slice variances of SAVE.
The obtained slice slope vectors are then weighted and combined into a matrix whose eigenvectors corresponding to non-zero eigenvalues are elements of the CDRS.Within the hth slice, the slice slope vector is denoted by, As indicated in the following lemma, when both Condition 1 and 2 hold, this slice slope vector contains information regarding the CDRS.The proof can be found in Appendix A.
Lemma 1.If Conditions 1 and 2 hold, then under the model in (1), for any subrange S of Y .
The SWAR matrix is then given by, where w h , h = 1, . . ., H are weights for the slices.E.g, like for SIR we could use w h = p h = P (Y ∈ S h ), however we also consider different weighting choices later.
Theorem 1.If Conditions 1 and 2 hold, then under the model in (1), eigenvectors corresponding to non-zero eigenvalues of R are elements of S B .
Proof.This follows directly from Lemma 1 since each of the b h 's are elements of the CDRS.
From Theorem 1, SWAR can return an orthonormal basis for S B , since the returned eigenvectors do not require a re-standardization, as long as rank(R) = K.Like OLS and SIR, for some model types SWAR may only find a partial basis, and the associated discussions with OLS and SIR also hold here.
Remark 1. Recall that SIR can find at most H − 1 e.d.r.directions.With respect to SWAR, max{rank(R)} = H, so that a complete basis may be found when H ≥ K.In fact, the special case of H = 1 is simply the usual OLS slope vector which can be used to determine the direction in the single-index model.However, as we will see later, combining multiple slopes can be beneficial.
An advantage of this method, compared to OLS, is the ability to find more than one informative e.d.r direction for models with K > 1.As shown in the following sections, SWAR can provide improved estimates compared to other methods in some contexts.Additionally, SWAR can perform well in the presence of contamination, for example, when model or distributional assumptions are violated by some observations in the data.Identifying contaminant points and returning a good e.d.r direction estimate at the same time is a great advantage of SWAR.The alternative weighting approaches given in later sections can improve the estimation further, providing more robust estimates.These claims are supported by simulation results and an example.
Consider a sample data set {y i , x i } n i=1 and call (y i , x i ) the ith pair.Let n h denote the number of observations to be in the hth slice (h = 1, . . ., H) where n 1 + . . .+ n H = n.The SWAR estimating algorithm is defined as follows: Step 1. Order the pairs according to the order of the y i s so that the ith ordered pair has the ith smallest y i .
Step 2. Partition the ordered data into H slices, with n h observations in the hth slice (h = 1, . . ., H).
Step 3. Obtain the OLS slope vector estimates for each of the slices and denote these as b 1 , . . ., b H .
Step 4. Form the SWAR matrix, R = H h=1 w h b h b h , where w h = n h /n.
Step 5.Return the eigenvectors γ 1 , . . ., γ K as the estimated basis for S B .
A simple slicing strategy that is commonly used for SIR, and can also be used for SWAR, is to choose the number of slices H, and then to allocate an equal number (or approximately equal) of observations per slice (i.e., w h = 1/H).
Throughout this paper, OLS is used to obtain the slice coefficient vectors, however, other e.d.r direction estimators could easily be utilized as well.For example, Li and Duan (1989) showed that robust linear regression estimators such as M-estimators can also identify e.d.r directions.
Remark 2. It is important to note that if OLS is used as the slope estimator, then SWAR cannot be performed if the number of observations in a slice is equal or less than the dimensionality of X, i.e. if n h ≤ p. SIR is not limited by such a case since the slice mean can be determined for any slice with at least one observation.For SWAR, care needs to be taken that not too many slices are chosen.Some other methods that also combine multiple coefficient vectors into a dimension reduction matrix are the Principal Quantile Regression (PQR, Wang et al., 2018) and the Principal Asymmetric Least Squares (PALS, Soale and Dong, 2022), which instead of slices use varying quantile and expectile levels respectively.The main disadvantage of PQR compared to PALS is that it is a computationally intense procedure, with PALS being at least twice as fast as PQR as noted by Soale and Dong (2022).
Robustness studies of SIR and related methods have shown that outliers can be harmful to estimation (e.g.Gather et al., 2002).However, not all outliers are influential as shown by example by Sheather and McKean (2001).As a tool for studying the robustness properties of estimators, influence functions for SIR have shown that it is the direction of the predictor vector relative to the e.d.r.directions that largely determines whether an outlier is influential (Prendergast, 2005(Prendergast, , 2007)).Also, the response only contributes to slice placement for SIR, so an additional consideration for SWAR is to the extent that outlying response values, even in just one slice, can influence estimation overall.This leads us to the study of influences functions for SWAR for two reasons.Firstly to better understand the robustness properties of SWAR, and secondly to introduce influence-derived weights for the robustness of SWAR.

Influence functions for SWAR
The Influence Function (IF, Hampel, 1974) measures the relative influence of a contaminant on an estimator of interest.
In other words, it measures how much the estimator has changed by the addition or removal of a small amount of contamination.Consider the following contamination distribution that allows for contamination in both the response and predictor variables, defined as where 0 < ε < 1 is the proportion of contamination, G is the uncontaminated joint distribution of (Y, X) and ∆ w0 is the Dirac measure, putting all of its mass at the contaminant point w 0 = (y 0 , x 0 ).For a statistical estimator with functional T defined at G and G ε , the IF in the direction of w 0 is defined as, A contaminant w 0 is highly influential when there is a big difference between the estimator at G and G ε resulting in large IF.More information about the influence function can be found, e.g., in Hampel (1986) and Clarke (2018).
The following assumption was used in the derivation of the SIR influence functions (e.g.Prendergast, 2005) .This is a realistic assumption that assumes the slicing proportions are the same with and without contamination (e.g.H equally proportioned slices in both cases).
Assumption 1.The slicing proportions w 1 , . . ., w h are pre-determined independently of G.

Influence function and asymptotic variance for a single e.d.r. direction
Let I(y 0 ∈ S h ) denote the indicator function which is equal to 1 when y 0 belongs in the hth slice and 0 otherwise.Also recall that the OLS slope vector for slice ) is the covariance of the predictor vector in the hth slice.In the case of K > 1, closed form solutions of the IF for e.d.r directions do not exist.Therefore, we consider the single-index case here, and in the next section a different IF approach for when K > 1.
Let γ 1 denote the functional for the first e.d.r direction estimator where γ 1 (G) = γ 1 .For the single-index model case where K = 1, the influence function is given below.
Theorem 2. Let h 0 ∈ (1, . . ., H) denote the slice within which the contamination is positioned (i.e.y 0 ∈ S h0 ) and use other notations defined previously.Under the model in (2) with K = 1 (single-index model) and Assumption 1, if Condition 1 and 2 hold, then the influence function for the SWAR e.d.r direction, with functional γ 1 , at G is given by, is the OLS residual for the contaminant within the h 0 th slice and P = γ 1 γ 1 is the projection matrix onto the CDRS.
The proof of Theorem 2 is given in Appendix B.1.Note that I − P is a projection matrix onto the compliment of the CDRS.This highlights that it is the direction of the predictor vector that can largely determine influence and, as with SIR and other methods, explains why not all outliers are influential.We will look at some specific examples later.
For an estimator with functional T that is sufficiently regular, so that T (G n ) is asymptotically normal, then the asymptotic variance of the estimator, ASV(T, G), at G is equal to (see, e.g.Hampel, 1986), where W is a random variable.
For a normal X, the asymptotic variance of the SIR e.d.r direction estimator, denoted by b SIR , is given by Prendergast Here, η 1 is the SIR re-standardized eigenvector that corresponds to the largest non-zero eigenvalue of the SIR matrix, denoted as ν 1 .
The following theorem gives the ASV of the SWAR e.d.r direction, the proof of which is given in Appendix C.
Theorem 3.For a random X, using previously introduced notation and when Condition 1 and 2 hold, the asymptotic variance of the SWAR e.d.r direction estimate under the model in (2) is given by, where The ASV(γ 1 , G) is a p × p symmetric matrix whose diagonal elements are the ASVs of the p elements of the SWAR e.d.r direction estimate and the off-diagonal elements are the asymptotic covariances between the elements.

Influence function for the subspace estimator
In dimension reduction we are mainly interested in the e.d.r direction estimators.However, Prendergast (2005) showed that an observation may be influential on a particular e.d.r direction but have no influence on the corresponding e.d.r space.Therefore, an influence function for the dimension reduction space estimator is more appropriate than it is for individual e.d.r.directions.Since SWAR returns an orthonormal basis, a candidate measure of influence is introduced by Bénasséni (1990) in the context of principal components.This measure is the average length of the distance vector between each principal component and its projection onto the space spanned by the contaminated components.In the context of SWAR, let Γ denote the p × K matrix whose columns γ 1 , . . ., γ K are the e.d.r directions given at G. Similarly, Γ ε is the corresponding matrix at G ε .Then the measure of distance between the contaminated and the uncontaminated e.d.r spaces is where P (ε) = Γ ε Γ ε is the projection matrix onto the subspace spanned by the contaminated e.d.r directions γ 1 (ε), . . ., γ K (ε).By letting ρ denote the functional for Bénasséni's measure, the influence function of ρ for w 0 at G, is given by, Given that the γ k 's and the γ k (ε)'s, for k = 1, . . ., K are orthogonal and have unit length, then it is clear that there is no influence on the e.d.r space estimator when r(Γ, Γ ε ) = 1, which happens when span(γ 1 , . . ., γ K ) = span(γ 1 (ε), . . ., γ K (ε)).Then, the influence is at its highest at r(Γ, Γ ε ) = 0 which occurs when the aforementioned spans are orthogonal to each other.
Theorem 4.Under Assumption 1 and given that Condition 1 and 2 hold, the influence function for Bénasséni's measure applied to the SWAR e.d.r space at G is given by, The proof for Theorem 4 is given in Appendix B.2. From Theorem 2 and 4, it is clearly evident that for

Identifying influence of certain observation types
Through the influence functions given in Theorem 2 and 4, we can identify how certain types of observations affect the estimation of the e.d.r direction and the e.d.r space provided by SWAR.For example, we observe the below interesting cases.Note that both IFs are functions of the residuals, r 0,h0 , and (I p − P )Σ −1 (x 0 − µ), so that the cases that follow are true for both the e.d.r direction and the e.d.r space estimators.
Case 1: Contaminant equal to the mean, x 0 = µ When x 0 = µ, there is zero influence on the aforementioned estimators.It is interesting, that in this case there is no effect from y 0 on the influence, even if y 0 has an extreme value or violates the model.
Therefore, an observation (y 0 , x 0 ) can have unbounded influence on the estimators if y 0 and/or x 0 are arbitrarily large.
Case 4: The residuals of w 0 are equal to zero.There is also zero influence from the contaminant w 0 on either of the estimators if r 0,h0 = 0.

Influence function plots
Below we provide some example influence plots to visually demonstrate the effect of a contaminant on the e.d.r direction and the e.d.r space estimators.Consider the linear model , and H = 5 equally probable slices.Let IF(γ 1,1 , w 0 ; G) denote the influence value of the contaminant w 0 on the first element of the e.d.r direction γ 1 given by SWAR.In Figure 1, plot (a) is the IF(γ 1,1 , w 0 ; G) and plot (b) is IF(ρ, w 0 ; G), where for x 0 = [x 1 , x 2 ], we set x 2 = 0 and allow for y 0 and x 1 to vary.In both plots there is zero influence on the estimators when x 1 is also zero, which is expected as explained in Case 1. Within each slice, as x 1 and y 0 increase, the influence on γ 1,1 in plot (a) increases without bound along the diagonal, that is when y 0 moves towards the boundaries that determine the slice sub-ranges.In plot (b) of Figure 1, the influence on the e.d.r space follows the same trends as in plot (a) but we also observe approximately zero influence values when y 0 follows the model approximately, i.e. when y 0 ≈ x 1 since x 2 = 0.

Sample and Empirical influence functions
In the sample setting, the influence of the ith observation is found by considering the change of the estimator pre-and post-removal of the observation.For a sample of size n denoted by {y i , x i } n i=1 , denote the empirical distribution by G n and the empirical distribution without the ith observation by G n,(i) .Then, the sample influence function (SIF) of an estimator with functional T is given by SIF(T, . Therefore, the SIF for the SWAR e.d.r direction estimator, with functional γ k , at G n is given by, for k = 1, . . ., K. Hence, large values in the components of the SIF(γ k , w i ; G n ) vector indicate that observation i is highly influential.
Similarly, for Bénasséni's measure, let Γ and Γ (i) denote the matrices whose columns are the estimated e.d.r directions at G n and G n,(i) .Then, the sample influence function for ρ for the ith observation at It is clear that when |SIF(ρ, w i ; G n )| is large, then the ith observation is highly influential on the basis that is estimating the CDRS.
The disadvantage of the SIF is that it requires (n + 1) estimates in order to obtain the sample influence values for the entire sample which can be computationally expensive especially when n and/or p are large.To overcome this, the empirical influence function (EIF) is an approximation of the SIF and can be found by replacing the population parameters of the IF with their sample estimates.Hence, the EIF of the ith observation for ρ for SWAR is where w i belongs in the hth slice.For large n, EIF ≈ SIF and it can be used in practise to efficiently detect influential observations.Below we provide a comparison between the SIF and EIF of Bénasséni's measure for SWAR.Consider the model as defined in ( 13) with n = 200 and p = 5.The SIF and EIF values for each of the observations are depicted in Figure 4.The figure shows that the EIF is a good approximation to the SIF and so can be used to identify the most influential observations without repeated SWAR estimation.

Some applications of the influence function
The usefulness of the IF expands further than just being a tool to explore the robustness properties of an estimator.
In this section, we propose using the mean influence to choose optimal weights for the slices in SWAR, as well as choosing K and H.
Our first goal is to down-weigh the slices with high mean influence allowing the slices with more stable estimates to contribute more to SWAR.We consider two different weighting techniques: namely, the within slice mean influence and the total mean influence.Detailed explanations for each of these are given below.

Within slice mean influence weights
Let X n denote the n × p matrix whose ith row is x i .Also, let b h,(i) denote the hth slice slope vector without the ith observation and X n h be the matrix whose rows are the x j 's that fall in the hth slice.For this version of weighting, we consider influence on the dimension reduced predictors within the hth slice, and where the dimension reduction has been carried out using the estimated slope vectors within that slice.Similar approach to determining influential observations have been considered previously (e.g.Prendergast, 2008;Prendergast and Smith, 2010).For each of the observations within the hth slice, compute Now, let δh = n −1 h n h i=1 δ h,i be the mean of the δ h,i s.Then, the new weights are defined as follows, and where these are scaled to sum to one.
From here onwards, this version of SWAR which uses the within slice mean influence weights, given in (18), will be referred to as SWAR W .

Total mean influence weights
For this is a re-weighting process first calculate the SWAR estimate (see Section 3) and then calculate the SIF in (15) for each of the n observations.Now, let {ρ h,i } n h i=1 denote the SIFs for the observations in the hth slice and the absolute sample mean of these values.Then, the total mean influence weights are given by, We then recompute the SWAR estimate but where these weights are used instead.The new re-weighted SWAR method with the total mean influence weights will be referred to as SWAR T .
It is important to note that for SWAR W and SWAR T , the term 1/ b h 2 is used in the weights so that the hth slice slope vectors are normalised.This is because only the direction of the vectors is important here and a vector with large length can dominate the dimension reduction matrix and affect the e.d.r direction estimates.Therefore, transforming the slope vectors to have unit length solves this problem.

Choosing K and H in SWAR
Influence functions can have multiple applications in the dimension reduction setting.Another application is that of choosing dimension reduction parameters (i.e.H and K) based on minimum mean influence, for example.Such applications have been explored in Shaker (2013), and a similar approach has been presented by Ye and Weiss (2003) and Liquet andSaracco (2008, 2012), who evaluated the sensitivity of dimension reduction estimators at different parameter values using a bootstrap approach.
In the mean influence approach, which we adopt here, we consider the optimal pair H, K to be the one that results in the minimum mean influence.For SWAR, the sample influence function of Bénasséni's measure can be used to select H in conjunction with K.An example of this, for a simulated model, is given soon in Section 6.An inspection of the Estimated SSPs (ESSPs) is also recommended when choosing H and K in practice.
The only disadvantage of the mean SIF compared to the bootstrap approaches is the computational intensity of the SIF.However, for large sample sizes where EIF ≈ SIF, the EIF can be used in practice to choose H and K more efficiently, since computational intensity reduces significantly.The HIF (Prendergast, 2007) provides another alternative to the SIF which can provide a better approximation than the EIF and is also less time consuming than the SIF.The effectiveness of the SIF in choosing K and H is not examined thoroughly in the present paper.

Simulations
In this section, we provide simulated examples to demonstrate the effectiveness of SWAR, SWAR W and SWAR T .The performance of the aforementioned methods is compared with those of OLS, PALS and SIR and results provided.We have chosen principal asymmetric least squares (PALS, Soale and Dong, 2022) since it is an interesting new method that also combines least squares estimates.The PALS object function is where R * = Y − β 0 − β 1 (β X − µ) and where ρ τ is the asymmetric least squares loss function (Newey and Powell, 1987).For given values for τ ∈ [0, 1], a slope using the above objective function is computed (estimated on the standardised scale and pre-multiplied by Σ −1/2 ), and combined similarly to our Step 4 of SWAR but with weights all equal to one.
For every model, we perform 1000 repetitions for each combination of the sample sizes n = 50, 200, 500 and 1000 with p = 5, 10 and 20, and H = 2, 5 and 10 for the slicing methods.For PALS we follow the lead of Soale and Dong (2022) and set the tuning parameter λ = 1 and τ = 0, 0.1, 0.2, . . ., 0.9, 1.To measure and evaluate the performance of each method we use the squared canonical correlations between the true and estimated e.d.r spaces.

Single-Index Model
Consider the following model with X ∼ N p (0, I p ) independent of ε and ε ∼ N (0, 1), where the true For this model we first perform the aforementioned methods for each combination of n, p and H.Then, we introduce contamination and perform the same comparisons to see where the influence weighting strategies can improve estimation.
We then replace 2% of the ordered observations based on the value of Y with contamination, where the contamination response is generated from N (150, 30 2 ) and we subtract five from each of the associated predictor vector elements.
To highlight the effect of this type of contamination, we first provide some example ESSPs in Figure 5 for n = 200 and p = 10.The contamination is clearly evident in the true ESSP (i.e.assuming known β).However, from the estimated ESSPs using OLS and SWAR, we can no longer distinguish these points from other observations.However, the estimated ESSP using SWAR T provides an excellent estimate of the ESSP.Hence we have chosen this type of contamination since resulting ESSPs may provide no indication that some problematic observations need further inspection.The averages of the squared correlations along with their corresponding standard deviations, shown in parentheses, for the uncontaminated and contaminated Model 1, are given in Table 1.The missing values (denoted by −) for SWAR, SWAR W and SWAR T in the tables are for when p ≥ n h so that slopes cannot be estimated.Figure 6, shows the box-plots of the correlations from 1000 repetitions of each method for p = 10.For simplicity, the results for n = 1000 are omitted from the table and the box-plots since they display similar trends with n = 500.
We can see that all methods perform extremely well for the uncontaminated case, especially as the sample size increases.OLS and PALS tend to perform better than SWAR methods when the sample size is small and dimensionality is large (e.g., n = 50 and p = 20).SIR is also better than SWAR for small sample sizes but performance is similar for larger  n.Whereas SIR's performance is better when the number of slices is large, all versions of SWAR prefer a smaller or intermediate number of slices.

Uncontaminated Contaminated
For the contaminated Model 1, the performance of OLS and PALS deteriorates significantly.SWAR also performs poorly, but to a lesser degree, and SIR provides reasonable estimates but still a notable decrease in performance.However, the weighted adjusted methods SWAR W and SWAR T show an outstanding performance in the presence of contamination, which was shown by example in Figure 5 for SWAR T .
Table 2: Choices based on the SIF of the SWAR e.d.r space estimator for number of slices H and dimension reduced predictors K, for the uncontaminated Model 1.
Finally, Table 2 shows how many times (out of 1000 repetitions) the SIF given in (15) has chosen each pair of H and K for SWAR, for the uncontaminated Model 1. Between the choice of K = 1 and K = 2, the SIF is correctly choosing K = 1 in all replications except for n = 50 and p = 20 for which case the method is struggling to find a good estimate.

SWAR
A PREPRINT For the optimal number of slices H = 2 is chosen the most.This is consistent with the previous findings that SWAR prefers a smaller number of slices.As n increases, a larger number of slices is often chosen.

Multiple-Index model
Similarly to the previous example, for X ∼ N p (0, I) independent of ε with ε ∼ N (0, 1), consider now with true e.d.r directions, β 1 = [1, 2, −3, 0, . . ., 0] and β 2 = [1, 1, 0, −2, 0, . . ., 0].The average values of the squared canonical correlations with the corresponding standard deviations, in parentheses, can be found in Table 3.The box-plots of the squared canonical correlations for p = 10 are then given in Figure 7.In the boxplots, we provide the squared canonical correlations for each direction so that we can compare performance for both directions.
For this model, PALS with λ = 1 fails to find a second direction, however, a different value of the tuning parameter seems to improve the results significantly especially as n increases.Therefore, this model is an example case where λ can have a significant effect on estimation, a perhaps unexpected result based on analysis carried out about PALS so far, where λ did not have a significant effect on estimation (Soale and Dong, 2022).SIR struggles to find a good estimate of the subspace when n is small (for H = 2 this is expected since the SIR matrix is of rank 1), and it can often find it difficult to estimate the second direction unless n is large, as is evident in the box-plots.
All SWAR versions perform comparatively well even when the sample size is small, in which case there is more variability in the estimation of the second direction.The performance of the SWAR methods also declines slightly when p increases.Additionally, SWAR and SWAR T benefit from a small to intermediate choice of H, whereas SWAR W benefits from an intermediate to a larger number of slices.

BigMac data example
In this section we compare the estimated directions given by OLS, SIR, SWAR, SWAR W and SWAR T for the 'BigMac' data from Enz (1991).Of interest is the regression of the response, the minimum labor required to buy a Big Mac and fries from MacDonalds in each city, on the 9 socio-economic predictor variables.There are 45 observations in the data and so we choose H = 5 for SIR and for SWAR, SWAR W and SWAR T we choose H = 2.These choices of H was confirmed to be the most suited to each method through the inspection of the Estimated SSPs (ESSPs).Inspections of ESSPs also revealed that K = 1 was suitable.The ESSPs for each method are shown in Figure 8 where an exponential curve has been added to each plot.It is evident from Figure 8 that SWAR provides the best ESSP for the BigMac data with the curve fitting all observations reasonably well.OLS performs comparatively poorly, SIR performs well except for a notable outlier that does not fit well, and SWAR W also provides a good fit.

Discussion
In this paper we consider slice weight average regresion (SWAR), along with robustified versions of this method using influence re-weighting (SWAR W and SWAR T ).These versions utilize the mean sample influence function to down-weight the slices that contain highly influential observations which therefore provide a non-robust estimate.This weighting process can also be easily applied to other dimension reduction methods that combine multiple vectors into a dimension reduction matrix.Although not reported here, we also considered re-weighted versions of SIR and PALS but with no notable improvements (some slight improvements for PALS was detected).
Despite the fact the OLS can only be used for a single direction (K = 1), SWAR method is surprisingly capable of finding a second direction for the models we considered.
for any range S of Y , since E(X|Y ∈ S) = E[E(X|B X)|Y ∈ S] due to the independence between ε and X.
Finally, from ( 22) and ( 23), [Var(X|Y ∈ S)] −1 Cov(X, Y |Y ∈ S) can be written in the form This is then an element of S B which completes the proof.
B Influence functions derivations for SWAR For simplicity, we let x0 = x 0 − µ h and ỹ0 = y 0 − µ y,h .From Prendergast (2007), Then, from (24), the influence function for the inverse slice variance is For brevity here we omit the details, but by following closely the proof for (24) (see Lemma A1 proof of Prendergast, 2007), we have using ( 25) and ( 26).
Let R denote the function for the SWAR matrix estimator such that R(G) = H h=1 w h B h (G)B h (G).Using the Product Rule, the influence function for R is straightforward when using (27).We present the result in the following lemma for use later.where r 0,h = y 0 − µ y,h − b h (x 0 − µ h ) is the OLS residual of w 0 in the hth slice, x0 = x 0 − µ h , and µ y,h and µ h are the means of the y i 's and the x i 's in the hth slice, respectively.
Eq. 13 of Prendergast (2005) provides the influence functions for the eigenvectors of the SIR matrix estimator and this result can be adapted and used directly here.However, from Lemma 2 and in the case K > 1, the influence functions for SWAR e.d.r.direction estimators depend on the expressions involving Cov G [x, y|y ∈ S h (ε)] and Var G [x|y ∈ S h (ε)] which cannot be derived further.We therefore focus our attention on the case of K = 1.This is of the form, since λ j = 0 for j > 1, The form of the IF in the case of K = 1 simplifies due to several results which we list here: (i) from Lemma 1, we have that b h = c 1 β 1 for some c 1 ∈ R. Since γ 1 is also a scalar multiple of β 1 , then γ j b h = 0 for j ≥ 2 (since γ j γ 1 = 0); (ii) from ( 22), we know that Σ −1 h = Σ −1 − c 2 β 1 β 1 for a c 2 ∈ R. Therefore, γ j Σ −1 h = γ j Σ −1 ; (iii) from ( 20) and ( 23 The proof of Theorem 2 is complete when applying (ii) to this first term also.

B.2 Proof of Theorem 4
From Bénasséni (1990)  It can similarly be shown that (I − P )Σ −1 h Var G [X|Y ∈ S( )]b h = 0 by using the form of the slice variance matrix in (20).Then, the proof is complete by substituting (28) in (30) and using the above simplifications.

Figure 4 :
Figure 4: Sample and Empirical influence values on the e.d.r space estimator of each observation for the model given in (13).

Figure 5 :
Figure 5: The True (top left) and Estimated SSPs for a realization of the contaminated Model 1 with n = 200 and p = 10.The ESSPs of OLS (to right), SWAR (bottom left) and SWAR T (bottom right) are depicted in the figure.

Figure 7 :
Figure 7: Box-plots of the squared canonical correlations for 1000 repetitions from PALS with λ = 1 and λ = 200, SIR, SWAR, SWAR W and SWAR T , for the various n and H combinations with p = 10, for Model 2.

Table 1 :
Average values of the squared correlations, cor(β X n , β X n ) 2 , with the corresponding standard deviations in parentheses, for 1000 realizations of Model 1 given by OLS, PALS, SIR, SWAR, SWAR W and SWAR T , for the uncontaminated and contaminated case.Uncontaminated

Table 3 :
Average values of the squared canonical correlations between the true and estimated directions, with the corresponding standard deviations in parentheses, from 1000 repetitions given by PALS (with λ = 1 and λ = 200), SIR, SWAR, SWAR W and SWAR T for Model 2.
Influence functions for the SWAR matrix and the single-index model e.d.r.direction Let C h and C XY,h denote the functionals for the hth slice covariance matrices so that at G, C h (G) = Σ h and C XY,h (G) = Σ xy,h respectively.In addition, let B h denote the functional of the hth slice slope vector estimator where B h B.1 using the Product Rule and by setting ε to 0 we have, IF(C −1 h , w 0