1 Introduction

Robust regression estimators are a standard and important tool in the toolbox of modern statisticians. They were introduced in the late sixties [36] and important early results appeared shortly thereafter [22, 23]. We recall that these estimators are defined as

$$\begin{aligned} {\widehat{\beta }}_\rho ={{\mathrm{argmin}}}_{\beta \in {\mathbb {R}}^p} \frac{1}{n}\sum _{i=1}^n \rho \left( Y_i-X_i'\beta \right) , \end{aligned}$$
(1)

for \(\rho \) a function chosen by the user. Here \(Y_i\) is a scalar response and \(X_i\) is a vector of predictors in \({\mathbb {R}}^p\). In the context we consider here, \(\rho \) will be a convex function. Naturally, one of the main reasons to use these estimators instead of the standard least-squares estimator is to increase the robustness of \({\widehat{\beta }}_\rho \) to outliers in e.g. \(Y_i\)’s. Formally, this robustness result can be seen through results of Huber (see [24]), in the low-dimensional case where p is fixed. Huber showed that when \(Y_i=X_i'\beta _0+\epsilon _i\), and when \(\epsilon _i\)’s are i.i.d, under some mild regularity conditions, \({\widehat{\beta }}_\rho \) is asymptotically normal with mean \(\beta _0\) and (asymptotic) covariance

$$\begin{aligned} (X'X)^{-1}\frac{{\mathbf {E}}_{}\left( \psi ^2(\epsilon )\right) }{[{\mathbf {E}}_{}\left( \psi '(\epsilon )\right) ]^2}, \quad \text {where } \psi =\rho '. \end{aligned}$$
(2)

The question of understanding the behavior of these estimators in the high-dimensional setting where p is allowed to grow with n was raised very early on in [23, p. 802, questions b–f]. These questions started being answered in the mid to late eighties in work of Portnoy and Mammen (e.g. [28, 31,32,33,34]). However, these papers covered the case where \(p/n\rightarrow 0\) while \(p\rightarrow \infty \).

In the papers [16, 17], we explained (mixing, as in [23], rigorous arguments, simulations and heuristic arguments) that the case \(p/n\rightarrow \kappa \in (0,1)\) yielded a qualitatively completely different picture for this class of problems. For instance, under various technical assumptions, we explained that the risk \(||{\widehat{\beta }}_\rho -\beta _0||_2\) could be characterized through a system of two non-linear equations (sharing some characteristics with the one below), the distribution of the residuals could be found and was completely different of that of the \(\epsilon _i\)’s, by contrast with the low dimensional case. Furthermore, we showed in [4] that maximum likelihood estimators were in general inefficient in high-dimension and found dimension-adaptive loss functions \(\rho \) that yielded better estimators than the ones we would have gotten by using the standard maximum likelihood estimator, i.e. using \(\rho =-\log f_\epsilon \), where \(f_\epsilon \) is the density of the i.i.d errors \(\epsilon _i\)’s. (We subsequently showed in [15]—which is an initial version of the current paper—that the techniques we had proposed in [16] could be made mathematically rigorous under various assumptions. See also the paper [9] that handles only the case of i.i.d Gaussian predictors, whereas El Karoui [15] can deal with more general assumptions on the predictors. Donoho and Montanari [9] also make interesting connection with the Scherbina–Tirrozi model in statistical physics—see [38, 40]. For other interesting results using rigorous approximate message passing techniques, see also e.g. [2].)

In the current paper, we study a generic extension of the robust regression problem involving ridge regularization. In other words, we study the statistical properties of

$$\begin{aligned} {\widehat{\beta }}={{\mathrm{argmin}}}_{\beta \in {\mathbb {R}}^p} \frac{1}{n}\sum _{i=1}^n \rho _i(Y_i-X_i'\beta )+\frac{\tau }{2}||\beta ||^2, \quad \text {where } Y_i=\epsilon _i+X_i'\beta _0. \end{aligned}$$

We will focus in particular on the case where there is no moment restriction on \(\epsilon _i\)’s. Furthermore, a key element of the study will be to show that the performance of \({\widehat{\beta }}\) is driven by the Euclidean geometry of the set of predictors \(\{X_i\}_{i=1}^n\). To do so, we will study “elliptical” models for \(X_i\)’s, i.e. \(X_i=\lambda _i {{\mathcal {X}}}_i\), where \({{\mathcal {X}}}_i\) has for instance independent entries. We note that when \(\lambda _i\) is independent of \({{\mathcal {X}}}_i\) and \({\mathbf {E}}_{}\left( \lambda _i^2\right) =1\), \(\mathrm {cov}\left( X_i\right) =\mathrm {cov}\left( {{\mathcal {X}}}_i\right) \). Hence these families of distributions for \(X_i\)’s have the same covariance, but as we will see they yield estimators whose performance vary quite substantially with the distribution of \(\lambda _i\)’s. As we explain below, the role of \(\lambda _i\)’s is to induce a “non-spherical geometry” on the predictors; understanding the impact of \(\lambda _i\)’s on the performance of \({\widehat{\beta }}\) is hence a way to understand how the geometry of the predictors affects the performance of the estimator. We note that in the low-dimensional case, when \(X_i\)’s are i.i.d, \(X'X/n\rightarrow \mathrm {cov}\left( X_1\right) \) in probability under mild assumptions, and hence the result of Huber mentioned in Eq. (2) shows that the limiting behavior of \({\widehat{\beta }}_\rho \) defined in Eq. (1) is the same under “elliptical” and non-elliptical models.

Our interest in elliptical distributions stems from the fact that, as we intuited for a related problem in [16], the behavior of quantities of the type \(X_i'Q X_i\) for Q deterministic is at the heart of the performance of \({\widehat{\beta }}\). Hence, studying elliptical distribution settings both shed light on the impact of the geometry of predictors on the performance of the estimator and allow us to put to rest potential claims of “universality” of results obtained in the Gaussian (or geometrically similar) case. We note that in statistics there is a growing body of work showing the importance of predictor geometry on various high-dimensional problems (see e.g. [8, 13, 14, 18, 20]).

One main motivation for allowing \(\rho _i\) to change with i is that it might be natural to use different loss functions for different observations if we happen to have information about distributional inhomogeneities in \(\{X_i,Y_i\}_{i=1}^n\). For instance, one group of observations could have errors coming from one distribution and a second group might have errors with a different distribution. Another reason is to gain information on the case of weighted regression, in which case \(\rho _i=w_i \rho \). Also, this analysis can be used to justify rigorously some of the claims made in [16]. Finally, it may prove useful in some bootstrap studies (see e.g. [19] for example).

In the current paper, we consider the situation where \(\beta _0\) is “diffuse”, i.e. all of its coordinates are small and it cannot be well approximated by a sparse vector. In this situation, use of ridge/\(\ell _2\) penalization is natural. The paper also answers the question, raised by other researchers in statistics, of knowing whether the techniques of the initial version [15] could be used in the situation we are considering here. Finally, the paper shows that some of the heuristics of [3] can be rigorously justified.

When \(\rho _i=\rho \) for all i, a natural question is to know whether we can find an optimal \(\rho \), in terms of prediction error for instance, as a function of the law of \(\epsilon _i\)’s—in effect asking similar questions to the ones answered by Huber [24] in low-dimension and in [4] in high-dimension. However, the constraints we impose in the current paper on both the errors (i.e. we do not want them to have moments) and the functions \(\rho _i\)’s make part of the argument in [4] not usable and might require new ideas. So we will consider this “optimization over \(\rho _i\)’s and \(\tau \)” in future work, given that the current proof is already long.

The problem and setup considered in this paper are more natural in the context of robust regression than the ones studied in the initial version [15], where the chosen setup was targeted towards problems related to suboptimality of maximum likelihood methods. However, the strategy for the proof of the results here is similar to the strategy we devised in the initial [15]. There are three main conceptual novelties, that create important new problems: handling ellipticity and the fact that \(\beta _0\ne 0\) requires new ideas in the second part of the proof (i.e. “Appendix 4”). Dealing with heavy tails and appropriate loss functions impacts the whole proof and requires many changes compared to the proof of [15]. Conceptually, this latter part is also the most important, as it shows that all the approximations made in earlier heuristic papers are valid, even in the presence of heavy-tailed errors. This situation is of course the one where these approximations, while having clearly shown their usefulness in giving conceptual and heuristic understanding of the statistical problem, were the most mathematically “suspicious”. So it is interesting to see that they can be made to work rigorously, especially since the probabilistic heuristics developed in these earlier papers allow researchers to shed light quickly on non-trivial statistical problems.

We now state our results. We believe our notations are standard but refer the reader to section Notations (immediately before Eq. 9 below) in case clarification is needed.

2 Results

The main focus of the paper is in understanding the properties of

$$\begin{aligned} {\widehat{\beta }}={{\mathrm{argmin}}}_{\beta \in {\mathbb {R}}^p} \frac{1}{n}\sum _{i=1}^n \rho _i(Y_i-X_i'\beta )+\frac{\tau }{2}||\beta ||^2, \quad \text {where } Y_i=X_i'\beta _0+\epsilon _i, \end{aligned}$$
(3)

and \(\tau >0\). For all \(1\le i\le n\), we have \(\epsilon _i \in {\mathbb {R}}\) and \(X_i \in {\mathbb {R}}^p\).

We prove four main results in the paper:

  1. 1.

    we characterize the \(\ell _2\)-risk of our estimator, i.e. \(||{\widehat{\beta }}-\beta _0||_2\);

  2. 2.

    we describe the behavior of the residuals \(R_i=Y_i-X_i'{\widehat{\beta }}\) and relate them to the leave-one-out prediction error \(\tilde{r}_{i,(i)}=Y_i-X_i'{\widehat{\beta }}_{(i)}\);

  3. 3.

    we obtain an approximate update formula for \({\widehat{\beta }}\) when adding an observation (and show it is very accurate);

  4. 4.

    we provide central limit theorems for the individual coordinates of \({\widehat{\beta }}\).

For the sake of clarity, we provide in the main text a series of assumptions that guarantee that our results hold. However, a more detailed and less restrictive statement of our assumptions is provided in the “Appendix”.

2.1 Preliminaries and overview of technical assumptions

We use the notation \(\text {prox}(\rho )\) to denote the proximal mapping of the function \(\rho \), which is assumed to be convex throughout the paper. This notion was introduced in [29]. We recall that

$$\begin{aligned} \text {prox}(c\rho )(x)&={{\mathrm{argmin}}}_{y\in {\mathbb {R}}} \left( c\rho (y)+\frac{1}{2}(x-y)^2\right) , \text { or equivalently,}\\ \text {prox}(c\rho )(x)&=(\mathrm {Id}+c\psi )^{-1}(x), \quad \text {where } \psi =\rho '. \end{aligned}$$

We refer the reader to [5, 29], or [37, Sect. 7.3], for more details on this operation. Note that the previous definitions imply that

$$\begin{aligned} \forall x, \; \;\text {prox}(c\rho )(x)+c\psi (\text {prox}(c\rho )(x))=x. \end{aligned}$$

We give examples of proximal mappings in the “Appendix 6”.

We now state some sufficient assumptions that guarantee that all the results stated below are correct. The main proofs are in the “Appendix”. The proofs done in the “Appendix” are done at a much greater level of generality than we are about to state and various aspects of those proofs require much weaker assumptions than those we present here. We start by giving an example where all of our conditions are met.

Example

Our conditions are met when

  • \(p/n\rightarrow \kappa \in (0,\infty )\).

  • \(\epsilon _i\)’s are i.i.d Cauchy (with median at 0).

  • \(X_i=\lambda _i {{\mathcal {X}}}_i\), where \(\lambda _i\in {\mathbb {R}}\) and \({{\mathcal {X}}}_i \in {\mathbb {R}}^p\) are independent. \(\lambda _i\)’s are i.i.d with bounded support; \({{{\mathcal {X}}}}_i\)’s are i.i.d with i.i.d \({{{\mathcal {N}}}}(0,1)\) entries, or i.i.d entries with bounded support and mean 0 as well as variance 1. \(\{{{\mathcal {X}}}_i\}_{i=1}^n\), \(\{\lambda _i\}_{i=1}^n\) and \(\{\epsilon _i\}_{i=1}^n\) are independent.

  • \(\beta _0\) is a “diffuse” vector with \(\beta _0(i)=u_{i,p}/\sqrt{p}\), \(0\le |u_{i,p}|\le C\) and \(\sum _{i=1}^p {u_{i,p}^2}=p\), i.e. \(||\beta _0||_2=1\).

  • \(\rho _i=\rho \) for all i’s and \(\rho \) is convex. \(\psi =\rho '\) is bounded and \(\psi '\) is Lipschitz and bounded. \(\text {sign}(\psi (x))=\text {sign}(x)\) and \(\rho (x)\ge \rho (0)=0\).

We note that this last condition is satisfied for smoothed approximation of the Huber function, where the discontinuity in \(\psi '\) at say 1 is replaced by a linear interpolation; see below for more details. Note however that the Huber function has a priori no statistical optimality properties in the context we consider.

Sufficient conditions for our results to hold

  • p / n has a finite non-zero limit.

  • \(\rho _i\)’s are chosen from finitely many possible convex functions. If \(\psi _i=\rho _i'\), \(\sup _i ||\psi _i||_{\infty }\le K\), \(\sup _i ||\psi _i'||_{\infty }\le K\), for some K. \(\psi _i'\) is also assumed to be Lipschitz-continuous. Also, for all \(x\in {\mathbb {R}}\), \(\text {sign}(\psi _i(x))=\text {sign}(x)\) and \(\rho _i(x)\ge \rho _i(0)=0\).

  • \(X_i=\lambda _i {{{\mathcal {X}}}}_i\), where \({{\mathcal {X}}}_i\)’s are i.i.d with independent entries. \(\lambda _i\)’s are independent and independent of \({{\mathcal {X}}}_i\)’s. The entries of \({{\mathcal {X}}}_i\)’s satisfy concentration property in the sense that if G is a convex 1-Lipschitz function (with respect to Euclidean norm), \(P(|G({{\mathcal {X}}}_i)-m_G|>t)\le C \exp (-{\mathsf {c}}t^2)\), for any \(t>0\), \(m_G\) being a median of \(G({{\mathcal {X}}}_i)\). We require the same assumption to hold when considering the columns of the \(n\times p\) design matrix \({{\mathcal {X}}}\). \({{\mathcal {X}}}_i\)’s have mean 0 and \(\mathrm {cov}\left( {{\mathcal {X}}}_i\right) =\mathrm {Id}_p\). We also assume that the coordinates of \({{\mathcal {X}}}_i\) have moments of all order. Furthermore, for any given k, the kth moment of the entries of \({{\mathcal {X}}}_i\) is assumed to be bounded independently of n and p.

  • \({\mathbf {E}}_{}\left( \lambda _i^2\right) =1\), \({\mathbf {E}}_{}\left( \lambda _i^4\right) \) is bounded and \(\sup _{1\le i\le n} |\lambda _i|\) grows a most like \(C (\log n)^k\) for some k. \(\lambda _i\)’s may have different distributions, but the number of such possible distributions is finite.

  • \(\epsilon _i\)’s are independent. They may have different distributions, but the number of such possible distributions is finite. Those distributions are assumed to have densities that are differentiable, symmetric and unimodal. Furthermore, we assume that if \(f_i\) is the density of one such distribution, \(\lim _{x\rightarrow \infty } xf_i(x)=0.\) \(\{{{\mathcal {X}}}_i\}_{i=1}^n\), \(\{\lambda _i\}_{i=1}^n\) and \(\{\epsilon _i\}_{i=1}^n\) are independent.

  • \(||\beta _0||_2\) remains bounded. Furthermore, \(||\beta _0||_\infty =\mathrm {O}(n^{-{\mathsf {e}}})\), where \(1/4<{\mathsf {e}}\).

  • The fraction of time each possible combination of functions and distributions for \((\rho _i,{{\mathcal {L}}}(\epsilon _i),{{\mathcal {L}}}(\lambda _i))\) appears in our problem has a limit as \(n\rightarrow \infty \). (\({{\mathcal {L}}}(\epsilon _i)\) and \({{\mathcal {L}}}(\lambda _i)\) are the laws of \(\epsilon _i\) and \(\lambda _i\).)

We now state our most important results (several others are in the “Appendix”, where we give the proof) and our proof strategy; naturally, the two go together to provide a sketch of proof. We postpone our discussion of both the assumptions and our results to Sect. 2.3.

2.2 Results and proof strategy

2.2.1 Characterization of the risk of \({\widehat{\beta }}\)

Consider \({\widehat{\beta }}\) defined in Eq. (3) and assume that \(\tau >0\) is given, i.e. does not change with p and n. Under the technical assumptions detailed in Sect. 2.1, we have:

Theorem 2.1

As p, n tend to infinity while \(p/n\rightarrow \kappa \in (0,\infty )\), \(\mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \rightarrow 0\). Furthermore, \(||{\widehat{\beta }}-\beta _0||\rightarrow r_{\rho }(\kappa )\) in probability, for \(r_{\rho }(\kappa )\) a deterministic scalar. Call \(W_i=\epsilon _i+r_{\rho }(\kappa )\lambda _i Z_i\), where \(Z_i\) is a \({{\mathcal {N}}}(0,1)\) random variable independent of \(\epsilon _i\) and \(\lambda _i\). Then there exists a constant \(c_{\rho }(\kappa )\ge 0\) such that

$$\begin{aligned} \left\{ \begin{array}{ll} \left[ \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^n{\mathbf {E}}_{}\left( [\text {prox}(c_{\rho }(\kappa )\lambda _i^2\rho _i)]'(W_i)\right) \right] &{}=1-\kappa +\tau c_{\rho }(\kappa ),\\ \kappa \left[ \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^n {\mathbf {E}}_{}\left( \frac{(W_i-\text {prox}(c_{\rho }(\kappa )\lambda _i^2 \rho _i)[W_i])^2}{\lambda _i^2}\right) \right] +\tau ^2 ||\beta _0||^2 c^2_{\rho }(\kappa )&{}=\kappa ^2 r^2_{\rho }(\kappa ). \end{array} \right. \nonumber \\ \end{aligned}$$
(4)

We note that

$$\begin{aligned} \frac{(x-\text {prox}(c_{\rho }(\kappa )\lambda _i^2 \rho _i)[x])^2}{\lambda _i^2}=c^2_\rho (\kappa )\lambda _i^2 \psi _i^2(\text {prox}(c_{\rho }(\kappa )\lambda _i^2 \rho _i)[x]), \end{aligned}$$

so in case \(\lambda _i\) takes the value 0, we can replace the expression on the left hand side by that on the right hand side, which does not involve dividing by \(\lambda _i^2\). This alternative expression also shows that there is no problem taking expectations in our equations.

The previous system can be reformulated in terms of \(\text {prox}((c_{\rho }(\kappa )\lambda _i^2\rho _i)^*)\), where \(f^*\) represents the Fenchel–Legendre dual of f. Indeed, Moreau’s prox identity [29] gives

$$\begin{aligned} \text {prox}((c\rho )^*)(x)=x-\text {prox}(c\rho )(x). \end{aligned}$$

This is partly why we chose to write the system as we did, since it can be rephrased purely in terms of \(\text {prox}([c_{\rho }(\kappa )\lambda _i^2\rho _i]^*)\), a formulation that has proven useful in previous related problems (see [4]).

We note that \(r_\rho (\kappa )\) and \(c_\rho (\kappa )\) will in general depend on \(\tau \), but we do not index those quantities by \(\tau \) to avoid cumbersome notations.

2.2.2 Organization of the proof and strategy

The proof is quite long so we now explain the main ideas and organization of the argument. Recall that if

$$\begin{aligned} F(\beta )=\frac{1}{n}\sum _{i=1}^n \rho _i \left( Y_i-X_i'\beta \right) +\frac{\tau }{2}||\beta ||^2, \end{aligned}$$

we have

$$\begin{aligned} {\widehat{\beta }}={{\mathrm{argmin}}}_{\beta \in {\mathbb {R}}^p} F(\beta ). \end{aligned}$$

The proof is broadly divided into three steps.

First step. The first idea is to relate \({\widehat{\beta }}\) and \({\widehat{\beta }}_{(i)}\), the solution of our optimization problem when the pair \((X_i,Y_i)\) is excluded from the problem. It is reasonable to expect that adding \((X_i,Y_i)\) will not change too much \(\tilde{r}_{j,(i)}=Y_j-X_j'{\widehat{\beta }}_{(i)}\) when \(j\ne i\), and hence that \(\tilde{r}_{j,(i)}\simeq R_j=Y_j-X_j'{\widehat{\beta }}\) when \(j\ne i\). Armed with this intuition, we can try to use a first-order Taylor expansion of \({\widehat{\beta }}\) around \({\widehat{\beta }}_{(i)}\) in the equation \(\nabla F ({\widehat{\beta }})=0\) to relate the two vectors. This is what the first part of the proof does, by surmising an approximation \(\eta _i\) for \({\widehat{\beta }}-{\widehat{\beta }}_{(i)}\) – following along the intuitive lines above but non-trivial to come up with at the level of precision we need. Much work is devoted to proving that this very informed guess is sufficiently accurate for our purposes. Since “the only thing we know” about \({\widehat{\beta }}\) is that \(\nabla F({\widehat{\beta }})=0\), we work on \(\nabla F({\widehat{\beta }})-\nabla F({\widehat{\beta }}_{(i)}+\eta _i)\) to do so, and show in our preliminaries (see “Appendix 2”) that controlling this latter quantity is enough to control \(||{\widehat{\beta }}-{\widehat{\beta }}_{(i)}-\eta _i||\). Once our bound for \(||{\widehat{\beta }}-{\widehat{\beta }}_{(i)}-\eta _i||\) is established, we use it to bound \({\mathbf {E}}_{}\left( |||{\widehat{\beta }}-\beta _0||^2-||{\widehat{\beta }}_{(i)}-\beta _0||^2|^2\right) \) and use a martingale inequality to deduce a bound on \(\mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \), which we show goes to zero. The corresponding results are presented in Sect. 2.2.3 and the detailed mathematical analysis is in “Appendix 3”.

Second step. The second step of the proof is to relate \({\widehat{\beta }}\) to another quantity \({\widehat{\gamma }}\), which is the solution of our optimization problem when the last column of the matrix X is excluded from the problem—see Sect. 2.2.4 below and “Appendix 4” for detailed mathematical analysis. Call V the corresponding design matrix. In our setting, it is reasonable to expect that \(r_{i,[p]}=Y_i-X_i(p)\beta _0(p)-V_i'{\widehat{\gamma }}\simeq Y_i-X_i'{\widehat{\beta }}\). A first order Taylor expansion of \(\nabla F ({\widehat{\beta }})\) around \(({\widehat{\gamma }}'\; \beta _0(p))'\) and further manipulations yields an informed “guess”, denoted \({\tilde{b}}\) below, for \({\widehat{\beta }}\), and in particular for \({\widehat{\beta }}_p\), the last coordinate of \({\widehat{\beta }}\). A large amount of work is devoted to proving that the quantity we surmised—denoted \({\mathfrak {b}}_p\) below—approximates \({\widehat{\beta }}_p\) sufficiently well for our purposes—once again by doing delicate computations on the corresponding gradients. Since \({\mathfrak {b}}_p\) has a reasonably nice probabilistic representation, it is possible to write \({\mathbf {E}}_{}\left( {\mathfrak {b}}_p^2\right) \) is terms of other quantities appearing in the problem, such as \(\psi _i(r_{i,[p]})\) (where \(\psi _i=\rho _i'\)) and a quantity \({\mathsf {c}}_{\tau ,p}\) that is the trace of the inverse of a certain random matrix. Because \({\mathfrak {b}}_p\) approximates \({\widehat{\beta }}_p\) sufficiently well, our approximation of \({\mathbf {E}}_{}\left( {\mathfrak {b}}_p^2\right) \) can be used to yield a good approximation of \({\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \). However, we want the approximation of \({\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \) to not depend on quantities that depend on p, such as \(r_{i,[p]}\) and \({\mathsf {c}}_{\tau ,p}\). Further work is needed to show that the approximation of \({\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \) can be made in terms of \(\tilde{r}_{i,(i)}\)’s—which we used in the first part of the proof—and a new quantity \(c_\tau \), which is the trace of the inverse of a certain random matrix, as was \({\mathsf {c}}_{\tau ,p}\). The resulting approximation for \({\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \) is essentially the second equation of our system—see Proposition (2.4) for instance.

Third step. The last part of the proof—see Sect. 2.2.5 and “Appendix 5” for detailed mathematical analysis - is devoted to first showing that \(\tilde{r}_{i,(i)}=Y_i-X_i'{\widehat{\beta }}_{(i)}\) behaves asymptotically like \(\epsilon _i+\lambda _i \sqrt{{\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2\right) } Z_i\), where \(Z_i\sim {{\mathcal {N}}}(0,1)\). The work done previously in the proof is extremely useful for that. Finally, we show that \(c_\tau \) is asymptotically deterministic. The characterization of \(c_\tau \) is essentially the first equation of our system—see Theorem 2.6 below. After all this is established, we can state for instance central limit theorems for \({\widehat{\beta }}_p\) and interesting quantities that appear in our proof.

The following few subsubsections make all our intermediate results precise. Armed with the above explanation for our approach, they provide the reader with a clear overview of the arc of our proof. The detailed mathematical analysis is given in the “Appendix”.

2.2.3 Leave-one-observation out approximations

We call the residuals

$$\begin{aligned} R_i=Y_i-X_i'{\widehat{\beta }}=\epsilon _i-X_i'({\widehat{\beta }}-\beta _0). \end{aligned}$$

We consider the situation where we leave the ith observation, \((X_i,Y_i)\), out. We call

$$\begin{aligned} {\widehat{\beta }}_{(i)}={{\mathrm{argmin}}}_{\beta \in {\mathbb {R}}^p} F_i(\beta ), \quad \text {where } F_i(\beta )=\frac{1}{n}\sum _{j\ne i} \rho _j\left( \epsilon _j+X_j'\beta _0-X_j'\beta \right) +\frac{\tau }{2}||\beta ||^2. \end{aligned}$$

We use the notations

$$\begin{aligned} \tilde{r}_{j,(i)}=\epsilon _j-X_j'({\widehat{\beta }}_{(i)}-\beta _0) \text { and } S_i=\frac{1}{n}\sum _{j\ne i}\psi _j'(\tilde{r}_{j,(i)})X_jX_j'. \end{aligned}$$

Note that \(\tilde{r}_{j,(i)}\)’s are simply the leave-one-out residuals (for \(j\ne i\)) and the leave-one-out prediction error (for \(j=i\)).

Let us consider

$$\begin{aligned} {\widetilde{\beta }}_i={\widehat{\beta }}_{(i)}+\frac{1}{n}(S_i+\tau \mathrm {Id})^{-1}X_i \psi _i(\text {prox}(c_i\rho _i)(\tilde{r}_{i,(i)}))\triangleq {\widehat{\beta }}_{(i)}+\eta _i, \end{aligned}$$

where

$$\begin{aligned} c_i=\frac{1}{n}X_i'(S_i+\tau \mathrm {Id})^{-1}X_i, \text { and } \eta _i=\frac{1}{n}(S_i+\tau \mathrm {Id})^{-1}X_i \psi _i(\text {prox}(c_i\rho _i)(\tilde{r}_{i,(i)})). \end{aligned}$$

We have the following theorem.

Theorem 2.2

Under our technical assumptions, we have, for any fixed k, when \(\tau \) is held fixed,

$$\begin{aligned} \sup _{1\le i\le n} ||{\widehat{\beta }}-{\widetilde{\beta }}_i||=\mathrm {O}_{L_k}\left( \frac{\text {polyLog}(n)}{n}\right) . \end{aligned}$$

Also,

$$\begin{aligned} \sup _{1\le i\le n} \sup _{ j\ne i}|\tilde{r}_{j,(i)}-R_j|&=\mathrm {O}_{L_k}\left( \frac{\text {polyLog}(n)}{n^{1/2}}\right) ,\\ \sup _{1\le i\le n} |R_i-\text {prox}(c_i\rho _i)(\tilde{r}_{i,(i)})|&=\mathrm {O}_{L_k}\left( \frac{\text {polyLog}(n)}{n^{1/2}}\right) . \end{aligned}$$

Finally,

$$\begin{aligned} \mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||_2^2\right) =\mathrm {O}\left( \frac{\text {polyLog}(n)}{n}\right) . \end{aligned}$$

A stronger version of this theorem is available in the “Appendix”. (We say that a sequence of random variables \(W_n=\mathrm {O}_{L_k}(1)\) if \(({\mathbf {E}}_{}\left( |W_n|^k\right) )^{1/k}=\mathrm {O}(1)\).)

There are two main reasons this theorem is interesting: it provides online-update formulas for \({\widehat{\beta }}\) through \({\widetilde{\beta }}_i\), with guaranteed approximation errors. Second, it relates the full residuals, whose statistical and probabilistic properties are quite complicated to the much-simpler-to-understand “leave-one-out” prediction error, \(\tilde{r}_{i,(i)}\). Indeed, because \(X_i\) is independent of \({\widehat{\beta }}_{(i)}\) under our assumptions, the statistical properties of \({\widehat{\beta }}_{(i)}'X_i\) are much simpler to understand than those of \({\widehat{\beta }}'X_i\).

2.2.4 Leave-one-predictor out approximations

Let V be the \(n\times (p-1)\) matrix corresponding to the first \((p-1)\) columns of the design matrix X. We call \(V_i\) in \({\mathbb {R}}^{p-1}\) the vector corresponding to the first \(p-1\) entries of \(X_i\), i.e. \(V_i'=(X_i(1),\ldots ,X_i(p-1))\). We call X(p) the vector in \({\mathbb {R}}^n\) with jth entry \(X_j(p)\), i.e. the \(p-\)th entry of the vector \(X_j\). When this does not create problems, we also use the standard notation \(X_{j,p}\) for \(X_j(p)\).

We use the notation \(\beta _0 = (\gamma _0'\; \beta _0(p))'\), i.e. \(\gamma _0\) is the vector corresponding to the first \(p-1\) coordinates of \(\beta _0\).

Let us call \({\widehat{\gamma }}\) the solution of our optimization problem when we use the design matrix V instead of X. In other words,

$$\begin{aligned} {\widehat{\gamma }}={{\mathrm{argmin}}}_{\gamma \in {\mathbb {R}}^{p-1}}\frac{1}{n}\sum _{i=1}^n \rho _i(\epsilon _i-V_i'(\gamma -\gamma _0))+\frac{\tau }{2}||\gamma ||^2. \end{aligned}$$
(5)

For stating the following results, we will rely heavily on the following definitions:

Definition

We call the corresponding residuals \(\{r_{i,[p]}\}_{i=1}^n\), i.e. \(r_{i,[p]}=\epsilon _i+V_i'\gamma _0-V_i'{\widehat{\gamma }}\). Let

$$\begin{aligned} u_p=\frac{1}{n}\sum _{i=1}^n \psi _i'(r_{i,[p]}) V_i X_i(p), \quad {\mathfrak {S}}_p=\frac{1}{n}\sum _{i=1}^n \psi _i'(r_{i,[p]}) V_i V_i'. \end{aligned}$$

We have \(u_p \in {\mathbb {R}}^{p-1}\) and \({\mathfrak {S}}_p\) is \((p-1)\times (p-1)\). We call

$$\begin{aligned} \xi _n&\triangleq \frac{1}{n}\sum _{i=1}^n X_i^2(p)\psi _i'(r_{i,[p]})-u_p'({\mathfrak {S}}_p+\tau \mathrm {Id})^{-1}u_p,\\ N_p&\triangleq \frac{1}{\sqrt{n}}\sum _{i=1}^n X_i(p)\psi _i(r_{i,[p]}). \end{aligned}$$

We call

$$\begin{aligned} {\mathfrak {b}}_p\triangleq \beta _0(p) \frac{\xi _n}{\tau +\xi _n}+\frac{1}{\sqrt{n}} \frac{N_p}{\tau +\xi _n}, \end{aligned}$$
(6)

and

$$\begin{aligned} {\widetilde{b}}= \begin{bmatrix} {\widehat{\gamma }}\\ \beta _0(p) \end{bmatrix} +[{\mathfrak {b}}_p-\beta _0(p)] \begin{bmatrix} -({\mathfrak {S}}_p+\tau \mathrm {Id})^{-1} u_p\\ 1 \end{bmatrix}. \end{aligned}$$
(7)

Theorem 2.3

Under our Assumptions, we have, for any fixed \(\tau >0\),

$$\begin{aligned} ||{\widehat{\beta }}-{\widetilde{b}}||\le \mathrm {O}_{L_k}\left( \frac{\text {polyLog}(n)}{[n^{1/2}\wedge n^{{\mathsf {e}}}]^2}\right) . \end{aligned}$$

In particular,

$$\begin{aligned} \sqrt{n}({\widehat{\beta }}_p-{\mathfrak {b}}_p)&=\mathrm {O}_{L_k}\left( \frac{\text {polyLog}(n)n^{1/2}}{[n^{1/2}\wedge n^{{\mathsf {e}}}]^2}\right) ,\\ \sup _i |X_i'({\widehat{\beta }}-{\widetilde{b}})|&=\mathrm {O}_{L_k}\left( \frac{\text {polyLog}(n)n^{1/2}}{[n^{1/2}\wedge n^{{\mathsf {e}}}]^2}\right) ,\\ \sup _{i}|R_i-r_{i,[p]}|&=\mathrm {O}_{L_k}\left( \left[ \frac{\text {polyLog}(n)}{\sqrt{n}\wedge n^{{\mathsf {e}}}}\right] \vee \left[ \frac{\text {polyLog}(n)n^{1/2}}{[n^{1/2}\wedge n^{{\mathsf {e}}}]^2}\right] \right) . \end{aligned}$$

Let us call

$$\begin{aligned} c_\tau =\frac{1}{n}\text {trace}\left( (S+\tau \mathrm {Id})^{-1}\right) , \quad \text {where } S=\frac{1}{n}\sum _{i=1}^n \psi _i'(R_i) X_i X_i'. \end{aligned}$$

We also have:

Proposition 2.4

Under our assumptions,

$$\begin{aligned} \left( \frac{p}{n}\right) ^2 {\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||_2^2\right)= & {} \frac{p}{n} \frac{1}{n}\sum _{i=1}^n {\mathbf {E}}_{}\left( [c_\tau \lambda _i \psi _i(\text {prox}(c_\tau \lambda _i^2 \rho _i)(\tilde{r}_{i,(i)}))]^2\right) \\&+\;\tau ^2 ||\beta _0||^2{\mathbf {E}}_{}\left( c_\tau ^2\right) +\mathrm {o}(1). \end{aligned}$$

Furthermore,

$$\begin{aligned} \sup _i |c_i-\lambda _i^2 c_\tau |=\mathrm {O}_{L_k}(n^{-1/2}\text {polyLog}(n)). \end{aligned}$$

2.2.5 Final steps and related results

Lemma 2.5

Under our assumptions , as n and p tend to infinity, \(\tilde{r}_{i,(i)}\) behaves like \(\epsilon _i+\lambda _i\sqrt{{\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2\right) }Z_i\), where \(Z_i\sim {{\mathcal {N}}}(0,1)\) is independent of \(\epsilon _i\) and \(\lambda _i\), in the sense of weak convergence.

Furthermore, if \(i\ne j\), \(\tilde{r}_{i,(i)}\) and \(\tilde{r}_{j,(j)}\) are asymptotically (pairwise) independent. The same is true for the pairs \((\tilde{r}_{i,(i)},\lambda _i)\) and \((\tilde{r}_{j,(j)},\lambda _j)\).

Theorem 2.6

Under our assumptions, when \(p/n\rightarrow \kappa \in (0,\infty )\), \(||{\widehat{\beta }}-\beta _0||\rightarrow r_\rho (\kappa )\), where \(r_\rho (\kappa )\) is deterministic. Call \(W_i=\epsilon _i+\lambda _ir_\rho (\kappa )Z_i\), where \(Z_i\sim {{\mathcal {N}}}(0,1)\) independent of \(\epsilon _i\) and \(\lambda _i\). Call

$$\begin{aligned} {\mathsf {G}}_n(x)&=\frac{1}{n}\sum _{i=1}^n {\mathbf {E}}_{}\left( \frac{1}{1+x\lambda _i^2\psi _i'(\text {prox}(x\lambda _i^2\rho _i)(W_i))}\right) \quad \hbox {and } {\mathsf {G}}(x)=\lim _{n\rightarrow \infty }{\mathsf {G}}_n(x)\;,\\ {\mathsf {H}}_n(x)&=\frac{1}{n}\sum _{i=1}^n {\mathbf {E}}_{}\left( [x \lambda _i \psi _i(\text {prox}(x \lambda _i^2 \rho _i)(W_i))]^2\right) \quad \hbox {and } {\mathsf {H}}(x)=\lim _{n\rightarrow \infty }{\mathsf {H}}_n(x). \end{aligned}$$

Under our assumptions, \(c_\tau \rightarrow c_\rho (\kappa )\) in probability, where \(c_\rho (\kappa )\) is the unique solution of the equation \({\mathsf {G}}(x)=1-\kappa +\tau x\). Furthermore, \(r_\rho (\kappa )\) solves

$$\begin{aligned} \kappa ^2 r^2_\rho (\kappa )=\kappa {\mathsf {H}}(c_\rho (\kappa )) +\tau ^2 ||\beta _0||^2 c^2_\rho (\kappa ). \end{aligned}$$

We note that the equation \({\mathsf {G}}(x)=1-\kappa +\tau x\) translates into the first equation of our system (4). This is a simple consequence of the properties of the derivative of Moreau’s proximal mapping—see Lemma 3.33.

The last equation of Theorem 2.6 is the second equation of our system (4). (The fact that the limits of \({\mathsf {G}}_n\) and \({\mathsf {H}}_n\) exist simply come from our assumptions that the proportion of times each possible triplet \((\rho _i,{{\mathcal {L}}}(\epsilon _i),{{\mathcal {L}}}(\lambda _i))\) appears has a limit as \(n\rightarrow \infty \).)

From this main theorem follows the following propositions.

Proposition 2.7

\(\xi _n\rightarrow \xi \) in probability, where \(\xi =\kappa /c_\rho (\kappa )-\tau >0\).

\(N_p\Longrightarrow {{\mathcal {N}}}(0,v^2)\) where

$$\begin{aligned} v^2=\lim _{n\rightarrow \infty } \frac{1}{n}\sum _{i=1}^n {\mathbf {E}}_{}\left( \lambda _i^2 \psi _i^2[\text {prox}(c_\rho (\kappa )\lambda _i^2\rho _i)(W_i)]\right) . \end{aligned}$$

Finally, when \(\beta _0(k)=\mathrm {O}(n^{-1/2})\),

$$\begin{aligned} \sqrt{n}[(\tau +\xi ){\widehat{\beta }}_k-\beta _0(k)\xi ]\Longrightarrow {{\mathcal {N}}}(0,v^2). \end{aligned}$$

The previous result can be used with \(v^2\) replaced by \({\hat{v}}_n^2=\frac{1}{n}\sum _{i=1}^n \lambda _i^2 \psi _i^2[\text {prox}(c_\tau \lambda _i^2\rho _i)(\tilde{r}_{i,(i)})]\) and \(\xi \) replaced by \(\omega _n=p/(nc_\tau )-\tau \) in testing applications—see the discussion after Proposition 3.30 for justifications. Naturally, since for all x, \(\lambda _i^2 \psi _i^2[\text {prox}(c_\tau \lambda _i^2\rho _i)(x)]= [x-\text {prox}(c_\tau \lambda _i^2\rho _i)(x)]/c_\tau \) when \(c_\tau >0\), \({\hat{v}}_n^2\) could also be written and computed using this alternative formulation.

We note that \(\omega _n\) is computable from the data. In our setup, \(\lambda _i\)’s are estimable using the scheme proposed in [14] and \({\hat{v}}_n^2\) can therefore also be estimated from the data. Hence, the previous proposition allows for testing the null hypothesis that \(\beta _0(k)=0\), for any \(1\le k \le p\).

We are also now in position to explain the behavior of the residuals.

Proposition 2.8

When our assumptions are satisfied and we further assume that \(\lambda _i\)’s are uniformly bounded, we have

$$\begin{aligned} \sup _{1\le i \le n}|R_i-\text {prox}(\lambda _i^2 c_\rho (\kappa ) \rho _i)(\tilde{r}_{i,(i)})|=\mathrm {o}_{L_k}(1). \end{aligned}$$

The behavior of the residuals is therefore qualitatively very different in this high-dimensional setting than its counterpart in the low-dimensional setting.

2.3 Discussion of assumptions and results

2.3.1 Why consider elliptical-like predictors?

The study of elliptical distributions is quite classical in multivariate statistics (see [1]). As pointed out by various authors (see, in the context of statistics and random matrix theory [8, 13, 20]), the Gaussian distribution has a very peculiar geometry in high-dimension. It is therefore important to be able to study models that break away from these geometric restrictions, which are not particularly natural from the point of view of data analysts.

Under our assumptions, in light of Lemma 3.37, it is clear that

$$\begin{aligned} \sup _{1\le i \le n}\left| \frac{||X_i||^2}{p}-\lambda _i^2\right| =\mathrm {o}_P(1), \quad \text {and } \sup _{i\ne j}\left| \frac{X_i'X_j}{p}\right| =\mathrm {o}_P(1). \end{aligned}$$

In the Gaussian (or Gaussian-like case of i.i.d entries for \(X_i\), with e.g. bounded entries which satisfy the assumptions we stated above), \(\lambda _i=1\). Hence, Gaussian or Gaussian-like assumptions imply that predictor vectors are situated near a sphere and are nearly orthogonal. (This simple geometry is of course closely tied to—or a manifestation of the—concentration of measure for convex 1-Lipschitz functions of those random variables.)

This is clearly not the case for elliptical predictors, though under our assumptions, \(\mathrm {cov}\left( X_i\right) =\mathrm {Id}_p\), even in the “elliptical” case we consider in the paper. So all the models we consider have the same covariance but the corresponding datasets may have different geometric properties.

We show in the paper that the role of the distribution of \(\lambda _i\)’s in the performance of the estimator depends on much more than its second moment, as Theorem 2.1 makes very clear. This is a situation that is similar to corresponding results in random matrix theory—see e.g. [13, 18]. It is therefore clear here again that predictor geometry (as measured by \(\lambda _i\)) plays a key role in the performance of our estimators in high-dimension. This is in sharp contrast with the low-dimensional setting—see [24]—which shows that in low-dimensional robust regression, what matters is only \(\mathrm {cov}\left( X_i\right) \).

These types of studies are also interesting and we think important as they clearly show that there is little hope of statistically meaningful “universality” results derived from Gaussian design results : moving from independent Gaussian assumptions for the entries of \(X_i\) to i.i.d assumptions does not change the geometry of the predictors, which appears to be key here as our proof’s reliance on concentration of quadratic forms in \({{\mathcal {X}}}_i\) makes clear. As such, while interesting on many counts, for instance to allow discrete predictors, moving from Gaussian to i.i.d assumptions is not a very significant perturbation of the model for statistical purposes. This is why we chose to work under elliptical assumptions. See also [8] for similar observations in a different statistical context.

In conclusion, the generalized elliptical models we study in this paper prove also that many models may be such that the predictors have the same covariance \(\mathrm {cov}\left( X_i\right) \) but yield very different performance when it comes to \(\lim ||{\widehat{\beta }}-\beta _0||\). They therefore provide a meaningful perturbation of the Gaussian assumption, give us insights into the impact of predictor geometry on the behavior of our estimators, and give us a rough idea of the subclass of models for which we can expect similar (or “universal”) performance for \({\widehat{\beta }}\).

Examples of distribution for \({{\mathcal {X}}}_i\) satisfying our concentration assumptions Corollary 4.10 in [27] shows that our assumptions are satisfied if \({{\mathcal {X}}}_i\) has independent entries bounded by \(1/(2\sqrt{{\mathsf {c}}})\). Theorem 2.7 in [27] shows that our assumptions are satisfied if \({{\mathcal {X}}}_i\) has independent entries with density \(f_k\), \(1\le k\le p\) such that \(f_k(x)=\exp (-u_k(x))\) and \(u_k''(x)\ge \sqrt{c}\) for some \(c>0\). Then \({\mathsf {c}}=c/2\). This is in particular the case for the case where \({{\mathcal {X}}}_i\) has i.i.d \({{\mathcal {N}}}(0,1)\) entries: then \(c=1\) and \({\mathsf {c}}=1/2\). We discuss briefly after Lemma 3.35 in the “Appendix” the impact of choosing other types of concentration assumptions.

2.3.2 Non-sparse \(\beta _0\): why consider \(\ell _2\)/ridge-regularization?

In this paper, we consider the case where \(\beta _0\) cannot—in general—be approximated in \(\ell _2\)-norm by a sparse vector. This is a situation that is thought to not be uncommon in biology (see, in a slightly different context [12], and many similar references), where sparsity assumptions are often/sometimes in doubt.

In other words, if \({\mathsf {s}}\) is a sparse vector (e.g. with support of size \(\mathrm {o}(p)\)), we necessarily have when \(\beta _0\) is diffuse (i.e. all of its entries are roughly of size \(p^{-1/2}\)) . In the situation we consider, it is in fact unclear whether any estimator can be consistent in \(\ell _2\) for \(\beta _0\). One interesting aspect of our study is that the System (4) might allow us to optimize (at least in certain circumstances) over the functions \(\rho _i\)’s we consider to get the best performing estimator in the class of ridge-regularized robust regression estimators for \(\beta _0\) and hence potentially beat sparse estimators (in the same line of thought, there are of course numerous applied examples where ridge regression outperforms Lasso in terms of prediction error).

Finally, one benefit of our analysis is that we have a central limit theorem for the coordinates of \({\widehat{\beta }}\) (see Proposition 2.7), which makes testing possible. In the situation where \(\beta _0\) has some large entries (of size up to \(n^{-1/4-\eta }\), \(\eta >0\)) and many small ones [of size \(\mathrm {o}(n^{-1/2})\)], this central limit theorem and its more refined version in Proposition 3.30 could help in designing better performing estimators by using scaled versions of \(\{{\widehat{\beta }}_k\}_{k=1}^p\), which we would threshold according to the result of our test. In other words, these central limit theorems for the coordinates of \({\widehat{\beta }}\) are the gateway to the construction of Hodges-type estimators in the setup we consider.

2.3.3 A remark on the fixed design case

We have worked in this paper with a certain class of random designs. It is not unusual to do so in robust regression studies—see the classic papers by Portnoy [31, 32, 34]. In many areas of applications, it is also unclear why statisticians should limit themselves to the study of fixed designs, in particular when they do not have control over the choice of the values of the predictors, i.e. they cannot design their experiments.

However, it is also interesting to understand what remains valid of our analysis in the case of fixed design. We note that our analysis gives already a few results in this direction.

In fact, since we have shown that \(\mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||^2\right) \rightarrow 0\), we have shown that

$$\begin{aligned} {\mathbf {E}}_{}\left( \mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||^2|X\right) \right) \rightarrow 0 \text { and } \mathrm {var}\left( {\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2|X\right) \right) \rightarrow 0, \end{aligned}$$

because

$$\begin{aligned} \mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||^2\right) ={\mathbf {E}}_{}\left( \mathrm {var}\left( ||{\widehat{\beta }}-\beta _0||^2|X\right) \right) + \mathrm {var}\left( {\mathbf {E}}_{}\left( ||{\widehat{\beta }}-\beta _0||^2|X\right) \right) . \end{aligned}$$

Therefore, with probability (over the design X) going to 1,

$$\begin{aligned} ||{\widehat{\beta }}-\beta _0||^2-r_\rho (\kappa )\rightarrow 0 \text { in } P_{\{\epsilon _i\}_{i=1}^n}\text {-probability}. \end{aligned}$$

(\(P_{\{\epsilon _i\}_{i=1}^n}\)-probability simply refers to probability statements with respect to the random \(\epsilon _i\)’s, the only source of randomness if the design matrix X is assumed to be fixed.) In other words, if the design is fixed, but results from one random draw of a \(n\times p\) matrix satisfying our distributional assumptions, Theorem 2.1 applies with probability (over the choice of design matrix) going to 1.

We note that \(||{\widehat{\beta }}-\beta _0||\) is an especially important quantity in terms of prediction error in our context, which is why our short discussion above focused on this quantity: if we are given a new predictor vector \(X_{new}\), we would naturally predict an unobserved response \(Y_{new}\) by \(X_{new}'{\widehat{\beta }}\) and hence, if \(Y_{new}=\epsilon _{new}+X_{new}'\beta _0\), our prediction error will be \(PE_{new}=\epsilon _{new}+X_{new}'(\beta _0-{\widehat{\beta }})\). Of course, if \(X_{new}\) has mean 0 and satisfies \(\mathrm {cov}\left( X_{new}\right) =\mathrm {Id}_p\), \({\mathbf {E}}_{X_{new}}\left[ (X_{new}'(\beta _0-{\widehat{\beta }}))^2\right] =||\beta _0-{\widehat{\beta }}||_2^2\). Hence the expected squared prediction error will be \(\mathrm {var}\left( \epsilon _{new}\right) +||\beta _0-{\widehat{\beta }}||_2^2\), provided \(\epsilon _{new}\) is independent of \(X_{new}\).

2.3.4 Optimization with respect to \(\tau \) and \(\rho \)

Just as the classic work of Huber on robust regression started by establishing central limit theorems for the estimator of interest (as a function of \(\rho \)) and proceeded to find optimal methods in various contexts (see [24]), one objective of our work is to pave the way for answering optimality questions in the setting we consider. An important first step to do so is therefore to obtain results such as Theorem 2.1.

A natural question is therefore to ask what are the optimal \(\rho _i\)’s in the context we consider, where optimality might be defined in terms of minimizing \(r_\rho (\kappa )\) in Theorem 2.1 or \(v^2\) in Proposition 2.7. For an example of such a study for \(r_\rho (\kappa )\) in a slightly different context, see [4]. Similarly, optimization over \(\tau \) should be possible. We leave however these questions for future work, since they are of a more analytic nature. (We have had success in [4] in the situation where \(\lambda _i=1\) and the errors are log-concave and hence not heavy-tailed, but the technique we employed in that paper does not apply readily here.)

We also note that in our context the optimal \(\tau \) – say for prediction – is in general not going to be close to 0, so the fact that our current study requires \(\tau >0\) is not a problem (see also [3]).

As can intuitively be seen from Proposition 2.7, the fact that \(||{\widehat{\beta }}-\beta _0||\) goes to a non-zero constant has two sources: bias induced by the ridge-regularization and the fact that each of the p coordinates has fluctuations of size \(n^{-1/2}\). Its asymptotically deterministic character comes on the other hand from our analysis in “Appendix 3”. This is in contrast with the low-dimensional case where p is fixed, where \(||{\widehat{\beta }}-\beta _0||\) goes to a non-zero constant simply because of bias issues and the law of large numbers.

In light of Proposition 2.7, it is clear that \(\tau \) plays in this problem a fairly similar role to the one it plays in low-dimension: it trades bias for variability in each individual coordinate. By contrast with the low-dimensional case however, even if we consider the case \(p<n\), the fact that the number of coordinates is of the same order of magnitude as n means that even for small \(\tau \) (and hence low-bias), \(||{\widehat{\beta }}-\beta _0||\) has a non-zero limit. The fact that this limit can be somewhat large is what suggests using values of \(\tau \) that are not close to 0. Interestingly, this is of course the situation encountered in practice in many real-world problems where p and n are large. A case in point is the situation where \(\beta _0=0\), in which case the optimal value of \(\tau \) is clearly \(\infty \); using \(\tau =0\) would put us back in the situation of [17], which would result in much worse performance for \(||{\widehat{\beta }}-\beta _0||\).

2.3.5 Possible extensions

Less smooth \(\rho \) ’s and \(\psi \) ’s While our approach is quite general and allows us to handle designs that are far from being Gaussian, the proof presented in this paper still requires some smoothness concerning \(\rho _i\)’s and \(\psi _i\)’s. On the one hand, results such as the ones obtained in [4] suggest that it is often the case that optimal loss functions in high-dimension are smoother than in low dimension. So the fact that we require \(\psi _i\)’s to be smooth is a source of less concern that it would be in low dimension. (Note also that the classic papers [28, 32] also require smoothness properties on \(\psi \).)

Though it is unclear whether the Huber function is optimal in any sense for the problems we are looking at, and hence whether it warrants a special focus, let us discuss this function in some detail. For the sake of simplicity let us focus on the situation where the transition from quadratic to linear happens at \(x=\pm 1\). Then

$$\begin{aligned} \psi (x)= {\left\{ \begin{array}{ll} x &{}\quad \text {if } |x|\le 1 \\ \text {sign}(x) &{} \quad \text {if } |x|\ge 1 \end{array}\right. }. \end{aligned}$$

So \(\psi \) is not differentiable at 1. However, it is easy to approximate this function by a function whose derivative is Lipschitz. As a matter of fact, if \(0< \eta <1\), \(\psi '_\eta \) such that

$$\begin{aligned} \psi _\eta '(x)= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if } |x|\le 1-\eta \\ \frac{1-|x|}{\eta } &{} \quad \text {if } |x| \in (1-\eta ,1) \\ 0 &{} \quad \text {if } |x|\ge 1 \end{array}\right. }, \end{aligned}$$

is \(1/\eta \)-Lipschitz. Furthermore, \(\psi _\eta \), the corresponding antisymmetric function with, when \(x\ge 0\),

$$\begin{aligned} \psi _\eta (x)= {\left\{ \begin{array}{ll} x &{}\quad \text {if } 0\le x \le 1-\eta , \\ 1-\frac{\eta }{2}-\frac{(1-x)^2}{2\eta }&{}\quad \text {if } x \in (1-\eta ,1),\\ 1-\frac{\eta }{2}&{} \quad \text {if } x\ge 1, \end{array}\right. } \end{aligned}$$

can be made to be arbitrarily close to \(\psi \) and similarly for the corresponding \(\rho _\eta \), picked such that \(\rho _\eta (0)=0\).

Our results apply to \(\rho _\eta \), for any \(\eta >0\). It seems quite likely that with a bit (and possibly quite a bit) of further approximation theoretic work, it should be possible to establish results similar to Theorem 2.1 for the Huber function by taking the limit of corresponding results for \(\rho _\eta \) with \(\eta \) arbitrarily small.

We note that most of our proof (in particular “Appendices 3 and 4”) is actually valid with functions \(\rho _i\)’s that can change with n. In particular, many results hold when \(\psi _i'\) are \(L_i(n)\)-Lipschitz with \(L_i(n)\le Cn^{\alpha }\). So one strategy to handle the case of the Huber function could be to use \(\psi _{\eta _n}\) with \(\eta _n=1/\log (n)\) for instance and strengthen the arguments of “Appendix 5” in the Appendix—in this very specific case where \(\psi _{\eta _n}\) has a limit—to get the Huber case as a limiting result. Because our proof is already long, we leave the details to the interested reader and might consider this problem in detail in future work.

Weighted robust regression One motivation for working on the problem at the level of generality we dealt with is that our results should allow us to tackle among other things weighted robust regression. For instance if \(\epsilon _i\)’s or \(\lambda _i\)’s in our model had different distributions, it would be natural to pick the corresponding \(\rho _i\)’s either as completely different functions, or maybe as \(\rho _i=w_i \rho \), with \(w_i\) deterministic but possibly depending on the distribution of \(\epsilon _i\)’s or \(\lambda _i\)’s. In the case where \(\epsilon _i\)’s and \(\lambda _i\)’s come from finitely many possible distributions, our results handle this situation.

Most of our results—i.e. those of “Appendices 3 and 4”—are true even when \(w_i\)’s are allowed to take a possibly infinite set of different values. If \(\epsilon _i\)’s are i.i.d, \(\lambda _i\)’s are i.i.d and \(w_i\)’s are i.i.d and these three groups of random variables are independent of each other, our arguments can be made to go through without much extra difficulties. The main potential problem is in “Appendix 5”, but then distributional symmetry between the \(R_i\)’s on one hand and the \(\tilde{r}_{i,(i)}\) on the other hand becomes helpful, as it had in [15]. So it is very likely that our results could be extended to cover this case at relatively little technical cost.

3 Conclusion

We have studied ridge-regularized robust regression estimators in the high-dimensional context where p / n has a finite non-zero limit. Our study has highlighted the importance of the geometry of the predictors in this problem: two models with similar covariance but different predictor geometry will in general yield estimators with very different performance. We have shown this result by studying the random design case in the context of elliptical predictors and looking at the influence of the “ellipticity parameter” \(\lambda _i\) on our results. Importantly, this shows that no statistically meaningful “universality” results can be derived from the study of Gaussian or i.i.d-designs, since their geometry is so peculiar (i.e. they are limited to the case \(\lambda _i=1\) for all i’s). The technique used in the paper seems versatile enough to be useful for several other high-dimensional M-estimation problems.

We have also obtained central limit theorems for the coordinates of \({\widehat{\beta }}\) that can be used for testing whether \(\beta _0(k)=0\) for any \(1\le k \le p\). However, our focus was mostly on the case where \(\beta _0\) is diffuse, with all coordinates small but contributing to \(Y_i=\epsilon _i+X_i'\beta _0\). Our results also provide a very detailed understanding of the properties of the residuals \(R_i\).

All these results were obtained without moment requirements on the errors \(\epsilon _i\)’s.

Finally, our characterization of the risk of these estimators raises interesting analytic questions related to finding optimal loss functions \(\rho _i\)’s in the context we consider. We plan to study these questions in the future.