On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators

We study ridge-regularized generalized robust regression estimators, i.e. β^=argminβ∈Rp1n∑i=1nρi(Yi-Xi′β)+τ2||β||2,whereYi=ϵi+Xi′β0,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\widehat{\beta }}={{\mathrm{argmin}}}_{\beta \in {\mathbb {R}}^p} \frac{1}{n}\sum _{i=1}^n \rho _i(Y_i-X_i'\beta )+\frac{\tau }{2}||\beta ||^2, \quad \text {where } \quad Y_i=\epsilon _i+X_i'\beta _0, \end{aligned}$$\end{document}in the situation where p/n tends to a finite non-zero limit. Our study here focuses on the situation where the errors ϵi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon _i$$\end{document}’s are heavy-tailed and Xi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i$$\end{document}’s have an “elliptical-like” distribution. Our assumptions are quite general and we do not require homoskedasticity of ϵi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon _i$$\end{document}’s for instance. We obtain a characterization of the limit of ||β^-β0||\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||{\widehat{\beta }}-\beta _0||$$\end{document}, as well as several other results, including central limit theorems for the entries of β^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{\beta }}$$\end{document}.


Introduction
Robust regression estimators are a standard and important tool in the toolbox of modern statisticians. They were introduced in the late sixties [36] and important early results appeared shortly thereafter [22,23]. We recall that these estimators are defined as for ρ a function chosen by the user. Here Y i is a scalar response and X i is a vector of predictors in R p . In the context we consider here, ρ will be a convex function. Naturally, one of the main reasons to use these estimators instead of the standard least-squares estimator is to increase the robustness of β ρ to outliers in e.g. Y i 's. Formally, this robustness result can be seen through results of Huber (see [24]), in the low-dimensional case where p is fixed. Huber showed that when Y i = X i β 0 + i , and when i 's are i.i.d, under some mild regularity conditions, β ρ is asymptotically normal with mean β 0 and (asymptotic) covariance The question of understanding the behavior of these estimators in the highdimensional setting where p is allowed to grow with n was raised very early on in [23, p. 802, questions b-f]. These questions started being answered in the mid to late eighties in work of Portnoy and Mammen (e.g. [28,[31][32][33][34]). However, these papers covered the case where p/n → 0 while p → ∞.
In the papers [16,17], we explained (mixing, as in [23], rigorous arguments, simulations and heuristic arguments) that the case p/n → κ ∈ (0, 1) yielded a qualitatively completely different picture for this class of problems. For instance, under various technical assumptions, we explained that the risk β ρ − β 0 2 could be characterized through a system of two non-linear equations (sharing some characteristics with the one below), the distribution of the residuals could be found and was completely different of that of the i 's, by contrast with the low dimensional case. Furthermore, we showed in [4] that maximum likelihood estimators were in general inefficient in high-dimension and found dimension-adaptive loss functions ρ that yielded better estimators than the ones we would have gotten by using the standard maximum likelihood estimator, i.e. using ρ = − log f , where f is the density of the i.i.d errors i 's. (We subsequently showed in [15]-which is an initial version of the current paperthat the techniques we had proposed in [16] could be made mathematically rigorous under various assumptions. See also the paper [9] that handles only the case of i.i.d Gaussian predictors, whereas El Karoui [15] can deal with more general assumptions on the predictors. Donoho and Montanari [9] also make interesting connection with the Scherbina-Tirrozi model in statistical physics-see [38,40]. For other interesting results using rigorous approximate message passing techniques, see also e.g. [2].) In the current paper, we study a generic extension of the robust regression problem involving ridge regularization. In other words, we study the statistical properties of We will focus in particular on the case where there is no moment restriction on i 's. Furthermore, a key element of the study will be to show that the performance of β is driven by the Euclidean geometry of the set of predictors {X i } n i=1 . To do so, we will study "elliptical" models for X i 's, i.e. X i = λ i X i , where X i has for instance independent entries. We note that when λ i is independent of X i and E λ 2 i = 1, cov (X i ) = cov (X i ). Hence these families of distributions for X i 's have the same covariance, but as we will see they yield estimators whose performance vary quite substantially with the distribution of λ i 's. As we explain below, the role of λ i 's is to induce a "non-spherical geometry" on the predictors; understanding the impact of λ i 's on the performance of β is hence a way to understand how the geometry of the predictors affects the performance of the estimator. We note that in the low-dimensional case, when X i 's are i.i.d, X X/n → cov (X 1 ) in probability under mild assumptions, and hence the result of Huber mentioned in Eq. (2) shows that the limiting behavior of β ρ defined in Eq. (1) is the same under "elliptical" and non-elliptical models.
Our interest in elliptical distributions stems from the fact that, as we intuited for a related problem in [16], the behavior of quantities of the type X i Q X i for Q deterministic is at the heart of the performance of β. Hence, studying elliptical distribution settings both shed light on the impact of the geometry of predictors on the performance of the estimator and allow us to put to rest potential claims of "universality" of results obtained in the Gaussian (or geometrically similar) case. We note that in statistics there is a growing body of work showing the importance of predictor geometry on various high-dimensional problems (see e.g. [8,13,14,18,20]).
One main motivation for allowing ρ i to change with i is that it might be natural to use different loss functions for different observations if we happen to have information about distributional inhomogeneities in {X i , Y i } n i=1 . For instance, one group of observations could have errors coming from one distribution and a second group might have errors with a different distribution. Another reason is to gain information on the case of weighted regression, in which case ρ i = w i ρ. Also, this analysis can be used to justify rigorously some of the claims made in [16]. Finally, it may prove useful in some bootstrap studies (see e.g. [19] for example).
In the current paper, we consider the situation where β 0 is "diffuse", i.e. all of its coordinates are small and it cannot be well approximated by a sparse vector. In this situation, use of ridge/ 2 penalization is natural. The paper also answers the question, raised by other researchers in statistics, of knowing whether the techniques of the initial version [15] could be used in the situation we are considering here. Finally, the paper shows that some of the heuristics of [3] can be rigorously justified.
When ρ i = ρ for all i, a natural question is to know whether we can find an optimal ρ, in terms of prediction error for instance, as a function of the law of i 's-in effect asking similar questions to the ones answered by Huber [24] in low-dimension and in [4] in high-dimension. However, the constraints we impose in the current paper on both the errors (i.e. we do not want them to have moments) and the functions ρ i 's make part of the argument in [4] not usable and might require new ideas. So we will consider this "optimization over ρ i 's and τ " in future work, given that the current proof is already long.
The problem and setup considered in this paper are more natural in the context of robust regression than the ones studied in the initial version [15], where the chosen setup was targeted towards problems related to suboptimality of maximum likelihood methods. However, the strategy for the proof of the results here is similar to the strategy we devised in the initial [15]. There are three main conceptual novelties, that create important new problems: handling ellipticity and the fact that β 0 = 0 requires new ideas in the second part of the proof (i.e. "Appendix 4"). Dealing with heavy tails and appropriate loss functions impacts the whole proof and requires many changes compared to the proof of [15]. Conceptually, this latter part is also the most important, as it shows that all the approximations made in earlier heuristic papers are valid, even in the presence of heavy-tailed errors. This situation is of course the one where these approximations, while having clearly shown their usefulness in giving conceptual and heuristic understanding of the statistical problem, were the most mathematically "suspicious". So it is interesting to see that they can be made to work rigorously, especially since the probabilistic heuristics developed in these earlier papers allow researchers to shed light quickly on non-trivial statistical problems.
We now state our results. We believe our notations are standard but refer the reader to section Notations (immediately before Eq. 9 below) in case clarification is needed.

Results
The main focus of the paper is in understanding the properties of and τ > 0. For all 1 ≤ i ≤ n, we have i ∈ R and X i ∈ R p . We prove four main results in the paper: 1. we characterize the 2 -risk of our estimator, i.e. β − β 0 2 ; 2. we describe the behavior of the residuals R i = Y i − X i β and relate them to the leave-one-out prediction errorr i,(i) = Y i − X i β (i) ; 3. we obtain an approximate update formula for β when adding an observation (and show it is very accurate); 4. we provide central limit theorems for the individual coordinates of β.
For the sake of clarity, we provide in the main text a series of assumptions that guarantee that our results hold. However, a more detailed and less restrictive statement of our assumptions is provided in the "Appendix".

Preliminaries and overview of technical assumptions
We use the notation prox(ρ) to denote the proximal mapping of the function ρ, which is assumed to be convex throughout the paper. This notion was introduced in [29]. We recall that prox(cρ)(x) = argmin y∈R cρ(y) + 1 2 (x − y) 2 , or equivalently, We refer the reader to [5,29], or [37,Sect. 7.3], for more details on this operation. Note that the previous definitions imply that ∀x, prox(cρ)(x) + cψ(prox(cρ)(x)) = x.
We give examples of proximal mappings in the "Appendix 6". We now state some sufficient assumptions that guarantee that all the results stated below are correct. The main proofs are in the "Appendix". The proofs done in the "Appendix" are done at a much greater level of generality than we are about to state and various aspects of those proofs require much weaker assumptions than those we present here. We start by giving an example where all of our conditions are met.

Example Our conditions are met when
• p/n → κ ∈ (0, ∞). • i 's are i.i.d Cauchy (with median at 0). • X i = λ i X i , where λ i ∈ R and X i ∈ R p are independent. λ i 's are i.i.d with bounded support; X i 's are i.i.d with i.i.d N (0, 1) entries, or i.i.d entries with bounded support and mean 0 as well as variance 1.
We note that this last condition is satisfied for smoothed approximation of the Huber function, where the discontinuity in ψ at say 1 is replaced by a linear interpolation; see below for more details. Note however that the Huber function has a priori no statistical optimality properties in the context we consider.

Sufficient conditions for our results to hold
• p/n has a finite non-zero limit.
• ρ i 's are chosen from finitely many possible convex functions. If ψ i = ρ i , sup i ψ i ∞ ≤ K , sup i ψ i ∞ ≤ K , for some K . ψ i is also assumed to be Lipschitz-continuous. Also, for all x ∈ R, sign(ψ i (x)) = sign(x) and ρ i (x) ≥ ρ i (0) = 0.
i.d with independent entries. λ i 's are independent and independent of X i 's. The entries of X i 's satisfy concentration property in the sense that if G is a convex 1-Lipschitz function (with respect to Euclidean norm), , for any t > 0, m G being a median of G(X i ). We require the same assumption to hold when considering the columns of the n × p design matrix X . X i 's have mean 0 and cov (X i ) = Id p . We also assume that the coordinates of X i have moments of all order. Furthermore, for any given k, the kth moment of the entries of X i is assumed to be bounded independently of n and p.
i is bounded and sup 1≤i≤n |λ i | grows a most like C(log n) k for some k. λ i 's may have different distributions, but the number of such possible distributions is finite.
• i 's are independent. They may have different distributions, but the number of such possible distributions is finite. Those distributions are assumed to have densities that are differentiable, symmetric and unimodal. Furthermore, we assume that if f i is the density of one such distribution, • The fraction of time each possible combination of functions and distributions for (ρ i , L( i ), L(λ i )) appears in our problem has a limit as n → ∞. (L( i ) and L(λ i ) are the laws of i and λ i .) We now state our most important results (several others are in the "Appendix", where we give the proof) and our proof strategy; naturally, the two go together to provide a sketch of proof. We postpone our discussion of both the assumptions and our results to Sect. 2.3.

Characterization of the risk of β
Consider β defined in Eq. (3) and assume that τ > 0 is given, i.e. does not change with p and n. Under the technical assumptions detailed in Sect. 2.1, we have: We note that so in case λ i takes the value 0, we can replace the expression on the left hand side by that on the right hand side, which does not involve dividing by λ 2 i . This alternative expression also shows that there is no problem taking expectations in our equations.
The previous system can be reformulated in terms of prox((c ρ (κ)λ 2 i ρ i ) * ), where f * represents the Fenchel-Legendre dual of f . Indeed, Moreau's prox identity [29] gives This is partly why we chose to write the system as we did, since it can be rephrased purely in terms of prox([c ρ (κ)λ 2 i ρ i ] * ), a formulation that has proven useful in previous related problems (see [4]).
We note that r ρ (κ) and c ρ (κ) will in general depend on τ , but we do not index those quantities by τ to avoid cumbersome notations.

Organization of the proof and strategy
The proof is quite long so we now explain the main ideas and organization of the argument. Recall that if The proof is broadly divided into three steps.
First step. The first idea is to relate β and β (i) , the solution of our optimization problem when the pair (X i , Y i ) is excluded from the problem. It is reasonable to expect that adding (X i , Y i ) will not change too muchr j,(i) = Y j − X j β (i) when j = i, and hence thatr j,(i) R j = Y j − X j β when j = i. Armed with this intuition, we can try to use a first-order Taylor expansion of β around β (i) in the equation ∇ F( β) = 0 to relate the two vectors. This is what the first part of the proof does, by surmising an approximation η i for β − β (i) -following along the intuitive lines above but non-trivial to come up with at the level of precision we need. Much work is devoted to proving that this very informed guess is sufficiently accurate for our purposes. Since "the only thing we know" about β is that ∇ F( β) = 0, we work on ∇ F( β) − ∇ F( β (i) + η i ) to do so, and show in our preliminaries (see "Appendix 2") that controlling this latter quantity is enough to control β − β (i) −η i . Once our bound and use a martingale inequality to deduce a bound on var β − β 0 2 , which we show goes to zero. The corresponding results are presented in Sect. 2.2.3 and the detailed mathematical analysis is in "Appendix 3".
Second step. The second step of the proof is to relate β to another quantity γ , which is the solution of our optimization problem when the last column of the matrix X is excluded from the problem-see Sect. 2.2.4 below and "Appendix 4" for detailed mathematical analysis. Call V the corresponding design matrix. In our setting, it is reasonable to expect that r i, Taylor expansion of ∇ F( β) around ( γ β 0 ( p)) and further manipulations yields an informed "guess", denotedb below, for β, and in particular for β p , the last coordinate of β. A large amount of work is devoted to proving that the quantity we surmiseddenoted b p below-approximates β p sufficiently well for our purposes-once again by doing delicate computations on the corresponding gradients. Since b p has a reasonably nice probabilistic representation, it is possible to write E b 2 p is terms of other quantities appearing in the problem, such as ψ i (r i, [ p] ) (where ψ i = ρ i ) and a quantity c τ, p that is the trace of the inverse of a certain random matrix. Because b p approximates β p sufficiently well, our approximation of E b 2 p can be used to yield a good approximation of E β − β 0 2 . However, we want the approximation of E β − β 0 2 to not depend on quantities that depend on p, such as r i, [ p] and c τ, p . Further work is needed to show that the approximation of E β − β 0 2 can be made in terms ofr i,(i) 's-which we used in the first part of the proof-and a new quantity c τ , which is the trace of the inverse of a certain random matrix, as was c τ, p . The resulting approximation for E β − β 0 2 is essentially the second equation of our system-see Proposition (2.4) for instance.
Third step. The last part of the proof-see Sect. 2.2.5 and "Appendix 5" for detailed mathematical analysis -is devoted to first showing thatr i, . The work done previously in the proof is extremely useful for that. Finally, we show that c τ is asymptotically deterministic. The characterization of c τ is essentially the first equation of our system-see Theorem 2.6 below. After all this is established, we can state for instance central limit theorems for β p and interesting quantities that appear in our proof.
The following few subsubsections make all our intermediate results precise. Armed with the above explanation for our approach, they provide the reader with a clear overview of the arc of our proof. The detailed mathematical analysis is given in the "Appendix".

Leave-one-observation out approximations
We call the residuals We consider the situation where we leave the ith observation, (X i , Y i ), out. We call We use the notations Note thatr j,(i) 's are simply the leave-one-out residuals (for j = i) and the leave-oneout prediction error (for j = i). Let us consider We have the following theorem. Also, A stronger version of this theorem is available in the "Appendix". (We say that a sequence of random variables There are two main reasons this theorem is interesting: it provides online-update formulas for β through β i , with guaranteed approximation errors. Second, it relates the full residuals, whose statistical and probabilistic properties are quite complicated to the much-simpler-to-understand "leave-one-out" prediction error,r i,(i) . Indeed, because X i is independent of β (i) under our assumptions, the statistical properties of β (i) X i are much simpler to understand than those of β X i .

Leave-one-predictor out approximations
Let V be the n × ( p − 1) matrix corresponding to the first ( p − 1) columns of the design matrix X . We call V i in R p−1 the vector corresponding to the first p − 1 entries of X i , i.e. V i = (X i (1), . . . , X i ( p − 1)). We call X ( p) the vector in R n with jth entry X j ( p), i.e. the p−th entry of the vector X j . When this does not create problems, we also use the standard notation X j, p for X j ( p).
Let us call γ the solution of our optimization problem when we use the design matrix V instead of X . In other words, For stating the following results, we will rely heavily on the following definitions: Definition We call the corresponding residuals We have u p ∈ R p−1 and S p is ( p − 1) × ( p − 1). We call We call and b = γ β 0 ( p) Theorem 2.3 Under our Assumptions, we have, for any fixed τ > 0, In particular, Let us call We also have:

Lemma 2.5 Under our assumptions , as n and p tend to infinity,r i,(i) behaves like
is independent of i and λ i , in the sense of weak convergence. Furthermore, if i = j,r i,(i) andr j,( j) are asymptotically (pairwise) independent. The same is true for the pairs (r i,(i) , λ i ) and (r j,( j) , λ j ).

Theorem 2.6 Under our assumptions, when p/n
and Under our assumptions, c τ → c ρ (κ) in probability, where c ρ (κ) is the unique solution of the equation We note that the equation G(x) = 1 − κ + τ x translates into the first equation of our system (4). This is a simple consequence of the properties of the derivative of Moreau's proximal mapping-see Lemma 3.33.
The last equation of Theorem 2.6 is the second equation of our system (4). (The fact that the limits of G n and H n exist simply come from our assumptions that the proportion of times each possible triplet (ρ i , L( i ), L(λ i )) appears has a limit as n → ∞.) From this main theorem follows the following propositions.
Finally, when β 0 (k) = O(n −1/2 ), The previous result can be used with v 2 replaced byv 2 and ξ replaced by ω n = p/(nc τ ) − τ in testing applicationssee the discussion after Proposition 3.30 for justifications. Naturally, since for all ]/c τ when c τ > 0,v 2 n could also be written and computed using this alternative formulation.
We note that ω n is computable from the data. In our setup, λ i 's are estimable using the scheme proposed in [14] andv 2 n can therefore also be estimated from the data. Hence, the previous proposition allows for testing the null hypothesis that β 0 (k) = 0, for any 1 ≤ k ≤ p.
We are also now in position to explain the behavior of the residuals.

Why consider elliptical-like predictors?
The study of elliptical distributions is quite classical in multivariate statistics (see [1]). As pointed out by various authors (see, in the context of statistics and random matrix theory [8,13,20]), the Gaussian distribution has a very peculiar geometry in high-dimension. It is therefore important to be able to study models that break away from these geometric restrictions, which are not particularly natural from the point of view of data analysts. Under our assumptions, in light of Lemma 3.37, it is clear that In the Gaussian (or Gaussian-like case of i.i.d entries for X i , with e.g. bounded entries which satisfy the assumptions we stated above), λ i = 1. Hence, Gaussian or Gaussianlike assumptions imply that predictor vectors are situated near a sphere and are nearly orthogonal. (This simple geometry is of course closely tied to-or a manifestation of the-concentration of measure for convex 1-Lipschitz functions of those random variables.) This is clearly not the case for elliptical predictors, though under our assumptions, cov (X i ) = Id p , even in the "elliptical" case we consider in the paper. So all the models we consider have the same covariance but the corresponding datasets may have different geometric properties.
We show in the paper that the role of the distribution of λ i 's in the performance of the estimator depends on much more than its second moment, as Theorem 2.1 makes very clear. This is a situation that is similar to corresponding results in random matrix theory-see e.g. [13,18]. It is therefore clear here again that predictor geometry (as measured by λ i ) plays a key role in the performance of our estimators in highdimension. This is in sharp contrast with the low-dimensional setting-see [24]which shows that in low-dimensional robust regression, what matters is only cov (X i ).
These types of studies are also interesting and we think important as they clearly show that there is little hope of statistically meaningful "universality" results derived from Gaussian design results : moving from independent Gaussian assumptions for the entries of X i to i.i.d assumptions does not change the geometry of the predictors, which appears to be key here as our proof's reliance on concentration of quadratic forms in X i makes clear. As such, while interesting on many counts, for instance to allow discrete predictors, moving from Gaussian to i.i.d assumptions is not a very significant perturbation of the model for statistical purposes. This is why we chose to work under elliptical assumptions. See also [8] for similar observations in a different statistical context.
In conclusion, the generalized elliptical models we study in this paper prove also that many models may be such that the predictors have the same covariance cov (X i ) but yield very different performance when it comes to lim β − β 0 . They therefore provide a meaningful perturbation of the Gaussian assumption, give us insights into the impact of predictor geometry on the behavior of our estimators, and give us a rough idea of the subclass of models for which we can expect similar (or "universal") performance for β.
Examples of distribution for X i satisfying our concentration assumptions Corollary 4.10 in [27] shows that our assumptions are satisfied if X i has independent entries bounded by 1/(2 √ c). Theorem 2.7 in [27] shows that our assumptions are satisfied if X i has independent entries with density f k , 1 ≤ k ≤ p such that f k (x) = exp(−u k (x)) and u k (x) ≥ √ c for some c > 0. Then c = c/2. This is in particular the case for the case where X i has i.i.d N (0, 1) entries: then c = 1 and c = 1/2. We discuss briefly after Lemma 3.35 in the "Appendix" the impact of choosing other types of concentration assumptions.

Non-sparse β 0 : why consider 2 /ridge-regularization?
In this paper, we consider the case where β 0 cannot-in general-be approximated in 2 -norm by a sparse vector. This is a situation that is thought to not be uncommon in biology (see, in a slightly different context [12], and many similar references), where sparsity assumptions are often/sometimes in doubt.
In other words, if s is a sparse vector (e.g. with support of size o( p)), we necessarily have when β 0 is diffuse (i.e. all of its entries are roughly of size p −1/2 ) β 0 − s → / 0. In the situation we consider, it is in fact unclear whether any estimator can be consistent in 2 for β 0 . One interesting aspect of our study is that the System (4) might allow us to optimize (at least in certain circumstances) over the functions ρ i 's we consider to get the best performing estimator in the class of ridge-regularized robust regression estimators for β 0 and hence potentially beat sparse estimators (in the same line of thought, there are of course numerous applied examples where ridge regression outperforms Lasso in terms of prediction error).
Finally, one benefit of our analysis is that we have a central limit theorem for the coordinates of β (see Proposition 2.7), which makes testing possible. In the situation where β 0 has some large entries (of size up to n −1/4−η , η > 0) and many small ones [of size o(n −1/2 )], this central limit theorem and its more refined version in Proposition 3.30 could help in designing better performing estimators by using scaled versions of { β k } p k=1 , which we would threshold according to the result of our test. In other words, these central limit theorems for the coordinates of β are the gateway to the construction of Hodges-type estimators in the setup we consider.

A remark on the fixed design case
We have worked in this paper with a certain class of random designs. It is not unusual to do so in robust regression studies-see the classic papers by Portnoy [31,32,34]. In many areas of applications, it is also unclear why statisticians should limit themselves to the study of fixed designs, in particular when they do not have control over the choice of the values of the predictors, i.e. they cannot design their experiments.
However, it is also interesting to understand what remains valid of our analysis in the case of fixed design. We note that our analysis gives already a few results in this direction.
In fact, since we have shown that var β − β 0 2 → 0, we have shown that Therefore, with probability (over the design X ) going to 1, -probability simply refers to probability statements with respect to the random i 's, the only source of randomness if the design matrix X is assumed to be fixed.) In other words, if the design is fixed, but results from one random draw of a n × p matrix satisfying our distributional assumptions, Theorem 2.1 applies with probability (over the choice of design matrix) going to 1.
We note that β − β 0 is an especially important quantity in terms of prediction error in our context, which is why our short discussion above focused on this quantity: if we are given a new predictor vector X new , we would naturally predict an unobserved response Y new by X new β and hence, if Y new = new + X new β 0 , our prediction error will be P E new = new + X new (β 0 − β). Of course, if X new has mean 0 and satisfies cov ( Hence the expected squared prediction error will be var ( new ) + β 0 − β 2 2 , provided new is independent of X new .

Optimization with respect to τ and ρ
Just as the classic work of Huber on robust regression started by establishing central limit theorems for the estimator of interest (as a function of ρ) and proceeded to find optimal methods in various contexts (see [24]), one objective of our work is to pave the way for answering optimality questions in the setting we consider. An important first step to do so is therefore to obtain results such as Theorem 2.1.
A natural question is therefore to ask what are the optimal ρ i 's in the context we consider, where optimality might be defined in terms of minimizing r ρ (κ) in Theorem 2.1 or v 2 in Proposition 2.7. For an example of such a study for r ρ (κ) in a slightly different context, see [4]. Similarly, optimization over τ should be possible. We leave however these questions for future work, since they are of a more analytic nature. (We have had success in [4] in the situation where λ i = 1 and the errors are log-concave and hence not heavy-tailed, but the technique we employed in that paper does not apply readily here.) We also note that in our context the optimal τ -say for prediction -is in general not going to be close to 0, so the fact that our current study requires τ > 0 is not a problem (see also [3]).
As can intuitively be seen from Proposition 2.7, the fact that β − β 0 goes to a non-zero constant has two sources: bias induced by the ridge-regularization and the fact that each of the p coordinates has fluctuations of size n −1/2 . Its asymptotically deterministic character comes on the other hand from our analysis in "Appendix 3". This is in contrast with the low-dimensional case where p is fixed, where β − β 0 goes to a non-zero constant simply because of bias issues and the law of large numbers.
In light of Proposition 2.7, it is clear that τ plays in this problem a fairly similar role to the one it plays in low-dimension: it trades bias for variability in each individual coordinate. By contrast with the low-dimensional case however, even if we consider the case p < n, the fact that the number of coordinates is of the same order of magnitude as n means that even for small τ (and hence low-bias), β − β 0 has a non-zero limit. The fact that this limit can be somewhat large is what suggests using values of τ that are not close to 0. Interestingly, this is of course the situation encountered in practice in many real-world problems where p and n are large. A case in point is the situation where β 0 = 0, in which case the optimal value of τ is clearly ∞; using τ = 0 would put us back in the situation of [17], which would result in much worse performance for β − β 0 .

Possible extensions
Less smooth ρ's and ψ's While our approach is quite general and allows us to handle designs that are far from being Gaussian, the proof presented in this paper still requires some smoothness concerning ρ i 's and ψ i 's. On the one hand, results such as the ones obtained in [4] suggest that it is often the case that optimal loss functions in highdimension are smoother than in low dimension. So the fact that we require ψ i 's to be smooth is a source of less concern that it would be in low dimension. (Note also that the classic papers [28,32] also require smoothness properties on ψ.) Though it is unclear whether the Huber function is optimal in any sense for the problems we are looking at, and hence whether it warrants a special focus, let us discuss this function in some detail. For the sake of simplicity let us focus on the situation where the transition from quadratic to linear happens at x = ±1. Then So ψ is not differentiable at 1. However, it is easy to approximate this function by a function whose derivative is Lipschitz. As a matter of fact, if 0 < η < 1, ψ η such that , is 1/η-Lipschitz. Furthermore, ψ η , the corresponding antisymmetric function with, when x ≥ 0, can be made to be arbitrarily close to ψ and similarly for the corresponding ρ η , picked such that ρ η (0) = 0.
Our results apply to ρ η , for any η > 0. It seems quite likely that with a bit (and possibly quite a bit) of further approximation theoretic work, it should be possible to establish results similar to Theorem 2.1 for the Huber function by taking the limit of corresponding results for ρ η with η arbitrarily small.
We note that most of our proof (in particular "Appendices 3 and 4") is actually valid with functions ρ i 's that can change with n. In particular, many results hold when ψ i are L i (n)-Lipschitz with L i (n) ≤ Cn α . So one strategy to handle the case of the Huber function could be to use ψ η n with η n = 1/ log(n) for instance and strengthen the arguments of "Appendix 5" in the Appendix-in this very specific case where ψ η n has a limit-to get the Huber case as a limiting result. Because our proof is already long, we leave the details to the interested reader and might consider this problem in detail in future work.
Weighted robust regression One motivation for working on the problem at the level of generality we dealt with is that our results should allow us to tackle among other things weighted robust regression. For instance if i 's or λ i 's in our model had different distributions, it would be natural to pick the corresponding ρ i 's either as completely different functions, or maybe as ρ i = w i ρ, with w i deterministic but possibly depending on the distribution of i 's or λ i 's. In the case where i 's and λ i 's come from finitely many possible distributions, our results handle this situation.
Most of our results-i.e. those of "Appendices 3 and 4"-are true even when w i 's are allowed to take a possibly infinite set of different values. If i 's are i.i.d, λ i 's are i.i.d and w i 's are i.i.d and these three groups of random variables are independent of each other, our arguments can be made to go through without much extra difficulties. The main potential problem is in "Appendix 5", but then distributional symmetry between the R i 's on one hand and ther i,(i) on the other hand becomes helpful, as it had in [15]. So it is very likely that our results could be extended to cover this case at relatively little technical cost.

Conclusion
We have studied ridge-regularized robust regression estimators in the high-dimensional context where p/n has a finite non-zero limit. Our study has highlighted the importance of the geometry of the predictors in this problem: two models with similar covariance but different predictor geometry will in general yield estimators with very different performance. We have shown this result by studying the random design case in the context of elliptical predictors and looking at the influence of the "ellipticity parameter" λ i on our results. Importantly, this shows that no statistically meaningful "universality" results can be derived from the study of Gaussian or i.i.d-designs, since their geometry is so peculiar (i.e. they are limited to the case λ i = 1 for all i's). The technique used in the paper seems versatile enough to be useful for several other high-dimensional M-estimation problems.
We have also obtained central limit theorems for the coordinates of β that can be used for testing whether β 0 (k) = 0 for any 1 ≤ k ≤ p. However, our focus was mostly on the case where β 0 is diffuse, with all coordinates small but contributing to Y i = i + X i β 0 . Our results also provide a very detailed understanding of the properties of the residuals R i .
All these results were obtained without moment requirements on the errors i 's. Finally, our characterization of the risk of these estimators raises interesting analytic questions related to finding optimal loss functions ρ i 's in the context we consider. We plan to study these questions in the future.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix 1: Assumptions and technical elements
Recall that the focus of the paper is on understanding the properties of where τ > 0. For all 1 ≤ i ≤ n, we have i ∈ R and X i ∈ R p . Different parts of the proof require different assumptions. So we label them accordingly.
Most of our proof ("Appendices 3 and 4") is carried out for functions ρ i,n that may vary with n, so our assumptions reflect this and we carry out most of our work at this level of generality. However, we do not make the dependence of ρ i,n on n explicit to avoid cumbersome notations. Having these results available should make future work on weighted regression or work of a more approximation-theoretic nature (for instance using a sequence ρ i,n to approximate a function ρ i that is not smooth) easier. This is one of the prime motivations for working at this level of generality.
Naturally, our assumptions are more and more restrictive as the proof progresses, so the summary of assumptions we provided in the main text is obtained by going through the assumptions and simply tallying the more restrictive ones. A sketch of proof is provided in Sect. 2.2.2, which should be helpful in navigating the detailed proof we provide in this "Appendix".
Before we delve into the details of the assumptions needed for each part of the proof to work, we summarize for the convenience of the reader the assumptions we need for the whole proof to go through.

Assumptions under which the whole proof goes through
• A1 p/n has a finite non-zero limit and ψ i is assumed to be Lipschitz. The functions ρ i 's can be chosen among finitely many possible functions (over all n).
and identically distributed. Their distribution is allowed to change with p and n.
The entries of X i are independent. Furthermore, for any 1-Lipschitz (with respect , C n and c n can vary with n. For simplicity, we assume that 1/c n = O(polyLog(n)) and C n is bounded in n. X i 's have mean 0 and cov (X i ) = Id p . We also assume that the coordinates of X i have moments of all order. Furthermore, for any given k, the kth moment of the entries of X i is assumed to be bounded independently of n and p. Also, for any 1 ≤ k ≤ p, the vectors k = (X 1 (k), . . . , X n (k)) in R n satisfy: for any , C n and c n can vary with n. As above, we assume that 1/c n = O(polyLog(n)).
and of each other. Furthermore, for any r ∈ R, if Z ∼ N (0, 1), independent of i , i + r Z has a (differentiable) density f i,r which is increasing on (−∞, 0) and decreasing on (0, ∞). Finally, butions. However, the number of choices for the triplet (ρ i , L(λ i ), L( i )) is finite (over all n). Furthermore, the fraction of times each such triplet appears in our problem-see Eq. (8) has a limit. (L( i ) just means the law of i .) We note that the entries of X i do not need to have the same distribution. Our condition A4 is satisfied when Condition A7 just means that if for instance ρ i = ρ and λ i 's are i.i.d but i 's can come from 3 distributions, the fraction of i 's coming from each of these three distributions has a limit as n → ∞. This last condition mostly plays a role in guaranteeing that we can take limits in various expressions. The simplest case is of course when ρ i = ρ, λ i 's are i.i.d and i 's are i.i.d, in which case there is only one possible choice for the triplet (ρ i , L(λ i ), L( i )).
We now state the conditions under which we carry out the proof. We state them in one place for the convenience of the reader. A discussion follows immediately after the statement of all the conditions.

First part of the proof ("Appendix 3")
For the first part of the proof (i.e. "leave-one-Observation-out"), we work under the following assumptions: • O1: p/n has a finite non-zero limit.
where C is constant. This is natural in the context of robust statistics, since it means that we allow ρ i 's to grow at most linearly at infinity. This assumption is for instance verified for Huber functions.
and identically distributed. Their distribution is allowed to change with p and n. Furthermore, for any 1-Lipschitz (with respect to Euclidean norm) convex function , C n and c n can vary with n. For simplicity, we assume that, 1/c n = O(polyLog(n)) and C n is bounded in n. X i 's have mean 0 and cov (X i ) = Id p . We also assume that the coordinates of X i have moments of all order. Furthermore, for any given k, the kth moment of the entries of X i is assumed to be bounded independently of n and p.
Note that we do not assume that i 's have identical distributions. Assumption O4 is satisfied for instance when X i are N (0, Id p ) or have i.i.d entries bounded by polyLog(n)-see [27] (this reference guarantees the concentration result we require is satisfied; the moment conditions need to be checked by other methods, but this is generally much simpler, as the case of Gaussian random variables clearly shows). Importantly, note that O4 does not require the entries of X i to be independent; see [27] or [13] for examples of X i satisfying O4 with dependent entries. In other respects, the assumption E λ 2 i = 1 plays a very minor role mathematically and could be relaxed to E λ 2 i is uniformly bounded without problems. Statistically, it is however important as it guarantees that cov (X i ) = Id p in all the models we consider.

Second part of the proof ("Appendix 4")
For the second part of the proof (i.e. "leave-one-Predictor-out"), we need all the previous assumptions and • P1: X i 's have independent entries. Furthermore, for 1 ≤ k ≤ p, the vectors k = (X 1 (k), . . . , X n (k)) in R n satisfy: for any 1-Lipschitz (with respect to , C n and c n can vary with n. As above, we assume that 1/c n = O(polyLog(n)).
constant independent of p and n. e satisfies α + 1/4 − e < 0. • P4: 1/2 − 2α > 0 and min(1/2, e) − α − 1/4 > 0. The latter implies that min(1/2, e) − α > 0 We note that according to Corollary 4.10 and the discussion that follows in [27], Assumptions O4 and P1 are compatible. O4 and P1 are for instance satisfied if the entries of X i 's are independent and bounded by polyLog(n). Another example is the case of X i ∼ N (0, Id p ), in which case c n is a constant independent of the dimension.
Note that we do not assume that the entries of X i have the same distribution. We note that if for instance α = 1/12 and e = 5/12, all the conditions in P3-P4 are satisfied. When α = 0, they simply become e > 1/4.

Last part of the proof ("Appendix 5")
For the last part of the proof, when we combine everything together, we will need the following assumptions on top of all the others: • F1: the i 's may have different distributions; however, they may only come from finitely many distributions. Furthermore, for any r ∈ R, if Z ∼ N (0, 1), independent of i , i +r Z has a differentiable density f i,r which is increasing on (−∞, 0) and decreasing on (0, ∞). Finally, lim |t|→∞ t f i,r (t) = 0.
• F3: α < 1/6 and α + 1/3 < 2 min(1/2, e) • F4: there exists C independent of n and p such that E λ 4 i ≤ C. • F5: λ i 's may have different distributions, but the set of possible distributions for λ i is finite. Similarly, ρ i may be different functions, but the set of possible functions ρ i may be is finite. Also, the number of distinct triplets (ρ i , L( i ), L(λ i )) is finite (over all n). Furthermore, the proportion of each such distinct triplet has a limit as n → ∞. Condition F3 is clearly satisfied in the case α = 1/12 and e = 5/12 we mentioned above. On the other hand, condition F5 requires that α = 0, since it prevents ρ i from changing with n. (We note that since F5 is required only at the very end of the proof, one could probably weaken its requirements considerably if another situation that the one we investigate really called for it.) We refer the reader to Lemma 3.39 and the discussion immediately following it for examples of densities for i 's satisfying F1. We note that smooth symmetric (around 0) log-concave densities will for instance satisfy all the assumptions we made about the i 's. See [25,26] for instance. This is also the case for the Cauchy distribution (see Theorem 1.6 in [7]). The latter is the most relevant reference here since we care about heavy-tailed i 's.

Discussion of the assumptions
Assumptions concerning the loss functions We wanted to investigate in this paper the situation where i 's have no moment restrictions, as befits "classical" robust statistics studies. As such, it is natural to assume that ψ i 's remain bounded, which is part of assumption A3. We think that interesting results can be found in "Appendices 3 and 4": in particular for dealing with ψ i functions that are not smooth and require approximations by ψ i,n functions that are smooth, but also for bootstrap studies for instance. This is why the paper handles those cases, even though we do not use fully these results in the main statements of the paper. We note that in specific cases, one would simply need to modify various arguments in "Appendix 5" to handle limits of those ψ i,n .
Assumptions concerning the predictors Assumption O4 is a bit stronger than we will need. For instance, "Appendices 3 and 4" do not actually require the X i 's to have identical distributions; "Appendix 5" would work if we assumed that X i 's were coming from finitely many distributions, with the proportion of X i 's picked from a particular distribution having a limit as n → ∞. The functions G we are working with will either be linear or square-root of quadratic forms, so we could limit our assumptions to those functions. However, as documented in [27] and discussed briefly in the introduction, a large number of natural or "reasonable" distributions satisfy the O4 assumptions. Our choice of having a potentially varying c n is motivated by the idea that we could, for instance, relax an assumption of boundedness of the entries of X i 's-that guarantees that O4 is satisfied when X i has independent entries-and replace it by an assumption concerning the moments of the entries of X i 's and a truncation of triangular arrays argument (see for instance [42]). We also refer the interested reader to [13] for a short list of distributions satisfying O4, compiled from various parts of [27]. Finally, we could replace the exp(−c n t 2 ) upper bound in O4 by exp(−c n t β ) for some fixed β > 0 and it seems that all our arguments would go through. We chose not to work under these more general assumptions because it would involve extra book-keeping and does not enlarge the set of distributions we can consider enough to justify this extra technical cost. Importantly, O4 allows the entries of X i 's to be dependent.
To give a concrete example, let us consider the situation where the entries of X i are independent and symmetric with an exponential density chosen to have variance 1. Then it is clear that sup i, j |X i ( j)| ≤ K [log(n)] 2 almost surely as n, p → ∞. Our analysis and assumptions then apply to the predictors 2 where s n is chosen so that 1/s 2 n = var i,n (the variance of the entries of i,n is not 1, since it is a truncation of X i but it is easy to see that var i,n → 1). Note that X i = X i almost surely, and therefore our statistical problem is not affected. Very minor modifications to the arguments of "Appendix 5" are then needed to handle s n and show that our results go through. Naturally the same argument could be made for other (non-exponential) distributions as long as We note that our method should also be able to handle c n such that 1/c n grows faster that polyLog(n) and hence deal with an even broader class of predictor distributions, but we chose not to do this in full details to limit the book-keeping burden in an already long proof.
Notations We will repeatedly use the following notations: Y i = X i β 0 + i ; polyLog(n) is used to replace a power of log(n); λ max (M) denotes the largest eigenvalue of the matrix M; |||M||| 2 denotes the largest singular value of M. We call = 1 n n i=1 X i X i the usual sample covariance matrix of the X i 's when X i 's are known to have mean 0.
We write X L = Y to say that the random variables X and Y are equal in law. We use the usual notation β (i) to denote the regression vector we obtain when we do not use the pair (X i , Y i ) or (X i , i ) in our optimization problem, a.k.a the leave-one-out estimate. We will also use the notation . . , X n }. We use the notation (a, b) for either the interval (a, b) or the interval (b, a): in several situations, we will have to localize quantities in intervals using two values a and b but we will not know whether a < b or b > a. We denote by X the n × p design matrix whose ith row is X i . We write a ∧ b for min(a, b) and a ∨ b for max(a, b). If A and B are two symmetric matrices, A B means that A − B is positive semi-definite, i.e. A is greater than B in the positive-definite/Loewner order. The notations o P , O P are used with their standard meanings, see e.g. [41, p. 12] for definitions. For the random variable W , we use the definition W L k = E |W | k 1/k . For sequences of random variables W n , Z n , we use the notation Remarks We call Note that under our assumptions on ρ, β is defined as the solution of Recall the following important definitions.
Definition We call Proof Let β 1 and β 2 be two vectors in R p . We have by definition We can use the mean value theorem to write -recall that we do not care about the order of the endpoints in our notation. Hence, which we write where This shows that Since ρ i 's are convex, ψ i = ρ i is non-negative and S β 1 ,β 2 is positive semi-definite. In the semi-definite order, we have S β 1 ,β 2 + τ Id p τ Id p . In particular, Proposition 3.1 yields the following lemma.

Lemma 3.2 For any
The lemma is a simple consequence of Eq. (15) since by definition f ( β) = 0.
Our strategy in what follows is to come up with "good candidates" for β 1 , for which we can control f (β 1 ) and transfer the information we will glean about the statistical properties of β 1 to β through Lemma 3.2.

On β and β − β 0
We show in the following lemma that β and β − β 0 cannot be too large.

Lemma 3.3 Let us call W n
Also, Therefore, under our assumptions O1-O6, Similarly, for any finite k, In the case k = 2, we have the more precise bound Proof The first and key inequality simply comes from applying Lemma 3.2 with β 1 = 0, after noticing that f (0) = −W n (β 0 ). The second one comes from using β 1 = β 0 and noticing that f (β 0 ) =−W n (0)+τβ 0 . We note that under our assumptions, according to Lemma 3.38, which gives all the results about L k bounds.
The last result about k = 2 follows from computing E W n (0) 2 = p n 1 n n i=1 E ψ 2 i ( i ) and using the bound • About E W n (0) 2 For the sake of clarity, we now explain in detail why E W n (0) 2 Hence, Since i 's and X i 's are independent, using the fact that E (X i ) = 0 and that ψ i 's are bounded, we have This gives the announced result.

Appendix 3: Approximating β by β (i) : leave-one-observation-out
We consider the situation where we leave the ith observation, (X i , Y i ), out. By definition, We callr We also call We have of course Let us consider where These definitions and the approximations they will imply can be understood in light of the probabilistic heuristics we derived for a related problem in [16,17]. The interested reader is referred to those papers-where we made a large effort to explain our intuitive ideas-for more information and intuition; given page limit requirements, we do not give a complete heuristic derivation of our results and refer the reader to Sect. 2.2.2 for a detailed explanation of our strategy. We note however that the rigorous proof requires refinements over the intuitive ideas. Those aspects are of a more technical nature and become apparent only through the analysis that we present here.
One of our aims is to show Theorem 3.9 below, which shows that we can very accurately approximate β by β i . Note that the statistical properties of β i are easier to understand that those of β; our high-quality approximations will allow us to transfer our understanding of β i to β.

Deterministic bounds
Proposition 3. 4 We have, with β i defined in Eq. (20), where and Proof The proof and strategy are similar to the corresponding ones in [15]. However, since there are delicate cancellations in the argument, we give all the details. We recall that By the mean-value theorem, we also have In light of the previous simplifications, we have, using Since by definition, In other respects, When ρ is differentiable, x − cψ(prox(cρ)(x)) = prox(cρ)(x) almost by definition of the proximal mapping (see Lemma 3.31). Therefore, We conclude that Applying Lemma 3.2, we see that Clearly, controlling R i is the key to controlling β − β i , so we need to develop insights into R i .

Lemma 3.5 We have
and We note that under our assumptions, we have and, using Lemma 3.35 Proof This proof is essentially obvious. We refer the reader to a corresponding one given in [15] in case details are needed.
On γ * (X j , β (i) , η i ) and related quantities We now show how to control 1 , which is essential for turning Eq. (23) into a useful bound. We proceed by first getting a better understanding of Eq. (26).

Lemma 3.6 Suppose, as in our assumption
Proof By definition, we have The bound follows immediately, using the fact that ψ i is L i (n)-Lipschitz.

Stochastic aspects
Recall that we have by definition We can therefore bound R i by Therefore, we also have This bound on R i shows that we can control β − β i in L k provided we can control each terms in the above product in L 3k , by appealing to Holder's inequality and Proposition 3.4.
We now turn our attention to the various elements of the bound on R i and show that we can control them under our assumptions.
We will control X j (S i + τ Id) −1 X i /n by appealing to Lemma 3.36.
in L k , for any finite k. Note also that under our Assumption O4, for any finite k, Proof The proof follows from that of Lemma 2.3 in [15]. Indeed, The proof of Lemma 2.3 in [15] shows that (1); this latter result is shown in [15] (see the discussion after Lemma 2.3 or Lemma 3.35 in the current "Appendix" applied to Now our assumptions O6 concerning sup i |λ i | = O L k (polyLog(n)) guarantee that the bounds we announced are valid.

Consequences
We have the following result. Recall that ψ i is assumed to be Lipschitz with Lipschitz constant L i (n).  polyLog(n)). This latter result is shown in Lemma 3.38. The statement concerning sup 1≤i≤n R i follows by the same arguments.

Proposition 3.8 Under Assumptions O1-O6, we have
We can now prove and state the following result, which relates residuals to leaveone-out-prediction errors and give a way to do online update from β (i) to β.
We recall that β i is defined in Eq. (20).

Theorem 3.9 Under Assumptions O1-O7, we have, for any fixed k, when τ is held fixed and
In particular, we have Finally, We note that we could state a slightly finer result involving L i (n) and various powers of ψ i ∞ . However, we will not need such fine results in what follows, so we opt for slightly coarser but easier-to-state statements.
Proof The first two results simply follow from our work on R i .
The third result follows from the coarse bound and the fact that sup 1≤ j≤n X j √ n = O L k (polyLog(n)) under our assumptions. Our results on β − β i give control of the first term. Control of the second term follows from Lemma 3.7 and the assumption that ψ i is bounded by CpolyLog(n).
Let us now turn to the final result, i.e. the approximation of the residual R i by a non-linear function of the leave-one-out prediction errorr i,(i) . Recall that Now, given the definition of β i , we have Hence, since almost by definition, if y = prox(cρ)(x), y + cψ(y) = x, we get So we have established that and the result follows from our previous bounds.

On the limiting variance of β 2 and β − β 0 2
An interesting consequence of our leave-one-observation-out work is that we can use the ideas and approximations developed above to show that β − β 0 and β are asymptotically deterministic (in other words, they can be approximated asymptotically by deterministic sequences).
Therefore β 2 has a deterministic equivalent in probability and in L 2 .
Proof We use the Burkholder/Efron-Stein inequality to show that var β 2 goes to 0 as n → ∞. In what follows, we rely on our approximations and our assumptions to have enough moments for all the expectations of the type E β 2k to be bounded like polyLog(n)/τ 2k . Note that this the content of our Lemma 3.3.
Recall the Efron-Stein inequality [11]: if W is a function of n independent random variables, and W (i) is any function of all those random variables except the ith, In our arguments below, β 2 plays the role of W and β (i) 2 plays the role of W (i) .
We first observe that Of course, using the fact that β by the Cauchy-Schwarz inequality, since E β k exists and is bounded by K polyLog(n)/τ k . Using the results of Theorem 3.9, we see that On the other hand, given the definition in Eq. (20), Since β (i) and S i are independent of X i , and |||( n ), using our assumptions O4 on X i applied to linear forms. Recall also that sup i ψ i ∞ = O(polyLog(n)). Therefore, we see that both terms are O L 2 (polyLog(n)/nc 1/2 n ). We conclude that then Taking W = β 2 and W (i) = β (i) 2 in the Efron-Stein inequality, we clearly see that This shows that β 2 has a deterministic equivalent in probability and in L 2 .
• About β − β 0 . The results are obtained in a similar fashion using our bound on β − β 0 and replacing everywhere in the arguments β by β − β 0 , and β i by β i − β 0 . The condition β 0 = O(polyLog(n)) plays a role to guarantee in Lemma 3.3 that Let us now provide some more details. As above, we write Of course, using the fact that by the Cauchy-Schwarz inequality, since E β − β 0 k exists and is bounded by K polyLog(n)/τ k . This latter fact follows from Assumption O7 and Lemma 3.3.
Using the results of Theorem 3.9, we see that provided α < 1/2. As above, given the definition in Eq. (20), Since β (i) − β 0 and S i are independent of X i , and |||( n ), using our assumptions O4 on X i applied to linear forms. Recall also that sup i ψ i ∞ = O(polyLog(n)). Therefore, we see that both terms are O L 2 (polyLog(n)/nc We conclude that then Taking now W = β − β 0 2 and W (i) = β (i) − β 0 2 in the Efron-Stein inequality, we clearly see that This shows that β − β 0 2 has a deterministic equivalent in probability and in L 2 .

Appendix 4: Leaving out a predictor
In this second step of the proof, we do need at various points that the entries of the vector X i be independent, whereas as we showed before, it is not important when studying what happens when we leave out an observation. Let V be the n × ( p − 1) matrix corresponding to the first ( p − 1) columns of the design matrix X . We call V i in R p−1 the vector corresponding to the first p − 1 entries of X i , i.e. V i = (X i (1), . . . , X i ( p − 1)). We call X ( p) the vector in R n with jth entry X j ( p), i.e. the p−th entry of the vector X j . When this does not create problems, we also use the standard notation X j, p for X j ( p).
Let us call γ the solution of Note that γ 0 is the solution of the original optimization problem (3) when X i ( p) is replaced by 0. In this part of the paper, we will rely heavily on the following definitions: Definition We call the residuals corresponding to this optimization problem We call Note that u p ∈ R p−1 and S p is ( p − 1) × ( p − 1). We call and We will show later, in "On ξ n " of Appendix 4 that ξ n ≥ 0. However, we will use this information from the beginning and there are no circular arguments. We note that ξ n depends on the coordinate we are considering, here p. So it should be written ξ n ( p). However, to avoid cumbersome notations, we keep the notation ξ n unless there are ambiguities or we need to stress what coordinate we are referring to. This happens only in parts of "About c i 's, ξ n , N p , and the limiting distribution of β( p)" of Appendix 5 below.
We consider Note that when ξ n > 0, we have The aim of our work in the second part of this proof is to establish Theorem 3.20 on p.39, which shows that b− β = O(polyLog(n)/n) in L k . Because the last coordinate of b, b p , has a reasonably simple probabilistic structure and our approximations are sufficiently good, we will be able to transfer our insights about this coordinate to β p , the last coordinate of β. This is also true when considering √ n(b p − β p ), so our approximations will be interesting at that scale, too.
The approach and approximating quantities we choose-as well as the intuition behind those choices-can be understood by using variants of the ideas discussed in our work in [3,16] and [17].

Deterministic aspects Proposition 3.11 We have
Furthermore, As we saw in Eq. (15) and Lemma 3.2, we have We note furthermore that, by definition of γ , The strategy of the proof is to control f ( b) by using g( γ ) to create good approximations and then recalling that g( γ ) = 0 p−1 .
Proof The proof strategy and ideas are tied to the technique developed in [15]; however, because there are a number of delicate cancellations in the argument, we give it in full details. (Naturally, coming up with good approximating quantities required much work.) a. Work on the first ( p − 1) coordinates of f ( b) We call f p−1 (β) the first p − 1 coordinates of f (β). We call γ ext the p-dimensional vector whose first p − 1 coordinates are γ and last coordinate is β 0 ( p), i.e.
For a vector v, we use the notation v comp,k to denote the p − 1 dimensional vector consisting of all the coordinates of v except the kth. Clearly, We can write by using the mean value theorem, for γ * i, p in the interval Let us call We have with this notation We note that by definition, , and Recalling the definition of S p and u p , we see that We conclude that . We have shown above that Recall the notation Clearly, We therefore see that We conclude that .

Representation of f ( b)
Aggregating all the results we have obtained so far, we see that We conclude immediately that This gives Eq. (33). The rest of the proof follows easily with mild modifications from [15] and we do not repeat it here.

Stochastic aspects
From now on, we assume that X ( p), is independent of . This is consistent with Assumption P1. (Recall that X i = λ i X i and therefore V i = λ i V i .) Note that Assumption O4 is satisfied for V i if it is satisfied for X i : convex 1-Lipschitz function of V i can be trivially made to be convex 1-Lipschitz function of X i by simply not acting on the last coordinate of X i . Naturally, a large amount of the rest of the proof consists in showing that we can bound f ( b) sufficiently finely for our results to hold true. So we will work on bounding each term in the product appearing in Eq. (33) in the rest of this section.
The last term is very easy to bound. In fact, using Eq. (34), we have and Hence, under assumptions O3-O4 and O6, we see that, for any fixed k and at τ fixed, As an aside, we note that p does not play any particular role here. If we considered the same quantity when we remove the kth predictor, and took the sup over 1 ≤ k ≤ p of the corresponding random variables, the same inequality would hold, in light of our work in e.g. Lemma 3.37.
The previous equation guarantees that We conclude, using Eq. (33), that provided the terms appearing inside the O L k have enough moments to enable us to use Holder's inequality. Recall that Lemma 3.38 gives ||| ||| 2 = O L k (polyLog(n)) under our assumptions O1-O7. At a high level, we expect sup 1≤i≤n |d i, p | and [b p −β 0 ( p)] to be small, which "should give us" that In fact, we will show in Proposition 3.12 that b p −β 0 ( p) = O L k (polyLog(n)[n −1/2 ∨ n −e ]) and in Proposition 3.19 that sup 1≤i≤n |d i, These are the key bounds we will need in showing that β − b is small. We now turn our attention to showing these two results.
We recall the notations Under our assumptions, we have E (X i ) = 0 and cov (X i ) = Id p and hence E X 2 i ( p) = 1. Recall that since we assume that Proposition 3. 12 We have Furthermore, under assumptions O1-O7 and P1, N p = O L k (polyLog(n)) and therefore, when τ is held fixed, Proof From the definition of b p , we see that, when ξ n = 0 We will see later, in "On ξ n " of Appendix 4 that ξ n ≥ 0 (there is no circular arguments, it is simply more convenient to postpone the investigation of the properties of ξ n ). It immediately then follows that Using independence of X ( p) and {V i , i } n i=1 , and E (X i ( p)) = 0 for all i, we have for instance, whether the right-hand side is finite or not. Using our bounds on max λ 2 i and sup i ψ i ∞ , we therefore have Simple computations also show that N p has as many moments as we need and that for any finite k, under our assumptions, We therefore have On ξ n Let us write ξ n using matrix notations. Let D ψ i (r ·,[ p] ) be the n × n diagonal matrix such that The notation D [ψ i (r i, [ p] )] n i=1 might make it clearer that we are referring to a unique matrix and not a sequence of matrices indexed by i but we use D ψ i (r ·,[ p] ) because it is a less cumbersome notation.
We also denote by X ( p) is the last column of the design matrix X . Then we have where This simply comes from elementary linear algebra and representing u p and S p in matrix form. For example, We are now ready to investigate in more detail the properties of ξ n . Lemma 3. 13 We have ξ n ≥ 0.
Furthermore, under Assumptions O1-O7 and P1, if D λ i is the diagonal matrix with ith entry λ i , Proof Let us first focus on The first part of the proof is very similar to the corresponding arguments in [15].
Since M is symmetric and has eigenvalues between 0 and 1, we also have, using e.g. Lemma V.1.5 in [6], Since X p satisfy the necessary concentration assumptions under Assumption P1, we can now appeal to Lemma 3.37 to obtain We now take a slight detour from the aim of showing that we have a very good approximation of β through b by working on finer properties of ξ n and b p . These properties will be essential in establishing the validity of the system (4).
To get a finer understanding of ξ n , we now focus on the properties of The previous lemma shows clearly why this is natural.

Then we have under Assumptions O1-O7 and P1, if M is the matrix defined in Eq.
(37), We also have Proof We call d i,i = ψ i (r i, [ p] )/n. Of course, by using the Sherman-Morrison-Woodbury formula (see e.g. [21], p. 19), Recall that we are interested in 1 This shows the second result of the lemma. On the other hand, With our definitions, we have, since λ 2 It immediately follows that as announced.
The previous result will be especially useful as an approximation result if we can show that ζ i 's are small, since assumption P2-which we will use later-implies that 1 n n i=1 ψ i ∞ cannot be too large. This is what we do in the next few pages.

Controlling ζ i
The main problem that arises when trying to control ζ i is the fact that r j, [ p] appearing in S p (i) depend on V i . This prevents us from using concentration of quadratic forms results such as those shown in Lemma 3.37. So further approximations arguments are needed. Of course, the idea of using a leave-two-out residuals to approximate {r j, [ p] } j =i immediately comes to mind. Hence our work in "Appendix 3" will later play a key role in showing that ζ i 's are small.

Lemma 3.15 Suppose we can find {r
Then provided K n has 3k uniformly bounded moments.
Proof We call Then, using for instance the first resolvent identity, i.e.
However, since AM i, p is independent of (λ i , V i ), we can use Lemma 3.37 and see that, by using the fact that λ max ((AM i, p + τ Id) −1 ) ≤ 1 τ .
Using the operator norm bound we gave above, we also have We conclude that Now, it is clear that under O1 and O4, sup 1≤i≤n V i 2 /n = O L k (1) and finally sup 1≤i≤n Control of 1 n trace (S p (i) + τ Id) −1 − 1 n trace (S p + τ Id) −1 Using the Sherman-Woodbury-Morrison formula, we have After taking traces, we see that We conclude that sup 1≤i≤n provided we can use Holder's inequality. In effect, this requires K n to have 3k moments.

Control of K n
A natural choice for r (i) j, [ p] defined in Lemma 3.15 is to use a leave one out estimator of γ , where the ith observation (and hence V i ) is ommitted. Hence, all the work done in Theorem 3.9 becomes immediately relevant. [ p] } j =i the residuals we would get by using a leave-one-out estimator of γ , i.e. excluding (V i , i ) from problem (28).

Lemma 3.16 Suppose we use for {r
With the notations of Lemma 3.15, we have under assumptions O1-O7 and P1 In particular, for any fixed τ , Proof Let us call δ n (i) random variables such that Applying Theorem 3.9 with R j = r j, [ p] andr j,(i) = r The control of K n follows immediately by using our assumptions on ψ i , specifically the fact that it is Cn α -Lipschitz. Important remark: the previous remark has important consequences for c i defined in Eq. (21). Indeed, we have the following corollary.

Corollary 3.17
Let c i be defined as in Eq. (21) and c τ be defined as in Eq. (14). Then, under assumptions O1-O7 and P1, we have The corollary follows from drawing analogy between these quantities and the situation investigated in Lemmas 3.14, 3.15, and 3.16; we now give a detailed proof.

Proof
We have now established that Recalling the notation we see that this quantity is the analog of c τ, p when we use all the predictors and not only ( p − 1) of them. Indeed, c i in Eq. (21) is defined, in the notation of the proof of Lemma 3.15 as an analog of 1 [ p] } j =i being played by the residuals obtained from the leave-one-out estimate of β, excluding (X i , Y i ) from the problem. Lemma 3.15 in connection with Theorem 3.20 shows that sup i | 1 ) under our assumptions. Passing from the p−1 dimensional version of this result, i.e. Lemma 3.15, to the p-dimensional version gives the approximation stated in the corollary.
We therefore see that

Further results on ξ n and b p
We can combine all the results we have obtained so far in the following proposition.
Proposition 3. 18 We have, under Assumptions O1-O7 and P1, Furthermore, under Assumptions O1-O7 and P1-P3, Both equations in this proposition are very important for this paper. The first one gives us a very precise idea of the behavior of ξ n in terms of c τ, p , which we will see in "Appendix 5" is relatively easy to understand. This first equation is also a stepping stone towards the first equation of the System (4). The second equation, on the other hand, is a stepping stone towards the second equation of System (4) in our main theorem, Theorem 2.1.

Proof • First equation
The proof of Eq. (45) consists just in aggregating all the previous results and noticing that c τ, p ≤ ( p − 1)/(nτ ) and therefore remains bounded. Indeed, we have This latter quantity was approximated in Lemma 3.14 by And in Lemma 3.13, we approximated ξ n by 1 . This gives the result of Eq. (45), by simply keeping track of the approximation errors we make at each step.
• Second equation Recall that by definition (see Eqs. (31) and (30)), Therefore, We note that c τ, p λ i ψ i (r i, [ p] ), which depends only on (If needed, see the definition of c τ, p in Lemma 3.14.) Since X i ( p)'s are independent with mean 0 and variance 1, we conclude that Given the result in Eq. (45) and our bound on √ n[b p − β 0 ( p)] in Proposition 3.12, this means that In this last equation, we make use of Proposition 3.12 and Assumption P3 since under this assumption n β 0 2 ∞ polyLog(n)n 2α−1/2 → 0. This is what allows us to replace c τ, p (τ + ξ n ) by p/n without loss of accuracy in going from the second-to-last to the last equation.
We now need to control d i, p to show that our approximation of β by b in Proposition 3.11 will yield sufficiently good results that they can be used to prove Theorem 2.1.

On d i, p
Recall the definition (The fact that γ * i, p ∈ (r i, [ p] , r i,[ p] + ν i ) follows from writing the definition of We have the following result.
Proposition 3. 19 We have, under Assumptions O1-O7 and P1-P3, at fixed τ , Proof Note that we can rewrite Recall that u p = 1 n V D ψ i (r ·,[ p] ) X ( p). We can also rewrite it as , and our concentration assumptions on X ( p) formulated in P1, we see that according to Lemma 3.36, we have where we look at V i (S p + τ Id) −1 u p as a linear form in X ( p). Note that we have absorbed the sup i |λ i | in the polyLog(n) term. Now, ) ||| 2 S p and we conclude that We also note that sup i |X i ( p)| = O L k (polyLog(n)/ √ c n ) under O4, O6 and P1, using the results of "Appendix 7". So we conclude that Recalling that |b p − β 0 ( p)| = O L k (n −1/2 polyLog(n) + β 0 ∞ ), we finally see that Under our assumption that ψ i is Cn α -Lipschitz, we see that

Final conclusions
We can now gather together our approximation results in the following Theorem.

Theorem 3.20
Under Assumptions O1-O7 and P1-P3, we have, for any fixed τ > 0, In particular, We note that the index p in the previous theorem plays no particular role and similar results holds when p is replaced by any k, 1 ≤ k ≤ p.
Proof The Theorem is just the aggregation of all of our results, using the key bound on β − b in Proposition 3.11. The last statement is the only one that might need an explanation. With the notations of the proof of Proposition 3.19, we have R i − r i, [ p]  Combining the results of Eq. (46) and the previous theorem, we see that under Assumptions O1-O7 and P1-P4, Since p did not play any particular role as compared to any other index in our analysis, the same result holds when p is replaced by k, 1 ≤ k ≤ p.
Dividing the previous expression by n on both sides and summing over all the indices 1 ≤ k ≤ p, we finally get Our aim now is to further simplify the above expression to get the second equation of our system.

On c τ, p and c τ
We now show that c τ,k 's are all close to the same quantity, which turns out to be c τ .
Proof Let us recall the notation According to Lemma 3.40, we see that, since c τ = 1 n trace (S + τ Id p ) −1 , It is clear that under our assumptions, a = O L k (polyLog(n)), since using e.g. our work in "Appendix 7". Since ψ i is Cn α -Lipschitz and we have Hence, using arguments similar to the ones we have used in the proof of Lemma 3.15 (i.e. first resolvent identity, etc...), we see that Since c τ, p = 1 n trace (S p + τ Id) −1 , the result we announced follows immediately. We note that p did not play a particular role here and hence taking the sup over those indices only adds a polyLog(n) term to the approximation. Hence our approximation is valid also for sup 1≤k≤n |c τ − c τ,k |.
We are now ready to prove the last proposition of this section, which will help us get the second equation of our System (4).

Proposition 3.22 Under Assumptions O1-O7 and P1-P4,
Furthermore, when all λ i 's are non-zero, Proof In light of the result in Proposition 3.21 and Assumption P3 which guarantees that β 0 2 2 is uniformly bounded in p and n, we see that Therefore, Eq. (47) implies that Using Theorem 3.20 and our bound on ψ i ∞ from Assumption O3, we see that Thanks to Proposition 3.21 we also have In light of Eq. (27) and Assumption O3, we have Lemma 3.32 and specifically the computation of the derivative of prox(cρ)(x) with respect to c, allows us to bound the error In light of this, we see that, by using Corollary 3.17, we can re-express the previous equation as since Eq. (44) in Corollary 3.17, gives sup i |c i − λ 2 i c τ | = O L k (n 2α−1/2 polyLog(n)). When λ i 's are all different from 0, we can rewrite this equation as Finally, since almost by definition,

Appendix 5: Last steps of the proof
We now reach the last steps of the proof and two important tasks remain to be completed. The first one is understanding the limiting behavior ofr i, (i) and showing that it behaves like i + λ i r ρ (κ)Z i in the limit, where Z i ∼ N (0, 1). With a little bit of further work, the corresponding results will give us in connection with Proposition 3.22 the second equation of our main system (4).
The second main task is then to show that c τ is asymptotically deterministic, i.e. it converges towards a non-random number.

On the asymptotic distribution ofr i,(i)
We have the following lemma.

Lemma 3.23 Under Assumptions O1-O7
and P1-P4, as n and p tend to infinity, Furthermore, if i = j,r i,(i) andr j,( j) are asymptotically (pairwise) independent. The same is true for the pairs (r i,(i) , λ i ) and (r j,( j) , λ j ) Recall that β (i) is independent of X i and that X i has mean 0, cov (X i ) = Id p and that, for any finite k, the first k absolute moments of its entries are assumed to be bounded uniformly in n.
Recall that we showed in Proposition 3.10 that var β − β 0 2 → 0. Thanks to Lemma 3.3, we also know that E β − β 0 2 is uniformly bounded. Furthermore, in the proof of Proposition 3.10, we showed that E β 2 − β (i) 2 → 0 and that Let us now show that ( β (i) − β 0 ) X i behaves like N (0, E β (i) − β 0 2 ). We employ a similar strategy as was done in [15] but give the argument in details since it requires some new work.
We need a simple generalization of the standard Lindeberg-Feller theorem (see e.g. [39]). Indeed, if a n, p (k) are random variables with p k=1 a n, p (k) 2 = A n , E A 2 n remains bounded in n, and a n, p (k) s are independent of X i , we see that: a) if Z ∼ N (0, Id p ), independent of a n, p (k), then a n, p Z ∼ A n N where N ∼ N (0, 1) and independent of A n (conditionally and unconditionally on a n, p ); b) Theorem 2.1.5 and its proof in [39] hold provided n i=1 E |a n, p (k)| 3 = o(1). The proof simply needs to be started conditionally on a n, p , and the final moment bounds are then taken unconditionally. This very mild generalization gives, if φ is a C 3 function, with bounded 2nd and third derivatives, where K is a constant that depends on the second and third absolute moments of the entries of X i . It is therefore independent of n and p under our assumptions on X i . To make matters clearer, we allow ourselves to use the notation v k or v(k) to refer to the kth coordinate of the vector v.
In our setting, a n, p (k) = β (i) (k) − β 0 (k). Recall that we have shown that The same arguments we used apply also to β (i) ( p), the pth coordinate of the leaveone-out estimate β (i) . So it is clear that We conclude that This, in connection with Corollary 2.1.9 in [39], shows that ( β (i) − β 0 ) X i behaves asymptotically like β (i) − β 0 N in the sense of weak convergence. Since β (i) − β 0 − E β (i) − β 0 → 0 in probability and E β (i) − β 0 remains bounded under our assumptions, Slutsky's lemma guarantees that asymptotically, in the sense of weak convergence-by which we mean that the difference of their characteristic functions goes to 0. Using the fact, which can be shown using results in the proof of Proposition 3.10, that E β (i) − β 0 −E β − β 0 → 0 and Slutsky's lemma, we see that in the sense of weak convergence.
• Second part For the second part, we use a leave-two-out approach, namely we use the approximationr i, (1) and similarly forr j,( j) (this is clear from Theorem 2.2; β (i j) is computed by solving Problem (3) without (X i , Y i ) nor (X j , Y j )). It is clear thatr i,(i) andr j,( j) are asymptotically independent conditional on X (i j) , i.e. all the predictors except X i and X j . But because their dependence on X (i j) is only through β (i j) − β 0 , which is asymptotically deterministic by arguments similar to those used in the proof of Proposition 3.10, we see thatr i,(i) andr j,( j) are asymptotically independent.
After this high-level explanation, let us now give a detailed proof. The arguments we gave above apply to β (i j) as they did to β (i) . In particular, since Of course, β (i j) depends only on {X (i j) , (i j) }. We call P (i j) the joint probability measure P (i j) = k =(i, j) P X k , k , i.e. probability computed with respect to all our random variables except (X i , i ) and (X j , j ) (we slightly abuse notation and do not index this probability measure by n for the sake of clarity). So we have found E n (i j) , depending only on (X (i j) , (i j) ), such that P (i j) (E n (i j) ) → 1 and . The arguments we gave above (treating a n, p 's as deterministic quantities) then imply that, when (X (i j) , (i, j) ) ∈ E n (i j) , Let us now use characteristic function arguments.
Let (w i , w j ) ∈ R 2 be fixed and , since the modulus of the functions we are integrating is bounded by 1. Now is a deterministic function of (X (i j) , (i j) ). Independence of X i and X j implies that Also, our conditional asymptotic normality arguments above imply that in P (i j) -probability. We therefore have Since P(E n (i j) ) → 1 and β (i j) −β 0 2 is asymptotically deterministic by arguments similar to those used in the proof of Proposition 3.10, we see that Therefore, This proves that α i and α j are asymptotically independent. It easily follows that the same is true forr i,(i) andr j,( j) . The same leave-two-out approach also shows asymptotic pairwise independence of the pairs (λ i ,r i,(i) ) and (λ j ,r j,( j) ), since β (i j) is independent of λ i and λ j under Assumption O6, which guarantees independence of the λ i 's.
The lemma is shown.

On the asymptotic behavior of c τ
We are now in position to show that c τ = 1 n trace (S + τ Id p ) −1 is asymptotically deterministic. This result will require several steps. O1-O7, P1-P4 and F2-F4.

Lemma 3.24 We work under Assumptions
Consider the random function , defined for x ≥ 0.
Under assumption O3, ψ i is L i (n)-Lipschitz. Therefore, for We recall that, according to Lemma 3.32, Hence, We finally conclude that We therefore have, when x ∨ y ≤ B, and x, y ≥ 0, ∀u, sup Since the right-hand side does not depend on u, we also have Naturally, g n (x) can be written as Therefore, for any x, y ≥ 0, we have Under assumptions P2 and F3-F4, we can now take expectations and get the result in L 1 , since all the terms on the right hand side are bounded in L 1 under those assumptions.
We have established stochastic equicontinuity of g n (x) on [0, B].

Lemma 3.25
Let us call G n (x) = E (g n (x)). Let B > 0 be given. For any given Under Assumptions O1-O7, P1-P4 and F1-F5, we also have Proof Under assumptions F1 and F5, we can divide the index set {1, . . . , n} into K subsets A 1 , . . . , A K , where K is finite (with n), in which (X i , i ) i∈A j play a symmetric role. Hence, var (g n (x 0 )) can be expressed as a sum of variances and covariances of finitely many functions of finitely many random variables (λ i ,r i,(i) ): for those random variables, we just need to pick a representative in each subset {A j } K j=1 . We note that since ψ i is Lipschitz and hence continuous, g n is an average of bounded continuous functions of the random variables of interest to us.
Asymptotic pairwise independence of (λ i ,r i,(i) )'s, and the fact that ψ i can only be one of finitely many functions imply that var (g n (x 0 )) → 0 and therefore gives the first result.
Let us now pick > 0. By the stochastic equicontinuity of g n and our bound in Eq. (49), we can find x 1 , . . . , x K , independent of n, such that for all x ∈ [0, B], there exists l such that, when n is large enough, Note that We immediately get Because K is finite, the fact that for all l, |g n (x l ) − G n (x l )| → 0 in L 2 implies that sup 1≤l≤K |g n (x l ) − G n (x l )| → 0 in L 2 . In particular, if n is sufficiently large, The lemma is shown. and n (x) = E (δ n (x)). Call x n a solution δ n (x n ) = 0 and x n,0 a solution of n (x n,0 ) = 0. Since 0 ≤ g n ≤ 1, we see that x n ≤ p/(nτ ), for otherwise, δ n (x) < 0. The same argument shows that if x > (p/n + )/τ , n (x) < − and x / ∈ T n, . Similarly, near solutions of δ n (x) = 0 must be less or equal to ( p/n + )/τ .
• Proof of the fact that c τ is such that δ n (c τ ) = o(1) An important remark is that c τ is a near solution of δ n (x) = 0. This follows most clearly from arguments we have developed for c τ, p so we start by giving details through arguments for this random variable. Recall that in the notation of Lemma 3.14, we had Now, according to Eq. (40), According to Lemmas 3.14, 3.15 and 3.16, we have Of course, when x ≥ 0 and y ≥ 0, |1/(1 + x) − 1/(1 + y)| ≤ |x − y| ∧ 1. Hence, we see that We conclude that Exactly the same computations can be made with c τ , so we have established that Now we have seen in Theorem 2.2 that Through our assumptions on ψ i , this of course implies that We have furthermore noted that sup i |c i − λ 2 i c τ | = O L k (n −1/2+2α polyLog(n)) in Corollary 3.17. Using Lemma 3.32, this implies that and hence Gathering everything together, we get So we have established that δ n (c τ ) = O L k (n −1/2+3α polyLog(n)).
• Final details Note that for any given x, δ n (x) − n (x) = o P (1) by using Lemma 3.25. In our case, with the notation of this lemma, B = p/(nτ ) + η/τ , for η > 0 given. This implies that, for any given > 0 sup with high-probability when n is large. Therefore, for any > 0, if x n is a solution of δ n (x n ) = 0, This exactly means that x n ∈ T n, with high-probability. The same argument applies for near solutions of δ n (x) = 0, which, for any > 0 must belong to T n, as n → ∞ with high-probability. Of course, there is nothing random about T n, which is a deterministic set. Note that T n, is compact because it is bounded and closed, using the fact that G n = E (g n ) is continuous. If T n,0 were reduced to a single point, we would have established the asymptotically deterministic character of c τ .
Given our work concerning the limiting behavior ofr i,(i) and our assumptions about i 's, we see that Lemma 3.39 applies to lim n→∞ n (x) under assumption F1. Therefore, as n → ∞, T n,0 is reduced to a point and c τ is asymptotically non-random. (Note that assumption F1 is stated in terms of the properties of densities of random variables of the form i +r Z i where Z i is N (0, 1), independent of i and r is arbitrary; Assumption F1 also gives us guarantees for i + r λ i Z i at λ i given by a simple change of variable. The W i 's appearing in Lemma 3.39 are of the form i + |λ i |r Z i , so assumption F1 is all we need for Lemma 3.39 to apply.)

Proof of Theorem 2.1
We are now ready to prove Theorem 2.1.
So n can be interpreted as The fact that c τ is asymptotically arbitrarily close to the root of n (x) = 0 gives us the first equation in the system appearing in Theorem 2.1. The second equation of the system comes from Eq. (48). Theorem 2.1 is shown, with c ρ (κ) being the limit of c τ .
About c i 's, ξ n , N p , and the limiting distribution of β( p) Theorem 2.1 as well as many of our intermediate results have interesting consequences for various quantities we encountered. Let us now state them. When we use the expression "under our assumptions", we mean assumptions O1-O7, P1-P4 and F1-F5.

On c i 's
Recall that in Corollary 3.17 we had shown that under our assumptions O1-O7 and P1-P4 Since we have now shown that c τ has a deterministic limit c ρ (κ), we have the following lemma.
We also have and c τ, p as well as c τ are bounded away from 0 in probability.
We note that using the last result and arguments in the proof below, we have, with the notations of Theorem 2.1 Proof The proof follows easily from the result of Proposition 3.18 which gives us that c τ, p (ξ n + τ ) − p − 1 n = O L k (n −1/2+2α polyLog(n)) = o P (1).
Since we have shown that c τ, p converges to a deterministic constant (recall that c τ − c τ, p → 0), we see that it is also the case for ξ n . Note also that ξ n ≤ 1 n n i=1 X 2 i ( p) ψ i ∞ , so E (ξ n ) remains bounded under our assumptions. To get the last result of the lemma, we just need to show that we can divide in the above display by c τ, p and still have something that converges to 0. We now show that c τ, p is bounded below. Note that c τ, p − p − 1 n(ξ n + τ ) = o P (1).
Since ξ n is bounded in probability, we see that p−1 n(ξ n +τ ) is bounded away from 0 in probability, which guarantees that c τ, p is bounded away from 0 in probability.
The results involving c τ immediately follow by appealing to Proposition 3.21.

On N p
Recall that by definition, we had We have the following result. Furthermore, there exists v such that v 2 n → v 2 , so that The same result applies to N k , for 1 ≤ k ≤ p.
As the proof makes clear, we can replace in the asymptotic statements above v 2 n bŷ Proof Note that we can write Under our assumptions X i ( p)'s are independent and independent of {λ i ψ i (r i, [ p] )} n i=1 . The mild generalization of the Lindeberg-Feller argument given in the proof of Lemma 3.23 now applies, using in the notation of that lemma a n, p (k) = n −1/2 λ i ψ i (r i, [ p] ) and recalling that |ψ i (r i, [ p] )| ≤ ψ i ∞ . Since λ i 's have 4 uniformly bounded moments under our assumptions, the fact that Asymptotic pairwise independence of (λ i ,r i,(i) ) and (λ j ,r j,( j) )-see Lemma 3.23in connection with Assumption F5 guarantees that We conclude that A 2 n is asymptotically deterministic and so is A n . By Slutsky's lemma we have N p behaves like N (0, v 2 n ) Proposition 3. 30 We have, with the notation of the previous lemmas, √ n[(τ + ξ n ) β p − β 0 ( p)ξ n ] ⇒ N (0, v 2 ).
Similarly, we have for all 1 ≤ k ≤ p: provided β 0 (k) = O(n −1/2 ), The proof of the proposition we give below shows that ξ in the previous display can be replaced by any quantity ω n such that ξ n − ω n = o P (1). This in particular the case if we choose ω n = p/(nc τ ) − τ , according to Lemma 3.28.
The main advantage of this ω n is that it is computable from the data. And we can therefore test the null hypothesis that β 0 ( p) = 0, since we can approximate v 2 by Recall that we showed that ξ n = O L k (1) under our assumptions. It is easy to verify that the same is true for ξ , its limit. We also see that Recall that by definition, So we conclude, using Slutsky's lemma that When β 0 ( p) = O(n −1/2 ), we see that √ n(ξ − ξ n )(β 0 ( p)) = o P (1).
• Extending the result from β p to β k We note that ξ n in the above argument actually depends on p also-we avoided writing out the dependence earlier to avoid cumbersome notations. We now make it explicit for clarity. Our arguments above guarantee that where as explained in Lemma 3.29 v does not depend on k. As explained above, if β 0 (k) = O(n −1/2 ), √ n(ξ − ξ n (k))(β 0 (k)) = o P (1) for all k's. Very importantly, ξ does not depend on p, since it is by definition ξ = ( p − 1)/(nc ρ (κ)) − τ . So we conclude that for all k, when β 0 (k) = O(n −1/2 ), Since the left hand side weakly converges to N (0, v 2 ), we conclude by Slutsky's lemma that for all k,

Appendix 6: Notes on the proximal mapping
In this section of the Appendix we remind the reader of elementary properties of the proximal mapping. The proofs, when needed, can be found in e.g. [15]. Let ρ be differentiable and such that ψ changes sign at 0, i.e. sign(ψ(x)) = sign(x) for x = 0. Then, prox(cρ)(0) = 0.
Hence, F n is polyLog(n)/c 1/2 n sup 1≤ j≤n L I j in We repeatedly use the following lemma in the proof.

Lemma 3.37
Suppose the assumptions of the previous Lemma are satisfied. Consider Q I j = 1 n X I c j M I j X I c j , where M I j is a random positive-semidefinite matrix depending only on X I j whose largest eigenvalue is λ max,I j . Assume that E (X i ) = 0, cov (X i ) = Id p and nc n → ∞. Then, we have in L k , The same bound holds when considering a single Q I j without the polyLog(n) term.
On the spectral norm of covariance matrices The results hold also in L k .
Proof The proof is exactly similar to that given in [15], which gives, following a simple adaption of the well-known -net argument explained e.g. in [40], "Appendix 1" that It is clear that and the result follows immediately.
To compute G i (x, λ i ), we differentiate under the integral sign (under our assumptions the conditions of Theorem 9.1 in [10] are satisfied) to get dt.
Since the denominator of the function we integrate is positive, we conclude that Since F i (x) = −τ + G i (x), we see that F i (x) ≤ −τ < 0. Therefore F i is a decreasing function on R + . Of course, prox(0ρ)(t) = t, so that F i (0) = p/n and lim x→∞ F i (x) = −∞, since, for instance, So we conclude that the equation F i (x) = 0 has a unique root. (Since F i is differentiable, F i is of course continuous.) We note that Therefore, F(0) = p/n and F (x) ≤ −τ . So F is decreasing, differentiable and hence has a unique root.
Remark the conditions on the density of W are satisfied in many situations. For instance if W i = + r λ i Z , where is symmetric about 0 and log-concave, Z is N (0, 1), independent of λ i and , and r > 0, it is clear that the density of W satisfies the conditions of our lemma. Similar results hold under weaker assumptions on of course. For more details, we refer the reader to e.g. [7] and [25,26,35].
In particular, we recall Theorem 1.6 in [7] which says that the convolution of two symmetric unimodal distributions on R is unimodal. Hence, when has a symmetric and unimodal distribution, so does W i = + λ i r Z, for any r . This is for instance the case when has a Cauchy distribution.

A linear algebraic remark
We need the following lemma at some point in the proof.

Lemma 3.40 Suppose the p × p matrix A is positive semi-definite and
Here a ∈ R. Let τ be a strictly positive real. Call τ = + τ Id p−1 . Then we have In particular, The proof is simple and we refer the reader to [15] for details if needed.