Generic Error Bounds for the Generalized Lasso with Sub-Exponential Data

This work performs a non-asymptotic analysis of the generalized Lasso under the assumption of sub-exponential data. Our main results continue recent research on the benchmark case of (sub-)Gaussian sample distributions and thereby explore what conclusions are still valid when going beyond. While many statistical features remain unaffected (e.g., consistency and error decay rates), the key difference becomes manifested in how the complexity of the hypothesis set is measured. It turns out that the estimation error can be controlled by means of two complexity parameters that arise naturally from a generic-chaining-based proof strategy. The output model can be non-realizable, while the only requirement for the input vector is a generic concentration inequality of Bernstein-type, which can be implemented for a variety of sub-exponential distributions. This abstract approach allows us to reproduce, unify, and extend previously known guarantees for the generalized Lasso. In particular, we present applications to semi-parametric output models and phase retrieval via the lifted Lasso. Moreover, our findings are discussed in the context of sparse recovery and high-dimensional estimation problems.


Introduction
This paper is concerned with the following common inference problem in statistical learning: Let (x 1 , y 1 ), . . . , (x n , y n ) ∈ R p × R be samples of a random input-output pair (x, y) ∈ R p × R, whose joint probability distribution is unknown. What information about the relationship between x and y can we retrieve only based on the knowledge of (x 1 , y 1 ), . . . , (x n , y n )?
A classical instance of this problem is linear regression, where y depends linearly on x, say y = x, β 0 + ν for an unknown parameter vector β 0 ∈ R p and independent additive noise ν. While the resulting task of estimating β 0 is nowadays fairly well understood in the low-dimensional regime n ≥ p, it is still subject of ongoing research in the high-dimensional regime n ≪ p. In the latter scenario, it is indispensable to impose additional conditions on the input-output model. A typical assumption is that β 0 belongs to a known, convex hypothesis set K ⊂ R p that is of low complexity in a certain sense. In such a model setup, a natural estimation procedure is based on solving the generalized Lasso: 1 The popularity of Lasso-type estimators is due to several desirable properties. Perhaps most importantly, many efficient algorithmic implementations are available for (LS K ) due to the convexity of K (e.g., see [12,56,66]), accompanied by the suitability for a statistical analysis due to its simple variational formulation (e.g., see the textbooks [6,13,23]). A more astonishing feature of the generalized Lasso (LS K ) is its ability to deal with non-linear relations between x and y. In fact, inspired by a classical result of Brillinger [5], a recent work of Plan and Vershynin [42] shows that for Gaussian input vectors, (LS K ) yields a consistent estimator for single-index models, i.e., y = f ( x, β 0 ) with an unknown, non-linear distortion function f : R → R. This finding has triggered a lot of related and follow-up research, e.g., see [14,16,19,38,46,[52][53][54]. We note that these works form only a small fraction of a whole research area on non-linear observation models, lying at the interface of statistics, learning theory, signal processing, and compressed sensing. A comprehen-Definition 1.2 (Sub-Gaussian/sub-exponential random variables) For α ∈ {1, 2}, we define the exponential Orlicz norm of a random variable Z : Ω → R by 4 The exponential Orlicz space L ψ α is then denoted by The elements of the exponential Orlicz spaces L ψ 1 and L ψ 2 are called sub-exponential and sub-Gaussian random variables, respectively.
The notions of sub-exponentiality and sub-Gaussianity impose restrictions on the tails of a random variable, which must not be "too heavy". This intuition gives rise to several equivalent versions of Definition 1.2, which are summarized in Proposition A.1 in Appendix A; for a more detailed introduction, we refer to [58,Chap. 2 & 3].
The definition of sub-Gaussian and sub-exponential random vectors is characterized by their one-dimensional marginals (i.e., projections onto one-dimensional subspaces): Definition 1.3 (Sub-Gaussian/sub-exponential random vectors) For a random vector x ∈ R p and α ∈ {1, 2}, we set x ψ α := sup If x ψ 2 < ∞, we say that x is (uniformly) sub-Gaussian, and if x ψ 1 < ∞, we say that x is (uniformly) sub-exponential.
The following result states a non-asymptotic error bound for the generalized Lasso (LS K ) with sub-exponential input vectors. Its proof is provided in Subsection 5.6, being a "by-product" of one of our main results, Corollary 2.15 in Subsection 2.3. For the sake of simplicity, we restrict ourselves to a polytopal hypothesis set K here, as this allows for explicit bounds on the complexity parameters. Moreover, it is worth emphasizing that for linear models, i.e., if y = x, β 0 , we simply obtain an estimation guarantee for β * = β 0 . Proposition 1.4 Let (x, y) ∈ R p × R be a joint random pair such that y ∈ R is sub-exponential and x ∈ R p is isotropic and sub-exponential with x ψ 1 ≤ κ for some κ > 0. Let K ⊂ R p be a convex polytope with D vertices and Euclidean diameter ∆ 2 (K), and let β * ∈ K be the expected risk minimizer on K, i.e., a solution to (1.1). Finally, let the observed sample pairs (x 1 , y 1 ), . . . , (x n , y n ) ∈ R p × R be independent copies of (x, y). Then there exists a universal constant C > 0 such that for every u ≥ 8, the following holds true with probability at least 1 − 5 exp(−C · u 2 ) − 2 exp(−C · √ n): If the sample size obeys n κ 10 · ∆ 2 (K) · log(D) √ n + log(D) + κ 6 · u 2 , then every minimizerβ of (LS K ) satisfies where σ(β * ) := y − x, β * ψ 1 .
Informally speaking, Proposition 1.4 shows that estimation of the expected risk minimizer succeeds with overwhelmingly high probability as long as n ≫ ∆ 2 (K) 2 · log(D) 2 . Such a statement is particularly appealing to high-dimensional problems such as sparse recovery. Another remarkable conclusion is that the estimator (LS K ) essentially performs as well as if the sample data were sub-Gaussian (cf. [15,Thm. 4.3]). Our main results in Section 2 confirm this observation in much 1 INTRODUCTION greater generality, but they will also reveal several important differences to the sub-Gaussian case; first and foremost, we will be concerned with defining appropriate complexity measures for K, which do not explicitly appear in the polytopal setting of Proposition 1.4. In this respect, it is important to note that there are relevant special cases of sub-exponential vectors (e.g., those with independent coordinates) for which the above estimate is too pessimistic and can be improved. The elaboration of this aspect is a key concern of this article and motivates the introduction of a generic tail condition that takes the underlying "geometry" of the problem into account. Apart from this, let us also emphasize that the simplifications of Proposition 1.4 come along with a suboptimal behavior regarding, (a), the error decay rate O(n −1/4 ), (b), the sub-exponential parameter κ, and (c), the model deviation parameter σ(β * ).
To the best of our knowledge, Proposition 1.4 is a new result, but it bears resemblance with a recent finding of Sattar and Oymak [46,Thm. 3.4], who consider a similar model setup with sub-exponential input vectors. Their analysis focuses on the projected gradient descent method, as an algorithmic implementation of (LS K ), and is therefore related to our estimation guarantees; see Subsection 3.4 for a more detailed comparison.

Contributions and Overview
The main purpose of this work is to shed more light on the estimation capacity of the generalized Lasso (LS K ) when the sample data are not sub-Gaussian. While Proposition 1.4 already gives a first glimpse into the prototypical situation of sub-exponential input vectors, we intend to address this problem in a more systematic and abstract way (cf. Problem 1.1). At the heart of our statistical analysis stands the so-called generic Bernstein concentration, which is introduced in Subsection 2.1 (see Definition 2.2). This concept is the outcome of a somewhat uncommon proof strategy: Instead of assuming a specific (sub-exponential) distribution for x, we study the associated excess risk of (LS K ) in an abstract sense, relying on an advanced generic chaining argument due to Mendelson [32]. Consequently, the key step of our approach is to understand the increment behavior of the underlying stochastic processes, and in fact, this precisely leads to generic Bernstein concentration as a natural condition for x. In that way, we are able to explore (LS K ) for a whole class of input distributions and thereby to refine the assumption of uniform sub-exponentiality in Proposition 1.4. Another important outcome of our analysis are two general complexity parameters for the hypothesis set K (see Definition 2.5 and 2.6), which are compatible with the notion of generic Bernstein concentration.
With these preliminaries at hand, we formulate our main result in Subsection 2.2 (see Theorem 2.10), which provides a novel, non-asymptotic error bound for (LS K ) under generic Bernstein concentration. However, a direct application of this guarantee to specific model situations is not always straightforward, since the aforementioned complexity parameters are of local nature, implicitly depending on the desired precision level. For this reason, we present two more easily accessible corollaries of Theorem 2.10 in Subsection 2.3. These results are based on simplified complexity parameters (see Definition 2.11 and 2.13, respectively), but come with the price of looser error bounds and sample-size conditions.
While the purpose of Section 2 is to develop a unified analysis for the generalized Lasso (LS K ), Section 3 is devoted to various applications and examples of our findings. We begin with a brief discussion on semi-parametric modeling in Subsection 3.1, demonstrating how our general results may be applied to specific parameter estimation problems. This is followed by several relevant examples of generic Bernstein concentration (see Subsection 3.2-3.4), leading to off-theshelf guarantees for (LS K ) with sub-exponential and sub-Gaussian sample data; these parts also provide a comparison to related approaches in the literature. In Subsection 3.5, we then revisit our motivating case study on the lifted Lasso for phase-retrieval-like problems-a scenario where sub-exponential distributions arise naturally. Finally, Subsection 3.6 contains a more detailed discussion of the complexity parameters from Section 2. In this context, it will become clearer that measuring complexity beyond sub-Gaussianity is a delicate issue and comes along with unexplored difficulties. Nevertheless, we are able to establish simple bounds in the prototypical case of ℓ 1 -balls, making our error bounds applicable to high-dimensional estimation and sparse recovery. Some concluding remarks are made in Section 4.

Differentiation from Previous Works
Apart from enabling heavier-tailed data, a crucial feature of generic Bernstein concentration is that it does not require any type of isotropy. Instead, the "geometric" behavior of the input vectors is captured by selecting two appropriate semi-norms (see Definition 2.2). This relaxation is key to the applicability of our results, as it allows us to handle structured, but anisotropic input vectors, such as arising in the phase lift approach (see Subsection 3.5). It is not clear to us how this important challenge could be addressed with other techniques, especially those suggested in [15] and related articles mentioned above. 5 The concept of generic Bernstein concentration therefore also presents a novel and systematic solution to this open problem, which is arguably one of the most significant achievements compared to previous works.
We close this part with another clarification: The present article is concerned with the generalized Lasso (LS K ) when the sample data are heavier tailed than sub-Gaussian, in particular, the underlying distribution may be unbounded. An alternative strategy is to first truncate the raw data at an appropriate threshold and then to apply (LS K ) or a similar estimator. In fact, the latter approach is quite common in practice, but it also facilitates a theoretical study due to the boundedness of the involved random variables, e.g., see [19,60,[62][63][64] for related results on nonlinear observation models. However, the (concentration-based) machinery for bounded sample data is certainly not applicable to the model setup of the present paper. Instead, we rather follow the conceptual ideas of Mendelson [31,33], who points out the downsides of the bounded framework and develops a general theory for heavy-tailed problems. Although our analysis is concerned with more specific model assumptions, it is just general enough to allow for a rigorous understanding of estimation with sub-exponential data, thereby unifying and improving a series of previously known results from the literature (see Section 3). Thus, to a certain degree, our work can be seen as a "connecting piece" between such highly customized approaches and the abstract theory of Mendelson [31,33].

Notation
The letter C is reserved for constants, whose values could change from time to time, and we say that C is universal if its value does not depend on any other involved parameter. If an inequality holds true up to a universal constant C > 0, we usually write A B instead of A ≤ C · B; the notation A ≍ B means that both A B and B A hold true. Furthermore, the positive part of a real number s ∈ R is denoted by [s] + := max{s, 0}.
The cardinality of a finite set I is denoted by |I|. The j-th entry of a vector v ∈ R p is denoted by v j and the support of v is defined as supp(v) := {j | v j = 0}. The cardinality of supp(v) is referred to as the sparsity of v and we write v 0 := | supp(v)|. For 1 ≤ q ≤ ∞, we denote the ℓ q -norm on R p by · q and the associated unit ball by B p q . The Euclidean unit sphere is given by The Frobenius norm is denoted by · F and the spectral norm by · op . We write I p ∈ R p×p for the identity matrix. Let L ⊂ R p . By span(L), cone(L), and conv(L), we denote the linear hull, conic hull, and convex hull, respectively. The diameter of L with respect to a (pseudo-)metric d is defined as ∆(L) For v ∈ R p , we use the notation v * for the linear functional ·, v , i.e., v * is the image of v under the Riesz isomorphism; analogously, we write A * for the image of a subset A ⊂ R p under the Riesz isomorphism. Furthermore, if R p is equipped with a probability measure µ, we can interpret v * as a random variable, i.e., v * = x, v , where x is distributed according to µ. In particular, we have that v * q

Main Results
This section presents the main results of this work. We begin with several technical preliminaries in Subsection 2.1, including the central concept of generic Bernstein concentration (see Definition 2.2) as well as the related complexity parameters (see Definition 2.5 and 2.6). The most general estimation guarantee is then formulated and discussed in Subsection 2.2 (see Theorem 2.10). This is followed by two corollaries in Subsection 2.3, employing simplified variants of our complexity parameters. Note that all proofs for this section are postponed to Section 5.

Preliminaries and Generic Bernstein Concentration
An error bound for the generalized Lasso (LS K ) is a statement about the minimizer of the following function: Definition 2.1 (Empirical risk, excess risk) The objective function minimized in (LS K ), i.e., the excess risk of β over β ♮ .
Since the map β →L(β) depends on the random pairs (x i , y i ), it can be seen as a stochastic process on the hypothesis set K. If the excess risk is strictly positive on a subset of K, the minimizer must be outside of this subset. In other words, we can localize the empirical risk minimizer in a certain set L ⊂ K if we have a positive lower bound for the excess risk on K \ L (see Fact 2.4 below). A powerful technique for proving such lower bounds is generic chaining for stochastic processes (see [50,51]).
The following definition introduces a generic concentration inequality for linear functions on the parameter space, which leads to an increment condition for the involved stochastic processes. Based on this condition, we will use chaining arguments to derive a generic error bound for (LS K ) (see Theorem 2.10 and its proof in Subsection 5.2). Estimation guarantees for specific classes of input vectors can be then obtained by considering concrete instances of this condition (see Section 3). Definition 2.2 (Generic Bernstein concentration) Let x ∈ R p be a random vector and let · g and · e be two semi-norms on R p . We say that x exhibits generic Bernstein concentration with respect to ( · g , · e ) if for every v ∈ R p and every t ≥ 0, we have that where exp(−∞) := 0 and t 0 := ∞ for t > 0, 0 for t = 0.
The prototypical instance of generic Bernstein concentration is a centered random vector x = (x 1 , . . . , x p ) ∈ R p with independent, sub-exponential coordinates: indeed, such an x exhibits generic Bernstein concentration with respect to ( R where R := max 1≤j≤p x j ψ 1 and a universal constant C B > 0. In this case, (2.1) simply corresponds to the classical Bernstein's inequality (see Theorem A.3), justifying the terminology of Definition 2.2. More generally, (2.1) can be seen as an example of mixed-tail conditions, which are quite common in the generic chaining literature, e.g., see [11,Thm. 3.5] or [51,Thm. 2.2.23]. To be more specific, the semi-norm · g governs the Gaussian-like ('g') tail, while · e governs the exponential-like ('e') tail.
The central idea of generic chaining is that the expected infimum (or supremum) of a stochastic process depends on the "size" of the underlying index set, which is equipped with a (pseudo-)metric that reflects the increment behavior of the stochastic process. For certain classes of canonical processes, the appropriate way of measuring the size is given by the well-known γ-functional: [51,Def. 2.2.19]) Let L be a set equipped with a pseudo-metric d. We call a sequence (A s ) s∈N of partitions 6 of L an admissible partition sequence if |A 0 | = 1 and |A s | ≤ 2 2 s for s ≥ 1 and if the sequence is increasing, i.e., for every A ∈ A s+1 there is some where A s (v) is the unique set in A s containing v and the infimum is taken over all admissible partition sequences. In this work, we will only deal with pseudo-metrics induced by semi-norms. Hence, we may write γ α (L, · ) := γ α (L, d · ) where d · is the pseudo-metric induced by a semi-norm · .
Returning to the issue of finding an error bound for (LS K ), let us now fix some precision level t > 0 and an arbitrary target vector β ♮ ∈ K (see also Problem 1.1). At the present level of abstraction, it is beneficial to leave the notion of the 'estimation error' as general as possible. For the sake of mental convenience, β ♮ can be seen as a desirable outcome of an estimation procedure (e.g., the expected risk minimizer on K), but this interpretation is mathematically irrelevant. The error measure that will concern us in this section is the Euclidean distance β − β ♮ 2 , whereβ ∈ K is the estimate of the generalized Lasso, i.e., a minimizer of (LS K ). Since E (·, β ♮ ) is a convex function (on K), one can make use of the following basic, yet important fact: If E (β, β ♮ ) > 0 for all β ∈ K β ♮ ,t , then every minimizerβ of (LS K ) satisfies the error bound β − β ♮ 2 < t.
Consequently, it suffices to control E (·, β ♮ ) on the spherical subset K β ♮ ,t of radius t around β ♮ . To this end, we loosely follow the approach of Mendelson [31] and decompose the excess risk as follows: (2.2) In this decomposition, the excess risk is expressed as a sum of two empirical processes Q(β − β ♮ ) and M(β, β ♮ ), both indexed by β ∈ K β ♮ ,t , which we call the quadratic process and the multiplier process, respectively. Note that this corresponds to a second-order Taylor expansion of E (·, β ♮ ) = L(·) −L(β ♮ ): the quadratic process is the second-order term 7 and the multiplier process is the first-order term M(β, β ♮ ) = (∇L)(β ♮ ), β − β ♮ . 6 As usual, by a partition of L, we mean a family of pairwise disjoint, non-empty subsets of L whose union is L. 7 Since the Hessian matrix HL(β ♮ ) ∈ R p×p is actually independent of β ♮ for the squared loss, the quadratic process is translation-invariant in the sense that it only depends on β − β ♮ . Hence, we write Q(β − β ♮ ) rather than Q(β, β ♮ ). t,n (L) from Definition 2.5: (a) In order to measure the complexity of L locally at scale t, we consider the set L ∩ tS p−1 . (b) L ∩ tS p−1 is contained in the convex hull of the four points indicated by small black dots. (c) Defining S as these four points, the quantity 1 t · γ 1 (S, · e )/ √ n + γ 2 (S, · g + · e ) is an upper bound for the local q-complexity of L at scale t, which is defined as the infimum over all such upper bounds.
With this notation at hand, the desired uniform lower bound E (β, β ♮ ) > 0 amounts to the event Based on the γ-functional, we now define two general complexity parameters, which are adapted to the analysis of the quadratic process and the multiplier process, respectively. Both parameters are tailored to the above notion of generic Bernstein concentration and have in common that they measure the complexity of a set locally, i.e., at a certain scale t > 0. This reflects the fact that we are only interested in the behavior of the empirical processes on K β ♮ ,t and not on the full hypothesis set K. Definition 2.5 (Local q-complexity) Let L ⊂ R p and let · g and · e be semi-norms on R p . For t > 0, we define the local q-complexity of L at scale t and sample size n with respect to ( · g , · e ) by q (g,e) t,n (L) := Remarkably, we do not simply measure the size of the set L ∩ tS p−1 in Definition 2.5, but optimize over all "skeletons" S of this set; see Figure 1 for an illustration and [36, Appx. A] for a related approach in the literature. Definition 2.6 (Local m-complexity) Let L ⊂ R p and let · g and · e be semi-norms on R p . For t > 0, we define the local m-complexity of L at scale t with respect to ( · g , · e ) by m (g,e) t It is worth noting that in the well-understood case of sub-Gaussian sample data, the q-complexity and m-complexity can be both identified with the notion of local Gaussian width; see also Subsection 3.2 for more details. In general, however, this simple geometric interpretation is no longer valid and the behavior of both parameters is highly non-trival. We will return to this important issue later in Subsection 3.6.
In order to control the quadratic process (in Subsection 5.2.1), we will apply the small-ball method, which is a powerful tool to establish uniform lower bounds for non-negative empirical processes (see [26,31,33]). For this purpose, the notion of a small-ball function is required: Definition 2.7 (Small-ball function; [26, p. 12995]) Let L ⊂ R p and let x be a random vector in R p . For θ ≥ 0, we define the small-ball function Since we are aiming at an error bound relative to an arbitrary target vector β ♮ ∈ K, it is natural that this error bound depends on how well the associated linear hypothesis x, β ♮ predicts the actual output variable y (which may depend on x in a non-linear way). In other words, the estimation performance of (LS K ) is also affected by the behavior of the model mismatch y − x, β ♮ , measuring how much y deviates from the linear model x, β ♮ . The following parameters allow us to make this precise: Definition 2.8 (Mismatch parameters) Given β ♮ ∈ R p and a random pair (x, y) ∈ R p × R, the mismatch deviation of β ♮ is defined by Moreover, for t ≥ 0 and K ⊂ R p , we define the (local) mismatch covariance of β ♮ at scale t by As the name suggests, the mismatch covariance captures the covariance between the input vector x and the model mismatch y − x, β ♮ . Inspired by linear regression problems, it is useful to think of the model mismatch as "noise" that perturbs the linear model x, β ♮ . In particular, if E[(y − x, β ♮ )x] = 0, this noise is uncorrelated with all input variables (but not necessarily independent), implying that ρ(β ♮ ) = ρ t (β ♮ ) = 0. In contrast, the mismatch deviation measures the sub-exponential tail behavior of the model mismatch. Note that in the noisy linear case, i.e., y = x, β ♮ + ν, we simply have that σ(β ♮ ) = ν ψ 1 . The interested reader is referred to Appendix B.1 for further remarks on the above notions of the mismatch covariance.

A Local Error Bound for (LS K )
Before stating the error bound, let us formally summarize our assumptions about the sampling process: Assumption 2.9 (Model setup) Let (x, y) ∈ R p × R be a joint random pair where x ∈ R p satisfies generic Bernstein concentration with respect to ( · g , · e ) and y ∈ R is sub-exponential. Moreover, let K ⊂ R p be a convex hypothesis set. We define the set K ∆ := span(K − K) ∩ S p−1 and assume that x satisfies the small-ball condition for some τ > 0. Finally, we assume that the observed sample pairs (x 1 , y 1 ), . . . , (x n , y n ) are independent copies of (x, y).
Although it can be helpful to imagine a semi-parametric relationship between x and y (see Subsection 3.1), such an assumption is not required at the current level of abstraction. Indeed, our main result, which is presented next, provides a generic error bound for the generalized Lasso (LS K ) without any specific observation model. Theorem 2.10 (General error bound for (LS K ), local version) Let Assumption 2.9 be satisfied and fix a vector β ♮ ∈ K. Then there exists a universal constant C > 0 such that for every u ≥ 8 and t ≥ 0, the following holds true with probability at least 8 For the case of exact recovery, i.e., t = 0, the corresponding complexity parameters q (g,e) 0,n (K − β ♮ ) and m (g,e) 0 (K − β ♮ ) are introduced further below in Definition 2.11 in Subsection 2.3. and we have that The interpretation of the error bound established in Theorem 2.10 is not straightforward, since the right-hand side of (2.5) depends on the precision level t and the right-hand side of (2.4) depends on both t and n. But regardless of these implicit dependencies, the above statement has almost the same syntactic form as in the case of sub-Gaussian sample data, e.g., see [15,Thm. 3.6], and we can rely on the interpretation suggested there. The following way of reading Theorem 2.10 is quoted from [15, p. 41], except that the mathematical terms and the equation numbers have been altered accordingly: A convenient way to read the above statement is as follows: First, fix an estimation accuracy t that can be tolerated. Then adjust the sample size n and β ♮ ∈ K such that (2.4) and (2.5) are both fulfilled (if possible at all). In particular, if n is chosen such that (2.5) just holds with equality (up to a constant), we obtain an error bound of the form With that in mind, one might be tempted to think that not much changes when going beyond sub-Gaussianity-but this is far from being true. The key difference becomes manifested in our generalized complexity parameters q In fact, their behavior can be significantly more complicated than in the sub-Gaussian case. We defer a more detailed discussion to Subsection 3.6, but also the applications in Subsection 3.2-3.5 can be helpful for a better understanding of this issue. Finally, several additional remarks on Theorem 2.10 and possible refinements can be found in Appendix B.2.

Global and Conic Error Bounds for (LS K )
The local complexity parameters in Theorem 2.10 lead to a fairly strong, but implicit error bound for (LS K ). In this subsection, we state two corollaries of Theorem 2.10 which achieve a better interpretability at the price of suboptimality. The first one replaces the local complexity terms by their (more pessimistic) conic versions: Definition 2.11 (Conic q-and m-complexity) Let L ⊂ R p and let · g and · e be semi-norms on R p . We define the conic q-complexity of L at sample size n with respect to ( · g , · e ) by q (g,e) 0,n (L) := inf Similarly, we define the conic m-complexity of L with respect to ( · g , · e ) by m (g,e) 0 0,n (L) and m (g,e) 0 (L) indicates that one can imagine the conic q-and mcomplexity as the limit case t = 0 of their local counterparts from Definition 2.5 and 2.6.
The conic complexity parameters allow us to remove the dependence of the right-hand sides of (2.4) and (2.5) on t in Theorem 2.10: While leading to an explicit error bound for the generalized Lasso (cf. (2.6)), Corollary 2.12 has the following drawback: If β ♮ is an interior point of K, then cone(K − β ♮ ) = R p , and the complexity terms q 0,n (R p ) and m (g,e) 0 (R p ), respectively, i.e., they no longer reflect any complexity reduction due to the restricted hypothesis set K. Hence, unless the hypothesis set K is perfectly tuned such that β ♮ is located on the boundary of K, Corollary 2.12 fails to provide a useful estimation guarantee in the high-dimensional regime p ≫ n. Evidently, this tuning problem affects the local error bound in Theorem 2.10 as well, but the situation is much less severe there, at least when β ♮ is close to the boundary of K (more precisely, if inf β∈R p \K β ♮ − β 2 < t). This fact particularly explains why (LS K ) is a stable estimator (cf. [ Our second approach to simplify Theorem 2.10 is to measure the complexity of the hypothesis set "globally", rather than in a local neighborhood of β ♮ . Definition 2.13 (Global q-and m-complexity) Let L ⊂ R p and let · g and · e be semi-norms on R p . We define the global q-complexity of L at sample size n with respect to ( · g , · e ) by Similarly, we define the global m-complexity of L with respect to ( · g , · e ) by The following lemma provides some basic facts about the global complexity parameters and relates them to their local counterparts. Lemma 2.14 Let L ⊂ R p , v ∈ R p , and t > 0. Then we have the following: . The second claim of Lemma 2.14 states that the global complexity parameters are translationinvariant. This allows us to decouple the complexity terms in Theorem 2.10 from β ♮ , leading to the following error bound: Corollary 2.15 (General error bound for (LS K ), global version) Let Assumption 2.9 be satisfied and fix a vector β ♮ ∈ K. Then there exists a universal constant C > 0 such that for every u ≥ 8, the following holds true with probability at least then every minimizerβ of (LS K ) satisfies If the complexity terms q (g,e) n (K) and m (g,e) (K) are sufficiently small, then (2.8) provides a useful error bound in the high-dimensional regime p ≫ n, independently of the location of β ♮ in K. Note that a prototypical application of Corollary 2.15 was already presented in Proposition 1.4 where K is a convex polytope (see also Proposition 3.13 in Subsection 3.6). However, the simplification of Corollary 2.15 has its price: the second summand in the error bound (2.15) exhibits a decay rate O(n −1/4 ), which is substantially worse that the rate of O(n −1/2 ) achieved in Theorem 2.10 and Corollary 2.12. Moreover, the dependence on σ(β ♮ ) is suboptimal in the "low-noise" regime, i.e., when σ(β ♮ ) ≪ 1.

Applications and Examples
This section is devoted to specific applications of the generic error bounds presented in Section 2. We begin with a discussion of semi-parametric estimation problems in Subsection 3.1, in particular, how the generalized Lasso (LS K ) performs with non-linear output models. In Subsection 3.2-3.5, we then demonstrate that generic Bernstein concentration covers a whole "spectrum" of relevant distributions, where (uniformly) sub-exponential and sub-Gaussian input vectors appear just as marginal cases. Finally, we continue our discussion on the notions of q-and m-complexity in Subsection 3.6, thereby focusing on the prototypical situation of sparse recovery via ℓ 1 -constraints.

Semi-Parametric Estimation Problems and the Mismatch Principle
We intentionally did not make a concrete choice of the target vector β ♮ in Section 2. This strategy has led to very flexible (generic) error bounds for (LS K ), but it does not address any specific estimation problem. As already pointed out subsequently to the initial Problem 1.1, a valid choice of β ♮ is the expected risk minimizer. Indeed, assuming that x is isotropic and β ♮ := E[yx] ∈ K, then β ♮ is the expected risk minimizer (on both K and R p ) and we have that ρ(β ♮ ) = ρ t (β ♮ ) = 0 (see Appendix B.1 and Figure 2 there). Hence, according to Theorem 2.10 (or its corollaries), (LS K ) yields a consistent estimator of β ♮ .
While such a statement is common in statistical learning, a much less obvious phenomenon is the capability of (LS K ) to solve semi-parametric estimation problems. In the context of this article, we may express a semi-parametric observation model as follows: where β 0 ∈ R p is an unknown parameter vector and F : (R p × R p ) → R a scalar output function which can be non-linear, random, and unknown. Agreeing on this model setup, the ultimate hope is now that (LS K ) is a (consistent) estimator of β 0 . 9 It turns out that this is often possible at least to a certain extent, even though fitting a linear model to non-linear observations might appear counterintuitive at first sight. A typical example is the simple classification rule y = sign( x, β 0 ), where there is still hope to recover the direction of β 0 , but not its magnitude. This limitation gives rise to a relaxed estimation problem: Problem 3.1 Is the generalized Lasso (LS K ) capable of estimating any element from a certain target set

which contains all those parameter vectors that allow us to extract the information of interest?
Similarly to the more general formulation of Problem 1.1, the term 'information' is left unspecified here and depends on what a user considers as a desirable outcome of an estimation procedure. In the above example of binary classification, a natural choice of target set would be T β 0 := span({β 0 }), if one is interested in the recovery of any scalar multiple of β 0 .
Our guarantees from Section 2 allow us to tackle Problem 3.1 in a very systematic way: Select β ♮ ∈ T β 0 ∩ K such that the mismatch covariance ρ(β ♮ ) becomes as small as possible. 10,11 Then apply Theorem 2.10 (or one of its corollaries) to obtain an error bound for the estimation This strategy ensures that the resulting target vector β ♮ encodes the desired information, while the (asymptotic) bias of (LS K ) is brought under control. In particular, if ρ(β ♮ ) = 0, we achieve a consistent estimator of β ♮ ; note that the corresponding mismatch deviation σ(β ♮ ) can still be large, but its size only affects the variance of the error β − β ♮ 2 . The approach just described was developed by Genzel [15,Chap. 4], where it is referred to as the mismatch principle (see also the technical report [17]). It is worth pointing out that there is an important conceptual difference to the "naive" idea of first explicitly computing the expected risk minimizer (on K) and then finding the closest point on the target set T β 0 : indeed, we measure the complexity of K locally at β ♮ , which enables us to exploit beneficial geometric features directly on T β 0 .
We refer the reader to [15,Chap. 4] for a more extensive discussion of the mismatch principle and various applications to semi-parametric estimation problems. In the present work, we confine ourselves to an illustration in the prototypical situation of single-index models. ) Let x ∈ R p be a centered, isotropic random vector. We assume that y obeys a single-index model of the form minimizes the (global) mismatch covariance over T β 0 and we have that where P ⊥ β 0 ∈ R p×p is the projection onto the orthogonal complement of span({β 0 }). In particular, if x is a standard Gaussian random vector, we have that ρ(β ♮ ) = 0.
In the special case of a Gaussian input vector, Proposition 3.2 reproduces the original finding of Plan and Vershynin [42]: despite an unknown, non-linear distortion, the generalized Lasso still allows for consistent estimation of the parameter vector, or at least a scalar multiple of it. When combining Proposition 3.2 with the results from Section 2 (for an appropriately tuned hypothesis set K), we observe that their conclusion remains essentially valid for non-Gaussian inputs as long as the mismatch covariance ρ(β ♮ ) vanishes or gets sufficiently small. On the other hand, if ρ(β ♮ ) is too large, it can be useful to employ an adaptive estimator instead (e.g., see [61][62][63]), but there also exist worst-case scenarios where an asymptotic bias is inevitable, regardless of the considered estimator (see [2]). For an overview of the extensive literature on single-index models as well as historical references, we refer the reader to [ Finally, it is worth pointing out that we did not make any (explicit) assumptions on the tail behavior of the distribution of x in this subsection. Therefore, one can easily combine the above described approach with the findings of the forthcoming subsections, which investigate specific instances of generic Bernstein concentration. 10 For the sake of clarity, we only consider the global mismatch covariance here, which is easier to interpret and forms an upper bound for ρ t (β ♮ ) according to Appendix B.1; but refinements are certainly possible when analyzing ρ t (β ♮ ) instead of ρ(β ♮ ). 11 If x is isotropic, this selection procedure has a nice geometric interpretation due to Appendix B.1: The mismatch covari-

Sub-Gaussian Input Vectors
The current and subsequent subsections are devoted to several examples of generic Bernstein concentration (see Definition 2.2). Let us begin with the situation of (uniformly) sub-Gaussian input vectors. A characteristic property of sub-Gaussian random variables is that their tails are essentially not heavier than those of the normal distribution (see Proposition A.1(i)). Combining this property with Definition 1.3, one can easily verify that a sub-Gaussian random vector x ∈ R p with sub-Gaussian norm x ψ 2 exhibits generic Bernstein concentration with respect to (C x ψ 2 · 2 , 0) for a universal constant C > 0. Since the sub-exponential part of the mixed-tail condition is effectively erased here by setting · e := 0, we observe that sub-Gaussian input vectors form a degenerate limit case at the lighter-tailed end of the "spectrum" of generic Bernstein concentration. Regarding the q-and m-complexities, the identity · e = 0 implies that the γ 1functional effectively vanishes in their respective definitions, so that we end up with a rescaled version of the functional γ 2 (·, · 2 ). The celebrated Majorizing Measure Theorem of Talagrand [51,Thm. 2.4.1] relates this functional to a well-known complexity parameter: The Gaussian width originates from classical results in geometric functional analysis and asymptotic convex geometry, e.g., see [18,20,35]. More recently, it has emerged as a useful tool for the analysis of high-dimensional estimation problems, e.g., see [3,10,34,37,44,49,57]. The connection to our analysis, which is provided by Talagrand's Majorizing Measure Theorem, is the fact that for every subset L ⊂ R p , we have Apart from a simple geometric interpretation of the resulting complexity parameters, (3.2) implies that the optimization over the "skeleton" is irrelevant (up to constants) in the sub-Gaussian case, since the Gaussian width is invariant under taking the convex hull. This explains why such an optimization is uncommon in the literature dealing with sub-Gaussian input data. The following fact summarizes the above considerations and allows us to relate the generic (global) error bound from Corollary 2.15 to the sub-Gaussian setting: Fact 3.4 Let x ∈ R p be a (uniformly) sub-Gaussian random vector, i.e., x ψ 2 < ∞. Then x exhibits generic Bernstein concentration with respect to ( The global q-and m-complexities satisfy The comparison of our results with existing ones is facilitated by introducing the normalized complexities q (2,0) n and m (2,0) in (3.3). Indeed, both parameters are unaffected by a rescaling of the input vector, which is a common feature of complexity measures defined in the literature. In contrast, their "unnormalized" counterparts q (g,e) n and m (g,e) "absorb" the norm x ψ 2 as a scalar pre-factor for the generic semi-norms · e and · g .
With regard to the local and conic q-and m-complexities, Talagrand's Majorizing Measure Theorem leads to similar conclusions. The (normalized) local q-complexity corresponds to the notion of local Gaussian width (cf. [14,42]): Since the definition of the (normalized) local m-complexity requires that conv(S) also contains the origin, it is not strictly equivalent to the local Gaussian width, but incorporates an additional constant term (cf. Appendix B.2(4)): Analogously, the (normalized) conic q-complexity and m-complexity correspond to the notion of conic Gaussian width (cf. [3,10]). A combination of these identifications with the corresponding error bounds from Section 2 (Theorem 2.10, Corollary 2.12, and Corollary 2.15) allows us to reproduce known estimation guarantees for (sub-)Gaussian sample data, e.g., see [15,42].

Input Vectors with Independent Sub-Exponential Features
Although it is reassuring that the generic error bounds from Section 2 are consistent with existing results for the sub-Gaussian case, this setup does not constitute a proper example of the mixedtail condition in Definition 2.2. A more natural example is given by input vectors with centered, independent, sub-exponential coordinates. In this case, generic Bernstein concentration can be simply implemented by the classical Bernstein's inequality (see Theorem A.3). The following fact is a direct consequence of Theorem A.3.
Fact 3.5 Let x ∈ R p be a random vector with centered, independent, sub-exponential coordinates. Set R := max 1≤j≤d x j ψ 1 . Then x exhibits generic Bernstein concentration with respect to R 3. The global q-and m-complexities satisfy Remarkably, there exists a geometric interpretation of m (2,∞) that is very similar to (3.2). To this end, let L ⊂ R p and let Y = (Y 1 , . . . , Y p ) be a random vector with independent, symmetric coordinates satisfying P(|Y j | ≥ t) = exp(−t) for all j = 1, . . . , p. By a result of Talagrand [51, Thm. 10.2.8], we then have Inspired by the notion of 'Gaussian' width, the expression on the right-hand side is referred to as the exponential width of L. Similarly to the sub-Gaussian case in Subsection 3.2, the relation (3.4) shows that the optimization over the "skeleton" does not make a difference in the present scenario, at least when ignoring universal constants. Therefore, we can conclude that the normalized m-complexity is equivalent to the exponential width, i.e., m (2,∞) The idea of using the exponential width as a complexity measure for sub-exponential input vectors was proposed by Sivakumar et al. [48]. In contrast to the Gaussian width, the exponential width is not rotation-invariant: only the sub-Gaussian component of the complexity is tied to the Euclidean structure on R p , whereas the sub-exponential component of the complexity is described by the ℓ ∞ -norm. Based on results by Talagrand, it follows from [48, Thm. 1] that for every subset L ∈ R p , we have Regarding the feasibility of estimation in the high dimensions, this shows that the situation for input vectors with independent, sub-exponential features is not substantially worse than for sub-Gaussian sample data.
According to Lemma 2.14(iii), the normalized global q-complexity q (2,∞) n (defined analogously to m (2,∞) ) can also be controlled by the exponential width. With this in mind, our lower bound for the quadratic process in Proposition 5.5 resembles a result of Sivakumar et al. [48,Thm. 3], if we assume input vectors with independent, sub-exponential entries. In contrast, [48,Thm. 3] just requires isotropy and uniform sub-exponentiality. Due to a lack of published proofs, we were unable to verify that the exponential width is the correct complexity measure for this more general scenario.

(Uniformly) Sub-Exponential Input Vectors
We were already concerned with the situation of (uniformly) sub-exponential input vectors in Subsection 1.1 (see Proposition 1.4). Taking the more abstract viewpoint from Section 2, this setting corresponds to a degenerate limit case at the heavier-tailed end of the "spectrum" of generic Bernstein concentration. Indeed, an application of Proposition A.1(i) leads to the following fact: Fact 3.6 Let x ∈ R p be a (uniformly) sub-exponential random vector, that is, x ψ 1 < ∞. Then x exhibits generic Bernstein concentration with respect to (0, C 1 x ψ 1 · 2 ), where C 1 > 0 is the constant from Proposition A.1(i). The global q-and m-complexities satisfy .
According to Talagrand's Majorizing Measure Theorem, we have that The right-hand side of (3.6) agrees with the notion of perturbed width that was considered by [36,46]. Since Sattar and Oymak [46] focus on the projected gradient descent as an algorithmic implementation of the generalized Lasso (LS K ), their error bounds are not directly comparable to ours (in the special case of sub-exponential input vectors), but they bear a resemblance. A remarkable difference is that we achieve an exponentially decaying probability of failure. This is due to the fact that we handle the multiplier process by Mendelson's chaining approach (see Subsection 5.2.2), which also explains why the notion of m-complexity does not appear in the results of [46]. Unlike in the settings of Subsection 3.2 and 3.3, a simple geometric interpretation of the complexity parameters is not available in the case of uniformly sub-exponential input vectors. In particular, there is no reason to believe that the optimization over the "skeleton" is unnecessary in general. Compared to the γ 2 -functional, which can be controlled by means of the Gaussian width, the γ 1 -functional seems more mysterious and intangible. However, it is at least possible to derive informative upper bounds in the important special case where K is a scaled ℓ 1 -ball (see Subsection 3.6).

The Lifted Lasso and Phase Retrieval
Another relevant example of generic Bernstein concentration occurs when applying the so-called lifted Lasso to sub-Gaussian input vectors. The lifted Lasso introduced below can be seen as a variant of the phase lift approach (see [7,8]), which is tailored to the phase retrieval problem (e.g., see [47] for an overview). Our statistical analysis is not limited to this specific model setup and covers the more general scenario considered by Thrampoulidis and Rawat [52], namely singleindex models with even output functions. In fact, Proposition 3.2 indicates that this is a highly non-trivial task: if y obeys (3.1) with Gaussian inputs and an even function f , then we would simply have µ = 0, so that the ordinary Lasso (LS K ) fails to recover the direction of the parameter vector β 0 .
Phase lifting follows a different approach that allows us to reduce the non-linear phase retrieval problem to a more accessible linear problem. It is based on the simple, yet crucial observation that where tr(·) denotes the trace and ·, · F the Hilbert-Schmidt inner product. The lifted Lasso then corresponds to the following convex optimization problem: where H ⊂ R p×p is a convex subset of the positive semidefinite cone in R p×p which contains all "lifted" hypotheses, i.e., H ⊃ {ββ T | β ∈ K} for some K ⊂ R p . Note that the centering term E[xx T ] in (LLS H ) is important to achieve consistency (see also Proposition 3.11 below). If the input vector x ∈ R p is sub-Gaussian, then xx T ∈ R p×p is sub-exponential. 12 In this sense, the lifted Lasso is a typical application where sub-exponential vectors occur naturally. The connection between the lifted setting and the notion of generic Bernstein concentration is given by the Hanson-Wright inequality (cf. [58, Thm. 6.2.1]):

Fact 3.8 Let x ∈ R p be a centered, (uniformly) sub-Gaussian random vector and let B ∈ R p×p . Then xx T − E[xx T ] exhibits generic Bernstein concentration with respect to
All aforementioned results can be integrated into our framework. The following error bound is a direct application of Corollary 2.15 to the setup of Fact 3.7: Corollary 3.9 (Error bound for (LLS H ), global version) Let x ∈ R p be as in Fact 3.7 and let y be sub-exponential. Let the sample pairs (x 1 , y 1 ), . . . , (x n , y n ) be independent copies of (x, y). Moreover, let 12 Here and in the following, the matrix space R p×p is canonically identified with R p 2 . In particular, we interpret xx T as a random vector in R p 2 , rather than a random matrix. 13 Note that the operator norm is absorbed by the Frobenius norm in the γ 2 -part of the q-complexity, which is possible due to · F + · op · F . H ⊂ R p×p be convex and fix B ♮ ∈ H. We also assume that Then there exists a universal constant C > 0 such that for every u ≥ 8, the following holds true with probability at least 1 − 5 exp(−C · u 2 ) − 2 exp(−C · √ n): If the sample size obeys then every minimizerB of (LLS H ) satisfies where the mismatch parameters ρ 0 (B ♮ ) and σ(B ♮ ) are defined with respect to the "lifted" random pair Obviously, one can derive analogous estimation guarantees for the local and conic complexity parameters based on Theorem 2.10 and Corollary 2.12, respectively. Remark 3.10 From a practical viewpoint, the error bound for the lifted Lasso in Corollary 3.9 is only of indirect interest, since the actual goal is to construct an estimator for an appropriate target vector β ♮ ∈ R p with B ♮ = β ♮ (β ♮ ) T . For this purpose, one may simply extract the rank-one component from a solutionB to (LLS H ). Indeed, letβ ∈ S p−1 be a unit-norm eigenvector ofB corresponding to the largest eigenvalueλ 1 ofB (recall thatB is positive semidefinite). Then, on that same event as in Corollary 3.9, the following error bound holds true: where Err is the error term on the right-hand side of (3.7); see [8,Sec. 6] and [52, Subsec. 2.1] for more details. In other words,λ 1β is an estimator of either β ♮ or −β ♮ , due to sign ambiguities. ♦ Since the error bound (3.7) is affected by the (constant) additive term ρ 0 (B ♮ ), it is natural to study situations where it vanishes. The following proposition concerns two model setups where this is the case. In conjunction with Corollary 3.9, this shows that the lifted Lasso (LLS H ) can provide a consistent estimator for phase-retrieval-like problems.

Proposition 3.11 (1) Let x ∈ R p be a random vector with finite second moments (so that the centering term E[xx T ] is well-defined). We assume that y obeys a quadratic observation model of the form
where β 0 ∈ R p is an unknown parameter vector and ν is independent noise with E[ν] = 0. Then we have that ρ 0 (B ♮ ) = 0 for B ♮ := β 0 β T 0 . (2) Let x ∼ N (0, I p ). We assume that y obeys a single-index model the form where β 0 ∈ S p−1 is an unknown parameter vector, f : R → R is a scalar output function, and ν is independent noise with E[ν] = 0. Set µ := 1 2 E[ f (Z)(Z 2 − 1)] with Z ∼ N (0, 1). Then, we have that ρ 0 (B ♮ ) = 0 for B ♮ := µβ 0 β T 0 . For the sake of brevity, we omit a proof of Proposition 3.11; it is straightforward but especially the second part requires some lengthy calculations, see [52, Appx. B.2] for more details. While the first statement of Proposition 3.11 addresses the classical phase retrieval problem under no additional assumptions on the input vector, the second one indicates that the lifted Lasso can handle much more general non-linearities, at least in the Gaussian case. In fact, one can achieve a consistent estimator of ±β 0 as long as µ = 0, which includes a large subclass of even output functions. This observation allows us to reproduce a main result of Thrampoulidis and Rawat [52], thereby integrating it into a more general statistical framework for the lifted Lasso. They also investigate the important special case of sparse recovery, where β 0 is sparse and H is a subset of a scaled ℓ 1 -ball in R p 2 . A detailed analysis of this situation goes beyond the scope of this paper, but we emphasize that the complexity bounds presented in the next subsection could be used to derive results in that regard. 14 Finally, we refer to [52, Subsec. 1.3] for further reading on recent approaches to phase retrieval and related problems.

Sparse Recovery and the Complexity of Polytopes
In this subsection, we discuss our complexity parameters in the context of high-dimensional estimation problems where n ≪ p, with a particular emphasis on sparse recovery; for a comprehensive introduction to high-dimensional statistics, we refer to the textbooks [6,13,23,58,59]. The common ground of sparse recovery problems is the assumption that the underlying parameter vector is sparse in a certain sense. In this part, we focus on the specific case where the target vector β ♮ ∈ R p is k-sparse, i.e., at most k of its coordinates are non-zero. Since the set of k-sparse vectors in R p is non-convex for k < p, it cannot be used as hypothesis set for the generalized Lasso (LS K ), and one has to come up with an appropriate convex relaxation. Probably the most natural choice is a scaled ℓ 1 -ball, which precisely leads to the standard Lasso studied by Tibshirani [55].
Let us begin with the situation where the hypothesis set is perfectly tuned in the sense that the target vector lies exactly on its boundary. The following result for the ℓ 1 -ball provides bounds for the local and conic m-and q-complexities in settings of Subsection 3.2-3.4. A proof is given in Appendix C.
(ii) The local and conic m-and q-complexities with respect to ( · 2 , · ∞ ) satisfy (iii) The local and conic m-and q-complexities with respect to (0, · 2 ) satisfy Since all upper bounds in Proposition 3.12 scale only logarithmically with the ambient dimension p, we can conclude that sparse recovery is feasible in the settings of Subsection 3.2-3.4. Moreover, the square-root-dependence on the sparsity k is optimal in each of the above cases. In particular, the sample-size condition (2.4) of Theorem 2.10 takes the familiar form where we have ignored other model-dependent parameters for the sake of clarity.
In situations where the hypothesis set is not perfectly tuned, it can be more appropriate to apply the global error bound from Corollary 2.15 instead of Theorem 2.10 (or Corollary 2.12). In this context, the "skeleton" optimization in the m-and q-complexities proves very useful. To this end, let us assume that K ⊂ R p is a convex polytope, i.e., K = conv(F) for a finite set of vertices F ⊂ R p . For α ∈ {1, 2}, we then have This rather crude bound is proved straightforwardly by constructing an admissible partition sequence whose partitions contain all elements of F as singletons "as soon as admissible", and by bounding every diameter trivially by ∆(F). With (3.8) at hand, we immediately obtain the following bounds for the global complexity parameters in the case of polytopal hypothesis sets: Proposition 3.13 Let K ⊂ R p be a convex polytope with D vertices. Then we have where ∆ e (K) and ∆ g (K) are the diameters of K with respect to the semi-norms · e and · g , respectively.
Since the ℓ 1 -ball in R p has only D = 2p vertices, Proposition 3.13 implies that sparse recovery is possible in the high-dimensional regime n ≪ p for all variants of generic Bernstein concentration discussed in this section, even if the hypothesis set K is not perfectly tuned. For example, if β ♮ is k-sparse and has unit norm, K = √ kB p 1 would be a valid choice for Corollary 2.15. Bypassing perfect tuning is in fact a desirable feature in statistics, but we point out that the error bound (2.8) in Corollary 2.15 exhibits a suboptimal decay rate of O(n −1/4 ).
The above findings indicate a noteworthy phenomenon of our complexity parameters, and generic chaining in general. The argument behind Proposition 3.13 is especially effective for those polytopes with few vertices because we then only have to control the empirical processes over this small subset (cf. (3.8)). For the γ 2 (·, · 2 )-functional, this simplification is irrelevant, since it is equivalent to the Gaussian width according to (3.2). However, the general geometric mechanisms behind this fact remain largely mysterious, e.g., see [51,Sec. 2.4]. In particular, the situation is much less understood beyond this special case and the involved γ-functionals are not necessarily invariant under taking the convex hull. Consequently, controlling the m-and q-complexities in any specific situation is a highly non-trivial task.

Conclusion and Outlook
Leaving aside the specific aspects and applications discussed in the previous sections, the overall conclusion of our main results reads as follows: The benchmark case of sub-Gaussian sample data can be seen as a "barrier" behind which the estimation behavior of the generalized Lasso (LS K ) can change significantly. The key difference becomes manifested in what way the complexity of the hypothesis set K is measured. Indeed, the m-and q-complexities do not generally enjoy a simple geometric interpretation similar to the Gaussian width, and except for some specific scenarios, the underlying chaining functionals are difficult to control (see Subsection 3.6). On the other hand, we have observed that several statistical and conceptual features remain valid beyond sub-Gaussianity. In particular, semi-parametric estimation problems can be treated as before, since the consistency of the generalized (or lifted) Lasso is not affected by the tail behavior of the input vectors (see Subsection 3.1 and 3.5).
On the technical side of this paper stands an application of generic chaining, as a means of controlling the quadratic and the multiplier process according to the underlying geometry of their index sets. In our specific analysis, this paradigm appears explicitly in the notion of generic Bernstein concentration: the correct way to measure complexity is determined by the tail behavior of the input vector, which is captured by two (appropriate) semi-norms.
We close our discussion with a short list of open problems and possible extensions of our approach: • Beyond sub-exponentiality. Are our results extendable to input vectors for which generic Bernstein concentration is simply too restrictive? An obvious relaxation would be that x obeys only a α-sub-exponential distribution, i.e., x ψ α < ∞ for some 0 < α < 1 (cf. Theorem 5.10).
For instance, such distributions occur naturally when studying higher-order variants of the lifted Lasso, where the input data consist of tensor products. We believe that our basic proof strategy would not break down in such scenarios. In fact, even though · ψ α is just a quasinorm for 0 < α < 1, concentration inequalities are available, similarly to the case α = 1, e.g., see [4,21,45]. Hence, a careful adaptation of generic Bernstein concentration and the related chaining argument might lead to similar estimation guarantees as in Section 2. It is worth pointing out that lower bounds for the quadratic process under heavier tailed inputs are subject of recent research, e.g., see [25,27,30,48], while the behavior of the multiplier process remains largely unclear.
• The multiplier process. The conclusion of the previous point gives rise to another relevant issue: How tight is our bound for the multiplier process (in Proposition 5.15)? Can it be improved in general or at least in specific model setups? Let us be a little more precise about this concern: Our approach to controlling the multiplier process is based on a powerful concentration inequality by Mendelson [32], formulated in Theorem 5.12. In contrast, the multiplier process is handled with more elementary arguments in most related works on the generalized Lasso, e.g., see [42,46,52]. These approaches suffer from a more pessimistic probability of success and may lead to different error bounds in some situations. We suspect that there exists a certain trade-off between the probability of success and the size of the related complexity terms. In this regard, a particularly interesting phenomenon is that-in contrast to the sub-Gaussian case-the complexities of the multiplier process and the quadratic process may be measured in a different way.
• Beyond linearity and convexity. The results of this paper are limited to convex hypothesis sets consisting of linear functions. Indeed, convexity is an important ingredient of Fact 2.4, while linearity enables the optimization over the "skeleton" in the proofs of Proposition 5.5 and 5.15. We expect that it is possible to drop the convexity assumption on K by analyzing the projected gradient descent method as an algorithmic implementation of (LS K ), e.g., see [38,40,46]. However, it is not clear to us whether our complexity terms would still adequately capture the non-convex nature of the hypothesis set; for instance, recall that the Gaussian width is invariant under taking the convex hull. In general, the analysis of non-convex optimization problems is very subtle, due to the possible presence of spurious local optima or saddle points. On the other hand, non-convex methods often perform better in both theory and practice, e.g., see [9]. These benefits have triggered a large amount of research in the last decade, but it is fair to say that many important issues in this field remain widely open.

Proofs of the Main Results
This part is dedicated to the proofs for Section 1 (provided in Subsection 5.6) and Section 2 (provided in Subsection 5.1-5.5).

Implications of Generic Bernstein Concentration
We begin with two implications of generic Bernstein concentration (see Definition 2.2), which are required for the proof of Theorem 2.10 in the next subsection, but might be also of independent interest. The proofs of both lemmas can be found in Appendix D. The first one concerns the q-th moment of the marginals of a random vector that satisfies generic Bernstein concentration; recall the notation v * from Subsection 1.4.
The second lemma addresses the symmetrized sum of i.i.d. random vectors that satisfy generic Bernstein concentration. The resulting random vector still exhibits generic Bernstein concentration but with respect to different semi-norms.

Lemma 5.2 Let x be a random vector in R p that exhibits generic Bernstein concentration with respect
to ( · g , · e ) and let x 1 , . . . , x n be independent copies of x. Furthermore, let ε 1 , . . . , ε n be independent Rademacher random variables (also independent of the x i ). Then the rescaled symmetrized sum ε i x i exhibits generic Bernstein concentration with respect to C( · g + · e ), C √ n · e , where C > 0 is a universal constant.

Proof of Theorem 2.10
Throughout this subsection and unless otherwise stated, we assume that the hypotheses of Theorem 2.10 are satisfied, especially Assumption 2.9. Let us recall the decomposition of the excess risk from (2.2): According to Fact 2.4, our main goal is to show that E (β, β ♮ ) > 0 for all β ∈ K β ♮ ,t . For this purpose, we will first treat the quadratic and multiplier process separately in Subsection 5.2.1 and 5.2.2 below. The outcome of this analysis are Proposition 5.5 and 5.15, respectively, which eventually allows us to derive the desired error bound in Subsection 5.2.3. We note that some results in Subsection 5.2.1 and 5.2.2 are presented in a slightly more general setting, considering a generic set L ⊂ R p instead of specific subsets of K.

The Quadratic Process
We now address the issue of finding a lower bound for the quadratic process Q(β − β ♮ ). Setting v := β − β ♮ ∈ K − K, the square root of the quadratic process takes the form Evidently, the quadratic process describes an interaction of the input vectors x i with the difference of two hypotheses β, β ♮ ∈ K. In this sense, it is intrinsic to K-it does not depend in any way on y, and in particular not on the model mismatch y − x, β ♮ (cf. Appendix B.2(3)). Since Q(β − β ♮ ) is a non-negative empirical process, it is suited for an application of the smallball method. We state a version by Tropp [57,Prop. 5.1] here, but it should be emphasized that the original idea is due to Mendelson (e.g., see [31,Thm. 5.4]); recall the notion of small-ball function from Definition 2.7.
is the empirical width of L with independent Rademacher random variables ε 1 , . . . , ε n .
In fact, Theorem 5.3 is a remarkable result because it holds true without a strong tail assumption on x. However, its significance is closely linked to finding an appropriate (upper) bound for the empirical width W n (L, x), which is usually not a simple task. In the specific context of this paper, where x exhibits generic Bernstein concentration, the following generic chaining bound by Talagrand will prove useful; recall the γ-functional from Definition 2.3.
for all v 1 , v 2 ∈ L and all t > 0. Then, we have that An appropriate combination of Theorem 5.3 and Theorem 5.4 leads to the following lower bound for the quadratic process: Proposition 5.5 Let x, x 1 , . . . , x n , K ⊂ R p , and τ > 0 be as in Assumption 2.9. For t > 0, let L ⊂ (K − K) ∩ tS p−1 . Then for every u > 0, we have that t,n (L) − τ · u with probability at least 1 − exp(−u 2 /2), where C Q > 0 is a universal constant.
Proof. Let θ := t · τ. Then, Theorem 5.3 states that holds true with probability at least 1 − exp(−u 2 /2). The claim of Proposition 5.5 follows from the bounds on Q 2θ (L, x) and W n (L, x) that we establish in the following.
Lower bound for Q 2θ : We have that Upper bound for W n : According to the definition of the local q-complexity, there exists a set S ⊂ R p with conv(S) ⊃ L ∩ tS p−1 (= L) t,n (L).
Conditioning on the random variables ε i , x i , the function In particular, at least one of the h(s i ) is not smaller than h(v). This implies and therefore W n (L, x) ≤ W n (S, x). To obtain an upper bound for W n (S, x), we consider the associated stochastic process and intend to apply Theorem 5.4: Since the x i and the ε i are independent, the distribution of (X v ) v∈S only depends on the individual distributions of the x i and the ε i . Observing that and that −ε i has the same distribution as ε i for each i, we conclude that (X v ) v∈S is indeed symmetric. Regarding the increment condition, let v 1 , v 2 ∈S and set v := v 1 − v 2 . Then, Lemma 5.2 implies that Finally, Theorem 5.4 yields t,n (L), t,n (L).

The Multiplier Process
We now turn our attention to the multiplier process. Setting v := β − β ♮ and ξ i := x i , β ♮ − y i , the process takes the form Unlike the quadratic process, the multiplier process is not intrinsic to the hypothesis set K, but (empirically) describes an interaction of the difference of two hypotheses β, β ♮ ∈ K with the model mismatch ξ := x, β ♮ − y (cf. Appendix B.2(3)).
In order to control the multiplier process, we adapt another result by Mendelson [32], which is based on a refined chaining approach: Instead of applying the traditional generic chaining to the function class {ξ · ·, β | β ∈ K}, Mendelson isolates the effect of the multiplier term ξ, which leads to a bound in terms of ξ L q and geometric properties of the class { ·, β | β ∈ K}. In fact, his result holds true for more general (non-linear) function classes, but in view of the objectives of this article, we only recite the special case of linear functions. In order to state this result, several definitions are required. Definition 5.6 ([32, Def. 1.6]) For a real-valued random variable Z and q ≥ 1, we define the (q)-norm by It is worth comparing the above definition to the moment characterization of sub-Gaussian variables (see Proposition A.1(ii)). Mendelson [32, p. 3658] remarks that the (q)-norm "measure[s] the subgaussian behaviour of the functions involved, but only up to a fixed level, rather than at every level". Definition 5.7 Let L be a set. We call a sequence (L s ) s∈N ⊂ 2 L of subsets of L an admissible approximation sequence if |L 0 | = 1 and |L s | ≤ 2 2 s for s ≥ 1.
The following definition introduces a relative of Talagrand's γ-functional (see Definition 2.3). For this purpose, also recall the notation of dual vectors from Subsection 1.4; more precisely, we equip R p with the pushforward measure P • x −1 of a (generic) random vector x ∈ R p , so that for where the infimum is taken over all admissible approximation sequences (L s ) s∈N and (π s v) * is a nearest point to v * in (L s ) * with respect to the (u 2 2 s )-norm.
With these definitions at hand, we can now state Mendelson's result, which provides a powerful concentration inequality for multiplier processes. We emphasize that the feature vector x and the multiplier ξ are not necessarily independent here, which is crucial for our analysis and an important difference to related results in the literature, e.g., see [22]. Theorem 5.9 ([32, Thm. 1.9]) Let L ⊂ R p and let (x, ξ) ∈ R p × R be a random pair such that ξ L q < ∞ for some q > 2. We assume that (x 1 , ξ 1 ), . . . , (x n , ξ n ) are independent copies of (x, ξ). Then there exist constants C 0 , C 1 , . . . , C 4 > 0 (only depending on q) such that for every w, u > C 0 , the following holds true with probability at least 1 − C 1 · w −q · n −(q/2)+1 · log q (n) − 4 exp(−C 2 · u 2 ): The term C 1 · w −q · n −(q/2)+1 · log q (n) in the probability of success arises from a concentration inequality for the random vector (ξ i ) n i=1 ∈ R n , for which we only assume that the q-th moment of its components exists for some q > 2. In fact, better rates can be achieved by more restrictive assumptions on the tails of ξ. For example, Mendelson proves a sub-Gaussian variant of Theorem 5.9 using Bernstein's inequality (see [32,Thm. 4.4]). If we assume that ξ is just subexponential (as in Assumption 2.9), Bernstein's inequality cannot be applied to the squared coordinates appearing in the Euclidean norm of (ξ i ) n i=1 . However, the following recent result of Götze et al. [21] allows us to derive a concentration inequality in the sub-exponential case: i ] < ∞ and X i ψ α ≤ R for some α ∈ (0, 1] ∪ {2}. 16 Let B = [b ij ] ∈ R n×n be a symmetric matrix. Then there exists a universal constant C > 0 such that for every t > 0, we have that Corollary 5.11 Let ξ 1 , . . . , ξ n be i.i.d. sub-exponential random variables. Then there exists a universal constant C > 0 such that with probability at least 1 − 2 exp(−C · √ n), we have that Proof. We apply Theorem 5.10 for X i := ξ i − E[ξ i ], B := I n , t := R 2 · n, and obtain Let us assume that the complement of the event E has occurred. Then, using Proposition A.1(ii), it follows that where the last step is due to for α ∈ {1, 2}; see [58, Lem. 2.6.8] for a proof, which also works for α = 1. Consequently, we have that The bound of Corollary 5.11 leads to the following sub-exponential version of Theorem 5.9: Theorem 5.12 Let L ⊂ R p and let (x, ξ) ∈ R p × R be a random pair such that ξ ψ 1 < ∞. We assume that (x 1 , ξ 1 ), . . . , (x n , ξ n ) are independent copies of (x, ξ). Then there exist universal constants C, C ′ ,C > 0 such that for every u ≥ 8, the following holds true with probability at least Proof. Analogously to the proof of [32,Thm. 4.4], it is enough to adapt the last step of the proof of [32,Thm. 1.9]. To this end, we set the variables arising in the proof of [32,Thm. 4.4] to the values q := 6 and w := 1, which entails r = r ′ = 2 and q 1 = 8. Then, with probability at least 1 − 2 exp(−C · √ n), Corollary 5.11 implies that Since ξ L 6 ξ ψ 1 , we also have that with probability at least 1 − 2 exp(−C · u 2 ), where ξ * and j 0 are objects defined in the proof of [32,Thm. 1.9]. The rest of the proof remains unchanged.
The following lemma is a centerpiece of our statistical analysis, as it allows us to control the complexity termΛ u (L, x) via generic Bernstein concentration: Lemma 5.13 Let L ⊂ R p with 0 ∈ conv(L) and u ≥ 1. Let x be a random vector in R p that exhibits generic Bernstein concentration with respect to ( · g , · e ). Then, we have that for every v ∈ R p . Adopting the notation from [32, Def. 1.7], we set which implies thatΛ According to (5.3), the second summand of (5.4) can bounded as follows: where ∆ e (L) and ∆ g (L) are the diameters of L with respect to · e and · g , respectively. To handle the first summand of (5.4), we apply (5.3) once again: where π + s : L → L s is an arbitrary map (depending on the respective admissible approximation sequence (L s ) s∈N indexed by the infimum); note that we can indeed replace π s by π + s here, since by definition, (π s ) s∈N is an optimal (functional-minimizing) sequence of projections with respect to the (u 2 2 s )-norms.
We now show that the above expression is upper bounded by 5 · u · γ 1 (L, · e ) + γ 2 (L, · g ) , which would imply the claim of Lemma 5.13. For this purpose, let (E s ) s∈N and (G s ) s∈N be two admissible partition sequences which approximate γ 1 (L, · e ) and γ 2 (L, · g ) up to a factor of 2, respectively. Furthermore, let (F s ) s∈N be given by F 0 = {L} and It is not hard to see that (F s ) s∈N is indeed an admissible partition sequence. Next, we use the sequence (F s ) s∈N to construct an admissible approximation sequence (L s ) s∈N and a corresponding sequence of maps (π • s ) s∈N : For each s ∈ N, the set L s ⊂ L is obtained by selecting exactly one (arbitrary) point v F from every F ∈ F s , while π • s maps every point in F to the respective v F . This construction ensures that for s ≥ 1 and v ∈ L, we have where we have used in the last line that (E s ) s∈N and (G s ) s∈N approximate γ 1 (L, · e ) and γ 2 (L, · g ) up to a factor of 2, respectively.
Remark 5.14 The proof of Lemma 5.13 is inspired by [32,Subsec. 4.3], where the upper bound Λ u (L, x) u · γ 1 (L, · ∞ ) + γ 2 (L, · 2 ) is derived under the assumption that x obeys an unconditional, isotropic, log-concave distribution. This assumption implies that x is stochastically dominated by a random vector with i.i.d. standard exponential coordinates, which enables a bound for v * (q) in terms of v 2 and v ∞ (cf. Subsection 3.3). ♦ The estimate from Lemma 5.13 leads us to our final result for the multiplier process:

Proposition 5.15
Let L ⊂ tS p−1 for some t > 0 and let (x, ξ) ∈ R p × R be a random pair such that ξ ψ 1 < ∞ and x exhibits generic Bernstein concentration with respect to ( · g , · e ). We assume that (x 1 , ξ 1 ), . . . , (x n , ξ n ) are independent copies of (x, ξ). Then there exist universal constants C, C ′ > 0 such that for every u ≥ 8, the following holds true with probability at least Proof. According to the definition of the local m-complexity, there exists a setS ⊂ R p with By Theorem 5.12, with probability at least 1 − 2 exp(−C · √ n) − 4 exp(−C · u 2 ), we have that Now let v ∈ L. Since L ⊂ conv(S), the point v can be expressed as a convex combination of points s 1 , . . . , s M ∈S as in (5.2). Conditioning on the random variables x i and ξ i , the function is a composition of a linear function and the convex function z → |z|. Hence, h is convex and we can apply Jensen's inequality to obtain Since v ∈ L was arbitrarily chosen, we can conclude that the following bound holds true if the event from (5.7) has occurred: Finally, Lemma 5.13 implies ΛC u (S, x) C · u · γ 1 (S, · e ) + γ 2 (S, · g ) where we have also used that u ≥ 8 > 1.

Controlling the Excess Risk
With the results of Proposition 5.5 and 5.15 at hand, we are now ready to prove Theorem 2.10. Let us first consider the case t > 0. According to Fact 2.4, it suffices to show that E (β, β ♮ ) > 0 for all β ∈ K β ♮ ,t . Remarkably, this argument is actually the only point in our proof where we rely on the convexity of the hypothesis set K. The remainder of the proof is divided into several substeps.
Step 2 (multiplier process): Applying Proposition 5.15 to L := K β ♮ ,t − β ♮ = (K − β ♮ ) ∩ tS p−1 and ξ := x, β ♮ − y, the following holds with probability at least 1 − 2 exp(−C · √ n) − 4 exp(−C · u 2 ): For every β ∈ K β ♮ ,t , we have that where the second inequality is due to If the aforementioned event has occurred and if t satisfies the condition (2.5) for an appropriate hidden constant, it follows that Step 3 (excess risk): Finally, we assume that the events from Step 1 and Step 2 have occurred jointly, which indeed happens with probability at least 1 − 2 exp(−C · √ n) − 5 exp(−C · u 2 ) for an appropriately chosen constant C > 0. Then, we obtain for all β ∈ K β ♮ ,t , which concludes the proof for t > 0. It remains to consider the case t = 0.
Step 4 (t = 0): In this case, q (g,e) t,n (K − β ♮ ) and m (g,e) t (K − β ♮ ) correspond to the conic complexities from Definition 2.11. Applying Proposition 5.5 and Proposition 5.15 simultaneously (as in the preceding steps) to L := cone(K − β ♮ ) ∩ S p−1 and radiust := 1, we have that with probability at least where we have also used that q (K − β ♮ ). Finally, let us assume that this event has occurred and let β ∈ K \ {β ♮ }. Then, we have which implies that β ♮ is the only solution to (LS K ).

Proof of Corollary 2.12
For t > 0 and L ⊂ R p , we have that 1 t L ⊂ cone(L). Due to the homogeneity of the semi-norms · g and · e , we can rewrite the definition of q t,n (L) = inf which coincides with the definition of q (g,e) 0,n (L) except that the infimum is taken over an inclusionwise larger domain of sets. Therefore, it holds that q (L). It follows that the replacement of the local complexities by the conic complexities (and ρ t (β ♮ ) by ρ 0 (β ♮ )) leads to stronger conditions in (2.4) and (2.5), which obviously cannot harm the validity of the theorem.

Proof of Lemma 2.14
The claim of (i) follows from the fact that the infima in the global complexity parameters are taken over inclusion-wise smaller domains of sets. The claim of (ii) follows from the fact that the affine term v does not affect the pseudo-metrics induced by the semi-norms · g and · e .
For the claim of (iii), we first observe that γ 2 (L, · g + · e ) γ 2 (L, · g ) + γ 2 (L, · e ), (5.8) which is stated as an exercise by Talagrand [51, Exc. 2.2.24]; its proof is based on the same strategy as the proof of Lemma 5.13: based on two admissible partition sequences that approximate γ 2 (L, · g ) and γ 2 (L, · e ) up to a factor of 2, a third partition sequence is defined as in (5.5). Making use of (5.8), we obtain where the second inequality is due to the fact that γ 2 (L, d) ≤ γ 1 (L, d) holds true for all sets L and pseudo-metrics d.
The claims of (iv) and (v) follow directly from the respective definitions.
see that α κ −3 , observe that where we have used the isotropy of x, the Cauchy-Schwarz inequality, and Proposition A.1(ii).
Finally, note that due to the isotropy of x, we have that κ 1. This concludes the proof.

Appendix A Basic Facts on Sub-Gaussian and Sub-Exponential Random Variables
The following proposition provides two characterizations of sub-exponential and sub-Gaussian random variables. The first one concerns the exponential decay behavior of their tails, while the second one addresses the growth of their absolute moments. Note that the dependence of the constants on α can be dropped here, since we consider only two values of α. (i) Z satisfies the concentration inequality where C α > 0 is a constant depending on α.
(ii) The moments of Z satisfy where C α > 0 is a constant depending on α.
The following two results are well-known inequalities for sub-Gaussian and sub-exponential random vectors, respectively. A comparison of both shows that the (weighted) sum of independent sub-exponential variables exhibits a mixed-tail behavior, as if it "were a mixture of sub-gaussian and sub-exponential distributions"; quote from [58, p. 35].

Appendix B Further Details on Section 2 B.1 Remarks on the Mismatch Covariance (Definition 2.8)
In this part, we adopt the notation of Definition 2.8, which has introduced the mismatch parameters. Since K t ⊂ K 0 ⊂ S p−1 , we observe that the mismatch covariance satisfies These bounds imply that all estimation guarantees presented in Section 2 remain true when replacing the local mismatch covariance by its global variant. However, there exist relevant scenarios where the second inequality in (B.1) becomes strict, so that considering ρ(β ♮ ) leads to suboptimal results. To this end, it is useful to first relate the mismatch covariance to the expected risk is the (unique) global expected risk minimizer. The above figure shows a situation where β * ∈ K and β ♮ is the expected risk minimizer on K. This implies that the negative gradient at β ♮ points out of K in the direction of β * , or more geometrically, the dashed supporting hyperplane separates K and β * . Hence, we have that −∇L(β), v ≤ 0 for all v ∈ K − β ♮ , and in particular, ρ t (β ♮ ) ≤ 0. On the other hand, it holds that ρ(β ♮ ) = 1 2 ∇L(β ♮ ) 2 = β * − β ♮ 2 > 0. Now, let β ♮ ∈ K be an expected risk minimizer on K, i.e., a solution to (1.1). A well-known optimality condition in convex analysis then implies that ρ t (β ♮ ) ≤ 0. On the other hand, if K does not contain a global expected risk minimizer, i.e., a solution to min β∈R p L(β), we have that ∇L(β ♮ ) = 0 and therefore ρ(β ♮ ) > 0; and vice versa, ρ(β ♮ ) = 0 implies that K contains a global expected risk minimizer. We refer to Figure 2 for an illustration of this argument when x is isotropic. 17 In view of our main result, Theorem 2.10, the local mismatch covariance measures the asymptotic impact of the model mismatch. When positive, ρ t (β ♮ ) can be seen as an asymptotic bias term, while a negative value can have favorable effects on the estimation performance of (LS K ); see Appendix B.2(3).

B.2 Remarks on Theorem 2.10
This part compiles several additional remarks on our main result, Theorem 2.10.
(1) Possible extensions. Theorem 2.10 is amenable to various extensions and generalizations. For instance, replacing the ℓ 2 -norm by an arbitrary semi-norm · in Fact 2.4 would lead to an error bound in terms of · ; note that such a step would also require an adaptation of the spherical intersections in the q-and m-complexities and the small-ball condition (2.3). This extension becomes particularly useful when the covariance matrix of the input vector x is poorly conditioned or even degenerate. 18 In this case, an appropriate linear transform of the ℓ 2 -error can account for the underlying covariance structure; see [15,Chap. 3 and Sec. 4.3] for a detailed discussion of this issue. 17 In the isotropic case, there also exists a nice functional-analytic interpretation: The mapping β → x, β is an isometric embedding of the Hilbert space R p into L 2 (Ω, P). Then the components x 1 , . . . , x p ∈ L 2 of the random vector x constitute an orthonormal basis for the subspace G := { x, β | β ∈ R p }. This implies ρ(β ♮ ) = ( y − x, β ♮ , x j L 2 ) p j=1 2 = P G (y − x, β ♮ ) L 2 , where P G is the orthogonal projection onto G; in other words, ρ(β ♮ ) corresponds to the "linear component" of the expected risk of β ♮ . 18 In principle, Assumption 2.9 imposes no explicit conditions on the covariance structure of x, but if it becomes too degenerate, the small-ball condition (2.3) might become unrealizable for · = · 2 .
Apart from that, it is possible to incorporate different loss functions or adversarial noise into Theorem 2.10; cf. [14] and [15,Chap. 3]. Working out the details goes beyond the scope of this paper, but is expected to be relatively straightforward. Furthermore, one might show similar estimation guarantees for the "basis-pursuit" version or the unconstrained version of the generalized Lasso; cf. [15,Chap. 3] and [28,29]. Finally, it is worth mentioning that the sub-exponentiality of y in Assumption 2.9 could be replaced by a less restrictive tail condition. This modification would concern the analysis of the multiplier process in Subsection 5.2.2; for example, a finite moment assumption for y would be sufficient when using Theorem 5.9 instead of Theorem 5.12.
(2) Small-ball condition. The small-ball condition (2.3) in Assumption 2.9 is based on Mendelson's more general condition (see [31,Asm. 3.1]), which reads where F is a class of (not necessarily linear) hypothesis functions. We emphasize that the smallball condition (2.3) is stated relative to the hypothesis set K, and it particularly suffices for x to be non-degenerate relative to the subspace span(K − K). This reflects the fact that the input vectors x i are only of interest to us insofar as they enable us to discern differences between the hypotheses in K. and set τ := α/4. As long as α > 0, we have that This lower bound is a more convenient expression, since α measures the degeneracy of x relative to K ∆ while δ can be seen as an (an-)isotropy parameter.
(3) Low-and high-noise regime. The statement of Theorem 2.10 could be further refined by specifying the smallest value of t such that the conditions (2.4) and (2.5) still hold true, while all other model parameters remain fixed. Such an optimization strategy is elaborated in the general learning framework of Mendelson [31]. Although the latter has certainly a wider scope than ours, there are important conceptual overlaps. Indeed, (2.4) is closely related to what Mendelson refers to as the "low-noise" regime: this condition is intrinsic to the hypothesis set K and does not depend on the model mismatch (the "noise") y − x, β ♮ ; in particular, it specifies how many samples are required for (LS K ) to recover a linear hypothesis function exactly. In contrast, the condition (2.5) is associated with the "high-noise" regime, as it strongly depends on the model mismatch in terms of ρ t (β ♮ ) and σ(β ♮ ). A remarkable conclusion is possible when ρ 0 (β ♮ ) < 0 (cf. Appendix B.1): in this case, we may simply set t = 0, while (2.5) can be even satisfied if σ(β ♮ ) > 0. In other words, exact recovery of β ♮ is feasible in certain scenarios, despite the presence of noise or model misspecifications. in the error bound of (2.5), capturing the radii of (K − β ♮ ) ∩ tS p−1 with respect to · g and · e (see the proof of Lemma 5.13). The appearance of such an additive term is in fact quite common in the literature, e.g., see [11,32].
To proceed, we need an upper bound for the moment generating function E[exp(λ v,x 1 )]. Sincex 1 is centered, we have that Due to symmetry, (2.1) also holds true for the symmetrized random vectorx 1 and from the proof of Lemma 5.1, we therefore obtain Upper bound for A: According to Stirling's approximation, we have q! ≥ (q/e) q , which implies If λ < 1 2e v e , the above series is convergent and we have A ≤ 1 + 2 · (eλ v e ) 2 · (2 − eλ v e ) (1 − eλ v e ) 2 ≤ 1 + 16(eλ v e ) 2 ≤ exp(16(λ v e ) 2 ).
Upper bound for B: Since we do not want to introduce further restrictions on λ, let us distinguish two cases: If λ < 1/(2e v g ), we use that (q/2) q/2 ≤ q q (due to q ≥ 2) and apply the same strategy as for the bound for A, which yields B ≤ exp(16(λ v g ) 2 ). Now, let λ ≥ 1/(2e v g ).
Next, we combine the above bounds for A and B: the basic inequality exp(a) + exp(b) − 3 ≤