1 Introduction

Let \(X_1,\ldots ,X_n\) be i.i.d. random (column) vectors in \(\mathbb {R}^p\) with finite second moments. This paper contributes to the problem of obtaining finite-sample concentration bounds for the random covariance-type operator

(1)

with mean \(\varSigma := \mathbb {E}\left[ X_1X_1^T\right] \). This problem has received a great deal of attention recently, and has important applications to the estimation of covariance matrices [15, 22], to the analysis of methods for least squares problems [10] and to compressed sensing and high dimensional, small sample size statistics [2, 18, 21].

The most basic problem is computing how many samples are needed to bring \(\widehat{\varSigma }_{n}\) close to \(\varSigma \). In general one needs at least \(n\ge p\) to bring \(\widehat{\varSigma }_{n}\) close to \(\varSigma \), so that the ranks of the two matrices can match. A basic problem is to find conditions under which \(n\ge C(\varepsilon )\,p\) samples are enough for guaranteeing

$$\begin{aligned} \Pr ({\forall v\in \mathbb {R}^p,\, (1-\varepsilon )v^T\varSigma v \le v^T\widehat{\varSigma }_{n}\,v\le (1+\varepsilon )\,v^T\varSigma \,v})\approx 1, \end{aligned}$$
(2)

where \(C(\varepsilon )\) depends only on \(\varepsilon >0\) and on moment assumptions on the \(X_i\)’s.

A well known bound by Rudelson [17, 20] implies \(C(\varepsilon )\,p\log p\) samples are necessary and sufficient if the vectors \(\varSigma ^{-1/2}X_i/\sqrt{p}\) have uniformly bounded norms. Removing the \(\log p\) factor is relatively easy for sub-Gaussian vectors \(X_i\), but even the seemingly nice case of log-concave random vectors (which have sub-exponential moments) had to wait for the breakthrough papers by Adamczak et al. [1, 3]. A series of results [9, 12, 15, 22] have proven similar results under finite-moment conditions on the one-dimensional marginals plus a (necessary) a high probability bound on \(\max _{i\le n}\,|X_i|_2\).

1.1 The sub-Gaussian lower tail

In this paper we focus on concentration properties of the lower tail of \(\widehat{\varSigma }_{n}\). As it turns out, information about the lower tail is sufficient for many applications, including the analysis of regression-type problems (see Theorem 1.2 below for an example). Moreover, the asymmetry between upper and lower tails is interesting from a purely mathematical perspective.

Our main result is the following theorem.

Theorem 1.1

(Proven in Sect. 4) Let \(X_1,\ldots ,X_n\) be i.i.d. copies of a random vector \(X\in \mathbb {R}^p\) with finite fourth moments. Define \(\varSigma :=\mathbb {E}\left[ XX^T\right] \) and assume

$$\begin{aligned} \forall v\in \mathbb {R}^p\,:\, \sqrt{\mathbb {E}\left[ (v^TX)^4\right] }\le \mathsf{h}\,v^T\varSigma \, v \end{aligned}$$
(3)

for some \(\mathsf{h}\in (1,+\infty )\). Set

$$\begin{aligned} \widehat{\varSigma }_{n}:=\frac{1}{n}\sum _{i=1}^nX_iX_i^T \hbox { as in } (1). \end{aligned}$$

Then, if the number n of samples satisfies

$$\begin{aligned} n\ge 81\mathsf{h}^2\,(p+2\ln (2/\delta ))/\varepsilon ^2, \end{aligned}$$

we have the bound

$$\begin{aligned} \Pr ({\forall v\in \mathbb {R}^p\,:\, v^T\widehat{\varSigma }_{n}\,v\ge (1-\varepsilon )\,v^T\varSigma \,v})\ge 1-\delta . \end{aligned}$$
(4)

Notice that the sample size n in Theorem 1.1 depends on \(p/\varepsilon ^2\), which is optimal if \(X_1\) has i.i.d. entries due to the Bai-Yin theorem [5]. Moreover, the dependence of n on \(\ln (1/\delta )/\varepsilon ^2\) shows that the sample size depends on the confidence level \(\delta \) in a sub-Gaussian fashion. More precisely, Theorem 1.1 implies that

$$\begin{aligned} V:= 1 - \inf _{\,v^T\varSigma \,v=1}\,v^T\widehat{\varSigma }_{n}\,v \end{aligned}$$

has a Gaussian-like right tail

$$\begin{aligned} \Pr \left( {V\ge C\,\mathsf{h}\,\sqrt{\frac{p}{n}} + r}\right) \le e^{-r^2n/C\mathsf{h}^2}\,\,(r\ge 0), \end{aligned}$$

with \(C>0\) universal. We observe that Theorem 1.1 has a more general version involving general sums of i.i.d. positive semidefinite random matrices; see Theorem 4.1 below for details.

The main assumption in Theorem 1.1 is (3). This is a finite-moment assumption, and, from a theoretical perspective, it seems remarkable that one can obtain sub-Gaussian concentration from it. From the perspective of applications, there are reasonably natural settings where (3) is a sensible assumption.

  1. 1.

    Assume first \(X=(X[1],X[2],\dots ,X[p])^T\) has diagonal \(\varSigma \) and satisfies a near unbiasedness assumption: for all \((i_1,i_2,i_3,i_4)\in \{1,\dots ,p\}^4\),

    $$\begin{aligned} \,i_4\not \in \{i_1,i_2,i_3\}\Rightarrow \mathbb {E}\left[ X[i_1]\,X[i_2]\,X[i_3]\,X[i_4]\right] =0. \end{aligned}$$

    This is true if \(X[1],X[2],\ldots ,X[p]\) are mean-zero four-wise independent random variables, or if \(X[1],X[2],\ldots ,X[p]\) is unconditional (i.e. its law is preserved when each coordinate is multiplied by a sign). From this assumption we may obtain (3) with

    $$\begin{aligned} \mathsf{h}:= 6\,\max \left\{ \frac{\sqrt{\mathbb {E}\left[ X[i]^4\right] }}{\mathbb {E}\left[ X[i]^2\right] }\,:\, i=1,2,\ldots ,p,\, \mathbb {E}\left[ X^2[i]\right] >0\right\} . \end{aligned}$$
  2. 2.

    Assume now that some X satisfying (3) is replaced by \(AX+\mu \) for some linear map \(A\in \mathbb {R}^{p\times p'}\) and some \(\mu \in \mathbb {R}^{p'}\). The new vector still satisfies \(\mathsf{h}<+\infty \), although \(\mathsf{h}\) may change by a universal constant factor. Note that the matrix A may be singular and/or that one may have \(p'>p\), in which case \(AX+\mu \) will have highly correlated components. This is allowed if X “comes from a vector with uncorrelated entries”.

  3. 3.

    The property \(\mathsf{h}<+\infty \) is also preserved when X is multiplied by an independent scalar \(\xi \), as long as \(\mathbb {E}\left[ \xi ^4\right] /\mathbb {E}\left[ \xi ^2\right] ^2\) is bounded by an absolute constant. As noted in [22], this is strictly weaker than what is needed for two-sided concentration as in (2).

  4. 4.

    An assumption not covered by our theorem is that of bounded designs: \(\varSigma ^{-1/2}X_1/\sqrt{p}\) a.s. bounded. This is verified when the coordinates X are a orthonormal functions such as the Fourier basis over [0, 1] (in this case \(\varSigma =I_{p\times p}\)). We note that this bounded design case is optimally covered by Rudelson’s aforementioned bound [17, 20].

One further attraction of Theorem 1.1 is its proof method, which is based on a PAC-Bayesian argument. The main feature of this method is that it provides a way to control empirical processes via entropic inequalities, as opposed to usual chaining methods. Further details about this method are given in Sect. 3 below. Although our application of this method is indebted to previous work by Audibert/Catoni [4] and Langford/Shawe-Taylor [13], we believe that this technique has much greater potential than what has been explored so far in the literature.

Remark 1

(Recent developments in lower tails) Many developments on variants of Theorem 1.1 have appeared since the first version of this paper. Almost simultaneously with us, Koltchinskii and Mendelson [11] obtained analogues of Theorem 1.1 under the assumption of \(q>4\) moments on the one dimensional marginals of \(X_1\). They also obtained results under our assumption (3), albeit with suboptimal dependence on p and \(\varepsilon \). Later, Yaskov [24, 25] obtained bounds under the assumption of uniform bounds for the weak \(L^q\) norms of one dimensional marginals, where \(q\ge 2\) is arbitrary. For each value of \(q>2\), he obtains the optimal exponent \(\alpha _q>0\) so that \(n=\Theta (p/\varepsilon ^{\alpha _q})\) samples are necessary and sufficient for (4) (with \(\delta =e^{-p}\)). His theorem is thus stronger Theorem 4.1 except possibly for the dependence of n on \(\delta \). However, the first part of [8, Theorem 3.1] by van de Geer and Muro achieves similar bounds as Yaskov, with the same dependence on \(\delta \) as our own Theorem 1.1. There has also been some related progress in checking lower- and upper-tail properties that are relevant in the \(p\gg n\) setting [9].

1.2 Application to ordinary least squares with random design

Theorem 1.1 will be illustrated with an application to random design linear regression when \(n\gg p\gg 1\). In this setting one is given data in the form of n i.i.d. copies \((X_i,Y_i)_{i=1}^n\) of a random pair \((X,Y)\in \mathbb {R}^p\times \mathbb {R}\), where X is a vector of covariates and Y is a response variable. The goal is to find a vector \(\widehat{\beta }_n\) that depends solely on the data so that the square loss

$$\begin{aligned} \ell (\beta ):=\mathbb {E}\left[ (Y-X^T\beta )^2\right] \end{aligned}$$

is as small as possible. This setting of random design should be contrasted with the technically simpler case of fixed design, where the \(X_i\) are non-random. Fixed design results are not informative about out-of-sample prediction, which is important in many routine applications of OLS e.g. in Statistical Learning and in Linear Aggregation.

We show below that the usual ordinary least squares (OLS) estimator

$$\begin{aligned} \widehat{\beta }_n\in \mathrm{arg min}_{\beta \in \mathbb {R}^p}\frac{1}{n}\sum _{i=1}^n\,(Y_i-\beta ^TX_i)^2, \end{aligned}$$

achieves error rates \(\approx \sigma ^2\,(p+\ln (1/\delta ))/n\) in the random design setting, where \(\sigma ^2\) measures the magnitude of “errors”. The formal theorem (modulo some definitions in Sect. 5.1) is as follows.

Theorem 1.2

(Proven in Sect. 5.2) Define \((X,Y),(X_1,Y_1),\ldots ,(X_n,Y_n)\) as above. Let \({\beta }_{\min }\) denote a minimizer of \(\ell \) and let \(\eta :=Y-{\beta }_{\min }^TX\). Define \(\varSigma :=\mathbb {E}\left[ XX^T\right] \), and let \(\varSigma ^{-1/2}\) be the Moore-Penrose pseudoinverse of \(\varSigma ^{1/2}\). Also set \(Z:=\eta \,\varSigma ^{-1/2}X\). Let \(\mathsf{h},\sigma ^2,\mathsf{h}_{*}>0\) and \(q>2\) and assume that, for all \(v\in \mathbb {R}^p\),

$$\begin{aligned} \sqrt{\mathbb {E}\left[ (v^TX)^4\right] }\le & {} \mathsf{h}\,v^T\varSigma \,v; \end{aligned}$$
(5)
$$\begin{aligned} \mathbb {E}\left[ (v^TZ)^2\right]\le & {} \sigma ^2\,|v|_ 2^2; \text{ and } \end{aligned}$$
(6)
$$\begin{aligned} \root q \of {\mathbb {E}\left[ |Z|^q_ 2\right] }\le & {} \mathsf{h}_*\,\sigma \,\sqrt{p}. \end{aligned}$$
(7)

Then for any \(\varepsilon \in (0,1/2)\), there exists \(C>0\) depending only on \(\mathsf{h}_*,\varepsilon \) and q such that, when \(\delta \in (C/n^{q/2-1},1)\) and

$$\begin{aligned} n\ge C\mathsf{h}^2\,(p + 2\ln (4/\delta )), \end{aligned}$$

then

$$\begin{aligned} \Pr \left( {\ell (\widehat{\beta }_n) - \inf _{\beta \in \mathbb {R}^p}\ell (\beta )\le \frac{(1+\varepsilon )\,\sigma ^2}{n} \,\left( \sqrt{p} + C\sqrt{\ln (4/\delta )}\right) ^2}\right) \ge 1-\delta . \end{aligned}$$

So for \(n\gg p\gg 1\) the excess loss of OLS is bounded by \((1+o\left( 1\right) )\,\sigma ^2p/n\), with high probability. This can be shown to be tight; cf. the end of Sect. 5.1 for details. An important point is that Theorem 1.2 makes minimal assumptions on the data, and works in a completely model-free, non-parametric, heteroskedastic setting. Our moment assumptions are reasonable e.g. when those of Theorem 1.1 are reasonable and \(Z=\eta \,\varSigma ^{-1/2}X\) is not too far from isotropic. For instance, if the “noise”  term \(\eta \) is independent from X, this property follows from suitable moment assumptions on the noise and on the one-dimensional marginals of X. Even when there is no independence, one only needs higher moment assumptions on X and \(\eta \) (thanks to Hölder’s inequality).

Theorem 1.2 extends recent papers Hsu et al. [10] and Audibert and Catoni [4]. Hsu et al. prove a variant of Theorem 1.2 where they assume an independent noise model with sub-Gaussian properties, as well as bounds on \(\varSigma ^{-1/2}X_i/\sqrt{p}\). Their bound does have the advantage of working up to much smaller values of \(\delta \). Audibert and Catoni obtain bounds for \(\delta \ge 1/n\), albeit with worse constants and only by assuming that \((v^TX_1)^2 \le B\,v^T\varSigma \,v\) almost surely for some \(B>0\). To the best of our knowledge, no excess loss bounds of optimal order were known under finite moment assumptions. We do note, however, that Theorem 1.2 is a simple consequence of our main result, Theorem 1.1, and a Fuk-Nagaev bound by Einmahl and Li [7, Theorem 4].

1.3 Organization

The remainder of the paper is organized as follows. Section 2 reviews some preliminaries and defines our notation. Section 3 discusses our PAC-Bayesian proof method, and Sect. 4 contains the proof of the sub-Gaussian lower tail (cf. Theorem 1.1). Some facts about OLS and our proof of Theorem 1.2 are presented in Sect. 5. The final Section contains some further remarks and open problems.

2 Notation and preliminaries

The coordinates of a vector \(v\in \mathbb {R}^p\) are denoted by \(v[1],v[2],\ldots ,v[p]\). We denote the space of matrices with p rows, \(p'\) columns and real entries by \(\mathbb {R}^{p\times p'}\). A is symmetric if it equals its own transpose \(A^T\). Given \(A\in \mathbb {R}^{p\times p}\), we let \(\mathrm{tr}(A)\) denote the trace of A and \(\lambda _{\max }(A)\) denote its largest eigenvalue. Also, \(\mathrm{diag}(A)\) is the diagonal matrix whose diagonal entries match those of A. The \(p\times p\) identity matrix is denoted by \(I_{p\times p}\). We identify \(\mathbb {R}^p\) with the space of column vectors \(\mathbb {R}^{p\times 1}\), so that the standard Euclidean inner product of \(v,w\in \mathbb {R}^p\) is \(v^Tw\). The Euclidean norm is denoted by \(|v|_2:=\sqrt{v^Tv}\).

We say that \(A\in \mathbb {R}^{p\times p}\) is positive semidefinite, and write \(A\succeq 0\), if it is symmetric and \(v^TAv\ge 0\) for all \(v\in \mathbb {R}^p\). In this case one can easily show that

$$\begin{aligned} v^TAv=0\Leftrightarrow v^TA=0\Leftrightarrow Av=0. \end{aligned}$$
(8)

The \(2\rightarrow 2\) operator norm of \(A\in \mathbb {R}^{p\times p'}\) is

$$\begin{aligned} |A|_{2\rightarrow 2}:= \max _{v\in \mathbb {R}^{p'}\,:\, |v|_2=1}|Av|_2. \end{aligned}$$

For symmetric \(A\in \mathbb {R}^{p\times p}\) this is the largest absolute value of its eigenvalues. Moreover, if A is positive semidefinite \(|A|_{2\rightarrow 2}=\lambda _{\max }(A)\) is the largerst eigenvalue, and (when A is invertible)

$$\begin{aligned} |A^{-1}|_{2\rightarrow 2} = \frac{1}{\min _{v\in \mathbb {R}^p\,:\, |v|_2=1}v^TAv}. \end{aligned}$$
(9)

Finally, we write \(A\succeq B\) if \(A-B\succeq 0\).

Throughout the paper we use big-oh and little-oh notation informally, mostly as shorthand. For instance, \(a=O\left( b\right) \) means that a is at most of the same order of magnitude as b, whereas \(a=o\left( b\right) \) or \(a\ll b\) means a is much smaller than b.

3 The PAC-Bayesian method

In this section we give an overview of the PAC-Bayesian method as applied to our problem. The actual proof of Theorem 1.1 is presented in Sect. 4 below.

At first sight it may seem odd that we can obtain strong concentration as in Theorem 1.1 from finite moment assumptions. The key point here is that, for any \(v\in \mathbb {R}^p\), the expression

$$\begin{aligned} v^T\widehat{\varSigma }_{n}v = \frac{1}{n}\sum _{i=1}^n\,(X_i^Tv) ^2 \end{aligned}$$

is a sum of random variables which are independent, identically distributed and non negative. Such sums are well known to have sub-Gaussian lower tails under weak assumptions; this follows e.g. Lemma A.1 below.

This fact may be used to show concentration of \(v^T\widehat{\varSigma }_{n}\,v\) for any fixed \(v\in \mathbb {R}^p\). It is less obvious how to turn this into a uniform bound. The standard techniques for this, such as chaining, involve looking at a discretized subsets of \(\mathbb {R}^p\) and moving from this finite set to the whole space. In our case this second step is problematic, because it requires upper bounds on \(v^T\widehat{\varSigma }_{n}\,v\), and we know that our assumptions are not strong enough to obtain this.

What we use instead is the so-called PAC-Bayesian method [6] for controlling empirical processes. At a very high level, this method replaces chaining and union bounds with arguments based on the relative entropy. What this means in our case is that a “smoothed-out” version of the process \(v^T\widehat{\varSigma }_{n}\,v\) (\(v\in \mathbb {R}^p\)), where v is averaged over a Gaussian measure, automatically enjoys very strong concentration properties. This implies that the original process is also well behaved as long as the effect of the smoothing can be shown to be negligible. Many of our ideas come from Audibert and Catoni [4], who in turn credit Langford and Shawe-Taylor [13] for the idea of Gaussian smoothing.

To make our ideas more definite, we present a technical result that encapsulates the main ideas in our PAC-Bayesian approach. This requires some conditions.

Assumption 1

\(\{Z_\theta :\,\theta \in \mathbb {R}^p\}\) is a family of random variables defined on a common probability space \((\varOmega ,\mathcal {F},\mathbb {P})\). We assume that the map

$$\begin{aligned} \theta \mapsto Z_\theta (\omega )\in \mathbb {R}\end{aligned}$$

is continuous for each \(\omega \in \varOmega \). Given \(v\in \mathbb {R}^p\) and an invertible, positive semidefinite \(C\in \mathbb {R}^{p\times p}\), we let \(\Gamma _{v,C}\) denote the Gaussian probability measure over \(\mathbb {R}^p\) with mean v and covariance matrix C. We will also assume that for all \(\omega \in \varOmega \) the integrals

$$\begin{aligned} (\Gamma _{v,C}\,Z_\theta )\,(\omega ):= \int _{\mathbb {R}^p}Z_\theta (\omega )\,\Gamma _{v,C}(d\theta ) \end{aligned}$$

are well defined and depend continuously on v. We will use the notation \(\Gamma _{v,C}f_\theta \) to denote the integral of \(f_\theta \) (which may also depend on other parameters) over the variable \(\theta \) with the measure \(\Gamma _{v,C}\).

Proposition 3.1

(PAC-Bayesian Proposition) Assume the above setup, and also that C is invertible and \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\) for all \(\theta \in \mathbb {R}^d\). Then for any \(t\ge 0\),

$$\begin{aligned} \Pr \left( {\forall v\in \mathbb {R}^p\,:\, \Gamma _{v,C}Z_\theta \le t + \frac{|C^{-1/2}v|_2^2}{2}}\right) \ge 1 - e^{-t}. \end{aligned}$$

In the next subsection we will apply this to prove Theorem 4.1. Here is a brief overview: we will perform a change of coordinates under which \(\varSigma =I_{p\times p}\). We will then define \(Z_\theta \) as

$$\begin{aligned} Z_\theta = \xi |\theta |^2_2-\xi \theta ^T\widehat{\varSigma }_{n}\,\theta + \text{(other } \text{ terms) } \end{aligned}$$

where \(\xi >0\) will be chosen in terms of t and the “other terms”  will ensure that \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\). Taking \(C=\gamma \,I_{p\times p}\) will result in

$$\begin{aligned} \Gamma _{v,C}Z_\theta = \xi |v|^2_2-\xi v^T\widehat{\varSigma }_{n}\,v+ \xi S_v +\text{(other } \text{ terms) } \end{aligned}$$

where

$$\begin{aligned} S_v:= \gamma \,p -\gamma \,\mathrm{tr}(\widehat{\varSigma }_{n}) \end{aligned}$$

is a new term introduced by the“smoothing operator” \(\Gamma _{v,\gamma C}\). The choice \(\gamma =1/p\) will ensure that this term is small, and the “other terms” will also turn out to be manageable. The actual proof will be slightly complicated by the fact that we need to truncate the operator \(\widehat{\varSigma }_{n}\) to ensure that \(S_v\) is highly concentrated.

Proof

As a preliminary step, we note that under our assumptions the map:

$$\begin{aligned} \omega \in \varOmega \mapsto \sup _{v\in \mathbb {R}^p}\left\{ \Gamma _{v,C}Z_\theta (\omega ) - \frac{|C^{-1/2}v|_2^2}{2}\right\} \in \mathbb {R}\cup \{+\infty \} \end{aligned}$$

is measurable, since (by continuity) we may take the supremum over \(v\in \mathbb {Q}^p\), which is a countable set. In particular, the event in the statement of the proposition is indeed a measurable set.

To continue, recall the definition of Kullback Leiber divergence (or relative entropy) for probability measures over a measurable space \((\Theta ,\mathcal {G})\):

$$\begin{aligned} K(\mu _{1}|\mu _0):= \left\{ \begin{array}{ll}\int _\Theta \,\ln \left( \frac{d\mu _{1}}{d\mu _0}\right) \,d\mu _1, &{} \text{ if } \mu _1\ll \mu _0; \\ +\infty , &{} \text{ otherwise }.\end{array}\right. \end{aligned}$$
(10)

A variational principle [14, eqn. (5.13)] implies that for any measurable function \(h:\Theta \rightarrow \mathbb {R}\):

$$\begin{aligned} \int h\,d\mu _1\le \ln \left( \int e^{h}\,d\mu _0\right) + K(\mu _{1}|\mu _0). \end{aligned}$$
(11)

We apply this when \(\Theta =\mathbb {R}^d\) with \(\mathcal {G}\) equal to the Borel \(\sigma \)-field \(\mathcal {B}(\mathbb {R}^d)\), \(\mu _1=\Gamma _{v,C}\), \(\mu _0=\Gamma _{0,C}\) and \(h=Z_\theta \). In this case it is well-known that the relative entropy of the two measures is \(|C^{-1/2}v|_2^2/2\) [19, Appendix A.5]. This implies:

$$\begin{aligned} \sup _{v\in \mathbb {R}^p} \left( \Gamma _{v,C}Z_\theta -\frac{|C^{-1/2}v|_2^2}{2}\right) \le \ln \left( \Gamma _{0,C}\,e^{Z_\theta }\right) . \end{aligned}$$

To finish, we prove that:

$$\begin{aligned} \Pr ({\Gamma _{0,C}\,e^{Z_\theta }\ge e^{t}})\le e^{-t}. \end{aligned}$$

But this follows from Markov’s inequality and Fubini’s Theorem:

$$\begin{aligned} \Pr \left( {\Gamma _{0,C}\,e^{Z_\theta }\ge e^{t}}\right) \le e^{-t}\,\mathbb {E}\left[ \Gamma _{0,C}\,e^{Z_\theta }\right] = e^{-t}\,\Gamma _{0,C}\mathbb {E}\left[ e^{Z_\theta }\right] \le e^{-t}, \end{aligned}$$

because \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\) for any fixed \(\theta \). \(\square \)

4 The sub-Gaussian lower tail

The goal of this section is to discuss and prove the following slight generalization of Theorem 1.1.

Theorem 4.1

Assume \(A_1,\dots ,A_n\in \mathbb {R}^{p\times p}\) are i.i.d. random self-adjoint, positive semidefinite matrices whose coordinates have bounded second moments. Define \(\varSigma := \mathbb {E}\left[ A_1\right] \) (this is an entrywise expectation) and

$$\begin{aligned} \widehat{\varSigma }_{n}:= \frac{1}{n}\sum _{i=1}^nA_i. \end{aligned}$$

Assume \(\mathsf{h}\in (1,+\infty )\) satisfies \(\sqrt{\mathbb {E}\left[ (v^TA_1\,v)^2\right] }^{1/2}\le \mathsf{h}\,v^T\,\varSigma v\) for all \(v\in \mathbb {R}^p\). Then for any \(\delta \in (0,1)\):

$$\begin{aligned} \Pr \left( {\forall v\in \mathbb {R}^p\,:\, v^T\widehat{\varSigma }_{n}\,v\ge \left( 1 - 9\,\mathsf{h}\,\sqrt{\frac{p+2\ln (2/\delta )}{n}}\right) \,v^T\varSigma v}\right) \ge 1-\delta . \end{aligned}$$

Theorem 1.1 is recovered when we set \(A_i=X_iX_i^T\), check that the moment assumption on \(v^TA_1\,v\) translates into (3), and note that

$$\begin{aligned} n\ge 81\,\mathsf{h}^2\,\left( \frac{p + 2\ln (2/\delta )}{\varepsilon ^2}\right) \Rightarrow 9\,\mathsf{h}\,\sqrt{\frac{p+2\ln (2/\delta )}{n}}\le \varepsilon . \end{aligned}$$

Theorem 4.1 is proved in several steps over the next subsections.

4.1 Preliminaries: normalization and truncation

We first note that we may assume that \(\varSigma \) is invertible. Indeed, if that is not the case, we can restrict ourselves to the range of \(\varSigma \), which is isometric to \(\mathbb {R}^{p'}\) for some \(p'\le p\), noting that \(A_iv=0\) and \(v^TA_i=0\) almost surely for any v that is orthogonal to the range (this follows from \(\mathbb {E}\left[ v^TA_1v\right] =0\) for v orthogonal to the range, combined with (8) above).

Granted invertibility, we may define:

$$\begin{aligned} B_i:= \varSigma ^{-1/2}A_i\varSigma ^{-1/2}\,\,(1\le i\le n) \end{aligned}$$
(12)

and note that \(B_1,\dots ,B_n\) are i.i.d. positive semidefinite with \(\mathbb {E}\left[ B_1\right] =I_{p\times p}\). Moreover,

$$\begin{aligned} \forall v\in \mathbb {R}^p\,:\, \sqrt{\mathbb {E}\left[ (v^TB_1v)^2\right] } = \sqrt{\mathbb {E}\left[ ((\varSigma ^{-1/2}v)^TA_1\,(\varSigma ^{-1/2}v))^2\right] }\le \mathsf{h}\,|v|^2_2. \end{aligned}$$
(13)

Define

$$\begin{aligned} t:=\ln (2/\delta )\quad \text{ and } \quad \varepsilon :=9\,\mathsf{h}\,\sqrt{\frac{p+2t}{n}}. \end{aligned}$$
(14)

Our goal is to show that the following holds with probability \(\ge 1-2e^{-t}\):

$$\begin{aligned} \forall w\in \mathbb {R}^p\,:\, w^T\widehat{\varSigma }_{n}\,v\ge (1-\varepsilon )\,w^T\varSigma \,w. \end{aligned}$$

Notice that, by homonegeity, it suffices to consider vectors of the form \(w=\varSigma ^{-1/2}\,v\) with \(|v|_2=1\). Thus our goal may be restated as follows.

$$\begin{aligned} \text{ Goal: } \Pr \left( {\forall v\in \mathbb {R}^p\,:\,|v|_2=1\Rightarrow \frac{1}{n}\sum _{i=1}^nv^TB_i\,v\ge 1-\varepsilon }\right) \ge 1-\delta . \end{aligned}$$
(15)

We will make yet another change to our goal. Fix some \(R>0\) and define (with hindsight) truncated operators

$$\begin{aligned} B_i^R:= \left( 1\wedge \frac{R}{\mathrm{tr}(B_i)}\right) \,B_i, \end{aligned}$$
(16)

with the convention that this is simply 0 if \(\mathrm{tr}(B_i)=0\). We collect some estimates for later use.

Lemma 4.1

We have for all \(v\in \mathbb {R}^p\) with unit norm

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^n {v}^T{B_i^Rv}\le \frac{1}{n}\sum _{i=1}^n {v}^T{B_iv};\\&\mathbb {E}\left[ (\mathrm{tr}(B_i^R))^2\right] \le \mathbb {E}\left[ (\mathrm{tr}(B_i))^2\right] \le (\mathsf{h}\,p)^2; \text{ and } \\&\quad \mathbb {E}\left[ {v}^T{B_i^Rv}\right] \ge \left( 1 - \frac{\mathsf{h}^2\,p}{R}\right) . \end{aligned}$$

Proof

The first assertion follows from the fact that the are positive semidefinite, so \(v^T\,B_iv\ge 0\) and \(v^T\,B^R_iv = \alpha _i\,v^TB_ib\) for each i, with \(\alpha _i\in [0,1]\) a scalar. This same reasoning implies \(B_i^R\preceq B_i\), a fact that we will use below.

To prove the second assertion, we let \(e_1,\ldots ,e_p\) denote the canonical basis of \(\mathbb {R}^p\), and apply Minkowski’s inequality:

$$\begin{aligned} \mathbb {E}\left[ (\mathrm{tr}(B_i^R)^2)\right]= & {} \mathbb {E}\left[ \left( \sum _{j=1}^p\,e_j^TB_i^R\,e_j\right) ^2\right] \\ \text{(use } 0\preceq B_i^R\preceq B_i\text{) }\le & {} \mathbb {E}\left[ \left( \sum _{j=1}^p\,e_j^TB_i\,e_j\right) ^2\right] \\ \text{(Minkowski) }\le & {} \left( \sum _{j=1}^p\, \sqrt{\mathbb {E}\left[ (e_j^TB_i\,e_j)^2\right] }\right) ^2 \\ \text{(eqn. } \text{(13) }\le & {} (\mathsf{h}\,p)^2 \end{aligned}$$

To prove the third assertion, we fix some \(v\in \mathbb {R}^p\) with \(|v|_2=1\). We use again that \(v^TB_i\,v\ge 0\) to deduce

$$\begin{aligned} 1 - \mathbb {E}\left[ v^TB^R_i\,v\right] = \mathbb {E}\left[ v^T\,(B_i-B_i^R)\,v\right] \le \mathbb {E}\left[ (v^T\,B_i\,v)\,\chi _{\{\mathrm{tr}(B_i)>R\}}\right] . \end{aligned}$$

We may bound the RHS via Cauchy Schwarz, noting that \(\mathbb {E}\left[ (v^TB_i\,v)^2\right] \le \mathsf{h}^2\) by (13) and \(\Pr {\mathrm{tr}(B_i)>R}\le \mathbb {E}\left[ \mathrm{tr}(B_i)^2\right] /R^2 = (\mathsf{h}\,p/R)^2\). This gives:

$$\begin{aligned} 1 - \mathbb {E}\left[ v^TB^R_i\,v\right] \le \sqrt{\mathbb {E}\left[ (v^TB_i\,v)^2\right] \,\Pr {\mathrm{tr}(B_i)>R}}\le \frac{\mathsf{h}^2\,p}{R}. \end{aligned}$$

\(\square \)

4.2 Applying the PAC-Bayesian method

We continue to use our definitions of \(B_1,\ldots ,B_n\) and \(B^R_1,\ldots ,B_n^R\), with the goal of proving (15). The parameters t and \(\varepsilon \) are as in (14). We also fix \(\xi >0\). We intend apply Proposition 3.1 with \(C=I_{p\times p}/p\) and

$$\begin{aligned} Z_\theta := \xi \,\mathbb {E}\left[ {\theta }^T{B_1^R\theta }\right] - \frac{\xi ^2}{2n^2}\,\mathbb {E}\left[ ( {\theta }^T{B_1^R\theta })^2\right] - \xi \sum _{i=1}^n\frac{ {\theta }^T{B_i^R\theta }}{n}. \end{aligned}$$

Let us check that the assumptions of the theorem are satisfied. First note that \(Z_\theta \) is a quadratic form in \(\theta \), and is therefore a.s. continuous as a function of \(\theta \). The same argument combined with the square integrability of the normal distribution shows that \(\Gamma _{v,C}\,Z_\theta \) is continuous in \(v\in \mathbb {R}^p\). The inequality \(\mathbb {E}\left[ e^{Z_\theta }\right] \le 1\) follows from independence, which implies

$$\begin{aligned} \mathbb {E}\left[ e^{Z_\theta }\right] = \prod _{i=1}^n\mathbb {E}\left[ e^{\frac{\xi \,\,\mathbb {E}\left[ {\theta }^T{B_1^R\theta }\right] }{n} - \frac{\xi {\theta }^T{B^R_i\theta }}{n} - \frac{\xi ^2}{2n^2}\,\mathbb {E}\left[ ( {\theta }^T{B_i^R\theta })^2\right] }\right] , \end{aligned}$$

plus the fact that, for any non-negative, square-integrable random variable W and any \(\xi >0\):

$$\begin{aligned} \mathbb {E}\left[ e^{\xi \,\mathbb {E}\left[ W\right] -\xi \,W - \frac{\xi ^2}{2}\,\mathbb {E}\left[ W^2\right] }\right] \le 1 \end{aligned}$$

(this is shown in Lemma A.1 in the Appendix). Therefore all assumptions of Proposition 3.1 are satisfied, and we may deduce from that result that, with probability \(\ge 1-e^{-t}\), for all \(v\in \mathbb {R}^p\):

$$\begin{aligned} \xi \,\Gamma _{v,C}\mathbb {E}\left[ {\theta }^T{B_1^R\theta }\right] - \frac{\xi ^2}{2n}\,\mathbb {E}\left[ ( {\theta }^T{B_1^R\theta })^2\right] -\xi \sum _{i=1}^n\Gamma _{v,C}\frac{ {\theta }^T{B_i^R\theta }}{n} \le \frac{p|v|_2^2 + 2t}{2}. \end{aligned}$$

This is the same as saying that, with probability \(\ge 1-e^{-t}\), the following inequality holds for all \(v\in \mathbb {R}^p\) with \(|v|_2=1\):

$$\begin{aligned} \sum _{i=1}^n\Gamma _{v,C}\frac{ {\theta }^T{B_i^R\theta }}{n}\ge \Gamma _{v,C}\mathbb {E}\left[ {\theta }^T{B_1^R\theta }\right] - \left( \frac{\xi }{2n}\,\Gamma _{v,C}\mathbb {E}\left[ ( {\theta }^T{B_1^R\theta })^2\right] +\frac{p + 2t}{2\xi }\right) .\nonumber \\ \end{aligned}$$
(17)

4.3 Dealing with the terms

The next step in the proof is to control all the terms involving \(\Gamma _{v,C}\) that appear in (17). For \(v\in \mathbb {R}^p\) with \(|v|_2=1\), explicit calculations reveal

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\Gamma _{v,C}\, {\theta }^T{B_i^R\theta }= & {} \frac{1}{n}\,\sum _{i=1}^n {v}^T{B_i^Rv} + \sum _{i=1}^n\frac{\mathrm{tr}(B_i^R)}{p n}\nonumber \\ \hbox {(use Lemma 4.1)}\le & {} \frac{1}{n}\,\sum _{i=1}^n {v}^T{B_iv} + \sum _{i=1}^n\frac{\mathrm{tr}(B_i^R)}{p n}; \end{aligned}$$
(18)
$$\begin{aligned} \Gamma _{v,C}\,\mathbb {E}\left[ {\theta }^T{B_1^R\theta }\right]= & {} \mathbb {E}\left[ {v}^T{B_1^Rv}\right] + \frac{\mathbb {E}\left[ \mathrm{tr}(B_1^R)\right] }{p }\nonumber \\ \hbox {(use Lemma 4.1)}\ge & {} 1- \frac{\mathsf{h}^2\,p}{R} + \frac{\mathbb {E}\left[ \mathrm{tr}(B_1^R)\right] }{p}. \end{aligned}$$
(19)

We also need estimates for \(\Gamma _{v,C}\mathbb {E}\left[ ( {\theta }^T{B_1^R\theta })^2\right] \). Standard calculations with the normal distribution show that:

$$\begin{aligned} \Gamma _{v,C}( {\theta }^T{B_1^R\theta }) ^2 = \Gamma _{0,C}\,( {v}^T{B_1^Rv} + {\theta }^T{B_1^R\theta } + 2 {\theta }^T{B_1^Rv})^2. \end{aligned}$$
(20)

That is, instead of averaging \(\theta \) over \(\Gamma _{v,C}\), we may replace \(\theta \) by \(v+\theta \) and then average over \(\Gamma _{0,C}\).

We now consider the RHS of (20). The first two terms inside the brackets in the RHS are non-negative. By Cauchy Schwarz and the AM/GM inequality, the third term satisfies

$$\begin{aligned} |2 {\theta }^T{B_1^Rv}|\le 2\sqrt{(\theta ^TB_1^R\,\theta )\, (v^TB_1^R\,v)}\le ( {v}^T{B_1^Rv} + {\theta }^T{B_1^R\theta }). \end{aligned}$$

We deduce that

$$\begin{aligned} 0\le {v}^T{B_1^Rv} + {\theta }^T{B_1^R\theta } + 2 {\theta }^T{B_1^Rv}\le 2\,( {v}^T{B_1^Rv} + {\theta }^T{B_1^R\theta }), \end{aligned}$$

and plugging this into (20) gives

$$\begin{aligned} \Gamma _{0,C}\,( {v}^T{B_1^Rv} + {\theta }^T{B_1^R\theta } + 2 {\theta }^T{B_1^Rv})^2\le & {} 4\Gamma _{0,C}\,[( {v}^T{B_1^Rv} + {\theta }^T{B_1^R\theta })^2] \end{aligned}$$
(21)
$$\begin{aligned} \text{(use } (a+b)^2\le 2a^2+ 2b^2)\le & {} 8 \, ( {v}^T{B_1^Rv})^2 \end{aligned}$$
(22)
$$\begin{aligned}&+\, 8\Gamma _{0,C}( {\theta }^T{B_1^R\theta })^2. \end{aligned}$$
(23)

We now compute the term in (23) as follows. First of all, since \(C=I_{p\times p}/p\),

$$\begin{aligned} \text{ Law } \text{ of } \theta ^TB_1^R\,\theta \text{ under } \Gamma _{0,C} = \text{ Law } \text{ of } \frac{1}{p}\sum _{i=1}^pN^2_i\lambda _i, \end{aligned}$$

where the \(\lambda _1,\ldots ,\lambda _p\ge 0\) are the eigenvalues of \(B_1^R\) and the \(N_1,\ldots ,N_p\) are independent standard Gaussian random variables .We note that \(\mathbb {E}\left[ N_i^2N_j^2\right] \le \mathbb {E}\left[ N_i^4\right] =3\) for all \(1\le i,j\le p\) and that the eigenvalues of \(B_1^R\) are all real and nonnegative (since \(B_1^R\succeq 0\)), therefore

$$\begin{aligned} \Gamma _{0,C}( {\theta }^T{B_1^R\theta })^2\le \frac{3\mathrm{tr}(B_i^R)^2}{p^2}. \end{aligned}$$

We combine this with (21), (22), and (23), then apply Lemma 4.1 and recall \(|v|_2=1\) to obtain:

$$\begin{aligned} \Gamma _{v,C}\mathbb {E}\left[ ( {\theta }^T{B_i^R\theta })^2\right] \le 8\mathsf{h}^2 + \frac{24}{p^2}\,\mathbb {E}\left[ \mathrm{tr}(B_i^R)^2\right] \le 32\mathsf{h}^2. \end{aligned}$$
(24)

We plug this last estimate into (17) together with (19) and (18). This results in the following inequality, which holds with probability \(\ge 1-e^{-t}\) simultaneously for all \(v\in \mathbb {R}^d\) with \(|v|_2=1\):

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n {v}^T{B_i^Rv}\ge 1 -\left\{ \frac{\mathsf{h}^2p}{R} + \frac{16\mathsf{h}^2}{n}\,\xi + \frac{p+2t}{2\xi }+ \left( \sum _{i=1}^n\frac{\mathrm{tr}(B_i^R)-\mathbb {E}\left[ \mathrm{tr}(B_i^R)\right] }{p n}\right) \right\} . \end{aligned}$$

This holds for any choice of \(\xi \). Optimizing over this parameter shows that, with probability \(\ge 1-e^{-t}\), we have the following inequality simultaneously for all \(v\in \mathbb {R}^p\) with \(|v|_2=1\).

$$\begin{aligned} \sum _{i=1}^n\frac{ {v}^T{B_iv}}{n}\ge & {} 1 - \left\{ \frac{\mathsf{h}^2p}{R} + 4\sqrt{2}\,\mathsf{h}\,\sqrt{\frac{(p+2t)}{n}}\right\} \nonumber \\&-\left( \sum _{i=1}^n\frac{\mathrm{tr}(B_i^R)-\mathbb {E}\left[ \mathrm{tr}(B_i^R)\right] }{p n}\right) . \end{aligned}$$
(25)

4.4 The final step: control of the trace

We now take care of the term involving the traces on the RHS. This is precisely the moment when the truncation of \(B_i\) is useful, as it allows for the use of Bernstein’s concentration inequality [23, Sect. 2.6]. This inequality states that, for independent random variables \(Z_1,\dots ,Z_n\) with \(\mathbb {E}\left[ Z_i\right] =0\), \(\sum _{i=1}^n\mathbb {E}\left[ Z_i^2\right] =\sigma ^2\) and \(|Z_i|\le M\) for each \(1\le i\le n \) (\(M>0\) a constant), then

$$\begin{aligned} \Pr \left( {\sum _{i=1}^nZ_i\ge \sigma \,\sqrt{2t} + \frac{2Mt}{3}}\right) \le e^{-t}. \end{aligned}$$

The term involving traces in (25) is a sum of i.i.d. mean-zero random variables that (because of the truncation) lie between \(-R/p n\) and R / pn. Moreover, the variance of each term is at most \(\mathbb {E}\left[ \mathrm{tr}(B_i^R)^2\right] /p^2 n^2\le \mathsf{h}^2/n^2\) by Lemma 4.1. We deduce:

$$\begin{aligned} \Pr \left( {\sum _{i=1}^n\frac{\mathrm{tr}(B_i^R)-\mathbb {E}\left[ \mathrm{tr}(B_i^R)\right] }{p n}\le \mathsf{h}\sqrt{\frac{2\,t}{n}} + \frac{2R\,t}{3p n}}\right) \ge 1-e^{-t}. \end{aligned}$$

Combining this with (25) implies that, for any \(t\ge 0\), the following inequality holds with probability \(\ge 1-2e^{-t}\), simultaneously for all \(v\in \mathbb {R}^p\) with \(|v|_2=1\):

$$\begin{aligned} \sum _{i=1}^n\frac{ {v}^T{B_iv}}{n}\ge 1 - \left\{ \frac{\mathsf{h}^2p}{R} + 4\sqrt{2}\,\mathsf{h}\,\sqrt{\frac{(p+2t)}{n}} +\mathsf{h}\sqrt{\frac{2t}{n}} + \frac{2R\,t}{3p n}\right\} . \end{aligned}$$
(26)

This holds for any \(R>0\). Optimization over R gives

$$\begin{aligned} \inf _{R>0}\,\frac{\mathsf{h}^2p}{R} + \frac{2R\,t}{3p n} = 2\mathsf{h}\,\sqrt{\frac{2t}{3n}}\le \sqrt{2}\,\mathsf{h}\sqrt{\frac{2t}{n}}, \end{aligned}$$

so, with the right choice of R,

$$\begin{aligned} \left\{ \frac{\mathsf{h}^2p}{R} + 4\sqrt{2}\,\mathsf{h}\,\sqrt{\frac{(p+2t)}{n}} +\mathsf{h}\sqrt{\frac{2\,t}{n}} + \frac{2R\,t}{3p n}\right\} \le (5\sqrt{2} + 1)\,\mathsf{h}\,\sqrt{\frac{(p+2t)}{n}}\le \varepsilon , \end{aligned}$$

according to the definition of \(\varepsilon \) in (14). We obtain

$$\begin{aligned} \Pr \left( {\forall v\in \mathbb {R}^p\,:\,|v|_2=1\Rightarrow 1- \sum _{i=1}^n\frac{ {v}^T{B_iv}}{n}\le \varepsilon }\right) \ge 1-2e^{-t}. \end{aligned}$$

Inequality (15) follows. As noted in Sect. 4.1, (15) implies Theorem 4.1 and finishes the proof.

5 Applications in random-design linear regression

The main goal of this section is to prove Theorem 1.2.

5.1 Preliminaries

We begin by recalling the general facts about this problem. We assume \((X,Y)\in \mathbb {R}^p\times \mathbb {R}\) is a random pair, with \(X\in \mathbb {R}^p\) a vector of covariates and \(Y\in \mathbb {R}\) a response variable. We assume \(\mathbb {E}\left[ |X|_2^2\right] <+\infty \) and \(\mathbb {E}\left[ Y^2\right] < +\infty \). As in the introduction, we define the square loss function:

$$\begin{aligned} \ell (\beta ):=\mathbb {E}\left[ (Y-X^T\beta )^2\right] \,\,\,(\beta \in \mathbb {R}^p). \end{aligned}$$
(27)

It is not hard to show that \(\ell \) has at least one minimizer \({\beta }_{\min }\in \mathbb {R}^p\), defined so that \({\beta }_{\min }^TX\) equals the \(L^2\) projection of Y onto the linear space generated by the coordinates of X. In fact, this property uniquely defines the random variable \({\beta }_{\min }^TX\), if not necessarily the vector \({\beta }_{\min }\). It also implies

$$\begin{aligned} \eta :=Y-{\beta }_{\min }^T\,X \text{ satisfies } \mathbb {E}\left[ \eta \,X\right] =0. \end{aligned}$$
(28)

In fact, \({\beta }_{\min }\) is a minimizer of \(\ell \) if and only if (28) holds. Another calculation shows that

$$\begin{aligned} \forall \beta \in \mathbb {R}^p\,:\, \ell (\beta ) - \ell ({\beta }_{\min }) = |\varSigma ^{1/2}(\beta -{\beta }_{\min })|_2^2, \end{aligned}$$
(29)

where \(\varSigma :=\mathbb {E}\left[ XX^T\right] \). In particular, \({\beta }_{\min }\) is the unique minimizer of \(\ell \) if and only if \(\varSigma \) is non-singular.

Our main interest is in the OLS estimator, which satisfies

$$\begin{aligned} \widehat{\beta }_n\in \mathrm{arg min}_{\beta \in \mathbb {R}^p}\frac{1}{n}\sum _{i=1}^n\,(Y_i-\beta ^TX_i)^2. \end{aligned}$$

If \(\widehat{\varSigma }_{n}:=n^{-1}\sum _{i=1}^nX_iX_i^T\) is invertible, \(\widehat{\beta }_n\) is uniquely defined by the formula:

$$\begin{aligned} \widehat{\beta }_n:= \widehat{\varSigma }_{n}^{-1}\,\frac{1}{n}\sum _{i=1}^nY_iX_i. \end{aligned}$$
(30)

If \(\widehat{\varSigma }_{n}\) is not invertible, we may still define \(\widehat{\beta }_n\) by (30) if we let \(\widehat{\varSigma }_{n}^{-1}\) denote the Moore-Penrose pseudoinverse of \(\widehat{\varSigma }_{n}\). This definition will be used implicitly below.

An interesting test case for our result is that of a linear model with Gaussian noise and Gaussian design, where we assume that X is mean-zero Gaussian with covariance matrix \(\varSigma \) and \(\eta \) is mean zero Gaussian with variance \(\sigma ^2\) and independent of X. Using the notation of Theorem 1.2, we see that \(\mathbb {E}\left[ \eta ^2\right] =\sigma ^2\) is the variance of the noise, and \(\mathsf{h}_*\) does not depend on n or p. An explicit calculation (which we omit) implies that, for \(n\gg p\gg 1\),

$$\begin{aligned} |\varSigma ^{1/2}(\widehat{\beta }_n-{\beta }_{\min })|_ 2^2 \ge (1-o\left( 1\right) )\,\frac{\sigma ^2\,p}{n} \text{ with } \text{ probability } 1-o\left( 1\right) \!. \end{aligned}$$

Theorem 1.2 guarantees that OLS achieves this error rate under much weaker assumptions on the distribution of (XY).

5.2 Proof of Theorem 1.2

Proof

We will assume for convenience that \(\varSigma \) is invertible; the general case requires minor modifications. For each i, define

$$\begin{aligned} \eta _i:= & {} Y_i - X_i^T{\beta }_{\min } \text{ and } \end{aligned}$$
(31)
$$\begin{aligned} Z_i:= & {} \eta _i\,\varSigma ^{-1/2}X_i = (Y_i-X_i^T{\beta }_{\min })\,\varSigma ^{-1/2}X_i, \end{aligned}$$
(32)

We note for later use that the \(Z_i\) are independent copies of the vector Z in the statement of Theorem 1.2.

The assumptions on X of Theorem 1.2 imply those of Theorem 1.1 with \(\varepsilon \) replaced by \(\varepsilon /10\) and \(\delta \) replaced by \(\delta /2\) (at least when C is a large enough constant). This implies that the event

$$\begin{aligned} \mathsf{Lower}:= \left\{ \forall v\in \mathbb {R}^p\,:\, v^T\widehat{\varSigma }_{n}\,v\ge \left( 1 - \varepsilon /10\right) \,v^T\varSigma v\right\} \end{aligned}$$
(33)

satisfies \(\Pr {\mathsf{(Lower)}}\ge 1-\delta /2\) whenever the condition on n in Theorem 1.2 is satisfied. When \(\mathsf{Lower}\) holds, \(\widehat{\varSigma }_{n}\) is also invertible, so

$$\begin{aligned} {\beta }_{\min }= \widehat{\varSigma }_{n}^{-1}\,\widehat{\varSigma }_{n}\,{\beta }_{\min }= \widehat{\varSigma }_{n}^{-1}\,\frac{1}{n}\sum _{i=1}^n\,({\beta }_{\min }^TX_i)\,X_i. \end{aligned}$$

Comparing this with the definition of \(\widehat{\beta }_n\) in (30), we see that:

$$\begin{aligned} \widehat{\beta }_n-{\beta }_{\min }= \widehat{\varSigma }_{n}^{-1}\,\frac{1}{n}\sum _{i=1}^n\,(Y_i - {\beta }_{\min }^TX_i)\,X_i = \widehat{\varSigma }_{n}^{-1}\,\varSigma ^{1/2}\left( \frac{1}{n}\sum _{i=1}^nZ_i\right) . \end{aligned}$$

Therefore, when \(\mathsf{Lower}\) holds, the excess loss \(\ell (\widehat{\beta }_n)-\ell ({\beta }_{\min })\) satisfies

$$\begin{aligned} |\varSigma ^{1/2}(\widehat{\beta }_n-{\beta }_{\min })|_ 2^2 = \left| (\varSigma ^{1/2}\widehat{\varSigma }_{n}^{-1}\,\varSigma ^{1/2})\,\left( \frac{1}{n}\sum _{i=1}^nZ_i\right) \right| _2^2\le \frac{\left| \frac{1}{n}\sum _{i=1}^nZ_i\right| _2^2}{(1-\varepsilon /10)^2}, \end{aligned}$$
(34)

since Lower and (9) imply that the \(2\rightarrow 2\) operator norm of \(\varSigma ^{1/2}\widehat{\varSigma }_{n}^{-1}\,\varSigma ^{1/2}\) is at most \(1/(1-\varepsilon /10)\).

What we have discussed so far shows that (34) holds with probability \(\ge 1-\delta /2\). We now show that

$$\begin{aligned} \Pr \left( {\left| \frac{1}{n}\sum _{i=1}^nZ_i\right| _2^2\ge \frac{(1+\varepsilon /5)^2\,\sigma ^2}{n}\,(\sqrt{p}+ C\ln (4/\delta ))}\right) \le \frac{\delta }{3} \end{aligned}$$
(35)

noting that this finishes the proof with some room to spare regarding the dependency on \(\varepsilon \). To do this, we use the Fuk-Nagaev-type inequality by Einmahl and Li in [7, Theorem 4]. In the Euclidean setting, that result implies that if \(U_1,\dots ,U_n\) are i.i.d. mean zero random vectors with \(\varLambda _n:=n\lambda _{\max }(\mathbb {E}\left[ U_1U_1^T\right] )\), then for any \(q>2\), \(\alpha ,\phi \in (0,1)\) one can find \(D>0\) such that, for any \(t>0\),

$$\begin{aligned} \Pr \left( {\left| \sum _{i=1}^nU_i\right| _2\ge (1+\phi )\,\mathbb {E}\left[ \left| \sum _{i=1}^nU_i\right| _2\right] + t}\right) \le e^{-\frac{t^2}{(2+\alpha )\,\varLambda _n}}+ D\,n\,\frac{\mathbb {E}\left[ |U_1|^q_2\right] }{t^{q}}. \end{aligned}$$

To obtain (35), we apply the previous display with \(U_i=Z_i/n\), \(\phi =\varepsilon /10\), \(\alpha =1\), and

$$\begin{aligned} t:=\sigma \,\sqrt{\frac{(\varepsilon \,p/100)\vee 3\ln (4/\delta )}{n}}. \end{aligned}$$

We observe that

$$\begin{aligned} \mathbb {E}\left[ \left| \sum _{i=1}^nU_i\right| _2\right] \le \sqrt{\frac{1}{n}\,\mathbb {E}\left[ \left| Z_1\right| _2^2\right] }\le \sqrt{\sigma ^2\frac{p}{n}}, \end{aligned}$$

\(\varLambda _n\le \sigma ^2/n\) and \(\mathbb {E}\left[ |U_1|^q_2/t^q\right] \le (100\mathsf{h}^2_*)^{q/2}/(\varepsilon \,n)^{q/2}\) under our assumptions. We deduce

$$\begin{aligned} \Pr {\left| \sum _{i=1}^n\frac{Z_i}{n}\right| _2\ge \left( 1+\frac{\varepsilon }{5}\right) \frac{\sigma }{\sqrt{n}}\,\left( \sqrt{p} + \sqrt{3\ln (4/\delta )}\right) }\le \frac{\delta }{4} + D\,\frac{\mathsf{h}_*^{q}}{\varepsilon ^{q/2}\,n^{q/2-1}}. \end{aligned}$$

This clearly implies (35) after suitably adjusting the constants, at least in the desired range \(\delta \ge C/n^{q/2-1}\). \(\square \)

6 Final remarks

  • The PAC-Bayesian method used in this paper seems an efficient alternative to chaining and other typical empirical processes methods. As such, it would be interesting to find other applications of it. One interesting question is if some variant of the method can be used to prove two-sided concentration of \(\widehat{\varSigma }_{n}\).

  • Consider the setting of Theorem 4.1. Let \(\mathbb {R}^p_s\) denote the set of all \(v\in \mathbb {R}^p\) that are s-sparse, i.e. have at most s nonzero coordinates. \(\mathbb {R}^p_s\) is a union of \(\left( {\begin{array}{c}p\\ s\end{array}}\right) \le (ep/s)^s\) s-dimensional spaces, so if

    $$\begin{aligned} n\ge 100\mathsf{h}^2\,\frac{s + s\ln (ep/s) + 2\ln (2/\delta )}{\varepsilon ^2} \end{aligned}$$

    one may apply Theorem 4.1 to these subspaces and deduce that \(v^T\widehat{\varSigma }_{n}\,v\ge (1-\varepsilon )\,v^T\varSigma \,v\) for all \(v\in \mathbb {R}^p_s\), with probability \(\ge 1-\delta \). In a companion paper [16] we show that this result is relevant to prove that \(\widehat{\varSigma }_{n}\) satisfies restricted eigenvalue properties when \(p\gg n\). This result is relevant to the Compressed Sensing and High Dimensional Statistics.