Foundations of Computational Mathematics

, Volume 14, Issue 3, pp 569–600

Random Design Analysis of Ridge Regression

Article

DOI: 10.1007/s10208-014-9192-1

Cite this article as:
Hsu, D., Kakade, S.M. & Zhang, T. Found Comput Math (2014) 14: 569. doi:10.1007/s10208-014-9192-1

Abstract

This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the “out-of-sample” prediction error, as opposed to the “in-sample” (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.

Keywords

Linear regression Ordinary least squares Ridge regression Randomized approximation 

Mathematics Subject Classification

Primary 62J07 Secondary 62J05 

1 Introduction

In the random design setting for linear regression, we are provided with samples of covariates and responses, \((x_1,y_1),(x_2,y_2),\ldots ,(x_n,y_n)\), which are sampled independently from a population, where the \(x_i\) are random vectors and the \(y_i\) are random variables. Typically, these pairs are hypothesized to have the linear relationship
$$\begin{aligned} y_i = \langle \beta ,x_i \rangle + \epsilon _i \end{aligned}$$
for some linear function \(\beta \) (though this hypothesis need not be true). Here, the \(\epsilon _i\) are error terms, typically assumed to be normally distributed as \(\mathcal {N}(0,\sigma ^2)\). The goal of estimation in this setting is to find coefficients \(\hat{\beta }\) based on these \((x_i,y_i)\) pairs such that the expected prediction error on a new draw \((x,y)\) from the population, measured as \(\mathbb {E}[(\langle \hat{\beta },x \rangle - y)^2]\), is as small as possible. This goal can also be interpreted as estimating \(\beta \) with accuracy measured under a particular norm.

The random design setting stands in contrast to the fixed design setting, where the covariates \(x_1,x_2,\cdots ,x_n\) are fixed (i.e., deterministic), and only the responses \(y_1,y_2,\cdots ,y_n\) treated as random. Thus, the covariance structure of the design points is completely known and need not be estimated, which simplifies the analysis of standard estimators. However, the fixed design setting does not directly address out-of-sample prediction, which is of primary concern in many applications; for instance, in prediction problems, the estimator \(\hat{\beta }\) is computed from an initial sample from the population, and the end-goal is to use \(\hat{\beta }\) as a predictor of \(y\) given \(x\) where \((x,y)\) is a new draw from the population. A fixed design analysis only assesses the accuracy of \(\hat{\beta }\) on data already seen, while a random design analysis is concerned with the predictive performance on unseen data.

This work gives a detailed analysis of both the ordinary least squares and ridge estimators [9] in the random design setting that quantifies the essential differences between random and fixed design. In particular, the analysis reveals, through a simple decomposition:
  • The effect of errors in the estimated covariance structure.

  • The effect of errors in the estimated covariance structure as well as the effect of approximating the true regression function by a linear function in the case the model is misspecified.

  • The effect of errors due to noise in the response.

Neither of the first two effects is present in the fixed design analysis of ridge regression, and the random design analysis shows that the effect of errors in the estimated covariance structure is minimal—essentially a second-order effect as soon as the sample size is large enough. The analysis also isolates the effect of approximation error in the main terms of the estimation error bound so that the bound reduces to one that scales only with the noise variance when the approximation error vanishes.

Another important feature of the analysis that distinguishes it from that of previous work is that it applies to the ridge estimator with an arbitrary setting of \(\lambda \ge 0\). The estimation error is given in terms of the spectrum of the second moment of \(x\) and the particular choice of \(\lambda \)—the dimension of the covariate space does not enter explicitly except when \(\lambda =0\). When \(\lambda = 0\), we immediately obtain an analysis of ordinary least squares; we are not aware of any other random design analysis of the ridge estimator with this characteristic. More generally, the convergence rate can be optimized by appropriately setting \(\lambda \) based on assumptions about the spectrum.

Finally, while our analysis is based on an operator-theoretical approach similar to that of [19] and [4], it relies on probabilistic tail inequalities in a modular way that gives explicit dependencies without additional boundedness assumptions other than those assumed by the probabilistic bounds.

1.1 Outline

Section 2 discusses the model, preliminaries, and related work. Section 3 presents the main results on the excess mean squared error of the ordinary least squares and ridge estimators under random design and discusses the relationship to the standard fixed design analysis. Section 4 discusses an application to accelerating least squares computations on large data sets. The proofs of the main results are given in Sect. 5.

2 Preliminaries

2.1 Notation

Unless otherwise specified, all vectors in this work are assumed to live in a finite-dimensional inner product space with inner product \(\langle \cdot ,\cdot \rangle \). The restriction to finite dimensions is due to the probabilistic bounds used in the proofs; the main results of this work can be extended to (possibly infinite-dimensional) separable Hilbert spaces under mild assumptions by using suitable infinite-dimensional generalizations of these probabilistic bounds. We denote the dimensionality of this space by \(d\), but stress that our results will not explicitly depend on \(d\) except when considering the special case of \(\lambda =0\). Let \(\Vert \cdot \Vert _M\) for a self-adjoint positive definite linear operator \(M \succ 0\) denote the vector norm given by \(\Vert v\Vert _M := \sqrt{\langle v,Mv \rangle }\). When \(M\) is omitted, it is assumed to be the identity \(I\), so \(\Vert v\Vert = \sqrt{\langle v,v \rangle }\). Let \(u \otimes u\) denote the outer product of a vector \(u\), which acts as the rank-one linear operator \(v \mapsto (u \otimes u)v = \langle v,u \rangle u\). For a linear operator \(M\), let \(\Vert M\Vert \) denote its spectral (operator) norm, i.e., \(\Vert M\Vert = \sup _{v \ne 0} \Vert Mv\Vert / \Vert v\Vert \), and let \(\Vert M\Vert _{\mathrm{F }}\) denote its Frobenius norm, i.e., \(\Vert M\Vert _{\mathrm{F }}= \sqrt{{{\mathrm{tr}}}(M^* M)}\). If \(M\) is self-adjoint, \(\Vert M\Vert _{\mathrm{F }}= \sqrt{{{\mathrm{tr}}}(M^2)}\). Let \(\lambda _{\max }[M]\) and \(\lambda _{\min }[M]\), respectively, denote the largest and smallest eigenvalue of a self-adjoint linear operator \(M\).

2.2 Linear Regression

Let \(x\) be a random vector, and let \(y\) be a random variable. Throughout, it is assumed that \(x\) and \(y\) have finite second moments (\(\mathbb {E}[\Vert x\Vert ^2] < \infty \) and \(\mathbb {E}[y^2] < \infty \)). Let \(\{v_j\}\) be the eigenvectors of
$$\begin{aligned} \varSigma := \mathbb {E}[x \otimes x] , \end{aligned}$$
(1)
so that they form an orthonormal basis. The corresponding eigenvalues are
$$\begin{aligned} \lambda _j := \langle v_j,\varSigma v_j \rangle = \mathbb {E}[\langle v_j,x \rangle ^2]. \end{aligned}$$
It is without loss of generality that we assume all eigenvalues \(\lambda _j\) are strictly positive, since otherwise we may restrict attention of all vectors to a subspace in which the assumption holds. Let \(\beta \) achieve the minimum mean squared error over all linear functions, i.e.,
$$\begin{aligned} \mathbb {E}[(\langle \beta ,x \rangle - y)^2] = \min _w \left\{ \mathbb {E}[(\langle w,x \rangle - y)^2] \right\} , \end{aligned}$$
so that
$$\begin{aligned} \beta := \sum _j \beta _j v_j, \quad \text {where} \quad \beta _j := \frac{\mathbb {E}[\langle v_j,x \rangle y]}{\mathbb {E}[\langle v_j,x \rangle ^2]}. \end{aligned}$$
(2)
We also have that the excess mean squared error of \(w\) over the minimum is
$$\begin{aligned} \mathbb {E}[(\langle w,x \rangle -y)^2] - \mathbb {E}[(\langle \beta ,x \rangle -y)^2] = \Vert w-\beta \Vert _\varSigma ^2 \end{aligned}$$
(see Proposition 5).

2.3 The Ridge and Ordinary Least Squares Estimators

Let \((x_1,y_1), (x_2,y_2), \cdots , (x_n,y_n)\) be independent copies of \((x,y)\), and let \(\widehat{\mathbb {E}}\) denote the empirical expectation with respect to these \(n\) copies, i.e.,
$$\begin{aligned} \widehat{\mathbb {E}}[f] := \frac{1}{n} \sum _{i=1}^n f(x_i,y_i), \quad \quad \widehat{\varSigma }:= \widehat{\mathbb {E}}[x \otimes x] = \frac{1}{n} \sum _{i=1}^n x_i \otimes x_i. \end{aligned}$$
(3)
Let \(\hat{\beta }_{\lambda }\) denote the ridge estimator with parameter \(\lambda \ge 0\), defined as the minimizer of the \(\lambda \)-regularized empirical mean squared error, i.e.,
$$\begin{aligned} \hat{\beta }_{\lambda }:= \arg \min _w \left\{ \widehat{\mathbb {E}}[(\langle w,x \rangle - y)^2] + \lambda \Vert w\Vert ^2 \right\} . \end{aligned}$$
(4)
The special case with \(\lambda = 0\) is the ordinary least squares estimator, which minimizes the empirical mean squared error. These estimators are uniquely defined if and only if \(\widehat{\varSigma }+ \lambda I \succ 0\) (a sufficient condition is \(\lambda > 0\)), in which case
$$\begin{aligned} \hat{\beta }_{\lambda }= (\widehat{\varSigma }+ \lambda I)^{-1} \widehat{\mathbb {E}}[xy]. \end{aligned}$$

2.4 Data Model

We now specify the conditions on the random pair \((x,y)\) under which the analysis applies.

2.4.1 Covariate Model

We first define the following effective dimensions of the covariate \(x\) based on the second moment operator \(\varSigma \) and the regularization level \(\lambda \):
$$\begin{aligned} d_{p,\lambda } := \sum _j \left( \frac{\lambda _j}{\lambda _j + \lambda } \right) ^p , \quad p \in \{1,2\}. \end{aligned}$$
(5)
It will become apparent in the analysis that these dimensions govern the sample size needed to ensure that \(\varSigma \) is estimated with sufficient accuracy. For technical reasons, we also use the quantity
$$\begin{aligned} \tilde{d}_{1,\lambda }:= \max \{ d_{1,\lambda }, 1 \} \end{aligned}$$
(6)
merely to simplify certain probability tail inequalities in the main result in the peculiar case that \(\lambda \rightarrow \infty \) (upon which \(d_{1,\lambda }\rightarrow 0\)). We remark that \(d_{2,\lambda }\) naturally arises in the standard fixed design analysis of ridge regression (see Proposition 1), and that \(d_{1,\lambda }\) was also used by [23] and [4] in their random design analyses of (kernel) ridge regression. It is easy to see that \(d_{2,\lambda }\le d_{1,\lambda }\), and that \(d_{p,\lambda }\) is at most the dimension \(d\) of the inner product space (with equality iff \(\lambda = 0\)).
Our main condition requires that the squared length of \((\varSigma + \lambda I)^{-1/2} x\) is never more than a constant factor greater than its expectation (hence the name bounded statistical leverage). The linear mapping \(x \mapsto (\varSigma + \lambda I)^{-1/2} x\) is sometimes called whitening when \(\lambda = 0\). The reason for considering \(\lambda > 0\), in which case we call the mapping \(\lambda \)-whitening, is that the expectation \(\mathbb {E}[\Vert (\varSigma + \lambda I)^{-1/2}x\Vert ^2]\) may only be small for sufficiently large \(\lambda \), as
$$\begin{aligned} \mathbb {E}[\Vert (\varSigma + \lambda I)^{-1/2} x\Vert ^2] = {{\mathrm{tr}}}((\varSigma + \lambda I)^{-1/2} \varSigma (\varSigma + \lambda I)^{-1/2}) = \sum _j \frac{\lambda _j}{\lambda _j + \lambda } = d_{1,\lambda }. \end{aligned}$$

Condition 1

(Bounded statistical leverage at \(\lambda \))

There exists finite \(\rho _{\lambda }\ge 1\) such that, almost surely,
$$\begin{aligned} \frac{\Vert (\varSigma + \lambda I)^{-1/2} x\Vert }{\sqrt{\mathbb {E}[\Vert (\varSigma + \lambda I)^{-1/2} x\Vert ^2]}} = \frac{\Vert (\varSigma + \lambda I)^{-1/2} x\Vert }{\sqrt{d_{1,\lambda }}} \le \rho _{\lambda }. \end{aligned}$$

The hard “almost sure” bound in Condition 1 may be relaxed to moment conditions simply by using different probability tail inequalities in the analysis. We do not consider this relaxation for sake of simplicity. We also remark that it is possible to replace Condition 1 with a sub-Gaussian condition (specifically, a requirement that every projection of \((\varSigma + \lambda I)^{-1/2} x\) be sub-Gaussian), which can lead to a sharper deviation bound in certain cases.

Remark 1

(Ordinary least squares) If \(\lambda = 0\), then Condition 1 reduces to the requirement that there exists a finite \(\rho _0\ge 1\) such that, almost surely,
$$\begin{aligned} \frac{\Vert \varSigma ^{-1/2} x\Vert }{\sqrt{\mathbb {E}[\Vert \varSigma ^{-1/2} x\Vert ^2]}} = \frac{\Vert \varSigma ^{-1/2} x\Vert }{\sqrt{d}} \le \rho _0. \end{aligned}$$

Remark 2

(Bounded covariates) If \(\Vert x\Vert \le r\) almost surely, then
$$\begin{aligned} \frac{\Vert (\varSigma + \lambda I)^{-1/2}x\Vert }{\sqrt{d_{1,\lambda }}} \le \frac{r}{\sqrt{(\inf \{\lambda _j\} + \lambda )d_{1,\lambda }}}, \end{aligned}$$
in which case Condition 1 holds with \(\rho _{\lambda }\) satisfying
$$\begin{aligned} \rho _{\lambda }\le \frac{r}{\sqrt{\lambda d_{1,\lambda }}}. \end{aligned}$$

2.4.2 Response Model

The response model considered in this work is a relaxation of the typical Gaussian model; the model specifically allows for approximation error and general sub-Gaussian noise. Define the random variables
$$\begin{aligned} \mathrm{noise }(x) := y - \mathbb {E}[y|x] \quad \text {and} \quad \mathrm{approx }(x) := \mathbb {E}[y|x] - \langle \beta ,x \rangle , \end{aligned}$$
(7)
where \(\mathrm{noise }(x)\) corresponds to the response noise, and \(\mathrm{approx }(x)\) corresponds to the approximation error of \(\beta \). This gives the following modeling equation:
$$\begin{aligned} y = \langle \beta ,x \rangle + \mathrm{approx }(x) + \mathrm{noise }(x). \end{aligned}$$
Conditioned on \(x\), \(\mathrm{noise }(x)\) is random, while \(\mathrm{approx }(x)\) is deterministic.

The noise is assumed to satisfy the following sub-Gaussian moment condition.

Condition 2

(Sub-Gaussian noise) There exists finite \(\sigma \ge 0\) such that, almost surely,
$$\begin{aligned} \mathbb {E}\left[ \exp (\eta \mathrm{noise }(x)) | x\right] \le \exp (\eta ^2 \sigma ^2/2) \qquad \forall \eta \in \mathbb {R}. \end{aligned}$$

Condition 2 is satisfied, for instance, if \(\mathrm{noise }(x)\) is normally distributed with mean zero and variance \(\sigma ^2\).

For the next condition, define \(\beta _{\lambda }\) to be the minimizer of the regularized mean squared error, i.e.,
$$\begin{aligned} \beta _{\lambda }:= \arg \min _w \left\{ \mathbb {E}[(\langle w,x \rangle - y)^2] + \lambda \Vert w\Vert ^2 \right\} = (\varSigma + \lambda I)^{-1} \mathbb {E}[xy] , \end{aligned}$$
(8)
and also define
$$\begin{aligned} \mathrm{approx }_{\lambda }(x) := \mathbb {E}[y|x] - \langle \beta _{\lambda },x \rangle . \end{aligned}$$
(9)
The final condition requires a bound on the size of \(\mathrm{approx }_{\lambda }(x)\).

Condition 3

(Bounded approximation error at \(\lambda \)) There exist finite \(b_{\lambda }\ge 0\) such that, almost surely,
$$\begin{aligned} \frac{\Vert (\varSigma + \lambda I)^{-1/2} x \mathrm{approx }_{\lambda }(x)\Vert }{\sqrt{\mathbb {E}[\Vert (\varSigma + \lambda I)^{-1/2} x\Vert ^2]}} = \frac{\Vert (\varSigma + \lambda I)^{-1/2} x \mathrm{approx }_{\lambda }(x)\Vert }{\sqrt{d_{1,\lambda }}} \le b_{\lambda }. \end{aligned}$$

The hard “almost sure” bound in Condition 3 can easily be relaxed to moment conditions, but we do not consider it here for sake of simplicity. We also remark that \(b_{\lambda }\) only appears in lower-order terms in the main bounds.

Remark 3

(Ordinary least squares) If \(\lambda = 0\) and the dimension of the covariate space is \(d\), then Condition 3 reduces to the requirement that there exists a finite \(b_0 \ge 0\) such that, almost surely,
$$\begin{aligned} \frac{\Vert \varSigma ^{-1/2} x \mathrm{approx }(x)\Vert }{\sqrt{\mathbb {E}[\Vert \varSigma ^{-1/2} x\Vert ^2]}} = \frac{\Vert \varSigma ^{-1/2} x \mathrm{approx }(x)\Vert }{\sqrt{d}} \le b_0. \end{aligned}$$

Remark 4

(Bounded approximation error) If \(|\mathrm{approx }(x)| \le a\) almost surely and Condition 1 (with parameter \(\rho _{\lambda }\)) holds, then
$$\begin{aligned} \frac{\Vert (\varSigma + \lambda I)^{-1/2} x \mathrm{approx }_{\lambda }(x)\Vert }{\sqrt{d_{1,\lambda }}}&\le \rho _{\lambda }|\mathrm{approx }_{\lambda }(x)| \\&\le \rho _{\lambda }(a + |\langle \beta -\beta _{\lambda },x \rangle |) \\&\le \rho _{\lambda }(a + \Vert \beta -\beta _{\lambda }\Vert _{\varSigma +\lambda I} \Vert x\Vert _{(\varSigma + \lambda I)^{-1}}) \\&\le \rho _{\lambda }(a + \rho _{\lambda }\sqrt{d_{1,\lambda }}\Vert \beta -\beta _{\lambda }\Vert _{\varSigma +\lambda I}), \end{aligned}$$
where the first and last inequalities use Condition 1, the second inequality uses the definition of \(\mathrm{approx }_{\lambda }(x)\) in (9) and the triangle inequality, and the third inequality follows from Cauchy–Schwarz. The quantity \(\Vert \beta -\beta _{\lambda }\Vert _{\varSigma +\lambda I}\) can be bounded by \(\sqrt{\lambda } \Vert \beta \Vert \) using the arguments in the proof of Proposition 7. In this case, Condition 3 is satisfied with
$$\begin{aligned} b_{\lambda }\le \rho _{\lambda }(a + \rho _{\lambda }\sqrt{\lambda d_{1,\lambda }} \Vert \beta \Vert ). \end{aligned}$$
If in addition \(\Vert x\Vert \!\le \! r\) almost surely, then Condition 1 and Condition 3 are satisfied with
$$\begin{aligned} \rho _{\lambda }\le \frac{r}{\sqrt{\lambda d_{1,\lambda }}} \quad \text {and} \quad b_{\lambda }\le \rho _{\lambda }(a + r \Vert \beta \Vert ) \end{aligned}$$
as per Remark 2.

2.5 Related Work

The ridge and ordinary least squares estimators are classically studied in the fixed design setting: the covariates \(x_1, x_2, \cdots , x_n\) are fixed vectors in \(\mathbb {R}^d\), and the responses \(y_1, y_2, \cdots , y_n\) are independent random variables, each with mean \(\mathbb {E}[y_i] = \langle \beta ,x_i \rangle \) and variance \({{\mathrm{var}}}(y_i) \le \sigma ^2\) [16]. The analysis reviewed in Sect. 3.1 reveals that the expected prediction error \(\mathbb {E}[\Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2]\) is controlled by the sum of a bias term, which is zero when \(\lambda = 0\), and a variance term, which is bounded by \(\sigma ^2d_{2,\lambda }/n\). As discussed in the Sect. 1, our random design analysis of the ridge estimator reveals the essential differences between fixed and random design by comparing with this classical analysis.

Many classical analyses of the ridge and ordinary least squares estimators in the random design setting (e.g., in the context of nonparametric estimators) do not actually show nonasymptotic \(O(d/n)\) convergence of the mean squared error to that of the best linear predictor, where \(d\) is the dimension of the covariate space. Rather, the error relative to the Bayes error is bounded by some multiple \(c > 1\) of the error of the optimal linear predictor relative to the Bayes error, plus a \(O(d/n)\) term [8]:
$$\begin{aligned} \mathbb {E}[(\langle \hat{\beta },x \rangle -\mathbb {E}[y|x])^2] \le c \cdot \mathbb {E}[(\langle \beta ,x \rangle -\mathbb {E}[y|x])^2] + O(d/n). \end{aligned}$$
Such bounds are appropriate in nonparametric settings where the error of the optimal linear predictor also approaches the Bayes error at an \(O(d/n)\) rate. Beyond these classical results, analyses of ordinary least squares often come with nonstandard restrictions on applicability or additional dependencies on the spectrum of the second moment operator (see the recent work of [2] for a comprehensive survey of these results); For instance, a result of [5] gives a bound on the excess mean squared error of the form
$$\begin{aligned} \Vert \hat{\beta }- \beta \Vert _\varSigma ^2 \le O\left( \frac{d +\log (\hbox { det }(\hat{\varSigma })/\hbox { det }(\varSigma ))}{n} \right) , \end{aligned}$$
but the bound is only shown to hold when every linear predictor with low empirical mean squared error satisfies certain boundedness conditions.
This work provides ridge regression bounds explicitly in terms of the vector \(\beta \) (as a sequence) and in terms of the eigenspectrum of the second moment operator \(\varSigma \). While the essential setting we study is not new, previous analyses make unnecessarily strong boundedness assumptions or fail to give a bound in the case \(\lambda = 0\). Here we review the analyses of [4, 19, 20, 23]. [23] assumes \(\Vert x\Vert \le b_x\) and \(|\langle \beta ,x \rangle -y| \le b_{\mathrm{approx }}\) almost surely, and gives the bound
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }-\beta \Vert _\varSigma ^2 \le \lambda \Vert \hat{\beta }_{\lambda }-\beta \Vert ^2 + c \cdot \frac{d_{1,\lambda }\cdot (b_{\mathrm{approx }} + b_x \Vert \hat{\beta }_{\lambda }-\beta \Vert )^2}{n} \end{aligned}$$
for some \(c>0\), where \(d_{1,\lambda }\) is the effective dimension at scale \(\lambda \) as defined in (5). The quantity \(\Vert \hat{\beta }_{\lambda }-\beta \Vert \) is then bounded by assuming \(\Vert \beta \Vert < \infty \). Thus, the dominant terms of the final bound have explicit dependences on \(b_{\mathrm{approx }}\) and \(b_x\). [19] assume that \(|y| \le b_y\) and \(\Vert x\Vert \le b_x\) almost surely, and prove the bound
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta _{\lambda }\Vert _\varSigma ^2 \le c' \cdot \frac{b_x^2b_y^2}{n\lambda ^2} \end{aligned}$$
for some \(c'>0\) (and note that the bound becomes trivial when \(\lambda = 0\)); this is then used to bound \(\Vert \hat{\beta }_{\lambda }-\beta \Vert _\varSigma ^2\) under explicit assumptions on \(\beta \). [4] assume \(\Vert x\Vert \le b_x\) almost surely, and prove the bound (in their Theorem 4)
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }-\beta \Vert _\varSigma ^2 \le c'' \cdot \biggl ( \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 + \frac{b_x \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2}{n\lambda } + \frac{\sigma ^2 d_{1,\lambda }}{n} + o(1/n) \biggr ). \end{aligned}$$
Here, we also note that, if one desires the bound to hold with probability \(\ge 1-\mathrm {e}^{-t}\) for some \(t>0\), then the leading factor \(c{^{\prime \prime }}>1\) depends quadratically on \(t\). Finally, [20] explicitly require \(|y| \le b_y\) and their main bound on \(\Vert \hat{\beta }_{\lambda }-\beta \Vert _\varSigma ^2\) (specialized for the ridge estimator) depends on \(b_y\) in a dominant term. Moreover, this main bound contains \(c{^{\prime \prime \prime }} \cdot ( \lambda \Vert \beta _{\lambda }\Vert ^2 + \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 )\) as a dominant term for some \(c{^{\prime \prime \prime }}>1\), and it is only given under explicit decay conditions on the eigenspectrum (their Eq. 6). The bound is also trivial when \(\lambda = 0\). Our result for ridge regression is given explicitly in terms of \(\Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2\) (and therefore explicitly in terms of \(\beta \) as a sequence, the eigenspectrum of \(\varSigma \), and \(\lambda \)); this quantity vanishes when \(\lambda =0\) and can be small even when \(\Vert \beta \Vert \) itself is large. We note that \(\Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2\) is precisely the bias term from the classical fixed design analysis of ridge regression, and therefore is natural to expect in a random design analysis.

Recently, [3] derived sharp risk bounds for the ordinary least squares and ridge estimators (in addition to specially developed PAC-Bayesian estimators) in a random design setting under very mild moment assumptions using PAC-Bayesian techniques. Their nonasymptotic bound for ordinary least squares holds with probability at least \(1-\mathrm {e}^{-t}\) but only for \(t \le \ln n\); this is essentially due to their weak moment assumptions. By relying on stronger moment assumptions, we allow the probability tail parameter \(t\) to be as large as \(\Omega (n/d)\). Our analysis is also arguably more transparent and yields more reasonable quantitative bounds. The analysis of [3] for the ridge estimator is established only in an asymptotic sense and therefore is not directly comparable to those provided here.

Finally, although the focus of our present work is on understanding the ordinary least squares and ridge estimators, it should also be mentioned that a number of other estimators have been considered in the literature with nonasymptotic prediction error bounds [3, 13, 14]. Indeed, the works of [3] and [13] propose estimators that require considerably weaker moment conditions on \(x\) and \(y\) to obtain optimal rates.

3 Random Design Regression

This section presents the main results of the paper on the excess mean squared error of the ridge estimator under random design (and its specialization to the ordinary least squares estimator). First, we review the standard fixed design analysis.

3.1 Review of Fixed Design Analysis

It is informative to first review the fixed design analysis of the ridge estimator. Recall that, in this setting, the design points \(x_1,x_2,\cdots ,x_n\) are fixed (deterministic) vectors, and the responses \(y_1,y_2,\cdots ,y_n\) are independent random variables. Therefore, we define \(\varSigma := \widehat{\varSigma }= n^{-1} \sum _{i=1}^n x_i \otimes x_i\) (which is nonrandom), and assume it has eigenvectors \(\{v_j\}\) and corresponding eigenvalues \(\lambda _j := \langle v_j,\varSigma v_j \rangle \). As in the random design setting, the linear function \(\beta := \sum _j \beta _j v_j\) where \(\beta _j := (n \lambda _j)^{-1} \sum _{i=1}^n \langle v_j,x_i \rangle \mathbb {E}[y_i]\) minimizes the expected mean squared error, i.e.,
$$\begin{aligned} \beta := \arg \min _w \frac{1}{n}\sum _{i=1}^n \mathbb {E}[(\langle w,x_i \rangle -y_i)^2]. \end{aligned}$$
Similar to the random design setup, define \(\mathrm{noise }(x_i) := y_i - \mathbb {E}[y_i]\) and \(\mathrm{approx }(x_i) := \mathbb {E}[y_i] - \langle \beta ,x_i \rangle \) for \(i=1,2,\cdots ,n\), so the following modeling equation holds:
$$\begin{aligned} y_i = \langle \beta ,x_i \rangle + \mathrm{approx }(x_i) + \mathrm{noise }(x_i) \end{aligned}$$
for \(i=1,2,\cdots ,n\). Because \(\varSigma = \widehat{\varSigma }\), the ridge estimator \(\hat{\beta }_{\lambda }\) in the fixed design setting is an unbiased estimator of the minimizer of the regularized mean squared error, i.e.,
$$\begin{aligned} \mathbb {E}[\hat{\beta }_{\lambda }] \!=\! (\varSigma \!+\! \lambda I)^{\!-\!1} \left( \frac{1}{n} \sum _{i\!=\!1}^n x_i\mathbb {E}[y_i] \right) \!=\! \arg \min _w \left\{ \frac{1}{n} \sum _{i\!=\!1}^n \mathbb {E}[(\langle w,x_i \rangle \! -\! y_i)^2] \!+\! \lambda \Vert w\Vert ^2 \right\} \!. \end{aligned}$$
This unbiasedness implies that the expected mean squared error of \(\hat{\beta }_{\lambda }\) has the bias-variance decomposition
$$\begin{aligned} \mathbb {E}[\Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2] = \Vert \mathbb {E}[\hat{\beta }_{\lambda }] - \beta \Vert _\varSigma ^2 + \mathbb {E}[\Vert \hat{\beta }_{\lambda }- \mathbb {E}[\hat{\beta }_{\lambda }]\Vert _\varSigma ^2]. \end{aligned}$$
(10)
The following bound on the expected excess mean squared error easily follows from this decomposition and the definition of \(\beta \) (see, e.g., Proposition 7):

Proposition 1

(Ridge regression: fixed design) Fix \(\lambda \ge 0\), and assume \(\varSigma + \lambda I\) is invertible. If there exists \(\sigma \ge 0\) such that \({{\mathrm{var}}}(y_i^2) \le \sigma ^2\) for all \(i=1,2,\cdots ,n\), then
$$\begin{aligned} \mathbb {E}[\Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2] \le \sum _j \frac{\lambda _j}{(\frac{\lambda _j}{\lambda } + 1)^2} \beta _j^2 + \frac{\sigma ^2}{n} \sum _j \left( \frac{\lambda _j}{\lambda _j + \lambda } \right) ^2 \end{aligned}$$
with equality iff \({{\mathrm{var}}}(y_i) = \sigma ^2\) for all \(i=1,2,\cdots ,n\).

Remark 5

(Effect of approximation error in fixed design) Observe that \(\mathrm{approx }(x_i)\) has no effect on the expected excess mean squared error.

Remark 6

(Effective dimension) The second sum in the bound is equal to \(d_{2,\lambda }\), a notion of effective dimension at regularization level \(\lambda \).

Remark 7

(Ordinary least squares in fixed design) Setting \(\lambda = 0\) gives the following bound for the ordinary least squares estimator \(\hat{\beta }_0\):
$$\begin{aligned} \mathbb {E}[\Vert \hat{\beta }_0 - \beta \Vert _\varSigma ^2] \le \frac{\sigma ^2 d}{n}, \end{aligned}$$
where, as before, equality holds iff \({{\mathrm{var}}}(y_i) = \sigma ^2\) for all \(i = 1,2,\cdots ,n\).

3.2 Ordinary Least Squares

Our analysis of the ordinary least squares estimator (under random design) is based on a simple decomposition of the excess mean squared error, similar to the one from the fixed design analysis. To state the decomposition, first let \(\bar{\beta }_0\) denote the conditional expectation of the least squares estimator \(\hat{\beta }_0\) conditioned on \(x_1,x_2,\cdots ,x_n\), i.e.,
$$\begin{aligned} \bar{\beta }_0 := \mathbb {E}[\hat{\beta }_0 | x_1,x_2,\cdots ,x_n] = \widehat{\varSigma }^{-1} \widehat{\mathbb {E}}[x\mathbb {E}[y|x]]. \end{aligned}$$
Also, define the bias and variance as
$$\begin{aligned} \varepsilon _{\mathrm{bs }}:= \Vert \bar{\beta }_0 - \beta \Vert _\varSigma ^2 \ ,\quad \quad \varepsilon _{\mathrm{vr }}:= \Vert \hat{\beta }_0 - \bar{\beta }_0\Vert _\varSigma ^2. \end{aligned}$$

Proposition 2

(Random design decomposition) We have
$$\begin{aligned} \Vert \hat{\beta }_0- \beta \Vert _\varSigma ^2&\le \varepsilon _{\mathrm{bs }}+ 2\sqrt{\varepsilon _{\mathrm{bs }}\varepsilon _{\mathrm{vr }}} + \varepsilon _{\mathrm{vr }}\\&\le 2(\varepsilon _{\mathrm{bs }}+ \varepsilon _{\mathrm{vr }}). \end{aligned}$$

Proof

The claim follows from the triangle inequality and the fact \((a+b)^2 \le 2(a^2+b^2)\). \(\square \)

Remark 8

Note that, in general, \(\mathbb {E}[\hat{\beta }_0] \ne \beta \) (unlike in the fixed design setting where \(\mathbb {E}[\hat{\beta }_0] =\beta \)). Hence, our decomposition differs from that in the fixed design analysis (see (10)).

Our first main result characterizes the excess loss of the ordinary least squares estimator.

Theorem 1

(Ordinary least squares regression) Pick any \(t > \max \{0, 2.6 - \log d \}\). Assume Condition 1 (with parameter \(\rho _0\)), Condition 2 (with \(\sigma \)), and Condition 3 (with \(b_0\)) hold and that
$$\begin{aligned} n \ge 6\rho _0^2 d(\log d + t). \end{aligned}$$
With probability at least \(1 - 3\mathrm {e}^{-t}\), the following holds:
  1. (1)
    Relative spectral norm error in \(\widehat{\varSigma }\): \(\widehat{\varSigma }\) is invertible, and
    $$\begin{aligned} \Vert \varSigma ^{1/2}\widehat{\varSigma }^{-1}\varSigma ^{1/2}\Vert \le (1 - \delta _{\mathrm{s }})^{-1}, \end{aligned}$$
    where \(\varSigma \) is defined in (1), \(\widehat{\varSigma }\) is defined in (3), and
    $$\begin{aligned} \delta _{\mathrm{s }}:= \sqrt{\frac{4\rho _0^2d(\log d + t)}{n}} + \frac{2\rho _0^2d(\log d + t)}{3n} \end{aligned}$$
    (note that the lower bound on \(n\) ensures \(\delta _{\mathrm{s }}\le 0.93 < 1\)).
     
  2. (2)
    Effect of bias due to random design:
    $$\begin{aligned} \varepsilon _{\mathrm{bs }}&\le \frac{2}{(1-\delta _{\mathrm{s }})^2} \Biggl ( \frac{\mathbb {E}[\Vert \varSigma ^{-1/2}x\mathrm{approx }(x)\Vert ^2]}{n} (1 + \sqrt{8t})^2 + \frac{16b_0^2 dt^2}{9n^2} \Biggr ) \\&\le \frac{2}{(1-\delta _{\mathrm{s }})^2} \Biggl ( \frac{\rho _0^2 d \mathbb {E}[\mathrm{approx }(x)^2]}{n} (1 + \sqrt{8t})^2 + \frac{16b_0^2 dt^2}{9n^2} \Biggr ) , \end{aligned}$$
    and \(\mathrm{approx }(x)\) is defined in (9).
     
  3. (3)
    Effect of noise:
    $$\begin{aligned} \varepsilon _{\mathrm{vr }}\le \frac{1}{1-\delta _{\mathrm{s }}} \cdot \frac{\sigma ^2 (d + 2 \sqrt{d t} + 2 t)}{n}. \end{aligned}$$
     

Remark 9

(Simplified form) Suppressing the terms that are \(o(1/n)\), the overall bound from Theorem 1 is
$$\begin{aligned} \Vert \hat{\beta }_0 \!-\! \beta \Vert _\varSigma ^2 \le \frac{2\mathbb {E}[\Vert \varSigma ^{\!-\!1/2}x\mathrm{approx }(x)\Vert ^2]}{n} (1\! +\! \sqrt{8t})^2\! +\! \frac{\sigma ^2 (d \!+\! 2 \sqrt{d t} \!+\! 2 t)}{n} \!+\! o(1/n) \end{aligned}$$
[so \(b_0\) appears only in the \(o(1/n)\) terms]. If the linear model is correct (i.e., \(\mathbb {E}[y|x] = \langle \beta ,x \rangle \) almost surely), then
$$\begin{aligned} \Vert \hat{\beta }_0 - \beta \Vert _\varSigma ^2 \le \frac{\sigma ^2 (d + 2 \sqrt{d t} + 2 t)}{n} + o(1/n). \end{aligned}$$
(11)
One can show that the constants in the first-order term in (11) are the same as those that one would obtain for a fixed design tail bound.

Remark 10

(Tightness of the bound) Since
$$\begin{aligned} \Vert \bar{\beta }_0 - \beta \Vert _\varSigma ^2 = \Vert (\varSigma ^{1/2} \widehat{\varSigma }^{-1} \varSigma ^{1/2}) \widehat{\mathbb {E}}[\varSigma ^{-1/2} x \mathrm{approx }(x)]\Vert ^2 \end{aligned}$$
and
$$\begin{aligned} \Vert \varSigma ^{1/2} \widehat{\varSigma }^{-1} \varSigma ^{1/2} - I\Vert \rightarrow 0 \end{aligned}$$
as \(n \rightarrow \infty \) (Lemma 2), \(\Vert \bar{\beta }_0 \!-\! \beta \Vert _\varSigma ^2\) is within constant factors of \(\Vert \widehat{\mathbb {E}}[\varSigma ^{\!-\!1/2} x\mathrm{approx }(x)]\Vert ^2\) for sufficiently large \(n\). Moreover,
$$\begin{aligned} \mathbb {E}[\Vert \widehat{\mathbb {E}}[\varSigma ^{-1/2} x\mathrm{approx }(x)]\Vert ^2] = \frac{\mathbb {E}[\Vert \varSigma ^{-1/2} x\mathrm{approx }(x)\Vert ^2]}{n} , \end{aligned}$$
which is the main term that appears in the bound for \(\varepsilon _{\mathrm{bs }}\). Similarly, \(\Vert \hat{\beta }_0 - \bar{\beta }_0\Vert _\varSigma ^2\) is within constant factors of \(\Vert \hat{\beta }_0 - \bar{\beta }_0\Vert _{\widehat{\varSigma }}^2\) for sufficiently large \(n\), and
$$\begin{aligned} \mathbb {E}[\Vert \hat{\beta }_0 - \bar{\beta }_0\Vert _{\widehat{\varSigma }}^2] \le \frac{\sigma ^2 d}{n} \end{aligned}$$
with equality iff \({{\mathrm{var}}}(y) = \sigma ^2\) (this comes from the fixed design risk bound in Remark 7). Therefore, in this case where \({{\mathrm{var}}}(y) = \sigma ^2\), we conclude that the bound Theorem 1 is tight up to constant factors and lower-order terms.

3.3 Random Design Ridge Regression

The analysis of the ridge estimator under random design is again based on a simple decomposition of the excess mean squared error. Here, let \(\bar{\beta }_{\lambda }\) denote the conditional expectation of \(\hat{\beta }_{\lambda }\) given \(x_1,x_2,\cdots ,x_n\), i.e.,
$$\begin{aligned} \bar{\beta }_{\lambda }:= \mathbb {E}[\hat{\beta }_{\lambda }| x_1,x_2,\cdots ,x_n] = (\widehat{\varSigma }+ \lambda I)^{-1} \widehat{\mathbb {E}}[x\mathbb {E}[y|x]]. \end{aligned}$$
(12)
Define the bias from regularization, the bias from the random design, and the variance as
$$\begin{aligned} \varepsilon _{\mathrm{rg }}:= \Vert \beta _{\lambda }- \beta \Vert _\varSigma ^2 , \quad \quad \varepsilon _{\mathrm{bs }}:= \Vert \bar{\beta }_{\lambda }- \beta _{\lambda }\Vert _\varSigma ^2 , \quad \quad \varepsilon _{\mathrm{vr }}:= \ \Vert \hat{\beta }_{\lambda }- \bar{\beta }_{\lambda }\Vert _\varSigma ^2, \end{aligned}$$
where \(\beta _{\lambda }\) is the minimizer of the regularized mean squared error (see (8)).

Proposition 3

(General random design decomposition)
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2&\le \varepsilon _{\mathrm{rg }}+ \varepsilon _{\mathrm{bs }}+ \varepsilon _{\mathrm{vr }}+ 2(\sqrt{\varepsilon _{\mathrm{rg }}\varepsilon _{\mathrm{bs }}} + \sqrt{\varepsilon _{\mathrm{rg }}\varepsilon _{\mathrm{vr }}} + \sqrt{\varepsilon _{\mathrm{bs }}\varepsilon _{\mathrm{vr }}}) \\&\le 3(\varepsilon _{\mathrm{rg }}+ \varepsilon _{\mathrm{bs }}+ \varepsilon _{\mathrm{vr }}). \end{aligned}$$

Proof

The claim follows from the triangle inequality and the fact \((a+b)^2 \le 2(a^2+b^2)\). \(\square \)

Remark 11

Again, note that \(\mathbb {E}[\hat{\beta }_{\lambda }] \ne \beta _{\lambda }\) in general, so the bias-variance decomposition in (10) from the fixed design analysis is not directly applicable in the random design setting.

The following theorem is the main result of the paper:

Theorem 2

(Ridge regression) Fix some \(\lambda \ge 0\), and pick any \(t > \max \{0, 2.6 - \log \tilde{d}_{1,\lambda }\}\). Assume Condition 1 (with parameter \(\rho _{\lambda }\)), Condition 2 (with parameter \(\sigma \)), and Condition 3 (with parameter \(b_{\lambda }\)) hold, and that
$$\begin{aligned} n \ge 6\rho _{\lambda }^2d_{1,\lambda }(\log \tilde{d}_{1,\lambda }+ t), \end{aligned}$$
where \(d_{p,\lambda }\) for \(p \in \{1,2\}\) is defined in (5), and \(\tilde{d}_{1,\lambda }\) is defined in (6).
With probability at least \(1 - 4\mathrm {e}^{-t}\), the following holds:
  1. (1)
    Relative spectral norm error in \(\widehat{\varSigma }+ \lambda I\): \(\widehat{\varSigma }+ \lambda I\) is invertible, and
    $$\begin{aligned} \Vert (\varSigma + \lambda I)^{1/2}(\widehat{\varSigma }+ \lambda I)^{-1}(\varSigma + \lambda I)^{1/2}\Vert \le (1 - \delta _{\mathrm{s }})^{-1}, \end{aligned}$$
    where \(\varSigma \) is defined in (1), \(\widehat{\varSigma }\) is defined in (3), and
    $$\begin{aligned} \delta _{\mathrm{s }}:= \sqrt{\frac{4\rho _{\lambda }^2d_{1,\lambda }(\log \tilde{d}_{1,\lambda }+ t)}{n}} + \frac{2\rho _{\lambda }^2d_{1,\lambda }(\log \tilde{d}_{1,\lambda }+ t)}{3n} \end{aligned}$$
    (note that the lower bound on \(n\) ensures \(\delta _{\mathrm{s }}\le 0.93 < 1\)).
     
  2. (2)
    Frobenius norm error in \(\widehat{\varSigma }\):
    $$\begin{aligned} \Vert (\varSigma + \lambda I)^{-1/2}(\widehat{\varSigma }- \varSigma )(\varSigma + \lambda I)^{-1/2}\Vert _{\mathrm{F }}\le \sqrt{d_{1,\lambda }} \delta _{\mathrm{f }}, \end{aligned}$$
    where
    $$\begin{aligned} \delta _{\mathrm{f }}:= \sqrt{\frac{\rho _{\lambda }^2d_{1,\lambda }- d_{2,\lambda }/d_{1,\lambda }}{n}}(1+\sqrt{8t}) + \frac{4\sqrt{\rho _{\lambda }^4d_{1,\lambda }+ d_{2,\lambda }/d_{1,\lambda }}t}{3n}. \end{aligned}$$
     
  3. (3)
    Effect of regularization:
    $$\begin{aligned} \varepsilon _{\mathrm{rg }}\le \sum _j \frac{\lambda _j}{(\frac{\lambda _j}{\lambda } + 1)^2} \beta _j^2. \end{aligned}$$
    If \(\lambda \!=\! 0\), then \(\varepsilon _{\mathrm{rg }}\!=\! 0\).
     
  4. (4)
    Effect of bias due to random design:
    $$\begin{aligned} \varepsilon _{\mathrm{bs }}&\le \frac{2}{(1\!-\!\delta _{\mathrm{s }})^2} \Biggl ( \frac{\mathbb {E}[\Vert (\varSigma \!+\! \lambda I)^{\!-\!1/2}(x\mathrm{approx }_{\lambda }(x) \!-\! \lambda \beta _{\lambda })\Vert ^2]}{n} (1 \!+\! \sqrt{8t})^2\\&\quad \quad \quad \quad \quad \quad \quad + \frac{16\bigl (b_{\lambda }\sqrt{d_{1,\lambda }}\! +\! \sqrt{\varepsilon _{\mathrm{rg }}} \bigr )^2t^2}{9n^2} \Biggr ) \\&\le \frac{4}{(1\!-\!\delta _{\mathrm{s }})^2} \Biggl ( \frac{\rho _{\lambda }^2 d_{1,\lambda }\mathbb {E}[\mathrm{approx }_{\lambda }(x)^2] \!+\! \varepsilon _{\mathrm{rg }}}{n} (1\! +\! \sqrt{8t})^2\! +\! \frac{\bigl (b_{\lambda }\sqrt{d_{1,\lambda }}\! + \!\sqrt{\varepsilon _{\mathrm{rg }}} \bigr )^2t^2}{n^2} \Biggr )\! , \end{aligned}$$
    and \(\mathrm{approx }_{\lambda }(x)\) is defined in (9). If \(\lambda = 0\), then \(\mathrm{approx }_{\lambda }(x) = \mathrm{approx }(x)\) as defined in (7).
     
  5. (5)
    Effect of noise:
    $$\begin{aligned} \varepsilon _{\mathrm{vr }}\le \frac{\sigma ^2 \Bigl (d_{2,\lambda }+ \sqrt{d_{1,\lambda }d_{2,\lambda }}\delta _{\mathrm{f }}\Bigr )}{n (1-\delta _{\mathrm{s }})^2} + \frac{2\sigma ^2 \sqrt{\Bigl (d_{2,\lambda }+ \sqrt{d_{1,\lambda }d_{2,\lambda }}\delta _{\mathrm{f }}\Bigr )t}}{n (1-\delta _{\mathrm{s }})^{3/2}} + \frac{2\sigma ^2 t}{n (1-\delta _{\mathrm{s }})}. \end{aligned}$$
     

We now discuss various aspects of Theorem 2.

Remark 12

(Simplified form) Ignoring the terms that are \(o(1/n)\) and treating \(t\) as a constant, the overall bound from Theorem 2 is
$$\begin{aligned}&\Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2\\&\quad \le \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 + O\left( \frac{\mathbb {E}[\Vert (\varSigma + \lambda I)^{-1/2}(x\mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda })\Vert ^2] + \sigma ^2 d_{2,\lambda }}{n} \right) \\&\quad \le \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 + O\left( \frac{\rho _{\lambda }^2d_{1,\lambda }\mathbb {E}[\mathrm{approx }_{\lambda }(x)^2] + \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 + \sigma ^2 d_{2,\lambda }}{n} \right) \\&\quad \le \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 \!+\! O\left( \frac{\rho _{\lambda }^2d_{1,\lambda }\mathbb {E}[\mathrm{approx }(x)^2] + (\rho _{\lambda }^2d_{1,\lambda }+1)\Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 + \sigma ^2 d_{2,\lambda }}{n} \right) \!, \end{aligned}$$
where the last inequality follows from the fact \(\sqrt{\mathbb {E}[\mathrm{approx }_{\lambda }(x)^2]} \le \sqrt{\mathbb {E}[\mathrm{approx }(x)^2]} + \Vert \beta _{\lambda }-\beta \Vert _\varSigma \).

Remark 13

(Effect of errors in \(\widehat{\varSigma }\)) The accuracy of \(\widehat{\varSigma }\) has a relatively mild effect on the bound—it appears essentially through multiplicative factors \((1-\delta _{\mathrm{s }})^{-1} = 1 + O(\delta _{\mathrm{s }})\) and \(1+\delta _{\mathrm{f }}\), where both \(\delta _{\mathrm{s }}\) and \(\delta _{\mathrm{f }}\) are decreasing with \(n\) (as \(n^{-1/2}\)), and therefore only contribute to lower-order terms overall.

Remark 14

(Effect of approximation error) The effect of approximation error is isolated in the term \(\Vert \bar{\beta }_{\lambda }-\beta _{\lambda }\Vert _\varSigma ^2\). The bound \(\varepsilon _{\mathrm{bs }}\) scales with a fourth-moment quantity \(\mathbb {E}[\Vert (\varSigma + \lambda I)^{-1/2}(x\mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda }\Vert ^2]\); when using the looser bound \(O(\rho _{\lambda }^2d_{1,\lambda }\mathbb {E}[\mathrm{approx }(x)^2] + (\rho ^2d_{1,\lambda }+1) \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2)\), the overall simplified bound from Remark 12 can be viewed as
$$\begin{aligned} \mathbb {E}[(\langle \hat{\beta }_{\lambda },x \rangle -\mathbb {E}[y|x])^2 | \hat{\beta }_{\lambda }]&\le \mathbb {E}[(\langle \beta ,x \rangle -\mathbb {E}[y|x])^2] \left( 1 + \frac{c_1 \rho _{\lambda }^2d_{1,\lambda }}{n} \right) \\&\quad + \mathbb {E}[\langle \beta _{\lambda }-\beta ,x \rangle ^2] \left( 1 + \frac{c_2 (\rho _{\lambda }^2d_{1,\lambda }+1)}{n} \right) \\&\quad + \text { terms due to stochastic noise} \end{aligned}$$
for some positive constants \(c_1\) and \(c_2\). Therefore, the (bound on the) mean squared error of \(\hat{\beta }_{\lambda }\) is the sum of two contributions (up to lower-order terms): the first is a scaling of the approximation errors \(\mathbb {E}[(\langle \beta ,x \rangle - \mathbb {E}[y|x])^2] + \mathbb {E}[\langle \beta _{\lambda }-\beta ,x \rangle ^2]\), where the scaling \(1 + O((\rho _{\lambda }^2d_{1,\lambda }+1)/n)\) tends to one as \(n \rightarrow \infty \); and the second is the stochastic noise contribution. The approximation error contribution is unique to random design, while the stochastic noise appears in both random and fixed design.

Remark 15

(Bounded covariates) Suppose \(\mathrm{approx }(x) = 0\) and that there exists \(r>0\) such that \(\Vert x\Vert \le r\) almost surely. This is the setting of a well-specified model with bounded covariates; the minimax risk over the class of models \(\beta \) with \(\Vert \beta \Vert \le B\) for some \(B>0\) is at least \(\Omega (\sqrt{\sigma ^2 r^2B^2 / n})\) [17]. In this case, using the inequalities \(\Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2 \le \lambda \Vert \beta \Vert ^2/2\) and \(d_{2,\lambda }\le {{\mathrm{tr}}}(\varSigma )/(2\lambda )\), the simplified bound from Remark 12 reduces to
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2 \le \Biggl ( 1 + O\biggl ( \frac{1 + r^2/\lambda }{n} \biggr ) \Biggr ) \cdot \frac{\lambda \Vert \beta \Vert ^2}{2} + \frac{\sigma ^2}{n} \cdot \frac{{{\mathrm{tr}}}(\varSigma )}{2\lambda }. \end{aligned}$$
Choosing \(\lambda > 0\) to minimize the bound and using the fact \({{\mathrm{tr}}}(\varSigma ) \le r^2\) gives
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2 \le \sqrt{\frac{\sigma ^2r^2B^2}{n} \cdot \biggl ( 1 + O(1/n) \biggr )} + O\biggl ( \frac{r^2B^2}{n} \biggr ) , \end{aligned}$$
which matches the lower bound up to constant factors and lower-order terms.

Remark 16

(Application to smoothing splines) The applications of ridge regression considered by [23] can also be analyzed using Theorem 2 (although technically our result is only proved in the finite-dimensional setting). We specifically consider the problem of approximating a periodic function with smoothing splines, which are functions \(f :\mathbb {R}\rightarrow \mathbb {R}\) whose \(s\)-th derivatives \(f^{(s)}\), for some \(s > 1/2\), satisfy
$$\begin{aligned} \int \left( f^{(s)}(t) \right) ^2 \mathrm {d}t < \infty . \end{aligned}$$
The one-dimensional covariate \(t \in \mathbb {R}\) can be mapped to the infinite-dimensional representation \(x := \phi (t) \in \mathbb {R}^\infty \), where
$$\begin{aligned} x_{2k} := \frac{\sin (kt)}{(k+1)^s} \quad \text {and}\quad x_{2k+1} := \frac{\cos (kt)}{(k+1)^s} , \quad k \in \{0, 1, 2, \cdots \}. \end{aligned}$$
Assume that the regression function is
$$\begin{aligned} \mathbb {E}[y|x] = \langle \beta ,x \rangle \end{aligned}$$
so \(\mathrm{approx }(x) = 0\) almost surely. Observe that \(\Vert x\Vert ^2 \le \frac{2s}{2s-1}\), so Condition 1 is satisfied with
$$\begin{aligned} \rho _{\lambda }:= \left( \frac{2s}{2s-1}\right) ^{1/2} \frac{1}{\sqrt{\lambda d_{1,\lambda }}} \end{aligned}$$
as per Remark 2. Therefore, the simplified bound from Remark 12 becomes in this case
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2&\le \Vert \beta _{\lambda }- \beta \Vert _\varSigma ^2 + C \cdot \left( \frac{2s}{2s-1} \cdot \frac{ \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2}{\lambda n} + \frac{\Vert \beta _{\lambda }- \beta \Vert _\varSigma ^2 + \sigma ^2d_{2,\lambda }}{n} \right) \\&\le \frac{\lambda \Vert \beta \Vert ^2}{2} + C \cdot \frac{\sigma ^2d_{2,\lambda }}{n} + C \cdot \left( \frac{2s}{2s-1} + \frac{\lambda }{2} \right) \cdot \frac{\Vert \beta \Vert ^2}{n} \end{aligned}$$
for some constant \(C > 0\), where we have used the inequality \(\Vert \beta _{\lambda }- \beta \Vert _\varSigma ^2 \le \lambda \Vert \beta \Vert ^2 / 2\). [23] shows that
$$\begin{aligned} d_{1,\lambda }\le \inf _{k \ge 1} \left\{ 2k + \frac{2/\lambda }{(2s-1)k^{2s-1}} \right\} . \end{aligned}$$
Since \(d_{2,\lambda }\le d_{1,\lambda }\), it follows that setting \(\lambda := k^{-2s}\) where \(k = \lfloor ( (2s-1)n / (2s) )^{1/(2s+1)} \rfloor \) gives the bound
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2 \le \left( \frac{\Vert \beta \Vert ^2}{2} + 2C\sigma ^2 \right) \cdot \left( \frac{2s-1}{2s} \cdot n \right) ^{-\frac{2s}{2s+1}} + \text {lower-order terms}, \end{aligned}$$
which has the optimal data-dependent rate of \(n^{-\frac{2s}{2s+1}}\) [22].

Remark 17

(Comparison with fixed design) As already discussed, the ridge estimator behaves similarly under fixed and random designs, with the main differences being the lack of errors in \(\widehat{\varSigma }\) under fixed design, and the influence of approximation error under random design. These are revealed through the quantities \(\rho _{\lambda }\) and \(d_{1,\lambda }\) (and \(b_{\lambda }\) in lower-order terms), which are needed to apply the probability tail inequalities. Therefore, the scaling of \(\rho _{\lambda }^2d_{1,\lambda }\) with \(\lambda \) crucially controls the effect of random design compared with fixed design.

4 Application to Accelerating Least Squares Computations

Our results for the ordinary least squares estimator can be used to analyze a randomized approximation scheme for overcomplete least squares problems [7, 18]. The goal of these randomized methods is to approximately solve the least squares problem
$$\begin{aligned} \min _{w \in \mathbb {R}^d} \frac{1}{m} \Vert Aw - b\Vert ^2 \end{aligned}$$
for some large, full-rank design matrix \(A \in \mathbb {R}^{m \times d}\) (\(m \gg d\)) and vector \(b \in \mathbb {R}^m\). Note that using a standard method to exactly solve the least squares problem requires \(\Omega (md^2)\) operations, which can be prohibitive for large-scale problems. However, when an approximate solution is satisfactory, significant computational savings can be achieved through the use of randomization.

4.1 A Randomized Approximation Scheme for Least Squares

The approximation scheme is as follows:
  1. (1)

    The columns of \(A\) and the vector \(b\) are first subjected to a randomly chosen rotation matrix (i.e., an orthogonal transformation) \(\Theta \in \mathbb {R}^{m \times m}\). The distribution over rotation matrices that may be used is discussed below.

     
  2. (2)

    A sample of \(n\) rows of \([\Theta A, \Theta b] \in \mathbb {R}^{m \times (d+1)}\) are then selected uniformly at random with replacement; let \(\{ [x_i^{\scriptscriptstyle \top }, y_i] : i = 1,2,\cdots ,n \}\) (where \(x_i \in \mathbb {R}^d\) and \(y_i \in \mathbb {R}\)) be the \(n\) selected rows of \([\Theta A, \Theta b]\).

     
  3. (3)
    Finally, the least squares problem
    $$\begin{aligned} \min _{w \in \mathbb {R}^d} \frac{1}{n} \sum _{i=1}^n (\langle w,x_i \rangle - y_i)^2 \end{aligned}$$
    is solved by computing the ordinary least squares estimator \(\hat{\beta }_0\) on the sample \(\{ (x_i,y_i) : i = 1,2,\cdots ,n \}\).
     
The motivation for the random rotation \(\Theta \) is captured in Lemma 1, which shows that, if \(\Theta \) is chosen randomly from certain distributions over rotation matrices, then applying \(\Theta \) to \(A\) and \(b\) creates an equivalent least squares problem for which the statistical leverage parameter (the quantity \(\rho _0\) in Condition 1) is small. Consequently, the new least squares problem can be approximately solved with a small random sample, as per Theorems 2 and 1. Without the random rotation, the statistical leverage parameter could be so large that small random sample of the rows will likely miss a row crucial for obtaining an accurate approximation. The role of statistical leverage in this setting was also pointed out by [6], although Lemma 1 makes the connection more direct. We note that Lemma 1 and the analysis below can be generalized to the case where \(\Theta \) is only approximately orthogonal; for most standard distributions over rotation matrices, the additional error terms that arise do not affect the overall analysis.

The running time of the approximation scheme is given by (i) the time required to apply the \(m \times m\) random rotation operator \(\Theta \) to the original \(m \times (d+1)\) matrix \([A, b]\) and randomly sample \(n\) rows, plus (ii) the time to solve the least squares problem on the smaller design matrix of size \(n \times d\). For (i), naïvely applying an arbitrary \(m \times m\) rotation matrix requires \(\Omega (m^2 d)\) operations; however, there are (distributions over) rotation matrices for which this running time can be reduced to \(O(md \log m)\) (see Example 2 in Sect. 4.3 below), which is a considerable speed-up when \(m\) is large. In fact, because only \(n\) out of \(m\) rows are to be retained anyway, this computation can be reduced to \(O(md \log n)\) [1]. For (ii), standard methods can produce the ordinary least squares estimator or the ridge regression estimator with \(O(nd^2)\) operations. Therefore, we are interested in the sample size \(n\) that suffices to yield an accurate approximation.

4.2 Analysis of the Approximation Scheme

Our approach to analyzing the above approximation scheme is to treat it as a random design regression problem. We apply Theorem 1 in this setting to give error bounds for the solution produced by the approximation scheme.

Let \((x,y) \in \mathbb {R}^d \times \mathbb {R}\) be a random pair distributed uniformly over the rows of \([\Theta A, \Theta b]\), where we assume that \(\Theta \) is randomly chosen from a suitable distribution over rotation matrices such as those described in Example 1 and Example 2. Lemma 1 (below) implies that there exists a constant \(c_0 > 0\) such that Condition 1 is satisfied with
$$\begin{aligned} \rho _0^2 \le c_0 \cdot \left( 1 + \frac{\log m + \tau }{d} \right) \end{aligned}$$
with probability at least \(1-\mathrm {e}^{-\tau }\) over the choice of the random rotation matrix \(\Theta \). Henceforth, we condition on the event that this holds.
Let \(\beta \in \mathbb {R}^d\) be the solution to the original least squares problem (i.e., \(\beta := \arg \min _w \Vert Aw-b\Vert ^2 / m\)), and let \(\hat{\beta }_0 \in \mathbb {R}^d\) be the ordinary least squares estimator computed on the random sample of the rows of \([\Theta A, \Theta b]\). Note that, for any \(w \in \mathbb {R}^d\),
$$\begin{aligned} \mathbb {E}[(\langle w,x \rangle -y)^2] = \frac{1}{m} \Vert \Theta Aw-\Theta b\Vert ^2 = \frac{1}{m} \Vert Aw-b\Vert ^2. \end{aligned}$$
Moreover, we may assume for simplicity that \(y - \langle \beta ,x \rangle = \mathrm{approx }(x)\) (i.e., there is no stochastic noise), so \(\mathbb {E}[\mathrm{approx }(x)^2] = \mathbb {E}[(\langle \beta ,x \rangle -y)^2] = \Vert A\beta -b\Vert ^2 / m\).
By Theorem 1, if at least
$$\begin{aligned} n \ge 6 \bigl (d + c_0 (\log m + \tau ) \bigr ) (\log d + t) \end{aligned}$$
rows of \([\Theta A,\Theta b]\) are sampled, then the ordinary least squares estimator \(\hat{\beta }_0\) satisfies the following approximation error guarantee (with probability at least \(1-3\mathrm {e}^{-t}\) over the random sample of rows):
$$\begin{aligned} \frac{1}{m} \Vert A\hat{\beta }_0 - b\Vert ^2 \le \frac{1}{m} \Vert A\beta - b\Vert ^2 \cdot \left( 1 + c_1 \frac{(d + \log m + \tau ) t}{n} \right) + o(1/n) \end{aligned}$$
for some constant \(c_1 > 0\). We note that the \(o(1/n)\) terms can be removed if one only requires constant probability of success (i.e., \(\tau \) and \(t\) treated as constants), as is considered by [7]. In this case, we achieve an error bound of
$$\begin{aligned} \frac{1}{m} \Vert A\hat{\beta }_0 - b\Vert ^2 \le \frac{1}{m} \Vert A\beta - b\Vert ^2 \cdot (1 + \epsilon ) \end{aligned}$$
for \(\epsilon > 0\) provided that the number of rows sampled is
$$\begin{aligned} n \ge c_2 (d + \log m) \left( \frac{1}{\epsilon } + \log d \right) \end{aligned}$$
for some constant \(c_2 > 0\).

4.3 Random Rotations and Bounding Statistical Leverage

The following lemma gives a simple condition on the distribution of the random orthogonal matrix \(\Theta \in \mathbb {R}^{n \times n}\) used to preprocess a data matrix \(A\) so that Condition 1 is applicable to a random vector \(x\) drawn uniformly from the rows of \(\Theta A\). Its proof is a straightforward application of Lemma 8.

Lemma 1

Fix any \(\tau > 0\) and \(\lambda \ge 0\). Suppose \(\Theta \in \mathbb {R}^{m \times m}\) is a random orthogonal matrix and \(\kappa > 0\) is a constant such that
$$\begin{aligned} \mathbb {E}\left[ \exp \left( \alpha ^{\scriptscriptstyle \top }\bigl (\sqrt{m} \Theta ^{\scriptscriptstyle \top }e_i\bigr ) \right) \right] \le \exp \left( \kappa \Vert \alpha \Vert ^2/2 \right) , \quad \forall \alpha \in \mathbb {R}^m , \forall i=1,2,\cdots ,m, \end{aligned}$$
(13)
where \(e_i\) is the \(i\)-th coordinate vector in \(\mathbb {R}^m\). Let \(A \in \mathbb {R}^{m \times d}\) be any matrix of rank \(d\), and let \(\varSigma := (1/m) (\Theta A)^{\scriptscriptstyle \top }(\Theta A) = (1/m) A^{\scriptscriptstyle \top }A\). There exists
$$\begin{aligned} \rho _{\lambda }^2 \le \kappa \left( 1 + 2\sqrt{\frac{\log m + \tau }{d_{1,\lambda }}} + \frac{2(\log m + \tau )}{d_{1,\lambda }} \right) \end{aligned}$$
such that
$$\begin{aligned} \Pr \left[ \max _{i=1,2,\cdots ,m} \Vert (\varSigma + \lambda I)^{-1/2} (\Theta A)^{\scriptscriptstyle \top }e_i \Vert ^2 > \rho _{\lambda }^2 d_{1,\lambda }\right] \le \mathrm {e}^{-\tau }, \end{aligned}$$
where \(d_{1,\lambda }:= \sum _{j=1}^d \frac{\lambda _j}{\lambda _j + \lambda }\) and \(\{ \lambda _1, \lambda _2, \cdots , \lambda _d \}\) are the eigenvalues of \(\varSigma \).

Proof

Let \(z_i := \sqrt{m} \Theta ^{\scriptscriptstyle \top }e_i\) for each \(i=1,2,\cdots ,n\). Let \(U \in \mathbb {R}^{m \times d}\) be a matrix of left orthonormal singular vectors of \((1/\sqrt{m}) A\), and let \(D_\lambda := {{\mathrm{diag}}}(\frac{\lambda _1}{\lambda _1 + \lambda }, \frac{\lambda _2}{\lambda _2 + \lambda }, \cdots , \frac{\lambda _d}{\lambda _d + \lambda })\). Note that \(D_\lambda = I\) if \(\lambda = 0\). We have
$$\begin{aligned} \Vert (\varSigma + \lambda I)^{-1/2} (\Theta A)^{\scriptscriptstyle \top }e_i\Vert = \Vert \sqrt{m} D_\lambda ^{1/2} U^{\scriptscriptstyle \top }\Theta ^{\scriptscriptstyle \top }e_i\Vert = \Vert D_\lambda ^{1/2} U^{\scriptscriptstyle \top }z_i\Vert . \end{aligned}$$
Since \({{\mathrm{tr}}}(UD_\lambda U^\top ) = d_{1,\lambda }\), \({{\mathrm{tr}}}(UD_\lambda ^2 U^\top ) \le d_{1,\lambda }\), and \(\lambda _{\max }[UD_\lambda U^\top ] \le 1\), Lemma 8 implies
$$\begin{aligned} \Pr \left[ \Vert D_\lambda ^{1/2} U^{\scriptscriptstyle \top }z_i\Vert ^2 > \kappa \left( d_{1,\lambda }+ 2\sqrt{d_{1,\lambda }(\log m + \tau )} + 2(\log m + \tau ) \right) \right] \le \mathrm {e}^{-\tau }/m. \end{aligned}$$
Therefore, by a union bound,
$$\begin{aligned}&\Pr \left[ \max _{i=1,2,\cdots ,m} \Vert (\varSigma + \lambda I)^{-1/2} (\Theta A)^{\scriptscriptstyle \top }e_i\Vert ^2\right. \\&\quad \quad \left. > \kappa \left( d_{1,\lambda }+ 2\sqrt{d_{1,\lambda }(\log m + \tau )} + 2(\log m + \tau ) \right) \right] \le \mathrm {e}^{-\tau }. \end{aligned}$$
\(\square \)

Below, we give two simple examples under which the condition (13) in Lemma 1 holds.

Example 1

Let \(\Theta \) be distributed uniformly over all \(m \times m\) orthogonal matrices. Fix any \(i=1,2,\cdots ,m\). The random vector \(v := \Theta ^{\scriptscriptstyle \top }e_i\) is distributed uniformly on the unit sphere §\(^{m-1}\). Let \(l\) be a \(\chi \) random variable with \(m\) degrees of freedom, so \(z := lv\) follows an isotropic multivariate Gaussian distribution. By Jensen’s inequality and the fact that \(\mathbb {E}[\exp (q^{\scriptscriptstyle \top }z)] \le \exp (\Vert q\Vert ^2/2)\) for any vector \(q \in \mathbb {R}^m\),
$$\begin{aligned} \mathbb {E}\left[ \exp \left( \alpha ^{\scriptscriptstyle \top }\bigl (\sqrt{m} \Theta ^{\scriptscriptstyle \top }e_i\bigr ) \right) \right]&= \mathbb {E}\left[ \exp \left( \alpha ^{\scriptscriptstyle \top }\bigl (\sqrt{m} v\bigr ) \right) \right] \\&= \mathbb {E}\left[ \mathbb {E}\left[ \exp \left( \frac{\sqrt{m}}{\mathbb {E}[l]} \alpha ^{\scriptscriptstyle \top }(\mathbb {E}[l] v) \right) \ \Big | \ v \right] \right] \\&\le \mathbb {E}\left[ \exp \left( \frac{\sqrt{m}}{\mathbb {E}[l]} \alpha ^{\scriptscriptstyle \top }(lv) \right) \right] \\&= \mathbb {E}\left[ \exp \left( \frac{\sqrt{m}}{\mathbb {E}[l]} \alpha ^{\scriptscriptstyle \top }z \right) \right] \\&\le \exp \left( \frac{\Vert \alpha \Vert ^2 m}{2 \mathbb {E}[l]^2} \right) \\&\le \exp \left( \frac{\Vert \alpha \Vert ^2}{2} \left( 1 - \frac{1}{4m} - \frac{1}{360m^3} \right) ^{-2} \right) , \end{aligned}$$
where the last inequality is due to the following lower estimate for \(\chi \) random variables:
$$\begin{aligned} \mathbb {E}[l] \ge \sqrt{m} \left( 1 - \frac{1}{4m} - \frac{1}{360m^3} \right) . \end{aligned}$$
Therefore, the condition (13) is satisfied with \(\kappa = 1 + O(1/m)\).

Example 2

Let \(m\) be a power of two, and let \(\Theta := H{{\mathrm{diag}}}(s)/\sqrt{m}\), where \(H \in \{\pm 1\}^{m \times m}\) is the \(m \times m\) Hadamard matrix, and \(s := (s_1,s_2,\cdots ,s_n) \in \{\pm 1\}^m\) is a vector of \(m\) Rademacher variables (i.e., \(s_1,s_2,\cdots ,s_m\) are i.i.d. with \(\Pr [s_1=1]=\Pr [s_1=-1]=1/2\)). It is easy to check that \(\Theta \) is an orthogonal matrix. The random rotation \(\Theta \) is a key component of the fast Johnson–Lindenstrauss transform of [1], also used by [7]. It is especially important for the present application because it can be applied to vectors with \(O(m \log m)\) operations, which is significantly faster than the \(\Omega (m^2)\) running time of naïve matrix–vector multiplication.

For each \(i=1,2,\cdots ,m\), the distribution of \(\sqrt{m} \Theta ^{\scriptscriptstyle \top }e_i\) is the same as that of \(s\), and therefore
$$\begin{aligned} \mathbb {E}\left[ \exp \left( \alpha ^{\scriptscriptstyle \top }\bigl (\sqrt{m} \Theta ^{\scriptscriptstyle \top }e_i\bigr ) \right) \right] = \mathbb {E}\left[ \exp \left( \alpha ^{\scriptscriptstyle \top }s \right) \right] \le \exp (\Vert \alpha \Vert ^2 / 2), \end{aligned}$$
where the last step follows by Hoeffding’s inequality. Therefore, the condition (13) is satisfied with \(\kappa =1\).

5 Proofs of Theorem 1 and Theorem 2

The proof of Theorem 2 uses the decomposition of \(\Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2\) in Proposition 3, and then bounds each term using the lemmas proved in this section.

The proof of Theorem 1 omits one term from the decomposition in Proposition 3 due to the fact that \(\beta = \beta _{\lambda }\) when \(\lambda = 0\); and it uses a slightly simpler argument to handle the effect of noise (Lemma 6 rather than Lemma 7), which reduces the number of lower-order terms. Other than these differences, the proof is the same as that for Theorem 2 in the special case of \(\lambda = 0\).

Define
$$\begin{aligned} \varSigma _{\lambda }&:= \varSigma + \lambda I , \end{aligned}$$
(14)
$$\begin{aligned} \widehat{\varSigma }_{\lambda }&:= \widehat{\varSigma }+ \lambda I , \quad \text {and}\end{aligned}$$
(15)
$$\begin{aligned} \Delta _{\lambda }&:= \varSigma _{\lambda }^{-1/2} (\widehat{\varSigma }- \varSigma ) \varSigma _{\lambda }^{-1/2} \nonumber \\&~= \varSigma _{\lambda }^{-1/2} (\widehat{\varSigma }_{\lambda }- \varSigma _{\lambda }) \varSigma _{\lambda }^{-1/2}. \end{aligned}$$
(16)
Recall the basic decomposition from Proposition 3:
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \beta \Vert _\varSigma ^2 \le \left( \Vert \beta _{\lambda }- \beta \Vert _\varSigma + \Vert \bar{\beta }_{\lambda }- \beta _{\lambda }\Vert _\varSigma + \Vert \hat{\beta }_{\lambda }- \bar{\beta }_{\lambda }\Vert _\varSigma \right) ^2. \end{aligned}$$
Section 5.1 first establishes basic properties of \(\beta \) and \(\beta _{\lambda }\), which are then used to bound \(\Vert \beta _{\lambda }- \beta \Vert _\varSigma ^2\); this part is exactly the same as the standard fixed design analysis of ridge regression. Section 5.2 employs probability tail inequalities for the spectral and Frobenius norms of random matrices to bound the matrix errors in estimating \(\varSigma \) with \(\widehat{\varSigma }\). Finally, Sects. 5.3 and 5.4 bound the contributions of approximation error (in \(\Vert \bar{\beta }_{\lambda }- \beta _{\lambda }\Vert _\varSigma ^2\)) and noise (in \(\Vert \hat{\beta }_{\lambda }- \bar{\beta }_{\lambda }\Vert _\varSigma ^2\)), respectively, using probability tail inequalities for random vectors as well as the matrix error bounds for \(\widehat{\varSigma }\).

5.1 Basic Properties of \(\beta \) and \(\beta _{\lambda }\), and the Effect of Regularization

The following propositions are well known in the study of inverse problems:

Proposition 4

(Normal equations) \(E[\langle w,x \rangle y] = \mathbb {E}[\langle w,x \rangle \langle \beta ,x \rangle ]\) for any \(w\).

Proof

It suffices to prove the claim for \(w = v_j\). Since \(\mathbb {E}[\langle v_j,x \rangle \langle v_{j'},x \rangle ] = 0\) for \(j' \ne j\), it follows that \(\mathbb {E}[\langle v_j,x \rangle \langle \beta ,x \rangle ] = \sum _{j'} \beta _{j'} \mathbb {E}[\langle v_j,x \rangle \langle v_{j'},x \rangle ] = \beta _j \mathbb {E}[\langle v_j,x \rangle ^2] = \mathbb {E}[\langle v_j,x \rangle y]\), where the last equality follows from the definition of \(\beta \) in (2). \(\square \)

Proposition 5

(Excess mean squared error) \(\mathbb {E}[(\langle w,x \rangle -y)^2] - \mathbb {E}[(\langle \beta ,x \rangle -y)^2] = \mathbb {E}[\langle w-\beta ,x \rangle ^2]\) for any \(w\).

Proof

Directly expanding the squares in the expectations reveals that
$$\begin{aligned}&\mathbb {E}[(\langle w,x \rangle -y)^2] - \mathbb {E}[(\langle \beta ,x \rangle -y)^2]\\&=\mathbb {E}[\langle w,x \rangle ^2] - 2\mathbb {E}[\langle w,x \rangle y] + 2\mathbb {E}[\langle \beta ,x \rangle y] - \mathbb {E}[\langle \beta ,x \rangle ^2] \\&= \mathbb {E}[\langle w,x \rangle ^2] -2\mathbb {E}[\langle w,x \rangle \langle \beta ,x \rangle ] + 2\mathbb {E}[\langle \beta ,x \rangle \langle \beta ,x \rangle ] - \mathbb {E}[\langle \beta ,x \rangle ^2] \\&= \mathbb {E}[\langle w,x \rangle ^2 -2\langle w,x \rangle \langle \beta ,x \rangle + \langle \beta ,x \rangle ^2] \\&=\mathbb {E}[\langle w-\beta ,x \rangle ^2], \end{aligned}$$
where the third equality follows from Proposition 4. \(\square \)

Proposition 6

(Shrinkage) For any \(j\),
$$\begin{aligned} \langle v_j,\beta _{\lambda } \rangle = \frac{\lambda _j}{\lambda _j + \lambda } \beta _j. \end{aligned}$$

Proof

Since \((\varSigma + \lambda I)^{-1} = \sum _j (\lambda _j + \lambda )^{-1} v_j \otimes v_j\),
$$\begin{aligned} \langle v_j,\beta _{\lambda } \rangle&= \langle v_j,(\varSigma + \lambda I)^{-1} \mathbb {E}[xy] \rangle = \frac{1}{\lambda _j + \lambda } \mathbb {E}[\langle v_j,x \rangle y]\\&= \frac{\lambda _j}{\lambda _j + \lambda } \frac{\mathbb {E}[\langle v_j,x \rangle y]}{\langle v_j,x \rangle ^2} =\frac{\lambda _j}{\lambda _j + \lambda } \beta _j. \end{aligned}$$
\(\square \)

Proposition 7

(Effect of regularization)
$$\begin{aligned} \Vert \beta -\beta _{\lambda }\Vert _\varSigma ^2 = \sum _j \frac{\lambda _j}{(\frac{\lambda _j}{\lambda } + 1)^2} \beta _j^2. \end{aligned}$$

Proof

By Proposition 6,
$$\begin{aligned} \langle v_j,\beta -\beta _{\lambda } \rangle = \beta _j - \frac{\lambda _j}{\lambda _j + \lambda } \beta _j = \frac{\lambda }{\lambda _j + \lambda } \beta _j. \end{aligned}$$
Therefore,
$$\begin{aligned} \Vert \beta -\beta _{\lambda }\Vert _\varSigma ^2 = \sum _j \lambda _j \left( \frac{\lambda }{\lambda _j + \lambda } \beta _j \right) ^2 = \sum _j \frac{\lambda _j}{(\frac{\lambda _j}{\lambda } + 1)^2} \beta _j^2. \end{aligned}$$
\(\square \)

5.2 Effect of Errors in \(\widehat{\varSigma }\)

Lemma 2

(Spectral norm error in \(\widehat{\varSigma }\)) Assume Condition 1 (with parameter \(\rho _{\lambda }\)) holds. Pick \(t > \max \{0, 2.6 - \log \tilde{d}_{1,\lambda }\}\). With probability at least \(1-\mathrm {e}^{-t}\),
$$\begin{aligned} \Vert \Delta _{\lambda }\Vert \le \sqrt{\frac{4\rho _{\lambda }^2d_{1,\lambda }(\log \tilde{d}_{1,\lambda }+ t)}{n}} + \frac{2\rho _{\lambda }^2d_{1,\lambda }(\log \tilde{d}_{1,\lambda }+ t)}{3n}, \end{aligned}$$
where \(\Delta _{\lambda }\) is defined in (16).

Proof

The claim is a consequence of the tail inequality from Lemma 10. First, define
$$\begin{aligned} \tilde{x}:= \varSigma _{\lambda }^{-1/2} x \quad \text {and}\quad \widetilde{\varSigma }:= \varSigma _{\lambda }^{-1/2} \varSigma \varSigma _{\lambda }^{-1/2} \end{aligned}$$
(where \(\varSigma _{\lambda }\) is defined in (14)), and let
$$\begin{aligned} Z&:= \tilde{x}\otimes \tilde{x}- \widetilde{\varSigma }\\&= \varSigma _{\lambda }^{-1/2} (x \otimes x - \varSigma ) \varSigma _{\lambda }^{-1/2} \end{aligned}$$
so \(\Delta _{\lambda }= \widehat{\mathbb {E}}[Z]\). Observe that \(\mathbb {E}[Z] = 0\) and
$$\begin{aligned} \Vert Z\Vert = \max \{ \lambda _{\max }[Z], \lambda _{\max }[-Z] \} \le \max \{ \Vert \tilde{x}\Vert ^2, 1 \} \le \rho _{\lambda }^2 d_{1,\lambda }, \end{aligned}$$
where the second inequality follows from Condition 1. Moreover,
$$\begin{aligned} \mathbb {E}[Z^2] = \mathbb {E}[(\tilde{x}\otimes \tilde{x})^2] - \widetilde{\varSigma }^2 = \mathbb {E}[\Vert \tilde{x}\Vert ^2 (\tilde{x}\otimes \tilde{x})] - \widetilde{\varSigma }^2 \end{aligned}$$
so
$$\begin{aligned} \lambda _{\max }[\mathbb {E}[Z^2]]&\le \lambda _{\max }[\mathbb {E}[(\tilde{x}\otimes \tilde{x})^2]] \le \rho _{\lambda }^2 d_{1,\lambda }\lambda _{\max }[\widetilde{\varSigma }] \le \rho _{\lambda }^2 d_{1,\lambda }\\ {{\mathrm{tr}}}(\mathbb {E}[Z^2])&\le {{\mathrm{tr}}}(\mathbb {E}[\Vert \tilde{x}\Vert ^2 (\tilde{x}\otimes \tilde{x})]) \le \rho _{\lambda }^2 d_{1,\lambda }{{\mathrm{tr}}}(\widetilde{\varSigma }) = \rho _{\lambda }^2 d_{1,\lambda }^2. \end{aligned}$$
The claim now follows from Lemma 10 (recall that \(\tilde{d}_{1,\lambda }= \max \{ 1, d_{1,\lambda }\}\)). \(\square \)

Lemma 3

(Relative spectral norm error in \(\widehat{\varSigma }_{\lambda }\)) If \(\Vert \Delta _{\lambda }\Vert < 1\) where \(\Delta _{\lambda }\) is defined in (16), then
$$\begin{aligned} \Vert \varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}\Vert \le \frac{1}{1 - \Vert \Delta _{\lambda }\Vert }, \end{aligned}$$
where \(\varSigma _{\lambda }\) is defined in (14) and \(\widehat{\varSigma }_{\lambda }\) is defined in (15).

Proof

Observe that
$$\begin{aligned} \varSigma _{\lambda }^{-1/2} \widehat{\varSigma }_{\lambda }\varSigma _{\lambda }^{-1/2}&= \varSigma _{\lambda }^{-1/2} (\varSigma _{\lambda }+ \widehat{\varSigma }_{\lambda }- \varSigma _{\lambda }) \varSigma _{\lambda }^{-1/2} \\&= I + \varSigma _{\lambda }^{-1/2} (\widehat{\varSigma }_{\lambda }- \varSigma _{\lambda }) \varSigma _{\lambda }^{-1/2} \\&= I + \Delta _{\lambda }, \end{aligned}$$
and that
$$\begin{aligned} \lambda _{\min }[I + \Delta _{\lambda }] \ge 1 - \Vert \Delta _{\lambda }\Vert > 0 \end{aligned}$$
by the assumption \(\Vert \Delta _{\lambda }\Vert < 1\) and Weyl’s theorem [10]. Therefore,
$$\begin{aligned} \Vert \varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}\Vert&= \lambda _{\max }[(\varSigma _{\lambda }^{-1/2} \widehat{\varSigma }_{\lambda }\varSigma _{\lambda }^{-1/2})^{-1}] = \lambda _{\max }[(I + \Delta _{\lambda })^{-1}]\\&= \frac{1}{\lambda _{\min }[I + \Delta _{\lambda }]}\le \frac{1}{1 - \Vert \Delta \Vert }. \end{aligned}$$
\(\square \)

Lemma 4

(Frobenius norm error in \(\widehat{\varSigma }\)) Assume Condition 1 (with parameter \(\rho _{\lambda }\)) holds. Pick any \(t > 0\). With probability at least \(1-\mathrm {e}^{-t}\),
$$\begin{aligned} \Vert \Delta _{\lambda }\Vert _{\mathrm{F }}&\le \sqrt{\frac{\mathbb {E}[\Vert \varSigma _{\lambda }^{-1/2}x\Vert ^4] - d_{2,\lambda }}{n}} (1 + \sqrt{8t}) + \frac{4\sqrt{\rho _{\lambda }^4d_{1,\lambda }^2 + d_{2,\lambda }}t}{3n} \\&\le \sqrt{\frac{\rho _{\lambda }^2d_{1,\lambda }^2 - d_{2,\lambda }}{n}} (1 + \sqrt{8t}) + \frac{4\sqrt{\rho _{\lambda }^4d_{1,\lambda }^2 + d_{2,\lambda }}t}{3n}, \end{aligned}$$
where \(\Delta _{\lambda }\) is defined in (16).

Proof

The claim is a consequence of the tail inequality in Lemma 9. As in the proof of Lemma 2, define \(\tilde{x}:= \varSigma _{\lambda }^{-1/2} x\) and \(\widetilde{\varSigma }:= \varSigma _{\lambda }^{-1/2} \varSigma \varSigma _{\lambda }^{-1/2}\), and let \(Z := \tilde{x}\otimes \tilde{x}- \widetilde{\varSigma }\) so \(\Delta _{\lambda }= \widehat{\mathbb {E}}[Z]\). Now endow the space of self-adjoint linear operators with the inner product given by \(\langle A,B \rangle _{\mathrm{F }}:= {{\mathrm{tr}}}(AB)\), and note that this inner product induces the Frobenius norm \(\Vert M\Vert _{\mathrm{F }}= \langle M,M \rangle _{\mathrm{F }}\). Observe that \(\mathbb {E}[Z] = 0\) and
$$\begin{aligned} \Vert Z\Vert _{\mathrm{F }}^2&= \langle \tilde{x}\otimes \tilde{x}- \widetilde{\varSigma }, \tilde{x}\otimes \tilde{x}- \widetilde{\varSigma } \rangle _{\mathrm{F }}\\&= \langle \tilde{x}\otimes \tilde{x}, \tilde{x}\otimes \tilde{x} \rangle _{\mathrm{F }}- 2\langle \tilde{x}\otimes \tilde{x}, \widetilde{\varSigma } \rangle _{\mathrm{F }}+ \langle \widetilde{\varSigma }, \widetilde{\varSigma } \rangle _{\mathrm{F }}\\&= \Vert \tilde{x}\Vert ^4 - 2\Vert \tilde{x}\Vert _{\widetilde{\varSigma }}^2 + {{\mathrm{tr}}}(\widetilde{\varSigma }^2) \\&= \Vert \tilde{x}\Vert ^4 - 2\Vert \tilde{x}\Vert _{\widetilde{\varSigma }}^2 + d_{2,\lambda }\\&\le \rho _{\lambda }^4 d_{1,\lambda }^2 + d_{2,\lambda }, \end{aligned}$$
where the inequality follows from Condition 1. Moreover,
$$\begin{aligned} \mathbb {E}[\Vert Z\Vert _{\mathrm{F }}^2]&= \mathbb {E}[\langle \tilde{x}\otimes \tilde{x}, \tilde{x}\otimes \tilde{x} \rangle _{\mathrm{F }}] - \langle \widetilde{\varSigma }, \widetilde{\varSigma } \rangle _{\mathrm{F }}\\&= \mathbb {E}[\Vert \tilde{x}\Vert ^4] - d_{2,\lambda }\\&\le \rho _{\lambda }^2 d_{1,\lambda }\mathbb {E}[\Vert \tilde{x}\Vert ^2] - d_{2,\lambda }\\&= \rho _{\lambda }^2 d_{1,\lambda }^2 - d_{2,\lambda }, \end{aligned}$$
where the inequality again uses Condition 1. The claim now follows from Lemma 9. \(\square \)

5.3 Effect of Approximation Error

Lemma 5

(Effect of approximation error) Assume Condition 1 (with parameter \(\rho _{\lambda }\)) and Condition 3 (with parameter \(b_{\lambda }\)) hold. Pick any \(t > 0\). If \(\Vert \Delta _{\lambda }\Vert < 1\) where \(\Delta _{\lambda }\) is defined in (16), then
$$\begin{aligned} \Vert \bar{\beta }_{\lambda }- \beta _{\lambda }\Vert _\varSigma \le \frac{1}{1 - \Vert \Delta _{\lambda }\Vert } \Vert \widehat{\mathbb {E}}[x \mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda }] \Vert _{\varSigma _{\lambda }^{-1}}, \end{aligned}$$
where \(\bar{\beta }_{\lambda }\) is defined in (12), \(\beta _{\lambda }\) is defined in (8), \(\mathrm{approx }_{\lambda }(x)\) is defined in (9), and \(\varSigma _{\lambda }\) is defined in (14). Moreover, with probability at least \(1-\mathrm {e}^{-t}\),
$$\begin{aligned} \Vert \widehat{\mathbb {E}}[x \mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda }] \Vert _{\varSigma _{\lambda }^{-1}}&\le \sqrt{ \frac{\mathbb {E}[\Vert \varSigma _{\lambda }^{-1/2} (x \mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda })\Vert ^2]}{n}} (1 + \sqrt{8t})\\&\quad + \frac{4(b_{\lambda }\sqrt{d_{1,\lambda }} + \Vert \beta -\beta _{\lambda }\Vert _\varSigma )t}{3n} \\&\le \sqrt{\frac{2(\rho _{\lambda }^2 d_{1,\lambda }\mathbb {E}[\mathrm{approx }_{\lambda }(x)^2] + \Vert \beta -\beta _{\lambda }\Vert _\varSigma ^2)}{n}} (1 + \sqrt{8t})\\&\quad + \frac{4(b_{\lambda }\sqrt{d_{1,\lambda }} + \Vert \beta -\beta _{\lambda }\Vert _\varSigma )t}{3n}. \end{aligned}$$

Proof

By the definitions of \(\bar{\beta }_{\lambda }\) and \(\beta _{\lambda }\),
$$\begin{aligned} \bar{\beta }_{\lambda }- \beta _{\lambda }&= \widehat{\varSigma }_{\lambda }^{-1} \left( \widehat{\mathbb {E}}[x\mathbb {E}[y|x]] - \widehat{\varSigma }_{\lambda }\beta _{\lambda }\right) \\&= \varSigma _{\lambda }^{-1/2} (\varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}) \varSigma _{\lambda }^{-1/2} \left( \widehat{\mathbb {E}}[x(\mathrm{approx }(x) + \langle \beta ,x \rangle )] - \widehat{\varSigma }\beta _{\lambda }- \lambda \beta _{\lambda }\right) \\&= \varSigma _{\lambda }^{-1/2} (\varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}) \varSigma _{\lambda }^{-1/2} \left( \widehat{\mathbb {E}}[x(\mathrm{approx }(x) \!+ \!\langle \beta ,x \rangle \!-\! \langle \beta _{\lambda },x \rangle )] \!-\! \lambda \beta _{\lambda }\right) \\&= \varSigma _{\lambda }^{-1/2} (\varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}) \varSigma _{\lambda }^{-1/2} \left( \widehat{\mathbb {E}}[x \mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda }] \right) . \end{aligned}$$
Therefore, using the submultiplicative property of the spectral norm,
$$\begin{aligned} \Vert \bar{\beta }_{\lambda }- \beta _{\lambda }\Vert _\varSigma&\le \Vert \varSigma ^{1/2} \varSigma _{\lambda }^{-1/2}\Vert \Vert \varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}\Vert \Vert \widehat{\mathbb {E}}[x \mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda }] \Vert _{\varSigma _{\lambda }^{-1}} \\&\le \frac{1}{1 - \Vert \Delta _{\lambda }\Vert } \Vert \widehat{\mathbb {E}}[x \mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda }] \Vert _{\varSigma _{\lambda }^{-1}}, \end{aligned}$$
where the second inequality follows from Lemma 3 and because
$$\begin{aligned} \Vert \varSigma ^{1/2} \varSigma _{\lambda }^{-1/2}\Vert ^2 = \lambda _{\max }[\varSigma _{\lambda }^{-1/2} \varSigma \varSigma _{\lambda }^{-1/2}] = \max _i \frac{\lambda _i}{\lambda _i + \lambda } \le 1 . \end{aligned}$$
The second part of the claim is a consequence of the tail inequality in Lemma 9. Observe that \(\mathbb {E}[x\mathrm{approx }(x)] = \mathbb {E}[x(\mathbb {E}[y|x] - \langle \beta ,x \rangle )] = 0\) by Proposition 4, and that \(\mathbb {E}[x\langle \beta - \beta _{\lambda },x \rangle ] - \lambda \beta _{\lambda }= \varSigma \beta - (\varSigma + \lambda I) \beta _{\lambda }= 0\). Therefore,
$$\begin{aligned} \mathbb {E}[\varSigma _{\lambda }^{-1/2}(x\mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda })] = \varSigma _{\lambda }^{-1/2} \mathbb {E}[x(\mathrm{approx }(x) + \langle \beta - \beta _{\lambda },x \rangle ) - \lambda \beta _{\lambda }] = 0. \end{aligned}$$
Moreover, by Proposition 6 and Proposition 7,
$$\begin{aligned} \Vert \lambda \varSigma _{\lambda }^{-1/2} \beta _{\lambda }\Vert ^2&= \sum _j \frac{\lambda ^2}{\lambda _j + \lambda } \langle v_j,\beta _{\lambda } \rangle ^2\nonumber \\&= \sum _j \frac{\lambda ^2}{\lambda _j + \lambda } \left( \frac{\lambda _j}{\lambda _j + \lambda } \beta _j \right) ^2\nonumber \\&\le \sum _j \frac{\lambda ^2}{\lambda _j + \lambda } \left( \frac{\lambda _j}{\lambda _j + \lambda } \right) \beta _j^2\nonumber \\&= \sum _j \frac{\lambda _j}{(\frac{\lambda _j}{\lambda } + 1)^2} \beta _j^2\nonumber \\&= \Vert \beta - \beta _{\lambda }\Vert _\varSigma ^2. \end{aligned}$$
(17)
Combining the inequality from (17) with Condition 3 and the triangle inequality, it follows that
$$\begin{aligned} \Vert \varSigma _{\lambda }^{-1/2} (x\mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda })\Vert&\le \Vert \varSigma _{\lambda }^{-1/2} x\mathrm{approx }_{\lambda }(x)\Vert + \Vert \lambda \varSigma _{\lambda }^{-1/2} \beta _{\lambda }\Vert \\&\le b_{\lambda }\sqrt{d_{1,\lambda }} + \Vert \beta - \beta _{\lambda }\Vert _\varSigma . \end{aligned}$$
Finally, by the triangle inequality, the fact \((a+b)^2 \le 2(a^2+b^2)\), the inequality from (17), and Condition 1,
$$\begin{aligned} \mathbb {E}[\Vert \varSigma _{\lambda }^{-1/2} (x\mathrm{approx }_{\lambda }(x) - \lambda \beta _{\lambda })\Vert ^2]&\le 2 (\mathbb {E}[\Vert \varSigma _{\lambda }^{-1/2}x\mathrm{approx }_{\lambda }(x)\Vert ^2] + \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2) \\&\le 2 (\rho _{\lambda }^2 d_{1,\lambda }\mathbb {E}[\mathrm{approx }_{\lambda }(x)^2] + \Vert \beta _{\lambda }-\beta \Vert _\varSigma ^2). \end{aligned}$$
The claim now follows from Lemma 9. \(\square \)

5.4 Effect of Noise

Lemma 6

(Effect of noise, \(\lambda = 0\)) Assume \(\lambda = 0\). Assume Condition 2 (with parameter \(\sigma \)) holds. Pick any \(t > 0\). With probability at least \(1-\mathrm {e}^{-t}\), either \(\Vert \Delta _0\Vert \ge 1\), or
$$\begin{aligned} \Vert \Delta _0\Vert < 1 \quad \text {and}\quad \Vert \bar{\beta }_0 - \hat{\beta }_0\Vert _\varSigma ^2 \le \frac{1}{1-\Vert \Delta _0\Vert } \cdot \frac{\sigma ^2 ( d + 2\sqrt{dt} + 2 t)}{n}, \end{aligned}$$
where \(\Delta _0\) is defined in (16).

Proof

Observe that
$$\begin{aligned} \Vert \bar{\beta }_0 - \hat{\beta }_0\Vert _\varSigma ^2 \le \Vert \varSigma ^{1/2} \widehat{\varSigma }^{-1/2}\Vert ^2 \Vert \bar{\beta }_0 - \hat{\beta }_0\Vert _{\widehat{\varSigma }}^2 = \Vert \varSigma ^{1/2} \widehat{\varSigma }^{-1} \varSigma ^{1/2}\Vert \Vert \bar{\beta }_0 - \hat{\beta }_0\Vert _{\widehat{\varSigma }}^2 ; \end{aligned}$$
and if \(\Vert \Delta _0\Vert < 1\), then \(\Vert \varSigma ^{1/2} \widehat{\varSigma }^{-1} \varSigma ^{1/2}\Vert \le 1/(1 - \Vert \Delta _0\Vert )\) by Lemma 3.
Let \(\xi := (\mathrm{noise }(x_1),\mathrm{noise }(x_2),\cdots ,\mathrm{noise }(x_n))\) be the random vector whose \(i\)-th component is \(\mathrm{noise }(x_i) = y_i - \mathbb {E}[y_i|x_i]\). By the definition of \(\hat{\beta }_0\) and \(\bar{\beta }_0\),
$$\begin{aligned} \Vert \hat{\beta }_0 - \bar{\beta }_0\Vert _{\widehat{\varSigma }}^2 = \Vert \widehat{\varSigma }^{-1/2} \widehat{\mathbb {E}}[x(y - \mathbb {E}[y|x])]\Vert ^2 = \xi ^{\scriptscriptstyle \top }\widehat{K} \xi , \end{aligned}$$
where \(\widehat{K} \in \mathbb {R}^{n \times n}\) is the symmetric matrix whose \((i,j)\)-th entry is \(\widehat{K}_{i,j} := n^{-2} \langle \widehat{\varSigma }^{-1/2} x_i, \widehat{\varSigma }^{-1/2} x_j \rangle \). Note that the nonzero eigenvalues of \(\widehat{K}\) are the same as those of
$$\begin{aligned} \frac{1}{n} \widehat{\mathbb {E}}\left[ (\widehat{\varSigma }^{-1/2} x) \otimes (\widehat{\varSigma }^{-1/2} x) \right] = \frac{1}{n} \widehat{\varSigma }^{-1/2} \widehat{\varSigma }\widehat{\varSigma }^{-1/2} = \frac{1}{n} I. \end{aligned}$$
By Lemma 8, with probability at least \(1-\mathrm {e}^{-t}\) (conditioned on \(x_1,x_2,\cdots ,x_n\)),
$$\begin{aligned} \xi ^{\scriptscriptstyle \top }\widehat{K} \xi \le \sigma ^2 ( {{\mathrm{tr}}}(\widehat{K}) + 2\sqrt{{{\mathrm{tr}}}(\widehat{K}^2)t} + 2\lambda _{\max }(\widehat{K}) t) = \frac{\sigma ^2 (d + 2\sqrt{dt} + 2t)}{n}. \end{aligned}$$
The claim follows. \(\square \)

Lemma 7

(Effect of noise, \(\lambda \ge 0\)) Assume Condition 2 (with parameter \(\sigma \)) holds. Pick any \(t > 0\). Let \(K\) be the \(n \times n\) symmetric matrix whose \((i,j)\)-th entry is
$$\begin{aligned} K_{i,j} := \frac{1}{n^2} \langle \varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} x_i, \varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} x_j \rangle , \end{aligned}$$
where \(\widehat{\varSigma }_{\lambda }\) is defined in (15). With probability at least \(1-\mathrm {e}^{-t}\),
$$\begin{aligned} \Vert \bar{\beta }_{\lambda }- \hat{\beta }_{\lambda }\Vert _\varSigma ^2 \le \sigma ^2 ( {{\mathrm{tr}}}(K) + 2\sqrt{{{\mathrm{tr}}}(K)\lambda _{\max }(K)t} + 2\lambda _{\max }(K) t). \end{aligned}$$
Moreover, if \(\Vert \Delta _{\lambda }\Vert < 1\) where \(\Delta _{\lambda }\) is defined in (16), then
$$\begin{aligned} \lambda _{\max }(K) \le \frac{1}{n(1-\Vert \Delta _{\lambda }\Vert )} \quad \text {and}\quad {{\mathrm{tr}}}(K) \le \frac{d_{2,\lambda }+ \sqrt{d_{2,\lambda }\Vert \Delta _{\lambda }\Vert _{\mathrm{F }}^2}}{n(1-\Vert \Delta _{\lambda }\Vert )^2}. \end{aligned}$$

Proof

Let \(\xi := (\mathrm{noise }(x_1),\mathrm{noise }(x_2),\cdots ,\mathrm{noise }(x_n))\) be the random vector whose \(i\)-th component is \(\mathrm{noise }(x_i) = y_i - \mathbb {E}[y_i|x_i]\). By the definition of \(\hat{\beta }_{\lambda }\), \(\bar{\beta }_{\lambda }\), and \(K\),
$$\begin{aligned} \Vert \hat{\beta }_{\lambda }- \bar{\beta }_{\lambda }\Vert _\varSigma ^2 = \Vert \widehat{\varSigma }_{\lambda }^{-1} \widehat{\mathbb {E}}[x(y - \mathbb {E}[y|x])]\Vert _\varSigma ^2 = \xi ^{\scriptscriptstyle \top }K \xi . \end{aligned}$$
By Lemma 8, with probability at least \(1-\mathrm {e}^{-t}\) (conditioned on \(x_1,x_2,\cdots ,x_n\)),
$$\begin{aligned} \xi ^{\scriptscriptstyle \top }K \xi&\le \sigma ^2 ( {{\mathrm{tr}}}(K) + 2\sqrt{{{\mathrm{tr}}}(K^2)t} + 2\lambda _{\max }(K) t) \\&\le \sigma ^2 ( {{\mathrm{tr}}}(K) + 2\sqrt{{{\mathrm{tr}}}(K)\lambda _{\max }(K)t} + 2\lambda _{\max }(K) t), \end{aligned}$$
where the second inequality follows from von Neumann’s theorem [10].
Note that the nonzero eigenvalues of \(K\) are the same as that of
$$\begin{aligned} \frac{1}{n} \widehat{\mathbb {E}}\left[ (\varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} x) \otimes (\varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} x) \right] = \frac{1}{n} \varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \widehat{\varSigma }\widehat{\varSigma }_{\lambda }^{-1} \varSigma ^{1/2} . \end{aligned}$$
To bound \(\lambda _{\max }(K)\), observe that by the submultiplicative property of the spectral norm and Lemma 3,
$$\begin{aligned} n \lambda _{\max }(K)&= \Vert \varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \widehat{\varSigma }^{1/2} \Vert ^2 \\&\le \Vert \varSigma ^{1/2} \varSigma _{\lambda }^{-1/2}\Vert ^2\Vert \varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1/2}\Vert ^2\Vert \widehat{\varSigma }_{\lambda }^{-1/2} \widehat{\varSigma }^{1/2}\Vert ^2 \\&\le \Vert \varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1/2}\Vert ^2\\&=\Vert \varSigma _{\lambda }^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \varSigma _{\lambda }^{1/2}\Vert \\&\le \frac{1}{1 - \Vert \Delta _{\lambda }\Vert }. \end{aligned}$$
To bound \({{\mathrm{tr}}}(K)\), first define the \(\lambda \)-whitened versions of \(\varSigma \), \(\widehat{\varSigma }\), and \(\widehat{\varSigma }_{\lambda }\) as
$$\begin{aligned} \varSigma _w&:= \varSigma _{\lambda }^{-1/2} \varSigma \varSigma _{\lambda }^{-1/2}, \\ \widehat{\varSigma }_w&:= \varSigma _{\lambda }^{-1/2} \widehat{\varSigma }\varSigma _{\lambda }^{-1/2}, \\ \widehat{\varSigma }_{\lambda ,w}&:= \varSigma _{\lambda }^{-1/2} \widehat{\varSigma }_{\lambda }\varSigma _{\lambda }^{-1/2}. \end{aligned}$$
Using these definitions with the cycle property of the trace,
$$\begin{aligned} n {{\mathrm{tr}}}(K)&= {{\mathrm{tr}}}(\varSigma ^{1/2} \widehat{\varSigma }_{\lambda }^{-1} \widehat{\varSigma }\widehat{\varSigma }_{\lambda }^{-1} \varSigma ^{1/2})\\&= {{\mathrm{tr}}}(\widehat{\varSigma }_{\lambda }^{-1} \widehat{\varSigma }\widehat{\varSigma }_{\lambda }^{-1} \varSigma )\\&= {{\mathrm{tr}}}(\widehat{\varSigma }_{\lambda ,w}^{-1} \widehat{\varSigma }_w\widehat{\varSigma }_{\lambda ,w}^{-1} \varSigma _w). \end{aligned}$$
Let \(\{ \lambda _j[M] \}\) denote the eigenvalues of a linear operator \(M\). By von Neumann’s theorem [10],
$$\begin{aligned} {{\mathrm{tr}}}(\widehat{\varSigma }_{\lambda ,w}^{-1} \widehat{\varSigma }_w\widehat{\varSigma }_{\lambda ,w}^{-1} \varSigma _w) \le \sum _j \lambda _j[\widehat{\varSigma }_{\lambda ,w}^{-1} \widehat{\varSigma }_w\widehat{\varSigma }_{\lambda ,w}^{-1}] \lambda _j[\varSigma _w] \end{aligned}$$
and by Ostrowski’s theorem [10],
$$\begin{aligned} \lambda _j[\widehat{\varSigma }_{\lambda ,w}^{-1} \widehat{\varSigma }_w\widehat{\varSigma }_{\lambda ,w}^{-1}] \le \lambda _{\max }[\widehat{\varSigma }_{\lambda ,w}^{-2}] \lambda _j[\widehat{\varSigma }_w]. \end{aligned}$$
Therefore,
$$\begin{aligned}&{{\mathrm{tr}}}(\widehat{\varSigma }_{\lambda ,w}^{-1} \widehat{\varSigma }_w\widehat{\varSigma }_{\lambda ,w}^{-1} \varSigma _w)\\&\quad \le \lambda _{\max }[\widehat{\varSigma }_{\lambda ,w}^{-2}] \sum _j \lambda _j[\widehat{\varSigma }_w] \lambda _j[\varSigma _w]\\&\quad \le \frac{1}{(1 - \Vert \Delta _{\lambda }\Vert )^2}\sum _j \lambda _j[\widehat{\varSigma }_w] \lambda _j[\varSigma _w]\\&\quad = \frac{1}{(1 - \Vert \Delta _{\lambda }\Vert )^2}\sum _j\left( \lambda _j[\varSigma _w]^2 + (\lambda _j[\widehat{\varSigma }_w] - \lambda _j[\varSigma _w]) \lambda _j[\varSigma _w]\right) \\&\quad \le \frac{1}{(1 - \Vert \Delta _{\lambda }\Vert )^2} \left( \sum _j\lambda _j[\varSigma _w]^2+ \sqrt{\sum _j (\lambda _j[\widehat{\varSigma }_w] - \lambda _j[\varSigma _w])^2} \sqrt{\sum _j \lambda _j[\varSigma _w]^2}\right) \\&\quad = \frac{1}{(1 - \Vert \Delta _{\lambda }\Vert )^2} \left( d_{2,\lambda }+ \sqrt{\sum _j (\lambda _j[\widehat{\varSigma }_w] - \lambda _j[\varSigma _w])^2} \sqrt{d_{2,\lambda }}\right) \\&\quad \le \frac{1}{(1 - \Vert \Delta _{\lambda }\Vert )^2}\left( d_{2,\lambda }+ \Vert \widehat{\varSigma }_w- \varSigma _w\Vert _{\mathrm{F }}\sqrt{d_{2,\lambda }}\right) \\&\quad = \frac{1}{(1 - \Vert \Delta _{\lambda }\Vert )^2}\left( d_{2,\lambda }+ \Vert \Delta _{\lambda }\Vert _{\mathrm{F }}\sqrt{d_{2,\lambda }} \right) , \end{aligned}$$
where the second inequality follows from Lemma 3, the third inequality follows from Cauchy–Schwarz, and the fourth inequality follows from Mirsky’s theorem [21].

Acknowledgments

The authors thank Dean Foster, David McAllester, and Robert Stine for many insightful discussions.

Copyright information

© SFoCM 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceColumbia UniversityNew YorkUSA
  2. 2.Microsoft ResearchCambridgeUSA
  3. 3.Department of StatisticsRutgers UniversityPiscatawayUSA

Personalised recommendations