Abstract
This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the “out-of-sample” prediction error, as opposed to the “in-sample” (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.
Similar content being viewed by others
References
N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. SIAM J. Comput., 39(1):302–322, 2009.
J.-Y. Audibert and O. Catoni. Linear regression through PAC-Bayesian truncation, 2010. arXiv:1010.0072.
J.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of Statistics, 30(5):2766–2794, 2011.
A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
O. Catoni. Statistical Learning Theory and Stochastic Optimization, Lectures on Probability and Statistics, Ecole d’Eté de Probabilitiés de Saint-Flour XXXI - 2001, volume 1851 of Lecture Notes in Mathematics. Springer, 2004.
P. Drineas and M. W. Mahoney. Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving, 2010. arXiv:1005.3097.
P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2010.
L. Györfi, M. Kohler, A. Kryżak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2004.
A. E. Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962.
R. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
D. Hsu, S. M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors, 2011. arXiv:1110.2842.
D. Hsu, S. M. Kakade, and T. Zhang. Tail inequalities for sums of random matrices that depend on the intrinsic dimension. Electronic Communications in Probability, 17(14):1–13, 2012.
D. Hsu and S. Sabato. Loss Minimization and Parameter Estimation with Heavy Tails, 2013. arXiv:1307.1827.
V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.
E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, second edition, 1998.
M. Nussbaum. Minimax risk: Pinsker bound. In S. Kotz, editor, Encyclopedia of Statistical Sciences, Update Volume 3, pages 451–460. Wiley, New York, 1999.
V. Rokhlin and M. Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. Proc. Natl. Acad. Sci. USA, 105(36):13212–13217, 2008.
S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximations, 26:153–172, 2007.
I. Steinwart, D. Hush, and C. Scovel. Optimal Rates for Regularized Least Squares Regression. In Proceedings of the 22nd Annual Conference on Learning Theory, pp. 79–93, 2009.
G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990.
C. J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10:1040–1053, 1982.
T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17:2077–2098, 2005.
Acknowledgments
The authors thank Dean Foster, David McAllester, and Robert Stine for many insightful discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Tomaso Poggio.
Appendix: Probability Tail Inequalities
Appendix: Probability Tail Inequalities
The following probability tail inequalities are used in our analysis. These specific inequalities were chosen in order to satisfy the general conditions set up in Sect. 2.4; however, our analysis can specialize or generalize with the availability of other tail inequalities of these sorts.
The first tail inequality is for positive semidefinite quadratic forms of a sub-Gaussian random vector. It generalizes a standard tail inequality for Gaussian random vectors based on linear combinations of \(\chi ^2\) random variables [15].
Lemma 8
(Quadratic forms of a sub-Gaussian random vector [11]) Let \(\xi \) be a random vector taking values in \(\mathbb {R}^n\) such that for some \(c \ge 0\),
For all symmetric positive semidefinite matrices \(K \succeq 0\), and all \(t > 0\),
The next lemma is a tail inequality for sums of bounded random vectors; it is a standard application of Bernstein’s inequality.
Lemma 9
(Vector Bernstein bound; see, e.g., [11]) Let \(x_1,x_2,\cdots ,x_n\) be independent random vectors such that
for all \(i=1,2,\cdots ,n\), almost surely. Let \(s := x_1 + x_2 + \cdots + x_n\). For all \(t > 0\),
The last tail inequality concerns the spectral accuracy of an empirical second moment matrix.
Lemma 10
(Matrix Bernstein bound [12]) Let \(X\) be a random matrix, and \(r > 0\), \(v > 0\), and \(k > 0\) be such that, almost surely,
If \(X_1,X_2,\cdots ,X_n\) are independent copies of \(X\), then for any \(t > 0\),
If \(t \ge 2.6\), then \(t (\mathrm {e}^t - t - 1)^{-1} \le \mathrm {e}^{-t/2}\).
Rights and permissions
About this article
Cite this article
Hsu, D., Kakade, S.M. & Zhang, T. Random Design Analysis of Ridge Regression. Found Comput Math 14, 569–600 (2014). https://doi.org/10.1007/s10208-014-9192-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-014-9192-1