Abstract
The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an Optimal Recovery perspective. Relying on a model assumption comparable to choosing a hypothesis class, a learner aims at minimizing the worst-case error, without recourse to any probabilistic assumption on the data. We first develop a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, for any Hilbert space, we show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view, as long as the hypothesis class is linear. Interestingly, this formula coincides with kernel ridgeless regression in some cases, proving that minimizing the average error and worst-case error can yield the same solution. We provide numerical experiments in support of our theoretical findings.
Similar content being viewed by others
References
Beck, A., Eldar, Y.C.: Regularization in regression with bounded noise: A Chebyshev center approach. SIAM J. Matrix Anal. Appl. 29(2), 606–625 (2007)
Belkin, M., Ma, S., Mandal, S.: To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549 (2018)
Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G., Wojtaszczyk, P.: Data assimilation in reduced modeling. SIAM/ASA J. Uncert. Quant. 5(1), 1–29 (2017)
Cotter, A., Keshet, J., Srebro, N.: Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603 (2011)
de Boor, C.: Best approximation properties of spline functions of odd degree. J. Math. Mech. 747–749 (1963)
de Boor, C.: Computational aspects of optimal recovery. In Optimal Estimation in Approximation Theory, pages 69–91. Springer (1977)
de Boor, C.: Convergence of abstract splines. J. Approx. Theory 31(1), 80–89 (1981)
De Vito, E., Rosasco, L., Caponnetto, A., Giovannini, U., Odone, F.: Learning from examples as an inverse problem. J. Mach. Learn. Res. 6, 883–904 (2005)
De Vito, E., Rosasco, L., Caponnetto, A., Piana, M., Verri, A.: Some properties of regularized kernel methods. J. Mach. Learn. Res. 5, 1363–1390 (2004)
Duchon, J.: Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, pp. 85–100. Springer (1977)
Ettehad, M., Foucart, S.: Instances of computational optimal recovery: dealing with observation errors. SIAM/ASA J. Uncertainty Quantification 9, 1438–1456 (2021)
Foucart, S.: Instances of computational optimal recovery: refined approximability models. J. Complex. 62, 101503 (2021)
Foucart, S., Liao, C.: Optimal recovery from inaccurate data in Hilbert spaces: Regularize, but what of the parameter? arXiv preprint arXiv:2111.02601 (2021)
Gurobi Optimization, LLC. Gurobi optimizer reference manual (2020)
Hazan, E.: Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016)
Li, W., Lee, K.-H., Leung, K.-S.: Generalized regularized least-squares learning with predefined features in a Hilbert space. In Advances in Neural Information Processing Systems, pp. 881–888 (2007)
Liang, T., Rakhlin, A.: Just interpolate: Kernel “ridgeless” regression can generalize. Ann. Stat. (2019)
Micchelli, C. A., Rivlin, T. J.: A survey of optimal recovery. In Optimal Estimation in Approximation Theory, pp. 1–54. Springer (1977)
Micchelli, C. A., Rivlin., T. J.: Lectures on optimal recovery. In Numerical Analysis Lancaster 1984, pp. 21–93. Springer (1985)
Minh, H.: Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory. Constr. Approx. 32, 307–338 (2010)
Owhadi, H., Scovel, C., Schäfer, F.: Statistical numerical approximation. Not. Am. Math. Soc. 66, 1608–1617 (2019)
Packel, E.W.: Do linear problems have linear optimal algorithms? SIAM Rev. 30(3), 388–403 (1988)
Plaskota, L.: Noisy information and computational complexity, volume 95. Cambridge University Press (1996)
Pólik, I., Terlaky, T.: A survey of the S-lemma. SIAM Rev. 49(3), 371–418 (2007)
Rakhlin, A., Zhai, X.: Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pp. 2595–2623 (2019)
Traub, J. F.: Information-Based Complexity. John Wiley and Sons Ltd. (2003)
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)
Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (1990)
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., Chandra, V.: Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018)
Acknowledgements
C. L. and Y. W. are supported by the Texas A &M Triads for Transformation (T3) Program. S. F. was/is partially supported by grants from the NSF (DMS-1622134, DMS-1664803, DMS-2053172) and from the ONR (N00014-20-1-2787). S. F and S. S. also acknowledge NSF grant CCF-1934904.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Felix Krahmer.
This work was initiated when they were both with the Department of Industrial and Systems Engineering, Texas A &M University, College Station, TX 77843 USA.
Appendix
Appendix
Below, we generalize the result of [3] in two directions. For the first direction, instead of assuming that the target function \(f_0\) itself is well approximated by elements of a set V, we assume that it is some linear transform T applied to \(f_0\) that is well approximated. This translates in the modification (44) of the approximability set. The novelty occurs not for invertible transforms, but for noninvertible ones (e.g. when T represents a derivative, as in (36)). For the second direction, instead of attempting to recover \(f_0\) in full, we assume that we only need to estimate a quantity \(Q(f_0)\) depending on \(f_0\), such as its integral. Although we focus on the extreme situations where Q is the identity or where Q is a linear functional, the case of an arbitrary linear map Q is covered. Leaving the introduction of the transform T aside, one useful consequence of the result below is that knowledge of (a basis for) the space V is not needed, since only the values of the \(\ell _i(v_j)\)’s and \(Q(v_j)\)’s are required to form \((Q \circ R^{\mathrm{opt}})(\mathbf {y})\).
Theorem 4
Let \(\mathcal {F},\mathcal {H},\mathcal {Z}\) be three normed spaces, \(\mathcal {H}\) being a Hilbert space, and let V be a subspace of \(\mathcal {H}\). Consider a linear quantity of interest \(Q: \mathcal {F}\rightarrow \mathcal {Z}\) and a linear map \(T: \mathcal {F}\rightarrow \mathcal {H}\). For \(\mathbf {y}\in \mathbb {R}^m\), define \(R^{\mathrm{opt}}(\mathbf {y}) \in \mathcal {F}\) as a solution to
Then the learning/recovery map \(Q \circ R^{\mathrm{opt}}: \mathbb {R}^m \rightarrow \mathcal {Z}\) is locally optimal over the model set
in the sense that, for any \(z \in \mathcal {Z}\),
Proof
Let us introduce the compatibility indicator
Given \(\mathbf {y}\in \mathbb {R}^m\), let \({\hat{f}} = R^{\mathrm{opt}}(\mathbf {y})\) denote the solution to (43). We shall establish (45) by showing on the one hand that
and on the other hand that, for any \(z \in \mathcal {Z}\).
Let us start with (47). Considering an arbitrary \(u \in \ker (L)\), notice that the quadratic expression \(t \in \mathbb {R}\) given by
is miminized at the point \(t=0\). This forces the linear term \(\langle T{\hat{f}} - P_V(T{\hat{f}} ),Tu - P_V(Tu) \rangle \) to vanish, in other words
Now, considering \(f \in \mathcal {K}\) such that \(L(f) = \mathbf {y}\) written as \(f = {\hat{f}} + u\) for some \(u \in \ker (L)\), the fact that \(f \in \mathcal {K}\) reads
Rearranging the latter gives
It remains to take the definition (46) into account in order to bound \(\Vert Q(f) - Q({\hat{f}}) \Vert _{\mathcal {Z}} = \Vert Q(u) \Vert _{\mathcal {Z}}\) and arrive at (47).
Turning to (48), we consider \(u \in \ker (L)\) such that
It is clear that \(f^\pm := {\hat{f}} \pm u\) both satisfy \(L(f^\pm ) = \mathbf {y}\), while \(f^\pm \in \mathcal {K}\) follows from
Therefore, for any \(z \in \mathcal {Z}\),
Taking (53) and (54) into account finishes to prove (48). \(\square \)
Rights and permissions
About this article
Cite this article
Foucart, S., Liao, C., Shahrampour, S. et al. Learning from non-random data in Hilbert spaces: an optimal recovery perspective. Sampl. Theory Signal Process. Data Anal. 20, 5 (2022). https://doi.org/10.1007/s43670-022-00022-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43670-022-00022-w