Learning from non-random data in Hilbert spaces: an optimal recovery perspective

Foucart, Simon; Liao, Chunyang; Shahrampour, Shahin; Wang, Yinsong

doi:10.1007/s43670-022-00022-w

Learning from non-random data in Hilbert spaces: an optimal recovery perspective

Original Article
Published: 26 April 2022

Volume 20, article number 5, (2022)
Cite this article

Sampling Theory, Signal Processing, and Data Analysis Aims and scope Submit manuscript

Simon Foucart¹,
Chunyang Liao¹,
Shahin Shahrampour² &
…
Yinsong Wang²

194 Accesses
1 Citation
Explore all metrics

Abstract

The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an Optimal Recovery perspective. Relying on a model assumption comparable to choosing a hypothesis class, a learner aims at minimizing the worst-case error, without recourse to any probabilistic assumption on the data. We first develop a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, for any Hilbert space, we show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view, as long as the hypothesis class is linear. Interestingly, this formula coincides with kernel ridgeless regression in some cases, proving that minimizing the average error and worst-case error can yield the same solution. We provide numerical experiments in support of our theoretical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

Bernstein–Jackson Inequalities on Gaussian Hilbert Spaces

Article Open access 12 September 2023

Notes

At the publication time of the current article, a portion of these works is already available, see [13].
Some results can now be found in [13], in particular, the article [13] uncovers a principled way to choose the parameter of a Tikhonov-like regression.

References

Beck, A., Eldar, Y.C.: Regularization in regression with bounded noise: A Chebyshev center approach. SIAM J. Matrix Anal. Appl. 29(2), 606–625 (2007)
Article MathSciNet Google Scholar
Belkin, M., Ma, S., Mandal, S.: To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549 (2018)
Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G., Wojtaszczyk, P.: Data assimilation in reduced modeling. SIAM/ASA J. Uncert. Quant. 5(1), 1–29 (2017)
Article MathSciNet Google Scholar
Cotter, A., Keshet, J., Srebro, N.: Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603 (2011)
de Boor, C.: Best approximation properties of spline functions of odd degree. J. Math. Mech. 747–749 (1963)
de Boor, C.: Computational aspects of optimal recovery. In Optimal Estimation in Approximation Theory, pages 69–91. Springer (1977)
de Boor, C.: Convergence of abstract splines. J. Approx. Theory 31(1), 80–89 (1981)
Article MathSciNet Google Scholar
De Vito, E., Rosasco, L., Caponnetto, A., Giovannini, U., Odone, F.: Learning from examples as an inverse problem. J. Mach. Learn. Res. 6, 883–904 (2005)
MathSciNet MATH Google Scholar
De Vito, E., Rosasco, L., Caponnetto, A., Piana, M., Verri, A.: Some properties of regularized kernel methods. J. Mach. Learn. Res. 5, 1363–1390 (2004)
MathSciNet MATH Google Scholar
Duchon, J.: Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, pp. 85–100. Springer (1977)
Ettehad, M., Foucart, S.: Instances of computational optimal recovery: dealing with observation errors. SIAM/ASA J. Uncertainty Quantification 9, 1438–1456 (2021)
Foucart, S.: Instances of computational optimal recovery: refined approximability models. J. Complex. 62, 101503 (2021)
Article MathSciNet Google Scholar
Foucart, S., Liao, C.: Optimal recovery from inaccurate data in Hilbert spaces: Regularize, but what of the parameter? arXiv preprint arXiv:2111.02601 (2021)
Gurobi Optimization, LLC. Gurobi optimizer reference manual (2020)
Hazan, E.: Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016)
Article Google Scholar
Li, W., Lee, K.-H., Leung, K.-S.: Generalized regularized least-squares learning with predefined features in a Hilbert space. In Advances in Neural Information Processing Systems, pp. 881–888 (2007)
Liang, T., Rakhlin, A.: Just interpolate: Kernel “ridgeless” regression can generalize. Ann. Stat. (2019)
Micchelli, C. A., Rivlin, T. J.: A survey of optimal recovery. In Optimal Estimation in Approximation Theory, pp. 1–54. Springer (1977)
Micchelli, C. A., Rivlin., T. J.: Lectures on optimal recovery. In Numerical Analysis Lancaster 1984, pp. 21–93. Springer (1985)
Minh, H.: Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory. Constr. Approx. 32, 307–338 (2010)
Article MathSciNet Google Scholar
Owhadi, H., Scovel, C., Schäfer, F.: Statistical numerical approximation. Not. Am. Math. Soc. 66, 1608–1617 (2019)
MathSciNet MATH Google Scholar
Packel, E.W.: Do linear problems have linear optimal algorithms? SIAM Rev. 30(3), 388–403 (1988)
Article MathSciNet Google Scholar
Plaskota, L.: Noisy information and computational complexity, volume 95. Cambridge University Press (1996)
Pólik, I., Terlaky, T.: A survey of the S-lemma. SIAM Rev. 49(3), 371–418 (2007)
Article MathSciNet Google Scholar
Rakhlin, A., Zhai, X.: Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pp. 2595–2623 (2019)
Traub, J. F.: Information-Based Complexity. John Wiley and Sons Ltd. (2003)
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)
Article Google Scholar
Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (1990)
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., Chandra, V.: Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018)

Download references

Acknowledgements

C. L. and Y. W. are supported by the Texas A &M Triads for Transformation (T3) Program. S. F. was/is partially supported by grants from the NSF (DMS-1622134, DMS-1664803, DMS-2053172) and from the ONR (N00014-20-1-2787). S. F and S. S. also acknowledge NSF grant CCF-1934904.

Author information

Authors and Affiliations

Department of Mathematics, Texas A &M University, College Station, TX, 77843, USA
Simon Foucart & Chunyang Liao
Department of Mechanical and Industrial Engineering, Northeastern University, Boston, USA
Shahin Shahrampour & Yinsong Wang

Authors

Simon Foucart
View author publications
You can also search for this author in PubMed Google Scholar
Chunyang Liao
View author publications
You can also search for this author in PubMed Google Scholar
Shahin Shahrampour
View author publications
You can also search for this author in PubMed Google Scholar
Yinsong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon Foucart.

Additional information

Communicated by Felix Krahmer.

This work was initiated when they were both with the Department of Industrial and Systems Engineering, Texas A &M University, College Station, TX 77843 USA.

Appendix

Below, we generalize the result of [3] in two directions. For the first direction, instead of assuming that the target function $f_0$ itself is well approximated by elements of a set V, we assume that it is some linear transform T applied to $f_0$ that is well approximated. This translates in the modification (44) of the approximability set. The novelty occurs not for invertible transforms, but for noninvertible ones (e.g. when T represents a derivative, as in (36)). For the second direction, instead of attempting to recover $f_0$ in full, we assume that we only need to estimate a quantity $Q(f_0)$ depending on $f_0$, such as its integral. Although we focus on the extreme situations where Q is the identity or where Q is a linear functional, the case of an arbitrary linear map Q is covered. Leaving the introduction of the transform T aside, one useful consequence of the result below is that knowledge of (a basis for) the space V is not needed, since only the values of the $\ell _i(v_j)$’s and $Q(v_j)$’s are required to form $(Q \circ R^{\mathrm{opt}})(\mathbf {y})$.

Theorem 4

Let $\mathcal {F},\mathcal {H},\mathcal {Z}$ be three normed spaces, $\mathcal {H}$ being a Hilbert space, and let V be a subspace of $\mathcal {H}$. Consider a linear quantity of interest $Q: \mathcal {F}\rightarrow \mathcal {Z}$ and a linear map $T: \mathcal {F}\rightarrow \mathcal {H}$. For $\mathbf {y}\in \mathbb {R}^m$, define $R^{\mathrm{opt}}(\mathbf {y}) \in \mathcal {F}$ as a solution to

$$\begin{aligned} \underset{f\in \mathcal {F}}{{\text {minimize}}}\, \Vert Tf - P_V(Tf)\Vert _{\mathcal {H}} \qquad \text{ s.to } L(f) = \mathbf {y}. \end{aligned}$$

(43)

Then the learning/recovery map $Q \circ R^{\mathrm{opt}}: \mathbb {R}^m \rightarrow \mathcal {Z}$ is locally optimal over the model set

$$\begin{aligned} \mathcal {K}= \{ f \in \mathcal {F}: \mathrm{dist}(Tf,V) \le \epsilon \} \end{aligned}$$

(44)

in the sense that, for any $z \in \mathcal {Z}$,

$$\begin{aligned} \sup _{f \in \mathcal {K}, L(f) = \mathbf {y}} \Vert Q(f) - Q \circ R^{\mathrm{opt}}(f)\Vert _{\mathcal {Z}} \le \sup _{f \in \mathcal {K}, L(f) = \mathbf {y}} \Vert Q(f) - z \Vert _{\mathcal {Z}}. \end{aligned}$$

(45)

Proof

Let us introduce the compatibility indicator

$$\begin{aligned} \mu := \sup _{u \in \ker (L) \setminus \{0\}} \frac{\Vert Q(u)\Vert _\mathcal {Z}}{\mathrm{dist}(Tu,V)}. \end{aligned}$$

(46)

Given $\mathbf {y}\in \mathbb {R}^m$, let ${\hat{f}} = R^{\mathrm{opt}}(\mathbf {y})$ denote the solution to (43). We shall establish (45) by showing on the one hand that

$$\begin{aligned} \sup _{\begin{array}{c} f \in \mathcal {K}\\ L(f) = \mathbf {y} \end{array}} \Vert Q(f) - Q({\hat{f}})\Vert _{\mathcal {Z}} \le \mu \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2} \end{aligned}$$

(47)

and on the other hand that, for any $z \in \mathcal {Z}$.

$$\begin{aligned} \sup _{\begin{array}{c} f \in \mathcal {K}\\ L(f) = \mathbf {y} \end{array}} \Vert Q(f) - z \Vert _{\mathcal {Z}} \ge \mu \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2}. \end{aligned}$$

(48)

Let us start with (47). Considering an arbitrary $u \in \ker (L)$, notice that the quadratic expression $t \in \mathbb {R}$ given by

$$\begin{aligned}&\Vert T({\hat{f}} + t u) - P_V(T({\hat{f}} + t u))\Vert _{\mathcal {H}}^2 = \Vert T{\hat{f}} - P_V(T{\hat{f}} )\Vert _{\mathcal {H}}^2 + 2 t \langle T{\hat{f}} \nonumber \\&\quad -P_V(T{\hat{f}} ),Tu - P_V(Tu) \rangle + \mathcal {O}(t^2) \end{aligned}$$

(49)

is miminized at the point $t=0$. This forces the linear term $\langle T{\hat{f}} - P_V(T{\hat{f}} ),Tu - P_V(Tu) \rangle $ to vanish, in other words

$$\begin{aligned} \langle T{\hat{f}} - P_V(T{\hat{f}} ),Tu \rangle = 0 \qquad \text{ for } \text{ any } u \in \ker (L). \end{aligned}$$

(50)

Now, considering $f \in \mathcal {K}$ such that $L(f) = \mathbf {y}$ written as $f = {\hat{f}} + u$ for some $u \in \ker (L)$, the fact that $f \in \mathcal {K}$ reads

$$\begin{aligned} \epsilon ^2 \ge \Vert T({\hat{f}} + u) - P_V(T({\hat{f}} + u))\Vert _{\mathcal {H}}^2 = \Vert T{\hat{f}} - P_V(T{\hat{f}})\Vert _{\mathcal {H}}^2 + \Vert Tu - P_V(Tu)\Vert _{\mathcal {H}}^2. \end{aligned}$$

(51)

Rearranging the latter gives

$$\begin{aligned} \mathrm{dist}(Tu,V) \le \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2}. \end{aligned}$$

(52)

It remains to take the definition (46) into account in order to bound $\Vert Q(f) - Q({\hat{f}}) \Vert _{\mathcal {Z}} = \Vert Q(u) \Vert _{\mathcal {Z}}$ and arrive at (47).

Turning to (48), we consider $u \in \ker (L)$ such that

$$\begin{aligned} \Vert Q(u)\Vert _{\mathcal {Z}}&= \mu \, \mathrm{dist}(Tu, V), \end{aligned}$$

(53)

$$\begin{aligned} \Vert Tu - P_V(Tu)\Vert _\mathcal {H}&= \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2}. \end{aligned}$$

(54)

It is clear that $f^\pm := {\hat{f}} \pm u$ both satisfy $L(f^\pm ) = \mathbf {y}$, while $f^\pm \in \mathcal {K}$ follows from

$$\begin{aligned} \Vert T f^\pm - P_V(T f^\pm )\Vert _\mathcal {H}^2&= \Vert (T{\hat{f}} - P_V(T {\hat{f}}) ) \pm (T u - P_V(T u) ) \Vert _\mathcal {H}^2\nonumber \\&= \Vert T{\hat{f}} - P_V(T {\hat{f}}) \Vert ^2 + \Vert Tu - P_V(T u) \Vert _\mathcal {H}^2 = \epsilon ^2. \end{aligned}$$

(55)

Therefore, for any $z \in \mathcal {Z}$,

$$\begin{aligned} \sup _{\begin{array}{c} f \in \mathcal {K}\\ L(f) = \mathbf {y} \end{array}} \Vert Q(f) - z \Vert _{\mathcal {Z}}&\ge \max \{ \Vert Q(f^+) - z \Vert _{\mathcal {Z}}, \Vert Q(f^-) - z \Vert _{\mathcal {Z}}\} \nonumber \\&\ge \frac{1}{2} \big ( \Vert Q(f^+) - z \Vert _{\mathcal {Z}} + \Vert Q(f^-) - z \Vert _{\mathcal {Z}}\} \big )\nonumber \\&\ge \frac{1}{2} \Vert Q(f^+ - f^-) \Vert _{\mathcal {Z}} = \Vert Q(u) \Vert _{\mathcal {Z}}. \end{aligned}$$

(56)

Taking (53) and (54) into account finishes to prove (48). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Foucart, S., Liao, C., Shahrampour, S. et al. Learning from non-random data in Hilbert spaces: an optimal recovery perspective. Sampl. Theory Signal Process. Data Anal. 20, 5 (2022). https://doi.org/10.1007/s43670-022-00022-w

Download citation

Received: 15 December 2020
Accepted: 07 April 2022
Published: 26 April 2022
DOI: https://doi.org/10.1007/s43670-022-00022-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from non-random data in Hilbert spaces: an optimal recovery perspective

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Bernstein–Jackson Inequalities on Gaussian Hilbert Spaces

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Learning from non-random data in Hilbert spaces: an optimal recovery perspective

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Bernstein–Jackson Inequalities on Gaussian Hilbert Spaces

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation