Skip to main content
Log in

Learning from non-random data in Hilbert spaces: an optimal recovery perspective

  • Original Article
  • Published:
Sampling Theory, Signal Processing, and Data Analysis Aims and scope Submit manuscript

Abstract

The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an Optimal Recovery perspective. Relying on a model assumption comparable to choosing a hypothesis class, a learner aims at minimizing the worst-case error, without recourse to any probabilistic assumption on the data. We first develop a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, for any Hilbert space, we show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view, as long as the hypothesis class is linear. Interestingly, this formula coincides with kernel ridgeless regression in some cases, proving that minimizing the average error and worst-case error can yield the same solution. We provide numerical experiments in support of our theoretical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. At the publication time of the current article, a portion of these works is already available, see [13].

  2. Some results can now be found in [13], in particular, the article [13] uncovers a principled way to choose the parameter of a Tikhonov-like regression.

References

  1. Beck, A., Eldar, Y.C.: Regularization in regression with bounded noise: A Chebyshev center approach. SIAM J. Matrix Anal. Appl. 29(2), 606–625 (2007)

    Article  MathSciNet  Google Scholar 

  2. Belkin, M., Ma, S., Mandal, S.: To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549 (2018)

  3. Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G., Wojtaszczyk, P.: Data assimilation in reduced modeling. SIAM/ASA J. Uncert. Quant. 5(1), 1–29 (2017)

    Article  MathSciNet  Google Scholar 

  4. Cotter, A., Keshet, J., Srebro, N.: Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603 (2011)

  5. de Boor, C.: Best approximation properties of spline functions of odd degree. J. Math. Mech. 747–749 (1963)

  6. de Boor, C.: Computational aspects of optimal recovery. In Optimal Estimation in Approximation Theory, pages 69–91. Springer (1977)

  7. de Boor, C.: Convergence of abstract splines. J. Approx. Theory 31(1), 80–89 (1981)

    Article  MathSciNet  Google Scholar 

  8. De Vito, E., Rosasco, L., Caponnetto, A., Giovannini, U., Odone, F.: Learning from examples as an inverse problem. J. Mach. Learn. Res. 6, 883–904 (2005)

    MathSciNet  MATH  Google Scholar 

  9. De Vito, E., Rosasco, L., Caponnetto, A., Piana, M., Verri, A.: Some properties of regularized kernel methods. J. Mach. Learn. Res. 5, 1363–1390 (2004)

    MathSciNet  MATH  Google Scholar 

  10. Duchon, J.: Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, pp. 85–100. Springer (1977)

  11. Ettehad, M., Foucart, S.: Instances of computational optimal recovery: dealing with observation errors. SIAM/ASA J. Uncertainty Quantification 9, 1438–1456 (2021)

  12. Foucart, S.: Instances of computational optimal recovery: refined approximability models. J. Complex. 62, 101503 (2021)

    Article  MathSciNet  Google Scholar 

  13. Foucart, S., Liao, C.: Optimal recovery from inaccurate data in Hilbert spaces: Regularize, but what of the parameter? arXiv preprint arXiv:2111.02601 (2021)

  14. Gurobi Optimization, LLC. Gurobi optimizer reference manual (2020)

  15. Hazan, E.: Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016)

    Article  Google Scholar 

  16. Li, W., Lee, K.-H., Leung, K.-S.: Generalized regularized least-squares learning with predefined features in a Hilbert space. In Advances in Neural Information Processing Systems, pp. 881–888 (2007)

  17. Liang, T., Rakhlin, A.: Just interpolate: Kernel “ridgeless” regression can generalize. Ann. Stat. (2019)

  18. Micchelli, C. A., Rivlin, T. J.: A survey of optimal recovery. In Optimal Estimation in Approximation Theory, pp. 1–54. Springer (1977)

  19. Micchelli, C. A., Rivlin., T. J.: Lectures on optimal recovery. In Numerical Analysis Lancaster 1984, pp. 21–93. Springer (1985)

  20. Minh, H.: Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory. Constr. Approx. 32, 307–338 (2010)

    Article  MathSciNet  Google Scholar 

  21. Owhadi, H., Scovel, C., Schäfer, F.: Statistical numerical approximation. Not. Am. Math. Soc. 66, 1608–1617 (2019)

    MathSciNet  MATH  Google Scholar 

  22. Packel, E.W.: Do linear problems have linear optimal algorithms? SIAM Rev. 30(3), 388–403 (1988)

    Article  MathSciNet  Google Scholar 

  23. Plaskota, L.: Noisy information and computational complexity, volume 95. Cambridge University Press (1996)

  24. Pólik, I., Terlaky, T.: A survey of the S-lemma. SIAM Rev. 49(3), 371–418 (2007)

    Article  MathSciNet  Google Scholar 

  25. Rakhlin, A., Zhai, X.: Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pp. 2595–2623 (2019)

  26. Traub, J. F.: Information-Based Complexity. John Wiley and Sons Ltd. (2003)

  27. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)

    Article  Google Scholar 

  28. Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (1990)

  29. Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., Chandra, V.: Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018)

Download references

Acknowledgements

C. L. and Y. W. are supported by the Texas A &M Triads for Transformation (T3) Program. S. F. was/is partially supported by grants from the NSF (DMS-1622134, DMS-1664803, DMS-2053172) and from the ONR (N00014-20-1-2787). S. F and S. S. also acknowledge NSF grant CCF-1934904.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simon Foucart.

Additional information

Communicated by Felix Krahmer.

This work was initiated when they were both with the Department of Industrial and Systems Engineering, Texas A &M University, College Station, TX 77843 USA.

Appendix

Appendix

Below, we generalize the result of [3] in two directions. For the first direction, instead of assuming that the target function \(f_0\) itself is well approximated by elements of a set V, we assume that it is some linear transform T applied to \(f_0\) that is well approximated. This translates in the modification (44) of the approximability set. The novelty occurs not for invertible transforms, but for noninvertible ones (e.g. when T represents a derivative, as in (36)). For the second direction, instead of attempting to recover \(f_0\) in full, we assume that we only need to estimate a quantity \(Q(f_0)\) depending on \(f_0\), such as its integral. Although we focus on the extreme situations where Q is the identity or where Q is a linear functional, the case of an arbitrary linear map Q is covered. Leaving the introduction of the transform T aside, one useful consequence of the result below is that knowledge of (a basis for) the space V is not needed, since only the values of the \(\ell _i(v_j)\)’s and \(Q(v_j)\)’s are required to form \((Q \circ R^{\mathrm{opt}})(\mathbf {y})\).

Theorem 4

Let \(\mathcal {F},\mathcal {H},\mathcal {Z}\) be three normed spaces, \(\mathcal {H}\) being a Hilbert space, and let V be a subspace of \(\mathcal {H}\). Consider a linear quantity of interest \(Q: \mathcal {F}\rightarrow \mathcal {Z}\) and a linear map \(T: \mathcal {F}\rightarrow \mathcal {H}\). For \(\mathbf {y}\in \mathbb {R}^m\), define \(R^{\mathrm{opt}}(\mathbf {y}) \in \mathcal {F}\) as a solution to

$$\begin{aligned} \underset{f\in \mathcal {F}}{{\text {minimize}}}\, \Vert Tf - P_V(Tf)\Vert _{\mathcal {H}} \qquad \text{ s.to } L(f) = \mathbf {y}. \end{aligned}$$
(43)

Then the learning/recovery map \(Q \circ R^{\mathrm{opt}}: \mathbb {R}^m \rightarrow \mathcal {Z}\) is locally optimal over the model set

$$\begin{aligned} \mathcal {K}= \{ f \in \mathcal {F}: \mathrm{dist}(Tf,V) \le \epsilon \} \end{aligned}$$
(44)

in the sense that, for any \(z \in \mathcal {Z}\),

$$\begin{aligned} \sup _{f \in \mathcal {K}, L(f) = \mathbf {y}} \Vert Q(f) - Q \circ R^{\mathrm{opt}}(f)\Vert _{\mathcal {Z}} \le \sup _{f \in \mathcal {K}, L(f) = \mathbf {y}} \Vert Q(f) - z \Vert _{\mathcal {Z}}. \end{aligned}$$
(45)

Proof

Let us introduce the compatibility indicator

$$\begin{aligned} \mu := \sup _{u \in \ker (L) \setminus \{0\}} \frac{\Vert Q(u)\Vert _\mathcal {Z}}{\mathrm{dist}(Tu,V)}. \end{aligned}$$
(46)

Given \(\mathbf {y}\in \mathbb {R}^m\), let \({\hat{f}} = R^{\mathrm{opt}}(\mathbf {y})\) denote the solution to (43). We shall establish (45) by showing on the one hand that

$$\begin{aligned} \sup _{\begin{array}{c} f \in \mathcal {K}\\ L(f) = \mathbf {y} \end{array}} \Vert Q(f) - Q({\hat{f}})\Vert _{\mathcal {Z}} \le \mu \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2} \end{aligned}$$
(47)

and on the other hand that, for any \(z \in \mathcal {Z}\).

$$\begin{aligned} \sup _{\begin{array}{c} f \in \mathcal {K}\\ L(f) = \mathbf {y} \end{array}} \Vert Q(f) - z \Vert _{\mathcal {Z}} \ge \mu \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2}. \end{aligned}$$
(48)

Let us start with (47). Considering an arbitrary \(u \in \ker (L)\), notice that the quadratic expression \(t \in \mathbb {R}\) given by

$$\begin{aligned}&\Vert T({\hat{f}} + t u) - P_V(T({\hat{f}} + t u))\Vert _{\mathcal {H}}^2 = \Vert T{\hat{f}} - P_V(T{\hat{f}} )\Vert _{\mathcal {H}}^2 + 2 t \langle T{\hat{f}} \nonumber \\&\quad -P_V(T{\hat{f}} ),Tu - P_V(Tu) \rangle + \mathcal {O}(t^2) \end{aligned}$$
(49)

is miminized at the point \(t=0\). This forces the linear term \(\langle T{\hat{f}} - P_V(T{\hat{f}} ),Tu - P_V(Tu) \rangle \) to vanish, in other words

$$\begin{aligned} \langle T{\hat{f}} - P_V(T{\hat{f}} ),Tu \rangle = 0 \qquad \text{ for } \text{ any } u \in \ker (L). \end{aligned}$$
(50)

Now, considering \(f \in \mathcal {K}\) such that \(L(f) = \mathbf {y}\) written as \(f = {\hat{f}} + u\) for some \(u \in \ker (L)\), the fact that \(f \in \mathcal {K}\) reads

$$\begin{aligned} \epsilon ^2 \ge \Vert T({\hat{f}} + u) - P_V(T({\hat{f}} + u))\Vert _{\mathcal {H}}^2 = \Vert T{\hat{f}} - P_V(T{\hat{f}})\Vert _{\mathcal {H}}^2 + \Vert Tu - P_V(Tu)\Vert _{\mathcal {H}}^2. \end{aligned}$$
(51)

Rearranging the latter gives

$$\begin{aligned} \mathrm{dist}(Tu,V) \le \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2}. \end{aligned}$$
(52)

It remains to take the definition (46) into account in order to bound \(\Vert Q(f) - Q({\hat{f}}) \Vert _{\mathcal {Z}} = \Vert Q(u) \Vert _{\mathcal {Z}}\) and arrive at (47).

Turning to (48), we consider \(u \in \ker (L)\) such that

$$\begin{aligned} \Vert Q(u)\Vert _{\mathcal {Z}}&= \mu \, \mathrm{dist}(Tu, V), \end{aligned}$$
(53)
$$\begin{aligned} \Vert Tu - P_V(Tu)\Vert _\mathcal {H}&= \big [ \epsilon ^2 - \Vert T{\hat{f}} - P_V(T {\hat{f}})\Vert _\mathcal {H}^2 \big ]^{1/2}. \end{aligned}$$
(54)

It is clear that \(f^\pm := {\hat{f}} \pm u\) both satisfy \(L(f^\pm ) = \mathbf {y}\), while \(f^\pm \in \mathcal {K}\) follows from

$$\begin{aligned} \Vert T f^\pm - P_V(T f^\pm )\Vert _\mathcal {H}^2&= \Vert (T{\hat{f}} - P_V(T {\hat{f}}) ) \pm (T u - P_V(T u) ) \Vert _\mathcal {H}^2\nonumber \\&= \Vert T{\hat{f}} - P_V(T {\hat{f}}) \Vert ^2 + \Vert Tu - P_V(T u) \Vert _\mathcal {H}^2 = \epsilon ^2. \end{aligned}$$
(55)

Therefore, for any \(z \in \mathcal {Z}\),

$$\begin{aligned} \sup _{\begin{array}{c} f \in \mathcal {K}\\ L(f) = \mathbf {y} \end{array}} \Vert Q(f) - z \Vert _{\mathcal {Z}}&\ge \max \{ \Vert Q(f^+) - z \Vert _{\mathcal {Z}}, \Vert Q(f^-) - z \Vert _{\mathcal {Z}}\} \nonumber \\&\ge \frac{1}{2} \big ( \Vert Q(f^+) - z \Vert _{\mathcal {Z}} + \Vert Q(f^-) - z \Vert _{\mathcal {Z}}\} \big )\nonumber \\&\ge \frac{1}{2} \Vert Q(f^+ - f^-) \Vert _{\mathcal {Z}} = \Vert Q(u) \Vert _{\mathcal {Z}}. \end{aligned}$$
(56)

Taking (53) and (54) into account finishes to prove (48). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Foucart, S., Liao, C., Shahrampour, S. et al. Learning from non-random data in Hilbert spaces: an optimal recovery perspective. Sampl. Theory Signal Process. Data Anal. 20, 5 (2022). https://doi.org/10.1007/s43670-022-00022-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s43670-022-00022-w

Keywords

Mathematics Subject Classification

Navigation