Abstract
We construct a least squares approximation method for the recovery of complex-valued functions from a reproducing kernel Hilbert space on \(D \subset \mathbb {R}^d\). The nodes are drawn at random for the whole class of functions, and the error is measured in \(L_2(D,\varrho _{D})\). We prove worst-case recovery guarantees by explicitly controlling all the involved constants. This leads to new preasymptotic recovery bounds with high probability for the error of hyperbolic Fourier regression on multivariate data. In addition, we further investigate its counterpart hyperbolic wavelet regression also based on least squares to recover non-periodic functions from random samples. Finally, we reconsider the analysis of a cubature method based on plain random points with optimal weights and reveal near-optimal worst-case error bounds with high probability. It turns out that this simple method can compete with the quasi-Monte Carlo methods in the literature which are based on lattices and digital nets.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider the problem of learning complex-valued multivariate functions on a domain \(D \subset \mathbb {R}^d\) from function samples on the set of nodes \({\mathbf {X}}:=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\} \subset D\). The functions are modeled as elements from some reproducing kernel Hilbert space H(K) with kernel \(K:D\times D\rightarrow \mathbb {C}\). The nodes \({\mathbf {X}}\) are drawn independently according to a tailored probability measure depending on the spectral properties of the embedding of H(K) in \(L_2(D,\varrho _D)\), where the error is measured. Our main focus in this paper is on worst-case recovery guarantees. In fact, we aim at recovering all \(f \in H(K)\) simultaneously from sampled values at the sampling nodes in \({\mathbf {X}}\) with high probability. To be more precise, we implement algorithms and provide error bounds to control the worst-case error
where \(S_{{\mathbf {X}}}^m\) is the fixed recovery operator. In contrast to that, the problem of reconstructing an individual function from random samples has been considered by several authors in the literature, e.g., Smale and Zhou [61], Bohn [5, 6], Bohn and Griebel [7], Cohen et al. [13], Chkifa et al. [10], Cohen and Migliorati [16], and many others to mention just a few.
Let us emphasize that we do not develop a Monte Carlo method here. It is rather the use of “random information” which gained substantial interest in the information-based complexity (IBC) community and in the field of compressed sensing, see the recent survey [28] and [24]. We construct a recovery operator \(S_{{\mathbf {X}}}^m\) which computes a best least squares fit \(S_{{\mathbf {X}}}^mf\) to the given data \((f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n))^\top \) from the finite-dimensional space spanned by the first \(m-1\) singular vectors of the embedding
The right singular vectors \(e_1(\cdot ), e_2(\cdot ),\ldots \) of this embedding are arranged according to their importance, i.e., with respect to the non-increasing rearrangement of the singular values \(\sigma _1 \ge \sigma _2 \ge \cdots >0\).
The investigations in this paper are inspired by the recent results by Krieg and Ullrich [34], which triggered substantial progress in the field. See also the discussion below in Remark 5.9 and [45, Sect. 7]. In this paper, we extend and improve the results from [34] in several directions. In particular, we investigate and implement a least squares regression algorithm under weaker conditions and give practically useful parameter choices which lead to a controlled failure probability and explicit error bounds.
A typical error bound relates the worst-case recovery error to the sequence of singular numbers \((\sigma _k)_{k\in \mathbb {N}}\) of the embedding (1.1) which represent the approximation numbers or linear widths. One main contribution of this paper is the following general bound, where all constants are determined precisely under mild conditions. Recall that \((e_k(\cdot ))_{k\in \mathbb {N}}\) denotes the sequence of right singular vectors of the embedding (1.1), i.e., the eigenfunctions of \(\mathrm {Id}^*\circ \mathrm {Id}:H(K) \rightarrow H(K)\), and \(\sigma _1 \ge \sigma _2 \ge \cdots >0\) the corresponding singular numbers.
Theorem 1.1
(cf. Corollary 5.6) Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions on a subset \(D \subset \mathbb {R}^d\) such that the positive semidefinite kernel \(K:D\times D\rightarrow \mathbb {C}\) satisfies \(\sup _{{\mathbf {x}}\in D}K({\mathbf {x}},{\mathbf {x}})<\infty \). Let further \(\varrho _D\) denote a probability measure on D. Furthermore, for \(n\in \mathbb {N}\) and \(\delta \in (0,1/3)\), we define \(m\in \mathbb {N}\) such that
holds. Then, the random reconstruction operator \(S^m_{{\mathbf {X}}}\) (see Algorithm 1), which uses samples on the n i.i.d. (according to \(\varrho _D\)) drawn nodes in \({\mathbf {X}}\), satisfies
where \(T(m) := \sup _{{\mathbf {x}}\in D}\sum _{k=m}^{\infty }|e_k({\mathbf {x}})|^2\).
The occurrence of the fundamental quantity N(m) is certainly not a surprise. It is also known as spectral function (see [26] and the references therein), and in case of orthogonal polynomials, it is the inverse of the infimum of the Christoffel function (cf., e.g., [23]). It represents a well-known ingredient for inequalities related to sampling and discretization, see, for instance, Gröchenig and Bass [1], Gröchenig [25, 26], Temlyakov [63, 64], and Temlyakov et al. [20]. If, for instance, \(N(m) \in {\mathcal {O}}(m)\) holds, we achieve near-optimal error bounds with respect to the number of used sampling values n. Note that by a straightforward computation we also have \(T(m) \le 2\sum _{k\ge m/2}^{\infty } \sigma _k^2 N(4k)/k.\) Hence, if \(N(m) \in {\mathcal {O}}(m)\), the bound in Theorem 1.1 is upper bounded by
with \(m:=n/(c_1\log (n)+c_2\log (\delta ^{-1}))\) and a constant \(C_{K,\varrho _D}>0\) depending on the measure \(\varrho _D\) and the kernel K.
In the general case we do not necessarily have \(N(m) \in {\mathcal {O}}(m)\). Here, a technique called “importance sampling”, see, e.g., [27, 46, 57, 58], turns out to be useful, see Algorithm 2. As proposed in [16] and specified in full detail in [34], one may sample from a reweighted distribution using a specific density \(\varrho _m\), defined in (5.12) below, which is different for any m and depends on the spectral properties of the embedding (1.1). It determines the important “area” to sample. In other words, we incorporate additional knowledge about the spectral properties of our embedding. The underlying recovery operator is still constructive, and the determined error bounds hold with high probability. Computing this envelope density (and sample from it) has been studied in [16, Sect. 5]. A refinement of this technique together with Theorem 1.1 leads to the following precise bounds under even weaker conditions.
Theorem 1.2
(cf. Theorem 5.8) Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions on a subset \(D \subset \mathbb {R}^d\). Let further \(\varrho _D\) denote a non-trivial \(\sigma \)-finite measure on D, and we assume that the positive semidefinite kernel \(K:D\times D\rightarrow \mathbb {C}\) satisfies
Furthermore, for \(n\in \mathbb {N}\) and \(\delta \in (0,1/3)\) we fix
Then, the random reconstruction operator \({\widetilde{S}}^m_{\mathbf {X}}\) (see Algorithm 2), which uses n samples drawn according to a probability measure depending on \(\varrho _D, m\) and K satisfies
By the same reasoning as above we may replace the error bound in (1.3) by (1.2) but this time with a universal (and precisely determined) constant \(C>0\). The result refines the bound in [34] as we give precise constants here in the general situation.
A further application of our least squares method is in the field of numerical integration. Oettershagen [52], Belhadji et al. [2], Groechenig [26], Migliorati and Nobile [43], and many others used least squares optimization to design quadrature rules. This results in (complex-valued) weights \({\mathbf {q}}:=(q_1,\ldots ,q_n)^\top \) in a cubature formula, i.e.,
where \({\mathbf {f}}:=(f({\mathbf {x}}^j))_{j=1}^n\) and \(\mu _D\) is the measure for which we want to compute the integral. In our setting, the integration nodes \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\) are determined once in advance for the whole class. Clearly, the bounds from Theorems 1.1, 1.2 can be literally transferred to control the worst-case integration error
As the main example, we consider the recovery of functions from Sobolev spaces with mixed smoothness (also known as tensor product Sobolev spaces or hyperbolic cross-spaces). This problem has been investigated by many authors in the last 30 years, see [18] and the references therein. The above general bound on the worst-case errors can, for instance, be used for any non-periodic embedding
The spaces \(H^s_{\text {mix}}([0,1]^d)\) can be represented in various ways as a reproducing kernel Hilbert space satisfying the requirements of the above theorems, see the concrete collection of examples in [3, Section 7.4]. Applying Theorem 1.2 and plugging in well-known upper bounds on the singular numbers we improve on the asymptotic sampling bounds in [18, Sect. 5], see also Dinh Dũng [17] and Byrenheid [8] for the non-periodic situation. In addition, using refined preasymptotic estimates for the \((\sigma _j)_{j\in \mathbb {N}}\) in the non-periodic situation (see [33, Section 4.3]) yields reasonable bounds for sampling numbers in case of small n.
Let us emphasize that the result by Krieg and Ullrich [34] can be considered as a major progress for the research on the complexity of this problem. They disproved Conjecture 5.6.2. in [18] for \(p=2\) and \(1/2< s <(d-1)/2\). Indeed, the celebrated sparse grid points are outperformed in a certain range for s. This represents a first step toward the solution of [18, Outstanding Open Problem 1.4]. As a consequence of the recent contributions by Nagel et al. [45] and Temlyakov [65] based on the groundbreaking solution of the Kadison–Singer problem by Marcus et al. [42], it is now evident that sparse grid methods are not optimal in the full range of parameters (except maybe in \(d=2\)). Still it is worth mentioning that the sparse grids represent the best known deterministic construction what concerns the asymptotic order. Indeed, the guarantees are deterministic and only slightly worse compared to random nodes in the asymptotic regime. However, regarding preasymptotics the random constructions provide substantial advantages.
In this paper, we use the simple least squares algorithm from [5, 14, 16, 34], and we show that using random points makes it also possible to obtain explicit worst-case recovery bounds also for small n. In the periodic setting the analysis benefits from the fact that the underlying eigenvector system is a bounded orthonormal system (BOS), see (2.8), which implies in particular \(N(m) \in {\mathcal {O}}(m)\). In case of the complex Fourier system, we have a BOS constant \(B=1\) and obtain dimension-free constants. This allows for a priori estimates on the number of required samples and arithmetic operations in order to ensure accuracy \(\varepsilon >0\) with our concrete algorithm. In particular, we incorporate recent preasymptotic bounds for the singular values \((\sigma _k)_{k\in \mathbb {N}}\), see [36, 37] and [33]. We obtain with probability larger than \(1-3\delta \) for the periodic mixed smoothness space \(H^s_{\text {mix}}(\mathbb {T}^d)\) with \(2s>1+\log _2 d\) equipped with the \(\Vert \cdot \Vert _{\#}\)-norm (see (8.4) below) the worst-case bound
The number of samples n scales similarly as after (1.2) with precisely determined absolute constants \(c_1,c_2 < 70\), see Corollary 8.4.
We also demonstrate in Sect. 9 that the BOS assumption is not necessary for getting a practical algorithm. Similar as in [5] we use an algorithm called “hyperbolic wavelet regression” and show that it recovers non-periodic functions belonging to a Sobolev space with mixed smoothness \(H^s_{\text {mix}}\) from the n nodes \({\mathbf {X}}\) drawn according to the uniform measure with a rate similar as for the periodic case. The proposed approach achieves rates which improve on the bounds in [5] and are only worse by \(\log ^s n\) in comparison with the optimal rates achieved by hyperbolic wavelet best approximation studied in [21, 60].
Finally, by our refined analysis we were able to settle a conjecture in [52] that the worst-case integration error for \(H^s_{\text {mix}}(\mathbb {T}^d)\) is bounded in order by \(n^{-s}\log (n)^{ds}\) with high probability. This conjecture was based on the outcome of several numerical experiments (described in [52]) where the worst-case error has been simulated using the RKHS Riesz representer. It is remarkable in two respects. First, it is possible to benefit from higher-order smoothness although we use plain random points. And second, this simple method can compete with most of the quasi-Monte Carlo methods based on lattices and digital nets studied in the literature, see [22, pp. 195, 247]. Moreover, if \(s<(d-1)/2\) we get a better asymptotic rate than sparse grid integration which is shown to be of order \(n^{-s}\log (n)^{(d-1)(s+1/2)}\), see [19].
We practically verify our theoretical findings with several numerical experiments. There, we compare the recovery error for the least squares regression method \(S_{{\mathbf {X}}}^m\) to the optimal error given by the projection on the eigenvector space. We also study a non-periodic regime, where we randomly sample points according to the Chebyshev measure. Algorithmically, the coefficients \({\mathbf {c}}:=(c_k)_{k=1}^{m-1}\in \mathbb {C}^{m-1}\) of the approximation \(S_{{\mathbf {X}}}^m:=\sum _{k=1}^{m-1} c_k\,\sigma _k^{-1}\,e_{k}\) can be obtained by computing the least squares solution of the (over-determined) linear system of equations \({\mathbf {L}}_m \, {\mathbf {c}}=(f({\mathbf {x}}^j))_{j=1}^n\), where \({\mathbf {L}}_m := \left( \sigma _k^{-1}\,e_{k}({\mathbf {x}}^j)\right) _{j=1;\,k=1}^{n;\,m-1}\in \mathbb {C}^{n\times (m-1)}\). In order to solve this linear system of equations, one can apply a standard conjugate gradient type iterative algorithm, e.g., LSQR [54]. The corresponding arithmetic costs are bounded from above by \(C\,R\,m\,n < C\, R\,n^2\), where \(C>0\) is an absolute constant and R the number of iterations which is rather small due to the well-conditioned least squares matrices.
Outline In Sect. 2, we describe the setting in which we want to perform the worst-case analysis. There we use the framework of reproducing kernel Hilbert spaces of complex-valued functions. Section 3 is devoted to the least squares algorithm, where the worst-case analysis is given in Sects. 5, 8, and 9. In the first place, we present the general results in Sect. 5. Section 8 considers the particular case of hyperbolic Fourier regression. In Sect. 9, we investigate the particular case of non-periodic functions with a bounded mixed derivative and their recovery via hyperbolic wavelet regression. The main tools from probability theory, like concentration inequalities for spectral norms and Rudelson’s lemma, are provided in Sect. 4. The analysis of the recovery of individual functions (Monte Carlo) is given in Sect. 6. Consequences for optimally weighted numerical integration based on plain random points are given in Sect. 7 and 8.2. Finally, the numerical experiments are shown and discussed in Sect. 10.
Notation As usual \(\mathbb {N}\) denotes the natural numbers, \(\mathbb {N}_0:=\mathbb {N}\cup \{0\}\), \(\mathbb {Z}\) denotes the integers, \(\mathbb {R}\) the real numbers, and \(\mathbb {C}\) the complex numbers. If not indicated otherwise the symbol \(\log \) denotes the natural logarithm. \(\mathbb {C}^n\) denotes the complex n-space, whereas \(\mathbb {C}^{m\times n}\) denotes the set of all \(m\times n\)-matrices \({\mathbf {L}}\) with complex entries. The spectral norm (i.e., the largest singular value) of matrices \({\mathbf {L}}\) is denoted by \(\Vert {\mathbf {L}}\Vert \) or \(\Vert {\mathbf {L}}\Vert _{2\rightarrow 2}\). Vectors and matrices are usually typesetted bold with \({\mathbf {x}},{\mathbf {y}}\in \mathbb {R}^n\) or \(\mathbb {C}^n\). For \(0<p\le \infty \) and \({\mathbf {x}}\in \mathbb {C}^n\) we denote \(\Vert {\mathbf {x}}\Vert _p := (\sum _{i=1}^n |x_i|^p)^{1/p}\) with the usual modification in the case \(p=\infty \). If \(T:X\rightarrow Y\) is a continuous operator we write \(\Vert T:X\rightarrow Y\Vert \) for its operator (quasi-)norm. For two sequences \((a_n)_{n=1}^{\infty },(b_n)_{n=1}^{\infty }\subset \mathbb {R}\) we write \(a_n \lesssim b_n\) if there exists a constant \(c>0\) such that \(a_n \le c\,b_n\) for all n. We will write \(a_n \asymp b_n\) if \(a_n \lesssim b_n\) and \(b_n \lesssim a_n\). We write \(f \in {\mathcal {O}}(g)\) for non-negative functions f, g if there is a constant \(c>0\) such that \(f \le cg\). D denotes a subset of \(\mathbb {R}^d\) and \(\ell _{\infty }(D)\) the set of bounded functions on D with \(\Vert \cdot \Vert _{\ell _{\infty }(D)}\) the supremum norm.
2 Reproducing Kernel Hilbert Spaces
We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [3, Chapt. 1] and [11, Chapt. 4]. Let \(L_2(D,\varrho _D)\) be the space of complex-valued square-integrable functions with respect to \(\varrho _D\). Here, \(D \subset \mathbb {R}^d\) is an arbitrary subset and \(\varrho _D\) a measure on D. We further consider a reproducing kernel Hilbert space H(K) with a Hermitian positive definite kernel \(K({\mathbf {x}},{\mathbf {y}})\) on \(D \times D\). The crucial property is the identity
for all \({\mathbf {x}}\in D\). It ensures that point evaluations are continuous functionals on H(K). We will use the notation from [11, Chapt. 4]. In the framework of this paper, the finite trace of the kernel
or its boundedness
is assumed. The boundedness of K implies that H(K) is continuously embedded into \(\ell _\infty (D)\), i.e.,
Note that we do not need the measure \(\varrho _D\) for this embedding.
The embedding operator
is compact under the integrability condition (2.2), which we always assume from now on. We additionally assume that H(K) is at least infinite dimensional. However, we do not assume the separability of H(K) here. Due to the compactness of \(\mathrm {Id}\) the operator \(\mathrm {Id}^{*}\circ \mathrm {Id}\) provides an at most countable system of strictly positive eigenvalues \((\lambda _j)_{j\in \mathbb {N}}\). These eigenvalues are summable as a consequence of (2.2) and (2.5), (2.6) below, such that the singular numbers \((\sigma _j)_{j\in \mathbb {N}}\) belong to \(\ell _2\). Indeed, let \(\mathrm {Id}^*\) be defined in the usual way as
Then, \(W_{\varrho _D} := \mathrm {Id}^*\circ \mathrm {Id}: H(K) \rightarrow H(K)\) is non-negative definite, self-adjoint and compact. Let \((\lambda _j,e_j)_{j\in \mathbb {N}}\) denote the eigenpairs of \(W_{\varrho _D}\), where \((e_j)_{j \in \mathbb {N}} \subset H(K)\) is an orthonormal system of eigenvectors, and \((\lambda _j)_{j \in \mathbb {N}}\) the corresponding positive eigenvalues. In fact, \(W_{\varrho _D}e_j = \lambda _je_j\) and \(\langle e_j, e_k \rangle _{H(K)} = \delta _{j,k}\). The sequence of positive eigenvalues is arranged in non-increasing order, i.e.,
Note that we have by Bessel’s inequality
Let us point out that by (2.4), the function
exists pointwise in \(\mathbb {C}\). This implies
and we get by \(\lambda _j = \sigma _j^2\) that \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) is a Hilbert–Schmidt operator if (2.2) holds. We will restrict to the situation where we have equality in (2.6). This can be achieved by posing additional assumptions, namely that H(K) is separable and \(\varrho _D\) is \(\sigma \)-finite, see [11, Thm. 4.27]. It further holds that
Hence, \((e_j)_{j\in \mathbb {N}}\) is also orthogonal in \(L_2(D,\varrho _D)\) and \(\Vert e_j\Vert _2 = \sqrt{\lambda _j} =: \sigma _j\). We define the orthonormal system \((\eta _j)_{j\in \mathbb {N}} := (\lambda _j^{-1/2}e_j)_{j\in \mathbb {N}}\) in \(L_2(D,\varrho _D)\).
For our subsequent analysis, the quantity
plays a fundamental role. We often need the related quantity \(T(m) := \sup _{{\mathbf {x}}\in D}\sum _{k=m}^{\infty }|e_k({\mathbf {x}})|^2\) which can be estimated by \(T(m) \le 2\sum _{k\ge m/2}^{\infty } \sigma _k^2 N(4k)/k\). The first one is sometimes called “spectral function”, see [26] and the references therein. Clearly, by (2.6) N(m) and T(m) are well defined if the kernel is bounded, i.e., if (2.3) is assumed. In fact, T(m) is bounded by \(\Vert K\Vert _{\infty }\). It may happen that the system \((\eta _k)_{k\in \mathbb {N}}\) is a uniformly \(\ell _\infty (D)\) bounded orthonormal system (BOS), i.e., we have for all \(k\in \mathbb {N}\)
Let us call B the BOS constant of the system. In this case, we have \(N(m) \le (m-1)B^2 \in {\mathcal {O}}(m)\) and \(T(m) \le B^2\sum _{k=m}^{\infty }\lambda _k\).
Remark 2.1
We would like to point out an issue concerning the embedding operator \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) defined above. As discussed in [11, p. 127], this embedding operator is in general not injective as it maps a function to an equivalence class. As a consequence, the system of eigenvectors \((e_j)_{j\in \mathbb {N}}\) may not be a basis in H(K). (Note that H(K) may not even be separable.) However, there are conditions which ensure that the orthonormal system \((e_j)_{j\in \mathbb {N}}\) is an orthonormal basis in H(K), see [11, 4.5] and [11, Ex. 4.6, p. 163], which is related to Mercer’s theorem [11, Thm. 4.49]. Indeed, if we additionally assume that the kernel \(K(\cdot ,\cdot )\) is bounded and continuous on \(D \times D\) (for a domain \(D \subset \mathbb {R}^d\)), then H(K) is separable and consists of continuous functions, see [3, Thms. 16, 17]. If we finally assume that the measure \(\varrho _D\) is a Borel measure with full support, then \((e_j)_{j\in \mathbb {N}}\) is a complete orthonormal system in H(K). In this case, we have the pointwise identity
as well as equality signs in (2.4), (2.5) and (2.6), see, for instance, [3, Cor. 4]. Let us emphasize that a Mercer kernel \(K(\cdot ,\cdot )\), which is a continuous kernel on \(D\times D\) with a compact domain \(D \subset \mathbb {R}^d\) satisfies all these conditions, see [11, Thm. 4.49]. In this case, we even have (2.9) with absolute and uniform convergence on \(D \times D\). Let us point out that to our surprise, Theorem 1.2 (Theorem 5.8) holds already true under the finite trace condition (2.2) if H(K) is separable and \(\varrho _D\) is \(\sigma \)-finite. We do not have to assume continuity of the kernel. Note that the finite trace condition is natural in this context as [29] shows.
3 Least Squares Regression
Our algorithm essentially boils down to the solution of an over-determined system
where \({\mathbf {L}}\in \mathbb {C}^{n \times m}\) is a matrix with \(n>m\). It is well known that the above system may not have a solution. However, we can ask for the vector \({\mathbf {c}}\) which minimizes the residual \(\Vert {\mathbf {f}}-{\mathbf {L}}\, {\mathbf {c}}\Vert _2\). Multiplying the system with \({\mathbf {L}}^*\) gives
which is called the system of normal equations. If \({\mathbf {L}}\) has full rank, then the unique solution of the least squares problem is given by
From the fact that the singular values of \({\mathbf {L}}\) are bounded away from zero, we get the following quantitative bound on the spectral norm of the Moore–Penrose inverse \(({\mathbf {L}}^{*}\,{\mathbf {L}})^{-1}\,{\mathbf {L}}^{*}\).
Proposition 3.1
Let \({\mathbf {L}}\in \mathbb {C}^{n\times m}\) be a matrix with \(m<n\) with full rank and singular values \(\tau _1,\ldots ,\tau _m >0\) arranged in non-increasing order.
-
(i)
Then, also the matrix \(({\mathbf {L}}^{*}\,{\mathbf {L}})^{-1}\,{\mathbf {L}}^{*}\) has full rank and singular values \(\tau _m^{-1},\ldots ,\tau _1^{-1}\) (arranged in non-increasing order).
-
(ii)
In particular, it holds that
$$\begin{aligned} \left( {\mathbf {L}}^{*}\,{\mathbf {L}}\right) ^{-1}\,{\mathbf {L}}^{*} = {\mathbf {V}}^{*}\, \tilde{\varvec{\Sigma }}{\mathbf {U}}\end{aligned}$$whenever \({\mathbf {L}}= {\mathbf {U}}^{*} \varvec{\Sigma } {\mathbf {V}}\), where \(\varvec{\Sigma } \in \mathbb {R}^{n\times m}\) is a rectangular matrix only with \((\tau _1,\ldots ,\tau _m)\) on the “main diagonal” and orthogonal matrices \({\mathbf {U}}\in \mathbb {C}^{n\times n}\) and \({\mathbf {V}}\in \mathbb {C}^{m\times m}\). Here, \(\tilde{\varvec{\Sigma }} \in \mathbb {R}^{m\times n}\) denotes the matrix with \((\tau _1^{-1},\ldots ,\tau _m^{-1})\) on the “main diagonal”.
-
(iii)
The operator norm \(\Vert ({\mathbf {L}}^{*}\,{\mathbf {L}})^{-1}\,{\mathbf {L}}^{*}\Vert \) can be controlled as follows:
$$\begin{aligned} \Vert \left( {\mathbf {L}}^{*}\,{\mathbf {L}}\right) ^{-1}\,{\mathbf {L}}^{*}\Vert \le \tau _m^{-1}. \end{aligned}$$
For function recovery, we will use the following matrix
for \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\subset D\) of distinct sampling nodes and the system \((\eta _k)_{k\in \mathbb {N}} := (\lambda _k^{-1/2}e_k)_{k\in \mathbb {N}}\). Below we will see that this matrix behaves well with high probability if n is large enough and the nodes in \({\mathbf {X}}\) are chosen independently and identically \(\varrho _D\)-distributed from D.
Using Algorithm 1, we compute the coefficients \(c_k\), \(k=1,\ldots ,m-1\), of the approximant
Note that the mapping \(f \mapsto S^m_{{\mathbf {X}}}f\) is linear for a fixed set of sampling nodes \({\mathbf {X}}\subset D.\)
4 Concentration Inequalities
We will consider complex-valued random variables X and random vectors \((X_1,\ldots ,X_N)\) on a probability space \((\Omega ,{\mathcal {A}},\mathbb {P})\). As usual we will denote with \(\mathbb {E}X\) the expectation of X. With \(\mathbb {P}(A|B)\) and \(\mathbb {E}(X|B)\), we denote the conditional probability
and the conditional expectation
where \(\chi _B:\Omega \rightarrow \{0,1\}\) is the indicator function on B.
Let us start with the classical Markov inequality. If Z is a random variable, then
Of course, there is also a version involving conditional probability and expectation. In fact,
Let us state concentration inequalities for the norm of sums of complex rank-1 matrices. For the first result, we refer to Oliveira [53]. We will need the following notational convention: For a complex (column) vector \({\mathbf {y}}\in \mathbb {C}^N\) (or \(\ell _2\)), we will often use the tensor notation for the matrix
Proposition 4.1
Let \({\mathbf {y}}^i, i=1,\ldots ,n\), be i.i.d. copies of a random vector \({\mathbf {y}}\in \mathbb {C}^{N}\) such that \(\Vert {\mathbf {y}}^i\Vert _2 \le M\) almost surely. Let further \(\mathbb {E}({\mathbf {y}}^i \otimes {\mathbf {y}}^i) = \varvec{\Lambda }\in \mathbb {C}^{N\times N}\) and \(0<t <1\). Then, it holds
Proof
In order to show the probability estimate, we refer to the proof of [53, Lem. 1] and observe
for \( 0\le s\le n/(2M^2)\). Since we restrict \(0<t<1\), the choice \(s=(4+2\sqrt{2})^{-1}nt/M^2\) yields
and, finally, the assertion holds. \(\square \)
Remark 4.2
A slightly stronger version for the case of real matrices can be found in Cohen, Davenport, Leviatan [13] (see also the correction in [14]). For \({\mathbf {y}}^i, i=1,\ldots ,n\), i.i.d. copies of a random vector \({\mathbf {y}}\in \mathbb {R}^{N}\) sampled from a bounded orthonormal system, one obtains the concentration inequality
where \(c_t = (1+t)(\ln (1+t))-t\). This leads to improved constants for the case of real matrices.
The following result goes back to Lust-Piquard and Pisier [41], and Rudelson [59]. The complex version with precise constants can be found in Rauhut [56, Cor. 6.20].
Proposition 4.3
Let \({\mathbf {y}}^i \in \mathbb {C}^N\) (or \(\ell _2\)), \(i=1,\ldots ,n\), and \(\varepsilon _i\) independent Rademacher variables taking values \(\pm 1\) with equal probability. Then,
with
Remark 4.4
The result is proved for complex (finite) matrices. Note that the factor \(\sqrt{\log (8\min \{n,N\})}\) is already an upper bound for \(\sqrt{\log (8r)}\), where r is the rank of the matrix \(\sum _i {\mathbf {y}}^i\otimes {\mathbf {y}}^i\). The proof of Proposition 4.3 with the precise constant is based on [56, Lem. 6.18] which itself is based on a non-commutative Khintchine inequality, see [56, 6.5]. This technique allows for controlling all the involved constants. Let us comment on the situation \(N=\infty \), i.e., \({\mathbf {y}}^j \in \ell _2\), where this inequality keeps valid with the factor \(\sqrt{\log (8n)}\). In fact, if the matrices \({\mathbf {B}}_j\) in [56, Thm. 6.14] are replaced by rank-1-operators \({\mathbf {B}}_j:\ell _2\rightarrow \ell _2\) of type \({\mathbf {B}}_j = {\mathbf {y}}^j \otimes {\mathbf {y}}^j\) with \(\Vert {\mathbf {y}}^j\Vert _2 < \infty \), then all the arguments keep valid and an \(\ell _2\)-version of this non-commutative Khintchine inequality is available. This implies an \(\ell _2\)-version of [56, Lem. 6.18] which reads as follows: Let \({\mathbf {y}}^j \in \ell _2\), \(j=1,\ldots ,n\), and \(p\ge 2\). Then,
Since we control the moments of the random variable representing the norm on the left-hand side, we are now able to derive a concentration inequality by standard arguments [56, Prop. 6.5]. This concentration inequality then easily implies the \(\ell _2\)-version of (4.3).
As a consequence of this result, we obtain the following deviation inequality in the mean which will be sufficient for our purpose.
Corollary 4.5
Let \({\mathbf {y}}^i\), \(i=1,\ldots ,n\), be i.i.d. random vectors from \(\mathbb {C}^N\) or \(\ell _2\) with \(\Vert {\mathbf {y}}^i\Vert _2 \le M\) almost surely. Let further \(\varvec{\Lambda } = \mathbb {E}({\mathbf {y}}^i \otimes {\mathbf {y}}^i)\). Then, with \(N>n\) we obtain
Proof
By well-known symmetrization technique (see [24, Lem. 8.4]), Proposition 4.3, and the Cauchy–Schwarz inequality, we obtain
Hence, we get \(F^2\le a^2(F+b)\) and we solve this inequality with respect to F, which yields
and this corresponds to the assertion. \(\square \)
5 Worst-case Errors for Least-Squares Regression
5.1 Random Matrices from Sampled Orthonormal Systems
Let us start with a concentration inequality for the spectral norm of a matrix of type (3.1). It turns out that the complex matrix \({\mathbf {L}}_m:={\mathbf {L}}_m({\mathbf {X}})\in \mathbb {C}^{n\times (m-1)}\) has full rank with high probability, where the elements of the set \({\mathbf {X}}\subset D\) of sampling nodes are drawn i.i.d. at random according to \(\varrho _D\). We will find below that the eigenvalues of
are bounded away from zero with high probability if m is small enough compared to n. We speak of an “oversampling factor” n/m. In case of a bounded orthonormal system with BOS constant B, see (2.8), it will turn out that a logarithmic oversampling is sufficient, see (5.2) below. Note that the boundedness constant B may also depend on the underlying spatial dimension d. However, if, for instance, the complex Fourier system \(\{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}}) :{\mathbf {k}}\in \mathbb {Z}^d\}\) is considered, we are in the comfortable situation that \(B = 1\).
Proposition 5.1
Let \(n,m \in \mathbb {N}\), \(m\ge 2\). Let further \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\ldots \}\) be the orthonormal system in \(L_2(D,\varrho _D)\) induced by the kernel K and the n sampling nodes in \({\mathbf {X}}\) be drawn i.i.d. at random according to \(\varrho _D\). Then, it holds for \(0<t <1\) that
where N(m) is defined in (2.7) and \({\mathbf {I}}_m=\mathrm {diag}({\varvec{1}})\in \{0,1\}^{(m-1)\times (m-1)}\).
Proof
We set \({\mathbf {y}}^i := \left( \eta _1({\mathbf {x}}^i), \ldots , \eta _{m-1}({\mathbf {x}}^i)\right) ^*\), \(i=1,\ldots ,n,\) and observe
Moreover, due to the fact that we have an orthonormal system \((\eta _k)_{k\in \mathbb {N}}\), we obtain that \(\mathbb {E}({\mathbf {H}}_m) = {\mathbf {I}}_m\). The result follows by noting that \(M^2 \le N(m)\) in Proposition 4.1. \(\square \)
Remark 5.2
From this proposition we immediately obtain that the matrix \({\mathbf {H}}_m \in \mathbb {C}^{(m-1) \times (m-1)}\) has only eigenvalues larger than \(t:=1/2\) with probability at least \(1-\delta \) if
Hence, in case of a bounded orthonormal system with BOS constant \(B>0\), see (2.8), we may choose
with \(\kappa _{\delta ,B} := (\log (1/\delta )+\sqrt{2})^{-1}B^{-2}/48\).
From Proposition 5.1 we get that all \(m-1\) singular values \(\tau _1,\ldots , \tau _{m-1}\) of \({\mathbf {L}}_m\) from (3.1) are all not smaller than \(\sqrt{n/2}\) and not larger than \(\sqrt{3n/2}\) with probability at least \(1-\delta \) if m is chosen such that (5.1) holds. In terms of Proposition 3.1, this means that \(\tau _1,\ldots , \tau _{m-1} \ge \sqrt{n/2}\). This leads to an upper bound on the norm of the Moore–Penrose inverse that is required for the least squares algorithm.
Proposition 5.3
Let \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\ldots \}\) be the orthonormal system in \(L_2(D,\varrho _D)\) induced by the kernel K. Let further \(m,n \in \mathbb {N}\), \(m\ge 2\), and \(0< \delta <1\) be chosen such that they satisfy (5.1). Then, the random matrix \({\mathbf {L}}_m\) from (3.1) satisfies
with probability at least \(1-\delta \).
In addition to the matrix \({\mathbf {L}}_m\), we need to consider a second linear operator that is defined using sampling values of the eigenfunctions \(e_j\). The importance of this operator has been pointed out in [34], where strong results on the concentration of infinite dimensional random matrices have been used. Since we only need the expectation of the norm, we only use Rudelson’s lemma, see Proposition 4.3, and a symmetrization technique. This allows us to control the constants.
Proposition 5.4
Let \({\mathbf {X}}=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\subset D\) be a set of n sampling nodes drawn uniformly and i.i.d. at random according to \(\varrho _D\), and consider the n i.i.d. random sequences
together with \(T(m):= \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k= m}^{\infty } |e_k({\mathbf {x}})|^2 < \infty \), see (2.7). Then, the operator
has expected norm
Proof
Note that \(\varvec{\Phi }_m^{*}\varvec{\Phi }_m = \sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) and
This gives
Finally, the bound in (5.3) follows from Corollary 4.5 (see also Remark 4.4 for \(N = \infty \)), the fact that \(\Vert \varvec{\Lambda }_m\Vert = \lambda _m = \sigma _m^2\) and \(M^2 = T(m)\). \(\square \)
5.2 Worst-case Errors with High Probability
Theorem 5.5
Let H(K) be a separable reproducing kernel Hilbert space on a domain \(D \subset \mathbb {R}^d\) with a positive semidefinite kernel \(K({\mathbf {x}},{\mathbf {y}})\) such that \(\sup _{{\mathbf {x}}\in D} K({\mathbf {x}},{\mathbf {x}}) <\infty \). We denote with \((\sigma _j)_{j\in \mathbb {N}}\) the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) for a probability measure \(\varrho _D\). Let further \(0<\delta <1\) and \(m,n \in \mathbb {N}\), where \(m\ge 2\) is chosen such that (5.1) holds. Drawing the n sampling nodes in \({\mathbf {X}}\) i.i.d. at random according to \(\varrho _D\), we have for the conditional expectation of the worst-case error
with \(C_{\mathrm {R}}\) from (4.4).
Proof
Let \(f \in H(K)\) such that \(\Vert f\Vert _{H(K)} \le 1\). Let further \({\mathbf {X}}=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\) be such that \(\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2\). Using orthogonality and the reproducing property \(S^m_{\mathbf {X}}P_{m-1}f=P_{m-1}f\), we estimate
where \(P_{m-1}f\) denotes the projection \(\sum _{j=1}^{m-1} \langle f,e_j \rangle e_j\) in H(K) yielding \(\Vert f - P_{m-1}f\Vert _{L_2(D,\varrho _D)} \le \sigma _m\). Note further, that for any \({\mathbf {x}}\in D\)
where \(T(\cdot ,{\mathbf {x}}) = K(\cdot ,{\mathbf {x}})-\sum _{j=1}^{\infty }e_j(\cdot )\overline{e_j({\mathbf {x}})}\) denotes an element in H(K). Its norm is given by
This gives
Returning to (5.5) we estimate
where \(\varvec{\Phi }_m\) denotes the infinite matrix from Proposition 5.4. Note that we used (2.4) in the last but one step. The relation in (5.7) together with (5.5) and \(\Vert f\Vert _{H(K)} \le 1\) implies
Integrating on both sides yields
Note that the integral on the right-hand side of (5.8) vanishes because of (5.6) and the fact that we have an equality sign in (2.6) due to our assumptions (separability of H(K)). This gives
Taking Proposition 5.4 and (4.1) into account and noting that \(\mathbb {P}(\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2)\) is larger than \(1-\delta \), we obtain the assertion. \(\square \)
In addition to that we may easily get a deviation inequality by using Markov’s inequality and standard arguments. It reads as follows.
Corollary 5.6
Under the same assumptions as in Theorem 5.5, it holds for fixed \(\delta >0\)
where \(C:=3+8\,C_{\mathrm {R}}^2+4\,C_{\mathrm {R}}<28.05\) is an absolute constant.
Proof
We define the events
and split up
Treating each summand separately, we have
Next we estimate \(\mathbb {P}(A^\complement | B)\) using the Markov inequality (4.2), Theorem 5.5, and setting
which yields
and the assertion follows. \(\square \)
Example 5.7
Theorem 5.5 as well as Corollary 5.6 can also be applied to non-bounded orthonormal systems which may lead to non-optimal error bounds. For instance, let \(D=[-1,1]\) and \(\varrho _D\) the normalized Lebesgue measure on \(D = [-1,1]\). Then, the second-order operator \({\text {A}}\) defined by
characterizes for \(s>1\) weighted Sobolev spaces
which are in fact reproducing kernel Hilbert spaces with reproducing kernel
where \({\mathcal {P}}_k:D\rightarrow \mathbb {R}\), \(k\in \mathbb {N}\), are \(L_2(D)\)-normalized Legendre polynomials. Clearly, \(({\mathcal {P}}_k)_{k\in \mathbb {N}}\) provides an ONB in \(L_2(D)\) and plays the role of \((\eta _k)_{k\in \mathbb {N}}\) in our setting. Moreover, we have \(e_k=\lambda _k\eta _k=(1+(k(k+1))^s)^{-1/2}{\mathcal {P}}_k\), and, accordingly, \(\sigma _k=(1+(k(k+1))^s)^{-1/2}\). The two quantities N(m) and T(m) are given by
Applying Theorem 5.5 or Corollary 5.6 and choosing m as large as possible, leads to the relation \(m^2\sim n/\log {n}\), which is far from optimal with respect to the number n of used samples, i.e., we observe worst-case error estimates of the form
in expectation and with high probability, respectively.
On the other hand, the result by Bernardi and Maday [4, Thm. 6.2] using a polynomial interpolation operator \(j_{n-1}\) at Gauss points guarantees
which is optimal in its main rate s. However, as it turns out below (see Example 5.10), it is not optimal with respect to the power of the logarithm. \(\square \)
Example 5.7 illustrates that the suggested approach will lead to worst-case error estimates with high probability. However, the achieved upper bounds are not optimal in specific situations. We will overcome this limitation in the next section by drawing the sampling nodes with respect to a weighted, tailored distribution and apply a weighted least squares algorithm.
5.3 Improvements Due to Importance Sampling
We are interested in the question of optimal sampling recovery of functions from reproducing kernel Hilbert spaces in \(L_2(D,\varrho _D)\). The goal is to get reasonable bounds in n, preferably in terms of the singular numbers of the embedding. Theorem 5.5 already gives a satisfying answer in case of bounded kernels and \(N(m) \in {\mathcal {O}}(m)\). In order to drop both conditions, we will use a weighted (deterministic) least squares algorithm (see Algorithm 2) to recover functions \(f\in H(K)\) from samples at random nodes (“random information” in the sense of [28]). The approach is a slight modification of the one proposed earlier in [34]. A technique will be used, which is known as “(optimal) importance sampling”, where one defines a density function depending on the spectral properties of the embedding operator. The sampling nodes are then drawn according to this density. In the Monte Carlo setting (or “randomized setting”) this has been successfully applied, e.g., by Cohen and Migliorati [16], see Remark 6.3. Also in connection with compressed sensing it led to substantial improvements when recovering multivariate functions, see [57, 58]. Authors originally applied this technique, e.g., for the approximation of integrals, see [27]. However, the setting in which we are interested in requires additional work since the sampling nodes are supposed to be drawn in advance for the whole class of functions.
As already mentioned, we construct a more suitable distribution which is used to draw the sampling nodes at random. In particular, we tailor a probability density function \(\varrho _m:D\rightarrow \mathbb {C}\) such that \(\mu _m\), which is given by
Then, we may draw the sampling nodes in \({\mathbf {X}}\subset D\) i.i.d. at random according to \(\mu _m\). For the chosen set \({\mathbf {X}}\), we define the approximation operator \({\widetilde{S}}^m_{\mathbf {X}}\) as indicated in Algorithm 2.
Choosing the specific density function
guarantees worst-case error estimates which are optimal up to logarithmic factors and up to a specific failure probability.
Theorem 5.8
Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions defined on D such that
for some non-trivial \(\sigma \)-finite measure \(\varrho _D\) on D, where \((\sigma _j)_{j\in \mathbb {N}}\) denotes the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\). Let further \(\delta \in (0,1/3)\) and \(n\in \mathbb {N}\) large enough, such that
holds. Moreover, we assume \(\varrho _m:D\rightarrow \mathbb {C}\) and \(\mu _m\) as stated in (5.12) and (5.11). We draw each node in \({\mathbf {X}}:=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\subset D\) i.i.d. at random according to \(\mu _m\), which yields
where \(C:=3+16\,C_{\mathrm {R}}^2+4\sqrt{2}\,C_{\mathrm {R}}<49.5\) is an absolute constant.
Proof
We emphasize that the second argument of the \(\max \) term in (5.14) makes sense since we know from (2.6) that the sequence of singular numbers is square-summable. As a technical modification of the density function, which has been presented in [34], we use the density \(\varrho _m :D \rightarrow \mathbb {R}\) as stated in (5.12). As above, the family \((e_j(\cdot ))_{j\in \mathbb {N}}\) represents the eigenvectors of the non-vanishing eigenvalues of the compact self-adjoint operator \(W_{\varrho _D}:=\mathrm {Id}^*\circ \mathrm {Id}: H(K) \rightarrow H(K)\), the sequence \((\lambda _j)_{j\in \mathbb {N}}\) represents the ordered eigenvalues, and finally \(\eta _j:=\lambda _j^{-1/2} e_j\). Since we assume the separability of H(K) and the \(\sigma \)-finiteness of \(\varrho _D\) we observe equality in (2.6), cf. [11, Thm. 4.27], and thus, we easily see that \(\int _D \varrho _m({\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}) = 1\). Let us define a family of kernels \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}})\), indexed by \(m\in \mathbb {N}\), via
and a new measure \(\mu _m\) as stated in (5.11) with the corresponding weighted space \(L_2(D,\mu _m)\). Clearly, \({\widetilde{K}}_m(\cdot ,\cdot )\) is a positive type function. As a consequence of
we obtain by an elementary calculation that with a constant \(c_m>0\) it holds
Indeed,
with
Hence, if \(\varrho _m({\mathbf {x}})\) or \(\varrho _m({\mathbf {y}})\) happens to be zero, then we may put \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}}) := 0\) in (5.15). In any case, due to (5.16), the kernel \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}})\) fits the requirements in Theorem 5.5. In fact, in Theorem 5.5 it is necessary that \({\widetilde{N}}(m)\) and \({\widetilde{T}}(m)\) are well defined and that we have access to function values in order to create the matrices \({\widetilde{{\mathbf {L}}}}_m\) and take the function values \(f({\mathbf {x}}^k)\), \(k=1,\ldots ,n\). Let us discuss the quantities \({\widetilde{N}}(m)\) and \({\widetilde{T}}(m)\) appearing in Theorem 5.5 for this new kernel \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}})\) first. It is clear, that the embedding \(\mathrm {Id}:H({\widetilde{K}}_m) \rightarrow L_2(D,\mu _m)\) shares the same spectral properties as the original embedding. Note that a function g belongs to \(H({\widetilde{K}}_m)\) if and only if \(g(\cdot ) = f(\cdot )/\sqrt{\varrho _m(\cdot )}\), \(f\in H(K)\), where we always put \(0/0:=0\). Clearly, as a consequence of (5.16) and (5.17) below (together with a density argument), we have that \(\varrho ({\mathbf {x}}) = 0\) implies \(f({\mathbf {x}}) = 0\) for all \(f\in H(K)\). Moreover, whenever \(\Vert f\Vert _{H(K)} \le 1\), the function \(g:=f/\sqrt{\varrho _m}\) satisfies \(\Vert g\Vert _{H({\widetilde{K}}_m)}\le 1\). Indeed, let
Then, \(\langle f,f\rangle _{H(K)} = \sum _{j=1}^N\sum _{i=1}^N \alpha _i \overline{\alpha _j}K({\mathbf {x}}^j,{\mathbf {x}}^i)\). We have
This implies
What remains is a standard density argument. The singular numbers of the new embedding remain the same. The singular vectors \({\tilde{e}}_k(\cdot )\) and \({\tilde{\eta }}_k(\cdot )\) are slightly different. They are the original ones divided by \(\sqrt{\varrho _m(\cdot )}\). In fact,
Furthermore, taking (2.5) into account, we find
In order to define the new reconstruction operator \({\widetilde{S}}_{{\mathbf {X}}}^m\) we need to create the matrices \({\widetilde{{\mathbf {L}}}}_m\) using the new function system \({\tilde{\eta }}_k\) and take function evaluations \(f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n)\). In more detail, we solve the least squares problem
and the vector \({\tilde{\mathbf {c}}}\) contains the coefficients of the least squares approximation \(S_{{\mathbf {X}}}^m(g)=\sum _{j=1}^{m-1} {\tilde{c}}_j{\tilde{\eta }}_j\) of \(g:=f/\sqrt{\varrho _m}\). This leads to Algorithm 2. Consequently, Theorem 5.5 allows to estimate the error
where \({\widetilde{S}}^m_{\mathbf {X}}f:=\sqrt{\varrho _m}\,S_{{\mathbf {X}}}^m(g)=\sum _{j=1}^{m-1} {\tilde{c}}_j\eta _j({\mathbf {x}})\).
We stress that \({\widetilde{S}}^{m}_{\mathbf {X}}f\) and the direct computation of \(S^{m}_{\mathbf {X}}f\) using \({\mathbf {L}}_m\, {\mathbf {c}} = {\mathbf {f}}\) may not coincide since both are based on different least squares problems in general.
It remains to note that for fixed n and m as in (5.13) we have for \({\mathbf {X}}=({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) the relation
Applying a slight modification of Corollary 5.6, i.e., setting
in the proof, yields
where we took (5.18) into account and \(C=3+16C_R^2+4\sqrt{2}C_R\) is as stated above. \(\square \)
Remark 5.9
-
(i)
In order to prove the theorem in full generality for separable RKHS, we use a technical modification of the density function presented in [34]. Clearly, as a consequence of (2.5) the function \(\varrho _m\) is positive and defined pointwise for any \({\mathbf {x}}\in D\). Moreover, it can be computed precisely from the knowledge of \(K({\mathbf {x}},{\mathbf {x}})\) and the first \(m-1\) eigenvalues and corresponding eigenvectors. The pointwise defined density function will be an essential ingredient for drawing the nodes in \({\mathbf {X}}\), on the one hand, and performing a reweighted least squares fit on the other hand. Note that also point evaluations of the density function are used in the algorithm. To circumvent the lacking injectivity we introduce a new reproducing kernel Hilbert space \(H({\widetilde{K}}_m)\) built upon the modified density function. To this situation Corollary 5.6 is applied, and we obtain Theorem 5.8, which also improves the results in [34] by determining explicit constants. In turn, we show, that the original algorithm proposed by [34] also works in this more general situation since both densities are equal almost everywhere.
-
(ii)
The situation of a non-separable RKHS H(K) is not considered in Theorem 5.8. It has been considered in the subsequent paper [44] by Moeller and T. Ullrich. Here the sampling density has to be modified even further to get a bound
$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)} \le 1} \Vert f-{\widetilde{S}}^m_{{\mathbf {X}}}f\Vert ^2_{L_2(D,\varrho _D)} \le c_1 \max \left\{ \frac{1}{n},\frac{1}{m}\sum \limits _{k\ge m/2} \sigma _k^2\right\} \end{aligned}$$with probability larger than \(1-c_2n^{1-r}\) and \(m:=\lfloor n/(c_3r\log (n)) \rfloor \). With this method we cannot get beyond the rate \(n^{-1/2}\) in case of non-separable Hilbert spaces H(K).
-
(iii)
While this paper was under review, the statement “recovery with high probability” has been refined by Ullrich [66] and, independently, by Moeller and Ullrich [44]. In fact, the failure probability of the above recovery can be controlled by \(n^{-r}\) (and is therefore decaying polynomially in the number of samples) by only paying a multiplicative constant r in the bounds, see also (ii).
-
(iv)
As a further follow up, Nagel et al. [45] developed a subsampling technique to select a subset \({\mathbf {J}} \subset {\mathbf {X}}\) of \({\mathcal {O}}(m)\) nodes out of the randomly chosen node set \({\mathbf {X}}\) such that the corresponding least squares recovery operator \({\widetilde{S}}^m_{{\mathbf {J}}}\) gives
$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)} \le 1} \Vert f-{\widetilde{S}}^m_{{\mathbf {J}}}f\Vert ^2_{L_2(D,\varrho _D)} \le \frac{C\log m}{m}\sum \limits _{k=cm}^{\infty } \sigma _k^2 \end{aligned}$$with precisely determined universal constants \(C,c>0\).
Example 5.10
-
(i)
If we choose \(m \asymp n/\log (n)\) according to (5.13) and assume a polynomial decay of the singular values, i.e., \(\sigma _m\lesssim m^{-s}\log ^\alpha m\) with \(s>1/2\) (which is, for instance, the case for Sobolev embeddings into \(L_2\)), we find
$$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-{\widetilde{S}}_{{\mathbf {X}}}^mf\Vert _{L_2(D)} \lesssim m^{-s}\log ^{\alpha } m \asymp n^{-s}\log ^{\alpha +s} n. \end{aligned}$$Accordingly, we observe the best possible main rate \(-s\) with respect to the used number of samples n, i.e., we achieve optimality up to logarithmic factors. For a more specific example, see (ii) below and Sect. 8.
-
(ii)
Let us reconsider the setting from Example 5.7. Theorem 5.8 provides a highly improved sampling strategy compared to the one discussed in Example 5.7. Certainly, we change the underlying distribution for the random selection of the sampling nodes and we incorporate weights in the least squares algorithm, cf. Algorithm 2. However, this allows for a crucial improvement of the relation of the maximal polynomial degree m and the number n of sampling nodes to \(m\sim n/\log (n)\), which leads to estimates
$$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-{\widetilde{S}}^m_{\mathbf {X}}f\Vert _{L_2(D)} \lesssim \max \left\{ \sigma _m,\sqrt{m^{-1}\sum _{k=m}^\infty \sigma _k^2}\right\} \asymp m^{-s}\lesssim \left( \frac{\log (n)}{n}\right) ^{s} \end{aligned}$$that are optimal in m and—up to logarithmic factors—even optimal in n. Here \(s>1/2\) is admitted. Compared to the results in Example 5.7, we obtain the same rates of convergence with respect to the polynomial degree m and significantly improved rates of convergence with respect to the number n of used sampling values. Note that instead of the density given in (5.12) we may use the Chebyshev measure given by the density \(\varrho ({\mathbf {x}}) = (\pi \sqrt{1-x^2})^{-1}\) on \(D = [-1,1]\) since the \(L_2\)-normalized Legendre polynomials \({\mathcal {P}}_k(x)\) are dominated by \(C(1-x^2)^{-1/4}\) for all \(k \in \mathbb {N}_0\), see [57]. Hence, in contrast to (5.12), this sampling measure is universal for all m.
In comparison with the results in Example 5.7, we obtain a significantly better rate of convergence with respect to the number n of used sampling values. Applying the recent result mentioned in Remark 5.9(iv) we obtain for the subsampled weighted least squares operator the worst-case error bound
Note that \({\widetilde{S}}_{{\mathbf {J}}}^mf\) uses \(n \in {\mathcal {O}}(m)\) many samples of f. \(\square \)
5.4 The Power of Standard Information and Tractability
The approximation framework studied in this paper has been first considered by Wasilkowski, Woźniakowski in 2001, see [67]. After that many authors dealt with the problem on how well can we approximate functions from RKHS by only using function values. For further historical remarks see [51] and [45, Rem. 1].
In this context, sampling values are called standard information abbreviated by \(\Lambda ^{\mathrm{std}}\). In contrast to this, one may also allow linear information (abbreviated by \(\Lambda ^{\mathrm{all}}\)), which means that the approximation is computed from a number of information about the function coming from arbitrary linear functionals. Clearly, \(\Lambda ^{\mathrm{std}}\) is a subset of \(\Lambda ^{\mathrm{all}}\) due to (2.1). It is well known that the best possible worst-case error with respect to m information coming from \(\Lambda ^{\mathrm{all}}\) can be achieved by approximating functions by means of their corresponding exact Fourier partial sum within \({\text {span}}\{\eta _j:j=1,\ldots ,m\}\). Then, the corresponding worst-case error is determined by \(\sigma _{m+1}\), see [48, Cor. 4.12], which establishes a natural lower bound on the worst-case error with respect to \(\Lambda ^{\mathrm{std}}\). One crucial question in IBC is to determine whether or not standard information \(\Lambda ^{\mathrm{std}}\) is as powerful as linear information \(\Lambda ^{\mathrm{all}}\). In this context, “power” is usually identified with the order of convergence of the considered error with respect to the number of used information.
From Theorem 5.8, see also Example 5.10(i), we obtain that the polynomial order of convergence for standard information is the same as for linear information if the kernel has a finite trace (i.e., \(\mathrm {Id}:H(K)\rightarrow L_2(D,\varrho _D)\) is a Hilbert–Schmidt operator). This problem has been addressed in [38, 50, Open Problem 1] and [51, Open Problem 126]. In [34], this observation has been already made for the situation that the embedding operator is injective (such that the eigenvectors of the strictly positive eigenvalues form an orthonormal basis in H(K), see Remark 2.1). The contribution of Theorem 5.8 is to get explicitly determined constants, on the one hand. On the other hand, it shows that separability of the RKHS and a finite trace condition is essentially enough for this purpose. Note that the finite trace condition can not be avoided, see [29].
We will further discuss some consequences for the tractability of this problem. For the necessary notions and definitions from the field of information-based complexity, see [48, 49, 51]. We comment on polynomial tractability with respect to linear information \(\Lambda ^{\mathrm{all}}\) and standard information \(\Lambda ^{\mathrm{std}}\). Let us consider the family of approximation problems
where \(K_d:D_d\times D_d \rightarrow \mathbb {C}\) is a family of reproducing kernels. In [48, Thm. 5.1] strong polynomial tractability of the family \(\{{\text {APP}}_d\}\) with respect to \(\Lambda ^{\text {all}}\) is characterized as follows: There is a \(\tau >0\) such that
where \(\sigma _{j,d}\), \(j=1,\ldots \), are the singular values belonging to \({{\text {APP}}}_d\) for fixed d. From our analysis in Theorem 5.8, we directly obtain a sufficient condition for polynomial tractability with respect to \(\Lambda ^{\text {std}}\). Without going into detail, we denote by \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\), \(\Lambda \in \{\Lambda ^{\mathrm{std}},\Lambda ^{\mathrm{all}}\}\), the minimal number n of information out of the specified class \(\Lambda \) any algorithm requires in order to achieve a worst-case error \(\varepsilon \) for the d-dimensional approximation problem. Certainly, \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\) usually depends on \(\varepsilon \), the dimension d and \(\Lambda \). At this point, we would like to mention that polynomial tractability means that \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\le C\,\varepsilon ^{-p}\,d^{q}\), \(C,p,q\ge 0\), i.e., \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\) can be bounded by terms that are both polynomial in d and polynomial in \(\varepsilon \). Furthermore, the problem \({\text {APP}}_d\) is called strongly polynomial tractable in the case that the estimate on \(n^{\mathrm{wor}}\) holds with \(q=0\).
Theorem 5.11
The family \(\{{\text {APP}}_d\}\) is strongly polynomially tractable with respect to \(\Lambda ^{\mathrm{std}}\)
-
(i)
if there exists \(\tau \le 2\) such that (5.19) holds true or
-
(ii)
if \(\{{\text {APP}}_d\}\) is strongly polynomially tractable with respect to \(\Lambda ^{\mathrm{all}}\) with exponent \(0<p_{\mathrm{all}}<2\), i.e., \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda ^{\mathrm{all}}) \le C_{\mathrm{all}}\,\varepsilon ^{-p_{\mathrm{all}}}\).
More precisely, there are constants \(C_{\mathrm{std}}, \delta >0\) only depending on \((C,\tau )\) or \((C_{\mathrm{all}}, p_{\mathrm{all}})\) such that
with \(p_{\mathrm{std}} = \tau \) in case (i) and \(p_{\mathrm{std}} = p_{\mathrm{all}}\) in case (ii).
Proof
Note that (ii) implies (i) with \(\tau = 2\). Furthermore, Theorem 5.8 implies strong polynomial tractability if (i) is assumed. In case (i) we may use Stechkin’s lemma [18, Lem. 7.8] which gives that \(\sum _{j=m}^{\infty } \sigma _{j,d}^2 \le C^2m^{-2/\tau +1}\) for all d. This gives the exponent \(p_{\text {std}} = \tau \) and an additional \(\log \) due to (5.13). If (ii) is assumed then \(\sum _{j=m}^{\infty } \sigma _{j,d}^2 \le C'm^{-2/p_{\mathrm{all}} +1}\) for all d. \(\square \)
Theorem 5.11 is stronger than [51, Thm. 26.20] in two aspects. As pointed out in the proof, assumption (i) is weaker than (ii), which is essentially the one in [51, Thm. 26.20]. Furthermore, our statement is stronger since \(p_{\mathrm{std}}\) equals \(p_{\mathrm{all}}\). The authors in [51, 26.6.1] showed that \(p_{\mathrm{std}} = p_{\mathrm{all}}+\frac{1}{2} [p_{\mathrm{all}}]^2\) and proposed that “the lack of the exact exponent represents another major challenge” and formulated Open Problem 127. Our considerations prove that the dependence on \(\varepsilon \) of the tractability estimates on \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda ^{\mathrm{all}})\) and \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda ^{\mathrm{std}})\) coincide up to logarithmic factors. Similar assertions hold true when strong polynomial tractability is replaced by polynomial tractability. The modifications are straightforward.
Example 5.12
Let us consider an example from [37, 55], namely, if \(s_1 \le s_2 \le \cdots \le s_d\), then
and \({\text {APP}}_d:H_{\text {mix}}^{\vec {s}}(\mathbb {T}^d)\rightarrow L_2(\mathbb {T}^d)\). Here \(\mathbb {T}= [0,1]\), where opposite points are identified, and \(H^s(\mathbb {T})\) is normed by
with \(w_s(k) = \max \{1,(2\pi )^s|k|^s\}\), see also Sect. 8. The smoothness vector \(\vec {s}\) is supposed to grow in j, e.g., for \(\beta >0\) we have
It has been shown in [55], see also [37], that the growth condition (5.20) is necessary and sufficient for polynomial tractability with respect to \(\Lambda ^{\text {all}}\). Taking Theorem 5.11 into account, we easily check that (5.19) is satisfied with \(\tau \le 2\) whenever \(\beta \cdot \tau > 1\), which means that \(\beta > 1/2\) is sufficient for strong polynomial tractability. In this case we obtain for any \(1/\beta < \tau \le 2\) that
6 Recovery of Individual Functions
In this section, we are interested in the reconstruction of an individual function f (taken from the unit ball of H(K)) from samples at random nodes. In the IBC community, see [48, 49, 51], such a scenario is called “randomized setting”, which refers to the occurrence of a random element in the algorithm (Monte Carlo). The following investigations improve on the results in [68] from different points of view. On the one hand, it is not necessary to choose sampling nodes according to a whole bunch of probability density functions according to Remark 6.3. On the other hand, assuming a bounded kernel turns out to be sufficient to obtain the rate of convergence matching that of \((\sigma _k)_{k\in \mathbb {N}}\) when randomly choosing sampling nodes according to the given probability measure \(\varrho _D\) due to Corollary 6.2.
However, the subsequent analysis is related to the one in [13, 14]. With similar techniques as above we will get an estimate for the conditional mean of the individual error \(\Vert f-S^m_{\mathbf {X}}f\Vert _{L_2(D,\varrho _D)}\).
Theorem 6.1
Let \(\varrho _D\) be a probability measure on D and H(K) denote a reproducing kernel Hilbert which is compactly embedded into the space \(L_2(D,\varrho _D)\). Let \(m,n\in \mathbb {N}\), \(m\ge 2\), be chosen such that (5.1) holds for some \(0<\delta <1\). Let further f be a fixed function such that \(\Vert f\Vert _{ H(K)} \le 1\). Drawing each sampling node in \({\mathbf {X}}=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\} \subset D\) i.i.d. at random according to \(\varrho _D\), we have for the conditional expectation of the individual error
where \(0<C_{\delta }\le 0.06\) depends on \(\delta \).
Proof
This time we follow the proof of [13, Thm. 2] and obtain
where \(g = f - P_{m-1}f\). Averaging on both sides yields
Note that the last summand on the right-hand side vanishes since g is orthogonal to \(\eta _k\), \(k = 1,\ldots ,m-1\). This gives
by taking \(4 N(m)/n \le C_{\delta }/\log (2n)\) into account, see (5.1), and there exists a \(C_\delta \le \frac{4}{48\sqrt{2}}\le 0.06\). \(\square \)
Analogous to the proof of Corollary 5.6, one shows the following result.
Corollary 6.2
Under the same assumptions as in Theorem 6.1, it holds for fixed \(\delta >0\)
Remark 6.3
-
(i)
Following [16] we may relax the condition (5.1) to
$$\begin{aligned} m := \left\lfloor \frac{n}{48\left( \sqrt{2}\log (2n)-\log {\delta }\right) }\right\rfloor +1, \end{aligned}$$when sampling with respect to the new measure (importance sampling)
$$\begin{aligned} \mu _m(A) := \int _A \varrho _m({\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}), \end{aligned}$$where \(\varrho _m\) is given by
$$\begin{aligned} \varrho _m({\mathbf {x}}) := \frac{1}{m-1}\sum \limits _{k=1}^{m-1}|\eta _k({\mathbf {x}})|^2. \end{aligned}$$Then, the corresponding approximation of the function f is based on the solution of a weighted least squares problem similar to the one discussed in Sect. 5.3. Note that we do not have to assume the square summability of the singular numbers of the embedding \(\mathrm {Id}:H(K)\rightarrow L_2(K,\varrho _D)\) here.
-
(ii)
The result of Theorem 6.1 advanced with (i) directly leads to estimates on the power of standard information, see Sect. 5.4, in the so-called randomized setting, cf. [51]. Roughly speaking, n sampling values (standard information) in the randomized setting are at least as powerful as \( c n/\log {n}\) linear information, i.e., the supremum over all \(f\in H(K)\) of the expected approximation error that is caused by the weighted least squares regression is almost as good as the best possible worst-case approximation error caused by the projection. The recent work [40] treats that topic in full detail. Moreover, in [39] the authors present improvements on the power of standard information in the so-called average case setting. Both paper investigate the weighted least squares regression given in Algorithm 2 using the techniques that yielded the results of this section.
-
(iii)
Together with the Weaver subsampling technique in [45], see also [47], we obtain
$$\begin{aligned}&e^{\mathrm{ran}}\left( n,d;\Lambda ^{\mathrm{std}}\right) \\&\quad \le C\min \left\{ \sqrt{\log n}\cdot e^{\mathrm{det}}\left( c_1n,d;\Lambda ^{\mathrm{all}}\right) , e^{\mathrm{det}}\left( c_2n/\log n,d;\Lambda ^{\mathrm{all}}\right) \right\} , \end{aligned}$$with three universal constants \(C,c_1,c_2>0\), where we use the notation from [48, 49, 51]. This is a further significant improvement of the results in [40] in the situation above if the sequence \((\sigma _k)_{k\in \mathbb {N}}\) decays “faster” than \(k^{-1/2}\).
While this paper was under review, it was shown by Cohen and Dolbeault [15] that even \(e^{\mathrm{ran}}(n,d;\Lambda ^{\mathrm{std}}) \le Ce^{\mathrm{det}}(cn,d;\Lambda ^{\mathrm{all}})\) holds true. Note that this result is also based on a weighted least squares method and [47].
7 Optimal Weights for Numerical Integration
We consider the problem of approximating the integral with respect to \(\mu _D\) of a function f by the cubature rule \({\text {Q}}_{\mathbf {X}}^m\)
where the weights \(q_j\) are determined by \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\). Indeed, assuming \({\mathbf {L}}_m\) of full column rank and following Oettershagen [52], we set
In fact, the cubature rule \({\text {Q}}_{\mathbf {X}}^m\) is the implicit integration of the least squares solution \(S^m_{\mathbf {X}}f:=\sum _{k=1}^{m-1} c_k \eta _k\), cf. (3.2), \({\varvec{c}}:= ({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m)^{-1}\,{\mathbf {L}}_m^*\, {\varvec{f}}\), since
Using this, we give upper bounds on the integration error caused by \({\text {Q}}_{\mathbf {X}}^m\).
Theorem 7.1
Let \(\mu _D\) be a measure on D such that \(L_1(D,\varrho _D) \hookrightarrow L_1(D,\mu _D)\). Denote with
the norm of the embedding. Under the same assumptions as in Theorem 5.5, it holds for fixed \(\delta >0\)
Proof
Using the embedding relation of the different \(L_1\)-spaces we have
We conclude the proof using Corollary 5.6. \(\square \)
In Theorem 7.1, we assume that the kernel is bounded, i.e., \(\sup _{{\mathbf {x}}\in D} K({\mathbf {x}},{\mathbf {x}})<\infty \). However, as we will see below it is enough to assume \(\int _{D}K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})<\infty \) and that \(\varrho _D\) is a finite measure. We define the following modified (reweighted) cubature formula
where \({\varvec{f}}\) is the vector of function samples \(\big (f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n)\big )\), the vector \({\varvec{b}}\) is given as in (7.1), \({\widetilde{{\mathbf {L}}}}_m\) as in Algorithm 2, and
Here, \(\varrho _m(\cdot )\) is given by (5.12) above and depends on the spectral properties of the kernel. Note that we simply put the respective entry to 0 in \({\mathbf {D}}_{\varrho _m}\) if \(\varrho _m({\mathbf {x}}^n)\) happens to be zero.
Theorem 7.2
If \(\varrho _D\) denotes a finite measure and \(\int _D K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})<\infty \), then
holds with probability larger than \(1-3\delta \) if the n nodes in \({\mathbf {X}}\) are drawn i.i.d. according to (5.11) above.
Proof
Using the bound on the \(L_2(D,\varrho _D)\) error from Theorem 5.8 together with (7.2) above, we obtain (7.3) with high probability if the nodes \({\mathbf {X}}\) are sampled according to the measure (5.11). \(\square \)
Remark 7.3
The previous result essentially improves on the bound in [2, Thm. 1] if the singular values decay polynomially. Note that we assume that \(\varrho _D\) is a finite measure and \(\int _D K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})<\infty \). In [2, Thm. 1], a logarithmic oversampling is not required; however, the error bounds are worse by a factor n, which is substantial, e.g., in the case of the Sobolev kernel, see Sect. 8.2.
8 Hyperbolic Cross-Fourier Regression
In the sequel, we are interested in the recovery of functions from periodic Sobolev spaces. That is, we consider functions on the torus \(\mathbb {T}^d \simeq [0,1)^d\) where opposite points are identified. Note that the unit cube \([0,1]^d\) is preferred here since it has Lebesgue measure 1 and is therefore a probability space. We could have also worked with \([0,2\pi ]^d\) and the Lebesgue measure (which can be made a probability measure by a d-dependent rescaling). The general error bounds for the recovery error given below (in terms of \((\sigma _j)_{j\in \mathbb {N}}\) like in Theorem 8.2) are not affected by this rescaling since the sequence \((\sigma _j)_{j\in \mathbb {N}}\) then also changes. However, some of the preasymptotic estimates for the \((\sigma _j)_{j\in \mathbb {N}}\) are sensitive with respect to a different domain as the results in Krieg [33] point out.
For \(\alpha \in \mathbb {N}\) we define the space \(H^{\alpha }_{\text {mix}}(\mathbb {T}^d)\) as the Hilbert space with the inner product
Defining the weight
and the univariate kernel function
directly leads to
which is a reproducing kernel for \(H^\alpha _{\text {mix}}(\mathbb {T}^d)\). The Fourier series representation of \(K^d_{\alpha ,*}({\mathbf {x}},{\mathbf {y}})\) is specified by
In particular, for any \(f\in H^\alpha _{\text {mix}}(\mathbb {T}^d)\) we have
The kernel defined in (8.3) associated with the inner product (8.1) can be extended to the case of fractional smoothness \(s>0\) replacing \(\alpha \) by s in (8.2)–(8.3) which in turn leads to the inner product
and the corresponding norm
The (ordered) sequence \((\lambda _j^*)_{j=1}^{\infty }\) of eigenvalues of the corresponding mapping \(W = \mathrm {Id}^{*}\circ \mathrm {Id}\), where \(\mathrm {Id}:H\big (K^d_{s,*}\big ) \rightarrow L_2(\mathbb {T}^d)\) is the non-increasing rearrangement of the numbers
The corresponding orthonormal system \((e_j^*)_{j=1}^{\infty }\) in \(H\big (K^d_{s,*}\big )\) is given by
Consequently, the orthonormal system \((\eta _j^*)_{j=1}^{\infty }\) in \(L_2(\mathbb {T}^d)\) is the properly ordered classical Fourier system \(\eta _{{\mathbf {k}}}({\mathbf {x}}) = \exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}})\). This directly implies the following behavior of the quantity N(m) defined in (2.7). It holds
Remark 8.1
It is possible to define a smoothness vector \({\mathbf {s}} = (s_1,\ldots ,s_d)\) to emphasize different smoothness in different coordinate directions. Such kernels will be denoted with \(K_{{\mathbf {s}}}({\mathbf {x}},{\mathbf {y}})\). In [37], the authors establish preasymptotic error bounds which can be used for the least squares analysis as we will see below.
Recent estimates in [35] allow for determining uniform recovery guarantees with preasymptotic error bounds. For this study, we need to change the kernel weight to a less natural, but for preasymptotic considerations more convenient structure of the weight
\(s>0\). As a consequence, the univariate kernel
as well as the tensor product kernel
changes and has modified Fourier series expansions. Of course, the weight \(w_{s,\#}\) yields an equivalent norm
in the space \(H^s_{\text {mix}}(\mathbb {T}^d)\). However, from a complexity theoretic point of view, it is worth noting the difference of both approaches. The respective unit balls belonging to both norms differ significantly since the equivalence constants for both norms may depend badly on d. Moreover, we stress on the fact that the non-increasing rearrangements of the eigenvalues \((\lambda _j^\#)_{j=1}^{\infty }\) of the mapping \(W = \mathrm {Id}^{*}\circ \mathrm {Id}\), where \(\mathrm {Id}:H\big (K^d_{s,\#}\big ) \rightarrow L_2(\mathbb {T}^d)\), differs from \((\lambda _j^*)_{j=1}^{\infty }\) since the associated mappings \(j\rightarrow {\mathbf {k}}\) do not coincide. Accordingly, the sampling operators \(S_{\mathbf {X}}^{m,*}\) and \(S_{\mathbf {X}}^{m,\#}\) are defined with respect to the different non-increasing rearrangements of the basis functions \(\eta _{{\mathbf {k}}}\), and thus, may also differ.
Despite the fact that there are many more different equivalent norms for \(H^s_{\text {mix}}(\mathbb {T}^d)\), we will use only the mentioned ones in order to apply advantageous known upper bounds on the eigenvalues \(\lambda _j^{*}\) and \(\lambda _j^\#\). In the regime of interest here, the recovery of periodic functions with mixed smoothness, we even have preasymptotic bounds for these eigenvalues/singular values available (see [33, 35, 36]). Note that theoretically everything is known about the singular values \(\sigma _m^\square \), \(\square \in \{*,\#\}\), since the behavior of this sequence is determined by the non-increasing rearrangement of the reciprocals of the tensor product weights \(\prod _{j=1}^d w_{s,\square }(k_j)\), cf. (8.2) and (8.4). The analysis of the rearrangement of multi-indexed sequences has been revealed that the upper bound
holds for all \(m\in \mathbb {N}\), cf. [35, Thm. 4.1].
One may argue that the kernel \(K^d_{s,\#}\) is less “natural” than the kernel \(K^d_{s,*}\). For this purpose we use the observation by Krieg [33], which yields the preasymptotic bound
in the range \(2\le m\le 3^d\), where the exponent scales also linearly in s.
8.1 Uniform Recovery of Functions from \(H^s_{\text {mix}}(\mathbb {T}^d)\)
First, we consider the asymptotic behavior of the sampling error caused by the presented least squares approach. Since the asymptotic bounds on \(\lambda _j^*\) and \(\lambda _j^\#\) differ only by constants that we do not specify explicitly, we study both cases collectively. For \(\square \in \{*,\#\}\), the asymptotic behavior of the sequence \((\sigma _j^\square )_{j=1}^\infty \) for the embedding \(\mathrm {Id}:H(K_{s,\square }^d) \rightarrow L_2(\mathbb {T}^d)\) has been known for a long time, cf. [18, Chapt. 4] and the references therein. There is a constant \({\tilde{C}}_d^\square \) which depends exponentially on d such that
As a direct consequence of Corollary 5.6, we determine explicit error bounds as well as the asymptotic error behavior (8.7), where the latter holds for all equivalent norms in \(H^s_{\text {mix}}(\mathbb {T}^d)\). Please note that the constants \(C_d^\square \) depend heavily on the specific norm. Moreover, a statement similar to (8.7) can be also obtained with the technique in [34].
Theorem 8.2
Let \(d\in \mathbb {N}, s>1/2\), \(\delta >0\) and \(n\in \mathbb {N}\) be given. Choose \(m\in \mathbb {N}\), \(m\ge 2\),
and draw the n sampling nodes in \({\mathbf {X}}\) uniformly i.i.d. at random in \(\mathbb {T}^d\). Then, we achieve
In particular, there is a constant \(C_d^\square >0\) depending on d such that for \(0<\delta <1/3\) it holds
Proof
The result follows from Corollary 5.6 and our specific situation where \(B=1\), \(N(m)=m-1\) and \(T_\square (m)=\sum _{k=m}^\infty (\sigma _k^\square )^2\). We estimate \(T_\square (m)\) using [12]
Hence, the right-hand side in (5.9) can be bounded from above by a constant times \(\sigma _m^\square \), which behaves as \(n^{-s}(\log n)^{sd}\). \(\square \)
In addition, we investigate the preasymptotic error behavior using the aforementioned estimates (8.5) on the singular values \(\sigma _m^\#\) that belongs to \(\mathrm {Id}:H\big (K^d_{s,\#}\big ) \rightarrow L_2(\mathbb {T}^d)\). Since the upper bounds have been proven only for this specific type of mappings, the following results, in particular the explicitly determined constants, may only hold for RKHS with weight functions \(w_s({\mathbf {k}})\ge \prod _{j=1}^d(1+|k_j|)^s\), which is fulfilled for \(w_{s,\#}\).
Theorem 8.3
Let \(d\in \mathbb {N}\), \(s>(1+\log _2d)/2\), \(0<\delta <1\), and \(n\in \mathbb {N}\) such that
is at least 2. Drawing the n sampling nodes in \({\mathbf {X}}\) uniformly i.i.d. at random in \(\mathbb {T}^d\) yields
Proof
We apply Theorem 5.5 and take into account that we choose
for large enough n. Furthermore, we set \(c:=16/3\), \(\beta :=2s/(1+\log _2d)\) and estimate \(T_\#(m)\), see (8.5),
Taking (5.4) into account, we bound
where \(b:=\frac{\log (8n)}{48\sqrt{2}\log (2n)}+\frac{\log (8n)}{n}\). The term in the brackets is monotonically decreasing in n. We stress that the last estimates are reasonable for \(m\ge 2\) and thus, we need at least \(n\ge 464\). This choice of n leads to an upper bound of the term within the brackets which is 29/6. Thus, for \(m\ge 2\), the estimate
holds and the assertion follows. \(\square \)
Similar to Corollary 5.6, we apply Markov’s inequality to get a lower bound on the success probability of the randomly chosen sampling set.
Corollary 8.4
Under the same assumptions as in Theorem 8.3, it holds
Proof
We follow the argumentation in the proof of Corollary 5.6 using the inequality
\(\square \)
Example 8.5
For \(s=5\) and \(d=16\), we would like to fulfill
i.e., with \(\delta =1/300\) we choose \(m:=2\,873\) smallest possible such that
holds. Clearly, for n such that \(m-1=2\,872\le \frac{n}{48(\sqrt{2}\log (2n)-\log {\delta })}\) holds, we observe the desired estimate. We choose the smallest possible \(n:=3\,879\,166\).
8.2 Numerical Integration of Periodic Functions
In [52, Sec. 4.2], the author discussed the construction of stable cubature weights for the approximation of the integral
for functions from periodic Sobolev spaces with dominating mixed smoothness. The integration nodes \({\mathbf {X}}\) are drawn uniformly and independently at random from \(\mathbb {T}^d\). Below, in Corollary 8.6, the cubature rule \({\text {Q}}_{\mathbf {X}}^m\) is fixed for the whole class. The result in [52, Thm. 4.5] bounds the worst-case integration error from above by
Corresponding numerical tests promise better behavior of the integration error, cf. [52, Rem. 4.6]. Our theoretical results of this section confirm that the optimal main rate of the presented approach in [52] is \(n^{-s}\), in particular we obtain the upper bound \(\lesssim _{d,s}n^{-s}(\log n)^{sd} \) for this specific setting. We achieve the following statement on the worst-case integration error.
Corollary 8.6
Let \(d\in \mathbb {N}, s>1/2\) and \(0<\delta <1/3\). We choose \(n\in \mathbb {N}\) such that m as stated in (8.8) is at least 2. Drawing the n sampling nodes in \({\mathbf {X}}\) uniformly i.i.d. at random from \(\mathbb {T}^d\), we put the cubature weight vector \({\varvec{q}}\) to be the first column of \(\overline{{\mathbf {L}}_m}\, \overline{({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m)^{-1}}.\) Then, with probability at least \(1-3\delta \), we obtain
In particular, there is a constant \(C_d^\square >0\) depending on d such that for \(0<\delta <1/3\) it holds with probability \(1-3\delta \)
Proof
We apply Theorem 7.1 with \(\mu _{\mathbb {T}^d} = \varrho _{\mathbb {T}^d} \equiv 1\) followed by (8.6). \(\square \)
Remark 8.7
By the same reasoning, the result in Theorem 8.3 transfers almost literally to the integration problem. In fact, having \(s>(1+\log _2 d)/2\) we see a non-trivial preasymptotic behavior. The above bounds show that this method based on random points competes with most of the quasi-Monte-Carlo methods studied in the literature, see [22, pp. 195, 247].
9 Hyperbolic Wavelet Regression
The following scenario of replacing the Fourier system by dyadic wavelets has already been investigated by Bohn [5, Sec. 5.5.2], [6] using piecewise linear prewavelets. Here, we use orthogonal wavelets and will improve the result in [5] in two directions. First, we remove a d-dependent \(\log \)-factor and second, our result holds for the whole class and not just one individual function, i.e., we control the worst-case error. It is worth mentioning that we only loose a \(\log \)-factor which is independent of d compared to the benchmark result in [21].
Let us start with the necessary definitions since we are now in a non-periodic setting. For \(s>0\) let us define the space \(H^s_{\text {mix}}(\mathbb {R}^d)\) as the collection of all functions from \(L_2(\mathbb {R}^d)\) such that
Here \({\mathcal {F}}\) denotes the Fourier transform on \(\mathbb {R}^d\) given by
It is well known that \(H^s_{\text {mix}}(\mathbb {R}^d)\) can be characterized using hyperbolic wavelets. Let \((\psi _{j,k})_{j\in \mathbb {N}_0, k\in \mathbb {Z}}\) be a univariate orthonormal wavelet system (if \(j=0\) then \(\psi _{0,k}\) denotes the orthogonal scaling function). Then, we denote with
the corresponding hyperbolic wavelet basis in \(L_2(\mathbb {R}^d)\). For our analysis we need that the univariate wavelet is a compactly supported wavelet, which means that \(\psi _{j,k}\) is supported “near” the interval \([k2^{-j}, (k+1)2^{-j}]\). If the wavelet basis has sufficient smoothness and vanishing moments, then \(f \in H^s_{\text {mix}}(\mathbb {R}^d)\) holds if and only if
This leads to the norm equivalence
Clearly, if \(\Vert f\Vert _{H^s_{\text {mix}}(\mathbb {R}^d)} \le 1\), then the sequence \((2^{\Vert {\mathbf {j}}\Vert _1s}\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle )_{{\mathbf {j}},{\mathbf {k}}}\) has an \(\ell _2\)-norm bounded by a constant, which will be important for our later analysis.
Let us consider the unit cube \([0,1]^d\). Let further \(D_{{\mathbf {j}}}\) be the set of all \({\mathbf {k}}\in \mathbb {Z}^d\) such that the wavelet \(\mathrm{supp}\,\psi _{{\mathbf {j}},{\mathbf {k}}}\) has a non-empty intersection with \([0,1]^d\). This directly leads to the extended domain \(\Omega \) given by
It holds \([0,1]^d \subset \Omega \), and the system \((\psi _{{\mathbf {j}},{\mathbf {k}}})_{{\mathbf {j}}\in \mathbb {N}_0^d,{\mathbf {k}}\in D_{{\mathbf {j}}}}\) is an orthonormal system in \(L_2(\Omega )\), however not a basis. Note that \(\Omega \) is still a bounded tensor domain with a measure proportional to 1 depending on the support length of the wavelet basis. It is also clear that this orthonormal system is not uniformly bounded in \(L_\infty \).
In the sequel we want to recover functions \(f\in H^s_{\text {mix}}(\mathbb {R}^d)\) on the domain \([0,1]^d\) from samples on the slightly larger extended domain \(\Omega \) in a uniform way. In other words, the discrete locations of the sampling nodes \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\) are chosen in advance for the whole class of functions. Let us consider the operator
which is known from hyperbolic wavelet approximation, see [21, 60]. The following worst-case error bound is well known and follows directly from (9.1):
We now consider a special case of the matrix \({\mathbf {L}}_m\) from (3.1), namely
Here, \(m = m(\ell ) \asymp 2^\ell \ell ^{d-1}\) and the functions \({\widetilde{\psi }}_{{\mathbf {j}},{\mathbf {k}}} = \sqrt{|\Omega |}\psi _{{\mathbf {j}},{\mathbf {k}}}\) enumerate the properly re-normalized wavelets \(\psi _{{\mathbf {j}},{\mathbf {k}}}, \Vert {\mathbf {j}}\Vert _1 \le \ell , {\mathbf {k}}\in D_{{\mathbf {j}}}\), which is now an orthonormal system in the space \(L_2(\Omega ,\varrho _\Omega )\) with the probability measure \(\varrho _\Omega = \frac{\mathrm {d}{\mathbf {x}}}{|\Omega |}\). The n sampling nodes in \({\mathbf {X}}\) are drawn i.i.d. at random according to \(\varrho _\Omega \). Note that due to the construction, we have that \(|\Omega |\) is bounded by a constant which depends on the chosen wavelet system. This, on the other hand, depends on the assumed mixed regularity properties of the function f, i.e., the mixed smoothness \(s>0\). The larger s is chosen, the larger the support of a properly chosen orthonormal wavelet system has to be. We propose Algorithm 3 for computing the wavelet coefficients of an approximation \(S_{\mathbf {X}}^\ell f\) to f. Note that there is a little abuse of notation since we use the wavelet level \(\ell \) as upper index (in contrast to m, the dimension of the subspace used earlier).
Theorem 9.1
Let \(0<\delta <1\). Let further \(s>1/2\) and \((\psi _{{\mathbf {j}},{\mathbf {k}}})_{{\mathbf {j}},{\mathbf {k}}}\) be a hyperbolic and compactly supported orthonormal wavelet system such that (9.1) holds true. Then, the algorithm \(S^\ell _{{\mathbf {X}}}\) indicated in Algorithm 3 recovers any \(f \in H^s_{\text {mix}}(\mathbb {R}^d)\) on \(L_2([0,1]^d)\) with high probability from \(n = n(\ell )\) random samples which are drawn in advance for the whole class. Precisely,
with probability larger than \(1-\delta \). The operator \(S^{\ell }_{{\mathbf {X}}}\) uses \(n(\ell ) \asymp 2^{\ell }\ell ^d\) many samples such that the bound in (9.3) reads as \({\widetilde{C}}_{\delta ,d}n^{-s}\log (n)^{ds}\) in terms of the number of samples.
Remark 9.2
-
(i)
Note that the optimal operator \({\tilde{P}}_\ell \) uses \(n(\ell ) \asymp 2^{\ell }\ell ^{d-1}\) wavelet coefficients. The gap between sampling recovery (\(\Lambda ^{\text {std}}\)) and general linear approximation (\(\Lambda ^{\text {all}}\)), see, e.g., [18, 48, 49, 51], is reduced to a \(\log \)-factor, which is independent of d.
-
(ii)
The matrix defined in (9.2) is rather sparse. It has \(n \asymp 2^{\ell }\ell ^{d}\) rows and \(m \asymp 2^{\ell }\ell ^ {d-1}\) columns. In every row we have only \(\asymp \#\{\Vert {\mathbf {j}}\Vert _1\le \ell \} \asymp \ell ^d\) many nonzero entries. This gives an additional acceleration for the least squares algorithm since matrix vector multiplications are cheap in this situation.
Proof of Theorem 9.1
We follow the proof of Theorem 5.5. Let \({\mathbf {X}}\) be a set of n randomly drawn nodes from \(\Omega \) according to \(\varrho _\Omega \). If the number n of samples satisfies (5.1), that is
then
is satisfied with probability larger than \(1-\delta \). Let \({\mathbf {X}}\) be such that this is the case. Then, we estimate
where \({\tilde{\varvec{\Phi }}}_\ell \) is defined similar as in Proposition 5.4. This time we put
Let us define the quantity
which goes along the lines of Proposition 5.4. Then, we get with literally the same arguments
Let us compute \({\tilde{T}}(\ell )\). Due to the compact support of the wavelet system, there are for fixed \({\mathbf {j}}\) only O(1) many wavelets \(\psi _{{\mathbf {j}},{\mathbf {k}}}\) such that \(\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}})\) is non-zero. For those O(1) wavelets, we have
Hence, we get
By the same reasoning we may estimate \({\tilde{N}}(\ell )\) in (9.4). Clearly n may be chosen such that
Plugging (9.6) and (9.7) into (9.5), we obtain
The same standard arguments as used in Theorem 5.5 and Corollary 5.6 lead to the bound in (9.3). It remains to estimate the number of samples n depending on \(\ell \), see (9.7). This clearly gives \(\log (n) \gtrsim \ell \) and hence \(n\gtrsim 2^\ell \ell ^d\) which concludes the proof. \(\square \)
10 Numerical Experiments
10.1 Recovery of Functions from Spaces with Mixed Smoothness
In this section, we perform numerical tests for the hyperbolic cross Fourier regression based on random sampling nodes from Sect. 8, i.e., we apply Algorithm 1 to periodic test functions f from the spaces \(H^s_{\text {mix}}(\mathbb {T}^d)\). In Fig. 1, we visualize realizations for such random nodes in the two- and three-dimensional case.
Besides random point sets, different types of deterministic lattices have also been used for numerical integration and function recovery, see for instance [9, 31, 32]. This motivates us to consider Frolov lattices [30] and Fibonacci lattices (cf., e.g., [62, Sec. IV.2]) in the context of this paper, see Fig. 2 for examples of such lattices.
In the following, we use the weight function
Note that for computational reasons we avoid the \(2\pi \) in this weight w. By the reasoning after (8.5), the weights without \(2\pi \) lead to a slightly slower decay of the respective singular numbers.
For a given number n of samples, we use the \(m=\lfloor n/(4\log n)\rfloor \) frequencies \({\mathbf {k}}\in \mathbb {Z}^d\) where \(w({\mathbf {k}})\) is smallest, i.e., we define \(I_m:=\{{\mathbf {k}}_1,\ldots ,{\mathbf {k}}_m\}\subset \mathbb {Z}^d\), \(|I_m|=m\), assuming an arrangement of \({\mathbf {k}}_j\) fulfilling \(w({\mathbf {k}}_1)\le w({\mathbf {k}}_2)\le \ldots \). Here ties are broken in numerical order starting with the first component \(k_1\) of \({\mathbf {k}}\) until the last one \(k_d\). Corresponding to our theoretical results, the goal is to compute a least squares approximation \(S_{\mathbf {X}}^mf\) of the projection \(P_{m-1}f\) of the function f to the \({\text {span}}\{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}}) :{\mathbf {k}}\in I_m\}\) using the n sampling values at the nodes in \({\mathbf {X}}\).
Comments on the arithmetic cost of Algorithm 1. Building the index set \(I_{m-1}\), i.e., enumerating the basis functions \(\eta _1,\ldots ,\eta _{m-1}\), requires
arithmetic operations, and setting up the matrix \({\mathbf {L}}_m\) requires
arithmetic operations, where \(C_1,C_2>0\) are absolute constants. Afterward, running Algorithm 1 requires
arithmetic operations, where \(C_3>0\) is an absolute constant and \(R\in \mathbb {N}\) is the number of LSQR iterations. If one chooses m as in Theorem 5.5, cf. (5.1) and (5.2), the condition number of the matrix \({\mathbf {L}}_m\) is \(\le \sqrt{3}\) with high probability and one obtains \(R\le 17\) for a LSQR accuracy of \(\approx 10^{-8}\). Please note that we choose \(m=\lfloor n/(4\log n)\rfloor \) in our experiments, which is slightly larger than the theory (5.2) requires.
Remark 10.1
Let us compare the hyperbolic cross-Fourier regression from Sect. 8, which uses random samples, with the single rank-1 lattice sampling approach from [9, 31], which uses highly structured deterministic sampling nodes. Up to logarithmic factors, both approaches have comparable error estimates w.r.t. the number m of basis functions and comparable arithmetic complexities. Single rank-1 lattice sampling has slightly worse recovery error estimates w.r.t. m than Algorithm 1, cf. Theorem 8.2. On the other hand, the arithmetic complexity for single rank-1 lattice sampling is slightly better. Moreover, the error estimates when using rank-1 lattices are guaranteed upper bounds, whereas the worst-case upper bounds in Sect. 8 hold with high probability. However, for fixed m, the used number of samples for single rank-1 lattice sampling is distinctly higher, i.e., almost quadratic compared to the approach from this paper. This results in error estimates w.r.t. the number n of used sampling values that are distinctly worse for single rank-1 lattices.
Subsequently, we consider three different test functions \(f:\mathbb {T}^d \rightarrow \mathbb {R}\), where the Fourier coefficients \({\hat{f}}_{{\mathbf {k}}}:=\int _{\mathbb {T}^d}f({\mathbf {x}}) \,\exp (-2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}}) \,\mathrm {d}{\mathbf {x}}\), \({\mathbf {k}}\in \mathbb {Z}^d\), of f, decay like \(|{\hat{f}}_{{\mathbf {k}}}|\sim \prod _{i=1}^d \big (1+|k_i|^2\big )^{-\alpha /2}\) for \(\alpha \in \{5/4,\, 2,\, 6\}\) and, consequently, \(f\in H^s_{\text {mix}}(\mathbb {T}^d)\) with \(s=\alpha -1/2-\varepsilon \) for \(\varepsilon >0\).
10.1.1 Test function f from \(H^{3/4-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)
We start with the test function
where we have for the Fourier coefficients \(|{\hat{f}}_{{\mathbf {k}}}|\sim \big (\prod _{i=1}^d (1+|k_i|^2)^{1/2}\big )^{-5/4}\) and, consequently, \(f\in H^{3/4-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\), \(\varepsilon >0\).
In Fig. 3a, we visualize the relative approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1}f\Vert _{L_2(\mathbb {T}^d)}\) for spatial dimensions \(d=2,3,4,5\). Due to (8.6), these errors should decay like \(m^{-0.75+\varepsilon }(\log m)^{(d-1)\cdot (0.75-\varepsilon )}\) for sufficiently large m. Correspondingly, we plot \(m^{-0.75}(\log m)^{(d-1)\cdot 0.75}\) as black dotted graphs. We observe that the obtained approximation errors nearly decay as the theory suggests.
Next, we apply Algorithm 1 on the test function f using n randomly selected sampling nodes as sampling scheme. We do not compute the least squares solution directly but use the iterative method LSQR [54] on the matrix \({\mathbf {L}}_m\), \(m=\lfloor n/(4\log n) \rfloor \). The obtained sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\) are visualized in Fig. 3b as well as the graphs \(\sim n^{-0.75}(\log n)^{d\cdot 0.75}\) as dotted lines which correspond to the theoretical upper bounds \(n^{-0.75+\varepsilon }(\log n)^{d\cdot (0.75-\varepsilon )}\), cf. (8.7). We set the tolerance parameter of LSQR to \(5\cdot 10^{-8}\) and the maximum number of iterations to 100. For \(d=2\) and \(d=3\), the errors nearly decay like these bounds. For \(d=4\) and \(d=5\), the errors seem to decay slightly slower than the bounds. In order to investigate this further, we also plot the corresponding approximation errors \({\tilde{a}}_m\) with \(m=\lfloor n/(4\log n) \rfloor \) as thick dashed lines. We observe that these approximation errors \({\tilde{a}}_{m}\), which are the best possible errors that can be achieved in this setting, almost coincide with the sampling errors. This means that we might still observe preasymptotic behavior.
For \(d=2\) spatial dimensions, we have a closer look at the sampling errors. In Fig. 4a, we again plot the approximation errors \({\tilde{a}}_{m}\), \(m=\lfloor n/(4\log n) \rfloor \). In addition, the aliasing errors \(\Vert P_{m-1}f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\), \(m=\lfloor n/(4\log n) \rfloor \), which are the errors caused by Algorithm 1 since
are shown as triangles. We observe that the aliasing errors nearly decay like the approximation errors, and that they are by one order of magnitude smaller. This corresponds to the behavior we observed in Fig. 3b.
Moreover, we compare least squares using random point sets with least squares using quasi-random point sets. In particular, so-called Frolov lattices [30]Footnote 1 are considered as sampling sets in \(d=2\) spatial dimensions and used in Algorithm 1. The resulting sampling errors almost coincide with the approximation errors. The aliasing errors are visualized in Fig. 4a as circles. It is remarkable that they decay similarly and are even lower than the aliasing errors for random nodes in most cases. A similar behavior can be observed for dimension \(d=3\), cf. Fig. 4b.
In addition, we consider Fibonacci lattices in dimension \(d=2\), cf. Fig. 4a. For \(n\ge 832040\), the matrices \({\mathbf {L}}_m\), \(m=\lfloor n/(4\log n) \rfloor \), contain at least two identical columns and correspondingly, the smallest eigenvalue of \({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\) is zero. Therefore, obtaining \(S^m_{{\mathbf {X}}}f\) via Algorithm 1 is not possible if the least squares solution is computed directly. An iterative method like LSQR may still work but the number of iterations may have to be restricted. In Fig. 4a, the obtained aliasing errors via LSQR are shown as squares, and they are smaller than in the other cases but they seem to decay slower. However, when we decreased the tolerance parameter of the LSQR algorithm, we observed for \(n\ge 832{,}040\) aliasing errors and sampling errors larger than 1.
10.1.2 Kink test function f from \(H^{3/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)
Next, we consider the kink test function \(f:\mathbb {T}^d\rightarrow \mathbb {R},\)
with Fourier coefficients
Besides the different test function f, we use the same setting as before.
In Fig. 5a, we visualize the relative approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1}f \Vert _{L_2(\mathbb {T}^d)}\) for spatial dimensions \(d=2,3,4,5\). These errors should decay like \(m^{-1.5+\varepsilon }(\log m)^{(d-1)\cdot (1.5-\varepsilon )}\) for sufficiently large m, and we observe that the obtained approximation errors nearly decay as the theoretical results suggest. Next, we apply Algorithm 1 with random nodes to the test function f using the iterative method LSQR. The resulting sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\) are depicted in Fig. 5b. In addition, the graphs \(\sim n^{-1.5}(\log n)^{d\cdot 1.5}\) are shown as dotted lines which roughly correspond to the theoretical upper bounds \(n^{-1.5+\varepsilon }(\log n)^{d\cdot (1.5-\varepsilon )}\). The errors seem to decay according to this bound for \(d=2\) and slower for \(d=3,4,5\). Again, we also plot the corresponding approximation errors \({\tilde{a}}_m\) with \(m=\lfloor n/(4\log n)\rfloor \) as thick dashed lines, and we observe that these approximation errors \({\tilde{a}}_{\lfloor n/(4\log n) \rfloor }\) almost coincide with the sampling errors. Correspondingly, we still have preasymptotic behavior for \(d=3,4,5\).
10.1.3 Test function f from \(H^{11/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)
As a third test function f, we consider an \(L_2\)-normalized product of periodic one-dimensional B-Splines of order 6, where each factor depends on a single variable and is a piecewise polynomial of degree 5, and therefore, we have \(f\in H^{11/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\).
In Fig. 6a, the relative approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1}f \Vert _{L_2(\mathbb {T}^d)}\) are visualized for spatial dimensions \(d=2,3,4,5\), which roughly decay like \(m^{-5.5+\varepsilon }(\log m)^{(d-1)\cdot (5.5-\varepsilon )}\) for sufficiently large m. The sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\) when applying Algorithm 1 with random nodes to the test function f using the iterative method LSQR are depicted in Fig. 6b. In addition, the graphs \(\sim n^{-5.5}(\log n)^{d\cdot 5.5}\) are plotted as dotted lines which correspond to the theoretical upper bounds. The errors seem to decay roughly according to this bound for \(d=2\) and \(d=3\) or slightly slower. Again, the corresponding approximation errors \({\tilde{a}}_m\) with \(m=n/(4\log n)\) are shown as thick dashed lines, and we observe that these approximation errors \({\tilde{a}}_{\lfloor n/(4\log n) \rfloor }\) almost coincide with the sampling errors. This means we still have preasymptotic behavior.
10.2 Integration of Functions from Spaces with Mixed Smoothness
Extensive numerical tests on integration using random point sets were performed by Oettershagen [52], cf. in particular [52, Sec. 4.1] for tests concerning numerical integration in Sobolev spaces with mixed smoothness \(H^{s}_{\text {mix}}(\mathbb {T}^d)\). These numerical tests, performed for \(d\in \{2,4,8,16\}\) spatial dimensions and smoothness \(s\in \{1,2,3\}\), suggest that the worst-case cubature error may decay with a main rate of \(n^{-s}\) with additional log factors. This is a remarkable behavior since plain Monte-Carlo with random points usually leads to an error decay of \(n^{-1/2}\).
However, the corresponding theoretical results in [52, Sec. 4.2] only give a main rate of \(n^{-s+1/2}\). We highlight that our results obtained in this paper bridge this gap of 1/2 in the main rate, since we show a worst-case error of \(\sim n^{-s}(\log n)^{d\,s}\) in Corollary 8.6, i.e., our theoretical main rate corresponds to the observations by Oettershagen [52]. Moreover, Algorithm 1 guarantees suitable error bounds in the preasymptotic setting, cf. Corollary 8.6 together with (8.5). More details on cubature rules based on least squares additionally refined using some variance reduction technique can be found in [43].
10.3 Recovery and Integration for the Non-periodic Case
Now we consider the non-periodic situation. We use the Chebyshev measure \({\tilde{\varrho }}_D(\mathrm {d}{\mathbf {x}}):=\prod _{t=1}^d(\pi \sqrt{1-x_t^2})^{-1} \, \mathrm {d}{\mathbf {x}}\) as random sampling scheme on \(D = [-1,1]^d\). We further define the non-periodic space \({\tilde{H}}^s_{\text {mix}}([-1,1]^d)\) via the reproducing kernel
and its tensor product
The space \({\tilde{H}}^s_{\text {mix}}([-1,1]^d)\) is embedded into \(L_2(D,{\tilde{\varrho }}_D)\) if and only if \(s>1/2\). We denote the k-th basis index of \(\eta _k\), \(k=1,\ldots ,m-1\), by \({\varvec{h}}_k:=\big (h_{k,t}\big )_{t=1}^d\in \mathbb {N}_0^d\), and we have \(\eta _1 \equiv 1\) and
if \(k>1\). Moreover, for the vector \({\varvec{b}}\in \mathbb {C}^{m-1}\) in (7.1), we have
for \(\mu _D\equiv 1\).
For the numerical experiments, we consider the test function
on \(D:=[-1,1]^d\), where the one-dimensional version is depicted in Fig. 7. For the Chebyshev coefficients of f, we have
and consequently, \(f\in {\tilde{H}}^{3/2-\varepsilon }_{\text {mix}}([-1,1]^d)\) for \(\varepsilon >0\).
We use the parameters as in Sect. 10.1 and apply Algorithm 1 on the test function f in the non-periodic setting, where we generate the random nodes with respect to the measure \({\tilde{\varrho }}_D(\mathrm {d}{\mathbf {x}}):=\prod _{i=1}^d(\pi \sqrt{1-x_i^2})^{-1} \, \mathrm {d}{\mathbf {x}}\). In Fig. 8, we show realizations for these random nodes. As before, we do not compute the least squares solution directly but use the iterative method LSQR on the matrix \({\mathbf {L}}_m\), \(m=\lfloor n/(4\log n)\rfloor \). The obtained sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f\Vert _{L_2([-1,1]^d,\varrho _D)}\) with \(m=\lfloor n/(4\log n)\rfloor \) are plotted in Fig. 9a as triangles as well as the corresponding approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1} f \Vert _{L_2([-1,1]^d,\varrho _D)}\) as thick dashed lines. We observe that the sampling and approximation errors almost coincide. Moreover, we plot the graphs \(\sim n^{-1.5}(\log n)^{d\cdot 1.5}\) as dotted lines which correspond to the expected theoretical upper bounds \(n^{-1.5+\varepsilon }(\log n)^{d\cdot (1.5-\varepsilon )}\). We observe that the obtained numerical errors nearly decay like these theoretical upper bounds.
In addition, we use the numerically computed Chebyshev coefficients \(c_k\) from Algorithm 1 to compute the approximation \(Q_{\mathbf {X}}f\) of I(f) by \(Q_{\mathbf {X}}f=\int _D S_{\mathbf {X}}^mf\,\mathrm {d}\mu _D=\sum _{k=1}^m c_k \, b_k\) where the complex numbers \(b_k\) are calculated as stated in (10.1). We repeatedly perform each test 100 times with different random nodes. The averages for the integration errors \(|I(f) - Q_{\mathbf {X}}f|\) of the 100 test runs are depicted in Fig. 9b and the maxima as error bars. Moreover, we plot the graphs \(\sim n^{-2}(\log n)^{d\cdot 2}\) as dotted lines, and we observe that the obtained integration errors approximately decay like these graphs. For comparison, we also plot \(n^{-1.5} (\log n)^{d\cdot 1.5}\) for \(d=2\) and \(d=5\) as thick solid lines which belong to the theoretical results \(n^{-1.5+\varepsilon } (\log n)^{d\cdot (1.5-\varepsilon )}\) one obtains analogously to (8.9) in Sect. 8 for the non-periodic case. These thick solid lines decay distinctly slower.
In particular, we strongly expect that the theoretical preasymptotic results in (8.5) and [33] also hold for the Chebyshev case.
Notes
In our numerical tests, we used the Frolov lattices constructed by the methods presented in [30] which has been published at https://ins.uni-bonn.de/content/software-frolov.
References
Bass, R.F., Gröchenig, K.: Relevant sampling of band-limited functions. Illinois J. Math. 57(1), 43–58 (2013)
Belhadji, A., Bardenet, R., Chainais, P.: Kernel quadrature with DPPs. In: Wallach, H., Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
Berlinet, A., Thomas-Agnan, C.: Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, Boston (2004). With a preface by Persi Diaconis
Bernardi, C., Maday, Y.: Polynomial interpolation results in Sobolev spaces. J. Comput. Appl. Math 43(1), 53–80 (1992)
Bohn, B.: Error analysis of regularized and unregularized least-squares regression on discretized function spaces. Dissertation, Institut für Numerische Simulation, Universität Bonn (2017)
Bohn, B.: On the convergence rate of sparse grid least squares regression. In: Garcke, J., Pflüger, D., Webster, C., Zhang, G. (eds.) Sparse Grids and Applications—Miami 2016, Lecture Notes in Computational Science and Engineering, vol. 123, pp. 19–41. Springer (2018). Also available as INS Preprint no 1711
Bohn, B., Griebel, M.: Error estimates for multivariate regression on discretized function spaces. SIAM J. Numer. Anal. 55(4), 1843–1866 (2017)
Byrenheid, G.: Sparse representation of multivariate functions based on discrete point evaluations. Dissertation, Institut für Numerische Simulation, Universität Bonn (2018)
Byrenheid, G., Kämmerer, L., Ullrich, T., Volkmer, T.: Tight error bounds for rank-1 lattice sampling in spaces of hybrid mixed smoothness. Numer. Math. 136(4), 993–1034 (2017)
Chkifa, A., Cohen, A., Migliorati, G., Nobile, F., Tempone, R.: Discrete least squares polynomial approximation with random evaluations—application to parametric and stochastic elliptic PDEs. ESAIM: M2AN 49(3), 815–837 (2015)
Christmann, A., Steinwart, I.: Support Vector Machines. Springer, Berlin (2008)
Cobos, F., Kühn, T., Sickel, W.: Optimal approximation of multivariate periodic Sobolev functions in the sup-norm. J. Funct. Anal. 270(11), 4196–4212 (2016)
Cohen, A., Davenport, M.A., Leviatan, D.: On the stability and accuracy of least squares approximations. Found. Comput. Math. 13(5), 819–834 (2013)
Cohen, A., Davenport, M.A., Leviatan, D.: Correction to: on the stability and accuracy of least squares approximations [MR3105946]. Found. Comput. Math. 19(1), 239 (2019)
Cohen, A., Dolbeault, M.: Optimal pointwise sampling for \(L^2\) approximation. arXiv:2105.05545 (2021)
Cohen, A., Migliorati, G.: Optimal weighted least-squares methods. SMAI J. Comput. Math. 3, 181–203 (2017)
Dung, D.: B-spline quasi-interpolant representations and sampling recovery of functions with mixed smoothness. J. Complex. 27(6), 541–567 (2011)
Dung, D., Temlyakov, V. N., Ullrich, T.: Hyperbolic Cross Approximation. Advanced Courses in Mathematics. CRM Barcelona. Birkhäuser/Springer (2019)
Dung, D., Ullrich, T.: Lower bounds for the integration error for multivariate functions with mixed smoothness and optimal Fibonacci cubature for functions on the square. Math. Nachr. 288(7), 743–762 (2015)
Daĭ, F., Primak, A., Temlyakov, V.N., Tikhonov, S.Y.: Integral norm discretization and related problems. Uspekhi Mat. Nauk 74(4(448)), 3–58 (2019)
DeVore, R.A., Konyagin, S.V., Temlyakov, V.N.: Hyperbolic wavelet approximation. Constr. Approx. 14(1), 1–26 (1998)
Dick, J., Kuo, F.Y., Sloan, I.H.: High-dimensional integration: the quasi-Monte Carlo way. Acta Numer. 22, 133–288 (2013)
Dunkl, C.F., Xu, Y.: Orthogonal Polynomials of Several Variables, 1st edn. Cambridge University Press, Cambridge (2001)
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkhäuser/Springer, New York (2013)
Gröchenig, K.: Irregular sampling, Toeplitz matrices, and the approximation of entire functions of exponential type. Math. Comput. 68(226), 749–765 (1999)
Gröchenig, K.: Sampling, Marcinkiewicz–Zygmund inequalities, approximation, and quadrature rules. J. Approx. Theory 257, 105455 (2020)
Hinrichs, A.: Optimal importance sampling for the approximation of integrals. J. Complex. 26(2), 125–134 (2010)
Hinrichs, A., Novak, E., Krieg, D., Prochno, J., Ullrich, M.: On the power of random information. In: Hickernell, F.J., Kritzer, P. (eds.) Multivariate Algorithms and Information-Based Complexity. Radon Series on Computational and Applied Mathematics, vol. 27, pp. 43–64. De Gruyter, Berlin (2020)
Hinrichs, A., Novak, E., Vybíral, J.: Linear information versus function evaluations for \(L_2\)-approximation. J. Approx. Theory 153(1), 97–107 (2008)
Kacwin, C., Oettershagen, J., Ullrich, M., Ullrich, T.: Numerical performance of optimized Frolov lattices in tensor product reproducing kernel Sobolev spaces. Found. Comput. Math. 21, 849–889 (2021). https://doi.org/10.1007/s10208-020-09463-y
Kämmerer, L., Potts, D., Volkmer, T.: Approximation of multivariate periodic functions by trigonometric polynomials based on rank-1 lattice sampling. J. Complex. 31(4), 543–576 (2015)
Kämmerer, L., Potts, D., Volkmer, T.: Approximation of multivariate periodic functions by trigonometric polynomials based on sampling along rank-1 lattice with generating vector of Korobov form. J. Complex. 31(3), 424–456 (2015)
Krieg, D.: Tensor power sequences and the approximation of tensor product operators. J. Complex. 44, 30–51 (2018)
Krieg, D., Ullrich, M.: Function values are enough for \(L_2\)-approximation. Found. Comput. Math. (2020). https://doi.org/10.1007/s10208-020-09481-w
Kühn, T.: New preasymptotic estimates for the approximation of periodic Sobolev functions. In: de Gier, J., Praeger, C.E., Tao, T. (eds.) 2018 MATRIX Annals, MATRIX Book Ser., vol. 3, pp. 97–112. Springer, Cham (2020)
Kühn, T., Sickel, W., Ullrich, T.: Approximation of mixed order Sobolev functions on the \(d\)-torus: asymptotics, preasymptotics and \(d\)-dependence. Constr. Approx. 42(3), 353–398 (2015)
Kühn, T., Sickel, W., Ullrich, T.: How anisotropic mixed smoothness affects the decay of singular numbers of Sobolev embeddings. J. Complexity 63, 101523 (2021)
Kuo, F.Y., Wasilkowski, G.W., Woźniakowski, H.: On the power of standard information for multivariate approximation in the worst case setting. J. Approx. Theory 158(1), 97–125 (2009)
Lu, W., Wang, H.: On the power of standard information for tractability for \(L_2\)-approximation in the average case setting. arXiv:2101.05200 (2021)
Lu, W., Wang, H.: On the power of standard information for tractability for \(L_2\)-approximation in the randomized setting. arXiv:2101.03665 (2021)
Lust-Piquard, F., Pisier, G.: Noncommutative Khintchine and Paley inequalities. Ark. Mat. 29(2), 241–260 (1991)
Marcus, A.W., Spielman, D.A., Srivastava, N.: Interlacing families II: mixed characteristic polynomials and the Kadison–Singer problem. Ann. of Math. (2) 182(1), 327–350 (2015)
Migliorati, G., Nobile, F.: Stable high-order randomized cubature formulae in arbitrary dimension. arXiv:1812.07761v4 (2020)
Moeller, M., Ullrich, T.: \({L}_2\)-norm sampling discretization and recovery of functions from RKHS with finite trace. Sampl. Theory Signal Process. Data Anal. (2021). To appear
Nagel, N., Schäfer, M., Ullrich, T.: A new upper bound for sampling numbers. Found. Comput. Math. (2021). https://doi.org/10.1007/s10208-021-09504-0
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155(1–2, Ser. A), 549–573 (2016)
Nitzan, S., Olevskii, A., Ulanovskii, A.: Exponential frames on unbounded sets. Proc. Am. Math. Soc. 144(1), 109–118 (2016)
Novak, E., Woźniakowski, H.: Tractability of Multivariate Problems. Vol. 1: Linear Information, EMS Tracts in Mathematics, vol. 6. European Mathematical Society (EMS), Zürich (2008)
Novak, E., Woźniakowski, H.: Tractability of Multivariate Problems. Volume II: Standard Information for Functionals, EMS Tracts in Mathematics, vol. 12. European Mathematical Society (EMS), Zürich (2010)
Novak, E., Woźniakowski, H.: On the power of function values for the approximation problem in various settings. Surv. Approx. Theory 6, 1–23 (2011)
Novak, E., Woźniakowski, H.: Tractability of multivariate problems. Volume III: Standard Information for Operators, EMS Tracts in Mathematics, vol. 18. European Mathematical Society (EMS), Zürich (2012)
Oettershagen, J.: Construction of Optimal Cubature Algorithms with Applications to Econometrics and Uncertainty Quantification. Dissertation, Institut für Numerische Simulation, Universität Bonn (2017)
Oliveira, R.I.: Sums of random Hermitian matrices and an inequality by Rudelson. Electr. Commun. Probab. 15, 203–212 (2010)
Paige, C.C., Saunders, M.A.: LSQR: an algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8(1), 43–71 (1982)
Papageorgiou, A., Woźniakowski, H.: Tractability through increasing smoothness. J. Complex. 26(5), 409–421 (2010)
Rauhut, H.: Compressive sensing and structured random matrices. In: Fornasier, M. (ed.) Theoretical Foundations and Numerical Methods for Sparse Recovery. Radon Series on Computational and Applied MathematicsRadon Series on Computational and Applied Mathematics, vol. 9. De Gruyter, Berlin (2010)
Rauhut, H., Ward, R.: Sparse Legendre expansions via \(\ell _1\)-minimization. J. Approx. Theory 164(5), 517–533 (2012)
Rauhut, H., Ward, R.: Interpolation via weighted \(\ell _1\) minimization. Appl. Comput. Harmon. Anal. 40(2), 321–351 (2016)
Rudelson, M.: Random vectors in the isotropic position. J. Funct. Anal. 164(1), 60–72 (1999)
Sickel, W., Ullrich, T.: Tensor products of Sobolev–Besov spaces and applications to approximation from the hyperbolic cross. J. Approx. Theory 161(2), 748–786 (2009)
Smale, S., Zhou, D.-X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. (N.S.) 41(3), 279–305 (2004)
Temlyakov, V.N.: Approximation of Periodic Functions. Computational Mathematics and Analysis Series. Nova Science Publishers Inc., Commack (1993)
Temlyakov, V.N.: The Marcinkiewicz-type discretization theorems. Constr. Approx. 48(2), 337–369 (2018)
Temlyakov, V.N.: Sampling discretization error of integral norms for function classes. J. Complexity 54, 101408 (2019)
Temlyakov, V. N.: On optimal recovery in \({L}_2\). J. Complex. (To appear)
Ullrich, M.: On the worst-case error of least squares algorithms for \({L}_2\)-approximation with high probability. J. Complex. 60 (2020)
Wasilkowski, G.W., Woźniakowski, H.: On the power of standard information for weighted approximation. Found. Comput. Math. 1(4), 417–434 (2001)
Wasilkowski, G.W., Woźniakowski, H.: The power of standard information for multivariate approximation in the randomized setting. Math. Comput. 76(258), 965–988 (2007)
Acknowledgements
The authors would like to thank Felix Bartel, Bastian Bohn, Michael Griebel, Karlheinz Gröchenig, Aicke Hinrichs, Boris S. Kashin, David Krieg, Thomas Kühn, Laura Lippert, Erich Novak, Jens Oettershagen, Peter Oswald, Daniel Potts, Holger Rauhut, Ingo Steinwart, Martin Stoll, Vladimir N. Temlyakov and Mario Ullrich for several discussions on the topic. In addition, the authors thank the referees as well as the editor for their valuable suggestions and remarks. L.K. gratefully acknowledges the funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, project number 380648269). T.U. would like to acknowledge support by the DFG Ul-403/2-1. T.V. gratefully acknowledges support by the SAB 100378180.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Albert Cohen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kämmerer, L., Ullrich, T. & Volkmer, T. Worst-case Recovery Guarantees for Least Squares Approximation Using Random Samples. Constr Approx 54, 295–352 (2021). https://doi.org/10.1007/s00365-021-09555-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00365-021-09555-0