1 Introduction

We consider the problem of learning complex-valued multivariate functions on a domain \(D \subset \mathbb {R}^d\) from function samples on the set of nodes \({\mathbf {X}}:=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\} \subset D\). The functions are modeled as elements from some reproducing kernel Hilbert space H(K) with kernel \(K:D\times D\rightarrow \mathbb {C}\). The nodes \({\mathbf {X}}\) are drawn independently according to a tailored probability measure depending on the spectral properties of the embedding of H(K) in \(L_2(D,\varrho _D)\), where the error is measured. Our main focus in this paper is on worst-case recovery guarantees. In fact, we aim at recovering all \(f \in H(K)\) simultaneously from sampled values at the sampling nodes in \({\mathbf {X}}\) with high probability. To be more precise, we implement algorithms and provide error bounds to control the worst-case error

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1} \Vert f-S_{{\mathbf {X}}}^mf\Vert _{L_2(D,\varrho _D)}, \end{aligned}$$

where \(S_{{\mathbf {X}}}^m\) is the fixed recovery operator. In contrast to that, the problem of reconstructing an individual function from random samples has been considered by several authors in the literature, e.g., Smale and Zhou [61], Bohn [5, 6], Bohn and Griebel [7], Cohen et al. [13], Chkifa et al. [10], Cohen and Migliorati [16], and many others to mention just a few.

Let us emphasize that we do not develop a Monte Carlo method here. It is rather the use of “random information” which gained substantial interest in the information-based complexity (IBC) community and in the field of compressed sensing, see the recent survey [28] and [24]. We construct a recovery operator \(S_{{\mathbf {X}}}^m\) which computes a best least squares fit \(S_{{\mathbf {X}}}^mf\) to the given data \((f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n))^\top \) from the finite-dimensional space spanned by the first \(m-1\) singular vectors of the embedding

$$\begin{aligned} \mathrm {Id}: H(K) \rightarrow L_2(D,\varrho _D). \end{aligned}$$
(1.1)

The right singular vectors \(e_1(\cdot ), e_2(\cdot ),\ldots \) of this embedding are arranged according to their importance, i.e., with respect to the non-increasing rearrangement of the singular values \(\sigma _1 \ge \sigma _2 \ge \cdots >0\).

The investigations in this paper are inspired by the recent results by Krieg and Ullrich [34], which triggered substantial progress in the field. See also the discussion below in Remark 5.9 and [45, Sect. 7]. In this paper, we extend and improve the results from [34] in several directions. In particular, we investigate and implement a least squares regression algorithm under weaker conditions and give practically useful parameter choices which lead to a controlled failure probability and explicit error bounds.

A typical error bound relates the worst-case recovery error to the sequence of singular numbers \((\sigma _k)_{k\in \mathbb {N}}\) of the embedding (1.1) which represent the approximation numbers or linear widths. One main contribution of this paper is the following general bound, where all constants are determined precisely under mild conditions. Recall that \((e_k(\cdot ))_{k\in \mathbb {N}}\) denotes the sequence of right singular vectors of the embedding (1.1), i.e., the eigenfunctions of \(\mathrm {Id}^*\circ \mathrm {Id}:H(K) \rightarrow H(K)\), and \(\sigma _1 \ge \sigma _2 \ge \cdots >0\) the corresponding singular numbers.

Theorem 1.1

(cf. Corollary 5.6) Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions on a subset \(D \subset \mathbb {R}^d\) such that the positive semidefinite kernel \(K:D\times D\rightarrow \mathbb {C}\) satisfies \(\sup _{{\mathbf {x}}\in D}K({\mathbf {x}},{\mathbf {x}})<\infty \). Let further \(\varrho _D\) denote a probability measure on D. Furthermore, for \(n\in \mathbb {N}\) and \(\delta \in (0,1/3)\), we define \(m\in \mathbb {N}\) such that

$$\begin{aligned} N(m):=\sup _{{\mathbf {x}}\in D}\sum \limits _{k=1}^{m-1}\sigma _k^{-2}\,|e_k({\mathbf {x}})|^2 \le \frac{n}{48\left( \sqrt{2}\log (2n)-\log \delta \right) } \end{aligned}$$

holds. Then, the random reconstruction operator \(S^m_{{\mathbf {X}}}\) (see Algorithm 1), which uses samples on the n i.i.d. (according to \(\varrho _D\)) drawn nodes in \({\mathbf {X}}\), satisfies

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{H(K)}\le 1} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)} \le \frac{29}{\delta }\,\max \left\{ \sigma ^2_m,\frac{\log (8n)}{n}T(m) \right\} \right) \ge 1-3\delta , \end{aligned}$$

where \(T(m) := \sup _{{\mathbf {x}}\in D}\sum _{k=m}^{\infty }|e_k({\mathbf {x}})|^2\).

The occurrence of the fundamental quantity N(m) is certainly not a surprise. It is also known as spectral function (see [26] and the references therein), and in case of orthogonal polynomials, it is the inverse of the infimum of the Christoffel function (cf., e.g., [23]). It represents a well-known ingredient for inequalities related to sampling and discretization, see, for instance, Gröchenig and Bass [1], Gröchenig [25, 26], Temlyakov [63, 64], and Temlyakov et al. [20]. If, for instance, \(N(m) \in {\mathcal {O}}(m)\) holds, we achieve near-optimal error bounds with respect to the number of used sampling values n. Note that by a straightforward computation we also have \(T(m) \le 2\sum _{k\ge m/2}^{\infty } \sigma _k^2 N(4k)/k.\) Hence, if \(N(m) \in {\mathcal {O}}(m)\), the bound in Theorem 1.1 is upper bounded by

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)} \le \frac{C_{K,\varrho _D}}{\delta \cdot m}\sum \limits _{k\ge m/2}^{\infty } \sigma _k^2 \end{aligned}$$
(1.2)

with \(m:=n/(c_1\log (n)+c_2\log (\delta ^{-1}))\) and a constant \(C_{K,\varrho _D}>0\) depending on the measure \(\varrho _D\) and the kernel K.

In the general case we do not necessarily have \(N(m) \in {\mathcal {O}}(m)\). Here, a technique called “importance sampling”, see, e.g., [27, 46, 57, 58], turns out to be useful, see Algorithm 2. As proposed in [16] and specified in full detail in [34], one may sample from a reweighted distribution using a specific density \(\varrho _m\), defined in (5.12) below, which is different for any m and depends on the spectral properties of the embedding (1.1). It determines the important “area” to sample. In other words, we incorporate additional knowledge about the spectral properties of our embedding. The underlying recovery operator is still constructive, and the determined error bounds hold with high probability. Computing this envelope density (and sample from it) has been studied in [16, Sect.  5]. A refinement of this technique together with Theorem 1.1 leads to the following precise bounds under even weaker conditions.

Theorem 1.2

(cf. Theorem 5.8) Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions on a subset \(D \subset \mathbb {R}^d\). Let further \(\varrho _D\) denote a non-trivial \(\sigma \)-finite measure on D, and we assume that the positive semidefinite kernel \(K:D\times D\rightarrow \mathbb {C}\) satisfies

$$\begin{aligned} \int _{D} K({\mathbf {x}},{\mathbf {x}}) \,\varrho _D(\mathrm {d}{\mathbf {x}}) < \infty . \end{aligned}$$

Furthermore, for \(n\in \mathbb {N}\) and \(\delta \in (0,1/3)\) we fix

$$\begin{aligned} m := \left\lfloor \frac{n}{96(\sqrt{2}\log (2n)-\log \delta )}\right\rfloor . \end{aligned}$$

Then, the random reconstruction operator \({\widetilde{S}}^m_{\mathbf {X}}\) (see Algorithm 2), which uses n samples drawn according to a probability measure depending on \(\varrho _D, m\) and K satisfies

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{H(K)}\le 1} \Vert f-{\widetilde{S}}^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)} \le \frac{50}{\delta }\,\max \left\{ \sigma ^2_m,\frac{\log (8n)}{n}\sum _{j=m}^\infty \sigma _j^2 \right\} \right) \ge 1-3\delta . \end{aligned}$$
(1.3)

By the same reasoning as above we may replace the error bound in (1.3) by (1.2) but this time with a universal (and precisely determined) constant \(C>0\). The result refines the bound in [34] as we give precise constants here in the general situation.

A further application of our least squares method is in the field of numerical integration. Oettershagen [52], Belhadji et al. [2], Groechenig [26], Migliorati and Nobile [43], and many others used least squares optimization to design quadrature rules. This results in (complex-valued) weights \({\mathbf {q}}:=(q_1,\ldots ,q_n)^\top \) in a cubature formula, i.e.,

$$\begin{aligned} {\widetilde{Q}}_{{\mathbf {X}}}^mf = {\mathbf {q}}^\top \cdot {\mathbf {f}} = \sum \limits _{j = 1}^n q_j\, f({\mathbf {x}}^j) := \int _D {\widetilde{S}}^ m_{{\mathbf {X}}}f \, \mathrm {d}\mu _D, \end{aligned}$$

where \({\mathbf {f}}:=(f({\mathbf {x}}^j))_{j=1}^n\) and \(\mu _D\) is the measure for which we want to compute the integral. In our setting, the integration nodes \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\) are determined once in advance for the whole class. Clearly, the bounds from Theorems 1.1, 1.2 can be literally transferred to control the worst-case integration error

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1} \left| \int _D f({\mathbf {x}})\,d{\mathbf {x}}- {\widetilde{Q}}^m_{{\mathbf {X}}}f\right| , \end{aligned}$$

see Theorems 7.1, 7.2.

As the main example, we consider the recovery of functions from Sobolev spaces with mixed smoothness (also known as tensor product Sobolev spaces or hyperbolic cross-spaces). This problem has been investigated by many authors in the last 30 years, see [18] and the references therein. The above general bound on the worst-case errors can, for instance, be used for any non-periodic embedding

$$\begin{aligned} \mathrm {Id}:H^s_{\text {mix}}([0,1]^d) \rightarrow L_2([0,1]^d),\quad s>1/2. \end{aligned}$$

The spaces \(H^s_{\text {mix}}([0,1]^d)\) can be represented in various ways as a reproducing kernel Hilbert space satisfying the requirements of the above theorems, see the concrete collection of examples in [3, Section 7.4]. Applying Theorem 1.2 and plugging in well-known upper bounds on the singular numbers we improve on the asymptotic sampling bounds in [18, Sect. 5], see also Dinh Dũng [17] and Byrenheid [8] for the non-periodic situation. In addition, using refined preasymptotic estimates for the \((\sigma _j)_{j\in \mathbb {N}}\) in the non-periodic situation (see [33, Section 4.3]) yields reasonable bounds for sampling numbers in case of small n.

Let us emphasize that the result by Krieg and Ullrich [34] can be considered as a major progress for the research on the complexity of this problem. They disproved Conjecture 5.6.2. in [18] for \(p=2\) and \(1/2< s <(d-1)/2\). Indeed, the celebrated sparse grid points are outperformed in a certain range for s. This represents a first step toward the solution of [18, Outstanding Open Problem 1.4]. As a consequence of the recent contributions by Nagel et al. [45] and Temlyakov [65] based on the groundbreaking solution of the Kadison–Singer problem by Marcus et al. [42], it is now evident that sparse grid methods are not optimal in the full range of parameters (except maybe in \(d=2\)). Still it is worth mentioning that the sparse grids represent the best known deterministic construction what concerns the asymptotic order. Indeed, the guarantees are deterministic and only slightly worse compared to random nodes in the asymptotic regime. However, regarding preasymptotics the random constructions provide substantial advantages.

In this paper, we use the simple least squares algorithm from [5, 14, 16, 34], and we show that using random points makes it also possible to obtain explicit worst-case recovery bounds also for small n. In the periodic setting the analysis benefits from the fact that the underlying eigenvector system is a bounded orthonormal system (BOS), see (2.8), which implies in particular \(N(m) \in {\mathcal {O}}(m)\). In case of the complex Fourier system, we have a BOS constant \(B=1\) and obtain dimension-free constants. This allows for a priori estimates on the number of required samples and arithmetic operations in order to ensure accuracy \(\varepsilon >0\) with our concrete algorithm. In particular, we incorporate recent preasymptotic bounds for the singular values \((\sigma _k)_{k\in \mathbb {N}}\), see [36, 37] and [33]. We obtain with probability larger than \(1-3\delta \) for the periodic mixed smoothness space \(H^s_{\text {mix}}(\mathbb {T}^d)\) with \(2s>1+\log _2 d\) equipped with the \(\Vert \cdot \Vert _{\#}\)-norm (see (8.4) below) the worst-case bound

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{\#}\le 1}\Vert f-S^{m,\#}_{\mathbf {X}}f\Vert ^2_{L_2(\mathbb {T}^d)} \le \frac{10s}{\delta (2s-1-\log _2d)}\left( \frac{16}{3m}\right) ^{\frac{2s}{1+\log _2d}}. \end{aligned}$$

The number of samples n scales similarly as after (1.2) with precisely determined absolute constants \(c_1,c_2 < 70\), see Corollary 8.4.

We also demonstrate in Sect. 9 that the BOS assumption is not necessary for getting a practical algorithm. Similar as in [5] we use an algorithm called “hyperbolic wavelet regression” and show that it recovers non-periodic functions belonging to a Sobolev space with mixed smoothness \(H^s_{\text {mix}}\) from the n nodes \({\mathbf {X}}\) drawn according to the uniform measure with a rate similar as for the periodic case. The proposed approach achieves rates which improve on the bounds in [5] and are only worse by \(\log ^s n\) in comparison with the optimal rates achieved by hyperbolic wavelet best approximation studied in [21, 60].

Finally, by our refined analysis we were able to settle a conjecture in [52] that the worst-case integration error for \(H^s_{\text {mix}}(\mathbb {T}^d)\) is bounded in order by \(n^{-s}\log (n)^{ds}\) with high probability. This conjecture was based on the outcome of several numerical experiments (described in [52]) where the worst-case error has been simulated using the RKHS Riesz representer. It is remarkable in two respects. First, it is possible to benefit from higher-order smoothness although we use plain random points. And second, this simple method can compete with most of the quasi-Monte Carlo methods based on lattices and digital nets studied in the literature, see [22, pp. 195, 247]. Moreover, if \(s<(d-1)/2\) we get a better asymptotic rate than sparse grid integration which is shown to be of order \(n^{-s}\log (n)^{(d-1)(s+1/2)}\), see [19].

We practically verify our theoretical findings with several numerical experiments. There, we compare the recovery error for the least squares regression method \(S_{{\mathbf {X}}}^m\) to the optimal error given by the projection on the eigenvector space. We also study a non-periodic regime, where we randomly sample points according to the Chebyshev measure. Algorithmically, the coefficients \({\mathbf {c}}:=(c_k)_{k=1}^{m-1}\in \mathbb {C}^{m-1}\) of the approximation \(S_{{\mathbf {X}}}^m:=\sum _{k=1}^{m-1} c_k\,\sigma _k^{-1}\,e_{k}\) can be obtained by computing the least squares solution of the (over-determined) linear system of equations \({\mathbf {L}}_m \, {\mathbf {c}}=(f({\mathbf {x}}^j))_{j=1}^n\), where \({\mathbf {L}}_m := \left( \sigma _k^{-1}\,e_{k}({\mathbf {x}}^j)\right) _{j=1;\,k=1}^{n;\,m-1}\in \mathbb {C}^{n\times (m-1)}\). In order to solve this linear system of equations, one can apply a standard conjugate gradient type iterative algorithm, e.g., LSQR [54]. The corresponding arithmetic costs are bounded from above by \(C\,R\,m\,n < C\, R\,n^2\), where \(C>0\) is an absolute constant and R the number of iterations which is rather small due to the well-conditioned least squares matrices.

Outline In Sect. 2, we describe the setting in which we want to perform the worst-case analysis. There we use the framework of reproducing kernel Hilbert spaces of complex-valued functions. Section 3 is devoted to the least squares algorithm, where the worst-case analysis is given in Sects. 5, 8, and 9. In the first place, we present the general results in Sect. 5. Section 8 considers the particular case of hyperbolic Fourier regression. In Sect. 9, we investigate the particular case of non-periodic functions with a bounded mixed derivative and their recovery via hyperbolic wavelet regression. The main tools from probability theory, like concentration inequalities for spectral norms and Rudelson’s lemma, are provided in Sect. 4. The analysis of the recovery of individual functions (Monte Carlo) is given in Sect. 6. Consequences for optimally weighted numerical integration based on plain random points are given in Sect. 7 and 8.2. Finally, the numerical experiments are shown and discussed in Sect. 10.

Notation As usual \(\mathbb {N}\) denotes the natural numbers, \(\mathbb {N}_0:=\mathbb {N}\cup \{0\}\), \(\mathbb {Z}\) denotes the integers, \(\mathbb {R}\) the real numbers, and \(\mathbb {C}\) the complex numbers. If not indicated otherwise the symbol \(\log \) denotes the natural logarithm. \(\mathbb {C}^n\) denotes the complex n-space, whereas \(\mathbb {C}^{m\times n}\) denotes the set of all \(m\times n\)-matrices \({\mathbf {L}}\) with complex entries. The spectral norm (i.e., the largest singular value) of matrices \({\mathbf {L}}\) is denoted by \(\Vert {\mathbf {L}}\Vert \) or \(\Vert {\mathbf {L}}\Vert _{2\rightarrow 2}\). Vectors and matrices are usually typesetted bold with \({\mathbf {x}},{\mathbf {y}}\in \mathbb {R}^n\) or \(\mathbb {C}^n\). For \(0<p\le \infty \) and \({\mathbf {x}}\in \mathbb {C}^n\) we denote \(\Vert {\mathbf {x}}\Vert _p := (\sum _{i=1}^n |x_i|^p)^{1/p}\) with the usual modification in the case \(p=\infty \). If \(T:X\rightarrow Y\) is a continuous operator we write \(\Vert T:X\rightarrow Y\Vert \) for its operator (quasi-)norm. For two sequences \((a_n)_{n=1}^{\infty },(b_n)_{n=1}^{\infty }\subset \mathbb {R}\) we write \(a_n \lesssim b_n\) if there exists a constant \(c>0\) such that \(a_n \le c\,b_n\) for all n. We will write \(a_n \asymp b_n\) if \(a_n \lesssim b_n\) and \(b_n \lesssim a_n\). We write \(f \in {\mathcal {O}}(g)\) for non-negative functions fg if there is a constant \(c>0\) such that \(f \le cg\). D denotes a subset of \(\mathbb {R}^d\) and \(\ell _{\infty }(D)\) the set of bounded functions on D with \(\Vert \cdot \Vert _{\ell _{\infty }(D)}\) the supremum norm.

2 Reproducing Kernel Hilbert Spaces

We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [3, Chapt. 1] and [11, Chapt. 4]. Let \(L_2(D,\varrho _D)\) be the space of complex-valued square-integrable functions with respect to \(\varrho _D\). Here, \(D \subset \mathbb {R}^d\) is an arbitrary subset and \(\varrho _D\) a measure on D. We further consider a reproducing kernel Hilbert space H(K) with a Hermitian positive definite kernel \(K({\mathbf {x}},{\mathbf {y}})\) on \(D \times D\). The crucial property is the identity

$$\begin{aligned} f({\mathbf {x}}) = \langle f, K(\cdot ,{\mathbf {x}}) \rangle _{H(K)} \end{aligned}$$
(2.1)

for all \({\mathbf {x}}\in D\). It ensures that point evaluations are continuous functionals on H(K). We will use the notation from [11, Chapt. 4]. In the framework of this paper, the finite trace of the kernel

$$\begin{aligned} \int _{D} K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}) < \infty \end{aligned}$$
(2.2)

or its boundedness

$$\begin{aligned} \Vert K\Vert _{\infty } := \sup \limits _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}},{\mathbf {x}})} < \infty \end{aligned}$$
(2.3)

is assumed. The boundedness of K implies that H(K) is continuously embedded into \(\ell _\infty (D)\), i.e.,

$$\begin{aligned} \Vert f\Vert _{\ell _{\infty }(D)} \le \Vert K\Vert _{\infty }\cdot \Vert f\Vert _{H(K)}. \end{aligned}$$

Note that we do not need the measure \(\varrho _D\) for this embedding.

The embedding operator

$$\begin{aligned} \mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D) \end{aligned}$$

is compact under the integrability condition (2.2), which we always assume from now on. We additionally assume that H(K) is at least infinite dimensional. However, we do not assume the separability of H(K) here. Due to the compactness of \(\mathrm {Id}\) the operator \(\mathrm {Id}^{*}\circ \mathrm {Id}\) provides an at most countable system of strictly positive eigenvalues \((\lambda _j)_{j\in \mathbb {N}}\). These eigenvalues are summable as a consequence of (2.2) and (2.5), (2.6) below, such that the singular numbers \((\sigma _j)_{j\in \mathbb {N}}\) belong to \(\ell _2\). Indeed, let \(\mathrm {Id}^*\) be defined in the usual way as

$$\begin{aligned} \langle \mathrm {Id}(f), g\rangle _{L_2} = \langle f, \mathrm {Id}^{*}(g) \rangle _{H(K)} . \end{aligned}$$

Then, \(W_{\varrho _D} := \mathrm {Id}^*\circ \mathrm {Id}: H(K) \rightarrow H(K)\) is non-negative definite, self-adjoint and compact. Let \((\lambda _j,e_j)_{j\in \mathbb {N}}\) denote the eigenpairs of \(W_{\varrho _D}\), where \((e_j)_{j \in \mathbb {N}} \subset H(K)\) is an orthonormal system of eigenvectors, and \((\lambda _j)_{j \in \mathbb {N}}\) the corresponding positive eigenvalues. In fact, \(W_{\varrho _D}e_j = \lambda _je_j\) and \(\langle e_j, e_k \rangle _{H(K)} = \delta _{j,k}\). The sequence of positive eigenvalues is arranged in non-increasing order, i.e.,

$$\begin{aligned} \lambda _1 \ge \lambda _2 \ge \lambda _3 \ge \cdots > 0. \end{aligned}$$

Note that we have by Bessel’s inequality

$$\begin{aligned} \Vert f\Vert _{H(K)}^2 \ge \sum \limits _{k=1}^{\infty } |\langle f,e_k\rangle _{H(K)} |^2. \end{aligned}$$
(2.4)

Let us point out that by (2.4), the function

$$\begin{aligned} {\mathbf {x}}\mapsto \sum \limits _{k=1}^{\infty }|e_k({\mathbf {x}})|^2 \le K({\mathbf {x}},{\mathbf {x}}) \end{aligned}$$
(2.5)

exists pointwise in \(\mathbb {C}\). This implies

$$\begin{aligned} \sum \limits _{k=1}^{\infty } \lambda _k \le \int _D K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}) \end{aligned}$$
(2.6)

and we get by \(\lambda _j = \sigma _j^2\) that \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) is a Hilbert–Schmidt operator if (2.2) holds. We will restrict to the situation where we have equality in (2.6). This can be achieved by posing additional assumptions, namely that H(K) is separable and \(\varrho _D\) is \(\sigma \)-finite, see [11, Thm. 4.27]. It further holds that

$$\begin{aligned} \langle e_j, e_k \rangle _{L_2} = \langle \mathrm {Id}(e_j), \mathrm {Id}(e_k) \rangle _{L_2} = \langle We_j,e_k \rangle _{H(K)} = \lambda _j \langle e_j,e_k\rangle _{H(K)} = \lambda _j\delta _{j,k}. \end{aligned}$$

Hence, \((e_j)_{j\in \mathbb {N}}\) is also orthogonal in \(L_2(D,\varrho _D)\) and \(\Vert e_j\Vert _2 = \sqrt{\lambda _j} =: \sigma _j\). We define the orthonormal system \((\eta _j)_{j\in \mathbb {N}} := (\lambda _j^{-1/2}e_j)_{j\in \mathbb {N}}\) in \(L_2(D,\varrho _D)\).

For our subsequent analysis, the quantity

$$\begin{aligned} N(m) := \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=1}^{m-1}|\eta _k({\mathbf {x}})|^2\ \end{aligned}$$
(2.7)

plays a fundamental role. We often need the related quantity \(T(m) := \sup _{{\mathbf {x}}\in D}\sum _{k=m}^{\infty }|e_k({\mathbf {x}})|^2\) which can be estimated by \(T(m) \le 2\sum _{k\ge m/2}^{\infty } \sigma _k^2 N(4k)/k\). The first one is sometimes called “spectral function”, see [26] and the references therein. Clearly, by (2.6) N(m) and T(m) are well defined if the kernel is bounded, i.e., if (2.3) is assumed. In fact, T(m) is bounded by \(\Vert K\Vert _{\infty }\). It may happen that the system \((\eta _k)_{k\in \mathbb {N}}\) is a uniformly \(\ell _\infty (D)\) bounded orthonormal system (BOS), i.e., we have for all \(k\in \mathbb {N}\)

$$\begin{aligned} \Vert \eta _k\Vert _{\ell _{\infty }(D)} \le B. \end{aligned}$$
(2.8)

Let us call B the BOS constant of the system. In this case, we have \(N(m) \le (m-1)B^2 \in {\mathcal {O}}(m)\) and \(T(m) \le B^2\sum _{k=m}^{\infty }\lambda _k\).

Remark 2.1

We would like to point out an issue concerning the embedding operator \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) defined above. As discussed in [11, p. 127], this embedding operator is in general not injective as it maps a function to an equivalence class. As a consequence, the system of eigenvectors \((e_j)_{j\in \mathbb {N}}\) may not be a basis in H(K). (Note that H(K) may not even be separable.) However, there are conditions which ensure that the orthonormal system \((e_j)_{j\in \mathbb {N}}\) is an orthonormal basis in H(K), see [11, 4.5] and [11, Ex. 4.6, p. 163], which is related to Mercer’s theorem [11, Thm. 4.49]. Indeed, if we additionally assume that the kernel \(K(\cdot ,\cdot )\) is bounded and continuous on \(D \times D\) (for a domain \(D \subset \mathbb {R}^d\)), then H(K) is separable and consists of continuous functions, see [3, Thms. 16, 17]. If we finally assume that the measure \(\varrho _D\) is a Borel measure with full support, then \((e_j)_{j\in \mathbb {N}}\) is a complete orthonormal system in H(K). In this case, we have the pointwise identity

$$\begin{aligned} K({\mathbf {x}},{\mathbf {y}})=\sum \limits _{j=1}^{\infty } \overline{e_j({\mathbf {y}})}\,e_j({\mathbf {x}}),\quad {\mathbf {x}},{\mathbf {y}}\in D, \end{aligned}$$
(2.9)

as well as equality signs in (2.4), (2.5) and (2.6), see, for instance, [3, Cor. 4]. Let us emphasize that a Mercer kernel \(K(\cdot ,\cdot )\), which is a continuous kernel on \(D\times D\) with a compact domain \(D \subset \mathbb {R}^d\) satisfies all these conditions, see [11, Thm. 4.49]. In this case, we even have (2.9) with absolute and uniform convergence on \(D \times D\). Let us point out that to our surprise, Theorem 1.2 (Theorem 5.8) holds already true under the finite trace condition (2.2) if H(K) is separable and \(\varrho _D\) is \(\sigma \)-finite. We do not have to assume continuity of the kernel. Note that the finite trace condition is natural in this context as [29] shows.

3 Least Squares Regression

Our algorithm essentially boils down to the solution of an over-determined system

$$\begin{aligned} {\mathbf {L}}\, {\mathbf {c}} = {\mathbf {f}} \end{aligned}$$

where \({\mathbf {L}}\in \mathbb {C}^{n \times m}\) is a matrix with \(n>m\). It is well known that the above system may not have a solution. However, we can ask for the vector \({\mathbf {c}}\) which minimizes the residual \(\Vert {\mathbf {f}}-{\mathbf {L}}\, {\mathbf {c}}\Vert _2\). Multiplying the system with \({\mathbf {L}}^*\) gives

$$\begin{aligned} {\mathbf {L}}^{*}\,{\mathbf {L}}\, {\mathbf {c}} = {\mathbf {L}}^{*}\, {\mathbf {f}} \end{aligned}$$

which is called the system of normal equations. If \({\mathbf {L}}\) has full rank, then the unique solution of the least squares problem is given by

$$\begin{aligned} {\mathbf {c}} = \left( {\mathbf {L}}^{*}\,{\mathbf {L}}\right) ^{-1}\,{\mathbf {L}}^{*}\, {\mathbf {f}}. \end{aligned}$$

From the fact that the singular values of \({\mathbf {L}}\) are bounded away from zero, we get the following quantitative bound on the spectral norm of the Moore–Penrose inverse \(({\mathbf {L}}^{*}\,{\mathbf {L}})^{-1}\,{\mathbf {L}}^{*}\).

Proposition 3.1

Let \({\mathbf {L}}\in \mathbb {C}^{n\times m}\) be a matrix with \(m<n\) with full rank and singular values \(\tau _1,\ldots ,\tau _m >0\) arranged in non-increasing order.

  1. (i)

    Then, also the matrix \(({\mathbf {L}}^{*}\,{\mathbf {L}})^{-1}\,{\mathbf {L}}^{*}\) has full rank and singular values \(\tau _m^{-1},\ldots ,\tau _1^{-1}\) (arranged in non-increasing order).

  2. (ii)

    In particular, it holds that

    $$\begin{aligned} \left( {\mathbf {L}}^{*}\,{\mathbf {L}}\right) ^{-1}\,{\mathbf {L}}^{*} = {\mathbf {V}}^{*}\, \tilde{\varvec{\Sigma }}{\mathbf {U}}\end{aligned}$$

    whenever \({\mathbf {L}}= {\mathbf {U}}^{*} \varvec{\Sigma } {\mathbf {V}}\), where \(\varvec{\Sigma } \in \mathbb {R}^{n\times m}\) is a rectangular matrix only with \((\tau _1,\ldots ,\tau _m)\) on the “main diagonal” and orthogonal matrices \({\mathbf {U}}\in \mathbb {C}^{n\times n}\) and \({\mathbf {V}}\in \mathbb {C}^{m\times m}\). Here, \(\tilde{\varvec{\Sigma }} \in \mathbb {R}^{m\times n}\) denotes the matrix with \((\tau _1^{-1},\ldots ,\tau _m^{-1})\) on the “main diagonal”.

  3. (iii)

    The operator norm \(\Vert ({\mathbf {L}}^{*}\,{\mathbf {L}})^{-1}\,{\mathbf {L}}^{*}\Vert \) can be controlled as follows:

    $$\begin{aligned} \Vert \left( {\mathbf {L}}^{*}\,{\mathbf {L}}\right) ^{-1}\,{\mathbf {L}}^{*}\Vert \le \tau _m^{-1}. \end{aligned}$$

For function recovery, we will use the following matrix

$$\begin{aligned} {\mathbf {L}}_m := {\mathbf {L}}_m({\mathbf {X}}) = \left( \begin{array}{cccc} \eta _1\left( {\mathbf {x}}^1\right) &{}\eta _2\left( {\mathbf {x}}^1\right) &{} \cdots &{} \eta _{m-1}\left( {\mathbf {x}}^1\right) \\ \vdots &{} \vdots &{}&{} \vdots \\ \eta _1\left( {\mathbf {x}}^n\right) &{}\eta _2\left( {\mathbf {x}}^n\right) &{} \cdots &{} \eta _{m-1}\left( {\mathbf {x}}^n\right) \end{array}\right) , \end{aligned}$$
(3.1)

for \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\subset D\) of distinct sampling nodes and the system \((\eta _k)_{k\in \mathbb {N}} := (\lambda _k^{-1/2}e_k)_{k\in \mathbb {N}}\). Below we will see that this matrix behaves well with high probability if n is large enough and the nodes in \({\mathbf {X}}\) are chosen independently and identically \(\varrho _D\)-distributed from D.

figure a

Using Algorithm 1, we compute the coefficients \(c_k\), \(k=1,\ldots ,m-1\), of the approximant

$$\begin{aligned} S^m_{{\mathbf {X}}}f:=\sum _{k = 1}^{m-1} c_k\, \eta _k. \end{aligned}$$
(3.2)

Note that the mapping \(f \mapsto S^m_{{\mathbf {X}}}f\) is linear for a fixed set of sampling nodes \({\mathbf {X}}\subset D.\)

4 Concentration Inequalities

We will consider complex-valued random variables X and random vectors \((X_1,\ldots ,X_N)\) on a probability space \((\Omega ,{\mathcal {A}},\mathbb {P})\). As usual we will denote with \(\mathbb {E}X\) the expectation of X. With \(\mathbb {P}(A|B)\) and \(\mathbb {E}(X|B)\), we denote the conditional probability

$$\begin{aligned} \mathbb {P}(A|B) := \frac{\mathbb {P}(A\cap B)}{\mathbb {P}(B)} \end{aligned}$$

and the conditional expectation

$$\begin{aligned} \mathbb {E}(X|B) = \frac{\mathbb {E}(\chi _B\cdot X)}{\mathbb {P}(B)}, \end{aligned}$$
(4.1)

where \(\chi _B:\Omega \rightarrow \{0,1\}\) is the indicator function on B.

Let us start with the classical Markov inequality. If Z is a random variable, then

$$\begin{aligned} \mathbb {P}(|Z|>t) \le \frac{\mathbb {E}|Z|}{t},\quad t>0. \end{aligned}$$

Of course, there is also a version involving conditional probability and expectation. In fact,

$$\begin{aligned} \mathbb {P}(|Z|>t~|~B) \le \frac{\mathbb {E}(|Z|~|~B)}{t},\quad t>0. \end{aligned}$$
(4.2)

Let us state concentration inequalities for the norm of sums of complex rank-1 matrices. For the first result, we refer to Oliveira [53]. We will need the following notational convention: For a complex (column) vector \({\mathbf {y}}\in \mathbb {C}^N\) (or \(\ell _2\)), we will often use the tensor notation for the matrix

$$\begin{aligned} {\mathbf {y}}\otimes {\mathbf {y}}:= {\mathbf {y}}\, {\mathbf {y}}^*= {\mathbf {y}}\, {\overline{{\mathbf {y}}}}^\top \in \mathbb {C}^{N \times N}\;(\text {or}\, \mathbb {C}^{\mathbb {N}\times \mathbb {N}}). \end{aligned}$$

Proposition 4.1

Let \({\mathbf {y}}^i, i=1,\ldots ,n\), be i.i.d. copies of a random vector \({\mathbf {y}}\in \mathbb {C}^{N}\) such that \(\Vert {\mathbf {y}}^i\Vert _2 \le M\) almost surely. Let further \(\mathbb {E}({\mathbf {y}}^i \otimes {\mathbf {y}}^i) = \varvec{\Lambda }\in \mathbb {C}^{N\times N}\) and \(0<t <1\). Then, it holds

$$\begin{aligned} \mathbb {P}\left( \left\| \frac{1}{n}\sum \limits _{i=1}^n{\mathbf {y}}^i \otimes {\mathbf {y}}^i-\varvec{\Lambda }\right\| >t\right) \le (2\min (n,N))^{\sqrt{2}}\exp \left( -\frac{nt^2}{12M^2}\right) . \end{aligned}$$

Proof

In order to show the probability estimate, we refer to the proof of [53, Lem. 1] and observe

$$\begin{aligned} \mathbb {P}\left( \left\| \frac{1}{n}\sum \limits _{i=1}^n{\mathbf {y}}^i \otimes {\mathbf {y}}^i-\varvec{\Lambda }\right\| >t\right) \le (2\min (n,N))^{\frac{1}{1-2M^2s/n}}\exp \left( -st+\frac{2M^2s^2}{n-2M^2s}\right) \end{aligned}$$

for \( 0\le s\le n/(2M^2)\). Since we restrict \(0<t<1\), the choice \(s=(4+2\sqrt{2})^{-1}nt/M^2\) yields

$$\begin{aligned}&(2\min (n,N))^{\frac{1}{1-2M^2s/n}}\exp \left( -st+\frac{2M^2s^2}{n-2M^2s}\right) \\&\quad = (2\min (n,N))^{\sqrt{2}}\exp \left( -\frac{nt^2}{(6+4\sqrt{2})M^2}\right) \end{aligned}$$

and, finally, the assertion holds. \(\square \)

Remark 4.2

A slightly stronger version for the case of real matrices can be found in Cohen, Davenport, Leviatan [13] (see also the correction in [14]). For \({\mathbf {y}}^i, i=1,\ldots ,n\), i.i.d. copies of a random vector \({\mathbf {y}}\in \mathbb {R}^{N}\) sampled from a bounded orthonormal system, one obtains the concentration inequality

$$\begin{aligned} \mathbb {P}\left( \Bigg \Vert \frac{1}{n}\sum \limits _{i=1}^n{\mathbf {y}}^i \otimes {\mathbf {y}}^i-{\mathbf {I}}\Bigg \Vert >t\right) \le 2N\exp (-c_tn/M^2), \end{aligned}$$

where \(c_t = (1+t)(\ln (1+t))-t\). This leads to improved constants for the case of real matrices.

The following result goes back to Lust-Piquard and Pisier [41], and Rudelson [59]. The complex version with precise constants can be found in Rauhut [56, Cor. 6.20].

Proposition 4.3

Let \({\mathbf {y}}^i \in \mathbb {C}^N\) (or \(\ell _2\)), \(i=1,\ldots ,n\), and \(\varepsilon _i\) independent Rademacher variables taking values \(\pm 1\) with equal probability. Then,

$$\begin{aligned} \mathbb {E}_{\varepsilon } \left\| \sum \limits _{i=1}^n \varepsilon _i \, {\mathbf {y}}^i\otimes {\mathbf {y}}^i\right\| \le C_{\mathrm {R}}\sqrt{\log (8\min \{n, N\})}~\cdot ~ \sqrt{\left\| \sum \limits _{i=1}^n {\mathbf {y}}^i\otimes {\mathbf {y}}^i\right\| }~\cdot ~ \max \limits _{i=1,\ldots ,n}\Vert {\mathbf {y}}^i\Vert _2,\nonumber \\ \end{aligned}$$
(4.3)

with

$$\begin{aligned} C_{\mathrm {R}} = \sqrt{2}+\frac{1}{4\sqrt{2\log (8)}} \in [1.53,1.54]. \end{aligned}$$
(4.4)

Remark 4.4

The result is proved for complex (finite) matrices. Note that the factor \(\sqrt{\log (8\min \{n,N\})}\) is already an upper bound for \(\sqrt{\log (8r)}\), where r is the rank of the matrix \(\sum _i {\mathbf {y}}^i\otimes {\mathbf {y}}^i\). The proof of Proposition 4.3 with the precise constant is based on [56, Lem. 6.18] which itself is based on a non-commutative Khintchine inequality, see [56, 6.5]. This technique allows for controlling all the involved constants. Let us comment on the situation \(N=\infty \), i.e., \({\mathbf {y}}^j \in \ell _2\), where this inequality keeps valid with the factor \(\sqrt{\log (8n)}\). In fact, if the matrices \({\mathbf {B}}_j\) in [56, Thm. 6.14] are replaced by rank-1-operators \({\mathbf {B}}_j:\ell _2\rightarrow \ell _2\) of type \({\mathbf {B}}_j = {\mathbf {y}}^j \otimes {\mathbf {y}}^j\) with \(\Vert {\mathbf {y}}^j\Vert _2 < \infty \), then all the arguments keep valid and an \(\ell _2\)-version of this non-commutative Khintchine inequality is available. This implies an \(\ell _2\)-version of [56, Lem. 6.18] which reads as follows: Let \({\mathbf {y}}^j \in \ell _2\), \(j=1,\ldots ,n\), and \(p\ge 2\). Then,

$$\begin{aligned} \left( \mathbb {E}_{\varepsilon } \left\| \sum \limits _{i=1}^n \varepsilon _i \, {\mathbf {y}}^i\otimes {\mathbf {y}}^i\right\| ^p\right) ^{1/p} \le 2^{3/(4p)}n^{1/p}\sqrt{p} \, \mathrm {e}^{-1/2}\sqrt{\left\| \sum \limits _{i=1}^n {\mathbf {y}}^i\otimes {\mathbf {y}}^i\right\| }\cdot \max \limits _{i=1,\ldots ,n}\Vert {\mathbf {y}}^i\Vert _2. \end{aligned}$$

Since we control the moments of the random variable representing the norm on the left-hand side, we are now able to derive a concentration inequality by standard arguments [56, Prop. 6.5]. This concentration inequality then easily implies the \(\ell _2\)-version of (4.3).

As a consequence of this result, we obtain the following deviation inequality in the mean which will be sufficient for our purpose.

Corollary 4.5

Let \({\mathbf {y}}^i\), \(i=1,\ldots ,n\), be i.i.d. random vectors from \(\mathbb {C}^N\) or \(\ell _2\) with \(\Vert {\mathbf {y}}^i\Vert _2 \le M\) almost surely. Let further \(\varvec{\Lambda } = \mathbb {E}({\mathbf {y}}^i \otimes {\mathbf {y}}^i)\). Then, with \(N>n\) we obtain

$$\begin{aligned} \mathbb {E}\left\| \frac{1}{n}\sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i-\varvec{\Lambda }\right\| \le 4 C_{\mathrm {R}}^2\frac{\log (8n)}{n}M^2+2C_{\mathrm {R}}\sqrt{\frac{\log (8n)}{n}}M\sqrt{\Vert \varvec{\Lambda }\Vert }. \end{aligned}$$

Proof

By well-known symmetrization technique (see [24, Lem. 8.4]), Proposition 4.3, and the Cauchy–Schwarz inequality, we obtain

$$\begin{aligned} F&:= \mathbb {E}\left\| \frac{1}{n}\sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i-\varvec{\Lambda }\right\| \le 2\mathbb {E}_{{\mathbf {y}}} \mathbb {E}_{\varepsilon } \left\| \frac{1}{n}\sum \limits _{i=1}^n \varepsilon _i{\mathbf {y}}^i \otimes {\mathbf {y}}^i\right\| \\&\le 2C_{\mathrm {R}}\frac{\sqrt{\log (8n)}}{n}\left( \mathbb {E}\max \limits _{i=1,\ldots ,n}\Vert {\mathbf {y}}^i\Vert ^2_2\right) ^{1/2} \left( \mathbb {E}\left\| \sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i\right\| \right) ^{1/2}\\&\le \underbrace{2C_{\mathrm {R}}\frac{\sqrt{\log (8n)}}{\sqrt{n}}M}_{=:a}\left( \underbrace{\mathbb {E}\left\| \frac{1}{n}\sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i-\varvec{\Lambda }\right\| }_{=F} +\underbrace{\Vert \varvec{\Lambda }\Vert }_{=:b}\right) ^{1/2}. \end{aligned}$$

Hence, we get \(F^2\le a^2(F+b)\) and we solve this inequality with respect to F, which yields

$$\begin{aligned} 0\le F \le \frac{a^2}{2}+\sqrt{\frac{a^4}{4}+a^2b}\le a^2+a\sqrt{b} \end{aligned}$$

and this corresponds to the assertion. \(\square \)

5 Worst-case Errors for Least-Squares Regression

5.1 Random Matrices from Sampled Orthonormal Systems

Let us start with a concentration inequality for the spectral norm of a matrix of type (3.1). It turns out that the complex matrix \({\mathbf {L}}_m:={\mathbf {L}}_m({\mathbf {X}})\in \mathbb {C}^{n\times (m-1)}\) has full rank with high probability, where the elements of the set \({\mathbf {X}}\subset D\) of sampling nodes are drawn i.i.d. at random according to \(\varrho _D\). We will find below that the eigenvalues of

$$\begin{aligned} {\mathbf {H}}_m:={\mathbf {H}}_m({\mathbf {X}}) = \frac{1}{n}{\mathbf {L}}_m^*\, {\mathbf {L}}_m \in \mathbb {C}^{(m-1)\times (m-1)} \end{aligned}$$

are bounded away from zero with high probability if m is small enough compared to n. We speak of an “oversampling factor” n/m. In case of a bounded orthonormal system with BOS constant B, see (2.8), it will turn out that a logarithmic oversampling is sufficient, see (5.2) below. Note that the boundedness constant B may also depend on the underlying spatial dimension d. However, if, for instance, the complex Fourier system \(\{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}}) :{\mathbf {k}}\in \mathbb {Z}^d\}\) is considered, we are in the comfortable situation that \(B = 1\).

Proposition 5.1

Let \(n,m \in \mathbb {N}\), \(m\ge 2\). Let further \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\ldots \}\) be the orthonormal system in \(L_2(D,\varrho _D)\) induced by the kernel K and the n sampling nodes in \({\mathbf {X}}\) be drawn i.i.d. at random according to \(\varrho _D\). Then, it holds for \(0<t <1\) that

$$\begin{aligned} \mathbb {P}\left( \Vert {\mathbf {H}}_m - {\mathbf {I}}_m\Vert > t\right) \le (2n)^{\sqrt{2}} \, \exp \left( -\frac{nt^2}{12\cdot N(m)}\right) , \end{aligned}$$

where N(m) is defined in (2.7) and \({\mathbf {I}}_m=\mathrm {diag}({\varvec{1}})\in \{0,1\}^{(m-1)\times (m-1)}\).

Proof

We set \({\mathbf {y}}^i := \left( \eta _1({\mathbf {x}}^i), \ldots , \eta _{m-1}({\mathbf {x}}^i)\right) ^*\), \(i=1,\ldots ,n,\) and observe

$$\begin{aligned} {\mathbf {H}}_m:={\mathbf {H}}_m({\mathbf {X}}) = \frac{1}{n}{\mathbf {L}}_m^*\, {\mathbf {L}}_m = \frac{1}{n} \sum \limits _{i = 1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i. \end{aligned}$$

Moreover, due to the fact that we have an orthonormal system \((\eta _k)_{k\in \mathbb {N}}\), we obtain that \(\mathbb {E}({\mathbf {H}}_m) = {\mathbf {I}}_m\). The result follows by noting that \(M^2 \le N(m)\) in Proposition 4.1. \(\square \)

Remark 5.2

From this proposition we immediately obtain that the matrix \({\mathbf {H}}_m \in \mathbb {C}^{(m-1) \times (m-1)}\) has only eigenvalues larger than \(t:=1/2\) with probability at least \(1-\delta \) if

$$\begin{aligned} N(m) \le \frac{n}{48(\sqrt{2}\log (2n)-\log \delta )}. \end{aligned}$$
(5.1)

Hence, in case of a bounded orthonormal system with BOS constant \(B>0\), see (2.8), we may choose

$$\begin{aligned} m \le \kappa _{\delta ,B}\frac{n}{\log (2n)} \end{aligned}$$
(5.2)

with \(\kappa _{\delta ,B} := (\log (1/\delta )+\sqrt{2})^{-1}B^{-2}/48\).

From Proposition 5.1 we get that all \(m-1\) singular values \(\tau _1,\ldots , \tau _{m-1}\) of \({\mathbf {L}}_m\) from (3.1) are all not smaller than \(\sqrt{n/2}\) and not larger than \(\sqrt{3n/2}\) with probability at least \(1-\delta \) if m is chosen such that (5.1) holds. In terms of Proposition 3.1, this means that \(\tau _1,\ldots , \tau _{m-1} \ge \sqrt{n/2}\). This leads to an upper bound on the norm of the Moore–Penrose inverse that is required for the least squares algorithm.

Proposition 5.3

Let \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\ldots \}\) be the orthonormal system in \(L_2(D,\varrho _D)\) induced by the kernel K. Let further \(m,n \in \mathbb {N}\), \(m\ge 2\), and \(0< \delta <1\) be chosen such that they satisfy (5.1). Then, the random matrix \({\mathbf {L}}_m\) from (3.1) satisfies

$$\begin{aligned} \Vert \left( {\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\right) ^{-1}\,{\mathbf {L}}_m^{*}\Vert \le \sqrt{\frac{2}{n}} \end{aligned}$$

with probability at least \(1-\delta \).

In addition to the matrix \({\mathbf {L}}_m\), we need to consider a second linear operator that is defined using sampling values of the eigenfunctions \(e_j\). The importance of this operator has been pointed out in [34], where strong results on the concentration of infinite dimensional random matrices have been used. Since we only need the expectation of the norm, we only use Rudelson’s lemma, see Proposition 4.3, and a symmetrization technique. This allows us to control the constants.

Proposition 5.4

Let \({\mathbf {X}}=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\subset D\) be a set of n sampling nodes drawn uniformly and i.i.d. at random according to \(\varrho _D\), and consider the n i.i.d. random sequences

$$\begin{aligned} {\mathbf {y}}^i = \left( e_m\left( {\mathbf {x}}^i\right) ,e_{m+1}\left( {\mathbf {x}}^i\right) ,\ldots \right) ^\top ,\quad i=1,\ldots ,n, \end{aligned}$$

together with \(T(m):= \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k= m}^{\infty } |e_k({\mathbf {x}})|^2 < \infty \), see (2.7). Then, the operator

$$\begin{aligned} \varvec{\Phi }_m:\ell _2 \rightarrow \mathbb {R}^n,\quad {\mathbf {z}} \mapsto \left( \begin{array}{c} \langle {\mathbf {z}},{\mathbf {y}}^1\rangle _{\ell _2}\\ \vdots \\ \langle {\mathbf {z}},{\mathbf {y}}^n\rangle _{\ell _2} \end{array}\right) \end{aligned}$$

has expected norm

$$\begin{aligned} \mathbb {E}\left( \Vert \varvec{\Phi }_m\Vert ^2\right) \le n\left( \sigma _m^2+4\,C_{\mathrm {R}}^2\frac{\log (8n)}{n}T(m)+2\,C_{\mathrm {R}}\,\sigma _m\,\sqrt{\frac{\log (8n)}{n}T(m)}\right) . \end{aligned}$$
(5.3)

Proof

Note that \(\varvec{\Phi }_m^{*}\varvec{\Phi }_m = \sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) and

$$\begin{aligned} \varvec{\Lambda }_m := \mathbb {E}\left( \frac{1}{n}\varvec{\Phi }_m^{*}\varvec{\Phi }_m\right) = \mathbb {E}\left( {\mathbf {y}}^i \otimes {\mathbf {y}}^i\right) = \mathrm {diag}\left( \sigma ^2_m,\sigma ^2_{m+1},\ldots \right) . \end{aligned}$$

This gives

$$\begin{aligned} \Vert \varvec{\Phi }_m\Vert ^2 = \Vert \varvec{\Phi }_m^{*}\varvec{\Phi }_m\Vert \le \Vert \varvec{\Phi }_m^{*}\varvec{\Phi }_m-n\varvec{\Lambda }_m\Vert +n\Vert \varvec{\Lambda }_m\Vert . \end{aligned}$$

Finally, the bound in (5.3) follows from Corollary 4.5 (see also Remark 4.4 for \(N = \infty \)), the fact that \(\Vert \varvec{\Lambda }_m\Vert = \lambda _m = \sigma _m^2\) and \(M^2 = T(m)\). \(\square \)

5.2 Worst-case Errors with High Probability

Theorem 5.5

Let H(K) be a separable reproducing kernel Hilbert space on a domain \(D \subset \mathbb {R}^d\) with a positive semidefinite kernel \(K({\mathbf {x}},{\mathbf {y}})\) such that \(\sup _{{\mathbf {x}}\in D} K({\mathbf {x}},{\mathbf {x}}) <\infty \). We denote with \((\sigma _j)_{j\in \mathbb {N}}\) the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) for a probability measure \(\varrho _D\). Let further \(0<\delta <1\) and \(m,n \in \mathbb {N}\), where \(m\ge 2\) is chosen such that (5.1) holds. Drawing the n sampling nodes in \({\mathbf {X}}\) i.i.d. at random according to \(\varrho _D\), we have for the conditional expectation of the worst-case error

$$\begin{aligned}&\mathbb {E}\left( \sup \limits _{\Vert f\Vert _{H(K)}\le 1} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)}\Big |\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2\right) \nonumber \\&\quad \le \frac{1}{1-\delta }\left( 3\sigma _m^2+8\,C_{\mathrm {R}}^2\frac{\log (8n)}{n}T(m)+4\,C_{\mathrm {R}}\,\sigma _m\,\sqrt{\frac{\log (8n)}{n}T(m)}\right) \nonumber \\&\quad \le \frac{3+8\,C_{\mathrm {R}}^2+4\,C_{\mathrm {R}}}{1-\delta }\,\max \left\{ \sigma ^2_m ,\frac{\log (8n)}{n}T(m)\right\} \end{aligned}$$
(5.4)

with \(C_{\mathrm {R}}\) from (4.4).

Proof

Let \(f \in H(K)\) such that \(\Vert f\Vert _{H(K)} \le 1\). Let further \({\mathbf {X}}=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\) be such that \(\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2\). Using orthogonality and the reproducing property \(S^m_{\mathbf {X}}P_{m-1}f=P_{m-1}f\), we estimate

$$\begin{aligned} \begin{aligned} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }&= \Vert f-P_{m-1}f\Vert ^2_{L_2\left( D,\varrho _D\right) }+\Vert P_{m-1}f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }\\&\le \sigma ^2_m + \Vert S^m_{\mathbf {X}}\left( P_{m-1}f-f\right) \Vert ^2_{L_2\left( D,\varrho _D\right) }\\&=\sigma ^2_m + \left\| \left( {\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\right) ^{-1}\,{\mathbf {L}}_m^{*} \left( \left( P_{m-1}f-f\right) \left( {\mathbf {x}}^k\right) \right) _{k=1}^n\right\| ^2_{2}\\&\le \sigma _m^2+\frac{2}{n}\sum \limits _{k=1}^n \Big |\left( f-P_{m-1}f\right) ({\mathbf {x}}^k)\Big |^2, \end{aligned} \end{aligned}$$
(5.5)

where \(P_{m-1}f\) denotes the projection \(\sum _{j=1}^{m-1} \langle f,e_j \rangle e_j\) in H(K) yielding \(\Vert f - P_{m-1}f\Vert _{L_2(D,\varrho _D)} \le \sigma _m\). Note further, that for any \({\mathbf {x}}\in D\)

$$\begin{aligned} \begin{aligned}&\left( f-P_{m-1}f\right) ({\mathbf {x}}) \\&\quad = \left\langle f, K(\cdot ,{\mathbf {x}})-\sum \limits _{j=1}^{\infty }e_j(\cdot )\overline{e_j({\mathbf {x}})} +\sum \limits _{j=1}^{\infty }e_j(\cdot )\overline{e_j({\mathbf {x}})}-\sum \limits _{j=1}^{m-1}e_j(\cdot )\overline{e_j({\mathbf {x}})} \right\rangle _{H(K)}\\&\quad = \sum \limits _{j=m}^\infty \langle f,e_j\rangle _{H(K)} e_j({\mathbf {x}}) + \langle f, T(\cdot ,{\mathbf {x}})\rangle _{H(K)}, \end{aligned} \end{aligned}$$

where \(T(\cdot ,{\mathbf {x}}) = K(\cdot ,{\mathbf {x}})-\sum _{j=1}^{\infty }e_j(\cdot )\overline{e_j({\mathbf {x}})}\) denotes an element in H(K). Its norm is given by

$$\begin{aligned} \Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}^2 := \langle T(\cdot ,{\mathbf {x}}), T(\cdot ,{\mathbf {x}})\rangle _{H(K)} = K({\mathbf {x}},{\mathbf {x}})-\sum _{j=1}^{\infty }|e_j({\mathbf {x}})|^2. \end{aligned}$$
(5.6)

This gives

$$\begin{aligned} \Big |\left( f-P_{m-1}f\right) ({\mathbf {x}})\Big |^2&\le \left| \sum \limits _{j=m}^\infty \langle f,e_j\rangle _{H(K)} e_j({\mathbf {x}})\right| ^2 \\&\quad + 2\Vert f\Vert _{H(K)}^2\Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}\sqrt{\sum \limits _{j=m}^{\infty }|e_j({\mathbf {x}})|^2}\\&\quad + \Vert f\Vert _{H(K)}^2 \Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}^2\\&\overset{(2.5)}{\le }\left( \sum \limits _{i=m}^\infty \overline{\langle f,e_i\rangle _{H(K)} e_i({\mathbf {x}})}\right) \left( \sum \limits _{j=m}^\infty \langle f,e_j\rangle _{H(K)} e_j({\mathbf {x}})\right) \\&\quad +\Vert f\Vert _{H(K)}^2\Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}\left( \Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}+2\Vert K\Vert _{\infty }\right) \\&\le \sum \limits _{i=m}^\infty \sum \limits _{j=m}^\infty \overline{\langle f,e_i\rangle _{H(K)}} \langle f,e_j\rangle _{H(K)} \overline{e_i({\mathbf {x}})} e_j({\mathbf {x}})\\&\quad +3\Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}\Vert K\Vert _{\infty }\Vert f\Vert _{H(K)}^2. \end{aligned}$$

Returning to (5.5) we estimate

$$\begin{aligned} \begin{aligned} \sum \limits _{k=1}^n\Big |\left( f-P_{m-1}f\right) \left( {\mathbf {x}}^k\right) \Big |^2&\le \Vert \left( \langle f,e_j\rangle _{H(K)}\right) _{j\in \mathbb {N}}\Vert ^2_2 \; \Vert \varvec{\Phi }_m\Vert ^2\\&\quad +3\Vert K\Vert _{\infty }\Vert f\Vert _{H(K)}^2\sum \limits _{k=1}^{n}\Vert T\left( \cdot ,{\mathbf {x}}^k\right) \Vert \\&\le \Vert f\Vert _{H(K)}^2\left( \Vert \varvec{\Phi }_m\Vert ^2 +3\Vert K\Vert _{\infty }\sum \limits _{k=1}^{n}\Vert T\left( \cdot ,{\mathbf {x}}^k\right) \Vert \right) , \end{aligned} \end{aligned}$$
(5.7)

where \(\varvec{\Phi }_m\) denotes the infinite matrix from Proposition 5.4. Note that we used (2.4) in the last but one step. The relation in (5.7) together with (5.5) and \(\Vert f\Vert _{H(K)} \le 1\) implies

$$\begin{aligned} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }&\le \sigma ^2_m + \Vert \left( {\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\right) ^{-1}\,{\mathbf {L}}_m^{*}\Vert ^2\cdot \sum \limits _{k=1}^n \Big |\left( f-P_{m-1}f\right) \left( {\mathbf {x}}^k\right) \Big |^2\\&= \sigma ^2_m + \frac{2}{n}\Vert \varvec{\Phi }_m\Vert ^2 + \frac{6\Vert K\Vert _{\infty }}{n}\sum \limits _{k=1}^n \Vert T\left( \cdot ,{\mathbf {x}}^k\right) \Vert . \end{aligned}$$

Integrating on both sides yields

$$\begin{aligned} \begin{aligned}&\int _{\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2}\sup \limits _{\Vert f\Vert _{{H(K)}}\le 1}\Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) } \, \varrho ^n_D\left( \mathrm {d}{\mathbf {X}}\right) \\&\quad \le \sigma ^2_m + \frac{2}{n}\mathbb {E}\left( \Vert \varvec{\Phi }_m\Vert ^2\right) + \frac{6\Vert K\Vert _{\infty }}{n}\sum \limits _{k=1}^n \int _D\Vert T\left( \cdot ,{\mathbf {x}}^k\right) \Vert \, \varrho _D(\mathrm {d}{\mathbf {x}}) \\&\quad = \sigma ^2_m + \frac{2}{n}\mathbb {E}\left( \Vert \varvec{\Phi }_m\Vert ^2\right) . \end{aligned} \end{aligned}$$
(5.8)

Note that the integral on the right-hand side of (5.8) vanishes because of (5.6) and the fact that we have an equality sign in (2.6) due to our assumptions (separability of H(K)). This gives

$$\begin{aligned} 0= & {} \int _D K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})-\int _D\sum \limits _{j=1}^{\infty }\lambda _j \, \varrho _D(\mathrm {d}{\mathbf {x}}) \\= & {} \int _D \left( K({\mathbf {x}},{\mathbf {x}})-\sum \limits _{j=1}^{\infty }|e_j({\mathbf {x}})|^2\right) \, \varrho _D(\mathrm {d}{\mathbf {x}}). \end{aligned}$$

Taking Proposition 5.4 and (4.1) into account and noting that \(\mathbb {P}(\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2)\) is larger than \(1-\delta \), we obtain the assertion. \(\square \)

In addition to that we may easily get a deviation inequality by using Markov’s inequality and standard arguments. It reads as follows.

Corollary 5.6

Under the same assumptions as in Theorem 5.5, it holds for fixed \(\delta >0\)

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{{H(K)}}\le 1} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) } \le \frac{C}{\delta }\,\max \left\{ \sigma ^2_m,\frac{\log (8n)}{n}T(m) \right\} \right) \ge 1-3\delta , \end{aligned}$$
(5.9)

where \(C:=3+8\,C_{\mathrm {R}}^2+4\,C_{\mathrm {R}}<28.05\) is an absolute constant.

Proof

We define the events

$$\begin{aligned} A&:=\left\{ {\mathbf {X}}\;:\sup \limits _{\Vert f\Vert _{H(K)}\le 1} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }\le t\right\} ,\\ B&:=\left\{ {\mathbf {X}}\;:\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2\right\} \end{aligned}$$

and split up

$$\begin{aligned} \mathbb {P}(A) =1- \mathbb {P}\left( A^\complement \right) =1-\mathbb {P}\left( A^\complement \cap B\right) -\mathbb {P}\left( A^\complement \cap B^\complement \right) . \end{aligned}$$

Treating each summand separately, we have

$$\begin{aligned} \mathbb {P}\left( A^\complement \cap B\right)&= \mathbb {P}\left( A^\complement | B\right) \mathbb {P}(B)\le \mathbb {P}\left( A^\complement | B\right) ,\\ \mathbb {P}\left( A^\complement \cap B^\complement \right)&\le \mathbb {P}\left( B^\complement \right) \le \delta . \end{aligned}$$

Next we estimate \(\mathbb {P}(A^\complement | B)\) using the Markov inequality (4.2), Theorem 5.5, and setting

$$\begin{aligned} t:=\frac{3+8\,C_{\mathrm {R}}^2+4\,C_{\mathrm {R}}}{\delta }\,\max \left\{ \sigma _m^2,\frac{\log (8n)}{n}T(m)\right\} \end{aligned}$$

which yields

$$\begin{aligned} \mathbb {P}\left( A^\complement | B\right) \le \frac{\mathbb {E}\left( A^\complement | B\right) }{t}\le \frac{\delta }{1-\delta }\overset{\delta \le 1/2}{\le } 2\delta \end{aligned}$$

and the assertion follows. \(\square \)

Example 5.7

Theorem 5.5 as well as Corollary 5.6 can also be applied to non-bounded orthonormal systems which may lead to non-optimal error bounds. For instance, let \(D=[-1,1]\) and \(\varrho _D\) the normalized Lebesgue measure on \(D = [-1,1]\). Then, the second-order operator \({\text {A}}\) defined by

$$\begin{aligned} {\text {A}}\!f(x)=-((1-x^2) v')' \end{aligned}$$

characterizes for \(s>1\) weighted Sobolev spaces

$$\begin{aligned} H(K_s):=\left\{ f\in L_2(D):{\text {A}}^{s/2}\!f\in L_2(D)\right\} \end{aligned}$$

which are in fact reproducing kernel Hilbert spaces with reproducing kernel

$$\begin{aligned} K_s(x,y)=\sum _{k\in \mathbb {N}}\left( 1+(k(k+1))^s\right) ^{-1}{\mathcal {P}}_k(x){\mathcal {P}}_k(y), \end{aligned}$$

where \({\mathcal {P}}_k:D\rightarrow \mathbb {R}\), \(k\in \mathbb {N}\), are \(L_2(D)\)-normalized Legendre polynomials. Clearly, \(({\mathcal {P}}_k)_{k\in \mathbb {N}}\) provides an ONB in \(L_2(D)\) and plays the role of \((\eta _k)_{k\in \mathbb {N}}\) in our setting. Moreover, we have \(e_k=\lambda _k\eta _k=(1+(k(k+1))^s)^{-1/2}{\mathcal {P}}_k\), and, accordingly, \(\sigma _k=(1+(k(k+1))^s)^{-1/2}\). The two quantities N(m) and T(m) are given by

$$\begin{aligned} N(m)&=\sup _{x\in D}\sum _{k=1}^{m-1} |{\mathcal {P}}_k(x)|^2=\sum _{k=0}^{m-2}\frac{2k+1}{2}=\frac{(m-1)^2}{2},\\ T(m)&=\sup _{x\in D}\sum _{k=m}^\infty \frac{2k-1}{2(1+(k(k+1))^s}\asymp \sum _{k=m}^\infty k^{-2s+1} \asymp m^{-2s+2}. \end{aligned}$$

Applying Theorem 5.5 or Corollary 5.6 and choosing m as large as possible, leads to the relation \(m^2\sim n/\log {n}\), which is far from optimal with respect to the number n of used samples, i.e., we observe worst-case error estimates of the form

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-S^m_{\mathbf {X}}f\Vert _{L_2(D)} \lesssim \sigma _m \sim m^{-s}\lesssim \left( \frac{\log (n)}{n}\right) ^{s/2} \end{aligned}$$

in expectation and with high probability, respectively.

On the other hand, the result by Bernardi and Maday [4, Thm. 6.2] using a polynomial interpolation operator \(j_{n-1}\) at Gauss points guarantees

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-j_{n-1} f\Vert _{L_2(D)} \lesssim \frac{\log (n)}{n^s}, \end{aligned}$$

which is optimal in its main rate s. However, as it turns out below (see Example 5.10), it is not optimal with respect to the power of the logarithm. \(\square \)

Example 5.7 illustrates that the suggested approach will lead to worst-case error estimates with high probability. However, the achieved upper bounds are not optimal in specific situations. We will overcome this limitation in the next section by drawing the sampling nodes with respect to a weighted, tailored distribution and apply a weighted least squares algorithm.

5.3 Improvements Due to Importance Sampling

We are interested in the question of optimal sampling recovery of functions from reproducing kernel Hilbert spaces in \(L_2(D,\varrho _D)\). The goal is to get reasonable bounds in n, preferably in terms of the singular numbers of the embedding. Theorem 5.5 already gives a satisfying answer in case of bounded kernels and \(N(m) \in {\mathcal {O}}(m)\). In order to drop both conditions, we will use a weighted (deterministic) least squares algorithm (see Algorithm 2) to recover functions \(f\in H(K)\) from samples at random nodes (“random information” in the sense of [28]). The approach is a slight modification of the one proposed earlier in [34]. A technique will be used, which is known as “(optimal) importance sampling”, where one defines a density function depending on the spectral properties of the embedding operator. The sampling nodes are then drawn according to this density. In the Monte Carlo setting (or “randomized setting”) this has been successfully applied, e.g., by Cohen and Migliorati [16], see Remark 6.3. Also in connection with compressed sensing it led to substantial improvements when recovering multivariate functions, see [57, 58]. Authors originally applied this technique, e.g., for the approximation of integrals, see [27]. However, the setting in which we are interested in requires additional work since the sampling nodes are supposed to be drawn in advance for the whole class of functions.

figure b

As already mentioned, we construct a more suitable distribution which is used to draw the sampling nodes at random. In particular, we tailor a probability density function \(\varrho _m:D\rightarrow \mathbb {C}\) such that \(\mu _m\), which is given by

$$\begin{aligned} \mu _m(A):=\int _A\varrho _m({\mathbf {x}})\varrho _D(\mathrm {d}{\mathbf {x}}),\quad A\subset D \text{ measurable } . \end{aligned}$$
(5.11)

Then, we may draw the sampling nodes in \({\mathbf {X}}\subset D\) i.i.d. at random according to \(\mu _m\). For the chosen set \({\mathbf {X}}\), we define the approximation operator \({\widetilde{S}}^m_{\mathbf {X}}\) as indicated in Algorithm 2.

Choosing the specific density function

$$\begin{aligned} \varrho _m({\mathbf {x}}) := \frac{1}{2}\left( \frac{1}{m-1}\sum \limits _{j=1}^{m-1} |\eta _j({\mathbf {x}})|^2 + \left( \sum \limits _{j=m}^{\infty } \lambda _j\right) ^{-1}\left( K({\mathbf {x}},{\mathbf {x}}) - \sum \limits _{j=1}^{m-1}|e_j({\mathbf {x}})|^2\right) \right) \end{aligned}$$
(5.12)

guarantees worst-case error estimates which are optimal up to logarithmic factors and up to a specific failure probability.

Theorem 5.8

Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions defined on D such that

$$\begin{aligned} \int _{D} K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}) < \infty \end{aligned}$$

for some non-trivial \(\sigma \)-finite measure \(\varrho _D\) on D, where \((\sigma _j)_{j\in \mathbb {N}}\) denotes the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\). Let further \(\delta \in (0,1/3)\) and \(n\in \mathbb {N}\) large enough, such that

$$\begin{aligned} m := \left\lfloor \frac{n}{96(\sqrt{2}\log (2n)-\log \delta )}\right\rfloor \ge 2 \end{aligned}$$
(5.13)

holds. Moreover, we assume \(\varrho _m:D\rightarrow \mathbb {C}\) and \(\mu _m\) as stated in (5.12) and (5.11). We draw each node in \({\mathbf {X}}:=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\subset D\) i.i.d. at random according to \(\mu _m\), which yields

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{{H(K)}}\le 1} \Vert f-{\widetilde{S}}^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) } \le \frac{C}{\delta }\,\max \left\{ \sigma ^2_m,\frac{\log (8n)}{n}\sum _{j=m}^\infty \sigma _j^2 \right\} \right) \ge 1-3\delta , \end{aligned}$$
(5.14)

where \(C:=3+16\,C_{\mathrm {R}}^2+4\sqrt{2}\,C_{\mathrm {R}}<49.5\) is an absolute constant.

Proof

We emphasize that the second argument of the \(\max \) term in (5.14) makes sense since we know from (2.6) that the sequence of singular numbers is square-summable. As a technical modification of the density function, which has been presented in [34], we use the density \(\varrho _m :D \rightarrow \mathbb {R}\) as stated in (5.12). As above, the family \((e_j(\cdot ))_{j\in \mathbb {N}}\) represents the eigenvectors of the non-vanishing eigenvalues of the compact self-adjoint operator \(W_{\varrho _D}:=\mathrm {Id}^*\circ \mathrm {Id}: H(K) \rightarrow H(K)\), the sequence \((\lambda _j)_{j\in \mathbb {N}}\) represents the ordered eigenvalues, and finally \(\eta _j:=\lambda _j^{-1/2} e_j\). Since we assume the separability of H(K) and the \(\sigma \)-finiteness of \(\varrho _D\) we observe equality in (2.6), cf. [11, Thm. 4.27], and thus, we easily see that \(\int _D \varrho _m({\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}) = 1\). Let us define a family of kernels \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}})\), indexed by \(m\in \mathbb {N}\), via

$$\begin{aligned} {\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}}) := \frac{K({\mathbf {x}},{\mathbf {y}})}{\sqrt{\varrho _m({\mathbf {x}})\varrho _m({\mathbf {y}})}}, \end{aligned}$$
(5.15)

and a new measure \(\mu _m\) as stated in (5.11) with the corresponding weighted space \(L_2(D,\mu _m)\). Clearly, \({\widetilde{K}}_m(\cdot ,\cdot )\) is a positive type function. As a consequence of

$$\begin{aligned} |K({\mathbf {x}},{\mathbf {y}})| \le \sqrt{K({\mathbf {x}},{\mathbf {x}})}\cdot \sqrt{K({\mathbf {y}},{\mathbf {y}})},\quad {\mathbf {x}},{\mathbf {y}}\in D, \end{aligned}$$

we obtain by an elementary calculation that with a constant \(c_m>0\) it holds

$$\begin{aligned} |K({\mathbf {x}},{\mathbf {y}})| \le c_m \sqrt{\varrho _m({\mathbf {x}})}\cdot \sqrt{\varrho _m({\mathbf {y}})}. \end{aligned}$$
(5.16)

Indeed,

$$\begin{aligned} \begin{aligned} K({\mathbf {x}},{\mathbf {x}})&= \sum \limits _{j=1}^{m-1} |e_j({\mathbf {x}})|^2 + \left( K({\mathbf {x}},{\mathbf {x}}) - \sum \limits _{j=1}^{m-1} |e_j({\mathbf {x}})|^2\right) \\&= \sum \limits _{j=1}^{m-1} \lambda _j|\eta _j({\mathbf {x}})|^2 + \left( \sum \limits _{j=m}^{\infty }\lambda _j\right) \cdot \left( \sum \limits _{j=m}^{\infty }\lambda _j\right) ^{-1}\left( K({\mathbf {x}},{\mathbf {x}}) - \sum \limits _{j=1}^{m-1} |e_j({\mathbf {x}})|^2\right) \\&\le c_m \varrho _m({\mathbf {x}}) \end{aligned} \end{aligned}$$

with

$$\begin{aligned} c_m := 2\max \left\{ \lambda _1(m-1),\sum \limits _{j=m}^\infty \lambda _j\right\} . \end{aligned}$$

Hence, if \(\varrho _m({\mathbf {x}})\) or \(\varrho _m({\mathbf {y}})\) happens to be zero, then we may put \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}}) := 0\) in (5.15). In any case, due to (5.16), the kernel \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}})\) fits the requirements in Theorem 5.5. In fact, in Theorem 5.5 it is necessary that \({\widetilde{N}}(m)\) and \({\widetilde{T}}(m)\) are well defined and that we have access to function values in order to create the matrices \({\widetilde{{\mathbf {L}}}}_m\) and take the function values \(f({\mathbf {x}}^k)\), \(k=1,\ldots ,n\). Let us discuss the quantities \({\widetilde{N}}(m)\) and \({\widetilde{T}}(m)\) appearing in Theorem 5.5 for this new kernel \({\widetilde{K}}_m({\mathbf {x}},{\mathbf {y}})\) first. It is clear, that the embedding \(\mathrm {Id}:H({\widetilde{K}}_m) \rightarrow L_2(D,\mu _m)\) shares the same spectral properties as the original embedding. Note that a function g belongs to \(H({\widetilde{K}}_m)\) if and only if \(g(\cdot ) = f(\cdot )/\sqrt{\varrho _m(\cdot )}\), \(f\in H(K)\), where we always put \(0/0:=0\). Clearly, as a consequence of (5.16) and (5.17) below (together with a density argument), we have that \(\varrho ({\mathbf {x}}) = 0\) implies \(f({\mathbf {x}}) = 0\) for all \(f\in H(K)\). Moreover, whenever \(\Vert f\Vert _{H(K)} \le 1\), the function \(g:=f/\sqrt{\varrho _m}\) satisfies \(\Vert g\Vert _{H({\widetilde{K}}_m)}\le 1\). Indeed, let

$$\begin{aligned} f(\cdot ) = \sum _{i=1}^N \alpha _i K(\cdot ,{\mathbf {x}}^i). \end{aligned}$$
(5.17)

Then, \(\langle f,f\rangle _{H(K)} = \sum _{j=1}^N\sum _{i=1}^N \alpha _i \overline{\alpha _j}K({\mathbf {x}}^j,{\mathbf {x}}^i)\). We have

$$\begin{aligned} \begin{aligned} g&= f(\cdot )/\sqrt{\varrho _m(\cdot )} = \sum _{i=1}^N \alpha _i K\left( \cdot ,{\mathbf {x}}^i\right) /\sqrt{\varrho _m(\cdot )}\\&= \sum _{i=1}^N \alpha _i \sqrt{\varrho _m\left( {\mathbf {x}}^i\right) } \frac{K(\cdot ,{\mathbf {x}}^i)}{\sqrt{\varrho _m(\cdot )}\sqrt{\varrho _m({\mathbf {x}}^i)}}. \end{aligned} \end{aligned}$$

This implies

$$\begin{aligned} \begin{aligned} \langle g,g\rangle _{H({\widetilde{K}}_m)}&= \sum _{j=1}^N\sum _{i=1}^N \alpha _i \overline{\alpha _j}\sqrt{\varrho _m({\mathbf {x}}^i)}\sqrt{\varrho _m({\mathbf {x}}^j)} \frac{K({\mathbf {x}}^j,{\mathbf {x}}^i)}{\sqrt{\varrho _m({\mathbf {x}}^j)}\sqrt{\varrho _m({\mathbf {x}}^i)}}\\&= \langle f,f\rangle _{H(K)}. \end{aligned} \end{aligned}$$

What remains is a standard density argument. The singular numbers of the new embedding remain the same. The singular vectors \({\tilde{e}}_k(\cdot )\) and \({\tilde{\eta }}_k(\cdot )\) are slightly different. They are the original ones divided by \(\sqrt{\varrho _m(\cdot )}\). In fact,

$$\begin{aligned} {\widetilde{N}}(m) := \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=1}^{m-1} |\eta _k({\mathbf {x}})|^2/\varrho _m({\mathbf {x}})&\le 2(m-1)\sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=1}^{m-1} |\eta _k({\mathbf {x}})|^2/\sum \limits _{j=1}^{m-1} |\eta _j({\mathbf {x}})|^2 \\&= 2(m-1). \end{aligned}$$

Furthermore, taking (2.5) into account, we find

$$\begin{aligned} \begin{aligned} {\widetilde{T}}(m)&:= \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=m}^{\infty } |e_k({\mathbf {x}})|^2/\varrho _m({\mathbf {x}})\\&\le 2\left( \sum \limits _{j=m}^{\infty } \lambda _j\right) \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=m}^\infty |e_k({\mathbf {x}})|^2/\left( K({\mathbf {x}},{\mathbf {x}})-\sum \limits _{j=1}^{m-1}|e_j({\mathbf {x}})|^2\right) \\&\le 2\left( \sum \limits _{j=m}^{\infty } \lambda _j\right) \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=m}^\infty |e_k({\mathbf {x}})|^2/\sum \limits _{j=m}^{\infty }|e_j({\mathbf {x}})|^2\\&\le 2\sum \limits _{j=m}^{\infty } \lambda _j = 2\sum \limits _{j=m}^{\infty } \sigma _j^2. \end{aligned} \end{aligned}$$
(5.18)

In order to define the new reconstruction operator \({\widetilde{S}}_{{\mathbf {X}}}^m\) we need to create the matrices \({\widetilde{{\mathbf {L}}}}_m\) using the new function system \({\tilde{\eta }}_k\) and take function evaluations \(f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n)\). In more detail, we solve the least squares problem

$$\begin{aligned} {\widetilde{{\mathbf {L}}}}_m{{\tilde{\mathbf {c}}}}={\varvec{g}}, \; \text {where}\; {\widetilde{{\mathbf {L}}}}_m:=\left( {{\tilde{\eta }}}_j({\mathbf {x}}^k)\right) _{k=1,j=1}^{n,m-1}, \; {\varvec{g}}:=\left( \frac{f({\mathbf {x}}^1)}{\sqrt{\varrho _m({\mathbf {x}}^1)}},\ldots ,\frac{f({\mathbf {x}}^n)}{\sqrt{\varrho _m({\mathbf {x}}^n)}}\right) ^\top , \end{aligned}$$

and the vector \({\tilde{\mathbf {c}}}\) contains the coefficients of the least squares approximation \(S_{{\mathbf {X}}}^m(g)=\sum _{j=1}^{m-1} {\tilde{c}}_j{\tilde{\eta }}_j\) of \(g:=f/\sqrt{\varrho _m}\). This leads to Algorithm 2. Consequently, Theorem 5.5 allows to estimate the error

$$\begin{aligned} \Vert g-S_{{\mathbf {X}}}^m(g)\Vert _{L_2\left( D,\mu _m\right) }=\Vert f-\sqrt{\varrho _m}\,S_{{\mathbf {X}}}^m(g)\Vert _{L_2\left( D,\varrho _D\right) }= \Vert f-{\widetilde{S}}^m_{\mathbf {X}}f\Vert _{L_2\left( D,\varrho _D\right) }, \end{aligned}$$

where \({\widetilde{S}}^m_{\mathbf {X}}f:=\sqrt{\varrho _m}\,S_{{\mathbf {X}}}^m(g)=\sum _{j=1}^{m-1} {\tilde{c}}_j\eta _j({\mathbf {x}})\).

We stress that \({\widetilde{S}}^{m}_{\mathbf {X}}f\) and the direct computation of \(S^{m}_{\mathbf {X}}f\) using \({\mathbf {L}}_m\, {\mathbf {c}} = {\mathbf {f}}\) may not coincide since both are based on different least squares problems in general.

It remains to note that for fixed n and m as in (5.13) we have for \({\mathbf {X}}=({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) the relation

$$\begin{aligned} \begin{aligned} \sup \limits _{\Vert f\Vert _{ H(K)} \le 1}\Vert f-{\widetilde{S}}^m_{{\mathbf {X}}}(f)\Vert ^2_{L_2\left( D,\varrho _D\right) }&= \sup \limits _{\Vert f\Vert _{ H(K)} \le 1}\Vert f/\sqrt{\varrho _m} - S^m_{{\mathbf {X}}}\left( f/\sqrt{\varrho _m}\right) \Vert ^2_{L_2\left( D,\mu _m\right) }\\&\le \sup \limits _{\Vert g\Vert _{ H\left( {\widetilde{K}}_m\right) } \le 1}\Vert g - S^m_{{\mathbf {X}}}(g)\Vert ^2_{L_2\left( D,\mu _m\right) }. \end{aligned} \end{aligned}$$

Applying a slight modification of Corollary 5.6, i.e., setting

$$\begin{aligned} t:=\frac{1}{\delta }\left( 3\sigma _m^2+8\,C_{\mathrm {R}}^2\frac{\log (8n)}{n}{\widetilde{T}}(m)+4\,C_{\mathrm {R}}\,\sigma _m\,\sqrt{\frac{\log (8n)}{n}{\widetilde{T}}(m)}\right) \end{aligned}$$

in the proof, yields

$$\begin{aligned} \begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{{H(K)}}\le 1} \Vert f-{\widetilde{S}}^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) } \le \frac{C}{\delta }\max \left( \sigma _m^2,\frac{\log (8n)}{n}\sum _{j=m}^\infty \sigma _j^2\right) \right) \ge 1-3\delta , \end{aligned} \end{aligned}$$

where we took (5.18) into account and \(C=3+16C_R^2+4\sqrt{2}C_R\) is as stated above. \(\square \)

Remark 5.9

  1. (i)

    In order to prove the theorem in full generality for separable RKHS, we use a technical modification of the density function presented in [34]. Clearly, as a consequence of (2.5) the function \(\varrho _m\) is positive and defined pointwise for any \({\mathbf {x}}\in D\). Moreover, it can be computed precisely from the knowledge of \(K({\mathbf {x}},{\mathbf {x}})\) and the first \(m-1\) eigenvalues and corresponding eigenvectors. The pointwise defined density function will be an essential ingredient for drawing the nodes in \({\mathbf {X}}\), on the one hand, and performing a reweighted least squares fit on the other hand. Note that also point evaluations of the density function are used in the algorithm. To circumvent the lacking injectivity we introduce a new reproducing kernel Hilbert space \(H({\widetilde{K}}_m)\) built upon the modified density function. To this situation Corollary 5.6 is applied, and we obtain Theorem 5.8, which also improves the results in [34] by determining explicit constants. In turn, we show, that the original algorithm proposed by [34] also works in this more general situation since both densities are equal almost everywhere.

  2. (ii)

    The situation of a non-separable RKHS H(K) is not considered in Theorem 5.8. It has been considered in the subsequent paper [44] by Moeller and T. Ullrich. Here the sampling density has to be modified even further to get a bound

    $$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)} \le 1} \Vert f-{\widetilde{S}}^m_{{\mathbf {X}}}f\Vert ^2_{L_2(D,\varrho _D)} \le c_1 \max \left\{ \frac{1}{n},\frac{1}{m}\sum \limits _{k\ge m/2} \sigma _k^2\right\} \end{aligned}$$

    with probability larger than \(1-c_2n^{1-r}\) and \(m:=\lfloor n/(c_3r\log (n)) \rfloor \). With this method we cannot get beyond the rate \(n^{-1/2}\) in case of non-separable Hilbert spaces H(K).

  3. (iii)

    While this paper was under review, the statement “recovery with high probability” has been refined by Ullrich [66] and, independently, by Moeller and Ullrich [44]. In fact, the failure probability of the above recovery can be controlled by \(n^{-r}\) (and is therefore decaying polynomially in the number of samples) by only paying a multiplicative constant r in the bounds, see also (ii).

  4. (iv)

    As a further follow up, Nagel et al. [45] developed a subsampling technique to select a subset \({\mathbf {J}} \subset {\mathbf {X}}\) of \({\mathcal {O}}(m)\) nodes out of the randomly chosen node set \({\mathbf {X}}\) such that the corresponding least squares recovery operator \({\widetilde{S}}^m_{{\mathbf {J}}}\) gives

    $$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)} \le 1} \Vert f-{\widetilde{S}}^m_{{\mathbf {J}}}f\Vert ^2_{L_2(D,\varrho _D)} \le \frac{C\log m}{m}\sum \limits _{k=cm}^{\infty } \sigma _k^2 \end{aligned}$$

    with precisely determined universal constants \(C,c>0\).

Example 5.10

  1. (i)

    If we choose \(m \asymp n/\log (n)\) according to (5.13) and assume a polynomial decay of the singular values, i.e., \(\sigma _m\lesssim m^{-s}\log ^\alpha m\) with \(s>1/2\) (which is, for instance, the case for Sobolev embeddings into \(L_2\)), we find

    $$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-{\widetilde{S}}_{{\mathbf {X}}}^mf\Vert _{L_2(D)} \lesssim m^{-s}\log ^{\alpha } m \asymp n^{-s}\log ^{\alpha +s} n. \end{aligned}$$

    Accordingly, we observe the best possible main rate \(-s\) with respect to the used number of samples n, i.e., we achieve optimality up to logarithmic factors. For a more specific example, see (ii) below and Sect. 8.

  2. (ii)

    Let us reconsider the setting from Example 5.7. Theorem 5.8 provides a highly improved sampling strategy compared to the one discussed in Example 5.7. Certainly, we change the underlying distribution for the random selection of the sampling nodes and we incorporate weights in the least squares algorithm, cf. Algorithm 2. However, this allows for a crucial improvement of the relation of the maximal polynomial degree m and the number n of sampling nodes to \(m\sim n/\log (n)\), which leads to estimates

    $$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-{\widetilde{S}}^m_{\mathbf {X}}f\Vert _{L_2(D)} \lesssim \max \left\{ \sigma _m,\sqrt{m^{-1}\sum _{k=m}^\infty \sigma _k^2}\right\} \asymp m^{-s}\lesssim \left( \frac{\log (n)}{n}\right) ^{s} \end{aligned}$$

    that are optimal in m and—up to logarithmic factors—even optimal in n. Here \(s>1/2\) is admitted. Compared to the results in Example 5.7, we obtain the same rates of convergence with respect to the polynomial degree m and significantly improved rates of convergence with respect to the number n of used sampling values. Note that instead of the density given in (5.12) we may use the Chebyshev measure given by the density \(\varrho ({\mathbf {x}}) = (\pi \sqrt{1-x^2})^{-1}\) on \(D = [-1,1]\) since the \(L_2\)-normalized Legendre polynomials \({\mathcal {P}}_k(x)\) are dominated by \(C(1-x^2)^{-1/4}\) for all \(k \in \mathbb {N}_0\), see [57]. Hence, in contrast to (5.12), this sampling measure is universal for all m.

In comparison with the results in Example 5.7, we obtain a significantly better rate of convergence with respect to the number n of used sampling values. Applying the recent result mentioned in Remark 5.9(iv) we obtain for the subsampled weighted least squares operator the worst-case error bound

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{{H(K_s)}}\le 1} \Vert f-{\widetilde{S}}_{{\mathbf {J}}}^mf\Vert _{L_2(D)} \lesssim \frac{\sqrt{\log (n)}}{n^s}. \end{aligned}$$

Note that \({\widetilde{S}}_{{\mathbf {J}}}^mf\) uses \(n \in {\mathcal {O}}(m)\) many samples of f. \(\square \)

5.4 The Power of Standard Information and Tractability

The approximation framework studied in this paper has been first considered by Wasilkowski, Woźniakowski in 2001, see [67]. After that many authors dealt with the problem on how well can we approximate functions from RKHS by only using function values. For further historical remarks see [51] and [45, Rem. 1].

In this context, sampling values are called standard information abbreviated by \(\Lambda ^{\mathrm{std}}\). In contrast to this, one may also allow linear information (abbreviated by \(\Lambda ^{\mathrm{all}}\)), which means that the approximation is computed from a number of information about the function coming from arbitrary linear functionals. Clearly, \(\Lambda ^{\mathrm{std}}\) is a subset of \(\Lambda ^{\mathrm{all}}\) due to (2.1). It is well known that the best possible worst-case error with respect to m information coming from \(\Lambda ^{\mathrm{all}}\) can be achieved by approximating functions by means of their corresponding exact Fourier partial sum within \({\text {span}}\{\eta _j:j=1,\ldots ,m\}\). Then, the corresponding worst-case error is determined by \(\sigma _{m+1}\), see [48, Cor. 4.12], which establishes a natural lower bound on the worst-case error with respect to \(\Lambda ^{\mathrm{std}}\). One crucial question in IBC is to determine whether or not standard information \(\Lambda ^{\mathrm{std}}\) is as powerful as linear information \(\Lambda ^{\mathrm{all}}\). In this context, “power” is usually identified with the order of convergence of the considered error with respect to the number of used information.

From Theorem 5.8, see also Example 5.10(i), we obtain that the polynomial order of convergence for standard information is the same as for linear information if the kernel has a finite trace (i.e., \(\mathrm {Id}:H(K)\rightarrow L_2(D,\varrho _D)\) is a Hilbert–Schmidt operator). This problem has been addressed in [38, 50, Open Problem 1] and [51, Open Problem 126]. In [34], this observation has been already made for the situation that the embedding operator is injective (such that the eigenvectors of the strictly positive eigenvalues form an orthonormal basis in H(K), see Remark 2.1). The contribution of Theorem 5.8 is to get explicitly determined constants, on the one hand. On the other hand, it shows that separability of the RKHS and a finite trace condition is essentially enough for this purpose. Note that the finite trace condition can not be avoided, see [29].

We will further discuss some consequences for the tractability of this problem. For the necessary notions and definitions from the field of information-based complexity, see [48, 49, 51]. We comment on polynomial tractability with respect to linear information \(\Lambda ^{\mathrm{all}}\) and standard information \(\Lambda ^{\mathrm{std}}\). Let us consider the family of approximation problems

$$\begin{aligned} {{\text {APP}}}_d:H(K_d) \rightarrow L_2(D_d,\varrho _{D_d}),\quad d\in \mathbb {N}, \end{aligned}$$

where \(K_d:D_d\times D_d \rightarrow \mathbb {C}\) is a family of reproducing kernels. In [48, Thm. 5.1] strong polynomial tractability of the family \(\{{\text {APP}}_d\}\) with respect to \(\Lambda ^{\text {all}}\) is characterized as follows: There is a \(\tau >0\) such that

$$\begin{aligned} C:=\sup _d \left( \sum \limits _{j=1}^{\infty } \sigma _{j,d}^{\tau }\right) ^{1/\tau } < \infty , \end{aligned}$$
(5.19)

where \(\sigma _{j,d}\), \(j=1,\ldots \), are the singular values belonging to \({{\text {APP}}}_d\) for fixed d. From our analysis in Theorem 5.8, we directly obtain a sufficient condition for polynomial tractability with respect to \(\Lambda ^{\text {std}}\). Without going into detail, we denote by \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\), \(\Lambda \in \{\Lambda ^{\mathrm{std}},\Lambda ^{\mathrm{all}}\}\), the minimal number n of information out of the specified class \(\Lambda \) any algorithm requires in order to achieve a worst-case error \(\varepsilon \) for the d-dimensional approximation problem. Certainly, \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\) usually depends on \(\varepsilon \), the dimension d and \(\Lambda \). At this point, we would like to mention that polynomial tractability means that \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\le C\,\varepsilon ^{-p}\,d^{q}\), \(C,p,q\ge 0\), i.e., \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda )\) can be bounded by terms that are both polynomial in d and polynomial in \(\varepsilon \). Furthermore, the problem \({\text {APP}}_d\) is called strongly polynomial tractable in the case that the estimate on \(n^{\mathrm{wor}}\) holds with \(q=0\).

Theorem 5.11

The family \(\{{\text {APP}}_d\}\) is strongly polynomially tractable with respect to \(\Lambda ^{\mathrm{std}}\)

  1. (i)

    if there exists \(\tau \le 2\) such that (5.19) holds true or

  2. (ii)

    if \(\{{\text {APP}}_d\}\) is strongly polynomially tractable with respect to \(\Lambda ^{\mathrm{all}}\) with exponent \(0<p_{\mathrm{all}}<2\), i.e., \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda ^{\mathrm{all}}) \le C_{\mathrm{all}}\,\varepsilon ^{-p_{\mathrm{all}}}\).

More precisely, there are constants \(C_{\mathrm{std}}, \delta >0\) only depending on \((C,\tau )\) or \((C_{\mathrm{all}}, p_{\mathrm{all}})\) such that

$$\begin{aligned} n^{\text {wor}}\left( \varepsilon ,d;\Lambda ^{\mathrm{std}}\right) \le C_{\mathrm{std}}\,\varepsilon ^{-p_{\mathrm{std}}}\log (1/\varepsilon )^{\delta } \end{aligned}$$

with \(p_{\mathrm{std}} = \tau \) in case (i) and \(p_{\mathrm{std}} = p_{\mathrm{all}}\) in case (ii).

Proof

Note that (ii) implies (i) with \(\tau = 2\). Furthermore, Theorem 5.8 implies strong polynomial tractability if (i) is assumed. In case (i) we may use Stechkin’s lemma [18, Lem. 7.8] which gives that \(\sum _{j=m}^{\infty } \sigma _{j,d}^2 \le C^2m^{-2/\tau +1}\) for all d. This gives the exponent \(p_{\text {std}} = \tau \) and an additional \(\log \) due to (5.13). If (ii) is assumed then \(\sum _{j=m}^{\infty } \sigma _{j,d}^2 \le C'm^{-2/p_{\mathrm{all}} +1}\) for all d. \(\square \)

Theorem 5.11 is stronger than [51, Thm. 26.20] in two aspects. As pointed out in the proof, assumption (i) is weaker than (ii), which is essentially the one in [51, Thm. 26.20]. Furthermore, our statement is stronger since \(p_{\mathrm{std}}\) equals \(p_{\mathrm{all}}\). The authors in [51, 26.6.1] showed that \(p_{\mathrm{std}} = p_{\mathrm{all}}+\frac{1}{2} [p_{\mathrm{all}}]^2\) and proposed that “the lack of the exact exponent represents another major challenge” and formulated Open Problem 127. Our considerations prove that the dependence on \(\varepsilon \) of the tractability estimates on \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda ^{\mathrm{all}})\) and \(n^{\mathrm{wor}}(\varepsilon ,d;\Lambda ^{\mathrm{std}})\) coincide up to logarithmic factors. Similar assertions hold true when strong polynomial tractability is replaced by polynomial tractability. The modifications are straightforward.

Example 5.12

Let us consider an example from [37, 55], namely, if \(s_1 \le s_2 \le \cdots \le s_d\), then

$$\begin{aligned} H\left( K_d\right) :=H_{\text {mix}}^{\vec {s}}\left( \mathbb {T}^d\right) = H^{{s_1}}(\mathbb {T}) \otimes \cdots \otimes H^{s_d}(\mathbb {T}) \end{aligned}$$

and \({\text {APP}}_d:H_{\text {mix}}^{\vec {s}}(\mathbb {T}^d)\rightarrow L_2(\mathbb {T}^d)\). Here \(\mathbb {T}= [0,1]\), where opposite points are identified, and \(H^s(\mathbb {T})\) is normed by

$$\begin{aligned} \Vert f\Vert ^2_{H^s} := \sum \limits _{k \in \mathbb {Z}} |{\hat{f}}_k|^2 \, w_{s}(k)^2 \end{aligned}$$

with \(w_s(k) = \max \{1,(2\pi )^s|k|^s\}\), see also Sect. 8. The smoothness vector \(\vec {s}\) is supposed to grow in j, e.g., for \(\beta >0\) we have

$$\begin{aligned} s_j \ge \beta \log _2(j+1),\quad j\in \mathbb {N}. \end{aligned}$$
(5.20)

It has been shown in [55], see also [37], that the growth condition (5.20) is necessary and sufficient for polynomial tractability with respect to \(\Lambda ^{\text {all}}\). Taking Theorem 5.11 into account, we easily check that (5.19) is satisfied with \(\tau \le 2\) whenever \(\beta \cdot \tau > 1\), which means that \(\beta > 1/2\) is sufficient for strong polynomial tractability. In this case we obtain for any \(1/\beta < \tau \le 2\) that

$$\begin{aligned} n^{\text {wor}}\left( \varepsilon ,d;\Lambda ^{{\text {std}}}\right) \le C_{\tau }\,\varepsilon ^{-\tau }. \end{aligned}$$

6 Recovery of Individual Functions

In this section, we are interested in the reconstruction of an individual function f (taken from the unit ball of H(K)) from samples at random nodes. In the IBC community, see [48, 49, 51], such a scenario is called “randomized setting”, which refers to the occurrence of a random element in the algorithm (Monte Carlo). The following investigations improve on the results in [68] from different points of view. On the one hand, it is not necessary to choose sampling nodes according to a whole bunch of probability density functions according to Remark 6.3. On the other hand, assuming a bounded kernel turns out to be sufficient to obtain the rate of convergence matching that of \((\sigma _k)_{k\in \mathbb {N}}\) when randomly choosing sampling nodes according to the given probability measure \(\varrho _D\) due to Corollary 6.2.

However, the subsequent analysis is related to the one in [13, 14]. With similar techniques as above we will get an estimate for the conditional mean of the individual error \(\Vert f-S^m_{\mathbf {X}}f\Vert _{L_2(D,\varrho _D)}\).

Theorem 6.1

Let \(\varrho _D\) be a probability measure on D and H(K) denote a reproducing kernel Hilbert which is compactly embedded into the space \(L_2(D,\varrho _D)\). Let \(m,n\in \mathbb {N}\), \(m\ge 2\), be chosen such that (5.1) holds for some \(0<\delta <1\). Let further f be a fixed function such that \(\Vert f\Vert _{ H(K)} \le 1\). Drawing each sampling node in \({\mathbf {X}}=\{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\} \subset D\) i.i.d. at random according to \(\varrho _D\), we have for the conditional expectation of the individual error

$$\begin{aligned} \mathbb {E}\left( \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }\Big |\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2\right) \le \frac{1}{1-\delta }\left( \sigma _m^2 + \frac{C_{\delta }}{\log n}\sigma _m^2\right) \le \frac{1.1}{1-\delta }\,\sigma _m^2, \end{aligned}$$

where \(0<C_{\delta }\le 0.06\) depends on \(\delta \).

Proof

This time we follow the proof of [13, Thm. 2] and obtain

$$\begin{aligned} \begin{aligned} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }&= \Vert f-P_{m-1}f\Vert ^2_{L_2\left( D,\varrho _D\right) }+\Vert P_{m-1}f-S^m_{\mathbf {X}}f\Vert ^2_{L_2\left( D,\varrho _D\right) }\\&\le \sigma ^2_m + \Vert S^m_{\mathbf {X}}\left( P_{m-1}f-f\right) \Vert ^2_{L_2\left( D,\varrho _D\right) }\\&\le \sigma ^2_m + \Vert \left( {\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\right) ^{-1}\Vert ^2 \, \Vert {\mathbf {L}}_m^{*}\, \left( g({\mathbf {x}}^1),\ldots ,g({\mathbf {x}}^n)\right) ^\top \Vert ^2_2\\&\le \sigma ^2_m + \frac{4}{n^2}\sum \limits _{k=1}^{m-1} \left| \sum \limits _{j=1}^n \overline{{\eta _k}({\mathbf {x}}^j)}g({\mathbf {x}}^j)\right| ^2\\&= \sigma ^2_m + \frac{4}{n^2}\sum \limits _{k=1}^{m-1}\sum \limits _{j=1}^n\sum \limits _{i=1}^n g({\mathbf {x}}^i)\overline{g({\mathbf {x}}^j)}\overline{\eta _k({\mathbf {x}}^i)}\eta _k({\mathbf {x}}^j), \end{aligned} \end{aligned}$$

where \(g = f - P_{m-1}f\). Averaging on both sides yields

$$\begin{aligned}&\int _{\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2}\Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)} \, \varrho ^n_D(\mathrm {d}{\mathbf {x}})\\&\quad \le \sigma ^2_m + \frac{4}{n^2}\sum \limits _{k=1}^{m-1}\sum \limits _{j=1}^n\sum \limits _{i=1}^n \mathbb {E}\left( g({\mathbf {x}}^i)\overline{g({\mathbf {x}}^j)}\overline{\eta _k({\mathbf {x}}^i)}\eta _k({\mathbf {x}}^j)\right) \\&\quad = \sigma _m^2+\frac{4}{n} \sum \limits _{k=1}^{m-1}\int _D |g({\mathbf {x}})|^2|\eta _k({\mathbf {x}})|^2\,\varrho _D(\mathrm {d}{\mathbf {x}}) \\&\qquad + \frac{4n(n-1)}{n^2}\sum \limits _{k=1}^{m-1} \left| \int _D g({\mathbf {x}})\overline{\eta _k({\mathbf {x}})} \, \varrho _D(\mathrm {d}{\mathbf {x}})\right| ^2. \end{aligned}$$

Note that the last summand on the right-hand side vanishes since g is orthogonal to \(\eta _k\), \(k = 1,\ldots ,m-1\). This gives

$$\begin{aligned} \begin{aligned}&\int _{\Vert {\mathbf {H}}_m-{\mathbf {I}}_m\Vert \le 1/2}\Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)} \, \varrho ^n_D(\mathrm {d}{\mathbf {x}}) \\&\quad \le \sigma _m^2 + \frac{4 N(m)}{n}\Vert f-P_{m-1}f\Vert ^2_{L_2(D,\varrho _D)}\\&\quad \le \sigma _m^2 + \frac{C_{\delta }}{\log (2n)}\sigma _m^2 \end{aligned} \end{aligned}$$

by taking \(4 N(m)/n \le C_{\delta }/\log (2n)\) into account, see (5.1), and there exists a \(C_\delta \le \frac{4}{48\sqrt{2}}\le 0.06\). \(\square \)

Analogous to the proof of Corollary 5.6, one shows the following result.

Corollary 6.2

Under the same assumptions as in Theorem 6.1, it holds for fixed \(\delta >0\)

$$\begin{aligned} \mathbb {P}\left( \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)} \le \frac{1}{\delta }\, \sigma _m^2\,\left( 1+\frac{0.06}{\log n}\right) \,\right) \ge 1-3\delta . \end{aligned}$$

Remark 6.3

  1. (i)

    Following [16] we may relax the condition (5.1) to

    $$\begin{aligned} m := \left\lfloor \frac{n}{48\left( \sqrt{2}\log (2n)-\log {\delta }\right) }\right\rfloor +1, \end{aligned}$$

    when sampling with respect to the new measure (importance sampling)

    $$\begin{aligned} \mu _m(A) := \int _A \varrho _m({\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}), \end{aligned}$$

    where \(\varrho _m\) is given by

    $$\begin{aligned} \varrho _m({\mathbf {x}}) := \frac{1}{m-1}\sum \limits _{k=1}^{m-1}|\eta _k({\mathbf {x}})|^2. \end{aligned}$$

    Then, the corresponding approximation of the function f is based on the solution of a weighted least squares problem similar to the one discussed in Sect. 5.3. Note that we do not have to assume the square summability of the singular numbers of the embedding \(\mathrm {Id}:H(K)\rightarrow L_2(K,\varrho _D)\) here.

  2. (ii)

    The result of Theorem 6.1 advanced with (i) directly leads to estimates on the power of standard information, see Sect. 5.4, in the so-called randomized setting, cf. [51]. Roughly speaking, n sampling values (standard information) in the randomized setting are at least as powerful as \( c n/\log {n}\) linear information, i.e., the supremum over all \(f\in H(K)\) of the expected approximation error that is caused by the weighted least squares regression is almost as good as the best possible worst-case approximation error caused by the projection. The recent work [40] treats that topic in full detail. Moreover, in [39] the authors present improvements on the power of standard information in the so-called average case setting. Both paper investigate the weighted least squares regression given in Algorithm 2 using the techniques that yielded the results of this section.

  3. (iii)

    Together with the Weaver subsampling technique in [45], see also [47], we obtain

    $$\begin{aligned}&e^{\mathrm{ran}}\left( n,d;\Lambda ^{\mathrm{std}}\right) \\&\quad \le C\min \left\{ \sqrt{\log n}\cdot e^{\mathrm{det}}\left( c_1n,d;\Lambda ^{\mathrm{all}}\right) , e^{\mathrm{det}}\left( c_2n/\log n,d;\Lambda ^{\mathrm{all}}\right) \right\} , \end{aligned}$$

    with three universal constants \(C,c_1,c_2>0\), where we use the notation from [48, 49, 51]. This is a further significant improvement of the results in [40] in the situation above if the sequence \((\sigma _k)_{k\in \mathbb {N}}\) decays “faster” than \(k^{-1/2}\).

While this paper was under review, it was shown by Cohen and Dolbeault [15] that even \(e^{\mathrm{ran}}(n,d;\Lambda ^{\mathrm{std}}) \le Ce^{\mathrm{det}}(cn,d;\Lambda ^{\mathrm{all}})\) holds true. Note that this result is also based on a weighted least squares method and [47].

7 Optimal Weights for Numerical Integration

We consider the problem of approximating the integral with respect to \(\mu _D\) of a function f by the cubature rule \({\text {Q}}_{\mathbf {X}}^m\)

$$\begin{aligned} {\text {Int}}_{\mu _D} f:=\int _D f \, \mathrm {d}\mu _D \approx {\text {Q}}_{\mathbf {X}}^m f:=\sum _{j=1}^n q_j f\left( {\mathbf {x}}^j\right) = {\varvec{q}}^\top {\varvec{f}}, \end{aligned}$$

where the weights \(q_j\) are determined by \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\). Indeed, assuming \({\mathbf {L}}_m\) of full column rank and following Oettershagen [52], we set

$$\begin{aligned} {\varvec{q}}&:= \overline{{\mathbf {L}}_m}\, \overline{({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m)^{-1}}\, {\varvec{b}}\;\in \mathbb {C}^n,\nonumber \\ \text {where } \; {\varvec{b}}&:=\big (b_k\big )_{k=1}^{m-1} \in \mathbb {C}^{m-1} \quad \text {with}\quad b_k:=\int _D \eta _k\,\mathrm {d}\mu _D. \end{aligned}$$
(7.1)

In fact, the cubature rule \({\text {Q}}_{\mathbf {X}}^m\) is the implicit integration of the least squares solution \(S^m_{\mathbf {X}}f:=\sum _{k=1}^{m-1} c_k \eta _k\), cf. (3.2), \({\varvec{c}}:= ({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m)^{-1}\,{\mathbf {L}}_m^*\, {\varvec{f}}\), since

$$\begin{aligned} {\text {Q}}_{\mathbf {X}}^m f&=\sum _{j=1}^n q_j f\left( {\mathbf {x}}^j\right) = {\varvec{b}}^\top \left( {\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\right) ^{-1} \, {\mathbf {L}}_m^*\, {\varvec{f}} = \sum _{k=1}^{m-1} c_k\int _D \eta _k\,\mathrm {d}\mu _D=\int _D S^m_{\mathbf {X}}f\mathrm {d}\mu _D. \end{aligned}$$

Using this, we give upper bounds on the integration error caused by \({\text {Q}}_{\mathbf {X}}^m\).

Theorem 7.1

Let \(\mu _D\) be a measure on D such that \(L_1(D,\varrho _D) \hookrightarrow L_1(D,\mu _D)\). Denote with

$$\begin{aligned} C_{\varrho ,\mu } := \Vert {\text {Id}}:L_1\left( D,\varrho _D\right) \rightarrow L_1(D,\mu _D)\Vert < \infty \end{aligned}$$

the norm of the embedding. Under the same assumptions as in Theorem 5.5, it holds for fixed \(\delta >0\)

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{{H(K)}}\le 1} \left| \int _D f \, \mathrm {d}\mu _D-{\text {Q}}_{\mathbf {X}}^m f\right| ^2 \le \frac{29\, C_{\varrho ,\mu }^2}{\delta }\,\max \left\{ \sigma ^2_m,\frac{\log (8n)}{n}T(m)\right\} \right) \ge 1-3\delta . \end{aligned}$$

Proof

Using the embedding relation of the different \(L_1\)-spaces we have

$$\begin{aligned} \left| \int _D f \, \mathrm {d}\mu _D-{\text {Q}}_{\mathbf {X}}^m f\right| =\left| \int _D f-S^m_{\mathbf {X}}f \, \mathrm {d}\mu _D\right|&\le \int _D \Big |f-S^m_{\mathbf {X}}f \Big | \, \mathrm {d}\mu _D\nonumber \\&\le C_{\varrho ,\mu }\int _D \Big |f-S^m_{\mathbf {X}}f \Big | \, \mathrm {d}\varrho _D\nonumber \\&\le C_{\varrho ,\mu }\,\Vert f-S^mf\Vert _{L_2(D,\varrho _D)}. \end{aligned}$$
(7.2)

We conclude the proof using Corollary 5.6. \(\square \)

In Theorem 7.1, we assume that the kernel is bounded, i.e., \(\sup _{{\mathbf {x}}\in D} K({\mathbf {x}},{\mathbf {x}})<\infty \). However, as we will see below it is enough to assume \(\int _{D}K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})<\infty \) and that \(\varrho _D\) is a finite measure. We define the following modified (reweighted) cubature formula

$$\begin{aligned} \begin{aligned} {\widetilde{Q}}^m_{{\mathbf {X}}} f&:= \int _D {\widetilde{S}}^m_{{\mathbf {X}}}f({\mathbf {x}})\varrho _D({\mathbf {x}}) = \widetilde{{\varvec{q}}}^\top {\varvec{f}}\\&=\left( {\mathbf {D}}_{\varrho _m}\overline{{\widetilde{{\mathbf {L}}}}_m}\,\overline{\left( {\widetilde{{\mathbf {L}}}}_m^{*}{\widetilde{{\mathbf {L}}}}_m\right) ^{-1}}\, {\varvec{b}}\right) ^\top {\varvec{f}} = {\varvec{b}}^\top \left( {\widetilde{{\mathbf {L}}}}_m^{*}{\widetilde{{\mathbf {L}}}}_m\right) ^{-1} {\widetilde{{\mathbf {L}}}}_m^*{\mathbf {D}}_{\varrho _m}{\varvec{f}}, \end{aligned} \end{aligned}$$

where \({\varvec{f}}\) is the vector of function samples \(\big (f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n)\big )\), the vector \({\varvec{b}}\) is given as in (7.1), \({\widetilde{{\mathbf {L}}}}_m\) as in Algorithm 2, and

$$\begin{aligned} {\mathbf {D}}_{\varrho _m} = \mathrm {diag}\left( 1/\sqrt{\varrho _m({\mathbf {x}}^1)},\ldots ,1/\sqrt{\varrho _m({\mathbf {x}}^n)}\right) . \end{aligned}$$

Here, \(\varrho _m(\cdot )\) is given by (5.12) above and depends on the spectral properties of the kernel. Note that we simply put the respective entry to 0 in \({\mathbf {D}}_{\varrho _m}\) if \(\varrho _m({\mathbf {x}}^n)\) happens to be zero.

Theorem 7.2

If \(\varrho _D\) denotes a finite measure and \(\int _D K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})<\infty \), then

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D f({\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}}) - {\widetilde{Q}}^m_{{\mathbf {X}}}(f)\right| ^2 \le \frac{50}{\delta }\varrho _D(D)\max \left\{ \sigma ^2_m ,\frac{\log (8n)}{n}\sum \limits _{j=m}^{\infty }\sigma _j^2\right\} \end{aligned}$$
(7.3)

holds with probability larger than \(1-3\delta \) if the n nodes in \({\mathbf {X}}\) are drawn i.i.d. according to (5.11) above.

Proof

Using the bound on the \(L_2(D,\varrho _D)\) error from Theorem 5.8 together with (7.2) above, we obtain (7.3) with high probability if the nodes \({\mathbf {X}}\) are sampled according to the measure (5.11). \(\square \)

Remark 7.3

The previous result essentially improves on the bound in [2, Thm. 1] if the singular values decay polynomially. Note that we assume that \(\varrho _D\) is a finite measure and \(\int _D K({\mathbf {x}},{\mathbf {x}}) \, \varrho _D(\mathrm {d}{\mathbf {x}})<\infty \). In [2, Thm. 1], a logarithmic oversampling is not required; however, the error bounds are worse by a factor n, which is substantial, e.g., in the case of the Sobolev kernel, see Sect. 8.2.

8 Hyperbolic Cross-Fourier Regression

In the sequel, we are interested in the recovery of functions from periodic Sobolev spaces. That is, we consider functions on the torus \(\mathbb {T}^d \simeq [0,1)^d\) where opposite points are identified. Note that the unit cube \([0,1]^d\) is preferred here since it has Lebesgue measure 1 and is therefore a probability space. We could have also worked with \([0,2\pi ]^d\) and the Lebesgue measure (which can be made a probability measure by a d-dependent rescaling). The general error bounds for the recovery error given below (in terms of \((\sigma _j)_{j\in \mathbb {N}}\) like in Theorem 8.2) are not affected by this rescaling since the sequence \((\sigma _j)_{j\in \mathbb {N}}\) then also changes. However, some of the preasymptotic estimates for the \((\sigma _j)_{j\in \mathbb {N}}\) are sensitive with respect to a different domain as the results in Krieg [33] point out.

For \(\alpha \in \mathbb {N}\) we define the space \(H^{\alpha }_{\text {mix}}(\mathbb {T}^d)\) as the Hilbert space with the inner product

$$\begin{aligned} \langle f , g\rangle _{H^{\alpha ,*}_{\text {mix}}} := \sum \limits _{{\mathbf {j}}\in \{0,\alpha \}^d}\langle D^{({\mathbf {j}})}f,D^{({\mathbf {j}})}g \rangle _{L_2(\mathbb {T}^d)}. \end{aligned}$$
(8.1)

Defining the weight

$$\begin{aligned} w_{\alpha ,*}(k) = (1+(2\pi |k|)^{2\alpha })^{1/2},\quad k\in \mathbb {Z}, \end{aligned}$$
(8.2)

and the univariate kernel function

$$\begin{aligned} K^1_{\alpha ,*}(x,y):=\sum \limits _{k\in \mathbb {Z}} \frac{\exp (2\pi \mathrm {i}k(y-x) )}{w_{\alpha ,*}(k)^2},\quad x,y\in \mathbb {T}, \end{aligned}$$

directly leads to

$$\begin{aligned} K^d_{\alpha ,*}({\mathbf {x}},{\mathbf {y}}):= K^1_{\alpha ,*}(x_1,y_1)\otimes \cdots \otimes K^1_{\alpha ,*}(x_d,y_d),\quad {\mathbf {x}},{\mathbf {y}}\in \mathbb {T}^d, \end{aligned}$$
(8.3)

which is a reproducing kernel for \(H^\alpha _{\text {mix}}(\mathbb {T}^d)\). The Fourier series representation of \(K^d_{\alpha ,*}({\mathbf {x}},{\mathbf {y}})\) is specified by

$$\begin{aligned} K^d_{\alpha ,*}({\mathbf {x}},{\mathbf {y}})&:=\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} \frac{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot ({\mathbf {y}}-{\mathbf {x}}) )}{w_{\alpha ,*}(k_1)^2\cdots w_{\alpha ,*}(k_d)^2} \\&= \sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} \frac{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot ({\mathbf {y}}-{\mathbf {x}}))}{\prod _{j=1}^d \left( 1+(2\pi |k_j|)^{2\alpha }\right) } ,\quad {\mathbf {x}},{\mathbf {y}}\in \mathbb {T}^d. \end{aligned}$$

In particular, for any \(f\in H^\alpha _{\text {mix}}(\mathbb {T}^d)\) we have

$$\begin{aligned} f({\mathbf {x}}) = \langle f, K^d_{\alpha ,*}({\mathbf {x}},\cdot ) \rangle _{H^{\alpha ,*}_{\text {mix}}}. \end{aligned}$$

The kernel defined in (8.3) associated with the inner product (8.1) can be extended to the case of fractional smoothness \(s>0\) replacing \(\alpha \) by s in (8.2)–(8.3) which in turn leads to the inner product

$$\begin{aligned} \langle f , g\rangle _{H^{s,*}_{\text {mix}}} := \sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} {\hat{f}}_{{\mathbf {k}}} \, \overline{{\hat{g}}_{{\mathbf {k}}}} \, \prod _{j=1}^d w_{s,*}(k_j)^2 \end{aligned}$$

and the corresponding norm

$$\begin{aligned} \Vert f\Vert _{*}:=\Vert f\Vert _{H^{s,*}_{\text {mix}}} := \left( \sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} |{\hat{f}}_{{\mathbf {k}}}|^2 \, \prod _{j=1}^d w_{s,*}(k_j)^2\right) ^{1/2}. \end{aligned}$$

The (ordered) sequence \((\lambda _j^*)_{j=1}^{\infty }\) of eigenvalues of the corresponding mapping \(W = \mathrm {Id}^{*}\circ \mathrm {Id}\), where \(\mathrm {Id}:H\big (K^d_{s,*}\big ) \rightarrow L_2(\mathbb {T}^d)\) is the non-increasing rearrangement of the numbers

$$\begin{aligned} \left\{ \lambda _{{\mathbf {k}}}^*:=\prod \limits _{j=1}^d w_{s,*}(k_j)^2=\prod \limits _{j=1}^d (1+(2\pi |k_j|)^{2s})^{-1} :{\mathbf {k}}\in \mathbb {Z}^d\right\} . \end{aligned}$$

The corresponding orthonormal system \((e_j^*)_{j=1}^{\infty }\) in \(H\big (K^d_{s,*}\big )\) is given by

$$\begin{aligned} \left\{ e_{{\mathbf {k}}}({\mathbf {x}}):=\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}})\prod \limits _{j=1}^d (1+(2\pi |k_j|)^{2s})^{-1/2} :{\mathbf {k}}\in \mathbb {Z}^d\right\} . \end{aligned}$$

Consequently, the orthonormal system \((\eta _j^*)_{j=1}^{\infty }\) in \(L_2(\mathbb {T}^d)\) is the properly ordered classical Fourier system \(\eta _{{\mathbf {k}}}({\mathbf {x}}) = \exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}})\). This directly implies the following behavior of the quantity N(m) defined in (2.7). It holds

$$\begin{aligned} N(m) = m-1\quad \text {and}\quad T_*(m) = \sum \limits _{j = m}^\infty \lambda _j^*= \sum \limits _{j = m} ^\infty (\sigma ^*_j)^2. \end{aligned}$$

Remark 8.1

It is possible to define a smoothness vector \({\mathbf {s}} = (s_1,\ldots ,s_d)\) to emphasize different smoothness in different coordinate directions. Such kernels will be denoted with \(K_{{\mathbf {s}}}({\mathbf {x}},{\mathbf {y}})\). In [37], the authors establish preasymptotic error bounds which can be used for the least squares analysis as we will see below.

Recent estimates in [35] allow for determining uniform recovery guarantees with preasymptotic error bounds. For this study, we need to change the kernel weight to a less natural, but for preasymptotic considerations more convenient structure of the weight

$$\begin{aligned} w_{s,\#}(k) = (1+2\pi |k|)^{s} ,\quad k\in \mathbb {Z}, \end{aligned}$$
(8.4)

\(s>0\). As a consequence, the univariate kernel

$$\begin{aligned} K^1_{s,\#}(x,y):=\sum \limits _{k\in \mathbb {Z}} \frac{\exp (2\pi \mathrm {i}k(y-x) )}{w_{s,\#}(k)^2},\quad x,y\in \mathbb {T}, \end{aligned}$$

as well as the tensor product kernel

$$\begin{aligned} K^d_{s,\#}({\mathbf {x}},{\mathbf {y}}) := \sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} \frac{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot ({\mathbf {y}}-{\mathbf {x}}))}{\prod _{j=1}^d (1+2\pi |k_j|)^{s}},\quad {\mathbf {x}},{\mathbf {y}}\in \mathbb {T}^d, \end{aligned}$$

changes and has modified Fourier series expansions. Of course, the weight \(w_{s,\#}\) yields an equivalent norm

$$\begin{aligned} \Vert f\Vert _{\#}:=\Vert f\Vert _{H^{s,\#}_{\text {mix}}(\mathbb {T}^d)} = \left( \sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} |{\hat{f}}_{{\mathbf {k}}}|^2 \prod _{j=1}^d \left( 1+2\pi |k_j|\right) ^{2s} \right) ^{1/2} \end{aligned}$$

in the space \(H^s_{\text {mix}}(\mathbb {T}^d)\). However, from a complexity theoretic point of view, it is worth noting the difference of both approaches. The respective unit balls belonging to both norms differ significantly since the equivalence constants for both norms may depend badly on d. Moreover, we stress on the fact that the non-increasing rearrangements of the eigenvalues \((\lambda _j^\#)_{j=1}^{\infty }\) of the mapping \(W = \mathrm {Id}^{*}\circ \mathrm {Id}\), where \(\mathrm {Id}:H\big (K^d_{s,\#}\big ) \rightarrow L_2(\mathbb {T}^d)\), differs from \((\lambda _j^*)_{j=1}^{\infty }\) since the associated mappings \(j\rightarrow {\mathbf {k}}\) do not coincide. Accordingly, the sampling operators \(S_{\mathbf {X}}^{m,*}\) and \(S_{\mathbf {X}}^{m,\#}\) are defined with respect to the different non-increasing rearrangements of the basis functions \(\eta _{{\mathbf {k}}}\), and thus, may also differ.

Despite the fact that there are many more different equivalent norms for \(H^s_{\text {mix}}(\mathbb {T}^d)\), we will use only the mentioned ones in order to apply advantageous known upper bounds on the eigenvalues \(\lambda _j^{*}\) and \(\lambda _j^\#\). In the regime of interest here, the recovery of periodic functions with mixed smoothness, we even have preasymptotic bounds for these eigenvalues/singular values available (see [33, 35, 36]). Note that theoretically everything is known about the singular values \(\sigma _m^\square \), \(\square \in \{*,\#\}\), since the behavior of this sequence is determined by the non-increasing rearrangement of the reciprocals of the tensor product weights \(\prod _{j=1}^d w_{s,\square }(k_j)\), cf. (8.2) and (8.4). The analysis of the rearrangement of multi-indexed sequences has been revealed that the upper bound

$$\begin{aligned} \sigma _m^\#\le \min \left( 1,\frac{16}{3m}\right) ^{\frac{s}{1+\log _2d}}\le \left( \frac{16}{3m}\right) ^{\frac{s}{1+\log _2d}} \end{aligned}$$
(8.5)

holds for all \(m\in \mathbb {N}\), cf. [35, Thm. 4.1].

One may argue that the kernel \(K^d_{s,\#}\) is less “natural” than the kernel \(K^d_{s,*}\). For this purpose we use the observation by Krieg [33], which yields the preasymptotic bound

$$\begin{aligned} \sigma _{m}^*\le \left( \frac{1.26}{m}\right) ^{\frac{1.83\cdot s}{4+\log _2(d)}} \end{aligned}$$

in the range \(2\le m\le 3^d\), where the exponent scales also linearly in s.

8.1 Uniform Recovery of Functions from \(H^s_{\text {mix}}(\mathbb {T}^d)\)

First, we consider the asymptotic behavior of the sampling error caused by the presented least squares approach. Since the asymptotic bounds on \(\lambda _j^*\) and \(\lambda _j^\#\) differ only by constants that we do not specify explicitly, we study both cases collectively. For \(\square \in \{*,\#\}\), the asymptotic behavior of the sequence \((\sigma _j^\square )_{j=1}^\infty \) for the embedding \(\mathrm {Id}:H(K_{s,\square }^d) \rightarrow L_2(\mathbb {T}^d)\) has been known for a long time, cf. [18, Chapt. 4] and the references therein. There is a constant \({\tilde{C}}_d^\square \) which depends exponentially on d such that

$$\begin{aligned} \sigma _m^\square \le {\tilde{C}}_d^\square \, m^{-s}(\log m)^{s(d-1)}, \quad m\in \mathbb {N}. \end{aligned}$$
(8.6)

As a direct consequence of Corollary 5.6, we determine explicit error bounds as well as the asymptotic error behavior (8.7), where the latter holds for all equivalent norms in \(H^s_{\text {mix}}(\mathbb {T}^d)\). Please note that the constants \(C_d^\square \) depend heavily on the specific norm. Moreover, a statement similar to (8.7) can be also obtained with the technique in [34].

Theorem 8.2

Let \(d\in \mathbb {N}, s>1/2\), \(\delta >0\) and \(n\in \mathbb {N}\) be given. Choose \(m\in \mathbb {N}\), \(m\ge 2\),

$$\begin{aligned} m \le \left\lfloor \frac{n}{48(\sqrt{2}\log (2n)-\log \delta )}\right\rfloor +1 \end{aligned}$$

and draw the n sampling nodes in \({\mathbf {X}}\) uniformly i.i.d. at random in \(\mathbb {T}^d\). Then, we achieve

$$\begin{aligned}&\mathbb {P}\left( \sup \limits _{\Vert f\Vert _{\square }\le 1} \Vert f-S_{\mathbf {X}}^{m,\square } f\Vert ^2_{L_2(\mathbb {T}^d)} \le \frac{29}{\delta }\,\max \left\{ \left( \sigma _m^\square \right) ^2, \frac{\log (8n)}{n}\sum _{k=m}^\infty (\sigma _k^\square )^2 \right\} \right) \\&\qquad \ge 1-3\delta . \end{aligned}$$

In particular, there is a constant \(C_d^\square >0\) depending on d such that for \(0<\delta <1/3\) it holds

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{\square }\le 1} \Vert f-S_{\mathbf {X}}^{m,\square } f\Vert _{L_2(\mathbb {T}^d)} \le \frac{C_d^\square }{\sqrt{\delta }}n^{-s}(\log n)^{sd}\right) \ge 1-3\delta . \end{aligned}$$
(8.7)

Proof

The result follows from Corollary 5.6 and our specific situation where \(B=1\), \(N(m)=m-1\) and \(T_\square (m)=\sum _{k=m}^\infty (\sigma _k^\square )^2\). We estimate \(T_\square (m)\) using [12]

$$\begin{aligned} T_\square (m) := \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=m}^{\infty }|e_k({\mathbf {x}})|^2= \sum \limits _{k = m}^{\infty } (\sigma _k^\square )^2 \asymp m^{-2(s-1/2)}(\log m)^{2(d-1)s} \asymp m(\sigma _m^\square )^2. \end{aligned}$$

Hence, the right-hand side in (5.9) can be bounded from above by a constant times \(\sigma _m^\square \), which behaves as \(n^{-s}(\log n)^{sd}\). \(\square \)

In addition, we investigate the preasymptotic error behavior using the aforementioned estimates (8.5) on the singular values \(\sigma _m^\#\) that belongs to \(\mathrm {Id}:H\big (K^d_{s,\#}\big ) \rightarrow L_2(\mathbb {T}^d)\). Since the upper bounds have been proven only for this specific type of mappings, the following results, in particular the explicitly determined constants, may only hold for RKHS with weight functions \(w_s({\mathbf {k}})\ge \prod _{j=1}^d(1+|k_j|)^s\), which is fulfilled for \(w_{s,\#}\).

Theorem 8.3

Let \(d\in \mathbb {N}\), \(s>(1+\log _2d)/2\), \(0<\delta <1\), and \(n\in \mathbb {N}\) such that

$$\begin{aligned} m:= \left\lfloor \frac{n}{48(\sqrt{2}\log (2n)-\log {\delta })}\right\rfloor +1 \end{aligned}$$
(8.8)

is at least 2. Drawing the n sampling nodes in \({\mathbf {X}}\) uniformly i.i.d. at random in \(\mathbb {T}^d\) yields

$$\begin{aligned}&\mathbb {E}\left( \sup \limits _{\Vert f\Vert _{\#}\le 1} \Vert f-S^{m,\#}_{\mathbf {X}}f\Vert ^2_{L_2(\mathbb {T}^d)}\Big |\Vert {\mathbf {H}}_{m,\#}-{\mathbf {I}}_m\Vert \le 1/2\right) \\&\quad \le \frac{29}{6\,(1-\delta )}\frac{2s}{2s-1-\log _2d}\left( \frac{16}{3m}\right) ^{\frac{2s}{1+\log _2d}}. \end{aligned}$$

Proof

We apply Theorem 5.5 and take into account that we choose

$$\begin{aligned} m := \left\lfloor \frac{n}{48(\sqrt{2}\log (2n)-\log {\delta })}\right\rfloor +1\le \frac{n}{48\sqrt{2}\log (2n)} +1 \end{aligned}$$

for large enough n. Furthermore, we set \(c:=16/3\), \(\beta :=2s/(1+\log _2d)\) and estimate \(T_\#(m)\), see (8.5),

$$\begin{aligned} T_\#(m)&:= \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=m}^{\infty }|e_k({\mathbf {x}})|^2= \sum \limits _{k = m}^{\infty } \left( \sigma _k^\#\right) ^2 \le c^{\beta }\sum _{k=m}^\infty k^{-\beta }\\&\le c^{\beta }\left( m^{-\beta }+\frac{1}{\beta -1}m^{-\beta +1}\right) \le c^{\beta }\frac{\beta }{\beta -1}m^{-\beta +1}. \end{aligned}$$

Taking (5.4) into account, we bound

$$\begin{aligned}&\mathbb {E}\left( \sup \limits _{\Vert f\Vert _{\#}\le 1} \Vert f-S^{m,\#}_{\mathbf {X}}f\Vert ^2_{L_2(\mathbb {T}^d)}\Big |\Vert {\mathbf {H}}_{m,\#}-{\mathbf {I}}_m\Vert \le 1/2\right) \\&\quad \le \frac{c^\beta m^{-\beta }}{1-\delta }\left( 3+8\,C_{\mathrm {R}}^2\frac{m\log (8n)}{n}\frac{\beta }{\beta -1}+4\,C_{\mathrm {R}}\,\sqrt{\frac{m\log (8n)}{n}\frac{\beta }{\beta -1}}\right) \\&\quad \le \frac{\beta \,c^\beta m^{-\beta }}{(1-\delta )(\beta -1)}\left( 3+8\,C_{\mathrm {R}}^2 b+4\,C_{\mathrm {R}}\,\sqrt{b}\right) , \end{aligned}$$

where \(b:=\frac{\log (8n)}{48\sqrt{2}\log (2n)}+\frac{\log (8n)}{n}\). The term in the brackets is monotonically decreasing in n. We stress that the last estimates are reasonable for \(m\ge 2\) and thus, we need at least \(n\ge 464\). This choice of n leads to an upper bound of the term within the brackets which is 29/6. Thus, for \(m\ge 2\), the estimate

$$\begin{aligned} \mathbb {E}\left( \sup \limits _{\Vert f\Vert _{\#}\le 1} \Vert f-S^{m,\#}_{\mathbf {X}}f\Vert ^2_{L_2(\mathbb {T}^d)}\Big |\Vert {\mathbf {H}}_{m,\#}-{\mathbf {I}}_m\Vert \le 1/2\right) \le \frac{29\,\beta \,c^\beta m^{-\beta }}{6\,(1-\delta )(\beta -1)} \end{aligned}$$

holds and the assertion follows. \(\square \)

Similar to Corollary 5.6, we apply Markov’s inequality to get a lower bound on the success probability of the randomly chosen sampling set.

Corollary 8.4

Under the same assumptions as in Theorem 8.3, it holds

$$\begin{aligned}&\mathbb {P}\left( \sup \limits _{\Vert f\Vert _{\#}\le 1} \Vert f-S^{m,\#}_{\mathbf {X}}f\Vert ^2_{L_2(\mathbb {T}^d)}\right. \\&\quad \left. \le \frac{29}{6\,\delta }\frac{2s}{2s-1-\log _2d}\left( \frac{256(\sqrt{2}\log (2n)-\log \delta )}{n}\right) ^{\frac{2s}{1+\log _2d}} \right) \\&\quad \ge \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{\#}\le 1}\Vert f-S^{m,\#}_{\mathbf {X}}f\Vert ^2_{L_2(\mathbb {T}^d)} \le \frac{29}{6\,\delta }\frac{2s}{2s-1-\log _2d}\left( \frac{16}{3m}\right) ^{\frac{2s}{1+\log _2d}} \right) \\&\qquad \ge 1-3\delta . \end{aligned}$$

Proof

We follow the argumentation in the proof of Corollary 5.6 using the inequality

$$\begin{aligned} m^{-1}\le \frac{48(\sqrt{2}\log (2n)-\log \delta )}{n}. \end{aligned}$$

\(\square \)

Example 8.5

For \(s=5\) and \(d=16\), we would like to fulfill

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{\#}\le 1} \Vert f-S^{m,\#}_{\mathbf {X}}f\Vert _{L_2(\mathbb {T}^d)}\le 0.1 \right) \ge 0.99, \end{aligned}$$

i.e., with \(\delta =1/300\) we choose \(m:=2\,873\) smallest possible such that

$$\begin{aligned} 2900 \left( \frac{16}{3m}\right) ^{2}\le 0.01 \end{aligned}$$

holds. Clearly, for n such that \(m-1=2\,872\le \frac{n}{48(\sqrt{2}\log (2n)-\log {\delta })}\) holds, we observe the desired estimate. We choose the smallest possible \(n:=3\,879\,166\).

8.2 Numerical Integration of Periodic Functions

In [52, Sec. 4.2], the author discussed the construction of stable cubature weights for the approximation of the integral

$$\begin{aligned} {\text {I}}(f):=\int _{\mathbb {T}^d} f({\mathbf {x}})\,\mathrm {d}{\mathbf {x}}\approx {\text {Q}}_{\mathbf {X}}^mf := \sum \limits _{j=1}^n q_jf({\mathbf {x}}^j) \end{aligned}$$

for functions from periodic Sobolev spaces with dominating mixed smoothness. The integration nodes \({\mathbf {X}}\) are drawn uniformly and independently at random from \(\mathbb {T}^d\). Below, in Corollary 8.6, the cubature rule \({\text {Q}}_{\mathbf {X}}^m\) is fixed for the whole class. The result in [52, Thm. 4.5] bounds the worst-case integration error from above by

$$\begin{aligned} \lesssim _{d,s}n^{-s+1/2}(\log n)^{sd-1/2}. \end{aligned}$$

Corresponding numerical tests promise better behavior of the integration error, cf. [52, Rem. 4.6]. Our theoretical results of this section confirm that the optimal main rate of the presented approach in [52] is \(n^{-s}\), in particular we obtain the upper bound \(\lesssim _{d,s}n^{-s}(\log n)^{sd} \) for this specific setting. We achieve the following statement on the worst-case integration error.

Corollary 8.6

Let \(d\in \mathbb {N}, s>1/2\) and \(0<\delta <1/3\). We choose \(n\in \mathbb {N}\) such that m as stated in (8.8) is at least 2. Drawing the n sampling nodes in \({\mathbf {X}}\) uniformly i.i.d. at random from \(\mathbb {T}^d\), we put the cubature weight vector \({\varvec{q}}\) to be the first column of \(\overline{{\mathbf {L}}_m}\, \overline{({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m)^{-1}}.\) Then, with probability at least \(1-3\delta \), we obtain

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{\square }\le 1} |I(f)-{\text {Q}}_{\mathbf {X}}^mf| \le \sqrt{\frac{29}{\delta }}\,\max \left\{ \sigma _m^\square ,\sqrt{\frac{\log (8n)}{n}\sum _{k=m}^\infty (\sigma _k^\square )^2} \right\} . \end{aligned}$$

In particular, there is a constant \(C_d^\square >0\) depending on d such that for \(0<\delta <1/3\) it holds with probability \(1-3\delta \)

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{\square }\le 1} |I(f)-{\text {Q}}_{\mathbf {X}}^mf| \le \frac{C_d^\square }{\sqrt{\delta }}n^{-s}(\log n)^{sd}. \end{aligned}$$
(8.9)

Proof

We apply Theorem 7.1 with \(\mu _{\mathbb {T}^d} = \varrho _{\mathbb {T}^d} \equiv 1\) followed by (8.6). \(\square \)

Remark 8.7

By the same reasoning, the result in Theorem 8.3 transfers almost literally to the integration problem. In fact, having \(s>(1+\log _2 d)/2\) we see a non-trivial preasymptotic behavior. The above bounds show that this method based on random points competes with most of the quasi-Monte-Carlo methods studied in the literature, see [22, pp. 195, 247].

9 Hyperbolic Wavelet Regression

The following scenario of replacing the Fourier system by dyadic wavelets has already been investigated by Bohn [5, Sec. 5.5.2], [6] using piecewise linear prewavelets. Here, we use orthogonal wavelets and will improve the result in [5] in two directions. First, we remove a d-dependent \(\log \)-factor and second, our result holds for the whole class and not just one individual function, i.e., we control the worst-case error. It is worth mentioning that we only loose a \(\log \)-factor which is independent of d compared to the benchmark result in [21].

Let us start with the necessary definitions since we are now in a non-periodic setting. For \(s>0\) let us define the space \(H^s_{\text {mix}}(\mathbb {R}^d)\) as the collection of all functions from \(L_2(\mathbb {R}^d)\) such that

$$\begin{aligned} \Vert f\Vert _{H^s_{\text {mix}}(\mathbb {R}^d)} = \left\| \left( \prod \limits _{i=1}^d (1+|y_i|)^s\right) {\mathcal {F}}f({\mathbf {y}})\right\| _{L_2(\mathbb {R}^d)} < \infty . \end{aligned}$$

Here \({\mathcal {F}}\) denotes the Fourier transform on \(\mathbb {R}^d\) given by

$$\begin{aligned} {\mathcal {F}}f({\mathbf {x}}) = \frac{1}{\sqrt{2\pi }^d}\int _{\mathbb {R}^d} f({\mathbf {y}})\exp (-\mathrm {i} \, {\mathbf {y}}\cdot {\mathbf {x}})\,\mathrm {d}{\mathbf {y}},\quad {\mathbf {x}}\in \mathbb {R}^d. \end{aligned}$$

It is well known that \(H^s_{\text {mix}}(\mathbb {R}^d)\) can be characterized using hyperbolic wavelets. Let \((\psi _{j,k})_{j\in \mathbb {N}_0, k\in \mathbb {Z}}\) be a univariate orthonormal wavelet system (if \(j=0\) then \(\psi _{0,k}\) denotes the orthogonal scaling function). Then, we denote with

$$\begin{aligned} \psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}) := \psi _{i_1,k_1}(x_1)\cdots \psi _{j_d,k_d}(x_d),\quad {\mathbf {x}}\in \mathbb {R}^d, \; {\mathbf {j}}\in \mathbb {N}_0^d, \; {\mathbf {k}}\in \mathbb {Z}^d, \end{aligned}$$

the corresponding hyperbolic wavelet basis in \(L_2(\mathbb {R}^d)\). For our analysis we need that the univariate wavelet is a compactly supported wavelet, which means that \(\psi _{j,k}\) is supported “near” the interval \([k2^{-j}, (k+1)2^{-j}]\). If the wavelet basis has sufficient smoothness and vanishing moments, then \(f \in H^s_{\text {mix}}(\mathbb {R}^d)\) holds if and only if

$$\begin{aligned} \left( \sum \limits _{{\mathbf {j}}\in \mathbb {N}_0^d}\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d}2^{2\Vert {\mathbf {j}}\Vert _1s}|\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle |^2\right) ^{1/2} <\infty . \end{aligned}$$

This leads to the norm equivalence

$$\begin{aligned} \Vert f\Vert _{H^s_{\text {mix}}(\mathbb {R}^d)} \asymp \left( \sum \limits _{{\mathbf {j}}\in \mathbb {N}_0^d}\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d}2^{2\Vert {\mathbf {j}}\Vert _1s}|\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle |^2\right) ^{1/2}. \end{aligned}$$
(9.1)

Clearly, if \(\Vert f\Vert _{H^s_{\text {mix}}(\mathbb {R}^d)} \le 1\), then the sequence \((2^{\Vert {\mathbf {j}}\Vert _1s}\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle )_{{\mathbf {j}},{\mathbf {k}}}\) has an \(\ell _2\)-norm bounded by a constant, which will be important for our later analysis.

Let us consider the unit cube \([0,1]^d\). Let further \(D_{{\mathbf {j}}}\) be the set of all \({\mathbf {k}}\in \mathbb {Z}^d\) such that the wavelet \(\mathrm{supp}\,\psi _{{\mathbf {j}},{\mathbf {k}}}\) has a non-empty intersection with \([0,1]^d\). This directly leads to the extended domain \(\Omega \) given by

$$\begin{aligned} \Omega :=\bigcup \limits _{{\mathbf {j}}\in \mathbb {N}_0^d} \bigcup \limits _{{\mathbf {k}}\in D_{{\mathbf {j}}}} \mathrm{supp}\,\psi _{{\mathbf {j}},{\mathbf {k}}}. \end{aligned}$$

It holds \([0,1]^d \subset \Omega \), and the system \((\psi _{{\mathbf {j}},{\mathbf {k}}})_{{\mathbf {j}}\in \mathbb {N}_0^d,{\mathbf {k}}\in D_{{\mathbf {j}}}}\) is an orthonormal system in \(L_2(\Omega )\), however not a basis. Note that \(\Omega \) is still a bounded tensor domain with a measure proportional to 1 depending on the support length of the wavelet basis. It is also clear that this orthonormal system is not uniformly bounded in \(L_\infty \).

In the sequel we want to recover functions \(f\in H^s_{\text {mix}}(\mathbb {R}^d)\) on the domain \([0,1]^d\) from samples on the slightly larger extended domain \(\Omega \) in a uniform way. In other words, the discrete locations of the sampling nodes \({\mathbf {X}}= \{{\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\}\) are chosen in advance for the whole class of functions. Let us consider the operator

$$\begin{aligned} {\tilde{P}}_\ell f := \sum \limits _{\Vert {\mathbf {j}}\Vert _1 \le \ell }\sum \limits _{{\mathbf {k}}\in D_{{\mathbf {j}}}}\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle \psi _{{\mathbf {j}},{\mathbf {k}}},\quad \ell \in \mathbb {N}, \end{aligned}$$

which is known from hyperbolic wavelet approximation, see [21, 60]. The following worst-case error bound is well known and follows directly from (9.1):

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H^s_{\text {mix}}(\mathbb {R}^d)} \le 1} \Vert f-{\tilde{P}}_\ell f\Vert _{L_2([0,1]^d)} \asymp 2^{-s\ell }. \end{aligned}$$

We now consider a special case of the matrix \({\mathbf {L}}_m\) from (3.1), namely

$$\begin{aligned} {\tilde{\mathbf {L}}}_\ell := \left( \begin{array}{c} \left( {\widetilde{\psi }}_{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^1)\right) ^\top _{\Vert {\mathbf {j}}\Vert _1\le \ell ,{\mathbf {k}}\in D_{{\mathbf {j}}}}\\ \vdots \\ \left( {\widetilde{\psi }}_{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^n)\right) ^\top _{\Vert {\mathbf {j}}\Vert _1\le \ell ,{\mathbf {k}}\in D_{{\mathbf {j}}}} \end{array}\right) . \end{aligned}$$
(9.2)

Here, \(m = m(\ell ) \asymp 2^\ell \ell ^{d-1}\) and the functions \({\widetilde{\psi }}_{{\mathbf {j}},{\mathbf {k}}} = \sqrt{|\Omega |}\psi _{{\mathbf {j}},{\mathbf {k}}}\) enumerate the properly re-normalized wavelets \(\psi _{{\mathbf {j}},{\mathbf {k}}}, \Vert {\mathbf {j}}\Vert _1 \le \ell , {\mathbf {k}}\in D_{{\mathbf {j}}}\), which is now an orthonormal system in the space \(L_2(\Omega ,\varrho _\Omega )\) with the probability measure \(\varrho _\Omega = \frac{\mathrm {d}{\mathbf {x}}}{|\Omega |}\). The n sampling nodes in \({\mathbf {X}}\) are drawn i.i.d. at random according to \(\varrho _\Omega \). Note that due to the construction, we have that \(|\Omega |\) is bounded by a constant which depends on the chosen wavelet system. This, on the other hand, depends on the assumed mixed regularity properties of the function f, i.e., the mixed smoothness \(s>0\). The larger s is chosen, the larger the support of a properly chosen orthonormal wavelet system has to be. We propose Algorithm 3 for computing the wavelet coefficients of an approximation \(S_{\mathbf {X}}^\ell f\) to f. Note that there is a little abuse of notation since we use the wavelet level \(\ell \) as upper index (in contrast to m, the dimension of the subspace used earlier).

figure c

Theorem 9.1

Let \(0<\delta <1\). Let further \(s>1/2\) and \((\psi _{{\mathbf {j}},{\mathbf {k}}})_{{\mathbf {j}},{\mathbf {k}}}\) be a hyperbolic and compactly supported orthonormal wavelet system such that (9.1) holds true. Then, the algorithm \(S^\ell _{{\mathbf {X}}}\) indicated in Algorithm 3 recovers any \(f \in H^s_{\text {mix}}(\mathbb {R}^d)\) on \(L_2([0,1]^d)\) with high probability from \(n = n(\ell )\) random samples which are drawn in advance for the whole class. Precisely,

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H^s_{\text {mix}}(\mathbb {R}^d)}\le 1} \Vert f-S^{\ell }_{{\mathbf {X}}} f\Vert _{L_2([0,1]^d)} \lesssim C_{\delta ,d}2^{-\ell s} \end{aligned}$$
(9.3)

with probability larger than \(1-\delta \). The operator \(S^{\ell }_{{\mathbf {X}}}\) uses \(n(\ell ) \asymp 2^{\ell }\ell ^d\) many samples such that the bound in (9.3) reads as \({\widetilde{C}}_{\delta ,d}n^{-s}\log (n)^{ds}\) in terms of the number of samples.

Remark 9.2

  1. (i)

    Note that the optimal operator \({\tilde{P}}_\ell \) uses \(n(\ell ) \asymp 2^{\ell }\ell ^{d-1}\) wavelet coefficients. The gap between sampling recovery (\(\Lambda ^{\text {std}}\)) and general linear approximation (\(\Lambda ^{\text {all}}\)), see, e.g., [18, 48, 49, 51], is reduced to a \(\log \)-factor, which is independent of d.

  2. (ii)

    The matrix defined in (9.2) is rather sparse. It has \(n \asymp 2^{\ell }\ell ^{d}\) rows and \(m \asymp 2^{\ell }\ell ^ {d-1}\) columns. In every row we have only \(\asymp \#\{\Vert {\mathbf {j}}\Vert _1\le \ell \} \asymp \ell ^d\) many nonzero entries. This gives an additional acceleration for the least squares algorithm since matrix vector multiplications are cheap in this situation.

Proof of Theorem 9.1

We follow the proof of Theorem 5.5. Let \({\mathbf {X}}\) be a set of n randomly drawn nodes from \(\Omega \) according to \(\varrho _\Omega \). If the number n of samples satisfies (5.1), that is

$$\begin{aligned} {\tilde{N}}(\ell ):= \sup _{{\mathbf {x}}\in \Omega } \sum \limits _{\Vert {\mathbf {j}}\Vert _1 \le \ell }\sum \limits _{{\mathbf {k}}\in D_{{\mathbf {j}}}} \Big |{\widetilde{\psi }}_{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}})\Big |^2 \lesssim \frac{n}{\log n -\log \delta }, \end{aligned}$$
(9.4)

then

$$\begin{aligned} \Big \Vert \left( {\tilde{\mathbf {L}}}_\ell ^{*}\,{\tilde{\mathbf {L}}}_\ell \right) ^{-1}\,{\tilde{\mathbf {L}}}_\ell ^{*}\Big \Vert \le \sqrt{\frac{2}{n}} \end{aligned}$$

is satisfied with probability larger than \(1-\delta \). Let \({\mathbf {X}}\) be such that this is the case. Then, we estimate

$$\begin{aligned}&\Vert f-S_{{\mathbf {X}}}^{\ell }f\Vert _{L_2([0,1]^d)} \le \Vert f-{\tilde{P}}_{\ell }f\Vert _{L_2([0,1]^d)} + \Vert {\tilde{P}}_{\ell }f - S_{{\mathbf {X}}}^{\ell }f\Vert _{L_2([0,1]^d)}\\&\quad \lesssim \left\| f-\sum \limits _{\Vert {\mathbf {j}}\Vert _1\le \ell }\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} \langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle \psi _{{\mathbf {j}},{\mathbf {k}}}\right\| _{L_2(\mathbb {R}^d)}+\Vert S_{{\mathbf {X}}}^{\ell }({\tilde{P}}_{\ell }f - f)\Vert _{L_2(\Omega , \varrho _{\Omega })}\\&\quad \lesssim 2^{-\ell s} + \sqrt{\frac{2}{n}}\cdot \left[ \sum \limits _{u=1}^n \left( \sum \limits _{\Vert {\mathbf {j}}\Vert _1> \ell }\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d}\overline{\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle \psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^u)}\right) \right. \\&\qquad \times \left. \left( \sum \limits _{\Vert {\mathbf {j}}\Vert _1> \ell }\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d}\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle \psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^u)\right) \right] ^{1/2} \\&\quad \lesssim 2^{-\ell s} + \sqrt{\frac{2}{n}}\cdot \left[ \sum \limits _{\Vert {\mathbf {j}}'\Vert _1> \ell }\sum \limits _{{\mathbf {k}}' \in \mathbb {Z}^d}\sum \limits _{\Vert {\mathbf {j}}\Vert _1> \ell }\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d}\right. \\&\qquad \left. 2^{\Vert {\mathbf {j}}'\Vert _1s}2^{\Vert {\mathbf {j}}\Vert _1s}\overline{\langle f,\psi _{{\mathbf {j}}',{\mathbf {k}}'}\rangle }\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle \sum \limits _{u=1}^n \frac{\overline{\psi _{{\mathbf {j}}',{\mathbf {k}}'}({\mathbf {x}}^u)}}{2^{\Vert {\mathbf {j}}'\Vert _1s}}\frac{\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^u)}{2^{\Vert {\mathbf {j}}\Vert _1s}}\right] ^{1/2} \\&\quad \lesssim 2^{-\ell s} + \sqrt{\frac{2}{n}} \big \Vert (2^{\Vert {\mathbf {j}}\Vert _1s}\langle f,\psi _{{\mathbf {j}},{\mathbf {k}}}\rangle )_{{\mathbf {j}},{\mathbf {k}}}\big \Vert _2 \; \Vert {\tilde{\varvec{\Phi }}}_\ell \Vert \\&\quad \lesssim 2^{-\ell s} + \sqrt{\frac{2}{n}}\Vert {\tilde{\varvec{\Phi }}}_\ell \Vert , \end{aligned}$$

where \({\tilde{\varvec{\Phi }}}_\ell \) is defined similar as in Proposition 5.4. This time we put

$$\begin{aligned} {\tilde{\varvec{\Phi }}}_\ell := \left( \begin{array}{c} \left( 2^{-\Vert {\mathbf {j}}\Vert _1s}\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^1)\right) ^\top _{\Vert {\mathbf {j}}\Vert _1> \ell ,{\mathbf {k}}\in D_{{\mathbf {j}}}}\\ \vdots \\ \left( 2^{-\Vert {\mathbf {j}}\Vert _1s}\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}}^n)\right) ^\top _{\Vert {\mathbf {j}}\Vert _1> \ell ,{\mathbf {k}}\in D_{{\mathbf {j}}}} \end{array}\right) . \end{aligned}$$

Let us define the quantity

$$\begin{aligned} {\tilde{T}}(\ell ) := \sup \limits _{{\mathbf {x}}\in \Omega } \sum \limits _{\Vert {\mathbf {j}}\Vert _1 >\ell }\sum \limits _{{\mathbf {k}}\in \mathbb {Z}^d} 2^{-2\Vert {\mathbf {j}}\Vert _1s}|\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}})|^2, \end{aligned}$$

which goes along the lines of Proposition 5.4. Then, we get with literally the same arguments

$$\begin{aligned} \mathbb {E}\Vert {\tilde{\varvec{\Phi }}}_{\ell }\Vert ^2 \lesssim n\left( 2^{-2\ell s} + \frac{\log n}{n}{\tilde{T}}(\ell )+ 2^{-\ell s}\sqrt{\frac{\log n}{n}{\tilde{T}}(\ell )}\right) . \end{aligned}$$
(9.5)

Let us compute \({\tilde{T}}(\ell )\). Due to the compact support of the wavelet system, there are for fixed \({\mathbf {j}}\) only O(1) many wavelets \(\psi _{{\mathbf {j}},{\mathbf {k}}}\) such that \(\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}})\) is non-zero. For those O(1) wavelets, we have

$$\begin{aligned} |\psi _{{\mathbf {j}},{\mathbf {k}}}({\mathbf {x}})|^2 \lesssim 2^{\Vert {\mathbf {j}}\Vert _1}. \end{aligned}$$

Hence, we get

$$\begin{aligned} {\tilde{T}}(\ell ) \lesssim \sum \limits _{\Vert {\mathbf {j}}\Vert _1>\ell } 2^{\Vert {\mathbf {j}}\Vert _1(1-2s)}\lesssim 2^{\ell (1-2s)}\ell ^{d-1}. \end{aligned}$$
(9.6)

By the same reasoning we may estimate \({\tilde{N}}(\ell )\) in (9.4). Clearly n may be chosen such that

$$\begin{aligned} {\tilde{N}}(\ell ) \asymp 2^{\ell }\ell ^{d-1} \lesssim \frac{n}{\log (n)}. \end{aligned}$$
(9.7)

Plugging (9.6) and (9.7) into (9.5), we obtain

$$\begin{aligned} \mathbb {E}\Vert {\tilde{\varvec{\Phi }}}_{\ell }\Vert ^2 \lesssim n2^{-2\ell s}. \end{aligned}$$

The same standard arguments as used in Theorem 5.5 and Corollary 5.6 lead to the bound in (9.3). It remains to estimate the number of samples n depending on \(\ell \), see (9.7). This clearly gives \(\log (n) \gtrsim \ell \) and hence \(n\gtrsim 2^\ell \ell ^d\) which concludes the proof. \(\square \)

10 Numerical Experiments

10.1 Recovery of Functions from Spaces with Mixed Smoothness

In this section, we perform numerical tests for the hyperbolic cross Fourier regression based on random sampling nodes from Sect. 8, i.e., we apply Algorithm 1 to periodic test functions f from the spaces \(H^s_{\text {mix}}(\mathbb {T}^d)\). In Fig. 1, we visualize realizations for such random nodes in the two- and three-dimensional case.

Besides random point sets, different types of deterministic lattices have also been used for numerical integration and function recovery, see for instance [9, 31, 32]. This motivates us to consider Frolov lattices [30] and Fibonacci lattices (cf., e.g., [62, Sec. IV.2]) in the context of this paper, see Fig. 2 for examples of such lattices.

In the following, we use the weight function

$$\begin{aligned} w({\mathbf {k}}):=\prod _{i=1}^d (1+|k_i|^2)^{1/2}. \end{aligned}$$

Note that for computational reasons we avoid the \(2\pi \) in this weight w. By the reasoning after (8.5), the weights without \(2\pi \) lead to a slightly slower decay of the respective singular numbers.

Fig. 1
figure 1

Realizations of random nodes for hyperbolic Fourier regression

Fig. 2
figure 2

Examples of Fibonacci and Frolov lattices in \(d=2\) spatial dimensions

For a given number n of samples, we use the \(m=\lfloor n/(4\log n)\rfloor \) frequencies \({\mathbf {k}}\in \mathbb {Z}^d\) where \(w({\mathbf {k}})\) is smallest, i.e., we define \(I_m:=\{{\mathbf {k}}_1,\ldots ,{\mathbf {k}}_m\}\subset \mathbb {Z}^d\), \(|I_m|=m\), assuming an arrangement of \({\mathbf {k}}_j\) fulfilling \(w({\mathbf {k}}_1)\le w({\mathbf {k}}_2)\le \ldots \). Here ties are broken in numerical order starting with the first component \(k_1\) of \({\mathbf {k}}\) until the last one \(k_d\). Corresponding to our theoretical results, the goal is to compute a least squares approximation \(S_{\mathbf {X}}^mf\) of the projection \(P_{m-1}f\) of the function f to the \({\text {span}}\{\exp (2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}}) :{\mathbf {k}}\in I_m\}\) using the n sampling values at the nodes in \({\mathbf {X}}\).

Comments on the arithmetic cost of Algorithm 1. Building the index set \(I_{m-1}\), i.e., enumerating the basis functions \(\eta _1,\ldots ,\eta _{m-1}\), requires

$$\begin{aligned} \le 4 \, C_1 \, d \, m^2 \log m \le C_1 \, d \, n \, m \end{aligned}$$

arithmetic operations, and setting up the matrix \({\mathbf {L}}_m\) requires

$$\begin{aligned} \le C_2 \, d \, n \, m \end{aligned}$$

arithmetic operations, where \(C_1,C_2>0\) are absolute constants. Afterward, running Algorithm 1 requires

$$\begin{aligned} \le C_3\, R \, n \, m \le C_3\, R \, \frac{n^2}{4\log n} \end{aligned}$$

arithmetic operations, where \(C_3>0\) is an absolute constant and \(R\in \mathbb {N}\) is the number of LSQR iterations. If one chooses m as in Theorem 5.5, cf. (5.1) and (5.2), the condition number of the matrix \({\mathbf {L}}_m\) is \(\le \sqrt{3}\) with high probability and one obtains \(R\le 17\) for a LSQR accuracy of \(\approx 10^{-8}\). Please note that we choose \(m=\lfloor n/(4\log n)\rfloor \) in our experiments, which is slightly larger than the theory (5.2) requires.

Remark 10.1

Let us compare the hyperbolic cross-Fourier regression from Sect. 8, which uses random samples, with the single rank-1 lattice sampling approach from [9, 31], which uses highly structured deterministic sampling nodes. Up to logarithmic factors, both approaches have comparable error estimates w.r.t. the number m of basis functions and comparable arithmetic complexities. Single rank-1 lattice sampling has slightly worse recovery error estimates w.r.t. m than Algorithm 1, cf. Theorem 8.2. On the other hand, the arithmetic complexity for single rank-1 lattice sampling is slightly better. Moreover, the error estimates when using rank-1 lattices are guaranteed upper bounds, whereas the worst-case upper bounds in Sect. 8 hold with high probability. However, for fixed m, the used number of samples for single rank-1 lattice sampling is distinctly higher, i.e., almost quadratic compared to the approach from this paper. This results in error estimates w.r.t. the number n of used sampling values that are distinctly worse for single rank-1 lattices.

Subsequently, we consider three different test functions \(f:\mathbb {T}^d \rightarrow \mathbb {R}\), where the Fourier coefficients \({\hat{f}}_{{\mathbf {k}}}:=\int _{\mathbb {T}^d}f({\mathbf {x}}) \,\exp (-2\pi \mathrm {i}\,{\mathbf {k}}\cdot {\mathbf {x}}) \,\mathrm {d}{\mathbf {x}}\), \({\mathbf {k}}\in \mathbb {Z}^d\), of f, decay like \(|{\hat{f}}_{{\mathbf {k}}}|\sim \prod _{i=1}^d \big (1+|k_i|^2\big )^{-\alpha /2}\) for \(\alpha \in \{5/4,\, 2,\, 6\}\) and, consequently, \(f\in H^s_{\text {mix}}(\mathbb {T}^d)\) with \(s=\alpha -1/2-\varepsilon \) for \(\varepsilon >0\).

10.1.1 Test function f from \(H^{3/4-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

We start with the test function

$$\begin{aligned} f:\mathbb {T}^d \rightarrow \mathbb {R}, \quad f({\mathbf {x}}) := \left( \frac{3}{2}\right) ^{d/2} \prod _{i=1}^d \left( 1 - \big |2 (x_i \bmod 1) - 1\big |\right) ^{1/4}, \end{aligned}$$

where we have for the Fourier coefficients \(|{\hat{f}}_{{\mathbf {k}}}|\sim \big (\prod _{i=1}^d (1+|k_i|^2)^{1/2}\big )^{-5/4}\) and, consequently, \(f\in H^{3/4-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\), \(\varepsilon >0\).

Fig. 3
figure 3

Approximation errors and least squares sampling errors for test function \(f\in H^{3/4-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

In Fig. 3a, we visualize the relative approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1}f\Vert _{L_2(\mathbb {T}^d)}\) for spatial dimensions \(d=2,3,4,5\). Due to (8.6), these errors should decay like \(m^{-0.75+\varepsilon }(\log m)^{(d-1)\cdot (0.75-\varepsilon )}\) for sufficiently large m. Correspondingly, we plot \(m^{-0.75}(\log m)^{(d-1)\cdot 0.75}\) as black dotted graphs. We observe that the obtained approximation errors nearly decay as the theory suggests.

Next, we apply Algorithm 1 on the test function f using n randomly selected sampling nodes as sampling scheme. We do not compute the least squares solution directly but use the iterative method LSQR [54] on the matrix \({\mathbf {L}}_m\), \(m=\lfloor n/(4\log n) \rfloor \). The obtained sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\) are visualized in Fig. 3b as well as the graphs \(\sim n^{-0.75}(\log n)^{d\cdot 0.75}\) as dotted lines which correspond to the theoretical upper bounds \(n^{-0.75+\varepsilon }(\log n)^{d\cdot (0.75-\varepsilon )}\), cf. (8.7). We set the tolerance parameter of LSQR to \(5\cdot 10^{-8}\) and the maximum number of iterations to 100. For \(d=2\) and \(d=3\), the errors nearly decay like these bounds. For \(d=4\) and \(d=5\), the errors seem to decay slightly slower than the bounds. In order to investigate this further, we also plot the corresponding approximation errors \({\tilde{a}}_m\) with \(m=\lfloor n/(4\log n) \rfloor \) as thick dashed lines. We observe that these approximation errors \({\tilde{a}}_{m}\), which are the best possible errors that can be achieved in this setting, almost coincide with the sampling errors. This means that we might still observe preasymptotic behavior.

Fig. 4
figure 4

Least squares aliasing errors and approximation errors for test function \(f\in H^{3/4-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

For \(d=2\) spatial dimensions, we have a closer look at the sampling errors. In Fig. 4a, we again plot the approximation errors \({\tilde{a}}_{m}\), \(m=\lfloor n/(4\log n) \rfloor \). In addition, the aliasing errors \(\Vert P_{m-1}f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\), \(m=\lfloor n/(4\log n) \rfloor \), which are the errors caused by Algorithm 1 since

$$\begin{aligned} \Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}^2 = \Vert f-P_{m-1}f \Vert _{L_2(\mathbb {T}^d)}^2 + \Vert P_{m-1}f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}^2, \end{aligned}$$

are shown as triangles. We observe that the aliasing errors nearly decay like the approximation errors, and that they are by one order of magnitude smaller. This corresponds to the behavior we observed in Fig. 3b.

Moreover, we compare least squares using random point sets with least squares using quasi-random point sets. In particular, so-called Frolov lattices [30]Footnote 1 are considered as sampling sets in \(d=2\) spatial dimensions and used in Algorithm 1. The resulting sampling errors almost coincide with the approximation errors. The aliasing errors are visualized in Fig. 4a as circles. It is remarkable that they decay similarly and are even lower than the aliasing errors for random nodes in most cases. A similar behavior can be observed for dimension \(d=3\), cf. Fig. 4b.

In addition, we consider Fibonacci lattices in dimension \(d=2\), cf. Fig. 4a. For \(n\ge 832040\), the matrices \({\mathbf {L}}_m\), \(m=\lfloor n/(4\log n) \rfloor \), contain at least two identical columns and correspondingly, the smallest eigenvalue of \({\mathbf {L}}_m^{*}\,{\mathbf {L}}_m\) is zero. Therefore, obtaining \(S^m_{{\mathbf {X}}}f\) via Algorithm 1 is not possible if the least squares solution is computed directly. An iterative method like LSQR may still work but the number of iterations may have to be restricted. In Fig. 4a, the obtained aliasing errors via LSQR are shown as squares, and they are smaller than in the other cases but they seem to decay slower. However, when we decreased the tolerance parameter of the LSQR algorithm, we observed for \(n\ge 832{,}040\) aliasing errors and sampling errors larger than 1.

10.1.2 Kink test function f from \(H^{3/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

Next, we consider the kink test function \(f:\mathbb {T}^d\rightarrow \mathbb {R},\)

$$\begin{aligned} f({\mathbf {x}}) = \prod _{i=1}^d \left( \frac{15}{4\sqrt{3}} \cdot 5^{3/4} \cdot \max \left( \frac{1}{5} - \left( (x_i\bmod 1)-\frac{1}{2}\right) ^2, 0\right) \right) \in H^{3/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d) \end{aligned}$$

with Fourier coefficients

$$\begin{aligned} {\hat{f}}_{{\mathbf {k}}} = \prod _{i=1}^d {\left\{ \begin{array}{ll} \frac{5^{5/4}\sqrt{3}}{8} \, (-1)^{k_i} \, \frac{\sqrt{5} \, \sin (2k_i\pi /\sqrt{5}) - 2 k_i \pi \cos (2k_i\pi /\sqrt{5}) }{\pi ^3 k_i^3} &{} \text {for } k_i\ne 0, \\ \frac{5^{1/4}}{\sqrt{3}}&{} \text {for } k_i=0. \end{array}\right. } \end{aligned}$$

Besides the different test function f, we use the same setting as before.

Fig. 5
figure 5

Approximation errors and least squares sampling errors for test function \(f\in H^{3/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

In Fig. 5a, we visualize the relative approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1}f \Vert _{L_2(\mathbb {T}^d)}\) for spatial dimensions \(d=2,3,4,5\). These errors should decay like \(m^{-1.5+\varepsilon }(\log m)^{(d-1)\cdot (1.5-\varepsilon )}\) for sufficiently large m, and we observe that the obtained approximation errors nearly decay as the theoretical results suggest. Next, we apply Algorithm 1 with random nodes to the test function f using the iterative method LSQR. The resulting sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\) are depicted in Fig. 5b. In addition, the graphs \(\sim n^{-1.5}(\log n)^{d\cdot 1.5}\) are shown as dotted lines which roughly correspond to the theoretical upper bounds \(n^{-1.5+\varepsilon }(\log n)^{d\cdot (1.5-\varepsilon )}\). The errors seem to decay according to this bound for \(d=2\) and slower for \(d=3,4,5\). Again, we also plot the corresponding approximation errors \({\tilde{a}}_m\) with \(m=\lfloor n/(4\log n)\rfloor \) as thick dashed lines, and we observe that these approximation errors \({\tilde{a}}_{\lfloor n/(4\log n) \rfloor }\) almost coincide with the sampling errors. Correspondingly, we still have preasymptotic behavior for \(d=3,4,5\).

10.1.3 Test function f from \(H^{11/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

As a third test function f, we consider an \(L_2\)-normalized product of periodic one-dimensional B-Splines of order 6, where each factor depends on a single variable and is a piecewise polynomial of degree 5, and therefore, we have \(f\in H^{11/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\).

Fig. 6
figure 6

Approximation errors and least squares sampling errors for test function \(f\in H^{11/2-\varepsilon }_{\text {mix}}(\mathbb {T}^d)\)

In Fig. 6a, the relative approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1}f \Vert _{L_2(\mathbb {T}^d)}\) are visualized for spatial dimensions \(d=2,3,4,5\), which roughly decay like \(m^{-5.5+\varepsilon }(\log m)^{(d-1)\cdot (5.5-\varepsilon )}\) for sufficiently large m. The sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f \Vert _{L_2(\mathbb {T}^d)}\) when applying Algorithm 1 with random nodes to the test function f using the iterative method LSQR are depicted in Fig. 6b. In addition, the graphs \(\sim n^{-5.5}(\log n)^{d\cdot 5.5}\) are plotted as dotted lines which correspond to the theoretical upper bounds. The errors seem to decay roughly according to this bound for \(d=2\) and \(d=3\) or slightly slower. Again, the corresponding approximation errors \({\tilde{a}}_m\) with \(m=n/(4\log n)\) are shown as thick dashed lines, and we observe that these approximation errors \({\tilde{a}}_{\lfloor n/(4\log n) \rfloor }\) almost coincide with the sampling errors. This means we still have preasymptotic behavior.

10.2 Integration of Functions from Spaces with Mixed Smoothness

Extensive numerical tests on integration using random point sets were performed by Oettershagen [52], cf. in particular [52, Sec. 4.1] for tests concerning numerical integration in Sobolev spaces with mixed smoothness \(H^{s}_{\text {mix}}(\mathbb {T}^d)\). These numerical tests, performed for \(d\in \{2,4,8,16\}\) spatial dimensions and smoothness \(s\in \{1,2,3\}\), suggest that the worst-case cubature error may decay with a main rate of \(n^{-s}\) with additional log factors. This is a remarkable behavior since plain Monte-Carlo with random points usually leads to an error decay of \(n^{-1/2}\).

However, the corresponding theoretical results in [52, Sec. 4.2] only give a main rate of \(n^{-s+1/2}\). We highlight that our results obtained in this paper bridge this gap of 1/2 in the main rate, since we show a worst-case error of \(\sim n^{-s}(\log n)^{d\,s}\) in Corollary 8.6, i.e., our theoretical main rate corresponds to the observations by Oettershagen [52]. Moreover, Algorithm 1 guarantees suitable error bounds in the preasymptotic setting, cf. Corollary 8.6 together with (8.5). More details on cubature rules based on least squares additionally refined using some variance reduction technique can be found in [43].

10.3 Recovery and Integration for the Non-periodic Case

Now we consider the non-periodic situation. We use the Chebyshev measure \({\tilde{\varrho }}_D(\mathrm {d}{\mathbf {x}}):=\prod _{t=1}^d(\pi \sqrt{1-x_t^2})^{-1} \, \mathrm {d}{\mathbf {x}}\) as random sampling scheme on \(D = [-1,1]^d\). We further define the non-periodic space \({\tilde{H}}^s_{\text {mix}}([-1,1]^d)\) via the reproducing kernel

$$\begin{aligned} {\tilde{K}}^1_{s,*}(x,y):=1+2\sum \limits _{h\in \mathbb {N}_0} \frac{\cos (h\arccos (x))\cos (h\arccos (y))}{w_{s,*}(h)^2},\quad x,y\in \mathbb {T}, \end{aligned}$$

and its tensor product

$$\begin{aligned} {\tilde{K}}^d_{s,*}({\mathbf {x}},{\mathbf {y}}):= K^1_{s,*}(x_1,y_1)\otimes \cdots \otimes K^1_{s,*}(x_d,y_d),\quad {\mathbf {x}},{\mathbf {y}}\in \mathbb {T}^d. \end{aligned}$$

The space \({\tilde{H}}^s_{\text {mix}}([-1,1]^d)\) is embedded into \(L_2(D,{\tilde{\varrho }}_D)\) if and only if \(s>1/2\). We denote the k-th basis index of \(\eta _k\), \(k=1,\ldots ,m-1\), by \({\varvec{h}}_k:=\big (h_{k,t}\big )_{t=1}^d\in \mathbb {N}_0^d\), and we have \(\eta _1 \equiv 1\) and

$$\begin{aligned} \eta _k({\mathbf {x}})=\prod _{t=1}^d \sqrt{2}^{\min \{h_{k,t},1\}} \cos \left( h_{k,t}\arccos x_t\right) \end{aligned}$$

if \(k>1\). Moreover, for the vector \({\varvec{b}}\in \mathbb {C}^{m-1}\) in (7.1), we have

$$\begin{aligned} b_k:=\int _D \eta _k\,\mathrm {d}\mu _D = \prod _{t=1}^d {\left\{ \begin{array}{ll} 2 &{} \text {for } h_{k,t}=0,\\ \frac{2\sqrt{2}}{1-h_{k,t}^2} &{} \text {for } h_{k,t}\in 2\mathbb {N},\\ 0 &{} \text {for } h_{k,t}\in 2\mathbb {N}-1, \end{array}\right. } \end{aligned}$$
(10.1)

for \(\mu _D\equiv 1\).

Fig. 7
figure 7

Dilated, scaled, and shifted B-spline of order 2 considered in interval \([-1,1]\)

For the numerical experiments, we consider the test function

$$\begin{aligned} f:\mathbb {R}\rightarrow \mathbb {R}, \quad f({\mathbf {x}}) := \left( \frac{3\pi }{49\pi -48\sqrt{3}}\right) ^{d/2} \, \prod _{i=1}^d {\left\{ \begin{array}{ll} 5+2x &{} \text {for } -5/2\le x_i< -1, \\ 3-2x &{} \text {for } -1 \le x_i < 3/2, \\ 0 &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

on \(D:=[-1,1]^d\), where the one-dimensional version is depicted in Fig. 7. For the Chebyshev coefficients of f, we have

$$\begin{aligned} {\hat{f}}_{{\mathbf {k}}} = \left( \frac{3\pi }{49\pi -48\sqrt{3}}\right) ^{d/2} \, \prod _{i=1}^d {\left\{ \begin{array}{ll} 4\cdot \dfrac{\sqrt{3}\,k_i\cos (2\pi k_i/3) + \sin (2\pi k_i/3)}{\sqrt{2}(-k_i + k_i^3)\pi } &{} \text {for } k_i\ge 2, \\ -(3\sqrt{6} + 2\sqrt{2}\pi )/(6\pi ) &{} \text {for } k_i=1, \\ 11/3 - 2\sqrt{3}/\pi &{} \text {for } k_i=0, \end{array}\right. } \end{aligned}$$

and consequently, \(f\in {\tilde{H}}^{3/2-\varepsilon }_{\text {mix}}([-1,1]^d)\) for \(\varepsilon >0\).

Fig. 8
figure 8

Realizations of random nodes with respect to the Chebyshev measure \({\tilde{\varrho }}_D({\mathbf {x}})\)

Fig. 9
figure 9

Sampling errors and integration errors for non-periodic test function \(f\in {\tilde{H}}^{3/2-\varepsilon }_{\text {mix}}([-1,1]^d)\)

We use the parameters as in Sect. 10.1 and apply Algorithm 1 on the test function f in the non-periodic setting, where we generate the random nodes with respect to the measure \({\tilde{\varrho }}_D(\mathrm {d}{\mathbf {x}}):=\prod _{i=1}^d(\pi \sqrt{1-x_i^2})^{-1} \, \mathrm {d}{\mathbf {x}}\). In Fig. 8, we show realizations for these random nodes. As before, we do not compute the least squares solution directly but use the iterative method LSQR on the matrix \({\mathbf {L}}_m\), \(m=\lfloor n/(4\log n)\rfloor \). The obtained sampling errors \({\tilde{g}}_n:=\Vert f-S^m_{{\mathbf {X}}}f\Vert _{L_2([-1,1]^d,\varrho _D)}\) with \(m=\lfloor n/(4\log n)\rfloor \) are plotted in Fig. 9a as triangles as well as the corresponding approximation errors \({\tilde{a}}_m:=\Vert f-P_{m-1} f \Vert _{L_2([-1,1]^d,\varrho _D)}\) as thick dashed lines. We observe that the sampling and approximation errors almost coincide. Moreover, we plot the graphs \(\sim n^{-1.5}(\log n)^{d\cdot 1.5}\) as dotted lines which correspond to the expected theoretical upper bounds \(n^{-1.5+\varepsilon }(\log n)^{d\cdot (1.5-\varepsilon )}\). We observe that the obtained numerical errors nearly decay like these theoretical upper bounds.

In addition, we use the numerically computed Chebyshev coefficients \(c_k\) from Algorithm 1 to compute the approximation \(Q_{\mathbf {X}}f\) of I(f) by \(Q_{\mathbf {X}}f=\int _D S_{\mathbf {X}}^mf\,\mathrm {d}\mu _D=\sum _{k=1}^m c_k \, b_k\) where the complex numbers \(b_k\) are calculated as stated in (10.1). We repeatedly perform each test 100 times with different random nodes. The averages for the integration errors \(|I(f) - Q_{\mathbf {X}}f|\) of the 100 test runs are depicted in Fig. 9b and the maxima as error bars. Moreover, we plot the graphs \(\sim n^{-2}(\log n)^{d\cdot 2}\) as dotted lines, and we observe that the obtained integration errors approximately decay like these graphs. For comparison, we also plot \(n^{-1.5} (\log n)^{d\cdot 1.5}\) for \(d=2\) and \(d=5\) as thick solid lines which belong to the theoretical results \(n^{-1.5+\varepsilon } (\log n)^{d\cdot (1.5-\varepsilon )}\) one obtains analogously to (8.9) in Sect. 8 for the non-periodic case. These thick solid lines decay distinctly slower.

In particular, we strongly expect that the theoretical preasymptotic results in (8.5) and [33] also hold for the Chebyshev case.