Worst case recovery guarantees for least squares approximation using random samples

We consider a least squares regression algorithm for the recovery of complex-valued functions belonging to a reproducing kernel Hilbert space $H(K)$ from random data measuring the error in $L_2(D,\varrho_D)$. We prove worst-case recovery guarantees with high probability and improve on the recent new upper bounds by Krieg, M. Ullrich for general sampling numbers by explicitly controlling all the involved constants with respect to the underlying spatial dimension $d$. This leads to new preasymptotic recovery bounds with high probability for the error of {Hyperbolic Fourier Regression} for multivariate functions. In addition, we analyze the algorithm {Hyperbolic Wavelet Regression} also based on least-squares to recover non-periodic functions from random samples. As a further application we reconsider the analysis of a cubature method based on plain random points with optimal weights introduced by Oettershagen in 2017. We confirm a conjecture (which was based on various numerical experiments) and give improved near-optimal worst-case error bounds with high probability. It turns out that this simple method can compete with the quasi-Monte Carlo methods in the literature which are based on lattices and digital nets. Last but not least, we contribute new preasymptotic bounds to the problem of the recovery of individual functions from $n$ samples which has been already considered by Bohn, Griebel; Cohen, Davenport, Leviatan; Chkifa, Migliorati, Nobile, Tempone; Cohen, Migliorati; Krieg and several others.


Introduction
We consider the problem of learning complex-valued multivariate function on a domain D ⊂ R d from function samples on the nodes X = {x 1 , ..., x n } ⊂ D, which are drawn independently according to a probability measure ̺ D . This may be interpreted as reconstructing a function from so-called scattered data, see [50]. The functions are modeled as elements from some reproducing kernel Hilbert space (RKHS) H(K) with kernel K(·, ·), see [1]. The error is measured in L 2 (D, ̺ D ). This problem has been considered by several authors in the literature, e.g. Bohn [2,3], Bohn, Griebel [4], Cohen, Davenport, Leviatan [9], Chkifa, Migliorati, Nobile, Tempone [7], Cohen, Migliorati [11] and many others to mention just a few. In these contributions the authors studied the problem of recovering an individual function from random observations. Our main focus is the uniform recovery of functions belonging to a class H(K) and to give worst-case recovery guarantees. In fact, the results below have a flavor similar to the uniform recovery results in compressed sensing [18], where an RIP matrix is drawn once such that with high probability every s−sparse vector can be recovered from measurements using this matrix. Accordingly, we recover all f ∈ H(K) from sampled values at X = (x 1 , ..., x n ) simultaneously with high probability, where the error is measured in L 2 (D, ̺ D ). To be more precise, we provide bounds for the worst-case error We construct a recovery operator S m X which computes a best least squares fit S m X f to the given data (f (x 1 ), ..., f (x n )) ⊤ from the finite-dimensional space spanned by the first m singular vectors of the embedding Id : H(K) → L 2 (D, ̺ D ) . (1.1) The right singular vectors e 1 (·), e 2 (·), ... of this embedding are arranged according to their importance, i.e., the non-increasing rearrangement of the singular values σ 1 ≥ σ 2 ≥ · · · > 0. As already mentioned, the nodes X are drawn at random from the domain D according to ̺ D . However, in contrast to the Monte-Carlo setting we use the same reconstruction operator S m X for the entire class H(K). The investigations of this paper are inspired by the recent results by Krieg and M. Ullrich [25], where an asymptotically near optimal least squares algorithm is presented, see the discussion in Section 8.3. In this paper we extend and improve the results from [25] in several directions. In particular, we investigate the least squares regression and give practically useful parameter choices which lead to a controlled failure probability and explicit error bounds.
A typical error bound relates the worst-case recovery error to the sequence of singular numbers (σ k ) k∈N of the embedding (1.1) which represent the approximation numbers or linear widths. One main contribution of this paper is the following general bound, where all constants are determined precisely under mild conditions. Recall that (e k (·)) k denotes the sequence of right singular vectors of the embedding (1.1), i.e., the eigenfunctions of Id * • Id : H(K) → H(K). In cases where T (m) ∼ n σ 2 m up to logarithmic factors we observe nearly optimal error bounds with respect to the first m − 1 basis functions e k (·) that are used to compute S m X f ∈ span{e 1 , . . . , e m−1 }. In addition, for N(m) ∼ m we also achieve near optimal error bounds with respect to the number of used sampling values n. Although the statement of Theorem 1.1 is highly general we may extract new bounds in several relevant situations. In this paper, we discuss Hyperbolic Fourier Regression and Hyperbolic Wavelet Regression in detail. Note that if the system of right singular vectors (e k ) k in Theorem 1.1 has the additional property that the orthonormal system (η k ) k (in L 2 (D, ̺ D )) given by η k := σ −1 k e k is uniformly bounded, i.e. η k ∞ ≤ B for all k, then the bounds in Theorem 1.1 can be made much more precise, see Section 8.2 below. In fact, as a consequence we do not only see the new asymptotic bound in [25], we also extend it to a preasymptotic range since we have complete knowledge of all involved constants. Furthermore, our result is no longer restricted to the case of real-valued functions. In the proof we combine a modification of a recent concentration inequality for sums of Hermitian matrices due to Oliveira [38] with a symmetrization technique involving the complex version of Rudelson's lemma, see Lemma 4.3 below. The precise constants in this powerful inequality have been computed by Rauhut [41].
Note, that in case of the complex Fourier system we have B = 1 and obtain dimension-free constants. We emphasize that within our worst-case error analysis we keep track of all the involved constants and give preasymptotic bounds where it is possible. This allows for a priori estimates on the number of required samples and arithmetic operations in order to ensure accuracy ε > 0 with our concrete algorithm. In particular for Hyperbolic Fourier Regression, we discuss and apply recent preasymptotic bounds for the singular values (σ k ) k∈N , see [27,28] and [23]. For the non-periodic situation we will use a modification of the algorithm called Hyperbolic Wavelet Regression where we use orthonormal compactly supported wavelets which are not uniformly bounded in L ∞ . However, the quantities N(m) and T (m) can be controlled in this situation. In particular, the matrices appearing in the least squares problem are very sparse which gives an additional acceleration for the runtime of the algorithm.
In the above result we control the failure probability and sample with respect to the known density ̺ D . One may implement and use this result in practice since the error guarantees are reliable with high probability, see Section 10. In addition, a technique called "optimal importance sampling", see e.g. [19,42,43,32], turns to be useful in this context. As proposed in [11] and specified in full detail in [25], one may sample from a reweighted distribution/density ̺ m , which is different for any m and depends on the spectral properties of the embedding (1.1). It determines the important "area" to sample. In other words, we incorporate additional knowledge about the spectral properties of our embedding. A refinement of this technique together with our Theorem 1.1 above leads to the following accurate bound on the sampling numbers g n (Id : H(K) → L 2 (D, ̺ D )), defined in Section 5.3, under even weaker conditions compared to Theorem 1.1.

the general bound
Note, that the maximum in (1.2) is really required since the second quantity may behave better in order than the first one, which, in fact, represents a natural benchmark in our setting. The result refines the bound in [25] as we give precise constants here for the sampling numbers in a very general situation. Assume for the moment the finiteness of sup d ∞ j=1 σ τ j,d for some 0 < τ ≤ 2 with respect to a d-indexed family of approximation problems, see [34]. As a direct consequence of (1.2) we then obtain polynomial tractability with respect to standard information. In addition, the tractability exponent with respect to standard information is arbitrarily close to the exponent with respect to linear information in the framework of Open Problem 127 in [36], see Section 5.4 below. The quantity T (m) from Theorem 1.1 can be replaced in terms of singular values (σ k ) k since we sampled with respect to an "envelope density" adapted to the specific problem and different from ̺ D . Although we stated the above result only in terms of sampling numbers, the underlying recovery operator is constructive and the determined error bounds hold with high probability. Computing this envelope density (and sample from it) has been studied in [11,Sect. 5]. However, as we will see below, in case of a compactly supported wavelet system or the classical complex Fourier system it is enough to sample from the target density ̺ D to obtain similar bounds.
Algorithmically, the coefficients c : k e k can be obtained by computing the least squares solution of the (over-determined) linear system of equations L m c = (f (x j )) n j=1 , where L m := σ −1 k e k (x j ) n; m−1 j=1; k=1 ∈ C n×(m−1) . In order to solve this linear system of equations, one can apply a standard conjugate gradient type iterative algorithm, e.g., LSQR [39]. The corresponding arithmetic costs are bounded from above by C R m n < C R n 2 , where C > 0 is an absolute constant and R the number of iterations. Our proof of Theorem 1.1 includes estimates of the singular values of the matrix L m , which yields that the condition number of L m is bounded from above by √ 3 with probability at least 1 − δ, i.e., only a small number R of iterations is required. The arithmetic cost of the algorithm will be further reduced if the underlying system matrix L m is sparse, which is the case in the wavelet setting when using compactly supported wavelets instead of classical Fourier basis polynomials, see Remark 9.2 below.
As a special case we consider the recovery of functions from periodic Sobolev spaces with mixed smoothness. This problem has been investigated by many authors in the last 30 years, see [13,Sect. 5] and the references therein. There is recent progress in [25], which motivated the investigations in this paper. The authors showed the existence of a point set and a recovery algorithm with improved asymptotic worst-case errors, see Section 8.3 for a detailed discussion of this result. In this paper, we use the simple least squares algorithm from [25,2], and we show that using random points makes it also possible to obtain explicit worst case recovery guarantees even in the preasymptotic range. The analysis benefits from the fact that the underlying eigenvector system is a bounded orthonormal system (BOS). However, we demonstrate in Section 9 that this assumption is not necessary for our analysis. We show that Hyperbolic wavelet regression recovers non-periodic functions belonging to a Sobolev space with mixed smoothness H s mix from n nodes X drawn according to ̺ D in the worst-case with the same asymptotic rate. The proposed approach achieves rates that are only worse by log s n in comparison to the optimal rates achieved by hyperbolic wavelet best approximation studied in [16] and [46]. The above general bound on g n can for instance be used for any non-periodic embedding where H s mix ([0, 1] d ) can be represented in various ways as a reproducing kernel Hilbert space satisfying the requirements of the above theorem, see the concrete collection of examples in [1, 7.4]. Plugging in well-known upper bounds on the singular numbers we improve on the asymptotic sampling bounds by Dinh Dũng [12] and Byrenheid [5]. In addition, using refined preasymptotic estimates for the (σ j ) j in the non-periodic situation (see [23, 4.3]) may be also able to obtain reasonable bounds for g n in case of small n.
In addition to the worst-case error, we also consider the error for recovering individual functions belonging to reproducing kernel Hilbert spaces. Now we deal with a least squares Monte-Carlo method, which has been already considered in the literature, see for instance [9], [2], [7] and others. We complement the results from [9] and put them into a multivariate context.
A further application of our least squares method is in the field of numerical integration. Oettershagen [37] used a least squares optimization approach in order to compute the nearoptimal (complex-valued) weights (q 1 , ..., q n ) in a cubature formula, i.e., where µ D is the measure for which we want to compute the integral. The integration nodes X = (x 1 , ..., x n ) are determined in advance. Clearly, the bounds from Theorem 1.1, 1.2 can be literally transferred to estimate the worst-case integration error, see Corollary 8.11 below. In fact, by our refined analysis we were able to settle a conjecture in [37] that the worst-case integration error for H s mix (T d ) is bounded in order by n −s log(n) ds with high probability. This conjecture was based on the outcome of several numerical experiments (described in [37]) where the worst-case error has been simulated using the RKHS Riesz representer. It is remarkable in two respects. First, it is possible to benefit from higher order smoothness by using plain random points. Note that this simple method can compete with most of the quasi-Monte Carlo methods based on lattices and digital nets studied in the literature, see [17, pp. 195, 247]. Moreover, if s < (d − 1)/2 we get a better asymptotic rate than sparse grid integration which is shown to be in the order of n −s log(n) (d−1)(s+1/2) , see [14]. Second, from well-known results in Monte-Carlo integration (which is actually not the case here) one would certainly expect an order of n −1/2 when using the same samples.
We practically verify our theoretical findings with several numerical experiments. There we compare the recovery error for the least squares regression method S m X compared to the optimal error given by the projection on the eigenvector space. We also study a non-periodic regime, where we randomly sample points according to the Chebyshev measure.
Outline. In Section 2 we describe the setting in which we want to perform the worst-case analysis. There we use the framework of reproducing kernel Hilbert spaces of complexvalued functions. Section 3 is devoted to the least squares algorithm, where the worstcase analysis is given in Sections 5,8,and 9. In the first place, we present the general results in Section 5. Section 8 considers the particular case of hyperbolic Fourier regression. In Section 9 we investigate the particular case of non-periodic functions with a bounded mixed derivative and their recovery via hyperbolic wavelet regression. The main tools from probability theory, like concentration inequalities for spectral norms and Rudelson's lemma, are provided in Section 4. The analysis of the recovery of individual functions (Monte Carlo) is given in Section 6. Consequences for optimally weighted numerical integration based on plain random points are given in Section 7. Finally, the numerical experiments are shown and discussed in Section 10.
Notation. As usual N denotes the natural numbers, N 0 := N∪{0}, Z denotes the integers, R the real numbers, and C the complex numbers. C n denotes the complex n-space, whereas C m×n denotes the set of all m × n-matrices L with complex entries. The spectral norm (i.e. the largest singular value) of matrices L is denoted by L or L 2→2 . Vectors and matrices are usually typesetted bold with x, y ∈ R n or C n . For 0 < p ≤ ∞ and x ∈ C n we denote x p := ( n i=1 |x i | p ) 1/p with the usual modification in the case p = ∞. If T : X → Y is a continuous operator we write T : X → Y for its operator (quasi-)norm. For two sequences (a n ) ∞ n=1 , (b n ) ∞ n=1 ⊂ R we write a n b n if there exists a constant c > 0 such that a n ≤ c b n for all n. We will write a n ≍ b n if a n b n and b n a n .

The setting
We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [1,Chapt. 1]. Let L 2 (D, ̺ D ) be the space of complex-valued square-integrable functions with respect to ̺ D . Here D ⊂ R d is a domain and ̺ D a probability measure. We further consider a reproducing kernel Hilbert space H(K) with a Hermitian kernel K(x, y) on D × D. We recall some basic facts about RKHS, see for instance [1, 1.5]. The identity for all x ∈ D, ensures that point evaluations are continuous functionals on H(K). In some cases the boundedness of the kernel is required, i.e., The finiteness of C K implies that H(K) is embedded into L ∞ (D).
The embedding operator Id : H(K) → L 2 (D, ̺ D ) is already continuous under a much weaker condition, namely We will see below, that under this condition the operator Id is even compact and the singular numbers (σ j ) j belong to ℓ 2 . Indeed, let Id * be defined as Then W ̺ D := Id * • Id : H(K) → H(K) is given by This operator is non-negative definite. Let {(λ j , e j )} j∈N denote the complete orthonormal system of eigenvectors with corresponding non-negative real eigenvalues. In fact, W e j = λ j e j and e j , e k H(K) = δ j,k . The eigenvalues (and corresponding eigenfunctions) are arranged in non-increasing order, i.e., λ 1 ≥ λ 2 ≥ λ 3 ≥ · · · ≥ 0.
We will arrange the eigenfunctions {e k } k∈N accordingly. Note, that we have in particular It further holds that e j , e k L 2 = Id(e j ), Id(e k ) L 2 = W e j , e k H(K) = λ j e j , e k H(K) = λ j δ j,k .
Hence, {e j } j∈N is also orthogonal in L 2 (D, ̺ D ) and e j 2 = λ j =: σ j . Due to the fact, that we get that λ j → 0 and hence the compactness of W : Note also, that we have for x, y ∈ D that with convergence in C. See for instance [1,Thm. 14]. It is important to mention that this identity holds although the corresponding orthonormal system η k := λ will play a fundamental role. Let us point out that the function converges pointwise in C, see [1,Thm. 14]. Hence, N(m) and T (m) are well-defined if (2.1) is assumed. In fact, T (m) is bounded by C K . It may happen that the system {η k } k is a uniformly L ∞ (D) bounded orthonormal system (BOS), i.e., we have for all k ∈ N η k ∞ ≤ B .

Least squares regression
Our algorithm essentially boils down to the solution of an over-determined system where L ∈ C n×m is a matrix with n > m. It is well-known that the above system may not have a solution. However, we can ask for the vector c which minimizes the residual f − L · c 2 . Multiplying the system with L * gives which is called the system of normal equations. If L has full rank then the unique solution of the least squares problem is given by From the fact that the singular values of L are bounded away from zero we get the following quantitative bound on the spectral norm of the Moore-Penrose inverse (L * L) −1 L * .
Proposition 3.1. Let L ∈ C n×m be a matrix with m < n with full rank and singular values τ 1 , ..., τ m > 0 arranged in non-increasing order.
(iii) The operator norm (L * L) −1 L * can be controlled as follows Proof. We start with L = U * ΣV and write We finally putΣ := (Σ * Σ) −1 Σ * and immediately observe the desired properties.
For function recovery we will use the following matrix for a set X = {x 1 , ..., x n } ⊂ D of distinct sampling nodes and the system {η k } := {λ −1/2 k e k }. Below we will see that this matrix behaves well with high probability if n is large enough and x 1 , ..., x n are chosen independently and identically ̺ D -distributed from D.

Algorithm 1 Least squares regression.
Input: X = {x 1 , ..., x n } ⊂ D set of distinct sampling nodes, f = (f (x 1 ), ..., f (x n )) ⊤ samples of f evaluated at the nodes from X, m ∈ N m < n such that the matrix L m := L m (X) from (3.1) has full (column) rank.
Solve the over-determined linear system via least squares (e.g. directly or via the LSQR algorithm [39]), i.e., compute Using Algorithm 1, we compute the coefficients c k , k = 1, . . . , m − 1, of the approximant Note, that the mapping f → S m X f is linear for a fixed set of sampling nodes

Concentration inequalities
We will consider complex-valued random variables X and random vectors (X 1 , ..., X N ) on a probability space (Ω, A, P). As usual we will denote with EX the expectation of X. With P(A|B) and E(X|B) we denote the conditional probability and the conditional expectation Let us start with the classical Markov inequality. If Z is a random variable then Of course, there is also a version involving conditional probability and expectation. In fact, Let us state concentration inequalities for the norm of sums of complex rank-1 matrices. For the first result we refer to Oliveira [38]. We will need the following notational convention: For a complex (column) vector y ∈ C N (or ℓ 2 ) we will often use the tensor notation for the matrix y ⊗ y := y · y * = y · y ⊤ ∈ C N ×N (or C N×N ) .
Proposition 4.1. Let y i , i = 1, ..., n, be i.i.d. copies of a random vector y ∈ C N such that y i 2 ≤ M almost surely. Let further E(y i ⊗ y i ) = Λ ∈ C N ×N and 0 < t < 1. Then it holds Proof. In order to show the probability estimate, we refer to the proof of [38, Lem. 1] and observe and, finally, the assertion holds.
Remark 4.2. A slightly stronger version for the case of real matrices can be found in Cohen, Davenport, Leviatan [9] (see also the correction in [10]). For y i , i = 1, ..., n, i.i.d. copies of a random vector y ∈ R N sampled from a bounded orthonormal system, one obtains the concentration inequality where c t = (1 + t)(ln(1 + t)) − t. This leads to improved constants for the case of real matrices.
.., n, and ε i independent Rademacher variables taking values ±1 with equal probability. Then Remark 4.4. The result is proved for complex (finite) matrices. Note, that the factor log(8 min{n, N}) is already an upper bound for log(8r), where r is the rank of the matrix i y i ⊗ y i . The proof of Proposition 4.3 with the precise constant is based on [41, Lem. 6.18] which itself is based on a noncommutative Khintchine inequality, see [41, 6.5]. This technique allows for controlling all the involved constants. Let us comment on the situation N = ∞, i.e., y j ∈ ℓ 2 , where this inequality keeps valid with the factor log(8n). In fact, if the matrices B j in [41, Thm. 6.14] are replaced by rank-1-operators B j : ℓ 2 → ℓ 2 of type B j = y j ⊗ y j with y j 2 < ∞ then all the arguments keep valid and an ℓ 2 -version of this noncommutative Khintchine inequality is available. This implies an ℓ 2 -version of [41,Lem. 6.18] which reads as follows: Let y j ∈ ℓ 2 , j = 1, ..., n, and p ≥ 2. Then Since we control the moments of the random variable representing the norm on the left-hand side we are now able to derive a concentration inequality by standard arguments (Proposition [41,Prop. 6.5]). This concentration inequality then easily implies the ℓ 2 -version of (4.3).
As a consequence of this result we obtain the following deviation inequality in the mean which will be sufficient for our purpose.
Proof. By well-known symmetrization technique (see [18,Lem. 8.4]), Proposition 4.3, and the Cauchy-Schwarz inequality, we obtain Hence, we get F 2 ≤ a 2 (F + b) and we solve this inequality with respect to F , which yields and this corresponds to the assertion.
5 Worst-case errors for least-squares regression

Random matrices from sampled orthonormal systems
Let us start with a concentration inequality for the spectral norm of a matrix of type (3.1). It turns out that the complex matrix L m := L m (X) ∈ C n×(m−1) has full rank with high probability, where x 1 , ..., x n ∈ D are drawn i.i.d. at random according to ̺ D . We will find below that the eigenvalues of are bounded away from zero with high probability if m is small enough compared to n. We speak of an "oversampling factor" n/m. In case of a bounded orthonormal system with BOS constant B it will turn out that a logarithmic oversampling is sufficient, see (5.2) below. Note, that the boundedness constant B may also depend on the underlying spatial dimension d. However, if for instance the complex Fourier system {exp(2πi k · x) : k ∈ Z d } is considered we are in the comfortable situation that B = 1.
Proof. We set y i := η 1 (x i ), . . . , η m−1 (x i ) * , i = 1, ..., n, and observe Moreover, due to the fact that we have an orthonormal system {η k } k∈N , we obtain that E(H m ) = I m . The result follows by noting that M 2 ≤ N(m) in Proposition 4.1.
Remark 5.2. From this proposition we immediately obtain that the matrix H m ∈ C (m−1)×(m−1) has only eigenvalues larger than t := 1/2 with probability at least 1 − δ if So, in case of a bounded orthonormal system with BOS constant B > 0, we may choose From Proposition 5.1 we get that all m−1 singular values τ 1 , ..., τ m−1 of L m from (3.1) are all not smaller than n/2 and not larger than 3n/2 with probability at least 1 − δ if m is chosen such that (5.1) holds. In terms of Proposition 3.1 this means that τ 1 , ..., τ m−1 ≥ n/2. This leads to an upper bound on the norm of the Moore-Penrose inverse that is required for the least squares algorithm. Proposition 5.3. Let {η 1 (·), η 2 (·), η 3 (·), ...} be the orthonormal system in L 2 (D, ̺ D ) induced by the kernel K. Let further m, n ∈ N, m ≥ 2, and 0 < δ < 1 be chosen such that they satisfy (5.1). Then the random matrix L m from (3.1) satisfies In addition to the matrix L m , we need to consider a second linear operator that is defined using sampling values of the eigenfunctions e j . The importance of this operator has been pointed out in [25], where strong results on the concentration of infinite dimensional random matrices have been used. Since we only need the expectation of the norm, we only use Rudelson's lemma, see Proposition 4.3, and a symmetrization technique. This allows us to control the constants.

Worst-case errors with high probability
Theorem 5.5. Let H(K) be a reproducing kernel Hilbert space on a domain D ⊂ R d with a positive semidefinite kernel K(x, y) such that sup x∈D K(x, x) < ∞. Let further 0 < δ < 1 and m, n ∈ N, where m ≥ 2 is chosen such that (5.1) holds. Drawing {x 1 , ..., x n } i.i.d. at random according to ̺ D , we have for the conditional expectation of the worst-case error .
Using orthogonality and the reproducing property S m f, e j e j and by applying Proposition 5.3, we continue where we use (2.2). Integrating on both sides yields Taking Proposition 5.4 and (4.1) into account and noting that P( H m − I m ≤ 1/2) is larger than 1 − δ we obtain the assertion.
In addition to that we may easily get a deviation inequality by using Markov's inequality and standard arguments. It reads as follows.
Corollary 5.6. Under the same assumptions as in Theorem 5.5 it holds for fixed δ > 0 P sup Proof. We define the events Treating each summand separately, we have Next we estimate P(A ∁ |B) using the Markov inequality (4.2), Theorem 5.5, and setting and the assertion follows.

Optimal sampling recovery of functions
We are interested in the question of optimal sampling recovery of functions from reproducing kernel Hilbert spaces in L 2 (D, ̺ D ). The quantity we want to study is classically given by We formulate the following theorem which is a direct consequence of Theorem 5.5 above.
In fact, we will use a technique which is used in the literature under the term "(optimal) importance sampling". It has been first applied in the least squares setting by Cohen and Migliorati in [11]. Authors also applied this technique e.g. for the approximation of integrals, see [19]. Also in connection with compressed sensing this has been successfully applied, see [43,42].
In particular, we use the (optimal) density function (sampling strategy) from [25], which is an essential ingredient for the following considerations. Furthermore, we introduce a new reproducing kernel Hilbert space H( K m ) based on this density function. To this situation Theorem 5.5 is applied, and we obtain Theorem 5.7, which improves the asymptotic results in [25] by determining explicit constants. defined on D such that for some non-trivial measure ̺ D on D. We denote with Id : H(K) → L 2 (D, ̺ D ) the corresponding compact embedding and with (σ j ) j its non-increasing sequence of singular numbers. Then we have for n ∈ N and Proof. We emphasize that the right hand side in (5.7) makes sense since we know from (2.3) that the sequence of singular numbers is square summable. The crucial modification of [11, (32)] which has been presented in [25] leads to the family of density functions ̺ m (x) : D → R, defined for any m ∈ N as By [1, Thm. 14] we know that ̺ m is defined pointwise. Moreover, we easily see that D ̺ m (x)̺ D (dx) = 1. This particular choice of the sampling density as well as the general proof strategy has already been used in [25]. The functions e j are the right singular vectors of the compact embedding Id : H(K) → L 2 (D, ̺ D ), λ j represent the ordered eigenvalues of Id * • Id : H(K) → H(K), and η j = λ −1/2 j e j . Let us define the kernel K m (x, y) via with the corresponding weighted space L 2 (D, µ m ) . Moreover, due to the definition of ̺ m (x) and Cauchy-Schwarz inequality we have If ̺ m (x) happens to be zero then we put K m (x, ·) := 0 and K m (·, x) := 0. By definition of the measure µ m it is also clear that so this event happens "almost" never. In other words, when randomly drawing points according to µ m then we observe ̺ m (x) = 0 almost surely. Thus the kernel K m (x, y) fits the requirements in Theorem 5.5. In fact, in Theorem 5.5 it is necessary that N (m) and T (m) are well-defined and that we have access to function values at randomly drawn nodes (with respect to the measure µ m on D) in order to create the matrices L m and take the function values f (x j ), j = 1, ..., n. Let us discuss the quantities N (m) and T (m) appearing in Theorem 5.5 for this new kernel K m (x, y) first. It is clear, that the embedding Id : H( K m ) → L 2 (D, µ m ) has the same spectral properties than the original embedding. The singular vectorsẽ k (·) andη k (·) are slightly different. They are the original ones divided by ̺ m (·). In fact, Furthermore, In order to define the new reconstruction operator S m X we need to create the matrices L m using the new function systemη k and take function evaluations f (x 1 ), ..., f (x n ). In more detail, we solve the least squares problem We stress that S m X f and the direct computation of S m X f using L m · c = f may not coincide since both are based on different least squares problems in general.
Due to (5.10), an estimate on g − S m X (g) L 2 (D,µm) also holds true for the error of the approximation S m X f . In addition to the considerations in Theorem 5.5, we need to avoid division by zero which leads to the conditional expectation Hence, this technical modification does not play a role for the final estimate in Theorem 5.5. Moreover, whenever f H(K) ≤ 1 the function g := f / √ ̺ m satisfies g H( Km) ≤ 1. In fact, it is easy to show that f, e k H(K) = g,ẽ k H( Km) for all k. Since both systems are an orthonormal basis in the respective space we see the equality of norms. It remains to note that for fixed n and m as in (5.6) we have for X = (x 1 , ..., Applying Theorem 5.5 with δ = e −5 and the above mentioned slight modification gives where in the last but one estimate we took (5.9) into account. Finally, the estimate on the conditional expectation implies the existence of a set X = (x 1 , ..., x n ) such that the above bound holds.

Comments on tractability
Let us discuss some issues on the tractability of this problem. For the necessary notions and definitions from the field of information based complexity see [34,35,36]. First, we comment on an observation from [25]. Assume, d is fixed and we have a polynomial decay of the (σ n ) n , i.e., there are positive numbers α > 1/2, β > 0 such that we obtain a similar bound for the sampling numbers. Indeed, from Theorem 5.7 we obtain by an elementary calculation g n n −α (log n) α+β .
In other words, it follows that α std 2,̺ D = α lin 2,̺ D , i.e., the polynomial order of convergence for standard information is the same as for linear information if the kernel is integrable. This problem has been addressed in [29] and [36,Open Problem 126]. Note that the integrability of the kernel is equivalent to the fact that the eigenvalues of Id * • Id : H(K) → H(K) are summable (nuclear operator).
Secondly, let us comment on polynomial tractability with respect to linear information Λ all and standard information Λ std . We consider the family of approximation problems More precisely, there are constants C std , δ > 0 only depending on (C, τ ) or (C all , p all ) such that n wor (ε, d; Λ std ) ≤ C std ε −p std log(1/ε) δ with p std = τ in case (i) and p std = p all in case (ii).
Proof. Note that (ii) implies (i) with τ = 2. Furthermore, Theorem 5.7 implies strong polynomial tractability if (i) is assumed. In case (i) we may use Stechkin's lemma [13,Lem. 7.8] which gives that ∞ j=m σ 2 j,d ≤ C 2 m −2/τ +1 for all d. This gives the exponent p std = τ and an additional log due to (5.6). If (ii) is assumed then ∞ j=m σ 2 j,d ≤ C ′ m −2/p all +1 for all d. Theorem 5.8 is stronger than Theorem 26.20 in [36] in two aspects. As pointed out in the proof, assumption (i) is weaker than (ii), which is essentially the one in [36,Theorem 26.20]. Furthermore, our statement is stronger since p std equals p all . The authors in [36, 26.6.1] showed that p std = p all + 1 2 [p all ] 2 and proposed that "the lack of the exact exponent represents another major challenge" and formulated Open Problem 127. Our considerations prove that the dependence on ε of the tractability estimates on n wor (ε, d; Λ all ) and n wor (ε, d; Λ std ) coincide up to logarithmic factors. Similar assertions hold true when strong polynomial tractability is replaced by polynomial tractability. The modifications are straight forward.
Example 5.9. Let us consider an example from [40] and [28], namely, if s 1 ≤ s 2 ≤ ... ≤ s d then Here T = [0, 1], where opposite points are identified, and H s (T) is normed by with w s (k) = max{1, (2π) s |k| s }, see also Section 8. The smoothness vector s is supposed to grow in j, e.g., for β > 0 we have It has been shown in [40], see also [28], that the growth condition (5.12) is necessary and sufficient for polynomial tractability with respect to Λ all . Taking Theorem 5.8 into account we easily check that (5.11) is satisfied with τ ≤ 2 whenever β · τ > 1, which means that β > 1/2 is sufficient for strong polynomial tractability. In this case we obtain for any 1/β < τ ≤ 2 that n wor (ε, d; Λ std ) ≤ C τ ε −τ .
Proof. This time we follow the proof of [9, Thm. 2] and obtain where g = f − P m−1 f . Averaging on both sides yields Note, that the last summand on the right-hand side vanishes since g is orthogonal to η k , k = 1, ..., m − 1. This gives Analogously to the proof of Corollary 5.6, one shows the following result. Corollary 6.2. Under the same assumptions as in Theorem 6.1 it holds for fixed δ > 0 Remark 6.3. Following [11] we may relax requirement (6.1) on the kernel K to where ̺ m is given by Then, the corresponding approximation of the function f is based on the solution of a least squares problem similar to that we discussed in Section 5.3.

Optimal weights for numerical integration on random nodes
We consider the problem of approximating the integral with respect to µ D of a function f by the cubature rule Q m where the weights q j are determined by X = (x 1 , ..., x n ) and b k := D η k dµ D , k = 1, ..., m−1.
Indeed, assuming L m of full column rank and following Oettershagen [37] we set In fact, the cubature rule Q m X is the implicit integration of the least squares solution S m Using this, we give upper bounds on the integration error caused by Q m X .
Proof. Using the embedding relation of the different L 1 -spaces we have We conclude the proof using Corollary 5.6.
When dealing with the problem of integrating an individual function, Theorem 6.1 has a direct consequence for Monte-Carlo integration.
Proof. Similar as in the proof of the previous theorem we estimate the integration error by the L 2 -approximation error and apply Theorem 6.1 afterwards.
Analogously to the proof of Corollary 5.6, one shows the following result.

Hyperbolic cross Fourier regression
In the sequel we are interested in the recovery of functions from periodic Sobolev spaces. That is, we consider functions on the torus T d ≃ [0, 1) d where opposite points are identified. Note, that the unit cube [0, 1] d is preferred here since it has Lebesgue measure 1 and is therefore a probability space. We could have also worked with [0, 2π] d and the Lebesgue measure (which can be made a probability measure by a d-dependent rescaling). The general error bounds for the recovering error given below (in terms of (σ j ) j like in Theorem 8.5) are not affected by this rescaling) since the sequence (σ j ) j then also changes. However, some of the preasymptotic estimates for the (σ j ) j are sensitive with respect to a different domain as the results in Krieg [23] point out.
For α ∈ N we define the space H α mix (T d ) as the Hilbert space with the inner product Defining the weight w α, * (k) = (1 + (2π|k|) 2α ) 1/2 , k ∈ Z (8.2) and the univariate kernel function which is a reproducing kernel for H α mix (T d ). The Fourier series representation of K d α, * (x, y) is specified by In particular, for any f ∈ H α mix (T d ) we have . The (ordered) sequence (λ * j ) ∞ j=1 of eigenvalues of the corresponding mapping W = Id * •Id, where Id : H K d s, * → L 2 (T d ), is the non-increasing rearrangement of the numbers The corresponding orthonormal system (e * j ) ∞ j=1 in H K d α, * is given by Consequently, the orthonormal system (η * j ) ∞ j=1 in L 2 (T d ) is the properly ordered classical Fourier system η k (x) = exp(2πik · x). This directly implies the following behavior of the quantities N(m) and T (m) defined in (2.4) and (2.5), see also the comment after (2.6). It holds Remark 8.1. It is possible to define a smoothness vector s = (s 1 , ..., s d ) to emphasize different smoothness in different coordinate directions. Such kernels will be denoted with K s (x, y). In [28] the authors establish preasymptotic error bounds which can be used for the least squares analysis as we will see below.
Recent estimates in [26] allow for determining uniform recovery guarantees with preasymptotic error bounds. For this study, we need to change the kernel weight to a less natural, but for preasymptotic considerations more convenient structure of the weight w s,# (k) = (1 + 2π|k|) s , k ∈ Z , (8.4) s > 0. As a consequence, the univariate kernel as well as the tensor product kernel changes and has modified Fourier series expansions. Of course, the weight w s,# yields an equivalent norm in the space H s mix (T d ). However, from a complexity theoretic point of view, it is worth noting the difference of both approaches. The respective unit balls belonging to both norms differ significantly, since the equivalence constants for both norms may depend badly on d. Moreover, we stress on the fact that the non-increasing rearrangements of the eigenvalues (λ # j ) ∞ j=1 of the mapping W = Id * •Id, where Id : H K d s,# → L 2 (T d ) differs from (λ * j ) ∞ j=1 since the associated mappings j → k do not coincide. Accordingly, the sampling operators S m, * X and S m,# X are defined with respect to the different non-increasing rearrangements of the basis functions η k and, thus, may also differ.
Despite the fact that there are many more different equivalent norms for H s mix (T d ), we will use only the mentioned ones in order to apply advantageous known upper bounds on the eigenvalues λ * j and λ # j . In the regime of interest here, the recovery of periodic functions with mixed smoothness, we even have preasymptotic bounds for these eigenvalues/singular values available (see [27,26,23]). Note that theoretically everything is known about the singular values σ m , ∈ { * , #}, since the behavior of this sequence is determined by the non-increasing rearrangement of the reciprocals of the tensor product weights d j=1 w s, (k j ), cf. (8.2) and (8.4). The analysis of the rearrangement of multi-indexed sequences has been started by Kühn, Sickel, Ullrich [27], where preasymptotic estimates of the type hold for all m ∈ N. One may argue that the kernel K d s,# is less "natural" than the kernel K d s, * . For this purpose we use the observations by D. Krieg [23]. It turned out that the size (measure) of the underlying domain D plays a significant role for the preasymptotic bound. where the exponent scales linear in s.

Recovery of individual functions from H s mix (T d )
In this section we aim at the recovery of a single individual function from random samples via least squares. Of course, this represents a less ambitious goal, since the random samples are not supposed to work uniformly for the whole class of functions. As one may expect this leads to better bounds. Note that, when doing numerical experiments with specific test functions, we actually observe the bounds for the individual recovery. We have learned from our general Theorem 6.1 that we may relate the individual recovery error to the singular values/approximation numbers of the respective embedding. Together with Theorem 6.1 and the considerations in the previous section this may be directly used to obtain preasymptotic error guarantees for the recovery of one single function using random samples. Note that this represents a special Monte-Carlo method. Theorem 8.2 below shows that our least square approach has a slightly worse sampling complexity compared to the method in [24]. However, the numerical performance is striking and, in addition, it is robust with respect to Gaussian noise, which will be considered in a subsequent contribution.
Theorem 8.2. Let d ∈ N, s > 1/2 and 0 < δ < 1. Let further n ∈ N such that Let f be a fixed function such that f * ≤ 1. We draw X = {x 1 , ..., x n } independently at random according to the uniform measure. Then the corresponding least squares operator S m, * X recovers f in the mean .
With the help of Markov's inequality we also get a bound in probability.
holds with probability exceeding 1 − 2δ. . for all n ∈ N with probability exceeding 1 − 2δ. Here we use the bound in (8.6).

Uniform recovery of functions from H s mix (T d )
First, we consider the asymptotic behavior of the sampling error caused by the presented least squares approach. Since the asymptotic bounds on λ * j and λ # j differ only by constants, that we do not specify explicit, we study both cases collectively. For ∈ { * , #}, the asymptotic behavior of the sequence (σ j ) ∞ j=1 for the embedding Id : H(K d s, ) → L 2 (T d ) is known for a long time. There is a constantC d which depends exponentially on d such that As a direct consequence of Corollary 5.6 we determine explicit error bounds as well as the asymptotic error behavior (8.8), where the latter holds for all equivalent norms in H s mix (T d ). Please note that the constants C d depend heavily on the specific norm. Moreover, a statement similar to (8.8) can be also obtained with the technique in [25]. Then, for δ > 0 we achieve In particular, there is a constant C d > 0 depending on d such that for 0 < δ < 1 it holds Hence, the right-hand side in (5.5) can be bounded from above by a constant times σ m , which behaves as n −s (log n) sd .
Remark 8.6. To our best knowledge, the bound in the previous theorem first showed up in Bohn [2,Sec. 5.5.4]. There the approximation error has been given for an individual function. We have already considered this problem in the last paragraph.
In addition, we investigate the preasymptotic error behavior using the aforementioned estimates (8.6) on the singular values σ # m that belongs to Id : H K d s,# → L 2 (T d ). Since the upper bounds have been proven only for this specific type of mappings, the following results, in particular the explicitly determined constants, may only hold for RKHS with weight functions w s (k) ≥ d j=1 (1 + |k j |) s , which is fulfilled for w s,# .
Theorem 8.7. Let d ∈ N, s > (1 + log 2 d)/2, 0 < δ < 1, and n ∈ N such that is at least 2. Then, we obtain Taking (5.4) into account, we bound The term in the brackets is monotonically decreasing in n. We stress that the last estimates are reasonable for m ≥ 2 and thus, we need at least n ≥ 464. This choice of n leads to an upper bound of the term within the brackets which is 29/6. Thus, for m ≥ 2, the estimate holds and the assertion follows.
Similar to Corollary 5.6, we apply Markov's inequality to get a lower bound on the success probability of the randomly chosen sampling set.
Proof. We follow the argumentation in the proof of Corollary 5.6 using the inequality

Sampling recovery of multivariate functions from H s mix (T d )
Let us return to the problem of optimal sampling recovery of functions, described in Section 5.3. We consider the specific situation Here we consider the minimal worst-case error (sampling numbers/widths) defined by It has been shown by several authors, see e.g. [45] and [13,Sec. 5] for some historical remarks, that for s > 1/2 it holds asymptotically for n ∈ N Note, that there is a gap in the logarithm between upper and lower bound. Recently, Krieg and Ullrich [25] showed by using the probabilistic method the existence of a discrete point set in T d where the "radius of information", see [34,Sec. 4.1], with respect to the case of real-valued functions is bounded from above by n −s (log n) ds . Based on this observation it turned out that the "spline algorithm", see [34,Thm. 4.2], would serve as recovery algorithm such that now g n (Id) d n −s (log n) ds .
Clearly, if s < (d − 1)/2 then the gap in (8.10) is reduced to (log n) s . Let us emphasize that the result by Krieg, Ullrich [25] can be considered as a major progress for the research on the complexity of this problem. They disproved Conjecture 5.6.2. in [13] for p = 2 and 1/2 < s < (d − 1)/2. Indeed, the celebrated sparse grid points are now beaten by random points in a certain range for s. Apart from the illuminating observation that sparse grids are not asymptotically optimal in this context, a concrete and implementable algorithm with controlled failure probability has not been specified in [25]. Further research was necessary to make the observations accessible from an algorithmic point of view. These issues have been addressed in this paper. Our method improves on the sparse grid technique with high probability. We also give an accurate bound on the above sampling numbers which is also reasonable for small n. Indeed, taking Theorem 8.5 with fixed δ = 29/64 into account, we observe where m := ⌊n ( √ 2 log(2n)+1) −1 /48⌋+1 and (σ j ) j denotes the sequence of singular numbers of the embedding.
Still it is worth mentioning that the sparse grids represent the best known deterministic construction what concerns the asymptotic order. Indeed, the guarantees are deterministic and only slightly worse compared to random nodes in the asymptotic regime. However, regarding preasymptotics the random constructions provide substantial advantages. The problem is somehow related to the recent efforts in compressed sensing. There the optimal RIP matrices are given as realizations of random matrices. Known deterministic constructions are far from being optimal.
There is still the somewhat philosophical question whether there is an intrinsic additional difficulty when restricting to algorithms based on function samples rather than Fourier coefficients. From a practical point of view sampling algorithms are highly relevant since we usually have given discrete samples of functions. The question remains: are the asymptotic characteristics σ n and g n of the same order or do they rather behave like σ n = o(g n )? This represents a fundamental open problem in hyperbolic cross approximation, see [13, Outstanding Open Problem 1.4]. Surprisingly, there is also recent progress in this direction. It turned out that the optimal sampling recovery problem is connected to the solution of the famous Kadison-Singer problem, see [31] . An interesting consequence of the groundbreaking result in [31] has been obtained by the authors in [33]. It reads as follows.
Then there exists a set X = {x 1 , ..., x n } ⊂ D with n ≤ C 1 m ′ such that for every In other words, every d-variate trigonometric polynomial with a fixed index set I, |I| = m − 1, may be stably recovered from C|I| = n samples, since the matrix has a condition number bounded by C 3 /C 2 which, however, may depend badly on d. This shows in particular, that the problem of optimal recovery is strongly connected to the Marcinkiewicz discretization problem. Recently, this family of problems gained renewed interest and has been studied systematically by V.N. Temlyakov and coauthors, see [48,49,15]. From an algorithmic point of view the construction above is hard to apply in practice since the existence of the point set is shown probabilistically without providing a small failure probability. However, this construction represents a promising candidate to make progress for the "Outstanding Open Problem" in [13]. In fact, this point construction behaves well in Proposition 5.3. The question remains open how to control the spectral norm of the infinite matrix in Proposition 5.4.

Numerical integration of periodic functions
In [37,Sec. 4.2], the author discussed the construction of stable cubature weights for the approximation of the integral for functions from periodic Sobolev spaces of dominating mixed smoothness. The integration nodes X are drawn uniformly and independently at random from [0, 1] d . Below, in Corollary 8.11, the cubature rule Q m X is fixed for the whole class. In contrast to that, we study a Monte-Carlo method in Theorem 8.13 below. The result in [37,Thm. 4.5] bounds the worst case integration error from above by d,s n −s+1/2 (log n) sd−1/2 .
Corresponding numerical tests promise better behavior of the integration error, cf. [37,Rem. 4.6]. Our theoretical results of this section confirm that the optimal main rate of the presented approach in [37] is n −s , in particular we obtain the upper bound d,s n −s (log n) sd for this specific setting. We achieve the following statement on the worst case integration error.
Corollary 8.11. Let d ∈ N, s > 1/2 and 0 < δ < 1. We choose n ∈ N such that m as stated in (8.9) is at least 2. Drawing X = {x 1 , . . . , x n } uniformly i.i.d. at random from [0, 1] d we put the cubature weight vector q to be the first column of L m (L * m L m ) −1 . Then, with probability at least 1 − 2δ, In particular, there is a constant C d > 0 depending on d such that for 0 < δ < 1 it holds with probability 1 − δ Proof. We apply Theorem 7.1 with µ T d = ̺ T d ≡ 1 followed by (8.7).
Remark 8.12. By the same reasoning the result in Theorem 8.7 transfers almost literally to the integration problem. In fact, having s > (1+log 2 d)/2 we see a non-trivial preasymptotic behavior. The above bounds show that this method based on random points competes with most of the quasi-Monte-Carlo methods studied in the literature, see [17, pp. 195, 247]. Now we consider the problem of integrating one particular function with random sampling points and a cubature formula that uses optimal weights, cf. [37]. Interestingly, we may reduce the general bound in Theorem 7.2 by a logarithmic factor in this specific situation. We stress the fact, that we construct a cubature rule that uses sampling values from a set of random nodes similar to the Monte-Carlo method. On the one hand, computing the optimal weights causes additional computational effort compared to plain Monte-Carlo. On the other hand, this strategy yields substantially improved cubature rules.
Theorem 8.13. Let d ∈ N, s > 1/2 and 0 < δ < 1. We choose n ∈ N such that m as stated in (8.9) is at least 2. Let further f be a fixed function such that f H s mix (T d ) ≤ 1. Then we get Proof. We first observe that with g = f − P m−1 f . Due to the special structure of the Fourier system (I(g) = 0) and the reproduction property of S m X we find Examining the proof of Theorem 6.1 once again we see that we have already estimated S m X g 2 L 2 (T d ) . In (6.2) and what follows we obtained Hence, we find Making the right-hand side in Theorem precise we obtain: Corollary 8.14. Under the assumptions of Theorem 8.13 we estimate

Hyperbolic wavelet regression
The following scenario of replacing the Fourier system by dyadic wavelets has already been investigated by Bohn [2,Sec. 5.5.2], [3]. Using orthogonal wavelets we will improve the result in [2] in two directions. First, due to the vanishing moments of the wavelets we will get a better error bound since the hyperbolic wavelet system is L 2 -stable. And second, our result holds for the whole class and not just one individual function, i.e., we control the worst-case error. It is worth mentioning, that we only loose a log-factor which is independent of d compared to the benchmark result in [16].
Let us start with the necessary definitions since we are now in a non-periodic setting. For s > 0 let us define the space H s mix (R d ) as the collection of all functions from L 2 (R d ) such that Here F denotes the Fourier transform on R d given by It is well-known, that H s mix (R d ) can be characterized using hyperbolic wavelets. Let (ψ j,k ) j∈N 0 ,k∈Z be a univariate orthonormal wavelet system (if j = 0 then ψ 0,k denotes the orthogonal scaling function). Then we denote with the corresponding hyperbolic wavelet basis in L 2 (R d ) . For our analysis we need that the univariate wavelet is a compactly supported wavelet which means that ψ j,k is supported "near" the interval [k2 −j , (k + 1)2 −j ]. If the wavelet basis has sufficient smoothness and vanishing moments then f ∈ H s mix (R d ) holds if and only if This leads to the norm equivalence Clearly, if f H s mix (R d ) ≤ 1 then the sequence (2 j 1 s f, ψ j,k ) j,k has an ℓ 2 -norm bounded by a constant, which will be important for our later analysis.
Let us consider the unit cube Q = [0, 1] d . Let further D j the set of all k such that the wavelet supp ψ j,k has a non-empty intersection with [0, 1] d . This directly leads to the extended domain Ω given by Ω := It holds [0, 1] d ⊂ Ω and the system (ψ j,k ) j∈N d 0 ,k∈D j is an orthonormal system in L 2 (Ω), however not a basis. Note, that Ω is still a bounded tensor domain with a measure proportional to 1 depending on the support length of the wavelet basis. It is also clear that this orthonormal system is not uniformly bounded in L ∞ .
In the sequel we want to recover functions f ∈ H s mix (R d ) on the domain [0, 1] d from samples on the slightly larger extended domain Ω in a uniform way. In other words, the discrete locations X = (x 1 , ..., x n ) of the sampling nodes are chosen in advance for the whole class of functions. Let us consider the operator P ℓ f := j 1 ≤ℓ k∈D j f, ψ j,k ψ j,k , ℓ ∈ N , Algorithm 2 Hyperbolic wavelet regression. Input: ℓ ∈ N, n :Ñ(ℓ) ≍ 2 ℓ ℓ d−1 n log(n) , X = {x 1 , ..., x n } ⊂ D set of distinct sampling nodes, f = (f (x 1 ), ..., f (x n )) ⊤ samples of f evaluated at the nodes from X, such that the matrixL ℓ :=L ℓ (X) from (9.2) has full (column) rank.
Solve the over-determined linear system L ℓ · (c j,k ) j,k = f via least squares (e.g. directly or via the LSQR algorithm [39]), i.e., compute which is known from hyperbolic wavelet approximation, see [16,46]. The following worst-case error bound is well-known and follows directly from (9.1) We now consider a special case of the matrix L m from (3.1), namelỹ Here m = m(ℓ) ≍ 2 ℓ ℓ d−1 and the functions ψ j,k = |Ω|ψ j,k enumerate the properly renormalized wavelets ψ j,k , j 1 ≤ ℓ, k ∈ D j , which is now an orthonormal system in the space L 2 (Ω, ̺ Ω ) with the probability measure ̺ Ω = dx |Ω| . The nodes X = (x 1 , ..., x n ) are drawn i.i.d. at random according to ̺ Ω . Note that, due to the construction, we have that |Ω| is bounded by a constant which depends on the chosen wavelet system. This, on the other hand, depends on the assumed mixed regularity properties of the function f , i.e., the mixed smoothness s > 0. The larger s is chosen, the larger the support of a properly chosen orthonormal wavelet system has to be. We propose Algorithm 2 for computing the wavelet coefficients of an approximation S ℓ X f to f . Theorem 9.1. Let 0 < δ < 1. Let further s > 1/2 and (ψ j,k ) j,k be a hyperbolic and compactly supported orthonormal wavelet system such that (9.1) holds true. Then the algorithm S ℓ X described above recovers any f ∈ H s mix (R d ) on L 2 ([0, 1] d ) with high probability from n = n(ℓ) random samples which are drawn in advance for the whole class. Precisely, with probability larger than 1 − δ. The operator S ℓ X uses n(ℓ) ≍ 2 ℓ ℓ d many samples, which results in the complexity bound Remark 9.2. (i) Note, that the optimal operatorP ℓ uses n(ℓ) ≍ 2 ℓ ℓ d−1 wavelet coefficients. The gap between sampling recovery (Λ std ) and general linear approximation (Λ all ), see e.g. [13], [34,35,36], is reduced to a log-factor, which is independent of d.
(ii) The matrix defined in (9.2) is rather sparse. It has n ≍ 2 ℓ ℓ d rows and m ≍ 2 ℓ ℓ d−1 columns. In every row we have only ≍ #{ j 1 ≤ ℓ} ≍ ℓ d many non-zero entries. This is gives an additional acceleration for the least squares algorithm since matrix vector multiplications are cheap in this situation.
Let us define the quantityT which goes along the lines of Proposition 5.4. Then we get with literally the same arguments Let us computeT (ℓ). Due to the compact support of the wavelet system there are for fixed j only O(1) many wavelets ψ j,k such that ψ j,k (x) is non-zero. For those O(1) wavelets we have Hence, we getT (ℓ) By the same reasoning we may estimateÑ(ℓ) in (9.4). Clearly n may be chosen such that Plugging (9.6) and (9.7) into (9.5) we obtain The same standard arguments as used in Theorem 5.5 and Corollary 5.6 lead to the bound in (9.3). It remains to estimate the number of samples n depending on ℓ, see (9.7). This clearly gives log(n) ℓ and hence n 2 ℓ ℓ d which concludes the proof.

Recovery of functions from spaces with mixed smoothness
In this section, we perform numerical tests for the hyperbolic cross Fourier regression based on random sampling nodes from Section 8, i.e., we apply Algorithm 1 for periodic test functions f from the spaces H s mix (T d ). In Figure 1, we visualize realizations for such random nodes in the two-and three-dimensional case.
Besides random point sets, different types of deterministic lattices have also been used for numerical integration and function recovery, see for instance [21,22] and [6]. This motivates us to consider Frolov lattices [20] and Fibonacci lattices (cf. e.g. [47,Sec. IV.2]) in the context of this paper, see Figure 2 for examples of such lattices.
In the following, we use the weight function (1 + |k i | 2 ) 1/2 .  Note that for computational reasons we avoid the 2π in this weight w. By the reasoning after (8.6), the weights without 2π lead to a slightly slower decay of the respective singular numbers.
For a given number n of samples, we use the m = ⌊n/(4 log n)⌋ frequencies k ∈ Z d where w(k) is smallest. Here ties are broken in numerical order starting with the first component k 1 of k until the last one k d . Corresponding to our theoretical results, the goal is to compute a least squares approximation S m X f , |X| = n, of the projection P m−1 f of the function f to the span{exp(2πi k · x) : k ∈ I}.
Comments on the arithmetic cost of Algorithm 1. Building the index set I, i.e., enumerating the basis functions η 1 , . . . , η m−1 , requires arithmetic operations, and setting up the matrix L m requires ≤ C 2 d n m arithmetic operations, where C 1 , C 2 > 0 are absolute constants. Afterwards, running Algorithm 1 requires ≤ C 3 R n m ≤ C 3 R n 2 4 log n arithmetic operations, where C 3 > 0 is an absolute constant and R ∈ N is the number of LSQR iterations. If one chooses m as in Theorem 8.2 or 8.5, cf. (5.1), the condition number of the matrix L m is ≤ √ 3 with high probability and one obtains R ≤ 17 for a LSQR accuracy of ≈ 10 −8 . Here in our experiments, we choose m = ⌊n/(4 log n)⌋ slightly larger.
Remark 10.1. Let us compare the hyperbolic cross Fourier regression from Section 8, which uses random samples, with the single rank-1 lattice sampling approach from [21,6], which uses highly structured deterministic sampling nodes. Up to logarithmic factors, both approaches have comparable error estimates w.r.t. the number m of basis functions and comparable arithmetic complexities. Single rank-1 lattice sampling has slightly worse recovery error estimates w.r.t. m than Algorithm 1, cf. Theorem 8.5. On the other hand, the arithmetic complexity for single rank-1 lattice sampling is slightly better. Moreover, the error estimates when using rank-1 lattices are guaranteed upper bounds, whereas the worst case upper bounds in Section 8 hold with high probability. However, for fixed m the used number of samples for single rank-1 lattice sampling is distinctly higher, i.e., almost quadratic compared to the approach from this paper. This results in error estimates w.r.t. the number n of used sampling values that are distinctly worse for single rank-1 lattices.
Subsequently, we consider three different test functions f : We start with the test function where we have for the Fourier coefficients |f k | ∼ d i=1 (1 + |k i | 2 ) 1/2 −5/4 and, consequently, In Figure 3a, we visualize the relative approximation errorsã m := f − P m−1 f L 2 (T d ) for spatial dimensions d = 2, 3, 4, 5.
Correspondingly, we plot m −0.75 (log m) (d−1)·0.75 as black dotted graphs. We observe that the obtained approximation errors nearly decay as the theory suggests.
Next, we apply Algorithm 1 on the test function f using n randomly selected sampling nodes as sampling scheme. We do not compute the least squares solution directly but use the iterative method LSQR [39] on the matrix L m , m = ⌊n/(4 log n)⌋. The obtained sampling errorsg n := f − S m X f L 2 (T d ) are visualized in Figure 3b as well as the graphs ∼ n −0.75 (log n) d·0.75 as dotted lines which correspond to the theoretical upper bounds n −0.75+ε (log n) d·(0.75−ε) , cf. (8.8). We set the tolerance parameter of LSQR to 5 · 10 −8 and the maximum number of iterations to 100. For d = 2 and d = 3, the errors nearly decay like these bounds. For d = 4 and d = 5, the errors seem to decay slightly slower than the bounds. In order to investigate this further, we also plot the corresponding approximation errorsã m with m = ⌊n/(4 log n)⌋ as thick dashed lines. We observe that these approximation errorsã m , which are the best possible errors that can be achieved in this setting, almost  coincide with the sampling errors. This means that we might still observe preasymptotic behavior. For d = 2 spatial dimensions, we have a closer look at the sampling errors. In Figure 4a, we again plot the approximation errorsã m , m = ⌊n/(4 log n)⌋. In addition, the aliasing errors P m−1 f − S m X f L 2 (T d ) , m = ⌊n/(4 log n)⌋, which are the errors caused by Algorithm 1 since , are shown as triangles. We observe that the aliasing errors nearly decay like the approximation errors, and that they are by one order of magnitude smaller. This corresponds to the behavior we observed in Figure 3b.
Moreover, so-called Frolov lattices [20] are considered as sampling sets in d = 2 spatial dimensions and used in Algorithm 1. The resulting sampling errors almost coincide with the approximation errors. The aliasing errors are visualized in Figure 4a as circles. They decay similarly and are lower than the aliasing errors for random nodes in most cases.
In addition, we consider Fibonacci lattices. For n ≥ 832040, the matrices L m , m = ⌊n/(4 log n)⌋, contain at least two identical columns and correspondingly, the smallest eigenvalue of L * m L m is zero. Therefore, obtaining S m X f via Algorithm 1 is not possible if the least squares solution is computed directly. An iterative method like LSQR may still work but the number of iterations may have to be restricted. In Figure 4a, the obtained aliasing errors via LSQR are shown as squares, and they are smaller than in the other cases but they seem to decay slower. However, when we decreased the tolerance parameter of the LSQR algorithm, we observed for n ≥ 832040 aliasing errors and, correspondingly, sampling errors larger than 1. Kink test function f from H with Fourier coefficientŝ Besides the different test function f , we use the same setting as before.
In Figure 5a, we visualize the relative approximation errorsã m := f − P m−1 f L 2 (T d ) for spatial dimensions d = 2, 3, 4, 5. These errors should decay like m −1.5+ε (log m) (d−1)·(1.5−ε) for sufficiently large m, and we observe that the obtained approximation errors nearly decay as the theoretical results suggest. Next, we apply Algorithm 1 with random nodes to the test function f using the iterative method LSQR. The resulting sampling errorsg n := f − S m X f L 2 (T d ) are depicted in Figure 5b. In addition, the graphs ∼ n −1.5 (log n) d·1.5 are shown as dotted lines which roughly correspond to the theoretical upper bounds n −1.5+ε (log n) d·(1.5−ε) . The errors seem to decay according to this bound for d = 2 and slower for d = 3, 4, 5. Again, we also plot the corresponding approximation errorsã m with m = ⌊n/(4 log n)⌋ as thick dashed lines, and we observe that these approximation errorsã ⌊n/(4 log n)⌋ almost coincide with the sampling errors. Correspondingly, we still have preasymptotic behavior for d = 3, 4, 5. n , d = 2ã ⌊n/(4 log n)⌋ , d = 2 g n , d = 3ã ⌊n/(4 log n)⌋ , d = 3 g n , d = 4ã ⌊n/(4 log n)⌋ , d = 4 g n , d = 5ã ⌊n/(4 log n)⌋ , d = 5 n −1.5 (log n) d·1.5 (b) sampling error Test function f from H 11/2−ε mix As a third test function f , we consider a periodic B-Spline of order 6, which is a piecewise polynomial of degree 5, and therefore, we have f ∈ H (T d ). In Figure 6a, the relative approximation errorsã m := f − P m−1 f L 2 (T d ) are visualized for spatial dimensions d = 2, 3, 4, 5, which roughly decay like m −5.5+ε (log m) (d−1)·(5.5−ε) for sufficiently large m. The sampling errorsg n := f − S m X f L 2 (T d ) when applying Algorithm 1 with random nodes to the test function f using the iterative method LSQR are depicted in Figure 6b. In addition, the graphs ∼ n −5.5 (log n) d·5.5 are plotted as dotted lines which correspond to the theoretical upper bounds. The errors seem to decay roughly according to this bound for d = 2 and d = 3 or slightly slower. Again, the corresponding approximation errorsã m with m = n/(4 log n) are shown as thick dashed lines, and we observe that these approximation errorsã ⌊n/(4 log n)⌋ almost coincide with the sampling errors. This means we still have preasymptotic behavior.

Integration of functions from spaces with mixed smoothness
Extensive numerical tests on integration using random point sets were performed by Oettershagen [37], cf. in particular [37,Sec. 4.1] for tests concerning numerical integration in Sobolev spaces with mixed smoothness H s mix (T d ). These numerical tests performed for d ∈ {2, 4, 8, 16} spatial dimensions and smoothness s ∈ {1, 2, 3} suggest that the worst case cubature error may decay with a main rate of n −s with additional log factors. This is a remarkable behavior since plain Monte-Carlo with random points usually leads to an error decay of n −1/2 . However, the corresponding theoretical results in [37,Sec. 4.2] only give a main rate of n −s+1/2 . We highlight that our results obtained in this paper bridge this gap of 1/2 in the n , d = 2ã ⌊n/(4 log n)⌋ , d = 2 g n , d = 3ã ⌊n/(4 log n)⌋ , d = 3 g n , d = 4ã ⌊n/(4 log n)⌋ , d = 4 g n , d = 5ã ⌊n/(4 log n)⌋ , d = 5 n −5.5 (log n) d·5.5 (b) sampling error  We use the parameters as in Section 10.1 and apply Algorithm 1 on the test function f in the non-periodic setting, where we generate the random nodes with respect to the measure n , d = 2ã ⌊n/(4 log n)⌋ , d = 2 g n , d = 3ã ⌊n/(4 log n)⌋ , d = 3 g n , d = 4ã ⌊n/(4 log n)⌋ , d = 4 g n , d = 5ã ⌊n/(4 log n)⌋ , d = 5 ∼ n −1.5 (log n) d·1.5 (a) sampling and approximation eror  ̺ D (x) := d i=1 (π 1 − x 2 i ) −1 . In Figure 8, we show realizations for these random nodes. As before, we do not compute the least squares solution directly but use the iterative method LSQR on the matrix L m , m = ⌊n/(4 log n)⌋. The obtained sampling errorsg n := f − S m X f L 2 ([−1,1] d ,̺ D ) with m = ⌊n/(4 log n)⌋ are plotted in Figure 9a as triangles as well as the corresponding approximation errorsã m := f − P m−1 f L 2 ([−1,1] d ,̺ D ) as thick dashed lines. We observe that the sampling and approximation errors almost coincide. Moreover, we plot the graphs ∼ n −1.5 (log n) d·1.5 as dotted lines which correspond to the expected theoretical upper bounds n −1.5+ε (log n) d·(1.5−ε) . We observe that the obtained numerical errors nearly decay like these theoretical upper bound.
In addition, we use the numerically computed Chebyshev coefficients c k from Algorithm 1 to compute the approximation Q X f of I(f ) by Q X f = D S m X f dµ D = m k=1 c k b k where the complex numbers b k are calculated as stated in (10.1). We repeatedly perform each test 100 times with different random nodes. The averages for the integration errors |I(f ) − Q X f | of the 100 test runs are depicted as triangles in Figure 9b and the maxima as error bars. Moreover, we plot the graphs ∼ n −2 (log n) d·2 as dotted lines, and we observe that the obtained integration errors approximately decay like these graphs. For comparison, we also plot n −1.5 (log n) d·1.5 for d = 2 and d = 5 as thick solid lines which belong to the theoretical results n −1.5+ε (log n) d·(1.5−ε) one obtains analogously to (8.11) in Section 8 for the non-periodic case. These thick solid lines decay distinctly slower.
In particular, we strongly expect that the theoretical preasymptotic results in (8.5), (8.6) and [23] also hold for the Chebyshev case.
ing by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, project number 380648269). T.U. would like to acknowledge support by the DFG Ul-403/2-1. T.V. gratefully acknowledges support by the SAB 100378180.