A new upper bound for sampling numbers

We provide a new upper bound for sampling numbers $(g_n)_{n\in \mathbb{N}}$ associated to the compact embedding of a separable reproducing kernel Hilbert space into the space of square integrable functions. There are universal constants $C,c>0$ (which are specified in the paper) such that $$ g^2_n \leq \frac{C\log(n)}{n}\sum\limits_{k\geq \lfloor cn \rfloor} \sigma_k^2\quad,\quad n\geq 2\,, $$ where $(\sigma_k)_{k\in \mathbb{N}}$ is the sequence of singular numbers (approximation numbers) of the Hilbert-Schmidt embedding $\text{Id}:H(K) \to L_2(D,\varrho_D)$. The algorithm which realizes the bound is a least squares algorithm based on a specific set of sampling nodes. These are constructed out of a random draw in combination with a down-sampling procedure coming from the celebrated proof of Weaver's conjecture, which was shown to be equivalent to the Kadison-Singer problem. Our result is non-constructive since we only show the existence of a linear sampling operator realizing the above bound. The general result can for instance be applied to the well-known situation of $H^s_{\text{mix}}(\mathbb{T}^d)$ in $L_2(\mathbb{T}^d)$ with $s>1/2$. We obtain the asymptotic bound $$ g_n \leq C_{s,d}n^{-s}\log(n)^{(d-1)s+1/2}\,, $$ which improves on very recent results by shortening the gap between upper and lower bound to $\sqrt{\log(n)}$.


Introduction
In this paper we study the classical problem of the optimal recovery of multivariate functions from n function samples. This problem is rather classical and turned out to be extremely difficult in several relevant situations. Since we want to recover the function f from n function samples (f (x 1 ), ..., f (x n )) the problem boils down to the question of how to choose these sampling nodes X = (x 1 , ..., x n ) and corresponding recovery algorithms. The minimal worst-case error for an optimal choice is reflected by the n-th linear sampling number defined by g n (Id K,̺ D ) := inf f − ϕ(f (x 1 ), ..., f (x n )) L 2 (D,̺ D ) . (1.1) The functions are modeled as elements from a separable reproducing kernel Hilbert space H(K) of functions on a set D ⊂ R d with finite trace kernel K(·, ·), see Section 3 below. The error is measured in L 2 (D, ̺ D ). Our main result is the existence of two universal constants C, c > 0 such that the relation holds true between the sampling numbers (g n ) n∈N and the square summable singular numbers (σ k ) k∈N of the compact embedding The algorithm which realizes the bound (1.2) in the sense of (1.1) is a (linear) least squares algorithm based on a specific set of sampling nodes. These are constructed out of a random draw in combination with a down-sampling procedure coming from the proof of Weaver's conjecture [22], see Section 2. In its original form the result in [22] is not applicable for our purpose. That is why we have to slightly generalize it, see Theorem 2.3 below. Note that the result in (1.2) is non-constructive. We do neither have a construction of a deterministic set of nodes nor a control on the failure probability. The problem discussed in the present paper is tightly related to the problem of the Marcinkiewicz discretization of L 2 -norms for functions from finite-dimensional spaces (e.g. trigonometric polynomials). In fact, constructing well-conditioned matrices for the least squares approximation is an equivalent issue. Let us emphasize that V.N. Temlyakov (and coauthors) already used the S. Nitzan, A. Olevskii and A. Ulanovskii construction [22] for the Marcinkiewicz discretization problem in the context of multivariate (hyperbolic cross) polynomials, see [29,28] and the very recent paper [18]. It actually turned out afterwards that the proof of [18,Lem. 2.2] already gives a version of our Theorem 2.3 with unspecified constants.
Compared to the result by D. Krieg and M. Ullrich [13] the relation (1.2) is slightly stronger. In fact, the difference is mostly in the log-exponent as the below example shows. The general relation (1.2) yields a significant improvement in the situation of mixed Sobolev embeddings in L 2 , see Section 7. Applied for instance to the situation of H s mix (T d ) in L 2 (T d ) with s > 1/2 the result in (1.2) yields whereas the result in [13,11,32,20] implies The log-gap grows with s > 1/2. Our new result achieves rates that are only worse by log(n) in comparison to the benchmark rates given by the singular numbers. Note that in d ≥ 3 and any s > 1/2 the bound (1.3) yields a better performance than any sparse grid technique is able to provide, see [27], [2], [25], [7], [5], [3] and [6,Sect. 5]. In addition, combining the above result with recent preasymptotic estimates for the (σ j ) j , see [15], [16], [14], [12], we are able to obtain reasonable bounds for g n also in the case of small n. See Section 7 for further comments and references in this direction. D. Krieg and M. Ullrich [13] used a sophisticated random sampling strategy which allowed for establishing a new connection between sampling numbers and singular values. Let us emphasize that this can be considered as a major progress after a long time research around this problem. In addition, the result in this paper partly relies on this random sampling strategy according to a distribution built upon spectral properties of the embedding. The advantage of the pure random strategy in connection with a log(n)-oversampling is the fact that the failure probability decays polynomially in n which has been recently shown by M. Ullrich [32] and, independently, by M. Moeller together with the third named author [20]. In other words, although this approach incorporates a probabilistic ingredient, the failure probability is controlled and the algorithm may be implemented. Note, that there are some obvious parallels to the field of compressed sensing, where also the measurement matrix is drawn at random and satisfies RIP with high probability.
Notation. As usual N denotes the natural numbers, N 0 := N∪{0}, Z denotes the integers, R the real numbers and R + the non-negative real numbers and C the complex numbers. For a natural number m we set [m] := {1, ..., m}. We will also use∪ to emphazise, that a union is disjoint. If not indicated otherwise log(·) denotes the natural logarithm of its argument. C n denotes the complex n-space, whereas C m×n denotes the set of all m × n-matrices L with complex entries. Vectors and matrices are usually typesetted boldface with x, y ∈ C n . The matrix L * denotes the adjoint matrix. The spectral norm of matrices L is denoted by L or L 2→2 . For a complex (column) vector y ∈ C n (or ℓ 2 ) we will often use the tensor notation for the matrix y ⊗ y := y · y * = y · y ⊤ ∈ C n×n (or C N×N ) .
For 0 < p ≤ ∞ and x ∈ C n we denote x p := ( n i=1 |x i | p ) 1/p with the usual modification in the case p = ∞ or x being an infinite sequence. As usual we will denote with EX the expectation of a random variable X on a probability space (Ω, A, P). Given a measurable subset D ⊂ R d and a measure ̺ we denote with L 2 (D, ̺) the space of all square integrable complex-valued functions (equivalence classes) on D with D |f (x)| 2 d̺(x) < ∞. We will often use Ω = D n as probability space with the product measure P = d̺ n if ̺ is a probability measure itself.

Weaver's theorem
In this section we prove a modified version of Weaver's KS 2 -theorem, also known as Weaver's KS 2 -conjecture, from [33] which was shown to be equivalent to the famous Kadison-Singer conjecture [10] dating back as far as 1959. For a long time, these statements were mere conjectures and many people even believed them to be false. Since the celebrated proof given by A. Marcus, D. Spielman, and N. Srivastava [19] in 2015, however, they have turned into actual theorems and thus into rather strong tools for various applications, and it is in fact the Weaver KS 2 -conjecture that is at the heart of our argument in this article. We need it in a slightly modified form, however, formulated in Theorem 2.3 below. The starting point for its proof is the following reformulation of the classical Weaver statement which already occurred in [22]. We will formulate it with slightly improved constants, see [21].
Note, that the above statement is trivial for ε ≥ (2 + √ 2) −2 , since in this case the lower bound is ≤ 0 and the upper bound is ≥ 1. Relaxing condition (2.1), one obtains an analogous statement for non-tight frames.
for all w ∈ C n , where β ≥ α > 0 are some fixed constants. Then there is a partition S 1∪ S 2 = [m], such that for each j = 1, 2 and all w ∈ C n .
Again, the above statement is trivial for ε/α ≥ (2+ √ 2) −2 . Now we are ready to formulate and prove the theorem which is convenient for our later purpose. The proof technique is analogous to the one used for the proof of Lemma 2 in [22]. It was pointed out by V.N. Temlyakov to us, that their proof of Lemma 2.2 in [18], which is stated in a weaker form, already gives a version of the below theorem with unspecified constants.
for all w ∈ C n , where c 1 , c 2 , c 3 only depend on k 1 , k 2 , k 3 . More precisely, we can estimate Proof. To ease the notation a bit, let us set ζ := 2 + √ 2. Put δ := k 1 n m , α 0 := k 2 , β 0 := k 3 and define recursively Assume for the moment that δ < (2ζ) −2 k 2 . We want to show that there is a constant γ > 0, not depending on δ and an L ∈ N, such that α ℓ ≥ (2ζ) 2 δ for all ℓ ≤ L as well as and therefore 1 4 The definition of L directly yields α L+1 < (2ζ) 2 δ, but by the above also It remains to find a γ > 0 as described above. To do so, first observe by the definition of the α ℓ and β ℓ that We have α L ≥ (2ζ) 2 δ so that ζ δ/α L ≤ 2 −1 and using which yields the final claim. With this at hand, consider the situation of the theorem. If k 1 n m ≥ (2ζ) −2 k 2 the assertion follows directly for J = [m] and the choice c 1 = (2ζ) 2 k 1 /k 2 , c 2 = (2ζ) 2 k 1 , and c 3 = (2ζ) 2 k 1 k 3 /k 2 . Furthermore, these constants clearly satisfy the given estimates.
Otherwise, for δ := k 1 n m < (2ζ) −2 k 2 , let α ℓ , β ℓ be as above. The vectors u i fulfill the assumptions of Corollary 2.2 for α = α 0 = k 2 and β = β 0 = k 3 , so that there is a set for all w ∈ C n . By choosing the smaller of the two partition classes S 1 or S 2 , we may assume #J 1 ≤ 1 2 m. We can now apply Corollary 2.2 again, where we restrict ourselves to the indices in J 1 . We thus get a J 2 ⊆ J 1 with for all w ∈ C n . Again, by choosing the smaller partition class, we may assume #J 2 ≤ By what was proven in the first part of this proof, we therefore get for all w ∈ C n . We thus get the assertion for J = J L+1 , c 2 = ζ 2 k 1 and c 3 = (2ζ) 2 γk 1 . As for c 1 , look at the quantities we see that the φ ℓ are monotonically increasing. Thus

Reproducing kernel Hilbert spaces
We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [1, Chapt. 1] and [4,Chapt. 4]. The papers [8] and [26] are also of particular relevance for the subject of this paper.
Let L 2 (D, ̺ D ) be the space of complex-valued square-integrable functions with respect to ̺ D . Here D ⊂ R d is an arbitrary measurable subset and ̺ D a measure on D. We further consider a reproducing kernel Hilbert space H(K) with a Hermitian positive definite kernel K(x, y) on D × D. The crucial property of reproducing kernel Hilbert spaces is the fact that Dirac functionals are continuous, or, equivalently, the reproducing property holds for all x ∈ D. It ensures that point evaluations are continuous functionals on H(K). We will use the notation from [4,Chapt. 4]. In the framework of this paper, the finite trace of the kernel is given by The embedding operator is Hilbert-Schmidt under the finite trace condition (3.1), see [8], [26,Lemma 2.3], which we always assume from now on. We additionally assume that H(K) is at least infinite dimensional. Let us denote the (at most) countable system of strictly positive eigenvalues We will also need the left and right singular vectors (e k ) k ⊂ H(K) and (η k ) k ⊂ L 2 (D, ̺ D ) which both represent orthonormal systems in the respective spaces related by e k = σ k η k with λ k = σ 2 k for k ∈ N . We would like to emphasize that the embedding (3.2) is not necessarily injective. In other words, for certain kernels there might also be a nontrivial null-space of the embedding in (3.2). Therefore, the system (e k ) k from above is not necessarily a basis in H(K). It would be a basis under additional restrictions, e.g. if the kernel K(·, ·) is continuous and bounded (i.e. a Mercer kernel). It is shown in [8], [26,Lemma 2.3] that if tr(K) < ∞ and H(K) is separable the non-negative function vanishes almost everywhere. Let us finally define the "spectral functions" provided that they exist.

Re-weighted least squares
Let us begin with concentration inequalities for the spectral norm of sums of complex rank-1 matrices. Such matrices appear as L * L when studying least squares solutions of overdetermined linear systems where L ∈ C n×m is a matrix with n > m. It is well-known that the above system may not have a solution. However, we can ask for the vector c which minimizes the residual f − L · c 2 . Multiplying the system with L * gives which is called the system of normal equations. If L has full rank then the unique solution of the least squares problem is given by For function recovery problems we will use the following matrix for X = (x 1 , ..., x n ) ∈ D n of distinct sampling nodes and a system (η k (·)) k of functions. Here (ii) In particular, it holds that (L * L) −1 L * = V * Σ U 5 whenever L = U * ΣV, where Σ ∈ R n×m is a rectangular matrix only with (τ 1 , ..., τ m ) on the main diagonal and orthogonal matrices U ∈ C n×n and V ∈ C m×m . Herẽ Σ ∈ R m×n denotes the matrix with (τ −1 1 , ..., τ −1 m ) on the main diagonal . (iii) The operator norm (L * L) −1 L * 2→2 can be controlled as follows Being in the RKHS setting we compute the coefficients c k , k = 1, . . . , m − 1, of the approximant using the least squares algorithm (4.1). We will also use the re-weighted version below, where ̺ m (·) is a density function which essentially first appeared in [13] and has been modified in [20] to Algorithm 1 Re-weighted least squares regression [13]. Input: X = (x 1 , ..., x k ) ∈ D k matrix of distinct sampling nodes, f = (f (x 1 ), ..., f (x k )) ⊤ samples of f evaluated at the nodes from X, m ∈ N m ≤ k such that the matrixL k,m in (4.5) has full (column) rank.
Compute re-weighted samples g : Solve the over-determined linear system Note, that the mapping f → S m X f is linear for a fixed set of sampling nodes if the matrix L n,m is well-conditioned. The next section gives sufficient conditions when this is the case.

Concentration results for random matrices
We start with a concentration inequality for the spectral norm of a matrix of type (4.2). It turns out that the complex matrix L n,m := L n,m (X) ∈ C n×(m−1) has full rank with high probability, if X = (x 1 , ..., x n ) are drawn i.i.d. at random from D n according to a measure P = d̺ n , the functions (η k ) m−1 k=1 are orthonormal w.r.t the measure ̺ and m is not too large (compared to n). We will find below that the eigenvalues of  Let us now turn to infinite matrices. We need a result which can be applied for independent ℓ 2 -sequences of the form where F := max 8r log n n M 2 κ 2 , Λ 2→2 and κ = 1+ This results can be rephrased as follows.

New bounds for sampling numbers
We are interested in the question of optimal sampling recovery of functions from reproducing kernel Hilbert spaces in L 2 (D, ̺ D ). The quantity we want to study is classically given by and quantifies the recovery of functions out of n function values in the worst case setting. The goal is to get reasonable bounds for this quantity in n, preferably in terms of the singular numbers of the embedding. The correct asymptotic behavior of this quantity in different situations is one of the central open questions in Information Based Complexity (IBC), see, e.g., [17], [23], in the field of Hyperbolic Cross Approximation, see [ Main idea. In the following theorem we apply Weaver's theorem to a random frame. The idea is to construct a sampling operator S m J using O(m) sampling nodes as follows. We draw X = (x 1 , ..., x n ) nodes at random, where n scales as m log m. At this stage we have too many sampling nodes, however, a "good" frame in the sense of (4.2). To this frame (rows of L n,m ) we apply our modified Weaver theorem. The result is a shrinked well-conditioned sub-frame corresponding to a subset (x i ) i∈J of the initial set of sample nodes. With this sub-frame (sub-matrix of L n,m ) we solve the over-determined system via the least squares Algorithm 1. This represents the sampling operator. When it comes to the error analysis we again benefit from the fact that we deal with a (not too small compared to n) subset of the original nodes X. The consequence is that we do not pay too much ( √ log n) compared to the sampling operator based on the original set X of nodes. However, we only used O(m) sample nodes which makes the difference. Theorem 6.1. Let H(K) be a separable reproducing kernel Hilbert space on a set D ⊂ R d with a positive semidefinite kernel K(x, y) such that the embedding Id K,̺ D : k=1 denote the sequence of singular numbers of this compact embedding. Then we have for the sequence of sampling numbers g n := g n (Id K,̺ D ) the general bound with two universal constants C, c > 0, which are specified in Remark 6.2 below.
Proof. Let m ≥ 2. Similar as in [13], [11] and [20] we use the density function (4.4) in order to consider the embedding Id : . In fact, we define the new kernel This yields and N (m) ≤ 2(m − 1), T (m) ≤ 2 ∞ j=m σ 2 j . For the details of this, see the discussion in the proof of [11,Thm. 5.7] and [11,Thm. 5.5]. Note, that the operators S m X and S m X are defined by Algorithm 1 and (4.1). The number of samples will be chosen later as O(m). Choose now the smallest n such that m ≤ n 40 log(n) .
This implies N(m) ≤ 2(m − 1) ≤ n/(20 log(n)) . Applying Theorem 5.1 with r = 2 gives that the rows of the matrix 1 √ n L n,m represent a finite frame with frame bounds 1/2 and 3/2 with high probability (the failure probability 2n −1 decays polynomially in n) when X = (x 1 , ..., x n ) ∈ D n is sampled w.r.t to the measure dµ m = ̺ m (·)d̺ D . That means, we have with high probability for any w ∈ C m−1 Let us now denote with P m−1 : H( K m ) → H( K m ) the projection operator onto span{e 1 (·)/ ̺ m (·), ..., e m−1 (·)/ ̺ m (·)}. Following the proof in [20,Thm. 5.1], we obtain almost surely for some constant c 1 > 0, where we used that Due to the high probability of both events, (6.3) and (6.5), there exists an instance of nodes {x 1 , ..., x n } such that the bound (6.5) is true and 1 √ n L n,m represents a finite frame in C m−1 . Note that all the assumptions of Theorem 2.3 are fulfilled. Indeed, the squared Euclidean norms of the rows of 1 √ n L n,m are bounded by 2(m − 1)/n. (Note, that there is some danger of confusion since the role of n and m is swapped.) To this finite frame we may apply Theorem 2.3 which constructs a sub-matrix L J,m having #J = O(m) rows which, properly normalized, still form a frame in C m−1 . It holds for all w ∈ C m−1 With this matrix we perform the least squares method (4.1) applied to the shrinked vector of function samples (f (x i )) i∈J corresponding to the rows of L J,m . We denote the least squares operator with S m J . Note, that this operator uses only #J = O(m) samples. Let g H( Km) ≤ 1 and again denote with P m−1 : H( K m ) → H( K m ) the projection operator onto span{e 1 (·)/ ̺ m (·), ..., e m−1 (·)/ ̺ m (·)}. Then it holds (6.7) Using Lemma 4.1 together with (6.6) gives By the choice of n together with (6.5) we may estimate further (6.8) Hence, we have where all the involved constants are universal. This implies the statement of the theorem. .

An outstanding open problem
Let us once again comment on an important open problem for the optimal sampling recovery of multivariate functions. We consider the minimal worst-case error (sampling numbers/widths) defined by Let us comment on the class H s mix (T d ). That is, we consider functions on the torus T d ≃ [0, 1) d where opposite points are identified. Note, that the unit cube [0, 1] d is preferred here since it has Lebesgue measure 1 and is therefore a probability space. We could have also worked with [0, 2π] d and the Lebesgue measure (which can be made a probability measure by a d-dependent rescaling). For α ∈ N we define the space H α mix (T d ) as the Hilbert space with the inner product Defining the weight w α (k) = (1 + (2π|k|) 2α ) 1/2 , k ∈ Z (7.3) and the univariate kernel function It has been shown by various authors, see [6,Chapt. 4] and the references therein, that we have asymptotically (s > 0) σ n (Id s,d ) ≍ s,d n −s (log n) (d−1)s , n ≥ 2 . (7.5) The correct asymptotic behavior of (7.1) is one of the central open questions in Information Based Complexity (IBC), see, e.g., [23], and Hyperbolic Cross Approximation, see [6, Outstanding Open Problem 1.4]. It has been shown by several authors, see e.g. [25], [5], [30] and [6,Sec. 5] for some historical remarks, that for s > 1/2 the bound c s,d n −s (log n) (d−1)s ≤ g n (Id s,d ) ≤ C s,d n −s (log n) (d−1)(s+1/2) (7.6) holds asymptotically in n ∈ N . Note, that there is a d-depending gap in the logarithm between upper and lower bound.
Recently, Krieg and Ullrich [13] improved this bound by using a probabilistic technique to show that for s > 1/2 g n (Id s,d ) s,d n −s (log n) (d−1)s+s .
Clearly, if s < (d − 1)/2 then the gap in (7.6) is reduced to (log n) s , which is still growing in s. In particular, there is no improvement in d = 2. However, this result can be considered as a major progress for the research on the complexity of this problem. They disproved Conjecture 5.6.2. in [6] for p = 2 and 1/2 < s < (d − 1)/2. Indeed, the celebrated sparse grid points are now beaten by random points in a certain range for s. This again reflects the "power of random information", see [9]. Still it is worth mentioning that the sparse grids represent the best known deterministic construction what concerns the asymptotic order. Indeed, the guarantees are deterministic and only slightly worse compared to random nodes in the asymptotic regime. However, regarding preasymptotics the random constructions provide substantial advantages. The problem is somehow related to the recent efforts in compressed sensing. There the optimal RIP matrices are given as realizations of random matrices. Known deterministic constructions are far from being optimal.
In the present paper we prove that the sparse grids are beaten for the full range of s > 1/2 whenever d ≥ 3. In case d = 2 our approach and the sparse grids have the same performance. Clearly, inserting (7.5) into the bound in Theorem 6.1 gives g n (Id s,d ) s,d n −s (log n) (d−1)s+1/2 , (7.7) which shortens the gap between upper and lower bound to √ log n. The best known lower bound is the one from (7.5). It is neither clear whether the bound in (7.7) is sharp nor if it can be improved. So this framework might serve as a candidate for Remark 6.3. Therefore, the outstanding open question remains (see, e.g., [6,Chapt. 5] and the references therein) whether there is an intrinsic additional difficulty when restricting to algorithms based on function samples rather than Fourier coefficients. From a practical point of view sampling algorithms are highly relevant since we usually have given discrete samples of functions. The question remains: are the asymptotic characteristics σ n and g n of the same order or do they rather behave like σ n = o(g n )? This represents a fundamental open problem in hyperbolic cross approximation, see [6, Outstanding Open Problem 1.4].