L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document}-norm sampling discretization and recovery of functions from RKHS with finite trace

In this paper we study L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document}-norm sampling discretization and sampling recovery of complex-valued functions in RKHS on D⊂Rd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D \subset \mathbb {R}^d$$\end{document} based on random function samples. We only assume the finite trace of the kernel (Hilbert–Schmidt embedding into L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document}) and provide several concrete estimates with precise constants for the corresponding worst-case errors. In general, our analysis does not need any additional assumptions and also includes the case of non-Mercer kernels and also non-separable RKHS. The fail probability is controlled and decays polynomially in n, the number of samples. Under the mild additional assumption of separability we observe improved rates of convergence related to the decay of the singular values. Our main tool is a spectral norm concentration inequality for infinite complex random matrices with independent rows complementing earlier results by Rudelson, Mendelson, Pajor, Oliveira and Rauhut.


Introduction
This paper can be seen as a continuation of [11,13]. We study the reconstruction of complex-valued multivariate functions on a domain D ⊂ R d from values at the (randomly sampled) nodes X := (x 1 , . . . , x n ) ∈ D n via weighted least squares algorithms. In addition, we are interested in the sampling discretization of the squared L 2 -norm of such functions using n random nodes. Both problems recently gained substantial interest, see [11,13,14,[25][26][27]29], and are strongly related as we know from Wasilkowski Communicated by Deguang Han.
B Tino Ullrich tino.ullrich@mathematik.tu-chemnitz.de 1 Faculty of Mathematics, TU Chemnitz, 09107 Chemnitz, Germany [30] and the recent systematic studies by Temlyakov [26] and Gröchenig [9] on L pnorm discretization. Our main interest is on accurate estimates for worst-case errors depending on the number n of nodes. In this paper, the functions are modeled as elements from some reproducing kernel Hilbert space H (K ), which is supposed to be compactly embedded into L 2 (D, D ). Its kernel is a positive definite Hermitian function K : D × D → C. In the papers [11,13,18] authors mainly restrict to the case of separable RKHS [11,18] or Mercer kernels on compact domains [13] with finite trace property to study the quantity for some recovery operator S m X . It computes a best least squares fit S m X f to the given data from the finite-dimensional space spanned by the first m − 1 singular vectors η 1 (·), . . . , η m−1 (·) of the embedding (

1.2)
We complement existing results by a refined analysis based on spectral norm concentration of infinite matrices to improve on the constants and the bounds for the failure probability on the one hand. On the other hand, the question remained whether the bounds on (1.1) may be extended to the most general situation where only the finite trace condition is assumed. This setting is not covered by the above mentioned references. In this paper we construct a new (weighted) least squares algorithm for this general situation, which has been first addressed by Wasilkowski, Woźniakowski in [31]. Surprisingly, we were able to improve on the bound in [31,Thm. 1] by obtaining the worst-case bound o( √ log n/n) in case of square summable singular values (σ k ) k (finite trace) of the embedding. It seems that, in general, their decay influences the bounds rather weakly (in contrast to the results in [11,13,18]).
In addition to the general sampling recovery problem we study the discretization of L 2 -integral norms in reproducing kernel Hilbert spaces H (K ) where only random information is used. To be more precise, we provide bounds for the following L 2worst-case discretization errors This quantity controls the simultaneous discretization of the squared L 2 (D, D )norms of all functions from H (K ). For finite-dimensional spaces we speak of Marcinkiewicz-Zygmund inequalities, a classical topic which also gained a lot of interest in recent years, see Temlyakov [26] and the references therein. Let us emphasize that both problems (sampling recovery and discretization) are strongly related. It has been shown by Wasilkowski [30] that the recovery of the norm · of a function from a function class is equally difficult as the recovery of the function in that norm using linear information. In other words, if we have a good sampling recovery operator S f in L 2 (D, D ) we may construct an equally good recovery for the norm of f by simply taking S f L 2 (D, D ) as approximant. This, however, is a simple consequence of the triangle inequality. Wasilkowski shows even more, namely that optimal information for the recovery problem is nearly optimal for the "norm-recovery" problem. However, let us emphasize that we recover the square of the norm in (1.3) (rather than the norm itself). It has been observed by Temlyakov in [25] that this indeed makes a difference if we assume a certain algebra property for point-wise multiplication, which is for instance present for mixed Sobolev spaces with smoothness s > 1/2. Taking into account that in this framework optimal quadrature behaves asymptotically better than sampling recovery (the improvement happens in the log), see [8,Chap. 5,9] and the references therein, we see that Wasilkowski's result does not hold true for this slightly modified framework. In fact, the worst-case error (1.3) may behave much better than the corresponding optimal sampling recovery error. In contrast to that we use random information here, i.e, nodes which are randomly drawn according to the natural (probability measure) D or some related measure and aim for results with high probability. As stated below we obtain a less good asymptotic error behavior for the classical discretization operator in (1.3) compared to the (non-squared) sampling recovery error in (1.1). However, we are able to control the dependence on the parameters and the failure probability rather explicit as (1.4), (1.6), Theorems 1.2 and 1.3 show.
Major parts of the analysis in this paper are based on the following concentration inequality for sums of complex self adjoint (infinite) random matrices. Theorem 1.1 (Section 3) Let y i , i = 1 . . . n, be i.i.d random sequences from the complex 2 . Let further n ≥ 3 and M > 0 such that y i 2 ≤ M almost surely for all i = 1 . . . n. We further put := Ey i ⊗ y i and assume that 2→2 ≤ 1. Then we have for 0 < t ≤ 1 Finite-dimensional results of this type are given by Rudelson [23], Tropp [28], Oliveira [20], Rauhut [22] and others. Mendelson and Pajor [16] were the first who addressed the infinite-dimensional case of real matrices as well, see Remark 3.1. The technique used has been introduced by Buchholz [2,3] and further developed by Rauhut for the purpose of analyzing RIP matrices based on complex bounded orthonormal systems (see [22] and the references therein). It is based on an operator version of the noncommutative Khintchine inequality [2,3] together with Talagrand's symmetrization technique.
As a direct consequence of Theorem 1.1 we obtain for separable H (K ) and a probability measure D always if the kernel is bounded, i.e., K ∞ := sup x∈D √ K (x, x) < ∞ (uniform boundedness). This condition is equivalent to the fact that the embedding of H (K ) into ∞ is continuous and has norm less or equal a finite number M (commonly called M-boundedness). The measure D is supposed to be a probability measure and x i , i = 1, . . . , n are drawn independently at random according to D . Note, that this problem is related to classical uniform bounds on the "defect function" in learning theory with respect to M-bounded function classes, see, e.g., [4,6]. There, bounds for (1.3) are usually given in terms of covering (or entropy) numbers of the unit ball of H (K ) in ∞ , see [4,12]. Here we consider situations where we neither have such information nor an embedding into ∞ . Choosing t appropriately (see Theorem 6.1), the worst-case discretization error may be bounded as O( (log n)/n) with high probability. To get rid of the uniform boundedness condition of the function class we may work with the weaker finite trace condition and prove a similar error bound for a slightly modified discretization operator when sampling the nodes x i independently according to the modified measure ν(x)d D (x) with ν(x) := K (x, x)/ tr(K ). One only has to replace K 2 ∞ by tr(K ) in the right-hand side of (1.4). In other words, we have with probability exceeding 1 − 2n 1−r for large enough n, see Theorem 6.3. This means that the success probability tends to 1 rather quickly as the number of samples increases.
As for the sampling recovery problem we start with a result in the most general situation. A modification of the recovery operator S m X from [11,13], see Algorithm 1 below, has been used to study the situation which is left as an open problem in [11]. The result reads as follows.  (1.7). Drawing X = (x 1 , . . . , x n ) at random according to the product measure ( m (x)d D (x)) n with the density defined in (7.1), we have S m X is the least squares operator from Algorithm 1 together with (7.1) and tr 0 (K ) is defined in (4.6).
In fact, we recover all f ∈ H (K ) from sampled values at X = (x 1 , . . . , x n ) simultaneously with probability larger than 1 − 3n −r by only assuming that the kernel K (·, ·) has finite trace (1.5). Note that this result improves on a result by Wasilkowski and Woźniakowski [31], where only the finite trace is required, see also Novak and Woźniakowski [19,Thm. 26.10]. The authors proved (roughly speaking) a rate of n −1/(2+ p) for the worst-case error with respect to standard information if the sequence of singular numbers is p-summable for p ≤ 2. We refer to Sect. 7 for further explanation.
In order to define the recovery operator S m X and the sampling density m (x) we need to incorporate spectral properties of the embedding (1.2), namely also the left and right singular functions (e k ) k ⊂ H (K ) and (η k ) k ⊂ L 2 (D, D ) ordered according to their importance (size of the corresponding singular number). Both systems are orthonormal in the respective spaces related by e k = σ k η k .
The above result can be improved essentially if we assume that H (K ) is separable. This is for instance the case if K is a Mercer kernel, i.e., continuous on a bounded and compact domain D. However, assuming only separability of H (K ) also includes the situation of continuous kernels on unbounded domains D, even D = R d . The following result already improves on the result given in [11,13] in several directions. The theorem works under less restrictive assumptions, the constants are improved and, last but not least, the failure probability decays polynomially in n. We would like to point that, while preparing this manuscript, Ullrich [29] proved a version of the next theorem with stronger requirements and different constants based on Oliveira's concentration result (see Remark 3.9). The following theorem is a reformulation of Theorem 5.2 in Sect. 5.

Theorem 1.3 (Section 5) Let K be a positive definite Hermitian kernel such that H (K )
is separable and the finite trace condition (1.5) holds true. With the notation from above we have for n ∈ N and m := n 14r log n (1.7) the bound 3)) and the operator S m X is defined in Algorithm 1. We would like to emphasize that the operator S m X uses n m log m samples of its argument. Based on this result it has been recently shown by the second named author (and coauthors, see [18]) that there exists a sampling operator S m J using only O(m) samples yielding the bound with universal and specified constants C, c > 0. However, for this improvement one has to sacrifice the high success probability. Notation As usual N denotes the natural numbers, N 0 := N ∪ {0}, Z denotes the integers, R the real numbers and R + the non-negative real numbers and C the complex numbers. If not indicated otherwise log(·) denotes the natural logarithm of its argument. C n denotes the complex n-space, whereas C m×n denotes the set of all m × n-matrices L with complex entries. Vectors and matrices are usually typesetted boldface with x, y ∈ C n . The matrix L * denotes the adjoint matrix. The spectral norm of matrices L is denoted by L or L 2→2 . For a complex (column) vector y ∈ C n (or 2 ) we will often use the tensor notation for the matrix y ⊗ y := y · y * = y · y ∈ C n×n (or C N×N ).
For 0 < p ≤ ∞ and x ∈ C n we denote x p := ( n i=1 |x i | p ) 1/ p with the usual modification in the case p = ∞ or x being an infinite sequence. Operator norms for T : H (K ) → L 2 will be denoted with T K ,2 . As usual we will denote with EX the expectation of a random variable X on a probability space ( , A, P). Given a measurable subset D ⊂ R d and a measure we denote with L 2 (D, ) the space of all square integrable complex-valued functions (equivalence classes) on D with We will often use = D n as probability space with the product measure P = d n if is a probability measure itself. We sometimes use the notation f = O(g) for positive functions f , g, which means that there is a constant

Concentration results for sums of random matrices
Let us begin with concentration inequalities for the spectral norm of sums of complex rank-1 matrices. Such matrices appear as L * L when studying least squares solutions of over-determined linear systems where L ∈ C n×m is a matrix with n > m, f ∈ C n and c ∈ C m−1 . It is well-known that the above system may not have a solution. However, we can ask for the vector c which minimizes the residual f − L · c 2 . Multiplying the system with L * gives L * L · c = L * · f which is called the system of normal equations. If L has full rank then the unique solution of the least squares problem is given by For function recovery and discretization problems we will use the following matrix . , x n ) ∈ D n of distinct sampling nodes and a system of functions are computed via least squares, see Algorithm 1. Note that the mapping f → S m X f is linear for a fixed set of sampling nodes X = (x 1 , . . . , x n ) ∈ D n .
We start with a concentration inequality for the spectral norm of a matrix of type (2.1). It turns out that in certain situations the complex matrix L m := L m (X) ∈ C n×(m−1) has full rank with high probability, where X = (x 1 , . . . , x n ) is drawn at random from D n according to a product measure P = d n . We will find that the eigenvalues of are bounded away from zero with high probability if m is small enough compared to n and the functions η k (·) denote an orthonormal system with respect to the measure from which the nodes in X are sampled. Let us define the corresponding spectral function . Let H m be given as above and as well as where c t := (1 − t) 1−t e t and d t : Proof We apply Theorem 2.1. To do this we define A i = 1 n y i ⊗ y i . One easily sees that all the matrices A i are always positive semi-definite and λ min Plugging this into Theorem 2.1 yields

Theorem 2.3 For n ≥ m and r > 1 the matrix H m has only eigenvalues greater than
In particular, we have Proof Choosing t = 1/2 and solving for N (m) in the above probability bound (using n 1−r on the right-hand side) gives the desired result. Indeed This gives the following implications (read from bottom to top) (2.7) The bound in (2.6) is a consequence of [11, Proposition 3.1].
From [11,Proposition 3.1] we also get a lower bound of (L * m L m ) −1 L * m 2→2 with high probability.
with probability at least 1 − n 1−r , where the nodes are sampled i.i.d according to .

Norm concentration for infinite matrices
In this section we want to extend some of the results from Sect. 2 to the infinite dimensional framework. We provide a new concentration inequality derived from the non-commutative Khintchine inequality via a bootstrapping argument using a symmetrization result by Ledoux and Talagrand [15] for Rademacher sums of random operators B i = y i ⊗ y i , where y i will denote a complex random infinite 2 -sequence.
The quantity · S p is a norm, see, e.g. [7].

Corollary 3.2 One easily sees that
and that for A with rank at most r it holds that The p-th Schatten-class is defined as Proof We utilize the non-commutative Khintchine inequality with B i := y i ⊗y i which belong to every S 2n since they have (at most) rank 1. Applying literally the same arguments as in [22,Lemma 6.18] we obtain the result (see [17] for details).
We estimate tails of random variables by means of their moments. We will use a well-known relation, see e.g. [  Proof We use a method as in [22,Theorem 7.3]. For 2 ≤ p < ∞ we put Since n i=1 y i ⊗ y i has rank (at most) n it is compact. The expectation matrix is a positive semidefinite operator with finite trace since y i 2 ≤ M for all i = 1 . . . n almost surely. This means is a trace class operator and therefore compact. Since 1 n n i=1 y i ⊗ y i − is compact and the subspace of all compact operators K ( 2 , 2 ) from 2 to 2 is separable we can choose a countable set D from the dual space of K ( 2 , 2 ) as in Proposition 3.7 such that Since f is a continuous linear functional we get (3.1) Proposition 3.7 applied to (3.1) with F(t) = t p together with Rudelson's lemma (Corollary 3.5) for 2 ≤ p < ∞ in the form

.
We now consider the random variable min F, 1 Obviously In the case of D 2 p,n,M ≤ F we get This yields Using Proposition 3.6 we get that Using this result we are now able to proof our main concentration inequality. Proof of Theorem 1.1 Proof Let us return to (3.2) in the above proof. Since 2→2 ≤ 1 we get as a consequence for 0 < t ≤ 1 We continue in the proof as above usingD p,n,M instead of D p,n,M and without replacing u by √ 2r log n in (3.3). With the same argumentation as above we get rid of the minimum and obtain this time for F ≥ max{4M 2 u 2 κ 2 /(nt), t}. The maximum is no longer necessary if we choose u 2 := t 2 n/(4M 2 κ 2 ). Plugging this choice into (3.6) and noting that 8κ 2 ≤ 21 gives the desired bound.

Remark 3.9
The first result of this type is due to Rudelson [23] for vectors from R n . Complex versions were proved by Rauhut [22] and Oliveira [20,Lem. 1]. Note that the result stated by Oliveira (Lemma 1) contains a small incorrectness in the probability bound. A corrected version has been stated in [11,Prop. 4.1]. In his paper Oliveira also comments on the infinite dimensional complex situation where m = ∞ but does not give a full proof. The proof method is different from ours. Note also that in [16,Cor. 2.6] Mendelson and Pajor give a concentration result for the infinite dimensional case of real vectors. Let us finally mention that a version of our Theorem 1.3 (in the next section) under more restrictive assumptions has been recently proved by Ullrich [29] based on Oliveira's concentration result.

Reproducing kernel Hilbert spaces
We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [1, Chapt. 1] and [4,Chapt. 4]. The papers [10,24] are also of particular relevance for the subject of this paper. Let L 2 (D, D ) be the space of complex-valued square-integrable functions with respect to D . Here D ⊂ R d is an arbitrary subset and D a measure on D. We further consider a reproducing kernel Hilbert space H (K ) with a Hermitian positive definite kernel K (·, ·) on D × D. The crucial property of reproducing kernel Hilbert spaces is the fact that Dirac functionals are continuous, or equivalently, the reproducing property holds for all x ∈ D. It ensures that point evaluations are continuous functionals on H (K ). We will use the notation from [4,Chapt. 4]. In the framework of this paper, the finite trace of the kernel With ∞ (D) we denote the set of bounded functions on D and with · ∞ (D) the supremum norm. Note, that we do not need the measure D for this embedding. In fact, here we mean "boundedness" in the strong sense (in contrast to essential boundedness w.r.t. the measure D ). The embedding operator is Hilbert-Schmidt under the finite trace condition (4.1), see [10], [24,Lemma 2.3], which we always assume from now on. We additionally assume that H (K ) is at least infinite dimensional. Let us denote the (at most) countable system of strictly positive eigenvalues (λ j ) j∈N of W K , D = Id * K , D • Id K , D arranged in non-increasing order, i.e., We will also need the left and right singular vectors (e k ) k ⊂ H (K ) and (η k ) k ⊂ L 2 (D, D ) which both represent orthonormal systems in the respective spaces related by e k = σ k η k with λ k = σ 2 k for k ∈ N. We would like to emphasize that the embedding (4.3) is not necessarily injective. In other words, for certain kernels there might be a also a nontrivial nullspace of the embedding Id in (4.4). Therefore, the system (e k ) k from above is not necessarily a basis in H (K ). It would be a basis under additional restrictions, e.g., the kernel K (·, ·) is continuous and bounded (Mercer kernel). Based on this observation we will decompose the kernel K (·, ·) as follows (4.5) By Bessel's inequality we get that It is shown in [10], [24,Lemma 2.3] that if tr(K ) < ∞ and H (K ) is separable then tr 0 (K ) = 0. As we will see below, it will make a big difference if tr 0 (K ) vanishes or not. The second case is only apparent if H (K ) is non-separable. In other words, if The first one is often called "spectral function", see [9] and the references therein.

Sampling recovery guarantees for separable RKHS
In this section we deal with the case that H (K ) is a separable Hilbert space on a subset D ⊂ R d which is compactly embedded in L 2 (D, D ) for a given measure D . The first Theorem gives a result in a more restrictive situation, namely that D is a probability measure and the kernel is bounded. In the second theorem we sample with respect to the probability density function m defined below in (5.3) and invented by Krieg and Ullrich [13,14]. We use the same proof strategy as in [11]. Here we do not apply Rudelson's bound [23] on the expectation. We rather use the concentration inequality proved in Proposition 3.8. This leads to a polynomial decaying failure probability, see also [29].

Proof We define the events
where F appears in (3.4) and H m in (2.3). The operator m is given by Using the union bound estimate we get

Theorem 2.3 implies
And after noting we infer from * m m = n i=1 y i ⊗ y i and Proposition 3.8 that In total we have P(A ∩ B) ≥ 1 − ηn 1−r with η := 2 3/4 + 1. According to the proof of [11,Theorem 5.5] we need the function T (·, x) := K (·, x) − ∞ j=1 e j (·)e j (x), which denotes an element in H (K ). Its norm is given by Note that the function in (5.2) is zero almost everywhere because of the fact that we have an equality sign in (4.6) due to our assumptions (separability of H (K ) and finite trace). Hence, P(C) = 1 with Let now X ∈ A ∩ B ∩ C. Then we can use for any f ∈ H (K ) with f H (K ) ≤ 1 a similar argument as in [11,Theorem 5.5] This yields In the sequel we consider a more general situation. The measure where the points are sampled will be adapted to the spectral properties of the embedding. This allows to specify the bound above in terms of the singular numbers of the embedding and benefit from their decay. Let us recall the density function from [13] which we will adapt to our framework as follows

Algorithm 1
Weighted least squares approximation [5], [13], [11]. Input: samples of f evaluated at the nodes from X, m ∈ N m < n such that the matrixL m in (5.4) has full (column) rank.
Compute weighted samples g := (g j ) n j=1 with g j := Solve the over-determined linear system We know from (4.6) that the sequence of singular numbers is square summable. We use the modified density function m (·) : D → R which has been introduced in [11] as a version of the one from [13]. As above, the family (e j (·)) j∈N represents the eigenvectors of the non-vanishing eigenvalues of the compact self-adjoint operator W D := Id * • Id : H (K ) → H (K ), the sequence (λ j ) j∈N represents the ordered eigenvalues and finally η j := λ −1/2 j e j . Clearly, as a consequence of (4.5) the function m is positive and defined pointwise for any x ∈ D. Moreover, it can be computed precisely from the knowledge of K (x, x) and the first m − 1 eigenvalues and corresponding eigenvectors. It clearly holds that D m (x)d D (x) = 1. Here is one of the main results of this paper (note that Theorem 1.3 from the introduction is a simple reformulation of the theorem below). Proof This result is a consequence of Theorem 5.1 above which is applied to the newly constructed probability measure m (·) together with .
We observe

Sampling discretization of the L 2 -norm
Motivated from supervised learning theory, see e.g. [6], one is interested in uniform bounds for the following version of the "defect function" with respect to f belonging to some hypothesis space H which is usually embedded into C(D), the continuous functions on the domain D. From a more classical perspective, authors were interested in discretizing L p -norms using Marcinkiewicz-Zygmund inequalities. This subject has been recently studied systematically by Temlyakov [25], see also Gröchenig [9]. The following theorem will be an immediate implication of our concentration result in Theorem 1.1.
If we fix r > 1 the above bound can be reformulated as with probability exceeding 1 − 2n 1−r and n sufficiently large, namely The nodes X = (x 1 , . . . , x n ) are randomly drawn according to the product measure (d D (x)) n .
Proof Fix f with f H (K ) ≤ 1 and put M : holds almost surely since tr 0 (K ) = 0 due to the separability of H (K ), see the paragraph after (4.6). In fact, the identity fails on a nullset A ⊂ D n , which is independent of f . This follows by using the same arguments as after (5.2). Hence, with probability exceeding 1 − exp(−t 2 n/(21M 2 )) by Theorem 1.1. HereM 2 = M 2 / 2→2 . Hence, we may choose t = 21M 2 r log(n)/n ≤ 1 to finally get log n n . (6.1) Remark 6.2 (i) The uniform boundedness (in ∞ ) of a function class is sometimes also called M-boundedness in the learning theory literature. It represents a common assumption there to analyze the defect function. In fact, uniform bounds on the defect function are proved by using the concept of covering numbers of the unit ball of H (K ) in ∞ (D), see [12,25]. In the above theorem covering number estimates were not used at all. (ii) The quantity K ∞ may be replaced by K L ∞ (D, D ) , i.e., the essential supremum with respect to the probability measure D . Since K L ∞ (D, D ) might by smaller than K ∞ we obtain a slight improvement.
Now we want to get rid of the uniform boundedness of the class and only assume the finite trace. We have to change the sampling measure and modify the norm discretization operator by incorporating weights. The corresponding theorem reads as follows. with an absolute constant C > 0 and m := n/(14r log n) . Note that this result improves on a result in Wasilkowski and Woźniakowski [31], see also Novak and Woźniakowski [19,Thm. 26.10]. The authors in [31,Thm. 26.10] constructed a recovery operator using n samples having squared worst-case error not greater than min σ 2 + tr(K ) n : = 1, 2, 3, . . . .
If we for instance assume that ∞ k=1 σ p k < ∞ for some 0 < p ≤ 2 then we may balance n p/( p+2) to obtain a rate of O(n −1/( p+2) ). In Theorem 1.2, see (7.2), we obtain a rate of o( (log n)/n) already for p = 2. In case p < 2 we obtain O(n −1/2 ). It seems that, in general, the decay properties of the singular values have a rather weak influence on the recovery bound compared to the separable case, where it is much better than O(n −1/2 ). A lower bound showing that we can not essentially improve on O(n −1/2 ) with the above algorithm will be provided in [17]. Proof of Theorem 1.  1 (x, y).