1 Introduction

This paper can be seen as a continuation of [11, 13]. We study the reconstruction of complex-valued multivariate functions on a domain \(D \subset \mathbb {R}^d\) from values at the (randomly sampled) nodes \({\mathbf {X}}\,{:}{=}\,({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n) \in D^n\) via weighted least squares algorithms. In addition, we are interested in the sampling discretization of the squared \(L_2\)-norm of such functions using n random nodes. Both problems recently gained substantial interest, see [11, 13, 14, 25,26,27, 29], and are strongly related as we know from Wasilkowski [30] and the recent systematic studies by Temlyakov [26] and Gröchenig [9] on \(L_p\)-norm discretization. Our main interest is on accurate estimates for worst-case errors depending on the number n of nodes. In this paper, the functions are modeled as elements from some reproducing kernel Hilbert space H(K), which is supposed to be compactly embedded into \(L_2(D,\varrho _D)\). Its kernel is a positive definite Hermitian function \(K:D\times D\rightarrow \mathbb {C}\). In the papers [11, 13, 18] authors mainly restrict to the case of separable RKHS [11, 18] or Mercer kernels on compact domains [13] with finite trace property to study the quantity

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)} \le 1} \int _D|f({\mathbf {x}})-S^m_{{\mathbf {X}}}f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})\, \end{aligned}$$
(1.1)

for some recovery operator \(S^m_{{\mathbf {X}}}\). It computes a best least squares fit \(S_{{\mathbf {X}}}^mf\) to the given data

$$\begin{aligned} {\mathbf {f}} = (f({\mathbf {x}}^1),\ldots ,f({\mathbf {x}}^n))^\top \end{aligned}$$

from the finite-dimensional space spanned by the first \(m-1\) singular vectors \(\eta _1(\cdot ),\ldots ,\eta _{m-1}(\cdot )\) of the embedding

$$\begin{aligned} \mathrm {Id}: H(K) \rightarrow L_2(D,\varrho _D). \end{aligned}$$
(1.2)

We complement existing results by a refined analysis based on spectral norm concentration of infinite matrices to improve on the constants and the bounds for the failure probability on the one hand. On the other hand, the question remained whether the bounds on (1.1) may be extended to the most general situation where only the finite trace condition is assumed. This setting is not covered by the above mentioned references. In this paper we construct a new (weighted) least squares algorithm for this general situation, which has been first addressed by Wasilkowski, Woźniakowski in [31]. Surprisingly, we were able to improve on the bound in [31, Thm. 1] by obtaining the worst-case bound \(o(\sqrt{\log n / n})\) in case of square summable singular values \((\sigma _k)_k\) (finite trace) of the embedding. It seems that, in general, their decay influences the bounds rather weakly (in contrast to the results in [11, 13, 18]).

In addition to the general sampling recovery problem we study the discretization of \(L_2\)-integral norms in reproducing kernel Hilbert spaces H(K) where only random information is used. To be more precise, we provide bounds for the following \(L_2\)-worst-case discretization errors

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)} \le 1} \left| \int _D |f({\mathbf {x}})|^2d\varrho _D({\mathbf {x}}) - \frac{1}{n}\sum \limits _{i=1}^n|f({\mathbf {x}}^i)|^2\right| . \end{aligned}$$
(1.3)

This quantity controls the simultaneous discretization of the squared \(L_2(D,\varrho _D)\)-norms of all functions from H(K). For finite-dimensional spaces we speak of Marcinkiewicz–Zygmund inequalities, a classical topic which also gained a lot of interest in recent years, see Temlyakov [26] and the references therein. Let us emphasize that both problems (sampling recovery and discretization) are strongly related. It has been shown by Wasilkowski [30] that the recovery of the norm \(\Vert \cdot \Vert \) of a function from a function class is equally difficult as the recovery of the function in that norm using linear information. In other words, if we have a good sampling recovery operator Sf in \(L_2(D,\varrho _D)\) we may construct an equally good recovery for the norm of f by simply taking \(\Vert Sf\Vert _{L_2(D,\varrho _D)}\) as approximant. This, however, is a simple consequence of the triangle inequality. Wasilkowski shows even more, namely that optimal information for the recovery problem is nearly optimal for the “norm-recovery” problem. However, let us emphasize that we recover the square of the norm in (1.3) (rather than the norm itself). It has been observed by Temlyakov in [25] that this indeed makes a difference if we assume a certain algebra property for point-wise multiplication, namely \(\Vert fg\Vert _{H(K)} \le c\Vert f\Vert _{H(K)}\cdot \Vert g\Vert _{H(K)}\) which is for instance present for mixed Sobolev spaces with smoothness \(s>1/2\). Taking into account that in this framework optimal quadrature behaves asymptotically better than sampling recovery (the improvement happens in the \(\log \)), see [8, Chap. 5, 9] and the references therein, we see that Wasilkowski’s result does not hold true for this slightly modified framework. In fact, the worst-case error (1.3) may behave much better than the corresponding optimal sampling recovery error. In contrast to that we use random information here, i.e, nodes which are randomly drawn according to the natural (probability measure) \(\varrho _D\) or some related measure and aim for results with high probability. As stated below we obtain a less good asymptotic error behavior for the classical discretization operator in (1.3) compared to the (non-squared) sampling recovery error in (1.1). However, we are able to control the dependence on the parameters and the failure probability rather explicit as (1.4), (1.6), Theorems 1.2 and 1.3 show.

Major parts of the analysis in this paper are based on the following concentration inequality for sums of complex self adjoint (infinite) random matrices.

Theorem 1.1

(Section 3) Let \({\mathbf {y}}^i , i= 1 \dots n,\) be i.i.d random sequences from the complex \( \ell _2\). Let further \(n \ge 3\) and \(M>0\) such that \(\Vert {\mathbf {y}}^i \Vert _2 \le M\) almost surely for all \(i=1 \dots n\). We further put \({\Lambda } \,{:}{=}\, \mathbb {E}{\mathbf {y}}^i \otimes {\mathbf {y}}^i\) and assume that \(\Vert {\Lambda }\Vert _{2\rightarrow 2} \le 1\). Then we have for \(0<t\le 1\)

$$\begin{aligned} \mathbb {P}\left( \left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2} \ge t \right) \le 2^\frac{3}{4} n\exp \left( -\frac{t^2n}{21M^2}\right) . \end{aligned}$$

Finite-dimensional results of this type are given by Rudelson [23], Tropp [28], Oliveira [20], Rauhut [22] and others. Mendelson and Pajor [16] were the first who addressed the infinite-dimensional case of real matrices as well, see Remark 3.1. The technique used has been introduced by Buchholz [2, 3] and further developed by Rauhut for the purpose of analyzing RIP matrices based on complex bounded orthonormal systems (see [22] and the references therein). It is based on an operator version of the non-commutative Khintchine inequality [2, 3] together with Talagrand’s symmetrization technique.

As a direct consequence of Theorem 1.1 we obtain for separable H(K) and a probability measure \(\varrho _D\) always

$$\begin{aligned} \begin{aligned}&\mathbb {P}\left( \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n |f({\mathbf {x}}^i)| ^2\right| >t\Vert \mathrm {Id}\Vert _{K,2}^2\right) \\&\quad \le 2^{3/4}n\exp \left( -\frac{t^2n \Vert \mathrm {Id}\Vert _{K,2}^2}{21\Vert K\Vert _{\infty }^2}\right) \end{aligned} \end{aligned}$$
(1.4)

if the kernel is bounded, i.e., \(\Vert K\Vert _{\infty } \,{:}{=}\, \sup _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}},{\mathbf {x}})}<\infty \) (uniform boundedness). This condition is equivalent to the fact that the embedding of H(K) into \(\ell _\infty \) is continuous and has norm less or equal a finite number M (commonly called M-boundedness). The measure \(\varrho _D\) is supposed to be a probability measure and \({\mathbf {x}}^i,\, i = 1, \dots , n\) are drawn independently at random according to \(\varrho _D\). Note, that this problem is related to classical uniform bounds on the “defect function” in learning theory with respect to M-bounded function classes, see, e.g., [4, 6]. There, bounds for (1.3) are usually given in terms of covering (or entropy) numbers of the unit ball of H(K) in \(\ell _\infty \), see [4, 12]. Here we consider situations where we neither have such information nor an embedding into \(\ell _\infty \). Choosing t appropriately (see Theorem 6.1), the worst-case discretization error may be bounded as \({\mathcal {O}}(\sqrt{(\log n)/n})\) with high probability. To get rid of the uniform boundedness condition of the function class we may work with the weaker finite trace condition

$$\begin{aligned} {\text {tr}}(K) \,{:}{=}\, \int _{D} K({\mathbf {x}},{\mathbf {x}})d\varrho _D({\mathbf {x}}) < \infty \end{aligned}$$
(1.5)

and prove a similar error bound for a slightly modified discretization operator when sampling the nodes \({\mathbf {x}}^i\) independently according to the modified measure \(\nu ({\mathbf {x}})d\varrho _D({\mathbf {x}})\) with \(\nu ({\mathbf {x}}) \,{:}{=}\, K({\mathbf {x}},{\mathbf {x}})/{\text {tr}}(K)\). One only has to replace \(\Vert K\Vert _{\infty }^2\) by \({\text {tr}}(K)\) in the right-hand side of (1.4). In other words, we have

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)}\right| \le \sqrt{21{\text {tr}}(K)\Vert \mathrm {Id}\Vert ^2r\frac{\log n}{n}} \end{aligned}$$
(1.6)

with probability exceeding \(1-2n^{1-r}\) for large enough n, see Theorem 6.3. This means that the success probability tends to 1 rather quickly as the number of samples increases.

As for the sampling recovery problem we start with a result in the most general situation. A modification of the recovery operator \(\widetilde{S}_{{\mathbf {X}}}^m\) from [11, 13], see Algorithm 1 below, has been used to study the situation which is left as an open problem in [11]. The result reads as follows.

Theorem 1.2

(Section 7) Let H(K) be a reproducing kernel Hilbert space on a subset \(D \subset \mathbb {R}^d\) with a positive definite Hermitian kernel \(K(\cdot ,\cdot )\) such that the finite trace property (1.5) holds true. Let \(r > 1\) and \(m,n \in \mathbb {N},\) \(n \ge 3,\) where \(m\) is chosen according to (1.7). Drawing \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) at random according to the product measure \((\varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}))^n\) with the density defined in (7.1),  we have

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f \big \Vert ^2_{L_2(D,\varrho _D)} \le 441 \max \left\{ \sigma _m^2, \frac{r \log n }{n} \sum _{j=m}^\infty \sigma _j^2, \frac{{\text {tr}}_0(K)}{n} \right\} \end{aligned}$$

with probability at least \( 1 - \eta n^{1-r}\) where \(\eta = 2^\frac{3}{4} +1\). \(\widetilde{S}_{{\mathbf {X}}}^m\) is the least squares operator from Algorithm 1 together with (7.1) and \({\text {tr}}_0(K)\) is defined in (4.6).

In fact, we recover all \(f \in H(K)\) from sampled values at \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) simultaneously with probability larger than \(1-3n^{-r}\) by only assuming that the kernel \(K(\cdot , \cdot )\) has finite trace (1.5). Note that this result improves on a result by Wasilkowski and Woźniakowski [31], where only the finite trace is required, see also Novak and Woźniakowski [19, Thm. 26.10]. The authors proved (roughly speaking) a rate of \(n^{-1/(2+p)}\) for the worst-case error with respect to standard information if the sequence of singular numbers is p-summable for \(p \le 2\). We refer to Sect. 7 for further explanation.

In order to define the recovery operator \(\widetilde{S}^m_{{\mathbf {X}}}\) and the sampling density \(\varrho _m({\mathbf {x}})\) we need to incorporate spectral properties of the embedding (1.2), namely also the left and right singular functions \((e_k)_k \subset H(K)\) and \((\eta _k)_k \subset L_2(D,\varrho _D)\) ordered according to their importance (size of the corresponding singular number). Both systems are orthonormal in the respective spaces related by \(e_k = \sigma _k \eta _k\).

The above result can be improved essentially if we assume that H(K) is separable. This is for instance the case if K is a Mercer kernel, i.e., continuous on a bounded and compact domain D. However, assuming only separability of H(K) also includes the situation of continuous kernels on unbounded domains D, even \(D=\mathbb {R}^d\). The following result already improves on the result given in [11, 13] in several directions. The theorem works under less restrictive assumptions, the constants are improved and, last but not least, the failure probability decays polynomially in n. We would like to point that, while preparing this manuscript, Ullrich [29] proved a version of the next theorem with stronger requirements and different constants based on Oliveira’s concentration result (see Remark 3.9). The following theorem is a reformulation of Theorem 5.2 in Sect. 5.

Theorem 1.3

(Section 5) Let K be a positive definite Hermitian kernel such that H(K) is separable and the finite trace condition (1.5) holds true. With the notation from above we have for \(n\in \mathbb {N}\) and

$$\begin{aligned} m \,{:}{=}\, \left\lfloor \frac{n}{14 r\log n}\right\rfloor \end{aligned}$$
(1.7)

the bound

$$\begin{aligned} \mathbb {P}\left( \sup _{\Vert f\Vert _{H(K)} \le 1} \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f\Vert _{L_2(D,\varrho _D)}^2 \le \frac{15}{m}\sum \limits _{j=\lfloor m/2 \rfloor }^{\infty }\sigma _j^2\right) \ge 1 -3 n^{1-r}, \end{aligned}$$

where \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) is sampled according to the product measure \((\varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}))^n\) (see (5.3)) and the operator \(\widetilde{S}_{{\mathbf {X}}}^m\) is defined in Algorithm 1.

We would like to emphasize that the operator \(\widetilde{S}_{{\mathbf {X}}}^m\) uses \(n \asymp m\log m\) samples of its argument. Based on this result it has been recently shown by the second named author (and coauthors, see [18]) that there exists a sampling operator \(\widetilde{S}^m_J\) using only \({\mathcal {O}}(m)\) samples yielding the bound

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K)} \le 1} \Vert f - \widetilde{S}_{J}^m f\Vert _{L_2(D,\varrho _D)}^2 \le \frac{C\log (m)}{m}\sum \limits _{j=\lfloor cm \rfloor }^{\infty }\sigma _j^2 \end{aligned}$$

with universal and specified constants \(C,c>0\). However, for this improvement one has to sacrifice the high success probability.

Notation As usual \(\mathbb {N}\) denotes the natural numbers, \(\mathbb {N}_0\,{:}{=}\,\mathbb {N}\cup \{0\}\), \(\mathbb {Z}\) denotes the integers, \(\mathbb {R}\) the real numbers and \(\mathbb {R}_+\) the non-negative real numbers and \(\mathbb {C}\) the complex numbers. If not indicated otherwise \(\log (\cdot )\) denotes the natural logarithm of its argument. \(\mathbb {C}^n\) denotes the complex n-space, whereas \(\mathbb {C}^{m\times n}\) denotes the set of all \(m\times n\)-matrices \({\mathbf {L}}\) with complex entries. Vectors and matrices are usually typesetted boldface with \({\mathbf {x}},{\mathbf {y}}\in \mathbb {C}^n\). The matrix \({\mathbf {L}}^{*}\) denotes the adjoint matrix. The spectral norm of matrices \({\mathbf {L}}\) is denoted by \(\Vert {\mathbf {L}}\Vert \) or \(\Vert {\mathbf {L}}\Vert _{2\rightarrow 2}\). For a complex (column) vector \({\mathbf {y}}\in \mathbb {C}^n\) (or \(\ell _2\)) we will often use the tensor notation for the matrix

$$\begin{aligned} {\mathbf {y}}\otimes {\mathbf {y}}\,{:}{=}\, {\mathbf {y}}\cdot {\mathbf {y}}^*= {\mathbf {y}}\cdot \overline{{\mathbf {y}}}^\top \in \mathbb {C}^{n \times n}\;(\text {or } \mathbb {C}^{\mathbb {N}\times \mathbb {N}}) . \end{aligned}$$

For \(0<p\le \infty \) and \({\mathbf {x}}\in \mathbb {C}^n\) we denote \(\Vert {\mathbf {x}}\Vert _p \,{:}{=}\, (\sum _{i=1}^n |x_i|^p)^{1/p}\) with the usual modification in the case \(p=\infty \) or \({\mathbf {x}}\) being an infinite sequence. Operator norms for \(T:H(K) \rightarrow L_2\) will be denoted with \(\Vert T\Vert _{K,2}\). As usual we will denote with \(\mathbb {E}X\) the expectation of a random variable X on a probability space \((\Omega , \mathcal {A}, \mathbb {P})\). Given a measurable subset \(D\subset \mathbb {R}^d\) and a measure \(\varrho \) we denote with \(L_2(D,\varrho )\) the space of all square integrable complex-valued functions (equivalence classes) on D with \(\int _D |f({\mathbf {x}})|^2\, d\varrho ({\mathbf {x}})<\infty \). We will often use \(\Omega = D^n\) as probability space with the product measure \(\mathbb {P}= d\varrho ^n\) if \(\varrho \) is a probability measure itself. We sometimes use the notation \(f = {\mathcal {O}}(g)\) for positive functions fg, which means that there is a constant \(c>0\) with \(f(t) \le cg(t)\). In addition we say \(f = o(g)\) if \(f(t)/g(t) \rightarrow 0\) as \(t \rightarrow \infty \).

2 Concentration results for sums of random matrices

Let us begin with concentration inequalities for the spectral norm of sums of complex rank-1 matrices. Such matrices appear as \({\mathbf {L}}^*{\mathbf {L}}\) when studying least squares solutions of over-determined linear systems

$$\begin{aligned} {\mathbf {L}}\cdot {\mathbf {c}} = {\mathbf {f}}, \end{aligned}$$

where \({\mathbf {L}}\in \mathbb {C}^{n \times m}\) is a matrix with \(n>m\), \({\mathbf {f}} \in \mathbb {C}^n\) and \({\mathbf {c}} \in \mathbb {C}^{m-1}\). It is well-known that the above system may not have a solution. However, we can ask for the vector \({\mathbf {c}}\) which minimizes the residual \(\Vert {\mathbf {f}}-{\mathbf {L}}\cdot {\mathbf {c}}\Vert _2\). Multiplying the system with \({\mathbf {L}}^*\) gives

$$\begin{aligned} {\mathbf {L}}^{*}{\mathbf {L}}\cdot {\mathbf {c}} = {\mathbf {L}}^{*}\cdot {\mathbf {f}} \end{aligned}$$

which is called the system of normal equations. If \({\mathbf {L}}\) has full rank then the unique solution of the least squares problem is given by

$$\begin{aligned} {\mathbf {c}} = ({\mathbf {L}}^{*}{\mathbf {L}})^{-1}{\mathbf {L}}^{*}\cdot {\mathbf {f}}. \end{aligned}$$

For function recovery and discretization problems we will use the following matrix

$$\begin{aligned} {\mathbf {L}}_m \,{:}{=}\, \left( \begin{array}{cccc} \eta _1({\mathbf {x}}^1) &{}\eta _2({\mathbf {x}}^1)&{} \cdots &{} \eta _{m-1}({\mathbf {x}}^1)\\ \vdots &{} \vdots &{}&{} \vdots \\ \eta _1({\mathbf {x}}^n) &{}\eta _2({\mathbf {x}}^n)&{} \cdots &{} \eta _{m-1}({\mathbf {x}}^n) \end{array}\right) = \left( \begin{array}{c} {{\mathbf {y}}^1}^\top \\ \vdots \\ {{\mathbf {y}}^n}^\top \end{array}\right) \quad \text{ and } \quad {\mathbf {f}} = \left( \begin{array}{c} f({\mathbf {x}}^1)\\ \vdots \\ f({\mathbf {x}}^n) \end{array}\right) , \end{aligned}$$
(2.1)

for \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n) \in D^n\) of distinct sampling nodes and a system of functions \((\eta _k)_{k=1}^{m-1}\). We put \({\mathbf {y}}^i \,{:}{=}\, (\eta _1({\mathbf {x}}^i),\ldots ,\eta _{m-1}({\mathbf {x}}^i))^\top , i=1,\ldots ,n\). The coefficients \(c_k\), \(k=1,\ldots ,m-1\), of the approximant

$$\begin{aligned} S^m_{{\mathbf {X}}}f\,{:}{=}\,\sum _{k = 1}^{m-1} c_k\, \eta _k\, \end{aligned}$$
(2.2)

are computed via least squares, see Algorithm 1. Note that the mapping \(f \mapsto S^m_{{\mathbf {X}}}f\) is linear for a fixed set of sampling nodes \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n) \in D^n.\)

We start with a concentration inequality for the spectral norm of a matrix of type (2.1). It turns out that in certain situations the complex matrix \({\mathbf {L}}_m\,{:}{=}\,{\mathbf {L}}_m({\mathbf {X}})\in \mathbb {C}^{n\times (m-1)}\) has full rank with high probability, where \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) is drawn at random from \(D^n\) according to a product measure \(\mathbb {P}= d\varrho ^n\). We will find that the eigenvalues of

$$\begin{aligned} {\mathbf {H}}_m\,{:}{=}\,{\mathbf {H}}_m({\mathbf {X}}) = \frac{1}{n}{\mathbf {L}}_m^*{\mathbf {L}}_m = \frac{1}{n} \sum \limits _{i = 1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i \in \mathbb {C}^{(m-1)\times (m-1)}, \end{aligned}$$
(2.3)

are bounded away from zero with high probability if m is small enough compared to n and the functions \(\eta _k(\cdot )\) denote an orthonormal system with respect to the measure \(\varrho \) from which the nodes in \({\mathbf {X}}\) are sampled. Let us define the corresponding spectral function

$$\begin{aligned} N(m) \,{:}{=}\, \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=1}^{m-1}|\eta _k({\mathbf {x}})|^2. \end{aligned}$$
(2.4)

From [28, Theorem 1.1] we get the following result.

Theorem 2.1

(Matrix Chernoff) For a finite sequence \((\mathbf {A}_k)\) of independent, Hermitian, positive semi-definite random matrices with dimension m satisfying \(\lambda _{\max }(\mathbf {A}_k) \le R\) almost surely it holds

for \(t \in [0,1]\) where \( \mu _{\min }: = \lambda _{\min }\left( \sum _{k=1}^n \mathbb {E}\mathbf {A}_k \right) \) and \( \mu _{\max }: = \lambda _{\max }\left( \sum _{k=1}^n \mathbb {E}\mathbf {A}_k \right) \).

Theorem 2.2

Let \(n,m \in \mathbb {N}\), \(m\ge 2 \) and \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\dots ,\eta _{m-1}(\cdot )\}\) be an orthonormal system in \(L_2(D,\varrho )\). Let \({\mathbf {H}}_m\) be given as above and \({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\in D\) drawn i.i.d. at random according to \(\mathbb {P}= d\varrho \) we have for \(0<t <1\) that

$$\begin{aligned} \mathbb {P}(\lambda _{\min } ({\mathbf {H}}_m) < 1-t) \le m \exp \left( -{\frac{n\, \log c_t}{N(m)}}\right) , \end{aligned}$$

as well as

$$\begin{aligned} \mathbb {P}(\lambda _{\max } ({\mathbf {H}}_m) > 1+t) \le m \exp \left( -{\frac{n\, \log d_t}{N(m)}}\right) , \end{aligned}$$

where \(c_t \,{:}{=}\,(1-t)^{1-t}e^t\) and \(d_t \,{:}{=}\,(1+t)^{1+t}e^{-t}\).

Proof

We apply Theorem 2.1. To do this we define \(\mathbf {A}_i = \frac{1}{n}\, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\). One easily sees that all the matrices \(\mathbf {A}_i\) are always positive semi-definite and \(\lambda _{\min }\Big (\sum _{i=1}^n \mathbb {E}\mathbf {A}_i \Big ) =1\). We have that

$$\begin{aligned} \lambda _{\max } (\mathbf {A}_i) = \Vert {\mathbf {y}}^i\Vert ^2/n \le N(m)/n. \end{aligned}$$

Plugging this into Theorem 2.1 yields

$$\begin{aligned} \mathbb {P}(\lambda _{\min }({\mathbf {H}}_m) \le 1-t) \le m \left[ \frac{e^{-t}}{(1-t)^{1-t}} \right] ^{n/N(m)} \le m \exp \left( -{\frac{n \log c_t}{N(m)}} \right) . \end{aligned}$$

Theorem 2.3

For \(n \ge m\) and \(r > 1\) the matrix \({\mathbf {H}}_m\) has only eigenvalues greater than 1/2 with probability at least \(1 - n^{1-r}\) if

$$\begin{aligned} N(m) \le \frac{n}{ 7 \, r \log n}. \end{aligned}$$
(2.5)

In particular,  we have

$$\begin{aligned} \Vert ({\mathbf {L}}_m^{*}{\mathbf {L}}_m)^{-1}{\mathbf {L}}_m^{*}\Vert _{2\rightarrow 2} \le \sqrt{\frac{2}{n}}. \end{aligned}$$
(2.6)

Proof

Choosing \(t = 1/2\) and solving for N(m) in the above probability bound (using \(n^{1-r}\) on the right-hand side) gives the desired result. Indeed

$$\begin{aligned} \mathbb {P}(\lambda _{\min } ({\mathbf {H}}_m) < 1-t) \le m \exp \left( -{\frac{n \log c_t}{N(m)}} \right) \le n^{1-r}. \end{aligned}$$

This gives the following implications (read from bottom to top)

$$\begin{aligned} \begin{aligned} \log (m) - \log (c_t) \frac{n}{N(m)}&\le \log n^{1-r} \\ \frac{\log m-\log n^{1-r} }{\log c_t}&\le \frac{n}{N(m)} \\ N(m)&\le \frac{n\log c_t}{\log m - \log n^{1-r}} \\ N(m)&\le \frac{n}{7 (\log n - (1-r) \log n)}\\ N(m)&\le \frac{n}{ 7\, r \log n}. \end{aligned} \end{aligned}$$
(2.7)

The bound in (2.6) is a consequence of [11, Proposition 3.1]. \(\square \)

From [11, Proposition 3.1] we also get a lower bound of \(\Vert ({\mathbf {L}}_m^{*}{\mathbf {L}}_m)^{-1}{\mathbf {L}}_m^{*}\Vert _{2\rightarrow 2}\) with high probability.

Corollary 2.4

Let \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\ldots \}\) be an orthonormal system in \(L_2(D,\varrho )\). Let further \(r>1\) and \(m,n \in \mathbb {N}\), \(m\ge 2\) such that

$$\begin{aligned} N(m) \le \frac{n}{10 \, r \log n} \end{aligned}$$

holds. Then the random matrix \({\mathbf {L}}_m\) from (2.1) satisfies

$$\begin{aligned} \Vert ({\mathbf {L}}_m^{*}{\mathbf {L}}_m)^{-1}{\mathbf {L}}_m^{*}\Vert _{2\rightarrow 2} \ge \sqrt{\frac{2}{3n}} \end{aligned}$$

with probability at least \(1-n^{1-r},\) where the nodes are sampled i.i.d according to \(\varrho \).

3 Norm concentration for infinite matrices

In this section we want to extend some of the results from Sect. 2 to the infinite dimensional framework. We provide a new concentration inequality derived from the non-commutative Khintchine inequality via a bootstrapping argument using a symmetrization result by Ledoux and Talagrand [15] for Rademacher sums of random operators \( \mathbf{B}_i = {\mathbf {y}}^i \otimes {\mathbf {y}}^i\), where \({\mathbf {y}}^i\) will denote a complex random infinite \(\ell _2 \)-sequence.

Definition 3.1

(Schatten-p-Norm) For a compact operator \(\mathbf{A}: H \rightarrow K\) between complex Hilbert spaces H and K and 1 \(\le p \le \infty \) we define the Schatten-p-norm:

$$\begin{aligned} \Vert \mathbf{A}\Vert _{S_p} = \Vert \sigma (\mathbf{A})\Vert _p \end{aligned}$$

where \(\sigma (\mathbf {A})\) is the vector of singular values of \(\mathbf{A}\).

The quantity \(\Vert \cdot \Vert _{S_p}\) is a norm, see, e.g. [7].

Corollary 3.2

One easily sees that

$$\begin{aligned} \Vert \mathbf{A}\Vert _{2 \rightarrow 2} = \Vert \mathbf{A}\Vert _{S_\infty } \le \Vert \mathbf{A}\Vert _{S_p} \end{aligned}$$

and that for A with rank at most \(r\) it holds that

for \(1\le p \le \infty \).

Definition 3.3

(Schatten-class) Let \(H, K \) be complex Hilbert spaces and \( 1 \le p < \infty \). The \(p\)-th Schatten-class is defined as

$$\begin{aligned} S_p(H,K) : = \big \{ \mathbf{A}: H \rightarrow K, \mathbf{A} \,\, compact, \Vert \mathbf{A} \Vert _{S_p} < \infty \big \}. \end{aligned}$$

Theorem 3.4

(Non-commutative Khintchine inequality, [2, 3]) Let \(n \in \mathbb {N}\) and \(\mathbf{B}_i,\) \(i=1,\ldots ,m,\) denote operators from \( S_{2n}\). Let further \(\varepsilon _i\) denote independent Rademacher variables for \( i= 1 \dots m \). Then it holds

Corollary 3.5

(Rudelson’s lemma) Let \({\mathbf {y}}^i\) be a sequence from the complex \( \ell _2\) and \(\varepsilon _i\) independent Rademacher variables for \( i= 1 \dots m \). Then it holds for \( 2 \le p < \infty \) that

Proof

We utilize the non-commutative Khintchine inequality with \( \mathbf{B}_i \,{:}{=}\, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) which belong to every \(S_{2n}\) since they have (at most) rank 1. Applying literally the same arguments as in [22, Lemma 6.18] we obtain the result (see [17] for details).

We estimate tails of random variables by means of their moments. We will use a well-known relation, see e.g. [22, Proposition 6.5].

Proposition 3.6

(Moments and tails) Let \(X\) be a random variable that for all \(p \ge p_0\) satisfies

$$\begin{aligned} \big (\mathbb {E}|X|^p \big )^\frac{1}{p} \le \alpha \beta ^\frac{1}{p} p^\frac{1}{\gamma } \end{aligned}$$

for some constants \(\alpha , \beta , \gamma , p_0 > 0\). Then

for all

From [15, Lemma 6.3] we get the following result.

Proposition 3.7

(Symmetrization, [15]) Let \(F : \mathbb {R}_+ \rightarrow \mathbb {R}_+ \) be convex. Let \({\mathbf {X}}_i,\) \(i=1,\ldots n,\) be independent random variables in a separable Banach space \((B, \Vert \cdot \Vert )\) such that \( \mathbb {E}F (\Vert {\mathbf {X}}_i\Vert ) < \infty \). Let further \(\varvec{\varepsilon } = (\varepsilon _i)_{i=1}^n\) be independent Rademacher variables which are also independent of \({\mathbf {X}}_i\). Then it holds that

$$\begin{aligned} \mathbb {E}F\left( \sup _{f \in D } \left| f \left( \sum _{i=1}^n {\mathbf {X}}_i \right) - \mathbb {E}f \left( \sum _{i=1}^n {\mathbf {X}}_i \right) \right| \right) \le \mathbb {E}F\left( 2 \left\| \sum _{i=1}^n \varepsilon _i {\mathbf {X}}_i \right\| \right) , \end{aligned}$$

where \(D\) is a countable set of linear functionals with \(\Vert {\mathbf {x}}\Vert = \sup _{f \in D}\big |f({\mathbf {x}})\big |\) for all \({\mathbf {x}}\in B\).

Proposition 3.8

Let \({\mathbf {y}}^i , i= 1 \dots n, \) be i.i.d random sequences from the complex \( \ell _2\). Let further \(n \ge 3,\) \(r > 1,\) \(M>0\) such that \(\Vert {\mathbf {y}}^i \Vert _2 \le M\) for all \(i=1 \dots n\) almost surely and \( \mathbb {E}{\mathbf {y}}^i \otimes {\mathbf {y}}^i = {\Lambda }\) for all \(i=1 \dots n\). Then

$$\begin{aligned} \mathbb {P}\left( \left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2} \ge F \right) \le 2^\frac{3}{4} \, n^{1-r}, \end{aligned}$$

where \(F \,{:}{=}\, \max \Big \{ \frac{8r \log n}{n} M^2 \kappa ^2 , \Vert {\Lambda } \Vert _{2 \rightarrow 2} \Big \}\) and \( \kappa = \frac{1+\sqrt{5}}{2}\) .

Proof

We use a method as in [22, Theorem 7.3]. For \(2 \le p < \infty \) we put

$$\begin{aligned} E_p \,{:}{=}\, \mathbb {E}\left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2}^p. \end{aligned}$$

Since \(\sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) has rank (at most) \(n\) it is compact. The expectation matrix \({\Lambda }\) is a positive semidefinite operator with finite trace since \(\Vert {\mathbf {y}}^i \Vert _2 \le M\) for all \(i=1 \dots n\) almost surely. This means \( {\Lambda }\) is a trace class operator and therefore compact. Since \( \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda }\) is compact and the subspace of all compact operators \(K(\ell _2, \ell _2)\) from \(\ell _2\) to \( \ell _2\) is separable we can choose a countable set \(D\) from the dual space of \(K(\ell _2, \ell _2)\) as in Proposition 3.7 such that

$$\begin{aligned} \left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2} = \sup \limits _{f \in D}\left| f \left( \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right) - f \left( \mathbb {E}\frac{1}{n}\sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right) \right| . \end{aligned}$$

Since \(f\) is a continuous linear functional we get

$$\begin{aligned} E_p = \mathbb {E}\left( \sup \limits _{f \in D}\left| f \left( \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right) - \mathbb {E}f \left( \frac{1}{n}\sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right) \right| ^p \right) . \end{aligned}$$
(3.1)

Proposition 3.7 applied to (3.1) with \(F(t) = t^p\) together with Rudelson’s lemma (Corollary 3.5) for \(2\le p < \infty \) in the form

yields

$$\begin{aligned} \begin{aligned} E_p&\le 2^p \, \mathbb {E}_{\mathbf {y}}\mathbb {E}_{\varvec{\varepsilon }} \left\| \frac{1}{n} \sum _{i=1}^n \varepsilon _i \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right\| _{2 \rightarrow 2} ^{p} \\&\le \left( \frac{2}{\sqrt{n}} \right) ^p 2^{\frac{3}{4}} n \, p^{\frac{p}{2}} e^{-\frac{p}{2}}\, \mathbb {E}_{\mathbf {y}}\left( \sqrt{\left\| \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right\| _{2 \rightarrow 2}} \, \max _{i = 1 \dots n}\Vert {\mathbf {y}}^i \Vert _{2} \right) ^p \\&\le \left( \frac{2}{\sqrt{n}} \right) ^p 2^{\frac{3}{4}} n \, p^{\frac{p}{2}} e^{-\frac{p}{2}} {M}^{p} \, \mathbb {E}_{\mathbf {y}}\left( \sqrt{\left\| \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i \right\| _{2 \rightarrow 2}} \right) ^p \\&\le \left( \frac{2}{\sqrt{n}} \right) ^p 2^{\frac{3}{4}} n \, p^{\frac{p}{2}} e^{-\frac{p}{2}} {M}^{p} \, \left( \sqrt{\mathbb {E}_{\mathbf {y}}\left( \left\| \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2} \right) + \Vert {\Lambda } \Vert _{2\rightarrow 2}} \right) ^p \end{aligned} \end{aligned}$$

because of \(\Vert {\mathbf {y}}^i \Vert _2 \le M \) and Hölder’s inequality. Using triangle inequality and the fact that we have

Setting \(D_{p,n,M} \,{:}{=}\, \frac{2}{\sqrt{n}} 2^{\frac{3}{4p}} M p^{\frac{1}{2}} n^{\frac{1}{p}} e^{-\frac{1}{2}}\) yields

(3.2)

where \(F \ge \Vert {\Lambda } \Vert _{2\rightarrow 2} \) will be chosen later. Solving this regarding gives

We now consider the random variable \( \min \Big \{F,\frac{1}{n} \Big \Vert \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \Big \Vert _{2 \rightarrow 2}\Big \}\). Obviously

In the case of \( D_{p,n,M}^2 \le F\) we get

and otherwise

This yields

Using Proposition 3.6 we get that

$$\begin{aligned} \mathbb {P}\left( \min \left\{ F,\left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2}\right\} \ge \frac{2}{\sqrt{n}} M \kappa \sqrt{F} u \right) \le 2^\frac{3}{4} \, n \exp \left( \frac{- u^2}{2} \right) \end{aligned}$$
(3.3)

for all \( u \ge \sqrt{2}\). Now we choose \(u = \sqrt{2r \log n}\) with \(r > 1\) and \(n \ge 3.\)This gives

$$\begin{aligned} \mathbb {P}\left( \min \left\{ F,\left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2}\right\} \ge \frac{2}{\sqrt{n}} M \kappa \sqrt{F} \sqrt{2r \log n} \right) \le 2^\frac{3}{4} \, n^{1-r}. \end{aligned}$$

In case \( F \ge \frac{2}{\sqrt{n}} M \kappa \sqrt{F} u\) we can avoid the minimum on the left-hand side. It clearly holds

$$\begin{aligned} F \ge \frac{2}{\sqrt{n}} M \kappa \sqrt{F} u = \frac{2}{\sqrt{n}} M \kappa \sqrt{F} \sqrt{2r \log n}, \end{aligned}$$

and hence

$$\begin{aligned} \sqrt{F} \ge \frac{2}{\sqrt{n}} M \kappa \sqrt{2r \log n}. \end{aligned}$$

The latter is satisfied if

$$\begin{aligned} F \,{:}{=}\, \max \left\{ \frac{8r \log n}{n} M^2 \kappa ^2 , \Vert {\Lambda }\Vert _{2 \rightarrow 2} \right\} . \end{aligned}$$
(3.4)

This yields

$$\begin{aligned} \mathbb {P}\left( \left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2} \ge F \right) \le 2^\frac{3}{4} \, n^{1-r}. \end{aligned}$$

\(\square \)

Using this result we are now able to proof our main concentration inequality.

Proof of Theorem 1.1

Proof

Let us return to (3.2) in the above proof. Since \(\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2 } \le 1\) we get as a consequence for \(0<t\le 1\)

(3.5)

with

$$\begin{aligned} \tilde{D}_{p,n,M} \,{:}{=}\, \frac{1}{\sqrt{t}}\frac{2}{\sqrt{n}} 2^{\frac{3}{4p}} M p^{\frac{1}{2}} n^{\frac{1}{p}} e^{-\frac{1}{2}}. \end{aligned}$$

We continue in the proof as above using \(\tilde{D}_{p,n,M}\) instead of \(D_{p,n,M}\) and without replacing u by \(\sqrt{2r\log n}\) in (3.3). With the same argumentation as above we get rid of the minimum and obtain this time

$$\begin{aligned} \mathbb {P}\left( \left\| \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \right\| _{2 \rightarrow 2} \ge F \right) \le 2^\frac{3}{4} n\exp (-u^2/2) \end{aligned}$$
(3.6)

for \(F \ge \max \{4M^2u^2\kappa ^2/(nt),t\}\). The maximum is no longer necessary if we choose \(u^2\,{:}{=}\,t^2n/(4M^2\kappa ^2)\). Plugging this choice into (3.6) and noting that \(8\kappa ^2 \le 21\) gives the desired bound. \(\square \)

Remark 3.9

The first result of this type is due to Rudelson [23] for vectors from \(\mathbb {R}^n\). Complex versions were proved by Rauhut [22] and Oliveira [20, Lem. 1]. Note that the result stated by Oliveira (Lemma 1) contains a small incorrectness in the probability bound. A corrected version has been stated in [11, Prop. 4.1]. In his paper Oliveira also comments on the infinite dimensional complex situation where \(m=\infty \) but does not give a full proof. The proof method is different from ours. Note also that in [16, Cor. 2.6] Mendelson and Pajor give a concentration result for the infinite dimensional case of real vectors. Let us finally mention that a version of our Theorem 1.3 (in the next section) under more restrictive assumptions has been recently proved by Ullrich [29] based on Oliveira’s concentration result.

4 Reproducing kernel Hilbert spaces

We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [1, Chapt. 1] and [4, Chapt. 4]. The papers [10, 24] are also of particular relevance for the subject of this paper.

Let \(L_2(D,\varrho _D)\) be the space of complex-valued square-integrable functions with respect to \(\varrho _D\). Here \(D \subset \mathbb {R}^d\) is an arbitrary subset and \(\varrho _D\) a measure on D. We further consider a reproducing kernel Hilbert space H(K) with a Hermitian positive definite kernel \(K(\cdot ,\cdot )\) on \(D \times D\). The crucial property of reproducing kernel Hilbert spaces is the fact that Dirac functionals are continuous, or equivalently, the reproducing property holds

$$\begin{aligned} f({\mathbf {x}}) = \langle f, K(\cdot ,{\mathbf {x}}) \rangle _{H(K)} \end{aligned}$$

for all \({\mathbf {x}}\in D\). It ensures that point evaluations are continuous functionals on H(K). We will use the notation from [4, Chapt. 4]. In the framework of this paper, the finite trace of the kernel

$$\begin{aligned} {\text {tr}}(K)\,{:}{=}\,\Vert K\Vert ^2_{2} = \int _{D} K({\mathbf {x}},{\mathbf {x}})d\varrho _D({\mathbf {x}}) < \infty \end{aligned}$$
(4.1)

or its boundedness

$$\begin{aligned} \Vert K\Vert _{\infty } \,{:}{=}\, \sup \limits _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}},{\mathbf {x}})} < \infty \end{aligned}$$
(4.2)

is assumed. The boundedness of K implies that H(K) is continuously embedded into \(\ell _\infty (D)\), i.e.,

$$\begin{aligned} \Vert f\Vert _{\ell _{\infty }(D)} \le \Vert K\Vert _{\infty }\cdot \Vert f\Vert _{H(K)}. \end{aligned}$$
(4.3)

With \(\ell _{\infty }(D)\) we denote the set of bounded functions on D and with \(\Vert \cdot \Vert _{\ell _{\infty }(D)}\) the supremum norm. Note, that we do not need the measure \(\varrho _D\) for this embedding. In fact, here we mean “boundedness” in the strong sense (in contrast to essential boundedness w.r.t. the measure \(\varrho _D\)). The embedding operator

$$\begin{aligned} \mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D) \end{aligned}$$
(4.4)

is Hilbert–Schmidt under the finite trace condition (4.1), see [10, 24, Lemma 2.3], which we always assume from now on. We additionally assume that H(K) is at least infinite dimensional. Let us denote the (at most) countable system of strictly positive eigenvalues \((\lambda _j)_{j\in \mathbb {N}}\) of \(W_{K,\varrho _D} = \mathrm {Id}_{K,\varrho _D}^{*} \circ \mathrm {Id}_{K,\varrho _D}\) arranged in non-increasing order, i.e.,

$$\begin{aligned} \lambda _1 \ge \lambda _2 \ge \lambda _3 \ge \cdots > 0. \end{aligned}$$

We will also need the left and right singular vectors \((e_k)_k \subset H(K)\) and \((\eta _k)_k \subset L_2(D,\varrho _D)\) which both represent orthonormal systems in the respective spaces related by \(e_k = \sigma _k \eta _k\) with \(\lambda _k = \sigma _k^2\) for \(k\in \mathbb {N}\). We would like to emphasize that the embedding (4.3) is not necessarily injective. In other words, for certain kernels there might be a also a nontrivial nullspace of the embedding \(\mathrm {Id}\) in (4.4). Therefore, the system \((e_k)_k\) from above is not necessarily a basis in H(K). It would be a basis under additional restrictions, e.g., the kernel \(K(\cdot ,\cdot )\) is continuous and bounded (Mercer kernel). Based on this observation we will decompose the kernel \(K(\cdot ,\cdot )\) as follows

$$\begin{aligned} \begin{aligned} K({\mathbf {x}},{\mathbf {y}})&= K^0({\mathbf {x}},{\mathbf {y}}) + K^1({\mathbf {x}},{\mathbf {y}})\\&\,{:}{=}\,\left( K({\mathbf {x}},{\mathbf {y}})-\sum \limits _{k=1}^{\infty } e_k({\mathbf {x}})\overline{e_k({\mathbf {y}})}\right) + \sum \limits _{k=1}^{\infty } e_k({\mathbf {x}})\overline{e_k({\mathbf {y}})}. \end{aligned} \end{aligned}$$
(4.5)

By Bessel’s inequality we get that

$$\begin{aligned} {\text {tr}}_0(K)\,{:}{=}\,{\text {tr}}(K^0)=\int _{D} K({\mathbf {x}},{\mathbf {x}})d\,\varrho _D({\mathbf {x}}) - \sum \limits _{k=1}^{\infty } \lambda _k \ge 0. \end{aligned}$$
(4.6)

It is shown in [10, 24, Lemma 2.3] that if \({\text {tr}}(K)<\infty \) and H(K) is separable then \({\text {tr}}_0(K) = 0\). As we will see below, it will make a big difference if \({\text {tr}}_0(K)\) vanishes or not. The second case is only apparent if H(K) is non-separable. In other words, if \(H(K)\) is separable the function \(K^0({\mathbf {x}},{\mathbf {x}}) \,{:}{=}\, K({\mathbf {x}},{\mathbf {x}})- \sum _{k=1}^\infty |e_k({\mathbf {x}})|^2\) is zero almost everywhere with respect to the measure \(\varrho _D\). Let us finally define the two crucial quantities

$$\begin{aligned} N(m) \,{:}{=}\, \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=1}^{m-1}|\eta _k({\mathbf {x}})|^2\ \end{aligned}$$
(4.7)

and

$$\begin{aligned} T(m) \,{:}{=}\, \sup \limits _{{\mathbf {x}}\in D}\sum \limits _{k=m}^{\infty }|e_k({\mathbf {x}})|^2. \end{aligned}$$
(4.8)

The first one is often called “spectral function”, see [9] and the references therein.

5 Sampling recovery guarantees for separable RKHS

In this section we deal with the case that H(K) is a separable Hilbert space on a subset \(D \subset \mathbb {R}^d\) which is compactly embedded in \(L_2(D,\varrho _D)\) for a given measure \(\varrho _D\). The first Theorem gives a result in a more restrictive situation, namely that \(\varrho _D\) is a probability measure and the kernel is bounded. In the second theorem we sample with respect to the probability density function \(\varrho _m\) defined below in (5.3) and invented by Krieg and Ullrich [13, 14]. We use the same proof strategy as in [11]. Here we do not apply Rudelson’s bound [23] on the expectation. We rather use the concentration inequality proved in Proposition 3.8. This leads to a polynomial decaying failure probability, see also [29].

Theorem 5.1

Let H(K) be a separable reproducing kernel Hilbert space on a set \(D \subset \mathbb {R}^d\) with a positive definite kernel \(K(\cdot ,\cdot )\) such that \(\sup _{{\mathbf {x}}\in D} K({\mathbf {x}},{\mathbf {x}}) <\infty \). We denote with \((\sigma _j)_{j\in \mathbb {N}}\) the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) for a probability measure \(\varrho _D\). Let further \(r>1\) and \(m,n \in \mathbb {N},\) \(n \ge 3\) where \(m\ge 2\) is chosen such that

$$\begin{aligned} N(m) \le \frac{n}{7 r\log n} \end{aligned}$$
(5.1)

holds. Drawing \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) according to the product measure \(d\varrho _D^n,\) we have

$$\begin{aligned} \mathbb {P}\left( \sup _{\Vert f\Vert _{H(K)} \le 1} \left\| f - S_{{\mathbf {X}}}^m f \right\| _{L_2(D,\varrho _D)}^2 \le 5 \max \left\{ \sigma _m^2, \frac{8r \log n}{n} T(m) \kappa ^2 \right\} \right) \ge 1 - \eta n^{1-r}, \end{aligned}$$

where \(\eta = 2^\frac{3}{4} +1\), \( \kappa = \frac{1+\sqrt{5}}{2}\) and \(N(m), \, T(m)\) are defined in (4.7),  (4.8) .

Proof

We define the events

$$\begin{aligned} A&\,{:}{=}\,\left\{ {\mathbf {X}}\in D^n~:~\frac{1}{n}\Big \Vert {\varvec{\Phi }}_m^*{\varvec{\Phi }}_m\Big \Vert _{2 \rightarrow 2} \le F + \sigma _m^2 \right\} ,\\ B&\,{:}{=}\,\left\{ {\mathbf {X}}\in D^n~:~\frac{1}{2}\le \lambda _{i}({\mathbf {H}}_m), \, i=1 \dots m \right\} , \end{aligned}$$

where \(F\) appears in (3.4) and \({\mathbf {H}}_m\) in (2.3). The operator \({\varvec{\Phi }}_m\) is given by

$$\begin{aligned} {\varvec{\Phi }}_m:\ell _2 \rightarrow \mathbb {R}^n,\quad {\mathbf {z}} \mapsto \left( \begin{array}{c} \langle {\mathbf {z}},{\mathbf {y}}^1 \rangle \\ \vdots \\ \langle {\mathbf {z}},{\mathbf {y}}^n \rangle \end{array}\right) \end{aligned}$$

with \({\mathbf {y}}^i = (e_m({\mathbf {x}}^i),e_{m+1}({\mathbf {x}}^i) \dots )^\top \) for all \( i=1 \dots n\). Hence we may choose F as

$$\begin{aligned} F\,{:}{=}\,\max \left\{ \frac{8r \log n}{n} T(m)\kappa ^2 , \sigma _m^2 \right\} \end{aligned}$$

in this specific situation. It clearly holds

$$\begin{aligned} \mathbb {P}(A \cap B)= 1 - \mathbb {P}(A^\complement \cup B^\complement ). \end{aligned}$$

Using the union bound estimate we get

$$\begin{aligned} \mathbb {P}(A^\complement \cup B^\complement ) \le \mathbb {P}(A^\complement ) + \mathbb {P}(B^\complement ). \end{aligned}$$

Theorem 2.3 implies

$$\begin{aligned} \mathbb {P}(B^\complement )\le n^{1-r}. \end{aligned}$$

And after noting

$$\begin{aligned} \mathbb {P}\Big (A^\complement \Big ) = \mathbb {P}\left( \frac{1}{n}\Big \Vert {\varvec{\Phi }}_m^*{\varvec{\Phi }}_m \Big \Vert _{2 \rightarrow 2}> F + \Vert {\varvec{\Lambda }}\Vert _{2 \rightarrow 2} \right) \le \mathbb {P}\left( \left\| \frac{1}{n}{\varvec{\Phi }}_m^*{\varvec{\Phi }}_m - {\varvec{\Lambda }} \right\| _{2 \rightarrow 2} > F \right) \end{aligned}$$

we infer from \({\varvec{\Phi }}_m^*{\varvec{\Phi }}_m = \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) and Proposition 3.8 that

$$\begin{aligned} \mathbb {P}\Big (A^\complement \Big ) \le \mathbb {P}\left( \left\| \frac{1}{n} \sum _{i=1}^n \,{\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\varvec{\Lambda }} \right\| _{2 \rightarrow 2} \ge F\right) \le 2^\frac{3}{4} \, n^{1-r} . \end{aligned}$$

In total we have

$$\begin{aligned} \mathbb {P}(A \cap B) \ge 1- \eta n^{1-r} \end{aligned}$$

with \(\eta \,{:}{=}\, 2^{3/4}+1\). According to the proof of [11, Theorem 5.5] we need the function \(T(\cdot ,{\mathbf {x}}) \,{:}{=}\, K(\cdot ,{\mathbf {x}})-\sum _{j=1}^{\infty }e_j(\cdot )\overline{e_j({\mathbf {x}})}\), which denotes an element in H(K). Its norm is given by

$$\begin{aligned} \Vert T(\cdot ,{\mathbf {x}})\Vert _{H(K)}^2 \,{:}{=}\, \langle T(\cdot ,{\mathbf {x}}), T(\cdot ,{\mathbf {x}})\rangle _{H(K)} = K({\mathbf {x}},{\mathbf {x}})-\sum _{j=1}^{\infty }|e_j({\mathbf {x}})|^2. \end{aligned}$$
(5.2)

Note that the function in (5.2) is zero almost everywhere because of the fact that we have an equality sign in (4.6) due to our assumptions (separability of H(K) and finite trace). Hence, \(\mathbb {P}(C) = 1\) with

$$\begin{aligned} C \,{:}{=}\, \left\{ {\mathbf {X}}\in D^n~:~\sum \limits _{k=1}^n \Vert T(\cdot ,{\mathbf {x}}^k)\Vert _{H(K)} = 0\right\} . \end{aligned}$$

Let now \({\mathbf {X}}\in A\cap B\cap C\). Then we can use for any \(f\in H(K)\) with \(\Vert f\Vert _{H(K)}\le 1\) a similar argument as in [11, Theorem 5.5]

$$\begin{aligned} \Vert f-S^m_{\mathbf {X}}f\Vert ^2_{L_2(D,\varrho _D)}&\le \sigma ^2_m + \Vert ({\mathbf {L}}_m^{*}{\mathbf {L}}_m)^{-1}{\mathbf {L}}_m^{*}\Vert _{2\rightarrow 2}^2\cdot \sum \limits _{k=1}^n \Big |\Big (f-P_{m-1}f\Big )({\mathbf {x}}^k)\Big |^2\\&= \sigma ^2_m + \frac{2}{n}\Vert {\varvec{\Phi }}_m\Vert _{2\rightarrow 2}^2 + \frac{6\Vert K\Vert _{\infty }}{n}\sum \limits _{k=1}^n \Vert T(\cdot ,{\mathbf {x}}^k)\Vert _{H(K)}\\&= \sigma _m^2+\frac{2}{n}\Vert {\varvec{\Phi }}_m\Vert _{2\rightarrow 2}\\&\le 2F+3\sigma _m^2. \end{aligned}$$

This yields

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{{H(K)}}\le 1} \Vert f-S^m_{\mathbf {X}} f\Vert ^2_{L_2(D,\varrho _D)} \le 2F + 3\sigma _m^2 \right) \ge 1 - \eta n^{1-r} \end{aligned}$$

and therefore

$$\begin{aligned} \mathbb {P}\left( \sup _{\Vert f\Vert _{H(K)} \ge 1} \Big \Vert f - S_{{\mathbf {X}}}^m f \Big \Vert _{L_2(D,\varrho _D)}^2 \le 5 \max \left\{ \sigma _m^2, \frac{8r \log n}{n} T(m) \kappa ^2 \right\} \right) \ge 1-\eta n^{1-r}. \end{aligned}$$

\(\square \)

In the sequel we consider a more general situation. The measure where the points are sampled will be adapted to the spectral properties of the embedding. This allows to specify the bound above in terms of the singular numbers of the embedding and benefit from their decay. Let us recall the density function from [13] which we will adapt to our framework as follows

$$\begin{aligned} \varrho _m({\mathbf {x}}) = \frac{1}{2} \left( \frac{1}{m-1} \sum _{j = 1}^{m-1} |\eta _j({\mathbf {x}})|^2 + \frac{K({\mathbf {x}},{\mathbf {x}}) - \sum _{j = 1}^{m-1}|e_j({\mathbf {x}})|^2}{\int _D K({\mathbf {x}},{\mathbf {x}}) d\varrho _D({\mathbf {x}}) - \sum _{j = 1}^{m-1}\lambda _j} \right) . \end{aligned}$$
(5.3)
figure a

We know from (4.6) that the sequence of singular numbers is square summable. We use the modified density function \(\varrho _m(\cdot ):D \rightarrow \mathbb {R}\) which has been introduced in [11] as a version of the one from [13]. As above, the family \((e_j(\cdot ))_{j\in \mathbb {N}}\) represents the eigenvectors of the non-vanishing eigenvalues of the compact self-adjoint operator \(W_{\varrho _D}\,{:}{=}\,\mathrm {Id}^*\circ \mathrm {Id}: H(K) \rightarrow H(K)\), the sequence \((\lambda _j)_{j\in \mathbb {N}}\) represents the ordered eigenvalues and finally \(\eta _j\,{:}{=}\,\lambda _j^{-1/2} e_j\).

Clearly, as a consequence of (4.5) the function \(\varrho _m\) is positive and defined point-wise for any \({\mathbf {x}}\in D\). Moreover, it can be computed precisely from the knowledge of \(K({\mathbf {x}},{\mathbf {x}})\) and the first \(m-1\) eigenvalues and corresponding eigenvectors. It clearly holds that \(\int _D \varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}) = 1\). Here is one of the main results of this paper (note that Theorem 1.3 from the introduction is a simple reformulation of the theorem below).

Theorem 5.2

Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions defined on D such that

$$\begin{aligned} \int _{D} K({\mathbf {x}},{\mathbf {x}})d\varrho _D({\mathbf {x}}) < \infty \end{aligned}$$

for some measure \(\varrho _D\) on D, where \((\sigma _k)_k\) denotes the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\). Then we have for \(n\in \mathbb {N}\) and

$$\begin{aligned} m \,{:}{=}\, \left\lfloor \frac{n}{14 r\log n }\right\rfloor \end{aligned}$$
(5.5)

the bound

$$\begin{aligned} \mathbb {P}\left( \sup _{\Vert f\Vert _{H(K)} \le 1} \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f\Vert _{L_2(D,\varrho _D)}^2 \le 5 \max \left\{ \sigma _m^2, \frac{16r\kappa ^2 \log n}{n}\sum \limits _{j=m}^{\infty }\sigma _j^2\right\} \right) \ge 1 - \eta n^{1-r}, \end{aligned}$$

where \({\mathbf {X}}\) is sampled i.i.d. according to \(\varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}})\) in (5.3) above and \(\eta = 2^\frac{3}{4} +1,\) \( \kappa = \frac{1+\sqrt{5}}{2}\).

Proof

This result is a consequence of Theorem 5.1 above which is applied to the newly constructed probability measure \(\varrho _m(\cdot )\) together with

$$\begin{aligned} \widetilde{K}_m({\mathbf {x}},{\mathbf {y}}) \,{:}{=}\, \frac{K({\mathbf {x}},{\mathbf {y}})}{\sqrt{\varrho _m({\mathbf {x}})}\sqrt{\varrho _m({\mathbf {y}})}}. \end{aligned}$$
(5.6)

We observe

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f \big \Vert _{L_2(D,\varrho _D)} \le \sup _{\Vert g\Vert _{H(\widetilde{K}_m)} \le 1 } \big \Vert g - S_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\mu _m)}, \end{aligned}$$

\(\widetilde{N}(m) \le 2(m-1)\), and \(\widetilde{T}(m) \le 2\sum _{j=m}^{\infty }\sigma _j^2\), see [11, Thms. 5.5, 5.8]. Applying Theorem 5.1 leads to the stated bound. \(\square \)

6 Sampling discretization of the \(L_2\)-norm

Motivated from supervised learning theory, see e.g. [6], one is interested in uniform bounds for the following version of the “defect function”

$$\begin{aligned} L_{{\mathbf {X}}}(f) \,{:}{=}\, \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}}) - \frac{1}{n}\sum \limits _{j=1}^n |f({\mathbf {x}}^j)|^2 \end{aligned}$$

with respect to f belonging to some hypothesis space \(\mathcal {H}\) which is usually embedded into \(\mathcal {C}(D)\), the continuous functions on the domain D. From a more classical perspective, authors were interested in discretizing \(L_p\)-norms using Marcinkiewicz–Zygmund inequalities. This subject has been recently studied systematically by Temlyakov [25], see also Gröchenig [9]. The following theorem will be an immediate implication of our concentration result in Theorem 1.1.

Theorem 6.1

Let \(\varrho _D\) denote a probability measure on the measurable subset \(D \subset \mathbb {R}^d\) and H(K) be a separable reproducing kernel Hilbert space with kernel \(K(\cdot ,\cdot )\) such that

$$\begin{aligned} \Vert K\Vert _\infty \,{:}{=}\, \sup \limits _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}},{\mathbf {x}})}<\infty \, \end{aligned}$$

(equivalently,  the unit ball in H(K) is uniformly bounded in \(\ell _\infty ).\) Then we have for \(0<t\le 1\)

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n |f({\mathbf {x}}^i)|^2\right| >t\Vert \mathrm {Id}\Vert ^2_{K,2}\right) \le 2^{3/4}n\exp \left( -\frac{nt^2 \Vert \mathrm {Id}\Vert ^2_{K,2}}{21 \Vert K\Vert _{\infty }^2}\right) . \end{aligned}$$

If we fix \(r>1\) the above bound can be reformulated as

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n |f({\mathbf {x}}^i)|^2\right| \le \Vert \mathrm {Id}\Vert _{H(K)\rightarrow L_2}\Vert K\Vert _{\infty }\sqrt{\frac{21r\log n}{n}} \end{aligned}$$

with probability exceeding \(1-2n^{1-r}\) and n sufficiently large,  namely

$$\begin{aligned} \frac{n}{\log n} \ge \frac{21 r \Vert K\Vert _{\infty }^2}{\Vert \mathrm {Id}\Vert _{K,2}^2}. \end{aligned}$$

The nodes \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) are randomly drawn according to the product measure \((d \varrho _D({\mathbf {x}}))^n\).

Proof

Fix f with \(\Vert f\Vert _{H(K)} \le 1\) and put \(M\,{:}{=}\,\Vert K\Vert _{\infty }\). Due to the \(L_2\)-identity

$$\begin{aligned} f = \sum \limits _{i=1}^\infty \sigma _i \langle f,e_i\rangle \eta _i \end{aligned}$$

we find

$$\begin{aligned} \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}}) = \sum \limits _{i=1}^\infty \sigma _i^2|\langle f,e_i\rangle |^2 = \langle {\hat{{\mathbf {f}}},{\varvec{\Lambda }}\hat{{\mathbf {f}}}}\rangle _{\ell _2} \end{aligned}$$

with \(\hat{{\mathbf {f}}} = (\langle f, e_k \rangle _{H(K)})_{k\in \mathbb {N}}\) and \({\varvec{\Lambda }} = \mathrm {diag}(\sigma ^2_1,\sigma ^2_2,\ldots )\). Note that \(\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2} = \Vert \mathrm {Id}\Vert _{K,2}^2\). Furthermore, putting

$$\begin{aligned} {\mathbf {y}}= (e_1({\mathbf {x}}^i),e_2({\mathbf {x}}^i),\ldots ,e_k({\mathbf {x}}^i),\ldots )^\top ,\quad i=1,\ldots ,n, \end{aligned}$$

we find

$$\begin{aligned} \frac{1}{n}\sum \limits _{i=1}^n |f({\mathbf {x}}^i)|^2 = \left\langle \hat{{\mathbf {f}}},\left( \frac{1}{n} \sum \limits _{i=1}^n {\mathbf {y}}^i \otimes {\mathbf {y}}^i\right) \hat{{\mathbf {f}}}\right\rangle _{\ell _2} \end{aligned}$$

holds almost surely since \({\text {tr}}_0(K) = 0\) due to the separability of H(K), see the paragraph after (4.6). In fact, the identity fails on a nullset \(A \subset D^n\), which is independent of f. This follows by using the same arguments as after (5.2). Hence,

with probability exceeding \(1-\exp (-t^2n/(21\tilde{M}^2))\) by Theorem 1.1. Here \(\tilde{M}^2 = M^2/\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2}\). Hence, we may choose \(t = \sqrt{21\tilde{M}^2 r \log (n)/n} \le 1\) to finally get

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n |f({\mathbf {x}}^i)|^2\right| \le \sqrt{21\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2}M^2r\frac{\log n}{n}}. \end{aligned}$$
(6.1)

\(\square \)

Remark 6.2

  1. (i)

    The uniform boundedness (in \(\ell _\infty \)) of a function class is sometimes also called M-boundedness in the learning theory literature. It represents a common assumption there to analyze the defect function. In fact, uniform bounds on the defect function are proved by using the concept of covering numbers of the unit ball of H(K) in \(\ell _{\infty }(D)\), see [12, 25]. In the above theorem covering number estimates were not used at all.

  2. (ii)

    The quantity \(\Vert K\Vert _{\infty }\) may be replaced by \(\Vert K\Vert _{L_{\infty }(D,\varrho _D)}\), i.e., the essential supremum with respect to the probability measure \(\varrho _D\). Since \(\Vert K\Vert _{L_{\infty }(D,\varrho _D)}\) might by smaller than \(\Vert K\Vert _{\infty }\) we obtain a slight improvement.

Now we want to get rid of the uniform boundedness of the class and only assume the finite trace. We have to change the sampling measure and modify the norm discretization operator by incorporating weights. The corresponding theorem reads as follows.

Theorem 6.3

Let \(\varrho _D\) denote an arbitrary measure on the measurable subset \(D \subset \mathbb {R}^d\) and H(K) be a separable reproducing kernel Hilbert space with (Hermitian) kernel \(K(\cdot ,\cdot )\) such that

$$\begin{aligned} {\text {tr}}(K) \,{:}{=}\, \int _D K({\mathbf {x}},{\mathbf {x}})\,d\varrho _D({\mathbf {x}}) < \infty . \end{aligned}$$

We define the probability density function

$$\begin{aligned} \nu ({\mathbf {x}}) \,{:}{=}\, \frac{K({\mathbf {x}},{\mathbf {x}})}{{\text {tr}}(K)} \end{aligned}$$

and sample \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) from the product measure \((\nu ({\mathbf {x}}))d\varrho _D({\mathbf {x}})^n\). Then we have

$$\begin{aligned} \mathbb {P}\left( \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)}\right| >t\Vert \mathrm {Id}\Vert _{K,2}^2\right) \le 2^{3/4}n\exp \left( -\frac{nt^2\Vert \mathrm {Id}\Vert ^2_{K,2}}{21{\text {tr}}(K) }\right) . \end{aligned}$$

If we fix \(r>1\) then this result can be reformulated as

$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)}\right| \le \sqrt{21{\text {tr}}(K)\Vert \mathrm {Id}\Vert ^2r\frac{\log n}{n}} \end{aligned}$$

with probability exceeding \(1-2n^{1-r}\) and n sufficiently large,  namely

$$\begin{aligned} \frac{n}{\log n} \ge \frac{21 r {\text {tr}}(K)}{\Vert \mathrm {Id}\Vert _{K,2}^2}. \end{aligned}$$

Proof

We want to apply Theorem 6.1. Let us define the normalized kernel

$$\begin{aligned} \tilde{K}({\mathbf {x}},{\mathbf {y}}) \,{:}{=}\, \frac{K({\mathbf {x}},{\mathbf {y}})}{\sqrt{\nu ({\mathbf {x}})}\sqrt{\nu ({\mathbf {y}})}}. \end{aligned}$$

Then \(\Vert \tilde{K}\Vert _{\infty } = \text{ tr(K) }\) and

$$\begin{aligned} \begin{aligned}&\sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)}\right| \\&\quad = \sup \limits _{\Vert f\Vert _{H(\tilde{K})}\le 1}\left| \int _D |f({\mathbf {x}})|^2\,\nu ({\mathbf {x}})d\varrho _D({\mathbf {x}})-\frac{1}{n}\sum \limits _{i=1}^n |f({\mathbf {x}}^i)|^2\right| . \end{aligned} \end{aligned}$$
(6.2)

It remains to note that

$$\begin{aligned} \Vert \mathrm {Id}\Vert _{K,2,d\varrho _D} = \Vert \mathrm {Id}\Vert _{\tilde{K},2,\nu (\cdot )d\varrho _D(\cdot )} \end{aligned}$$

and we may apply Theorem 6.1. \(\square \)

7 Non-separable RKHS

Now we deal with a more general situation and drop the separability assumption for H(K). We only assume the finite trace property (1.5). For this purpose we define the new density function

$$\begin{aligned} \varrho _m({\mathbf {x}}) = \frac{1}{3} \left( \frac{ \sum _{j = 1}^{m-1} |\eta _j({\mathbf {x}})|^2}{m-1} + \frac{ \sum _{j=m}^\infty |e_j({\mathbf {x}})|^2}{\sum _{j=m}^{\infty } \lambda _j} + \frac{K_0({\mathbf {x}},{\mathbf {x}})}{ {\text {tr}}(K^0)} \right) \end{aligned}$$
(7.1)

and the operator \(\widetilde{S}^m_{{\mathbf {X}}}\) from Algorithm 1. Clearly, it holds \(\int \varrho _m({\mathbf {x}}) d\varrho _D({\mathbf {x}}) = 1\). Theorem 1.2 provides the bound

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f \big \Vert ^2_{L_2(D,\varrho _D)} \le C \max \left\{ \sigma _m^2, \frac{r \log n }{n} \sum _{j=m}^\infty \sigma _j^2, \frac{{\text {tr}}(K^0)}{n} \right\} \end{aligned}$$
(7.2)

with an absolute constant \(C>0\) and \(m \,{:}{=}\, \lfloor n/(14 r\log n)\rfloor \). Note that this result improves on a result in Wasilkowski and Woźniakowski [31], see also Novak and Woźniakowski [19, Thm. 26.10]. The authors in [31, Thm. 26.10] constructed a recovery operator using n samples having squared worst-case error not greater than

$$\begin{aligned} \min \left\{ \sigma _{\ell }^2 + \frac{{\text {tr}}(K)\ell }{n}~:~\ell = 1,2,3,\ldots \right\} . \end{aligned}$$

If we for instance assume that \(\sum _{k=1}^{\infty } \sigma _k^p < \infty \) for some \(0<p\le 2\) then we may balance \(\ell \asymp n^{p/(p+2)}\) to obtain a rate of \({\mathcal {O}}(n^{-1/(p+2)})\). In Theorem 1.2, see (7.2), we obtain a rate of \(o(\sqrt{(\log n)/n})\) already for \(p=2\). In case \(p<2\) we obtain \({{\mathcal {O}}}(n^{-1/2})\). It seems that, in general, the decay properties of the singular values have a rather weak influence on the recovery bound compared to the separable case, where it is much better than \({\mathcal {O}}(n^{-1/2})\). A lower bound showing that we can not essentially improve on \({\mathcal {O}}(n^{-1/2})\) with the above algorithm will be provided in [17].

Proof of Theorem 1.2

Proof

Step 1. Let us assume that \(M\,{:}{=}\,\Vert K\Vert _{\infty } = \sup _{{\mathbf {x}}\in X}\sqrt{K({\mathbf {x}},{\mathbf {x}})} < \infty \). By the spectral theorem we can decompose

$$\begin{aligned} H(K) = N(\mathrm {Id}) \oplus \overline{ {\mathrm{span}}\{e_1(\cdot ), e_2(\cdot ), \dots \}} \end{aligned}$$

where N is the nullspace of the embedding. Let us now define

$$\begin{aligned} K^1({\mathbf {x}},{\mathbf {y}}) = \sum _{j=1}^\infty e_j({\mathbf {x}}) \overline{e_j({\mathbf {y}})} \end{aligned}$$

and

$$\begin{aligned} K^0({\mathbf {x}},{\mathbf {y}}) = K({\mathbf {x}},{\mathbf {y}}) - K^1({\mathbf {x}},{\mathbf {y}}) . \end{aligned}$$

Therefore, \(K^0({\mathbf {x}},{\mathbf {y}}) \) is the reproducing kernel of the nullspace \(N(\mathrm {Id})\) and \( \sup _{{\mathbf {x}}\in X} \sqrt{K^0({\mathbf {x}}, {\mathbf {x}})} =: M_0 \le M < \infty \). We estimate

$$\begin{aligned} \begin{aligned}&\sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f - S_{{\mathbf {X}}}^m f \big \Vert _{L_2(D,\varrho _D)} \\&\quad \le \sup _{\Vert g\Vert _{H(K^0)} \le 1 } \big \Vert g - S_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)} + \sup _{\Vert g\Vert _{H(K^1)} \le 1 } \big \Vert g - S_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)}. \end{aligned} \end{aligned}$$
(7.3)

The second summand can be treated with Theorem 5.1. The space \(H(K^1)\) is separable and \(K^1({\mathbf {x}},{\mathbf {x}})\) is bounded. Theorem 5.1 gives

$$\begin{aligned} \sup _{\Vert g\Vert _{H(K^1)} \le 1} \Big \Vert g - S_{{\mathbf {X}}}^m g \Big \Vert _{L_2(D,\varrho _D)}^2 \le 5 \max \left\{ \sigma _m^2, \frac{8r \log n}{n} T(m) \kappa ^2 \right\} \end{aligned}$$
(7.4)

with probability at least \( 1 - \eta n^{1-r} \) whenever (5.1) holds. The number \(\kappa \) is the same as in Proposition 3.8.

Note that all the functions in \(H({K}^0)\) are zero in \(L_2(D,\varrho _D)\) since this space is the nullspace of the embedding. Hence

$$\begin{aligned} \begin{aligned} \sup _{\Vert g\Vert _{H({K}^0)} \le 1 } \big \Vert g - {S}_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)}^2 =&\sup _{\Vert g\Vert _{H({K}^0)} \le 1 } \big \Vert {S}_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)}^2 \\ \le&\frac{2}{n} \sup _{\Vert g\Vert _{H({K}^0)} \le 1 } \sum _{i=1}^n |g({\mathbf {x}}^i)|^2 \end{aligned} \end{aligned}$$

holds for the same \({\mathbf {X}}= ({\mathbf {x}}^1, \dots , {\mathbf {x}}^n) \) for which (7.4) holds. We only need the operator norm of \({\mathbf {H}}_m\) (see (2.3)) to be larger than \(\frac{1}{2}\) which comes from Theorem 2.1. At this point we employ the “Representer Theorem” from learning theory, see for instance [4, Theorem 5.5]. We claim that

$$\begin{aligned} \sup _{\Vert g\Vert _{H({K}^0)} \le 1 } \sum _{i=1}^n |g({\mathbf {x}}^i)|^2 = \sup _{\varvec{\alpha }^\intercal {K}^0[{\mathbf {X}}] \overline{\varvec{\alpha }} \le 1} \sum _{i=1}^n|{g}({\mathbf {x}}^i)|^2, \end{aligned}$$
(7.5)

where \(\big ({K}^0({\mathbf {x}}^i, {\mathbf {x}}^j)\big )_{i,j = 1}^n\) is the kernel taken at the points from \({\mathbf {X}}\) and \( {g}({\mathbf {x}}) = \sum _{i =1}^n \alpha _i K^0({\mathbf {x}},{\mathbf {x}}^i)\)

In other words, we can reduce the problem to the finite dimensional space

\( {\mathrm{span}}\big \{ K^0(\cdot , {\mathbf {x}}^1),\dots , K^0(\cdot , {\mathbf {x}}^n) \big \}.\) Note that

$$\begin{aligned} \begin{aligned} \Vert {g}\Vert _{H({K}^0)}^2&= \sum _{i = 1}^n \sum _{j = 1}^n {K}^0({\mathbf {x}}^i, {\mathbf {x}}^j) \alpha _i \overline{\alpha _j} \\&= \varvec{\alpha }^\intercal {K}^0[{\mathbf {X}}] \overline{\varvec{\alpha }} \\&\le 1. \end{aligned} \end{aligned}$$

The reason is that \(g \in H(K^0)\) can be decomposed into \(g = g_1 + g_2\) with \(g_1 \perp g_2\) and \(g_1 = Pg \), the orthogonal projection onto \({\mathrm{span}}\big \{ K^0(\cdot , {\mathbf {x}}^1), \dots ,K^0(\cdot , {\mathbf {x}}^n) \big \}\). Due to \(g_1 \perp g_2\), we have that \(\langle g_2, K^0(\cdot , {\mathbf {x}}^i)\rangle = 0 = g_2({\mathbf {x}}^i)\) for all i. Hence \(\sum _{i=1}^n |g({\mathbf {x}}^i)|^2 = \sum _{i=1}^n |g_1({\mathbf {x}}^i)|^2\) and \( \Vert g_1\Vert _{H({K}^0)}^2 \le 1 \). Therefore

$$\begin{aligned} \sup _{\Vert g\Vert _{H({K}^0)} \le 1 } \big \Vert g - {S}_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)}^2&\le \frac{2}{n} \sup _{\Vert g\Vert _{H({K}^0)} \le 1 } \sum _{i=1}^n |g({\mathbf {x}}^i)|^2 \nonumber \\&\le \frac{2}{n} \sup _{\varvec{\alpha }^\intercal {K}^0[{\mathbf {X}}] \overline{\varvec{\alpha }} \le 1 } \sum _{i=1}^n \left| \sum _{j=1}^n \alpha _j {K}^0({\mathbf {x}}^i , {\mathbf {x}}^j) \right| ^2. \end{aligned}$$
(7.6)

Since \(H({K}^0)\) is the nullspace of the embedding we know that \({K}^0({\mathbf {x}}^i , {\mathbf {x}}^j)\) is zero almost surely for \(i \ne j\). We can therefore continue to estimate (7.6) by

$$\begin{aligned} \begin{aligned} \frac{2}{n} \sup _{\alpha ^\intercal K[{\mathbf {X}}] \overline{\alpha } \le 1 } \sum _{i=1}^n \left| \sum _{j=1}^n \alpha _j {K}^0({\mathbf {x}}^i , {\mathbf {x}}^j) \right| ^2&= \frac{2}{n} \sup _{\varvec{\alpha }^\intercal K[{\mathbf {X}}] \overline{\varvec{\alpha }} \le 1 } \sum _{i=1}^n |\alpha _i|^2 \big | {K}^0({\mathbf {x}}^i , {\mathbf {x}}^i) \big |^2 \\&\le \frac{2 M_0^2}{n} \sup _{\varvec{\alpha }^\intercal K[{\mathbf {X}}] \overline{\varvec{\alpha }} \le 1 } \sum _{i=1}^n |\alpha _i|^2 \big | {K}^0({\mathbf {x}}^i , {\mathbf {x}}^i) \big | \\&\le \frac{2 M_0^2}{n} \end{aligned} \end{aligned}$$
(7.7)

since we have almost surely

$$\begin{aligned} \sup _{\varvec{\alpha }^\intercal {K}^0[{\mathbf {X}}] \overline{\varvec{\alpha }} \le 1 } \sum _{i=1}^n |\alpha _i|^2 \big | {K}^0({\mathbf {x}}^i , {\mathbf {x}}^i) \big |&= \sum _{i = 1}^n \sum _{j = 1}^n \alpha _i \overline{\alpha _j} {K}^0({\mathbf {x}}^i, {\mathbf {x}}^j) \\&= \varvec{\alpha }^\intercal {K}^0[{\mathbf {X}}] \overline{\varvec{\alpha }} \\&\le 1. \end{aligned}$$

This leads to

$$\begin{aligned}&\sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f -S_{{\mathbf {X}}}^m f \big \Vert ^2_{L_2(D,\varrho _D)}\nonumber \\&\quad \le \left( \sup _{\Vert g\Vert _{H(K^0)} \le 1 } \big \Vert g - S_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)} + \sup _{\Vert g\Vert _{H(K_m^1)} \le 1 } \big \Vert g - S_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\varrho _D)}\right) ^2 \nonumber \\&\quad \le \left( \sqrt{\frac{1}{4 \kappa ^2}} + \sqrt{5} \right) ^2 \max \left\{ \sigma _m^2, \frac{8r \log n}{n} T(m) \kappa ^2,\frac{8 M_0^2 \kappa ^2}{n} \right\} \nonumber \\&\quad \le 7 \max \left\{ \sigma _m^2, \frac{8r \log n}{n} T(m) \kappa ^2, \frac{8 M_0^2 \kappa ^2}{n} \right\} \end{aligned}$$
(7.8)

with probability exceeding \(1-\eta n^{1-r}\).

Step 2. We define the measure \(d\mu _m({\mathbf {x}}) = \varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}) \) as well as the kernel \(\widetilde{K}_m({\mathbf {x}},{\mathbf {y}})\) as in (5.6) and \(\widetilde{K}_m^0, \widetilde{K}_m^1\) accordingly. This gives

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f \big \Vert _{L_2(D,\varrho _D)} \le \sup _{\Vert g\Vert _{H(\widetilde{K}_m)} \le 1 } \big \Vert g - \widetilde{S}_{{\mathbf {X}}}^m g \big \Vert _{L_2(D,\mu _m)}. \end{aligned}$$
(7.9)

We apply the results from Step 1 to the right-hand side. Hence, we have to know the bound for \(\widetilde{K}^0_m\), \(\widetilde{N}(m)\) and \(\widetilde{T}(m)\), where the latter quantities are associated to \(\widetilde{K}^1_m\). We will now show that \(\widetilde{K}_m^0\) can be bounded by \(3{\text {tr}}(K^0)\). In fact,

$$\begin{aligned} \widetilde{K}_m^0({\mathbf {x}},{\mathbf {x}})&= \frac{K_m^0({\mathbf {x}},{\mathbf {x}})}{\varrho _m({\mathbf {x}})} \nonumber \\&= \frac{K({\mathbf {x}},{\mathbf {x}}) - \sum _{k = 1}^{\infty } |e_k({\mathbf {x}})|^2}{\frac{1}{3} \bigg ( \frac{ \sum _{j = 1}^{m-1} |\eta _j({\mathbf {x}})|^2}{m-1} + \frac{ \sum _{j=m}^\infty |e_j({\mathbf {x}})|^2}{\sum _{j=m}^{\infty } \lambda _j} + \frac{K^0({\mathbf {x}},{\mathbf {x}})}{ {\text {tr}}(K^0)} \bigg )} \nonumber \\&\le 3 \, {\text {tr}}(K^0). \end{aligned}$$
(7.10)

Hence, we have \(\widetilde{M}_0^2 = 3{\text {tr}}(K^0)\). By the same arguments as in the proof of Theorem 5.2, see also [11, Thm. 5.7], we get \(\widetilde{N}(m) \le 3(m-1)\) and \(\widetilde{T}(m) \le 3\sum _{j=m}^\infty \sigma _j^2\). Plugging this into (7.8) gives

$$\begin{aligned} \sup _{\Vert g\Vert _{H(\widetilde{K}_m)} \le 1 } \big \Vert g - \widetilde{S}_{{\mathbf {X}}}^m g \big \Vert ^2_{L_2(D,\mu _m)}&\le 7 \max \left\{ \sigma _m^2, \frac{24 \kappa ^2 r \log n }{n} \sum _{j=m}^\infty \sigma _j^2, \frac{24 \kappa ^2{\text {tr}}(K^0)}{n} \right\} \\ \sup _{\Vert f\Vert _{H(K)} \le 1 } \big \Vert f - \widetilde{S}_{{\mathbf {X}}}^m f \big \Vert ^2_{L_2(D,\varrho _D)}&\le 441 \max \left\{ \sigma _m^2, \frac{r \log n }{n} \sum _{j=m}^\infty \sigma _j^2, \frac{{\text {tr}}(K^0)}{n} \right\} . \end{aligned}$$

\(\square \)

What concerns a counterpart of the \(L_2\)-norm discretization result for general RKHS having finite trace we can prove the following.

Theorem 7.1

Let \(H(K)\) be a RKHS and \(\varrho _D\) be a measure on \(D \subset \mathbb {R}^d\). If

  1. (i)

    If \(\Vert K\Vert _{\infty }\,{:}{=}\, \sup _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}}, {\mathbf {x}})} < \infty \) and \(\varrho _D\) denotes a probability measure on D,  where \({\mathbf {x}}^i, i=1,\ldots ,n,\) are drawn i.i.d. according to \(\varrho _D\) then

    $$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n |f({\mathbf {x}}^i)|^2 \right| \le 8\sqrt{r\frac{\log (n)}{n}} \Vert K\Vert _{\infty }^2 \end{aligned}$$

    holds with probability at least \(1 - 2 n^{1-r}\) if n is large enough,  i.e. \(\frac{n}{\log n} \ge \frac{21 r \Vert K\Vert _{\infty }^2}{\Vert \mathrm {Id}\Vert _{K,2}^2}\).

  2. (ii)

    If \({\text {tr}}(K) = \int _D K({\mathbf {x}},{\mathbf {x}}) d\varrho _D({\mathbf {x}}) < \infty \) then

    $$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)} \right| \le 8 {\text {tr}}(K)\sqrt{r\frac{\log (n)}{n}} \end{aligned}$$

    holds with probability at least \(1 - 2 n^{1-r}\) and \(\frac{n}{\log n} \ge \frac{21 r {\text {tr}}(K)}{\Vert \mathrm {Id}\Vert _{K;2}^2}\) where \(\nu ({\mathbf {x}}) = \frac{K({\mathbf {x}}, {\mathbf {x}})}{{\text {tr}}(K)}\) and \({\mathbf {x}}^i, i=1,\ldots ,n,\) are drawn i.i.d. according to \(\nu ({\mathbf {x}})d\varrho _D({\mathbf {x}})\).

Proof

Since we may have that \(\text{ tr}_0(K) > 0\) the decomposition \(K({\mathbf {x}}, {\mathbf {y}}) = K^0({\mathbf {x}}, {\mathbf {y}}) + K^1({\mathbf {x}}, {\mathbf {y}})\) leads to a “non-trivial” Kernel \(K^0({\mathbf {x}}, {\mathbf {y}})\). We estimate in case (i):

$$\begin{aligned}&\sup _{\Vert f\Vert _{H(K)} \le 1} \left| \int _D |f({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n |f({\mathbf {x}}^i)|^2 \right| \nonumber \\&\quad = \sup _{\Vert f\Vert _{H(K)} \le 1} \left| \int _D |f_0({\mathbf {x}}) + f_1({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n |f_0({\mathbf {x}}^i) + f_1({\mathbf {x}}^i)|^2 \right| \nonumber \\&\quad = \sup _{\Vert f\Vert _{H(K)} \le 1} \left| \int _D |f_1({\mathbf {x}})|^2d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n |f_0({\mathbf {x}}^i) |^2 - \frac{2}{n} \sum _{i = 1}^n |f_0({\mathbf {x}}^i) f_1({\mathbf {x}}^i)| - \frac{1}{n} \sum _{i = 1}^n |f_1({\mathbf {x}}^i)|^2\right| \nonumber \\&\quad \le \sup _{\Vert f\Vert _{H(K^1)} \le 1} \left| \int _D |f_1({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n |f_1({\mathbf {x}}^i) |^2\right| \end{aligned}$$
(7.11)
$$\begin{aligned}&\qquad + \sup _{\Vert f\Vert _{H(K^0)} \le 1} \frac{1}{n} \sum _{i = 1}^n |f_0({\mathbf {x}}^i) |^2 \end{aligned}$$
(7.12)
$$\begin{aligned}&\qquad + \sup _{\Vert f\Vert _{H(K)} \le 1} \frac{2}{n} \left( \sum _{i = 1}^n |f_0({\mathbf {x}}^i) |^2\right) ^{\frac{1}{2}} \left( \sum _{i = 1}^n |f_1({\mathbf {x}}^i) |^2\right) ^{\frac{1}{2}} \end{aligned}$$
(7.13)

and note that (7.11) \(\le \sqrt{21 \Vert \mathrm {Id}\Vert ^2 M^2 r \frac{\log {n}}{n}}\) by Theorem 6.1 with probability at least \(1-2n^{1-r}\), where \(M\,{:}{=}\,\Vert K\Vert _{\infty }\). To estimate (7.12) we use the same reasoning leading to (7.5) and get

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K^0)} \le 1} \frac{1}{n} \sum _{i = 1}^n |f_0({\mathbf {x}}^i) |^2 \le \frac{1}{n} M^2, \end{aligned}$$
(7.14)

where we used \(K({\mathbf {x}},{\mathbf {x}}) \le M^2\). We also use (7.14) in order to estimate (7.13). It holds

$$\begin{aligned} \sup _{\Vert f\Vert _{H(K)} \le 1} \frac{2}{n} \left( \sum _{i = 1}^n |f_0({\mathbf {x}}^i) |^2\right) ^{\frac{1}{2}} \left( \sum _{i = 1}^n |f_1({\mathbf {x}}^i) |^2\right) ^{\frac{1}{2}}&\le \frac{2}{\sqrt{n}} M \left( \frac{1}{n} \sum _{i = 1}^n |f_1({\mathbf {x}}^i) |^2\right) ^{\frac{1}{2}} \\&\le \frac{2 M^2}{\sqrt{n}}. \end{aligned}$$

In total we estimate

$$\begin{aligned} (7.11) + (7.12) + (7.13)&\le \sqrt{21 r} M^2 \sqrt{\frac{\log (n)}{n}} + \frac{M^2}{n} + \frac{2 M^2}{\sqrt{n}} \\&\le 8 M^2 \sqrt{\frac{r\log (n)}{n}}. \end{aligned}$$

To prove (ii) we use the same technique as in Theorem 6.3 replacing \(\frac{1}{n} \sum _{i = 1}^n |f({\mathbf {x}}^i)|^2\) with \(\frac{1}{n} \sum _{i = 1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)}\) where \(\nu ({\mathbf {x}}) = \frac{K({\mathbf {x}}, {\mathbf {x}})}{{\text {tr}}(K)}\) and also \(M^2 \) by \({\text {tr}}(K)\) we can reduce everything to case (i). \(\square \)