Abstract
In this paper we study \(L_2\)-norm sampling discretization and sampling recovery of complex-valued functions in RKHS on \(D \subset \mathbb {R}^d\) based on random function samples. We only assume the finite trace of the kernel (Hilbert–Schmidt embedding into \(L_2\)) and provide several concrete estimates with precise constants for the corresponding worst-case errors. In general, our analysis does not need any additional assumptions and also includes the case of non-Mercer kernels and also non-separable RKHS. The fail probability is controlled and decays polynomially in n, the number of samples. Under the mild additional assumption of separability we observe improved rates of convergence related to the decay of the singular values. Our main tool is a spectral norm concentration inequality for infinite complex random matrices with independent rows complementing earlier results by Rudelson, Mendelson, Pajor, Oliveira and Rauhut.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper can be seen as a continuation of [11, 13]. We study the reconstruction of complex-valued multivariate functions on a domain \(D \subset \mathbb {R}^d\) from values at the (randomly sampled) nodes \({\mathbf {X}}\,{:}{=}\,({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n) \in D^n\) via weighted least squares algorithms. In addition, we are interested in the sampling discretization of the squared \(L_2\)-norm of such functions using n random nodes. Both problems recently gained substantial interest, see [11, 13, 14, 25,26,27, 29], and are strongly related as we know from Wasilkowski [30] and the recent systematic studies by Temlyakov [26] and Gröchenig [9] on \(L_p\)-norm discretization. Our main interest is on accurate estimates for worst-case errors depending on the number n of nodes. In this paper, the functions are modeled as elements from some reproducing kernel Hilbert space H(K), which is supposed to be compactly embedded into \(L_2(D,\varrho _D)\). Its kernel is a positive definite Hermitian function \(K:D\times D\rightarrow \mathbb {C}\). In the papers [11, 13, 18] authors mainly restrict to the case of separable RKHS [11, 18] or Mercer kernels on compact domains [13] with finite trace property to study the quantity
for some recovery operator \(S^m_{{\mathbf {X}}}\). It computes a best least squares fit \(S_{{\mathbf {X}}}^mf\) to the given data
from the finite-dimensional space spanned by the first \(m-1\) singular vectors \(\eta _1(\cdot ),\ldots ,\eta _{m-1}(\cdot )\) of the embedding
We complement existing results by a refined analysis based on spectral norm concentration of infinite matrices to improve on the constants and the bounds for the failure probability on the one hand. On the other hand, the question remained whether the bounds on (1.1) may be extended to the most general situation where only the finite trace condition is assumed. This setting is not covered by the above mentioned references. In this paper we construct a new (weighted) least squares algorithm for this general situation, which has been first addressed by Wasilkowski, Woźniakowski in [31]. Surprisingly, we were able to improve on the bound in [31, Thm. 1] by obtaining the worst-case bound \(o(\sqrt{\log n / n})\) in case of square summable singular values \((\sigma _k)_k\) (finite trace) of the embedding. It seems that, in general, their decay influences the bounds rather weakly (in contrast to the results in [11, 13, 18]).
In addition to the general sampling recovery problem we study the discretization of \(L_2\)-integral norms in reproducing kernel Hilbert spaces H(K) where only random information is used. To be more precise, we provide bounds for the following \(L_2\)-worst-case discretization errors
This quantity controls the simultaneous discretization of the squared \(L_2(D,\varrho _D)\)-norms of all functions from H(K). For finite-dimensional spaces we speak of Marcinkiewicz–Zygmund inequalities, a classical topic which also gained a lot of interest in recent years, see Temlyakov [26] and the references therein. Let us emphasize that both problems (sampling recovery and discretization) are strongly related. It has been shown by Wasilkowski [30] that the recovery of the norm \(\Vert \cdot \Vert \) of a function from a function class is equally difficult as the recovery of the function in that norm using linear information. In other words, if we have a good sampling recovery operator Sf in \(L_2(D,\varrho _D)\) we may construct an equally good recovery for the norm of f by simply taking \(\Vert Sf\Vert _{L_2(D,\varrho _D)}\) as approximant. This, however, is a simple consequence of the triangle inequality. Wasilkowski shows even more, namely that optimal information for the recovery problem is nearly optimal for the “norm-recovery” problem. However, let us emphasize that we recover the square of the norm in (1.3) (rather than the norm itself). It has been observed by Temlyakov in [25] that this indeed makes a difference if we assume a certain algebra property for point-wise multiplication, namely \(\Vert fg\Vert _{H(K)} \le c\Vert f\Vert _{H(K)}\cdot \Vert g\Vert _{H(K)}\) which is for instance present for mixed Sobolev spaces with smoothness \(s>1/2\). Taking into account that in this framework optimal quadrature behaves asymptotically better than sampling recovery (the improvement happens in the \(\log \)), see [8, Chap. 5, 9] and the references therein, we see that Wasilkowski’s result does not hold true for this slightly modified framework. In fact, the worst-case error (1.3) may behave much better than the corresponding optimal sampling recovery error. In contrast to that we use random information here, i.e, nodes which are randomly drawn according to the natural (probability measure) \(\varrho _D\) or some related measure and aim for results with high probability. As stated below we obtain a less good asymptotic error behavior for the classical discretization operator in (1.3) compared to the (non-squared) sampling recovery error in (1.1). However, we are able to control the dependence on the parameters and the failure probability rather explicit as (1.4), (1.6), Theorems 1.2 and 1.3 show.
Major parts of the analysis in this paper are based on the following concentration inequality for sums of complex self adjoint (infinite) random matrices.
Theorem 1.1
(Section 3) Let \({\mathbf {y}}^i , i= 1 \dots n,\) be i.i.d random sequences from the complex \( \ell _2\). Let further \(n \ge 3\) and \(M>0\) such that \(\Vert {\mathbf {y}}^i \Vert _2 \le M\) almost surely for all \(i=1 \dots n\). We further put \({\Lambda } \,{:}{=}\, \mathbb {E}{\mathbf {y}}^i \otimes {\mathbf {y}}^i\) and assume that \(\Vert {\Lambda }\Vert _{2\rightarrow 2} \le 1\). Then we have for \(0<t\le 1\)
Finite-dimensional results of this type are given by Rudelson [23], Tropp [28], Oliveira [20], Rauhut [22] and others. Mendelson and Pajor [16] were the first who addressed the infinite-dimensional case of real matrices as well, see Remark 3.1. The technique used has been introduced by Buchholz [2, 3] and further developed by Rauhut for the purpose of analyzing RIP matrices based on complex bounded orthonormal systems (see [22] and the references therein). It is based on an operator version of the non-commutative Khintchine inequality [2, 3] together with Talagrand’s symmetrization technique.
As a direct consequence of Theorem 1.1 we obtain for separable H(K) and a probability measure \(\varrho _D\) always
if the kernel is bounded, i.e., \(\Vert K\Vert _{\infty } \,{:}{=}\, \sup _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}},{\mathbf {x}})}<\infty \) (uniform boundedness). This condition is equivalent to the fact that the embedding of H(K) into \(\ell _\infty \) is continuous and has norm less or equal a finite number M (commonly called M-boundedness). The measure \(\varrho _D\) is supposed to be a probability measure and \({\mathbf {x}}^i,\, i = 1, \dots , n\) are drawn independently at random according to \(\varrho _D\). Note, that this problem is related to classical uniform bounds on the “defect function” in learning theory with respect to M-bounded function classes, see, e.g., [4, 6]. There, bounds for (1.3) are usually given in terms of covering (or entropy) numbers of the unit ball of H(K) in \(\ell _\infty \), see [4, 12]. Here we consider situations where we neither have such information nor an embedding into \(\ell _\infty \). Choosing t appropriately (see Theorem 6.1), the worst-case discretization error may be bounded as \({\mathcal {O}}(\sqrt{(\log n)/n})\) with high probability. To get rid of the uniform boundedness condition of the function class we may work with the weaker finite trace condition
and prove a similar error bound for a slightly modified discretization operator when sampling the nodes \({\mathbf {x}}^i\) independently according to the modified measure \(\nu ({\mathbf {x}})d\varrho _D({\mathbf {x}})\) with \(\nu ({\mathbf {x}}) \,{:}{=}\, K({\mathbf {x}},{\mathbf {x}})/{\text {tr}}(K)\). One only has to replace \(\Vert K\Vert _{\infty }^2\) by \({\text {tr}}(K)\) in the right-hand side of (1.4). In other words, we have
with probability exceeding \(1-2n^{1-r}\) for large enough n, see Theorem 6.3. This means that the success probability tends to 1 rather quickly as the number of samples increases.
As for the sampling recovery problem we start with a result in the most general situation. A modification of the recovery operator \(\widetilde{S}_{{\mathbf {X}}}^m\) from [11, 13], see Algorithm 1 below, has been used to study the situation which is left as an open problem in [11]. The result reads as follows.
Theorem 1.2
(Section 7) Let H(K) be a reproducing kernel Hilbert space on a subset \(D \subset \mathbb {R}^d\) with a positive definite Hermitian kernel \(K(\cdot ,\cdot )\) such that the finite trace property (1.5) holds true. Let \(r > 1\) and \(m,n \in \mathbb {N},\) \(n \ge 3,\) where \(m\) is chosen according to (1.7). Drawing \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) at random according to the product measure \((\varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}))^n\) with the density defined in (7.1), we have
with probability at least \( 1 - \eta n^{1-r}\) where \(\eta = 2^\frac{3}{4} +1\). \(\widetilde{S}_{{\mathbf {X}}}^m\) is the least squares operator from Algorithm 1 together with (7.1) and \({\text {tr}}_0(K)\) is defined in (4.6).
In fact, we recover all \(f \in H(K)\) from sampled values at \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) simultaneously with probability larger than \(1-3n^{-r}\) by only assuming that the kernel \(K(\cdot , \cdot )\) has finite trace (1.5). Note that this result improves on a result by Wasilkowski and Woźniakowski [31], where only the finite trace is required, see also Novak and Woźniakowski [19, Thm. 26.10]. The authors proved (roughly speaking) a rate of \(n^{-1/(2+p)}\) for the worst-case error with respect to standard information if the sequence of singular numbers is p-summable for \(p \le 2\). We refer to Sect. 7 for further explanation.
In order to define the recovery operator \(\widetilde{S}^m_{{\mathbf {X}}}\) and the sampling density \(\varrho _m({\mathbf {x}})\) we need to incorporate spectral properties of the embedding (1.2), namely also the left and right singular functions \((e_k)_k \subset H(K)\) and \((\eta _k)_k \subset L_2(D,\varrho _D)\) ordered according to their importance (size of the corresponding singular number). Both systems are orthonormal in the respective spaces related by \(e_k = \sigma _k \eta _k\).
The above result can be improved essentially if we assume that H(K) is separable. This is for instance the case if K is a Mercer kernel, i.e., continuous on a bounded and compact domain D. However, assuming only separability of H(K) also includes the situation of continuous kernels on unbounded domains D, even \(D=\mathbb {R}^d\). The following result already improves on the result given in [11, 13] in several directions. The theorem works under less restrictive assumptions, the constants are improved and, last but not least, the failure probability decays polynomially in n. We would like to point that, while preparing this manuscript, Ullrich [29] proved a version of the next theorem with stronger requirements and different constants based on Oliveira’s concentration result (see Remark 3.9). The following theorem is a reformulation of Theorem 5.2 in Sect. 5.
Theorem 1.3
(Section 5) Let K be a positive definite Hermitian kernel such that H(K) is separable and the finite trace condition (1.5) holds true. With the notation from above we have for \(n\in \mathbb {N}\) and
the bound
where \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) is sampled according to the product measure \((\varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}))^n\) (see (5.3)) and the operator \(\widetilde{S}_{{\mathbf {X}}}^m\) is defined in Algorithm 1.
We would like to emphasize that the operator \(\widetilde{S}_{{\mathbf {X}}}^m\) uses \(n \asymp m\log m\) samples of its argument. Based on this result it has been recently shown by the second named author (and coauthors, see [18]) that there exists a sampling operator \(\widetilde{S}^m_J\) using only \({\mathcal {O}}(m)\) samples yielding the bound
with universal and specified constants \(C,c>0\). However, for this improvement one has to sacrifice the high success probability.
Notation As usual \(\mathbb {N}\) denotes the natural numbers, \(\mathbb {N}_0\,{:}{=}\,\mathbb {N}\cup \{0\}\), \(\mathbb {Z}\) denotes the integers, \(\mathbb {R}\) the real numbers and \(\mathbb {R}_+\) the non-negative real numbers and \(\mathbb {C}\) the complex numbers. If not indicated otherwise \(\log (\cdot )\) denotes the natural logarithm of its argument. \(\mathbb {C}^n\) denotes the complex n-space, whereas \(\mathbb {C}^{m\times n}\) denotes the set of all \(m\times n\)-matrices \({\mathbf {L}}\) with complex entries. Vectors and matrices are usually typesetted boldface with \({\mathbf {x}},{\mathbf {y}}\in \mathbb {C}^n\). The matrix \({\mathbf {L}}^{*}\) denotes the adjoint matrix. The spectral norm of matrices \({\mathbf {L}}\) is denoted by \(\Vert {\mathbf {L}}\Vert \) or \(\Vert {\mathbf {L}}\Vert _{2\rightarrow 2}\). For a complex (column) vector \({\mathbf {y}}\in \mathbb {C}^n\) (or \(\ell _2\)) we will often use the tensor notation for the matrix
For \(0<p\le \infty \) and \({\mathbf {x}}\in \mathbb {C}^n\) we denote \(\Vert {\mathbf {x}}\Vert _p \,{:}{=}\, (\sum _{i=1}^n |x_i|^p)^{1/p}\) with the usual modification in the case \(p=\infty \) or \({\mathbf {x}}\) being an infinite sequence. Operator norms for \(T:H(K) \rightarrow L_2\) will be denoted with \(\Vert T\Vert _{K,2}\). As usual we will denote with \(\mathbb {E}X\) the expectation of a random variable X on a probability space \((\Omega , \mathcal {A}, \mathbb {P})\). Given a measurable subset \(D\subset \mathbb {R}^d\) and a measure \(\varrho \) we denote with \(L_2(D,\varrho )\) the space of all square integrable complex-valued functions (equivalence classes) on D with \(\int _D |f({\mathbf {x}})|^2\, d\varrho ({\mathbf {x}})<\infty \). We will often use \(\Omega = D^n\) as probability space with the product measure \(\mathbb {P}= d\varrho ^n\) if \(\varrho \) is a probability measure itself. We sometimes use the notation \(f = {\mathcal {O}}(g)\) for positive functions f, g, which means that there is a constant \(c>0\) with \(f(t) \le cg(t)\). In addition we say \(f = o(g)\) if \(f(t)/g(t) \rightarrow 0\) as \(t \rightarrow \infty \).
2 Concentration results for sums of random matrices
Let us begin with concentration inequalities for the spectral norm of sums of complex rank-1 matrices. Such matrices appear as \({\mathbf {L}}^*{\mathbf {L}}\) when studying least squares solutions of over-determined linear systems
where \({\mathbf {L}}\in \mathbb {C}^{n \times m}\) is a matrix with \(n>m\), \({\mathbf {f}} \in \mathbb {C}^n\) and \({\mathbf {c}} \in \mathbb {C}^{m-1}\). It is well-known that the above system may not have a solution. However, we can ask for the vector \({\mathbf {c}}\) which minimizes the residual \(\Vert {\mathbf {f}}-{\mathbf {L}}\cdot {\mathbf {c}}\Vert _2\). Multiplying the system with \({\mathbf {L}}^*\) gives
which is called the system of normal equations. If \({\mathbf {L}}\) has full rank then the unique solution of the least squares problem is given by
For function recovery and discretization problems we will use the following matrix
for \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n) \in D^n\) of distinct sampling nodes and a system of functions \((\eta _k)_{k=1}^{m-1}\). We put \({\mathbf {y}}^i \,{:}{=}\, (\eta _1({\mathbf {x}}^i),\ldots ,\eta _{m-1}({\mathbf {x}}^i))^\top , i=1,\ldots ,n\). The coefficients \(c_k\), \(k=1,\ldots ,m-1\), of the approximant
are computed via least squares, see Algorithm 1. Note that the mapping \(f \mapsto S^m_{{\mathbf {X}}}f\) is linear for a fixed set of sampling nodes \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n) \in D^n.\)
We start with a concentration inequality for the spectral norm of a matrix of type (2.1). It turns out that in certain situations the complex matrix \({\mathbf {L}}_m\,{:}{=}\,{\mathbf {L}}_m({\mathbf {X}})\in \mathbb {C}^{n\times (m-1)}\) has full rank with high probability, where \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) is drawn at random from \(D^n\) according to a product measure \(\mathbb {P}= d\varrho ^n\). We will find that the eigenvalues of
are bounded away from zero with high probability if m is small enough compared to n and the functions \(\eta _k(\cdot )\) denote an orthonormal system with respect to the measure \(\varrho \) from which the nodes in \({\mathbf {X}}\) are sampled. Let us define the corresponding spectral function
From [28, Theorem 1.1] we get the following result.
Theorem 2.1
(Matrix Chernoff) For a finite sequence \((\mathbf {A}_k)\) of independent, Hermitian, positive semi-definite random matrices with dimension m satisfying \(\lambda _{\max }(\mathbf {A}_k) \le R\) almost surely it holds
for \(t \in [0,1]\) where \( \mu _{\min }: = \lambda _{\min }\left( \sum _{k=1}^n \mathbb {E}\mathbf {A}_k \right) \) and \( \mu _{\max }: = \lambda _{\max }\left( \sum _{k=1}^n \mathbb {E}\mathbf {A}_k \right) \).
Theorem 2.2
Let \(n,m \in \mathbb {N}\), \(m\ge 2 \) and \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\dots ,\eta _{m-1}(\cdot )\}\) be an orthonormal system in \(L_2(D,\varrho )\). Let \({\mathbf {H}}_m\) be given as above and \({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n\in D\) drawn i.i.d. at random according to \(\mathbb {P}= d\varrho \) we have for \(0<t <1\) that
as well as
where \(c_t \,{:}{=}\,(1-t)^{1-t}e^t\) and \(d_t \,{:}{=}\,(1+t)^{1+t}e^{-t}\).
Proof
We apply Theorem 2.1. To do this we define \(\mathbf {A}_i = \frac{1}{n}\, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\). One easily sees that all the matrices \(\mathbf {A}_i\) are always positive semi-definite and \(\lambda _{\min }\Big (\sum _{i=1}^n \mathbb {E}\mathbf {A}_i \Big ) =1\). We have that
Plugging this into Theorem 2.1 yields
Theorem 2.3
For \(n \ge m\) and \(r > 1\) the matrix \({\mathbf {H}}_m\) has only eigenvalues greater than 1/2 with probability at least \(1 - n^{1-r}\) if
In particular, we have
Proof
Choosing \(t = 1/2\) and solving for N(m) in the above probability bound (using \(n^{1-r}\) on the right-hand side) gives the desired result. Indeed
This gives the following implications (read from bottom to top)
The bound in (2.6) is a consequence of [11, Proposition 3.1]. \(\square \)
From [11, Proposition 3.1] we also get a lower bound of \(\Vert ({\mathbf {L}}_m^{*}{\mathbf {L}}_m)^{-1}{\mathbf {L}}_m^{*}\Vert _{2\rightarrow 2}\) with high probability.
Corollary 2.4
Let \(\{\eta _1(\cdot ), \eta _2(\cdot ), \eta _3(\cdot ),\ldots \}\) be an orthonormal system in \(L_2(D,\varrho )\). Let further \(r>1\) and \(m,n \in \mathbb {N}\), \(m\ge 2\) such that
holds. Then the random matrix \({\mathbf {L}}_m\) from (2.1) satisfies
with probability at least \(1-n^{1-r},\) where the nodes are sampled i.i.d according to \(\varrho \).
3 Norm concentration for infinite matrices
In this section we want to extend some of the results from Sect. 2 to the infinite dimensional framework. We provide a new concentration inequality derived from the non-commutative Khintchine inequality via a bootstrapping argument using a symmetrization result by Ledoux and Talagrand [15] for Rademacher sums of random operators \( \mathbf{B}_i = {\mathbf {y}}^i \otimes {\mathbf {y}}^i\), where \({\mathbf {y}}^i\) will denote a complex random infinite \(\ell _2 \)-sequence.
Definition 3.1
(Schatten-p-Norm) For a compact operator \(\mathbf{A}: H \rightarrow K\) between complex Hilbert spaces H and K and 1 \(\le p \le \infty \) we define the Schatten-p-norm:
where \(\sigma (\mathbf {A})\) is the vector of singular values of \(\mathbf{A}\).
The quantity \(\Vert \cdot \Vert _{S_p}\) is a norm, see, e.g. [7].
Corollary 3.2
One easily sees that
and that for A with rank at most \(r\) it holds that
for \(1\le p \le \infty \).
Definition 3.3
(Schatten-class) Let \(H, K \) be complex Hilbert spaces and \( 1 \le p < \infty \). The \(p\)-th Schatten-class is defined as
Theorem 3.4
(Non-commutative Khintchine inequality, [2, 3]) Let \(n \in \mathbb {N}\) and \(\mathbf{B}_i,\) \(i=1,\ldots ,m,\) denote operators from \( S_{2n}\). Let further \(\varepsilon _i\) denote independent Rademacher variables for \( i= 1 \dots m \). Then it holds
Corollary 3.5
(Rudelson’s lemma) Let \({\mathbf {y}}^i\) be a sequence from the complex \( \ell _2\) and \(\varepsilon _i\) independent Rademacher variables for \( i= 1 \dots m \). Then it holds for \( 2 \le p < \infty \) that
Proof
We utilize the non-commutative Khintchine inequality with \( \mathbf{B}_i \,{:}{=}\, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) which belong to every \(S_{2n}\) since they have (at most) rank 1. Applying literally the same arguments as in [22, Lemma 6.18] we obtain the result (see [17] for details).
We estimate tails of random variables by means of their moments. We will use a well-known relation, see e.g. [22, Proposition 6.5].
Proposition 3.6
(Moments and tails) Let \(X\) be a random variable that for all \(p \ge p_0\) satisfies
for some constants \(\alpha , \beta , \gamma , p_0 > 0\). Then
for all
From [15, Lemma 6.3] we get the following result.
Proposition 3.7
(Symmetrization, [15]) Let \(F : \mathbb {R}_+ \rightarrow \mathbb {R}_+ \) be convex. Let \({\mathbf {X}}_i,\) \(i=1,\ldots n,\) be independent random variables in a separable Banach space \((B, \Vert \cdot \Vert )\) such that \( \mathbb {E}F (\Vert {\mathbf {X}}_i\Vert ) < \infty \). Let further \(\varvec{\varepsilon } = (\varepsilon _i)_{i=1}^n\) be independent Rademacher variables which are also independent of \({\mathbf {X}}_i\). Then it holds that
where \(D\) is a countable set of linear functionals with \(\Vert {\mathbf {x}}\Vert = \sup _{f \in D}\big |f({\mathbf {x}})\big |\) for all \({\mathbf {x}}\in B\).
Proposition 3.8
Let \({\mathbf {y}}^i , i= 1 \dots n, \) be i.i.d random sequences from the complex \( \ell _2\). Let further \(n \ge 3,\) \(r > 1,\) \(M>0\) such that \(\Vert {\mathbf {y}}^i \Vert _2 \le M\) for all \(i=1 \dots n\) almost surely and \( \mathbb {E}{\mathbf {y}}^i \otimes {\mathbf {y}}^i = {\Lambda }\) for all \(i=1 \dots n\). Then
where \(F \,{:}{=}\, \max \Big \{ \frac{8r \log n}{n} M^2 \kappa ^2 , \Vert {\Lambda } \Vert _{2 \rightarrow 2} \Big \}\) and \( \kappa = \frac{1+\sqrt{5}}{2}\) .
Proof
We use a method as in [22, Theorem 7.3]. For \(2 \le p < \infty \) we put
Since \(\sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) has rank (at most) \(n\) it is compact. The expectation matrix \({\Lambda }\) is a positive semidefinite operator with finite trace since \(\Vert {\mathbf {y}}^i \Vert _2 \le M\) for all \(i=1 \dots n\) almost surely. This means \( {\Lambda }\) is a trace class operator and therefore compact. Since \( \frac{1}{n} \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda }\) is compact and the subspace of all compact operators \(K(\ell _2, \ell _2)\) from \(\ell _2\) to \( \ell _2\) is separable we can choose a countable set \(D\) from the dual space of \(K(\ell _2, \ell _2)\) as in Proposition 3.7 such that
Since \(f\) is a continuous linear functional we get
Proposition 3.7 applied to (3.1) with \(F(t) = t^p\) together with Rudelson’s lemma (Corollary 3.5) for \(2\le p < \infty \) in the form
yields
because of \(\Vert {\mathbf {y}}^i \Vert _2 \le M \) and Hölder’s inequality. Using triangle inequality and the fact that we have
Setting \(D_{p,n,M} \,{:}{=}\, \frac{2}{\sqrt{n}} 2^{\frac{3}{4p}} M p^{\frac{1}{2}} n^{\frac{1}{p}} e^{-\frac{1}{2}}\) yields
where \(F \ge \Vert {\Lambda } \Vert _{2\rightarrow 2} \) will be chosen later. Solving this regarding gives
We now consider the random variable \( \min \Big \{F,\frac{1}{n} \Big \Vert \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i - {\Lambda } \Big \Vert _{2 \rightarrow 2}\Big \}\). Obviously
In the case of \( D_{p,n,M}^2 \le F\) we get
and otherwise
This yields
Using Proposition 3.6 we get that
for all \( u \ge \sqrt{2}\). Now we choose \(u = \sqrt{2r \log n}\) with \(r > 1\) and \(n \ge 3.\)This gives
In case \( F \ge \frac{2}{\sqrt{n}} M \kappa \sqrt{F} u\) we can avoid the minimum on the left-hand side. It clearly holds
and hence
The latter is satisfied if
This yields
\(\square \)
Using this result we are now able to proof our main concentration inequality.
Proof of Theorem 1.1
Proof
Let us return to (3.2) in the above proof. Since \(\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2 } \le 1\) we get as a consequence for \(0<t\le 1\)
with
We continue in the proof as above using \(\tilde{D}_{p,n,M}\) instead of \(D_{p,n,M}\) and without replacing u by \(\sqrt{2r\log n}\) in (3.3). With the same argumentation as above we get rid of the minimum and obtain this time
for \(F \ge \max \{4M^2u^2\kappa ^2/(nt),t\}\). The maximum is no longer necessary if we choose \(u^2\,{:}{=}\,t^2n/(4M^2\kappa ^2)\). Plugging this choice into (3.6) and noting that \(8\kappa ^2 \le 21\) gives the desired bound. \(\square \)
Remark 3.9
The first result of this type is due to Rudelson [23] for vectors from \(\mathbb {R}^n\). Complex versions were proved by Rauhut [22] and Oliveira [20, Lem. 1]. Note that the result stated by Oliveira (Lemma 1) contains a small incorrectness in the probability bound. A corrected version has been stated in [11, Prop. 4.1]. In his paper Oliveira also comments on the infinite dimensional complex situation where \(m=\infty \) but does not give a full proof. The proof method is different from ours. Note also that in [16, Cor. 2.6] Mendelson and Pajor give a concentration result for the infinite dimensional case of real vectors. Let us finally mention that a version of our Theorem 1.3 (in the next section) under more restrictive assumptions has been recently proved by Ullrich [29] based on Oliveira’s concentration result.
4 Reproducing kernel Hilbert spaces
We will work in the framework of reproducing kernel Hilbert spaces. The relevant theoretical background can be found in [1, Chapt. 1] and [4, Chapt. 4]. The papers [10, 24] are also of particular relevance for the subject of this paper.
Let \(L_2(D,\varrho _D)\) be the space of complex-valued square-integrable functions with respect to \(\varrho _D\). Here \(D \subset \mathbb {R}^d\) is an arbitrary subset and \(\varrho _D\) a measure on D. We further consider a reproducing kernel Hilbert space H(K) with a Hermitian positive definite kernel \(K(\cdot ,\cdot )\) on \(D \times D\). The crucial property of reproducing kernel Hilbert spaces is the fact that Dirac functionals are continuous, or equivalently, the reproducing property holds
for all \({\mathbf {x}}\in D\). It ensures that point evaluations are continuous functionals on H(K). We will use the notation from [4, Chapt. 4]. In the framework of this paper, the finite trace of the kernel
or its boundedness
is assumed. The boundedness of K implies that H(K) is continuously embedded into \(\ell _\infty (D)\), i.e.,
With \(\ell _{\infty }(D)\) we denote the set of bounded functions on D and with \(\Vert \cdot \Vert _{\ell _{\infty }(D)}\) the supremum norm. Note, that we do not need the measure \(\varrho _D\) for this embedding. In fact, here we mean “boundedness” in the strong sense (in contrast to essential boundedness w.r.t. the measure \(\varrho _D\)). The embedding operator
is Hilbert–Schmidt under the finite trace condition (4.1), see [10, 24, Lemma 2.3], which we always assume from now on. We additionally assume that H(K) is at least infinite dimensional. Let us denote the (at most) countable system of strictly positive eigenvalues \((\lambda _j)_{j\in \mathbb {N}}\) of \(W_{K,\varrho _D} = \mathrm {Id}_{K,\varrho _D}^{*} \circ \mathrm {Id}_{K,\varrho _D}\) arranged in non-increasing order, i.e.,
We will also need the left and right singular vectors \((e_k)_k \subset H(K)\) and \((\eta _k)_k \subset L_2(D,\varrho _D)\) which both represent orthonormal systems in the respective spaces related by \(e_k = \sigma _k \eta _k\) with \(\lambda _k = \sigma _k^2\) for \(k\in \mathbb {N}\). We would like to emphasize that the embedding (4.3) is not necessarily injective. In other words, for certain kernels there might be a also a nontrivial nullspace of the embedding \(\mathrm {Id}\) in (4.4). Therefore, the system \((e_k)_k\) from above is not necessarily a basis in H(K). It would be a basis under additional restrictions, e.g., the kernel \(K(\cdot ,\cdot )\) is continuous and bounded (Mercer kernel). Based on this observation we will decompose the kernel \(K(\cdot ,\cdot )\) as follows
By Bessel’s inequality we get that
It is shown in [10, 24, Lemma 2.3] that if \({\text {tr}}(K)<\infty \) and H(K) is separable then \({\text {tr}}_0(K) = 0\). As we will see below, it will make a big difference if \({\text {tr}}_0(K)\) vanishes or not. The second case is only apparent if H(K) is non-separable. In other words, if \(H(K)\) is separable the function \(K^0({\mathbf {x}},{\mathbf {x}}) \,{:}{=}\, K({\mathbf {x}},{\mathbf {x}})- \sum _{k=1}^\infty |e_k({\mathbf {x}})|^2\) is zero almost everywhere with respect to the measure \(\varrho _D\). Let us finally define the two crucial quantities
and
The first one is often called “spectral function”, see [9] and the references therein.
5 Sampling recovery guarantees for separable RKHS
In this section we deal with the case that H(K) is a separable Hilbert space on a subset \(D \subset \mathbb {R}^d\) which is compactly embedded in \(L_2(D,\varrho _D)\) for a given measure \(\varrho _D\). The first Theorem gives a result in a more restrictive situation, namely that \(\varrho _D\) is a probability measure and the kernel is bounded. In the second theorem we sample with respect to the probability density function \(\varrho _m\) defined below in (5.3) and invented by Krieg and Ullrich [13, 14]. We use the same proof strategy as in [11]. Here we do not apply Rudelson’s bound [23] on the expectation. We rather use the concentration inequality proved in Proposition 3.8. This leads to a polynomial decaying failure probability, see also [29].
Theorem 5.1
Let H(K) be a separable reproducing kernel Hilbert space on a set \(D \subset \mathbb {R}^d\) with a positive definite kernel \(K(\cdot ,\cdot )\) such that \(\sup _{{\mathbf {x}}\in D} K({\mathbf {x}},{\mathbf {x}}) <\infty \). We denote with \((\sigma _j)_{j\in \mathbb {N}}\) the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\) for a probability measure \(\varrho _D\). Let further \(r>1\) and \(m,n \in \mathbb {N},\) \(n \ge 3\) where \(m\ge 2\) is chosen such that
holds. Drawing \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) according to the product measure \(d\varrho _D^n,\) we have
where \(\eta = 2^\frac{3}{4} +1\), \( \kappa = \frac{1+\sqrt{5}}{2}\) and \(N(m), \, T(m)\) are defined in (4.7), (4.8) .
Proof
We define the events
where \(F\) appears in (3.4) and \({\mathbf {H}}_m\) in (2.3). The operator \({\varvec{\Phi }}_m\) is given by
with \({\mathbf {y}}^i = (e_m({\mathbf {x}}^i),e_{m+1}({\mathbf {x}}^i) \dots )^\top \) for all \( i=1 \dots n\). Hence we may choose F as
in this specific situation. It clearly holds
Using the union bound estimate we get
Theorem 2.3 implies
And after noting
we infer from \({\varvec{\Phi }}_m^*{\varvec{\Phi }}_m = \sum _{i=1}^n \, {\mathbf {y}}^i \otimes {\mathbf {y}}^i\) and Proposition 3.8 that
In total we have
with \(\eta \,{:}{=}\, 2^{3/4}+1\). According to the proof of [11, Theorem 5.5] we need the function \(T(\cdot ,{\mathbf {x}}) \,{:}{=}\, K(\cdot ,{\mathbf {x}})-\sum _{j=1}^{\infty }e_j(\cdot )\overline{e_j({\mathbf {x}})}\), which denotes an element in H(K). Its norm is given by
Note that the function in (5.2) is zero almost everywhere because of the fact that we have an equality sign in (4.6) due to our assumptions (separability of H(K) and finite trace). Hence, \(\mathbb {P}(C) = 1\) with
Let now \({\mathbf {X}}\in A\cap B\cap C\). Then we can use for any \(f\in H(K)\) with \(\Vert f\Vert _{H(K)}\le 1\) a similar argument as in [11, Theorem 5.5]
This yields
and therefore
\(\square \)
In the sequel we consider a more general situation. The measure where the points are sampled will be adapted to the spectral properties of the embedding. This allows to specify the bound above in terms of the singular numbers of the embedding and benefit from their decay. Let us recall the density function from [13] which we will adapt to our framework as follows
We know from (4.6) that the sequence of singular numbers is square summable. We use the modified density function \(\varrho _m(\cdot ):D \rightarrow \mathbb {R}\) which has been introduced in [11] as a version of the one from [13]. As above, the family \((e_j(\cdot ))_{j\in \mathbb {N}}\) represents the eigenvectors of the non-vanishing eigenvalues of the compact self-adjoint operator \(W_{\varrho _D}\,{:}{=}\,\mathrm {Id}^*\circ \mathrm {Id}: H(K) \rightarrow H(K)\), the sequence \((\lambda _j)_{j\in \mathbb {N}}\) represents the ordered eigenvalues and finally \(\eta _j\,{:}{=}\,\lambda _j^{-1/2} e_j\).
Clearly, as a consequence of (4.5) the function \(\varrho _m\) is positive and defined point-wise for any \({\mathbf {x}}\in D\). Moreover, it can be computed precisely from the knowledge of \(K({\mathbf {x}},{\mathbf {x}})\) and the first \(m-1\) eigenvalues and corresponding eigenvectors. It clearly holds that \(\int _D \varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}) = 1\). Here is one of the main results of this paper (note that Theorem 1.3 from the introduction is a simple reformulation of the theorem below).
Theorem 5.2
Let H(K) be a separable reproducing kernel Hilbert space of complex-valued functions defined on D such that
for some measure \(\varrho _D\) on D, where \((\sigma _k)_k\) denotes the non-increasing sequence of singular numbers of the embedding \(\mathrm {Id}:H(K) \rightarrow L_2(D,\varrho _D)\). Then we have for \(n\in \mathbb {N}\) and
the bound
where \({\mathbf {X}}\) is sampled i.i.d. according to \(\varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}})\) in (5.3) above and \(\eta = 2^\frac{3}{4} +1,\) \( \kappa = \frac{1+\sqrt{5}}{2}\).
Proof
This result is a consequence of Theorem 5.1 above which is applied to the newly constructed probability measure \(\varrho _m(\cdot )\) together with
We observe
\(\widetilde{N}(m) \le 2(m-1)\), and \(\widetilde{T}(m) \le 2\sum _{j=m}^{\infty }\sigma _j^2\), see [11, Thms. 5.5, 5.8]. Applying Theorem 5.1 leads to the stated bound. \(\square \)
6 Sampling discretization of the \(L_2\)-norm
Motivated from supervised learning theory, see e.g. [6], one is interested in uniform bounds for the following version of the “defect function”
with respect to f belonging to some hypothesis space \(\mathcal {H}\) which is usually embedded into \(\mathcal {C}(D)\), the continuous functions on the domain D. From a more classical perspective, authors were interested in discretizing \(L_p\)-norms using Marcinkiewicz–Zygmund inequalities. This subject has been recently studied systematically by Temlyakov [25], see also Gröchenig [9]. The following theorem will be an immediate implication of our concentration result in Theorem 1.1.
Theorem 6.1
Let \(\varrho _D\) denote a probability measure on the measurable subset \(D \subset \mathbb {R}^d\) and H(K) be a separable reproducing kernel Hilbert space with kernel \(K(\cdot ,\cdot )\) such that
(equivalently, the unit ball in H(K) is uniformly bounded in \(\ell _\infty ).\) Then we have for \(0<t\le 1\)
If we fix \(r>1\) the above bound can be reformulated as
with probability exceeding \(1-2n^{1-r}\) and n sufficiently large, namely
The nodes \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) are randomly drawn according to the product measure \((d \varrho _D({\mathbf {x}}))^n\).
Proof
Fix f with \(\Vert f\Vert _{H(K)} \le 1\) and put \(M\,{:}{=}\,\Vert K\Vert _{\infty }\). Due to the \(L_2\)-identity
we find
with \(\hat{{\mathbf {f}}} = (\langle f, e_k \rangle _{H(K)})_{k\in \mathbb {N}}\) and \({\varvec{\Lambda }} = \mathrm {diag}(\sigma ^2_1,\sigma ^2_2,\ldots )\). Note that \(\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2} = \Vert \mathrm {Id}\Vert _{K,2}^2\). Furthermore, putting
we find
holds almost surely since \({\text {tr}}_0(K) = 0\) due to the separability of H(K), see the paragraph after (4.6). In fact, the identity fails on a nullset \(A \subset D^n\), which is independent of f. This follows by using the same arguments as after (5.2). Hence,
with probability exceeding \(1-\exp (-t^2n/(21\tilde{M}^2))\) by Theorem 1.1. Here \(\tilde{M}^2 = M^2/\Vert {\varvec{\Lambda }}\Vert _{2\rightarrow 2}\). Hence, we may choose \(t = \sqrt{21\tilde{M}^2 r \log (n)/n} \le 1\) to finally get
\(\square \)
Remark 6.2
-
(i)
The uniform boundedness (in \(\ell _\infty \)) of a function class is sometimes also called M-boundedness in the learning theory literature. It represents a common assumption there to analyze the defect function. In fact, uniform bounds on the defect function are proved by using the concept of covering numbers of the unit ball of H(K) in \(\ell _{\infty }(D)\), see [12, 25]. In the above theorem covering number estimates were not used at all.
-
(ii)
The quantity \(\Vert K\Vert _{\infty }\) may be replaced by \(\Vert K\Vert _{L_{\infty }(D,\varrho _D)}\), i.e., the essential supremum with respect to the probability measure \(\varrho _D\). Since \(\Vert K\Vert _{L_{\infty }(D,\varrho _D)}\) might by smaller than \(\Vert K\Vert _{\infty }\) we obtain a slight improvement.
Now we want to get rid of the uniform boundedness of the class and only assume the finite trace. We have to change the sampling measure and modify the norm discretization operator by incorporating weights. The corresponding theorem reads as follows.
Theorem 6.3
Let \(\varrho _D\) denote an arbitrary measure on the measurable subset \(D \subset \mathbb {R}^d\) and H(K) be a separable reproducing kernel Hilbert space with (Hermitian) kernel \(K(\cdot ,\cdot )\) such that
We define the probability density function
and sample \({\mathbf {X}}= ({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^n)\) from the product measure \((\nu ({\mathbf {x}}))d\varrho _D({\mathbf {x}})^n\). Then we have
If we fix \(r>1\) then this result can be reformulated as
with probability exceeding \(1-2n^{1-r}\) and n sufficiently large, namely
Proof
We want to apply Theorem 6.1. Let us define the normalized kernel
Then \(\Vert \tilde{K}\Vert _{\infty } = \text{ tr(K) }\) and
It remains to note that
and we may apply Theorem 6.1. \(\square \)
7 Non-separable RKHS
Now we deal with a more general situation and drop the separability assumption for H(K). We only assume the finite trace property (1.5). For this purpose we define the new density function
and the operator \(\widetilde{S}^m_{{\mathbf {X}}}\) from Algorithm 1. Clearly, it holds \(\int \varrho _m({\mathbf {x}}) d\varrho _D({\mathbf {x}}) = 1\). Theorem 1.2 provides the bound
with an absolute constant \(C>0\) and \(m \,{:}{=}\, \lfloor n/(14 r\log n)\rfloor \). Note that this result improves on a result in Wasilkowski and Woźniakowski [31], see also Novak and Woźniakowski [19, Thm. 26.10]. The authors in [31, Thm. 26.10] constructed a recovery operator using n samples having squared worst-case error not greater than
If we for instance assume that \(\sum _{k=1}^{\infty } \sigma _k^p < \infty \) for some \(0<p\le 2\) then we may balance \(\ell \asymp n^{p/(p+2)}\) to obtain a rate of \({\mathcal {O}}(n^{-1/(p+2)})\). In Theorem 1.2, see (7.2), we obtain a rate of \(o(\sqrt{(\log n)/n})\) already for \(p=2\). In case \(p<2\) we obtain \({{\mathcal {O}}}(n^{-1/2})\). It seems that, in general, the decay properties of the singular values have a rather weak influence on the recovery bound compared to the separable case, where it is much better than \({\mathcal {O}}(n^{-1/2})\). A lower bound showing that we can not essentially improve on \({\mathcal {O}}(n^{-1/2})\) with the above algorithm will be provided in [17].
Proof of Theorem 1.2
Proof
Step 1. Let us assume that \(M\,{:}{=}\,\Vert K\Vert _{\infty } = \sup _{{\mathbf {x}}\in X}\sqrt{K({\mathbf {x}},{\mathbf {x}})} < \infty \). By the spectral theorem we can decompose
where N is the nullspace of the embedding. Let us now define
and
Therefore, \(K^0({\mathbf {x}},{\mathbf {y}}) \) is the reproducing kernel of the nullspace \(N(\mathrm {Id})\) and \( \sup _{{\mathbf {x}}\in X} \sqrt{K^0({\mathbf {x}}, {\mathbf {x}})} =: M_0 \le M < \infty \). We estimate
The second summand can be treated with Theorem 5.1. The space \(H(K^1)\) is separable and \(K^1({\mathbf {x}},{\mathbf {x}})\) is bounded. Theorem 5.1 gives
with probability at least \( 1 - \eta n^{1-r} \) whenever (5.1) holds. The number \(\kappa \) is the same as in Proposition 3.8.
Note that all the functions in \(H({K}^0)\) are zero in \(L_2(D,\varrho _D)\) since this space is the nullspace of the embedding. Hence
holds for the same \({\mathbf {X}}= ({\mathbf {x}}^1, \dots , {\mathbf {x}}^n) \) for which (7.4) holds. We only need the operator norm of \({\mathbf {H}}_m\) (see (2.3)) to be larger than \(\frac{1}{2}\) which comes from Theorem 2.1. At this point we employ the “Representer Theorem” from learning theory, see for instance [4, Theorem 5.5]. We claim that
where \(\big ({K}^0({\mathbf {x}}^i, {\mathbf {x}}^j)\big )_{i,j = 1}^n\) is the kernel taken at the points from \({\mathbf {X}}\) and \( {g}({\mathbf {x}}) = \sum _{i =1}^n \alpha _i K^0({\mathbf {x}},{\mathbf {x}}^i)\)
In other words, we can reduce the problem to the finite dimensional space
\( {\mathrm{span}}\big \{ K^0(\cdot , {\mathbf {x}}^1),\dots , K^0(\cdot , {\mathbf {x}}^n) \big \}.\) Note that
The reason is that \(g \in H(K^0)\) can be decomposed into \(g = g_1 + g_2\) with \(g_1 \perp g_2\) and \(g_1 = Pg \), the orthogonal projection onto \({\mathrm{span}}\big \{ K^0(\cdot , {\mathbf {x}}^1), \dots ,K^0(\cdot , {\mathbf {x}}^n) \big \}\). Due to \(g_1 \perp g_2\), we have that \(\langle g_2, K^0(\cdot , {\mathbf {x}}^i)\rangle = 0 = g_2({\mathbf {x}}^i)\) for all i. Hence \(\sum _{i=1}^n |g({\mathbf {x}}^i)|^2 = \sum _{i=1}^n |g_1({\mathbf {x}}^i)|^2\) and \( \Vert g_1\Vert _{H({K}^0)}^2 \le 1 \). Therefore
Since \(H({K}^0)\) is the nullspace of the embedding we know that \({K}^0({\mathbf {x}}^i , {\mathbf {x}}^j)\) is zero almost surely for \(i \ne j\). We can therefore continue to estimate (7.6) by
since we have almost surely
This leads to
with probability exceeding \(1-\eta n^{1-r}\).
Step 2. We define the measure \(d\mu _m({\mathbf {x}}) = \varrho _m({\mathbf {x}})d\varrho _D({\mathbf {x}}) \) as well as the kernel \(\widetilde{K}_m({\mathbf {x}},{\mathbf {y}})\) as in (5.6) and \(\widetilde{K}_m^0, \widetilde{K}_m^1\) accordingly. This gives
We apply the results from Step 1 to the right-hand side. Hence, we have to know the bound for \(\widetilde{K}^0_m\), \(\widetilde{N}(m)\) and \(\widetilde{T}(m)\), where the latter quantities are associated to \(\widetilde{K}^1_m\). We will now show that \(\widetilde{K}_m^0\) can be bounded by \(3{\text {tr}}(K^0)\). In fact,
Hence, we have \(\widetilde{M}_0^2 = 3{\text {tr}}(K^0)\). By the same arguments as in the proof of Theorem 5.2, see also [11, Thm. 5.7], we get \(\widetilde{N}(m) \le 3(m-1)\) and \(\widetilde{T}(m) \le 3\sum _{j=m}^\infty \sigma _j^2\). Plugging this into (7.8) gives
\(\square \)
What concerns a counterpart of the \(L_2\)-norm discretization result for general RKHS having finite trace we can prove the following.
Theorem 7.1
Let \(H(K)\) be a RKHS and \(\varrho _D\) be a measure on \(D \subset \mathbb {R}^d\). If
-
(i)
If \(\Vert K\Vert _{\infty }\,{:}{=}\, \sup _{{\mathbf {x}}\in D} \sqrt{K({\mathbf {x}}, {\mathbf {x}})} < \infty \) and \(\varrho _D\) denotes a probability measure on D, where \({\mathbf {x}}^i, i=1,\ldots ,n,\) are drawn i.i.d. according to \(\varrho _D\) then
$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n |f({\mathbf {x}}^i)|^2 \right| \le 8\sqrt{r\frac{\log (n)}{n}} \Vert K\Vert _{\infty }^2 \end{aligned}$$holds with probability at least \(1 - 2 n^{1-r}\) if n is large enough, i.e. \(\frac{n}{\log n} \ge \frac{21 r \Vert K\Vert _{\infty }^2}{\Vert \mathrm {Id}\Vert _{K,2}^2}\).
-
(ii)
If \({\text {tr}}(K) = \int _D K({\mathbf {x}},{\mathbf {x}}) d\varrho _D({\mathbf {x}}) < \infty \) then
$$\begin{aligned} \sup \limits _{\Vert f\Vert _{H(K)}\le 1}\left| \int _D |f({\mathbf {x}})|^2 d \varrho _D({\mathbf {x}}) - \frac{1}{n} \sum _{i = 1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)} \right| \le 8 {\text {tr}}(K)\sqrt{r\frac{\log (n)}{n}} \end{aligned}$$holds with probability at least \(1 - 2 n^{1-r}\) and \(\frac{n}{\log n} \ge \frac{21 r {\text {tr}}(K)}{\Vert \mathrm {Id}\Vert _{K;2}^2}\) where \(\nu ({\mathbf {x}}) = \frac{K({\mathbf {x}}, {\mathbf {x}})}{{\text {tr}}(K)}\) and \({\mathbf {x}}^i, i=1,\ldots ,n,\) are drawn i.i.d. according to \(\nu ({\mathbf {x}})d\varrho _D({\mathbf {x}})\).
Proof
Since we may have that \(\text{ tr}_0(K) > 0\) the decomposition \(K({\mathbf {x}}, {\mathbf {y}}) = K^0({\mathbf {x}}, {\mathbf {y}}) + K^1({\mathbf {x}}, {\mathbf {y}})\) leads to a “non-trivial” Kernel \(K^0({\mathbf {x}}, {\mathbf {y}})\). We estimate in case (i):
and note that (7.11) \(\le \sqrt{21 \Vert \mathrm {Id}\Vert ^2 M^2 r \frac{\log {n}}{n}}\) by Theorem 6.1 with probability at least \(1-2n^{1-r}\), where \(M\,{:}{=}\,\Vert K\Vert _{\infty }\). To estimate (7.12) we use the same reasoning leading to (7.5) and get
where we used \(K({\mathbf {x}},{\mathbf {x}}) \le M^2\). We also use (7.14) in order to estimate (7.13). It holds
In total we estimate
To prove (ii) we use the same technique as in Theorem 6.3 replacing \(\frac{1}{n} \sum _{i = 1}^n |f({\mathbf {x}}^i)|^2\) with \(\frac{1}{n} \sum _{i = 1}^n \frac{|f({\mathbf {x}}^i)|^2}{\nu ({\mathbf {x}}^i)}\) where \(\nu ({\mathbf {x}}) = \frac{K({\mathbf {x}}, {\mathbf {x}})}{{\text {tr}}(K)}\) and also \(M^2 \) by \({\text {tr}}(K)\) we can reduce everything to case (i). \(\square \)
References
Berlinet, A., Thomas-Agnan, C.: Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, Boston, MA, (2004). With a preface by Persi Diaconis
Buchholz, A.: Operator Khintchine inequality in non-commutative probability. Ann. of Math. 319, 1–16 (2001)
Buchholz, A.: Optimal constants in Khintchine type inequalities for Fermions, Rademachers and \(q\)-Gaussian operators. Bulletin Polish Acad. Sci. Math. 53(3), 315–321 (2005)
Christmann, A., Steinwart, I.: Support Vector Machines. Springer, (2008)
Cohen, A., Migliorati, G.: Optimal weighted least-squares methods. SMAI J. Comput. Math. 3, 181–203 (2017)
Cucker, F., Zhou, D.X.: Learning Theory. Cambridge University Press, An Approximation Theory Viewpoint (2007)
Dirksen, S.: Noncommutative and vector-valued Rosenthal inequalities. Dissertation, Delft Institute of Applied Mathematics, (2011)
Dũng, D., Temlyakov, V.N., Ullrich, T.: Hyperbolic Cross Approximation. Advanced Courses in Mathematics. CRM Barcelona, Birkhäuser/Springer (2019)
Gröchenig, K.: Sampling, Marcinkiewicz-Zygmund inequalities, approximation, and quadrature rules. J. Approx. Theory 257,(2020)
Hein, M., Bousquet, O.: Kernels, associated structures and generalizations. Technical Report 127, Max Planck Institute for Biological Cybernetics, Tübingen, Germany, (2004)
Kämmerer, L., Ullrich, T., Volkmer, T.: Worst-case recovery guarantees for least squares approximation using random samples. Constr. Approx., to appear, arXiv:1911.10111v3
Konyagin, S.V., Temlyakov, V.N.: The entropy in learning theory. Error estimates. Constr. Approx. 25(1), 1–27 (2007)
Krieg, D., Ullrich, M.: Function values are enough for \({L}_2\)-approximation. Found. Comp. Math., to appear, arXiv:math/1905.02516v3
Krieg, D., Ullrich, M.: Function values are enough for \({L}_2\)-approximation. Part II. J. Complexity, to appear, arXiv:2011.01779
Ledoux, M., Talagrand, M.: Probability in Banach Spaces. Springer, (1991)
Mendelson, S., Pajor, A.: On singular values of matrices with independent rows. Bernoulli 12, 761–773 (2006)
Moeller, M.: Norm-concentration results for infinite random matrices with independent rows. Bachelor’s thesis, Faculty of Mathematics, TU Chemnitz (2021)
Nagel, N., Schäfer, M., Ullrich, T.: A new upper bound for sampling numbers. Found. Comp. Math., to appear, arXiv:2010.00327
Novak, E., Woźniakowski, H.: Tractability of multivariate problems. Volume III: Standard information for operators, volume 18 of EMS Tracts in Mathematics. European Mathematical Society (EMS), Zürich, (2012)
Oliveira, R.I.: Sums of random Hermitian matrices and an inequality by Rudelson. Electr. Comm. Probab. 15, 203–212 (2010)
Paige, C.C., Saunders, M.A.: LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Software 8, 43–71 (1982)
Rauhut, H.: Compressive sensing and structured random matrices. In: Fornasier, M. (ed.) Theoretical Foundations and Numerical Methods for Sparse Recovery, volume 9 of Radon Series on Computational and Applied Mathematics. de Gruyter, Berlin (2010)
Rudelson, M.: Random vectors in the isotropic position. J. Funct. Anal. 64, 60–72 (1999)
Steinwart, I., Scovel, C.: Mercers theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constr. Approx. 35(3):363–417 (2012)
Temlyakov, V.: Sampling discretization error of integral norms for function classes. J. Complex. 54, (2019)
Temlyakov, V.N.: The Marcinkiewicz-type discretization theorems. Constr. Approx. 48(2), 337–369 (2018)
Temlyakov, V. N.: On optimal recovery in \(L_2\). J. Complex. 65, (2021)
Tropp, J.: User-friendly tail bounds for sums of random matrices. Found. Comp. Math. 12(4), 389–434 (2011)
Ullrich, M.: On the worst-case error of least squares algorithms for L2-approximation with high probability. Journal of Complexity 60,(2020)
Wasilkowski, G.-W.: Some nonlinear problems are as easy as the approximation problem. Comput. Math. Appl. 10(4-5), 351–363 (1985)
Wasilkowski, G.W., Woźniakowski, H.: On the power of standard information for weighted approximation. Found. Comput. Math. 1(4), 417–434 (2001)
Acknowledgements
T.U. would like to acknowledge support by the DFG Ul-403/2-1. The authors would like to thank T. Kühn, V.N. Temlyakov, M. Ullrich, K. Pozharska and two anonymous referees for comments and further references.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Deguang Han.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Moeller, M., Ullrich, T. \(L_2\)-norm sampling discretization and recovery of functions from RKHS with finite trace. Sampl. Theory Signal Process. Data Anal. 19, 13 (2021). https://doi.org/10.1007/s43670-021-00013-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43670-021-00013-3
Keywords
- Spectral norm concentration
- Least squares approximation
- Random sampling
- Discretization
- Marcinkiewicz–Zygmund inequalities