Abstract
Methods for obtaining a function g in a relationship \(y=g(x)\) from observed samples of y and x are the building blocks for black-box estimation. The classical parametric approach discussed in the previous chapters uses a function model that depends on a finite-dimensional vector, like, e.g., a polynomial model. We have seen that an important issue is the model order choice. This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, the function will be searched over a possibly infinite-dimensional functional space. Overfitting and ill-posedness are circumvented by using reproducing kernel Hilbert spaces as hypothesis spaces and related norms as regularizers. Such kernel-based approaches thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory.
Download chapter PDF
6.1 Preliminaries
Techniques for reconstructing a function g in a functional relationship \(y=g(x)\) from observed samples of y and x are the fundamental building blocks for black-box estimation. As already seen in Chap. 3 when treating linear regression, given a finite set of pairs \((x_i, y_i)\) the aim is to determine a function g having a good prediction capability, i.e., for a new pair (x, y) we would like the prediction g(x) close to y (e.g., in the MSE sense).
The classical parametric approach discussed in Chap. 3 uses a model \(g_{\theta }\) that depends on a finite-dimensional vector \(\theta \). A very simple example is a polynomial model, treated in Example 3.1, given, e.g., by \(g_{\theta }(x)=\theta _1+\theta _2 x +\theta _3x^2\) whose coefficients \(\theta _i\) can be estimated by fitting the data via least squares. In this parametric scenario, we have seen that an important issue is the model order choice. In fact, the least squares objective improves as the dimension of \(\theta \) increases, eventually leading to data interpolation. But overparametrized models, as a rule, perform poorly when used to predict future output data, even if benign overfitting may sometimes happen, as e.g., described in the context of deep networks [17, 55, 75]. Another drawback related to overparameterization is that the problem may become ill-posed in the sense of Hadamard, i.e., the solution may be non-unique, or ill-conditioned. This means that the estimate may be highly sensitive even to small perturbations of the outputs \(y_i\) as, e.g., illustrated in Fig. 1.3 of Sect. 1.2.
This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, g will be searched over a possibly infinite-dimensional functional space. Overfitting and ill-posedness is circumvented by using reproducing kernel Hilbert spaces (RKHSs) as hypothesis spaces and related norms as regularizers. Such norms generalize the quadratic penalties seen in Chap. 3. In this scenario, the estimator is completely defined by a positive definite kernel which has to encode the expected function properties, e.g., the smoothness level. Furthermore we will see that, even when the model class is infinite dimensional, the function estimate turns out a finite linear combination of basis functions computable from the kernel. The estimator also enjoys strong asymptotic properties, permitting (under reasonable assumptions on data generation) to achieve the optimal predictor as the data set size grows to infinity.
The kernel-based approaches described in the following sections thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory. In addition, RKHS theory paves the way to the development of other powerful techniques, e.g., for estimation of an infinite number of impulse response coefficients (IIR models estimation), for continuous-time linear system identification and also for nonlinear system identification.
The reader not familiar with functional analysis finds in the first part of the appendix of this chapter a brief overview on the basic results used in the next sections, like, e.g., the concept of linear and bounded functional which is key to define a RKHS.
6.2 Reproducing Kernel Hilbert Spaces
In what follows, we use \(\mathscr {X}\) to indicate domains of functions. In machine learning, this set is often referred to as the input space with its generic element \(x \in \mathscr {X}\) called input location. Sometimes, \(\mathscr {X}\) is assumed to be a compact metric space, e.g., one can think of \(\mathscr {X}\) as a closed and bounded set in the familiar space \(\mathbb {R}^m\) equipped with the Euclidean norm. In what follows, all the functions are real valued, so that \(f: \mathscr {X} \rightarrow \mathbb {R}\).
Reproducing kernel Hilbert spaces We now introduce a class of Hilbert spaces \(\mathscr {H}\) which play a fundamental role as hypothesis spaces for function estimation problems. Our goal is to estimate maps which permit to make predictions over the whole \(\mathscr {X}\). Thus, a basic requirement is to search for the predictor in a space containing functions which are well-defined pointwise for any \(x \in \mathscr {X}\). In particular, we assume that all the pointwise evaluators \(g \rightarrow g(x)\) are linear and bounded over \(\mathscr {H}\). This means that \(\forall x \in \mathscr {X}\) there exists \(C_x< \infty \) such that
The above condition is stronger than requiring \(g(x) < \infty \ \forall x\) since \(C_x\) can depend on x but not on g. This property already leads to the function spaces of interest. The following definitions are taken from [13].
Definition 6.1
(RKHS, based on [13]) A reproducing kernel Hilbert space (RKHS) over a non-empty set \(\mathscr {X}\) is a Hilbert space of functions \(g:\mathscr {X} \rightarrow \mathbb {R}\) such that (6.1) holds.
As suggested by the name itself, RKHSs are related to the concept of positive definite kernel [13, 20], a particular function defined over \(\mathscr {X}\times \mathscr {X}\). In the literature it is also called positive semidefinite kernel, hence in what follows positive definite kernel and positive semidefinite kernel will define the same mathematical object. This is also specified in the next definition.
Definition 6.2
(Positive definite kernel, Mercer kernel and kernel section, based on [13]) Let \(\mathscr {X}\) denote a non-empty set. A symmetric function \(K:\mathscr {X}\times \mathscr {X} \rightarrow \mathbb {R}\) is called positive definite kernel or positive semidefinite kernel if, for any finite natural number p, it holds
If strict inequality holds for any set of p distinct input locations \(x_k\), i.e.,
then the kernel is strictly positive definite.
If \(\mathscr {X}\) is a metric space and the positive definite kernel is also continuous, then K is said to be a Mercer kernel.
Finally, given a kernel K, the kernel section \(K_x\) centred at x is the function \(\mathscr {X} \rightarrow \mathbb {R}\) defined by
Hence, in the sense given above, a positive definite kernel “contains” matrices which are all at least positive semidefinite.
We are now in a position to state a fundamental theorem from [13] here specialized to Mercer kernels which lead to RKHSs containing continuous functions (the proof is reported in Sect. 6.9.2).
Theorem 6.1
(RKHSs induced by Mercer kernels, based on [13]) Let \(\mathscr {X}\) be a compact metric space and let \(K:\mathscr {X}\times \mathscr {X} \rightarrow \mathbb {R}\) be a Mercer kernel. Then, there exists a unique (up to isometries) Hilbert space \(\mathscr {H}\) of functions, called RKHS associated to K, such that
-
1.
all the kernel sections belong to \(\mathscr {H}\), i.e.,
$$\begin{aligned} K_x \in \mathscr {H} \quad \forall x \in \mathscr {X}; \end{aligned}$$(6.2) -
2.
the so-called reproducing property holds, i.e.,
$$\begin{aligned} \langle K_x, g \rangle _{\mathscr {H}} = g(x) \quad \forall (x, g) \in \left( \mathscr {X},\mathscr {H}\right) . \end{aligned}$$(6.3)
In addition, \(\mathscr {H}\) is contained in the space \(\mathscr {C}\) of continuous functions.
Remark 6.1
Note that the space \(\mathscr {H}\) characterized in Theorem 6.1 is indeed a RKHS according to Definition 6.1. In fact, for any input location x the kernel section \(K_x\) belongs to the space and, according to the reproducing property, represents the evaluation functional at x. Then, Theorem 6.27 (Riesz representation theorem), reported in the appendix to this chapter, permits the conclusion that all the pointwise evaluators over \(\mathscr {H}\) are linear and bounded.
While Theorem 6.1 establishes a link between Mercer kernels (which enjoy continuity properties) and RKHSs, it is possible also to state a one-to-one correspondence with the entire class of positive definite kernels (not necessarily continuous). In particular, the following result holds.
Theorem 6.2
(Moore–Aronszajn, based on [13]) Let \(\mathscr {X}\) be any non-empty set. Then, to every RKHS \(\mathscr {H}\) there corresponds a unique positive definite kernel K such that the reproducing property (6.3) holds. Conversely, given a positive definite kernel K, there exists a unique RKHS of real-valued functions defined over \(\mathscr {X}\) where (6.2) and (6.3) hold.
The proof can be quite easily obtained using Theorem 6.27 (Riesz representation theorem) and arguments similar to those contained in the proof of Theorem 6.1.
Further notes and RKHSs examples Thus, a RKHS \(\mathscr {H}\) can be defined just by specifying a kernel K, also called the reproducing kernel of \(\mathscr {H}\). In particular, any RKHS is generated by the kernel sections. More specifically, let \(S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })\) and define the following norm in S
where
Then, one has
Summarizing, one has
-
all the kernel sections \(K_x(\cdot )\) belong to the RKHS \(\mathscr {H}\) induced by K;
-
\(\mathscr {H}\) contains also all the finite linear combinations of kernel sections along with some particular infinite sums, convergent w.r.t. the norm (6.4);
-
every \(f \in \mathscr {H}\) is thus a linear combination of a possibly infinite number of kernel sections.
Assume for instance \(K(x_1,x_2) = \exp \left( - \Vert x_1-x_2\Vert ^2\right) \), which is the so-called Gaussian kernel. Then, all the functions in the corresponding RKHS are sums, or limits of sums, of functions proportional to Gaussians. As further elucidated later on, this means that every function of \(\mathscr {H}\) inherits properties such as smoothness and integrability of the kernel, e.g., we have seen in Theorem 6.1 that kernel continuity implies \(\mathscr {H} \subset \mathscr {C}\). This fact has an important consequence on modelling: instead of specifying a whole set of basis functions, it suffices to choose a single positive definite kernel that encodes the desired properties of the function to be synthesized.
Example 6.3
(Norm in a two-dimensional RKHS) We introduce a very simple RKHS to illustrate how the kernel K can be seen as a similarity function that establishes the norm (complexity) of a function by comparing function values at different input locations.
When \(\mathscr {X}\) has finite cardinality m, the functions are evaluated just on a finite number of input locations. Hence, each function f is in one-to-one correspondence with the m-dimensional vector
In addition, any kernel is in one-to-one correspondence with one symmetric positive semidefinite matrix \(\mathbf {K} \in \mathbb {R}^{m \times m}\) with (i, j)-entry \(\mathbf {K}_{ij} = K(i,j)\). Finally, the kernel sections can be seen as the columns of \(\mathbf {K}\).
Assume, e.g., \(m=2\) with \(\mathscr {X}=\{1,2\}\). Then, the functions can be seen as two-dimensional vectors and any kernel K is in one-to-one correspondence with one symmetric positive semidefinite matrix \(\mathbf {K} \in \mathbb {R}^{2 \times 2}\). The RKHS \(\mathscr {H}\) associated to K is finite-dimensional being spanned just by the two kernel sections \(K_1(\cdot )\) and \(K_2(\cdot )\) which can be seen as the two columns of \(\mathbf {K}\). Hence, the functions f in \(\mathscr {H}\) are in one-to-one correspondence with the vectors
If \(\mathbf {K}\) is full rank, \(\mathscr {H}\) covers the whole \(\mathbb {R}^2\) and from (6.4) we have
For the sake of simplicity, assume also that \(\mathbf {K}_{11}=\mathbf {K}_{22}=1\) so that it must hold \(-1<\mathbf {K}_{12}<1\). Then, considering, e.g., the function \(f(i)=i\), one has
Figure 6.1 displays \(\Vert f \Vert ^2_{\mathscr {H}}\) as a function of \(\mathbf {K}_{12}\). One can see that the norm diverges as \(|\mathbf {K}_{12}|\) approaches 1.
If, e.g., \(\mathbf {K}_{12}=1\) the kernel function becomes constant over \(\mathscr {X} \times \mathscr {X}\). Hence, the two kernel sections \(K_1(\cdot )\) and \(K_2(\cdot )\) coincide, being constant with \(K_1(i)=K_2(i)=1\) for \(i=1,2\). This means that \(\mathbf {K}_{12}=1\) induces a space \(\mathscr {H}\) containing only constant functions.Footnote 1 This explains why the norm (complexity) of f becomes large if \(\mathbf {K}_{12}\) is close to 1: the space becomes less and less “tolerant” of functions with \(f(1)\ne f(2)\).
Letting now \(f(1)=1\) and \(f(2)=a\), the joint effect of \(\mathbf {K}_{12}\) and a is explained by the formula
Note that, thinking now of \(\mathbf {K}_{12}\) as fixed, the function with minimal RKHS norm (complexity) is obtained with \(a=\mathbf {K}_{12}\) and has a norm equal to one. \(\square \)
Example 6.4
(\(\mathscr {L}_2^{\mu }\) and \(\ell _2\) ) Let \(\mathscr {X}=\mathbb {R}\) and consider the classical Lebesgue space of square summable functions with \(\mu \) equal to the Lebesgue measure. Recall that this is a Hilbert space whose elements are equivalence classes of functions measurable w.r.t. Lebesgue: any group of functions which differ only on a set of null measure (e.g., containing only a countable number of input locations) identifies the same vector. Hence, \(\mathscr {L}_2^{\mu }\) cannot be a RKHS since pointwise evaluation is not even well defined.
Let instead \(\mathscr {X}={\mathbb N}\) (the set of natural numbers) and define the identity kernel
where \(\delta _{ij}\) is the Kronecker delta. Clearly, K is symmetric and positive definite according to Definition 6.2 (it can be associated with an identity matrix of infinite size). Hence, it induces unique RKHS \(\mathscr {H}\) that contains all the finite combinations of the kernel sections. In particular, any finite sum can be written as \(f(\cdot ) = \sum _{i=1}^{m} f_i K_{i}(\cdot )\), where some of the \(f_i\) may be null, and corresponds to a sequence with a finite number of non null components. To obtain the entire \(\mathscr {H}\), we need also to add all the Cauchy sequences limits w.r.t. the norm (6.4) given by
which coincides with the classical Euclidean norm of \([f_1 \ldots f_m]\). This allows us to conclude that the associated RKHS is the classical space \(\ell _2\) of square summable sequences.
As a finale note, Definition 6.1 easily confirms that \(\ell _2\) is a RKHS. In fact, for every \(f=[f_1 \ f_2 \ \ldots ] \in \ell _2\) one has
and, recalling (6.1), this shows that all the evaluation functionals \(f \rightarrow f_i\) with \( i \in {\mathbb N}\) are bounded. \(\square \)
Example 6.5
(Sobolev space and the first-order spline kernel) While in the previous example we have seen that \(\mathscr {L}_2^{\mu }\) is not a RKHS, consider now the space obtained by integrating the functions in this space. In particular, let \(\mathscr {X}=[0,1]\), set \(\mu \) to the Lebesgue measure and consider
One thus has that any f in \(\mathscr {H}\) satisfies \(f(0)=0\) and is absolutely continuous: its derivative \(h=\dot{f}\) is defined almost everywhere and is Lebesgue integrable.
With the inner product given by
it is easy to see that \(\mathscr {H}\) is a Hilbert space. In fact, \(\mathscr {L}_2^{\mu }\) is Hilbert and we have established a one-to-one correspondence between functions in \(\mathscr {H}\) and \(\mathscr {L}_2^{\mu }\) which preserves inner product. Such \(\mathscr {H}\) is an example of Sobolev space [2] since the complexity of a function is measured by the energy of its derivative:
Now, given \(x \in [0,1]\), let \(\chi _x(\cdot )\) be the indicator function of the set [0, x]. Then, one has
where we have used the Cauchy–Schwarz inequality. Hence, \(\mathscr {H}\) is also a RKHS since all the evaluations functionals are bounded. We now prove that its reproducing kernel is the so-called first-order (linear) spline kernel given by
In fact, every kernel section belongs to \(\mathscr {H}\), being piecewise linear with \(\dot{K}_x = \chi _x\). Furthermore, (6.6) satisfies the reproducing property since
The linear spline kernel and some of its sections are displayed in the top panels of Fig. 6.2. \(\square \)
6.2.1 Reproducing Kernel Hilbert Spaces Induced by Operations on Kernels \(\star \)
We report some classical results about RKHSs induced by operations on kernels which can be derived from [13]. The first theorem characterizes the RKHS induced by the sum or product of two kernels.
Theorem 6.6
(RKHS induced by sum or product of two kernels, based on [13]) Let K and G be two positive definite kernels over the same domain \(\mathscr {X} \times \mathscr {X}\), associated to the RKHSs \(\mathscr {H}\) and \(\mathscr {G}\), respectively.
The sum \(K+G\), where
is the reproducing kernel of the RKHS \(\mathscr {R}\) containing functions
with
The product KG, where
is instead the reproducing kernel of the RKHS \(\mathscr {R}\) containing functions
with
The second theorem instead provides the connection between two RKHSs, with the second one obtained from the first one by sampling its kernel.
Theorem 6.7
(RKHS induced by kernel sampling, based on [13]) Let \(\mathscr {H}\) be the RKHS induced by the kernel \(K: \mathscr {X} \times \mathscr {X} \rightarrow {\mathbb R}\). Let \(\mathscr {Y} \subset \mathscr {X}\) and denote with \(\mathscr {R}\) the RKHS of functions over \(\mathscr {Y}\) induced by the restriction of the kernel K on \(\mathscr {Y} \times \mathscr {Y}\). Then, the functions in \(\mathscr {R}\) correspond to the functions in \(\mathscr {H}\) sampled on \(\mathscr {Y}\). One also has
where \(g_{\mathscr {Y}}\) is g sampled on \(\mathscr {Y}\).
The following theorem lists some operations which permit to build kernels (and hence RKHSs) from simple building blocks.
Theorem 6.8
(Building kernels from kernels, based on [13]) Let \(K_1\) and \(K_2\) two positive definite kernels over \(\mathscr {X} \times \mathscr {X}\) and \(K_3\) a positive definite kernel over \(\mathbb {R}^m \times \mathbb {R}^m\). Let also P an \(m \times m\) symmetric positive semidefinite matrix and \(\mathscr {P}(x)\) a polynomial with positive coefficients. Then, the following functions are positive definite kernels over \(\mathscr {X} \times \mathscr {X}\):
-
\(K(x,y)=K_1(x,y) + K_2(x,y)\) (see also Theorem 6.6).
-
\(K(x,y)=aK_1(x,y), \quad a \ge 0\).
-
\(K(x,y)=K_1(x,y)K_2(x,y)\) (see also Theorem 6.6).
-
\(K(x,y)=f(x)f(y), \quad f: \mathscr {X} \rightarrow \mathbb {R}\).
-
\(K(x,y)=K_3(f(x),f(y)), \quad f: \mathscr {X} \rightarrow \mathbb {R}^m\).
-
\(K(x,y)=x^T P y, \quad \mathscr {X}=\mathbb {R}^m\).
-
\(K(x,y)=\mathscr {P}(K_1(x,y))\).
-
\(K(x,y)=\exp (K_1(x,y))\).
6.3 Spectral Representations of Reproducing Kernel Hilbert Spaces
In the previous section we have seen that any RKHS is generated by its kernel sections. We now discuss another representation obtainable when the kernel can be diagonalized as follows
where the set \(\mathscr {I}\) is countable. This will lead to new insights on the nature of the RKHSs, generalizing to the infinite-dimensional case the connection between regularization and basis expansion reported in Sect. 5.6.
A simple situation holds when the input space has finite cardinality, e.g., \(\mathscr {X}=\{x_1 \ldots x_m\}\). Under this assumption, any positive definite kernel is in one-to-one correspondence with the \(m \times m\) matrix \(\mathbf {K}\) whose (i, j)-entry is \(K(x_i,x_j)\). The representation (6.8) then follows from the spectral theorem applied to \(\mathbf {K}\). In fact, if \(\zeta _i\) and \(v_i\) are, respectively, the eigenvalues and the orthonormal (column) eigenvectors of \(\mathbf {K}\), (6.8) can be written as
where the functions \(\rho _i(\cdot )\) have become the vectors \(v_i\). One generalization of this result is described below.
Let \(L_K\) be the linear operator defined by the positive definite kernel K as follows:
We also assume that \(\mu \) is a \(\sigma \)-finite and nondegenerate Borel measure on \(\mathscr {X}\). Essentially this means that \(\mathscr {X}\) is the countable union of measurable sets with finite measure and that \(\mu \) “covers” entirely \(\mathscr {X}\). The reader can, e.g., consider \(\mathscr {X} \subset \mathbb {R}^m\) and think of \(\mu \) as the Lebesque measure or any probability measure with \(\mu (A)>0\) for any non-empty open set \(A \subset \mathscr {X}\). The next classical result goes under the name of Mercer theorem whose formulations trace back to [60].
Theorem 6.9
(Mercer theorem, based on [60]) Let \(\mathscr {X}\) be a compact metric space equipped with a nondegenerate and \(\sigma \)-finite Borel measure \(\mu \) and let K be a Mercer kernel on \(\mathscr {X} \times \mathscr {X}\). Then, there exists a complete orthonormal basis of \(\mathscr {L}_2^{\mu }\) given by a countable number of continuous functions \(\{\rho _i\}_{i \in \mathscr {I}}\) satisfying
with \(\zeta _i >0 \ \forall i\) if K is strictly positive and \(\lim _{i \rightarrow \infty } \zeta _i =0\) if the number of eigenvalues is infinite.
One also has
where the convergence of the series is absolute and uniform on \(\mathscr {X} \times \mathscr {X}\).
The following result characterizes a RKHS through the eigenfunctions of \(L_K\). The proof is reported in Sect. 6.9.3.
Theorem 6.10
(RKHS defined by an orthonormal basis of \(\mathscr {L}_2^{\mu }\)) Under the same assumption of Theorem 6.9, if the \(\rho _i\) and \(\zeta _i\) satisfy (6.10), with also \(\zeta _i>0 \ \forall i\), one has
In addition, if
one has
so that
Hence, it also comes that \(\{\sqrt{\zeta _i} \rho _i\}_{i \in \mathscr {I}}\) is an orthonormal basis of \(\mathscr {H}\).
The representation (6.12) is not unique since the spectral maps, i.e., the functions that associate a kernel with a decomposition of the type (6.8), are not unique. They depend on the chosen measure \(\mu \) even if they lead to the same RKHS.
Theorem 6.10 thus shows that any kernel admitting an expansion (6.11) coming from the Mercer theorem induces a separable RKHS, i.e., having a countable basis given by the \(\rho _i\). Later on, Theorem 6.13 will show that such result holds under much milder assumptions. In fact, the representation (6.12) can be obtained starting from any diagonalized kernel (6.8) involving generic functions \(\rho _i\), e.g., not necessarily independent of each other. One can also remove the compactness hypothesis on the input space, e.g., letting \(\mathscr {X}\) be the entire \(\mathbb {R}^m\).
Remark 6.2
(Relationship between \(\mathscr {H}\) and \(\mathscr {L}_2^{\mu }\) ) Theorem 6.10 points out an interesting connection between \(\mathscr {H}\) and \(\mathscr {L}_2^{\mu }\). Since the functions \(\rho _i\) form an orthonormal basis in \(\mathscr {L}_2^{\mu }\), one has
while (6.12) shows that
If \(\zeta _i>0 \ \forall i\), one has the set inclusion \(\mathscr {H} \subset \mathscr {L}_2^{\mu }\) since the functions in the RKHS, must satisfy a more stringent condition on the expansion coefficients decay (the \(\zeta _i\) decay to zero).
In addition, let \(L_K^{1/2}\) denote the operator defined as the square root of \(L_K\), i.e., for any \(f \in \mathscr {L}_2^{\mu }\) with \(f= \sum _{i \in \mathscr {I}} c_i \rho _i\), one has
This is a smoothing operator: the function \(L_K^{1/2}[f] \) is more regular than f since the expansion coefficients \(\sqrt{\zeta _i}c_i \) decrease to zero faster than the \(c_i\). In view of (6.15) and (6.16), we obtain
which shows that the RKHS can be thought of as the output of the linear system \(L_K^{1/2}\) fed with the space \(\mathscr {L}_2^{\mu }\), i.e., \(\mathscr {H} = L_K^{1/2} \mathscr {L}_2^{\mu }\).
Example 6.11
(Spline kernel expansion) In Example 6.5, we have seen that the space of functions on the unit interval satisfying \(f(0)=0\) and \(\int _0^1 \dot{f}^2(x) dx < \infty \) is the RKHS associated to the first-order spline kernel \(\min (x,y)\). We now derive a representation of the type (6.12) for this space setting \(\mu \) to the Lebesgue measure. For this purpose, consider the system
The above equation is equivalent to
which implies \(\rho (0)=0\). Taking the derivative w.r.t. x we also obtain
that implies \( \dot{\rho }(1)=0\). Differentiating again w.r.t. x gives
whose general solution is
The boundary conditions \(\rho (0)=\dot{\rho }(1)=0\) imply \(b=0\) and lead to the following possible eigenvalues:
The orthonormality condition also implies \(a=\sqrt{2}\) so that we obtain
This provides the formulation (6.12) of the Sobolev space \(\mathscr {H}\). Figure 6.3 plots three eigenfunctions (left panel) and the first 100 eigenvalues \(\zeta _i\) (right panel). It is evident that the larger i the larger is the high-frequency content of \(\rho _i\) and the RKHS norm of such basis function. In fact, a large value of i corresponds to a small eigenvalue \(\zeta _i\) and one has \(\Vert \rho _i\Vert ^2_{\mathscr {H}}=1/\zeta _i\). \(\square \)
Example 6.12
(Translation invariant kernels and Fourier expansion) A translation invariant kernel depends only on the difference of its two arguments. Hence, there exists \(h:\mathscr {X} \rightarrow \mathbb {R}\) such that \(K(x,y)=h(x-y)\). Assume that \(\mathscr {X}=[0,2\pi ]\) and that h can be extended to a continuous, symmetric and periodic function over \(\mathbb {R}\). Then, it can be expanded in terms of the following uniformly convergent Fourier series
where \(\zeta _0\) accounts for the constant component and we assume \(\zeta _i>0 \ \forall i\). We thus obtain the kernel expansion
in terms of functions which are all orthogonal in \(\mathscr {L}_2^{\mu }\). Hence, these kernels induce RKHSs generated by the Fourier basis, with different inner products determined by \(\zeta _i\). \(\square \)
6.3.1 More General Spectral Representation \(\star \)
Now, assume that the kernel K is available in the form \(K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y)\) with \(\zeta _i > 0 \ \forall i\), but with functions \(\rho _i\) not necessarily orthonormal. More generally, we do not even require that they are independent, e.g., \(\rho _1\) could be a linear combination of \(\rho _2\) and \(\rho _3\). The following result shows that the RKHS associated to K is still generated by the \(\rho _i\), but the relationship of the expansion coefficients with \(\Vert \cdot \Vert _{\mathscr {H}}\) is more involved than in the previous case.
Theorem 6.13
(RKHS induced by a diagonalized kernel) Let \(\mathscr {H}\) be the RKHS induced by \(K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y)\) with \(\zeta _i > 0 \ \forall i\) and the set \(\mathscr {I}\) countable. Then, \(\mathscr {H}\) is separable and admits the representation
and one has
The proof is reported in Sect. 6.9.4 while an application example is given below.
Example 6.14
Let
Using Theorem 6.13, we obtain that the RKHS \(\mathscr {H}\) associated to K is spanned by \(\sin ^2(x)\), \(\cos ^2(x)\) and the constant function. Now, let \(f(x)=1\) and consider the problem of computing \(\Vert f\Vert _{\mathscr {H}}^2\). To have a correspondence with (6.8) we can, e.g., fix the notation
and
Since the functions \(\rho _i\) are not independent, many different representation for \(f(x)=1\) can be found. In particular, one has
so that
with the minimum 1/2 obtained at \(c=1/2\). Hence, according to the norm of \(\mathscr {H}\), the “minimum energy” representation of \(f(x)=1\) is \(1/2(\rho _1(x)+ \rho _2(x) + \rho _3(x))\).
\(\square \)
6.4 Kernel-Based Regularized Estimation
6.4.1 Regularization in Reproducing Kernel Hilbert Spaces and the Representer Theorem
A powerful approach to reconstruct a function \(g:\mathscr {X} \rightarrow \mathbb {R}\) from sparse data \(\{x_i,y_i\}_{i=1}^N\) consists of minimizing a suitable functional over a RKHS. An important generalization of the estimators based on quadratic penalties, denoted by ReLS-Q in Chap. 3, is defined by
In (6.21), \(\mathscr {V}_i\) are loss functions measuring the distance between \(y_i\) and \(f(x_i)\). They can take only positive values and are assumed convex w.r.t. their second argument \(f(x_i)\). As an example, when the quadratic loss is adopted for any i, one obtains
Then, the norm \(\Vert \cdot \Vert _{\mathscr {H}}\) defines the regularizer, e.g., given by the energy of the first-order derivative
which corresponds to the spline norm introduced in Example 6.5. Finally, the positive scalar \(\gamma \) is the regularization parameter (already encountered in the previous chapters) which has to balance adherence to experimental data and function regularity. Indeed, the idea underlying (6.21) is that the predictor \(\hat{g}\) should be able to describe the data without being too complex according to the RKHS norm. In particular, the scope of the regularizer is to restore the well-posedness of the problem, making the solution depend continuously on the data. It should also include our available information on the unknown function, e.g., the expected smoothness level.
The importance of the RKHSs in the context of regularization methods stems from the following central result, whose first formulation can be found in [52]. It shows that the solutions of the class of variational problems (6.21) admit a finite-dimensional representation, independently of the dimension of \(\mathscr {H}\). The proof of an extended version of this result can be found in Sect. 6.9.5.
Theorem 6.15
(Representer theorem, adapted from [104]) Let \(\mathscr {H}\) be a RKHS. Then, all the solutions of (6.21) admit the following expression
where the \(c_i\) are suitable scalar expansion coefficients.
Thus, as in the traditional linear parametric approach, the optimal function is a linear combination of basis functions. However, a fundamental difference is that their number is now equal to the number of data pairs, and is thus not fixed a priori. In fact, the functions appearing in the expression of the minimizer \(\hat{g}\) are just the kernel sections \(K_{x_i}\) centred on the input data. The representer theorem also conveys the message that, using estimators of the form (6.21), it is not possible to recover arbitrarily complex functions from a finite amount of data. The solution is always confined to a subspace with dimension equal to the data set size.
Now, let \(\mathbf {K} \in \mathbb {R}^{N \times N}\) be the positive semidefinite matrix (called kernel matrix, or Gram matrix) such that \(\mathbf {K}_{ij} = K(x_i,x_j)\). The ith row of \(\mathbf {K}\) is denoted by \(\mathbf {k}_i\). Using this notation, if \(g = \sum _{i=1}^N \ c_i K_{x_i}\) then
where \(c=[c_1,\ldots ,c_N]^T\) and the second equality derives from the reproducing property or, equivalently, from (6.4).
Using the representer theorem, we can plug the expression (6.22) of the optimal \(\hat{g}\) into the objective (6.21). Then, exploiting (6.23), the variational problem (6.21) boils down to
The regularization problem (6.21) has been thus reduced to a finite-dimensional optimization problem whose order N does not depend on the dimension of the original space \(\mathscr {H}\). In addition, since each loss function \(\mathscr {V}_i\) has been assumed convex, the objective (6.24) is convex overall. How to compute the expansion coefficients now depends on the specific choice of the \(\mathscr {V}_i\), as discussed in the next section.
Remark 6.3
(Kernel trick and implicit basis functions encoding) Assume that the kernel admits the expansion \(K(x,y) = \sum _{i =1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y), \ \ \zeta _i > 0\). Then, as discussed in Sect. 6.3, any function in \(\mathscr {H}\) has the representation
Problem (6.21) can then be rewritten using the infinite-dimensional vector \(a=[a_1 \ a_2 \ \ldots ]\) as unknown:
and an equivalent representation of (6.22) becomes \(\hat{g}=\sum _{i =1}^{\infty } \ \hat{a}_i \rho _i\). In comparison to this reformulation the use of the kernel and of the representer theorem subsumes modelling and computational advantages. In fact, through K one needs neither to choose the number of basis functions to be used (the kernel can already include in an implicit way an infinite number of basis functions) nor to store any basis function in memory (the representer theorem reduces inference to solving a finite-dimensional optimization problem based on the kernel matrix \(\mathbf {K}\)). These features are related to what is called the kernel trick in the machine learning literature.
6.4.2 Representer Theorem Using Linear and Bounded Functionals
A more general version of the representer theorem obtained in [52] can be obtained by replacing \(f(x_i)\) with \(L_i[f]\), where \(L_i\) is linear and bounded. In the first part of the following result \(\mathscr {H}\) is just required to be Hilbert. In Sect. 6.9.5 we will see how Theorem 6.16 can be further generalized.
Theorem 6.16
(Representer theorem with functionals \(L_i\), adapted from [104]) Let \(\mathscr {H}\) be a Hilbert space and consider the optimization problem
where each \(L_i: \mathscr {H} \rightarrow \mathbb {R}\) is linear and bounded. Then, all the solutions of (6.25) admit the following expression
where the \(c_i\) are suitable scalar expansion coefficients and each \(\eta _i \in \mathscr {H}\) is the representer of \(L_i\), i.e., for any i and \(f \in \mathscr {H}\):
In particular, if \(\mathscr {H}\) is a RKHS with kernel K, each basis function is given by
The existence of \(\eta _i\) satisfying (6.27) is ensured by the Riesz representation theorem (Theorem 6.27). One can also prove that in a RKHS a linear functional L is linear and bounded if and only if the function f obtained by applying L to the kernel, i.e., \(f(x)=L[K(x,\cdot )] \ \forall x\), belongs to the RKHS.
Note also that Theorem 6.15 is indeed a special case of the last result. In fact, let \(\mathscr {H}\) be a RKHS and \(L_i[f]=f(x_i) \ \forall i\). Then, each \(L_i\) is linear and bounded and each \(\eta _i\) becomes the kernel section \(K_{x_i}\) according to the reproducing property.
Example 6.17
(Solution using the quadratic loss) Let us adopt a quadratic loss in (6.25), i.e., \(\mathscr {V}_i(y_i,L_i[f])=(y_i-L_i[f])^2\). This makes the objective strictly convex so that a unique solution exists. To find it, plugging (6.26) in (6.25) and using also (6.28), the following quadratic problem is obtained
where \(Y=[y_1, \ldots ,y_N]^T\), \(\Vert \cdot \Vert \) is the Euclidean norm, while the \(N \times N\) matrix O has i, j entry given by
The minimizer \(\hat{c}\) of (6.29) is unique if O is full rank. Otherwise, all the solutions lead to the same function estimate in view of the (already mentioned) strict convexity of (6.25). In particular, one can always use as optimal expansion coefficients the components of the vector
In Sect. 6.5.1 this result will be further discussed in the context of the so-called regularization networks, where one comes back to assume \(L_i[f]=f(x_i)\). \(\square \)
6.5 Regularization Networks and Support Vector Machines
The choice of the loss \(\mathscr {V}_i\) in (6.21) yields regularization algorithms with different properties. We will illustrate four different cases below.
6.5.1 Regularization Networks
Let us consider the quadratic loss function \(\mathscr {V}_i(y_i,f(x_i))= r_i^2\), with the residual \(r_i\) defined by \(r_i=y_i-f(x_i)\). Such a loss, also depicted in Fig. 6.4 (top left panel), leads to the problem
which is a generalization of the regularized least squares problem encountered in the previous chapters. In particular, it extends the estimator (3.58a) based on quadratic penalty called ReLS-Q in Chap. 3. The estimator (6.32) is known in the literature as regularization network [71] or also kernel ridge regression. The strict convexity of the objective (6.32) ensures that the minimizer \(\hat{g}\) not only exists but is also unique (this issue is further discussed in the remark at the end of this subsection).
To find the solution, we can follow the same arguments developed in Example 6.17, just specializing the result to the case \(L_i[f]=f(x_i)\). We will see that the matrix O has just to be replaced by the kernel matrix \(\mathbf {K}\).
As previously done, let \(Y=[y_1, \ldots ,y_N]^T\) and use \(\Vert \cdot \Vert \) to indicate the Euclidean norm. Then, the corresponding regularization problem (6.24) becomes
which is a finite-dimensional ReLS-Q. After simple calculations, one of the optimal solutionsFootnote 2 is found to be
where \(I_{N}\) is the \(N \times N\) identity matrix. The estimate from the regularization network is thus available in closed form, given by \(\hat{g} = \sum _{i=1}^N \ \hat{c}_i K_{x_i}\) with the optimal coefficient vector \(\hat{c}\) solving a linear system of equations.
Remark 6.4
(Regularization network as projection) An interpretation of the regularization network can be also given in terms of a projection. In particular, let \(\mathscr {R}\) be the Hilbert space \(\mathbb {R}^N \times \mathscr {H}\) (any element is a couple containing a vector v and a function f) with norm defined, for any \(v \in \mathbb {R}^N\) and \(f \in \mathscr {H}\), by
Let also S be the (closed) subspace given by all the couples (v, f) satisfying the constraint \(v=[f(x_1) \ldots f(x_N)]\). Then, if \(g=(Y,0)\) where 0 here denotes the null function in \(\mathscr {H}\), the projection of g onto S is
It is now immediate to conclude that \(g_S\) corresponds to \(([\hat{g}(x_1) \ldots \hat{g}(x_n)],\hat{g})\) where \(\hat{g}\) is indeed the minimizer (6.32), which must thus be unique in view of Theorem 6.25 (Projection theorem). Note that this interpretation can be extended to all the variational problems (6.21) containing losses defined by a norm induced by an inner product in \(\mathbb {R}^N\).
6.5.2 Robust Regression via Huber Loss \(\star \)
As described in Sect. 3.6.1, a shortcoming of the quadratic loss is its sensitivity to outliers because the influence of large residuals \(r_i\) grows quadratically. In presence of outliers, one would better use a loss function that grows linearly. These issues have been widely studied in the field of robust statistics [51], where loss functions such as the Huber’s have been introduced. Recalling (3.115), one has
where we still have \(r_i=y_i-f(x_i)\). The Huber loss function with \(\delta =1\) is shown in Fig. 6.4 (top right panel). Notice that it grows linearly and is thus robust to outliers. When \(\delta \rightarrow +\infty \), one recovers the quadratic loss. On the other hand, we also have \(\lim _{\delta \rightarrow 0^+} \mathscr {V}_i(r)/\delta = |r_i|\) that is the absolute value loss.
6.5.3 Support Vector Regression \(\star \)
Sometimes, it is desirable to neglect prediction errors, as long as they are below a certain threshold. This can be achieved, e.g., using the Vapnik’s \(\varepsilon \)-insensitive loss given, for \(r_i=y_i-f(x_i)\), by
The Vapnik loss with \(\varepsilon =0.5\) is shown in Fig. 6.4 (bottom left panel). Notice that it has a null plateau in the interval \([-\varepsilon , \varepsilon ]\) so that any predictor closer than \(\varepsilon \) to \(y_i\) is seen as a perfect interpolant. The loss then grows linearly, thus ensuring robustness. The regularization problem (6.21) associated with the \(\varepsilon \)-insensitive loss function turns out
and is called Support Vector Regression (SVR), see, e.g., [37]. The SVR solution, given by \(\hat{g} = \sum _{i=1}^N \ \hat{c}_i K_{x_i}\) according to the representer theorem, is characterized by sparsity in \(\hat{c}\), i.e., some components \(\hat{c}_i\) are set to zero. This feature is briefly discussed below.
In the SVR case, obtaining the optimal coefficient vector \(\hat{c}\) by (6.24) is not trivial since the loss \(| \cdot |_{\varepsilon }\) is not differentiable everywhere. This difficulty can be circumvented by replacing (6.24) with the following equivalent problem obtained considering two additional N-dimensional parameter vectors \(\xi \) and \(\xi ^*\):
subject to the constraints
To see that its minimizer contains the optimal solution \(\hat{c}\) of (6.24), it suffices noticing that (6.36) assigns a linear penalty only when \(|y_i - \mathbf {k}_ic| > \varepsilon \).
Problem (6.36) is quadratic subject to linear inequality constraints, hence it is solvable by standard optimization approaches like interior point methods [64, 108]. Calculating the Karush–Kuhn–Tucker conditions, it is possible to show that the condition \(|y_i - \mathbf {k}_i\hat{c}| < \varepsilon \) implies \(\hat{c}_i=0\). Indexes i for which \(\hat{c}_i \ne 0\) instead identify the set of input locations \(x_i\) called support vectors.
6.5.4 Support Vector Classification \(\star \)
The three losses illustrated above were originally proposed for regression problems, with the output y real valued. When the outputs can assume only two values, e.g., 1 and −1, a classification problem arises. Here, the scope of the predictor is just to separate two classes. This problem can be seen as a special case of regression. In particular, even if the output space is binary, consider prediction functions \(f: \mathscr {X} \rightarrow \mathbb {R}\) and assume that the input \(x_i\) is associated to the class 1 if \(f(x_i)\ge 0\) and to the class \(-1\) if \(f(x_i)<0\). Let the margin on an example \((x_i,y_i)\) be \(m_i=y_if(x_i)\). Then, we will see that the value of \(m_i\) is a measure of how well we are classifying the available data. One can thus try to maximize the margin but still searching for a function not too complex according to the RKHS norm. In particular, we can exploit (6.21) with a loss that depends on the margin as described below.
The most natural classification loss is the \(0-1\) loss defined for any i by
and depicted in Fig. 6.4 (bottom right panel, dashed line). Adopting it, the first component of the objective in (6.21) returns the number of misclassifications. However, the \(0-1\) loss is not convex and leads to an optimization problem of combinatorial nature.
An alternative is the so-called hinge loss [98] defined by
which thus provides a linear penalty when \(m<1\). Figure 6.4 (bottom right panel, solid line) illustrates that it is a convex upper bound on the \(0-1\) loss. The problem associated with the hinge loss turns out
and is called support vector classification (SVC).
Like in the SVR case, obtaining the optimal coefficient vector by (6.37) is not trivial since the hinge loss is not differentiable. But one can still resort to an equivalent problem, now obtained considering just an additional parameter vector \(\xi \):
subject to the constraints
As in the SVR case, the optimal solution \(\hat{c}\) is sparse and indexes i for which \(\hat{c}_i \ne 0\) define the support vectors \(x_i\).
6.6 Kernels Examples
The reproducing kernel characterizes the hypothesis space \(\mathscr {H}\). Together with the loss function, it also completely defines the key estimator (6.21) which exploits the RKHS norm as regularizer. The choice of K has thus a crucial impact on the ability of predicting future output data. Some important RKHSs are discussed below.
6.6.1 Linear Kernels, Regularized Linear Regression and System Identification
We now show that the regularization network (6.32) generalizes the ReLS-Q problem introduced in Chap. 3 which adopts quadratic penalties. The link is provided by the concept of linear kernel.
We start assuming that the input space is \(\mathscr {X} = {\mathbb R}^m\). Hence, any input location x corresponds to an m-dimensional (column) vector. If \(P \in {\mathbb R}^{m \times m}\) denotes a symmetric and positive semidefinite matrix, a linear kernel is defined as follows
All the kernel sections are linear functions. Hence, their span defines a finite-dimensional (closed) subspace of linear functions that, in view of Theorem 6.1 (and subsequent discussion) coincides with the whole \(\mathscr {H}\). Hence, the RKHS induced by the linear kernel is simply a space of linear functions and, for any \(g \in \mathscr {H}\), there exists \(a \in {\mathbb R}^m\) such that
If P is full rank, letting \(\theta := P a\), we also have
Now, let us use such \(\mathscr {H}\) in the regularization network (6.32). Without using the representer theorem, we can plug the representation \(g(x)=\theta ^T x\) in the regularization problem to obtain \(\hat{g}(x)=\hat{\theta }^Tx\) where
with the ith row of the regression matrix \(\varPhi \) equal to \(x_i^T\). One can see that (6.39) coincides with ReLS-Q, with the regularization matrix P which defines the linear kernel K and, in turn, the penalty term \(\theta ^T P^{-1} \theta \).
We now derive the connection with linear system identification in discrete time. The data set consists of the output measurements \(\{y_i\}_{i=1}^N\), collected on the time instants \(\{t_i\}_{i=1}^N\), and of the system input u. We can form each input location using past input values as follows
where m is the FIR order and an input delay of one unit has been assumed. Then, if Y collects the noisy outputs, \(\hat{\theta }\) becomes the impulse response estimate. This establishes a correspondence between regularized FIR estimation and regularization in RKHS induced by linear kernels.
6.6.1.1 Infinite-Dimensional Extensions \(\star \)
In place of \(\mathscr {X}=\mathbb {R}^m\), let now \(\mathscr {X} \subset \mathbb {R}^\infty \), i.e., the input space contains sequences. We can interpret any input location as an infinite-dimensional column vector and use ordinary notation of algebra to handle infinite-dimensional objects. For instance, if \(x,y \in \mathscr {X}\) then \(x^Ty=\langle x,y \rangle _2\) where \(\langle \cdot ,\cdot \rangle _2\) is the inner product in \(\ell _2\). Assume we are given a symmetric and infinite-dimensional matrix P such that the linear kernel
is well defined over a subset of \(\mathbb {R}^\infty \times \mathbb {R}^\infty \). For example, if P is absolutely summable, i.e., \(\sum _{ij} |P_{ij}|<\infty \), the kernel is well defined for any input location \(x \in \mathscr {X}\) with \(\mathscr {X}=\ell _\infty \). The kernel section centred on x is the infinite-dimensional column vector Px. Following arguments similar to those seen in the finite-dimensional case, one can conclude that the RKHS associated to such K contains linear functions of the form \(g(x)=a^T P x\) with \(a \in \mathscr {X}\). Roughly speaking, the regularization network (6.32) relying on such hypothesis space is the limit of Problem (6.39) for \(m \rightarrow \infty \). To compute the solution, in this case it is necessary to resort to the representer theorem (6.22). One obtains
where \(\hat{c}\) is defined by (6.34) and
The link with linear system identification follows the same reasoning previously developed but \(x_i\) now contains an infinite number of past input values, i.e.,
With this correspondence, the regularization network now implements regularized IIR estimation and \(\hat{\theta }\) contains the impulse response coefficients estimates. In fact, note that the nature of \(x_i\) makes the value \(\hat{g}(x_i)\) the convolution between the system input u and \(\hat{\theta }\) evaluated at \(t_i\) (with one unit input delay).
In a more sophisticated scenario, in place of sequences, the input space \(\mathscr {X}\) could contain functions. For instance, \(\mathscr {X} \subset \mathscr {P}^c\) where \(\mathscr {P}^c\) is the space of piecewise continuous functions on \(\mathbb {R}^+\). Thus, each input location corresponds to a continuous function \(x:\mathbb {R}^+ \rightarrow \mathbb {R}\). Given a suitable symmetric function \(P: \mathbb {R}^+ \times \mathbb {R}^+ \rightarrow \mathbb {R}\), a linear kernel is now defined by
The corresponding RKHS thus contains linear functionals: any \(f \in \mathscr {H}\) maps x (which is a function) into \(\mathbb {R}\). The solution of the regularization network (6.32) equipped with such hypothesis space is
where \(\hat{c}\) is still defined by (6.34) and
The connection with linear system identification is obtained by defining
(if the input u(t) is continuous for \(t \ge 0\) and causal, the functions \(x_i(t)\) is piecewise continuous, making necessary the assumption \(\mathscr {X} \subset \mathscr {P}^c\)). In this way, each \(g \in \mathscr {H}\) represents a different linear system. Furthermore, the regularization network (6.32) implements regularized system identification in continuous time and \(\hat{\theta }\) is the continuous-time impulse response estimate. The class of kernels which include the BIBO stability constraint will be discussed in the next chapter.
6.6.2 Kernels Given by a Finite Number of Basis Functions
Assume we are given an input space \(\mathscr {X}\) and m independent functions \(\rho _i:\mathscr {X}\rightarrow \mathbb {R}\). Then, we define
It is easy to verify that K is a positive definite kernel. Recalling Theorem 6.13, the associated RKHS coincides with the m-dimensional space spanned by the basis functions \(\rho _i\). Each function in \(\mathscr {H}\) has the representation \(g(x) = \sum _{i=1}^m \theta _i \rho _i(x)\) and, in view of (6.20) and the independence of the basis functions, one has
Consider now the regularization network (6.32) equipped with such hypothesis space. The solution can be computed without using the representer theorem by plugging in (6.32) the expression of g as a function of \(\theta \). Letting \(\varPhi \in {\mathbb R}^{N \times m}\) with \(\varPhi _{ij} = \rho _j(x_i)\), we obtain \(\hat{g} = \sum _{i=1}^m \ \hat{\theta }_i \rho _i\) with
The solution (6.41) coincides with the ridge regression estimate introduced in Sect. 1.2.
6.6.3 Feature Map and Feature Space \(\star \)
Let \(\mathscr {F}\) be a space endowed with an inner product, and assume that a representation of the form
is available. Then, it follows immediately that K is a positive definite kernel. In this context, \(\phi \) is called a feature map, and \(\mathscr {F}\) the feature space. For instance, to have the connection with the kernel discussed in the previous subsection, we can think of \(\phi \) as a vector containing m functions. It is defined for any x by
so that \(\mathscr {F}=\mathbb {R}^m\) with the Euclidean inner product. Then, we obtain
Now, given any positive definite kernel K, Theorem 6.2 (Moore–Aronszajn theorem) implies the existence of at least one feature map, namely, the RKHS map \(\phi _{\mathscr {H}}:\mathscr {X} \rightarrow \mathscr {H}\) such that
where the representation (6.42) follows immediately from the reproducing property. These arguments show that K is a positive definite kernel iff there exists at least one Hilbert space \(\mathscr {F}\) and a map \(\phi : \mathscr {X} \rightarrow \mathscr {F}\) such that \(K(x,y)=\langle \phi (x), \phi (y) \rangle _{\mathscr {F}}\).
Feature maps and feature spaces are not unique since, by introducing any linear isometry \(I:\mathscr {H} \rightarrow \mathscr {F}\), one can obtain a representation in a different space:
Now, assume that the kernel admits the decomposition (6.8), i.e.,
with \(\zeta _i > 0 \ \forall i\). Then, a spectral feature map of K is
with
In fact, we have
It is worth also pointing out the role of the feature map within the estimation scenario. In many applications, linear functions are not models powerful enough. Kernels define more expressive spaces by (implicitly) mapping the data into a high-dimensional feature space where linear machines can be applied. Then, the use of the estimator (6.21) does not require to know any feature map associated to K: the representer theorem shows that the only information needed to compute the estimate is the kernel matrix, as also discussed in Remark 6.3.
6.6.4 Polynomial Kernels
Another example of kernel is the (inhomogeneous) polynomial kernel [70]. For \(x,y \in \mathbb {R}^m\), it is defined by
with \(\langle \cdot , \cdot \rangle _2\) to denote the classical Euclidean inner product. As an example, assume \(c=1\) and \(m=p=2\) with \(x=[x_a \ x_b]\) and \(y=[y_a \ y_b]\). Then, one obtains the kernel expansion
of the type (6.8) with the \(\rho _i(x_a,x_b)\) given by all the monomials of degree up to 2, i.e., the 6 functions
More in general, if \(c>0\), the polynomial kernel induces a \(\left( {\begin{array}{c}m+p\\ p\end{array}}\right) \)-dimensional RKHS spanned by all possible monomials up to the pth degree. The number of basis function is thus finite but exponential in p. This simple example is in some sense opposite to that described in Sect. 6.6.2. It shows how a kernel can be used to define implicitly a rich class of basis functions.
6.6.5 Translation Invariant and Radial Basis Kernels
A kernel is said translation invariant if there exists \(h:\mathscr {X} \rightarrow \mathbb {R}\) such that \(K(x,y)=h(x-y)\). This class has been already encountered in Example 6.12 where its relationship with the Fourier basis (in the case of one-dimensional input space) is illustrated. A general characterization is given below, see also [80].
Theorem 6.18
(Bochner, based on [23]) A positive definite kernel K over \(\mathscr {X} = \mathbb {R}^d\) is continuous and of the form \(K(x,y)=h(x-y)\) if and only if there exists a probability measure \(\mu \) and a positive scalar \(\eta \) such that:
Translation invariant kernels include also the class of radial basis kernels (RBF) of the form \(K(x,y) = h(\Vert x-y\Vert )\) where \(\Vert \cdot \Vert \) is the Euclidean norm [85]. A notable example is the so-called Gaussian kernel:
where \(\rho \) denotes the kernel width. This kernel is often used to model functions expected to be somewhat regular. Note however that \(\rho \) has an important role in tuning the smoothness level. A low value makes the kernel close to diagonal so that a low norm can be assigned also to rapidly changing functions. On the other hand, as \(\rho \) approaches zero, only functions close to be constant are given a low penalty. This is the same phenomenon illustrated in Fig. 6.1.
Another widely adopted kernel, which induces spaces of functions less regular than the Gaussian one, is the Laplacian kernel which uses the Euclidean norm in place of the squared Euclidean norm:
Differently from the kernels described in the first part of Sect. 6.6.1, as well as in Sects. 6.6.2 and 6.6.4, the RKHS associated with any non-constant RBF kernel is infinite dimensional (it cannot be spanned by a finite number of basis functions). The associated RKHS can be shown to be dense in the space of all continuous functions defined on a compact subset \(\mathscr {X} \subset \mathbb {R}^m\). This means that every continuous function can be represented in this space with the desired accuracy as measured by the sup-norm \(\sup _{x \in \mathscr {X}} |f(x)|\). This property is called universality. This does not imply that the RKHS induced by a universal kernel includes any continuous function. For instance, the Gaussian kernel is universal but it has been proved that it does not contain any polynomial, including the constant function [69].
6.6.6 Spline Kernels
To simplify the exposition, let \(\mathscr {X}=[0,1]\) and let also \(g^{(j)}\) denote the jth derivative of g, with \(g^{(0)}:=g\). Intuitively, in many circumstances an effective regularizer is obtained by penalizing the energy of the pth derivative of g, i.e., employing
An interesting question is whether this penalty term can be cast in the RKHS theory. For \(p=1\), a positive answer has been given by Example 6.5. Actually, the answer is positive for any integer p. In fact, consider the Sobolev space of functions g whose first \(p-1\) derivatives are absolutely continuous and satisfy \(g^{(j)}(0)=0\) for \(j=0,\ldots ,p-1\). The same arguments developed in Example 6.5 when \(p=1\) can be easily generalized to prove that this is a RKHS \(\mathscr {H}\) with norm
The corresponding kernel is the pth-order spline kernel
where \(G_p\) is the so-called Green’s function given by
Note that the Laplace transform of \(G_p(\cdot ,0)\) is \(1/s^p\). Hence, the Green’s function is connected with the impulse response of a p-fold integrator. When \(p=1\), we recover the linear spline kernel of Example 6.5:
whereas \(p=2\) leads to the popular cubic spline kernel [104]:
The linear and the cubic spline kernel are displayed in Fig. 6.2.
We can use the spline hypothesis space in the regularization problem (6.21). Then, from the representer theorem one obtains that the estimate \(\hat{g}\) is a pth-order smoothing spline with derivatives continuous exactly up to order \(2p-2\) (the order’s choice is thus related to the expected function smoothness). This can be seen also from the kernels sections plotted in Fig. 6.2 for p equal to 1 and 2. For \(p=2\) the (finite) sum of kernel sections provides the well-known cubic smoothing splines, i.e., piecewise third-order polynomials.
Spline functions enjoy many numerical properties originally studied in the interpolation scenario. In particular, piecewise polynomials circumvent Runge’s phenomenon (large oscillations affecting the reconstructed function) which, e.g., arises when high-order polynomials are employed [81]. Fit convergence rates are discussed, e.g., in [3, 14].
6.6.7 The Bias Space and the Spline Estimator
Bias space As discussed in Sect. 4.5, in a Bayesian setting, in some cases it can be useful to enrich \(\mathscr {H}\) with a low-dimensional parametric part, known in the literature as bias space. The bias space typically consists of linear combinations of functions \(\{\phi _k\}_{k=1}^m\). For instance, if the unknown function exhibits a linear trend, one may let \(m=2\) and \(\phi _1(x)=1,\phi _2(x)=x\). Then, one can assume that g is sum of two functions, one in \(\mathscr {H} \) and the other one in the bias space. In this way, the function space becomes \(\mathscr {H} + \text{ span } \{ \phi _1,\ldots , \phi _m\}\). Using a quadratic loss, the regularization problem is given by
and the overall function estimate turns out \(\hat{g} = \hat{f} + \sum _{k=1}^{m} \hat{\theta }_k\phi _k\). Note that the expansion coefficients in \(\theta \) are not subject to any penalty term but a low value for m avoids overfitting. The solution can be computed exploiting an extended version of the representer theorem. In particular, it holds that
where, assuming that \(\varPhi \in {\mathbb R}^{N \times m}\) is full column rank and \(\varPhi _{ij} = \phi _j(x_i)\),
Remark 6.5
(Extended version of the representer theorem) The correctness of formulas (6.51a–6.51c) can be easily verified as follows. Fix \(\theta \) to the optimizer \(\hat{\theta }\) in the objective present in the rhs of (6.49). Then, we can use the representer theorem with Y replaced by \(Y-\varPhi \hat{\theta }\) to obtain \(\hat{f} = \sum _{i=1}^{N} \hat{c}_i K_{x_i} \) with
with A indeed given by (6.51c). This proves (6.51b). Using the definition of A this also implies
Now, if we fix f to \(\hat{f}\), the optimizer \(\hat{\theta }\) is just the least squares estimate of \(\theta \) with Y replaced by \(Y- \mathbf {K} \hat{c}\). Hence, we obtain
Using \(Y-\mathbf {K} \hat{c}= \varPhi \hat{\theta } + \gamma \hat{c}\) in the expression for \(\hat{\theta }\), we obtain \(\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T \hat{c}=0\). Multiplying the lhs and rhs of (6.51b) by \(\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T\) and using this last equality, (6.51a) is finally obtained.
The spline estimator The bias space is useful, e.g., when spline kernels are adopted. In fact, the spline space of order p contains functions all satisfying the constraints \(g^{(j)}(0)=0\) for \(j=0,\ldots ,p-1\). Then, to cope with nonzero initial conditions, one can enrich such RKHS with polynomials up to order \(p-1\). The enriched space is \(\mathscr {H} \oplus \text{ span } \{1,x,\ldots ,x^{p-1}\}\), with \(\oplus \) denoting a direct sum, and enjoys the universality property mentioned at the end of Sect. 6.6.5. The resulting spline estimator becomes a notable example of (6.49): it solves
whose explicit solution is given by (6.50) setting \(\phi _k(x)=x^{k-1}\) and \(\varPhi _{ij} = x_i^{j-1}\).
Cubic spline estimator (6.52) with three different values of the regularization parameter: truth (red thick line), noisy data (\(\circ \)) and estimate (black solid line)
We consider a simple numerical example to illustrate the estimator (6.52) and the impact of different choices of \(\gamma \) on its performance. The task is the reconstruction of the function \(g(x)=e^{\sin (10x)}\), with \(x \in [0,1]\), from 100 direct samples corrupted by white and Gaussian noise with standard deviation 0.3. The estimates coming from (6.52) with \(p=2\) and three different values of \(\gamma \) are displayed in the three panels of Fig. 6.5. The cubic spline estimate plotted in the top left panel is affected by oversmoothing: the too large value of \(\gamma \) overweights the norm of f in the objective (6.52), introducing a large bias. Hence, the model is too rigid, unable to describe the data. The top right panel displays the opposite situation obtained adopting a too low value for \(\gamma \) which overweights the loss function in (6.52). This leads to a high variance estimator: the model is overly flexible and overfits the measurements. Finally, the estimate in the bottom panel of Fig. 6.5 is obtained using the regularization parameter optimal in the MSE sense. The good trade-off between bias and variance leads to an estimate close to truth. As already pointed out in the previous chapters, the choice of \(\gamma \) can thus be interpreted as the counterpart of model order selection in the classical parametric paradigm.
6.7 Asymptotic Properties \(\star \)
6.7.1 The Regression Function/Optimal Predictor
In what follows, we use \(\mu \) to indicate a probability measure on the input space \(\mathscr {X}\). For simplicity, we assume that it admits a probability density function (pdf) denoted by \({\mathrm p}_{x}\). The input locations \(x_i\) are now seen as random quantities and \({\mathrm p}_{x}\) models the stochastic mechanism through which they are drawn from \(\mathscr {X}\). For instance, in the system identification scenario treated in Sect. 6.6.1, each input location contains system input values, e.g., see (6.40). If we assume that the input is a stationary stochastic process, all the \(x_i\) indeed follow the same pdf \({\mathrm p}_x\).
Let also \(\mathscr {Y}\) indicate the output space. Then, \({\mathrm p}_{yx}\) denotes the joint pdf on \(\mathscr {X} \times \mathscr {Y}\) which factorizes into \({\mathrm p}_{y|x}(y|x){\mathrm p}_{x}(x)\). Here, \({\mathrm p}_{y|x}\) is the pdf of the output y conditional on a particular realization x.
Let us now introduce some important quantities function of \(\mathscr {X},\mathscr {Y}\) and \(p_{yx}\). Given a function f, the least squares error associated to f is defined by
The following result, also discussed in [33], characterizes the minimizer of \(\mathrm {Err}(f)\) and has connections with Theorem 4.1.
Theorem 6.19
(The regression function, based on [33]) We have
where \(f_{\rho }\) is the so-called regression function defined by
One can see that the regression function does not depend on the marginal density \({\mathrm p}_{x}\) but only on the conditional \({\mathrm p}_{y|x}\). For any given x, it corresponds to the posterior mean (Bayes estimate) of the output y conditional on x. The proof of this fact is easily obtained by first using the following decomposition
and then noticing that the very last term is independent of f.
Theorem 6.19 shows that \(f_{\rho }\) is the best output predictor in the sense that it minimizes the expected quadratic loss (MSE) on a new output drawn from \({\mathrm p}_{yx}\). Now, we will consider a scenario where \({\mathrm p}_{y|x}\) (and possibly also \({\mathrm p}_{x}\)) is unknown and only N samples \(\{x_i,y_i\}_{i=1}^N\) from \({\mathrm p}_{yx}\) are available. We will study the asymptotic properties (N growing to infinity) of the regularized approaches previously described. The regularization network case is treated in the following subsection.
6.7.2 Regularization Networks: Statistical Consistency
Consider the following regularization network
which coincides with (6.32) except for the introduction of the scale factor 1/N in the quadratic loss. We have also stressed the dependence of the estimate on the data set size N. Our goal is to assess whether \(\hat{g}_N\) converges to \(f_{\rho }\) as \(N \rightarrow \infty \) using the norm \(\Vert \cdot \Vert _{\mathscr {L}_2^{\mu }}\) defined by the pdf \({\mathrm p}_x\) as follows
First, details on the data generation process are provided.
Data generation assumptions The probability measure \(\mu \) on \(\mathscr {X}\) is assumed to be Borel non degenerate. As already recalled, this means that realizations from \({\mathrm p}_{x}\) can cover entirely \(\mathscr {X}\), without holes. This happens, e.g., when \({\mathrm p}_x(x)>0 \ \forall x \in \mathscr {X}\). The stochastic processes \(x_i\) and \(y_i\) are jointly stationary, with joint pdf \({\mathrm p}_{yx}\).
The study is not limited to the i.i.d. case. This is important, e.g., in system identification where, as visible in (6.40), input locations contain past input values shifted in time, hence introducing correlation among the \(x_i\). Let a, b indicate two integers with \(a \le b\). Then, \(\mathscr {M}_a^b\) denotes the \(\sigma \)-algebra generated by \((x_a,y_a),\ldots ,(x_b,y_b)\). The process (x, y) is said to satisfy a strong mixing condition if there exists a sequence of real numbers \(\psi _m\) such that, \(\forall k,m\ge 1\), one has
with
Intuitively, if a, b represent different time instants, this means that the random variables tend to become independent as their temporal distance increases.
Assumption 6.20
(Data generation and strong mixing condition) The probability measure \(\mu \) on the input space (having pdf \({\mathrm p}_x\)) is nondegenerate. In addition, the random variables \(x_i\) and \(y_i\) form two jointly stationary stochastic processes, with finite moments up to the third order and satisfy a strong mixing condition. Finally, denoting with \(\psi _i\) the mixing coefficients, one has
Consistency Result
The following theorem, whose proof is in Sect. 6.9.6, illustrates the convergence in probability of (6.55) to the best output predictor.
Theorem 6.21
(Statistical consistency of the regularization networks) Let \(\mathscr {H}\) be a RKHS of functions \(f: \mathscr {X} \rightarrow \mathbb {R}\) induced by the Mercer kernel K, with \(\mathscr {X}\) a compact metric space. Assume that \(f_{\rho } \in \mathscr {H}\) and that Assumption 6.20 holds. In addition, let
where \(\alpha \) is any scalar in \((0,\frac{1}{2})\). Then, as N goes to infinity, one has
where \(\longrightarrow _p\) denotes convergence in probability.
The meaning of (6.56) is the following one. The regularizer \(\Vert \cdot \Vert _{\mathscr {H}}^2\) in (6.55) restores the well-posedness of the problem by introducing some bias in the estimation process. Intuitively, to have consistency, the amount of regularization should decay to zero as N goes to \(\infty \), but not too rapidly in order to keep the variance term under control. This can be obtained making the regularization parameter \(\gamma \) go to zero with the rate suggested by (6.56).
6.7.3 Connection with Statistical Learning Theory
We now discuss the class of estimators (6.21) within the framework of statistical learning theory.
Learning problem Let us consider the problem of learning from examples as defined in statistical learning. The starting point is that described in Sect. 6.7.1. There is an unknown probabilistic relationship between the variables x and y described by the joint pdf \({\mathrm p}_{yx}\) on \(\mathscr {X} \times \mathscr {Y}\). We are given examples \(\{x_i,y_i\}_{i=1}^N\) of this relationship, called training data, which are independently drawn from \({\mathrm p}_{yx}\). The aim of the learning process is to obtain an estimator \(\hat{g}_N\) (a map from the training set to a space of functions) able to predict the output y given any \(x \in \mathscr {X}\).
Generalization and consistency In the statistical learning scenario, the two fundamental properties of an estimator are generalization and consistency. To introduce them, first we introduce a loss function \(\mathscr {V}(y,f(x))\), called risk functional. Then, the mean error associated to a function f is the expected risk given by
Note that, in the quadratic loss case, the expected risk coincides with the error already introduced in (6.53). Given a function f, the empirical risk is instead defined by
Then, we introduce a class of functions forming the hypothesis space \(\mathscr {F}\) where the predictor is searched for. The ideal predictor, also called the target function, is given byFootnote 3
In general, even when a quadratic loss is chosen, \(f_{0}\) does not coincide with the regression function \(f_{\rho }\) introduced in (6.54) since \(\mathscr {F}\) could not contain \(f_{\rho }\).
The concepts of generalization and consistency trace back to [97, 99,100,101]. Below, recall that \(\hat{g}_N\) is stochastic since it is function of the training set which contains the random variables \(\{x_i,y_i\}_{i=1}^N\).
Definition 6.3
(Generalization and consistency, based on [102]) The estimator \(\hat{g}_N\) (uniformly) generalizes if \(\forall \varepsilon >0\):
The estimator is instead (universally) consistent if \(\forall \varepsilon >0\):
From (6.61), one can see that generalization implies that the performance on the training set (the empirical error) must converge to the “true” performance on future outputs (the expected error). The presence of the \(\sup _{{\mathrm p}_{yx}}\) is then to indicate that this property must hold uniformly w.r.t. all the possible stochastic mechanisms which generate the data. Consistency, as defined in (6.62), instead requires the expected error of \(\hat{g}_N\) to converge to the expected error achieved by the best predictor in \(\mathscr {F}\). Note that the reconstruction of \(f_{0}\) is not required. The goal is that \(\hat{g}_N\) be able to mimic the prediction performance of \(f_0\) asymptotically. Key issues in statistical learning theory are the understanding of the conditions on \(\hat{g}_N\), the function class \(\mathscr {F}\) and the loss \(\mathscr {V}\) which ensure such properties.
Empirical Risk Minimization
The most natural technique to determine \(f_0\) from data is the empirical risk minimization (ERM) approach where the empirical risk is optimized:
The study of ERM has provided a full characterization of the necessary and sufficient conditions for its generalization and consistency. To introduce them, we first need to provide further details on the data generation assumptions.
Assumption 6.22
(Data generation assumptions) It holds that
-
the \(\{x_i,y_i\}_{i=1}^N\) are i.i.d. and each couple has joint pdf \({\mathrm p}_{yx}\);
-
the input space \(\mathscr {X}\) is a compact set in the Euclidean space;
-
\(y \in \mathscr {Y}\) almost surely with \(\mathscr {Y}\) a bounded real set;
-
the class of functions \(\mathscr {F}\) is bounded, e.g., under the sup-norm;
-
\(A \le \mathscr {V}(y,f(x)) \le B\), for \(f \in \mathscr {F},y \in \mathscr {Y}\), with A, B finite and independent of f and y. \(\square \)
Note that, if the first four points hold true, in practice any loss function of interest, such as quadratic, Huber or Vapnik, satisfies the last requirement.
We now introduce the concept of \(V_{\gamma }\)-dimension [5]. It is a complexity measure which extends the concept of Vapnik–Chervonenkis (VC) dimension originally introduced for the indicator functions.
Definition 6.4
(\(V_{\gamma }\) -dimension, based on [5]) Let Assumption 6.22 hold. The \(V_{\gamma }\)-dimension of \(\mathscr {V}\) in \(\mathscr {F}\), i.e., of the set \(\mathscr {V}(y,f(x)), \ f \in \mathscr {F}\), is defined as the maximum number h of vectors \((x_1,y_1),\ldots ,(x_h,y_h)\) that can be separated in all \(2^h\) possible way using rules
for \(f \in \mathscr {F}\) and some \(s\ge 0\). If, for any h, it is possible to find h pairs \((x_1,y_1),\ldots ,(x_h,y_h)\) that can be separated in all the \(2^h\) possible ways, the \(V_{\gamma }\)-dimension of \(\mathscr {V}\) in \(\mathscr {F}\) is infinite.
So, the \(V_{\gamma }\)-dimension is infinite if, for any data set size, one can always find a function f and a set of points which can be separated by f in any possible way. Note that the required margin to distinguish the classes increases as \(\gamma \) augments. This means that the \(V_{\gamma }\)-dimension is a monotonically decreasing function of \(\gamma \).
The following definition deals with the uniform, distribution-free convergence of empirical means to expectations for classes of real-valued functions. It is related to the so-called uniform laws of large numbers.
Definition 6.5
(Uniform Glivenko Cantelli class, based on [5]) Let \(\mathscr {G}\) denote a space of functions \(\mathscr {Z} \rightarrow \mathscr {R}\), where \(\mathscr {R} \) is a bounded real set, and let \({\mathrm p}_z\) denote a generic pdf on \(\mathscr {Z}\). Then, \(\mathscr {G}\) is said to be a Uniform Glivenko Cantelli (uGC) classFootnote 4 if
It turns out that, under the ERM framework, generalization and consistency are equivalent concepts. Moreover, the finiteness of the \(V_{\gamma }\)-dimension coincides with the concept of uGC class relative to the adopted losses and turns out the necessary and sufficient condition for generalization and consistency [5]. This is formalized below.
Theorem 6.23
(ERM and \(V_{\gamma }\)-dimension, based on [5]) Let Assumption 6.22 hold. The following facts are then equivalent:
-
ERM (uniformly) generalizes.
-
ERM is (uniformly) consistent.
-
The \(V_{\gamma }\)-dimension of \(\mathscr {V}\) in \(\mathscr {F}\) is finite for any \(\gamma >0\).
-
The class of functions \(\mathscr {V}(y,f(x))\) with \(f \in \mathscr {F}\) is uGC.
In the last point regarding the uGC class, one can follow Definition 6.5 using the correspondences \(\mathscr {Z}=\mathscr {X} \times \mathscr {Y}\), \(z=(x,y)\), \({\mathrm p}_{z}={\mathrm p}_{yx}\) and \(\mathscr {R}=[A,B]\).
Connection with Regularization in RKHS
The connection between statistical learning theory and the class of kernel-based estimators (6.21) is obtained using as function space \(\mathscr {F}\) a ball \(\mathscr {B}_r\) in a RKHS \(\mathscr {H}\), i.e.,
The ERM method (6.63) becomes
which is an inequality constrained optimization problem. Exploiting the Lagrangian theory, we can find a positive scalar \(\gamma \), function of r and of the data set size N, which makes (6.65) equivalent to
which, apart from constants, coincides with (6.21). The question now is whether (6.65) is consistent in the sense of the statistical learning theory. The answer is positive. In fact, under Assumption 6.22, it can be proved that the class of functions \(\mathscr {V}\) in \(\mathscr {F}\) is uGC if \(\mathscr {F}\) is uGC. In addition, one sufficient (but not necessary) condition for \(\mathscr {F}\) to be uGC is that \(\mathscr {F}\) be a compact set in the space of continuous functions. The following important result then holds.
Theorem 6.24
(Generalization and consistency of the kernel-based approaches, based on [33, 65]) Let \(\mathscr {H}\) be any RKHS induced by a Mercer kernel containing functions \(f: \mathscr {X} \rightarrow \mathbb {R}\), with \(\mathscr {X}\) a compact metric space. Then, for any r, the ball \(\mathscr {B}_r\) is compact in the space of continuous functions equipped with the sup-norm. It then comes that \(\mathscr {B}_r\) is uGC and, if Assumption 6.22 holds, the regularized estimator (6.65) generalizes and is consistent.
Theorem 6.24 thus shows that kernel-based approaches permit to exploit flexible infinite-dimensional models with the guarantee that the best prediction performance (achievable inside the chosen class) will be asymptotically reached.
6.8 Further Topics and Advanced Reading
Basic functional analysis principles can be found, e.g., in [59, 79, 112]. The concept of RKHS was developed in 1950 in the seminal works [13, 20]. Classical books on the subject are [6, 82, 84]. RKHSs have been introduced within the machine learning community in [46, 47] leading, in conjunction with Tikhonov regularization theory [21, 96], to the development of many powerful kernel-based algorithms [42, 86].
Extensions of the theory to vector-valued RKHSs is described in [62]. This is connected to the so-called multi-task learning problem [18, 29] which deals with the simultaneous reconstruction of several functions. Here, the key point is that measurements taken on a function (task) may be informative w.r.t. the other ones, see [16, 40, 68, 95] for illustrations of the advantages of this approach. Multi-task learning will be illustrated in Chap. 9 using also a numerical example based on real pharmacokinetics data.
Mercer theorem dates back to [60] which discusses also the connection with integral equations, see also the book [50]. Extensions of the theorem to non compact domains are discussed in [94]. The first version of the representer theorem appears in [52]. It has been then subject of many generalizations which can be found in [11, 36, 83, 103, 110]. Recent works have also extended the classical formulation to the context of vector-valued functions (multi-task learning and collaborative filtering), matrix regularization problems (with penalty given by spectral functions of matrices), matricizations of tensors, see, e.g., [1, 7, 12, 54, 87]. These different types of representer theorems are cast in a general framework in [10].
The term regularization network traces back to [71] where it is illustrated that a particular regularized scheme is equal to a radial basis function network. Support vector regression and classification were introduced in [24, 31, 37, 98], see also the classical book [102]. Robust statistics are described in [51].
The term “kernel trick” was used in [83] while interpretation of kernels as inner products in a feature space was first described in [4]. Sobolev spaces are illustrated, e.g., in [2] while classical works on smoothing splines are [32, 104]. The important spline interpolation properties are described in [3, 14, 22].
Polynomial kernels were used for the first time in [70] while an application to Wiener system identification can be found in [44], as also discussed later on in Chap. 8 devoted to nonlinear system identification. An explicit (spectral) characterization of the RKHS induced by the Gaussian kernel can be found in [91, 92], while the more general case of radial basis kernels is treated in [85]. The concept of universal kernel is discussed, e.g., in [61, 90].
The strong mixing condition is discussed, e.g., in [107] and [34].
The convergence proof for the regularization network relies upon the integral operator approach described in [88] in an i.i.d. setting and its extension to the dependent case developed in [66] in the Wiener system identification context. For other works on statistical consistency and learning rates of regularized least squares in RKHS see, e.g., [48, 93, 105, 109, 111].
Statistical learning theory and the concepts of generalization and consistency, in connection with the uniform law of large numbers, date back to the works of Vapnik and Chervonenkis [97, 99,100,101]. Other related works on convergence of empirical processes are [38, 39, 73]. The concept of \(V_{\gamma }\) dimension and its equivalence with the Glivenko–Cantelli class is proved in [5], see also [41] for links with RKHS. Relationships between the concept of stability of estimates (continuous dependence on the data) and generalization/consistency can be found in [63, 72], see also [26] for previous work on this subject. Numerical computation of the regularized estimate (6.21) is discussed in the literature studying the relationship between machine learning and convex optimization [19, 25, 77]. In the regularization network case (quadratic loss), if the data set size N is large, plain application of a solver with computational cost \(O(N^3)\) can be highly inefficient. Then, one can use approximate representations of the kernel function [15, 53], based, e.g., on the Nyström method or greedy strategies [89, 106, 113]. One can also exploit the Mercer theorem by just using an mth-order approximation of K given by \(\sum _{i=1}^m \zeta _i \rho _i(x) \rho _i(y)\). The solution obtained with this kernel may provide accurate approximations also when \(m \ll N\), see [28, 43, 67, 114, 115]. Training of kernel machines can be also accelerated by using randomized low-dimensional feature spaces [74], see also [78] for insights on learning rates.
In the case of generic convex loss (different from the quadratic), one problem is that the objective is not differentiable everywhere. In this circumstance, the powerful interior point (IP) methods [64, 108] can be employed which applies damped Newton iterations to a relaxed version of the Karush–Kuhn–Tucker (KKT) equations for the objective [27]. A statistical and computational framework that allows their broad application to the problem (6.21) for a wide class of piecewise linear quadratic losses can be found in [8, 9]. In practice, IP methods exhibit a relatively fast convergence behaviour. However, as in the quadratic case, a difficulty can arise if N is very large, i.e., it may not be possible to store the entire kernel matrix in memory and this fact can hinder the application of second-order optimization techniques such as the (damped) Newton method. A way to circumvent this problem is given by the so-called decomposition methods where a subset of the coefficients \(c_i\), called working set, is selected, and the associated low-dimensional sub-problem is solved. In this way, only the corresponding entries of the output kernel matrix need to be loaded into the memory, e.g., see [30, 56,57,58]. An extreme case of decomposition method is coordinate descent, where the working set contains only one coefficient [35, 45, 49].
Notes
- 1.
One can then also easily check that the case \(\mathbf {K}_{12}=-1\) instead induces a RKHS containing only functions satisfying \(f(1)=-f(2)\).
- 2.
- 3.
Here, and also when introducing empirical risk minimization (ERM), we assume that all the introduced minimizers exist. If this does not hold true, all the concepts remain valid by resorting to the concept of almost minimizers and almost ERM, with \(I(f_0):=\inf _{f \in \mathscr {F}} \ I(f)\).
- 4.
Sometimes, the class defined by (6.5) in terms of convergence in probability is called weak uGC while almost sure convergence leads to a strong uGC. However, it can be proved that, if Assumption 6.22 holds true and the function class is the composition of the losses with \(\mathscr {F}\), the two concepts become equivalent.
References
Abernethy J, Bach F, Evgeniou T, Vert JP (2009) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826
Adams RA, Fournier J (2003) Sobolev spaces. Academic Press
Ahlberg JH, Nilson EH (1963) Convergence properties of the spline fit. J Soc Indust Appl Math 11:95–104
Aizerman A, Braverman EM, Rozoner LI (1964) Theoretical foundations of the potential function method in pattern recognition learning. Autom Remote Control 25:821–837
Alon N, Ben-David S, Cesa-Bianchi N, Haussler D (1997) Scale-sensitive dimensions, uniform convergence, and learnability. J ACM 44(4):615–631
Alpay D (2003) Reproducing kernel Hilbert spaces and applications. Springer
Amit Y, Fink M, Srebro N, Ullman S (2007) Uncovering shared structures in multiclass classification. In: Proceedings of the 24th international conference on machine learning, ICML ’07, New York, NY, USA. ACM, pp 17–24
Aravkin A, Burke J, Pillonetto G (2012) Nonsmooth regression and state estimation using piecewise quadratic log-concave densities. In: Proceedings of the 51st IEEE conference on decision and control (CDC 2012)
Aravkin A, Burke JV, Pillonetto G (2013) Sparse/robust estimation and Kalman smoothing with nonsmooth log-concave densities: modeling, computation, and theory. J Mach Learn Res 14:2689–2728
Argyriou A, Dinuzzo F (2014) A unifying view of representer theorems. In: Proceedings of the 31th international conference on machine learning, vol 32, pp 748–756
Argyriou A, Micchelli CA, Pontil M (2009) When is there a representer theorem? vector versus matrix regularizers. J Mach Learn Res 10:2507–2529
Argyriou A, Micchelli CA, Pontil M (2010) On spectral learning. J Mach Learn Res 11:935–953
Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Atkinson KE (1968) On the order of convergence of natural cubic spline interpolation. SIAM J Numer Anal 5(1):89–101
Bach FR, Jordan MI (2005) Predictive low-rank decomposition for kernel methods. In: Proceedings of the 22nd international conference on Machine learning, ICML ’05, New York, NY, USA. ACM, pp 33–40
Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99
Bartlett PL, Long PM, Lugosi G, Tsigler A (2020) Benign overfitting in linear regression. PNAS 117:30063–30070
Baxter J (1997) A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach Learn 28:7–39
Bennett KP, Parrado-Hernandez E (2006) The interplay of optimization and machine learning research. J Mach Learn Res 7:1265–1281
Bergman S (1950) The kernel function and conformal mapping. Mathematical surveys and monographs. AMS
Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75:1–120
Birkhoff G, De Boor C (1964) Error bounds for spline interpolation. J Math Mech 13:827–835
Bochner S. Monotone Funktionen, Stieltjessche Integrale, und harmonische Analyse. Math Ann 108:378–410
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual ACM workshop on computational learning theory. ACM Press, pp 144–152
Bottou L, Chapelle O, DeCoste D, Weston J (eds) (2007) Large scale kernel machines. MIT Press, Cambridge, MA, USA
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press
Carli FP, Chiuso A, Pillonetto G (2012) Efficient algorithms for large scale linear system identification using stable spline estimators. In: IFAC symposium on system identification
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Collobert R, Bengio S (2001) SVMTorch: support vector machines for large-scale regression problems. J Mach Learn Res 1:143–160
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Numer Math 31:377–403
Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49
Dehling H, Philipp W (1982) Almost sure invariance principles for weakly dependent vector-valued random variables. Ann Probab 10(3):689–701
Dinuzzo F (2011) Analysis of fixed-point and coordinate descent algorithms for regularized kernel methods. IEEE Trans Neural Netw 22(10):1576–1587
Dinuzzo F, Scholkopf B (2012) The representer theorem for Hilbert spaces: a necessary and sufficient condition. In: Bartlett P, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, pp 189–196
Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. In: Advances in neural information processing systems
Dudley RM, Giné E, Zinn J (1991) Uniform and universal Glivenko-Cantelli classes. J Theor Probab 4(3):485–510
Dudley RM (1984) École d’Été de Probabilités de Saint-Flour XII - 1982, chapter A course on empirical processes. Springer, Berlin, Heidelberg, pp 1–142
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Evgeniou T, Pontil M (1999) On the \({V}_\gamma \) dimension for regression in reproducing kernel Hilbert spaces. In: Algorithmic learning theory, 10th international conference, ALT ’99, Tokyo, Japan, Dec 1999, Proceedings, lecture notes in artificial intelligence, vol 1720. Springer, pp 106–117
Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13:1–50
Ferrari-Trecate G, Williams CKI, Opper M (1999) Finite-dimensional approximation of Gaussian processes. In: Proceedings of the 1998 conference on advances in neural information processing systems. MIT Press, Cambridge, MA, USA, pp 218–224
Franz MO, Schölkopf B (2006) A unifying view of Wiener and Volterra theory and polynomial kernel regression. Neural Comput 18:3097–3118
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10(6):1455–1480
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2):219–269
Guo ZC, Zhou DX (2013) Concentration estimates for learning with unbounded sampling. Adv Comput Math 38(1):207–223
Ho CH, Lin CJ (2012) Large-scale linear support vector regression. J Mach Learn Res 13:3323–3348
Hochstadt H (1973) Integral equations. Wiley
Huber PJ (1981) Robust statistics. Wiley, New York, NY, USA
Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41(2):495–502
Kulis B, Sustik M, Dhillon I (2006) Learning low-rank kernel matrices. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, New York, NY, USA. ACM, pp 505–512
Lafferty J, Zhu X, Liu Y (2004) Kernel conditional random fields: representation and clique selection. In: Proceedings of the twenty-first international conference on machine learning, ICML ’04, New York, NY, USA. ACM
Liang T, Rakhlin A (2020) Just interpolate: kernel ridgeless regression can generalize. Ann Stat 48(3):1329–1347
Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12(12):1288–1298
List N, Simon HU (2004) A general convergence theorem for the decomposition method. In: Proceedings of the 17th annual conference on computational learning theory, pp 363–377
List N, Simon HU (2007) General polynomial time decomposition algorithms. J Mach Learn Res 8:303–321
Megginson RE (1998) An introduction to Banach space theory. Springer
Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc Lond 209(3):415–446
Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J Mach Learn Res 7:2651–2667
Micchelli CA, Pontil M (2005) On learning vector-valued functions. Neural Comput 17(1):177–204
Mukherjee S, Niyogi P, Poggio T, Rifkin R (2006) Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv Comput Math 25(1):161–193
Nemirovskii A, Nesterov Y (1994) Interior-point polynomial algorithms in convex programming, vol 13. SIAM, Philadelphia, PA, USA
Pillonetto G (2008) Solutions of nonlinear control and estimation problems in reproducing kernel hilbert spaces: existence and numerical determination. Automatica 44(8):2135–2141
Pillonetto G (2013) Consistent identification of Wiener systems: a machine learning viewpoint. Automatica 49(9):2704–2712
Pillonetto G, Bell BM (2007) Bayes and empirical Bayes semi-blind deconvolution using eigenfunctions of a prior covariance. Automatica 43(10):1698–1712
Pillonetto G, Dinuzzo F, De Nicolao G (2010) Bayesian on-line multi-task learning of Gaussian processes. IEEE Trans Pattern Anal Mach Intell 32(2):193–205
Pillonetto G, Quang MH, Chiuso A (2011) A new kernel-based approach for nonlinear system identification. IEEE Trans Autom Control 56(12):2825–2840
Poggio T (1975) On optimal nonlinear associative recall. Biol Cybern 19(4):201–209
Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497
Poggio T, Rifkin R, Mukherjee S, Niyogi P (2004) General conditions for predictivity in learning theory. Nature 428(6981):419–422
Pollard D (1989) Asymptotics via empirical processes. J Stat Sci 4(4):341–354
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Advances in neural information processing systems, pp 1177–1184
Ribeiro AH, Hendriks J, Wills A, Schön TB (2021) Beyond Occam’s razor in system identification: double-descent when modeling dynamics. In: Proceedings of the 19th IFAC symposium on system identification (SYSID), Online, July 2021
Riesz F (1909) Sur les operations fonctionnelles lineaires. Comptes rendus de l’Academie des Sciences (in French) 149:974–977
Rockafellar RT (1970) Convex analysis. Princeton Landmarks in Mathematics. Princeton University Press
Rudi A, Rosasco L (2017) Generalization properties of learning with random features. In: Advances in neural information processing systems, pp 3218–3228
Rudin W (1987) Real and complex analysis. McGraw-Hill, Singapore
Rudin W (1990) Fourier analysis on groups. Wiley
Runge C (1901) Uber empirische funktionen und die interpolation zwischen aquidistanten ordinaten. Zeitschrift für Mathematik und Physik 46:224–243
Saitoh S (1988) Theory of reproducing kernels and its applications, vol 189. Pitman research notes in mathematics series. Longman Scientific and Technical, Harlow
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. Neural Netw Comput Learn Theory 81:416–426
Schölkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. (Adaptive computation and machine learning). MIT Press
Scovel C, Hush D, Steinwart I, Theiler J (2010) Radial kernels and their reproducing kernel Hilbert spaces. J Complex 26(6):641–660
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press
Signoretto M, Tran DQ, De Lathauwer L, Suykens JAK (2014) Learning with tensors: a framework based on convex optimization and spectral regularization. Mach Learn 94(3):303–351
Smale S, Zhou DX (2007) Learning theory estimates via integral operators and their approximations. Constr Approx 26:153–172
Smola A, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 911–918
Sriperumbudur BK, Fukumizu K, Lanckriet G (2011) Universality, characteristic kernels and RKHS embedding of measures. J Mach Learn Res 12:2389–2410
Steinwart I (2002) On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res 2:67–93
Steinwart I, Hush D, Scovel C (2006) An explicit description of the reproducing kernel Hilbert space of Gaussian RBF kernels. IEEE Trans Inf Theory 52:4635–4643
Steinwart I, Hush D, Scovel C (2009) Learning from dependent observations. J Multivar Anal 100(1):175–194
Sun H (2005) Mercer theorem for RKHS on noncompact sets. J Complex 21(3):337–349
Thrun S, Pratt L (1997) Learning to learn. Kluwer
Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. Winston/Wiley, Washington, D.C
Vapnik V (1982) Estimation of dependences based on empirical data: springer series in statistics (Springer series in statistics). Springer, New York Inc., Secaucus, NJ, USA
Vapnik V (1997) The nature of statistical learning theory. Springer
Vapnik V, Chervonenkis A (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab Appl 16(2):264–280
Vapnik V, Chervonenkis A (1981) Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory Probab Appl 26:532–553
Vapnik V, Chervonenkis A (1991) The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognit Image Anal 1(3):283–305
Vapnik V (1998) Statistical learning theory. Wiley, New York, NY, USA
De Vito E, Rosasco L, Caponnetto A, Piana M, Verri A (2004) Some properties of regularized kernel methods. J Mach Learn Res 5:1363–1390
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Wang C, Zhou DX (2011) Optimal learning rates for least squares regularized regression with unbounded sampling. J Complex 27(1):55–67
Williams CKI, Seeger M (2000) Using the Nyström method to speed up kernel machines. In: Proceedings of the 2000 conference on advances in neural information processing systems, Cambridge, MA, USA. MIT Press, pp 682–688
Withers CS (1981) Conditions for linear processes to be strong-mixing. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 57(4):477–480
Wright SJ (1997) Primal-dual interior-point methods. Siam, Englewood Cliffs, N.J., USA
Wu Q, Ying Y, Zhou DX (2006) Learning rates of least-square regularized regression. Found Comput Math 6:171–192
Yu Y, Cheng H, Schuurmans D, Szepesvari C (2013) Characterizing the representer theorem. In: Proceedings of the 30th international conference on machine learning, pp 570–578
Yuan M, Tony Cai T (2010) A reproducing kernel Hilbert space approach to functional linear regression. Ann Stat 38:3412–3444
Zeidler E (1995) Applied functional analysis. Springer
Zhang K, Kwok JT (2010) Clustered Nyström method for large scale manifold learning and dimension reduction. IEEE Trans Neural Netw 21(10):1576–1587
Zhu H, Rohwer R (1996) Bayesian regression filters and the issue of priors. Neural Comput Appl 4:130–142
Zhu H, Williams CKI, Rohwer RJ, Morciniec M (1998) Gaussian regression and optimal finite dimensional linear models. In: Neural networks and machine learning. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
6.9 Appendix
6.9 Appendix
6.1.1 6.9.1 Fundamentals of Functional Analysis
We gather some basic functional analysis definitions and results.
Vector Spaces
We will assume that the reader is familiar with the concept of real vector space V (the field is given by the real numbers). Here, we just recall that this is a set whose elements are called vectors. The space is closed w.r.t. two operations, called addition and scalar multiplication, which satisfy the usual algebraic properties. This means that any linear and finite combination of vectors still falls in V. When the vector space contains functions \(g: \mathscr {X} \rightarrow \mathbb {R}\), for any \(f,g \in V\) and \(\alpha \in \mathbb {R}\) the two operations are defined as follows:
and
Inner Products and Norms
An inner product on V is the function
which is
-
1.
linear in the first argument
$$ \langle \alpha v + \beta y , z \rangle = \alpha \langle v , z \rangle + \beta \langle y , z \rangle , \quad v,y,z \in V \quad \alpha ,\beta \in \mathbb {R}; $$ -
2.
symmetric (and so also linear in the second argument)
$$ \langle v , y \rangle = \langle y , v \rangle ; $$ -
3.
positive, in the sense that
$$ \langle v , v \rangle \ge 0 \quad \forall v $$with
$$ \langle v , v \rangle = 0 \ \iff \ v=0, $$where in the r.h.s. 0 denotes the null vector.
Recall also that a norm on V is the nonnegative function
which satisfies
-
1.
absolute homogeneity
$$ \Vert \alpha v \Vert = |\alpha | \Vert v\Vert , \quad v \in V \quad \alpha \in \mathbb {R}; $$ -
2.
the triangle inequality
$$ \Vert v + y \Vert \le \Vert v \Vert + \Vert y \Vert ; $$ -
3.
null vector condition
$$ \Vert v \Vert = 0 \ \iff \ v=0. $$
The norm induced by the inner product \(\langle \cdot , \cdot \rangle \) is given by
and it is easy to check that this function indeed satisfies all the three norm axioms listed above. One also has the Cauchy–Schwarz inequality
Finally, recall that both \(\langle \cdot , x \rangle \) with \(x \in V\) and \(\Vert \cdot \Vert \) are examples of continuous functionals \(V \rightarrow \mathbb {R}\), i.e., if \(\lim _{j \rightarrow \infty } \Vert v-v_j\Vert =0\), then
Hilbert and Banach Spaces
A Hilbert space \(\mathscr {H}\) is a vector space equipped with an inner product \(\langle \cdot , \cdot \rangle \) which is complete w.r.t. to the norm \(\Vert \cdot \Vert \) induced by such inner product. This means that, given any Cauchy sequence, i.e., a sequence of vectors \(\{g_j\}_{j=1}^\infty \) such that
there exists \(g \in \mathscr {H}\) such that
In other words, every Cauchy sequence is convergent. Examples of Hilbert spaces used in this book are
-
the classical Euclidean space \(\mathbb {R}^m\) of vectors \(a=[a_1 \ \ldots \ a_m]\) equipped with the classical Euclidean inner product
$$ \langle a, b \rangle _2 = \sum _{i=1}^m a_i b_i $$sometimes denoted just by \(\langle \cdot , \cdot \rangle \) in the book;
-
the space \(\ell _2\) of squared summable real sequences \(a=[a_1 \ a_2 \ldots ]\), i.e., such that
$$ \sum _{i=1}^\infty a_i^2 < \infty , $$equipped with the inner product
$$ \langle a, b \rangle _2 = \sum _{i=1}^\infty a_i b_i; $$ -
the classical Lebesgue space \(\mathscr {L}_2\) of functions (where the measure \(\mu \) is here omitted to simplify notation) \(g: \mathscr {X} \rightarrow \mathbb {R}\) which are squared summable w.r.t. the measure \(\mu \), i.e., such that
$$ \int _{ \mathscr {X}} g^2(x) d\mu (x) < \infty , $$equipped with the inner product still denoted by \(\langle \cdot , \cdot \rangle _2 \) but now given by
$$ \langle g, f \rangle _2 = \int _{ \mathscr {X}} g(x)f(x) d\mu (x). $$
The spaces reported above are also instances of metric spaces where, for every couple of vectors f, g, there is a notion of distance defined by \(\Vert f-g\Vert \). Other metric spaces are the Banach spaces. They are normed vector spaces complete w.r.t. the metric induced by their norm. Hence, every Hilbert space is a Banach space but the converse is not true: this happens when \(\Vert \cdot \Vert \) does not derive from an inner product. Examples of Banach spaces (whose norm does not derive from an inner product) are
-
the space \(\ell _1\) of absolutely summable real sequences \(a=[a_1 \ a_2 \ldots ]\), i.e., such that
$$ \sum _{i=1}^\infty |a_i| < \infty , $$equipped with the norm
$$ \Vert a \Vert _1 = \sum _{i=1}^\infty |a_i|; $$ -
the Lebesgue space \(\mathscr {L}_1\) of functions \(g: \mathscr {X} \rightarrow \mathbb {R}\) absolutely integrable w.r.t. the measure \(\mu \), i.e., such that
$$ \int _{\mathscr {X}} |g(x)| d\mu (x) < \infty , $$equipped with the norm
$$ \Vert g \Vert _1 = \int _{\mathscr {X}} |g(x)| d\mu (x); $$ -
the space \(\ell _{\infty }\) of bounded real sequences \(a=[a_1 \ a_2 \ldots ]\), i.e., such that
$$ \sup _i |a_i| < \infty , $$equipped with the norm
$$ \Vert a \Vert _{\infty } = \sup _i |a_i|; $$ -
the space \(\mathscr {C}\) of continuous functions \(g: \mathscr {X} \rightarrow \mathbb {R}\). where \(\mathscr {X}\) is a compact set typically in \(\mathbb {R}^m\), equipped with the sup-norm (also called uniform norm)
$$ \Vert g \Vert _{\infty } = \max _{x \in \mathscr {X}} |g(x)|; $$ -
the Lebesgue space \(\mathscr {L}_\infty \) of functions \(g: \mathscr {X} \rightarrow \mathbb {R}\) which are essentially bounded w.r.t. the measure \(\mu \), i.e., for any g there exists M such that
$$ |g(x)| \le M \ \text{ almost } \text{ everywhere } \text{ in } \mathscr {X} \text{ w.r.t. } \text{ the } \text{ measure } \mu , $$equipped with the norm
$$ \Vert g \Vert _\infty = \inf \left\{ M \ | \ |g(x)| \le M \ \text{ almost } \text{ everywhere } \text{ in } \mathscr {X} \text{ w.r.t. } \text{ the } \text{ measure } \mu \right\} . $$
An infinite-dimensional Hilbert (or Banach) space is said to be separable if it admits a countable basis \(\{ \rho _j \}_{j=1}^\infty \), i.e., for any g in the space we can find scalars \(c_j\) such that
When such vectors \(\{ \rho _j \}\) satisfy also the conditions
then the basis is said to be orthonormal.
Subspaces, Projections and Compact Sets
A subset S of the vector space V is said to be a subspace if S is itself a vector space with the same addition and multiplication operations defined in V. The symbol
denotes the subspace containing all the finite linear combinations of vectors taken from the (possibly uncountable) family \(\{ \rho _j \}_{ j \in A }\).
Given a subspace (or simply a set) S contained in a Hilbert (or Banach) space, we define
Then, S is said to be closed if
The orthogonal to a subspace S of a Hilbert space is denoted by \(S^\perp \) and defined by
It is easy to prove that \(S^\perp \) is always a closed subspace.
The following fundamental theorem holds.
Theorem 6.25
(Projection theorem) Let S be a closed subspace of a Hilbert space with norm \(\Vert \cdot \Vert _\mathscr {H}\). Then, one has
-
any \(g \in \mathscr {H}\) has a unique decomposition
$$ g = g_S + g_{S^\perp }, \quad g_S \in S, \ g_{S^\perp } \in S^\perp ; $$ -
\(g_S\) is the projection of g onto S, i.e.,
$$ g_S= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in S} \ \Vert g-f \Vert _\mathscr {H}; $$ -
it holds that
$$ \Vert g \Vert _\mathscr {H}^2 = \Vert g_S \Vert _\mathscr {H}^2 + \Vert g_{S^\perp } \Vert _\mathscr {H}^2. $$
A set A contained in a Hilbert (or Banach) space with norm \(\Vert \cdot \Vert \) is said to be compact if, given any sequence \(\{g_j\}\) of vectors all contained in A, it is possible to extract a subsequence \(\{g_{k_j}\}\) convergent in A, i.e., there exists \(g \in A\) such that
When the space is finite-dimensional, a set is compact iff it is closed and bounded.
Linear and Bounded Functionals
Given a Hilbert space \(\mathscr {H}\) with norm \(\Vert \cdot \Vert _{\mathscr {H}}\), a functional \(L: \mathscr {H} \rightarrow \mathbb {R}\) is said to be bounded (or, equivalently, continuous) if there exists a positive scalar C such that
The following classical theorem holds.
Theorem 6.26
(Closed graph theorem) Let \(\mathscr {H}\) be a Hilbert (or Banach) space. Then \(L: \mathscr {H} \rightarrow \mathbb {R}\) is linear and bounded if and only if the graph of L, i.e.,
is a closed set in the product space \(\mathscr {H} \times \mathbb {R}\). This means that if \(\{ f_i \}_{i=1}^{+\infty }\) is a sequence converging to \(f \in \mathscr {H}\) and \(\{L[f_i]\}_{i=1}^{+\infty }\) converges to \(y \in \mathbb {R}\), then \(L[f]=y\).
This other fundamental theorem asserts that every linear and bounded functional over \(\mathscr {H}\) is in one-to-one correspondence with a vector in \(\mathscr {H}\).
Theorem 6.27
(Riesz representation theorem, based on [76]) Let \(\mathscr {H}\) be a Hilbert space and let \(L: \mathscr {H} \rightarrow \mathbb {R}\). Then L is linear and bounded if and only there is a unique \(f \in \mathscr {H}\) such that
6.1.2 6.9.2 Proof of Theorem 6.1
First, we derive two lemmas which are instrumental to the main proof.
Lemma 6.1
Let
If there exists a Hilbert space \(\mathscr {H}\) satisfying conditions (6.2) and (6.3), then \(\mathscr {H}\) is the closure of S, i.e., \(\mathscr {H} = \bar{S}\).
Proof
It comes from condition (6.2) that \(\bar{S}\) is a closed subspace which must belong to \(\mathscr {H}\). Theorem 6.25 (Projection theorem) then ensures that any function \(f \in \mathscr {H}\) can be written as
As for the component \(f_{\bar{S}^\perp }\), using condition (6.3) (reproducing property) we obtain
In fact, every kernel section belongs to S and is thus orthogonal to every function in \(\bar{S}^\perp \). Hence, \(f_{\bar{S}^\perp }\) is the null vector and this concludes the proof. \(\square \)
Lemma 6.2
Let \(S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })\) and define
where f is a generic element in S, hence admitting representation
Then, \(\Vert \cdot \Vert _{\mathscr {H}}\) is a well-defined norm in S.
Proof
The reader can easily check that absolute homogeneity and the triangle inequality are satisfied by \(\Vert \cdot \Vert _{\mathscr {H}}\). We only need to prove the null vector condition, i.e., that for every \(f \in S\) one has
Now, assume that \(\Vert f \Vert _{\mathscr {H}} = 0\) where \(f(\cdot ) = \sum _{i=1}^{m} c_i K_{x_i}(\cdot )\). While the coefficients \(\{c_i\}_{i=1}^m\) and the input locations \(\{x_i\}_{i=1}^m\) are fixed and define f, let also \(c_{m+1}\) and \(x_{m+1}\) be an arbitrary scalar and input location, respectively. Define \(\mathbf {K} \in \mathbb {R}^{m \times m}\) and \(\mathbf {K} _+ \in \mathbb {R}^{m+1 \times m+1}\) two matrices with (i, j)-entry given by \(K(x_i,x_j)\). Let also \(c=[c_1 \ \ldots \ c_m]^T\) and \(c_+=[c_1 \ \ldots \ c_m \ c_{m+1}]^T\). Note that \(\mathbf {K} c\) is the vector which contains the values of f on the input locations \(\{x_i\}_{i=1}^m\).
Since K is positive definite, it holds that
In addition, since by assumption
it comes that the components of the vector \(\mathbf {K} c\), which are the values of f on \(\{x_i\}_{i=1}^m\), are all null. Now, we show that \(f(x)=0\) holds everywhere, also on the generic input location \(x_{m+1} \in \mathscr {X}\). In fact, after simple calculations, one obtains
Now, assume that \(f(x_{m+1})>0\). Then, since the last term on the r.h.s. is infinitesimal of order two w.r.t. \(c_{m+1}\) we can find a negative value for \(c_{m+1}\) sufficiently close to zero such that \(c_+^T \mathbf {K} _+ c_+ <0\) which contradicts the fact that K is positive definite. If \(f(x_{m+1})<0\) we can instead find a positive value for \(c_{m+1}\) sufficiently close to zero such that \(c_+^T \mathbf {K} _+ c_+ <0\), which is still a contradiction. Hence, we must have \(f(x_{m+1}) = 0\). Since \(x_{m+1}\) was arbitrary, we conclude that f must be the null function.
\(\square \)
We now prove Theorem 6.1. Let \(S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })\) and, for any \(f,g \in S\) having representations
define
By Lemma 6.2, it is immediate to check that \(\langle \cdot , \cdot \rangle _{\mathscr {H}}\) is a well-defined inner product on S. Then, we now show that the desired Hilbert space is \(\mathscr {H} = \bar{S}\), where \(\bar{S}\) is the completion of S w.r.t. the norm induced by \(\langle \cdot , \cdot \rangle _{\mathscr {H}}\).
Condition (6.2) is trivially satisfied since, by construction, all the kernel sections belong to \(\mathscr {H}\).
As for the condition (6.3), we start checking that it holds over S. Introducing the couple of functions in S given by
we have
showing that the reproducing property holds in S. Let us now consider the completion of S. To this aim, let \(\{f_j\}\) be a Cauchy sequence with \(f_j \in S \ \forall j\). We have
where we have used first the reproducing property (since it holds in S) and then the Cauchy–Schwarz inequality. We have
where the scalar q independent of x exists because the kernel K is continuous over the compact \(\mathscr {X} \times \mathscr {X}\). Combining the last two inequalities leads to
which shows that the convergence in \(\mathscr {H}\) implies also uniform convergence. In other words, if \(f_j \rightarrow f\) in \(\mathscr {H}\) w.r.t. \(\Vert \cdot \Vert _\mathscr {H}\), then \(f_j \rightarrow f\) also in the space \(\mathscr {C}\) of continuous functions w.r.t. the sup-norm \(\Vert \cdot \Vert _{\infty }\). Since \(S \subset \mathscr {C}\) and \(\mathscr {C}\) is Banach, all the functions in the completion of S are continuous, i.e., \(\mathscr {H} \subset \mathscr {C}\). Furthermore, if \(f_j \rightarrow f\) in \(\mathscr {H}\), one has that for any \(x \in \mathscr {X}\)
by the continuity of the inner product. But we also have
since \(f_j \in S \ \forall j\), the reproducing property holds in S and convergence in \(\mathscr {H}\) implies uniform (and, hence, pointwise) convergence. This shows that \(\langle f, K_x \rangle _\mathscr {H} = f(x) \ \forall f \in \mathscr {H}\), i.e., the reproducing property holds over all the space \(\mathscr {H}\).
The last point is the unicity of \(\mathscr {H}\). For the sake of contradiction, assume that there exists another Hilbert space \(\mathscr {G}\) which satisfies conditions (6.2) and (6.3). By Lemma 6.1, we must have \(\mathscr {G}=\bar{S}\) where the completion of S is w.r.t. the norm \(\Vert \cdot \Vert _\mathscr {G}\) deriving from the inner product \(\langle \cdot ,\cdot \rangle _{\mathscr {G}}\). Condition (6.3) holds both in \(\mathscr {H}\) and in \(\mathscr {G}\), so that we have
Since the functions in S are finite linear combinations of kernel sections, by the linearity of the inner product, the above equality allows to conclude that
Such an equality, together with the uniqueness of limits, implies that the completion of S w.r.t. \(\Vert \cdot \Vert _\mathscr {H}\) coincides with the completion w.r.t. \(\Vert \cdot \Vert _\mathscr {G}\). Hence, \(\mathscr {H}\) and \(\mathscr {G}\) are the same Hilbert space and this completes the proof.
6.1.3 6.9.3 Proof of Theorem 6.10
It is not difficult to see that (6.12) with the inner product (6.13) is a Hilbert space. In addition, using the Mercer theorem, in particular the expansion (6.11), from (6.13) one has
and, for any \(f = \sum _{i \in \mathscr {I}} a_i \rho _i\), it also holds that
This shows that every kernel section belongs to \(\mathscr {H}\) and the reproducing property holds. Theorem 6.1 then ensures that \(\mathscr {H}\) is indeed the RKHS associated to K.
6.1.4 6.9.4 Proof of Theorem 6.13
First, let \(\mathscr {H}\) be the RKHS induced by \(K(x,y)=\zeta \rho (x) \rho (y)\). Any RKHS is spanned by its kernel sections, hence in this case \(\mathscr {H}\) is the one-dimensional subspace generated by \(\rho \). By the reproducing property it holds that
In addition, one has
so that
Now, consider the kernel of interest \(K(x,y) = \sum _{i=1}^{\infty } \zeta _i \rho _i(x) \rho _i(y)\) associated with \(\mathscr {H}\). Define \(K_j(x,y) = \zeta _j \rho _j(x) \rho _j(y)\). with \(\Vert \cdot \Vert _{\mathscr {H}_j} \) to denote the norm induced by \(K_j\). From the discussion above it holds that
Think of \(K(x,y) = \sum _{i=1}^{\infty } \zeta _i \rho _i(x) \rho _i(y)\) as the sum of \(K_j(x,y)\) and \(K_{-j}(x,y) = \sum _{k \ne j}^{\infty } \zeta _k \rho _k(x) \rho _k(y)\). Then, using Theorem 6.6 and (6.70), one has
where \(\mathscr {H}_{-j}\) is the RKHS induced by \(K_{-j}\). Evaluating the objective at \((c_j=1,h=0)\), one obtains
and this shows that \(\rho _j \in \mathscr {H} \ \forall j\).
Now we prove that the functions \(\rho _j\) generate all the RKHS \(\mathscr {H}\) induced by K. Using Theorem 6.25 (Projection theorem), it comes that for any \(f \in \mathscr {H}\) we have
where G indicates the closure in \(\mathscr {H}\) of the subspace generated by all the \(\rho _k\). Using the reproducing property, one obtains
where the last equality exploits the relation \(h \perp \rho _k \ \forall k \). This completes the first part of the proof.
As for the RKHS norm characterization, first let \(\mathscr {H}_j^\infty \) be the RKHS induced by the kernel \(\sum _{k=j}^\infty K_k\) with \(h_j\) to denote a generic element of \(\mathscr {H}_j^\infty \). Then, given \(f \in \mathscr {H}\), using Theorem 6.6 in an iterative fashion, we obtain
In particular, every equality above is obtained thinking of the kernel \(\sum _{k=j}^\infty K_k\) as the sum of \(K_j\) and \(\sum _{k=j+1}^\infty K_k\). Then, \(h_j\) can be decomposed into two parts, i.e., \(h_j = c_j\rho _j + h_{j+1}\), with \(\Vert \rho _j\Vert ^2_{\mathscr {H}_j} = 1/\zeta _j\) where, as before, \(\Vert \cdot \Vert _{\mathscr {H}_j}\) denotes the norm in the one-dimensional RKHS induced by \(K_j\). Now, let \(\hat{c}_1,\ldots ,\hat{c}_{n-1},\hat{h}_n\) be the minimizers of the last objective (the minimizer can be assumed unique without loss of generality, just to simplify the exposition) and note that \(\Vert \hat{h}_n\Vert _{\mathscr {H}_n^\infty }\) must go to zero as \(n \rightarrow \infty \). Then, it comes that the sequence \(\hat{c}_1,\hat{c}_{2},\ldots \) characterizing the norm \(\Vert f \Vert _{\mathscr {H}}^2\) is indeed \(\min _{\{c_k\}} \ \sum _{k=1}^\infty \ \frac{c_k^2}{\zeta _k }\) with the \(\{c_k\}\) subject to the constraints \(\lim _{n \rightarrow \infty } \ \Vert f - \sum _{k=1}^n c_k \rho _k \Vert _{\mathscr {H}} = 0\).
6.1.5 6.9.5 Proofs of Theorems 6.15 and 6.16
We prove the following more general result that embraces as special cases Theorems 6.15 and 6.16.
Theorem 6.28
Let \(\mathscr {H}\) be a Hilbert space. Consider the optimization problem
and assume that
-
problem (6.71) admits at least one solution;
-
each \(L_i: \mathscr {H} \rightarrow \mathbb {R}\) is linear and bounded;
-
the objective \(\varPhi \) is strictly increasing w.r.t. its last argument.
Then, all the solutions of (6.71) admit the following expression
where the \(c_i\) are suitable scalar expansion coefficients and each \(\eta _i \in \mathscr {H}\) is the representer of \(L_i\), i.e.,
In particular, if \(\mathscr {H}\) is a RKHS with kernel K, each basis function is given by
To prove the above result, let \(\hat{g}\) be a solution of (6.71) and denote with S the (closed) subspace spanned by the N representers \(\eta _i\) of the functionals \(L_i\), i.e.,
Exploiting Theorem 6.25 (Projection theorem), we can write
For the sake of contradiction, assume that \(\hat{g}_{S^\perp }\) is different from the null function. Then, we have
where the last equality exploits the fact that each \(\eta _i\) is orthogonal to all the functions in \(S^\perp \) while the inequality exploits the assumption that \(\varPhi \) is strictly increasing w.r.t. its last argument. This contradicts the optimality of \(\hat{g}\) and implies that \(\hat{g}_{S^\perp }\) must be the null function, hence concluding the first part of the proof.
Finally, to prove (6.28) note that, if \(\mathscr {H}\) is a RKHS, one has
where the first equality comes from the reproducing property, while the second one derives from the fact that \(\eta _i\) is the representer of \(L_i\).
6.1.6 6.9.6 Proof of Theorem 6.21
Preliminary Lemmas
The first lemma, whose proof can be found in [34], states a bound on the correlation between two random variables assuming values in a Hilbert space.
Lemma 6.3
(based on [34]) Let a and b be zero-mean random variables measurable with respect to the \(\sigma \)-algebras \(\mathscr {M}_1\) and \(\mathscr {M}_2\) and with values in the Hilbert space \(\mathscr {H}\) having inner product \(\langle \cdot , \cdot \rangle _{\mathscr {H}}\). Then, it holds that
where all the expectations above are assumed to exist and
As for the second lemma, first it is useful to introduce the following integral operator:
Since the assumptions underlying Theorem (6.9) (Mercer Theorem) hold true, there exists a complete orthonormal basis of \(\mathscr {L}_2^{\mu }\), denoted by \(\{\rho _i\}_{i \in \mathscr {I}}\), which satisfies
To simplify exposition, hereby we assume \(\zeta _i>0 \ \forall i\). Then, for \(r>0\), we define the operators \(L_K^{-r}\) and \(L_K^{r}\) as follows
The function \(L_K^{-r}[f]\) is less regular than f since its expansion coefficients go to zero more slowly. Instead, \(L_K^{r}\) is a smoothing operator since \(\zeta _i^r c_i\) goes to zero faster than \(c_i\) as i goes to infinity. When \(r=1/2\) we recover the operator \(L_K^{1/2}\) already defined in (6.17) which satisfies \(\mathscr {H} = L_K^{1/2} \mathscr {L}_2^{\mu }\). The following lemma holds.
Lemma 6.4
If \(L_K^{-r} f_{\rho } \in \mathscr {L}_2^{\mu }\) for some \(0<r \le 1\), letting
one has
Proof
By assumption, there exists \(g \in \mathscr {L}_2^{\mu }\), say \(g=\sum _{i \in \mathscr {I}} d_i \rho _i\), such that \(f_{\rho }= L_K^{r}g \) so that \(f_{\rho }=\sum _{i \in \mathscr {I}} \zeta _i^r d_i \rho _i\). Now, we characterize the solution \(\hat{f}\) of (6.76) using \(f = \sum _{i \in \mathscr {I}} c_i \rho _i\) and optimizing w.r.t. the \(c_i\). The objective becomes
and setting the partial derivatives w.r.t. each \(c_i\) to zero, we obtain
This implies
If \(0<r \le 1\), it follows that
and this proves (6.77). \(\square \)
In the proof of the third lemma reported below, the notation \(S_x: \mathscr {H} \rightarrow \mathbb {R}^N\) indicates the sampling operator defined by \(S_x f=[f(x_1) \ldots f(x_N)]\). In addition, \(S_x^T\) denotes its adjoint, i.e., for any \(c \in \mathbb {R}^N\), it satisfies
where \(\langle \cdot , \cdot \rangle \) is the Euclidean inner product. Hence, one has
Lemma 6.5
Define
with \(\hat{f} \) defined by (6.76). Then, if \( \hat{g}_N\) is given by (6.55), one has
Proof
First, it is useful to derive two useful equalities involving \(\hat{f}\) and \(\hat{g}_N\). The first one is
The last equality in (6.80) follows from the definition of \(L_k\) and \(\eta _i\). The first equality can be obtained using the representation \(f_{\rho }=\sum _{i \in \mathscr {I}} d_i \rho _i\), then following the same passages contained in the first part of the previous lemma’s proof to obtain
The second result consists of the following alternative expression for \(\hat{g}_N\):
where I denotes the identity operator. To prove it, we will use the equality \(\hat{g}_N=S_x^T \left( \mathbf {K}+N \gamma I_{N}\right) ^{-1}Y\) which derives from the representer theorem and also the fact that, for any vector \(c \in \mathbb {R}^N\), it holds that \(S_xS_x^Tc=\mathbf {K}c\) with \(\mathbf {K}\) the kernel matrix built using \([x_1 \ldots x_N]\). Then, we have
Now, it is also useful to obtain a bound on the inverse of the operator \(\frac{S_x^T S_x}{N} + \gamma I\). Assume that \(v \in \mathscr {H}\) and let u satisfy
We take inner products on both sides with u and use the equality \(\langle S_x S_x^T u, u \rangle _{\mathscr {H}} =\langle S_x u, S_x u \rangle \) to obtain
One has
Thus, we have shown that
Now, it comes from (6.81) that
Exploiting the equalities
which derive from (6.79) and (6.80), respectively, we obtain
The use of (6.82) then completes the proof. \(\square \)
Proof of Statistical Consistency
Let \(\hat{f}\) be defined by (6.76), i.e.,
Then, consider the following error decomposition
The first term \(\Vert \hat{f}-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }}\) on the r.h.s. is not stochastic. The assumption \(f_{\rho } \in \mathscr {H}\) ensures that \(\Vert L_K^{-r} f_{\rho } \Vert _{\mathscr {L}_2^{\mu }}< \infty \) for \(0 \le r \le 1/2\). It thus comes from Lemma 6.4 that, at least for \(0<r \le 1/2\), it holds that
Now, consider the second term \(\Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {L}_2^{\mu }}\). Since the input space (the function domain) is compact, and recalling also (6.69), there exists a constant A such that
To obtain a bound for the r.h.s. involving the RKHS norm, consider the stochastic function
already introduced in (6.79). Using the reproducing property, one has
The function \(\hat{f}\) belongs to \(\mathscr {H}\) and is thus continuous on the compact \(\mathscr {X}\). In addition, the kernel K is continuous on the compact \(\mathscr {X} \times \mathscr {X}\) and the process \(\{x_i,y_i\}\) has finite moments up to the third order by assumption. Hence, there exists a constant B independent of i such that
We can now come back to \(\Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {H}}\). From Lemma 6.5, \(\forall \gamma >0\) it holds that
Now, using first Jensen’s inequality and then (6.87), (6.88), Assumption 6.20 and (6.73) in Lemma 6.3 (with a and b replaced by \(\eta _i- \mathscr {E}[\eta _i]\) and \(\eta _j - \mathscr {E}[\eta _j]\)) one obtains constants C and D such that
where \(\eta \) replaces \(\eta _i\) or \(\eta _j\) when the expectation is independent of i and j. This latter result, combined with (6.85) and (6.88), leads to the existence of a constant E such that
that, combined with (6.83) and (6.84), implies that for any \(0<r \le 1/2\)
Hence, when \(\gamma \) is chosen according to (6.56), \(\mathscr {E}\Vert \hat{g}_N-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \) converges to zero as N grows to \(\infty \). Using the Markov inequality, (6.57) is finally obtained.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Regularization in Reproducing Kernel Hilbert Spaces. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-95860-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95859-6
Online ISBN: 978-3-030-95860-2
eBook Packages: EngineeringEngineering (R0)