6.1 Preliminaries

Techniques for reconstructing a function g in a functional relationship $$y=g(x)$$ from observed samples of y and x are the fundamental building blocks for black-box estimation. As already seen in Chap. 3 when treating linear regression, given a finite set of pairs $$(x_i, y_i)$$ the aim is to determine a function g having a good prediction capability, i.e., for a new pair (xy) we would like the prediction g(x) close to y (e.g., in the MSE sense).

The classical parametric approach discussed in Chap. 3 uses a model $$g_{\theta }$$ that depends on a finite-dimensional vector $$\theta$$. A very simple example is a polynomial model, treated in Example 3.1, given, e.g., by $$g_{\theta }(x)=\theta _1+\theta _2 x +\theta _3x^2$$ whose coefficients $$\theta _i$$ can be estimated by fitting the data via least squares. In this parametric scenario, we have seen that an important issue is the model order choice. In fact, the least squares objective improves as the dimension of $$\theta$$ increases, eventually leading to data interpolation. But overparametrized models, as a rule, perform poorly when used to predict future output data, even if benign overfitting may sometimes happen, as e.g., described in the context of deep networks [17, 55, 75].  Another drawback related to overparameterization is that the problem may become ill-posed in the sense of Hadamard, i.e., the solution may be non-unique, or ill-conditioned. This means that the estimate may be highly sensitive even to small perturbations of the outputs $$y_i$$ as, e.g., illustrated in Fig. 1.3 of Sect. 1.2.

This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, g will be searched over a possibly infinite-dimensional functional space. Overfitting and ill-posedness is circumvented by using  reproducing kernel Hilbert spaces (RKHSs) as hypothesis spaces and related norms as regularizers. Such norms generalize the quadratic penalties seen in Chap. 3. In this scenario, the estimator is completely defined by a positive definite kernel which has to encode the expected function properties, e.g., the smoothness level. Furthermore we will see that, even when the model class is infinite dimensional, the function estimate turns out a finite linear combination of basis functions computable from the kernel. The estimator also enjoys strong asymptotic properties, permitting (under reasonable assumptions on data generation) to achieve the optimal predictor as the data set size grows to infinity.

The kernel-based approaches described in the following sections thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory. In addition, RKHS theory paves the way to the development of other powerful techniques, e.g., for estimation of an infinite number of impulse response coefficients (IIR models estimation), for continuous-time linear system identification and also for nonlinear system identification.

The reader not familiar with functional analysis finds in the first part of the appendix of this chapter a brief overview on the basic results used in the next sections, like, e.g., the concept of linear and bounded functional which is key to define a RKHS.

6.2 Reproducing Kernel Hilbert Spaces

In what follows, we use $$\mathscr {X}$$ to indicate domains of functions. In machine learning, this set is often referred to as the input space with its generic element $$x \in \mathscr {X}$$  called input location. Sometimes, $$\mathscr {X}$$ is assumed to be a compact metric space, e.g., one can think of $$\mathscr {X}$$ as a closed and bounded set in the familiar space $$\mathbb {R}^m$$ equipped with the Euclidean norm. In what follows, all the functions are real valued, so that $$f: \mathscr {X} \rightarrow \mathbb {R}$$.

Reproducing kernel Hilbert spaces We now introduce a class of Hilbert spaces $$\mathscr {H}$$ which play a fundamental role as hypothesis spaces for function estimation problems. Our goal is to estimate maps which permit to make predictions over the whole $$\mathscr {X}$$. Thus, a basic requirement is to search for the predictor in a space containing functions which are well-defined pointwise for any $$x \in \mathscr {X}$$. In particular, we assume that all the pointwise evaluators $$g \rightarrow g(x)$$ are linear and bounded over $$\mathscr {H}$$.  This means that $$\forall x \in \mathscr {X}$$ there exists $$C_x< \infty$$ such that

\begin{aligned} |g(x)| \le C_x\Vert g\Vert _{\mathscr {H}}, \quad \forall g \in \mathscr {H}. \end{aligned}
(6.1)

The above condition is stronger than requiring $$g(x) < \infty \ \forall x$$ since $$C_x$$ can depend on x but not on g. This property already leads to the function spaces of interest. The following definitions are taken from [13].

Definition 6.1

(RKHS, based on [13]) A reproducing kernel Hilbert space (RKHS) over a non-empty set $$\mathscr {X}$$ is a Hilbert space of functions $$g:\mathscr {X} \rightarrow \mathbb {R}$$ such that (6.1) holds.

As suggested by the name itself, RKHSs are related to the concept of positive definite kernel [13, 20],  a particular function defined over $$\mathscr {X}\times \mathscr {X}$$. In the literature it is also called positive semidefinite kernel, hence in what follows positive definite kernel and positive semidefinite kernel will define the same mathematical object. This is also specified in the next definition.

Definition 6.2

(Positive definite kernel, Mercer kernel and kernel section, based on [13]) Let $$\mathscr {X}$$ denote a non-empty set. A symmetric function $$K:\mathscr {X}\times \mathscr {X} \rightarrow \mathbb {R}$$ is called positive definite kernel or positive semidefinite kernel if, for any finite natural number p, it holds

$$\sum _{i=1}^{p}\sum _{j=1}^{p}a_ia_j K(x_i,x_j) \ge 0, \quad \forall (x_k,a_k) \in \left( \mathscr {X},\mathbb {R}\right) , \quad k=1,\ldots , p.$$

If strict inequality holds for any set of p distinct input locations $$x_k$$, i.e.,

$$\sum _{i=1}^{p}\sum _{j=1}^{p}a_ia_j K(x_i,x_j) > 0,$$

then the kernel is strictly positive definite.

If $$\mathscr {X}$$ is a metric space and the positive definite kernel is also continuous, then K is said to be a Mercer kernel.

Finally, given a kernel K, the kernel section $$K_x$$ centred at x is the function $$\mathscr {X} \rightarrow \mathbb {R}$$ defined by

$$K_x(y) = K(x,y) \quad \forall y \in \mathscr {X}.$$

Hence, in the sense given above, a positive definite kernel “contains” matrices which are all at least positive semidefinite.

We are now in a position to state a fundamental theorem from [13] here specialized to Mercer kernels which lead to RKHSs containing continuous functions (the proof is reported in Sect. 6.9.2).

Theorem 6.1

(RKHSs induced by Mercer kernels, based on [13]) Let $$\mathscr {X}$$ be a compact metric space and let $$K:\mathscr {X}\times \mathscr {X} \rightarrow \mathbb {R}$$ be a Mercer kernel. Then, there exists a unique (up to isometries) Hilbert space $$\mathscr {H}$$ of functions, called RKHS associated to K, such that

1. 1.

all the kernel sections belong to $$\mathscr {H}$$, i.e.,

\begin{aligned} K_x \in \mathscr {H} \quad \forall x \in \mathscr {X}; \end{aligned}
(6.2)
2. 2.

the so-called reproducing property holds, i.e.,

\begin{aligned} \langle K_x, g \rangle _{\mathscr {H}} = g(x) \quad \forall (x, g) \in \left( \mathscr {X},\mathscr {H}\right) . \end{aligned}
(6.3)

In addition, $$\mathscr {H}$$ is contained in the space $$\mathscr {C}$$ of continuous functions.

Remark 6.1

Note that the space $$\mathscr {H}$$ characterized in Theorem 6.1 is indeed a RKHS according to Definition 6.1. In fact, for any input location x the kernel section $$K_x$$ belongs to the space and, according to the reproducing property, represents the evaluation functional at x. Then, Theorem 6.27 (Riesz representation theorem), reported in the appendix to this chapter, permits the conclusion that all the pointwise evaluators over $$\mathscr {H}$$ are linear and bounded.

While Theorem 6.1 establishes a link between Mercer kernels (which enjoy continuity properties) and RKHSs, it is possible also to state a one-to-one correspondence with the entire class of positive definite kernels (not necessarily continuous). In particular, the following result holds.

Theorem 6.2

(Moore–Aronszajn, based on [13]) Let $$\mathscr {X}$$ be any non-empty set. Then, to every RKHS $$\mathscr {H}$$ there corresponds a unique positive definite kernel K such that the reproducing property (6.3) holds. Conversely, given a positive definite kernel K, there exists a unique RKHS of real-valued functions defined over $$\mathscr {X}$$ where (6.2) and (6.3) hold.

The proof can be quite easily obtained using Theorem 6.27 (Riesz representation theorem) and arguments similar to those contained in the proof of Theorem 6.1.

Further notes and RKHSs examples Thus, a RKHS $$\mathscr {H}$$ can be defined just by specifying a kernel K, also called the reproducing kernel of $$\mathscr {H}$$. In particular, any RKHS is generated by the kernel sections. More specifically, let $$S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })$$ and define the following norm in S

\begin{aligned} \Vert f \Vert _{\mathscr {H}}^2 = \sum _{i=1}^p \sum _{j=1}^p c_i c_j K(x_i,x_j), \end{aligned}
(6.4)

where

\begin{aligned} f(\cdot ) = \sum _{i=1}^{p} c_i K_{x_i}(\cdot ). \end{aligned}

Then, one has

$$\mathscr {H} = S \ \cup \ \left\{ \text{ all } \text{ the } \text{ limits } \text{ w.r.t. } \Vert \cdot \Vert _{\mathscr {H}} \text{ of } \text{ Cauchy } \text{ sequences } \text{ contained } \text{ in } S \right\} .$$

Summarizing, one has

• all the kernel sections $$K_x(\cdot )$$ belong to the RKHS $$\mathscr {H}$$ induced by K;

• $$\mathscr {H}$$ contains also all the finite linear combinations of kernel sections along with some particular infinite sums, convergent w.r.t. the norm (6.4);

• every $$f \in \mathscr {H}$$ is thus a linear combination of a possibly infinite number of kernel sections.

Assume for instance $$K(x_1,x_2) = \exp \left( - \Vert x_1-x_2\Vert ^2\right)$$, which is the so-called Gaussian kernel.  Then, all the functions in the corresponding RKHS are sums, or limits of sums, of functions proportional to Gaussians. As further elucidated later on, this means that every function of $$\mathscr {H}$$ inherits properties such as smoothness and integrability of the kernel, e.g., we have seen in Theorem 6.1 that kernel continuity implies $$\mathscr {H} \subset \mathscr {C}$$. This fact has an important consequence on modelling: instead of specifying a whole set of basis functions, it suffices to choose a single positive definite kernel that encodes the desired properties of the function to be synthesized.

Example 6.3

(Norm in a two-dimensional RKHS) We introduce a very simple RKHS to illustrate how the kernel K can be seen as a similarity function that establishes the norm (complexity) of a function by comparing function values at different input locations.

When $$\mathscr {X}$$ has finite cardinality m, the functions are evaluated just on a finite number of input locations. Hence, each function f is in one-to-one correspondence with the m-dimensional vector

$$\mathbf {f} = \left( \begin{array}{c} f(1) \\ f(2) \\ \vdots \\ f(m) \end{array}\right) .$$

In addition, any kernel is in one-to-one correspondence with one symmetric positive semidefinite matrix $$\mathbf {K} \in \mathbb {R}^{m \times m}$$ with (ij)-entry $$\mathbf {K}_{ij} = K(i,j)$$. Finally, the kernel sections can be seen as the columns of $$\mathbf {K}$$.

Assume, e.g., $$m=2$$ with $$\mathscr {X}=\{1,2\}$$. Then, the functions can be seen as two-dimensional vectors and any kernel K is in one-to-one correspondence with one symmetric positive semidefinite matrix $$\mathbf {K} \in \mathbb {R}^{2 \times 2}$$. The RKHS $$\mathscr {H}$$ associated to K is finite-dimensional being spanned just by the two kernel sections $$K_1(\cdot )$$ and $$K_2(\cdot )$$ which can be seen as the two columns of $$\mathbf {K}$$. Hence, the functions f in $$\mathscr {H}$$ are in one-to-one correspondence with the vectors

$$\mathbf {f} = \left( \begin{array}{c} f(1) \\ f(2) \end{array}\right) = \mathbf {K} c, \quad c \in \mathbb {R}^2.$$

If $$\mathbf {K}$$ is full rank, $$\mathscr {H}$$ covers the whole $$\mathbb {R}^2$$ and from (6.4) we have

$$\Vert f \Vert ^2_{\mathscr {H}} = c^T \mathbf {K} c = \mathbf {f}^T \mathbf {K}^{-1} \mathbf {f}.$$

For the sake of simplicity, assume also that $$\mathbf {K}_{11}=\mathbf {K}_{22}=1$$ so that it must hold $$-1<\mathbf {K}_{12}<1$$. Then, considering, e.g., the function $$f(i)=i$$, one has

\begin{aligned} \Vert f \Vert ^2_{\mathscr {H}}= & {} [1 \ \ 2] \ \mathbf {K}^{-1} \ [1 \ \ 2]^T\\= & {} \frac{5-4\mathbf {K}_{12}}{1-\mathbf {K}_{12}^2}, \quad -1<\mathbf {K}_{12}<1. \end{aligned}

Figure 6.1 displays $$\Vert f \Vert ^2_{\mathscr {H}}$$ as a function of $$\mathbf {K}_{12}$$. One can see that the norm diverges as $$|\mathbf {K}_{12}|$$ approaches 1.

If, e.g., $$\mathbf {K}_{12}=1$$ the kernel function becomes constant over $$\mathscr {X} \times \mathscr {X}$$. Hence, the two kernel sections $$K_1(\cdot )$$ and $$K_2(\cdot )$$ coincide, being constant with $$K_1(i)=K_2(i)=1$$ for $$i=1,2$$. This means that $$\mathbf {K}_{12}=1$$ induces a space $$\mathscr {H}$$ containing only constant functions.Footnote 1 This explains why the norm (complexity) of f becomes large if $$\mathbf {K}_{12}$$ is close to 1: the space becomes less and less “tolerant” of functions with $$f(1)\ne f(2)$$.

Letting now $$f(1)=1$$ and $$f(2)=a$$, the joint effect of $$\mathbf {K}_{12}$$ and a is explained by the formula

\begin{aligned} \Vert f \Vert ^2_{\mathscr {H}}= & {} [1 \ \ a] \ \mathbf {K}^{-1} \ [1 \ \ a]^T\\= & {} \frac{(a-\mathbf {K}_{12})^2}{1-\mathbf {K}_{12}^2}+1, \quad -1<\mathbf {K}_{12}<1. \end{aligned}

Note that, thinking now of $$\mathbf {K}_{12}$$ as fixed, the function with minimal RKHS norm (complexity) is obtained with $$a=\mathbf {K}_{12}$$ and has a norm equal to one. $$\square$$

Example 6.4

($$\mathscr {L}_2^{\mu }$$ and $$\ell _2$$ ) Let $$\mathscr {X}=\mathbb {R}$$ and consider the classical Lebesgue space  of square summable functions with $$\mu$$ equal to the Lebesgue measure. Recall that this is a Hilbert space whose elements are equivalence classes of functions measurable w.r.t. Lebesgue: any group of functions which differ only on a set of null measure (e.g., containing only a countable number of input locations) identifies the same vector. Hence, $$\mathscr {L}_2^{\mu }$$ cannot be a RKHS since pointwise evaluation is not even well defined.

Let instead $$\mathscr {X}={\mathbb N}$$ (the set of natural numbers) and define the identity kernel

\begin{aligned} K(i,j)=\delta _{ij}, \ \ (i,j) \in {\mathbb N}\times {\mathbb N}, \end{aligned}
(6.5)

where $$\delta _{ij}$$ is the Kronecker delta. Clearly, K is symmetric and positive definite according to Definition 6.2 (it can be associated with an identity matrix of infinite size). Hence, it induces unique RKHS $$\mathscr {H}$$ that contains all the finite combinations of the kernel sections. In particular, any finite sum can be written as $$f(\cdot ) = \sum _{i=1}^{m} f_i K_{i}(\cdot )$$, where some of the $$f_i$$ may be null, and corresponds to a sequence with a finite number of non null components. To obtain the entire $$\mathscr {H}$$, we need also to add all the Cauchy sequences limits w.r.t. the norm (6.4) given by

\begin{aligned} \Vert f\Vert _{\mathscr {H}}^2= & {} \left\| \sum _{i=1}^{m} f_i K_{i}(\cdot ) \right\| _{\mathscr {H}}^2 \\= & {} \sum _{i=1}^m \sum _{j=1}^m f_i f_j K(i,j) = \sum _{i=1}^m f_i^2, \end{aligned}

which coincides with the classical Euclidean norm of $$[f_1 \ldots f_m]$$. This allows us to conclude that the associated RKHS is the classical space $$\ell _2$$ of square summable sequences.

As a finale note, Definition 6.1 easily confirms that $$\ell _2$$ is a RKHS. In fact, for every $$f=[f_1 \ f_2 \ \ldots ] \in \ell _2$$ one has

$$|f_i| \le \sqrt{\sum _i f_i^2} = \Vert f \Vert _2 \quad \forall i,$$

and, recalling (6.1), this shows that all the evaluation functionals $$f \rightarrow f_i$$ with $$i \in {\mathbb N}$$ are bounded. $$\square$$

Example 6.5

(Sobolev space and the first-order spline kernel) While  in the previous example we have seen that $$\mathscr {L}_2^{\mu }$$ is not a RKHS, consider now the space obtained by integrating the functions in this space. In particular, let $$\mathscr {X}=[0,1]$$, set $$\mu$$ to the Lebesgue measure and consider

\begin{aligned} \mathscr {H} = \left\{ f \ | \ f(x) = \int _0^x h(y) dy \ \text{ with } \ h \in \mathscr {L}_2^{\mu } \right\} . \end{aligned}

One thus has that any f in $$\mathscr {H}$$ satisfies $$f(0)=0$$ and is absolutely continuous: its derivative $$h=\dot{f}$$ is defined almost everywhere and is Lebesgue integrable.

With the inner product given by

$$\langle f,g \rangle _{\mathscr {H}} = \langle \dot{f}, \dot{g} \rangle _{\mathscr {L}_2^{\mu }},$$

it is easy to see that $$\mathscr {H}$$ is a Hilbert space. In fact, $$\mathscr {L}_2^{\mu }$$ is Hilbert and we have established a one-to-one correspondence between functions in $$\mathscr {H}$$ and $$\mathscr {L}_2^{\mu }$$ which preserves inner product. Such $$\mathscr {H}$$ is an example of Sobolev space [2] since the complexity of a function is measured by the energy of its derivative:

$$\Vert f \Vert _{\mathscr {H}}^2 = \int _0^1 \dot{f}^2(x) dx.$$

Now, given $$x \in [0,1]$$, let $$\chi _x(\cdot )$$ be the indicator function of the set [0, x]. Then, one has

\begin{aligned} | f(x) |= & {} \left| \int _0^x \dot{f}(a) da \right| = \left| \langle \chi _x , \dot{f} \rangle _{\mathscr {L}_2^{\mu }} \right| \\\le & {} \Vert \dot{f} \Vert _{\mathscr {L}_2^{\mu }} = \Vert f \Vert _{\mathscr {H}}, \end{aligned}

where we have used the Cauchy–Schwarz inequality. Hence, $$\mathscr {H}$$ is also a RKHS since all the evaluations functionals are bounded. We now prove that its reproducing kernel is the so-called first-order (linear) spline kernel given by

\begin{aligned} K(x,y) = \min (x,y). \end{aligned}
(6.6)

In fact, every kernel section belongs to $$\mathscr {H}$$, being piecewise linear with $$\dot{K}_x = \chi _x$$. Furthermore, (6.6) satisfies the reproducing property since

\begin{aligned} \langle f, K_x \rangle _{\mathscr {H}}= & {} \langle \dot{f} , \chi _x \rangle _{\mathscr {L}_2^{\mu }} \\= & {} \int _0^x \dot{f}(y) dy = f(x). \end{aligned}

The linear spline kernel and some of its sections are displayed in the top panels of Fig. 6.2. $$\square$$

6.2.1 Reproducing Kernel Hilbert Spaces Induced by Operations on Kernels $$\star$$

We report some classical results about RKHSs induced by operations on kernels which can be derived from [13]. The first theorem characterizes the RKHS induced by the sum or product of two kernels.

Theorem 6.6

(RKHS induced by sum or product of two kernels, based on [13]) Let K and G be two positive definite kernels over the same domain $$\mathscr {X} \times \mathscr {X}$$, associated to the RKHSs $$\mathscr {H}$$ and $$\mathscr {G}$$, respectively.

The sum $$K+G$$, where

$$[K+G](x,y)=K(x,y)+G(x,y),$$

is the reproducing kernel of the RKHS $$\mathscr {R}$$ containing functions

$$f= h +g, \quad (h,g) \in \mathscr {H} \times \mathscr {G}$$

with

$$\Vert f \Vert ^2_{\mathscr {R}} = \min _{h \in \mathscr {H}, g \in \mathscr {G}} \Vert h \Vert ^2_{\mathscr {H}} + \Vert g \Vert ^2_{\mathscr {G}} \ \text{ s.t. } \ f=h+g.$$

The product KG, where

$$[KG](x,y)=K(x,y)G(x,y)$$

is instead the reproducing kernel of the RKHS $$\mathscr {R}$$ containing functions

$$f= hg, \quad (h,g) \in \mathscr {H} \times \mathscr {G}$$

with

$$\Vert f \Vert ^2_{\mathscr {R}} = \min _{h \in \mathscr {H}, g \in \mathscr {G}} \Vert h \Vert ^2_{\mathscr {H}}\Vert g \Vert ^2_{\mathscr {G}} \ \text{ s.t. } \ f=hg.$$

The second theorem instead provides the connection between two RKHSs, with the second one obtained from the first one by sampling its kernel.

Theorem 6.7

(RKHS induced by kernel sampling, based on [13]) Let $$\mathscr {H}$$ be the RKHS induced by the kernel $$K: \mathscr {X} \times \mathscr {X} \rightarrow {\mathbb R}$$. Let $$\mathscr {Y} \subset \mathscr {X}$$ and denote with $$\mathscr {R}$$ the RKHS of functions over $$\mathscr {Y}$$ induced by the restriction of the kernel K on $$\mathscr {Y} \times \mathscr {Y}$$. Then, the functions in $$\mathscr {R}$$ correspond to the functions in $$\mathscr {H}$$ sampled on $$\mathscr {Y}$$. One also has

\begin{aligned} \Vert f \Vert ^2_{\mathscr {R}} = \min _{g \in \mathscr {H}} \ \Vert g \Vert ^2_{\mathscr {H}} \ \ \text{ s.t. } \ \ g_{\mathscr {Y}}=f, \end{aligned}
(6.7)

where $$g_{\mathscr {Y}}$$ is g sampled on $$\mathscr {Y}$$.

The following theorem lists some operations which permit to build kernels (and hence RKHSs) from simple building blocks.

Theorem 6.8

(Building kernels from kernels, based on [13]) Let $$K_1$$ and $$K_2$$ two positive definite kernels over $$\mathscr {X} \times \mathscr {X}$$ and $$K_3$$ a positive definite kernel over $$\mathbb {R}^m \times \mathbb {R}^m$$. Let also P an $$m \times m$$ symmetric positive semidefinite matrix and $$\mathscr {P}(x)$$ a polynomial with positive coefficients. Then, the following functions are positive definite kernels over $$\mathscr {X} \times \mathscr {X}$$:

• $$K(x,y)=K_1(x,y) + K_2(x,y)$$ (see also Theorem 6.6).

• $$K(x,y)=aK_1(x,y), \quad a \ge 0$$.

• $$K(x,y)=K_1(x,y)K_2(x,y)$$ (see also Theorem 6.6).

• $$K(x,y)=f(x)f(y), \quad f: \mathscr {X} \rightarrow \mathbb {R}$$.

• $$K(x,y)=K_3(f(x),f(y)), \quad f: \mathscr {X} \rightarrow \mathbb {R}^m$$.

• $$K(x,y)=x^T P y, \quad \mathscr {X}=\mathbb {R}^m$$.

• $$K(x,y)=\mathscr {P}(K_1(x,y))$$.

• $$K(x,y)=\exp (K_1(x,y))$$.

6.3 Spectral Representations of Reproducing Kernel Hilbert Spaces

In the previous section we have seen that any RKHS is generated by its kernel sections. We now discuss another representation obtainable when the kernel can be diagonalized as follows

\begin{aligned} K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y), \ \ \zeta _i > 0 \ \forall i , \end{aligned}
(6.8)

where the set $$\mathscr {I}$$ is countable. This will lead to new insights on the nature of the RKHSs, generalizing to the infinite-dimensional case the connection between regularization and basis expansion reported in Sect. 5.6.

A simple situation holds when the input space has finite cardinality, e.g., $$\mathscr {X}=\{x_1 \ldots x_m\}$$. Under this assumption, any positive definite kernel is in one-to-one correspondence with the $$m \times m$$ matrix $$\mathbf {K}$$ whose (ij)-entry is $$K(x_i,x_j)$$. The representation (6.8) then follows from the spectral theorem applied to $$\mathbf {K}$$. In fact, if $$\zeta _i$$ and $$v_i$$ are, respectively, the eigenvalues and the orthonormal (column) eigenvectors of $$\mathbf {K}$$, (6.8) can be written as

$$\mathbf {K} = \sum _{i=1}^m \zeta _i v_i v_i^T,$$

where the functions $$\rho _i(\cdot )$$ have become the vectors $$v_i$$. One generalization of this result is described below.

Let $$L_K$$ be the linear operator defined by the positive definite kernel K as follows:

\begin{aligned} L_K[f](\cdot ) = \int _{X} K(\cdot ,x) f(x) d\mu (x). \end{aligned}
(6.9)

We also assume that $$\mu$$ is a $$\sigma$$-finite and nondegenerate Borel measure on $$\mathscr {X}$$. Essentially this means that $$\mathscr {X}$$ is the countable union of measurable sets with finite measure and that $$\mu$$ “covers” entirely $$\mathscr {X}$$. The reader can, e.g., consider $$\mathscr {X} \subset \mathbb {R}^m$$ and think of $$\mu$$ as the Lebesque measure or any probability measure with $$\mu (A)>0$$ for any non-empty open set $$A \subset \mathscr {X}$$. The next classical result goes under the name of Mercer theorem whose formulations trace back to [60].

Theorem 6.9

(Mercer theorem, based on [60]) Let $$\mathscr {X}$$ be a compact metric space equipped with a nondegenerate and $$\sigma$$-finite Borel measure $$\mu$$ and let K be a Mercer kernel on $$\mathscr {X} \times \mathscr {X}$$. Then, there exists a complete orthonormal basis of $$\mathscr {L}_2^{\mu }$$ given by a countable number of continuous functions $$\{\rho _i\}_{i \in \mathscr {I}}$$ satisfying

\begin{aligned} L_K[\rho _i] = \zeta _i \rho _i, \quad i \in \mathscr {I}, \quad \zeta _1 \ge \zeta _2 \ge \ \cdots \ \ge 0, \end{aligned}
(6.10)

with $$\zeta _i >0 \ \forall i$$ if K is strictly positive and $$\lim _{i \rightarrow \infty } \zeta _i =0$$ if the number of eigenvalues is infinite.

One also has

\begin{aligned} K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y), \end{aligned}
(6.11)

where the convergence of the series is absolute and uniform on $$\mathscr {X} \times \mathscr {X}$$.

The following result characterizes a RKHS through the eigenfunctions of $$L_K$$. The proof is reported in Sect. 6.9.3.

Theorem 6.10

(RKHS defined by an orthonormal basis of $$\mathscr {L}_2^{\mu }$$) Under the same assumption of Theorem 6.9, if the $$\rho _i$$ and $$\zeta _i$$ satisfy (6.10), with also $$\zeta _i>0 \ \forall i$$, one has

\begin{aligned} \mathscr {H} = \left\{ f \ \Big | \ f(x) = \sum _{i \in \mathscr {I}} c_i \rho _i(x) \ \text{ s.t. } \ \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i } < \infty , \right\} . \end{aligned}
(6.12)

$$f = \sum _{i \in \mathscr {I}} a_i \rho _i, \quad g = \sum _{i \in \mathscr {I}} b_i \rho _i,$$

one has

\begin{aligned} \langle f,g \rangle _{\mathscr {H}} = \sum _{i \in \mathscr {I}} \frac{a_i b_i}{\zeta _i}, \end{aligned}
(6.13)

so that

\begin{aligned} \Vert f\Vert _{\mathscr {H}}^2 = \sum _{i \in \mathscr {I}} \frac{a_i^2}{\zeta _i}. \end{aligned}
(6.14)

Hence, it also comes that $$\{\sqrt{\zeta _i} \rho _i\}_{i \in \mathscr {I}}$$ is an orthonormal basis of $$\mathscr {H}$$.

The representation (6.12) is not unique since the spectral maps, i.e., the functions that associate a kernel with a decomposition of the type (6.8), are not unique. They depend on the chosen measure $$\mu$$ even if they lead to the same RKHS.

Theorem 6.10 thus shows that any kernel admitting an expansion (6.11) coming from the Mercer theorem induces a separable RKHS, i.e., having a countable basis given by the $$\rho _i$$. Later on, Theorem 6.13 will show that such result holds under much milder assumptions. In fact, the representation (6.12) can be obtained starting from any diagonalized kernel (6.8) involving generic functions $$\rho _i$$, e.g., not necessarily independent of each other. One can also remove the compactness hypothesis on the input space, e.g., letting $$\mathscr {X}$$ be the entire $$\mathbb {R}^m$$.

Remark 6.2

(Relationship between $$\mathscr {H}$$ and $$\mathscr {L}_2^{\mu }$$ ) Theorem 6.10 points out an interesting connection between $$\mathscr {H}$$ and $$\mathscr {L}_2^{\mu }$$. Since the functions $$\rho _i$$ form an orthonormal basis in $$\mathscr {L}_2^{\mu }$$, one has

\begin{aligned} f \in \mathscr {L}_2^{\mu } \ \iff \ f= \sum _{i \in \mathscr {I}} c_i \rho _i \ \text{ with } \ \sum _{i \in \mathscr {I}} \ c_i^2 < \infty \end{aligned}
(6.15)

while (6.12) shows that

\begin{aligned} f \in \mathscr {H} \ \iff \ f= \sum _{i \in \mathscr {I}} c_i \rho _i \ \text{ with } \ \sum _{i \in \mathscr {I}} \ \frac{c_i^2}{\zeta _i} < \infty . \end{aligned}
(6.16)

If $$\zeta _i>0 \ \forall i$$, one has the set inclusion $$\mathscr {H} \subset \mathscr {L}_2^{\mu }$$ since the functions in the RKHS, must satisfy a more stringent condition on the expansion coefficients decay (the $$\zeta _i$$ decay to zero).

In addition, let $$L_K^{1/2}$$ denote the operator defined as the square root of $$L_K$$, i.e., for any $$f \in \mathscr {L}_2^{\mu }$$ with $$f= \sum _{i \in \mathscr {I}} c_i \rho _i$$, one has

\begin{aligned} L_K^{1/2}[f] = \sum _{i \in \mathscr {I}} \sqrt{\zeta _i}c_i \rho _i. \end{aligned}
(6.17)

This is a smoothing operator: the function $$L_K^{1/2}[f]$$ is more regular than f since the expansion coefficients $$\sqrt{\zeta _i}c_i$$ decrease to zero faster than the $$c_i$$. In view of (6.15) and (6.16), we obtain

\begin{aligned} \mathscr {H} = \left\{ L_K^{1/2}[f] \ \ | \ \ f \in \mathscr {L}_2^{\mu } \right\} , \end{aligned}
(6.18)

which shows that the RKHS can be thought of as the output of the linear system $$L_K^{1/2}$$ fed with the space $$\mathscr {L}_2^{\mu }$$, i.e., $$\mathscr {H} = L_K^{1/2} \mathscr {L}_2^{\mu }$$.

Example 6.11

(Spline kernel expansion) In Example 6.5, we have seen that the space of functions on the unit interval satisfying $$f(0)=0$$ and $$\int _0^1 \dot{f}^2(x) dx < \infty$$ is the RKHS associated to the first-order spline kernel $$\min (x,y)$$. We now derive a representation of the type (6.12) for this space setting $$\mu$$ to the Lebesgue measure. For this purpose, consider the system

$$\int _0^1 \min (x,y) \rho (y) dy = \zeta \rho (x) .$$

The above equation is equivalent to

$$\int _0^x y \rho (y) dy + x \int _x^1 \rho (y) dy = \zeta \rho (x),$$

which implies $$\rho (0)=0$$. Taking the derivative w.r.t. x we also obtain

$$\int _x^1 \rho (y) dy = \zeta \dot{\rho }(x)$$

that implies $$\dot{\rho }(1)=0$$. Differentiating again w.r.t. x gives

$$-\rho (x) = \zeta \ddot{\rho }(x),$$

whose general solution is

$$\rho (x) = a \sin (x / \sqrt{\zeta }) + b \cos (x / \sqrt{\zeta }), \quad a,b \in \mathbb {R}.$$

The boundary conditions $$\rho (0)=\dot{\rho }(1)=0$$ imply $$b=0$$ and lead to the following possible eigenvalues:

$$\zeta _i = \frac{1}{( i \pi - \pi /2)^2}, \quad i=1,2,\ldots .$$

The orthonormality condition also implies $$a=\sqrt{2}$$ so that we obtain

$$\rho _i(x) = \sqrt{2} \sin \left( i \pi x - \frac{\pi x}{2}\right) , \quad i=1,2,\ldots .$$

This provides the formulation (6.12) of the Sobolev space $$\mathscr {H}$$.  Figure 6.3 plots three eigenfunctions (left panel) and the first 100 eigenvalues $$\zeta _i$$ (right panel). It is evident that the larger i the larger is the high-frequency content of $$\rho _i$$ and the RKHS norm of such basis function. In fact, a large value of i corresponds to a small eigenvalue $$\zeta _i$$ and one has $$\Vert \rho _i\Vert ^2_{\mathscr {H}}=1/\zeta _i$$. $$\square$$

Example 6.12

(Translation invariant kernels and Fourier expansion) A translation invariant kernel depends only on the difference of its two arguments. Hence, there exists $$h:\mathscr {X} \rightarrow \mathbb {R}$$ such that $$K(x,y)=h(x-y)$$. Assume that $$\mathscr {X}=[0,2\pi ]$$ and that h can be extended to a continuous, symmetric and periodic function over $$\mathbb {R}$$. Then, it can be expanded in terms of the following uniformly convergent Fourier series

$$h(x)= \sum _{i=0}^{\infty } \ \zeta _i \cos ( i x),$$

where $$\zeta _0$$ accounts for the constant component and we assume $$\zeta _i>0 \ \forall i$$. We thus obtain the kernel expansion

$$K(x,y) = \zeta _0 + \sum _{i=1}^{\infty } \ \zeta _{i} \cos (i x) \cos (i y) + \sum _{i=1}^{\infty } \ \zeta _{i} \sin (i x) \sin (i y),$$

in terms of functions which are all orthogonal in $$\mathscr {L}_2^{\mu }$$. Hence, these kernels induce RKHSs generated by the Fourier basis,  with different inner products determined by $$\zeta _i$$. $$\square$$

6.3.1 More General Spectral Representation $$\star$$

Now, assume that the kernel K is available in the form $$K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y)$$ with $$\zeta _i > 0 \ \forall i$$, but with functions $$\rho _i$$ not necessarily orthonormal. More generally, we do not even require that they are independent, e.g., $$\rho _1$$ could be a linear combination of $$\rho _2$$ and $$\rho _3$$. The following result shows that the RKHS associated to K is still generated by the $$\rho _i$$, but the relationship of the expansion coefficients with $$\Vert \cdot \Vert _{\mathscr {H}}$$ is more involved than in the previous case.

Theorem 6.13

(RKHS induced by a diagonalized kernel) Let $$\mathscr {H}$$ be the RKHS induced by $$K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y)$$ with $$\zeta _i > 0 \ \forall i$$ and the set $$\mathscr {I}$$ countable. Then, $$\mathscr {H}$$ is separable and admits the representation

\begin{aligned} \mathscr {H} = \left\{ f \ \Big | \ f(x) = \sum _{i \in \mathscr {I}} c_i \rho _i(x) \ \text{ s.t. } \ \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i } < \infty \right\} \end{aligned}
(6.19)

and one has

\begin{aligned} \Vert f\Vert _{\mathscr {H}}^2 = \min _{\{c_i\}} \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i} \ \text{ s.t. } \ f = \sum _{i \in \mathscr {I}} c_i \rho _i. \end{aligned}
(6.20)

The proof is reported in Sect. 6.9.4 while an application example is given below.

Example 6.14

Let

$$K(x,y) = 2 \sin ^2(x) \sin ^2(y) + 2 \cos ^2(x) \cos ^2(y) + 1.$$

Using Theorem 6.13, we obtain that the RKHS $$\mathscr {H}$$ associated to K is spanned by $$\sin ^2(x)$$, $$\cos ^2(x)$$ and the constant function. Now, let $$f(x)=1$$ and consider the problem of computing $$\Vert f\Vert _{\mathscr {H}}^2$$. To have a correspondence with (6.8) we can, e.g., fix the notation

$$\rho _1(x) = \sin ^2(x), \quad \rho _2(x) = \cos ^2(x), \quad \rho _3(x)= 1$$

and

$$\zeta _1 = 2, \quad \zeta _2 = 2, \quad \zeta _3=1.$$

Since the functions $$\rho _i$$ are not independent, many different representation for $$f(x)=1$$ can be found. In particular, one has

$$1= c \rho _1(x) + c \rho _2(x) + (1-c) \rho _3(x) \quad \forall c \in \mathbb {R},$$

so that

$$\Vert f \Vert _{\mathscr {H}}^2 = \min _{c} \ \frac{c^2}{2} + \frac{c^2}{2} + (1-c)^2 = \min _{c} \ 2c^2 -2c +1 = \frac{1}{2}$$

with the minimum 1/2 obtained at $$c=1/2$$. Hence, according to the norm of $$\mathscr {H}$$, the “minimum energy” representation of $$f(x)=1$$ is $$1/2(\rho _1(x)+ \rho _2(x) + \rho _3(x))$$.

$$\square$$

6.4 Kernel-Based Regularized Estimation

6.4.1 Regularization in Reproducing Kernel Hilbert Spaces and the Representer Theorem

A powerful approach to reconstruct a function $$g:\mathscr {X} \rightarrow \mathbb {R}$$  from sparse data $$\{x_i,y_i\}_{i=1}^N$$ consists of minimizing a suitable functional over a RKHS. An important generalization of the estimators based on quadratic penalties, denoted by ReLS-Q in Chap. 3, is defined by

\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N}\mathscr {V}_i(y_i,f(x_i))+ \gamma \Vert f\Vert _{\mathscr {H}}^2. \end{aligned}
(6.21)

In (6.21), $$\mathscr {V}_i$$ are loss functions measuring the distance  between $$y_i$$ and $$f(x_i)$$. They can take only positive values and are assumed convex w.r.t. their second argument $$f(x_i)$$. As an example, when the quadratic loss is adopted for any i, one obtains

$$\mathscr {V}_i(y_i,f(x_i)) = (y_i -f(x_i))^2.$$

Then, the norm $$\Vert \cdot \Vert _{\mathscr {H}}$$ defines the regularizer, e.g., given by the energy of the first-order derivative

$$\Vert f \Vert _{\mathscr {H}}^2 = \int _0^1 \dot{f}^2(x) dx,$$

which corresponds to the spline norm introduced in Example 6.5.  Finally, the positive scalar $$\gamma$$ is the regularization parameter  (already encountered in the previous chapters) which has to balance adherence to experimental data and function regularity. Indeed, the idea underlying (6.21) is that the predictor $$\hat{g}$$ should be able to describe the data without being too complex according to the RKHS norm. In particular, the scope of the regularizer is to restore the well-posedness of the problem, making the solution depend continuously on the data. It should also include our available information on the unknown function, e.g., the expected smoothness level.

The importance of the RKHSs in the context of regularization methods stems from the following central result, whose first formulation can be found in [52]. It shows that the solutions of the class of variational problems (6.21) admit a finite-dimensional representation, independently of the dimension of $$\mathscr {H}$$. The proof of an extended version of this result can be found in Sect. 6.9.5.

Theorem 6.15

(Representer theorem, adapted from [104]) Let $$\mathscr {H}$$ be a RKHS. Then, all the solutions of (6.21) admit the following expression

\begin{aligned} \hat{g} = \sum _{i=1}^N \ c_i K_{x_i}, \end{aligned}
(6.22)

where the $$c_i$$ are suitable scalar expansion coefficients.

Thus, as in the traditional linear parametric approach, the optimal function is a linear combination of basis functions. However, a fundamental difference is that their number is now equal to the number of data pairs, and is thus not fixed a priori. In fact, the functions appearing in the expression of the minimizer $$\hat{g}$$ are just the kernel sections $$K_{x_i}$$ centred on the input data. The representer theorem also conveys the message that, using estimators of the form (6.21), it is not possible to recover arbitrarily complex functions from a finite amount of data. The solution is always confined to a subspace with dimension equal to the data set size.

Now, let $$\mathbf {K} \in \mathbb {R}^{N \times N}$$ be the positive semidefinite matrix (called kernel matrix, or Gram matrix) such that $$\mathbf {K}_{ij} = K(x_i,x_j)$$. The ith row of $$\mathbf {K}$$ is denoted by $$\mathbf {k}_i$$. Using this notation, if $$g = \sum _{i=1}^N \ c_i K_{x_i}$$ then

\begin{aligned} g(x_i) = \mathbf {k}_i c \ \ \text{ and } \ \ \Vert g \Vert _{\mathscr {H}}^2 = c^T\mathbf {K}c, \end{aligned}
(6.23)

where $$c=[c_1,\ldots ,c_N]^T$$ and the second equality derives from the reproducing property or, equivalently, from (6.4).

Using the representer theorem, we can plug the expression (6.22) of the optimal $$\hat{g}$$ into the objective (6.21). Then, exploiting (6.23), the variational problem (6.21) boils down to

\begin{aligned} \min _{c \in \mathbb {R}^{N}} \sum _{i=1}^N \mathscr {V}_i (y_i,\mathbf {k}_ic)+ \gamma c^T\mathbf {K}c. \end{aligned}
(6.24)

The regularization problem (6.21) has been thus reduced to a finite-dimensional optimization problem whose order N does not depend on the dimension of the original space $$\mathscr {H}$$. In addition, since each loss function $$\mathscr {V}_i$$ has been assumed convex, the objective (6.24) is convex overall. How to compute the expansion coefficients now depends on the specific choice of the $$\mathscr {V}_i$$, as discussed in the next section.

Remark 6.3

(Kernel trick and implicit basis functions encoding) Assume that the kernel admits the expansion $$K(x,y) = \sum _{i =1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y), \ \ \zeta _i > 0$$.   Then, as discussed in Sect. 6.3, any function in $$\mathscr {H}$$ has the representation

$$f=\sum _{i =1}^{\infty } \ a_i \rho _i \ \ \text {with} \ \ \Vert f\Vert _{\mathscr {H}}^2=\sum _{j=1}^{\infty } \frac{a_j^2}{\zeta _j}.$$

Problem (6.21) can then be rewritten using the infinite-dimensional vector $$a=[a_1 \ a_2 \ \ldots ]$$ as unknown:

$$\hat{a} =\arg \min _a \ \sum _{i=1}^N \mathscr {V}_i\left( y_i,\sum _{j=1}^\infty a_j \rho _j(x_i)\right) + \gamma \sum _{j=1}^{\infty } \frac{a_j^2}{\zeta _j},$$

and an equivalent representation of (6.22) becomes $$\hat{g}=\sum _{i =1}^{\infty } \ \hat{a}_i \rho _i$$. In comparison to this reformulation the use of the kernel and of the representer theorem subsumes modelling and computational advantages. In fact, through K one needs neither to choose the number of basis functions to be used (the kernel can already include in an implicit way an infinite number of basis functions) nor to store any basis function in memory (the representer theorem reduces inference to solving a finite-dimensional optimization problem based on the kernel matrix $$\mathbf {K}$$). These features are related to what is called the kernel trick in the machine learning literature.

6.4.2 Representer Theorem Using Linear and Bounded Functionals

A more general version of the representer theorem obtained in [52] can be obtained by replacing $$f(x_i)$$ with $$L_i[f]$$, where $$L_i$$ is linear and bounded. In the first part of the following result $$\mathscr {H}$$ is just required to be Hilbert. In Sect. 6.9.5 we will see how Theorem 6.16 can be further generalized.

Theorem 6.16

(Representer theorem with functionals $$L_i$$, adapted from [104]) Let $$\mathscr {H}$$ be a Hilbert space and consider the optimization problem

\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N}\mathscr {V}_i(y_i,L_i[f])+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}
(6.25)

where each $$L_i: \mathscr {H} \rightarrow \mathbb {R}$$ is linear and bounded. Then, all the solutions of (6.25) admit the following expression

\begin{aligned} \hat{g} = \sum _{i=1}^N \ c_i \eta _i, \end{aligned}
(6.26)

where the $$c_i$$ are suitable scalar expansion coefficients and each $$\eta _i \in \mathscr {H}$$ is the representer of $$L_i$$, i.e., for any i and $$f \in \mathscr {H}$$:

\begin{aligned} L_i[f]=\langle f,\eta _i \rangle _{\mathscr {H}}. \end{aligned}
(6.27)

In particular, if $$\mathscr {H}$$ is a RKHS with kernel K, each basis function is given by

\begin{aligned} \eta _i(x) = L_i[K(\cdot ,x)]. \end{aligned}
(6.28)

The existence of $$\eta _i$$ satisfying (6.27) is ensured by the Riesz representation theorem (Theorem 6.27). One can also prove that in a RKHS a linear functional L is linear and bounded if and only if the function f obtained by applying L to the kernel, i.e., $$f(x)=L[K(x,\cdot )] \ \forall x$$, belongs to the RKHS.

Note also that Theorem 6.15 is indeed a special case of the last result. In fact, let $$\mathscr {H}$$ be a RKHS and $$L_i[f]=f(x_i) \ \forall i$$. Then, each $$L_i$$ is linear and bounded and each $$\eta _i$$ becomes the kernel section $$K_{x_i}$$ according to the reproducing property.

Example 6.17

(Solution using the quadratic loss) Let us adopt a quadratic loss in (6.25), i.e., $$\mathscr {V}_i(y_i,L_i[f])=(y_i-L_i[f])^2$$. This makes the objective strictly convex so that a unique solution exists. To find it, plugging (6.26) in (6.25) and using also (6.28), the following quadratic problem is obtained

\begin{aligned} \Vert Y-Oc \Vert ^2+ \gamma c^T O c \end{aligned}
(6.29)

where $$Y=[y_1, \ldots ,y_N]^T$$, $$\Vert \cdot \Vert$$ is the Euclidean norm, while the $$N \times N$$ matrix O has ij entry given by

\begin{aligned} O_{ij}=\langle \eta _i, \eta _j \rangle _{\mathscr {H}} = L_i[L_j[K]]. \end{aligned}
(6.30)

The minimizer $$\hat{c}$$ of (6.29) is unique if O is full rank. Otherwise, all the solutions lead to the same function estimate in view of the (already mentioned) strict convexity of (6.25). In particular, one can always use as optimal expansion coefficients the components of the vector

\begin{aligned} \hat{c} = (O+\gamma I_N)^{-1}Y. \end{aligned}
(6.31)

In Sect. 6.5.1 this result will be further discussed in the context of the so-called regularization networks, where one comes back to assume $$L_i[f]=f(x_i)$$. $$\square$$

6.5 Regularization Networks and Support Vector Machines

The choice of the loss $$\mathscr {V}_i$$ in (6.21)  yields regularization algorithms with different properties. We will illustrate four different cases below.

6.5.1 Regularization Networks

Let us consider the quadratic loss function $$\mathscr {V}_i(y_i,f(x_i))= r_i^2$$, with the residual $$r_i$$ defined by $$r_i=y_i-f(x_i)$$. Such a loss, also depicted in Fig. 6.4 (top left panel), leads to the problem

\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N} (y_i-f(x_i))^2+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}
(6.32)

which is a generalization of the regularized least squares problem encountered in the previous chapters. In particular, it extends the estimator (3.58a) based on quadratic penalty called ReLS-Q in Chap. 3. The estimator (6.32) is known in the literature as regularization network [71] or also kernel ridge regression.  The strict convexity of the objective (6.32) ensures that the minimizer $$\hat{g}$$ not only exists but is also unique (this issue is further discussed in the remark at the end of this subsection).

To find the solution, we can follow the same arguments developed in Example 6.17, just specializing the result to the case $$L_i[f]=f(x_i)$$. We will see that the matrix O has just to be replaced by the kernel matrix $$\mathbf {K}$$.

As previously done, let $$Y=[y_1, \ldots ,y_N]^T$$ and use $$\Vert \cdot \Vert$$ to indicate the Euclidean norm. Then, the corresponding regularization problem (6.24) becomes

\begin{aligned} \min _{c \in \mathbb {R}^{N}} \Vert Y-\mathbf {K}c\Vert ^2 + \gamma c^T\mathbf {K}c, \end{aligned}
(6.33)

which is a finite-dimensional ReLS-Q. After simple calculations, one of the optimal solutionsFootnote 2 is found to be

\begin{aligned} \hat{c} = \left( \mathbf {K}+\gamma I_{N}\right) ^{-1}Y, \end{aligned}
(6.34)

where $$I_{N}$$ is the $$N \times N$$ identity matrix. The estimate from the regularization network is thus available in closed form, given by $$\hat{g} = \sum _{i=1}^N \ \hat{c}_i K_{x_i}$$ with the optimal coefficient vector $$\hat{c}$$ solving a linear system of equations.

Remark 6.4

(Regularization network as projection) An interpretation of the regularization network can be also given in terms of a projection. In particular, let $$\mathscr {R}$$ be the Hilbert space $$\mathbb {R}^N \times \mathscr {H}$$ (any element is a couple containing a vector v and a function f) with norm defined, for any $$v \in \mathbb {R}^N$$ and $$f \in \mathscr {H}$$, by

$$\Vert (v,f)\Vert _{\mathscr {R}}^2 = \Vert v\Vert ^2 + \gamma \Vert f \Vert _{\mathscr {H}}^2, \ \ \gamma >0, \ \ \Vert \cdot \Vert = \text {Euclidean norm}.$$

Let also S be the (closed) subspace given by all the couples (vf) satisfying the constraint $$v=[f(x_1) \ldots f(x_N)]$$. Then, if $$g=(Y,0)$$ where 0 here denotes the null function in $$\mathscr {H}$$, the projection of g onto S is

\begin{aligned} g_S= & {} \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{h \in S} \ \Vert g-h \Vert ^2_\mathscr {R} \\= & {} \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{(\{f(x_i)\}_{i=1}^N,f), \ f \in \mathscr {H}} \ \sum _{i=1}^{N} (y_i-f(x_i))^2+ \gamma \Vert f\Vert _{\mathscr {H}}^2. \end{aligned}

It is now immediate to conclude that $$g_S$$ corresponds to $$([\hat{g}(x_1) \ldots \hat{g}(x_n)],\hat{g})$$ where $$\hat{g}$$ is indeed the minimizer (6.32), which must thus be unique in view of Theorem 6.25 (Projection theorem). Note that this interpretation can be extended to all the variational problems (6.21) containing losses defined by a norm induced by an inner product in $$\mathbb {R}^N$$.

6.5.2 Robust Regression via Huber Loss $$\star$$

As described in Sect. 3.6.1, a shortcoming of the quadratic loss is its sensitivity to outliers because the influence of large residuals $$r_i$$ grows quadratically. In presence of outliers, one would better use a loss function that grows linearly. These issues have been widely studied in the field of robust statistics [51],   where loss functions such as the Huber’s   have been introduced. Recalling (3.115), one has

$$\mathscr {V}_i(y_i,f(x_i)) = \left\{ \begin{array}{lcl} \frac{r_i^2}{2 }, \quad &{} |r_i|\le \delta \\ \delta \left( |r_i|-\frac{\delta }{2}\right) , \quad &{} |r_i| > \delta \end{array}, \right.$$

where we still have $$r_i=y_i-f(x_i)$$. The Huber loss function with $$\delta =1$$ is shown in Fig. 6.4 (top right panel). Notice that it grows linearly and is thus robust to outliers. When $$\delta \rightarrow +\infty$$, one recovers the quadratic loss. On the other hand, we also have $$\lim _{\delta \rightarrow 0^+} \mathscr {V}_i(r)/\delta = |r_i|$$ that is the absolute value loss.

6.5.3 Support Vector Regression $$\star$$

Sometimes, it is desirable to neglect prediction errors, as long as they are below a certain threshold. This can be achieved, e.g., using the Vapnik’s $$\varepsilon$$-insensitive loss given,   for $$r_i=y_i-f(x_i)$$, by

$$\mathscr {V}_i(y_i,f(x_i)) = |r_i|_{\varepsilon } = \left\{ \begin{array}{lcl} 0, \quad &{} |r_i| \le \varepsilon \\ |r_i|-\varepsilon , \quad &{} |r_i| > \varepsilon \end{array}. \right.$$

The Vapnik loss with $$\varepsilon =0.5$$ is shown in Fig. 6.4 (bottom left panel). Notice that it has a null plateau in the interval $$[-\varepsilon , \varepsilon ]$$ so that any predictor closer than $$\varepsilon$$ to $$y_i$$ is seen as a perfect interpolant. The loss then grows linearly, thus ensuring robustness. The regularization problem (6.21) associated with the $$\varepsilon$$-insensitive loss function turns out

\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N} |y_i-f(x_i)|_{\varepsilon }+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}
(6.35)

and is called Support Vector Regression (SVR), see, e.g., [37]. The SVR solution, given by $$\hat{g} = \sum _{i=1}^N \ \hat{c}_i K_{x_i}$$ according to the representer theorem, is characterized by sparsity in $$\hat{c}$$, i.e., some components $$\hat{c}_i$$ are set to zero. This feature is briefly discussed below.

In the SVR case, obtaining the optimal coefficient vector $$\hat{c}$$ by (6.24) is not trivial since the loss $$| \cdot |_{\varepsilon }$$ is not differentiable everywhere. This difficulty can be circumvented by replacing (6.24) with the following equivalent problem obtained considering two additional N-dimensional parameter vectors $$\xi$$ and $$\xi ^*$$:

\begin{aligned} \min _{c,\xi ,\xi ^*} \ \sum _{i=1}^N (\xi _i + \xi _i^*) + \gamma c^T\mathbf {K}c, \end{aligned}
(6.36)

subject to the constraints

\begin{aligned}&y_i - \mathbf {k}_ic \le \varepsilon + \xi _i, \quad i=1,\ldots ,N,\\&\mathbf {k}_ic - y_i \le \varepsilon + \xi ^*_i, \quad i=1,\ldots ,N,\\&\xi _i,\xi _i^* \ge 0, \qquad \qquad i=1,\ldots ,N. \end{aligned}

To see that its minimizer contains the optimal solution $$\hat{c}$$ of (6.24), it suffices noticing that (6.36) assigns a linear penalty only when $$|y_i - \mathbf {k}_ic| > \varepsilon$$.

Problem (6.36) is quadratic subject to linear inequality constraints, hence it is solvable by standard optimization approaches like interior point methods [64, 108]. Calculating the Karush–Kuhn–Tucker conditions, it is possible to show that the condition $$|y_i - \mathbf {k}_i\hat{c}| < \varepsilon$$ implies $$\hat{c}_i=0$$. Indexes i for which $$\hat{c}_i \ne 0$$ instead identify the set of input locations $$x_i$$ called support vectors.

6.5.4 Support Vector Classification $$\star$$

The three losses illustrated above were originally proposed for regression problems, with the output y real valued. When the outputs can assume only two values, e.g., 1 and −1, a classification problem arises. Here, the scope of the predictor is just to separate two classes. This problem can be seen as a special case of regression. In particular, even if the output space is binary, consider prediction functions $$f: \mathscr {X} \rightarrow \mathbb {R}$$ and assume that the input $$x_i$$ is associated to the class 1 if $$f(x_i)\ge 0$$ and to the class $$-1$$ if $$f(x_i)<0$$. Let the margin on an example $$(x_i,y_i)$$ be $$m_i=y_if(x_i)$$. Then, we will see that the value of $$m_i$$ is a measure of how well we are classifying the available data. One can thus try to maximize the margin  but still searching for a function not too complex according to the RKHS norm. In particular, we can exploit (6.21) with a loss that depends on the margin as described below.

The most natural classification loss is the $$0-1$$ loss defined for any i by

$$\mathscr {V}_i(y_i,f(x_i)) = \left\{ \begin{array}{lcl} 0, \quad &{} m_i >0 \\ 1, \quad &{} m_i \le 0 \end{array}, \quad m_i=y_if(x_i), \right.$$

and depicted in Fig. 6.4 (bottom right panel, dashed line). Adopting it, the first component of the objective in (6.21) returns the number of misclassifications. However, the $$0-1$$ loss is not convex and leads to an optimization problem of combinatorial nature.

An alternative is the so-called hinge loss [98] defined by

$$\mathscr {V}_i(y_i,f(x_i)) = | 1 - y_i f(x_i) |_+ = \left\{ \begin{array}{lcl} 0, \quad &{} m > 1 \\ 1-m, \quad &{} m \le 1 \end{array}, \quad m=y_if(x_i), \right.$$

which thus provides a linear penalty when $$m<1$$. Figure 6.4 (bottom right panel, solid line) illustrates that it is a convex upper bound on the $$0-1$$ loss. The problem associated with the hinge loss turns out

\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N} |1 - y_i f(x_i)|_+ + \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}
(6.37)

and is called support vector classification (SVC).

Like in the SVR case, obtaining the optimal coefficient vector by (6.37) is not trivial since the hinge loss is not differentiable. But one can still resort to an equivalent problem, now obtained considering just an additional parameter vector $$\xi$$:

\begin{aligned} \min _{c,\xi } \ \sum _{i=1}^N \ \xi _i + \gamma c^T\mathbf {K}c, \end{aligned}
(6.38)

subject to the constraints

\begin{aligned}&y_i (\mathbf {k}_ic) \ge 1 - \xi _i, \quad i=1,\ldots ,N,\\&\xi _i \ge 0, \qquad \qquad \ \ i=1,\ldots ,N. \end{aligned}

As in the SVR case, the optimal solution $$\hat{c}$$  is sparse and indexes i for which $$\hat{c}_i \ne 0$$ define the support vectors $$x_i$$.

6.6 Kernels Examples

The reproducing kernel characterizes the hypothesis space $$\mathscr {H}$$. Together with the loss function, it also completely defines the key estimator (6.21) which exploits the RKHS norm as regularizer. The choice of K has thus a crucial impact on the ability of predicting future output data. Some important RKHSs are discussed below.

6.6.1 Linear Kernels, Regularized Linear Regression and System Identification

We now show that the regularization network (6.32) generalizes the ReLS-Q problem introduced in Chap. 3 which adopts quadratic penalties. The link is provided by the concept of linear kernel.

We start assuming that the input space is $$\mathscr {X} = {\mathbb R}^m$$. Hence, any input location x corresponds to an m-dimensional (column) vector. If $$P \in {\mathbb R}^{m \times m}$$ denotes a symmetric and positive semidefinite matrix, a linear kernel is defined as follows

$$K(y,x) = y^T P x, \quad (x,y) \in \mathbb {R}^m \times \mathbb {R}^m.$$

All the kernel sections are linear functions. Hence, their span defines a finite-dimensional (closed) subspace of linear functions that, in view of Theorem 6.1 (and subsequent discussion) coincides with the whole $$\mathscr {H}$$. Hence, the RKHS induced by the linear kernel is simply a space of linear functions and, for any $$g \in \mathscr {H}$$, there exists $$a \in {\mathbb R}^m$$ such that

$$g(x)=a^T P x=K_a(x).$$

If P is full rank, letting $$\theta := P a$$, we also have

\begin{aligned} || g ||^2_{\mathscr {H}}= & {} || K_a ||^2_{\mathscr {H}} = \langle K_a, K_a \rangle _{\mathscr {H}} \\= & {} K(a,a) = a^T P a \\= & {} \theta ^T P^{-1} \theta . \end{aligned}

Now, let us use such $$\mathscr {H}$$ in the regularization network (6.32). Without using the representer theorem, we can plug the representation $$g(x)=\theta ^T x$$ in the regularization problem to obtain $$\hat{g}(x)=\hat{\theta }^Tx$$ where

\begin{aligned} \hat{\theta }= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta \in \mathbb {R}^{m}} \ \Vert Y-\varPhi \theta \Vert ^2+ \gamma \theta ^T P^{-1} \theta , \end{aligned}
(6.39)

with the ith row of the regression matrix $$\varPhi$$ equal to $$x_i^T$$. One can see that (6.39) coincides with ReLS-Q, with the regularization matrix P which defines the linear kernel K and, in turn, the penalty term $$\theta ^T P^{-1} \theta$$.

We now derive the connection with linear system identification in discrete time. The data set consists of the output measurements $$\{y_i\}_{i=1}^N$$, collected on the time instants $$\{t_i\}_{i=1}^N$$, and of the system input u. We can form each input location using past input values as follows

\begin{aligned} x_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m}]^T, \end{aligned}
(6.40)

where m is the FIR order and an input delay of one unit has been assumed. Then, if Y collects the noisy outputs, $$\hat{\theta }$$ becomes the impulse response estimate. This establishes a correspondence between regularized FIR estimation and regularization in RKHS induced by linear kernels.

6.6.1.1 Infinite-Dimensional Extensions $$\star$$

In place of $$\mathscr {X}=\mathbb {R}^m$$, let now $$\mathscr {X} \subset \mathbb {R}^\infty$$, i.e., the input space contains sequences. We can interpret any input location as an infinite-dimensional column vector and use ordinary notation of algebra to handle infinite-dimensional objects. For instance, if $$x,y \in \mathscr {X}$$ then $$x^Ty=\langle x,y \rangle _2$$ where $$\langle \cdot ,\cdot \rangle _2$$ is the inner product in $$\ell _2$$. Assume we are given a symmetric and infinite-dimensional matrix P such that the linear kernel

$$K(y,x) = y^T P x$$

is well defined over a subset of $$\mathbb {R}^\infty \times \mathbb {R}^\infty$$. For example, if P is absolutely summable, i.e., $$\sum _{ij} |P_{ij}|<\infty$$, the kernel is well defined for any input location $$x \in \mathscr {X}$$ with $$\mathscr {X}=\ell _\infty$$. The kernel section centred on x is the infinite-dimensional column vector Px. Following arguments similar to those seen in the finite-dimensional case, one can conclude that the RKHS associated to such K contains linear functions of the form $$g(x)=a^T P x$$ with $$a \in \mathscr {X}$$. Roughly speaking, the regularization network (6.32) relying on such hypothesis space is the limit of Problem (6.39) for $$m \rightarrow \infty$$. To compute the solution, in this case it is necessary to resort to the representer theorem (6.22). One obtains

$$\hat{g}(x) = \sum _{i=1}^N \ \hat{c}_i K_{x_i}(x) = \hat{\theta }^T x$$

where $$\hat{c}$$ is defined by (6.34) and

$$\hat{\theta } := \sum _{i=1}^N \ \hat{c}_i P x_i.$$

The link with linear system identification follows the same reasoning previously developed but $$x_i$$ now contains an infinite number of past input values, i.e.,

$$x_i = [u_{t_i-1} \ u_{t_i-2} \ u_{t_i-3} \ldots ]^T.$$

With this correspondence, the regularization network now implements regularized IIR estimation and $$\hat{\theta }$$ contains the impulse response coefficients estimates. In fact, note that the nature of $$x_i$$ makes the value $$\hat{g}(x_i)$$ the convolution between the system input u and $$\hat{\theta }$$ evaluated at $$t_i$$ (with one unit input delay).

In a more sophisticated scenario, in place of sequences, the input space $$\mathscr {X}$$ could contain functions. For instance, $$\mathscr {X} \subset \mathscr {P}^c$$ where $$\mathscr {P}^c$$ is the space of piecewise continuous functions on $$\mathbb {R}^+$$. Thus, each input location corresponds to a continuous function $$x:\mathbb {R}^+ \rightarrow \mathbb {R}$$. Given a suitable symmetric function $$P: \mathbb {R}^+ \times \mathbb {R}^+ \rightarrow \mathbb {R}$$, a linear kernel is now defined by

$$K(y,x) = \int _{\mathbb {R}^+ \times \mathbb {R}^+} \ y(t) P(t,\tau ) x(\tau ) dt d\tau .$$

The corresponding RKHS thus contains linear functionals: any $$f \in \mathscr {H}$$ maps x (which is a function) into $$\mathbb {R}$$. The solution of the regularization network (6.32) equipped with such hypothesis space is

$$\hat{g}(x) = \sum _{i=1}^N \ \hat{c}_i K_{x_i}(x) = \int _{\mathbb {R}^+} \hat{\theta }(\tau ) x(\tau ) d \tau ,$$

where $$\hat{c}$$ is still defined by (6.34) and

$$\hat{\theta }(\tau ) := \sum _{i=1}^N \ \hat{c}_i \int _{\mathbb {R}^+} \ P(\tau ,t) x_i(t) dt.$$

The connection with linear system identification is obtained by defining

$$x_i(t) = u(t_i-t), \quad t \ge 0$$

(if the input u(t) is continuous for $$t \ge 0$$ and causal, the functions $$x_i(t)$$ is piecewise continuous, making necessary the assumption $$\mathscr {X} \subset \mathscr {P}^c$$). In this way, each $$g \in \mathscr {H}$$ represents a different linear system. Furthermore, the regularization network (6.32) implements regularized system identification in continuous time and $$\hat{\theta }$$ is the continuous-time impulse response estimate. The class of kernels which include the BIBO stability constraint will be discussed in the next chapter.

6.6.2 Kernels Given by a Finite Number of Basis Functions

Assume we are given an input space $$\mathscr {X}$$ and m independent functions $$\rho _i:\mathscr {X}\rightarrow \mathbb {R}$$.  Then, we define

$$K(x,y) = \sum _{i=1}^m \rho _i(x) \rho _i(y).$$

It is easy to verify that K is a positive definite kernel. Recalling Theorem 6.13, the associated RKHS coincides with the m-dimensional space spanned by the basis functions $$\rho _i$$. Each function in $$\mathscr {H}$$ has the representation $$g(x) = \sum _{i=1}^m \theta _i \rho _i(x)$$ and, in view of (6.20) and the independence of the basis functions, one has

$$\Vert g \Vert _{\mathscr {H}}^2 = \sum _{i=1}^m \ \theta _i^2.$$

Consider now the regularization network (6.32) equipped with such hypothesis space. The solution can be computed without using the representer theorem by plugging in (6.32) the expression of g as a function of $$\theta$$. Letting $$\varPhi \in {\mathbb R}^{N \times m}$$ with $$\varPhi _{ij} = \rho _j(x_i)$$, we obtain $$\hat{g} = \sum _{i=1}^m \ \hat{\theta }_i \rho _i$$ with

\begin{aligned} \hat{\theta } = \arg \min _{\theta \in \mathbb {R}^{m}} \ \Vert Y-\varPhi \theta \Vert ^2+ \gamma \Vert \theta \Vert ^2. \end{aligned}
(6.41)

The solution (6.41) coincides with the ridge regression estimate introduced in Sect. 1.2.

6.6.3 Feature Map and Feature Space $$\star$$

Let $$\mathscr {F}$$ be a space endowed with an inner product, and assume that a representation of the form

\begin{aligned} K(x,y) = \langle \phi (x), \phi (y) \rangle _{\mathscr {F}}, \qquad \phi :\mathscr {X}\rightarrow \mathscr {F}, \end{aligned}
(6.42)

is available. Then, it follows immediately that K is a positive definite kernel. In this context, $$\phi$$ is called a feature map, and $$\mathscr {F}$$ the feature space. For instance, to have the connection with the kernel discussed in the previous subsection, we can think of $$\phi$$ as a vector containing m functions. It is defined for any x by

$$\phi (x)=\left( \begin{array}{c}\rho _1(x) \\ \rho _2(x) \\ \vdots \\ \rho _m(x) \end{array}\right)$$

so that $$\mathscr {F}=\mathbb {R}^m$$ with the Euclidean inner product. Then, we obtain

$$K(x,y) = \langle \phi (x), \phi (y) \rangle _{2} = \phi ^T(x) \phi (y) = \sum _{i=1}^m \rho _i(x) \rho _i(y).$$

Now, given any positive definite kernel K, Theorem 6.2 (Moore–Aronszajn theorem) implies the existence of at least one feature map, namely, the RKHS map $$\phi _{\mathscr {H}}:\mathscr {X} \rightarrow \mathscr {H}$$ such that

$$\phi _{\mathscr {H}}(x) = K_x,$$

where the representation (6.42) follows immediately from the reproducing property. These arguments show that K is a positive definite kernel iff there exists at least one Hilbert space $$\mathscr {F}$$ and a map $$\phi : \mathscr {X} \rightarrow \mathscr {F}$$ such that $$K(x,y)=\langle \phi (x), \phi (y) \rangle _{\mathscr {F}}$$.

Feature maps and feature spaces are not unique since, by introducing any linear isometry $$I:\mathscr {H} \rightarrow \mathscr {F}$$, one can obtain a representation in a different space:

$$K(x,y) = \langle \phi _{\mathscr {H}}(x), \phi _{\mathscr {H}}(y) \rangle _{\mathscr {H}} = \langle I \circ \phi _{\mathscr {H}}(x), I \circ \phi _{\mathscr {H}}(y) \rangle _\mathscr {F}.$$

Now, assume that the kernel admits the decomposition (6.8), i.e.,

$$K(x,y) = \sum _{i=1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y)$$

with $$\zeta _i > 0 \ \forall i$$. Then, a spectral feature map of K is

$$\phi _{\mu }: \mathscr {X} \rightarrow \ell _2$$

with

$$\phi _{\mu }(x) = \{ \sqrt{\zeta _i} \rho _i(x) \}_{i=1}^{\infty }, \ \ x \in \mathscr {X}.$$

In fact, we have

$$\langle \phi _{\mu }(x), \phi _{\mu }(y) \rangle _2 = \sum _{i=1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y) = K(x,y).$$

It is worth also pointing out the role of the feature map within the estimation scenario. In many applications, linear functions are not models powerful enough. Kernels define more expressive spaces by (implicitly) mapping the data into a high-dimensional feature space where linear machines can be applied.  Then, the use of the estimator (6.21) does not require to know any feature map associated to K: the representer theorem shows that the only information needed to compute the estimate is the kernel matrix, as also discussed in Remark 6.3.

6.6.4 Polynomial Kernels

Another example of kernel is the (inhomogeneous) polynomial kernel [70]. For $$x,y \in \mathbb {R}^m$$, it is defined by

$$K(x,y) = \left( \langle x, y \rangle _2 +c\right) ^p, \quad p \in \mathbb {N}, \quad c \ge 0,$$

with $$\langle \cdot , \cdot \rangle _2$$ to denote the classical Euclidean inner product. As an example, assume $$c=1$$ and $$m=p=2$$ with $$x=[x_a \ x_b]$$ and $$y=[y_a \ y_b]$$. Then, one obtains the kernel expansion

$$K(x,y) = 1+ x_a^2y_a^2+x_b^2y_b^2+2x_ax_by_ay_b+2x_ay_a+2x_by_b,$$

of the type (6.8) with the $$\rho _i(x_a,x_b)$$ given by all the monomials of degree up to 2, i.e., the 6 functions

$$1, \ x_a^2, \ x_b^2, \ x_ax_b, \ x_a, \ x_b.$$

More in general, if $$c>0$$, the polynomial kernel induces a $$\left( {\begin{array}{c}m+p\\ p\end{array}}\right)$$-dimensional RKHS spanned by all possible monomials up to the pth degree. The number of basis function is thus finite but exponential in p. This simple example is in some sense opposite to that described in Sect. 6.6.2. It shows how a kernel can be used to define implicitly a rich class of basis functions.

6.6.5 Translation Invariant and Radial Basis Kernels

A kernel is said translation invariant if there exists $$h:\mathscr {X} \rightarrow \mathbb {R}$$ such that $$K(x,y)=h(x-y)$$. This class has been already encountered in Example 6.12 where its relationship with the Fourier basis (in the case of one-dimensional input space) is illustrated. A general characterization is given below, see also [80].

Theorem 6.18

(Bochner, based on [23]) A positive definite kernel K over $$\mathscr {X} = \mathbb {R}^d$$ is continuous and of the form $$K(x,y)=h(x-y)$$ if and only if there exists a probability measure $$\mu$$ and a positive scalar $$\eta$$ such that:

$$K(x,y) = \eta \int _{\mathscr {X}} \cos \left( \langle z, x-y \rangle _2 \right) d\mu (z).$$

Translation invariant kernels include also the class of radial basis kernels (RBF)  of the form $$K(x,y) = h(\Vert x-y\Vert )$$ where $$\Vert \cdot \Vert$$ is the Euclidean norm [85]. A notable example is the so-called Gaussian kernel

\begin{aligned} K(x,y) = \exp \left( -\frac{\Vert x-y\Vert ^2}{\rho }\right) , \quad \rho > 0, \end{aligned}
(6.43)

where $$\rho$$ denotes the kernel width. This kernel is often used to model functions expected to be somewhat regular. Note however that $$\rho$$ has an important role in tuning the smoothness level. A low value makes the kernel close to diagonal so that a low norm can be assigned also to rapidly changing functions. On the other hand, as $$\rho$$ approaches zero, only functions close to be constant are given a low penalty. This is the same phenomenon illustrated in Fig. 6.1.

Another widely adopted kernel, which induces spaces of functions less regular than the Gaussian one, is the Laplacian kernel which uses the Euclidean norm in place of the squared Euclidean norm:

\begin{aligned} K(x,y) = \exp \left( -\frac{\Vert x-y\Vert }{\rho } \right) , \quad \rho > 0. \end{aligned}
(6.44)

Differently from the kernels described in the first part of Sect. 6.6.1, as well as in Sects. 6.6.2 and 6.6.4, the RKHS associated with any non-constant RBF kernel is infinite dimensional (it cannot be spanned by a finite number of basis functions). The associated RKHS can be shown to be dense in the space of all continuous functions defined on a compact subset $$\mathscr {X} \subset \mathbb {R}^m$$. This means that every continuous function can be represented in this space with the desired accuracy as measured by the sup-norm $$\sup _{x \in \mathscr {X}} |f(x)|$$. This property is called universality.  This does not imply that the RKHS induced by a universal kernel includes any continuous function. For instance, the Gaussian kernel is universal but it has been proved that it does not contain any polynomial, including the constant function [69].

6.6.6 Spline Kernels

To simplify the exposition, let $$\mathscr {X}=[0,1]$$ and let also $$g^{(j)}$$ denote the jth derivative of g, with $$g^{(0)}:=g$$. Intuitively, in many circumstances an effective regularizer is obtained by penalizing the energy of the pth derivative of g, i.e., employing

\begin{aligned} \int _0^1 \left( g^{(p)}(x)\right) ^2 dx. \end{aligned}

An interesting question is whether this penalty term can be cast in the RKHS theory. For $$p=1$$, a positive answer has been given by Example 6.5. Actually, the answer is positive for any integer p. In fact, consider the Sobolev space of functions g whose first $$p-1$$ derivatives are absolutely continuous and satisfy $$g^{(j)}(0)=0$$ for $$j=0,\ldots ,p-1$$. The same arguments developed in Example 6.5 when $$p=1$$ can be easily generalized to prove that this is a RKHS $$\mathscr {H}$$ with norm

$$\Vert g\Vert _{\mathscr {H}}^2=\int _0^1 \left( g^{(p)}(x)\right) ^2 dx.$$

The corresponding kernel is the pth-order spline kernel

\begin{aligned} K(x,y) = \int _0^1 G_p(x,u)G_p(y,u)du, \end{aligned}
(6.45)

where $$G_p$$ is the so-called Green’s function given by

\begin{aligned} G_p(x,u) = \frac{(x-u)_+^{p-1}}{(p-1)!} , \qquad (u)_+ = \left\{ \begin{array}{cl} u &{} \text{ if } ~ u \ge 0 \\ 0 &{} \text{ otherwise } \end{array} \right. . \end{aligned}
(6.46)

Note that the Laplace transform of $$G_p(\cdot ,0)$$ is $$1/s^p$$. Hence, the Green’s function is connected with the impulse response of a p-fold integrator. When $$p=1$$, we recover the linear spline kernel of Example 6.5

\begin{aligned} K(x,y) = \min \{x, y\} \end{aligned}
(6.47)

whereas $$p=2$$ leads to the popular cubic spline kernel [104]:

\begin{aligned} K(x,y) = \frac{x y \min \{x, y\}}{2}-\frac{(\min \{x, y\})^3}{6}. \end{aligned}
(6.48)

The linear and the cubic spline kernel are displayed in Fig. 6.2.

We can use the spline hypothesis space in the regularization problem (6.21). Then, from the representer theorem one obtains that the estimate $$\hat{g}$$ is a pth-order smoothing spline with derivatives continuous exactly up to order $$2p-2$$ (the order’s choice is thus related to the expected function smoothness). This can be seen also from the kernels sections plotted in Fig. 6.2 for p equal to 1 and 2. For $$p=2$$ the (finite) sum of kernel sections provides the well-known cubic smoothing splines, i.e., piecewise third-order polynomials.

Spline functions enjoy many numerical properties originally studied in the interpolation scenario. In particular, piecewise polynomials circumvent Runge’s phenomenon (large oscillations affecting the reconstructed function) which, e.g., arises when high-order polynomials are employed [81]. Fit convergence rates are discussed, e.g., in [3, 14].

6.6.7 The Bias Space and the Spline Estimator

Bias space As discussed in Sect. 4.5, in a Bayesian setting, in some cases it can be useful to enrich $$\mathscr {H}$$ with a low-dimensional parametric part, known in the literature as bias space.  The bias space typically consists of linear combinations of functions $$\{\phi _k\}_{k=1}^m$$. For instance, if the unknown function exhibits a linear trend, one may let $$m=2$$ and $$\phi _1(x)=1,\phi _2(x)=x$$. Then, one can assume that g is sum of two functions, one in $$\mathscr {H}$$ and the other one in the bias space. In this way, the function space becomes $$\mathscr {H} + \text{ span } \{ \phi _1,\ldots , \phi _m\}$$. Using a quadratic loss, the regularization problem is given by

\begin{aligned} (\hat{f},\hat{\theta }) = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\begin{array}{c} f \in \mathscr {H},\\ \theta \in \mathbb {R}^m \end{array} } \sum _{i=1}^{N}\left( y_i-f(x_i)-\sum _{k=1}^{m} \theta _k \phi _k(x_i)\right) ^2+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}
(6.49)

and the overall function estimate turns out $$\hat{g} = \hat{f} + \sum _{k=1}^{m} \hat{\theta }_k\phi _k$$. Note that the expansion coefficients in $$\theta$$ are not subject to any penalty term but a low value for m avoids overfitting. The solution can be computed exploiting an extended version of the representer theorem.  In particular, it holds that

\begin{aligned} \hat{g} = \sum _{i=1}^{N} \hat{c}_i K_{x_i} + \sum _{k=1}^{m} \hat{\theta }_k\phi _k, \end{aligned}
(6.50)

where, assuming that $$\varPhi \in {\mathbb R}^{N \times m}$$ is full column rank and $$\varPhi _{ij} = \phi _j(x_i)$$,

\begin{aligned} \hat{\theta }&= \left( \varPhi ^T A^{-1} \varPhi \right) ^{-1} \varPhi ^T A^{-1} Y \end{aligned}
(6.51a)
\begin{aligned} \hat{c}&= A^{-1} \left( Y- \varPhi \hat{\theta }\right) \end{aligned}
(6.51b)
\begin{aligned} A&= \mathbf {K} + \gamma I_{N}. \end{aligned}
(6.51c)

Remark 6.5

(Extended version of the representer theorem) The correctness of formulas (6.51a6.51c) can be easily verified as follows. Fix $$\theta$$ to the optimizer $$\hat{\theta }$$ in the objective present in the rhs of (6.49). Then, we can use the representer theorem with Y replaced by $$Y-\varPhi \hat{\theta }$$ to obtain $$\hat{f} = \sum _{i=1}^{N} \hat{c}_i K_{x_i}$$ with

$$\hat{c} = A^{-1} \left( Y- \varPhi \hat{\theta }\right)$$

with A indeed given by (6.51c). This proves (6.51b). Using the definition of A this also implies

$$Y-\mathbf {K} \hat{c}= \varPhi \hat{\theta } + \gamma \hat{c}.$$

Now, if we fix f to $$\hat{f}$$, the optimizer $$\hat{\theta }$$ is just the least squares estimate of $$\theta$$ with Y replaced by $$Y- \mathbf {K} \hat{c}$$. Hence, we obtain

$$\hat{\theta }= \left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T (Y- \mathbf {K} \hat{c}).$$

Using $$Y-\mathbf {K} \hat{c}= \varPhi \hat{\theta } + \gamma \hat{c}$$ in the expression for $$\hat{\theta }$$, we obtain $$\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T \hat{c}=0$$. Multiplying the lhs and rhs of (6.51b) by $$\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T$$ and using this last equality, (6.51a) is finally obtained.

The spline estimator The bias space is useful, e.g., when spline kernels are adopted. In fact, the spline space of order p contains functions all satisfying the constraints $$g^{(j)}(0)=0$$ for $$j=0,\ldots ,p-1$$. Then, to cope with nonzero initial conditions, one can enrich such RKHS with polynomials up to order $$p-1$$. The enriched space is $$\mathscr {H} \oplus \text{ span } \{1,x,\ldots ,x^{p-1}\}$$, with $$\oplus$$ denoting a direct sum, and enjoys the universality property mentioned at the end of Sect. 6.6.5. The resulting spline estimator becomes a notable example of (6.49): it solves

\begin{aligned} \min _{\begin{array}{c} f \in \mathscr {H},\\ \theta \in \mathbb {R}^p \end{array} } \sum _{i=1}^{N} \left( y_i -f(x_i)-\sum _{k=1}^{p} \theta _k x_i^{k-1} \right) ^2+ \gamma \int _0^1 \left( f^{(p)}(x) \right) ^2 dx, \end{aligned}
(6.52)

whose explicit solution is given by (6.50) setting $$\phi _k(x)=x^{k-1}$$ and $$\varPhi _{ij} = x_i^{j-1}$$.

We consider a simple numerical example to illustrate the estimator (6.52) and the impact of different choices of $$\gamma$$  on its performance. The task is the reconstruction of the function $$g(x)=e^{\sin (10x)}$$, with $$x \in [0,1]$$, from 100 direct samples corrupted by white and Gaussian noise with standard deviation 0.3. The estimates coming from (6.52) with $$p=2$$ and three different values of $$\gamma$$ are displayed in the three panels of Fig. 6.5. The cubic spline estimate plotted in the top left panel is affected by oversmoothing: the too large value of $$\gamma$$ overweights the norm of f in the objective (6.52), introducing a large bias. Hence, the model is too rigid, unable to describe the data. The top right panel displays the opposite situation obtained adopting a too low value for $$\gamma$$ which overweights the loss function in (6.52). This leads to a high variance estimator: the model is overly flexible and overfits the measurements. Finally, the estimate in the bottom panel of Fig. 6.5 is obtained using the regularization parameter optimal in the MSE sense. The good trade-off between bias and variance leads to an estimate close to truth. As already pointed out in the previous chapters, the choice of $$\gamma$$ can thus be interpreted as the counterpart of model order selection in the classical parametric paradigm.

6.7 Asymptotic Properties $$\star$$

6.7.1 The Regression Function/Optimal Predictor

In what follows, we use $$\mu$$ to indicate a probability measure on the input space $$\mathscr {X}$$. For simplicity, we assume that it admits a probability density function (pdf) denoted by $${\mathrm p}_{x}$$. The input locations $$x_i$$ are now seen as random quantities and $${\mathrm p}_{x}$$ models the stochastic mechanism through which they are drawn from $$\mathscr {X}$$. For instance, in the system identification scenario treated in Sect. 6.6.1, each input location contains system input values, e.g., see (6.40). If we assume that the input is a stationary stochastic process, all the $$x_i$$ indeed follow the same pdf $${\mathrm p}_x$$.

Let also $$\mathscr {Y}$$ indicate the output space. Then, $${\mathrm p}_{yx}$$ denotes the joint pdf on $$\mathscr {X} \times \mathscr {Y}$$ which factorizes into $${\mathrm p}_{y|x}(y|x){\mathrm p}_{x}(x)$$. Here, $${\mathrm p}_{y|x}$$ is the pdf of the output y conditional on a particular realization x.

Let us now introduce some important quantities function of $$\mathscr {X},\mathscr {Y}$$ and $$p_{yx}$$. Given a function f, the least squares error associated to f is defined by

\begin{aligned} \mathrm {Err}(f) = \mathscr {E} (y-f(x))^2 = \int _{\mathscr {X} \times \mathscr {Y}} \ (y-f(x))^2 {\mathrm p}_{yx}(y,x) dx dy. \end{aligned}
(6.53)

The following result, also discussed in [33], characterizes the minimizer of $$\mathrm {Err}(f)$$ and has connections with Theorem 4.1.

Theorem 6.19

(The regression function, based on [33]) We have

$$f_{\rho } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_f \ \mathrm {Err}(f),$$

where $$f_{\rho }$$ is the so-called regression function defined by

\begin{aligned} f_{\rho }(x) = \int _{\mathscr {Y}} y {\mathrm p}_{y|x}(y|x)dy, \quad x \in \mathscr {X}. \end{aligned}
(6.54)

One can see that the regression function does not depend on the marginal density $${\mathrm p}_{x}$$ but only on the conditional $${\mathrm p}_{y|x}$$. For any given x, it corresponds to the posterior mean (Bayes estimate) of the output y conditional on x. The proof of this fact is easily obtained by first using the following decomposition

\begin{aligned} \mathrm {Err}(f)= & {} \int _{\mathscr {X} \times \mathscr {Y}} \ (y-f_{\rho }(x)+f_{\rho }(x)-f(x))^2 {\mathrm p}_{yx}(y,x) dx dy \\= & {} \mathscr {E}(f_{\rho }(x)-f(x))^2 + \mathscr {E}(y-f_{\rho }(x))^2\\+ & {} 2 \int _{\mathscr {X}} (f_{\rho }(x)-f(x)) \underbrace{\left( \int _{\mathscr {Y}} (y-f_{\rho }(x)) {\mathrm p}_{y|x}(y|x) dy \right) }_{0} {\mathrm p}_x(x) dx \\= & {} \mathscr {E}(f_{\rho }(x)-f(x))^2 + \mathscr {E}(y-f_{\rho }(x))^2, \end{aligned}

and then noticing that the very last term is independent of f.

Theorem 6.19 shows that $$f_{\rho }$$ is the best output predictor in the sense that it minimizes the expected quadratic loss (MSE) on a new output drawn from $${\mathrm p}_{yx}$$. Now, we will consider a scenario where $${\mathrm p}_{y|x}$$ (and possibly also $${\mathrm p}_{x}$$) is unknown and only N samples $$\{x_i,y_i\}_{i=1}^N$$ from $${\mathrm p}_{yx}$$ are available. We will study the asymptotic properties (N growing to infinity) of the regularized approaches previously described. The regularization network case is treated in the following subsection.

6.7.2 Regularization Networks: Statistical Consistency

Consider the following regularization network

\begin{aligned} \hat{g}_N= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \frac{\sum _{i=1}^{N} (y_i-f(x_i))^2}{N} + \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}
(6.55)

which coincides with (6.32) except for the introduction of the scale factor 1/N in the quadratic loss. We have also stressed the dependence of the estimate on the data set size N. Our goal is to assess whether $$\hat{g}_N$$ converges to $$f_{\rho }$$ as $$N \rightarrow \infty$$ using the norm $$\Vert \cdot \Vert _{\mathscr {L}_2^{\mu }}$$ defined by the pdf $${\mathrm p}_x$$ as follows

$$\Vert f \Vert ^2_{\mathscr {L}_2^{\mu }} = \int _{\mathscr {X}} f^2(x){\mathrm p}_x(x)dx.$$

First, details on the data generation process are provided.

Data generation assumptions The probability measure $$\mu$$ on $$\mathscr {X}$$ is assumed to be Borel non degenerate. As already recalled, this means that realizations from $${\mathrm p}_{x}$$ can cover entirely $$\mathscr {X}$$, without holes. This happens, e.g., when $${\mathrm p}_x(x)>0 \ \forall x \in \mathscr {X}$$. The stochastic processes $$x_i$$ and $$y_i$$ are jointly stationary, with joint pdf $${\mathrm p}_{yx}$$.

The study is not limited to the i.i.d. case. This is important, e.g., in system identification where, as visible in (6.40), input locations contain past input values shifted in time, hence introducing correlation among the $$x_i$$. Let ab indicate two integers with $$a \le b$$. Then, $$\mathscr {M}_a^b$$ denotes the $$\sigma$$-algebra generated by $$(x_a,y_a),\ldots ,(x_b,y_b)$$. The process (xy) is said to satisfy a strong mixing condition if there exists a sequence of real numbers $$\psi _m$$ such that, $$\forall k,m\ge 1$$, one has

$$|P(A \cap B) - P(A)P(B) | \le \psi _i \quad \forall A \in \mathscr {M}_1^k, B \in \mathscr {M}_{k+i}^\infty$$

with

$$\lim _{i \rightarrow \infty } \psi _i = 0.$$

Intuitively, if ab represent different time instants, this means that the random variables tend to become independent as their temporal distance increases.

Assumption 6.20

(Data generation and strong mixing condition) The probability measure $$\mu$$ on the input space (having pdf $${\mathrm p}_x$$) is nondegenerate. In addition, the random variables $$x_i$$ and $$y_i$$ form two jointly stationary stochastic processes, with finite moments up to the third order and satisfy a strong mixing condition. Finally, denoting with $$\psi _i$$ the mixing coefficients, one has

\begin{aligned} \sum _{i=1}^\infty \left( |\psi _i|^{1/3} \right) < \infty . \end{aligned}

Consistency Result

The following theorem, whose proof is in Sect. 6.9.6, illustrates the convergence in probability of (6.55) to the best output predictor.

Theorem 6.21

(Statistical consistency of the regularization networks)  Let $$\mathscr {H}$$ be a RKHS of functions $$f: \mathscr {X} \rightarrow \mathbb {R}$$ induced by the Mercer kernel K, with $$\mathscr {X}$$ a compact metric space. Assume that $$f_{\rho } \in \mathscr {H}$$ and that Assumption 6.20 holds. In addition, let

\begin{aligned} \gamma \propto \frac{1}{N^{\alpha }}, \end{aligned}
(6.56)

where $$\alpha$$ is any scalar in $$(0,\frac{1}{2})$$. Then, as N goes to infinity, one has

\begin{aligned} \Vert \hat{g}_N - f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \longrightarrow _p 0, \end{aligned}
(6.57)

where $$\longrightarrow _p$$ denotes convergence in probability.

The meaning of (6.56) is the following one. The regularizer $$\Vert \cdot \Vert _{\mathscr {H}}^2$$ in (6.55) restores the well-posedness of the problem by introducing some bias in the estimation process. Intuitively, to have consistency, the amount of regularization should decay to zero as N goes to $$\infty$$, but not too rapidly in order to keep the variance term under control. This can be obtained making the regularization parameter $$\gamma$$ go to zero with the rate suggested by (6.56).

6.7.3 Connection with Statistical Learning Theory

We now discuss the class of estimators (6.21) within the framework of statistical learning theory.

Learning problem Let us consider the problem of learning from examples as defined in statistical  learning. The starting point is that described in Sect. 6.7.1. There is an unknown probabilistic relationship between the variables x and y described by the joint pdf $${\mathrm p}_{yx}$$ on $$\mathscr {X} \times \mathscr {Y}$$. We are given examples $$\{x_i,y_i\}_{i=1}^N$$ of this relationship, called training data, which are independently drawn from $${\mathrm p}_{yx}$$.    The aim of the learning process is to obtain an estimator $$\hat{g}_N$$ (a map from the training set to a space of functions) able to predict the output y given any $$x \in \mathscr {X}$$.

Generalization and consistency In the statistical learning scenario, the two fundamental properties of an estimator are generalization and consistency. To introduce them, first we introduce a loss function $$\mathscr {V}(y,f(x))$$, called risk functional.  Then, the mean error associated to a function f is the expected risk given by

\begin{aligned} I(f) = \int _{\mathscr {X} \times \mathscr {Y}} \ \mathscr {V}(y,f(x)) {\mathrm p}_{yx}(y,x) dx dy. \end{aligned}
(6.58)

Note that, in the quadratic loss case, the expected risk coincides with the error already introduced in (6.53). Given a function f, the empirical risk  is instead defined by

\begin{aligned} I_N(f) = \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)). \end{aligned}
(6.59)

Then, we introduce a class of functions forming the hypothesis space $$\mathscr {F}$$ where the predictor is searched for. The ideal predictor, also called the target function, is given byFootnote 3

\begin{aligned} f_{0} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {F}} \ I(f). \end{aligned}
(6.60)

In general, even when a quadratic loss is chosen, $$f_{0}$$ does not coincide with the regression function $$f_{\rho }$$ introduced in (6.54) since $$\mathscr {F}$$ could not contain $$f_{\rho }$$.

The concepts of generalization and consistency trace back to [97, 99,100,101]. Below, recall that $$\hat{g}_N$$ is stochastic since it is function of the training set which contains the random variables $$\{x_i,y_i\}_{i=1}^N$$.

Definition 6.3

(Generalization and consistency, based on [102]) The estimator $$\hat{g}_N$$ (uniformly) generalizes if $$\forall \varepsilon >0$$:

\begin{aligned} \lim _{N \rightarrow \infty } \ \sup _{{\mathrm p}_{yx}} \ \mathbb {P} \left\{ | I_N(\hat{g}_N) - I(\hat{g}_N) | > \varepsilon \right\} =0. \end{aligned}
(6.61)

The estimator is instead (universally) consistent if $$\forall \varepsilon >0$$:

\begin{aligned} \lim _{N \rightarrow \infty } \ \sup _{{\mathrm p}_{yx}} \ \mathbb {P} \left\{ I(\hat{g}_N) > I(f_{0}) + \varepsilon \right\} =0. \end{aligned}
(6.62)

From (6.61), one can see that generalization implies that the performance on the training set (the empirical error) must converge to the “true” performance on future outputs (the expected error). The presence of the $$\sup _{{\mathrm p}_{yx}}$$ is then to indicate that this property must hold uniformly w.r.t. all the possible stochastic mechanisms which generate the data. Consistency, as defined in (6.62), instead requires the expected error of $$\hat{g}_N$$ to converge to the expected error achieved by the best predictor in $$\mathscr {F}$$. Note that the reconstruction of $$f_{0}$$ is not required. The goal is that $$\hat{g}_N$$ be able to mimic the prediction performance of $$f_0$$ asymptotically. Key issues in statistical learning theory are the understanding of the conditions on $$\hat{g}_N$$, the function class $$\mathscr {F}$$ and the loss $$\mathscr {V}$$ which ensure such properties.

Empirical Risk Minimization

The most natural technique to determine $$f_0$$ from data is the empirical risk minimization (ERM)   approach where the empirical risk is optimized:

\begin{aligned} \hat{g}_N = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {F}} \ I_N(f) = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {F}} \ \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)). \end{aligned}
(6.63)

The study of ERM has provided a full characterization of the necessary and sufficient conditions for its generalization and consistency. To introduce them, we first need to provide further details on the data generation assumptions.

Assumption 6.22

(Data generation assumptions) It holds that

• the $$\{x_i,y_i\}_{i=1}^N$$ are i.i.d. and each couple has joint pdf $${\mathrm p}_{yx}$$;

• the input space $$\mathscr {X}$$ is a compact set in the Euclidean space;

• $$y \in \mathscr {Y}$$ almost surely with $$\mathscr {Y}$$ a bounded real set;

• the class of functions $$\mathscr {F}$$ is bounded, e.g., under the sup-norm;

• $$A \le \mathscr {V}(y,f(x)) \le B$$, for $$f \in \mathscr {F},y \in \mathscr {Y}$$, with AB finite and independent of f and y. $$\square$$

Note that, if the first four points hold true, in practice any loss function of interest, such as quadratic, Huber or Vapnik, satisfies the last requirement.

We now introduce the concept of $$V_{\gamma }$$-dimension [5]. It is a complexity measure which extends the concept of Vapnik–Chervonenkis (VC) dimension originally introduced for the indicator functions.

Definition 6.4

($$V_{\gamma }$$ -dimension, based on [5]) Let Assumption 6.22 hold. The $$V_{\gamma }$$-dimension of $$\mathscr {V}$$ in $$\mathscr {F}$$, i.e., of the set $$\mathscr {V}(y,f(x)), \ f \in \mathscr {F}$$, is defined as the maximum number h of vectors $$(x_1,y_1),\ldots ,(x_h,y_h)$$ that can be separated in all $$2^h$$ possible way using rules

\begin{aligned}&\text {Class 1:} \ \text {if} \ \mathscr {V}(y_i,f(x_i)) \ge s + \gamma ,\\&\text {Class 0:} \ \text {if} \ \mathscr {V}(y_i,f(x_i)) \le s - \gamma \end{aligned}

for $$f \in \mathscr {F}$$ and some $$s\ge 0$$. If, for any h, it is possible to find h pairs $$(x_1,y_1),\ldots ,(x_h,y_h)$$ that can be separated in all the $$2^h$$ possible ways, the $$V_{\gamma }$$-dimension of $$\mathscr {V}$$ in $$\mathscr {F}$$ is infinite.

So, the $$V_{\gamma }$$-dimension is infinite if, for any data set size, one can always find a function f and a set of points which can be separated by f in any possible way. Note that the required margin to distinguish the classes increases as $$\gamma$$ augments. This means that the $$V_{\gamma }$$-dimension is a monotonically decreasing function of $$\gamma$$.

The following definition deals with the uniform, distribution-free convergence of empirical means to expectations for classes of real-valued functions. It is related to the so-called uniform laws of large numbers.

Definition 6.5

(Uniform Glivenko Cantelli class, based on [5]) Let $$\mathscr {G}$$ denote a space of functions $$\mathscr {Z} \rightarrow \mathscr {R}$$, where $$\mathscr {R}$$ is a bounded real set, and let $${\mathrm p}_z$$ denote a generic pdf on $$\mathscr {Z}$$. Then, $$\mathscr {G}$$ is said to be a Uniform Glivenko Cantelli (uGC) classFootnote 4 if

\begin{aligned} \forall \varepsilon>0 \quad \lim _{N \rightarrow \infty } \ \sup _{{\mathrm p}_z} \ \mathbb {P} \left\{ \sup _{g \in \mathscr {G} } \left| \frac{1}{N} \sum _{i=1}^N \ g(z_i) - \int _{\mathscr {X}} g(z)p_z(z)dz \right| > \varepsilon \right\} =0. \end{aligned}

It turns out that, under the ERM framework, generalization and consistency are equivalent concepts. Moreover, the finiteness of the $$V_{\gamma }$$-dimension coincides with the concept of uGC class relative to the adopted losses and turns out the necessary and sufficient condition for generalization and consistency [5]. This is formalized below.

Theorem 6.23

(ERM and $$V_{\gamma }$$-dimension, based on [5]) Let Assumption 6.22 hold. The following facts are then equivalent:

• ERM (uniformly) generalizes.

• ERM is (uniformly) consistent.

• The $$V_{\gamma }$$-dimension of $$\mathscr {V}$$ in $$\mathscr {F}$$ is finite for any $$\gamma >0$$.

• The class of functions $$\mathscr {V}(y,f(x))$$ with $$f \in \mathscr {F}$$ is uGC.

In the last point regarding the uGC class, one can follow Definition 6.5 using the correspondences $$\mathscr {Z}=\mathscr {X} \times \mathscr {Y}$$, $$z=(x,y)$$, $${\mathrm p}_{z}={\mathrm p}_{yx}$$ and $$\mathscr {R}=[A,B]$$.

Connection with Regularization in RKHS

The connection between statistical learning theory and the class of kernel-based estimators (6.21) is obtained using as function space $$\mathscr {F}$$ a ball $$\mathscr {B}_r$$ in a RKHS $$\mathscr {H}$$, i.e.,

\begin{aligned} \mathscr {F}=\mathscr {B}_r:= \Big \{ f \in \mathscr {H} \ | \ \Vert f \Vert _{\mathscr {H}} \le r \Big \}. \end{aligned}
(6.64)

The ERM method (6.63) becomes

\begin{aligned} \hat{g}_N = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f} \ \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)) \quad \text {s.t.} \ \ \Vert f \Vert _{\mathscr {H}} \le r, \end{aligned}
(6.65)

which is an inequality constrained optimization problem. Exploiting the Lagrangian theory, we can find a positive scalar $$\gamma$$, function of r and of the data set size N, which makes (6.65) equivalent to

$$\hat{g}_N = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)) + \gamma \left( \Vert f\Vert _{\mathscr {H}}^2-r^2\right) ,$$

which, apart from constants, coincides with (6.21). The question now is whether (6.65) is consistent in the sense of the statistical learning theory. The answer is positive. In fact, under Assumption 6.22, it can be proved that the class of functions $$\mathscr {V}$$ in $$\mathscr {F}$$ is uGC if $$\mathscr {F}$$ is uGC. In addition, one sufficient (but not necessary) condition for $$\mathscr {F}$$ to be uGC is that $$\mathscr {F}$$ be a compact set in the space of continuous functions. The following important result then holds.

Theorem 6.24

(Generalization and consistency of the kernel-based approaches, based on [33, 65])  Let $$\mathscr {H}$$ be any RKHS induced by a Mercer kernel containing functions $$f: \mathscr {X} \rightarrow \mathbb {R}$$, with $$\mathscr {X}$$ a compact metric space. Then, for any r, the ball $$\mathscr {B}_r$$ is compact in the space of continuous functions equipped with the sup-norm. It then comes that $$\mathscr {B}_r$$ is uGC and, if Assumption 6.22 holds, the regularized estimator (6.65) generalizes and is consistent.

Theorem 6.24 thus shows that kernel-based approaches permit to exploit flexible infinite-dimensional models with the guarantee that the best prediction performance (achievable inside the chosen class) will be asymptotically reached.

Basic functional analysis principles can be found, e.g., in [59, 79, 112]. The concept of RKHS was developed in 1950 in the seminal works [13, 20]. Classical books on the subject are [6, 82, 84]. RKHSs have been introduced within the machine learning community in [46, 47] leading, in conjunction with Tikhonov  regularization theory [21, 96], to the development of many powerful kernel-based algorithms [42, 86].

Extensions of the theory to vector-valued RKHSs is described in [62].  This is connected to the so-called multi-task learning problem  [18, 29] which deals with the simultaneous reconstruction of several functions. Here, the key point is that measurements taken on a function (task) may be informative w.r.t. the other ones, see [16, 40, 68, 95] for illustrations of the advantages of this approach. Multi-task learning will be illustrated in Chap. 9 using also a numerical example based on real pharmacokinetics data.

Mercer theorem dates back to [60] which discusses also the connection with integral equations,  see also the book [50]. Extensions of the theorem to non compact domains are discussed in [94]. The first version of the representer theorem appears in [52]. It has been then subject of many generalizations which can be found in [11, 36, 83, 103, 110]. Recent works have also extended the classical formulation to the context of vector-valued functions (multi-task learning and collaborative filtering), matrix regularization problems (with penalty given by spectral functions of matrices), matricizations of tensors, see, e.g., [1, 7, 12, 54, 87]. These different types of representer theorems are cast in a general framework in [10].

The term regularization network traces back to [71] where it is illustrated that a particular regularized scheme is equal to a radial basis function network. Support vector regression and classification were introduced in [24, 31, 37, 98], see also the classical book [102]. Robust statistics are described in [51].

The term “kernel trick” was used in [83] while interpretation of kernels as inner products in a feature space was first described in [4]. Sobolev spaces are illustrated, e.g., in [2] while classical works on smoothing splines  are [32, 104]. The important spline interpolation properties are described in [3, 14, 22].

Polynomial kernels were used for the first time in [70] while an application to Wiener system identification can be found in [44], as also discussed later on in Chap. 8 devoted to nonlinear system identification. An explicit (spectral) characterization of the RKHS induced by the Gaussian kernel can be found in [91, 92], while the more general case of radial basis kernels is treated in [85]. The concept of universal kernel is discussed, e.g.,  in [61, 90].

The strong mixing condition is discussed, e.g., in [107] and [34].

The convergence proof for the regularization network relies upon the integral operator  approach described in [88] in an i.i.d. setting and its extension to the dependent case developed in [66] in the Wiener system identification context. For other works on statistical consistency and learning rates of regularized least squares in RKHS see, e.g., [48, 93, 105, 109, 111].

Statistical learning theory and the concepts of generalization and consistency, in connection with the uniform law of large numbers, date back to the works of Vapnik and Chervonenkis [97, 99,100,101]. Other related works on convergence of empirical processes are [38, 39, 73]. The concept of $$V_{\gamma }$$ dimension and its equivalence with the Glivenko–Cantelli class is proved in [5], see also [41] for links with RKHS. Relationships between the concept of stability of estimates (continuous dependence on the data) and generalization/consistency can be found in [63, 72], see also [26] for previous work on this subject. Numerical computation of the regularized estimate (6.21) is discussed in the literature studying the relationship between machine learning and convex optimization [19, 25, 77]. In the regularization network case (quadratic loss), if the data set size N is large, plain application of a solver with computational cost $$O(N^3)$$ can be highly inefficient. Then, one can use approximate representations of the kernel function [15, 53], based, e.g., on the Nyström method or greedy strategies [89, 106, 113]. One can also exploit the Mercer theorem by just using an mth-order approximation of K given by $$\sum _{i=1}^m \zeta _i \rho _i(x) \rho _i(y)$$. The solution obtained with this kernel may provide accurate approximations also when $$m \ll N$$, see [28, 43, 67, 114, 115]. Training of kernel machines can be also accelerated by using randomized low-dimensional feature spaces [74], see also [78] for insights on learning rates.

In the case of generic convex loss (different from the quadratic), one problem is that the objective is not differentiable everywhere. In this circumstance, the powerful interior point (IP)  methods [64, 108] can be employed which applies damped Newton iterations  to a relaxed version of the Karush–Kuhn–Tucker (KKT) equations for the objective [27]. A statistical and computational framework that allows their broad application to the problem (6.21) for a wide class of piecewise linear quadratic losses can be found in [8, 9]. In practice, IP methods exhibit a relatively fast convergence behaviour. However, as in the quadratic case, a difficulty can arise if N is very large, i.e., it may not be possible to store the entire kernel matrix in memory and this fact can hinder the application of second-order optimization techniques such as the (damped) Newton method. A way to circumvent this problem is given by the so-called decomposition methods where a subset of the coefficients $$c_i$$, called working set, is selected, and the associated low-dimensional sub-problem is solved. In this way, only the corresponding entries of the output kernel matrix need to be loaded into the memory, e.g., see [30, 56,57,58]. An extreme case of decomposition method is coordinate descent, where the working set contains only one coefficient [35, 45, 49].