Abstract
In the previous parts of the book, we have studied how to handle linear system identification by using regularized least squares (ReLS) with finite-dimensional structures given, e.g., by finite impulse response (FIR) models. In this chapter, we cast this approach in the RKHS framework developed in the previous chapter. We show that ReLS with quadratic penalties can be reformulated as a function estimation problem in the finite-dimensional RKHS induced by the regularization matrix. This leads to a new paradigm for linear system identification that provides also new insights and regularization tools to handle infinite-dimensional problems, involving, e.g., IIR and continuous-time models. For all this class of problems, we will see that the representer theorem ensures that the regularized impulse response is a linear and finite combination of basis functions given by the convolution between the system input and the kernel sections. We then consider the issue of kernel estimation and introduce several tuning methods that have close connections with those related to the regularization matrix discussed in Chap. 3. Finally, we introduce the notion of stable kernels, that induce RKHSs containing only absolutely summable impulse responses and study minimax properties of regularized impulse response estimation.
Download chapter PDF
7.1 Regularized Linear System Identification in Reproducing Kernel Hilbert Spaces
7.1.1 Discrete-Time Case
We will consider linear discrete-time systems in the form of the so-called output error (OE) models. Data are generated according to the relationship
where y(t), u(t) and \(e(t) \in \mathbb {R}\) are the system output, the known system input and the noise at time instant \(t\in \mathbb N\), respectively. In addition, \(G^0(q)\) is the “true” system that has to be identified from the input–output samples with q being the time shift operator, i.e., \(qu(t)=u(t+1)\). Here, and also in all the remaining parts of the chapter, we assume that e is white noise (all its components are mutually uncorrelated).
In Chap. 2, we have seen that there exist different ways to parametrize \(G^0(q)\). In what follows, we will start our discussions exploiting the simplest impulse response descriptions given by FIR models and then we will consider more general infinite-dimensional models also in continuous time. We will see that there is a common way to estimate them through regularization in the RKHS framework and the representer theorem.
7.1.1.1 FIR Case
The FIR case corresponds to
where m is the FIR order, \(g_1,\ \dots , g_m\) are the FIR coefficients and \(\theta \) is the unknown vector that collects them. Model (7.2) can be rewritten in vector form as follows:
where
and
with
Instead of describing FIR model estimation directly in the regularized RKHS framework, let us first recall the ReLS method with quadratic penalty term introduced in Chap. 3. It gives the estimate of \(\theta \) by solving the following problem:
where the regularization matrix \(P\in {\mathbb R}^{m\times m}\) is positive semidefinite, assumed invertible for simplicity. The regularization parameter \(\gamma \) is a positive scalar that, as already seen, has to balance adherence to experimental data and strength of regularization.
Now we show that (7.4) can be reformulated as a function estimation problem with regularization in the RKHS framework. For this aim, we will see that the key is to use the \(m \times m\) matrix P to define the kernel over the domain \(\{1,2,\dots ,m\} \times \{1,2,\dots ,m\}\). This in turn will define a RKHS of functions \(g: \{1,2,\dots ,m\} \rightarrow \mathbb {R}\). Such functions are connected with the components \(g_i\) of the m-dimensional vector \(\theta \) by the relation \(g(i)=g_i\). So, the functional view is obtained replacing the vector \(\theta \) with the function that maps i into the ith component of \(\theta \).
Let us define a positive semidefinite kernel \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) as follows:
where \(P_{ij}\) is the (i, j)th entry of the regularization matrix P. It is obvious that K is positive semidefinite because P is positive semidefinite. Its kernel sections will be denoted by \(K_i\) with \(i=1,\ldots ,m\) and are the columns of P seen as functions mapping \(\mathscr {X}\) into \(\mathbb {R}\).
Now, using the Moore–Aronszajn Theorem, illustrated in Theorem 6.2, the kernel K reported in (7.5) defines a unique RKHS \(\mathscr {H}\) such that \(\langle K_i, g \rangle _{\mathscr {H}} = g(i)\), \(\forall (i, g) \in \left( \mathscr {X},\mathscr {H}\right) \). This is the function space where we will search for the estimate of the FIR coefficients. According to the discussion following Theorem 6.2, since there are just m kernel sections \(K_i\) associated to the m columns of P, for any impulse response candidate \(g \in \mathscr {H},\) there exist m scalars \(a_j\) such that
where P(i, : ) is the ith row of P. Since g(i) is the ith component of \(\theta \), one has
By the reproducing property, we also have
and this implies
As a result, the ReLS method (7.4) can be reformulated as follows:
which is a regularized function estimation problem in the RKHS \(\mathscr {H}\).
In view of the equivalence between (7.4) and (7.7), the FIR function estimate \(\hat{g}\) has the closed-form expression given by (7.4d). The correspondence is established by \(\hat{g}(i)=\hat{\theta }_i\). We will show later that such closed-form expression can be derived/interpreted by exploiting the representer theorem.
Remark 7.1
\(\star \) Besides (7.7), there is also an alternative way to reformulate the ReLS method (7.4) as a function estimation problem with regularization in the RKHS framework. This has been sketched in the discussions on linear kernels in Sect. 6.6.1. The difference lies in the choice of the function to be estimated and the choice of the corresponding kernel. In particular, in this chapter, we have obtained (7.7) choosing the function and the corresponding kernel to be the FIR g and (7.5), respectively. In contrast, in Sect. 6.6.1, the RKHS is defined by the kernel
and contains the linear functions \(x^T\theta \), where the input locations x incapsulate m past input values. So, using (7.8), the corresponding RKHS does not contain impulse responses but functions that represent directly linear systems mapping regressors (built with input values) into outputs.
7.1.1.2 IIR Case
The infinite impulse response (IIR) case corresponds to
where \(\theta =[g_1,\dots , g_\infty ]^T\). So, model order m is set to \(\infty \) and we have to handle infinite-dimensional objects. To face the intrinsic ill-posedness of the estimation problem, one could think to introduce an infinite-dimensional regularization matrix P. But the penalty \(\theta ^TP^{-1}\theta \), adopted in (7.4) for the FIR case, would turn out to be undefined. So, the RKHS setting is needed to define regularized IIR estimates. The first step is to choose a positive semidefinite kernel \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\). Then, let \(\mathscr {H}\) be the RKHS associated with K and \(g \in \mathscr {H}\) be the IIR function with \(g(k)=g_k\) for \(k\in \mathbb {N}\). Finally, the estimate is given by
One may wonder whether it is possible to obtain a closed-form expression of the IIR estimate \(\hat{g}\) as in the FIR case. The answer is positive and given by the following representer theorem. It derives from Theorem 6.16 reported in the previous chapter applied to the case of quadratic loss functions, as discussed in Example 6.17, that allows to recover the expansion coefficients of the estimate just solving a linear system of equations, see (6.29) and (6.31). Before stating in a formal way the result, it is useful to point out the following two facts:
-
in the dynamic systems context treated in this chapter any functional \(L_i\) present in Theorem 6.16 is now applied to discrete-time impulse responses g which lives in the RKHS \(\mathscr {H}\). Hence, it represents the discrete-time convolution with the input, i.e., \(L_i\) maps \(g \in \mathscr {H}\) into the system output evaluated at the time instant \(t=i\);
-
from the discussion after Theorem 6.16, recall also that a linear functional L is linear and bounded in \(\mathscr {H}\) if and only if the function f, defined for any x by \(f(x)=L[K(x,\dot{)}]\), belongs to \(\mathscr {H}\). Hence, the condition (7.11) reported below is equivalent to assume that the system input defines linear and bounded functionals over the RKHS induced by K.
Theorem 7.1
(Representer theorem for discrete-time linear system identification, based on [73, 90]). Consider the function estimation problem (7.10). Assume that \(\mathscr {H}\) is the RKHS induced by a positive semidefinite kernel \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\) and that, for \(t=1,\ldots ,N\), the functions \(\eta _t\) defined by
are all well defined in \(\mathscr {H}\). Then, the solution of (7.10) is
where \(\hat{c}_t\) is the tth entry of the vector
with \(Y=[y(1),\dots y(N)]^T\) and with the (t, s)th entry of O given by
Theorem 7.1 discloses an important feature of regularized impulse response estimation in RKHS. The function estimate \(\hat{g}\) has a finite dimensional representation that does not depend on the dimension of the RKHS \(\mathscr {H}\) induced by the kernel but only on the data set size N.
Example 7.2
(Stable spline kernel for IIR estimation) To estimate high-order FIR models, in the previous chapters, we have introduced some regularization matrices related to the DC, TC and stable spline kernels, see (5.40) and (5.41). Consider now the TC kernel, also called first-order stable spline, with support extended to \(\mathbb {N} \times \mathbb {N}\), i.e.,
This kernel induces a RKHS that contains IIR models and can be conveniently adopted in the estimator (7.10). An interesting question is to derive the structure of the induced regularizer \(\Vert g\Vert ^2_{\mathscr {H}}\). One could connect K with the matrix P entering (7.4a) but its inverse is undefined since now P is infinite dimensional. To derive the stable spline norm, it is instead necessary to resort to functional analysis arguments. In particular, in Sect. 7.7.1, it is proved that
an expression that well reveals how the kernel (7.15) includes information on smooth exponential decay. When used in (7.10), the resulting IIR estimate balances the data fit (sum of squared residuals) and the energy of the impulse response increments weighted by coefficients that increase exponentially with time t and thus enforce stability.
Let us now consider a simple application of the representer theorem. Assume that the system input is a causal step of unit amplitude, i.e., \(u(t)=1\) for \(t \ge 0\) and \(u(t)=0\) otherwise. The functions (7.11) are given by
For instance, the first three basis functions are
and, in general, one has
Hence, any \(\eta _t\) is a well-defined function in the RKHS induced by K, being the sum of the first t kernel sections. Then, according to Theorem 7.1, we conclude that the IIR estimate returned by (7.10) is spanned by the functions \(\{\eta _t\}_{t=1}^N\) with coefficients then computable from (7.13). \(\square \)
Although Theorem 7.1 is stated for the IIR case (7.10), the same result also holds for the FIR case (7.7). The only difference is that the series in (7.11) and (7.14) have to be replaced by finite sums up to the FIR order m. Then, interestingly, one can interpret the regularized FIR estimate (7.4d) in a different way exploiting the representer theorem perspective. In particular, one finds \(O=\varPhi P\varPhi ^T\) while the basis functions \(\{\eta _t\}_{t=1}^N\) are in one-to-one correspondence with the N columns of \(P\varPhi ^T\), each of dimension m.
7.1.2 Continuous-Time Case
Now, we consider linear continuous-time systems still focusing on the output error (OE) model structure. The system outputs are collected over N time instants \(t_i\). Hence, the measurements model is
where y(t), u(t) and e(t) are the system output, the known input and the noise at time instant \(t\in {\mathbb R}^+\), respectively, while \(g^0(t), \ t\in {\mathbb R}^+\) is the “true” system impulse response.
Similarly to what done in the previous section, we will study how to determine from a finite set of input–output data a regularized estimate of the impulse response \(g^0\) in the RKHS framework. The first step is to choose a positive semidefinite kernel \(K:{\mathbb R}^+\times {\mathbb R}^+\rightarrow {\mathbb R}\). It induces the RKHS \(\mathscr {H}\) containing the impulse response candidates \(g \in \mathscr {H}\). Then, the linear model can be estimated by solving the following function estimation problem:
The closed-form expression of the impulse response estimate \(\hat{g}\) is given by the following representer theorem that again derives from Theorem 6.16 and the same discussion reported before Theorem 7.1. Note just that now any functional \(L_i\) entering Theorem 6.16 is applied to continuous-time impulse responses g in the RKHS \(\mathscr {H}\). Hence, it represents the continuous-time convolution with the input, i.e., \(L_i\) maps \(g \in \mathscr {H}\) into the system output evaluated at the time instant \(t_i\).
Theorem 7.3
(Representer theorem for continuous-time linear system identification, based on [73, 90]) Consider the function estimation problem (7.18). Assume that \(\mathscr {H}\) is the RKHS induced by a positive semidefinite kernel \(K:{\mathbb R}^+\times {\mathbb R}^+\rightarrow {\mathbb R}\) and that, for \(i=1,\ldots ,N\), the functions \(\eta _i\) defined by
are all well defined in \(\mathscr {H}\). Then, the solution of (7.18) is
where \(\hat{c}_i\) is the ith entry of the vector
with \(Y=[y(t_1),\dots y(t_N)]^T\) and the (i, j)th entry of O given by
Example 7.4
(Stable spline kernel for continuous-time system identification) In Example 6.5, we introduced the first-order spline kernel \(\min (x,y)\) on \([0,1]\times [0,1]\). It describes a RKHS of continuous functions f on the unit interval that satisfy \(f(0)=0\) whose squared norm is the energy of the first-order derivative, i.e.,
To describe stable impulse responses g, we instead need a kernel defined over the positive real axis \(\mathbb {R}^+\) that induces the constraint \(g(+\infty )=0\). A simple way to obtain this is to exploit the composition of the spline kernel with an exponential change of coordinates mapping \(\mathbb {R}^+\) into [0, 1]. The resulting kernel is called (continuous-time) first-order stable spline kernel. It is given by
where \(\beta >0\) regulates the change of coordinates and, hence, the impulse responses decay rate. So, \(\beta \) can be seen as a kernel parameter related to the dominant pole of the system.
It is interesting to note the similarity between the kernel (7.15) and the first-order stable spline kernel (7.24). By letting \(\alpha =\exp (-\beta )\), the sampled version of the first-order stable spline kernel (7.24) corresponds exactly to the TC kernel (7.15). Top panel of Fig. 7.1 plots (7.24) and also some kernel sections: they are all continuous and exponentially decaying to zero. Such kernel inherits also the universality property of the splines. In fact, its kernel sections can approximate any continuous impulse response on all the compact subsets of \(\mathbb {R}^+\).
The relationship with splines permits also to easily achieve one spectral decomposition of (7.24). In particular, in Example 6.11, we obtained the following expansion of the spline kernel:
with
where all the \(\rho _i\) are mutually orthogonal on [0, 1] w.r.t. the Lebesque measure. In view of the simple connection between spline and stable spline kernels given by exponential time transformations, one easily obtains that the first-order stable spline kernel can be diagonalized as follows:
with
where the \(\phi _i\) are now orthogonal on \([0,+\infty )\) w.r.t. the measure \(\mu \) of density \(\beta e^{-\beta t}\). In Fig. 6.3, we reported the eigenfunctions \(\rho _i\) with \(i=1,2,8\) and the eigenvalues \(\zeta _i\) for the first-order spline kernel (6.47). For comparison, we now show in Fig. 7.2 the corresponding eigenfunctions \(\phi _i\) of the first-order stable spline kernel (7.24) with \(\beta =1\) and also the \(\zeta _i\). While the eigenvalues are the same, differently from the \(\rho _i\) the eigenfunctions \(\phi _i\) now decay exponentially to zero.
Having obtained one spectral decomposition of (7.24), we can now exploit Theorem 6.10 to obtain the following representation of the RKHS induced by the first-order stable spline kernel:
and the squared norm of g turns out to be
Now we will exploit the above results to obtain a more useful expression for \(\Vert g\Vert _{\mathscr {H}}^2\). The deep connection between spline and stable spline kernel implies that these two spaces are isometrically isomorphic, i.e., there is an one-to-one correspondence that preserves inner products. In fact, we can associate to any stable spline function g(t) in \(\mathscr {H}\) the spline function f(t) in the space induced by (6.47) such that \(g(t)=f(e^{-\beta t})\). So, \(g(t)=\sum _{i=1}^\infty c_i\phi _i(t)\) implies \(f(t)=\sum _{i=1}^\infty c_i \rho _i(t)\) and the two functions have indeed the same norm \( \sum _{i=1}^\infty \frac{c_i^2}{\zeta _i}\). Now, using (7.23) and (7.28), we obtain
This expression gives insights into the nature of the stable spline space. Compared to the classical Sobolev space induced by the first-order spline kernel, the norm penalizes the energy of the first-order derivative of g with a weight proportional to \(e^{\beta t}\). Such norm thus enforces all the function in \(\mathscr {H}\) to be continuous impulse responses decaying to zero at least exponentially. Note also that (7.29) really seems the continuous-time counterpart of the norm (7.16) associated to the discrete-time stable spline kernel.
Let us see now how to generalize the kernel (7.24). In Sect. 6.6.6 of the previous chapter, we have introduced the general class of spline kernels. Here, we started our discussion using the first-order (linear) spline kernel \(\min (x,y)\) but we have seen that higher-order models can be useful to reconstruct smoother functions, an important example being the second-order (cubic) spline kernel (6.48). Applying exponential time transformations to the splines, the class of the so-called stable spline kernels is obtained. For instance, from (6.48), one obtains the second-order stable spline kernel
The bottom panels of Fig. 7.1 plots (7.30) and also some kernel sections: they exponentially decay to zero and are more regular than those associated to (7.24). \(\square \)
7.1.3 More General Use of the Representer Theorem for Linear System Identification \(\star \)
Theorems 7.1 and 7.3 are special cases of the more general representer theorem involving function estimation from sparse and noisy data. It was reported as Theorem 6.16 in the previous chapter. Let us briefly recall it. Its starting point was the optimization problem
where \(\mathscr {V}_i\) is a loss function, e.g., the quadratic loss adopted in this chapter, and each functional \(L_i: \mathscr {H} \rightarrow \mathbb {R}\) is linear and bounded. Then, all the solutions of (7.31) are given by
where each \(\eta _i \in \mathscr {H}\) is the representer of \(L_i\) given by
How to compute the expansion coefficients \(c_i\) will then depend on the nature of the \(\mathscr {V}_i\), as described in Sect. 6.5.
The estimator (7.31) can be exploited for linear system identification thinking of g as an impulse response, using e.g., a stable spline kernel to define \(\mathscr {H}\). The linear functional \(L_i\) is then defined by a convolution and returns the system noiseless outputs at instant \(t_i\). In particular, in discrete-time one has
while in continuous time, it holds that
When quadratic losses are used, (7.31) becomes the regularization network described in Sect. 6.5.1 whose expansions coefficients are available in closed form. One has \(\hat{c} = (O+\gamma I_N)^{-1}Y\) with the (t, s)-entry of the matrix O given by \(O_{ts} = L_s[L_t[K]]\), as given by (7.14) in discrete time and by (7.22) in continuous time. The use of losses \(\mathscr {V}_i\) different from quadratic then opens the way also to the definition of many new algorithms for impulse response estimation. For example, the use of the Vapnik’s \(\epsilon \)-insensitive loss described in Sect. 6.5.3 leads to support vector regression for linear system identification. Beyond promoting sparsity in the coefficients \(c_i\), it also makes the estimator robust against outliers since penalties on large residuals grows linearly. Outliers can be tackled also by adopting the \(\ell _1\) or Huber loss, see Sect. 6.5.2. A general system identification framework that includes all the convex piecewise linear quadratic losses and penalties is, e.g., described in [2].
Interestingly, the estimator (7.31) can be conveniently adopted for linear system identification also giving g a different meaning from an impulse response. For instance, in system identification there are important IIR models that use Laguerre functions see e.g., [91, 92] whose z-transform is
They form an orthonormal basis in \(\ell _2\) and some of them are displayed in Fig. 7.3.
Another option is given by the Kautz basis functions that allow also to include information on the presence of system resonances [46]. Using \(\phi _i\) to denote such basis functions, the impulse response model can be written as
A problem is how to determine the coefficients \(g_i\) from data. Classical approaches use truncated expansions \(f = \sum _{i=1}^{d} \ g_i \phi _i\), with model order d estimated using, e.g., Akaike’s criterion, as discussed in Sect. 2.4.3, and then determine the \(g_i\) by least squares. An interesting alternative is to let \(d=+\infty \) and to think that the \(g_i\) define the function g such that \(g(i)=g_i\). One can then estimate the coefficients through (7.31) adopting a kernel, like TC and stable spline, that includes information on the expansion coefficients’ decay to zero. Working in discrete time, the functionals \(L_i\) entering (7.31) are in this case defined by
while in continuous time, one has
7.1.4 Connection with Bayesian Estimation of Gaussian Processes
Similarly to what discussed in the finite-dimensional setting in Sect. 4.9, also the more general regularization in RKHS can be given a probabilistic interpretation in terms of Bayesian estimation. In this paradigm, the different loss functions correspond to alternative statistical models for the observation noise, while the kernel represents the covariance of the unknown random signal, assumed independent of the noise. In particular, when the loss is quadratic, all the involved distributions are Gaussian.
We now discuss the connection under the linear system identification perspective where the “true” impulse response \(g^0\) is seen as the random signal to estimate. Consider the measurements model
where \(L_i\) is a linear functional of the true impulse response \(g^0\) defined by convolution with the system input evaluated at \(t_i\). One has
in discrete time and
in continuous time. So, the impulse response estimators discussed in this chapter can be compactly written as
where the RKHS \(\mathscr {H}\) contains functions \(g:\mathscr {X}\rightarrow {\mathbb R}\) with \(\mathscr {X}=\mathbb {N}\) in discrete time and \(\mathscr {X}={\mathbb R}^+\) in continuous time.
The following result (whose simple proof is in Sect. 7.7.2) shows that, under Gaussian assumptions on the impulse response and the noise, (7.37) provides the minimum variance estimate of \(g^0\) given the measurements \(Y=[y(t_1),\dots ,y(t_N)]^T\).
Proposition 7.1
Let the following assumptions hold:
-
the impulse response \(g^0\) is a zero-mean Gaussian process on \(\mathscr {X}\). Its covariance function is defined by
$$ \mathscr {E} (g^0(t)g^0(s)) = \lambda K(t,s), $$where \(\lambda \) is a positive scalar and K is a kernel;
-
the e(t) are mutually independent zero-mean Gaussian random variables with variance \(\sigma ^2\). Moreover, they are independent of \(g^0\).
Let \(\mathscr {H}\) be the RKHS induced by K, set \(\gamma =\sigma ^2/\lambda \) and define
Then, \(\hat{g}\) is the minimum variance estimator of \(g^0\) given Y, i.e.,
Remark 7.2
The connection between regularization in RKHS and estimation of Gaussian processes was first pointed out in [51] in the context of spline regression, using quadratic losses, see also [41, 83, 90]. The connection also holds for a wide class of losses \(\mathscr {V}_i\) also different from quadratic. For instance, in this statistical framework, using the absolute value loss corresponds to Laplacian noise assumptions. The statistical interpretation of an \(\epsilon \)-insensitive loss in terms of Gaussians with mean and variance given by suitable random variables can be found in [79], see also [40, 67]. For all this kind of noise models, and many others, it can be shown that the RKHS estimate \(\hat{g}\) includes all the possible finite-dimensional maximum a posteriori estimates of \(g^0\), see [3] for details.
The largest space contains all the realizations of a zero-mean Gaussian process of covariance K. The smallest space is the RKHS \(\mathscr {H}\) induced by K, assumed here infinite dimensional. The probability that realizations of f fall in the RKHS is zero. Instead, when the assumptions underlying the representer theorem hold, the realizations of the minimum variance estimator \({\mathscr {E}}[f|Y]\) are contained in \(\mathscr {H}\) with probability one
Remark 7.3
The relation between RKHSs and Gaussian stochastic processes, or more general Gaussian random fields, is stated by Proposition 7.1 in terms of minimum variance estimators. In particular, since the representer theorem ensures that such estimator is sum of a finite number of basis functions belonging to \(\mathscr {H}\), it turns out that \(\hat{g}\) belongs to the RKHS induced by the covariance of \(g^0\) with probability one. Now, one may also wonder what happens a priori, before seeing the data. In other words, the question is whether realizations of a zero-mean Gaussian process of covariance K fall in the RKHS induced by K. If the kernel K is associated with an infinite-dimensional \(\mathscr {H}\), the answer is negative with probability one, as graphically illustrated in Fig. 7.4. While deep discussions can be found in [9, 34, 59, 68], here we give just a hint on this fact. Assume that the kernel admits the decomposition
inducing an M-dimensional RKHS \(\mathscr {H}\). Let the deterministic functions \(\phi _i\) be independent. Then, we know from Theorem 6.13 that, if \(f(t) =\sum _{i=1}^M a_i \phi _i(t)\), then
Now, think of K as a covariance and let \(a_i\) be zero-mean Gaussian and independent random variables of variance \(\zeta _i\), i.e.,
Then, the so-called Karhunen–Loève expansion of the Gaussian random field \(f\sim \mathscr {N}(0,K)\), also discussed in Sect. 5.6 to connect regularization and basis expansion in finite dimension, is given by
with M possibly infinite and convergence in quadratic mean. The RKHS norm of f is now a random variable and, since the \(a_i\) are mutually independent with \({\mathscr {E}}a_i^2 = \zeta _i\), one has
So, if the RKHS is infinite dimensional, one has \(M=\infty \) and the expected (squared) RKHS norm of the process f diverges to infinity.
7.1.5 A Numerical Example
Our goal now is to illustrate the influence of the choice of the kernel on the quality of the impulse response estimate using also the Bayesian interpretation of regularization. The example is a simple linear discrete time system in the form of (7.1). Using the z-transform, its transfer function is
The system’s impulse response is reported in Fig. 7.5. The disturbances e(t) are independent and Gaussian random variables with mean zero and variance \(0.05^2\). For ease of visualization, we let the input u(t) be an impulsive signal, i.e., \(u(0)=1\) and \(u(t)=0\) elsewhere. Thus, the impulse response have to be estimated from 20 direct and noisy impulse response measurements.
We consider a Monte Carlo simulation of 200 runs. At any run, the outputs are obtained by generating mutually independent measurement noises. One data set is shown in Fig. 7.5. For each of the 200 data sets, we use the regularized IIR estimator (7.10). For what regards \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\), , we will compare the performance of three kernels: the Gaussian (6.43), the cubic spline (6.48) and the stable spline (7.15) defined, respectively, by
Recall that the Gaussian and the cubic spline kernel are the most used in machine learning to include information on smoothness. The cubic spline estimator could be also complemented with a bias space given, e.g., by a linear function, as described in Sect. 6.6.7. However, one would obtain results very similar to those described in what follows.
To adopt the estimator (7.10), we need to find a suitable value for the regularization parameter \(\gamma \) and also for the unknown kernel parameters, i.e., the kernel width \(\rho \) in the Gaussian kernel and the stability parameter \(\alpha \) for stable spline. As already done, e.g., in Sect. 1.2 for ridge regression, an oracle-based procedure is adopted to optimally balance bias and variance. The unknown parameters are achieved by maximizing the measure of fit defined as follows:
where computation is restricted only to the first 50 samples where, in practice, the impulse response is different from zero. This tuning procedure is ideal since it exploits the true function \(g^0\). It is useful here since it excludes the uncertainty brought by the kernel tuning procedure and will fully reveal the influence of the kernel choice on the quality of the impulse response estimate.
True impulse response (thick line) and 200 impulse response estimates obtained using the cubic spline kernel (6.48) (top panel), the Gaussian kernel (6.43) (middle) and the stable spline kernel (bottom). The unknown parameters are estimated by an oracle that maximizes the fit (7.39) for each data set
The impulse response estimates obtained by the cubic spline, the Gaussian and the stable spline kernel are reported in Fig. 7.6. When the cubic spline kernel (6.48) is chosen, the impulse response estimates diverge as time goes. This result can be also given a Bayesian interpretation where (6.48) becomes the covariance of the stochastic process \(g^0\). Specifically, the cubic spline kernel models the impulse response as double integration of white noise. So, impulse responses coefficients are correlated but the prior variance increases in time. For stable systems, variability is instead expected to decay to zero as t progresses. When the Gaussian kernel (6.43) is chosen, quality of the impulse response estimates much improves, but many of them exhibit oscillations and the variance of the impulse response estimator is still large. Bayesian arguments here show that the Gaussian kernel models \(g^0\) as a stationary stochastic process. Smoothness information is encoded but not the fact that that one expects the prior variance to decay to zero. Finally, the impulse response estimates returned by the stable spline kernel (7.15) are all very close to the truth. These outcomes are similar to those described, e.g., in Example 5.4 in Sect. 5.5. In particular, even if this example is rather simple, it shows clearly that a straightforward application of standard kernels from machine learning and smoothing splines literature may give unsatisfactory results. Inclusion of dynamic systems features in the regularizer, like smooth exponential decay, greatly enhances the quality of the impulse response estimates.
7.2 Kernel Tuning
As we have seen in the previous parts of the book, the kernels depend on some unknown parameters, the so-called hyperparameters. They can, e.g., include scale factors, the kernel width of the Gaussian kernel or the impulse response’s decay rate in the TC and stable spline kernels. In real-world applications, the oracle-based procedure used in the previous section cannot be used. The kernels need instead to be tuned from data. Such procedure is referred to as hyperparameter estimation and is the counterpart of model order selection in the classical paradigm of system identification. It determines model complexity within the new paradigm where system identification is seen as regularized function estimation in RKHSs. This calibration step will thus have a major impact on model’s performance, e.g., in terms of predictive capability on new data. Due to the connection with the ReLS methods in quadratic form, the tuning methods introduced in Chaps. 3 and 4 can be easily applied also in the RKHS framework. In particular, let \(K(\eta )\) denote a kernel, where \(\eta \) is the hyperparameter vector belonging to the set \(\varGamma \). Such vector could also include other parameters not present in the kernel, e.g., the noise variance \(\sigma ^2\). Some calibration methods to estimate \(\eta \) from data are then reported below.
7.2.1 Marginal Likelihood Maximization
The first approach we describe is marginal likelihood maximization (MLM), also called the empirical Bayes method in Sect. 4.4. MLM relies on the Bayesian interpretation of function estimation in RKHS discussed in Sect. 7.1.4. Under the same assumptions stated in Proposition 7.1, \(\eta \) can be estimated by maximum likelihood
with \({\mathrm p}(Y|\eta )\) obtained by integrating out \(g^0\) from the joint density \({\mathrm p}(Y|g^0){\mathrm p}(g^0|\eta )\), i.e.,
The probability density \({\mathrm p}(Y|\eta )\) is the marginal likelihood and, hence, (7.40) is called the MLM method.
Computation of (7.41) is especially simple in our case since our measurements model is linear and Gaussian. In fact, in the Bayesian interpretation of regularized linear system identification in RKHS, the impulse response \(g^0\) is a zero-mean Gaussian process with covariance \(\lambda K,\) where \(\lambda \) is a positive scale factor. The impulse response is also assumed independent of the noises e(t) which are white and Gaussian of variance \(\sigma ^2\). Recall also the definition of the matrix O, now possibly function of \(\eta \), reported in (7.14) for the discrete-time case, i.e., when \(\mathscr {X}=\mathbb {N}\), and in (7.22) for the continuous-time case, i.e., when \(\mathscr {X}={\mathbb R}^+\). The matrix \(\lambda O(\eta )\) plays an important role in the MLM method since it corresponds to the covariance matrix of the noise-free output vector \([L_1[g^0],\ \dots ,\ L_N[g^0]]^T\) and is thus often called the output kernel matrix. Then, as also discussed in Sect. 7.7.2, it comes that the vector Y turns out to be Gaussian with zero mean, i.e.,
where the covariance matrix \(Z(\eta )\) is given by
with \(I_N\) the \(N \times N\) identity matrix. Here, the vector \(\eta \) could, e.g., contain both \(\lambda \) and \(\sigma ^2\). One then obtains that the empirical Bayes estimate of \(\eta \) in (7.40) becomes
where the objective is proportional to the minus log of the marginal likelihood.
As discussed in Chap. 4, the MLM method includes the Occam’s razor principle, i.e., unnecessarily complex models are automatically penalized, see e.g., [83]. In particular, the Occam’s factor arises thanks to the marginalization and it manifests itself in the term \(\log \det (Z(\eta ))\) in (7.42). A simple example can be obtained thinking of the behaviour of the objective for different values of the kernel scale factor \(\lambda \). When \(\lambda \) increases, the model becomes more complex since, under a stochastic viewpoint, the prior variance of the impulse response \(g^0\) increases. In fact, the term \(Y^T Z(\eta )^{-1} Y\), related to the data fit, decreases since the inverse of \(Z(\eta )\) tends to the null matrix (the model has infinite variance and can describe any kind of data). But the Occam’s factor increases since \(\det (Z(\eta ))\) grows to infinity. In this way, \(\hat{\eta }\) will balance data fit and model complexity.
True impulse response (red thick line) and impulse response estimates obtained by ridge regression with hyperparameters estimated by an oracle that optimizes the fit (top panel), and by the stable spline kernels of order 1 (middle) and 2 (bottom) with hyperparameters estimated by marginal likelihood maximization
7.2.1.1 Numerical Example
To illustrate the effectiveness of MLM, we revisit the example reported in Sect. 1.2. The problem is to reconstruct the impulse response reported in Fig. 7.7 (red line) from the 1000 input–output data displayed in Fig. 1.2. System input is low pass and this makes estimation hard due to ill-conditioning.
We will adopt three kernels. Using \(\delta \) to denote the Kronecker delta, the value K(i, j) is defined, respectively, by
The first choice corresponds to ridge regression with the regularizer given by the sum of squared impulse response coefficients. The other two are the first- and second-order stable spline kernel reported in (7.15) and in (7.30), respectively. More specifically, the last kernel corresponds to the discrete-time version of (7.30) with \(\alpha =e^{-\beta }\).
In Fig. 1.5, we reported the ridge regularized estimate with \(\gamma \) chosen by an oracle to maximize the fit. To ease comparison with other approaches, such a figure is also reproduced in the top panel of Fig. 7.7. The reconstruction is not satisfactory since the regularizer does not include information on smoothness and decay. In fact, the Bayesian interpretation reveals that ridge regression describes the impulse response as realization of white noise, a poor model for stable dynamic systems. This also explains the presence of oscillations in the reconstructed profile.
The middle and bottom panel report the estimates obtained by the stable spline kernels with the noise variance and the hyperparameters \(\gamma ,\alpha \) tuned from data through MLM. Even if no oracle is used, the quality of the impulse response reconstruction greatly increases. This is also confirmed by a Monte Carlo study where 200 data sets are obtained using the same kind of input but generating new independent noise realizations. MATLAB boxplots of the 200 fits for all the three estimators are in Fig. 7.8. Here, the median is given by the central mark while the box edges are the 25th and 75th percentiles. Then, the whiskers extend to the most extreme fits not seen as outliers. Finally, the outliers are plotted individually. Average fits are \(73.7\%\) for ridge, \(83.9\%\) for first-order and \(90.2\%\) for second-order stable spline.
In this example, one can see that it is preferable to use the second-order stable spline kernel. This is easily explained by the fact that the true impulse response is quite regular so that increasing our expected smoothness improves the performance.
Interestingly, the selection between different kernels, like first- and second-order stable spline, can be also automatically performed by MLM, so addressing the problem of model comparison described in Sect. 2.6.2. In fact, let s denote an additional hyperparameter that may assume only value 0 or 1. Then, we can consider the combined kernel
and optimize the hyperparameters \(s,\alpha \) and \(\gamma \) by MLM. Clearly, the role of s is to select one of the two kernels, e.g., if the estimate \(\hat{s}\) is 0, then the impulse response estimate will be given by a second-order stable spline. Applying this procedure to our problem, one finds that the second-order stable spline kernel is selected 177 times out of the 200 Monte Carlo runs. Obtained fits are shown in Fig. 7.9, their mean is \(88.8\%\).
Remark 7.4
Kernel choice via MLM has also connections with selection through the concept of Bayesian model probability discussed in Sect. 4.11, see also [50]. In fact, assume we are given different competitive kernels (covariances) \(K^i\) and, for a while, assume also that all the hyperparameter vectors \(\eta ^i\) are known. We can then interpret each kernel as a different model. We can also assign a priori probabilities that data have been generated by the ith covariance \(K^i\), hence thinking of any model as a random variable itself. If all the kernels are given the same probability, the marginal likelihood computed using \(K^i\) becomes proportional to the posterior probability of the ith model. This permits to exploit the marginal likelihood to select the “best” kernel-based estimate among those generated by the \(K^i\). When hyperparameters are unknown, the marginal likelihoods can be evaluated with each \(\eta ^i\) set to its estimate \(\hat{\eta }^i\). In this case, care is needed since maximized likelihoods define model posterior probabilities that do not account for hyperparameters uncertainty. For example, if the dimensions of \(\eta ^i\) change with i, the risk is to select a kernel that have many parameters and overfits. This problem can be mitigated, e.g., by adopting the criteria described in Sect. 2.4.3, e.g., using BIC, we compute
where N is the number of available output measurements and \(\dim \eta ^i\) is the number of hyperparameters contained in the ith model. Note that, when using stable spline kernels as in the above example, the BIC penalty is irrelevant since the first- and the second-order stable spline estimator contain the same number of unknown hyperparameters.
7.2.2 Stein’s Unbiased Risk Estimator
The second method is the Stein’s unbiased risk estimator (SURE) method introduced in Sect. 3.5.3.2. The idea of SURE is to minimize an unbiased estimator of the risk, which is the expected in-sample validation error of the model estimate. In what follows, \(g^0\) is no more stochastic as in the previous subsection but corresponds to a deterministic impulse response. Identification data are given by
where the \(e(t_i)\) are independent, with zero mean and known variance \(\sigma ^2\), and each \(L_i\) is the linear functional defined by convolutions with the system input evaluated at \(t_i\). One thus has \(L_i[g^0]=\sum ^{\infty }_{k=1}g^0(k)u(t_i-k)\) in discrete time, where the \(t_i\) assume integer values, and \(L_i[g^0]=\int _0^\infty g^0(\tau )u(t_i-\tau )d\tau \) in continuous time. The N independent validation output samples \(y_{v}(t_i)\) are then defined by using the same input that generates the identification data but an independent copy of the noises, i.e.,
So, all the 2N random variables \(e_{v}(t_i)\) and \(e(t_i)\) are mutually independent, with zero mean and noise variance \(\sigma ^2\). Consider the impulse response estimator
as a function of the hyperparameter vector \(\eta \). The predictions of the \(y_{v}(t_i)\) are then given by \(L_i[\hat{g}]\) and also depend on \(\eta \). The expected in-sample validation error of the model estimate \(\hat{g}\) is then given by the mean prediction error
where the expectation \({\mathscr {E}}\) is over the random noises \(e_v(t_i)\) and \(e(t_i)\). Note that the result not only depends on \(\eta \) but also on the unknown (deterministic) impulse response \(g^0\). So, we cannot compute the prediction error. However, it is possible to derive an unbiased estimate of it. To obtain this, let \(\hat{Y}(\eta )\) be the (column) vector with components \(L_i[\hat{g}]\). The output kernel matrix \(O(\eta )\), already introduced to describe marginal likelihood maximization, then gives the connection between the vector Y containing the measured outputs \(y(t_i)\) and the predictions. In fact, using the representer theorem to obtain \(\hat{g}\), and hence the \(L_i[\hat{g}]\), one obtains
Following the same line of discussion developed in Sect. 3.5.3.2 to obtain (3.96), we can derive the following unbiased estimator of (7.44):
where \(\text {dof}(\eta )\) are the degrees of the freedom of \(\hat{Y}(\eta )\) given by
that vary from N to 0 as \(\gamma \) increases from 0 to \(\infty \).
Note that (7.46) is function only of the N output measurements \(y(t_i)\). Thus, we can then estimate the hyperparameter \(\eta \) by minimizing the unbiased estimator \(\widehat{\text {EVE}_{\text {in}}}(\eta )\) of \({\text {EVE}_{\text {in}}}(\eta )\) to achieve
The above formula has the same form of the AIC criterion (2.33) computed assuming Gaussian noise of known variance \(\sigma ^2\) except that the dimension m of the model parameter \(\theta \) is now replaced by the degrees of freedom \(\text {dof}(\eta )\).
7.2.3 Generalized Cross-Validation
The third approach is the generalized cross-validation (GCV) method. As discussed in Sects. 2.6.3 and 3.5.2.3, cross-validation (CV) is a classical way to estimate the expected validation error by efficient reuse of the data and GCV is closely related with the N-fold CV with quadratic losses. To describe it in the RKHS framework, let \(\hat{g}^{k}\) be the solution of the following function estimation problem:
So, \(\hat{g}^{k}\) is the function estimate when the kth datum \(y(t_k)\) is left out. As also described, e.g., in [90, Chap. 4], the following relation between the prediction error of \(\hat{g}\) and the prediction error of \(\hat{g}^{k}\) holds:
where \(H_{kk}(\eta )\) is the (k, k)th element of the influence matrix
Therefore, the validation error of the N-fold CV with quadratic loss function is
Minimizing the above equation as a criterion to estimate the hyperparameter \(\eta \) leads to the predicted residual sums of squares (PRESS) method
The above criterion coincides with that derived in (3.80) working in the finite-dimensional setting.
GCV is a variant of (7.52) obtained by replacing each \(H_{kk}(\eta )\), \(k=1,\dots , N\), in (7.52) with their average. One obtains
In view of (7.45), one has
and, from (7.47) one can see that \(\mathrm {trace}(H(\eta ))\) corresponds to the degrees of freedom \(\text {dof}(\eta )\), i.e.,
So, the GCV (7.53) can be rewritten as follows:
This corresponds to the criterion (3.82) obtained in the finite-dimensional setting. Differently from SURE, a practical advantage of PRESS and GCV is that they do not require knowledge (or preliminary estimation) of the noise variance \(\sigma ^2\).
7.3 Theory of Stable Reproducing Kernel Hilbert Spaces
In the numerical experiments reported in this chapter, we have seen that regularized IIR models based, e.g., on TC and stable splines provide much better estimates of stable linear dynamic systems than other popular machine learning choices like the Gaussian kernel. The reading key was the inclusion in the identification process of information on the decay rate of the impulse response. This motivates the study of the class of the so-called stable kernels that enforces the stability constraint on the induced RKHS.
7.3.1 Kernel Stability: Necessary and Sufficient Conditions
The necessary and sufficient condition for a linear system to be bounded-input–bounded-output (BIBO) stable is that its impulse response \(g \in \ell _1\) for the discrete-time case and \(g \in \mathscr {L}_1\) for the continuous-time case. Here, \(\ell _1\) is the space of absolutely summable sequences, while \(\mathscr {L}_1\) contains the absolutely summable functions on \(\mathbb {R}^+\) (equipped with the classical Lebesque measure), i.e.,
Therefore, for regularized identification of stable systems the impulse response should be searched within a RKHS that is a subspace of \(\ell _1\) in discrete time and a subspace of \(\mathscr {L}_1\) in continuous time. This naturally leads to the following definition of stable kernels.
Definition 7.1
(Stable kernel, based on [32, 73]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel and \(\mathscr {H}:\mathscr {X}\rightarrow {\mathbb R}\) be the RKHS induced by K. Then, K is said to be stable if
-
\(\mathscr {H} \subset \ell _1\) for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);
-
\(\mathscr {H} \subset \mathscr {L}_1\) for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).
If a kernel K is not stable, it is also said to be unstable. Accordingly, the RKHS \(\mathscr {H}\) is said to be stable or unstable if K is stable or unstable.
Assigned a kernel, the question is now how to assess its stability. For this purpose, a direct use of the above definition is often challenging since it can be difficult to understand which functions belong to the associated RKHS. Stability conditions directly on K would be instead desirable. One first observation is that, since \(\mathscr {H}\) contains all kernel sections according to Theorem 6.2, all of them must be stable. In discrete time, this means \(K(i,\cdot ) \in \ell _1\) for all i. However, this condition is necessary but not sufficient for stability, a fact which is not so surprising since we have seen in Sect. 6.2 that \(\mathscr {H}\) contains also all the Cauchy limits of linear combinations of kernel sections. For instance, in Example 6.4, we have seen that the identity kernel \(K(i,j)=\delta _{ij}\), connected with ridge regression but here defined over all \({\mathbb N}\times {\mathbb N}\), induces \(\ell _2\). Such space is not contained in \(\ell _1\). So, the identity kernel is not stable even if each kernel section is stable since it contains only one non-null element.
The following fundamental result can be found in a more general form in [16] and gives the desired charactherization of kernel stability. Maybe not surprisingly, we will see that the key test spaces are \(\ell ^\infty \), that contains bounded sequences in discrete time, and \(\mathscr {L}_\infty \), that contains essentially bounded functions in continuous time. The proof is reported in Sect. 7.7.3.
Theorem 7.5
(Necessary and sufficient condition for kernel stability, based on [16, 32, 73]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then,
-
one has
$$\begin{aligned} \mathscr {H}\subset \ell _1 \iff \sum _{s=1}^\infty \left| \sum _{t=1}^\infty K(s,t)l_t\right| <\infty ,\ \forall \ l\in \ell _\infty \end{aligned}$$(7.56)for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);
-
one has
$$\begin{aligned} \mathscr {H} \subset \mathscr {L}_1 \iff \int _0^\infty \left| \int _0^\infty K(s,t)l(t)dt\right| ds <\infty ,\ \forall \ l\in \mathscr {L}_\infty \end{aligned}$$(7.57)for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).
Figure 7.10 illustrates the meaning of Theorem 7.5 by resorting to a simple system theory argument. In particular, a kernel can be seen as an acausal linear time-varying system. In discrete time it induces the following input–output relationship
where \(K_i(j)=K(i,j),\) while \(u_i\) and \(y_i\) denote the system input and output at instant i. Then, the RKHS induced by K is stable iff system (7.58) maps every bounded input \(\{u_i\}_{i=1}^\infty \) into a summable output \(\{y_i\}_{i=1}^\infty \). Abusing notation, we can also see K as an infinite-dimensional matrix with i, j-entry given by \(K_i(j)\) with u and y infinite-dimensional column vectors. Then, using ordinary algebra notation to handle these objects, the input–output relationship becomes \(y=Ku\) and the stability condition is
In Theorem 7.5, it is immediate to see that including the constraint \(-1 \le l_t \le 1 \ \forall t\) on the test functions does not have any influence on the stability test. With this constraint, one has
The following result is then an immediate corollary of Theorem 7.5 obtained exploiting the above inequalities. It states that absolute summability is a sufficient condition for a kernel to be stable.
Corollary 7.1
(based on [16, 32, 73]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then,
-
one has
$$\begin{aligned} \mathscr {H} \subset \ell _1 \sum _{s=1}^\infty \sum _{t=1}^\infty | K(s,t) | <\infty \end{aligned}$$(7.59)for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);
-
one has
$$\begin{aligned} \mathscr {H} \subset \mathscr {L}_1 \int _0^\infty \int _0^\infty | K(s,t) | dtds <\infty \end{aligned}$$(7.60)for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).
Finally, consider the class of nonnegative-valued kernels \(K^{ \text{+ }}\), i.e., satisfying \(K(s,t) \ge 0 \ \forall s,t\). If a kernel is stable, using as test function \(l(t)=1 \ \forall t\), one must have
in discrete time, and
in continuous time. So, for nonnegative-valued kernels, stability implies (absolute) summability of the kernel. But, since we have seen in Corollary 7.1 that absolute summability implies stability, the following result holds.
Corollary 7.2
(based on [16, 32, 73]) Let \(K^{\text{+ }}:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite and nonnegative-valued kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then,
-
one has
$$\begin{aligned} \mathscr {H} \subset \ell _1 \iff \sum _{s=1}^\infty \sum _{t=1}^\infty K^{ \text{+ }}(s,t) <\infty \end{aligned}$$(7.61)for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);
-
one has
$$\begin{aligned} \mathscr {H} \subset \mathscr {L}_1 \iff \int _0^\infty \int _0^\infty K^{ \text{+ }}(s,t) dtds <\infty \end{aligned}$$(7.62)for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).
As an example, we can now show that the Gaussian kernel (6.43) defined e.g., over \({\mathbb R}^+ \times {\mathbb R}^+\) is not stable. In fact, it is nonnegative valued and one has
The same holds for the spline kernels (6.45) extended to \(\mathbb {R}^+ \times \mathbb {R}^+\) and also for translation invariant kernels introduced in Example 6.12, as e.g., proved in [32] using the Schoenberg representation theorem. Hence, all of these models are not suited for stable impulse response estimation.
Remark 7.5
Any unstable kernel can be made stable simply by truncation. More specifically, let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be an unstable kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then by setting \(K(s,t)=0\) for \(s,t>T\) for any given \(T\in \mathscr {X}\), a stable kernel is obtained. Care should be however taken when a FIR model is obtained through this operation. In fact, consider e.g., the use of cubic spline or Gaussian kernel in the estimation problem depicted in Fig. 7.6 setting T equal to 20 or 50. Also after truncation, such models would not give good performance: the undue oscillations affecting the estimates in the top and middle panel of Fig. 7.6 would still be present. The reason is that these two kernels do not encode the information that the variability of the impulse response decreases as time progresses, as also already discussed using the Bayesian interpretation of regularization.
7.3.2 Inclusions of Reproducing Kernel Hilbert Spaces in More General Lebesque Spaces \(\star \)
We now discuss the conditions for a RKHS to be contained in the spaces \(\mathscr {L}_p^{\mu }\) equipped with a generic measure \(\mu \). The following analysis will then include both the space \(\mathscr {L}_1\) (considered before with the Lebesque measure) and \(\ell _1\) as special cases obtained with \(p=1\). First, we need the following definition.
Definition 7.2
(based on [16]) Let \(1 \le p \le \infty \) and \(q=\frac{p}{p-1}\) with the convention \(\frac{p}{p-1}=\infty \) if \(p=1\) and \(\frac{p}{p-1}=1\) if \(p=\infty \). Moreover, let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel. Then, the kernel K is said to be q-bounded if
-
1.
the kernel section \(K_s \in \mathscr {L}_p^{\mu }\) for almost all \(s \in \mathscr {X}\), i.e., for every \(s \in \mathscr {X}\) except on a set of null measure w.r.t. \(\mu \);
-
2.
the function \(\int _0^\infty K(s,t)l(t)d\mu (t) \in \mathscr {L}_p^{\mu }\), \(\forall l \in \mathscr {L}_{q}^{\mu }\).
The following theorem then gives the necessary and sufficient condition for the q-boundedness of a kernel and is a special case of Proposition 4.2 in [16].
Theorem 7.6
(based on [16]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel with \(\mathscr {H}\) the induced RKHS. Then, \(\mathscr {H}\) is a subspace of \(\mathscr {L}_p^{\mu }\) if and only if K is q-bounded, i.e.,
Theorem 7.6 permits thus to see if a RKHS is contained in \(\mathscr {L}_p^{\mu }\) by checking the properties of the kernel. Interestingly, setting \(p=1\), that implies \(q=\infty \), and \(\mu \) e.g., to the Lebesque measure one can see that the concept of stable and \(\infty \)-bounded kernel are equivalent. Theorem 7.5 is then a special case of Theorem 7.6.
7.4 Further Insights into Stable Reproducing Kernel Hilbert Spaces \(\star \)
In this section, we provide some additional insights into the structure of the stable kernels and associated RKHSs. The analysis is focused on the discrete-time case where the kernel K can be seen as an infinite-dimensional matrix with the (i, j)-entries denoted by \(K_{ij}\). Thus, the function domain is the set of natural numbers \(\mathbb {N}\) and the RKHS contains discrete-time impulse responses of causal systems.
As discussed after (7.58) to comment Fig. 7.10, the kernel K can be also associated with an acausal linear time-varying system, often called kernel operator in the literature. It maps the infinite-dimensional input (sequence) u into the infinite-dimensional output Ku whose ith component is \(\sum _{j=1}^\infty K_{ij} u_j\). Two important kernel operators will be considered. The first one maps \(\ell _{\infty }\) into \(\ell _1\) and is key for kernel stability as pointed out in Theorem 7.5. The second one maps \(\ell _2\) into \(\ell _2\) itself and will be important to discuss spectral decompositions of stable kernels.
7.4.1 Inclusions Between Notable Kernel Classes
To state some relationships between stable kernels and other fundamental classes, we start introducing some sets of RKHSs. Define
-
the set \(\mathscr {S}_{s}\) that contains all the stable RKHSs;
-
the set \(\mathscr {S}_{1}\) with all the RKHSs induced by absolutely summable kernels, i.e., satisfying
$$\begin{aligned} \sum _{ij} \ |K_{ij}| < +\infty ; \end{aligned}$$ -
the set \(\mathscr {S}_{ft}\) of RKHSs induced by finite-trace kernels, i.e., satisfying
$$\begin{aligned} \sum _{i} \ K_{ii} < +\infty ; \end{aligned}$$ -
the set \(\mathscr {S}_{2}\) associated to squared summable kernels, i.e., satisfying
$$\begin{aligned} \sum _{ij} \ K_{ij}^2 < +\infty . \end{aligned}$$
One has then the following result from [8] (see Sect. 7.7.4 for some details on its proof).
Theorem 7.7
(based on [8]) It holds that
Figure 7.11 gives a graphical description of Theorem 7.7 in terms of inclusions of kernels classes. Its meaning is further discussed below.
In Corollary 7.1, we have seen that absolute summability is a sufficient condition for kernel stability. The result \(\mathscr {S}_{1} \subset \mathscr {S}_{s}\) shows also that such inclusion is strict. Hence, one cannot conclude that a kernel is unstable from the sole failure of absolute summability.
The fact that \(\mathscr {S}_{s} \subset \mathscr {S}_{ft}\) means that the set of finite-trace kernels contains the stable class. This inclusion is strict, hence the trace analysis can be used only to show that a given RKHS is not contained in \(\ell _1\). There are however interesting consequences of this fact. Consider all the RKHSs induced by translation invariant kernels
where h satisfies the positive semidefinite constraints. The trace of these kernels is \(\sum _i \ K_{ii}=\sum _i \ h(0)\) and it always diverges unless h is the null function. So, all the translation invariant kernels are unstable (as already mentioned after Corollary 7.2). Other instability results become also immediately available. For instance, all the kernels with diagonal elements satisfying \(K_{ii} \propto i^{-\delta }\) are unstable if \(\delta \le 1\).
Finally, the strict inclusion \(\mathscr {S}_{ft} \subset \mathscr {S}_{2}\) shows that the finite-trace test is more powerful than a check of kernel squared summability.
7.4.2 Spectral Decomposition of Stable Kernels
As discussed in Sect. 6.6.3 and in Remark 6.3, kernels can define spaces rich of functions by (implicitly) mapping the space of the regressors into high-dimensional feature spaces where linear estimators can be used. This allows to reduce nonlinear algorithms even without knowing explicitly the feature map, i.e., without the exact knowledge of which functions are encoded in the kernel. In particular, in Sect. 6.3, we have seen that if the kernel admits the spectral representation
then the \(\rho _i(x)\) are the basis functions that span the RKHS induced by K. For instance, the basis functions \(\rho _1(x)=1,\rho _2(x)=x,\rho _3(x)=x^2,\ldots \) describe polynomial models which are, e.g., included up to a certain degree in the polynomial kernel discussed in Sect. 6.6.4. Now, we will see that stable kernels always admit an expansion of the type (7.64) with the \(\rho _i\) forming a basis of \(\ell _2\). The number of \(\zeta _i\) different from zero then corresponds to the dimension of the induced RKHS.
Formally, it is now necessary to consider the operator induced by a stable kernel K as a map from \(\ell _2\) into \(\ell _2\) itself. Again, it is useful to see K as an infinite-dimensional matrix so that we can think of Kv as the result of the kernel operator applied to \(v \in \ell _2\). An operator is said to be compact if it maps any bounded sequence \(\{v_i\}\) into a sequence \(\{K v_i\}\) from which a convergent subsequence can be extracted [85, 95]. From Theorem 7.7, we know that any stable kernel K is finite trace and, hence, squared summable. This fact ensures the compactness of the kernel operator, as discussed in [8] and stated below.
Theorem 7.8
(based on [8]) Any operator induced by a stable kernel is self-adjoint, positive semidefinite and compact as a map from \(\ell _2\) into \(\ell _2\) itself.
This result allows us to exploit the spectral theorem [35] to obtain an expansion of K. Now, recall that spectral decompositions were discussed in Sect. 6.3 where the Mercer’s theorem was also reported. Mercer’s theorem derivations exploit the spectral theorem and, as, e.g., in Theorem 6.9, they typically assume that the kernel domain is compact, see also [86] for discussions and extensions. Indeed, first formulations consider continuous kernels on compact domains (proving also uniform convergence of the expansion). However, the spectral theorem does not require the domain to be compact and, when applied to discrete-time kernels on \(\mathbb {N} \times \mathbb {N}\), it guarantees pointwise convergence. It thus becomes the natural generalization of the decomposition of a symmetric matrix in terms of eigenvalues and eigenvectors, initially discussed in the finite-dimensional setting in Sect. 5.6 to link regularization and basis expansion. This is summarized in the following proposition that holds in virtue of Theorem 7.8.
Proposition 7.2
(Representation of stable kernels, based on [8]) Assume that the kernel K is stable. Then, there always exists an orthonormal basis of \(\ell _2\) composed by eigenvectors \(\{\rho _i\}\) of K with corresponding eigenvalues \(\{\zeta _i\}\), i.e.,
In addition, the kernel admits the following expansion:
with \(x,y \in \mathbb {N}\).
While in the next subsection, we will use the above theorem to discuss the representation of stable RKHSs, some numerical considerations regarding (7.65) are now in order. Under an algorithmic viewpoint, many efficient machine learning procedures use truncated Mercer expansions to approximate the kernel, see [42, 52, 75, 93, 96] for discussions on their optimality in a stochastic framework. Applications for system identification can be found in [15] where it is shown that a relatively small number of eigenfunctions (w.r.t. the data set size) can well approximate impulse responses regularized estimates. These works trace back to the so-called Nyström method where an integral equation is replaced by finite-dimensional approximations [5, 6]. However, obtaining the Mercer expansion (7.65) in closed form is often hard. Fortunately, the \(\ell _2\) basis and related eigenvalues of a stable RKHS can be numerically recovered (with arbitrary precision w.r.t. the \(\ell _2\) norm) through a sequence of SVDs applied to truncated kernels [8]. Formally, let \(K^{(d)}\) denote the \(d \times d\) positive semidefinite matrix obtained by retaining only the first d rows and columns of K. Let also \(\rho _i^{(d)}\) and \(\zeta _i^{(d)}\) be, respectively, the eigenvectors of \(K^{(d)}\), seen as elements of \(\ell _2\) with a tail of zeros, and the eigenvalues returned by the SVD of \(K^{(d)}\). Assume, for simplicity, single multiplicity of each \(\zeta _i\). Then, for any i, as d grows to \(\infty \) one has
where \(\Vert \cdot \Vert _2\) is the \(\ell _2\) norm.
In Fig. 7.12, we show some eigenvectors (left panel) and the first 100 eigenvalues (right) of the stable spline kernel \(K_{xy} =\alpha ^{\max {(x,y)}}\) with \(\alpha =0.99\). Results are obtained applying SVDs to truncated kernels of different sizes and monitoring convergence of eigenvectors and eigenvalues. The final outcome was obtained with \(d=2000\).
7.4.3 Mercer Representations of Stable Reproducing Kernel Hilbert Spaces and of Regularized Estimators
Now we exploit the representations of the RKHSs induced by a diagonalized kernel as discussed in Theorems 6.10 and 6.13 (where compactness of the input space is not even required). In view of Proposition 7.2, assuming for simplicity all the \(\zeta _i\) different from zero, one obtains that the RKHS associated to a stable K always admits the representation
where the \(\rho _i\) are the eigenvectors of K forming an orthonormal basis of \(\ell _2\).Footnote 1 If \(g = \sum _{i=1}^{\infty } a_i \rho _i \), one also has
The fact that any stable RKHS is generated by an \(\ell _2\) basis gives also a clear connection with the important impulse response estimators which adopt orthonormal functions, e.g., the Laguerre functions illustrated in Fig. 7.3 [46, 91, 92]. A classical approach used in the literature is to introduce the model \(g=\sum _i a_i \rho _i\) and then to use linear least squares to determine the expansion coefficients \(a_i\). In particular, let \(L_t[g]\) be the system output, i.e., the convolution between the known input and g evaluated at the time instant t. Then, the impulse response estimate is
where d determines model complexity and is typically selected using AIC or cross-validation (CV) as discussed in Chap. 2.
In view of (7.67) and (7.68), the regularized estimator (7.10), equipped with a stable RKHS, is equivalent to
This result is connected with the kernel trick discussed in Remark 6.3 and shows that regularized least squares in a stable (infinite-dimensional) RKHS always model impulse responses using an \(\ell _2\) orthonormal basis, as in the classical works on linear system identification. But the key difference between (7.69) and (7.70) is that complexity is no more controlled by the model order because d is set to \(\infty \). Complexity instead depends on the regularization parameter \(\gamma \) (and possibly also on other kernel parameters) that balances the data fit and the penalty term. This latter induces stability by using the kernel eigenvalues \(\zeta _i\) to constrain the decay rate to zero of the expansion coefficients.
7.4.4 Necessary and Sufficient Stability Condition Using Kernel Eigenvectors and Eigenvalues
We have seen that a fruitful way to design a regularized estimator for linear system identification is to introduce a kernel by specifying its entries \(K_{ij}\). This modelling technique translates our expected features of an impulse response into kernel properties, e.g., smooth exponential decay as described by stable spline, TC and DC kernels. This route exploits the kernel trick, i.e., the basis functions implicit encoding. In some circumstances, it could be useful to build a kernel starting from the design of eigenfunctions \(\rho _i\) and eigenvalues \(\zeta _i\). A notable example is given by the (already cited) Laguerre or Kautz functions that belong to the more general class of Takenaka–Malmquist orthogonal basis functions [46]. They can be useful to describe oscillatory behavior or presence of fast/slow poles.
Since any stable kernel can be associated with an \(\ell _2\) basis, the following fundamental problem then arises. Given an orthonormal basis \(\{\rho _i\}\) of \(\ell _2\), for example, of the Takenaka–Malmquist type, which are the conditions on the eigenvalues \(\zeta _i\) ensuring stability of \(K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y)\)? The answer is in the following result derived from [8] that reports the necessary and sufficient condition (the proof is given in Sect. 7.7.5).
Theorem 7.9
(RKHS stability using Mercer expansions, based on [8]) Let \(\mathscr {H}\) be the RKHS induced by K with
where the \(\{\rho _i\}\) form an orthonormal basis of \(\ell _2\). Let also
Then, one has
where \(\langle \cdot , \cdot \rangle _2 \) is the inner product in \(\ell _2\).
Thus, clearly, there is no stability if one function \(\rho _i\) associated to \(\zeta _i > 0\) doesn’t belong to \(\ell _1\). In fact, one can choose u containing the signs of the components of \(\rho _i\) and this leads to \(\langle \rho _i,u \rangle _2= +\infty \). Nothing is instead required for the eigenvectors associated to \(\zeta _i=0\). Theorem 7.9 permits also to derive the following sufficient stability condition.
Corollary 7.3
(based on [8]) Let \(\mathscr {H}\) be the RKHS induced by the kernel \(K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y)\) with \(\{\rho _i\}\) an orthonormal basis of \(\ell _2\). Then, it holds that
Furthermore, such condition also implies kernel absolute summability and, hence, it is not necessary for RKHS stability.
It is easy to exploit the stability condition (7.72) to design models of stable impulse responses starting from an \(\ell _2\) basis. Let us reconsider, e.g., Laguerre or Kautz basis functions \(\{\rho _i\}\) to build the impulse response model
To exploit (7.70), one has to define stability constraints on the expansion coefficients \(a_i\). This corresponds to define \(\zeta _i\) in such a way that the regularizer
enforces absolute summability of g. Laguerre and Kautz models belong to the Takenaka–Malmquist class of functions \(\rho _i\) that all satisfy
with M a constant independent of i [46]. Then, Corollary 7.3 ensures that the choice
includes the stability contraint for the entire Takenaka–Malmquist class.
Let us now consider the class of orthonormal basis functions \(\rho _i\) all contained in a ball of \(\ell _1\). Then, the necessary and sufficient stability condition assumes a form especially simple as the following result shows.
Corollary 7.4
(based on [8]) Let \(\mathscr {H}\) be the RKHS induced by the kernel \(K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y)\) with \(\{\rho _i\}\) an orthonormal basis of \(\ell _2\) and \(\Vert \rho _i \Vert _1 \le M < +\infty \) if \(\zeta _i>0\), with M not dependent on i. Then, one has
Finally, Fig. 7.13 illustrates graphically all the stability results here obtained starting from Mercer expansions.
Inclusion properties of some important kernel classes in terms of Mercer expansions. This representation is the dual of that reported in Fig. 7.11 and defines kernel sets through properties of the kernel eigenvectors \(\rho _i\), forming an orthonormal basis in \(\ell _2\), and of the corresponding kernel eigenvalues \(\zeta _i\). The condition \(\sum _i \zeta _i \Vert \rho _i\Vert _1^2 < \infty \) is the most restrictive since it implies kernel absolute summability. The necessary and sufficient condition for stability is \(\sup _{u \in {\mathscr {U}}_{\infty }} \ \sum _i \zeta _i \langle \rho _i, u \rangle _2^2< \infty \). Finally, \(\sum _i \zeta _i < \infty \) and \(\sum _i \zeta _i^2 < \infty \) are exactly the conditions for a kernel to be finite trace and squared summable, respectively
7.5 Minimax Properties of the Stable Spline Estimator \(\star \)
In this section, we will derive non-asymptotic upper bounds on the MSE of the regularized IIR estimator (7.10) valid for all the exponentially stable discrete-time systems whose poles belong to the complex circle of radius \(\rho \). Obtained bounds can be evaluated before any data is observed. This kind of results give insight into the so-called sample complexity, i.e., the number of measurements needed to achieve a certain accuracy on impulse response reconstruction. This is an attractive feature even if, since the bounds need to hold for all the models falling in a particular class, often they are quite loose for the particular dynamic system at hand. However, they have a considerable theoretical value since permit also to assess the quality of (7.10) through nonparametric minimax concepts. Such setting considers the worst-case inside an infinite-dimensional class and has been widely studied in nonparametric regression and density estimation [88]. In particular, obtained bounds will lead to conditions which ensure the optimality in order, i.e., the best convergence rate of (7.10) in the minimax sense. We will derive them by considering system inputs given by white noises and using the TC/stable spline kernel (7.15) as regularizer. The important dependence between the convergence rate of (7.10) to the true impulse response, the stability kernel parameter \(\alpha \) and the stability radius \(\rho \) will be elucidated.
7.5.1 Data Generator and Minimax Optimality
As in the previous part of the chapter, we use \(g^0\) to denote the impulse response of a discrete-time linear system. The measurements are generated as follows:
where \(g^0(k)\) are the impulse response coefficients. We will always assume \(g^0\) as a deterministic and exponentially stable impulse response, while the input u and the noise e are stochastic as specified below.
Assumption 7.10
The impulse response \(g^0\) belongs to the following set:
The system input and the noise are discrete-time stochastic processes. One has that \(\lbrace u(t) \rbrace _{t \in \mathbb {Z}}\) are independent and identically distributed (i.i.d.) zero-mean random variables with
Finally, \(\lbrace e(t) \rbrace _{t \in \mathbb {Z}}\) are independent random variables, independent of \(\lbrace u(t) \rbrace _{t \in \mathbb {Z}}\), with
The available measurements are
where N is the data set size.
The quality of an impulse response estimator \(\hat{g}\) function of \(\mathcal{D}_T\) will be measured by computing the estimation error \(\mathscr {E} \Vert g^0-\hat{g}\Vert _2\), where \(\Vert \cdot \Vert _2\) is the norm in the space \(\ell _2\) of squared summable sequences. Note that the expectation is taken w.r.t. the randomness of the system input and the measurement noise. The worst-case error over the family \(\mathscr {S}\) of exponentially stable systems reported in (7.75) will be also considered. In particular, the uniform \(\ell _2\)-risk of \(\hat{g}\) is
An estimator \(g^{*}\) is then said to be minimax if the following equality holds for any data set size N:
meaning that \(g^{*}\) minimizes the error w.r.t. the worst-case scenario. Building such kind of estimator is in general really difficult. For this reason, it is often convenient to consider just the asymptotic behaviour introducing the concept of optimality in order. Specifically, an estimator \(\bar{g}\) is optimal in order if
with \(C_N\) is function of the data set size and satisfies \(\sup _N \ C_N<\infty \) and \(g^{*}\) is minimax. In our linear system identification setting, optimality in order thus ensures that, as N grows to infinity, the convergence rate of \(\bar{g}\) to the true impulse response \(g^0\) cannot be improved by any other system identification procedure in the minimax sense.
7.5.2 Stable Spline Estimator
As anticipated, our study is focused on the following regularized estimator:
equipped with the stable spline kernel
For future developments, it is important to control complexity of (7.79) not only by using the hyperparameters \(\gamma \) and \(\alpha \) but also through the dimension d of the following subspace:
over which optimization of the objective in (7.79) is performed. In particular, we will consider the estimator
and will study how N and the choice of \(\gamma ,\alpha ,d\) influence the estimation error and, hence, the convergence rate. This will lead to complexity control rules that are a hybrid of those seen in the classical and in the regularized framework. To obtain this, first, we rewrite (7.81) in terms of regularized FIR estimation by exploiting the structure of the stable spline norm (7.16) which shows that
Let us define the matrix
and the regressors
Now, one can easily see that the first d components of \(\hat{g}^d\) in (7.81) are contained in the vector
Hence, we obtain
where
In real applications, one cannot measure the inputs at all the time instants and our data set \(\mathcal{D}_T\) in (7.78) could contain only the inputs \(u(1),\ldots ,u(N)\). So, differently from what postulated in the above equations, in practice the regressors are never perfectly known. One solution is just to replace with zeros the unknown input values \(\{u(t)\}_{t<1}\) entering (7.84). Also under this model misspecification, all the results introduced in the next sections still hold.
7.5.3 Bounds on the Estimation Error and Minimax Properties
The following theorem will report non asymptotic bounds that illustrate the dependence of \(\mathscr {E} \Vert g^0-\hat{g}^d\Vert _2\) on the following three key variables:
-
the FIR order d which determines the truncation error;
-
the parameter \(\alpha \) contained in the matrix R reported in (7.83) that establishes the exponential decay of the estimated impulse response coefficients;
-
the regularization parameter \(\gamma \) which trades-off the penalty defined by R and the adherence to experimental data.
In addition, it gives conditions on \(\alpha \) which ensure optimality in order if some conditions on the stability radius \(\rho \) entering (7.75) and on the FIR order d (function of the data set size N) are fullfilled. Below, the notation O(1) indicates an absolute constant, independent of N. Furthermore, given \(x \in \mathbb {R}\), we use \(\left\lfloor {x}\right\rfloor \) to indicate the largest integer not larger than x. The following result then holds.
Theorem 7.11
(based on [74]) Let the FIR order d be defined by the following function of the data set size N:
with N large enough to guarantee \(d^{*} \ge 1\).
Then, under Assumption 7.10, the estimator (7.81) satisfies
where
Furthermore, if the measurement noise is Gaussian and \(\sqrt{\alpha } \ge \rho \), the stable spline estimator (7.81) is optimal in order.
To illustrate the meaning of Theorem 7.11, first is useful to recall a result obtained in [43] that relies on the Fano’s inequality. It shows that, if a dynamic system is fed with white input and the measurement noise is Gaussian, the expected \(\ell _2\) error of any impulse response estimator cannot decay to zero faster than \(\sqrt{\frac{\ln N}{N}}\) in a minimax sense.
Theorem 7.12
(based on [43]) Let Assumption 7.10 hold and assume also that the measurement noise is Gaussian. Then, if \(\hat{g}\) is any impulse response estimator built with \(\mathcal{D}_T\), for N sufficiently large one has
\(\blacksquare \)
To illustrate the convergence rate of the stable spline estimator, first note that the FIR dimension \(d^{*}\) in (7.88) scales logarithmically with N. Apart from irrelevant constants, one in fact has
We now consider the three terms on the r.h.s. of (7.89) with \(d=d^{*}\). Since
the first two terms decay to zero at least as \(\sqrt{\frac{\ln N}{N}}\). Regarding the third one, one has
and this shows that the optimal convergence rate is obtained if \( \alpha \ge \rho \) but the case \(\alpha <\rho \) can be critical. In particular, combining (7.89) with (7.93) and (7.94), the following considerations arise:
-
the convergence rate of the stable spline estimator (7.81) does not depend on \(\gamma \) but only on the relationship between the kernel parameter \(\alpha \) and the stability radius \(\rho \) defining the class of dynamic systems (7.75);
-
using Theorem 7.12, one can see from (7.94) that if \(\alpha <\rho \) the achievement of the optimal rate is related to the term \(N^{-\frac{\ln \rho }{\ln \alpha }}\) which appears as third term in (7.89). The key condition is
$$ \frac{\ln \rho }{\ln \alpha } \ge 0.5 \implies \sqrt{\alpha } \ge \rho . $$This indeed corresponds to what was stated in the final part of Theorem 7.11: under Gaussian noise the stable spline estimator is optimal in order if \(\sqrt{\alpha }\) is an upper bound on the stability radius \(\rho \).
Relationships (7.93) and (7.94) clarify also what happens when the kernel includes a too fast exponential decay rate, i.e., when \(\sqrt{\alpha } <\rho \). In this case, the error goes to zero as \(N^{-\frac{\ln \rho }{\ln \alpha }}\), getting worse as \(\sqrt{\alpha }\) drifts apart \(\rho \). Such phenomenon has a simple explanation. A too small \(\alpha \) enforces the impulse response estimate to decay to zero also when the true impulse response coefficients are significantly different from zero. This corresponds to a strong bias: a wrong amount of regularization is introduced in the estimation process, hence compromising the convergence rate. This is also graphically illustrated in Fig. 7.14 that plots the convergence rate \(\ln \rho / \ln \alpha \) as a function of \(\sqrt{\alpha }\) for five different values of \(\rho \).
The analysis thus shows how \(\alpha \) plays a fundamental role in controlling impulse response complexity and, hence, in establishing the properties of the regularized estimator. This is not surprising also in view of the deep connection between the decay rate and the degrees of freedom of the model. This was illustrated in Fig. 5.6 of Sect. 5.5.1 using the class of DC kernels which includes TC as special case.
Convergence rate \(\ln \rho / \ln \alpha \) of the stable spline estimator as a function of \(\sqrt{\alpha }\) for \(\sqrt{\alpha } <\rho \) with \(\rho \) in the set \(\{0.7,0.8,0.9,0.95,0.99\}\). When \(\sqrt{\alpha } <\rho \) the estimation error converges to zero as \(N^{-\frac{\ln \rho }{\ln \alpha }}\). Instead, if \(\sqrt{\alpha } \ge \rho \) the error decays as \(\sqrt{\frac{\ln N}{N}}\), making the stable spline estimator optimal in order when the measurement noise is Gaussian
7.6 Further Topics and Advanced Reading
The idea to handle linear system identification with regularization methods in the RKHS framework first appears in [72]. As already mentioned, the representer theorems introduced in this chapter are special cases of that involving linear and bounded functionals reported in the previous chapter, see Theorem 6.16. More general versions of representer theorems with, e.g., more general loss functions and/or regularization terms can be found in, e.g., [33]. Similarly to the spline smoothing problem studied in Sect. 6.6.7, it could be useful to enrich the regularized impulse response estimators here described with a parametric component. Of course, the corresponding regularized estimator will still have a closed-form finite-dimensional representation that depends on both the number of data N and the number of enriched parametric components, e.g., see [72, 90].
The stable spline kernel [72] and the diagonal correlated kernel [19] are the first two kernels introduced in the linear system identification literature. The stability of a kernel (or equivalently the stability of a RKHS) first appeared in [32, 73]. The stability of a kernel is equivalent to the \(\infty \)-boundedness of the kernel, which is a special case of the more general q-boundedness with \(1<q\le \infty \) in [16]. The proof in [16] for the sufficiency and necessity of the q-boundedness of a kernel is quite involved and abstract. Theorem 7.5 is also discussed in [24], see also [76] where the stability analysis exploits the output kernel. The optimal kernel that minimizes the mean squared error was studied in [19, 73]. As already discussed, unfortunately, the optimal kernel cannot be applied in practice because it depends on the true impulse response to be estimated. Nevertheless, it offers a guideline to design kernels for linear system identification and more general function estimation problems. Motivated by these findings, many stable kernels have been introduced over the years, e.g., [17, 21, 77, 80, 97]. In particular, [17] proposed linear multiple kernels to handle systems with complicated dynamics, e.g., with distinct time constants and distinct resonant frequencies, and [77] further extended this idea and proposed “integral” versions of the stable spline kernels. To design kernels to embed more general prior knowledge, e.g., the overdamped/underdamped dynamics, common structure, etc., it is natural to divide the prior knowledge into different types and then develop systematic ways to design kernels accordingly, see [21, 80, 97]. In particular, the approaches proposed in [21] are based on machine learning and a system theory perspectives, those in [80] rely on the maximum entropy principle, and the method proposed in [97] uses harmonic analysis.
Along with the kernel design, many efforts have also been spent on “kernel analysis”. In particular, many kernels can be given maximum entropy interpretations including the stable spline kernel, the diagonal correlated kernel and the more general simulation-induced kernel [14, 21, 23]. This can help to understand the prior knowledge embedded in the model. Many kernels have the Markov property e.g., [83]. Examples are the diagonal correlated kernel and some carefully designed simulation induced kernels [21]. Exploring this property could help to design efficient implementation. As we have seen, the spectral analysis of kernels is often not available in closed form, even is it can be numerically recovered, but exceptions include the stable spline and the diagonal correlated kernel [20, 22, 72].
The hyperparameter tuning problem has been studied for a long time in the context of function estimation problem from noisy observations, e.g., [83, 90]. The marginal likelihood maximization method depends on the connection with the Bayesian estimation of Gaussian processes, which was first studied in [51] in spline regression, see also [41, 83, 90]. More discussions on its relation to Bayesian evidence and Occam’s razor principle can be found in e.g., [27, 60]. Stein’s unbiased risk estimation method is also known as the \(C_p\) statistics [61]. The generalized cross-validation method is first proposed in [28] and found to be rotation invariant in [44]. The problem can also be tackled using full Bayes approaches relying on stochastic simulation techniques, e.g., Markov chain Monte Carlo [1, 39].
In the context of linear system identification, some theoretical results on the hyperparameter estimation problem have been derived. In particular, it was shown in [4] that the marginal likelihood maximization method is consistent for diagonal kernels in terms of the mean square error and asymptotically minimizes a weighted mean square error for nondiagonal kernels. In [78], the robustness of the marginal likelihood maximization is analysed with the help of the excess degrees of freedom. It is further shown in [63, 64, 66] that Stein’s unbiased risk estimation as well as many cross-validation methods are asymptotically optimal in the sense of mean square error. In [4, 17, 94], the optimal hyperparameter of the marginal likelihood maximization is shown to be sparse. By exploring such property it is possible to handle various structure detection problems in system identification like sparse dynamic network identification [17, 26]. Full Bayes approaches can be found, e.g., in [69].
As also recalled in the previous chapter, straightforward implementation of the regularization method in RKHS framework has computational complexity \(O(N^3)\) and thus is prohibitive to apply when N is large. Many efficient approximation methods have been proposed in machine learning, e.g., [53, 81, 82]. In the context of linear system identification, there is another practical issue that must be noted in the implementation: the ill-conditioning possibly arising from the use of stable kernels, which is unavoidable due to the nature of stability. Hence, extra care has to be taken when developing efficient implementations. Some approximation methods have been proposed to reduce the computational complexity and avoid numerical computation. The first one is to truncate the IIR at a suitable finite-order n. Then, computational complexity becomes \(O(n^3)\) and one can also use the approach proposed in [18] relying on some fundamental algebraic techniques and reliable matrix factorizations. The other one is to truncate the infinite expansion of a kernel at a finite-order l. Then, computational complexity becomes \(O(l^3)\), see [15]. See also [36] for efficient kernel-based regularization implementation using Alternating Direction Method of Multipliers (ADMM). Another practical issue is the difficulty caused by local minima. For kernels with few number of hyperparameters, e.g., the stable spline kernel and the diagonal correlated kernel, this difficulty can be well faced using different starting points or also some grid methods. For systems with complicated dynamics, it is suggested to apply linear multiple kernels [17] since the corresponding marginal likelihood maximization is a difference of convex programming problem and a stationary point can be found efficiently using sequential convex optimization technique, e.g., [48, 87].
We only considered single-input single-output linear systems in the chapter with white measurement noise. For multiple-input single-output linear systems, it is natural to use multi-input impulse response models and then assume that the overall system has a block diagonal kernel [73]. The regularization method can also be extended to handle linear systems with colored noise, e.g., ARMAX models. One can exploit the fact that such systems can be approximated arbitrarily well by finite-order ARX models [57]. The problem thus becomes a special case of multiple-output single-input systems where the regressors contain also past outputs [71]. This will be also illustrated in Chap. 9.
In practice, the data could be contaminated by outliers due to a failure in the measurement or transmission equipment, e.g., [56, Chap. 15]. In the presence of outliers, it is suggested to use heavy-tailed distributions instead of the commonly used Gaussian distribution for the noise in robust statistics, e.g., [49]. For regularization methods in the RKHS framework, the key difficulty is that the hyperparameter estimation criteria and the regularized estimate may not have closed-form expressions. Several methods have been proposed to overcome this difficulty. In particular, an expectation maximization (EM) method was proposed in [10] and further improved in [55] exploiting a variational expectation method.
Input design is an important issue for classical system identification and many results have been obtained, e.g., [38, 45, 47, 56]. For regularized system identification in RKHS, some results have been reported recently. The first result was given in [37] where the mutual information between the output and the impulse response was chosen as the input design criterion. Unfortunately, obtaining the optimal input involves the solution of a nonconvex optimization problem. Differently from [37, 65] adopts scalar measures of the Bayesian mean square error as input design criterion, proposing a two-step procedure to find the global optimal input through convex optimization.
For what concerns the building of uncertainty regions around the dynamic system estimates, approaches are available which return bounds that, beyond being non-asymptotic, are also exact, i.e., with the desired inclusion probability. This requires some assumptions on data generation, like the introduction of prior distributions on the impulse response. An important example, already widely discussed in this book, is the use of a Bayesian framework that interprets regularization as Gaussian regression [83]. The posterior density becomes available in closed form and Bayes intervals can be easily obtained. Another approach to compute bounds for linear regression is the sign-perturbed sums (SPS) technique [30]. Following a randomization principle, it builds guaranteed uncertainty regions for deterministic parametric models in a quasi-distribution free setup [11, 12]. Recently, there have been notable extensions to the class of models that SPS can handle. The first line of thought still sees the unknown parameters as deterministic but introduces regularization, see [29, 70, 89] and also [31] which is a first attempt to move beyond the strictly parametric nature of SPS. A second line of thought allows for the exploitation of some form of prior knowledge at a more fundamental probabilistic level [13, 70].
Finally, the interested readers are referred to the survey [73] for more references, see also [25, 58].
Notes
- 1.
In (7.67), we have assumed that all the kernel eigenvalues are strictly positive so that \(\mathscr {H}\) is infinite dimensional. If some \(\zeta _i\) is null, \(\mathscr {H}\) is spanned only by the eigenvectors associated to those non-null. If only a finite number of \(\zeta _i\) is different from zero, K is finite rank and \(\mathscr {H}\) is finite dimensional. A notable case is that of the RKHSs induced by truncated kernels, i.e., such that there exists d such that \(K_{ii}=0 \ \forall i>d\). As we have seen, this kind of kernels induce finite-dimensional RKHSs containing FIR systems of order d.
References
Andrieu C, Doucet A, Holenstein R (2010) Particle Markov Chain Monte Carlo methods. J. R. Stat. Soc. Series B 72(3):269–342
Aravkin A, Burke JV, Pillonetto G (2018) Generalized system identification with stable spline kernels. SIAM J. Sci. Comput. 40(5):1419–1443
Aravkin A, Bell BM, Burke JV, Pillonetto G (2015) The connection between Bayesian estimation of a Gaussian random field and RKHSs. IEEE Trans. Neural Netw. Learn. Syst. 26(7):1518–1524
Aravkin A, Burke JV, Chiuso A, Pillonetto G (2014) Convex vs non-convex estimators for regression and sparse estimation: the mean squared error properties of ARD and GLASSO. J. Mach. Learn. Res. 15(1):217–252
Atkinson K (1975) Convergence rates for approximate eigenvalues of compact integral operators. SIAM J. Numer. Anal. 12(2):213–222
Baker C (1977) The numerical treatment of integral equations. Clarendon press
Bisiacco M, Pillonetto G (2020) Kernel absolute summability is sufficient but not necessary for RKHS stability. SIAM J Control Optim
Bisiacco M, Pillonetto G (2020) On the mathematical foundations of stable RKHSs. Automatica
Bogachevch7 VJ (1998) Gaussian measures. AMS
Bottegal G, Aravkin A, Hjalmarsson H, Pillonetto G (2016) Robust EM kernel-based methods for linear system identification. Automatica 67:114–126
Campi MC, Weyer E (2005) Guaranteed non-asymptotic confidence regions in system identification. Automatica 41(10):1751–1764
Carè A, Csáji BCs, Campi MC, Weyer E (2018) Finite-sample system identification: an overview and a new correlation method. IEEE Control Syst Lett 2(1):61–66
Carè A, Pillonetto G, Campi MC (2018) Uncertainty bounds for kernel-based regression: a Bayesian SPS approach. In: 2018 IEEE 28th international workshop on machine learning for signal processing (MLSP), pp 1–6
Carli FP, Chen T, Ljung L (2017) Maximum entropy kernels for system identification. IEEE Trans. Autom. Control 62(3):1471–1477
Carli FP, Chiuso A, Pillonetto G (2012) Efficient algorithms for large scale linear system identification using stable spline estimators. In: IFAC symposium on system identification
Carmeli C, De Vito E, Toigo A (2006) Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl. 4:377–408
Chen T, Andersen MS, Ljung L, Chiuso A, Pillonetto G (2014) System identification via sparse multiple kernel-based regularization using sequential convex optimization techniques. IEEE Trans. Autom. Control 59(11):2933–2945
Chen T, Ljung L (2013) Implementation of algorithms for tuning parameters in regularized least squares problems in system identification. Automatica 49:2213–2220
Chen T, Ohlsson H, Ljung L (2012) On the estimation of transfer functions, regularizations and Gaussian processes - Revisited. Automatica 48:1525–1535
Chen T, Pillonetto G, Chiuso A, Ljung L (2015) Spectral analysis of the DC kernel for regularized system identification. In: Proceedings 54th IEEE conference on decision and control (CDC), pp. 4017–4022, Osaka, Japan
Chen T (2018) On kernel design for regularized LTI system identification. Automatica 90:109–122
Chen T (2019) Continuous-time DC kernel – a stable generalized first-order spline kernel. IEEE Trans. Autom. Control 63:4442–4447
Chen T, Ardeshiri T, Carli FP, Chiuso A, Ljung L, Pillonetto G (2016) Maximum entropy properties of discrete-time first-order stable spline kernel. Automatica 66:34–38
Chen T, Pillonetto G (2018) On the stability of reproducing kernel Hilbert spaces of discrete-time impulse responses. Automatica
Chiuso A (2016) Regularization and Bayesian learning in dynamical systems: Past, present and future. Annu. Rev. Control 41:24–38
Chiuso A, Pillonetto G (2012) A Bayesian approach to sparse dynamic network identification. Automatica 48(8):1553–1565
Cox RT (1946) Probability, frequency, and reasonable expectation. Am. J. Phys. 14(1):1–13
Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Numer. Math. 31:377–403
Csáji B (2019) Non-asymptotic confidence regions for regularized linear regression estimates. In: Faragó István, Izsák Ferenc, Simon Péter L (eds) Progress in Industrial Mathematics at ECMI 2018. pp. Springer International Publishing, Cham, pp 605–611
Csáji B, Campi MC, Weyer E (2015) Sign-perturbed sums: a new system identification approach for constructing exact non-asymptotic confidence regions in linear regression models. IEEE Trans Signal Process 63(1):169–181
Csáji B, Kis KB (2019) Distribution-free uncertainty quantification for kernel methods by gradient perturbations. Mach Learn 108(8):1677–1699
Dinuzzo F (2015) Kernels for linear time invariant system identification. SIAM J. Control Optim. 53(5):3299–3317
Dinuzzo F, Scholkopf B (2012) The representer theorem for Hilbert spaces: a necessary and sufficient condition. In: Bartlett P, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, pp 189–196
Driscollch7 M (1973) The reproducing kernel Hilbert space structure of the sample paths of a Gaussian process. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete 26:309–316
Dunford N, Schwartz JT (1963) Linear operators. InterScience Publishers
Fujimoto Y (2021) Efficient implementation of kernel regularization based on ADMM. In: Proceedings of the 19th IFAC symposium on system identification (SYSID), Online, July 2021
Fujimoto Y, Sugie T (2018) Informative input design for kernel-based system identification. Automatica 89(3):37–43
Gevers M (2005) Identification for control: From the early achievements to the revival of experiment design. Eur. J. Control 11:335–352
Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov chain Monte Carlo in Practice. Chapman and Hall, London
Girosi F (1991) Models of noise and robust estimates. A.I. Memo 1287, Artificial Intelligence Laboratory, 1287, Massachusetts Institute of Technology
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput. 7(2):219–269
Gittens A, Mahoney M (2016) Revisiting the Nyström method for improved large-scale machine learning. J. Mach. Learn. Res. 17(1):3977–4041
Goldenshluger A (1998) Nonparametric estimation of transfer functions: rates of convergence and adaptation. IEEE Trans. Inf. Theory 44(2):644–658
Golub GH, Heath M, Wahba G (1979) Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2):215–223 May
Goodwin GC, Payne RL (1977) Dynamic system identification: experiment design and data analysis. Academic Press, New York
Heuberger P, Van den Hof P, Wahlberg B (2005) Modelling and identification with rational orthogonal basis functions. Springer
Hjalmarsson H (2005) From experiment design to closed loop control. Automatica 41(3):393–438
Horst R, Thoai NV (1999) DC programming: overview. J Optim Theory Appl 103(1):1–43
Huber PJ (1981) Robust Statistics. John Wiley and Sons, New York, NY, USA
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795
Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2):495–502
Kumar S, Mohri M, Talwalkar A (2012) Sampling methods for the Nyström method. J. Mach. Learn. Res. 13(1):981–1006
Lázaro-Gredilla M, Quiñonero-Candela J, Rasmussen CE, Figueiras-Vidal AR (2010) Sparse spectrum Gaussian process regression. J Mach Learn Res 99:1865–1881
Lidskii VB (1959) Non-self-adjoint operators with a trace. Dokl. Akad. Nauk. 125:485–487
Lindfors M, Chen T (2019) Regularized LTI system identification in the presence of outliers: a variational EM approach. Automatica, under review
Ljung L (1999) System identification - theory for the user, 2nd edn. Prentice-Hall, Upper Saddle River
Ljung L, Wahlberg B (1992) Asymptotic properties of the least-squares method for estimating transfer functions and disturbance spectra. Adv. Appl. Probab. 24:412–440
Ljung L, Chen T, Mu B (2019) A shift in paradigm for system identification. Int J Control 1–8
Lukic MN, Beder JH (2001) Stochastic processes with sample paths in reproducing kernel Hilbert spaces. Trans. Am. Math. Soc. 353:3945–3969
MacKay DJC (1992) Bayesian interpolation. Neural Comput. 4:415–447
Mallows CL (1973) Some comments on CP. Technometrics 15(4):661–675
Megginson RE (1998) An introduction to Banach space theory. Springer
Mu B, Chen T, Ljung L (2018) Asymptotic properties of generalized cross validation estimators for regularized system identification. In: The 18th IFAC symposium on system identification (SYSID)
Mu B, Chen T, Ljung L (2018) On asymptotic properties of hyperparameter estimators for kernel-based regularization methods. Automatica 94:381–395
Mu B, Chen T (2018) On input design for regularized LTI system identification: Power-constrained input. Automatica 97:327–338
Mu B, Chen T, Ljung L (2018) Asymptotic properties of hyperparameter estimators by using cross-validations for regularized system identification. In: Proceedings of the 57th IEEE conference on decision and control, pp 644–649
Palmer JA, Wipf DP, Kreutz-Delgado K, Rao BD (2006) Variational EM algorithms for non-Gaussian latent variable models. Adv Neural Inf Process Syst
Parzen E (1963) Probability density functionals and reproducing kernel Hilbert spaces. In: Proceedings of the symposium on time series analysis. Wiley, New York
Pillonetto G, Bell BM (2007) Bayes and empirical Bayes semi-blind deconvolution using eigenfunctions of a prior covariance. Automatica 43(10):1698–1712
Pillonetto G, Carè A, Campi MC (2018) Kernel-based SPS. IFAC-PapersOnLine 51(15):31–36. 18th IFAC symposium on system identification SYSID 2018
Pillonetto G, Chiuso A, De Nicolao G (2011) Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica 47(2):291–305
Pillonetto G, De Nicolao G (2010) A new kernel-based approach for linear system identification. Automatica 46(1):81–93
Pillonetto G, Dinuzzo F, Chen T, De Nicolao G, Ljung L (2014) Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50
Pillonetto G, Scampicchio A (2021) Sample complexity and minimax properties of exponentially stable regularized estimators. IEEE Trans Autom Control
Pillonetto G, Schenato L, Varagnolo D (2019) Distributed multi-agent Gaussian regression via finite-dimensional approximations. IEEE Trans. Pattern Anal. Mach. Intell. 41(9):2098–2111
Pillonetto G (2018) System identification using kernel-based regularization: New insights on stability and consistency issues. Automatica 93:321–332
Pillonetto G, Chen T, Chiuso A, De Nicolao G, Ljung L (2016) Regularized linear system identification using atomic, nuclear and kernel-based norms: The role of the stability constraint. Automatica 69:137–149
Pillonetto G, Chiuso A (2015) Tuning complexity in regularized kernel-based regression and linear system identification: The robustness of the marginal likelihood estimator. Automatica 58:106–117
Pontil M, Mukherjee S, Girosi F (2000) On the noise model of support vector machine regression. In: Proceedings of algorithmic learning theory 11th international conference ALT 2000, Sydney
Prando G, Chiuso A, Pillonetto G (2017) Maximum entropy vector kernels for MIMO system identification. Automatica 79:326–339
Quiñonero-Candela J, Rasmussen CE, Williams CKI (2007) Approximation methods for Gaussian process regression. In: Bottou L, Chapelle O, DeCoste D, Weston J (eds) Large-Scale Kernel Machines. MIT Press, Cambridge, MA, USA, pp 203–223
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press
Robert D (2017) On the traces of operators (from Grothendieck to Lidskii). EMS newsletter 3:26–33
Rudin W (1987) Real and Complex Analysis. McGraw-Hill, Singapore
Sun H (2005) Mercer theorem for RKHS on noncompact sets. J. Complex. 21(3):337–349
Tao PD, An LTH (1997) Convex analysis approach to D.C. programming: theory, algorithms and applications. ACTA Math Vietnam 22:289–355
Tsybakov AB (2008) Introduction to nonparametric estimation. Springer
Volpe V (2015) Identification of dynamical systems with finitely many data points. University of Brescia, MSc thesis
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Wahlberg B (1991) System identification using Laguerre models. IEEE Trans Autom Control AC-36:551–562
Wahlberg B (1994) System identification using Kautz models. IEEE Trans. Autom. Control 39(6):1276–1282
Williams CKI, Seeger M (2000) Using the Nyström method to speed up kernel machines. In: Advances in neural information processing systems. MIT Press, Cambridge, pp 682–688
Wipf DP, Rao BD (2004) Sparse Bayesian learning for basis selection. IEEE Trans. Signal Process. 52(8):2153–2164
Zeidler E (1995) Applied functional analysis. Springer
Zhu H, Williams CKI, Rohwer RJ, Morciniec M (1998) Gaussian regression and optimal finite dimensional linear models. In: Neural networks and machine learning. Springer, Berlin
Zorzi M, Chiuso A (2018) The harmonic analysis of kernel functions. Automatica 94:125–137
Author information
Authors and Affiliations
Corresponding author
7.7 Appendix
7.7 Appendix
7.1.1 7.7.1 Derivation of the First-Order Stable Spline Norm
We will exploit a representation of the RKHS induced by the first-order discrete-time stable spline kernel given by a linear transformation of the space \(\ell _2\) containing the squared summable sequences. This has some connections with the relationship between squared summable function spaces and RKHS discussed in Remark 6.2, even if no spectral decomposition of the kernel will be needed below.
Let \(\mathscr {H}\) be the RKHS induced by the stable spline kernel (7.15) with elements denoted by \(g=\{g_t\}_{t=1}^{+\infty }\). We will see that any \(g \in \mathscr {H}\) can be written as
where the scalars \(\{\psi _{tj}\}\) define the linear operator mapping \(\ell _2\) into \(\mathscr {H}\). By adopting notation of ordinary algebra to handle infinite-dimensional objects, one can see g as an infinite-dimensional column vector. In addition, (7.95) can be rewritten as \(g = \varPsi w,\) where \(\varPsi \) is an infinite-dimensional matrix with (t, j)-entry given by \(\psi _{tj}\). We will now obtain the expression of \(\varPsi \). Let
where \(e^j\) is the infinite-dimensional column vector with all null elements except its jth entry which is equal to one. Let also
The inverse \(\varPsi ^{-1}\) of \(\varPsi \) acts as follows: given a sequence g, it maps g into
Then, given \(\varPsi \) in (7.96), we will show that the space
with inner product given by
is the RKHS induced by the stable spline kernel. First, it is easy to see that the null space of \(\varPsi \) contains only the null vector. Then, since \(\ell _2\) is Hilbert, one obtains that \(\mathscr {H}\) is a Hilbert space. We can now exploit Theorem 6.2, i.e., the Moore–Aronszajn theorem, to prove that it is also the desired RKHS. To obtain this, the two conditions described below have to be checked.
The first condition says that any kernel section must belong to the space \(\mathscr {H}\) in (7.98). Thanks to the algebraic view, we can see the stable spline kernel K as an infinite-dimensional matrix. Hence, the kernel sections are the infinite-dimensional columns of K and, in particular, we use \(K_t\) to indicate the tth column. Now, one has to assess that \(\Vert K_t \Vert _{\mathscr {H}}^2 < \infty \ \forall t\). Note that
Then, we have
and the first condition is so satisfied.
The second condition is the reproducing property, i.e., one has to assess that
This holds true since
showing that the second condition is also satisfied.
Using (7.99), one has
and this confirms the norm’s structure reported in (7.16).
7.1.2 7.7.2 Proof of Proposition 7.1
We will exploit the results on estimation of Gaussian vectors reported in Sect. 4.2.2.
Let \(\text {Cov} [u,v]\) denote the covariance matrix of two random vectors u and v, i.e.,
First, we consider the distribution of Y. Note that \(L_i[g^0]\) is a linear functional of the stochastic process \(g^0\). Hence, since linear transformations of normal processes preserve Gaussianity, the noise-free output \([L_1[g^0], \ldots , L_{N}[g^0]]^T\) is a multivariate zero-mean Gaussian random vector. Furthermore, since
the covariance matrix of \([L_1[g^0], \ldots , L_{N}[g^0]]^T\), apart from the scale factor \(\lambda \), is indeed defined by the output kernel matrix O reported in (7.14) for the discrete-time case, i.e., when \(\mathscr {X}=\mathbb {N}\), and in (7.22) for the continuous-time case, i.e., when \(\mathscr {X}={\mathbb R}^+\). Now, recall that the e(t), where \(t=1,\dots ,N\), are assumed to be mutually independently Gaussian distributed with mean zero and variance \(\sigma ^2\). Moreover, they are also assumed independent of \(g^0\). One then obtains that \(g^0\) and Y are jointly Gaussians, with the mean and covariance matrix of Y given by
For what regards the covariance matrix of \(g^0\) and Y, the independence assumptions imply that
Then, using also the correspondence \(\gamma =\sigma ^2/\lambda \), we have
where \(\hat{c}_t\) is the tth entry of vector \(\hat{c}\) defined in (7.13) for the continuous-time case or in (7.21) for the discrete-time case. This completes the proof.
7.1.3 7.7.3 Proof of Theorem 7.5
We only consider the proof for the discrete-time case (7.56). The continuous-time case (7.57) can be proved in a similar way. To prove (7.56), we first need a lemma.
Lemma 7.1
Consider the linear operator \(L_K\) defined by
where \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\) is a positive semidefinite kernel. Assume that \(L_K\) satisfies the following property: for any \(l\in \ell _{\infty },\) one has \(L_K[l]\in \ell _1\). Then, \(L_k\) is a continuous (bounded) linear operator, i.e., there exists a scalar \(b>0\), independent of l, such that
Proof
First, we show that for any \(s\in \mathbb {N}\), the kernel section \(K_s(\cdot )\) belongs to \(\ell _1\). To show this, for any \(s\in \mathbb {N}\), we can define a sequence \(l\in \ell _\infty \) in the following way:
Then plugging this l into (7.101) yields \(L_K[l]=\sum _{t=1}^\infty |K(s,t)| \). Since \(L_K[l]\in \ell _1\) for every \(l\in \ell _{\infty }\), then we obtain
Now, for any \(l,a \in \ell _\infty \), it holds that
where both \(\Vert l - a \Vert _{\infty } \) and \(\sum _{t=1}^\infty | K(s,t)|\) are finite for any \(s\in \mathbb {N}\) since \(l,a\in \ell _{\infty }\) and in view of (7.103). Following (7.104), the remaining proof is a simple application of the closed graph theorem, see Theorem 6.26. In fact, let \(l \rightarrow a\) in \(\ell _\infty \) and \(L_K[l] \rightarrow g\) in \(\ell _1\). Then (7.104) shows that \(L_K[l](s) \rightarrow L_K[a](s)\) for every \(s\in \mathbb {N}\), implying that \(g_s = L_K[a](s)\) for every \(s\in \mathbb {N}\). As a result, the graph \((l,L_K[l])\) is closed and thus \(L_K\) is continuous (bounded) by the closed graph theorem. \(\square \)
Now let us consider (7.56) in Theorem 7.5. We first prove the sufficient part, i.e.,
We start by introducing some definitions. For any \(f \in \mathscr {H}\), we let \(l \in \ell _\infty \) be a sequence defined by the signs of f, i.e.,
and let also \(l^n\) be a sequence defined by
Then we have
where the last identity is due to the reproducing property of K. Moreover, by the Cauchy–Schwarz inequality, we have
Now we show that \(\left\| \sum _{t=1}^\infty l_t^n K_t(\cdot ) \right\| _{\mathscr {H}}\) is finite. First, we note that
and then from the linear operator \(L_K\) defined in (7.101) and its boundedness property (7.102) proved in Lemma 7.1, we obtain
where we have used the fact that \(\Vert l^n\Vert _\infty =1\) for any \(n\in \mathbb {N}\). Noting the above equation and (7.105) yields
Since \(f\in \mathscr {H}\) and thus \(\Vert f \Vert _{\mathscr {H}}\) is finite, \(\sum _{t=1}^n | f_t |\) is bounded above for any \(n\in \mathbb N\). Further note that the partial sum \(\sum _{t=1}^n | f_t |\) is an increasing sequence and bounded above, therefore by monotone convergence theorem, the limit of \(\sum _{t=1}^n | f_t |\), i.e., \(\lim _{n\rightarrow \infty }\sum _{t=1}^n | f_t |\) exists, and is denoted by \(\sum _{t=1}^\infty | f_t |\), which shows that \(f \in \ell _1\). Since f was chosen arbitrarily, this implies \(\mathscr {H} \subset \ell _1\) and thus completes the proof for the sufficient part.
Now, we prove the necessary part, i.e.,
Again, we start by introducing some definitions. For any \(f \in \mathscr {H}\) and \(l \in \ell _{\infty }\), we define a new sequence lf by letting \([lf]_t=l_tf_t\), \(\forall t\in \mathbb N\), where \([lf]_t\) is the tth entry in the sequence lf. Then we have \( l f \in \ell _1 \), because \(l\in \ell _{\infty }\) and \(f \in \ell _1\) due to \(\mathscr {H} \subset \ell _1\). Moreover, we define \( g^n(\cdot ) = \sum _{t=1}^n l_t K_t(\cdot ) \) with \(n\in \mathbb {N}\). Now we show that the sequence of functions \(g^n(\cdot )\) with \(n\in \mathbb {N}\) is a weak Cauchy sequence in \(\mathscr {H}\). To show this, we take without loss of generality \(m\le n\) and \(m\in \mathbb {N}\), and then we have
Moreover, we have
Since \(l f \in \ell _1\), i.e., \(\sum _{t=1}^\infty |l_tf_t|<\infty \), the Cauchy criterion ensures that
which implies
Noting the above equation and (7.106) yields that the sequence of functions \(g^n(\cdot )=\sum _{t=1}^n l_t K_t(\cdot )\) with \(n\in \mathbb {N}\) is a weak Cauchy sequence. Recall that every Hilbert space, beyond being complete, is also weakly sequentially complete, which is because every Hilbert space is reflexive, see Definition 2.5.23 along with Corollaries 2.8.10 and 2.8.11 in [62]. Hence, the sequence of functions \(g^n(\cdot )=\sum _{t=1}^n l_t K_t(\cdot )\) with \(n\in \mathbb {N}\) is also a weakly convergent sequence, i.e., there exists an \(h \in \mathscr {H}\) such that
Now, we take \(f(\cdot )=K_s(\cdot )\) in the above equation. Using the reproducing property of K, the left-hand side becomes
while the right-hand side becomes
This implies that
Finally, note that \(h \in \mathscr {H} \subset \ell _1\), therefore
which completes also the necessary part and, hence, concludes the proof.
7.1.4 7.7.4 Proof of Theorem 7.7
First, it is useful to set up some notation. Let r be an integer or \(r=\infty \). Then, we define the set \({\mathscr {U}}_r\) as follows:
Let p be another integer associated with the odd number \(m=2p+1\) and with \(n=2^m\). We also use \(x_i \in {\mathscr {U}}_m\) , with \(i=1,2,\dots ,n\), to indicate distinct vectors containing exactly m elements \(\pm 1\) (their ordering is irrelevant). Then, for any \(n=2^3, 2^5, 2^7, \dots \), the \(n \times m\) matrix \(V^{(n)}\) is given by
and its rows contain all the possible permutations of \(\pm 1\). We now discuss the inclusions stated in the theorem.
The inclusion \(\mathscr {S}_{1} \subseteq \mathscr {S}_{s}\) derives from Corollary 7.1 where we have seen that absolute summability is a sufficient condition for kernel stability. The proof of the strict inclusion \(\mathscr {S}_{1} \subset \mathscr {S}_{s}\) is not trivial and is reported in [7] where one can find a particular kernel, function of the matrices \(V^{(n)}\) in (7.109), that is stable but non-absolutely summable.
For what concerns the inclusion \(\mathscr {S}_{s} \subset \mathscr {S}_{ft}\), let \(M_m\) denote a positive semidefinite matrix of size \(m \times m\). Consider also the linear operator \(M_m:{\mathbb R}^m \rightarrow {\mathbb R}^m\) with domain and co-domain equipped, respectively, with the \(\ell _{\infty }\) and the \(\ell _1\) norms. Its operator norm is then given by
where the last equality follows from the so-called Bauer’s maximum principle for convex functions. First, we prove that
For this aim, since \(V^{(n)T}\) contains all the vectors in \({\mathscr {U}}_m\) as columns, the problem is equal to evaluating
and to find the column with maximum \(\ell _1\) norm. The \(\ell _1\) norm of each column can be obtained as the scalar product of the column with a suitable \(x \in {\mathscr {U}}_m\) containing the signs of the column entries. Hence, the \(n^2\) entries of
surely contain these n \(\ell _1\) norms. Furthermore, the maximum \(\ell _1\) norm which needs to be found is the maximum of all these \(n^2\) entries since \(x_1^Tc \le x_2^Tc, \ \forall x_1 \in {\mathscr {U}}_m\) if \(x_2=\text{ sign }(c )\), where the function sign returns, for each entry of c, value 1 if such entry is larger than zero and -1 otherwise. Also, since \(V^{(n)}M_mV^{(n) T}\) is positive semidefinite, the maximum is found along its diagonal, i.e.,
We now note that the trace of \(V^{(n)}M_mV^{(n) T}\) satisfies
Finally,
and this proves (7.111).
Now, think of \(M_k\) as the \(k \times k\) submatrix of the stable kernel represented by the infinite-dimensional matrix K. We also use \(L_K\) to denote the associated kernel operator mapping \(\ell _{\infty }\) into \(\ell _1\). So, it holds that
where \(\Vert L_K\Vert _{\infty ,1}\) indicates the operator norm of \(L_K\), i.e.,
Using (7.111), we obtain
and, since \(\mathrm {trace}[M_k]\) is a monotone non-decreasing sequence upper-bounded by \(\Vert L_K\Vert _{\infty ,1}<+\infty \), one also has
This shows that the trace of any stable kernel is finite. Such inclusion is strict as the following example shows. Let the vector v s.t. \(v \in \ell _2\) and \(v \notin \ell _1\). Consider the kernel
One has \(\mathrm {trace}(K) = \Vert v\Vert _2^2<+\infty \). If \(w=sign(v) \in \ell _{\infty }\) one has \(Kw=v\Vert v\Vert _1\) and this implies \(\Vert Kw\Vert _1=\infty \). So, the kernel K has finite trace but is unstable.
The inclusion \(\mathscr {S}_{ft} \subset \mathscr {S}_{2}\) relies on the important relation between nuclear and Hilbert–Schmidt (HS) operators, e.g., see [35, 54, 84]. In particular, let K be a kernel, seen as an infinite-dimensional matrix, and let \(L_K\) be the induced kernel operator as a map from \(\ell _2\) into \(\ell _2\) itself. Given any orthonormal basis \(\{v_i\}\) in \(\ell _2\), the nuclear norm of \(L_K\) is
and is independent of the chosen basis. Then, \(L_K\) is said to be nuclear if (7.112) is finite. Its (squared) Hilbert–Schmidt (HS) norm is instead
and is also independent of the chosen basis. Then, \(L_K\) is said to be HS if (7.113) is finite. It is also known that any nuclear operator is HS and can be written as the composition of two HS operators.
For our purposes, we now exploit the fact that any finite-trace kernel induces a nuclear operator, as shown in [8]. So, one also has that (7.113) is finite and, choosing as \(\{v_i\}\) the canonical basis \(\{e_i\}\) of \(\ell _2\) , one obtains
Such inclusion is also strict as illustrated via the example
Finally, \(\mathscr {S}_{2}\) is contained in the set of all the positive semidefinite infinite matrices. Furthermore, the inclusion is strict: this can be seen just considering the example \(K=vv^T,\) where v is the infinite-dimensional column vector with all components equal to 1.
7.1.5 7.7.5 Proof of Theorem 7.9
The notation \(L_K\) is still used to denote the operator induced by the kernel K and mapping \(\ell _{\infty }\) into \(\ell _1\). Its operator norm is \(\Vert L_K\Vert _{\infty ,1}\) while \((\zeta _i,\rho _i)\) are its eigenvalues and eigenvectors orthogonal in \(\ell _2\). From Theorem 7.5 and Lemma 7.1, one has
Since the function
is convex, the Bauer’s maximum principle ensures that
where
Using notation of ordinary algebra to deal with infinite-dimensional matrices, we can write \(K=UDU^T\), where D is diagonal and contains the eigenvalues \(\zeta _i\) of K while the columns of U contain the corresponding eigenvectors \(\rho _i\). One has
and, hence,
Letting \(s(u)=\text {sign}(y)\), we obtain
Using (7.116), also noticing that \(f(u)=h(u)\), this implies
Now, define
Exploiting the definition of s(u), one has
On the other hand,
that implies
So, one has
and this concludes the proof in view of (7.115).
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Regularization in Reproducing Kernel Hilbert Spaces for Linear System Identification. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-95860-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95859-6
Online ISBN: 978-3-030-95860-2
eBook Packages: EngineeringEngineering (R0)