5.1 Preliminaries

As we have discussed in the preceding chapters, system identification can be framed as an inverse problem which aims at finding a dynamical model \(\mathcal{M}\) from a set of measured input output “training” data \(\mathcal{D}_{T}:=\{u(t),y(t)\}_{t=1,\ldots ,N}\).  The field of inverse problems [5] has motivated the development of, and is pervaded by, regularization techniques; as such it is evident that regularization could and should play a major role also in the system identification arena.

On the contrary, we believe it is fair to say that regularization has not had a pervasive impact in system identification until very recently. To introduce its use in this field, we will refer to linear models \(\mathscr {M}=\{\mathscr {M}(\theta )|\theta \in D_{\mathscr {M}}\}\) introduced in Chap. 2, Eq. (2.1). Note that this notation not only includes classical parametric structures, such as ARX, ARMAX, Box–Jenkins models but also so-called nonparametric ones where the “parameter” \(\theta \) may be infinite dimensional, e.g., containing all the impulse response coefficients of the filters \(W_y(q)\) and \(W_u(q)\) which characterize the predictor

$$\begin{aligned} \hat{y}(t|\theta )&= W_y(q)y(t) + W_u(q)u(t). \end{aligned}$$
(5.1)

The transfer functions \(W_y(q)\) and \(W_u(q)\) are related to the input–output model

$$ y(t)=G(q,\theta )u(t)+H(q,\theta )e(t) $$

by the relation

$$\begin{aligned} W_y(q)&:=[1-H^{-1}(q,\theta )] \quad W_u(q):=H^{-1}(q,\theta )G(q,\theta ),\end{aligned}$$
(5.2)

see also (2.4).

For simplicity here, we consider the single-output case \(y(t) \in {\mathbb R}\). In the prediction error framework described in Chap. 2, the model fit is typically measured by the negative log likelihood

$$ V_N(\theta ) = -2 \mathrm{log}\; {\mathrm p}(\mathcal{D}_T|\theta ) = - 2 \sum _{t=1}^N \mathrm{log}({\mathrm p}(y(t) - \hat{y}(t|\theta ))), $$

which in the Gaussian case is, up to constants, proportional to the sum of squared prediction errors

$$ V_N(\theta )\propto \sum _{t=1}^N (y(t) - \hat{y}(t|\theta ))^2. $$

As discussed in Chap. 3, regularization can be added to make the inverse problem of estimating the model \(\mathcal{M}(\theta )\) from data well-posed, and therefore regularized estimators \(\hat{\theta }_R \) of the form

$$\begin{aligned} \hat{\theta }_R := \mathop {\mathrm{arg\;} \mathrm{min}}\limits _{\theta } \, W_N(\theta )= \mathop {\mathrm{arg\;} \mathrm{min}}\limits _{\theta } \, V_N(\theta ) + J_ \gamma (\theta ) \end{aligned}$$
(5.3)

are considered. This framework has been extensively discussed in the previous chapter in the context of linear regression under the squared loss \( V_N(\theta ) =\) \( \Vert Y - \varPhi \theta \Vert _2^2,\) see e.g., Eq. (3.57).

The function \(J_\gamma (\theta )\) is usually referred to as the penalty function, and possibly depends on some (hyper-)parameter \(\gamma \). In the simplest case \(J_\gamma (\theta )\) takes the multiplicative form

$$ J_\gamma (\theta ): = \gamma J(\theta ) $$

and \(\gamma \) acts a scaling factor which controls the “amount” of regularization. The most famous example is the so-called ridge regression problem, in which a quadratic loss \(V_N(\theta )\) is used and \(J(\theta ): = \Vert \theta \Vert ^2\) so that (see also (3.61a)):

$$ \hat{\theta }^{R} := \mathop {\mathrm{arg}\; \mathrm{min}}\limits _{\theta } \, \Vert Y - \varPhi \theta \Vert _2^2 + \gamma \Vert \theta \Vert ^2 = \left( \varPhi ^T \varPhi + \gamma I\right) ^{-1} \varPhi ^T Y. $$

However, ridge regression has not had a significant impact in the context of System Identification, i.e., when the vector \(\theta \) contains the impulse response coefficients of a (linear) dynamical system. To understand why, it is important to discuss the choice of \(J_\gamma (\theta )\). We will see that it plays a fundamental role and strongly influences the properties of the estimator \(\hat{\theta }_R\). In particular, we will see how \(J_\gamma (\theta )\) should be designed to encode properties of dynamical systems such as BIBO stability, smoothness in time domain and frequency domain, oscillatory behaviour and so on; this is a form of “inductive bias” well known and studied in the machine learning community, see e.g., [61].

As argued in Chap. 4, regularization can be given a Bayesian interpretation. In fact, introducing a probabilistic prior on model parameters \(\theta \) of the form

$$\begin{aligned} \mathrm {p}_\gamma (\theta ) \propto e^{-\frac{J_ \gamma (\theta )}{2}} \end{aligned}$$
(5.4)

and the Likelihood function:

$$\begin{aligned} {\mathrm p}(\mathcal{D}_T|\theta ) \propto e^{-\frac{V_N(\theta )}{2}} \end{aligned}$$
(5.5)

the maximum a posteriori (MAP) estimator of \(\theta \) (see (4.2)), becomes

$$\begin{aligned} \hat{\theta }^{\mathrm MAP}&:= \mathrm{arg\;max}_\theta \; {\mathrm p}(\theta |\mathcal{D}_T) \end{aligned}$$
(5.6)
$$\begin{aligned}&= \mathrm{arg\;max}_\theta \; {\mathrm p}(\mathcal{D}_T|\theta ) {\mathrm p}_\gamma (\theta ) \end{aligned}$$
(5.7)
$$\begin{aligned}&= \mathrm{arg\;max}_\theta \; \mathrm{log} \left[ {\mathrm p}(\mathcal{D}_T|\theta ) {\mathrm p}_\gamma (\theta )\right] \end{aligned}$$
(5.8)
$$\begin{aligned}&= \mathrm{arg\;min}_\theta \; -\mathrm{log} \left[ {\mathrm p}(\mathcal{D}_T|\theta ) \right] - \mathrm{log} \left[ {\mathrm p}_\gamma (\theta )\right] \end{aligned}$$
(5.9)
$$\begin{aligned}&= \mathrm{arg\;min}_\theta \; V_N(\theta ) + J_ \gamma (\theta )\end{aligned}$$
(5.10)
$$\begin{aligned}&= \hat{\theta }_R. \end{aligned}$$
(5.11)

In what follows, we will therefore use interchangeably the “regularization” framework, and thus think of \(J_\gamma (\theta )\) as a penalty function, or the “Bayesian” framework, and thus think of \(p_\gamma (\theta )\) as a prior (with some caution in the infinite-dimensional case).

5.2 MSE and Regularization

The final goal of modelling is to perform some task, e.g., prediction or control, on future unseen data. As such the estimated model quality should be measured having the objective in mind. For simplicity, we will consider a prediction task, referring the reader to the literature discussed in Sect. 5.9 for extensions. To this purpose, in addition to the training data \(\mathcal{D}_{T},\) let us introduce testing data:

$$ \mathcal{D}_{test}:=\{u_{test}(t),y_{test}(t)\}_{t=1,\ldots ,N_{test}}.$$

A model \(\hat{\mathcal{M}}:= \mathcal{M}(\hat{\theta })\) estimated using the training data \(\mathcal{D}_{T}\) should then predict well testing data \(\mathcal{D}_{test}\). In particular, let \(\hat{y}(t|\hat{\theta })\) be the output prediction at instant t constructed using the estimated model. Then, we can measure the performance of \(\hat{\mathcal{M}}\) using the Mean Squared Error (MSE) on output (Y) prediction and assuming that data are generated by some “true”, yet unknown parameter vector \( \theta _0\). This is defined as

$$\begin{aligned} MSE_{Y}(\hat{\mathscr {M}},\theta ^{{\mathrm o}}) = {\mathscr {E}}\left( \frac{1}{N_{test}} \sum _{t=1}^{N_{test}}(y_{test}(t) - \hat{y}_{test}(t|\hat{\theta }))^2 \right) = {\mathscr {E}}\left( y_{test}(t) - \hat{y}_{test}(t|\hat{\theta }) \right) ^2, \end{aligned}$$
(5.12)

where, for simplicity, we have assumed stationary statistics for the couples \(u_{test}(t),y_{test}(t)\) in the last passage. In this section, we will argue that using regularization in estimating \(\hat{\theta }\) can indeed help in obtaining a small \(MSE_{Y}(\hat{\mathscr {M}},\theta _0)\). Let us first assume that data are generated by an unknown “true” linear time-invariant (LTI) causal model:

$$\begin{aligned} y(t) = \sum _{k=1}^{\infty }g_k u(t-k) + e(t), \end{aligned}$$
(5.13)

where the “true” “parameter” \(\theta _0 = [g_1,g_2,g_3,\ldots ,g_n,\ldots ]\) is an infinite sequence in \(\ell ^1\), i.e.,

$$ \sum _{k=1}^\infty |g_k| <\infty . $$

We now consider the model class \(\mathscr {M}(\theta )\) of Finite Impulse Response (FIR) Output Error (OE) models

$$\begin{aligned} y(t) = \sum _{k=1}^{n} \theta _k u(t-k) + e(t), \end{aligned}$$
(5.14)

where the parameter vector \(\theta \in {\mathbb R}^n\) contains the coefficients of an nth-order finite impulse response model. Under the assumption that the input process is unit variance white noise, independent of the measurement noise, and defining

$$ \hat{g}_k: = \left\{ \begin{array}{rl} \hat{\theta }_k &{} k=1,\ldots ,n \\ 0 &{} \text {otherwise}\end{array}\right. $$

the MSE (5.12) has the expression

$$\begin{aligned} \begin{array}{rcl} MSE_{Y}(\hat{\mathscr {M}},\theta _0) &{}= &{} {\mathscr {E}}(y_{test}(t) - \hat{y}_{test}(t|\hat{\theta }))^2 \\ &{} = &{} {\mathscr {E}}\left( \sum _{k=1}^{\infty }(g_k - \hat{g}_k) u_{test}(t-k) + e(t) \right) ^2 \\ &{} = &{}\underbrace{\sum _{k=1}^{\infty } {\mathscr {E}}(g_k - \hat{g}_k)^2} _{{\mathscr {E}}\Vert g - \hat{g}\Vert ^2} + \sigma ^2 \\ &{} = &{}\underbrace{\sum _{k=1}^{\infty } {\mathscr {E}}(\hat{g}_k - {\mathscr {E}}[\hat{g}_k])^2}_{Variance} +\underbrace{\sum _{k=1}^{\infty } (g_k - {\mathscr {E}}[\hat{g}_k])^2}_{Bias^2}+ \sigma ^2 \\ &{} = &{}\underbrace{ \sum _{k=1}^{n } {\mathscr {E}}( \hat{\theta }_k - {\mathscr {E}}[\hat{\theta }_k])^2}_{Variance} + \underbrace{\sum _{k=1}^{n } (g_k - {\mathscr {E}}[\hat{\theta }_k])^2 + \sum _{k=n+1}^{\infty } g^2_k }_{Bias^2}+ \sigma ^2. \end{array} \end{aligned}$$
(5.15)

This is nothing but the usual bias-variance trade-off discussed in Chap. 1: the model (\(\theta \) in this case) has to be rich enough (i.e., n large) to capture the “true” data generating mechanism (low bias) but also simple enough (i.e., n small) to be estimated using the available data with low variability (low variance). The squared loss

$$ {\mathscr {E}}\Vert g - \hat{g}\Vert ^2 = \sum _{k=1}^{\infty } {\mathscr {E}}(g_k - \hat{g}_k)^2 $$

present on the right-hand side of (5.15), after the third equality, is called a compound loss on the (possibly infinite) vector \(\theta \) [60, 63] and defines the MSE.

Considering compound losses of this type allows us to connect with the discussion made in Chap. 1 on Stein’s effect. To simplify exposition, let us assume that the identification input is a discrete impulse \(u(t) = \delta (t)\) so that we can think of y(t) as direct noisy measurements of all the (nonzero) impulse response coefficients

$$\begin{aligned} y(t) = g_t + e(t) \quad \quad t=1,\ldots ,n. \end{aligned}$$
(5.16)

Defining \(Y:=[y(1),\ldots ,y(n)]^T\) and \(E:=[e(1),\ldots ,e(n)]^T\) the measurement model (5.16) can be written in vector form

$$\begin{aligned} Y = \theta + E, \quad \quad E \sim \mathscr {N}(0,\sigma ^2 I_n). \end{aligned}$$
(5.17)

As we have seen in Chap. 1, the least squares estimator \(\hat{\theta }_{LS}\) for model (5.17) is dominated (for \(n>2)\) by the James–Stein estimator discussed in Sect. 1.1.1. As argued in Chap. 1, the James–Stein estimator (1.3) is a special case of a regularized estimator (5.3) where \(J_\gamma (\theta ) = \gamma \Vert \theta \Vert ^2\) and \(\gamma \) takes the data-dependent form (1.4)

$$ \gamma = \frac{(n-2)\sigma ^2}{\Vert y \Vert ^2-(n-2)\sigma ^2}. $$

Following this route, the James–Stein estimator favours “small” parameters values (the regularization term \(J_\gamma (\theta )= \gamma \Vert \theta \Vert ^2\) penalises large \(\Vert \theta \Vert \)) and therefore it is to be expected that the gap w.r.t. the least square estimator is larger under these circumstances; this has been illustrated in Fig. 1.1.

As pointed out in Sect. 1.1.2, there is actually nothing special in having chosen the origin as a reference. In fact, the penalty term can be replaced with \(J_\gamma (\theta ) = \gamma \Vert \theta -a\Vert ^2\) for any \(a\in {\mathbb R}^n\) yielding to estimators which always dominate least squares provided \(\gamma \) is chosen as

$$\frac{(n-2)\sigma ^2}{\Vert y-a \Vert ^2-(n-2)\sigma ^2}.$$

This teaches us that under certain circumstances it is possible to steer the estimators, using a suitable penalty functional, towards certain regions of the parameter space (or more generally model space); most importantly, this can be done without any loss (actually with a gain) for any possible occurrence of the “true” yet unknown system. However the reader should remind that this only holds for the compound loss (5.15) and should not be seen as panacea. For instance, James–Stein estimators may provide only marginal improvements over Least Squares in situations where the signal-to-noise ratio is highly non-uniform over parameter space, a situation often encountered in system identification when input signals are not white and poor excitation may be present, e.g., in certain frequency bands. This has been illustrated in Example 1.2.

Therefore, as a take home message from Chap. 1 and the discussion above, we should remind that regularization has potential to offer, yet its use in system identification is not straightforward. The main reasons are as follows:

  1. 1.

    Often one cannot restrict to Output Error models (i.e., also noise models should be included) and the input process is neither impulsive nor white. Thus, the MSE (5.12) takes a different form than (5.15). This calls for extensions of James–Stein estimators to weighted losses and non-orthogonal design; to some extent this has been pursued in the statistics literature, the reader is referred to [4, 9, 43, 64] and references therein. See also [13, Sect. 6].

  2. 2.

    While James–Stein estimators have been built with the purpose of showing that the least squares estimator is not admissible (see Sect. 1.1.1, for a formal definition), it may not necessarily be our primary goal to dominate least squares (or another estimator) uniformly over parameter space. In order to cure the ill-conditioning phenomenon widely discussed in Chap. 3, it could be advantageous to tailor regularization to certain “dynamical-system” oriented properties, thus gaining a lot in certain regions of the model space, while possibly incurring in minor losses in other regions which are very unlikely.

The latter is one of the main goals of this book, i.e., to provide the reader with a thorough understanding of the role of regularization in estimating dynamical systems so as to optimally design regularization methods depending on the intended use of the model. In the remaining part of the chapter, we will first introduce the concept of “optimal” prior and derive its expression. We will then connect the structure of the optimal prior to the notion of BIBO stability for linear dynamical systems and also its link with smoothness in time and frequency domains. Connection with the Bayesian setting will also be provided. The chapter will be concluded with an historical overview of how the use of regularization in the context of estimation of dynamical systems has evolved, illustrating also the role played by time- and frequency-domain smoothness.

5.3 Optimal Regularization for FIR Models

Let us consider the problem of estimating the impulse response \(\{\theta _k\}_{k=1,\ldots ,n}\) of the FIR model (5.14) using data \(\{y(t)\}_{t=1,\ldots ,N}\). The FIR model can be compactly written as

$$\begin{aligned} Y = \varPhi \theta + E, \end{aligned}$$
(5.18)

where \(Y:=[y(1),\ldots ,y(N)]^T\), \(E:=[e(1),\ldots ,e(N)]^T\) and \(\varPhi \) contains the input samples, which are assumed to be available for all times needed to avoid issues related to the initial condition. Then, we will still use \(\theta _0\) to denote the “true” value that has generated the data.

We now consider the class of regularized estimators

$$ \hat{\theta }^{R}: = \mathop {\mathrm{arg\;\;min}}\limits _{\theta \in {\mathbb R}^n} \;\; \Vert Y - \varPhi \theta \Vert ^2 + \sigma ^2 \theta ^T P^{-1} \theta $$

parametrized by the regularization matrix \(P = P^T > 0\). As shown in Chap. 3, see Eq. (3.60), the generalized ridge regression estimator \(\hat{\theta }^{R}\) can be extended also to the case P is singular so that we can assume \(P = P^T \succeq 0\). As a matter of fact, in the Bayesian framework introduced in Chap. 4, \(\theta ^{R}\) can be also interpreted as the MAP estimator

$$ \hat{\theta }^{\mathrm {MAP}}: = \arg \max \;\; p(\theta |Y) $$

obtained under the assumption that the noise E is Gaussian, zero mean and variance \(\sigma ^2 I\) and that \(\theta \) is independent of E, zero-mean Gaussian with (possibly singular) variance \(P = P^T \succeq 0\) (the singular case was described in (4.19)).

In this section, to emphasize the dependence of the estimator \( \hat{\theta }^{R}\) on \(P=P^T \succeq 0\), we will use the notation

$$ \hat{\theta }^P : = \hat{\theta }^{\mathrm R} = \hat{\theta }^{\mathrm MAP}. $$

Our objective now is to study the performance of the estimator \(\hat{\theta }^P\), in terms of MSE, as a function of \(P = P^T \succeq 0\), under the assumption that Y has been generated by a “true model” of the form (5.18) with a deterministic and unknown parameter \(\theta _0\). Thus, the only source of “randomness” is the noise vector E and the system input which is seen as a stochastic process (independent of E) in this section.

We consider a test experiment with a new input \(u_{test}(t)\), independent of the input u(t) used for identification; for convenience of notation, we define the lagged test input vector

$$ \phi _{test}(t): = \left[ u_{test}(t),\ldots ,u_{test}(t -n+1)\right] ^T$$

so that under (5.14) the test output is given by

$$ y_{test}(t) = \phi ^T_{test}(t)\theta _0 + e_{test}(t). $$

Let us also define the covariance matrix

$$ W_u = Var\left\{ \phi _{test}(t)\right\} = {\mathscr {E}}\phi _{test}(t) \phi ^T_{test}(t) $$

(note that stationary assumptions are present here, in fact \(W_u\) does not depend on time t) and the MSE matrix

$$ M_{\theta _0}(P):={\mathscr {E}}(\theta _0- \hat{\theta }^{P})(\theta _0-\hat{\theta }^{P})^T. $$

If we now consider the output mean squared error \(MSE_{Y}(\hat{\mathscr {M}},\theta _0) \) in (5.12) computed for the model \(\hat{\mathcal{M}},\) we obtain

$$\begin{aligned} \begin{array}{rcl} MSE_{Y}(\hat{\mathscr {M}},\theta ^{{\mathrm o}}) &{}=&{} {\mathscr {E}}\left( y_{test}(t) - \hat{y}_{test}(t|\hat{\theta }^P) \right) ^2 \\ &{} = &{} {\mathscr {E}}\left[ \phi _{test}^T(t)\theta _0 + e_{test} - \phi _{test}^T(t) \hat{\theta }^{P}\right] ^2\\ &{} = &{} {\mathscr {E}}\left[ (\theta _0- \hat{\theta }^{P})^T\phi _{test}(t) \phi _{test}^T(t) (\theta _0- \hat{\theta }^{P})^T \right] + \sigma ^2 \\ &{} = &{} Tr\{ {\mathscr {E}}(\theta _0- \hat{\theta }^{P}) (\theta _0- \hat{\theta }^{P})^T {\mathscr {E}}\phi _{test}(t) \phi _{test}^T(t) \} + \sigma ^2 \\ &{} = &{} Tr\{ M_{\theta _0}(P) W_u\} + \sigma ^2, \end{array} \end{aligned}$$
(5.19)

where in the second to last equation, we have used that the test inputs and noises are assumed to be independent of the training inputs and noise in the identification data used for estimating \(\hat{\theta }^P\).

A direct consequence of this fact is that, given two prior covariance matrices P and \(P^*\), if \(M_{\theta _0}(P) \succeq M_{\theta _0}(P^*)\), then

$$ MSE_Y(\hat{\theta }^{P},\theta _0) \ge MSE_Y(\hat{\theta }^{P^*},\theta _0) \quad \quad \forall W_u, $$

i.e., estimator \(\theta ^{P^*}\) outperforms \(\theta ^{P}\) in terms of output prediction for any possible choice of the test input covariance \(W_u\). Thus, if the modelling purpose is output prediction, it is of interest to minimize, w.r.t. all possible \(P=P^T \succeq 0\), the matrix \(M_{\theta _0}(P)\), i.e., to find

$$\begin{aligned} {P^*}: = \mathop {\arg \min }_{P=P^T \succeq 0} \;\; M_{\theta _0}(P), \end{aligned}$$
(5.20)

so that \(\hat{\theta }^{P^*}\) outperforms any other \(\hat{\theta }_{P}\) in terms of output error (5.15) for any choice of the (test) input covariance \(W_u\). Under the assumption that the true model generating the data is an FIR model of length n with impulse response

$$ g_k = \left\{ \begin{array}{cc} \theta _{0,k} &{} k\le n\\ 0 &{} k>n, \end{array} \right. $$

the solution \(P^*\) of the minimization problem in (5.20) has been derived in Proposition 3.1, and takes the form

$$\begin{aligned} P^* = \theta _0 \theta _0^T, \end{aligned}$$
(5.21)

where \(\theta _0\) is the “true” impulse response of the data-generating mechanism (5.14). An alternative proof of the optimal solution (5.21) to problem (5.20) can be found in Sect. 5.10.1. Since \(P^*\) depends on the unknown true system, this result is not of practical interest; however, if we think of the FIR model (5.14) as the approximation of a BIBO stable infinite impulse response model

$$\begin{aligned} y(t) =\sum _{k=1}^{\infty } \theta _{0,k} u(t-k) + e(t), \end{aligned}$$
(5.22)

the impulse response \(\theta _0\) should have finite \(\ell _1\) norm \(\Vert \theta _0\Vert _1\), i.e.,

$$\begin{aligned} \Vert \theta _0\Vert _1: = \sum _{k=1}^{\infty } |\theta _{0,k}| < \infty , \end{aligned}$$
(5.23)

and therefore \(\theta _{0,k}\) should decay as a function of the index k. As a result, the entries \([P^*]_{ij} = \theta _{0,i}\theta _{0,j}\) of optimal kernel decay as functions of the row and column indexes i and j. In Bayesian terms, it is thus expected that also the elements \([P]_{ij}\) of any “good” candidate prior variance should do the same. As we will see later in this chapter, recent forms of regularization for system identification include a decay rate condition on the elements \([P]_{ij}\), so as to guarantee that the estimated system is BIBO stable. Therefore, we will often refer to conditions on the decay rate of P as “stability conditions”. While condition (5.23) is obviously satisfied when \(\theta \) is a finite dimensional vector, this loose connection between decay rate of the kernel and stability needs to be tightened. We will see in the next section that this can be properly formulated in a Bayesian framework.

5.4 Bayesian Formulation and BIBO Stability

In the previous section, we have considered only FIR models which are reasonable approximations of any BIBO LTI system in most practical scenarios. However, it is of interest to formulate the estimation of LTI BIBO stable systems in full generality, without assuming the impulse response to be of finite support. This entails working with infinite dimensional impulse responses \(\{\theta _k\}_{k \in {\mathbb N}}\). In this chapter, we first consider the Bayesian framework, while regularization in infinite-dimensional Hilbert spaces will be addressed in Chap. 6. To start with, we model the unknown impulse response \(\{\theta _k\}_{k \in {\mathbb N}}\) as a stochastic process indexed over time k; this is the straightforward extension to the infinite-dimensional case of (5.18) where \(\theta \) was a finite-dimensional random vector. In this context, it is of interest to introduce the concept of “stable” priors:

Definition 5.1

(Stable priors) A prior on \(\{\theta _k\}_{k \in {\mathbb N}}\) is said to be stable if realizations are sequences almost surely in \(\ell _1\), i.e.,

$$\displaystyle {\sum _{k=1}^\infty } |\theta _k| < \infty \quad \quad a.s. $$

In most of this book, mostly for computational reasons, we will also assume that \(\{\theta _k\}_{k \in {\mathbb N}}\) be Gaussian (i.e., that any finite collection of random variables \(\{\theta _k\}_{k \in I}\), \(I=\{i_1,\ldots ,i_\ell \}\), \(i_k \in {\mathbb N}\), \(\ell \in {\mathbb N}\) are jointly Gaussian). This is formalized in the following assumption.

Assumption 5.1

Under the Bayesian framework, we assume \(\{\theta _k\}_{k \in {\mathbb N}}\) to be a Gaussian stochastic process with mean \(\{m_k\}_{k \in {\mathbb N}}\) and covariance function K(ts), \(t,s \in {\mathbb N}\).

   \(\square \)

It is an interesting fact that, under additional assumptions on the mean and covariance functions, the prior is stable according to Definition 5.1, as formalized in the following lemma whose proof is in Sect. 5.10.2.

Lemma 5.1

Under Assumption 5.1 and if the following additional conditions hold

$$\begin{aligned} \begin{array}{c} \displaystyle {\sum _{k=1}^\infty } |m_k| =M_{\ell _1}< \infty \quad \quad \quad \quad \displaystyle {\sum _{k=1}^\infty } K(k,k)^{1/2} =K_{\ell _1}<\infty , \end{array} \end{aligned}$$
(5.24)

then the prior is stable as per Definition 5.1, i.e.,

$$\displaystyle {\sum _{k=1}^\infty } |\theta _k| < \infty \quad \quad a.s. $$

In most of this book, we will also make the assumption that the a priori mean \(m_t\) is identically zero, and thus only the condition on the covariance K(ts) should be checked to ensure stability. We will now discuss different form of prior covariances K encountered in the literature.

5.5 Smoothness and Contractivity: Time- and Frequency-Domain Interpretations

As seen in Sect. 5.3, the optimal regularizer should mimic the “true” impulse response, which is clearly unfeasible since the impulse response is unknown. However, as already discussed in Sect. 5.4, we can use the prior to encode qualitative behaviour of impulse responses of BIBO stable linear systems. In particular we have seen in Lemma 5.1 that a certain decay condition on the prior mean and covariance guarantees the description of only (almost surely) BIBO stable linear systems. The simplest example of such a prior model is the following.

Example 5.2

(Diagonal (DI) prior) Assume the prior mean to be zero \(m_t = 0\), \(\forall t\in {\mathbb N}\) and the covariance function to be diagonal with exponentially decaying entries

$$ K(t,s) = \lambda \alpha ^t \delta (t-s) \quad \quad t,s \in {\mathbb N}\quad \quad \lambda >0, \quad 0\le \alpha <1. $$

The parameters \(\lambda \) (scale factor) and \(\alpha \) (decay rate) are treated as hyperparameters to be estimated from data, using e.g., marginal likelihood maximization, as described in Sect. 4.4. It is worth observing that the assumptions of Lemma 5.1 are satisfied, indeed

$$ \sum _{t\in {\mathbb N}} |m_t| = 0 \quad \quad \sum _{t\in {\mathbb N}} K(t,t)^{1/2} = \sum _{t\in {\mathbb N}}\sqrt{\lambda } \alpha ^{t/2} =\sqrt{\lambda } \frac{\sqrt{\alpha }}{1 -\sqrt{\alpha }} < \infty $$

and hence this is a stable prior.

   \(\square \)

It is interesting to observe that a decay rate condition on the impulse response coefficients is equivalent to assuming a smoothness condition in the frequency domain. To see this, let us introduce the frequency response function

$$ G(e^{j\omega }):= \sum _{k=1}^{\infty } \theta _k e^{-j\omega k}. $$

The \(L_2\)-norm of the first derivative \(\frac{dG(e^{j\omega })}{d\omega }\) can be considered

$$ \left\| \frac{dG(e^{j\omega })}{d\omega }\right\| ^2 :=\frac{1}{2\pi } \int _{0}^{2\pi }\left| \frac{dG(e^{j\omega })}{d\omega }\right| ^2 \, d\omega $$

which using Parseval’s theorem can be expressed in time domain

$$\begin{aligned} \left\| \frac{dG(e^{j\omega })}{d\omega }\right\| ^2 =\sum _{k=1}^\infty k^2 |\theta _k|^2. \end{aligned}$$
(5.25)

Computing higher-order derivatives, and using again Parseval’s theorem, the \(L_2\)-norm of the mth-order derivative is given by

$$\begin{aligned} \left\| \frac{d^{(m)}G(e^{j\omega })}{d\omega ^{(m)}}\right\| ^2 =\sum _{k=1}^\infty k^{2m} |\theta _k|^2. \end{aligned}$$
(5.26)

Hence, the condition that the \(\{\theta _k\}\) decay rapidly (and possibly exponentially as postulated by the Diagonal kernel) with k, implies a bound on the \(L_2\) norm of the mth-order derivatives, i.e., smoothness in the frequency domain of the model.

Fig. 5.1
figure 1

Sample realizations from the diagonal kernel prior for \(\alpha =0.4\) (top) and \(\alpha =0.8\) (bottom). Impulse response is on the left, frequency response (magnitude only) on the right

As illustrated in Fig. 5.1, smoothness in the frequency domain decreases when \(\alpha \) increases. However, under this prior, the impulse response coefficients are modelled as independent (yet not identically distributed) random variables. Thus no smoothness in the time domain is included, as for instance, is typically performed with priors based on random walk, which are the discrete-time counterpart of spline models as discussed in Sect. 4.9. A prior model that, in addition to stability, also includes a smoothness condition in the time domain, is the so-called TC-kernel:

Example 5.3

(Tuned-Correlated (TC) prior) Assume the prior mean is zero \(m_t = 0\), \(\forall t\in {\mathbb N}\) and the covariance function takes the form

$$ K(t,s) =\lambda \alpha ^{\max (t,s)} \quad \quad t,s \in {\mathbb N}\quad \quad \lambda >0, \quad 0\le \alpha <1. $$

As in the previous example, the parameters \(\lambda \) (scale factor) and \(\alpha \) (decay rate) are treated as hyperparameters to be estimated from data, using e.g., marginal likelihood maximization. It is worth observing that also in this case the assumptions of Lemma 5.1 are satisfied, indeed

$$ \sum _{t\in {\mathbb N}} |m_t| = 0< \infty \quad \quad \sum _{t\in {\mathbb N}} K(t,t)^{1/2} = \sum _{t\in {\mathbb N}}\sqrt{\lambda } \alpha ^{t/2} = \sqrt{\lambda } \frac{ \sqrt{\alpha }}{1 - \sqrt{\alpha }} < \infty $$

and hence this is a stable prior. In addition, the TC prior now introduces correlation between impulse response coefficients, e.g., one has

$$ {\mathscr {E}}\theta _{t} \theta _{s} = \lambda \alpha ^t \quad \forall t \ge s. $$

So, the correlation is different from zero and exponentially decays to zero.   \(\square \)

Fig. 5.2
figure 2

Sample realizations from the Tuned-Correlated (TC) prior for \(\alpha =0.4\) (top) and \(\alpha =0.8\) (bottom). Impulse response is on the left, frequency response (magnitude only) on the right

Fig. 5.3
figure 3

30 sample realizations from the diagonal (top) and Tuned-Correlated (bottom) prior for \(\alpha =0.8\)

Figure 5.2 shows two typical realizations from the TC prior, both in time domain and frequency domain, for \(\alpha = 0.4\) (top) and \(\alpha = 0.8\) (bottom), while Fig. 5.3 shows 30 sample realizations from the DI (top) and TC (bottom) priors, respectively.

Example 5.4

(Importance of stable priors) In order to illustrate the advantage of using stable priors, we now consider a simple example of identification of an output error model. In particular, we consider a system of the form

$$ y(t) = \sum _{k=1}^{\infty } g_k u(t-k) + e(t), $$

where the measured input u(t) and the noise e(t) are realizations from white Gaussian noise with zero mean and unit variance. The impulse response is

$$ g_k = \left\{ \begin{array}{cc} \left( \frac{k}{2}\right) ^2 e^{-\frac{k}{4} } &{} k\ge 1 \\ 0 &{} k<1\end{array}\right. \!. $$

For the purpose of identification, we assume the input is available at all time instances needed. For illustration purposes, the impulse response has been truncated at \(k=50\), since it is in practice zero for \(k>50\). We also assume that output measurements y(t) are available for \(t=1,\ldots ,35\). The hyperparameters are all estimated using marginal likelihood maximization, see Sect. 4.4. The results are shown in Fig. 5.4. The reconstruction error is measured using the percentage root mean square (RMS) error:

$$\begin{aligned} \sqrt{\frac{\sum _{k=1}^\infty (g_k - \hat{g}_k)^2}{\sum _{k=1}^\infty g^2_k}} \times 100 \%. \end{aligned}$$
(5.27)

As illustrated in Fig. 5.4, it is apparent that the results obtained by using the stable priors, see panels (b) and (c), outperform those returned by the spline (random walk) prior, see panel a, that does not include the stability constraint. The best relative error is obtained by the TC priors (\(\simeq \)10%) and goes up to as much as \(\simeq \)33% for the spline priors. It can also be observed that while for stable priors (b) and (c) confidence intervals shrink as time index k grows, the same does not hold for the spline prior. The same behaviour had been observed in Sect. 4.9, see Fig. 4.1.   \(\square \)

Fig. 5.4
figure 4

Panels ac: impulse response reconstruction (blue) and true (red) with \(95\%\) Bayesian confidence intervals (dashed). Panel d is the relative RMS error (5.27) on impulse response reconstruction as a function of the scale factor \(\lambda \). For DI and TC priors for each scale factor, the optimal decay rate \(\alpha \) is estimated using marginal likelihood. The star denotes the performance obtained using the scale factor selected using marginal likelihood optimization. It is remarkable that the relative error achieved by maximizing the marginal likelihood is close to the minimum achievable by an oracle who would have access to the true impulse response and thus could minimize the relative RMS error

In the next section, a class of stable priors, which includes TC as a special case, will be derived following a first-principle maximum entropy framework.

5.5.1 Maximum Entropy Priors for Smoothness and Stability: From Splines to Dynamical Systems

The class of Stable Spline priors introduced in the paper [49] extends smoothness priors ideas used in splines models introduced in Sect. 4.9, embedding exponential decay conditions on the impulse response prior. They ultimately lead to estimated models which are BIBO stable with probability 1.

In this section, we will introduce a simple construction of these stable spline priors in discrete time. In particular, we will exploit a very natural axiomatic derivation in the maximum entropy framework introduced in Chap. 4. For the sake of illustration, we will only consider the so-called stable spline prior of order one (also known as the TC prior, see Example 5.3) and its extension known as DC prior. Possible extensions will be discussed, but not developed in full detail.

The most natural construction, inspired by smoothing spline ideas, is based on the following two observations:

  1. 1.

    Stability: the variance of \(\theta _k\) should decay “sufficiently fast” (see Lemma 5.1), possibly exponentially, with the lag k. Assuming a zero-mean process, this can be expressed using a condition on second-order moments of the form:

    $$\begin{aligned} {\mathscr {E}}\left[ \theta _{k}^2\right] = \lambda _S \alpha ^{k} \quad k=1,\ldots ,n \quad 0< \alpha <1. \end{aligned}$$
    (5.28)

    For reasons that will become clear later on, imposing equality (as done above) rather than inequality constraints is convenient.

  2. 2.

    Smoothness: the difference between adjacent coefficients should be constrained, e.g., as measured by the relative variance,

    $$\begin{aligned} \frac{{\mathscr {E}}\left[ (\theta _{k-1}-\theta _k)^2\right] }{ {\mathscr {E}}\left[ \theta _{k-1}^2\right] } = \lambda _R \quad k=2,\ldots ,n. \end{aligned}$$
    (5.29)

Using the stability constraint and redefining the constant \(\lambda _R\), condition (5.29) can be rewritten as

$$\begin{aligned} {\mathscr {E}}\left[ (\theta _{k-1}-\theta _k)^2\right] = \lambda _R \alpha ^{k-1} \quad k=2,\ldots ,n. \end{aligned}$$
(5.30)

The following theorem (whose proof is reported in Sect. 5.10.3) derives the class of maximum entropy priors under the constraints (5.28) and (5.29). Next, in Corollary 5.1 (whose proof is in Sect. 5.10.4), we will see that for special choices of \(\lambda _S\) and \(\lambda _R\) the well-known TC and DC priors [10, 52] are obtained.

Theorem 5.5

Let \(\{\theta _{k}\}_{k=1,\ldots ,n}\) be a zero mean, absolutely continuous random vector with density \(p_\theta (\theta )\), that satisfies the following constraints (with \(0< \alpha <1\)):

$$\begin{aligned} \begin{array}{ccl} {\mathscr {E}}\left[ \theta _{k}^2\right] &{}=&{} \lambda _S\alpha ^{k} \quad k=1,\ldots ,n \\ {\mathscr {E}}\left[ (\theta _{k-1}-\theta _k)^2\right] &{}= &{}\lambda _R \alpha ^{k-1} \quad k=2,\ldots ,n \end{array} \end{aligned}$$
(5.31)

with \(\lambda _S\in {\mathbb R}\) and \(\lambda _R \in {\mathbb R}\) such that

$$\begin{aligned} \lambda _S (1-\sqrt{\alpha })^2< \lambda _R < \lambda _S (1+\sqrt{\alpha }) ^2. \end{aligned}$$
(5.32)

Then, the solution \(p_{\theta ,ME}(\theta )\) of the maximum entropy problem

$$\begin{aligned} p_{\theta ,ME}: = \mathop {\mathrm{arg\;} \mathrm{max}}\limits _{ {\mathrm p}(\cdot ) \;\; s.t. \;\; (5.31)} \;\; - {\mathscr {E}}\log (p_{\theta }(\theta )) \end{aligned}$$
(5.33)

has the following form:

$$\begin{aligned} p_{\theta ,ME}(\theta ) = C e^{-\frac{1}{2} \theta ^T \varSigma ^{-1} \theta }, \end{aligned}$$
(5.34)

where the matrix \(\varSigma ^{-1}\) has the band structure:

$$ \varSigma ^{-1} = \left[ \begin{array}{cccccc} * &{} * &{} 0 &{} \dots &{} \dots &{} 0 \\ * &{} * &{} * &{} 0 &{} \dots &{} 0 \\ 0 &{} * &{} * &{} * &{} 0 &{} \dots \\ \vdots &{} \dots &{} \ddots &{} \ddots &{} \ddots &{} \dots \\ 0 &{} \dots &{} 0 &{} * &{} * &{} * \\ 0 &{} \dots &{} \dots &{} 0 &{} * &{} * \end{array}\right] . $$

The maximum entropy process admits the backward representation

$$ \theta _{k-1} = a_B \theta _k + w_k \quad w_k \sim \mathcal{N}(0,\sigma _k^2) \quad k\in \{1,\ldots ,n\} $$

with

$$\begin{aligned} a_B = \frac{\lambda _S(1+\alpha ) - \lambda _R}{2\lambda _S\alpha }, \end{aligned}$$
(5.35)
$$\begin{aligned} \sigma _k^2 = \lambda _S \alpha ^{k-1} (1 - a_B^2 \alpha ), \end{aligned}$$
(5.36)

and terminal condition

$$\begin{aligned} {\mathscr {E}}\theta ^2_n = \lambda _S \alpha ^{n}. \end{aligned}$$
(5.37)

Last, the autocovariance of \(\theta _k\) satisfies the relation:

$$\begin{aligned} {\mathscr {E}}\theta _k \theta _h = \lambda _Sa_B^{|k-h|} \alpha ^{\max \{k,h\}}. \end{aligned}$$
(5.38)

Corollary 5.1

Under the conditions of Theorem 5.5 and defining

$$\begin{aligned} \rho : = a_B \sqrt{\alpha } = \frac{\lambda _S(1+\alpha ) - \lambda _R}{2\lambda _S\sqrt{\alpha }}, \end{aligned}$$
(5.39)

the maximum entropy model in Theorem 5.5 corresponds to the so-called DC-kernel [10], i.e.,

$$\begin{aligned} {\mathscr {E}}\theta _k \theta _h = \lambda _S \rho ^{|k-h|} \alpha ^{\frac{k+h}{2}}. \end{aligned}$$
(5.40)

In particular, for \(\lambda _R = \lambda _S(1-\alpha ) \), this reduces to the so-called TC kernel [10] with

$$\begin{aligned} {\mathscr {E}}\theta _k \theta _h = \lambda _S\alpha ^{\max \{k,h\}}, \end{aligned}$$
(5.41)

while for \(\lambda _R = \lambda _S(1+\alpha ),\) we obtain the covariance of the “diagonal” kernel

$$\begin{aligned} {\mathscr {E}}\theta _k \theta _h = \left\{ \begin{array}{cc}\lambda _S\alpha ^{P} &{} k=h\\ 0 &{} k\ne h\end{array}\right. \!. \end{aligned}$$
(5.42)

Remark 5.1

In the maximum entropy kernel derived in Theorem 5.5, which includes DC, TC and DI as special cases as stressed in Corollary 5.1, the constant \(\lambda _S\) plays only the role of a scale factor while \(\alpha \) is a “decay rate”. Therefore, by fixing \(\lambda _S=1\) and \(\alpha =0.8\) we can study the behaviour as the “regularity” constant \(\lambda _R\) varies in the interval \(\lambda _S(1-\sqrt{\alpha })^2 = \lambda _{R,min}\le \lambda _R \le \lambda _{R,max} = \lambda _S(1+\sqrt{\alpha })^2\). This is entirely equivalent to studying the behaviour of the kernel as a function of the ratio \(\lambda _R/\lambda _S\). We thus consider a grid of 9 possible values \(\lambda _{R,min} = \lambda _{R,1}< \lambda _{R,2}< \dots < \lambda _{R,9} = \lambda _{R,max} \). Then, Fig. 5.5 plots 5 sample realizations for each of these values with panel (i) corresponding to the value \(\lambda _{R,i}\). In particular, \(\lambda _{R,4} = \lambda _S(1-\alpha )\) corresponds to the TC kernel and \(\lambda _{R,6} = \lambda _S(1+\alpha )\) induces the DI kernel. For each realization from the prior (solid line) also its best single-exponential fit is shown in order to highlight the “overall” decay rate which can be thought of as an envelope of the curves. In panel (1), with \(\lambda _{R}\) taking the smallest possible value, hence imposing the “maximum” amount of regularity, all realizations are pure exponentials. In panel (9), with \(\lambda _{R}\) taking its maximum value, all realizations are pure damped oscillations. In fact, in both cases, it can be checked that the corresponding kernel is singular.

Fig. 5.5
figure 5

Sample realizations (solid) and best (least squares) exponential fit as a function of the kernel parameters. In all figures \(\alpha =0.8\) and \(\lambda _S = 1\). The regularity parameter \(\lambda _R\) varies, from its minimum value \(\lambda _{R,min} = \lambda _S (1-\sqrt{\alpha )}^2 \simeq 0.011\) in panel (1) to the maximum value \(\lambda _{R,max} = \lambda _S (1+\sqrt{\alpha )}^2 \simeq 3.589\) in panel (9). Panel (4), with \(\lambda _R = 0.2\), corresponds to the TC kernel; panel (6) with \(\lambda _R = 2.6\) to the DI kernel

Degrees of Freedom of the DC Kernels

Theorem 5.5 provides a class of kernels \(K_{\eta }\) parametrized by the hyperparameter vector \(\eta := [\lambda _S, \lambda _R, \alpha ]\). In Fig. 5.5, we have illustrated how realizations from the prior change as a function of the regularity parameter \(\lambda _R\) having fixed \(\lambda _S = 1\) (or equivalently as a function of the ratio \(\lambda _R/\lambda _S\)). As discussed in Chap. 4, choosing the prior is equivalent to describing the model class. In the linear system identification context, this then defines a penalty function on impulse responses. A way to measure the “size” of the model class is to use the concept of equivalent degrees of freedom, introduced in the Bayesian context in Sect. 4.8. Unfortunately, the degrees of freedom are defined in terms of the output predictor sensitivity and they thus require to specify not only the model class but also the experimental conditions under which the model is estimated. Only in limiting cases (such as improper prior on finitely and linearly parametrized model classes), degrees of freedom become independent of the experiment and coincide with the number of parameters. In this section, we thus consider the prototypical setup in Eq. (5.18):

$$\begin{aligned} Y = \varPhi \theta _0 + E \quad \quad Y \in {\mathbb R}^N, \quad N = 1000, \quad \theta _0 \in {\mathbb R}^n. \end{aligned}$$
(5.43)

We recall that the matrix \(\varPhi \) is an Hankel matrix built with the input samples \(\{u(t)\}\) so that \(\varPhi \theta _0\) implements the convolution of u with \(\theta _0\). The input \(\{u(t)\}\) is now assumed to be a zero-mean unit variance white noise. We also assume the noise \(\{e(t)\}\) is zero-mean unit variance white noise. We consider two scenarios in which the order of the system (length n of \(\theta _0\)) is assumed to be either \(n=30\) or \(n=100\). Exploiting the derivation in Chap. 4 (see Definition 4.2 and Proposition 4.3), the degrees of freedom \(\mathrm {dof}(\eta )\), as a function of the hyperparameter vector \(\eta \), are given by

$$ \mathrm {dof}(\eta ) = {{\,\mathrm{trace}\,}}\left( \varPhi (\varPhi ^T \varPhi + K^{-1}_{\eta } )^{-1} \varPhi ^T \right) . $$

Assuming also here that \(\lambda _S = 1\), we study how \({\mathrm {dof}}(\eta )\) varies as a function of \(\lambda _R\) for three different values of \(\alpha \) (0.6, 0.8, and 0.95). The behaviour is illustrated in Fig. 5.6 where it is apparent that the maximum is achieved for the DI kernel, and the minimum (a bit smaller than 1) is attained at the extremum points, where the kernel has rank exactly equal to 1. It is interesting to observe the intertwining between the value of \(\alpha \) (that controls the decay rate) and the length of the FIR model n. As the coefficient vector \(\theta _0\) changes from length \(n=30\) (left) to \(n=100\) (right) the effective “size” of the model doesn’t change much for \(\alpha = 0.6\) and \(\alpha = 0.8\), while it does increase when \(\alpha = 0.95\). This confirms the fact that the kernel, for \(\alpha \) fixed, effectively controls the model complexity so that the estimator becomes insensitive to the chosen length, provided n is “big enough” w.r.t. \(\alpha \). In particular \(n=15\) would be sufficient for \(\alpha =0.6\), \(n=30\) for \(\alpha =0.8\) while for \(\alpha =0.95\) the effective size is about \(n=100\).

Fig. 5.6
figure 6

Effective degrees of freedom of the DC kernel as a function of \(\lambda _R\) (\(\lambda _S= 1\)) for model (5.43); \(n=30\) (left), \(n=100\) (right). From top to bottom: \(\alpha =0.6\), \(\alpha = 0.8\) and \(\alpha =0.95\)

Extension to Smoothness Conditions on Filtered Versions \(\star \)

So far, we limited our attention to so-called “first-order” stable splines, which are derived imposing conditions on “first-order” differences, leading to first-order, i.e., AR(1), realizations. Of course these constructions can be generalized by replacing (5.31) with a higher-order constraint of the form

$$\begin{aligned} \begin{array}{ccl} {\mathscr {E}}\Vert \theta _k\Vert ^2 &{}\le &{} \lambda _S \alpha ^{k} \\ {\mathscr {E}}\Vert \theta _k - \sum _{i=1}^p a_i \theta _{k+i}\Vert ^2 &{}\le &{} \lambda _R \alpha ^{k}. \end{array} \end{aligned}$$
(5.44)

While the first constraint is a “standard” stability condition, the second constraint can be interpreted as a filtered frequency domain smoothness condition. In fact, defining the filter \(F(q): = 1 - \sum _{i=1}^p a_i q^i\), let us denote with \(\theta ^F_k\) the sequence obtained filtering \(\theta _k\) with F(q). The condition

$$ {\mathscr {E}}\Vert \theta ^F_k\Vert ^2 = {\mathscr {E}}\Vert \theta _k - \sum _{i=1}^p a_i \theta _{k+i}\Vert ^2 \le \lambda _R \alpha ^{k} $$

implies that \(\theta ^F_k\) should decay “fast” enough (in mean square) and thus

$$ {\mathscr {E}}\sum _{k=0}^{\infty } k^{2m} \Vert \theta ^F_k\Vert ^2 $$

should be small for any integer m. As a consequence, if

$$ G(e^{j\omega }):= \sum _{k=1}^{\infty } \theta _k e^{-j\omega k}, $$

using Parseval’s theorem,

$$ \displaystyle { {\mathscr {E}}\int _{0}^{2\pi } \left\| F(e^{j\omega }) \frac{d^{(m)}G(e^{j\omega })}{d\omega ^{(m)}}\right\| ^2 d\omega } $$

should be small as well, implying that \(\theta _k\) should concentrate most of his energy (variance) in frequency bands where the (absolute value of the) filter \(F(e^{j\omega })\) is small.

We regard developments of this type, in principle, as a straightforward extension of the basic ideas discussed in this chapter to obtain DC kernels. In particular, the choice of the coefficients a in (5.44) is a design issue, which can be guided by prior knowledge on the candidate models, and its underlying principles and ideas are the same as those illustrated above. There are however additional complications due to the richer structure of the constraints, which might entail non-trivial issues to derive an analytic expression of the kernel.

5.6 Regularization and Basis Expansion \(\star \)

The \(\ell _2\) (ridge regression) regularized estimators that have been discussed in this chapter can also be framed in the context of basis expansion using the so-called Karhunen–Loève decomposition  of the random process \(\theta \). For the sake of exposition, we will now consider the finite-dimensional case, i.e., we will study FIR models of length n of the form (5.14). Extension to the infinite-dimensional case will be discussed in the framework of Reproducing Kernel Hilbert Spaces illustrated in Chap. 6. Under this finite-dimensional assumption, we consider the covariance matrix \(\mathbf{K}\in {\mathbb R}^{n\times n}\) whose entries satisfy \([\mathbf{K}]_{(t,s)}:=K(t,s) = \mathrm{cov}(\theta _t,\theta _s)\). The matrix \(\mathbf{K}\) can be written in terms of its spectral decomposition (Singular Value Decomposition) in the form:

$$\begin{aligned} \mathbf{K}= USU^T = \sum _{i=1}^n \xi _i u_i u_i^T \quad u_i \in {\mathbb R}^n \quad \Vert u_i\Vert =1 \quad u_i \perp u_j \;\;\; \forall i\ne j, \end{aligned}$$
(5.45)

where

$$ U:=[u_1,\ldots ,u_n] \quad S:=\mathrm{diag}\{\xi _1,\ldots ,\xi _n\}. $$

The set of vectors \(u_i \in {\mathbb R}^n\) provides an orthonormal basis of \({\mathbb R}^n\) so that any impulse response \(\theta \in {\mathbb R}^n\) can be written using the orthonormal basis expansion

$$\begin{aligned} \theta = \sum _{i=1}^n \, u_{i} \beta _i \quad \quad \beta _i := <\theta ,u_i>, \end{aligned}$$
(5.46)

where the coefficients \(\beta _i = <\theta ,u_i> = u_i^T \theta \) are therefore zero-mean random vectors with covariances

$$ {\mathscr {E}}\beta _i \beta _j = {\mathscr {E}}u_i^T \theta \theta ^T u_j = u_i^T \mathbf{K}u_j =\xi _i \delta _{ij}. $$

Clearly, the argument above can be reversed. Namely, starting from (a possibly orthonormal) basis \(u_i\), \(i=1,\ldots ,n\) the random basis expansion

$$\begin{aligned} \theta = \sum _{i=1}^n \, u_{i} \beta _i , \quad \quad \beta _i \sim \mathcal{N}(0,\xi _i) \quad (\beta _1,\ldots ,\beta _n) \quad \mathrm{independent} \end{aligned}$$
(5.47)

induces a probability description of the candidate \(\theta \)’s which turns out to be zero mean and with covariance matrix as in (5.45). This interpretation provides a clear link between “standard” models described in terms of basis expansions, regularization and the Bayesian view.

Remark 5.2

(Low-Rank Kernel Approximation) The spectral decomposition of the kernel (5.45) suggests also that, when some singular values \(\xi _i\) are “very small”, it can be easily approximated by a low-rank matrix

$$ \mathbf{K}= \sum _{i=1}^n \xi _i u_i u_i^T \simeq \sum _{i=1}^{\hat{n}}\xi _i u_i u_i^T \quad \quad \hat{n} \le n. $$

This is equivalent to approximating the \(\xi _i\) below a certain threshold with zero singular values. This threshold can be chosen by a standard SVD-truncation criterion, e.g., neglecting singular values below a certain fraction of the largest singular value \(\xi _1\), i.e., that satisfy

$$ \xi _i < \frac{ \xi _1}{R}. $$

In Fig. 5.7, the value \(R = 20\) has been chosen to plot the most relevant eigenfunctions. Low-rank kernel approximation can also be exploited to reduce the computational burden in computing the solutions.

Figure 5.7 shows the eigenfunctions of the DC kernel for different choices of the hyperparameters. As already studied in the previous section, the “complexity” of the kernel, measured e.g., by the degrees of freedom as illustrated in Fig. 5.6, varies as the hyperparameters change. In the context of basis expansions, this is clear from Fig. 5.8 where the singular values of the kernel, i.e., the variances of the basis expansion coefficients \(\beta _i\), introduced in (5.47), vary as the hyperparameters change. For instance when \(\lambda _R = \lambda _{R,min}\), see panel (1), and \(\lambda _R = \lambda _{R,max}\), see panel (9), the kernel has rank 1. Instead the singular values decay slower for the DI kernel, see panel (5), that also has the largest number of degrees of freedom, see Fig. ().

Fig. 5.7
figure 7

First \(\hat{n}\) eigenfunctions of the DC kernel. To enhance clarity, n is chosen for each combination of the parameters so that \(\hat{n} = \arg \max _i \; i \;\; s.t. \; \; \sigma _i^2>\sigma _1^2/20\) (see Remark 5.2). In all figures, \(\alpha =0.8\) and \(\lambda _S = 1\). The regularity parameter \(\lambda _R\) varies, from its minimum value \(\lambda _{R,min} = \lambda _S (1-\sqrt{\alpha )}^2 \simeq 0.011\) in panel (1) to the maximum value \(\lambda _{R,max} = \lambda _S (1+\sqrt{\alpha )}^2 \simeq 3.589\) in panel (9). Panel (4), with \(\lambda _R = 0.2\), corresponds to the TC kernel; panel (6) with \(\lambda _R = 2.6\) to the DI kernel

Fig. 5.8
figure 8

First 10 singular values of the DC kernel. In all figures, \(\alpha =0.8\) and \(\lambda _S = 1\). The regularity parameter \(\lambda _R\) varies, from its minimum value \(\lambda _{R,min} = \lambda _S (1-\sqrt{\alpha )}^2 \simeq 0.011\) in panel (1) to the maximum value \(\lambda _{R,max} = \lambda _S (1+\sqrt{\alpha )}^2 \simeq 3.589\) in panel (9). Panel (4), with \(\lambda _R = 0.2\), corresponds to the TC kernel; panel (6) with \(\lambda _R = 2.6\) to the DI kernel

Even if this section is devoted to finite impulse response models (i.e., n finite, and therefore BIBO stable systems), it still makes sense to discuss what happens to the coefficients \(\theta _n\) when n becomes “large” and its relation with BIBO stability. In Lemma 5.1, we have seen that a sufficient conditions for a.s. BIBO stability of realizations from the Gaussian prior, is that the diagonal elements of K satisfy the summability condition

$$ \sum _{t=1}^\infty K(t,t)^{1/2} < \infty $$

which requires a “sufficiently fast” decay rate of the diagonal K(tt). A quite natural question concerns how the behaviour of K(tt) reflects on the basis vectors \(u_i\). The following lemma, whose proof is in Sect. 5.10.5, gives the answer.

Lemma 5.2

The basis vectors \(u_i\) introduced in (5.45), whose tth elements are denoted by \(u_{it}\), satisfy the inequality

$$\begin{aligned} |u_{it}| \le \frac{1}{\xi _i} C [{\mathbf{K}}]_{t,t}^{1/2}, \quad C: =\sum _{t=1}^n [{\mathbf{K}}]_{t,t}. \end{aligned}$$
(5.48)

Condition (5.48) holds also in the infinite dimensional case, i.e., as \(n\rightarrow \infty \), provided K(ts) admits the spectral decomposition

$$ K(t,s)= \sum _{i=1}^{\infty } \xi _i u_{it} u_{is}, $$

where the \(u_{i}\) are orthonormal sequences in \(\ell _2\) and the condition \(\sum _{t=1}^\infty K(t,t) = C <\infty \) is satisfied.

While this result is essentially trivial for n finite, it becomes important when \(n\rightarrow \infty \), since it provides a condition on the tail behaviour of the eigenvectors (eigenfunctions). For instance, if the diagonal entries (variances) of the kernel K(tt) decay exponentially fast as a function of t, also the \(u_{it}\) do so. The decay of the eigenfunctions can be visually inspected in Fig. 5.7.

5.7 Hankel Nuclear Norm Regularization

As discussed above, regularization can be used to enforce smoothness and stability of impulse responses. Yet this is just one way, and possibly not the most common in the field of dynamical systems, to control the “complexity” of model classes.

For instance, in the parametric approach to system identification, the complexity can be measured by the dimension of a minimal state-space realization of the unknown system. For ease of exposition, let us now only consider the single-input single-output output error case (i.e., \(H(z) = 1\)). In this case, the number of free parameters is \(2n+1\) where n is the degree of the denominator of the transfer function \(G_\theta (z)\), that also equals the dimension n of a minimal state-space realization of \(G_\theta (z)\) which is called the McMillan degree of \(G(z,\theta )\), as seen in Sect. 2.2.1.1. To fix notation, let us introduce a minimal state-space realization of \(G(z,\theta )\)

$$\begin{aligned} \begin{array}{rcl} x_{t+1} &{} = &{} A x_t + B u_t \quad \quad x_t \in {\mathbb R}^n,\\ y_t &{} = &{} C x_t \end{array} \end{aligned}$$
(5.49)

which is such that \(G(z,\theta )= C(zI-A)^{-1}B\). If \(\{g(k,\theta )\}_{k \in {\mathbb N}}\) is the impulse response sequence, parametrized by \(\theta \), then one has \(g(k,\theta ) = CA^{k-1}B\) \(\forall k>0\).

It is well known from realization theory that the McMillan degree has a close connection with the so-called Hankel matrix formed with the impulse response coefficients, i.e.,

$$\begin{aligned} \mathcal{H}_{r,c}(\theta ):=\left[ \begin{array}{ccccc} g(1,\theta ) &{} g(2,\theta ) &{} g(3,\theta ) &{} \dots &{} g(c,\theta ) \\ g(2,\theta ) &{} g(3,\theta ) &{} g(4,\theta )&{} \dots &{} g(c+1,\theta ) \\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ g(r,\theta )&{} g(r+1,\theta ) &{}g(r+2,\theta )&{} \dots &{} g(r+c-1,\theta ) \end{array}\right] \end{aligned}$$
(5.50)

with r block rows and c block columns. The following lemma holds.

Lemma 5.3

(based on [65]) The linear time-invariant system with impulse response \(\{g(k,\theta ) \}_{k\in {\mathbb N}}\) admits a minimal state-space realization of order n (i.e., has McMillan degree equal to n) if and only if, for some choice of rc the following holds:

$$\begin{aligned} n = \mathrm{rank}\{\mathcal{H}_{r,c}(\theta )\} = \mathrm{rank}\{\mathcal{H}_{r+j,c+i}(\theta )\} \quad \quad \forall \;\; i,j \in {\mathbb N}. \end{aligned}$$
(5.51)

In practice, only a finite number of impulse response (Markov) parameters \(g(k,\theta ) \), \(k=1,\ldots ,p\) is available and the problem of finding a state-space model of the form (5.49) such that \(g(k,\theta ) = CA^{k-1}B\) \(\forall \; k=1,\ldots ,p\), is known as partial realization problem.

This shows that, indeed, a notion of “complexity” can be attached to the dimension n of a minimal state-space realization (5.49); therefore the rank of the Hankel matrix \(\mathcal{H}_{c,r}(\theta )\) can be considered as a candidate for performing regularization. This leads to the choice of a penalty given by

$$\begin{aligned} J_{\mathcal{H},\gamma }(\theta ):= \gamma \, \mathrm{rank}\{\mathcal{H}_{r,c}(\theta )\} \end{aligned}$$
(5.52)

for suitable values of the integers cr. Unfortunately, similarly to what happens for the 0 quasi-norm \(\Vert x\Vert _0\) (defined as the number of non-zero entries in the vector x) discussed in Sect. 3.6.2.1, the rank functional is not convex; as a result solving optimization problems involving penalties of the form (5.52) is problematic. The very same issue arise in a variety of rank-constrained optimization problems.

As seen in Chap. 3, to overcome this limitations, inspired by work on \(\ell _1\) regularization, researchers have suggested to use the nuclear norm \(\Vert A \Vert _*\) of a matrix \(A\in {\mathbb R}^{m\times n}\) defined as

$$\begin{aligned} \Vert A \Vert _*: = {{\,\mathrm{trace}\,}}\left( \sqrt{{ A}^T{A}}\right) = \sum _i \, \sigma _i(A), \end{aligned}$$
(5.53)

where \(\sigma _i(A)\) denotes the ith singular value of the matrix A, as a surrogate for the rank of the matrix A. The nuclear norm is also known as Ky–Fan n-norm or trace norm.  This choice is motivated by the following lemma.

Lemma 5.4

(based on [20]) Given a matrix \(A \in {\mathbb R}^{m\times n}\) the nuclear norm of A is the convex envelope of the rank function on the set \(\mathcal{A}:=\{ A \in {\mathbb R}^{m\times n}, \; \Vert A\Vert \le 1\}\).

These considerations have led to a whole class of regularization methods which build upon the nuclear norm of the Hankel matrix

$$ J_{\mathcal{H},\gamma }(\theta ):= \gamma \,\Vert \mathcal{H}_{r,c}(\theta )\Vert _* $$

as a possible regularizer. Also several extensions have been considered, including weighted versions of the form

$$ J_{\mathcal{H},\gamma }(\theta ):= \gamma \,\Vert W_r\mathcal{H}_{r,c}(\theta )W_c\Vert _* $$

where \(W_c\) and \(W_r\) are, respectively, “column” and “row” weightings. These latter can be possibly adapted iteratively, in the framework of iteratively reweighted methods such as those commonly used in conjunction with \(\ell _1\) and/or \(\ell _2\) reweighted schemes, see e.g., [72].

The Hankel norm regularizer can also be studied from a Bayesian perspective, considering the prior

$$\begin{aligned} \mathrm {p}_{\mathcal{H},\gamma }(\theta ) \propto \exp \left( -\gamma \Vert \mathcal{H}_{r,c}(\theta )\Vert _*\right) \propto \exp \left( -\gamma \sum _i \, \sigma _i(\mathcal{H}_{r,c}(\theta )) \right) . \end{aligned}$$
(5.54)

To gain some intuition on the structure of this prior, let \(g(k,\theta )=\theta _k\) and consider the following modified prior which penalizes the nuclear norm of the squared Hankel matrix, i.e.,

$$\begin{aligned} \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta ) \propto \exp \left( -\gamma \Vert \mathcal{H}_{r,c}(\theta )\mathcal{H}_{r,c}(\theta )^T \Vert _*\right) \propto \exp \left( -\gamma \sum _i \, \sigma _i(\mathcal{H}_{r,c}(\theta )\mathcal{H}_{r,c}(\theta )^T) \right) . \end{aligned}$$
(5.55)

The reason for introducing \( \tilde{ \mathrm {p}}\) is twofold. The first is related to the fact that the prior (5.55) is equivalent to assuming that the entries \(\theta _k\) of the impulse response are independent zero mean Gaussians, as formalized in the following proposition.

Proposition 5.1

(based on [53]) Let \( \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) be as in (5.55) and let \(\theta \in {\mathbb R}^m \sim \tilde{\mathrm {p}}_{\mathcal{H},\gamma }(\theta ),\) where \(\mathcal{H}_{p,p}(\theta )\) is its \(p\times p\) Hankel matrix (with \(m=2p-1\)). Then the \(\theta _k\)’s are zero mean, independent and Gaussian. In particular:

$$\begin{aligned} \theta _k \sim \left\{ \begin{array}{cl} \mathscr {N}\left( 0,\frac{1}{2\gamma k}\right) &{} \ \text{ if } \ 1 \leqslant k \leqslant \frac{m+1}{2} \\ \mathscr {N}\left( 0,\frac{1}{2\gamma (m-k+1)}\right) &{} \ \text{ if } \ \frac{m+1}{2} < k \leqslant m \end{array} \right. \!. \end{aligned}$$
(5.56)

As illustrated in Fig. 5.9, from (5.56) one sees that the variance of \(\theta _k\) is not decaying with the lag k, and hence the prior \( \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) does not induce a BIBO stable hypothesis space.

Second, the prior \( \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) can be used as a proposal distribution for an MCMC scheme, as introduced in Sect. 4.10, to sample from the Hankel prior \(\mathrm {p}_{\mathcal{H},\gamma }(\theta ) \) in (5.54) with \(g(k,\theta )=\theta _k\). Samples from \(\mathrm {p}_{\mathcal{H},\gamma }(\theta )\) can then be used to approximate the variances \(\text{ Var }\{\theta _k\}\) and the correlations \(\text{ Corr }\{\theta _{P},\theta _{h}\}\). These are shown in Fig. 5.9. In particular, the solid line in the left panel shows \(\text{ Var }\{\theta _k\}\) as a function of k, while the right panel \(\text{ Corr }\{\theta _{P},\theta _{k+h}\}\) as a function of h for k fixed to 50. It is clear that, even though under \(\mathrm {p}_{\mathcal{H},\gamma }(\theta )\) the \(\theta _k\)’s are not Gaussian, the variances resemble those of \(\tilde{\mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) (left panel, dashed line) as and also their correlations resemble those of independent variables. For the sake of comparison, the left panel plots also the profiles of the impulse response coefficients’ variances using the TC prior for two different decay rates (dashdot lines).

Fig. 5.9
figure 9

Prior induced by the Hankel Nuclear Norm: the impulse response coefficients are contained in the vector \(\theta \in {\mathbb R}^{79}\), modelled as a random vector with probability density function \(\mathrm {p}_{\mathcal{H},\gamma }(\theta ) \propto \exp (-\Vert \mathcal{H}_{40,40}(\theta ) \Vert _*)\). Left: variances of the impulse response coefficients \(\theta _k\) reconstructed by MCMC (solid line) and approximated using the prior (5.56) (dashed line). The figure also displays the variances of \(\theta _k\) when \(\theta \) is a Gaussian random vector with stable spline (TC) covariance (5.41) for two different values of \(\alpha \) (dashdot lines). All the profiles are rescaled so that they share the same initial value. Right: 40th row of the matrix containing the correlation coefficients returned by the MATLAB command corrcoef(M) where each column of the 79\(\times 10^6\) matrix M contains one MCMC realization of \(\theta \) under the Hankel prior \(\mathrm {p}_{\mathcal{H},\gamma }(\theta )\). The adopted MCMC scheme was a random walk Metropolis with increments proportional to the variances (5.56) divided by a factor equal to 4

These observations suggest that, while the nuclear norm regularization (prior) accounts for system-theoretic notions of model complexity as defined by the McMillan degree, it fails to include decay rate and smoothness constraints. One would expect, therefore, that Hankel regularization alone may not give satisfactory results as it is not able to properly bound the candidate set of models. It turns out that the maximum entropy framework discussed in Sect. 5.5.1 can be used to build prior distribution which account for stability, smoothness as well as “complexity”. The following theorem (whose proof is given in Sect. 5.10.6) gives the structure of the MaxEnt prior under a simple “TC”-like condition on the stability-smoothness constraint.

Theorem 5.6

Let \(\{\theta _{P}\}_{k=1,\ldots ,m}\) be a zero mean, absolutely continuous random vector with density \(p_\theta (\theta )\), which satisfies the following constraints:

$$\begin{aligned} \begin{array}{ccl} {\mathscr {E}}\left[ \theta _{m}^2\right] &{}\le &{} \sigma ^2 \alpha ^{m-1} \\ {\mathscr {E}}\left[ (\theta _{k-1}-\theta _k)^2\right] &{} \le &{} \sigma ^2 \alpha ^{k-2}(1-\alpha ) \quad k=2,\ldots ,m \\ {\mathscr {E}}\Vert \mathcal{H}_{r,c}(\theta )\Vert _* &{} \le &{}h. \end{array} \end{aligned}$$
(5.57)

Then, the solution \(\mathrm {p}_{\theta ,MEH}(\theta )\) of the maximum entropy problem

$$\begin{aligned} \mathrm {p}_{\theta ,MEH}: = \mathop {\mathrm{arg\;} \mathrm{max}}\limits _{ {\mathrm p}(\cdot ) \;\; s.t. \;\; (5.31)} \;\; - {\mathscr {E}}\log (\mathrm{p}_{\theta }(\theta )) \end{aligned}$$
(5.58)

has the following form:

$$\begin{aligned} \mathrm {p}_{\theta ,MEH}(\theta ) \propto e^{-\mu _H \Vert \mathcal{H}_{r,c}(\theta )\Vert _*} \left[ \prod _{k=2}^{m} e^{-\frac{1}{2}\mu _{k-1} (\theta _{k-1}-\theta _{P})^2} \right] e^{-\frac{1}{2}\mu _{m}{\theta _{m}^2}}, \end{aligned}$$
(5.59)

where the Lagrange multipliers \(\mu _H,\mu _1,\ldots ,\mu _m\) are determined so that the constraints (5.57) are satisfied.Footnote 1

Hankel nuclear norm discussed in this chapter is only one possible way to favour “simple” (in the sense of having small McMillan degree) models. Indeed, it is by no means trivial to use priors of the form (5.59), that involve nuclear norm terms, in conjunction with marginal likelihood optimization to estimate hyperparameters. Several variations are possible and, indeed, matricial reweighting schemes such as those used in [55] can be used in a Bayesian context, leading to iteratively reweighted schemes that remind of \(\ell _1/\ell _2 \) reweighting [72].

5.8 Historical Overview

The framework discussed in this chapter has indeed a long history that can be traced back, by and large outside the control community, until the early ’70s of the last century. In this section, we will review these developments and point to similarities and differences with the theory developed in this chapter.

5.8.1 The Distributed Lag Estimator: Prior Means and Smoothing

To the best of our knowledge Bayesian methods for estimating dynamical systems have first been advocated in the early ’70s in the econometrics literature for FIR models of the form (5.14), which were referred to as distributed lag models.  The length n of the FIR model was actually left unspecified, and possibly let going to infinity.

In particular, [40, 62] were the first to talk about (and apply) Bayesian methods for system identification, arguing that “rigid parametric” structures may be inadequate, extending arguments which can be found in [66] for “static” linear regression models to the “dynamical” systems scenario. In the paper [40], having in mind that modes of linear time-invariant systems have an exponentially decaying behaviour of the type \(\alpha ^t\), it was suggested to describe the unknown impulse response \(\theta \) with a process having an exponentially decaying prior mean

$$\begin{aligned} \{m_t\}_{t\in {\mathbb N}} \quad \quad m_t := \lambda \alpha ^t \quad |\alpha | <1. \end{aligned}$$
(5.60)

Other possible response patterns had been considered, such as the hump, composed of the response build-up, the maximum and its decay, see [40] for details and alternative patterns. The covariance function K(ts) in [40] was taken so that the ratio

$$ \frac{std(\theta _t)}{m_t} $$

remains constant over time t. This was called the “proportionality principle”. and can be achieved with the choice

$$\begin{aligned} K(t,s) = cov(\theta _t,\theta _s) := v w_{ts} \alpha ^{t+s-2} \quad |w_{ts}| \le 1 \end{aligned}$$
(5.61)

so that the normalized standard deviation

$$ \frac{std(\theta _t)}{m_t} = \frac{\sqrt{K(t,t)}}{m_t} = \frac{\sqrt{v w_{ts} \alpha ^{2t-2}} }{\lambda \alpha ^t} = \frac{\sqrt{v w_{ts} \alpha ^{-2}} }{c } $$

is indeed constant if \(w_{ts}\) is so. This would imply that prior credible intervals have constant relative size w.r.t. their means, see p. 1065 of [40].

The choice (5.61) left the coefficients \(w_{ts}\) unspecified and, indeed in [40], it was emphasized that “the selection of the values of the set of \(w_{ij}\) still remains a relatively difficult task”; one suggestion, inspired by work on smoothing [34], has been to take

$$\begin{aligned} w_{ij} = w^{|i-j|} \quad 0< w < 1 \end{aligned}$$
(5.62)

leading to

$$\begin{aligned} K_{ij} = v \alpha ^{i+j-2} w^{|i-j|}, \end{aligned}$$
(5.63)

which is exactly the DC kernel introduced in Corollary 5.1. It is also interesting to observe that [40] already suggested the use of marginal likelihood to choose the most suitable prior distribution in the class.

Of course, postulating a prior mean m introduces in the estimation procedures a remarkable prejudice and requires quite accurate knowledge on the expected \(\theta \). The paper [62], inspired by “smoothing priors” arguments, suggested instead that the prior mean should be zero, and only smoothness conditions on the lags should be enforced; this leads to a zero mean prior, i.e., \(c=0\) in (5.60), with a dth degree smoothing covariance. For instance, for \(d=2\), the prior model can be expressed in terms of the second-order differences:

$$ \beta :=\underbrace{\left[ \begin{array}{cccccc} 1 &{} -2 &{} 1 &{} 0 &{} \dots &{}0 \\ 0 &{} 1 &{} -2 &{}\ddots &{} \vdots &{} 0\\ \vdots &{} \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} \dots &{} \dots &{} 1 &{} -2 &{} 1 \end{array}\right] }_{:=S }\theta = S \theta $$

postulating \({\mathscr {E}}\beta \beta ^T = S {\mathscr {E}}\theta \theta ^T S^T =I\).

Fig. 5.10
figure 10

50 realizations form Shiller’s prior (with penalty on initial condition and first difference). It is clear from the picture that the realizations are smooth, as expected, but certainly do not resemble impulse responses of (stable) linear systems

It is clear from Fig. 5.10 that this prior guarantees smoothness in time domain (and therefore low-pass behaviour in frequency domain) but no guarantee on stability.

5.8.2 Frequency-Domain Smoothing and Stability

The “time-domain” smoothing discussed in the previous section has been criticized by Akaike [1] who posed the question whether time-domain smoothness conditions would “be the most natural ones”. Akaike suggested that instead smoothness should be enforced in the frequency domain, i.e., considering the frequency response

$$G(e^{j\omega }):= \sum _{k=1}^{n} \theta _k e^{-j\omega k}. $$

To this purpose, the \(L_2\)-norm of the first derivative \(\frac{dG(e^{j\omega })}{d\omega }\) can be considered and we have already seen in (5.25) that one obtains

$$\begin{aligned} \left\| \frac{dG(e^{j\omega })}{d\omega }\right\| ^2 =\sum _{k=1}^n k^2 |\theta _k|^2. \end{aligned}$$
(5.64)

Discouraging large \(\left\| \frac{dG(e^{j\omega })}{d\omega }\right\| ^2\) can thus be obtained using the right-hand side of (5.64) as a penalty, which can be written in the form:

$$ {\mathrm p}(\gamma ,\theta ):= \theta ^T K_\gamma ^{-1} \theta , $$

where

$$\begin{aligned} K_\gamma :=\frac{1}{\gamma } \mathrm{diag}\left\{ 1, \;\frac{1}{4}, \; \frac{1}{9},\dots , \frac{1}{n^2} \right\} . \end{aligned}$$
(5.65)

This is of course equivalent to assuming that the impulse response vector \(\theta \) has a zero-mean normal prior with covariance \(K_\gamma \).

Unfortunately, in the limit \(n\rightarrow \infty \), the covariance function (5.65) does not meet the (more stringent) sufficient conditions of Lemma 5.1; of course rather straightforward extensions include setting penalties on higher-order derivatives, which would result in a faster decay rate of the diagonal elements of (5.65). This is a manifestation of the well-known link between regularity in the frequency domain and decay rate of the impulse response already discussed in Sect. 5.5.

5.8.3 Exponential Stability and Stochastic Embedding

More recently, Gaussian priors for dynamical systems have been considered in the control literature; in particular, a zero-mean Gaussian prior with diagonal and exponentially decaying covariance

$$\begin{aligned} {\mathscr {E}}\theta \theta ^T = K_{\rho ,\alpha }:=\alpha \,\mathrm{diag}\left\{ 1,\; \rho ,\;\rho ^2, \dots , \rho ^{n-1} \right\} \end{aligned}$$
(5.66)

has been proposed in the so-called “stochastic embedding” framework [25, 26]. Let us now briefly introduce the problem: consider an Output Error model of the form

$$ y(t) = \sum _{k=1}^\infty g_k(\theta ) u(t-k) + e(t), $$

where \(g_k(\theta )\), \(\theta \in {\mathbb R}^n\) is a parametric description of the unknown impulse response \(\{g_k\}_{k=1,\ldots ,\infty }\) in the model class \(\mathcal{M}_n(\theta )\). Let \(\hat{\theta }\) be some parametric estimator of \(\theta \), e.g., the PEM estimator

$$\begin{aligned} \hat{\theta }= \mathop {\mathrm{arg\;min}}\limits _\theta \; \sum _{k=1}^N \Vert y(t) - G(z,\theta )u(t)\Vert ^2. \end{aligned}$$
(5.67)

Let now

$$ \hat{G}(z):=G(z,\hat{\theta })= \sum _{k=1}^\infty g_k(\hat{\theta }) z^{-k} $$

be the corresponding estimator of the transfer function \(G(z,\theta )= \sum _{k=1}^\infty g_k(\theta ) z^{-k}\).

In the Model Error Modelling framework,  it is assumed that the “true” transfer function G(z) is only partially captured by the chosen model class \(\mathcal{M}_n(\theta )\) so that

$$\begin{aligned} G(z) = G(z,\theta _0) + \tilde{G}(z) \quad G(z,\theta _0) \in \mathcal{M}_n(\theta ) \end{aligned}$$
(5.68)

and \(\tilde{G}(z)\) represents a model error. The purpose of Model Error Modelling is to obtain a statistical description of the model error, say

$$\tilde{G}(z): = G(z) - \hat{G}(z)$$

which may be used, for instance, to estimate the model order, e.g., the dimension n of the parameter vector \(\theta \). This can be achieved by minimizing an estimate of the MSE

$$ {\mathscr {E}}\Vert G(z) - G(z,\hat{\eta })\Vert ^2 $$

while accounting for the model error model \(\tilde{G}(z)\), see e.g., Eqs. (89)–(92) in [26].

The model error \(\tilde{G}(z)\) is estimated in [26] starting from the least squares residuals \(v_{\hat{\theta }}(t):=y(t) - G(z,\hat{\theta })u(t)\) which, under assumption (5.68), is expected to be described by the model

$$ v(t) = \tilde{G}(z) u(t) + e(t). $$

It is remarkable that [26] propose to estimate the parameters \(\alpha \) and \(\rho \) that characterize the covariance (5.66) resorting to marginal likelihood maximization

$$\begin{aligned} (\hat{\alpha }, \hat{\rho }):=\mathop {\mathrm{arg\;max}}\limits _{\alpha ,\rho } \int {\mathrm p}(V_{\hat{\eta }}|\tilde{g}) {\mathrm p}(\tilde{g}|\alpha ,\rho )\, d\tilde{g}, \end{aligned}$$
(5.69)

where \(V_{\hat{\eta }}:=[v_{\hat{\eta }}(1),\ldots ,v_{\hat{\eta }}(N)]\). It is also interesting to observe that the exponential decay of the covariance sequence (5.66) implies a smoothness condition in the frequency response function similar in spirit to that advocated in [1]. This is formalized in the following result whose proof is in Sect. 5.10.7.

Lemma 5.5

Let \(\{g_{k,\alpha }\}_{k=0,\ldots ,\infty }\) be a zero-mean Gaussian process with covariance (5.66) and let

$$G_\alpha (e^{j\omega }):=\sum _{k=0}^\infty g_{k,\alpha } e^{-jk\omega } \quad \omega \in [0,2\pi )$$

be its Fourier transform. Then the Lipschitz-like condition

$$\begin{aligned} \begin{array}{c} {\mathscr {E}}[\Vert G_{{\alpha }}(e^{j\omega _1}) - G_{{\alpha }}(e^{j\omega _2})\Vert ^2] \le \frac{c}{1-\alpha } (\omega _1-\omega _2)^2 \quad \quad |\alpha |<1 \end{array} \end{aligned}$$
(5.70)

holds.

5.9 Further Topics and Advanced Reading

Section 1.3 already reported a list of topics and readings on inverse problems, Stein estimators and their link with the Empirical Bayes framework.

The use of regularization and Bayesian priors can be probably dated back to the paper [71] were smoothing ideas have been advocated for a denoising problem in the field of Actuarial Science. See also the much later reference [34]. The later developments are essentially impossible to survey in this short section and we refer the reader to [66] for an early overview on the use of Bayes priors in the context of linear regression; the interested reader may also consult [22, 31, 32, 42, 59] where generalized ridge regression  has been proposed to stabilize ill-conditioned inverse problems.

To the best of our knowledge, [40, 62] have been the first to use these ideas in the context of dynamical systems, named “distributed-lag” models in these early references. This work has been subsequently taken up by Akaike [1] and later on by Kitagawa and Gersh in a series of papers, see e.g., [35, 36], which culminated in the well-known book [37]. The seminal papers by Leamer and Shiller have also been continued by the econometrics community, starting with the work by Doan, Litterman and Sims, see e.g., [18] for an overview and further references. This has lead to the so-called “Minnesota prior”, which has been discussed quite extensively in the econometrics literature; several variations and extensions are found, see for instance [23, 41].

The econometrics literature has since then studied Bayesian procedures for system identification rather intensively, mostly under the acronym Bayesian VARs; the main driving motivation was that of handling high-dimensional time series (i.e., p large, called cross-sectional dimension in the econometrics literature) with possibly many explicative variables (m large), see for instance [2, 17, 23, 38].

The problem of tuning the regularization parameters (or equivalently the hyperparameters describing the prior in a Bayesian setting) has received relatively little attention in the econometrics literature: [40] already suggested the use of Empirical Bayes procedure, while [2, 18] propose tuning the hyperparameters using out-of-sample and in-sample errors, respectively. The paper [38] and the most recent work [23] adopt again an Empirical Bayes approach using the marginal likelihood; [23] claims the superiority of this approach w.r.t. previous “ad-hoc” techniques [2, 18].

Despite this long history, the use of Bayesian priors for system identification has only gained popularity in relatively recent times, e.g., see the survey [52]. We believe it is fair to say that reason for this is to be attributed to the fact that much more efforts have been recently devoted to developing prior models tailored to estimating dynamical system. In the remaining part of the book, these issues will be dealt with in some details. The reader is referred to [10, 11, 49, 50, 55] for various classes of prior models and to [6, 7, 12, 55] for more details on Maximum Entropy derivations. Extensions include prior models to estimate sparse models for high-dimensional time series [14, 74] as well as classes of priors for nonlinear dynamical models [51], that will be thoroughly discussed in Chap. 8. In particular, the techniques described in this chapter can be also used to identify the so-called dynamic networks that consist of a large set of interconnected dynamic systems. Modelling such complex physical systems is important in several fields of science and engineering, including also biomedicine and neuroscience [27, 30, 46, 56]. Estimation is difficult since they are often large scale and the network topology is typically unknown [14, 44, 67]. One typically postulates the existence of many connections and then has to understand from data which are really active. Since in real physical systems often only a small fraction of links is really working, the estimation process needs to exploit sparsity regularizers as those introduced in Chap. 3 and their stochastic interpretation like the Bayesian Lasso [47]. In the context of linear dynamic networks, where modules are defined by impulse responses, many approaches have been recently designed e.g., relying on local multi-input single-output (MISO) models [16, 19, 45]. Contributions based on variational Bayesian inference and/or nonparametric regularization, deeply connected with the techniques discussed in this book, are in [14, 33, 58, 73]. Methods to infer the full network dynamics using (structured) multiple-input multiple-output (MIMO) models can be found instead in [21, 69], with estimates consistency analyzed in [57]. A contribution based on the combination of the stable spline kernel and the so-called horseshoe sparsity prior [8, 54, 68] has been developed in [48]. See also [3, 24, 29, 70] for insights on identifiability issues and [28] where compressed sensing is exploited.

5.10 Appendix

5.10.1 Optimal Kernel

Theorem 5.7

The solution \(P^*\) of problem (5.20) is given by

$$\begin{aligned} P^* = \theta _0 \theta _0^T, \end{aligned}$$
(5.71)

where \(\theta \) is the “true” impulse response of the data-generating mechanism (5.14).

Proof

The proof will proceed as follows: let us denote with \(\hat{\theta }^{P^*}\) the estimator obtained with P as in (5.71). Consider the error

$$ \tilde{\theta }^P :=\hat{\theta }^P - \theta _0 $$

which can be written as

$$ \begin{array}{rcl} \tilde{\theta }^P &{}=&{} \hat{\theta }^P - \theta _0\\ &{} = &{} \hat{\theta }^{P^*} \theta _0 + \hat{\theta }^P - \hat{\theta }^{P^*} \\ &{} = &{} \tilde{\theta }^{P^*} + \left( \hat{\theta }_{P} - \hat{\theta }^{P^*}\right) . \end{array} $$

We shall show that the following orthogonality property holds:

$$\begin{aligned} {\mathscr {E}}\tilde{\theta }^{P^*} \left( \hat{\theta }_{P}- \hat{\theta }^{P^*} \right) ^T =0 \end{aligned}$$
(5.72)

so that

$$\begin{aligned} {\mathscr {E}}\tilde{\theta }^P \tilde{(}\theta ^P)^T = {\mathscr {E}}\tilde{\theta }^{P^*} (\tilde{\theta }^{P^*})^T + {\mathscr {E}}\left( \hat{\theta }_{P} - \hat{\theta }^{P^*} \right) \left( \hat{\theta }_{P} - \hat{\theta }^{P^*} \right) ^T \end{aligned}$$
(5.73)

and therefore:

$$ M_\theta (P) - M_\theta (P^*) = {\mathscr {E}}\tilde{\theta }^P (\tilde{\theta }^P)^T - {\mathscr {E}}\tilde{\theta }^{P^*} ( (\tilde{\theta }^{P^*})^T = {\mathscr {E}}\left( \hat{\theta }_{P} - \hat{\theta }^{P^*} \right) \left( \hat{\theta }_{P} - \hat{\theta }^{P^*} \right) ^T \succeq 0 $$

which will prove the claim that \(P^*= \theta _0 \theta _0^T\) is the optimal solution to (5.20).

It now just remains to show that (5.72) holds. To do so, let us rewrite (4.7) assuming null \(\mu _{\theta }\) and using the matrix inversion lemma as (3.145):

$$ \begin{array}{rcl} \hat{\theta }^P &{}=&{} \left( \sigma ^2 I + P \varPhi ^T \varPhi \right) ^{-1} P \varPhi ^T Y\\ &{}=&{} \left( \sigma ^2 I + P \varPhi ^T \varPhi \right) ^{-1} P \varPhi ^T (\varPhi \theta _0 + E) \\ &{}=&{} \left( \sigma ^2 I + P \varPhi ^T \varPhi \right) ^{-1}\left[ \left( P \varPhi ^T \varPhi + \sigma ^2 I - \sigma ^2 I\right) \theta _0 + P \varPhi ^T E\right] \\ &{} = &{} \theta _0 - \left( \sigma ^2 I + P \varPhi ^T \varPhi \right) ^{-1} \left[ \sigma ^2 \theta _0 - P \varPhi ^T E\right] . \end{array} $$

Therefore, the error \(\tilde{\theta }^P:= \theta _0 - \hat{\theta }^P\) can be written in the form:

$$\begin{aligned} \tilde{\theta }^P =\mathop {\underbrace{\left( \sigma ^2 I + P \varPhi ^T \varPhi \right) ^{-1}}}_{:=W_P} \left[ \sigma ^2 \theta _0 - P \varPhi ^T E\right] = W_P \left[ \sigma ^2 \theta _0 - P \varPhi ^T E\right] . \end{aligned}$$
(5.74)

Now, using (5.74), we have:

$$ \hat{\theta }^{P^*} - \hat{\theta }_{P} = \tilde{\theta }_{P} - \tilde{\theta }^{P^*} = \sigma ^2 \left( W_P -W_{P^*} \right) \theta _0 + \left( W_{P^*}P -W_{P} P\right) \varPhi ^T E. $$

Now, let us compute

$$\begin{aligned} \begin{array}{rcl} {\mathscr {E}}\left( \hat{\theta }^{P^*} - \hat{\theta }_{P} \right) ( \tilde{\theta }^{P^*}) ^T &{}=&{} \sigma ^4 \left( W_P -W_{P^*} \right) \theta \theta ^T W_{P^*} ^T- \sigma ^2 \left( W_{P^*}P -W_{P} K\right) \varPhi ^T \varPhi P^* W_{P^*}^T \\ &{} = &{}\sigma ^2 \left[ \sigma ^2\left( W_P -W_{P^*} \right) - \left( W_{P^*}P^* -W_{P} P\right) \varPhi ^T \varPhi \right] P^* W_{P^*}^T \end{array}. \end{aligned}$$
(5.75)

If we now use the identity

$$ W_P \left( \sigma ^2 I + P \varPhi ^T \varPhi \right) = I \quad \quad \Rightarrow \quad \quad \sigma ^2 W_P = I - W_P P \varPhi ^T \varPhi $$

we obtain

$$ \sigma ^2\left( W_P - W_{P^*}\right) =\left( W_{P^*} P^* - W_P P\right) \varPhi ^T \varPhi $$

so that, using (5.75),

$$ {\mathscr {E}}\left( \hat{\theta }^{P^*} - \hat{\theta }_{P} \right) ( \tilde{\theta }^{P^*}) ^T =0 $$

which proves (5.72) and thus the theorem.   \(\square \)

5.10.2 Proof of Lemma 5.1

Consider the following upper bound on the probability that the \(\ell _1\) norm of \(\theta \) be larger than a given threshold \(T_{\ell _1}\):

$$ {\mathbb P}\left[ \sum _{t=1}^\infty |\theta _t| \ge T_{\ell _1}\right] \le \frac{1}{T_{\ell _1}}{\mathscr {E}}\sum _{t=1}^\infty |\theta _t | = \frac{1}{T_{\ell _1}} {\sum _{t=1}^\infty {\mathscr {E}}|\theta _t| } \le \frac{1}{T_{\ell _1}}\sum _{t=1}^\infty \left( m_t + \sqrt{2/\pi } K(t,t)^{1/2}\right) $$

where we have used the equality \({\mathscr {E}}|X| =\sigma \sqrt{2/\pi } \) for \(X\sim \mathcal{N}(0,\sigma ^2)\). Using the hypothesis (5.24) we have that

$$ {\mathbb P}\left[ \sum _{t=1}^\infty |\theta _t| \ge T_{\ell _1}\right] \le \frac{M_{\ell _1} + K_{\ell _1} \sqrt{2/\pi }}{T_{\ell _1}} $$

and therefore

$$ {\mathbb P}\left[ \sum _{t=1}^\infty |\theta _t| < T_{\ell _1}\right] \ge 1-\frac{M_{\ell _1} + K_{\ell _1} \sqrt{2/\pi }}{T_{\ell _1}}. $$

Taking the limit as \(T_{\ell _1}\rightarrow +\infty \) we have

$$ {\mathbb P}\left[ \sum _{t=1}^\infty |\theta _t| < +\infty \right] = 1 $$

which concludes the proof.

5.10.3 Proof of Theorem 5.5

The proof is based on the fact that the Maximum Entropy distribution \({\mathrm p}(\theta )\) under constrains \({\mathscr {E}}f_k(\theta ) = F_k\) and \({\mathscr {E}}g_k(\theta ) = G_k\) has the “Gibbs” structure, i.e., it is the exponential of a weighted sum of the constraint functionals (see e.g., [15]):

$$ {\mathrm p}(\theta ) \propto e^{- \sum _i \mu _i f_i(\theta )+ \gamma _i g_i(\theta )}. $$

In our case, we have \(f_k(\theta )=\theta _k^2\) and \(g_k(\theta ) = (\theta _{k-1}-\theta _k)^2\), and therefore the max-ent solution has the form

$$\begin{aligned} \begin{array}{rcl} p_{\theta ,ME}(\theta )= & {} C e^{-\frac{1}{2} \left( \mu _1 \theta _1^2 + \sum _{k=2}^n \mu _k \theta _k^2 + \gamma _k (\theta _{k-1}-\theta _k)^2\right) }. \end{array} \end{aligned}$$
(5.76)

Using a well-known result in graphical models (see e.g., Lauritzen [39]), the variables \(\theta _k\) and \(\{\theta _{k+2},\ldots ,\theta _n\}\) are conditionally independent given \(\theta _{k+1}\) (because \(\theta _{k+1}\) is the only neighbour of \(\theta _{P}\) in the graph representing \({\mathrm p}(\theta _k,\theta _{k+1},\ldots ,\theta _n)\) (or equivalently \(\theta _{k+1}\) separates \(\theta _k\) from \(\theta _{k+2},\theta _{k+3},\ldots ,\theta _n\)).

In our case, this conditional independence implies that the best linear estimator \(\hat{\theta }_{k-1}\) of \(\theta _{k-1}\) given \( \theta _k, \theta _{k+1},\ldots ,\theta _n\) depends only \(\theta _{P}\) (i.e., \(\hat{\theta }_{k-1} = a_{B,k} \theta _k \)) so that the vector \(\theta \) admits the fFootnote 2 representation:

$$\begin{aligned} \theta _{k-1} = a_{B,k} \theta _k + w_k \end{aligned}$$
(5.77)

with \(w_k:=\theta _{k-1} - \hat{\theta }_{k-1}=\theta _{k-1} - a_{B,k} \theta _k \) zero mean and uncorrelated of \( \theta _k, \theta _{k+1},\ldots ,\theta _n\). Let us define \(\sigma _k^2: = {\mathscr {E}}w_k^2\). In order to express \(a_{B,k}\) and \(\sigma _k^2\) as a function of \(\lambda _R,\lambda _S,\alpha \), we exploit the constraints (5.31) and the dynamical model (5.77). In particular we have

$$\begin{aligned} \begin{array}{rcl} \lambda _S \alpha ^{k-1} &{}=&{} {\mathscr {E}}\theta _{k-1}^2\\ &{}=&{} a_{B,k}^2 {\mathscr {E}}\theta _{P}^2 + \sigma _k^2 \\ &{} = &{} a_{B,k}^2 \lambda _S \alpha ^{P} + \sigma _k^2 \end{array} \end{aligned}$$
(5.78)
$$\begin{aligned} \begin{array}{rcl} \lambda _R \alpha ^{k-1} &{}=&{} {\mathscr {E}}(\theta _{k-1} - \theta _k)^2 \\ &{} = &{} {\mathscr {E}}((a_{B,k} -1) \theta _k^2 + w_k)^2 \\ {} &{}=&{} {\mathscr {E}}(a_{B,k} -1)^2 \theta _k^2 + {\mathscr {E}}w_k^2 \\ &{}=&{} (a_{B,k} -1)^2 \lambda _S \alpha ^k + \sigma _k^2 \end{array}. \end{aligned}$$
(5.79)

Substracting (5.78) from (5.79) we obtain

$$ (\lambda _S-\lambda _R) \alpha ^{k-1} = a_{B,k}^2 \lambda _S \alpha ^{P} - (a_{B,k} -1)^2 \lambda _S \alpha ^k = (2 a_{B,k} -1)\lambda _S \alpha ^k $$

which implies that

$$ a_{B,k} = \frac{\lambda _S (1+\alpha ) - \lambda _R}{2 \lambda _S \alpha } =: a_B $$

that is independent of k, thus denoted with \(a_B\) as in (5.35). From (5.79) we also have that

$$ \sigma _k^2 = \lambda _R \alpha ^{k-1} - (a_{B} -1)^2 \lambda _S \alpha ^k = ( \lambda _R - (a_{B} -1)^2 \lambda _S\alpha ) \alpha ^{k-1} = (1-a_B^2\alpha )\lambda _S\alpha ^{k-1} $$

where the last equality follows after a few manipulations and proves (5.36). Replacing

$$ a_B - 1 = \frac{\lambda _S (1-\alpha ) - \lambda _R}{2 \lambda _S \alpha } $$

in the previous equation we have:

$$ \sigma _k^2 = \left[ \lambda _R - \left( \frac{\lambda _S (1-\alpha ) - \lambda _R}{2 \lambda _S \alpha }\right) ^2 \lambda _S\alpha \right] \alpha ^{k-1}. $$

Of course \(\sigma _k^2\), and thus the right hand side, should be positive (for simplicity we exclude the singular case \(\sigma _k^2 =0\)):

$$ \lambda _R - \left( \frac{\lambda _S (1-\alpha ) - \lambda _R}{2 \lambda _S \alpha }\right) ^2 \lambda _S\alpha = \frac{4\lambda _R\lambda _S \alpha - \left( \lambda _S (1-\alpha ) - \lambda _R\right) ^2}{4 \lambda _S \alpha } > 0 $$

which in turn is equivalent to

$$ 4\lambda _R\lambda _S \alpha - \left( \lambda _S (1-\alpha ) - \lambda _R\right) ^2 > 0. $$

This happens if and only if

$$ \lambda _R^2 - 2\lambda _R \lambda _S(1+\alpha ) + (1-\alpha )^2 \lambda _S^2 < 0. $$

This is a degree two polynomial in \(\lambda _R\) with two positive roots

$$ \lambda _{R,i} = \lambda _S(1+\alpha ) \pm \sqrt{\lambda _S^2 (1+\alpha ) ^2+ \lambda _S^2 (1-\alpha )^2} = \lambda _S(1+\alpha \pm 2 \sqrt{\alpha }) \quad i=1,2 $$

and therefore our problem is feasible if and only if

$$ \lambda _{R,min}=\lambda _{R,1} = \lambda _S(1+\alpha - 2 \sqrt{\alpha })< \lambda _R < \lambda _S(1+\alpha + 2 \sqrt{\alpha }) = \lambda _{R,2} = \lambda _{R,max} $$

thus proving (5.32). Now it remains to prove that (5.76) takes the form (5.34). First let us observe that the exponent of (5.76) is a quadratic form in \(\theta \), and therefore (5.76) can be written in the form

$$ p_{\theta ,ME}(\theta ) = C e^{-\frac{1}{2} \theta ^T \varPhi \theta }. $$

Last, since in (5.76) only products of the form \(\theta _k\theta _h\) for \(h\in [k-1,k,k+1]\) appear, the matrix \(\varPhi =\varPhi ^T\) has the following band structure:

$$ \varPhi = \left[ \begin{array}{cccccc} * &{} * &{} 0 &{} \dots &{} \dots &{} 0 \\ * &{} * &{} * &{} 0 &{} \dots &{} 0 \\ 0 &{} * &{} * &{} * &{} 0 &{} \dots \\ \vdots &{} \dots &{} \ddots &{} \ddots &{} \ddots &{} \dots \\ 0 &{} \dots &{} 0 &{} * &{} * &{} * \\ 0 &{} \dots &{} \dots &{} 0 &{} * &{} * \end{array}\right] . $$

In addition, for \(p_{\theta ,ME}(\theta )\) to be a density, \(\varPhi \) needs to be positive semidefinite (otherwise there would be directions in which the density grows indefinitely). Since \(\theta \) admits the backward AR representation (5.77) with \({\mathscr {E}}w_k^2 =\sigma ^2_k>0\), the covariance matrix \(\varSigma = {\mathscr {E}}\theta \theta ^T \) is positive definite and thus \(\varPhi = \varSigma ^{-1}\). To compute the autocovariance function \({\mathscr {E}}\theta _h \theta _k\) we consider the following cases: if \(k=h\) we have

$$ {\mathscr {E}}\theta _h \theta _h = \lambda _S \alpha ^h. $$

If \(k>h\) we have

$$ {\mathscr {E}}\theta _h \theta _k = a_B {\mathscr {E}}\theta _{h+1} \theta _k $$

and iterating the relation we find

$${\mathscr {E}}\theta _h \theta _k = a_B^{k-h} {\mathscr {E}}\theta _k \theta _k = \lambda _S a_B^{k-h} \alpha ^k. $$

Analogously, if \(h>k\) we have

$${\mathscr {E}}\theta _h \theta _k = a_B^{h-k} {\mathscr {E}}\theta _h \theta _h = \lambda _S a_B^{h-k} \alpha ^h. $$

Combining the three cases we obtain

$$ {\mathscr {E}}\theta _k \theta _h = \lambda _Sa_B^{|k-h|} \alpha ^{\max \{k,h\}} $$

proving (5.38).

5.10.4 Proof of Corollary 5.1

Using the definition (5.39) in Eq. (5.38) we obtain:

$$ {\mathscr {E}}\theta _k \theta _h = \lambda _Sa_B^{|k-h|} \alpha ^{\max \{k,h\}} = \lambda _S \frac{\rho ^{|k-h|}}{\alpha ^{\frac{|k-h|}{2}}}\alpha ^{\max \{k,h\}} = \lambda _S \rho ^{|k-h|} \alpha ^{\frac{k+h}{2}}. $$

In addition, if the matching condition \(\lambda _R = \lambda _S(1-\alpha ) \) is satisfied, then from (5.35) \( a_B = 1 \) and from (5.39) \(\rho = \sqrt{\alpha }\); substituting in (5.40) we obtain

$$ {\mathscr {E}}\theta _k \theta _h= \lambda _S \rho ^{|k-h|} \alpha ^{\frac{k+h}{2}} = \lambda _S \alpha ^{\max \{k,h\}} $$

i.e., the covariance sequence of the well known TC kernel.

5.10.5 Proof of Lemma 5.2

The proof of this lemma is a simple application of Schwartz inequality. In particular we have:

$$\begin{array}{rcl} |u_{it}| &{}=&{} \frac{1}{\xi _i} |\sum _{s=1}^n [{\mathbf{K}}]_{t,s} u_{is}|\le \sum _{s=1}^n \sqrt{[{\mathbf{K}}]_{t,t}}\sqrt{[{\mathbf{K}}]_{s,s}} |u_{is}|\\ &{} \le &{}\frac{1}{\xi _i} \sqrt{[{\mathbf{K}}]_{t,t}}\ \sum _{s=1}^n \sqrt{[{\mathbf{K}}]_{s,s}} |u_{is} |\le \frac{1}{\xi _i} K(t,t)^{1/2} \underbrace{\sum _{s=1}^n\sqrt{[{\mathbf{K}}]_{s,s}}}_{=C < \infty } \underbrace{\displaystyle {\sum _{s=1}^n} {|u_{is}|^2}}_{=1}, \end{array} $$

where the last inequality follows from the fact that \(u_{it}\) has 2-norm equal to 1 for all i. The same condition clearly holds also in the infinite dimensional case, i.e., as \(n\rightarrow \infty \) if K(ts) admits the spectral decomposition

$$ K(t,s)= \sum _{i=1}^{\infty } \xi _i u_{it} u_{is} $$

and the condition \(\sum _{t} K(t,t) = C <\infty \) holds. In particular this latter condition holds true if the more stringent condition \(\sum _{t} K^{1/2}(t,t) <\infty \) in Lemma 5.1 is satisfied.

5.10.6 Proof of Theorem 5.6

The proof follows from fact that the Maximum Entropy distribution \({\mathrm p}(x)\) under constrains \({\mathscr {E}}f_i(x) \le \gamma _i\) has the “Gibbs” structure, i.e., it is the exponential of a weighted sum of the constraint functionals (see e.g., [15]):

$$ {\mathrm p}(x) \propto e^{- \sum _i \mu _i f_i(x)}. $$

5.10.7 Proof of Lemma 5.5

Since \(\{g_{k,\alpha }\}_{k=0,\ldots ,\infty }\) is zero mean, then clearly also \(G_\alpha (e^{j\omega })\) is so, i.e., \({\mathscr {E}}G_\alpha (e^{j\omega }) =0\). If we now consider the difference

$$ G_{{\alpha }}(e^{j\omega _1}) - G_{{\alpha }}(e^{j\omega _2}) = \sum _{k=0}^\infty g_{k,\alpha } \left[ e^{-jk\omega _1} -e^{-jk\omega _2}\right] , $$

taking the expected value of the squared norm, and using the fact the \({\mathscr {E}}g_{k,\alpha } g_{k,\alpha } = c\alpha ^k \delta _{k-h}\), we have

$$ {\mathscr {E}}\Vert G_{{\alpha }}(e^{j\omega _1}) - G_{{\alpha }}(e^{j\omega _2})\Vert ^2= \sum _{k=0}^\infty c\alpha ^k \Vert e^{-jk\omega _1} -e^{-jk\omega _2}\Vert ^2. $$

Now, using

$$ \Vert e^{-jk\omega _1} -e^{-jk\omega _2}\Vert ^2 = 2 \left( 1-\cos (\omega _1-\omega _2)\right) \le (\omega _1-\omega _2)^2 $$

and the expression for the sum of the geometric series \(\alpha ^k\) the thesis follows.

5.10.8 Forward Representations of Stable-Splines Kernels \(\star \)

A major drawback of the backward construction is that it is not straightforward to extend it to an infinite interval, i.e., to let \(n\rightarrow \infty \) in order to consider infinitely long impulse response models \(\{\theta _{k}\}_{k\in {\mathbb N}}\). However this difficulty can be circumvented exploiting the “forward” representation of (5.77), which turns out to be again a time varying AR(1) model.Footnote 3 Theorem 5.8 derives the forward AR(1) representation of the maximum entropy process found in Theorem 5.5.

Theorem 5.8

The maximum entropy solution to (5.33) found in Theorem 5.5 admits the forward AR(1) representation

$$\begin{aligned} \theta _{k+1} = a_F \theta _k + w_k \quad \quad k\ge 0 \end{aligned}$$
(5.80)

with zero-mean initial condition such that \( {\mathscr {E}}\theta _0^2 = \lambda _S\), and where

$$\begin{aligned} a_F = \rho \alpha ^{1/2} = a_B \alpha \end{aligned}$$
(5.81)

and \(w_k\) is a sequence of zero mean variables, uncorrelated with the initial condition \(\theta _0\) and such that

$$\begin{aligned} {\mathscr {E}}w_k w_h = \left\{ \begin{matrix} \sigma _{F,k}^2&{} k = h \\ 0 &{} {k\ne h} \end{matrix} \right. \end{aligned}$$
(5.82)

with \( \sigma _{F,k}^2 = \lambda _S \alpha ^{k+1}(1-\rho ^2)\).

Proof

First of all let us observe that, if \(\theta _k\) admits an AR(1) forward representation of the form (5.80) (with \(w_k\) that satisfies (5.82)), \(a_F\) should satisfy the relation

$$ a_F = {\mathscr {E}}\theta _{k+1} \theta _{k} \left( {\mathscr {E}}\theta ^2_{k}\right) ^{-1}. $$

Using the expression (5.38), we obtain:

$$ {\mathscr {E}}\theta _{k+1} \theta _{k} \left( {\mathscr {E}}\theta ^2_{k}\right) ^{-1} = \lambda _S a_B \alpha ^{k+1} \left( \lambda _S \alpha ^{k}\right) ^{-1} = a_B \alpha $$

and recalling that \(\rho = a_B \alpha ^{1/2}\) we also obtain

$$ a_F = a_B \alpha = \rho \alpha ^{1/2}. $$

In addition, denoting \( \sigma _{F,k}^2: = {\mathscr {E}}w_k^2\),

$$ {\mathscr {E}}\theta _{k+1}^2 = a_F^2 {\mathscr {E}}\theta _{k}^2 + \sigma _{F,k}^2 $$

must hold. Therefore,

$$ \sigma _{F,k}^2 = {\mathscr {E}}\theta _{k+1}^2 - a_F^2 {\mathscr {E}}\theta _{k}^2 = \lambda _S \alpha ^{k+1} - \rho ^2 \alpha \alpha ^{k} = \lambda _S \alpha ^{k+1} (1 - \rho ^2). $$

It also straightforward to verify that, if \(\theta _k\) is generated by (5.80), then

$$ {\mathscr {E}}\theta _{k + \tau } \theta _k = a_F^\tau {\mathscr {E}}\theta _k ^2 = a_F ^\tau \lambda _S \alpha ^{k} = \lambda _S a_B^\tau \alpha ^{k+\tau } \quad \tau >0 $$

which is exactly of the form

$$ {\mathscr {E}}\theta _{h} \theta _k = \lambda _S a_B^{|h-k|} \alpha ^{\mathrm{max}(k,h)} $$

provided \(h = k + \tau \), \(\tau > 0\). This concludes the proof.   \(\square \)