## 8.1 Nonlinear System Identification

In Sect. 2.2, Eq. (2.2), a model of a dynamical system was defined as a predictor function g that maps past input–output data

\begin{aligned} Z^{t-1}=\{y(t-1),u(t-1),y(t-2),u(t-2),\ldots \} \end{aligned}

to the next output

\begin{aligned} \hat{y}(t|\theta )=g(t,\theta ,Z^{t-1}), \end{aligned}
(8.1)

where $$\theta$$ is a parameter vector that indexes the model. The predictor could possibly also be a nonparametric map belonging to some function class. If g is a nonlinear function of $$Z^{t-1}$$ the model is nonlinear and the task to infer it from all the available measurements contained in the training set $$\mathcal{D}_T$$ is the task of Nonlinear System Identification. This is a complex area with a vast and rich literature. One reason for the richness is that very many parameterizations of g have been suggested, each with various proposed estimation methods, e.g., see the survey [36]. The different parameterizations allow various degrees of prior knowledge about the system to be accounted for, which gives grey box models with different shades of grey: see the section The Palette of Nonlinear Models in [36].

A typical element of nonlinear models is that somewhere in the structure there can be a static nonlinearity present, $$\zeta (t)=h(\eta (t))$$. Dealing with static nonlinearities is therefore an essential feature in nonlinear identification. See the sidebar “Static Nonlinearities” in [36], and Sect. 8.5.2 for some brief remarks.

If no prior physical knowledge is available, we have a black-box model. Then we need to employ parameterizations for g that are very flexible and can describe any reasonable function with arbitrary accuracy. A typical choice for this are neural networks or  deep nets. See Sect. 8.5.1 for some comments. Alternatively one can define g non-parametrically as belonging to a certain (possibly infinite dimensional) function class. This leads to kernel methods, like regularization networks, and Gaussian Process inference, treated in the next section.

Both in the case of grey and black-box models, nonlinear identification is characterized by considerable structural uncertainty. This leads typically to parametric models with many parameters and regularization will be a natural and useful tool to handle that. This chapter will discuss typical use of regularization for various tasks in nonlinear system identification.

## 8.2 Kernel-Based Nonlinear System Identification

Consider the measurements model

\begin{aligned} y(t_i)=f^0(x_i)+ e(t_i), \quad i=1,\dots , N, \end{aligned}
(8.2)

where $$y(t_i)$$ is the system output at instant $$t_i$$, corrupted by the noises $$e(t_i)$$, and $$f^0$$ is the unknown function to reconstruct. The link with nonlinear system identification is obtained by assuming that the $$x_i$$ contains past input and/or output values, i.e.,

\begin{aligned} x_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m_u} \ y_{t_i-1} \ y_{t_i-2} \ \ldots \ y_{t_i-m_y} ]. \end{aligned}
(8.3)

In this way, the function $$f^0$$ represents a dynamic system. For the sake of simplicity, let $$m=m_u=m_y$$, where m will be called the system memory in what follows. Then, if $$m<\infty$$ a nonlinear ARX (NARX) model is obtained. A nonlinear FIR (NFIR) is instead obtained when $$x_i$$ contains only past inputs, i.e.,

\begin{aligned} x_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m}]. \end{aligned}
(8.4)

Now, with these correspondences, we can assume that our nonlinear predictor belongs to a function class $$\mathscr {H}$$ given by a RKHS. Then, given the N couples $$\{x_i,y(t_i)\}$$, the regularization network

\begin{aligned} \hat{f} = \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{f \in \mathscr {H}} \ \sum _{i=1}^N (y(t_i)-f(x_i))^2 + \gamma \Vert f\Vert _{\mathscr {H}}^2 \end{aligned}
(8.5)

implements regularized NARX, with $$f:\mathbb {R}^{2m} \rightarrow \mathbb {R}$$,  or NFIR, with $$f:\mathbb {R}^{m} \rightarrow \mathbb {R}$$.

To obtain the estimate $$\hat{f}$$ we can now exploit Theorem 6.15, i.e., the representer theorem. Since we focus on quadratic loss functions, the results in Sect. 6.5.1 ensure that our system estimate $$\hat{f}$$ not only exists and is unique but is also available in closed form. In particular, let $$Y=[y(t_1), \ldots , y(t_N)]^T$$ and $$\mathbf {K} \in \mathbb {R}^{N \times N}$$ be the kernel matrix such that $$\mathbf {K}_{ij} = K(x_i,x_j)$$. The nonlinear system estimate is then sum of the N kernel sections centred on the $$x_i$$, i.e.,

\begin{aligned} \hat{f} = \sum _{i=1}^N \ \hat{c}_i K_{x_i} \end{aligned}
(8.6)

with coefficients $$\hat{c}_i$$ contained in the vector

\begin{aligned} \hat{c} = \left( \mathbf {K}+\gamma I_{N}\right) ^{-1}Y, \end{aligned}
(8.7)

with $$I_{N}$$ the $$N \times N$$ identity matrix.

For future developments, in the remaining part of this section it is useful to cast the connection between regularization in RKHS and Bayesian estimation in this nonlinear setting. Some strategies for hyperparameters tuning will be also recalled.

### 8.2.1 Connection with Bayesian Estimation of Gaussian Random Fields

First, we recall an important result obtained in the linear setting in Sect. 7.1.4. The starting point was the measurements model

$$y(t_i)=L_i[g^0]+ e(t_i), \quad i=1,\dots , N,$$

with $$g^0$$ denoting the system impulse response and $$L_i[g^0]$$ representing the convolution between $$g^0$$ and the input, evaluated at $$t_i$$. Proposition 7.1 said that, if $$\mathscr {H}$$ is the RKHS induced by a kernel K, then

\begin{aligned} \hat{g} = \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{g \in \mathscr {H}} \ \sum _{i=1}^N (y(t_i)-L_i[g])^2 + \gamma \Vert g\Vert _{\mathscr {H}}^2 \end{aligned}

is the minimum variance impulse response estimator when the noise e is white and Gaussian while $$g^0$$ is a zero-mean Gaussian process (independent of e) of covariance proportional to K, i.e.,

$$\mathscr {E} (g^0(t)g^0(s)) \propto K(t,s).$$

So, the choice of K ensures that the probability is concentrated on our expected impulse responses. For instance, in previous chapters we have seen that the TC/stable spline class describes time-courses that are smooth and exponential decaying with a level established by some hyperparameters. A very simple approach to understand the prior ideas introduced in the model is to simulate some curves that will thus represent some of our candidate impulse responses. As an example, some realizations from the discrete-time TC kernel (7.15), given by $$K(i,j)=\alpha ^{\max (i,j)}$$ with $$\alpha =0.9$$, are reported in the left panels of Fig. 8.1.

Consider the nonlinear scenario with measurements model given by (8.2) and input locations containing past inputs and outputs. The fundamental difference w.r.t. the linear setting is that the unknown function $$f^0$$ now represents directly the nonlinear input–output relationship. The connection with Bayesian estimation is obtained thinking of $$f^0$$ as a nonlinear stochastic surface, in particular a zero-mean Gaussian random field. This is a generalization of a stochastic process over general domains: one has that, for any set of input locations $$\{x^*_i\}_{i=1}^p$$, the vector $$[f^0(x^*_1) \ \ldots \ f^0(x^*_p)]$$ is jointly Gaussian. In particular, the covariance of such vector is assumed to be proportional to the kernel matrix $$\mathbf {K}$$ whose (ij)-entry is $$\mathbf {K}_{ij} = K(x^*_i,x^*_j)$$. This corresponds to saying that $$f^0$$ is a zero-mean Gaussian random field with covariance $$\lambda K$$, with $$\lambda$$ a positive scalar, independent of the white Gaussian noises $$e(t_i)$$ of variance $$\sigma ^2$$. Then,

\begin{aligned} \hat{f} = \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{f \in \mathscr {H}} \ \sum _{i=1}^N (y(t_i)-f(x_i))^2 + \gamma \Vert f\Vert _{\mathscr {H}}^2, \quad \gamma =\frac{\sigma ^2}{\lambda } \end{aligned}

turns out to be the minimum variance estimator of the nonlinear system $$f^0$$. In this stochastic scenario, our model assumptions can be better understood by simulating some nonlinear surfaces from the prior. They will represent some of our candidate nonlinear systems. As an example, some realizations from the Gaussian kernel (6.43), given by $$K(x,a)=\exp (-\Vert x-a\Vert ^2/\rho )$$ with $$\rho =1000$$, are reported in the right panels of Fig. 8.1. It is apparent that such covariance includes just information on the smoothness on the input–output map, i.e., the fact that similar inputs should produce similar system outputs.

### 8.2.2 Kernel Tuning

As already discussed, e.g., in Sect. 3.5, even when the structure of a kernel is assigned, the estimator (8.5) typically contains unknown parameters that have to be determined from data. For example, if the Gaussian kernel $$\exp (-\Vert x-a\Vert ^2/\rho )$$ is adopted the unknown hyperparameter vector $$\eta$$ will contain the regularization parameter $$\gamma$$, the kernel width $$\rho$$ and possibly also the system memory m. We now briefly discuss estimation of $$\eta$$ just pointing out some natural connections with the techniques illustrated in Sect. 7.2 in the linear scenario.

An important observation is that, when a quadratic loss is adopted, even in the nonlinear setting the estimator (8.5) leads to predictors linear in the data Y. In addition, since we assume data generated according to (8.2), direct noisy measurements of f are available. Hence, the output kernel matrix O used in Sect. 7.2 just reduces to the kernel matrix $$\mathbf {K}$$ computed over the $$x_i$$ where data are collected. In fact, from (8.6) and (8.7) one can see that the predictions $$\hat{y}_i$$, i.e., the estimates of the $$f^0(x_i)$$, are the components of $$\mathbf {K}\hat{c}$$. So, they are collected in the vector

\begin{aligned} \hat{Y}(\eta )= \mathbf {K}(\eta )(\mathbf {K}(\eta )+ \gamma I_N)^{-1}Y. \end{aligned}
(8.8)

Now, consider techniques like SURE and GCV that see $$f^0$$ as a deterministic function  so that the randomness in Y derives only from the output noise. Exploiting the same line of discussion reported in Sects. 3.5.2 and 3.5.3 (see also Sect. 7.2), from (8.8) we see that the influence matrix is given by $$\mathbf {K}(\eta )(\mathbf {K}(\eta )+ \gamma I_N)^{-1}$$. Hence, the degrees of freedom are

\begin{aligned} \text {dof}(\eta ) = {{\,\mathrm{trace}\,}}(\mathbf {K}(\eta )(\mathbf {K}(\eta )+\gamma I_N)^{-1}). \end{aligned}
(8.9)

Then, the SURE estimate of $$\eta$$ is obtained by minimizing the following unbiased estimator of the prediction risk

\begin{aligned} \hat{\eta }= \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{\eta \in \varGamma } \frac{1}{N}\Vert Y-\hat{Y}(\eta )\Vert ^2 +2\sigma ^2\frac{\text {dof}(\eta )}{N} \end{aligned}
(8.10)

while the GCV estimate is

\begin{aligned} \hat{\eta } = \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{\eta \in \varGamma } \ \frac{\Vert Y-\hat{Y}(\eta )\Vert ^2}{(1-\text {dof}(\eta )/N)^2}, \end{aligned}
(8.11)

where we have used $$\varGamma$$ to denote the optimization domain.

If we instead consider the Bayesian framework discussed in the previous subsection, we see $$f^0$$ as a zero-mean Gaussian random field of covariance $$\lambda K$$, with $$\lambda$$ a positive scale factor, independent of the white Gaussian noise of variance $$\sigma ^2$$. Since $$y(t_i)=f^0(x_i)+ e(t_i)$$, following the same reasonings developed in the finite-dimensional context in Sect. 4.4, one obtains that the vector Y is zero-mean Gaussian, i.e.,

$$Y \sim \mathscr {N}(0,Z(\eta ))$$

with covariance matrix

$$Z(\eta ) = \lambda \mathbf {K}(\eta ) + \sigma ^2 I_{N}.$$

Above, the vector $$\eta$$ could, e.g., contain $$\lambda , \sigma ^2,m$$ and also other parameters entering K. Then, we easily obtain that its marginal likelihood estimate is

\begin{aligned} \hat{\eta }= \arg \min _{\eta \in \varGamma } \ Y^T Z(\eta )^{-1} Y +\log \det (Z(\eta )). \end{aligned}
(8.12)

## 8.3 Kernels for Nonlinear System Identification

In the previous section we have cast the kernel-based estimator (8.5) in the framework of nonlinear system identification. We have also provided its Bayesian interpretation and recalled how to estimate the hyperparameter vector $$\eta$$ when the parametric form of K is assigned. But the crucial question is now the regularization design. This is a fundamental issue, initially discussed in Sect. 3.4.2, which in this setting consists of choosing a kernel structure suited to model nonlinear dynamic systems. Two interesting options come from machine learning literature.  The first one is the (already mentioned) Gaussian kernel

$$K(x,a)=e^{\frac{-\Vert x-a\Vert ^2}{\rho }}$$

that can describe input–output relationships just known to be smooth. We have also seen in Sect. 6.6.5 that this model is infinite dimensional, i.e., its induced RKHS cannot be spanned by a finite number of basis functions. It is also universal, being dense in the space of all continuous functions defined on any compact subset of the regressors’ domain. These appear attractive features when little information on system dynamics are available.

A second alternative is the polynomial kernel

\begin{aligned} K(x,a)= \left( \langle x, a \rangle _2+1 \right) ^p, \quad p \in \mathbb {N}, \end{aligned}
(8.13)

where $$\langle \cdot , \cdot \rangle _2$$ is the classical Euclidean inner product. In the NFIR case, where the input locations $$x_i \in \mathbb {R}^m$$ as given in (8.4), such kernel has a fundamental connection with the Volterra representations of nonlinear systems, see, e.g., [35]. In fact, we know from Sect. 6.6.4 that the induced RKHS is not universal but has dimension $$\left( {\begin{array}{c}m+p\\ p\end{array}}\right)$$ and contains all possible monomials up to the pth degree. Hence, the polynomial kernel implicitly encodes truncated discrete Volterra series  of the desired order. It avoids curse of dimensionality since the possibly large number coefficients have not to be computed explicitly thanks to monomials’ encoding. In fact, from (8.7) one can see that estimation complexity, even if cubic in the number N of output data turns out to be linear in the system memory m and independent of the degree p of nonlinearity.

### 8.3.1 A Numerical Example

We will consider a numerical example where the Gaussian and the polynomial kernel are used to estimate a nonlinear dynamic system from input–output data.

Consider the NFIR

\begin{aligned} \nonumber \tiny f^0(x_t)= & {} \left( \sum _{i=1}^{80} g_i^0 u_{t-i} \right) - u_{t-2}u_{t-3} - 0.25u^2_{t-4} + 0.25u_{t-1}u_{t-2}+\\+ & {} 0.75u^3_{t-3}+0.5\left( u^2_{t-1}+u_{t-1}u_{t-3}+u_{t-2}u_{t-4}\right) \end{aligned}
(8.14)

with nonlinearities taken from [40] while the coefficients $$g_i^0$$ are reported in Fig. 8.2. The inputs are independent Gaussian random variables of variance 4. The measurements model is that reported in (8.2) with the noise e white and Gaussian of variance 4 and independent of u. Such system is strongly nonlinear: the contribution of the linear part (defined by the $$g_i^0$$) to the output variance is around $$12 \%$$ of the overall variance.

We generate 2000 input–output couples and display them in Fig. 8.3. The first 1000 input–output couples $$\{u_k,y_k\}_{k=1}^{1000}$$ are the identification data while the other 1000 $$\{u_k,y_k\}_{k=1001}^{2000}$$ are the test set.  They are used to assess the performance of an estimator in terms of the prediction fit

\begin{aligned} 100\left( 1 - \left[ \frac{\sum ^{2000}_{k=1001}|y_k-\hat{y}_k|^2 }{\sum ^{2000}_{k=1001}|y_k-\bar{y} |^2}\right] ^{\frac{1}{2}}\right) ,\ \ \bar{y}=\frac{1}{1000}\sum ^{2000}_{k=1001}y_k, \end{aligned}
(8.15)

where the $$\hat{y}_k$$ are the predictions returned by a certain estimator by assuming null initial conditions, i.e., computed by using only $$\{u_k\}_{k=1001}^{2000}$$ and setting to zero the inputs falling outside the test set.

First, consider the estimator (8.5) equipped with either the Gaussian or the polynomial kernel with input locations

$$x_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m}],$$

where the system memory m is seen as an hyperparameter to be estimated from data. Specifically, when using the Gaussian kernel

$$K(x,a)=e^{\frac{-\Vert x-a\Vert ^2}{\rho }}$$

the estimator depends on the unknown hyperparameter vector

$$\eta =[m \ \ \gamma \ \ \rho ],$$

where m is the system memory, $$\gamma$$ is the regularization parameter and $$\rho$$ is the kernel width. Instead, when using the polynomial kernel

$$K(x,a)= \left( \langle x, a \rangle _2+1 \right) ^p, \quad p \in \mathbb {N},$$

we have

$$\eta =[m \ \ \gamma \ \ p],$$

where, in place of $$\rho$$, the third unknown hyperparameter is the polynomial order p. In both the cases, we estimate $$\eta$$ by using an oracle. In particular, assigned a certain $$\eta$$, the estimator (8.5) determines $$\hat{f}$$ by using only the identification data but the oracle has access to the test set to select that hyperparameter vector that maximizes the prediction fit (8.15). Note that calibration is quite computational expensive. In fact, one has to introduce a grid to account for the discrete nature of the system memory m. The polynomial kernel requires also the introduction of another grid for the polynomial order p.

Figure 8.4 reports some test set data (red line) extracted from the last 1000 outputs displayed in the right panel of Fig. 8.3. When adopting the Gaussian kernel, the oracle chooses $$m=4$$. When using the polynomial kernel it selects $$m=6$$ and sets the polynomial order to $$p=3$$. The top panel of Fig. 8.4 shows the predictions returned by the oracle-based Gaussian kernel (black line). The prediction fit is not so large, equal to $$69.6 \%$$. The bottom panel instead plots results from the oracle-based polynomial kernel (black line). The prediction capability increases to $$73.5\%$$ but does not appear so satisfactory. Figure 8.5 also reports the MATLAB boxplots of 100 prediction fits returned by the two kernel-based estimators after a Monte Carlo study. At any of the 100 runs new realizations of inputs and noises define a new identification and test set. One can see that, on average, the polynomial kernel performs a bit better than the Gaussian kernel, but its mean prediction fit is around $$72\%$$.

### 8.3.2 Limitations of the Gaussian and Polynomial Kernel

From (8.14) one can see that the NFIR order is $$m=80$$ while the oracle sets $$m=4$$ and $$m=6$$ when using, respectively, the Gaussian and the polynomial kernel. This introduces a bias in the estimation process that is clearly visible in the predictions reported in Fig. 8.4. Let us try to understand the reasons of this phenomenon.

Polynomial kernel First, consider the polynomial kernel. The oracle chooses the correct polynomial order $$p=3$$ to account for the highest-order term $$0.75u^3_{t-3}$$ present in the system. Such choice however already defines a complex model since it includes all the monomials up to order 3. In particular, with $$m=6$$ and $$p=3$$ the number of adopted basis functions is

$$\left( {\begin{array}{c}m+p\\ p\end{array}}\right) = \left( {\begin{array}{c}6+3\\ 3\end{array}}\right) = 84,$$

that is quite large considering that 1000 outputs are available. If m is increased to 7, one would implicitly use

$$\left( {\begin{array}{c}m+p\\ p\end{array}}\right) = \left( {\begin{array}{c}7+3\\ 3\end{array}}\right) = 120$$

basis functions. In general, values of m larger than 6 are not acceptable for the oracle: even a careful tuning of the regularization parameter $$\gamma$$ does not permit to have a good control on the estimator’s variance. This is illustrated in Fig. 8.6 that displays the best prediction test set fit that can be obtained by the oracle as a function of the system memory m. The maximum is indeed obtained with $$m=6$$. Instead, the value $$m=80$$ leads to a very small fit, around $$25\%$$, because this introduces an overly complex model with

$$\left( {\begin{array}{c}m+p\\ p\end{array}}\right) = \left( {\begin{array}{c}80+3\\ 3\end{array}}\right) = 91881$$

monomials.

Another reason that does not allow the polynomial kernel to well control model variance is the way it regularizes (implicitly) the monomial coefficients. We describe this point through a simple example. A quadratic polynomial kernel is considered but similar considerations would still hold by introducing larger degrees. Let $$p=2$$, $$x=[u_{t-1} \ \ldots \ u_{t-m}]$$ and $$a=[u_{\tau -1} \ \ldots \ u_{\tau -m}]$$. Exploiting the multinomial theorem one obtains

\begin{aligned} K(x,a)= & {} (\sum _{i=1}^m u_{t-i}u_{\tau -i} +1)^2 \\= & {} \sum _{i=1}^m u^2_{t-i} u_{\tau -i}^2 +2\sum _{i=2}^m \sum _{j=1}^{i-1} ( u_{t-i} u_{t-j}) (u_{\tau -i}u_{\tau -j}) + 2\sum _{i=1}^m u_{t-i} u_{\tau -i}+1. \end{aligned}

This defines the following diagonalized version of the quadratic polynomial kernel

$$K(x,a) = \sum _{i} \zeta _i \rho _i(x) \rho _i(a),$$

where the $$\rho _i(x)$$ are all the monomials up to degree 3 contained in the following vector

\begin{aligned} \Big \{u^2_{t-m},\ldots ,u^2_{t-1},u_{t-m}u_{t-m+1},\ldots ,u_{t-m}u_{t-1},u_{t-m+1}u_{t-m+2}, \\ \ldots ,u_{t-m+1}u_{t-1}, \ldots , u_{t-2}u_{t-1},u_{t-m},\ldots ,u_{t-1},1 \Big \}, \end{aligned}

with the corresponding $$\zeta _i$$ given by

\begin{aligned} \Big \{1,\ldots ,1,2,\ldots ,2,2, \ldots ,2, \ldots , 2,2,\ldots ,2,1 \Big \}. \end{aligned}

According to the RKHS theory described in Sect. 6.3, for any f in the RKHS $$\mathscr {H}$$ induced by such kernel one has

$$f(x) = \sum _{i} c_i \rho _i(x), \quad \Vert f\Vert ^2_{\mathscr {H}} = \sum _{i} \frac{c_i^2}{\zeta _i},$$

where all the eigenvalues $$\zeta _i$$ assume value 1 or 2 (most of them are equal to 2). Hence, one can see that the regularizer $$\Vert f\Vert ^2_{\mathscr {H}}$$ does not incorporate any fading memory concept typical of dynamic systems. In fact, the two coefficients of the monomials $$\{u^2_{t-m},u^2_{t-1}\}$$ or those of the couple $$\{u_{t-m}u_{t-m+1},u_{t-2}u_{t-1}\}$$ are assigned the same penalty. But, similarly to the linear case, one should instead expect that inputs $$u_{t-i}$$ have less influence on $$y_t$$ as the positive lag i increases.

Gaussian kernel As in the case of the polynomial model, one of the limitations of the Gaussian kernel $$K(x,a)=\exp (-\Vert x-a\Vert ^2/\rho )$$ in modelling nonlinear systems is that it does not include any fading memory concept. Hence, the inputs $$\{u_{t-1},u_{t-2},\ldots ,u_{t-m}\}$$ included in the input location are expected to have the same influence on $$y_t$$. This can be appreciated also through the Bayesian interpretation of regularization, e.g., by inspecting the system realizations generated by the Gaussian kernel reported in the right panels of Fig. 8.1.

Still adopting a stochastic viewpoint, another drawback is that the covariance $$\exp (-\Vert x-a\Vert ^2/\rho )$$ describes stationary processes and this implies that the variance of $$f^0(x)$$ does not depend on the input location. This is now illustrated in the one-dimensional case where $$x \in \mathbb {R}$$ and the kernel models a static nonlinear system $$f^0(u)$$, i.e., the (noiseless) output y depends only on a single input value u. The left panel of Fig. 8.7 plots some realizations from $$\exp (-(u-a)^2/500)$$. They can be poor nonlinear system candidates since a nonlinear system, like that reported in (8.14), often contains also a linear component. For this reason it can be useful to enrich the model with a linear kernel. Its effect can be appreciated by looking at the realizations plotted in the right panel of Fig. 8.7 that are now drawn by using $$\exp (-(u-a)^2/500)+ua/400$$ as covariance.

The fact that the predictive capability of a nonlinear model can much improve by adding a linear component can be understood also considering Theorem 6.15 (representer theorem). Using only a Gaussian kernel, the estimate $$\hat{f}$$ of the nonlinear system returned by (8.5) is the sum of N Gaussian functions centred on the $$x_i$$. Hence, in the regions where no data are available, the function $$\hat{f}$$ just decays to zero and this can lead to poor predictions when, e.g., a linear component is present in the system. This phenomenon is illustrated in the left panel of Fig. 8.8. In this case, the prediction performance can be greatly enhanced by adding a linear kernel, whose results are visible in the right panel of the same figure.

### 8.3.3 Nonlinear Stable Spline Kernel

We will build a kernel $$\mathscr {K}$$ for nonlinear system identification, namely the nonlinear stable spline kernel, by exploiting what has been learnt from the previous example. To simplify exposition, we consider the NFIR case but all the ideas here developed can be immediately extended to NARX models, as discussed at the end of this section.

First, it is useful to define $$\mathscr {K}$$ as the sum of a linear and a nonlinear kernel, i.e.,

\begin{aligned} \mathscr {K}(x_i,x_j) = \lambda _{L} x_i^T P x_j + \lambda _{NL} K(x_i,x_j), \end{aligned}
(8.16)

where the input locations are here seen as column vectors, i.e.,

$$x_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m}]^T,$$

$$P \in \mathbb {R}^{m \times m}$$ is a symmetric positive semidefinite matrix that models the impulse response of the system’s linear part while K describes the nonlinear dynamics. Note that the two-scale factors $$\lambda _{L}$$ and $$\lambda _{NL}$$ are unknown hyperparameters that balance the contributions of the linear and nonlinear part to the output.

For what concerns P, such matrix can be defined by resorting to the class of stable kernels developed in the previous chapters. In particular, using the TC/stable spline kernel, the (ab)-entry of P is

\begin{aligned} P_{ab} =\alpha _{L}^{\max {(a,b)}}, \quad 0 \le \alpha _{L} <1, \quad a=1,\ldots ,m, \ b=1,\ldots ,m, \end{aligned}
(8.17)

where $$\alpha _L$$ determines the decay rate of the impulse response governing the linear dynamics.

For what concerns K, we will define it by modifying the classical Gaussian kernel in order to include fading memory concepts. Following the same ideas underlying the TC kernel, we include the information that $$u_{t-i}$$ is expected to have less influence on $$y_t$$ as i increases by defining

\begin{aligned} K(x_i,x_j)= \exp \Big ( -\sum _{k=1}^m \ \alpha _{NL}^{k-1} \frac{(u_{t_i-k}-u_{t_j-k})^2}{\rho } \Big ), \quad 0 < \alpha _{NL} \le 1. \end{aligned}
(8.18)

The additional hyperparameter $$\alpha _{NL}$$ gives the information that past inputs’ influence decays exponentially to zero. To understand how this kernel models the nonlinear surface, and how different values of $$\alpha _{NL}$$ can describe different system features, we can use the Bayesian interpretation of regularization. In particular, consider an example with $$m=2$$, so that the components of $$x_i$$ are $$u_{t_i-1}$$ and $$u_{t_i-2}$$, and let the system $$f^0$$ be a zero-mean Gaussian random field with covariance given by (8.18) with $$\rho =1000$$. If $$\alpha _{NL}=1$$ we recover the Gaussian kernel. Hence, before seeing any data, $$u_{t_i-1}$$ and $$u_{t_i-2}$$ are expected to have the same influence on the system output. This can be appreciated by drawing some realizations from such random field, e.g., see the top panel of Fig. 8.9 (or the right panels of Fig. 8.1).

With $$\alpha _{NL}$$ very close to zero, the output depends mainly on $$u_{t_i-1}$$, i.e.,

$$K(x_i,x_j)\approx \exp \Big ( -\frac{(u_{t_i-1}-u_{t_j-1})^2}{\rho } \Big ).$$

This can be appreciated by looking at the realization in the middle panel of Fig. 8.9 obtained with $$\alpha _{NL}=0.001$$. One can see that, for fixed $$u_{t_i-1}$$, changes in $$u_{t_i-2}$$ do not produce appreciable variations in the function value. If the value of $$\alpha _{NL}$$ is now increased, the input value $$u_{t_i-2}$$ starts playing a role. This is visible in the bottom panel where the realization is now generated by using $$\alpha _{NL}=0.1$$.

The nonlinear stable spline kernel enjoys also an advantage related to computational issues. Using classical machine learning kernels, like Gaussian or polynomial, the choice of the dimension m of the input space is a delicate issue. It requires discrete tuning, as encountered in classical linear system identification to estimate, e.g., FIR or ARX order, and this can be computationally expensive. In the case of the polynomial kernel, another discrete parameter is the polynomial order p that requires an additional grid. By introducing stability/fading memory hyperparameters, one can instead set m to a large value increasing the flexibility of the estimator. Then, estimation of $$\alpha _{L}$$ and $$\alpha _{NL}$$ from data permits to control the “effective” dimension of the regressor space in a continuous manner. In light of the continuous nature of the optimization domain, one needs to solve only one optimization problem, involving, e.g., SURE (8.10), GCV (8.11) or Empirical Bayes (8.12).

Finally, as already mentioned, the extension to NARX models is very simple. Let $$x_i=[a_i^T \ b_i^T]^T$$ with

$$a_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m}]^T, \quad b_i = [y_{t_i-1} \ y_{t_i-2} \ \ldots \ y_{t_i-m}]^T.$$

Then, the kernel (8.16) can be modified as follows

\begin{aligned} \mathscr {K}(x_i,x_j) = \lambda _{a} a_i^T P_a a_j + \lambda _b b_i^T P_b b_j + \lambda _{c} K_c(a_i,a_j)K_d(b_i,b_j) \end{aligned}
(8.19)

with the matrices $$P_a$$ and $$P_b$$ defined by the TC kernel (8.17), with possibly different decay rates $$\alpha _{L}$$, and the nonlinear kernels $$K_c$$ and $$K_d$$ defined by (8.18), with possibly different decay rates $$\alpha _{NL}$$. A possible variation is

\begin{aligned} \mathscr {K}(x_i,x_j) = \lambda _{a} a_i^T P_a a_j + \lambda _b b_i^T P_b b_j + \lambda _{c} K_c(a_i,a_j)+ \lambda _{d} K_d(b_i,b_j), \end{aligned}
(8.20)

where the nonlinear dynamics are no more product, as in (8.19), but instead sum of nonlinear functions which depend on either past inputs or past outputs. In fact, recall from Theorem 6.6 that sums and products of kernels induce well-defined RKHSs containing, respectively, sums and products of functions belonging to the spaces associated to the single kernels.

### 8.3.4 Numerical Example Revisited: Use of the Nonlinear Stable Spline Kernel

Let us now reconsider the numerical example where the nonlinear system (8.14) is used to generate the identification and test data reported in Fig. 8.3. Now, we use the estimator (8.5) equipped with the nonlinear stable spline kernel (8.16). System memory is set to $$m=100$$. Hence, we let $$\alpha _{L}$$ and $$\alpha _{NL}$$ determine from data which past inputs mostly influence the output due to the linear and nonlinear system part, respectively. In particular, the hyperparameter vector $$\eta =[ \lambda _{L} \ \lambda _{NL} \ \alpha _{L} \ \alpha _{NL} \ \rho ]$$ is estimated via marginal likelihood maximization using the 1000 input–output training data.

Figure 8.10 shows the same test set data (red line) reported in Fig. 8.4 and extracted from the last 1000 outputs visible in the right panel of Fig. 8.3. The predictions (black line) returned by the nonlinear stable spline kernel are now very close to truth. The prediction fit is around $$90\%$$. Comparing these results with those in Fig. 8.4, one can see that the prediction performance is much better than that of the Gaussian and polynomial kernel. Recall also that these two estimators tune complexity by using an oracle that is not implementable in practice. Figure 8.11 also plots the MATLAB boxplots of 100 prediction fits returned after a Monte Carlo study of 100 runs by these two oracle-based estimators, already present in Fig. 8.5, and by nonlinear stable spline. One can see that the use of a regularizer that accounts for dynamic systems features largely improves the prediction fits.

## 8.4 Explicit Regularization of Volterra Models

In what follows, we use $$\mathrm {C}(k,m)$$ to indicate the number of ways one can form the nonnegative integer k as the sum of m nonnegative integers. This is the same problem as distributing k objects to m groups (some groups may get zero objects). By combinatorial theory we have

\begin{aligned} \mathrm {C}(k,m) = {k+m-1 \atopwithdelims ()m-1 }. \end{aligned}
(8.21)

We adopt the model description (8.2) and seek a simple representation for the model f(x). For simple notation, assume that f is scalar valued with past inputs only, i.e., $$(m_y=0, m_u=m)$$ with input location x given by (8.4). A straightforward idea is to mimic polynomial Taylor expansion

\begin{aligned} f(x) =\sum _{k=1}^{p} g_kx^k. \end{aligned}
(8.22)

This innocent-looking function expansion is in fact a bit more complex than it looks. The kth power of the m-row vector x is to be interpreted as $$\mathrm {C}$$-dimensional column vector with each element being a monomial of the m-components x(i) of x with sum of exponents being k:

\begin{aligned} \alpha _r ^{(k)}= x(1)^{\beta (k,1)}x(2)^{\beta (k,2)}\cdots x(m)^{\beta (k,m)} \end{aligned}
(8.23a)
\begin{aligned} \beta (k,p) \ \text {non negative, such that}\;\sum _{\ell =1}^m \beta (k,\ell ) = k \end{aligned}
(8.23b)
\begin{aligned} r=1,2, \ldots , \mathrm {C}(k,m). \end{aligned}
(8.23c)

In (8.22) $$g_k$$ is to be interpreted as a row vector with $$\mathrm {C}(k,m)$$ elements

\begin{aligned} g_k=[g_k^{(1)},\ldots ,g_k^{\mathrm {C}(k,m)}]. \end{aligned}
(8.23d)

The response f(x) is thus made of $$d(p,m)= \sum _{k=1}^p \mathrm {C}(k,m)$$ contributions (“impulse responses”) from each of the nonlinear combinations of past inputs

\begin{aligned} \alpha _r ^{(k)}= u_{t-1}^{\beta (k,1)}u_{t-2}^{\beta (k,2)}\cdots u_{t-m}^{\beta (k,m)} \end{aligned}
(8.23e)
\begin{aligned} r=1,\ldots ,\mathrm {C}(k,m),\quad k=1,\ldots ,p. \end{aligned}
(8.23f)

This expansion of the model (8.22) is the Volterra Model discussed, e.g., by [7, 35]. It has d(pm) parameters. The reader may recognize this as an explicit treatment of the polynomial kernel (8.13) which does not exploit any basis functions implicit encoding and, hence, does not exploit the kernel trick described in Remark 6.3. This has also some connections with the explicit regularization approaches for linear system identification discussed in Sect. 7.4.4 using, e.g., Laguerre functions.

So, this model has memory length m and polynomial order p. As $$p\rightarrow \infty$$ it follows that f(x) in (8.22), with possibly the addition of a constant function, can approximate any (“reasonable”) function arbitrarily well. This universal approximation property is of course very valuable for black box models and created considerable interest in Volterra models. However, it is easy to see that the number d(pm) of parameters $$g_k$$ increases very rapidly with m and p and that high-order polynomials in the observed signals may create numerically ill-conditioned calculations. Hence, Volterra models have not been used so much in practical identification problems, unless for small values of m and p.

A remedy for the large number of parameters and ill-conditioned numerics is clearly to use regularization. In [4] it is discussed how to regularize the Volterra model to make it a practical tool. In short, the idea is the following, illustrated for a small example with $$p=2$$.

We write the model also adding a scalar $$g_0$$ which accounts for a constant component in the output so that one has

\begin{aligned} y(t)&= g_0 + g_1^T\varphi (t) +\varphi ^T(t)G_2\varphi (t) \end{aligned}
(8.24a)
\begin{aligned}&\varphi ^T(t)=[u(t_1), u(t_2)\ldots u(t_m)] \end{aligned}
(8.24b)
\begin{aligned}&g_1=\theta _1 \quad m- \text {dimensional column vector} \end{aligned}
(8.24c)
\begin{aligned}&G_2 \; m\times m\; \text {symmetric matrix}, \end{aligned}
(8.24d)

where the matrix $$G_2$$ is formed from $$g_2^{(1)} g_2^{(2)}g_2^{(3)}$$ in the expansion (8.22)–(8.23e).

The regularized estimation can now be formed as the criterion

\begin{aligned} \hat{\theta }^{\text {R}}=\mathop {\mathrm {arg}\,\mathrm {min}}\limits _{\theta } \Vert Y-\varPhi _N^T\theta \Vert ^2 + \theta ^TD\theta \end{aligned}
(8.25)

with

\begin{aligned} \theta = [g_0,\theta _1^T,\theta _2^T]^T \end{aligned}
(8.26)

and $$\theta _2$$ is an $$m(m+1)/2$$ dimensional column vector made up from $$G_2$$, and Y is the vector of observed outputs y(t) with $$t=1,\ldots ,N$$. The regression vector $$\varPhi _N$$ if formed from the components of $$\varphi (t)$$ in the obvious way. It is natural to decompose the regularization matrix accordingly:

\begin{aligned} D= \begin{bmatrix} d_0&{}0&{}0\\ 0&{}D_1&{}0\\ 0&{}0&{}D_2 \end{bmatrix} \end{aligned}
(8.27)

and treat the regularization of the constant term, ($$d_0$$), the linear term ($$D_1$$) and the quadratic term ($$D_2$$) in (8.24a) separately. As discussed in Chap. 5, a natural choice of regularization matrices is to let them reflect prior information about the corresponding parameters. That means that $$d_0$$ can be taken as any suitable scalar. The $$\theta _1$$ vector for the first-order term describes a regular linear impulse response, and the prior for that one can be taken as, e.g., the DC kernel reported in (5.40), i.e.,

\begin{aligned} P_1(i,j) = c \cdot e^{-\alpha |i-j|} e^{-\beta \frac{(i+j)}{2}}. \end{aligned}
(8.28)

For the second-order model $$\theta _2$$ it is natural to treat the second-order nonlinear term in the Volterra expansion as a two-dimensional surface, described by two time-indices $$\tau _1$$ and $$\tau _2$$ so that the parameter at $$\tau _1,\tau _2$$ is the contribution to the Volterra sum from $$u(t-\tau _1)\cdot u(t-\tau _2)$$. This is illustrated in Fig. 8.12. The prior value of this contribution can be formed as the product of two kernels built up from responses in a coordinate system $$\mathscr {U,V}$$ after an orthonormal coordinate transformation, corresponding to a rotation of $$45^\circ$$ of the original $$\tau _1,\tau _2$$-plane:

\begin{aligned} P_2(i,j)&= c_2P_{\mathscr {V}}(i,j)P_{\mathscr {U}}(i,j) \end{aligned}
(8.29)
\begin{aligned} P_{\mathscr {V}}(i,j)&=e^{-\alpha _{\mathscr {V}}\left| |\mathscr {V}_i|-|\mathscr {V}_j|\right| } e^{-\beta _{\mathscr {V}}\frac{\left| |\mathscr {V}_i|+|\mathscr {V}_j|\right| }{2}} \end{aligned}
(8.30)
\begin{aligned} P_{\mathscr {U}}(i,j)&=e^{-\alpha _{\mathscr {U}}\left| |\mathscr {U}_i|-|\mathscr {U}_j|\right| } e^{-\beta _{\mathscr {U}}\frac{\left| |\mathscr {U}_i|+|\mathscr {U}_j|\right| }{2}}, \end{aligned}
(8.31)

where $$\mathscr {U}_i$$ and $$\mathscr {V}_i$$ refer to the coordinates in the new system. The corresponding prior distribution is depicted in Fig. 8.12. As desired, it is smooth and decays to zero in all directions. The coordinate change is useful to make the surface smooth over critical border lines.

This regularization was deployed in [36], section “Example 5(a) Black-Box Volterra Model of the Brain”. Quite useful results were obtained with a regularized model with 594 parameters, thanks to the regularization. An extension for the regularized Volterra models, based on similar idea, is treated in [41], which also provides an EM algorithm to estimate the hyperparameters in the regularization matrices. Another development where the ideas developed in [4] are coupled with kernels implicit encoding can be found in [8].

## 8.5 Other Examples of Regularization in Nonlinear System Identification

### 8.5.1 Neural Networks and Deep Learning Models

There are many other universal approximators fon nonlinear systems f(x) than those based on kernels or on the explicit Volterra model (8.22). The most common ones are various neural network models (NNMs), see, e.g., [12, 23]. They use simple nonlinearities connected in more or less complex networks. The parameters are weights in the connections as well as characterizations of the nonlinearities. Like Volterra models they are capable of approximating any reasonable system arbitrarily well given sufficiently many parameters. This means that the NNM typically has many parameters. In simple application there could be hundreds of parameters but some applications, especially in the so-called deep model applications, could have tens of thousands of parameters [18], see also [9, 11, 13, 43] for deep NARX and state-space models. Even if benign overfitting has been sometimes observed also for overparametrized models [3, 19, 30], in general regularization is a very important tool also for estimating such model. Hence, many tricks are typically included in the estimation/minimization schemes.

$$\ell _2$$, $$\ell _1$$ penalties They include the traditional weighted $$\ell _2$$ and $$\ell _1$$ norm penalties that we discuss in this book, see, e.g., Sect. 3.6. For example, all estimation algorithms in the System Identification Toolbox, [22] are equipped with optional weighted $$\ell _2$$-regularization—also when NNM are estimated.

Early termination It is common to monitor not only the fit to estimation data in the minimization process, but also how well the current model fits a validation data set. Then the minimization is terminated when the fit to validation data no longer improves, even when the estimation criterion value keeps improving. This early termination technique is in fact equivalent to traditional regularization, as shown in [38].

Dropout or Dilution A special technique common in (deep) learning with NNM is to curb the flexibility of the model by ignoring (dropping) randomly chosen nodes in the network. This is of course a way to control that the model does not become prone to overfitting and provides regularization of the estimation just as the other methods in this book, but by a quite different technique. See, e.g., [17, 28] for more details.

### 8.5.2 Static Nonlinearities and Gaussian Process (GP)

A basic problem in nonlinear system identification is to handle estimation of a static nonlinear function $$h(\eta )$$ from known observations

\begin{aligned}\{\zeta (t),\eta (t),t=1,\ldots ,N\},\qquad \zeta (t)=h(\eta (t))+noise.\end{aligned}

A general way to do this is to apply Gaussian Process (GP) estimation, [29], see also Sects. 4.9 and 8.2.1. Then $$h(\eta )$$ is seen as a Gaussian stochastic process with a prior mean (often zero) and a certain prior covariance function $$K(\eta _1,\eta _2)$$. The arguments can range both over a discrete and continuous domain. After a number of observations $$z=\{\zeta (t),\eta (t),t=1,\ldots ,N\}$$, the posterior distribution of the process $$h^p(\eta |z)$$, can be determined for any $$\eta$$. This is, in short, how the function h can be estimated. As seen in Sect. 8.2.1, it corresponds to a kernel method with the kernel determined by the prior covariance function $$K(\eta _1,\eta _2)$$.

### 8.5.3 Block-Oriented Models

A very common family of nonlinear dynamic models is obtained by networks of linear dynamic models G(q) and nonlinear static functions h(x), see Fig. 8.13. The simplest and most common ones are the Hammerstein Model $$y(t)=G(h(u(t))$$ which is obtained by passing the input through a static nonlinearity before it enters the linear system.

Wiener model $$z=G(u), y(t)=g(z(t))$$, where the output of a linear system is subsequently passing through the nonlinearity. The important contribution [5] has shown that any nonlinear system with fading memory can be approximated by a Wiener model. See also, e.g., [37] for a survey and [42] for a general approach to Hammerstein–Wiener identification allowing coloured noise sources both before and after the nonlinearities (which may be non-invertible).

Traditionally, the nonlinearities have been parametrized, e.g., as piecewise constant or piecewise linear, as polynomials or as neural nets. Recently it has been more common to work with nonparametric nonlinearities which are typically modelled by the GP approach, and whole estimation is then treated in a Bayesian setting. For example, in [21] the linear part of a Wiener model is parametrized by state-space matrices AB in an observer canonical form with suitable priors and the output nonlinearity h(z) is a Gaussian Process with a prior mean $$=z$$ (“linear output”) and large and “smooth” prior covariance. To obtain the posterior densities, a particle Gibbs sampler (PMCMC, Particle Markov Chain Monte Carlo) is employed.

In [32] the same approach is used to model the output nonlinearity, but the linear part is written as an impulse response, with a prior of the same type as discussed in Sect. 5.5.1. The whole problem can then be written as

\begin{aligned} y = \varphi (\varPhi g), \end{aligned}
(8.32)

where y is the observed output, $$\varphi$$ is the output static nonlinearity, g is the impulse response of the linear system and $$\varPhi$$ is the Toeplitz matrix formed from the input. The problem is then to determine the posterior densities $${\mathrm p}(\varphi |y)$$ and $${\mathrm p}(g|y)$$ by Bayesian calculations. In [31] a similar technique is used for estimating Hammerstein models.

### 8.5.4 Hybrid Models

A common class of nonlinear models are Hybrid models [15, 39]. They change their properties depending on some regime variable p(t) (which may be formed from the inputs and outputs themselves) [16]. Think of a collection of linear models that describe the system behaviour in different parts of the operating space and automatically shift as the operating point changes. To build a hybrid model involves two steps: (1) find the collection of relevant different models and (2) determine the areas where each model is operative. This is considered as quite a difficult problem, and approaches from different areas in control theory have been tested. Here we will comment upon a few ideas that relate to regularized identification.

A basic problem is to decide when a change in system behaviour  occurs. This relates to change detection and signal segmentation. A regularization based method to segment ARX models was suggested in [25]. The standard way to estimate ARX models can be described as in Chap. 2:

\begin{aligned} \min _{\theta } \sum _{t=1}^N \Vert y(t)-\varphi ^T(t)\theta \Vert ^2. \end{aligned}
(8.33)

This gives the average linear model behaviour over the time record $$t\in [1\ldots ,N]$$. To follow momentary changes over time, we could estimate N models by

\begin{aligned} \min _{\theta (t),t=1,\ldots ,N} \sum _{t=1}^N \Vert y(t)-\varphi ^T(t)\theta (t)\Vert ^2. \end{aligned}
(8.34)

This would give a perfect fit with a pretty useless collection of models. To tell that we need to be more selective when accepting new models, we can add a $$\ell _1$$ regularization term, discussed in Sect. 3.6, obtaining:

\begin{aligned} \min _{\theta (t),t=1,\ldots ,N} \sum _{t=1}^N \Vert y(t)-\varphi ^T(t)\theta (t)\Vert ^2+\gamma \sum _{t=2}^N \Vert \theta (t)-\theta (t-1)\Vert _{1}. \end{aligned}
(8.35)

One could also use the norms in $$\ell _p$$ with $$p>1$$ as regularizers but it is crucial that the penalty is a sum of norms and not a sum of squared norms. Then, adopting a suitable value for the regularization parameter $$\gamma$$, the penalty favours the terms in the second sum to be exactly zero and not just small. This will force the number of different models from (8.35) to be small and thus just flag when essential changes have taken place.

This idea is taken further in [24] to build hybrid models of PWA (piecewise affine) character. The starting point is again (8.34), but now the number of models is reduced by looking at all the raw models:

\begin{aligned} \min _{\theta (t),t=1,\ldots ,N} \sum _{t=1}^N \Vert y(t)-\varphi ^T(t)\theta (t)\Vert ^2+\gamma \sum _{t=1}^N \sum _{s=1}^NK(p(t)p(s))\Vert \theta (t)-\theta (s)\Vert . \end{aligned}
(8.36)

Here $$K(p_1,p_2)$$ is a weighting based on the respective regime variables p. This gives a number of, say, d submodels, and they can then be associated with values of the regime variable by a classification step.

These ideas of segmentation, building a collection of d submodels and associating them with particular values of time are taken to a further degree of sophistication in [27]. The idea there is to build hybrid stable spline (HSS) algorithm, based on a joint use of the TC (stable spline) kernel, see Sect. 5.5.1, for a family of ARX models like (8.34). The classification of the models is built into the algorithm, by letting the classification parameters be part of the hyperparameters. An MCMC scheme is employed to handle the nonconvex and combinatorial difficulties of the maximum likelihood criterion.

### 8.5.5 Sparsity and Variable Selection

In all estimation problems it is essential to find the regressors $$x_k(t)$$, where $$k=1,\ldots ,d$$, which are best suited for predicting the goal variable y(t). The variables $$x_k$$ can be formed from the observations from the system in many different ways. It is generally desired to find a small collection of regressors, and statistics offers many tools for this: hypothesis analysis, projection pursuit [14], manifold learning/dimensionality reduction  [10, 26, 34], ANOVA, see, e.g., [20] for applications to nonlinear system identification.

The problem of variable (regressor) selection can be formulated as follows. Given a model with n candidate regressors $$\tilde{x}_k(t)$$

\begin{aligned} y(t) =f(\tilde{x}_1(t),\ldots , \tilde{x}_n(t)) + e(t) \end{aligned}
(8.37)

find a subselection or combination of regressors $$x_1(t),\ldots ,x_d(t)$$ that gives the best model of the system. Note that the NARX model (8.3) is a special case of (8.37) with $$x_k(t)=[y(t-k),u(t-k)]$$. In principle one could try out different subsets of regressors and see how good models (in cross validation) are produced. That would in most cases mean overwhelmingly many tests.

Instead the $$\ell _1$$-norm regularization discussed in Sect. 3.6.1, leading to LASSO in (3.105), is a very powerful tool for variable selection and sparsity. In what follows each $$\tilde{x}_i(t)$$ is scalar and is the ith component of the n-dimensional vector x(t). Then, for a linearly parametrized model

\begin{aligned} y(t) = \beta _1\tilde{x}_1(t)+\dots +\beta _n\tilde{x}_n(t) + e(t), \end{aligned}
(8.38)

where the best regressors are found by the criterion

\begin{aligned}&\min _{{\mathrm {B}}}\sum _{t=1}^N \Vert y(t)-\varPhi (t){\mathrm {B}}\Vert ^2 + \gamma \Vert {\mathrm {B}}\Vert _1 \end{aligned}
(8.39)
\begin{aligned} {\mathrm {B}}&= [\beta _1,\beta _2,\dots ,\beta _n]^T \end{aligned}
(8.40)
\begin{aligned} \varPhi (t)&=[\tilde{x}_1(t),\tilde{x}_2(t),\ldots ,\tilde{x}_n(t)]. \end{aligned}
(8.41)

This idea to use $$\ell _1$$-norm regularization was extended to the general model (8.37) in [2]. It is based on the idea to estimate the partial derivatives $$\beta _k= \frac{\partial f}{\partial \tilde{x}_k}$$ in (8.37) analogously to (8.39). In particular, the Taylor expansion of f(x(t)) around $$x^0$$ is

\begin{aligned} f(x(t))=f(x^0)+(x(t)-x^0)^T\frac{\partial f}{\partial \tilde{x} } +\mathscr {O}(\Vert x(t)-x^0\Vert ^2). \end{aligned}
(8.42)

The partial derivative is evaluated at $$x^0$$ and is a column vector of dimension n with row k given by the derivative w.r.t. $$\tilde{x}_k$$. As anticipated, denote this by $$\beta _k$$. These parameters can be estimated by least squares with

\begin{aligned} \min _{\alpha , {\mathrm {B}}}\sum _{t=1}^N\Vert y(t)-\alpha -(x(t)-x^0)^T{\mathrm {B}}\Vert ^2\cdot K(x(t)-x^0) + \gamma \Vert {\mathrm {B}}\Vert _1, \end{aligned}
(8.43)

where $$\alpha$$ corresponds to $$f(x^0)$$, $${\mathrm {B}}$$ is the vector of partial derivatives $$\beta _k$$ and K is a kernel that focuses the sum to points x(t) in the vicinity of $$x^0$$. The $$\ell _!$$ norm regularization term is added just as in (8.39) to promote zero estimates of the gradients. This will focus on selecting regressors $$\tilde{x}_k$$ that are important for the model.

With the so-called iterative reweighting, [6], the regularization term can be refined to

\begin{aligned} \gamma \sum _{k=1}^n w_k | \beta _k|, \end{aligned}
(8.44)

where $$w_k= 1/|\hat{\beta }_k|$$ are based on the estimates from (8.43). This refinement is suggested to be included in the algorithm of [2].

Note that this test depends on the chosen point $$x^0$$. It will be a big task to investigate “many” such points. In [1] it is instead suggested to estimate the expected values $$E x_i \frac{\partial f}{\partial \tilde{x}_i}$$ and $$E {\frac{\partial f}{\partial \tilde{x}_i}}$$. This is done using the pdfs for $$\tilde{x}_k$$ given by $${\mathrm p}_i(u)$$ and $$\frac{d {\mathrm p}_i(u)}{dx_i}$$ which can be estimated by simple density estimation (involving only a scalar random variable).

A comprehensive study of sparsity and regularization is made in [33]. It works with a more complex model definition, allowing $$f: \mathbb {R}^n \rightarrow \mathbb {R}$$ to be defined over several Hilbert spaces. The bottom line is still based on $$\ell _1$$-norm regularization of partial derivatives and the final learning algorithm is given by minimization of a functional

\begin{aligned} \frac{1}{N}\sum _{t=1}^N(y_t-f(x(t)))^2+ \gamma \left( 2\sum _{i=1}^n \Big \Vert \frac{\partial f}{\partial \tilde{x}_i}\Big \Vert _N+\nu \Vert f\Vert ^2_{\mathscr {H} }\right) . \end{aligned}
(8.45)

Here, $$\mathscr {H}$$ can be a RKHS, the penalty on each partial derivative is given by

$$\Big \Vert \frac{\partial f}{\partial \tilde{x}_i} \Big \Vert _N = \sqrt{\frac{1}{N}\sum _{t=1}^N \Big (\frac{\partial f(x(t))}{\partial \tilde{x}_i}\Big ) },$$

$$\gamma$$ is the regularization parameter and $$\nu$$ is a small positive number to ensure stability and strongly convex regularizer.