## 9.1 Identification of Discrete-Time Output Error Models

In this section, we will consider two numerical experiments with data generated according to the discrete-time output error (OE) model

$$y(t)=G_0(q)u(t)+e(t),$$

where $$G_0$$ is a rational transfer function while e is white Gaussian noise independent of the known input u. Using simulated data, we will compare the performance of the classical PEM approach, as described in Chap. 2, with some of the regularized techniques illustrated in this book. In particular, we will adopt regularized high-order FIR, with impulse response coefficients contained in the m-dimensional (column) vector $$\theta$$ and the output data in the (column) vector $$Y=[y(1) \ldots y(N)]^T$$. So, letting the regression matrix $$\varPhi \in {\mathbb R}^{N \times m}$$ be

$$\small \varPhi = \left( \begin{array}{ccccc} u(0) &{} u(-1) &{} u(-2) &{} \ldots &{} u(-m+1) \\ u(1) &{} u(0) &{} u(-1) &{} \ldots &{} u(-m) \\ \ldots &{} &{} &{} &{} \\ \ u(N-1) &{} u(N-2) &{} u(N-3) &{} \ldots &{} u(N-m) \end{array}\right) ,$$

our estimator is

\begin{aligned} \hat{\theta }&=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _\theta \Vert Y-\varPhi \theta \Vert ^2 + \gamma \theta ^TP^{-1}\theta \end{aligned}
(9.1a)
\begin{aligned}&=P\varPhi ^T(\varPhi P\varPhi ^T+\gamma I_{N})^{-1}Y; \text {or}\end{aligned}
(9.1b)
\begin{aligned}&=(P\varPhi ^T\varPhi +\gamma I_{m})^{-1}P\varPhi ^TY. \end{aligned}
(9.1c)

We have already seen in (5.40), (5.41) and (7.30), using MaxEnt arguments and spline theory, that choices for the regularization matrix P can be the first- or second-order stable spline kernel, denoted respectively by TC and SS, respectively, or the DC kernel. They are recalled below specifying also the hyperparameter vector $$\eta$$:

\begin{aligned} \text {TC}\quad&P_{kj}(\eta )=\lambda \alpha ^{\max (k,j)}; \nonumber \\&\lambda \ge 0, \; 0\le \alpha <1, \; \eta =[\lambda ,\alpha ], \end{aligned}
(9.2)
\begin{aligned} \text {SS}\quad&P_{kj}(\eta ) = \lambda \left( \frac{\alpha ^{k+j+\max (k,j)}}{2} - \frac{\alpha ^{3\max (k,j)}}{6} \right) \nonumber \\&\lambda \ge 0, \; 0\le \alpha <1, \; \eta =[\lambda ,\alpha ], \end{aligned}
(9.3)
\begin{aligned} \text {DC}\quad&P_{kj}(\eta )=\lambda \alpha ^{(k+j)/2}\rho ^{|j-k|}; \;\nonumber \\&\lambda \ge 0, \; 0\le \alpha <1,|\rho |\le 1,\ \; \eta =[\lambda ,\alpha ,\rho ]. \end{aligned}
(9.4)

### 9.1.1 Monte Carlo Studies with a Fixed Output Error Model

In this example the true impulse response is fixed to that reported in Fig. 8.2, obtained by random generation of a rational transfer function of order 10. It has to be estimated from 500 input–output couples (collected with system initially at rest). The input is white noise filtered by the rational transfer function $$1/(z-p)$$ where p will vary over the unit interval during the experiment. Note that p establishes the difficulty of our system identification problem. Values close to zero make the input similar to white noise and the output data informative over a wide range of frequencies. Instead, values of p close to 1 increase the low-pass nature of the input and, hence, the ill-conditioning. The measurement noise is white and Gaussian with variance equal to that of the noiseless output divided by 50. Two estimators will be adopted:

• Oe+Or. Classical PEM approach (2.22) equipped with an oracle. In particular, our candidate models are rational transfer functions where the order of the two polynomials is equal and can vary between 1 and 30. For any model order, estimation is performed through nonlinear least squares by solving (2.22) with $$\ell$$ in (2.21) set to the quadratic function. The method is implemented in oe.m of the MATLAB System Identification Toolbox. Then, the oracle chooses the estimate which maximizes the fit

\begin{aligned} 100\left( 1 - \left[ \frac{\sum ^{100}_{k=1}|g_k^0-\hat{g}(k)|^2 }{\sum ^{100}_{k=1}|g_k^0-\bar{g}^0|^2}\right] ^{\frac{1}{2}}\right) ,\ \ \bar{g}^0=\frac{1}{100}\sum ^{100}_{k=1}g^0_k, \end{aligned}
(9.5)

where $$g^0_k$$ are the true impulse response coefficients while $$\hat{g}(k)$$ denote their estimates. The estimator is given the information that system initial conditions are null.

• TC+ML. This is the regularized estimator (9.1), equipped with the kernel TC. The number of estimated impulse response coefficients is $$m=100$$ and the regression matrix is built with $$u(t)=0$$ if $$t<0$$. At every run, the noise variance is estimated by fitting via least squares a low-bias model for the impulse response. Then, the two kernel hyperparameters are obtained via marginal likelihood optimization, see (7.42).  The method is implemented in impulseest.m of the MATLAB System Identification Toolbox.

We consider 4 Monte Carlo studies of 300 runs defined by different values of p in the set $$\{0,0.9,0.95,0.99\}$$. As already mentioned, $$p=0$$ corresponds to white noise input while $$p=0.99$$ leads to a highly ill-conditioned problem (output data provide little information at high frequencies). Figure 9.1 reports the boxplots of the 1000 fits returned by Oe+Or and TC+ML for the four different values of p. Even if PEM exploits an oracle to tune complexity, the performance is (slightly) better than TC+ML only when the input is white noise, see also Table 9.1. When p increases, the ill-conditioning affecting the problem increases and TC+ML outperforms Oe+Or even if no oracle is used for hyperparameters tuning. This also points out the effectiveness of marginal likelihood optimization in controlling complexity.

This case study shows that continuous tuning of hyperparameters may be a more versatile and powerful approach than classical estimation of discrete model orders. A problem related to PEM here could be also the presence of local minima of the objective. This is much less critical when adopting kernel-based regularization. In fact, TC+ML regulates complexity through only two hyperparameters while Oe+Or has to optimize many more parameters (function of the postulated model order).

### 9.1.2 Monte Carlo Studies with Different Output Error Models

Now we consider two Monte Carlo studies of 1000 runs regarding identification of several discrete-time output error models. The outputs are still given by

$$y(t)=G_0(q)u(t)+e(t)$$

with e white Gaussian noise independent of u, but the rational transfer function $$G_0$$ changes at any run. In fact, a 30th-order single-input single-output continuous-time system is first randomly generated by the MATLAB command rss.m. It is then sampled at 3 times of its bandwidth and used if its poles fall within the circle of the complex plane with centre at the origin and radius 0.99.

With the system at rest, 1000 input–output pairs are generated as follows. At any run, the system input is unit variance white Gaussian noise filtered by a second-order rational transfer function generated by the same procedure adopted to obtain $$G_0$$. The outputs are corrupted by an additive white Gaussian noise with a SNR (the ratio between the variance of noiseless output and noise) randomly chosen in [1, 20] at any run. In the first experiment, the data set

\begin{aligned} \mathcal{D}_T =\{u(1),y(1),\ldots ,u(N),y(N)\} \end{aligned}

contains the first 200 input–output couples, i.e., $$N=200$$, while in the second experiment all the 1000 couples are used, i.e., $$N=1000$$.

Starting from null initial conditions, at any run we also generate two different kinds of test sets

\begin{aligned} \mathcal{D}_{test}= \{u^{new}(1),y^{new}(1),\ldots ,u^{new}(M),y^{new}(M)\}, \quad M=1000. \end{aligned}

The first test set is especially challenging since noiseless outputs are generated by using unit variance white Gaussian noise as input. In the second test set the input has instead the same statistics of that entering the identification data, hence making easier  its prediction.

The performance of a model characterized by $$\hat{\theta }$$, and returning $$\hat{y}^{new}(t|\hat{\theta })$$ as output prediction at instant t, is

\begin{aligned} \mathscr {F}(\hat{\theta }) = 100 \left( 1- \sqrt{\frac{\sum _{t=1}^{M} \left( y^{new}(t) -\hat{y}^{new}(t | \hat{\theta }) \right) ^2}{\sum _{t=1}^{M} \left( y^{new}(t) -\bar{y}^{new} \right) ^2}} \right) , \quad M=1000, \end{aligned}
(9.6)

where $$\bar{y}^{new}$$ is the average output in $$\mathcal{D}_{test}$$ and $$\hat{y}^{new}(t|\hat{\theta })$$ are computed assuming zero initial conditions (otherwise high-order models could have the advantage to calibrate the initial conditions to fit $$\mathcal{D}_{test}$$). The prediction fit (9.6) can be obtained by the MATLAB command predict(model,data,k,’ini’,’z’) where model and data denote structures containing the estimated model and the test set $$\mathcal{D}_{test}$$, respectively.

In what follows, we will use also estimators equipped with an oracle which evaluates the fit (9.6) for the test set of interest. Different rational models with orders between 1 and 30 are tried and the oracle selects the orders that give the best fit. We are now in a position to introduce the following 6 estimators:

• Oe+Or1. Classical PEM approach (2.22), with quadratic $$\ell$$ in (2.21), equipped with an oracle which uses the first test set (white noise input). As said, candidate models are rational transfer functions whose order can vary between 1 and 30. For any order, the model is returned by the function oe.m of the MATLAB’s System Identification Toolbox [14].

• Oe+Or2. The same procedure described above except that the oracle maximizes the prediction fit using the second test set (test input with statistics equal to those of the training input).

• Oe+CV. The classical approach now does not use any oracle: model order is estimated by cross validation by splitting the identification data into two sets  with the first and the last N/2 data contained in $$\mathcal{D}_T$$. The prediction errors are computed assuming zero initial conditions. The model order minimizing the sum of squared prediction errors (computed assuming zero initial conditions) is chosen. Finally, the system estimate is computed using all the data in $$\mathcal{D}_T$$ by solving (2.22) with quadratic loss.

• {TC+ML,SS+ML,DC+ML}. These are three regularized FIR estimators of the form () with order 200 and kernels TC (9.2), SS (9.3) and DC (9.4). Marginal likelihood optimization (7.42) is used to determine the noise variance and the kernel hyperparameters (2 for SS and TC, 3 for DC). The regularized FIR models are estimated using the function impulseest.m in the MATLAB’s System Identification Toolbox [14].

#### 9.1.2.1 Results

The MATLAB boxplots in Fig. 9.2 contain the 1000 fit measures returned by the estimators during the first experiment with $$N=200$$ (left panels) and the second experiment with $$N=1000$$ (right panels). Table 9.2 reports the average fit values.

In the top panels of Fig. 9.2 one can see the fits of the first test set. Recall that Oe+Or1 has access to such data to optimize the prediction capability. Interestingly, despite this advantage, the performance of all the three regularized approaches is close that of the oracle while that Oe+CV is not so satisfactory. This is also visible in the first two rows of Table 9.2.

The bottom panels of Fig. 9.2 show results relative to the second test set which is used by Oe+Or2 to maximize the prediction fit. Since training and test data are more similar, the prediction capability of Oe+CV improves significantly but the regularized estimators still outperform the classical approach, see also the last two rows of Table 9.2.

### 9.1.3 Real Data: A Robot Arm

Consider now the vibrating flexible robot arm described in [27], where two feedforward controller design methods were compared on trajectory tracking problems. The input of the robot arm is the driving couple and the output is the acceleration at the tip of the robot arm. The input–output data contain 40960 data points. They are collected at a sampling frequency of 500 Hz for 10 periods with each period containing 4096 data points. A portion of the data is shown in Fig. 9.3. The identification problem of the robot arm was studied in [23, Sect. 11.4.4] with frequency domain methods.

We will build models by both the classical prediction error method and the kernel method with the DC kernel. Since the true system is unknown, to compare the performance of different impulse response estimates we divide the data into two parts: the training and the test set, given by the first 6000 input–output couples and the reaming ones, respectively. Then, we measure how well the models, built with the estimation data, predict the test outputs.

For the prediction error method, we estimate nth-order state-space models without disturbance model and with zero initial conditions for $$n=1,\ldots ,36$$. This method is available in MATLAB’s System Identification Toolbox [13] as the command pem(data,n,’dist’,’no’,’ini’,’z’). The prediction fits computed using (9.6) are shown as a function of n in Fig. 9.4, respectively. An oracle that has access to the test set would select the order $$n=18$$, hence obtaining a prediction fit equal to $$79.75\%$$. For the kernel method with the DC kernel, we estimate a FIR model of high-order 3000 with hyperparameters tuned by optimizing the marginal likelihood. When forming the regression matrix, the unknown input data are set to zero. The prediction fit (9.6) is 83.07% and is shown as a horizontal solid line in Fig. 9.4. The kernel method with the DC kernel is available in MATLAB’s System Identification Toolbox [14] as the command impulseest(data,3000,0,opt) where, in the option opt, we set opt.RegulKernel=’dc’; opt.Advanced.AROrder=0.

The Bode magnitude plot of the models estimated by PEM and the DC kernel is shown in Fig. 9.5. The empirical frequency function estimate obtained using the command etfe in MATLAB’s System Identification Toolbox [14] is also displayed.

The measured output and the predicted output over a portion of the test set are shown in Fig. 9.6. If one has concern that a FIR model of order 3000 is quite large, then one could reduce such high-order model by projecting it to a low-order state-space model. Exploiting model order reduction techniques, the fit of a state-space model of order $$n=25$$ is 79.8%, still better than the best state-space description that can be obtained by PEM.

### 9.1.4 Real Data: A Hairdryer

The second application is a real laboratory device, whose function is similar to that of a hairdryer: the air is fanned through a tube and then heated at a mesh of resistor wires, as described in [13, Sect. 17.3]. The input to the hairdryer is the voltage over the mesh of resistor wires while the output is the air temperature measured by a thermocouple. The input–output data contain 1000 data points collected at a sampling frequency of 12.5 Hz for 80 s. A portion of the data is shown in the top panel of Fig. 9.7. Since the input–output values move around 5 and 4.9, respectively, we detrend the measurements in such a way that they move around 0. The estimation and test set data are then given by the first and the last 500 input–output couples, respectively.

As in the case of the robot arm, we build models by the classical prediction error method with an oracle, which maximizes the prediction fit, and the regularized approach with the DC kernel, with hyperparameters tuned by marginal likelihood optimization. For the prediction error method, we estimate nth-order state-space models without disturbance model for $$n=1,\ldots ,36$$ and with zero initial conditions. The fits, as a function of n, are shown in Fig. 9.8. The best result is obtained for order $$n=5$$ and turns out $$88.38\%$$. For the kernel method with the DC kernel, we estimate a FIR model with order 70. When forming the regression matrix, we set the unknown input data to zero. The prediction fit (9.6) is somewhat close to that achieved by PEM+Oracle being equal to 88.15%. It is shown as a dash-dot blue line in Fig. 9.8. The test set and the predicted outputs returned by the two methods are shown in Fig. 9.9. One can see that the regularized approach has a prediction capability very close to that of PEM+Oracle.

## 9.2 Identification of ARMAX Models

In this section we consider the identification of linear systems

\begin{aligned} y(t) = \left\{ \sum _{i=1}^p G_{0i}(q)u_i(t) \right\} +H_0(q)e(t). \end{aligned}
(9.7)

Differently from the previous cases, beyond the presence of multiple observable inputs $$u_i$$, also the noise model is unknown. In fact, the e(t) are white Gaussian noise of unit variance filtered by a system $$H_0(q)$$ that has to be estimated from data.

First, it is useful to cast the identification of the general model (9.7) in a regularized context. Without loss of generality, to simplify the exposition, let $$p=1$$ with the single observable input denoted by u. Exploiting (2.4), given the general linear model (9.7), we can write any predictor as two infinite impulse responses from y and u, respectively. When using ARX models, we have seen in (2.8) that such infinite responses specialize to finite responses. One has

\begin{aligned} \nonumber y(t)=&-a_1y(t-1)-\cdots -a_{n_a}y(t-n_a)+b_1u(t-1)+\cdots \\&+b_{n_b}u(t-n_b)+e(t)=\varphi _y^T(t)\theta _a+\varphi _u^T(t)\theta _b + e(t), \end{aligned}
(9.8)

where $$\theta _a = \begin{bmatrix} a_1&\ldots&a_{n_a} \end{bmatrix}^T$$, $$\theta _b = \begin{bmatrix} b_1&\ldots&b_{n_b} \end{bmatrix}^T$$ and $$\varphi _y(t)$$, $$\varphi _u(t)$$ are made up from y and u in an obvious way. Thus, the ARX model is a linear regression model, to which the same ideas of regularization can be applied. This point is important since we have seen in Theorem 2.1 that ARX-expressions become arbitrarily good approximators for general linear systems as the orders $$n_a,n_b$$ tend to infinity. However, as discussed in Chap. 2, high-order ARX can suffer from large variance. A solution is to set $$n_a=n_b=n$$ to a large value and then introduce regularization matrices for the two impulse responses from y and from u. The P-matrix in (9.1) can be partitioned along with $$\theta _a, \theta _b$$:

\begin{aligned} P(\eta _1,\eta _2) = \begin{bmatrix} P^a(\eta _1)&{}0\\ 0&{}P^b(\eta _2) \end{bmatrix} \end{aligned}
(9.9)

with $$P^{a}(\eta _1),P^{b}(\eta _2)$$ defined, e.g., by any of (9.2)–(9.4). Letting $$\theta =[\theta _a^T \ \theta _b^T]^T$$ and building the regression matrix using $$[\varphi ^T_y(t) \ \varphi ^T_u(t)]$$ as rows, the estimator () now becomes a regularized high-order ARX. The MATLAB code for estimating this model using, e.g., the DC kernel would be

ao=arxRegulOptions(’RegularizationKernel’,’DC’),

[Lambda,R] = arxRegul(data,na,nb,nk,ao),

aropt= arxOptions; aropt.Regularization.Lambda = Lambda,

aropt.Regularization.R = R,

m = arx(data,na,nb,nk,aropt).

We can also easily extend this construction to multiple inputs. Given any generic p, one needs to estimate $$p+1$$ impulse responses with the matrix (9.9) now containing $$p+1$$ blocks. If there are multiple outputs, one approach is to consider each output channel as a separate linear regression as in (9.8). The difference is that now also the other outputs need to be appended as done with the inputs.

### 9.2.1 Monte Carlo Experiment

One challenging Monte Carlo study of 1000 runs is now considered. Data are generated at any run by an ARMAX model of order 30 having p observable inputs, i.e.,

$$y(t) = \left\{ \sum _{i=1}^p \frac{B_i(q)}{A(q)}u_i(t) \right\} + \frac{C(q)}{A(q)} e(t),$$

with p drawn from a random variable uniformly distributed on $$\{2,3,4,5 \}$$. Note that the system contains $$p+1$$ rational transfer functions. They depend on the polynomials $$A,B_i$$ and C which are randomly generated at any run by the MATLAB function drmodel.m. Such function is first called to obtain the common denominator A and the first numerator $$B_1$$. The other p calls are used to obtain the numerators of the remaining rational transfer functions. The system so generated is accepted if the modulus of its poles is not larger than 0.95. In addition, letting $$G_i(q)=\frac{B_i(q)}{A(q)}$$ and $$H(q)=\frac{C(q)}{A(q)}$$ the signal to noise ratio has to satisfy

$$1 \le \frac{\sum _{i=1}^p \Vert G_i\Vert _2^2}{\Vert H\Vert _2^2} \le 20$$

where $$\Vert G_i\Vert _2,\Vert H\Vert _2$$ are the $$\ell _2$$ norms of the system impulse responses.

After a transient to mitigate the effect of initial conditions, at any run 300 input–output couples are collected to form the identification data set $$\mathcal{D}_T$$   and other 1000 to define the test set $$\mathcal{D}_{test}$$.  In any case, the input is white Gaussian noise of unit variance.

Differently from the output error models, in the ARMAX case the performance measure adopted to compare different estimated models depends on the prediction horizon k. More specifically, let $$\hat{y}^{new}_k(t|\hat{\theta })$$ be the k-step-ahead predictor associated with an estimated model characterized by $$\hat{\theta }$$. For any t, such function predicts k-step-ahead the test output $$y^{new}(t)$$ by using the values of the test input $$u^{new}$$ up to time $$t-1$$ and of the test output $$y^{new}$$ up to $$t-k$$. The prediction difficulty in general increases as k gets larger. The special case $$k=1$$ corresponds to the one-step-ahead predictor given by (2.4), while see, e.g., [13, Sect. 3.2] for the expressions of the generic k-step-ahead impulse responses.

As done in (9.6) we use $$\bar{y}^{new}$$ denote the mean of the outputs in $$\mathcal{D}_{test}$$, but now the prediction fit depends on k, being given by

\begin{aligned} \mathscr {F}_k(\hat{\theta }) = 100 \left( 1- \sqrt{\frac{\sum _{t=1}^{M} \left( y^{new}(t) -\hat{y}^{new}_k(t | \hat{\theta }) \right) ^2}{\sum _{t=1}^{M} \left( y^{new}(t) -\bar{y}^{new} \right) ^2}} \right) , \quad M=1000. \end{aligned}
(9.10)

In this case, we say that an estimator is equipped with an oracle if it can use the test set to maximize $$\sum _{k=1}^{20} \mathscr {F}_{k}$$ by tuning the complexity of the model estimated using the identification data. The following estimators are then introduced:

• PEM+Oracle: this is the classical PEM approach (2.22) with quadratic loss equipped with an oracle. The candidate model structures are ARMAX models with polynomials all having the same degree up to 30. For any model order, the MATLAB command pem.m (or armax.m) is used to obtain the system’s estimate. of the MATLAB System Identification Toolbox [14].

• PEM+CV: in place of the oracle, model complexity is estimated by cross validation splitting $$\mathcal{D}_T$$ into two sets containing, respectively, the first and the last 150 input–output couples. The model order which minimizes the sum of the squared one-step-ahead prediction errors computed with zero initial conditions for the validation data is selected. The final system’s estimate is returned by (2.22) using all the identification data.

• {PEM+AICc,PEM+BIC}: this is the classical PEM approach with AIC-type criteria used to tune complexity, as reported in (2.35) and (2.36).

• {TC+ML,SS+ML,DC+ML}: these are the three regularized least squares estimators introduced at the beginning of this section which determine the unknown coefficients of the multi-input version of the ARX model. After setting the length of each predictor impulse response to 50, the regularization matrices entering the multi-input version of (9.9) are defined by TC (9.2) or SS (9.3) or DC (9.4) kernels. The first 50 input–output pairs in $$\mathcal{D}_T$$ are used just as entries of the regression matrix. For every impulse response, a different scale factor $$\lambda$$ and a common variance decay rate $$\alpha$$ (and, in the case of DC, a correlation $$\rho$$) is adopted. The hyperparameters are determined via marginal likelihood optimization.

All the system inputs delay are assumed known and their values are provided to all the estimators described above.

The average of the fits $$\mathscr {F}_k$$ given by (9.10), function of the prediction horizon k, is reported in Fig. 9.10. Since PEM equipped with Akaike-like criteria return very small average fits, results achieved by this kind of procedures are not displayed. The MATLAB boxplots of the 1000 values of $$\mathscr {F}_1$$ and $$\mathscr {F}_{20}$$ returned by all the estimators are visible in Fig. 9.11. The average fit of SS+ML is quite close to that of PEM+Oracle which is in turn outperformed by TC+ML and DC+ML. This is remarkable also considering that such kernel-based approaches can be used in real applications while PEM+Oracle relies on an ideal tuning which exploits the test set. Results returned by PEM equipped with CV are instead unsatisfactory.

The results outline the importance of regularization, especially in experiments with relatively small data sets. In this case, only 300 input–output measurements are available with quite complex systems of order 30. The classical PEM approach equipped with any model order-selection rule cannot predict the test set better than the oracle.  However, this latter can tune complexity by exploring only a finite set of given models. Kernel-based approaches can instead balance bias and variance by continuous tuning of regularization parameters. In this way, better performing trade-offs may be reached.

### 9.2.2 Real Data: Temperature Prediction

Now we consider thermodynamic modelling of buildings using some real data taken from [22]. Eight sensors are placed in two rooms of a small two-floor residential building of about 80 $$\text {m}^2$$ and 200 $$\text {m}^3$$. They are located only on one floor (approximately 40 $$\text {m}^2$$). More specifically, temperatures are collected through a wireless sensor network made of 8 Tmote-Sky nodes produced by Moteiv Inc. The building was inhabited during the measurement period consisting of 8 days and samples were taken every 5 min. A thermostat controlled the heating system with the reference temperature manually set every day depending upon occupancy and other needs. This makes available a total of 8 temperature profiles displayed in Fig. 9.12. One can see the high level of collinearity of the signals. This makes the problem ill-conditioned, complicating the identification process.

We just consider multiple-input single-output (MISO) models. The temperature from the first node is seen as the output ($$y_i$$) and the other 7 temperatures as inputs ($$u^j_i$$, $$j=1,\ldots ,7$$). Data are divided into 2 parts: those collected at time instants $$1,\ldots ,1200$$ form the identification set while those at instants $$1201,\ldots ,2500$$ are used for test purposes. With 5 min sampling times, 1200 instants almost correspond to 100 h, a rather small time interval. Hence, we assume a “stationary” environment and normalize the data so as to have zero mean and unit variance before performing identification. Quality of the k-step-ahead prediction on test data is measured by (9.10).

Identification has been performed using ARMAX models with an oracle which has access to the test set. This estimator, called PEM+Or, maximizes $$\sum _{k=1}^{48} \mathscr {F}_k$$ which accounts for the prediction capability up to 4 h ahead. The other estimator is regularized ARX equipped with the TC kernel with a different scale factor $$\lambda$$ assigned to each unknown one-step-ahead predictor impulse response and a common decay rate $$\alpha$$. The length of each impulse response is set to 50 and the hyperparameters are estimated via marginal likelihood maximization using only the identification data. This estimator is denoted by TC+ML. Results are reported in Fig. 9.13 (top panel): the performance of PEM+Or and TC+ML is quite similar. Sample trajectories of one-hour-ahead test data prediction returned by TC+ML are also reported in Fig. 9.13 (bottom panel).

## 9.3 Multi-task Learning and Population Approaches $$\star$$

In the previous chapters we have studied the problem of reconstructing a real-valued function from discrete and noisy samples. An extension is the so-called multi-task learning problem in which several functions (tasks) are simultaneously estimated. This problem is significant if the tasks are related to each other so that measurements taken on a function are informative with respect to the other ones. An example is given by a network of linear systems whose impulse responses share some common features. Here, a relevant problem is the study of anomaly detection in homogenous populations of dynamic systems [5, 6, 10]. Normally, all of them are supposed to have the same (possibly unknown) nominal dynamics. However, there can be a subset of systems that have anomalies (deviations from the mean) and the goal is to detect them from the data collected in the population. Important applications of multi-task learning arise also in biomedicine when multiple experiments are performed in subjects from a population [9]. Similar patterns are observed in individual responses so that measurements collected in a subject can help reconstructing also the responses of other individuals. In pharmacokinetics (PK) and pharmacodynamics (PD) the joint analysis of several individual curves is often exploited and called population analysis [24]. One class of adopted models is parametric, e.g., compartmental ones [7]. The problem can be solved using, e.g., the NONMEM software, which traces back to the seventies [3, 25], or more sophisticated approaches like Bayesian MCMC algorithms [15, 28]. More recently, machine learning/nonparametric approaches have been proposed for the population analysis of PK/PD data [16, 19, 20].

In the machine learning literature, the term multi-task learning was originally introduced in [4]. The performance improvement achievable by using a multi-task approach instead of a single-task one which learns the functions separately has been then pointed out in [1, 26], see also [2] for a Bayesian treatment. Next, in [8] it has been proposed a regularized kernel method hinging upon on the theory of vector-valued Reproducing kernel Hilbert spaces [18]. Developments and applications of multi-task learning can then be found, e.g., in [11, 12, 17, 21, 29, 30].

We will now see that multi-task learning can be cast within the RKHS setting developed in the previous chapters by defining a particular kernel. Just to simplify exposition, let us assume that there is a common input space X for all the tasks and consider a set of k functions $$\mathbf{{f}}_i:X \mapsto \mathbb {R}$$. Assume also that the following $$n_i$$ input–output data are available for each task i

\begin{aligned} (x_{1i},y_{1i}),(x_{2i},y_{2i}),\ldots ,(x_{n_ii},y_{n_ii}). \end{aligned}
(9.11)

Our goal is to jointly estimate all the unknown functions $$\mathbf{{f}}_i$$ starting from these examples. For this aim, first a kernel can be introduced to include our knowledge on the single functions (like smoothness) and also on their relationships. This can be done by defining an enlarged input space

$$\mathscr {X} = X \times \{1,2,\,\ldots ,k\}.$$

Hence, a generic element of $$\mathscr {X}$$ is the couple (xi) where $$x \in X$$ while $$i \in \{1,\ldots ,k\}$$. The index i thus specifies that the input location belongs to the part of the function domain connected with the ith function. The information regarding all the tasks can now be specified by the kernel $$K: \mathscr {X} \times \mathscr {X} \rightarrow \mathbb {R}$$ which induces a RKHS of functions $$\mathbf{{f}}: \mathscr {X} \rightarrow \mathbb {R}$$. In fact, we are just exploiting RKHS theory on function domains that include both continuous and discrete components. Note that, in practice, any function $$\mathbf {f}$$ embeds k functions $$\mathbf{{f}}_i$$.

Regularization in RKHS then allows us to reconstruct the tasks from the data (9.11) by computing

\begin{aligned} {{\hat{\mathbf {f}}}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mathbf{{f}} \in \mathscr {H}} \ \sum _{i=1}^k \sum _{l=1}^{n_i}\mathscr {V}_{li}(y_{li},\mathbf{{f}}_i(x))+ \gamma \Vert \mathbf{{f}}\Vert _{\mathscr {H}}^2. \end{aligned}
(9.12)

Under general conditions on the losses $$\mathscr {V}_{li}$$, we can then apply the representer theorem, i.e., Theorem 6.15, to obtain the following expression for the minimizer:

\begin{aligned} {{\hat{\mathbf {f}}}}_j(x)=\sum _{i=1}^k \sum _{l=1}^{n_i} c_{li} K\left( (x,j),(x_{li},i)\right) \quad x \in X, j=1,\ldots ,k \end{aligned}
(9.13)

where $$\{c_{li}\}$$ are suitable scalars. Adopting quadratic losses which include weights $$\{\sigma ^2_{li}\}$$, i.e.,

$$\mathscr {V}_{li}(a,b)=\frac{(a-b)^2}{\sigma ^2_{li}}$$

for any $$a,b \in \mathbb {R}$$, a regularization network is obtained and the expansion coefficients $$\{c_{li}\}$$ solve the following linear system of equations

\begin{aligned} \sum _{i=1}^k \sum _{l=1}^{n_i} \left[ K((x_{li},i),(x_{jq},q))+\gamma \sigma ^2_{jq} \delta _{lj} \delta _{iq} \right] c_{li}=y_{jq}, \end{aligned}
(9.14)

where $$q=1,\ldots ,k$$, $$j=1,\ldots ,n_q$$ and $$\delta _{ij}$$ is the Kronecker delta.

Connection with Bayesian estimation Exploiting the same arguments developed in Sect. 8.2.1, the following relationship between (9.13), (9.14) and Bayesian estimation of Gaussian random fields is obtained. Let the measurements model be

\begin{aligned} y_{ji}=\mathbf{{f}}_i(x_{ji})+e_{ji} \end{aligned}
(9.15)

where $$\{e_{ji}\}$$ are independent Gaussian noises of variances $$\{\sigma ^2_{ji}\}$$. Define

$$y_i=[y_{1i} \ldots y_{n_ii}]^T, \qquad y^k=[y_1^T \ldots y_k^T]^T.$$

Assume also that $$\{\mathbf{{f}}_i\}$$ are zero-mean Gaussian random fields, independent of the noises, with covariances

\begin{aligned} \text {Cov} \left( \mathbf{{f}}_i(x),\mathbf{{f}}_q(s)\right) =K((x,i),(s,q)) \qquad x,s \in X, \end{aligned}

where $$i=1,\ldots ,k$$ and $$q=1,\ldots ,k$$. Then, one obtains that for $$j=1,\ldots ,k$$, the minimum variance estimate of $$\mathbf{{f}}_j$$ conditional on $$y^k$$ is defined by (9.13), (9.14) by setting $$\gamma =1$$. Furthermore, the posterior variance of $$\mathbf{{f}}_j(x)$$ is

\begin{aligned} \text {Var}\left[ \mathbf{{f}}_j(x) | y^k \right] =\text {Var}\left[ \mathbf{{f}}_j(x)\right] - \text {Cov}\left( \mathbf{{f}}_j(x),y^k\right) \left( Var\left[ y^k\right] \right) ^{-1} \text {Cov}\left( \mathbf{{f}}_j(x),y^k\right) ^T. \end{aligned}
(9.16)

In the above formula, in view of the independence assumptions, one has

\begin{aligned} \text {Var}\left[ y^k\right] =\left( \begin{array}{cccc} V_{11} &{} V_{12} &{} \ldots &{} V_{1k} \\ V_{21} &{} \ldots &{} \ldots &{} V_{2k} \\ \vdots &{} \ldots &{} \ldots &{} \ldots \\ V_{k1} &{} \ldots &{} \ldots &{} V_{kk} \end{array} \right) + \left( \begin{array}{cccc} \varSigma _{1} &{} 0 &{} \ldots &{} 0 \\ 0 &{} \varSigma _{2} &{} \ldots &{} 0 \\ 0 &{} 0 &{} \ldots &{} 0 \\ 0 &{} 0 &{} 0 &{} \varSigma _{k}, \end{array} \right) \end{aligned}

where each block $$V_{iq}$$ belongs to $$\mathbb {R}^{n_i \times n_q}$$ and its (lj)-entry is given by

\begin{aligned} V_{iq}(l,j)=K((x_{li},i),(x_{jq},q)), \end{aligned}

while $$\varSigma _i={{\,\mathrm{diag}\,}}\{\sigma ^2_{1i},\ldots ,\sigma ^2_{n_i i}\}$$. In addition

\begin{aligned} \text {Cov}\left( \mathbf{{f}}_j(x),y^k\right) =\text {Cov}\left( \mathbf{{f}}_j(x),[\mathbf{{f}}_1(x_{11}) \ldots \mathbf{{f}}_1(x_{n_11}) \ldots \mathbf{{f}}_k(x_{1k}) \ldots \mathbf{{f}}_k(x_{n_kk})] \right) \\ = [ K ((x,j),(x_{11},1)) \ldots K((x,j),(x_{n_11},1)) \qquad \qquad \qquad \\ \ldots K((x,j),(x_{1k},k)) \ldots K((x,j),(x_{n_kk},k)) ]. \end{aligned}

Example of multi-task kernel: average plus shift A simple yet useful class of multi-task kernels is obtained by defining K as follows:

\begin{aligned} K((x_1,p),(x_2,q)) = \overline{\lambda } \overline{K}(x_1,x_2) + \delta _{pq}\widetilde{\lambda } \widetilde{K}_{p}(x_1,x_2) \end{aligned}
(9.17)

where $$\overline{\lambda }^2$$ and $$\widetilde{\lambda }^2$$ are two-scale factors that typically need to be estimated from data. Such kernel describes each function as the sum of an average function $$\mathbf {\overline{f}}$$, hereafter named average task, and an individual shift $$\mathbf {\widetilde{f}}_j(x)$$ specific for each task. Indeed, if $$\overline{\lambda } = 0$$ all the functions would be learnt independently of each other. Instead, when $$\widetilde{\lambda } = 0$$ all the tasks are actually the same. The Bayesian interpretation of multi-task learning discussed above facilitates also the understanding of this model. In fact, once the kernel is seen as a covariance, it is easy to see that, for any i and $$x \in X$$, each task decomposes into

$$\mathbf{{f}}_i(x)={{\bar{\mathbf {f}}}}(x)+{{\tilde{\mathbf {f}}}}_i(x)$$

where $${{\bar{\mathbf {f}}}}$$ and $$\{{{\tilde{\mathbf {f}}}}_i\}$$ are zero-mean independent Gaussian random fields.

### 9.3.2 Numerical Example: Real Pharmacokinetic Data

Multi-task learning is now illustrated by considering a data set connected with xenobiotics administration in 27 human subjects [20]. Such administration can be seen as the input to a continuous-time linear dynamic system whose (measurable) output is the drug profile in plasma. In any subject, 8 measurements were collected at 0.5, 1, 1.5, 2, 4, 8, 12, 24 h after a bolus, an input which can be seen as a Dirac delta. Hence, one has to deal with a particular continuous-time system identification problem where noisy and direct samples of the impulse response are available.

In this experiment, noises are known to be Gaussian and heteroscedastic, i.e., their variances are not constant being given by $$\sigma _{ij}^2 = (0.1y_{ij})^{2}$$. The 27 experimental concentration profiles are displayed in Fig. 9.14, together with the average profile. In light of the number of subjects, such average curve is a reasonable estimate of the average task $$\overline{\mathbf {f}}$$.

The whole data set consists of 216 pairs $$(x_{ij},y_{ij})$$, for $$i=1, \ldots , 8$$ and $$j=1, \ldots ,27$$, and is split in an identification (training) and a test set. For what regards training, a sparse sampling schedule is considered: only 3 measurements per subject are randomly chosen within the 8 available data. We will adopt the multi-task estimator (9.12) to reconstruct all the continuous-time profiles. In view of the Gaussian and heteroscedastic nature of the noise, the losses are defined by

$$\mathscr {V}_{li}(a,b) = \frac{(a-b)^2}{\sigma _{ij}^2}.$$

For what regards the function model, since humans are expected to give similar responses to the drug, quite close to an average function, the kernel (9.17) is adopted. In addition, it is known that in these experiments there is a greater variability for small values of t, followed by an asymptotic decay to zero. This motivates the use of a stable kernel to model both the average and the shifts. A model suggested in [20] is a cubic spline kernel under the time-transformation

$$h(t) = \frac{3}{t+3}$$

which defines (9.17) through the correspondences

$$\overline{K}(t,\tau ) = \widetilde{K}_{p}(t,\tau ) = \frac{h(t) h(\tau ) \min \{h(t), h(\tau )\}}{2}-\frac{(\min \{h(t), h(\tau )\})^3}{6}.$$

One can check that this model induces a stable RKHS by using Corollary 7.2. In fact, the kernels are nonnegative-valued and the integral of a generic kernel section is

$$\int _0^{+\infty } \left( \frac{h(t) h(\tau ) \min \{h(t), h(\tau )\}}{2}-\frac{(\min \{h(t), h(\tau )\})^3}{6}\right) d\tau$$
$$= \frac{1}{2(t+3)^3} \left( (27t+81)\log (\frac{t+3}{3})+13.5t+67.5\right)$$

and this result clearly implies

$$\int _0^{+\infty } \int _0^{+\infty } \left( \frac{h(t) h(\tau ) \min \{h(t), h(\tau )\}}{2}-\frac{(\min \{h(t), h(\tau )\})^3}{6}\right) d\tau dt < \infty .$$

The initial plasma concentration is known to be zero. Hence, a zero variance virtual measurement in $$t=0$$ was added for all tasks. The hyperparameters $$\overline{\lambda }^2$$ and $$\widetilde{\lambda }^2$$ were then estimated via marginal likelihood maximization by exploiting the Bayesian interpretation of multi-task learning discussed above.

The left and right panels of Fig. 9.15 report results obtained by the single- and the multi-task approach, respectively, in 5 subjects. One can see the data and the estimated curves with their $$95\%$$ confidence intervals obtained using the posterior variance (9.16). Each panel shows also the estimates obtained by employing the full sampling grid. It is apparent that the multi-task estimates are closer to these reference profiles. A good predictive capability with respect to the other five “unobserved” data is also visible. To better quantify this aspect, let $$I^f$$ and $$I^r_j$$ denote the full and reduced sampling grid in the jth subject. Let also $$I_j=I^f \diagdown I_j^r$$, whose cardinality is 5. Then, for each subject, we also define the prediction error as

$$RMSE_j^{MT} = \sqrt{ \frac{\sum _{i \in I_j} (y_{ij} - \hat{\mathbf{{f}}}_j(x_{ij}))^2}{5}}$$

with the single-task $$RMSE_j^{ST}$$ defined in a similar way. Figure 9.16 then reports the boxplots with the 27 RMSE returned by the single- and multi-task estimates. The improvement on the prediction performance due to the kernel-based population approach is evident.