Abstract
This chapter collects some numerical experiments to test the performance of kernelbased approaches for discretetime linear system identification. Using Monte Carlo simulations, we will compare the performance of kernelbased methods with the classical PEM approaches described in Chap. 2. Simulated and real data are included, concerning a robotic arm, a hairdryer and a problem of temperature prediction. We conclude the chapter by introducing the socalled multitask learning where several functions (tasks) are simultaneously estimated. This problem is significant if the tasks are related to each other so that measurements taken on a function are informative with respect to the other ones. A problem involving real pharmacokinetics data, related to the socalled population approaches, is then illustrated. Results will be often illustrated by using MATLAB boxplots. As already mentioned in Sect. 7.2, when commenting Fig. 7.8, the median is given by the central mark while the box edges are the 25th and 75th percentiles. The whiskers extend to the most extreme fits not seen as outliers. Then, the outliers are plotted individually.
Download chapter PDF
9.1 Identification of DiscreteTime Output Error Models
In this section, we will consider two numerical experiments with data generated according to the discretetime output error (OE) model
where \(G_0\) is a rational transfer function while e is white Gaussian noise independent of the known input u. Using simulated data, we will compare the performance of the classical PEM approach, as described in Chap. 2, with some of the regularized techniques illustrated in this book. In particular, we will adopt regularized highorder FIR, with impulse response coefficients contained in the mdimensional (column) vector \(\theta \) and the output data in the (column) vector \(Y=[y(1) \ldots y(N)]^T\). So, letting the regression matrix \(\varPhi \in {\mathbb R}^{N \times m}\) be
our estimator is
We have already seen in (5.40), (5.41) and (7.30), using MaxEnt arguments and spline theory, that choices for the regularization matrix P can be the first or secondorder stable spline kernel, denoted respectively by TC and SS, respectively, or the DC kernel. They are recalled below specifying also the hyperparameter vector \(\eta \):
9.1.1 Monte Carlo Studies with a Fixed Output Error Model
In this example the true impulse response is fixed to that reported in Fig. 8.2, obtained by random generation of a rational transfer function of order 10. It has to be estimated from 500 input–output couples (collected with system initially at rest). The input is white noise filtered by the rational transfer function \(1/(zp)\) where p will vary over the unit interval during the experiment. Note that p establishes the difficulty of our system identification problem. Values close to zero make the input similar to white noise and the output data informative over a wide range of frequencies. Instead, values of p close to 1 increase the lowpass nature of the input and, hence, the illconditioning. The measurement noise is white and Gaussian with variance equal to that of the noiseless output divided by 50. Two estimators will be adopted:

Oe+Or. Classical PEM approach (2.22) equipped with an oracle. In particular, our candidate models are rational transfer functions where the order of the two polynomials is equal and can vary between 1 and 30. For any model order, estimation is performed through nonlinear least squares by solving (2.22) with \(\ell \) in (2.21) set to the quadratic function. The method is implemented in oe.m of the MATLAB System Identification Toolbox. Then, the oracle chooses the estimate which maximizes the fit
$$\begin{aligned} 100\left( 1  \left[ \frac{\sum ^{100}_{k=1}g_k^0\hat{g}(k)^2 }{\sum ^{100}_{k=1}g_k^0\bar{g}^0^2}\right] ^{\frac{1}{2}}\right) ,\ \ \bar{g}^0=\frac{1}{100}\sum ^{100}_{k=1}g^0_k, \end{aligned}$$(9.5)where \(g^0_k\) are the true impulse response coefficients while \(\hat{g}(k)\) denote their estimates. The estimator is given the information that system initial conditions are null.

TC+ML. This is the regularized estimator (9.1), equipped with the kernel TC. The number of estimated impulse response coefficients is \(m=100\) and the regression matrix is built with \(u(t)=0\) if \(t<0\). At every run, the noise variance is estimated by fitting via least squares a lowbias model for the impulse response. Then, the two kernel hyperparameters are obtained via marginal likelihood optimization, see (7.42). The method is implemented in impulseest.m of the MATLAB System Identification Toolbox.
We consider 4 Monte Carlo studies of 300 runs defined by different values of p in the set \(\{0,0.9,0.95,0.99\}\). As already mentioned, \(p=0\) corresponds to white noise input while \(p=0.99\) leads to a highly illconditioned problem (output data provide little information at high frequencies). Figure 9.1 reports the boxplots of the 1000 fits returned by Oe+Or and TC+ML for the four different values of p. Even if PEM exploits an oracle to tune complexity, the performance is (slightly) better than TC+ML only when the input is white noise, see also Table 9.1. When p increases, the illconditioning affecting the problem increases and TC+ML outperforms Oe+Or even if no oracle is used for hyperparameters tuning. This also points out the effectiveness of marginal likelihood optimization in controlling complexity.
This case study shows that continuous tuning of hyperparameters may be a more versatile and powerful approach than classical estimation of discrete model orders. A problem related to PEM here could be also the presence of local minima of the objective. This is much less critical when adopting kernelbased regularization. In fact, TC+ML regulates complexity through only two hyperparameters while Oe+Or has to optimize many more parameters (function of the postulated model order).
9.1.2 Monte Carlo Studies with Different Output Error Models
Now we consider two Monte Carlo studies of 1000 runs regarding identification of several discretetime output error models. The outputs are still given by
with e white Gaussian noise independent of u, but the rational transfer function \(G_0\) changes at any run. In fact, a 30thorder singleinput singleoutput continuoustime system is first randomly generated by the MATLAB command rss.m. It is then sampled at 3 times of its bandwidth and used if its poles fall within the circle of the complex plane with centre at the origin and radius 0.99.
With the system at rest, 1000 input–output pairs are generated as follows. At any run, the system input is unit variance white Gaussian noise filtered by a secondorder rational transfer function generated by the same procedure adopted to obtain \(G_0\). The outputs are corrupted by an additive white Gaussian noise with a SNR (the ratio between the variance of noiseless output and noise) randomly chosen in [1, 20] at any run. In the first experiment, the data set
contains the first 200 input–output couples, i.e., \(N=200\), while in the second experiment all the 1000 couples are used, i.e., \(N=1000\).
Starting from null initial conditions, at any run we also generate two different kinds of test sets
The first test set is especially challenging since noiseless outputs are generated by using unit variance white Gaussian noise as input. In the second test set the input has instead the same statistics of that entering the identification data, hence making easier its prediction.
The performance of a model characterized by \(\hat{\theta }\), and returning \(\hat{y}^{new}(t\hat{\theta })\) as output prediction at instant t, is
where \(\bar{y}^{new}\) is the average output in \(\mathcal{D}_{test}\) and \(\hat{y}^{new}(t\hat{\theta })\) are computed assuming zero initial conditions (otherwise highorder models could have the advantage to calibrate the initial conditions to fit \(\mathcal{D}_{test}\)). The prediction fit (9.6) can be obtained by the MATLAB command predict(model,data,k,’ini’,’z’) where model and data denote structures containing the estimated model and the test set \(\mathcal{D}_{test}\), respectively.
In what follows, we will use also estimators equipped with an oracle which evaluates the fit (9.6) for the test set of interest. Different rational models with orders between 1 and 30 are tried and the oracle selects the orders that give the best fit. We are now in a position to introduce the following 6 estimators:

Oe+Or1. Classical PEM approach (2.22), with quadratic \(\ell \) in (2.21), equipped with an oracle which uses the first test set (white noise input). As said, candidate models are rational transfer functions whose order can vary between 1 and 30. For any order, the model is returned by the function oe.m of the MATLAB’s System Identification Toolbox [14].

Oe+Or2. The same procedure described above except that the oracle maximizes the prediction fit using the second test set (test input with statistics equal to those of the training input).

Oe+CV. The classical approach now does not use any oracle: model order is estimated by cross validation by splitting the identification data into two sets with the first and the last N/2 data contained in \(\mathcal{D}_T\). The prediction errors are computed assuming zero initial conditions. The model order minimizing the sum of squared prediction errors (computed assuming zero initial conditions) is chosen. Finally, the system estimate is computed using all the data in \(\mathcal{D}_T\) by solving (2.22) with quadratic loss.

{TC+ML,SS+ML,DC+ML}. These are three regularized FIR estimators of the form () with order 200 and kernels TC (9.2), SS (9.3) and DC (9.4). Marginal likelihood optimization (7.42) is used to determine the noise variance and the kernel hyperparameters (2 for SS and TC, 3 for DC). The regularized FIR models are estimated using the function impulseest.m in the MATLAB’s System Identification Toolbox [14].
9.1.2.1 Results
The MATLAB boxplots in Fig. 9.2 contain the 1000 fit measures returned by the estimators during the first experiment with \(N=200\) (left panels) and the second experiment with \(N=1000\) (right panels). Table 9.2 reports the average fit values.
In the top panels of Fig. 9.2 one can see the fits of the first test set. Recall that Oe+Or1 has access to such data to optimize the prediction capability. Interestingly, despite this advantage, the performance of all the three regularized approaches is close that of the oracle while that Oe+CV is not so satisfactory. This is also visible in the first two rows of Table 9.2.
The bottom panels of Fig. 9.2 show results relative to the second test set which is used by Oe+Or2 to maximize the prediction fit. Since training and test data are more similar, the prediction capability of Oe+CV improves significantly but the regularized estimators still outperform the classical approach, see also the last two rows of Table 9.2.
9.1.3 Real Data: A Robot Arm
Consider now the vibrating flexible robot arm described in [27], where two feedforward controller design methods were compared on trajectory tracking problems. The input of the robot arm is the driving couple and the output is the acceleration at the tip of the robot arm. The input–output data contain 40960 data points. They are collected at a sampling frequency of 500 Hz for 10 periods with each period containing 4096 data points. A portion of the data is shown in Fig. 9.3. The identification problem of the robot arm was studied in [23, Sect. 11.4.4] with frequency domain methods.
We will build models by both the classical prediction error method and the kernel method with the DC kernel. Since the true system is unknown, to compare the performance of different impulse response estimates we divide the data into two parts: the training and the test set, given by the first 6000 input–output couples and the reaming ones, respectively. Then, we measure how well the models, built with the estimation data, predict the test outputs.
For the prediction error method, we estimate nthorder statespace models without disturbance model and with zero initial conditions for \(n=1,\ldots ,36\). This method is available in MATLAB’s System Identification Toolbox [13] as the command pem(data,n,’dist’,’no’,’ini’,’z’). The prediction fits computed using (9.6) are shown as a function of n in Fig. 9.4, respectively. An oracle that has access to the test set would select the order \(n=18\), hence obtaining a prediction fit equal to \(79.75\%\). For the kernel method with the DC kernel, we estimate a FIR model of highorder 3000 with hyperparameters tuned by optimizing the marginal likelihood. When forming the regression matrix, the unknown input data are set to zero. The prediction fit (9.6) is 83.07% and is shown as a horizontal solid line in Fig. 9.4. The kernel method with the DC kernel is available in MATLAB’s System Identification Toolbox [14] as the command impulseest(data,3000,0,opt) where, in the option opt, we set opt.RegulKernel=’dc’; opt.Advanced.AROrder=0.
The Bode magnitude plot of the models estimated by PEM and the DC kernel is shown in Fig. 9.5. The empirical frequency function estimate obtained using the command etfe in MATLAB’s System Identification Toolbox [14] is also displayed.
The measured output and the predicted output over a portion of the test set are shown in Fig. 9.6. If one has concern that a FIR model of order 3000 is quite large, then one could reduce such highorder model by projecting it to a loworder statespace model. Exploiting model order reduction techniques, the fit of a statespace model of order \(n=25\) is 79.8%, still better than the best statespace description that can be obtained by PEM.
9.1.4 Real Data: A Hairdryer
The second application is a real laboratory device, whose function is similar to that of a hairdryer: the air is fanned through a tube and then heated at a mesh of resistor wires, as described in [13, Sect. 17.3]. The input to the hairdryer is the voltage over the mesh of resistor wires while the output is the air temperature measured by a thermocouple. The input–output data contain 1000 data points collected at a sampling frequency of 12.5 Hz for 80 s. A portion of the data is shown in the top panel of Fig. 9.7. Since the input–output values move around 5 and 4.9, respectively, we detrend the measurements in such a way that they move around 0. The estimation and test set data are then given by the first and the last 500 input–output couples, respectively.
As in the case of the robot arm, we build models by the classical prediction error method with an oracle, which maximizes the prediction fit, and the regularized approach with the DC kernel, with hyperparameters tuned by marginal likelihood optimization. For the prediction error method, we estimate nthorder statespace models without disturbance model for \(n=1,\ldots ,36\) and with zero initial conditions. The fits, as a function of n, are shown in Fig. 9.8. The best result is obtained for order \(n=5\) and turns out \(88.38\%\). For the kernel method with the DC kernel, we estimate a FIR model with order 70. When forming the regression matrix, we set the unknown input data to zero. The prediction fit (9.6) is somewhat close to that achieved by PEM+Oracle being equal to 88.15%. It is shown as a dashdot blue line in Fig. 9.8. The test set and the predicted outputs returned by the two methods are shown in Fig. 9.9. One can see that the regularized approach has a prediction capability very close to that of PEM+Oracle.
9.2 Identification of ARMAX Models
In this section we consider the identification of linear systems
Differently from the previous cases, beyond the presence of multiple observable inputs \(u_i\), also the noise model is unknown. In fact, the e(t) are white Gaussian noise of unit variance filtered by a system \(H_0(q)\) that has to be estimated from data.
First, it is useful to cast the identification of the general model (9.7) in a regularized context. Without loss of generality, to simplify the exposition, let \(p=1\) with the single observable input denoted by u. Exploiting (2.4), given the general linear model (9.7), we can write any predictor as two infinite impulse responses from y and u, respectively. When using ARX models, we have seen in (2.8) that such infinite responses specialize to finite responses. One has
where \(\theta _a = \begin{bmatrix} a_1&\ldots&a_{n_a} \end{bmatrix}^T\), \(\theta _b = \begin{bmatrix} b_1&\ldots&b_{n_b} \end{bmatrix}^T\) and \(\varphi _y(t)\), \(\varphi _u(t)\) are made up from y and u in an obvious way. Thus, the ARX model is a linear regression model, to which the same ideas of regularization can be applied. This point is important since we have seen in Theorem 2.1 that ARXexpressions become arbitrarily good approximators for general linear systems as the orders \(n_a,n_b\) tend to infinity. However, as discussed in Chap. 2, highorder ARX can suffer from large variance. A solution is to set \(n_a=n_b=n\) to a large value and then introduce regularization matrices for the two impulse responses from y and from u. The Pmatrix in (9.1) can be partitioned along with \(\theta _a, \theta _b\):
with \(P^{a}(\eta _1),P^{b}(\eta _2)\) defined, e.g., by any of (9.2)–(9.4). Letting \(\theta =[\theta _a^T \ \theta _b^T]^T\) and building the regression matrix using \([\varphi ^T_y(t) \ \varphi ^T_u(t)]\) as rows, the estimator () now becomes a regularized highorder ARX. The MATLAB code for estimating this model using, e.g., the DC kernel would be
ao=arxRegulOptions(’RegularizationKernel’,’DC’),
[Lambda,R] = arxRegul(data,na,nb,nk,ao),
aropt= arxOptions; aropt.Regularization.Lambda = Lambda,
aropt.Regularization.R = R,
m = arx(data,na,nb,nk,aropt).
We can also easily extend this construction to multiple inputs. Given any generic p, one needs to estimate \(p+1\) impulse responses with the matrix (9.9) now containing \(p+1\) blocks. If there are multiple outputs, one approach is to consider each output channel as a separate linear regression as in (9.8). The difference is that now also the other outputs need to be appended as done with the inputs.
9.2.1 Monte Carlo Experiment
One challenging Monte Carlo study of 1000 runs is now considered. Data are generated at any run by an ARMAX model of order 30 having p observable inputs, i.e.,
with p drawn from a random variable uniformly distributed on \(\{2,3,4,5 \}\). Note that the system contains \(p+1\) rational transfer functions. They depend on the polynomials \(A,B_i\) and C which are randomly generated at any run by the MATLAB function drmodel.m. Such function is first called to obtain the common denominator A and the first numerator \(B_1\). The other p calls are used to obtain the numerators of the remaining rational transfer functions. The system so generated is accepted if the modulus of its poles is not larger than 0.95. In addition, letting \(G_i(q)=\frac{B_i(q)}{A(q)}\) and \(H(q)=\frac{C(q)}{A(q)}\) the signal to noise ratio has to satisfy
where \(\Vert G_i\Vert _2,\Vert H\Vert _2\) are the \(\ell _2\) norms of the system impulse responses.
After a transient to mitigate the effect of initial conditions, at any run 300 input–output couples are collected to form the identification data set \(\mathcal{D}_T\) and other 1000 to define the test set \(\mathcal{D}_{test}\). In any case, the input is white Gaussian noise of unit variance.
Differently from the output error models, in the ARMAX case the performance measure adopted to compare different estimated models depends on the prediction horizon k. More specifically, let \(\hat{y}^{new}_k(t\hat{\theta })\) be the kstepahead predictor associated with an estimated model characterized by \(\hat{\theta }\). For any t, such function predicts kstepahead the test output \(y^{new}(t)\) by using the values of the test input \(u^{new}\) up to time \(t1\) and of the test output \(y^{new}\) up to \(tk\). The prediction difficulty in general increases as k gets larger. The special case \(k=1\) corresponds to the onestepahead predictor given by (2.4), while see, e.g., [13, Sect. 3.2] for the expressions of the generic kstepahead impulse responses.
As done in (9.6) we use \(\bar{y}^{new}\) denote the mean of the outputs in \(\mathcal{D}_{test}\), but now the prediction fit depends on k, being given by
In this case, we say that an estimator is equipped with an oracle if it can use the test set to maximize \(\sum _{k=1}^{20} \mathscr {F}_{k}\) by tuning the complexity of the model estimated using the identification data. The following estimators are then introduced:

PEM+Oracle: this is the classical PEM approach (2.22) with quadratic loss equipped with an oracle. The candidate model structures are ARMAX models with polynomials all having the same degree up to 30. For any model order, the MATLAB command pem.m (or armax.m) is used to obtain the system’s estimate. of the MATLAB System Identification Toolbox [14].

PEM+CV: in place of the oracle, model complexity is estimated by cross validation splitting \(\mathcal{D}_T\) into two sets containing, respectively, the first and the last 150 input–output couples. The model order which minimizes the sum of the squared onestepahead prediction errors computed with zero initial conditions for the validation data is selected. The final system’s estimate is returned by (2.22) using all the identification data.

{PEM+AICc,PEM+BIC}: this is the classical PEM approach with AICtype criteria used to tune complexity, as reported in (2.35) and (2.36).

{TC+ML,SS+ML,DC+ML}: these are the three regularized least squares estimators introduced at the beginning of this section which determine the unknown coefficients of the multiinput version of the ARX model. After setting the length of each predictor impulse response to 50, the regularization matrices entering the multiinput version of (9.9) are defined by TC (9.2) or SS (9.3) or DC (9.4) kernels. The first 50 input–output pairs in \(\mathcal{D}_T\) are used just as entries of the regression matrix. For every impulse response, a different scale factor \(\lambda \) and a common variance decay rate \(\alpha \) (and, in the case of DC, a correlation \(\rho \)) is adopted. The hyperparameters are determined via marginal likelihood optimization.
All the system inputs delay are assumed known and their values are provided to all the estimators described above.
The average of the fits \(\mathscr {F}_k\) given by (9.10), function of the prediction horizon k, is reported in Fig. 9.10. Since PEM equipped with Akaikelike criteria return very small average fits, results achieved by this kind of procedures are not displayed. The MATLAB boxplots of the 1000 values of \(\mathscr {F}_1\) and \(\mathscr {F}_{20}\) returned by all the estimators are visible in Fig. 9.11. The average fit of SS+ML is quite close to that of PEM+Oracle which is in turn outperformed by TC+ML and DC+ML. This is remarkable also considering that such kernelbased approaches can be used in real applications while PEM+Oracle relies on an ideal tuning which exploits the test set. Results returned by PEM equipped with CV are instead unsatisfactory.
The results outline the importance of regularization, especially in experiments with relatively small data sets. In this case, only 300 input–output measurements are available with quite complex systems of order 30. The classical PEM approach equipped with any model orderselection rule cannot predict the test set better than the oracle. However, this latter can tune complexity by exploring only a finite set of given models. Kernelbased approaches can instead balance bias and variance by continuous tuning of regularization parameters. In this way, better performing tradeoffs may be reached.
9.2.2 Real Data: Temperature Prediction
Now we consider thermodynamic modelling of buildings using some real data taken from [22]. Eight sensors are placed in two rooms of a small twofloor residential building of about 80 \(\text {m}^2\) and 200 \(\text {m}^3\). They are located only on one floor (approximately 40 \(\text {m}^2\)). More specifically, temperatures are collected through a wireless sensor network made of 8 TmoteSky nodes produced by Moteiv Inc. The building was inhabited during the measurement period consisting of 8 days and samples were taken every 5 min. A thermostat controlled the heating system with the reference temperature manually set every day depending upon occupancy and other needs. This makes available a total of 8 temperature profiles displayed in Fig. 9.12. One can see the high level of collinearity of the signals. This makes the problem illconditioned, complicating the identification process.
We just consider multipleinput singleoutput (MISO) models. The temperature from the first node is seen as the output (\(y_i\)) and the other 7 temperatures as inputs (\(u^j_i\), \(j=1,\ldots ,7\)). Data are divided into 2 parts: those collected at time instants \(1,\ldots ,1200\) form the identification set while those at instants \(1201,\ldots ,2500\) are used for test purposes. With 5 min sampling times, 1200 instants almost correspond to 100 h, a rather small time interval. Hence, we assume a “stationary” environment and normalize the data so as to have zero mean and unit variance before performing identification. Quality of the kstepahead prediction on test data is measured by (9.10).
Identification has been performed using ARMAX models with an oracle which has access to the test set. This estimator, called PEM+Or, maximizes \(\sum _{k=1}^{48} \mathscr {F}_k\) which accounts for the prediction capability up to 4 h ahead. The other estimator is regularized ARX equipped with the TC kernel with a different scale factor \(\lambda \) assigned to each unknown onestepahead predictor impulse response and a common decay rate \(\alpha \). The length of each impulse response is set to 50 and the hyperparameters are estimated via marginal likelihood maximization using only the identification data. This estimator is denoted by TC+ML. Results are reported in Fig. 9.13 (top panel): the performance of PEM+Or and TC+ML is quite similar. Sample trajectories of onehourahead test data prediction returned by TC+ML are also reported in Fig. 9.13 (bottom panel).
9.3 Multitask Learning and Population Approaches \(\star \)
In the previous chapters we have studied the problem of reconstructing a realvalued function from discrete and noisy samples. An extension is the socalled multitask learning problem in which several functions (tasks) are simultaneously estimated. This problem is significant if the tasks are related to each other so that measurements taken on a function are informative with respect to the other ones. An example is given by a network of linear systems whose impulse responses share some common features. Here, a relevant problem is the study of anomaly detection in homogenous populations of dynamic systems [5, 6, 10]. Normally, all of them are supposed to have the same (possibly unknown) nominal dynamics. However, there can be a subset of systems that have anomalies (deviations from the mean) and the goal is to detect them from the data collected in the population. Important applications of multitask learning arise also in biomedicine when multiple experiments are performed in subjects from a population [9]. Similar patterns are observed in individual responses so that measurements collected in a subject can help reconstructing also the responses of other individuals. In pharmacokinetics (PK) and pharmacodynamics (PD) the joint analysis of several individual curves is often exploited and called population analysis [24]. One class of adopted models is parametric, e.g., compartmental ones [7]. The problem can be solved using, e.g., the NONMEM software, which traces back to the seventies [3, 25], or more sophisticated approaches like Bayesian MCMC algorithms [15, 28]. More recently, machine learning/nonparametric approaches have been proposed for the population analysis of PK/PD data [16, 19, 20].
In the machine learning literature, the term multitask learning was originally introduced in [4]. The performance improvement achievable by using a multitask approach instead of a singletask one which learns the functions separately has been then pointed out in [1, 26], see also [2] for a Bayesian treatment. Next, in [8] it has been proposed a regularized kernel method hinging upon on the theory of vectorvalued Reproducing kernel Hilbert spaces [18]. Developments and applications of multitask learning can then be found, e.g., in [11, 12, 17, 21, 29, 30].
9.3.1 KernelBased Multitask Learning
We will now see that multitask learning can be cast within the RKHS setting developed in the previous chapters by defining a particular kernel. Just to simplify exposition, let us assume that there is a common input space X for all the tasks and consider a set of k functions \(\mathbf{{f}}_i:X \mapsto \mathbb {R}\). Assume also that the following \(n_i\) input–output data are available for each task i
Our goal is to jointly estimate all the unknown functions \(\mathbf{{f}}_i\) starting from these examples. For this aim, first a kernel can be introduced to include our knowledge on the single functions (like smoothness) and also on their relationships. This can be done by defining an enlarged input space
Hence, a generic element of \(\mathscr {X}\) is the couple (x, i) where \(x \in X\) while \(i \in \{1,\ldots ,k\}\). The index i thus specifies that the input location belongs to the part of the function domain connected with the ith function. The information regarding all the tasks can now be specified by the kernel \(K: \mathscr {X} \times \mathscr {X} \rightarrow \mathbb {R}\) which induces a RKHS of functions \(\mathbf{{f}}: \mathscr {X} \rightarrow \mathbb {R}\). In fact, we are just exploiting RKHS theory on function domains that include both continuous and discrete components. Note that, in practice, any function \(\mathbf {f}\) embeds k functions \(\mathbf{{f}}_i\).
Regularization in RKHS then allows us to reconstruct the tasks from the data (9.11) by computing
Under general conditions on the losses \(\mathscr {V}_{li}\), we can then apply the representer theorem, i.e., Theorem 6.15, to obtain the following expression for the minimizer:
where \(\{c_{li}\}\) are suitable scalars. Adopting quadratic losses which include weights \(\{\sigma ^2_{li}\}\), i.e.,
for any \(a,b \in \mathbb {R}\), a regularization network is obtained and the expansion coefficients \(\{c_{li}\}\) solve the following linear system of equations
where \(q=1,\ldots ,k\), \(j=1,\ldots ,n_q\) and \(\delta _{ij}\) is the Kronecker delta.
Connection with Bayesian estimation Exploiting the same arguments developed in Sect. 8.2.1, the following relationship between (9.13), (9.14) and Bayesian estimation of Gaussian random fields is obtained. Let the measurements model be
where \(\{e_{ji}\}\) are independent Gaussian noises of variances \(\{\sigma ^2_{ji}\}\). Define
Assume also that \(\{\mathbf{{f}}_i\}\) are zeromean Gaussian random fields, independent of the noises, with covariances
where \(i=1,\ldots ,k\) and \(q=1,\ldots ,k\). Then, one obtains that for \(j=1,\ldots ,k\), the minimum variance estimate of \(\mathbf{{f}}_j\) conditional on \(y^k\) is defined by (9.13), (9.14) by setting \(\gamma =1\). Furthermore, the posterior variance of \(\mathbf{{f}}_j(x)\) is
In the above formula, in view of the independence assumptions, one has
where each block \(V_{iq}\) belongs to \(\mathbb {R}^{n_i \times n_q}\) and its (l, j)entry is given by
while \(\varSigma _i={{\,\mathrm{diag}\,}}\{\sigma ^2_{1i},\ldots ,\sigma ^2_{n_i i}\}\). In addition
Example of multitask kernel: average plus shift A simple yet useful class of multitask kernels is obtained by defining K as follows:
where \(\overline{\lambda }^2\) and \(\widetilde{\lambda }^2\) are twoscale factors that typically need to be estimated from data. Such kernel describes each function as the sum of an average function \(\mathbf {\overline{f}}\), hereafter named average task, and an individual shift \(\mathbf {\widetilde{f}}_j(x)\) specific for each task. Indeed, if \(\overline{\lambda } = 0\) all the functions would be learnt independently of each other. Instead, when \(\widetilde{\lambda } = 0\) all the tasks are actually the same. The Bayesian interpretation of multitask learning discussed above facilitates also the understanding of this model. In fact, once the kernel is seen as a covariance, it is easy to see that, for any i and \(x \in X\), each task decomposes into
where \({{\bar{\mathbf {f}}}}\) and \(\{{{\tilde{\mathbf {f}}}}_i\}\) are zeromean independent Gaussian random fields.
9.3.2 Numerical Example: Real Pharmacokinetic Data
Multitask learning is now illustrated by considering a data set connected with xenobiotics administration in 27 human subjects [20]. Such administration can be seen as the input to a continuoustime linear dynamic system whose (measurable) output is the drug profile in plasma. In any subject, 8 measurements were collected at 0.5, 1, 1.5, 2, 4, 8, 12, 24 h after a bolus, an input which can be seen as a Dirac delta. Hence, one has to deal with a particular continuoustime system identification problem where noisy and direct samples of the impulse response are available.
In this experiment, noises are known to be Gaussian and heteroscedastic, i.e., their variances are not constant being given by \(\sigma _{ij}^2 = (0.1y_{ij})^{2}\). The 27 experimental concentration profiles are displayed in Fig. 9.14, together with the average profile. In light of the number of subjects, such average curve is a reasonable estimate of the average task \(\overline{\mathbf {f}}\).
The whole data set consists of 216 pairs \((x_{ij},y_{ij})\), for \(i=1, \ldots , 8\) and \(j=1, \ldots ,27\), and is split in an identification (training) and a test set. For what regards training, a sparse sampling schedule is considered: only 3 measurements per subject are randomly chosen within the 8 available data. We will adopt the multitask estimator (9.12) to reconstruct all the continuoustime profiles. In view of the Gaussian and heteroscedastic nature of the noise, the losses are defined by
For what regards the function model, since humans are expected to give similar responses to the drug, quite close to an average function, the kernel (9.17) is adopted. In addition, it is known that in these experiments there is a greater variability for small values of t, followed by an asymptotic decay to zero. This motivates the use of a stable kernel to model both the average and the shifts. A model suggested in [20] is a cubic spline kernel under the timetransformation
which defines (9.17) through the correspondences
One can check that this model induces a stable RKHS by using Corollary 7.2. In fact, the kernels are nonnegativevalued and the integral of a generic kernel section is
and this result clearly implies
The initial plasma concentration is known to be zero. Hence, a zero variance virtual measurement in \(t=0\) was added for all tasks. The hyperparameters \(\overline{\lambda }^2\) and \(\widetilde{\lambda }^2\) were then estimated via marginal likelihood maximization by exploiting the Bayesian interpretation of multitask learning discussed above.
The left and right panels of Fig. 9.15 report results obtained by the single and the multitask approach, respectively, in 5 subjects. One can see the data and the estimated curves with their \(95\%\) confidence intervals obtained using the posterior variance (9.16). Each panel shows also the estimates obtained by employing the full sampling grid. It is apparent that the multitask estimates are closer to these reference profiles. A good predictive capability with respect to the other five “unobserved” data is also visible. To better quantify this aspect, let \(I^f\) and \(I^r_j\) denote the full and reduced sampling grid in the jth subject. Let also \(I_j=I^f \diagdown I_j^r\), whose cardinality is 5. Then, for each subject, we also define the prediction error as
with the singletask \(RMSE_j^{ST}\) defined in a similar way. Figure 9.16 then reports the boxplots with the 27 RMSE returned by the single and multitask estimates. The improvement on the prediction performance due to the kernelbased population approach is evident.
References
Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99
Baxter J (1997) A Bayesian/Information theoretic model of learning to learn via multiple task sampling. Mach Learn 28:7–39
Beal S, Sheiner L (1992) NONMEM user’s guide. NONMEM Project Group, University of California, San Francisco
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Chen T, Andersen MS, Chiuso A, Pillonetto G, Ljung L (2014) Anomaly detection in homogenous populations: a sparse multiple kernelbased regularization method. In: 53rd IEEE conference on decision and control, pp 265–270
Chu E, Gorinevsky D, Boyd S (2011) Scalable statistical monitoring of fleet data. IFAC Proc Vol 44(1):13227–13232. 18th IFAC world congress
Davidian M, Giltinan DM (1995) Nonlinear models for repeated measurement data. Chapman and Hall, New York
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Ferrazzi F, Magni P, Bellazzi R (2003) Bayesian clustering of gene expression time series. In: Proceedings of 3rd international workshop on bioinformatics for the management, analysis and interpretation of microarray data (NETTAB 2003), pp 53–55
Gorinevsky D, Matthews B, Martin R (2012) Aircraft anomaly detection using performance models trained on fleet data. In: 2012 conference on intelligent data understanding, pp 17–23
Ling H, Wang Z, Li P, Shi Y, Chen J, Zou F (2019) Improving person reidentification by multitask learning. Neurocomputing 347:109–118
Liu Z, Huang B, Cui Y, Xu Y, Zhang B, Zhu L, Wang Y, Jin L, Wu D (2019) Multitask deep learning with dynamic programming for embryo early development stage classification from timelapse videos. IEEE Access 7:122153–122163
Ljung L (1999) System identification  theory for the user, 2nd edn. PrenticeHall, Upper Saddle River
Ljung L (2013) System identification toolbox V8.3 for MATLAB. The MathWorks, Inc., Natick
Lunn DJ, Best N, Thomas A, Wakefield JC, Spiegelhalter D (2002) Bayesian analysis of population PK/PD models: general concepts and software. J Pharmacokinet Pharmacodyn 29(3):271–307
Magni P, Bellazzi R, De Nicolao G, Poggesi I, Rocchetti M (2002) Nonparametric AUC estimation in population studies with incomplete sampling: a Bayesian approach. J Pharmacokinet Pharmacodyn 29(5/6):445–471
Maurer A, Pontil M, RomeraParedes B (2016) The benefit of multitask representation learning. J Mach Learn Res 17(1):2853–2884
Micchelli CA, Pontil M (2005) On learning vectorvalued functions. Neural Comput 17(1):177–204
Neve M, De Nicolao G, Marchesi L (2005) Nonparametric identification of pharmacokinetic population models via Gaussian processes. In: Proceedings of 16th IFAC world congress, Praha, Czech Republic
Neve M, De Nicolao G, Marchesi L (2007) Nonparametric identification of population models via Gaussian processes. Automatica 97(7):1134–1144
Pillonetto G, Dinuzzo F, De Nicolao G (2010) Bayesian online multitask learning of Gaussian processes. IEEE Trans Pattern Anal Mach Intell 32(2):193–205
Pillonetto G, Chen T, Chiuso A, De Nicolao G, Ljung L (2016) Regularized linear system identification using atomic, nuclear and kernelbased norms: the role of the stability constraint. Automatica 69:137–149
Pintelon R, Schoukens J (2012) System identification: a frequency domain approach, 2nd edn. Wiley, New York
Sheiner LB (1994) The population approach to pharmacokinetic data analysis: rationale and standard data analysis methods. Drug Metab Rev 15:153–171
Sheiner LB, Rosenberg B, Marathe V (1977) Estimation of population characteristics of pharmacokinetic parameters from routine clinical data. J Pharmacokinet Biopharm 5(5):445–479
Thrun S, Pratt L (1997) Learning to learn. Kluwer, Boston
Torfs D, Vuerinckx R, Swevers J, Schoukens J (1998) Comparison of two feedforward design methods aiming at accurate trajectory tracking of the end point of a flexible robot arm. IEEE Trans Control Syst Technol 6(1):2–14
Wakefield JC, Smith AFM, RacinePoon A, Gelfand AE (1994) Bayesian analysis of linear and nonlinear population models by using the Gibbs sampler. J Appl Stat 41:201–221
Xu Y, Li X, Chen D, Li H (2018) Learning rates of regularized regression with multiple Gaussian kernels for multitask learning. IEEE Trans Neural Netw Learn Syst 29(11):5408–5418
Zhou Q, Zhao Q (2016) Flexible clustered multitask learning by learning representative tasks. IEEE Trans Pattern Anal Mach Intell 38(2):266–278
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Numerical Experiments and Real World Cases. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/9783030958602_9
Download citation
DOI: https://doi.org/10.1007/9783030958602_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030958596
Online ISBN: 9783030958602
eBook Packages: EngineeringEngineering (R0)