2.1 The State-of-the-Art Identification Setup

System Identification is characterized by five basic concepts:

  • \(\textit{X}\): the experimental conditions under which the data is generated;

  • \(\mathscr {D}\): the  data;

  • \(\mathscr {M}\): the model structure and its parameters \(\theta \);

  • \(\mathscr {I}\): the identification method by which a parameter value \(\hat{\theta }\) in the model structure \(\mathscr {M}(\theta )\) is determined based on the data \(\mathscr {D}\);

  • \(\mathscr {V}\): the validation process that generates confidence in the identified model.

See Fig. 2.1. It is typically an iterative process to navigate to a model that passes through the validation test (“is not falsified”), involving revisions of the necessary choices. For several of the steps in this loop helpful support tools have been developed. It is however not quite possible or desirable to fully automate the choices, since subjective perspectives related to the intended use of the model are very important.

2.2 \(\mathscr {M}\): Model Structures

A model structure \(\mathscr {M}\) is set of a parametrized models that describe the relations between the inputs u and outputs y of the system. The parameters are denoted by \(\theta \) so a particular model will be denoted by \(\mathscr {M}(\theta )\). The set of models then is

$$\begin{aligned} \mathscr {M}=\{\mathscr {M}(\theta )|\theta \in D_{\mathscr {M}}\}. \end{aligned}$$
(2.1)
Fig. 2.1
figure 1

The identification work loop

The models may be expressed and formalized in many different ways. The most common model is linear and time-invariant linear (LTI), but possible models include both nonlinear and time-varying cases, so a list of actually used concrete model will be both very long and diverse.

It is useful to take the general view that a model gives a rule to predict (one-step-ahead) the output at time t, i.e., y(t) (a p-dimensional column vector), based on observations of previous input–output data up to time \(t-1\) (denoted by \(Z^{t-1}=\{y(t-1),u(t-1),y(t-2),u(t-2),\ldots \}\)). Here u(t) is the input at time t and we assume here that the data are collected in discrete time and denote for simplicity the samples as enumerated by t.

The predicted output will then be

$$\begin{aligned} \hat{y}(t|\theta )=g(t,\theta ,Z^{t-1}) \end{aligned}$$
(2.2)

for a certain function g of past data. This covers a very wide variety of model descriptions, sometimes in a somewhat abstract way. The descriptions become much more explicit when we specialize to linear models.

A note on “inputs” All measurable disturbances that affect y should be included among the inputs u to the system, even if they cannot be manipulated as control inputs. In some cases, the system may entirely lack measurable inputs, so the model (2.2) then just describes how future outputs can be predicted from past ones. Such models are called time series, and correspond to systems that are driven by unobservable disturbances. Most of the techniques described in this book apply also to such models.

A note on disturbances A complete model involves both a description of the input–output relations and a description of how various disturbance or noise sources affect the measurements. The noise description is essential both to understand the quality of the model predictions and the model uncertainty. Proper control design also requires a picture of the disturbances in the system.

2.2.1 Linear Time-Invariant Models

For linear time-invariant (LTI) systems, a general model structure is given by the transfer function G from input u to output y and with an additive disturbance—or noise—v(t):

$$\begin{aligned} y(t)=G(q,\theta )u(t)+v(t). \end{aligned}$$
(2.3a)

This model is in discrete time and q denotes the shift operator \(qy(t)=y(t+1)\). The sampling interval is set to one time unit. The expansion of \(G(q,\theta )\) in the inverse (backwards) shift operator gives the impulse response of the system:

$$\begin{aligned} G(q,\theta )u(t)=\sum ^\infty _{k=1} g_k(\theta )q^{-k}u(t)=\sum ^\infty _{k=1}g_k(\theta )u(t-k). \end{aligned}$$
(2.3b)

The discrete-time Fourier transform (or the z-transform of  the impulse response, evaluated in \(z=e^{i\omega }\)) gives the frequency response of the system:

$$\begin{aligned} G(e^{i\omega },\theta )=\sum ^{\infty }_{k=1}g_k(\theta )e^{-ik\omega }. \end{aligned}$$
(2.3c)

The function G describes how an input sinusoid shifts phase and amplitude when it passes through the system.

The additive noise term v can be described as white noise e(t), filtered through another transfer function H:

$$\begin{aligned} \ v(t) = H(q,\theta )e(t)\end{aligned}$$
(2.3d)
$$\begin{aligned} \ {\mathscr {E}}e^2(t)=\sigma ^2\end{aligned}$$
(2.3e)
$$\begin{aligned} {\mathscr {E}}e(t)e^T(k)=0 \;\text {if}\; k\ne t \end{aligned}$$
(2.3f)

(\({\mathscr {E}}\) denotes mathematical expectation).

This noise characterization is quite versatile and with a suitable choice of H it can describe a disturbance with a quite arbitrary spectrum. It is useful to normalize (2.3d) by making H monic:

$$\begin{aligned} H(q,\theta ) = 1 + h_1(\theta )q^{-1} + \cdots . \end{aligned}$$
(2.3g)

To think in terms of the general model description (2.2) with the predictor as a unifying model concept, assuming H to be inversely stable [5, Sect. 3.2] it is useful to rewrite (2.3) as

$$\begin{aligned} H^{-1}(q,\theta )y(t)&=H^{-1}(q,\theta )G(q,\theta ) u(t) +e(t)\\ y(t)&= [1-H^{-1}(q,\theta )]y(t) +H^{-1}(q,\theta )Gu(t) +e(t)=\\y(t)&=G(q,\theta )u(t)+[1-H^{-1}(q,\theta )][y(t)-G(q,\theta )u(t)]+e(t). \end{aligned}$$

Note that the expansion of \(H^{-1}\) starts with “1”, so the first term starts with \(\tilde{h}_1 q^{-1}\) so there is a delay in y. That means that the right-hand side is known at time \(t-1\) except for the term e(t), which is unpredictable at time \(t-1\) and must be estimated with its mean 0. All this means that the predictor for (2.3) (the conditional mean of y(t) given past data) is

$$\begin{aligned} \hat{y}(t|\theta ) = G(q,\theta )u(t)+[1-H^{-1}(q,\theta )][y(t)-G(q,\theta )u(t)]. \end{aligned}$$
(2.4)

It is easy to interpret the first term as a simulation using the input u, adjusted with a prediction of the additive disturbance v(t) at time t, based on past values of v. The predictor is thus an easy reformulation of the basic transfer functions G and H. The question now is how to parametrize these.

2.2.1.1 The McMillan Degree

Given just the sequence of impulse responses \(g_k\), with \(k=1,2,\ldots \), one may consider different ways of representing the system in a more compact form, like rational transfer functions or state-space models, to be considered below. A quite useful concept is then the McMillan degree:

From a given impulse response sequence, \(g_k\) (that could be \(p \times m\) matrices that describe a system with m inputs and p outputs) form the Hankel matrix

$$\begin{aligned} H_k= \begin{bmatrix} g_1&{}g_2&{}g_3&{}\cdots &{}g_k\\ g_2&{}g_3&{}g_4&{}\cdots &{}g_{k+1}\\ \cdots &{}\cdots &{}\cdots &{}\cdots &{}\cdots \\ g_k&{}g_{k+1}&{}g_{k+2}&{}\cdots &{}g_{2k-1} \end{bmatrix}. \end{aligned}$$
(2.5)

Then as k increases, the McMillan degree n of the impulse response is the maximal rank of \(H_k\):

$$\begin{aligned} n = \max _k \text {rank}\; H_k. \end{aligned}$$
(2.6)

This means that the impulse response can be generated from an nth-order state-space model, but not from any lower-order model.

2.2.1.2 Black-Box Models

A black-box model  uses no physical insight or interpretation, but is just a general and flexible parameterization. It is natural to let G and H be rational in the shift operator:

$$\begin{aligned} G(q,\theta )&=\frac{B(q)}{F(q)}; \quad H(q,\theta )=\frac{C(q)}{D(q)}\end{aligned}$$
(2.7a)
$$\begin{aligned} B(q)&=b_1q^{-1}+b_2q^{-2}+\ldots b_{nb}q^{-nb}\end{aligned}$$
(2.7b)
$$\begin{aligned} F(q)&=1+f_1q^{-1}+\ldots +f_{nf}q^{-nf}, \end{aligned}$$
(2.7c)

with then C and D monic like F, i.e., start with a “1”, and the vector collecting all the coefficients

$$\begin{aligned} \theta&=[b_1,b_2, \ldots , f_{nf}]. \end{aligned}$$
(2.7d)

Common black-box structures of this kind are FIR (finite impulse response model, \(F=C=D=1\)), ARMAX (autoregressive moving average with exogenous input, \(F=D\)), and BJ (Box–Jenkins, all four polynomials different).

A Very Common Case: The ARX Model

A very common case is that \(F=D=A\) and \(C=1\) which gives the ARX model (autoregressive with exogenous input):

$$\begin{aligned} y(t)&=A^{-1}(q)B(q)u(t)+A^{-1}(q)e(t) \, \text {or}\end{aligned}$$
(2.8a)
$$\begin{aligned} A(q)y(t)&=B(q)u(t) + e(t) \, \text {or}\end{aligned}$$
(2.8b)
$$\begin{aligned} y(t)&+a_1y(t-1)+\ldots +a_{n_a}y(t-n_a)\end{aligned}$$
(2.8c)
$$\begin{aligned}&=b_1u(t-1)+\ldots +b_{n_b}u(t-n_b). \end{aligned}$$
(2.8d)

This means that the expression for the predictor (2.4) becomes very simple:

$$\begin{aligned} \hat{y}(t|\theta )&=\varphi ^T(t)\theta \end{aligned}$$
(2.9)
$$\begin{aligned} \varphi ^T(t)&= \begin{bmatrix} -y(t-1)&-y(t-2)&\ldots&-y(t-n_a)&u(t-1)&\ldots&u(t-n_b) \end{bmatrix}\end{aligned}$$
(2.10)
$$\begin{aligned} \theta ^T&= \begin{bmatrix} a_1&a_2&\ldots&a_{n_a}&b_1&b_2&\ldots&b_{n_b} \end{bmatrix}. \end{aligned}$$
(2.11)

In statistics, such a model is known as a linear regression.

We note that as \(n_a\) and \(n_b\) increase to infinity the predictor (2.9) may approximate any linear model predictor (2.4). This points to a very important general approximation property of ARX models:

Theorem 2.1

(based on [6]) Suppose a true linear system is given by

$$\begin{aligned} y(t)=G_0(q)u(t)+H_0(q)e(t), \end{aligned}$$
(2.12)

where \(G_0(q)\) and \(H^{-1}_0(q)\) are stable filters,

$$\begin{aligned} G_0(q)&=\sum _{k=1}^\infty g_k(q^{-k})\\ H^{-1}_0(q)&=\sum _{k=1}^\infty \tilde{h}_k(q^{-k})\\ d(n)&=\sum ^\infty _{k=n}|g_k|+|\tilde{h}_K| \end{aligned}$$

and e is a sequence of independent zero-mean random variables with bounded fourth-order moments.

Consider an ARX model (2.8) with orders \(n_a, n_b=n\), estimated from N observations. Assume that the order n depends on the number of data as n(N), and tends to infinity such that \(n(N)^5/N \rightarrow 0\). Assume also that the system is such that \(d(n(N))\sqrt{N} \rightarrow 0\) as \(N\rightarrow \infty \). Then the ARX model estimates \(\hat{A}_{n(N)} (q)\) and \(\hat{B}_{n(N)}(q)\) of order n(N) obey

$$\begin{aligned} \frac{\hat{B}_{n(N)}(q)}{\hat{A}_{n(N)}(q)} \rightarrow G_0(q), \quad \frac{1}{\hat{A}_{n(N)}(q)} \rightarrow H_0(q) \text { as} N \rightarrow \infty . \end{aligned}$$
(2.13)

Intuitively, the above result follows from the fact that the true predictor for the system

$$\begin{aligned} \hat{y}(t|\theta ) = (1-H_0^{-1})y(t)+H_0^{-1}G_0u(t) = \sum _{k=1}^\infty \tilde{h}_k y(t-k) + \tilde{g} _k u(t-k) \end{aligned}$$

is stable. Hence, it can be truncated at any n with arbitrary accuracy, and the truncated sum is the predictor of an nth-order ARX model.

This is quite a useful result saying that ARX models can approximate any linear system, if the orders are sufficiently large. ARX models are easy to estimate. The estimates are calculated by linear least squares (LS) techniques, which are convex and numerically robust. Estimating a high-order ARX model, possibly followed by some model order reduction, could thus be an alternative to the numerically more demanding general PEM criterion minimization (2.22) introduced later on. This has been extensively used, e.g., by [14, 15]. The only drawback with high-order ARX models is that they may suffer from high variance.

2.2.1.3 Grey-Box Models

If some physical facts are known about the system, these could be incorporated in the model structure. Such a model that is based on physical insights and has a built-in behaviour that mimics known physics is known as a Grey-Box Model. For example, it could that for an airplane whose motion equations are known from Newton’s laws, but certain parameters are unknown, like the aerodynamical derivatives. Then it is natural to build a continuous-time state-space model from known physical equations:

$$\begin{aligned} \begin{aligned} \dot{x}(t)&=A(\theta )x(t)+B(\theta )u(t)\\ y(t)&=C(\theta )x(t)+D(\theta )u(t)+v(t). \end{aligned} \end{aligned}$$
(2.14)

Here \(\theta \) are simply some entries of the matrices ABCD, corresponding to unknown physical parameters, while the other matrix entries signify known physical behaviour. This model can be sampled with well-known sampling formulas (obeying the input inter-sample properties, zero-order hold or first-order hold) to give

$$\begin{aligned} \begin{aligned} x(t+1)&=\mathscr {F}(\theta )x(t)+\mathscr {G}(\theta )u(t)\\ y(t)&=C(\theta )x(t)+D(\theta )u(t)+w(t). \end{aligned} \end{aligned}$$
(2.15)

The model (2.15) has the transfer function from u to y

$$\begin{aligned} G(q,\theta )=C(\theta )[qI-\mathscr {F}(\theta )]^{-1}\mathscr {G}(\theta )+D(\theta ) \end{aligned}$$
(2.16)

so we have achieved a particular parameterization of the general linear model (2.3).

2.2.1.4 Continuous-Time Models

The general model description (2.2) describes how the predictions evolve in discrete time. But in many cases we are interested in continuous-time (CT) models, like models for physical interpretation and simulation. But CT model estimation is contained in the described framework, as the linear state-space model (2.14) illustrates.

2.2.2 Nonlinear Models

A nonlinear model is a relation (2.2), where the function g is nonlinear in the input–output data Z. There is a rich variation in how to specify the function g more explicitly. A quite general way is the nonlinear state-space equation, which is a counterpart to (2.15):

$$\begin{aligned} \begin{aligned} x(t+1)&=f(x(t),u(t),v(t),\theta )\\ y(t)&=h(x(t),e(t),\theta ), \end{aligned} \end{aligned}$$
(2.17)

where v and e are white noises.

2.3 \(\mathscr {I}\): Identification Methods—Criteria

The goal of identification is to match the model to the data. Here the basic techniques for such matching will be discussed. Suppose we have collected a data record in the time  domain

$$\begin{aligned} \mathcal{D}_T=\{u(1),y(1),\ldots ,u(N),y(N)\} \end{aligned}$$
(2.18)

which will be called in this book identification set or training set, with N being its size. A natural way to evaluate a model is to see how well it is able to predict the measured output since the model is in essence a predictor. It is thus quite natural to form the prediction errors for (2.2):

$$\begin{aligned} \varepsilon (t,\theta ) = y(t)-\hat{y} (t|\theta ). \end{aligned}$$
(2.19)

The “size” of this error can be measured by some scalar norm:

$$\begin{aligned} \ell (\varepsilon (t,\theta )) \end{aligned}$$
(2.20)

and the performance of the predictor over the whole data record \(\mathcal{D}_T\) is given by

$$\begin{aligned} V_N(\theta )=\sum ^N_{t=1}\ell (\varepsilon (t,\theta )). \end{aligned}$$
(2.21)

A natural parameter estimate is the value that minimizes this prediction fit:

$$\begin{aligned} \hat{\theta }_N= {{\,\mathrm{arg\,min}\,}}_{\theta \in D_{\mathscr {M}}} V_N(\theta ). \end{aligned}$$
(2.22)

This is the Prediction Error Method (PEM) and it is applicable to general model structures. See, e.g., [5] or [7] for more details.

The PEM approach can be embedded in a statistical setting. The ML methodology below offers a systematic framework to do so.

2.3.1 A Maximum Likelihood (ML) View

If the system innovations e have a probability density function (pdf) f(x), then the criterion function (2.21) with \(\ell (x)=-\log f(x)\) will be the logarithm of the Likelihood function. See Lemma 5.1 in [5]. More specifically, let the system have p outputs, and let the innovations be Gaussian with zero mean and covariance matrix \(\varLambda \), so that

$$\begin{aligned} y(t) = \hat{y}(t|\theta _0) +e(t), \quad e(t)\in N(0,\varLambda ) \end{aligned}$$
(2.23)

for the \(\theta _0\) that generated the data. Then it follows that the negative logarithm of the likelihood function for estimating \(\theta \) from y is

$$\begin{aligned} L_N(\theta ) = \frac{1}{2} [V_N(\theta ) + N \log \det \varLambda + Np \log 2\pi ], \end{aligned}$$
(2.24)

where \(V_N(\theta )\) is defined by (2.21), with

$$\begin{aligned} \ell (\varepsilon (t,\theta )) = \varepsilon ^T(t,\theta )\varLambda ^{-1}\varepsilon (t,\theta ). \end{aligned}$$
(2.25)

That means that the maximum likelihood model estimate (MLE) for known \(\varLambda \) is obtained by minimizing \(V_N(\theta )\). If \(\varLambda \) is not known, it can be included among the parameters and estimated, ([5], p. 218), which results in a criterion

$$\begin{aligned} D_N(\theta ) = \det \sum ^N_{t=1} \varepsilon (t,\theta )\varepsilon ^T(t,\theta ) \end{aligned}$$
(2.26)

to be minimized.

A Bayesian interpretation of (2.22) as well as a regularized version will be given in Chap. 4.

2.4 Asymptotic Properties of the Estimated Models

As we have seen in the first chapter, bias and variance play important roles in estimation problems. We will here give a short account of how these concepts are treated in classical system identification.

2.4.1 Bias and Variance

The observations, certainly of the output from the system, are affected by noise and disturbances. That means that the estimated model parameters (2.22) also will be affected by disturbances. These disturbances are typically described as stochastic processes, which makes the estimate \(\hat{\theta }_N\) a random variable. This has a certain probability density function, which could be complicated to compute. Often the analysis is restricted to its mean and variance only. The difference between the mean and a true description of the system measures the bias of the model. If the mean coincides with the true system, the estimate is said to be unbiased. As already pointed out in (1.1), the total error in a model thus has two contributions: the bias and the  variance.

2.4.2 Properties of the PEM Estimate as \(N\rightarrow \infty \)

Except in simple special cases it is quite difficult to compute the pdf of the estimate \(\hat{\theta }_N\). However, its asymptotic properties as \(N\rightarrow \infty \) are easier to establish. The basic results can be summarized as follows (see [5, Chaps. 8 and 9] for a more complete treatment):

  • Limit model:

    $$\begin{aligned} \hat{\theta }_N \rightarrow \theta ^*={{\,\mathrm{arg\,min}\,}}\left[ \lim _{N\rightarrow \infty } \frac{1}{N}V_N(\theta )\approx {\mathscr {E}}\ell (\varepsilon (t,\theta ))\right] . \end{aligned}$$
    (2.27)

    Here \({\mathscr {E}}\) denotes mathematical expectation. So the estimate will converge to the best possible model, in the sense that it gives the smallest average prediction error.

  • Asymptotic covariance matrix for scalar output models:

    In case the prediction errors \(e(t)=\varepsilon (t,\theta ^*)\) for the limit model are approximately white, the covariance matrix of the parameters is asymptotically given by

    $$\begin{aligned} \text {Cov} \hat{\theta }_N \sim \frac{\kappa (\ell )}{N} \left[ \text {Cov} \frac{d}{d\theta }\hat{y}(t|\theta )\right] ^{-1}. \end{aligned}$$
    (2.28)

    That means that the covariance matrix of the parameter estimate is given by the inverse covariance matrix of the gradient of the predictor w.r.t. the parameters. Here (prime denoting derivatives)

    $$\begin{aligned} \kappa (\ell )=\frac{{\mathscr {E}}[\ell '(e(t))]^2}{{\mathscr {E}}\ell ''(e(t)]^2}. \end{aligned}$$
    (2.29)

    Note that

    $$\begin{aligned} \kappa (\ell )=\sigma ^2= {\mathscr {E}}e^2(t)\quad \text {if}\, \quad \ell (e)=e^2/2. \end{aligned}$$

    If the model structure contains the true system, it can be shown that this covariance matrix is the smallest that can be achieved by any unbiased estimate, in case the norm \(\ell \) is chosen as the logarithm of the pdf of e. That is, it fulfils the the Cramér–Rao inequality, [2]. These results are valid for quite general model structures.

  • Results for LTI models:

    Now, specialize to linear models (2.3) and assume that the true system is described by

    $$\begin{aligned} y(t)=G_0(q)u(t)+H_0(q)e(t), \end{aligned}$$
    (2.30)

    which could be general transfer functions, possibly much more complicated than the model. Then

    • $$\begin{aligned} \theta ^*={{\,\mathrm{arg\,min}\,}}_\theta \int _{-\pi }^\pi |G(e^{i\omega },\theta )-G_0(e^{i\omega })|^2\frac{\varPhi _u(\omega )}{|H(e^{i\omega },\theta )|^2}d\omega . \end{aligned}$$
      (2.31)

      That is, the frequency function of the limiting model will approximate the true frequency function as well as possible in a frequency norm given by the input spectrum \(\varPhi _u\) and the noise model.

    • For a linear black-box model, the covariance of the estimated frequency function is

      $$\begin{aligned} \text {Cov} G(e^{i\omega },\hat{\theta }_N) \sim \frac{n}{N} \frac{\varPhi _v(\omega )}{\varPhi _u(\omega )} \; \text {as}\; n,N \rightarrow \infty , \end{aligned}$$
      (2.32)

      where n is the model order and \(\varPhi _v\) is the noise spectrum \(\sigma ^2 |H_0(e^{i\omega })|^2\). The variance of the estimated frequency function at a given frequency is thus, for a high-order model, proportional to the noise-to-signal ratio at that frequency. That is a natural and intuitive result.

2.4.3 Trade-Off Between Bias and Variance

The quality of the model depends on the quality of the measured data and the flexibility of the chosen model structure (2.1). A more flexible model structure typically has smaller bias, since it is easier to come closer to the true system. At the same time, it will have a higher variance: with higher flexibility it is easier to be fooled by disturbances and this may lead to data overfitting. So the trade-off between bias and variance to reach a small total error is a choice of balanced flexibility of the model structure.

As the model gets more flexible, the fit to the estimation data in (2.22), given by \(V_N(\hat{\theta }_N)\), will always improve. To account for the variance contribution, it is thus necessary to modify this fit to assess the total quality of the model. A much used technique for this is Akaike’s criterion (AIC), [1],

$$\begin{aligned} \hat{\theta }_N = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mathscr {M}, \theta \in D_{\mathscr {M}}} 2L_N(\theta )+2\text {dim} \theta , \end{aligned}$$
(2.33)

where \(L_N\) is the negative log likelihood function. The minimization also takes place over a family of model structures with different number of parameters (dim \(\theta \)).

For Gaussian innovations e with unknown and estimated variance, the criterion AIC takes the form

$$\begin{aligned} \hat{\theta }_N = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mathscr {M}, \theta \in D_{\mathscr {M}}} \left[ \log \det \left[ \frac{1}{N} \sum ^N_{t=1} \varepsilon (t,\theta )\varepsilon ^T(t,\theta )\right] +2\frac{m}{N}\right] \quad \text {AIC} \end{aligned}$$
(2.34)

with \(m= \text {dim} \theta \) and after normalization and omission of model-independent quantities.

There is also a small-sample version, described in [4] and known in the literature as corrected Akaike’s criterion (AICc),  defined by

$$\begin{aligned} \hat{\theta }_N = {{\,\mathrm{arg\,min}\,}}_\theta \left[ \log \det \left[ \frac{1}{N} \sum ^N_{t=1} \varepsilon (t,\theta )\varepsilon ^T(t,\theta )\right] +2\frac{m}{(N-m-1)} \right] , \quad \text {AICc}. \end{aligned}$$
(2.35)

Another variant places a larger penalty on the model flexibility:

$$\begin{aligned} \hat{\theta }_N= {{\,\mathrm{arg\,min}\,}}_\theta \left[ \log \det \left[ \frac{1}{N} \sum ^N_{t=1} \varepsilon (t,\theta )\varepsilon ^T(t,\theta )\right] +\log (N)\frac{m}{N}\right] , \text {BIC, MDL}. \end{aligned}$$
(2.36)

This is known as Bayesian information criterion (BIC) or Rissanen’s  Minimum Description Length (MDL) criterion, see, e.g.,  [10, 11] and [5, pp. 505–507].

Section 2.6 contains further aspects on the choice of model structure.

2.5 \(\textit{X}\): Experiment Design

Experiment design involves all questions that concern the collection of estimation data, such as selecting which signals to measure, which sampling rate to use, and also the design of the input including possible feedback configurations.

The theory of experiment design primarily relies upon analysis of how the asymptotic parameter covariance matrix (2.28) depends on the design variables: so the essence of experiment design can be symbolized as

$$\begin{aligned} \min _{\textit{X}} \text {trace}\{ C [ E \psi (t)\psi ^T(t)]^{-1}\}, \end{aligned}$$

where \(\psi \) is the gradient of the prediction w.r.t. the parameters and the matrix C is used to weight variables reflecting the intended use of the model.

For linear systems, the input design is often expressed as selecting the spectrum (frequency contents) of u.

This leads to the following recipe: let the input’s power be concentrated to frequency regions where a good model fit is essential, and where disturbances are dominating.

The measurement setup, like if band limited inputs are used to estimate continuous-time models and how the experiment equipment is instrumented with band-pass filters, e.g., see [8, Sects. 13.2–3], also belongs to the important experiment design questions.

2.6 \(\mathscr {V}\): Model Validation

Model validation is about obtaining a model that, at least for the time being, can be accepted. It amounts to examining and scrutinizing the model to check if it can be used for its purpose. These methods are of course problem dependent and contain several subjective elements, Therefore, no conclusive procedure for validation can be given. A few useful techniques will be listed here. Basically it is a matter of trying to falsify a model under the conditions it will be used for and also to gain confidence in its ability to reproduce new data from the system.

2.6.1 Falsifying Models: Residual Analysis

An estimated model is never a correct description of a true system. In that sense, a model cannot be “validated”, i.e., proved to be correct. Instead it is instructive to try and falsify it, i.e., confront it with facts that may contradict its correctness. A good principle is to look for the simplest unfalsified model, see, e.g., [9].

Residual analysis is the leading technique for falsifying models: the residuals or one-step-ahead prediction errors \(\hat{\varepsilon }(t)=\varepsilon (t,\hat{\theta }_N)=y(t)-\hat{y}(t|\hat{\theta }_N)\) should ideally not contain any traces of past inputs or past residuals. If they did, it means that the predictions are not ideal. So, it is natural to test the correlation functions

$$\begin{aligned} \hat{r}_{\hat{\varepsilon },u} (k)&=\frac{1}{N}\sum ^N_{t=1}\hat{\varepsilon }(t+k)u(t)\end{aligned}$$
(2.37)
$$\begin{aligned} \hat{r}_{\hat{\varepsilon }} (k)&=\frac{1}{N}\sum ^N_{t=1}\hat{\varepsilon }(t+k)\hat{\varepsilon }(t) \end{aligned}$$
(2.38)

and check that they are not larger than certain thresholds. Here N is the length of the data record and k typically ranges over a fraction of the interval \([-N,N]\). See, e.g., [5, Sect. 16.6] for more details.

2.6.2 Comparing Different Models

When several models have been estimated it is a question to choose the “best one”. Then, models that employ more parameters naturally show a better fit to the data, and it is necessary to outweigh that. The model selection criteria AIC (2.34) and BIC (2.36) are examples of how such decisions can be taken. They can be extended to regular hypothesis tests where more complex models are accepted or rejected at various test levels, see, e.g., [5, Sect. 16.4].

Making comparisons in the frequency domain is a very useful complement for domain experts used to think in terms of natural frequencies, natural damping, etc.

2.6.3 Cross-Validation

Cross-validation (CV) is an important statistical concept that loosely means that the model performance is tested on a data set (validation data) other than the estimation data. There is an extensive literature on cross-validation, e.g., [13] and many ways to split up available data into estimation and validation parts have been suggested. The goal is to obtain an estimate of the prediction capability of future data of the model in correspondence with different choices of \(\theta \). Parameter selection is thus performed by optimizing the estimated prediction score. Hold out validation is the simplest form of CV: the available data are split in two parts, where one of them (estimation set) is used to estimate the model, and the other one (validation set) is used to assess the prediction capability. By ensuring independence of the model fit from the validation data, the estimate of the prediction performance is approximately unbiased. For models that do not require estimation of initial states, like FIR and ARX models, CV can be applied efficiently in more sophisticated ways by splitting the data into more portions, as described in [3].