1 Function Minimization

The minimization (or maximization) of a multivariate function F(x) is a frequent task in solving non-linear systems of equations, clustering, maximum-likelihood estimation, function and model fitting, supervised learning, and similar problems. A basic classification of minimization methods distinguishes between methods that require the computation of the gradient or even the Hessian matrix of the function and gradient-free methods. All methods discussed in the following subsections are iterative and require a suitable starting point x 0. Implementations of the methods discussed in the following, along with many others, can, for instance, be found in the MATLAB® Optimization Toolbox™ [1] and in the Python package scipy.optimize [2].

1.1 Newton–Raphson Method

If F(x) is at least twice continuously differentiable in its domain, it can be approximated by its second order Taylor expansion \(\hat {F}\) at the starting point x 0:

$$\displaystyle \begin{gathered} F({{\boldsymbol{x}}_{0}}+{\,{\boldsymbol{h}}})\approx \hat{F}({{\boldsymbol{x}}})=F({{\boldsymbol{x}}_{0}})+{{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}}){\,{\boldsymbol{h}}}+\frac{1}{2}{\,{\boldsymbol{h}}}{{}^{\mathsf{T}}}{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}}){\,{\boldsymbol{h}}}, \end{gathered} $$
(3.1)

where g(x) = ∇F(x) is the gradient and H(x) = ∇2 F(x) is the Hessian matrix of F(x). The step h is determined such that \(\hat {F}\) has a stationary point at x 1 = x 0 +  h, i.e., that the gradient of \(\hat {F}\) is zero at x 1:

$$\displaystyle \begin{gathered} \nabla\hat{F}({{\boldsymbol{x}}_{1}})={{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}})+{\,{\boldsymbol{h}}}{{}^{\mathsf{T}}}{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}})=\mathbf{0}{\Longrightarrow} {\,{\boldsymbol{h}}}=-[{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}})]{{}^{-1}}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}}){{}^{\mathsf{T}}}.{} \end{gathered} $$
(3.2)

Note that if x is a column vector, g(x 0) is a row vector. In order to ensure that the Wolfe conditions [3] are satisfied, Eq. (3.2) is often relaxed to:

$$\displaystyle \begin{aligned} {\,{\boldsymbol{h}}}&=-\eta\hspace{0.5pt}[{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}})]{{}^{-1}}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}}){{}^{\mathsf{T}}}, \end{aligned} $$
(3.3)

with a learning rate η ∈ (0, 1). Inverting the Hessian matrix can be computationally expensive; in this case, h can be computed by finding an approximate solution to the linear system H(x 0) ⋅ h = −g(x 0)T. This procedure is iterated to produce a sequence of values according to:

$$\displaystyle \begin{gathered} {{\boldsymbol{x}}_{k+1}}={{\boldsymbol{x}}_{k}}-\eta\hspace{0.5pt}[{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{k}})]{{}^{-1}}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}}){{}^{\mathsf{T}}}. \end{gathered} $$
(3.4)

If the starting point x 0 is sufficiently close to a local minimum, the sequence converges quadratically to the local minimum. In practice, the iteration is stopped as soon as the norm | g(x k) | of the gradient falls below some predefined bound ε. If F is a convex function, the local minimum is also the global minimum.

1.2 Descent Methods

As the computation of the Hessian matrix is computationally costly, various methods that do not require it have been devised, for instance, descent methods. A descent method is an iterative algorithm that searches for an approximate minimum of F by decreasing the value of F in every iteration. The iteration has the form x k+1 = x k + η k d k, where d k is the search direction and η k is the step-size parameter. As with the Newton–Raphson method, when a (local) minimum is reached, it cannot be left anymore.

1.2.1 Line Search

A search direction d is called a descent direction at the point \({{\boldsymbol {x}}}\in {\mathbb {R}}^n\) if g(x) ⋅d < 0. If η is sufficiently small, then

$$\displaystyle \begin{gathered} F({{\boldsymbol{x}}}+\eta\,{{\boldsymbol{d}}})<F({{\boldsymbol{x}}}). \end{gathered} $$
(3.5)

Once a search direction d k has been chosen at the point x k, line search implies that the line x = x k + λ d k is followed to its closest local minimum. For various line search methods such as fixed and variable step size, interpolation, golden section or Fibonacci’s method, see [4, 5].

1.2.2 Steepest Descent

Steepest descent follows the opposite direction of the gradient. If it is combined with line search, each search direction is orthogonal to the previous one, leading to a zig-zag search path. This can be very inefficient in the vicinity of a minimum where the Hessian matrix has a large condition number (ratio of the largest to the smallest eigenvalue); see Fig. 3.1.

Fig. 3.1
figure 1

Contour lines of the function F(x, y) = 2x 2 + 16y 2, and steepest descent with line search starting at the point x 0 = (3;1)

1.2.3 Quasi-Newton Methods

In the Newton-Raphson method, the search direction is d = −[H(x k)]−1 g(x k)T. If H(x k) is positive definite, then d is a descent direction, and so is −A g(x k)T for any positive definite matrix A. In a quasi-Newton method, A is constructed as an approximation to the inverse Hessian matrix, using only gradient information. The initial search direction is the negative gradient, d 0 = −g(x 0)T, and the initial matrix A 0 is the identity matrix. Each iteration performs a line search along the current search direction:

$$\displaystyle \begin{gathered} \lambda_k=\arg_\lambda\min_\lambda\left({{\boldsymbol{x}}_{k}}+\lambda\hspace{0.5pt}{{\boldsymbol{d}}}_k\right),\ \; {{\boldsymbol{x}}_{k+1}}={{\boldsymbol{x}}_{k}}+\lambda_k\hspace{0.5pt}{{\boldsymbol{d}}}_k. \end{gathered} $$
(3.6)

The new search direction is then computed according to:

$$\displaystyle \begin{gathered} {{\boldsymbol{d}}}_{k+1}=-{{{\boldsymbol{A}}}}_{k+1}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k+1}}){{}^{\mathsf{T}}}, \end{gathered} $$
(3.7)

where A k+1 is the updated approximation to the inverse Hessian matrix. There are two different algorithms for computing the update [6, p. 422], both using the change of the gradient along the step, denoted by \({\boldsymbol {y}}_k=\left [{{\boldsymbol {g}}}({{\boldsymbol {x}}_{k+1}})-{{\boldsymbol {g}}}({{\boldsymbol {x}}_{k}})\right ]{{ }^{\mathsf {T}}}\).

Davidon–Fletcher–Powell algorithm

$$\displaystyle \begin{gathered} {{{\boldsymbol{A}}}}_{k+1}={{{\boldsymbol{A}}}}_k+\lambda_k\frac{{{\boldsymbol{d}}}_k{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}}{{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}{\boldsymbol{y}}_k}-\frac{{{{\boldsymbol{A}}}}_k{\boldsymbol{y}}_k{\boldsymbol{y}}_k{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}_k}{{\boldsymbol{y}}_k{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}_k{\boldsymbol{y}}_k}. \end{gathered} $$
(3.8)

Broyden–Fletcher–Goldfarb–Shanno algorithm

$$\displaystyle \begin{gathered} {{{\boldsymbol{A}}}}_{k+1}=\lambda_k\frac{{{\boldsymbol{d}}}_k{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}}{{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}{\boldsymbol{y}}_k}+\left({{\boldsymbol{I}}}-\frac{{{\boldsymbol{d}}}_k{\boldsymbol{y}}_k{{}^{\mathsf{T}}}}{{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}{\boldsymbol{y}}_k}\right){{{\boldsymbol{A}}}}_k\left({{\boldsymbol{I}}}-\frac{{\boldsymbol{y}}_k{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}}{{\boldsymbol{y}}_k{{}^{\mathsf{T}}}{{\boldsymbol{d}}}_k}\right). \end{gathered} $$
(3.9)

1.2.4 Conjugate Gradients

If the function \(F({{\boldsymbol {x}}}),{{\boldsymbol {x}}}\in {\mathbb {R}}^n\) is quadratic of the form \(F({{\boldsymbol {x}}})=\frac {1}{2}{{\boldsymbol {x}}}{{ }^{\mathsf {T}}}{{{\boldsymbol {A}}}}{{\boldsymbol {x}}}-{{\boldsymbol {b}}}{{ }^{\mathsf {T}}}{{\boldsymbol {x}}}+c\) with positive definite A, the global minimum can be found in exactly n steps, if line search with a set of conjugate search directions is used. Such a set {d 1, …, d n} is characterized by the following conditions:

$$\displaystyle \begin{gathered} {{\boldsymbol{d}}}_i{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}\hspace{0.5pt}{{\boldsymbol{d}}}_j=0,\ \; i\neq j,\quad{{\boldsymbol{d}}}_i{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}\hspace{0.5pt}{{\boldsymbol{d}}}_i\neq0,\quad i=1,\ldots,n. \end{gathered} $$
(3.10)

The set is linearly independent and a basis of \({\mathbb {R}}^n\). An example for n = 2 is shown in Fig. 3.2.

Fig. 3.2
figure 2

Contour lines of the function F(x, y) = 2x 2 + 16y 2, and descent with line search and conjugate gradients starting at the point x 0 = (3;1). The minimum is reached in two steps

In the general case, the conjugate gradient method proceeds by successive approximations, generating a new search direction in every iteration. Given an approximation x k, the new search direction is d k = −g(x k)T + β k d k−1, where d 0 is arbitrary. A line search along direction d k gives the next approximation:

$$\displaystyle \begin{gathered} \lambda_k=\arg_\lambda\min ({{\boldsymbol{x}}_{k}}+\lambda\hspace{0.5pt}{{\boldsymbol{d}}}_k),\ \; {{\boldsymbol{x}}_{k+1}}={{\boldsymbol{x}}_{k}}+\lambda_k\hspace{0.5pt}{{\boldsymbol{d}}}_k. \end{gathered} $$
(3.11)

Different variants of the algorithm exist corresponding to different prescriptions for computing β k. Two of them are given here [6, p. 406–416].

Fletcher–Reeves algorithm

$$\displaystyle \begin{gathered} \beta_k=\frac{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}})\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}}){{}^{\mathsf{T}}}}{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}})\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}}){{}^{\mathsf{T}}}}. \end{gathered} $$
(3.12)

Polak-Ribière algorithm

$$\displaystyle \begin{gathered} \beta_k=\frac{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}})\left[{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}})-{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}})\right]{{}^{\mathsf{T}}}}{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}})\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}}){{}^{\mathsf{T}}}}. \end{gathered} $$
(3.13)

It is customary to set β k to zero if k is a multiple of n, in order to avoid accumulation of rounding errors. As such, convergence to the minimum in n steps is no longer guaranteed for non-quadratic F(x).

1.3 Gradient-Free Methods

A popular gradient-free method is the downhill-simplex or Nelder–Mead algorithm [7]. It can be applied to functions whose derivatives are unknown, do not exist everywhere, or are too costly or difficult to compute. In n dimensions, the method stores n + 1 test points x 1, …, x n+1 at every iteration, ordered by increasing function values, and the centroid x 0 of all points but the last one. The simplex generated by the test points is then modified according to the function values in the test points. The allowed modifications are: reflection, expansion, contraction and shrinking. The iteration is terminated when the function value of the best point does not change significantly anymore. The size of the initial simplex is important; choosing it too small can lead to a very localized search. On the other hand, it is possible to escape from a local minimum by restarting the search with a sufficiently large simplex.

An example with the Rosenbrock function F(x, y) = (0.8 − x)2 + 200 (yx 2)2 is shown in Fig. 3.3. The function has a very shallow minimum at x = 0.64, y = 0.8. With the tolerance 10−8 on the function value the minimum is reached after 90 steps.

Fig. 3.3
figure 3

Contour lines of the Rosenbrock function F(x, y) = (0.8 − x)2 + 200 (yx 2)2 and minimization with the downhill-simplex method starting at the point x 0 = (1.5;1). With the tolerance 10−8 on the function value the minimum is reached after 90 steps

Other gradient-free methods are simulated annealing, tabu search, particle swarm optimization, genetic algorithms, etc.

2 Statistical Models and Estimation

In the context of this book a statistical model is defined as a functional dependence of observed quantities (observations or measurements) on unknown quantities of interest (parameters or state vectors). The parameters cannot be observed directly, and the observations are subject to stochastic uncertainties. The aim is to estimate the parameters from the observations according to some criterion of optimality.

2.1 Linear Regression Models

A linear regression model has the following general form [8]:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{m}}_{}}={{{\boldsymbol{F}}}}{}{{\boldsymbol{p}}}+{{\boldsymbol{c}}_{}}+{{\boldsymbol{\varepsilon}}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}}\right]}=\mathbf{0},\ \;{\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}}\right]}={{{\boldsymbol{V}}}}={{{\boldsymbol{G}}}}{{}^{-1}}, \end{gathered} $$
(3.14)

where mis the (n × 1)-vector of observations, F is the known (n × m) model matrix with m ≤ n and assumed to be of full rank, p is the (m × 1) vector of model parameters, cis a known constant offset, and ε is the (n × 1) vector of observation errors with zero expectation and (n × n) covariance matrix V , assumed to be known.

LS estimation of p requires the minimization of the following objective function:

$$\displaystyle \begin{gathered} {\mathcal{S}}({{\boldsymbol{p}}})=\left({{\boldsymbol{m}}_{}}-{{{\boldsymbol{F}}}}{}{{\boldsymbol{p}}}-{{\boldsymbol{c}}_{}}\right){{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\, \left({{\boldsymbol{m}}_{}}-{{{\boldsymbol{F}}}}{}{{\boldsymbol{p}}}-{{\boldsymbol{c}}_{}}\right). \end{gathered} $$
(3.15)

The least-squares (LS) estimator \({{\boldsymbol {\tilde {p}}}}\) and its covariance matrix C are given by:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{\tilde{p}}}}=({{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}){{}^{-1}}{{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}({{\boldsymbol{m}}_{}}-{{\boldsymbol{c}}_{}}),\ \;{{{\boldsymbol{C}}}}=({{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}){{}^{-1}}. \end{gathered} $$
(3.16)

The estimator \({{\boldsymbol {\tilde {p}}}}\) is unbiased and the estimator with the smallest covariance matrix among all estimators that are linear functions of the observations. If the distribution of ε is a multivariate normal distribution, the estimator is efficient, i.e., has the smallest possible covariance matrix among all unbiased estimators.

The residuals r of the regression are defined by:

$$\displaystyle \begin{gathered} {\boldsymbol{r}}={{\boldsymbol{m}}_{}}-{{\boldsymbol{c}}_{}}-{{{\boldsymbol{F}}}}{}{{\boldsymbol{\tilde{p}}}},\ \;{{\boldsymbol{R}}}={\mathsf{Var}\left[{\boldsymbol{r}}\right]}={{{\boldsymbol{V}}}}-{{{\boldsymbol{F}}}}{}\hspace{0.5pt}({{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}){{}^{-1}}{{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}. \end{gathered} $$
(3.17)

The standardized residuals s, also called the “pulls” in high-energy physics, are given by:

$$\displaystyle \begin{gathered} s_i=\frac{r_i}{\sqrt{{{\boldsymbol{R}}}_{ii}}},\ \; i=1,\ldots,n. \end{gathered} $$
(3.18)

If the model is correctly specified, the pulls have mean 0 and standard deviation 1. The chi-square statistic of the regression is defined as :

$$\displaystyle \begin{gathered} {\chi^2}={\boldsymbol{r}}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}\hspace{0.5pt}{\boldsymbol{r}},\ \;\mathrm{with}\ \; {\mathsf{E}\left[\hspace{0.5pt}{\chi^2}\right]}=n-m. \end{gathered} $$
(3.19)

If the observation errors are normally distributed, χ 2 is χ 2-distributed with d = n − m degrees of freedom; its expectation is d and its variance is 2 d. Its p-value p is defined by the following probability transform:

$$\displaystyle \begin{gathered} p=1-G_{d}\hspace{0.5pt}(\hspace{0.5pt}{\chi^2})=\int_{{\chi^2}}^{\infty} g_{d}\hspace{0.5pt}(x)\,\mathrm{d}x, \end{gathered} $$
(3.20)

where G k(x) is the cumulative distribution function of the χ 2-distribution with k degrees of freedom and g k(x) is its probability density function (PDF). Large values of χ 2 correspond to small p-values. If the model is correctly specified, p is uniformly distributed in the unit interval. A very small p-value indicates a misspecification of the model or of the covariance matrix V , or both.

2.2 Nonlinear Regression Models

The linear regression model in Eq. (3.14) can be generalized to a nonlinear model:

$$\displaystyle \begin{gathered} {{\boldsymbol{m}}_{}}={{\boldsymbol{f}}_{}}({{\boldsymbol{p}}})+{{\boldsymbol{\varepsilon}}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}}\right]}=\mathbf{0},\ \;{\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}}\right]}={{{\boldsymbol{V}}}}={{{\boldsymbol{G}}}}{{}^{-1}}, \end{gathered} $$
(3.21)

where fis a (n × 1)-vector of smooth functions of m variables. LS estimation of p requires the minimization of the following objective function:

$$\displaystyle \begin{gathered} {\mathcal{S}}({{\boldsymbol{p}}})=\left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{p}}})\right]{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\, \left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{p}}})\right]. \end{gathered} $$
(3.22)

The function \({\mathcal {S}}({{\boldsymbol {p}}})\) can be minimized with any of the methods discussed in Sect. 3.1. The one used most frequently is probably the Gauss-Newton method, based on the first-order Taylor expansion of fand resulting in the following iteration:

(3.23)

At each step, the covariance matrix C k+1 of \({{\boldsymbol {\tilde {p}}}}_{k+1}\) is approximately given by:

$$\displaystyle \begin{gathered} {{{\boldsymbol{C}}}}_{k+1}=({{{\boldsymbol{F}}}}{}_k{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}_k){{}^{-1}}. \end{gathered} $$
(3.24)

In general, the covariance matrix of the final estimate \({{\boldsymbol {\tilde {p}}}}\) can be approximated by the inverse of the Hessian of \({\mathcal {S}}({{\boldsymbol {p}}})\) at \({{\boldsymbol {\tilde {p}}}}\). The final chi-square statistic χ 2 is given by:

$$\displaystyle \begin{gathered} \chi^2=\left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{\tilde{p}}}})\right]{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}\left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{\tilde{p}}}})\right]. \end{gathered} $$
(3.25)

In the case of Gaussian observation errors, the chi-square statistic is approximately χ 2-distributed, and its p-value is approximately uniformly distributed. The iteration is stopped when the chi-square statistic does not change significantly any more.

2.3 State Space Models

A dynamic or state space model describes the state of an object in space or time, such as a rocket or a charged particle [9]. The state usually changes continuously, but is assumed to be of interest only at discrete instances in the present context. These instances are labeled with indices from 0, the initial state, to n, the final state. The state at instance k is specified by the state vector q k. The spatial or temporal evolution of the state is described by the system equation, which is Eq. (3.26) in the linear case and Eq. (3.41) in the general case.

2.3.1 Linear State Space Models and the Kalman Filter

In the simplest case, the state at instant k is an affine function of the state at instant k − 1 plus a random disturbance called the system or process noise, with known expectation and covariance matrix:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{q}}_{k}}={\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{\boldsymbol{q}}_{k-1}}+{{\boldsymbol{d}}_{k}}+{{\boldsymbol{\gamma}}_{k}},\ \; {\mathsf{E}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{\boldsymbol{g}}_{k}},\ \; {\mathsf{Var}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{{\boldsymbol{Q}}}_{k}}. \end{gathered} $$
(3.26)

The process noise γ k may affect only a subset of the state vector, in which case its covariance matrix Q k is not of full rank.

In most cases, only some or even none of the components of the state vector can be observed directly. Instead, the observations are functions of the state plus an observation error. In the simplest case, this function is again affine:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{m}}_{k}}={{{\boldsymbol{H}}}_{k}}{{\boldsymbol{q}}_{k}}+{{\boldsymbol{c}}_{k}}+{{\boldsymbol{\varepsilon}}_{k}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}=\mathbf{0}, \ \; {\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}={{{\boldsymbol{V}}}_{ k}}={{{\boldsymbol{G}}}_{k}}{{}^{-1}}. \end{gathered} $$
(3.27)

Process noise and observation errors are assumed to be independent.

Given observations m 1, …, m n, an initial state q 0 and an initial state covariance matrix C 0, all states can be estimated recursively by the Kalman filter [10]. Assume there is an estimated state vector \({{\boldsymbol {\tilde {q}}}}_{k-1}\) with its covariance matrix C k−1 at instant k − 1. The estimated state vector at instant k is obtained by a prediction step followed by an update step. After the last update step, the full information contained in all observations can be propagated back to all previous states by the smoother. If both process noise and observation errors are normally distributed, the Kalman filter is the optimal filter; if not, it is still the best linear filter.

Prediction step

The state vector and its covariance matrix are propagated to the next instance using the system equation Eq. (3.26) and linear error propagation:

$$\displaystyle \begin{gathered} {{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}={\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{\boldsymbol{\tilde{q}}}}_{k-1}+{{\boldsymbol{d}}_{k}}+{{\boldsymbol{g}}_{k}},\ \; {{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}={\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{{\boldsymbol{C}}}_{ k-1}}{\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{}^{\mathsf{T}}}+{{{\boldsymbol{Q}}}_{k}}.{} \end{gathered} $$
(3.28)

Update step

The updated state vector is the weighted mean of the predicted state vector and the observation m k. There are two equivalent ways to compute the update. The first one uses the gain matrix K k:

$$\displaystyle \begin{aligned} {{\boldsymbol{\tilde{q}}}}_{k}&={{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}+{{{\boldsymbol{K}}}_{k}}\hspace{0.5pt}({{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}), \ \; {{{\boldsymbol{C}}}_{ k}}=({{\boldsymbol{I}}}-{{{\boldsymbol{K}}}_{k}}{{{\boldsymbol{H}}}_{k}})\hspace{0.5pt}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}},{} \end{aligned} $$
(3.29)
$$\displaystyle \begin{aligned} {{{\boldsymbol{K}}}_{k}}&={{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}({{{\boldsymbol{V}}}_{ k}}+{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}){{}^{-1}}.{} \end{aligned} $$
(3.30)

The second one is a multivariate weighted mean:

$$\displaystyle \begin{aligned} {{\boldsymbol{\tilde{q}}}}_{k}&={{{\boldsymbol{C}}}_{ k}}\hspace{0.5pt}\left[{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{}^{-1}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}+{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}_{k}}\hspace{0.5pt}({{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}})\right],{} \end{aligned} $$
(3.31)
$$\displaystyle \begin{aligned} {{{\boldsymbol{C}}}_{ k}}&=({{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{}^{-1}}+{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}_{k}}{{{\boldsymbol{H}}}_{k}}){{}^{-1}}.{} \end{aligned} $$
(3.32)

The update step has an associated chi-square statistic χk2, which can be computed from the predicted residual r k | k−1 or from the updated residual r k:

$$\displaystyle \begin{aligned} {\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}&={{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}},\ {{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}={\mathsf{Var}\left[{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}\right]}={{{\boldsymbol{V}}}_{ k}}+{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}, \end{aligned} $$
(3.33)
$$\displaystyle \begin{aligned} {\boldsymbol{r}}_{k}&={{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{k},\ {{\boldsymbol{R}}}_{k}={\mathsf{Var}\left[{\boldsymbol{r}}_{k}\right]}={{{\boldsymbol{V}}}_{ k}}-{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{ k}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}, \end{aligned} $$
(3.34)
$$\displaystyle \begin{aligned} {\chi^2}_k&={\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}{{}^{\mathsf{T}}}{{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}{{}^{-1}}\hspace{0.5pt}{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}={\boldsymbol{r}}_{k}{{}^{\mathsf{T}}}{{\boldsymbol{R}}}_{k}{{}^{-1}}\hspace{0.5pt}{\boldsymbol{r}}_{k}.{} \end{aligned} $$
(3.35)

If the model is correctly specified and both process noise and observation errors are normally distributed, χk2 is χ 2-distributed with a number of degrees of freedom equal to the dimension of m k. The total chi-square \({\chi ^2_{\mathrm {tot}}}\) of the filter is obtained by summing χk2 over all k.

Smoothing

The smoother propagates the full information contained in the last estimate \({{\boldsymbol {\tilde {q}}}}_{n}\) back to all previous states. There are again two equivalent formulations. The first one uses the gain matrix A k of the smoother:

$$\displaystyle \begin{aligned} {{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}&={{\boldsymbol{\tilde{q}}}}_{k}+{{{\boldsymbol{A}}}_{k}}({{\boldsymbol{\tilde{q}}}}_{k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}n}-{{\boldsymbol{\tilde{q}}}}_{k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}k}), \ \;{{{\boldsymbol{A}}}_{k}}={{{\boldsymbol{C}}}_{ k}}{{\boldsymbol{F}}_{k+1}}{{}^{\mathsf{T}}}{{{\boldsymbol{C}}}_{ k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}k}}{{}^{-1}},{} \end{aligned} $$
(3.36)
$$\displaystyle \begin{aligned} {{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}&={{{\boldsymbol{C}}}_{ k}}-{{{\boldsymbol{A}}}_{k}}({{{\boldsymbol{C}}}_{ k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}k}}-{{{\boldsymbol{C}}}_{ k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}n}})\hspace{0.5pt}{{{\boldsymbol{A}}}_{k}}{{}^{\mathsf{T}}}{}. \end{aligned} $$
(3.37)

This formulation is numerically unstable, as the difference of the two positive definite matrices in Eq. (3.37) can fail to be positive definite as well because of rounding errors. The second, numerically stable formulation realizes the smoother by running two filters, one forward and one backward, on the same sets of observations. The smoothed state is a weighted mean of the states of the two filters:

$$\displaystyle \begin{gathered} {{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}={{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}\left[{{{\boldsymbol{C}}}_{ k}}{{}^{-1}}{{\boldsymbol{\tilde{q}}}}_{k}+\left({{{\boldsymbol{C}}}^{\,\mathrm{b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}}\right){{}^{-1}}{{\boldsymbol{\tilde{q}}}}{}^{\,\mathrm{b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}\right], \ \; {{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}{{}^{-1}}={{{\boldsymbol{C}}}_{ k}}{{}^{-1}}+\left({{{\boldsymbol{C}}}^{\,\mathrm{b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}}\right){{}^{-1}},{} \end{gathered} $$
(3.38)

where \({{\boldsymbol {\tilde {q}}}}{ }^{\,\mathrm {b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}\) is the predicted state from the backward filter and \({{{\boldsymbol {C}}}^{\,\mathrm {b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}}\) its covariance matrix . Alternatively, the predicted state from the forward filter and the updated state from the backward filter can be combined. The smoother step has an associated chi-square statistic χk | n2, which can be computed from the smoothed residuals:

$$\displaystyle \begin{aligned} {\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}&={{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}},\ {{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}={\mathsf{Var}\left[{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}\right]}={{{\boldsymbol{V}}}_{ k}}-{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}},{} \end{aligned} $$
(3.39)
$$\displaystyle \begin{aligned} {\chi^2}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}&={\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}{{}^{\mathsf{T}}}{{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}{{}^{-1}}{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}.{} \end{aligned} $$
(3.40)

If the model is correctly specified and both process noise and observation errors are normally distributed, χk | n2 is χ 2-distributed with a number of degrees of freedom equal to the dimension of m k. As the smoothed state vectors are estimated from the same information, they are correlated. The prescription for computing their joint covariance matrix is given in [11].

The Kalman filter can also be implemented as an information filter and as a square-root filter [10].

2.3.2 Nonlinear State Space Models and the Extended Kalman Filter

In the applications of the Kalman filter to track fitting, see Chap. 6, the system equation is usually nonlinear; see Sect. 4.3 and Fig. 4.4. It has the following form:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{q}}_{k}}={\boldsymbol{f}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}({{\boldsymbol{q}}_{k-1}})+{{\boldsymbol{\gamma}}_{k}},\ \; {\mathsf{E}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{\boldsymbol{g}}_{k}},\ \; {\mathsf{Var}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{{\boldsymbol{Q}}}_{k}}. \end{gathered} $$
(3.41)

In the prediction step, the exact linear error propagation is replaced by an approximate linearized error propagation; see also Sect. 4.4 and Fig. 4.5 .

(3.42)

More rarely, also the measurement equation can be nonlinear:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{m}}_{k}}={{{\boldsymbol{h}}}_{k}}({{\boldsymbol{q}}_{k}})+{{\boldsymbol{\varepsilon}}_{k}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}=\mathbf{0}, \ \; {\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}={{{\boldsymbol{V}}}_{ k}}={{{\boldsymbol{G}}}_{k}}{{}^{-1}}. \end{gathered} $$
(3.43)

The first order Taylor expansion of h k at \({{\boldsymbol {\tilde {q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}\) gives:

$$\displaystyle \begin{gathered} {{{\boldsymbol{h}}}_{k}}({{\boldsymbol{q}}_{k}})\approx{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{q}}_{k}}+{{\boldsymbol{c}}_{k}},\ \; {{\boldsymbol{c}}_{k}}={{{\boldsymbol{h}}}_{k}}({{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}})-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}. \end{gathered} $$
(3.44)

Using this H k and c k in Eqs. (3.29) and (3.30) or Eqs. (3.31) and (3.32) gives the update equations for the nonlinear measurement equation. If required, the update can be iterated by re-expanding h k at \({{\boldsymbol {\tilde {q}}}}_{k}\) and recomputing H k, c k and K k. The smoother can be implemented via its gain matrix, Eqs. (3.36) and (3.37), or by the backward filter, Eq. (3.38).

3 Clustering

Clustering is the classification of objects into groups, such that similar objects end up in the same group. The similarity of objects can be summarized in a distance/similarity matrix containing the pair-wise distances/similarities of the objects to be clustered. A cluster is then a group with small distances/large similarities inside the group and large distances/small similarities to objects outside the group. After clustering, the resulting cluster structure should be validated by measuring the internal consistency of each cluster. For further information and examples, the reader is directed to the literature [12,13,14].

3.1 Hierarchical Clustering

Hierarchical clustering groups the objects with a sequence of nested partitions [12, Chapter 3]. If the sequence starts from single-object clusters and merges them to larger clusters, the clustering is called agglomerative ; if the sequence starts from a single cluster containing all objects and splits it into successively smaller clusters, the clustering is called divisive . At any stage of the clustering, all clusters are pairwise disjoint. The number of clusters is not necessarily specified in advance, but can determined “on the fly”. If in divisive clustering a cluster is considered to be valid, further splitting is not required, but also not forbidden. If in agglomerative clustering the merging of two clusters results in an invalid cluster, the merger is undone.

3.2 Partitional Clustering

Partitional clustering directly divides the objects into a predefined number K of clusters, usually by optimizing an objective function that describes the global quality of the cluster structure [12, Chapter 4]. Partitional clustering can be repeated for several values of K in order to find the optimal cluster structure. In the fuzzy variant of partitional clustering, an object can belong to more than one cluster with a certain degree of membership [15].

3.3 Model-Based Clustering

In model-based clustering, the objects are assumed to be drawn from a mixture distribution with two or more components. Each component is described by a PDF and has an associated probability or “weight” in the mixture [16]. The parameters of the mixture are usually estimated by the Expectation-Maximization (EM) algorithm [17,18,19]; see Sect. 7.2.2. A by-product of the EM algorithm is the posterior probability π ik of object i belonging to cluster k, for all objects and all clusters. The result is again a fuzzy clustering.