Statistics and Numerical Methods

Frühwirth, Rudolf; Strandlie, Are

doi:10.1007/978-3-030-65771-0_3

Rudolf Frühwirth⁹ &
Are Strandlie¹⁰

Part of the book series: Particle Acceleration and Detection ((PARTICLE))

3515 Accesses

Abstract

The chapter gives an outline of some statistical and numerical methods that will be applied in later chapters. The first section deals with the minimization of functions. Several gradient-based methods and a popular non-gradient method are discussed. The following section discusses statistical models and the estimation of model parameters. The basics of linear and nonlinear regression models and state space models are presented, including least-squares estimation and the (extended) Kalman filter. The final section gives a brief overview of clustering and different types of clustering algorithms.

You have full access to this open access chapter, Download chapter PDF

1 Function Minimization

The minimization (or maximization) of a multivariate function F(x) is a frequent task in solving non-linear systems of equations, clustering, maximum-likelihood estimation, function and model fitting, supervised learning, and similar problems. A basic classification of minimization methods distinguishes between methods that require the computation of the gradient or even the Hessian matrix of the function and gradient-free methods. All methods discussed in the following subsections are iterative and require a suitable starting point x ₀. Implementations of the methods discussed in the following, along with many others, can, for instance, be found in the MATLAB^® Optimization Toolbox™ [1] and in the Python package scipy.optimize [2].

1.1 Newton–Raphson Method

If F(x) is at least twice continuously differentiable in its domain, it can be approximated by its second order Taylor expansion $\hat {F}$ at the starting point x ₀:

$$\displaystyle \begin{gathered} F({{\boldsymbol{x}}_{0}}+{\,{\boldsymbol{h}}})\approx \hat{F}({{\boldsymbol{x}}})=F({{\boldsymbol{x}}_{0}})+{{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}}){\,{\boldsymbol{h}}}+\frac{1}{2}{\,{\boldsymbol{h}}}{{}^{\mathsf{T}}}{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}}){\,{\boldsymbol{h}}}, \end{gathered} $$

(3.1)

where g(x) = ∇F(x) is the gradient and H(x) = ∇² F(x) is the Hessian matrix of F(x). The step h is determined such that $\hat {F}$ has a stationary point at x ₁ = x ₀ + h, i.e., that the gradient of $\hat {F}$ is zero at x ₁:

$$\displaystyle \begin{gathered} \nabla\hat{F}({{\boldsymbol{x}}_{1}})={{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}})+{\,{\boldsymbol{h}}}{{}^{\mathsf{T}}}{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}})=\mathbf{0}{\Longrightarrow} {\,{\boldsymbol{h}}}=-[{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}})]{{}^{-1}}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}}){{}^{\mathsf{T}}}.{} \end{gathered} $$

(3.2)

Note that if x is a column vector, g(x ₀) is a row vector. In order to ensure that the Wolfe conditions [3] are satisfied, Eq. (3.2) is often relaxed to:

$$\displaystyle \begin{aligned} {\,{\boldsymbol{h}}}&=-\eta\hspace{0.5pt}[{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{0}})]{{}^{-1}}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{0}}){{}^{\mathsf{T}}}, \end{aligned} $$

(3.3)

with a learning rate η ∈ (0, 1). Inverting the Hessian matrix can be computationally expensive; in this case, h can be computed by finding an approximate solution to the linear system H(x ₀) ⋅ h = −g(x ₀)^T. This procedure is iterated to produce a sequence of values according to:

$$\displaystyle \begin{gathered} {{\boldsymbol{x}}_{k+1}}={{\boldsymbol{x}}_{k}}-\eta\hspace{0.5pt}[{{{\boldsymbol{H}}}}({{\boldsymbol{x}}_{k}})]{{}^{-1}}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}}){{}^{\mathsf{T}}}. \end{gathered} $$

(3.4)

If the starting point x ₀ is sufficiently close to a local minimum, the sequence converges quadratically to the local minimum. In practice, the iteration is stopped as soon as the norm | g(x _k) | of the gradient falls below some predefined bound ε. If F is a convex function, the local minimum is also the global minimum.

1.2 Descent Methods

As the computation of the Hessian matrix is computationally costly, various methods that do not require it have been devised, for instance, descent methods. A descent method is an iterative algorithm that searches for an approximate minimum of F by decreasing the value of F in every iteration. The iteration has the form x _k+1 = x _k + η _k d _k, where d _k is the search direction and η _k is the step-size parameter. As with the Newton–Raphson method, when a (local) minimum is reached, it cannot be left anymore.

1.2.1 Line Search

A search direction d is called a descent direction at the point ${{\boldsymbol {x}}}\in {\mathbb {R}}^n$ if g(x) ⋅d < 0. If η is sufficiently small, then

$$\displaystyle \begin{gathered} F({{\boldsymbol{x}}}+\eta\,{{\boldsymbol{d}}})<F({{\boldsymbol{x}}}). \end{gathered} $$

(3.5)

Once a search direction d _k has been chosen at the point x _k, line search implies that the line x = x _k + λ d _k is followed to its closest local minimum. For various line search methods such as fixed and variable step size, interpolation, golden section or Fibonacci’s method, see [4, 5].

1.2.2 Steepest Descent

Steepest descent follows the opposite direction of the gradient. If it is combined with line search, each search direction is orthogonal to the previous one, leading to a zig-zag search path. This can be very inefficient in the vicinity of a minimum where the Hessian matrix has a large condition number (ratio of the largest to the smallest eigenvalue); see Fig. 3.1.

1.2.3 Quasi-Newton Methods

In the Newton-Raphson method, the search direction is d = −[H(x _k)]⁻¹ g(x _k)^T. If H(x _k) is positive definite, then d is a descent direction, and so is −A g(x _k)^T for any positive definite matrix A. In a quasi-Newton method, A is constructed as an approximation to the inverse Hessian matrix, using only gradient information. The initial search direction is the negative gradient, d ₀ = −g(x ₀)^T, and the initial matrix A ₀ is the identity matrix. Each iteration performs a line search along the current search direction:

$$\displaystyle \begin{gathered} \lambda_k=\arg_\lambda\min_\lambda\left({{\boldsymbol{x}}_{k}}+\lambda\hspace{0.5pt}{{\boldsymbol{d}}}_k\right),\ \; {{\boldsymbol{x}}_{k+1}}={{\boldsymbol{x}}_{k}}+\lambda_k\hspace{0.5pt}{{\boldsymbol{d}}}_k. \end{gathered} $$

(3.6)

The new search direction is then computed according to:

$$\displaystyle \begin{gathered} {{\boldsymbol{d}}}_{k+1}=-{{{\boldsymbol{A}}}}_{k+1}\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k+1}}){{}^{\mathsf{T}}}, \end{gathered} $$

(3.7)

where A _k+1 is the updated approximation to the inverse Hessian matrix. There are two different algorithms for computing the update [6, p. 422], both using the change of the gradient along the step, denoted by ${\boldsymbol {y}}_k=\left [{{\boldsymbol {g}}}({{\boldsymbol {x}}_{k+1}})-{{\boldsymbol {g}}}({{\boldsymbol {x}}_{k}})\right ]{{ }^{\mathsf {T}}}$.

Davidon–Fletcher–Powell algorithm

$$\displaystyle \begin{gathered} {{{\boldsymbol{A}}}}_{k+1}={{{\boldsymbol{A}}}}_k+\lambda_k\frac{{{\boldsymbol{d}}}_k{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}}{{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}{\boldsymbol{y}}_k}-\frac{{{{\boldsymbol{A}}}}_k{\boldsymbol{y}}_k{\boldsymbol{y}}_k{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}_k}{{\boldsymbol{y}}_k{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}_k{\boldsymbol{y}}_k}. \end{gathered} $$

(3.8)

Broyden–Fletcher–Goldfarb–Shanno algorithm

$$\displaystyle \begin{gathered} {{{\boldsymbol{A}}}}_{k+1}=\lambda_k\frac{{{\boldsymbol{d}}}_k{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}}{{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}{\boldsymbol{y}}_k}+\left({{\boldsymbol{I}}}-\frac{{{\boldsymbol{d}}}_k{\boldsymbol{y}}_k{{}^{\mathsf{T}}}}{{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}{\boldsymbol{y}}_k}\right){{{\boldsymbol{A}}}}_k\left({{\boldsymbol{I}}}-\frac{{\boldsymbol{y}}_k{{\boldsymbol{d}}}_k{{}^{\mathsf{T}}}}{{\boldsymbol{y}}_k{{}^{\mathsf{T}}}{{\boldsymbol{d}}}_k}\right). \end{gathered} $$

(3.9)

1.2.4 Conjugate Gradients

If the function $F({{\boldsymbol {x}}}),{{\boldsymbol {x}}}\in {\mathbb {R}}^n$ is quadratic of the form $F({{\boldsymbol {x}}})=\frac {1}{2}{{\boldsymbol {x}}}{{ }^{\mathsf {T}}}{{{\boldsymbol {A}}}}{{\boldsymbol {x}}}-{{\boldsymbol {b}}}{{ }^{\mathsf {T}}}{{\boldsymbol {x}}}+c$ with positive definite A, the global minimum can be found in exactly n steps, if line search with a set of conjugate search directions is used. Such a set {d ₁, …, d _n} is characterized by the following conditions:

$$\displaystyle \begin{gathered} {{\boldsymbol{d}}}_i{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}\hspace{0.5pt}{{\boldsymbol{d}}}_j=0,\ \; i\neq j,\quad{{\boldsymbol{d}}}_i{{}^{\mathsf{T}}}{{{\boldsymbol{A}}}}\hspace{0.5pt}{{\boldsymbol{d}}}_i\neq0,\quad i=1,\ldots,n. \end{gathered} $$

(3.10)

The set is linearly independent and a basis of ${\mathbb {R}}^n$. An example for n = 2 is shown in Fig. 3.2.

In the general case, the conjugate gradient method proceeds by successive approximations, generating a new search direction in every iteration. Given an approximation x _k, the new search direction is d _k = −g(x _k)^T + β _k d _k−1, where d ₀ is arbitrary. A line search along direction d _k gives the next approximation:

$$\displaystyle \begin{gathered} \lambda_k=\arg_\lambda\min ({{\boldsymbol{x}}_{k}}+\lambda\hspace{0.5pt}{{\boldsymbol{d}}}_k),\ \; {{\boldsymbol{x}}_{k+1}}={{\boldsymbol{x}}_{k}}+\lambda_k\hspace{0.5pt}{{\boldsymbol{d}}}_k. \end{gathered} $$

(3.11)

Different variants of the algorithm exist corresponding to different prescriptions for computing β _k. Two of them are given here [6, p. 406–416].

Fletcher–Reeves algorithm

$$\displaystyle \begin{gathered} \beta_k=\frac{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}})\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}}){{}^{\mathsf{T}}}}{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}})\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}}){{}^{\mathsf{T}}}}. \end{gathered} $$

(3.12)

Polak-Ribière algorithm

$$\displaystyle \begin{gathered} \beta_k=\frac{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}})\left[{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k}})-{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}})\right]{{}^{\mathsf{T}}}}{{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}})\hspace{0.5pt}{{\boldsymbol{g}}}({{\boldsymbol{x}}_{k-1}}){{}^{\mathsf{T}}}}. \end{gathered} $$

(3.13)

It is customary to set β _k to zero if k is a multiple of n, in order to avoid accumulation of rounding errors. As such, convergence to the minimum in n steps is no longer guaranteed for non-quadratic F(x).

1.3 Gradient-Free Methods

A popular gradient-free method is the downhill-simplex or Nelder–Mead algorithm [7]. It can be applied to functions whose derivatives are unknown, do not exist everywhere, or are too costly or difficult to compute. In n dimensions, the method stores n + 1 test points x ₁, …, x _n+1 at every iteration, ordered by increasing function values, and the centroid x ₀ of all points but the last one. The simplex generated by the test points is then modified according to the function values in the test points. The allowed modifications are: reflection, expansion, contraction and shrinking. The iteration is terminated when the function value of the best point does not change significantly anymore. The size of the initial simplex is important; choosing it too small can lead to a very localized search. On the other hand, it is possible to escape from a local minimum by restarting the search with a sufficiently large simplex.

An example with the Rosenbrock function F(x, y) = (0.8 − x)² + 200 (y − x ²)² is shown in Fig. 3.3. The function has a very shallow minimum at x = 0.64, y = 0.8. With the tolerance 10⁻⁸ on the function value the minimum is reached after 90 steps.

Other gradient-free methods are simulated annealing, tabu search, particle swarm optimization, genetic algorithms, etc.

2 Statistical Models and Estimation

In the context of this book a statistical model is defined as a functional dependence of observed quantities (observations or measurements) on unknown quantities of interest (parameters or state vectors). The parameters cannot be observed directly, and the observations are subject to stochastic uncertainties. The aim is to estimate the parameters from the observations according to some criterion of optimality.

2.1 Linear Regression Models

A linear regression model has the following general form [8]:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{m}}_{}}={{{\boldsymbol{F}}}}{}{{\boldsymbol{p}}}+{{\boldsymbol{c}}_{}}+{{\boldsymbol{\varepsilon}}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}}\right]}=\mathbf{0},\ \;{\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}}\right]}={{{\boldsymbol{V}}}}={{{\boldsymbol{G}}}}{{}^{-1}}, \end{gathered} $$

(3.14)

where mis the (n × 1)-vector of observations, F is the known (n × m) model matrix with m ≤ n and assumed to be of full rank, p is the (m × 1) vector of model parameters, cis a known constant offset, and ε is the (n × 1) vector of observation errors with zero expectation and (n × n) covariance matrix V , assumed to be known.

LS estimation of p requires the minimization of the following objective function:

$$\displaystyle \begin{gathered} {\mathcal{S}}({{\boldsymbol{p}}})=\left({{\boldsymbol{m}}_{}}-{{{\boldsymbol{F}}}}{}{{\boldsymbol{p}}}-{{\boldsymbol{c}}_{}}\right){{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\, \left({{\boldsymbol{m}}_{}}-{{{\boldsymbol{F}}}}{}{{\boldsymbol{p}}}-{{\boldsymbol{c}}_{}}\right). \end{gathered} $$

(3.15)

The least-squares (LS) estimator ${{\boldsymbol {\tilde {p}}}}$ and its covariance matrix C are given by:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{\tilde{p}}}}=({{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}){{}^{-1}}{{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}({{\boldsymbol{m}}_{}}-{{\boldsymbol{c}}_{}}),\ \;{{{\boldsymbol{C}}}}=({{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}){{}^{-1}}. \end{gathered} $$

(3.16)

The estimator ${{\boldsymbol {\tilde {p}}}}$ is unbiased and the estimator with the smallest covariance matrix among all estimators that are linear functions of the observations. If the distribution of ε is a multivariate normal distribution, the estimator is efficient, i.e., has the smallest possible covariance matrix among all unbiased estimators.

The residuals r of the regression are defined by:

$$\displaystyle \begin{gathered} {\boldsymbol{r}}={{\boldsymbol{m}}_{}}-{{\boldsymbol{c}}_{}}-{{{\boldsymbol{F}}}}{}{{\boldsymbol{\tilde{p}}}},\ \;{{\boldsymbol{R}}}={\mathsf{Var}\left[{\boldsymbol{r}}\right]}={{{\boldsymbol{V}}}}-{{{\boldsymbol{F}}}}{}\hspace{0.5pt}({{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}){{}^{-1}}{{{\boldsymbol{F}}}}{}{{}^{\mathsf{T}}}. \end{gathered} $$

(3.17)

The standardized residuals s, also called the “pulls” in high-energy physics, are given by:

$$\displaystyle \begin{gathered} s_i=\frac{r_i}{\sqrt{{{\boldsymbol{R}}}_{ii}}},\ \; i=1,\ldots,n. \end{gathered} $$

(3.18)

If the model is correctly specified, the pulls have mean 0 and standard deviation 1. The chi-square statistic of the regression is defined as :

$$\displaystyle \begin{gathered} {\chi^2}={\boldsymbol{r}}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}\hspace{0.5pt}{\boldsymbol{r}},\ \;\mathrm{with}\ \; {\mathsf{E}\left[\hspace{0.5pt}{\chi^2}\right]}=n-m. \end{gathered} $$

(3.19)

If the observation errors are normally distributed, χ ² is χ ²-distributed with d = n − m degrees of freedom; its expectation is d and its variance is 2 d. Its p-value p is defined by the following probability transform:

$$\displaystyle \begin{gathered} p=1-G_{d}\hspace{0.5pt}(\hspace{0.5pt}{\chi^2})=\int_{{\chi^2}}^{\infty} g_{d}\hspace{0.5pt}(x)\,\mathrm{d}x, \end{gathered} $$

(3.20)

where G _k(x) is the cumulative distribution function of the χ ²-distribution with k degrees of freedom and g _k(x) is its probability density function (PDF). Large values of χ ² correspond to small p-values. If the model is correctly specified, p is uniformly distributed in the unit interval. A very small p-value indicates a misspecification of the model or of the covariance matrix V , or both.

2.2 Nonlinear Regression Models

The linear regression model in Eq. (3.14) can be generalized to a nonlinear model:

$$\displaystyle \begin{gathered} {{\boldsymbol{m}}_{}}={{\boldsymbol{f}}_{}}({{\boldsymbol{p}}})+{{\boldsymbol{\varepsilon}}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}}\right]}=\mathbf{0},\ \;{\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}}\right]}={{{\boldsymbol{V}}}}={{{\boldsymbol{G}}}}{{}^{-1}}, \end{gathered} $$

(3.21)

where fis a (n × 1)-vector of smooth functions of m variables. LS estimation of p requires the minimization of the following objective function:

$$\displaystyle \begin{gathered} {\mathcal{S}}({{\boldsymbol{p}}})=\left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{p}}})\right]{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\, \left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{p}}})\right]. \end{gathered} $$

(3.22)

The function ${\mathcal {S}}({{\boldsymbol {p}}})$ can be minimized with any of the methods discussed in Sect. 3.1. The one used most frequently is probably the Gauss-Newton method, based on the first-order Taylor expansion of fand resulting in the following iteration:

(3.23)

At each step, the covariance matrix C _k+1 of ${{\boldsymbol {\tilde {p}}}}_{k+1}$ is approximately given by:

$$\displaystyle \begin{gathered} {{{\boldsymbol{C}}}}_{k+1}=({{{\boldsymbol{F}}}}{}_k{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}{{{\boldsymbol{F}}}}{}_k){{}^{-1}}. \end{gathered} $$

(3.24)

In general, the covariance matrix of the final estimate ${{\boldsymbol {\tilde {p}}}}$ can be approximated by the inverse of the Hessian of ${\mathcal {S}}({{\boldsymbol {p}}})$ at ${{\boldsymbol {\tilde {p}}}}$. The final chi-square statistic χ ² is given by:

$$\displaystyle \begin{gathered} \chi^2=\left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{\tilde{p}}}})\right]{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}}\hspace{0.5pt}\left[{{\boldsymbol{m}}_{}}-{{\boldsymbol{f}}_{}}({{\boldsymbol{\tilde{p}}}})\right]. \end{gathered} $$

(3.25)

In the case of Gaussian observation errors, the chi-square statistic is approximately χ ²-distributed, and its p-value is approximately uniformly distributed. The iteration is stopped when the chi-square statistic does not change significantly any more.

2.3 State Space Models

A dynamic or state space model describes the state of an object in space or time, such as a rocket or a charged particle [9]. The state usually changes continuously, but is assumed to be of interest only at discrete instances in the present context. These instances are labeled with indices from 0, the initial state, to n, the final state. The state at instance k is specified by the state vector q _k. The spatial or temporal evolution of the state is described by the system equation, which is Eq. (3.26) in the linear case and Eq. (3.41) in the general case.

2.3.1 Linear State Space Models and the Kalman Filter

In the simplest case, the state at instant k is an affine function of the state at instant k − 1 plus a random disturbance called the system or process noise, with known expectation and covariance matrix:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{q}}_{k}}={\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{\boldsymbol{q}}_{k-1}}+{{\boldsymbol{d}}_{k}}+{{\boldsymbol{\gamma}}_{k}},\ \; {\mathsf{E}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{\boldsymbol{g}}_{k}},\ \; {\mathsf{Var}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{{\boldsymbol{Q}}}_{k}}. \end{gathered} $$

(3.26)

The process noise γ _k may affect only a subset of the state vector, in which case its covariance matrix Q _k is not of full rank.

In most cases, only some or even none of the components of the state vector can be observed directly. Instead, the observations are functions of the state plus an observation error. In the simplest case, this function is again affine:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{m}}_{k}}={{{\boldsymbol{H}}}_{k}}{{\boldsymbol{q}}_{k}}+{{\boldsymbol{c}}_{k}}+{{\boldsymbol{\varepsilon}}_{k}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}=\mathbf{0}, \ \; {\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}={{{\boldsymbol{V}}}_{ k}}={{{\boldsymbol{G}}}_{k}}{{}^{-1}}. \end{gathered} $$

(3.27)

Process noise and observation errors are assumed to be independent.

Given observations m ₁, …, m _n, an initial state q ₀ and an initial state covariance matrix C ₀, all states can be estimated recursively by the Kalman filter [10]. Assume there is an estimated state vector ${{\boldsymbol {\tilde {q}}}}_{k-1}$ with its covariance matrix C _k−1 at instant k − 1. The estimated state vector at instant k is obtained by a prediction step followed by an update step. After the last update step, the full information contained in all observations can be propagated back to all previous states by the smoother. If both process noise and observation errors are normally distributed, the Kalman filter is the optimal filter; if not, it is still the best linear filter.

Prediction step

The state vector and its covariance matrix are propagated to the next instance using the system equation Eq. (3.26) and linear error propagation:

$$\displaystyle \begin{gathered} {{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}={\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{\boldsymbol{\tilde{q}}}}_{k-1}+{{\boldsymbol{d}}_{k}}+{{\boldsymbol{g}}_{k}},\ \; {{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}={\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{{\boldsymbol{C}}}_{ k-1}}{\boldsymbol{F}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}{{}^{\mathsf{T}}}+{{{\boldsymbol{Q}}}_{k}}.{} \end{gathered} $$

(3.28)

Update step

The updated state vector is the weighted mean of the predicted state vector and the observation m _k. There are two equivalent ways to compute the update. The first one uses the gain matrix K _k:

$$\displaystyle \begin{aligned} {{\boldsymbol{\tilde{q}}}}_{k}&={{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}+{{{\boldsymbol{K}}}_{k}}\hspace{0.5pt}({{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}), \ \; {{{\boldsymbol{C}}}_{ k}}=({{\boldsymbol{I}}}-{{{\boldsymbol{K}}}_{k}}{{{\boldsymbol{H}}}_{k}})\hspace{0.5pt}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}},{} \end{aligned} $$

(3.29)

$$\displaystyle \begin{aligned} {{{\boldsymbol{K}}}_{k}}&={{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}({{{\boldsymbol{V}}}_{ k}}+{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}){{}^{-1}}.{} \end{aligned} $$

(3.30)

The second one is a multivariate weighted mean:

$$\displaystyle \begin{aligned} {{\boldsymbol{\tilde{q}}}}_{k}&={{{\boldsymbol{C}}}_{ k}}\hspace{0.5pt}\left[{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{}^{-1}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}+{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}_{k}}\hspace{0.5pt}({{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}})\right],{} \end{aligned} $$

(3.31)

$$\displaystyle \begin{aligned} {{{\boldsymbol{C}}}_{ k}}&=({{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{}^{-1}}+{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}{{{\boldsymbol{G}}}_{k}}{{{\boldsymbol{H}}}_{k}}){{}^{-1}}.{} \end{aligned} $$

(3.32)

The update step has an associated chi-square statistic χk2, which can be computed from the predicted residual r _{k | k−1} or from the updated residual r _k:

$$\displaystyle \begin{aligned} {\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}&={{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}},\ {{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}={\mathsf{Var}\left[{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}\right]}={{{\boldsymbol{V}}}_{ k}}+{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}, \end{aligned} $$

(3.33)

$$\displaystyle \begin{aligned} {\boldsymbol{r}}_{k}&={{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{k},\ {{\boldsymbol{R}}}_{k}={\mathsf{Var}\left[{\boldsymbol{r}}_{k}\right]}={{{\boldsymbol{V}}}_{ k}}-{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{ k}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}}, \end{aligned} $$

(3.34)

$$\displaystyle \begin{aligned} {\chi^2}_k&={\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}{{}^{\mathsf{T}}}{{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}{{}^{-1}}\hspace{0.5pt}{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}={\boldsymbol{r}}_{k}{{}^{\mathsf{T}}}{{\boldsymbol{R}}}_{k}{{}^{-1}}\hspace{0.5pt}{\boldsymbol{r}}_{k}.{} \end{aligned} $$

(3.35)

If the model is correctly specified and both process noise and observation errors are normally distributed, χk2 is χ ²-distributed with a number of degrees of freedom equal to the dimension of m _k. The total chi-square ${\chi ^2_{\mathrm {tot}}}$ of the filter is obtained by summing χk2 over all k.

Smoothing

The smoother propagates the full information contained in the last estimate ${{\boldsymbol {\tilde {q}}}}_{n}$ back to all previous states. There are again two equivalent formulations. The first one uses the gain matrix A _k of the smoother:

$$\displaystyle \begin{aligned} {{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}&={{\boldsymbol{\tilde{q}}}}_{k}+{{{\boldsymbol{A}}}_{k}}({{\boldsymbol{\tilde{q}}}}_{k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}n}-{{\boldsymbol{\tilde{q}}}}_{k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}k}), \ \;{{{\boldsymbol{A}}}_{k}}={{{\boldsymbol{C}}}_{ k}}{{\boldsymbol{F}}_{k+1}}{{}^{\mathsf{T}}}{{{\boldsymbol{C}}}_{ k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}k}}{{}^{-1}},{} \end{aligned} $$

(3.36)

$$\displaystyle \begin{aligned} {{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}&={{{\boldsymbol{C}}}_{ k}}-{{{\boldsymbol{A}}}_{k}}({{{\boldsymbol{C}}}_{ k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}k}}-{{{\boldsymbol{C}}}_{ k+1{\hspace{0.5pt}|\hspace{0.5pt}}{}n}})\hspace{0.5pt}{{{\boldsymbol{A}}}_{k}}{{}^{\mathsf{T}}}{}. \end{aligned} $$

(3.37)

This formulation is numerically unstable, as the difference of the two positive definite matrices in Eq. (3.37) can fail to be positive definite as well because of rounding errors. The second, numerically stable formulation realizes the smoother by running two filters, one forward and one backward, on the same sets of observations. The smoothed state is a weighted mean of the states of the two filters:

$$\displaystyle \begin{gathered} {{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}={{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}\left[{{{\boldsymbol{C}}}_{ k}}{{}^{-1}}{{\boldsymbol{\tilde{q}}}}_{k}+\left({{{\boldsymbol{C}}}^{\,\mathrm{b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}}\right){{}^{-1}}{{\boldsymbol{\tilde{q}}}}{}^{\,\mathrm{b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}\right], \ \; {{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}{{}^{-1}}={{{\boldsymbol{C}}}_{ k}}{{}^{-1}}+\left({{{\boldsymbol{C}}}^{\,\mathrm{b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}}\right){{}^{-1}},{} \end{gathered} $$

(3.38)

where ${{\boldsymbol {\tilde {q}}}}{ }^{\,\mathrm {b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}$ is the predicted state from the backward filter and ${{{\boldsymbol {C}}}^{\,\mathrm {b}}_{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k+1}}$ its covariance matrix . Alternatively, the predicted state from the forward filter and the updated state from the backward filter can be combined. The smoother step has an associated chi-square statistic χk | n2, which can be computed from the smoothed residuals:

$$\displaystyle \begin{aligned} {\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}&={{\boldsymbol{m}}_{k}}-{{\boldsymbol{c}}_{k}}-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}},\ {{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}={\mathsf{Var}\left[{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}\right]}={{{\boldsymbol{V}}}_{ k}}-{{{\boldsymbol{H}}}_{k}}{{{\boldsymbol{C}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}}{{{\boldsymbol{H}}}_{k}}{{}^{\mathsf{T}}},{} \end{aligned} $$

(3.39)

$$\displaystyle \begin{aligned} {\chi^2}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}&={\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}{{}^{\mathsf{T}}}{{\boldsymbol{R}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}{{}^{-1}}{\boldsymbol{r}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}n}}.{} \end{aligned} $$

(3.40)

If the model is correctly specified and both process noise and observation errors are normally distributed, χk | n2 is χ ²-distributed with a number of degrees of freedom equal to the dimension of m _k. As the smoothed state vectors are estimated from the same information, they are correlated. The prescription for computing their joint covariance matrix is given in [11].

The Kalman filter can also be implemented as an information filter and as a square-root filter [10].

2.3.2 Nonlinear State Space Models and the Extended Kalman Filter

In the applications of the Kalman filter to track fitting, see Chap. 6, the system equation is usually nonlinear; see Sect. 4.3 and Fig. 4.4. It has the following form:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{q}}_{k}}={\boldsymbol{f}}_{k{\hspace{0.5pt}|\hspace{0.5pt}} k-1}({{\boldsymbol{q}}_{k-1}})+{{\boldsymbol{\gamma}}_{k}},\ \; {\mathsf{E}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{\boldsymbol{g}}_{k}},\ \; {\mathsf{Var}\left[{{\boldsymbol{\gamma}}_{k}}\right]}={{{\boldsymbol{Q}}}_{k}}. \end{gathered} $$

(3.41)

In the prediction step, the exact linear error propagation is replaced by an approximate linearized error propagation; see also Sect. 4.4 and Fig. 4.5 .

(3.42)

More rarely, also the measurement equation can be nonlinear:

$$\displaystyle \begin{gathered}{} {{\boldsymbol{m}}_{k}}={{{\boldsymbol{h}}}_{k}}({{\boldsymbol{q}}_{k}})+{{\boldsymbol{\varepsilon}}_{k}},\ \;{\mathsf{E}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}=\mathbf{0}, \ \; {\mathsf{Var}\left[{{\boldsymbol{\varepsilon}}_{k}}\right]}={{{\boldsymbol{V}}}_{ k}}={{{\boldsymbol{G}}}_{k}}{{}^{-1}}. \end{gathered} $$

(3.43)

The first order Taylor expansion of h _k at ${{\boldsymbol {\tilde {q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}$ gives:

$$\displaystyle \begin{gathered} {{{\boldsymbol{h}}}_{k}}({{\boldsymbol{q}}_{k}})\approx{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{q}}_{k}}+{{\boldsymbol{c}}_{k}},\ \; {{\boldsymbol{c}}_{k}}={{{\boldsymbol{h}}}_{k}}({{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}})-{{{\boldsymbol{H}}}_{k}}{{\boldsymbol{\tilde{q}}}}_{{k{\hspace{0.5pt}|\hspace{0.5pt}}{}k-1}}. \end{gathered} $$

(3.44)

Using this H _k and c _k in Eqs. (3.29) and (3.30) or Eqs. (3.31) and (3.32) gives the update equations for the nonlinear measurement equation. If required, the update can be iterated by re-expanding h _k at ${{\boldsymbol {\tilde {q}}}}_{k}$ and recomputing H _k, c _k and K _k. The smoother can be implemented via its gain matrix, Eqs. (3.36) and (3.37), or by the backward filter, Eq. (3.38).

3 Clustering

Clustering is the classification of objects into groups, such that similar objects end up in the same group. The similarity of objects can be summarized in a distance/similarity matrix containing the pair-wise distances/similarities of the objects to be clustered. A cluster is then a group with small distances/large similarities inside the group and large distances/small similarities to objects outside the group. After clustering, the resulting cluster structure should be validated by measuring the internal consistency of each cluster. For further information and examples, the reader is directed to the literature [12,13,14].

3.1 Hierarchical Clustering

Hierarchical clustering groups the objects with a sequence of nested partitions [12, Chapter 3]. If the sequence starts from single-object clusters and merges them to larger clusters, the clustering is called agglomerative ; if the sequence starts from a single cluster containing all objects and splits it into successively smaller clusters, the clustering is called divisive . At any stage of the clustering, all clusters are pairwise disjoint. The number of clusters is not necessarily specified in advance, but can determined “on the fly”. If in divisive clustering a cluster is considered to be valid, further splitting is not required, but also not forbidden. If in agglomerative clustering the merging of two clusters results in an invalid cluster, the merger is undone.

3.2 Partitional Clustering

Partitional clustering directly divides the objects into a predefined number K of clusters, usually by optimizing an objective function that describes the global quality of the cluster structure [12, Chapter 4]. Partitional clustering can be repeated for several values of K in order to find the optimal cluster structure. In the fuzzy variant of partitional clustering, an object can belong to more than one cluster with a certain degree of membership [15].

3.3 Model-Based Clustering

In model-based clustering, the objects are assumed to be drawn from a mixture distribution with two or more components. Each component is described by a PDF and has an associated probability or “weight” in the mixture [16]. The parameters of the mixture are usually estimated by the Expectation-Maximization (EM) algorithm [17,18,19]; see Sect. 7.2.2. A by-product of the EM algorithm is the posterior probability π _ik of object i belonging to cluster k, for all objects and all clusters. The result is again a fuzzy clustering.

References

MathWorks, MATLAB Optimization Toolbox. https://www.mathworks.com/products/optimization.html
SciPy v1.4.1 Reference Guide, https://tinyurl.com/SciPy-optimize
P. Wolfe, SIAM Rev. 11(2), 226 (1969)
Article MathSciNet Google Scholar
P. Pedregal, Introduction to Optimization (Springer, New York, 2006)
MATH Google Scholar
L. Foulds, Optimization Techniques (Springer, New York, 1981)
Book Google Scholar
M. Bazaraa, H. Sherali, C. Shetty, Nonlinear Programming: Theory and Algorithms (Wiley, Hoboken, 2005)
MATH Google Scholar
J.A. Nelder, R. Mead, Comput. J. 7, 308 (1965)
Article MathSciNet Google Scholar
N.H.H. Bingham, J.M. Fry, Regression: Linear Models in Statistics (Springer, London, 2010)
Book Google Scholar
R. Prado, M. West, Time Series (CRC Press, Boca Raton, 2010)
Book Google Scholar
B.D.O. Anderson, J.B. Moore, Optimal Filtering (Dover Publications, Mineola, 2005)
MATH Google Scholar
W. Hulsbergen, Nucl. Instrum. Methods Phys. Res. A 600(2), 471 (2009)
Article ADS Google Scholar
R. Xu, D. Wunsch, Clustering (Wiley–IEEE Press, Hoboken, 2009)
Google Scholar
L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, Hoboken, 2005)
MATH Google Scholar
E.A. Patrick, Fundamentals of Pattern Recognition (Prentice–Hall, Upper Saddle River, 1972)
Google Scholar
J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Springer, Boston, 1981)
Book Google Scholar
V. Melnykov, R. Maitra, Stat. Surv. 4, 80 (2010). https://projecteuclid.org/euclid.ssu/1272547280
Article MathSciNet Google Scholar
A.P. Dempster, N.M. Laird, D.B. Rubin, J. R. Stat. Soc. Ser. B 39(1), 1 (1977)
Google Scholar
J.A. Bilmes, A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report TR-97-021, International Computer Science Institute (1998). http://www.icsi.berkeley.edu/ftp/global/pub/techreports/1997/tr-97-021.pdf
S. Borman, The Expectation Maximization Algorithm—A short tutorial. Technical Report, University of Utah (2004). https://tinyurl.com/BormanEMTutorial
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of High Energy Physics, Austrian Academy of Sciences, Wien, Austria
Rudolf Frühwirth
Norwegian University of Science and Technology, Gjøvik, Norway
Are Strandlie

Authors

Rudolf Frühwirth
View author publications
You can also search for this author in PubMed Google Scholar
Are Strandlie
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Frühwirth, R., Strandlie, A. (2021). Statistics and Numerical Methods. In: Pattern Recognition, Tracking and Vertex Reconstruction in Particle Detectors. Particle Acceleration and Detection. Springer, Cham. https://doi.org/10.1007/978-3-030-65771-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-65771-0_3
Published: 23 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65770-3
Online ISBN: 978-3-030-65771-0
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics