Kálmán filters are a ubiquitous technique for state-space estimation from multiple noisy measurements, and are used in fields as diverse as robotics, signal processing, and econometrics. In particle physics they are most commonly used as a method to incorporate kinematical constraints and detector-material interactions when estimating the particle track state from clustered hits in tracking stations. As such, Kálmán filters often form the basis of event reconstruction algorithms.
Recent emphasis on complete online processing of full events motivates the need for more efficient reconstruction algorithms. In particular, from Run 3 of the LHC, the LHCb experiment intends to perform full event reconstruction at 30MHz in the high-level trigger, to exploit the efficiency gain from performing analysis-level selections earlier in the pipeline. As such, the execution speed of this reconstruction, of which the Kálmán filter is a dominant contributor [67], is strictly limited from a cost-performance perspective.
As many of these operations are inherently parallelisable, implementation of the reconstruction and track filtering on graphics processing units (GPUs) shows good promise, and is potentially a more cost effective alternative to CPUs. Nevertheless, as GPUs are generally designed as single-instruction multiple-data processors, they lack many features that are found in CPUs, such as support for conditional program flow, large caches, and fast interconnects between the compute cores.
Kálmán Filter Formalism
Kálmán filters recursively compute closed-form least-squares estimates for the state and its covariance matrix, under the assumption that all uncertainties can be well described by multidimensional normal distributions; and that only linear relations exist between the state at step t and the state at step \(t + 1\), and the state and the measurement process. The application of a Kálmán filter can be broken down into three stages: a prediction (or projection) stage where the state at step t is projected linearly to a state at step \(t + 1\); a filtering stage where the state at step \(t + 1\) is corrected using the measurement and covariance matrix of the measurement at step \(t + 1\); and a smoothing stage after all filtering steps, where state and covariance matrix updates are propagated backwards through the states to achieve a globally optimal configuration. The formulation here follows that of Refs. [68, 69] (Fig. 9).
The first projection step is described by a set of recurrence relations that extrapolate the state described by a vector \(\mathbf {p}\) at step t to the values at step \(t + 1\), given by
$$\begin{aligned} \mathbf {p}_{t + 1, \text {proj}} = \mathbf {F}_t \mathbf {p}_t, \end{aligned}$$
(1)
with the covariance matrix of \(\mathbf {p}\) given by \(\mathbf {C}\), where
$$\begin{aligned} \mathbf {C}_{t + 1, \text {proj}} = \mathbf {F}_t \mathbf {C}_t \mathbf {F}_t^{\top } + \mathbf {Q}_t. \end{aligned}$$
(2)
These relations are expressed in terms of the transfer matrix \(\mathbf {F}_t\), and the random error matrix \(\mathbf {Q}_t\). The expression in Eq. 1 uses the underlying modelling assumptions (in the case of this particular track reconstruction, simple kinematics) that generate \(p_{t + 1}\) from \(p_t\) via the application of the linear operator \(\mathbf {F}_t\). The error matrix \(\mathbf {Q}\) contains the process noise that involves terms that describe additive errors to the estimated state, such as those that are picked up after each propagation step from material interactions.
At step \(t + 1\), the prediction from step t to \(t + 1\), \(\mathbf {p}_{t + 1, \text {proj}}\) is updated using the measurements at \(t + 1\), \(\mathbf {m}_{t + 1}\). The relation between the measurement \(\mathbf {m}\) and the state \(\mathbf {p}\) is given by \(\mathbf {H}\) (which in general is independent of t), and the updated filtered expectation of \(\mathbf {p}_{t + 1}\) becomes
$$\begin{aligned} \mathbf {p}_{t + 1, \text {filt}} = \mathbf {C}_{t + 1, \text {filt}} \left[ \mathbf {C}^{-1}_{t + 1, \text {proj}} \mathbf {p}_{t + 1, \text {proj}} + \mathbf {H}^{\top } \mathbf {G}_{t + 1} \mathbf {m}_{t + 1} \right] , \end{aligned}$$
(3)
where
$$\begin{aligned} \mathbf {C}_{t + 1, \text {filt}} = \left[ \mathbf {C}_{t + 1, \text {proj}} + \mathbf {H}^{\top } \mathbf {G}_{t + 1} \mathbf {H} \right] \end{aligned}$$
(4)
is the corresponding covariance matrix. Here, \(\mathbf {G}_t\) is the matrix that describes weights corresponding measurement noise, such as the detector resolution, at step t.
Up until this point, all information is updated in the forward direction, however information downstream can also be used to update upstream state estimates, to obtain a globally optimal set of states. To do this propagation, a backward transport operator is defined as
$$\begin{aligned} \mathbf {A}_t = \mathbf {C}_{t, \text {filt}} \mathbf {F}_t^{\top } \mathbf {C}^{-1}_{t + 1, \text {proj}}, \end{aligned}$$
(5)
which is used to perform the smoothing step in the backward direction and updating the state
$$\begin{aligned} \mathbf {p}_{t, \text {smooth}} = \mathbf {p}_{t, \text {filt}} + \mathbf {A}_t( \mathbf {p}_{t + 1, \text {smooth}} - \mathbf {p}_{t + 1, \text {proj}}), \end{aligned}$$
(6)
and covariance matrix
$$\begin{aligned} \mathbf {C}_{t, \text {smooth}} = \mathbf {C}_{t, \text {filt}} + \mathbf {A}_t( \mathbf {C}_{t + 1, \text {smooth}} - \mathbf {C}_{t + 1, \text {proj}}) \mathbf {A}_t^{\top }, \end{aligned}$$
(7)
at t using the now smoothed state and covariance matrix at \(t + 1\).
The covariance matrix can also be used to form a \(\chi ^2\) test statistic to determine the consistency of a hit with the fitted track,
$$\begin{aligned} \chi ^2_t = \mathbf {r}^T_t \mathbf {G}_t \mathbf {r}_t + (\mathbf {p}_{t, \text {filt}} -p_{t, \text {proj}}) \mathbf {C}_{t,\text {proj}}^{-1} (\mathbf {p}_{t, \text {filt}} -p_{t, \text {proj}}), \end{aligned}$$
(8)
where \(r_k\) is the residual,
$$\begin{aligned} \mathbf {r}_k = \mathbf {m} - \mathbf {H} \mathbf {p}_{t, \text {filt}}. \end{aligned}$$
(9)
Kálmán Filter Configuration
To investigate the performance characteristics of a Kálmán filter implemented in Poplar on the IPU, a tracker with 2D active planes of 1m \(\times\) 1m in \(\hat{x}-\hat{y}\) is considered, separated by a homogeneous inactive medium that induces multiple scattering. Five of these planes are used, separated in \(\hat{z}\) by \(d = 1\)m of the inactive medium, and indexed by t. Each of these detector planes record measured track hits, \(\mathbf {m} = \{m_x, m_y\}\), discretised according to the physical resolution of the detector planes, \(\sigma\).
No magnetic field is considered, however its inclusion would only result in a minor modification of the track state (to infer momentum) and inclusion of the magnetic field description in \(\mathbf {F}.\) It is assumed initially that each track registers a hit on each of the five planes, and the matching of hits to tracks is perfect. In reality, dummy hits can be introduced to the tracking algorithms, and tracks are often post-processed to find the most likely set, so neither of these effects compromise the generality of this proof of principle.
A state vector, \(\mathbf {p}_t = \{x_t, \tan {\theta _t}, y_t, \tan {\phi _t}\}\), corresponding to the most likely values of the track x-position, \(x_t\); y-position, \(t_t\); tangent of the track slope in \(\hat{x}-\hat{z}\), \(\tan {\theta }\); and tangent of the track slope in \(\hat{y}-\hat{z}\), \(\tan {\phi }\); is estimated at each plane, t. It follows that the model parameters for such a system are
$$\begin{aligned} \mathbf {F}&= \begin{bmatrix} 1 &{} d &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} d \\ 0 &{} 0 &{} 0 &{} 1 \\ \end{bmatrix}, \quad \mathbf {G} = \begin{bmatrix} 1/\sigma ^2 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1/\sigma ^2 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ \end{bmatrix},\end{aligned}$$
(10)
$$\begin{aligned} \mathbf {H}&= \begin{bmatrix} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ \end{bmatrix}, \quad \mathbf {Q} = \begin{bmatrix} z_0^2\theta _0^2 &{} z_0\theta _0^2 &{} z_0^2\theta _0^2 &{} z_0\theta _0^2 \\ z_0\theta _0^2 &{} \theta _0^2 &{} z_0\theta _0^2 &{} \theta _0^2 \\ z_0^2\theta _0^2 &{} z_0\theta _0^2 &{} z_0^2\theta _0^2 &{} z_0\theta _0^2 \\ z_0\theta _0^2 &{} \theta _0^2 &{} z_0\theta _0^2 &{} \theta _0^2 \\ \end{bmatrix}, \end{aligned}$$
(11)
where the parameterisation for \(\mathbf {Q}\) is obtained from Ref. [70] disregarding higher order terms in the track slopes; \(z_0\) is the material depth; and \(\theta _0^2\) is the variance of the multiple scattering angle.
The initial state for the first projection step is set to be equal to the hits on the first plane, \(\mathbf {p}_{0,\text {proj}} = \{ m_{0,x}, 0, m_{0, y}, 0 \}\), and the covariance matrix set to equal the full uncertainty on the track state,
$$\begin{aligned} \mathbf {C}_{0, \text {proj}} = \begin{bmatrix} (\varDelta x) ^2 &{} (\varDelta \tan \theta ) ^2 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} (\varDelta y) ^2 &{} (\varDelta \tan \phi ) ^2 \\ 0 &{} 0 &{} 0 &{} 0 \\ \end{bmatrix}, \end{aligned}$$
(12)
where \(\varDelta x = \varDelta y = 1\)m, and \(\varDelta \theta = \varDelta \phi = 1\).
In this study, simulated particles are produced at (0, 0, 0) and travel in the positive \(\hat{z}\) direction towards the detector planes. At each plane, the particle interacts with the active detector material according to its projection on the \(\hat{x}-\hat{y}\) plane of the detector, with a location that is subject to a random fluctuation in each direction depending on the total path length to simulate the effect of multiple scattering. Subsequently the location of the hit is discretised according to the granularity of the active detector area. These two effects determine the Kálmán-filter process and covariance matrices of the measurement uncertainty. An example of the simulated detector configuration can be seen in Fig. 10, with the corresponding hits and reconstructed track states.
Benchmarks
The Kálmán filter described in “Kálmán Filter Configuration” is implemented for the IPU hardware using the Poplar C++ SDK. To exploit the independence of the particle tracks, each track is assigned to a single IPU tile, where all operations in “Kálmán Filter Formalism” are performed. In principle, this results in 1, 216 Kálmán filter operations proceeding in parallel, however, optimal throughput is only achieved when several batches of tracks are copied to each tile initially, and then operated on sequentially. From Fig. 11, it can be seen that for batches of size greater than \(\sim 10\) tracks, almost perfect parallelism is achieved, with a peak throughput of around \(2.2\times 10^{6}\) tracks per second for this configuration.
It is interesting to study the behaviour of the IPU implementation of the Kálmán filter with a workload that relies on program branch statements and random memory accesses. To this end, a modification of the above Kálmán filter configuration is implemented, where a proportion of hits are forced to be inconsistent with tracks they have been assigned to. This results in a large value of the \(\chi ^2\) expression in Eq. 8. At each step the \(\chi ^2\) value is evaluated, and if it is above a certain threshold, the state is not updated and the previous state is propagated to the next state under the assumption that no hit was observed at this stage.
On the IPU, this is implemented by a branch statement in the vertex code, which is executed on each tile separately. By way of comparison, an equivalent Kálmán filter configuration is also implemented in TensorFlow (v2.1.0) for execution on the GPU. In TensorFlow the subsequent filtering step is modified using a conditional gather-scatter update to the state and state propagation parameters. Despite the sub-optimal TensorFlow-based GPU implementation, it is instructive to compare the relative throughput in the case where the states are conditionally modified, and the case where no conditional execution is performed. On the IPU, the reduction in peak throughput is approximately half that of the GPU—where it operates at \(91\%\) of peak throughput in this case, compared to \(80\%\) for the GPU. This is likely because the conditional execution results in an inefficiency caused by divergence of parallel threads on the GPU (‘warp divergence’), whereas on the IPU these execute independently.