1 Introduction

Vertex finding is the process of dividing all or a subset of the reconstructed tracks in an event into classes such that presumably all tracks in a class are produced at the same vertex. Vertices in an event can be classified as primary vertices or secondary vertices. In a fixed target experiment, a primary vertex is the point where a beam particle collides with a target particle; in a collider experiment, a primary vertex is the point where two beam particles collide. In the LHC experiments CMS and ATLAS, there are many primary vertices in each event because in each bunch crossing many collisions may and do occur. As a rule, however, at most one of these primary vertices is of interest to the subsequent analysis; this is called the signal vertex. The signal vertex is distinguished from the other primary vertices (pile-up vertices) by using kinematic criteria such as the transverse momenta of the tracks forming the vertex .

A secondary vertex is the point where an unstable particle decays, or where a particle interacts with the material of the detector. As the search for secondary vertices is often based on a well-reconstructed primary vertex, this chapter deals only with primary vertex finding; secondary vertex finding is deferred to Chap. 9.

In collider experiments, the position of the beam spot and the size of the luminous region contains precise prior information about the transverse position of primary vertices, see Fig. 7.1. The prior information on the position along the beam axis is much weaker, as the primary vertices can be anywhere in a zone defined by the bunch length, which is a couple of centimeters in the LHC. In fixed-target experiments, the prior information is given by the beam profile and the position and size of the target.

Fig. 7.1
figure 1

The transverse size of the luminous region of the LHC as determined by ATLAS in 2017 [1]. Left: horizontal size; right: vertical size

Vertex finding methods can be roughly divided into three main types: generic clustering algorithms , topological methods, and iterated estimators. The latter can be considered as a special model-based clustering method. For a brief introduction into methods of clustering, see Sect. 3.3.

2 Primary Vertex Finding in 1D

In the LHC experiments ATLAS and CMS, there are many beam-beam interactions during a single bunch crossing at the typical luminosity level of the collider. In order to find all primary vertices, first primary tracks are selected by a cut on their distance to the z-axis, which is the beam line. The selected tracks are then clustered on the basis of their z-coordinates at their point of closest approach to the centre of the beam spot. The vertex finding problem is thus reduced to clustering in a single spatial dimension.

2.1 Divisive Clustering

A simple divisive clustering can be performed by requiring a gap of at least d between adjacent clusters/vertices. The threshold d depends on the shape of the beam profile along z, on the expected number of interactions per bunch crossing, and on the precision of the z-coordinate used for clustering. It has to be optimized by studying simulated events. If subsequent validation of a cluster fails, it can be further divided by a more refined method (see Sect. 7.3.3).

2.2 Model-Based Clustering

Assume that there are n tracks with z-coordinates z i, i = 1, …, n, sorted in ascending order: z 1 < z 2 < … < z n. The z i are assumed to be sampled from a Gaussian mixture with the following PDF:

$$\displaystyle \begin{gathered} f(z)=\sum_{k=1}^K \omega_k\hspace{0.5pt}\varphi\hspace{0.5pt}({{\boldsymbol{z}}};v_k,\sigma_k^2),\ \;\sum_{k=1}^K\omega_k=1, \end{gathered} $$
(7.1)

where K is the number of mixture components, φ is the PDF of the normal distribution, ω k is the component weight, v k is the mean value, and \(\sigma _k^2\) is the variance of component k, k = 1, …, K. As the association of the tracks represented by the points z i to the vertices represented by the component means v k is unknown, latent (unobserved) variables y i, i = 1, …, n are introduced that encode the association:

$$\displaystyle \begin{gathered} y_i=k\iff z_i\mbox{ belongs to component }k,\ i=1,\ldots,n. \end{gathered} $$
(7.2)

The latent variables and the unknown parameters of the mixture (weights, means, variances) can be estimated by the Expectation-Maximization (EM) algorithm [2,3,4]. The EM algorithm is iterative, and each iteration consists of two steps. In the E-step (expectation step), the latent variables are estimated, given the observations and the current estimate of the mixture parameters. In the M-step (maximization step), the maximum likelihood estimate of the mixture parameters is computed, using the estimates of the latent variables from the E-step. Convergence is guaranteed as the likelihood increases in every iteration. There is no guarantee, however, to reach the global maximum of the likelihood function.

In the special case of a normal mixture, explicit formulas can be obtained. Assume that the M-step of iteration j gives the estimates \(\omega _{k}(j),v_k(j),\sigma _k^2(j),\ k=1,\ldots ,K\). The E-step of the next iteration j + 1 computes the association probabilities or ownership weights p i,k of all points with respect to all components:

$$\displaystyle \begin{gathered} p_{i,k}=\frac{\omega_k(j)\,\varphi\hspace{0.5pt}(z_i;v_k(j),\sigma_k^2(j))} {\sum_{l=1}^K \omega_l(j)\,\varphi\hspace{0.5pt}(z_i;v_l(j),\sigma_l^2(j))}, \ i=1,\ldots,n, \ k=1,\ldots,K. \end{gathered} $$
(7.3)

In the M-step the mixture parameters are updated:

$$\displaystyle \begin{gathered} \begin{aligned} \omega_k(j+1)&=\frac{1}{n}\sum_{i=1}^n p_{i,k}, \\ v_k(j+1)&=\frac{\sum_{i=1}^n p_{i,k}\,z_i}{\sum_{i=1}^n p_{i,k}},\\ \sigma_k^2(j+1)&=\frac{\sum_{i=1}^n p_{i,k}\left[z_i-v_k(j+1)\right]^2}{\sum_{i=1}^n p_{i,k}},\ \;\ k=1,\ldots,K. \end{aligned} \end{gathered} $$
(7.4)

The EM algorithm suffers from the fact that the number K of clusters has to be selected in advance. A possible solution to this problem is to set K to the largest value that can reasonably be expected, and to merge components that are sufficiently similar after the EM algorithm has converged.

The selection of the optimal number of components can be automatized by sparse model-based clustering [5, 6]. Sparsity of the final mixture is achieved by an appropriate prior on the mixture weights. As explicit formulas are no longer available, estimation of the mixture parameters has to be done via Markov Chain Monte Carlo (MCMC). A comparison with the EM algorithm can be found in [6].

2.3 EM Algorithm with Deterministic Annealing

Assume that there are n tracks with z-coordinates z i, i = 1, …, n, again in ascending order, with their associated standard errors σ i, i = 1, …, n. The EM algorithm described in Sect. 7.2.2 is sensitive to the initial values of the component parameters. This can be cured by introducing deterministic annealing (DA), see [7]. DA introduces a temperature parameter T, which is used to scale the standard errors of the z i, i = 1, …, n. Annealing starts at high temperature, corresponding to large errors, and the temperature is lowered according to a predefined annealing schedule. At each temperature, the association probabilities of the data points z i to the current cluster centers are computed. If there are data points not associated to any of the clusters, indicated by very low association probabilities, one of these points is chosen as a new cluster center. The number of clusters is thus determined dynamically. The algorithm is summarized in Table 7.1. Note that in contrast to the model-based clustering in Sect. 7.2.2 the association probabilities are computed using the uncertainties of the track positions instead of the vertex positions.

Table 7.1 Algorithm: Vertex finding with EM algorithm and deterministic annealing

Two examples with ten randomly generated clusters within a space of 1 cm and σ i = 0.01 cm are shown in Fig. 7.2. In the example on the top, the found clusters perfectly match the true clusters while in the example on the bottom, two true clusters merge into a single found cluster, and one of the found clusters is spurious with a single data point.

Fig. 7.2
figure 2

Examples of cluster finding with EM algorithm and Deterministic Annealing. The results are discussed in the text

2.4 Clustering by Deterministic Annealing

Assume further that there are K vertex positions v k, k = 1, …, K, called prototypes, that represent K clusters of tracks. The expected discrepancy between data points and prototypes is given by:

$$\displaystyle \begin{gathered} E=\sum_{i=1}^n\sum_{k=1}^K P({{\boldsymbol{z}}}_i\in C_k)\,d(z_i,v_k), \end{gathered} $$
(7.7)

where C k is the cluster represented by v k and d(z i, v k) is a measure of distance between data point z i and the prototype v k [8]. A typical choice of d(z i, v k) is the weighted squared difference:

$$\displaystyle \begin{gathered} d(z_i,v_k)=\left(\frac{{{\boldsymbol{z}}}_i-v_k}{\sigma_i}\right)^2,{} \end{gathered} $$
(7.8)

where σ i is again the standard error of track position z i. As there is no prior knowledge on the probabilities P(z i ∈ C k), the principle of maximum entropy can be applied, giving a Gibbs distribution:

$$\displaystyle \begin{gathered} P({{\boldsymbol{z}}}_i\in C_k)=\frac{\exp\left(-\beta\,d(z_i,v_k)\right)}{\sum_{l=1}^K \exp\left(-\beta\,d(z_i,v_l)\right)} \end{gathered} $$
(7.9)

The parameter β = 1∕T is the inverse of the “temperature” of the system.

Finding the most probable configuration of the prototypes at a given temperature is equivalent to minimizing the “free energy” F, which is given by [8]:

$$\displaystyle \begin{gathered} F=-\frac{1}{\beta}\sum_{i=1}^n \ln \sum_{k=1}^K\exp\left(-\beta\,d(z_i,v_k)\right). \end{gathered} $$
(7.10)

Minimization of F has to be done by numerical methods; see Sect. 3.1.

At infinite temperature, i.e., β = 0, there is a single cluster containing all data points. The temperature is now lowered according to a predefined annealing schedule. At some positive value of β, the cluster will undergo a phase transition and split into smaller clusters. Annealing is continued until the end of the schedule. At every temperature, a characteristic number of effective prototypes emerges at distinct positions, independent of the number of prototypes.

Instead of using a large number of prototypes, many of which may coincide at a given temperature, one can work with weighted prototypes, where the weight ρ k corresponds to the fraction of unweighted prototypes that coincide at v k. The weights ρ k always sum to 1, and the free energy is slightly modified:

$$\displaystyle \begin{gathered} F=-\frac{1}{\beta}\sum_{i=1}^n \ln \sum_{k=1}^K\rho_k \exp\left(-\beta\,d(z_i,v_k)\right). \end{gathered} $$
(7.11)

The association probabilities and the cluster weight are given by:

$$\displaystyle \begin{gathered} p_{i,k}=P({{\boldsymbol{z}}}_i\in C_k)=\frac{\rho_k\exp\left(-\beta\,d(z_i,v_k)\right)}{\sum_{l=1}^K \rho_l\exp\left(-\beta\,d(z_i,v_l)\right)},\ \; \rho_k=\frac{1}{n}\sum_{i=1}^n p_{i,k}.{} \end{gathered} $$
(7.12)

The annealing is started at high temperature with a single prototype of weight ρ 1 = 1. If the distance function is chosen as in Eq. (7.8), the minimum of F is at the weighted mean of the data points. During annealing, the temperature is gradually decreased and the local minima of F emerge. The critical temperature \(T_k^{\mathrm {c}}\) of cluster k is the point where a local minimum turns into a saddle point and is given by:

$$\displaystyle \begin{gathered} T_k^{\mathrm{c}}=2\sum_{i=1}^n \frac{p_{i,k}}{\sigma_i^2}\left(\frac{{{\boldsymbol{z}}}_i-v_k}{\sigma_i}\right)^2\bigg/\sum_{i=1}^n \frac{p_{i,k}}{\sigma_i^2}. \end{gathered} $$
(7.13)

Whenever the temperature falls below the critical temperature of a cluster, the prototype of this cluster is replaced by two nearby representatives. The association probabilities and the new cluster weights are recomputed according to Eq. (7.12). If the final temperature is small enough, the soft assignments of data points to clusters turns into a hard assignment.

3 Primary Vertex Finding in 3D

In principle, the clustering methods described above for vertex finding in 1D can also be applied to vertex finding in 3D. It has to be noted, though, that the shortest distance in space between two tracks is peculiar insofar as it does not satisfy the triangle inequality: if tracks a and b are close, and tracks b and c are close, it does not follow that tracks a and c are close as well. The distance between two clusters of tracks should therefore be defined as the maximum of the individual pairwise distances, known as complete linkage in the clustering literature. Alternatively, the distance between two clusters can be the distance between the two vertices fitted from the clusters.

3.1 Preclustering

Because of the high track and vertex multiplicity in typical collider experiments at the LHC, preliminary clusters of tracks can be formed by selecting a primary vertex or a small group of primary vertices found in 1D. All tracks that are compatible with these are put in a preliminary cluster. This reduces the combinatorics and opens the possibility of processing these preliminary clusters in parallel. Tracks that are not compatible with any primary vertex are reserved for secondary vertex finding. In low-multiplicity experiments, this preliminary clustering can be omitted.

3.2 Greedy Clustering

Greedy clustering is agglomerative and starts with a single track, preferably a high-quality track with many hits and good χ 2 (see Sect. 6.4.1). It is combined with its nearest neighbour in 3D, and a vertex is fitted from the two tracks. If the fit is successful, the vertex is stored. The track nearest to vertex is added, for instance, by means of an extended Kalman filter (see Sect. 8.1.2.2). This procedure is continued until the vertex fit fails. Clustering is then resumed with an unused track.

The greedy clustering does not guarantee the globally best assignment of tracks to vertices, as tracks that are attached to a vertex remain attached forever. This can be cured by using a robust vertex fit throughout (see Sect. 8.2), allowing a track to be removed from a vertex if it is tagged as an outlier.

3.3 Iterated Estimators

This is a divisive clustering algorithm with the following steps:

  1. 1.

    Perform a (preferably robust) vertex fit with all tracks.

  2. 2.

    Discard all incompatible tracks.

  3. 3.

    Repeat step 1 with all discarded tracks.

The iteration stops when no vertex with at least two tracks can be successfully fitted. Step 2 might itself be iterative, especially if the vertex fit is not robust, so that the incompatible tracks have to be removed sequentially. An iterative vertex finder, based on an adaptive fit (Sect. 8.2.2) and called the Adaptive Vertex Reconstructor (AVR, [9]) is implemented in the RAVE toolbox [10, 11]; see Appendix C.

3.4 Topological Vertex Finder

A general topological vertex finder called ZVTOP was proposed in [12]. It is related to the Radon transform, which is a continuous version of the Hough transform used for track finding (see Sect. 5.1.2). The search for vertices is based on a function V (v) that quantifies the probability of a vertex at location v. For each track a Gaussian probability tube f i(v) is constructed. The function V (v) is defined taking into account that the value of f i(v) must be significant for at least two tracks:

$$\displaystyle \begin{aligned} V({{\boldsymbol{v}}})=\sum_{i=1}^n f_i({{\boldsymbol{v}}})-\frac{\sum_{i=1}^n f_i^2({{\boldsymbol{v}}})}{\sum_{i=1}^n f_i({{\boldsymbol{v}}})} \end{aligned}$$

Due to the second term on the right-hand side, V (v) ≈ 0 in regions where f i(v) is significant for only one track. The form of V (v) can be modified to fold in known physics information about probable vertex locations. For instance, V (v) can be augmented by a further function f 0(v) describing the location and spread of the interaction point. In addition, V (v) may be modified by a factor dependent on the angular location of the point v.

Vertex finding amounts to finding the local maxima of the function V (v). The search starts at the calculated maxima of the products f i(v)f j(v) for all track pairs. For each of these points, the nearest maximum of V (v) is found. As V (v) is a smooth function, any of the methods discussed in Sect. 3.1 can be employed. The found maxima are clustered together to form candidate vertex regions. The final association of the tracks to the vertex candidates can be done on the basis of the respective χ 2 contributions or by an adaptive fit (see Sect. 8.2.2). An experimental application is described in [13].

In [14], the topological vertex finder was augmented by a procedure based on the concept of the minimum spanning tree of a graph. For each track, the bins crossed by the track (or a tangent to the track at the point of closest approach) are incremented by one.

3.5 Medical Imaging Vertexer

The Medical Imaging Vertexer (MIV) [15] is similar to ZVTOP, but differs from it in two points: first, it works with a pixelized representation of the track density; second, it applies a medical imaging filter to the density before finding the maxima. The vertex finder can be summarized in the following steps:

  1. 1.

    All tracks are back-projected into a volume to be searched for vertices. The volume is represented by a 3D histogram. The bin size of the histogram is comparable to the tube size in ZVTOP.

  2. 2.

    The histogram is transformed into Fourier space, filtered by a medical imaging filter that removes artifacts and reduces blurring, and transformed back to a histogram in 3D space.

  3. 3.

    The filtered histogram is searched for local maxima. Clustering starts with the highest bin. If the next highest bin is adjacent it is added to the cluster, otherwise it is the seed for the next cluster. This is iterated until no bin is above a predefined threshold.

  4. 4.

    Clusters are split or merged using a resolution criterion similar to the one used in ZVTOP [12].

  5. 5.

    A cluster is accepted as a vertex candidate if its starting bin exceeds a predefined threshold. The vertex position is estimated as the center of gravity of the cluster.

The performance of the MIV has been studied and compared to the Adaptive Vertex Reconstructor (AVR) in [15]. It is shown that the MIV finds vertices with higher efficiency and higher purity at large pile-up, whereas the AVR performs better at small pile-up.