1 Introduction

Identifying jets containing bottom and charm hadrons and separating them from jets that originate from lighter quarks, is a critical task in the LHC physics program, referred to as “flavor tagging”. Bottom and charm jets are characterized by the presence of secondary decays “inside” the jet - the bottom and charm hadrons will decay several millimeters past the primary interaction point (primary vertex), and only stable outgoing particles will be measured by the detector. Figure 1 illustrates a typical bottom jet decay, with two consecutive displaced vertices from a bottom decay (blue lines) and charm decay (yellow lines).

Fig. 1
figure 1

Illustration of a jet with secondary decay vertices. In order to identify the flavor of the jet, vertex reconstruction aims to group together the tracks measured in the detector based on their point of origin

Existing flavor tagging algorithms use a combination of low-level variables (the charged particle tracks, reconstructed secondary vertices), and high-level features engineered by experts as input to neural networks of various architectures in order to perform jet flavor classification [1].

Vertex reconstruction can be separated into two tasks, vertex finding, and vertex fitting [2]. Vertex finding refers to the task of partitioning the set of tracks, and vertex fitting refers to estimating the vertex positions given each sub-set of tracks. Existing algorithms typically use an iterative procedure of finding and fitting to perform both tasks together. We focus on using a neural network for vertex finding only. Vertex finding is a challenging task because of two factors:

  • Secondary vertices can be in close proximity to the primary vertex, and to each other, within the measurement resolution of the track trajectories.

  • The charged particle multiplicity in each individual vertex is low, typically between 1 and 5 tracks.

Vertex reconstruction is in essence an inverse problem of a complicated noisy (forward) function:

$$\begin{aligned} \text {Particle Decay} \rightarrow \text {Particle Measurement in Detector} \end{aligned}$$
(1)

Neural networks can find a model for this inverse problem without expert intervention by using supervised learning, i.e., by providing many examples of the forward process, which can be provided by simulations. They can also be easily optimized by retraining without expert intervention. Particle colliders may have different modes of operation during their lifetime, such as the LHC increasing its collision energy over the years. Different data taking conditions require re-optimizing reconstruction algorithms, and neural networks provide a simple way to perform that re-optimization.

Since the set of tracks to be partitioned has no inherent order, we use an equivariantFootnote 1 neural network architecture. We show in this paper that this constraint on the model results in better performance.

We first describe the dataset on which we test our proposed algorithm in Sect. 2. The model architecture and the baseline algorithms are described in Sect. 3. Section 4 discusses the performance metrics defined for vertex finding. Section 5 describes how the impact of vertex finding on jet classification was assessed, and the results are presented in Sect. 6. Conclusions are given in Sect. 7.

1.1 Background

Standard vertex reconstruction algorithms. Existing vertex reconstruction techniques are based on the geometry of the tracks, or a combination of the geometry and constraints that are configured by hand to match a specific particle decay pattern [3]. In order to handle finding and fitting multiple vertices, a standard algorithm is adaptive vertex reconstruction (AVR) [2, 4, 5]. The basic concept of AVR is to perform a least squares fit of the vertex position given all the tracks, then remove less compatible tracks from the fit, and refit those tracks again to more vertices. This repeats until no tracks are left. AVR can be used to first fit the primary vertex with special considerations for its unique properties, and subsequently fit secondary vertices. In this paper it is used as a general multi-vertex fitter, applied only to tracks associated to a single jet.

Deep learning on sets and graphs. Following the successful application of deep learning to images [6, 7], there is an ongoing research effort aimed at applying deep learning to other data structures such as unordered sets [8,9,10] and graphs [11,12,13,14]. Typical learning tasks for such domains are point-cloud classification for sets, or molecule property prediction, for graphs. A challenge in both scenarios stems from the arbitrary order of the elements in the set or the nodes in the graph. Fully connected, convolutional and recurrent networks do not have the correct inductive bias for learning tasks on unordered sets [15]. They assume a fixed size or an ordering in the data. A popular design principle for networks that process such unordered data is constraining layers to be equivariant or invariant to the reordering operation. By using only equivariant layers the neural networks is constrained to represent only equivariant functions.

Recently, the Set2Graph (S2G) model [16] was proposed as a simple, equivariant model for learning tasks in which the input is an arbitrarily ordered set of n elements and the output is an \(n\times n\) matrix that represents their pairwise relations. The S2G model was proved to be universal, meaning it can approximate any equivariant function from a set to a graph. We use this model in this paper.

Deep learning for particle physics. Neural networks that operate on sets have been used recently in a number of particle physics applications [17]. The data structure of an unordered set is a natural description for most particle physics reconstruction tasks, and recent progress in the field of graph neural networks [15] has prompted many new applications. For the problem of track reconstruction, a graph neural network was used to classify the paths between adjacent detector “hits” [18, 19]. This is a similar application to vertex finding since the end result must be a partition of the set of hits to different tracks. Other applications of graph neural networks to partitioning sets of objects include particle reconstruction in calorimeters and liquid argon time projection chambers  [20,21,22,23]. Direct jet classification has also been proposed with a few different variants of message passing networks [24,25,26,27,28,29,30,31].

2 Data

We test the proposed algorithm on a simulated dataset.Footnote 2 The dataset consists of jets sampled from \(pp\rightarrow t\bar{t}\) events at \(\sqrt{s}=14\) TeV. The events are generated with pythia8 [32] and a basic detector simulation is performed with delphes  [33], emulating a detector similar to ATLAS [34]. charged particle tracks are represented by 6 perigee parameters (\(d_{0}\), \(z_{0}\), \(\phi \), \(\text {cot}\theta \), \(p_{T}\), q) and their covariance matrix. Noise is added to the track perigee parameters with Gaussian smearing. The track parameters resolution depends on the transverse momentum \(p_{T}\) and pseudorapidity \(\eta \) of the track in a qualitatively similar way to the measurements reported in [34]. The covariance matrix is diagonal in this simplified track smearing model – the smearing is done independently for each parameter with no correlated effects.

Jets are constructed from calorimeter energy deposits with the anti-\(k_{\mathrm {T}}\) algorithm [35] with a distance parameter of \(R = 0.4\). Charged tracks are cone associated to jets with a \(\varDelta R < 0.4\) cone around the jet axis. The flavor labeling of jets (as bottom, charm or light) is done by matching weakly decaying bottom and charm hadrons to the jet with a \(\varDelta R\) cone of size 0.3.

A basic jet selection is applied, requiring jets have \(p_{T} > 20\) GeV and \(|\eta | < 2.5\) The input to the vertex finding algorithms is the set of tracks associated to each jet, the jet \(p_{T}\), \(\eta \), \(\phi \) and jet mass.

Dataset composition. The properties of secondary vertices, such as their distance from the primary vertex, depend on the jet flavor but also on \(p_{T}\), \(\eta \), and number of tracks (\(n_{\mathrm {tracks}}\)). However, the distribution of those parameters is different for the different flavors, depending on the process used to generate the sample. The dataset is therefore built by sampling equal numbers of jets from each flavor in each \((p_{T},\eta ,n_{\mathrm {tracks}})\) bin, as illustrated in Fig. 2a. For each bin, the flavor with the least amount of jets (usually c jets) in that bin determines the number of jets from the other flavors that are sampled. Figure 2b shows the resulting distribution of the number of vertices in each jet flavor, and Fig. 2c shows the distribution of \(p_{T}\), \(\eta \), and \(n_{\mathrm {tracks}}\) for all the flavors. The dataset is split into training (500k jets), validation, and testing datasets (100k jets each).

Fig. 2
figure 2

a The dataset is composed by selecting equal numbers of jets from each flavor in each bin of \(p_{T}\), \(\eta \), and \(n_{\mathrm {tracks}}\). b Distribution of the number of secondary vertices for the different jet flavors. c The resulting distribution of \(p_{T}\), \(\eta \), and \(n_{\mathrm {tracks}}\) in the dataset

3 Vertex finding algorithms

We compare four different algorithms.

  • Adaptive vertex reconstruction (AVR).

  • Set2Graph neural network.

  • Track pair (TP) classifier.

  • Recurrent neural network (RNN) model.

AVR serves as the baseline, and represents the existing vertex reconstruction algorithms. The S2G model is our universal equivariant model. The TP and RNN algorithms are baseline neural networks that are similar to S2G but remove one of its important properties: The TP algorithm is not universal, while the RNN is not equivariant. The architectures of all models are described below.

3.1 Adaptive vertex reconstruction

We use adaptive vertex reconstruction as implemented in the RAVE software package [4]. This algorithm is a representative of existing (non neural network based) methods. The input to the algorithm is the set of tracks associated to the jet and their covariance matrix. The output is a set of vertices, and a set of track-to-vertex association weights. The algorithm can associate a track to more than one vertex. To convert this output into an unambiguous partition, each track is assigned to the vertex to which it has the highest weight. There are hyperparameters that control the iterative fitting or finding procedure such as cuts on the track-to-vertex weight for removing outliers, and these were scanned to find the set of cuts resulting in the highest Rand index (defined in Sect. 4.1). Additional details about the hyper-parameter optimization are given in Appendix A.

3.2 Set2Graph neural network

For the neural network training, the vertex finding task is cast as an edge classification task, as illustrated in Fig. 3. The input consists of the tracks associated to a jet, represented as an array of \(n_{\mathrm {tracks}} \times d_{\mathrm {in}}\) matrix, with the \(d_{\mathrm {in}}=10\) features composed of the 6 track perigee parameters and the jet feature vector (the jet features are duplicated for each track). The output is a binary label attached to each pair of tracks indicating whether they originated from the same position in space.

Fig. 3
figure 3

The input and training target for the neural network algorithms. For a jet with \(n_{\mathrm {tracks}}\), the input is an array of \(n_{\mathrm {tracks}} \times d_{\mathrm {in}}\) track and jet features (jet features are represented by the light blue boxes, track features by the colored boxes), and the target output is a binary classification label for each of the \(n_{\mathrm {tracks}}\times (n_{\mathrm {tracks}}-1)\) ordered pairs of tracks in the jet

Fig. 4
figure 4

Partitioning a set of jet tracks using a neural network. A set-to-set component, \(\phi \), creates a hidden representation of each track, with size \(d_{\text {hidden}}\). A broadcasting layer \(\beta \), then creates a representation for each directed edge (ordered pair of tracks in the jet) by combining the representation of the two tracks and the sum of all representations. An edge classifier \(\psi \) then operates on the directed edges. This output is used for training the model (see the target definition in Fig. 3). During inference the output of the edge classifier is symmetrized to produce an edge score. The edge scores are used to define the set partition by optimizing the partition score, as described in Sect. 3.4

The S2G network is built as a composition of 3 modules, \(\psi \circ \beta \circ \phi \): a set-to-set component, \(\phi \), a broadcasting layer \(\beta \) and a final edge classifier \(\psi \). Here we give only a high level description of what each module does and its purpose, the specific model details are given in Appendix B. The model architecture is illustrated in Fig. 4.

The set-to-set component \(\phi \) takes as input the matrix of size \(n_{\mathrm {tracks}} \times d_{\mathrm {in}}\). The output of \(\phi \) is a hidden representation vector for each track, with size \(n_{\mathrm {tracks}} \times d_{\mathrm {hidden}}\). \(\phi \) is where information is exchanged between tracks and it is implemented as a deep sets [8] network.

The broadcasting layer \(\beta \) constructs a representation for each ordered pair of tracks (directed edge) using the output of \(\phi \). The edge representation is simply a concatenation of the representations of the two tracks, with the sum of all track representations, resulting in an output of size \((n_{\mathrm {track}}(n_{\mathrm {track}}-1))\times 3 d_{\mathrm {hidden}}\).

The edge classifier \(\psi \) is an MLP that operates on the edges to produce an edge score. This edge score is trained according to the target defined in Fig. 3. During inference (after the training is complete) the edge scores are symmetrized, so for an unordered track pair the edge score \(s_{ij}\) is:

$$\begin{aligned} s_{ij} = \sigma \left( \frac{1}{2} \left( \psi (\text {track}_{i}, \text {track}_{j})+\psi (\text {track}_{j}, \text {track}_{i}) \right) \right) \end{aligned}$$
(2)

where \(\sigma \) is the sigmoid function.

3.3 Neural network baselines

The neural network baselines are meant to check the importance of the properties of the S2G model. The models have a similar number of trainable parameters: 0.46M for S2G, 0.42M and 0.53M for TP and RNN respectively. They share the same architecture of \(\psi \circ \beta \circ \phi \) as the S2G model, with some components replaced as described below. Their properties are summarized in Table 1.

Table 1 Comparison of the neural network models. The inference time and FLOPS are measured per single jet with 14 tracks. FLOPS were estimated with [36]

The TP classifier is not a universal model. It will allow us to quantify the contribution of the information exchange between tracks to the overall vertex finding performance. As illustrated in Fig. 5, the hidden representation created for each track by the deep set module is conditional on the other tracks in the jet. We expect that for the task of vertex finding, being aware of all tracks is important, as the probability of a track pair being connected is conditional on the presence or lack of additional tracks nearby.

The TP classifier checks this assumption about the data. If the probability of each track pair is conditional only on the properties of the track pair, this algorithm will perform as well as the S2G model. It is still expected to perform reasonably well, as it can still learn to join together tracks based on their geometry alone.

The deep set based \(\phi \) layer is replaced by an MLP applied to each track in the jet (independently from the other tracks) to produce some hidden vector representation of that track. While a deep set has been proven to be universal (can approximate any function from sets to sets) [37] applying elementwise MLP is not universal for permutation equivariant functions.

Additionally, the broadcasting layer \(\beta \) does not use the sum of the track hidden representations. The \(\psi \) network operates only on the pair of track hidden representations. Therefore in the TP classifier there is no information exchange between the track pairs – each track pair is classified independently.

In the RNN model the \(\phi \) deep set component is replaced by a stack of bi-directional GRU layers [38]. Each GRU layer processes the sequence of track representations, sorted by the track transverse momentum. The layer output is a concatenation of the sequence of hidden representations from both directional passes of the GRU, therefore each track hidden representation still contains information from all other tracks in the jet. This model can theoretically learn any function that the S2G model can, but its architecture is not equivariant. This model will show if the equivariance is a useful inductive bias for this task. Additionally, the sequential nature of the RNN leads to a slower inference time compared to the S2G and TP models (see Table 1).

3.4 Inference

The network output needs to be converted into a cluster assignment for the tracks. If an edge tracks \(i\rightarrow j\) is connected, and track j is connected to track k, then the edge between \(i\rightarrow k\) must also be connected, regardless of its edges score. This could lead to a situation where many edges with low edge scores are artificially connected. Therefore we utilize the partition score optimization algorithm proposed by the authors of [21]. Track pairs whose score (Eq. 2) is above a threshold of 0.5 are considered in sequence of decreasing score, and are “connected” only if their addition decreases the partition score:

$$\begin{aligned} {\text {Partition score}} = \sum \delta _{ij}\ln (s_{ij})+(1-\delta _{ij})\ln (1-s_{ij}) \end{aligned}$$
(3)

where \(\delta _{ij}\) is 1 if \( \text {track}_{i}\) and \( \text {track}_{j}\) are assigned to the same cluster. In other words, if the connection of two tracks leads to an indirect connection between tracks with low edge scores, the connection is rejected.

Fig. 5
figure 5

The deep set module \(\phi \) in the S2G model (top) creates the track hidden representation based on information exchange between the tracks in the jet. The TP classifier (center) however, creates the hidden representation with an MLP, which operates on each track individually. The RNN model (bottom) creates the hidden representation with a bi-directional GRU, which means the output depends on the order in which the tracks are sorted

3.5 Training procedure and loss function

We train the network f to perform edge predictions, i.e., predicting the probability of each pair of input tracks to originate from the same vertex. For a jet with \(n_{\mathrm {tracks}}\) we therefore predict \(n_{\mathrm {tracks}} (n_{\mathrm {tracks}}-1)\) edge scores. We train the network f with the edge predictions before the symmetrization step, which results in \(n_{\mathrm {tracks}} (n_{\mathrm {tracks}}-1)/2\) edge scores.

In terms of edge classification, it is import to balance the false positive and negative rates. We initially trained the network with a standard binary cross entropy (BCE) loss function:

$$\begin{aligned} \text {BCE}= \sum _{\mathrm {edges}} -y_{\mathrm {edge}} \ln (\hat{y}_{\mathrm {edge}}) -(1-y_{\mathrm {edge}}) \ln (1-\hat{y}_{\mathrm {edge}}) \end{aligned}$$
(4)

where \(\hat{y}_{\mathrm {edge}}\) is the edge predicted value, between 0 and 1, and \(y_{\mathrm {edge}}\) is the truth edge label (0 or 1). The sum is over all edges in a single jet.

Training with BCE loss function resulted in a high number of false negatives. We therefore introduced a loss function based on the \(F_\beta \) score, defined as:

$$\begin{aligned} F_\beta = \frac{(1+\beta ^2)\cdot \text {TP}}{ (1+\beta ^2)\cdot \text {TP}+ \text {FP}+\beta ^2\cdot \text {FN} } \end{aligned}$$
(5)

with TP, FP, FN the true positives, false positives and false negatives respectively. The \(F_\beta \) score is not differentiable. Quantities such as true positives are defined by functions that contain non differentiable conditions, for example:

$$\begin{aligned} \text {true positives}\equiv \sum _{\mathrm {edges}}( \hat{y}_{\mathrm {edge}} > \text {threshold} ) y_{\mathrm {edge}} \end{aligned}$$
(6)

To compute a differentiable \(F_\beta \) loss, denoted as \(F_\beta ^*\) these quantities are approximated as differentiable functions:

$$\begin{aligned} \begin{aligned}&\text {true positives}^*\equiv \sum _{\mathrm {edges}}\hat{y}_{\mathrm {edge}}\cdot y_{\mathrm {edge}}\\&\text {false positives}^*\equiv \sum _{\mathrm {edges}}\hat{y}_{\mathrm {edge}}\cdot (1- y_{\mathrm {edge}}) \\&\text {false negatives}^*\equiv \sum _{\mathrm {edges}}(1-\hat{y}_{\mathrm {edge}})\cdot y_{\mathrm {edge}}. \end{aligned} \end{aligned}$$
(7)

However, training with the \(F_\beta ^*\) loss only was unstable. Given the random weight initialization of the network, the training would sometimes fail to converge. A combined loss of BCE and F1 was finally used:

$$\begin{aligned} \text {Loss}= \text {BCE}-\lambda \sum _{\mathrm {jets}}F_\beta ^* \end{aligned}$$
(8)

\(\lambda \) and \(\beta \) are hyperparameters that control the balance between false negatives and false positives.

Fig. 6
figure 6

Definition of internal, interpair and external edges for a pair of vertices

4 Performance metrics for vertex finding

We quantify the vertex finding performance from 3 different perspectives: The entire jet, individual vertices and pairs of vertices. The motivation for defining multiple metrics is that vertex finding is an intermediate step which is used for a number of other tasks related to event reconstruction. Therefore it is important to quantify the performance for a wide variety of jets with different kind of decay topologies.

4.1 Overall jet performance

For jets as a whole, we consider the adjusted Rand index (ARI) [39]. ARI is a measure of the similarity between two set partitions. For vertex finding where the ground truth is well defined, we can treat the ARI of a jet as a “score” that tells us how well our vertex finding algorithm reproduced the ground truth partition. ARI is a normalized form of the Rand index, defined as:

$$\begin{aligned} {\mathrm {RI}} = \frac{\text {number of correct edges}}{ \text {number of edges in the set} }. \end{aligned}$$
(9)

Correct edges are edges whose label matches the label they have in the ground truth (true positives and true negatives). The adjustment of the RI is done by normalizing relative to the expectation value or the RI:

$$\begin{aligned} {\mathrm {ARI}}=\frac{{\mathrm {RI}}-{\mathbb {E}}[{\mathrm {RI}}]}{1-{\mathbb {E}}[{\mathrm {RI}}]}. \end{aligned}$$
(10)

The expectation value of the RI is defined by a choice of a random clustering model. There are several models one can adopt, described in Ref. [40]. In our case a suitable choice is the “one-sided” comparison, where the true vertex assignment is considered fixed, and the expectation value is computed assuming one draws a completely random vertex assignment for the algorithm prediction. The expression for the expectation value is therefore:

$$\begin{aligned} \mathbb {E}[{\mathrm {RI}}] = \frac{B_{N-1}}{B_{N}}\frac{\sum _{i} {g_{i} \atopwithdelims ()2} }{ {N \atopwithdelims ()2} } + \left( 1- \frac{B_{N-1}}{B_{N}} \right) \left( 1- \frac{\sum _{i} {g_{i} \atopwithdelims ()2} }{ {N \atopwithdelims ()2} } \right) \end{aligned}$$
(11)

where \(N\equiv n_{\mathrm {tracks}}\), \(B_{N}\) is the bell number (the number of possible partitions of a set with N elements), the sum is over the i vertices in the jet and \(g_{i}\) is the number of tracks in the i-th vertex.

An ARI score of 1 means the algorithm found the correct cluster assignment, while 0 represents a cluster assignment that is as good as random guessing. We consider the ARI score in 3 categories: perfect (ARI of 1), intermediate (ARI between 0.5 and 1), and poor (ARI lower than 0.5).

4.2 Vertices and vertex-pairs performance

Fig. 7
figure 7

Jet classification model. The vertex finding module contains either one of the neural network models described in Sect. 3, or the predictions produced by the baseline AVR algorithms, pre-computed on the training dataset. If a pre-trained network in used in the vertex finding module, its weights are frozen during the training of the jet classifier

Instead of looking at an entire jet, we can consider subsets of the jet – individual vertices and all possible vertex pairs. We distinguish between internal, external, and inter-pair edges. Figure 6 illustrates the definition. Internal edges connect tracks inside a vertex, Interpair edges connect tracks in one vertex to tracks in the other vertex (this definition is only relevant for vertex pairs), and external edges connect tracks from the vertex/vertex pair to other tracks in the jet. Note that “external edges” refers to edges that are connected only at one end to one of the tracks in the subset under consideration (vertex or vertex pair) – not to all edges that are external to the subset. Considering a specific vertex, or a pair of vertices, we can compute separately the accuracy for each type of edge:

$$\begin{aligned} \text {Accuracy}_{\text {edge type}}= \frac{\text {correct edges}}{ \text {number of edges of that type}} \end{aligned}$$
(12)

where for internal edges, correct edges are those predicted to be connected by the algorithm, and for the other types, correct edges are those predicted to be disconnected.

Table 2 Comparing vertex finding performance from three perspectives: jet, vertex and vertex-pair. See Sect. 4 for the definitions of the various metrics. The mean for each metric, split by jet flavor, is shown for the S2G, AVR and TP algorithms. The S2G model outperforms or equals the other algorithms, maintaining the baseline AVR high accuracy for light jets with significant improvements for b and c jets

We can also multiply the different kinds of accuracies to compute an overall accuracy for the vertex/vertex-pair in question.Footnote 3

For individual vertices, we can evaluate the accuracy as a function of any vertex property we deem important, for example the number of tracks in the vertex. For vertex pairs, an important metric is the performance as a function of the distance between the two vertices. It is expected that as the distance between vertices decreases, accurate vertex finding becomes more difficult, and nearby vertices will be merged. The vertex pair performance metrics allow us to quantify that.

5 Impact on jet classification

In order to asses the impact of improved vertex finding on jet classification, we trained a classifier that took the edge classification prediction of the different algorithms as input, along with the tracks and jet features. The classifier predicts if the jet is a bottom, charm or light jet. The architecture for jet classification is illustrated in Fig. 7. A vertex finding module (either AVR, or one of the neural network models) is used to produce an edge prediction for the input set of tracks, which is added to a hidden representation created by a deep set. The resulting graph is processed by a graph network [15] and the resulting graph representation is classified by an MLP. Details about the architecture and training are given in Appendix C. In this scenario, the edge predictions can be considered as a form of supervised attention for the jet classifier. The weights of the vertex finding module are frozen during training.

The baseline classification performance is given by training the same model with an untrained S2G vertex finding module. This baseline model has the ability to reach the same performance as the model with the pre-trained S2G network, as it is an identical network. However it is trained only with the classification objective, where both vertex finding module and the rest of the network are trained together. This baseline therefore shows if an unsupervised attention mechanism can reach similar classification performance, which would require it to identify the relevant features in the data without guidance.

6 Results

The vertex finding results are summarized in Table 2. The S2G model outperforms AVR in all jet performance metrics. The improvement is significant (about 20% increase in ARI) for b and c jets, while for light jets the same high performance is maintained. The ARI distribution for the different flavors is shown in Fig. 8 – while there is still a substantial amount of poorly reconstructed jets (with ARI < 0.5) there are more than twice as many perfectly reconstructed b and c jets compared to AVR. In Fig. 9 the mean ARI is shown as a function of both the number of tracks, and the number of vertices in the jet. For b jets, there is a very large improvement in jets with a small number of tracks, but the advantage over AVR is maintained across the entire range. The AVR algorithm outperforms S2G only in b and c jets which have only one vertex, which are very rare in the dataset.

Fig. 8
figure 8

ARI scores for the different flavors of jets. We consider 3 categories: Perfect – jets with an ARI score of exactly 1, Intermediate – a score between 0.5 and 1 and Poor – scores below 0.5

Fig. 9
figure 9

Mean ARI scores for the different flavors of jets as a function of the jet properties

When considering vertex and vertex-pair metrics, for bottom and charm jets the mean internal accuracy for S2G is within 1% of the baseline, and a large increase (between 10 to 20%) is achieved for external and inter-pair accuracy. Figure 10 shows the performance for vertices, as a function of vertex size (i.e., number of tracks in the vertex). The S2G algorithm maintains an advantage over the full range of vertex sizes. The S2G model has a similar internal accuracy to the baseline, but a 10% increase in external accuracy for smaller vertices.

Table 3 Jet flavor classification performance metrics. The model with pre-trained S2G vertex finding module outperforms the other algorithms in overall
Fig. 10
figure 10

Vertex performance as a function of the vertex size. Internal, external and combined accuracy are defined in Sect. 4.2

Figure 11 show the performance for vertex pairs, as a function of the distance between the vertices. Again the S2G shows a promising ability to separate vertices even when the distance between them approaches 0. The performance increase of about 10% in combined accuracy comes from the improvement in interpair and external accuracy, i.e., less merging of vertices.

Comparison to neural network baselines Both the TP and RNN algorithms have a lower ARI by about 5–10% compared to the S2G model for b and c jets. S2G also outperforms both baselines in vertex and vertex-pair combined accuracy. From Fig. 8 we can see that S2G has the highest percentage of perfectly reconstructed jets, and Figs. 9, 10 and 11 show that this advantage is maintained across the entire dataset.

Fig. 11
figure 11

Vertex pair accuracy as a function of distance between the vertices. The internal accuracy is shown for both smaller vertex (the vertex with fewer tracks, vertex #1) and the larger vertex (vertex #2)

Impact on jet classification The results for jet classification are shown in Table 3. The pre-trained S2G classifier outperforms the AVR based classifier by over 10% in terms of overall accuracy with the most significant gain coming from the increased rejection of light jets (an increase in light jet F1 from 40% to 69%). The neural network baseline with an S2G based vertexing module that is trained only towards the classification objective shows better performance than the AVR and track pair based algorithms. This indicates that the network is able to learn some important features of the data by itself. The RNN and S2G based models have similar performance, with the S2G model outperforming the RNN in particular in c jet identification.

7 Conclusions

We proposed training a neural network to perform vertex finding, using supervised learning. We found that it outperforms standard techniques for multiple performance metrics of vertex reconstruction, and shows promising increase in performance for nearby vertices.

We utilized the Set2Graph model, a simple equivariant and universal model of functions from sets to graphs. We showed that the model’s universality and equivariance were both important. The universality was needed to properly learn the vertex finding task, by taking into account information from all tracks in the jet. Equivariance was a useful inductive bias, resulting in better performance compared to recurrent neural network which could in theory learn the same function as the S2G model. We evaluated the impact of the improved accuracy in vertex reconstruction on jet classification by training a classifier that used the vertex finding predictions as input, as a sort of supervised attention mechanism. We found that improved vertex finding leads to improved classification. The supervised attention mechanism lead to better results compared to an identical model with unsupervised attention. The universal models (S2G and RNN) had the best performance, however the equivariance of S2G gave it a slight advantage over the RNN.

Future work may explore the application of this technique to more complicated decays such as boosted Higgs to (bb/cc), and apply it to more realistic datasets that include full detector simulation and pileup interactions.