Secondary Vertex Finding in Jets with Neural Networks

Jet classification is an important ingredient in measurements and searches for new physics at particle coliders, and secondary vertex reconstruction is a key intermediate step in building powerful jet classifiers. We use a neural network to perform vertex finding inside jets in order to improve the classification performance, with a focus on separation of bottom vs. charm flavor tagging. We implement a novel, universal set-to-graph model, which takes into account information from all tracks in a jet to determine if pairs of tracks originated from a common vertex. We explore different performance metrics and find our method to outperform traditional approaches in accurate secondary vertex reconstruction. We also find that improved vertex finding leads to a significant improvement in jet classification performance.


Introduction
Identifying jets containing bottom and charm hadrons and separating them from jets that originate from lighter quarks, is a critical task in the LHC physics program, referred to as "flavor tagging". Bottom and charm jets are characterized by the presence of secondary decays "inside" the jet -the bottom and charm hadrons will decay several millimeters past the primary interaction point (primary vertex), and only stable outgoing particles will be measured by the detector. Figure 1 illustrates a typical bottom jet decay, with two consecutive displaced vertices from a bottom decay (blue lines) and charm decay (yellow lines).
Existing flavor tagging algorithms use a combination of low-level variables (the charged particle tracks, reconstructed secondary vertices), and high-level features engineered by experts as input to neural networks of various architectures in order to perform jet flavor classification [1].  Vertex reconstruction can be separated into two tasks, vertex finding, and vertex fitting [2]. Vertex finding refers to the task of partitioning the set of tracks, and vertex fitting refers to estimating the vertex positions given each sub-set of tracks. Existing algorithms typically use an iterative procedure of finding and fitting to perform both tasks together. We focus on using a neural network for vertex finding only. Vertex finding is a challenging task because of two factors: -Secondary vertices can be in close proximity to the primary vertex, and to each other, within the measurement resolution of the track trajectories. -The charged particle multiplicity in each individual vertex is low, typically between 1 and 5 tracks.
Vertex reconstruction is in essence an inverse problem of a complicated noisy (forward) function: Particle Decay → Particle Measurement in Detector (1) Neural networks can find a model for this inverse problem without expert intervention by using supervised learning, i.e., by providing many examples of the forward process, which can be provided by simulations. They can also be easily optimized by retraining without expert intervention. Particle colliders may have different modes of operation during their lifetime, such as the LHC increasing its collision energy over the years. Different data taking conditions require re-optimizing reconstruction algorithms, and neural networks provide a simple way to perform that reoptimization.
Since the set of tracks to be partitioned has no inherent order, we use an equivariant 1 neural network architecture. We show in this paper that this constraint on the model results in better performance.
We first describe the dataset on which we test our proposed algorithm in Section 2. The model architecture and the baseline algorithms are described in Section 3. Section 4 discusses the performance metrics defined for vertex finding. Section 5 describes how the impact of vertex finding on jet classification was assessed, and the results are presented in Section 6. Conclusions are given in Section 7.

Background
Standard vertex reconstruction algorithms. Existing vertex reconstruction techniques are based on the geometry of the tracks, or a combination of the geometry and constraints that are configured by hand to match a specific particle decay pattern [3]. In order to handle finding and fitting multiple vertices, a standard algorithm is adaptive vertex reconstruction (AVR) [2,4,5]. The basic concept of AVR is to perform a least squares fit of the vertex position given all the tracks, then remove less compatible tracks from the fit, and refit those tracks again to more vertices. This repeats until no tracks are left. AVR can be used to first fit the primary vertex with special considerations for its unique properties, and subsequently fit secondary vertices. In this paper it is used as a general multi-vertex fitter, applied only to tracks associated to a single jet.
Deep learning on sets and graphs. Following the successful application of deep learning to images [6,7], there is an ongoing research effort aimed at applying deep learning to other data structures such as unordered sets [8][9][10] and graphs [11][12][13][14]. Typical learning tasks for such domains are point-cloud classification for sets, or molecule property prediction, for graphs. A challenge in both scenarios stems from the arbitrary order of the elements in the set or the nodes in the graph. Fully connected, convolutional and recurrent networks do not have the correct inductive bias for learning tasks on unordered sets [15]. They assume a fixed size or an ordering in the data. A popular design principle for networks that process such unordered data is constraining layers to be equivariant or invariant to the reordering operation. By using only equivariant layers the neural networks is constrained to represent only equivariant functions.
Recently, the Set2Graph (S2G) model [16] was proposed as a simple, equivariant model for learning tasks in which the input is an arbitrarily ordered set of n elements and the output is an n × n matrix that represents their pairwise relations. The S2G model was proved to be universal, meaning it can approximate any equivariant function from a set to a graph. We use this model in this paper.
Deep learning for particle physics. Neural networks that operate on sets have been used recently in a number of particle physics applications [17]. The data structure of an unordered set is a natural description for most particle physics reconstruction tasks, and recent progress in the field of graph neural networks [15] has prompted many new applications. For the problem of track reconstruction, a graph neural network was used to classify the paths between adjacent detector "hits" [18,19]. This is a similar application to vertex finding since the end result must be a partition of the set of hits to different tracks. Other applications of graph neural networks to partitioning sets of objects include particle reconstruction in calorimeters and liquid argon time projection chambers [20][21][22][23]. Direct jet classification has also been proposed with a few different variants of message passing networks [24][25][26][27][28][29][30][31].

Data
We test the proposed algorithm on a simulated dataset 2 . The dataset consists of jets sampled from pp → tt events at √ s = 14 TeV. The events are generated with PYTHIA8 [32] and a basic detector simulation is performed with DELPHES [33], emulating a detector similar to ATLAS [34]. charged particle tracks are represented by 6 perigee parameters (d 0 , z 0 , φ , cotθ , p T , q) and their covariance matrix. Noise is added to the track perigee parameters with Gaussian smearing. The track parameters resolution depends on the transverse momentum p T and pseudorapidity η of the track in a qualitatively similar way to the measurements reported in [34]. The covariance matrix is diagonal in this simplified track smearing model-the smearing is done independently for each parameter with no correlated effects.
Jets are constructed from calorimeter energy deposits with the anti-k T algorithm [35] with a distance parameter of R = 0.4. Charged tracks are cone associated to jets with a ∆ R < 0.4 cone around the jet axis. The flavor labeling of jets (as bottom, charm or light) is done by matching weakly decaying bottom and charm hadrons to the jet with a ∆ R cone of size 0.3.
A basic jet selection is applied, requiring jets have p T > 20 GeV and |η| < 2.5 The input to the vertex finding algorithms is the set of tracks associated to each jet, the jet p T , η, φ and jet mass.
Dataset composition. The properties of secondary vertices, such as their distance from the primary vertex, depend on the jet flavor but also on p T , η, and number of tracks (n tracks ). However, the distribution of those parameters is different for the different flavors, depending on the process used to generate the sample. The dataset is therefore built by sampling equal numbers of jets from each flavor in each (p T , η, n tracks ) bin, as illustrated in Figure 2a. For each bin, the flavor with the least amount of jets (usually c jets) in that bin determines the number of jets from the other flavors that are sampled. Figure 2b shows the resulting distribution of the number of vertices in each jet flavor, and Figure 2c shows the distribution of p T , η, and n tracks for all the flavors. The dataset is split into training (500k jets), validation, and testing datasets (100k jets each).

Vertex Finding Algorithms
We compare 4 different algorithms.
-Recurrent neural network (RNN) model. AVR serves as the baseline, and represents the existing vertex reconstruction algorithms. The S2G model is our universal equivariant model. The TP and RNN algorithms are baseline neural networks that are similar to S2G but remove one of its important properties: The TP algorithm is not universal, while the RNN is not equivariant. The architectures of all models are described below.

Adapative vertex reconstruction
We use adaptive vertex reconstruction as implemented in the RAVE software package [4]. This algorithm is a representative of existing (non neural network based) methods. The input to the algorithm is the set of tracks associated to the jet and their covariance matrix. The output is a set of vertices, and a set of track-to-vertex association weights. The algorithm can associate a track to more than one vertex. To convert this output into an unambiguous partition, each track is assigned to the vertex to which it has the highest weight. There are hyperparameters that control the iterative fitting or finding procedure such as cuts on the track-to-vertex weight for removing outliers, and these were scanned to find the most performant set of cuts based on the Rand index (defined in Section 4.1). Additional details about the hyperparameter optimization are given in Appendix A.

Set2Graph Neural Network
For the neural network training, the vertex finding task is cast as an edge classification task, as illustrated in Figure 3. The S2G network is built as a composition of 3 modules, ψ • β • φ : a set-to-set component, φ , a broadcasting layer β and a final edge classifier ψ. Here we give only a high level description of what each module does and its purpose, the specific model details are given in Appendix B. The model architecture is illustrated in Figure 4.
The set-to-set component φ takes as input the matrix of size n tracks × d in . The output of φ is a hidden representation vector for each track, with size n tracks × d hidden . φ is where information is exchanged between tracks and it is implemented as a deep sets [8] network.
The broadcasting layer β constructs a representation for each ordered pair of tracks (directed edge) using the output of φ . The edge representation is simply a concatenation of the representations of the two tracks, with the sum of all track representations, resulting in an output of size (n track (n track − 1)) × 3d hidden .
The edge classifier ψ is an MLP that operates on the edges to produce an edge score. This edge score is trained according to the target defined in Figure 3. During inference (after the training is complete) the edge scores are symmetrized, so for an unordered track pair the edge score s i j is: Where σ is the sigmoid function.

Neural network baselines
The neural network baselines are meant to check the importance of the properties of the S2G model. The models have  Table 1. The TP classifier is not a universal model. It will allow us to quantify the contribution of the information exchange between tracks to the overall vertex finding performance. As illustrated in Figure 5, the hidden representation created for each track by the deep set module is conditional on the other tracks in the jet. We expect that for the task of vertex finding, being aware of all tracks is important, as the probability of a track pair being connected is conditional on the presence or lack of additional tracks nearby.
The TP classifier checks this assumption about the data. If the probability of each track pair is conditional only on the properties of the track pair, this algorithm will perform as well as the S2G model. It is still expected to perform reasonably well, as it can still learn to join together tracks based on their geometry alone.
The deep set based φ layer is replaced by an MLP applied to each track in the jet (independently from the other tracks) to produce some hidden vector representation of that track. While a deep set has been proven to be universal (can approximate any function from sets to sets) [37] applying elementwise MLP is not universal for permutation equivariant functions.
Additionally, the broadcasting layer β does not use the sum of the track hidden representations. The ψ network operates only on the pair of track hidden representations. Therefore in the TP classifier there is no information exchange between the track pairs-each track pair is classified independently.
In the RNN model the φ deep set component is replaced by a stack of bi-directional GRU layers [38]. Each GRU layer processes the sequence of track representations, sorted by the track transverse momentum. The layer output is a concatenation of the sequence of hidden representations from both directional passes of the GRU, therefore each track hidden representation still contains information from all other tracks in the jet. This model can theoretically learn any function that the S2G model can, but its architecture is not equivarient. This model will show if the equivariance is a useful  Figure 3). During inference the output of the edge classifier is symmetrized to produce an edge score. The edge scores are used to define the set partition by optimizing the partition score, as described in 3.4 inductive bias for this task. Additionally, the sequential nature of the RNN leads to a slower inference time compared to the S2G and TP models (see Table 1).

Inference
The network output needs to be converted into a cluster assignment for the tracks. If an edge tracks i → j is connected, and track j is connected to track k, then the edge between i → k must also be connected, regardless of its edges score. This could lead to a situation where many edges with low edge scores are artificially connected. Therefore we utilize the partition score optimization algorithm proposed by the authors of [21]. Track pairs whose score (eq. 2) is above a threshold of 0.5 are considered in sequence of decreasing score, and are "connected" only if their addition decreases the partition score: where δ i j is 1 if track i and track j are assigned to the same cluster. In other words, if the connection of two tracks leads to an indirect connection between tracks with low edge scores, the connection is rejected.

Training procedure and Loss function
We train the network f to perform edge predictions, i.e., predicting the probability of each pair of input tracks to originate from the same vertex. For a jet with n tracks we therefore predict n tracks (n tracks − 1) edge scores. We train the network f with the edge predictions before the symmetrization step, which results in n tracks (n tracks − 1)/2 edge scores.
In terms of edge classification, it is import to balance the false positive and negative rates. We initially trained the network with a standard binary cross entropy (BCE) loss function: whereŷ edge is the edge predicted value, between 0 and 1, and y edge is the truth edge label (0 or 1). The sum is over all edges in a single jet.
Training with BCE loss function resulted in a high number of false negatives. We therefore introduced a loss function based on the F β score, defined as: with TP, FP, FN the true positives, false positives and false negatives respectively. The F β score is not differentiable. Quantities such as true positives are defined by functions that contain non differentiable conditions, for example: true positives ≡ ∑ edges (ŷ edge > threshold)y edge (6) To compute a differentiable F β loss, denoted as F * β these quantities are approximated as differentiable functions: However, training with the F * β loss only was unstable. Given the random weight initialization of the network, the training would sometimes fail to converge. A combined loss of BCE and F1 was finally used: λ and β are hyperparameters that control the balance between false negatives and false positives.

Performance Metrics for Vertex Finding
We quantify the vertex finding performance from 3 different perspectives: The entire jet, individual vertices and pairs of vertices. The motivation for defining multiple metrics is that vertex finding is an intermediate step which is used for a number of other tasks related to event reconstruction. Therefore it is important to quantify the performance for a wide variety of jets with different kind of decay topologies.

Overall Jet Performance
For jets as a whole, we consider the adjusted Rand index (ARI) [39]. ARI is a measure of the similarity between two set partitions. For vertex finding where the ground truth is well defined, we can treat the ARI of a jet as a "score" that tells us how well our vertex finding algorithm reproduced the ground truth partition. ARI is a normalized form of the Rand index, defined as: RI = number of correct edges number of edges in the set (9) Correct edges are edges whose label matches the label they have in the ground truth (true positives and true negatives). The adjustment of the RI is done by normalizing relative to the expectation value or the RI: The expectation value of the RI is defined by a choice of a random clustering model. There are several models one can adopt, described in Ref [40]. In our case a suitable choice is the "one-sided" comparison, where the true vertex assignment is considered fixed, and the expectation value is computed assuming one draws a completely random vertex assignment for the algorithm prediction. The expression for the expectation value is therefore: (11) where N ≡ n tracks , B N is the bell number (the number of possible partitions of a set with N elements), the sum is over the i vertices in the jet and g i is the number of tracks in the i-th vertex.
An ARI score of 1 means the algorithm found the correct cluster assignment, while 0 represents a cluster assignment that is as good as random guessing. We consider the ARI score in 3 categories: perfect (ARI of 1), intermediate (ARI between 0.5 and 1), and poor (ARI lower than 0.5).

Vertices and Vertex-Pairs Performance
Instead of looking at an entire jet, we can consider subsets of the jet-individual vertices and all possible vertex pairs. We distinguish between internal, external, and interpair edges. Figure 7 illustrates the definition. Internal edges connect tracks inside a vertex, Interpair edges connect tracks in one vertex to tracks in the other vertex (this definition is only relevant for vertex pairs), and external edges connect tracks from the vertex/vertex pair to other tracks in the jet. Note that "external edges" refers to edges that are connected only at one end to one of the tracks in the subset under consideration (vertex or vertex pair)-not to all edges that are external to the subset. Considering a specific vertex, or a pair of vertices, we can compute separately the accuracy for each type of edge: Accuracy edge type = correct edges number of edges of that type (12) where for internal edges, correct edges are those predicted to be connected by the algorithm, and for the other types, correct edges are those predicted to be disconnected.
We can also multiply the different kinds of accuracies to compute an overall accuracy for the vertex/vertex-pair in question 3 .
For individual vertices, we can evaluate the accuracy as a function of any vertex property we deem important, for example the number of tracks in the vertex. For vertex pairs, an important metric is the performance as a function of the distance between the two vertices. It is expected that as the distance between vertices decreases, accurate vertex finding becomes more difficult, and nearby vertices will be merged. The vertex pair performance metrics allow us to quantify that.

Impact on Jet Classification
In order to asses the impact of improved vertex finding on jet classification, we trained a classifier that took the edge classification prediction of the different algorithms as input, along with the tracks and jet features. The classifier predicts if the jet is a bottom, charm or light jet. The architecture for jet classification is illustrated in Figure 6. A vertex finding module (either AVR, or one of the neural network models) is used to produce an edge prediction for the input set of tracks, which is added to a hidden representation created by a deep set. The resulting graph is processed by a graph network [15] and the resulting graph representation is classified by an MLP. Details about the architecture and training are given in Appendix C. In this scenario, the edge predictions can be considered as a form of supervised attention for the jet classifier. The weights of the vertex finding module are frozen during training.  The baseline classification performance is given by training the same model with an untrained S2G vertex finding module. This baseline model has the ability to reach the same performance as the model with the pre-trained S2G network, as it is an identical network. However it is trained only with the classification objective, where both vertex finding module and the rest of the network are trained together. This baseline therefore shows if an unsupervised attention mechanism can reach similar classification performance, which would require it to identify the relevant features in the data without guidance.

Results
The vertex finding results are summarized in Table 2. The S2G model outperforms AVR in all jet performance metrics. The improvement is significant (about 20% increase in ARI) for b and c jets, while for light jets the same high performance is maintained. The ARI distribution for the different flavors is shown in Figure 8 -while there is still a substantial amount of poorly reconstructed jets (with ARI < 0.5) there are more than twice as many perfectly reconstructed b and c jets compared to AVR. In Figure 9 the mean ARI is shown as a function of both the number of tracks, and the number of vertices in the jet. For b jets, there is a very large improvement in jets with a small number of tracks, but the advantage over AVR is maintained across the entire range. The AVR algorithm outperforms S2G only in b and c jets which have only one vertex, which are very rare in the dataset.
When considering vertex and vertex-pair metrics, for bottom and charm jets the mean internal accuracy for S2G is within 1% of the baseline, and a large increase (between 10  We consider 3 categories: Perfect-jets with an ARI score of exactly 1, Intermediate-a score between 0.5 and 1 and Poor-scores below 0.5 to 20%) is achieved for external and inter-pair accuracy. Figure 10 shows the performance for vertices, as a function of vertex size (i.e., number of tracks in the vertex). The S2G algorithm maintains an advantage over the full range of vertex sizes. The S2G model has a similar internal accuracy to the baseline, but a 10% increase in external accuracy for smaller vertices. Figure 11 show the performance for vertex pairs, as a function of the distance between the vertices. Again the S2G shows a promising ability to separate vertices even when the distance between them approaches 0. The performance increase of about 10% in combined accuracy comes from the improvement in interpair and external accuracy, i.e., less merging of vertices.  Comparison to neural network baselines Both the TP and RNN algorithms have a lower ARI by about 5 to 10% compared to the S2G model for b and c jets. S2G also outperforms both baselines in vertex and vertex-pair combined accuracy. From Figure 8 we can see that S2G has the highest percentage of perfectly reconstructed jets, and and Figures 9, 10 and 11 show that this advantage is maintained across the entire dataset.
Impact on jet classification The results for jet classification are shown in Table 3. The pre-trained S2G classifier outperforms the AVR based classifier by over 10% in terms of overall accuracy with the most significant gain coming from the increased rejection of light jets (an increase in light jet

Conclusions
We proposed training a neural network to perform vertex finding, using supervised learning. We found that it outperforms standard techniques for multiple performance metrics of vertex reconstruction, and shows promising increase in performance for nearby vertices.
We utilized the Set2Graph model, a simple equivariant and universal model of functions from sets to graphs. We showed that the model's universality and equivariance were both important. The universality was needed to properly learn the vertex finding task, by taking into account information from all tracks in the jet. Equivariance was a useful inductive bias, resulting in better performance compared to recur-rent neural network which could in theory learn the same function as the S2G model. We evaluated the impact of the improved accuracy in vertex reconstruction on jet classification by training a classifier that used the vertex finding predictions as input, as a sort of supervised attention mechanism. We found that improved vertex finding lead to improved classification. The supervised attention mechanism lead to better results compared to an identical model with un-supervised attention. The universal models (S2G and RNN) had the best performance, however the equivariance of S2G gave it a slight advantage over the RNN.
Future work may explore the application of this technique to more complicated decays such as boosted Higgs to (bb/cc), and apply it to more realistic datasets that include full detector simulation and pileup interactions.  The AVR algorithm in RAVE [4] has three main parameters that can be adjusted by the user --Primary vertex significance cut -Secondary vertex significance cut minimum weight for a track to stay in a fitted vertex The values for there parameters were scanned in a grid between 0.1 to 10 for the significance cuts (33 equally spaced values) and between 0.1 to 0.8 for the minimum weight (10 values). For each possible value of the parameters, the mean RI was computed for each of the 3 flavors in the training dataset. The values of the b, c and light jet RI are shown in figure 12. The working point that was chosen had the highest b jet RI with a mean light jet RI above 0.95: -Primary cut: 2.5 -Secondary cut: 2.5 minimum weight: 0.2

Appendix B: Model Architecture and Training Details
Hyperparameter tuning and ablation studies The optimization of the model hyperparameters and architecture used in this paper are described in detail in the supplementary material of [16]. Below we describe the architecture for the final optimized model. The attention block in the deep set layer is a key/query attention [41,42]: Where X is the n × d in input, f 1 , f 2 are the key and query MLPs of width d small = d in /10.
If we describe the stack of deep set layers by their output dimension d out , the φ module layer dimensions are: The edge classifier component ψ takes in the n · (n − 1)×(5·3) output of the broadcasting layer, and uses a single hidden layer MLP with output dimensions (256, 1).
Baseline TP Classifier. The MLP that replaces the deep set layers has the following output sizes: φ TP output dimensions = (384, 384, 384, 384, 5) (B. 3) The edge classifier component ψ is identical expect its input size is now 5 · 2 instead of 5 · 3 due to the absence of the sum in the broadcasting layer. Each GRU layer is bi directional. Each direction results in a hidden representation of size d out /2, and the results are concatenated.
Training Hyperparameters We used a batch size of 2048, Adam optimizer [43] with learning rate of 10 −3 . Training takes place in less than 2 hours on a single Tesla V100 GPU. The training is stopped when the validation loss stops does not decrease for 20 epochs.
The model, illustrated in Figure 6 is composed of four components: -Deep set network -Vertex finding module, -Graph network [15]. The deep set creates a hidden representation for each track in the input.
Vertex finding module This is either the AVR pre-computed vertex assignment, or one of the vertex finding networks. The output of this module is an edge prediction e i j between any two tracks in the input set.
The graph network creates a hidden representation for the tracks based on the output of the deep set and the vertex finding module, which is treated as edge features for the fully connected graph of tracks.
The graph network is composed of a sequence of GN blocks, each with an edge update and node update MLP. where h t i is the ith node hidden representation at step t, g t is the global representation of the graph (sum of all node hidden representations), E t and U t are the edge and node update MLPs for layer t of the graph network and e i j is the edge prediction given by the vertex finding module for the edge between node i and j. N(i) is the node neighborhood. In this model the graph is always fully connected, so the node neighborhood contains all the nodes in the graph. The edge update MLP has linear layers with sizes: E t dimensions = (126 · 3 + 1, 100, 20) (C.9) The node update MLP has linear layers with sizes: U t dimensions = (126 + 20, 100, 126) (C.10) The graph network has 3 such GN blocks.
The jet classifier MLP takes as input the sum of track hidden representations and the jet features (p T , η, φ , jet mass). It predicts if the jet is a b,c or light jet. The model is trained with a batch size of 1000, Adam optimizer and a learning rate of 5 · 10 −4 , and cross entropy loss. Training takes less than 2 hours on single Tesla V100 GPU. The training is stopped when the validation loss stops does not decrease for 20 epochs.