1 Introduction

Charged particle tracking plays an essential role in High-Energy Physics (HEP), including particle identification and kinematics, vertex finding, lepton reconstruction, and flavor jet tagging. At the core of particle tracking there is a pattern recognition algorithm that must associate a list of 2D or 3D position measurements from a tracking detector (known as hits or spacepoints in literature) to a list of particle track candidates (or tracks. A track is defined as a list of spacepoints associated by the pattern recognition to a charged particle).

The number of particle track candidates varies significantly from one experiment setup to another. For example, in a High-Luminosity LHC (HL-LHC) [1] collision event, due to the pile-up of multiple proton–proton collision per bunch crossing, there are typically 5000 charged particles and 100,000 spacepoints, about 50% of which are associated to particles of interest.

Fig. 1
figure 1

A simulated HL-LHC collision event (top) as seen by the TrackML tracking detector [2]. The detector schematic (bottom) shows the top half of the detector projected on the r-z plane. The z-axis is along the beam direction

A typical HEP offline tracking algorithm [3,4,5] has four stages: spacepoint formation, track seeding, track following, and track fitting. The spacepoint formation stage combines the detector readout cell raw data in clusters from which the spacepoint 3D coordinates, and their uncertainties, are determined. Track seeding combines spacepoints in doublet or triplet seeds. Each seed provides an initial track direction, origin, and possibly a curvature, with associated uncertainties. The track following stage adds more spacepoints to the seed by looking for matching spacepoints along the extrapolated trajectory. Finally a track fitting stage, which may be combined with the track following, fits a trajectory through the track spacepoints to assess the track quality and measure the particle’s physical and kinematic properties (charge, momentum, origin, etc). To avoid biasing physics results, each stage of the algorithm must have high efficiency, meaning it must identify e.g. \(> 90\%\) of the charged particles within a fiducial region (e.g. \(p_\text {T} > 1\) GeV, \(|\eta | < 4\)) as track candidates. Track seeding and track filtering must also have high purity, meaning that e.g. \(>60\%\) of the track seeds and track candidates must correspond to charged particles. High purity allows to keep the number of track candidates, and the associated computational costs, under control.

Online tracking algorithms may use different pattern recognition algorithmsFootnote 1 to create and filter track seeds and candidates, but share the same high efficiency requirements. Online application also have stringent computing requirements (e.g. latency \(O(10)~\upmu \)s for LHC triggers).

The computational cost of current tracking algorithms grows worse than linearly with beam intensity and detector occupancy, as demonstrated in Fig. 2. Given the order-of-magnitude increase for beam intensity at HL-LHC, charged particle pattern recognition algorithms might well limit the discovery potential of HL-LHC experiments.

Fig. 2
figure 2

Reconstruction wall time per event as a function of the average number of interactions per bunch crossing \(\langle \,\mu \,\rangle \). Top: ATLAS Run 2 Inner Detector reconstruction with default configurations [10]. Bottom: CMS time spent in tracking sequence for 2016 tracking, 2017 tracking with conventional seeding, and 2017 tracking with Cellular Automaton (CA) seeding [11]

Over the last two decades, tracking computational challenges arising from the increased number of combinations have been addressed by tightening fiducial regions for charged particles, developing highly optimized tracking algorithms [4, 5], and even optimizing the geometry of tracking detectors. These optimizations brought order-of-magnitude gains in tracking computational performance with limited impact on physics. While these efforts continue [12], it is unlikely that another order of magnitude can be gained through incremental optimization without impacting physics performance. Furthermore, given the computational complexity and iterative nature of current track following and filtering algorithms, it is challenging to run them efficiently on data parallel architectures like GPUs.

The TrackML challenge [2] jump-started the application of deep learning pattern recognition methods applied to HEP tracking. The HEP.TrkX pilot project [13] proposed the use of graph networks to filter track doublet and triplet seeds [14]. Building on that work, the Exa.TrkX project [15] has demonstrated the applicability of Geometric Deep Learning (GDL) methods [16] – specifically metric learning and Graph Neural Networks (GNN) – to particle tracking [17]. GDL is concerned with learning representations of data that have complex geometrical relationships and no natural ordering, like detector spacepoints. GDL models are computationally regular, naturally parallel and therefore well-suited to run on hardware accelerators.

This work describes new developments that enabled the first study of the computing and physics performance of the Exa.TrkX pipeline on the entire TrackML detector at HL-LHC design luminosity, a step towards the validation of the pipeline on ATLAS and CMS data.

2 Related work

Early on, the Hep.TrkX pilot project attempted to assign and regress track parameters to single spacepoints using image processing models. Subsequent attempts at estimating track parameters using image processing and recurrent networks showed promising results [18] in a simplified environment. A similar realization of the method is reported in [19] where a model processing image from successive pixel detector layers is used to produce tracklets, seeds to classical pattern recognition. The method yields superior seeding efficiency for tracks within jets in dense environments. The concept of using LSTM [20] to supplement the Kalman Filter method for track following developed by HEP.TrkX [14, 18, 21] was later found in one of the promising solutions of the accuracy phase [22] of the TrackML challenge. The task of particle tracking was addressed with a hit-to-track assignment method using gated recurrent unit [23] (GRU), producing promising result in sparse environments [21]. This approach was constrained computationally due to the use of recurrent models.

Reference [24] applies the track finding approach developed in Ref. [25] to the whole detector by exploiting a new data-driven graph construction method and large model support in Tensorflow [26]. Reference [27] applies a similar GNN model to the task of particle-flow reconstruction. The model has a classification objective, followed by a partial regression of generator-level particle candidate kinematics. The method performs at least as well as a classical particle-flow algorithm in HL-LHC-like collision conditions. As part of the Exa.TrkX project, graph networks are used for LArTPC track reconstruction [28]. Reference [29] explores the opportunity to implement Exa.TrkX-inspired graph networks on FPGAs. Starting from the input stage of the Exa.TrkX pipeline, Ref. [30] studies the impact of cluster shape information on track seeding performance. In Ref. [31], metric learning is used to improve the purity in spacepoints buckets formed using similarity hashing. With the advent of quantum computer of increasing size came the development of quantum machine learning techniques, also applied in particle physics [32]. In particular, inspired by the use of GNN for charged particle tracking of the Exa.TrkX team, quantum graph networks have been tested on the same problem [33].

3 Methodology

3.1 Input data

This study is based on the TrackML dataset that uses a Montecarlo simulation of top quark pair production from proton–proton collisions at the HL-LHC. To simulate the effect of event pileup and produce realistic detector occupancy, a Poisson random number (with \(\mu =200\)) of QCD “minimum bias” events are overlaid on top of the \(t\bar{t}\) collisions.

The TrackML detector is a set of concentric cylindrical layers of pixelated sensors (the barrel) complemented by a set of circular disks (the endcaps) to ensure nearly \(4\pi \) coverage in solid angle, as pictured in Fig. 1. Figure 3 shows the spatial distribution of the spacepoints of a typical event. One notable feature of this dataset is the inclusion of “noise” spacepoints, added as a proxy for various low-momentum particle interactions and detector effects which would otherwise require more expensive and detailed simulations.

Fig. 3
figure 3

A typical event distribution of spacepoints projected on the x-z plane, parallel to the beam direction (left), and the x-y plane, orthogonal to the beam direction (right)

3.2 The Geometric Deep Learning Pipeline

This paper updates the methodology previously presented in Ref. [17] to a fully-learned pipeline, where both graph construction and graph classification are trained. This section describes the pipeline (represented schematically in Fig. 4) used to obtain the results in Sect. 4. Details of the latest model design, parameter choices, and technical optimizations are discussed in Sect. 5.

Fig. 4
figure 4

Stages of the TrackML track formation inference pipeline. Light red boxes are trainable stages

The pipeline currently used to reconstruct tracks from a pointcloud of spacepoints requires six discrete stages of processing and inference. These broadly consist of a preprocessing stage, three stages required to construct a spacepoint graph, and two stages required to classify the graph edges and partition them into track candidates. Each stage is trained independently (due to memory constraints) on the output of the previous stage’s inference.

First, the dataset is processed into a format suitable for model training. This includes calculating directional information and summary statistics from the charge deposited in each spacepoint, i.e. the cell features in Fig. 4. These values are appended to the cylindrical coordinates of each spacepoint to form an input feature vector to the pipeline. To apply a graph neural network to this set of data, it is necessary to arrange them into a graph. One can apply various geometric heuristics to define which spacepoints are likely to be connected by an edge (i.e. belong to the same track), but a useful technique is to train a model on the geometry of connected tracks. Thus, our second stage is to train an Embedding Network – a multi-layer perceptron (MLP) which embeds each spacepoint into an N-dimensional latent space. The graph is constructed by connecting neighboring spacepoints within a radius \(r_{\text {embedding}}\), in the latent space. We train this embedding with a pairwise hinge loss, to encourage spacepoints that belong to the same track to be close in the embedded space, according to the Euclidean metric. This allows for a highly efficient edge construction, since we do not rely on any heuristics of the detector geometry that may lead to missed edges.

The edge selection at this stage is close to 100% efficient but \(O(1)\%\) pure, with a graph size of \(O(10^5)\) nodes and \(O(10^7)\) edges (the purity-efficiency trade-off can be tuned with the choice of \(r_{\text {embedding}}\)). Before running training or inference on the memory-intensive GNN, we filter these edges down with another MLP. The input to this third stage is the concatenated features on either side of each edge. That is, the Filter Network is a binary classifier applied to the set of edges. Constraining edge efficiency to remain high (above 96%) leads to much sparser graphs, of \(O(10^6)\) edges.

The fourth stage of the pipeline is the training and inference of the graph neural network. The results presented in this work are predominantly obtained from the Interaction Network architecture, first proposed in Ref. [34]. This varietal of GNN includes hidden features on both nodes and edges, which are propagated around the graph (called “message passing”) with consecutive concatenations along edges and aggregations of messages at receiving nodes. In the final layer of the network, a binary classification is obtained for each edge as true or fake, and trained on a cross-entropy loss.

The final stage of the TrackML pipeline involves task-specific post-processing. If our goal is track formation, we can place a threshold on the edge scores produced by the GNN and partition the graph into connected components. If our goal is track seeding, we can directly sample the classified edges for high likelihood combinations of connected triplets, or convert the entire graph to a triplet graph and train this on a second GNN to classify the triplets. A triplet graph is formed by taking all edges in the original (doublet) graph and assigning them as nodes in the new triplet graph. The nodes in this triplet graph are connected if they share a hit in the doublet graph. Applying a GNN to this structure produces highly pure sets of seeds as shown in Ref. [17].

Many of these techniques are common to other applications being explored in the Exa.TrkX collaboration. The pattern of nearest-neighbor graph-building and GNN edge classification has shown its potential for neutrino experiments [28] and CMS High Granularity Calorimeter [25]. Indeed, these applications build on the TrackML pipeline and extend it, for example by adding the particle type as an edge feature.

4 Results

4.1 Tracking performance of the TrackML pipeline

4.1.1 Tracking efficiency and purity

The performance of a tracking pipeline is mainly characterized by tracking efficiency and purity. For efficiency calculations, only charged particles that satisfy \(|\eta | < 4.0\) and \(p_\text {T} > 100\) MeV are considered. These selected particles, \(N_{particles}(\text {selected})\), are hereafter referred to as particles.

The overall tracking efficiency, known as physics efficiency  \(\epsilon _\text {phys}\) (Eq. 1), is defined as the fraction of particles that are matched to at least one reconstructed track. A particle is considered to be matched to a reconstructed track when (1) the majority of spacepoints in the reconstructed track belong to the same true track, and (2) the majority of spacepoints in the matched true particle track are found in the reconstructed track.Footnote 2

To measure the efficiency of the tracking pipeline itself, we also define the technical efficiency \(\epsilon _\text {tech}\) (Eq. 2) as the fraction of reconstructable particles matching at least one reconstructed track. Reconstructable particles have a trajectory that leaves at least five spacepoints in the detector. Tracking purity (Eq. 3) is defined as the fraction of reconstructed tracks that match a selected particle.Footnote 3

$$\begin{aligned} \epsilon _\text {phys}&= \frac{N_{particles}(\text {selected, matched})}{N_{particles}(\text {selected})} \end{aligned}$$
$$\begin{aligned} \epsilon _\text {tech}&= \frac{N_{particles}(\text {selected, reconstructable, matched})}{N_{particles}(\text {selected, reconstructable})} \end{aligned}$$
$$\begin{aligned} \text {Purity}&= \frac{N_{tracks}(\text {selected,matched})}{N_{tracks}(\text {selected})} \end{aligned}$$

Averaged over 50 testing events from the TrackML dataset, the physics efficiency for particles with \(p_\text {T} > 500\) MeV is \(88.7\pm 0.3\%\) and the technical efficiency is \(97.6\pm 0.3\%\). Without any fiducial \(p_\text {T} \) cut, the physics efficiency becomes \(67.2\pm 0.1\%\) and the technical efficiency \(91.3\pm 0.2\%\). The tracking purity is \(58.3\pm 0.6\%\). Using the TrackML challenge scoring system and all tracks in the event, we obtained a score of \(0.877\pm 0.005\).Footnote 4 The errors quoted are statistical only.

Figure 5 shows the \(p_\text {T} \) distribution of particles as well as the tracking efficiency as a function of particle \(p_\text {T} \). The physics efficiency for particles with \(p_\text {T} \) of [100, 300] MeV is 43%, therefore, is not displayed in the plot. The physics efficiency for particles with \(p_\text {T} > 700\) MeV is above 88%. The technical efficiency is 82% for particles with \(p_\text {T} \) of [100, 300] MeV, and increases to above 97% for particles with \(p_\text {T} > 700\) MeV. Figure 5 also shows the \(\eta \) distribution of particles with \(p_\text {T} > 500\) MeV as well as the tracking efficiency as a function of the particle \(\eta \). The physics efficiency is higher in the barrel region of the detector (volumes 8, 13, 17 in Fig. 1), while the technical efficiency is almost flat across the \(\eta \) range. In Fig. 5 the \(p_\text {T} \) and \(\eta \) of the matched truth particle were used, rather than the \(p_\text {T} \) and \(\eta \) of the reconstructed track. We leave a study of track quality and detector resolution effects for future work.

Fig. 5
figure 5

Top row: selected, reconstructable, and matched particles (left) and tracking efficiency (right) as a function of \(p_\text {T} \) for particles with \(|\eta | < 4\). Bottom row: selected, reconstructable, and matched particles (left) and tracking efficiency (right) as a function of \(\eta \) for \(p_\text {T} > 0.5\) GeV. The definition of “selected”, “reconstructable”, and “matched” can be found in Sect. 4.1.1

4.1.2 Systematic studies

Before using a tracking algorithm in production, it is necessary to measure its sensitivity to systematic effects, including pile-up, noise and digitization errors, and uncertainties in the measurement of detector properties (alignment, rotation, magnetic field map, etc.).

Measuring precisely the impact of pile-up collisions on tracking performance is beyond the scope of this work, but we can estimate pile-up’s impact on tracking performance by plotting efficiency and purity as a function of the number of spacepoints in the detector. Figure 6 shows that the effect of the increased detector occupancy is a smooth performance degradation O(%). In future work, we will study the origin of this degradation to achieve the stable performance of traditional algorithms [36].

Fig. 6
figure 6

Mean and standard deviation of the technical efficiency (top) and purity (bottom) as a function of the total number of spacepoints in an event

The impact of noise spacepoints can be estimated using the TrackML dataset by studying the inference performance of the tracking pipeline, trained without any noise spacepoints, as a function of the fraction of noise spacepoints (up to a maximum of 20% of the total). Table 1 shows the technical tracking efficiency and purity for different noise levels. The efficiency decreases by \(\simeq 1.6\%\) and the purity by \(\simeq 5.4\%\) when 20% of noise spacepoints are presented. The loss of efficiency happens primarily for particles with \(p_\text {T} < 500\) MeV (Fig. 7).

Table 1 Technical efficiency and purity for different noise fractions \((N^{\text {noise}}_{\text {spp}}/N_{\text {spp}})*100\%\)
Fig. 7
figure 7

Relative technical efficiency as a function of \(p_\text {T} \). Each curve shows the ratio of \(\text {eff}(\text {noise}=N\%) / \text {eff}(\text {noise}=0)\)

Detector misalignment effects are approximated by shifting by up to 1 mm the x-axis of all spacepoints in the inner-most TrackML barrel detector layer or the four innermost layers (volume 8 in Fig. 1). In both cases, the impact on the tracking efficiency is less than 0.1%. However, studying in depth misalignments, and other detector effects, requires access to experiment detailed detector simulation data. We leave these studies as future work to be performed in collaboration with each experiment.

4.2 Distributed training performance

Our training sample consists of 7500 pileup events from the TrackML dataset. It takes about 1.5 days to train the Exa.TrkX pipeline on a Nvidia A100 GPU for a set of hyper-parameters. It is therefore desirable to use distributed training to parallelize model training and hyper-parameter optimization (HPO). This study relied on data parallel training [37] implemented using Horovod [38] and Tensorflow’s tf.distributed framework [39]. Horovod supports distributed training across multiple nodes, while tf.distributed allows to use the same code across CPUs, TPUs, and GPUs.

For this study, the TrackML pipeline is trained on up to 64 Nvidia V100 GPUs across eight NERSC Cori-GPU computing nodes. Using the Horovod framework (Fig. 8), training time is reduced from 22 min, with 1 GPU, to 0.5 min with 64 GPUs.Footnote 5 The strong scaling efficiencyFootnote 6 is about 90% with 2 GPUs and 75% with 8 GPUs. This deviation from ideal scaling is due to the model setup time and data movement costs.

Fig. 8
figure 8

Time per training epoch (left) and Strong scaling efficiency (right) for GNN’s distributed training. The top row refers to the Horovod implementation, the bottom row to the tf.distributed one. The first bin in the bottom left diagram refers to the serial case, in which the input graph is not padded

Table 2 Average inference time for synchronous execution of the TrackML pipeline benchmarked on CPUs and GPUs. For these step-by-step measurements, we force the pipeline to execute serially by calling torch.cuda.synchronize after each step. The total inference time comprises all the steps including ones not listed in the table

Figure 8 also shows the scaling behaviour of the tf.distributed implementation. Since this implementation requires all input data to be of the same size, we have to pad all input graphs to a fixed size. This essentially doubles the time needed to train one epoch, that increases from 22 minutes for dynamic input graph sizes to 41 min for constant graph sizes. Leaving aside this fixed overhead, tf.distributed appears to scale better than Horovod, achieving \(\simeq 85\%\) strong scaling efficiency with 8 GPUs.

4.3 Inference performance on CPU and GPU

It is crucial to characterize the computational cost of the end-to-end learned tracking algorithm. We rely on the Pytorch and TensorFlow libraries to optimize our inference pipeline on CPU and GPU. The execution time for the inference pipeline has been measured on two hardware platforms: Nvidia V100 GPUs with 16 GB on-board memory, and Intel Xeon 6148s (Skylake) CPUs with 40 cores and 192 GB memory per node. The inputs to the filtering step do not fit into the GPU memory. Therefore, edge filtering for one event is executed in mini-batches with a fixed batch size of 800k edges. Typically, the inputs to the filtering from one event are split into seven batches, leading to additional computational cost for moving data from host to GPU. The peak GPU memory consumption is about 15.7 GB as obtained from the Nvidia profiling tool.

Averaging over 500 events, it takes \(2.2 \pm 0.3\) wall-clock seconds per event (as measured by measured by the python module time) to run the inference pipeline on the GPU and \(202 \pm 35\) seconds to run it on a single CPU core. This total execution time includes every step of the calculation, and in particular the time needed to move data from host to GPU. Table  2 breaks down the wall-clock time for the most significant steps of the pipeline. The results show how the graph creation and filtering steps are the biggest targets for further optimization in order to surpass traditional algorithms in terms of inference time [40].

In addition, Fig. 9 shows how the total inference time depends almost linearly on the number of spacepoints in the event for both CPUs and GPUs. The step-like dispersion in the GPU case is due to the splitting of the inputs to the filtering step into mini-baches. A step-like jump indicates one more mini-batch is added.

Fig. 9
figure 9

Total inference time as a function of number of spacepoints in each event for CPUs (top) and GPUs (bottom)

Many optimizations were introduced to the pipeline in order to achieve these GPU timings, which before optimization took over 20 seconds per event. These improvements include porting all data processing to the GPU-accelerated CuPy library [41], writing custom sparse operations for graph processing (e.g. doublet-to-triplet conversion [42], graph intersection methods), using FAISS [43] for large-k NN graph construction, and performing track labelling with CuGraph’s connected component algorithm on GPU [44]Footnote 7. These improvements are specific to the inference stage; training optimizations will be discussed in the following section, and ongoing developments in Sect. 6. No CPU-specific optimization was performed in this work.

5 Discussion

The performance given above is the result of experimentation across various feature sets, architectures, model configurations and hyperparameters. It has also been necessary to overcome a variety of training hurdles in terms of memory and computational availability. We describe here training and inference details that should allow a reader to reproduce these results on the provided codebase.

5.1 Feature set

The input dataset includes both spatial coordinates and highly granular pixel cluster shape information. Graph construction (the second pipeline step in Fig. 4, that includes learned embedded space model and edge filter model) appears to benefit significantly from the cluster shape information, approximately doubling the purity for a held fixed high efficiency. The summary cluster shape statistics include the number of channels and the total charge deposited, as well as local and global representations of the cluster as a high-level feature vector. Details about the calculation of this feature vector as well as a thorough exploration of the effect of cluster shape information on seeding performance are provided in Ref. [30]. Cluster shape information does not appear to improve the performance of the GNN, and in fact seems to degrade it. This suggests that the width of the GNN hidden layers is not great enough to capture the functional relationship of cluster information between nodes. Scaling to a width that properly explores this question would require more memory than available on the Nvidia A100 GPUs used for this study.

Depending on the final goal of the pipeline, further features can be included in the loss calculation in order to bias the model towards desired regions. For example, if our aim is to maximize the TrackML score (described in Ref. [2]) – a weighting function \(s_i\) that places more importance on a spacepoint i from a longer and higher \(p_\text {T} \) track, and in the first and last sets of detector layers – we can weight-up true edges by this function, normalized to have a mean of weight \(=1\). To measure the performance of models trained to this goal, we introduce a weighted purity measure. Weighted purity is defined as a function the TrackML weights \(w_{ij}\) and the truth \(y_{ij}\in \{0,1\}\) of each edge connecting spacepoint i and spacepoint j,

$$\begin{aligned} \text {Purity}_\text {weighted}= & {} \frac{\sum _{ij} w_{ji} y_{ij}}{\sum _{ij} w_{ij}},\nonumber \\ w_{ij}= & {} {\left\{ \begin{array}{ll} \frac{1}{2}(s_i + s_j), \text { if } y_{ij} = 1 \\ 1, \text { if } y_{ij} = 0 \end{array}\right. } \end{aligned}$$

We see significant improvements in this metric when validating on the weighted model: the Embedding Network improves from a weighted purity of \(1.7\% \pm 0.2\%\) to \(2.0\% \pm 0.3\%\), while the Filter Network improves from a weighted purity of \(8.4\% \pm 0.6\%\) to \(11.7\% \pm 1.0\%\). Given this weighting, the model learns to prioritize higher \(p_\text {T} \)  and longer tracks, while disregarding less informative tracks. Using this bias, we can achieve the same TrackML score with a constructed graph size reduced by approximately 25%. Using this technique to improve the TrackML score is an ongoing work.

5.2 Graph construction

Having chosen a feature set, to train the learned embedding space we use a training paradigm commonly referred to as a Siamese Network [46], where a particular spacepoint – called the source – is run through an MLP, here 6 layers each with 512 hidden channels, hyperbolic tan activations, and layer normalization. The final layer of the MLP takes the features to an 8-dimensional latent space. A different, comparison spacepoint – called the target – is also run through this same Embedding Network, and the L2 norm distance d in the latent space between the source and target enters a comparative hinge loss

$$\begin{aligned} \mathcal {L}_\text {hinge} = {\left\{ \begin{array}{ll} d^p, \text { if } y_{ij} = 1 \\ \text {max}(0, 1 - d^p), \text { if } y_{ij} = 0 \end{array}\right. } \end{aligned}$$

where p is a hyperparameter that we choose to be 2.

If the source i and target j spacepoints share an edge in the event’s truth graph,Footnote 8 we designate them as neighbours with \(y_{ij} = 1\), otherwise they are designated \(y_{ij}=0\). In this way, the hinge loss draws together truth graph neighbors and repels non-neighbors.

Training performance of the Embedding Network is highly dependent on choice of source-target example pairs. In early epochs, it is enough to choose random pairs. However, at some point, many random pairs will contribute no gradient to the loss, as they will be separated by a distance greater than the margin. At that point, it is useful to implement hard negative mining [47]. We run a GPU-optimised k-nearest-neighbor (KNN) algorithmFootnote 9 to mine examples around each source vector, within the hinge margin \(d=1\). The computational overhead of the KNN step is significantly offset by the examples mined which all contribute to the loss.

A similar technique is used in the Filter Network, where the vast majority of the edges produced from the graph construction in the embedded space are easy to classify as fake. This is already a highly imbalanced dataset, with around 98.5% of edges fake. Again, within several epochs, the Filter Network is able to classify many of these as fake, so we balance each batch with all true edges, the same number of hard negatives (i.e. negatives the filter is unsure of) and the same number of easy negatives (to maintain performance on these edges). The Filter Network is a MLP that takes the 24-feature concatenated edge features and feeds forward through 3 layers of 1024 hidden channels, to a binary cross-entropy loss function.

5.3 GNN edge classification

In choosing the best GNN architecture, memory usage remains a significant constraint. The Interaction Network (IN) [34] presented in these results does appear to marginally attain the best performance against Attention Graph Neural Networks (AGNN) [14, 49] – the other class of GNN considered for the pipeline. However, both of these networks require gradients to be retained in memory for every graph edge. Indeed, this anisotropic treatment of edges (i.e. a node is able to receive the messages of each of its neighbors in a non-uniform way) is what allows these two architectures to be so expressive. Depending on hardware availability, we have found two solutions to the memory constraint. Access to next-generation Nvidia A100 GPUs allowed an IN to be trained with 8 steps of message passing, aggregating edge features at each node, and each node and concatenated edge features passing through two-layer MLPs of [128, 64] hidden features and ReLU activations [50]. Choice of aggregation function should be permutation invariant. In this work, we take it to be a summation.

For lower-memory GPUs, such as the Nvidia V100, we attained similar performance training the AGNN architecture, with [64, 64, 64]-channel MLPs applied to each edge and node. Adding residuals [51] across the 8 message passing steps greatly improved performance in this case. To fit full-event training on a single V100, it was necessary to employ various techniques, such as mixed precision training and gradient checkpointing. The latter stores only the input of each layer, not the gradients. On the backward pass, gradients are re-calculated on the fly, allowing for a  4\(\times \) reduction in memory usage for an 8-iteration GNN. Another technique explored is to split the events piecemeal and train on each piece as a standalone batch. There is a noticeable impact on performance due to messages being interrupted at the graph edges. In future work, we will present ongoing efforts to parallelise these graph pieces across multiple GPUs, retaining the high performance that full-event training allows.

5.4 Physics-inspired data augmentation

Preliminary work on using coordinate transforms to augment the training data has been explored with varying degrees of success. In this study, focused on track seeding, only the innermost detector layers (volumes 7–9 in Fig. 1) were used.

One promising approach is to make a copy of each graph in the training set that has been reflected across the phi-axis [52]. The phi reflection creates the charge conjugate graph and helps to balance any asymmetry between positive and negatively charged particles within the training set. Using the phi-reflected graphs boosts efficiency by \(\simeq 2\%\) and purity by \(\simeq 1\%\) in the barrel. This performance boost comes at the cost of doubling the training time. In future work, we will investigate the opportunity of integrating charge conjugation symmetry into the network itself.

A second promising trick is to use a Hough Transform [6, 7] on the graph to create edge features. Using the Hough parameters as edge features boosts efficiency by \(\simeq 2\%\) and purity by \(\simeq 1\%\). A further efficiency boost of \(\simeq 3\%\) (and \(\simeq 2\%\) to purity) comes from using the Hough accumulator to extract an edge weight. This edge weight effectively pools information from every node, and therefore comes at a large computational cost (filling the accumulator in Hough space). On the other hand, the Hough parameters can be computed quickly from the two nodes that define the edge.

6 Conclusions and future work

This works shows how a tracking pipeline based on geometric deep learning can achieve state-of-the-art computing performance that scales linearly with the number of spacepoints, showing great promise for the next generation of HEP experiments. The inference pipeline has been optimized on GPU systems, on the assumption that the next generation of HEP experiments will have widespread access to accelerators either locally in heterogeneous systems [27, 53] or remotely [54, 55].

Within the simplifying assumptions of the TrackML dataset, we have shown how the Exa.TrkX pipeline could meet the tracking performance requirements of current collider experiments. Preliminary studies suggest that this performance should be robust against systematic effects like detector noise, misalignment, and pile-up.

Much remains to be done to validate these promising results. To this end, the Exa.TrkX project is collaborating with physicists from ATLAS [56], CMS [57], DUNE [58], ICARUS [59], and MuonE [60].

The goal is to adapt the Exa.TrkX pipeline to each experiment’s needs and simulated datasets, measure its performance and robustness against systematic effects according to the experiment metrics. For example, it is crucial for HL-LHC experiments to study the performance of tracking algorithms in dense environments, like high-\(p_\text {T} \) jets. Given the interest in long-lived particle observation at the HL-LHC, it will also be important to study the performance of the Exa.TrkX pipeline for tracks coming from a displaced vertex.Footnote 10

On the computational side, there are several optimization opportunities to explore systematically, including mixed precision training, multi-GPU training and inference with graph data parallelisation (that is, one event spread across multiple GPUs) [61]; locality sensitive hashing to speed-up KNN/graph construction stage [62], model quantization, operator fusion and other improvements with TensorRT [63], clustering of final node embeddings rather than hard connected components method with GravNet-style architectures [64].

The distributed training results presented in this work are promising but still preliminary. To fully exploit the capabilities of upcoming HPC systems and to further reduce training time while potentially pushing further on model size, it will be beneficial to perform further studies on large scale training of GNNs for track reconstruction. Given the size of the input graphs, this problem may be amenable to training techniques which parallelise the processing of input graphs across multiple GPUs in training.

Finally, it will be interesting to measure the computing performance of (parts of) the Exa.TrkX pipeline on domain-specific accelerators like Google TPU [65] and GraphCore IPU [66], comparing power consumption, latency and throughput with “traditional” GPUs.

7 Software availability

A growing number of groups are studying the application of graph networks to HEP reconstruction (see [67] for a recent review). Some of these works [24, 27,28,29,30,31, 33] have strong connections with the Exa.TrkX project. To promote collaboration and reproducibility, the Exa.TrkX software is available from the HEP Software Foundation’s Trigger and Reconstruction GitHub.Footnote 11 A pipeline of re-usable modules is implemented within the Pytorch Lightning system, which allows for uncluttered and simple model definitions. As each stage of the pipeline is dependent, logging utilities are integrated that allow a specific combination of stages and hyperparameters to be trackable and reproducible. Extensive documentation is provided to help track reconstruction groups start exploring geometric learning. The roadmap for this repository includes adding performance metrics to the codebase; a taxonomy of model features; and short tutorials in each of the available applications.