Performance of a Geometric Deep Learning Pipeline for HL-LHC Particle Tracking

The Exa.TrkX project has applied geometric learning concepts such as metric learning and graph neural networks to HEP particle tracking. Exa.TrkX's tracking pipeline groups detector measurements to form track candidates and filters them. The pipeline, originally developed using the TrackML dataset (a simulation of an LHC-inspired tracking detector), has been demonstrated on other detectors, including DUNE Liquid Argon TPC and CMS High-Granularity Calorimeter. This paper documents new developments needed to study the physics and computing performance of the Exa.TrkX pipeline on the full TrackML dataset, a first step towards validating the pipeline using ATLAS and CMS data. The pipeline achieves tracking efficiency and purity similar to production tracking algorithms. Crucially for future HEP applications, the pipeline benefits significantly from GPU acceleration, and its computational requirements scale close to linearly with the number of particles in the event.


Introduction
Charged particle tracking plays an essential role in High-Energy Physics (HEP), including particle identification and kinematics, vertex finding, lepton reconstruction, and flavor jet tagging. At the core of particle tracking there is a pattern recognition algorithm that must associate a list of 2D or 3D position measurements from a E-mail: xju@lbl.gov a tracking detector (known as hits or spacepoints in literature) to a list of particle track candidates (or tracks. A track is defined as a list of spacepoints associated by the pattern recognition to a charged particle).
The number of particle track candidates varies significantly from one experiment setup to another. For example, in a High-Luminosity LHC (HL-LHC) [1] collision event, due to the pile-up of multiple proton-proton collision per bunch crossing, there are typically 5,000 charged particles and 100,000 spacepoints, about 50 % of which are associated to particles of interest.
A typical HEP offline tracking algorithm [3,4,5] has four stages: spacepoint formation, track seeding, track following, and track fitting. The spacepoint formation stage combines the detector readout cell raw data in clusters from which the spacepoint 3D coordinates, and their uncertainties, are determined. Track seeding combines spacepoints in doublet or triplet seeds. Each seed provides an initial track direction, origin, and possibly a curvature, with associated uncertainties. The track following stage adds more spacepoints to the seed by looking for matching spacepoints along the extrapolated trajectory. Finally a track fitting stage, which may be combined with the track following, fits a trajectory through the track spacepoints to assess the track quality and measure the particle's physical and kinematic properties (charge, momentum, origin, etc). To avoid biasing physics results, each stage of the algorithm must have high efficiency, meaning it must identify e.g. > 90% of the charged particles within a fiducial Fig. 1 A simulated HL-LHC collision event (top) as seen by the TrackML tracking detector [2]. The detector schematic (bottom) shows the top half of the detector projected on the r-z plane. The z-axis is along the beam direction.
region (e.g. p T > 1 GeV, |η| < 4) as track candidates. Track seeding and track filtering must also have high purity, meaning that e.g. > 60% of the track seeds and track candidates must correspond to charged particles. High purity allows to keep the number of track candidates, and the associated computational costs, under control.
Online tracking algorithms may use different pattern recognition algorithms 1 to create and filter track seeds and candidates, but share the same high efficiency requirements. Online application also have stringent computing requirements (e.g. latency O(10) µs for LHC triggers).
The computational cost of current tracking algorithms grows worse than linearly with beam intensity and detector occupancy, as demonstrated in Figure 2. Given the order-of-magnitude increase for beam intensity at HL-LHC, charged particle pattern recognition algorithms might well limit the discovery potential of HL-LHC experiments.
Over the last two decades, tracking computational challenges arising from the increased number of combinations have been addressed by tightening fiducial regions for charged particles, developing highly opti-1 including Hough transforms [6,7] and cellular automata [8,9] Average pileup  [10]. Bottom: CMS time spent in tracking sequence for 2016 tracking, 2017 tracking with conventional seeding, and 2017 tracking with Cellular Automaton (CA) seeding [11]. mized tracking algorithms [4,5], and even optimizing the geometry of tracking detectors. These optimizations brought order-of-magnitude gains in tracking computational performance with limited impact on physics. While these efforts continue [12], it is unlikely that another order of magnitude can be gained through incremental optimization without impacting physics performance. Furthermore, given the computational complexity and iterative nature of current track following and filtering algorithms, it is challenging to run them efficiently on data parallel architectures like GPUs.
The TrackML challenge [2] jump-started the application of deep learning pattern recognition methods applied to HEP tracking. The HEP.TrkX pilot project [13] proposed the use of graph networks to filter track doublet and triplet seeds [14]. Building on that work, the Exa.TrkX project [15] has demonstrated the applicabil-ity of Geometric Deep Learning (GDL) methods [16] specifically metric learning and Graph Neural Networks (GNN) -to particle tracking [17]. GDL is concerned with learning representations of data that have complex geometrical relationships and no natural ordering, like detector spacepoints. GDL models are computationally regular, naturally parallel and therefore well-suited to run on hardware accelerators.
This work describes new developments that enabled the first study of the computing and physics performance of the Exa.TrkX pipeline on the entire TrackML detector at HL-LHC design luminosity, a step towards the validation of the pipeline on ATLAS and CMS data.

Related work
Early on, the Hep.TrkX pilot project attempted to assign and regress track parameters to single spacepoints using image processing models. Subsequent attempts at estimating track parameters using image processing and recurrent networks showed promising results [18] in a simplified environment. A similar realization of the method is reported in [19] where a model processing image from successive pixel detector layers is used to produce tracklets, seeds to classical pattern recognition. The method yields superior seeding efficiency for tracks within jets in dense environments. The concept of using LSTM [20] to supplement the Kalman Filter method for track following developed by HEP.TrkX [18,21,14] was later found in one of the promising solutions of the accuracy phase [22] of the TrackML challenge. The task of particle tracking was addressed with a hit-to-track assignment method using gated recurrent unit [23] (GRU), producing promising result in sparse environments [21]. This approach was constrained computationally due to the use of recurrent models.
Ref. [24] applies the track finding approach developed in Ref. [25] to the whole detector by exploiting a new data-driven graph construction method and large model support in Tensorflow [26]. Ref. [27] applies a similar GNN model to the task of particle-flow reconstruction. The model has a classification objective, followed by a partial regression of generator-level particle candidate kinematics. The method performs at least as well as a classical particle-flow algorithm in HL-LHC-like collision conditions. As part of the Exa.TrkX project, graph networks are used for LArTPC track reconstruction [28]. Ref. [29] explores the opportunity to implement Exa.TrkX-inspired graph networks on FP-GAs. Starting from the input stage of the Exa.TrkX pipeline, Ref. [30] studies the impact of cluster shape information on track seeding performance. In Ref. [31], metric learning is used to improve the purity in spacepoints buckets formed using similarity hashing. With the advent of quantum computer of increasing size came the development of quantum machine learning techniques, also applied in particle physics [32]. In particular, inspired by the use of GNN for charged particle tracking of the Exa.TrkX team, quantum graph networks have been tested on the same problem [33].

Input Data
This study is based on the TrackML dataset that uses a Montecarlo simulation of top quark pair production from proton-proton collisions at the HL-LHC. To simulate the effect of event pileup and produce realistic detector occupancy, a Poisson random number (with µ = 200) of QCD "minimum bias" events are overlaid on top of the tt collisions.
The TrackML detector is a set of concentric cylindrical layers of pixelated sensors (the barrel ) complemented by a set of circular disks (the endcaps) to ensure nearly 4π coverage in solid angle, as pictured in Figure 1. Figure 3 shows the spatial distribution of the spacepoints of a typical event. One notable feature of this dataset is the inclusion of "noise" spacepoints, added as a proxy for various low-momentum particle interactions and detector effects which would otherwise require more expensive and detailed simulations.

The Geometric Deep Learning Pipeline
This paper updates the methodology previously presented in Ref. [17] to a fully-learned pipeline, where both graph construction and graph classification are trained. This section describes the pipeline (represented schematically in Figure 4) used to obtain the results in § 4. Details of the latest model design, parameter choices, and technical optimizations are discussed in § 5.
The pipeline currently used to reconstruct tracks from a pointcloud of spacepoints requires six discrete stages of processing and inference. These broadly consist of a preprocessing stage, three stages required to construct a spacepoint graph, and two stages required to classify the graph edges and partition them into track candidates. Each stage is trained independently (due to memory constraints) on the output of the previous stage's inference.
First, the dataset is processed into a format suitable for model training. This includes calculating directional  . These values are appended to the cylindrical coordinates of each spacepoint to form an input feature vector to the pipeline. To apply a graph neural network to this set of data, it is necessary to arrange them into a graph. One can apply various geometric heuristics to define which spacepoints are likely to be connected by an edge (i.e. belong to the same track), but a useful technique is to train a model on the geometry of connected tracks. Thus, our second stage is to train an Embedding Network -a multi-layer perceptron (MLP) which embeds each spacepoint into an N-dimensional latent space. The graph is constructed by connecting neighboring spacepoints within a radius r embedding , in the latent space. We train this embedding with a pairwise hinge loss, to encourage spacepoints that belong to the same track to be close in the embedded space, according to the Euclidean metric. This allows for a highly efficient edge construction, since we do not rely on any heuristics of the detector geometry that may lead to missed edges.
The edge selection at this stage is close to 100% efficient but O(1)% pure, with a graph size of O(10 5 ) nodes and O(10 7 ) edges (the purity-efficiency trade-off can be tuned with the choice of r embedding ). Before running training or inference on the memory-intensive GNN, we filter these edges down with another MLP. The input to this third stage is the concatenated features on either side of each edge. That is, the Filter Network is a binary classifier applied to the set of edges. Constraining edge efficiency to remain high (above 96%) leads to much sparser graphs, of O(10 6 ) edges.
The fourth stage of the pipeline is the training and inference of the graph neural network. The results presented in this work are predominantly obtained from the Interaction Network architecture, first proposed in Ref. [34]. This varietal of GNN includes hidden features on both nodes and edges, which are propagated around the graph (called "message passing") with consecutive concatenations along edges and aggregations of messages at receiving nodes. In the final layer of the network, a binary classification is obtained for each edge as true or fake, and trained on a cross-entropy loss.
The final stage of the TrackML pipeline involves task-specific post-processing. If our goal is track formation, we can place a threshold on the edge scores produced by the GNN and partition the graph into connected components. If our goal is track seeding, we can directly sample the classified edges for high likelihood combinations of connected triplets, or convert the entire graph to a triplet graph and train this on a second GNN to classify the triplets. A triplet graph is formed by taking all edges in the original (doublet) graph and assigning them as nodes in the new triplet graph. The nodes in this triplet graph are connected if they share a hit in the doublet graph. Applying a GNN to this structure produces highly pure sets of seeds as shown in Ref. [17].
Many of these techniques are common to other applications being explored in the Exa.TrkX collaboration. The pattern of nearest-neighbor graph-building and GNN edge classification has shown its potential for neutrino experiments [28] and CMS High Granularity Calorimeter [25]. Indeed, these applications build on the TrackML pipeline and extend it, for example by adding the particle type as an edge feature.

Tracking Efficiency and Purity
The performance of a tracking pipeline is mainly characterized by tracking efficiency and purity. For efficiency calculations, only charged particles that satisfy |η| < 4.0 and p T > 100 MeV are considered. These selected particles, N particles (selected), are hereafter referred to as particles.
The overall tracking efficiency, known as physics efficiency phys (Eq. 1), is defined as the fraction of particles that are matched to at least one reconstructed track. A particle is considered to be matched to a reconstructed track when 1) the majority of spacepoints in the reconstructed track belong to the same true track, and 2) the majority of spacepoints in the matched true particle track are found in the reconstructed track 2 .
To measure the efficiency of the tracking pipeline itself, we also define the technical efficiency tech (Eq. 2) as the fraction of reconstructable particles matching at least one reconstructed track. Reconstructable particles have a trajectory that leaves at least five spacepoints in the detector. Tracking purity (Eq. 3) is defined as the fraction of reconstructed tracks that match a selected particle 3 .
Averaged over 50 testing events from the TrackML dataset, the physics efficiency for particles with p T > 500 MeV is 88.7 ± 0.3% and the technical efficiency is 97.6±0.3%. Without any fiducial p T cut, the physics efficiency becomes 67.2±0.1% and the technical efficiency 91.3 ± 0.2%. The tracking purity is 58.3 ± 0.6%. Using the TrackML challenge scoring system and all tracks in the event, we obtained a score of 0.877 ± 0.005 4 . The errors quoted are statistical only. Figure 5 shows the p T distribution of particles as well as the tracking efficiency as a function of particle p T . The physics efficiency for particles with p T of [100, 300] MeV is 43%, therefore, is not displayed in the plot. The physics efficiency for particles with p T > 700 MeV is above 88%. The technical efficiency is 82% for particles with p T of [100, 300] MeV, and increases to above 97% for particles with p T > 700 MeV. Figure 5 also shows the η distribution of particles with p T > 500 MeV as well as the tracking efficiency as a function of the particle η. The physics efficiency is higher in the barrel region of the detector (volumes 8,13,17 in Figure 1), while the technical efficiency is almost flat across the η range. In Figure 5 the p T and η of the matched truth particle were used, rather than the p T and η of the reconstructed track. We leave a study of track quality and detector resolution effects for future work.

Systematic Studies
Before using a tracking algorithm in production, it is necessary to measure its sensitivity to systematic effects, including pile-up, noise and digitization errors, and uncertainties in the measurement of detector properties (alignment, rotation, magnetic field map, etc.).
Measuring precisely the impact of pile-up collisions on tracking performance is beyond the scope of this work, but we can estimate pile-up's impact on tracking performance by plotting efficiency and purity as a function of the number of spacepoints in the detector. Figure 6 shows that the effect of the increased detector occupancy is a smooth performance degradation O(%). In future work, we will study the origin of this degradation to achieve the stable performance of traditional algorithms [36].
The impact of noise spacepoints can be estimated using the TrackML dataset by studying the inference performance of the tracking pipeline, trained without any noise spacepoints, as a function of the fraction of noise spacepoints (up to a maximum of 20% of the total). Table 1 shows the technical tracking efficiency and purity for different noise levels. The efficiency decreases by 1.6% and the purity by 5.4% when 20% of noise spacepoints are presented. The loss of efficiency happens primarily for particles with p T < 500 MeV (Figure 7). Detector misalignment effects are approximated by shifting by up to 1 mm the x-axis of all spacepoints in the inner-most TrackML barrel detector layer or the four innermost layers (volume 8 in Figure 1). In both cases, the impact on the tracking efficiency is less than 0.1%. However, studying in depth misalignments, and other detector effects, requires access to experiment detailed detector simulation data. We leave these studies as future work to be performed in collaboration with each experiment.

Distributed Training Performance
Our training sample consists of 7500 pileup events from the TrackML dataset. It takes about 1.5 days to train the Exa.TrkX pipeline on a Nvidia A100 GPU for a set of hyper-parameters. It is therefore desirable to use distributed training to parallelize model training and hyper-parameter optimization (HPO). This study relied on data parallel training [37] implemented using Horovod [38] and Tensorflow's tf.distributed framework [39]. Horovod supports distributed training across multiple nodes, while tf.distributed allows to use the same code across CPUs, TPUs, and GPUs.
For this study, the TrackML pipeline is trained on up to 64 Nvidia V100 GPUs across eight NERSC Cori-GPU computing nodes. Using the Horovod framework (Figure 8), training time is reduced from 22 minutes, with 1 GPU, to 0.5 minutes with 64 GPUs 5 . The strong scaling efficiency 6 is about 90% with 2 GPUs and 75% with 8 GPUs. This deviation from ideal scaling is due to the model setup time and data movement costs. Figure 8 also shows the scaling behaviour of the tf.distributed implementation. Since this implementation requires all input data to be of the same size, we have to pad all input graphs to a fixed size. This essentially doubles the time needed to train one epoch, that increases from 22 minutes for dynamic input graph sizes to 41 minutes for constant graph sizes. Leaving aside this fixed overhead, tf.distributed appears to scale better than Horovod, achieving 85% strong scaling efficiency with 8 GPUs.

Inference performance on CPU and GPU
It is crucial to characterize the computational cost of the end-to-end learned tracking algorithm. We rely on the Pytorch and TensorFlow libraries to optimize our inference pipeline on CPU and GPU. The execution time for the inference pipeline has been measured on two hardware platforms: Nvidia V100 GPUs with 16 GB on-board memory, and Intel Xeon 6148s (Skylake) CPUs with 40 cores and 192 GB memory per node. The inputs to the filtering step do not fit into the GPU 5 All measurements in this section were taken training on spacepoints from the barrel region of the TrackML detector. For comparison, training with spacepoints from the whole detector takes 70 minutes per epoch on one Nvidia A100 GPU 6 defined as t 1 /(N × t N ) * 100% where t N is the time to train on a fixed total number of events across N GPUs.  Top row: selected, reconstructable, and matched particles (left) and tracking efficiency (right) as a function of p T for particles with |η| < 4. Bottom row: selected, reconstructable, and matched particles (left) and tracking efficiency (right) as a function of η for p T > 0.5 GeV. The definition of "selected", "reconstructable", and "matched" can be found in § 4.1.1 memory. Therefore, edge filtering for one event is executed in mini-batches with a fixed batch size of 800k edges. Typically, the inputs to the filtering from one event are split into seven batches, leading to additional computational cost for moving data from host to GPU. The peak GPU memory consumption is about 15.7 GB as obtained from the Nvidia profiling tool.
Averaging over 500 events, it takes 2.2 ± 0.3 wallclock seconds per event (as measured by measured by the python module time) to run the inference pipeline on the GPU and 202 ± 35 seconds to run it on a single CPU core. This total execution time includes every step of the calculation, and in particular the time needed to move data from host to GPU. Table 2 breaks down the wall-clock time for the most significant steps of the pipeline. The results show how the graph creation and filtering steps are the biggest targets for further optimization in order to surpass traditional algorithms in terms of inference time [40].
In addition, Figure 9 shows how the total inference time depends almost linearly on the number of spacepoints in the event for both CPUs and GPUs. The steplike dispersion in the GPU case is due to the splitting of the inputs to the filtering step into mini-baches. A step-like jump indicates one more mini-batch is added. Many optimizations were introduced to the pipeline in order to achieve these GPU timings, which before optimization took over 20 seconds per event. These improvements include porting all data processing to the GPU-accelerated CuPy library [41], writing custom sparse operations for graph processing (e.g. doublet-totriplet conversion [42], graph intersection methods), using FAISS [43]  ments are specific to the inference stage; training optimizations will be discussed in the following section, and ongoing developments in § 6. No CPU-specific optimization was performed in this work.

Discussion
The performance given above is the result of experimentation across various feature sets, architectures, model configurations and hyperparameters. It has also been necessary to overcome a variety of training hurdles in terms of memory and computational availability. We describe here training and inference details that should allow a reader to reproduce these results on the provided codebase.

Feature Set
The input dataset includes both spatial coordinates and highly granular pixel cluster shape information. Graph construction (the second pipeline step in Figure 4, that includes learned embedded space model and edge filter model) appears to benefit significantly from the cluster shape information, approximately doubling the purity for a held fixed high efficiency. The summary cluster shape statistics include the number of channels and the total charge deposited, as well as local and global representations of the cluster as a high-level feature vector.
Details about the calculation of this feature vector as well as a thorough exploration of the effect of cluster shape information on seeding performance are provided in Ref. [30]. Cluster shape information does not appear to improve the performance of the GNN, and in fact seems to degrade it. This suggests that the width of the GNN hidden layers is not great enough to capture the functional relationship of cluster information between nodes. Scaling to a width that properly explores this question would require more memory than available on the Nvidia A100 GPUs used for this study.  Total time (s) Depending on the final goal of the pipeline, further features can be included in the loss calculation in order to bias the model towards desired regions. For example, if our aim is to maximize the TrackML score (described in Ref. [2]) -a weighting function s i that places more importance on a spacepoint i from a longer and higher p T track, and in the first and last sets of detector layers -we can weight-up true edges by this function, normalized to have a mean of weight = 1. To measure the performance of models trained to this goal, we introduce a weighted purity measure. Weighted purity is defined as a function the TrackML weights w ij and the truth y ij ∈ {0, 1} of each edge connecting spacepoint i and spacepoint j, We see significant improvements in this metric when validating on the weighted model: the Embedding Network improves from a weighted purity of 1.7%±0.2% to 2.0% ± 0.3%, while the Filter Network improves from a weighted purity of 8.4% ± 0.6% to 11.7% ± 1.0%. Given this weighting, the model learns to prioritize higher p T and longer tracks, while disregarding less informative tracks. Using this bias, we can achieve the same TrackML score with a constructed graph size reduced by approximately 25%. Using this technique to improve the TrackML score is an ongoing work.

Graph Construction
Having chosen a feature set, to train the learned embedding space we use a training paradigm commonly referred to as a Siamese Network [46], where a particular spacepoint -called the source -is run through an MLP, here 6 layers each with 512 hidden channels, hyperbolic tan activations, and layer normalization. The final layer of the MLP takes the features to an 8-dimensional latent space. A different, comparison spacepoint -called the target -is also run through this same Embedding Network, and the L2 norm distance d in the latent space between the source and target enters a comparative hinge loss where p is a hyperparameter that we choose to be 2. If the source i and target j spacepoints share an edge in the event's truth graph 8 , we designate them as neighbours with y ij = 1, otherwise they are designated y ij = 0. In this way, the hinge loss draws together truth graph neighbors and repels non-neighbors.
Training performance of the Embedding Network is highly dependent on choice of source-target example pairs. In early epochs, it is enough to choose random pairs. However, at some point, many random pairs will contribute no gradient to the loss, as they will be separated by a distance greater than the margin. At that point, it is useful to implement hard negative mining [47]. We run a GPU-optimised k-nearestneighbor (KNN) algorithm 9 to mine examples around each source vector, within the hinge margin d = 1. The computational overhead of the KNN step is significantly offset by the examples mined which all contribute to the loss.
A similar technique is used in the Filter Network, where the vast majority of the edges produced from the graph construction in the embedded space are easy to classify as fake. This is already a highly imbalanced dataset, with around 98.5% of edges fake. Again, within several epochs, the Filter Network is able to classify many of these as fake, so we balance each batch with all true edges, the same number of hard negatives (i.e. negatives the filter is unsure of) and the same number of easy negatives (to maintain performance on these edges). The Filter Network is a MLP that takes the 24feature concatenated edge features and feeds forward through 3 layers of 1024 hidden channels, to a binary cross-entropy loss function.

GNN Edge Classification
In choosing the best GNN architecture, memory usage remains a significant constraint. The Interaction Network (IN) [34] presented in these results does appear to marginally attain the best performance against Attention Graph Neural Networks (AGNN) [49,14] -the other class of GNN considered for the pipeline. However, both of these networks require gradients to be retained in memory for every graph edge. Indeed, this anisotropic treatment of edges (i.e. a node is able to receive the messages of each of its neighbors in a nonuniform way) is what allows these two architectures to be so expressive. Depending on hardware availability, we have found two solutions to the memory constraint. Access to next-generation Nvidia A100 GPUs allowed an IN to be trained with 8 steps of message passing, aggregating edge features at each node, and each node and concatenated edge features passing through twolayer MLPs of [128,64] hidden features and ReLU activations [50]. Choice of aggregation function should be permutation invariant. In this work, we take it to be a summation.
For lower-memory GPUs, such as the Nvidia V100, we attained similar performance training the AGNN architecture, with [64,64,64]-channel MLPs applied to each edge and node. Adding residuals [51] across the 8 message passing steps greatly improved performance in this case. To fit full-event training on a single V100, it was necessary to employ various techniques, such as mixed precision training and gradient checkpointing. The latter stores only the input of each layer, not the gradients. On the backward pass, gradients are re-calculated on the fly, allowing for a 4x reduction in memory usage for an 8-iteration GNN. Another technique explored is to split the events piecemeal and train on each piece as a standalone batch. There is a noticeable impact on performance due to messages being interrupted at the graph edges. In future work, we will present ongoing efforts to parallelise these graph pieces across multiple GPUs, retaining the high performance that full-event training allows.

Physics-inspired data augmentation
Preliminary work on using coordinate transforms to augment the training data has been explored with varying degrees of success. In this study, focused on track seeding, only the innermost detector layers (volumes 7-9 in Figure 1) were used.
One promising approach is to make a copy of each graph in the training set that has been reflected across the phi-axis [52]. The phi reflection creates the charge conjugate graph and helps to balance any asymmetry between positive and negatively charged particles within the training set. Using the phi-reflected graphs boosts efficiency by 2% and purity by 1% in the barrel. This performance boost comes at the cost of doubling the training time. In future work, we will investigate the opportunity of integrating charge conjugation symmetry into the network itself.
A second promising trick is to use a Hough Transform [6,7] on the graph to create edge features. Using the Hough parameters as edge features boosts efficiency by 2% and purity by 1%. A further efficiency boost of 3% (and 2% to purity) comes from using the Hough accumulator to extract an edge weight. This edge weight effectively pools information from every node, and therefore comes at a large computational cost (filling the accumulator in Hough space). On the other hand, the Hough parameters can be computed quickly from the two nodes that define the edge.

Conclusions and Future Work
This works shows how a tracking pipeline based on geometric deep learning can achieve state-of-the-art com-puting performance that scales linearly with the number of spacepoints, showing great promise for the next generation of HEP experiments. The inference pipeline has been optimized on GPU systems, on the assumption that the next generation of HEP experiments will have widespread access to accelerators either locally in heterogeneous systems [27,53] or remotely [54,55].
Within the simplifying assumptions of the TrackML dataset, we have shown how the Exa.TrkX pipeline could meet the tracking performance requirements of current collider experiments. Preliminary studies suggest that this performance should be robust against systematic effects like detector noise, misalignment, and pile-up.
The goal is to adapt the Exa.TrkX pipeline to each experiment's needs and simulated datasets, measure its performance and robustness against systematic effects according to the experiment metrics. For example, it is crucial for HL-LHC experiments to study the performance of tracking algorithms in dense environments, like high-p T jets. Given the interest in long-lived particle observation at the HL-LHC, it will also be important to study the performance of the Exa.TrkX pipeline for tracks coming from a displaced vertex 10 .
On the computational side, there are several optimization opportunities to explore systematically, including mixed precision training, multi-GPU training and inference with graph data parallelisation (that is, one event spread across multiple GPUs) [61]; locality sensitive hashing to speed-up KNN/graph construction stage [62], model quantization, operator fusion and other improvements with TensorRT [63], clustering of final node embeddings rather than hard connected components method with GravNet-style architectures [64].
The distributed training results presented in this work are promising but still preliminary. To fully exploit the capabilities of upcoming HPC systems and to further reduce training time while potentially pushing further on model size, it will be beneficial to perform further studies on large scale training of GNNs for track reconstruction. Given the size of the input graphs, this problem may be amenable to training techniques which parallelise the processing of input graphs across multiple GPUs in training.
Finally, it will be interesting to measure the computing performance of (parts of) the Exa.TrkX pipeline on domain-specific accelerators like Google TPU [65] and GraphCore IPU [66], comparing power consumption, latency and throughput with "traditional" GPUs.

Software availability
A growing number of groups are studying the application of graph networks to HEP reconstruction (see [67] for a recent review). Some of these works [24,27,28,29,30,31,33] have strong connections with the Exa.TrkX project. To promote collaboration and reproducibility, the Exa.TrkX software is available from the HEP Software Foundation's Trigger and Reconstruction GitHub 11 . A pipeline of re-usable modules is implemented within the Pytorch Lightning system, which allows for uncluttered and simple model definitions. As each stage of the pipeline is dependent, logging utilities are integrated that allow a specific combination of stages and hyperparameters to be trackable and reproducible. Extensive documentation is provided to help track reconstruction groups start exploring geometric learning. The roadmap for this repository includes adding performance metrics to the codebase; a taxonomy of model features; and short tutorials in each of the available applications.