Learning representations of irregular particle-detector geometry with distance-weighted graph networks

We explore the use of graph networks to deal with irregular-geometry detectors in the context of particle reconstruction. Thanks to their representation-learning capabilities, graph networks can exploit the full detector granularity, while natively managing the event sparsity and arbitrarily complex detector geometries. We introduce two distance-weighted graph network architectures, dubbed GarNet and GravNet layers, and apply them to a typical particle reconstruction task. The performance of the new architectures is evaluated on a data set of simulated particle interactions on a toy model of a highly granular calorimeter, loosely inspired by the endcap calorimeter to be installed in the CMS detector for the High-Luminosity LHC phase. We study the clustering of energy depositions, which is the basis for calorimetric particle reconstruction, and provide a quantitative comparison to alternative approaches. The proposed algorithms provide an interesting alternative to existing methods, offering equally performing or less resource-demanding solutions with less underlying assumptions on the detector geometry and, consequently, the possibility to generalize to other detectors.


Introduction
Traditionally, Machine Learning (ML) techniques are a key ingredient to event processing at particle colliders, employed in tasks such as particle reconstruction (clustering), identification (classification), and energy or direction measurement (regression) in calorimeters and tracking devices. The first applications of Neural Networks to High Energy Physics (HEP) date back to the '80s [1,2,3,4]. Starting with the MiniBooNE experiment [5], Boosted Decision Trees became the state of the art, and played a crucial role in the discovery of the Higgs boson by the ATLAS and CMS experiments [6]. Recently, a series of studies on different aspects of LHC data taking and data processing workflows have demonstrated the potential of Deep Learning (DL) in collider applications, both as a way to speed up current algorithms and to improve their performance. Nevertheless, the list of DL models actually deployed in the centralized workflows of the LHC experiments remains quite short. 1 Many of these studies, 1 As an example, at the moment such a list for the CMS experiment consists of a set of b-tagging algorithms [7,8] and a data quality monitoring algorithm for the muon drift tube chambers [9]. Other applications exist at the analysis level, downstream from the centralized event processing. In data analyses, one typically considers abstract four-momenta and not the low-level quantities such as detector hits, making the use of DL techniques easier.
which are typically proof-of-concept demonstrations, are based on convolutional neural networks (CNN) [10], which perform computing vision tasks by applying translationinvariant kernels to raw digital images. CNN architectures applied on HEP data thus imposes a requirement for the particle detectors to be represented as regular arrays of sensors. This requirement, common to many of the approaches described in Section 2, creates problems for realistic applications of CNNs in collider experiments. 2 In this work, we propose novel Deep Learning architectures based on graph networks to improve the performance and reduce the execution time of typical particlereconstruction tasks, such as cluster reconstruction and particle identification. In contrast to CNNs, graph networks can learn optimal detector-hits representations without making specific assumptions on the detector geometry. In particular, no data preprocessing is required, even for detectors with irregular geometries. We consider the specific case of particle reconstruction in calorimeters, for which this characteristic of graph networks may become especially relevant in the near future. In view of the High-Luminosity LHC phase, the endcap calorimeter of the CMS detector will be replaced by a novel-design digital calorimeter, the High Granularity Calorimeter (HGCAL), consisting of arrays of hexagonal silicon sensor cells interleaved with absorber layers [11]. Being positioned close to the beam pipe and exposed to ∼ 200 proton-proton collisions on average per bunch crossing, this detector will be characterized by high occupancy over its large number of readout channels. Downstream in the data processing pipeline, the unprecedented number of sensors and their geometry will cause an increase in event size and consequently the computational needs, necessitating novel data processing approaches given the expected computing limitations [12]. The detector we consider in this study, described in detail in Section 4, is loosely inspired by the HGCAL geometry. In particular, it features a similarly irregular sensor structure, with sensor sizes varying with the detector depth as well as within a single layer. On the other hand, the HGCAL hexagonal sensors were traded for square-shaped sensors, in order to keep the computing resources needed to generate the training data set within a manageable limit.
As a benchmark application, we consider the basis for all further particle reconstruction tasks in a calorimeter: clustering of the recorded energy deposits into disentangled showers from individual particles. To this purpose, we introduce two novel distance-weighted graph network architectures, the GarNet and the GravNet layers, which are designed to provide a good balance between performance and computing resources needs for inference. While our discussion is limited to a calorimetry-related problem, the design of these new layer architectures is such that it automatically generalizes to any kind of sparse data, such as hits collected by a typical tracking device or reconstructed particle candidates inside a hadronic jet. We believe that architectures of this kind are more practical to deploy in a realistic experimental environment and could become relevant for the LHC experiments, both for offline and real-time event processing and selection as well as shower simulation. This paper is structured as follows: Section 2 reviews related previous works. In Section 3, we describe the Gr-avNet and GarNet architectures. Section 4 describes the data set used for this study. Section 5 introduces the metric used to optimize the networks. Section 6 describes the models. The results are presented in Sections 7 and 8 in terms of accuracy and computational efficiency, respectively. Conclusions are presented in Section 9.

Related Work
In recent years, DL models, and in particular CNNs, have become very popular in different areas of HEP. CNNs were successfully applied to calorimeter-oriented tasks, including particle identification [11,13,14,15,16], energy regression [11,13,15,16], hadronic jet identification [17,18,19,20], fast simulation [13,21,22,23,24] and pileup subtraction in jets [25]. Many of these works assume a simplified detector description: the detector is represented as a regular array of sensors expressed as 2D or 3D images, and the problem of overlapping regions at the transition between detector components (e.g. barrel and endcap) is ignored.
Sometimes the fixed-grid pixel shape is intended to reflect the typical angular resolution of the detector, which is implicitly assumed to be a constant, while in reality it depends on the energy of the incoming particle.
In order to overcome this practical difficulty with CNN architectures, different HEP-related studies investigating alternative architectures have been performed. In the context of jet identification, several authors studied models based on recurrent [7,8,26] and recursive [27] networks, graph networks [28], and DeepSets [29]. Recurrent architectures have also been studied for event classification [30]. In general, these approaches take as input a particle-based representation of an event and thus are easier to apply in applications running after a global event reconstruction based on a particle-flow algorithm [31,32].
Outside the HEP domain, overcoming the necessity for a regular structure motivated original research to use graph-based networks [33], which in general are suited for processing point-wise data with no regular structure by representing them as vertices in a graph. A comprehensive overview of various graph-based networks can be found in Ref. [34]. In a typical implementation of a graphbased network, the vertices are connected according to some predefined criteria at the preprocessing stage. The connections between the vertices (edges) then define paths of information exchange [35,36]. In some cases, the edge and vertex properties are used to infer attention (weight) assigned to each neighbour during this information exchange, while leaving the neighbour relations (adjacency matrix) unchanged [37]. Some of these architectures have already been considered for collider physics, in the context of jet tagging [38], event topology classification [39], and for pileup subtraction [40].
Particularly interesting for irregular detectors are, however, networks that are capable of learning the geometry, as studied in combination with message passing [41]. Within this approach, the adjacency matrix is trainable. In other words, the neighbour relations, which encode the detector geometry, are not imposed at the preprocessing stage but are inferred from the input data. Although this approach is promising, its downside is the need to connect all vertices with each other, which makes it computationally challenging for graphs with a large number of vertices as the memory requirement becomes forbiddingly high. This problem is overcome by defining only a subset of connections between neighbours in a learnable space representation, where the edge properties of each vertex to a limited number of its neighbours are used to calculate a new feature representation per vertex, which is then passed to the next layer of similar structure [42]. This approach is implemented in the EdgeConv layer and the corresponding DGCNN model [42]. The neighbours are selected based on the new vertex features, which makes it particularly challenging to create a gradient for training with respect to the neighbour selection. The DGCNN model works around this issue by using the edge features themselves. However, due to the dynamic calculation of neighbour relations in high-dimensional space, this network still requires substantial computing resources, which would make its use for triggering purposes in collider detectors unfeasible.

The GravNet and GarNet layers
The neural network layers proposed in this study are designed to provide competitive performance on particle reconstruction tasks while dealing with data sparsity in an efficient way. These architectures aim to keep a trainable space representation at minimal computational costs. The layers receive as input a B × V × F IN data set, consisting of a batch of B examples, each represented by a set of V detector hits, embedded in the network set through F IN features. For instance, the F IN features could include the Cartesian coordinates of a given sensor, its address (layer number, module number, etc.), the sensor time stamp, the recorded energy, etc.
A pictorial representation of the operations performed by the two layers is shown in Fig. 1. For both architectures, the first step is to apply a dense 3 neural network to each of the V detector hits, deriving from the F IN features two output arrays: the first array (S) is interpreted as a set of coordinates in some learned representation space (for the GravNet layer) or as the distance between the considered vertex and a set of S aggregators (for the GarNet layer); the second array (F LR ) is interpreted as a learned representation of the vertex features. At this point, a given input example of initial dimension V × F IN is converted into a graph with V vertices in the abstract space identified by S. Each vertex is represented by the F LR features, derived from the initial inputs. The projection from the V × F IN to this graph is linear, with trainable weights and bias vectors.
The main difference between the GravNet and the GarNet architectures is in the way the V vertices are connected when building the graph. In the case of the Gr-avNet layer, the Euclidean distances d jk between (j, k) pairs of vertices in the S space are used to associate to each vertex its closest N neighbors. In the case of the GarNet layer, the graph is built connecting each of the V vertices to a set of dim(S) aggregators. What is learned by S, in this case, is the distance between a vertex and each of the aggregators.
Once the edges of the graph are built, each vertex (aggregator) of the GravNet (GarNet) layer collects the information associated with the F LR features across its edges. This is done in three steps: are computed for the feature f i of each of the vertices v j connected to a given vertex or aggregator v k , scaling the original value by a potential, function of the euclidean distance d jk , giving the gravitational network GravNet its name. The potential function is introduced to enhance the contribution of close-by vertices. For this reason, V has to be a decreasing function of d jk . In this study, we use a Gaussian potential V (d jk ) = exp (−d 2 jk ) for the GravNet layer 4 and an exponential potential V (d jk ) = exp (−|d jk |) for the GarNet layer. 2. Thef i jk functions computed from all the edges associated to a vertex of aggregator v k are combined, generating a new featuref i k of v k . For instance, we consider the average of thef i jk across the j edges and their maximum. In our case, it was particularly crucial to extend the choice of aggregator functions beyond the maximum, which was already explored for similar architectures [42]. In fact, the mean function (as any other similar function) helped improve the convergence of the model, by taking into account the contribution of all the vertices. 3. Each adopted combination rule in the previous step generates a new set of featuresF LR . All of them are concatenated to the original F IN vector. This extended vector is transformed into a set of F OUT new vertex features, using a fully connected dense layer with tanh activation. The concatenation is done for each initial vertex. In the case of the GarNet layer, this requires an additional step of passing thef i k features of the v k aggregators back to the initial vertices, weighted by the V (d jk ) potential. This information exchange of the garnered information through the aggregators defines the GarNet name. The full process transforms the initial B × V × F IN data set into a B × V × F OUT data set. As common with graph networks, the main advantage comes from the fact that the F OUT output (unlike the F IN input) carries collective information from each vertex and its surrounding, providing a more informative input to downstream processing. Thanks to the distinction between learned space information S and learned features F LR , the dimensionality of connections in the graph is kept under control, resulting in a smaller memory consumption than, for instance, the EdgeConv layer.
The two layer architectures and the models based on them, described in the following sections, are implemented in TensorFlow [43]. 5

Data set
The data set used in this paper is based on a simplified calorimeter with irregular geometry, built in GEANT4 [44]. The calorimeter is made entirely of Tungsten, with a width of 30 cm × 30 cm in the x and y directions and a length of 2 m in the longitudinal direction (z), which corresponds to 20 nuclear interaction lengths. The longitudinal dimension is further split into 20 layers of equal thickness. Each layer contains square sensor cells, with a fine segmentation in the quadrant with x > 0 and y > 0 and a lower granularity elsewhere. The total number of cells and their individual sizes vary by layer, replicating the basic features of a slightly irregular calorimeter. For more details, see Fig. 2 and Table 1.
Charged pions are generated at z = −2 m; the x and y coordinates of the generation vertex are randomly sampled within |x| < 5 cm and |y| < 5 cm. The x and y components of the particle momentum are set to 0, while the z component is sampled uniformly between 10 and 100 GeV. The particles therefore impinge the calorimeter front face perpendicularly and shower along the longitudinal direction.
The resulting total energy deposit in each cell, as well as the cell position, width, and layer number, are recorded for each event. These quantities correspond to the F IN fea-Layer Cells (x > 0, y > 0) Cells elsewhere  0  64  48  1  64  108  2-3  100  192  4-7  64  108  8-11  64  48  12-13  16  12  14-19 4 3 Table 1: Number of cells in the finely segmented quadrant and the rest of the layer, for the benchmark calorimeter geometry described in the text.
ture vector given as input to the graph models (see Section 3). Each example consists of the result of two overlapping showers. Cell by cell, the energy of two showers is summed and the fraction belonging to each of the showers in each cell is defined as the ground truth. In addition, the position of the largest energy deposit per shower is recorded. If this position is the same for the two overlapping showers, they are considered not separable and the event is discarded. This applies to about 5% of the events. In total 16 000 000 events are generated. Out of these, 100 000 are used for validation and 250 000 for testing. The rest is used for training.

Clustering metrics
To identify individual showers and use their properties, e.g. for a subsequent particle identification task, the energy deposits should be clustered so that overlapping parts are identified without removing important parts of the original shower. Therefore, the clustering algorithms should predict the energy fraction of each sensor belonging to each shower. Lower energy deposits are slightly less important. These considerations define the loss function: where p ik and t ik are the predicted and true energy fractions in sensor i and shower k. These are weighted by the square root of E i t i , which is the total energy deposit in sensor i belonging to shower k, to introduce a mild energy scaling within each shower. In addition, in each event we randomly label one of the showers as the test shower and the other as the noise shower, and define the clustering energy response R k of shower k (k = test, noise) as:

Models
The models need to incorporate neural network layers to identify localized structures as well as to perform information exchange globally between the sensors. This can be achieved either by multiple message passing iterations between neighbouring sensors or a direct global information exchange. Here, we employ a combination of both. The input to all models is an array of sensors, each holding its recorded energy deposits, global position coordinates, sensor size, and layer number. We compare three different graph-network approaches to a CNN based approach (Binning), presented as a baseline. Each model is designed to contain approximately 100 000 free parameters. The model structure is as follows: -Binning: a regular grid of 20 × 20 × 20 pixels is imposed on the irregular geometry. Each pixel contains the information of at most one sensor 6 . The information is concatenated to the mean of these features in all pixels, pre-processed by one 1 × 1 × 1 CNN layer with 20 nodes, and then fed through eight blocks of CNN layers. Each block consists of a CNN layer with a kernel of 7 × 7 × 1 followed by a layer with a kernel of 1 × 1 × 3, each containing 14 filters. The output of each block is passed to the next block and simultaneously added to a list of all block outputs. All CNN layers employ tanh activation functions. Finally, the full list of block outputs per pixel is reshaped to represent the vertices of the graph and fed through a dense layer with 128 nodes and ReLU activation. Different CNN models have also been tested and showed similar or worse performance.

-DGCNN model: adapting the model proposed in
Ref [42] to our problem, the sensor features are interpreted as positions of points in a 16-dimensional space and fed through one global space transformation followed by four blocks comprising one EdgeConv layer. Our EdgeConv layer has a similar configuration as in Ref. [42], with 40 neighbouring vertices and three internal dense layers with ReLu activation acting on the edges with 64 nodes each. The output of the Edge-Conv layer is concatenated with its mean over all vertices and fed to one dense layer with 64 nodes and ReLu activation which concludes the block. The output of each block is passed to the next block and simultaneously added to a list of all block outputs per vertex together with the mean over vertices. This list is finally fed to a dense layer with 32 nodes and ReLU activation. -GarNet model: The original vertex features are concatenated with the mean of the vertex features and then passed on to one dense layer with 32 nodes and tanh activation before entering 11 subsequent Gar-Net layers. These layers contain S = 4 aggregators, to which F LR = 20 features are passed, and F OUT = 32 output nodes. The output of each layer is passed to the next and added to a vector containing the concatenated outputs of each GarNet layer. The latter is finally passed to a dense layer with 48 nodes and ReLU activation.
In all cases, each output vertex of these model building blocks is fed through one dense layer with ReLU activation and three nodes, followed by a dense layer with two output nodes and softmax activation. This last processing step determines the energy fraction belonging to each shower. Batch normalisation [45] is applied in all models to the input and after each block.
All models are trained on the full training data set using the Adam optimizer [46] and an initial learning rate of about 3 × 10 −4 , the exact value depending on the model. The learning rate is reduced exponentially in steps to the minimum of 3 × 10 −6 after 2 million iterations. Once the learning rate has reached the minimum level, it is modulated by 10% at a fixed frequency, following the method proposed in Ref. [47].

Clustering performance
All approaches described in Section 6 perform well for clustering purposes. An example is shown in Fig. 3, where two charged pions with an energy of approximately 50 GeV enter the calorimeter. One pion loses a significant fraction of energy in an electromagnetic shower in the first calorimeter layers. The remaining energy is carried by a single particle passing the central part of the calorimeter before showering. The second pion passes the first layers as a minimally ionizing particle and showers in the central part of the calorimeter. Even though the two showers largely overlap, the GravNet network (shown here as an example) is able to identify and separate the two showers very well. The track within the calorimeter is well identified and reconstructed and the energy fractions properly assigned, even in the parts where the two showers heavily overlap. Similar performance can be observed with the other investigated methods.
Quantitatively, the models are compared with respect to multiple performance metrics. The first two are the mean and the variance of the loss function value (µ L and σ L ) computed according to Equation (2) over the test events. The mean and the variance of the test shower response (µ R and σ R ), where the response is defined in Equation (3), are also compared. While the test shower response follows an approximately normal distribution over majority of the test events, a small outlier population, where the shower clustering fails, are seen to lead µ R and σ R to misparametrize the core of the distribution. Therefore, response kernel mean µ * R and variance σ * R , restricted  to test showers with response between 0.2 and 2.8, are added to the set of evaluation metrics. In addition, we also compare the clustering accuracy (A), defined as the fraction of showers with response between 0.7 and 1.3. Finally, the above set of metrics is duplicated, with the second set using only the sensors with energy fractions between 0.2 and 0.8 in the computation of the loss function and the response. The second set of metrics characterizes the performance of the models in particularly challenging case of reconstructing significantly overlapping clusters. The two sets of metrics are called inclusive and overlap-specific in the remainder of the discussion.
The metric values are listed in Table 2. Comparing the inclusive metrics, it can be seen that the GravNet layer outperforms the other approaches, including even the more resource-intensive DGCNN model. The GarNet model performance is in between the DGCNN model and the binning approach in terms of reconstruction of individual shower hit fractions, parametrized by µ L and σ L . However, in characteristics related to clustering response, the binning model outperforms the GarNet and DG-CNN model slightly. On the other hand, with respect to overlap-specific metrics, the graph based approaches outperform the binning approach. The DGCNN and Grav-Net model perform equally well, and the GarNet model lies in between the binning approach and GravNet.  One should notice that part of the incorrectly predicted events are actually correctly clustered events in which the test shower is labelled as noise shower (shower swapping). Since the labelling is irrelevant in a clustering problem, this behavior is not a real inefficiency of the algorithm. We denote by s the fraction of events where this behaviour is observed. In Table 3, we calculate the loss for both choices and evaluate the performance parameters for the assignment that minimizes the loss. The binning model shows the largest fraction of swapped showers. The difference in response between the best-performing GravNet model and the GarNet model is enhanced, while the difference between the GravNet and DGCNN model scales similarly, likely because of their similar general structure.
In Fig. 4, the performance of the models are compared in bins of the test shower energy with respect to inclusive and overlap-specific µ R and σ R . For the inclusive metrics, the GravNet model outperforms the other models in the full range, and the GarNet model shows the worst performance, albeit in a comparable range. The resourceintensive DGCNN model lies in between GravNet and GarNet.
The overall upward bias in the response for lower shower energies warrants an explanation. This bias is a result of edge effects, induced by our choice of using an adapted mean-square error loss to predict a quantity bounded in [0,1] (the energy fraction). This choice of loss function creates an expectation value larger than 0 at a peak value of 0 (and vice-versa at a fraction of 1), and therefore pushes the prediction away from being exactly 0 or 1, leading to an underestimation at high energies and an overestimation at low energies. The design of a customized loss function that eliminates this bias is left to future studies. For the moment, we are interested in a performance comparison between models, all affected by this bias.
For overlap-specific metrics, the edge effects are highly suppressed. The Figures confirm that the graph-based models outperform the binning method at all test shower energies. It is also seen that the GravNet and the DGCNN model show similar performance.

Resource requirements
In addition to the clustering performance, it is important to take into account the computational resources demanded by each model during inference. The required inference time and memory consumption can have a significant impact on the applicability of the network for reconstruction tasks in constrained environments, such as the online and offline central-processing workflows of a typical collider-physics experiment. We evaluate the inference time t and memory consumption m for the models studied here on one NVIDIA GTX 1080 Ti GPU for batch sizes of 1 and 100, denoted as (t 1 , m 1 ) and (t 100 , m 100 ), respectively. The inference time is also evaluated on one Intel Xeon E5-2650 CPU core (t CPU 10 ) for a fixed batch size of 10. As shown in Fig. 5, memory consumption and execution times differ significantly between the models. The binning approach outperforms all other models, because of the highly optimized CNN implementations. The DG-CNN model requires the largest amount of memory, while the model using the GravNet layers requires about 50% less. The GarNet model provides a good compromise of memory consumption with respect to performance. In terms of inference time, the binning model is the fastest and the graph-based models show a similar behaviour for small batch sizes on a GPU. The GarNet and the Grav-Net model benefit from parallelizing over a larger batch.  In particular, the GarNet model is mostly sequential, which also explains the outstanding performance on a single CPU core, with almost a factor of 10 shorter inference time compared to the DGCNN model.

Conclusions
In this work, we introduced the GarNet and GravNet layers, which are distance-weighted graph networks capable of learning irregular patterns of sparse data, such as the detector hits in a particle physics detector with realistic geometry. Using as a benchmark problem the hit clustering in a highly granular calorimeter, we show how these network architectures offer a good compromise between clustering performance and computational resource needs, when compared to CNN-based and other graphbased networks. In the specific case considered here, the performance of the GarNet and GravNet models are comparable to the CNN and graph baselines. On the other hand, the simulated calorimeter in the benchmark study is only slightly irregular and can still be represented by an almost regular array. In more realistic applications, e.g. with the hexagonal sensors and the non-projective geometry of the future HGCAL detector of CMS, the difference in performance between the graph-based approaches and the CNN-based approaches is expected to increase further, making the GarNet approach a very efficient candidate for fast and accurate inference and the GravNet approach a good candidate for high-performance reconstruction with significantly less resource requirements but similar performance compared to the DGCNN model for a similar number of free parameters.
It should also be noted that the GarNet and Gr-avNet architectures make no specific assumption on the structure of the underlying data, and thus can be employed for many other applications related to particle and event reconstruction, such as tracking and jet identifica- tion. Exploring the extent of usability of these architectures will be the focus of follow-up work.

Note added
After the completion of this work, Ref. [28] appeared, discussing the application of a similar approach to the problem of jet tagging.