Explainable Equivariant Neural Networks for Particle Physics: PELICAN

PELICAN is a novel permutation equivariant and Lorentz invariant or covariant aggregator network designed to overcome common limitations found in architectures applied to particle physics problems. Compared to many approaches that use non-specialized architectures that neglect underlying physics principles and require very large numbers of parameters, PELICAN employs a fundamentally symmetry group-based architecture that demonstrates benefits in terms of reduced complexity, increased interpretability, and raw performance. We present a comprehensive study of the PELICAN algorithm architecture in the context of both tagging (classification) and reconstructing (regression) Lorentz-boosted top quarks, including the difficult task of specifically identifying and measuring the $W$-boson inside the dense environment of the Lorentz-boosted top-quark hadronic final state. We also extend the application of PELICAN to the tasks of identifying quark-initiated vs.~gluon-initiated jets, and a multi-class identification across five separate target categories of jets. When tested on the standard task of Lorentz-boosted top-quark tagging, PELICAN outperforms existing competitors with much lower model complexity and high sample efficiency. On the less common and more complex task of 4-momentum regression, PELICAN also outperforms hand-crafted, non-machine learning algorithms. We discuss the implications of symmetry-restricted architectures for the wider field of machine learning for physics.


Introduction
Identifying, reconstructing, and measuring the properties and dynamics of high-energy, short-distance particle phenomena is inherently an inference task, since direct access to the fundamental processes is often impossible due to the time and length scales at which they occur.The suite of detection techniques, pattern recognition algorithms, and measurement approaches used to perform this task inevitably imposes constraints on both the nature of the information used as well as on the form and structure of the results.Such constraints play a crucial role in the context of jet substructure measurements, in which detailed analysis is performed on the long-distance features of Lorentz-boosted particle decays, parton showering, and radiation patterns found in the collimated sprays of particles that form the jets themselves.In this work, we present a comprehensive analysis of a new approach to multiple jet substructure-based inference tasks using a machine learning (ML) architecture that fundamentally respects permutation and Lorentz-group symmetries: PELICAN, the permutation equivariant and Lorentz invariant or covariant aggregator network.Our approach thus imposes explicit physics-informed constraints on the system and consequently yields new insights and capabilities.
Decades of jet substructure research have yielded a wide range of approaches to performing inference tasks such as: distinguishing quark-initiated from gluon-initiated jets [1][2][3][4][5]; discriminating jets formed from Lorentz-boosted top quarks, Higgs and -bosons, from the continuum background of jets formed from light-quarks and gluons [6][7][8][9]; dissecting and measuring the parton-shower structure of light-quark and gluon jets themselves [10][11][12][13][14][15].Many approaches have been adopted to perform these tasks, including the direct use of discriminating high-level observables and multi-variate methods [16][17][18], as well as a growing number of ML architectures using a variety of latent-space representations.For a comprehensive overview of jet substructure measurements, see Refs.[19,20], as well as Ref. [21] for a general review of ML methods in high-energy physics (including substructure measurements).As the model complexity has grown, so too have questions regarding the relationship of both the methods and the constraints that they impose to the fundamental physical processes that they are used to model.In particular, the use of observables, architectures, and latent space representations that adhere closely to the structure and dynamics of the physics processes under study have been found to provide not only enhanced performance, but also significant insights and improvements in interpreting the results [18,22,23].Imbuing these models with knowledge of, or even fundamental respect for, the symmetry group structures of the system under study has thus become increasingly impactful in the study of jet substructure, especially in the context of ML models and various neural network (NN) architectures [24][25][26].
There are several common approaches to enforcing continuous symmetries in NNs.Data augmentation can be used to train a model to have a particular sparsity structure and become approximately symmetric.However, when model complexity and interpretability are of concern, as is the case in particle physics and jet substructure analyses, a different approach is helpful.Similar issues arise with approaches that use preprocessing/normalization, which moreover come with inherent ambiguities and discontinuities that can be detrimental for sufficiently complex tasks.
Traditionally, ML algorithms are evaluated based on basic performance metrics such as accuracy and computational cost.However, in contexts where the trained algorithms are treated not only as predictors or generators, but as actual models for some process, -which is especially true in scientific applicationsother metrics of model quality are valuable.Model complexity (defined as e.g. the number of parameters), explainability and interpretability can be important for making a viable physics model out of an ML algorithm.Further, certain problem-specific properties such as symmetries can be critical as well.Symmetries in ML are known to produce less complex models which respect basic geometrical rules and arguably provide more opportunities for interpretability and explainability (e.g.convolutional neural network (CNN) kernels are often interpreted as visual features).Even in realistic settings where the symmetries are merely approximate, symmetry-constrained architectures often outperform more general architectures in terms of pure accuracy (see e.g.Section 4), but even in cases when that is not true symmetric architectures should not be discounted due to their other benefits.For these reasons, as advocated for in Ref. [27], we have adopted an approach of building all symmetries directly into the PELICAN network architecture itself, similar to the inherent translational symmetry of CNNs.

Summary of results
In Section 2 we discuss equivariance in jet physics and introduce the tools we need to build an efficient equivariant architecture.In Section 3 we describe the architectures of PELICAN classifiers and regressors.Now we briefly summarize the main results presented in this work, corresponding to Sections 4 through 8.

Top-tagging with a PELICAN classifier
We train PELICAN top taggers using a public benchmark dataset, to distinguish between top quark jets, and light quark and gluon jets.These taggers achieve state-of-the-art performance on the benchmark with fewer learnable parameters than the previous highest-performing network.PELICAN top taggers with as few as 11k parameters outperform all non-equivariant networks in the benchmark.See Section 4 for details.

𝑾-boson 4-momentum reconstruction with PELICAN
We train a PELICAN model using a custom dataset [28] of fully-hadronic top decays to reconstruct the full 4-momentum of the intermediate -bosons.Specifically, PELICAN uses 4-momenta of the top quark jet constituents as inputs.PELICAN performs favorably in reconstructing the full  momentum when compared with the Johns Hopkins (JH) top tagger [7], which produces  candidates for the subset of jets that pass its tagging.PELICAN achieves better   , mass, and angular resolutions on JH top-tagged jets -and achieves comparable resolutions to the JH tagger even when evaluated on the full dataset.Additionally, we train a PELICAN model to reconstruct the 4-momentum of only the products of the  →  ′ decay which are contained within the jet.We discuss differences in performance and effects of this choice in reconstruction targets in Section 5.
-boson mass reconstruction with PELICAN Mass reconstruction is a common particle physics analysis task, and any reconstruction algorithm should be robust and relatively free of bias.In Section 6 we discuss the nuances of PELICAN mass reconstruction targeting the -bosons in the above-mentioned dataset [28] as an example.The results show that eliminating bias in the underlying dataset is required to produce an unbiased final algorithm.In the case of  mass reconstruction, this is achieved by training PELICAN on a dataset with multiple values of   .
Explaining PELICAN weights PELICAN's respect of the particle permutation and Lorentz symmetries inherent to particle datasets provides it with explainability and interpretability rarely found in particle physics machine learning applications.In Section 7 we investigate the rich penultimate layer of PELICAN and its discriminatory power.In particular, we discuss interpretations of PELICAN as a soft clustering and detector-unfolding algorithm of sorts.
IRC-safety and PELICAN In particle physics, IRC-safety is an algorithmic concern which restricts tools to be robust with respect to soft-particle emissions (infrared -IR) and collinear (C) splits due to divergences in perturbative quantum chromodynamics (QCD).In Section 8 we modify PELICAN into IR-safe and IRC-safe versions and discuss their relative performances.

Equivariance and jet physics
This section aims to establish a clear connection between the group theory that underlies the PELICAN architecture and the implementation of this approach for both classification and regression, as described in Section 3.
Given a symmetry group  and two sets ,  on which an action of  is defined, a mapping  : for any  ∈  and  ∈ .In particular, if the action of  on  happens to be trivial (i.e. •  =  for all , ), then  is called invariant.In relativistic physics, equivariant maps are typically represented by tensors with equivariant spacetime indices treated via Einstein notation.For instance, the electromagnetic field tensor   can be viewed as a Lorentz-equivariant mapping from covariant vector fields to contravariant ones.In this work we will be interested in tasks from jet physics that can be reduced to learning a Lorentz-equivariant map.In this section we review some basics of the Lorentz symmetry in the context of such tasks.

Lorentz symmetry and jets
Lorentz symmetry is one of the fundamental symmetries of the Standard Model of particle physics.The full Lorentz group O(1, 3) can be defined as the set of linear transformations of the 4-dimensional spacetime that preserve the Minkowski metric  = diag(1, −1, −1, −1).However, in this work we will restrict ourselves to the proper orthochronous subgroup SO + (1, 3) that preserves spatial and temporal orientations.Lorentz invariance is the mathematical encapsulation of the fact that the outcomes of physical phenomena don't depend on the inertial frame of the observer.In the context of particle accelerators, this boils down to the observation that all initial and final states of a particle interaction are the same in all inertial frames.This is formally reflected in the fact that the Standard Model of particle physics is Lorentz-invariant, and therefore any model of any physically relevant processes encompassed by the Standard Model can be as well.
A couple subtle points are worth addressing before applying Lorentz symmetry to experimental tasks in jet physics.Neither the actual particle detectors nor the software simulating particle decays and their detection are Lorentz-invariant.Reasons for this include: non-invariant corrections to perturbative computations in quantum chromodynamics (QCD); non-invariance of jet clustering algorithms; practical limitations of detectors such as finite resolution and energy cutoffs.Nevertheless, it is still valid to learn Lorentz-invariant models from data obtained this way.Firstly, QCD is globally Lorentz-invariant and boosting the entire event does not change the outcome of the decay process.As long as inference is performed on data obtained in conditions similar to the conditions of the perturbative simulation, corrections from such effects as the running of the couplings with varying momentum scales are not a concern either.The same applies to jet clustering algorithms and the finite detection resolution: as long as the data used for inference was obtained in the same reference frame as the data used for training, the inference is valid and the outputs are expected to be Lorentz-equivariant.Finally, the fact that the detector itself introduces a fixed reference frame can be fully addressed without breaking the symmetry of the model by including detector geometry among its inputs.This will be discussed below in Section 3.1.

Lorentz invariance
The classification task considered in this work is exactly Lorentz invariant.The physical content of this statement will be discussed below, but mathematically it simply means the following.If the inputs to the network are a collection of 4-vectors (energy-momentum vectors in our case)  1 , . . .,   , the output is  (  1 , . . .,   ), and Λ ∈ SO + (1, 3) is a Lorentz transformation, then  (Λ 1 , . . ., Λ  ) =  (  1 , . . .,   ) . (2.1) There are a few ways of constructing a machine learning model that satisfies a constraint of this kind.The simplest one is to hand-pick a set of invariant observables (such as particle masses, relative masses, particle identification labels and charge) and feed them into a generic neural network architecture.
Another approach inspired by convolutional networks is to preserve group-equivariant latent representations in the hidden layers.In this case the neuron nonlinearity must be a Lorentz-equivariant operation, and examples of this can be found in both the Lorentz Group Network (LGN) [25] and LorentzNet [26].As in traditional CNN's used in image processing, equivariant latent representations, as opposed to invariant ones, can regularize the network via efficient weight-sharing and improve training.
Here, we take a slightly different approach.Given a set of 4-vector inputs  1 , . . .,   , we compute a complete set of Lorentz invariants on that set.For classical groups, including the Lorentz group, the space of invariants constructed out of a collection of vectors in the fundamental representation are functions of only the pairwise invariant dot products (using the appropriate invariant quadratic form for the given symmetry group) and of square determinants (of, say, 4 column-vectors for the Lorentz group) [29].Furthermore, if the invariant is required to be symmetric in the vector inputs, then it's only a function of the dot products (see also the discussion in Ref. [30]).In short, all totally symmetric Lorentz invariants can be written in the following form: This is the first key idea used in our architecture.The first step performed by the input layer is the computation of the  ×  array of dot products between the particle 4-momenta (also known as the Gram matrix).Note, however, that from simple dimension counting it's clear that the  ( − 1)/2 components of the Gram matrix {  •   } can't be independent.The physical manifold inside this high-dimensional space is defined by the set of constraints det  5 = 0 for every 5-minor  5 of the Gram matrix (that is, any matrix obtained from the original one by crossing out  − 5 rows and  − 5 columns).Moreover, a causally related set of points such as a particle jet will always satisfy   •   ⩾ 0 for all , .Therefore a neural network whose input is an  ×  matrix will learn the task only on this (4 − 6)-dimensional submanifold of R  2 .The outputs of the trained model on the rest of the space will be uncontrollable and physically meaningless.

Permutation equivariance
Particle data are often interpreted as a point cloud since there is no natural ordering on the vectors.For such problems it makes sense to use one of the permutation-invariant or equivariant architectures.One of the simplest approaches is called Deep Sets [31], which has been applied to jet tagging [24] and even heavy-flavor tagging [32].The fundamental fact used in deep sets is that any permutation-invariant continuous mapping of inputs  1 , . . .,   can be written in the form  (  (  )), where  and  can be approximated by neural networks.
The main limitation of permutation-invariant architectures such as Deep Sets is the difficulty of training.Since aggregation (summation over the particle index) happens only once, the Deep Sets architecture can struggle with modeling complex higher-order interactions between the particles [33].The network representing  is forced to be a relatively wide fully connected network, which makes it difficult to train.
The alternative to permutation-invariant architectures is provided by permutation-equivariant ones.Given a symmetry group  (e.g. the group of permutations), a representation (, ) is a tuple where  is a set and  :  ×  →  is a map that becomes a bĳection   = (, •) :  →  for any fixed value of the first argument,   = id, and   −1 =  −1  .Given two representations (, ) and ( ′ ,  ′ ) of a group , a map  :  →  ′ is called equivariant if it intertwines the two representations, that is: Equivariance is a key property of all convolutional networks -for example, in CNN's the convolution operation is inherently equivariant with respect to translations (up to edge effects).Similarly, Graph Neural Networks (GNN's) use permutation equivariance to force architectures that respect the underlying graph structure and don't exhibit false implicit biases that produce different outputs after a mere renaming of the graph vertices.In this context, we review the standard definition of a message passing layer where the particles are treated as nodes in a graph (for example, the fully connected graph), and every layer of the network only updates the activation at every node.If we denote by   the data assigned to node , then the message passing layer will typically construct "messages"    = (   ,   ) and then update each node by aggregating the messages coming from all neighbors of that node and combining the result with the original state of the node:  ′  = (   ,    ).Sometimes the graph also possesses "edge data"    that can be incorporated into the message-forming stage.
Message passing architectures have been successfully applied to jet tagging, most prominently in Refs.[25,26].However, attempts to combine message passing with Lorentz invariance runs into a major obstacle: as we have seen, the inputs to the network consist of nothing but edge data    =   •   .Traditional message passing would require a reduction of this set of inputs to a point cloud (with only one particle index), potentially restricting the set of possible higher-order interactions between the points.To avoid making these unnecessary choices, we employ the general permutation-equivariant layers suggested in Refs.[34,35].
In the general setting, permutation equivariance is a constraint on mappings  between arrays   1  2 •••  of any rank , every index   ∈ {1, . . .,  } referring to a particle label, whereby permutations of the particles "commute" with the map: Here, the action of permutations is "diagonal": . Graph Neural Networks explicitly implement this constraint for rank 1 arrays (node information).A higher-order generalization of the Message Passing layer can be defined as Equivariant Layer: Here, Msg is a node-wise nonlinear map ("message forming") shared between all nodes, and Agg is a general permutation-equivariant linear mapping ("aggregation") acting on the particle indices of .Note that whether Msg is node-wise and whether Agg is linear is somewhat ambiguous based on how one separates the mappings into their components, which is why, in particular, the traditional formulation of message passing allows messages to be functions of pairs of nodes.In practice, our aggregation block will also involve a nonlinear activation function.

Elementary equivariant aggregators
It only remains to describe the exact structure of the equivariant aggregation layers defined above.Since the general case is presented in Refs.[34,35], here we will only present the layers that we need for jet tagging.Since the input is an array of rank 2, the main equivariant layer for us is one that transforms arrays of rank 2 to other arrays of the same rank:    ↦ →  ′   .The space of all linear maps of this type turns out to be 15-dimensional.The basis elements of this space can be conveniently illustrated using binary arrays of rank 4.There are 15 such arrays      ,  = 1, . . ., 15, and the action of the equivariant layer can be written as . (2.6) The 15 aggregators   are easy to visualize.This is done below for  = 2.
The smallest squares represent components of the input 2 × 2 array, and the larger 2 × 2 squares represent components of the output array.Dots represent the non-zero components of the binary tensors   , and every component of the output tensor is the result of aggregation over all inputs marked by the dots.Output components that lack any dots are set to be a fixed constant, by default zero (the affine versions of these mappings include two such parameters: one constant for the diagonal and another for the remaining components).By "aggregation" we mean, in general, any symmetric function, but in practice it is usually a sum or mean.For example, the first aggregator is simply the identity map on matrices: the  'th component of the output array is the result of aggregation over only the  'th component of the input.The second aggregator realizes the transposition of arrays  ′   =   .The following three aggregators represent various ways of embedding the diagonal of the input array in an equivariant way.It is easy to see that simultaneously swapping the two rows and the two columns of the input is equivalent to doing the same to the output, which confirms equivariance.These first 5 aggregators are "order zero" in  because they do not actually perform any aggregation.Instead, they can be thought of as permutation-equivariant skip-connections.
The second group of 8 "order one" aggregators aggregate over  components of the input by aggregating either over rows, columns, or the diagonal, and then embedding the result into the output array in all possible equivariant ways.Finally, the last 2 aggregators are the "order two" aggregators that aggregate over all  2 components of the input.
If we allow aggregators to be nonlinear, then they can take the following form: the binary array   selects a subset of the components of the input array, and then a general symmetric function   is applied to that subset: In practice we define   as the mean of its inputs followed by an additional scaling by a factor of    / N   with learnable exponents   , where N is a constant representing the typical number of input vectors expected in the dataset, provided to the model as a hyperparameter.

Equivariance and Jet Physics
There are several reasons for enforcing the full Lorentz symmetry in our ML models.First and foremost, it is a fundamental symmetry of the space to which the inputs belong.Lorentz transformations represent the effect of switching between different inertial frames, and most fundamental processes in physics are independent of the choice of the observer's inertial frame: if a given collection of particles consists of products of a decay of a top quark for one observer, then the same is true for all other observers.Nevertheless, some processes involved in generating and observing high-energy collision events break the Lorentz symmetry in some subtle ways.At the fundamental level, the running of the couplings in QCD can cause Lorentz symmetry breaking in the parton shower distribution functions.Even the amount of final decay products depends on the transversal boost of the initial parton-level particles.However, there is no question that both the original protons and the final (asymptotic) decay products are accurately represented by a collection of 4-vectors subject to the spacetime Lorentz symmetry: the asymptotic outcome of a collision event is independent of the observer's reference frame.
Another reason for symmetry-restricted modeling is that, from the geometric perspective, only some mathematical operations are permissible when working with objects that transform in a certain way under a symmetry group.A non-equivariant neural network effectively neglects the vector nature of the inputs by treating individual components of the input vectors as scalars.While improving network expressivity, non-equivariance fails to deliver physically interpretable models.Ultimately, a statement about equivariance is a statement about what the basic features of the data are -e.g.vectors are features, but the individual components of those vectors are not.
More relevant to the applications is the fact that both the simulation and the observation of collisions inevitably involves some degree of clustering.A particle detector is made of cells (e.g.calorimeters) of finite size and as such is unable to distinguish between some particles that are collinear or very close to collinear.Similarly, the standard algorithms for collision simulation typically perform jet clustering to closely reproduce the detector behavior.Clustering of course is not a Lorentz-invariant procedure: particle tracks that diverge by a small angle in one frame will diverge by a large angle in another highly boosted frame.However, this limitation of Lorentz-invariant architectures is fairly minor.Since clustering is always done in a fixed laboratory frame, it is still reasonable to impose the full Lorentz symmetry on the resulting 4-vector data.So unless the pre-clustering data itself is coming from multiple significantly different inertial frames, clustering is not interfering with the fundamental symmetry.Simply put, however a given set of 4-vectors is obtained and represented in a specific inertial frame, those vectors will respect the Lorentz symmetry.

PELICAN architecture
The PELICAN architecture is simplified with respect to LGN due to the use of the complete set of dot products between the input 4-momenta (See Section 2), and this has significant implications for both the overall architecture as well as the ease of training and interpretability of the network.This section discusses each of the primary components of the network, including the inputs and their embedding, the permutation and Lorentz equivariant blocks, and the output layers that determine the structure of the result, namely classification or 4-vector regression.

Dot Products and Beams
On the input side of the architecture, the first step is to compute all pairwise dot products of the input 4-momenta.Appended to the list of these 4-momenta are two auxiliary beam particles with 4-momenta (1, 0, 0, ±1).This is helpful since the datasets we are using are all simulated in a fixed laboratory frame where the original proton-proton collision happens along the -axis, and the auxiliary inputs restore this orientational knowledge.In particular, the dot products between constituents and beams give PELICAN access to the energies and transverse momenta of all constituents.
It is worth emphasizing that introducing beams in this manner allows us to fix a particular spatial orientation of the events without restricting or violating the global Lorentz symmetry inherent in the architecture.Indeed, if one were to treat the auxiliary beams as constant vectors of hyperparameters, then this action would reduce the full Lorentz symmetry to merely rotations in the -plane and -boosts.However, due to the fact that the beams are fed into the network on equal footing with all other inputs, they are properly treated as full-fledged 4-vectors that should also transform under the global Lorentz symmetry.Thus, counter-intuitively, we let the network access individual energies, transverse momenta and -momenta while still preserving full Lorentz symmetry and all the computational benefits that come with it.
Input Embedding Next there is an embedding layer that applies the function   () = ((1 + )  − 1)/ to each dot product with  0 − 2 different values of the trainable parameter  (initialized to span the interval [0.05, 0.5]).Then the result goes through a masked BatchNorm2D layer.Finally, this array of scalars gets concatenated with two labels L  , L  per dot product    =   •   that indicate whether each of particles  and  is a beam or not.The label for a beam is chosen to be 1 and the label for all other particles is 0. At the end of this input block, we have a tensor of shape [,  max ,  max ,  0 ] where the feature vector for each particle pair has the form BatchNorm2D   1 (   ), . . .,    0 −2 (   ) , L  , L  .

Permutation equivariant blocks
The main element of the equivariant architecture is the permutation-equivariant block transforming arrays of rank 2. Namely, we assume that the input tensor to the block has shape [,  max ,  max ,   ], where  is the batch size,  max is the maximum number of jet constituents per event (with zero padding for events with fewer constituents), and   is the number of input channels.We also use a binary mask of shape [,  max ,  max ] to appropriately exclude the zero padding from operations like BatchNorm and aggregation.The output of the block will be a similar tensor of shape [,  max ,  max ,  +1 ] with the same mask.As outlined above, the equivariant layer consists of a message block and an aggregation block.The message block is chosen to be a dense multilayer perceptron (MLP) acting on the channel dimension with a LeakyReLU activation and BatchNorm2D (normalization over the first three dimensions of the tensor, for each channel separately, followed by an affine transform with two learnable parameters per channel).Here we use a masked implementation of batch normalization so that the variable particle number is respected.The message block is then followed by Dropout that zeroes out each of the  ×  2 max ×   eq components independently with a certain probability.The aggregation block applies 15 linear aggregation functions (LinEq 2→2 ) which, for each component of the output tensor, compute the mean over some subset of the components of the input tensor, as explained in Section 2.4.Note that this is a non-parametric transformation performed on each channel separately.Each of the   eq × 15 resulting aggregation values is then independently multiplied by   / N  with a trainable exponent  (initialized as a random float in [0, 1]), where  is the number of particles in the corresponding event.This allows for some flexibility in the aggregation process, for example  = 1 returns the sum aggregation function, and combining multiple aggregators is known to boost accuracy [34].
Aggregation is followed by a dense layer that mixes the   eq × 15 aggregators down to  +1 features.Due to the size of this layer, we employ a simple factorization to reduce the number of parameters.Namely the weight tensor   , where  is the input channel index,  is the basis index (1 to 15), and  is the output channel index, can replaced by the following combination: Here, the first term first mixes the 15 aggregators among each other for each output channel, and then mixes the channels.Similarly, the second term first mixes the 15 aggregators for each input channel, and then mixes the channels.The final result is a tensor of shape [,  max ,  max ,  +1 ], so these equivariant layers can be stacked multiple times.

Classification and 4-vector regression outputs
One of the strengths of the PELICAN architecture is the ability to easily switch between serving as a classification tool for discriminating between Lorentz-boosted top quarks and the QCD background, to being able to provide 4-vector regression results, such as momentum reconstruction.

PELICAN classifier
To build a classifier, aside from the Eq 2→2 equivariant layer one needs a Eq 2→0 layer that reduces the rank 2 array to permutation-invariant scalars.This layer involves just 2 aggregation functions instead of 15 -the trace and the total sum of the input square matrix, but is otherwise identical to the equivariant layer described in the last section.
From the input block, the tensor is passed through  equivariant Eq 2→2 layers, and the Eq 2→0 layer with dropout.This produces a tensor of shape [,  out ].One final MLP mixes this down to just 2 classification weights per event.A cross-entropy loss function is then used for optimization.

PELICAN 4-vector regression
The same architecture can also be easily adapted for 4-vector regression tasks, such as momentum reconstruction.Any Lorentz-equivariant map from a collection of 4-vectors  1 , . . .,   to one (or several) 4-vector has the form where   's are Lorentz-invariant functions [36].Combining this with permutation invariance, we conclude that the multi-valued map (  1 , . . .,   ) ↦ → (  1 , . . .,   ) must also be equivariant with respect to the permutations of the inputs.The only change required to the architecture we've introduced for classification is that Eq 2→0 must be replaced with Eq 2→1 and the final output layer must have only one output channel (assuming we are regressing on a single 4-vector).The Eq 2→1 layer is again identical to Eq 2→2 except that it uses only 4 linear aggregators: row sums, column sums, trace, and full sum.The architecture is summarized by the following diagram, where we treat    as the inputs and   as the outputs, keeping in mind formula (3.3) that lets us recover the final predicted vector.

Tagging jets from Lorentz boosted top quarks
This section presents the dataset, training approach, and results of using PELICAN as a classifier in the context of identifying Lorentz-boosted top quarks.Three different versions of PELICAN are discussed, each with a different size in terms both the width of the network and the number of trainable parameters.Lastly, the dependence of the performance on the size of the training dataset is also presented, providing a quantitative relationship between the size of the network, the training dataset efficiency, and the resulting performance.

Classification dataset
We perform top-tagging on the reference dataset [37], which was also used in Ref. [8].This dataset consists of 2M entries, each entry corresponding with a single hadronic top jet or the leading jet from a QCD dĳet event.There are 1.2M training entries, 400k validation entries and 400k testing entries.The events were generated with Pythia8, and the Delphes framework [38] was used for fast detector simulation in order to incorporate detector effects.For each jet, the 4-momentum of the 200 leading constituents are stored in Cartesian coordinates (,   ,   ,   ), in order of decreasing   .This list is zero-padded, and all jets in the dataset have fewer than 200 constituents.The dataset does not contain any other information on the jet constituents, such as charge or spin.

Classification training procedure
The top-tagging model contains five Eq 2→2 blocks of identical shapes.We train three different versions of the model with different widths.The widest model has 132 input and 78 output channels on every messaging layer (the equivariant layer then produces 132 × 15 quantities which get mixed down to 78 channels by a fully connected linear layer).The output MLP is just one layer that mixes 132 channels down to 2 classification weights.The number of jet constituents was capped at 80 (no noticeable performance gain was seen beyond that number).The dropout rate was 0.025, and the model was optimized using the AdamW optimizer [39] with weight decay of 0.005.The training on the full dataset went on for 35 epochs with the same learning rate schedule as in Ref. [26]: 4 epochs of linear warm-up up to learning rate of 0.001, followed by 28 epochs of CosineAnnealingLR with  0 of 4 epochs and  mult = 2, and then 3 epochs of exponentially decaying learning rate with exponent  = 0.5 per epoch.The three models were trained on Nvidia H100 GPU's with batch size of 100, taking 0.43, 0.17, or 0.08 seconds per batch, respectively.Inference took 0.17, 0.07, or 0.04 seconds per batch.Batches were shuffled between epochs.

Classification results
Figure 2 shows the receiver operating characteristic, here represented by the background rejection as a function of the signal efficiency, for the classification performance.In Table 1 we compare the accuracy, area under the curve (AUC), and background rejection values at 30% signal efficiency between PELICAN  and a multiple existing ML top-taggers, including the previous state-of-the-art LorentzNet [26].We trained three PELICAN top-taggers with layers of differing widths, with 208k, 48k, and 11k trainable parameters respectively.The results are averaged over 5 random initialization seeds, and the uncertainties are given by the standard deviation.The large PELICAN model improves upon the LorentzNet result with a comparable number of parameters, and the medium model roughly matches LorentzNet despite having 5 times fewer parameters.Perhaps most remarkably, the small model with 11k parameters beats every pre-LorentzNet competitor despite having at least times fewer parameters, and up to 130 times fewer parameters, than other networks.
In addition to different model sizes, we also explore sample efficiency.Each of the three models above was trained on 0.5%, 1% and 5% of the training data and compared to the original.For these, the training went on for 70 epochs with 60 epochs of CosineAnnealingLR instead of 28, and 6 epochs of exponential decay instead of 3. The results can be found in Table 2. Notice that at lower amounts of training data the differences in performance between models of different width become much less significant, and at 1% and 0.5% of training data all three models fall within each other's uncertainty ranges.These results suggest that the larger PELICAN networks are likely able to learn a greater range of more subtle features from the training data and thus benefit from seeing a larger training dataset.On the other hand, the primary features are already learnable with just a few percent of the data.In particular, with 5% of the training data and only 11k learnable parameters, the PELICAN 25/15 version of the network appears to achieve the similar background rejection performance as ResNeXt, which uses 1.46M parameters learning on the full dataset.

𝑾-boson 4-momentum reconstruction
To test the equivariant regression architecture described in Section 3 we chose a task where the aim is to reconstruct (or predict) the full 4-momentum of the -boson within the Lorentz-boosted top quark decay products.Specifically, we consider the same hadronic top quark decay that constitutes the signal in the top-tagging dataset, which uses the  →  →  two-step decay, followed by hadronization, showering, and detection.Our aim is to reconstruct the true 4-momentum of the -boson given the full set of observed final state particles of the top decay, as represented by the jet constituents.

Regression dataset
The dataset used for the regression task consists of 1.5M  t events simulated with Pythia8, with 700k events for training, 200k events for validation, and 500k events for testing (with an additional 100k events set aside in a second testing set).From each event, we cluster anti-  jets with  = 0.8 using FastJet [42] and we select the jet nearest to the truth-level top quark in (, ), requiring the distance between the top quark and the jet to satisfy Δ (top quark, jet) < 0.8.This jet clustering is done both at truth-level, and using calorimeter tower objects produced by running the event through Delphes fast detector simulation using the ATLAS detector card.Thus, each event in the dataset corresponds to a single jet, and includes information for truth-level particles such as the truth-level top quark -we may therefore use the terms jet and event interchangeably below with the understanding that each event in the dataset has one and only one jet recorded.This dataset is publicly available via Zenodo [28], where a full description of the various data fields is provided.Here we provide only an overview of some key features: 1.There are two versions of the dataset, corresponding with truth-and reconstruction-level (Delphes) jets.The events are the same between versions, so the two can be compared event-by-event to study the effects of detector reconstruction on network training and performance.In addition, the event contains the stable -boson daughter particles.These are the truth-level, final state particles that are traced back to the -boson by Pythia.
3. Each jet is tagged with the Johns Hopkins top tagger [7] (JH), as implemented in FastJet.This allows us to define a subpopulation of JH-tagged events, which we shall sometimes refer to as JH events.For jets that it tags as top quark jets, JH provides a -boson candidate constructed from subjets.
4. Each jet is also tagged as whether or not it is fully-contained (FC).We define FC events as those where the -quark, as well as the two quarks from  →  ′ decay, are within Δ < 0.6 of the jet centroid (i.e.within 75% of the jet radius).In such cases almost all of the  daughters are contained within the jet and we can expect a good reconstruction of the  momentum.FC events comprise 75% of the dataset.

Regression training procedure
Our model has 4 equivariant Eq 2→2 blocks.Each messaging layer takes in 132 channels and outputs 78 channels.Conversely, each equivariant aggregation layer has 78 input channels and outputs 132 channels.
The Eq 2→1 block has the same shape, and the final fully connected layer has the shape 1 × 132.There are 210k parameters in total.Assuming  non-zero input jet constituents, this produces  scalar coefficients   with zero-padding, which are the Lorentz invariants introduced in (3.3).The reconstructed 4-momentum is then computed via The training regime for this task is essentially identical to the one for top-tagging: AdamW optimizer with weight decay of 0.01, 35 epochs in total with 4 epochs of warm-up and exponential learning rate decay for the last 3 epochs.The main difference is in the choice of the loss function (  reco ,  target ).Spacetime geometry allows for many choices of this function, which in turn will affect the shape of the landscape near  target and in turn the precision of various reconstructed features of the vector, such as the mass, energy, spatial momentum, transverse momentum, and direction.It is even possible to construct Lorentz-invariant loss functions to make the training process itself equivariant.Nevertheless, for the purpose of simultaneous reconstruction of the direction and the mass of the -boson,   , we found to be very effective.It uses all 4 components of the target vector and strikes a good balance between the precision of the reconstructed mass and spatial momentum.
A rarely discussed feature of this task is the choice of the target vector  target .Even though our ultimate inference target is the true  momentum   true , it is not necessarily the best training target given the nature of the dataset.Detection and jet clustering include multiple energy, momentum, and spatial cuts that exclude some decay products from the final jet.For instance, one of the three quarks in  →  might fall outside of the  = 0.8 radius of the jet clustering algorithm, in which case most of the decay products of that quark are likely to be absent from the event record.If many of the decay products of the -boson are missing, then we lack the information necessary to make an accurate estimate of its true momentum, or even to identify which of the jet constituents belong to the .This effect is often referred to as an acceptance issue due to the finite purview of the final state reconstruction.
To alleviate this issue and provide better control over the inference stage, we propose an alternative target 4-vector that we call the contained true  momentum   cont , equal to the total 4-momentum of the truth-level  decay products that fall within the radius of the final reconstructed top jet.In the truth-level dataset, this is simply   cont =     where   are the indices of the constituents whose parent is the -boson and not the -quark.In the Delphes dataset, however, there is no simple analytic relationship between   cont and the jet constituents   .That is to say that the mapping of the truth-level information to the detector-level reconstruction is highly non-linear.Nonetheless, in either dataset this vector more accurately reflects the available information about the -boson and allows us to make inferences not only about the -boson itself, but also about the containment qualities of the event.This will be discussed further in the Section 5.5 below.For reference, the true mass spectra of both   true and   cont are shown in Fig. 3.For fully-contained (FC) The top curve represents the spectrum over the entire dataset on log scale and the bottom curve shows the spectrum over FC events only, scaled linearly relative to the top curve, i.e. the fraction of FC events in a given bin is given by the apparent height of the FC curve divided by the total height of the bin (heights are measured from the -axis).The two mass spectra of FC events, in fact, match.Table 3: Momentum reconstruction results for the Johns Hopkins (JH) tagger and PELICAN trained to reconstruct   true .We report the relative   and mass resolutions, and the interquantile range for the angle Δ between predicted and true momenta.PELICAN uncertainties are within the last significant digit.events, the mass spectra are similar between the true and the contained  mass as expected.Non-FC events are mostly confined to a clear second peak at 13 GeV corresponding to  and  jets (where one of the quarks from  →  fell outside the jet), and a minor peak at   cont = 0 corresponding to  jets.Given the above observations, we prepared two PELICAN models, one trained to reconstruct   true , and another trained to reconstruct   cont .Otherwise the two models are identical and are trained in the same way and with the same loss function.We then compare the outputs of each model to   true and analyze the benefits of the two choices of the target.

Regression results for 𝒑 𝑾 true reconstruction
The results are summarized in Table 3.We quantify the precision of the reconstruction by the transverse momentum1,   , and mass resolutions, given by half of the central 68 th interquantile range of ( predict −  true )/ true , where  is  or   .In addition we report the lower 68 th interquantile range for Δ, the -boost-invariant spatial angle between predicted and true momenta2.Since there are no ML-based methods for this task, we use the -boson identification of the Johns Hopkins top tagger [43] implemented in FastJet [42] for the baseline comparison.The tagger has a 36% efficiency on the truth-level dataset and 31% on the Delphes one.It can only identify -boson candidates for jets it tags, so we report PELICAN results both on the JH-tagged jets only (PELICAN|JH) and on the full dataset (PELICAN).Moreover, we evaluate PELICAN on the population of FC events (PELICAN|FC).More than 99.9% of JH-tagged events contain all three true quarks  within the jet radius, so this population represents an especially restricted and 'ideal' type of event.The results were evaluated over 5 training runs initialized with different random seeds, and the resolutions reported in Table 3 are consistent across the runs.
There are significant differences in PELICAN's performance on the different sub-populations of events.In the direct comparison with the JH tagger, PELICAN|JH is 2-4 times more precise.However, even on the much larger class of FC events, PELICAN produces predictions with almost the same precision.The highest loss of precision happens on non-FC events where many of the  decay products are missing from the jet, leading to lower average precision on the entire dataset.As we will discuss in Section 6, this result can be explained by interrogating the PELICAN weights and kinematic information directly.
In Fig. 4 we show the relative reconstructed  masses for two of the models, one trained on truth data, and one on Delphes data.The results also include the curve for the JH tagger's reconstruction, as well as PELICAN|JH and PELICAN|FC.The 68 th interquantile ranges of these curves match the numbers in the   column of Table 3. See Section 7 for further details on the causes of performance degradation in the Delphes case.For the complete set of results see Appendix A.

Regression results for 𝒑 𝑾
cont reconstruction Now we train new models with the target vector set to the contained true  momentum   cont , evaluate their precision by comparing the outputs to the true  momentum   true , and compare the results to Table 3.As shown in Table 4, the resolutions for these models on JH-tagged and FC events are slightly worse than the first set of models, in the Delphes case by 5-15%.The largest change is in non-FC events, leading to poor average resolutions on the whole dataset.Despite this, as we will now show, these models can in fact be better suited for real-world applications.

Discussion
To see the main benefit of this model, we present the behavior of the relative reconstructed mass shown in Fig. 5. PELICAN-reconstructed masses within the range of true  masses are almost as precise on the full dataset as they are on FC events (see Fig. 5 near the peak at 1).The most prominent feature obvious from these results is that, despite the slightly lower accuracies on FC events (at fixed width and depth of the network), the   model trained to reconstruct   cont accurately reproduces the mass spectrum of   cont in Fig. 3 and therefore discriminates between FC and non-FC events, allowing us to perform post-inference event selections.
For instance, in the Delphes case, choosing a 55 GeV cutoff, 97% of all FC events have  reco > 55 GeV, and vice versa, 97% of all events with  reco > 55 GeV are FC.In this manner we can significantly improve the accuracy of the reconstruction without accessing truth-level information that is needed to identify FC events.This comes at the cost of a modest reduction in signal efficiency -from the ostensible 100% down to 75%.Note that in the Delphes case, the set of FC events is contaminated with a small number of events with significant losses of  decay products due to detector effects, but it can be refined by reducing the jet radius used in the definition of full containment.Consequently, we propose the following simple routine for real-world applications of these models.First, use the model trained targeting   cont as an FC-tagger to refine the data.Then, apply the model targeting   true to reconstruct the -boson.We conclude that   cont is the better target for many common reconstruction tasks where one is willing to sacrifice some signal efficiency -or to only fully measure the 4-momentum on a sub-sample of the identified events -to gain improved accuracy.In the following sections we will not present models trained on both targets, however a complete set of metrics and results, including models targeting   true , can be found in Appendix A.

𝑾-boson mass measurement
As we saw above, PELICAN is able to reconstruct the mass of the -boson,   , found within the dense environment o the complete decay products of a top quark jet.For truth-level datasets, the resolution of this   reconstruction is below the natural width of the mass spectrum.In the Delphes case, the resolution is too wide to produce any substantial correlation between the true and reconstructed masses (see Appendix A for figures that demonstrate this).We would like to eliminate the possibility that the reason that the true masses are highly concentrated around 80 GeV is due in part to the potential for PELICAN to effectively memorize a single number: the -boson mass.In this section we examine a more realistic reconstruction task, where the true mass of the target particle is unknown, and the dataset uniformly covers a wide range of its masses.The reconstruction task is still identical to that of Section 5.Even though we could use a scalar-valued version of PELICAN to target the mass of the -boson, the accuracy of that reconstruction would in fact suffer in comparison with the 4-vector model.This is simply due to the fact that the 4-momentum contains more relevant information than the mass alone, since the direction and the energy of the particle are, in part, correlated with the mass.Thus the only new element in this experiment will be the dataset, which will now involve -bosons of varying masses uniformly covering the range   ∈ {65, 95} GeV.The dataset is also identical to that used in Section 5, except that the -boson mass is set to be variable.This is achieved by combining multiple independently-produced datasets where the generator-level value of   was modified from its default value.Fig. 6 shows the resulting distribution of -boson masses, as well as that of the sum of  daughters contained within each jet.

Regression results for 𝒎 𝑾 reconstruction
The hyperparameters and the training regime used here are the same as in Section 5.Here we focus on the model trained to reconstruct the contained momentum   cont (see Appendix A to find the results for the model targeting   true ).The outputs are then compared to the true -boson   true .The accuracies for the full 4-vector reconstruction are presented in Table 5.The largest loss of accuracy relative to Section 5 is, unsurprisingly, in DELPHES data FC events the mass column.However, since the true mass now covers a much wider range, this still presents a significant improvement in the mass reconstruction capability.To demonstrate this better, we show the 2D correlations between target and reconstructed masses in Figures 7 and 8 for the models trained targeting   true and   cont , respectively.We also differentiate between non-FC (left) and FC (right) events in the two sides of each of the panels in each figure.

Model complexity
The model examined above has 210k trainable parameters, however even significantly smaller models achieve good accuracy.As an illustration, we compare the resolutions of three PELICAN models trained on the variable mass dataset targeting   true .They are obtained from the original model by a proportional rescaling of the widths of all layers.The first model is the 210k parameter one, with 132/78 channels, i.e. each messaging layer has 132 input and 78 output channels.The second model has 60/35 channels and 49k parameters.The third model has 25/15 channels and 11k parameters.The resolutions over the Delphes test dataset are reported in Table 6, and we observe that even the 11k-parameter model handily beats the JH method.

Discussion
In the Delphes dataset, we observe that for non-FC events (bottom left of Fig. 8), the reconstructed contained mass is only weakly correlated with the true contained mass (or with the true  mass, as shown in Fig. 22 in Appendix A).However, in the quadrant where both masses exceed 55 GeV, we find a 65% correlation on FC events in the Delphes case.The most important type of error PELICAN makes here is when a non-FC event gets assigned a high reconstructed mass.Meaning that a mass near that of the true  was assigned for a jet with few of the  decay products in it.Among all events with  reco > 55 GeV, 3.6% are non-FC, and they bring the correlation among that population down to 51% (  , mass, and angular resolutions on this population closely track those of PELICAN|FC above).But since in practice we're interested in   true , the correlation between that and  reco is higher, at 59% among events with  reco > 55 GeV.This is a significant improvement over the model trained on the original   true ∼ 80 GeV Delphes dataset, and especially over non-ML methods such as the JH tagger (see Fig. 9).However, even a model trained on Delphes data to reconstruct   true , in fact, achieves a 40% correlation with   true on non-FC events (see Fig. 7), so FC-tagging may not be necessary.Overall, PELICAN provides a viable method for estimating particle masses.

PELICAN explainability
Despite the output of the PELICAN regression model ostensibly being a 4-vector (or multiple 4-vectors), the richer and more natural object to treat as the output are the PELICAN weights {  } introduced in (Eq.5.1).Each   is attached to its corresponding input constituent   due to permutation equivariance and therefore encodes a scalar feature of that particle within the event.As we will show in this section, the behavior of these weights is key to the unique explainability and visualization features of the PELICAN architecture.
In essence, PELICAN is able to take a set of  input 4-vectors and assign  scalar features to them (of course there can be several features per input as well) in a Lorentz-invariant way.This can be powerful in a variety of applications, but in the context of particle reconstruction the problem of finding the right values of the weights is similar to a soft clustering problem.Assuming an idealized dataset with perfect information about the decay products, the model should identify the decay products of the -boson, assign   = 1 to them, and zero to all other constituents.This is analogous to what taggers like the Johns Hopkins top-tagger aim to do via jet clustering.However, since any five 4-vectors are linearly dependent, there is a continuum family of solutions {  } and it is not clear that PELICAN will prefer the clustering solution.

Distributions of PELICAN weights
In Fig. 10 we display the distributions of all PELICAN weights for models from Section 5 trained targeting   true .We also mark each constituent as either a -or a -daughter.This yields several observations.

DELPHES data
Figure 11: Stacked histograms with proportional bin heights of all PELICAN weights computed over the testing dataset for the 4-vector reconstruction task from Section 5 using models trained targeting   true .Broken up into three populations by jet containment:  events (all three truth-level quarks from the  →  →  process fall within the jet clustering radius);  events (only the -quark fell outside of the jet); and non-FC events, which include , , and  events.
Firstly, nearly all weights are either non-negative or very slightly negative (e.g.above −0.1)with a very sharp peak at zero (the peak is entirely to the left of zero to very high precision3).This is the first feature that justifies the interpretation of PELICAN as a soft clustering method.Since our inputs represent realistic events, all input 4-vectors in them are causally related, and in particular they belong to the future light cone, as does the target vector.This implies that no linear combination of these vectors with positive coefficients can produce a zero vector.The distributions, therefore, show that PELICAN weights assigned to -daughters are not "contaminated" with these degenerate combinations.
Secondly, the truth-level distribution is highly concentrated at 0 and 1 and very closely matches the binary clustering solution.That is, almost all constituents assigned weight 0 are -daughters, and almost all of those assigned 1 are -daughters.Nevertheless, 30% of -daughters are assigned positive weights, prompting further investigation.Moreover, the distribution of -daughter weights in the Delphes case is so spread out that it becomes difficult to explain it by a mere analogy with clustering.
We can delve more deeply into the weight distribution by evaluating the sub-populations of weights based on jet containment.Fig. 11 shows the distributions of weights for , , and non-FC events.The majority of constituents at the high end of the weight scale belong to non-FC events.Similarly, the weights 3The bin [−10 −6 , 0) contains about 100 times more constituents than the bin [0, 10 −6 ).

DELPHES data
Figure 13: 2D histogram of PELICAN weights vs constituent transverse momentum for the 4-vector reconstruction task from Section 5 using models trained targeting   cont .Only FC events shown here.
produced by the models trained targeting   cont , shown in Fig. 12, are more highly concentrated at 0 and 1, and have much lower and shorter "tails" on the right, especially among -daughters.This is the first indication that PELICAN tends to up-weight some constituents in events where it doesn't have enough information for an accurate reconstruction.
This approach allows to characterize the constituents that are being up-weighted.Fig. 13 shows the constituent weights as a function of the constituent's   .The main observation here is that among high-energy ("hard") constituents with   > 100 GeV the weight distribution is much more binary, and the vast majority of constituents with weights falling away from the two peaks are soft, below 20 GeV.In the Delphes case PELICAN appears to down-weight high-energy -daughters and up-weight soft constituents.Once again, loss of information in the form of detector effects appears to lead to PELICAN up-weighting soft constituents.

Detector effects on PELICAN weights
While the truth-level PELICAN models reliably converge to a binary clustering solution, the weights in the Delphes case do not permit such a straightforward interpretation.To better understand their behavior, we  ran additional experiments using custom datasets that exclude different components of the Delphes detector simulation one by one.Delphes performs the following steps: simulate the effect of the magnetic field   on charged final-state particles; aggregate truth-level particle energies within each electromagnetic calorimeter (ECAL) and hadronic calorimeter (HCAL) detector cell; apply energy smearing by sampling a lognormal distribution; unify the ECAL and HCAL cells; apply spatial smearing by picking a uniformly random point within the detector cell; construct the spatial momentum so that the resulting 4-vector, which represents a detector cell, is massless.We found that while each of these steps contributes to smearing out the truth-level distribution of PELICAN weights and shifting the peak downwards, the magnetic field is responsible for almost all of the differences between truth and Delphes results.
The simulated magnetic field is able to deflect charged particles very significantly, enough to account for most of the error in PELICAN's reconstruction relative to the truth-level reconstruction.Our hypothesis for why this leads to lower PELICAN weights for hard constituents is the following.Deflected hard particles produce large errors in the direction but not the energy of the reconstruction, therefore one can down-weight them and compensate for the energy deficit using softer constituents.Moreover, by up-weighting softer constituents PELICAN can in fact partially correct the error in the direction since the deflections of positively charged particles can be partially cancelled out by those of negatively charged particles.
An extra piece of evidence in support of this hypothesis can be found by modifying the loss function.If we re-train the model on the same Delphes dataset using a loss function consisting of a single energy term | reco −  true |, we find a distribution of weights (see Fig. 14) nearly as bimodal as the original one trained on truth-level data (see Fig. 12).This indicates that the source of the error in PELICAN's reconstruction on Delphes data is overwhelmingly spatial.Out of all the steps that Delphes performs, only two are purely spatial: momentum smearing within one cell, and the simulated magnetic field.However, the detector cells (approximately 0.02 × 0.02 in (, )) are much smaller than the magnitude of PELICAN's typical angular error, and thus smearing cannot explain the error.

Event visualization
As we discussed above, despite being a single-vector regression model, PELICAN produces one feature per input constituent (namely the weight   ), and these features become interpretable by virtue of Eq. 5.1.This gives us a unique opportunity to make event-level visualizations that provide insight into how PELICAN treats jet topology and how it compares to conventional methods such as the JH tagger's jet clustering.
In Fig. 16 we show an amalgamation of 200 events from the Delphes dataset from Section 5 projected onto the unit sphere.Each event was spatially rotated so that the position of the true  within the image is fixed and the true -quark is located in the negative  direction.In one display the constituents are colored according to their parent being either the  boson or the -quark, and in the other they're colored based on their assigned PELICAN weight.The correlation between the two images is clear: -daughters tend to be correctly assigned zero weight, whereas -daughters have positive weights with the hardest constituents having weights between 0.4 and 0.8.
In Fig. 15 we show a single event in the (, ) plane, with dot color and size dependent on the constituent energy.Note the reduced number of constituents in the Delphes display, and how some of the constituents get strongly deflected by the simulated magnetic field.The same event can be visualized in three more helpful ways.In addition to parent type and PELICAN visualizations introduced in Fig. 16, we can also extract the list of constituents that the JH tagger identifies as belonging to the  boson and highlight them.Fig. 17 displays the same single event in all three ways.In addition, we add a special marker for the direction of the reconstructed  boson.In the parent type pane, this reconstruction is defined as  =1     where   is the energy of the true -daughters within that constituent divided by the actual energy of the constituent.In the JH and PELICAN panes, the marker corresponds to the corresponding reconstructions obtained by those methods.

IRC-safety and PELICAN
Perturbative computations in QCD suffer from a divergence caused by two types of processes: soft emission and collinear splittings.As a consequence, meaningful observables in this theory need to be insensitive to such processes, and this requirement is known as IRC-safety.In this section we provide a precise definition, give a characterization of IRC-safe Lorentz-invariant observables (see details in Appendix B), and describe modifications to the PELICAN architecture that make it IR-safe or IRC-safe.
Infrared safety (IR-safety) guarantees insensitivity to soft emissions, i.e. particles with relatively low energies and momenta.A family of continuous symmetric observables  (  ) (  1 , . . .,   ) is said to define an IR-safe observable  if lim for any  and any  1 , . . .,   , , where  controls how infinitesimally small the considered soft emission  is.
Collinear safety (C-safety) is a restriction on observables in perturbative QCD that arises from the divergent contributions of collinear emissions of gluons.Positive gluon mass would prevent such divergences, which is why C-safety concerns only massless particles.We can define C-safety formally as follows: an observable  (  1 , . . .,   ) is C-safe if, whenever two massless 4-momenta  1 and  2 become collinear (which happens for massless particles iff  1 •  2 = 0), the value of  depends only on the total momentum  1 +  2 .Expressed even more explicitly, C-safety says that setting  1 =  and  2 = (1 − )  with some 4-vector  such that  2 = 0 must lead to the same output regardless of the value of , i.e.In Appendix B we characterize IRC-safe Lorentz-invariant observables, but the following summary will suffice for the purpose of designing an IRC-safe version of PELICAN.First, a Lorentz-invariant observable (assumed to be consistently defined for any finite number  of 4-vector inputs) is IR-safe if and only if it has no explicit dependence on the multiplicity .More precisely, adding the zero 4-vector to the list of inputs should leave the output value invariant.Second, an IRC-safe Lorentz-invariant observable is one that is IR-safe and moreover depends on any of its massless inputs only through their total.E.g. if  1 ,  2 ,  3 are fixed to be massless, then  (  1 ,  2 ,  3 ,  4 , . ..) must depend only on  1 +  2 +  3 .In particular, if all inputs are massless, then all IRC-safe invariant observables are functions of only the jet mass  2 = (    ) 2 .Note, however, that such an observable can still depend on these vectors in an arbitrarily complex fashion away from the massless manifold.
The original PELICAN architecture as introduced above is neither IR-nor C-safe.Below we modify the architecture to make it exactly IR-safe or IRC-safe and evaluate the implications.

IR-safe PELICAN
As shown above, IR-safety in Lorentz-invariant networks essentially requires the outputs to be independent of the multiplicity .There are four ways in which the multiplicity shows up in PELICAN: 1. Scaling with   / N  in the equivariant block.This must be disabled for IR-safety.
2. Non-zero bias values in linear layers.Since the network is permutation-equivariant, the bias values are shared across jet constituents, which means that upon aggregation in the following equivariant layer they contribute multiples of .All biases in all linear layers must be disabled for IR-safety.
3. The input embedding must map zero to zero, but our original choice already satisfies this.In addition, the activation function must also have a fixed point at zero.Our default choice, LeakyReLU, also satisfies this.
4. Following an application of a PELICAN equivariant block, rows and columns corresponding to soft constituents will contain a combination of sums over all constituents.Even in the absence of biasing constants, this effectively increases the multiplicity with which these values enter in the later aggregations.This can be resolved by making sure that rows and columns that are soft at the input remain soft throughout the whole network.Therefore we introduce soft masking, whereby the last 12 equivariant aggregators (those don't preserve the softness of rows/columns) are followed by a multiplication by the vector of values  •   , where  =  =1   , scaled and clipped to be within [−1, 1].In Eq 2→2 this multiplication is applied both row-wise and column-wise, and in Eq 2→1 it's component-wise.
With these modifications, PELICAN becomes IR-safe.As we will see, this restriction leads to a modest reduction in the performance of PELICAN's predictions in our tasks of interest.

IRC-safe PELICAN
Adding C-safety to the architecture is much simpler.As stated above, the necessary requirement is that the output depend on massless inputs only through their sum.In PELICAN this can be achieved by inserting a linear permutation-equivariant layer with a mass-based soft mask immediately at the input (any nonlinear embedding has to be done later).Consider a case where  1 ,  2 are massless and the dot product matrix {   } is fed into such an equivariant layer.Most of the aggregators will compute sums over rows or columns, thus immediately producing C-safe quantities.However, several of the aggregators, including the identity, will two massless 4-vectors   ,   is   •   =     (1 − cos    ).At small angles the approximately boost-invariant combination is then      2   .The expansion of any permutation-symmetric function of such dot products into a series as above will differ from general EFP's as follows: every angle    must appear in the product an even number of times, and the pre-factor must consist of nothing but one factor of     for every factor of  2   .In each EFP only a subset of terms will generally satisfy this condition, so a better representation of the Lorentz-invariant basis is warranted.
Let us start again with the most general Lorentz-invariant and permutation-invariant observable of  4-momenta.As we know, it is a function of only the dot products    =   •   .Such a function can be expanded at small arguments into terms like where the prime indicates that the terms where any two of the indices coincide are excluded.Including them would also produce a valid basis, but we leave them out to reduce the complexity of each polynomial.Such terms simply correspond to other multigraphs obtained from  by vertex identification, and those already appear in the expansion with independent coefficients.Unlike in EFP's, we also allow multigraphs with loops since we are not restricting ourselves to purely massless inputs.However, if all inputs are massless, all graphs with loops will produce vanishing polynomials.
Now we can easily see how to enforce IRC-safety in these polynomials.IR-safety requires that sending any of the 4-momenta to zero, say   → 0, must be equivalent to simply restricting the same expression to only the other  − 1 particles.We observe that the only case when this is not already satisfied in the expression above is when the multigraph  contains isolated vertices (vertices of degree zero).Isolated vertices lead to summations over dummy indices corresponding to those vertices, which results in an overall factor of some power of .The multiplicity  is not an IR-safe variable, and it is sufficient to ban graphs with isolated vertices to enforce IR-safety.We can thus drop the notation  (  )  and simply write   with the implicit understanding that (B.5) defines the full infinite family of observables for all .Now we address C-safety.Much like in EFP's, C-safety in   's means that whenever, say,  1 and  2 are massless, each monomial that contains a certain number of indices equal to 1 or 2 must match its coefficient with any other monomial that differs only by flipping any number of the indices from 2 to 1 or vice versa.In the language of multigraphs, this means that each monomial that assigns a vertex of degree  1 to  1 and another vertex of degree  2 to  2 must in fact result as just one among all possible monomials obtained via vertex identification from a graph  ′ where those two vertices have been "blown up" into  1 +  2 vertices of degree one.We also notice that this condition applies only to vertices that do not have any loops attached to them because otherwise the corresponding terms would vanish as  1 and  2 become massless, thereby making any additional restrictions unnecessary.To summarize, any IRC-safe polynomial contains with equal coefficients all   's with 's obtained by any combination of vertex identifications from some multigraph  ′ such that all of its loopless vertices are leaves (i.e. of degree one).We shall call such multigraphs loop-saturated.In this process it is sufficient to allow identifications of only the leaf vertices because all other vertex identifications will result in a polynomial independently defined by another loop-saturated multigraph.where the coincidence of two or more indices   =   is allowed only if their corresponding vertices ( and ) are leaves of .MFP  is simply MFP ′  plus a linear combination of certain MFP ′  's with fewer vertices, namely such that  is a graph obtained from  by one or more vertex identifications each of which involves at least one non-leaf vertex of .Indeed, such identifications preserve the property of being loop-saturated, and all other identifications are already included in the definition of MFP ′  .Note that just like for EFP's, disconnected multigraphs correspond to products of their connected components (this holds for MFP  but not for MFP ′  ): MFP ⊔ = MFP  • MFP  . (B.8) As an example, if all inputs are known to be massless, then any connected multigraph with loops will produce a vanishing polynomial.And the only loopless loop-saturated connected multigraph is the graph consisting of just one edge, which evaluates to the jet mass: MFP •−• = ,     =  2 jet .Therefore in the massless case we generate all analytic functions of  2 jet , which is exactly what we expect based on the results of the previous section.Lastly, returning to the example from the last section,  12  23  31 , it can only be interpreted as an IR-safe polynomial if it corresponds to the triangle graph.However, such a graph is not loop-saturated, and therefore this polynomial is not IRC-safe despite being C-safe for  = 3 (which can be confirmed by observing that for  = 4 the triangle graph generates a non-C-safe polynomial), and the correct IRC-safe completion is simply  6 jet .

B.2 Quantifying IRC-safety
Let  (  ) be a Lorentz-invariant permutation-symmetric observable defined for varying number  of inputs, for instance the output of a PELICAN instance.We define the following difference operator that measures the obstruction to  being IR-safe: An IR-safe observable must always have Δ IR  = 0, and lower values of IR-robustness indicate lower sensitivity to soft particles.In practice, we can evaluate the ratio (Δ IR  (  ) )/  (  ) and average it over a testing dataset to quantify the IR-safety of PELICAN.
For C-safety, we pick a massless 4-vector  and define the differential operator C-robustness can be easily defined as the second-order analog of  C  .Namely, we pick a second vector p such that  • p = 0 (which means that p is tangent to the light cone at ) and measure how quickly  C   deviates from zero as we deform  where   indicates the partial derivative with respect to   .Note that individual partial derivatives aren't really well-defined due to the fact that {   } is not a set of independent coordinates on the manifold of  4-vectors for  ⩾ 5.However, combinations such as above are valid since they are nothing but a different way of expressing the original well-defined operator.We also immediately notice that if  = 2 and  1 ∥  2 ∥ , then  C   = 0 automatically.Indeed, every Lorentz-invariant observable for two particles is C-safe due to the fact that  12 =

Figure 2 :
Figure 2: Performance of various machine learning architectures represented by the background rejection as a function of the signal efficiency.

2 .
The input data for the network are the 4-momenta of the 200 leading jet constituents.For use as possible regression targets and for defining jet containment (explained below), each event contains (a) the truth-level top quark that initiated the jet, (b) the bottom quark from top quark decay, (c) the -boson from top quark decay, (d) the two quarks from subsequent -boson decay ( →  ′ ),

Figure 3 :
Figure 3: Stacked histogram with proportional bin heights showing the mass spectrum of the two targets, the true -boson   true , and the contained true  momentum   cont .The top curve represents the spectrum over the entire dataset on log scale and the bottom curve shows the spectrum over FC events only, scaled linearly relative to the top curve, i.e. the fraction of FC events in a given bin is given by the apparent height of the FC curve divided by the total height of the bin (heights are measured from the -axis).The two mass spectra of FC events, in fact, match.

Figure 6 :
Figure 6: Stacked histogram with proportional bin heights (see description in Fig. 3) showing the mass spectrum of the two targets,   true and   cont , in the variable  mass dataset.

Figure 7 :
Figure 7: 2D histograms of true vs. reconstructed masses for models trained on the variable mass dataset targeting   true (top: truth data; bottom: Delphes data), broken up into two populations based on jet containment (left: non-FC events; right: FC events).

Figure 8 :
Figure 8: 2D histograms of target vs. reconstructed masses for models trained targeting   cont (top: truth data; bottom: Delphes data), broken up into two populations based on jet containment (left: non-FC events; right: FC events).

Figure 9 :
Figure 9: JH tagger's reconstruction of the  mass on the variable  mass dataset, truth-level and Delphes versions.The correlation values are 47% and 25%, correspondingly.

Figure 10 :
Figure10: Stacked histograms with proportional bin heights of all PELICAN weights computed over the testing dataset for the 4-vector reconstruction task from Section 5 using models trained targeting   true .Broken up into two populations --boson products and -quark products.In the Delphes case, a constituent is considered a  product if the corresponding calorimeter cell detected at least one true  daughter.

Figure 12 :
Figure 12: Stacked histograms with proportional bin heights of all PELICAN weights for the 4-vector reconstruction task from Section 5 using models trained targeting   cont .Broken up into two populations by parent type.

Figure 14 :
Figure 14: Same as Fig. 12 but this time the model is trained using a single-term loss function proportional to | reco −  cont |.

Figure 15 :
Figure 15: A single event viewed in the ,  plane with color and size dependent on energy.The central cross marks the true  boson, and the other three crosses mark the three true quarks from the process  → .

Figure 16 :
Figure 16: Composite event display of 200 events from the Delphes dataset from Section 5.Each event is transformed using a 3D rotation matrix such that the true  boson ends up at (, ) = (/2, 0) (white cross), and the true -quark is directly below.PELICAN is rotationally invariant, so its output is unaffected by the normalization.Each dot is a Delphes constituent and the dot size increases logarithmically with constituent energy.(a) Color reflects parent type: constituents that are fully derived from -daughters are orange and those from -daughters are purple; in the rare cases when the fraction of -derived energy in a given calorimeter cell is between 0 and 1, the corresponding color is taken from the color scale in the right pane.(b) Color reflects the value of the PELICAN weight, clipped to the interval [0, 1], as shown in the legend.Note how the hardest  constituents (largest dots) tend to have PELICAN weights between 0.5 and 1.

Figure 17 :
Figure 17: The same event as in Fig. 15 in the (, ) plane.(a) Constituents are colored according to the actual parent type; size increases with energy; the yellow cross marks the reconstruction obtained by summing all of the constituents that belong to the  boson.(b) Constituents are colored according to how they are tagged by the JH-tagger as either -daughters or not; size increases with energy; the yellow cross marks the JH-reconstructed  boson.(c) Constituents are colored according to their PELICAN weight clipped to the interval [0, 1]; size increases as the weight goes from 0 to 1 and decreases after that.Note how the soft Delphes -constituents get assigned high PELICAN weights.

Figure 28 :
Figure 28: The addition of the dashed loops makes this multigraph loop-saturated: all nonleaf vertices come with at least one loop.Loop-saturated multigraphs enumerate all IRC-safe polynomials.

Table 1 :
Comparison of PELICAN models trained on different fractions of the training dataset.The subscripts indicate the width of the network: e.g.132/78 means each Msg layer has 132 input and 78 output channels.

Table 2 :
Comparison of PELICAN models trained on different fractions of the training data.
Reconstructed  mass relative to true  mass for the PELICAN model trained (on truth or Delphes data) targeting   true .

Table 4 :
PELICAN resolutions for models trained to reconstruct   cont .Resolutions are still obtained by comparing the model predictions to   true .
3)showing the mass spectrum of the two targets,   true and   cont , in the variable  mass dataset.

Table 5 :
PELICAN resolutions for models trained to reconstruct   cont with variable   .Resolutions are still obtained by comparing the model predictions to   true .

Table 6 :
Comparison of PELICAN models of three different network widths trained to reconstruct   true with variable  mass.Tested on Delphes data.