An Efficient Lorentz Equivariant Graph Neural Network for Jet Tagging

Deep learning methods have been increasingly adopted to study jets in particle physics. Since symmetry-preserving behavior has been shown to be an important factor for improving the performance of deep learning in many applications, Lorentz group equivariance - a fundamental spacetime symmetry for elementary particles - has recently been incorporated into a deep learning model for jet tagging. However, the design is computationally costly due to the analytic construction of high-order tensors. In this article, we introduce LorentzNet, a new symmetry-preserving deep learning model for jet tagging. The message passing of LorentzNet relies on an efficient Minkowski dot product attention. Experiments on two representative jet tagging benchmarks show that LorentzNet achieves the best tagging performance and improves significantly over existing state-of-the-art algorithms. The preservation of Lorentz symmetry also greatly improves the efficiency and generalization power of the model, allowing LorentzNet to reach highly competitive performance when trained on only a few thousand jets. Code and models are available at \url{https://github.com/sdogsq/LorentzNet-release}.

However, a key aspect neglected by most of these deep learning models is the fundamental spacetime symmetry. The jet tagging result should not depend on the spatial orientation or the Lorentz boost of a jet. Starting from the work [51], symmetry-preserving deep learning models have been developed and shown their power in improving performance [52][53][54][55][56][57][58] and enhancing the model's interpretability [59] in many applications, e.g., the Deep Sets framework is proposed for preserving the permutation invariance and equivariance [52], [54,55] design gauge equivariant flow for sampling for lattice gauge theory, and [56][57][58] propose group equivariant models for physical or molecular dynamics prediction. For jet tagging, symmetry-preserving deep learning models have not been fully explored. Although a preprocessing step is applied in many existing models to reduce such dependence, e.g., by performing a Lorentz boost along the collision axis and a rotation in the transverse plane such that the jet momentum always points to the same direction, this only functions to a limited extent and cannot achieve full invariance to arbitrary Lorentz transformations. There exist models such as LoLa [32] and LBN [60] that attempt to exploit the underlying Lorentz symmetry with specially designed neural network layers, but they cannot guarantee the invariance of the entire model. A better way to achieve independence from any Lorentz transformation is via equivariant neural networks. A neural network layer is Lorentz-equivariant if its output transforms accordingly when the input undergoes a Lorentz transformation. Therefore, equivariant layers can be stacked to build a symmetry-preserving deep neural network, of which the classification result is unchanged under any Lorentz transformation. In [61], LGN, a Lorentz equivariant neural network architecture built with tensor products of Lorentz group representations is developed and demonstrated for the first time. However, the tagging performance of LGN reported in [61] is unsatisfactory. One bottleneck is the slow computational speed due to the explicit calculation of the high-order tensors through consecutive tensor product operations over hidden layers.
In this article, we propose a new design of symmetry-preserving deep learning models for jet tagging. Using the particle cloud representation of a jet, we propose LorentzNet, which directly scalarizes the input 4-vectors to realize Lorentz symmetry. Specifically, we design Minkowski dot product attention, which aggregates the 4-vectors with weights learned from Lorentz-invariant geometric quantities under the Minkowski metric. The construction of LorentzNet is guided by the universal approximation theory on Lorentz equivariant mapping, which ensures the equivariance and the universality of the LorentzNet. Compared to LGN which requires computationally expensive tensor products of the geometric quantities to achieve expressiveness, LorentzNet only requires the Minkowski inner product of two vectors. Thus, it is more efficient in terms of both training and inference. Benchmarked on two representative jet tagging datasets, LorentzNet achieves the best tagging performance and improves significantly over existing state-of-the-art algorithms such as ParticleNet. Moreover, we test the tagging performance with 5%, 1% and 0.5% of training samples, the degradation in performance for LorentzNet is smaller than that for ParticleNet, which clearly shows the benefit of the inductive bias brought by the Lorentz symmetry.
The rest of this paper is organized as follows. Sec. 2 reviews the theory of the Lorentz group. Sec. 3 introduces the architecture of LorentzNet. Sec. 4 details the experiments carried out on two jet tagging benchmarks to show the effectiveness of LorentzNet. Sec. 5 serves as the conclusion.

Preliminary
In this section, we introduce the foundation of the Lorentz group and describe the graph neural network for modelling particle clouds.

Notations
Following the jet representation in [37], we regard the constituent particles as a point cloud, which is an unordered, permutation invariant set of particles. Let V = (v 1 , · · · , v N ) ∈ R N ×4 denote the point clouds living in R 4 and N denotes the number of particles in a jet. Note that the number of particles for jets may be different. ·, · and · denote the Minkowski inner product and Minkowski norm. ⊕ denotes the direct sum and ⊗ denotes the tensor product.

Theory of the Lorentz Group
Minkowski metric. Consider the 4-dimensional space-time R 4 with basis {e i } 3 i=0 . We define a bilinear form η : R 4 × R 4 → R as follows. For u, v ∈ R 4 , we set η(u, v) = u T Jv where J = diag(1, −1, −1, −1) is the Minkowski metric. The Minkowski inner product of two vectors u = (t, x, y, z) and v = (t , x , y , z ) is defined as Lorentz transformation. The Lorentz group is defined to be the set of all matrices that preserve the bilinear form η. Restricting the inertial frames to be positively oriented and positively time-oriented, we obtain proper orthochronous Lorentz group, denoted as SO(1, 3) + . The 3d spatial rotation group SO(3) is a subgroup of SO(1, 3) + . Additionally, the Lorentz boost is also included. Given two inertial frames {e i } 3 i=0 and {e i } 3 i=0 , the relative velocity vector β and the boost factor γ are defined by e 0 = γe 0 If we perform Lorentz boost along the x-spatial axis, then the Lorentz transformation between these two frames is the matrix Lorentz group equivariance. Let T g : V → V and S g : U → U be group actions of g ∈ G on sets V and U , respectively. We say a function φ : holds for all v ∈ V and g ∈ G.
In this work, we only consider the case that the type of the output is a scalar or vector. Therefore, we explore the following equivariance on a set of particles V ∈ R N ×4 . Let Q be the Lorentz transformation, the Lorentz equivariance of φ(·) means: Note that when the output is a scalar, the group equivariance equals the group invariance.

Graph Neural Network for Particle Cloud
A jet can be denoted as a graph when we regard the constituent particles as nodes. For the particle with index i, we use its 4-momentum vector v i = (E i , p i x , p i y , p i z ) as the coordinate of node i in Minkowski space. We use s i = (s i 1 , s i 2 , · · · , s i α ) to denote the scalars, such as mass, charge and particle identity information, etc, which compose the node attributes. Now f i = v i ⊕ s i contains essential features for tagging. The graph can be denoted as G = (V, E) where V is the set of nodes and E is the set of edges. The edges characterize the message passing between two particles, hence the interaction of two individual sets of particle-wise features. If there is no such interaction, there will be no edge between the two corresponding nodes. Here, we regard the graph as a fully connected graph as we do not assume that we have any prior on the interactions among these particles.
Graph neural networks are natural to learn representations for graph-structured data [62]. Given a graph G = (V, E), assuming L steps in total, the l-th message passing step on the graph can be described as [63]: where h 0 i = f i is the input feature, e ij is the edge feature, N (i) is the set of neighbors of the node i, and M l , U l are neural networks. For a classification problem, the outputŷ can be obtained by applying the softmax function after decoding Directly applying the message passing to the jet can not ensure the Lorentz group equivariance because the non-linear neural networks are applied directly to the whole input features and ignore its intrinsic structure. There are some variants of message passing designed to satisfy continuous group symmetry. A common way is to project the inputs to the basis of irreducible representations of the corresponding group, e.g., the LGN [61] which satisfies the Lorentz group equivariance. To ensure the universality, high-order geometric tensors should be realized by LGN, which brings high computational cost. (See detailed discussions in Appendix A).
In the next section, we will introduce the architecture of LorentzNet which is built upon a universal approximation theorem of the Lorentz equivariant function and does not require explicitly calculating the higher-order representations.

Network Architecture
In this section, we illustrate the architecture of LorentzNet. The construction of the LorentzNet is based on the following universal approximation theorem for the Lorentz group equivariant continuous function.
where g i are continuous Lorentz-invariant scalar functions, and ·, · is the Minkowski inner product.
Proposition 3.1 provides a way to construct Lorentz group equivariant mapping with no need to calculate the high-order tensors. Instead, a Lorentz group equivariant continuous mapping can be constructed by the attention on v i with encoding the Minkowski dot products of v i with its neighbours. This motivates us to design the Minkowski dot product attention in LorentzNet, which will be introduced in the next section.

LorentzNet
We introduce the blocks in LorentzNet. As described in Fig. 1, LorentzNet is mainly constructed by the stack of Lorentz Group Equivariant block (LGEB) along with encoder and decoder layers.
Input layer. The inputs into the network are 4-momenta of particles from a collision event, and may include scalars associated with them (such as label, charge, etc.). Using the notations defined in Section 2.3, the input is the set of vectors f i = v i ⊕s i . In the experiments in this paper, the scalars include the mass of the particle (i.e., Lorentz Group Equivariant Block. We use h l = (h l 1 , h l 2 , · · · , h l N ) to denote the node embedding scalars, and x l = (x l 1 , x l 2 , · · · , x l N ) to denote the coordinate embedding vectors in the l-th LGEB layer. When l = 0, x 0 i equals the input of the 4-momenta v i and h 0 i equals the embedded input of the scalar variables s i . LGEB aims to learn deeper embeddings h l+1 , x l+1 via current h l , x l . Motivated by Equation (3.1), the message passing of LorentzNet is written as follows. We use m ij to denote the edge message between particle i and j, and it encodes the scalar information of the particle i and j, i.e., where φ e (·) is a neural network and ψ(·) = sgn(·) log(| · | + 1) in Equation (3.2) is to normalize large numbers from broad distributions for ease of optimization. Except for the embedding of the scalar features h l i and h l j , according to Proposition 3.1, the input of the neural network contains the Minkowski dot product x i , x j . The x l i − x l j 2 is also included because the interaction between particles relies on this term and we include it as a prior feature for ease of learning. According to Equation (3.1), we design Minkowski dot product attention as where φ x (·) ∈ R is a scalar function modeled by neural networks. To ensure the equivariance, we can not arbitrarily apply the normalization trick to control the scale of x l+1 i . Therefore, the hyperparameter c is introduced to control the scale of x l+1 i to avoid the scale exploding. This step captures the interactions of the i-th particle with other particles via the ensemble of the 4-momenta of all particles. Unlike most of the symmetry-preserving neural networks such as LGN and EGNN [56] (for E(n) equivariance) 1 which only apply non-linear transformation (e.g., neural network) to encode the radial distance we include the dot product x i , x j in m ij according to Equation (3.1). The scalar features for particle i is forward as where φ h (·) is also modeled by neural networks whose output dimension equals the dimension of h l+1 i . For efficient computation, we operate summation j∈[N ] w ij m l ij to aggregate m l ij . We introduce an neural network φ m (·) to learn the edge significance from node j to node i, i.e., . This can both ensure the permutation invariance and also ease the implementation for jets with different numbers of particles. This operation is also widely adopted in other types of graph neural networks [56,63].
Decoding layer. After stacks of LGEB for L layers, we decode the node embedding Note that the information of x L−1 has been included in h L through the m L−1 ij . Therefore, to avoid redundant information, we only decode h L . First we use average pooling to get A subsequent dropout layer is applied to h av to prevent overfitting. A decoding block with two fully connected layers, followed by a softmax function, is used to generate the output for the binary classification task.

Theoretical Analysis
In this section, we analyze the Lorentz group equivariance of LorentzNet.
Proposition 3.2. The coordinate embedding x l = (x l 1 , x l 2 , · · · , x l N ) are Lorentz group equivariant and the node embedding h l = (h l 1 , · · · , h l N ) are Lorentz group invariant.
Proof: We denote Q as the Lorentz transformation. If m l ij are invariant under Q for all i, j, l, x l+1 i will be Lorentz group equivariant because Then we illustrate the invariance of m l ij . Let's start from the input. Since Minkowski norm and Minkowski inner product are invariant to Lorentz group, we have Therefore, the input of φ e are invariant variables under transformation Q and then m 0 ij are invariant. Recursively using the invariance of m l ij and the equivariance of x l i , we can get the conclusion. As for the expressiveness of the LGEB structure, we have the following discussions. Because h i is a function of the aggregation of m ij , it contains the information of all the Minkowski dot products of the particle pairs. Therefore, the weights in Minkowski dot product attention φ x (·) in Equation (3.3) at l-th layer (l > 1) contain all the Minkowski dot products of the particle pairs. According to Proposition 3.1, the input information of the attention factor is complete for expressiveness.

Implementation Details
The LorentzNet architecture used in this paper is shown in Fig. 1. It consists of 6 Lorentz group equivariant blocks (L = 6). The scalar embedding is implemented as one fully connected layer which maps the scalars to latent space of width 72, i.e., Linear(scalar_num, 72).  Linear(72, 2). The dropout rate here is 0.2. The hyperparameter c in Equation (3.3) is chosen to be 5 × 10 −3 and 1 × 10 −3 for the top tagging and the quark-gluon tagging task respectively to achieve the best performances.
The networks are implemented with PyTorch and the training is performed on a cluster with four Nvidia Tesla V100 GPUs. A batch size of 32 is used on every single GPU for LorentzNet architecture. The AdamW [65] optimizer, with a weight decay of 0.01, is used to minimize the cross-entropy loss. The LorentzNet is trained for 35 epochs in total. Firstly a linear warm-up period of 4 epochs is applied to reach the initial learning rate 1 × 10 −3 . Then a CosineAnnealingWarmRestarts [66] learning rate schedule with T 0 = 4, T mult = 2 is adopted for next 28 epochs. Finally, an exponential learning rate decay with γ = 0.5 is used for the last 3 epochs. We test the model on the validation dataset at the end of each training epoch, and the model with the highest validation accuracy is saved as our best model for the final test. Code and models are available at https: //github.com/sdogsq/LorentzNet-release.

Experiment
The performance of the LorentzNet architecture is evaluated on two representative jet tagging tasks: top tagging and quark-gluon tagging. In this section, we show the benchmark results.

Datasets
Top tagging. We first evaluate LorentzNet on top tagging classification experiments with the publicly available reference dataset [67]. This dataset contains 1.2M training entries, 400k validation entries, and 400k testing entries. Each of these entries represents a single jet whose origin is either an energetic top quark, a light quark, or a gluon. The events are produced using the PYTHIA8 Monte Carlo event generator. The ATLAS detector response is modelled with the DELPHES software package.
The jets in the reference dataset are clustered using the anti-k T algorithm, with a radius of R = 0.8. For each jet, the 4-momenta are saved in Cartesian coordinates (E, p x , p y , p z ) for up to 200 constituent particles selected by the highest transverse momentum. Following the settings in [61], the colliding particle beams aligned along the z-axis are added. Each jet contains an average of 50 particles, and events with less than 200 are zero-padded.
Quark-gluon tagging. The second dataset is for quark-gluon tagging, i.e., discriminating jets initiated by quarks and by gluons. The signal (quark) and background (gluon) jets are generated with PYTHIA8 using the Z(→ νν) + (u, d, s) and Z(→ νν) + g processes, respectively. No detector simulation is performed. The final state non-neutrino particles are clustered into jets using the anti-k T algorithm with R = 0.4. This dataset consists of 2 million jets in total, with half signal and half background. We follow the setting in [37] to split 1.6M/200K/200K for training, validation, and testing.
One difference of the quark-gluon tagging dataset is that it includes not only the 4momentum vector but also the type of each particle (i.e., electron, photon, pion, etc.). We use the one-hot encoding of the type of each particle as input scalars into the LorentzNet. We include this task to test the performance of LorentzNet under this different type of input feature.

Baselines and Tagging Performance
We compare the performance of LorentzNet with six baseline models: ResNeXt-50 [68], P-CNN [69], PFN [36], ParticleNet [37], EGNN [56] and LGN [61]. The tagging accuracy of these six models except EGNN has been reported in [37] and [61]. In this paper, we also investigate their robustness under Lorentz transformation and the computational cost. For self-contained, we briefly introduce them here. The ResNeXt-50 model is a 50-layer convolutional neural network with skip connections for image classification. Representing jets as images, we can apply ResNeXt-50 to jet tagging. The P-CNN is a 14-layer 1D CNN using particle sequences as inputs. The P-CNN architecture is proposed in the CMS particle-based DNN boosted jet tagger and shows significant improvement in performance compared to a traditional tagger using boosted decision trees and jet-level observables. The Particle Flow Network (PFN) and the ParticleNet also treat a jet as an unordered set of particles. The PFN is based on the Deep Sets framework. The ParticleNet is based on Dynamic Graph Convolutional Neural Network with carefully designed EdgeConv operation. We also include EGNN as a representative symmetry-preserving model as our baseline which is E(4) equivariant. It is different from LorentzNet in that EGNN uses metrics in Euclidean space instead of Minkowski space. We compare LorentzNet with EGNN to show the necessity of Lorentz group symmetry. For a fair comparison, we set the number of parameters of EGNN in the same order as LorentzNet.
For ResNeXt, P-CNN, PFN, and ParticleNet, we follow the implementation in [37]. For LGN, we follow the implementation in [61] on top tagging. For the quark-gluon dataset, the implementation details are reported in Appendix A.
Tagging performance. The results for the top tagging dataset and quark-gluon dataset are summarized in Table 1 and Table 2   ROC Curve (AUC) 2 , the background rejection 1/ε B at the signal efficiency of ε S = 0.3 and 0.5 (ε B , ε S are also known as the false positive and the true positive rates, respectively), and the number of trainable parameters. Especially, the background rejection metric is widely adopted to select the best jet tagging algorithm as it is directly related to the expected contribution of the background [36,37,61]. From Table 1 and Table 2, we conclude that LorentzNet achieves the state-of-the-art performance on both the top tagging dataset and the quark-gluon in terms of accuracy, AUC, and background rejection at ε S = 0.3, 0.5. The results verify the effectiveness of LorentzNet compared with the baselines. Fig. 2 shows the background rejection at a finegrained signal efficiency. The ROC curves of LorentzNet achieve the highest score at all the selected signal efficiency compared to the baselines. Especially, LorentzNet shows superiority compared to LGN, since it achieves 4 or 5 times improvement on the background rejection. The results verify our discussions in Sec. 3.2.

Training Fraction
Model

Sample Efficiency
The benefit of the preservation of Lorentz group symmetry in jet tagging has not been studied in the literature. In theory, the Lorentz group symmetry injects inductive bias into the deep learning model which restricts the function class of the hypothesis space. The inductive bias can help to boost the generalization and improve the sample efficiency. As the improvement in the generalization performance (i.e., the tagging accuracy) has been shown in the previous section, we show the robustness of LorentzNet trained on smaller training data to verify the sample efficiency of LorentzNet in this part. We choose the best-performed architecture among the models with and without fully Lorentz group symmetry (i.e., the LorentzNet and the ParticleNet) to compare. The inductive bias in ParticleNet is a subgroup symmetry of the Lorentz group, which only considers the Lorentz boosts on the z-axis and the rotation on the x − y plane, while LorentzNet is symmetric to the Lorentz group. We randomly select 5%, 1%, and 0.5% fraction of training data to train the LorentzNet and ParticleNet on the top tagging dataset, and we test their performances on the same test data with size 400k. The training strategy keeps the same as the experiments on the full training data. The results are reported in Table 3. The gap between the tagging accuracy and AUC between LorentzNet and ParticleNet becomes larger as the number of training data becomes smaller. The results clearly show the benefit of the preservation of Lorentz group symmetry in jet tagging.

Equivariance test
Another advantage of symmetry-preserving deep learning models is their robustness under Lorentz transformation. To verify it, we rotate the test data by Lorentz transformation with different scales of β along the x−axis, i.e., the value of (E, p x ) in the 4-momentum vector will be rotated. As β becomes larger, the difference between the distributions of training and test data will become larger. We test the model trained on the original training data, and the tagging accuracy on the rotated test data is reported in Fig. 3. The horizontal axis of Fig. 3 shows the value of β and the vertical axis shows the tagging accuracy on the top tagging dataset under Lorentz transformation with corresponding β. The results show that the accuracy of LorentzNet and LGN on the test data after Lorentz transformation is robust in a large range of β, while the test accuracy of other non-equivariant models will drop as β becomes larger. According to special relativity, the fundamental quantities to clarify the particles will not be changed. Even compared with LGN, LorentzNet is more stable when β approaches 1, and the instability of LGN is caused by the rounding errors in float arithmetic as described in its original paper [61].

Ablation study
In this section, we report the results of the ablation study to further demonstrate the effectiveness of the components in LorentzNet. To show the effectiveness of keeping the Lorentz group equivariance, we directly use x l i , x l j as inputs of the φ e to break the Lorentz group equivariance because x l i , x l j are not Lorentz group invariant variables, i.e., the Equation

Model
Equivariance Accuracy AUC 1/ε B (ε S = 0.5) (3.2) is replaced by We name this variant as LorentzNet without equivariance (abbreviated as LorentzNet (w/o)). We compare the performance of LorentzNet (w/o) with LorentzNet on the top tagging dataset. The hyperparameters of training LorentzNet (w/o) keep the same as LorentzNet and we train LorentzNet (w/o) till it converges. As shown in Table 4, LorentzNet (w/o) performs worse than LorentzNet, which shows the necessity of the Lorentz group equivariance on the tagging performance.

Computational cost
We report the inference time of LorentzNet and other baseline models. The inference time of each model on both CPU and GPU along with the number of their parameters are reported in Table 5. The number of trainable parameters of LorentzNet is in the same order as P-CNN and ParticleNet. Models are executed on a cluster with an Intel Xeon CPU E5-2698 v4 and an Nvidia Tesla V100 32GB GPU. All the inference times are collected with a batch size of 100. As shown in Table 5, the inference time on GPU of LorentzNet is slightly larger than P-CNN, PFN, and particleNet, and is comparable with ResNeXt. Especially, compared with LGN, the LorentzNet is almost 5 times faster on GPU, although the number of trainable parameters of LGN is much smaller. For the inference time on CPU, LorentzNet is also faster than LGN. Both results demonstrate the efficiency of LorentzNet. Since there is no need to compute the high-order tensors in LorentzNet, it is more efficient. See Appendix C for more computational comparisons of LorentzNet and LGN. The main computational cost of LorentzNet comes from its message passing on the fully connected graph. Its cost quadratically depends on the number of particles in a jet. This cost can be reduced by clustering the nodes and we will explore it in future studies.

Conclusion
We have presented LorentzNet, a Lorentz group equivariant graph neural network for jet tagging. Experiments on two representative jet tagging datasets show that LorentzNet achieves substantial performance improvement over all existing methods. The efficient design of the message passing to preserve the Lorentz symmetry significantly reduces both training and inference time compared to LGN. Moreover, the efficient design enhances the generalization power, allowing LorentzNet to reach highly competitive performance with only a few thousand jets for training. Although the effectiveness of LorentzNet is only demonstrated on two jet tagging tasks in this article, the symmetry-preserving architecture can be applied to a broad range of tasks in particle physics, such as regression of the true jet mass or even the full 4-momentum, generation of jets for fast simulation, classification of the entire collision event, mitigation of pileup interactions, and more. As ever-larger datasets become available in particle physics, we hope that LorentzNet will help enable more accurate prediction and accelerate the discovery of interesting new phenomena.
Ideally, the universal approximation of the LGN can be achieved if the order of tensors that the LGN realized is sufficiently high. However, in practice, the order of tensors should be cut off for efficient computation and there is no direct prior to determining this cutoff in many applications.
In terms of computational efficiency, one problem is that the tensor product brings the computational cost to (Eq.(A.2)) compared with the simple dot product in LorentzNet. Another problem is that the tensor product operation is a weighted sum of products of the input features. Without normalization, the output of the network will be sensitive to the scale of the input, which leads to instability (e.g., value explosion or vanishing) in both the forward and backward processes. To ensure stability, the input 4-momenta were scaled by a factor of 0.005 on the top tagging dataset in the original paper [61]. The scaling factor is a hyper-parameter to tune for different datasets. However, there is no principle way designed for the scaling or normalization which can stabilize the training process and keep the Lorentz symmetry meanwhile. Therefore, compared with LGN, LorentzNet can achieve a better tradeoff between approximation ability and computational efficiency.
In the experiments of LGN on the quark-gluon dataset, we follow its public code base in https://github.com/fizisist/LorentzGroupNetwork. The batch size is set to be 48 and we run 53 epochs in total. Other parameters are all aligned with its original paper [61]. The wall block time of running one epoch of LGN is more than 10 times larger than other baseline models, which is also reported in its original paper [61] (e.g., 7.5 hours per epoch with a batch size of 8 on top tagging). The computational efficiency is the main bottleneck for further evaluation of LGN.

B Relation with EGNN
The message passing of LorentzNet is also inspired by EGNN, which is proposed to ensure E(n) group equivariance. If the output is a scalar, the message passing of EGNN is written as: where · E denotes the L 2 norm under Euclidean metric. We discuss the difference between LorentzNet and EGNN. First, EGNN uses the Euclidean norm of the relative distance as the only scalar information of the vector. Since relative distance can not recover the information of the angles (i.e., the inner products), its expressiveness is arguable [70]. In the context of E(n) equivariant, the inner product x i , x j can not be added because of the translation invariance. As we target Lorentz group symmetry, we can include the Minkowski inner product for information completeness. Second, EGNN does not include the coordinate update component for scalar prediction, while LorentzNet includes the Minkowski dot product attention for expressiveness. For fair comparison in the experiments, we include the coordinate update step in EGNN to ensure the difference only caused by different groups.
In the experiments of EGNN on top and quark-gluon dataset, we follow its public code base in https://github.com/vgsatorras/egnn. We use a batch size of 128 in the training process for 35 epochs in total. The AdamW [65] optimizer, with a weight decay of 0.01, is used to minimize the cross-entropy loss. Firstly a linear warm-up period of 4 epochs is applied to reach the initial learning rate 1 × 10 −3 . Then a CosineAnnealingWarm-Restarts [66] learning rate schedule with T 0 = 4, T mult = 2 is adopted for next 28 epochs. Finally, an exponential learning rate decay with γ = 0.5 is used for the last 3 epochs. We test the model on the validation dataset at the end of each training epoch, and the model with the highest validation accuracy is saved as our best model for the final test.  We report the inference time of LorentzNet and LGN with two different scales on both CPU and GPU along with the parameter numbers in Table 6. The inference time of LorentzNet (small) is 31 times faster than LGN (large) on CPU and 12 times faster on GPU, while the number of their trainable parameters is approximately equal. Besides, LorentzNet (small) still reaches 21 times faster than LGN on CPU and 10 times faster on GPU, even when the network of LGN is smaller than LorentzNet (small). The comparison under the approximately equal number of parameters shows that LorentzNet is much more efficient.

C Computational Cost in Comparison to LGN
We also report the performance of LorentzNet and LGN in Table 7. The LorentzNet (small) still shows a highly competitive performance even when the number of parameters is reduced to only 5% of the original model. We do not evaluate the performance of LGN (large) due to its slow training process. Implementation Details. LorentzNet (small) follows the settings in Sec. 3.3 except all the width of latent space is set to be 16. LGN (large) follows the settings in [61] except the numbers of channels are chosen as N