Improved Constraints on Effective Top Quark Interactions using Edge Convolution Networks

We explore the potential of Graph Neural Networks (GNNs) to improve the performance of high-dimensional effective field theory parameter fits to collider data beyond traditional rectangular cut-based differential distribution analyses. In this study, we focus on a SMEFT analysis of $pp \to t\bar t$ production, including top decays, where the linear effective field deformation is parametrised by thirteen independent Wilson coefficients. The application of GNNs allows us to condense the multidimensional phase space information available for the discrimination of BSM effects from the SM expectation by considering all available final state correlations directly. The number of contributing new physics couplings very quickly leads to statistical limitations when the GNN output is directly employed as an EFT discrimination tool. However, a selection based on minimising the SM contribution enhances the fit's sensitivity when reflected as a (non-rectangular) selection on the inclusive data samples that are typically employed when looking for non-resonant deviations from the SM by means of differential distributions.


Introduction
The search for new physics, albeit so far unsuccessful at the Large Hadron Collider (LHC) remains a cornerstone of the collider phenomenology programme. The lack of direct evidence for new states beyond the Standard Model (BSM) can be interpreted as an indication of a separation between the electroweak scale and the scale of new physics Λ. Integrating out BSM states can then be cast into a consistent and systematic extension of the SM by higher dimensional operators O with dimensions [O] > 4 [1]. Turning the argument around, constraints on any extension of the SM can be obtained by agnostically reflecting all a priori allowed operator deformations in the interpretation of LHC results up to a considered order in the Λ −1 expansion. At operator level up to dimension six in Standard Model Effective Theory (SMEFT) [2][3][4][5][6][7] this programme has been comprehensively addressed from a theoretical perspective (see e.g. [8][9][10][11][12][13][14] and [15] for a recent review) and provides a theoretically consistent interpretation at this order in the Λ −1 expansion. Related phenomenological (proof-of-principle) analyses have focussed on (combinations of) Higgs physics [16][17][18][19][20][21][22][23], electroweak precision observables [24][25][26], and the top sector [27][28][29][30][31][32][33][34], owing to good statistical and systematic control at past and present collider experiments, and their expected roles in BSM physics in general. The experimental implementation of such a strategy is far from trivial. The number of involved and independent effective interactions can be large, thus potentially limiting the sensitivity of a single, specific analysis. In parallel, systematic uncertainties can lead to weak constraints, giving only loose and perhaps non-perturbative limits when understood as UV constraints in concrete matching calculations.
There are two avenues to improve phenomenological sensitivity. Firstly, decreasing the theoretical and experimental uncertainties alongside systematics of the EFT and SM hypotheses, possibly via data-driven approaches, will lead to improved limits when more data become available (assuming agreement with the SM prevails). Lower limits on the direct evidence of new states, e.g., via s-channel production are predominantly driven by the available LHC centre-of-mass energy. Therefore, the lower limit on Λ in eq. (1.1), which is driven by the LHC's energy coverage, will not change dramatically in the future. Thus, any modelling improvement at scales |Q 2 | Λ 2 where the EFT expansion can be considered reliable will be reflected in improved constraints on the Wilson coefficients (WCs) C i (modulo remaining blind directions).
Secondly, we can resolve to a more comprehensive extraction of information from experimental data. Such strategies are highlighted in the recent resurgence of machine learning (ML) applications to particle physics [35][36][37][38][39][40][41][42][43][44][45] (in particular focusing also on experimental improvements [46,47]). 'Traditional' collider observables such as transverse momenta, angles and (pseudo)rapidities, alongside rectangular cuts on these, might not fully capture the exclusion potential when all ad hoc modifications of correlations are considered, which is the key motivation of the EFT approach (in particular this extends to the inclusion of systematic uncertainties [48]).
This paper is organised as follows. In section 2, we review the EFT operators relevant for this case study. We also detail our simulation, analysis and fit setup of tt production. Section 3 is devoted to the machine learning aspects of this study: we briefly outline our baseline cuts (taking the experimental analysis of [65] as guidance), review our ML setup, and discuss input parameters, training and classification. We highlight the performance improvements of a ML-informed top sector fit in section 4 and conclude in section 5.

Effective interactions for top pair production with leptonic decays
Any differential cross section that follows from eq. (1.1) can be written as the second term is the contributions from the interference of the EFT and the SM terms. The third term represents the contribution from the EFT squared or cross-terms which are Λ 4 suppressed. In the following, we will limit ourselves to dimension 6 (differential) cross sections ∼ Λ −2 that result from interference of the EFT and SM amplitudes. While this is a theoretically consistent approach, it also constitutes a conservative case for EFT limit setting: contributions ∼ Λ −4 typically show a dramatic momentum-transfer enhanced behaviour and are therefore relatively easy to constrain, even using standard approaches. Put differently, any sensitivity improvement that we can identify for the linearised approach will generalise to the inclusion of the ∼ Λ −4 terms in eq. (2.1).

Analysis Setup and Fit Methodology
We use the SMEFTSim [66,67] implementation to include the effective operators, which is then interfaced with MadGraph5 [68] via FeynRules [69] and UFO [70] to generate the event samples at leading order (LO) 1 for We use a √ s = 13 TeV analysis by the CMS collaboration [65] as inspiration to investigate (correlated) differential measurement results and representative data binning as given in table 1. SM predictions are injected as mock reference data for the luminosity L ref = 1 In this work, we focus on GNN performance of EFT parameter fits and limit ourselves to a leading order analysis. We note that including higher order contributions for the SM hypothesis is crucial to obtain consistency with the measured data, but will not impact the qualitative results of this work. We have checked that the results of 2.3 fb −1 of Ref. [65] and we scale statistical uncertainties relative to this luminosity, using L ref /L for extrapolations. Our implementation relies on Rivet [72,73], which processes events after showering with Pythia8 [74] before feeding them into the fit.
To avoid imposing any assumptions as to correlations -and remove the chance that double-counting of events would artificially inflate sensitivity to EFT contributions -a single distribution is used where bin-to-bin correlations are included, and a single bin is used where they are not. In the absence of a full reference correlation/covariance matrix the selection of the bin/distribution is made on a coefficient-by-coefficient basis, with the input with maximum deviation from a fixed point on that axis being selected. This maximum sensitivity, minimum correlation assumption input subset is then used to determine individual and profiled bounds for the coefficient being studied. Where a normalised distribution is used we must drop a bin, as otherwise the covariance matrix will be singular. The dropped bin is chosen such that we obtain the most stable covariance matrix, with the bin with the largest uncertainty being dropped if there are multiple bins leading to an equivalently well-conditioned covariance matrix. 2 In the following we will consider bounds for all relevant operators using the dimensionless 'bar' notationC with the electroweak expectation value v 246 GeV. In many standard analyses, cut-and-count techniques are often used to restrict the phase space region in such a way that the SM contamination is minimised, and as a result this yields an increased new-physics sensitivity. However rectangular cuts often yield inferior sensitivity compared to the methodically selected regions by means of machine learning classifiers. In our scenario an efficient event-by-event classification using GNNs, separating the generated events into either pure SM or the SMEFT operators that sourced them, could lead to improvements on the bounds of WCs after imposing cuts on the output score of the network.

Graph representation of events
In order to use a GNN as a classifier, the events need to be embedded in a graph structure with nodes, edges and features associated to observables of final states or reconstructed objects. While various different approaches are possible to construct a graph from the IR-safe, calibratable and detectable final states, we employ a physics-motivated strategy, creating graphs similar to the tree of the chain of eq. (2.2) 3 . Concretely, we pre-process the data samples and require at least two jets of transverse momentum p T (j) > 20 GeV and pseudorapidity |η| < 5 that are not b-tagged. The event is vetoed if there are not at least two b-jets and one lepton in the central part of the detector (|η( )| < 2.5), where the b-jets must also satisfy p T (b) > 20 GeV. Subsequently, we embed the passed events into graphs using the following steps (see also fig. 1): (i) Nodes: Firstly, the missing transverse momentum (MTM) is identified by balancing the net visible momenta, −p(visible), neglecting the longitudinal components. A node is added corresponding to MTM. Then, for each lepton, we attempt to reconstruct the W fourmomentum as a sum of the lepton's four-momentum and the MTM. The invariant mass of the W candidate is calculated and if it falls within [65,95] GeV a node is added, labelled W 1 , as well as one for the b-jet b 1 that has the smallest separation ∆R = ∆η 2 + ∆φ 2 from W 1 . In the case where there are more lepton-MTM combinations with compatible invariant mass, the one closest to the W boson mass is selected. The top from the leptonic decay chain t 1 is finally reconstructed from the four-momenta of , b 1 and MTM and obtains its respective node. Following a similar procedure, we consider combinations of jets to find a pair with dijet invariant mass 70 GeV ≤ m(jj) ≤ 90 GeV. If a pair is found we add nodes for the two jets j 1 , j 2 and for the second boson W 2 , otherwise we only add nodes for the two leading jets. From the remaining b-jets a node is added for the leading one, b 2 , as well as for the second top t 2 whose four-momentum is reconstructed using b 2 , j 1 and j 2 . We scan over the remaining particles and if any are within ∆R < 0.8 of any of the identified or reconstructed objects we add a node that will be connected only to the nearby object.
(ii) Edges: The connections between the nodes create the adjacency matrix of the graph and the nodes of the final states are connected to the ones of the reconstructed objects from which they are derived. We first connect the MTM and lepton to W 1 and subsequently, W 1 and b 1 are connected to the first top quark node. If a W 1 was not created then the aforementioned final states connect directly to t 1 . 4 Similarly, for the other leg of the decay chain, if W 2 was successfully reconstructed, we join its node with the two jets used to reconstruct it, and then W 2 and b 2 are connected to the top node. The jets are directly connected to the top if there is no node for W 2 . Any node originating from the remaining final states is connected to the node of the object that satisfied ∆R < 0.8. 3 We also checked the fully connected graph and found low performance for the given network as the number of edges increases, which carry much less physics information. Hence the decay chain-like structure in the graph gives good performance. 4 We expect that this will lead to a further enhancement of sensitivity when the Λ −4 non-resonant contributions are considered.
(iii) Node features: After constructing the node and edges, we associate each node with a feature vector [p T , η, φ, E, m, PID], which represent transverse momentum, pseudorapidity, azimuthal angle, energy, mass and particle identification number respectively.

Graph Neural Network with Edge Convolution
Convolution networks have seen a range of developments in the past few years. These have created the capability to employ multi-scale localised spatial features. However, Convolutional Neural Networks (CNNs) are limited to work on regular Euclidean-data like images. Recent GNN developments have overcome this limitation through generalising CNNs to operate on graph structured data, facilitating the exploration of non-Euclidean domains of the data [75]. This was formalised as Message Passing Neural Networks (MPNNs) in Ref. [51] for supervised learning applications. We briefly describe the general paradigm of the MPNN, which we will generalise later for the edge convolution (EdgeConv) network used in this paper. MPNNs have two main components: a message-passing phase and a graph readout layer. The message passing is defined as a mathematical operation between two nodes i and j. We define x ij as the edge connecting the nodes i and j at the lth time-step, where the vector sign represents the directed graph. A graph can be undirected or directed; we have used bi-directed graphs for this study. During the message-passing phase a message m (l) ij is calculated between the two nodes by the following operation, m The message function can be a linear activation function or a multilayer-perceptron (MLP), which is shared between the edges and is analogous to convolution operation (here we use a linear activation function for the message function). Once the messages between all connected nodes have been calculated in a layer, each node feature is updated using an aggregation function x where N (i) are the nodes which are connected to ith node and A is the permutation invariant function (for instance 'max', 'sum', or 'mean'). The vector x (l+1) is the input to the next message passing layer. For graph classification, after some message passing operation L we perform a permutation invariant graph readout operation 2 on the final node features x where G denotes the input graph. This gives us fixed length representation of (possibly) variable length graphs, and feeds into a downstream neural network. We use an EdgeConv network in this study, which is an ideally suited network for exploiting the edge features from given node features. The edge convolution operation is defined with the following message-passing function where aggregation for each node is done using 'mean', after which the features of each node are updated. The linear layers Θ and Φ take the inputs and map them to identical dimensional spaces. We use L = 2 and mean graph-readout.

Network Architecture and Training
We use the Deep Graph Library [76] and PyTorch [77] to construct the graphs and the networks that classify the different EFT signal contributions and the SM 'background'. Models with different architectures are trained on data samples that consist of 70000 events for each class, with 80%, 10% and 10% used for training, validation and testing respectively. The network models considered, incorporate EdgeConv layers followed by hidden linear layers and ReLU is used as the activation function for each layer. Probabilities for each class can be obtained from the output layer by applying the softmax function. We choose the categorical cross-entropy loss function for the multi-class classification problem and use the Adam optimiser with a learning rate of 0.001 to minimise the loss function. The learning rate decays with a factor of 0.1 if the loss function has not decreased for three consecutive epochs. We train the models for 100 epochs in mini-batches of 100 graphs and an early stopping condition when no loss decrease has occurred for ten epochs. By varying the amount of layers and nodes, and training the different models on the data, we find that the configuration of two EdgeConv layers of 60 nodes and one hidden linear layer of 40 nodes performs particularly well for our scenario. Any event used during training or validation is not considered further in any other part of this work. The loss and accuracy curves for the classifying events have been checked to avoid overtraining. It is worth highlighting that we observe signs of overtraining when we consider deeper networks. The good performance of a relatively shallow network signifies that non-resonant physics is characterisable by relatively few phenomenological properties, which is consistent with the findings of traditional differential EFT fits (see in particular Ref. [19]). This observation will form the baseline of the qualitative discussion of a two-operator example in the next section.

A minimal example
For illustration purposes, we first limit our study to a three-class classification problem. The network output in this example returns the probability of an event belonging to each of the three classes. An event is then assigned to the EFT/SM class with the greatest corresponding probability. Generalising this to a higher and critical number of WCs will be the focus of section 4.2.
The restriction employed in this section is motivated from the generic modifications that can be expected from EFT interactions. Momentum-dependent interactions will typically enhance the tails of momentum-dependent distributions compared to the SM, while interactions that modify SM couplings (feeding into, e.g., a modified top quark width) will predominantly lead to a modified inclusive rate with momentum-related distributions  similar to the SM. We reflect this in our choice of operators for this section: The distributions of the hardest b jet for these operators are given in fig. 2. Correlated with the events hardness are more central final states and characteristically modified angular and rapidity separations. Identifying the most appropriate superposition of physical observables is therefore critical for a particularly sensitive EFT analysis. We consider the two operators of eq. (4.1) as they exhibit a particularly distinguishable phenomenology, but they will also allow us to discuss the limitations of using different approaches to a ML-informed limit setting. In fig. 3 (left), the probabilities calculated for each event to be a result of each SMEFT insertion are shown. As can be expected, events arising from O  are both low (and the probability of belonging to the SM is high due to the normalisation of probabilities). The network is able to discriminate efficiently among the three classes and different regions can be efficiently removed by cuts on the two output probabilities. On the right in fig. 3 the Receiver Operator Characteristic (ROC) curves are shown. We calculate these in a one-vs-rest scheme by first binarising the labels and using the network score output for each WC. We also show an EFT vs SM ROC curve where all EFT labels are marked as signal and the SM as background. We construct the ROC curve using the summed scores for each new physics WC, which we later generalise when more than two contributions are on.
To examine the improvement of the network performance for this simplified test case of two WCs modifying SM production, we performed a χ 2 fit for each operator to yield bounds on the WCs. To construct the χ 2 (for details see Ref. [28]), we use the distribution p T (b 1 ), the transverse momentum of the leading b-jet. To gain as much statistical control as possible, we also extrapolate the results to an integrated luminosity of 3 ab −1 , in line with the expected performance of the High-Luminosity (HL) LHC. The qualitative pattern of results, however, is independent of the luminosity chosen. Performing this analysis on the full datasets gives the contours shown in black in fig. 5, establishing a baseline against which we can evaluate the improvement in the constraints from applying the GNN results.
To demonstrate the power of the GNN approach, we cut on the datasets, based on the probability assigned by the network of belonging to a given class; only events with a probability greater than an optimised value of belonging to one operator class are used in the χ 2 fit. The correlation of fig. 3 (left) allows us to select a threshold probability to cut on, which has the effect of substantially reducing the SM background and the contamination from the other operator, resulting in a relatively stronger signal effect and thus a tighter constraint on the WC for the operator for which the cut is performed. This is shown in the blue and red contours in fig. 5, where the values of the cuts have been tuned to give maximal performance for each operator respectively, whilst avoiding completely depleting bins in the SM p T (b 1 ) distribution, as to do so would lead to unrealistic bounds on the WCs as statistical control is lost.
Due to this optimisation the bounds on individual coefficients improve, yet the other coefficient is essentially free, with expectedly far worse performance than in the original case with the full dataset. To resolve this and improve the combined bounds, we consider the probability P(BSM), which is simply the sum of the network assigned probabilities of each operator, i.e. P(BSM) = P(O (8)ii33 for the two operator classification considered here. This does indeed result in a combined bound that is superior to the original analysis. An alternative approach to formulating constraints is to directly employ the output of the GNN, i.e. using 2D histograms of the probabilities from the network (see, for example, the individual histograms from each contribution in fig. 4), in place of the p T (b 1 ) distributions of fig. 2. A d-dimensional classification can be converted into a d−1 dimensional probability histogram. This can act as a template for limit setting using the information that has condensed down the phenomenologically available information into the operator classification. Considering again O  Figure 5. WC constraint contours at the 95% C.L. from χ 2 fitting; in black from the data of the baseline selection of section 2 which also passes the network requirements. The left plot shows the contours from cuts on the NN scores at the optimal value of these score cuts, with the analysis performed using p T (b 1 ) distributions. The right plot shows the BSM score cut as in the left plot, along with the contour from the 2D score histogram of fig. 4 (with no score cuts) analysis, as well as an analysis using the 1D BSM score histogram. For details see text. distributions, allowing the information of all three histograms to contribute. The resulting contour from this method is shown on the right plot of fig. 5. This method also improves the bounds on the WCs compared to the original p T (b 1 ) distribution analysis with no cuts on the probabilities required. This approach is feasible when we consider only a small (sub)set of the relevant interactions. Turning to the full d − 1 dimensional histogram very quickly increases the statistical uncertainty. As can be seen from the qualitative similarity of the two approaches, a minimisation of appears to be adequate for multi-dimensional EFT analyses, particularly at luminosities below 3 ab −1 .
It should be noted that the one-dimensional P(BSM) histogram could be used to construct a χ 2 as well, in order to obtain the contours on the C However, the sensitivity is limited compared to the other approaches along certain directions, as shown in fig. 5, due to the loss of information in the projection of the two-dimensional output to a one-dimensional score. We therefore have not explored this approach further.

Fit constraints with GNN selections
Extending the qualitative discussion of the previous section to the thirteen dimensional SMEFT parameter space, we show the Receiver Operator Characteristic (ROC) curves of the full classification in fig. 6. The ROC curves are calculated with the generalised procedure discussed above. Again we see that the network 5 is capable of distinguishing operators adequately. 5 By optimizing the hyperparameters for this scenario we conclude that the architecture used for the two operators case continues to perform particularly well. Deeper networks do not significantly improve the performance and often suffer from longer training times and overtraining.    Table 3. Maximum improvements in 2σ bounds via a cut on the ML score.
Starting from the baseline sensitivity as quoted in table 2 (see also section 2), we first show how contributing operators are impacted by imposing ML score cuts in fig. 6. Sizeable improvements can be obtained when the momentum enhancement is present (e.g. in case of C 33 uG ). Similarly, the graph network performs well in discriminating the non-resonant top decay contributions, e.g. in caseC 33 uW . Improvements ranging between 5% and 60% are achievable in such instances (see table 3), depending on the operators under consideration, however, always at stringent cuts on the ML score to achieve a generic BSM-sensitive selection (before losing statistical control for score cuts approaching unity). Representative operator improvements as a function of the ML score are given in fig. 7. Operators showing a relatively small improvement are already under relatively good control via the inclusive rate and the baseline selection, which establishes good sensitivity to such non-SM interactions. In particular this holds for theC G direction (which can be constrained in more adapted ways by exploiting multi-jet production [78,79]).
Since individual constraints focus on one operator fixing the rest of the WCs to zero, it is common practice to profile over the rest of the WCs by determining their value such that the χ 2 function is minimised. In the scenario where the analysis is particularly sensitive to the presence of any additional operator, a significant decrease in sensitivity will arise. We calculate the improvement in the case of profiled WCs which, as shown in fig. 7, remains similar to the individual WCs case. This is expected as the network selection removes background contributions but keeps new-physics effects. However, we note that the improvement on profiled bounds can be greater than on individual ones as in fig. 7. This occurs when the cut on the EFT score selects a region where the impact on the bounds of a particular operator by the presence of additional ones is reduced, even though the robustness of one class against variations of others is not taken into account in our work.

Summary and Outlook
The absence of direct evidence for new physics beyond the Standard Model at the LHC is as surprising as it is challenging for particle physics. Turning to effective field theory methods with the aim of fingerprinting new physics through the observation of modifications of expected SM correlations in the plethora of LHC data is a well-motivated approach to experimentally challenge, and perhaps, overcome the current status quo. The multitude of ad hoc new physics interactions in the SMEFT approach demands tailored approaches to achieve the most sensitive limit setting. In this sense, limiting analyses to a handful of, albeit motivated, differential distributions is not beneficial for enhancing the sensitivity. Conversely, employing machine learning techniques that fingerprint and exploit correlations in data provides a highly adaptive avenue to enhance the overall sensitivity that can be achieved at the LHC but also other (future) collider experiments.
In this work, we have focused employing on GNNs for EFT limit setting. GNNs are particularly motivated approaches for this purpose as they allow us to directly reflect the graph structure which is imposed by EFT interactions in the classification and eventual limit setting. We base our analysis on the semileptonic tt final states, as this is a motivated phenomenological arena for the presence of new interactions, but also because we face a critically large Wilson coefficient parameter space for multi-label classification. We find that large improvements of the sensitivity become achievable when correlations are not yet fully exploited in the inclusive base selection. This demonstrates that machine learning of multi-labelled collider data provides an excellent avenue towards improving the sensitivity of EFT-related measurements at colliders. We find that this improvement translates from individual to profiled bounds; our results also indicate a strategic approach to improve profiled constraints by tensioning operators against each other, which is not directly accessible by minimising the SM probability, but highlights the relative operator probabilities as another avenue for future investigations. Along these lines, we also note that optimisations of the ML score can be achieved via different weightings of the individual class probabilities. This way more model-specific (i.e. matched) interpretations of EFT constraints can be included to the machine learning stage, which should lead, in principle, to further sensitivity enhancements.
We note that the results of our exploratory study presented here are based on a Monte Carlo analysis; the comparison of actual data with Monte Carlo predictions is affected by a range of theoretical and experimental uncertainties. While our results do not include such uncertainties, in principle it is possible to treat them via Generative Adversarial Neural Networks, e.g. [80,81]. Such an approach would discriminate between the different (labelled) hypotheses when the data is well-described by individual classes or superpositions of classes, effectively removing modelled uncertainty parameters from the classifier score. In general, this will lead to a decreased sensitivity compared to the idealised situation of the proof-of-principle analysis presented in this work. There are examples of such approaches to treat theoretical [48] and experimental [82] uncertainties. We leave modifications of the architecture presented in this paper along these lines for future work.