Searches for new physics with boosted top quarks in the MadAnalysis 5 and Rivet frameworks

High-momentum top quarks are a natural physical system in collider experiments for testing models of new physics, and jet substructure methods are key both to exploiting their largest decay mode and to assuaging resolution difficulties as the boosted system becomes increasingly collimated in the detector. To be used in new-physics interpretation studies, it is crucial that related methods get implemented in analysis frameworks allowing for the reinterpretation of the results of the LHC such as MadAnalysis 5 and Rivet. We describe the implementation of the HEPTopTagger algorithm in these two frameworks, and we exemplify the usage of the resulting functionalities to explore the sensitivity of boosted top reconstruction performance to new physics contributions from the Standard Model Effective Field Theory. The results of this study lead to important conclusions about the implicit assumption of Standard-Model-like top quark decays in associated collider analyses, and for the prospects to constrain the Standard Model Effective Field Theory via kinematic observables built from boosted semi-leptonic tt¯\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t\bar{t}$$\end{document} events selected using HEPTopTagger.


Introduction
Since the resurrection of jet-substructure methods as probes for new particles at the LHC [1,2], boosted topologies in which multiple decay products from heavy intermediate states fall into a single large-radius (large-R) jet have seen wide application in searches for new physics [3][4][5][6][7][8].While not initially considered a E-mail: jack.araz@durham.ac.uk b E-mail: andy.buckley@glasgow.ac.uk c E-mail: fuks@lpthe.jussieu.fr in the early days of the LHC, these jet substructure techniques are now indeed largely used to extend the sensitivity of searches for new physics.This is particularly the case as the currently null results of those searches indicate that any relevant physics beyond the Standard Model (BSM) is most-likely located at a large mass scale, featuring heavy particles whose production and decay would naturally yield highly-boosted lighter Standard Model (SM) objects.
Many collider signatures can benefit from the usage of jet substructure methods, as they can be generally applied to tag many SM and BSM particles when they are produced with a high Lorentz boost.Among these, the top quark is an important target, for two reasons.First, the top quark is the highest-mass fermion in the SM, featuring a Yukawa coupling value close to 1.This makes it a natural candidate to provide an explanation for the hierarchy problem, and to play the role of a mediator that couples to new-physics sectors (e.g. through the Higgs field).Second, boosted methods can provide better background-rejection power than a classic 'resolved' reconstruction of the top-quark kinematics.As a high-mass, colour-charged and non-hadronising particle, the top quark is the most complex SM resonance to reconstruct from fully resolved decay components.This not only requires highly performant b-tagging, but also suffers from either a complicated lepton and missingmomentum reconstruction or the resolution difficulties inherent to reconstructing a fully hadronic bqq final state.
Jet-substructure methods offer a way to bypass many of the difficulties related to the reconstruction and identification of hadronically-decaying top quarks by relying on one single large-radius jet in place of three small-radius ones.In addition, such an option generally exploits the presence of two heavy SM particles' decay hierarchies within the large-R jet (the top quark itself and the W boson originating from its decay), together with information on the internal momentum and angular structure of all jet constituents (with or without b-tagging requirements) to disambiguate boosted top quarks from jets originating from pure QCD background processes.A prominent tool in such studies is the HEPTopTagger method [9,10], which pioneered this approach and has since gone through several rounds of enhancement such as use of variable-radius jet clustering.In the meantime more sophisticated and efficient top tagging methods have been developed.Typical examples are based on a classification of jets making use of the radiation pattern within a jet (also known as shower deconstruction) [11], on advanced machine learning techniques (we refer to Ref. [12] for an overview) relying on observables like the jet transverse momentum and mass, the dispersion of its constituents estimated through N -subjettiness variables [13,14], splitting scales [15], energy correlation functions [16,17], as well as on jet image analysis by means of neutral networks [18][19][20] and image or language recognition techniques [21][22][23][24].More recently, a series of machine-learning methods embedding Lorentz invariance [25,26] have additionally been proposed and explored.The HEPTopTagger method, however, still plays the role of being an important benchmark in the top-tagging landscape, especially in the context of use by the LHC experiments [27,28].On the other hand, the related code has historically been unavailable for use in analysis prototyping and preservation within the two public analysis frameworks MadAnalysis 5 [29][30][31] and Rivet [32,33], that are widely used across the high-energy physics community.The goal of the present work is to fill this gap, and to document, through a few examples, its addition to both frameworks.It also therefore serves as a prototype interface for integration of C++ versions of machine-learning taggers into these public analysis toolkits.
While most applications of boosted top-quark reconstruction have been aimed at direct searches for new physics, the lack of tangible evidence for new highmass resonances urges complementary studies of indirect routes through which BSM physics can manifest.A leading approach in this is that of the Standard Model Effective Field Theory (SMEFT), in which the explicit microscopic physics of a particular BSM model is replaced by an infinite set of higher-dimensional operators involving the SM fields and compatible with the SM symmetries [34][35][36].The SMEFT is then an expansion in an energy scale Λ above which the effective theory breaks down and real new physics resides, so that new fields with masses comparable with Λ must be explicitly added to the model's Lagrangian.Details about the UV theory are encoded in the Wilson coefficients multiplying each operator, and the relevance of a specific new interaction is dictated (a priori) by the dimension of the corresponding operator (that is thus suppressed by some power of the effective scale).At dimension six, 84 (3045) parameters encode the leading BSM effects, assuming a flavour-blind (flavour-general) setup [37,38].Constraints are typically made primarily in the model-independent space of the corresponding Wilson coefficients by investigating the possibility of small and often subtle deviations from the SM expectation.Among all operators, about twenty of them impact top physics under the simplifying assumption that new physics couples dominantly to bosons and to the left-handed doublet and right-handed up-type singlet of third generation quarks [39].
Global SMEFT interpretations of measurements at the LHC in the top sector have recently been achieved by several groups [40][41][42][43][44][45].These studies demonstrated in particular that dozens of SMEFT operators could be constrained (and therefore determined) simultaneously, correlating sometimes information originating from different sectors.It is nevertheless well known that signatures of processes involving boosted top quarks could be crucially relevant [46].These indeed involve large momentum transfers, so that they are expected to exhibit the largest sensitivity to new physics effects in the SMEFT, and subsequently show the most sensitivity to BSM phenomena.It is therefore natural to focus on high-momentum collider event categories involving the production of boosted top quarks, and to consider them as a promising avenue to statistically constrain the viable space of Wilson coefficients associated with top quark operators.
In the present study, we make use of the HEP-TopTagger functionalities that we implemented in the Rivet and MadAnalysis 5 frameworks (together the possibility of computing emulated reconstructionlevel observables) to study the sensitivity of the LHC to top-related SMEFT operators, focusing on the production of a pair of boosted top quarks.However, the HEPTopTagger algorithm is designed to exploit as much as possible the kinematics of the SM decay of a boosted top quark.This leads to the open question about how new physics effects arising from the introduction of non-zero top-quark SMEFT operators could modify these kinematics, and hence impact the performance of HEPTopTagger and, by inference, of any similar reconstruction method based on the topology arising from an SM top quark decay.As a straightfor-ward application and keeping this in mind, we highlight important resulting issues for BSM interpretations.
In Section 2, we detail our technical developments in Rivet (Section 2.2) and MadAnalysis 5 (Section 2.3), and briefly explain how to use the codes for physics studies.In Section 3, we exemplify the usage of these developments to estimate the impact of new physics via effective SMEFT operators on HEPTopTagger performance, and how this affects the sensitivity of the present and future runs of the LHC (assuming an integrated luminosity of 300 fb −1 and 3000 fb −1 and varied levels of systematic errors) to these operators.We summarise our work in Section 4.
2 HEPTopTagger implementation in the Rivet and MadAnalysis 5 frameworks

Generalities
In its initial proposal [9], the HEPTopTagger algorithm is a purely deterministic top-tagging method in which boosted top reconstruction is solely achieved from the geometrical structure and properties of the constituents of a fat jet.It first defines a fat jet collection from an event final state by using the Cambridge-Aachen jet algorithm [47][48][49].A procedure is next applied to all jets included in this collection, in order to decide whether they should be top-tagged.
In practice, each reconstructed fat jet is decomposed into several subjets by applying a mass drop criterion [50].More precisely, jet clustering is iteratively undone so that each fat jet is split in two subjets, and the subjet with the smallest invariant mass is kept only if its invariant mass is large enough.Each resulting subjet is further decomposed in the same manner provided that its invariant mass is larger than some threshold.All possible triplets of jets belonging to the subjet collection obtained in this way are then filtered, and the five hardest filtered subjets are selected for boosted top quark reconstruction.These five subjets are reclustered into three subjets, that are thus assumed to originate from a top quark decay.Events are at this stage rejected if they do not include any resulting triplet with an invariant mass that is compatible with the top mass.Top tagging stems from several requirements that are imposed on the invariant masses of the different dijet pairs that could be formed from the three subjects of any boosted top quark candidate, in particular in order to ensure the compatibility with the presence of an intermediate W boson. More-over the transverse momentum of the top candidate is required to be at least 200 GeV.
We refer to the original documentation [9,10] for a more comprehensive and quantitative presentation of the HEPTopTagger algorithm.
The performance of any top-quark tagger can be improved by using an increased set of input variables (as in most multi-variate methods), for which the explicit choices are made through a tuning process relative to a given reference.To this end, the HEPTopTagger method has been updated and now includes a variety of features that enhance the tagging efficiency and reduce the associated mistagging rates: it uses substructure mass-drop conditions [50], jet trimming [51] and pruning [4,52] algorithms, and filtering steps [2], in addition to the core requirement that the large-radius jet demonstrates the three-pronged structure characteristic of a boosted top-quark's hadronic decay.In the current version of the HEPTopTagger package, all these methods are used together in a multi-variate classification [53,54] which maximises the expected tagging performance.
Access to this tool and all its embedded features within public frameworks like MadAnalysis 5 or Rivet is thus crucial for prototyping and reproducing collider-event data analyses, a key activity in collider phenomenology.In the rest of this section, we discuss technical details about the embedding of HEPTopTagger in these two software tools, and describe how they could practically be used.In practice, we rely on the latest public version of HEPTopTagger (i.e. its version 2 available from the webpage https://www.thphys.uni-heidelberg.de/~plehn/index.php?show=heptoptagger).Moreover, we have validated our implementations by confronting the results of a few test calculations obtained by using the two interfaced versions of HEPTopTagger to those returned by HEPTopTagger when used in a standalone mode.

Jet substructure tools in Rivet
The implementation of HEPTopTagger within Rivet has been designed on top of its existing jet-analysis toolkit, using the 'smearing projection' machinery that simulates kinematic and particleidentification misreconstruction through transfer functions, while preserving links between particle-level and reconstruction-level physics objects.When jet substructure methods are involved, dedicated smearing methods are required, as many observables (e.g.Nsubjettinesses) are sensitive to angular correlations between the jet constituents.It is therefore necessary to model the detector's finite angular resolution to get a realistic detector response, including the inefficiencies related to the hadronic calorimeter.This is achieved, as detailed in Ref. [55], through the directional smearing of the pseudo-rapidity η and azimuthal angle ϕ variables defining the direction of every jet constituent.As angular deflections are more significant for constituents with a low transverse momentum p T , this smearing is made p T -dependent with greater angular stability at higher momentum.The specific form used, known to describe jet-substructure effects well on the public data, is angular smearing by a Gaussian with a mean of zero and a standard deviation given by Here α, β and γ are free parameters, set to 0.045, 0.013 and 31.15,respectively, from the fit detailed in Ref. [55].Additionally, energy-resolution smearing was performed, using relative scaling by a Gaussian with mean of 1, and width σ E ∼ 10%.
Our implementation of the HEPTopTagger method in Rivet relies on an object of the HTT class, normally to be declared as a member variable of an analysis (or projection) class,

HTT _tagger ;
The HTT class is defined in the header file "Rivet/Tools /RivetHTT.hh".All available parameters for this wrapper are initialised through an HTT::InputParameters object, that can be used for any further modification relevant for the needs of the user.A simple example is HTT :: InputParameter s parameters ; parameters .mass_drop = 0.8; // mass drop rate parameters .filt_N = 5; // nr of f i l t e r e d subjets _tagger .setParams ( parameters ) ; The list of available parameters can be found in the definition of the C++ structure HTT::InputParameters in the file "Rivet/Tools/RivetHTT.hh".All parameters that are not explicitly initialised by the user keep their default values which have been chosen according to Refs.[9,10].During the execution of a Rivet analysis, a reclustered jet, instantiated as a Jet object, can be processed by the methods of the _tagger object, for example in _tagger .calc ( fatjets [0]) ; Here, fatjets[0] refers to the leading (i.e.highest-p T ) jet included in a vector of clustered jets called fatjets (the object fatjets being thus of type Jets).The computation yields the creation of various accessors that returns a variety of information into native Rivet objects.This list of accessors is shown in Table 1.
For a practical example, we refer to the illustrative analysis that can be found in Rivet's "analyses/examples/ EXAMPLE_HTT.cc"file.

Jet substructure tools in MadAnalysis 5
Since 2021, MadAnalysis 5 and its SFS framework for fast simulation of detector effects [56] have been equipped with jet substructure tools and methods. 1 In particular, the smearing functionality implemented in the SFS framework allows for modifications of the properties of the jets' constituents, so that the SFS is suitable for the embedding of HEPTopTagger in a way similar to what was achieved for Rivet in Section 2.2.As the substructure branch is so far largely undocumented, we take benefit from the present work to provide some details on its functioning and how to make use of the code to embed top tagging in a generic analysis.
When a jet reconstruction algorithm is turned on in MadAnalysis 5, a so-called 'primary' jet collection is built from a hadronised event.This primary jet collection is equivalent to the sole jet collection that used to be built in versions 1.X.Y of the code, which was documented in [31,56].In practice, the code makes use of its interface with FastJet [57], that can be turned on from the MadAnalysis 5 command line interface by typing set main .fastsim .package = fastjet A specific jet algorithm is then activated through the commands set main .fastsim .algorithm = < algorithm > set main .fastsim .< property > = < value > The list of supported algorithms, together with the available properties, is provided in [56].By default, the anti-k T jet algorithm [58] is considered, with a radius parameter R = 0.4 (radius) and a minimum p T value of 5 GeV (ptmin).The primary jet collection is identified by its jet identifier (or JetID), that is fixed to Ma5Jet by default.This identifier can be further modified through the command set main .fastsim .JetID = < new JetID > Additional jet collections can be instantiated through Accessor Functionality where <JetID> refers to the identifier of the collection, <algo> to the associated clustering algorithm, and where any algorithm-specific parameter can be optionally fixed through comma-separated or space-separated equalities (otherwise default values are used).For instance, typing define jet_algorithm CA08 cambridge radius =0.8 \ ptmin =200 defines a jet collection coined CA08, in which jets are reconstructed by means of the Cambridge-Aachen jet algorithm [47][48][49], with a radius parameter set to 0.8 and a minimum p T value of 200 GeV.Parameters can also be altered through specific commands, like for instance in set CA08 .radius = 0.8 set CA08 .ptmin = 200 Once multiple jet collections are defined, constituentbased smearing is always applied to the properties of all final-state hadrons before the different reconstructions are performed.This contrasts with the setup in which a single collection is defined, as here users can decide to smear reconstructed objects instead of their constituents.Reconstruction efficiencies can also be provided from the command line interface (see [56]), but they will only be applied to the primary jet collection.This limiting behaviour can however be bypassed by employing the expert mode of the code, in which users implement their analysis directly in C++ (and are thus free to do whatever they want).We therefore focus only on this expert mode from now on. 2t the level of the C++ code generated by Mad-Analysis 5 (or implemented from scratch by expert users), the primary jet collection can be accessed through the standard accessor event.rec()->jets()(as described in [30,31]), and all jet collections (including the primary one) can be accessed through the accessor event.rec()->jets(<JetID>)(with <JetID> being the identifier referring to the collection).These accessors return a vector of pointers to constant RecJetFormat objects (or RecJet objects for short), the entire vector being also of the shorthand type RecJets.
In the version 2.0.X of MadAnalysis 5, a Substructure namespace has been implemented and includes wrappers to a large set of FastJet and FastJet Contrib functionalities.This substructure module allows for three standard infrared and collinear safe jet-clustering algorithms, that can be initialised as for instance through Substructure :: Cluster cluster ; cluster .Initialize ( Substructure :: antikt , 0.4 , 20. , isExclusive = false ) ; This initialises a Cluster object named cluster in which jet reconstruction relies on the anti-k T algorithm with parameter R = 0.4, and that selects reconstructed jets featuring p T > 20 GeV.In order to make use of the Cambridge-Aachen or the generalised k T [57]   to be set to Substructure::cambridge and Substructure::kt respectively.The next arguments are related to the two options available for the three supported algorithms (namely the radius parameter R and the minimum p T requirement applied on the reconstructed jets), and the last optional argument (isExclusive) indicates whether leptons and photons originating from hadron decays have to be included in their respective collections in addition to be considered as jet constituents (isExclusive = false), or not (isExclusive = true).Next, clustering is executed through the command cluster .Execute ( < event > , < JetID >) ; where <JetID> is the identifier of the jet collection to use to store the output of the clustering, and <event> is an EventFormat object pointing to the whole event.Smearing and reconstruction efficiencies are automatically included, if provided by the user (see Ref. [56]).
Clustered jets can be further manipulated, either one by one or all together.For instance, the first of the following commands defines a new collection FilteredJets as a sub-selection of all reconstructed (primary) jets satisfying p T > 20 GeV and |η| < 2.5.The next two lines are dedicated to the initialisation of a new clustering method (the Cambridge-Aachen algorithm with a radius parameter R = 0.5, that is the sole parameter that can be specified here), with which those jets will be reclustered, RecJets FilteredJets = filter ( event .rec () -> jets () , 20.0 , 2.5) ; Substructure :: Recluster recluster ; recluster .Initialize ( Substructure :: cambridge , 0.5) ; Here, we assume that the primary jets have been clustered through some (unspecified) algorithm.Next, we make use of the Recluster object, a first time on the whole jet collection, and a second time specifically on the leading jet, RecJets Re cl u st er ed J et s = recluster .Execute ( FilteredJets ) ; const RecJet R e c l u s t e r e d L e a d i n g J e t = recluster .
Execute ( FilteredJets [0]) ; As another example, we now discuss jet reconstruction in which the radius parameter R is variable [59]. 3uch a method can be used from the Substructure wrapper as follows, The clustering type must be CALIKE (Cambridge-Aachen), KTLIKE (k T algorithm) or AKTLIKE (anti-k T algorithm), the parameters minR and maxR stand for the minimum and maximum radius values allowed, and the internal clustering strategy to be used by FastJet has to be among Best, N2Tiled, N2Plain, NNH or Native.We refer to Ref. [59] for more information.Reclustering is then proceeded as above, RecJets variableRJets = variableR .Execute ( FilteredJets ) ; RecJets v a r i a b l e R L e a d i n g J e t = variableR .Execute ( FilteredJets [0]) ; In order to enable the usage of HEPTopTagger within MadAnalysis 5, the package must first be downloaded and linked to the code.This is achieved by typing in the MadAnalysis 5 command line interface install HEPTopTagger once FastJet and FastJet Contrib are installed and available (which is achieved by typing in the interpreter the command install fastjet).When implementing an analysis in C++, the execution of HEPTopTagger is controlled from a dedicated structure called Substructure ::HTT::InputParameters.The latter is defined in the file "tools/SampleAnalyzer/Interfaces/HEPTopTagger/HTT.h",together with all associated parameters and methods, and it is documented in the file "tools/SampleAnalyzer/Interfaces /HEPTopTagger/README.md".Taking the example introduced in Section 2.2, a simple example of initialisation would read As for the embedding into Rivet, this method leads to the generation of a variety of accessors that allows for the exploration of the properties of the would be top-jet.Their list is given in Table 2.For more detailed practical examples on the usage of jet substructure techniques and HEPTopTagger within Mad-Analysis 5, we refer to the tutorial available from https://github.com/MadAnalysis/tutorial_osu.

Exploring new physics effects with boosted top quarks in the SMEFT
In this section, we demonstrate the use of HEPTop-Tagger (version 2) within the Rivet and MadAnalysis 5 frameworks, and we study the potential impact of SMEFT operators on boosted top quark decays.The set of relevant operators that we consider is introduced in Section 3.1.In Section 3.2, we focus on the production of a semi-leptonically decaying t t pair to investigate how SMEFT deviations in the properties of boosted top quarks affect the performance of top taggers (through deviations from the taggers' expectations of SM-like top-quark decay properties).Next, we make use of our findings to derive in Section 3.3 the sensitivity of a typical analysis of boosted top-pair production and decay to various SMEFT operators poorly constrained by other means.

Theoretical framework
In the absence of any explicit evidence for new fields and interactions beyond the SM, effective field theo-ries provide a natural path to scrutinising the impact of hypothetical BSM physics at the electroweak scale Λ EW .In this context, the SMEFT paradigm offers a very promising framework allowing for the exploration of heavy new physics.The SMEFT is an effective field theory expansion in an energy scale Λ that is assumed to satisfy Λ Λ EW .The model Lagrangian is defined via a set {O 1 , O 2 , ...} of higher-dimensional (i.e.nonrenormalisable) operators in the SM fields.Assuming that the leading new-physics effects arise at dimension six, this Lagrangian reads where L SM is the SM Lagrangian, and the Wilson coefficients C j encode the BSM details of the theory.Among the 3045 free parameters in this general SMEFT Lagrangian of eq. ( 2) [37,38], only a few are relevant for top-quark physics.
We consider a scenario in which CP is conserved, and we next assume that new physics only couples to the weak doublet of left-handed top and bottom quarks (Q) and the right-handed weak singlets (t and b) of third-generation quarks (as well as to SM bosons).Moreover, bosonic operators leading to flavouruniversal effects are discarded, we approximate the CKM matrix by the identity matrix, and all Yukawa couplings but those of the top and bottom quarks are neglected.In order to further reduce the number of free parameters, we consider a U (2) q × U (2) u × U (2) d flavour symmetry among the quarks of the first and second generations, in agreement with the principle of minimal flavour violation [60][61][62].Differences between the first and second-generation quarks are thus ignored, and we subsequently introduce the generic notation q for a left-handed weak doublet of first-generation or second-generation quark fields, and u and d for the corresponding right-handed weak singlets of up-type and down-type quark fields.
In our analysis, we aim to leverage the detectorsimulation capabilities of the MadAnalysis 5 and Rivet frameworks (including our implementation of HEPTopTagger) to realistically explore the effects of effective operators on the reconstruction performance of boosted top quarks.Among the full set of potentially impactful SMEFT operators [39], only eight of them are not too strongly constrained by other means [40][41][42][43][44][45], so that an investigation of pair-production and decay of boosted top quarks could offer new handles on them.They read, in the notation of Ref. [42], where the matrices T A stand for the generators of SU (3) c in the fundamental representation, and the matrices σ I are the usual Pauli matrices.

Top tagging performance in the presence of non-vanishing SMEFT operators
In order to assess how non-zero values for the Wilson coefficients associated with the SMEFT operators of eq. ( 3) affect top-quark tagging performance, we make use of MadGraph5 aMC@NLO version 3.0.3[63] to generate parton-level events describing top-antitop production and their semi-leptonic decay at the LHC (operating at a centre-of-mass energy of 13 TeV).We rely on leading-order matrix elements convolved with the leading-order set of NNPDF3.0 parton distribution functions [64] provided through the Lhapdf6 library [65].For efficiency reasons, the Monte Carlo event generation was kinematically biased to high scales, and we required that the invariant mass of the produced t t system satisfies m truth t t > 950 GeV.These fixed-order events are matched with parton showering and hadronisation as modelled by Pythia version 8.2 [66].Background events are generated with the same toolchain, but by considering the production of a leptonically-decaying W boson in association with a pair of b jets (and two additional jets), pp → W b b+jets.
Our canonical analysis was implemented in Rivet version 3 [33]. 4It employs FastJet version 3.3.3[57] for event reconstruction, and HEPTopTagger version 2 [10] in its default configuration.We remind that the latter has been tuned on boosted top quarks with properties as expected from their SM production and decay, which may thus not be the best for scenarios in which SMEFT effects change the properties of the produced tops.In our usage of HEPTopTagger, we turn on the 'optimal R' option.This allows the tagging algorithm to determine the minimum choice for the fat jet reconstruction radius to ensure that the reconstructed top jet includes a three-prong structure (as expected from standard top-quark decays).
Our event reconstruction is achieved by first defining a collection of 'small jets' through the clustering of all visible hadron-level final-state objects with a pseudo-rapidity |η| < 4.5, muons excepted.We use the anti-k T jet algorithm [58] with radius parameter R = 0.4, and then impose a minimum transverse-momentum requirement of p T > 30 GeV on the reconstructed small jets.Next, we define a collection of 'fat jets' from the same hadron-level objects.This collection is constructed by using the Cambridge-Aachen algorithm [47][48][49] with a radius parameter R = 1.5.We impose a minimum transverse momentum requirement of p T > 200 GeV on the reconstructed fat jets.
Lepton candidates (i.e.electrons and muons) are required to satisfy basic momentum and pseudo-rapidity criteria, p T > 10 GeV and |η| < 2.5.At this stage, ∆R-based isolation is enforced in order to remove the overlap between the lepton collection and the two jet collections.We remove from the small-jet collection any small jet j lying in the vicinity of a lepton by an angular distance ∆R( , j) < 0.1, and we then discard any lepton lying at a distance ∆R( , j) < 0.4 of any of the remaining small jets.Moreover, we define b jets as small jets with p T > 30 GeV and with a ghost-associated bhadron with p T > 5 GeV [67,68].
After reconstruction, we select events whose topology is compatible with that expected from the production of a pair of boosted top quarks that decays semileptonically.We require that each selected event features one lepton with at least 50 GeV of transverse momentum, a minimum missing transverse energy / E T > 30 GeV, as well as at least two small b jets and two small light jets.Next, we reconstruct the leptonicallydecaying W boson that we consider on-shell.This assumption implies that the invariant mass of the system comprising the lepton and the missing momentum is equal to the mass m W of the W boson, which allows us to determine the longitudinal component / p z of the missing momentum, In the above expression, p = (p ,x , p ,y , p ,z ) denotes the three-momentum of the lepton, / p = ( / p x , / p y , / p z ) is the missing three-momentum, and E stands for the energy of the lepton.From the solution to eq. ( 4), we can define the four-momentum of the leptonically-decaying W boson W rec L .In the case where this equation has two solutions, we arbitrarily choose the smallest value for / p z .Moreover, when it has no solution, we set the associated discriminant to 0 and use the resulting solution.
In order to reconstruct the leptonically-decaying top quark, we match this reconstructed W boson with one of the b jets by minimising the difference between the top mass m t and the invariant mass m[W rec L ⊕ b] of the system constituted of the reconstructed W boson W rec L and the b jet.This is achieved through a ∆χ 2 minimisation, with a mass-resolution parameter σ = 40 GeV.The bjet matched in this leptonic-top reconstruction is denoted by b L in the following text.
Figure 1 illustrates the features of the reconstruction of the leptonic branch of the process.It shows the distribution in the invariant mass m(W rec L ) of the reconstructed W boson (upper panel) and that in the invariant mass m(t rec L ) of the reconstructed top quark (lower panel).Predictions are displayed both for the t t signal (red) and the associated background (blue).These results demonstrate that most signal events exhibit an on-shell leptonically-decaying W -boson and an on-shell associated top quark.However, the tails of the distributions extend quite significantly away from the peak values for the two spectra.This originates from the inefficiencies inherent to the kinematic fit performed in eq. ( 4), which could lead to zero, one, or two solutions for / p z .Consequently, the reconstructed mass of the W rec L boson (upper panel of Figure 1) exhibits a plateau at values lower than the true W mass.This impacted our choice for the numerical value of the resolution parameter used in the χ 2 fit of eq. ( 5), which then leads to a quite broad peak around the true top mass for the distribution in the reconstructed top mass (lower panel of Figure 1).
In the next step of our analysis, we study to which extent a hadronically-decaying top quark can be reconstructed from the event's final state.We start from the fat-jet collection and discard any fat jet J that lies at angular distance ∆R(J, t rec L ) ≤ 1.5 of the reconstructed leptonically-decaying top quark t rec L .Next, we discard all fat jets found near the b L jet, i.e. lying within a angular distance ∆R(J, b L ) ≤ 1.5.Finally, we reject events that do not comprise at least one fat jet that includes a (small) b-jet.This condition is implemented by requiring that there is a fat jet J such that a b-jet different from the b L jet lies at a distance ∆R(J, b) < 1.5 from it.We then test whether the hardest of the remaining fat jet is top-tagged by HEPTopTagger.
We now introduce a few useful quantities in order to assess the performance of HEPTopTagger.First, we classify a truth-level top quark as "on-shell" when its invariant mass is in the range [m t −15 GeV, m t +15 GeV], and define the quantity T t t as the number of t t events featuring two such on-shell top quarks.Next, we denote by C t H the number of events for which the reconstructed hadronic top quark lies within an angular distance ∆R < 1.2 from the corresponding truth- level object when the latter is on-shell. 5Similarly, C t rec L stands for the number of events for which the reconstructed leptonically-decaying top quark lies at a distance ∆R < 1.2 of its truth-level counterpart when it is on-shell.The quantities C t H and C t rec L hence refer to the number of events for which the reconstructed top quarks are matched with the corresponding truth-level objects so that reconstruction is deemed correct.
With the first set of three coloured columns displayed on the left of Figure 2, we show the resulting reconstruction efficiency defined as the ratio of the number of events featuring correctly reconstructed hadronic and leptonic top quarks to the number of events including two truth-level on-shell top quarks, i.e. the selfexplanatory quantity This efficiency is given when the baseline cuts described above are imposed (red), when an additional selection of m truth t t > 1 TeV is enforced (blue), and finally, when we require m truth t t > 1.5 TeV (green).The error bars represent the related Monte Carlo statistical uncertainty.We observe that about 50% of the SM events with on-shell t t production are correctly reconstructed, this number slightly increasing when we focus more deeply on the boosted regime (i.e. with a larger m truth t t cut).
The efficiency, however, increases once one of the SMEFT operators of eq. ( 3) is turned on, as shown in the rest of Figure 2 (the dashed lines being guidelines for the comparison with the case of the SM).Here, the signal is simulated by implementing the Lagrangian and operators of eqs.( 2) and ( 3) in FeynRules as specified in Refs.[69,70].This is then used to generate a UFO [71] model to be used within MadGraph5 aMC@NLO so that events could be generated through the same toolchain as that described at the beginning of this section.However, whereas we include the interference of dimension-six contributions with SM diagrams, squared SMEFT contributions (thus formally of dimension-eight) are truncated away.The increase in efficiency observed in Figure 2 can be traced back not only to a slight increase in the signal cross section, but also to a change in the event topology enhancing HEPTopTagger's ability to correctly tag the boosted, hadronically-decaying top quark.To prove this statement, we display in Figure 3 the efficiency ε of correctly tagging the leptonic top t rec L regardless of the hadronic branch of the events, As can be seen in this figure, the efficiency ε is almost 100% for all considered scenarios (both in terms of new-physics setup and the parton-level m truth t t cut).This confirms that the global suppression of the efficiency ε shown in Figure 2 (relative to ε ) originates solely from the tagging of the hadronic top quark, and is therefore related to the performance of HEPTopTagger.The latter can thus directly be assessed from the quantity ε, and it is different between SM t t events and those including the interference of top-related SMEFT operators with the SM.  2 but for the efficiency associated with the reconstruction of one leptonic top quark, estimated relatively to the number of events containing two on-shell top quarks.

SM
Our results demonstrate that the performance of HEPTopTagger could be strongly impacted by the physics model that is used as a reference during its tuning.Including effective operators such as those in eq. ( 3) favours the production of a boosted top-antitop pair more than in the SM, as expected from operators sensitive to the event's energy scale.While in this case the presence of operators not included in the HEPTop-Tagger tuning enhances the reconstruction efficiency, this is not generally true, and a tuning based on potential EFT contributions could find different optimal tagging parameters.
Importantly, analyses assuming SM-like HEPTop-Tagger reconstruction efficiencies would underestimate the reconstruction and tagging efficiency for any data t t events involving these operators, and would hence systematically overestimate the magnitude of the corresponding Wilson coefficient.This observation reinforces the importance of using operator-dependent reconstruction efficiencies in SMEFT fits to boosted top-quark data.
The presented efficiencies are, however, normalised to the number of events featuring an on-shell t t pair.The obtained increase in the tagging efficiency ε in the presence of SMEFT operators may, therefore, also be related to a different probability of getting at least one off-shell top quark in the events.This problem is addressed by the Dalitz-plot heat-maps shown in Figure 4, which depict the on-shellness of the produced hadronic top quark.In these figures, we display the correlations between two ratios of invariant masses, m 13 /m 123 and m 23 /m 123 .The three integers 1, 2 and 3 denote the three (p T -ordered) subjets comprised in the hadronically-decaying boosted top quark, so that m 123 stands for the invariant mass of the three-subjet system, m 13 for the invariant mass the system made of the leading and third subjets, and m 23 for that of the system made of the second and third subjets.We present results by restricting the events to those events featuring on-shell top quarks (left column) and for the entire generated samples (right column).Moreover, we explore the difference between the SM (top row), a scenario in which the O 3,1  Qq operator of eq. ( 3) is turned on (middle row), and a scenario in which the O 3,8   Qq operator of eq. ( 3) is turned on (bottom row).
As can be seen, the jet combinatorics are correctly resolved in most events in the case of the SM.The leading jet is most often that originating from the two-body t → W b decay (with the b-tagging information being ignored), and the next two jets are those stemming from the hadronic W -boson decay.The distribution of the m 23 /m 123 ratio is indeed concentrated around m W /m t for the two subfigures of the upper row of Figure 4.The spread around this value is more pronounced when no restriction is enforced on the invariant mass of the top quarks at parton level, as observed from a comparison of the predictions shown in the top-left and top-right figures.This can be easily explained by the inefficiency of HEPTopTagger to correctly tag off-shell top jets, as, by default, the algorithm has been tuned on events featuring on-shell top quarks.This situation changes slightly when EFT operators are enabled (middle and lower rows of Figure 4).First, although the associated amplitude does not feature any intermediate W boson (as the decay of the top quark proceeds via a single four-fermion operator), the interference with the SM diagrams (our predictions being truncated at dimension-six) is sufficient to keep the properties that the leading jet is the bjet, and that the next two jets can be paired to recon-Fig.4 Dalitz plots depicting the invariant mass ratios m 13 /m 123 and m 23 /m 123 where the indices refer to a specific jet among those comprising the reconstructed hadronically-decaying top quark.We show predictions when the on-shellness of the top quark is enforced (left column) and when there is no restriction on the invariant mass of the top quarks at parton level (right column).We consider the case of the SM (top row) and that of scenarios with one SMEFT operator turned on, namely O 3,1 Qq (middle row) and O 3,8  Qq (bottom row).
struct a hadronically-decaying W boson.It is additionally noticeable that the effective operators considered affect the reconstructed top quark so that the latter is naturally more often on-shell (and more boosted due to the energy growth inherent to the effective-theory paradigm).Consequently, we can expect better performance of HEPTopTagger, which confirms what was already found in Figure 2.

Boosted tops as a probe to new physics in the SMEFT
In this section, we explore how the findings of Section 3.2 affect the sensitivity of the LHC to SMEFT effects originating from the operators of eq. ( 3).We begin by providing, in Table 3, the numbers of events surviving each of the selection cuts introduced in the previous section, both for the t t signal and the W b b + jets background.Our results are normalised to an integrated luminosity of 300 fb −1 , and we additionally estimate the efficiencies associated with each cut, which we define as the ratio of the number of events surviving a given cut to the number of events surviving the previous cut.Whereas the last cut on the invariant mass of the reconstructed t t system (i.e. the ninth one in the table, m rec t t > 950 GeV) is not necessary for physicsanalysis purposes, it is required to match the Monte Carlo signal-generation cut implemented in Section 3.2 (to enable a more efficient event-generation process in the boosted regime).
As already noticeable from the results introduced earlier in this manuscript, for instance from the invariant-mass spectra displayed in Figure 1, the events surviving the entire selection are primarily dominated by signal events, which hence have large expected event-counts.This is further reflected in the S/B and S/ √ B ratios provided as significance estimators in the lower rows of Table 3, these two metrics being evaluated in terms of the number of signal events S and background events B passing all the analysis cuts.The background is thus fully under control in our study, so a shape analysis can be implemented to study how kinematic distributions can be best used to constrain the SMEFT-operators' Wilson coefficients.
To do this, we first increase the final selection cut to maximise sensitivity by probing more deeply boosted top-antitop production.In the following, we hence consider either m rec t t > 1 TeV or m rec t t > 1.5 TeV.The sensitivity of the LHC to a given SMEFT operator is derived through the evaluation of a χ 2 test-statistic in an asymptotic scheme that involves deviations of SMEFT predictions relative to the associated SM predictions for a given set of observables.Our analysis explores simultaneously the distributions of the following observables: the invariant mass m rec t t of the di-top system; the transverse momentum p T (j R=1.5 ) of the leading fat-jet; the transverse momenta p T (j R=0.4

2
) and p T (j R=0.4

3
) of the three leading small-R jets; the transverse-momentum spectrum p T (t H ) of the reconstructed hadronic top quark; the transverse-momentum spectrum p T (t rec L ) of the reconstructed leptonic top quark; the rapidity difference ∆y(t rec L , t H ) between the two reconstructed top quarks; and the azimuthal-angle difference ∆ϕ(t rec L , t H ) between the two reconstructed top quarks.
In order to estimate the χ 2 value associated with a specific SMEFT scenario, each of the nine histograms considered was divided into 25 bins (20 and 16 for the ∆y(t rec L , t H ) and ∆ϕ(t rec L , t H ) distributions respectively), and we calculated the quantity in which we sum over all bins and all histograms.The SM predictions are taken as the null hypothesis, N exp i denoting hence the expected number of events in the SM for a given observable and bin i, N obs i standing for the corresponding SMEFT predictions, and ∆ sys N obs i referring to the error on the SMEFT predictions.In other words, we enforce that the pseudo-data corresponding to the SM scenario (i.e. the origin of the Wilson coefficient parameter space) corresponds to the background expectation with suppressed statistical and systematical fluctuations, which consists, therefore, of an Asimov dataset.The above χ 2 test is thus asymptotically equivalent to a profile likelihood ratio ∆χ 2 = χ 2 SMEFT − χ 2 best for a given SMEFT scenario with an implicit best-fit reference model evaluated in the case of the SM (therefore with χ 2 best = 0).Without explicitly performing any profiling, we thus estimate the sensitivity of a profile-likelihood fit by comparing the obtained χ 2 values with that expected from a χ 2 distribution with one degree of freedom.In practice, however, profiled constraints could be slightly weaker due to a less perfect fit of observed data to the background model.
In Table 4, we provide information on the observable found to provide the strongest sensitivity to each SMEFT operator.The results are shown for the two cuts on the invariant mass considered, m rec t t > 1 TeV (upper panel of Mass req.
Table 4 Observable driving the sensitivity of the LHC (at 68% confidence level) to a given SMEFT operator from eq. (3) (first column).We consider both a perfect situation without systematics (∆ sys = 0, second and fourth columns), and one with 10% of systematics (∆ sys = 10%, third and fifth columns).Moreover, we present results for 300 fb −1 and 3 ab −1 , and for an invariant mass cut of m rec t t > 1 TeV (upper panel) and m rec t t > 1.5 TeV (lower panel).
panel of Table 4).Moreover, we consider LHC luminosities of 300 fb −1 and 3000 fb −1 , and two different options for the amount of systematics ∆ sys used in eq. ( 8).We take as a reference the ideal situation in which there are no systematic uncertainties (∆ sys = 0), as well as a more realistic situation in which we set ∆ sys = 10%.In our procedure to extract this information, we define the sensitivity on the basis of a 68% confidence level.When we consider a moderate definition of the boosted regime with m rec t t > 1 TeV, the sensitivity is always driven by the distribution in the transverse momentum of either leptonicallydecaying top quark (p T (t rec L )) or of the lepton originating from the decay of this top quark (p T ( 1 )).The information brought by the hadronic branch of the event is found to be sub-leading for all SMEFT operators and systematic-uncertainty assumptions.However, the situation changes when the boosted regime is probed more deeply through the tighter cut m rec t t > 1.5 TeV.Here, both top quarks are reconstructed and tagged more accurately (in particular through the better performance of HEPTopTagger in a SMEFT scenario, see Section 3.2).This leads to an increased discovery potential through use of a larger set of contributing observables.This statement is illustrated in the lower panel of the table, which displays a greater variability in the leading observable driving the sensitivity of the LHC to a given SMEFT operator, with the O 1,8 Qq , O 3,8 Qq , O 3,1 Qq , and O 8 Qd operators now most sensitive to either hadronic-top or t t-system observables.
Our final projections of SMEFT Wilson-coefficient expected limits, assuming the SM, are shown in Figure 5.We derive the sensitivity of the LHC to each of the operators considered, making use of the procedure described above.We present bounds on the associated Wilson coefficients, both for an integrated luminosity of 300 fb −1 (blue) and 3000 fb −1 (red), and for the two options explored for the level of systematics, namely ∆ sys = 0 (shaded bars) and 10% (solid bars).In addition, we distinguish the case in which we pre-select at parton-level on-shell t t events (left subfigures) and that in which we analyse the full event sample generated (right subfigures).As for the previous discussion, we first implement a relatively inclusive requirement of 1 TeV on the invariant mass of the reconstructed t t system (upper row) and as well as a more stringent m rec t t > 1.5 TeV cut (bottom row).We find limits on |C/Λ| that lie in the 0.1-1 TeV −1 range.This means that for Wilson coefficients satisfying C ∼ 1, effective scales in the 1-5 TeV range can be probed.Conversely, for TeV-scale new physics, couplings of O(0.1) can be reached.The bounds are found to be mildly more constraining with the increase in luminosity as well as with a harder cut on m rec t t , as expected, and the impact of off-shell top-antitop production is additionally found to be sub-leading.Such a sensitivity is of comparable size with that estimated on the basis of global fits (see e.g.predictions from Ref. [42]), which demonstrates the potential of including dedicated analyses of boosted top-quark pair production and decay in SMEFT global fits.Global fits of LHC Run 2 data indeed indicate that |C/Λ| has to be smaller than about 0.1-1 TeV −1 too.Our results should however additionally be compared with individual limits extracted from fits of a large set of observables when one SMEFT operator is considered at a time (for a fairer comparison).Such fits lead to bounds on |C/Λ| of O(0.1)TeV −1 [44], which are thus comparable with the findings of Figure 5. Whereas exploiting boosted top quark production is already known to have a strong constraining power on individual operators (for instance in the context of top dipole moments, where it has been shown to significantly improve the bounds by a factor of a few [72]), a detailed quantitative analysis of its impact lies beyond the scope of this paper.Here, we have only investigated how using a specific boostedtop quark channel could lead to a better assessment of the sensitivity of the LHC to top-quark-related SMEFT operators, thanks to a joint usage of a variety of potentially relevant observables and improved top-tagging capabilities in the SMEFT.

Conclusion and outlook
Jet substructure methods are known to be among the key players in the search for new phenomena beyond the Standard Model of particle physics.Among these, a set of dedicated techniques are related to the identification of jets originating from the hadronic decay of a boosted top quark.In this paper, we have reported the development of an interface between the HEPTop-Tagger package and two software tools widely used in the high-energy physics community, namely the Mad-Analysis 5 and Rivet frameworks.Thanks to this development, the many users of these platforms now have the possibility to exploit boosted hadronically-decaying top quarks and their properties in analyses of highenergy physics events for the Large Hadron Collider and beyond.
We have briefly described these two implementations and how to use them.Our developments equip the Rivet toolkit from version 3.1.7,which is available from HepForge (see https://rivet.hepforge.org/),as well as the MadAnalysis 5 framework from version 2.0.4,3).We present predictions for 300 fb −1 (blue) and 3000 fb −1 (red), ∆ sys = 0 (shaded bars) and 10% (solid bars), and we distinguish an analysis of the full t t event sample generated (right column) and after enforcing on-shell top-antitop production (left column).Two analysis cuts on the invariant mass of the reconstructed top pair are imposed, m rec available from GitHub (see https://github.com/MadAnalysis/madanalysis5/releases).
Moreover, detailed tutorials exploiting all the possibilities can be found in the "analyses/examples/EXAMPLE_HTT.cc" analysis file shipped with Rivet, as well as in the MadAnalysis 5 tutorial available from https://github.com/MadAnalysis/tutorial_osu.
To illustrate the power of these developments, we have considered the SMEFT framework in which new physics manifests through non-renormalisable operators in the Standard Model fields.We have focused on eight dimension-six, four-fermion operators relevant to the top-quark sector, chosen as they are not stringently constrained by current SMEFT global fits.The analysis of the production of pairs of boosted top quarks could therefore provide new handles on associated heavy BSM physics.We have explored this option by first investigating the performance of the HEPTopTagger algorithm in the presence of non-vanishing SMEFT operators.Whereas the algorithm is tuned on SM top-pair production and decay, we have observed that its performance improves further in the presence of the considered additional SMEFT operators in the model's Lagrangian.The energy dependence of the SMEFT operators considered indeed favours the production of very energetic boosted top quarks, with properties enhancing their tagging possibility by the HEPTopTagger method.This observation highlights the importance of considering new-physics effects upon reconstruction performance when attempting SMEFT parameter fits.
Secondly, we have investigated differential observables in boosted top-antitop production following HEPTopTagger tagging, to study how deviations from the Standard Model can best be used to isolate SMEFT effects emerging from the new operators.We have shown that a simple analysis based on HEPTopTagger could lead to bounds comparable with those stemming from other means to constrain SMEFT operators.We hope that this demonstrates the potential of the developments presented in this work and that they will serve the community well in the future.

Fig. 1
Fig.1Invariant mass spectra relevant to the reconstruction of the leptonically decaying top quark.We display the invariant mass m(W rec L ) of the reconstructed W boson (upper panel), as well as that (m(t rec L )) of the reconstructed top quark (lower panel).Predictions are shown for both the t t signal (red) and the associated background (blue).

Fig. 5
Fig.5Sensitivity of the LHC to the various SMEFT operators of eq.(3).We present predictions for 300 fb −1 (blue) and 3000 fb −1 (red), ∆ sys = 0 (shaded bars) and 10% (solid bars), and we distinguish an analysis of the full t t event sample generated (right column) and after enforcing on-shell top-antitop production (left column).Two analysis cuts on the invariant mass of the reconstructed top pair are imposed, m rec

t t > 1
TeV (upper panel) and m rec t t > 1.5 TeV (lower panel).

Table 1
PseudoJet object) difference between the reconstructed top mass and the true top mass |mrec − mt| Boolean indicating if the top jet has a mass compatible with the top mass, satisfies two-dimensional mass plane requirements, and has a p T above some threshold Accessors equipping the HEPTopTagger wrapper implemented in Rivet.
const Jet topJet() top-quark candidate (returned as a Jet object) const Jet bJet() b-jet candidate (returned as a Jet object) const Jet wJet() combined subjets compatible with a W -boson candidate (returned as a Jet object) const Jet w1Jet() leading subjet constituting a W -boson candidate (returned as a Jet object) const Jet w2Jet() sub-leading subjet constituting a W -boson candidate (returned as a Jet object) const PseudoJet topJet() top-quark candidate (returned as a PseudoJet object) const PseudoJet bJet() b-jet candidate (returned as a PseudoJet object) const PseudoJet wJet() combined subjets compatible with a W -boson candidate (returned as a PseudoJet object) const PseudoJet w1Jet() leading subjet constituting a W -boson candidate (returned as a PseudoJet object) const PseudoJet w2Jet() sub-leading subjet constituting a W -boson candidate (returned as a bool passedMassCutTop() Boolean indicating if the top jet has a mass compatible with the top mass bool passedMassCut2D() Boolean indicating if the top jet satisfies two-dimensional mass plane requirements algorithm, the first argument of the Initialize method needs RecJet object) Boolean indicating if the top jet has a mass compatible with the top mass, satisfies two-dimensional mass plane requirements, and has a p T above some threshold Boolean indicating if the top jet has a mass compatible with the top mass b-jet candidate (returned as a
Efficiency associated with the reconstruction of one leptonic and one hadronic top quark, estimated relatively to the number of events containing two on-shell top quarks.Results are shown after the analysis baseline cuts (red), an additional TeV cut (green).We consider the case of the SM (first column), as well as when eight different SMEFT operators are turned on (next columns).
> 1 TeV cut (blue), and an extra m truth t t > 1.5

Table 4 )
and m rec t t > 1.5 TeV (lower

Table 3
Number of t t and W b b+jets SM events surviving each step of our analysis, presented together with their respective selection efficiency ε.The results are normalised to an integrated luminosity of 300 fb −1 .In the last row of the table, we provide two alternative means to assess the analysis significance, namely the S/B and S/ √ B ratios where S and B are the number of t t and background events passing all cuts.