Optimisation of large-radius jet reconstruction for the ATLAS detector in 13 TeV proton–proton collisions

Jet substructure has provided new opportunities for searches and measurements at the LHC, and has seen continuous development since the optimization of the large-radius jet deﬁnition used by ATLAS was performed during Run 1. A range of new inputs to jet reconstruction, pile-up mitigation techniques and jet grooming algorithms motivate an optimisation of large-radius jet reconstruction for ATLAS. Inthispaper,thisoptimisationprocedureispresented,andthe performance of a wide range of large-radius jet deﬁnitions is compared. The relative performance of these jet deﬁnitions is assessed using metrics such as their pileup stability, ability to identify hadronically decaying W bosons and top quarks with large transverse momenta. A new type of jet input object, called a ‘uniﬁed ﬂow object’ is introduced which combines calorimeter- and inner-detector-based sig-nals in order to achieve optimal performance across a wide kinematic range. Large-radius jet deﬁnitions are identiﬁed which signiﬁcantly improve on the current ATLAS baseline deﬁnition, and their modelling is studied using pp collisions recorded by the ATLAS detector at √ s = 13 TeV during 2017.


Introduction
High-energy particle collisions such as those produced in the Large Hadron Collider (LHC) at CERN can result in the production of massive particles (e.g. W /Z /H bosons and top quarks) with large Lorentz boosts. When such particles decay, their decay products become collimated, or 'boosted', in the direction of the progenitor particle. For massive particles that are sufficiently boosted, it is advantageous to reconstruct their hadronic decay products as a single large-radius (large-R) jet. Such large-R jets capture a characteristic, multi-pronged jet substructure from the two-body or three-body decays of hadronically decaying W , Z and H bosons and top quarks, which is distinct from the radiation pattern of a light-quark-or gluon-initiated jet. The substructure of boosted particle decays [1,2] allows powerful new approaches to be utilised in searches for physics beyond the Standard Model (BSM) [3-12] at high energy scales, and has enabled novel measurements of Standard Model processes [13][14][15][16][17][18][19][20][21][22][23][24].
The reconstruction of boosted hadronic systems is complicated by the presence of soft radiation from several sources, which degrades performance when reconstructing jet substructure observables. In particular, soft radiation from the underlying event and uncorrelated radiation from additional pp interactions concurrent with the hard-scattering event of interest (pile-up interactions) can degrade the jet mass resolution and other jet substructure quantities, which are critical to boosted object identification. These effects are amplified by the use of a large radius for jet reconstruction [25][26][27][28], which incorporates more uncorrelated energy. During Run 1, the average number of pile-up interactions per LHC bunch crossing was roughly 20. This number increased to ∼ 34 in the Run 2 dataset, although some events during this period were recorded with up to 70 pile-up interactions. The average number of pile-up collisions is expected to increase further during Run 3 and will reach ∼ 200 pile-up interactions during high-luminosity LHC operations [29]. As experimental conditions become more challenging, the choices made when reconstructing large-R jets will need to evolve to maintain optimal performance.
There is no single way to reconstruct a jet, and several choices must be made at the level of a physics analysis to define the jets which will be used. Jets at the LHC are typically reconstructed from some set of input objects ('jet inputs', or simply 'inputs' throughout) using a sequential recombination algorithm with a user-specified radius parameter (R). Once a jet input type is chosen, it may be preprocessed before jet reconstruction, for example, to mitigate the effects of pile-up. After jet reconstruction, a grooming algorithm may be applied to the jets which preferentially removes soft and/or wide-angled radiation from the reconstructed jet, to further suppress contributions from pile-up and the underlying event and to enhance the resolution of the jet mass and other substructure observables.
Large-R jets are typically reconstructed by ATLAS using the anti-k t algorithm [30] and a radius parameter R = 1.0. The choice of recombination scheme and radius parameter has been studied previously [31], and is not revisited in these studies. ATLAS large-R jet reconstruction has so-far been based on topological cluster inputs reconstructed only using calorimeter-based energy measurements. These clusters provide excellent energy resolution, but do not accurately rep-resent the positions of individual particles within jets with large transverse momentum ( p T ), particularly in areas where the energy density is large or the calorimeter granularity is coarse. This can result in degraded performance when the resolution of individual particles becomes relevant, for instance, when reconstructing the mass of showers which are so collimated that they are not spatially resolved by the ATLAS calorimeter's granularity. In order to better reconstruct the angular distributions of charged particles within jets, several particle-flow (PFlow) algorithms which were developed and commissioned by ATLAS during Run 2 are considered. These include a PFlow implementation designed to improve R = 0.4 jet performance at low p T [32], and a variant designed to reconstruct jet substructure at the highest transverse momenta, called Track-CaloClusters (TCCs) [7,33]. In this work, a union of PFlow and TCCs called 'Unified Flow Objects' (UFOs) is established to provide optimal performance across a wider kinematic range than is possible with either particle-flow objects (PFOs) or TCCs alone, which are each found to perform well in distinct kinematic regions. Jet inputs may also be preprocessed using one or several of the many input-object-level pile-up mitigation techniques which have been developed, such as constituent subtraction [34,35], Voronoi subtraction [36], SoftKiller [37], and pile-up per particle identification (PUPPI) [38]. Various input types and pile-up mitigation algorithms can be combined to create pileup-robust inputs to jet reconstruction, adding additional complexity to the search for optimal performance.
Grooming algorithms are another tool which may be used to remove undesirable radiation from jets after they have been reconstructed. The performance of several grooming algorithms was studied by ATLAS in detail using Run 1 data [39] and during preparations for Run 2 [40], including the jet trimming [41], pruning [42], and mass drop filtering [43] algorithms. Based on these studies, large-R jets groomed with the trimming algorithm using parameter choices of R sub = 0.2 and f cut = 5% were found to be optimal for ATLAS with Run 2 conditions. Since the completion of these studies, several additional jet grooming algorithms have been proposed, including the modified mass drop (mMDT) [ The development of new input objects, pile-up mitigation techniques and jet grooming algorithms by the experimental and phenomenological communities motivates a thorough reoptimisation of the large-R jet definition used by ATLAS. In this paper, the jet tagging and substructure performance of 171 distinct combinations of the different jet inputs, pile-up mitigation techniques and grooming algorithms is evaluated using Run 2 conditions. The performance of different jet definitions is compared in the context of several metrics, which quantify their tagging performance, their pile-up stability, and the sensitivity of their mass response to different jet substructure topologies. The performance in data is also studied to ensure the validity of the conclusions from the Monte Carlo studies.
The remaining sections of this document are structured as follows. The ATLAS detector is described in Sect. 2, along with aspects of the 2017 pp dataset and details of the simulated events used to perform these studies. An overview of the jet reconstruction techniques surveyed by these studies is provided in Sect. 3. Several metrics are used to determine the optimal jet definition, as well as to understand the behaviour of individual algorithms. Due to the large number of possible large-R jet definitions, a two-stage optimisation is performed to determine which of these exhibit the best performance. In the first stage, presented in Sect. 4, the metrics which will be used to evaluate the relative performance of all jet definitions are established by studying the performance of a limited set of jet definitions. The observations made from these comparisons motivate a union of the existing particle-flow and TCC input objects; this new input object type is presented in Sect. 5. The results of the complete survey of jet definitions are presented in Sect. 6. UFO-based definitions which perform consistently well are selected for further study. This smaller list of jet definitions, each of which improves on the current ATLAS baseline large-R jet definition, is calibrated using simulated events, and a more detailed comparison of their performance in terms of their tagging performance and jet p T and mass resolutions as well as their performance in data is made in Sect. 7. In an appendix, more details of the interaction between pile-up interactions and topological cluster formation are provided.

The ATLAS detector, data and simulated events
The ATLAS detector [47][48][49] consists of three principal subsystems. 1 The inner detector (ID) provides tracking of charged particles within |η| < 2.5 using silicon pixel and microstrip detectors, as well as a transition radiation tracker which provides a large number of hits in the ID's outermost layers in addition to particle identification information. This subsystem is immersed in an axial magnetic field generated by a 2 T solenoid. A sampling calorimeter surrounds the ID and barrel solenoid, providing energy measurements of electromagnetically and hadronically interacting particles within |η| < 4.9, and is followed by a muon spectrometer. 1 ATLAS uses a right-handed coordinate system with its origin at the nominal interaction point (IP) in the centre of the detector and the z-axis along the beam pipe. The x-axis points from the IP to the centre of the LHC ring, and the y-axis points upward. Cylindrical coordinates (r, φ) are used in the transverse plane, φ being the azimuthal angle around the z-axis. The pseudorapidity is defined in terms of the polar angle θ as η = − ln tan(θ/2).
The electromagnetic showers of electrons and photons are measured with a high-granularity liquid argon (LAr) calorimeter, consisting of a barrel module within |η| < 1.475 and two endcaps from 1.365 < |η| < 3.2. Hadronic showers are measured using a steel/scintilator tile calorimeter within |η| < 1.7 and with a pair of LAr/copper endcaps within 1.5 < |η| < 3.2. In the forward region, a LAr/copper and LAr/tungsten forward calorimeter measures showers of both kinds within 3.2 < |η| < 4.9.
The muon spectrometer is based one barrel and two endcap superconducting toroidal magnets. Precision chambers provide measurements for all muons within |η| < 2.7, and separate trigger chambers allow the online selection of events with muons within |η| < 2. 4.
As writing events to disk at the nominal LHC collision rate of 40 MHz is currently unfeasible, a two-level trigger system is used to select events for analysis. The hardware-based Level-1 trigger accepts events at a rate of ∼100 kHz using a subset of available detector information. The software-based High-Level Trigger then reduces the event rate to ∼1 kHz, which is retained for further analysis.
Studies presented in this paper utilise a dataset of protonproton collisions delivered by the LHC in 2017 with centreof-mass-energy √ s = 13 TeV and collected with the ATLAS detector. Data containing highp T dijet events were selected using a single-jet trigger, and the leading anti-k t R = 1.0 jet is required to have p T above 600 GeV. All data are required to meet standard ATLAS quality criteria [50]; data taken during periods when detector subsystems were not functional, which contain significant contamination from detector noise, or where there were detector read-out problems are discarded. The resulting dataset has an integrated luminosity of 44.3 fb −1 and an associated luminosity uncertainty of 2.4% [51], obtained using the LUCID-2 detector [52] for the primary luminosity measurements.
The simulated event samples used to perform these studies were generated using Pythia 8.186 [53,54] with the NNPDF2.3 LO [55] set of parton distribution functions (PDF), a p T -ordered parton shower, Lund string hadronisation [56,57], and the A14 set of tuned parameters (tune) [58]. These samples provide 'background' jets which originate from high-energy quark and gluon scattering (using a 2 → 2 matrix element), and 'signal' jets originating from highp T W boson and top quark decays across a wide kinematic range. The signal W jets were produced using a BSM spin-1 W → W Z → qqqq model including only hadronic W and Z decays. The signal top quark jets are taken from a BSM Z → tt model, where the top quarks may decay either hadronically or semileptonically. In order to remove dependence on the specific BSM physics models used to generate these jets, the p T spectrum of signal jets is always reweighted to match that of background jets [59]. Straightforward particle-level containment definitions are used to ensure that the signal jets provide samples of two-and three-pronged jet topologies: the decay partons of the W boson or top quark are required to be within R = 0.75 of the particle-level jet axis. Top jets containing leptonic W boson decays are rejected using particle-level information.
All simulated events were passed through the complete ATLAS detector simulation [60] based on Geant4 [61] using the FTFP_BERT_ATL model [60]. The effect of pileup was modelled by overlaying the hard-scatter event with minimum-bias pp collisions generated by Pythia 8.210 with the A3 tune [62] and the NNPDF2.3 LO PDF set. The number of pile-up vertices was reweighted to match the data events, which have an average of 38 simultaneous interactions per bunch crossing in the 2017 dataset. Pile-up events are overlaid such that each subdetector reconstructs the effect of signals from adjacent bunch crossings ('out-of-time' pile-up) as well as those from the same bunch crossing as the hard-scatter event ('in-time' pile-up) [63].

Objects and algorithms
This section provides a brief overview of different jet input object, pile-up mitigation and grooming options. All jets discussed in these studies are reconstructed using the antik t algorithm as implemented in FastJet [64] with radius parameter R = 1.0. All jets used in these results are required to have a minimum p T of 300 GeV, and to be within η < 1.2.
The complete set of jet input object types, pile-up mitigation and grooming algorithms surveyed is summarised in Table 1. In some cases, additional algorithms or settings were studied but were not found to produce results which differed significantly from those presented here. Notes have been made in Sect. 4 when appropriate regarding these omitted jet definitions, and they are indicated in Table 1 by an asterisk (*).

Stable generator-level particles
Particle-level jets, or 'truth jets', are reconstructed in simulated events at generator level. All detector-stable particles from the hard-scattering process with a lifetime τ in the laboratory frame such that cτ > 10 mm are used. Particles that are expected to leave only negligible energy depositions in the calorimeter, i.e. muons and neutrinos, are excluded.
Ungroomed particle-level jets are used as the reference objects for selections throughout these studies in order to ensure that the same set of reconstructed jets are selected for comparison, regardless of the jet input objects used in reconstruction or grooming algorithm applied. In studies of simulated jets, unless otherwise specified, ungroomed particle-level jets are geometrically matched ( R < 0.75) to ungroomed reconstructed jets, and kinematic selections are applied to the ungroomed particle-level jet four-vector.
Particle-level jets are also taken as the reference for simulation-based ATLAS jet calibrations, and for studies of the jet energy and mass resolution. In this circumstance, they are groomed using the same algorithm and parameters as the reconstructed jets to which they are being compared (Sect. 7).

Inner detector tracks
Tracks are reconstructed from charged-particle hits in the inner detector. In order to ensure that only well-reconstructed tracks from the hard scattering are used, track quality criteria are applied. The 'loose' quality working point is used, which places requirements on the number of silicon hits in each subdetector [65]. Tracks are associated to the primary vertex (PV) of the hard interaction by placing a requirement on the track distance of closest approach to the PV along the z axis, |z 0 sin θ | < 2.0 mm. The PV is selected as the vertex with the highest scalar p 2 T sum of tracks associated with it using transverse and longitudinal impact parameter requirements. In addition, tracks are required to have p T > 500 MeV and to be within the tracking volume (|η| < 2.5).

Topological clusters
Jets reconstructed from ATLAS calorimeter information are built from 'topoclusters' [66], which are three-dimensional groupings of topologically connected calorimeter cells. Topoclusters are formed using iterated 'seed' and 'collect' steps based on the absolute value of the signal significance in a cell relative to the expected noise, σ noise , which considers both electronic noise and stochastic noise from pile-up interactions. Cells with signal significance over 4σ noise in an event are allowed to seed topocluster formation, and their neighbouring cells with significance over 2σ noise are subsequently included. This step is repeated until all adjacent cells have a significance below 2σ noise , at which point all neighbouring cells are added to the cluster (0σ noise ). If this process results in a cluster with two or more local energy maxima, a splitting algorithm is used to separate the showers. The energies of the resulting set of clusters are calibrated at the electromagnetic (EM) scale, and all clusters are taken to be massless.
An additional calibration using the local cell weighting (LCW) scheme is applied to form clusters whose energy is calibrated at the correct particle-level scale [66]. This weighting scheme classifies energy depositions as either electromagnetic-or hadronic-like using a variety of cluster moments, and accounts for the non-compensating response of the calorimeter, out-of-cluster energy, and for energy deposited in the dead material within the detector. Table 1 Summary of pile-up mitigation algorithms, jet inputs, and grooming algorithms, the abbreviated names used throughout this work, and the relevant parameters tested for each algorithm. UFOs are intro-duced in Sect. 5. Algorithms marked with an asterisk (*) were studied, but were not found to produce results significantly different from other configurations. Such results are not presented in these studies

Particle-flow objects (PFOs)
Particle-flow (PFlow) reconstruction combines track-and calorimeter-based measurements and results in improved jet energy and mass resolution, and improved pile-up stability relative to jets reconstructed from topoclusters alone [32,67]. Double-counting of contributions from the momentum measurement of charged particles in the inner detector and their energy measurement from the calorimeters is removed using a cell-based energy subtraction.
The PFlow algorithm first attempts to match each selected track to a single topocluster in the calorimeter, using topoclusters calibrated to the EM scale, and tracks selected using the "tight" quality working point [65]. The track momentum and the topocluster position are used to com-pute the expected energy deposition in the calorimeter by the particle that created the track. It is not uncommon for a single particle to deposit energy in multiple topoclusters. For each track/topocluster system, the PFlow algorithm evaluates the probability that the particle's energy was deposited in more than one topocluster, and may include additional topoclusters in the track/topocluster system if they are necessary to reconstruct the full shower energy. The expected energy deposited in the calorimeter by the particle that produced the track is subtracted, cell-by-cell, from the associated topoclusters. If the associated calorimeter energy following this subtraction is consistent with the expected shower fluctuations of a single particle, the remaining calorimeter energy is removed.
Topoclusters which are not matched to any tracks are assumed to contain energy deposited by neutral particles and are left unmodified. In the cores of jets, particles are often produced at higher energies and in dense environments, decreasing the advantages of using inner-detector-based measurements of charged particles. To account for this degradation of inner tracker performance, the shower subtraction is gradually disabled for tracks with momenta below 100 GeV if the energy E clus deposited in the calorimeter in a cone of size R = 0.15 around the extrapolated track trajectory satisfies where E dep is the expected energy deposition from a charged pion. The subtraction is completely disabled for tracks with p T > 100 GeV when this condition is satisfied.
After the PFlow algorithm has run to completion, the collection of particle-flow objects (PFOs) consists of tracks, and both modified and unmodified topoclusters. Charged PFOs which are not matched to the PV are removed in order to reduce the contribution from pile-up; this procedure is referred to as 'Charged Hadron Subtraction' (CHS) [68,69].

Track-CaloClusters (TCCs)
Track-CaloClusters (TCCs) [33] were developed in the context of searches for massive BSM diboson resonances [7]. These constituents combine calorimeter-and inner-detectorbased measurements in a manner which is optimised for jet substructure reconstruction performance in the highestp T jets. Unlike PFlow, which uses the expected energy depositions of single particles to determine the contributions of individual tracks to clusters, the TCCs use the energy information from topoclusters and angular information from tracks.
The TCC algorithm starts by attempting to match each 'loose' track in the event (from both the hard-scatter and pileup vertices) to topoclusters calibrated to the local hadronic scale in the calorimeter. In the case where one track matches one topocluster, the p T of the TCC object is taken from the topocluster, while its η and φ coordinates are taken from the track. In more complex situations where multiple tracks are matched to multiple topoclusters, several TCC objects are created (where the TCC multiplicity is equal to the track multiplicity): each TCC object is given some fraction of the momentum of the topocluster, where that fraction is determined from the ratios of momenta of the matched tracks. TCC angular properties (η, φ) are taken directly from the unmodified inner detector tracks, and their mass is set to zero.
As in PFlow reconstruction, unmatched topoclusters are included in the TCC objects as unmodified neutral objects.

Jet-input-level pile-up mitigation algorithms
Prior to jet reconstruction, the set of input objects may be preprocessed by one or by a combination of several inputlevel pile-up mitigation algorithms. When reconstructing jets from topoclusters, these algorithms are applied to the entire set of inputs. When incorporating tracking information, the PV provides an additional, powerful method to reject charged particles from pile-up interactions. In this case, these addi-tional pile-up mitigation algorithms are applied only to the neutral PFOs or TCCs in an event before jet finding.

Constituent subtraction (CS)
Constituent Subtraction [34] is a per-particle method of performing area subtraction [70] on jet input objects. The catchment area [26] of each input object is defined using ghost association: massless particles called 'ghosts' are overlaid on the event uniformly, with p T satisfying where A g , the area of the ghosts, is set to 0.01 and p g T corresponds to the expected contribution from pile-up radiation in a small η-φ area of 0.1 × 0.1. For each event, the pile-up energy density ρ is estimated as the median of the p T /A distribution of the R = 0.4 k t [71] jets in the event. These jets are reconstructed without a p T requirement, but are required to be within |η| < 2.0. The total p T of all of the ghosts is equal to the expected average pile-up contribution, based on the estimated value of ρ.
After the ghosts have been added, the distance R i,k between each cluster i and ghost k is given by 2 The cluster-ghost pairs are then sorted in order of ascending R i,k , and the algorithm proceeds iteratively through each (i, k) pair, modifying the p T of each cluster and ghost by where R max is a free parameter of the algorithm taken to be 0.25 in this study, based on studies of R = 0.4 jet performance [72]. Any ghosts remaining after the subtraction are eliminated.
In the authors' description of this algorithm, a correction is also applied for the mass of the input object. Since all neutral ATLAS jet inputs are defined to be massless, this correction is unnecessary in the ATLAS implementation.

SoftKiller (SK)
The SoftKiller (SK) [37] algorithm applies a p T cut to input objects. This cut is chosen on an event-by-event basis such that the value of ρ after the selection is approximately zero. To achieve this, the event is divided into an η-φ grid of userspecified length scale, chosen to be = 0.6, based on studies of R = 0.4 jet performance [72]. The p T cut is determined in order to make half of the grid spaces empty after it is applied (input objects are removed from all grid cells, not just the half which are empty following SK).
To account for detector-level effects, where input objects may not consist purely of hard-scatter or pile-up contributions (see appendix), the best performance is achieved by applying some form of area subtraction to input objects before applying SK. In these studies, SK is always applied to inputs after the CS algorithm; this combination is indicated as 'CS + SK'.
An alternative approach to assigning areas to jet input objects is based on Voronoi tesselation [36] and was studied both in isolation and in conjunction with the SoftKiller algorithm. Both variants of this alternative were found to perform similarly to the CS + SK results presented here.

Pile-up per particle identification (PUPPI)
'Pile-up per particle identification', or PUPPI [38], is a pileup-mitigation algorithm which assigns each input object i a likelihood to have originated from a pile-up interaction based on its kinematic properties and proximity to charged hardscatter particles matched to the event's PV. This likelihood is given by where the index j tracks the charged inputs matched to the PV, R 0 is the maximum radial distance at which inputs may be matched to each other, R min is the minimum radial distance of matching, R i j is the angular distance between an input object and a charged hard-scatter particle, and is the Heaviside step function. The value of R min is generally taken to be very small, and is chosen to be 0.001 in these studies. The value of R 0 is chosen to be 0.3. Once α has been calculated for all input objects, then the following quantity is determined: whereᾱ PU is the mean value of α for all charged pile-up input objects in the event, and σ PU is the RMS of that same distribution. The four-momentum of each neutral input i is then weighted by where F χ 2 is the cumulative distribution of the χ 2 distribution, eliminating all neutral inputs i whose calculated value of α i is less thanᾱ PU .
In order to suppress additional noise, a p T cut is applied to the remaining input objects after they have been reweighted. This cut is dependent on the number of reconstructed primary vertices (N PV ), and is determined by where the parameters a and b are user-specified. For these studies, the parameters are chosen to be a = 200 MeV and b = 14 MeV, based on studies of the R = 0.4 PFlow jet energy resolution.
While PUPPI could technically be applied to topoclusters, the principles of the algorithm depend strongly on the matching of neutral input objects to nearby charged particles from the hard-scatter event. It is therefore more effective for particle-flow-type algorithms. Due to the large number of free parameters, and since it has only been optimised for ATLAS PFlow jets with R = 0.4, PUPPI is only applied to PFlow jets.

Trimming
Trimming [41] was designed to remove contamination from soft radiation in the jet by excluding regions of the jet where the energy flow originates mainly from the underlying event, pile-up, or initial-state radiation (ISR), in order to improve the resolution of the jet energy and mass measurements. In Run 1 [31], it was also found to be effective in mitigating the effects of pile-up on large-R jets. To trim a large-R jet, the jet constituents are reclustered into subjets of a user-specified radius R sub using the k t algorithm. Subjets with p T less than some user-specified fraction f cut of the p T of the original ungroomed jet are discarded: their constituents are removed from the final groomed jet.

Pruning
Pruning [42] proposes a modification of the jet clustering sequence, which removes splittings that are assessed as likely to pull in soft radiation from pile-up interactions and the underlying event. This is achieved by determining a 'pruning radius' such that hard prongs fall into separate subjets, while discarding softer radiation outside of these prongs. The constituents of the large-R jet are reclustered using the Cambridge-Aachen (C/A) algorithm [73,74] to form an angle-ordered cluster sequence. At each step of the clustering sequence, the softer subjet is discarded if it is either too soft or wide-angled, enforced by requiring where R 12 , M 12 , and p T,12 are respectively the angular distance, the mass, and the transverse momentum of the subjet pair at a given step in the clustering sequence, and z = min p T,1 , p T,2 / p T,1 + p T,2 . The parameters R cut and z cut are user-defined, and respectively control the amount of wide-angled and soft radiation which is removed by the pruning algorithm.

Soft-drop (SD)
Soft-drop [45] is a technique for removing soft and wideangle radiation from a jet. In this algorithm, the constituents of the large-R jet are reclustered using the C/A algorithm, creating an angle-ordered jet clustering history. Then, the clustering sequence is traversed in reverse (starting from the widest-angled radiation and iterating towards the jet core). At each step in the clustering sequence, the kinematics of the splitting are tested with the condition where the subscripts 1 and 2 respectively denote the harder and softer branches of the splitting, and the parameters z cut and β dictate the amount of soft and wide-angled radiation which is removed. If the splitting fails this condition, the lowerp T branch of the clustering history is removed, and the declustering process is repeated on the higherp T branch.
If the condition is satisfied, the process terminates and the remaining constituents form the groomed jet.
If β = 0, SD suppresses radiation purely based on the p T , while larger values of β allow more soft radiation to remain within the groomed jet when it is sufficiently collinear. SD with β = 0 is equivalent to the modified Mass Drop Tagger (MDT) algorithm [31,75]. SD grooming has an intrinsic quality which is not shared by the trimming or pruning algorithms: certain jet substructure observables are calculable beyond leading-logarithm accuracy following the application of SD [75][76][77][78][79][80][81].

Recursive soft-drop (RSD) and bottom-up soft-drop (BUSD)
The standard soft-drop algorithm aims to find the first hard splitting in the jet clustering history in order to define a groomed jet. In the case of a multi-pronged decay, this treatment may not be sufficient to remove enough soft radiation from the jet, since the SD condition may be satisfied before removing all of this energy. A recursive extension of the SD algorithm ('recursive soft-drop,' or RSD) has been proposed [46], in which the algorithm continues recursively along the harder branch of the C/A clustering sequence until N hard splittings have been found. The case of N =1 is equivalent to the standard SD algorithm, while for larger values of N , a larger fraction of the jet may be traversed by the grooming algorithm. When N = ∞, the entire C/A sequence is traversed by the grooming algorithm regardless of the number of hard splittings found. Bottom-up soft-drop (BUSD) [46] instead incorporates the SD criteria within the jet clustering algorithm, similar to pruning. In these studies, the 'local' version of BUSD is implemented, which is applied after initial jet reconstruction. Using this approach, jets are reconstructed with the anti-k t algorithm, and then reclustered using a modified version of the C/A algorithm, where particles i and j with the smallest distance d i j = R i j /R 0 are combined to create a new pseudojet given by The results of applying local BUSD are expected to be similar to those of RSD with N = ∞, since both algorithms begin with the same set of constituents per jet and groom the entire C/A clustering sequence.
Other configurations for the SD family of algorithms were studied, including β = 2 grooming, but were not found to give results significantly different from those reported in detail.

Performance metrics
In order to survey the relative performance of all considered large-R jet definitions, several metrics must be established which probe relevant aspects of their behaviour in the context of large-R jet reconstruction and calibration by ATLAS. It is not feasible to calibrate each of the definitions studied (even with a simulation-based approach, as in Sect. 7), and so these metrics have been chosen in order to be robust against differences caused by calibration. The metrics selected include the tagging performance of highp T W bosons and top quarks, the stability of the jets in the presence of pile-up interactions, and the degree to which a jet definition's mass scale depends on the signal-or background-like substructure of the jet.
In this section, the behaviour of each metric is illustrated using a reduced list of jet definitions that have been selected to highlight the interplay between different aspects of jet reconstruction. For each metric, jets reconstructed from topological clusters, particle-flow and track-calocluster input objects are compared, with and without pile-up mitigation. Two grooming algorithms are also compared for each jet input: trimming with R sub = 0.2 and f cut = 0.05, and softdrop with β = 1.0 and z cut = 0.1. The trimming algorithm is chosen because it is the current baseline definition used by ATLAS. The soft-drop algorithm is chosen as an alternative which has demonstrated good performance, as is shown in Sect. 6.
Results of the complete survey of all jet definitions summarised in Table 1 are provided in Sect. 6.

Tagging performance
Many analyses using large-R jets rely on a tagger to distinguish between different types of jets, such as distinguishing between the decay of a highp T , hadronically decaying top quark and a jet originating from a high-energy quark or gluon. Such boosted-particle taggers range in complexity from simple mass cuts to complex machine-learning algorithms [82][83][84]. While the complete optimisation of a jet tagger is outside the scope of this work, it is important to compare the tagging performance of different jet definitions in terms of their background rejection (defined as the reciprocal of the background-jet tagging efficiency) at fixed signal-jet tagging efficiency. This may be done using a simple tagger based on the jet mass and a jet substructure (JSS) observable. In order to study the tagging performance for different jet topologies, taggers are created for highp T W bosons and top quarks by combining the jet mass with another jet substructure observable which is sensitive to either two-or three-pronged signal jet topologies.
The jet mass, as defined by where i are the constituents of the jet, is typically one of the most powerful variables that can be used to discriminate between different types of jets.
To tag boosted W decays, which have a two-pronged structure, the D 2 observable [85][86][87] is used with a choice of angular exponent β = 1.0. This observable is a ratio of three-point to two-point energy-energy correlation functions which has been used by ATLAS in W taggers since Run 1 [39,82].
For boosted top quark decays, which have a three-pronged structure, τ 32 with the winner-take-all axis configuration [88,89] is used. This observable is a ratio of two Nsubjettiness variables, which tests the compatibility of a jet's substructure with a particular N -pronged hypothesis. ATLAS has incorporated τ 32 into its top taggers, whether simple or complex, since Run 1 [59,82].
Unlike a mass-only tagger, where more aggressive grooming can improve the jet mass resolution at the cost of grooming away additional information contained within a jet's soft radiation, a mass + JSS tagger relies on such soft radiation to achieve better background rejection. Such taggers are a more realistic approximation to the expected future tagging performance of any given jet definition (which will use more sophisticated techniques), and are amenable to this survey of many jet definitions.
For both the W and top taggers, the tagging algorithm proceeds similarly: first, a fixed signal-efficiency ( sig ) mass window is selected, where the window is defined to be the minimum mass range which contains 68% of the signal mass distribution. This window should select the signal jet mass peak. A one-sided cut is then applied to D 2 or τ 32 , and background rejection (1/ bkg ) is compared at a fixed signal efficiency taken to be sig = 50%. This signal efficiency working point is representative of taggers used by ATLAS in physics analysis, and the results were not found to depend strongly on the working point which was selected. The relative performance of various jet definitions in terms of their background rejection at a fixed signal efficiency point was noted to typically provide a consistent ordering of jet definitions before and after applying a simulation-based calibration, and so this metric was selected instead of possible alternatives such as the Receiver Operating Characteristic (ROC) curve integral.
The background rejection for the boosted W boson tagger is shown as a function of signal tagging efficiency in Fig. 1 for two p T bins: a lowp T bin (300 GeV < p true, ungroomed T < 500 GeV), and a highp T bin (1000 GeV < p true, ungroomed T < 1500 GeV), where kinematic requirements are placed on the p T of the ungroomed particle-level jet which is associated with the detector-level jet under study (Sect. 3.1.1). The lowp T bin represents the regime where the W decay products are boosted just enough to be contained within a single large-R jet, while the highp T bin represents the regime where the decay products are more collimated and may begin to merge. The performance in these two regions is expected to be different due to detector effects and algorithmic differences. Similarly, the background rejection of the top tagger is shown in Fig. 2, except the lower p T bin is chosen to be 500 GeV < p true, ungroomed T < 1000 GeV, since the larger mass of the top quark results in less collimation of its decay products.
Better alternatives to the baseline topocluster jet definition are clearly visible. At low p T , PFlow reconstruction results in the best performance for W boson and top tagging, while TCCs have a lower background rejection than topocluster jets. At high p T , TCCs provide a significantly better background rejection than the other options, although PFlow still provides an improvement over topocluster reconstruction.
The application of CS + SK pile-up mitigation has very little effect for the highp T jets, but for the lowp T W tagger, it significantly improves the background rejection for soft-drop jets, which are more susceptible to pile-up than trimmed jets. This effect is seen for all three jet input types, but it is pronounced for topocluster inputs, which do not use tracking information to remove pile-up. Top tagging performance benefits more from adopting soft-drop grooming than W tagging: background rejection increases when tagging top

Pile-up stability
Two metrics are used to study the pile-up stability of jet definitions in order to determine which definitions are sufficiently insensitive to pile-up. The first quantifies the effect on the jet mass scale by studying how the W boson mass peak position changes as a function of pile-up, and provides a handle with which to assess the impact of pile-up on a jet's hard structure.
The second quantifies the impact on substructure observables by studying the pile-up dependence of W boson tagging efficiency, in order to quantify how pile-up contributions alter the soft radiation patterns within jets.
A related study of the effects of pile-up on topocluster reconstruction is presented in an appendix of this publication, utilising a new technique which propagates particlelevel information about hard-scatter and pile-up energy depositions through the ATLAS reconstruction procedure.

Pile-up stability of the W boson jet mass peak position
Jet substructure observables such as the jet mass are particularly sensitive to pile-up; the contribution of pile-up to the jet mass scales approximately with the jet radius cubed [90]. Figure 3 shows a subset of the trimmed mass distribution of W jets in bins of N PV for various jet input object types, demonstrating that pile-up can visibly alter the average value and width of the jet mass distribution. This effect is quantified using a simple metric. In bins of N PV , the core of the W mass peak is iteratively fit with a Gaussian distribution. The trend of the fitted peak position versus N PV is then fit with a line. The slope of this line is a measure of the sensitivity of the jet mass to PU: a larger magnitude indicates larger pile-up sensitivity. The position of the W jet mass peak was found to be a more resilient metric when studying the performance of uncalibrated jet definitions than other possible choices, such as properties of the jet mass response.
The results of this fitting procedure are provided in Fig. 4 for the reduced set of jet definitions. The application of CS + SK pile-up mitigation is shown to stabilise trends in topocluster and PFlow jets, even for jet grooming algorithms which are most sensitive to the effects of pile-up such as soft-drop with topocluster jets. The fitted value of the W boson mass peak position decreases as a function of N PV for TCCs. This is related to TCC cluster splitting: as the number of pile-up interactions increases, the number of pile-up tracks also increases. Since these tracks are included in the energy-sharing step of the TCC algorithm, topoclusters are divided into more parts, and more energy is removed. Unlike PFlow and topocluster jet reconstruction, the pile-up stabil-ity of TCCs deteriorates after the application of CS + SK. Uncorrected PFlow and TCC jet reconstruction are less sensitive to pile-up than topocluster inputs, since they are able to remove the charged pile-up component via CHS.

Pile-up stability of a simple tagger
The second metric of pile-up stability quantifies the effect of pile-up on the tagging efficiency, which is impacted more by contributions from soft radiation to the tails of jet substructure observables. The D 2 variable is particularly sensitive to soft radiation, and so a W tagger is defined using the jet mass and D 2 (Sect. 4.1). For a sample of events with N PV < 15, a mass cut which results in a 68% signal efficiency is found, and then the D 2 cut that results in an overall signal efficiency of 50% is determined. Then, in bins of N PV , the signal efficiency of applying these cuts is evaluated. These signal efficiencies are plotted as a function of N PV and the trend is fit with a line. The slope of this line is indicative of pile-up sensitivity in the soft jet substructure of the jet definition. These slopes are shown for the reduced set of jet definitions in Fig. 5.
As pile-up levels increase, the signal efficiency of the W tagger tends to decrease, although the opposite behaviour is often observed for TCC jets. Similarly to what was found when studying the W mass peak position metric (Sect. 4.2.1), topocluster inputs are the least stable. After pile-up mitigation, the pile-up stability of all inputs, including TCCs, improves. The trends in stability as a function of grooming algorithm are the same as for the W mass peak position. 4.3 Topological sensitivity ATLAS calibrates large-R jets using a procedure which involves simulation-based and in situ methods [91]. For the simulation-based calibration, the average jet energy and mass scale in reconstructed jets are calibrated to the average scale of jets at particle level, using a sample of jets originating from light quarks and gluons (Sect. 7.1). These light-quarkand gluon-derived calibrations are also currently applied to all jets, including to signal jets (e.g. W /Z /H /t jets). Dependence of the jet energy and mass scale on the progenitor of the jet is undesirable: if the jet mass scale for signal and background jets with similar kinematics is different, then the signal jets will receive an incorrect calibration factor.
In order to examine the topology dependence of the jet mass scale for different jet definitions, the ratio of the mean value of the uncalibrated jet mass response, R m = m reco /m true , for signal W jets to that of background jets is constructed within a bin of large-R jet p T , η and mass. Deviations from unity will result in non-closure in the mass response for signal jets following calibration (Sect. 7.1). This effect is relevant at low p T , where W jets may be contained f cut = 0.05), with unmodified jet input objects. Jet p T and η cuts before tagging are made using the ungroomed particle-level large-R jet matched to each of the groomed reconstructed large-R jets within an R = 1.0 jet, but top quarks are not; therefore, only W jets and background jets are considered in this context. The baseline topocluster-based trimmed large-R jet definition used by the ATLAS experiment exhibits a difference for signal jets of 4% by this metric; therefore, deviations from unity of 4% or less have not been found to be problematic at later stages of the calibration workflow [91], given the current level of calibration precision. Figure 6 shows the jet mass response for signal and background jets built from topological clusters and groomed with either the trimming or soft-drop grooming algorithms. The lowp T bin, where this topological effect is most pronounced, is shown. A larger sensitivity to the signal-or backgroundlike nature of the jet is observed for soft-drop grooming, which retains more soft radiation. The application of pile-up mitigation can exacerbate topological differences in the jet mass scale by altering the distribution of soft jet constituents differently depending on the jet's signal-or background-like topology.

Unified flow objects (UFOs)
After observing the behaviour of the jet input objects currently used by ATLAS in physics analyses (topoclusters, PFOs and TCCs), it is clear even from the reduced set of jet definitions (Sect. 4) that no single jet definition is optimal according to all metrics. While TCCs significantly improve tagging performance at high p T , their performance is typically worse than the baseline topocluster-based trimmed jet definition at low p T , and they are more sensitive to pileup than other definitions. Jets reconstructed from PFOs can improve on the baseline definition for the entire p T range, but their tagging performance is significantly worse than that of TCC jets at high p T when given the same grooming algorithm.
The relative performance of these jet definitions can be understood by reflecting on how different inputs are reconstructed. For lowp T particles, PFOs are designed to improve the correspondence between particles and reconstructed objects. However, as the particle p T increases or the environment close-by to the particle becomes dense, the inner detector's momentum resolution deteriorates, and so the PFlow subtraction algorithm is gradually disabled in order to avoid degradation of the jet energy resolution.
The cluster splitting scheme used for TCCs does not utilise a detailed understanding of the correlation between tracks and clusters, and instead is designed to resolve many (charged) particles without double counting their energy. When splitting low-energy topoclusters, this can result in an incorrect redistribution of the cluster's energy, while for highenergy clusters, the ability to resolve many particles increases the relative tagging performance of TCCs over other definitions. TCCs exhibit pile-up instabilities at low p T , where the mass scale decreases as the number of pile-up interactions increases. This trend is the opposite of what is observed for jets reconstructed from topoclusters and PFOs, and occurs because the TCC algorithm splits clusters into more components when additional tracks from pile-up interactions are present in the reconstruction procedure.
These observations motivate the development of a new jet input object, which combines desirable aspects of PFO and The UFO reconstruction algorithm is illustrated in Fig. 7. The process begins by applying the standard ATLAS PFlow algorithm (Sect. 3.1.4). Charged PFOs which are matched to pile-up vertices are removed. The remaining PFOs are classified into different categories: neutral PFOs, charged PFOs which were used to subtract energy from a topocluster, and charged PFOs for which no subtraction was performed due to their high momentum or being located in a dense environment. Jet-input-level pile-up mitigation algorithms may now be applied to the neutral PFOs if desired. A modified version of the TCC splitting algorithm is then applied to the remaining PFOs: only tracks from the hard-scatter vertex are used as input to the splitting algorithm, in order to avoid pile-up instabilities. Any tracks which have been used for PFlow subtraction are not considered, as they have already been well-matched and their expected contributions have been subtracted from the energy in the calorimeter. The TCC algorithm then proceeds as described in Sect. 3.1.5, using the modified collection of tracks to split neutral and unsubtracted charged PFOs instead of topoclusters. This approach provides the maximum benefit of PFlow subtraction at lower particle p T , and cluster splitting where the benefit is maximal at high particle p T .
The performance of UFOs is illustrated in Figs. 8 and 9 according to the same metrics as for other jet input objects in Sect. 4. The increased tagging performance of UFOs is demonstrated across both the low and high p T ranges in (a) (b) Fig. 6 Distribution of the jet mass response in W jets and q/g jets reconstructed from topoclusters. The mass response is constructed following application of the a trimming (R sub = 0.2, f cut = 0.05) or b soft-drop (β = 1.0, z cut = 0.1) grooming algorithms at both truth and detector level. Jet p T and η selections are made using the ungroomed particle-level large-R jet matched to each of the groomed detector-level large-R jets. The uncertainties from the fits are typically less than 0.005. A particle-level mass-window cut with 68% signal efficiency is applied to both the groomed signal and background jets Fig. 8, where their performance is superior to that of TCC jets at high p T , and becomes similar to that of PFlow jets as p T decreases. UFOs are naturally pile-up-stable due to the inclusion of only charged-particle tracks matched to the primary vertex, similar to the ATLAS PFlow algorithm. Figure 9 demonstrates the additional stability that an input-level pile-up mitigation algorithm such as CS + SK can offer when it is applied to neutral particles (calorimeter deposits), especially at low p T .
The topological dependence of UFOs is not enhanced relative to the other jet definitions previously studied, and options exist with sensitivity equal to or below that of the baseline topocluster-based trimmed definition which improve on other aspects of jet performance.

Performance survey
The metrics described in Sect. 4 are used to study the performance of all jet definitions listed in Table 1, with the addition of UFOs. This provides a more complete understanding of the interplay between the different aspects of jet reconstruction. The results are summarised in Figs. 10, 11, 12, 13 and 14. 6.1 Tagging performance A comparison of the background rejection of the W tagger at the 50% signal tagging efficiency working point is shown in Fig. 10 for two p T bins: a lowp T bin (300 GeV < p true, ungroomed T < 500 GeV), and a highp T bin (1000 GeV < p true, ungroomed T < 1500 GeV). Several trends are apparent from the performance of the taggers. As seen in Sect. 4, for a fixed grooming algorithm, PFO reconstruction improves on topocluster reconstruction for both p T bins, while TCCs improve background rejection even further at high p T . In both cases, UFO reconstruction is able to match or improve on the performance of other jet inputs for both p T bins. In general, pile-up mitigation improves W tagging performance for all input types. The effects of pile-up mitigation are more apparent at low p T , where soft pile-up radiation has a larger impact on the reconstruction of D 2 . At high p T , pile-up mitigation significantly improves the performance of TCC jets. This is related to the greater impact of pile-up mitigation for TCCs on the background mass distribution than the signal distribution, which increases the background rejection.
The tagging performance varies significantly among the different grooming algorithms and parameter choices. For trimming algorithms, smaller values of R sub or larger values of f cut result in reduced tagging performance, regardless of the jet input type. These parameter choices correspond to more aggressive grooming, indicating that some of the softer radiation is important for effectively tagging different types of jets. An analogous observation is made for SD jets, where small values of β, or large values of z cut generally result in degraded tagging performance.
A similar set of results is seen for the top tagger in Fig. 11. In the lowp T bin, PFlow jets typically outperform both topocluster and TCC jets, while TCC jets outperform the other input object types at high p T . Again, UFO jets are able to match or improve the performance compared to the other jet input types in both p T bins. Pile-up mitigation tends to improve results, particularly at low p T , as observed for W taggers, although in a few cases the background rejection deteriorates. The baseline trimming algorithm works well for all input object types, but at low p T , the background rejection may be improved by 50% by instead using a SD algorithm with lighter grooming. The standard SD algorithm with β = 1 and z cut = 0.1 works particularly well, although recursive and bottom-up variants can also provide comparable performance.
In general, the tagging performance of jets constructed out of UFOs matches or exceeds that of jets reconstructed out of any other input type.

Pile-up stability
The slopes of the fitted average W boson jet mass as a function of N PV are shown in Fig. 12 for each of the surveyed jet definitions. The uncertainties in the fitted slope values tend to be negligible compared to the differences between reported values. Among jet input types, PFOs and UFOs are the most pile-up-stable. PFOs, TCCs, and UFOs are all more pile-up-stable than topoclusters, due to the ability to easily remove charged particles from pile-up vertices. As discussed in Sect. 4, the fitted value of the TCC W mass peak position decreases as a function of N PV for most grooming algorithms, although for lighter grooming algorithms which are more affected by pile-up, the slope is sometimes positive. This effect is exacerbated by the use of CS + SK, and for CS + SK TCCs, all of the studied trends are negative.
There are significant differences in the pile-up stability of different jet grooming algorithms. In general, all studied configurations of trimming are stable. For SD, RSD and BUSD, stability depends on the parameter choice. Larger values of β, where more soft and wide-angled radiation is retained, have a larger pile-up dependence. As expected, for the same value of z cut , RSD and BUSD are more stable than the standard SD definition.
For all input types, with the exception of TCCs, jet-inputlevel pile-up mitigation techniques improve the pile-up stability of the jet definitions. Since too much energy is already subtracted for TCCs because of the inclusion of pile-up tracks in their reconstruction, any additional subtraction further degrades performance. For other jet inputs, the use of pile-up mitigation reduces the pile-up sensitivity so that it is better than or equivalent to the pile-up sensitivity from the baseline trimmed topocluster jet definition. This is true even for lightly groomed algorithms (e.g. RSD with z cut = 0.05, β = 1, N = 3), where CS + SK improves stability by a factor of 20. While PUPPI improves the pile-up stability of PFOs, the performance of CS + SK PFOs is better overall, sometimes by more than a factor of two. This improvement is seen for nearly all grooming algorithms. The pile-up stability of UFOs is similar to that of PFOs, which is expected since the modified TCC splitting step does not remove pileup particles.
The change in signal efficiency of the D 2 tagger as a function of N PV is shown in Fig. 13. Uncertainties in the reported values from the fitting procedure tend to be negligible (subpercent level). As pile-up levels increase, the signal efficiency of the W tagger tends to decrease. As observed when studying the W mass peak position metric, topocluster inputs are The trends in stability as a function of grooming algorithm are the same as for the W mass position. While CS + SK is typically still more performant than PUPPI, the degree of improvement is not as large as that observed when studying the pile-up stability of the W jet mass peak-position.

Topological sensitivity
In order to examine the topology dependence of the jet energy and mass scale for different jet definitions, the ratio of the mean value of the uncalibrated jet mass response for W jets to that of background jets is constructed. These values can be significantly different, as seen in Sect. 4. Deviations from unity will result in non-closure in the mass response following calibration. This effect is largest at low p T , where the reconstruction of W jets is relevant. As seen in Fig. 14, the baseline topocluster-based trimmed large-R jet definition used by the ATLAS experiment shows a score of around 4% in this metric, and so small deviations from unity are not problematic.
The topology dependence is increased by the application of jet-input-level pile-up mitigation algorithms. In general, TCCs show the most sensitivity, which can reach 20% after pile-up mitigation algorithms are applied. The topological sensitivity is increased for all inputs after the application of CS + SK, regardless of the grooming algorithm applied. This effect is generally lower for UFOs than for other jet

Comparison of calibrated jet definitions
The tagging performance of a jet definition will have the largest impact on the sensitivity of searches for new physics performed by ATLAS, and so it is the primary metric used to determine which definitions are important for further study.
The pile-up stability and topological sensitivity of the jet mass scale are also important, but since the performance of the baseline topocluster-based trimmed jet definition is still adequate, they are primarily used to distinguish between otherwise similar jet definitions. The primary motivation for choosing UFO-based definitions for further study is their W boson and top quark tagging performance.
Based on their optimal tagging performance over the entire kinematic range of interest, in addition to the increased pileup stability achieved by utilising tracking information in the jet definition, only jets reconstructed from UFOs are considered further. Several grooming algorithms are promising: soft-drop (β = 1.0, z cut = 0.1) jets perform well when tagging highp T top quarks, while the RSD (β = 1.0, z cut = 0.05, N = ∞) and BUSD (β = 1.0, z cut = 0.05) extensions provide further improvements for highp T W bosons. Trimmed UFO jets ( f cut = 0.05, R sub = 0.2) also provide competitive performance in certain regions. These four UFO jet definitions were selected for calibration and further study, as summarised in Table 2 in the category 'studied definitions.' (a) (b) Fig. 10 Background rejection at 50% signal efficiency for a tagger using the jet mass and D 2 for W boson jets at a low p T , and b high p T . Jet p T and η cuts before tagging are made using the ungroomed particle-level large-R jet matched to each of the groomed reconstructed large-R jets. The current baseline topocluster-based trimmed collection is indicated with a 7.1 Simulation-based jet energy and mass scale calibrations A simulation-based calibration is derived using Pythia dijet events for each of the UFO collections which were selected for further study, as well as for additional large-R jet definitions which will permit comparisons of each aspect of the jet definition which is studied. These jet definitions are listed in Table 2. This calibration follows the methodology in Ref. [91], and restores the average reconstructed jet p T and mass scales (JES, JMS) to those of the particle-level references. For each jet definition, a reference set of particle-level jets are reconstructed as described in Sect. 3.1.1, and the same grooming algorithm is applied as that used for the detectorlevel jet definition.
Detector-level jets are matched to particle-level jets using a procedure which minimises the distance R = ( φ) 2 + ( η) 2 . The p T and mass responses are defined respectively as R p T = p reco T / p true T and R m = m reco /m true , where the 'reco' quantities correspond to the value of the jet energy or mass before any calibration has been applied. The truth quantities are defined using particle-level jets, reconstructed following the procedure described in Sect. 3.1.1. The average response is determined using a Gaussian fit to the core of each response distribution.
For the JES calibration, these fits are performed in bins of jet energy and detector pseudorapidity η det , defined as the jet pseudorapidity calculated relative to the geometrical centre of the ATLAS detector. This parameterisation yields (a) (b) Fig. 11 Background rejection at 50% signal efficiency for a tagger using the jet mass and τ 32 for top quark jets at a low p T , and b high p T . Jet p T and η cuts before tagging are made using the ungroomed particle-level large-R jet matched to each of the groomed reconstructed large-R jets. The current baseline topocluster-based trimmed collection is indicated with a a more accurate representation of the active calorimeter cells than that obtained when using the pseudorapidity calculated relative to the PV, and results in an improved evaluation of the calorimeter response. The JES correction factor, c JES = 1/R p T is smoothed in energy and η det , and is applied to the four-momentum of the reconstructed jet as a multiplicative scale factor. A correction to the jet η (' η' below) is also applied to correct for biases with respect to the particle-level reference in certain detector regions [92]. The JES correction is similar for each of the four CS + SK UFO jet definitions which are calibrated, regardless of the grooming algorithm which is applied.
After the JES correction has been applied, the jet mass scale calibration is derived using the same procedure in bins of E reco , η det , and log(m reco /E reco ). The jet mass calibration factor c JMS = 1/R m is applied only to the mass of the jet, keeping the jet energy fixed and thus allowing the p T to vary. This factor is also a smooth function of the large-R jet kinematics. The reconstructed large-R jet kinematics are thus given by: Fig. 12 Pile-up dependence of the value of the fitted W boson mass peak at low p T . Jet p T and η cuts before tagging are made using the ungroomed particle-level large-R jet matched to each of the groomed reconstructed large-R jets. The current baseline topocluster-based trimmed collection is indicated with a . The z-axis colour range is based on the difference of the baseline collection from a slope of 0. This makes differences between definitions more discernible than those between very unstable collections, which may have values beyond the axis range Fig. 13 Pile-up dependence of a D 2 cut on the W boson jet selection efficiency at low p T . Jet p T and η cuts before tagging are made using the ungroomed particle-level large-R jet matched to each of the groomed reconstructed large-R jets. The current baseline topoclusterbased trimmed collection is indicated with a . The z-axis colour range is based on the difference of the baseline collection from a slope of 0. This makes differences between definitions more discernible than those between very unstable collections, which may have values beyond the axis range where the quantities E 0 , m 0 and η 0 refer to the jet properties prior to any calibration, but following the jet grooming procedure. The JMS correction is mostly similar for each of the four CS + SK UFO jet definitions which are studied, but differences in the size of the correction become largest for massive jets at high p T . Figure 15 presents the average jet mass response R m for jets with a particle-level jet mass equal to that of the W boson, for the four CS + SK UFO jet definitions which are calibrated. The response for large-R jets with this mass is obtained by directly taking a profile through the smoothed response maps. Highp T trimmed jets require a Fig. 14 Ratio of the mean value of mass response in W jets to that in q/g jets at low p T . Kinematic selections before tagging are made using the ungroomed particle-level large-R jet matched to each of the groomed reconstructed large-R jets. The current baseline topoclusterbased trimmed collection is indicated with a  smaller calibration factor than jets which are groomed using the SD, RSD or BUSD algorithms. This indicates that there are differences in the highp T behaviour of grooming algorithms: trimming removes more pile-up from jets at high p T , bringing the average JMS of these jets closer to particle level before calibration. All figures where JES + JMS calibrations have been applied to the large-R jet four-vector are labelled 'JES + JMS'.

Jet mass and p T resolution
The expected large-R jet mass resolution, defined to be the 68% interquantile range divided by twice the median of the distribution, is shown in Fig. 16 for samples of signal jets. For these studies (as for all studies in this document), the baseline trimmed topocluster mass is used directly, rather than the combined mass [91] (which incorporates additional measurements from the inner tracking detector), allowing a direct comparison of the unmodified performance of the different jet definitions. In Fig. 16a and b, the resolution for all UFO jet definitions is shown to be better than for the baseline trimmed topocluster definition, particularly at high p T . The expected mass resolution of UFO jets is stable across the entire p T spectrum. In the lowp T region the mass resolution of UFO jets is typically similar to that of topocluster jets, while in the highp T region, it more closely follows the behaviour of TCC jets. For hadronically decaying highp T top quarks, UFOs improve the jet mass resolution relative to topocluster-based jets by 26%, and by 40% for highp T hadronically decaying W bosons.
In order to help factorise the performance gains from various sources, comparisons of the jet mass resolution are also provided for several other calibrated jet definitions. at highp T the mass resolution of top quarks is better than that of W bosons due to the fact that W bosons are lighter, and their decay products are typically more collimated, making the calorimeter granularity relevant at lower values of p T . UFO jets outperform topocluster and TCC jets for both W boson and top quark jets. PFlow jets are also found to be more performant than topocluster and TCC jets for top quark jets, although their performance deteriorates for highly boosted W bosons. The trimming and soft-drop algorithms are compared for UFO jets with and without CS + SK pile-up mitigation in Fig. 16e and f. The application of CS + SK does not significantly alter the mass resolution of trimmed UFO jets; however, it is found to improve the mass resolution for soft-drop jets at low p T by nearly 40%. The large-R jet p T resolution for background jets is shown in Fig. 17, determined as the one-standard-deviation width of Gaussian fits to the R p T distributions divided by their fitted mean. The p T resolution of trimmed topocluster jets is superior to that of either TCC trimmed jets or any of the UFO jet definitions studied. UFO jets do not use the LC correction because PFOs are reconstructed using topoclusters at the EM scale, which results in a degraded correlation between the particle-level and detector-level large-R jet p T . While TCC jets take topoclusters calibrated to the LC scale as input, the energy resolution of TCC trimmed jets is worse than for topocluster trimmed jets, while the UFO trimmed jet resolution is almost identical to the resolution of PFlow trimmed jets. This indicates that the energy resolution degradation of TCC is due to the inclusion of pile-up tracks in the energy sharing, since these are not included in the UFO implementation.

Jet mass + JSS tagging performance
In this section, a comparison of the tagging performance of the calibrated jet definitions is reported. Instead of considering a single efficiency working point (Sect. 4.1), the tagging performance is studied using ROC curves. Figures 18 and 19 show the tagger background rejection as a function of the tagger signal efficiency, using the same jet mass + jet substructure taggers discussed in Sect. 4.1: a fixed mass-window cut with 68% signal efficiency is applied, and then a one-sided D 2 or τ 32 cut is made to obtain the desired signal efficiency.
When tagging highp T , hadronically decaying W bosons (Fig. 18), the considered UFO definitions bring significant improvement over the LCTopo and TCC definitions. At high p T , UFOs outperform the baseline topocluster-based jet definition in terms of their background rejection by about 120% at a fixed signal-tagging efficiency of 50%. For highp T , hadronically decaying top quarks (Fig. 19), UFO def-initions outperform all other choices, improving the background rejection by 135% when compared with the baseline topocluster-based jet definition at a fixed signal-tagging efficiency of 50%. Use of the recursive or bottom-up soft-drop grooming algorithm is noted to further improve performance over the trimmed UFO definition by an additional 10% for a signal efficiency of 50%, and the application of CS + SK pile-up mitigation is also found to increase performance by roughly 10% when it is applied in conjunction with the softdrop grooming algorithm.

Data-to-simulation comparisons
Robust modelling of jet substructure is crucial to reduce uncertainties related to Monte Carlo modelling of parton showers in physics analyses that rely on jet-substructurebased techniques. To verify the accuracy of the simulation, predictions were generated at the detector level for several Events are selected using the lowest unprescaled single large-R jet trigger. This trigger is fully efficient for ungroomed large-R jets with p T > 600 GeV. Data are required to pass a series of quality requirements and cleaning cuts. In addition, overlap removal and pile-up reweighting are applied.
Events are required to have at least one jet with a groomed jet p T above 600 GeV, and all jets are required to have p T > 600 GeV and |η| < 1.2. When studying the behaviour of τ 32 and D 2 , the jet mass is required to be greater than 40 GeV. Data and simulated events are required to pass the same event selection.
The observed data are compared with simulated dijet events in Fig. 20. The jet mass, number of jet constituents, D 2 , and τ 32 are studied. Only statistical uncertainties are displayed, and the statistical uncertainty of the simulation is negligible compared to that of the data. In general, the level of agreement between data and simulation for the UFO jets is similar to that of topocluster trimmed jets, indicating that this level of agreement is tolerable for general use on ATLAS. The exception to this is the number of constituents, which is known to be modelled poorly [66]. The modelling is improved for UFO jets relative to topocluster-based trimmed jets, particularly at large constituent multiplicities.
The background rejection for the mass + JSS taggers described in Sect. 6 is shown in Fig. 21 as a function of the large-R jet p T , where taggers are created for each p T bin, using the 50% signal efficiency working point. For the (a) (b) Fig. 21 Data-to-simulation comparisons of the background rejection for groomed jets for a the mass + D 2 W tagger, and b the mass + τ 32 top tagger W tagger, agreement between data and simulation is similar for all jet definitions, while for the top taggers, agreement is slightly worse for UFO jets than for the topocluster trimmed definition.

Concluding remarks
The development of jet substructure techniques has enabled new searches and measurements, boosting the sensitivity of the Large Hadron Collider experiments to the physics of and beyond the Standard Model. This paper has presented a set of performance comparisons in order to determine the most promising large-R jet definitions for use in future analyses, with a focus on comparing different jet input objects, pile-up mitigation algorithms and jet grooming algorithms.
A new type of jet input, called a Unified Flow Object, has been proposed which incorporates tracking information into jet substructure reconstruction by combining particle-flow reconstruction for lowp T particles and cluster splitting for particles at high p T and in dense environments. These UFO inputs can increase the background rejection of jet taggers across a wide kinematic range by up to 120% for a simple W tagger at 50% signal efficiency, and up to 135% for a simple top tagger at 50% signal efficiency when compared with the current baseline trimmed topocluster large-R jet definition. While the p T resolution of these jets is degraded relative to the baseline LCW topocluster-based ATLAS large-R jet definition due to the different topocluster energy scales used as input objects, UFO jets provide an improved jet mass resolution, with up to a 45% improvement at high p T for signal jets when compared with existing ATLAS large-R jet definitions.
The application of CS + SK pile-up mitigation has been shown to stabilise and augment performance as a function of the number of pile-up interactions, which will be crucial in the face of the difficult experimental conditions to come during future LHC data-taking periods. Pile-up mitigation increases the number of experimentally viable grooming configurations to include options which do not groom soft radiation aggressively enough to be considered with unmodified jet inputs.
Several promising grooming algorithms were compared using large-R CS + SK UFO jets. Definitions incorporating soft-drop grooming and its extensions, recursive soft-drop and bottom-up soft-drop, all outperform the baseline ATLAS trimming configuration in terms of highp T W and top quark tagging using simple taggers. These collections are viable for general-purpose use in the challenging experimental conditions of the LHC only due to the improvements in jet inputs and pile-up mitigation algorithms. The soft-drop definition using z cut = 0.1 and angular exponent β = 1.0 outperforms all other candidates when identifying highp T top quarks, and is competitive to within 5-10% of the considered RSD and BUSD options when tagging boosted W bosons. These jets also exhibit good pile-up stability and a tolerable sensitivity to topological effects, according to the metrics studied. This definition provides superior jet mass resolution for lowp T W jets when compared with RSD and BUSD options. Due to its wide range of applicability, it is concluded that the CS + SK UFO soft-drop (β = 1.0, z cut = 0.1) large-R jet definition provides the best performance for use as a general-purpose jet definition in ATLAS physics analyses.

Data Availability Statement
This manuscript has no associated data or the data will not be deposited. [Authors' comment: All ATLAS scientific output is published in journals, and preliminary results are made available in Conference Notes. All are openly available, without restriction on use by external parties beyond copyright law and the standard conditions agreed by CERN. Data associated with journal publications are also made available: tables and data from plots (e.g. cross section values, likelihood profiles, selection efficiencies, cross section limits, . ..) are stored in appropriate repositories such as HEPDATA (http:// hepdata.cedar.ac.uk/). ATLAS also strives to make additional material related to the paper available that allows a reinterpretation of the data in the context of new theoretical models. For example, an extended encapsulation of the analysis is often provided for measurements in the framework of RIVET (http://rivet.hepforge.org/).] Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Funded by SCOAP 3 .