ATLAS $b$-jet identification performance and efficiency measurement with $t\bar{t}$ events in $pp$ collisions at $\sqrt{s}=13$ TeV

The algorithms used by the ATLAS Collaboration during Run 2 of the Large Hadron Collider to identify jets containing $b$-hadrons are presented. The performance of the algorithms is evaluated in the simulation and the efficiency with which these algorithms identify jets containing $b$-hadrons is measured in collision data. The measurement uses a likelihood-based method in a sample of highly enriched in $t\bar{t}$ events. The topology of the $t \to W b$ decays is exploited to simultaneously measure both the jet flavour composition of the sample and the efficiency in a transverse momentum range from 20 GeV to 600 GeV. The efficiency measurement is subsequently compared with that predicted by the simulation. The data used in this measurement, corresponding to a total integrated luminosity of 80.5 fb$^{-1}$, were collected in proton-proton collisions during the years 2015 to 2017 at a centre-of-mass energy $\sqrt{s}=$ 13 TeV. By simultaneously extracting both the efficiency and jet flavour composition, this measurement significantly improves the precision compared to previous results, with uncertainties ranging from 1% to 8% depending on the jet transverse momentum.


Introduction
The identification of jets containing b-hadrons (b-jets) against the large jet background containing c-hadrons but no b-hadron (c-jets) or containing neither bor c-hadrons (light-flavour jets) is of major importance in many areas of the physics programme of the ATLAS experiment [1] at the Large Hadron Collider (LHC) [2]. It has been decisive in the recent observations of the Higgs boson decay into bottom quarks [3] and of its production in association with a top-quark pair [4], and plays a crucial role in a large number of Standard Model (SM) precision measurements, studies of the Higgs boson properties, and searches for new phenomena.
The ATLAS Collaboration uses various algorithms to identify b-jets [5], referred to as b-tagging algorithms, when analysing data recorded during Run 2 of the LHC (2015-2018). These algorithms exploit the long lifetime, high mass and high decay multiplicity of b-hadrons as well as the properties of the b-quark fragmentation. Given a lifetime of the order of 1.5 ps (< cτ >≈ 450 µm), measurable b-hadrons have a significant mean flight length, < l >= βγcτ, in the detector before decaying, generally leading to at least one vertex displaced from the hard-scatter collision point. The strategy developed by the ATLAS Collaboration is based on a two-stage approach. Firstly, low-level algorithms reconstruct the characteristic features of the b-jets via two complementary approaches, one that uses the individual properties of charged-particle tracks, later referred to as tracks, associated with a hadronic jet, and a second which combines the tracks to explicitly reconstruct displaced vertices. These algorithms, first introduced during Run 1 [5], have been improved and retuned for Run 2. Secondly, in order to maximise the b-tagging performance, the results of the low-level b-tagging algorithms are combined in high-level algorithms consisting of multivariate classifiers. The performance of a b-tagging algorithm is characterised by the probability of tagging a b-jet (b-jet tagging efficiency, ε b ) and the probability of mistakenly identifying a c-jet or a light-flavour jet as a b-jet, labelled ε c (ε l ). In this paper, the performance of the algorithms is quantified in terms of c-jet and light-flavour jet rejections, defined as 1/ε c and 1/ε l , respectively.
The imperfect description of the detector response and physics modelling effects in Monte Carlo (MC) simulations necessitates the measurement of the performance of the b-tagging algorithms with collision data [6][7][8]. In this paper, the measurement of the b-jet tagging efficiency of the high-level b-tagging algorithms used in proton-proton (pp) collision data recorded during Run 2 of the LHC at √ s = 13 TeV is presented. The corresponding measurements for c-jets and light-flavour jets, used in the measurement of the b-jet tagging efficiency to correct the simulation such that the overall tagging efficiency of c-jets and light-flavour jets match that of the data, are described elsewhere [7,8]. The production of tt pairs at the LHC provides an abundant source of b-jets by virtue of the high cross-section and the t → W b branching ratio being close to 100%. A very pure sample of tt events is selected by requiring that both W bosons decay leptonically, referred to as dileptonic tt decays in the following. A combinatorial likelihood approach is used to measure the b-jet tagging efficiency of the high-level b-tagging algorithms as a function of the jet transverse momentum (p T ). This version of the analysis builds upon the approach used previously by the ATLAS Collaboration [6], extending the method to derive additional constraints on the flavour composition of the sample, which reduces the uncertainties by up to a factor of two relative to previous publication.
The paper is organised as follows. In Section 2, the ATLAS detector is described. Section 3 contains a description of the objects reconstructed in the detector which are key ingredients for b-tagging algorithms, while Section 4 describes the b-tagging algorithms and the evaluation of their performance in the simulation. The second part of the paper focuses on the b-jet tagging efficiency measurement carried out in collision data and the application of these results in ATLAS analyses. The data and simulated samples used in this work are described in Section 5. The event selection and classification performed for the measurement of the b-jet tagging efficiency are summarised in Section 6. The measurement technique is presented in Section 7 and the sources of uncertainties are described in Section 8. The results and their usage within the ATLAS Collaboration are discussed in Sections 9 and 10, respectively.

ATLAS detector
The ATLAS detector [1] at the LHC covers nearly the entire solid angle around the collision point. It consists of an inner tracking detector (ID) surrounded by a superconducting solenoid, electromagnetic and hadronic calorimeters and a muon spectrometer incorporating three large superconducting toroid magnets.
The ID consists in a high-granularity silicon pixel detector which covers the vertex region and typically provides four measurements per track. The innermost layer, known as the insertable B-layer (IBL) [9], was added in 2014 and provides high-resolution hits at small radius to improve the tracking performance. For a fixed b-jet efficiency, the incorporation of the IBL improves the light-flavour jet rejection of the b-tagging algorithms by up to a factor of four [10]. The silicon pixel detector is followed by a silicon microstrip tracker (SCT) that typically provides eight measurements from four strip double layers. These silicon detectors are complemented by a transition radiation tracker (TRT), which enables radially extended track reconstruction up to the pseudorapidity1 |η| = 2.0. The TRT also provides electron identification information based on the fraction of hits (typically 33 in the barrel and up to an average of 38 in the endcaps) above a higher energy-deposit threshold corresponding to transition radiation. The ID is immersed in a 2 T axial magnetic field and provides charged-particle tracking in the pseudorapidity range |η| < 2.5.
The calorimeter system covers the pseudorapidity range |η| < 4.9. Within the region |η| < 3.2, electromagnetic calorimetry is provided by barrel and endcap high-granularity lead/liquid-argon (LAr) sampling calorimeters, with an additional thin LAr presampler covering |η| < 1.8 to correct for energy loss in material upstream of the calorimeters. Hadronic calorimetry is provided by a steel/scintillator-tile calorimeter, segmented into three barrel structures within |η| < 1.7, and two copper/LAr hadronic endcap calorimeters. The solid angle coverage is completed with forward copper/LAr and tungsten/LAr calorimeter modules optimised for electromagnetic and hadronic measurements, respectively.
The muon spectrometer comprises separate trigger and high-precision tracking chambers measuring the deflection of muons in a magnetic field generated by superconducting air-core toroids. The precision chamber system covers the region |η| < 2.7 with three layers of monitored drift tubes, complemented by cathode-strip chambers in the forward region. The muon trigger system covers the range |η| < 2.4 with resistive-plate chambers in the barrel and thin-gap chambers in the endcap regions.
A two-level trigger system [11] is used to select interesting events. The first level of the trigger is implemented in hardware and uses a subset of detector information to reduce the event rate to a design value of at most 100 kHz. It is followed by a software-based trigger that reduces the event rate to a maximum of around 1 kHz for offline storage.
Jet flavour labels are attributed to the jets in the simulation. Jets are labelled as b-jets if they are matched to at least one weakly decaying b-hadron having p T ≥ 5 GeV within a cone of size ∆R = 0.3 around the jet axis. If no b-hadrons are found, c-hadrons and then τ-leptons are searched for, based on the same selection criteria. The jets matched to a c-hadron (τ-lepton) are labelled as c-jets (τ-jets). The remaining jets are labelled as light-flavour jets.

Algorithms for b-jet identification
This section describes the different algorithms used for b-jet identification and the evaluation of their performance in simulation. Low-level b-tagging algorithms fall into two broad categories. A first approach, implemented in the IP2D and IP3D algorithms [23] and described in Section 4.2.1, is inclusive and based on exploiting the large impact parameters of the tracks originating from the b-hadron decay. The second approach explicitly reconstructs displaced vertices. The SV1 algorithm [24], discussed in Section 4.2.2, attempts to reconstruct an inclusive secondary vertex, while the J F algorithm [25], presented in Section 4.2.3, aims to reconstruct the full bto c-hadron decay chain. These algorithms, first introduced during Run 1 [5], benefit from improvements and a new tuning for Run 2. To maximise the b-tagging performance, low-level algorithm results are combined using multivariate classifiers. To this end, two high-level tagging algorithms have been developed. The first one, MV2 [23], is based on a boosted decision tree (BDT) discriminant, while the second one, DL1 [23], is based on a deep feed-forward neural network (NN). These two algorithms are presented in Sections 4.3.1 and 4.3.2, respectively.

Training and tuning samples
The new tuning and training strategies of the b-tagging algorithms for Run 2 are based on the use of a hybrid sample composed of tt and Z simulated events. Only tt decays with at least one lepton from a subsequent W-boson decay are considered in order to ensure a sufficiently large fraction of c-jets in the event whilst maintaining a jet multiplicity profile similar to that in most analyses. A dedicated sample of Z decaying into hadronic jet pairs is included to optimise the b-tagging performance at high jet p T . The cross-section of the hard-scattering process is modified by applying an event-by-event weighting factor to broaden the natural width of the resonance and widen the transverse momentum distribution of the jets produced in the hadronic decays up to a jet p T of 1.5 TeV. The branching fractions of the decays are set to be one-third each for the bb, cc and light-flavour quark pairs to give a p T spectrum uniformly populated by jets of all flavours. The hybrid sample is obtained by selecting b-jets from tt events if the corresponding b-hadron p T is below 250 GeV and from the Z sample if above, with a similar strategy applied for c-jets and light-flavour jets. More details about the production of the tt simulated sample, referred to as the baseline tt sample in the following, and the Z simulated sample are given in Section 5. Events with at least one jet are selected, excluding the jets overlapping with a generator-level electron originating from a Wor Z-boson decay.

Algorithms based on impact parameters
There are two complementary impact parameter-based algorithms, IP2D and IP3D [23]. The IP2D tagger makes use of the signed transverse impact parameter significance of tracks to construct a discriminating variable, whereas IP3D uses both the track signed transverse and the longitudinal impact parameter significance in a two-dimensional template to account for their correlation. Probability density functions (pdf) obtained from reference histograms of the signed transverse and longitudinal impact parameter significances of tracks associated with b-jets, c-jets and light-flavour jets are derived from MC simulation. The pdfs are computed in exclusive categories that depend on the hit pattern of the tracks to increase the discriminating power. The pdfs are used to calculate ratios of the b-jet, c-jet and light-flavour jet probabilities on a per-track basis. Log-likelihood ratio (LLR) discriminants are then defined as the sum of the per-track probability ratios for each jet-flavour hypothesis, e.g. N i=1 log (p b /p u ) for the b-jet and light-flavour jet hypotheses, where N is the number of tracks associated with the jet and p b (p u ) is the template pdf for the b-jet (light-flavour jet) hypothesis. The flavour probabilities of the different tracks contributing to the sum are assumed to be independent of each other. In addition to the LLR separating b-jets from light-flavour jets, two extra LLR functions are defined to separate b-jets from c-jets and c-jets from light-flavour jets, respectively. These three likelihood discriminants for both the IP2D and IP3D algorithms are used as inputs to the high-level taggers.
Both the IP2D and IP3D algorithms benefited from a complete retuning prior to the 2017-2018 ATLAS data taking period [23]. In particular, a reoptimisation of the track category definitions was performed, allowing the IBL hit pattern expectations and the next innermost layer information to be fully exploited. The rejection of tracks originating from photon conversions, long-lived particles decays (K s , Λ) and interactions with detector material by the secondary vertex algorithms has also been improved. An additional set of new template pdfs was also produced using a 50%/50% mixture of tt and Z simulated events for extra track-categories with no hits in the first two layers, which are populated by long-lived b-hadrons traversing the first layers before they decay. The tt sample is used to populate all remaining categories.

Secondary vertex finding algorithm
The secondary vertex tagging algorithm, SV1 [24], reconstructs a single displaced secondary vertex in a jet. The reconstruction starts from identifying the possible two-track vertices built with all tracks associated with the jet, while rejecting tracks that are compatible with the decay of long-lived particles (K s or Λ), photon conversions or hadronic interactions with the detector material. The SV1 algorithm runs iteratively on all tracks contributing to the cleaned two-tack vertices, trying to fit one secondary vertex. In each iteration, the track-to-vertex association is evaluated using a χ 2 test. The track with the largest χ 2 is removed and the vertex fit is repeated until an acceptable vertex χ 2 and a vertex invariant mass less than 6 GeV are obtained. With this approach, the decay products from band c-hadrons are assigned to a single common secondary vertex.
Several refinements in the track and vertex selection were made prior to the 2016-2017 ATLAS data taking period to improve the performance of the algorithm, resulting in an increased pile-up rejection and an overall enhancement of the performance at high jet p T [24]. Among the various algorithm improvements, additional track-cleaning requirements are applied for jets in the high-pseudorapidity region (|η| ≥ 1.5) to mitigate the negative influence of the increasing amount of detector material on the secondary vertex finding efficiency. The fake-vertex rate is also better controlled by limiting the algorithm to only consider the 25 highest-p T tracks in the jets, which preserves all reconstructed tracks from b-hadron decays, whilst limiting the influence of additional tracks in the jet. The selection of two-track vertex candidates prior to the χ 2 -fit was also reoptimised. Extra candidate-cleaning requirements were introduced to further reduce the number of fake vertices and material interactions, such as the rejection of two-track vertex candidates with an invariant mass greater than 6 GeV, which are not likely to originate from band c-hadron decays. Eight discriminating variables, including the number of tracks associated with the SV1 vertex, the invariant mass of the secondary vertex, its energy fraction (defined as the total energy of all the tracks associated with the secondary vertex divided by the energy of all the tracks associated with the jet), and the three-dimensional decay length significance are used as inputs to the high-level taggers. The b-tagging performance of the SV1 algorithm is evaluated using a LLR discriminant based on pdfs for the b-jet, c-jet and light-flavour jet hypotheses computed from three-dimensional histograms built from three SV1 output variables: the vertex mass, the energy fraction and the number of two-track vertices.

Topological multi-vertex finding algorithm
The topological multi-vertex algorithm, J F [25], exploits the topological structure of weak band c-hadron decays inside the jet and tries to reconstruct the full b-hadron decay chain. A modified Kalman filter [26] is used to find a common line on which the primary, bottom and charm vertices lie, approximating the b-hadron flight path as well as the vertex positions. With this approach, it is possible to resolve the band c-hadron vertices even when a single track is attached to them.
Several improvements [25], prior to the 2017-2018 ATLAS data taking period, have been introduced in the current version of the J F algorithm. These include, a reoptimisation of the track selection to better mitigate the effect of pile-up tracks, an improvement in the rejection of material interactions, and the introduction of a vertex-mass dependent selection during the decay chain fit to increase the efficiency for tertiary vertex reconstruction. Eight discriminating variables, including the track multiplicity at the J F displaced vertices, the invariant mass of tracks associated with these vertices, their energy fraction and their average three-dimensional decay length significance, are used as inputs to the high-level taggers. The b-tagging performance of the J F algorithm is evaluated using a LLR discriminant based on likelihood functions combining pdfs extracted from some of the J F output variables (vertex mass, energy fraction and decay length significance) and parameterised for each of the three jet flavours.

MV2
The MV2 algorithm [23] consists of a boosted decision tree (BDT) algorithm that combines the outputs of the low-level tagging algorithms described in Section 4.2 and listed in Table 1. The BDT algorithm is trained using the ROOT Toolkit for Multivariate Data Analysis (TMVA) [27] on the hybrid tt + Z sample. The kinematic properties of the jets, namely p T and |η|, are included in the training in order to take advantage of the correlations with the other input variables. However, to avoid differences in the kinematic distributions of signal (b-jets) and background (c-jets and light-flavour jets) being used to discriminate between the different jet flavours, the b-jets and c-jets are reweighted in p T and |η| to match the spectrum of the light-flavour jets. No kinematic reweighting is applied at the evaluation stage of the multivariate classifier. For training, the c-jet fraction in the background sample is set to 7%, with the remainder composed of light-flavour jets. This allows the charm rejection to be enhanced whilst preserving a high light-flavour jet rejection. The BDT training hyperparameters of the MV2 tagging algorithm are listed in Table 2. They have been optimised to provide the best separation power between the signal and the background. The output discriminant of the MV2 algorithm for b-jets, c-jets and light-flavour jets evaluated with the baseline tt simulated events are shown in Figure 1(a).

DL1
The second high-level b-tagging algorithm, DL1 [23], is based on a deep feed-forward neural network (NN) trained using Keras [28] with the Theano [29] backend and the Adam optimiser [30]. The DL1 NN has a multidimensional output corresponding to the probabilities for a jet to be a b-jet, a c-jet or a light-flavour jet. The topology of the output consists of a mixture of fully connected hidden and Maxout layers [31]. The input variables to DL1 consist of those used for the MV2 algorithm with the addition of the J F c-tagging variables listed in Table 1. The latter relate to the dedicated properties of the secondary and tertiary vertices (distance to the primary vertex, invariant mass and number of tracks, energy, energy fraction, and rapidity of the tracks associated with the secondary and tertiary vertices). A jet p T and |η| reweighting similar to the one used for MV2 is performed. The DL1 algorithm parameters, listed in Table 3, include the architecture of the NN, the number of training epochs, the learning rates and training batch size. All of these are optimised in order to maximise the b-tagging performance. Batch normalisation [32] is added by default since it is found to improve the performance.
Training with multiple output nodes offers additional flexibility when constructing the final output discriminant by combining the b-jet, c-jet and light-flavour jet probabilities. Since all flavours are treated equally during training, the trained network can be used for both b-jet and c-jet tagging. In addition, the use of a multi-class network architecture provides the DL1 algorithm with a smaller memory footprint than BDT-based algorithms. The final DL1 b-tagging discriminant is defined as: where p b , p c , p light and f c represent respectively the b-jet, c-jet and light-flavour jet probabilities, and the effective c-jet fraction in the background training sample. Using this approach, the c-jet fraction in the background can be chosen a posteriori in order to optimise the performance of the algorithm. An optimised c-jet fraction of 8% is used to evaluate the performance of the DL1 b-tagging algorithm in this paper.
The output discriminants of the DL1 b-tagging algorithms for b-jets, c-jets and light-flavour jets in the baseline tt simulated events are shown in Figure 1(b).

Algorithm performance
The evaluation of the performance of the algorithms is carried out using b-jet tagging single-cut operating points (OPs). These are based on a fixed selection requirement on the b-tagging algorithm discriminant distribution ensuring a specific b-jet tagging efficiency, ε b , for the b-jets present in the baseline tt simulated sample. The selections used to define the single-cut OPs of the MV2 and the DL1 algorithms, as well as the corresponding c-jet, τ-jet and light-flavour jet rejections, are shown in Table 4. The MV2 and the DL1 discriminant distributions are also divided into five 'pseudo-continuous' bins, (O k ) k=1..5 , delimited by the selections used to define the b-jet tagging single-cut OPs for 85%, 77%, 70% and 60% efficiency, and

Input
Variable Description Likelihood ratio between the b-jet and lightflavour jet hypotheses log(P b /P c ) Likelihood ratio between the band c-jet hypotheses log(P c /P light ) Likelihood ratio between the c-jet and lightflavour jet hypotheses Transverse displacement of the 2 nd or 3 rd vertex m Trk (2 nd /3 rd vtx)(JF) Invariant mass of tracks associated with 2 nd or 3 rd vertex Energy fraction of the tracks associated with 2 nd or 3 rd vertex Fraction of charged jet energy in 2 nd or 3 rd vertex N TrkAtVtx (2 nd /3 rd vtx)(JF) Number of tracks associated with 2 nd or 3 rd vertex Y min trk , Y max trk , Y avg trk (2 nd /3 rd vtx)(JF) Min., max. and avg. track rapidity of tracks at 2 nd or 3 rd vertex    bounded by the trivial 100% and 0% selections. The value of the pdf in each bin is called the b-jet tagging probability and labelled P b in the following. The b-jet tagging efficiency of the ε b = X% single-cut OP can then be defined as the sum of the b-jet tagging probabilities in the range [X%,0%].
The light-flavour jet and c-jet rejections as a function of the b-jet tagging efficiency are shown in Figure 2 for the various low-and high-level b-tagging algorithms. This demonstrates the advantage of combining the information provided by the low-level taggers, where improvements in the light-flavour jet and c-jet rejections by factors of around 10 and 2.5, respectively, are observed at the b = 70% single-cut OP of the high-level algorithms compared to low-level algorithms. This figure also illustrates the different b-jet tagging efficiency range accessible with each low-level algorithm and thereby their complementarity in the multivariate combinations, with the performance of the DL1 and MV2 discriminants found to be similar. The two algorithms tag a highly correlated sample of b-jets, where the relative fraction of jet exclusively tagged by each algorithm is around 3% at the ε b = 70% single-cut OP. The relative fractions of light-flavour jets exclusively mis-tagged by the MV2 or the DL1 algorithms at the ε b = 70% single-cut OP reach 0.2% and 0.1%, respectively.
However, the additional J F c-tagging variables used by DL1 bring around 30% and 10% improvements in the light-flavour jet and c-jet rejections, respectively, at the b = 70% single-cut OP compared to MV2.

Data and simulated samples
In order to use data to evaluate the performance of the high-level b-tagging algorithms, a sample of events enriched in tt dileptonic decays is selected.
The analysis is performed with a pp collision data sample collected at a centre-of-mass energy of √ s = 13 TeV during the years 2015, 2016 and 2017, corresponding to an integrated luminosity of 80.5 fb −1 and a mean number of pp interactions per bunch crossing of 31.9. The uncertainty in the integrated luminosity is 2.0% [33], obtained using the LUCID-2 detector [34] for the primary luminosity measurements. All events used were recorded during periods when all relevant ATLAS detector components were functioning normally. The dataset was collected using triggers requiring the presence of a single, high-p T electron or muon, with p T thresholds that yield an approximately constant efficiency for leptons passing an offline selection of p T ≥ 28 GeV.  The baseline tt full simulation sample was produced using P B v2 [35][36][37][38] where the first-gluonemission cut-off scale parameter h damp is set to 1.5m t , with m top = 172.5 GeV used for the top-quark mass. P B was interfaced to P 8.230 [39] with the A14 set of tuned parameters [40] and NNPDF30NNLO (NNPDF2.3LO) [41,42] parton distribution functions in the matrix elements (parton shower). This set-up was found to produce the best modelling of the multiplicity of additional jets and both the individual top-quark and tt system p T [43].
Alternative tt simulation samples were generated using P B v2 interfaced to H 7.0.4 [44] with the H7-UE-MMHT set of tuned parameters. The effects of initial-and final-state radiation (ISR, FSR) are explored by reweighting the baseline tt events in a manner that reduces (reduces and increases) initial (final) parton shower radiation [43] and by using an alternative P B v2 + P 8.230 sample with h damp set to 3m top and parameter variation group V 3 (described in Ref. [43]) increased, leading to increased ISR.
The majority of events with at least one 'fake' lepton in the selected sample arise from tt production where only one of the W bosons, which originated from a top-quark decay, decays leptonically. These fake leptons come from several sources, including non-prompt leptons produced from bottom or charm hadron decays, electrons arising from a photon conversion, jets misidentified as electrons, or muons produced from in-flight pion or kaon decays. This background is also modelled using the tt production described above. The rate of events with two fake leptons is found to be negligible.
Non-tt processes, which are largely subdominant in this analysis, can be classified into two types: those with two real prompt leptons from W or Z decays (dominant) and those where at least one of the reconstructed lepton candidates is 'fake' (subdominant). Backgrounds containing two real prompt leptons include single top production in association with a W boson (Wt), diboson production (WW, W Z, Z Z) where at least two leptons are produced in the electroweak boson decays, and Z+jets, with Z decaying into leptons. The Wt single top production was modelled using P B v2 interfaced to P 8.230 using the 'diagram removal' scheme [45,46] with the A14 set of tuned parameters and the NNPDF30NNLO (NNPDF2.3LO) [41,42] parton distribution functions in the matrix elements (parton shower). Diboson production with additional jets was simulated using S [47,48] v2.2.1 (for events where one boson decays hadronically) or S v2.2.2 (for events where no bosons decay hadronically), using the PDF set NNPDF30NNLO [41]. This includes the 4 , ν, νν, ννν, qq and νqq final states, which cover WW, W Z and Z Z production including off-shell Z contributions. Z+jets production (including both Z → ττ and Z → ee/µµ) was modelled using S v2.2.1 with PDF set NNPDF30NNLO. Processes with one real lepton include t-channel and s-channel single top production [49]. These processes were modelled with the same generator and parton shower combination as the Wt channel. W+jets production, with the W boson decaying into eν, µν or τν with the τ-lepton decaying leptonically, was modelled in a similar way to the Z+jets production described above.
Alternative samples of non-tt processes include the Wt single top production using P B v2 interfaced to H 7.0.4. The effects of ISR and FSR are evaluated by reweighting the baseline single-top events in a manner that either reduces or increases parton shower radiation. An additional Wt sample using P B v2 interfaced to P 8.230 with the alternative 'diagram subtraction' scheme [45,46] is used to investigate the impact of the interference between tt and Wt production. Uncertainties in diboson and Z+jets production are estimated by reweighting the baseline samples, whereas uncertainties in processes with one real lepton are evaluated directly from data, as described later in Section 8.
As described in Section 4, the new Run 2 b-tagging algorithm training strategy is based on the use of a hybrid sample composed of both the baseline tt event sample and a dedicated sample of Z decaying into hadronic jet pairs. This Z sample was generated using P 8.2.12 with the A14 set of tuned parameters for the underlying event and the leading-order NNPDF2.3LO [42] parton distribution functions.
The E [50] package was used to handle the decay of heavy-flavour hadrons for all samples except for those generated with the S generator, for which the default S configuration recommended by the S authors was used. All MC events have additional overlaid minimum-bias interactions generated with P 8.160 with the A3 set of tunes parameters [51] and NNPDF2.3LO parton distribution functions to simulate pile-up background and are weighted to reproduce the observed distribution of the average number of interactions per bunch crossing of the corresponding data sample. The nominal MC samples were processed through the full ATLAS detector simulation [52] based on GEANT4 [53], but most samples used for systematic uncertainty evaluation were processed with a faster simulation making use of parameterised showers in the calorimeters [54]. The simulated events were reconstructed using the same algorithms as the data.

Event selection and classification
A sample of events enriched in tt dileptonic decays is selected by requiring exactly two well-identified lepton candidates and two jets to be present in each event. Events are further classified on the basis of two topological variables to control processes including non-b-jets. The lepton definition, and the event selection and classification are described in this section.

Lepton object definition
In addition to the objects reconstructed for b-tagging, described in Section 3, the event selection for the efficiency measurement requires electron and muon candidates, defined as follows.
Electron candidates are reconstructed from an isolated energy deposit in the electromagnetic calorimeter matched to an ID track [55]. Electrons are selected for inclusion in the analysis within the fiducial region of transverse energy E T ≥ 28 GeV and |η| < 2.47. Candidates within the transition region between the barrel and endcap electromagnetic calorimeters, 1.37 ≤ |η| < 1.52, are removed in order to avoid large trigger efficiency uncertainties in the turn-on region of the lowest p T trigger. A tight likelihood-based electron identification requirement is used to further suppress the background from multi-jet production. Isolation criteria are used to reject candidates coming from sources other than prompt decays from massive bosons (hadrons faking an electron signature, heavy-flavour decays or photon conversions). Scale factors (SFs), of order unity, derived in Z → e + e − events are applied to simulated events to account for differences in reconstruction, identification and isolation efficiencies between data and simulation. Electron energies are calibrated using the Z mass peak.
Muon candidates are reconstructed by combining tracks found in the ID with tracks found in the muon spectrometer [56]. Muons are selected for inclusion in the analysis within the fiducial region of transverse momentum p T ≥ 28 GeV and |η| < 2.5. If the event contains a muon reconstructed from high hit multiplicities in the muon spectrometer due to very energetic punch-through jets or from badly measured inner detector tracks in jets wrongly matched to muon spectrometer track segments, the whole event is vetoed. A tight muon identification requirement is applied to the muon candidates to further suppress the background. Isolation selections similar to the ones applied to the electron candidates are imposed to reject candidates coming from sources other than prompt massive boson decays (hadrons faking a muon signature or heavy-flavour decays). SFs of order unity, similar to those for electrons and derived in Z → µ + µ − events, are applied to account for differences in reconstruction, identification and isolation efficiencies between data and simulated events. Muon momenta are calibrated using the Z mass peak.
If electrons, muons or jets overlap with each other, all but one object must be removed from the event. The distance metric used to define overlapping objects is defined as ∆R = (∆φ) 2 + (∆y) 2 where ∆y represents the rapidity difference. To prevent double-counting of electron energy deposits as jets, jets within ∆R = 0.2 of a reconstructed electron candidate are removed. If the nearest remaining jet is within ∆R = 0.4 of the electron, the electron is discarded. To reduce the background from muons from heavy-flavour decays inside jets, muons are required to be separated by ∆R ≥ 0.4 from the nearest jet. In cases where a muon and a jet are reconstructed within ∆R < 0.4, the muon is removed if the jet has at least three associated tracks; the jet is removed otherwise. This avoids an inefficiency for high-energy muons undergoing significant energy loss in the calorimeter.

Event selection
To be considered in this analysis, events must have at least one lepton identified in the trigger system. This triggered lepton must match an offline electron or muon candidate. For each applicable trigger, scale factors are applied to the simulation in order to correct for known differences in trigger efficiencies between the simulation and collision data [11].
In order to reject backgrounds with fewer than two prompt leptons, exactly two reconstructed leptons with opposite charges are required. Contributions from backgrounds with Z bosons are reduced by requiring that one lepton is an electron and the other is a muon. The residual contribution from Z → ττ events, which populate the low mass region, is further reduced by considering only events with m eµ ≥ 50 GeV. The contribution from tt events with light-flavour jets from ISR or FSR or from W bosons is reduced by requiring exactly two reconstructed jets.
Since the aim of the study is to measure the b-jet tagging efficiency, it is useful to label simulated events according to the generator-level flavour of the two selected jets, following the definitions introduced in Section 3, instead of the physics process they originate from. Events with two b-jets (non-b-jets) are labelled bb (ll). Events with one selected b-jet and one non-b-jet are labelled bl events if the b-jet p T is larger than the non-b-jet p T and lb in the opposite case. According to the simulation, more than 90% of the non-b-jets are light-flavour jets, the rest being composed of c-jets, and more than 95% of the b-jets originate from a top-quark decay. The fraction of τ-jets is predicted to be negligible.
In order to create bb, bl, lb and ll-enriched regions in the selected sample, each of the two leptons is paired with a jet in an exclusive way to determine whether they originate from the same top-quark decay.
The pairing is performed such that it minimises (m 2 j1, i + m 2 j2, j ), where j1 ( j2) is the highest-p T (second highest-p T ) jet, i, j are the two leptons and m j1, (m j2, ) is the invariant mass of the system including the highest (second highest) p T jet and its associated lepton. Choosing the pairing that mimimises this quantity relies on the fact that if the pairs of objects are from the same original particles then they are likely to have similar masses. Using the minimum of squared masses penalises asymmetric pairings with one high-mass lepton-jet pair, as well as combinations including two very high invariant masses, which are unlikely for those arising from top-quark decay. Events are required to have m j1, ≥ 20 GeV and m j2, ≥ 20 GeV in order to avoid configurations in which a soft jet and a soft lepton are close to each other, which are not well described by the simulation. The event classification based on these variables is described in more detail in the next section.
According to the simulation, about 85% of the events passing the selection are dileptonic tt events, about 65% of which are bb events. Single top production in association with a W boson accounts for 8% of the events, with about 30% of these events containing two selected b-jets. Diboson and Z+jet production represent respectively about 5% and 2% of the selected events, 85% of these events being ll events. Events originating from W+jets production are negligible (< 0.1%). The main source of non-b-jets therefore originates from tt bl or lb events, i.e. dileptonic tt events with a high-p T light-flavour jet originating from ISR or FSR. Figure 3 shows the level of agreement between data and simulation as a function of the p T and η of the selected jets as well as the expected fraction of tt events. The overall level of agreement between data and simulation is fairly good, although some mismodelling is present at high jet p T , possibly related to the modelling of the top-quark p T [57], which motivates the extraction of the b-jet tagging efficiency in jet p T bins. The distribution of the discriminant of the MV2 algorithm for events passing the selection is shown in Figure 4. Generally, good modelling is observed, indicating similar b-jet tagging efficiencies in data and simulation.

Event classification
The distributions of the m j1, and m j2, observables are shown in Figure 5. In the case of tt events with two b-jets, both m j1, and m j2, have an upper limit around m t = 172.5 GeV and are usually significantly smaller due to the undetected neutrino. This is generally not the case for bl, lb and ll events, which result in high m j1, and/or m j2, values more often at high jet p T . Therefore, the m j1, and m j2, observables discriminate between bb, bl, lb and ll events while being uncorrelated with the b-tagging discriminants, which do not make use of leptons outside jets.
The selected events are classified into 45 different bins according to the p T of the two jets, allowing the b-jet tagging efficiency to be measured as a function of the jet p T . In addition, in each leading jet p T , subleading jet p T bin (p T,1 , p T,2 ), the events are further classified into four bins according to the m j1, and m j2, values: • m j1, , m j2, < 175 GeV, signal region (SR): high bb purity region used to measure the b-jet tagging efficiency, • m j1, , m j2, ≥ 175 GeV, ll control region (CR LL ): high ll purity control region used to constrain the bb, bl, lb and ll fractions in the SR, • m j1, < 175 GeV, m j2, ≥ 175 GeV, bl control region (CR BL ): high bl purity control region used to constrain the bb, bl, lb and ll fractions in the SR, • m j1, ≥ 175 GeV, m j2, < 175 GeV, lb control region (CR LB ): high lb purity control region used to constrain the bb, bl, lb and ll fractions in the SR.
Finally, the events in the SR are further classified as a function of the pseudo-continuous binned b-tagging discriminant of the two jets, denoted w 1 and w 2 , as defined in Section 4.4.  These classifications result in a total of 1260 orthogonal categories. A schematic diagram illustrating the event categorisation is shown in Figure 6. The bb event purity in the signal regions for the different p T,1 , p T,2 bins is shown in Figure 7. The lowest purity (19%) occurs when both jets have very low p T ; however, the majority of bins have a bb event purity greater than 70% and the highest purity (where both jets have high p T ) reaches 93%. The CR LL , CR BL and CR LB control regions are enriched in their targeted backgrounds relative to the corresponding SR. Their purity in ll, bl and lb events varies across the p T,1 , p T,2 plane and ranges in the simulation from 30% to 90% (CR LL ), 32% to 79% (CR BL ) and 20% to 74% (CR LB ), respectively. The dominant background in each SR always benefits from a high-purity (i.e. ≥ 50%) control region.

Extraction of b-jet tagging efficiency
Once events have been selected and classified, the measurement of the b-jet tagging probabilities is performed. The precision of the previous ATLAS measurement [6] was limited by the uncertainty in the fractions of bb, bl, lb and ll events in the selected sample, which is driven by the modelling of top-quark pair production. The main novelty of this work in comparison to Ref.
[6] lies in the measurement method, which uses both signal and control region data to define a joint log-likelihood function allowing the simultaneous estimate of the b-jet tagging probabilities and flavour compositions. This new technique leads to a reduction in the total uncertainties by up to a factor of two, as discussed in Section 8.     The general form of an extended binned log-likelihood function, after dropping terms that do not depend on the parameters to be estimated, is provided in Eq. (1): where ν tot is the total expected number of events,Θ = (Θ 1 , ..., Θ m ) is the list of parameters to be estimated, including the parameters of interest (POI) and nuisance parameters, ν i (n i ) is the number of expected (observed) events in the bin i and N bins are considered in total. In this work, the POIs are the b-jet tagging probabilities, P b , introduced in Section 4.4. They are defined in this measurement per p T bin, i.e. as the conditional probabilities for a b-jet with a transverse momentum falling in one of the nine p T bins (T m ) m=1..9 of the measurement to have a b-tagging discriminant w falling in one of the five pseudo-continuous bins (O k ) k=1..5 . The b-jet tagging efficiency of the single-cut OP X in that jet p T bin, ε b , relates to the POIs as outlined below: In each control region, the number of events in a given p T,1 , p T,2 bin (T m , T n ) is written as the sum of the bb, bl, lb and ll yields expected in that bin (ν m,n bb , ν m,n bl , ν m,n lb , ν m,n ll ), corrected by p T,1 , p T,2 dependent correction factors (c m,n bb , c m,n bl , c m,n lb , c m,n ll ), forming the nuisance parameters: ν C R (T m , T n ) = c m,n bb ν m,n C R,bb + c m,n bl ν m,n C R,bl + c m,n lb ν m,n C R,lb + c m,n ll ν m,n C R,ll .
In each signal region, the events are further binned according to the b-tagging discriminants of the two jets, w 1 , w 2 . The number of events expected in a given p T,1 , p T,2 , w 1 , w 2 bin (T m , T n , O k , O p ) is thus written: where P l is the effective b-jet tagging probability of the mix of c-jets and light-flavour jets predicted by the simulation in each p T,1 , p T,2 bin. The POIs and correction factors are estimated by minimising the negative log-likelihood function defined above with the MINUIT algorithm [58]. Both the POIs and correction factors are free parameters during the minimisation procedure. Signal and control region data are provided as input as well as the P l conditional probabilities, which are estimated from the MC simulation corrected to match data (see Section 1). The simulation is also used to determine the bb, bl, lb and ll yield fractions according to the type of region (SR, CRs), as the correction factors are defined as a function of p T,1 and p T,2 only.
The extraction method is validated using pseudo-data generated with a known flavour composition. This is created by combining events from either nominal or alternative MC simulation fluctuated according to the statistical uncertainty expected from the actual dataset. The input parameters of the minimisation procedure are taken from the nominal MC simulation in all cases. The size of the non-closure effects observed when using pseudo-data based on the nominal and alternative MC simulation are compared respectively with the expected data statistical uncertainty (0.6-3.7%), and to the sum in quadrature of the expected data statistical uncertainty, the MC statistical uncertainty and the physics modelling uncertainties quoted for the final measurement (0.9-5.4%). The non-closure effects are found to be within uncertainties in each jet p T bin such that no additional uncertainty related to the signal extraction method is considered.

Uncertainties
Uncertainties affecting the measurement which originate from statistical sources are considered together with systematic uncertainties related to the detector calibration and physics modelling.
The data statistical uncertainty in the b-jet tagging probabilities, and their bin-to-bin correlations, are obtained from the error matrix returned by MINUIT [58] and propagated to the b-jet tagging efficiencies via a basis transformation. The data statistical uncertainty reaches about 4% (2%) for jets within 20 ≤ p T < 30 GeV (30 ≤ p T < 40 GeV), ranges from 1% to 3% for jet p T ≥ 140 GeV and is below 1% elsewhere.
The bootstrap resampling technique [59] is used to assess the MC statistical uncertainty by creating an ensemble of statistically equivalent measurements in which the weight of each simulated event used in the nominal measurement is multiplied by an additional term, randomly chosen for each event from a Poisson distribution with a mean of one. The standard deviation of the distribution of these measurements is taken as the MC statistical uncertainty. This method allows all correlations to be preserved and the uncertainty in the value of any parameter to be extracted. One hundred bootstrap replicas of each simulated sample are used for this evaluation. The MC statistical uncertainty in the b-jet tagging efficiencies is found to be non-negligible only for jet p T ≤ 40 GeV, where it reaches about 2% and 1% for jets within 20 GeV ≤ p T < 30 GeV and 30 GeV ≤ p T < 40 GeV, respectively.
The systematic uncertainties are derived by varying a parameter in the simulated events, repeating the complete analysis with this varied parameter and taking the difference between the updated measurement of the b-jet tagging efficiency or probability and the nominal measurement as the (bin-wise correlated) uncertainty. For b-jet tagging efficiencies, the bootstrap replicas of simulated events are then used to evaluate the MC statistical uncertainty in each systematic variation. Variations of the b-jet tagging efficiency that are not statistically significant undergo a bin-merging procedure over an increasing number of p T bins to improve their significance. Following this procedure, only statistically significant variations are considered as systematic uncertainties.
Uncertainty sources related to the energy scale and resolution of hadronic jets [19] encompass both the modelling of the detector response and the analysis techniques used to derive the calibration. The impact of the jet energy scale uncertainty reaches from 4%-5% for jet p T ≤ 30 GeV, 1% for 30 GeV ≤ p T < 40 GeV, and is negligible elsewhere. It is dominated by the prediction of the quark/gluon origin of the light-flavour jets and by the difference in their energy response, as well as the difference in the calorimeter energy response as a function of η. The uncertainty originating from the jet energy resolution is negligible. Uncertainty sources related to the performance of the JVT algorithm [21], the b-tagging performance for light-flavour jets [8] and c-jets [7] as well as the modelling of pile-up interactions were investigated and found to be negligible, as were lepton-related uncertainties, including energy/momentum scale and resolution, identification, isolation, trigger and track-vertex association efficiency.
The uncertainty in the physics modelling of top-quark events is evaluated by changing the parton shower and hadronisation model from P 8 to H 7 and increasing or decreasing the amount of ISR and FSR within P 8 [43]. The uncertainties originating from parton distribution functions (PDF) are quantified following the PDF4LHC recommendations [60]. An additional source of uncertainty originates from the mismodelling of the interference between single top Wt and tt production. It is evaluated by switching the nominal single-top simulation sample, based on the 'diagram removal' scheme, to the one based on the 'diagram subtraction' scheme [45]. The final tt modelling uncertainty reaches 3% (2%) for jet p T < 30 GeV (30 ≤ p T < 40 GeV) and about 1% for p T ≥ 40 GeV. It is dominated at low p T by PDF and ISR/FSR variations whereas at higher p T the choice of parton shower and hadronisation model is the dominant contribution. The single-top uncertainty reaches about 3% for jet p T < 30 GeV due to the parton shower and hadronisation model variation, and 1% for jet p T ≥ 250 GeV, where the uncertainty in the interference with tt events is the dominant contribution. It is below 1% elsewhere. The uncertainties associated with the modelling of top-quark events are reduced by up to a factor of two relative to the previous ATLAS analysis [6] due to the new b-jet tagging efficiency extraction method, which allows the bb event yield to be determined at a precision of a few percent in each p T,1 , p T,2 bin.
The uncertainty in the modelling of diboson and Z+jet production [61, 62] is evaluated by varying the total cross-section and the factorisation and renormalisation scales for these processes, as well as propagating the uncertainty from the PDF. The total cross-section is kept constant when performing the scale and PDF variations such that only the shapes of the kinematic distributions are impacted. The total cross-section is varied by ±6% (±5%) for Z+jets (diboson) production. The scale uncertainties are estimated simultaneously by varying the nominal values by a factor of two up and down and taking the largest deviations from the nominal predictions in each direction as uncertainties. PDF uncertainties are evaluated using the 100 bootstrap replicas provided by the NNPDF30NNLO [41] using the same method as outlined for the MC statistical uncertainty earlier in this section. The final diboson and Z+jets uncertainties are found to be negligible in the entire range covered by the analysis.
The number of events with a selected muon not originating from a Zor W-boson decay is predicted by the simulation to be negligible after the event selection. This is due to the tight muon identification and isolation criteria applied. The number of events with a selected electron not originating from a Zor W-boson decay passing the event selection (1NPel, for 1 non-prompt electron) is also predicted by the simulation to be very small but one order of magnitude higher, reaching about 0.3% of the total event yield after selection. An uncertainty in this yield is derived by comparing the number of data and MC events in an alternative region defined by requiring two same-sign (SS) leptons instead of opposite-sign (OS). The SS region is predicted by the simulation to have a composition that is 12% 1NPel events, with the remaining 88% of the sample coming from non-1NPel events, which is dominated by diboson production. This is estimated from simulation and subtracted from the data. The remaining data events are then compared with MC predictions in bins of electron p T . The data-to-simulation ratio ranges from values close to 3 for p T < 120 GeV to values close to 1 for p T ≥ 300 GeV. These values are used as simulation-to-data scale factors to correct the yield of simulated 1NPel events in the OS region in order to estimate an uncertainty in the fake-lepton modelling. The b-jet tagging efficiency measurement is then repeated with these scale factors applied and compared with the nominal measurement. Differences of about 1% to 2% for jet p T < 40 GeV and negligible elsewhere are observed and accounted for as an additional systematic uncertainty.

Results
The goodness-of-fit is evaluated by computing a Pearson's χ 2 and comparing it with the number of degrees of freedom (ndf ) of the fit [63]. This procedure tests the hypothesis that the remaining differences between observed and expected yields post-fit originate only from the limited size of the dataset. The χ 2 /ndf value obtained for the nominal measurement is 0.98, corresponding to a p-value of about 0.65. This result illustrates the high goodness-of-fit already observed before accounting for the other sources of uncertainty discussed in Section 8.
The bb, bl, lb and ll yield post-fit correction factors are of order unity, compatible with unity within uncertainties, and typically constrained within 2%-5% for bb, 5%-10% for bl, and 7%-20% for lb and ll.
The central values of the bb yield correction factors tend to be a few percent below unity, pointing to a slight underestimate of the number of light-flavour jets in the nominal simulation. The yield correction factors deviate more strongly from unity when running on the alternative simulated samples.
The b-jet tagging efficiency measurement for the ε b = 70% single-cut OP of the MV2 algorithm is presented in Figure 8(a) as a function of jet p T together with the efficiency derived from tt simulated events passing the signal region selection. The corresponding b-jet tagging efficiency simulation-to-data scale factors, defined as the ratio between the measured b-jet tagging efficiency to the b-jet tagging efficiency derived from the simulation, are shown in Figure 8(b). Scale factors are derived for all single-cut OPs and for the DL1 tagger using the same technique, resulting in similar results, as illustrated in Figure 9. The scale factors have values very close to one and are approximately constant throughout the entire p T range, illustrating the good modelling of the b-jet tagging performance. The b-jet tagging efficiency measurement for the ε b = 70% single-cut OP of the MV2 algorithm as a function of jet |η| and the corresponding simulation-to-data scale factors are presented in Figure 10. The b-jet tagging probability and efficiency measurement was also repeated considering only data and simulated events with either less or more than 28 additional pp interactions per bunch crossing, and separately for 2015-2016 and 2017 data. In all cases, consistent results were observed.  The uncertainty in the efficiency measurement for the ε b = 70% single-cut OP of the MV2 tagger is summarised in Table 5. The total uncertainty reaches about 1% for 40 GeV ≤ p T < 250 GeV, where it is dominated by the uncertainty in the physics modelling of tt events and the data statistical uncertainty. At lower p T values (20 GeV ≤ p T < 40 GeV), the total uncertainty increases to 8% due to higher uncertainties in the jet energy scale, the modelling of tt and single-top-quark events, the limited number of data and MC events and the modelling of fake leptons. For jet p T ≥ 250 GeV, the uncertainty increases to about 3% due to the limited number of data events. These observations are consistent across single-cut OPs and taggers.
The measurement of the b-jet tagging probabilities in the MV2 and DL1 algorithm output bins is presented  Table 5: Breakdown of the systematic uncertainties in the b-jet tagging efficiency measurement for the 70% single-cut OP of the MV2 tagger as a function of the jet p T bin. The 'tt modelling' and the 'Single top modelling' uncertainties correspond to the sum in quadrature of the uncertainty in the parton shower, hadronisation model, initial-state and final-state radiation and PDF for tt and single top-quark production, respectively. The 'Single top modelling' uncertainties include an additional source originating from the interference between single top and tt production. 'Other sources' corresponds to the sum in quadrature of the uncertainties related to jet energy resolution, electron and muon performance, b-tagging performance for light-flavour jets and c-jets, JVT performance, diboson and Z+jet modelling (including normalisation and shape uncertainties) and pile-up modelling. All systematic uncertainties are fully correlated bin-by-bin whereas the statistical uncertainty correlations are evaluated following the procedures described in Section 8. In the case of correlated systematic uncertainties, the relative sign of the uncertainty in each bin is taken into account, even if not shown here. in Figures 11(a) and 11(c) together with the b-jet tagging probabilities derived from tt simulated events passing the signal region selection. The probabilities are shown for jets with 110 GeV ≤ p T < 140 GeV, which is located close to the b-jet tagging efficiency maximum. The corresponding b-jet tagging probability scale factors are shown in Figures 11(b) and 11(d). The uncertainty in this measurement is summarised in Table 6. The total uncertainty varies from about 9% in the 100%-85% bin to about 1% in the 60%-0% bin. It is driven by the tt modelling uncertainties and data statistics, which is consistent with the result reported for the 70% single-cut OP in this p T range.

Usage in ATLAS analysis
This section details how the simulation-to-data scale factors are incorporated into ATLAS physics analyses. Scale factors are smoothed, extrapolated beyond the jet p T range of the data measurement and corrected Table 6: Breakdown of the systematic uncertainties in the b-jet tagging probability measurement of the MV2 tagger as a function of the 'pseudo-continuous' bins for jets satisfying 110 ≤ p T < 140 GeV. The 'tt modelling' and the 'Single top modelling' uncertainties correspond to the sum in quadrature of the uncertainty in the parton shower, hadronisation model, initial-state and final-state radiation and PDF for tt and single-top-quark production, respectively. The 'Single top modelling' uncertainties include an additional source originating from the interference between single top and tt production. 'Other sources' corresponds to the sum in quadrature of the uncertainties related to jet energy resolution, electron and muon performance, b-tagging performance for light-flavour jets and c-jets, JVT performance, diboson and Z+jet modelling (including normalisation and shape uncertainties) and pile-up modelling. All systematic uncertainties are fully correlated bin-by-bin whereas the statistical uncertainty correlations are evaluated following the procedures described in Section 8. In the case of correlated systematic uncertainties, the relative sign of the uncertainty in each bin is taken into account, even if not shown here. taking into account the generator dependence in the simulation. The number of systematic uncertainties is reduced while preserving the bin-by-bin correlations. The scale factors are then applied to ATLAS physics analyses by correcting the b-jet tagging response in simulation and by applying related uncertainties to the correction.

Smoothing
The simulation-to-data scale factors for single cut OPs are smoothed in jet p T using a local polynomial kernel estimator with a bandwith parameter of 0.2 following the procedure described in Ref.
[6]. This procedure prevents distortions in the variables of interest induced by the application of the scale factors.

Extrapolation to high-p T jets
The analysis described in this paper provides a precise measurement of the b-jet tagging efficiency in data and compares it with the one obtained from MC simulation. Since there are currently not many b-jets in data for jet p T above 400 GeV in di-lepton tt events, an alternative assessment of the uncertainty in the b-jet tagging efficiency for jet p T in this range is developed to extend the single cut OP calibration to the entire jet p T range inspected by physics analyses in ATLAS. Underlying quantities that are known to affect the b-tagging performance are varied in the simulation one by one and the b-jet tagging efficiency is recomputed in each case. The difference from the b-jet tagging efficiency obtained in the nominal simulation is then taken as an additional systematic uncertainty.
Four distinct sets of variables, related to the reconstruction of tracks, of jets, the modelling of the b-hadrons and the interaction of long-lived b-hadrons with the detector material, are considered. Among the uncertainties related to the reconstruction of tracks, the ones that are found to most affect the b-tagging performance are those related to the track impact-parameter resolution, the fraction of fake tracks, the description of the detector material, and the track multiplicity per jet. The uncertainty in the impactparameter resolution includes the effects of alignment, dead modules and additional material not accurately modelled in the simulation. The uncertainty is derived from several event topologies, including dijet events where effects due to tracking in dense environments, such as in the cores of high-energy jets, are included [12]. No dedicated studies of samples enriched in high-energy b-jets, where collimated tracks from displaced decay vertices conspire to create a challenging environment for the track reconstruction algorithm, are included at this stage. The effect of the parton shower simulation and b-quark fragmentation function is evaluated by comparing the b-jet tagging efficiency with the one obtained from the alternative tt event simulations described in Section 5. In standard ATLAS MC simulations, interactions with detector material are simulated only for the decay products of the b-hadron and not for the b-hadron itself. Given that about 5% of the b-hadrons within b-jets with jet p T = 150 GeV decay after the innermost pixel detector layer, differences in the b-jet tagging efficiency at high p T are expected. In order to evaluate the size of the effect, the Z sample described in Section 5 was enhanced to include the interaction of b-hadrons with the detector material, and the b-jet tagging efficiency derived from this sample is compared with the one obtained from the nominal Z sample.
These sources of uncertainties are found to have a similar impact on the b-jet tagging efficiency of the MV2 and DL1 taggers in the jet p T range 400 GeV to 1 TeV. In this jet p T regime, the modelling uncertainties are dominant, reaching 2% for p T ∼ 400 GeV and growing linearly to ∼ 4% at the TeV scale. The uncertainty due to the interaction with the detector material is also important and found to be ∼ 1% at p T ∼ 700 GeV, growing to ∼ 2% at ∼ 1 TeV. Other leading uncertainties include the jet energy scale and track impact-parameter resolution uncertainties, reaching about 2.5% and 1% at ∼ 1 TeV, respectively. At the TeV scale, the impact of the extrapolation uncertainty is different for MV2 and DL1, due to the differing efficiency profile of the two b-taggers. The b-jet tagging efficiency of DL1 falls more steeply at high p T compared to that of MV2, which is approximately constant. This results in the jet energy scale uncertainty having a much larger impact for the DL1 tagger, due to the increased impact of the migration of jets between the p T bins.
The simulation-to-data scale factor measured in the highest p T bin considered in the collision data analysis is extrapolated for p T ≥ 400 GeV. The mean value and uncertainties after smoothing at p T = 400 GeV are assumed to stay valid for higher jet p T . An extrapolation uncertainty is then constructed as the sum in quadrature of all the uncertainties described above, rescaled in proportion to their respective values in the highest p T bin of the data measurement, and added in quadrature to the pre-existing uncertainties. The result of the procedures of smoothing and extrapolating the single cut OP scale factors are shown in Figure 12, where both the b-jet tagging efficiency as directly measured in data and its extrapolation are shown for the ε b = 70% single cut OP of the MV2 and DL1 taggers.

Generator dependence
The b-jet tagging efficiency in the simulation depends on several properties, such as the production fractions of the different b-hadron species, the fragmentation function and the number of additional charged particles near the b-hadron, which are not necessarily identical among the different MC event generators. Simulation-to-simulation scale factors are therefore derived to take into account differences in the b-jet tagging efficiency due to the usage of a different fragmentation model to that used to derive the simulation-to-data scale factors. The simulation-to-simulation scale factors are computed as ratios of b-jet tagging efficiencies in the same jet p T bins of the alternative and nominal tt samples. For b-jets, they range from 1% to 3% as a function of jet p T . These scale factors are used when the b-jet tagging efficiency simulation-to-data scale factors are applied to a sample produced with a showering generator different from the one used for the nominal tt sample used in the denominator of the scale factor calculation.

Reduction of systematic uncertainties
The individual application in a physics analysis of each independent systematic uncertainty included in Figure 12 would lead to a large number of variations. A method for reducing the total number of uncertainties while preserving the bin-by-bin correlations is provided for use in ATLAS physics analyses and is described in Ref.
[6]. This is achieved by constructing the covariance matrix for each source of uncertainty and by summing these matrices together. Bin-by-bin correlations are kept as non-zero offdiagonal elements. As this equates to the total covariance matrix, which is symmetric and positive-definite, an eigenvector decomposition is performed. The resulting number of variations equals the number of jet p T bins and is further reduced where eigenvalue variations are shown to have a negligible impact on a result.

Application to physics analyses
For each jet where b-jet tagging is applied in ATLAS physics analyses, a weight is applied in simulation to match the tagging rate as measured in data by the calibration analyses. The weight is jet-flavour dependent. The calibration analysis described in this paper is the baseline correction for jets labelled as b-jets. If the jet is tagged using a single cut OP in MC simulation the weight is simply the smoothed simulation-to-data scale factor itself: where SF(p T ) is the smoothed b-jet tagging efficiency scale factor evaluated at a given p T . If the jet is not tagged the weight becomes: . ( The latter form of Eq. (3) is adopted because, in this way, and by constructing high-granularity efficiency distributions, possible differences in the tagging rate induced by event topologies are minimised. These weights are necessary to ensure that the number of events remains the same after corrections. The final event weight is then computed as the product of all jet weights. In cases where the physics analysis does not rely on the single cut OP but on the pseudo-continuous bins of the discriminant distribution, Eq. (2) becomes dependent on the p T and pseudo-continuous bins T m , O k , and SF(T m , O k ) becomes the b-jet tagging probability SF measured in the pseudo-continuous bin O k and p T bin T m while Eq. (3) becomes unnecessary.

Conclusion
Several b-tagging algorithms are used to analyse data recorded by the ATLAS experiment during Run 2 of the LHC. Their performance is evaluated in simulation, and the b-jet tagging efficiencies of the MV2 and DL1 algorithms are measured in pp collision data.
The b-jet identification strategy combines the results of low-level algorithms (IP2D, IP3D, SV1, J F ) into high-level algorithms based on multivariate classifiers (MV2, DL1). The low-level algorithms either exploit the large impact parameters of the tracks originating from the b-hadron decay products or attempt to directly reconstruct heavy-flavour hadron vertices. Large increases in light-flavour jet and c-jet rejection are obtained by the MV2 and DL1 algorithms compared to each individual low-level algorithm, illustrating the high complementarity of the latter and validating the overall strategy followed by the ATLAS Collaboration.
The b-jet tagging efficiency of the MV2 and DL1 algorithms are measured in 80.5 fb −1 of proton-proton collision data collected by the ATLAS detector over the period 2015-2017 at a centre-of-mass energy √ s = 13 TeV. A high-purity sample of dileptonic tt events is obtained by retaining events with exactly one muon, one electron and two hadronic jets. Events are classified according to the transverse momentum of each of the two jets as well as the jet-lepton invariant masses obtained when pairing exclusively each jet with the lepton most likely to originate from the same top-quark decay. A combinatorial likelihood approach is then used to simultaneously extract the jet flavour composition of the sample and the b-jet tagging probabilities for jets in a transverse momentum range from 20 to 600 GeV, from which the b-jet tagging efficiencies for single cut OPs are obtained. Simulation-to-data scale factors are computed by comparing the efficiency measured in collision data with that observed in the simulation. The measured simulation-to-data scale factors are close to unity with a total uncertainty ranging from 1% to 8% for single cut OPs. The precision of the measurement is limited at low p T by the uncertainty in the jet energy scale, which is detector-related, and at high p T by the size of the dataset, which will grow in the future. It was previously limited by the modelling of top-quark pair production. Further procedures relating to the simulation-to-data scale factors are undertaken to correct for generator dependences, smooth the shape, extrapolate beyond the range of the data measurement, reduce the number of nuisance parameters and apply them in ATLAS analyses. This result demonstrates a significant improvement in the precision of the b-jet tagging efficiency measurement for the ATLAS experiment, which is typically limited by systematic uncertainties. The total uncertainty is reduced by up to a factor of two relative to the previous ATLAS results. This is achieved by the improvement in the measurement method, which simultaneously extracts the b-jet tagging efficiency and jet flavour composition.      [11] ATLAS Collaboration, Performance of the ATLAS        [63] K. Pearson, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50 (1900) 157.