Searches for lepton-flavour-violating decays of the Higgs boson into $e\tau$ and $\mu\tau$ in $\sqrt{s}=13$ TeV $pp$ collisions with the ATLAS detector

This paper presents direct searches for lepton flavour violation in Higgs boson decays, $H\rightarrow e\tau$ and $H\rightarrow\mu\tau$, performed using data collected with the ATLAS detector at the LHC. The searches are based on a data sample of proton-proton collisions at a centre-of-mass energy $\sqrt{s} = 13$ TeV, corresponding to an integrated luminosity of 138 fb$^{-1}$. Leptonic ($\tau \rightarrow \ell \nu_\ell \nu_\tau$) and hadronic ($\tau \rightarrow $ hadrons $ \nu_\tau$) decays of the $\tau$-lepton are considered. Two background estimation techniques are employed: the MC-template method, based on data-corrected simulation samples, and the Symmetry method, based on exploiting the symmetry between electrons and muons in the Standard Model backgrounds. No significant excess of events is observed and the results are interpreted as upper limits on lepton-flavour-violating branching ratios of the Higgs boson. The observed (expected) upper limits set on the branching ratios at 95% confidence level, $\mathcal{B}(H\rightarrow e\tau)<0.20\%$ (0.12%) and $\mathcal{B}(H\rightarrow \mu\tau)<0.18\%$ (0.09%), are obtained with the MC-template method from a simultaneous measurement of potential $H \rightarrow e\tau$ and $H \rightarrow\mu\tau$ signals. The best-fit branching ratio difference, $\mathcal{B}(H\rightarrow \mu\tau)- \mathcal{B}(H\rightarrow e\tau)$, measured with the Symmetry method in the channel where the $\tau$-lepton decays to leptons, is (0.25 $\pm$ 0.10)%, compatible with a value of zero within 2.5$\sigma$.


Introduction
One of the main goals of the Large Hadron Collider (LHC) physics programme at CERN is to discover physics beyond the Standard Model (SM). The discovery of a scalar Higgs boson at the LHC [1,2] has provided important insight into the mechanism of electroweak symmetry breaking [3][4][5][6][7][8] and made it possible to search for physics phenomena beyond the SM (BSM physics phenomena) in the Higgs sector. A possible sign of new physics would be the observation of lepton flavour violation (LFV) in decays of the Higgs boson into a pair of leptons with different flavours.
The most stringent bounds on the LFV decays of the Higgs boson, → and → , are derived from direct searches. These include a previous ATLAS search [29] which placed 95% confidence level (CL) upper limits on the branching ratios (B) of → and → at 0.47% and 0.28%, respectively, using data collected at √ = 13 TeV, corresponding to an integrated luminosity of 36.1 fb −1 . Likewise, the CMS Collaboration set 95% CL upper limits restricting the branching ratios to B ( → ) < 0.22% and B ( → ) < 0.15% using data collected at √ = 13 TeV, with an integrated luminosity of 137 fb −1 [30]. The ATLAS Collaboration performed a direct search for → decay and obtained a 95% CL upper limit on the branching ratio value of B ( → ) < 6.1 × 10 −5 [31], using data collected at √ = 13 TeV, with an integrated luminosity of 139 fb −1 . The most stringent indirect constraint on the → decay is derived from the results of searches for → decays [32], and a bound of B ( → ) < (10 −8 ) is obtained [33,34]. Figure 1: LFV decay schemes of the Higgs boson for the (a) ℓ ℓ ′ and (b) ℓ had final states. The off-diagonal Yukawa coupling term is indicated by the ℓ symbol. the branching ratios. A multivariate analysis (MVA) technique is developed for each final state in both methods to achieve maximum separation between signal and background.
Three statistical analyses are performed: one independent search for each of the → and → processes, and one simultaneous determination of the → and → signals. In the independent search for the → process, the → signal is assumed to be zero, and vice versa in the case of the → search. For each event category used in the statistical analysis, the MC-template method in the ℓ had final state is combined with either the Symmetry method or the MC-template method in the ℓ ℓ ′ final state. The method having the higher expected sensitivity is chosen. In the simultaneous determination of the → and → signals, the assumption about the absence of one of the two signals is removed. Consequently, the MC-template method is used for both the ℓ ℓ ′ and ℓ had final states.

ATLAS detector
The ATLAS detector [36] at the LHC covers nearly the entire solid angle around the collision point. 3 It consists of an inner tracking detector surrounded by a thin superconducting solenoid, electromagnetic and hadron calorimeters, and a muon spectrometer incorporating three large superconducting air-core toroidal magnets.
The inner-detector system is immersed in a 2 T axial magnetic field and provides charged-particle tracking in the range | | < 2.5. The high-granularity silicon pixel detector covers the vertex region and typically provides four measurements per track, the first hit normally being in the insertable B-layer installed before Run 2 [37,38]. It is followed by the silicon microstrip tracker, which usually provides eight measurements per track. These silicon detectors are complemented by the transition radiation tracker (TRT), which enables radially extended track reconstruction up to | | = 2.0. The TRT also provides electron identification information based on a likelihood method. 3 ATLAS uses a right-handed coordinate system with its origin at the nominal interaction point (IP) in the centre of the detector and the -axis along the beam pipe. The -axis points from the IP to the centre of the LHC ring, and the -axis points upwards. Cylindrical coordinates ( , ) are used in the transverse plane, being the azimuthal angle around the -axis. The pseudorapidity is defined in terms of the polar angle as = − ln tan( /2). Angular distance is measured in units of Δ ≡ √︁ (Δ ) 2 + (Δ ) 2 .
The calorimeter system covers the pseudorapidity range | | < 4.9. Within the region | | < 3.2, electromagnetic calorimetry is provided by barrel and endcap high-granularity lead/liquid-argon (LAr) calorimeters, with an additional thin LAr presampler covering | | < 1.8 to correct for energy loss in material upstream of the calorimeters. Hadron calorimetry is provided by the steel/scintillator-tile calorimeter, segmented into three barrel structures within | | < 1.7, and two copper/LAr hadron endcap calorimeters. The solid angle coverage is completed with forward copper/LAr and tungsten/LAr calorimeter modules optimised for electromagnetic and hadronic energy measurements respectively.
The muon spectrometer comprises separate trigger and high-precision tracking chambers measuring the deflection of muons in a magnetic field generated by the superconducting air-core toroidal magnets. The field integral of the toroids ranges between 2.0 and 6.0 T m across most of the detector. The region | | < 2.7 is covered with three layers of precision chambers composed of monitored drift tubes, complemented by cathode-strip chambers in the forward region, where the background is highest. The muon trigger system covers the range | | < 2.4 with resistive-plate chambers in the barrel, and thin-gap chambers in the endcap regions.
Interesting events are selected by the first-level trigger system implemented in custom hardware, followed by selections made by algorithms implemented in software in the high-level trigger [39]. The first-level trigger accepts events from the 40 MHz bunch crossings at a rate below 100 kHz, which is reduced to about 1 kHz by the high-level trigger and these events are recorded to disk.
An extensive software suite [40] is used in data simulation, in the reconstruction and analysis of real and simulated data, in detector operations, and in the trigger and data acquisition systems of the experiment.

Collision data and simulation samples
The dataset used for the searches consists of the LHC proton-proton collision data recorded by the ATLAS experiment at √ = 13 TeV during the period from 2015 to 2018. Events are selected for analysis only if they are of good quality and if all the relevant detector components are known to have been in good operating condition [41]. The total integrated luminosity of the analysed data is 138 fb −1 . The events considered were accepted by single-lepton or dilepton triggers [42][43][44][45]. The T thresholds of the single-lepton triggers were T > 27 (25) GeV and T > 27.3 (21) GeV for the 2016-2018 (2015) data-taking period. The T thresholds of the dilepton triggers were T > 18 GeV and T > 14.7 GeV. Simulated events are used to model → ℓ signal processes, as well as most of the backgrounds from SM processes. A summary of all the generators used for the simulation of the signal and background processes is shown in Table 1, and more details can be found in Ref. [46]. The measured Higgs boson mass of 125.09 GeV [47] is assumed in the calculation of the expected cross-sections and branching fractions. The Higgs-boson production cross-sections are fixed to the SM predictions [48] throughout this measurement.
The main Higgs boson production modes at the LHC are, in descending order of predicted cross-section, gluon-gluon fusion ( F), followed by vector-boson fusion (VBF), and associated and¯production. For the LFV signal, → and → , the F, VBF and production mechanisms are considered. The¯production process is not considered for the LFV signal due to its negligible contribution. The background contribution originating from the SM → and → decays is small and the SM predictions are assumed for the branching ratios. These two processes were modelled using the same simulation strategy as for the LFV signal, but the¯production mode was also included. Other background processes involve electroweak production of / bosons via VBF, Drell-Yan production of / in association with jets, and diboson, single top-quark and top-quark pair (¯) production.
All samples of simulated events were processed through the ATLAS detector simulation [49] based on Geant4 [50]. The effects of multiple interactions in the same and neighbouring bunch crossings (pile-up) were modelled by overlaying minimum-bias events, simulated using the soft QCD processes of Pythia 8.186 [51] with the A3 [52] set of tuned parameters and the NNPDF2.3lo [53] parton distribution functions (PDF).

Object reconstruction and event selection
The topology of the → and → events requires the reconstruction of electrons, muons, visible products of hadronically decaying leptons ( had-vis ), jets and missing transverse momentum.

Object reconstruction
The tracks measured in the inner detector are used to reconstruct the interaction vertices. The vertex with the highest sum of squared transverse momenta of associated tracks is taken as the primary vertex [68].
Electrons and photons are reconstructed from energy deposits in the electromagnetic calorimeters [69]. Electron candidates are matched to inner-detector tracks. They are required to satisfy the 'Loose' likelihoodbased identification criterion defined in Ref. [69], which has an efficiency of about 93%, and to have on calorimetric and track information for the MC-template method and on track information only for the Symmetry method, where a slightly larger track-T contribution around the muon is allowed in order to increase the number of events available for that method's data-driven background estimation.
Jets are reconstructed with the anti-algorithm [71,72] using a radius parameter = 0.4. The jetclustering input objects are based on particle flow [73] in the inner detector and the calorimeter. Cleaning criteria are used to identify jets arising from non-collision backgrounds or noise in the calorimeters [74], and events containing such jets are removed. Only jets with T > 20 GeV and | | < 4.5 are considered. To identify and reject jets that are not associated with the primary vertex of the hard interaction (pile-up jets), a 'jet-vertex tagger' (JVT) [75] algorithm is applied to jets with T < 60 GeV and | | < 2.5. To suppress pile-up jets in the forward region, a 'forward JVT' [76] algorithm which exploits jet shapes and topological jet correlations in pile-up interactions is applied to all jets with T < 60 GeV and | | > 2.5. Jets with | | < 2.5 containing -hadrons are identified using the DL1r -tagging algorithm [77][78][79]. The fixed 85% efficiency (measured in simulated¯events) working point is used. The rejection factors for -tagging a jet initiated by a -quark or light parton are 2.6 and 29 respectively at the 85% efficiency working point.
Leptonic -decays are reconstructed as electrons or muons. The had decays are composed of a neutrino and a set of visible decay products, most frequently one or three charged pions and up to two neutral pions. The reconstruction of the had-vis is seeded by jets reconstructed by the anti-algorithm [71], using calibrated topological clusters [80] as inputs, with a radius parameter of = 0.4 [81,82]. The jets form had-vis candidates and are additionally required to have T > 10 GeV and | | < 2.5. Reconstructed tracks are matched to had-vis candidates. To separate the had-vis candidates originating from hadronic decays from quark/gluon-initiated jets, a recurrent neural network (RNN) had-vis identification algorithm [83] is used. The had-vis objects are required to satisfy the 'Very Loose' had-vis identification criterion, which has an efficiency of 95% for simulated had decays. A separate multivariate discriminant (eBDT) [84] is employed to reject backgrounds arising from electrons that are misreconstructed as single-track had-vis .
The reconstructed electrons, muons, had-vis and jets used in this analysis are not built from a set of mutually exclusive tracks or calorimeter clusters; it is therefore possible that two different objects share most of their constituents. To resolve this ambiguity, an overlap removal procedure is applied. The T threshold of muons considered in the overlap removal with had-vis is lowered to 2 GeV. More details can be found in Ref. [46].
The missing transverse momentum vector, ì miss T , is reconstructed as the negative vector sum of the transverse momenta of light leptons, photons, had-vis , jets, and the 'soft-term'. The soft-term is calculated as the vectorial sum of the T of tracks matched to the primary vertex but not to a reconstructed light lepton, had-vis or jet [85]. The magnitude of ì miss T is referred to as the missing transverse momentum, miss T .

Event selection and categorisation
The two analysis channels are defined according to the decay mode. Events in the ℓ ℓ ′ channel contain exactly two light leptons of opposite-sign electric charges and different flavours, while events in the ℓ had channel contain exactly one light lepton and a had-vis with opposite-sign electric charges. To preserve event separation between the channels, events containing had-vis are vetoed in the ℓ ℓ ′ channel.
For each channel, a Baseline selection is applied, based on the properties of the light leptons, had-vis , missing transverse momentum, and on event properties as the absence of -tagged jets. Events satisfying the Baseline selection criteria are further classified into two statistically independent categories, VBF and veto events if 90 < vis ( , had-vis ) < 100 GeV non-VBF, based on the kinematic properties of the jets produced in association with the Higgs boson candidate. The event selection for all regions is summarised in Table 2. Background control regions (CRs) used in the analyses are described in Section 5 and summarised in Table 5.
In the ℓ ℓ ′ channel, the definition of the leading (ℓ 1 ) and subleading (ℓ 2 ) light leptons is based on the T -ordering of the light leptons in the laboratory frame. This ordering is used in the context of the rejection and estimation of backgrounds, where the T in the laboratory frame is relevant. Additionally, an approach based on the approximate Higgs boson rest frame is used for the classification of the events. The four-momentum of the Higgs boson rest frame is built from the two leptons and miss T four-momenta. The pseudorapidity component of the missing-momentum vector is assumed to be the same as that of the system formed by the two charged light leptons, and the resulting invariant mass of the Higgs boson system is constrained to be 125 GeV. The light lepton having the higher T in the approximate Higgs boson rest frame is called ℓ , and is assumed to be the lepton originating from the Higgs boson decay. The other charged light lepton is called ℓ , and is assumed to be the light lepton originating from the decay. Events are divided into the and final states depending on whether ℓ is the muon or the electron. Using this approach, the lepton misassignment for the Baseline event selection is reduced to about 5% (7%) for the F (VBF) production mode of the LFV signal, compared with a 11% (19%) misassignment obtained when the laboratory frame is used.
In the ℓ ℓ ′ channel of the MC-template method, the leading light lepton (ℓ 1 ) is required to have ℓ 1 T > 45 GeV, while events with lower ℓ 1 T values are used to obtain a region enriched with → events, as described in Section 5.1. The events must satisfy a requirement on the invariant mass of the two final-state leptons, 30 GeV < ℓ 1 ℓ 2 < 150 GeV, to reduce the single-top-quark and¯(hereafter collectively labelled 'topquark') background. Additionally, to reduce the top-quark background contribution, events with one or more identified -tagged jets are rejected. In order to suppress the contribution from backgrounds with non-prompt light leptons such as heavy-flavoured hadron decays and → processes, and to ensure compatibility with the primary vertex, additional requirements are imposed on the transverse impact parameter ( 0 ) significance and the longitudinal impact parameter ( 0 ), weighted by the normalized transverse momentum of the track (sin ). The 0 significance [78] of the ℓ 1 track is required to be smaller than 5, the combination | 0 sin | is required to be smaller than 0.5 mm. If ℓ 2 is an electron, the 0 significance of its track is required to be smaller than 10. When ℓ 2 = , the requirement 0.2 < track T / cluster T < 1.25 on the ratio of the T measured using only the inner detector, track T , to the T measured in the calorimeter, cluster T , aims to reduce the number of → background events, in which one of the muons deposits a significant fraction of its energy in the calorimeter.
In the ℓ ℓ ′ channel of the Symmetry method, the requirement on ℓ 1 T is reduced relative to the MC-template method to ℓ 1 T > 35 GeV to increase the number of events used in the Neural Network (NN) training, introduced in Section 6. To preserve a symmetric selection, electron | | requirements from the object reconstruction are also applied to muons. Therefore, selected muons are required to have | | < 1.37 or 1.52 < | | < 2.47. The requirement 30 GeV < ℓ 1 ℓ 2 < 150 GeV is used to reduce the top-quark background. The 0 significance of the tracks of both light leptons is required to be smaller than 10. No requirements on track T (ℓ 2 )/ cluster T (ℓ 2 ) are applied when ℓ 2 = because it would break the symmetry between electrons and muons.
In the ℓ had channel, angular selection criteria (cos Δ (ℓ, miss T ) + cos Δ ( had-vis , miss T )) > −0.35 and |Δ (ℓ, had-vis )| < 2 are imposed to reject + jets and multi-jet production processes. To reduce the topquark background contribution, events with one or more identified -tagged jets are rejected. Selected had-vis are required to satisfy the 'Tight' had-vis identification criterion. In the had channel, the reconstructed had-vis is required to pass the eBDT 'Medium' working point in order to suppress (→ ) + jets processes where one of the electrons is reconstructed as a had-vis object.
The VBF and non-VBF categories in each of the ℓ ℓ ′ and ℓ had final states give rise to a total of four signal regions (SRs) in each search. The VBF category is designed to enhance the sensitivity to the VBF Higgs boson production mode. Dedicated requirements are applied to the jet kinematics and topology of the two jets to separate VBF Higgs boson production from the other production modes. The leading and subleading T -ordered jets are denoted by j 1 and j 2 , respectively. The jj and Δ jj variables represent the invariant mass and -separation of the two leading jets, respectively. The non-VBF category contains events failing the VBF selection, and in the had channel, an additional requirement on vis ( , had-vis ) reduces the (→ ) + jets background component.

Background estimation
The → process is one of the dominant backgrounds in all the categories and channels. Other relevant backgrounds originate from + jets and multi-jet events with at least one jet misidentified as an electron, muon or had (referred to as 'misidentified' hereafter). The misidentified background in the two methods (MC-template and Symmetry) and the two analysis channels (ℓ ℓ ′ and ℓ had ) is evaluated using data-driven techniques. Additionally, top-quark decays form a non-negligible contribution, particularly in the ℓ ℓ ′ channel. Other background components such as diboson production ( , and ), → and → are also considered.
In the following, the background estimates are discussed in more detail for the MC-template ℓ ℓ ′ channel in Section 5.1, for the MC-template ℓ had channel in Section 5.2, and for the Symmetry-based ℓ ℓ ′ channel in Section 5.3.

MC-template ℓ ℓ ′ channel
In the MC-template ℓ ℓ ′ channel, the main background contributions arise from top-quark, → and diboson processes, and events with misidentified leptons. Sources of smaller backgrounds are → ℓℓ events and SM Higgs boson decays. In addition to simulation, data control and validation regions are used to estimate the background contributions, where possible. Backgrounds from Higgs boson processes such as → and → are expected to be small and are normalised to their SM predictions.
The top-quark processes contribute 34%-54% of the total background, depending on the category. For each category, the top-quark simulation is validated with data in a top-quark CR, statistically independent of the SRs. These CRs are defined by applying all the baseline criteria for non-VBF or VBF categorisation except for the -tagged jet veto, and 95% of their events are top-quark events. Good modelling by the top-quark simulation is observed, as shown in Figure 2 (a, b). To account for possible theoretical uncertainties in the production cross-section for the simulated samples, the top-quark CRs and two normalisation factors, one per category, are included in the statistical analysis (Section 8). Each normalisation factor is determined during the signal extraction.
The → events account for 23% (11%) of the total background in the non-VBF (VBF) SR. For each category, the → simulation is validated with data in a → CR, statistically independent of the SRs. These CRs are defined by requiring 35 GeV < ℓ 1 T < 45 GeV, and the → process contributes ∼65% (∼32%) of all events in the non-VBF (VBF) → CR. Good modelling in the → CRs is illustrated for the non-VBF case in Figure 2 (c, d). In the statistical analysis, the → CRs are used jointly with two independent normalisation factors to constrain the → normalisation in the SRs.
Diboson events form 19%-32% of the total background, depending on the category. The shape and normalisation of diboson process distributions are estimated from the simulation and validated with data in a dedicated validation region, where the diboson processes contribute ∼67% of the total background. This region is defined by applying the Baseline selection criteria and requiring ℓ 2 T > 30 GeV and the mass of the two leptons to satisfy 100 GeV < ℓ 1 ℓ 2 < 150 GeV. The transverse mass, T , calculated from the transverse momentum of the subleading light lepton and the miss T , is required to be greater than 30 GeV and jets with T > 30 GeV are rejected. Good modelling in the diboson validation region is observed, as illustrated in Figure 3.
The → process contributes sizeably only in the channel, where it represents up to 2% of the total background. The modelling of the → background by the simulation is validated in a dedicated → CR, which is not included in the statistical analysis. This CR is obtained by applying the Baseline selection, but with 35 GeV < ℓ 1 T < 45 GeV and with the dilepton pair mass near the boson mass peak, 75 GeV < ℓ 1 ℓ 2 < 100 GeV. Additionally, the separation between the subleading light lepton and the ì miss T is required to satisfy |Δ (ℓ 2 , miss T )| < 1.5. To ensure that the → CR is statistically independent of the SRs and the → CRs, the events are required to have 1.25 < track T (ℓ 2 )/ cluster T (ℓ 2 ) < 3. While no mismodelling of the MC template shape is found in the → CR, a global normalisation offset of 25% is observed. Therefore, the → event yields in the SRs are decreased by 25% and a normalisation uncertainty of that size is assigned to the → contribution in the statistical analysis.   A data-driven method is used to estimate the misidentified background contribution from events having at least one light lepton originating from heavy-flavour decays, photon conversion, a jet or a had misidentified as a light lepton. These events originate mostly from + jets, multi-jet production and top-quark processes. The method used is based on the assumption that the ratio of opposite-sign to same-sign light-lepton events is approximately constant when inverting the isolation quality requirement for the subleading    light lepton [29]. The ratio, named a transfer factor, is measured in regions enriched in misidentified background. These regions use the SRs' selection criteria, but the lepton isolation requirement is inverted for the subleading light lepton, and the electron is allowed to fail the 'Medium' identification requirement while still passing the 'Loose' requirement. To obtain the misidentified background prediction in the SRs, the transfer factor is applied to same-sign regions. Events in the same-sign regions are required to satisfy the same selection criteria as in the SRs, but have two light leptons with same-sign electric charges. Events containing two prompt light leptons (mostly from diboson events) form a small component of this background and are subtracted using simulation. The transfer factors are calculated independently in the and final states. In each of the two cases, transfer factors are obtained separately for events passing the single-lepton or dilepton trigger requirement, and separately for events passing or failing the -veto requirement.
The statistical uncertainty of the misidentified background is separated into four uncorrelated components, based on whether the event is triggered by a single-lepton or dilepton trigger and whether the event passed or failed the -veto requirement. The systematic components account for several effects. The uncertainty related to the subtraction of the prompt-lepton processes in the same-sign region is determined to be between 6% and 8%. The systematic uncertainty due to possible differences between the jet flavour composition in the misidentified background enriched CRs and the SRs is taken into account. To estimate this uncertainty, the transfer factors are calculated in misidentified background enriched CRs using the simulated samples of the main sources of fake leptons ( + jets and ). The transfer factors are applied to the simulated events passing the Baseline selection, except that two same-sign leptons instead of opposite-sign leptons are required. The obtained prediction for the BDT distribution (see Section 6) is    compared with the simulated events passing the Baseline selection. The difference between the two BDT distribution shapes is taken as the magnitude of the systematic uncertainty, which is found to vary between 10% and 30%. To cover potential residual differences between the simulation and data, a non-closure uncertainty is obtained by defining additional opposite-sign misidentified background enriched CRs. These regions are statistically independent of the SRs and use the same selection criteria as the Baseline selection but invert the requirement on 0 significance. The transfer factors are recalculated using these additional regions and the background predictions are re-evaluated using the new transfer factors. The difference between data and the background prediction in the additional opposite-sign misidentified background enriched CRs is assigned as a systematic uncertainty. The magnitude of the uncertainty ranges from 2% to 12%. The dependence of the transfer factor on the 0 significance requirement was investigated and found to be negligible. Figure 4 shows the level of agreement between the data and simulation in the non-VBF and VBF categories for a subset of variables with the highest discrimination power used in the MVA as described in Section 6.

MC-template ℓ had channel
In the MC-template ℓ had channel, the main background contributions arise from the → events and are estimated using the simulated events. The → events form 48%-67% of the total background depending on the category. The second-largest background consists of events containing a jet misidentified as had-vis and contributes 22%-30% of the total background. This misidentified background is evaluated using the Fake-Factor technique [88,89]. Contributions with had and a jet misidentified as an electron or a muon are estimated from simulation and found to be negligible. Backgrounds from other processes such as top-quark, → ℓℓ, dibosons, → and → are much smaller and are estimated from simulation.
The shape of the → background distribution is modelled with simulations, and its normalisation in the VBF and non-VBF categories is constrained by data in the BDT score distributions (see Section 6) in the SRs. The shape of the top-quark background distribution is modelled with simulation. The corresponding normalisation factors in the VBF and non-VBF categories are constrained using the ℓ ℓ ′ top-quark CRs in the simultaneous fit of the ℓ had and ℓ ℓ ′ channels.
Diboson events form 2%-5% of the total background, depending on the category. In the had channel, the → process contributes about 5%-6% of the total background in the non-VBF category, and 3%-4% in the VBF category. Its modelling is validated in a dedicated validation region, where the → event fraction is ∼65%. The → validation region is defined by applying the Baseline selection criteria and also requiring | ( )| < 0.1 and the collinear mass [87] to be near the peak: 90 GeV < coll ( , ) < 110 GeV. While no template mismodelling is observed in this region, a global normalisation offset of 13% is observed. Thus, an uncertainty of that size is assigned to the → contribution in the statistical analysis described in Section 8.
The distribution of the misidentified background component in the SR is obtained by multiplying the number of events that satisfy the SR selection criteria but fail the 'Tight' had-vis identification requirement by a Fake-Factor. The Fake-Factor is defined as the ratio of the number of jets misidentified as had-vis that satisfy the 'Tight' had-vis identification criteria to the number that do not, but still satisfy the 'Very    Loose' identification requirement. The Fake-Factor is parameterised as a function of the T and track multiplicity of the had-vis . Since the misidentified background originates from + jets and multi-jet production processes, two independent Fake-Factors, and QCD , are measured in dedicated + jets and multi-jet production CRs, respectively. These regions are defined to be statistically independent of the SRs by inverting several selection criteria. In particular, (cos Δ (ℓ, miss T ) + cos Δ ( had-vis , miss T )) < −0.35, T (ℓ, miss T ) > 60 GeV and T ( , miss T ) > 40 GeV for the + jets CR, and |Δ (ℓ, had-vis )| ≥ 2 and T (ℓ, miss T ) < 60 GeV for the multi-jet CR. To obtain the final estimate of the misidentified background contribution in the SRs, the combined Fake-Factor = QCD QCD + (1 − QCD ) is used. The fraction of multi-jet events in the misidentified background, QCD , is obtained by scaling the number of events in the multi-jet CR by the ratio of the number of events where the light lepton passes the isolation requirements to the number where it does not [46]. This ratio is measured in another region enriched in multi-jet events where the ℓ and had have the same electric charge.
The statistical uncertainties obtained from the Fake-Factor and QCD calculations are included in the statistical analysis. They are considered as independent sources of uncertainty and their magnitude varies between 2% and 3% in the VBF category, while it is < 0.4% in the non-VBF category. An additional uncertainty accounts for the residual difference between the misidentified background modelling and the data in the misidentified background enriched same-sign region defined by applying the SR selection criteria, but requiring the had and ℓ to have the same charge. The magnitude of the difference between the two regions varies from 2% to 10%, depending on the category, and it is assigned to both the positive and negative component of the systematic uncertainty. Figure 5 shows the level of agreement between the data and background modelling in the non-VBF and VBF categories for a subset of variables with the highest discrimination power used in the MVA as described in Section 6. Here the normalisation of signal and background yields is obtained by performing the simultaneous fit of the → and → signals (discussed in Section 8.2), based only on data in the ℓ had final state. The distributions show different levels of separation between signal and background, and reasonable modelling of the data within statistical and systematic uncertainties. The corresponding SR event yields are detailed in Table 8 in Appendix B.2.

Symmetry-based ℓ ℓ ′ channel
The data-driven / -symmetry method was first employed in Ref. [90]. This method is sensitive to the difference between B( → ) and B( → ), and it is based on the following two assumptions: 1. SM processes are symmetric under the exchange of prompt electrons with prompt muons to a good approximation. As a consequence, the kinematic distributions of prompt electrons and prompt muons are approximately the same; 2. Flavour-violating decays of the Higgs boson break this symmetry.
According to the / -symmetry assumption, the SM background is split equally between the and datasets. However, the → signal is mostly present in the dataset, since the T of the muon originating from the Higgs boson is typically larger than the T of the electron originating from the decay of the -lepton. Similarly, the → signal is mostly present in the dataset. Thus, in a search for → decays, the SM background can be estimated using the dataset, and vice versa.
Detector-related effects cause a distortion of the original / -symmetry and need to be accounted for. The first effect is due to events containing misidentified and non-prompt light leptons, collectively referred to as misidentified background in the following. These particles originate from misidentified jets, misclassified light leptons, hadronic decays within a jet, and electrons from photon conversion. They contribute differently to the and datasets. The second effect is due to the muons and electrons having different trigger, reconstruction, identifications and isolation efficiencies which depend differently on kinematic properties such as T , | |, and .

Events containing misidentified light leptons
The misidentified light leptons are subdivided into two contributions based on their estimation method: MC and FF for contributions estimated from the simulation and from the data using the Fake-Factor method, respectively.
Events contributing to MC include had that are misidentified as light leptons, muons that are misidentified as electrons, and photons that are misidentified as electrons. Events containing one or more misidentified light leptons are estimated from simulation, taking into account the following processes: → , top-quark, diboson, → , → and . Contributions from → and → are negligible. The most significant contribution to MC is from → where low-T muons are misidentified as electrons. The modelling of this process (total cross-section and shapes of various kinematic distributions) is validated in a dedicated validation region dominated by → events. This region's definition is similar to the one described in Section 5.1, replacing the requirement Δ (ℓ , miss T ) < 1.5 with a requirement of coll < 115 GeV. The systematic uncertainty associated with the normalisation of this background is determined to be 16%. Conservatively, this uncertainty is applied to all the MC contributions.
Events contributing to FF include jets misidentified as light leptons and light leptons originating from hadronic decays within a jet. These mostly originate from + jets processes. A smaller contribution originates from multi-jet production. This background is estimated using the Fake-Factor method where the Fake-Factors are measured in a dedicated + jets CR enriched in misidentified light leptons. This + jets CR is based on the selection of exactly three electrons or muons where two are required to be consistent with a -boson decay, while the third is assumed to originate from a jet which is misidentified either as an electron or muon and is used for the determination of the Fake-Factors. A difference in jet flavour composition between the + jets and Baseline regions could result in different Fake-Factors. This is taken into account by using a set of correction factors measured in simulation as the ratio of the Fake-Factors in the two regions.
The Fake-Factors and correction factors are binned as a function of lepton T , with electron Fake-Factors also being parameterised with respect to Δ , miss T . The product of the Fake-Factors and correction factors is applied as event weights, depending on the kinematic quantities of the non-prompt light leptons that fail to satisfy the lepton identification criteria. Contributions, where either the electron or muon is misidentified, are estimated separately and summed; the numbers of events where both light leptons are misidentified are treated specifically to avoid double counting.
The uncertainties associated with the Fake-Factor method consist of statistical and systematic components. The statistical component is estimated for each Fake-Factor bin separately and consists of the statistical uncertainties from the data and the subtracted prompt-lepton simulated yields within the + jets CR propagated to the Fake-Factors. The systematic component corresponds to the uncertainty from the subtraction of events containing three real light leptons, i.e. from and processes where the subtracted yields are varied by the theory cross-section uncertainties of both processes separately. Uncertainties associated with the correction factors include a contribution from the limited number of simulated events in the + jets and Baseline regions, and from the choice of a specific MC generator. In the statistical model (described in Section 8), the statistical sources are treated as uncorrelated between Fake-Factor or correction factor bins, the systematic sources as correlated between bins, and the MC subtraction-uncertainty also as correlated between electron and muon misidentified background.

Efficiency correction
The efficiency to measure a event ( ) or a event ( ) is given by the product of the event trigger efficiency and the muon and electron efficiencies. The trigger efficiency calculation and the associated systematic uncertainties are described in Ref. [91]. Simulated events are scaled using dedicated scale factors to account for the modelling of trigger efficiency in the simulation.
The lepton efficiency is a product of the reconstruction, identification and isolation efficiencies. For the muons, these are measured in Ref.
[92] along with their uncertainties and found to be independent of the event selection. For the electrons, regardless of the parameterisation used, the actual efficiencies are strongly dependent on the event selection, and need to be measured for the analysis-specific phase space. Simulated events are corrected to match the data efficiencies by using dedicated scale factors.
In this study, the electron efficiencies are estimated from the simulated events passing the Baseline event selection with the following two changes: the requirement ℓ 1 T > 35 GeV is relaxed to ℓ 1 T > 20 GeV, in order to increase the number of events, and only events that contain either one electron and one muon candidate, or two electron candidates, are considered.
The number of events available after applying this selection determines the statistical uncertainty associated with the electron efficiency measurement. Systematic uncertainties associated with this measurement are estimated by comparing the efficiencies obtained when using either the electron or muon as the tag lepton, and by raising the T selection for ℓ 1 back to 35 GeV. The efficiencies are scaled to account for possible differences observed between data and simulation, using the set of scale factors and associated uncertainties described in Ref. [69] available for each efficiency separately.
The efficiency corrections are applied per event, depending on the kinematic properties of the electron and muon. Each event in the sample is used to mimic an event in the sample with the same kinematic properties by scaling it with the efficiency ratio = / . The same procedure is applied with the inverted ratio when scaling to events. The procedure is tested in simulated events, as shown in Figure 6, by applying the efficiency correction to events and comparing them with events. By construction, the efficiency correction is only valid for events containing prompt leptons. In the simulation, events containing misidentified leptons are vetoed. In the data, contributions from both MC and FF are subtracted before the efficiency correction is applied. As shown in Figure 6, this procedure restores the symmetry between the two channels in simulation.

Background estimation
The number of background events in each dataset ( and ) is estimated as a sum of three contributions: a symmetric component estimated by applying the efficiency correction procedure to the events in the other dataset, MC and FF . Since the efficiency correction is only valid for true leptons, the contribution from events containing misidentified leptons is subtracted from the other dataset before applying the efficiency correction. Figure 7 compares the data and the background prediction for a selected set of kinematic variables with strong signal-vs-background discrimination power for the and datasets in the non-VBF and VBF categories. In the figures, the FF contribution is labelled as 'Misidentified', while MC contribution is labelled as 'Other'. The normalisation of signal and background yields is obtained by performing the independent fits of the → and → signals (discussed in Sections 8.1 and 8.3), based only on data in the ℓ ℓ ′ final state. The distributions show either → or → signal, with clear shape differences to the backgrounds, particularly for mass-based variables, and good agreement of the prediction with data within statistical and systematic uncertainties. The corresponding SR event yields are shown in Table 9 in Appendix B.3.     separates the signal from → , → ℓℓ and → backgrounds. The three BDTs were utilised because they provided better expected signal significance than a single BDT. To increase the number of events in the BDT training, the and the datasets are combined. The resulting BDT scores are combined into a single score using a linear weighted sum, with the weights optimised using the expected signal significance as a figure-of-merit.
In the ℓ had channel of the MC-template method, two BDTs are used in all cases except for the non-VBF category in the had channel, where three BDTs are used. Two/three BDTs were chosen instead of a single BDT because they provide better expected signal significance. In the case where three BDTs are trained, BDT ℓ had 1 discriminates the signal from the → events; BDT ℓ had 2 separates the signal from misidentified background; and BDT ℓ had 3 separates the signal from the rest of the backgrounds. In the case where two BDTs are trained, BDT ℓ had 1 ′ separates the signal from → events and BDT ℓ had 2 ′ separates the signal from the rest of the backgrounds. The resulting BDT scores are combined into a single score using a linear weighted sum in the non-VBF categories, and a quadratic weighted sum in the VBF categories. The latter provided a 4% improvement in the expected signal significance, used as a figure-of-merit, for the VBF categories; in the non-VBF categories, the linear combination provided the better expected signal significance.
In the ℓ ℓ ′ channel of the Symmetry method, the requirement on the jet invariant mass ( jj ) in the VBF category is lowered to 300 GeV to increase the size of the training set. The depth of the NN is chosen taking into account the number of events available for the training. For the non-VBF category, a multiclass classifier, with three output classes, is used. The three classes correspond to signal, flavour-symmetric SM background and misidentified background. The MC contribution is added to the flavour-symmetric SM background class since the distributions of its main processes, → , → and , are more similar to the distributions of the flavour-symmetric SM background processes than to the ones originating from jets misidentified as a lepton. For the VBF category, three single binary classification networks are trained and the resulting output distributions are combined via a linear combination to obtain a single discriminant. The three networks are trained to separate signal events from: (i) MC , → and → , (ii) top-quark, diboson and → and (iii) FF events.
Details of the MVA optimisation procedure, as well as the input variables used in the training, are described in Appendix C. The distributions of the resulting MVA discriminants are shown in Figures 8, 9 and 10 for the MC-template ℓ ℓ ′ , MC-template ℓ had and Symmetry ℓ ℓ ′ channels, respectively. The MC-template method distributions are obtained after performing the simultaneous fit of the → and → signals described in Section 8.2. The Symmetry method distributions use the set-up for independent searches for → and → (see Section 8.1), setting the other signal to zero in each case, and are based on ℓ ℓ ′ final-state data, which pass the Symmetry-method-specific selection. The conversion of the fit of the branching ratio difference into individual branching ratios results in positive contributions for → and negative contributions for → , as discussed in Section 5.3. This is indicated with a positive or negative sign in Figure 10.      Figure 9: BDT score distributions for the had (left) and the had (right) final states in the non-VBF (top) and VBF (bottom) categories of the MC-template ℓ had channel. The middle panel shows the ratio of data to prediction (background+signal). The bottom panel shows the residuals between data and the background after the fit. The hashed band indicates the total postfit uncertainty of the total predicted yields. The prediction for each sample is determined from the likelihood fit performed to measure → and → signals simultaneously, described in Section 8.2. The binning is shown as used in the likelihood fit. Overlaid prefit signal shapes assume B ( → ℓ ) = 0.1% and are enhanced by a factor 100 for visibility. The postfit signal contributions are considered as part of the predictions.   Figure 10: NN score distributions for the (left) and the (right) final states, in the non-VBF (top) and VBF (bottom) categories of the Symmetry ℓ ℓ ′ channel. The middle panel shows the ratio of data to prediction (background+signal). The bottom panel shows the residuals between data and the background after the fit. The hashed band indicates the total postfit uncertainty on the total predicted yields. The prediction for each sample is determined from the likelihood fit performed to measure → (left) and → (right) signals, described in Section 8.1. The binning is shown as used in the likelihood fit and points outside the displayed -axis range are indicated by arrows. Overlaid prefit signal shapes assume B ( → ℓ ) = 0.1% and are enhanced by a factor 100 for visibility. The postfit signal contributions are considered as part of the predictions. As a result of the conversion of the fits of branching ratio differences to individual branching ratios, the upward deviation in appears as the downward deviation in .

Systematic uncertainties
Systematic uncertainties affect the yields in the signal and control regions, as well as the shape of the MVA score distribution. They can be separated into three groups: experimental uncertainties, theoretical uncertainties for the backgrounds, and theoretical uncertainties for the signal. The systematic uncertainties related to the estimation of misidentified background are discussed in Section 5 for the MC-template ℓ ℓ ′ (Section 5.1), the MC-template ℓ had (Section 5.2) and the Symmetry method (Section 5.3).  [85]. Their energy scale and resolution uncertainties are taken into account as well. Experimental uncertainties affect the shape of the MVA score distribution, the background yields and the signal cross-section through their effects on the acceptance and the migration between different analysis categories. An additional uncertainty from the measurement of the luminosity [100, 101], amounting to 1.7%, is included.
In the MC-template channels, theoretical uncertainties are considered for the background processes estimated from simulation [46]. Their effect on the normalisation and shape of the MVA discriminant is considered in the statistical analysis. For + jets events, systematic uncertainties include those due to renormalisation ( r ), factorisation ( f ) and resummation scale ( qsf ), the jet-to-parton matching scheme (CKKW) [102], and the choice of s value and the PDFs. For the top-quark background, uncertainties related to the choice of matrix element and parton shower generators [103,104], the initial-and final-state radiation model [105], and the PDFs are considered [106]. For the diboson production processes, an uncertainty of 6% is assigned to the cross-section in the statistical analysis [107][108][109].
With the Symmetry method, these background contributions are estimated from data, and it is not necessary to consider the aforementioned theoretical uncertainties.
The Higgs boson production cross-section uncertainties are obtained from Ref. [110]. Effects on the signal expectations are treated as uncorrelated between production modes. Theoretical uncertainties affecting the F, VBF, , and¯production cross-sections are considered; more details can be found in Ref. [46]. The uncertainties include components for those estimated by varying the PDF or s value, or varying the choice of matrix element generator or parton shower and hadronisation model. For the matrix element variation, predictions by Powheg Box v2 are compared with those by MadGraph5_aMC@NLO [111]. The parton shower and hadronisation model variation replaces the nominal Pythia 8 simulation with Herwig 7 [103,104]. For both the MC-template and Symmetry methods, the Higgs boson production cross-section uncertainties are considered for SM decays as well as for the → and → signals ( F, VBF, , production modes only). Table 3 lists uncertainties for each measurement grouped by their respective sources. In the independent measurements of the → and → signal (discussed in Section 8.1), labelled as 1 POI, the results based on the Symmetry method are included. Hence, statistical uncertainties in the background estimate which result from the symmetric component of the background estimate play an important role. In the simultaneous measurement of the → and → signal (discussed in Section 8.2), labelled as 2 POI, the uncertainties are dominated by systematic sources, in particular the misidentified background. Table 3: Summary of the different sources of uncertainty affecting the observed B ( → ℓ ) and their impact as computed by the independent fits (1 POI) described in Section 8.1 and the simultaneous fit (2 POI) described in Section 8.2. The values in the table are multiplied by a factor 100 to improve their readability. Experimental uncertainties for reconstructed objects combine efficiency and energy/momentum scale and resolution uncertainties. 'Background sample size' includes the bin-by-bin statistical uncertainties in the simulated backgrounds as well as statistical uncertainties in misidentified backgrounds, which are estimated using data.

Statistical analysis and results
The statistical analysis uses a likelihood function L ( , ), constructed as a product of Poisson probability terms over all bins considered in the search. These include the MVA score distributions of all the SRs and the event yields in the ℓ ℓ ′ MC-template method CRs, when included, to constrain the normalisations of the major backgrounds estimated from simulation, in particular the top-quark and → background components. The likelihood function depends on the parameters of interest (POIs), and , defined as the branching ratios B ( → ) and B ( → ), respectively, and a set of nuisance parameters that encode the effect of systematic uncertainties on the signal and background expectations. All nuisance parameters are implemented in the likelihood function as Gaussian or log-normal constraints. The latter are used for normalisation factors to ensure that they are always positive.
The likelihood function is fitted to the data to test for the presence of a signal. Estimates of the POIs are calculated with the profile-likelihood-ratio test statistic˜ [112], and if no signal is found, the upper limits on the branching ratios are derived by using˜and the CL s method [113].
Three different statistical analyses are performed, differing in the POI definitions and relying on different inputs from the two background estimation methods: • An independent search for the → signal: a single POI, , is estimated in the fit combining the and had final states (discussed in Section 8.1). B( → ) is set to zero.
• An independent search for the → signal: a single POI, , is estimated in the fit combining the and had final states (discussed in Section 8.1). B( → ) is set to zero.
• A simultaneous measurement of the → and → signals: two POIs ( and ) are estimated in the simultaneous fit of the , had , and had final states (discussed in Section 8.2).
The independent searches combine the non-VBF and VBF categories of the ℓ had channel and the non-VBF category of the ℓ ℓ ′ channel from the MC-template method with the VBF category of the ℓ ℓ ′ channel from the Symmetry method. For the simultaneous measurements of → and → signals, the combined fit is performed with the regions from the MC-template method for both ℓ ℓ ′ and ℓ had final states. The SRs and CRs exploited in the different combined fits are specified in Table 4. In the fit combinations where a SR of the ℓ ℓ ′ channel from the MC-template method is considered, the corresponding top-quark and → CRs are also included.
Additionally, stand-alone fits of individual channels and categories are performed employing either background estimation method. Since the Symmetry method directly measures the branching ratio difference, the results of the Symmetry method can be converted into individual branching ratios with the assumption that the other signal is zero (employed in Section 8.1). Alternatively, the results from the simultaneous fits with the MC-template method can be utilised to calculate the branching ratio difference to allow a direct comparison with the Symmetry method results. The branching ratio difference determined from the stand-alone fits with either background estimation method for the ℓ ℓ ′ final state, together with their compatibility, is discussed in Section 8.3.

Independent searches for → and →
A fit with a single POI is performed for each search separately by combining the ℓ ℓ ′ and ℓ had final states. The best-fit branching ratios and upper limits are evaluated assuming B ( → ) = 0 for the → search and B ( → ) = 0 for the → search. Each search combines the ℓ had channel and the non-VBF category of the ℓ ℓ ′ channel from the MC-template method with the VBF category of the ℓ ℓ ′ channel from the Symmetry method. The choice of combining the MC-template and Symmetry methods is based on the expected sensitivity [112] of each individual category and selecting the method having higher significance. For the VBF category, the Symmetry method performs better than the MC-template method by 13% and 7% for the and final states, respectively. For the non-VBF category, the MC-template method performs better than the Symmetry method by 28% and 18% for the and final states, respectively. For the ℓ had channel, only the MC-template method is available.
Three normalisation factors are used to normalise the → production cross-section in the ℓ ℓ ′ and ℓ had SRs estimated with the MC-template method. A common normalisation factor is used for the non-VBF category SR of the ℓ ℓ ′ channel and the event yield of the dedicated → CR. In the ℓ had channel, two independent normalisation factors are used for the non-VBF and VBF categories. They are constrained by the data in the SRs. In general, top-quark processes form a very small background contribution in the ℓ had channel and no top-quark CR is defined in the ℓ had channel. Therefore, in the non-VBF category, where the MC-template method is used for both final states, a single normalisation factor is used to scale the top-quark processes' production cross-section jointly in the ℓ ℓ ′ and ℓ had final state, making use of the ℓ ℓ ′ top-quark CR. Since the Symmetry method is exploited in the ℓ ℓ ′ channel in the VBF category, it is not possible to constrain the normalisation of the top-quark background in the ℓ had final state by a CR and hence uncertainties are applied to the top-quark cross-section. Figure 11 shows the resulting 95% CL upper limits for the → and → searches, with the breakdown of the stand-alone fits from each channel. For the → ( → ) signal, a 1.9 (2.2 ) excess is observed. In the case of the → signal, this is driven mainly by the non-VBF category of the ℓ ℓ ′ final state based on the MC-template method. In the case of the → signal, it is driven by the non-VBF category of the ℓ had final state. The branching ratio of the LFV Higgs boson decay is related to the non-diagonal Yukawa coupling matrix elements [33] by the formula where Γ (SM) = 4.07 MeV [114] is the Higgs boson's width as predicted by the SM. The observed 95% CL upper limits on the branching ratio correspond to the following limits on the coupling matrix elements: √︁ | | 2 + | | 2 < 0.0014 and √︃ | | 2 + | | 2 < 0.0012. Figure 12 shows the limits on the individual coupling matrix elements ℓ and ℓ obtained from the independent fits in the two searches. The same figure also shows the limits from the → ℓ searches [33,115]. Compared with the indirect limits from the → ℓ searches, the direct limits obtained in this search are tighter by slightly less than one order of magnitude for and , while for and they are tighter by about a factor of ten. This clearly indicates the strength of direct searches at the LHC. In the case of → , the constraints are tighter than the naturalness limit preventing a non-hierarchical mass spectrum from large off-diagonal terms in the Yukawa coupling matrix: | ℓ ℓ | ≲ ℓ / 2 , where is the vacuum expectation value of the Higgs field [33]; in the case of → , the naturalness limit has not been reached yet. This is in line with the previous ATLAS search [29] based on a partial dataset at √ = 13 TeV, but with tighter limits.   Figure 12: Expected (red long-dashed line) and observed (solid blue line) 95% CL upper limits from the independent fits (1 POI) on the absolute value of the couplings ℓ and ℓ together with the most stringent indirect limits from → ℓ searches (dark purple region) for (a) ℓ = or (b) ℓ = . The short-dashed lines represent the limits corresponding to different branching ratios (0.01%, 0.2%, 1% and 10%), while the dotted line indicates the naturalness limit (denoted by n.l.).

Simultaneous measurement of → and → signal
A simultaneous measurement of the → and → signals is performed using the MC-template method, where it is possible to remove the assumption about the absence of a → signal in the fit of B( → ) and vice versa. The two POIs, corresponding to B( → ) and B( → ), are estimated in the simultaneous fit of the , had , and had final states. The analysis exploits eight SRs and eight CRs (one top-quark and one → CR for each ℓ ℓ ′ SR). The → and top-quark normalisation factors are correlated between the SRs and CRs, but not between the VBF and non-VBF categories. The top-quark normalisation factors are common to the , , had and had channels, while → normalisation factors are decorrelated between ℓ ℓ ′ and ℓ had final states. In total, two normalisation factors are used for the top-quark processes and four normalisation factors are used for the → processes. Figure 13 shows the resulting 95% CL upper limits for the → and → signals, together with the contribution from each category. For the → ( → ) signal, a 2.4 (1.6 ) excess is observed. In the case of the → signal, this is driven mainly by the non-VBF category of the ℓ ℓ ′ final state based on the MC-template method. In the case of the → signal, it is driven by the non-VBF category of the ℓ had final state. The result of the simultaneous fit of the → and → signals, along with the 68% and 95% CL contours, is shown in Figure 14. The result is found to be compatible with the SM prediction within 2.1 . Figure 15 shows the limits on the individual coupling matrix elements ℓ and ℓ obtained from the simultaneous fit of the two signals. The observed 95% CL upper limits on the branching ratios correspond to the following limits on the coupling matrix elements: √︁ | | 2 + | | 2 < 0.0013, and      Figure 15: Expected (red long-dashed line) and observed (solid blue line) 95% CL upper limits from the simultaneous fit (2 POI) of the two searches on the absolute value of the couplings ℓ and ℓ together with the most stringent indirect limits from → ℓ searches (dark purple region) for (a) ℓ = or (b) ℓ = . The short-dashed lines represent limits corresponding to different branching ratios (0.01%, 0.2%, 1% and 10%), while the dotted line indicates the naturalness limit (denoted by n.l.).

Measurement of the branching ratio difference in the ℓ ℓ ′ final state
The Symmetry method uses the dataset to estimate the background in the → search, and the dataset to estimate the background in the → search. Therefore, this method measures the difference between the two branching ratios, B ( → ) and B ( → ). In the search for → , the difference B ( → ) − B ( → ) is measured and in the search for → , the difference B ( → ) − B ( → ) is measured. The two measurements of the branching ratio difference with the Symmetry method are found to be compatible, taking into account the anti-correlation between the two measurements. A more detailed description can be found in Appendix D. The B ( → ) − B ( → ) measurement has the smaller expected uncertainty and is chosen as main result for the branching ratio difference from the Symmetry method. For the combination of the non-VBF and VBF categories, B ( → ) − B ( → ) measured by the Symmetry method is (0.25 ± 0.10)%, compatible with zero within 2.5 .
The branching ratio difference for the MC-template method is determined from the simultaneous fit of B ( → ) and B ( → ) following the approach discussed in Section 8.2, but based only on data in the ℓ ℓ ′ final state. The obtained difference B ( → ) − B ( → ) takes into account the correlation between the two branching ratio parameters from the fit. The difference B ( → ) − B ( → ) is measured to be (0.02 ± 0.12)% for the combination of the non-VBF and VBF categories, compatible with zero well within 1 .  The results of the branching ratio difference measurement are shown in Figure 16, displaying the results for non-VBF and VBF categories separately as well as combined. The Symmetry method favours a larger branching ratio for the → signal than for the → signal, with a significance of 2.5 for the combination of the non-VBF and VBF categories, driven mainly by the non-VBF category. The MC-template method observes a branching ratio difference compatible with zero for the combination of the non-VBF and VBF categories, driven also by the non-VBF category.
The measurement of the branching ratio difference with two independent background estimation methods in the ℓ ℓ ′ final state allows the results of the two methods to be checked for compatibility. The compatibility of the two background estimation methods is tested, taking into account the correlations between the Symmetry and MC-template methods. Due to the large overlap between the data that are used by the two methods, the data statistical uncertainties are correlated. Likewise, the two methods use the same simulated samples for signal, so the signal uncertainties are correlated. All other uncertainties are considered to be uncorrelated. For the MC-template method, the correlated uncertainties are fixed to their best-fit values and the uncertainties are re-evaluated considering only the uncorrelated sources. For the Symmetry method, the full uncertainty is considered. The results obtained with the MC-template and Symmetry methods are shown in Figure 16 and are found to be compatible within 2 .

Conclusion
Two direct searches for lepton-flavour-violating Higgs boson decays, → and → , are presented. They are based on √ = 13 TeV proton-proton collision data collected by the ATLAS experiment at the LHC, corresponding to an integrated luminosity of 138 fb −1 . Two complementary background estimation techniques are exploited. The first relies on simulation and is used in both the ℓ ℓ ′ and ℓ had channels. The second one relies on the Symmetry method and is applied in the ℓ ℓ ′ channel.
A simultaneous fit of possible → and → signals was performed combining the ℓ ℓ ′ and ℓ had channels, relying on inputs from the MC-template method only. In line with the independent fits, small excesses with respect to the SM background are observed, but below the threshold for evidence of a new signal. The observed (expected) 95% CL upper limits on the branching ratios of → and → are 0.20% (0.12%) and 0.18% (0.09%), respectively. The result of the simultaneous fit of → and → signals is found to be compatible with a branching ratio value of zero for both processes within 2.1 .
The measurement of the branching ratio difference with the Symmetry method, based only on data in the ℓ ℓ ′ final state, favours a larger branching ratio for the → signal than for the → signal, though the difference is not statistically significant. The best-fit value for the difference B ( → ) − B ( → ) obtained with the Symmetry method is (0.25 ± 0.10)%, which is compatible with a value of zero within 2.5 . It is also compatible within 2 with the best-fit value of (0.02 ± 0.12)%, obtained with the MC-template method in the ℓ ℓ ′ final state.
The observed limits based on the full dataset recorded by ATLAS at √ = 13 TeV are more stringent by factors of up to 2.5 (1.6) than the corresponding previous limits for the → ( → ) decay based on a partial dataset at √ = 13 TeV, while the expected sensitivity for the → ( → ) signal improves by a factor of about 3.1 (4.1). In addition to the approximately four times larger dataset, the main sources of improvement are significantly more sophisticated features of the analysis method. These include lepton assignment in the approximate Higgs boson rest frame for the ℓ ℓ ′ channel, use of the symmetry method for the ℓ ℓ ′ VBF category in the fit with one parameter of interest, introduction of a simultaneous fit of → and → signals (two parameters of interest), more advanced multivariate classifiers for signal extraction, and improved object reconstruction, in particular the identification algorithm.  [116]. Table 5 summarizes the control and validation regions (CR and VR respectively) used in the analysis to characterize the various background sources.

B Background and signal yields B.1 MC-template ℓ ℓ ′ channel
The signal and background event yields in all SRs and CRs obtained after performing the simultaneous fit of the → and → signals (Section 8.2) using only data from the ℓ ℓ ′ final state are detailed in Tables 6 and 7. Table 6: Observed event yields and predictions as computed by the fit in the signal region (SR) and control regions (CRs) of the channel. The prediction for each sample is determined from the likelihood fit performed to measure the → and → signals using the ℓ ℓ ′ data only (Section 8.2). The signal event yields are given for B ( → ) = 0.13% and B ( → ) = 0.17%, which correspond to the best-fit values obtained from the fit mentioned above. Uncertainties include statistical and systematic components. Uncertainties in the total background prediction include the effect of correlations between individual uncertainty sources as determined by the fit.  Table 7: Observed event yields and predictions as computed by the fit in the signal region (SR) and control regions (CRs) of the channel. The prediction for each sample is determined from the likelihood fit performed to measure the → and → signals using the ℓ ℓ ′ data only (Section 8.2). The signal event yields are given for B ( → ) = 0.13% and B ( → ) = 0.17%, which correspond to the best-fit values obtained from the fit mentioned above. Uncertainties include statistical and systematic components. Uncertainties in the total background prediction include the effect of correlations between individual uncertainty sources as determined by the fit.

B.2 MC-template ℓ had channel
The signal and background event yields in all SRs obtained after performing the simultaneous fit of the → and → signals (Section 8.2) using only data of the ℓ had final state are detailed in Table 8. Table 8: Observed event yields and predictions as computed by the fit in the signal regions (SRs) of the had and had channels. The prediction for each sample is determined from the likelihood fit performed to measure the → and → signals using the ℓ had data only (Section 8.2). The signal event yields are given for B ( → ) = 0.03% and B ( → ) = 0.09%, which correspond to the best-fit values obtained from the fit mentioned above. Uncertainties include statistical and systematic components. Uncertainties in the total background prediction include the effect of correlations between individual uncertainty sources as determined by the fit.

B.3 Symmetry-based ℓ ℓ ′ channel
The signal and background event yields in the SRs obtained after performing the independent fit (Section 8.1) using only data passing the symmetry-method specific ℓ ℓ ′ event selection are detailed in Table 9. The background contribution is divided into the symmetric, MC and FF components. Contributions from the other lepton channel are marked with the 'tilde' sign. Table 9: Observed event yields in the VBF and non-VBF categories for the → and → searches following the Symmetry-method, after the statistical analysis based on events passing the ℓ ℓ ′ Symmetry-method specific selection. The signal event yields are given for the best-fit branching ratio values based on the ℓ ℓ ′ data, B ( → ) = −0.33% and B ( → ) = 0.25%, where the fit of the difference is translated into individual branching ratios, by setting the other branching ratio explicitly to zero. This results in a negative shift for B ( → ), and a positive shift for B ( → ) (cf. Section 5.3). Contributions in the search for → coming from the selection are marked with the 'tilde' sign, those from the selection are listed without. For the → search, the notation is the same with interchanged regions. Uncertainties include statistical and systematic components. Uncertainties in the total background prediction include the effect of correlations between individual uncertainty sources as determined by the fit.

C MVA optimisation
The MVAs are trained using the four-momenta of the selected reconstructed particles as well as other derived observables, such as the invariant masses, angular separations and the miss T . In the VBF category, additional jet-related variables are used to exploit the characteristic VBF topology. The complete list of the variables used is shown in Table 10. For the MC-template method, the list of input variables is optimised by maximising the expected signal significance based on fits to Asimov datasets [112] considering statistical uncertainties only. For the Symmetry method, the hyperparameters and the input variables are optimised with respect to the expected signal significance by performing the statistical analysis taking into account all systematic uncertainties and using only ℓ ℓ ′ data (as described in Section 8). For each category, the lowest-ranked variables making marginal contributions to the expected sensitivity are removed. ). The MMC algorithm is tuned specifically to reconstruct the mass of the LFV Higgs boson. In the collinear approximation, the azimuthal angle between the prompt lepton from the Higgs boson and the miss T , Δ (ℓ , miss T ), is expected to be large for signal events, while the azimuthal angle between the visible products of the -lepton decay and the miss T , Δ ( , miss T ), is expected to be small for signal events. Other angular separations are included as well: Δ (ℓ , ), Δ (ℓ , ), Δ (ℓ , ), where refers to ℓ or had-vis in the corresponding channel. The Δ angular separations involving ℓ are evaluated in the approximate Higgs boson rest frame.
In the ℓ ℓ ′ channel, two vertex-related variables are also included because they provide additional discrimination power: the difference of the transverse impact parameters of the leptons (Δ 0 (ℓ 1 , ℓ 2 )) and the 0 significance of the lepton from the : ℓ 0 . The Δ discriminant [117] exploited in the ℓ ℓ ′ channels is expected to be close to zero if the decay products of -lepton are collinear and the transverse momentum of the Higgs boson can be neglected, while for the background events, this value deviates from zero. Additionally, in the Symmetry ℓ ℓ ′ channel, the -centrality, defined as -centrality(ℓ) = exp(−4/( j 1 − j 2 ) 2 · ( ℓ − 0.5( j 1 + j 2 )) 2 ), is included in the VBF category.
Different MVA techniques are used for the non-VBF and VBF categories to exploit the VBF topology, taking into account for the different limited training sample sizes.
For the MC-template method, the parameters used to configure the BDT are summarised in Table 11. The BDTs are trained using the Toolkit for Multivariate Data Analysis [93]. A -fold cross validation procedure [118] is implemented for the BDT training to avoid overtraining and to enable efficient usage of the available events in the training process. In both the ℓ ℓ ′ and ℓ had final states, 5-folds are used.
In the ℓ ℓ ′ channel of the Symmetry method, the fully connected deep NN training with Keras [94] and Tensorflow [95] as the backend uses supervised learning with categorical cross-entropy as the cost function and L2 weight regularisation to prevent overtraining, as well as 10-fold cross validation. The hyperparameter optimisation is carried out using the Optuna framework [119], where the L2 weight regularisation parameter is included in the hyperparameter optimisation. The full set of hyperparameters used for the NNs in the Symmetry ℓ ℓ ′ channel is summarised in Table 12.  Table 10: List of input variables used in the non-VBF and VBF categories. Those labelled with 'rest' are evaluated in the approximate Higgs boson rest frame as described in Section 4.2. The variable tot T , used only in the VBF category of the MC-template ℓ ℓ ′ channel, corresponds to the absolute value of the vectorial sum of the T of the two leptons, the two jets and the miss T , and it is used to veto the third jet. When a variable is used in both the ℓ ℓ ′ and ℓ had channels, the symbol refers to either ℓ or had-vis in the corresponding channel. Variable

D Compatibility of branching ratio differences
The compatibility of the two measured branching ratio differences, Δ = B ( → ) − B ( → ) and Δ = B ( → ) − B ( → ), in the Symmetry method is tested. Measuring Δ (Δ ) is equivalent to measuring ( ) when assuming ( ) = 0. This implies that the sum of the two fitted values for and should be consistent with zero. This was quantified using a 2 test, accounting for the error of the mean value of the upward and downward systematic variations of and , and the correlation coefficient between the and datasets. The latter is found to be = −0.80 using the bootstrap methodology [120] when considering the full systematic uncertainties in the fit. The results obtained in the non-VBF and VBF categories, as well as in the combined fit, indicate that the Δ and Δ measurements are compatible within 1.3 .
The branching ratio difference, Δ , from the Symmetry method can be compared with the value from the MC-template method, as discussed in Section 8.3. The result is displayed in Table 13. Table 13: Best-fit B ( → ) − B ( → ) values, given in %, obtained from the stand-alone fits of the ℓ ℓ ′ final state with either background estimation method. The results for non-VBF and VBF category are provided, as well as their combination. The last row shows the difference between the MC-template method and the Symmetry method. The results for the MC-template method are quoted once with the full uncertainty and once considering only the sources of uncertainty that are uncorrelated between the two analyses. The uncertainty in the difference between the two methods is calculated taking into account only the uncorrelated uncertainty sources for the MC-template method and the full uncertainty for Symmetry method.B