Search for a heavy Higgs boson decaying into two lighter Higgs bosons in the $\tau\tau$bb final state at 13 TeV

A search for a heavy Higgs boson H decaying into the observed Higgs boson h with a mass of 125 GeV and another Higgs boson h$_\mathrm{S}$ is presented. The h and h$_\mathrm{S}$ bosons are required to decay into a pair of tau leptons and a pair of b quarks, respectively. The search uses a sample of proton-proton collisions collected with the CMS detector at a center-of-mass energy of 13 TeV, corresponding to an integrated luminosity of 137 fb$^{-1}$. Mass ranges of 240-3000 GeV for $m_\mathrm{H}$ and 60-2800 GeV for $m_\mathrm{h_S}$ are explored in the search. No signal has been observed. Model independent 95% confidence level upper limits on the product of the production cross section and the branching fractions of the signal process are set with a sensitivity ranging from 125 fb (for $m_\mathrm{H}$ $=$ 240 GeV) to 2.7 fb (for $m_\mathrm{H}$ $=$ 1000 GeV). These limits are compared to maximally allowed products of the production cross section and the branching fractions of the signal process in the next-to-minimal supersymmetric extension of the standard model.


Introduction
The discovery of the Higgs boson (h) with a mass of 125 GeV at the CERN LHC [1][2][3] has turned the standard model (SM) of particle physics into a theory that could be valid up to the Planck scale. To date all properties of the observed particle are in agreement with the expectations of the SM within an experimental precision of 5-20% [4][5][6][7]. Despite its success in describing a wealth of phenomena, the SM falls short of addressing a number of fundamental theoretical questions and striking observations in nature. In this respect it is considered to be still incomplete. Supersymmetry (SUSY) postulates a bosonic (fermionic) partner particle for each SM fermion (boson), with the same quantum numbers as the corresponding SM particle apart from its (half-) integer spin [8,9]. The fact that to date no such SUSY particles have been observed implies that if SUSY were realized in nature it must be a broken symmetry. Apart from the prediction of a sizable number of new particles, SUSY requires the extension of the Brout-Englert-Higgs mechanism part [10][11][12][13][14][15] of the SM Lagrangian. In the minimal supersymmetric extension of the SM (MSSM) [16,17] one more SU(2) doublet of complex scalar fields is introduced with respect to the SM, leading to the prediction of two charged and three neutral Higgs bosons, one of which can be associated with h. A further extension of the MSSM by one additional complex scalar field S is theoretically well motivated, since it can solve the so called "µ-problem" of the MSSM [18]. It leads to the next-to-minimal supersymmetric SM (NMSSM), as reviewed, e.g., in Refs. [19,20]. Since S is a complex field, the number of predicted Higgs bosons increases by two, resulting in two charged and five neutral Higgs bosons, of which three are scalar and two are pseudoscalar in nature.
Many searches for additional Higgs bosons in the context of the MSSM have been performed by the LHC experiments. In the absence of signal, these have led to the exclusion of large parts of the MSSM parameter space for masses of the additional neutral Higgs bosons up to ≈2 TeV [21][22][23][24]. The parameter space of the NMSSM, on the other hand, is still largely unconstrained [25].
The current analysis focuses on the H → hh S decay of a heavy Higgs boson H into h and another neutral boson h S with a mass of m h S < m H − m h . It is based on the data recorded during the years 2016, 2017, and 2018 at a center-of-mass energy of 13 TeV with the CMS experiment, resulting in an integrated luminosity of 137 fb −1 . The search is inspired by the NMSSM, where h S could have a dominant admixture of the additional singlet field S, leading to a significant suppression of its couplings to SM particles and thus of its direct production at the LHC. In this case, the production of H and subsequent decay into hh S would become the dominant source for h S production. Despite the overall reduced coupling strengths to SM particles, the branching fractions of h S for its decay into SM particles are still expected to be similar to those of h. While here we use the NMSSM as a motivation, any other two Higgs doublet plus singlet model is equally relevant for the search.
A promising signature for the search is given by the decay of h into a pair of tau leptons and the decay of h S into a pair of b quarks, h(ττ)h S (bb). For better readability we will not distinguish fermions by particle or antiparticle in this final state in subsequent notation throughout the text. The decay into b quarks is chosen for its large branching fraction. The decay into tau leptons is chosen for its cleaner signature compared to the decay into b quarks. This search is restricted to H production from gluon fusion. The Feynman diagram for the process of interest is shown in Fig. 1. The search is performed in the mass ranges of 240 ≤ m H ≤ 3000 GeV and 60 ≤ m h S ≤ 2800 GeV. It is the first search for such a process at the LHC. No attempt is made to identify and treat specially boosted topologies, for which the h and h S decay products may not easily be spatially resolved. These can occur in parts of the explored mass ranges, e.g., for The paper is organized as follows. A brief introduction of the CMS detector and event reconstruction are given in Sections 2 and 3, respectively. The model used to describe the data is given in Section 4. The event selection and categorization are described in Section 5, followed by a discussion of the systematic uncertainties considered for the analysis of the data in Section 6. The results of the search are presented in Section 7. The paper is summarized in Section 8.

The CMS detector
The central feature of the CMS apparatus is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections. Forward calorimeters extend the pseudorapidity (η) coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid.
The silicon tracker measures charged particles within the range of |η| < 2.5. During the LHC data-taking period up to 2017, the silicon tracker consisted of 1440 silicon pixel and 15 148 silicon strip detector modules. From 2017 on, the silicon pixel detector was upgraded to 1856 modules. For nonisolated particles with a transverse momentum of 1 < p T < 10 GeV with respect to the beam axis and |η| < 1.4, the track resolutions are typically 1.5% in p T and 25-90  µm in the transverse (longitudinal) impact parameter [26]. From 2017 on, the transverse impact parameter resolution improved to 20-60 µm when restricted to the same η range as before and 20-75 µm in the increased full η range [27].
The momentum resolution for electrons with p T ≈ 45 GeV from Z → ee decays ranges from 1.7 to 4.5%. It is generally better in the barrel region than in the endcaps, and also depends on the bremsstrahlung energy emitted by the electron traversing the material in front of the ECAL [28].
Muons are measured in the range of |η| < 2.4, with detection planes made using three technologies: drift tubes, cathode strip chambers, and resistive plate chambers. The relative p T resolution for muons with 20 < p T < 100 GeV is 1.3 to 2.0% in the barrel and better than 6% in the endcaps. In the barrel the relative p T resolution is better than 10% for muons with p T up to The contributions from backgrounds to the electron and muon selections are further reduced by requiring the corresponding lepton to be isolated from any hadronic activity in the detector. This property is quantified by an isolation variable where p e(µ) T corresponds to the electron or muon p T and ∑ p charged T , ∑ E neutral T , and ∑ E γ T to the p T (transverse energy E T ) sum of all charged particles, neutral hadrons, and photons, in a predefined cone of radius ∆R = (∆η) 2 + (∆φ) 2 around the lepton direction at the PV, where ∆φ (measured in radians) and ∆η correspond to the angular distances of the particle to the lepton in the azimuthal angle φ and η directions, respectively [28,29]. The chosen cone sizes are ∆R < 0.3 for electrons and 0.4 for muons. The lepton itself is excluded from the calculation. To mitigate any distortions from PU, only those charged particles whose tracks are associated with the PV are taken into account. Since for neutral hadrons and photons an unambiguous association to the PV or PU is not possible, an estimate of the contribution from PU (p PU T ) is subtracted from the sum of ∑ E neutral T and ∑ E γ T . This estimation is obtained from tracks not associated to the PV in the case of I µ rel and from the mean energy flow per unit area in the case of I e rel . In the case of negative values, the results are set to zero. For further characterization of an event, all reconstructed PF objects are used to form jets using the anti-k T jet finding algorithm with a distance parameter of 0.4. To identify jets resulting from the hadronization of b quarks (b jets) the DEEPJET algorithm is used as described in Refs. [39,40]. In this analysis a working point of this algorithm is chosen that corresponds to an expected b jet identification efficiency of ≈80% for an expected misidentification rate for jets originating from light quarks and gluons (c quarks) of 1% (15%) [41]. Jets with p T > 30 GeV and |η| < 4.7 and b jets with p T > 20 GeV and |η| < 2.4 (2.5) are used, where the value in parentheses corresponds to the selection after the upgrade of the silicon pixel detector from 2017 on.
Jets are also used as seeds for the reconstruction of hadronic τ decays (τ h ). This is done by further exploiting the substructure of the jets, using the hadrons-plus-strips algorithm, as described in Ref. [42]. For the analysis, the decays into one or three charged hadrons with up to two neutral pions with p T > 2.5 GeV are used. The neutral pions are reconstructed as strips with dynamic size in η-φ from reconstructed electrons and photons contained in the seeding jet, where the strip size varies as a function of the p T of the electron or photon candidate. The τ h decay mode is then obtained by combining the charged hadrons with the strips. To distinguish the τ h decays from jets originating from the hadronization of quarks or gluons, and from electrons, or muons, the DEEPTAU algorithm is used, as described in Ref. [43]. This algorithm exploits the information of the reconstructed event record, comprising tracking, impact parameter, and ECAL and HCAL cluster information; the kinematic and object identification properties of the PF candidates in the vicinity of the τ h candidate and the τ h candidate itself; and several characterizing quantities of the whole event. It results in a multiclassification output y DT α (α = τ, jet , e , µ) equivalent to a Bayesian probability of the τ h candidate to originate from a genuine tau, the hadronization of a quark or gluon, an isolated electron, or an isolated muon. From this output three discriminators are built according to For this analysis, a working point of D jet with a genuine τ h identification efficiency of 70% for a misidentification rate of 0.43% is chosen. For D e and D µ , depending on the ττ final Table 1: Background processes contributing to the event selection, as given in Section 5. The symbol corresponds to an electron or muon. The second column refers to the experimental signature in the analysis, the last three columns indicate the estimation methods used to model each corresponding signature, as described in Sections 4.1-4.3.
Estimation method Background process Final state signature τ-emb.
state, different working points with efficiencies of 80% and >99% and misidentification rates between 0.03% and 2.60% are chosen, respectively. It should be noted that the misidentification rate of D jet strongly depends on the p T and quark flavor of the misidentified jet, which is why this number should be viewed as an approximate estimate.
The pileup-per-particle identification algorithm [44] is applied to reduce the PU dependence of the p miss T observable, which is computed as the negative vectorial p T sum of the PF candidates, weighted by their probability to originate from the PV [45]. Its magnitude is referred to as p miss T . It is used for the estimation of the invariant mass of the two tau leptons before their decay, as discussed in Section 5.

Data model
The selection given in Section 5 targets the reconstruction of a pair of tau leptons originating from h with a mass of m τ τ = 125 GeV and a pair of b quarks originating from h S with a mass varying between 60 and 2800 GeV. For the τ pair the eτ h , µτ h and τ h τ h final states are used. The contribution of the eµ final state to the sensitivity of the search has been found to be negligible, which can be understood from the low ττ branching fraction and the overwhelmingly large background from t quark pair production (tt). In the eτ h and µτ h final states, the most abundant source of background after the selection is tt that can easily result in a signature with genuine leptons and b quarks. After selection the expected fraction of tt events in these final states is ≈70%. In the τ h τ h final state, events containing purely quantum chromodynamics (QCD) induced gluon and quark jets, referred to as QCD multijet production in the following, and the decay of Z bosons into tau leptons form the largest background sources with ≈35% each.
All SM background sources of relevance for this analysis are listed in Table 1. For the background modeling, three different methods are used depending on the interpreted signature after reconstruction: (i) ττ events are obtained from the τ-embedding method, discussed in Section 4.1; (ii) events with jets misidentified as τ h (jet → τ h ) are obtained from the F F -method, discussed in Section 4.2; (iii) all other background events and the signal events are obtained from full event simulation, discussed in Section 4.3.

The τ-embedding method
For all events in which the decay of a Z or two W bosons results in two genuine tau leptons, the τ-embedding method, as described in Ref. [46], is used. For this purpose, µµ events are selected in data. All energy deposits of the muons are removed from the event record and replaced by simulated tau lepton decays with the same kinematic properties as the selected muons. In this way the method relies only on the simulation of the well-understood tau lepton decay and its energy deposits in the detector, while all other parts of the event, such as the reconstructed jets, their identification as originating from the PV, the identification of b jets, or the non-τ related parts of p miss T , are obtained from data. This obviates the need to simulate complicated processes, such as parton showering, hadronization, underlying event, and event pileup, which are difficult to model in simulation, and results in an improved description of the data compared to the simulation of the full process. In turn, several simulation-to-data corrections, as detailed in Section 4.4, are not needed.
The selected muons predominantly originate from Z boson decays; however, contributions from other processes resulting in two genuine tau leptons, like tt or diboson production, are also covered by this model, where throughout the text diboson refers to any combination of two W or Z bosons. A detailed discussion of the selection of the original µµ events, the exact procedure itself, its range of validity, and related uncertainties can be found in Ref. [46]. For a selection with at least one jet identified as a b jet in the event, as described in Section 5, 84% of the µµ events selected for the τ-embedding method are expected to originate from Z boson decays, 14% from tt, and ≈2% from diboson production.

The F F -method
The main contributing processes to jet → τ h events are QCD multijet production, W bosons in association with jets (W+jets), and tt. These events are estimated using the F F -method, as described in Refs. [22,47]. For this purpose the complete kinematic phase space is split into the disjoint signal region (SR), application region (AR), and determination regions (DR i ). The SR and the AR differ only in the working point chosen for the identification of the τ h candidate, where for the AR a looser working point is chosen and the events from the SR are excluded. Three independent extrapolation factors F i F are derived for QCD multijet, W+jets production, and tt in three dedicated DR i , defined to enrich each corresponding process. The F i F are then used to estimate the yields N SR and kinematic properties of the combination of these backgrounds in the SR from the number of events N AR in the AR according to For this purpose the F i F are combined into a weighted sum, using the simulation-based estimation of the fractions w i of each process in the AR.
For the estimate of F QCD F , the charges of the two selected τ decay products are required to be of same sign. For the estimation of F W+jets F , a b jet veto and a high transverse mass of the lepton-p miss T system are required. The estimation of F tt F is obtained from simulation, with a selection of more than two jets, at least one b jet, and more than two leptons in an event. Each F i F is derived on an event-by-event basis, as a function of the p T of the τ h candidate, the p T of the second τ decay in the event, and the mass of the visible ττ decay products. All other processes but the enriched background process are estimated from the τ-embedding method or simulation and subtracted for this purpose. Each F i F is further subject to a number of nonclosure corrections derived from control regions in data to take sub-leading dependencies of the F i F into account.

Simulation
In the τ h τ h final state, the τ-embedding and F F -methods cover ≈95% of all expected background events. In the eτ h and µτ h final states the fractions of expected background events described by these two methods are ≈42%, each. All remaining events originate from processes like Z boson, tt, or diboson production, where at least one decay of a vector boson into an electron or muon is not covered by any of the two methods. These and the signal events are modeled using the simulation of the full processes.
The production of Z bosons in the ee and µµ final states is simulated at leading-order (LO) precision in the coupling strength α S , using the MADGRAPH5 aMC@NLO 2.2.2 (2.4.2) event generator [48,49] for the simulation of the data taken in 2016 (2017 and 2018). To increase the number of simulated events in regions of high signal purity, supplementary samples are generated with up to four outgoing partons in the hard interaction. For diboson production MADGRAPH5 aMC@NLO is used at next-to-LO (NLO) precision in α S . For tt and single t quark production samples are generated at NLO precision using POWHEG 2.0 [50][51][52][53][54][55]. The kinematic properties of single h production are simulated at NLO precision using POWHEG separately for the production via gluon fusion, vector boson fusion, or in association with a Z boson, W boson, or a top quark pair. For this purpose h is assumed to behave as expected from the SM.
When compared to data, Z boson, tt, and single t quark events in the tW channel are normalized to their cross sections at next-to-NLO precision in α S [56][57][58]. Single t quark production in the t-channel and diboson events are normalized to their cross sections at NLO precision in α S or higher [58][59][60].
The signal process H → hh S is generated using MADGRAPH5 aMC@NLO at LO precision. The analysis is restricted to H production via gluon fusion, which is expected to be dominant, e.g., in the NMSSM. Due to the two unknown masses involved in the decay, a two-dimensional grid of signal mass pairs is generated, resulting in 420 mass pairs spanning from 240 to 3000 GeV in m H and 60 to 2800 GeV in m h S , only taking pairs with m h S + 125 GeV ≤ m H into account.
For the generation of all signal and background processes, the NNPDF3.0 [61] (NNPDF3.1 [62]) parton distribution functions are used for the simulation of the data taken in 2016 (2017 and 2018). The description of the underlying event is parameterized according to the CUETP8M1 [63] and CP5 [64] tunes. Parton showering and hadronization, as well as the τ lepton decays, are modeled using the PYTHIA 8.230 event generator [65]. For all simulated events, additional inclusive inelastic pp collisions generated with PYTHIA are added according to the expected PU profile in data to take the effect of the observed PU into account. All events generated are passed through a GEANT4-based [66] simulation of the CMS detector and reconstructed using the same version of the CMS event reconstruction software as used for the data.

Corrections and control of the model
The capability of the model to describe the data is monitored in various control regions orthogonal to the signal and background classes defined in Section 5, and corrections and corresponding uncertainties are derived where necessary.
The following corrections equally apply to simulated and τ-embedded events, where the τ decay is also simulated. Since the simulation part for τ-embedded events happens under detector conditions, which are different from the case of fully simulated events, corrections and related uncertainties may differ, as detailed in Ref. [46]. Corrections are derived for residual differences between data and simulation in the efficiency of the selected triggers, the electron and muon tracking efficiency, and in the efficiency of the identification and isolation requirements for electrons and muons. These corrections are obtained in bins of p T and η of the corresponding lepton, using the "tag-and-probe" method, as described in Ref. [67], with Z → ee and Z → µµ events. They usually amount to not more than a few percent.
In a similar way, corrections are obtained for the efficiency of triggering on the τ h decay signature and for the τ h identification efficiency, following procedures as described in Ref. [42]. The latter are derived as a function of the p T of the τ h in four bins below 40 GeV and one bin above. For p T (τ h ) > 40 GeV, a correction is also derived for each τ h decay mode individually, which is used only in the τ h τ h final state. Corrections to the energy scale of the τ h decays and of electrons misidentified as τ h are derived for each year of data-taking and each τ h decay mode individually, from likelihood scans of discriminating observables, like the mass of the visible decay products of the τ h candidate, as detailed in Ref. [42]. For muons misidentified as τ h this effect has been observed to be negligible. For the trigger efficiency the correction is obtained from parametric fits to the trigger efficiency as a function of p T derived for each corresponding sample and data.
The following corrections only apply to fully simulated events. During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at |η| > 2.0 caused a specific trigger inefficiency. For events containing an electron (a jet) with p T larger than ≈50 (≈100) GeV, in the region of 2.5 < |η| < 3.0 the efficiency loss is 10-20%, depending on p T , η, and time. Corresponding corrections have been derived from data and applied to the simulation.
The jet energy is corrected to the expected response at the stable hadron level, using corrections measured in bins of the jet p T and η, as described in Ref. [31]. These corrections are usually not larger than 10-15%. Residual data-to-simulation corrections are applied to the simulated event samples. They usually range between subpercent level at high jet p T in the central part of the detector to a few percent in the forward region. A correction is applied to the direction and magnitude of p miss T based on differences between estimates of the hadronic recoil in Z → µµ events in data and simulation, as described in Ref. [45]. This correction is applied to the simulated Z boson, single h, and signal events, where a hadronic recoil against a single particle is well defined.
The efficiencies for genuine and misidentified b jets to pass the working points of the b jet identification discriminator, as given in Section 3, are determined from data, using tt events for genuine b jets and Z boson production in association with jets originating from light quarks or gluons. Data-to-simulation corrections are obtained for these efficiencies and used to correct the number of b jets in the simulation, as described in Ref. [39].
Data-to-simulation corrections are further applied to Z → ee (Z → µµ) events in the eτ h (µτ h ) and τ h τ h final states in which an electron (muon) is reconstructed as a τ h candidate, to account for residual differences in the e(µ) → τ h misidentification rate between data and simulation. Deficiencies in the modeling of Z boson events in the ee, µµ final states, due to the use of a LO simulation, are corrected for by reweighting the simulated Z → µµ events to data in bins of p µµ T and m µµ . In addition all simulated tt events are weighted to better match the top quark p T distribution, as observed in data [68].
The overall normalization of all backgrounds is constrained by dedicated event categories, obtained from neural network (NN) multiclassification, as described in Section 5. After the event selection and prior to the event classification, i.e., still at an inclusive state of the analysis, the marginal distributions and pairwise correlations, including self-correlations, of all input features to the NNs used for event classification are subject to extensive scrutiny. This is done exploiting goodness-of-fit tests, based on a saturated likelihood model [69] including all systematic uncertainties of the model and their correlations, as used for the signal extraction. This guarantees a good understanding of the input space to the NNs and the input distributions used for the statistical inference of the signal contribution.

Event selection
The L1 trigger decision is based on the identification of high-p T electrons or muons, reconstructed from a fast readout of the ECAL and muon detectors. A positive L1 trigger decision initiates the further reconstruction of the given event at the HLT. In the HLT step, the selection is based on the presence of a single electron or muon, an eτ h or µτ h pair, or a τ h τ h pair in the event. The addition of the single-electron or single-muon requirement to the list of triggers via a logical OR condition increases the overall acceptance of the online selection. In the offline selection further requirements on the p T , η, I e(µ) rel , and the D α discriminators are applied in addition to the object identification requirements described in Section 3, as summarized in Table 2.
In the eτ h (µτ h ) final state, an electron (muon) with at least 25 (20) GeV is required, if an event was selected by a trigger based on the presence of the eτ h (µτ h ) pair. If the event was selected Table 2: Offline requirements applied to electrons, muons, and τ h candidates used for the selection of the τ pair. The p T values in parentheses correspond to events selected by a singleelectron or single-muon trigger. These requirements depend on the year of data-taking. For D jet the efficiency and for D e(µ) the misidentification rates for the chosen working points are given in parentheses. A detailed discussion is given in the text.
by a single-electron trigger, the p T requirement on the electron is increased to 26-33 GeV depending on the data-taking period, to ensure a sufficiently high efficiency of the HLT selection. For muons, the p T requirement is increased to 23 (25) GeV for 2016 (2017 or 2018), if selected by a single-muon trigger. The electron (muon) is required to be contained in the central detector with |η| < 2.1, and to be isolated from any hadronic activity according to I e(µ) rel < 0.15. The τ h candidate is required to have |η| < 2.3 and p T > 35 GeV if selected by an eτ h (µτ h ) pair trigger, or p T > 30 GeV if selected by a single-electron (single-muon) trigger. In the τ h τ h final state, both τ h candidates are required to have |η| < 2.1 and p T > 40 GeV. The working points of the DEEPTAU discriminants, as described in Section 3, are chosen depending on the final state. Events with additional leptons fulfilling looser selection criteria are discarded to avoid the assignment of single events to more than one final state.
The selected τ decay candidates are required to be of opposite charge and to be separated by more than ∆R = 0.5 in the η-φ plane. The closest distance of their tracks to the PV is required to be d z < 0.2 cm along the beam axis. For electrons and muons, an additional requirement of d xy < 0.045 cm in the transverse plane is applied. In rare cases in which an extra τ h candidate fulfilling all selection requirements is found, the candidate with the higher score of D jet is chosen.
In addition to the tau lepton pair, at least one b jet fulfilling the selection criteria, as described in Section 3, is required. Events that contain only one b jet and no other jet are removed from the analysis. If more than two b jets exist, the pair is built from those that are leading in p T . If only one b jet exists the b pair is built using the b jet and the jet with its highest b jet score of the DEEPJET classifier. The energies of the jets used for the b pair are corrected using the multivariate energy-momentum regression described in Ref. [70].
This analysis selection is optimized for the reconstruction of events where the h and h S decay products are spatially resolved. Boosted topologies, which can occur in parts of the explored mass ranges, are not specifically targeted.

Event classification
All events retained by the selection described above are further sorted into five categories. One for signal, the other four are enriched with different backgrounds. This is done separately for each of the three final states and each of the three data-taking periods resulting in 45 categories. The background-enriched categories are used to further constrain systematic uncertainties in the background estimates during the statistical inference of the signal contribution. This categorization is based on NN multiclassification exploiting fully connected feed-forward NNs with two hidden layers of 200 nodes each, and five output nodes implemented in the software package TENSORFLOW [71]. The first four output nodes used to enrich the backgrounds comprise the following events: (i) events containing genuine τ pairs (labeled "ττ"); (ii) events with quark or gluon induced jets misidentified as τ h (labeled "jet → τ h "); (iii) top quark pair events where the intermediate W bosons decay into any combination of electrons and muons, or into a single τ and an electron or muon (not included in (i) or (ii); labeled as "tt"); (iv) events from remaining background processes that are of minor importance for the analysis and not yet included in any of the previous classes (labeled as "misc"). The processes in (iv) comprise diboson production, single t quark production, Z boson decays to electrons or muons, and single h production. For single h production, rates and branching fractions as predicted by the SM are assumed. Event classes (i) and (ii) are defined by final state or experimental signature of the contained events rather than explicit underlying physics processes. They are complemented by event classes (iii) and (iv) to characterize all background processes, which are of relevance for the analysis. The fifth event class, associated with the fith output node, contains the H → h(ττ)h S (bb) signal events (labeled as "signal"). This choice of event classes closely resembles the data model described in Section 4.
For each node in the hidden layers, the hyperbolic tangent is chosen as the activation function. The activation function for the output layer is chosen to be the softmax function allowing for a Bayesian conditional probability interpretation y (k) i of an event k to be associated to an event class i, given its input features x (k) to the NN. The highest value of y (k) i , max(y (k) i ), defines which class the event is associated with and will define the discriminator for the statistical inference of the signal contribution. All other outputs y (k) j , j = i are discarded from any further consideration so that any event is used only once for the statistical inference of the signal.
In the eτ h and µτ h final states, the input space to the NNs is spanned by 20 features x of an event including p T of the τ candidates and the jets forming the b quark pair; the mass and p T estimates of the τ pair, b quark pair, and ττbb system; the number of (b) jets; and further kinematic properties of the selected jets. For this purpose, a likelihood-based estimate of the ττ mass before decay [72] and a kinematic fit to the ττbb system for each given m H and m h S hypothesis, similar to the approach described in Ref. [73], are used. In the τ h τ h final state these features are complemented by the masses of the two jets used for the b quark pair system and their associated output values of the DEEPJET algorithm, to allow for a better discrimination of genuine b jets from light quark or gluon induced jets. All input features have been selected from a superset of variables describing the properties of the event exploiting a ranking of individual features and pairwise correlations of features, as described in Ref. [74].
Since the kinematic properties of the signal strongly vary across the probed ranges of m H and m h S a total of 68 NNs per final state are used for classification, which within each final state only differ by the kinematic properties of the signal that are used for training. For this purpose, adjacent sets of points in m h S and m H are combined into single signal classes. Up to four points in m h S are combined for single points in m H , for m H ≤ 1000 GeV. Beyond m H = 1000 GeV, all remaining points in m H and up to nine points in m h S are combined. The concrete grouping is a tradeoff between sensitivity and computational feasibility. Though it reduces the use of the invariant mass of the reconstructed b quark pair (m bb ) for the NN decision this grouping of mass points has only a small effect on the overall NN performance in separating signal from background, which can be understood by the following means: (i) correlated information, like the m H estimate and the χ 2 of the kinematic fit are used, in addition to m bb ; (ii) the fact that m bb is a peaking distribution for signal while not for background is still fully exploited by the NN; (iii) for m H > 1000 GeV the p T of the jets forming the b quark pair gains importance. Differences of the input features depending on the year of data-taking are taken into account by a conditional training using a one-hot encoding of the data-taking year in the NN training, such that the correct year of data taking obtains the value 1, while all other data-taking years obtain the value 0.
The parameters to be optimized during training are the weights ({w a }) and biases ({b b }) of the NN output functions y i . Before training the weights are initialized with random numbers using the Glorot initialization technique [75] with values drawn from a uniform distribution. The biases are initialized with zero. The trainings are then performed using randomly sampled batches of N = 30 events per event class, drawn from the training datasets using a balanced batch approach [76]. This approach has shown improved convergence properties on training samples with highly imbalanced lengths. The classification task is encoded in the NN loss function, chosen as the cross entropy which is to be minimized during the NN trainings. In Eq. (4), k runs over the events in the batch, on which L is evaluated. The NN prediction for event k to belong to category i is given by y To improve the generalization property of the NNs, two regularization techniques are introduced. Firstly, after each hidden layer a layer with a dropout probability of 30% is added. Secondly, the weights of the NNs are subject to an L2 (Tikhonov) regularization [78] with a regularization factor of 10 −5 .
After training, a very good separation between the background events and the signal events is achieved, with a purity and classification sensitivity for the correct signal class of typically more than 80%.

Systematic uncertainties
The uncertainty model used for the analysis comprises theoretical uncertainties, experimental uncertainties, and uncertainties due to the limited population of template distributions for the background model used for the statistical inference of the signal, as described in Section 7. The last group of uncertainties is incorporated for each bin of each corresponding template individually following the approach proposed in Ref. [79]. For this analysis, where the signal is expected to be concentrated to a few bins with low background expectation, these uncertainties can often range among those with the largest impact on the signal significance. All other uncertainties lead to correlated changes across bins either in the form of normalization changes or as general nontrivial shape-altering variations. Depending on the way they are derived, correlations may also arise across years, samples, or individual uncertainties.
The following uncertainties related to the level of control of the reconstruction of electrons, muons, and τ h candidates after selection apply to simulated and τ-embedded events. Unless stated otherwise they correspond to the uncertainties of the corrections described in Section 4.4 and are partially correlated across τ-embedded and simulated events.
• Uncertainties in the identification efficiency of electrons and muons amount to 2%, correlated across all years. Since no significant dependence on the p T or η of each corresponding lepton is observed these uncertainties are introduced as normalization uncertainties.
• With a similar reasoning, uncertainties in the electron and muon trigger efficiencies are also introduced as normalization uncertainties. They amount to 2% each. Due to differences in the trigger leg definitions they are treated as uncorrelated for singlelepton and two-object triggers. This may result in shape-altering effects in the overall model, since both triggers act on different regions in lepton p T .
• For fully simulated events an uncertainty in the electron energy scale is derived from the calibration of ECAL crystals, and applied on an event-by-event basis [28]. For τembedded events uncertainties of 0.50-1.25%, split by the ECAL barrel and endcap regions, are derived for the corrections described in Section 4.4. Due to the different ways the uncertainties are determined and differences in detector conditions they are treated as uncorrelated across simulated and τ-embedded events. They lead to shape-altering variations and are treated as correlated across years. The muon momentum (p µ ) is very precisely known [38]. A variation within the given uncertainties, depending on the muon η and p T has been checked to have no influence on the analysis.
• Uncertainties in the τ h -identification range between 3 and 5% in bins of τ h p T . Due to the nature of how they are derived these uncertainties are statistically dominated and therefore treated as uncorrelated across decay modes, p T bins, and years. The same is true for the uncertainties in the τ h -energy scale, which range from 0.2 to 1.1%, depending on the p T and the decay mode of the τ h . For the energy scale of electrons misidentified as τ h candidates, extra corrections are derived depending on the τ h p T and decay mode. Their uncertainties range from 1.0 to 2.5%. Concerning correlations the same statements apply as for the τ h -energy scale. All uncertainties discussed here for the τ h identification and energy scale lead to shape-altering variations. A generous variation of the momentum scale of muons misidentified as τ h has been checked to have a marginal effect on the analysis.
• Uncertainties in the τ h trigger efficiency are 5-10%, depending on the p T of the τ h . They are obtained from parametric fits to data and simulation, and lead to shapealtering effects. They are treated as uncorrelated across triggers and years.
Two further sources of uncertainty are considered for τ-embedded events [46]: • A 4% normalization uncertainty accounts for the level of control in the efficiency of the µµ selection in data, which is unfolded during the τ-embedding procedure. The dominant part of this uncertainty originates from the trigger used for selection and is treated as uncorrelated across years.
• Another shape and normalization-altering uncertainty in the yield of tt → µµ + X decays, which are part of the τ-embedded event samples, ranges between subpercent and 10%, depending on the event composition of the model. For this uncertainty, the number and shape of tt events contained in the τ-embedded event samples are estimated from simulation, for which the corresponding decay has been selected at the parton level. This estimate is then varied by ±10%.
For fully simulated events the following additional uncertainties apply: • Uncertainties in the e(µ) → τ h misidentification rate amount to 40% for electrons and range from 10 to 70% for muons. The relatively large size of these uncertainties originates from the rareness of these cases in the control regions that are used to measure these rates. They only apply to simulated Z → ee (µµ) events, which are of marginal importance for the analysis. The impact on the overall background yield is below the 1% level both in the eτ h and µτ h final states. The same is true for the uncertainty in the reweighting in the Z boson mass and p T , discussed in Section 4.4, which ranges from 10 to 20%.
• Uncertainties in the energy calibration and resolution of jets are applied with different correlations depending on their sources, comprising statistical limitations of the measurements used for calibration, the time-dependence of the energy measurements in data due to detector aging, and nonclosure corrections introduced to cover residual differences between simulation and data [31]. They range between subpercent level and O(10%), depending on the kinematic properties of the jets in the event. Similar uncertainties are applied for the identification rates for b jets and for the misidentification rates for light quark or gluon induced jets, which are of a similar range each [39,40].
• Depending on the process in consideration, two independent uncertainties in p miss T are applied. For processes that are subject to recoil corrections, i.e., Z boson production, h production, or signal, uncertainties in the calibration and resolution of the hadronic recoil are applied, ranging from 1 to 5%. For all other processes an uncertainty in p miss T is derived from the amount of unclustered energy in the event [45].
• A normalization uncertainty due to the timing shift of the inputs of the ECAL L1 trigger described in Section 4.4 amounts to 2-3%.
• A shape-altering uncertainty is derived in the reweighting of the top quark p T described in Section 4.4 by applying the correction with twice the required magnitude, thus overcorrecting, or not applying it at all. This uncertainty has only a very small effect on the final discriminator.
• The integrated luminosity is measured for each year of data-taking individually following procedures, as described in Ref. [80]. The luminosity measurements are known to a precision of 2.3 (2.5)% for 2017 [81] (2016 [82] and 2018 [83]). The corresponding normalization uncertainties comprise parts that are correlated and parts that are uncorrelated across the years.
• Since the search is not conducted within any particular model, no uncertainties on the production cross section or branching fractions of the signal need to be taken into account. Uncertainties in the signal acceptance are obtained from variations of the factorization and renormalization scales, as well as from sampling all relevant parameters for the estimation of the parton density distributions within their corresponding uncertainties, following procedures as outlined in [87]. The changes in acceptance due to the scale variations are observed to be less than 10%. They are shape altering, depending on the h and h S p T . The acceptance variations due to the sampling of the parton density distributions amount to normalization changes of 18%. Both uncertainties are correlated across years.
For the F F -method the following uncertainties apply: • The F i F and their corrections are subject to statistical fluctuations in each corresponding DR i . The corresponding uncertainties are split into a normalization and a shapealtering part and propagated into the final discriminator. They usually range between 3-5% and are treated as uncorrelated across the kinematic and topological bins they are derived in.
• Additional uncertainties are applied to cover corrections for non-closure effects and extrapolation factors, varying from a few percent to O(10%), depending on the kinematic properties of the τ h candidate and the topology of the event. These are both normalization and shape-altering uncertainties.
• An additional source of uncertainty concerns the subtraction of processes other than the enriched process in each corresponding DR i . These are subtracted from the data using simulated or τ-embedded events. The combined shape of the events to be removed is varied by 7%, and the measurements are repeated. The impacts of these variations are then propagated to the final discriminator as shape-altering uncertainties.
• An uncertainty in the estimation of the three main background fractions in the AR is estimated from a variation of each individual contribution by 7%, increasing or decreasing the remaining fractions such that the sum of all contributions remains unchanged. The amount of variation is motivated by the uncertainty in the production cross sections and acceptances of the involved processes and the constraint on the process composition that can be clearly obtained from the AR. The effect of this variation is observed to be very small, since usually one of the contributions dominates the event composition in the AR.
Due to their mostly statistical nature and differences across years, all uncertainties related to the F F -method are treated as uncorrelated across years. A summary of all systematic uncertainties that have been discussed in this section is given in Table 3.

Results
The model used to infer the signal from the data is defined by an extended binned likelihood of the form where i labels all bins of the distributions of the NN output functions max(y i ) of each of the five signal and background classes defined in Section 5. Split by three ττ final states and three years of data-taking this results in 45 individual input histograms, for each given pair of m H and m h S . The function P (k i |µ S i (m H , m h S , {θ j }) + B i ({θ j })) corresponds to the Poisson density to observe k i events in bin i for a prediction of S i signal and a total of B i background events. The parameter µ is a single scaling parameter of the signal.
Systematic uncertainties are incorporated as penalty terms for additional nuisance parameters {θ j } in the likelihood, appearing as a product with predefined probability density functions C(θ j |θ j ) to obtain a maximum likelihood estimateθ j for an assumed true value of θ j , during the statistical inference of the signal [88].
Sets of input distributions based on the NN classification for m H = 500 GeV and 100 ≤ m h S < 150 GeV in the µτ h , eτ h , and τ h τ h final states are shown in Figs. 2-4. For these figures, the data from all three years of data-taking have been combined. To retain the shape of the distributions of the y i in each category, the histogram bins have been divided by their widths, in the upper panels of each figure. As a Bayesian probability estimate the values of y i range from 0.2 to 1.0. The lower bound is given by the constraint that each event has to be associated to one of the five event categories. In each event category, the targeted processes are expected to have increasing purity with increasing values of y i . The points with error bars correspond to the data and the stacked filled histograms to the expectation from the background model. For the signal categories, the expectation for a signal with σ B(H → h(ττ)h S (bb)) = 200 or 50 fb, depending on the ττ final state, is also shown by a red line.
In the middle panels for all background categories, the purity estimated for the background template of each corresponding event category is shown. For the signal categories, the ratio of the indicated signal divided by the sum of all backgrounds is shown. In the lower panels of each figure, the observed numbers of events divided by the numbers of events expected from the background model are shown for each bin.
No signal-like excess is observed in any of the investigated mass combinations and 95% confidence level (CL) upper limits on the σ B(H → h(ττ)h S (bb)) of a potential signal are set following the modified frequentist approach as described in Refs. [89,90], using the same definition of the profile likelihood test statistic as defined in Refs. [88,91]: In Eq. (6),μ,θ j,µ , andθ j,μ indicate the maximum likelihood estimates of the corresponding parameters from the fit to the data and the index of q µ indicates that the fit to the data has been performed for a fixed value of µ. In the large number limit, the distribution of q µ can be approximated by analytic functions, from which the median and the uncertainty contours can be obtained as described in Ref. [92].
The observed and expected limits as a function of the tested values of m h S in a mass range from 240 ≤ m H ≤ 3000 GeV and for the combination of all ττ final states and the analyzed data from all years are shown in Fig. 5. The observed limits are given by the black points. The expected median values in the absence of signal are indicated by the dashed black line with the central 68 and 95% expected quantiles for the upper limit given by the green and yellow bands. They range from 125 fb for m H = 240 GeV and m h S = 85 GeV to 2.7 fb for m H = 1000 GeV and m h S = 350 GeV with a roughly flattening progression beyond. These limits are model independent. Since the analysis is not able to distinguish between scalar and pseudoscalar Higgs bosons, the limits are equally applicable to both cases. Residual differences on the detector acceptance for a scalar or pseudoscalar h S are expected to be small and well covered by the theoretical acceptance uncertainties discussed in Section 6.
It should be noted that neighboring points in m h S differ only slightly in the kinematic properties of the tested signal hypotheses. Groups of hypothesis tests based on the same NN trainings for classification are indicated by discontinuities in the limits, which are linearly connected otherwise to improve the visibility of common trends.
A summary of the observed limits for all tested pairs of m H and m h S is shown in Fig. 6, where the limits are given by the color code of the figure.
Maximally allowed values for σ B(H → h(ττ)h S (bb)) in the context of the NMSSM for given pairs m H and m h S , have been provided by the LHC Higgs Working Group, using the codes NMSSMTOOLS 5.5.0 [93] and NMSSMCALC [94], incorporating experimental constraints from measurements of the h properties, SUSY searches, B-meson physics and dark matter searches. The region in the plane spanned by m H and m h S where the observed limits fall below these maximally allowed values on σ B(H → h(ττ)h S (bb)) are indicated by a read hatched area. It corresponds to 400 ≤ m H 600 GeV and 60 ≤ m h S 200 GeV. For m(H) = 450 GeV and 60 ≤ m h S ≤ 80 GeV the observed limits are five times smaller than the maximally allowed values for σ B(H → h(ττ)h S (bb)). Tabulated results of this analysis are available in the Hep-Data database [95].

Summary
A search for a heavy Higgs boson H decaying into the observed Higgs boson h with a mass of 125 GeV and another Higgs boson h S has been presented. The h and h S bosons are required to decay into a pair of tau leptons and a pair of b quarks, respectively. The search uses a sample of proton-proton collisions collected with the CMS detector at a center-of-mass energy of 13 TeV, corresponding to an integrated luminosity of 137 fb −1 . Mass ranges of 240-3000 GeV for m H and 60-2800 GeV for m h S are explored in the search. No signal has been observed. Model independent 95% confidence level upper limits on the product of the production cross section and the branching fractions of the signal process are set with a sensitivity ranging from 125 fb (for m H = 240 GeV) to 2.7 fb (for m H = 1000 GeV). These limits have been compared to maximally allowed products of the production cross section and the branching fractions of the signal process in the next-to-minimal supersymmetric extension of the standard model. This is the first search for such a process carried out at the LHC. Table 3: Summary of systematic uncertainties discussed in the text. The first column indicates the source of uncertainty; the second the processes that it applies to; the third the variation; and the last how it is correlated with other uncertainties. A checkmark is given also for partial correlations. More details are given in the text.

Acknowledgments
We congratulate our colleagues in the CERN accelerator departments for the excellent performance of the LHC and thank the technical and administrative staffs at CERN and at other CMS institutes for their contributions to the success of the CMS effort. In addition, we gratefully acknowledge the computing centers and personnel of the Worldwide LHC Computing Grid and other centers for delivering so effectively the computing infrastructure essential to our analyses. Finally, we acknowledge the enduring support for the construction and operation of the LHC, the CMS detector, and the supporting computing infrastructure provided by the following funding agencies:       [16] P. Fayet, "Supergauge invariant extension of the Higgs mechanism and a model for the electron and its neutrino", Nucl. Phys. B 90 (1975)    [26] CMS Collaboration, "Description and performance of track and primary-vertex reconstruction with the CMS tracker", JINST 9 (2014) P10009, doi:10.1088/1748-0221/9/10/P10009, arXiv:1405.6569.
[27] CMS Collaboration, "Track impact parameter resolution for the full pseudorapidity coverage in the 2017 dataset with the CMS Phase-1 pixel detector", CMS Detector Performance Note CMS-DP-2020-049, 2020.