A 2D fit with background measurement constraints to boost the Higgs to 4l discovery potential at the LHC

A data-driven method for simultaneously extracting a potential Higgs to ZZ to 4e, 4mu, 2e2mu signal and its dominant backgrounds, is presented. The method relies on a combined fit of the 2-lepton, Z*, and 4-lepton invariant masses. The fit is assisted by normalization of the Z+X backgrounds in data control regions. The Higgs discovery potential for the next few years of LHC running is presented. The demonstrated high sensitivity of the method makes it ideal for the search performed by the ATLAS and CMS experiments.


Introduction
The discovery of the Standard Model (SM) Higgs boson is the major goal of the Large Hadron Collider (LHC). With the 2011 proton-proton experimental run now underway, LHC experiments are entering the main phase of the Higgs search. The Higgs boson mass is a free parameter in the SM, however there is strong expectation motivated by precision electroweak data [1] and direct searches [2], that a low mass Higgs (114.4 − 186 GeV 1 , at 95% confidence level) should be discovered at the LHC.
The experimentally cleanest signature for the discovery of the Higgs is its "golden" decay to four leptons (electrons and muons): H→ZZ ( * ) →4ℓ. The excellent energy resolution and linearity of the reconstructed electrons and muons leads to a narrow 4-lepton invariant mass peak on top of a smooth background. The expected signal to background ratio after all experimental requirements ranges approximately from 1/1 to 5/1 depending on the Higgs mass [3]. The major component of the background consists of the irreducible pp→ZZ ( * ) →4ℓ decays. The most challenging mass region is between 120-180 GeV where one of the Z bosons is off-shell giving low transverse momentum leptons. In this region backgrounds from pp→Zbb, Zcc →4ℓ (denoted as ZQQ) and to a lesser extend Zjj and ZjQ (j denotes a light quark which is mis-identified as a lepton) are significant, requiring tight lepton isolation cuts in order to reduce these backgrounds to levels well below the pp→ZZ ( * ) background.
In order to maximize the Higgs discovery potential using the H→4ℓ channel, the most accurate achievable knowledge of the background and its associated uncertainty are essential. The main dominant pp→ZZ ( * ) →4ℓ background can be estimated by (i) theoretical predictions, (ii) by a combination of theoretical predictions and subsequent constraints using experimental LHC data, or (iii) by performing a sideband measurement and subsequently extrapolating to the signal region (SR). As discussed in [4], the first two methods suffer from theoretical uncertainties, the luminosity measurement uncertainty (only for method (i)) and systematic uncertainties such as lepton reconstruction efficiency. In addition, for low Higgs mass, the ZQQ contribution is significant with a theoretical uncertainty ranging from 20 − 50% [9] (ideally it should be experimentally controlled with early LHC data).
Data-driven background extractions using the sideband measurement (iii) have been considered by both ATLAS and CMS [3], [4]. Such extractions are not as sensitive to theoretical and luminosity uncertainties, and uncertainties due to lepton and isolation efficiencies, but for relatively low integrated luminosity are limited by the number of events in the 4-lepton sideband [5]. Thus, a method providing a simultaneous measurement of a potential Higgs signal and all its background components is essential 1 Natural units are used throughout this paper.
for the Higgs discovery in the most important LHC channel. This method should be able to give a realistic recipe for the extraction of all major components of the background (Zjj, ZQj, ZQQ, ZZ ( * ) ) while improving the power of the Higgs search. In this work we propose such a novel method applicable in early LHC measurements.
Our main result is that using direct measurements of all backgrounds and a combined fit of the 2-lepton and 4-lepton invariant masses, the Higgs should be discovered with 2-5 f b −1 of integrated luminosity combining the analyses of the two LHC experiments. The proposed method can be directly employed in the Higgs searches currently performed by the ATLAS and CMS experiments [6], [7].

Description of the Method
The main goal of the technique presented in this paper is to optimize the Higgs discovery potential using all available information from data by simultaneously extracting the Z+X→4ℓ and ZZ ( * ) →4ℓ yields which are expected to fully dominate the 4ℓ continuum. This is achieved by a synthesis of a 2D fit assisted by constraints in the normalizations of the Z+X backgrounds using data control regions. The Z+X reducible background consists of three components Zjj (Z+2jets), ZjQ (Z+1jet plus a lepton from a heavy flavour decay) and ZQQ. The jet contribution includes photons. The method will be applied for a low mass Higgs search (M H < 2 × M Z ) and a high mass search (M H ≥ 2 × M Z ).
In the study presented here the proton-proton collisions at a center of mass energy of 7 TeV were generated using PYTHIA [8]. A fast generic detector simulation was performed by smearing the generated values in order to include acceptance, resolution and efficiency effects. The ZQQ and ZQ fractions of Z+X were obtained using MCFM [9].

Event selection
A typical event selection for the H→4ℓ analysis is as follows: • Select events with at least 4 leptons with transverse momentum (p ⊥ ) >7 GeV in pairs of same flavour and opposite charge. At least one of these leptons must have high p ⊥ >20 GeV.
• Require one on-shell Z→2ℓ boson. The on-shell requirement is satisfied by an invariant mass (M 12 ) cut of ± 15 GeV about the nominal Z-mass. The 4-lepton requirement leads to an on-shell Z invariant mass distribution which has a small flat component of reducible backgrounds.
• The two leptons involved in the on-shell Z reconstruction are required to have a strict level of isolation. This is achieved using the tracker and the calorimeter. They are also required to originate from the interaction vertex.
• The remaining 2 same-flavour leptons have an invariant mass M 34 which is the key discriminating variable in this analysis. It will also serve as the ZZ data control region. The cuts on these 2 leptons depend on the mass hypothesis (high or low Higgs mass).
Non-Z reducible backgrounds (tt and W+jets) after event selection are very small and can be subtracted by removing the smooth background under the Z mass peak. Leptons from light quarks are non-isolated, while those from heavy quarks (c,b) have in addition a significant impact parameter (normalized distance of closest approach) from the interaction vertex. In this work a selection using stringent isolation and impact parameter criteria is applied for the low Higgs mass search, whereas for the high Higgs mass search no selection is applied on the impact parameter and the isolation criteria are relaxed. Typical signal and background yields for the two event selections performed are shown in Table 1. The relaxed selection criteria for the high Higgs mass search are motivated by the fact that the Z+X background does not peak in high values of the 4-lepton mass, M 4l . This background can be estimated by the 2D fit using the information in M 34 .
The background components can be constrained using measurements in data control regions defined by a subset of the analysis cuts. Each of these data control regions is rich in a particular background component. The definition of such data control regions (Zjj, ZQj) involves the two additional subleading leptons. Since lepton isolation cuts lead to a large rejection of the Zjj and ZjQ components, it is a common experimental practice to reverse such cuts in order to define data control regions rich in these backgrounds [6]. An example is given in the next section.  Table 1: Expected 4-lepton event yields (in the M 4l mass range) per 1 fb −1 of integrated luminosity after a low mass analysis selection (first column), and a high mass analysis selection (second column) for proton-proton collisions at a center of mass energy of 7 TeV. Detector effects such as acceptance, efficiencies and resolutions have been applied via a fast simulation (see text).

Lepton Isolation: a data control region example
In this section we present an example of a usage of a data control region to predict the reducible background in the signal region.
To demonstrate how these regions can be used for electrons to extract the normalizations of the Zjj and ZjQ backgrounds after all analysis cuts, we use an electromagnetic (EM) shower containment variable R iso [6]. We define R iso = E in /E tot as the ratio of EM energy inside a 0.075×0.125 cone, divided by the energy in a 0.125×0.125 cone. The distributions of R iso for isolated electrons from Z-bosons and for non-isolated leptons are shown in Figure 1. Its shape was obtained from existing measurements at the LHC [10]. For R iso < 0.7 and given the cross sections of Table 1, R iso is fully dominated by j→e fakes. This R iso < 0.7 region provides a data control region for Zjj. The normalization of the Zjj background after all analysis cuts can be extracted if the data events found in the data control region with R iso < 0.7 are normalized to a Zjj Monte Carlo (Figure 1 bottom), and subsequently the remaining analysis cuts are applied to the MC. This final step (the extrapolation with MC) relies on a good control of the additional discriminants used [6]. Typical such variables used by ATLAS and CMS are: isolation, EM shower shape variables and impact parameter. The predicted number of Zjj background events in the signal region is given by the following expression: where N CR Zjj is the number of Zjj events in the Zjj control region in MC and ǫ MC is the efficiency for Zjj events passing the final cut selection.

Extraction of background contributions
In this work we introduce a novel technique in which the signal and all background components are fitted simultaneously in the M 34 and M 4ℓ invariant mass observables. This is a 2D unbinned extended likelihood fit which exploits constraints coming from independent measurements of the Z+X backgrounds in data control regions. The method is applied on two separate searches: a low mass search (M H < 2 × M Z ) where one Z is off-shell, and a high mass search (M H ≥ 2 × M Z ) where both Z's can be on-shell. The two regions are dominated by different background fractions and shapes, thus a separate treatment is required. Figure 2 shows a simultaneous fit of the low Higgs mass M 34 and M 4ℓ distributions for pseudo-data corresponding to an integrated luminosity of 5 fb −1 for a single LHC experiment. Figure 3 shows similar fits for the high mass search. The method exploits the fact that the Higgs distribution makes a narrow peak in the M 4ℓ distribution, while the backgrounds are smoothly distributed. In the high mass region, the ZZ signal is clearly visible in M 34 , however, it could be contaminated by a potential Higgs signal. These two components can be easily separated using the information in the M 4ℓ distribution (Figure 3 bottom). The Z+X component for the high Higgs mass search is significant, due to the relaxed selection used in this work, but it can be easily estimated using the information in the M 34 region (Figure 3 top). These example 2D fits are performed using the normalization constraints obtained from data summarized below: • The excellent separation between ZZ and the rest of the backgrounds in the M 34 distribution provides the fit with a powerful discriminant. This allows the extraction of normalizations for the ZZ and the rest of the backgrounds. In this study the ZZ and Z+X components are fitted with a Breit-Wigner convoluted with a gaussian resolution function and an exponential respectively.
• At low values of M 34 the ZZ* component is obtained by the ZZ normalization from data and the relative ZZ*/ZZ normalization from MC. The residual uncertainty on the ZZ* normalization is taken as a systematic, since there is no other experimental input to constrain ZZ*.
• The normalizations of the ZQj and Zjj backgrounds can be constrained by measurements in the control regions. These normalizations have uncertainties coming from the experimental knowledge of the relevant discriminants used in the extrapolation from the control region to the signal region (see section 2.2).
• Finally the normalization of ZQQ is let to float in the fit since it is fully unconstrained. Constraints of ZQQ are possible by requiring large impact parameters for the leptons, offering the possibility to directly measure the ZQQ component in Z+X. This is beyond the scope of this work.
As shown in Figures 2 and 3, the potential Higgs signal and its associated backgrounds can be extracted in a single fit assisted by constraints on the components using data.  3 Results: Higgs discovery potential vs Luminosity The 2D fit described in the previous section leads not only to an extraction of the signal and all of the major backgrounds (ZZ, ZQQ, ZQj, Zjj→4ℓ), but also to the signal significance [11]. In this section we use the method to assess the SM Higgs discovery potential. For the extraction of the significance the RooStats package of the RooFit/ROOT analysis framework has been used [12], [13], [14].
The SM Higgs discovery potential depends on the luminosity and the mass of the Higgs. Here we assume a 7 TeV center of mass energy and an integrated luminosity, ranging from 1 to 30 fb −1 . The significance can be calculated assuming the expected Higgs invariant mass distribution including detector resolution and the natural width of the SM Higgs. The obtained significance as a function of mass and as a function of luminosity is presented in Figure 4 for a single LHC experiment. The low Higgs mass search is mostly sensitive at 150 GeV where a 3σ evidence should be found with 3 fb −1 of integrated luminosity. The real strength of this channel is in the high mass search. Here one LHC experiment will need 2 fb −1 in order to observe a 3σ evidence for Higgs masses up to 300 GeV. This is an important result since our method is data driven trying to maximize the information provided by the data.
In the remainder of this section we summarize the uncertainties of the method. At low luminosities below 5 fb −1 , the main uncertainty is due to the limited data statistics in the M 34 and M 4l plots. The expectation for the individual background components is as follows: • ZZ: for low luminosities, we expect only a handful of ZZ events for which we will have an excellent understanding of the M 34 shape thanks to existing Z→2ℓ data. This leads to an extracted normalization for ZZ and ZZ* that is statistically dominated. In addition, the ZZ* part is sensitive to theoretical and experimental uncertainties (such as control of lepton efficiencies).   • ZQj and Zjj backgrounds: the uncertainties are smaller due to the high statistics in the data control regions. In this case the dominant uncertainties are coming from the control of the discriminating variables used to go from the data control regions to the Higgs signal region. These variables include shower shapes, lepton impact parameter and tracking quality. Since these backgrounds are subdominant (ZQQ and ZZ ( * ) dominate), their impact to the total systematic uncertainty is expected to be small.
• ZQQ background: it is significant in the low mass range. The 2D fit can constrain both the shape of ZQQ and its normalization. Thus the ZQQ normalization and shape uncertainties are also dominated by the statistics in M 34 and M 4l , as well as by the systematic uncertainties on the other backgrounds.

Summary and Conclusions
In this paper, a data-driven method for extracting a potential Higgs signal and its dominant backgrounds in the search for Higgs boson decays to four leptons, was presented. The method relies on a combined unbinned likelihood fit of the 2-lepton, Z ( * ) , and 4-lepton invariant masses. The unique feature of the method is the attempt to maximize the use of the information provided in the data: the 2D fit is assisted by normalization of the main Z+X backgrounds in control regions. The Zjj and ZjQ components are predicted using data control regions defined by a subset of the analysis event selection. The 2D fit proceeds assuming these normalizations of Zjj and ZjQ and their uncertainties, leaving the ZQQ normalization free. The ZZ ( * ) background is normalized by the ZZ