A Deep Neural Network for Simultaneous Estimation of b Jet Energy and Resolution

We describe a method to obtain point and dispersion estimates for the energies of jets arising from b quarks produced in proton–proton collisions at an energy of s=13TeV\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{s}=13\,\text {TeV} $$\end{document} at the CERN LHC. The algorithm is trained on a large sample of simulated b jets and validated on data recorded by the CMS detector in 2017 corresponding to an integrated luminosity of 41 fb-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,\text {fb}^{-1}$$\end{document}. A multivariate regression algorithm based on a deep feed-forward neural network employs jet composition and shape information, and the properties of reconstructed secondary vertices associated with the jet. The results of the algorithm are used to improve the sensitivity of analyses that make use of b jets in the final state, such as the observation of Higgs boson decay to bb¯\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {b}\bar{\hbox {b}}$$\end{document}.


Introduction
Following the discovery of the 125 GeV Higgs boson reported by the ATLAS and CMS Collaborations at the CERN LHC in 2012 [1][2][3], a rich research program was established to probe this new particle.The program includes the measurement of all production and decay modes that are accessible at the LHC.The decay of the Higgs boson into a pair of vector bosons was established with a statistical significance higher than five standard deviations individually for Z and W pairs using data collected at the LHC from 2011 to 2013 at center-of-mass energy of √ s = 7 or 8 TeV [4][5][6][7][8][9].A few years later, the combination of CMS data sets collected at 8 and 13 TeV was used to report the observation of Higgs boson decay to a pair of τ leptons [10], followed by the observation of the associated production of a Higgs boson with a top quark-antiquark pair (tt) [11,12].
Higgs boson decay to a b quark-antiquark pair (bb) was only recently announced by the CMS [13] and ATLAS [14] Collaborations, despite it being the dominant decay mode.This is because of the challenges associated with separating the signal from the large background of bb produced by quantum chromodynamics (QCD) processes.Good resolution of the reconstructed invariant masses of Higgs boson candidates is necessary to have a more favorable signal-to-background ratio.This is achieved in CMS by the method described in this paper, based on a deep neural network (DNN) that estimates the energy of jets originating from b quarks (b jets).Similar algorithms, using neural networks, were previously used by the CDF Collaboration at the Tevatron [15], and BDT-based energy regressions were used earlier by the CMS Collaboration to estimate the energy of b jets [16].
The characteristic of the approach described in this paper is the use of a regression algorithm that is implemented in a feed-forward neural network with six hidden layers trained on a very large data set, consisting of Monte Carlo (MC) simulated b jets.The algorithm has a considerably larger modeling capability than those used previously.This approach was made possible by leveraging recent advances in hardware accelerators, such as graphics processing units (GPU), and in modern packages for automatic differentiation to handle the otherwise expensive computations involved in this task.A minimization of a loss function that combines a Huber [17] and two quantile [18] loss terms, enables simultaneous training of point and dispersion estimators of the regression target without making any assumptions about the functional form of its distribution.The point estimator is used as a correction of the measured b jet energy while dispersion estimators are used to build a jet-by-jet resolution estimator.The method is validated on data collected by the CMS detector in 2017.
In the following, Section 2 describes the CMS detector and the data sets used for this work.The regression problem and the inputs are described in Section 3. In Section 4 the loss function is introduced, while the DNN architecture and its training are summarized in Section 5. Finally, the results are presented in Section 6, followed by the summary in Section 7.

The CMS detector and data sets
The central feature of the CMS detector is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections.Forward calorimeters extend the pseudorapidity (η) coverage provided by the barrel and endcap detectors.Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid.A detailed description of the apparatus, together with a definition of the coordinate system used and the relevant kinematic variables, can be found in Ref. [19].
The particle-flow (PF) algorithm [20] used by CMS aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector.Photon energies are obtained from ECAL data.The candidate vertex with the largest value of summed physics-object p 2 T is taken to be the primary proton-proton (pp) interaction vertex.The energy of each electron in the event is determined from a combination of the electron momentum at the primary interaction vertex, as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with having originated from the electron.The momentum of each muon is obtained via the curvature of the corresponding track.The energy of each charged hadron is determined from a combination of momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for zero-suppression effects and for the response function of the calorimeters to hadronic showers.Finally, for a neutral hadron the energy is obtained from the corresponding HCAL corrected energies.The anti-k T algorithm [21,22] with a distance parameter of 0.4 is applied offline to the full set of PF candidates in order to cluster them into jets.The jet momentum is determined by the vectorial sum of all particle momenta in the jet.The jet energy resolution typically amounts to 15-20% at 30 GeV, 10% at 100 GeV, and 5% at 1 TeV [23].
Additional pp interactions within the same or nearby bunch crossings (pileup) can contribute unrelated particles to the jet.To mitigate the effects of pileup, charged particles with tracks originating from pileup vertices are discarded before jet reconstruction.Then, the residual contamination from neutral particles and charged particles without reconstructed tracks is estimated for each event and subtracted from the jet energy.Jet energy corrections are derived from simulation to bring the measured average response for jets in line with particle-level jets.In-situ measurements of the transverse momentum balance in dijet, photon+jet, Z+jet, and multijet events are used to account for residual differences between the jet energy scales in data and simulation [24].We refer to this correction algorithm as the baseline algorithm.
The DNN was trained on 100 million b jets from a simulated sample of tt events produced in pp collisions at √ s = 13 TeV, generated at next-to-leading-order (NLO) accuracy in perturbative QCD (pQCD) with the POWHEG v2 program [25].Predictions of the model were then tested on simulated events with b jets coming from a variety of physical processes in order to validate performance in all relevant kinematic regions.To this end, b jets from the decay of Higgs bosons produced in association with a Z boson, Z(→ + − )H(→ bb ), where is an electron or a muon, were generated with the MADGRAPH5 aMC@NLO generator [26] at NLO pQCD accuracy.Additionally, b jets from the decay of Higgs boson pairs produced either from gluon fusion or in the decay of a new, spin 0 resonance, with one Higgs boson decaying to a b quarkantiquark pair and the other to a pair of photons, H(→ bb )H(→ γγ), were generated with MADGRAPH5 aMC@NLO at leading-order accuracy in pQCD.
Finally, to validate the regression model on data, the output of the DNN for simulated b jets was compared to that obtained for b jets recorded by the CMS detector.The events used for this validation were recorded in 2017 with triggers [27] that require the presence of at least one lepton.This data set, corresponding to an integrated luminosity of 41 fb −1 , was further enriched in Z bosons produced in association with b jets.The simulated events come from a sample of Z bosons and up to two additional partons generated with MADGRAPH5 aMC@NLO at NLO accuracy in pQCD.
For all simulated events, PYTHIA 8.2 [28] with the CP5 tune [29] is used for parton showering and hadronization.The CMS detector response is simulated by the GEANT4 [30] package, and simulated pileup interactions are added to the hard-scattering process to match the distribution of pileup interactions observed in data, for which the observed mean number of interactions per bunch crossing is 32.

Energy regression and input features
In comparison to jets arising from light-flavor quarks or gluons, jets arising from b quarks have special characteristics that call for dedicated energy corrections.In particular, b jets contain b hadrons that can often decay to a final state with a charged lepton and a neutrino.The neutrinos, which only interact via the weak force, escape detection, leading to an underestimate of the b jet energy, with a corresponding degradation of energy resolution.As described in Section 2, the jet energy is reconstructed by clustering its constituents within a given distance parameter.Compared to jets originating from light-flavor quarks and gluons, b jets, because of their higher mass, tend to spread radially over a wider area in the η-φ plane, where φ is the azimuthal angle.This often leads to leakage of energy outside of the jet clustering region, further impacting the jet energy response and resolution.
The b jets used for the DNN training come from a sample of simulated top quark events.The top quark decays before hadronising with a branching fraction close to unity into a b jet and a W boson.At LHC energies, it provides a source of b jets that spans a large transverse momentum (p T ) spectrum and covers the full η acceptance of the detector.
Two definitions of jets are used in this study: "generator-level jets", clustered from stable particles produced by the MC generator that includes the contribution from the neutrino's momentum, and "reconstructed jets", clustered from reconstructed particle-flow candidates.The reconstructed b jets were matched to a generated b jet to avoid contamination by light flavoured jets and were selected by applying a minimum threshold for transverse momentum (p reco T > 15 GeV, p gen T > 15 GeV) and by requiring the pseudorapidity of the central axis of the reconstructed jet to be within the tracker acceptance (|η| < 2.4).The p reco T value is corrected with the baseline algorithm as described in Section 2. The regression target, y, used in this study is defined as the ratio of the transverse momentum of the generator-level jet, p gen T , to that of the reconstructed jet, p reco T , corrected by the baseline jet energy algorithm.Using this definition rather than using p gen T directly has the effect of greatly reducing the variance of the target while producing a numerical value of order 1.The distribution of the target for b jets from an MC simulated tt sample is shown in Fig. 1 (right).To improve the convergence of the training of the DNN, the target is further standardized by subtracting its median value and dividing it by its standard deviation.
The DNN training inputs provide information about the kinematics, shape and composition of reconstructed jets.The inputs consist of the following features: • jet kinematics: jet p T , η, mass, and transverse mass, defined as E 2 − p 2 z ; • information about pileup interactions: the median energy density in the event, ρ, corresponding to the amount of transverse momentum per unit area that is due to overlapping collisions [31]; • information about semileptonic decays of b hadrons when an electron or muon candidate is clustered within a jet: the transverse component of lepton momentum perpendicular to the jet axis, the distance ∆R = √ (∆η) 2 + (∆φ) 2 where φ is in radians, and a categorical variable that encodes information about the lepton candidate's flavour; • information about the secondary vertex, selected as the highest p T displaced vertex linked to the jet: number of tracks associated to the vertex, transverse momentum and mass (computed assigning the pion mass to all reconstructed tracks forming the secondary vertex); the distance between the collision vertex and the secondary vertex computed in three dimensional space with its associated uncertainty [32,33]; • jet composition: largest p T value of any charged hadron candidates, fractions of energy carried by jet constituents; namely electrons, photons, charged hadrons, neutral hadrons, and muons.These fractions are computed for the whole jet, and separately in five rings of ∆R around the jet axis (∆R = 0-0.05,0.05-0.1,0.1-0.2,0.2-0.3,0.3-0.4); • multiplicity of PF candidates clustered to form the jet; • information about jet energy sharing among the jet constituents computed as where i runs over all jet constituents.
This results in a total of 43 input features.No additional preprocessing is performed, apart from the input normalization provided by batch normalization [34] at the input layer of the DNN.

Loss function
The regression outputs are the mean estimator and the 25 and 75% quantiles of the target distribution.The mean estimator ( ŷ) is used as the correction to be applied to the reconstructed b jet energy, while half of the difference of the 75 and 25% quantiles is used as a per-jet estimator of the b jet energy resolution.
The Huber loss function is employed to learn the mean of the target distribution via a minization process.It is preferable to the mean squared error because of its reduced sensitivity to the tails of the target distribution.It is defined as: where z = y − ŷ, and δ is set to 1 in our case.To estimate the 25 and 75% quantiles of the target distribution, the quantile loss function is used: where τ = 0.25 (0.75) corresponds to the 25 (75)% quantile.
The complete loss function can then be written as: where E (x,y)∼p(x,y) denotes the expectation value when sampling (x, y) on the distribution p(x, y), x denotes the set of input features, and p(x, y) is the joint distribution of the input features and the target variables y in the training sample.The symbols ŷ(x), ŷ25% (x) and ŷ75% (x) denote the DNN outputs: ŷ(x) is the mean estimator, ŷ25% (x) and ŷ75% (x) are the 25 and 75% quantile estimators, respectively.

Neural network architecture
The model used for this study is a feed-forward, fully connected DNN with 6 hidden layers, 43 input features and 3 outputs: the energy correction and the 25 and 75% quantiles.As mentioned above, a batch normalization layer is applied at the DNN input.
Each hidden layer of the DNN is built from the following components: • Dense layer: defined as a linear combination of all outputs from the previous layer.
• Batch normalization layer: to transform the inputs to zero-mean and unit-variance.
• Dropout unit: an operation that zeroes a fixed fraction of randomly chosen nodes, used as a regularization handle.The dropout rate is one of the optimized hyperparameters of the DNN.
A small slope β = 0.2 was chosen for the LReLU to allow for a nonvanishing gradient over the domain of the function [35].The output layer has a linear activation function.The DNN is implemented using the KERAS package [36] with TENSORFLOW backend [37].Back-propagation is done using stochastic gradient descent with the Adam optimizer [38].

Hyperparameter optimization
To optimize the performance of the DNN, three hyperparameters are considered: the depth of the network architecture, the dropout rate, and the gradient descent learning rate.They were tuned using the cross-validation algorithm [39].The mean validation loss was used as the figure of merit for the optimization over a five-fold splitting of the training sample.The network has been trained on a single NVIDIA GeForce GTX 1080 Ti GPU.
The number of nodes in the last 3 hidden layers of the DNN was set to [512, 256, 128] respectively, while the number of nodes of the remaining layers was set to 1024.The mean validation loss varies by about 15% over the ensemble of chosen points.A number of configurations were found to provide comparable performance.Of these, the network with the smallest number of trainable parameters was chosen.The parameters and their values are: do = 0.1, lr = 0.001, and 6 hidden layers with [1024, 1024, 1024, 512, 256, 128] nodes.This architecture has a total of about 2.8 million trainable parameters.

Training set p T composition
The number of events as a function of the b jet p T spectrum in the training sample spans six orders of magnitude, as shown in Fig. 1 (left).This means that, during the training, the DNN is exposed to many more jets with low p T .In situations like this, one might expect worse performance for high-p T jets.To check if this is an issue, emphasis was given to the high p T part of the sample.About 95% of the jets with p T below 400 GeV were removed in order to reproduce the same exponential shape observed above 400 GeV.We found that the DNN trained on this subsample of events showed no improvement for high p T jets but did have up to 0.5% degradation of the relative jet energy resolution.For this reason, the final DNN is trained on the full sample.

Results
The performance of the b jet regression was evaluated by comparing the b jet energy resolution and scale (defined as the most probable value of the p gen T /p reco T distribution), before and after the energy correction, on a test sample that is statistically independent from those used for training and validation.Different physics processes were included in the test set in order to evaluate the performance of the algorithm on b jets with different kinematics.The processes employed in the test sample are: • tt: top quark-antiquark pair production (independent of the training data set), • Z(→ + − )H(→ bb ): associated production of a Higgs boson with a Z boson, where the Z boson decays to a pair of same-flavor, opposite-charge electrons or muons, and the Higgs boson decays to bb, • H(→ bb )H(→ γγ): double Higgs boson produced via gluon fusion with one Higgs boson decaying to bb, and the other to a pair of photons, assuming both standard model (SM) and beyond SM kinematics.In the latter case, the double Higgs signal originates from the decay of a spin-0 resonance with a mass of 500 or 700 GeV.
Figure 2 shows the 25, 40, 50, and 75% quantiles of the target distribution before and after applying the DNN b jet energy corrections, as a function of jet p T , η, and ρ.The results are obtained for b jets from the tt test sample.The 40% quantile has been found to be a good approximation of the most probable value of the target distribution.In addition, the 40% quantile validates the performance on a quantile not used in the training.It can be seen that after DNN corrections, the distribution becomes narrower, and its median and 40% quantile exhibit smaller dependence on jet p T , η, and the median event energy density ρ.The jet energy resolution, s, is estimated as half the difference between the 75% (q 75 ) and 25% (q 25 ) quantiles of the target distribution.To quantify the resolution improvement, we compared the relative jet energy resolution, s, defined as: where the resolution s is divided by q 40 , the most probable value estimated as the 40% quantile of the target distribution.The relative improvement on s for b jets for various physics processes is between 12 and 15%, as can be seen from Table 1. Figure 3 shows the value of s obtained for b jets from the tt test sample as a function of the generator-level p gen T (left), η (center), and ρ (right).The lower panels in Fig. 3 show the relative improvements resulting from the DNN energy correction.The observed behavior agrees with the expectation that the regression correction should optimize the jet energy resolution, while the baseline corrections aim for a flat response as a function of the jet generator-level p gen T and η.For all physics processes considered, the per-jet relative resolution improvement is around 12-18% for p T < 100 GeV, falling to around 5-9% for p T > 200 GeV.
Knowledge of jet energy resolution on a jet-by-jet basis can be exploited in analyses searching for resonant production of b jet pairs to increase their sensitivity.We have checked the correlation between the jet resolution s and the value of the per-jet resolution estimator, ŝ, provided by the DNN: To do this, the sample of b jets was split into several equally populated bins in ŝ.In each bin, the value of s is computed as half the difference between the q 75 and q 25 quantiles of the target distribution, and compared to the average resolution estimator ŝ .Figure 4 shows the   correlation between s and the ŝ values for the inclusive p T spectrum and for several bins in p T .A linear dependence with slope near unity confirms that the per-jet energy resolution estimator ŝ correctly represents the jet resolution.We observe that deviations of the slope from unity from the linear behavior are roughly compatible within 20% of the ŝ value.
While the improvements described above are quoted at the single jet level, many physics analyses use the invariant mass of the two b jet system as a discriminating variable for signal extraction.The improvement in the resolution of the dijet invariant mass is generally bigger than that for a single jet because the energy corrections effectively equalize the energy scale of the two jets, while also improving the jet resolution.To estimate the dijet resolution improvement events with two leptons and two jets were selected from the Z(→ + − )H(→ bb ) sample: jets were required to have p T larger than 20 GeV, absolute value of η below 2.4, and be compatible with the hadronisation of b quarks, referred to as "b-tagged" [33] jets in the following.The selection criteria for the b-tagged jets corresponds to a 70% b jet tagging efficiency with a 1% misidentification rate for light-flavor or gluon jets.Leptons were required to have a p T larger than 20 GeV, while the lepton pairs were required to be compatible with the decay of a Z boson, requiring their invariant mass to be within 20 GeV of the mass of the Z boson.The Z boson was required to have a transverse momentum larger than 150 GeV.An improvement of about 20% in the dijet invariant mass resolution in the Z(→ + − )H(→ bb ) sample can be observed in Fig. 5.
In addition, a dedicated study was performed to test how well the algorithm performance can be transferred from Monte Carlo simulations to the domain of pp collision data.A set of Z boson candidates decaying to a pair of charged leptons was extracted from pp collisions recorded    where two jets and two leptons were selected.Distributions are shown before (dotted blue) and after (solid red) applying the b jet energy corrections.A Bukin function [40] was used to fit the distribution.The fitted mean and width of the core of each distribution are displayed in the figure .by the CMS experiment in 2017.A standard set of requirements [24,41] was applied to select events with electron or muon pairs compatible with having originated from the decay of a Z boson.Events were further required to have at least one b-tagged jet.The jet with the largest p T was required to have |η| < 2, while the p T of the dilepton system was required to be larger than 100 GeV.The p T balance between the Z boson and the b-tagged jet candidate was enforced by requiring that extra jets have a p T less than 30% of the Z p T to suppress events with additional hadronic activity.Events satisfying these requirements were used to evaluate the agreement between data and MC simulations.In addition, the resolution of the jets was measured by extrapolating to zero additional hadronic activity following the methodology described in Ref. [24].
Figure 6 shows the ratio between the p T of the leading jet and that of the dilepton system for events in which the p T of the subleading jet is less than 15 GeV.The left and right panels show the distributions obtained before and after applying the DNN-based corrections, respectively.It can be seen that the effect of the corrections is to reduce the width of the distribution.Using the method detailed in Ref. [24], the double ratio of the relative jet resolution s measured in data and in simulated events was found to be 1.1 ± 0.1 before and after applying the DNN-based corrections.This confirms that the resolution improvement achieved in simulated events is successfully transferred to the data domain.

Summary
We have described an algorithm that makes it possible to obtain point and dispersion estimates of the energy of jets arising from b quarks in proton-proton collisions.We trained a deep, feedforward neural network, with inputs based on jet composition and shape information, and on properties of the associated reconstructed secondary vertex for a sample of simulated b jets arising from the decays of top quark-antiquark pairs.The neural network simultaneously finds robust mean, 25 and 75% quantile estimators for the energy of a b jet.The mean estimator is based on the Huber loss function and is used as an energy correction, while the 25 and 75% quantile estimators are used to build a jet-by-jet resolution estimator, defined as half the difference between these quantiles.
The DNN-based algorithm leverages the information contained in a large training data set consisting of nearly 100 million simulated b jets, and improves the resolution of the b jet energy by 12-15% relative to that which is found after baseline corrections.An improvement of about 20% is observed in the resolution of the invariant mass of b jet pairs resulting from the decay of a Higgs boson produced in association with a Z boson.Events containing a dilepton decay of a Z boson produced in association with a b jet are used to validate the performance of the algorithm on proton-proton collision data recorded with the CMS detector.The jet energy resolution improvement observed in data is consistent with that found in simulation.The resolution estimator is further shown to predict the resolution of b jets with an accuracy of 20% over a p T range between 30 and 350 GeV.
The results described here are being used by the CMS Collaboration in several physics analyses targeting final states containing b jets, including the observation of the Higgs boson decay to bb [13].

Figure 1 (Figure 1 :
Figure 1: (left) The p reco T distribution for reconstructed b jets in an MC tt sample.(right) Distribution of the regression target for the MC tt training sample.

Figure 2 :
Figure 2: The 25, 40, 50, and 75% quantiles are shown for the b jet energy scale p gen T /p reco T distribution before (blue dashdot) and after (red solid) applying the regression correction as a function of jet p T (left), η (center), and ρ (right).

Figure 3 :
Figure 3: Relative jet energy resolution, s, as a function of generator-level jet p gen T (left), η (center), and ρ (right) for b jets from tt MC events.The average p T of these b jets is 80 GeV.The blue stars and red squares represent s before and after the DNN correction, respectively.The relative difference ∆s/s baseline between the s values before and after DNN corrections is shown in the lower panels.

Figure 4 :
Figure 4: Correlation between jet energy resolution s and the average jet energy resolution estimator ŝ for b jets from tt MC events.The blue circles correspond to the inclusive p T spectrum, while the blue band represents 20% up and down variations of the fitted ŝ trend for the inclusive p T spectrum.The red stars correspond to jets with p T ∈ [30, 50] GeV, orange diamonds to p T ∈ [50, 70] GeV, and green crosses to p T ∈ [110,120] GeV.

Figure 6 :
Figure 6: Distribution of the ratio between the transverse momentum of the leading b-tagged jet and that of the dilepton system from the decay of the Z boson.Distributions are shown before (left) and after (right) applying the b jet energy corrections.The s values of the core distributions are included in the figures.The black points and histogram show the distributions for data and simulated events, respectively.

Table 1 :
Relative differences ∆s/s baseline between the s values obtained before and after applying the DNN energy correction for b jets produced in the different physics processes indicated.