Introduction

Following the discovery of the 125 \(\text {GeV}\) Higgs boson reported by the ATLAS and CMS Collaborations at the CERN LHC in 2012 [1,2,3], a rich research program was established to probe this new particle. The program includes the measurement of all production and decay modes that are accessible at the LHC. The decay of the Higgs boson into a pair of vector bosons was established with a statistical significance higher than five standard deviations individually for photon, Z and W pairs using data collected at the LHC from 2011 to 2013 at center-of-mass energies of \(\sqrt{s}=7\) and 8 \(\,\text {TeV}\) [4,5,6,7,8,9]. A few years later, the combination of CMS data sets collected at 8 and 13 \(\text {TeV}\) was used to report the observation of Higgs boson decay to a pair of \(\tau \) leptons [10], followed by the observation of the associated production of a Higgs boson with a top quark–antiquark pair (\(\hbox {t}\bar{\hbox {t}}\)) [11, 12].

Higgs boson decay to a b quark–antiquark pair (\(\hbox {b}\bar{\hbox {b}}\)) was only recently announced by the CMS [13] and ATLAS [14] collaborations, despite it being the dominant decay mode. This is because of the challenges associated with separating the signal from the large background of \(\hbox {b}\bar{\hbox {b}}\) produced by quantum chromodynamics (QCD) processes. Good resolution of the reconstructed invariant mass of Higgs boson candidates is necessary to have a more favorable signal-to-background ratio. This is achieved in CMS by the method described in this paper, based on a deep neural network (DNN) that estimates the energy of jets originating from b quarks (b jets). Similar algorithms, using neural networks, were previously used by the CDF Collaboration at the Tevatron [15, 16], and BDT-based energy regressions were used earlier by the CMS Collaboration to estimate the energy of b jets [17].

The approach described in this paper is to use a regression algorithm that is implemented in a feed-forward neural network with six hidden layers trained on a very large data set, consisting of Monte Carlo (MC) simulated b jets. The algorithm has a considerably larger modeling capability than those used previously. This approach was made possible by leveraging recent advances in hardware accelerators, such as graphics processing units (GPU), and in modern packages for automatic differentiation to handle the otherwise expensive computations involved in this task. A minimization of a loss function that combines a Huber [18] and two quantile [19] loss terms enables simultaneous training of point and dispersion estimators of the regression target without making any assumptions about the functional form of its distribution. The point estimator is used as a correction of the measured b jet energy, while dispersion estimators are used to build a jet-by-jet resolution estimate. The CMS collaboration had previously developed a BDT-based approach to estimate the energy and per-object resolution [20,21,22]. This can be achieved by training separate regressions to obtain energy and per-object resolution estimators, or by means of a semiparametric regression [20, 21]. For a semiparametric regression, the training relies on the knowledge of the analytical shape of the target distribution. The novel characteristic of the algorithm described in this paper is the simultaneous training of the point and dispersion estimators without reference to an ansatz distribution for the regression target. This method is validated on data collected by the CMS detector in 2017.

In the following, Sect. 2 and Sect. 3 describe the CMS detector and the data sets used for this work. The regression problem and the inputs are described in Sect. 4. In Sect. 5, the loss function is introduced, while the DNN architecture and its training are summarized in Sect. 6. Finally, the results are presented in Sect. 7, followed by the summary in Sect. 8.

The CMS Detector

The central feature of the CMS detector is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections. Forward calorimeters extend the pseudorapidity (\(\eta \)) coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A detailed description of the apparatus, together with a definition of the coordinate system used and the relevant kinematic variables, can be found in Ref. [23].

The particle-flow (PF) algorithm [24] used by CMS aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. Photon energies are obtained from ECAL data. The candidate vertex with the largest value of summed physics-object \(p_{\mathrm {T}} ^2\) is taken to be the primary proton–proton (\(\hbox {p}{}{} \hbox {p}{}{} \)) interaction vertex. The energy of each electron in the event is determined from a combination of the electron momentum at the primary interaction vertex, as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with having originated from the electron. The momentum of each muon is obtained via the curvature of the corresponding track. The energy of each charged hadron is determined from a combination of momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for zero-suppression effects and for the response function of the calorimeters to hadronic showers. Finally, for a neutral hadron, the energy is obtained from the corresponding HCAL corrected energies. The anti-\(k_{\mathrm {T}}\) algorithm [25, 26] with a distance parameter of 0.4 is applied offline to the full set of PF candidates to cluster them into jets. The jet momentum is determined by the vectorial sum of all particle momenta in the jet. The jet energy resolution typically amounts to 15–20% at 30 \(\text {GeV}\), 10% at 100 \(\text {GeV}\), and 5% at 1 \(\text {TeV}\)[27].

Additional \(\hbox {p}{}{} \hbox {p}{}{} \) interactions within the same or nearby bunch crossings (pileup) can contribute unrelated particles to the jet. To mitigate the effects of pileup, charged particles with tracks originating from pileup vertices are discarded before jet reconstruction. Then, the residual contamination from neutral particles and charged particles without reconstructed tracks is estimated for each event and subtracted from the jet energy. Jet energy corrections are derived from simulation to bring the measured average response for jets in line with particle-level jets. Neutrinos are not included in the clustering of particle-level jets. In situ measurements of the transverse momentum balance in dijet, photon+jet, \(\hbox {Z}{}{} \)+jet, and multijet events are used to account for residual differences between the jet energy scales in data and simulation [28]. We refer to this correction algorithm as the baseline algorithm.

Data Sets

The DNN was trained on 100 million b jets from a simulated sample of \(\hbox {t}\bar{\hbox {t}}\) events produced in pp collisions at \(\sqrt{s}=13\,\text {TeV} \), generated at next-to-leading-order (NLO) accuracy in perturbative QCD (pQCD) with the powheg v2 program [29]. Predictions of the model were then tested on simulated events with b jets coming from a variety of physical processes to validate performance in all relevant kinematic regions. To this end, b jets from the decay of Higgs bosons produced in association with a Z boson, \(\hbox {Z}{}{} (\rightarrow \ell ^+\ell ^-)\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\), where \(\ell \) is an electron or a muon, were generated with the MadGraph 5_amc@nlo generator [30] at NLO pQCD accuracy. Additionally, b jets from the decay of Higgs boson pairs produced either from gluon fusion or in the decay of a new, spin-0 resonance, with one Higgs boson decaying to a b quark-antiquark pair and the other to a pair of photons, \(\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\hbox {H}{}{} (\rightarrow {\upgamma }{}{} {\upgamma }{}{})\), were generated with MadGraph 5_amc@nlo at leading-order accuracy in pQCD.

Two definitions of jets are used in this study: “generator-level jets”, clustered from stable particles produced by the MC generator that include the contribution from the neutrino’s momentum, and “reconstructed jets”, clustered from reconstructed particle-flow candidates. The reconstructed b jets were matched to generated b jets to avoid contamination by light flavored jets. For each reconstructed jet, the corresponding generator-level jet is found by spatial matching in the \(\eta -\phi \) plane by requiring the distance \(\varDelta R = \sqrt{\smash [b]{(\varDelta \eta )^2+(\varDelta \phi )^2}}\) (where \(\phi \) is the azimuthal angle in radians) to be \(\varDelta R < 0.4\). The reconstructed b jets were then selected by applying a minimum threshold for transverse momentum (\(p_{\mathrm {T}} ^\text {reco}> 15\) \(\,\text {GeV}\), \(p_{\mathrm {T}} ^{\text {gen}}> 15\) \(\,\text {GeV}\)) and by requiring the pseudorapidity of the central axis of the reconstructed jet to be within the tracker acceptance (\(|\eta | < 2.4\)).

Finally, to validate the regression model on data, the output of the DNN for simulated b jets was compared to that obtained for b jets recorded by the CMS detector. The events used for this validation were recorded in 2017 with triggers [31] that require the presence of at least one lepton. This data set, corresponding to an integrated luminosity of 41 \(\,\text {fb}^{-1}\), was further enriched in Z bosons produced in association with b jets. The corresponding simulated events come from a sample of Z bosons and up to two additional partons generated with MadGraph 5_amc@nlo at NLO accuracy in pQCD.

For all simulated events, pythia 8.2 [32] with the CP5 tune [33] is used for parton showering and hadronization. The CMS detector response is simulated by the Geant4 [34] package, and simulated pileup interactions are added to the hard-scattering process to match the distribution of pileup interactions observed in data, for which the observed mean number of interactions per bunch crossing is 32.

Energy Regression and Input Features

In comparison to jets arising from light-flavor quarks or gluons, jets arising from b quarks have special characteristics that call for dedicated energy corrections. In particular, b jets contain b hadrons that can often decay to a final state with a charged lepton and a neutrino. The neutrinos, which only interact via the weak force, escape detection, leading to an underestimate of the b jet energy, with a corresponding degradation of energy resolution. As described in Sect. 2, the jet energy is reconstructed by clustering its constituents within a given distance parameter. Compared to jets originating from light-flavor quarks and gluons, b jets, because of their higher mass, tend to spread radially over a wider area in the \(\eta \)-\(\phi \) plane. This often leads to leakage of energy outside of the jet clustering region, further impacting the jet energy response and resolution.

The b jets used for the DNN training come from a sample of simulated top quark events. The top quark decays before hadronising with a branching fraction close to unity into a b jet and a W boson. At LHC energies, it provides a source of b jets that spans a large transverse momentum (\(p_{\mathrm {T}} \)) spectrum and covers the full \(\eta \) acceptance of the detector. The \(p_{\mathrm {T}} ^\text {reco}\) value is corrected with the baseline algorithm as described in Sect. 2. Figure 1 (upper) shows the distribution of \(p_{\mathrm {T}} ^\text {reco}\), for the selected b jets.

Fig. 1
figure 1

(upper) The \(p_{\mathrm {T}} ^\text {reco}\) distribution for reconstructed b jets in an MC \(\hbox {t}\bar{\hbox {t}}\)sample. (lower) Distribution of the regression target for the MC \(\hbox {t}\bar{\hbox {t}}\)training sample

The regression target, y, used in this study is defined as the ratio of the transverse momentum of the generator-level jet, \(p_{\mathrm {T}} ^{\text {gen}}\), to that of the reconstructed jet, \(p_{\mathrm {T}} ^\text {reco}\), applying the baseline jet energy corrections. Using this definition rather than using \(p_{\mathrm {T}} ^{\text {gen}}\) directly has the effect of greatly reducing the variance of the target while producing a numerical value of order 1. The distribution of the target for b jets from an MC simulated \(\hbox {t}\bar{\hbox {t}}\) sample is shown in Fig. 1 (lower). To improve the convergence of the training of the DNN, the target is further standardized by subtracting its median value and dividing it by its standard deviation.

The DNN training inputs provide information about the kinematics, shape, and composition of reconstructed jets. The inputs consist of the following features:

  • jet kinematics: jet \(p_{\mathrm {T}}\), \(\eta \), mass, and transverse mass, defined as \(\sqrt{\smash [b]{ E^2 - p_z ^2} }\);

  • information about pileup interactions: the median energy density in the event, \(\rho \), corresponding to the amount of transverse momentum per unit area that is due to overlapping collisions [35];

  • information about semileptonic decays of b hadrons when an electron or muon candidate is clustered within a jet: the transverse component of lepton momentum perpendicular to the jet axis, the distance \(\varDelta R = \sqrt{\smash [b]{(\varDelta \eta )^2+(\varDelta \phi )^2}}\), and a categorical variable that encodes information about the lepton candidate’s flavor;

  • information about the secondary vertex, selected as the highest \(p_{\mathrm {T}}\) displaced vertex linked to the jet: number of tracks associated to the vertex, transverse momentum, and mass (computed assigning the pion mass to all reconstructed tracks forming the secondary vertex); the distance between the collision vertex and the secondary vertex computed in three-dimensional space with its associated uncertainty [36, 37];

  • jet composition: largest \(p_{\mathrm {T}} \) value of any charged hadron candidates, fractions of energy carried by jet constituents; namely charged hadrons, neutral hadrons, muons, and an electromagnetic component coming from electrons and photons. These fractions are computed for the whole jet, and separately in five rings of \(\varDelta R\) around the jet axis (\(\varDelta R = \) 0–0.05, 0.05–0.1, 0.1–0.2, 0.2–0.3, 0.3–0.4);

  • multiplicity of PF candidates clustered to form the jet;

  • information about jet energy sharing among the jet constituents computed as

    $$\begin{aligned} \frac{\sqrt{\sum _ip_{\text T,i}^2}}{{\sum _ip_{\text T,i}}}, \end{aligned}$$
    (1)

    where i runs over all jet constituents.

This results in a total of 41 input features. No additional preprocessing is performed, apart from the input normalization provided by batch normalization [38] at the input layer of the DNN.

Loss Function

A possible approach to such a regression problem is to develop separate dedicated regressions to obtain energy and per-object resolution estimators. If the target distribution can be parametrized analytically, one can use a semiparametric regression to obtain estimates of the function parameters. This method has been used by the CMS collaboration to estimate the energy and resolution of electron and photon candidates [20, 21]. Whereas for the photon and electron candidates, the energy response can be parametrized by an analytically integrable function, this is less straightforward for b jets, making such an approach to the problem more expensive computationally. An alternative approach is to simultaneously obtain point and dispersion estimates of the b jet energy by defining a loss function that is completely agnostic to the target distribution. The correction to be applied to the reconstructed b jet energy can be obtained as the estimated mean, while the per-jet b jet energy resolution can be estimated as half the difference of the 75 and 25% quantiles. Therefore, the regression loss function should provide the mean estimator (\({\hat{y}}\)), and the 25 and 75% quantiles of the target distribution.

The Huber loss function is employed to learn the mean of the target distribution via a minimization process. It is preferable to the mean squared error because of its reduced sensitivity to the tails of the target distribution. It is defined as:

$$\begin{aligned} H_{\delta }(z) = {\left\{ \begin{array}{ll} \frac{1}{2} z ^2, &{}\text {if }|z| < \delta ;\\ \delta |z| - \frac{1}{2}\delta ^2, &{}\text {otherwise,} \end{array}\right. } \end{aligned}$$
(2)

where \(z = y -{\hat{y}}\), and \(\delta \) is set to 1 in our case. To estimate the 25 and 75% quantiles of the target distribution, the quantile loss function is used:

$$\begin{aligned} \rho _{\tau }(z) = {\left\{ \begin{array}{ll} \tau z, &{}\text {if }z > 0;\\ (\tau - 1) z, &{}\text {otherwise,} \end{array}\right. } \end{aligned}$$
(3)

where \(\tau \) = 0.25 (0.75) corresponds to the 25 (75)% quantile.

The complete loss function can then be written as:

$$\begin{aligned} \text {loss}({\hat{y}},{\hat{y}}_{25\%},{\hat{y}}_{75\%}) = E_{(x,y) \sim p(x,y)} [H_{1}( y - {\hat{y}}(x) )+\rho _{0.25}( y - {\hat{y}}_{25\%}(x)) +\rho _{0.75}( y -{\hat{y}}_{75\%}(x))], \end{aligned}$$
(4)

where \(E_{(x,y) \sim p(x,y)}\) denotes the expectation value when sampling (xy) on the distribution p(xy), x denotes the set of input features, and p(xy) is the joint distribution of the input features and the target variables y in the training sample. The symbols \({\hat{y}}(x)\), \({\hat{y}}_{25\%}(x)\), and \({\hat{y}}_{75\%}(x)\) denote the DNN outputs: \({\hat{y}}(x)\) is the mean estimator, and \({\hat{y}}_{25\%}(x)\) and \({\hat{y}}_{75\%}(x)\) are the 25 and 75% quantile estimators, respectively.

Neural Network Architecture

The model used for this study is a feed-forward, fully connected DNN with 6 hidden layers, 41 input features, and 3 outputs: the energy correction and the 25 and 75% quantiles. As mentioned above, a batch normalization layer is applied at the DNN input.

Each hidden layer of the DNN is built from the following components:

  • Dense layer: defined as a linear combination of all outputs from the previous layer.

  • Batch normalization layer: to transform the inputs to zero-mean and unit-variance.

  • Dropout unit: an operation that zeroes a fixed fraction of randomly chosen nodes during the training, used as a regularization handle. The dropout rate is one of the optimized hyperparameters of the DNN.

  • Activation unit: we chose the “Leaky” Rectified Linear Unit (LReLU) [39]:

    $$\begin{aligned} \text {LReLU}(x) = {\left\{ \begin{array}{ll} x, &{}\hbox { if}\ x \ge 0;\\ \beta x, &{}\hbox { if}\ x < 0, \end{array}\right. } \end{aligned}$$
    (5)

    with \(\beta = 0.2\).

A small slope \(\beta \) = 0.2 was chosen for the LReLU to allow for a nonvanishing gradient over the domain of the function [39]. The output layer has a linear activation function. The DNN is implemented using the Keras package [40] with TensorFlow backend [41]. Back-propagation is done using stochastic gradient descent with the Adam optimizer [42].

Hyperparameter Optimization

To optimize the performance of the DNN, three hyperparameters are considered: the depth of the network architecture, the dropout rate, and the gradient descent learning rate. They were tuned using the cross-validation algorithm [43]. The mean validation loss was used as the figure of merit for the optimization over a five-fold splitting of the training sample. The network has been trained on a single NVIDIA GeForce GTX 1080 Ti GPU.

Random sampling was used to select 50 of 120 grid points in hyperparameter space, where the grid is defined by the following:

  • dropout rate: \(do \in [0.1, 0.2, 0.3, 0.4]\).

  • learning rate: \(lr \in [10^{-2}, 10^{-3}, 10^{-4}, 10^{-5}, 10^{-6}]\).

  • number of hidden layers: varied between 3 and 8.

The number of nodes in the last three hidden layers of the DNN was set to [512, 256, 128], respectively, while the number of nodes of the remaining layers was set to 1024. A number of configurations were found to provide comparable performance. Of these, the network with the smallest number of trainable parameters was chosen. The parameters and their values are: \(do = 0.1\), \(lr = 0.001\), and 6 hidden layers with [1024, 1024, 1024, 512, 256, 128] nodes. This architecture has a total of about 2.8 million trainable parameters.

Training Set \(p_{\mathrm {T}}\) Composition

The number of events as a function of the b jet \(p_{\mathrm {T}}\) spectrum in the training sample spans six orders of magnitude, as shown in Fig. 1 (upper). This means that, during the training, the DNN is exposed to many more jets with low \(p_{\mathrm {T}}\). In situations like this, one might expect worse performance for high-\(p_{\mathrm {T}}\) jets. To check if this is an issue, emphasis was given to the high \(p_{\mathrm {T}}\) part of the sample. About 95% of the jets with \(p_{\mathrm {T}}\) below 400 \(\,\text {GeV}\) were removed to reproduce the same exponential shape observed above 400 \(\,\text {GeV}\). We found that the DNN trained on this subsample of events showed no improvement for high \(p_{\mathrm {T}}\) jets, but did have up to 0.5% degradation of the relative jet energy resolution. For this reason, the final DNN is trained on the full sample.

Results

The performance of the b jet regression was evaluated by comparing the b jet energy resolution and scale (defined as the most probable value of the \(p_{\mathrm {T}} ^{\text {gen}}/ p_{\mathrm {T}} ^\text {reco}\) distribution), before and after the energy correction, on a test sample that is statistically independent from those used for training and validation. Different physics processes were included in the test set to evaluate the performance of the algorithm on b jets with different kinematics. The processes employed in the test sample are:

  • \(\hbox {t}\bar{\hbox {t}}\): top quark–antiquark pair production (independent of the training data set),

  • \(\hbox {Z}{}{} (\rightarrow \ell ^+\ell ^-)\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\): associated production of a Higgs boson with a Z boson, where the Z boson decays to a pair of same flavor, opposite-charge electrons or muons, and the Higgs boson decays to \(\hbox {b}\bar{\hbox {b}}\),

  • \(\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\hbox {H}{}{} (\rightarrow {\upgamma }{}{} {\upgamma }{}{})\): double Higgs boson produced via gluon fusion with one Higgs boson decaying to \(\hbox {b}\bar{\hbox {b}}\), and the other to a pair of photons, assuming both standard model (SM) and beyond SM kinematics. In the latter case, the double Higgs signal originates from the decay of a spin-0 resonance with a mass of 500 or 700 \(\text {GeV}\).

Figure 2 shows the 25, 40, 50, and 75% quantiles of the target distribution before and after applying the DNN b jet energy corrections, as a function of jet \(p_{\mathrm {T}}\), \(\eta \), and \(\rho \). The results are obtained for b jets from the \(\hbox {t}\bar{\hbox {t}}\)test sample. The 40% quantile has been found to be a good approximation of the most probable value of the target distribution. In addition, the 40% quantile validates the performance on a quantile not used in the training. It can be seen that after DNN corrections, the distribution becomes narrower, and its median and 40% quantile exhibit smaller dependence on jet \(p_{\mathrm {T}} \), \(\eta \), and the median event energy density \(\rho \).

Fig. 2
figure 2

The 25, 40, 50, and 75% quantiles are shown for the b jet energy scale \(p_{\mathrm {T}} ^{\text {gen}}/ p_{\mathrm {T}} ^\text {reco}\) distribution before (blue dashdot) and after (red solid) applying the regression correction as a function of jet \(p_{\mathrm {T}}\) (left), \(\eta \) (center), and \(\rho \) (right). The \(\eta \) and \(\rho \) distributions are shown for jets with \(p_{\mathrm {T}}\) \(\in \) [70, 100] \(\,\text {GeV}\)

The jet energy resolution, \(\mathrm {s}\), is estimated as half the difference between the 75% (\(q_{75}\)) and 25% (\(q_{25}\)) quantiles of the target distribution. To quantify the resolution improvement, we compared the relative jet energy resolution, \(\overline{\mathrm {s}}\), defined as:

$$\begin{aligned} \overline{\mathrm {s}} \equiv \frac{\mathrm {s}}{q_{40}} = \frac{q_{75} - q_{25}}{2}\frac{1}{q_{40}}, \end{aligned}$$
(6)

where the resolution \(\mathrm {s}\) is divided by \(q_{40}\), the most probable value estimated as the 40% quantile of the target distribution. The relative improvement on \(\overline{\mathrm {s}}\) for b jets for various physics processes is between 12 and 15%, as can be seen from Table 1. Figure 3 shows the value of \(\overline{\mathrm {s}}\) obtained for b jets from the \(\hbox {t}\bar{\hbox {t}}\) test sample as a function of the generator-level \(p_{\mathrm {T}} ^{\text {gen}}\) (left), \(\eta \) (center), and \(\rho \) (right). The lower panels in Fig. 3 show the relative improvements resulting from the DNN energy correction. The observed behavior agrees with the expectation that the regression correction should optimize the jet energy resolution, while the baseline corrections aim for a flat response as a function of the jet generator level \(p_{\mathrm {T}} ^{\text {gen}}\) and \(\eta \). For all physics processes considered, the per-jet relative resolution improvement is around 12–18% for \(p_{\mathrm {T}} <100\,\text {GeV} \), falling to around 5–9% for \(p_{\mathrm {T}} >200\,\text {GeV} \). This improvement translates into an improvement in sensitivity of the analyses that make use of b jets in the final state. The improvement in the b jet energy resolution brought by the regression is similar for b jets with and without associated leptons. This demonstrates that the algorithm is able to correct not only for the undetected neutrinos in semileptonic decays of b hadrons, but also for effects that may only be present in hadronic decays. In addition, the regression was shown to improve the response of light jets by about 3%.

Fig. 3
figure 3

Relative jet energy resolution, \(\overline{\mathrm {s}}\), as a function of generator-level jet \(p_{\mathrm {T}} ^{\text {gen}}\) (left), \(\eta \) (center), and \(\rho \) (right) for b jets from \(\hbox {t}\bar{\hbox {t}}\) MC events. The average \(p_{\mathrm {T}}\) of these b jets is 80 \(\text {GeV}\). The \(\eta \) and \(\rho \) distributions are shown for jets with \(p_{\mathrm {T}}\) \(\in \) [70, 100] \(\text {GeV}\). The blue stars and red squares represent \(\overline{\mathrm {s}}\) before and after the DNN correction, respectively. The relative difference \(\varDelta \overline{\mathrm {s}}/\overline{\mathrm {s}} _{\text {baseline}}\) between the \(\overline{\mathrm {s}}\) values before and after DNN corrections is shown in the lower panels

Table 1 Relative differences \(\varDelta \overline{\mathrm {s}}/\overline{\mathrm {s}} _\text {baseline}\) between the \(\overline{\mathrm {s}}\) values obtained before and after applying the DNN energy correction for b jets produced in the different physics processes indicated

Knowledge of jet energy resolution on a jet-by-jet basis can be exploited in analyses searching for resonant production of b jet pairs to increase their sensitivity. We have checked the correlation between the jet resolution \(\mathrm {s}\) and the value of the per-jet resolution estimator, \(\hat{\mathrm {s}}\), provided by the DNN:

$$\begin{aligned} \hat{\mathrm {s}} \equiv \frac{1}{2}({\hat{y}}_{75\%} - {\hat{y}}_{25\%}). \end{aligned}$$
(7)

To do this, the sample of b jets was split into several equally populated bins in \(\hat{\mathrm {s}}\). In each bin, the value of \(\mathrm {s}\) is computed as half the difference between the \(q_{75}\) and \(q_{25}\) quantiles of the target distribution, and compared to the average resolution estimator \(\langle \hat{\mathrm {s}}\rangle \). Figure 4 shows the correlation between \(\mathrm {s}\) and the \(\langle \hat{\mathrm {s}}\rangle \) values for the inclusive \(p_{\mathrm {T}}\) spectrum and for several bins in \(p_{\mathrm {T}}\). A linear dependence with slope near unity confirms that the per-jet energy resolution estimator \(\hat{\mathrm {s}}\) correctly represents the jet resolution. We observe that deviations of the slope from unity from the linear behavior are roughly compatible within 20% of the \(\hat{\mathrm {s}}\) value.

Fig. 4
figure 4

Correlation between jet energy resolution \(\mathrm {s}\) and the average jet energy resolution estimator \(\langle \hat{\mathrm {s}}\rangle \) for b jets from \(\hbox {t}\bar{\hbox {t}}\) MC events. The blue circles correspond to the inclusive \(p_{\mathrm {T}}\) spectrum, while the blue band represents 20% up and down variations of the fitted \(\langle \hat{\mathrm {s}}\rangle \) trend for the inclusive \(p_{\mathrm {T}}\) spectrum. The red stars correspond to jets with \(p_{\mathrm {T}}\) \(\in \) [30, 50] \(\,\text {GeV}\), orange diamonds to \(p_{\mathrm {T}}\) \(\in \) [50, 70] \(\,\text {GeV}\), and green crosses to \(p_{\mathrm {T}}\) \(\in \) [110,120] \(\,\text {GeV}\)

While the improvements described above are quoted at the single jet level, many physics analyses use the invariant mass of the two b jet system as a discriminating variable for signal extraction. The improvement in the resolution of the dijet invariant mass is generally bigger than that for a single jet, because the energy corrections effectively equalize the energy scale of the two jets, while also improving the jet resolution. To estimate the dijet resolution, improvement, events with two leptons and two jets were selected from the \(\hbox {Z}{}{} (\rightarrow \ell ^+\ell ^-)\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\) sample: jets were required to have \(p_{\mathrm {T}} \) larger than 20 \(\text {GeV}\), absolute value of \(\eta \) below 2.4, and be compatible with the hadronisation of b quarks, referred to as “b-tagged” [37] jets in the following. The selection criteria for the b-tagged jets correspond to a 70% b jet tagging efficiency with a 1% misidentification rate for light-flavor or gluon jets. Leptons were required to have a \(p_{\mathrm {T}}\) larger than 20 \(\text {GeV}\), while the lepton pairs were required to be compatible with the decay of a Z boson, requiring their invariant mass to be within 20 \(\text {GeV}\) of the mass of the Z boson. The Z boson was required to have a transverse momentum larger than 150 \(\text {GeV}\). An improvement of about 20% in the dijet invariant mass resolution in the \(\hbox {Z}{}{} (\rightarrow \ell ^+\ell ^-)\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\) sample can be observed in Fig. 5. A Bukin function [44] was used to fit the core of the distribution in Fig. 5. The fit is performed in the range [75, 165] \(\text {GeV}\) for the baseline and [81,160] \(\text {GeV}\) for the DNN corrected distribution.

Fig. 5
figure 5

Dijet invariant mass distributions for simulated samples of \(\hbox {Z}{}{} (\rightarrow \ell ^+\ell ^-)\hbox {H}{}{} (\rightarrow \hbox {b}\bar{\hbox {b}})\) events, where two jets and two leptons were selected. Distributions are shown before (dotted blue) and after (solid red) applying the b jet energy corrections. A Bukin function [44] was used to fit the distribution. The fitted mean and width of the core of each distribution are displayed in the figure

In addition, a dedicated study was performed to test how well the algorithm performance can be transferred from Monte Carlo simulations to the domain of pp collision data. A set of Z boson candidates decaying to a pair of charged leptons was extracted from pp collisions recorded by the CMS experiment in 2017. A standard set of requirements [28, 45] was applied to select events with electron or muon pairs compatible with having originated from the decay of a Z boson. Events were further required to have at least one b-tagged jet. The jet with the largest \(p_{\mathrm {T}} \) was required to have \(|\eta | < 2\), while the \(p_{\mathrm {T}}\) of the dilepton system was required to be larger than 100 \(\text {GeV}\). The \(p_{\mathrm {T}}\) balance between the Z boson and the b-tagged jet candidate was enforced by requiring that extra jets have a \(p_{\mathrm {T}}\) less than 30% of the Z \(p_{\mathrm {T}}\) to suppress events with additional hadronic activity. Events satisfying these requirements were used to evaluate the agreement between data and MC simulations. In addition, the resolution of the jets was measured by extrapolating to zero additional hadronic activity following the methodology described in Ref. [28].

Figure 6 shows the ratio between the \(p_{\mathrm {T}}\) of the leading jet and that of the dilepton system for events in which the \(p_{\mathrm {T}}\) of the subleading jet is less than 15 \(\text {GeV}\). The upper and lower panels show the distributions obtained before and after applying the DNN-based corrections, respectively. It can be seen that the effect of the corrections is to reduce the width of the distribution. Using the method detailed in Ref. [28], the double ratio of the relative jet resolution \(\overline{\mathrm {s}}\) measured in data and in simulated events was found to be \(1.1 \pm 0.1\) before and after applying the DNN-based corrections. This validates that the resolution improvement achieved in simulated events is successfully transferred to the data domain.

Fig. 6
figure 6

Distribution of the ratio between the transverse momentum of the leading b-tagged jet and that of the dilepton system from the decay of the Z boson. Distributions are shown before (upper) and after (lower) applying the b jet energy corrections. The \(\overline{\mathrm {s}}\) values of the core distributions are included in the figures. The black points and histogram show the distributions for data and simulated events, respectively.

Summary

We have described an algorithm that makes it possible to obtain point and dispersion estimates of the energy of jets arising from b quarks in proton–proton collisions. We trained a deep, feed-forward neural network, with inputs based on jet composition and shape information, and on properties of the associated reconstructed secondary vertex for a sample of simulated b jets arising from the decays of top quark–antiquark pairs. The neural network simultaneously finds robust mean, 25 and 75% quantile estimators for the energy of a b jet. The mean estimator is based on the Huber loss function and is used as an energy correction, while the 25 and 75% quantile estimators are used to build a jet-by-jet resolution estimator, defined as half the difference between these quantiles.

The DNN-based algorithm leverages the information contained in a large training data set consisting of nearly 100 million simulated b jets, and improves the resolution of the b jet energy by 12–15% relative to that which is found after baseline corrections. An improvement of about 20% is observed in the resolution of the invariant mass of b jet pairs resulting from the decay of a Higgs boson produced in association with a Z boson. The resolution estimator is further shown to predict the resolution of b jets with an accuracy of 20% over a \(p_{\mathrm {T}}\) range between 30 and 350 \(\text {GeV}\). Events containing a dilepton decay of a Z boson produced in association with a b jet are used to validate the performance of the algorithm on proton–proton collision data recorded with the CMS detector. The jet energy resolution improvement observed in data is consistent with that found in simulation.

The results described here are being used by the CMS Collaboration in several physics analyses targeting the final states containing b jets, including the observation of the Higgs boson decay to \(\hbox {b}\bar{\hbox {b}}\) [13].