Boosting $H\to b\bar b$ with Machine Learning

High $p_T$ Higgs production at hadron colliders provides a direct probe of the internal structure of the $gg \to H$ loop with the $H \to b\bar{b}$ decay offering the most statistics due to the large branching ratio. Despite the overwhelming QCD background, recent advances in jet substructure have put the observation of the $gg\to H \to b\bar{b}$ channel at the LHC within the realm of possibility. In order to enhance the sensitivity to this process, we develop a two stream convolutional neural network, with one stream acting on jet information and one using global event properties. The neural network significantly increases the discovery potential of a Higgs signal, both for high $p_T$ Standard Model production as well for possible beyond the Standard Model contributions. Unlike most studies for boosted hadronically decaying massive particles, the boosted Higgs search is unique because double $b$-tagging rejects nearly all background processes that do not have two hard prongs. In this context --- which goes beyond state-of-the-art two-prong tagging --- the network is studied to identify the origin of the additional information leading to the increased significance. The procedures described here are also applicable to related final states where they can be used to identify additional sources of discrimination power that are not being exploited by current techniques.


Introduction
Even though it has been over five years since the discovery of the Higgs boson [1,2], the final state with the largest branching ratio (B(H → bb) ≈ 58 % [3]) has not been probed with great precision. This process is difficult to measure due to large and nearly irreducible background processes -only recently has (V )H → bb been confirmed [4][5][6]. However, many interesting and largely untested features of high-p T Higgs production [7][8][9][10][11] are challenging to probe with cleaner final states such as H → γγ or H → ZZ * → 4 due to their low branching ratios. 1 Boosted H → bb provides access to the highest p T Higgs bosons at the Large Hadron Collider (LHC); if they can be measured with good precision, a door leading beyond the Standard Model (BSM) could be opened.
Major advances in the use of jet substructure and machine learning techniques have revolutionized the ability to look for hadronic signals in increasingly extreme regions of phase space. 2 Most analyses that exploit the hadronic decays of boosted heavy particles have so far used modern tools to purify the event selection but not to directly identify the main objects of interest. However, pioneering work by the ATLAS and CMS collaborations have used these techniques to directly measure boosted particle cross-sections [16][17][18][19] and to search directly for BSM particles [20][21][22]. In particular, the CMS collaboration has used single jets to search for the boosted H → bb decay [19], the first experimental result on the subject since the idea was originally proposed in Ref. [23] (albeit exclusively in the V H channel), as well as combining this analysis with other differential low-p T data [24]. One reason for the long delay between conception and practical results was the development of advanced techniques for grooming [25][26][27], 2-prong tagging [28][29][30][31], jet four-vector calibrations [32,33], and boosted b-tagging [34][35][36][37][38][39][40][41].
The presence of multiple nearby boosted b quarks sets boosted Higgs identification apart from other boosted massive particle classification. This is because requiring two b-tagged subjets inside a larger jet necessarily requires that the parent jet has a two-prong structure. For boosted massive object identification, most of the jet substructure community has focused on n-prong taggers [28][29][30][31][42][43][44][45][46][47][48][49], which are not optimized for cases where n-prongs are already present. By probing the full radiation pattern inside boosted boson decays, Ref. [50] showed that there is information beyond traditional n-prong tagging and even beyond traditional color flow observables [51]. This was also explored in the context of boosted Higgs boson decays in Ref. [49], which identified simple observables that capture the additional information. However, neither these studies nor the more recent Ref. [52], which considered generic quark and gluon jets as a background to H → bb, were explicitly predicated on subjet tagging as a baseline and did not probe global information beyond jet substructure. Until the present work, the full potential of information beyond n-prong taggers has not been demonstrated for concrete observables such as cross-sections or BSM coupling limits. 3 Modern machine learning (ML) tools have shown great promise for using low-level [39,46,49,50,52, and high-level [87][88][89][90] information to classify hadronic final states at the LHC. These techniques must be adapted to cope with significant sparsity, large dynamic ranges, multi-channel inputs and data that has no unique representation. Similar techniques have been demonstrated for full event classification wth low-level [63,82] and high-level inputs [91][92][93]. In addition to the challenges related to the structure of the data, one of the key challenges for applying state-of-the-art techniques in practice is the need for a background estimation 3 The latest CMS bb and cc tagging techniques use machine learning approaches with a large number of particle-and vertex-level inputs [53]. These approaches could learn information beyond n-prong tagging. However, the background source for training is generic quark and gluon jets, not g → bb. The working point used in the boosted H search operates at the 1 % mis-tag rate, while the rate of g → bb is comparable to or lower than this value (see e.g., Ref. [54]) so most of the tagger's effort must go to reducing the large non-g → bb background. Performance studies specifically with g → bb as the background show that the tagger reduces the g → bb background 3× more than the signal [55]. The equivalent performance shown later in this paper (Fig. 3) corresponds to about 16× more g → bb than signal. These numbers are not directly comparable because the latter is also after mass and two-prong tagging requirements (and is thus conservative). Therefore, the techniques presented in this paper are using more information, but further studies are required to understand how much more and what type of information is being used. method. As in Ref. [77], boosted H → bb has a natural background estimation technique by using the localization of the Higgs boson in the jet mass distribution. For this reason, the algorithms presented here may already be useful to enhance existing analysis efforts.
In this paper we use deep neural networks to examine the potential of using all the available information in boosted Higgs events. We use a two stream convolutional neural network to combine jet substructure information with global event information, finding significant gains coming from both components demonstrating that the search can be greatly improved. Furthermore, we are able to identify the dominant source of jet substructure discrimination in terms of a simple observable.
This paper is organized as follows. Section 2 details the ML setup, including the preprocessing and architecture of the neural network. The neural network is then applied to the SM search for boosted H → bb in Section 3. Physics beyond the SM could introduce p T -dependent effects that are enhanced for boosted Higgs bosons. Implications for the NN classifier on BSM physics are described in Section 4. The paper concludes in Section 5.

Machine Learning Architecture
This section describes our machine learning setup, with a focus on the neural network architecture and preprocessing.

Neural Network Architecture
Our neural network architecture is driven both by physics goals as well as the desire to extract the maximal amount of information from the event. For the boosted H → bb topology, there are two physically distinct components to the events: the substructure of the hardest jet and the global event structure. Due to the color singlet scalar nature of the Higgs, the radiation pattern within and around the bb jet is expected to differ from g → bb jets. Different production mechanisms can also result in different numbers and orientations of jets in the events. All of these aspects are investigated.
To incorporate both local and global information, a two-stream neural network is constructed. One stream acts on the full event information and the other acts on the image of the Higgs candidate jet. The two streams are then combined. This setup can be used to separately assess how much discrimination power can be obtained from the substructure and the global event separately, as well as in combination. A schematic of this two-stream architecture is shown in Fig. 1.
In order to account for the compact nature of the detector in the φ direction, we use padding layers that take the leftmost few columns and append them to the right before each convolution (for all convolutional layers), effectively performing convolutions over the cylinder rather than over a square. Further details related to the image (pre)processing are discussed in Section 2.2.
The details of the convolution and pooling layers of each stream are as follows. Each convolutional filter is 5 × 5, and the pooling layers are 2 × 2, with rectified linear unit (ReLU) activations, and stride length of 1. The first convolutional layer in each stream has 32 filters, and the second convolutional layer in each stream has 64 filters. The dense layer at the end of each stream has 300 neurons each. Finally, the two dense layers from each stream are fully connected to an output layer of one neuron with sigmoid activation. In total this gives 2.6 million trainable parameters in the network. We used the AdaDelta optimizer [94], with binary cross entropy as our loss function, and used the relatively simple Early Stopping method as a regularization technique, stopping when the significance improvement of the Higgs measurement at p min T = 450 GeV stopped improving (with a patience of 2 epochs). We arrived at this final model after testing the performance (measured by the significance improvement of the Higgs measurement at p min T = 450 GeV) using different optimizers (AdaDelta [94], AdaGrad [95], Adam [96]), different activation functions (mainly testing ReLU against leaky ReLU), and regularization (dropout [97] vs. Early Stopping). Our training was performed using the Keras [98] Python neural network library with Tensorflow [99] backend, on Nvidia GeForce 1080 Ti GPUs.

Inputs and Preprocessing
The inputs to our neural network are jet images [56]. For each event, an image is created for each stream: one image is the full event image and the other is the image of the hardest jet (that has been double b-tagged). Both images are 40 × 40 pixels. For the jet image, the range (in η-φ space) is 2R × 2R where R = 0.8 is the radius of the jet. The full event image covers effectively the entire η-φ cylinder (|η| < 5). Inspired by Ref. [60], both the jet and event image have three channels analogous to the RGB channels of a color image. The pixel intensity for the three channels correspond to the sum of the charged particle p T , the sum of the neutral particle p T , and the number of charged particles. As the neutral particle p T is particularly sensitive to pileup, additional studies without this channel are included in the results.
To ensure that the neural network is not learning spacetime symmetries, and to reduce the size of the input streams, the jet images are preprocessed in a similar way to previous studies, see e.g., Refs. [60,100]. In particular, all of the images are normalized (sum of intensities is unity) and standardized (zero-centered and divided by standard deviations). Prior to these steps, the jet images are also rotated so that the two subjets are aligned along the same axis in every image [50,56]. Details about the subjet identification and b-tagging are discussed in Section 3.1.

Boosting Standard Model Higgs Tagging
This section studies the neural network performance in the context of improving the significance for the Standard Model boosted H → bb search.

Simulation Setup and Validation
Simulated pp collisions at √ s = 13 TeV are generated using MadGraph5_aMC@NLO 2.6.2 [101] for the hard processes and showered with Pythia 8.226 [102]. Background events are generated using two, three and four jet events (pp > jj, pp > jjj and pp > jjjj) matched using the MLM approach [103]. In order to include finite top mass effects (and BSM contributions in Section 4) signal events are generated at one-loop order (pp > Hj [QCD] and pp > Hjj [QCD]), which in this case corresponds to the leading contribution. The overlap between the real emission from the matrix element and the parton shower is also accounted for using the MLM algorithm. Higher order amplitudes for the signal process with full mass dependence are now becoming available [104][105][106][107][108][109][110]. These updates could slightly modify the numerical results, but should not change the conclusions and would primarily effect the overall rate and not the features exploited by the machine learning approach, which are primarily associated with the radiation pattern in the jet and the global event. Furthermore, in these studies, higher loop finite top mass effects are found to be flat at high p T and therefore do not significantly modify the shape of the p T spectrum. Events are clustered and analyzed using FastJet 3.2.1 [111] and the FastJet contrib extensions. Following a CMS-like analysis [19], jets are clustered with R = 0.8 anti-k t jets [112], which are groomed with the soft drop algorithm [27] with β = 0 and z cut = 0.1.  Candidate Higgs jets are required to have transverse momentum p T > 450 GeV, and satisfy a double b-tag. In general, b-tagging performance depends heavily on the exact experimental implementation and is detector specific. In this analysis, we use an approach similar to subjet b-tagging in ATLAS 5 [113] which can be mimicked at particle-level while assuming 100 % b-tagging efficiency and infinitely good rejection. This introduces an O(1) correction to the cross-section, but does not qualitatively change the conclusions. The subjets of the large-R jets are ghost-associated [114] R = 0.2 anti-k t jets. Such jets are declared b-tagged if they have a ghost-associated B hadron with p T > 5 GeV. In addition to b-tagging, the leading double b-tagged jet (the Higgs candidate) is required to have −6.0 < ρ < −2.1 (ρ = log m 2 SD /p 2 T ). This is chosen following Ref.
[19] to avoid the deeply nonperturbative region as well as finite cone limitations in the jet clustering, although no re-optimization of this range was performed. Finally, the two-prong observable N 2 [29] is required to be ≤ 0.4. There is little dependence on the exact N 2 requirement, likely in part because of the two (b-tagged) subjet requirement. Figure 2 shows the m SD distribution after applying the above selections. The overall rate, relative rates between processes, and general trends agree with the CMS analysis in Ref. [19].
Since the goal of this paper is to emphasize the possible gains for this search using ML, we have made a number of simplifying assumptions, and therefore the exact reproduction of the CMS analysis is not our primary concern. We believe that none of these assumptions significantly change our quantitative conclusions, but they should be revisited with the full analyses in ATLAS and CMS. In particular, the tt background is ignored, the background fit is simplified, experimental effects relating to track reconstruction and b-tagging are ignored, and as mentioned above, the Higgs cross-section is only computed at NLO. The top background is small but comparable to the Higgs signal, and since we have consistently ignored it for the pseudo-data and background, any residual contribution is a subleading effect from modeling uncertainty. Tracks reconstructed by ATLAS and CMS are excellent proxies for charged particles, though there are percent-level differences resulting from material interactions and pattern recognition ambiguities. These effects, as well as pileup, can slightly degrade b-tagging performance [39,84]. Once again, this is important to account for when setting a precise limit, but would not change the relative gains presented here.

Machine Learning Results
Having validated the setup specified in Section 3.1 against the public CMS results [19], the simulated events are now used as input to our two stream convolutional neural network to identify whether additional discrimination power can be obtained from the jet substructure, jet superstructure, and other global event properties.
Network training proceeds with 50 000 signal and background events passing the selection criteria from Section 3.1. The training-validation-test split that we used was 50 %-25 %-25 %. There is no requirement on the jet mass, as the entire spectrum is used to evaluate the significance. In practice, this could make traditional data-driven background estimation techniques more complex to use, though there have also been many techniques proposed to preserve the mass distribution [115][116][117][118].
The neural network performance is quantified using the significance improvement characteristic (SIC) curve. Such a curve is approximately equal to / √ b and quantifies the gain in significance over the baseline selection. Following a CMS-like analysis [19], the full significance is calculated using a binned likelihood fit treating the bin counts as Poisson-distributed random variables. This procedure assumes that the results are dominated by statistical uncertainties, which will always be true for the highest p T bins. Data statistical uncertainties account for over half of the total uncertainty in Ref. [19], so this is a valid approximation. There is no fit to determine the background shape, which is taken directly from the simulation. Once again, this is valid in the statistics limited regime.
The binned likelihood fit is performed in the mass range from 50 GeV to 197 GeV and using bins of width 7 GeV. (The CMS analysis performs the same fit in the 40 GeV to 201 GeV with the same binning [19]). The corresponding SIC curve is shown in Fig. 3. A maximum significance gain of about 2.2 is achieved with a signal efficiency of about 25 %. This means that if the significance with the nominal selection was 1 for a given dataset size, then after the application of the neural network, the new significance would be 2.2. The maximum significance from the event stream only is about 1.4 while for the same value for the jet stream only is about 2. This indicates that the jet information is much more important than the global information, though a significance gain of 1.4 is still important. Since pileup is not included in the simulation, it is important to show that the performance is similar when pileup  sensitive inputs are removed. The "no neutral layer" curve in Fig. 3 shows that the peak performance is robust and even better than the full network at high significance. Intuitively, a network with more information should not be able to do worse, though in practice, this could occur due to weight sharing or from too few training examples. For reference, the β 3 observable proposed in Ref. [49], n is the n-subjettiness observable [28,42] with angular exponent j, is also shown for comparison in Fig. 3. This single observable captures a significant fraction of the total significance improvement, but there is still more information available from the full two stream setup to boost the significance further. A further investigation into the information learned by the network is described in Section 3.3.1.
To understand the impact of a gain of 2.2 in the SIC, the expected significance for the SM H → bb search is plotted as a function of integrated luminosity in Fig. 4. A center of mass energy of √ s = 13 TeV is assumed through the end of LHC Run 3, which corresponds to about 300 fb −1 . The curves follow the statical scaling of L dt, where L is the instantaneous luminosity. The current CMS result reported an observed (expected) significance of 1.5 (0.7) [19]. As anticipated from the agreement with the mass distribution (Fig. 2), the significance calculated using the simulation reported in Section 3.1 is very similar at 1.227. Without machine learning, "evidence" (3σ) may only be achieved after the full LHC dataset (up to 2023) and "observation" (5σ) may be possible only with the HL-LHC. In contrast, with the application of the neural network, evidence may be achievable with the full Run 2 (2015-2018) dataset (about 150 fb −1 ) and observation may be possible well before the end of the LHC. This represents one of the main results of this paper, and emphasizes the possible gains to be had with ML.

What is the Neural Network Learning?
With a significant improvement from the neural network, it is interesting to investigate in more detail what information the machine is exploiting beyond the existing search. This section follows some of the procedures for such a study described in Ref. [50].
First, Fig. 5 shows the (first layer) convolutional filters from both streams of the network. Since both streams are actually three-channel images, there are three sets of filters for each case. While it is difficult to immediately recognize what the network is learning from these filters, there are some hints upon careful inspection. In particular, the event images have a small number of "hot spots." This may indicate that the network is learning to compute distances between prongs within jets and between jets. In contrast, the jet image filters have many active pixels with complex shapes. These filters are too small to span the typical subjet distance and so may be identifying the pattern of radiation between or around subjets. The following sections examine the two streams of the network in more detail.  Figure 5. The 32 filters from the first layer of the total event CNN in (a) and the jet substructure CNN in (b). The top row filters correspond to the charged p T layer, the second row shows the neutral p T layer and the bottom row is for the charged-particle multiplicity channel. minimal gains, and the primary difference between the two decays are their color flows, shown in Fig. 6, with the Higgs being a color singlet, and the gluon a color octet. The gluon radiates much more widely away from the dipole, as is clearly seen in the jet images in Fig. 5. ijm (Are there any experimental benefits of Rb2? It might be cleaner to just use beta. Rb2 is also IRC unsafe -ijm) Having identified from the neural network that significant discrimination power can be extracted from the jet, and building on the intuition from the jet images and our physical understanding of the decay channels, that this information should be contained in the color flow, we now show that this additional discrimination power can largely be extracted using a simple observable to identify the color flow. A number of observables exist to probe the color flow within a jet. Here we consider the recently introduced observable 3 where ⌧ j n is the n-jettiness observable [37,38] with angular exponent j defined with the winner takes all axes [68].
In Fig. 7 we show an SIC curve comparing the performance of the 3 observable with the full neural network architecture. The full neural network sets an upper bound on the achievable discrimination power, and we find that the majority of the improved discrimination power identified by the neural network is reproduced by the simple 3 observable. This is promising for immediate application to LHC searches. It also supports our intuition that the dominant remaining information lies in the color flow. Since much effort has been given to two-prong tagging, and relatively limited attention has been payed to the study of color flow, we believe that variable such as 3 may be more widely applicable to improving jet substructure searches.

Global
e ⌧ j n is the n-jettiness observable [37,38] with angular exponent j defined with the winner s all axes [68]. In Fig. 7 we show an SIC curve comparing the performance of the 3 observable with the eural network architecture. The full neural network sets an upper bound on the achievable imination power, and we find that the majority of the improved discrimination power ified by the neural network is reproduced by the simple 3 observable. This is promising mmediate application to LHC searches. It also supports our intuition that the dominant ining information lies in the color flow. Since much effort has been given to two-prong ing, and relatively limited attention has been payed to the study of color flow, we believe variable such as 3 may be more widely applicable to improving jet substructure searches.
Color flow for H ! bb and g ! bb, the main irreducible QCD background to our signal. numbers 1 and 2 label different color lines.

Jet Substructure
As emphasized earlier, the H → bb search is different from other boosted hadronically decaying massive boson studies because the application of double b-tagging already enforces a two-prong topology. Therefore, two-prong tagging is not as useful. Studies to further optimize the event selection with N 2 confirm this expectation -little significance gain is possible using only this state-of-the-art two-prong tagging technique (see also Ref. [37]). One of the attractive features of jet images is that they can be directly inspected to visualize the information content. For example, Fig. 6 shows the average of the 100 most signal-like and most background-like jets, according to the neural network. The two-prong structure of both signal and background is clear in all three channels. The main difference between gg → bb and H → bb is the orientation of the radiation between and around the two prongs. As expected due to the different color structure, the radiation pattern around the two prongs is more spread out for the gluon case. Figure 7 shows additional images that are split by their value of β 3 . It is clear from the images that low β 3 values (background-like) pick out subjets with a broader radiation patterns compared with high β 3 (signal-like) images. However, the top plot of Fig. 7 clearly indicates that β 3 is not the same as the neural network, so there is additional information to learn. Figure 8 tries to visualize the additional information. The distribution of β 3 in the signal is reweighted to be the same as the background so that β 3 by itself is not useful for discrimination. The average images for signal and background look very similar by eye, but the difference of the average images reveals interesting structure. These structures still show an enhanced radiation pattern around the subjets for the background relative to the signalthere is thus more color flow information available to learn than is captured by β 3 alone.
Drilling down into the information content of the jet images in more detail, perhaps using more of the techniques from Ref. [50] and understanding to what extent β 3 captures color flow and other effects is of great interest for future studies. 6

Global Event
While much attention has been devoted to the extraction of information from jet substructure, less has been payed to the extraction of discrimination power from the full event. At the same time, probing what is learned from event properties is more complicated than for the jet image due to the reduced symmetry. As with the jet substructure, due to the color singlet nature of the Higgs, we expect that the color flows in signal and background jets should be distinct. From our study, we find that while this information does not provide as much discrimination power as the jet substructure, it nonetheless provides an additional gain in significance. While several observables for discriminating global color flow have been proposed [51,[122][123][124], this is in general quite a challenging task. Furthermore, we expect that it would be quite topology dependent. Nevertheless it would be interesting study in more detail, since it has not received much attention. We believe that ML is an ideal technique for extracting complicated global event information that has not yet been exploited to its full potential in LHC searches.
We also highlight the efficacy of the padding that renders the convolutional layers of the neural network symmetric under rotations in φ by one pixel. This is a new feature of our neural network, which we find to be helpful for training stability. Figure 9 shows a typical signal image and how the neural network output changes as the event is rotated in φ. As desired, the network with the padding at every convolution layer is much more stable than the ones without the padding. The reason that the padded network is not completely invariant under rotations in φ is that the dense layers at the end of the network break the φ symmetry while for rotations as scales below a single pixel the discretization breaks the invariance. Figure 10 shows a similar trend after averaging over many events.

Recommendations for Future Searches
Due to the important of the H → bb channel for probing the Higgs sector at the LHC, we conclude this section with some concrete recommendations for improving the LHC searches, reiterating the points found in this section. In particular, although the most power is gained from a neural network, we have shown that a large fraction of this information can be obtained through simple observables, which can immediately be implemented in current searches for boosted H → bb. A neural network using charged information could also be applied without requiring extensive calibration studies. A key component of what the network can learn since the signal and background are already in a two-prong topology is the color flow. Quantifying the additional information in the form of compact analytic observables is an interesting and important part of future work.

High-p T Higgs for BSM Physics
Beyond the discovery of the H → bb decay, a major motivation for the study of boosted H → bb final states in particular is that it allows one to study the structure of the gg → H process at high p T . While in the Standard Model this is primarily due to the contribution of a virtual top quark loop, the total cross section σ(gg → H) is only sensitive to the low-energy limit of this loop, in which it is extremely well approximated by a dimension-five operator with no dependence on m t . At p T 2m t , this is no longer true, as the physical momentum running through the loop is comparable to m t , allowing potential new physics contributions to the loop to be disentangled that are not observable for the total cross section by observing the p T dependence. This general observation has been explored in Refs. [7][8][9][10][11]. In this section we apply our machine learning techniques and illustrate how the improved significance for H → bb translates to improved bounds on BSM physics.
We are interested in probing new physics in the gg → H production loop that can be modeled as dimension-6 operators. Following Ref. [7], the operators modifying gg → H production cross section are parameterized as Here G aµν is the QCD field strength, and G aµν = 1 2 µνσρ G a σρ its dual. After electroweak breaking, the induced operators affecting the coupling of the Higgs boson to tops and gluons take the form where so that one degeneracy between c H and the real part of c y remains. In the following we will only be interested in CP -even terms, and explicitly setκ t =κ g = 0, implicitly demanding c y be real.
Although the two couplings that we are interested in probing have distinct physical effects, namely κ t acts as a correction to the top Yukawa, while κ g corrects the ggH coupling, the Higgs low energy theorem [125,126] guarantees that they contribute to the inclusive Higgs production cross section as (κ t + κ g ) 2 up to corrections of O(m 2 h /m 2 t ). As shown in Ref. [7], this degeneracy is broken by the cross-section for H + jet production for a given p min T cut on the Higgs scales like where and δ are terms dependent on the p min T cut placed on the Higgs. These are given in Ref. [7] for tabulated values of p min T , computed at one-loop order with full m t dependence. the (κ t + κ g ) 2 term (provided that c g = 0). We can see this effect in Fig. 11. Consequently, performing a search for boosted Higgs can provide bounds on the Wilson coefficients κ g = c g and the combination κ t = 1 − Re(c y ) − c H /2.

Simulation
Our signal of interest is dominated by the interference of the SM gg → Hj(j) process, with the higher dimensional operators given above. Since MadGraph5_aMC@NLO is currently unable to compute the interference effects of processes that start at loop-level, a modification of the typical MadGraph5_aMC@NLO procedure is necessary to correctly generate the processes we want to study. The effect of the operators parameterized by c g andc g are recovered by implementing a fictitious heavy top partner whose mass is set to be large enough that a contact operator approximation remains valid for all LHC processes (nominally 10 TeV) and whose coupling to the Higgs is tuned to give the correct dependence on the high-dimension operator coefficient. The other operators are implemented as actual higher-dimension operators, as is conventional.

Results and interpretation
We apply our two-stage convolutional neural network to improve the bounds on the couplings c g and the combination Re(c y ) + c H /2 as compared with the standard search. In Fig. 12 we show the constraints from comparing the inclusive cross section to one with p min T = 650 GeV. We see significant gains using machine learning, corresponding to the improved significance for the Higgs seen earlier. To properly situate these results, we summarize the current bounds and future prospects for constraining these operator coefficients in the absence of a dedicated high-p T Higgs analysis. While theoretical constraints from general principles such as causality and locality do exist [127] (in particular Re(c y ) + c H /2 > 0 seems to always be true when generated within quantum field theory), the best current bounds on the couplings c g and Re(c y ) + c H /2 come from a combination of the most recent inclusive Higgs cross section measurements [128] and recent global fit of the Standard Model Effective Field Theory to all current Higgs and electroweak data performed in Ref. [129]. The inclusive Higgs measurement constrains a combination of couplings for which the linearization c g − Re(c y ) − c H /2 is an excellent approximation to be 0.29 ± 0.46 at 3σ using 36.1 fb −1 of data. The global fit provides current world averages (again with 3σ uncertainties) of c g = 0.10 ± 0.30, Re(c y ) = −4.7 ± 7.8, c H = −1.1 ± 1.8, all consistent with zero at 2σ. These results are primarily driven by the LHC Run II Higgs measurements, all using 36.1 fb −1 (35.9 fb −1 ) of data from ATLAS (CMS), although since the effect of possible other higher-dimension operators on the backgrounds is not included these bounds should be interpreted with care. The dominant discriminating power is provided by looking for deviations in the h → W W * , ZZ * decays, with the most constraining bounds clearly being on the coefficient c g . 7 7 A recent combined fit from CMS using high-pT H → bb and well as differential H → γγ, and H → ZZ * → 4 decays [24] finds bounds of cg 0.12 ± 0.42, | Re(cy) + cH /2| 0.5 using our conventions. These are nearly competitive with the global fits on cg and would clearly improve the fits of the other bounds if included in the Conservatively assuming no improvement in the treatment of systematic or theoretical errors, analogous bounds to those discussed above with a full 3 ab −1 dataset should be able to reduce uncertainties by a factor of 2 in the inclusive Higgs cross section linear combination and by 20 % to 25 % for the global fits. Comparing this to the projections of Fig. 12, our proposed analysis has the potential to exceed these conservative extrapolations on sensitivity by a factor of a few.

Conclusions
In this paper, we have applied modern machine learning techniques to improve the search for the H → bb decay at the LHC. This decay offers a powerful probe of BSM contributions to the gg → H loop at high p T . Using our techniques, this process may be discoverable at the LHC (prior to the HL-LHC).
A new feature of our analysis is that we have used a two stream convolutional neural network, with one stream acting on the double b-tagged jet, and the other stream acting on the global event information. This enables us to not only exploit the maximal information in the event, combining both jet substructure information and global information, but also allows us to more easily identify the dominant physics features that the neural network is learning. In particular, we find that a significant fraction of this information is not contained in the recently proposed β 3 observable. Disentangling these differing sources of information is challenging in standard analyses, which substructure observables nominally designed to identify two-prong substructure, although in the course of optimization they may become sensitive to other features as well. Resolving an event at multiple scales and in various regions of phase space is a generic technique that should enable significant improvements in other LHC searches. By probing the neural network in detail, it may also be possible to use the neural networks as a guide to building compact, analytical, simple observables that nearly saturate the machine learning performance. With such tools in hand, increasingly extreme regions of phase space can be thoroughly explored.