Tagging the Higgs boson decay to bottom quarks with colour-sensitive observables and the Lund jet plane

We study the problem of distinguishing $b$-jets stemming from the decay of a colour singlet, such as the Higgs boson, from those originating from the abundant QCD background. In particular, as a case study, we focus on associate production of a vector boson and a Higgs boson decaying into a pair of $b$-jets, which has been recently observed at the LHC. We consider the combination of several theory-driven observables proposed in the literature, together with Lund jet plane images, in order to design an original $Hbb$ tagger. The observables are combined by means of standard machine learning algorithms, which are trained on events obtained with fast detector simulation techniques. We find that the combination of high-level single-variable observables with the Lund jet plane provides an excellent discrimination performance. We also study the dependence of the tagger on the invariant mass of the decaying particles, in order to assess the extension to a generic $Xbb$ tagger.


Introduction
Since the discovery of the Higgs boson at the Large Hadron Collider (LHC) by the ATLAS [1] and CMS [2] experiments in 2012, our understanding of the propera e-mail: luca.cavallini02@universitadipavia.it b e-mail: andrea.coccaro@ge.infn.it c e-mail: charanjit.kaur@bristol.ac.uk d e-mail: giulia.manco@pv.infn.it e e-mail: simone.marzani@ge.infn.it f e-mail: fabrizio.parodi@ge.infn.it g e-mail: daniela.rebuzzi@unipv.it h e-mail: alberto.rescia@desy.de i e-mail: giovanni.stagnitto@physik.uzh.ch ties of this particle has progressively evolved. In addition to the "golden" decay modes of the Higgs boson, H → γγ and H → 4l, in the past few years other decay channels have been observed, usually in association with particular production modes. For instance, due to its large branching ratio, the H → bb decay plays a central role in studies that aim at probing the structure of the Higgs couplings to the fermions. In this regard, one of the most interesting processes is the associated production of a Higgs boson H and a vector boson V (W or Z), with the vector boson decaying leptonically and the Higgs boson decaying hadronically into a pair of b-quarks, V (ll)H(bb): the decay products of the vector boson provide us with a clean experimental signature, as well as a recoil system for the Higgs particle. Both ATLAS and CMS experiments have reported the observation of the H → bb decay and of the V H production mode [3,4] and the ATLAS experiment reported the first cross-section measurements [5][6][7][8] targeting different regimes of reconstructed transverse momenta of the vector boson and in fiducial volumes, as defined by the simplified template cross-section framework [9]. The experimental focus is therefore shifting towards precision measurements of the kinematics of the H → bb decay channel and, as suggested in Ref. [10], additional differential information, and hence discrimination of this process against sources of backgrounds, is crucial for the sensitivity to beyond-the-Standard-Model operators. After the fragmentation and hadronisation process, the hard b-quarks produced by the Higgs boson decay are usually detected as two separate b-jets [11,12]. In simulation, a b-jet is defined by a suitable particle-level observable, based on the angular distance of B-hadrons with respect to the jet axis, or by ghost association [13,14]. On real data, b-jets are identified by means of dedicated b-tagging algorithms.
In order to make the most out of the large set of accumulated data, strategies to better discriminate the H(bb) process over the large QCD background (where the pair of b-quarks is produced by pure strong interaction, mostly by g(bb) collinear splitting) are being actively developed. The signal/background discrimination is especially compelling in the boosted regime, when the transverse momentum of the jets is much greater than their invariant mass: in such a situation, the b-quarks may be close in angle, and hence reconstructed as a single jet. Since the seminal work of Ref. [15], several jet substructure techniques-which aim to improve the discrimination performance by finding hard prongs inside a large radius jet -have been designed, tested and implemented in the analyses by the experimental collaborations (see, for instance Ref. [16] and references therein). Broadly speaking, two main strategies exist: design high-level theoreticallymotivated observables, sensitive to particular features of the signal distribution, which can be measured on data and used as single-variable discriminant; or produce some low-level representation of the jets (list of particles, calorimetric images,. . . ), to be used as input for machine learning (ML) techniques. We refer the reader to the recent literature about ML based approaches for H → bb tagging [17][18][19][20][21][22][23][24][25][26][27], which is continuously being updated in Ref. [28].
Our case of interest is particularly challenging because both signal and background feature a similar flavoured two-prong structure. However, the two processes have a different behaviour with respect to the QCD radiation pattern. Namely, in the signal case, the b-jets originate from the decay of a colour singlet, and thus radiation will be mostly contained within the two b-quark system. Instead, in the background case, we expect QCD radiation to be more diffuse, due to colour connections with the rest of the event. Therefore, we would like to exploit observables that are particularly sensitive to the colour flows in the event. In this paper we select several theory-driven single-valued observables (see Sect. 2) and a theoretically-motivated representation of a jet (specifically, the Lund jet plane [29]) and build a combined Hbb tagger, with the aim to exploit the best of both strategies. Such a combination is performed by means of standard ML algorithms, for different input choices. We use boosted decision trees (BDT) for single-valued observables and convolutional neural networks (CNN) for Lund jet images. BDTs have been part of HEP analyses for a long time [30]. CNNs are showing promising potential for image based data sets for various applications in HEP see e.g. [31][32][33].
Moreover, in our analysis we account for the experimental detection and reconstruction of the physical quantities of interest by performing a fast detector simulation on the generated Monte Carlo events, and we assess the impact of these so-called detector effects on individual variable distributions and on the overall performance of the tagger. Finally, in an ideal scenario we would like to apply the same tagger for the decay products of a generic colour singlet X, without any prior knowledge on the value of its mass, so as to design a global Xbb tagger. In this view, it would be desirable to keep the tagger uncorrelated with the invariant mass of the bquark pair, in order to ease its calibration on Z+jet events and to simplify the determination of the nonresonant background shape in data-driven approaches.
The paper is organised as follows. In Sect. 2 we briefly introduce the colour sensitive observables under study and the Lund jet plane. In Sect. 3 we discuss the event generation set-up and the selection cuts adopted in our analysis. In Sect. 4 we study the individual distributions of the colour sensitive observables and the output of the Lund jet plane CNN for the signal and the background processes, before and after detector simulation. Moreover, we assess the discrimination performance of the combination of several observables, by also including the Lund jet plane CNN output as an additional input. In Sect. 5 we discuss the BDT dependence on the invariant mass of the large radius jet used in the analysis (see Sect. 3), to determine the mass bias of our Hbb tagger. Finally, in Sect. 6 we draw our conclusions.

Description of the observables
We make use of high-level colour sensitive variables introduced in the literature in the past few years: jet pull and its projections, namely the pull angle, θ p [34,35], and the parallel and perpendicular components of the pull vector, t and t ⊥ [36,37]; the colour ring O [38]; D 2 [39,40] and the Lund jet plane [29]. We limit ourselves to a brief introduction of the relevant variables, and we refer the interested reader to the original papers.

Jet pull
Let us consider a hard jet J a . The pull vector t is the jet shape observable that is defined as where p Ta is the transverse momentum of the jet, and the sum runs over all the jet constituents. y and φ repre-sent rapidity and azimuthal angle, and r i is the distance vector between the jet axis and its i-th constituent in the y-φ plane The pull vector is sensitive to the different colour connections of the entire event in which the jet is formed. If we consider events with two hard jets (or subjets) J a and J b that originate from the decay of a colour singlet, additional QCD radiation tends to be emitted between the two jets, causing the pull vector of J a to point in the direction of J b and vice-versa. If instead the two jets originate from the decay of a colour octet, such as the gluon, then the pull vectors will instead tend to point in different directions. In order to make these considerations more quantitative, we can introduce suitable projections of the pull vector t along two directions: the one given by the unit vector which points from the centre of J a to the centre of J b t = t ·n , withn = 1 and the other generated by the unit vector perpendicular ton t ⊥ = t ·n ⊥ , withn ⊥ = 1 where in the above equations we have introduced ∆y = y a − y b and ∆φ = φ a − φ b . We also consider θ p , known as the pull angle, defined as Of all the variables built out of the jet pull vector, the pull angle has been shown to be one of the most effective discriminants of the two different colour configurations [34]. However, the comparison between experimental measurements of the pull angle and theoretical calculations has shown that this observable is not under good theoretical control [41][42][43]. This problem can be traced back to the fact that θ p is not infrared and collinear (IRC) safe, but only Sudakov safe [35] (for discussions about Sudakov safety, see Refs. [44][45][46]). Instead, t and t ⊥ are IRC safe observables, so it is interesting to assess whether individually or in some combination t and t ⊥ possess the same discriminating power of its non IRC safe counterpart. It is also interesting to study how the discriminating power of the pull vector variables is affected when combining the variables for both jets J a and J b . Thus, we will include all the three pull vector variables, for both jets J a and J b , as input to the ML algorithms. Furthermore, although θ p is not IRC safe, it is also included.
Finally, we note that there is potential overlap in including both t , t ⊥ and θ p because the jet pull is only a two-component vector. However, this is not an issue, since the machine learning algorithms are trained to be robust against interdependence between input variables.

Jet colour ring
The jet colour ring was introduced in Ref. [38], as an observable that is provably optimal, in certain kinematic limits. The starting point of its construction is the observation that, according to the Neyman-Pearson lemma, the ratio of the matrix elements squared for the signal and the background process should be monotonic to the optimal single-variable discriminant [47]. When considering a decay of a colour singlet as signal and a colour octet as background, with a subsequent gluon emission in the boosted regime, and working in the softcollinear limit approximation, the ratio simplifies to where the indices a and b refer to the hard partons, the index k to an additional (gluon) emission, and θ ij 's are the angles between them. The above considerations lead to the definition of the jet colour ring where now ∆ ij are distances between jets (or subjets) in the azimuth-rapidity plane. The observable name originates from its geometric interpretation: radiation from colour singlets will tend to fall between the two jets, leading to values of O < 1, while in the case of colour octets, we will tend to have O > 1.

D 2
The variable D 2 [39] is defined as the ratio of two normalized N -point energy correlation functions (ECFs) [48], e β k : where β is a parameter which we have set to β = 2.
The variable is usually calculated on a large radius jet, and is useful to discriminate 2-prong jets from 1-prong jets. Furthermore, because of its sensitivity to soft radiation at wide angles, D 2 also probes colour correlations and it is therefore useful to disentangle different colour configurations. However, we note that D 2 retains a correlation with the mass of the large radius jet; this may be a problem when designing a tagger free of any mass bias, see for instance [49]. We will come back to this aspect in Sect. 5.

Lund jet plane
The Lund jet plane is a theory-inspired representation of a jet [29]. It is formed parsing backwards the Cambridge-Aachen (C/A) [50,51] clustering history of the jet. The procedure starts by undoing the final clustering step and by recording the kinematics of the splitting. The primary Lund jet plane is obtained by iterating the above procedure, always following the hardest branch in each splitting and recording the azimuth-rapidity separation of the branches involved in the splitting and the relative transverse momentum of the emission. The Lund jet plane has been exploited in the context of vector boson [29], top [52] and Higgs [27] tagging. Furthermore, it has been successfully measured at AT-LAS [53] and first-principle theoretical predictions have been performed in Ref. [54].

Event simulation and selection
We generate about 300k events for the pp → H(bb)Z(ν ν ) signal and 4M events for the pp → bbν ν background processes, so as to have about 50k events remaining after all analysis cuts, in accordance with Table 1, as will be detailed below. In case of signal, bb pair is produced from the decay of the Higgs boson, while it comes mainly from QCD interaction in case of background. Fig. 1a and Fig. 1b show representative Feynman diagrams for the signal and the background processes, respectively. We generate hard events using MG5 aMC@NLO v2.8.3.2 [55], by imposing a 200 GeV cut on the p T for the neutrino pair in the final state. This is done to ensure that the events generated are firmly in the boosted regime. These parton-level events are subsequently showered in Pythia v8.305 [56], including MPI and underlying events, to produce particle-level events.
Finally, rather than simulating an entire detector, Delphes v3.5.0 is used to perform a fast detector simulation [57,58]. This allows us to understand how the discrimination power could be affected by real-life detector effects, without having to run a computationally expensive full simulation, which in addition is strongly detector dependent. From Delphes, we extract both the Monte Carlo truth of the event, containing the particle-level information, e.g. the same one would get from a perfectly efficient detector (henceforth referred to as truth), as well as the reconstructed events including the detector effects (henceforth referred to as reco).
The Delphes simulation is run using the ATLAS card with minor modifications, described below, to fit our needs. For the truth case, we consider all visible, stable particles with p T > 0.5 GeV. Instead, in the reco case, the jets are built using the simulated calorimeter towers and tracks. All electromagnetic calorimeter towers with energy E > 0.5 GeV and significance S > 2.0 and all hadronic calorimeter towers with energy E > 1.0 GeV and significance S > 2.0 are considered. Tracks are required to have p T > 0.5 GeV. Delphes uses FastJet v3.3.4 [59] to perform the jet clustering.
At this point the analysis is the same in both the truth case and the reco case. First, the constituents are clustered into jets with radius R = 1.0 using the antik T algorithm [60]. For each event, we choose the jet with the highest p T as the large radius jet. We only accept the event if the large jet has p T > 250 GeV and |y| < 1.5, because of the tracking detector acceptance.
We also cluster the constituents into smaller jets with radius R = 0.2. We identify those jets having ∆R < 0.8 from the large jet, and call these subjets. We then proceed to identify the b-subjets which originate from the b-partons, through a process known as b-labelling. We do this by first identifying the b-partons originating from the hard scattering in the event record, requiring a minimum p T of 5.0 GeV. For each b-parton, we compute the distance between each b-parton and subjet; the subjet which is closest to the b-parton, provided that the distance is below 0.2, is labelled as bsubjet. The association between a b-parton and b-subjet is unique.
For the event to be accepted, we require two blabelled subjets with p T > 10 GeV. The pull variables are calculated on these two b-labelled subjets, and D 2 is calculated on the large jet. For the colour ring to be defined, there must also be a third non-b-subjet within ∆R = 0.8 from the large jet. In a majority of cases, this third jet is not present. To avoid discarding too many events, in these cases we assign a default value of O = −1 to the colour ring. This allows for higher statistics, but also provides useful information to the machine learning algorithms.
For the Lund jet plane, we consider large radius jet constituents and re-cluster them using C/A algorithm. Considering the declustering history of this jet, we get the primary Lund jet plane. Considering 25 × 25 pixels for each image, we put 1 or 0 in a pixel depending on if (ln 1/∆, ln k T ) value of the splitting falls in that pixel or not. Our implementation of the Lund jet plane is based on the one present in fastjet-contrib [61] repository. Table 1 shows the percentage of events which pass the selections in all cases considered. The selection discards more background than signal events in both the truth and reco cases. The most important cut is the p T cut on the large radius jet, accounting for 60% of discarded events. The second most important cut is the rapidity cut on the large radius jet which rejects 10% of the events.

Discrimination performance
After event selection, we are ready to evaluate observables on the selected events. We first show in Fig. 2 the normalised distributions for the eight colour sensitive (CS) observables introduced in Sect. 2, both for signal and background, and at the truth and reco level. By just looking at the plots, we can appreciate the strong discrimination power of the O and D 2 , which is retained at the reco level. Instead, the observables related to the jet pull vector are more affected by detector effects: there is a visible difference in t i , t ⊥i and θ pi , both for the leading jet a and the sub-leading jet b, between the truth and the reco cases. In particular, the pull angle observables θ pa and θ pb seem to be good discriminants at the truth level, but detector effects noticeably flatten the signal distribution, hence leading to a worsening of the discrimination power. We then show in Fig. 3 the averaged Lund images for the signal and background process, in the truth and reco cases respectively. We note that detector effects lead to an overall decrease of the image quality, in the sense that the distinctive features of the truth case are still present, but in the reco case there is additional radiation for middle values of ∆ and k t for both the signal and the background events. However, the high density patch at large ∆ and high k t in the case of the signal is still clearly visible by eye also in the reco case.
After having determined the distributions of the CS observables and the Lund jet images, we now use these as inputs to ML algorithms in order to build combined classifiers. Specifically, we train a BDT 1 on the CS observables, whereas Lund images are classified using a CNN. More details about these methods and architectures are provided in Appendix A. The output distribution of the CNN Lund jet plane classifier (LP CNN ) is shown in Fig. 2. In the following, we consider also the combination of (some of) the CS and LP CNN observables, in order to improve the total discrimination power; in such cases, we adopt a two-step procedure, by using the output of the CNN Lund jet plane classifier as an additional input to the BDT.
In Fig. 4, we show the receiver operating characteristic (ROC) curves for several combinations of observables. Namely, we consider all the colour sensitive observables (CS) or just the D 2 and the colour ring (D 2 +CR), combined through a BDT; the LP CNN ; the combination of all the CS observables with the LP CNN (CS+LP CNN ), by means of the two-step procedure explained above. For each curve in Fig. 4, we report the  Table 2, both for the truth and the reco cases. A perfect classifier would have AUC = 1, whereas a random classifier is associated with AUC = 1/2. As expected, we first observe a worsening of the discrimination power by moving from the truth case to the reco case, as can also be seen by comparing the value in the left and in the right column of Table 2. However, the performance after taking into account detector ef-fects is still good for most of the combinations, close to 0.85 for the CS+LP CNN combination. Furthermore, we see that most of the discriminating power of the set of CS observables is actually coming from the combination of D 2 +CR alone. This is in agreement with what observed at the level of distributions at the beginning of Sect 4: the jet pull observables, including the pull angle, seem not to add any additional information useful for classification. At reco level, their discrimination power  is almost unnoticeable. Moving to the combinations involving the Lund jet plane, we observe that the Lund jet plane alone is performing better than the whole set of CS observables, especially in the region of high signal efficiencies. When we combine LP CNN with the CS observables, we see a noticeable improvement of the overall classification power, with a value of AUC equal to 0.893 in the truth case and 0.846 in the reco case. Finally, in Table 3 we rank the variables based on their importance in the BDT, both in the truth and reco case. The ranking presented here also includes the output of the Lund jet plane CNN as an additional input. LP CNN is the most discriminating variable, both in the truth and in the reco case. It is followed in order by D 2 and the colour ring O. The jet pull variables are all of similar importance, at the bottom of the ranking score. For the reconstructed case, O gains additional importance with respect to the pull variables.
We end this Section with a comment about the usage of the pull angle. As already mentioned in Sect. 2, the pull angle θ p is only Sudakov safe. One may wonder what is the effect of keeping only its (IRC-safe) projections t and t ⊥ , instead of using all the three observables t , t ⊥ and θ p as input in the BDT, as we have done. Unsurprisingly, by looking at the correlation matrix in the BDT, we observe a strong correlation between these variables. Given that the variables derived from the jet pull vector(s) have a small influence on the overall performance, a variant of the tagger could be conceived, with comparable performance, by dropping θ p among the list of inputs to the BDT.

Invariant mass dependence
Since our goal is to develop a tagger purely sensitive to the colour configuration of the decaying particle, in order to be applied to other contexts (such as Z or W boson hadronic decays), ideally our procedure should be insensitive to the invariant mass of decaying system, specifically to the invariant mass of the pair of b-jets.
In Fig. 5 we show the distribution of the invariant mass measured on the whole set of background events and on three subset of events, each of the same size, corresponding to signal-enriched, intermediate and background-enriched regions. These regions are defined by means of a set of cuts on different discriminant variables: D 2 alone (Fig. 5a), combined BDT with CS but D 2 (Fig. 5b), Lund plane CNN (Fig. 5c). We show results for the reco case only, since the ones in the truth case are similar. In an ideal scenario, a cut on the discriminating variable should not also concomitantly imply a cut on the invariant mass of the system, hence the curves for the three regions should overlap, and agree with the curve without any cut.
We find that the sensitivity to the invariant mass is introduced mainly through the D 2 observable, which is highly correlated to the value on the mass, as it is clear from Fig. 5a. Such a correlation has been already investigated in the literature [49,62]. By removing D 2 from the CS input variables of the BDT, the mass bias is greatly reduced, as can be observed by comparing Fig. 5a and Fig. 5b. However, given the fact that the D 2 observable is ranked as one of the most important (see Table 3), the removal of this variable comes at the price of loosing a good part of efficiency.
Finally, it is interesting to study whether the LP CNN alone retains or not a dependence on the invariant mass of the decaying system. This is shown in Fig. 5c. Unfortunately, the output of the LP CNN appears to be notably correlated to the invariant mass of the pair of b-jets. In order to better understand this behaviour, by looking at the correlation matrix of the BDT with the CS variables and the LP CNN as input, we note that LP CNN and D 2 are largely correlated. Hence, the behaviour we observe in Fig. 5c for LP CNN can be related to the known behaviour of D 2 of Fig. 5a. Given the fact that LP CNN is our best discriminating variable, consid- ering removing it from the combination comes at the price of loosing a good part of discrimination power.

Conclusions
In this paper, we have investigated the problem of distinguishing the b-jets originating from the decay of a colour singlet from those originating from the pure QCD background (mostly through g → bb collinear splitting). We have focused on the signal process pp → H(bb)Z(ν ν ), but we are confident that our strategy is valid in a more general context. Specifically, we have trained a BDT architecture on eight high-level, colour sensitive observables, in order to develop a combined colour tagger. We have also explored the discrimination performance of a CNN architecture trained on Lund jet plane images. Finally, we have combined the highlevel observables and the output of the Lund jet plane CNN in a common BDT architecture. We have also performed a fast detector simulation in order to better assess the experimental feasibility of this tagging strategy. Namely, we have compared individual distributions and the final performance of the tagger before and after the inclusion of detector effects.
We have found a good discrimination power for our combination of colour sensitive observables with the output of the Lund jet plane CNN (AUC = 0.893), slightly deteriorated when including detector effects (AUC = 0.846). The Lund jet plane alone has been proven to be a powerful Hbb tagger even in presence of detector effect, thus extending the results of Ref. [27], and when combined with theoretically motivated singlevariable observables, such as D 2 or the colour ring, the overall performance appreciably improves. In the end, we have shown that our tagger, which is a combination of several theory-driven single-variable observables with a representation of radiation pattern within a jet, is not only effective in theory, but also shows promising prospects for application to experimental analyses.
We have also studied to what extent the tagger is sensitive to the mass of the decaying particle. In the case of the colour sensitive observables, the mass bias has been shown to be ultimately due to the D 2 input variable, as already known from the literature. However, we have also found that the Lund plane CNN retains a large mass bias, and to the best of our knowledge this has not been pointed out in the literature so far. The elimination of these variables come at the cost of classification efficiency, especially in the case of the Lund plane CNN. Further studies are needed in order to understand how to remove such a mass bias. For instance, one could plan to explore techniques similar to the ones presented in Refs. [49,62], based on a debiasing a posteriori.
Even if a fast detector simulation offers a good starting point in order to assess the feasibility of a tagging strategy, in the end a full detector simulation within a more defined experimental context would be required. Such a more realistic scenario would also entail the inclusion of the efficiency of real-life existing b-tagging algorithms used by the experimental collaborations, whereas in this paper we have assumed a b-tagging algorithm with 100% efficiency. Further, although the main scope of this study is to assess the performance of relatively simple observables and ML architectures for Hbb tagging, it is interesting to investigate the importance  Table 4: BDT parameters used for the truth and reco data set.
of physics information beyond the primary Lund plane, which can be exploited by including planes originated by secondary splittings. In this context, we have made a preliminary investigation using the LundNet-5 model of Ref. [52], which employs graph neural networks, and obtaining an improved performance corresponding to AUC = 0.925 and AUC = 0.894, for truth and reco, respectively. We leave the implementation of these suggestions, as well as the opportunity to consider new high-level observables as input for our tagger, for future studies.
The Boosted Decision Tree (BDT) has been implemented using the ROOT Toolkit for Multivariate Analysis (TMVA) library [63]. The BDT is made up of 50 trees, with a maximum depth of 5 and minimum node size set to 2.5% of the total number of events. The Gini index is used as the optimisation criterion. AdaBoost has been chosen as the boosting model and the number of cuts is set to 80. We use a 50/50 train/test sample, and there is no downsampling.  We used CNN (implemented using Keras [64]) for the Lund jet images data set. Balanced data set for the binary classification is used; 70% for the training, 15% for the validation and 15% for the testing. We tried several models with different combinations of hyperparameters and the best architecture is described in the Table 5. Here N i , i = 1 . . . 4 denotes the number of filters in the corresponding convolutional layer. Filter size is 3 × 3 in all the convolutional layers. Activation function 'relu' and 'softmax' are used for the intermediate and last layer, respectively. For the optimisation, 'adam' optimiser is used. We used pooling (MaxPooling) operation after the second and fourth convolutional layer for image downsizing. Note that CNN architecture used is same for truth and reco data except the dropouts. The numbers mentioned in the brackets are the dropouts strength used for the reco data set.