Binary Discrimination Through Next-to-Leading Order

Binary discrimination between well-defined signal and background datasets is a problem of fundamental importance in particle physics. With detailed event simulation and the advent of extensive deep learning tools, identification of the likelihood ratio has typically been reserved as a computational problem. However, this approach can obscure overtraining or excessive sensitivity to tuned features of the simulation that may not be well-defined theoretically. Here, we present the first analysis of binary discrimination for signal and background distributions for which their likelihood ratio is infrared and collinear safe, and can therefore be calculated order-by-order in perturbation theory. We present explicit, general formulas for receiver operator characteristic curves and the area under it through next-to-leading order. As a demonstration of this formalism, we apply it to discrimination of highly-boosted Higgs decays from gluon splitting to bottom quarks. Effects at next-to-leading order are first sensitive to the flow of color in the jet and significantly modify discrimination performance at leading-order. In the limit of infinite boost, these events can be perfectly discriminated because only the gluon will radiate at finite angles from the bottom quarks, and we find that large effects persist at energies accessible at the Large Hadron Collider. Next-to-leading order is therefore required to qualitatively understand results using machine-learning methods.


Introduction
Given an ensemble of events from a particle collider experiment, such as the Large Hadron Collider, a natural first question is to what process or processes each individual event corresponds.Quantum mechanically, this question cannot be answered with certainty, of course, but can be at a statistical level, with each event having some probability to have been a manifestation of one of the many possible processes.Reduced to a binary discrimination problem, where all we have to classify is the probability of an event to be of signal or background process type, it is well-known that the optimal discriminant observable is (monotonic in) the likelihood ratio, the ratio of probability distributions of background to signal, by the Neyman-Pearson lemma [1].Further, given the availability of high-fidelity data simulators [2][3][4] and the enormous dimensionality of particle physics events, this classification problem is naturally relegated to machine learning, whose interest in particle physics has exploded recently.See Refs.[5][6][7][8][9][10][11][12][13][14] for reviews of the utility of machine learning in particle physics.
Nearly uniquely in fields that employ machine learning for data analysis, high-energy particle physics has a precise underlying theoretical description from which calculations can be performed.Machine learning studies that exclusively employ simulated data therefore provide only an incomplete picture that can significantly obscure the features that are being learned to establish the likelihood for a particular problem of interest.Further, because event simulators include numerous models and parameters on top of a perturbative parton shower or fixed-order matrix elements, it can be unclear what the machine is learning and if the features are robust physics, or simply idiosyncrasies of the simulation.Therefore, in parallel with machine learning that pushes analysis boundaries, one would also desire predictions for discrimination performance that is well-defined theoretically, calculable from first-principles, and systematically-improvable. For classification of events or individual jets, collimated streams of high-energy particles, at colliders, this requires that the likelihood ratio is infrared and collinear (IRC) safe [15,16] and therefore can be predicted within the perturbation theory of quantum chromodynamics (QCD).
However, for many classification problems it can be argued that the likelihood is not IRC safe [17], or IRC safety is only manifest once all-orders effects are included [18,19], and so a perturbative analysis can be impossible or at the very least obscured.However, through the exercise of constructing an IRC safe likelihood ratio, theoretically-improved definitions of "signal" and "background" may present themselves and correspondingly render what was once deemed impossible now possible.Here, we present the first systematic analysis of IRC safe likelihoods for binary classification tasks order-by-order in perturbation theory.Because the likelihood ratio changes order-by-order because the distribution of radiation in jets or events changes order-by-order, there is a feedback and crosstalk between perturbative expansions of the distribution of the likelihood ratio and the definition of the likelihood ratio itself.This results in an intricate perturbative expansion, but the necessary properties of the likelihood ratio to be IRC safe become obvious: the likelihood must be well-defined at lowest, leading order, and further one must map the phase space of emissions at higher orders down to that of leading order (LO) in an IRC safe way.
In this paper, we present results through next-to-leading order (NLO) in the strong coupling α s for a generic likelihood ratio, signal and background distributions of the likelihood, and its receiver operating characteristic (ROC) curve that quantifies discrimination power.As a concrete application of this framework, we apply it to the problem of discrimination of Higgs boson decays to bottom quarks, H → b b, from massive fragmentation of gluons to bottom quarks, g → b b, in the highly-boosted limit.We find that next-to-leading order is crucial for qualitative understanding of machine learning results, because that is the first order sensitive to the flow of color, and therefore the radiation pattern of emissions, off of the final state bottom quarks.In the process, we discuss the necessity of IRC safe NLO-to-LO phase space maps, robust and IRC safe flavor tagging of the bottom quarks, and practicalities of numerical integration of infrared divergent matrix elements.This correspondingly provides a foundation for future analyses to establish theoretical predictions for various other discrimination problems.

The ROC Curve Through NLO
In this section, we present the derivation of the ROC curve for optimal binary discrimination through next-to-leading order in the strong coupling, α s .We assume that we have two samples of events, signal S and background B, and their respective particle momentum distributions are measured on phase space Φ.An expansion in α s requires the property of infrared and collinear safety, and so particle momentum is the only information that is accessible to our discriminant.The signal and background probability distributions on phase space can be expanded in powers of α s as S (Φ (1) ) + O(α 2 s ) , (2.1) where the superscript denotes the order of that term in α s and the corresponding phase space for the number of particles produced at that order.As probability distributions, they must be normalized to integrate to unity, which requires dividing by their total integral over all of phase space.Starting from normalization of the leading-order distribution our normalization prescription for the contributions to the distribution at higher-orders is defined via the expansion of the ratio ) dΦ (1) p (1) (Φ (1) ) + • • • , where terms at higher orders in α s are suppressed in the ellipses.At order α s , there are two contributions whose sum integrates to 0: Because it lives in a higher-dimensional phase space, we refer to p (1) (Φ (1) ) as the real contribution and thus −p (0) (Φ (0) ) dΦ (1) p (1) (Φ (1) ) ≡ p (0) (Φ (0) ) V (1) , ( as the virtual contribution.Real, virtual, and mixed real-virtual contributions at higher orders are defined similarly, from simply Taylor expansion of Eq. (2.4) to higher orders in α s .By the Neyman-Pearson lemma [1], a function monotonic in the likelihood ratio is the optimal binary discrimination observable.The likelihood ratio L defined as a function on phase space is the ratio of the signal and background distributions so that L → 0 is the signal-rich region of phase space and pure background is pushed to L → ∞.Here, we use the caret notation L(Φ) to denote the likelihood as a particular function of phase space Φ, while L is its value.Because it is defined through perturbative distributions itself, the likelihood ratio has a perturbative expansion, where B (Φ (1) ) + αs 2π V (1) S (Φ (1) ) + αs 2π V S p (0) B (Φ (1) ) S (Φ (1) ) where we have introduced notation for the likelihood at order-n in the perturbative expansion.Note that this perturbative expansion of the likelihood ratio only makes sense if it is IRC safe.Specifically, the leading-order likelihood ratio L(0) (Φ) must be IRC safe, as it is a ratio of distributions and appears as a factor at every order in the expansion.

Perturbative Expansion of the Cumulative Distribution of the Likelihood
With these perturbative expansions of the signal and background distributions and their likelihood ratio, we can work to calculate the distribution of the likelihood on the respective event samples.However, here we will forgo that analysis and jump directly to calculation of the receiver operating characteristic or ROC curve.The ROC curve is a signal versus background efficiency curve that quantifies the true and false positive rate for a sliding cut on the likelihood.It can be defined through the signal and background cumulative distributions of the likelihood ratio as where Σ B (L) is the cumulative distribution of the likelihood on the background sample and Σ −1 S (x) is the inverse cumulative distribution of the likelihood on the signal sample, evaluated at quantile x ∈ [0, 1].With the expansions established above, we will determine the expansion of the ROC curve through next-to-leading order in the coupling α s .
The first step is to determine the expansion of the cumulative distribution of the likelihood ratio, Σ(L).It is defined as the total probability for the likelihood to be less than a given value, where (2.10) To determine its expansion in α s , we expand both the probability p(Φ) and the likelihood L(Φ) on phase space, where The final line, containing dependence on the likelihood at next-to-leading order, implicitly mixes contributions that live in different phase spaces, so it is useful to expand that out to identify individual contributions.Note that B (Φ (1) ) S (Φ (1) ) . (2.12) Thus, the integral on the final line of Eq. (2.11) can be expressed as B (Φ (1) ) S (Φ (1) ) We can now put these results together to construct the signal and background cumulative distributions of the likelihood through next-to-leading order.For the background, we have B (Φ (1) ) − L p (1) The signal distribution is B (Φ (1) ) − L p Both of these distributions have an expansion in α s , and we denote it in the same way as the underlying distributions on phase space, where This form explicitly demonstrates that we must apply a procedure to map next-to-leading order phase space dΦ (1) onto leading order phase space dΦ (0) , as there are leading-order phase space constraints within integrals over next-to-leading order distributions.This map must be IRC safe, which uniquely constrains the map in the soft or collinear regions of phase space.However, away from the singular boundaries, this map is ambiguous, and there are many possible choices for such a map.Perhaps the most natural choices are pairwise jet clustering algorithms, like k T [20][21][22] or Cambridge/Aachen [23,24], and we will study some dependence on the choice of IRC safe map in Sec. 3.For now, however, we will leave such a map implicit.
Using the leading-order cumulative distribution of the likelihood, the contribution to the cumulative distribution at next-to-leading order is (2.17) − L dΦ (1) p (1) S (Φ (1) for background and Σ S (L) = dΦ (1) p (1) + L dΦ (1) p (1) − dΦ (1) p (1) for the signal distribution.Here, we have introduced the notation for the leading-order probability distribution of the likelihood,

The Inverse Cumulative Distribution of the Likelihood
Unlike the (cumulative) distribution of the likelihood ratio, there does not exist a simple, direct calculation that can be performed to determine the inverse cumulative distribution, as necessary to evaluate the ROC curve.However, we can establish its perturbative expansion through its definition as the functional inverse of the cumulative distribution.Specifically, we require that and we can then match terms in the perturbative expansion order-by-order to satisfy this requirement.First, we expand the inverse cumulative distribution in powers of α s as Next, we take its argument to be the perturbative expansion of the cumulative distribution, where By definition, note that and so terms at each higher order in α s must vanish.In particular, this requires that . (2.24) By the inverse function theorem, note that the derivative of the inverse cumulative distribution is just the inverse of the probability distribution of the likelihood ratio.Using this result, the inverse cumulative distribution of the likelihood has the perturbative expansion where the NLO coefficient is . (2.27) In these expressions, we assume that the leading-order inverse cumulative distribution, Σ −1,(0) (x), is straightforward to evaluate.

The Perturbative Expansion of the ROC Curve
We now have all of the ingredients in place to construct the perturbative expansion of the ROC curve for the likelihood.The ROC curve is, again, Using the results for the inverse cumulative distribution from above, the ROC curve has the expansion The NLO contribution to the ROC curve of the likelihood ratio is then . (2.30) Using the compact results from the end of Sec.2.1, the NLO correction to the ROC curve takes a compact form.Note that the expression ROC (1) simply from the definition of the likelihood.With this observation, note from Eqs. (2.17) and (2.18) that the terms that contain δ-functions explicitly cancel, leaving only the first terms in each, that contain Θ-functions and the leading-order cumulative distributions of the likelihood.Therefore, the NLO correction to the ROC curve is expressed as A useful rule-of-thumb for the discrimination power of the likelihood is the area under the ROC curve, or the AUC, where which also admits a perturbative expansion.The first two contributions are AUC (1) = S (L) .
We note that the sign of the NLO contribution to the AUC is not well-defined, and so higher-order corrections can either increase or decrease the discrimination power as compared to leading order.However, note that the AUC will necessarily decrease, corresponding to better discrimination power, if the NLO correction to the background distribution is negative, Σ B (L) < 0, and the NLO correction to the signal distribution is positive, Σ As a concrete illustration of this framework, we will consider the problem of discrimination of Higgs boson decays from massive gluon splitting to bottom quark pairs, H → b b from g → b b, in highly-boosted jets.This problem has a long history in jet physics and a first solution jump started the modern jet substructure program [25], and has seen numerous studies within the context of machine learning more recently, e.g., Refs.[26][27][28][29][30][31][32][33][34][35][36].It is known that the likelihood for some discrimination problems are not IRC safe, for example, in hadronic decays of boosted top quarks versus massive QCD jets [17,37], or that IRC safety of the likelihood is only realized when all-orders effects are included, for example, in quark versus gluon jet discrimination [18,19].However, H → b b from g → b b discrimination is IRC safe for the following reasons.First, the particle content of the final states are identical and so in the likelihood ratio of squared matrix elements divergences associated with addition collinear radiation will exactly cancel.Second, there is no soft singularity associated with the splitting g → b b, and so its squared matrix element is regular in the soft region of phase space.Finally, the Higgs has a fixed mass, and so a collinear divergence of the b b in gluon splitting is regulated by a mass cut.
Because the final states of H → b b and g → b b are identical and indistinguishable their contributions to the cross section for a given process should, in principle, be summed coherently at the amplitude level.However, the Higgs has a small decay width compared to its mass m H and, working in the narrow width approximation, cannot interfere with other intermediate states.Further, we will work in the highly-boosted limit in which the decay or splitting products are all highly collimated into a small angular region.In practice, the region of interest would be defined by a jet algorithm, typically anti-k T [38], but here, for simplicity and generality, we remain agnostic to the particular jet algorithm employed.Instead, we work in the approximation that the jet energy E is sufficiently large that all particles produced would be captured by any jet algorithm, to leading order in m 2 H /E 2 .Thus, our results will capture the universal behavior of this discrimination problem, and sensitivity to realistic jet energy, jet radius, contamination from initial state radiation, or other effects will be left to future work.Therefore, this analysis represents an idealized limit of the possible discrimination power, given these simplifications.
Even within this simplified framework, there are other ambiguities that arise at next-toleading order.Discrimination of H → b b decays from g → b b splittings implicitly assumes that jets have already been tagged as containing at least a bottom and anti-bottom quark.Modern techniques for bottom hadron tagging have extremely impressive performance (see, e.g., Refs.[39,40]), but there are nevertheless some minimal, reasonable requirements on resolution.Specifically, the b and b must lie outside some minimal angle from one another to be distinguishable.At next-to-leading-order, the matrix element for H → b bg decay has no divergence associated with collinear b b, and so any angular cut can be ignored, to leading power in the value of the cut angle.However, the splitting g → b bg does have a b b collinear divergence, which correspondingly must be regulated.Inspired by Refs.[41,42], our procedure here to do this is to simply require that the clustered b b pair has the largest invariant mass, or, equivalently, is clustered by the JADE algorithm [43,44] at the very last step.This constraint obviously has no effect at leading-order and at next-to-leading order regulates the b b collinear divergence, so will accomplish our needs here.However, this naive counting of bottomness flavor in each subjet is not IRC safe beyond next-to-leading order [45], and so a generalization would require a well-defined flavor prescription.There are several other recent examples of IR and IRC safe flavor algorithms [46][47][48], and we leave a detailed study of their potential consequences for identification of H → b b decays to future work.

Discrimination at Leading Order
We begin with establishing the baseline discrimination power of these two processes at leading order.First, working in the highly-boosted limit, we note that leading-order collinear phase space is simply where s is the invariant mass of the two final state particles and z ∈ [0, 1/2] is the energy fraction of the softer of the two, assuming otherwise indistinguishability.We have included the constraint on the mass of the Higgs directly in the phase space.The squared matrix element for H → b b decay is flat, and so its normalized probability distribution on phase space is which integrates to 1 on phase space dΦ (0) with z ∈ [0, 1/2].The collinear splitting process g → b b is governed by the corresponding splitting function [49][50][51][52][53], represented by the leading-order squared matrix element where T R = 1/2 is the normalization of the matrices in the fundamental representation of SU(3) color.We present the complete matrix element because the normalization of the nextto-leading order contribution will depend on the prefactors at leading-order, which we will address in the next section.The normalized distribution for g → b b splitting is then which integrates to 1 on dΦ (0) .The likelihood ratio is then just the ratio between the distributions, where Note that this is monotonic on z ∈ [0, 1/2], and so an equally good classifier is the energy fraction 1 − z itself.We choose 1 − z instead of z because both L(0) and 1 − z are decreasing on z ∈ [0, 1/2].This simplifies the calculation of the ROC curve to this order, as and so its inverse is  splitting (solid green) and h → b b decay (dashed red).Right: Leading-order ROC curve of gluon fraction as a function of Higgs fraction (solid green) as compared to random (dashed gray).
The cumulative distribution for the gluon is The ROC curve to this order is therefore x . (3.9) We have plotted the probability distributions and the ROC curve in Fig. 1.
The leading-order AUC is just as easy to calculate, with slightly smaller than the AUC of completely random event selection of 0.5.Compared to the AUCs for other discrimination problems calculated at lowest order, this is rather large.For example, the AUC for quark versus gluon jet discrimination at leading-logarithmic accuracy is [54] AUC q vs. g = determined by the ratio of the adjoint C A to the fundamental C F quadratic Casimirs of SU(3) color.

Discrimination at Next-to-Leading Order
With the analysis at leading-order established, we move to next-to-leading order.From the general results for the likelihood distributions and ROC curve at NLO from Sec. 2, there are two requirements for their practical calculation.First, we need an IRC safe map from NLO to LO phase spaces, so we can evaluate the leading-order cumulative distribution on the distribution of particles at next-to-leading order.Second, while the likelihood ratio is IRC safe, there will be divergences at intermediate steps of the calculation that must be appropriately isolated and subtracted, to render the NLO contribution finite.We will describe our approach to both of these issues and present numerical results for the ROC curve for H → b b from g → b b discrimination at next-to-leading order in this section.

Map from NLO to LO Phase Space
For the NLO-to-LO phase space map, we need NLO phase space in the first place.In the highly-boosted limit with fixed total invariant mass, differential three-body phase space can be expressed as [55,56] dΦ (1) ≡ 4 (4π) 5 Here, s ij is the invariant mass of particles i and j and z i is the energy fraction of particle i.
We will present the explicit form of the squared NLO matrix elements soon, but even without them, we already know their infrared divergent structure and therefore what the NLO-to-LO map must regulate.Because we enforce a lower bound on the invariant mass of the b b pair, they have no collinear divergence, and so the only divergences are when the gluon becomes collinear to either the b or b, or when the gluon becomes soft/low energy.In the collinear limits, this NLO-to-LO map must cluster collinear particles first for IRC safety.Thus, the simplest such NLO-to-LO map is just that from a sequential jet clustering algorithm, and here we will focus on the k T family of algorithms.Specifically, we will study the Cambridge/Aachen (C/A) [23,24] algorithm.With the additional JADE algorithm requirement on the b b pair, the C/A mapping can be expressed in these phase space coordinates as Here, z is the energy fraction of the softer of the two subjets after reclustering, which is also the one non-trivial LO phase space coordinate.Again, because any possible collinear divergence of the b b pair has been regulated by JADE, we do not need consider clustering them together to return finite results at NLO.The clustering constraints can be expressed in the general form where where g alg is a function of energy fractions that determines the specific form of the IRC safe clustering algorithm.

Subtraction Method for Numerical NLO Evaluation
From the expression for the cumulative distribution of the likelihood ratio in Eq. (2.11), the basic object that appears at next-to-leading order takes the form of either or its derivative with respect to L. The leading-order cumulative distributions can be directly evaluated using the results of the previous section.We have for L ∈ [3/4, 3/2].As it is by itself IRC safe, we will consider Eq. (3.15) as the fundamental object we will calculate at NLO, and then from it, construct the full NLO cumulative distributions according to the prescriptions established earlier.
The phase space constraints of the NLO-to-LO map are complicated enough that there is no hope to evaluate this analytically, so we must do at least some of the phase space integrals numerically.IRC safety, however, is not enough to ensure that intermediate steps of a calculation do not have divergences that ultimately cancel at the end.Thus, as written with the expected divergences in the NLO distribution, Eq. (3.15) can not be naively numerically evaluated.We can, then, add and subtract to each integral a term that exactly cancels the divergences, which we denote as p (1,sing) (Φ (1) ), in an analogous way to familiar dipole subtraction methods, like ERT [57], FKS [58], or Catani-Seymour [59] methods.We now have dΦ (1) p (1) (Φ (1) Now, the first integrand is explicitly finite over all of phase space and can be evaluated with standard Monte Carlo methods.Much of the integral on the final line, exclusively with the singular subtraction matrix element, can be evaluated exactly analytically in dimensional regularization and the divergences explicitly removed.For brevity and presentation clarity here, we present the details of this calculation in App. A.
For the signal and background processes considered here, as well as the requirement on b-tagging, the leading-power singular distribution takes the form ) ) . Here

NLO Matrix Elements
The final piece to establish to be able to numerically evaluate the likelihood distributions at NLO are the squared matrix elements at NLO themselves.For Higgs decay, the real, three-body decay matrix element can be straightforwardly calculated to be This relationship to the leading-order matrix element then means that the properly-normalized contribution to the phase space distribution at NLO is p C F = 4/3 is the fundamental quadratic Casimir of SU(3) color.Note, of course, that this matrix element is Lorentz-invariant and applies in any frame, including the highly-boosted collinear frame of our analysis.
For high-energy gluon splitting, the matrix element at leading-power in the collinear limit takes the general form where P ab g→b bg (Φ (1) ) and P nab g→b bg (Φ (1) ) are the Abelian and non-Abelian splitting functions, as defined by their color prefactor.Here, C A = 3 is the adjoint quadratic Casimir of SU(3) color.With the normalization of the leading-order distribution, the properly-normalized NLO contribution is then p (1)  g (Φ (1) ) = 3(4π) 4 1 m 2 H C F P ab g→b bg (Φ (1) ) + C A P nab g→b bg (Φ (1) ) . (3.25) The individual collinear splitting functions are [60, 61] Unlike the Higgs decay matrix element, these splitting functions only apply in the highlyboosted regime and have explicit dependence on non-Lorentz invariant quantities of the particle energy fractions.This is of course because the gluon is massless and any non-zero mass ascribed to the gluon only arises because of its high-energy fragmentation.
One issue with the non-Abelian splitting function that must be dealt with is the fact that, in addition to its physical infrared divergences, it also has an unphysical "collinear" divergence when the angle of the gluon to either the b or b diverges, θ 2 bg , θ 2 bg → ∞, with fixed s bg , sb g .This divergences exists because the color octet gluon must recoil against the rest of the event that itself carries a net octet color, and so the gluon can be preferentially emitted collinear to the rest of the event.Within our high-boost, collinear approximation, this is of course not accurately described and anyway there will always be some finite jet radius that separates the jet from the rest of the event.Our solution here to regulate this divergence is therefore to impose a maximal angle cut on the emitted gluon, where we demand that the squared angle of the gluon from either the b or b is no larger than 8 times the squared angle between the b b itself: For reference, for jets in experiments with radius R ∼ 1, this roughly corresponds to identification of Higgs bosons with an energy about 5 to 6 times its mass, E ∼ 700 GeV.At any rate, the results are relatively weakly dependent on the particular value of this cut; varying it by a factor of two changes the values of the integrals at NLO by about 10% (i.e., the infinite angle divergence is logarithmic).Additionally, because the Abelian splitting function proportional to C F is integrable as angles diverge, this constraint has a negligible effect on that component of the matrix element.In the truly infinite boost limit, but finite, non-zero jet radius, the angle between the b and b goes to 0. As the Higgs is a color singlet, radiation from the b b dipole would also be restricted to a region of 0 angular size, assuming there is no contamination emitted from outside of the jet.By contrast, because the gluon is a color-octet, it would still emit radiation at all angles throughout the jet.Thus, strictly speaking, in the infinite-boost limit, there would be sufficient information for perfect discrimination between Higgs decays versus gluon fragmentation to bottom quarks.This distinction of radiation patterns in the highly-boosted limit was recently exploited in construction of the jet color ring observable [62].Therefore, for realistic and useful predictions for discrimination, we must assume that the energy of the jet is finite.As mentioned above, the dependence on discrimination power with the boost is logarithmic, and so exhibits mild variation over experimentally-accessible jet energies.

Numerical Results
With these accumulated results, we can evaluate all necessary integrals to establish the cumulative distribution of the likelihood ratio on Higgs and gluon events.We perform numerical integration with the implementation of VEGAS [63,64] in the CUBA libraries [65].The first thing that we show is the functional form of the basic object appearing at next-to-leading order; namely on both Higgs and gluon events.These are displayed in Fig. 2 over the range L ∈ [3/4, 3/2], the range of the leading-order likelihood ratio.There are a few properties to note; first, and most importantly, both integrals vanish at the endpoints of L = 3/4, 3/2, which ensures that the probability distributions of the likelihood ratio through NLO remains unit normalized.Next, note that the gluon integral is negative over the domain of the likelihood, while the Higgs integral is positive over (nearly all) of the domain.Thus, in the probability distributions of the likelihood ratios proportional to the derivatives of these integrals, effects at NLO tend to push the gluon distribution to larger L (where its derivative is positive) and the Higgs color-singlet Higgs bosons especially at wide angles, and this feature significantly improves discrimination power.We do expect, however, that contributions from next-to-next-to-leading order and higher are relatively small modifications of the NLO distribution because there are no qualitatively new physical effects that arise.
From these likelihood distributions, we can then evaluate the corresponding ROC curves through next-to-leading order.This is plotted in Fig. 4.This unambiguously demonstrates that discrimination is dramatically improved at NLO, and is starting to be comparable to results from machine learning a classifier on simulated data (see, e.g., Ref. [36]).To quantify the discrimination power in a single number, we can evaluate the AUC through next-to-leading order, where we find

Conclusions
For binary classification problems in jet physics that admit a perturbative expansion in the strong coupling, α s , we have formulated discrimination with the likelihood ratio through next-to-leading order.Through this analysis, we have explicitly identified the properties that are necessary for an infrared and collinear safe classifier, and applied the general results to the concrete example of discrimination of Higgs decays from gluon splitting to bottom quarks in the highly-boosted limit.Next-to-leading order effects are vital for any qualitative understanding of the classifier because that is the first order at which sensitivity to the flow of color in the jets appears.In the true infinite boost limit and ignoring contamination, the classifier is perfect because only gluons will emit radiation at finite angles from the b b pair.At high but accessible boosts that can be measured at the Large Hadron Collider, these radiation effects are substantial, and decrease the area under the ROC curve by about a factor of 3 from the leading-order prediction.Only starting at next-to-leading order are these predictions qualitatively close to corresponding results from machine learning studies on simulated data.
This property also demonstrates that predictions at next-to-next-to-leading order (NNLO) are necessary for an accurate uncertainty estimate.This is similar to giant K-factors that arise in studies of vector boson production [25,[67][68][69].In that case, naive scale variation fails to account for next-to-leading order effects because, starting at next-to-leading order, there are qualitatively new processes that contribute, corresponding to distinct initial states, whose presence cannot be estimated by simple scale variation alone.Here, the "giant Kfactor" arises because, starting at next-to-leading order, there are qualitatively new diagrams corresponding to the presence of radiation off of the hard, Born-level particles.Within the framework of this highly-boosted analysis, one could continue by including contributions from four-particle final states, as represented by, for example, 1 → 4 splitting functions.While these have been recently calculated [70,71], it would be simpler and likely a good approximation to just use strongly-ordered splitting functions, like employed here in Eq. (3.19) for practical numerical integration.The importance of these higher-order contributions would then establish the robustness of discrimination at NLO.
While the decay of the Higgs to bottoms quarks has been observed [72][73][74], the decay of Higgs bosons to gluons has not, even though its branching fraction is about 10%.H → gg shares many features with H → b b decays, but the lack of bottom quarks that can be identified and tagged means that backgrounds for H → gg are overwhelmingly large.Even though gluons from Higgs decay must be in a color-singlet combination, their larger individual color charge renders their corresponding jets "fuzzier" than that of quarks, and standard observables sensitive to the flow of color are much less sensitive [62,75,76].At least for discrimination of H → gg decays from g → gg splitting, the IRC safe likelihood analysis here could provide necessary insight into both strict upper bounds on performance as well as new, useful observables for this problem.However, g → gg splitting has a soft divergence, unlike g → b b, and thus, even at leading order, there will be logarithmic sensitivity to the jet energy and radius that is responsible for regulating the soft, wide-angle divergence.
While many discrimination problems do not admit a likelihood ratio that is IRC safe, such as hadronic top quark decay [17], this analysis may seem of limited utility.However, in the top quark decay example, many of the leading-order final states initiated by QCD jets are dramatically different from the top.For example, while the decay t → bq q′ can look experimentally identical to, say, g → b bg when the b is mistagged, theoretically, these final states are distinguishable.This may suggest that a way forward to establish ultimate theoretical performance by identification of background subprocesses that produce a final state as close as possible to that of signal.Assuming perfect bottom hadron tagging, there is only one process initiated by light(er) QCD partons that mimics top decay; namely, b → bq q, the high-energy fragmentation of a bottom quark to three quarks.Both the top and its daughter W boson decay on-shell in the narrow-width approximation, and correspondingly imposing the top mass constraint on the background final state bq q and the W mass on the sub-state q q completely eliminates the presence of infrared divergences and renders the likelihood ratio IRC safe.A theoretical analysis of top tagging with this approach might then quantitatively establish a hard upper bound on performance, and explain the saturation of discrimination from machine learning [77].
To isolate and subtract the divergences, we will re-write phase space in angle-energy fraction coordinates for which dΦ (1) Here, E is the energy of the jet, ϕ is azimuthal angle of the gluon about the line that passes through the b b pair, and we have used the law of cosines in the invariant mass δ-function, where In these coordinates, the distribution and clustering constraints are We have left the angle θ 2 bg implicit here for compactness.Also, note that we have explicitly fixed the LO phase space coordinate to be the energy fraction z b .
Combining these results, the integral over this phase space is dΦ (1)  dΦ (0) p (1,sing) (Φ (1) ) (A. 16) The collinear divergence is now explicit.We can isolate the divergence using the +-function expansion, where The +-functions are defined to integrate to 0 on [0, m 2 H /E 2 ], where and when integrated over a function g(x) that is analytic about x = 0 is defined as With this expansion, we can then perform the integration, only keeping those terms necessary to capture through order-ϵ 0 .We have dΦ (1)  dΦ (0) p (1,sing) (Φ (1) ) (A.20) The divergent integral can then be explicitly evaluated to find Note that the residual scale dependence establishes that the natural scale at which to evaluate α s (µ 2 ) is at the Higgs mass, µ 2 = m 2 H .The integral with the +-function can be massaged into the form Both integrals are now explicitly finite and can be integrated with standard Monte Carlo methods.Further, angles can be rescaled by m 2 H /E 2 to eliminate any explicit dependence on either the Higgs mass or the jet energy.
Putting these results together and explicitly subtracting divergences, we find that

A.3 Abelian Soft Term
The Abelian soft term, the term that would exist in electromagnetism, is This has both soft and collinear divergences and so will need to be regulated both in the angle θ 2 bg and the energy fraction z g .To do this, we again express phase space in angle-energy fraction coordinates.We than have dΦ (1)  dΦ (0) p (1,sing) We have already discussed the +-function expansion for the angle θ 2 bg ; we now introduce the +-function expansion for the energy fraction.We have the expansion and the +-functions are defined to integrate to 0 on z g ∈ [0, 1 − z b ].That is, for an analytic function g(z g ), we define Using the +-functions, the integral becomes The third contribution subtracts the overlap of the double pole in ϵ from the first two contributions.For k T -type algorithms that we consider here, clustering with 0 energy particles reduces to a restriction on relative angles exclusively.That is, Performing the integrals with the δ-functions, we have Here, we have combined the collinear divergent term (with the δ(θ 2 bg ) factor) and the soft and collinear term (with both δ(θ 2 bg ) and δ(z g ) factors) The final integral is explicitly finite, and so can be evaluated with standard Monte Carlo methods.The second integral can simplified by rescaling the angle as Then, the integrals can be expressed as dΦ (1)  dΦ (0) p (1,sing) (Φ (1) ) (A.This term has a soft divergence, z g → 0, that must be regulated.In angle-energy fraction coordinates, the integral over phase space of this term is dΦ (1)  dΦ (0) p (1,sing) (Φ (1) ) (A.43) Isolating the divergent term, we have dΦ (1)  dΦ (0) p (1,sing) (Φ (1) ) (A.44) To separate angle and energy factors, we rescale both angles as The divergent term then reduces to dΦ (1)  dΦ (0) p (1,sing) (Φ (1) ) ⊃ − As discussed in the previous section, the ϵ expansion of the factor z ϵ b (1 − z b ) −ϵ has no term at order-ϵ that contributes to the cumulative distribution of the likelihood.As such, this entire term is actually independent of z b , and so can be safely ignored.Therefore, the only contribution to this integral is

Figure 1 :
Figure 1: Left: Leading-order probability distributions in the energy fraction z for g → b b

s
, we have explicitly subtracted the soft-collinear region from both the bg and bg collinear regions, leaving it in the soft function contribution.There is no b b collinear region because of the flavor tagging constraints.The soft contribution depends on the flow of color in the initial state, where, for the H → b b and g → b b processes, we have, respectively, bg sb g + C A z b sb g + zbs bg − z g s b b z g s bg sb g .(3.21)

Figure 3 :
Figure3: Plots of the probability distribution of the likelihood ratio L at leading-(solid) and next-to-leading (dashed) order for Higgs events (left) and gluon events (right), both clustered with the C/A algorithm.At next-to-leading order, we use the central value α s (m 2 H ) = 0.113, and the shaded bands represents sensitivity to variation of the scale of the coupling by a factor of 2.

. 30 )Figure 4 :
Figure 4: Plots of the ROC curves for Higgs decays to bottom quarks vs. gluon splitting to bottom quarks at leading (solid) and next-to-leading (dashed) orders.At next-to-leading order, we use the central value α s (m 2 H ) = 0.113, and the shaded bands represents sensitivity to variation of the scale of the coupling by a factor of 2. A random classifier is represented by the line in dashed gray.