Gaining (Mutual) Information about Quark/Gluon Discrimination

Discriminating quark jets from gluon jets is an important but challenging problem in jet substructure. In this paper, we use the concept of mutual information to illuminate the physics of quark/gluon tagging. Ideal quark/gluon separation requires only one bit of truth information, so even if two discriminant variables are largely uncorrelated, they can still share the same"truth overlap". Mutual information can be used to diagnose such situations, and thus determine which discriminant variables are redundant and which can be combined to improve performance. Using both parton showers and analytic resummation, we study a two-parameter family of generalized angularities, which includes familiar infrared and collinear (IRC) safe observables like thrust and broadening, as well as IRC unsafe variants like $p_T^D$ and hadron multiplicity. At leading-logarithmic (LL) order, the bulk of these variables exhibit Casimir scaling, such that their truth overlap is a universal function of the color factor ratio $C_A/C_F$. Only at next-to-leading-logarithmic (NLL) order can one see a difference in quark/gluon performance. For the IRC safe angularities, we show that the quark/gluon performance can be improved by combining angularities with complementary angular exponents. Interestingly, LL order, NLL order, Pythia 8, and Herwig++ all exhibit similar correlations between observables, but there are significant differences in the predicted quark/gluon discrimination power. For the IRC unsafe angularities, we show that the mutual information can be calculated analytically with the help of a nonperturbative"weighted-energy function", providing evidence for the complementarity of safe and unsafe observables for quark/gluon discrimination.


Introduction
Jets are collimated sprays of hadrons that act as proxies for short-distance quarks and gluons. Because quarks and gluons have different color charges, they have different showering and fragmentation patterns, and one can exploit this information to discriminate quark-initiated jets from gluon-initiated jets on a statistical basis. Quark/gluon discrimination is one of the key goals of the jet substructure community [1][2][3]. A number of quark/gluon tagging methods have been pursued [4][5][6][7][8][9], with corresponding performance studies [10][11][12] at the Large Hadron Collider (LHC). Two seemingly conflicting themes have emerged from these quark/gluon discrimination studies (as well as from other tagging studies). An optimistic theme is that tagging performance can be substantially improved by combining multiple jet substructure observables, as advocated in Refs. [4,5]. A more pessimistic theme is that even if two discriminant variables are largely uncorrelated, their joint performance may not be much better than their individual performance. These dueling themes can be seen by comparing the results of recent tagging studies of boosted W bosons [13,14]. Since quark/gluon discrimination has so many potential physics applications, it is essential to understand why both of these themes can be true.
To achieve this goal, we pursue a twofold approach in this paper. First, we use the concept of "mutual information" to illuminate the statistical aspects of quark/gluon tagging. Mutual information characterizes the correlations between variables by counting the number of shared bits of information. 1 Ideal quark/gluon tagging requires just one bit of "truth" information (e.g. 0 = quark and 1 = gluon), so even if a variable has many bits of total information, those bits may or may not have much overlap with the truth. For an observable a and an equal admixture of quarks and gluons, the mutual information with the truth is I(T ; A) = da p q (a) 2 log 2 p q (a) p tot (a) + p g (a) 2 log 2 p g (a) p tot (a) , (1.1) where p q (p g ) is the probability distribution for quarks (gluons), and p tot = (p q + p g )/2. Since 0 ≤ I(T ; A) ≤ 1, we will sometimes refer to it as the "truth overlap". In essence, the conflicting themes above can be traced to the difference between the total information (measured by e.g. the Shannon entropy) and relevant information for quark/gluon discrimination (measured by I(T ; A)). Second, we will introduce a two-parameter family of discriminant variables to illuminate the physics aspects of quark/gluon tagging. We will call them "generalized angularities", which depend not only on an angular exponent β ≥ 0, but also on an energy weighting factor κ ≥ 0. They are defined as 1 To our knowledge, the only use of mutual information in the particle physics literature is Ref. [15], though it has been discussed recently in Ref. [16] as a robust measure of correlations. Elsewhere in high energy physics, mutual information is used in the study of (holographic) entanglement entropy (see e.g. [17][18][19][20] Figure 1: Visualization of the space of observables λ κ β , which includes several well-known jet observables used in quark/gluon discrimination: the line κ = 1 corresponds to the IRC safe angularities e β , the origin (β, κ) = (0, 0) to multiplicity, and (0,2) to p D T . Here, "width" at (1,1) refers also to broadening and girth, and "mass" at (2,1) refers to jet-mass-squared divided by energy (i.e. thrust).
where z i is the momentum fraction of particle i, R i is its rapidity/azimuth angle to a suitable axis, 2 and R 0 is the jet radius. These variables are infrared and collinear (IRC) safe only for κ = 1. As shown in Fig. 1, certain values of (β, κ) correspond to well-known observables: (0, 0) is particle multiplicity, (0, 2) is p D T [7,8,11], and the line (β, 1) are the (recoil-free) angularities e β [22,[28][29][30], including broadening/width/girth at β = 1 [24,31,32] and thrust at β = 2 [33]. 3 We will present analytic calculations and parton shower simulations to understand the quark/gluon discrimination power of the λ κ β variables. Our analytic calculations build on previous work calculating the substructure of quark and gluon jets [9,29,30,[34][35][36] and well as calculating angularities in e + e − event shapes [22,28,37]. For the IRC safe angularities (κ = 1), we will be able to analytically study the correlations between two angularities e α and e β up to next-to-leading logarithmic (NLL) accuracy. For the IRC unsafe angularities (κ = 1), we can use the techniques developed in Refs. [6,[38][39][40] to introduce a new nonperturbative object called the "weighted-energy function". For β = 0, the discrimination power depends on the details of this object and only the dependence on the jet p T and radius R 0 is calculable. However, for κβ 0.5, just the first (logarithmic) moment of this function enters at NLL. As long as these moments are 2 To have recoil-free observables, we use the winner-take-all axis [21][22][23]. The winner-take-all axis always coincides with one of the particles in the jet, so there is guaranteed to be at least one particle with Ri = 0. For β = 0, we define λ κ 0 = i z κ i . Though we will not discuss the issue of recoil [9,22,[24][25][26][27] in much detail, our analytic results require using recoil-free instead of recoil-sensitive angularities. 3 The recoil-free angularities are sometimes denoted as τ (β) [9]. The generalized angularities also have the honorific notation of φ κ β .
sufficiently small, we can predict the quark/gluon tagging performance for an individual λ κ β to NLL accuracy, as well as study correlations between two IRC unsafe angularities to NLL. Some of the results in this paper are well-known to experts, though perhaps not in the language we use here. On the statistical side, tagging performance is typically shown in terms of receiver operating characteristic (ROC) curves, which show the background mistag rate for a given signal efficiency. As discussed in App. A.1, ROC curves and mutual information are related to each other, and while ROC curves are perhaps more intuitive, mutual information has a closed form analytic definition and also has a nice visualization in terms of Venn diagrams. On the physics side, λ κ β is only a subset of the possible quark/gluon discriminants (see Ref. [5] for a catalog). Our goal is not to be exhaustive, but rather explain why different values of κ and β are sensitive to different properties of quarks and gluons. To maximize quark/gluon performance when combining variables, clearly one wants to pick variables that are sensitive to different physical effects.
In Sec. 2, we review the definition of mutual information, and use it to emphasize why the joint tagging performance of two observables is a separate concept from the correlations between two observables. In Sec. 3, we show that for observables with "Casimir scaling" behavior, the quark/gluon truth overlap is a universal function of the color factor ratio C A /C F . In Sec. 4, we discuss the general features of the λ κ β variables and show parton shower results for their mutual information. We then turn to analytic calculations, treating the IRC safe case of the angularities e β in Sec. 5 and the more general IRC unsafe case of λ κ β in Sec. 6. We conclude in Sec. 7.

Definition
Mutual information is a measure of the shared information content of two observables. For continuous distributions of variables a and b, the mutual information is (see e.g. [41]) where p(a, b) is the joint probability distribution (normalized to have unit integral), and p(a) ≡ db p(a, b) and p(b) ≡ da p(a, b) are the marginal probability distributions. 4 Here, we are using base 2 logarithms such that I(A; B) = 1 corresponds to one binary bit of shared information.
In order to visualize mutual information, it is helpful to rewrite I  Figure 2: Left: The mutual information I(A; B) between observables A and B is visualized as the area of the shaded overlap region in information space. Right: As a special case, we can consider the mutual information I(T ; A) between observable A and the truth T (i.e. the truth overlap).
Strictly speaking, the entropy (unlike the mutual information) is not well-defined for continuous observables, though it can be made sensible by binning the distributions. For discretevalued observables, we have can be interpreted as the "area" of the intersection A ∩ B in information space, and it useful for quantifying the degree of correlation between two variables, with low values corresponding less correlated variables.

Single Variable Discrimination
For a single variable a, we can quantify how well it performs as a signal/background discriminant by calculating how much mutual information it shares with the truth T . Consider an event sample with signal fraction f and background fraction (1 − f ), and let t = 0 for signal events and t = 1 for background events. Because t is a discrete variable, it has a well-defined Shannon entropy which is the number of available "truth bits". The most intuitive choice is f = 1/2 which yields H(T ) = 1, corresponding to one bit of truth (i.e. signal = 0 vs. background = 1). Without knowing the truth information, the measured a distribution would be where p 0 (p 1 ) is the normalized a probability distribution for signal (background) events.
With the addition of truth information, the joint probability distribution is (2.8) so the mutual information between A and the truth T (i.e. the truth overlap) is By Eq. (2.5) and shown in Fig. 2b, The mutual information I(T ; A) quantifies how well the variable a can separate signal and background, with I(T ; A) = 0 corresponding to no discrimination power and I(T ; A) = H(T ) corresponding to full discrimination power.
In this paper, we will focus almost exclusively on f = 1/2 (see Eq. (1.1)), such that 0 ≤ I(T ; A) ≤ 1. As explained in App. A.1, by varying f , I(T ; A) can be related to the ROC curves typically used to quantify discrimination power. Unlike ROC curves, there is a closed-form expression for the mutual information, which makes it better suited for analytical studies. When Eq. (2.9) is applied on a sample of events generated by a Monte Carlo program, the finite sample size requires one to replace the integral by a sum over bins. Care is needed to avoid biasing the mutual information from binning, which is discussed in detail in App. A.2.

Pairwise Correlations in Discrimination
Given two variables a and b, I(T ; A) and I(T ; B) quantify how well each performs individually as signal/background discriminants. Similarly, we can assess how well a and b perform as joint discriminant variables by calculating In the left figure, the mutual information with the truth (shaded) is complementary, such that combining A and B increases the truth overlap. This is not so in the right figure.
such that a and b always have the same or better joint discrimination than either variable individually. If I(T ; A, B) = max{I(T, A), I(T ; B)}, then there is no gain in quark/gluon discrimination in considering both a and b, and one of the two variables is redundant, at least for this purpose. To highlight the difference between I(A; B) and I(T ; A, B), consider Fig. 3 which shows two variables with a low degree of correlation (i.e. I(A; B) is relatively small). In the left example, a and b are both decent discriminant variables individually (i.e. A and B both have overlap with the truth T ), but they have considerably improved joint discrimination power (i.e. I(T ; A, B) is larger than both I(T ; A) and I(T ; B)). In the right example, despite the fact that a and b are largely uncorrelated (as measured by I(A; B)), there is no gain in discrimination power by considering a and b jointly (i.e. I(T ; A, B) = I(T ; A) = I(T ; B).) In the jet substructure literature, there are known examples of both situations in Fig. 3. The left example is the ideal case (see e.g. [13]), where two variables a and b give complementary information for discrimination. The right example is the more puzzling case (see e.g. [14]), where two variables exhibit comparable discrimination power, a low degree of correlation, yet little gain in performance when considered jointly. Using mutual information, it is straightforward to diagnose and visualize this situation, helping to identify redundant variables. (Of course, redundant but largely uncorrelated variables can still be helpful for other purposes, such as calibration.)

Quark/Gluon Discrimination from Casimir Scaling
As a simple example of using mutual information, consider an observable a that satisfies the property of "Casimir scaling". For such an observable, the normalized cumulative distribution for quarks (Σ q ) and gluons (Σ g ) can be written as where r(a) is any monotonically decreasing function of a, C F = 4/3 is the color factor for quarks, and C A = 3 is the color factor for gluons. As we will see in Sec. 5, the angularities e β obey Casimir scaling at leading-logarithmic (LL) order. In Sec. 6.2, we will even find that the generalized angularities λ κ β with κβ 0.5 obey Casimir scaling at LL as well. As discussed in Ref. [9], any observable that exhibits Casimir scaling has a ROC curve of where x is the quark jet efficiency and x C A /C F is the gluon jet mistag rate. This result follows from making a cut a < a cut , which keeps a fraction Σ q (a cut ) of quarks and a fraction Σ g (a cut ) = (Σ q (a cut )) C A /C F of gluons. Since (approximate) Casimir scaling is so ubiquitous among quark/gluon discriminants, this explains why so many discriminant variables have such similar performance. To improve performance, one has to probe the jet beyond just its overall color charge C i . We can understand this same feature from the point of view of mutual information by showing that the truth overlap I(T ; A) is a universal function of C A /C F for observables that exhibit Casimir scaling. 5 The probability distribution for a is just the derivative of the cumulative distribution with respect to a:

4)
If f is the fraction of quark jets in the sample and (1 − f ) is the fraction of gluon jets, the total probability distribution is , (3.5) and the truth overlap is By making the change of variables all dependence on the distribution r(a) can be removed and the integrals can be evaluated exactly. We find where 2 F 1 (a, b; c; z) is the hypergeometric function and ln is the natural logarithm. As advertised, for any observable exhibiting Casimir scaling, I(T ; A) is a universal function of Setting the quark fraction f equal to 1/2 and C A /C F = 9/4 for QCD, the mutual information for quark/gluon discrimination is This will be the baseline value to which all observables will be compared. Note that I(T, A) is quite far from 1 (i.e. a full truth bit), demonstrating the inherent challenge of quark/gluon tagging.

Generalized Angularities
Our analytic studies of quark/gluon separation will focus on the generalized angularities λ κ β defined in Eq. (1.2), repeated for convenience: Here z i is the energy fraction and θ i = R i /R 0 the angular fraction with respect to the jet radius R 0 , such that 0 ≤ z i , θ i ≤ 1. We measure the angles R i with respect to the recoil-free winner-take-all axis [21][22][23] and we use a jet algorithm that centers the jet on the winner-takeall axis, such that θ i ≤ 1 is strictly enforced. For the IRC safe angularities e β , it is known that a recoil-free axis improves quark/gluon discrimination power [9]. For the generalized angularities λ κ β , a recoil-free axis is crucial for the calculations with β κ, since it ensures that λ κ β measures the radiation pattern around the initiating hard quark or gluon and not the displacement (i.e. recoil) of the hard parton away from the jet axis.
These variables are effective quark/gluon discriminants because they probe the angular and energetic structure of jets, both of which are sensitive to the differing color factors between quarks and gluons, among other effects. Large β emphasizes wide-angle radiation whereas small β emphasizes collinear radiation. Large κ emphasizes harder hadrons, whereas small κ emphasizes softer hadrons. For reference, we highlight the κ = 1 and β = 0 cases: While λ 1 0 = 1 is a trivial observable, we can expand around κ = 1 to find lim κ→1 so when we present studies for λ 1 0 , we really mean lim κ→1 λ κ 0 , which is effectively the same as the observable i∈jet z i ln z i .
To get a feel for the performance of the various λ κ β , we can use parton shower simulations to estimate their quark/gluon truth overlap. We generate an equal admixture of quark and gluon jets (i.e. f = 1/2) from the processes qq → qq and gg → gg using Pythia 8.183 [42,43] and Herwig++ 2.6.3 [44,45] at the 8 TeV LHC. 6 The transverse momenta of the jets is required to be p T > 400 GeV with a jet radius of R 0 = 0.6. To avoid any effects from recoil [9,[24][25][26][27], we identify jets using 1-jettiness [46,47] as a jet finder [48], taking the winner-take-all axis [21][22][23] as the jet center. This style of jet finding always returns one perfectly circular jet cone, and FastJet 3.0.3 [49] code is available from the Nsubjettiness package through the FastJet contrib project (http://fastjet.hepforge.org/contrib/). 7 In Figs. 4a and 4b, we show the truth overlap I(T ; λ κ β ) from Eq. (2.9) for different choices of λ κ β . 8 Confirming the results of Refs. [4,5], one of the best single discriminant variables is λ 0 0 (i.e. hadron multiplicity). In Figs. 4c and 4d, we show the truth gain which is a measure of the information gain by using a second λ κ β in addition to λ 0 0 . We see that observables like λ 1 1 ≡ e 1 (i.e. width) and λ 2 0 (i.e. p D T ) do add additional information, in agreement with LHC performance studies [10][11][12].
Of course, these parton shower results should be taken as just illustrative, especially since it is known that Pythia 8 typically overestimates the quark/gluon separation power [12]. The differences between Pythia 8 and Herwig++ are quite striking, but the origin of the disagreement is not known at present. For this reason, we want to calculate I(T ; λ κ β ) from first principles to predict which observable (or combination of observables) has the best discrimination power, which is the subject of the next sections. 6 The choice of 8 TeV allows us to use the same event sample and event selection as Ref. [9]. Results at 14 TeV are qualitatively similar. 7 We thank T.J. Wilkason for providing a beta version of his code. 8 Figure 4: Parton shower study of quark/gluon discrimination for Pythia 8 (left) and Her-wig++ (right). Top: quark/gluon discrimination power of λ κ β as characterized by the truth overlap I(T ; λ κ β ). Bottom: improvement in discrimination power from supplementing multiplicity with λ κ β , ∆I(T ; λ 0 0 → λ 0 0 , λ κ β ) ≡ I(T ; λ 0 0 , λ κ β ) − I(T ; λ 0 0 ). The small solid boxes correspond to the dots indicated in Fig. 1, the wide dashed box indicates the IRC safe angularities e β , and "LL" in light yellow indicates the result from Casimir scaling (i.e. I(T ; λ κ β ) 0.1 from Eq. (3.9)).

IRC Safe Angularities
We start our analytic studies with the IRC safe limit κ = 1, corresponding to the recoil-free angularities e β ≡ λ 1 β . For all β > 0, these are IRC safe. To the order of accuracy of our calculations, e β are identical to the energy correlation functions C (β) 1 [9]. The case β = 1 is also known as width (or broadening or girth) and β = 2 is known as thrust (which is related to mass-squared at a fixed jet energy).
It was observed in Ref. [9] that the recoil-free angularities are good quark/gluon discriminants, with better performance at fixed β than the traditional recoil-sensitive angularities (i.e. angularities measured with respect to the jet momentum axis). The discrimination power of e β increased as the angular exponent β decreases towards zero, and we will verify this behavior from the mutual information viewpoint. In addition, using the double differential cross sections from Refs. [50,51], we can study the correlations between different angularities e α and e β to show how using additional information can improve tagging performance.
First, a general comment about these analytic predictions. For most jet algorithms, quark and gluon jets are only well-defined at lowest order in α s (see, however, Ref. [52]). Here, we will not address the subtlety of defining quark and gluon jets to all orders in perturbation theory, but instead adopt a pragmatic definition for this paper: a quark (gluon) jet is what results from the showering of a quark (gluon) parton. This was also the strategy used in Ref. [9], and is sufficient to NLL accuracy.

Truth Overlap for One Angularity
The properties of e β are particularly simple at LL accuracy. 9 The normalized cumulative distribution of the angularity e β was computed in, e.g. Ref. [9]: Here, C i is the color of the jet: C F = 4/3 for quarks and C A = 3 for gluons. This distribution satisfies the Casimir scaling property of Eq. (3.1), and therefore the truth overlap I(T ; e β ) is given by the formula in Eq. (3.8), independent of β.
To determine the β-dependence of I(T ; e β ), we have to go to next-to-leading logarithmic (NLL) accuracy, as in Ref. [9]. 10 We use the NLL distributions for the recoil-free angularities computed in Ref. [22] (which are identical to the NLL resummation of the energy correlation functions from Ref. [26]) and compute the mutual information of the angularities with truth. For β > 1, our NLL distributions correspond to the calculations for (recoil-sensitive) 9 We define logarithmic accuracy through the cumulative distribution of the observable of interest. For an observable e, the cumulative distribution has the expansion ln Σ(e) = αs ln 2 e + αs ln e + αs + O(α 2 s ) .
We define "LL" as keeping the leading terms in this expansion with the scaling αs ln 2 e ∼ 1. 10 We define "NLL" as the leading terms in the expansion of Eq. (5.1) with the scaling αs ln e ∼ 1. angularities performed in Ref. [30]. For reference, the cumulative distributions are given in App. B.1, and we determine I(T ; e β ) through numeric integration. The truth overlap I(T ; e β ) as a function of the angular exponent β is shown in Fig. 5 for f = 1/2. The left plot is from the NLL calculation and the right plot shows Pythia 8 and Herwig++, using the same event generation settings as in Sec. 4 (i.e. p T > 400 GeV and R 0 = 0.6). The LL result from Casimir scaling is plotted for reference. We see that I(T ; e β ) increases significantly as β decreases, showing that the quark/gluon discrimination improves. As discussed in Ref. [9], the qualitative β-dependence is the same at NLL compared to the two parton shower programs, but there are significant numerical differences. Part of that is because the NLL result is lacking effects like nonperturbative power corrections which modify the quark/gluon discrimination power. The large difference between Pythia 8 and Herwig++ has been seen in other contexts [12], and the underlying reason is as-of-yet unknown.

Truth Overlap for Two Angularities
We now turn to a study of the quark/gluon discrimination power of two angularities. This will highlight the analytic benefits of using mutual information (instead of ROC curves) to study correlated observables. Constructing the ROC curve for more than a single observable is a formidable challenge because contours of constant signal/background significance can be non-trivial functions of the observables. Typically, the procedure for determining the discrimination power is to use a multivariate analysis (MVA) such as a boosted decision tree. In contrast, mutual information is defined by simply integrating over the joint probability distribution, so all correlations between observables are automatically taken into account.
At LL accuracy, the double differential cross section of two angularities was computed in Ref. [50]. For angularities e α and e β with different angular exponents α > β, the double cumulative distribution is At this order, the angularities satisfy the inequalities While the LL distribution does exhibit Casimir scaling, it does so for a multivariate exponential function, so the analysis of Sec. 3 does not apply. The double differential cross section is defined by differentiating with explicit expressions in App. B.2. The function Θ 0 enforces the phase space restrictions in Eq. (5.4), and has to be outside of the derivatives. At NLL accuracy, the double differential cross section was computed in Ref. [51] by interpolating between effective theories at the e β = e α and (e α ) β = (e β ) α boundaries of phase space. The NLL expression is given in App. B.2 for reference, and an equivalent derivation using Soft-Collinear Effective Theory (SCET) [53][54][55][56] is given in App. C.2.
In Fig. 6 we show the truth overlap I(T ; e α , e β ), comparing the LL expression, the NLL expression, Pythia 8, and Herwig++. As before we have set the quark fraction f = 0.5, and the diagonal entries correspond to the single observable values from Fig. 5. From the baseline LL value of a single angularity in Eq. (3.9) (i.e. I(T ; e β ) 0.1) the truth overlap can be increased noticably even at LL. For example, for angularities e 2 and e 0.5 , the joint truth overlap is greater than 0.12 at LL. At NLL, the discrimination power uniformly rises, as expected from Fig. 5. Because our NLL expressions do not account for the nonperturbative region of phase space, one should be cautious interpreting the results for β 0.5. Turning to the parton showers, they give quite different prediction for I(T ; e α , e β ), with Pythia 8 even more optimistic than the NLL result and Herwig++ closer to the LL result.
The large numerical differences between these methods highlights the considerable theoretical uncertainties present in quark/gluon discrimination. It is important to note that these large differences are not present when trying to model quarks and gluons individually, and only arise in the context of discrimination. In App. D we show the mutual information I(e α ; e β ), which measures the degree of correlation between two angularities on separate quark and gluon samples. The four methods (LL, NLL, Pythia 8, and Herwig++) show Figure 6: The quark/gluon truth overlap for pairs of IRC safe angularities (e α , e β ). Top: the LL and NLL analytic calculations. Bottom: the Pythia 8 and Herwig++ parton showers. The single observable LL baseline (I(T ; e β ) 0.1) is indicated by light yellow. Note that the LL and NLL results are only trustable for β 0.5. Also, near the α = β diagonal, the NLL results suffer from numerical issues due to the small phase space allowed by Eq. (5.4). much closer agreement for I(e α ; e β ) than for I(T ; e α , e β ), suggesting that the truth overlap is more sensitive to subtle (and difficult to predict) differences between quark and gluon jets.
For completeness, in App. D we show the truth gain ∆I(T, e max → e α , e β ) ≡ I(T ; e α , e β ) − max{I(T ; e α ), I(T ; e β )}, (5.6) which makes it easier to see that there is improved quark/gluon discrimination power from measuring two angularities instead of just one. 11 Roughly speaking, pairs of angularities with the smallest values of I(e α ; e β ) (i.e. least correlation) lead to the largest increase in I(T ; e α , e β ) (i.e. discrimination power), though there is considerable variability. Because the four methods have different predictions for which angularities should be combined, care should be taken when using any of these methods to estimate quark/gluon discrimination performance.

IRC Unsafe Angularities
We now turn to the more interesting case of the IRC unsafe angularities with κ = 1. As seen in Sec. 4 and known in the literature, hadron multiplicity (λ 0 0 ) and p D T (λ 2 0 ) are effective quark/gluon discriminants. But much of the rest of the (κ, β) plane is still unexplored (apart from the angularity line at κ = 1).
One challenge to gaining an analytic understanding of the κ = 1 case is that λ κ β is collinear unsafe (unlike e β studied above). This introduces an intrinsic sensitivity to nonperturbative physics that describes how the emitted radiation is split into hadrons, prohibiting a purely perturbative calculation. That said, using the techniques developed in Refs. [6,[38][39][40], we can encode the nonperturbative information into a "weighted-energy function" which can be extracted from data. In fact, for β > 0, we will only need a few nonperturbative parameters (and not a whole function) to characterize the distributions.
Strictly speaking our calculations will only be valid for κ 0.5. The reason is that as κ → 0 the observable also becomes infrared unsafe, further complicating calculations (as discussed in the context of hadron multiplicities in e.g. Ref. [58,59]). Also note that the β = 0 and β > 0 regimes are very different in how they treat collinear radiation, so we will consider them separately. The approximate range of validity of the calculations are shown in Fig. 7.

The β = 0 Regime
We start with the β = 0 case with λ κ 0 . Recall from Eq. (4.4) that λ 1 0 effectively refers to the observable i z i ln z i . In order to study these observables, we need to introduce a nonperturbative object called the weighted-energy function F i κ (x, µ) that describes how the energy of a jet is distributed among its constituent hadrons. Here, i labels the flavor of the 11 Of course, while mutual information is helpful to characterize the possible gains from combining observables, a multivariate analysis is still needed to realize these gains in practice. As discussed in App. B.4, it is challenging to determine the optimal cuts analytically, even at LL. Alternatively, in the spirit of Ref. [57], one could use the ratio of the quark/gluon double differential distributions as a weighting factor.  Figure 7: The regions of the space of observables λ κ β that we calculate are shown in orange (β = 0 and κ 0.5, Sec. 6.1) and blue (βκ 0.5 and βκ/(1 − κ) 2 6, Sec. 6.2). As explained in Sec. 6.2, the funny shape of the blue region is due to a combination of perturbative and nonperturbative constraints.
jet. This object is similar to the charge distribution [38] and the track function [39,40] which describe other aspects of the fragmentation of quarks and gluons into hadrons.
The quark weighted-energy function has the following operator definition Here ψ is the quark field, with momentum fixed by the δ functions involving the momentum operatorp, H denotes a hadronic final state, and z h = (p 0 h +p 3 h )/k − is the momentum fraction carried by the hadron h ∈ H. (The only dependence on k − is through z.) There is a similar definition for the gluon weighted-energy function, and we have suppressed eikonal Wilson lines needed for gauge invariance. These functions are normalized such that As a point of reference, if the hadrons were weighted by their charge, then F i κ (x) would be the jet charge function D i (x, κ, µ) [38]. Alternatively, for κ = 1 and restricted to charged particles, this would be the track function T i (x, µ) [39,40].
At LO, the cross section differential in λ κ 0 for a parton of flavor i is simply meaning that at this order, F i κ (x, µ) gives the λ κ 0 distribution directly, with x = λ κ 0 . The dependence on the jet p T and jet radius R 0 enters through the scale choice µ = p T R 0 . At NLO, the cross section is [38] The ratio J ij /J i describes the perturbative splitting i → jk, where j has momentum fraction z (see also Refs. [60,61]). By including these NLO corrections, the perturbative uncertainty (µ-dependence) is reduced. The weighted-energy functions are purely nonperturbative, so in that sense, we are not really able to predict the quark/gluon discrimination power of the λ κ 0 variables. But F i κ (x, µ) does have a perturbative renormalization group evolution [38], Thus, one can measure F i κ (x, µ) at one scale (ideally in pure quark/gluon samples), and then evolve to a different scale. This DGLAP [62][63][64][65][66] evolution of F i κ corresponds to the emissions described by a parton shower in Monte Carlo programs. We have implemented these evolution equations for p D T (λ 2 0 ), reproducing the dependence on the jet p T observed in Pythia 8 and Herwig++, as shown in Fig. 8.
The weighted-energy functions are sufficient for understanding a single λ κ 0 , but if we want to study correlations between a pair of λ ρ 0 and λ κ 0 , then we would need a double weightedenergy function: This is defined analogously to Eq. (6.1), albeit with the double measurement This object also has a renormalization group evolution analogous to Eq. (6.5). In Sec. 6.3, we will use the fact that when we study correlations between the β > 0 angularities. Since our analytic calculations are limited by our lack of knowledge of the nonperturbative function F i κ , we close our discussion of β = 0 by simply showing the quark/gluon truth overlap extracted from Pythia 8 and Herwig++. In Fig. 9, we show the truth overlap of a single λ κ 0 and in Fig. 10 Figure 8: The weighted-energy function F i κ=2 (x) for p D T for u-quarks (red) and gluons (blue), extracted from Pythia 8 (left) and Herwig++ (right). The darker solid curve is the parton shower results extracted at the scale 40 GeV, the lighter solid curve is the evolution from 40 GeV to 400 GeV using Eq. (6.5), and the dotted curve is the parton shower results at 400 GeV. In all cases, we are incorporating the NLO corrections in Eq. (6.4), which is why Figure 9: The quark/gluon truth overlap for an individual generalized angularity λ κ 0 as a function of the energy-weighting power κ. Here, we are comparing the Pythia 8 and quite striking in illustrating the substantial difference in discrimination power predicted by Pythia 8 and Herwig++. Interestingly, the truth overlaps of Pythia 8 and Herwig++ seem to be related by a simple scaling of the value, with otherwise similar structures visible over the range of observables. This might point to the source of the discrepancy in the physics descriptions between Pythia 8 and Herwig++, but is beyond the scope of this paper.

The β > 0 Regime
For β > 0, the generalized angularities λ κ β are collinear safe with respect to emissions at θ = 0, but collinear unsafe with respect to wide-angle emissions. This is precisely the same situation as for the track thrust study in Ref. [40], so we can adapt those methods here. In particular, the jet can be described by a number of perturbative gluon emissions that can then be matched onto separate weighted-energy functions F i κ (x, µ). At (N)LL order, the emissions (including the weighted-energy functions) exponentiate, allowing us to predict the performance of λ κ β for quark/gluon discrimination. A jet with a single parton has λ κ β = 0, since the reference axis will align with that parton. For two partons, the winner-take-all axis will align with the harder parton. Ignoring fixedorder corrections, we can assume that the harder parton is the initiating parton (i.e. quark for a quark jet) such that the λ κ β distribution is determined by the emitted soft gluon. In the LO approximation for narrow jets, we can use splitting functions where P i→ig (z) is the splitting function for parton flavor i to emit a soft gluon with momentum fraction z. We can achieve (N)LL resummation by considering the strongly-ordered limit where the λ κ β distribution is determined by the hardest emissions in the jet. For the strongly-ordered limit to make sense, we assume that F g κ (x, µ) is non-singular and does not have support over a hierarchically large range in x, such that the emission with, say, the largest value of z κ θ β also typically has the largest value of λ κ β . This then allows us to use the logic of CAESAR [26] to determine the λ κ β distribution up to NLL order (ignoring non-global effects [67]). In the CAESAR approach, the one-gluon distribution in Eq. (6.9) is interpreted as the radiator function: where α s is now a running coupling evaluated to two-loop order in the CMW scheme [68]. This yields the cumulative distribution accurate to NLL where γ E is the Euler-Mascheroni constant, Γ is the gamma function, and R i ≡ −dR i /d ln λ κ β is the logarithmic derivative. The cross section is obtained via We can already learn a lot from Eq. (6.11) by considering just the LL limit. In that limit, we can drop the prefactor terms involving R , fix the coupling α s , and take P i→ig (z) = 2C i /z. In that case we find Because the bounds of integration for x depends on the value of the observable λ κ β , this integral cannot be simplified. However, because we assume that F g κ (x, µ) is non-singular, for small λ κ β we can expand the integral in powers of λ κ β . To leading logarithmic accuracy we then have where other terms are suppressed by powers of λ κ β . The logarithmic moments are defined as In Fig. 11, we show values of f g,i κ extracted from Pythia 8 and Herwig++ for a range of κ values. These moments are quite similar between the two parton showers, suggesting that their extraction is robust. Like in Ref. [40], we find that logarithmic moments of the nonperturbative function appears in the exponent of the cumulative distribution. Note that the only difference between quark and gluon jets is the color factor C i , since the same gluon-based f g,n κ appears for both kinds of jets. Thus, this observable satisfies Casimir scaling in the LL limit, yielding the mutual information discussed in Sec. 3. Strictly speaking, these f g,n κ terms are only relevant at NLL order, since they multiply at most single logarithms in the observable. 12 Therefore, we will drop the f g,n κ terms at LL order: Doing the full NLL calculation using Eq. (6.11) is straightforward with the help of two tricks. First, using the fact that we can rewrite the radiator function in Eq. (6.10) as 12 One could imagine power counting f g,n κ as being logarithmically enhanced instead of as O(1). In that case, however, one would need to keep track of every f g,n κ starting at NLL order, so we effectively return to the β = 0 case where the full F g κ (x, µ) function is needed.
whereR i is the radiator for the IRC safe angularity with exponent β/κ. Second, we only need to keep the first logarithmic moment f g,1 κ at NLL order, so when we do the x integral weighted by F g κ (x, µ), we can effectively replace x → exp(f g,1 κ ) (6.19) in the argument ofR i , up to log-suppressed terms. 13 Thus, the IRC unsafe radiator is simply where the IRC safe radiatorR i is given in App. B.1. We find it quite remarkable that we can relate an IRC unsafe distribution to an IRC safe one in this way, and we show that these same two tricks are valid in SCET in App. C.1. 14 Before showing analytic results, we need to comment on the range of validity of our calculation. Because the radiator scales like 1/(βκ), we can only trust this perturbative expression when βκ 0.5. In addition, the validity of our approach is limited to the region where, in absolute terms, the nonperturbative parameter f g,1 κ is smaller than the typical values of ln λ κ β . 15 From Fig. 11 we obtain the approximation f g,1 κ ≈ 3(1 − κ), and using Eq. (6.16), we expect the distribution to peak at ln 2 λ κ β = βκ/(2α s C i ). This suggests that our calculation holds for βκ (1 − κ) 2 > c, (6.21) where c = 18α s C i , which is 2.7 for quarks and 6.0 for gluons. We take the more restrictive value c = 6.0 when assessing the validity of the quark/gluon truth overlap, and this is the reason why the blue region in Fig. 7 is missing the upper left and lower right corners. In Fig. 12, we show the truth overlap I(T ; λ κ β ), comparing LL, NLL, Pythia 8, and Herwig++. As expected, there is no difference between different choices of κ and β at LL, with differences showing up first at NLL order. The NLL calculation breaks down in the upper left region (due to large values of f g,1 κ ) and in the lower left region (due to βκ being too small). Unfortunately, these are exactly the same regions where there is interesting behavior in the Pythia 8 and Herwig++ predictions. If we naively trust the NLL results outside of their 13 Note that this rescaling would replace f g,2 κ with (f g,1 κ ) 2 in Eq. (6.14) . Because the f g,2 κ term is formally beyond NLL accuracy, this is an allowed replacement.
14 It is perhaps even more remarkable that we can take logarithmic derivatives of the resulting Ri and get the right NLL expression for the multiple emissions piece. Ultimately, the only way we are able to justify this is via SCET, since the original CAESAR approach was only proven for IRC safe observables. Note that there is a Jacobian factor in R i , so you cannot directly relate the cumulative distributions for e β/κ and λ κ β / exp(f g,1 κ ), only the radiators. Because of the R i term, the discrimination power is not just a function of β/κ, and has non-trivial κ and f g,1 κ dependence at NLL. 15 Outside of this region, the nonperturbative effects become too large to be treated using just the logarithmic moments, and we would have to include the full F g κ (x, µ) function, as also mentioned in footnote 12. In principle, this would allow us to get beyond the "nonperturbative barrier" in Eq. (6.21), though we have not attempted such a calculation.  Figure 12: The quark/gluon truth overlap for an individual generalized angularity λ κ β . Top: the LL and NLL analytic calculations. Note that these calculations are singular at κ = 0 or β = 0. Bottom: the Pythia 8 and Herwig++ parton showers (identical to Figs. 4a and  4b). The solid boxes correspond to dots indicated in Fig. 1, the dashed box corresponds to the IRC safe angularities e β , and the grey dashed curve in the LL/NLL plots marks the range of validity of our calculations (i.e. the edge of the blue region in Fig. 7). range of validity, then starting from broadening (λ 1 1 ) and approaching multiplicity (λ 0 0 ), the NLL results show an increase in discrimination power, in agreement with the parton showers. However approaching p D T (λ 2 0 ), the NLL results also show an increase in discrimination power, which is the opposite behavior from the parton showers (until one reaches the actual β = 0 line). Of course, one should be wary of this extrapolation, since our calculations are lacking important nonperturbative corrections. 16

Two β > 0 Angularities
Because the resummed λ κ β distributions for β > 0 only depend on logarithmic moments of the weighted-energy function, we have an opportunity to analytically study the correlations between two generalized angularities λ ρ α and λ κ β . We can already gain a lot of insight from a LL study, and we can use the same tricks as for Eq. (6.20) to obtain an NLL result.
With the help of the double weighted-energy function in Eq. (6.6), we can define a double radiator function Again assuming that F g ρ,κ does not have any large hierarchies (such that the observables are dominated by the hardest emissions), then we can follow the logic of Ref. [50] and say that at LL accuracy 1 σ i where Θ 0 (λ ρ α , λ κ β ) enforces the phase space restrictions 17 This expression does not immediately generalize to NLL accuracy, since there is no (known) factorization theorem for double differential distributions over the full phase space. Instead we will exploit the interpolation technique of Ref. [51] to help find the NLL expression, as we did in Sec. 5.2. 16 Intuitively, the parton shower results make sense. Compared to gluons, quarks typically have smaller values of λ 1 1 but larger values of λ 2 0 . Thus, interpolating between λ 1 1 and λ 2 0 should yield a poor discriminant variable. In the NLL approach, small β is always favored, and calculational control is lost before the f g,1 κ parameter has a chance to reverse that trend. 17 These assume that nonperturbative physics do not affect the phase space, which is fine for LL accuracy. At NLL, we use Eq. (6.33) to adjust the phase space given the first logarithmic moments of the weighted-energy function.
Calculating the double radiator in the LL limit, we find for α/ρ > β/κ where we have used Eq. (6.8) to simplify the last line (accurate to leading power in λ ρ α and λ κ β ) and we have defined As in the case of a single λ κ β , strictly speaking the nonperturbative parameters f g,n ρ,κ only show up at subleading logarithmic order, and can be ignored to LL accuracy, Note that the exponents ρ and κ still have an effect on the discrimination power even though we are not accounting for the nonperturbative parameters at this order. The overall prefactor implies that when α and β are sufficiently different, we should only trust this distribution for ακ − βρ 0.5. (6.28) To achieve NLL accuracy, we need to combine the interpolation technique of Ref. [51] with the two tricks as we used to find the single generalized angularity distribution in Eq. (6.20). Following Ref. [51], the general form of the NLL distribution is where R i is the double radiator from Eq. (6.22) and R i is a multiple emissions term that is not given by any simple logarithmic derivative of R i . We start by considering the double radiator R i . Looking at the theta functions in Eq. (6.22), we see that the integration range for the double radiator is the same as for two IRC safe angularities with At NLL accuracy, we only need the first logarithmic moments of the weighted-energy function, so we can make the replacement Thus, the double radiator is whereR NLL i is the IRC safe double cumulative distribution for angularities with exponents α/ρ and β/κ, defined in App. B.2. Turning to the multiple emissions term R i , we need to find a function that interpolates between the logarithmic derivative functions R (λ ρ α ) and R (λ κ β ) on the boundaries of phase space. Since we already found the single radiator to NLL accuracy in Eq. (6.20), this interpolation is straightforward, and we give the explicit expressions in App. B.3. Finally, to have a properly normalized distribution, we have to apply the rescaling to the phase space constraints in Eq. (6.24) as well. Again, we find it remarkable that there is such a close relationship between IRC safe and IRC unsafe calculations at NLL order, and we validate this method in SCET in App. C.2. In Fig. 13, we show the truth overlap for the LL and NLL calculations, compared to results obtained from Pythia 8 and Herwig++. The LL and NLL calculations do not extend to the region where α or β is zero, which are left white in the plot. We caution the reader that some of these LL and NLL results extrapolate outside the range of validity in Eqs. (6.21) and (6.28). Comparing the various predictions, we observe similarities between the regions of minimal and maximal discrimination power, though the overall discrimination power is (again) larger for Pythia 8 and NLL than for Herwig++ and LL. The Pythia 8 and NLL results most clearly indicate that it is advantageous to pick one of the angular exponents α or β to be small. A notable difference between the predictions is that the LL and NLL calculations suggest that one should avoid the diagonal ρ = κ, whereas for Pythia 8, and to a lesser extent Herwig++, the maximum discrimination power is sometimes (surprisingly) close to it (see e.g. α = 1, β = 0.5).

Conclusions
Robust quark/gluon discrimination is a key goal for the jet substructure community, so to the extent possible, it is important to use first principles calculations to assess the challenges and opportunities. In this paper, we showed that mutual information is a powerful way to understand how variables are correlated, and whether or not that (lack of) correlation pertains to discrimination power. We also made progress in gaining analytic control over the tagging performance of the generalized angularities λ κ β . For the IRC safe angularities and the IRC unsafe angularities with β > 0, we calculated the quark/gluon truth overlap for a single angularity I(T ; λ κ β ) and for pairs of angularities I(T ; λ ρ α , λ κ β ) to NLL order. Ultimately, we want to extend our analysis to higher orders, but this would require a robust "truth" definition for a quark jet versus a gluon jet. While the strategy of Ref. [52] is one option to define the truth flavor of a jet, we would prefer a definition for which the jet constituents are the same as for traditional flavor-less jet algorithms. Of course, quark and gluon jets do not exist in isolation, and at some point, the color correlations between the jets will be relevant for characterizing the discrimination power. The techniques introduced recently in Ref. [69] should help in gaining analytic control over those color correlations. Since our NLL results are subject to large changes from scale variation, higher-order calculations will be crucial for robust uncertainty estimates.
Assuming we did have a suitable quark/gluon truth definition, then a key challenge for calculations beyond our present order is dealing with soft radiation, in particular the effect of non-global logarithms [67]. One option is to do quark/gluon tagging in concert with soft drop declustering [36] (a generalization of modified mass drop tagging [34,70]). The soft drop procedure removes soft radiation, and therefore removes non-global contributions to the jet. The cumulative distributions for a single soft-dropped angularity were already calculated in Ref. [36], where the distributions exhibited Casimir scaling at LL order. Using soft-dropped jet shapes for quark/gluon discrimination seems promising from both a theoretical and experimental point of view, and we leave a more detailed study to future work.
Based on our studies, we have two recommendations to the ATLAS and CMS experiments. The first recommendation is to make (unfolded) measurements of the recoil-free angularities distributions for a range of β values, ideally in purified quark/gluon samples. 18 The differences seen between Monte Carlo programs in Fig. 4 is worrisome, and while eventually calculations might be a guide to what these distributions should look like, in the short term e β measurements can be a key reference for tuning Monte Carlo programs, especially because the differences arise from effects that are formally beyond LL accuracy. The second recommendation is to measure more double differential distributions. 19 While we focused on mutual information with the truth in this paper, one would still like to understand the full correlation structure. Angularities are a good place to start, and double differential distributions of e α and e β would be quite valuable, especially with new calculational tools available to predict these correlations [50,51] (see App. D for example plots along these lines).
Finally, in the spirit of Refs. [6,[38][39][40], we have provided another example of how collinear unsafe (but soft safe) observables can be made calculable with the help of new nonperturbative objects. We introduced the weighted-energy functions F i κ (x), which allowed us to understand many aspects of the κ = 1 regime. Because the β > 0 angularities only depend on logarithmic moments of F i κ (x), they are the simplest to understand. But even the β = 0 angularities are within calculational control, since we can study the renormalization group behavior of, say, the p D T (λ 2 0 ) distribution. Of course, hadron multiplicity (λ 0 0 ) is not captured within our framework due to the presence of the soft singularity, but perhaps hadron multiplicities could be made 18 Initial results along these lines appear in Ref. [12] for the energy correlation function ratio C (β) 1 [9], though only the final quark/gluon performance is shown, not the extracted quark and gluon distributions. 19 There are double differential results in Ref. [13], but measured only in simulation and not in data.
analytically tractable by using soft drop declustering to remove soft radiation. We expect future studies will continue to improve (and improve our understanding of) quark/gluon discrimination.

Acknowledgments
We thank the participants of the Boost 2013 workshop for many inspiring discussions that led to this work. We also thank Jason Gallicchio

A.1 Relationship to the ROC Curve
The mutual information of an observable A with the truth T can be derived from the ROC curve of A. Although we phrase this discussion in terms of quark/gluon discrimination, it obviously carries over to other cases as well.
At the position (q, g) on the ROC curve, the region dq has a fraction dq of the quark jets. The fraction of gluon jets is given by the slope of the ROC curve, dg = g (q) dq. Looking at the definition of I(T ; A) in Eq. (2.9), we can write the integrals as da p q (a) ⇒ dq, da p g (a) ⇒ dg = dq g (q).
For a sample with quark fraction f , the ratios of the probability distributions are This leads to .
As a simple test of this formula, note that the ROC curve for an observable that satisfies Casimir scaling is g(q) = q (C A /C F ) . Plugging this into Eq. (A.3), we recover I(T ; A) from Eq. (3.8).
Inverting this relationship to obtain the ROC curve from the mutual information is not so easy, suggesting that mutual information is an easier concept to work with. Nevertheless it seems in principle possible, though we do not claim that the following simple-minded approach is optimal. Consider discretizing g (q) by treating it as a constant g i on the interval q ∈ [i/n bins , (i + 1)/n bins ] with i = 0, 1, . . . , n bins − 1. Because an ideal ROC curve is not only monotonically increasing (i.e. g (q) ≥ 0) but also convex (i.e. g (q) ≥ 0), this means that g i+1 ≥ g i , and we have a chance to find a set of equations that (uniquely) determine g i . Then, we can integrate g (q) in the usual way to find g(q). One set of equations is given by the n-th derivative of I(T ; A) evaluated at f = 1 (for n ≥ 2) while for the special case of n = 1 we can use In discrete form, these become a system of polynomial equations Because these equation are non-linear, this quickly becomes numerically unstable for n bins > 4, but gives a proof of principle that a solution can be found. Note that the condition that the ROC curve is convex is crucial, since otherwise each permutation of g i would also constitute a solution. For special functional forms, there are simpler strategies to find the ROC curve. For example, if it is known that g(q) = q c , then a practical way to estimate the exponent c is via

A.2 Subtleties of Finite Statistics and Binning
When dealing with a sample with a finite number of events, the Shannon entropy H is sensitive to the way in which the sample is binned. We will compute the leading effect this has on the entropy. Let N bins be the number of bins and let λ i be the average number of events in bin i. The total number of events in the sample is For sufficiently large N ev , the fluctuations within the bins should be independent and the number of events in bin i should be Poisson distributed about λ i . Thus, the expected entropy of the binned sample is where n i ranges over the possible observed events in bin i. Assuming that λ i 1 for all bins, we can expand the entropy about the average value n i = λ i in each bin. The lowest order term is the expected value of the entropy in the limit of infinite statistics The first order term vanishes because λ i is the average value. The second order term is the first non-trivial effect from finite sample size. The second derivative of the entropy factor is Then, the entropy is where we have used the fact that a Poisson distribution with average λ i has variance λ i . Higher terms in the expansion will depend on the distribution of λ i , but these contributions are suppressed by inverse powers of λ i . In calculating the mutual information I(A; B) on a single sample, the leading effect from finite statistics is where N X bins is the number of bins used to calculate H(X) and N X ev is the number of events in the corresponding sample. To avoid the leading bias term from binning when the sample sizes are the same (i.e. N A ev = N B ev = N AB ev ), it is important to take N AB bins (approximately) equal to N AB bins = N A bins + N B bins , rather than the naive choice N AB bins = N A bins N B bins . This is the strategy we used to make Figs. 14 and 15. Alternatively, if one wants to use the same binning in the A and B observables such that N AB bins = N A bins N B bins , then one has to adjust the number of events accordingly. This is the strategy we used to make Figs. 17 and 18, where we took where H q+g (A) is the entropy of the combined quark and gluon sample, H q (A) (H g (A)) is the entropy of the quark (gluon) sample, and f is the fraction of quarks in the combined sample. For finite bins and sample size, the expected entropy of the combined sample is Here, λ q i (λ g i ) is the average number of quark (gluon) jets in bin i, n q i (n g i ) is the corresponding number of observed jets, N q+g bins is the total number of bins, and N q (N g ) is the total number of quark (gluon) jets in the combined sample. Note that Using the same technique as before, we can find the leading effect from finite statistics on the entropy by expanding n q i and n g i about λ q i and λ g i to quadratic order. We find The resulting truth overlap I(T ; A) is where N q bins (N g bins ) is the number of bins in the pure quark (gluon) sample and N q (N g ) is the number of jets in the pure quark (gluon) sample.
This finite statistics bias can be reduced by using samples and binning such that the second line of Eq. (A.18) is zero. There are two useful cases to consider. When the combined sample is created by simply merging the pure samples (i.e. N q = N q and N g = N g ), the bias terms reduce to and so the bias can be removed by binning such that N q bins + N g bins = N q+g bins . Alternatively, if we choose to use the same binning for each sample (i.e. N q bins = N g bins = N q+g bins ≡ N bins ), the bias terms are δI(T ; A) = 1 2 ln 2 This is zero, if, for example, we take the combined sample to have half the number of events of the pure samples (i.e. N q /N q = N g /N g = 1/2). This later strategy is the one we used for all of the truth overlap plots in this paper.

B.1 One IRC Safe Angularity at NLL
The cross section for a single recoil-free angularity e β was derived in Refs. [22,26] (see also Ref. [30]), and we summarize the results here. To NLL order, the cumulative distribution for e β can be expressed as Here, we are using a slightly different notation from Ref. [26] and the body of the text (see Eq. (6.11)), with The reason for separating out the T function is that it contains terms that formally start at NLL order, such that we do not need to consider the logarithmic derivative T (e β ) in the prefactor.
In this notation, the radiator R(e β ) consists of the cusp pieces of the jet and soft function anomalous dimensions, while the function T (e β ) contains the non-cusp terms (for details on these anomalous dimensions see App. C.1). The logarithmic derivative R (e β ) is given by To NLL accuracy, the cusp anomalous dimensions are evaluated at two-loop order and the radiator is Here, the strong coupling constant is evaluated at the hard scale using two-loop running with n f = 5 from α s (m Z ) = 0.12. The observable is contained in λ = 2α s β 0 ln e β , the color factor of the jet is C i , the one-and two-loop beta functions are and the ratio of the two-loop to the one-loop cusp anomalous dimensions is For the non-cusp terms, the function T (e β ) is and γ i is the non-cusp anomalous dimension to one-loop. For quarks and gluons, they are For NLL accuracy, the logarithmic derivative R (e β ) only needs to be evaluated at one-loop: The cross section is obtained from the cumulative distribution in the standard way, We will verify this single differential calculation in the language of SCET in App. C.1.

B.2 Two IRC Safe Angularities at NLL
For two angularities e α and e β , the double cumulative cross section to NLL order was derived in Ref. [51]. Assuming α > β, to NLL order it can be written as where the functions R(e α , e β ), T (e α , e β ), and R(e α , e β ) are given below. As in Eq. (B.2), we are changing the notation from the text (see Eq. (6.29)) to separate out the T piece from the radiator. Note that this has a similar form to Eq. (B.1), albeit with functions that depend on two arguments. Unlike the single angularity case, R is not related to any logarithmic derivative of R. The radiator R(e α , e β ) is where U (z) = (1 + z) ln(1 + z). The non-cusp piece T (e α , e β ) is (B.14) The multiple emissions piece R(e α , e β ) is The power suppressed terms have been chosen such that the sum of the exponents of e α and e β is 1. The cross section is obtained from the cumulative distribution by differentiation and imposing the phase space constraint in Eq. (5.4), We will verify this double differential calculation in the language of SCET in App. C.2.

B.3 Two IRC Unsafe Angularities at NLL
For the IRC unsafe angularities, the double differential distribution takes the form where we are again using the notation change in Eq. (B.2). Using the rescaling trick in Eq. (6.32), we can determine the double radiator R i + γ i T i . For the multiple emissions term R i , we have to interpolate between expressions derived at the phase space boundaries [51]. This interpolation yields The final two terms, proportional to powers of λ ρ α and λ κ β , are formally power-suppressed over the entire phase space, but are necessary to satisfy the boundary conditions. When ρ = κ = 1, this reduces to the IRC safe case in Eq. (B.15).

B.4 Finding the ROC Curve at LL
In this paper, we focused on mutual information, but one can (in principle) use the same cross sections to determine the ROC curve for quark/gluon discrimination.

C Equivalent NLL Results from SCET
For the IRC safe angularities, Ref. [22] demonstrated that the SCET approach and CAESAR approach to resummation give the same single differential cross sections to NLL accuracy. In this appendix, we repeat the same exercise for the generalized angularities. We also show how to perform the double differential interpolation of Ref. [51] in the language of SCET.

C.1 One Generalized Angularity in SCET
The SCET calculation for the generalized angularities λ κ β mirrors that of track thrust [40]. We factorize the cross section into a hard function, jet functions, and a soft function which describe physics at the corresponding scales At NLL order, the cross section is completely generated by renormalization group evolution. For simplicity we evolve the hard and jet function to the soft scale, which results in a cumulative distribution of [37,40,71,72] The evolution kernels that enter here are which are given in terms of where r = α s (µ)/α s (µ 0 ). The coefficients of the beta function, the cusp, and the non-cusp anomalous dimensions are Note that the conventions for these coefficients differ from those in Sec. B.1, and that the nonperturbative effect described by f g,1 κ is included in the non-cusp anomalous dimension. To test the agreement with the IRC safe result in Sec. B.1, we use the central scale choice in Eq. (C.1), such that the SCET result for e β can be written as Up to terms that are beyond NLL order and ignoring logarithms of the jet radius R 0 , we find so the CAESAR and SCET results indeed agree at this order. One advantage of the SCET approach is that it separates the physics at different scales, allowing one to estimate the perturbative uncertainty of Eq. (C.2) by (independently) varying µ H , µ J , and µ S . For the IRC unsafe case in Sec. 6.2, we need to verify the rescaling hypothesis in Eq. (6.20), which is equivalent to where X is the function in the IRC unsafe case andX is the function in the IRC safe case (with angular exponent β/κ). Note that we can drop the f g,1 κ terms from T i and R i to NLL accuracy, and that there is a 1/κ Jacobian factor from the logarithmic derivative in R . Under this rescaling, the central scales are related as If the jet and soft scales swap from one boundary to the other boundary, we get This is of course strange from the point of view of factorization, but one should remember that the factorization theorem on the boundary does not hold in the interior. Using these interpolating scales, we can write down a candidate form for the double radiator, Imposing the boundary conditions in Eq. (C.12) to NLL order leads to a one parameter family of solutions. The simplest one is: (Remember that a different convention for β 0 is used here than in App. B.) The first term satisfies the boundary conditions on the cumulative distribution and the derivative boundary conditions at e α = e β . The second term is power suppressed, except at the boundary e α = (e β ) α/β , and is introduced to satisfy the derivative boundary conditions there. It is formally beyond the order we are working so other choices are possible. The single angularity result R (e α ) = η i J (µ J , µ S ) = −2C i /(β − 1)η Γ (µ J , µ S ) suggests an ansatz similar to Eq. (C.15) forR. This leads tõ The first two terms satisfies the boundary condition on the cumulative distribution and the derivative boundary condition at e α = e β . As in Eq. (C.17), the additional power suppressed terms take care of the derivative boundary conditions at e α = (e β ) α/β .
The interpolation for IRC unsafe angularities is a direct generalization of Eq. (C. 16) where we assume α/ρ > β/κ and the scales are modified to The interpolation of the non-cusp piece in Eq. (C.17) mostly carries over. The contribution from γ i H = −γ i J only involves the hard and jet scales, which are the same as before. However, the nonperturbative coefficients enter through K γ J (µ J , µ S ). Although one can build an interpolation similar to Eq. (C.17), we also need an interpolation between f g,1 It is therefore much more convenient to use the rescaling trick in Eq. (6.32). Finally, the multiple emissions contribution in Eq. (C.18) generalizes toR The terms on the first line satisfy the boundary condition for the cumulative distribution. The terms on the second and third line are power suppressed except at the boundaries (λ ρ α ) β = (λ κ β ) α and (λ ρ α ) κ = (λ κ β ) ρ , respectively, where their inclusion enforces the derivative boundary conditions. For ρ = κ the third line is absent, and for ρ = κ = 1 this reduces to the IRC safe case.

D Additional Plots
In this appendix, we present additional plots involving mutual information to complement the truth overlap plots in the main text. We also show some raw angularity distributions.
We first study the correlation between IRC safe angularities with different angular exponents, as measured by I(e α ; e β ). This is shown for a pure sample of quark jets in Fig. 14 and for a pure sample of gluon jets in Fig. 15   their angular exponents are close to one another, and become increasingly uncorrelated as the angular exponents move farther apart. This behavior can be understood from the definition of the angularities. For large values of angular exponent, the angularity is dominated by soft, wide-angle emissions because collinear emissions are suppressed by small angles raised to a high power. By contrast, at small values of the angular exponent, the angularity is dominated by hard collinear emissions. Thus, when two angularities have very different angular exponents, their values are dominated by different physics and so are largely uncorrelated. Unlike the case of the truth overlap in Fig. 5, there is broad agreement between LL, NLL, Pythia 8, and Herwig++ as far as the raw correlations are concerned.
Next, we want to understand better the degree to which two angularities have more truth overlap than one angularity. In Fig. 16, we plot ∆I(T, e max → e α , e β ) from Eq. (5.6), namely the pairwise truth overlap I(T ; e α , e β ) minus the truth overlap of the stronger angularity max{I(T ; e α ), I(T ; e β )}. The information gain is on the order of 10% (O(0.01) compared to a baseline truth overlap of O(0.1)). As in Fig. 5, there are quite substantial differences between the various methods. It is interesting that in Pythia 8 one can already achieve considerable gains in performance just off the diagonal, i.e. for observables that are not very different. This may be because when α and β are close (compare to Eq. and θ α i log θ i is similar to the optimal kernel found in Refs. [4,5]. Turning to the generalized angularities, in Figs. 17 and 18 we show the correlations on pure quark and gluon samples as measured by I(λ ρ α , λ κ β ). As in the IRC unsafe case, there is broad agreement in the overall degree of correlation, though one has to be mindful of the restricted range of validity of the (N)LL calculations. In Fig. 19, we show the improvement of using two generalized angularities compared to one. The LL and NLL calculations are not shown, since those calculations are not accurate enough to assess small differences. The comparison between Pythia 8 and Herwig++ is similar to the IRC safe case, with Pythia 8 being more optimistic about the gains possible by combining observables.
Finally, we show a few raw angularity distributions from the NLL calculation and both parton showers. In Fig. 20, we show single differential distributions for e 1 as an IRC safe example and λ 0. 6 2 as an IRC unsafe example. Note that the NLL calculations lack important hadronization corrections that are modelled by the parton showers and are particularly important to correctly describe the small angularity region. The NLL result for λ 0. 6 2 cuts off rather sharply at the low end due to our treatment of the QCD Landau pole. The Pythia 8 distributions are more peaked than the Herwig++ distributions, which is part of the reason why Pythia 8 predicts improved discrimination power compared to Herwig++. We then show the double differential distribution for e 1 and λ 0.6 2 in Fig. 21, showing only the halfmaximum contour for readability. There is an irreducible degree of correlation between these observables due to phase space constraints (see Eq. (6.24)), but one can see that, while the gluon contours are similar for Pythia 8 and Herwig++, the Pythia 8 contour for quarks    Figure 17: Correlation between two generalized angularities (λ ρ α , λ κ β ) on a pure quark sample. Top: the LL and NLL analytic calculations. Bottom: the Pythia 8 and Herwig++ parton showers. As in Fig. 13, β ∈ {0, 0.5, 1, 2} and we sweep 0 ≤ κ ≤ 2.  Figure 18: Same as Fig. 17, but for a pure gluon sample.   Figure 19: Improvement in the truth overlap by using two generalized angularities (λ ρ α , λ κ β ) as compared to only one, I(T ; λ ρ α , λ κ β )−max[I(T ; λ ρ α ), I(T ; λ κ β )]. We only show the Pythia 8 and Herwig++ parton showers since the LL and NLL analytic calculations are not sufficiently accurate to extract subtle differences in truth overlap. is significantly smaller than Herwig++. This explains the enhanced discrimination power predicted by Pythia 8. The NLL contours are much larger in size than either Pythia 8 or Herwig++ because the NLL distributions do not vanish at the phase space boundaries [51]. The analytic distribution will only vanish at the boundaries starting at NLL order, beyond the accuracy to which double differential cross sections have as-of-yet been computed.  Figure 20: Raw distributions of e 1 (top) and λ 0.6 2 (bottom) for the NLL calculation (left) and parton showers (right). Note that the NLL distributions lack hadronization corrections that are present in the parton showers, which affects small values of the angularities. plane for the NLL calculation (left) and parton showers (right). The contours correspond to the half maximum of d 2 σ/(d ln e 1 d ln λ 0. 6 2 ).