Towards Machine Learning Analytics for Jet Substructure

The past few years have seen a rapid development of machine-learning algorithms. While surely augmenting performance, these complex tools are often treated as black-boxes and may impair our understanding of the physical processes under study. The aim of this paper is to move a first step into the direction of applying expert-knowledge in particle physics to calculate the optimal decision function and test whether it is achieved by standard training, thus making the aforementioned black-box more transparent. In particular, we consider the binary classification problem of discriminating quark-initiated jets from gluon-initiated ones. We construct a new version of the widely used N-subjettiness, which features a simpler theoretical behaviour than the original one, while maintaining, if not exceeding, the discrimination power. We input these new observables to the simplest possible neural network, i.e. the one made by a single neuron, or perceptron, and we analytically study the network behaviour at leading logarithmic accuracy. We are able to determine under which circumstances the perceptron achieves optimal performance. We also compare our analytic findings to an actual implementation of a perceptron and to a more realistic neural network and find very good agreement.


Introduction
The CERN Large Hadron Collider (LHC) is continuously exploring the high-energy frontier. However, although well-motivated from a theoretical viewpoint, no conclusive evidence of new particles or interactions, and no significant cracks in the Standard Model (SM) of particle physics have been found. This challenges the particle physics community, theorists and experimenters alike, to find new ways to interrogate and interpret the data. In the context of analyses involving hadronic final states, jet substructure has emerged as an important tool for searches at the LHC. Consequently, a vibrant field of theoretical and experimental research has developed in the past ten years, producing a variety of studies and techniques [1]. Many grooming and tagging algorithms have been developed, successfully tested, and are currently used in experimental analyses [2][3][4][5][6][7].
The field of jet substructure is currently undergoing a revolution. The rapid development, within and outside academia, of machine-learning (ML) techniques is having a profound impact on particle physics in general and jet physics in particular. In the context of high-energy physics, ML algorithms are typically trained on a control sample, which could be either Monte Carlo pseudo-data or a high-purity dataset, and then applied to an unknown sample to classify its properties. These ideas have been exploited in particle physics for a long time. However, because of limitations on the algorithms' efficiency and on computers' power, so-far algorithms were applied to relatively low-dimensional projections of the full radiation pattern that one wished to classify. Even so, such projections usually correspond to physically-motivated observables, such as, e.g. the jet mass, jet shapes, and therefore limitation in performance was mitigated with physics understanding. The current ML revolution moves away from low-dimensional projections and exploit deep neural network (NN) to perform classification. Early progress was made on supervised classification of known particles [8][9][10][11][12] to the point where powerful network architectures are now available [13][14][15][16][17] and the focus lies on improving the stability [18][19][20][21][22] of these approaches. These techniques, developed by both theorists and experimentalists, have already been tested in the LHC environment and dedicated studies have been performed by the ATLAS and CMS collaborations [23][24][25]. Despite its rapid development and the unquestionable improvements that ML brings to particle physics, it is often met with a certain degree of suspicion, because physicists often feel uneasy about relying on black-box techniques. We believe that the situation is not too dissimilar to the one we faced a decade ago, at the birth of jet substructure. At the time there was a plethora of new tagging algorithms and a deeper understanding was reached only when theoretical studies based on QCD were performed, see e.g. [26][27][28][29][30][31][32][33][34][35][36][37][38].
A common expectation is that well-trained networks can achieve the performance of the likelihood, i.e. they can learn a monotonic function of the likelihood ratio. In this paper, we are in the almost-unique position to analytically derive the weights corresponding to a cut on the likelihood ratio for a realistic problem and to test whether they agree with a network trained the usual way. Inspired by the successful approach of the past, we want to anchor this investigation in perturbative QCD, with a first-principle understanding approach, by looking at a situation where most studies can be performed analytically, and afterwards by investing how our findings in the simplest case compare to a more general setup.
For this current study, we focus on an issue that has received a great deal of both theoretical and experimental attention, namely the ability of disentangling jets that can be thought as having originated by the fragmentation of a high-energy quark (henceforth quark jets), from the ones originated from a gluon (henceforth gluon jets) [39]. In particular, in section 2, we introduce a variant of the well-known N -subjettiness variables [40] based on the primary Lund plane declustering tree [41]. This primary N -subjettiness T N is more amenable to an all-order QCD analysis, which we perform at leading logarithmic (LL) accuracy, while maintaining, if not improving, the quark/gluon discrimination power. This definition is such that, if we measure the N -subjettiness variables {T 1 ...T n }, then, at LL, a cut on the likelihood ratio simply corresponds to a cut on T n . With this definition, and within the stated accuracy, we are able to write simple analytic expressions for the cumulative distributions and for the so-called area under the curve (AUC), which measures the discriminating power of the observable, for any value of n. 1 We then move in section 3 to use the aforementioned primary N -subjettiness variables as input to a NN. Because we expect the optimal discriminant to be a cut on T n , we can start by considering the simplest possible NN, namely a single neuron, or perceptron, with n inputs and one output. This opens up the interesting possibility of performing analytic calculations that describe the behaviour of the perceptron at LL accuracy in QCD and allow us to determine the optimal decision function. The ability to fully control the behaviour of a ML technique analytically from first-principles is the main novelty of this paper. In sections 4 and 5, we compare our theoretical findings with actual networks with architecture of increasing complexity using pseudo-data, which we generate either according to a leading-logarithmic distribution in QCD (section 3) or using a general-purpose Monte Carlo parton shower (section 5). Finally, we draw our conclusions in section 6.

Quark/gluon tagging with N -subjettiness
In the framework of perturbative QCD, final-state jets, i.e. collimated sprays of hadrons, are formed by the fragmentation of high-energetic partons, which emerge from the hard scattering. While the association of a single parton, say a quark or a gluon, to a final-state jet is an intrinsically ambiguous -if not ill-defined -operation, essentially because of higher-order corrections -it is however possible to employ operational definitions that associate the value of a measurable quantity to an enriched sample of quarks or gluons [39,43]. Many such observables have been proposed in the literature, ranging from more traditional jet observables such as angularities [44] or energy-correlation functions [36], to observables that exploit more modern jet substructure techniques, such us the iterated soft-drop multiplicity [45] or the Les Houches multiplicity [46], to operational definitions inspired by data science [47,48].
A general approach to quantitatively assess the power of quark/gluon discrimination was put forward in Ref. [42]. The core idea of that approach is to measure multiple infra-red and collinear (IRC) safe observables on a single jet as a means to characterise the available emission phase space. The observables of choice were the N -subjettiness variables [40,49] τ N , where the measurement of the first n variables {τ 1 . . . τ n } is able to resolve n emissions in the jet 2 . The N -subjettiness shape is one of the most used jet substructure variables. As its name suggests, it aims to identify jets with an N -prong structure, taking inspiration from the event-shape N -jettiness [52]. In order to achieve this, a set of axes a 1 , . . . , a N is introduced and the N -subjettiness is defined as 1 Expressions for cumulative distributions can also be obtained for the standard N -subjettiness definition, see e.g. [29,42], although their complexity increases with n to a point where other quantities, such as, for instance, the AUC are no longer tractable analytically. 2 There is indeed more information in a jet than what is captured by this set of N -subjettiness variables.
One could study, for instance, an extended set of N -subjettiness variables [50], energy-flow polynomial [51], or counting observables, e.g. [45]. We thank Andrew Larkoski and Ben Nachman for comments on this point.
where β is a free parameter often set to unity, p t and R 0 are the jet transverse momentum and radius respectively and ∆ ia j = ∆y 2 ia j + ∆φ 2 ia j is the distance between particle i and the axis a j in the azimuth-rapidity plane. Note that the axes a j can be defined in several ways, for a review see for instance Ref. [1]. Crucially, IRC safety guarantees that N -subjettiness distributions can be meaningfully calculated in perturbative QCD. These distributions have been the subject of several theoretical investigations [29,30,53] (see also [34,54,55] for theoretical studies on the related observable D 2 ). The N -subjettiness distributions were calculated explicitly at LL accuracy, i.e. in the limit in which all emissions are strongly ordered, in Ref. [42] for the cases n = 1, 2, 3. Furthermore, the Authors of Ref. [42] were able to calculate from first-principle familiar quantities that enter statistical analyses. We start our discussion with a brief recap of the elements in Ref. [42] which are relevant for our current study.
The likelihood ratio, which according to the Neyman-Pearson lemma [56] provides the best single-variable discriminant, is simply given by the ratio of the probability distributions computed for background (gluons) and signal (quark) jets: In order to assess the discriminating power of an observable, Receiver Operating Characteristic (ROC) curves are often considered. These curves show the background (gluon) efficiency against the signal (quark) efficiency and are remarkably useful to directly compare the performance of different tools. For a single-variable observables V ≡ V (τ 1 , . . . , τ n ), one can define the normalised cumulative distribution The ROC curve is then obtained as where x is the signal (quark) efficiency. The area under the ROC curve (AUC) can be used as a quantifier of the discrimination power, where AUC = 0 corresponds to perfect performance. 3 For example, if one considers an IRC observable v that only resolves one emission, e.g. v = τ 1 , then the LL cumulative distributions for quarks and gluons simply differ by their colour factor, leading to the so-called Casimir scaling: 3 Note than one could have alternatively defined the ROC curve as showing the background rejection against the signal efficiency i.e. ROC(x) = 1 − Σg Σ −1 q (x) . In this case the perfect performance would correspond to AUC = 1.
The fact that τ n resolves n emissions inside the jet results in a rather complicated structure, even in the limit where the emissions are strongly ordered. This is because the emissions that set the values of the observables τ i , which are always gluons at LL accuracy, can either be primary emissions, i.e. they originate from the original hard parton which could be a quark or a gluon, or they can originate from subsequent gluon splittings. If we consider, for instance, the case of a jet initiated by a hard quark, one ends up with n contributions with colour factors C n−i F C i A , i = 0, · · · , n − 1. Furthermore, a rather intricate resummation structure emerges, i.e. a Sudakov form factor with both C F and C A contributions and depending on the complete tower of n emissions. It is clear that this intricate structure does not facilitate analytical calculations, especially because, in this study, we are not interested in considering p q and p g as final results, but rather, as inputs for subsequent calculations.
As a consequence, we find convenient to introduce a variant of N -subjettiness that is sensitive, at LL accuracy, only to primary emissions, such that the distributions p i are determined by strongly-ordered gluon emissions off the initial hard parton. We present this new observable in the next section.

Primary N -subjettiness
We define the new observable primary N -subjettiness as follows. Starting from a jet of radius R 0 , one first builds the list of primary Lund declusterings [41]: 1. Recluster the jet constituents with the Cambridge/Aachen algorithm [57,58].
3. When the declustering terminates, i.e. when j is no longer coming from a j 1 + j 2 clustering, definep t0 as the transverse momentum of j.
From the set of transverse momenta, we can define the momentum fractions where we note that the final hard momentump t0 is included in the normalisation. This produces a set of values (z i , ∆ i ), for i = 1, . . . , m, that we order such that The primary N -subjettiness is then defined as 4 Note that an alternative definition, equivalent at leading-logarithmic accuracy, but different beyond, would be to define T Note that T 1 ≥ · · · ≥ T n , like with the standard N -subjettiness τ N . The primary Nsubjettiness definition in Eq. (2.9) is very similar to the standard N -subjettiness definition with the main difference that it is computed based on primary Lund declusterings. The definition of the momentum fractions z i is such that m i=0 z i = 1.

Primary N -subjettiness at leading-logarithmic accuracy
By construction, the leading-logarithmic (LL) expression for the n-dimensional differential distribution is obtained by considering strongly-ordered independent emissions off the originating hard parton (either a quark or a gluon). In this strongly-ordered limit, only the first term in the rhs of Eq. (2.9) should be kept, i.e. T N ≈ z N (∆ N /R 0 ) β . In that context, the structure of the leading-logarithmic resummation is particularly trivial: one gets a factor for each of the n emissions associated with T 1 , . . . , T n as well as a Sudakov factor vetoing any additional real emission with z(∆/R 0 ) β > T n . 5 This yields where i = q, g refers to the flavour of the jet (with C q = C F and C g = C A ). The radiator R, which and has been stripped of its colour factor, is computed in the soft-and-collinear-limit including running-coupling corrections, as appropriate for our LL accuracy: We have also introduced R (T) = dR(T) d log(1/T) , where log x always denotes the natural logarithm of x.
An important observation is the following. From Eq. (2.10) we note that the structure of the probability distributions at LL in QCD for primary definition of N -subjettiness is the same for quark and gluon jets except for the colour factor, C F or C A which appears as an overall factor in both the R pre-factors and the Sudakov exponent. (This is not case with the standard definition of N -subjettiness). Consequently, the likelihood ratio Eq. (2.2) at LL becomes which is a monotonic function of T n only. Therefore, at LL, a cut on the likelihood ratio is equivalent to a cut on T n . The remarkable simplicity of this result is the strongest motivation for introducing primary N -subjettiness. This observable is thus the ideal laboratory to study analytically how a neural network that takes the primary N -subjettiness variables as inputs performs. Due to the simplicity of the classifier -a cut on a single variable -we expect that even the simplest network, i.e. a single neuron, should be enough to reproduce the likelihood ratio. Studying more complex architectures -especially including one or more hidden layers -would be very interesting but is not in the scope of this exploratory work. Analytic studies of a perceptron will be the topic of section 3, but, before moving to that, let us derive a couple of results that allow us to establish primary N -subjettiness as an appealing observable on its own, rather than just a shortcut to more tractable analytic results. Firstly, it is interesting to obtain an analytic expression for the cumulative distribution with a cut T on T n . As we have just seen, this is equivalent to a cut on the likelihood. We have where i = q, g. The latter expression splits naturally in two terms: if T n−1 < T, we simply find Σ i (T n−1 < T); if T n−1 > T, the exponential in (2.10) factors out and we obtain By induction we arrive at It is also possible to find an analytic expression for the AUC. Exploiting the ROC curve definition in Eq. (2.5), we can write the AUC as an integral of the quark and gluon distributions: (2.16) The details of the integration are not all trivial and are thus given in appendix A. Here we just quote the final result: We can compare the values given by this expression with the ones computed in [42] for the standard definition of N -subjettiness.
We conclude that, at least at LL, a cut on the primary N -subjettiness T n provides better quark/gluon discrimination power than a cut on the standard N -subjettiness. The comparison of the two different definitions when evaluated on Monte-Carlo generated pseudo-data will be discussed in section 5.

Fixed-coupling limit
In the previous section we have obtained analytic expressions for primary N -subjettiness distributions that are valid at LL, including running-coupling effects. Henceforth, for sake of simplicity, we are going to consider the fixed coupling limit of Eq. (2.11). In this limit, the probability distributions for quarks and gluons Eq. (2.10) is 18) and the likelihood ratio Eq. (2.12) consequently becomes In order to further simplify our notation, we can also reabsorb the factor α s /(πβ) in a redefinition of the variables T i , or, equivalently, we can define the colour factors C F and C A in units of α s /(πβ). We will do the latter, introducing C i = αsC i πβ . We will also express Eq. (2.18) in terms of different variables x i = x i (T i ). We therefore rewrite Eq. (2.18) as In particular, we will consider the following three choices: linear: 3 Perceptron analytics

Generic considerations
In this section we will investigate the simplest neural network, namely the perceptron, where the input layer is directly linked to the output through a single neuron, as depicted in Fig. 1a. Analytically, this means that the network output y is a function of a weighted linear combination of the inputs: with x vector of the input variables, while a is the vector of the weights of the neuron and b is a bias. The function f , called activation function, allows us to introduce an element of non-linearity in the network. In this paper we will focus on the sigmoid, which is widely adopted in the ML community. The sigmoid is defined as and it is shown in Fig. 1b. Alternative functional forms are possible. For instance, in this context, one could replace the sigmoid with a rectified version (hard-sigmoid): f (x) = max(0, min(x, 1)) .
. .   The neural network learns the best choice of the weights by minimising the cost function, which quantifies the difference between the expected value of the networkŷ and the predicted value y (given by Eq. (3.1)). In this study we are focusing on the issue of quark versus gluon discrimination, which is an example of a binary classification problem, where we haveŷ = 0 (associated with quarks) orŷ = 1 (associated with gluons). In this context, the cross-entropy loss is one of the most common choice for the loss function This is what we will use for this paper. However, many of the results we obtain also apply to other loss functions, such as, for instance, the quadratic loss: In order to train the NN, one usually starts with a so-called training sample, i.e. a collection of input vectors { x i }, each labelled as a quark jet or as a gluon jet. If we have a training sample of 2N inputs, equally divided between signal and background labels, we can write the cost function as: The input variables x (i) are generated according to a probability distribution p i ( x), with p q (p g ) being the probability distribution of the inputs x for quark (gluon) jets. If the training sample is large, as it usually is, we can rewrite the above equation in the continuous limit: In a general classification problem, the probability distributions of the inputs are unknown. However, in the context of QCD studies, we can exploit expert-knowledge: if we choose the input variables x as IRC safe observables, we can apply the powerful machinery of perturbative quantum field theory to determine these distributions at a well-defined and, in principle, systematically improvable accuracy.
In what follows, we will use the primary N -subjettiness variables {T i } as inputs for our perceptron. Using the results obtained in section 2, we will evaluate Eq. (3.7) using probability distributions calculated at LL accuracy. We will then study the global minimum of the cost function. We will focus on the the sigmoid activation function, Eq. (3.2), and the cross-entropy loss, Eq. (3.4), although analogous results can be obtained for the quadratic loss, Eq. (3.5). More specifically, our goal is to establish whether the set of weights a and bias b that minimises Eq. (3.7) corresponds to a cut on T n , which corresponds to the likelihood ratio at LL accuracy, cf. Eq. (2.12). This corresponds to checking a 1 = · · · = a n−1 = 0. We will show that the ability of the perceptron to find the likelihood ratio crucially depends on the functional form of the input variables. We shall consider three cases, all based on the primary N -subjettiness, namely T i , log T i and log 2 T i .

Minimisation of the cost function
In order to find the extrema of the cost function Eq. (3.7), we consider the partial derivatives with respect to a generic weight a i or b. With a simple application of the chain rule, we find a set of n + 1 simultaneous equations where i = 1, . . . , n and the prime indicates the derivative of a function with respect to its (first) argument. In general, in order to solve the above system of simultaneous equations, we have to explicitly compute the n-dimensional integral in Eq. (3.8). However, in our case, it is possible to find directly a solution at the integrand level. To this purpose, we observe that when the equality is satisfied for any value of x, then the system of equations is fulfilled. Since the crossentropy loss, Eq. (3.4), satisfies C (y, 1) = − 1 y and C (y, 0) = −C (1 − y, 1), Eq. (3.9) can be simplified to We note that this result also holds, for instance, in the case of the quadratic loss. More specifically, we want to compute Eq. (3.8) for our probabilities p i given by Eq. (2.20). We explicitly consider 3 cases differing only by what is used as inputs to the perceptron: "log-square inputs", Eq. (2.21), "log inputs", Eq (2.22), and "linear inputs", Eq (2.23). In all three cases, we use the sigmoid activation function and the cross-entropy loss. Given the following identities for the sigmoid function: and (3.11) and the properties of the cross-entropy loss, Eq. (3.8) may be rewritten as: Log-square inputs. Let us start by considering L i = log 2 T i as inputs to the perceptron. In this case the probability distributions for quarks and gluons in the fixed-coupling limit are given by Eq. (2.21). This is a very lucky scenario because we can determine the minimum at the integrand level. Indeed, the condition Eq. (3.10) gives the constraint leading to the following solution Hence, for the log-square inputs, the minimum of the cost function does agree with the optimal cut on T n dictated by the likelihood. Note that, as C F < C A , the weight a n is negative. This is expected, since the sigmoid function is monotonic and we have mapped the gluon (quark) sample to output 1 (0), see Eq. (3.6), whereas the gluon sample has larger T i and thus smaller L i : the negative sign of a n restores the proper ordering between inputs and output of the perceptron.
If we restrict ourselves to the case n = 2, it is also possible to explicitly perform the integrals in Eq. (3.7) and arrive at an analytic expression for the cost function. We report this calculation in appendix B.
Log inputs. We now turn our attention to logarithmic inputs l = − log T i . In this case we are not able to determine the position of the minimum at the integrand level and we are forced to use the explicit constraints from Eq. (3.12). In particular, we want to check if the likelihood condition a 1 = · · · = a n−1 = 0 is still a solution of the system of this system of n + 1 equations. To this purpose, we use Eq. (3.12) with the probability distribution given by Eq. (2.22), then we set a 1 = · · · = a n−1 = 0, thus explicitly obtaining where I i (l n ) is the result of the integration over l 1 , . . . , l n−1 Up to an irrelevant overall multiplicative constant, these integrals are easily found to be Replacing this result in Eq. (3.15), we see that all of the derivatives with respect to a i , i = 1, . . . , n give rise to the same equation, and thus the system of n + 1 simultaneous equation reduces to a system of just two independent equations, for any n. These two equations correspond to two lines in the (a n , b) plane. If these two lines never cross, then the system has no solution, the minimum of the cost function is not at a 1 = . . . a n−1 = 0 and thus the perceptron is not able to correctly reproduce the likelihood. If instead these lines meet at some (ā n ,b), then the minimum of the cost function does correspond to the likelihood ratio. Despite numerous attempts, we have not been able to perform the final integration over l n in Eq. (3.15) analytically. However, we can perform the integration numerically and plot the result in the (a n , b) plane. This is done in Fig. 2a, where, without loss of generality, we have concentrated on the case n = 2. It is clear from the plot that the two curves do meet in a point and hence the perceptron is able to find the correct minimum, i.e. the one dictated by the likelihood with a 1 = 0. The explicit n-dependence of the coefficients a n and b n can be found numerically by solving the system of equations. Furthermore, it is possible to obtain analytically the scaling of a n with respect the colour factors, as detailed in appendix C. We find that, once we have factored out C −1/2 F , the resulting coefficient only depends on the ratio of colour factors: where F is a function that we have not determined.
Linear inputs. We now move to consider linear inputs of the perceptron, i.e. the variables T i directly. We follow the same logic as in the case of the logarithmic inputs, namely we want to check whether there is a minimum of the cost function satisfying a 1 = · · · = a n−1 = 0. Following the same steps as before, we have  with There is a crucial difference between the integrals in Eq. (3.20) and the corresponding ones in the case of the logarithmic inputs, Eq. (3.16): the cases with 1 ≤ i ≤ n lead to n different constraints. For sake of simplicity, let us consider n = 2. We have (3.21) Thus, all the three equations appearing in (3.19), i.e. i = 0, 1, 2 provide independent conditions. The system has solutions if the corresponding three curves in the (a 2 , b) plane meet in one point. By numerically performing the integrations, we can check whether this happens or not. This is done in Fig. 2b where it is clear that the three curves do not intersect at a common point and hence the perceptron is unable to find the correct minimum dictated by the likelihood. Let us summarise the findings of this section. Working in the leading logarithmic approximation of QCD, we have analytically studied the behaviour of a perceptron with sigmoid activation and cross-entropy cost function, in the context of a binary classification problem. We have explicitly considered three variants of the primary N -subjettiness inputs: squared logarithms, logarithms and linear inputs. In the first two cases the minimum of the cost function captures the expected behaviour from the likelihood ratio, i.e. a 1 = · · · = a n−1 = 0. This does not happen with linear inputs, although the configuration a 1 = · · · = a n−1 = 0 is within the reach of the network. This is most likely due to the fact that the simple perceptron with linear inputs struggles at correctly learning the probability distributions which are intrinsically logarithmic. We expect that a more complex network would be needed in this case and this is shown explicitly in section 5.

Perceptron numerics
In this section we validate our analytic findings with an actual implementation of a perceptron. In practice, we have used a simple implementation based on [59], with a learning rate of 0.1 and a mini-batch size of 32. 6 Because our first-principle analysis has been developed at LL accuracy, in order to numerically test the perceptron performance we generate a sample of pseudo-data according to the QCD LL distribution for quark and gluon jets. We consider the three different input variants also used in the analytic study, namely square logarithms, logarithms and the linear version of the N -subjettiness inputs. In order to train our single-neuron network we use a sample of 1M events in the first two cases and 16M the for linear inputs, unless otherwise stated. Furthermore, we perform our study as a function of number of N -subjettiness variables. Specifically, when we quote a given value of n, we imply that all T i with i ≤ n have been used as inputs. The results of this study are collected in Fig. 3. Each plot shows the value of the network weights a i and b after training, i.e. at the minimum of the cost function that has been found by the network through back-propagation and gradient descent. The plot on the left is for log-square inputs, the one in the middle for log inputs and the one on the right for linear inputs. The values of the weights determined by the network are shown as circles (for a i with i < n), squared (for a n ) and triangles (for b), with the respective numerical uncertainties. In the case of log-square and log inputs, we also show the behaviour obtained from our analytic calculations in section 3. We find perfect agreement. In particular, the minimum of the ROC curves: n = 3 ROC curves: n = 5 (b) Inputs: n = 5 Figure 5: ROC curves obtained from a trained perceptron with different inputs. Note that, in order to highlight deviations from optimal performance, we show Eq. (4.1). Thus, the closer these curves are to zero, the better the performance. cost function found by the network is consistent with a 1 = · · · = a n−1 = 0. This does not happen in the case of linear inputs, although the discrepancy is tiny.
It is interesting to investigate whether the theoretical issues for the linear inputs have some visible effect on the network performance. In order to do so, we first perform a study of the perceptron convergence as a function of the training sample size N train . As before, we repeat this study for each of the input variants previously introduced. We also consider two different values of n: for the case n = 3 we build our inputs from T 1 , T 2 , T 3 , while for n = 5, we also include T 4 and T 5 . In Fig. 4 we plot the cost function as a function of the training sample size. Fig. 4 is obtained by training the perceptron with a progressively increasing sample of pseudo-data generated on the fly according to the QCD LL distribution. At fixed values of N train , the cost function C of the trained network is then evaluated on a test sample of fixed dimension. This procedure is iterated a number of times, and at the end, for each N train , we take the average and standard deviation of the different runs as a measure of the central value and uncertainty. The plots clearly show that the convergence with linear inputs is slower than in the cases of log-squares and logs, exposing the fact that the single-neuron network struggles to learn intrinsically logarithmic distributions with linear inputs.
Furthermore, with the same set up, we can study the actual performance of the network using ROC curves. To highlight deviations from the ideal case dictated by the likelihood ratio, the plots in Fig. 4 show where ROC N N is the network ROC curve, while ROC lik. is the ROC curve that corresponds to a cut on T n . We perform this study for two different sizes of the training sample: 1M and 16M. We know from Fig. 4 that the former is enough to achieve convergence in the case of log-square and log inputs but not for linear inputs. Thus, we expect to see larger differences in this case. This is indeed confirmed by the plots in Fig. 5. The dash-dotdotted curve in blue, corresponding to linear inputs and a training done on a sample of 1M events, indicates that the difference with respect the optimal case is of order 10% for quark efficiencies q < 0.6 in the case of n = 3 and exceeds 30% in the same region for n = 5. If the training sample size is increased to 16M, the perceptron with linear inputs still performs worse than the other choices, although the difference after training is now small and never exceeds 1%. For comparison, we also show the results for N -subjettiness ratios as inputs. In this case, for n = 3, we include T 1 , T 21 = T 2 T 1 and T 32 = T 3 T 2 . In this case, the expected ideal cut on T n cannot be reproduced by any choice of the perceptron weights a i and bias b. The perceptron performance in this case is consequently much worse even after a training on our largest sample.

Monte Carlo studies
In the previous section, we have validated the results of section 3, using the same setting as for our analytical calculations, namely a single-neuron network fed with primary Nsubjettiness variables (or functions thereof) distributed according to QCD predictions at leading-logarithmic accuracy. This setup is somehow oversimplified, and one may wonder how our results compare in term of performance and convergence to a fully-fledged neural network trained on Monte Carlo pseudo-data. In addition, we have adopted the primary N -subjettiness definition, whose nice analytical properties naturally justifies its use in our theoretical studies. However, this alternative definition so far has been compared to the standard definition only in terms of AUC at LL accuracy. Even though the purpose of this work is not a detailed comparison of the two definitions, we need to make sure that primary N -subjettiness performs sensibly as a quark/gluon tagger.
The purpose of this section is to address some of these concerns, by extending the setup of section 4 to a more realistic scenario, both in terms of the generation of pseudo-data and of the network architecture employed in the analysis. First, we generate pseudo-data with Pythia 8.230 [60], including hadronisation and multi-parton interactions. We simulate dijet events qq → qq and gg → gg and we cluster jets with the anti-k t algorithm [61], as implemented in FastJet [62], with jet radius R 0 = 0.4. We keep jets with p t > 2 TeV and |y| < 4. We then compute τ N and T N (i.e. both N -subjettiness definitions) up to the desired N = n for each jet. We set the N -subjettiness parameter β to 1. For the standard N -subjettiness, we use the reference axes obtained from the exclusive k t axis with winner-takes-all [63] recombination and a one-pass minimisation [49]. In addition to linear, log-square and log functional forms and N -subjettiness ratios already adopted in the previous section, we also consider a running-coupling-like input, obtained by evaluating the radiator R(τ i ) of Eq. (2.11) with a one-loop running coupling. 7 The samples thus obtained are used to train either a single perceptron (as in the previous section) or a full neural network. The perceptron is trained using the same implementation based on Ref. [59], training over 50 epochs, with a learning rate of 0.5. The full network uses a dense architecture implemented in PyTorch 1.4.0. The number of inputs equals the number n of input N -subjettiness values (between 2 and 7). There are three hidden layers with 128 nodes per layer and a LeakyReLu [64] activation function with negative slope of 0.01 is applied following each of these layers. We train the network using binary cross-entropy for a maximum of 50 epochs using the Adam [65] optimiser with a learning rate of 0.005 and a mini-batch size of 32 events. During training, the area under curve is evaluated after each epoch on a statistically independent validation set and training terminates if there is no improvement for 5 consecutive epochs [66]. The best model on the validation set is then selected and evaluated on another independent test set for which results are reported. Data are split 6:2:2 between training, validation, and testing. No extensive optimisation of hyperparameters has been performed and training usually terminates due to the early stopping criterion before reaching the maximally possible number of epochs.
We start with a comparison of the primary and the standard definition of N -subjettiness in terms of network performance. In Fig. 6a  We see that the performance of networks trained with the standard definition is worse 7 We set αs(ptR0) = 0.09, and we regulate the divergence at the Landau pole by freezing the coupling at a scale kt such that 2αs(ptR)β0 log(ptR0/kt) = 0.99.  by 10-15% compared to the primary definition for mid-values of the quark efficiency, 0.4 ε q 0.8 and comparable elsewhere, except for a small region at large quark efficiency, ε q 0.9. Even though the benefit of using primary N -subjettiness is not as large as one could have expected based on our leading-logarithmic calculations, we still observe a performance improvement. We note also that at very small quark efficiency non-perturbative effects have a sizeable effect, invalidating our arguments purely based on a perturbative calculation. Furthermore, large ε q correspond to the regime where the cut on N -subjettiness is no longer small and our arguments based on the resummation of large logarithms no longer apply, and one should instead consider the full structure of hard matrix elements. 8 That said, Fig. 6a shows that for the most of the range in quark efficiency, a better discrimination is reached if we adopt the primary-N -subjettiness definition.
It is also interesting to compare the results obtained with different inputs in Fig. 6a. Although one should expect that a full NN should ultimately reach the performance of the likelihood independently on the choice of inputs, our practical setup still shows a mild dependence on the choice of inputs. The key point is that the favoured inputs agree with our analytic expectations from section 3 with logarithmic inputs (log τ i , log 2 τ i and R(τ i ))  showing an equally-good performance, the linear inputs only marginally worse and the ratio inputs showing a (few percent) worse discriminating power. This shows that an understanding of the underlying physical behaviour of the problem one addresses is helpful for deep-learning. 9 We note however that even though the convergence of the neural network is more delicate for the case of ratio inputs (and, to a lesser extent, for linear inputs), a performance similar to the optimal one reported in Fig. 6a for logarithmic inputs could be achieved with a careful optimisation of the hyperparameters.
We now move to Fig. 6b, where we compare the ROC curves obtained with a full neural network to those of a simple perceptron. In this plot we select the primary N -subjettiness definition and we display results for the usual input choices. The solid lines in Fig. 6b coincide with the ones in Fig. 6a. First, we observe that a perceptron trained with Monte Carlo pseudo-data performs worse compared to the full NN for all the considered input types. This is not surprising as the arguments in section 3 are based on a simplified, leadinglogarithmic, approach and subleading corrections can be expected to come with additional complexity that a single perceptron would fail to capture. It is actually remarkable that for ε q 0.6 a single perceptron typically gives performances which are only 10% worse than the full network. This is not true if we consider N -subjettiness ratio inputs, in which case the perceptron performance is sizeably worse than the performance we get by using the full NN. This in agreement with the observation made towards the end of section 4, namely that a more complex network architecture is able to learn the correct weights for the ratio inputs, while the perceptron is not.
The behaviour of the other input types is relatively similar, although, quite surprisingly, linear inputs tend to give a slightly better performance than logarithmic ones. The origin of this is not totally clear. One can argue that the difference in performance between linear and logarithmic inputs was small at leading-logarithmic accuracy and that several effects -subleading logarithmic effects, hard matrix-element corrections, non-perturbative effects, ... -can lift this approximate degeneracy. One should also keep in mind that our leading-logarithmic calculation is valid at small-enough quark/gluon efficiencies, while, as the efficiency grows, we become sensitive to contributions from other kinematic domains. We also note that if we increase the jet p t , e.g. if we look at 20 TeV jets at 100 TeV pp collider, the difference between logarithmic and linear inputs becomes again comparable, which is likely a sign that the phase-space at larger p t has an increased contribution from the resummation region. We finally note that a simple cut on T n , which is optimal at LL, gives comparable results to the perceptron at large quark efficiencies ε q 0.7, but performs rather worse otherwise.
Next, we study how the convergence of the network is affected by the choice of inputs. In this context, convergence can mean two different things. Firstly, we can speak about convergence of the network training (as we did in Fig. 4). This is shown in Fig. 7a, for the area under the ROC curve (AUC) as a function of the number of training epochs for a single perceptron. For this plot, we have used primary N -subjettiness and (mini-batch) stochastic gradient descend with a learning rate λ = 0.5. The better performance obtained for linear inputs is confirmed in Fig. 7a. However, this happens at a cost in convergence speed: where logarithmic inputs show an almost complete convergence after a single epoch, linear inputs converge much slower. 10 We also see that using ratios of N -subjettiness as inputs yields a clearly worse AUC. Secondly, we can study convergence as a function of the size of the network which can be either the network width at fixed number of hidden layers, Fig. 7b, or the network depth at fixed number of neurons per hidden layer, Fig. 7c. These plots also show that logarithmic inputs converge (marginally) faster than linear inputs. Interestingly, after a full training (with 7 hidden layers and 128 neurons per hidden layer), we recover a situation where logarithmic inputs give a slightly better performance compared to linear inputs, as our analytic studies suggested.
As a final test, we study in Fig. 8 the AUC as a function of the number n of input N -subjettiness variables. Results are shown for inputs distributed according to the leadinglogarithmic QCD distributions (as for our analytic calculations in section 3 and for our tests in section 4) in Fig. 8a as well as for a Pythia8 sample in Fig. 8b. We show the AUC for both the standard definition of N -subjettiness (red) and for our primary definition (blue), and in both cases we show the results of training either a single perceptron (open symbols) or a full NN (filled symbols). In the case of the Pythia8 sample, we also show the AUC corresponding to a simple cut on either T n or τ n . For the case of the leading-logarithmic distributions, we also provide the expectations from the actual likelihood ratio (dashed lines). The latter is obtained from Eq.   N -subjettiness performs better than the standard definition, although the difference, clearly visible in the leading-log distributions, is only marginal with the Pythia8 sample. Then, in the case of a leading-log distribution, the optimal performance of the primary N -subjettiness is already captured by a single perceptron, while for the case of standard N -subjettiness only the full NN is able to reach a performance on par with the likelihood expectation. For the Pythia8 sample, the training of the full NN leads a reduction of the AUC compared to a single perceptron for both N -subjettiness definitions. The improvement is however smaller for the primary definition (∼ 10%) than for the standard one (∼ 15%). This is likely a reminiscence of the behaviour observed at leading logarithmic accuracy. It is interesting to observe that the performance of a simple cut on T n , which is optimal at LL, is comparable to the perceptron at small n, while it degrades as n increases. This is most likely due to the fact that at large n we become more sensitive to non-perturbative corrections and consequently our LL approximation is no longer sufficient. Conversely, a cut on τ n , while giving a performance comparable to that of the perceptron for n = 2, degrades more rapidly at larger n.

Conclusions and outlook
In this paper we have investigated the behaviour of a simple neural network made of just one neuron, in the context of a binary classification problem. The novelty of our study consists in exploiting the underlying theoretical understanding of the problem, in order to obtain firm analytical predictions about the perceptron performance.
In particular, we have addressed the question of distinguishing quark-initiated jets from gluon-initiated ones, exploiting a new variant of the N -subjettiness (IRC-safe) family of observables that we dubbed primary N -subjettiness. As the name suggests, these observables are only sensitive, to LL accuracy, to emissions of soft gluons off the original hard parton, while standard N -subjettiness also depends on secondary gluon splittings, already at LL. Thanks to the simple all-order behaviour of observables {T 1 . . . T n } we have been able to determine that the optimal discriminant at LL, i.e. one which is monotonically related to the likelihood ratio, is just a cut on T n . We have also been able to obtain analytic expressions for the area under the ROC curve and the area under the ROC curve which are standard figures of merit for a classifier's performance. Furthermore, we have found that, besides having nicer theoretical properties, T N typically outperforms τ N as a quark/gluon tagger, in the range of quark efficiencies that is typically employed in phenomenological analyses.
The central part of this study has been the analytic study of the LL behaviour of a perceptron that takes primary N -subjettiness variables as inputs. We have considered a perceptron with a sigmoid as activation function and the cross-entropy as loss function. The main question we have raised is whether the values of the perceptron weights at the minimum of the cost function correspond to the aforementioned optimal LL classifier, i.e. a cut on the last primary N -subjettiness considered. We have been able to show analytically that this depends on the actual functional form of the inputs fed to the network. The perceptron is able to find the correct minimum if logarithms (or square logarithms) of the N -subjettiness are passed, but fails to do so with linear inputs. This reflects the fact that the LL distributions of the inputs are indeed logarithmic. These analytic results are fully confirmed by a numerical implementation of the perceptron, which has been trained on pseudo-data generated according to a LL distribution. Furthermore, we have also found that, in the case of linear inputs, the learning rate of the perceptron is substantially slower than for logarithmic inputs. As a by-product, we were also able to find, in a few cases, closed analytic expressions for the network cost function.
Finally, we have considered a more realistic framework for our analysis. Firstly, we have have trained the perceptron pseudo-data generated with a general-purpose Monte Carlo parton shower, and we have obtained qualitative agreement with the analytic study. Secondly, we have observed that when considering a full neutral network, we obtain a comparable performance with all the variants of the inputs, with the possible exception of the N -subjettiness ratios, which requires more fine-tuning to be brought to the same performance.
Our work highlights in a quantitative way the positive role that first-principle understanding of the underlying physical phenomena has in classification problems that employs ML techniques. Even if a similar degree of performance is ultimately achieved with a full NN regardless of the type of inputs, a knowledge of the underlying physics is helpful to build a simpler network less dependent on a careful optimisation of the hyperparameters. Furthermore, we stress that the use of theoretically well-behaved observables has allowed us to perform our analysis at a well-defined, and in principle improvable, accuracy. We believe that this is an important step towards the consistent use of ML in experimental measurements that are ultimately aimed at a detailed comparison with theoretical predictions.
Our current analysis suffers from the obvious limitations of being conducted at LL and for just a very simple network. We believe that going beyond LL accuracy for primary N -subjettiness is not only interesting, but desirable. The first reason would be to see if the simple resummation structure we see at LL accuracy survives at higher orders. Then, this very first analysis places primary N -subjettiness as a promising substructure observable. One is therefore tempted to also investigate its performance in the context of boosted vector boson or top tagging, although one might expect that these classification problems benefit from constraints beyond primary emissions. Furthermore, in the framework of our analytic study of ML techniques, it would be interesting to see if at a higher accuracy one is still able to find a simple combination of architecture, cost function and network inputs that guarantees, analytically, optimal performance. More generally, we look forward to future work where perturbative QCD helps designing better ML techniques with a comfortable degree of analytic control. of the ROC curve given in Eq. (2.5): where, for a given observable V , the cumulative distribution for signal or background as a function of a cut v reads as in Eq. (2.4), which we rewrite as: We now perform a change of variable from the efficiency x to the cut on the observable v, which results in the following Jacobian factor: Thus, we have which is the expression presented in Eq. (2.16).
We now specialise to our case and we consider the probability distributions p q and p g for primary N -subjettiness at LL accuracy, which are given in Eq. (2.10). We first integrate over quark variables and exploiting the result found in Eq. (2.15), we obtain 1 0 dT 1q · · · T n−1,q 0 dT nq p q (T 1q , T 2q , · · · , T nq ) Θ(T nq > T ng ) = 1− Γ(n, C F R(T ng )) Γ(n) . (A.7) The integral over T ng can be performed using the following integration-by-parts identity: We see that iteratively the following kind of integrals appear: In the end we obtain: The sum can be performed explicitly, leading to the result presented in Eq. (2.17) .

B Analytic results for the cost function
In section 3 we have looked for a minimum of the cost function, Eq. (3.7), by computing from the very beginning the derivatives with respect to the NN weights. This procedure allowed us to determine analytically the position of the minimum for the case of log-square inputs.
In the case of logarithmic and linear inputs, although we have not been able to perform all the integrals analytically, we have nonetheless reduced the problem of the determination of the position of the minimum of the cost function to a one-dimensional integration, which we have performed numerically. An alternative approach would be to determine an analytic expression for the cost function, as a function of the NN weights and bias, which we would have to, in turn, minimise. As we shall see in this appendix, this approach is not as successful as the one presented in the main text. The integrals that appear in the determination of the cost function Eq. (3.7) are rather challenging and we have been able to solve them only in a couple of simple cases. Namely, we limit ourselves to the case n = 2 with log-square inputs, by using the sigmoid as activation function, and we report results for the cross-entropy loss, Eq. (3.4), and the quadratic loss, Eq. (3.5). Even if the NN setup is minimal, the derivation of explicit expressions for the cost function may be instructive as they represent a first-principle determination of a NN behaviour and they could be valuable in the context of comparisons between experimental measurements that make use of machine-learning algorithms and theoretical predictions.

B.1 Cross-entropy loss
We start by considering the cost function Eq. (3.7) with the cross-entropy loss. Explicitly: (B.1) Note that the replacement C F ↔ C A is equivalent to (a 1 , a 2 , b) ↔ (−a 1 , −a 2 , −b), due to the symmetries of the functions involved. We first observe that the integral over L 1 gives rise to a dilogarithm. In order to evaluate the integral over L 2 we make use of the following result: The analytic expression that we obtain for the cost function is C (XE) (a 1 , a 2 , b) = 1 2a 1 − (I (XE) ( C F , a 1 + a 2 , b) − I (XE) ( C F , a 2 , b)) From section 3, we already now that the minimum is located at a 1 = 0. However, given the structure of the explicit result after integration, Eq. (B.3), it is highly nontrivial to recover the position of the minimum analytically, due to the presence of a 1 both at denominator and in the arguments of hypergeometric function.

B.2 Quadratic loss
An analogous calculation can be performed in the case of the quadratic loss Eq. (3.5). We have to calculate A e − C A L 2 (1 + e a 1 L 1 +a 2 L 2 +b ) 2 .
(B.4) As in the case of the cross-entropy loss, the integral over L 1 is straightforward. We then make use of the following identity to perform the remaining integral over L 2 : For illustration purposes, we fix C A and C F to C A and C F respectively. In Fig. 9 we plot each cost function around the found minimum point. We see that the condition a 1 = 0 and the (negative) value of a 2 = C F − C A −1.67 are indeed confirmed.
C Scaling of a n in the log inputs case.
In this appendix we would like to study the scaling of a n with respect to C F and C A . In order to simplify the notation, in this appendix we will rename x ≡ l n and a ≡ a n .  Figure 9: Cost function as a function of a 1 and a 2 around the minimum point. b has been accordingly fixed to the value in Eq. (3.14).
We first expand in series around x = 0 the sigmoid functions which appear in Eq. (3.15): where: with m = 2n or m = 2n − 1. Given the following integration identity: we obtain: We now divide by C 2n−m−1 F and collect 1/ C k F in the second bracket: This equation suggests that: a n which is the relation quoted in the main text.