Jet tagging made easy

We develop taggers for multi-pronged jets that are simple functions of jet substructure (so-called ‘subjettiness’) variables. These taggers can be approximately decorrelated from the jet mass in a quite simple way. Specifically, we use a Logistic Regression Design (LoRD) which, even being one of the simplest machine learning classifiers, shows a performance which surpasses that of simple variables used by the ATLAS and CMS Collaborations and is not far from more complex models based on neural networks. Contrary to the latter, our method allows for an easy implementation of tagging tasks by providing a simple and interpretable analytical formula with already optimised parameters.


Introduction
The tagging of boosted jets is an essential tool for the search of heavy new physics beyond the Standard Model (SM) that involves one or more coloured particles as final state. After the first works on jet substructure [1][2][3] many jet substructure variables [4][5][6][7][8] have been proposed to discriminate boosted Higgs bosons, top quarks and weak W/Z bosons from the QCD background composed of quark and gluon jets. (See also Refs. [10][11][12].) Multivariate taggers based on jet substructure variables [13,14] or directly using jet images (see Ref. [15] for a review) have also been employed. These taggers can be qualified as 'dedicated', namely they are designed to discriminate jets corresponding to a specific type of signal (Higgs, top, W/Z ) from the background. Despite achieving a good discrimination (especially for multivariate taggers), a drawback of dedicated taggers is the fact that they may not be sensitive to other types of jets different from those which they are designed for. This was made clear in Ref. [16]: the application of a wrong tagger to a four-pronged jet can cause the J. A. Aguilar-Saavedra: On leave of absence from Universidad de Granada, 18071 Granada, Spain. a e-mail: jaas@ugr.es (corresponding author) signal significance to plummet, even below the value when the tagger is not applied.
A first attempt to develop a 'generic' tagger, sensitive to a variety of complex jets, was done in Ref. [17], by training a neural network (NN) with two, three and four-pronged jets as signals, against the QCD background. Another complementary approach is brought by weakly-supervised and unsupervised methods, where little or no theoretical input is given and the NN learns to discriminate potential signals in different ways, for example by comparing different kinematical regions [18][19][20] or by a reduction of the space dimensionality using an autoencoder [21][22][23][24]. Despite all these multivariate methods are very effective in the discrimination, they have two inherent drawbacks: • Often, there is no obvious interpretation of how the multivariate methods work in separating signal from background. This is because the interpretability of the results greatly decreases in general with the complexity of the machine learning (ML) models. • More complex ML models require more complex implementations, which translates into more difficult testing and reproducibility of the results, especially when the literature does not provide all the necessary details of the implementation (which is typically the case, unfortunately).
As a complement to these methods, it is our purpose here to develop and test a set of simple taggers for multi-pronged jets, based on a set of variables proposed in Ref. [13], 1 , τ (2) 1 , . . . , τ (1/2) M−2 , τ (1) M−2 , τ (2) M−2 , τ (1) M−1 , τ (2) with M > 1 an integer. The subjettiness variables τ (β) n measure to which extent the radiation in a jet is clustered around n axes, with β an angular exponent (their precise definition is given in the next section). As shown in Ref. [13], the set of variables (1) allows to reconstruct the phase space of M partons within a jet, up to a global rotation.
Our main goal in this work is simplicity: the taggers obtained are simple functions of the subjettiness variables (1), with an approximate mass decorrelation à la DDT [25] which suffices to maintain the shape of the jet mass distribution to a large extent, after the application of the taggers. The taggers are developed using Logistic Regression Design (LoRD) to find the numerical parameters in the function that achieve the best discrimination, as described in Sect. 2 and Appendix A. There exist already in the literature taggers based on optimised products of subjettiness variables: a method to develop taggers with a scan over the parameter space was earlier introduced in Ref. [26], and complex neural networks were used in Ref. [27], for the development of taggers to discriminate two-pronged jets versus QCD jets, and for quark/gluon jet identification. With respect to those, the design of our taggers is much simpler, and also addresses the two issues mentioned above about interpretability and reproducibility.
We find that the discrimination power of our simple taggers in some cases largely surpasses the simple variables used by the ATLAS and CMS Collaborations in searches for new physics using jet substructure. Results are presented in Sect. 3. We also compare the results using LoRD with NNs using the same architecture as in Ref. [17]. In general, the NNs perform better, except for some signals which neither are trained for.
An important point in the design of the taggers is the kinematical region (i.e. jet mass and p T ) used for the optimisation. We address this issue in Sect. 4. While the dependence on p T J is marginal, the dependence on jet mass is more noticeable when one gets away from the design region. Our conclusions are presented in Sect. 5. In Appendix B we estimate the variance of the taggers obtained with the LoRD. A qualitative discussion about the interpretability and the intrinsic dimension of the datasets is discussed in Appendix C. In Appendix D we compare among different options for the design of the taggers, regarding the grooming (or not) of the jet mass, momentum and subjettiness variables. In Appendix E we summarise a few results for taggers without mass decorrelation, and in Appendix F we investigate the performance of the more complex taggers -the ones designed for four-pronged jets -for jets with less prongs.

LoRD of the taggers
The input to the taggers is given by a set of subjettiness variables (1) with M ≤ 9, 1 where with i labelling the particles in the jet, p T i their transverse momenta, R K i their lego-plot distance to the axis K = 1, . . . , N and p T J the jet transverse momentum. As in Ref. [13], in the computation of these variables we use the axes defined by exclusive k T algorithm [28,29] with standard Escheme recombination [30]. The proposed functional form for the taggers is with and ρ = log m 2 J / p 2 T J . The coefficients c β n are determined by Logistic Regression to optimise the discrimination between the signal(s) and the background (see Appendix A for a detailed description of this implementation); b is a parameter chosen to achieve an approximate mass decorrelation, and for convenience the tagger output is shifted by subtracting a fixed quantity a so that its average T , when evaluated on a reference background sample, vanishes. Notice that the sum (4) is equivalent to a product n,β that includes and generalises the commonly used ratios 1 and τ 32 = τ (1) 3 /τ (1) 2 . Several comments and clarifications regarding our procedure are now in order.
• Independently of the precise method used to determine c β n , a range of jet mass and p T has to be specified for the signal(s) and background. For any of these processes the distributions of the variables τ (β) n depend on m J and p T J . The coefficients c β n are then obtained to optimise the discrimination of signal and background within a given interval of jet mass and p T . • Using ungroomed jet mass and p T for the determination of c β n reduces the dependence of the resulting taggers on the intervals chosen, and slightly improves the mass decorrelation. Also, this is desirable in order not to stick to a particular grooming algorithm. We have also tried using the groomed mass and p T , and the discrimination power is similar. We also use τ (β) n of the ungroomed jets, since we find that the discrimination between signal and background is better. A comparison among these possibilities is made in Appendix D.
• For the mass decorrelation, evaluation of the tagger performance, etc., namely in all calculations except the tagger design itself, we use m J and p T J of the groomed jets.
The recursive soft drop [31] algorithm with parameters β = 1, z cut = 0.05, N = 3 is found to work very well for multi-pronged jets, avoiding the peak distortions and shifts that other algorithms produce [32]. • The parameter a is chosen to adjust T = 0 for a reference background sample with (groomed) p T J ≥ 250 GeV, and without any restriction on m J . For larger p T or when considering a narrow m J interval, the average is slightly shifted. This residual dependence may be accounted for by varying the tagger threshold, as done for example in Ref. [33] or, equivalently, by varying a as a function of m J and p T J . This sophistication however is not required for our discussion.
With the LoRD method we optimise the discrimination between quark/gluon jets (background) and multi-pronged decays of boosted colour singlet particles (signal). Quark and gluon jets are obtained by generating the parton-level processes pp → Zg, pp → Zq, with decay Z → νν, using MadGraph5 [34], and Pythia 8 [35]. In all cases the centre-of-mass energy is set to 13 TeV. The detector response is simulated with Delphes 3.4 [36] using the CMS detector card. Jets are reconstructed using the anti-k T algorithm [37] with radius R = 0.8. FastJet 3.2 [38] is used for jet reconstruction, grooming and calculation of the τ (β) n variables. For the signal we use fat jets resulting from the decay of neutral, colour-singlet particles into four, three and two quarks, considering the six processes with S a scalar and F a fermion. These processes are generated at parton level with Protos [39] and, subsequently passed through the parton shower, hadronisation and fast simulation chain. In order to remain as model-agnostic as possible, we implement decays of S and F with a flat matrix element, so that the decay weight of the different kinematical configurations only corresponds to the two-, three-or fourbody phase space. These signal Monte Carlo data are dubbed as Model Independent (MI) data, and its use is motivated by the need to sample phase space without model prejudice [17].
This choice is very effective to make the taggers learn prongness rather than other undesired feature. Consequently, the obtained taggers T can be used outside -though not very far from -the interval they have been designed for. We build taggers for four-pronged (4P), three-pronged (3P) and two-pronged (2P) signals using the corresponding set of signal processes in the first, second and third line of Eq. (6) as signal, versus the QCD background. For 4P taggers we select M = 9 in (1), while for 3P and 2P taggers it is enough to use M = 7 and M = 5, respectively. 2 We select three different kinematical regions (for ungroomed quantities) for the design of the taggers, labelled as follows: • hi80: p T J ≥ 1 TeV, m J ∈ [60, 100] GeV. For the backgrounds we set at the parton level a cut on the jet, p T ≥ 1 TeV (for hi80 and hi200) and p T ≥ 500 GeV (for lo80) in order to increase the efficiency of the event generation. After simulation, the cuts on p T J and m J are applied. For the signals we set M Z = 2.2 TeV for hi80 and hi200, and M Z = 1.1 TeV for lo80, in order to have a transverse momentum distribution close to the one that is subsequently required by the cut on p T J . The mass of the intermediate particles S, F is set to 80 GeV for hi80 and lo80 and 200 GeV for hi200. After the simulation, the cut on m J is applied to select the corresponding kinematical region.
In total, we develop nine taggers (2P, 3P and 4P for the three kinematical regions defined above), plus three alternate versions of the hi80 taggers for testing purposes. The size of the signal and background datasets used in the optimisation is collected in Table 1. The background events are approximately evenly divided among quark and gluons. The signal events for each (4P, 3P, 2P) signal class are approximately evenly divided among the two contributions listed in each line of Eq. (6). Finally, the two classes (signal and background) are mutually balanced among each other as well. The results for the coefficients for the different taggers are collected in Table 2. The variance of the obtained results is addressed in Appendix B, while a qualitative discussion about the interpretability and the intrinsic dimension of the datasets is discussed in Appendix C.
For the mass decorrelation and test of the tagger performance we use a sample of QCD dijet production generated with MadGraph in 100 GeV intervals of p T , starting at [200, 300] GeV and with the last one having p T ≥ 2.2 TeV. The different samples are hadronised and passed through the detector simulation, and then combined with a weight that corresponds to the cross section. For each interval we generate 2×10 5 events and keep both jets (leading and sub-leading) for the analysis, therefore our QCD sample comprises 8.4 million jets, spanning a very wide range of mass and p T . The application of a tight cut on the value ofT produces a noticeable modification in the lineshape of the QCD background versus the jet mass, as seen in Appendix E. This is a serious inconvenient in experimental searches for a bump in this distribution. For other searches that do not use the jet mass as final discriminator, maintaining the shape is still desirable in order to be able to use sidebands for the esti-mation of the background. We therefore perform an approximate mass decorrelation following the DDT [25] proposal. For each tagger we select the parameter b by fitting the calculating the average slope of T versus ρ, in the interval ρ ∈ [−6, −2], for the QCD background sample. Because this average also depends on p T J , we select p T J ∈ [500, 600] GeV, which gives good results when the dependence on ρ is not linear and the averages show some spread with p T J . Finally, the parameter a is adjusted so as to have T = 0 in the inclusive QCD sample with p T J ≥ 250 GeV. The values of b and a for each tagger are collected in the last two lines of Table 2.
For illustration, we show in the left panels of Fig. 1 the average T for the QCD background in several p T J bins, as a function of ρ. The top, middle and bottom panels correspond to the 4P, 3P and 2P hi80 taggers, respectively. For comparison, we also show T for the non-decorrelated tagger, for the inclusive sample. For the 4P tagger the mass decorrelation is very good: the three lines for T are almost coincident and horizontal. For the 3P and 2P taggers the decorrelation achieved with this simple prescription is poorer. The same Table 2 Numerical coefficients in (3) and (4) corresponding to the different taggers         shown in red. Right: Jet mass distribution for the QCD background, after increasingly tighter cuts on the tagger output trend is found with the hi200, lo80 and the alternate hi80 taggers. The jet mass distributions for the QCD background after increasingly tighter cuts on T are presented on the right panels of Fig. 1. We observe that the background lineshape is very well preserved, with results comparable to the best decorrelation methods examined Ref. [40]. One can notice two minor features: (i) When the background is suppressed by a factor around 100 (e.g. green curve with respect to black curve in Fig. 1-top right), a small increase is produced in the first bin [20,40] GeV. (ii) For the 3P and 2P taggers the slope of the QCD distribution (see right plots) slightly decreases after application of the cuts on T (coloured lines). This corresponds to the fact that the T distributions on the left panels are not as flat as for the 4P tagger.
We remark that this simple decorrelation with fixed b in (3) can easily be improved by taking b as a function of m J and p T J . As the numerical calculation ofT is very simple, this is a rather computer-inexpensive task. Our purpose here is to show that even a rough mass decorrelation with fixed b does most of the job to maintain the profile of the QCD jet mass distribution after the application of the tagger, even for very tight cuts on the tagger output. Refinements are always possible, see Ref. [25]. A final point that is of interest for practical applications is the stability of the results obtained by the LoRD. The stability of the classifier performance on test samples is very good, as shown in Appendix B. However, this does not guarantee that the T distributions for the background and signals are similar. With this purpose, we have used an alternate set of hi80 taggers obtained with different random seeds to check to which extent the T distributions are alike. The results, presented in Fig. 2, show a remarkable similarity between the alternate versions (v1 and v2) of the same tagger. No cut on jet mass is applied, and p T J ≥ 250 GeV is required on signal and background for consistency with the parton-level cut in the background generation. The details on the signals can be found in the next section. This stability is useful to be able to build a set of taggers that cover a very wide range of jet masses, if necessary. Notice also that, by design, the signals are expected to have higher values ofT than the background, therefore the background can be reduced with a lower cut on T .

Performance
We evaluate the performance of our taggers by selecting several new physics signals yielding multi-pronged jets, with q = u, d, s, c light quarks other than b. We generically denote the scalars with cascade decay into four quarks as S, and the scalars decaying into bb as A. The decays Z → SS, Z → A A can take place in any SM extension with an additional U(1) group and extra scalars, with the minimal consistent implementation given in Ref. [41]. The decays Z → W W can take place in left-right models [42,43]. We select M Z = 2.2 TeV and different values of M S and M A to test the hi80 and hi200 taggers. For three-pronged jets from Z → tt we have not found significant improvement over the simple ratio τ 32 and we omit the results for brevity. The generated signal samples have a minimum of 10 5 events, and for each event we use both jets, therefore the samples used have a minimum of 2×10 5 jets. The background sample is the same one with 8.4 × 10 6 events used for the mass decorrelation.
A meaningful assessment of the performance of the taggers can only be made within a given interval of jet mass and p T . (Other anomaly detection methods [18] report combined performances using the jet mass too as discriminator.) In all cases we require p T J ≥ 1 TeV, and we do not apply an upper cut on this variable because the distributions are in all cases concentrated towards smaller transverse momentum. The jet mass interval selected for hi80 taggers is m J ∈ [60, 100] GeV, and for hi200 taggers it is m J ∈ [160, 240] GeV. These jet mass window requirements reduce the QCD background by factors of 6.5 and 7.1, respectively.
We present in Fig. 3 the receiver operating characteristic (ROC) curves for signal efficiency versus background rejection of the hi80 taggers (left column) and hi200 taggers (right column), evaluated on different signals indicated in each plot. For comparison, we include the ROC curves for simple ratios τ nm ≡ τ (1) n /τ (1) m used in the literature. We also include the ROC curves for NN taggers trained on the same (groomed) mass and p T interval, using the same architecture of Ref. [17], with two fully connected hidden layers of 512 and 32 nodes, respectively. 3 For the NN taggers we do not perform any mass decorrelation. The results shown here for NNs are not fully comparable to those in Ref. [17] because here we select to apply the taggers on fixed intervals of groomed mass. The substructure of QCD jets with groomed mass e.g. m J ∈ [60, 100] GeV is not the same as for jets with ungroomed mass m J ∈ [60, 100] GeV. This can also be noticed by comparing with the results in Appendix B, obtained for jets with ungroomed mass m J ∈ [60, 100] GeV.
To better illustrate the effect of the tagging on the signalto-background significance S/ √ B (with S standing for signal and B for background) we define the significance improvement as with ε S , ε B the tagger efficiencies for signal and background, respectively. This is precisely the factor multiplying the luminosity-dependent ratio S/ √ B due to the tagging. We plot the lines (in gray) that correspond to several values of the significance improvement s. Notice that, for the mass inter- 2P taggers hi80 Fig. 2 Distributions of T for the QCD background (black) and selected signals (blue), for the hi80 taggers as in Table 2 (v1, solid lines) and alternate versions (v2, dashed lines) obtained with a different random seed vals selected, an additional improvement by a factor of 2 is brought by the jet mass cut, which might even be larger for more stringent cuts on m J . The first row of Fig. 3 shows the performance for S → A A → 4b, giving four-pronged jets with b quarks. A scalar undergoing this type of decay has been dubbed as 'stealth boson' because of its elusive nature [16]. We observe that τ -ratios, especially for M S = 80 GeV, fail to improve the signal significance, while the LoRD tagger can improve it by a factor of two. (For a S → A A → 4u signal the performance is similar.) The NN tagger reaches a higher significance improvement s = 3.
In the second row of Fig. 3 we study two-pronged jets from A → bb, which are harder to identify than W/Z bosons using jet substructure. The LoRD tagger performs better than the commonly used ratio τ 21 but, again, worse than the NN tagger.
In the third row of Fig. 3 we examine two signals without b quarks. On the left panel we have W → qq, for which as said the tagger T 2P has a better discrimination than for A → bb with the same mass. The performance of T 2P is half-way between the simplest τ 21 ratio and the more complex NN. On the right panel we show S → W W → 4q, giving a fourpronged jet with four light quarks. The performance of T 4P on this signal is quite good (better than for the analogue with four b quarks, top right panel) but it is still surpassed by the NN by a factor of two in terms of significance enhancement.
We also examine the performance of our taggers for jets containing electrons or photons as 'prongs', for which the taggers are not designed. We consider The decays S → A A → bbγ γ can take place for example in the model of Refs. [41,44]. N is a heavy neutral lepton such as the right-handed neutrinos introduced in left-right models, which undergoes a three-body decay mediated by an off-shell W R boson. We set M Z = 2.2 TeV as before, and show our results in Fig. 4. The left column corresponds to hi80 taggers and the right column to hi200 taggers.
In the top panels of Fig. 4 we show the performance of the T 3P and T 4P taggers for jets containing two b quark and two photons. These conspicuous jets have a shape that is approximately four-pronged, and the τ 42 ratio works well to distinguish them from QCD jets. (The ratio τ 43 is comparable, and other τ -ratios are worse; we do not show the corresponding lines for clarity.) The LoRD taggers also work well to discriminate these signals from the background. Remarkably, for higher jet masses the LoRD taggers provide a better discrimination than dedicated NNs, although their performance is similar to that of τ 42 . As it can be observed, T 3P has a better discrimination for M S = 80 GeV (top, left panel) and T 4P is better for M S = 200 GeV (top, right panel).
In the bottom panels of Fig. 4 we show the performance of the T 3P taggers for a signal that is not properly threepronged, since one of the 'prongs' is an electron rather than a jet. The taggers perform well for this signal for which they are not specifically designed, and better than the simple ratio τ 32 . Most surprisingly, the LoRD taggers perform better than the NN taggers, especially at m N = 200 GeV.
Overall, we find that the highest benefit of the LoRD taggers is achieved for four-pronged signals, for which they largely surpass the performance of simple τ -ratios and capture a good deal of the potential of a complex NN. Also, we remark that the taggers work remarkably well for signals for which they are not designed: (a) jets containing two b quarks plus two photons; (b) jets containing two light quarks plus an electron.
The performance of the LoRD taggers remains stable under moderate variations of the masses of the particle originating the fat jet. In the comparisons shown in Figs. 3 and 4, these masses were taken as 80 GeV for hi80 taggers and  Again, we observe that the performance of the hi200 T 4P tagger remains quite stable under moderate variations of the jet mass.

Dependence on input data kinematics
We address in this section the variation of the tagger performance for signals (and backgrounds) with masses or transverse momentum quite different from the ones used in the design. We select the two reference mass and p T intervals used in Figs GeV, p T J ≥ 1 TeV. We apply the hi80, hi200 and lo80 taggers to selected 4P, 3P and 2P signals. The results are shown in Fig. 6. The left column corresponds to the different taggers applied on signals in the kinematical region (a), and the right column to different taggers applied on signals in the region (b). From the comparison we can observe that: • The performance is quite stable with p T J (solid versus dotted lines in the left column). Here we apply taggers designed with p T J ≥ 1 TeV (hi80, solid lines) and p T J ≥ 500 GeV (lo80, dotted lines), on jets with p T J ≥ 1 TeV and we observe that the performance is quite similar for 4P taggers (top panel) and basically the same for 3P and 2P taggers. • For 4P taggers, the tagger efficiency is degraded when applying the wrong one, i.e. the hi200 tagger in the region (a) or the hi80 tagger in the region (b). For 3P taggers, nevertheless, the hi200 tagger is good in both (a) and (b) kinematical regions. For 2P taggers there is no appreciable difference in any case.
These results suggest that, if one is interested on a very wide range of jet masses, a set of two or three LoRD taggers with different masses can be used to cover that region without performance loss. The stability of the taggers resulting from the optimisation procedure, shown in Fig. 2, and the stability of the performance with small jet mass variations, shown in Fig. 5, ensure that this procedure is feasible and would

Conclusions
In this work we have used a logistic regression design (LoRD) to obtained simple taggers for multi-pronged jets based on jet substructure variables. These taggers can be approximately decorrelated from the jet mass. The application of the taggers keeps the shape of the jet mass distribution for the QCD background to a large extent. The best results are achieved for taggers for four-pronged (4P) signals, which precisely are the least covered in terms of available tools. In this case, the mass decorrelation is very good (see Fig. 1), the signal to background discrimination largely surpasses simple τ -ratios used in current new physics searches, and is not far from dedicated NNs. For boosted top quarks, the three-pronged (3P) taggers do not bring much improvement over the ratio τ 32 . 4 However, for boosted heavy neutrinos -which give jets that are not properly three-pronged -the LoRD taggers perform very well, even better than NNs trained on the same set of signals and backgrounds. A significance improvement by a factor up to 8 can be achieved. This fact is quite interesting since current searches for boosted heavy neutrinos [45] do not use any type of jet substructure analysis. We have also tested two-pronged (2P) taggers on a variety of signals, finding that the performance is half-way between the ratio τ 21 and a dedicated NN. LoRD T 3P and T 4P taggers are also sensitive to jets containing two b quarks plus two photons, a signature which is not experimentally covered [44].
We envisage two possible situations where the LoRD taggers may be very useful. The first one is when the development of a full-fledged multivariate tagger with mass decorrelation is not feasible. In this case, a LoRD tagger (or a handful of them) can easily be used, obtaining results that are not far from the ones that a multivariate method could bring.  The second situation is as a complement and cross-check of results obtained with more complex taggers as the ones based on deep neural networks. These methods are often a 'black box' whose results are difficult to test independently. Because the performance of the LoRD taggers is not far from NNs, they can be very useful as a robust test, especially in case any new physics signal involving fat jets is found at the LHC.

Data Availability Statement
This manuscript has associated data in a data repository. [Authors' comment: This manuscript has associated data in https://github.com/bzaldivarm/LoRD.] Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Funded by SCOAP 3 .

A Logistic Regression implementation
In this appendix we describe in detail our implementation of the Logistic Regression classifier for distinguishing signal from background.
The datasets consist of N pairs where t i are the binary labels of each event (signal or background) and x i are the input variables, in our concrete case x = log τ , where τ is the vector of subjettiness variables as defined in (1). We preprocess the data such that signal and background events are equal in number for the training set and validation sets. For the latter, we use 20% of the dataset, while the remaining 80% is used for training (note that the test data in our study are completely independent samples analysed a posteriori).
The Logistic Regression is a simple ML model for classification where the different classes of data points are assumed to be linearly separable in the space of input variables x. As a probabilistic discriminative model, its goal is to model the conditional probability p(C k |x i ) of a given class C k , with (k = 0, 1), given an input x i , and it does so with the following parametric form: where we identify C 0 as the signal class. In the above equation, σ is the sigmoid function, σ (z) = (1 + e −z ) −1 , and w, w 0 are the weights to be optimised. In our analysis we have dropped the 'intercept' w 0 , since we wanted to make a direct correspondence with the product defined in (5), which has shown to be a good signal-vs-background discriminator in the literature [26] for simple jet topologies. A non-zero intercept amounts to a global multiplicative constant in (5).
We have checked in any case that the inclusion of the intercept does not have a visible impact in the classification performance. On the other hand, the weights w are identified as the coefficients c β n in Eq. (4). The optimisation of the w parameters has been done as usual by minimising the socalled cross-entropy cost function: y i (w)) . (11) The implementation of the above procedure has been performed with the automatic differentiation tool Tensorflow v1.14 [46]. We have used the Adam stochastic gradient descent optimiser with learning rate 0.01, and otherwise default parameters, and using a mini-batch of size n = 200. Convergence is typically achieved at around 200 epochs, the training process taking around 3 seconds in an Intel(R) Core i5-5200U CPU @ 2.20GHz with 8GB of RAM.
It is worth noting that as a result of the above optimisation method, different random initialisations 5 of the weights w will lead to optimal weights having O(1) differences. Even if such differences will not have an important impact in the error rate of the validation test, they will lead to slightly different distributions for the background and the signal when applying the trained model to the test samples. An example of these variations is shown in Fig. 2.

B Estimating the variance of the predictions
Apart from the variability in the performance due to the stochastic optimisation procedure described in appendix A, one may wonder about the variability of the predictions associated to the statistical variance of the parameters w that best fit the data. A clear way to proceed would be to adopt a Bayesian approach, where a proposed prior distribution for the parameters is updated with data (via the likelihood) in order to obtain the posterior distribution, which can be used in a second step to obtain the probability distribution of the prediction for a given new input. However, even if for the ML model in consideration (Logistic Regression) this would be a tractable task, the procedure would still require the use of approximate inference techniques in order to approximate the posterior distribution mentioned above, as well as to approximate the predictive distribution.
Instead, we opt here for a less ambitious frequentist alternative to estimate the variance of the parameters, known in some literature as the bootstrapping method. The idea is very simple: (i) generate a set of B synthetic datasets identical in size to the original dataset; (ii) train the ML model on each of those, such as to have B sets of optimal parameters; (iii) from the previous step build the resulting distribution of the parameters, from which to obtain the desired variance.
In the procedure outlined above, a decision has to be made about how to generate the synthetic datasets. Ideally, one has at hand a 'generative model' (for example, the likelihood of the data) which it is possible to sample from by using the Fig. 7 Variability in the performance of 2P, 3P and 4P taggers (from left to right, respectively), obtained from B = 50 bootstrap samples of the original dataset optimal parameters obtained with the original dataset. However, in this work we have used a discriminative model (see appendix A), where the likelihood of the data is not modeled. Then, an alternative -followed in this work -is to sample directly from the original dataset 'with replacement', as it is customary in the bootstrapping literature. In other words, we generate B = 50 datasets by randomly choosing points from the original dataset, such that some of those points may be present more than once in the synthetic datasets.
Finally, as a shortcut to computing the predictive distribution, we proceed directly to evaluate the performance of the B set of fitted parameters under the test dataset. The results are illustrated in Fig. 7 for the hi80 kinematical region, and for test samples corresponding to an ungroomed mass interval m J ∈ [60, 100] GeV.
We observe that the variability in the predictions is very small for the three taggers, which is a desirable outcome in any statistical modeling. We note that the estimation of such variability may be underestimated at some extent due to the finiteness of the original dataset, which may bias the bootstrap samples.

C Interpretability of the results and dimensionality of the datasets
One of the main motivations of this work is to obtain simple enough (yet efficient) taggers, providing the interpretability of the results which is typically absent in more complex multivariate models as neural networks. Interpretability is intended here first of all in the sense of the parametric form of the decision boundary between signal and background, and clearly the model adopted in this work, Logistic Regres-sion, featuring a linear decision boundary in the input space, is more interpretable than more complex models. A different, more physical sense of interpretability could be explored, regarding the input variables themselves. The basis of subjettiness variables (1) is complete in the sense that it completely specifies the variables of M-body phase space (being 3M −4 in total) such as the transverse momenta, relative angles, etc. up to a global rotation. Consequently, it is expected that all the subjettiness variables play an important role in the signal/background discrimination for M-pronged jets.
Still, in this work we have used values of M that are higher than the number of prongs of the 'signal' jets, e.g. we have used M = 9 to build taggers for four-pronged jets. Therefore, one may wonder whether the full set of variables is required. We have verified that this is the case, by building 4P taggers where, instead of using all the τ (β) n variables of the basis (1), we restrict ourselves only to the subsets with either β = 1, β = 2 or β = 0.5. The performance of the resulting taggers are significantly worse for the hi80 test samples, having for a signal efficiency of 0.2 a corresponding background rejection an O(10) smaller than the 4P tagger with the full set of variables (cf. Fig. 4).
Note, however, that the fact that the full set of variables is required for optimal signal to background discrimination does not mean that the intrinsic dimensionality of the data is 3M − 4. There is a partial correlation among the different τ (β) n variables and thus the dataset is expected to lie in a smaller dimensional space. In order to check this we have performed a simple Principal Component Analysis (PCA) on the original hi80 training sample. The results show that the first two principal components (i.e. the two most important rotated variables) already contain around 86% of the original variance. The weights that each of the original variables have on those principal components are all similar, of O(0.1). Finally, if using only those two principal components as new input for the LoRD model (i.e. now having a 2-dimensional dataset) we obtain a performance on the hi80 samples which is very similar to the one obtained with the original (3M −4)dimensional dataset. This confirms our expectations that the information of the dataset is (a) contained in a hyperplane of (much) smaller dimensionality, while (b) being spread over all the 3M − 4 original variables. Further exploration of the interpretability of this kind of tagger is left for future work.

D Using groomed versus ungroomed quantities
As mentioned in Section 2, in the design of the taggers we use MI data as signal and QCD samples as background, within fixed intervals of ungroomed mass, and for p T above and around some fixed cut. In the test of the taggers, however,  Fig. 8 Comparison of the performance of the 4P hi80 tagger using groomed versus ungroomed quantities we use the groomed mass and p T , as it is usually done in the LHC experiments. We have found that the performance when using groomed quantities in the design is rather similar, as shown for example for the 4P hi80 tagger in Fig. 8. The black line represents our default choice, whereas the blue line corresponds to designing the tagger using groomed m J and p T , in the same intervals. On the other hand, using groomed subjettiness variables τ (β) n , both in the design and the test, significantly degrades the performance of the tagger (red line). The use of ungroomed τ (β) n is not a problem, actually it is the standard choice for the CMS Collaboration, which uses alternative methods such as PUPPI [47] to remove pile-up.

E Taggers without mass decorrelation
The application of the taggers without prior mass decorrelation produce a significant shaping of the jet mass spectrum for the QCD background. This is illustrated in Fig. 9 for two taggersT 4P designed in two different mass intervals. The peak-like structure produced is near 100 GeV in both cases, therefore the location of the bump is not related to the design mass interval,  Fig. 11 Comparison of the performance of 4P hi80 (left) and hi200 (right) taggers for several signals, from two-pronged to four-pronged for hi200. These examples highlight the benefit of the mass decorrelation even if it is not perfect, see Fig. 1. A drawback of the decorrelation procedure is a slightly worse discrimination between signal and background, as it can be seen in Fig. 10, for four-pronged (left), three-pronged (middle) and two-pronged (right) taggers, designed in the same kinematical region hi80.

F 4P as generic taggers
Despite being specifically designed for four-pronged signals, the 4P hi80 and hi200 taggers work well for jets with less than four quarks. One example has been already shown in Fig. 4: jets containing two b quarks and two photons. For completeness, we show here the performance for other signals : N → eqq, A → bb, and W → qq. The results are shown in Fig. 11. We also include four-pronged signals as a reference. The masses of the particles originating the jet are the same as taken in Figs. 3 and 4. For low masses, the 4P hi80 tagger performs well for all signals -actually, the one for which the discrimination is worse is the difficult four-pronged signal S → A A → 4b for which it is specifically designed. For high masses, the 4P hi200 tagger is not adequate for A → bb but it is quite good for N → eqq.