A generic anti-QCD jet tagger

New particles beyond the Standard Model might be produced with a very high boost, for instance if they result from the decay of a heavier particle. If the former decay hadronically, then their signature is a single massive fat jet which is difficult to separate from QCD backgrounds. Jet substructure and machine learning techniques allow for the discrimination of many specific boosted objects from QCD, but the scope of possibilities is very large, and a suite of dedicated taggers may not be able to cover every possibility — in addition to making experimental searches cumbersome. In this paper we describe a generic model-independent tagger that is able to discriminate a wide variety of hadronic boosted objects from QCD jets using N -subjettiness variables, with a significance improvement varying between 2 and 8. This is in addition to any improvement that might come from a cut on jet mass. Such a tagger can be used in model-independent searches for new physics yielding fat jets. We also show how such a tagger can be applied to signatures over a wide range of jet masses without sculpting the background distributions, allowing to search for new physics as bumps on jet mass distributions.

Experimental analyses carried out by the ATLAS and CMS collaborations use dedicated taggers in addition to the jet mass, to search for beyond the Standard Model (BSM) scenarios that can give rise to boosted top quarks, W/Z, or Higgs bosons. For example, shape variables such as the N -subjettiness ratio τ (1) 21 [7] and the energy correlation function D (β=1) 2 [11] are very effective in distinguishing between QCD jets and two-pronged decays from W/Z, and the performance can be further improved by using a more complete set of jet substructure variables and a multivariate analysis [13]. As another example, the subjettiness ratio τ (1) 32 is used to identify jets from top quark decays. However, the inherent drawback in this approach is that, while these dedicated taggers are efficient in the -1 -JHEP11(2017)163 discrimination of top quarks and W/Z hadronic decays from QCD jets, they may not be able to identify fat jets arising from the decay of BSM boosted particles.
New particles near the electroweak scale may exist and evade direct detection, for example, if their couplings to quarks and gauge bosons are small. They can still be produced in the decay of heavier particles and may have dominant decays into hadronic final states. Examples of such cases are neutral (pseudo-)scalars in models with left-right symmetry [26] and warped extra dimensional models with more than 2 branes [27,28] (see also refs. [29,30]). An explicit example of the limitations of dedicated jet taggers has been given in ref. [31], by considering a new 'stealth boson' S with a mass in the 100 GeV range and undergoing a cascade decay S → AA → bbbb mediated by a lighter particle A. When S is boosted, so that the four b quarks merge into a single jet, the τ (1) 21 and D (β=1) 2 variables used to tag massive SM bosons would 'see' the resulting four-pronged fat jet as a QCD jet. Consequently, new physics searches involving boosted hadronically-decaying W or Z bosons, e.g. diboson resonance searches, can be relatively blind to the analogous new physics processes (diboson-like resonances) involving one or two S particles of a mass around M W,Z It is highly desirable that ATLAS and CMS searches are not restricted to a few simple benchmark models, but rather cover as many new physics signatures as possible. A broader scope for LHC searches becomes of the utmost importance given the absence of any convincing hint of new physics beyond the SM, as we still do not know how new physics may manifest at collider experiments. With that purpose, a generic 'anti-QCD' tagger that distinguishes QCD jets not only from W/Z hadronic decays, but also from generic BSM boosted objects, would be a useful tool. In this paper we address this problem and provide a proof of concept that this kind of tool can be developed (see also ref. [32] which pursues related ideas). With this goal, we perform a multivariate analysis using a neural network (NN) that is trained to discriminate QCD jets from fat jets with two-, three-and four-pronged structure, arising from the decay of relatively light boosted particles. After describing our framework in section 2, we perform a simple analysis in section 3, to demonstrate the discrimination power for several examples of fat jets from boosted new particles against QCD jets. A comparison between the performance of generic and dedicated taggers is given in section 4. The decorrelation between the background rejection with tagging based on jet substructure and the jet mass requires a slightly more sophisticated analysis, which is presented in section 5. Our results are discussed in section 6. Some appendices are devoted to additional details of our analysis. In appendix A we study the dependence of the results on the number of input variables for the NNs. In appendices B and C we discuss how the results change when we modify the signal flavour composition, and the quark/gluon background composition, respectively. The dependence of the results on the NN architecture is explored in appendix D. In appendix E we examine the issue of whether the taggers only learn jet shapes or they also learn about different signal and background kinematics. A related issue is the dependence of the results on the specific model for hadronisation and showering; this is addressed in appendix F, where we compare the results using two Monte Carlo simulation codes. Finally, in appendix G we study for completeness the signals of light coloured boosted objects. Following ref. [13], we characterise the jet substructure by a set of generalised N -subjettiness [10] variables with i labelling the particles in the jet, p T i their transverse momenta, ∆R Ki their legoplot distance to the axis K = 1, . . . , N and p T J the jet transverse momentum. As in ref. [13], in the computation of these variables we use the axes defined by exclusive k T algorithm [33,34] with standard E-scheme recombination [35]. Ref. [13] proposed the following basis of observables, 1 , τ 1 , . . . , τ motivated by the requirement to be able to fully reconstruct the (3M − 4)-dimensional phase space for a decay into M particles. They found that the discriminating power for Z-jets versus gluon and quark jets was saturated by considering up to 4-body phase space.
Because we are interested also in higher pronged decays we explore a larger 17-dimensional basis with M = 7. This specific choice is motivated in appendix A. It is likely that a smaller, more carefully selected basis of substructure variables could be used with little degradation in discrimination power, but in this work we do not attempt to optimise this. We remark that, equivalently, a set of energy correlation functions [11] ECF(N, β) could also be used, but the calculations are much more computationally-demanding when one considers higher N , as is required for the identification of multi-pronged boosted jets. The values of these variables are used as the input to a NN trained to discriminate quark/gluon jets from multi-pronged decays of boosted colour singlet particles. Quark and gluon jets are obtained by generating the parton-level processes pp → Zg and pp → Zq, with Z → νν, using MadGraph5 [36]. Event generation is followed by hadronisation and parton showering with Pythia 8 [37]. The detector response is simulated with Delphes 3.4 [38] using the CMS detector card. Jets are reconstructed using the antik T algorithm [39] with radius R = 0.8, as implemented in FastJet 3.2 [40]. For the signal we use fat jets resulting from the decay of neutral, colour-singlet particles into two, three and four quarks. For these, we consider the six processes with S a scalar and F a fermion. These processes are generated with Protos [41] and, in order to remain as model-agnostic as possible, we implement decays of S and F with a flat matrix element, so that the decay weight of the different kinematical configurations only corresponds to the two-, three-or four-body phase space. We will refer to these Monte Carlo data as Model Independent (MI) data in the following. Our choice is motivated -3 -

JHEP11(2017)163
by the need to sample phase space without model prejudice. For example, any specific choice of four-body decay topology, such as 1 → 1 + 1 → 2 + 2, combined with a choice of masses for the intermediate particles, would only sample a part of four-body phase space, which varies with those mass choices. Therefore, training on specific cascade modes would introduce a model bias. Our choice to train on both light and b quarks is also with the same aim, of making the tagger as model-agnostic as possible. Variations on this choice, either removing final states with b quarks, or adding signal processes with gluons (e.g. S → gg) in the training, are discussed in appendix B. Several new physics signal processes are generated to test whether the NN correctly identifies jets resulting from boosted multi-pronged particle decays, including some for which it is not trained. We use seven such processes, with H 0 1 a heavy scalar and A 0 a pseudo-scalar, H ± a charged scalar and Z , W additional vector bosons. All these new particles arise, for example, in left-right models. We consider hadronic decays of the top quarks, W/Z bosons and pseudo-scalars resulting from the H 0 1 and H ± decays. 1 We note that for W and Z hadronic decays the jet shapes are very similar, so for brevity we only consider the former. These processes are generated with MadGraph5 implementing the relevant interactions [26] in FeynRules [43], and using the universal Feynrules output [44] to interface with the event generator.
We treat the search for boosted BSM objects as a binary classification problem, with quark and gluon jets labelled as background and jets originating from boosted massive objects labelled as signal. Our NN classifiers are multilayer perceptrons, a simple fully connected architecture that is well suited for use with unstructured input data. These are implemented using Keras [45] with a TensorFlow backend [46]. We choose an architecture with two hidden layers, the first containing 512 nodes and the second containing 32 nodes, all using rectifier activation functions. (See appendix D for a few examples using alternative NN architectures.) The output layer is a single node with sigmoid activation. The input consists of the 17 τ (β) N variables, with some preprocessing applied. We use two kinds of preprocessing which we discuss in more detail in section 3 and section 5 respectively. The first is a simple standardisation of the inputs, which we find significantly improves the training time, stability over variations of the initial seed, and discrimination performance JHEP11(2017)163 of the trained tagger for simple architectures like the ones used here. The second approach relies on a more complicated transformation of the input data and allows a tagger to be sensitive to signals over a broad range of masses, while decorrelating background rejection from jet mass and p T .
Except where otherwise specified, we train on equal numbers of background and signal events. Background is divided equally between quark and gluon jets (see appendix C for variations in this choice). Signal training data is divided equally between the six categories of MI events described in eqs. (2.3). 20% of this signal and background data is set aside for validation. We choose binary cross entropy as the loss function to be optimised, using the Root Mean Square Propagation (RMSProp) algorithm with a learning rate of 10 −3 . Additionally, if the loss as measured on validation data does not improve over three epochs, the learning rate is reduced by a factor of 10. Training is stopped after 100 epochs, or when validation loss has not improved in five epochs. Typically, we find that training in this manner takes several tens of epochs. We train five different copies of each NN in the same way but with different starting seeds, and pick the one which has the best performance as measured by area under the receiver operating characteristic (ROC) curve with validation data. All machine learning calculations were performed with an Intel(R) Core(TM) i5-6300U CPU @ 2.40 GHz with 8 GB of RAM. Our NNs take 1-10 minutes to train.

A first approach to anti-QCD tagging
For this first simple analysis each of the N -subjettiness variables in eq. (2.1) is standardised by a linear transformation, with a N constant, so that the resulting τ std(β) N distribution for the QCD background (composed of equal parts of quark and gluon jets) has zero mean and unit standard deviation. We consider three benchmarks for the jet transverse momentum p T J and mass m J , and for each one we also select the Z resonance mass in eqs. (2.3) to yield MI data with a p T J distribution close to the threshold and similar to the background (see appendix E). One tagger is built in each case, In the first two cases, the MI data used for training is generated with boosted particle masses M S,F = 80 GeV, and in the last case with masses of 400 GeV. The jet mass m J and transverse momentum p T J used here are of the ungroomed jet, for reasons that we specify at the end of this section. The number of events used for the training is collected in table 1.  The std500 and std1000 taggers are used to investigate the discrimination power for jets coming from BSM boosted particles with a mass around the W mass. We use two regimes of p T J (p T > 500 GeV and p T > 1000 GeV) to check the differences, and the extent to which the results are specific -or not -to a kinematical region. For each benchmark, a NN is trained and validated with MI and QCD data, and is then tested on a number of boosted jet topologies, 2 W → qq , setting the parent Z resonance mass responsible for the processes in eq. (3.2) to 1100 GeV and 2200 GeV for the p T > 500 GeV and p T > 1000 GeV test samples respectively (the p T distributions of the QCD and signal jets generated in this way are very similar, see appendix E). The third and fourth line in eq. (3.2) are two examples of the stealth boson S in ref. [31]. The results for the ROC curves giving the signal efficiency versus background rejection are presented in figure 1. To better illustrate the effect of the tagging on the signal-to-background significance S/ √ B, we define significance improvement as the factor multiplying the luminosity-dependent ratio S/ √ B due to the tagging, and indicate the lines (in dashed gray) that correspond to a significance improvement of 1, 2, 4 and 8. For comparison we also include the efficiency curve for the dedicate tagger τ (1) 21 , applied to fat jets from W bosons. Several comments are in order.
1. The taggers perform better for jets with light quarks, either from W bosons or from stealth bosons decaying to four u quarks.
2. For W bosons the anti-QCD taggers represent a significant improvement over the dedicated tagger τ 21 , as also observed in ref. [13].
3. The discriminating power of the std1000 tagger, when applied to all jet topologies, outperforms the discriminating power of τ with τ 21 , which is specifically designed for W bosons and actually may reduce the S/ √ B ratio for this type of signals [31].
5. The taggers have a good discrimination for fat jets from H 0 1 → gg, for which they are not trained.
The std1500 tagger is used to test the performance at higher jet masses and also the ability to distinguish more complex boosted signatures, using a Z /W resonance mass of 3300 GeV. These topologies include a 1 → 1 + 1 → 1 + 3 asymmetric cascade decay (tb), a 1 → 1+1 → 2+2 cascade decay with different intermediate particle masses (ZA 0 ) and even six-pronged fat jets (tt) for which the tagger is not trained. The ROC curves are presented in figure 2. As in the previous cases, the discrimination power is best for jets with light quarks, with a significance improvement up to a factor of 8 for W W . In addition, it is very good for the rest of signals except for a resonance decaying to gg, for which it is not trained. The performance for tt is remarkable, especially if one considers that the tagger is trained with up to four-pronged MI data, and a merged tt jet has six quarks. The significance improvement from the tagging of shape variables adds to that gained from a jet mass cut. For illustration, we show in table 2 the significance improvement for the signals that is achieved with the cuts m J ∈ [65, 105] GeV for p T J > 500 (1000) GeV, as in the std500 (std1000) tagger, and m J ∈ [350, 450] GeV for p T J > 1500 GeV, as in the std1500 tagger. The full significance improvement that can be achieved by the combination of jet mass and shape variables is obtained by multiplying the numbers in table 2 using the ungroomed jet mass with those that can be read from figures 1 and 2. The improvement is modest at low m J because the ungroomed jet mass distribution for QCD events is large there. Finally, let us comment about our choice for ungroomed jet masses and p T J . There are several methods [47][48][49][50] to improve signal mass resolution by removing soft particles within the jet. This also tends to improve signal and background separation by shifting the mass spectrum of QCD jets to lower values. However, the results depend on the choice of algorithm and set of parameters, and choices that are optimised for boosted SM particles are often not satisfactory for complex and massive multi-pronged jet topologies such as those considered in this paper. To make this more precise, we compare the significance improvement coming from jet mass cut in table 2, between ungroomed jets and jets trimmed [49] with the parameters R sub = 0.2, f cut = 0.05, commonly used for massive vector bosons W/Z. This algorithm and parameter choice work well for W bosons, as can be observed from table 2, but is too aggressive for many of the multi-pronged boosted objects for which this groomer can signficantly broaden and shift the signal peak. For example, for stealth bosons with p T J > 500 GeV (row 3, top panel in table 2) this groomer degrades the mass resolution. For all the complex signals from H 0 1 and H ± decays with masses M = 400 GeV (lower panel, table 2), this degradation is more pronounced. Jet pruning [48] and soft drop [50] have a similar performance [31]. In any case, the selection of a grooming algorithm and parameter choice that works well for generic BSM objects is another interesting and unrelated issue, which deserves a dedicated study.

Dedicated versus generic taggers
It is naturally expected that a NN jet tagger trained on a specific signal will achieve a better discrimination for that signal than a generic tagger, but it will also have a worse performance than the generic tagger on other types of signals. In order to quantify these statements, we have trained two dedicated taggers: (a) Tagger 'std1000 W': p T J > 1000 GeV, m J ∈ [65−105] GeV, M Z = 2200 GeV, trained on W → qq and the QCD background.
GeV and the QCD background. We show our results in figure 3. On the left panel we can observe that the dedicated tagger std1000 W has a slightly better discrimination power than the generic tagger std1000 for the W bosons it is trained on, but somewhat worse for four-pronged stealth bosons. We find that for final states with W bosons, the performance loss by using a generic tagger is rather small, and is more than compensated by the broader sensitivity to new physics signals. On the right panel it is apparent that the dedicated tagger is significantly better than the generic tagger for the H 0 1 → W W signal it is trained on, but it is considerably worse for other signals. Firstly, even though this tagger is specifically trained on a four-pronged signal (W W ), its performance on a different four-pronged signal (ZA 0 )   is even worse than with the generic multi-pronged tagger. This fact illustrates that there can be large differences between signals even if they happen to share the same number of prongs, and justifies our choice to train on MI data rather than specific signal models. Secondly, the sensitivity to tt is completely degraded by using the dedicated W W tagger, which actually deteriorates rather than enhances sensitivity to this signal.
Generic multivariate taggers are also found to discriminate the various signals from the QCD background better than the simple τ -ratios that have commonly been used in new physics searches. In the left panel of figure 4 we compare the performance of the std1000 generic tagger (solid lines) to those of τ   stealth boson signatures we also use τ (1) 42 , which has also been used for boosted hadronic H → W W * discrimination by the CMS collaboration [53]. In all the three cases, the performance of the generic tagger is much better, but this is especially apparent for stealth bosons, in agreement with previous results [31]. In the right panel we do the comparison for more massive jets using the std1500 tagger and various selected τ -ratios. Only for a tt signal, for which this tagger is not trained, the performances are comparable.
Altogether, the comparison of generic taggers with dedicated ones and simple τ -ratios is very illustrative. For jet masses around the weak boson masses, there is a remarkable improvement for non-W signals with respect to a W -dedicated tagger, keeping nearly the same performance for W bosons, and in all cases, quite an improvement over τ (1) M N . For heavier jet masses, the advantage of a generic tagger is still the broader sensitivity, though the performance of a dedicated tagger can be significantly better.

Mass decorrelation
It is desirable, although not compulsory, that a tagger based on jet substructure is decorrelated from the jet mass, in the sense that the tagging efficiency for the background has little dependence on m J . When this happens, the jet tagging does not shape the m J distribution of the SM background [52]. This allows for data-driven background estimation using jet mass sidebands, and for the application of bump-hunting strategies on a jet mass distribution. The NNs described in the previous sections, when combined with a choice of threshold on the NN output, act as cuts in the 17-dimensional τ which were used as NN inputs in the previous section. The efficiency of the tagger (with fixed threshold) on QCD events will necessarily vary with jet mass and p T , and will result in a sculpting of jet mass distributions in ways that depend sensitively on the mass and p T of the jets on which it was trained.
In order to build a tagger with an efficiency on QCD jets not varyingly strongly with jet mass or p T , there are three obvious possibilities. The simplest one is to apply to the NN output the approach utilised already in [25]. In this case, the threshold on the NN output would be adjusted with jet mass and p T , in such a way that background rejection is fixed. This approach has many advantages (first and foremost being simplicity), but a tagger used in this way that is optimised for signal discrimination at one mass will tend to have suboptimal performance for signals at different masses as the shapes of the input distributions vary, and the tagger might sculpt signal shapes and shift signal mass peaks. It might be required that a suite of taggers are trained, optimised at different mass points. This problem could be ameliorated if a basis of variables is found which are only weakly correlated with jet mass and p T . A second approach, to be adopted in this section, involves preprocessing the τ input variables, but will have the advantage that a single tagger can be used with good signal discrimination over a wide range of masses and p T . A third possibility would be to build a tagger that can learn to vary the region of τ (β) N -space to cut as a function of jet mass. This was achieved in ref. [51] using an adversarial strategy designed to maintain mass decorrelation on QCD jets. This would leave open the question, however, of how to sample signal masses in training, in such a way that the tagger is not biased towards particular signal masses.
Let us consider a set of τ std(β) N calculated using QCD jets selected within a certain jet mass and p T bin. Arranging the 17 τ with R and S being 17 by 17 square matrices. R is a rotation matrix that diagonalises the symmetric covariance matrix calculated from this τ -set. This matrix induces a rotation into a basis aligned with the principal component axes of the dataset. In this basis, all pairs of variables are linearly uncorrelated. This is equivalent to choosing a basis whose axes lie along the principal axes of a rigid body formed out of this distribution. We then standardise the data along these axes, so that along each principal axis the standard deviation of the data is 1. This is the action of the diagonal matrix S. We then invert the principal axis rotation with the action of R −1 . 3 In practice, data should be binned according to jet mass and p T , and a transformation matrix M i = R −1 i S i R i must be determined for each bin i. Alternatively, one could define the PCA rescaling as a continuous function of p T and m J which could be fitted to binned data. In the third column of figure 5 we plot τ PCA distributions for QCD. Firstly, it can be seen that much of the variation in these distributions with jet mass has been eliminated by the transformation. Second, thin directions have been stretched and fat directions have been squashed, as can be seen most clearly in the third row. This fact can aid in the training of the NN.
Therefore, the PCA tagger involves two different tasks: 1. To set up a transformation map between τ and τ PCA , which requires a binning of MC data for the QCD background in m J and p T J . This map is used both when training the NN (with signal and background events) and when applying the tagger to test data.
2. To train the NN using τ PCA variables in some interval of m J and p T J , which might only be a subset of the entire domain of the transformation map.
In order to test whether a tagger trained on input data with this preprocessing will sculpt QCD jet mass distributions, we generate as test data 1,081,834 QCD jets (evenly split between gluon and quark jets) with p T > 1000 GeV, and with no mass cut. The jet mass distribution for this data is given by the solid black lines in figure 6. For the PCA preprocessing of the τ (β) N variables, we bin the data by jet mass with variable bin sizes (as indicated by the bin widths in figure 6), in order to have similar numbers of events in each bin, and define a PCA transformation for each bin calculated from this data in that bin.
An additional sample of QCD jets, generated in the same manner as above, is set aside for use in training two new taggers. These taggers are trained on the τ PCA values of QCD and MI data selected only in a mass window, indicated by the shaded boxes in figure 6, to investigate if they will sculpt the QCD jet mass distribution around those windows and JHEP11 (2017)  if they will still be sensitive to new physics signals outside of those windows. The cuts implemented on the training data and the parameters for the generation of the MI training data for these taggers are The sizes of the event samples used for training these taggers are given in the first two columns of table 3. The solid coloured lines in the first two rows of figure 6 indicate the jet mass distribution for the QCD test sample after selection by the taggers at varying thresholds. We find that there are no new spurious features introduced by application of either tagger. In order to test the sensitivity of these taggers to boosted resonance signals at different masses we simulate the following two signals, resulting from the decay of a 2.2 TeV resonance. The dashed lines in figure 6 indicate the results when these signals are injected into the QCD test sample, re-weighted to correspond to 1.2% and 0.7% of the size of the QCD sample, respectively. We see that both taggers not only succeed in not sculpting the QCD jet mass distribution, but they are also sensitive to BSM boosted objects outside of the mass range in which they were trained.
We also wish to test the effect of using a tagger in a p T region in which it was not trained. We therefore generate QCD data in the range 500 GeV < p T < 1000 GeV, binned in jet mass in the same way as the p T > 1000 GeV data above. The τ distributions of this data determine the PCA transformations for data falling into these bins. We also generate MI data on which to train the following tagger, The size of the event samples used for the training is given in the third column of table 3. This tagger is then applied to the data described above in the p T > 1000 GeV bins. The results are shown in the third row of figure 6. We find that the performance of the tagger is not greatly sensitive to the p T and mass spectrum of jets used to train the tagger, so long as the data has been properly standardised along the principal component axes.

Discussion
The generic anti-QCD taggers we have developed in this work provide an alternative to usual taggers in LHC searches for new physics in the boosted regime, with the main advantage being their broad sensitivity to multi-pronged boosted signatures. This feature is of great interest as we do not yet know how new physics might manifest at the LHC. Indeed, new relatively light particles beyond the SM might exist and be produced with very high boosts, for instance if they result from the decay of a heavier particle. If these particles decay hadronically then their signature is a single massive fat jet which might be difficult to separate from QCD backgrounds with existing tools.
A generic anti-QCD tagger entails a compromise between a high rejection of the QCD background and a broad sensitivity to a variety of signals. As we have shown, a dedicated tagger has a better performance for the specific signal it is trained on, but it can be rather blind to other types of signals. In particular, 1. For jet masses around the weak boson masses, there is a remarkable improvement for BSM boosted signals (exemplified by stealth bosons) with respect to a W -dedicated tagger analogous to the one in ref. [13], while keeping nearly the same performance for W bosons. Both for W and stealth bosons, the generic tagger provides quite an improvement over the simple ratio τ 2. For heavier jet masses of a few hundreds of GeV, the advantage of a generic tagger is still the sensitivity to several multi-pronged signals, though the performance of a dedicated tagger can be significantly better.
In either case, final states involving several b quarks are harder to distinguish from the QCD background than those involving light quarks, but b tagging could also be used as an additional independent tool. Overall, we observe that searches for new resonances would greatly benefit from a generic tagger for hadronic boosted objects, perhaps complementing dedicated ones. (Dedicated taggers also have their place in specific analyses where one is not interested in other possible signatures, for example in tt measurements in the boosted regime.) A simple application of a generic tagger of this kind would be an extension of the existing searches for diboson resonances, which search for a resonance bump in a di-jet invariant mass distribution. The use of a generic tagger would allow to search for resonances decaying to a boosted SM boson and a boosted BSM boson. In this case, leptonic decays could be selected for the SM boson and the recoiling fat jet might be selected in a series of broad mass windows after selection by a generic tagger trained in each window. Alternatively, hadronic decays could also be selected for the SM boson, using standard tagging criteria. One recent example is given by ref. [54], which looks for XH decays of a heavy resonance, selecting H → bb for the Higgs boson and a two-pronged decay X → qq for X, with a set of overlapping mass windows for the new particle X and a standard tagger D (β=1) 2 . In this case, a generic tagger could be used to provide sensitivity not only to X → qq but to other topologies as well. A search could also be carried out for di-BSM bosons, requiring both bosons to have similar mass, and doing a scan over a series of broad mass windows.
Going beyond the discrimination of various signals against the QCD background, it may also be desirable to have a fixed background rejection as a function of the jet mass, for example to allow for data-driven background estimation using jet mass sidebands, and for the application of bump-hunting strategies on a jet mass distribution. This is a solved problem, and can be achieved by applying existing decorrelation techniques to the NN output. However, doing this in such a way as to also maintain good sensitivity to signals over a broad range of masses with a single tagger and without signal-mass bias is a more difficult problem. We have demonstrated that an approach based on standardising along the principal component axes gives satisfactory results in simulation, which is implemented by building a 'transformation map' in the two-dimensional plane of m J and p T J , using Monte Carlo simulation of the QCD background. This map relates the N -subjettiness variables τ , which are the inputs to the tagger. This relation varies with the jet mass and p T and, in practice, it is enough to consider suitable bins in m J and p T J . This can be considered as an extension of the approach which has already been taken in a CMS search for light resonances decaying to quark pairs [25] to decorrelate the jet substructure tagger from the jet mass. In our case, the tagger is trained at some given m J and p T J intervals, and it can be subsequently applied outside these intervals by using the map of transformations between τ Although the number of variables (17 × 17 for the correlation matrix) used here for the transformation map of PCA-scaled taggers seems a formidable task for an experimental analysis, let us point out that a simpler approach will suffice. First, a five-body tagger nearly has the same performance, as seen in appendix A, which reduces the number of variables to 11 × 11. Second, some optimisation by reducing the number of variables may be performed too, without sacrificing the performance. Indeed, in this work our goal has been to provide a proof of concept that anti-QCD taggers can be built, leaving the optimisation for future analyses.
Either in its simplest versions (as in section 3) with standardised input, or in its mass-decorrelated versions (as in section 5) with PCA scaling, a generic anti-QCD tagger is a novel tool, whose implementation seems feasible, and which could greatly benefit experimental analyses. The final goal is quite ambitious: to enlarge the scope of new physics searches with SM boosted objects, so as to be sensitive to new physics yielding BSM boosted objects. This will constitute a leap forward in new physics searches at the energy frontier, and is well worth the effort.

B Effect of signal composition on training
The shape of a jet from a heavy quark such as a b quark is in general different from that of gluons and light quarks. We have included light and b quark jets in our MI data earlier, in an attempt to capture all possible shapes, but for simplicity we have not included gluons.
In this appendix we show how the results are affected if (i) one doesn't include b quarks in the training data, or (ii) if one also includes gluons. For each case, we train taggers with the MI data set modified -for case (i) we use the subset of processes in eq. (2.3) that do not involve b quark in final state, while for case (ii) we use all the processes in eq. (2.3) and in addition we add the process H 0 1 → gg. In all cases, we continue to use equal numbers of events for each of the three or seven categories of training signal data. We perform these studies in two kinematic regimes corresponding to those used for the std1000 and std1500 taggers in section 3: We test the performance of the taggers on the signal processes The masses indicated for H 0 1 and A 0 correspond to cases (a) and (b) respectively. The results are shown in figure 8 (top panel for b quarks and bottom panel for gluons).
Focusing first on the case of inclusion of b quarks in the training data, we find that for decays with only light (u) quarks in the final state inclusion of b quark MI data has no effect on tagging performance. For other decays which include b quarks or gluons in the final state, taggers trained with b quarks do marginally better. Secondly, for the case when (gg)-jets are added to the training data, we find that including this process marginally improves the discrimination power for (gg)-jets in both kinematic regimes studied. For other processes that have b or u quarks in the final state, the inclusion of these jets in training has a negligible effect on the performance.

C Background composition: effect of quark to gluon ratio
For simplicity, in the training of our taggers and their testing on Monte Carlo data, we have assumed that the background is composed of equal parts of quarks and gluons. This is obviously not the case in a real analysis, in which the relative ratio will depend not only on the final state considered, but also on the energies involved. In this appendix we explore how sensitive the results are to the precise ratio of quarks and gluons.  Figure 9. Effect of QCD quark to gluon ratio in training and test data. The performance of the different taggers (solid, dashed and dotted lines) is shown for several signals (in rows) and for different choices of quark to gluon ratios in the background test data (in columns).
We focus on p T J > 1000 GeV, m J ∈ [65, 105], as considered for the std1000 tagger, and train three taggers on the MI data in eq. (2.3) and the QCD background, with three ratios of quarks and gluons: n q = 10 n g , n q = n g , n q = 0.1 n g , corresponding to the solid, dashed and dotted lines in figure 9. We test these taggers for several signals, with a background composed of the same three ratios of quarks and gluons: n q = 10 n g (left column), n q = n g (middle column), n q = 0.1 n g (right column). The signal processes JHEP11(2017)163 The main conclusion of this comparison is that the results actually do not depend much on the precise background composition, as seen from a glance at figure 9. In some cases the relative performance of the three taggers is as expected: for example, for the second process in (C.1) above, the tagger is (marginally) better when the background composition is the same in training and testing. But this is not the case for the third process in (C.1). For example, for n q = 0.1 n g , the tagger trained with the 'inverse' ratio n q = 10 n g is slightly better. This suggests that changing the background composition also affects the way in which the tagger learns what is signal and what is background.

D NN architecture
The choice of architecture in any NN problem merits its own study. Throughout this paper we have used an architecture that gives robust results against variations in its depth (number of nodes in a hidden layer) and breadth (number of hidden layers). In this appendix we show that the results are very insensitive to variations on this choice. Apart from the architecture considered for our results in sections 3-5 (two fully connected hidden units with 512 nodes and 32 nodes respectively, henceforth referred as 512-32), we consider here two more architectures -1024-32 and 512-512-32, in a self explanatory notation. We consider the mass and p T J ranges used in the definition of the std1000 and std1500 taggers, and train two taggers on the MI processes in eqs. E Is the tagger learning shape or kinematics?
Although it has been shown that the taggers can efficiently discriminate various multipronged signals from the QCD background, a question remains whether this discrimination is solely based on jet shapes or there is also some effect from the different kinematics of the signals and the background. For example, we have already mentioned that the heavy Z and W resonance masses have been chosen in such a way that the p T J distributions are similar to the background, but still there are some differences, which can be seen in figure 11 (left), between the distributions of the QCD background and two sample signals, W and stealth bosons. The same can be said about the jet mass, shown in the right panel: while the background distribution is rather flat, the signals concentrate around and slightly above the input resonance mass. We have tested the effect of the different p T J and m J dependence by considering the discrimination of these two signals with reweighted distributions, using the std1000 tagger. (The reweighting of the signals makes them have the same two-dimensional (p T J , m J ) signal distributions as the background.) With this purpose, a two-dimensional binning in p T J and m J is applied, with 25 GeV bins in p T J and 5 GeV bins in m J . The ranges of these variables are restricted to p T J ∈ [1000, 1250] GeV and m J ∈ [75, 105] GeV, in order to avoid the appearance of a few events with too large weights that might bias the results. As can be seen from figure 11, still within those intervals there is significant variation of these two variables.
The comparison between the ROC curves for the signals with the original and reweighted distributions, in both cases restricted to the mentioned p T J and m J intervals, is presented in figure 12. The left panel shows the results for W bosons, also including the curves for τ (1) 21 , and the right panel shows the results for stealth bosons. In both cases we observe that the differences between the results with the original and re-weighted distributions are very small. Also, the ROC curves without re-weighting (but with restricted 1 . The performance is identical to that of the std1000 tagger in figure 1. p T J and m J range) can be compared to those in figure 1 (right), to see that they are very similar. Overall, it is found that the influence of kinematics in the tagger learning, if any, is quite small.
Besides, we note that the variable τ 1 which is an input to our taggers is closely related to (m J /p T ) 2 [55]. From the discussions in section 5, it is our objective to avoid as much as possible jet mass and p T being directly used as discriminating variables by the tagger. One may wonder whether this variable should have been excluded from our set in eq. (2.2). In order to test its influence on our results, we train a variant of the std1000 tagger on all 7-body variables except τ and compare its performance to the std1000 tagger in figure 13; we find no effect on the tagger performance. This can be understood because the leading dependence τ (2) 1 ∼ (m J /p T ) 2 is the same for all signals and backgrounds at a given mass and p T J . Therefore, the standardisation of the inputs erases this dependence to a very large extent.

F Pythia versus Herwig
A serious challenge in the application of machine learning to jet physics in a real collider experiment is the question of whether the distributions of substructure variables are correctly modelled by simulation, and whether the performance of the tagger is robust under mismodelling. Designing approaches to bypass mismodelling fragility is an active area of research [56][57][58], but beyond the scope of this work. In this appendix, we restrict ourselves to investigating the variation of tagging performance when using data hadronised with Pythia (as used for our results in sections 3-5) and Herwig. We focus on p T J > 1000 GeV, m J ∈ [65, 105], as used in the std1000 tagger, for brevity. Two new taggers are trained on all processes in eqs. (2.3) and the background; one of them is trained on data using Pythia and the other one with data using Herwig. We test the performance of the two taggers on W bosons and stealth bosons with M H 0 1 = 80 GeV, M A 0 = 30 GeV. This test data (both signal and background) is generated twice, once with Pythia and once with Herwig. We show in figure 14 the results for in-sample tests (e.g. a Pythia trained tagger tested on Pythia data) as well as out of sample tests (e.g. a Pythia trained tagger tested on Herwig data).
We find that in general the performance is better on Pythia generated data than on Herwig generated data, though for the most part this is largely independent of which data the tagger was trained on. The exception is that the tagger trained on Pythia data has significantly worse performance on the Herwig data for stealth bosons, compared to the tagger trained on Herwig data. The differences between the Herwig and Pythia curves appear to arise mostly from the different modeling of the higher-order τ 4 alone for the stealth boson signal, using Pythia and Herwig data. It is clearly seen that the differences due to showering are much more pronounced for τ (1) 4 . This argument is confirmed by the study of the discriminating power of the τ   Because it is of the utmost importance that the performance of the tagger on QCD data be very well understood, it might be best to train such a tagger with real QCD data, and especially, test its performance directly on data in suitable control regions. Significant uncertainties on signal efficiencies may remain, but these are less important than having an accurate prediction for the background.

G Boosted coloured jets
In this paper we have focused on jets resulting from the boosted decays of colour-singlet new particles, which might easily be missed in searches looking for their direct production due to a small production cross section. Light coloured particles have large production cross sections, and searches for signatures resulting from their direct pair production via QCD are typically highly constraining. However, in ref. [59] it was noted that for some decays of such particles, for example a vector-like quark (VLQ) decaying via a non-renormalisable operator into three light quarks, there are no meaningful LHC constraints from direct searches for these particles with masses between 100 GeV and 1000 GeV. For masses as low as a few hundred GeV, passing the LHC thresholds for jet-based searches may require these VLQs to be produced with high momentum, resulting in collimation of their decay products into a fat jet. Therefore, and also for completeness, it is of interest to see if the generic tagger which we have trained only on colour-singlet jets is sensitive also to these coloured jets. Further, this is the only three-pronged signal which we test our tagger on.
We simulate the pair production of VLQs TT with T → ccc using an UFO file generously provided by the authors of ref. [59], setting the T mass to 400 GeV. We impose a generation level cut H T > 3 TeV, where H T is the scalar sum of p T of the quark decay products. The detector level selection is made in the same way as described for the signals used to test the tagger std1500 in section 3, selecting the hardest jet. In figure 16, we show the performance of the generic tagger on this signal (solid line), as well as the performance of a dedicated tagger trained to discriminate T -jets from QCD-jets. We find that the generic tagger has a moderate performance for this signal, with approximately 10% QCD efficiency at 50% signal efficiency. One might wonder if the sensitivity to such signals is lost by training the generic tagger only on colour-singlet jets. However, the dedicated tagger has only marginally better performance, which suggests that the reason for the moderate performance of the generic tagger is that this type of signal is intrinsically hard to distinguish from QCD jets. An outline of a possible search strategy for this signature could be as follows. Select back-to-back dijet events passing some high p T threshold. Require the jet mass of the two jets to be close to each other, and after applying a tagger at some threshold, look for a bump in the average jet mass distribution of the two jets. This could either be a cut-andcount analysis in relatively wide jet mass bins, or as a bump hunting shape analysis on a smooth background distribution.
Open Access. This article is distributed under the terms of the Creative Commons Attribution License (CC-BY 4.0), which permits any use, distribution and reproduction in any medium, provided the original author(s) and source are credited.