Thinking outside the ROCs: Designing Decorrelated Taggers (DDT) for jet substructure

We explore the scale-dependence and correlations of jet substructure observables to improve upon existing techniques in the identification of highly Lorentz-boosted objects. Modified observables are designed to remove correlations from existing theoretically well-understood observables, providing practical advantages for experimental measurements and searches for new phenomena. We study such observables in $W$ jet tagging and provide recommendations for observables based on considerations beyond signal and background efficiencies.


Introduction
Techniques that aim to exploit the substructure of jets in order to identify highly Lorentzboosted objects [1][2][3][4] have become an essential component of the LHC phenomenology toolkit. Several grooming and tagging algorithms, e.g. [5][6][7][8][9][10][11][12][13][14][15], have been developed, successfully tested, and are currently used in experimental analyses. Considerable theoretical progress has also been made and theoretical calculations that describe the action of groomers and taggers on both background [16,17] and signal jets [18,19] have been performed. More recently, calculations have been extended to interesting case in which a jet shape is measured in conjunction with a cut on the jet mass in [20][21][22][23] and [24].
Despite this enormous amount of progress, experimental collaborations have yet to fully exploit these advantages to reduce systematic uncertainties in analyses using substructure techniques. Much study has been focused on the relationship of numerous identification observables in order to construct the most optimal heavy object taggers. Dedicated phenomenological studies [4] and detailed analysis by CMS [25][26][27][28] and ATLAS [29][30][31][32] employing multivariate techniques were performed in order to understand how to best identify boosted W/Z bosons, top quarks and Higgs bosons optimizing the statistical discrimination power of background rejection and signal efficiency. Moreover, there has been recent interest in using computer vision techniques to combine individual calorimeter cells into non-linear optimal observables [33][34][35]. However, a quantitative study of the reduction of systematic uncertainties by taking advantage of theoretical improvements has not yet been performed.
In the following study, we aim to build a tagger based not only on statistical discrimination power, but also the robust behavior of the inherent QCD background. This tagger will be designed such that, after applying a flat cut on the tagging variable, the shape of the QCD background jet mass distribution remains stable and flat. We demonstrate our methodology, entitled "designed decorrelated taggers (DDT)", by performing an example analysis in which hadronically decaying W boson jets are distinguished from quark-and gluon-initiated jets. The DDT approach is applicable to the identification of any heavy boosted objects, such as Z, H, and top jets.

Samples
The Monte Carlo samples used in this study were originally used for studies in the BOOST13 report [4]. Samples were generated at √ s = 8 TeV for QCD dijets, and for W + W − pairs produced in the decay of a scalar resonance. The QCD events were split into subsamples of gg and qq events, allowing for tests of discrimination of hadronic W bosons, quarks, and gluons. QCD samples were produced at leading order (LO) using MADGRAPH5 [36], while W W samples were generated using the JHU GENERATOR [37]. The samples were then showered through PYTHIA8 (version 8.176) [38] using the default tune 4C [39]. The samples were produced in exclusive p T bins of width 100 GeV at the parton level. The p T bins investigated in this report were 300-400 GeV, 500-600 GeV and 1.0-1.1 TeV.
The stable particles in the generator-level events are clustered into jets with the anti-k T jet algorithm [40] with three different distance parameters, R = 0.4, 0.8, 1.2, using fastjet 3.1 [41,42]. No multiple parton interactions (or pileup) is used in these samples, although previous LHC measurements [43,44] have shown that grooming algorithms are more resilient to pileup effects than standard jet algorithms. Furthermore, it was shown in those measurements that the Monte Carlo simulation can accurately reproduce the data for regions of high jet mass, whereas there are disagreements below the Sudakov peak. The grooming algorithms, however, mitigate this disagreement very strongly as well. As such, we study jets with a grooming algorithm applied. The algorithms we have investigated are the "modified" mass-drop tagger (mMDT) [5,16] with z cut = 0.1, jet trimming [10] with R sub = 0.3 and f cut = 0.1, jet pruning [8,9], and soft drop [12] with z cut = 0.1 for both β = 1 and β = 2 (note that the case of β = 0 is equivalent to the mMDT). We have found that the conclusions are not strongly dependent on the groomer used, so have used soft-drop with β = 0 (mMDT) for most of our comparisons due to its smoother scaling behavior than other groomers [16].

Current taggers
Current heavy object jet substructure taggers employed by CMS and ATLAS often cut on some number of observables directly or through some algorithm. Take, for example, something similar to the CMS Run 1 W tagger that uses simple cuts on the N -subjettiness ratio τ 2 /τ 1 [11] and the soft drop jet mass [12]. In this study, we consider the τ 2 /τ 1 variable where the subjet axes are chosen using the k T one-pass axes optimization technique.
In order to distinguish hadronically decaying W bosons (which give rise to jets that are intrinsically two-pronged) from QCD background, a flat cut on on τ 2 /τ 1 is typically performed. As expected, this procedure greatly reduces the background, but it also leads to an unwanted sculpting of the soft drop jet mass distribution (an undesirable feature also discussed in Ref. [45]), as shown in Fig. 1.    Figure 1: Soft drop mass distribution (z cut = 0.1 and β = 0) for gluon jets after various cuts on τ 2 /τ 1 (β τ = 1) for different jet p T bins: p T = 300-400 GeV (top left), p T = 500-600 GeV (top right), p T = 1-1.1 TeV (bottom left) and also for the signal (bottom right), distributions for signal are stable versus p T . The cuts in τ 2 /τ 1 vary from 1.0 to 0.0 in steps of 0.02; the changing line styles for successive cuts are meant to visually aid the reader.
After cutting on τ 2 /τ 1 to select jets which are two-pronged, the QCD background soft drop jet mass distribution becomes more peak-like in shape, making it harder to distinguish QCD jets from W jets which also have a peak in the jet mass distribution. The shape of the sculpted jet mass distribution, and the location of this artificial peak, varies for different jet p T regions. This p T dependent sculpting of the jet mass distributions makes sideband methods of background estimation more difficult. In this case and in further examples, we primarily consider gluon-initiated jets though performance with quark-initiated jets is similar.
Differences will be explored in greater detail in future studies.
In Ref. [16] it was argued that flat QCD mass distributions could be obtained by tuning the value of the soft drop energy fraction threshold (z cut ), and optimal values for quark-and gluon-initiated jets were analytically derived. However, the presence of the τ 2 /τ 1 cut makes this situation more complex and it requires reconsidering the issue.
Therefore, we propose additional criterion in determining a better tagging observable beyond pure statistical discrimination power. For similarly discriminant observables, we would like to find an observable which is (1) primarily uncorrelated with the groomed jet mass observable (or rather that has complementary correlations as far as discrimination is concerned) and (2) maintains a desirable groomed mass behavior while scaling p T . Observables satisfying this criterion would, after applying a rectangular cut, still produce a flat groomed jet mass distribution.

Shape observable scaling in QCD
We start our study of the correlations of substructure variables with the jet mass and p T by introducing the appropriate scaling variable for QCD jets: Here we have differed from the typical definition of jet ρ by removing the jet distance parameter R 2 from the denominator of the definition. For now we keep R = 0.8 fixed and leave this for future study. Note that when we apply soft-drop, we take the mass in Eq. (3.1) to be computed on the constituents of the soft-drop jet, while the transverse momentum is the one of the original (ungroomed) jet. We now compute, on both our background and signal samples, the average value of the N -subjettiness ratio τ 2 /τ 1 (computed on the full jet) as a function of the soft-drop ρ. This is shown in Fig. 2, on the left. The signal W jets are shown in open circles while the background, here gluon jets, are shown in closed circles. The various colors are different bins in jet p T . We note the typical behavior showing τ 2 /τ 1 for the signal tending to lower values than the background and at a given value in ρ due to the mass scale of the signal jet in a given p T bin. The signal tends to be fixed around the W mass and thus shifts for different values of p T and is otherwise most concentrated in the dip region. Now, let us focus on the background curves (solid points). We notice a strong dependence on τ 2 /τ 1 which is what causes the sculpting of the mass distributions shown in the previous section. However, we note that there exist a region in ρ for which this relationship is conspicuously linear. This is an interesting behavior, which we will exploit shortly in Sec. 4. We also observe that, even in this linear region, there is still a residual p T dependence, which looks like, to a very good approximation, a constant shift. The behavior observed in Fig. 2 for soft drop ρ is also observed for other groomers, such as trimming and pruning, within the p T ranges consdidered. At lower values of ρ differences in the groomers become more apparent, most likely because in that region trimming and pruning acquire further sensitivity to soft physics [16]. Thus, in the current study, we concentrate on the soft-drop mass due to its stable behavior. This approximate linear relation between τ 2 /τ 1 can be (qualitatively) understood by noting that, in the case β τ = 2, τ 2 essentially measures the subjet mass, while τ 1 corresponds to the jet mass itself. This leads to an approximately linear relation between τ 2 /τ 1 and ρ in the region of the (soft-collinear) phase-space where all-order effects can be neglected. 1 Furthermore, Ref. [24] performed calculations for jet mass distributions in the presence of a τ 2 /τ 1 cut to an accuracy which is close to next-to-leading logarithmic (NLL) accuracy. Despite the fact that the calculation corresponding to the profile plot in Fig. 2 were not performed, it could in principle be derived because the authors do provide the double differential distribution in τ 2 /τ 1 and ρ. However, some important differences between our current set-up and the one of Ref. [24] prevent us from using their results to get more quantitative insight in the behaviors we observe beyond the existence of a region with linear correlation. First Ref. [24] did not consider the soft drop ρ and, second, the definition of N -subjettiness differs in the two studies both in regards of the angular exponent (β τ = 1 versus β τ = 2) and of the choice of axes. We note that, at fixed-coupling, all the transverse momentum dependence is accounted for in the definition of the shape and ρ. We have checked whether the origin of the p T dependence that we see in Fig. 2 (on the left) could be traced back to the transverse momentum used in the definition of the ρ (ungroomed vs groomed) but this was found not to be the case. Running coupling contributions, as well as other subleading corrections, do introduce a p T dependence and they are likely to responsible for the observed p T dependence. However, a quantitative understanding of these effects would require a calculation using the techniques of Ref. [24]. This goes beyond the scope of this work and for this study we limit ourselves to a phenomenological solution, while leaving a first-principle analysis for future work. Thus, in order to remove the constant p T dependence in the τ 2 /τ 1 profile, we introduce a modified version of ρ: This change of variable, together with the choice µ ∼ 1 GeV, appears to perform an excellent job in getting rid of the p T dependence, as shown in Fig. 2, on the right, though of course we note this is purely an empirical observation. So far, we have only considered τ 2 /τ 1 versus soft drop mass. We also noted that a similar linear correlation exists between τ 2 /τ 1 and other groomed masses, though not shown explicitly. We can also consider other shape variables, though we leave an exhaustive exploration of all shape variables to a later study. As an example, we show also energy correlation functions C β=1 2 and D β=1 2 as a function of ρ in Fig. 3. On the left, C β=1 2 shows a relatively flat distribution versus ρ which is desirable although the behavior is not quite linear. On the right, D β=1 2 is highly correlated with ρ. In both cases, the correlations have some p T -dependence that is not trivially empirically determined.  By performing the transformation ρ → ρ , we have successfully accounted for most of the p T dependence of the profile distribution. Next we would like to perform a further transformation with the aim of flattening the profile dependence on ρ , with the idea that this will in turn reduce the mass-sculpting discussed earlier.
In order to determine the transformation we are after, we concentrate on the region in which the relationship between τ 2 /τ 1 and ρ is essentially linear. Thus, we introduce where the slope M is numerically fitted from Fig. 2 (red fit lines). The comparison between the τ 2 /τ 1 and τ 21 distributions is shown in Fig. 4, for different jet p T bins. The transformed variable, τ 21 , looks similar to the original variable τ 2 /τ 1 although the behavior of the correlation with the groomed mass is now practically removed. We note that a p T -dependence on the signal shape is introduced which is, in hindsight, expected given the transformation takes advantage of scaling properties of the background. This can cause a p T -dependence in the signal efficiency with a cut on τ 21 not present in the original τ 2 /τ 1 ; however, we note this is not necessarily an undesirable feature. For example, as backgrounds decrease at higher p T it may be desirable to allow a larger signal efficiency and this should be studied in more detail in the experiments within the context of particular analyses. This can be seen in Fig. 5 which shows the profile of τ 21 as a function of ρ with the intended decorrelated behavior.  Now, we can explore the sculpting of the mass distributions making a flat cut in τ 21 . This is shown in Fig. 6 which should be contrasted with Fig. 1 which was obtained with a flat cut in τ 2 /τ 1 . Notice that now the sculpting of the mass distribution is considerably reduced, particularly in the region of interest where the W boson peak is. With a simple transformation, we can now preserve mass sidebands for background estimations and make robust predictions of the p T dependence of the backgrounds. This practical consequences of a well-behaved background shape will be explored in Section 5. Generally speaking, a non-linear dependence is not a technical obstacle to performing an observable transformation and we discuss this in Section 6; however, studying the behavior in a simple analytic regime allows us to better understand the underlying physical behavior. The final component to evaluating the success of the observable transformation is to understand the performance of the new observable in terms of rejecting backgrounds.

Performance of DDT
To evaluate the performance of the transformed variable we use the traditional receiver operating characteristic (ROC) curve, defined as the signal efficiency as a function of the background efficiency. A better discriminating tagger is characterized by higher signal efficiency and lower background efficiency. The discriminating performance of τ 2 /τ 1 and the transformed τ 21 are shown in the left of Fig. 7 for jets within a soft drop mass window of [60-120] GeV (corresponding to the W signal mass region). From the ROC curve, we note that after transforming the variable the discriminating power does not degrade and even shows modest improvement in this kinematic regime. We can see where this comes from in the right panel of Fig. 7. After cutting on raw τ 2 /τ 1 the QCD soft drop jet mass distribution is sculpted such that many of the jets surviving the cut fall into the W mass region. In contrast, cutting on τ 21 leaves a more linearly falling distribution which preserves the low sideband. The mass distributions on the right side of Fig. 7 are after making a cut on the shape observable to maintain a signal efficiency of 50%.

Case studies
Currently, the systematic uncertainties in extracting the efficiency are large (and usually dominant) sources of uncertainty in SM and BSM analyses at the LHC [46][47][48][49][50][51][52]. There are several places where the improved scaling behavior can reduce these systematics, in addition to the performance improvements in the ROC curves shown in Fig. 7. We will present two improvements, the preservation of mass sidebands in the kinematic fit to extract the W tagging efficiency from semileptonic tt events, and the overall background estimate in diboson analyses. Both cases take advantage of the flatter background distributions to improve the uncertainties in shape-based fits.

Preservation of mass sidebands
The shape of the jet mass spectrum is used in the LHC experiments to determine the W tagging efficiency; for instance, CMS relies on a simultaneous fit to the jet mass in events that pass and fail the τ 21 selection. However, as shown in Fig. 1, the τ 21 selection significantly kinematically sculpts the background distribution in this variable. This can lead to significant fitted uncertainties when extracting the background normalization, and thus directly translates to large uncertainties in the W tagging efficiency measurement. By using the τ 21 , a significant improvement is observed.
To demonstrate this, we examine two cases, modified mass drop tagging with τ 21 < 0.45, and modified mass drop tagging with a scale-dependent selection τ 21 < 0.6 − 0.08 × ρ , where ρ = log m 2 /p T /µ . This translates into a cut on τ 21 < 0.6. These selections have approximately the same signal efficiency. For simplicity, the same signal and background MC samples are used as in the previous sections, but the events are weighted with an easily specifiable fraction of background jets. In this case, the background fraction for the entire sample is 40%. This gives a comparable fraction of merged to unmerged W bosons in a semileptonic tt selection at 13 TeV at the LHC, but allows us to easily tune the fraction. In addition, to mimic the approximate detector resolution, the intrinsic resolution of the W → qq system is smeared with a Gaussian of width 10 GeV. This is indicative of the resolutions obtained at the CMS and ATLAS experiments. Figure 8 shows simple fits to the jet mass for 5000 MC events in the range 50 < m J < 120 GeV, after a selection on the N -subjettiness variable. The model is a double Gaussian, one for the QCD continuum and one for the W mass peak. The jet p T range considered is p T = 300 − 400 GeV, to give a typical p T range of the W bosons from top quark decays from SM tt production. The first fit shows the modified mass drop algorithm after τ 21 < 0.45. The second fit shows the modified mass drop algorithm after τ 21 < 0.6. The fits successfully capture the mass of the W and the input width of 10%.
It is interesting to note that the jet mass of the QCD jets after the τ 21 selection are significantly pushed below 10 GeV. In addition, the remaining distribution is flat. However, for the standard τ 21 selection, the distribution is rising, with significantly more background under the W signal peak.
The background uncertainty on the fit is is 6% when using the standard τ 21 selection. However, it is reduced by a factor of two to 3% by using the τ 21 selection. This is driven by the fact that the fitter can more easily handle sidebands that are flatter, so the τ 21 variable outperforms the τ 21 variable in this metric.
This would translate directly into a decreased systematic uncertainty for the LHC experiments. While newer and more clever algorithms can achieve better performance in MC simulations, this does not always translate directly to improvements in actual analyses due to the need to characterize the systematic uncertainties. We therefore propose this test as an appropriate metric to characterize the systematic performance of new substructure algorithms.

Diboson Background Estimate
The diboson background estimate for the LHC experiments is much the same as the extraction of the W tagging efficiency, except that the background fraction is significantly higher. We have chosen a value of 80% (integrated over the entire spectrum of events) as an indicative fraction, with the same number of events (5000). We have considered two different p T ranges, p T = 500 − 600 GeV and p T = 1000 − 1100 GeV. One somewhat obvious but important point is that as the p T increases, the Sudakov peak from QCD-generated jets shifts further to the right. As this occurs, the fits to discriminate boosted W bosons from QCD-generated jets are less and less able to distinguish between the categories. Figures 9 and 10 show similar fits as shown in Fig. 8. However, the background fraction is raised from 40% to 80% (again integrated over the entire mass spectrum), and the p T ranges are set to p T = 500 − 600 GeV and p T = 1000 − 1100 GeV, respectively.
For the range p T = 500−600 GeV, it is plain to see that there is a significant improvement of the τ 21 variable, where the background uncertainty decreases from 15% to 6%. This is even more apparent for the range p T = 1000 − 1100 GeV,where the uncertainty decreases from 23% to 6%.

Generalized Scale Invariance
Decorrelation schemes can be extended beyond a pair of variables to decorrelate classes of many variables. Such a procedure can be used to allow for a class of variables to be merged into a single multi-variate analysis discriminator (MVA), while preserving decorrelation against one or a set of variables that are further used in the analysis. Consider, for example, building an MVA W tagger using both τ 2 /τ 1 and C β=1 2 . Both of these variables have correlations with p T and mass, so the resulting classifier that combines the variables will also be correlated with mass and p T . Decorrelating the space of variables against mass and p T before or during the construction of the MVA can thus preserve the mass and p T invariance resulting in an uncorrelated tagger. This idea has previously been pursued in b-physics utilizing an MVA that minimizes the mass dependence, while simultaneously constructing a classifier [53].
In light of building an example based on previously presented studies, we split ρ = log(m 2 /p 2 T ) by into it components log(m) and log(p T ). Combining this with either C 1 2 or τ 2 /τ 1 gives a class of three variables for which we decorrelate into a set of three independent linear combinations of variables. The independent variables can be viewed as properties of the data which span the space of distinctive features. This space can be explored to further understand behavior of the data. Additionally, a subset of the independent components can be merged through an MVA while maintaining the decorrelation of the remaining set of variables. In this way, mass sidebands or other sideband methods can be used on the merged MVA discriminator with the decorrelated variable.
As has previously been noted, decorrelating variables which are not implicitly linearly correlated is poorly defined [54]. We thus consider two generalized approaches that attempt to decorrelate discriminators that are not necessarily linearly correlated. We consider two decorrelation approaches: Principle Component Analysis (PCA) of transformed variables and Independent Component Analysis (ICA).

Decorrelation by PCA and ICA
Given a set of variables need not be linearly correlated, we consider a transformed variable For this transformation, we train a gradient boosted decision tree [55] with the boosted W boson as a signal and a high p T QCD jet as a background. This transformation places the variables into a space that enables the possibility of linearized correlations of the original variables.
The resulting correlation matrix of the transformed variables can be decorrelated through principle component analysis by taking the eigenvectors of the matrix. This yields a set of n-independent vectors for a n-dimensional correlation matrix.
The decorrelated vectors for the triplet of transformed τ 2 /τ 1 , log(p T ), and log(m) is shown in Fig. 11. The correlation of the resulting vectors is compared with a gradient boosted decision tree using all variables and with the transformed mass. From this correlation, we observe two discriminating dimensions and the p T . These we can write as v 1 = log(m/µ 1 ) + K 1 (τ 2 /τ 1 ) (6.2) v 2 = τ 2 /τ 1 + K 2 log(m 3.5 /p T µ 2.5 2 ), (6.3) where K 1,2 correspond to coefficients and µ 1,2 are scales, typically µ 1,2 ∼ 1 GeV to make the observables dimensionless. The first variable corresponds to the transformed mass and the second corresponds the transformed τ 2 /τ 1 . The second variable is not too different from ρ decorrelated τ 2 /τ 1 . An alternative decorrelation approach, known as independent component analysis (ICA), involves diagonalization of the matrix constructed by computing the pairwise mutual information of each pair of variables on the sample of QCD jets. This differs to previous approaches, which rely on the mutual information to truth. Here, we focus on identifying features in the data and not necessarily discriminating power. We perform the ICA with an algorithm that uses k-nearest neighbor to expedite the diagonalization process (MILCA) [56]. The right panel of Fig. 11 shows the ICA decomposed vectors. As with the transformed PCA, the ICA decorrelates the p T , however the mass τ 2 /τ 1 interdependence is stronger than in the transformed case.
Finally, the equivalent decorrelated matrix for a combined set of observables is shown in Fig. 12, here we show just the transformed PCA approach. From the combined set, we observe the largest orthogonal set of discrimination power comes from the C β=1 2 as oppose to τ 2 /τ 1 . When comparing the two approaches, we have found variable transformed PCA yields a more consistent performance with our previous observations.

Conclusion and Outlook
In this note, we explore the scale-dependence and correlations of jet substructure observables. The goal is not only to improve the statistical power of such observables, which we also demonstrate, but also to consider practical issues related to using such observables in searches for new physics. In order to design decorrelated taggers (DDT), we transform the shape observable, here τ 2 /τ 1 → τ 21 , by decorrelating it from groomed mass observables also factoring in the p T scale-dependence. In addition to improving the statistical discrimination between signal and background, we also preserve a robust, flat background shape and which has more stable behavior when scaling of the background going from lower p T bins to higher p T bins. We demonstrate the advantages of such an approach in various case studies such as predicting background normalizations and determining heavy object tagging scale factors related to new physics searches.
The intention of this note is not to perform a detailed study of all possible heavy object taggers, but instead, to introduce further considerations when designing taggers and propose a method by which all considerations can be addressed, namely via observable decorrelation. We leave studies related to variations on jet mass groomers and shape observables, R-scaling, quark-gluon fractions, scaling background predictions, behavior at extremely high p T , and top tagging to future works. We have explored more generic determinations of observable decorrelation with complex taggers using multivariate techniques and numerical principlecomponent analysis.