1 Introduction

A hadronic jet is produced from an initial parton via a sequence of perturbative QCD branching interactions (the parton shower), followed by the non-perturbative conversion of partons to the hadrons we observe in experiments (hadronization). A Markov chain description of the parton shower suggests the spatial distribution of partons will exhibit some fractal character [1,2,3,4,5,6], and this will be inherited by the final hadron distribution (invoking local parton-hadron duality  [7]). However, true scale invariance of the hadron distribution within a jet is broken by the running of the branching probability, termination of the shower due to hadronization, and finite detector resolution. Here we define new observables to characterize jet branching structure, named Extended Fractal Observables (EFOs), which accommodate deviations from fractal structure through simple parametrizations. The idea is to apply box-counting techniques, used widely in the study of dynamical systems and scale invariant objects, to the substructure of QCD jets. Box counting has previously been employed in particle physics to calculate the fractal dimension of electromagnetic showers [8] for highly granular calorimetric reconstruction. Here, we extend the generality and information content of this technique in our characterization of QCD jets.

The motivation for this study is twofold. Firstly, we would like to characterize the spatial substructure of jets into a set of new observables. Secondly, we would like to demonstrate the use of such observables in the discrimination of quark and gluon jets. Quark and gluon discrimination has long been used as a tool to enhance the sensitivity of signatures with additional quarks [9,10,11,12]. In particular, weak boson fusion induced Higgs-production is enhanced due to the distinct signature of two additional hard quark jets in the gluon-dominated forward region of the detector [9, 13,14,15,16,17,18,19,20,21]. Quark and gluon tagging are also expected to be useful for physics searches beyond the Standard Model, including the detection of supersymmetric particles [22, 23]. Additionally, if well designed, these taggers can be further extended to the subjets of boosted boson signatures [24]. We demonstrate that modest improvements can be made to existing quark–gluon taggers by incorporating the new jet observables defined in this paper.

Finally, our construction of pixel-based jet observables resonates with the recent development of the jet image paradigm  [25, 26], in which the energy measured in each detector cell is interpreted as the intensity of a pixel in a 2D image. Within this approach, powerful machine-learning algorithms for classifying images have been brought to bear on a range of jet classification problems. This has included tagging boosted weak bosons [26, 27], boosted top quarks [28], and heavy-flavors [29, 30].

We define EFOs in the following section. In Sect. 3 we analyze the performance of these observables in quark–gluon discrimination, before concluding.

2 Extended fractal observables

The computation of the EFOs is performed on a jet by jet basis using a variation of the Minkowski-Bouligand (box-counting) dimension, as follows.

2.1 Variable definitions

To define our variables we implement a two-stage recipe: firstly, the jet cone is divided in the familiar \(\left( \eta ,\phi \right) \) angular coordinates into a square grid of cells, each cell having side-length \(\epsilon \). For a given scale \(\epsilon \), we count the number of cells \(N_{hits}\left( \epsilon \right) \) which register particle hits with a total transverse momentum greater than some pixel-level soft cutoff, in this study chosen to be \(p_{T}>1.0~\hbox {GeV}\). This low energy cut represents a limiting threshold due to detector resolution. This counting is iterated over a range of scales, as is illustrated in Fig. 1. The second stage is to fit smooth functions to the variation of \(y=\log N_{hits}\left( \epsilon \right) \) with \(x=\log \left( 1/\epsilon \right) \), and to extract the parameters of the fit as a set of (correlated) jet observables, which we call Extended Fractal Observables (EFOs). This is a generalization of the traditional box-counting method, in which only linear functions \(y=mx+c\) are fitted, with the gradient m identified as the fractal dimension [8].

Fig. 1
figure 1

An illustration of the iterated box-counting procedure used to calculate fractal-based quantities on a set of points. The filled blue circles are the \(\left( \eta ,\phi \right) \) angular coordinates of the hadrons within a particular sample jet (in particular, this jet has total \(p_{T}=157~\hbox {GeV}\), and 30 constituent hadrons). The box-counting is illustrated for four sample scales, corresponding to successively finer \(\epsilon \) values of \(0.2, 0.1,0.067\,\text {and}\,0.05\). The cells registering particle hits are highlighted with red shading

Fig. 2
figure 2

Left: logarithmic fits to \(\log N_{hits}\left( \epsilon \right) \) against \(\log \left( 1/\epsilon \right) \) for light quarks, bottom quarks, and gluons, of the form \(y=p_{0}+p_{1}x+p_{2}\log x\). The values of the fitted parameters \(\left\{ p_{i}\right\} \) define one possible set of Extended Fractal Observables. Right: fits to \(\log (N_{hits})\) against \(\log \left( 1/\epsilon \right) \) using an asymptotically saturating fitting function, specifically \(y=p_{0}+p_{1}\tanh (x-p_{2})\)

Fig. 3
figure 3

Left: the ratio of \(\log (N_{hits})\) with respect to the quark values, for b-quarks and gluons, as a function of \(\log \left( 1/\epsilon \right) \). A linear fit is added for comparison. Right: the difference of \(\log (N_{hits})\) with respect to the quark values, for b-quarks and gluons. In the Modified Leading Logarithmic Approximation (MLLA), the differences in hadron multiplicity between quarks, b-quarks and gluons are predicted to be energy independent  [31]. The small but non-zero slopes in this plot reflect the fact that box-counting at a given angular scale probes spatial information in addition to the rate of splitting at the corresponding energy scale

Indeed, in Fig. 2 there is no distinct region of linear scaling, as would be needed to extract a fractal dimension. Rather, \(\log N_{hits}\left( \epsilon \right) \) levels off smoothly from large to small scales as saturation is approached, motivating a non-linear fit to extract whatever information this curve might encode about the jet. In particular, the hadronization region (i.e. at small \(\epsilon \)) obviously carries non-perturbative information sensitive to the flavor of the jet. The observed curves are distinct between quarks, gluons and b-quarks, as summarized in Figs. 2 and 3. This scaling is a fundamental property of QCD resulting from the differences in the splitting of quarks and gluons. Further measurements of this scaling allows for an alternative approach to extract QCD properties such as the strong coupling constant [32, 33].

The generic plateauing curves in Fig. 2 can be fitted by almost any non-linear function (given a suitably restricted range in x), so we studied fit functions with at most three parameters, for speed and robustness of fitting. Fits were carried out simply by a binned \(\chi ^2\) minimization of the chosen function. Example fit functions included the following:

  1. 1.

    logarithmic fits of the form \(y=p_{0}+p_{1}x+p_{2}\log x\).

  2. 2.

    quadratic fits: \(y=p_{0}+p_{1}x+p_{2}x^{2}\).

  3. 3.

    hyperbolic tangent fits: \(y=p_{0}+p_{1}\tanh (x-p_{2})\).

The values of the best fit parameters \(\left\{ p_{i}\right\} \) for each fitting function constitute three possible sets of EFOs. For a polynomial in \(x=\log (1/\epsilon )\), like the quadratic fit function, the fit reduces to a matrix inversion and thus has a well-defined convergence. The other two parametrizations are not polynomials, hence we perform a \(\chi ^2\) minimization.

Functions which actually saturate, such as the hyperbolic tangent parametrization above, are more physically motivated because they can model the saturation itself (asymptoting to the jet multiplicity). However, for the range of box scales used in our study (of width \(\epsilon \ge 0.05\), – see 2.2 below), and for all but the lowest \(p_T\) jets, the non-saturating fit functions also provide adequate models for the observed scaling. For the purpose of quark–gluon discrimination (see Sect. 3), the logarithmic fitting function was found to give the best discrimination performance of the three functions above (see Fig. 6 to compare the performance between the logarithmic and hyperbolic tangent fitting functions).

2.2 The range of box-counting scales

The range of angular scales \(\epsilon \) has been chosen by paving the jet cone with a square grid of \(N\times N\) cells, where the splitting scale N ranges in integer steps from 3 to 16. For each N, the angular scale is \(\epsilon =2R/N\), where R is the jet radius, in this study \(R=0.4\). The coarsest \(\epsilon \) scale chosen, corresponding to \(N=3\), is essentially the coarsest scale carrying potentially discriminating information (for \(N=2\) the jet cone would be divided into four quarters, all of which will register a hit for realistic jet shapes). The finest \(\epsilon \) scale chosen is \(\epsilon _{min}=0.8/16=0.05\), because this is approximately the angular detector resolution in both LHC experiments, CMS and ATLAS [34, 35]. For the \(p_{T}\ge 100~\hbox {GeV}\) jets studied here, the number of hits is just beginning to saturate at this scale (see Fig. 2), so we are probing into the hadronization region prior to the flat plateau.

Finally, we would like to highlight that these fractal-based observables are similar in spirit to calculating subjet rates of jets  [15, 36], given subjets clustered using the \(p_{T}\)-independent Cambridge-Aachen algorithm [37]. Both observables compute \(p_{T}\)-independent branching information on a succession of angular scales down to some threshold. And both observables perform what is essentially a further clustering on the substructure of the jet to extract this information pertaining to the branching history of the jet. In light of this, the EFO approach could be extended to utilize subjet counts (instead of hit grid cell counts) to assign scale-dependent multiplicities \(N(\epsilon )\).

2.3 Infrared and collinear safety

Preserving infrared and collinear safety ensure calculability in perturbative QCD. An observable is infrared (collinear) safe if its value is unchanged by the emission of soft (co-moving) particles. The EFOs, as defined in 2.1 with a pixel-level soft cutoff, are fully IRC safe.

Firstly, the box counting procedure is intrinsically collinear safe: if one particle splits into two particles with the same \(\left( \eta ,\phi \right) \) coordinates, we still count just one cell hit by both daughter particles, at any finite scale of probing. Hence collinear splittings will not affect the number of cells \(N_{hits}\left( \epsilon \right) \) to register particle hits at any choice of scale. On the other hand, infrared safety of the EFOs can only be engineered by imposing some low momentum cutoff to cleanse the jet of its soft constituents. However, this soft cutoff must be implemented consistently with collinear safety. If we simply discarded all soft hadrons with, say \(p_{T}<1~\hbox {GeV}\), this would spoil collinear safety. To see this, consider the following pathological example: if a particle with \(p_{T}=1.5~\hbox {GeV}\)  splits into two comoving particles with \(p_{T}=0.8~\hbox {GeV}\)  and \(p_{T}=0.7~\hbox {GeV}\), then both would be discarded by a particle-level soft cut, and so \(N_{hits}\left( \epsilon \right) \) would not be invariant under this collinear splitting.

This is remedied by defining a pixel-level (rather than particle-level) sort cutoff. That is, we only consider a cell to register a hit if it measures a total \(p_{T}\) greater than our soft cutoff of \(1~\hbox {GeV}\). This way, if the troublesome \(1.5~\hbox {GeV}\) particle in the example above splits collinearly into any number of daughters, the pixel still measures a total \(p_{T}\) of \(1.5~\hbox {GeV}\), and so registers a hit regardless of these splittings. Thus, box-counting with a pixel-level soft cutoff is fully IRC safe. In addition, a pixel-level rather than particle-level cut is more naturally realized experimentally since a pixel hit is consistent with an LHC detector cell.

Numerically, the performance of a quark–gluon discriminant built using the EFOs was found to be essentially insensitive to varying the value of this \(p_{T}\) cut (over values between \(0.1~\hbox {GeV}\) and \(1.0~\hbox {GeV}\)), suggesting the variables are not strongly shaped by the IR emission, at least in simulations. In the following section, a \(p_{T}\) cut of \(1~\hbox {GeV}\) is used throughout. Finally, we acknowledge that pixel-level cutoffs have been used previously in the context of jet images analyses (for example in  [25]) to ensure IRC safety in the same context.

3 Performance in Quark–Gluon discrimination

We now investigate whether these observables might be a useful new tool in the important and challenging problem of distinguishing light quarks from gluon jets.

3.1 Event generation and setup

In this study, we use QCD dijet samples at a center-of-mass energy of \(13~\hbox {TeV}\). Because previous quark–gluon studies have revealed that discrimination performance varies a lot between the different generators  [9,10,11, 14, 38],Footnote 1 we here produce and shower events (at leading order) using both Herwig++ (version 2.7.0 with tune UE-EE-5C )  [39, 40] and Pythia 8 (version 8.185 with tune CUETP8M1) [41], with order 150k events in each. Jets are clustered with the anti-\(k_{T}\) algorithm using the final state particles following showering and hadronization; a cone size of \(R=0.4\) and the FastJet code package [42] are used for the jet clustering. The EFOs (here computed using the logarithmic fitting function), along with a set of other established jet observables, have been computed for the highest \(p_{T}\) jet in each event. We define the flavor of that jet by matching to the highest-\(p_{T}\) parton within \(R<0.3\) of the jet axis, and classify the event as signal (background) if matched to a light quark (gluon).Footnote 2

Fig. 4
figure 4

Left: single variable performance ROC curves. The EFOs, minor axis, and \(p_{T}D\) are significantly more discriminating than multiplicity. The EFOs are most discriminating for high signal efficiency (\(\gtrsim 70\%\)), below which jet minor axis becomes most discriminating. Right: linear correlation coefficients between pairs of variables, for quark jets (the values are similar for gluon jets). We see only weak correlations between the EFOs and the three existing QGD variables

As a baseline for comparison, we shall consider the variables currently used by the Compact Muon Solenoid (CMS) quark–gluon tagger, which are [10]: (i) the total number of reconstructed particles in the jet (the multiplicity) [43]; (ii) the \(p_{T}D\) variable (\(C_{1}^{\beta =0}\)) [44],

$$\begin{aligned} p_{T}D=\frac{\sqrt{\varSigma _{i}p_{T,i}^{2}}}{\varSigma _{i}p_{T,i}}, \end{aligned}$$
(1)

where i sums over the constituents of the jet, which describes the distribution of transverse momentum between the particles in the jet; and (iii) \(\sigma _{2}\), the (\(p_{T}\)-weighted) semi-minor axis of the jet in the \((\eta ,\phi )\) plane [10], defined by

$$\begin{aligned} \sigma _{2}=(\lambda _{2} / \varSigma _{i}p_{T,i}^{2})^{1/2}, \end{aligned}$$
(2)

where \(\lambda _{2}\) is the smaller eigenvalue of the \(2\times 2\) symmetric matrix with components \(M_{11}=\varSigma _{i} p_{T,i}^{2} \varDelta \eta _{i}^{2}\), \(M_{22}=\varSigma _{i} p_{T,i}^{2} \varDelta \phi _{i}^{2}\), and \(M_{12}=-\varSigma _{i} p_{T,i}^{2} \varDelta \eta _{i} \varDelta \phi _{i}\). Throughout this study, we build multi-variable quark–gluon discriminants using a boosted decision tree (BDT), implemented using the Toolkit for Multivariate Analysis (TMVA) via adaptive boosting. The \(p_{T}\) of the quark and gluon samples are reweighted to match the exact same kinematics in both cases, so as to avoid selection biases induced by kinematic differences in the simulation.

3.2 Results

Fig. 5
figure 5

ROC curves for BDT discriminators constructed from various combinations of observables, as indicated by the legend, for events showered using both Herwig and Pythia with jet \(p_{T}\ge 100~\hbox {GeV}\). The discrimination is superior in Pythia. We see in both event generators that including the EFOs rather than multiplicity (which is used in the CMS tagger) yields a marginally better performance

Fig. 6
figure 6

Left: the relative gain for the three-variable taggers with respect to a baseline tagger using just \(p_{T}D\) and \(\sigma _2\), for the Herwig events (which yield more conservative discrimination estimates). The gain is also plotted for EFOs computed with the hyperbolic tangent fitting function specified in Sect. 2.1, for which the performance is worse. Right: for Pythia events. Note the wider range of the y-axis, to accommodate the larger gains found in Pythia

We first compare the discriminator performance of single variables and the correlations between them, before going on to compare multi-variable taggers built with and without inclusion of the new EFO observables.

We can measure discriminator performance by receiver operator characteristic (ROC) curves, which plot background rejection against signal efficiency. Roughly speaking, the more convex the curve, the better the performance. The left plot of Fig.  4, made using the Herwig samples, shows that the EFOsFootnote 3 are individually well-discriminating, particularly if we seek high signal efficiency. Their performance is significantly better than that of the jet multiplicity variable.

The right plot of Fig. 4 presents the linear correlation coefficients (calculated using the TMVA toolkit) between the EFOs and the existing CMS quark–gluon tagger variables: multiplicity, \(p_{T}D\) and \(\sigma _{2}\). We also include a computation of the fractal dimension, which has been calculated from a linear fit over a small range of box scales. Strong correlations are present amongst the EFOs, as is natural given they are parameters derived from the same fit. However, their correlations with the other variables are no greater than 43% (for either quarks or gluons).Footnote 4 Interestingly, the EFOs are most highly correlated with \(\sigma _{2}\), not multiplicity as might have been expected. This evidence suggests the discrimination power of the EFOs is not simply a result of higher multiplicities in gluon jets, and therefore that the addition of these parameters to a quark–gluon discriminator might improve performance.

We find that replacing the multiplicity variable in the existing CMS quark–gluon tagger with the EFO variable yields a gain in discriminator performance, albeit only a modest one. This gain is seen using both Herwig and Pythia event generators (with the setup described above) in the ROCs presented in Fig.  5, which are for jets with \(p_{T}\ge 100~\hbox {GeV}\). We see the performance in Pythia is significantly better than Herwig for each combination of variables, consistent with previous studies  [9,10,11, 14].

Moreover, the incremental gain upon replacing multiplicity with the EFOs is larger in Pythia than Herwig, so Herwig gives the more conservative estimate of the impact of including the EFOs. We see the gain in performance (relative to a baseline tagger using just \(p_{T}D\) and \(\sigma _2\)) more clearly in Fig. 6, with the left panel for Herwig and the right for Pythia. The gain is at the level of 1– 2\(\%\) in the more conservative Herwig setup, and slightly larger in Pythia (note the different scaling of the y-axis). To emphasize a previous point, these gains were found to be stable across different values of the soft \(p_{T}\) cut. Finally, we investigated how the performance varies with energy scale, by performing the analysis in \(p_T\) bins of 50–100, 100–200, and 200–500\(~\hbox {GeV}\). Discrimination was found to increase with \(p_T\) in both Herwig and Pythia (see Fig.  7 for the Herwig results).

Fig. 7
figure 7

Performance of a possible new quark–gluon tagger (using \(p_{T}D\), \(\sigma _2\), and the EFOs), in three \(p_T\) bins, for Herwig-generated dijet events. Quarks and gluons are found to be easier to distinguish using this tagger at higher \(p_T\)

Combining all four variables (multiplicity, \(p_{T}D\), \(\sigma _2\) and the EFOs) was seen to give no further improvement. This suggests all the information from multiplicity is captured by the EFOs,Footnote 5 while the converse is not true. In summary, we have presented evidence in this study that the Extended Fractal Observables provide an additional handle that captures the salient features of jet multiplicity, incorporates new information from showering and hadronization, and which is also better behaved under IRC emission (see 2.3).

4 Conclusions

In this study we defined new jet observables, the Extended Fractal Observables, by a generalization of the box-counting method used in the study of fractal systems. Defined with a pixel-level low momentum cutoff, these observables are infrared and collinear safe. We have then sought to apply the EFOs to improve quark–gluon discrimination. At the generator level, we find some modest improvement in discrimination by gluon rejection when we replace multiplicity with the EFOs in the existing CMS tagger, across both Herwig++ and Pythia 8. Extending the performance of these new variables to include detector effects can naturally be performed in the LHC environment with the CMS Particle Flow algorithm [45] in conjunction with the PUPPI algorithm [46] to reconstruct particle candidates in the presence of high pile-up.

5 Outlook

This method of studying jet substructure is a new approach. As such, there are many directions in which we would like to proceed, including:

  1. 1.

    Exploring particle hits in a 3-dimensional coordinate space spanned by \(\eta \), \(\phi \) and \(z^{-1}\), where z is the fractional transverse momentum of the jet constituent.

  2. 2.

    Applying the EFOs beyond Quark–Gluon discrimination, for example to the identification of pile-up jets, or initial state radiation.

  3. 3.

    These box-counting methods extend very naturally from the substructure of a single jet to a whole-event analysis. Such a novel approach may provide new insight into searches for new physics topologies such as those in supersymmetry or top quark pair production [47].

  4. 4.

    Furthermore, box-counting analyses could provide a useful characterization of event shapes in heavy ion collisions, where studies of jet properties beyond jet reconstruction are traditionally difficult, but well motivated [48,49,50].

  5. 5.

    Finally, we would like to emphasize that the calculation of EFOs on quark and gluon jets probes parton shower scaling that results from the QCD color factor ratio. Calculating EFOs on cosmic ray air shower profiles [51] could therefore help discriminate QCD-induced air showers from more interesting signals; of particular interest, showers induced by electroweak sphalerons. Experimentally, the calculation of EFOs in this air shower context is conceptually appealing: the 1660 individual Cerenkov detectors (spread over 3000  km\(^2\)) of the Pierre Auger Observatory in Argentina  [52] would naturally function as the finest-scale cells in our box-counting algorithm. These techniques could therefore be useful in probing physics at energies far beyond that of the LHC.