Jet substructure templates: data-driven QCD backgrounds for fat jet searches

QCD is often the dominant background to new physics searches for which jet substructure provides a useful handle. Due to the challenges associated with modeling this background, data-driven approaches are necessary. This paper presents a novel method for determining QCD predictions using templates — probability distribution functions for jet substructure properties as a function of kinematic inputs. Templates can be extracted from a control region and then used to compute background distributions in the signal region. Using Monte Carlo, we illustrate the procedure with two case studies and show that the template approach effectively models the relevant QCD background. This work strongly motivates the application of these techniques to LHC data.


Introduction
Analyzing the substructure of jets has proven useful for a multitude of new physics scenarios. A variety of substructure observables and techniques have been proposed [1][2][3], with applications ranging from boosted-object tagging [4][5][6][7] to high-multiplicity searches that take advantage of accidental substructure [8][9][10]. Simultaneously, tremendous progress has been made in modeling QCD at the analytical level and with Monte Carlo (MC) event generators . These developments have been crucial in supporting efforts to perform Standard Model measurements, as well as searches for new physics, at the LHC. However, given the complexity of the LHC environment, data-driven background estimates for QCD remain a central component of any jet substructure study; new methods for determining these backgrounds are thus of great interest.
There are three complementary reasons why jet substructure is useful for new physics searches at the LHC. The first is that large-radius "fat" jets [36] naturally accommodate JHEP05(2014)005 grooming methods like trimming [37], pruning [38], and mass-drop filtering [4]. This is a significant advantage in the hadronic environment of the LHC, where the presence of the underlying event and pile-up makes mass reconstruction difficult. Second, teasing out the signatures of highly boosted states that undergo hadronic decays is difficult to impossible without jet substructure techniques. Maximizing sensitivity to very boosted final states will become increasingly important as rising trigger thresholds make low-p T regions of phase space inaccessible. Finally, the fat jet paradigm provides an efficient way to separate signal from background in multijet events. For example, the total jet mass of an event (the sum of the individual fat jet masses) is a particularly useful discriminator [8] due to the simple fact that QCD does not tend to yield large-mass jets.
Given the many successes of the fat jet approach and the practical importance of improving methods for obtaining data-driven estimates for Standard Model rates, it is natural to focus on the specific case of the multijet background to fat jet searches. In [10,39], a novel approach to computing multijet backgrounds was proposed that relies on the assumption that the properties of individual fat jets are independent of one another and universal -this is motivated by the approximate factorization of QCD jets [40]. Under this assumption, jet properties can be measured in signal-poor samples (e.g., inclusive dijets) and then extrapolated to the signal (e.g., multijet) region. This paper provides a concrete procedure for making such an extrapolation.
Typically, data-driven techniques require identifying a set of approximately uncorrelated variables to define control regions. Any non-negligible correlations require correction factors that are estimated with MC. Consequently, if one wants to perform cuts on nontrivially correlated variables, such as the mass and other substructure properties of a jet, the background estimate is no longer data-driven. In contrast, the template methods introduced in this paper naturally incorporate these correlations. As a result, searches can be performed with more handles than previously possible and with less dependence on MC.
Any realistic data-driven proposal must assess the statistical and systematic uncertainties associated with the background estimate. By taking advantage of kernel smoothing techniques, we develop a framework in which the statistical uncertainties associated with the template method can be reliably addressed. We demonstrate the feasibility of our approach by applying it to two example mock analyses.
The rest of the paper is organized as follows. Section 2 provides a general overview of the method. Section 3 continues the discussion with a more detailed and technical description. Section 4 presents two MC case studies to demonstrate how the method might be deployed in a realistic analysis. Finally, section 5 presents our conclusions. An appendix is included that gathers together many of the technical aspects of this study. A review of kernel smoothing is included in appendix A. The technical details of our MC generation and processing are discussed in appendix B. Code for implementing the approach presented in this paper is available at http://sourceforge.net/projects/kernelsmoother/.
To keep track of the terminology associated with our proposal, new terms are emphasized with bold red typeface where they are first introduced.

JHEP05(2014)005 2 Substructure templates
The approach advocated here is different from typical data-driven methods. In broad strokes, the usual strategy is to 1. define a control region that is expected to be signal-poor; 2. use this region to validate/normalize MC calculations; 3. use the validated MC to predict the backgrounds in the signal region for which a given search has been optimized.
The foundation of our approach, in contrast, involves constructing a data-driven template that models the substructure of fat jets. This template is then used to dress a sample of multi-fat jet events. 1 The selection criteria are then applied to the dressed sample to yield a background estimate in the signal region.

Separating kinematics and substructure
The templatesρ are transfer functions derived from data that encode the statistical distribution of output variables as a function of the input variable(s). In the limit of infinite statistics,ρ converges to the true distribution ρ. The outputs are jet substructure observables, such as the jet mass, while the inputs are kinematic variables. Assuming that the input variable is the jet's transverse momentum p T , then P m p T =ρ m p T dm (2.1) is the data-driven estimate for the probabilityP that a jet with the given p T has a mass between m and m + dm. Here, " " distinguishes input from output variables on the right and left sides, respectively, and hatted variables refer to estimated quantities. Templates can be used to compute data-driven backgrounds through the following procedure: 1. determine a control region that is expected to be signal-poor; 2. map out templates using data in the control region; 3. generate a MC sample in the signal region, disregarding everything except for the distribution of the input variable(s); 4. convolve the templates derived from the control sample with the MC sample; 5. apply cuts to this data/MC hybrid dataset to obtain a data-driven background estimate. 1 Note that other examples of "dressing" have been implemented in the past. For example, in the context of charged Higgs boson searches, ATLAS makes a data-driven background estimate by taking a control sample with muons in the final state and replacing the muons with simulated taus, a technique that ATLAS refers to as "embedding" [41]. This is essentially the inversion of our recipe. CMS has also used a version of dressing in searches for boosted t t resonances. In particular, they rely on data in a control region to determine the properties of their top tagger, which is then propagated to the signal region [42,43].

JHEP05(2014)005
This article develops a concrete formalism for determining QCD backgrounds in this way, with particular care being paid to the statistical properties of the results. In more detail, the procedure is formulated as follows. First, a suitable control region is defined, which is referred to as the training sample. The fat jets in this sample are used to construct a histogram for the substructure observable(s) of interest as a function of kinematic observable(s). In order to obtain a continuous template from a limited number of fat jets, the histogram must be smoothed. This is done using kernel smoothing techniques, which are described in detail in section 3 and appendix A. The statistical error associated with the kernel smoothing procedure is folded into the final error estimate for the number of background events surviving a given set of cuts.
The next step requires the use of MC integration and is the analog of the extrapolation performed in other data-driven approaches. To begin, a kinematic sample is generated with event generator MC. The kinematic sample consists of events with N j fat jets. This sample is then used to extract the kinematic distribution of the fat jet background -e.g., It is worth emphasizing that any jet substructure information in the kinematic sample goes unused. Instead, each fat jet in the kinematic sample is dressed with substructure information using the template determined from the previous step; we refer to this as Monte Carlo integration, which results in the dressed sample. Lastly, cuts are performed on the dressed data set to obtain a background estimate and an associated smoothing variance. Note that this approach incorporates the correlations between the kinematics of the various fat jets into the final result. As shown in section 4 using two explicit mock analyses, this allows non-trivial correlations to propagate to the dressed sample and reproduces the predictions from MC, within statistical uncertainties. To summarize, the proposed data-driven strategy is (this parallels the enumeration above): 1. determine a control region to obtain a training sample; 2. train a templateρ; 3. generate a kinematic sample using MC; 4. perform the integration thereby convolvingρ with the kinematic sample; 5. apply cuts to the resulting set of dressed events to obtain a data-driven background estimate. Figure 1 provides a pictorial representation of the procedure. The computation of smoothing variance is more involved and we postpone a detailed discussion to section 3.

Using jet factorization
Let us now take a moment to discuss the assumption of jet template independence in more detail. This assumption leads to a dramatic reduction in the dimension of the multijet JHEP05(2014)005 phase space and allows for data-driven background estimates to be made with minimal statistical error. The jet properties we study here can be separated into two categories. Those in the first category, denoted by the d-dimensional vector x, are the outputs of templates. Examples include jet mass, N -subjettiness, angularity, and other jet substructure observables. The physics of these observables is largely driven by the parton shower and, as such, is set by the kinematics of the hard parton. Physically, the corresponding templates can be mapped out by taking a parton-level event and showering it many times. The d observables in x need not be independent of each other, and in most cases they are not. For instance, if a jet's mass is anomalously large, the same jet will have non-trivial N -subjettiness moments.
The second category consists of kinematic variables, denoted by k. In this paper, we only consider the jet p T , although in principle the set k could be augmented with other variables, such as the pseudo-rapidity of the jet or pileup conditions at the time of the event. These kinematic properties cannot be modeled as template outputs, since the kinematics of jets in an event are strongly correlated with one another.
An N j -jet differential cross section is given as a function of the fat jet kinematics k i and substructure observables x i for i = 1 to N j . It can be written as 2 For a given k i , we assume that the substructure observable x i of the i th fat jet is independent of all the other x j and k j , with j = i. Then the right-most term in eq. (2.3) splits into a JHEP05(2014)005 product of N j terms: This factorized form of the differential cross section allows for background estimates to be computed by a convolution of separate kinematic and substructure parts and forms the basis of our data-driven proposal. Note that the independence assumption underlying eq. (2.4) dramatically reduces the dimensionality of the configuration space that needs to be explored. Consider, for example, the case where d = 1 and k = p T . Then, instead of using MC to estimate the QCD prediction for a 2N j -dimensional configuration space, only a N j -dimensional configuration space needs to be explored. If multiple jet properties are being studied, d > 1, then the reduction in dimensionality is even more dramatic, with a (d+1)×N j dimensional configuration space reduced to N j dimensions. The additional cost involved is creating the (d + 1)-dimensional templates ρ( x p T ). Of course, as d increases (keeping the size of the training sample fixed), the density estimation that underlies the construction of substructure templates becomes increasingly difficult and at some point the associated uncertainties will outweigh any gains from the reduced dimensionality. This is the price to pay for not including any a priori model for correlations between the various substructure observables; however, these minimal assumptions allow us to remain datadriven. Soft QCD effects can violate the factorization in eq. (2.4). These include underlying event, pile-up, and O(N −2 c ) corrections coming from perturbative QCD due to large-angle soft radiation. Jet grooming can minimize these effects. Throughout this paper, the fat jets are trimmed [37], as described in detail in appendix B. In general, it is important to consider the conditions under which eq. (2.4) is most likely to hold -for example, as a function of the fat jet radius.
There are many additional subleading effects that spoil the factorization in eq. (2.4). The most important of these is the quark-gluon composition of jets in an event. If the composition of the jets changes dramatically between the training sample and the signal region, then anomalous features may be introduced. In practice, quark-and gluon-initiated jets often have broadly similar features in the signal region and the composition does not change dramatically in many samples of interest. However, the templates taken from the two leading jets do not describe the third and fourth jets well. The introduction of quark and gluon templates may resolve this issue, as will be explored in future work [44].
Note that the three-momenta of the fat jets in the dressed data are automatically conserved because they are taken from MC. However, energy is not conserved since the fat jet masses are determined stochastically -the total energy of each dressed event is different from that of the kinematic event. In practice, we do not find that this is an issue for the cases studied here. The violation of energy conservation changes the center-of-mass energy of the event, but not appreciably because the ratio m j /p T < 1. These effects lead to correlations between the masses of the jets; how to account for this in the template approach remains an interesting open question.

Implementing templates
This section provides a detailed description of the procedure for generating substructure templates and calculating cut efficiencies. It is accompanied by appendix A, which reviews the basics of the statistical procedures used here. If the reader is new to kernel smoothing methods, we recommend that he or she take a moment to review appendix A before proceeding.

Constructing templates
The jets in the training sample are used to build a histogram. Technically, this is all that is needed to build the templates. However, due to the finite size of the training sample, the histograms might not be adequately populated and data-smoothing techniques are necessary.
The statistical technique of kernel smoothing is one way to approach the problem. In this procedure, every data point in the training sample is smoothed out to the adjoining region in the multi-dimensional space. The aggregate of these smooth contributions gives the final probability distribution function. Take, as an example, the jet mass distribution (solid purple) shown in figure 2. This distribution is derived by looking at the two leading jets with p T > 200 GeV in the full MC sample. Now imagine that we want to reproduce this distribution using a statistically limited sample -say 1% of the total number of events in the full MC sample. Applying kernel smoothing to the statistically limited sample reproduces the full distribution, to within errors.

JHEP05(2014)005
Take an event with N j jets, each with particular kinematic and substructure properties described by the vectors k and x, respectively. For example, if we are interested in each jet's p T , mass, and N -subjettiness ratio τ 21 , then the i th jet (ordered by p T ) is described by k j i = (p T ) j i and x j i = (m, τ 21 ) j i . We unify k and x into the single D-dimensional vector z, where D is the number of kinematic and substructure variables of interest. Then, the i th jet in event j is described by To build a training sample for the i th jet, we combine the i th jet from all events in the sample. Said another way, the training sample for the i th jet, denoted as T i , is the set where N T is the number of events with an i th jet. The normalized histogram corresponding to T i encodes a discrete probability distribution function with D dimensions. This probability distribution function may or may not be smooth, depending on the total number of events N T . This is where kernel smoothing comes into play -it allows us to define a smooth probability distribution function in a statistically robust manner. In particular, we can define the smoothed template for the i th jet,ρ i z , usinĝ where z(J) is the value of z for jet J, K is the kernel, and h is a matrix describing the kernel width. The choice of kernel is not too critical and we simply adopt the canonical choice of a Gaussian: (3.4) The template generated by the kernel smoothing procedure, eq. (3.3), is simply a sum of Gaussians around each data point in the training sample. The degree to which this procedure appropriately smooths out the distribution depends on the matrix of widths, h ij . Thus, the goal is to choose the smoothing parameters such that the estimated distribution function,ρ z , is as close as possible to the true distribution function, ρ z . Selecting the width matrix h ij of the kernel is one of the most important aspects of this procedure because an over/under-smoothed distribution may not accurately represent the true distribution function. A common practice is to use the "Asymptotic Mean Integrated Squared Error" (AMISE) metric described in appendix A. This gives rise to "Silverman's Rule-of-Thumb" [45], where h ij is (3.5) Here, c is an O(1) constant andσ ij is an estimator for the square root of the covariance matrix of ρ. For distributions that are not vastly different from the normal distribution, this provides a useful starting point for determining the kernel width. All three panels show the result of applying kernel smoothing to the full MC sample; this serves as an approximation to the "true" distribution. 3 Overlaid are templates that have been generated using only 1% of the full sample in order to demonstrate how kernel smoothing approximates the true distribution for different values of the kernel width. 4 The width is varied from the optimal value c = c AMISE [middle, green] to an over-smoothed value c = 1.5 c AMISE [left, red], and an under-smoothed value c = 0.7 c AMISE [right, orange].

JHEP05(2014)005
The smoothing procedure introduces two competing sources of error in the template, bias and variance, whose magnitudes are set by the amount of smoothing. These are defined as follows: where . . . is the expectation value with respect to the true distribution ρ. The variance is interpreted as the scatter inρ z when estimated from different samples drawn from ρ z , while the bias is a measure of the systematic displacement of the estimated distribution from the true distribution as a consequence of smoothing.
As discussed in more detail in appendix A, the presence of bias is problematic and must be corrected for. Throughout, we use a corrected templateρ , which has substantially reduced bias. To computeρ ( z), all that is needed is an estimate for b( z). We adopt an estimatorb z that is formed by comparing the twice-smoothed templatê This estimator is valid to leading order in the kernel width [47], as shown in eq. (A.24). The corrected template is then defined aŝ It is this corrected template that will be used to make background estimates in the signal region. This procedure is justified in appendix A, where the corrected template is shown to give consistent results for a test probability distribution.

Dressing an event
The previous section showed how to construct a template from a training sample. With this template in hand, one can calculate the efficiency for a kinematic event in the signal

JHEP05(2014)005
region to pass a given set of cuts. We will now describe this calculation in detail. Note that throughout the rest of this paper, unless stated otherwise, our substructure templates are universal. That is, we do not make any distinctions between, say, the 1 st and 2 nd jets (as ordered by p T ) in the training sample. Rather, a single template is constructed making use of all the jets in the training sample. The corrected templateρ ( z) constructed in the previous section does not distinguish between the kinematic variables k and substructure variables x. However, we wish to view the substructure variables as functionally dependent upon the kinematic input variables. That is, we want to convert the joint probability distributionρ ( x, k) into the conditional probability distributionρ ( x k) usinĝ for each k. The templateρ ( x k ) is normalized so that it acts as a probability distribution function for the substructure information of a jet with given kinematics k. Given a multi-fat jet event in the signal region, obtained from MC, we can apply cuts to the first and second jets, with specified kinematics k 1 and k 2 , respectively. The efficiency, , that this event passes the cuts iŝ where C = C x 1 , x 2 is the region of the ( x 1 , x 2 ) configuration space selected by the cuts. For instance, if x is the mass of a jet, then is the cut requiring that the sum of the two individual jet masses is larger than M J . While the integrand of eq. (3.12) is a product of templates, the region of integration may be a complicated function of multiple variables.

Calculating errors
This procedure is trivially generalized to a set of kinematic events, thus yielding a background prediction with error bars. The most straightforward use of the template is to start with a sample of jets, of size N e , where only the kinematics are specified. A dressed MC sample may be created by dressing each event with the template multiple times. Said another way, the integral in eq. (3.12) is performed by Monte Carlo integration; for each kinematic event, the template is used to produce n I dressed events. It is sufficient to take n I ∼ O(10 3 ). Schematically, to dress events with two kinematic objects, one converts each MC event into n I dressed events where each dressed event, labeled by α, has fully specified jet substructure, i.e., specific values of x 1 α and x 2 α that have been drawn uniformly from their domain. Since the x i α are JHEP05(2014)005 drawn uniformly (and not from the template itself), each dressed event is also associated with a corrected weight w α and an uncorrected weight w α : The normalizations of w and w are irrelevant, because they cancel when calculating efficiencies (see eq. (3.19)). The corrected weight is used for the background prediction, while the uncorrected weight is used to estimate the total bias for the predicted number of events.
To determine the final number of events that survive the cuts for a given search, one first computes the sum of corrected weights for each kinematic event before and after cuts: where The estimate for the efficiency that a given event i passes the cuts iŝ Note that this is just the discrete version of eq. (3.12). These efficiencies are then summed to obtain the final background prediction. The predicted number of background events after cuts,N e [C], isN The statistical uncertainty in the estimateN e [C] is challenging to determine because of the implicit dependence ofN e [C] on the training sample. This leads to important correlations between the cut efficiencies,ˆ [C], for different kinematic events. A convenient technique for computing errors makes use of the bootstrap (see e.g., [47][48][49][50]). An ensemble of datasets (resampled from the training sample) is used to calculate the corresponding set of templates. This yields an ensemble of estimates forN e [C] whose variance is calculated directly. While this procedure is computationally expensive, it results in a robust estimate of the statistical uncertainty, as will be demonstrated in appendix A.
In more detail, for an ensemble of resampled datasets, a corresponding set of templates ρ b is computed, with b ranging from 1 to N BS . These, in turn, are used to compute an ensemble of background predictions . The smoothing variance is given by: for more details). To make sure this is the case, we estimate the size of the residual biasσ B inherent in the estimateN e [C] by comparingN e [C] toN e [C], where the latter estimate uses the uncorrected weights: with The contribution of each kinematic event, i, to the bias is the difference between the two . Note that this difference is a signed quantity, and the biases from different kinematic events may add (in)coherently. A measure of the total biasσ B iŝ The probability distribution function for the bias is unknown a priori and is in general not Gaussian. In order forσ V to be a reliable estimate of the statistical uncertainty inN e [C], we must haveσ B σ V . In practice, we choose a kernel width such thatσ B is a factor of 2 to 5 less thanσ V . Because bias correction is done at the template-level,σ B is an imprecise measure of the residual bias inN e [C]. Consequently,σ B is taken to be unsigned and can be interpreted as a systematic error.
In addition to the statistical fluctuation of the templates, there is an additional systematic error σ sys coming from the mismatch between the template in the control region and the associated template that would be derived from a background-only sample in the signal region. For the examples presented in the next section, we computeσ V andσ B explicitly. These results demonstrate that σ sys is subdominant to the statistical errors associated with the template procedure, validating the approximation that the templates are independent.

Measurement uncertainties
The preceding discussion made no mention of measurement uncertainties. If one were interested in comparing substructure templates extracted from data to analytical calculations, then the size and nature of any detector effects would be an important consideration in designing an appropriate template procedure. Indeed, the presence of these uncertainties introduces a number of complications (see e.g., [51]). In our case, templates are used to make background estimates that are then directly compared to data in the signal region. Consequently, the substructure variables x are implicitly taken to be the measured values; the templates are derived using the measured data and extrapolated to make a prediction for cuts on measured variables. Furthermore, as long as detector-smearing is smooth, the kernel width can be taken to be smaller than the inherent detector resolution (and it will be in the limit of large statistics, see eq. (3.5)). This is not a problem, since what is being estimated is the convolution of the underlying kinematic distributions with the detector response.

JHEP05(2014)005
The measurement uncertainties for the input variables k are relevant if the kinematics are being modeled by MC. In this situation, it is important that the kinematic sample be passed through a reliable detector simulator prior to the integration step so that all of the input and output variables correspond to quantities measured by the detector. On the other hand, if the signal contamination in the region of interest is small (in the absence of any substructure cuts), it might be possible to take the kinematic sample directly from data, in which case these issues can be avoided.

Applications
This section is devoted to illustrating the data-driven template approach for two different searches. The smoothing varianceσ V for 8 TeV LHC data is estimated and shown to be under control. As we will see, the MC and template predictions are in excellent agreement.
The first example is a search for high multiplicity signals (see section 4.1). Fat jets from signal processes should exhibit accidental substructure [8][9][10] that would be suppressed for QCD backgrounds. We demonstrate that the jet mass and N -subjettiness of the two leading jets of a four-jet sample can be predicted using templates derived from a three-jet sample. For a second case study, the template method is applied to an existing ATLAS search for pair-produced gluinos undergoing an R-parity-violating (RPV) decay to three light quarks each (see section 4.2). The observed correlation between the masses of the two leading jets is captured using the template approach.
Because we do not have access to actual LHC data, we rely on MC events. Weighted events were generated using a variation on the binning procedure outlined in [52]. Specifically, two-, three-, and four-parton QCD events were generated using Madgraph5 [53]. These events were subsequently showered using Pythia8 [54]; matching was done using the MLM procedure. The particle-level events were grouped into cells to simulate the finite resolution of the calorimeter. Trimming was applied to these pixels to reduce sensitivity to pileup effects (which we do not model), and N -subjettiness was computed using the FastJet-contrib plugin [55,56]. For a full description of the simulation framework, see appendix B.

High multiplicity signals
This section demonstrates the application of template techniques to the types of highmultiplicity searches advocated in [8][9][10]. This case study does not reproduce these searches in exact detail. Instead, we study a simplified version of the event-subjettiness search in [9] in which the substructure of only the two leading jets is considered. Despite the modification, this example is still complex enough to constitute a real test of the template methodology.
As a preselection, each event is required to have at least N j = 4, R = 1.0 anti-k T [57] jets with p T > 200 GeV. Cuts are placed on observables that depend on the fat jets' substructure (but not on their kinematics): the sum of jet masses where τ ij is the N -subjettiness ratio τ j /τ i [55,56]. Note that there is no impediment to replacing event-subjettiness by one of the subjet counting observables proposed in [10], although the discrete nature of the latter would require modifications to the kernel smoothing procedure.
Requiring four fat jets is already extremely efficient at reducing the Standard Model background [8]. An exclusive three-jet sample is therefore expected to be signal-poor, making it an ideal training sample. The two leading jets in each event are used to fill a 3-dimensional templateρ (m, τ 21 p T ).
A good set of coordinates for the template is: The log-transformed variables are appropriate given the logarithmic evolution of the strong coupling constant with respect to the fat jet p T and the form of the collinear singularity that governs the generation of mass in QCD jets. The bandwidth is chosen to produce a small total bias,σ B σ V , while keeping the errorσ V as small as possible. Figure 3 shows the resulting templates, while figure 4 compares the 3-jet template with the 4-jet prediction for a fixed p T slice. These figures demonstrate that the substructure variables for the two leading jets in the exclusive 3-jet  The error bands only include the template errors, computed using the bootstrapped ensemble of templates. The lower panels show the deviation between the templates, normalized to the three-jet template. and 4-jet samples are equivalent within 10% in the high-mass region and a broad range of τ 21 . We have checked that this holds for a wide range of p T choices.
With the template in hand, the next step is to dress a kinematic sample, i.e., the integration step. An inclusive 4-jet cut, with each R = 1.0 anti-k T jet required to have p T > 200 GeV, fully characterizes the kinematic sample. The leading and subleading jets (but not the 3 rd or 4 th jets) in the kinematic sample are then dressed with the substructure template. The result is a dressed 4-jet sample in which the leading and subleading jets are associated with values of τ 21 and fat jet mass m 1,2 .
Using the dressed sample, one can compute the fraction of dressed events passing the cuts. We apply a preselection that m j > 20 GeV for each jet, ensuring that the results are IR-safe [35]. Another preselection cut of p T > 250 GeV is also applied to ensure insensitivity to the template boundary. The resulting differential distributions for M J (with T 21 < 0.3) and T 21 (with M J > 250 GeV) are shown in figure 5. These distributions are in excellent agreement with the MC. Several additional cuts are summarized in table 1; the template predictions agree with the MC to within template errors.
Note that only the first and second jets in the kinematic sample were dressed in this case study. Empirically, we find that the template derived from the two leading jets in the exclusive three-jet sample accurately models the two leading jets in the four-jet sample. While the template for the third jet is qualitatively different, it does provide a good model of the third jet in the four-jet sample, although this statement suffers from large statistical uncertainties given the size of the MC sample. The difference in the third-jet template is driven in large part by the average quark/gluon content of the third jet. It is likely that by  incorporating quark and gluon information (extracted from MC) into the determination of the templates, the procedure could be generalized to incorporate the third and even fourth jets into the analysis. Exploring such a generalization is left for future work [44].

Boosted three-jet resonances
The second case study is based on the ATLAS search for boosted gluinos undergoing RPV decays to three light quarks [58]. The ATLAS analysis proceeds by clustering the event into fat R = 1.0 anti-k T jets and narrow R = 0.4 jets. Several search regions are considered, each with a combination of cuts on narrow jet quantities (multiplicity, scalar sum p T ) and fat jet quantities (multiplicity, jet masses). In addition, a substructure requirement of τ 32 < 0.7 is imposed on both fat jets in order to select on the expected three-pronged structure of gluino decays. The case study presented here mirrors the fat jet analysis,

JHEP05(2014)005
but ignores the narrow jet selections. Note that we compute N -subjettiness using the "min-axes" algorithm, whereas ATLAS used the "k T -axes" algorithm.
To proceed with a background estimate as outlined in the previous two sections, we must first define a training sample from which we can construct a substructure template. As a preselection on the MC sample, each event is required to have at least two fat jets with p T > 320 GeV. Because the signal region consists of events with two high-mass fat jets, the training sample is defined by requiring at least one low mass jet with m j < 140 GeV. This ensures that signal contamination in the training sample is small. That is, for every pair of leading and subleading fat jets (j 1 , j 2 ) the three-dimensional templateρ (m, τ 32 p T ) is filled with j 2 whenever m j 1 < 140 GeV. The procedure is then repeated for (j 2 , j 1 ). The template is formed using the coordinateŝ as in the previous study. The next step is to define a kinematic sample. In general, the natural way to define the kinematic sample is to use the same cuts as the signal region; apart from substructure cuts, the two samples are defined identically. In the present case, this corresponds to choosing a sample generated by MadGraph5 at √ s = 8 TeV, with each event required to have at least two fat jets each with p T > 350 GeV and no requirements on the fat jets' masses or values of τ 32 . In the integration step, both of the fat jets in the kinematic sample are dressed with the templateρ (m, τ 32 p T ). This results in a dressed data set in which each event consists of a pair of fat jets, with given values of p T , m, and τ 32 .
To assess the efficacy of the template approach, the dressed data set is compared to the full MC sample. One interesting test is to compare the correlations present in the m j 1 vs. m j 2 plane in the kinematic sample with the results from the dressed data set. Figure 6 shows that the resulting jet mass distributions are indeed reproduced by the template estimate. This figure also provides a good qualitative match with the relevant plot published by ATLAS [58]. The dominant correlations between the two jet masses are captured by virtue of the p T dependence of theρ template.
As in the previous study, the dressed data set can also be used to estimate yields in the signal region after imposing substructure cuts. The predicted cross sections are listed in table 2 for a selection of cuts. There is excellent agreement between the MC and template predictions. One expects the smoothing variance at the LHC to be reduced when a large dataset is available for training the templates. Note that as the training sample increases, statistical uncertainties decrease more slowly than the familiar 1/ √ N T , see eq. (3.5).

Conclusions
Recent years have seen a revolution in fat jet techniques that could allow for the discovery of new physics in signal regions that would otherwise have remained obscured by QCD backgrounds. Realizing the full potential of this statement assumes that we can reliably  Table 2. Expected number of background events for the RPV case study, assuming an integrated luminosity of 20 fb −1 . The amount of smoothing used for each background estimate is specified in the first column (in units of c AMISE ). The errors for the template estimates are given byσ V , computed using eq. (3.20). The template and MC predictions agree to within the calculated errors. σ B is not normally distributed and therefore cannot be simply combined withσ V . model the relevant backgrounds. This strongly motivates developing new methods for data-driven background estimates, particularly for the challenging QCD background.
An attractive feature of the fat jet paradigm is that it offers a natural division of phase space into inputs (kinematics) and outputs (substructure variables) as a consequence of jet factorization. This work takes advantage of this division to develop a novel method of estimating QCD backgrounds from data. The method makes use of jet substructure templates that are obtained from a control region and used to dress a kinematic sample in the signal region. The implementation chosen here depends on kernel smoothing techniques, which are used to obtain a statistical error for the background prediction. For the two case studies in section 4, the QCD MC was treated as mock data and the predictions from kernel smoothing were shown to match the MC predictions. This illustrates that the systematic error introduced by treating the two leading jets as independent and identical is subdominant to the statistical error.
It is worth contrasting our data-driven proposal with existing approaches, such as the "ABCD" method. Ideally, the ABCD method uses a highly uncorrelated pair of variables in order to define signal and control regions. If correlations are non-negligible, MC is used to calculate a correction factor that is folded into the background extrapolation. Thus we should consider two cases in comparing the two methods. In the first case, uncorrelated variables can be found, and ABCD yields a data-driven background estimate without relying on MC. Because the systematics associated with template methods and the ABCD method are complementary, each provides a valuable crosscheck on the background determination. In the second case, uncorrelated variables cannot be found, and any ABCD estimate is strongly dependent on MC. Template-based methods, however, may still be able to provide a data-driven estimate. This latter situation is expected to hold in many jet substructure searches of interest, e.g., for searches involving cuts on jet mass and additional substructure observables (because these observables are in general highly correlated). This paper is just the start of exploring applications of the template approach. One important area of study is how templates depend on the quark/gluon composition of jets. The techniques in this paper can be generalized to include separate templates for quarks and gluons, which can be viewed as a discrete input label. To explore how the templates depend on quark/gluon content, we generated two MC samples of pure-quark and puregluon dijets and created the associated templates, shown in figure 7. Once a 20 GeV mass cut is imposed on each jet (dashed gray line), the quark and gluon mass distributions are very similar. However, there is clearly information that could potentially be incorporated into the procedure. Using separate quark/gluon templates will be a crucial step in applying these methods to third (or higher) jets in the event, which tend to be gluon jets.
There are other directions that deserve further study. It will be important to understand how the template estimates depend on the fat jet radius, and the impact of overlapping jets and pile-up. While the jet factorization assumption facilitates straightforward data-driven estimates, it would be interesting to explore what happens when this JHEP05(2014)005 condition is relaxed. This might include systematically improving the template method with input from perturbative QCD calculations. Finally, the template technique can be generalized to many different new physics searches, with different kinematic inputs and substructure variables.
Template functions provide a powerful way to estimate backgrounds in regions where direct MC calculations are prohibitively hard. This allows extrapolation of both MC and data into signal-rich regions. A deviation would be the first sign of new physics or, at worst, an indication of the breakdown of jet factorization or novel QCD effects not included in MC. No matter the result, we will learn something from the application of template methods to data.

A A kernel smoothing primer
This appendix provides a detailed introduction to kernel smoothing and presents the key derivations that justify the procedures used in the main body of the paper. It is divided into three parts: appendix A.1 reviews the basic formulae and the main properties of kernel smoothing; appendix A.2 details the various algorithmic choices that underlie our template procedure; and appendix A.3 validates the procedure on a test probability distribution. We chose our particular prescription for its relative simplicity, computational efficiency, and the fact that it works well in MC. More sophisticated variations are possible, including for example, adaptive kernel smoothing (which uses a variable bandwidth) and boundary kernels (which improve performance near the edge of the domain).
A basic statistical problem is to estimate a probability distribution ρ( z) from a random sample of finite size. The simplest strategy is to bin the data into a histogram. Histograms, however, are intrinsically non-smooth, and the resulting estimates can be very sensitive to the choice of binning scheme, especially when the sample size is small. Furthermore, the statistical errors associated with histograms can be difficult to compute.
A better alternative is to use kernel smoothing techniques [45,47,59,60]. Kernel smoothing replaces a histogram with a probability estimateρ( z), where z denotes a Ddimensional vector that can include kinematic and/or substructure variables. The weighting function, or kernel K( z), must balance between the degree of locality and the smoothness of the resulting estimate.
Given a training sample with N T independent data points { z 1 , z 2 , . . . , z N T } that have been obtained from the true distribution ρ( z), one can obtain an estimateρ of the true JHEP05(2014)005 density ρ with the following master formula: Here, K is the smoothing kernel (often chosen to be a Gaussian) and h is a non-degenerate matrix that controls the smoothness of the resultingρ by changing the width of the kernel. For a Gaussian kernel, 5 (A. 2) The density estimate is normalized, ρ d D z = 1, as long as the kernel is normalized to unity, K h d D z = 1. After fixing K to be a Gaussian, one still has to choose the bandwidth matrix h, which should be large enough that the resulting estimate is reasonably smooth, but small enough that important features are not washed-out. As we will see below, the choice of h is typically more important than the choice of kernel, as it controls the size of the bias and variance. Choosing the best value of h is challenging because different measures result in different choices of h (see e.g., [61]). The following subsections present a concrete prescription for choosing h that leads to good statistical behavior.

A.1 Deriving the AMISE bandwidth
To select an optimal bandwidth h, one must specify a metric by which to measure optimality. A common choice is the Mean Integrated Squared Error (MISE), defined as the expectation of the square of the error term: where the notation . . . denotes the expectation value with respect to the true distribution; for any statistical quantity f z 1 , . . . , z N T one defines In eq. (A.5), the MISE is split into two terms. The first is identified as the bias b( z ) squared and arises from the fact that kernel smoothing systematically smooths out peaks and valleys. The second term yields the variance v 2 ( z ), which encodes statistical fluctuations about the mean estimate ρ( z) . Eq. (A.5) encapsulates the classic bias-variance trade-off

JHEP05(2014)005
in statistics: in order to reduce bias, one must reduce the amount of smoothing and hence increase the variance; to reduce the variance, one has to smooth out more features and hence increase the bias.
Computing the bias and variance is difficult because each depends explicitly on the true distribution ρ one is trying to estimate. Given a large sample size, it usually suffices to consider leading-order expressions in the asymptotic limit of large N T . As the number of data points approaches infinity, the optimal bandwidth will approach zero, h AMISE → 0. As a result, it is useful to expand the MISE around the limit N T → ∞ and h → 0 to derive an asymptotic expression for the optimal bandwidth, h AMISE . Consider the average estimate ρ( z) . In the large N T limit, the summation in eq. (A.1) becomes a continuum integral: Expanding the integrand to second order in h and writing out all the indices explicitly yields where a symmetric kernel with K(s) = K(−s) has been assumed. For the Gaussian kernel, which satisfies d D s s i s j K s = δ ij , the leading-order bias is The interpretation is that the smoothing bias is most pronounced at peaks and valleys where the second derivative is large. Local maxima (minima) are estimated to be lower (higher) due to smoothing. A similar calculation can be performed for the variance: In the first summation, the i = j terms yield N T integrals over K 2 h . For i = j, each of the N 2 T − N T integrals factorizes into the square of d D z K h z − z , as z i and z j are JHEP05(2014)005 independent random variables drawn from the same distribution. In the second summation, independence of the z i yields a further N 2 T terms d D z K h z − z 2 . Gathering everything together, one has Until this point, the computation is exact. To proceed further, consider the limit h → 0, where K h z → δ z . The first term dominates because it contains a factor of δ 2 z . Therefore, in the asymptotic limit: where ρ z − h s ρ z to leading order in the last line. The variance is directly proportional to the true density in the neighborhood of z. For a Gaussian kernel, the integral evaluates to d D s K 2 (s) = (4π) − D 2 . The leading-order variance in the asymptotic limit is then Armed with asymptotic formulae for the bias and variance, we can compute the Asymptotic MISE (AMISE) by plugging eq. (A.10) and eq. (A.17) into eq. (A.5): The optimal bandwidth is determined by minimizing the AMISE; this derivation yields a bandwidth h AMISE that is correct to leading order in N −1 T . To explicitly minimize this expression, it is useful to rewrite the bias. Using integration by parts, eq. (A.10) is rewritten as where S ijkl is a totally symmetric rank-4 tensor. Any such tensor can be diagonalized by an orthogonal transformation; it is possible to perform a coordinate transformation z → y such that S ijkl = 1 4 (δ ij δ kl + δ ik δ jl + δ il δ jk )(4π) − D 2 , where the normalization factor assumes ρ has been transformed into standard normal form.

JHEP05(2014)005
The minimum occurs when H ij is proportional to the unit matrix δ ij . The optimal bandwidth in the asymptotic limit is then identified as 6 (A.21) Even in the asymptotic limit, the expression for the optimal bandwidth depends on the true distribution ρ z via dy i dz j . This dependence cannot be removed without making further assumptions. For a Gaussian distribution, the required transformation is simply y i = σ ij z j , where σ ij is the square root of the covariance matrix. For more general distributions that are not too far from the Gaussian distribution, one expects the correct transformation z → y to roughly correspond to rotating away the correlations and scaling the width of the distribution to unity. Hence, the following choice of the bandwidth should be near optimal for most reasonable probability distributions: Here, σ ij is the square root of the covariance matrix computed from the distribution ρ( z), and c is an N T -independent O(1) constant; eq. (A.22) is known as Silverman's Rule-of-Thumb [45]. In practice, σ ij is replaced by an estimateσ ij from data. As we will see in appendix A.3, variance estimation is more reliable than bias estimation in practice. It is therefore advantageous to choose a value for c that deliberately undersmoothsρ, so that the bias is smaller than the variance and the error estimate can be trusted.

A.2 Practical kernel smoothing
This subsection presents the specific algorithms and formulae used in this paper. In the applications we consider, kernel estimatesρ z are typically constructed from very large samples, and it is not always feasible to keep the z i for each event in memory. 7 To limit memory usage, the dataset is first binned with a small bin size h b h, and then kernel smoothing is performed on the binned dataset. Additionally, to save computational time, the kernel estimate is precomputed at each bin location and linear interpolation is used to computeρ z for intermediate values of z.
More explicitly, let q denote the center of a given bin and let q z be the function that sends z to the center of the corresponding bin. Given a training sample { z i } of size N T , denote the number of z i in each bin q as n q . The master formula for kernel smoothing, eq. (A.1), then becomeŝ

JHEP05(2014)005
where κ is a normalization constant. Eq. (A.23) only definesρ at the center of each bin; to extend the domain ofρ to all values of z, multi-dimensional linear interpolation is performed. In principle, the constant κ can be set to normalizeρ z to unity; when computing cut efficiencies this constant drops out, so it can be ignored in practice.
Combined with the rule-of-thumb for bandwidth selection given in eq. (A.22), the calculation ofρ z is now straightforward. Estimating the biasb z and variancev 2 z is less trivial. One could in principle substituteρ z for ρ z into eq. (A.10) and eq. (A.17) and use the resulting estimates, which are correct to leading order in h. We find, however, that it is preferable to use a different approach, which we now describe. 8 The bias can be estimated by performing kernel smoothing twice and comparing the result with that obtained from performing kernel smoothing once. Specifically, This procedure is equivalent to eq. (A.10) in the asymptotic limit [47]. We find that eq. (A.24) yields relatively stable results in the test simulation discussed in appendix A.3.
Note that the kernel smoothing in eq. (A.24) is performed using the same algorithm outlined above. 10 We use the bootstrap [47][48][49][50] to estimate the variance. The flexibility of the bootstrap is particularly well-suited to our case, where the variance needs to be propagated through complicated kinematic cuts. For a given statistic (e.g., can be the efficiency for the cuts of interest) and estimatorˆ , the bootstrap relies on creating a set of resamples of the observed dataset. Each resampled dataset then yields a value forˆ , the distribution of which is used to measure properties of the estimatorˆ (e.g., its variance).
To compute the variance, one first converts a given dataset z i into a set of bin counts {n q }. One then fluctuates each element of the set n q a total of N BS times, thus creating N BS fluctuated datasets, denoted by n q b , with b ∈ {1, . . . , N BS }. Each element of n q b is sampled from the Poisson distribution: 11 n q b ∼ Poisson n q .
(A.25) 8 For example, because eq. (A.10) requires an estimation of the second derivative of ρ, it involves the additional complication of selecting a bandwidth appropriate for derivative estimation. 9 In principle, eq. (A.24) depends on two distinct bandwidths. In practice, we find that it is sufficient to take these to be equal as determined using eq. (A.22). It should be kept in mind, however, that different bias estimators are possible and may perform better on real data. 10 One might worry that binning the data yields additional sources of bias not accounted for in eq. (A.24).
As long as h b h and linear interpolation is applied, however, this additional bias is negligible. In practice, we find that h b as large as h/2 still yields a consistent estimate. Of course, it is preferable to take h b as small as allowed by memory limitations. 11 For NT data points zi the standard bootstrap procedure generates each resampled dataset by drawing exactly NT samples from the empirical distribution (i.e., from the observed dataset) with replacement. Our Poisson fluctuation procedure (which we choose for computational convenience) is not identical to the standard bootstrap, but in practice we find that it provides robust estimates of the variance. The accuracy of this prescription deserves further scrutiny in any future experimental study.

JHEP05(2014)005
For each fluctuated dataset, one can compute a corresponding ρ z b using the same binned kernel smoothing procedure as in eq. (A.23): where, as before, κ b is a normalization constant, and N b is the total number of events in the b th fluctuated dataset. Finally, we obtain an estimate of the smoothing variance: (A.27) In practice, N BS 100 is sufficient for a reasonable estimate of the variance.
For our application, it is important that density estimatesρ( z) be associated with a confidence interval. However, due to the presence of asymptotic bias (see especially the discussions in [49,50]), a naive application of the smoothing variance in eq. (A.27) might lead us to construct incorrect confidence intervals that are centered at ρ( z) instead of at ρ( z). There are two approaches to this problem, both of which can be used to ensure the construction of properly centered confidence intervals. The first is to deliberately undersmooth the data so that bias is negligible in comparison to the root variance. The second is to explicitly correct for the bias. In practice, we find that it is necessary to use both approaches simultaneously. Relying solely on undersmoothing is inadequate because background estimates rely on estimates of ρ( z) at a variety of different values of z; to ensure small bias over the entire range of relevant values of z would lead to too much undersmoothing. Relying solely on bias correction is inadequate because of the difficulty of constructing an accurate bias estimator. By making use of both approaches simultaneously, it is possible to mitigate the problems of asymptotic bias. In practice, for any specified cut, we choose a bandwidth such thatσ V 2σ B , whereσ B is a measure of the total bias ofN e [C], see eq. (3.23). It is worth stressing the need to pay careful attention to these issues, because they play a central role in determining the robustness of the final statistical uncertainties associated with the template procedure. Note that more sophisticated approaches that, e.g. make use of adaptive kernel smoothing or implement bias correction at the level of [C], may allow for an improved treatment of asymptotic bias. To introduce explicit bias correction, we define the bias-corrected density esti- Note that d D zρ z = 1 as long asb( z) is the difference of two probability distributions as computed in eq. (A.24). The resultingρ is not necessarily positive everywhere. We find that this does not cause any problems for the cut efficiencies computed in this paper, but one should be aware of this issue. Eq. (A.27) can also be used to define the bias-corrected variance via the substitution 12ρ Using the bias-correctedρ ( z), one can construct a bias-corrected estimatorˆ , which has substantially reduced bias. Note that is still a biased estimator, but the bias is now of higher order in the bandwidth h. The cost of reducing bias is an increase in the variance, which can be reliably computed with the bootstrap.
Given the ensemble of bias-corrected estimators, the corresponding statistical error is computed with the bootstrap. As discussed in section 3.3, each of the ρ b is used as an input to generate an ensemble of bootstrapped results N e [C] 1 , . . . , N e [C] N BS for a given set of cuts C. The variance of the final prediction is estimated using this ensemble: (A.30) This is to be distinguished from the variance of the templates themselves, v 2 ( z).
The next subsection presents a validation study showing that the bias correction and variance estimates outlined above are statistically robust and lead to sound confidence intervals.

A.3 Validation
This subsection presents a validation study for our particular implementation of kernel smoothing. The first test evaluates the accuracy of the estimatesρ( z) by applying kernel smoothing to a known one-dimensional probability distribution ρ(x) with x ∈ [0, 1]. The second test evaluates the data-driven procedure for estimating cut efficiencies by applying cuts to the joint probability distribution ρ(x 1 , x 2 ) = ρ(x 1 )ρ(x 2 ). Both tests use the distribution ρ(x) ∼ x(1 − x) cos(8x) + 1.5 , illustrated in figure 8.  Figure 9. The bias and variance for the first validation test. Left: b(x) is the true bias (blue), b(x) is the bias estimate for a single trial experiment (green), and b (x) is the estimated bias averaged over all trial experiments (red). Right: the corresponding results for the square root of the bias-corrected variance, v 2 (x).

For each experiment
and v 2 i (x) using the methods outlined in appendix A.2.
The resulting kernel estimates are illustrated in figure 8. The bandwidth is chosen with Silverman's Rule-of-Thumb with a moderate amount of undersmoothing (c 0.9 c AMISE and h 0.04 in eq. (A.22)). The bin width is h b = 0.0025 h. The bias-corrected version on the right shows substantial reduction in the average bias (i.e., red matches blue). The variance, however, increases as can be seen by the increased error of the sample estimate, shown in green.
The performance of the bias and variance estimators is shown in figure 9. The left panel shows systematic underestimation of the bias. The bias, however, is subdominant to the variance as a consequence of undersmoothing. Since the bias is generally of the correct sign and order-of-magnitude, bias correction leads to confidence intervals that have improved probability coverage (as we will see explicitly below). Compared to the bias, the variance can be more reliably estimated, as seen in the right panel. It is likely that the bias estimator can be improved, which would be an important goal for any experimental study.
The second test goes a step further and allows us to evaluate the statistical performance of our procedure for estimating cut efficiencies. In particular it constitutes a test of JHEP05(2014)005 The test is formulated as follows: 1. Do N trials = 5000 trial experiments, with each experiment containing N = 500 events, {x 1 , x 2 , . . . , x N }, drawn from ρ(x).

For each trial experiment
Note thatσ V i does not include a bias correction when constructing s i .
5. Finally, calculate the variance of the estimates {ˆ [C] i }: and unit variance. As can be seen in the figure, bias correction significantly improves the Gaussianity of s (which corresponds to the bias-corrected estimateˆ ) as compared to s (which corresponds to the estimateˆ ). For x cut = 1.4 and with h = 0.06 the bias is roughly a factor of 20 smaller than the variance, but the difference between s and s is still non-negligible. For x cut = 1.8, h needs to be significantly smaller with h = 0.015 in order to achieve a subdominant bias. The bias in this case is roughly a factor of two smaller than the variance, and the bias correction results in a sizable shift from s to s . For both cuts, bias correction improves the probability coverage of the confidence interval for (especially its centeredness) and thus justifies the error bars given for the background estimates in section 4.
Finally, the performance of the estimatorsσ B andσ V is shown in figure 11. As is apparent, variance estimation is much more reliable than bias estimation. The bias estimatorσ B (as opposed to the bias estimatorb( z) used to defineρ ( z)) need not be very accurate because it is only used to ensure that we are in a statistical regime where the bias is subdominant so that confidence intervals have good probability coverage. Note that the performance ofσ B illustrated in the figure is for the particular cut value of x cut = 1.4; the agreement betweenσ B and σ B varies as a function of x cut , although the overall level of agreement shown here is representative.

B QCD Monte Carlo
This section describes the simulated data utilized in our studies. Up to four final-state partons were simulated at the generator-level, which were then merged with a shower. The final hadron-level output was pixelated to model the finite granularity of the calorimeter.  Finally, these pixels were clustered into fat anti-k T jets with R = 1.0. This level of detail in the simulation is sufficient for demonstrating the viability of our data-driven technique. Note that if we were considering backgrounds for searches involving missing energy, it would be important to also simulate detector-smearing effects.
In more detail, parton-level QCD events were generated using Madgraph5 version 5.1.5.10 at √ s = 8 TeV and merged with the parton shower via the MLM "MadGraph style" prescription using PYTHIA version 8.1.76 [54]. To more accurately model the large phase space for multijet production at the LHC, we employed a variation of the weighting technique discussed in [52]. Specifically, events were binned in both jet multiplicity (2 exclusive, 3 exclusive, and 4 inclusive jets) and in H T , see table 3.
The post-PYTHIA hadron-level information was then passed to a code based on FastJet version 3.0.2 [65,66]. The granularity of the calorimeter was simulated by creating a grid of cells in the η-φ plane with size 0.1 × 0.1 for |η| < 2.5 and 0.2 × 0.2 for 2.5 < |η| < 4.5; any particles with |η| > 4.5 were discarded. For each event, the energy of all particles that fell within that cell was summed. This gave the energy of the cell for the given event. The energy and position of these cells were combined into light-like four-vectors that were used as inputs to the FastJet routines.
Next, the cells were clustered into R = 1.0 anti-k T jets. In order to reduce sensitivity to the impact of pileup, the jets were groomed using the trimming algorithm with f cut = 0.05 and R trim = 0.3. The fat jets used in our study are the output of this trimming procedure. N -subjettiness was computed using the FastJet package [55,56] with the 'min axes' algorithm and β = 1.

JHEP05(2014)005
One of the main goals of section 4 is to demonstrate that the error bars associated with the smoothing procedure are reasonably small. As discussed above, the width of the kernel scales with the number of events in the training sample, N T , and the total AMISE scales as N T −4/(D+4) (see eq. (A.20)). As a result, we must understand the size of the training sample available at the LHC in order to make realistic projections for the statistical uncertainties of the template procedure. For very inclusive triggers, the event rate can be so large that it is not possible to write all events that pass selection cuts to tape.
To deal with this, the collaborations perform prescaling, i.e. they record only a fraction of the events and reweight them to account for events that went unrecorded. Prescaling is relevant because the weighted events come with larger statistical uncertainties than the naive expectation of √ N .