Improved Extrapolation Methods of Data-driven Background Estimation in High-Energy Physics

Data-driven methods of background estimations are often used to obtain more reliable descriptions of backgrounds. In hadron collider experiments, data-driven techniques are used to estimate backgrounds due to multi-jet events, which are difficult to model accurately. In this article, we propose an improvement on one of the most widely used data-driven methods in the hadron collision environment, the"ABCD"method of extrapolation. We describe the mathematical background behind the data-driven methods and extend the idea to propose improved general methods.


Introduction
The Standard Model (SM) of particle physics is compatible with almost all of the measurements from particle experiments. In contrast to the successes on Earth, astrophysical measurements seem to imply existence of energy component that cannot be explained by SM, and pose a serious challenge.
Despite theoretical and experimental efforts, there is no direct evidence that any of the solutions proposed is correct. Moreover, it is not clear what direction should be taken in order to resolve the problem. Particles predicted by viable extensions of the SM are already excluded beyond many TeV's at the LHC [1]. It may turn out that these new states are massive enough to be beyond the reach of the LHC for direct production. However, it does not exclude the possibility that interesting physics are waiting to be found in rarer and more complicated final states. For example, we may have to entertain the possibility of exotic final states [2,3], where new states appear as a continuum rather than as a resonance, above backgrounds. In either case, better accuracy of background estimation is necessary.
For many processes of interest, automatic calculations to next-to-leading order (NLO) in strong interactions are accessible in modern Monte Carlo event generators [4]. However, even at the NLO, theoretical uncertainties are larger than statistical uncertainty for many processes at the LHC. And as the number of final-state hadronic jets increase, even the accuracy of NLO calculations decreases [5]. Parton showering, hadronization, and underlying events have smaller effect on the theoretical uncertainty, but nevertheless are not negligible.
To reduce the uncertainties related to background estimation, various data-driven estimation methods could be employed. Data-driven methods make use of the data in the "background" dominated control region (CR) to estimate background contributions in the "signal" region (SR), where interesting events may be found. The method of interpolating using side-bands is a canonical method. In analyses involving hadron collision data, we often employ a method of extrapolation, called "ABCD," a data-driven background estimation method. It should be noted that data-driven methods do not entirely exclude the use of simulated data. In this article, we review the main idea behind data-driven methods and then extend it to find an improvement for the extrapolation method.

Data-driven methods of background estimation
The concept of estimating backgrounds from the data itself is nothing new. Important discoveries in the history of particle physics would not have been possible without such estimations, given that the underlying theory of particle interactions were not very well known or had large uncertainties [6]- [10].
While there are many ways that data-driven methods can be divided, in this article, we will group them into two categories. In the first category, there are data-driven methods that use interpolations from the measurements performed on the side-bands. These methods are used when we look for a new particle state in a restricted range of kinematic phase space (usually mass). In the second category, there are methods we use when straight interpolations are difficult to employ. The methods that use extrapolations based on information in signal-depleted regions, fall in this group. An extrapolation method, called the "ABCD" method, is often used in hadron collider experiments, where predictions of multijet production processes have large uncertainties [11,12,13]. For more complicated analyses, it could involve combinations of the two categories.

Interpolation methods
We briefly review the interpolation methods, which will give us ideas on how to extend and improve extrapolation methods. In an interpolation method, measurements are performed in the side-bands or CRs that surround the SR and the information is combined to estimate the backgrounds in the signal region. In the absence of other information, the minimal assumption is that the background would have a smooth distribution.
Let us take a one-dimensional example. We may assume that the signal region is in x 0 ∼ x 0 + ∆, Without loss of generality. The number of backgrounds in this region for a distribution of backgrounds described by f (x) may be expressed as F (x 0 ) ≡ x0+∆ x0 f (x)dx. Let us take a simple side-band of equal width to either side of the signal region. The backgrounds on the left(right) side-band are F (x 0 −∆) (F (x 0 +∆)), respectively. If we assume that the series expansion is valid, we can then express the entries in the side-bands as From the two side-bands, the best estimate of F (x 0 ) is obtained by taking the average of the two: which is a well-known result. For a background whose distribution is of the f (x) = ax + b form, the answer is exact. However, for a shape that has higher-order terms, this approximation may not be enough. If we allow two side-bands on each side, the terms proportional to ∆ 2 can be eliminated.
The best estimate from two equal width side-bands on each side is which is accurate for background distribution f (x) that is locally a cubic function. One can easily understand this, since with one side-band on each side, we can fit a line through the two measurement points for interpolation, and thus find the linear function exactly. And with two side-bands on each side, we have four measurements, therefore, we can fit a cubic function for interpolation. A similar idea can be adapted to a case with more than one dimension. Let us consider a rectangular signal region in x, y space between x 0 ∼ x 0 + ∆ x and y 0 ∼ y 0 + ∆ y . Altogether, we can use 8 side-bands, four on the sides of the rectangle and four regions on the corners. Without any prior knowledge of the background distributions, and using similar arguments as before, the best estimate for interpolation is

"ABCD" extrapolation methods
In background estimation using interpolation methods, the signal is completely surrounded by CRs that provide strong constraints. They would be useful if the signal is localized. However, in searches for new physics signatures at large energies, the signal of interest is expected to populate higher energy, mass, or jet multiplicity regions. In these cases, measurements based on the signal-depleted CRs must be extrapolated to the SR. We introduce the notation to be used for the extrapolation methods. We can use the extrapolation methods of background estimation if the dependence of an observable on x and y is mostly independent, as: where the non-independent component is in . We assume that the non-independent part is small | | << 1. Then the integral in a rectangular region would be mostly factorizable as well.
where Σ is the average value of over this range and depends on the amount of dependence between the two variables, x and y. S x (S y ) is the integral of P x (P y ) in the range x 0 ∼ x 1 (y 0 ∼ y 1 ), respectively. For a fixed-width window, x 1 = x 0 + ∆ x and y 1 = y 0 + ∆ y , F is a function of x 0 and y 0 , so we can omit the arguments x 1 and y 1 as An estimate of F (x, y) is obtained by taking suitable products of the F s in the neighboring regions as: where the ∆'s stand for either ∆ x or ∆ y . The ∆ x ∆ y term would vanish if (x, y) → 0. Therefore, the error of the estimation depends on the degree of non-independence of x and y. In this derivation, we do not assume that S x (S y ) vary slowly as a function of x (y), respectively, but that Σ varies slowly enough that the series expansion is valid. The method is often referred to as the "ABCD" method (Eq. 11) or matrix method. In an ABCD method, twodimensional phase space is divided into four regions, one of which is the SR and the neighboring three regions are the CRs. The choice of the two control variables used for this purpose depends on the physics case of interest, but should be as independent as possible. In hadron collision experiments, such extrapolation methods are used to estimate the backgrounds in a variety of settings. Usually, the signature of interest is expected at high energies or large particle multiplicities, therefore, the interpolation methods cannot be used. It is in this regime where the need for these methods arises because of large theoretical or experimental uncertainties in prediction using simulations or calculations. The data-driven approach can bypass many of these difficulties.
The information from the three A, B, and C CRs, is used to estimate the backgrounds in the signal region, D (Fig. 1). Generally, we can express the estimate of F D aŝ where the ∆'s are either x 1 −x 0 , x 2 −x 1 , y 1 −y 0 , or y 2 −y 1 . When x 2 and/or y 2 is taken to infinity, the expansion, in general, is not valid unless Σ = 0 since ∆ → ∞. However, even if Σ = 0, under certain conditions, the expansion could still be valid. For the case x 2 → ∞, if the distribution P x (x) falls sharply as x increases, then Eq. 12 could be still valid. Since Σ(x 1 , x 2 , y 0 , y 1 ) ≈ Σ(x 1 , x 1 + δ x , y 0 , y 1 ), remembering that Σ is the average value of in the given region, thus x 2 is not as relevant since the data are distributed heavily towards lower values of x. Under these conditions, where Σ i (Σ ij ) is the partial derivative with respect to the ith argument (i and j arguments), respectively, and ∆s are either ∆ x1 , ∆ y1 , or ∆ y2 . In summary, with the ABCD method, measurements in three regions neighboring the SR can be used to give the accurate description to O(∆ 2 ), given that the correlation between the x and y is weak and the distribution falls sharply in x and y.
3 Improving the data-driven extrapolation method As was the case with interpolation, it is possible to improve the accuracy of extrapolation methods by including more CRs. We derive several new analytic results and provide some case studies to demonstrate their efficacy.

Extended ABCD methods
We assume that the SR is x > x 0 and y > y 0 (Fig. 2) and that the joint distribution in x and y is mostly factorizable. Then we can express the number of entries in the SR as x 0 ] and similarly in y, the accuracy can be improved as where ∆ stands for either ∆ x or ∆ y . With fixed-width CRs, terms up to ∆ 2 can be exactly canceled. Therefore, the effects of correlations among variables on the prediction are mitigated as well. In the appendix, we give an explicit expression for Eq. 14.
We can extend the idea further by using information in eight CRs (Fig. 2), where it is possible to get accuracy of the O(∆ 4 ) order: However, having more CRs does not always result in reduced error. Since the method involves multiplication or division operations, statistical uncertainties, due to the finite number of entries in each CR directly affect the uncertainty of the prediction. From practical considerations, it may be desirable to have fewer CRs, so we also derived an optimal expression for the case of five control regions, by allowing for two control region bins in either x or y, but not in both. In the case of two control region bins in x, but one in y, the optimal combination of the control region measurements is As before, the error depends on the assumptions of weak correlations among the dependent variables x and y, as described by (x, y). We also assume that (x, y) varies slowly enough to allow for the series expansion. While the results derived are for fixed width bins, they can be applied to the variable widths cases. The variablewidths bins could be modified into fixed-width bins by locally stretching or squeezing the control variables phase space. And as long as this operation does not invalidate the assumption of the weak correlations, these methods are applicable.

Toy example
As a simple test, we apply the ABCD method and the extended ABCD method of Eq. 16 to a distribution which is a smoothly decreasing distribution in x and y, but otherwise arbitrary. The distribution would separable in x and y in the absence of the x+y term, which provides some correlation between x and y. For simplicity, the boundaries for the ABCD method are set to x 0 = 1, x 1 = 2, x 2 = 3, y 0 = 1, y 1 = 1, y 2 = 2. The true value of the area in D is F D = 0.1210 for α = 0.5, while the ABCD method (Eq. 12) yields 0.1247. The extended ABCD method with the left boundary at x −1 = 0 yields 0.1195. Extended ABCD method reduces the error in prediction by a factor of 2.5 for this case. Fig. 3 shows how the predictions of ABCD and extended ABCD change with α. The bands represent the error terms of the respective methods in the appendix. Since the distribution is known explicitly, the error terms can be calculated. As α → 0, both methods converge to 1, as expected, since the distribution becomes independent in x and y.

tt+multi-jets in hadronic channels
For the second case study, we apply various ABCD methods of background estimations to tt+jj simulated sample. The tt+multi-jets processes are backgrounds to many of the searches for physics beyond the standard model at the LHC [14,15]. While calculation of tt + jj is available at the next-to-leading order (NLO), it has relatively larger theoretical uncertainties than what is desired by the experiments [5]. Furthermore, the quoted uncertainties in the literature are on the overall inclusive cross sections, but in some phase space, the uncertainties on the the differential cross sections could be even larger. It is difficult to envision improved calculations for these processes in the foreseeable future. Therefore, having a more reliable data-driven technique is important for these processes.
We generated one million events of pp → ttjj sample at √ s = 14 TeV with MG5aMC@NLO v2.61 at LO [4]. The extra partons are required to have p T > 20 GeV and |η| < 5.0. The partons are hadronized with Pythia 8 [16]. Delphes 3 fast detector simulation and reconstruction were subsequently applied. The reconstructed jets are required to be p T > 30 GeV and |η| < 2.4. We required zero isolated lepton that satisfies p T > 20 GeV and |η| < 2.4 in an event.
The distribution of the number of hadronic jets (N j ) and the number of b-tagged jets (N bj ) is shown in Fig. 4 and the number of entries in each bin is listed in Table 1.
The correlation coefficient of the two variables is 0.139, hence, they are weakly correlated. We apply the methods in Eqs. 14-16, taking N j and N bj as control variables. The SR is N j ≥ 9 and N bj ≥ 4. It could be applicable in a scenario where signature of interest consists of multijets and multiple b-tagged jets.
The results of applying various extrapolation methods are shown in Table 2. The uncertainties in the predictions are statistical uncertainties due to the number of entries in the control region. They are evaluated by an ensemble test where the number of entries in each control region fluctuates according to a Poisson distribution. The extended ABCD methods allow for better prediction in terms of reduced deviation from the truth, at the cost of increased statistical uncertainties.
Next, we consider cases where the control variables are continuous. We take the hadronic scalar sum of jet transverse momenta (H T ) and the sixth leading jet transverse momentum (p T 6 ) as the control variables. The two variables are obviously correlated (correlation coefficient: 0.660), as shown in Fig. 5. We deliberately chose these variables to better exemplify the advantages of the extended ABCD methods.
Since the distribution drops rapidly as H T or p T 6 , we consider two different use cases. In the first case, the widths of the CRs and SR (∆ x ) are wider than the widths of the distribution, and in the second case, the widths are similar or smaller than the width of the distribution of each control variable (Fig. 5). Table 3 shows how the different regions are defined and the number of entries in the respective regions for the two cases. In the first case, the region of interest (D) has a lower limit on H T . This could be a typical use case in hadron colliders where we are interested in phenomena at high energies. In the second case, D is much narrower, and although this is not the most general use case, it is nonetheless interesting for illustration purposes. The bins are chosen such that the   Table 4. Predictions of entries in region D for the two cases in Table 3. The errors quoted are the expected statistical uncertainties from pseudo-experiments.
number of entries do not vary greatly among the different regions.
In the first case, the ABCD method yields 4802 ± 122 while the extended ABCD method of Eq. 16 yields 9976 ± 488. The ABCD method is inadequate because of the correlation between p T 6 and H T . In the second case, the ABCD method yields 3886 ± 128 while the extended ABCD method yields 4493 ± 291. In both cases, the presence of A and C control regions provides an additional lever arm and allows us to take into consideration the dependence on H T better.
One of the important reasons to use the data-driven method is to reduce some of the systematic uncertainties. Through several case studies, we demonstrate that the extended ABCD methods provide estimates that are closer to the truth. For cases where independent variables are not easy to find, the extended ABCD method could still take into account some of the correlations. In many analyses, the normalization of the background is treated as a nuisance parameter to be constrained further by fitting to data. The extended ABCD methods can provide smaller uncertainty on the prior of the normalization and thus move towards reducing systematic uncertainties.

Conclusions
We propose extensions to the ABCD method of extrapolated background estimation by exploiting information from additional control regions. The extended ABCD methods could be useful when the control variables are not exactly independent, since they can mitigate the effects of correlations among the variables. Through several case studies, we demonstrate that they provide more accurate predictions at the cost of increased statistical uncertainties.
This work was supported in part by the Korean National Research Foundation (NRF) grants NRF-2018R1A2B6005043 and NRF-2020R1A2B5B02001726.