Quantifying the power of multiple event interpretations

A number of methods have been proposed recently which exploit multiple highly-correlated interpretations of events, or of jets within an event. For example, Qjets reclusters a jet multiple times and telescoping jets uses multiple cone sizes. Previous work has employed these methods in pseudo-experimental analyses and found that, with a simplified statistical treatment, they give sizable improvements over traditional methods. In this paper, the improvement gain from multiple event interpretations is explored with methods much closer to those used in real experiments. To this end, we derive a generalized extended maximum likelihood procedure. We study the significance improvement in Higgs to bb with both this method and the simplified method from previous analysis. With either method, we find that using multiple jet radii can provide substantial benefit over a single radius. Another concern we address is that multiple event interpretations might be exploiting similar information to that already present in the standard kinematic variables. By examining correlations between kinematic variables commonly used in LHC analyses and invariant masses obtained with multiple jet reconstructions, we find that using multiple radii is still helpful even on top of standard kinematic variables when combined with boosted decision trees. These results suggest that including multiple event interpretations in a realistic search for Higgs to bb would give additional sensitivity over traditional approaches.


Introduction
Both the ATLAS and CMS collaborations have recently released their full Run 1 analyses of the search for the H → bb decay mode [1,2]. With only Run 1 data, neither search was capable of finding this process at Standard Model rates. With the additional statistics from Run 2 data, H → bb will surely be observed. However, a precision measurement providing a meaningful extraction of the bottom and top Yukawa couplings with implications for Beyond the Standard Model physics will require these searches to increase their sensitivity beyond what might be achievable with currently used experimental techniques. Most of the proposed improvements involve looking in special kinematic regions where backgrounds are smaller [3] or computing new, physically-motivated observables [4,5].
In [6], a qualitatively new way to construct observables called "Qjets" was proposed. Instead of comparing a single observable between signal and background, Qjets proposed to look at the sensitivity of an observable to multiple interpretations of that observable generated by small variations in some parameter. In the original Qjets proposal, these variations were made perturbing around the original jet clustering algorithm: instead of always merging the two closest particles during jet clustering, the Qjets algorithm considers merging more distant pairs. This generates a distribution of highlycorrelated observables for each jet in each event. The width of this distribution for pruned jet mass [7], which was called volatility in [6], provides a strong signal-to-background discriminant in boosted Higgs or boosted W boson searches. Volatility has been measured in experiment [8,9] with similar results and discrimination power to simulation. An application of the Qjets method to event reconstruction was proposed in [10]. In [11], a simpler and faster way of using multiple event interpretations was proposed: simply compute the same observable using different values of the jet size R. Both [10] and [11] computed the reach of the H → bb search by combining the multiple interpretations, finding as much as a 46% improvement in significance over using a single interpretation.
The method used to estimate significance improvement in [6,10,11] is a natural generalization of cut-and-count for a single observable. Normally, an event either passes a set of cuts (z = 1) or does not (z = 0). With multiple interpretations, a fraction z of the interpretations pass the cuts. This fraction z is an observable, measurable in data and computable with Monte Carlo. In [6,10,11], the 1-dimensional distributions of z for signal and background were used to estimate the probability that a given set of of events could be explained by a fluctuation of the background only. The procedure is reviewed below in Section 3. Using this method, multiple event interpretations were shown in [6,10,11] to give significant improvement to search reaches.
One drawback of the method used in [6,10,11] is that it presupposes a knowledge of the background cross section. Many LHC analyses try to avoid taking cross sections from theory. Instead, they often use control regions to establish background normalizations, which are not necessarily precisely known in the specific regions of phase space exploited by the analysis. These control regions are typically defined to have minimal overlap with the signal regions. When using multiple event interpretations, however, it is generally not possible to define non-overlapping regions, since a single event can potentially cover a large range of values for the observable of interest. One goal of this paper is to generalize the extended maximum likelihood procedure, which fits to signal and background cross sections separately, to observables based on multiple event interpretations.
Another question one might have about multiple event interpretations is whether the improvement is due to features of the events which are already accounted for in the current search strategies. For example, most analyses (including the H → bb searches) use many kinematic variables in the event to maximize significance, often combined with sophisticated multivariate methods like neural networks and boosted decision trees. It has not been proven so far that the improvement observed when using multiple event interpretations is independent of what can be obtained by exploiting the kinematic features of the event exploited by multivariate discriminants. We also address this concern, by showing that multiple event interpretations can indeed improve the significance of a search when combined with kinematic variables. This paper is organized as follows. Section 2 describes the simulation and event selection we used. Section 3 gives a quick introduction to the statistical methods we discuss and their relative merits. In Section 4 we first review the typical extended maximum likelihood (EML) fit, emphasizing those aspects which break down when using multiple event interpretations. Then we derive the modifications in the EML formalism necessary to account for the statistical correlations among the different interpretations of the same event. Section 4.2 describes a two-dimensional extension of the likelihood fit that avoids modifications of the EML fit at the expense of adding complexity, and compares the performance of this extension to the results of the previous section. Section 5 compares the performance of multivariate analyses including kinematic information and multiple event interpretations, to understand whether multiple event interpretations indirectly make use of kinematic information already exploited in current LHC analyses. We summarize the results from all these methods in Table 1.

Monte Carlo simulation and event selection
We generate signal and background processes for proton-proton collisions at √ s = 8 TeV using Mad-Graph 5.1 [12], interfaced to Pythia 6.4 [13] to simulate the parton shower and non-perturbative effects such as underlying event and hadronization. ZH → e + e − bb events are generated and used for the signal, and Z(→ e + e − ) + bb events are used as background. High statistics are produced for these samples, but in quoting expected significances, the signal and backgrounds are normalized to 25 fb −1 . This normalization is also used for generating toy models used for estimating the uncertainties in the likelihood fits. Jets are clustered from stable particles with lifetimes above 10 ps (excluding neutrinos) using the anti-k t algorithm with different R parameters. Unlike in [10], these interpretations are built using a deterministic method, similar to the telescoping jets approach introduced in [11]. The event selection is a simplified version of that in [1] and requires 83 GeV< m ee < 99 GeV, p T (b lead ) > 45 GeV, p T (b sublead ) > 25 GeV and p lep T > 25 GeV. A jet is defined as a b-jet if it contains any decay products from the original b-quark. Studies are performed in two kinematic regions defined by the transverse momentum of the vector boson (p Z T < 120 GeV and p Z T > 120 GeV). The invariant mass distribution of the two leading jets in p T that are labeled as b-quarks is used for our singificance estimates.

Significance measures
In this section, we quickly review the essential differences between a cross-section based significance calculation, like the ones used in [6,10,11], and the extended likelihood method.
The approach used in [6,10,11] starts with an observable z, defined as the fraction of event interpretations satisfying a given set of cuts. In a normal analysis, z would be either 1 (event passes cuts) or 0 (event fails cuts). The insight of Qjets was that one can use multiple event interpretations to make z a rational number. One can then compute in Monte Carlo the probabilities ρ S (z) and ρ B (z) for finding certain z's (See Fig. 3 of [10] for some ρ(z) distributions). Then the probability that the data cannot be accounted for by a fluctuation of the background (measured in standard deviations of the signal away from the background) is given by where ρ data is the observed probability. See Section 5 of [10] for a derivation of this formula. In a simulation, we replace data-minus-expected-background in the numerator by signal: For terminological clarity, we call the estimate of significance using this method the crosssection (xs)-based significance. It is cross-section based since one needs to know how much background there should be in order to see a fluctuation above this amount. The xs-based significance can be applied to any observable, not just this z variable. It corresponds to an analysis which provides a weight function w(z) = ρ S (z)/ρ B (z), and measures a final observable N which is the sum of the weight of all observed events. In particular, it can be applied to multidimensional data, using w(z 1 , z 2 , · · · ) if the multidimensional distributions ρ S (z 1 , z 2 , · · · ) and ρ B (z 1 , z 2 , · · · ) are known. We show the relative significance using various choices of z i in Table 1.
Many LHC analyses are instead based on likelihood fits. In a likelihood fit, the distribution of some observable z in the data is used to fit the signal and background cross sections: While the xs-based significance is computed using the normalized probability distributions for signal and background and the expected background cross section, likelihood fits extract both signal and background cross sections directly from data. A likelihood fit which takes into account the Poisson fluctuations of signal and background is called an extended maximum likelihood (EML) fit; in the rest of this paper all references to "likelihood" fits include this extension as appropriate. In Section 4 we show how extended maximum likelihood can be computed from events with multiple interpretations.
In likelihood fits the ultimate observable of interest is σ S . For a discovery, we need the probability that σ S > 0. The analog of the significance in Eq. (3.1) is the expected value of σ S divided by the variance in the measured value of σ S on data sets with no signal. In other words: Note that while Eqs. (3.1) and (3.3) are both measures of the probability that a signal is there, the two methods have different priors so they cannot be directly compared. In the xs-based significance, the analysis knows the expected background rate, and counts any difference in the data from the background as observed signal. For an EML significance, both the signal rate and the background rate are fit. Since the xs-based analysis has access to more information it will often give higher significances. The likelihood fit is procedurally closer to experimental conditions, where backgrounds are never known with perfect accuracy and are estimated from sidebands.
The difference in priors between xs-based and EML can be illustrated by a simple example. Suppose we had only a single bin, and so the entirety of the data is the fact that N events were measured. The xs-based approach would report an observed value N , subtract the expected N B (an input from theory), and declare an observed signal rate of The likelihood fit, however, would simply fail, for it would be attempting to fit N = σ S + σ B for both σ S and σ B , and a single bin cannot fit two parameters. Thus the error would be infinite, and the significance would not be meaningful.

Maximum likelihood fits
Many methods exist to find the cross section fits σ S and σ B in Eq. (3.2) by maximizing the likelihood of the observed distributions. In this section we explore how to apply such methods to multipleinterpretation data. When using multiple interpretations, each event k = 1 · · · K gives us not a single number x k , as is the usual case, but a series of numbers x i k , with i = 1 · · · I indexing the interpretations. We consider two approaches to constructing a likelihood fit from these observables x i k , which we call the merged histogram likelihood fit and the multidimensional likelihood fit.
For the merged histogram fit, we combine all of the interpretations into a single one-dimensional distribution, as though we had not K events but K × I events. Then we can apply the usual fitting technology to this distribution, as we might for, say, an invariant mass distribution in a conventional analysis. This approach will converge upon the correct fit values in the limit of infinite statistics. However, because the distribution contains both highly-correlated contributions (from a single event) and statistically-uncorrelated ones (from different events), the existing technology to estimate the errors in the fit parameters will give vastly incorrect errors. We show how to correctly estimate the errors with this method in Section 4.1.
In the multidimensional likelihood fit, we treat the x i k as I observables in K events. This is possible when the multiple interpretations are indexed, as in telescoping jets where they correspond to different R, but not when they are generated by adding randomness to a jet algorithm, as in the original Qjets proposal. If the I interpretations are distinguishable, we can then do a fit to the K data points in an I-dimensional space. This is no different from a regular fit to multi-dimensional data, and so the usual fitting technology can be applied. As in other multidimensional fits, the highdimensionality of the space will quickly saturate the statistics either of data or simulation. To control for this, we consider three approaches. We first try severely limiting the number of interpretations, down to I = 2. This allows us to map out the full I-dimensional distribution over events and produce a nearly exact maximum likelihood solution. Second, we try limiting the number of bins, but taking I larger. Third, we try using multivariate methods, in particular boosted decision trees, in place of the exact likelihood. The multidimensional approaches are explored in Section 4.2. Numerical results are summarized in a Table 1 and discussed in Section 4.3.

Merged histogram likelihood fit
In this section we derive appropriate formulas for uncertainty estimation when highly correlated data from multiple event interpretations are merged into a single distribution. We first review general features of how likelihood fits are done, and then discuss how things are modified with multiple event interpretations.
In an extended maximum likelihood (EML) fit, a set of unknown parameters {a α } are estimated by maximizing the likelihood density over a sample of N data values, x n : where P (x; {a α }) is the probability density for the data value x when N data points are expected. It should be noted that N , the actual number of observed events, is generally not equal to N , due to Poisson fluctuations (N can take non-integer values). Since the normalization has to be constrained to the number of expected events, the following is true: where P is taken as a continuous probability density function in this equation and not evaluated at each data point. This shows one way that the EML fit differs from the standard maximum likelihood fit, for which this normalization is 1. The EML thus allows to fit for the shape of a distribution as well as for the normalization. The EML is thus widely used in searches, where the total number of events collected is subject to Poisson fluctuations. The difference between the standard maximum likelihood and the EML lies on the normalization condition, represented by the e −N term [14], which guarantees that the normalization of P (x n ; {a α }) does not increase arbitrarily in the maximization procedure, and that that normalization obeys Poisson statistics for N expected events. The log-likelihood for this function takes the form ln L =  The fitted parameters are then taken to be the set {â α } which maximize this likelihood: The errors on these parameters come from incoherent Poisson fluctuations on the number of observed events in each bin, δn(x). Thus the total error on each parameter, δa α , is given by the quadrature sum of the errors due to each bin fluctuation: The error due to fluctuations of a single bin can be computed by expanding the minimization equation: d ln L/dn(x) is just the contribution to ln L from a single event in bin x, which is given by ln P . From this we can compute the total error on parameters a α : Where M αβ is the second derivative matrix of ln L. Assuming n(x) has Poisson errors δn(x) = n(x), we get a final formula for the errors: Where C is the covariance matrix: C αβ ≡ ∂ aα ln P ∂ a β ln P .
In the simple case of only a single a, this reduces to: This formalism can be extended to understand the use of the EML for multiple event interpretations. For multiple event interpretations, correlations between points can be resolved by expressing the sum over K events as a sum over the multidimensional parameter space of x = {x 1 , x 2 , ..., x I }, where the new index refers to each event interpretation. For the analysis where all interpretations are combined in the same x distribution, P (x) = i P i (x) and P i (x) is the projection of the multidimensional distribution P ( x) along the interpretation i.
In this approach the likelihood function to be maximized is: And in the case of a single fit parameter a: In the case of interest, we have a signal distribution ρ S (x) and a background distribution ρ B (x) (each normalized so ρ(x)dx is the number of interpretations expected to pass the cuts). We are fitting to a predicted distribution P = N S ρ S + N B ρ B , so we have two parameters N S and N B . The likelihood is: And then the errors are given by: Using Eqs. (4.17)-(4.19) and (3.3), the improvement in the significance of the search can be estimated for an analysis using a likelihood fit.

Multidimensional likelihood fits
Rather than merging all the interpretations into a single distribution, it would clearly be smarter to keep the interpretations separate and exploit their correlations. Unfortunately, computing the exact likelihood from an I-dimensional space would require an exponentially large data set. For example, with I = 10 interpretations, and only B = 5 bins in each direction, we would need 10 million events just to have each bin populated with 1 event. This is of course just the usual curse of dimensionality for multivariate fits, which is present in any analysis with correlated observables. A popular approach is to replace the exact likelihood with boosted decision trees (BDTs), neural networks, or other sophisticated  algorithms. Another approach is just to take I and B small enough so that the dimensionality is not intractable. In this section, we compare these alternatives. If we take I = 2, then it is possible to populate the 2 dimensional space quite well. For example, the 2D distributions for signal and background m bb distributions for two different choices of R are shown in Figure 1. The figure shows the low p Z T sample clustered with anti-k t using R = 0.5 and R = 1.2. The results at high p Z T are qualitatively comparable except for an overall shift at higher masses for the background. The two invariant masses are clearly correlated, but also clearly not 100% correlated. To quantify the correlation, the linear correlation coefficients among some representative R values are shown in Figure 2. A picture emerges in which, particularly at high p Z T , the correlation between small R and large R interpretations is quite different for signal and background, with smaller correlations in the signal. The 2-dimensional data can be run through a regular likelihood fitting procedure, with no modification since each event is only contributing a single point. We show results in Table 1.
As an alternative to taking I = 2, we can produce a statistically tractable fitting problem by instead reducing the number of bins B. For instance, we consider B = 3 and I = 4, where we bin every interpretation into m bb < 110 GeV, 110 GeV < m bb < 140 GeV, and 140 GeV < m bb . Results for this approach are also shown in Table 1 Table 1 summarizes the results obtained with the different methods described in previous sections. The baseline is the third line of the table; for the xs-based results it uses ρ(m bb ) (see Section 3) and for the EML-based result uses a fit to the m bb distribution. The first two lines use the fraction of interpretations in a window of 110 GeV < m bb < 140 GeV to define z as in [11]. These two rows list the xs-based significance and the EML significance, both evaluated on the observable z given by the fraction of interpretations that lie in the window. For 1 R, z is either 0 or 1, and this is equivalent to a simple cut-and-count experiment. The use of the full m bb distribution (from 50 GeV to 200 GeV with an overflow bin on each side), even with just 1 R (the third row) is more powerful than the use of multiple event interpretations for the simple cut-and-count model (the second row). When using 5 R's (R = 0.4, 0.5, 0.6, 0.7 and R = 0.8) and pooling all interpretations in the same m bb distribution (the row labeled "merged"), there is a small gain when using the likelihood fit, despite some features of the distribution being washed out in the procedure. The next row shows the results when using 3 m bb bins (m bb < 110 GeV, 110 GeV < m bb < 140 GeV and m bb > 140 GeV), to reduce the dimensionality, and 4 R's (R = 0.4, 0.6, 0.8, 1.0). Using 3 bins provides non-negligible gains at high p T , but does not manage to perform better than a single interpretation at low p T . This indicates that at low p T the added bins do not help tell apart the different interpretations but that a p T -dependent choice of binning might be worth further exploration.

Results
The method that gives the second highest significance is the one using the full 2D distributions of m bb computed with R = 0.5 and R = 1.2. At high p T , we find 35% and 38% improvements with the xs-based and EML fit respectively. As long as the two radii are far enough apart that the correlations between the reconstructed invariant masses is not very high, the choice of the two radii does not impact significantly the observed improvement. The use of boosted decision trees 1 to combine the two radii does not do quite as well as the full 2D m bb distribution, suggesting that there is a loss of information in the construction of the BDT, at least in our implementation (the TMVA default).
The improvement in both the xs-based and the EML-based approaches is highest using the most radii which we combine using BDTs. We find up to a 52% improvement for xs-based or 41% for EML based in the high p T sample by combining 12 R's over using just a single R. Two additional entries in the table refer to the use of kinematic variables in the BDTs and are discussed in Section 5.

Comparison to standard observables
So far, we have seen that combining measurements of m bb computed by clustering with the anti-k t algorithm with different R's can have a sizable improvement over using m bb with a single R. It is natural to wonder whether the improvement is due to the exploitation of properties of the events which could be exploited equally well using more traditional kinematic variables. For example, as R increases, the jet momentum increases. Since the signal and background have different momentum dependence, one might imagine that the same gain could be realized simply by including the p T of the b-b system into the analysis. To explore whether multiple R values leads to improvement in the HZ → bb + − search, we compare the improvement using multiple event interpretations (multiple R's) to the improvement from standard kinematic observables.
For the kinematic variables, we take the set from the HZ LHC study in [4], Table 4. All observables are computed using jets reconstructed with the anti-k t algorithm with R = 0.7. The observables we consider are constructed from either the hadrons (b-jets): or the leptons (from the Z decay): and one variable dependent on the leptons and b-jets: In these expressions, η is rapidity, p T is transverse momentum, b 1 refers to the harder of the two b-jets and b 2 to the softer, and similarly for the leptons. These 12 variables are all input to a BDT and used to compute the significance just as we have for multiple event interpretations. We also try combining these kinematic variables with the 11 remaining masses in our study (computed using R = 0.4 to R = 1.5 in steps of 0.1). Results are shown on the last two lines of Table 1. We can draw a few interesting qualitative conclusions from this analysis. First, we see that the kinematic variables work relatively better at low p T than high p T . At high p T angular differences are smaller and there are, thus, fewer handles to distinguish signal from background. This is unlike the multiple event interpretations, for which improvements are more significant at high p T . Second, we see that using multiple event interpretations (multiple R's) still gives serious benefit on top of all of the kinematic variables. The improvement is more significant at high p T as expected from the previous observation.

Conclusions
In previous work [6,10,11] multiple event interpretations, in particular the reclustering of an event using different jet sizes, were shown to give sizable improvement in the potential significance for a H → bb search in association with a W or Z boson. In this paper we have attempted to refine those analyses using methodologies as close as possible to those used in experimental analyses at the LHC. For this purpose, a new expression of the likelihood has been developed to account for correlations across events populating several bins in one dimension. The improvement in the significance of the search with this treatment has been shown to be sizable. We find as much as ≈ 41% improvement when using 12 R's over a single R when the bb system has p T > 120 GeV and ≈ 30% improvement for p T < 120 GeV. We also explored whether the improvement from multiple event interpretations carries overlapping information to traditional kinematic variables or complementary information. To answer this question, we took 12 kinematic variables that have been demonstrated to be nearly optimal in a multivariate analysis [4] (and some of which were used in a recent CMS search) for H → bb and compared their efficacy to what we get from just using multiple R's and what we get by combining them. We find that adding the m bb at multiple interpretations gives a 12% improvement at low p T and 20% improvement at high p T over the kinematic variables at a single R. These improvements are particularly encouraging, since the phase space explored in this paper does not include boosted topologies and thus cannot benefit from otherwise highly successful jet substructure techniques [3,16,17]. This phase space is, however, quite relevant for finding the H → bb decay. In summary, these results strongly suggest that multiple interpretations can help in searches with realistic statistical methods.