A binned likelihood for stochastic models

Metrics of model goodness-of-fit, model comparison, and model parameter estimation are the main categories of statistical problems in science. Bayesian and frequentist methods that address these questions often rely on a likelihood function, which describes the plausibility of model parameters given observed data. In some complex systems or experimental setups predicting the outcome of a model cannot be done analytically and Monte Carlo techniques are used. In this paper, we present a new analytic likelihood that takes into account Monte Carlo uncertainties, appropriate for use in large or small statistics regimes. Our formulation has better performance than semi-analytic methods, prevents strong claims on biased statements, and results in better coverage properties than available methods.


Introduction
The use of Monte Carlo (MC) techniques to calculate non-trivial theoretical quantities and expectations in complex experimental settings is common practice in particle physics. A MC event is a single representation of what can be detected in data, and is typically generated from a single realization of the underlying physics parameters, θ. These events are often binned in some observable space and compared with the data. Since the generation process is stochastic, a particular θ used for generating the MC can lead to different outputs. This stochasticity introduces an uncertainty in the MC distributions. Furthermore, as production of large-statistics MC is often time consuming, reweighting is used to move from one hypothesis to another. In reweighting, a MC event with previously assigned weight, w( θ), is assigned a new weight, w( θ ) [1]. It follows that MC uncertainties will be hypothesis dependent; thus, to do hypothesis testing, it is important to account for them. This is especially important for small-signal searches, performed far away from the Gaussian limit, where a modified-χ 2 may not be suitable [2]. A Poisson likelihood is a more appropriate statistical description of event counts, but there a proper treatment of MC statistical uncertainties is less straightforward. Solutions to this problem have been discussed in the literature in the context of frequentist statistics by adding nuisance parameters [3][4][5]. A detailed probabilistic treatment of MC weights was discussed in [6]. However, [3,5,6] add additional time complexity, and [4] does not provide a full exposition on how to incorporate weighted MC. We present a new treatment that is valid in the large and small statistics limits of data and MC, suited for frequentist and Bayesian analyses, based on the Poisson likelihood. Our likelihood accounts for statistical uncertainties due to MC, converges appropriately in the high-statistics limit, allows for arbitrary event-byevent reweighting, has better coverage properties, and is computationally efficient. An implementation of the likelihood described in this work can be found in [7].
The paper is organized as follows. In Sec. 2 we briefly review two common treatments available in the literature to account for MC statistical uncertainty. In Sec. 3 we define and discuss our new likelihood. In Sec. 4 we study the performance of the likelihood in an example and compare it to other treatments in the literature. Finally, in Sec. 5 we conclude.

The Poisson likelihood and previous work
In order to compare MC with data, events are often binned into distributions across a set of observables. For simplicity we focus on a single-bin; in the absence of cross-bin-correlated systematic uncertainties the generalization to multiple bins is trivial. It is well known that the count of independent, rare natural processes can be described by the Poisson likelihood [8], given by where λ( θ) is the expected bin count for a hypothesis, which depends on parameters θ, and k is the number of observed data events. For weighted MC, often a direct substitution of λ( θ) = i w i ( θ) is used, where w i are the weights of each of the MC events. Then Eq. (2.1) can be rewritten as where the sum over i is performed across all MC events in the bin.
To treat MC statistical uncertainties in the small sample regime a modification of the Poisson likelihood was introduced in [3], which is briefly covered below. First, note that the expectation in a single bin is given by contributions from different physical processes, which we index by s. Then, the number of expected events can be written as whereN s is the expected number of MC events from process s that fall in the bin and the sum runs s = 1, . . . , n s where n s is the total number of relevant processes. This quantity can be compared with the data by using the likelihood in Eq. (2.1). The MC generation process can be modeled as having drawn N s events out of a random process that simulates the physical process. Thus N s is an estimator of the expected number of events from source s,N s . The MC generation is a binomial process where we generate an arbitrary and fixed number of events per source of physical process, N s,gen , and accept them into the bin of interest with probability β s ( θ), such thatN s ( θ) = β s ( θ)N s,gen . In the regime of rare processes, β s ( θ) 1, and large number of generated events, N s,gen 1, the expected counts in the bin can be approximated as Poisson distributed with parameter λ( θ) = s β s ( θ)N s,gen = sN s ( θ). This is the case of interest, as it is when MC generation is expensive, so we can state that N s is drawn from a Poisson process with meanN s ( θ). Profiling on the true number of MC events per process in the bin results in the Barlow-Beeston (BB) likelihood, given by [3] where λ( θ) is given by Eq. (2.3), N s andN s are the estimated and true MC counts in the bin respectively, and we have profiled over n s nuisance parameters denoted by {N s } ns s=1 . In the above formalism we have produced the MC at the natural rate, but this is not the case in weighted MC. The prescription is given by replacing Eq. (2.3) with where w s ( θ) are the average weights of MC events of source s that account for the differences in the MC generation and target hypothesis of interest. In this case, the likelihood definition is still given by Eq. (2.4).
In the large-statistics regime the Gaussian distribution is an appropriate description of the observed data. In this limit, the use of the Pearson-χ 2 as a test-statistic [9] is common practice. For a single analysis bin the Pearson-χ 2 is defined as where λ( θ) = i w i ( θ) and w i are the weights of each of the MC events. The form of the Pearson-χ 2 arises from the fact that the Gaussian distribution of k is the large sample limit of a Poisson distribution for which the expected statistical variance of the observation is given by λ( θ). Systematic uncertainties, under the assumption that they follow a Gaussian distribution and are independent between bins, can be included as, .
Though, this method of incorporating systematic uncertainties tends to overestimate them in shape-only analyses; see [10] for a recent discussion in the context of reactor neutrino anomalies. Similarly, one can include uncertainties to account statistical fluctuations of the MC in the test statistic. In doing so the Gaussian behaviour is implicit and the modified-χ 2 reads where σ 2 mc is the MC statistical uncertainty in the bin given by Note that this test statistic definition is not appropriate in the small statistics regime, as the data is no longer well described by a Gaussian distribution. If one uses a χ 2 teststatistic in the small sample regime, one ought to calculate the test-statistic distribution properly to achieve appropriate coverage [11].

Generalization of the Poisson likelihood
Using MC we can compute the expected event count, λ( θ), by means of adding the MC event weights as discussed in Sec. 2. Due to the finite nature of MC we can never recover exactly the intrinsic parameter, λ( θ), from which the MC events are drawn, but only construct estimators; the same is true for any observable of interest. As such, L Poisson written in Eq. (2.2) is an approximation that presumes the MC realization matches the expectation. In general, we should rewrite L Poisson as , but clearly this is an unrealistic assumption as it presumes perfect knowledge of the parameter λ( θ) from a finite number of realizations. Instead, it is more appropriate to construct P(λ| w( θ)) based on the MC realization. This is given by where P(λ) is a prior on λ that must be chosen appropriately and L(λ| w( θ)) is the likelihood of λ given w( θ). This is similar to [3,4], but instead of fitting λ as a nuisance parameter as in L BB Eq. (2.4), we marginalize over it as informed by the MC weights.

Likelihood construction
In this section we first construct L(λ| w( θ)) and then motivate an appropriate choice of P(λ). We will show that the likelihood can be written in terms of two quantities, for a bin with m MC events.
Assume that m is the outcome of sampling a Poisson distributed random variable M with probability mass function wherem is the mean of the distribution. We will motivate a probability distribution of the expected number of data events, λ, by studying the case of identical weights. For identical weights, w ≡ w i ∀i, the following equalities hold: We can interpret Pois(M = m;m) as a likelihood function of λ as µ and σ fully specify w( θ) for identical weights. Assuming a uniform P(λ) and substituting Eq. (3.6) for L(λ| w( θ)) in Eq. (3.2), we obtain where Γ is the gamma function and G the gamma distribution with shape parameter α and inverse-scale parameter β. These parameters are given in terms of µ and σ as With this choice of L(λ| w( θ)) and P(λ), we can rewrite L General from Eq. (3.1) as where α and β depend on θ through w. The derivation above assumed identical weights. For arbitrary weights, µ is an outcome sampled from a compound Poisson distribution, which can be approximated by a scaled Poisson distribution by matching their first and second moments as outlined in [12]. Equation (3.3) can be written where m Eff is an effective number MC events and w Eff an effective weight. Both are given in terms of µ and σ by Eq. (3.5) as m Eff = µ 2 /σ 2 and w Eff = σ 2 /µ. They can be directly substituted into the derivation for identical weights, where µ is described by Pois(M = m Eff ;m) = Pois(µ/w Eff ; λ/w Eff ), corresponding to the scaled Poisson distribution described in [12]. Since m Eff need not be an integer, we replace the factorial in Eq. (3.6) with a gamma function, as was done in Eq. (3.7). The derivation of L Eff remains the same and is a valid approximation for arbitrary weights. It is possible to generalize the choice of α and β in Eq. (3.8) by choosing a particular form of P(λ). Since the distribution of interest is Poisson, a well-motivated choice of P(λ) is it's conjugate prior the gamma distribution [13]; also see [6] for a recent discussion. Thus we set P(λ) = G (λ; a, b), where a and b are that shape and inverse-scale parameters of the gamma distribution respectively. These hyper-parameters dictate the distribution of the Poisson parameter λ [14]. In line with our previous discussion, the gamma distribution prior implies that Eq. (3.8) becomes The rest of the likelihood derivation remains the same. This allows the choice of specific values for a and b to satisfy certain properties. Equation (3.8) is obtained with a = 1 and b = 0, corresponding to the uniform prior discussed above. Another interesting choice is to require that the mean and variance of P(λ|µ, σ) matches µ and σ 2 respectively. This can be achieved by setting a = b = 0 and we refer to this parameter assignment as L Mean .
In the case of identical weights, L Mean is equivalent to Eq. (20) in [6]. Both choices are improper priors, as technically they are limiting cases of the gamma distribution. However, we can use them to obtain proper P(λ|µ, σ) distributions.
In [6] a convolutional approach is suggested for handling arbitrary weights. Each weighted MC event has P(λ i |w i ) = G(λ i ; 1, 1/w i ), corresponding to the prior P(λ i ) = G(λ i ; 0, 0), such that λ = m i λ i . The likelihood L Mean is a good analytic approximation of the more computationally expensive calculation given in [6]. The latter has time complexity O(k 2 m) where k and m are the number of data and MC events in the bin respectively. When assuming uniform priors, the convolutional approach does not recover Eq. (3.7) for identical weights, so it cannot be used as a generalization of L Eff .

Likelihood convergence
In this section we will show that, if the relative uncertainty of the bin content vanishes as MC statistics increases, L Eff and L Mean both converge to L Poisson .
For positive weights w i , the relative uncertainty σ/µ is bounded between zero and one. Uncertainty as large as the estimated quantity, σ/µ = 1, occurs if and only if m = 1. In the limit that σ/µ goes to zero, Eq. (3.7) converges to δ(λ − µ) and L Eff and L Mean both go to L Poisson . We can see this by noting that the shape parameter, α, goes to infinity as the MC relative uncertainty goes to zero, turning the gamma distribution to a Gaussian distribution of mean α/β and variance α/β 2 . This Gaussian converges to δ(λ − µ) in the limit of vanishing σ/µ. Substituting into Eq. (3.1), we recover L Poisson .
It remains to be shown that the relative uncertainty of the bin content vanishes as MC statistics increase. For identical weights, For arbitrary weights, the limit can be written in terms of the running average of w i and w 2 i as where w m is the average over w i and w 2 m the average over w 2 i for i ≤ m. This shows that as long as w 2 m does not grow much faster than w 2 m the limit will converge to zero. For most practical weight distributions, this should be the case. It is instructive to examine the behavior of L Eff for the simple case of a single-bin fit with k = 100. As the general behavior of µ and σ can be complicated for arbitrary weights, we minimize l Eff ≡ −2 ln L Eff over µ for two possibilities: fixed σ and fixed σ/µ. In terms of Eq. (3.3), a sufficient but not necessary condition for constant σ/µ with varying µ is equal weights, and a necessary but not sufficient condition for constant σ with varying µ is m ≥ 2. Furthermore, a constant σ implies that µ ≤ σ √ m. For arbitrary weights the correlation between µ and σ should lie somewhere in between the two cases.

Example and performance
In practice, likelihood constructions such as those discussed above are used to estimate physical parameters from data. As discussed in Sec. 1, weighted MC is often used to compute the likelihood of a particular physical scenario given the observed data. Statements are then made about the physical scenarios either by maximizing the likelihood, or examining the posterior distribution assuming some priors. We examine a toy experiment where we measure the mode, Ω, and normalization, Φ, of a signal gaussian against a steeply falling inverse power-law background. The performance of L Eff is evaluated and compared against other likelihoods.
For our toy experiment, we generate the true energies, E t , of fake data events from a background falling as (E t /100GeV) −3.07 and a Gaussian signal centered at Ω = 125 GeV with width 2 GeV and normalization Φ = 5013 for 1 arb. unit of livetime. Our imaginary detector is sensitive in the range from 100-160 GeV. To simulate the effect of a real detector, 5% (background) and 3% (signal) smearings are applied to E t in order to obtain event-by-event reconstructed energies E r . The MC generation is performed for signal and background separately, assuming inverse powerlaw distributions of (E t /100GeV) −1 for signal and (E t /100GeV) −2 for background. Reweighting of the MC can then be performed as a function of E t and forward-folded onto distributions in E r over which the events are histogrammed and likelihoods evaluated. For all toy analyses, the background component and signal width are kept fixed to their true values. Only the signal peak Ω and normalization Φ are treated as free parameters. Figure 2. Benchmark scenario. Leftmost panel: the underlying distribution that data is drawn from. Right three panels: observed data (black), L Eff best-fit MC distributions (blue), and MC uncertainties (blue band), with increasing MC statistics from left to right. The background component and signal width are fixed, while the mode and normalization of the signal peak are fit by maximizing the likelihood with respect to those parameters. Figure 2 shows the expectation in E t as well as the data and L Eff best-fit distributions in E r . The left-most panel shows the expectation for both signal and background assuming no smearing in E t . The three other panels show the smeared, E r , distribution for data (black) and the best-fit result from L Eff for three different MC datasets (blue) of varying statistics. The smeared shape of the signal peak is clearly visible in data, but not in the lowest statistics MC. As the MC increases in statistics, the best-fit MC can be seen to converge to data. The best-fit values for the example shown in Fig. 2 are given in Table 1 for L Eff and L Poisson . As point estimators, both likelihoods return similar values. This is driven by the fact that the same underlying MC distribution is used to fit to the data. The effect of convoluting P(λ| w( θ)) mostly serves to broaden the likelihood space, while preserving the maximum within the constraints described in Sec. 3.3. In the high-statistics limit, both likelihoods can be used for unbiased point estimation, provided that the likelihood-space is smooth enough to for standard minimization techniques to probe the global minima.

Coverage
Due to the higher computational cost of generating frequentist confidence intervals by generating pseudodata to compute the test-statistic (T S) distribution, it is common to estimate frequentist coverage assuming Wilks' theorem. While this is usually valid in the high-statistics regime, in the case of finite MC, a likelihood description based on MC expectations may lead to undercoverage even in the case of large data statistics. In this section, we will use T S = ∆l. We evaluated the coverage properties, assuming Wilk's theorem, of the two-dimensional fit over (Ω, Φ) for several likelihood constructions. These include the modified-χ 2 , L Poisson , L BB , L Mean , and L Eff . These five likelihoods were chosen on the basis of their computation speed and as tests of different approaches towards the treatment of weighted MC.
Several configurations were tested, all under the assumptions of the toy experiment described in Sec. 4.1. The data had livetime set to 1 arb. unit. The MC was generated for two different settings of the total number of events: 10 3 and 10 6 . For each setting, 500 toy experiments were generated, their best-fits found, and their ∆l = l( θ true ) − l(ˆ θ) evaluated, where θ true andˆ θ corresponds to the true and best-fit (Ω, Φ) respectively. Each toy experiment was classified as covering θ true at a specified level p if ∆l < I(p; 2), where I is the inverse of the χ 2 cumulative density function and the 2 indicates the number of degrees of freedom.  Figure 3 shows the percentage of times the true parameters were within the confidence intervals at level p as a function of the estimated coverage percentile for that level. The test was performed for two settings of MC statistics, N MC = 10 3 (left panel) and N MC = 10 6 (right panel), using the different likelihoods. First note that, as expected, the true coverage is highly dependent on MC statistics, with higher MC statistics leading towards improved agreement. In the case of N MC = 10 3 , L BB , L Mean , modified-χ 2 , and L Poisson all undercover to varying degrees of severeness. For N MC = 10 6 , L Poisson still undercovers but the other finite-MC likelihoods exhibit good agreement. In this test, L Eff exhibits the best coverage properties.

Posterior distributions
It is also possible to use L Eff in a Bayesian approach. Using Bayes' theorem, the posterior is proportional to where π( θ) is a prior on the parameters. As evaluation of the normalization factor can by challenging, P( θ|k) can be approximated using a Markov Chain Monte Carlo (MCMC). For our toy example, we used emcee [15] to sample P( θ|k) under a uniform box prior for two different likelihood functions: L Eff and L Poisson . The sampling was performed using the data and MC sets described in Sec. 4.1. Figure 4 shows the posterior distributions of Ω and Φ of the signal peak. For each comparison, L Eff (blue) and L Poisson (orange) were sampled under a uniform prior using the same underlying data and MC. We used 20 walkers with 300 burn-in steps followed by 1000 steps as settings for emcee. The left and center column show the marginal posterior distribution for the mass, Ω, and normalization, Φ, respectively. The true value is indicated by the dashed, vertical line. The rightmost column shows the joint posterior distribution with 68% (solid) and 95% (dashed) contours. The truth is indicated by the star. With L Poisson , the true value of the parameter is highly improbable for the lower MC-statistics cases of the top and middle rows. In contrast, the posterior evaluated using L Eff has increased width due to the reduced MC statistics. Even for N MC = 10 6 (bottom row) the shape of the posterior evaluated using L Poisson is narrower than that using L Eff . Credible intervals estimated using L Poisson would bias the result.

Conclusion
The use of MC to estimate expected outcomes of physical processes is nowadays standard practice. By construction, MC distributions are sample observations and subject to statistical fluctuations. MC events are also typically weighted to a particular physics model, and these weights may not be uniform across all events in a observable bin. A direct comparison of MC distributions to data is typically performed using Poisson or χ 2 minimization, where the expectation from MC is computed as a sum over weights in a particular observable bin, may be inappropriate in the case of low MC statistics. Such likelihoods neglect the intrinsic MC fluctuations and may lead to vastly underestimated parameter uncertainties. A better approach is to construct a likelihood that accounts for MC statistical uncertainties.
The main result of this work is an analytic extension of the Poisson likelihood that accounts for MC statistical uncertainty. This new L Eff is motivated by treating the MC realization as an observation of a Poisson random variate, computing the likelihood of the expectation using the MC, and convoluting with the Poisson probability of observed data. Our construction is computationally efficient, exhibits proper limiting behavior in both the high-statistics and equal-weights limits, and has excellent coverage properties. In our tests, it out-performs other treatments of MC statistical uncertainty.