Criteria for projected discovery and exclusion sensitivities of counting experiments

The projected discovery and exclusion capabilities of experiments are often quantified using the median expected $p$-value or its corresponding significance. We argue that this criterion leads to flawed results, which for example can counterintuitively project lessened sensitivities if the experiment takes more data or reduces its background. We discuss the merits of several alternatives to the median expected significance, both when the background is known and when it is subject to some uncertainty. We advocate for standard use of the"exact Asimov significance"$Z^{\rm A}$ detailed in this letter.

Introduction. Consider the problem of assessing the efficacy of a planned experiment that will measure event counts that could be ascribed either to a new physics signal or a standard physics background. The criteria for discovery or exclusion of the signal can be quantified in terms of the p-value. In general, for a given experimental result, p is the probability of obtaining a result of equal or greater incompatibility with a null hypothesis H 0 . In high-energy physics searches, for example, the one-sided p-value results are usually reported in terms of the significance and the criteria for discovery and exclusion have often been taken, somewhat arbitrarily, as Z > 5 (p < 2.867 × 10 −7 ) and p < 0.05 (Z > 1.645), respectively. Here, we suppose for simplicity that both signal and background are governed by independent Poisson statistics with means s and b respectively, where s is known and b may be subject to some uncertainty. For assessing the prospects for discovery, one simulates many equivalent pseudo-experiments with data generated under the assumption H data = H s+b that both signal and background are present, obtaining observed events n 1 , n 2 , n 3 , . . .. One then calculates the p-value for each of those simulated experiments (p 1 , p 2 , p 3 , . . .) with respect to the null hypothesis H 0 = H b that only background is present. For exclusion, the roles of the two hypotheses are reversed; the pseudo-experiment data is generated under the assumption H data = H b that only background is present, and the null hypothesis H 0 = H s+b is that both signal and background are present, so that a different set of p-values is obtained. The challenge is to synthesize the results in the limit of a very large number of pseudo-experiments into a significance estimate Z disc or Z excl . There is no agreement on this step, which is the primary focus of this letter.
A common measure [1] of the power of an experiment is the median expected significance Z med for discovery or exclusion of some important signal (i.e., the median of Z(p 1 ), Z(p 2 ), Z(p 3 ), . . . for the simulated p-values). A reason to use the median (rather than mean) is that eq. (1) is non-linear, so that the mean of a set of Z-values is not the same as the Z-value of the corresponding mean of p-values.
However, Z med has a counter-intuitive flaw, which is most prominent when s and b are not too large, and especially for exclusion. As we show below, for a given fixed s, Z med can actually significantly increase as b increases. Similarly, for a given fixed b, Z med can decrease as s is increased. This leads to the paradoxical situation that an experiment could be judged worse, according to the Z med criteria, if it acquires more data, or if it reduces its background. In this letter, we discuss this problem, and consider some alternatives to Z med .
Known background case. The Poisson probability of observing n events, given a mean µ, is P (n|µ) = e −µ µ n /n!. ( Consider first the idealized case that the signal and background Poisson means s and b are both known exactly. One can then generate pseudo-experiment results for n, using µ = s + b for the discovery case, and µ = b for the exclusion case. A large number of simulated pseudoexperiments can be generated randomly via Monte Carlo, as described in the Introduction. However, for all cases in this letter, it is equivalent but much more efficient and accurate to consider exactly once each result n that can contribute non-negligibly, and then weight the results according to the probability of occurrence. The p-value for discovery, if n events are observed, is while that for exclusion is where Γ(x), γ(x, y), and Γ(x, y) are the ordinary, lower incomplete, and upper incomplete gamma functions, respectively. The median p-value among the pseudoexperiments can now be converted, using eq. (1), to obtain Z med disc (s, b) and Z med excl (s, b).
Some typical results for Z med disc and Z med excl as a function of b are shown in Figure 1. They each have a "sawtooth" shape, rather than monotonic as one might perhaps expect. This illustrates the unfortunate feature mentioned in the Introduction that the median expected Z can increase with increasing b. As noted in [2,3] for Z med disc , the underlying reason is that the allowed values of n are discrete (integers), causing the median to get "stuck" instead of varying continuously in response to changes in s or b. We emphasize that this sawtooth behavior is exactly reproducible for any sufficiently large number of pseudo-experiments, and has nothing to do with randomness from insufficient sampling. It is more prominent for exclusion than for discovery, because the number of events relevant for the median pseudo-experiment is smaller. Also, note that for larger b, the sawteeth get closer together as the integer n of the median gets larger, but the height of the sawtooth envelope remains significant. This is effectively a sort of practical randomness in Z med , as tiny changes in s or b will move one between the top and the bottom of the sawtooth envelope.
We now consider several alternatives to Z med . First, one can take the arithmetic mean of the Z-values directly, which we call Z mean . (In computing Z mean disc , we use Z = 0 for no observed events, n = 0. A reasonable alternative definition for both Z mean disc and Z mean excl would be to use Z = 0 for all outcomes n that give a negative Z. That would give slightly larger values for Z mean , but usually negligibly so except when Z mean is uninterestingly small anyway.) Second, one can take the arithmetic mean of the p-values, and then convert these to Z values, which we call Z p mean . Third, one can consider the Z-value obtained for the mean n (i.e., average over the simulated n 1 , n 2 , n 3 , . . .); the use of the mean data for computing the expected significance has been used in [5,6] and [2,3] and was called the Asimov data in the latter three references. Refs. [2,3] obtained an Asimov approximation to Z med disc : and ref. [4] gave a similar result for exclusion: These are both based on a likelihood ratio method approximation (valid in the limit of a large event sample) for Z given in [7] in the context of γ-ray astronomy.
In this letter, we propose instead to simply use for the Asimov approximation the exact p-values in eqs. (3) and (4) with n replaced by its expected means: so that which can be readily converted to Z-values using eq. (1). We call this the "exact Asimov significance" and denote it by Z A . Along with Z med , Figure 1 also shows Z mean and Z A for the discovery and exclusion cases, together with Z CCGV disc , and Z KM excl , as a function of b, for fixed s = 3, 6, 12. Both Z mean and Z A are within the Z med sawtooth envelopes, but decrease monotonically with b. We conclude that they are both sensible measures of the expected significance. In the discovery case, Z mean is generally slightly more conservative than Z A , and the reverse is true for the exclusion case. The previously known Asimov approximations Z CCGV disc and Z KM excl of refs. [2,3] and [4] are considerably less conservative, lying near the upper edges of the Z med sawtooth envelopes.
Not shown in Fig. 1 is Z p mean , which we find is much lower than all of the others, due to being dominated by unlikely outcomes with large p-values, and therefore not a reasonable measure of the expected significance. (Although we do not recommend its use, we note the amusing fact Z p mean disc = Z p mean excl , the proof of which does not rely on the assumed probability distribution, and so also holds exactly in the case of an uncertain background discussed below.) One sometimes sees s/ √ b used as an estimate, but this is much larger than the Z's shown in Fig. 1, and, as is well-known, is not a good estimate of the expected significance except when b is large.
Uncertain background case. More realistically, the expected mean number of background counts can be subject to uncertainties of various sorts. In high-energy physics, the background uncertainty for a future experiment is often dominated by limitations in perturbative theoretical calculations or systematic effects, both of which are unknown (and indeed difficult to rigorously define) but can be roughly estimated or conjectured. There are also statistical uncertainties that will arise from a limited number of events in control or sideband regions. Here, we will consider, in part as a proxy for other types of uncertainties, the "on-off problem" (see for example [7][8][9][10][11][12]), in which the background is estimated by a measurement of m Poisson events in a supposed backgroundonly (off) region. The ratio of the background Poisson mean in this region to the background mean in the signal (on) region is assumed to be a known number τ . The point estimates for the Poisson mean and the uncertainty of the background in the signal region are then While this Poisson variance is certainly not a rigorous model for systematic or perturbative calculation uncertainties, we propose that it can also be used as a rough proxy for them, in the sense that a proposed estimate for b and ∆b can be traded for (m, τ ) in the on-off problem. We now assign probabilities ∆P to each possible count outcome n in the on region, given m events in the off region, following a hybrid Bayesian-frequentist approach by averaging [10][11][12][13][14] over the possible background means using a Bayesian posterior with a flat prior, (normalized so that ∞ 0 db P (b|m, τ ) = 1), from which we then find Note that here the true background mean b appears only as an integration variable, and that ∞ n=0 ∆P (n, m, τ, s) = 1, for any m, τ, s. The limit lim τ →∞ ∆P (n, m, τ, s), with m/τ =b held fixed, recovers the Poisson distribution P (n|s +b). In the second equality of eq. (12), we have written a form valid for noninteger n and m, both to define Z A below and to account for the fact that an estimatedb and ∆b may correspond to non-integer m. The third equality is more useful when n is an integer, and also in the case s = 0 where only the k = 0 term survives and one can replace n! by Γ(n + 1).
For exclusion, we find where the first form (following directly from the defini-tion) involves a double sum, the second single-sum form is more efficient if n is an integer, while the last two forms are valid for non-integer n, m, have differing ease of numerical evaluation depending on the inputs, and follow from each other by integration by parts.
We can now consider the expected significances in the case thatb and ∆b have been fixed, corresponding either to a calculation of the background with limited accuracy, or to a measurement of m for a given τ . This is done by generating pseudo-experiments for n, distributed according to the probabilities ∆P (n, m, τ, s) for discovery and ∆P (n, m, τ, 0) for exclusion, and then evaluating the p-values according to eq. (13) for discovery and eq. (14) for exclusion. As before, we consider Z med , Z mean , and Z A obtained from the allowed pseudo-experiment data, each as functions of s,b, ∆b. Here, Z A is obtained by replacing n by its mean expected values. For the discovery and exclusion cases respectively, we find these are where Then p Asimov disc (s,b, ∆b) = p disc ( n disc , m, τ ), p Asimov excl (s,b, ∆b) = p excl ( n excl , m, τ, s), which are converted to Z A disc and Z A excl as usual. Note that the mean expected event count in the absence of signal, b, is distinct from, and larger than, the measured background estimate,b = m/τ . The fact that b >b can be understood heuristically as the statement that, for finite τ , a given m is more likely to have been a downward rather than upward fluctuation. As an extreme example, if m = 0, this could be a downward fluctuation of a non-zero true background, but obviously it could not be an upward one. Given (m, τ ), depending on the experimental situation there may be other justifiable probability density functions besides eq. (11), and the subsequent discussion carries through similarly for any other choice. If we had chosen a different Bayesian distribution in eq. (11), then the expression for b (in terms of m and τ ) would change. For this reason, we prefer to give results directly in terms of the independent variablê b = m/τ corresponding to the direct measurement (or calculation) of the background, rather than b.
Refs. [3] and [4] had earlier provided Asimov approximations to the median discovery and exclusion significances, respectively. [Equations (5) and (6) above are the limits as ∆ b → 0.] However, those are not directly comparable to our definitions when ∆ b = 0, since they take the (unknown) true background mean b as input, rather than the point estimateb = m/τ as we do here. If one ignores the distinction and considers b =b, then Z A disc and Z A excl as defined in this letter give more conservative significances than those obtained from [3,4].
Results for Z med , Z mean , and Z A for discovery and exclusion are shown in Figure 2 for ∆b/b = 0.2, this time for s andb both taken proportional to an integrated luminosity factor Ldt which represents the temporal progress of the experiment. We consider fixed ratios s/b = 2, 10, 100 for discovery and 0.5, 5 for exclusion. Again, the sawtooth behavior of Z med is evident, while Z mean and Z A both lie within or near its envelope, and can be taken as reasonable and monotonic measures of the expected discovery and exclusion capabilities. Note that Z A excl is more conservative than Z med excl or Z mean excl for higher integrated luminosities, while Z mean is slightly more conservative for discovery. As before, Z p mean disc = Z p mean excl , not shown, gives far smaller values and cannot be recommended. In Fig. 3, we show Z A disc and Z A excl for ∆b/b = 0, 0.2, and 0.5. Consistent with intuition, increasing the background uncertainty reduces the expected significances, with a much greater impact when s/b is smaller.
Conclusion. In this letter, we have critically examined the use of median expected significance Z med and possible alternatives. We find that either Z mean or Z A as defined and evaluated above would be reasonable measures of the discovery and exclusion capabilities of counting experiments with known or uncertain backgrounds. They both give results that are similar to Z med , but are monotonic in the expected way with respect to changes in background and signal means and background uncertainties. They are also considerably more conservative than previous Asimov approximations, especially when the background is small. The exclusion case with low event counts, where the sawtooth behavior of Z med excl is particularly prominent and problematic, is noteworthy, as the success of the Standard Model of particle physics suggests the future importance of limit-setting capabilities for experimental signals with small rates including rare decays, non-standard interactions, new heavy particle production, and dark matter searches.
In comparing Z mean and Z A , we note that there is no "correct" measure of the expected significance, since the various Z definitions are simply different answers to different questions. The Z A measure is typically slightly less conservative in evaluating discovery, and more conservative for exclusion prospects, than Z mean . It may be simpler to extend Z A to the case of experiments that feature more complex statistics than just integer counts of events. Also, the Z A measure, based on the means of the data distributions, is often simpler to evaluate; in the counting experiments considered here, this requires only directly plugging into eqs. (8)-(9) for a known background, or eqs. (10) and (13)-(19) for an uncertain background. For these reasons, we advocate that Z A be the standard significance measure for projected exclusions and discovery sensitivities in counting experiments.

Supplementary Material
As noted above, in the case where the background estimate is determined by the method of measuring m in the "off region" and translating it to the "on region" through τ , it is possible to consider different Bayesian priors for the true background mean b, rather than the flat prior chosen in the main text. For a simple two-parameter class of examples, consider where q = θ = 0 recovers the choice made in the main text. Then one finds a normalized Bayesian posterior distribution for the background, in place of eq. (11): The calculations of ∆P , p disc , and p excl would then go through as before with the replacements τ → τ + θ and m → m + q, with the results still expressible in terms of the independent variablesb and ∆b as defined by eq. (10). In particular, one would have b = (m + q + 1)/(τ + θ) =b[1 + (q + 1)∆ 2 b /b 2 ]/[1 + θ∆ 2 b /b] in that case. However, in the absence of a compelling reason to the contrary, we consider the simple flat prior q = θ = 0 to be preferred, as it successfully reproduces the frequentist result eq. (13) for p disc , as shown in [11,12]. In any case, the Z mean and Z A measures can be defined as above with any suitable choice of prior as dictated by realistic considerations.
We now show some further results supplementary to our main discussion. In Fig. 4, we first show the probabilities ∆P (n, m, τ, s) for discovery (left panel) and ∆P (n, m, τ, 0) for exclusion (right panel), for a fixedb = m/τ , as a function of event count n in the signal (on) region, for various values of τ . The lines for τ = ∞ in both panels correspond to the Poisson distribution P (n|µ) with µ = s +b for the discovery case, and µ =b for the exclusion case. For a fixedb, as τ gets larger, the ∆P distribution approaches the Poisson distribution, as expected.
Intuitively, we also expect the discovery and exclusion significance measures to dramatically decrease when the background uncertainty gets larger. From Fig. 5, we see that the median expected significance, once again, suffers from the sawtooth behavior. However, the expected significances Z mean and Z A behave as we expect, and, as argued above, can be taken as reasonable measures of the expected discovery and exclusion significances. Also, it is evident from the figure that the (∆b,b) → (0, b) limit works out smoothly.
One can consider other measures as alternatives to the median, mean, or Asimov expected Z. For a large number of pseudo-experiments simulated for the discovery case, we can also count the number of these experiments, where we have greater than 5σ discovery, and thus obtain a probability P (Z disc > 5). In Fig. 6, we compare P (Z disc > 5) for ∆b/b = 0 (left panel), and 0.5 (right panel). As we expect, P (Z disc > 5) decreases, more drastically for smaller s/b, as the background uncertainty increases. However, this measure also shows a sawtooth behavior, rather than increasing monotonically with s = σ s Ldt. Similarly, Fig. 7 shows the probability of obtaining greater than 95% CL exclusion in a large number of pseudo-experiments simulated for the exclusion case P (Z excl > 1.645) for ∆b/b = 0 (left panel), and 0.5 (right panel). Once again, increasing the background uncertainty reduces P (Z excl > 1.645), more drastically for smaller s/b. And, as was the case with P (Z disc > 5), this measure also shows a sawtooth behavior with respect to changes in s = σ s Ldt.
Finally, Fig. 8 shows the probability of obtaining a significance greater than a certain Z in a large number of pseudo-experiments simulated for both discovery (left panel) and exclusion (right panel) cases, for fixed (s,b) and ∆b/b = 0, 0.5, as a function of Z. As expected, both P (Z disc > Z) and P (Z excl > Z) decrease with increasing Z, and with increasing background uncertainty. However, for smaller s/b, background uncertainty does not have much impact on the results. A Python implementation of various significance measures for projected exclusions and discovery sensitivities in counting experiments examined in this letter, including the advocated Z A , is made available in a code repository Zstats at https://github.com/prudhvibhattiprolu/Zstats. To illustrate the usage of the code, the repository also has short programs that produce the data in each of the figures in this paper. More information about all functions in this package can also be accessed using the Python help function. FIG. 6. The probability of obtaining a significance Z disc > 5, corresponding to greater than 5σ discovery, in a large number of pseudo-experiments generated for the discovery case, for fixed ratios s/b = 2, 5, 10, and 50, as a function of s = σs Ldt, for ∆b/b = 0 (left) and ∆b/b = 0.5 (right).