Is the bump significant? An axion-search example

Many experiments in physics involve searching for a localized excess over background expectations in an observed spectrum. If the background is known and there is Gaussian noise, the amount of excess of successive observations can be quantified by the runs statistic taking care of the look-elsewhere effect. The distribution of the runs statistic under the background model is known analytically but the computation becomes too expensive for more than about a hundred observations. This work demonstrates a principled high-precision extrapolation from a few dozen up to millions of data points. It is most precise in the interesting regime when an excess is present. The method is verified for benchmark cases and successfully applied to real data from an axion search. The code that implements our method is available at https://github.com/fredRos/runs .


I. INTRODUCTION
We revisit the problem of searching for a bump at an unknown location in a spectrum.
Specifically we assume there are L observations {y i } and the index i provides an ordering for the data, for example time, mass, energy. . . . In our background model that has no bump, the observations independently follow a Gaussian or Normal distribution where the expectation µ and the standard deviation σ are known for every i. In our previous work [1], we introduced the runs test statistic T to check the consistency of the background model with the observations. If a discrepancy is found, more specific analyses can be carried out to decide if a signal is present and to determine the parameters of the signal.
The main motivation behind the runs statistic is that it automatically takes care of the look-elsewhere effect (also called the trials factor) that arises in some other methods that look for a narrow peaks, for example in the search for the Higgs boson at the LHC [2,3].
There, the profile-likelihood ratio statistic was employed [4] which requires fully specifying both the background and the signal model including dependence on unknown parameters to be estimated from the data. For reliable estimates of the look-elsewhere effect, asymptotic normality and principled extrapolation from small to large significance had to be used [5,6].
In comparison, the runs statistic does not require a signal model and does not rely on asymptotic normality of the likelihood but assumes the background is known exactly. In [1], it was demonstrated that in this setting the runs statistic leads to a more powerful test than the classic χ 2 test in this peak-fitting problem.
Recounting the definition of the runs statistic, consider the sequence of L observations as consisting of success and failure runs, where the observation i is a success if it is above the background expectation, y i ≥ µ i . The runs statistic T is defined as the largest value of χ 2 for any success run where R represents the set of indices in an individual success run. Using the cumulative F (T |L), the p value is the tail-area probability to find T larger than the observed value T obs , As an illustration, a sequence of L = 1000 independent standard Gauss distributed random numbers (shifted and scaled) is shown in Fig. 1, while the running value of χ 2 , χ 2 i , is shown in the second plot. Note that χ 2 i is 0 initially and as soon as a failure is encountered; i.e., y i < µ i . Otherwise, it is incremented by (y i − µ i ) 2 /σ 2 i for every success. In the case shown, the largest observed χ 2 of any run leads to T obs = 15.8. Using our results presented below, this T obs is equivalent to a p value of 0.36 which suggests good agreement with the background-only hypothesis.
The exact probability distribution for the runs statistic T has been derived in [1], and code is available on github [7] to calculate the cumulative of the test statistic in mathematica and C++. The calculation time grows rapidly with L (roughly as exp √ L /L), and for L ≈ 200 becomes too long for practical use on todays CPUs, even with multiple cores. We derive here a formula that allows us to use the results for moderate L ≈ 100 to extrapolate to very large L 10 6 with high accuracy in the region of interest where T obs is large such that the p value is very small. We have implemented the extrapolation formula in our code [7]. This allows the use of the runs statistic in very long sequences of measurements with the correct statistical distribution for the test statistic without relying on expensive and somewhat inaccurate Monte Carlo simulations.
An application of our run statistic is described in the last section of this paper. The setting is an axion search experiment [8], where we eventually expect to have of order 5 · 10 7 power measurements integrated over ∼ 2 kHz intervals covering a frequency range of approximately 100 GHz. The data will be acquired in approximately 50 MHz data sets. The axion signal is expected to be very narrow, possible one to several 2 kHz bins wide, but with unknown shape, and we wish to make the minimum number of assumptions in a first pass through the data. We intend to use the run statistic to identify candidate signals in this spectrum.
Once a candidate signal is identified, the experimental setup can be modified to increase the signal-to-noise ratio considerably. However, changing the setup and acquiring more data is time consuming, and can only be performed relatively rarely. It is therefore important to understand the statistical significance of a putative signal.

II. CUMULATIVE OF THE TEST STATISTIC FOR LARGE L
We assume there is a sequence of L observations. Consider a partition L = N l + N r into two segments, where N l and N r denote the left-hand and right-hand part. Suppose that in each of the segments considered separately, the test statistics T l and T r are both less than T obs . Then there are exactly two ways this can occur. Either T < T obs for the entire sequence, or there is a run that crosses the boundary (condition C) and its χ 2 in each segment is less than T obs but the combined χ 2 is the largest of any run (condition M) and exceeds T obs : We denote by F (T obs |L) ≡ P (T < T obs | N l , N r , L) the value of the cumulative probability for the test statistic T for a total of L observations. Since the events are assumed independent, we can factorize P (T l < T obs , T r < T obs |N l , N r , L) = P (T l < T obs |N l , into the left and right parts and rearrange to find F (T obs |L) = F (T obs |N l )F (T obs |N r ) − P (T l < T obs , T r < T obs , T ≥ T obs , C, M|N l , N r , L) . (7) For both N l and N r large, there typically are many runs so it is unlikely that a boundaryspanning run has the largest χ 2 . Then Eq. (7) shows that the cumulative of the whole sequence is essentially the product of cumulatives for each segment minus a small correction.
A. χ 2 distribution of runs starting at the boundary Since the boundary is fixed and we require for the correction term that the run cross the boundary, it must necessarily have the first result on each side above the expectation. Consider the run segment on the right of the boundary: we can use the law of total probability to calculate the probability density as the sum of probability densities for runs of different length times the χ 2 probability for that number of degrees of freedom: where r is the length of the success run and we require at least one success. We have P (r|N r ) = (1/2) r+1 for r < N r and P (r = N r |N r ) = (1/2) Nr . The reason for the +1 for r < N r is the requirement that the result is below expectation for the r + 1 st sample, which also has probability 1/2. P (χ 2 |r) is the usual chi-squared probability density for r degrees of freedom so that we find For a χ 2 from a particular run to be our test statistic T , the run must span the boundary (condition C) and its χ 2 must be the maximum value for any run in the L range (condition M), so χ 2 l + χ 2 r = T . We calculate the probability that a contiguous run spanning the boundary satisfies the conditions specified as We implement condition M by weighting each possible contribution χ 2 l + χ 2 r with the probability that this is the largest in the full range L, F (χ 2 l + χ 2 r |L). This is the quantity we seek to compute so we cannot evaluate this expression directly. But so as F (T obs |L) → 1 we can write where we define and H(χ 2 r |N r ) is the cumulative of h. H can be expressed in terms of the cumulative of P (χ 2 |r), and requires no numerical integral.
With the approximation x(T obs , L) = F (T obs |L), Eq. (7) now simplifies to or We expect this expression to become exact as F (T obs |L) → 1, and to show some discrepancies at smaller values of F (T obs |L) where the approximation employed in Eq. (10) is not valid. Since we underestimate the correction term, we overestimate F (T obs |L). The error is larger at values of T obs where F is finite but not close to 1. We evaluate this effect for some examples below. For the more interesting region where F → 1, we expect our approximation to be excellent.
The correction ∆(T obs |N l , N r ) is nearly independent of N l , N r in our region of interest (N l and N r large) due to the Bernoulli suppression of long runs in h which is ∝ 2 N l,r /2 ; cf.
Eq. (8). We provide numerical results supporting this claim in Sec. III.
Let us consider the special case N l = N r = N and assume that indeed Taking n = 2 as the base case, we can generalize to arbitrary n ≥ 2 by induction to arrive at our main result To highlight the hidden assumptions in this procedure, we write down the induction step from n → n + 1 in detail using N l = nN, N r = N and suppressing the T obs dependencies This implies that the approximation x(T obs , nN ) = F (T obs |nN ) is consistently employed n − 1 times and that we neglect contributions from runs longer than 2N which is acceptable for the same reasons that ∆ can be considered independent of N . For concreteness, in our With this scaling equation, we can use exact results for moderate values of N to find the p value for our T obs for very large nN .

B. Trials factor
In many applications, the look-elsewhere effect just amounts to multiplying the p value by the number of trials, or trials factor. For example in a finely binned histogram with n bins the p value for the entire histogram is just n times the largest p value of any single bin [9]. Following the histogram analogy, we consider a batch of N successive data points as one bin. Neglecting the denominator in Eq. (17), the overall p value if both p(T obs |N ) and ∆ are small. We identify the trials factor as the number of batches n and remark that the proportionality is only approximate in our application as it mildly depends on T obs and possibly N .

III. NUMERICAL TESTS
We first display the behavior of F (T |L) and P (T |L) for different L in Fig. 3. For L = 100, the exact calculation is used, while for larger L the scaling formula Eq. 17 is used. As is   To further study the accuracy of expression 17, we take the difference of the cumulative distributions, F (T |n·100)−F (T |2n·50) for n = 10, 100, 1000 and plot these versus the value of the test statistic in Fig. 5. We see here that the calculations agree to better than the per mil level, with the largest differences visible for moderate values of the cumulative. This is exactly the region where the approximation used in Eq. (10) was expected to show small deviations. The correction term as evaluated is too small, so that F (T |2n·50) > F (T |n·100).
The effect decreases at larger T . In the most interesting case, where the p value is small (F (T ) large), the calculations are in excellent agreement with negligible differences. Finally, we compare the calculated cumulative probabilities with results from Monte Carlo simulations using the MT19937 random number generator [10] available in the Gnu Scientific Library [11]. The Box-Muller algorithm from the Gnu Scientific Library is then  Carlo cumulative probability and compare to that calculated. We estimate the statistical uncertainty on the Monte Carlo result using the binomial probability standard deviation [13].
The results are shown in Fig. 6. The precision can be evaluated as where the inequality holds since F (T |N ) ≤ 1 and ∆(T ) ≥ 0, and ⊕ indicates addition in quadrature. To reach a given precision on p, we require that F (T |N ) and ∆(T ) be evaluated to an absolute precision /n. As an example, for L = 10 6 , N = 100, and = 10 −5 , we would need an absolute precision on F (T |N ) and ∆(T ) at the 10 −9 level. We have verified that this can be reached in practice: regarding F (T |N ), our two implementations [7] in mathematica and C++ agree at the 10 −15 level, and with 1D numerical integration ∆(T ) can be computed at the 10 −10 level.

IV. TEST CASE -(FAKE) AXION SEARCH
As an example of the use of our run statistic, we consider an experimental setup at the Max Planck Institute for Physics designed to search for axions in the mass range 40 − 400 µeV [8]. The measurement is effectively a power spectrum as a function of the frequency of emitted microwave radiation built from many independent measurements. The baseline signal is dominated by the thermal background (10 −19 W). A weak fake axion signal was injected; the location and width of the signal were unknown when the analysis was carried out. Although the shape (Gaussian) was known, this information was not used. The spectrum to be analyzed is shown in Fig. 7. The shape of the background spectrum and level of fluctuation was unknown, and had to be determined from the data by assuming that any possible signal would provide a much sharper feature than any change in background. In the example considered here, the spectrum consists of 24576 data points giving the integrated power in ≈ 2 kHz intervals. A signal is expected to be one or a few bins wide. To minimize the number of assumptions made about the signal shape, we scanned the spectrum with our run statistic to determine if there was a significant deviation from background expectations.
To determine the background shape and fluctuations, the full spectrum was partitioned into contiguous subsets and a second order polynomial fit was used to find the background shape. The residuals were used to extract the standard deviation of the fluctuations. Different lengths of the subsets were considered from a minimum of 96 measurements to 256 measurements. It was verified that the fluctuations relative to the fit function followed the expected Gaussian distribution.
The distribution of T for the 256 sets of 96 data points in shown in Fig. 8, and compared to the expectation from the exact calculation of [1] for N = 96. The largest value found was T obs = 57.3. For N = 96, that is if the total number of observations had been only 96, this has a p value of p = 6.4 · 10 −9 . To get the p value for all observations, we use the expression Eq. (17) and find a p value of p = 1.9 · 10 −6 , which is very small. The frequency at which this signal was found was indeed the frequency of the injected 'fake axion'. For comparison, the second most significant p value of a test statistic found in one of our runs was 2 · 10 −5 taking L = 96, which becomes p ≈ 6 · 10 −3 when taking the full 24576 samples into account.
These changes of the p value illustrate the importance of the look-elsewhere effect. To infer the p value from N for nN observations, the obvious guess for the trials factor in Eq. (22) is 256, which is too small by 20 % for the above numbers. Our more accurate result based on Eq. (17) is achieved for essentially the same computational effort.  To appreciate the usefulness of Eq. (17), we consider the computational cost. Proposition 1 from [1] states that computing the exact expression for the distribution of T for L = 24576 requires a sum over the enormous number of 2.6 · 10 169 integer partitions, something that is completely unfeasible with the best supercomputers today. But with our approximate result, we only need to compute the exact expression for N = 96 which requires 1.4 · 10 8 partitions, or a reduction of work by 161 orders of magnitude. The algorithm is of the streaming type and the partitions need not be stored but can be independently processed, so it makes ideal use of modern multi-core computers. On a desktop computer with an Intel i7-4700 CPU with four cores and eight openMP threads, the C++ implementation [7] of F (T |N = 100) is computed in 1.8 s. With some reduction in precision and range of validity, computing for N = 50 as the baseline requires less than 10 ms.

V. CONCLUSION
We presented an extension of our previous work [1] where we introduced the runs statistic T to detect bumps that are incompatible with a background model for ordered 1D data sets such as spectra. We derived the exact distribution of the statistic needed to compute a p value before but the expression could not be evaluated beyond about 100 data points within reasonable time. In this work, we describe an approximation Eq. (17) that takes the results from few data points and extrapolates to millions of points in a principled manner by dividing the data into chunks. In the region of interest where the observed value T obs is large and the p value is small, the approximation has both high accuracy and precision. We