Statistical interpretation of sterile neutrino oscillation searches at reactors

A considerable experimental effort is currently under way to test the persistent hints for oscillations due to an eV-scale sterile neutrino in the data of various reactor neutrino experiments. The assessment of the statistical significance of these hints is usually based on Wilks' theorem, whereby the assumption is made that the log-likelihood is $\chi^2$-distributed. However, it is well known that the preconditions for the validity of Wilks' theorem are not fulfilled for neutrino oscillation experiments. In this work we derive a simple asymptotic form of the actual distribution of the log-likelihood based on reinterpreting the problem as fitting white Gaussian noise. From this formalism we show that, even in the absence of a sterile neutrino, the expectation value for the maximum likelihood estimate of the mixing angle remains non-zero with attendant large values of the log-likelihood. Our analytical results are then confirmed by numerical simulations of a toy reactor experiment. Finally, we apply this framework to the data of the Neutrino-4 experiment and show that the null hypothesis of no-oscillation is rejected at the 2.6\,$\sigma$ level, compared to 3.2\,$\sigma$ obtained under the assumption that Wilks' theorem applies.

Despite the strong evidence for the existence of physics beyond the Standard Model (SM), searches for new particles at high-energy colliders have been so far unsuccessful. A possible explanation is that the new physics lies at relatively low scales, and that the dark and the visible sectors communicate through a very weakly-interacting particle. A particularly appealing scenario in this context is the addition of right-handed neutrinos to the SM, which could also source the SM neutrino masses. While right-handed neutrinos are singlets of the SM gauge group (and therefore sterile under the SM interactions), they may lead to observable signatures through their mixing with the SM neutrinos at a variety of experiments, depending on their masses. In particular, and motivated by the LSND anomaly [1], considerable experimental effort was directed towards the search for eV-scale right-handed neutrinos using short-baseline oscillation experiments over the past two decades, see Ref. [2] for a recent review. In this article we focus on reactor neutrino experiments, which have been playing a crucial role in neutrino physics since the discovery of the neutrino by Reines and Cowan [3], and have been central to the search for eV-scale sterile neutrinos.
Calculations of the anti-neutrino fluxes emitted from nuclear reactors performed in 2011 [4,5] have lead to the so-called "reactor anti-neutrino anomaly" [6], providing a hint for the existence of sterile neutrinos due to an observed mis-match between predicted and observed rates. This hint remains controversial up to today, see Refs. [7,8] for discussions of the latest developments on this issue. Therefore, modern experiments focus on the relative comparison of measured spectra at different baselines [9][10][11][12][13][14][15][16], which is more robust against uncertainties in flux predictions. Recent global fits of reactor data still find indications for sterile neutrino oscillations at the level of 2 − 3σ, even when using only spectral ratios [17][18][19][20]. Those indications are based purely on spectral distortions which may feature oscillatory patterns.
Note, however, that the necessary conditions to apply Wilks' theorem [21] are typically not fulfilled for sterile neutrino searches in oscillation experiments, which can lead to wrong results when evaluating significance or confidence levels based on χ 2 values. 1 Therefore, one should wonder if the hints may be coming from a mis-interpretation of the data. This has recently been pointed out in Refs. [27][28][29], where the authors have shown by way of Monte Carlo simulations that that there are indeed corrections and that the statistical significance of the hints is reduced. In this paper we attempt to go one step further and, besides providing analytical arguments that allow to understand the expected distribution for the test statistics, we also study the dependence of the observed corrections on relevant experimental parameters and numerical details of the analysis.
The outline of the paper is as follows. In section II we start with some general remarks, introduce a test statistic to evaluate the significance of the presence of sterile neutrino oscillations and give some qualitative arguments why it is likely that experiments find a hint for sterile neutrinos even if there are none. In section III we consider an idealized disappearance experiment and derive the expected distribution of the test statistics. We show that the above statement is a consequence of the fact that in a Fourier composition of white noise some frequency will appear with largest amplitude. This will allow us to make predictions for the expected distribution of the best fit points for sin 2 2θ. In section IV we perform numerical simulations of toy reactor experiments, and we study in detail the distribution of the test statistic as well as the location of the best-fit points. We also investigate the impact of various parameters, such as restriction to the physical region, the impact of systematics, or alternative χ 2 definitions. In section VI we consider as a case study the recent results for the Neutrino-4 experiment [16,30], which has reported a ∼ 3σ hint for sterile neutrino oscillations. We perform a Monte-Carlo study of the Neutrino-4 data and show that the significance for the presence of sterile neutrinos is somewhat lower than expected under the assumption of a χ 2 distribution of the test statistics, and we present the confidence regions obtained by explicit Neyman-Pearson construction based on the Feldman-Cousins prescription [31]. We conclude in section VI.

II. GENERAL REMARKS
Reactor neutrino experiments look for the disappearance of electron anti-neutrinos. In this work we assume that a single sterile neutrino is relevant for the phenomenology and we consider only experimental setups where the baseline is short enough such that oscillations due to the standard three-flavor mass-squared differences can be safely neglected. In this limit, sterile neutrino oscillations are described by an effective two-flavor survival probability: where E is the neutrino energy, L is the baseline, θ is an effective neutrino mixing angle, and ∆m 2 stands for the mass-squared splitting between the eV-scale mass state and the light SM neutrinos. In order to analyze the results of a given experiment a least-squares function of binned spectral data is considered, χ 2 (sin 2 2θ, ∆m 2 ). A common test statistic T for evaluating the hypothesis of the presence of sterile neutrino oscillations is the ∆χ 2 (or, equivalently, the likelihood ratio) between the best-fit point and the no-oscillation case: where sin 2 2θ and ∆m 2 indicate the parameter values at the χ 2 minimum. If Wilks' theorem [21] applies, T should be distributed as a χ 2 -distribution for 2 degrees of freedom (DOF), corresponding to the two minimized parameters. Indeed, for the problem at hand, there are several reasons to suspect that the necessary conditions for Wilks' theorem to apply are not fulfilled: First, there is a physical boundary for the mixing angle, sin 2 2θ ≥ 0. Second, the parameter ∆m 2 becomes undefined for sin 2 2θ → 0 and sin 2 2θ becomes unphysical for ∆m 2 → 0. Third, the cosine dependence on ∆m 2 of the oscillation probability in eq. (1) leads to a strong non-linear behavior. Therefore, significant deviations of the distribution of T from a χ 2 -distribution are expected a priori, see also [27,31].
As we will show in section III, for an idealized situation the distribution of √ T is the one of the maximum of N standard normal random variables, where N corresponds to an effective number of bins. We will give physical arguments, as to why for this type of experiments a non-vanishing value for sin 2 2θ at the best-fit is likely, with a relatively large value of T . In fact, its typical value is set by the size of the relative statistical uncertainty of the sample. In section IV we will compute the distribution of T for more realistic configurations and will always confirm rather large deviations from a χ 2 -distribution. This suggests that reliable statements about significance and confidence levels require explicit Monte-Carlo simulations, in agreement with previous results [27,28]. We will demonstrate this explicitly using the recent results from Neutrino-4 in section V.

III. EXPECTED DISTRIBUTION OF THE TEST STATISTICS
In this section we derive the expected test statistic T for a toy model. After making a series of assumptions that allow us to write the randomly fluctuated events in each bin as a discrete Fourier transform, we will proceed and minimize the χ 2 function analytically. This will provide us with an expression in terms of the Fourier coefficients of the expansion, which we can then substitute into eq. (2) to get an analytical expression for the expected test statistic.
A. Derivation of the test statistic for a toy model Let us consider a toy model for an oscillation disappearance experiment. We consider N bins in L/E and write the predicted event number in each bin as where p 0 i is the predicted number of events in case of no oscillations, P osc i is given in eq. (1), and the index i labels the bins in L/E. Let us now adopt the following assumptions: (a) We assume that there is no sterile neutrino in Nature, i.e., the observed data in bin i is given by the no-oscillation prediction plus a statistical fluctuation with variance σ 2 i : For Poisson statistics we would have σ i = p 0 i . For simplicity we assume that δd i are Gaussian.
(b) We assume that only shape information in L/E is used, but not on the absolute normalization. This applies to experiments where energy spectra are fitted leaving the overall normalization free, but also to setups or combinations of experiments where relative spectra at different baselines are considered.
The second assumption is implemented in a χ 2 by introducing a free pull parameter ξ, such that where minimization with respect to ξ is understood. Working to linear order in ξ and sin 2 2θ, and assuming that terms involving a sum over δd i or cos(∆m 2 L/2E i ) average to zero, one finds that ξ ≈ (1/2) sin 2 2θ minimizes the χ 2 . Using this together with assumption (a) above we find where n i ≡ δd i /σ i are independent standard normal random variables with n i = 0, n i n j = δ ij , see eq. (4).
Let us now adopt the additional simplifying assumptions to build a mathematical toy model: (c) We assume that the relative statistical error σ i /p 0 i has the same value for each bin and define the new parameter Although this is not strictly the case for a reactor experiment, in section IV we will see that it works relatively well for the experimental setups under consideration in this work. Furthermore we assume that bins have equal width in L/E and define Hence, j labels bins in L/E while the index κ labels discrete frequencies proportional to ∆m 2 . With this idealization, eq. (6) becomes We see that in this limit the sterile neutrino search is equivalent to fitting Gaussian whitenoise with a cosine function with the amplitude s and the frequency κ as free parameters. This form suggests to consider the discrete Fourier transform of the N random variables n i : with a κ , b κ ∈ R. Focusing on the cosine term, the coefficients a κ can be computed as Since n i are independent standard Gaussian variables, it is clear that a κ are random Gaussian variables as well, with where we have assumed that sums over cos ϕ κi (cos 2 ϕ κi ) average to 0 (N/2). Let us now look for the best fit point (ŝ,κ). We start by minimizing the χ 2 in eq. (9) with respect to s, for fixed κ. This giveŝ where in the last step we have inserted n i from eq. (10), using the fact that all terms average to zero except the one containing cos 2 ϕ κi . This implies that, for fixed κ,ŝ(κ) follows a Gaussian distribution with its mean and variance as given by eq. (12). Also, we see that, for fixed κ, the χ 2 is minimized by choosing s as the Fourier coefficient corresponding to the frequency κ. Inserting the Fourier transform from eq. (10) as well as the solution from eq. (13) into the χ 2 in eq. (9), we find where in the second step we have expanded the square and used the fact that only terms of the form cos 2 ϕ λi or sin 2 ϕ λi survive, while all mixed terms average to zero. Here, the constant C contains the b λ terms and is independent of κ.
Next, we minimize with respect to κ. Since eq. (14) is a sum of positive terms, the χ 2 would be minimal for κ =κ, such that aκ is the Fourier coefficient with the largest absolute value. However, considering the definition of s in eq. (7), we see that the physical requirement sin 2 2θ ≥ 0 implies s ≥ 0. Therefore, if the minimization is restricted to the physically allowed region we obtain For N Gaussian variables with a κ = 0, the probability that all a κ are negative is (1/2) N . Hence, for sufficiently large N it is very likely to obtain at least one positive a κ , such that eq. (15) leads to a positive best-fit point for the parameter s (and therefore for sin 2 2θ). For simplicity we neglect hereafter the unlikely case that none of the a κ is positive. Finally, let us consider the test statistic T = χ 2 (0, 0)−χ 2 (ŝ,κ) defined in eq. (2). Using eqs. (9) and (10) we find the χ 2 for the SM point as where C is the same constant as in eq. (14). Evaluating now the minimum of the χ 2 , χ 2 (ŝ, κ) as given in eq. (14) at k =κ, we finally obtain whereã κ ≡ N/2a κ are standard normal random variables, ã κ = 0, ã κãλ = δ κλ (see eq. (12)).

B. Discussion
Equations (15) and (17) are the main results of this section. The latter shows that the square-root of the test statistic T has the distribution of the maximum of N standard normal variables. It is proportional to the best fit amplitudeŝ, and hence, up to a normalization factor, the best-fit point in eq. (15) follows the same distribution. Distributions of this type are considered in the field of "extreme value statistics", see e.g., [32,33].
For the case of Gaussian variables of interest here, there exists a limiting distribution for N → ∞. It is based on the so-called Gumbel distribution e −e −z . Let x = max iãi , wherẽ a i are N standard normal variables. For finite N the cumulative probability distribution (CDF) F (x) can be approximated by [33]: with In fig. 1 (left) we show 1-CDF for the maximum of N standard normal variables obtained by numerical calculations (solid) compared to the approximate formula in eq. (18) (long-dashed) for N = 30 and 60. We see that they agree reasonably well for 1 − CDF 0.1, but start to deviate for smaller values. Indeed, the convergence to the Gumbel distribution goes only as 1/ log N [33]. Therefore, since the distribution can be easily calculated numerically, we will for the rest of the paper stick to the numerical method and denote this distribution by "Max. Gauss" in the following.
The important property of this distribution is, that small values of T are rather unlikely. In fig. 1 we compare 1-CDF as well as the probability density function (PDF) for √ T to the one for a χ 2 distribution. Indeed, if Wilks' theorem was applicable, T should be distributed as χ 2 with 2 DOF. Obviously, the conditions for Wilks' theorem to hold are badly violated in this case, for the reasons mentioned in section II. The peak at √ T ∼ 2 and the small probability to obtain √ T 1 indicates that, even if there is no sterile neutrino present in Nature, it is very likely to obtain a best-fit point with finite sin 2 2θ as well as relatively large value of T . This would lead to claiming a signal at relevant statistical significance, if evaluated with a χ 2 -distribution. The physical reason for this behavior can be understood from eq. (9): in a white noise spectrum it is very likely to find some frequency with sizable amplitude that is able to fit the data.
The expectation value for a random variable z with the CDF F (z) = e −e −z is given by (17) and (18) follows where the numbers hold for N ≈ 30 . . . 60. These values agree to a good accuracy with the mean values obtained numerically, and depend only weakly (logarithmically) on N . We can use these results to estimate the expectation value for sin 2 2θ. Let N be the total number of observed events. According to assumption (c) above we have p 0 i ≈ N /N and σ i = p 0 i . Then eq. (7) leads to where in the second relation we have used the numerical values from eq. (20). We see that up to a numerical factor, the expected best fit value for sin 2 2θ is set by the relative statistical error of the event sample. We will find this behavior in the simulations discussed in the following sections. From fig. 1 we see that there is a lower bound of √ T 1.5 at 99% CL for N = 30 . . . 60, which translates into a lower bound on sin 2 2θ according to eq. (21).
To conclude this section, we remark that the idealized situation considered here is certainly an over-simplification, and especially assumption (c) will not be satisfied in a realistic oscillation experiment. Nevertheless, these considerations capture the most relevant features and the results obtained here allow an intuitive understanding of the numerical results we are going to present below. In particular, the preference for the presence of sterile neutrino oscillations even in case of no true signal is predicted from those arguments, and allows a qualitative (in some cases even quantitative) understanding of the more realistic simulations discussed in the remainder of this paper.

IV. NUMERICAL SIMULATIONS FOR A TOY EXPERIMENT
A. Description of the simulation In order to verify the validity of the analytical approach presented in the previous section, we now proceed to perform a numerical simulation for a toy experiment. For this purpose, we choose a reactor disappearance experiment which aims to set a constraint on the sin 2 2θ − ∆m 2 parameter space from the observation ofν e →ν e oscillations. We consider generic shapes of the anti-neutrino flux and inverse beta-decay detection cross section. The distance between the reactor core and the detector is set to L = 10 m. In order to account for the finite size of the reactor core, the probability is averaged over a window ∆L = ±1 m: This ensures that fast oscillations are averaged-out at the detector. Unless otherwise stated, the exposure is set such that the total number of events is 1.5 × 10 4 . A binned χ 2 analysis is performed, using 43 bins in energy of equal size distributed between 2 and 8 MeV, and a Gaussian energy resolution of the form σ(E) = 0.03 E/MeV is applied to the event distributions. The experimental details outlined above have been chosen to lie in the same ballpark as for some of the running short baseline reactor disappearance experiments [12][13][14][15][16]. However, we have explicitly checked that changing any of these parameters does not qualitatively affect our results. Finally, we have assumed negligible backgrounds in our analysis for simplicity, in order to ease the interpretation of our results. Again in this case, we have checked that the inclusion of a sizable background component does not alter qualitatively our conclusions.
The results presented in this section have been obtained by simulating a large sample of pseudo-experiments, applying random statistical fluctuations to the expected event rates for the reactor experiment setup outlined above. Since here we are mostly interested in evaluating the significance of a potential positive signal, throughout this section we will generate random data under the null-hypothesis, that there is no sterile neutrino in Nature, i.e., for no oscillations. These are generated on a bin-per-bin basis, sampling a normal distribution with its mean set to the expected number of events in a given bin for the SM hypothesis p 0 i ≡ p i (θ = 0), and its width set to the associated statistical uncertainty p 0 i . For the large number of events considered here the Gaussian approxi-mation to the Poisson distribution is well justified. Unless otherwise stated, the number of pseudo-experiments simulated is set to 20,000 for each of the cases studied in this section. The sample of pseudo-experiments will then be used to determine the distribution of our test statistics T defined in eq. (2). In order to do so, for each pseudo-experiment a Poisson χ 2 function is built. For a set of parameters (θ, ∆m 2 ), it reads: where p i is the expected number of events in the i-th bin for θ and ∆m 2 (in the absence of statistical fluctuations), while d i is the "observed" number of events, i.e., the pseudo-data generated as described above. Here, ξ is a nuisance parameter, introduced in order to account for the systematic uncertainty in the prediction of the expected event rates. Once eq. (23) has been computed, a pull-term is added and the result is minimized over the nuisance parameter ξ: where σ sys stands for the prior uncertainty on the signal normalization. Here,ξ is a parameter introduced to account for the fact that the normalization of the signal is typically obtained from previous experimental data, which is also subject to statistical fluctuations. In order to account for the associated uncertainty, for each pseudo-experiment the value ofξ is drawn from a normal distribution centered at zero and with a width equal to σ sys , as for instance in Ref. [24]. For each pseudo-data realization we minimize the χ 2 in eq. (24) with respect to sin 2 2θ and ∆m 2 and calculate a value for the test statistic T = χ 2 (0, 0) − χ 2 min . From the ensemble of all simulated realizations we obtain then the expected distribution of T under the null-hypothesis of no oscillations. Equation (23) will be our default χ 2 definition. But we have also studied the case where the Poisson χ 2 in eq. (23) is replaced by a Gaussian χ 2 : where σ i = √ d i stands for the statistical uncertainty associated to the number of events observed in each bin. As we will show below, the Gaussian χ 2 can lead to different results for the distribution of T , unless one takes σ i = √ p i instead of √ d i . Only in the latter case the two χ 2 definitions eqs. (23) and (25) lead to the same results.
For the results presented in the following we assume a single baseline setup. However, the arguments presented in section III apply also to multi-baseline configurations, for instance when ratios of spectra at different baselines are considered, or in case of segmented detectors with additional L information. The derivation in section III relies only on general binning in (L/E) including also bins in L. We have verified explicitly that the simulation of a setup combining energy spectra at two different baselines leads to very similar results as the single-baseline configuration. This is also confirmed in section V, when we consider the Neutrino-4 experiment. Figure 2 presents the results of two simulations with different exposures: our default setup, with N = 1.5 × 10 4 total number of events (dark blue), and the same setup with 100 times more events, N = 1.5 × 10 6 (light blue). The left panel shows the distribution of the test statistic T . We observe a clear deviation from the χ 2 distribution, and a good agreement with the max. Gauss distribution derived in section III. The agreement is excellent for the high statistics case and, in particular, we obtain the best match when the number of bins for the max. Gauss distribution is set at N = 45, to be compared with the 43 spectral bins used in the simulation. The reason for this (small) difference is that for a more realistic spectrum, some of the assumptions from section III are only approximately fulfilled. In particular, assumption (c) (defined in section III A) requires that relative statistical errors are equal in all bins, which is obviously not true for a peaked spectrum as in reactor experiments. Therefore, N = 45 should be considered as the effective number of random standard normal variables, which leads to the best representation of the T distribution from simulation. In table I we show, for various values of the test statistic T , the significance which would be obtained by assuming a χ 2 distribution for 2 DOF (as we would expect if Wilks' theorem held) compared to the correct result following from the max. Gauss distribution . For comparison, the solid curves show the expected sensitivity at 95% CL assuming that Wilks' theorem holds (that is, the contours corresponding to ∆χ 2 = 5.99). These results have been obtained using a Poisson χ 2 , with no background, for 10% signal systematics, and restricting 0 < sin 2 2θ < 1 in the fit.

B. Results
for N = 45. For example, if a value of T = 11.83 is observed, we would exclude the SM at 3σ (p-value 0.27%) under the assumption of Wilks' theorem, while the correct significance would be only 2.48σ (p-value 1.3%). As a rule of thumb, we can see from the table that p-values are under-estimated by about a factor 5, and the number of σ gets reduced by roughly 0.5σ (except for low CL, where the difference is close to 1σ). The right panel in fig. 2 shows the distribution of the best-fit points obtained in the simulations. Although the pseudo-data has been generated under the no-oscillation hypothesis, we observe a clear preference for a non-vanishing value of sin 2 2θ. Note that actually none of the best fit points is located near the "true value" (sin 2 2θ = 0). Obviously sin 2 2θ and ∆m 2 are biased estimators in this case. Comparing the light and dark blue points we confirm the scaling of the value of the mixing angle at the best-fit with the relative statistical error 1/ √ N , as expected from the discussion in section III. Indeed, in the region 0.5 eV 2 ∆m 2 3 eV 2 the mean value of the sin 2 2θ best fit points agrees rather well with the prediction from eq. (21), as indicated by the vertical dashed lines. 2 However, outside this region of ∆m 2 the best-fit points lie at larger values for the mixing angle. The reason is that for extreme values of ∆m 2 the idealizations assumed in section III A do not apply. For example, the feature around ∆m 2 ∼ 0.4 eV 2 corresponds to ∆m 2 L/(2E) π at E 3 MeV and, as a result, the first minimum of the survival probability is located at the peak of the event spectrum. This corresponds roughly to the case where half an oscillation period fits into the effective energy range, and therefore corresponds to the minimal frequency which can be sampled by the data. In contrast, for high mass-squared differences the frequency becomes much higher than the bin width can capture and therefore corresponds to over-sampling of the data. Hence, in both cases we are leaving the domain of the discrete parameterization of ∆m 2 in terms of the index κ = 1, . . . , N adopted in section III A, see eq. (8), which leads to the observed deviations with respect to the estimate in eq. (21).
For comparison, we also show in the right panel of fig. 2 the expected sensitivity contours at 95% CL, obtained under the assumption that Wilks' theorem holds (solid lines). As can be seen from the figure, the best-fit points always lie very close to the expected sensitivity limit in this case and, for a sizeable fraction of the pseudo-experiments simulated, they lie to the right of the sensitivity curve, if naively computed assuming a ∆χ 2 for 2 DOF (as is usually the case in the literature).
Let us now discuss the impact of a systematic uncertainty on the overall normalization of the spectrum for the distribution of the test statistic T . Figure 3 shows the results obtained for different assumed priors on the systematic error for the signal, σ sys = 20%, 1% as well as the no-systematics case. In all cases we assume a total number of events of N = 1.5 × 10 4 . We see from the figure that the distribution for the no-systematic case is somewhat χ 2 -like, with a number of DOF between 1 and 2. Although some deviations from this behaviour are observed (due to the effect of the physical boundary sin 2 2θ ≥ 0 as well as the non-linearity of the model), the distribution is clearly different from a max. Gauss. The reason is that in section III we assumed that only shape information is used (assumption (b)), whereas in the absence of systematic errors the information on the total event rate is also available. In this case the model of fitting white noise, eq. (9), does not fully correspond to fitting the disappearance probability in eq. (1), which can only reduce the event numbers. In contrast, for the case σ sys = 20% the systematic uncertainty is much larger than the statistical one for the assumed event sample, σ sys 1/ √ N . This corresponds effectively to a free normalization in the fit and assumption (b) is satisfied. Correspondingly, we observe in fig. 3 a very good agreement with the max. Gauss distribution for this case. 3 For the 1% case we have σ sys 1/ √ N , 2 We have verified that the range of values of ∆m 2 where this is satisfied scales with the baseline as expected from ∆m 2 L = const. 3 Note that also for the 10% systematic assumed in fig. 2  which corresponds to an intermediate situation between fixed and free normalization. Figure 3 also shows the impact due to the treatment of systematics when simulating the random pseudo-data. Solid curves show the results obtained randomizing the central value of the pull parameter, i.e., for each realization of the pseudo-data we drawξ in eq. (24) from a Gaussian distribution with width σ sys (as outlined in section IV A). In contrast, for the dash-dotted curves the central value for the pull parameter is not randomized and kept fixed atξ = 0 for all pseudo-data samples. We see that this has a rather large impact on the √ T distribution as long as systematic and statistical errors are comparable (σ syst = 1%), whereas for an effectively free normalization (σ syst = 20%) the difference is largely reduced. The reason is that in the latter case the fit can always adjust the normalization within the statistical uncertainty, with negligible impact of the penalty term for the pull parameter.
(b) defined in section III A is fulfilled. FIG. 4. Impact on the distribution of √ T due to the choice of the χ 2 implementation. Solid curves correspond to a Poisson χ 2 , while dashed lines correspond to a Gauss χ 2 . The left panel shows the effect of increasing the systematic uncertainties from no systematics to an overall 10% signal normalization uncertainty. In the right panel we show the results by imposing different restrictions on the allowed range for sin 2 2θ, as indicated by the labels. In all cases, a total of 1.5 × 10 4 events are simulated for each pseudo-experiment. Dotted curves indicate the χ 2 distribution for 1 to 5 DOF from left to right.

C. Further studies of the properties of the T distribution
In this subsection we investigate in some detail additional properties of the distribution of the test statistic T . We start by discussing the impact of the two χ 2 implementations from eq. (23) (Poisson χ 2 ) versus eq. (25) (Gauss χ 2 ). Naively one expects that they should give similar results if the number of events per bin is 10. The differences on the resulting distributions for the two χ 2 implementations are shown in fig. 4.
In the left panel of fig. 4 we adopt different choices for the normalization uncertainty. Interestingly, we find for σ sys 2% notably differences, despite the rather large event number of N = 1.5 × 10 4 . For the 43 bins in our simulation this corresponds to about 350 events per bin on average, where the bin with the smallest number of events has a mean above 20 events. We have checked that for the no-systematic case the differences between Poisson and Gauss disappear only for N 10 5 . From the figure we also see that for N = 1.5 × 10 4 the differences between the Gauss and Poisson implementations disappear for large enough systematic uncertainty, when both cases approach the max. Gauss distribution. The origin of the different behavior is related to the assumption σ i = √ d i in eq. (25). We have confirmed that when we use instead the prediction, σ i = p 0 i , the Gauss and Poisson χ 2 implementations lead to identical results.
In the right panel of fig. 4 we study the impact of the physical boundary sin 2 2θ ≥ 0 in the case of N = 1.5 × 10 4 and no systematic uncertainty on the normalization. In the figure we compare three cases: sin 2 2θ ≥ 0 (blue lines), sin 2 2θ ≤ 0 (green lines), and a third case where no restriction is imposed on sin 2 2θ (red lines). For the Poisson χ 2 implementation we see that restricting the sign of sin 2 2θ has a notable impact on the distribution, but the effect is similar regardless of the sign of sin 2 2θ. In contrast, for the Gauss χ 2 implementation we observe significant differences between the cases sin 2 2θ ≥ 0 and sin 2 2θ ≤ 0. Again this is a consequence of using √ d i as the statistical error in the Gauss χ 2 , eq. (25): as d i includes statistical fluctuations, √ d i is not symmetric between upward and downward fluctuations, which leads to the asymmetric behavior with respect to the sign of sin 2 2θ. However, if the theoretical prediction is used as variance, the Gauss χ 2 becomes symmetric between upward and downward fluctuations. We have explicitly checked that in this case the dependence on the sign of sin 2 2θ disappears and we recover the result from the Poisson χ 2 . Surprisingly, these second order effects are not negligible even for N = 1.5 × 10 4 events.
Overall, we observe that the results obtained with a Gaussian χ 2 agree with the Poisson χ 2 as long as we use the square-root of the prediction to calculate the statistical uncertainty, while sizeable deviations occur when the square-root of the data is used. Let us remark, however, that as long as the distribution of the test statistic is numerically evaluated by Monte Carlo simulation, of course any reasonable χ 2 definition can be used (including also the Gauss χ 2 as defined in eq. (25)).
To summarize the results found in this section, we find that as long as σ sys 1/ √ N , the distribution of the test statistic T is sensitive to details of the analysis, such as size of systematics, treatment of systematics during randomization, χ 2 variants, physical boundaries. However, once σ sys 1/ √ N (i.e., for experiments where only shape information is used) the max. Gauss distribution seems to be a rather robust result.

V. APPLICATION: NEUTRINO-4 AS A CASE STUDY
The Neutrino-4 experiment [16,30] has recently claimed a possible indication of sterile neutrino oscillations with ∆m 2 7.2 eV 2 and sin 2 2θ 0.26. They report a statistical significance using their combined phase 1 and 2 data of 3.2σ [30]. Then an estimate of their systematic uncertainty is quoted, leading to a combined statistical/systematical significance of a positive sin 2 2θ of 2.8σ. In this section we use the Neutrino-4 results to illustrate the arguments presented above on a real-life example. We concentrate on the purely statistical aspect and will show that, in light of the discussions in the previous sections, the significance from statistical errors alone is already lower than the quoted 3.2σ.
Neutrino-4 uses a segmented detector, which allows to bin their data in both L and E. The data is binned using 9 bins in energy with width ∆E = 0.5 MeV starting at 2.3 MeV, and 24 bins in baseline with ∆L = 0.235 m starting at 6.25 m, resulting into a total of 216 bins in L/E. The bin width in energy of 0.5 MeV corresponds to the energy resolution of the detector [30]. Eventually, each consecutive group of 8 bins are combined together, leading to N = 27 data points. The observed data correspond to the ratio where d i stands for the observed number of events in bin i. Out of these, the first 19 bins are shown for the combined phase 1 and phase 2 data sets in Fig. 47 of [30] (blue points). Following the Neutrino-4 collaboration, we fit these 19 data points with the survival probability for a given L/E bin over the averaged probability: Here · i indicates the average over an energy interval ∆E, with the value of L/E set at the bin center of the corresponding (L/E)-bin i. The fit is performed with a simple Gaussian χ 2 definition, using the statistical errors read off from Fig. 47 of [30]. Note that, due to the particular way the fit is performed by Neutrino-4, using the ratios in eqs. (26) and (27) the analysis is only sensitive to spectral distortions in L/E, and therefore assumption (b) from section III A is fulfilled. With our fit we can reproduce to good accuracy the results from Ref. [30]. Our best-fit point is located at ∆m 2 = 8.84 eV 2 , sin 2 2θ = 0.42; however, we find a quasi-degenerate local minimum with ∆χ 2 = 5 × 10 −3 at ∆m 2 = 7.28 eV 2 , sin 2 2θ = 0.34, close to the best-fit point obtained by Neutrino-4. We explain this slight difference by the fact that the fit reported in Ref. [30] uses more information in L/E than available to us. This additional information seems to somewhat disfavor the local minimum around ∆m 2 9 eV 2 compared to the one at 7.25 eV 2 . Furthermore, we obtain for the χ 2 minimum and the test statistic T , i.e., the ∆χ 2 between no oscillations and the best-fit point: χ 2 min = 17.11 , T = 12.87 (Fig. 47 of Ref. [30]) , showing good agreement, especially for T . If evaluated under the assumption of Wilks' theorem with a χ 2 distribution with 2 DOF we would get from T = 12.9 a p-value of 1.58 × 10 −3 , corresponding to 3.16σ. In order to check this reasoning, we have generated a large sample of artificial data sets for Neutrino-4, under the null-hypothesis of no oscillations, in order to calculate the distribution of T explicitly. The result is shown in fig. 5, which shows significant deviations from a χ 2distribution. In agreement with the discussions above, the distribution of √ T is found to be more similar to the max. Gauss distribution. In this case we find that the effective N = 40 for the max. Gauss distribution providing the closest fit to the numerical distribution deviates substantially from the actual number of bins in the data, N = 19. Based on the numerical T distribution we obtain a p-value of 9.1 × 10 −3 (or 2.6σ), indicating that the actual statistical significance is clearly lower. 4 In addition, we have calculated confidence regions in the plane of sin 2 2θ and ∆m 2 by performing a Feldman-Cousins analysis [31]. For given values of sin 2 2θ and ∆m 2 we have generated many artificial data sets, assuming that these values are the true values in Nature. This allows us to compute the correct distribution of ∆χ 2 (sin 2 2θ, ∆m 2 ) = χ 2 (sin 2 2θ, ∆m 2 ) − χ 2 min (30) for each point in the parameter space. Comparing the value of ∆χ 2 exp obtained from the actual experimental data to the numerical distribution (sin 2 2θ, ∆m 2 ), we obtain the confidence level (CL) at which a particular point can be rejected. Repeating this procedure for the whole parameter space we obtain confidence regions at a given CL as the set of all points which are accepted at that CL. The results of this analysis are shown in fig. 6 as shaded regions for 68.3% (dark green), 95.45% (medium green), and 99.73% (light green) CL. Our regions are also compared to ∆χ 2 contours obtained under the assumption that ∆χ 2 follows a χ 2 distribution with 2 DOF, which would be the case if Wilks' theorem held. We clearly observe that true confidence regions are substantially larger than the ones based on the χ 2 distribution. In particular, the no-oscillation case is contained in the 3σ contour for the Monte-Carlo calculation, in agreement with the discussion of the test statistic T above. 5

VI. SUMMARY AND CONCLUSIONS
In this paper we have studied the statistical interpretation of sterile neutrino oscillation searches in the disappearance mode, specifically when no information on the absolute normalization of the signal is used. A priori there are several good reasons to expect that Wilks' theorem does not apply in this case: the presence of a physical boundary, the fact that the parameter space changes dimension if either ∆m 2 → 0 or sin 2 2θ → 0, and the highly non-linear dependence of the number of events on ∆m 2 . Not surprisingly, we do indeed find significant deviations. Although in this work we decided to focus on short-baseline reactor experiments as a case study, our results are more general. We find that this situation, under some assumptions, is equivalent to fitting Gaussian white noise with a single frequency of free amplitude. This allows us to express the distribution of the test statistic T to be the maximum of N Gaussian random variables where N is the effective number of bins ("max. Gauss distribution"). Therefore, this class of oscillation searches will always find a best-fit for a non-zero signal even if there is no oscillation in the data, with a non-negligible statistical significance if interpreted as if Wilks' theorem would apply. In other words, the parameters obtained at the minimum of the χ 2 are biased estimators in this case.
We then perform Monte Carlo simulations of a toy reactor disappearance experiment, to confirm that our analytic understanding carries over to a more realistic setting. The test statistic T we consider is equivalent to the log-likelihood either for a Gaussian likelihood or Poissonian likelihood. We find that, if the systematic uncertainty on the event normalization is comparable to (or smaller than) the statistical error of the event sample, the distribution function of T is rather sensitive on fine details of the chosen simulation and, in particular, on: whether the central value of the nuisance parameter is randomized or not, and whether a Gaussian or a Poissonian log-likelihood is used (despite the fairly large number of events per bin). Conversely, for experiments relying on shape information only (that is, when no information on the absolute normalization is used) the max. Gauss distribution is a rather robust prediction for the distribution of the test statistic.
Finally, we apply our understanding to the actual data of the Neutrino-4 experiment. We are able to reproduce the quantitative details of their analysis quite well if we assume that Wilks' theorem applies. However, in agreement with our arguments presented above we find that the test statistics shows significant deviations from a χ 2 distribution. In particular, we show by explicit Monte Carlo simulation that the significance of the claimed oscillation signal is reduced from 3.2 σ (p = 1.58 × 10 −3 ) to 2.6 σ (p = 9.1 × 10 −3 ), that is, the probability that this is a mere statistical fluctuation is about 6 times larger than that expected if Wilks' theorem were to hold. It should be noted that our Neutrino-4 analysis is based on statistical errors only, and that the inclusion of systematic effects may reduce the significance even further.
In summary, our results provide a simple, intuitive understanding on why and how shape-only oscillation searches are different from the usual case. Applied to Neutrino-4 we find a reduced significance for sterile neutrino oscillation, but not to the extent to completely dismiss this indication as a pure statistical fluctuation. It would be interesting to see how this type of analysis would play out in a global fit of all short-baseline reactor data, but this is beyond the scope of the present work.