Statistical significance of the sterile-neutrino hypothesis in the context of reactor and gallium data

We evaluate the statistical significance of the 3+1 sterile-neutrino hypothesis using $\nu_e$ and $\bar\nu_e$ disappearance data from reactor, solar and gallium radioactive source experiments. Concerning the latter, we investigate the implications of the recent BEST results. For reactor data we focus on relative measurements independent of flux predictions. For the problem at hand, the usual $\chi^2$-approximation to hypothesis testing based on Wilks' theorem has been shown in the literature to be inaccurate. We therefore present results based on Monte Carlo simulations, and find that this typically reduces the significance by roughly $1\,\sigma$ with respect to the na\"ive expectation. We find no significant indication in favor of sterile-neutrino oscillations from reactor data. On the other hand, gallium data (dominated by the BEST result) show more than $5\,\sigma$ of evidence supporting the sterile-neutrino hypothesis, favoring oscillation parameters in agreement with constraints from reactor data. This explanation is, however, in significant tension ($\sim 3\,\sigma$) with solar neutrino experiments. In order to assess the robustness of the signal for gallium experiments we present a discussion of the impact of cross-section uncertainties on the results.


Introduction
There is a number of long-standing experimental results in neutrino physics that defy Standard Model (SM) expectations, and at least individually find an explanation by new oscillations with a mass-squared splitting of around 1 eV 2 ; for reviews see e.g., refs. [1,2]. This scale is much larger than the mass-squared splittings associated with the three neutrinos of the SM, and hence points to the existence of a forth neutrino, which must be a singlet under the SM gauge group (i.e., sterile), in order to agree with the invisible decay width of the Z-boson [3]. A sterile neutrino would open a fermionic portal to a dark sector and, as such, is well motivated.
In 2011, a new piece of evidence appeared in the form of the so-called reactor antineutrino anomaly [4]: new evaluations of the prediction of the antineutrino flux from a reactor [5,6] resulted in a 6% increase of the expected neutrino signal. This implied that past experiments (which were in agreement with previous calculations) were in fact observing a deficit of the same magnitude. In parallel, renewed analyses of radiochemical source-calibration measurements done for the solar-neutrino experiments SAGE [7,8] and GALLEX [9,10] also indicated a deficit compatible with the result from reactors: the so-called gallium anomaly [11][12][13]. Thus, for the first time, there was considerable evidence of anomalous ν e andν e disappearance.
Because of this, the past decade has seen a plethora of experimental efforts (both with reactor neutrinos and radioactive sources), as well as theoretical efforts in nuclear physics, to improve source and cross-section predictions. On the experimental side, a number of reactor experiments at short distances (6 -25 m) from the reactor core have been conducted: DANSS [14], NEOS [15], PROSPECT [16], STEREO [17] and Neutrino-4 [18]. These experiments derive their results for sterile-neutrino oscillations by comparing the measured neutrino spectra at different distances from the reactor, and thus their results do not rely on reactor-flux modeling. While some of these reported evidence favoring the sterile-neutrino hypothesis, others did not, and thus the question of the statistical compatibility of the different data sets and their interpretation in the form of a global fit is relevant [19,20]. All of these experiments look for an oscillatory distortion in the ratio of measured spectra, and there are a number of subtleties that arise in their statistical interpretation [21][22][23][24], which render the usual approach based on Wilks' theorem [25] unreliable.
In parallel, in order to test the gallium anomaly, the BEST collaboration recently presented a new result of a two-zone gallium radiochemical measurement employing a chromium-51 source [26]. Although they did not observe a different rate in the two volumes, their results confirmed the earlier deficit with much smaller error bars. This new result brings the gallium anomaly to a new level, making it highly significant and raising the question of its compatibility with other data [27,28].
Since 2011, significant effort has gone into improving our understanding of reactor antineutrino fluxes, both from a theoretical perspective and from many measurements of beta-feeding functions of individual isotopes. These efforts have led to a much better understanding of the underlying physics, see e.g., ref. [29], but have not resolved the reactor antineutrino anomaly. Global analyses of past reactor neutrino data [30][31][32], precise neutrino-spectrum measurements [33] and fuel-evolution measurements [34] have been indicating for a while that there might be a problem with the prediction for uranium-235. The recent absolute flux determination from a reactor with only uranium-235 fission by the STEREO experiment [35] confirms this. The basis for the Huber-Mueller reactor fluxes are integral beta-spectrum measurements conducted in the 1980s at the Institute Laue-Langevin (ILL) [36][37][38]. Recently, the ratio of uranium-235 to plutonium-239 integral beta spectra was measured [39], indicating the possibility that the normalization of the original ILL data for uranium-235 was too high. Taking this new information into account, a global flux analysis [28] finds that there is no evidence for the reactor antineutrino anomaly in total rates. All data sets and methods of prediction agree in total rate if one replaces the normalization of uranium-235 with the new result of ref. [39]. Therefore, any support for the sterile-neutrino hypothesis from reactor neutrino data now would have to come from the observation of an actual oscillation by DANSS, NEOS, PROSPECT, STEREO or Neutrino-4. Note that the spectrum discrepancy between flux models and reactor neutrino measurements [40][41][42] around 5 MeV of neutrino energy, colloquially known as the 5-MeV bump, has no bearing on the question of a sterile neutrino for these experiments, since in taking the ratio of spectra measured at different distances, the bump effectively cancels. Finally, a number of electron-neutrino disappearance results from accelerator experiments may also be interpreted in the context of a sterile neutrino. Specifically, both KARMEN and LSND [43][44][45] have provided bounds from a charged-current reaction on 12 C. A measurement of the T2K experiment reported in ref. [46], its main purpose being a determination of the charged-current electron-neutrino cross section and characterization of the ν e beam component, also leads to a mild indication of disappearance. A somewhat similar result appeared recently based on MicroBooNE data [47], see also ref. [48]. All of these results are not very statistically significant, as of yet, and do not strongly favor oscillation over no-oscillation nor vice versa; we therefore do not include them in our fits. Similar remarks apply to the tritium beta-decay experiment KATRIN [49], whose exclusion depends strongly on whether the light neutrino masses can be neglected with respect to the mass of the much heavier fourth neutrino [49,50].
Let us also remark that additional eV-scale neutrinos face severe constraints from cosmology, both in regard to their early-time contributions as relativistic degrees of freedom and their late-time contributions to the energy density as hot dark matter, suppressing the formation of small-scale structure, see e.g., ref. [51] for a recent global analysis and refs. [1,2] for reviews and further references. However, any cosmological bound relies on the assumption of thermalization of the sterile neutrino in the early universe and may thus be evaded in an alternative cosmology. Our goal here is to present results that are robust with respect to cosmological assumptions.
In this paper, we present a global fit of up-to-date results from the DANSS, NEOS, PROSPECT, STEREO and Neutrino-4 reactor experiments, the past gallium results from SAGE and GALLEX, as well as the recent BEST result and solar-neutrino results in the framework of one additional sterile neutrino (the so-called 3+1-oscillation hypothesis). The results presented are independent of reactor-flux predictions and are thus complementary to the results of ref. [28]. Since Wilks' theorem can not be used reliably here [21][22][23][24], we perform a full Feldman-Cousins analysis [52].
The analysis results are presented in the main text; individual experiments, as well as details of the Feldman-Cousins analysis, are described in the appendices. Section 2 contains a description of the reactor data, as well as the results of their combination. In section 3, we discuss the gallium experiments and the dependence of their results with the charged-current cross section for neutrino capture. In section 4, we evaluate the statistical compatibility of the various data sets, paying particular attention to the compatibility between solar and gallium data, and present our global-fit results. In section 5, we present a summary of our findings and draw our conclusions.
2 Reactor experiments 2.1 Description of the data used In the following, we provide a brief summary of the reactor experiments whose data we have used; detailed descriptions of our analyses -including the precise forms of the χ 2 functions and our techniques for generating pseudo data -are presented in appendix A.
DANSS At DANSS, the detector is located on a movable platform under an industrial 3.1 GW th reactor [14]. It measures the neutrino spectrum at three baselines: bottom, middle and top (I ≡ B, M, T , respectively). The DANSS collaboration presents their data with the following ratios where i indicates the energy bin and n i I = N i I /∆t I . Here, N i I stands for the event rates the energy bin i and ∆t I is the corresponding exposure time period for each detector location I. A total of 36 energy bins are used in our analysis, spanning energies between 1.5-6 MeV. Exposure times and the background-subtracted rates for the top baseline were extracted from ref. [14, p. 12]. Doing this, we get a total event rate for the top baseline of i n i T = 4132.61 events per day, for energies between 1.5-6 MeV (in prompt energy). Details on our χ 2 implementation can be found in appendix A.1.
NEOS The NEOS detector consists of approximately 1000 L of homogeneous 0.5% Gd-loaded liquid scintillator. It is located at about 23.7 m from a commercial reactor core with a thermal power of 2.8 GW in the Hanbit nuclear-power complex in Yeonggwang, South Korea. The detector operated for 46 days with the reactor off and 180 days with the reactor on. NEOS measured 1976.7 ± 3.4 inverse-beta-decay candidates per day (with prompt energies between 1 and 10 MeV) during the reactor-on period, as opposed to 85.1±1.4 events in the reactor-off period. For our analysis, we take the data from ref. [15, fig. 3a]: the black points are assumed to be the background-subtracted counts and are normalized to the quoted total signal rate, 1976.7 − 85.1 = 1891.6 events/day.
The NEOS collaboration does a shape-only analysis using the neutrino spectrum from ref. [53] plus corrections due to different fission fractions (see appendix A.2 for details); therefore, we leave the normalization completely free in our fit. We build a binned χ 2 , with 61 energy bins 1 between 1-10 MeV.
PROSPECT The PROSPECT experiment uses a large, highly segmented detector. Segments are grouped into 10 different baselines (see fig. 39 of ref. [16]). Data are then reported as 10 energy spectra -one for each cluster of segments -with 16 energy bins. Our analysis benefits from the thorough data release accompanying ref. [16], including the background-subtracted spectra for each baseline bin and the covariance matrix describing them, which includes both statistical and systematic uncertainties. The χ 2 is calculated following ref. [16, eq. 11] (see also appendix A.3).
STEREO The STEREO detector contains six cells, each with a different effective baseline. The experiment was divided up into two phases: for each cell, phase-I data are presented in ten bins from 1.625 to 6.625 MeV of prompt energy, while phase-II data are presented in eleven bins from 1.625 to 7.125 MeV (in both cases the bin sizes are set to 0.5 MeV). The data have been published alongside ref. [17]; however, the raw event rates have been normalized to the prediction under the no-oscillation hypothesis (after minimization over systematic uncertainties). Therefore, we also rescaled our predictions in order to compare to their data, using as the nuisance parameters (for the prediction under the no-oscillation hypothesis) the values from fig. 31 of ref. [17]. The STEREO analysis contains a total of 40 pulls: we are able to minimize 14 of these analytically, while the remaining 26 are minimized numerically. Further details on our implementation of STEREO can be found in appendix A.4.
Neutrino-4 Our analysis of Neutrino-4 [18] is based on the previous analysis performed in ref. [24]. Neutrino-4 also uses a segmented detector. The data are binned using nine 0.5-MeV energy bins from 2.3 to 6.8 MeV, and twenty-four 0.235-m baseline bins from 6.25 to 11.89 m. The resulting 216 bins in L/E are then clustered into consecutive groups of 8 bins, leading to a total of 27 data points. The analysis is performed using the ratios where d i is the number of events in the i th bin. In our analysis, we use the first 19 bins for the combined phase-1 and phase-2 data sets (the blue points in fig. 47 of the v2 arXiv preprint of ref. [18]). Note that although the event counts d i are uncorrelated, the ratios R i are correlated through the denominator in eq. (2.2). However, we have estimated these correlations and find that they do not change our results, so they will be neglected here. Further details on our χ 2 implementation and analysis can be found in appendix A.5 as well as in ref. [24].

Results of the reactor analysis
For all the reactor data listed above, we can reproduce the results of the corresponding collaborations with good accuracy when adopting the same assumptions and statistical method. This is shown in table 1, which compares our best-fit points to the ones from the respective collaborations, showing reasonable agreement in all cases. The χ 2 values of the global minimum (χ 2 min ) are also shown: as can be seen, our results match very well those obtained by the different collaborations in all cases. Notice the slightly different number of dof between our analysis and the collaboration's for the NEOS and PROSPECT analyses, which is due to a different number of nuisance parameters used in the fit.
In the right-most part of the table, we evaluate the data's preference for the presence of sterile-neutrino oscillations by considering the test statistic 3 where χ 2 3ν is the χ 2 value for the null hypothesis, sin 2 2θ = 0 (θ being the angle that parametrizes the mixing between the sterile neutrino and the light states), and χ 2 min is the χ 2 value at the best-fit point. We use this test statistic to evaluate the p-value of the null hypothesis in the following way. For a given experiment, the distribution that ∆χ 2 3ν follows can be obtained from Monte Carlo (MC) simulations: a set of pseudo experiments is randomly generated according to the expected statistical fluctuations of  ), and its equivalent number of Gaussian standard deviations, #σ. The last column shows the corresponding number of standard deviations when Wilks' theorem is assumed, i.e., assuming ∆χ 2 3ν is distributed as χ 2 (2 dof). We also show our combined results for all reactors, as well as reactors with solar, and reactors with gallium data. The gallium analysis uses the Kostensalo et al. cross section [54]. event rates in each bin (see appendix B.2 for details). The obtained distribution of the test statistic can then be compared to the experimental value in order to obtain the corresponding p-value. The p-values we obtain, as well as the corresponding number of Gaussian standard deviations, are given in table 1. Again, the comparison with the collaborations' corresponding results shows good agreement.
If Wilks' theorem held, then ∆χ 2 3ν would follow a χ 2 distribution with 2 dof (corresponding to the two parameters, sin 2 2θ and ∆m 2 , which we minimize over). In the last column of table 1, we show the number of standard deviations corresponding to that assumption, #σ (W ) . Comparing with the previous column, we find that Wilks' theorem consistently overestimates the significance by about one standard deviation, confirming previous studies [21][22][23][24]. Therefore, in order to assess the significance for the presence of sterile neutrinos from these data, it is essential to study the distribution of the test statistic numerically.
Next, let us discuss the differences in the allowed confidence regions obtained in the sin 2 2θ-∆m 2 plane. Figure 1 shows the contours for constant ∆χ 2 = 6.18 and 11.83 for the various reactor experiments individually (as well as their combination) together with the constraint from solar neutrinos (see below), where   can be used to determine the expected distribution of ∆χ 2 as defined in eq. (2.4) and compared to the value measured experimentally, in order to determine the confidence level (CL) at which a given point of parameter space can be rejected. The full allowed region in the sin 2 2θ-∆m 2 plane is then obtained by repeating this procedure for all points in parameter space. Confidence regions obtained in this way have the correct coverage by construction.
The colored bands in fig. 2 show such FC confidence regions, corresponding to 1, 2, and 3σ significance, for the individual reactor experiments (as well as for their combination, in the last panel), where the width of the bands indicates the 99% spread due to the finite size of the MC sample (see appendix B.4 for details). Comparing these bands to the lines of the same color, which correspond to ∆χ 2 contours assuming that Wilks' theorem applies, we find that in most cases the true CL is reduced approximately by one Gaussian standard deviation, similar to the null-hypothesis case. Also, we do not find closed confidence regions at 99.73% CL for any of the reactor experiments, in agreement with the p-values of the null hypothesis reported in table 1.
Although none of the individual experiments shows a clear signal at high significance, several of them favor the sterile-neutrino-oscillation hypothesis at a mild confidence level, 2σ. Therefore, the question arises, whether these indications add up and lead to a significant result in a combined analysis. We saw already in fig. 1 that the allowed regions for the different experiments do not show a consistent overlapping pattern and therefore an enhancement of the various hints is not expected from their combination. This is confirmed and quantified by the results of our global reactor MC analysis, reported in table 1 as well as in the bottom-right panel of fig. 2: the combined reactor fit gives ∆χ 2 3ν = 7.3 with a p-value corresponding to 1.1σ. Hence, we find that reactor data are statistically compatible with no sterile-neutrino oscillations. Again, we note that assuming Wilks' theorem we would still have a ≈ 2σ hint for sterile neutrinos, whereas the MC analysis reduces this to ≈ 1σ.
This effect is further illustrated in fig. 3, where we show the probability distribution function (PDF) and the survival function 4 (SF) of the ∆χ 2 3ν distribution for the combination of all reactor experiments, obtained numerically by generating 10 5 MC data sets. We see from the plot that the distribution deviates considerably from the χ 2 distribution with 2 dof. On the other hand, in ref. [24] it has been shown that for this type of oscillation search, under certain idealizations, the SF for ∆χ 2 3ν follows the so-called maximum-Gauss distribution, which is defined as the maximum value of n independent standard-normal variables. Therefore, in fig. 3 we also compare the numerical SF with this distribution for n = 80 and 120 (here n plays the role of an effective number of degrees of freedom, and has been chosen in order to match the shape of the SF). Note  . We also plot the χ 2 (2 dof) SF and PDF for comparison (red), as well as the maximum-Gauss distribution for n = 80, 120, see ref. [24]. that small deviations from the maximum-Gauss distribution occur since some of the assumptions under which it has been derived are not fulfilled exactly; for instance, due to non-trivial systematic uncertainties, the data points in our fit are not all statistically independent.

Gallium radioactive-source experiments
The gallium solar-neutrino detectors GALLEX and SAGE have been used to measure the neutrino rate from radioactive 51 Cr and 37 Ar sources [7][8][9][10], leading to results consistently lower than expected. This so-called gallium anomaly could potentially find its explanation by sterile-neutrino oscillations [12,55]. Recently, the BEST collaboration has performed a dedicated source experiment using a 51 Cr source embedded in a two-volume gallium detector, confirming the previous hints at high significance [26]; see also ref. [27].
We summarize the gallium data in table 2, quoting the ratios of observed to expected event rates. The errors in the table include only statistical and uncorrelated experimental errors. The errors for the two BEST data points are obtained by adding the statistical error given in tab. 1 of ref. [26] and the uncorrelated experimental systematic error of 2.6% in quadrature. We observe that the reported ratios deviate from unity by about 20%, confirmed by BEST at high significance.
An important ingredient in the interpretation of these results is the ν e -capture cross section on gallium. The ratios in table 2 are based on the traditional Bahcall 1997 cross sections [56], but for alternative cross sections they have to be rescaled accordingly.
(Ar) Bahcall [56] 58.  Table 3. Cross section for ν e detection on gallium, weighted by the energy lines for neutrinos produced from a Cr and an Ar source, for our three representative cross-section models. Units are 10 −46 cm 2 . We give the best estimate for the total cross sections as well as the cross section corresponding to the ground state (g.s.) transition only. For the Bahcall numbers we quote the (larger) upper error (see text); Bahcall does not provide an error estimate on σ g.s. .
Here we have considered two recent re-evaluations of the cross sections from Kostensalo et al. [54] and Semenov [57], summarized in table 3. Note that in table 3 we give the best estimates for the total cross section for the ν e -induced 71 Ga → 71 Ge transition, which receives contributions from the ground-state-to-ground-state (g.s.) transition as well as from transitions into excited states of 71 Ge. Two remarks should be made at this point. First, the matrix element of the g.s. transition can be directly inferred from the electron-capture lifetime of 71 Ge and the corresponding phase-space factor. Second, the excited-state contributions add incoherently and thus can only increase the cross section, that is, the g.s. transition provides a lower bound on the total cross section. Finally, note that the relatively large error in the g.s. In order to analyze gallium data, we define the following χ 2 function: Here, R i and σ i are the observed ratios and their uncertainties from table 2, rescaled according to the adopted cross section, and P i are the oscillation probabilities, averaged over the detector volumes and weighted by the neutrino-energy lines from the corresponding source. The pull parameter ξ describes the correlated uncertainty on the cross section. 5 In order to set the uncertainty σ ξ we proceed as follows: for a given cross-section model, we first minimize with respect to ξ, adopting the total uncertainty given in table 3; we then check whether the resulting cross section at the pull minimum is smaller than the ground-state contribution; if this is the case, then we switch to the smaller uncertainty of σ g.s. , making the pull stiffer. In this way, we take into account the asymmetries of cross-section uncertainties. An analogous procedure is adopted when generating random values for the pull parameter for the MC studies.
In table 4, we report the results of our analysis of gallium data under the sterileneutrino-oscillation hypothesis adopting different assumptions about the detection cross section. We give the value of ∆χ 2 3ν as defined in eq. (2.3) and the corresponding significance in units of Gaussian standard deviations. While previous results from GALLEX and SAGE show only a weak hint for sterile-neutrino oscillations, the significance of the BEST result is > 5σ, independent of the assumption on the cross section. Note that even the conservative ground-state-only analysis leads to more than 5σ significance. Let us stress that in order to calculate the significances in table 4, we assume that ∆χ 2 3ν is distributed as χ 2 (2 dof). For the combined analysis and the Kostensalo et al. cross section we have tested this assumption with a high-statistics MC study (10 9 pseudo experiments). We find that the ∆χ 2 3ν distribution is more similar to χ 2 (1 dof). This reduction of the dof is related to the presence of the physical boundary sin 2 2θ ≥ 0, see e.g., ref. [58] for a discussion of this effect . The null hypothesis, when evaluated by Monte Carlo, yields which is slightly higher than the corresponding value for a χ 2 with 2 dof (5.4σ, see table 4) and more similar to the value for 1 dof, which is 5.7σ. In the following, we will adopt the Kostensalo et al. [54] cross section as our default analysis for gallium data. In fig. 4 (left panel), we show the allowed regions in the sterileneutrino parameter space for gallium data. An explanation of these measurements in terms of sterile-neutrino oscillations requires rather large mixing angles, 0.21 ≤ sin 2 2θ ≤ 0.47 at 2σ. We observe that in this case the FC confidence regions are in excellent agreement with ∆χ 2 contours under Wilks' theorem assuming 2 dof. Many of the arguments leading to deviations from Wilks' theorem in the reactor case [24] do not apply here. In particular, the effect is basically a constant suppression of the rate, without oscillatory behaviour. Since the allowed region appears far away from the physical boundary (which dominates the distribution under the null-hypothesis, see above), we recover here the full freedom corresponding to the 2 parameters sin 2 2θ and ∆m 2 .
In the right panel of fig. 4 we compare the region preferred by gallium with constraints from reactor data, solar neutrinos, the KATRIN experiment, and data on ν e -12 C scattering from the LSND and KARMEN experiments. In the next section, we are going to investigate in detail the compatibility of gallium, reactor and solar neutrino data. The KATRIN constraint shown in fig. 4 is taken from [49], where this curve has been derived by fixing the lightest neutrino mass to zero. This constraint cuts off the allowed parameter space at large values of ∆m 2 100 ev 2 , but becomes quickly weaker for smaller mass-squared differences. We note that the KATRIN limit would be further relaxed if the assumption of neglecting the light neutrino masses is dropped [49,50]. The constraint from ν e scattering on 12 C from LSND [44] and KARMEN [43] is taken from the re-analysis performed in ref. [55]. This limit is in tension (at ∼ 2σ under Wilks' theorem) with the gallium region for ∆m 2   Dash-dotted curves are obtained under the assumption of Wilks' theorem (2 dof). Right: FC confidence regions for gallium, reactor, and solar data at 2σ. We superimpose the 95% exclusion limit from the KATRIN collaboration [49] (this curve fixes the lightest neutrino mass to zero) and the Wilks' 95% exclusion limit (2 dof) from ν e -12 C scattering from the LSND and KARMEN experiments [43,44], taken from ref. [55]. Dash-dotted lines are extrapolations assuming constant sensitivity in the mixing.
Wilks' theorem is applicable, which may lead to an overestimation of the exclusion regions, similar to the data sets studied here. Thus, due to the limited statistical power of the KATRIN and LSND/KARMEN constraints in the region of interest, we focus below on gallium, reactor, and solar data.

Global Fit Results and Consistency Tests
In this section, we combine the reactor and gallium data discussed in sections 2 and 3, respectively, and study their consistency. In addition, we take into account information from solar neutrinos, which provides an important constraint on large mixing angles. The solar-neutrino analysis adopted here is based on the simplified χ 2 construction from ref. [59], which offers an efficient way to include solar neutrino data in a MC study. We refer the interested reader to that reference for details; a brief summary is provided in appendix A.6. Table 1 shows the value of the χ 2 per dof for the combination of reactor data, and the combination of reactor + solar, and reactor + gallium. The mild differences between the values of the χ 2 min in the last three rows of the table indicate that the usual goodness-of-fit test would not provide any insights concerning the mutual consistency of the different data sets, given the very large number of dof at hand. Therefore, in order to quantify the consistency between different data sets, we use the parameter goodness-of-fit (PG) test [60,61], based on the test statistic where the index k labels the data sets to be tested for consistency. The first term on the right-hand side corresponds to the global χ 2 minimum, whereas the second term is the sum of the χ 2 minima of each data set individually. Under Wilks' theorem, χ 2 PG will follow a χ 2 distribution with k P k − P dof, where P k is the number of parameters, on which the data set k depends, and P is the total number of parameters of the model [60,61]. For our cases of interest, we have P reactor = P gallium = P = 2, corresponding to sin 2 2θ and ∆m 2 , while P solar = 1 -in the limit relevant for us here, solar data are independent of ∆m 2 and depend only on sin 2 2θ. Let us first investigate the consistency of the solar and gallium data sets. Figure 5 shows the ∆χ 2 profiles for gallium data under various assumptions of the detection cross section, compared to the one from solar data. The dash-dotted curves correspond to ∆χ 2 gallium + ∆χ 2 solar ; the value of χ 2 PG for the consistency of solar and gallium data is given by the minimum of the dashed curves. We see that the Bahcall, Kostensalo and g.s. cross sections all give very similar results for χ 2 PG . Even under the ground-stateonly assumption, we would obtain χ 2 PG = 11, corresponding to 3.3σ tension. For the Semenov cross section, we find χ 2 PG = 18 and the tension is at the 4.2σ level. Also, note that here we use the GS98 solar model [62], which leads to conservative limits. If the AGSS09 model [62] were used instead, then the solar constraint would get significantly stronger [59], making the tension with gallium even more severe.
Next, let us study the consistency when reactor data are also considered. The results of various consistency checks are reported in table 5, where, for concreteness, we have adopted the Kostensalo et al. cross section for the gallium experiments. The last two columns show our results for the PG test, where the distribution of χ 2 PG is calculated using MC simulations. In doing so, the pseudo-data fluctuations are generated by using the prediction for the corresponding best-fit parameters as the pseudo-data mean. We find that both the reactor and solar data, as well as reactor and gallium data, are fully compatible. For comparison, the middle columns in table 5  Data set  obtained under the assumption of Wilks' theorem. Although the differences in p-value with respect to the MC result are small in the case of consistent data sets, we find a larger discrepancy in the case of inconsistent data sets. This is shown in the last two rows of the table, where our MC simulations show a reduction of ∼ 0.5σ with respect to the expected result under the assumption that Wilks' theorem holds. Given the strong tension between solar and gallium data, we will not combine these two data sets in the following but consider only the combinations reactor+solar and reactor+gallium separately. The best-fit points and p-values for the null hypothesis of these combinations can be found in the last rows of table 1. While the combination 3ν,obs (solar) Figure 6. Survival function of ∆χ 2 3ν under the null hypothesis for reactor data combined with solar (green) or with gallium (blue), for 10 6 runs. For comparison we also show the χ 2 (2 dof) SF (solid red line), the maximum-Gauss distribution for n = 60 (dash-dotted purple), see [24], and a linear extrapolation of the tail of the gallium SF (dash-dotted light blue).
of solar and reactor data shows no significant hint for sterile-neutrino oscillations, the reactor+gallium combination prefers oscillations over the null hypothesis with ∆χ 2 3ν = 38.8. Assuming Wilks' theorem (2 dof), this would correspond to p (W ) 0 = 3.8 × 10 −9 (5.9σ). Such a high significance makes it impossible to calculate a p-value by MC.
In fig. 6, we show the MC distribution of the test statistic for the null hypothesis, ∆χ 2 3ν , defined in eq. (2.3). Both the reactor+solar and reactor+gallium datasets have an almost identical distribution which differs significantly from a χ 2 distribution, but has a similar shape as the maximum-Gauss distribution discussed in [24]. This suggests that in both cases the statistical properties are dominated by the reactor data set. We can use this observation to extrapolate the SF of ∆χ 2 3ν to the observed value for reactor+gallium. Assuming a maximum-Gauss distribution with n = 60, which gives a good fit to the MC distribution in the region where it can be simulated (c.f., purple dash-dotted curve in fig. 6), we find   In any case we confirm the 5σ indication in favor of sterile-neutrino oscillations of combined reactor+gallium data, which is of course driven by the BEST result. Under the adopted assumptions for the BEST analysis, this is a statistically robust result.
In fig. 7 we show the allowed regions for reactor+solar data (left) and reactor + gallium data (right). For the reactor+solar data we find no closed contours at 2σ for the FC analysis, consistent with the null-hypothesis being allowed at 1.3σ. The main effect of solar neutrino data is to exclude the large mixing angles in the region ∆m 2 10. For the reactor+gallium analysis the determination of the mixing angle 0.11 < sin 2 2θ < 0.53 (3σ) is dominated by gallium data, whereas ∆m 2 is determined by Neutrino-4 reactor data. The best-fit point is located at sin 2 2θ = 0.32 , ∆m 2 = 8.86 (best fit, reactor+gallium data) , (4.4) however, several local minima are present at 2σ. In both panels we note again that the FC confidence regions are larger by roughly 1σ compared to contours based on Wilks' theorem.

Summary and Conclusions
We have performed a global Feldman-Cousins analysis of the reactor short-baseline experiments DANSS, NEOS, PROSPECT, STEREO and Neutrino-4, past gallium data from SAGE and GALLEX and the recent result of the BEST experiment, as well as solar neutrino data. The analysis is performed in a 3+1-neutrino-oscillation framework.
In general, our results show that a full Feldman-Cousins treatment reduces all p-values by about 1σ as compared to a naïve application of Wilks' theorem, as shown in table 1.
Our main findings are that no reactor experiment individually favors oscillation at more than 2.4 σ, with the strongest hint coming from Neutrino-4. Combined reactor data are consistent with no sterile-neutrino oscillations at 1.1 σ, and solar neutrino data are fully compatible with this result. On the other hand, gallium data favor a rate deficit at more than 5 σ, even under conservative choices for the cross-section uncertainty. This result is compatible with reactor data, and would then point to values of ∆m 2 7-12 eV 2 , driven by the Neutrino-4 result. At the same time, the gallium and solar results are in tension, according to a parameter goodness-of-fit test [61] by more than 3 σ. One important caveat to our findings is that we take each collaboration's statements about systematic uncertainties at face value.
In comparison to earlier global fits of reactor and gallium data, for instance refs. [19,20,32], we find that the newest data from DANSS, NEOS, PROSPECT and STEREO are consistent with the no-oscillation hypothesis; the evidence in favor of oscillation stems entirely from the latest BEST gallium data, consistent with hints at much weaker significance from previous gallium measurements and the Neutrino-4 reactor data, which all point to larger mixing angles and mass-squared splittings. These data sets are in mutual agreement, with a parameter goodness-of-fit p-value of 0.5. The reactor experiments not seeing any evidence for oscillation are simply insensitive to the region preferred by Neutrino-4 and the gallium data. Although the Neutrino-4 results have been challenged on experimental grounds [23,[63][64][65], we take the results published by the collaboration as the basis for our analysis. Neutrino-4 alone prefers oscillation at 2.4 σ (with a p-value of 1.5%), whereas the gallium data exclude the nooscillation hypothesis at more than 5 σ. For gallium data, the main systematic is the cross section for the reaction 71 Ga + ν e → 71 Ge + e − ; we have performed a literature survey of several cross-section calculations. The difficulty arises from excited-state contributions, since the ground-state-to-ground-state transition matrix element can be directly determined from the observation of the electron capture of 71 Ge. However, the excited-state contributions add incoherently and thus can only increase the cross section, making the effect larger. Even considering only the ground-state contribution to the cross section, we find the result is > 5 σ significant, see table 4. If this result is due to some systematic effect, then it seems unlikely that it is related to the cross section. Note that the results from the two detector volumes in BEST are practically identical and thus this result (as was the case for past gallium results) is an absolute rate measurement, leaving room for some common, as-yet-unidentified, systematic affecting the overall normalization of the event rates.
On the other hand, solar data strongly reject sterile-neutrino oscillations with large mixing angles for the whole range of mass-squared splittings relevant here. As a consequence, solar data are in significant tension with the gallium data (with a parameter goodness-of-fit of at least 3 σ, see table 5 and fig. 5). The evidence in favor of oscillation from reactors is much weaker and hence there is no tension in a global fit between solar and reactor data. The solar bound presumably could be evaded by constructing a model in which sterile neutrinos experience some anomalous matter effect in the Sun, but this goes beyond the 3+1 model considered here.
This leaves the question of how to test our findings. In the context of 3+1 oscillations, note that the oscillation length of a 4 MeV neutrino with ∆m 2 = 10 eV 2 is only about 0.5 m, making the direct observation of oscillations in a reactor experiment very challenging. The KATRIN experiment is, in principle, sensitive to sterile neutrinos down to ∆m 2 ∼ 10 eV 2 , but a strong exclusion limit can be only obtained if a prior is put on the effective parameter m 2 νe,eff > 0. This prior is well motivated by physics, but systematic effects in beta-decay endpoint experiments have consistently led to measured values of m 2 νe,eff being negative (for a recent review see e.g., ref. [66]). Given the strong tension with solar data and the fact that large values of ∆m 2 ∼ 10 eV 2 are also in tension with standard cosmology, the notion that we are dealing with new physics different from a vanilla sterile neutrino is worthwhile to investigate, as is the search for as-yet-unidentified experimental systematics.

A Details on the Experiment Simulations and Data Analyses
In this appendix, we describe our analyses of the reactor experiments, as well as our treatment of solar neutrino data. While our analyses of reactor experiments are insensitive to the modeling of the (differential) flux dΦ/dE ν and inverse-beta-decay cross section σ IBD , we must make some choices for these in order to perform our simulations. When necessary, we simulate event rates using the Huber-Mueller spectra [5,6] and the Beacom-Vogel cross section [67], unless noted otherwise. We reiterate that our results are insensitive to these choices.

A.1 DANSS
The flux of neutrinos is determined by using the fuel fission fractions provided by DANSS [14, p. 5]. We assume the reconstructed energy E rec is distributed as a Gaussian with mean E ν and width encoded by the reconstruction matrix R(E rec , E ν ). The width is taken from ref. [68, fig. 5] but increased by 10% to better match the results of the collaboration.
Because the baseline is so short, the finite size of the reactor core and detector needs to be taken into account. The detector is assumed to be a cube of 1 m 3 volume [69] and the reactor size is taken to be 4.5 m. This number is set to be slightly higher than the value quoted in ref. [14] in order to obtain a better match to the results of the collaboration, and we assume that this is due to diagonally travelling neutrinos having a longer path length. Thus, for each baseline I ≡ B, M, T (bottom, middle, top) the predicted number of events is computed as The DANSS analysis is performed in terms of the ratios between the number of events at the different locations as outlined in section 2.1, see eq. (2.1). The predicted values for the ratios can be written in terms of the event rates in eq. (A.2) as where the no-oscillation predicted ratios R 3ν 1 , R 3ν 2 can be found in ref. [14, p. 17]. where we have assumed that the uncertainty of the event rates in each bin is statistics dominated (σ N i I = N i I ). This describes correlations due to the fact that P i 1,2 contain common factors of N i B,T . The final ingredient in the computation of the χ 2 function for DANSS is the effect of systematic uncertainties. We introduce two nuisance parameters κ 1,2 for each of the ratios measured, associated with the relative efficiency, and add a pull term for each of them. The χ 2 function for DANSS is then defined as where σ sys = 0.2%, and D i a corresponds to the ratios measured by the experiment. The minimization is performed analytically, noting that the problem is to minimize a degree-two multinomial in the nuisance parameters.
By following this procedure, we find good agreement between our χ 2 map and the result reported in ref. [14, p. 16] by the collaboration. A comparison of the best-fit points is given in table 1.
To generate the pseudo data, we first compute the mean value for each bin as Statistical fluctuations are then injected by randomly sampling a Gaussian distribution for each bin, centered around its mean value and with standard deviation σ I,i = n i I /∆t I .

A.2 NEOS
The NEOS analysis uses the neutrino spectrum reported from the Daya Bay collaboration, obtained from ref. [53]. Due to this complication, and their unknown error response, we obtain our prediction by applying a ratio to the collaboration's no-oscillation prediction.
Let us first denote the collaboration's prediction in the absence of oscillations by S i,0 . We obtain S i,0 by dividing the ratios from ref. [15, fig. 3c] through the extracted signal points (corresponding to the observed data minus the expected background) from ref. [15, fig. 3a]. Doing this normalizes the sum i S i,0 = 1891.6.
To obtain the prediction S i for the sterile-neutrino hypothesis, we then rescale 6 the no-oscillation prediction by the ratio where DB i is a function defined for each bin i in reconstructed energy as and Here, R(E rec , E ν ) stands for the reconstruction matrix (see below for an exact definition), while [σ IBD dΦ/dE ν ] (E ν ) is the flux-weighted cross section in terms of the true neutrino energy, taken from ref. [53]. In eq. (A.8), we have also defined a weighted oscillation probability with the inverse baseline squared as where P ee is the 3+1-oscillation probability at NEOS; L min,max are the minimum, maximum baselines respectively; and ρ(L) is the baseline distribution (extracted from ref. [70, p. 8]). Note that the Daya Bay spectrum from ref. [53] unfolds the three-neutrino oscillations, and, since Daya Bay would be sensitive to both 3ν and 4ν oscillations, we need to remove distortions that would be introduced under a different hypothesis. For the large values of ∆m 2 considered in this work, we assume that the oscillations due to the new frequency, introduced by the sterile neutrino, average out at Daya Bay, which leads to the factor [1 − sin 2 2θ/2] −1 in eq. (A.8). Corrections from different isotope fractions are applied as per ref. [15].
The reconstruction matrix R(E rec , E ν ) has two components. For the first, we take the approximation from ref. [71] for positron energy loss, (A.11) where Z = 0.01. Here, the energy resolution is parametrized as [72, fig. 3] and the average energy E avg includes the non-linear detector response, where f ( · ) is taken from the supplementary material of ref. [73] and E p = E ν −∆ np +m e , with m e being the electron mass and ∆ np the neutron-proton mass difference. The choice of parametrization for the function f was simply made by tuning AE p f (aE p + b) to the results of the collaboration, as no detector response was provided. Besides eq. (A.11) we also consider a second contribution from escaping 511-keV annihilation photons, with a relative height 0.5 as per ref. [70, p. 15]: (A.14) The final reconstruction matrix R(E rec , E ν ) is computed by adding the contributions from eqs. (A.11) and (A.14). However, for the last bin we find that this approach does not provide good agreement with ref. [15]. Thus, the entry in this last bin is estimated as a piecewise constant (see the inset of [15, fig. 3a]) so that the data/Daya Bay ratio is correctly reproduced. Finally, once the reconstruction matrix has been computed, the integral over reconstructed energy in eq. (A.9) is done analytically. Following ref. [15], we define the NEOS covariance matrix as V ij = V syst ij + V stat ij . The matrix V stat accounts for the statistical errors and is defined as where B i corresponds to the background events and S obs i corresponds to the observed background-subtracted events per bin i (that is, the black points in [15, fig. 3a]), while ∆t on and ∆t off correspond to the data-taking periods in days, with the reactor on and off, respectively. In eq. (A.15) the first term is the Poisson error, while the second one stems from the fact that the background rate was measured during the (relatively short) reactor-off time period and therefore is not known precisely. On the other hand, the contribution V sys is defined as where V DB is constructed through a bicubic interpolation of the covariance matrix in ref. [53, table 13], and the normalization constant N is determined as As for systematic uncertainties, we include a nuisance parameter for the overall flux normalization, ξ, but we add no penalty term to the χ 2 : it is left free in the fit. Thus, the NEOS χ 2 function reads: where the minimization over ξ is done analytically. We compared our Bayesian exclusion curve (computed as in ref. [15, eq. (4)]) to the result of the collaboration; in this method, the unnormalized probability distribution for a fixed ∆m 2 is exp(−∆χ 2 /2). Our curve has extremely good agreement in the range 1-2 eV 2 .

NEOS
Due to our incomplete knowledge of the detector response, our fit does not perfectly reproduce the results from the collaboration: in particular, we get more local minima in our fit, and the location of our global minimum differs from the one obtained by the collaboration in table 1. We assume this to be due to the unknown detector response, combined with the impact of systematic uncertainties (e.g., energy scale) which are technically difficult to implement outside the collaboration. However, we have checked that the differences in χ 2 are relatively small. For example, the collaboration obtains two degenerate minima at (∆m 2 , sin 2 2θ) = (0.05, 1.73 eV 2 ) and (0.04, 1.30 eV 2 ). For these two points of parameter space, our fit yields a higher χ 2 by approximately two units, see table 6.

A.3 PROSPECT
We approximate the detector response from the extensive public data release accompanying ref. [16], which contains efficiencies ( s ) and response matrices (R s ij ) for each segment s of the detector. For our analysis, we group the segments into baseline bins, which we index with l, according to the segment map given by the collaboration. The fit is performed using the GLoBES [74,75] libraries. For each bin l, the detector mass is set to the sum of the relative efficiencies of the segments contained in that bin. Moreover, its effective baseline is defined by adding the contribution of each segment, weighted by its relative efficiency w s = s s s as 1 Here, the baseline distribution ρ s (L) for each segment s is constructed by randomly selecting points in the reactor core and detector segment to form a density histogram (the segment map provided by the collaboration determines the detector geometry, while the reactor core is modeled as a cylinder of height 51 cm and radius 22 cm).
The four-neutrino oscillation probability for a given bin l and neutrino energy E ν is computed as P l ee sin 2 θ, where for each baseline bin l we define At a given baseline bin l and for an energy bin i, the predicted event rates for a set of oscillation parameters are proportional to where dΦ/dE ν is the predicted neutrino flux and σ is the cross section, while R l ij stands for the effective energy response matrix for bin l, given by We have grouped the energy index i and the baseline index l into a combined index (il) as a bookkeeping device. The weights W s account for the relative expected event rates in the segments. In principle, these vary as a function of the sterile-neutrino parameters, but we assume that these differences are negligible. The result from eq. (A.25) is then used to rescale the predicted signal event rates by the collaboration in the absence of oscillations (S 0 (il) , provided in the data release), as Once the event rates have been computed, the χ 2 is calculated as in ref. [16, eq. 11]: where V is the covariance matrix provided by the collaboration, 7 O (il) is the observed number of events in energy bin i for baseline bin l, and we have defined In fig. 8 (left), we show our Feldman-Cousins allowed regions with statistical fluctuations generated as explained in appendix B; we find very good agreement with the results reported by the collaboration.

A.4 STEREO
For each of the six cells in the STEREO detector, we calculate the effective baseline and oscillation probability in a similar fashion to PROSPECT (described in appendix A.3), according to the geometry described in ref. [17]. We index these cells with l = 1, . . . , 6. In principle, light produced in a given cell can leak into its neighbors, potentially converting a single event of a given energy into multiple events with lesser energy in multiple segments. However, we assume that this cross-talk is negligible in our analysis.
The STEREO data set is divided into two phases; we index these phases with λ = I, II. Here we assume that all cells have the same energy resolution and that the cell masses are identical for both phases. The energy-resolution function can be read from ref. [76, fig. 2.19] for phase I, and from ref. [17, fig. 11] for phase II. It is parametrized as The detector non-linearities can also be found in the same references. The extracted data for the non-linear detector responses are fit to quadratic functions, As mentioned in the main text, the STEREO collaboration has provided their data normalized with respect to their best-fit no-oscillation hypothesis for each phase. As such, in order to compare our predictions with their data, we must normalize our predictions in the same way. To wit, for each phase λ, the prediction we employ for energy bin i and cell l can be written as [17]: where η λ l is a pull introduced for the cell-uncorrelated energy-scale uncertainty, ξ λ is its cell-correlated counterpart, and ζ λ l is a pull for the signal-normalization uncertainty. Here, N 0,λ is the prediction for the no-oscillation hypothesis for the fixed (optimized) pullsη λ l ,ξ λ l ,ζ λ l provided in ref. [17, fig. 31]. The χ 2 function for STEREO is computed by adding the contributions of data from the two phases. Following ref. [17, eqs. (15, 19)], we include a set of free flux normalizations for each bin i (φ i , common to the two phases) 8 as well as a relative 8 Note that while the ten lowest-energy bins are identical for the two phases, there is an additional high-energy bin for the phase-II data, (i.e., N II Ebins = N I Ebins + 1). As such, the phase II analysis depends on an additional flux nuisance parameter φ to which phase I is insensitive. normalization uncertainty between the two phases (Φ I , common to all bins). The final χ 2 function is obtained after minimization over all nuisance parameters: (A.32) Here, O λ l,i are the observed data points in baseline bin l and energy bin i [17,77] while σ λ l,i are the statistical errors on the data. The uncertainties of the nuisance parameters are: while both φ i and Φ I are left completely unconstrained in the fit. Finally, we make the transformations and for consistency β II l = ζ II l . This isolates the dependence of ξ λ , Φ I in the quadratic pull terms, which we can analytically minimize. We numerically minimize the remaining pulls with a non-linear conjugate-gradient algorithm [78].
Using our χ 2 implementation, we are able to reproduce the results of the collaboration to a reasonable degree of accuracy, both for the ∆χ 2 as well as for the 95% CL s 9 , curves published in their data release [79]. A comparison of our FC 95% C.L. curve is shown in fig. 8 (right). While the shape of our exclusion curve is similar to the result of the collaboration, we do find a general reduction in our calculated exclusion. We think, however, that the level of agreement achieved is reasonable, given that: 1. The χ 2 is quite flat in the low-∆m 2 region. Consequently, small numerical differences -in either the low-∆m 2 region or around the global minimum -can lead to sizeable shifts in the locations of contours of constant ∆χ 2 . beyond the current level would make our simulation computationally too expensive.
3. Our detector response is only approximate. We use cell-and phase-specific descriptions of the non-linearity, but assume a common resolution for all cells for a given phase. This may affect the high-∆m 2 region in particular.
Finally note that, when generating pseudo data for STEREO, for some of the bins in the high-energy tail, we sometimes obtain negative fluctuations. Whenever that happens, we simply throw these away and redraw fluctuations for all bins.

A.5 Neutrino-4
Our analysis of Neutrino-4 is based on ref. [24]. As outlined in section 2.1, the event rates are binned in L/E and grouped into 27 bins. In our analysis, we use the first 19 bins of such a data set, shown as blue triangles in fig. 47 in v2 of the preprint of ref. [18].
The collaboration performed their analysis in terms of the ratio of events in each bin to the total rates in all bins, as in eq. (2.2). The predicted event rates in each bin (in the absence of statistical fluctuations) can be computed as 10 where the average over each bin i is indicated by · i and takes into account both energy and baseline uncertainties. The fit to Neutrino-4 data is performed with a simple Gaussian χ 2 definition: where σ i corresponds to the statistical uncertainty read off fig. 47 of ref. [18], while R obs i and R pred i refer to the observed and predicted ratios in each bin. Once the theoretical prediction has been computed for each bin, we generate statistical fluctuations in each bin according to a Gaussian centered at the theoretical prediction, N (R pred i , σ i ), where σ i is the uncertainty in each bin extracted from the same figure 11 in ref. [18].
Let us mention that Neutrino-4's treatment of the detector energy resolution has been criticized in refs. [63][64][65]. In particular, a proper treatment of the energy resolution may reduce the reported signal significance by approximately 0.5σ [65], and require even larger values for the mixing angle. Additionally, the authors of ref. [23] have claimed that unaccounted correlations between energy and efficiency may induce an oscillation-like signature at Neutrino-4. In our analysis we do take into account finite energy (as well as spatial) resolution, see eq. (A.35). In choosing the value for the resolution we follow our general strategy, to reproduce as close as possible the official oscillation results published by the collaboration. Specifically, the average in each bin is performed using an effective energy uncertainty ∆E eff = E (∆L/L) 2 + (∆E/E) 2 , where ∆E = 500 keV and ∆L = 0.48 m.

A.6 Solar Neutrinos
The solar-neutrino analysis performed here is based on the simplified χ 2 construction from ref. [59], which in turn is based on the global fit from ref. [81]. While we refer the interested reader to the aforementioned references for details, here we provide a brief explanation of the solar analysis for completeness.
Our fit considers four data points as in ref. [59], corresponding to the four oscillation probabilities: r = (P LE ee , P HE ee , P LE ex , P HE ex ). (A.37) Here, P ee is the electron-neutrino survival probability and P ex = P eµ + P eτ is the transition probability of electron neutrinos to the other active neutrino flavours. The indices LE (low energy) and HE (high energy) refer to the energy regions below and above the MSW resonance, respectively. Under the same set of assumptions adopted in [59], the probabilities in eq. (A.37) only depend on the angles θ 12 , θ 13 , θ 14 . Since it has been shown in ref. [55] that the determination of θ 13 is unaffected by the presence of a sterile neutrino, throughout this work we will fix the value of θ 13 to the three-neutrino best-fit point from ref. [81], sin 2 θ 13 = 0.0223. Moreover, in ref. [59] it was also shown that the most conservative bound for θ 14 is achieved for sin 2 θ 12 = 0.3125. Thus, this will also be fixed throughout, for simplicity (we have explicitly checked that allowing θ 12 to vary has a negligible impact on our results).
The χ 2 function for solar data is computed as: where O r and P r stand for the observed and predicted values for the probabilities in eq. (A.37), respectively, while V −1 rs denotes the inverse covariance matrix, see eq. (A.40) below. The observations O r , as well as the corresponding uncertainties σ r and correlation matrix ρ, are taken from ref. [59, tab. 1]. In this work we use the values for the GS98 solar model [62], corresponding to the upper half of the table. Let us define the relative covariance matrix for the observations as (no sum over repeated indices). Following ref. [59], a good approximation to the full solar-neutrino fit is obtained by splitting the inverse covariance matrix V −1 rs into a theoretical and an experimental part, as (no sum over repeated indices), and S −1 is the inverse of S. Using the inverse covariance matrix defined in eq. (A.40) the χ 2 for solar data can be computed as in eq. (A.38).
Regarding generation of pseudo data, note when P r = O r , we have V rs = X rs and the dependence on P r drops out (see [59, eq. (2.12)]). Therefore, we use the Cholesky decomposition of X in order to generate the statistical fluctuations for the pseudo data, following the procedure outlined in appendix B.2.

B.1 Treatment of nuisance parameters
When generating pseudo data, in addition to statistical fluctuations on the event rates, we need to include fluctuations on the nuisance parameters as well. This means that the pseudo data should be generated using a set of "true" nuisance parameters, ξ true , as the mean for ξ. The question is what value should be taken for ξ true . Within a frequentist approach, a priori one should scan all possible values of ξ and study the distribution of the test statistic accordingly (that is, treat the nuisance parameters as if they were additional parameters of the model). However, in practice this is often not possible and a choice has to be made regarding the assumed values for ξ true .
For all reactor experiments considered in this work, we will leave the overall normalization of the event rates completely free; as such, we will not consider fluctuations of these free pulls, fixing them to the values that provide the best fit to the observed data. We do so, because a free normalization implies that no prior information on that parameter is being used. Therefore, since the only data to consider are those measured by the experiment, it seems plausible to use them in order to determine the flux normalization. In other words: we allow each experiment to "choose" their preferred value of the overall normalization for the event rates, given that we are only interested in the observation of an oscillatory signal in the spectrum. Note that this is done independently for each set of assumed "true" oscillation parameters, sin 2 2θ, ∆m 2 .
For the rest of the nuisance parameters considered here, prior measurements are typically available and can be used to determine the relevant range of values to consider when generating the fluctuations. Thus, for each nuisance parameter we will generate Gaussian fluctuations around the best-fit values from previous measurements (i.e., the mean value as reported by the collaboration), taking the reported uncertainties as the width of the Gaussian.

B.2 Generation of pseudo data
We use the following procedure to generate pseudo data in our analysis: (i) choose the parameters ∆m 2 , sin 2 2θ for the hypothesis of new oscillations; (ii) calculate the normalization-related nuisance parameters {η min } that provide the best fit to the observed data for the assumed hypothesis; (iii) generate fluctuations for the nuisance parameters subject to prior constraints, taking as central values their reported best-fit values ξ k , i.e., ξ pseudo k = ξ k + σ ξ k δ k , (B.1) where δ k are standard-normal fluctuations and ξ k are the values to which the pulls are constrained by the penalty terms; (iv) compute the prediction for the assumed hypothesis, using the fluctuated nuisance parameters and calculated flux normalizations, and set this as the mean for the generation of pseudo data, i.e., for each bin i set (vi) calculate the Cholesky decomposition L of V pseudo (note if χ 2 does not contain a covariance matrix, then L is diagonal with the standard deviations of each bin as the diagonal entries); (vii) inject fluctuations around the pseudo-data mean with L, i.e., where δ j are standard-normal fluctuations.
B.3 Comment on the χ 2 distribution related to randomized nuisance parameters Here we provide some justification for the way we generate pseudo data in the presence of constrained nuisance parameters. For this aim we consider the simplified case of a single pull ξ describing a normalization uncertainty. Without loss of generality we set ξ = 0, such that eqs. where θ denotes generic model parameters. For simplicity, let us also assume that the data is a priori uncorrelated with variances σ 2 i and eq. (B.3) becomes Then we obtain for our test statistic for the pseudo data: Let us now prove that for fixed parameters θ, the test statistic χ 2 pseudo is distributed as a χ 2 with N dof. First, it follows from eqs. (B.4) to (B.6) that D pseudo i are multivariant Gaussian variables with mean and covariances of with Var(X i , X j ) = X i X j − X i X j and δ ij is the Kronecker delta. On the other hand, eq. (B.7) has the standard form of a pull-χ 2 , which is known 12 to be equivalent to with S ij given in eq. (B.8). Hence, by a suitable variable transformation (involving the diagonalization of S) one can write χ 2 pseudo = N i=1 ζ 2 i , with ζ i being N independent standard-normal random variables, and therefore χ 2 pseudo follows a χ 2 distribution with N dof. Remark 1 : This proof generalizes in a straight-forward way if there is a nontrivial correlation matrix between the data points, and to the case of several pulls, as long as they enter the prediction linearly.
Remark 2: In this proof it has been essential to use the reported best-fit value ξ as central value to fluctuate pull, as in eq. (B.1). (Remember that we have set ξ = 0 in this section, without loss of generality.) If for generating the random pulls a mean value different from ξ was used, the pseudo-data would be biased and χ 2 pseudo would follow a non-central χ 2 distribution.

B.4 MC uncertainty of FC confidence intervals
We describe how we estimate the uncertainty on our FC confidence regions due to the finite size of the MC sample. Consider a point in the parameter space, which has a true p-value p. Then for a total of N MC trials, the number n of pseudo-data sets with ∆χ 2 pseudo > ∆χ 2 obs is distributed as a binomial distribution, with parameters N and p. Hence, we can define a p-value distribution via f (q) = N B (n; N, p), where q = n/N and B( · ; · , · ) is the binomial PDF. An α = 99% confidence interval [q 0 , q 1 ] for this p-value can be defined as where F ( · ) is the CDF of f ( · ). For small p-values, the distribution is skewed. Note that since the distribution f is discrete, the relations in eq. (B.9) are only approximate; they become exact in the large-N limit. The MC approximation for the p-value is n/N . We take this as the p parameter for the binomial distribution (i.e., we assume the MC approximation is exact). To obtain the 99%-confidence uncertainty bands for a confidence region, we create a p-value map in the sin 2 2θ-∆m 2 plane. Then we interpret the area between the contours at q 0 and q 1 as the 99% confidence uncertainty on the β CL contour with β = 1 − p.
12 See ref. [82] for an explicit proof. Note that this prescription, and in particular eq. (B.9), is independent of the specific experiment. It follows from the properties of the binomial distribution and depends only on the CL β = 1 − p, the MC sample size N , and the uncertainty α. We plot some numerical values in fig. 9. The discontinuities visible in the bands for 3σ CL are due to the discrete nature of the binomial distribution. Note that for a MC sample size of 10 3 it is not possible to calculate a meaningful 3σ confidence region at the accuracy of α = 0.99 or 0.9, as follows from q 0 = 0 for these parameters.