Confidence in a neutrino mass hierarchy determination

In the next decade, a number of experiments will attempt to determine the neutrino mass hierarchy. Feasibility studies for such experiments generally determine the statistic \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \overline{{\varDelta {\chi^2}}} $\end{document}. As the hierarchy is a discrete choice, Δχ2 does not obey a one degree of freedom χ2 distribution and so the number of σ’s of sensitivity to the hierarchy is not the square root of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \overline{{\varDelta {\chi^2}}} $\end{document}. We present a simple Bayesian formula for the sensitivity to the hierarchy determination that can be expected from the median experiment as a function of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \overline{{\varDelta {\chi^2}}} $\end{document}.


JHEP01(2014)095
In the next two decades a number of reactor, accelerator and atmospheric neutrino experiments will attempt to determine the neutrino mass hierarchy, which is the sign of the mass difference ∆M 2 31 = M 2 3 − M 2 1 where M i is the ith eigenvalue of the neutrino mass matrix. If the sign is positive (negative), one says that the hierarchy is normal (inverted). Most of these experiments are still in the planning stages, where the key role is played by forecasts of the sensitivity of a given design to the hierarchy.
Such forecasts determine, either analytically or via Monte Carlo simulations, where χ 2 N (χ 2 I ) is the χ 2 statistic equal to a weighted sum of the squares of the differences between the data and predictions given the normal (inverted) hierarchy, choosing all of the nuisance parameters so as to minimize χ 2 N (χ 2 I ). The goal of these experiments is not to determine whether each of the hierarchies is consistent with the data, as would be usual in a frequentist approach. Rather, as it is already well accepted that precisely one of the hierarchies is manifested in nature, the goal of these experiments is to determine which of the hierarchies provides the best fit to the data. In this paper we will use the test statistic ∆χ 2 to answer this question as follows. We will define the best fit hierarchy to be that which yields the lowest value of χ 2 , and so the hierarchy determined by the experiment simply corresponds to the sign of ∆χ 2 .
The critical question is then, given ∆χ 2 , what is the sensitivity of a typical experiment to the hierarchy? In ref. [1] the authors showed that the most naive answer, the p value that would be obtained if ∆χ 2 satisfied a one degree of freedom χ 2 distribution, gives the incorrect answer. Indeed ∆χ 2 is not necessarily positive and so such a prescription would not even always be defined. In this note we will provide an analytic answer (13) to this question and will compare our answer to the results of simulations of Daya Bay II and disappearance data at NOνA.
Nested hypotheses. To begin, we will describe just why the p value is not the answer to the question stated above. Consider N data points {y i } generated by an experiment trying to determine an unknown quantity x. We will use the approximation in which these data points y i follow a Gaussian distribution peaked at y i (x) and σ i (x) are known functions of x. An experimenter is interested in two hypotheses. Hypothesis (A) states that x is a real number. Hypothesis (B) states that x = x 0 , for a particular real number x 0 . Clearly hypothesis (B) is a special case of hypothesis (A), so these hypotheses are said to be nested. In particular, (B) is obtained from (A) by fixing one, otherwise unconstrained, real number, the number x.
For any given value of x, the experimenter can define a statistic χ 2 (x) by simulating the experiment with that value of x and calculating the weighted sum of the squares of differences between his measured and simulated results -2 -

JHEP01(2014)095
The experimenter then determines a best fit x, for which χ 2 (x) is minimized. He then asks how compatible his results are with the hypothesis (B). To determine this, he calculates Unlike ∆χ 2 defined in eq. (1), δχ 2 is manifestly nonnegative, because x is defined so as to give the lowest value of χ 2 .
Just what value of δχ 2 should the experimenter expect? 75 years ago Wilks proved [2] that if hypothesis (B) is true then δχ 2 will obey a χ 2 distribution with a single degree of freedom. The experimenter can then determine a conditional probability that given (B), the experiment would have gone as badly as it did For example, if he found δχ 2 = 9, then p W would only be about 0.3%, and so a frequentist experimenter might conclude that he has ruled out (B) with 3σ of confidence.
Non-nested hypotheses. As described in ref. [1], the determination of the hierarchy is qualitatively different. The two hypotheses are the normal hierarchy (NH) and the inverted hierarchy (IH). These hypotheses are not nested, and they correspond to a discrete choice, not the fixing of a degree of freedom. So the conditions for Wilks' theorem are strongly violated. As was observed in general in ref. [3] and in this context in ref. [1], this means that the statistic ∆χ 2 defined in eq. (1) does not follow a χ 2 distribution. Just what distribution does ∆χ 2 follow? Let us begin with the simple case in which there are no nuisance parameters, which was applied to a toy model of the hierarchy determination in ref. [1].
An experiment will produce a set of numbers {y i }, which we assemble into a vector y. The normal and inverted hierarchies yield two theoretical estimates of this vector which we will denote y N and y I respectively. Again let us assume that the measured numbers y i are normally distributed about their mean with a variance σ 2 i , which for simplicity we take to be independent of the hierarchy. Without loss of generality, let us assume for the moment that the true hierarchy is normal. Then the measured numbers will be where g i is a standard Gaussian random variable. The statistic ∆χ 2 is then easily determined to be

JHEP01(2014)095
This identifies ∆χ 2 as a Gaussian distributed random variable with mean given by the first term on the right hand side and standard deviation given by the second term [1] Note that ∆χ 2 is the ∆χ 2 statistic without statistical fluctuations, for example it may be given by the theoretical spectra of ν e observed at a reactor experiment, of ν µ and ν µ at an iron calorimeter atmospheric neutrino experiment, or of ν e (ν e ) appearance at an accelerator experiment running in the neutrino (antineutrino) mode. In an atmospheric neutrino experiment one may use the spectra as a function of energy, zenith angle and even the inelasticity of the events [4]. ∆χ 2 is the statistic most often reported in the literature. We will now use eq. (8) to relate ∆χ 2 to three quantities of interest.
What is the probability that the hierarchy which yields the lowest χ 2 is indeed the true hierarchy?
Let us first consider the case in which the normal hierarchy is manifested in nature. The correct hierarchy will be determined by the experiment if ∆χ 2 > 0. The statistic ∆χ 2 is centered on the positive value ∆χ 2 with a standard deviation of 2 ∆χ 2 and so the closest negative value is ∆χ 2 /2 σ's from the mean, on one side of the distribution. For example, if ∆χ 2 = 9 then a negative value will be excluded at 1.5σ's on one side, yielding a probability of successfully determining the hierarchy of 93.3%, considerably less than the 99.7% that one may naively suspect just by taking the square root of ∆χ 2 . More generally, the probability of correctly determining the hierarchy is In a more standard terminology, p c is the sensitivity to the hierarchy of the binary classification test whose classification function is the sign of χ 2 . In ref. [5] the authors obtained a similar result which differs as a result of their formula (5.11) for the probability of success for a given ∆χ 2 .
If instead the inverted hierarchy is correct, the calculation proceeds identically. As we have approximated σ i to be hierarchy independent, the probability of success is identical for both hierarchies. This is the quantity quoted in a number of studies such as refs. [6][7][8].

Second, what is the sensitivity of a typical experiment to the hierarchy?
A "typical experiment" is one in which |∆χ 2 | obtains its average value |∆χ 2 |. As the probability of successfully determining the hierarchy is a monotonic function of ∆χ 2 , the average value of ∆χ 2 corresponds to the median value of the probability of success and so JHEP01(2014)095 we will refer to such experiments as median experiments. The sensitivity of the sign(χ 2 ) test to the hierarchy is the probabilty that a fit to the correct hierarchy yields a lower value of χ 2 than one to the wrong hierarchy. Since |∆χ 2 | is fixed, this is simply the probability that ∆χ 2 has the correct sign.
Again the calculation will proceed identically for both hierarchies, so we may restrict our attention to the case in which the normal hierarchy is correct. Therefore the question is, given that ∆χ 2 is positive and ∆χ 2 is equal to either ∆χ 2 or −∆χ 2 , what is the probability p v that ∆χ 2 = ∆χ 2 .
Let L ± be the likelihood, given the normal hierarchy, that ∆χ 2 = ±∆χ 2 , which is easily found using the fact that ∆χ 2 obeys a normal distribution centered at ∆χ 2 with standard deviation 2 ∆χ 2 . Using the fact that the distribution of ∆χ 2 is odd with respect to a change in the hierarchy, the Bayes factor for the normal hierarchy is In particular, symmetric Bayesian priors assigning a 50% chance to each hierarchy yield a posterior probability of correctly determining the hierarchy for median experiments, those in which |∆χ 2 | = |∆χ 2 |. For example, if ∆χ 2 = 9 then the probability that a median experiment correctly determines the hierarchy will be 98.9%. While this is better than the mean probability of success 93.3%, it still falls noticeably short of the 99.7% confidence which one might expect from Wilks' theorem. Given ∆χ 2 determined either from Monte Carlo simulations or from Asimov data, one may express the sensitivity to the hierarchy expected at a median experiment in terms of a number s of σ's. We will define s so that p v is given by the same formula as the p-value corresponding to sσ confidence Then using eq. (11) one easily finds that the number of σ's of sensitivity is This function is plotted in figure 1. For example, if ∆χ 2 = 9 then a median experiment determines the hierarchy with 2.3σ of confidence instead of the 3σ which might be expected. A general Bayesian prior of b and 1 − b for the normal and inverted hierarchies leads to a sensitivity Third, what is the probability p(s) that the hierarchy will be determined with a sensitivity of at least sσ? JHEP01(2014)095  Note first that for a general experimental outcome ∆χ 2 , the probability of success is independent of ∆χ 2 . Using this fact, an argument similar to those above leads to This function is plotted in figure 2.
Parallel nuisance parameters. In reality there is no single experimental result y N or y I which is predicted by a given hierarchy. The results also depend on a number of nuisance

JHEP01(2014)095
parameters, such as the neutrino mass matrix parameters and the flux normalization of the source. We will assemble these nuisance parameters into a vector x = {x i }.
If the final data consists of N numbers, such as the number of events in N energy bins, and if there are K nuisance parameters, then each hierarchy corresponds not to a point but to a K-dimensional subset of the N -dimensional vector space in which y is valued. The nuisance parameters x i are coordinates on these subsets. If the standard deviations σ i vary sufficiently slowly, then the inverse covariance matrix defines a metric on this space. Recall that, in the case of the normal (inverted) hierarchy, the nuisance parameters x are chosen to minimize χ 2 N (χ 2 I ). Geometrically, this minimization corresponds to choosing the point in each subset which is closest to y, the coordinates of the point are the nuisance parameters which minimize the corresponding χ 2 statistic.
In this framework, it is easy to combine data from multiple experiments. They can simply be added to y as new components. For example, one can combine a forecast spectrum of Daya Bay II with a value of the nuisance parameter θ 13 determined at Daya Bay and RENO by letting the first N − 2 components of y correspond to the ν e spectrum at Daya Bay II and the next two to the relative survival probabilities observed at Daya Bay and RENO. The single nuisance parameter θ 13 yields a curve in the N -dimensional space of observations for each hierarchy. The curve is parameterized by θ 13 . The last two coordinates of this curve are simply the relative survival probabilities expected at Daya Bay and RENO as a function of the parameter θ 13 . The χ 2 to be minimized is the distance in the full N dimensional space, so it automatically combines determinations of θ 13 at RENO, Daya Bay and Daya Bay II without the need for any penalty terms. Now let us make two approximations. First, we approximate y N and y I to be linear (or affine) functions of the nuisance parameters x, so that the subspaces corresponding to theoretical predictions are hyperplanes. The resulting models are called linear regression models. Model selection in one dimensional non-nested linear regression models was first studied in ref. [9]. Ref. [3] presented a statistic, generalizing ∆χ 2 , which is Gaussian distributed and distinguishes the models. The properties of this statistic, in the case of linear regression models, were determined in ref. [10].
One may object that the spectra are not indeed linear functions of the neutrino mass matrix. However the essential point is that they be approximately linear in a regime whose size is the precision to which an experiment can determine the nuisance parameters. This is a much easier criterion. Later we will compare our analytical results to simulations in which no such approximation is made, and we will see that the resulting error is small.
For now we will make the further approximation that one obtains the same value of ∆χ 2 for any value of the nuisance parameters chosen for the normal hierarchy if the nuisance parameters for the inverse hierarchy are chosen so as to minimize ∆χ 2 I . In other words, ∆χ 2 , is independent of the choice of the nuisance parameters so long as each χ 2 is properly minimized. Geometrically this means that the hyperplanes corresponding to the theoretical values y N and y I are parallel.
Again assume that the normal hierarchy is correct. If x T is the true value of the nuisance parameters, then the theoretical values of the observables y N will be linear functions y N i of x T . χ 2 N (χ 2 I ) is just the minimum distance squared from the observations JHEP01(2014)095 Figure 3. In this figure the hierarchy is normal and ∆χ 2 is independent of the nuisance parameters.
The two parallel lines are the expected measurements corresponding to various values of the nuisance parameters for the two hierarchies. As a result of statistical fluctuations y N i + σ i g i is measured instead of the theoretical value y N i . The parallel part of g determines the effect of this fluctuation on the best fit nuisance parameters and the perpendicular part its effect on ∆χ 2 . y i = y N i + σ i g i to the hyperplane corresponding to the normal (inverted) hierarchy. The statistical fluctuation vector g = σ i g i can be decomposed into a two vectors, g ⊥ and g such that g ⊥ is perpendicular to the hyperplanes and g is parallel.
To determine χ 2 N or χ 2 I , one must choose the nuisance parameters x at which it is minimized. χ 2 will be minimized for the choice of nuisance parameters x T + g . In other words, the parallel part of g yields the statistical error in the determination of the nuisance parameters. We have assumed that this error is the same for both hierarchies. For this choice of nuisance parameters, the theoretical predictions for y i are y N i + σ i g i and y I i + σ i g i in the cases of the two hierarchies. Now we are ready to calculate In the last step we used the identity  which follows from the fact that, using the metric 1/σ 2 i , the vector (y N i −y I i ) is perpendicular to the hyperplanes and so to σ i g i .
Just as in eq. (6), eq. (17) describes a normal distribution centered at ∆χ 2 and with standard deviation 2 ∆χ 2 . As a result, eqs. (9) and (13) for the probability of success and number of σ's of confidence in the median experiment remain correct.
General nuisance parameters. Of course, ∆χ 2 does depend on the nuisance parameters, and so the hyperplanes corresponding to the theoretical data are not parallel and the above results are only approximate. This fact was first noted in ref. [3], where it was concluded that as a result ∆χ 2 is not normally distributed. Its distribution leptokurtic. This observation can be intuitively understood as follows. Imagine that ∆χ 2 depends so strongly upon the nuisance parameters that a 1σ change in the nuisance parameters can reduce the confidence in the hierarchy determination by several σ's. As a result, most of the experiments in which the hierarchy determination is incorrect will be those in which the nuisance parameter is such that ∆χ 2 is much smaller. Thus the tails of the distribution of ∆χ 2 will grow as a result of those simulations in which the nuisance parameters take a nonstandard value. Clearly, this effect is only present in simulations in which the nuisance parameters are allowed to vary, and so simulations that fix the nuisance parameters will yield values of ∆χ 2 which, upon using eq. (13), overestimate the confidence of a hierarchy determination.
In ref. [3] the author proposed a new statistic which does follow a Gaussian distribution even in this more general setting. However, in the case of the hierarchy determinations planned in the near future, the angle between the hypersurfaces is actually quite small. This is reflected in the observation [12] that even a 1σ variation in θ 13 only leads to about a one third of a σ variation in the confidence. Therefore the approximate treatment of the ∆χ 2 statistic above is quite precise.
To illustrate this point, in figure 4 we present the distribution of the ∆χ 2 statistic in simulations which combine the ν e spectrum measured at Daya Bay II with MINOS' JHEP01(2014)095 Figure 5. As in figure 4, but using only a 1% determination of the atmospheric mass splitting. The simulations reported in the red and black curves use δ = 0 and π respectively, although the fitting is always performed assuming δ = π/2. As can be seen, if δ = π, the hierarchy determination will be more reliable [12,13].
4% determination of the atmospheric mass difference [11] and also with an optimistic 1% forecast determination at an upgraded NOνA . All of the nuisance parameters are fixed except for |∆M 2 32 |, which is chosen to minimize χ 2 I and χ 2 N as described above. Following [13] we have considered 6 years of exposure at a 20 kton detector for Daya Bay II which detects ν e via inverse β decay on the 10% of its mass consisting of free protons. The baselines and reactor fluxes are identical to ref. [13]. The leptonic CP-violating angle δ is set to π/2.
We find that the distribution of ∆χ 2 is indeed well approximated by a Gaussian distribution centered at ∆χ 2 with standard deviation 2 ∆χ 2 . ∆χ 2 ∼ 11 (20) for Daya Bay II with MINOS (NOνA) yielding 2.6σ (3.9σ) of confidence at the median experiment, with a rate of successfully determining the hierarchy of 94.6% (98.5%) in good agreement with eq. (9). In figure 5 we present the distribution of ∆χ 2 in simulations in which δ = 0 and π, although we always fit to a δ = π/2 theoretical mode as the appearance mode at T2K and NOνA cannot distinguish 0 and π [12,14]. At δ = 0 (π) we find ∆χ 2 = 17 (22) yielding 3.5σ (4.2σ) of confidence, confirming the expectations of ref. [15]. Despite the fact that the model used for fitting differs from that used to generate the data, the distribution of ∆χ 2 described in this paper approximates the simulated data well.
Frequentist sensitivity. A frequentist notion of sensitivity can be made well defined even in this context [16,17]. Immagine that an experiment measures ∆χ 2 . This differs from the expected ∆χ 2 for the normal (inverted) hierarchy by |∆χ 2 ∓ ∆χ 2 | which corresponds to a frequentist incompatibility of In particular, in the case of the median experiment with the true hierarchy, ∆χ 2 = ∆χ 2 . Therefore the inverted hierarchy is excluded at a confidence of ∆χ 2 σ's. In this sense it JHEP01(2014)095 might be tempting to ignore the results of this paper and to identify the frequentist incompatibility ∆χ 2 with the sensitivity to the hierarchy expected in a median experiment.
While such a definition of sensitivity is well-defined, it has a very unattractive feature. Consider an experiment with an expected ∆χ 2 = 16. The general arguments in this note imply that if the hierarchy is normal (inverted) then ∆χ 2 will follow a Gaussian distribution centered on 16 (−16) with a width of σ = 8. In the frequentist sense, the median experiment will yield |∆χ 2 | = 16 and so is incompatible with the false hierarchy with 4σ of confidence while the 98th percentile experiment will yield ∆χ 2 = 0 and so is incompatible with the false hierarchy with 2σ of confidence. An identification of the sensitivity to the hierarchy with the frequentist incompatibility would therefore imply that even the 98th percentile of experimental outcomes will yield a 2σ sensitivity to the hierarchy. Now consider the somewhat unlikely case in which due to statistical fluctuations, the results of this experiment are indeed in the 98th percentile, so that ∆χ 2 = 0. Now the experimentalist will be asked to provide the hierarchy with 2σ of confidence. Of course he cannot, the experiment has not yielded any preference for either hierarchy, even at the 2σ level that was promised for a 98th percentile experiment when the funding was requested. In this sense, the identification of the frequentist incompatibility with the sensitivity to the hierarchy is misleading: the sensitivity can be nonzero even when no information is obtained.
The basic problem with the application of the frequentist notion of sensitivity in this example is that both hierarchies have been ruled out with equal confidence. Ruling out both hierarchies can be useful when searching for new physics, testing assumptions regarding backgrounds, etc. Although in that case one would use a χ 2 test and not a ∆χ 2 test, as the latter is insensitive to effects that affect both hierarchies similarly. However, for the purpose of determining which hierarchy is manifested in nature it is reasonable to assume that one of the hierarchies is indeed correct. In this case one is led to the Bayesian constructions described in this note.