On the Bayesian approach to neutrino mass ordering

We study the framework of Bayesian statistics for analyzing the capabilities and results of future experiments looking to solve the issue of the neutrino mass ordering. Starting from the general scenario, we then give examples of the procedure for experiments with Gaussian and non-Gaussian distributions for the indicator. We describe in detail what can and cannot be said about the neutrino mass ordering and a future experiment's capabilities to determine it. Finally, we briefly comment on the application to other binary measurements, such as the determination of the octant of $\theta_{23}$.


I. INTRODUCTION
With the lepton mixing angle θ 13 being discovered to be large [1][2][3][4], there has been a surge of interest in the possibility of determining whether the neutrino mass ordering is normal (NO) or inverted (IO) in the next or next-to-next generation of neutrino experiments.
Many studies of the capabilities for doing this in different types of experiments have been performed, including atmospheric , reactor [27,[29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46], and long baseline [14,19,32,[47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64][65][66] neutrino experiments. Most of these studies take an approach where the neutrino mass ordering is determined in a frequentist manner with the typical square root of the test statistic used as a measure of the sensitivity as if the distribution of the test statistic was a χ 2 distribution with one degree of freedom. The issue of how to deal with the fact that Wilks' theorem [67] does not apply to a binary measurement has also been dealt with in different ways in several studies [40,45,68,69]. The statistical analysis of the ordering measurement can be performed in either a frequentist or a Bayesian statistics setting. Although the Bayesian method of model selection is ideally suited for the task, the neutrino community is traditionally frequentist and more accustomed to interpreting frequentist results. Thus, in choosing which approach to take, these two facts have to be weighted against each other.
It should be mentioned that the frequentist analysis can lead to some results that may be considered unappealing, such as the rejection of both hierarchies at high confidence, there is nothing wrong in performing it as long as proper care is taken in interpreting the results.
In this text we will concentrate on how to correctly perform the Bayesian analysis, both as a method of interpreting actual results as well as for judging the capabilities of future experiments. The frequentist approach will be discussed elsewhere [70].
The remainder of this paper has been organized as follows: We start by quickly reviewing the Bayesian approach to model selection in Sec. II. In Sec. III we then specialize to the situation where we have two models that we wish to compare, including definitions that we will use later as well as an analytical treatment of the case where a Gaussian approximation is valid. We continue by giving an example of a non-Gaussian situation in Sec. IV, where we also briefly discuss a semi-analytic approach for the case when the distributions are essentially sums of different Gaussian distributions. In this section we also discuss the prior dependence of Bayesian analyses and its implications for decision making in future neutrino oscillation experiments. Finally, in Sec. V, we discuss and summarize our results.

II. BAYESIAN MODEL SELECTION
Unlike frequentist statistics, which are only concerned with how probable outcomes were given a hypothesis, Bayesian statistics deal with our degree of belief in a hypothesis. This has both advantages and disadvantages, the main disadvantage being the introduction of a final result which may depend on the prior knowledge that is inserted into the analysis.
However, in some cases it may be of interest to actually consider the prior and changes to it as valuable tools at the design level of an experiment (see, e.g., Ref. [71]). Bayesian model selection is performed as follows: Consider a situation where we have to select among several different hypotheses H i , all of which are mutually exclusive. Before any experiment is performed, we assign a prior π i = P (H i ) to each hypothesis. These priors are quantifications of our degree of belief in the different hypotheses before having any experimental input and thus are typically chosen such that π i = π j unless some arguments can be applied for having greater belief in one hypothesis than in the others (or if some other experiment has already provided an indication that this is the case). In general where the equality only holds if we are absolutely convinced that one of the hypotheses is true. Once we have performed an experiment, we wish to update our degree of belief in the different hypotheses by computing how likely each hypothesis is, this quantity is the posterior probability P (H i ; t), i.e., the probability that H i is true given an observation t.
What we are really interested in are the posterior odds Central to Bayesian statistics, we now make use of Bayes' theorem to rewrite the odds as Here, L H i (t) is the likelihood of producing the observation t if the hypothesis H i is true.
Thus, the posterior odds are simply updates of the prior odds such that each hypothesis is weighted by how likely the outcome was given the hypothesis.
In many situations, the hypotheses that are being treated are not simple, but rather composed of a general model with one or several model parameters θ. In these cases, we also need to assign a parameter prior π H i (θ i ) quantifying how likely each realization of parameters are. The likelihood of the data is then in general dependent on the parameters θ i . However, the model likelihood is easily computable by weighting the likelihoods with the parameter prior In general, the Bayesian approach would consider the data as is without referencing a test statistic. However, by only considering a single observable T , we are going to simplify the analysis significantly and this is just what we will do below. Since we are not dealing with the frequentist notion of hypothesis testing, we will refrain from calling this T a test statistic and instead refer to it as an indicator. This indicator will be well suited for telling two hypotheses, H i and H j , apart if its distribution under the different hypotheses differ significantly. Keeping this in mind, we now concentrate on the case at hand, where we want to distinguish two given hypotheses (normal and inverted neutrino mass ordering). When discussing the neutrino mass ordering at future neutrino oscillation experiments below, we will assume that the indicator used is the quantity typically referred to as the ∆χ 2 : where χ 2 for given parameters is related to the likelihood L of the data according to χ 2 = −2 log(L), thus making T equivalent to the likelihood ratio of the best fit points in NO and IO. We will continue calling this quantity T , since it does typically not follow a χ 2 distribution.

III. SELECTING BETWEEN TWO HYPOTHESES
In particular, Bayesian model selection is very well suited for selecting between two mutually exclusive hypotheses where we are essentially certain that one of them has to be true.
Such is the case of the neutrino mass ordering as was discussed by Qian et al. [68]. In this section, we will largely follow their approach to analyze the Gaussian situation analytically in detail and also discuss the modifications introduced in a non-Gaussian scenario. We will later comment on the different possibilities of evaluating the merits future oscillation experiments based on Bayesian measures of sensitivity.
Let us first consider the general case, where the parameter space may be extended and the distribution of the indicator typically displays non-gaussianities regardless of the gaussianity of the underlying data. Once an experiment has been performed and resulted in a result T = t, we need to compute the posterior odds of the hypotheses H andH. By Bayes' theorem, this will be given by where p = p H is the degree of belief in NO after making the observation t and we have assumed p H + pH = 1. Thus, in order to compute the posterior odds, compute the general integral in Eq. (5). In general, this may be a non-trivial task to perform analytically.
However, it can be done numerically in a very straightforward fashion through simple Monte Carlo simulation as follows: 1. If H is a composite hypothesis, sample parameter values θ from the parameter prior π H (θ).
2. Generate a set of data under the assumption that H is true with parameters θ.
3. Compute and store the value of the indicator T given the data generated in 2.
4. Repeat 1 to 3 until a sufficient number of samples of T have been generated.
This procedure will result in a sample of the distribution of T under the assumption that H is true. In some ways, this procedure is very similar to computing the distribution of a test statistic in a frequentist analysis. The big difference is that we are allowed to compute the distribution for H as a whole rather than being restricted to doing it for a particular selection of parameters θ. The reason we can do this is that the prior π H (θ) has introduced a valid way of properly weighting the contributions from different parameter values.
Once the likelihoods are known as a function of t and the posterior odds computed, we can phrase our degree of belief in H by using either the Kass-Raftery [72] or Jeffrey [73] scales. For comparison, it is useful to define the quantity  For the remainder of this text, we will use the Kass-Raftery scale, see Tab. I, but also give the degree of belief for reference.

A. Evaluating the performance of a future experiment
The Bayesian framework gives us the freedom of computing a series of interesting probabilities already before an experiment has been performed. If we assume that H andH are the only two possible hypotheses which are equally probable before performing the experiment, then P (H) = P (H) = 0.5. As κ(T ) can be seen as the distribution of posterior odds once the experiment is performed. We can, for example, compute the probability of the experiment giving at least strong evidence (> 95.3%) for either ordering P (|κ| > 6) = P (|κ| > 6; H)P (H) + P (|κ| > 6;H)P (H).
This of course also includes the possibility of obtaining strong evidence for the wrong ordering (corresponding to a degree of belief < 4.7% for the correct ordering), so the more interesting quantity is where we have exchanged the conditioned probabilities for obtaining strong evidence for the probability of obtaining strong evidence for the correct hypothesis. Obviously Note that Eq. (10) does not assume H orH to be true (as was done in Ref. [68]), but is rather a weighted average of the results from both hypotheses with the weights given by the prior degree of belief in H andH, respectively. In the remainder of this paper, we will consider the situation where an equal prior probability of 0.5 has been assigned to both hypotheses. Furthermore, it should be pointed out that it is important to refrain from interpreting κ or posterior degree of belief p in terms of a number of σ, which is an inherently frequentist concept not present in a Bayesian analysis. The number of σ at which a frequentist analysis would reject a hypothesis is dependent on the p-value, the probability of obtaining a result more extreme than the observed one if the hypothesis is true. In contrast, the p computed in this Bayesian framework is the degree of belief that the hypothesis is true after an experiment is performed. These two concepts are fundamentally different and should not be confused.

B. Normal distributed indicators
In some limiting cases, the indicator T will follow a normal distribution for both H andH with the same standard deviation. This limit is well fulfilled, e.g., for future reactor neutrino experiments sensitive to the neutrino mass ordering as discussed in Refs. [68,69]. For easy applicability to this scenario, we will assume that T = N(±T 0 , 2 √ T 0 ), where N(µ, σ) is a normal distribution with mean µ and standard deviation σ and the + (−) is for NO (IO).
Clearly, any indicator which supplies normal distributions for T can be brought to this form by translations and rescalings as long as they have the same standard deviation. For the particular case of the neutrino mass ordering at reactors, we also have T 0 = ∆χ 2 , i.e., the ∆χ 2 = minH χ 2 − min H χ 2 for the Asimov data set for H [74], i.e., the data set where all observables are given by their expectation values.
Since the cumulative distribution and probability density functions of the normal distribution are known, we can obtain simple analytic expressions for several of the quantities and distributions mentioned above, in particular Here, F (p) and f (p) are the cumulative distribution and probability density functions for p (before the experiment is performed, once an observation has been made p is fixed) assuming that H is true, respectively. Thus, by definition where the last equality follows from symmetry, which agrees with Eq. (12). The probability density function f (p) is the analytic form of what is shown in the left panel of Fig. 3 in Ref. [68]. Note that since the probability of obtaining evidence at least at level κ 0 for the true hypothesis in this case is independent of the hypothesis (this is true for all cases where the indicator is anti-symmetric under the exchange of H andH) and thus P (evidence at least κ 0 for true hypothesis) = 1 Note that this quantity is the probability of obtaining evidence at least at level κ 0 for the true hypothesis before the experiment has been performed. Of course, once the experiment has been performed, we either have found the evidence or not. In particular, the quantity is the probability that the experimental outcome will favor the correct ordering. A priori, this quantity has little to do with the actual posterior odds assigned to H andH as claimed in Ref. [69]. If the experiment then measures t = 0 + ε, the posterior odds will be close to one (equal degree of belief in H andH) regardless of T 0 . In other terms, the experiment did not provide us with significant discrimination between H andH although we might or might not have expected it to do so. This is neither strange nor unwanted. In fact, this is To conclude this section, let us finally note that, since erf(0) = 0, the median experiment will give evidence of strength κ = T 0 . This can be understood from the facts that κ(t) = t and T is symmetric around ±T 0 .

IV. EXAMPLE OF A NON-GAUSSIAN SITUATION
As mentioned earlier, the general case is typically not analytically solvable but instead we must apply numerical simulations in a fashion similar to that found in frequentist statistics,  where one in general must simulate the distribution of the test statistic. However, contrary to the frequentist framework, the appearance of a probability measure on the parameters means that we will only need to perform one simulation for each hypothesis, rather than once for every parameter set. This detail in itself offers a significant reduction in the computing power needed to perform a test. In particular, the actual testing once the experiment has been performed turns into a simple matter of averaging the likelihood of the acquired data over each hypothesis to find the hypothesis likelihood and, through it, the posterior odds.
In this section, we will give an example of a situation where non-gaussianities play a role in the form of a simulation of the NOνA experiment [75]. In order to perform this analysis we will utilize the GLoBES software [76,77] along with selected parts of the MonteCUBES plug in [78]. Since we are mainly concerned with the statistical challenges, we simply use the predefined NOνA glb files from the GLoBES homepage [53,79] to describe the experiment.
For the neutrino oscillation parameters and their priors we impose the conditions displayed in Tab. II. We only allow ∆m 2 31 and δ to vary and fix the rest of the parameters. Again, this is done mainly for illustration purposes and in fact the deviation of θ 23 from its maximal value is an additional challenge for future oscillation experiments, which we will comment on more later. In addition, note that we use these priors only for sampling the parameters in order to find the indicator distributions. We then follow the steps described in the previous section in order to compute these distributions.
In Fig. 2, we show the cumulative distribution and probability density functions of the indicator T = min IO χ 2 − min NO χ 2 (we here use the built-in GLoBES χ 2 function, which is −2 log L for a given parameter set). As can be seen from the figure, the main difference from a Gaussian distribution is the appearance of large tail distributions in the direction of the simulated ordering. These tails arise mainly from regions of parameter space where the hierarchies are well separated. In this case, values of δ = ±π/2 are either close to the other ordering or the ones furthest removed from it. If the value of δ is such that it will be easy to separate the hierarchies, then we have a significant chance of obtaining evidence for the true ordering. However, this is where the prior on the parameter space of the models come into play and properly weights the probabilities of being in different regions of parameter space.

This is in some sense similar to the gain fraction argument for CP-violation presented in
Ref. [71], with the difference that we are now performing a full Bayesian analysis. From the probability density functions, it is now easy to construct the value of κ(t) and we show this quantity in Fig. 3. As can be seen in this figure, the posterior odds for a given value of the indicator T will typically be larger than the Gaussian t. This effect is due to the tails of the probability density functions. Since κ is related to the ratio of the probability density functions and these fall off much slower than in the Gaussian case on the side of the true ordering, it follows that this will generally be the case. If instead we had a situation where the tails were on the other side, then we would typically obtain weaker posterior odds for the same t.
It should be stressed that the results shown in Fig. 3 is the only important piece of information once the experiment has been performed as relating the measured value of the indicator with the posterior odds is straightforward using this figure. As expected, a measured value around t = 0 adds very little information on whether NO or IO is the true one.
Finally, we will now discuss how to judge the capabilities of a future experiment within the Bayesian framework and argue that the full distribution of the expected evidence should be considered rather than reducing it to a single number. This will roughly correspond to the typical sensitivity analysis that is usually performed in the frequentist setting. In order to do this, we compute the distributions of κ under the assumptions of NO and IO. This is done by simply inserting the indicator distributions for NO and IO, respectively, and thus obtaining the corresponding distributions of κ. These distributions, whose cumulative distribution functions are shown in Fig. 4  The alternative is to pick a desired level of evidence and just compare the experiments' probabilities of reaching at least that level. While the disadvantage of this procedure is a certain loss of information, the advantage is that this single number can be plotted against, e.g., experimental parameters such as the neutrino energy, baseline, or running time. We wish to stress the fact that, since the shape of the distribution of κ in general will vary depending on the experiment, a single number will typically be insufficient to describe the full situation. For example, a Gaussian experiment with T 0 = 4 would have a larger median evidence for the true ordering than our simulated NOνA experiment. However, the chances of getting lucky and obtain very strong evidence in such an experiment would be miniscule.
Thus, rather than tabulating several predefined numbers describing the distributions, we

A. Semi-analytic approximation
It was observed in Ref. [70], that for fixed values of the oscillation parameters (in particular δ), the distribution of T would still take on a Gaussian form with mean and standard deviation now given by the T 0 (θ) produced by that particular parameter set θ. If this holds and T 0 (θ) is known, then the probability density function of T will be given by where is the probability density function of N(T 0 , 2 √ T 0 ) and the corresponding expressions hold for inverted ordering. If T 0 (θ) has the same distribution in both hierarchies, i.e., for all T 0 , then It follows directly that also in this case, This is a generalization of the result found in Eq. (11) and any deviations from this result must be caused by non-Gaussianities or different distributions of T 0 (θ) in the different orderings.
The largest deviation from this appeared for the case of the NOνA experiment, which is the example we have been using so far in this work. For comparison, we show the expectation of this semi-analytic approximation with thin curves in the lower panel of Fig. 2. As can be seen, the approximation overestimates the value of the probability density function where the ordering is not the assumed one, while it underestimates it in the other tail. For producing this result, we have computed T 0 as a function of δ for both hierarchies independently and the results do not provide the same T 0 distributions although it is close enough for κ(t) not to show any significant deviations from t for |t| < 20. Thus, the deviations from this rule seen in Fig. 3 originate mainly from the non-Gaussianity of the NOνA distributions. Thus, using this semi-analytic approach for NOνA will therefore lead to conservative estimates of the NOνA capabilities. We also show the result of the approximation as thin black curves in Fig. 4. This figure confirms our expectation that the approximation will underestimate the capability of NOνA, although the main features such as the larger tail probability are still present, the probability of reaching a good degree of belief may be underestimated by as much as 10%.

B. Prior dependence
Just as with any Bayesian statement, the distribution of the posterior odds prior to an experiment is performed will typically be dependent on the parameter prior within the models. In the situation described above, the main prior impact arises from the flat prior on δ, while the actual prior chosen for ∆m 2 31 does not change the prediction significantly. This mainly occurs since the probability of generating data that will separate the hierarchies is highly δ dependent. This only reflects the fact that when considering what future experiment to built, the current knowledge of the model parameters should also be taken into account (see also Ref. [71]). For the CP-violating phase, most people would agree that a flat prior is a reasonable starting point as long as no data is available to favor one region of δ or the other. However, once there is a hint for a value of δ, it is only natural that the perceived probability of making a given measurement should change accordingly.
The prior dependence would play an even bigger role in a situation where there is no lower bound on the effect that is being searched for. In the case of the mass ordering, ∆m 2 31 = 0 has long since been ruled out. However, for the determination of the octant of θ 23 , the data of global fits are still very compatible with maximal mixing. Of course, the possibility to discriminate between the octants is crucially dependent on how far from maximal θ 23 is allowed to be and a prior which allows larger deviations will typically give a more optimistic prediction for the capabilities of an experiment to discover the octant. Let us also note that if the parameter knowledge is not very precise, leading to good perceived chances of a discovery, and a discovery is later not made once the experiment is performed, the experiment should still typically not be considered a failure as it then will tend to disfavor those parts of parameter space where a discovery would be easy, thus providing a better prior as input for the next experiment.
As an example of the dependence on the prior, we show the case where δ is fixed to 90 • in NO and −90 • in IO in Fig. 5. Due to these points being almost degenerate, essentially no chance remains for obtaining any level of reliable evidence for which ordering is the true one through the NOνA experiment.

V. SUMMARY AND DISCUSSION
We have discussed and given concrete examples of how to apply a Bayesian analysis to the problem of neutrino mass ordering determination. In doing so, we have described the correct procedures to obtain the Bayesian posterior odds for any experiment and discussed how Bayesian methods may be used to judge the capabilities of future neutrino oscillation experiments by extending the approach and argumentation presented by Qian et al. [68].
In short, before an experiment is performed, one may speak only about the expected distribution of the posterior odds of the two different orderings. Here, the actual value of the indicator, typically taken to be ∆χ 2 = min IO χ 2 − min NO χ 2 , and not only its sign is of importance. We want to point out that this is precisely what we would expect in a Bayesian setting, as changing a measurement by a small amount from ∆χ 2 = ε to −ε should not change the results significantly. We contrast this to the situation presented by Ciuffoli et al. Ref. [69], where the ordering is identified as normal if ∆χ 2 > 0 and the probability of obtaining ∆χ 2 > 0 is taken as the confidence level. This corresponds to attempting to determine the confidence level in a frequentist fashion, i.e., asking the question of how large the ratio of infinitely repeated experiments would arrive at the correct ordering. In the frequentist nomenclature, this is simply a computation of the confidence level provided by a test which takes zero as the critical value. As we have shown, in the symmetric Gaussian approximation, this probability only corresponds to the chance of the correct ordering being favored after the experiment is performed and says nothing about how favored it has to be (just that it has to be above 50%). A more interesting question than this would be to ask for the probability that an experiment will give at least strong or very strong evidence for the correct ordering, corresponding to a posterior degree of belief of at least 95.3% and 99.3%, respectively, which we have also computed for the case of a Gaussian distribution of the indicator.
The second question of Ref. [69] regards the confidence level obtained by a typical experiment. This is somewhat misguided, since confidence is an inherent frequentist concept and what is interesting in the Bayesian framework is simply the posterior odds. In particular, what is interpreted in Ref. [69] as the probability of success is actually the posterior probability of the correct ordering, which the authors then go on to interpret as a frequentist p-value. While the p-value in a frequentist setting is the probability of obtaining a more extreme result, the posterior probability evaluated for the median experiment is the degree of belief in the correct ordering if the data would turn out to be the median expected data.
Here we have also argued against the use of a number of σ for describing the strength of the evidence, since in the frequentist nomenclature the neutrino community is familiar with, this represents a measure of how much the data deviates from the expectation and not the degree of belief in a model. The third question raised in Ref. [69] is asking what the probability of achieving a certain level of evidence (although it is referred to as confidence), we have discussed this extensively throughout this text and the main results include Eq. (16) for the Gaussian approximation and Fig. 4 for the simulation of NOνA. Let us finally remark that this analysis by no means is specific for the neutrino mass ordering. It is equally appropriate also for other binary measurements such as which octant θ 23 belongs to, or any other binary (or even larger degeneracies) measurement in physics.
In particular, it should be expected that the octant degeneracy of θ 23 displays even more non-Gaussian behavior since, unlike for the neutrino mass ordering, the degenerate solutions are very weakly separated.