A Bayesian solution to the Behrens–Fisher problem

A simple solution to the Behrens–Fisher problem based on Bayes factors is presented, and its relation with the Behrens–Fisher distribution is explored. The construction of the Bayes factor is based on a simple hierarchical model, and has a closed form based on the densities of general Behrens–Fisher distributions. Simple asymptotic approximations of the Bayes factor, which are functions of the Kullback–Leibler divergence between normal distributions, are given, and it is also proved to be consistent. Some examples and comparisons are also presented.


Introduction
The Behrens-Fisher problem is that of testing the equality of the means of two independent normal populations with unknown and arbitrary variances. The problem arises when the quotient between the variances is unknown. If this quotient is known, the problem is easily solved from both the frequentist and Bayesian approaches.
Under a frequentist point of view, the Behrens-Fisher problem has the difficulty that standard homoscedastic normal theory cannot be applied due to the presence of the two unrelated variances, so that there is no an exact p-value. Following this approach and trying to find an approximate solution to the problem, two different but closely related methods have been addressed: the Neyman-Pearson theory of significance tests (Bartlett [ [21], Wald [19], Scheffé [17]) and that based on the notion of generalized p-values (Tsui and Weerahandi [18], Weerahandi [20] and Witkovsky [22]). An extensive bibliography of such frequentists methods appears in Kim [12].
There is not an agreed Bayesian solution to the Behrens-Fisher testing problem based on Bayes factors. This is mainly due to the fact that as the priors commonly used-generally, reference priors-are improper, the Bayes factor to compare the null and the alternative hypotheses cannot be computed. To overcome this difficulty, in Moreno et al. [14] and Moreno and Girón [15], the problem is formulated as one of Bayesian model selection, and intrinsic and fractional prior distributions are generated in order to compute a proper Bayes factor.
In this paper the Behrens-Fisher problem is approached as a true testing problem based on the Bayes factor. To this end, the problem is presented as a problem of testing the homogeneity of the means of two normal populations and a simple solution is provided by formulating a hierarchical model under the alternative hypothesis of the problem. The use of such hierarchical model will allow us to derive a proper Bayes factor, thus avoiding the impossibility that arises in its calculation when only improper distributions are used to compare models of different dimensions. The most important fact in the paper is that the solution provided here is simpler than the one obtained with the use of intrinsic priors distributions, since it only involves one-dimensional integrals, and the numerical results of the Bayes factors obtained are very similar in both cases.
In addition to the simplicity of the solution obtained, it is important to note that the prior distributions considered for the hierarchical model parameters imply that they are basically the same as those in Jeffreys [11], which shows an interesting relationship between the proposed testing approach and the estimation approach. This fact possibly explains the fact that it is possible to obtain an expression for the proposed Bayes factor using the density of the Behrens-Fisher distribution.
The article is structured as follows. In Sect. 2, the problem is posed as a homogeneity test and a hierarchical model under the alternative hypothesis is formulated to derive a proper Bayes factor. In Sect. 3, relationship between the Bayes factor and the Behrens-Fisher distribution is considered. In Sect. 4, a simple asymptotic approximation of the Bayes factor, related to the Kullback-Leibler divergence is given, and the consistency of the Bayes factor is proven when both sample sizes grow to infinity at the same rate. Section 5 provides several examples and comparison with other Bayesian and frequentist approaches.
Finally, Sect. 6 discusses some of the findings in the paper and suggests a more general solution in line with other approaches like the one based on intrinsic priors. Possible extensions of the presented results to the case of more than two samples and to the problem of comparison, i.e., testing the homogeneity of regression coefficients under heteroscedasticity are also addressed.

The Behrens-Fisher problem
In this section, the Behrens-Fisher problem is formulated as a problem of homogeneity of the means of two independent normal distributions with unknown and unequal variances.
Thus, we have, in principle, seven unknown parameters μ, μ 1 , μ 2 , σ 2 1 , σ 2 2 , τ 2 1 and τ 2 2 to assign prior information. Under H 0 , reference priors are considered, whereas the use of a hierarchical model only for the means, and objective priors for the common mean parameter μ, and the unknown variances τ 2 1 , τ 2 2 is proposed under H 1 . The general form of the likelihood function for the data x 1 and x 2 with means μ 1 and μ 2 and variances σ 2 1 and σ 2 2 is wherex 1 ,x 2 are the sample means, s 2 1 and s 2 2 are the unbiased estimates of the variances σ 2 1 and σ 2 2 , respectively, and ν 1 = n 1 − 1 and ν 2 = n 2 − 1, the degrees of freedom. Under H 0 , the marginal of the data x 1 , x 2 , conditional on μ, σ 2 1 and σ 2 2 is the likelihood function substituting μ 1 and μ 2 by μ, that is This conditional marginal density of the data only depends on the nuisance parameters μ, σ 2 1 and σ 2 2 , which are regarded to be independent and they are assigned reference priors, i.e., μ ⊥ ⊥ σ 2 1 ⊥ ⊥ σ 2 2 and h(μ, Therefore, Integrating first with respect to σ 2 1 and σ 2 2 , and then with respect to μ, the marginal under H 0 turns out to be Note that the integrals with respect to the population variances are analytic but the integral with respect μ has not a closed form formula. Under H 1 , the hierarchical model is given by setting the first hierarchy the second is that where s is any scale density function and the thirth, and last, hierarchy is, as before, that Note that the first hierarchy is related to the g-priors of Zellner and Ziow [23] in the one-dimensional case when g i = n i , with the important difference that they are centered at μ, as the intrinsic priors, instead of 0, as is usually done.
According to the following Lemma 1, and taking into account the second and third hierarchy of the model, it is easy to deduce that the prior distribution of τ 2 i is the same reference prior as that of σ 2 i for i = 1, 2, with the same constant c i .
, with c i any positive constant, and the distribution of If s 0 is the density generator of the scale family, then If we do the change of variables y = τ 2 i /σ 2 i in the integral, then, after some obvious simplifications Lemma 1 implies that the second and third hierarchy of the proposed model are In this way, the computation of the marginal under H 1 runs as follows Multiplying the three equalities and rearranging the quadratic forms of the exponent, we get Now, the marginal of the data is obtained by integrating out this expression, first with respect to μ 1 and μ 2 and, then, with respect to τ 2 1 and τ 2 2 . The last integral, with respect to μ has not an analytical closed form. Therefore, after computing all the integrals, the marginal turns out to be Finally, after simplifying common terms, the Bayes factor for testing H 0 vs. H 1 turns out to be which can be written in simplified form as It can be remarked that the application of the Lemma 1, would have allowed us to formulate the hypothesis test -by abusing the notation just a little-with the same variances in both hypotheses, i.e., in the form H 0 : μ 1 = μ 2 = μ; σ 2 1 and σ 2 2 arbitrary vs. H 1 : μ 1 = μ 2 ; σ 2 1 and σ 2 2 arbitrary.
An interesting point that we would like to make to conclude this Section is that μ 1 and μ 2 have the same improper marginal as μ, as it is easily proved. Therefore, we can establish a close relationship between the proposed Bayesian approach and the one given by Jeffreys [11] as far as prior distributions are concerned. This is possibly the reason why an expression of the Bayes factor in (1) can be obtained in terms of the density function of the Behrens-Fisher distribution, which we will study in the next section of the paper.

Relation with the Behrens-Fisher distribution
It is interesting to see that the integrands in both the numerator and the denominator of the Bayes factor in (1) are proportional to the product of the densities of two Student t distributions. In this section, we will prove that these integrals can be evaluated through the general form of a Behrens-Fisher distribution. To do this, we begin by recalling the definitions of the standard and generalized Behrens-Fisher distributions.
where t 1 and t 2 are independent random variates following Student t distributions with f 1 > 0 and f 2 > 0 degrees of freedom, respectively. It will be denoted by The extension of this distribution to a location-scale family is defined in Girón et al. [9] as follows.
The following theorems are given in Girón et al. [9]. Theorem 1 shows that the generalized Behrens-Fisher distribution is a convolution of two general Student t distributions. Note that the general form t(μ, σ 2 , ν) corresponds to a Student t distribution with location parameter μ, scale parameter σ 2 and degrees of freedom ν. Theorem 2 states that the generalized Behrens-Fisher distribution is a location mixture of Student's t distributions with mixing distribution a Student t.
As a consequence of Theorem 2, the probability density function of x can be expressed as Once we have introduced the above definitions and results, let us consider again the Bayes factor in expression (1).
Defining d =x 2 −x 1 , the integral of the numerator in (1) is From a simple change of variable δ = μ −x 2 + d, the last equality can be written as Taking into account formula (2), the preceding last formula is, up to constants, the probability density function evaluated at zero of a random variable distributed as a Behrens-Fisher distribution with locationx 2 −x 1 , scale s 2 1 /n 1 + s 2 2 /n 2 , degrees of freedom ν 1 , ν 2 and angle φ such as tan 2 φ = s 2 1 /n 1 s 2 2 /n 2 .
Thus, the integral can be written as where f b 1 is the probability density function of a random variable b 1 distributed as Following the same reasoning, the integral of the denominator in expression (1) can be expressed as where f b 2 is the probability density function of a random variable b 2 distributed as where ψ is an angle such that tan 2 ψ = (n 1 + 1)s 2 1 /n 1 (n 2 + 1)s 2 2 /n 2 .
Finally, (1) turns out to be

Asymptotic behavior: approximations and consistency
In this section we will study some asymptotic properties of the proposed Bayes factor in order to know its behavior when sample sizes tend to infinity. We will also prove its consistency for the case where the sample sizes grow towards infinity at the same rate. We will begin by giving an approximation of the Behrens-Fisher distribution when its degrees of freedom grow indefinitely, which will allow us to derive an approximation of the Bayes factor studied in this article for large sample sizes. Taking into account Theorem 1 and the well known result that the Student t distribution asymptotically follows a normal distribution if the degrees of freedom are large enough, the following theorem holds. Be-Fi(μ, σ 2 , f 1 , f 2 , φ) and f 1 , f 2 tends to ∞, then the distribution of b is approximately N (μ, σ 2 ).

Theorem 3 If b ∼
Considering last theorem, the distribution of the random variate b 1 in (3) can be approximated by a normal with meanx 1 −x 2 and variance s 2 1 /n 1 + s 2 2 /n 2 , if the sample size is large enough, whilst the distribution of the random variate b 2 can be approximated by a normal with meanx 1 −x 2 and variance s 2 1 + s 2 2 .
Thus, if n 1 → +∞ and n 2 → +∞, an approximation of the Bayes factor in (3) is given by An interesting fact about the asymptotic Bayes factor is that it can be expressed as a function of the Kullback-Leibler divergences of normal distributions. In fact, if we take into account that the Kullback Leibler divergence between the probability distributions of two then, the above asymptotic expression of the Bayes factor can be written as where f X , f Y and f Z are the probability density functions of the random variates X , Y and Z normally distributed as follows One important property of the Bayes factor for the Behrens-Fisher is that it is consistent in the sense that, as n 1 and n 2 tend to infinity at the same rate, then, under the null hypothesis, the probability of the true model goes to 1, or equivalently, the Bayes factor goes to infinity, and under the alternative hypothesis it goes to 0. Next theorem states this result in a more precise form. As a byproduct, we also obtain a simple approximation to the Bayes factor for large values of n 1 and n 2 when the growing rate is of the same order.

Theorem 4 The Bayes factor is consistent under the null and the alternative hypotheses, in
the sense that if the model we are sampling from is the true one, the Bayes factor goes to infinity, in probability. More precisely: Proof Assuming that the convergence towards infinity of n 1 and n 2 is of the same order, i.e., n 1 = m y n 2 = a · m, (a > 0), the Bayes factor can be written as Considering that s 2 1 and s 2 2 are consistent estimators of σ 2 1 and σ 2 2 , respectively; the following expectation can be approximated for large values of m by Taking into account that the distribution of the difference of means is approximated bȳ then, under the null hyphotesis H 0 : μ 1 = μ 2 , Thus, On the other hand, under the alternative hyphotesis H 1 :

Examples: simulation study and calibration curves
If we define the statistic d =x 2 −x 1 as in Sect. 3, then the Bayes factor of equation (1), after a simple change of variable, can be rewritten as a function of d and the rest of sufficient statistics n 1 , n 2 , and, the posterior probability of the null hypothesis, assuming that both hypotheses are equaly likely, is .
From these formulas, it follows an interesting property of the Bayes factor and, consequently, of the posterior probability of the null hypothesis, which is that both are symmetric and unimodal functions of d =x 2 −x 1 , irrespective of the values of the sample sizes n 1 and n 2 , and the unbiased estimates of the variances s 2 1 and s 2 2 . This implies that the acceptance regions of the Bayes test, as functions of the statistic d, are always symmetric intervals around 0. Thus, the acceptance region is A(d; n 1 , n 2 , s 2 1 , s 2 2 ) = {d : B 01 (d, n 1 , n 2 , s 2 1 , s 2 2 ) ≥ 1} or, equivalently, A(d; n 1 , n 2 , s 2 1 , s 2 2 ) = {d : Pr(H 0 |d, n 1 , n 2 , s 2 1 , s 2 2 ) ≥ 1/2)}. This behavior is illustrated with a well known example taken from Box and Tiao [4, pp 107-109]: In a spinning modification experiment involving independent data from two normal distributions, the following results for the sufficient statistics were obtained: The value of the observed statistic d =x 2 −x 1 = 55 − 50 = 5. It then follows, from formula (4), that the numerical value of Bayes factor is B 01 (x 1 , n 1 , s 2 1 ,x 2 , n 2 , s 2 2 ) = 0.306118, and the posterior probability of the null hypothesis is Pr (H 0 |x 1 , n 1 , s 2 1 ,x 2 , n 2 , s 2 2 ) = 0.234373,  On the other hand, the acceptance intervals for the d statistic are obtained from the intersection of the Bayes factor or the posterior probability of the null hypothesis-regarded as functions of d-with the horizontal lines located at 1 and 1/2 values, respectively, as shown in Fig.1. The acceptance interval is A(d; n 1 , n 2 , s 2 1 , s 2 2 ) = (−3.47197, 3.47197). As the observed value of d = 5 does not belong to the acceptance interval, the null hypothesis is rejected.
Next, we will establish a comparison of the proposed hierarchical Bayes factor with both the intrinsic Bayes factor, appearing in Moreno et al. [14] and Moreno and Girón [15], and the most commonly used frequentist test based on the Welch statistic.
The only common statistic to the intrinsic, the hierarchical Bayes factors and the p-values is d. For this, Table 1 displays, for the values of s 2 1 , s 2 2 , n 1 and n 2 considered in (5), the posterior probabilities of the null hypothesis obtained through them and the resulting pvalue of the Welch's t-test for the values of the d statistic reported in Moreno and Girón [15].
From Table 1 we can conclude that the two Bayesian procedures produce very similar results, with the intrinsic ones being slightly higher than the hierarchical ones. It seems to indicate that the intrinsic Bayes factor slightly favors the null hypothesis, but there is a general agreement between the report provided by both procedures about accepting or rejecting the null hypothesis. More extensive analysis on the comparison of the three procedures for various sample sizes and different variance estimates should be carried out to have more evidence about their behavior.
A striking difference of the intrinsic and hierarchical based Bayes factors for the Behrens-Fisher problem with the corresponding ones for the two-sample normal homoscedastic problem is that, in the latter case, both Bayes factors are simple functions of the standard t statistic. This does not happen for the Behrens-Fisher problem. In fact, the test statistic for the Welch tests is Nevertheless, the asymptotic approximation of tour Bayes factor does depend on t but also depends on the following statistic u = d s 2 1 + s 2 2 and the ancillaries.
Only in the case of equal sample sizes, there is a one-to-one relationship between the asymptotic Bayes factor and the Welch statistic.
Next, we will try to further explore how an increase in sample sizes can influence the possible disagreements provided by the frequentist and Bayesian approaches presented in this article.
We know that the t Welch statistic is distributed as a Student distribution with approximately ν degrees of freedom, where ν = However, a calibration curve-the one that measures the relationship between p-values and posterior probabilities of the null-which was introduced for the normal linear models in Girón et al. [10], can be extended to the Behrens-Fisher problem using the common d statistic. Usually, the calibration curves vary with the sample sizes of the two samples-the ancillary statistics-, and the values of the unbiased estimates of the variances, s 2 1 and s 2 2 . This implies that calibration curves for the Behrens-Fisher problem, depend on too many parameters. Thus, to illustrate their form and behavior, we have chosen to compute only those that depend on the common varying sample size n = n 1 = n 2 for fixed values of s 2 1 and s 2 2 . Figure 2 illustrates the shapes of the calibration curves for s 2 1 = 1 and s 2 2 = 25 and increasing values of n = 5, 10, 30 and 100. It also points out to the fact that, as the sample size grows to infinity, the calibration curve converges to a constant equal to 1 for all p-values in (0, 1], except for p = 0, meaning that the posterior probability of the null hypothesis goes to 1 when sampling from it.
One important conclusion from the plot is, that for the same value of the posterior probability, the p-values decrease with the sample size. This fact has very important consequences in the usual statistical practice: in order to match Bayesian and frequentist procedures, the αlevel for accepting/rejecting the null hypothesis, the sample size should be taken into account, in the direction that for very large sample sizes, the α-level should be substantially decreased. heteroscedasticity is considered to obtain a Bayes factor which is shown to be closely related to the densities of the general Behrens-Fisher distributions. A comparison with the Bayes factor for intrinsic priors shows minor differences which basically produce the same Bayesian answers to the Behrens-Fisher problem. The inclusion of the calibration curve for the hierarchical Bayes factor shows disagreement with frequentist p-values when the sample sizes increase, a common characteristic of many Bayes factors.
Another possible Bayes factor that could be envisaged, following the steps suggested in Subsection 4.9.1 of [8], would involve a modification of the hierarchical model presented by introducing a certain hyperparameter σ 2 in the model and a new hierarchy by linking the variances with this hyperparameter by means of a weakly informative prior scale distribution, say s(·), an approach related to the one considered in Berger et al. [3] and similar to that of Moreno et al. [16], as follows: The hierarchy of the means μ i given μ and their respective variances remains as before, and we add the following hierarchy of the variances given the common parameter σ 2 : σ 2 i ⊥ ⊥ σ 2 2 | σ 2 and σ 2 i |σ 2 ∼ s(σ 2 i |σ 2 ), τ 2 i ⊥ ⊥ τ 2 2 | σ 2 and τ 2 i |σ 2 ∼ s(τ 2 i |σ 2 ).
In this way, the parameters on which to assign improper reference priors would be μ and σ 2 under both hypotheses, so the resulting Bayes factor would be proper. This solution is a little less simple than the one proposed in this paper since the the Bayes factor expression obtained involves the computation of two-dimensional integrals. Simulation studies carried out using Gamma, Inverted Gamma, Half normal and Half Cauchy distributions as links show robustness in the results obtained and a high degree of similarity with those obtained with the Bayes factor described here. This new Bayes factor would be more in line with other objective or default Bayes factors, but the one presented here has the merit of being simple and does not differ much numerically from other possible solutions. Finally, some possible extensions of the the simple hierarchical Bayes factor obtained in the paper could be easily applied to the problem of comparing the means of more than two samples of normal populations with unequal and unknown variances, and to the problem of comparing the regression coefficients of two, and more than two, heteroscedastic normal linear models.