On the prevalence of information inconsistency in normal linear models

Informally, ‘information inconsistency’ is the property that has been observed in some Bayesian hypothesis testing and model selection scenarios whereby the Bayesian conclusion does not become definitive when the data seem to become definitive. An example is that, when performing a t test using standard conjugate priors, the Bayes factor of the alternative hypothesis to the null hypothesis remains bounded as the t statistic grows to infinity. The goal of this paper is to thoroughly investigate information inconsistency in various Bayesian testing problems. We consider precise hypothesis tests, one-sided hypothesis tests, and multiple hypothesis tests under normal linear models with dependent observations. Standard priors are considered, such as conjugate and semi-conjugate priors, as well as variations of Zellner’s g prior (e.g., fixed g priors, mixtures of g priors, and adaptive (data-based) g priors). It is shown that information inconsistency is a widespread problem using standard priors while certain theoretically recommended priors, including scale mixtures of conjugate priors and adaptive priors, are information consistent.


Introduction
When testing a hypothesis H 0 against an alternative hypothesis H 1 , a common Bayesian tool is the Bayes factor, B 10 , which quantifies the relative evidence (or odds) from the data for H 1 against H 0 . A Bayes factor is called information inconsistent if, when the evidence for the alternative hypothesis appears to be overwhelming (in the sense that the observed effect under the alternative hypothesis becomes arbitrarily large), the Bayes factor converges to a constant B * < ∞. This conflicting behavior, which already dates back to Jeffreys (1961), is also referred to as the information paradox (Liang et al. 2008). Note that we utilize the language of Bayes factors simply for convenience; everything could equivalently be stated in terms of posterior probabilities, e.g., there is information inconsistency if the posterior probability of H 1 is bounded away from 1 as the evidence for H 1 appears to be overwhelming.
Example 1 A typical example of an information inconsistent Bayes factor is when using Zellner's (1986) g prior for testing the regression coefficients in a linear regression model y = γ 1 n + X 1 θ + , with ∼ N n (0, σ 2 I n ), where y is a vector containing the n responses, γ is the intercept, X 1 is a n×r 1 matrix containing the explanatory variables, θ is a vector with the r 1 unknown coefficients that are tested, σ 2 is the unknown error variance, 1 n is a vector of length n with ones, 0 is a vector of zeros, 1 I n is the identity matrix of size n, and N n denotes a n-dimensional normal (or Gaussian) distribution. When testing H 0 : θ = 0 versus H 1 : θ = 0 with the g prior, π 0 (γ , σ 2 ) ∝ σ −2 and π 1 (θ | γ, σ 2 ) = N r 1 (θ|0, gσ 2 (X 1 X 1 ) −1 ) and π 1 (γ , σ 2 ) ∝ σ −2 , for some fixed g > 0, the Bayes factor goes to (1 + g) (n−r 1 −1)/2 < ∞ as the evidence against H 0 accumulates in the sense that |θ | → ∞, whereθ denotes the least squares estimate of θ and | · | denotes Euclidean norm of a vector (see also, Berger and Pericchi 2001). Furthermore, it has also been reported that the g prior is information inconsistent when testing one-sided hypotheses (Mulder 2014a).
In comparison with large sample inconsistency, which occurs when the evidence for the true hypothesis against another hypothesis does not go to infinity as the sample size grows, information inconsistency has not received much attention in the literature. In our view, both types of inconsistency are undesirable and should be avoided in general testing procedures. The goal of this paper is therefore to explore information inconsistency in the general setting of testing in the normal linear model with unknown variance. We will consider improper as well as proper priors; conjugate priors, scale mixtures of conjugate priors, independent priors, and adaptive priors; and precise null hypothesis testing, one-sided hypothesis testing, and multiple hypothesis testing. Throughout the paper, we also consider variations of Zellner's g prior (e.g., fixed g priors, mixtures of g priors, and adaptive (data-based) g priors) as this class of priors is commonly observed in the literature. We show that information inconsistency typically results when using 'standard' conjugate or independent semi-conjugate priors, while information consistency typically results when using more sophisticated scale mixture or adaptive priors. We also explore the practical consequences of information consistency, by investigating when information inconsistency starts to manifest itself and finding the limiting value of the Bayes factor. Note that having an unknown variance is crucial; we are not aware of any information inconsistency results for testing in the normal linear model with known variance.
The paper is organized as follows. First the linear regression model with dependent errors and some notation are introduced (Sect. 2). Subsequently, Sect. 3 explores information consistency when testing a precise hypothesis using various prior specifications, followed by one-sided hypothesis tests in Sect. 4 and a multiple hypothesis test in Sect. 5. We end the paper with some conclusions and recommendations in Sect. 6.

The linear regression model with dependent errors
Throughout this paper, the focus shall be on the linear regression model with dependent errors, y = Xβ + , with ∼ N n (0, σ 2 Σ), where the vector y of length n contains the responses, X = [x 1 . . . x K ] is an n × K matrix containing the K predictor variables which are regressed on the K unknown regression coefficients in β (n > K ), is a normally distributed error vector, σ 2 is an unknown common variance, and Σ is a known positive definite matrix. Three different types of hypothesis tests will be considered. First, we consider the classical null hypothesis test of a set of linear restrictions on β against an unrestricted alternative, i.e., H 0 : Rβ = 0 r 1 versus H 1 : Rβ = 0 r 1 , where R is an r 1 × K matrix with known constants (r 1 ≤ K ). Second, we consider the equivalent one-sided hypothesis test of H 0 : Rβ ≤ 0 r 1 versus H 1 : Rβ ≤ 0 r 1 , where " ≤" implies that at least one inequality goes to the other direction. Third, we briefly consider the multiple hypothesis test H 0 : Rβ = 0 r 1 versus H 1 : Rβ ≤ 0 r 1 (with Rβ = 0 r 1 excluded) versus H 2 : Rβ ≤ 0 r 1 . The precise Bayesian hypothesis test of a set of linear restrictions was also investigated by Bayarri and García-Donato (2007). A Bayesian hypothesis test with combinations of equality and one-sided constraints was, for instance, considered by Mulder et al. (2010).
The model is reparametrized so that the linear combination of the parameters of interest, i.e., θ = Rβ, is perpendicular to the nuisance parameters, i.e., γ = Dβ, i.e., where the r 2 × K matrix D contains r 2 = K − r 1 independent rows of P ⊥ R X Σ −1 X, where the orthogonal projection matrix is given by P ⊥ R = I K − R RR −1 R. Subsequently, the model can be written as where X 1 contains the first r 1 columns of XT −1 that are regressed on θ and X 0 contains the remaining r 2 columns of XT −1 that are regressed on γ . The null hypothesis can then be written as H 0 : θ = 0 versus H 1 : θ ∈ R r 1 , and the one-sided hypothesis test can be written as H 0 : θ ≤ 0 versus H 1 : θ ≤ 0. Thus, the design matrix under the precise null hypothesis H 0 is denoted by X 0 , and under the unconstrained alternative hypothesis H 1 in the precise test, it is denoted by [X 0 X 1 ]. Further note that the ML estimates of θ and γ are independent because consequence of the choice of D. Throughout this paper, the free parameters under a hypothesis have a hypothesis index to make it explicit that the parameters under different hypotheses have different interpretations and therefore different priors. For example, the population variances under H 0 and H 1 are denoted by σ 2 0 and σ 2 1 , respectively. Also,θ will denote the maximum likelihood estimate of θ .

Testing a precise hypothesis
The following definition will be used for information inconsistency when testing a precise hypothesis.
Definition 1 A Bayes factor, B 10 , is called information inconsistent for testing H 0 : for which the Bayes factor B 10 ≤ B * 10 < ∞. For normal linear models, this definition is equivalent to the more general formulation using the likelihood ratio Λ 10 , as proposed by Bayarri et al. (2012). The definition implies that an information consistent Bayes factor and the classical likelihood ratio test (using the usual F or t statistic) result in identical conclusions as Λ 10 → ∞.

Conjugate priors
In the conjugate case, the conditional prior of θ | σ 2 1 under H 1 has a multivariate normal distribution and the marginal prior of σ 2 t , for t = 0 or 1, has a scaled inverse Chi-squared distribution, resulting in where s 2 0 and s 2 1 are prior scale parameters and ν 0 and ν 1 are prior degrees of freedom for the error variance under the two different hypotheses H 0 and H 1 , respectively. The scaled inverse Chi-squared distribution is used (instead of the inverse gamma distribution) because of the natural relation between the prior degrees of freedom ν t and the sample size n (Gelman et al. 2004). When setting the prior degrees of freedom equal to ν t = 0, we obtain the objective improper prior, π t (σ 2 t ) ∝ σ −2 t , for t = 0 or 1, and when additionally setting Ω = g X 1 Σ −1 X 1 −1 , we obtain Zellner's g prior.
The use of improper priors in testing for common "group invariant" parameters, such as the variances, is justified in Berger et al. (1998) and further discussed in the current testing problem in Bayarri et al. (2012). The conditional prior for θ is centered at the null value of 0, as is common in testing and model uncertainty, but any other (fixed) centering could be used without affecting the results that follow. Denoting the ML estimates byθ = X 1 Σ −1 X 1 −1 X 1 Σ −1 y andγ = X 0 Σ −1 X 0 −1 X 0 Σ −1 y and the sums of squares by s 2 y = (y − X 1θ − X 0γ ) Σ −1 (y − X 1θ − X 0γ ), a standard calculation yields that the Bayes factor of H 1 against H 0 , based on the conjugate priors in (2) and (3), is where the constant is The Bayes factor depends on bothθ and s 2 y , which are independent. We will thus assume that s 2 y is fixed. The following result is immediate.
Remark 1 Setting ν 0 < ν 1 seems logical because it implies that the prior for σ 2 1 is more concentrated than the prior for σ 2 0 (consistent with a nonzero mean explaining some of the variation compared to a zero mean). This choice, however, results in a disastrously information inconsistent Bayes factor, with the conclusion being that the null hypothesis is certainly true when |θ | → ∞.
Remark 2 Setting ν 0 = ν 1 is the usual choice, which still results in an information inconsistent Bayes factor. Note that the prior degrees of freedom would be set to 0 in the objective Bayesian approach. The impact of this inconsistency will be discussed below for the special case of the univariate t test.
Remark 3 Setting ν 0 > ν 1 would not be a logical choice because the prior for σ 2 0 is then more concentrated in the tails than the prior for σ 2 1 , even though the regression coefficient θ under H 1 can explain some of the variation in the data. The resulting Bayes factor, however, is information consistent. A special case of this choice arises from setting the prior for the variance under H 0 to be proportional to the conditional prior of the variance given θ = 0 under H 1 , i.e., π 0 (σ 2 ) = π 1 (σ 2 | θ = 0) = inv-χ 2 (σ 2 | ν 1 ν 1 +r 1 s 2 1 , ν 1 + r 1 ), so that ν 0 = ν 1 + r 1 . The Bayes factor can then be expressed as the Savage-Dickey density ratio (Dickey 1971), B 10 = π 1 (θ=0|y) π 1 (θ =0) , where the marginal prior and the posterior of θ have a multivariate Student t distribution.
Remark 4 The definition of information inconsistency in this paper is a purely analytic definition; how does the function B 10 behave as |θ| → ∞, while s 2 y > 0 remains fixed. The statistical scenario in which this will most commonly arise is when θ itself grows increasingly large, with σ 2 staying constant, consistent with the notion that there should then be overwhelming evidence against H 0 . Indeed, the definition of conditional Lindley's paradox in Som et al. (2016), which is closely related to information consistency, is formally based on the limiting behavior of parameters. We utilize the analytic version of information inconsistency because it captures the essential behavior without having to deal with probabilistic issues, and also because it is remarkably general in certain situations. For instance, with the standard objective prior having ν 0 = ν 1 = 0, one can divide through by s 2 y in (4), and state information inconsistency in terms of the statistic |θ |/s y → ∞, which covers many possible situations in terms of the true parameters.

Practical implications for a univariate test under dependence
The practical importance of information inconsistency is explored for the objective prior with ν 1 = ν 0 = 0 for a univariate t test of H 0 : θ = 0 versus H 1 : θ = 0 with correlated data. Specifically, consider r 1 = 1, r 2 = 0, X 1 = 1 n , and Ω = 1, with Σ being the correlation matrix with identical correlations ρ in the off-diagonal elements. The t-statistic, t =θ √ 1 n Σ −1 1 n s y / √ n−1 , then has a t-distribution with n − 1 degrees of freedom under H 0 . The Bayes factor in (4) can then be expressed as a function of the t-statistic, namely The limiting value of the Bayes factor, as |t| goes to infinity, is Hence, the correlation can dramatically affect the situation. Table 1 provides the limiting value of the Bayes factor as |t| goes to ∞ for different choices of the correlation ρ and different sample sizes varying from n = 2 to a sample size of n = 20. The table also provides the Bayes factor when t = 4 to check whether inconsistency starts coming into play for a large t value. As comparisons, the corresponding two-sided p values are also provided, as well as the upper bound B 10 < 1/[−ep log p], which is a bound over a large nonparametric class of priors [derived in Sellke et al. (2001)]. When there is zero correlation, the limit (n+1) (n−1)/2 is large for sample sizes larger than 6, so that information inconsistency is not problematical from a practical point of view. For large correlations on the other hand, and especially when ρ is close to 1, the limiting values can be quite small, arguing against the use of objective conjugate priors. Figure 1 displays the logarithm of the Bayes factor as a function of log 10 (t) when using conjugate priors (solid lines) and n = 7, ρ = .5, s 2 y = n − 1 = 6, s 2 0 = s 2 1 = 1, and different choices for the prior degrees of freedom, namely (ν 0 , ν 1 ) = (0, 0), (1, 2) or (2, 1). As can be seen, if ν 0 = ν 1 = 0, the logarithm of the Bayes factor converges to log 10 (20.8) = 1.32 (Table 1). Furthermore, if ν 0 < ν 1 (or ν 0 > ν 1 ), the evidence goes to ∞ for H 0 (or H 1 ) as t → ∞ implying information inconsistency (or information consistency). The results are qualitatively similar when using other values for the prior scales.
It is natural to ask if information inconsistency also occurs if ρ is unknown. The answer is yes, as shown in the following lemma.
Lemma 2 If ρ > 0 is unknown with prior density π(ρ), and the same priors are assumed for the other parameters, then, for t 2 > n − 1, which converges to (1 + n) (n−1)/2 as |t| → ∞, implying information inconsistency. Proof Calculus shows that, for t 2 > n − 1, (5) is a decreasing function of ρ on [0, 1] and hence is maximized at We complete the proof by showing that Indeed, (6) is equivalent to (0), ending the proof.
The restriction to ρ > 0 is not necessary, but simplifies the proof.

Mixtures of conjugate priors
Although use of conjugate priors in testing is common, it has long been argued [starting with Jeffreys (1961)] that fatter-tailed prior distributions should be used. One such class that is increasingly popular is the class of scale mixtures of conjugate priors. This class results in information consistent Bayes factors if the prior on g is thick enough, as shown by the following lemmas which generalize the result in Liang et al. (2008) for Lemma 3 Let θ | g, γ 1 , σ 2 1 ∼ N r 1 (0, gσ 2 1 Ω), where σ 2 1 has the prior specified in (2) and g has a prior with density π(g). If ν 0 > ν 1 , any π(g) with positive support yields an information consistent B 10 . The condition ∞ 0 (g + 1) (n−r 1 −r 2 +ν 1 )/2 π(g)dg = ∞ is necessary and sufficient for information consistency whenever ν 0 = ν 1 , and necessary whenever ν 0 < ν 1 .
The maximum number of finite moments that the prior on g can have to achieve information consistency increases with the sample size n and decreases with the number of predictors K = r 1 + r 2 . Lemma 3 gives us a complete description for all scale mixtures of conjugate priors whenever ν 0 ≥ ν 1 , but only gives us a necessary condition for information consistency for ν 0 < ν 1 . The lemma below characterizes the behavior of polynomial-tailed priors on g in this latter case and provides partial results for priors with thinner-and thicker-than-polynomial priors on g.
Lemma 4 Suppose ν 0 < ν 1 and let θ | g, γ 1 , σ 2 1 ∼ N r 1 (0, gσ 2 1 Ω), where σ 2 1 has the prior specified in (2) and g has a prior with density π(g). Then, the following are true: [NB: All of the priors on g considered in Liang et al. (2008) satisfy both conditions.] Proof See "Appendix B". Note that the Zellner and Siow prior (1980) (which was the first proposed information consistent prior for this situation) and the hyper-g prior (Liang et al. 2008) satisfy both conditions because they have polynomial tails.

Semi-conjugate prior
A feature of the conjugate prior that is sometimes questioned is the dependence induced between θ and σ 2 ; in objective Bayesian analysis, this is hard to avoid (only σ is available to provide an objective scale for θ), but it does seem rather arbitrary. For example, Moran et al. (2018) advocated the use of independent priors as dependent conjugate priors may result in severe underestimation of the error variance in variable selection problems. Hence, it is of interest to also investigate information consistency using independent semi-conjugate priors of the form With these semi-conjugate priors, the Bayes factor becomes where Lemma 5 As |θ| → ∞, the Bayes factor in (7), based on the independent semiconjugate prior, behaves as follows: Note that, in the typical case of ν 0 = ν 1 , we observe an even worse case of information inconsistency than for the conjugate prior because the relative evidence between H 1 and H 0 goes to 1 when there appears to be overwhelming evidence for H 1 ; in contrast, for the conjugate prior case, the limiting Bayes factor-while nonzero-was at least exponentially small in n.
The intuition behind this result is that very largeθ is equally unlikely under H 1 and H 0 , due to the light-tailed normal prior for θ under H 1 . Furthermore, the limits are the same as in the conjugate case if ν 0 = ν 1 . Hence, the choice of the prior degrees of freedom plays a crucial role in information inconsistency, even when the variance is a priori independent of θ . Figure 1 also displays the Bayes factor, based on the independence prior, as a function of log 10 (t) for the univariate t test when the data correlation is ρ = .5 (dashed line). As can be seen, the Bayes factor based on the independence prior and the conjugate prior with the same hyperparameters is approximately equal for absolute t values smaller than approximately log 10 (.5). For larger t values, the flatter tails of the independence priors start to have an effect resulting in a decrease in the Bayes factor, relative to the Bayes factor based on the conjugate priors.

Fatter-tailed independence priors
It is somewhat unfair to use an independent normal prior for model comparison here since, from Jeffreys (1961), the use of fatter-tailed priors has been recommended. To keep the discussion of fatter-tailed priors simple, we consider only the one-dimensional case (i.e., r 2 = 0) and restrict the prior π 1 (θ ) to be a t-distribution with mean 0, scale τ (fixed) and degrees of freedom ν, i.e., Then Theorem 3.3 in Fan and Berger (1992) shows that, as |θ | → ∞, Thus, as |θ| → ∞, Since n ≥ 2, if 0 < ν < 1 it will be true that n + ν 0 > min{n − 1 + ν 1 , ν + 1} so that B 10 will be information consistent. For the commonly used Cauchy prior (ν = 1), information consistency also holds, except for the case when n = 2 and ν 0 = 0 (this last corresponding to the objective prior for σ 2 0 ). It is interesting that information consistency does hold for this last case when π 1 (θ ) is chosen to be Cauchy(0, σ 1 ) (cf. Liang et al. 2008) and ν 1 = 0; thus, once again, insisting on prior independence of σ 2 1 and θ only appears to worsen the problem of information inconsistency.

Adaptive priors
Another approach to Bayesian hypothesis testing is to let the prior under H 1 adapt to the likelihood, as in George and Foster (2000) and Hansen and Yu (2001).
Example 2 For the g prior in the t test, when the t-statistic t = θ X 1 Σ −1 X 1θ s 2 y /(n−1) > 1, the marginal likelihood under H 1 is maximized for the choice g = n−r 2 −r 1 r 1 (n−1) t 2 − 1. The Bayes factor for this choice equals which is information consistent. For a univariate t test, with r 1 = 1 and r 2 = 0, the resulting Bayes factor can be expressed as B 10 = 1 |t| n−1+t 2 n n 2 .
Proof See "Appendix D".
Lemma 6 establishes information consistency for all ν 0 and ν 1 . This is in contrast to the results in previous sections, where the behavior of B 10 depends (sometimes rather strongly) on ν 0 and ν 1 .

One-sided hypothesis testing
The following definition will be used for information consistency for a one-sided testing problem.
Definition 2 A Bayes factor is information consistent, for a one-sided hypothesis test of H 0 : θ ≤ 0 versus H 1 : θ ≤ 0, if B 10 → ∞ as |θ| → ∞ with at least one coordinate ofθ going to ∞, and B 10 → 0, as all coordinates ofθ go to −∞. If this does not hold, the Bayes factor is called information inconsistent.

Conjugate prior
When testing nonnested hypotheses, it is common to formulate an encompassing prior π on the joint space Θ = Θ 0 ∪ Θ 1 and specify truncations of this prior under H 0 and H 1 (e.g., Berger and Mortera 1999;Klugkist and Hoijtink 2007). As in the null hypothesis test, the encompassing conjugate prior is centered on the boundary of the subspaces under investigation, i.e., with a flat improper prior for γ . The priors under the nonnested hypotheses H t , for t = 0 or 1, can then be expressed as π t (σ 2 ) = π(σ 2 ), and π t (γ ) = π(γ ), with the denominator in (9) being equal to the conditional prior probability of Θ t under the joint prior on Θ, i.e., P π (θ ∈ Θ t | σ 2 ) = Θ t N (θ |0, σ 2 Ω)dθ > 0. The Bayes factor for the one-sided hypothesis test based on the conjugate priors can then be expressed as The derivation is similar to that in Mulder (2014a). The prior and posterior probabilities that the constraints hold under the encompassing model can be computed as the proportion of draws satisfying the constraints. Also note that the conditional prior probability of θ ≤ 0 is completely determined by the prior covariance matrix Ω and is independent of σ 2 [therefore, we can set σ 2 = 1 in (10)]. This is a direct result of centering the encompassing prior on the point of interest 0. For example, if Ω = I r 1 , then P π (θ ≤ 0 | σ 2 ) = 2 −r 1 , ∀σ 2 > 0. In the g prior with Ω = gσ 2 (X 1 Σ −1 X 1 ) −1 , the prior probability is completely determined by the covariance structure of the predictors.
Lemma 7 P π (θ ≤ 0 | y) is bounded away from 0 and 1 for all y. Hence B 10 is information inconsistent. Ifθ = cv and c → ∞, then where ξ has a multivariate t distribution with mean Proof See "Appendix E". The same result can be shown to hold (by essentially the same argument) if a proper conjugate prior is used for γ .

Fig. 2
The Bayes factor B 10 for the one-sided hypothesis test based on the conjugate prior (solid line) and independence prior (dashed line) as a function of t values when n = 7, ρ = .5, s 2 y = n − 1 = 6, and setting the objective prior to be improper via ν = 0 as t → ∞, where T ν (·) denotes the cdf of a univariate Student t distribution with ν degrees of freedom. Note that as t → −∞, B 10 converges to the reciprocal of (11). Table 2 provides the limiting values of the Bayes factors and Bayes factors in the case of a relatively large t value of 4 for different sample sizes and correlations. When comparing Table 2 with Table 1, we can conclude that the practical importance of information inconsistency for one-sided hypothesis testing is considerably less problematic in comparison with the null hypothesis test. Finally, Fig. 2 (solid line) displays the Bayes factor for the one-sided hypothesis test as a function of the t value based on n = 7, ρ = .5, s 2 y = n − 1 = 6, and setting the objective improper based on ν = 0.

Mixtures of conjugate priors
We provide the following necessary and sufficient condition for information consistency for a scale mixture of conjugate normal priors in a one-sided hypothesis test.
Proof See "Appendix F".

Independence prior
The independence semi-conjugate encompassing prior is given by The truncated priors of θ under the nonnested hypotheses are as in (9), except that the normalizing constant P π (θ ∈ Θ t ) is the marginal prior probability of Θ t . The Bayes factor for the one-sided hypothesis test based on the independence prior can again be expressed as but note that the posterior probability is no longer available in closed form.

Lemma 9
As |θ| → ∞ and at least one coordinate ofθ goes to ∞, the Bayes factor of H 1 : θ ≤ 0 versus H 0 : θ ≤ 0 based on the independence encompassing prior in (12) satisfies Thus, as in null hypothesis testing, the independence prior results in a serious violation of information consistency because the evidence in the data of H 1 relative to H 0 goes to 1 when the evidence against H 0 appears to be overwhelming. For completeness, the Bayes factor for the one-sided hypothesis test is also displayed in Fig. 2 (dashed line), illustrating the extreme form of information inconsistency.

Adaptive priors
An adaptive prior can be specified where the prior covariance matrix of θ is adapted to the likelihood such that the Bayes factor is maximized for the hypothesis that is supported by the data (i.e., maximize B 01 ifθ ≤ 0, and maximize B 10 elsewhere). Here we show that an adaptive g prior results in an information consistent Bayes factor.

Lemma 10
The Bayes factor based on the g prior, with g max = arg max g {B 01 } if θ ≤ 0 and g max = arg max g {B 10 } ifθ ≤ 0, is information consistent for one-sided hypothesis testing.
Proof A proof is given in "Appendix H".
As shown in the proof, the choice for g that maximizes the Bayes factor is obtained by letting g go to ∞ (see also, Mulder 2014a). As a result of letting the prior variances go to infinity, the posterior is not shrunk toward the prior mean, which is sufficient to establish information consistency. Therefore, the methods of Mulder (2014b) and Gu et al. (2014) are also information consistent. A potential issue of letting g go to infinity is that the marginal likelihoods under H 0 and H 1 go to 0 in the limit. However because the Bayes factor in (10) converges to a limit where the posterior probabilities are computed using flat priors and the prior probabilities are based on the prior covariance structure, the outcome seems a reasonable default quantification of the relative evidence for a one-sided test.

Multiple hypothesis testing
Below we consider the definition for information (in)consistency in a multiple testing problem. The definition implies that a Bayes factor needs to be information consistent for both a precise test and a one-sided test. A graphical representation for the bivariate case can be found in Fig. 3.
As the conjugate and independent semi-conjugate priors resulted in information inconsistent Bayes factors for the one-sided hypothesis test, this automatically implies that these priors result in information inconsistency for the multiple hypothesis test. A specific case when using conjugate priors that is interesting to mention is when setting the prior degrees of freedom for σ 2 under H 0 larger than the prior degrees of freedom for σ 2 under the encompassing prior to construct truncated priors under H 1 and H 2 , i.e., ν 0 > ν. This results in information consistency for the precise hypothesis test (a consequence of Lemma 1) and information inconsistency for the one-sided test (a consequence of Lemma 7). To see that this results in undesirable behavior consider a Fig. 3 Graphical representation of the definition of an information consistent Bayes factor in a multiple testing problem of H 0 : θ = 0 versus H 1 : θ ≤ 0 and θ = 0 (gray quadrant) versus H 2 : θ ≤ 0 (white quadrants). The directions of the arrows reflect directions of the limits. The evidence for H 1 against H 0 and H 2 should go to ∞ for limits in the lower left quadrant, and the evidence for H 2 against H 0 and H 1 should go to ∞ for the limits in the white quadrants, in order for the Bayes factor to be information consistent univariate multiple t test of H 0 : θ = 0 versus H 1 : θ < 0 versus H 2 : θ > 0. If we let t → ∞, the support for H 1 against H 0 would go to ∞. Thus as the effect goes to plus infinity, the evidence for the existence of a negative effect against no effect diverges.
Finally, note that Lemmas 3 and 8 give the necessary and sufficient conditions for the mixing distribution of the scale mixture of conjugate priors to be information consistent in the multiple testing problem.

Conclusions
This paper explored the existence of information inconsistency when using conjugate priors, mixtures of g priors, independence priors, and adaptive g priors for precise testing, one-sided testing, and multiple hypothesis testing. An overview of our findings can be found in Table 3. "Normal" information inconsistency refers to a (typically large) limiting bound B of the evidence against the null (i.e., B 10 → B) "Severe" refers to a limiting bound that is close to 1 (i.e., B 10 1). "Disastrous" refers to infinite evidence in the opposite direction (i.e., B 10 → 0). "No" refers to no information inconsistency; thus, information consistency (i.e., B 10 → ∞) The first major conclusion is that information inconsistency is ubiquitous when typical conjugate priors are used in hypothesis testing and model selection in the normal linear model with unknown variance. (Again, the problem does not seem to arise in normal linear models with known variance.) It happens in standard null hypothesis testing and one-sided testing; it happens with proper and improper conjugate priors; and it happens with almost all independence conjugate priors. The practical importance of the problem varies over different situations; it will primarily be a practical problem when the sample is small relative to the number of free parameters and there is high correlation between the observations. But, even in other cases, we consider information inconsistency to be highlighting a logical flaw that might have other serious consequences and is, hence, something to be avoided.
The second major conclusion is that use of either fatter-tailed priors (including appropriate mixtures of g-priors) or adaptive priors typically results in information consistency. This is not as surprising as the almost complete lack of information consistency for conjugate priors, in that previous particular fatter-tailed priors (such as the Zellner-Siow prior) had been shown to be information consistent. Still, the generality in which such priors can be shown to be information consistent is highly comforting.
It should be noted that, when proper priors yield information inconsistency, a logical flaw in Bayesian analysis is not being discovered; if one truly believed the priors were correct, then one should behave in an information inconsistent manner. But one rarely accurately knows features of the priors-such as their tail behaviors-that determine information inconsistency. Thus the intuitive appeal of information consistency can be used as a significant aid to selection of such prior features.
Finally, information inconsistency is not limited to the normal linear model with unknown variance, as shown in the following example. Example 3 Let y | θ ∼ Cauchy(θ, 1) and suppose that we want to test H 0 : θ = 0 against H 1 : θ = 0. Under H 1 , assume that θ ∼ Cauchy(0, ψ). Then, the Bayes factor in favor of H 1 to H 0 is As y → ∞, BF 10 → ψ(1 + ψ) < ∞, so the Bayes factor is information inconsistent.
This example also shows that information consistency is not dependent, in general, on having an unknown scale parameter; here the scale parameter of the observation is known.
Funding The first author was funded by the Netherlands Organization for Scientific Research .
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Proof of Lemma 3
Denote: Throughout, we use the following notation for functions a, b: and only if a(g, θ ) b(g, θ ) and a(g, θ) b(g, θ ).
Before we prove Lemma 3, we prove an auxiliary result Proof Consider the matrix factorization and take the eigendecomposition Ω −1/2 I −1 θ Ω −1/2 = O D O , where O is orthogonal and D diagonal with elements 0 < d l < d i < d u < ∞. Then, we can rewrite We can bound Similarly, we can find the lower bound Now, we prove Lemma 3 arguing by cases. Case ν 0 > ν 1 Applying the lower bound in Lemma 11, Since p 0 < p 1 , the term outside the integral goes to infinity as θ 2 → ∞, and by Fatou's lemma, lim inf which is clearly bounded away from 0 for any prior on g with positive support, so any such prior yields an information consistent B 10 whenever ν 0 > ν 1 .
Case ν 0 = ν 1 Applying the lower bound in Lemma 11 and Fatou's lemma as we did for the case ν 0 > ν 1 : The limit is O(1), so a sufficient condition for information consistency is ∞ 0 (g + d l ) (n− p 1 −r 1 )/2 π(dg) ∞ 0 (g + 1) (n− p 1 −r 1 )/2 π(dg) = ∞, as required. Case ν 0 < ν 1 In this case, we apply the upper bound in Lemma 11: The term outside the integral goes to 0, so a necessary condition for information consistency is that the integral be infinite. We can bound the integral: so a necessary condition for information consistency is as required.

B Proof of Lemma 4
Throughout, we use the notation in "Appendix A".
Case 1. Suppose there exists M < ∞ such that for all g ≥ M, π(g) g −α for α > 1 and p 0 > p 1 . Then, we apply the lower bound in Lemma 11: Plugging in: Using the identity (which is satisfied because α < (n − p 0 − r 1 )/2 and p 0 > p 1 by assumption), the limit of the hypergeometric function as R 2 → 1 is a constant (by Gauss' theorem). From here, it is immediate to conclude that B 10 is information consistent whenever the lower bound is infinite, which occurs for α < (n − p 0 − r 1 )/2 + 1, as required.

D Proof of Lemma 6
Using the notation in "Appendix A" and applying Lemma 11: For g > 0, the right-hand side is maximized at g = max(0, so the adaptive prior is information consistent.
It is easy to see that ξ * lies in a fixed compact set C for anyθ , from which it is immediate that P π (ξ ≤ 0 | y) is bounded away from 0 and 1. The second part of the lemma follows immediately from letting c → ∞ in the expression for ξ * .

F Proof of Lemma 8
Throughout, we use the notation in "Appendix A". Sufficient condition: We start with the case where there exists θ i → +∞; we treat the case where all θ i → −∞ later.
We can write: with h as defined in Lemma 11 (but noting that, in this case, the notation is ν 1 = ν). Letting p = ν − r 2 and using the upper bound in Lemma 11, we obtain lim From Lemma 7, we know that where ξ has a multivariate Student t distribution, with location and scale We factor where O is orthogonal and D is diagonal (with positive entries) as defined in Lemma 11. Therefore, for a fixed coordinate j, Using the same factorizations, we obtain w 2 ∝ θ ΩΩ θ for g > 0. Plugging this in and factorizing the denominator in m in a similar manner, we obtain If we choose a coordinate j such that w j > 0 (which exists by assumption), using the lower bound in Lemma 11, we obtain where T n− p is a central Student t with n − p degrees of freedom. Let ε > 0, then Therefore, we can plug in our bounds for m j and S j j , which are bounded away from 0 whenever g > 0. Using the tail bound and our previous work, we obtain Therefore, if the integral above is infinite, lim θ 2 →∞ P(θ ≤ 0 | y) = 0, as required. Now we turn to the case where θ i → −∞ for all i, in which case we assume that w i < 0 for all i. Then, a Fréchet bound ensures that P(θ ≤ 0 | y) = P(θ 1 ≤ 0, θ 2 ≤ 0, ... , θ r 1 ≤ 0 | y) ≥ r 1 i=1 P(θ i ≤ 0 | y) − (r 1 − 1). Therefore, lim θ 2 →∞ P(θ i ≥ 0 | y) = 0, 1 ≤ i ≤ r 1 ⇒ lim θ 2 →∞ P(θ ≤ 0 | y) = 1.

H Proof of Lemma 10
The marginal posterior of θ in the joint space has a multivariate Student t distribution with mean g g+1θ , scale matrix (n −r 2 ) −1 (s 2 y + (g + 1) −1θ (X 1 Σ −1 X 1 )θ) g g+1 (X 1 Σ −1 X 1 ) −1 , and n − r 2 degrees of freedom. A change of variables to ξ = g+1 g θ results in a multivariate Student t distribution with meanθ , scale matrix (n − r 2 ) −1 ((1 + g −1 )s 2 y +g −1θ (X 1 Σ −1 X 1 )θ )(X 1 Σ −1 X 1 ) −1 , and degrees of freedom n−r 2 . Note that the posterior probability is invariant under this transformation, i.e., P π (θ ≤ 0|y) = P π (ξ ≤ 0|y). Furthermore, it is important to note that the factor (1 + g −1 )s 2 y + g −1θ (X 1 Σ −1 X 1 )θ in the scale matrix of ξ is a monotonically decreasing function of g. Now it is easy to see that ifθ ≤ 0, P π (ξ ≤ 0|y) monotonically increases as the scales decrease, and ifθ ≤ 0, P π (ξ ≤ 0|y) monotonically decreases as the scales decrease. Thus, in order to maximize B 01 ifθ ≤ 0, and maximize B 10 ifθ ≤ 0, we have to let g go to ∞. For completeness, note the marginal posterior of θ in the joint space with a multivariate Student t distribution with meanθ , scale matrix (n − r 2 ) −1 s 2 y (X 1 Σ −1 X 1 ) −1 , and n − r 2 degrees of freedom, in the limit as g → ∞. Thus, even though a (data-based) adaptive prior is considered, the choice of g that maximizes the Bayes factor does not depend on the data. Note that taking the limit g → ∞ was already considered by Mulder (2014a) but not in the context of an adaptive prior.