On the correspondence from Bayesian log-linear modelling to logistic regression modelling with $g$-priors

Consider a set of categorical variables where at least one of them is binary. The log-linear model that describes the counts in the resulting contingency table implies a specific logistic regression model, with the binary variable as the outcome. Within the Bayesian framework, the $g$-prior and mixtures of $g$-priors are commonly assigned to the parameters of a generalized linear model. We prove that assigning a $g$-prior (or a mixture of $g$-priors) to the parameters of a certain log-linear model designates a $g$-prior (or a mixture of $g$-priors) on the parameters of the corresponding logistic regression. By deriving an asymptotic result, and with numerical illustrations, we demonstrate that when a $g$-prior is adopted, this correspondence extends to the posterior distribution of the model parameters. Thus, it is valid to translate inferences from fitting a log-linear model to inferences within the logistic regression framework, with regard to the presence of main effects and interaction terms.


Introduction
Consider observations v = {v 1 , . . . , v n }, and parameters θ = {θ 1 , . . . , θ n } and φ = {φ 1 , . . . , φ n }. Following standard notation, v i , i = 1, . . . , n, follows a distribution that is a member of the exponential family when its probability function can be written as, where, w = {w 1 , . . . , w n } are known weights, and φ i is the dispersion or scale parameter. With regard to first and second order moments, The variance function is defined as V (µ i ) = b ′′ (θ i ). A generalized linear model relates µ = {µ 1 , . . . , µ n } to covariates by setting g(µ) = Xγ, where g denotes the link function, X the covariate design matrix and γ is a vector of parameters. For a single µ i , we write g(µ i ) = X (i) γ, where X (i) denotes the i − th row of X.
Denote with P a finite set of P categorical variables. Observations from P can be arranged as counts in a P -way contingency table. Denote the cell counts as n i , i = 1, . . . , n. A Poisson distribution is assumed for the counts so that E(n i ) = µ i . A Poisson log-linear interaction model log(µ) = X ll λ is a generalized linear model that relates the expected counts to P. Assuming that one of the categorical variables, denoted with Y , is binary, a logistic regression can also be fitted with Y as the outcome and the remaining P − 1 variables as covariates. We write, logit(p) = X lt β, p = (p 1 , . . . , p n ), where p i denotes the conditional probability that Y = 1 given covariates X lt(i) , and β is a vector of parameters.
It is known (Agresti, 2002, Chapter 8) that when P contains a binary Y , a log-linear model log(µ) = X ll λ implies a specific logistic regression model with parameters β defined uniquely by λ. The logistic regression model for the conditional odds ratios for Y implies an equivalent loglinear model with arbitrary interaction terms between the covariates in the logistic regression, plus arbitrary main effects for these covariates. We provide a simple example to illustrate this result and clarify additional notation. Assume three categorical variables X, Y , and Z, with Y binary. Following the notation in Agresti (2002), let i, j, k be integer indices that describe the level of X, Y and Z respectively. For instance, as Y is binary, j = 0, 1. Consider the log-linear model, where the superscript denotes the main effect or interaction term. The corresponding logistic regression model for the conditional odds ratios for Y is derived as follows, This is a logistic regression with parameters, Considering identifiability corner point constraints, all elements in λ with a zero subscript are set to zero.
1k . This scales in a straightforward manner to larger log-linear models. For instance, if (M1) contained the three-way interaction XY Z, then the corresponding logistic regression model would contain the XZ interaction, so that, To demonstrate that the correspondence between log-linear and logistic models is not bijective, it is straightforward to show that, for example, the log-linear model, log(µ ijk ) = λ + λ X i + λ Y j + λ Z k + λ XY ij + λ Y Z jk , implies the same logistic regression as (M1).
To simplify the analysis and notation, for the remainder of this manuscript we consider models specified under corner point constraints. Then, every logistic regression model parameter is defined uniquely by the corresponding log-linear model parameter. This suggests the following implication. Assume that after fitting a specific log-linear model, the credible interval for an element of λ contains zero. When fitting the corresponding logistic regression model, the investigator will anticipate that the credible interval for the corresponding element of β will also contain zero, as the two parameters are equal. Importantly, for this translation to hold, it is essential that the prior distribution for β implied by the prior on λ is the same to the distribution the investigator would assign to β if they were to fit the logistic model directly. If the implied prior on β is different to a directly assigned prior then, with regard to β, there is no correspondence between the Bayesian log-linear and logistic analyses, rendering any translation meaningless. This is where the contribution in this manuscript lies. Typically, lack of information for the parameters of a generalized linear model leads to a relatively flat but proper prior distribution, so that model determination based on Bayes factors is valid; O'Hagan (1995). A very popular choice for the parameters of log-linear and logistic regression models is the g-prior; see Sabanes-Bovè and Held (2011) and Overstall andKing (2014a,2014b). This type of prior was first proposed by Zellner (1986) for general linear models. In this context, it is known as Zellner's g-prior. Theorem 1 states that assigning to λ the g-prior that is specific to log-linear modelling, implies the g-prior specific to logistic modelling on the parameters β of the corresponding logistic regression. Under the reasonable assumption that an investigator who chooses a g-prior for λ would also choose a g-prior for β if they were to fit a logistic regression directly, inferences on the parameters of a log-linear model translate to inferences on the parameters of the corresponding logistic regression. The presence or absence of interaction terms in the log-linear model can inform on the relation between the binary Y and the other variables as described by logistic regression. In fact, in the first illustration in Section 4.1, we observe that the credible intervals of the corresponding λ and β parameters are very similar, almost identical considering simulation error.
Another implication of the fact that a certain log-linear model implies a specific logistic regression concerns model determination. Assume that posterior probabilities in a space of log-linear models show the prominence of a certain model. The prominent log-linear model describes a certain dependence structure between the categorical factors, including the relation of the binary Y with all other factors. Although we do not provide a rigorous probabilistic result, we believe it is sensible to anticipate that the logistic regression model that corresponds to the prominent log-linear model will also be prominent in the space of logistic regressions. This is because the corresponding logistic regression describes the dependence structure between Y and the other factors that is supported by the data, in accordance with the log-linear analysis. When g-priors are adopted for the parameters of the linear models, Theorem 1 reinforces this argument as it strengthens the correspondence between the log-linear and logistic analyses.
Therefore, under reasonable expectation, results from a single log-linear model determination analysis may translate to prominent, or at the very least interesting, logistic regressions for any of the binary factors that formed the contingency table. In Section 4, in the analysis of simulated and real data, prominent logistic regression models are always the ones suggested by the single log-linear analysis.
In Section 2, we provide the definition of the g-prior and mixtures of g-priors, and describe how the g-prior is derived for log-linear and logistic regression models. Section 3, contains the main contribution in this manuscript. In Section 4, the correspondence between log-linear and logistic regression models is illustrated using simulated and real data. We conclude with a discussion.
2 The g-prior and mixtures of g-priors A g-prior for the parameters γ of a generalized linear model is a multivariate Normal distribution N (m γ , gΣ γ ), constructed so that the prior variance is a scalar multiple g of the inverse Fisher information matrix; see Liang et al. (2008) for a discussion on the choice of g. Following Ntzoufras et al. (2003) and Ntzoufras (2009, chapter 7), the g-prior for the parameters of loglinear and logistic regression models are specified so that, m γ = (m γ 1 , 0, . . . , 0) ⊤ , where m γ 1 corresponds to the intercept and can be non-zero, and where diag(1/φ i ) denotes a diagonal n × n matrix with non-zero elements 1/φ i , and m * = g −1 (m γ 1 ).
The unit information prior is a special case of the g-prior, obtained by setting g = N , where N denotes the total number of observations. It is constructed so that the information contained in the prior is equal to the amount of information in a single observation (Kass and Wasserman, 1995). Assuming that g is a random variable, with prior f (g), leads to a mixture of g-priors, so that, Mixtures of g-priors are also called hyper-g priors; see Sabanes-Bovè and Held (2011).
Log-linear regression: Consider counts n i i = 1, . . . , n. Now, N = n 1 n i , and, For the log-linear model, log(µ) = X ll λ, and g( where, m λ = (log(n), 0, . . . , 0),n denoting the average cell count. In addition, Dellaportas and Forster (1999) adopt this formulation, advocating that Σ Considering the unit information prior, they suggest that d = 2 is a sensible choice, given the amount of information contained in the prior with regard to the whole parameter space.
Logistic regression: Assume that y i , i = 1, . . . , n, is the proportion of successes out of t i trials.
The logistic regression model is defined as logit(p) = X lt β, so that X lt is a n × n β design matrix, and g( By approximating each t i with the average number of trialst, as suggested by Ntzoufras et al. (2003),

Relation between logistic regression and log-linear models
We show that, given that a g-prior for log-linear models is assigned to λ, the implied prior for β is a g-prior for logistic regression models, i.e. the one that would be assigned if the investigator considered the logistic regression model directly.
Proof: See Appendix.
Corollary 1: A unit information prior λ ∼ N (m λ , N Σ λ ) implies a unit information prior N (m β , N Σ β ) for the parameters β of the corresponding logistic regression.
Corollary 1 follows directly from Theorem 1 by setting g = N . The following Corollary concerns mixtures of g-priors. It is implicitly assumed that the investigator would adopt the same prior density f (g) for both modelling approaches.

Illustrations
In all subsequent model determination studies, the Reversible Jump MCMC algorithm proposed in Papathomas et al. (2011) was employed. Unit information priors were adopted for the model parameters (g = N ), and all models were assumed equally probable apriori. The size of the burn-in sample was 10 5 , followed by 2 × 10 5 iterations.
The two data sets were also analysed after adopting a mixture of g-priors such that, .
The expectation of the Inverse Gamma density is N . Results are not reported as posterior model probabilities and credible intervals were very similar to results under the unit information prior. Altering the variance of g within [1, 100] did not produce a notable difference in inferences.

A simulation study
Consider six binary variables {Y, A, B, C, D, E}, generating data from 1000 subjects in accordance with the log-linear model, Following the notation in Agresti (2002), a single letter denotes the presence of a main effect, two letter terms denote the presence of the implied first-order interaction and so on and so forth. The presence of an interaction between a set of variables implies the presence of all lower order interactions plus main effects for that set. For the generated data, model selection on the space of all possible graphical log-linear models assigns posterior probability to the model above equal to 0.98. In this study, the signal in the data with regard to the dependence structure between the factors is very strong, allowing to fully demonstrate the direct correspondence between the two types of analysis.
According to the discussion and results in Sections 1 and 3, we expect that a logistic regression where Y is treated as the outcome should only contain the first-order interactions AB and CD plus the E main effect. To verify this, we perform model comparison in the space of all possible graphical logistic regression models. The most likely logistic regression model (with posterior probability 0.59) is, logit(p) = AB + CD + E.
This confirms our expectancy. Credible intervals (CI) for the model parameters are shown in Table 1. The CIs for the corresponding λ and β parameters are almost identical, considering simulation error. For example, the CI for λ Y AB 1,1,1 is (−1.60, −0.51), whilst the CI for β AB 1,1 is (−1.57, −0.48).
In Table 1, we also report on the most prominent logistic regression model when any of the {A, B, C, D, E} is treated as the outcome. Similarly to the results above, the most prominent model is always the one suggested by the log-linear analysis. There is a direct correspondence between the CIs of the λ and β parameters. For example, when A is the outcome, the CI for β B 1 is (0.94, 1.64), whilst the CI for λ AB 1,1 is (0.92, 1.65). The CIs of β parameters that correspond to a main effect or interaction not present in the prominent log-linear model always contain zero. For example, when A is the outcome, the CI for β C 1 is (−0.37, 0.22), as the AC interaction is absent from the prominent log-linear model.
The deviance is not crucial within the Bayesian framework. Nevertheless, it is noted that, in terms of deviance, a logistic regression model is equivalent to the largest possible log-linear model that corresponds to it. In our example, (M3) attains a deviance of 20.178, whilst (M2) produces 33.707. The log-linear model that is equivalent in terms of deviance to (M3) is log(µ) = ABCDE + Y AB + Y CD + Y E. Edwards & Havránek (1985) presented a 2 6 contingency table in which 1841 men were crossclassified by six binary risk factors {A, B, C, D, E, F } for coronary heart disease. From Della-portas and Forster (1999), the top two Hierarchical models are log(µ) = AC + BC + AD + AE + CE + DE + F,

A real data illustration
and, log(µ) = AC + BC + AD + AE + BE + DE + F, with associated posterior model probabilities 0.28 and 0.15 respectively, for unit information priors and d = 2.
Although the data do not provide a very strong signal in terms of log-linear model determination, we anticipate that in a logistic regression with outcome A, only the C, D, and E main effects will not be zero. To verify this, model comparison is performed in the space of all possible graphical logistic regression models, including main effects. The most likely model (with probability 0.42) is, The posterior credible intervals associated with the B and F main effect parameters contain zero; see Table 2. All other credible intervals indicate clearly that the remaining main effects are not zero. This confirms our expectancy.
In Table 2, we also report on the most prominent logistic regression model when any of {B, C, D, E, F } is treated as the outcome. As expected, a logistic regression that only contains main effects is always the prominent model. Credible intervals for two β elements do not contain zero in spite of what was suggested by (M4) and (M5). The two CIs are displayed with bold font in Table 2.
In fact, both CIs concern parameters that describe the relation between B and F when either is treated as the outcome. We believe this anomaly is observed because the signal in the data is relatively weak, as 57% of the probability in the log-linear model space corresponds to a large number of models, each one associated with posterior probability that is less than 0.08.

Discussion
In general, results on the relation between different statistical models are of interest, as they improve understanding and enhance the models' utility. Often, developments for one modelling framework are not readily available for the other. For example, Papathomas and Richardson (2014) comment on the relation between log-linear modelling and variable selection within clustering, in particular with regard to marginal independence, whilst logistic regression models are not considered in that report.
The mapping between log-linear and logistic regression model spaces is not bijective. As a result, posterior model probabilities for one model space do not readily translate to model probabilities for the other. It is obvious that probabilities in the logistic model space do not translate to loglinear model probabilities as, typically, multiple log-linear models correspond to a single logistic regression. The converse translation is also not straightforward. To calculate logistic regression model probabilities, the probabilities of the corresponding log-linear models should be added up for each logistic regression. This can be too cumbersome an exercise, especially for large model spaces.
In addition, posterior model probabilities in the logistic regression model space, calculated using log-linear model probabilities, are likely to be different to the ones derived when the investigator considers the logistic regression model space directly. Dellaportas et al. (2012) advocate that prior distributions for the model space and the model parameters should be set in tandem.
However, when performing model comparison, it is still standard practice to assign equal prior probabilities to all competing models; see, for example, Dellaportas and Forster (1999). This paradigm was adopted in the simulated data analysis presented in Section 4.1, where (M 2) is associated with posterior probability 0.98. We would expect the corresponding logistic regression model to be associated with the same or higher probability, as it corresponds to (M2) plus other log-linear models. That was not the case (its posterior probability was 0.59) because equal prior probabilities were assigned to all possible logistic regressions, rather than the probabilities implied by the uniform prior in the log-linear model space.
Some of the established model search algorithms such as the Shotgun Stochastic Search (Hans et al. 2007) do not aim to calculate exact posterior model probabilities. They aim to detect regions in the model space with models that are associated with relatively higher probability compared to others. Therefore, in spite of the limitations discussed above, the presented results on the correspondence between the two frameworks can also be useful when large model spaces are searched. A single log-linear model determination implies, at the very least, interesting logistic regressions or model subspaces for any of the binary factors. Appendix. Proof of Theorem 1: To facilitate the proof, the following notation is introduced. Write the mapping between β and λ as β = T λ, where, and λ (k) , k = 1, . . . , n λ Y , is a vector of zeros with the exception of one element that is equal to one. This element is in the position of the k-th λ parameter with a Y in its superscript. With n λ Y we denote the number of parameters in λ with a Y in their superscript.
To ease algebraic calculations, and without any loss of generality, rearrange the columns of λ, creating a new vector λ r , so that T changes accordingly to, where I is a n β × n β identity matrix and n β is the number of elements in β. The rows and columns of X ll are also rearranged accordingly to create X rll , so that, We can now write β = T r λ r . For example, assume the log-linear model (M1) describes a 3×2×2 contingency table. Then, the standard arrangement of the elements of λ would be such that, After rearranging, The g-prior, λ ∼ N (m λ , gΣ λ ) ≡ N ((log(n), 0, . . . , 0) ⊤ , gdn N (X ⊤ ll X ll ) −1 ), where d ∈ [1, 2], translates to, λ r ∼ N (m λr , gΣ λr ) ≡ N ((0, . . . , 0, log(n), 0, . . . , 0) ⊤ , gdn where log(n) is the (n β + 1)-th element in the mean vector. Then, Furthermore, For d = 2 (in accordance with Dellaportas and Forster, 1999), Var(β) = 4gn N (X ⊤ lt X lt ) −1 . Thus, which is the g-prior for the parameters of a logistic regression; see Section 2.