On the correspondence from Bayesian log-linear modelling to logistic regression modelling with g-priors

Consider a set of categorical variables where at least one of them is binary. The log-linear model that describes the counts in the resulting contingency table implies a specific logistic regression model, with the binary variable as the outcome. Within the Bayesian framework, the g-prior and mixtures of g-priors are commonly assigned to the parameters of a generalized linear model. We prove that assigning a g-prior (or a mixture of g-priors) to the parameters of a certain log-linear model designates a g-prior (or a mixture of g-priors) on the parameters of the corresponding logistic regression. By deriving an asymptotic result, and with numerical illustrations, we demonstrate that when a g-prior is adopted, this correspondence extends to the posterior distribution of the model parameters. Thus, it is valid to translate inferences from fitting a log-linear model to inferences within the logistic regression framework, with regard to the presence of main effects and interaction terms.

1 Introduction 2015). Our manuscript concerns the g-prior and mixtures of g-priors. After data are collected, the prior f (γ ) is updated to the posterior distribution f (γ |Data) via the conditional probability formula and Bayes Theorem, so that, For the prior distributions discussed above, closed-form expressions for the posterior distribution f (γ |Data) do not exist. The posterior is typically calculated using Markov chain Monte Carlo stochastic simulation, or Normal approximations (O'Hagan and Forster 2004). It is known (Agresti 2002) that when P contains a binary Y , a log-linear model log(μ) = X ll λ implies a specific logistic regression model with parameters β defined uniquely by λ. The logistic regression model for the conditional odds ratios for Y implies an equivalent log-linear model with arbitrary interaction terms between the covariates in the logistic regression, plus arbitrary main effects for these covariates. We provide a simple example to illustrate this result and clarify additional notation. Assume three categorical variables X, Y , and Z , with Y binary. Let i, j, k be integer indices that describe the level of X, Y , and Z , respectively. For instance, as Y is binary, j = 0, 1. Consider the log-linear model, where the superscript denotes the main effect or interaction term. The corresponding logistic regression model for the conditional odds ratios for Y is derived as follows, This is a logistic regression with parameters, β = (β, β X i , β Z k ), so that, Considering identifiability corner point constraints, all elements in λ with a zero subscript are set to zero. Then, β = λ Y 1 , β X i = λ XY i1 and β Z k = λ Y Z 1k . This scales in a straightforward manner to larger log-linear models. For instance, if (M1) contained the three-way interaction XY Z, then the corresponding logistic regression model would contain the X Z interaction, so that, β X Z ik = λ XY Z i1k − λ XY Z i0k , and under corner point constraints, β X Z ik = λ XY Z i1k . If a factor does not interact with Y in the log-linear model, then this factor disappears from the corresponding logistic regression model. To demonstrate that the correspondence between log-linear and logistic models is not bijective, it is straightforward to show that, for example, the log-linear model, log(μ i jk ) = λ + λ X i + λ Y j + λ Z k + λ XY i j + λ Y Z jk , implies the same logistic regression as (M1). More generally, the relation between β and λ can be described as β = T λ, where T is an incidence matrix (Bapat 2011). In the context of this manuscript, matrix T has one row for each element of β, and one column for each element of λ. The elements of T are zero, except in the case where the element of β is defined by the corresponding element of λ. The number of rows of T cannot be greater than the number of columns. To simplify the analysis and notation, for the remainder of this manuscript we consider models specified under corner point constraints. Then, every logistic regression model parameter is defined uniquely by the corresponding log-linear model parameter, and the correspondence from a log-linear to a logistic regression model is direct.
The contribution of our manuscript is twofold. First, Theorem 1 states that assigning to λ the g-prior that is specific to log-linear modelling implies the g-prior specific to logistic modelling on the parameters β of the corresponding logistic regression. The log-linear model has to be the largest model that corresponds to the logistic regression, i.e. the model that contains all possible interaction terms between the categorical factors in P\{Y }. Second, under the reasonable assumption that an investigator who chooses a g-prior for λ would also choose a g-prior for β if they were to fit a logistic regression directly, inferences on the parameters of a log-linear model translate to inferences on the parameters of the corresponding logistic regression. Closed-form expressions for the posterior distributions do not exist. Wang and George (2007) utilize the Laplace approximation for generalized linear models, focusing on the approximation of the marginal likelihood for the purpose of variable selection. Theorem 2 shows that, asymptotically, the matching between the prior distributions of the corresponding parameters extends to the posterior distributions. It is then demonstrated by numerical illustrations that the presence or absence of interaction terms in the log-linear model can inform on the relation between the binary Y and the other variables as described by logistic regression. For example, assume that after fitting the log-linear model, the credible interval for an element of λ contains zero. When fitting the corresponding logistic regression model, the investigator will anticipate that the credible interval for the corresponding element of β will also contain zero. Importantly, for this translation to hold, it is essential that the prior distribution for β implied by the prior on λ is the same to the distribution the investigator would assign to β if they were to fit the logistic model directly. If the implied prior on β is not the same as a directly assigned prior then, with regard to β, the correspondence from the Bayesian log-linear analysis to the logistic one becomes dubious. In both illustrations in Sect. 4, we observe that the credible intervals of the corresponding λ and β parameters are virtually identical considering simulation error.
In Sect. 2, we provide the definition of the g-prior and mixtures of g-priors and describe how the g-prior is derived for log-linear and logistic regression models. Section 3 contains the main contributions in this manuscript. In Sect. 4, the correspondence from a log-linear to a logistic regression model is illustrated using simulated and real data. We conclude with a discussion.

The g-prior and mixtures of g-priors
A g-prior for the parameters γ of a generalized linear model is a multivariate Normal distribution N (m γ , gΣ γ ), constructed so that the prior variance is a multiple of the inverse Fisher information matrix by a scalar g. See Liang et al. (2008) for a discussion on the choice of g. In accordance with Ntzoufras et al. (2003) and Ntzoufras (2009), the g-prior for the parameters of log-linear and logistic regression models is specified so that, m γ = (m γ 1 , 0, . . . , 0) , where m γ 1 corresponds to the intercept and can be nonzero, and, where diag(1/φ i ) denotes a diagonal n × n matrix with nonzero elements 1/φ i , and m * = ζ −1 (m γ 1 ). The unit information prior is a special case of the g-prior, obtained by setting g = N , where N denotes the total number of observations. It is constructed so that the information contained in the prior is equal to the amount of information in a single observation (Kass and Wasserman 1995). Assuming that g is a random variable, with prior f (g), leads to a mixture of g-priors, so that, Mixtures of g-priors are also called hyper-g priors (Sabanès Bovè and Held 2011).
Log-linear regression Consider counts n i i = 1, . . . , n ll . Now, N = n ll i=1 n i , and, For the log-linear model, log(μ) = X ll λ, and ζ(μ i ) = log(μ i ) so that ζ (μ i ) = μ −1 i . The g-prior is constructed as N (m λ , gΣ λ ), where m λ = (log(n), 0, . . . , 0). Here,n denotes the average cell count. The prior mean for the log-linear model intercept is also often set to zero (Dellaportas et al. 2012). (Note that altering the prior mean for the log-linear model intercept does not affect the validity of the theoretical results in Sect. 3. This is straightforward to deduce from the proof of Theorem 1 given in 'Appendix', as the prior mean for the log-linear intercept does not affect the implied distribution of the logistic regression parameters.) In addition, Logistic regression Assume that y i , i = 1, . . . , n lt , is the proportion of successes out of t i trials. Now, N = n lt i=1 t i , and, and, The logistic regression model is defined as logit( p) = X lt β, so that X lt is a n lt × n β design matrix, and ζ( , and, Here, p * corresponds to m * in the general definition of the g-prior at the start of this section, so that p * = ζ −1 (m γ 1 ), where m γ 1 is the first element of m β which is zero. Thus, we obtain that p * = e 0 /(e 0 + 1) = 0.5. By approximating each t i with the average number of trialst, as suggested by Ntzoufras et al. (2003),

Correspondence from log-linear to logistic regression models
Consider a set of categorical variables P that includes a binary variable Y . Assume a log-linear model that, in addition to the terms that involve Y , contains all possible interaction terms between the categorical factors in P\{Y }. We show that, given that a g-prior is assigned to the log-linear model parameters λ, the implied prior for β is a g-prior for logistic regression models, i.e. the one that would be assigned if the investigator considered the logistic regression model directly.
Theorem 1 Assume a g-prior λ ∼ N (m λ , gΣ λ ) on the parameters of a log-linear model log(μ) = X ll λ, that contains all possible interaction terms between the categorical factors in P\{Y }. This prior implies a g-prior N (m β , gΣ β ) for the parameters β of the corresponding logistic regression logit( p) = X lt β.
Proof The proof is based on rearranging the rows and columns of X ll , and partitioning so that one part of X ll consists of the logistic design matrix X lt , or replications of X lt . We then show that the prior mean and variance of the elements of λ that correspond to β are the prior that would be assigned to β if the logistic regression was fitted directly. The complete proof is given in 'Appendix'.

Corollary 1 A unit information prior λ ∼ N (m λ , N Σ λ ) implies a unit information prior N (m β , N Σ β ) for the parameters β of the corresponding logistic regression.
Corollary 1 follows directly from Theorem 1 by setting g = N . The following Corollary concerns mixtures of g-priors. It is implicitly assumed that the investigator would adopt the same prior density f (g) for both modelling approaches.
Corollary 2 A mixture of g-priors so that λ|g ∼ N (m λ , gΣ λ ), g ∼ f (g), implies a mixture of g-priors for the parameters β of the corresponding logistic regression, so This also follows from Theorem 1, which states that when λ|g ∼ N (m λ , gΣ λ ), the conditional prior for β is β|g ∼ N (m β , gΣ β ).
When the g-prior is utilized, it is common to assign a locally uniform Jeffreys prior (∝ 1) on the intercept, after the covariate columns of the design matrix have been centred to ensure orthogonality with the intercept (Liang et al. 2008). If one decides to adopt the approach where a flat prior is assigned to the intercept in both log-linear and logistic formulations, the correspondence between log-linear and logistic regression breaks, but only with regard to the intercept of the logistic regression. The prior on the log-linear intercept does not have a bearing on the implied prior for the logistic regression parameters, because the log-linear intercept does not contribute to the formation of the logistic regression parameters, as described in Sect. 1. After assigning a flat prior on the intercept of the log-linear model, all β parameters (including the intercept) are still Normal as linear combinations of Normal random variables, and the distribution of β is the one given by Theorem 1. For details, see the additional material in the proof of Theorem 1 in 'Appendix'. For an illustration, see Table 3 in Sect. 4.2.
Closed-form expressions for the posterior distribution of the parameters of a generalized linear model do not exist. However, it is known (O'Hagan and Forster 2004) that a Normal approximation applies. Consider a g-prior for the parameters γ of the generalized linear model, ζ(μ) = X d γ , so that, for fixed g, Given observations v = {v 1 , . . . , v n }, the posterior distribution of γ is approximated by a Normal density, so that, Here,γ is the maximum likelihood estimate of γ , and I(γ ) is the information matrix X d V X d . For the log-linear model, the diagonal matrix V (denoted by V log-linear ) has diagonal elements exp{X ll(i)λ }, i = 1, . . . , n ll . When the logistic regression is fitted, V logistic has diagonal elements t i exp{X lt (i)β }exp{1 + X lt (i)β } −2 , i = 1, . . . , n lt . Within the Bayesian framework, when fitting a generalized linear model, a large sample (n → ∞) will swamp the prior distribution, rendering it irrelevant for deriving posterior inferences (O'Hagan and Forster 2004). In practice, this can be true even for moderate sample sizes (say, of order 10 2 or larger), especially when the prior is not informative, which is typically the case with g-priors.
Theorem 2 Consider a g-prior λ ∼ N (m λ , gΣ λ ) on the parameters of a log-linear model log(μ) = X ll λ, that contains all possible interaction terms between the categorical factors in P\{Y }. Consider also the analogous g-prior N (m β , gΣ β ) for the parameters β of the corresponding logistic regression logit( p) = X lt β. For fixed g, and for a large sample, the posterior distribution of β, as given in (1), is approximately equal to the posterior distribution of the elements of λ that correspond to β.
Proof A partitioning similar to the one adopted for the proof of Theorem 1 is utilized. First, we show that, asymptotically, the posterior variance of β is the posterior variance of the elements of λ that correspond to β. Then, we do the same for the posterior means. The proof is based on the crucial assumption that for a large sample the contribution of the prior in deriving the posterior moments can be ignored. A standard result utilized in the proof is that, asymptotically, the Binomial distribution for a data point can be approximated by a Poisson distribution. The complete proof is given in 'Appendix'.
In the next section, we demonstrate with numerical illustrations that, for fixed g, the correspondence between the priors extends to posterior distributions, so that the posterior distribution of the logistic regression parameters matches the one of the corresponding log-linear model parameters. This is true even for relatively moderate sample sizes N , say a few hundred, and for standard choices of g such as g = N .

Illustrations
Unit information priors were adopted for the model parameters (g = N ). The size of the burn-in sample was 10 4 , followed by 5 × 10 5 iterations.

A simulation study
We simulate data from 1000 subjects, on six binary variables {Y, A, B, C, D, E}. Probabilities that correspond to the cells of the 2 6 contingency table are generated in accordance with the log-linear model, log(μ) = Y AB + Y C D + Y E. Adopting the notation in Agresti (2002), a single letter denotes the presence of a main effect, two letter terms denote the presence of the implied first-order interaction and so on and so forth. The presence of an interaction between a set of variables implies the presence of all lower-order interactions plus main effects for that set. Cell counts are simulated according to the generated cell probabilities. Parameter values and the design matrix of the log-linear model used to generate the cell probabilities are given in Supplemental material, Section S2.
We fit to the simulated data the log-linear model, According to the discussion and results in Sects. 1 and 3, the corresponding logistic regression where Y is treated as the outcome only contains the first-order interactions AB and C D plus the main effect for E, In Table 1, we present credible intervals (CIs) for the parameters of (M3) and the relevant parameters of (M2). The CIs for the corresponding λ and β parameters are almost identical, considering simulation error. For example, the CI for λ Y C D 1,1,1 is (−2.01, −0.85), whilst the CI for β C D 1,1 is (−2.00, −0.84). In Table 2, we present minimum, maximum and quantile values for the t i observations, for the logistic regression in Table 1. It is clear that the simulated data do not represent balanced Binomial experiments where t i =t. The credible intervals listed in Table 1 demonstrate that the correspondence studied in this manuscript is very robust to departures from t i =t. This is also demonstrated in the real data analysis presented in the next subsection, where the collected data do not represent balanced Binomial experiments when one of the factors is treated as the outcome. In Supplemental material, we present additional analyses on simulated data sets, including results on smaller samples, roughly one quarter the size of the data set analysed in this section. Inferences on the correspondence between the posterior distributions remain unchanged. Edwards and Havránek (1985) presented a 2 6 contingency table in which 1841 men were cross-classified by six binary risk factors {A, B, C, D, E, F} for coronary heart disease. The data were also analysed in Dellaportas and Forster (1999), where the top hierarchical model was, log(μ) = AC + AD + AE + BC + C E + DE + F, with posterior model probability 0.28. In Table 3, we present CIs for the parameters of the log-linear model,

A real data illustration
We also present CIs for the parameters of the corresponding logistic regression model when A is treated as the outcome, We performed this analysis twice. Once after considering the g-priors described in Sect. 2 (g = N ), as in the previous illustration, and after adopting a g-prior with a locally flat prior for the intercept. Under the g-prior described in Sect. 2, the CIs for the corresponding λ and β parameters (including the intercept) are almost identical, considering simulation error. For instance, the CI for both the coefficient of A in the log-linear model and the intercept in the logistic regression is (−0.59, −0.24). Under the flat prior for the intercepts, the correspondence breaks down with regard to the intercept in the logistic regression model. The CI for the coefficient of A in the loglinear model is (−0.59, −0.24), whilst the CI for the intercept of the corresponding logistic regression model is (−0.17, 0.02). Concurrently, the credible intervals for the

Discussion
The correspondence we investigated is not unexpected, given the results in Agresti (2002) discussed in Introduction, and also the link between the g-prior and Fisher's information matrix (Held et al. 2015), although this link is stronger for general linear models. Our investigation is also related to Consonni and Veronese (2008), where specifying a prior for the parameters of one model, and then, transferring this specification to the parameters of another is discussed. Of the four strategies considered in Consonni and Veronese (2008), the one directly linked to our manuscript is 'Marginalization', as the derived prior for the parameters of the logistic regression is the one that is the marginal prior of the relevant parameters of the log-linear model. Results on the relation between different statistical models are of interest, as they improve understanding and enhance the models' utility. Often, developments for one modelling framework are not readily available for the other. For example, Papathomas and Richardson (2016) comment on the relation between log-linear modelling and variable selection within clustering, in particular with regard to marginal independence, without examining logistic regression models. Our numerical illustrations concern the g-prior, where the parameter g is fixed. To further explore the correspondence between the two modelling frameworks, we also considered the two hyper priors that are prominent in Liang et al. (2008). This is the Zellner-Siow prior [IG(0.5, N/2)], and the prior introduced in the aforementioned manuscript in Sect. 4.2, with the suggested specification α = 3. Furthermore, the two data sets were analysed after adopting a mixture of g-priors such that, g ∼ IG(a g , b g ). We considered a g = 2 + mean(g) 2 /var(g) and b g = mean(g) + mean(g) 3 /var(g), in accordance with the specified prior moments mean(g) and var(g). We considered distinct Inverse Gamma densities with markedly different expectations and variances, as well as the vague prior I G(0.1, 0.1). We observed that the correspondence does not hold exactly when a mixture of g-priors is adopted. This seems to be because the posterior distribution for g is different under the two modelling frameworks, something that Table 3 Real data illustration Log-linear model ( affects to a small, but noticeable degree, the posterior credible intervals for the model parameters. For more details, see the analyses presented in Supplemental material. Theoretical results in this manuscript refer to a specific log-linear model and the corresponding logistic regression model, for a given set of covariates. Therefore, our results should not be misinterpreted as licence to readily translate log-linear model selection inferences to inferences concerning logistic regression models. When performing model selection in a space of log-linear models, the prominent log-linear model describes a certain dependence structure between the categorical factors, including the relation of the binary Y with all other factors. The logistic regression that corresponds to the prominent log-linear model describes the dependence structure between Y and the other factors that is supported by the data in accordance with the log-linear analysis. Therefore, under reasonable expectation, results from a single log-linear model determination analysis may translate, at the very least, to interesting logistic regressions for any of the binary factors that formed the contingency table. However, the mapping between log-linear and logistic regression model spaces is not bijective. Furthermore, posterior model probabilities depend on the prior on the model space, with various different approaches for defining such a prior discussed in Dellaportas et al. (2012). For the simulated data analysed in Sect. 4.1, log-linear model Y AB + Y C D + Y E has posterior probability 0.98, whilst the posterior probability of the corresponding logistic regression model (M3) is 0.59. Similar results from analysing the real data in Sect. 4.2, not presented here, also support this note of caution. In all model determination analyses, the Reversible Jump MCMC algorithm proposed in Papathomas et al. (2011) was employed. All possible graphical log-linear models were assumed equally likely a priori, as were all possible logistic graphical models for some given outcome.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix
Proof of Theorem 1 To facilitate the proof, the following notation is introduced. Using the incidence matrix T discussed in Sect. 1, write the mapping between β and λ as (1) . . .
and λ (k) , k = 1, . . . , n λ Y , is a vector of zeros with the exception of one element that is equal to one. This element is in the position of the kth λ parameter with a Y in its superscript. With n λ Y we denote the number of parameters in λ with a Y in their superscript. This is a more rigorous definition of T compared to the more descriptive definition in Sect. 1. To ease algebraic calculations, and without any loss of generality, rearrange the columns of λ, creating a new vector λ r , so that T changes accordingly to, T r = I 0 , where I is an n β × n β identity matrix and n β is the number of elements in β. The rows and columns of X ll are also rearranged accordingly to create X rll , so that, X ll-lt is a square (n ll /2 × n ll /2) matrix. This is because we consider the log-linear model that, in addition to the terms that involve Y , contains all possible interaction terms between the categorical factors in P\{Y }. The number of parameters that correspond to the intercept, main effects and interactions for P\{Y } is n ll /2.
Denote with j 1 = 2 the number of levels of the binary factor Y that becomes the outcome in the logistic regression model. With j 2 to j q , 1 ≤ q ≤ P − 1 denote the number of levels of the q − 1 factors that are present in the log-linear model but disappear from the logistic regression model as they do not interact with Y . Then, n ll = 2 × j 2 × · · · × j q × n lt . When q = 1, all factors other than Y remain in the logistic regression model as covariates. When q = P − 1, the corresponding logistic regression model only contains the intercept. For instance, for a 2 P contingency table, n ll = 2 q × n lt , and for q = 1, n ll = 2 × n lt . Furthermore, X * lt is a n ll /2 × n β matrix. By rearranging the rows of X rll when necessary, we can write X * lt as X * lt = (X lt X lt . . . X lt ) , where X lt is repeated ( j 1 − 1) × j 2 × · · · × j q times. For example, for q = 1, X * lt = X lt . For q = 2, X lt repeats j 2 times within X * lt . We can now write β = T r λ r . For example, assume the log-linear model (M1) describes a 3 × 2 × 2 contingency table. Then, q = 1, and the standard arrangement of the elements of λ would be such that,  1 1 0 1 1 1 0 1 0 1  1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0  1 0 1 0 1 0 1 0 0 0  1 0 0 1 1 0 0 1 0 0  1 1 0 1 1 1 0 1 1 0  1 0 1 1 1 0 1 1 0 1  0 0 0 0 1 0 0   For another example, where q = 2, consider again model (M1) but now assume that the interaction Y Z is not present in the log-linear model. Then, the Z factor will disappear from the corresponding logistic regression model, and after rearranging,   N (log(n), 0, . . . , 0) , where log(n) is the (n β + 1)th element in the mean vector. Then, Furthermore, From (2), From Lutkepohl (1996, p. 147), the submatrix H that is formed by the first n β rows and columns of (X rll X rll ) −1 is, It is straightforward to verify that for a projection matrix P lt and a constant c, Therefore, (2I − P lt ) = (0.5I + 0.5P lt ) −1 , and consequently, X ll-lt is a square matrix of full rank. If X ll-lt was not full rank, then some of its columns would be linearly dependent. In turn, some of the columns of X ll-lt X ll-lt would be linearly dependent, implying the same for columns of X rll (see Eq. 2). This is not possible as X rll is a design matrix of full rank. Thus, X −1 ll-lt exists and, Therefore, which is the g-prior for the parameters of a logistic regression, as described in Sect. 2. This completes the proof.
Placing a flat prior on the intercept Assume that a flat prior is placed on the intercept of the log-linear model, after the design matrix has been centred to induce orthogonality between the intercept and the factors that form the contingency table. This does not alter the prior on the parameters of the corresponding logistic regression model. The proof follows along the lines of the proof of Theorem 1, if we express the parameters of the logistic regression model as β = T r −1 λ r −1 , where T r −1 denotes matrix T r without the first column with all elements zero, and λ r −1 denotes the vector of parameters λ r without the intercept λ. The proof proceeds as above, replacing X rll with X rll−1 , where X rll−1 is the former matrix without the column with all elements one. It is also required to replace X ll-lt with X ll-lt−1 , where X ll-lt−1 is the former matrix without the column with all elements one.

Proof of Theorem 2
The proof utilizes quantities defined earlier in Sect. 3 and in the proof of Theorem 1. First, we will show that, asymptotically, the posterior variance of β is identical to the posterior variance of the elements of λ that correspond to β. Then, we will do the same for the posterior means. Consider a vector of cell counts n = {n 1 , . . . , n ll }, and the log-linear model log(μ) = X ll λ. Then, asymptotically, whereλ denotes the maximum likelihood estimate (MLE). After rearranging the rows and columns of X ll , consider the log-linear model with linear predictor X rll λ r , for cell counts n r , where n r is n rearranged to correspond to X rll . Now, V 1 denotes a diagonal matrix with nonzero elements exp(X * lt (i) (T rλr )), i = 1, . . . , n ll /2. V 2 denotes a diagonal matrix with nonzero elements exp(X ll-lt(i)λll-lt ), i = 1, . . . , n ll /2, whereλ ll-lt denotes the MLE for λ r \T r λ r . Now, From Lutkepohl (1996, p. 147), the submatrix H that is formed by the first n β rows and columns of Var(λ r |n r ) is, From Lutkepohl (1996, p. 29, line 6), the expression above simplifies to, Within the Bayesian framework, a large sample (N → ∞) will swamp the prior distribution, rendering it irrelevant for deriving posterior inferences (O'Hagan and Forster 2004). This can be viewed as equivalent to considering a flat non-informative prior, in our case assuming that g → ∞. For a sample size large enough to justify ignoring the contribution of the prior distribution in Var(λ|n), i.e. assuming that A 12 = V 1 V 2 and A 2 = V 2 , asymptotically, V 1,reduced denotes a diagonal matrix with elements exp(X lt (i) (T rλr )), i = 1, . . . , n lt . V 2,k , k = 1, . . . , ( j 1 − 1) × j 2 × · · · × j q , denotes a diagonal matrix with elements exp(X ll-lt(n lt (k−1)+i)λll-lt ). This expression simplifies as q becomes smaller, i.e. the fewer times X lt is contained within X * lt . For example, when X * lt = X lt , i.e. when q = 1 and all factors other than Y remain in the logistic regression, V 1,reduced = V 1 .
Therefore, approximately, (3) In matrix notation, we can now write that, asymptotically, Var(T r λ r |n r ) = T r (Var(λ r |n r ))T r = I 0 (Var(λ r |n r )) I 0 where t is a diagonal matrix with diagonal elements the number of trials t i , and V logistic has diagonal elements t i exp{X lt (i)β }exp{1 + X lt (i)β } −2 , i = 1, . . . , n lt . (X lt V logistic X lt ) −1 is, asymptotically, the posterior variance of β when the logistic regression is fitted directly, and thus, we have shown that the posterior variance of β is identical to the posterior variance of the elements of λ that correspond to β.
We will now show that, asymptotically, the posterior mean E(β|t, y) is the posterior mean of the elements of λ that correspond to β. For a sample large enough to justify ignoring the contribution of the prior in (1), we obtain that, E(λ|n) I(λ) −1 I(λ)λ = λ. Similarly, E(β|t, y) β . Therefore, E(T r λ r |n) T rλr , and it is sufficient to show thatβ = T rλr . Closed-form expressions for the maximum likelihood estimators of the parameters of a generalized linear model do not exist. As a result, we will base the derivation of this result on the Iterative Re-weighed Least Squares (IRLS) algorithm. This is the standard procedure for maximizing the likelihood when a generalized linear model is fitted. See Wood (2006) for more details. For a linear predictor X d γ , this iterative process is based on the formula, For a log-linear model, ζ it is denoted by ζ it log-linear , and its ith element, i = 1, . . . , n ll , is, For a logistic regression model, ζ it is denoted by ζ it logistic , and its ith element, i = 1, . . . , n lt , is, For the log-linear model, the IRLS procedure is written as, where V log-linear is a diagonal matrix with diagonal elements exp{X rll(i)λr }, i = 1, . . . , n ll . Algebraic operations similar to the ones carried out earlier show that (X rll V log-linear (λ it )X rll ) −1 partitions as, where 1 and 2 are matrices not relevant to this proof. Furthermore, X rll V log-linear (λ it r ) partitions as, For the log-linear model, we write ζ log-linear = (ζ * lt ζ ll-lt ) , where ζ * lt corresponds to the first n ll /2 rows of X rll . Now, the first n β elements of (X rll V log-linear (λ it )X rll ) −1 X rll V log-linear (λ it r )ζ log-linear , i.e. the ones that correspond to the logistic regression parameters, are given by, Using the Poisson approximation to the Binomial distribution, for the ith element of ζ * lt , and assuming without any loss of generality that i < n lt , Thus, Therefore, the updating step for T r λ r is, If the logistic regression was to be fitted directly, obtaining the MLE would be based on the IRLS algorithm, By replacing the sum of the elements of the V 2,k matrices with the approximate values given in (3), we observe that, asymptotically, the updating step is the same for both T r λ r and β. Thus, if the starting point for T r λ r is the same as the starting point for β, the iterative algorithm would give the same MLE for the logistic regression parameters and the corresponding log-linear model parameters. The IRLS algorithm is robust to different starting values when the likelihood is not flat. Therefore, asymptotically, β T rλr and the proof is complete.