Abstract
Consider a set of categorical variables where at least one of them is binary. The log-linear model that describes the counts in the resulting contingency table implies a specific logistic regression model, with the binary variable as the outcome. Within the Bayesian framework, the g-prior and mixtures of g-priors are commonly assigned to the parameters of a generalized linear model. We prove that assigning a g-prior (or a mixture of g-priors) to the parameters of a certain log-linear model designates a g-prior (or a mixture of g-priors) on the parameters of the corresponding logistic regression. By deriving an asymptotic result, and with numerical illustrations, we demonstrate that when a g-prior is adopted, this correspondence extends to the posterior distribution of the model parameters. Thus, it is valid to translate inferences from fitting a log-linear model to inferences within the logistic regression framework, with regard to the presence of main effects and interaction terms.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider observations \({\varvec{v}}=\{v_1,\ldots ,v_n\}\), parameters \(\varvec{\theta }=\{\theta _1,\ldots ,\theta _n\}\), and known quantities or nuisance parameters \(\varvec{\phi }=\{\phi _1,\ldots ,\phi _n \}\). Following standard notation, \(v_i, i=1,\ldots ,n\), follows a distribution that is a member of the exponential family when its probability function can be written as,
where \(\varvec{w}=\{w_1,\ldots ,w_n \}\) are known weights, and \(\phi _i\) is described as the dispersion or scale parameter. With regard to first- and second-order moments, \(\mu _i\equiv E(v_i)=b^{'}(\theta _i)\) and \(\text{ Var }(v_i)=\frac{w_i}{\phi _i} b^{''}(\theta _i)\). The variance function is defined as \(V(\mu _i) = b^{''}(\theta _i)\). A generalized linear model relates \(\varvec{\mu }=\{\mu _1,\ldots ,\mu _n \}\) to covariates by setting \(\zeta (\varvec{\mu })=X_d \varvec{\gamma }\), where \(\zeta \) denotes the link function, \(X_d\) the covariate design matrix, and \(\varvec{\gamma }\) a vector of parameters. For a single \(\mu _i\), we write \(\zeta (\mu _i)=X_{d(i)} \varvec{\gamma }\), where \(X_{d(i)}\) denotes the ith row of \(X_d\). So, \(\zeta \) is defined as a vector function \(\zeta \equiv \{\zeta _1,\ldots ,\zeta _n \}\) with n elements.
Denote with \(\mathcal {P}\) a finite set of P categorical variables. Observations from \(\mathcal {P}\) can be arranged as counts in a P-way contingency table. Denote the cell counts as \(n_i, i=1,\ldots ,n_{ll}\). We use the ‘ll’ indicator to allude to the log-linear model that will describe these counts. A Poisson distribution is assumed for the counts so that \(E(n_i)=\mu _i\). A Poisson log-linear interaction model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\) is a generalized linear model that relates the expected counts to \(\mathcal {P}\). Assuming that one of the categorical variables, denoted with Y, is binary, a logistic regression can also be fitted with Y as the outcome, and all or some of the remaining \(P-1\) variables as covariates. We write, \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }, \varvec{p}=(p_1,\ldots ,p_{n_{lt}})\), using the ‘lt’ indicator for the logistic model. Here, \(p_i\) denotes the conditional probability that \(Y=1\) given covariates \(X_{lt(i)}\), and \(\varvec{\beta }\) is a vector of parameters.
Within the Bayesian framework, a prior distribution \(f(\varvec{\gamma })\) is assigned to the parameters of the log-linear or logistic regression model. This can be an informative prior that incorporates prior information on the magnitude of the effect of the different covariates or interactions. Eliciting such a prior distribution is not straightforward, especially for the coefficients of interaction terms (Consonni and Veronese 2008). Typically, lack of information for the parameters of a generalized linear model leads to a relatively flat but proper prior distribution, so that model determination based on Bayes factors is valid (O’Hagan 1995). A popular choice among Bayesian statisticians is the g-prior or a mixture of g-priors, described in detail in Sect. 2. These are flexible priors designed to carry very little information so that inferences are driven by the observed data. See, for example, Wang and George (2007), Sabanès Bovè and Held (2011), Overstall and King (2014a, b) and Mukhopadhyay and Samantha (2016). This type of prior was first proposed by Zellner (1986) for general linear models. In this context, it is known as Zellner’s g-prior. Other priors have been proposed, especially for analyses where the focus is on model comparison and variable selection. For example, Jeffreys prior (Liang et al. 2008), the generalized hyper-g prior (Sabanès Bovè and Held 2011), and the expected-posterior priors and power-expected-posterior priors (Fouskakis et al. 2015). Our manuscript concerns the g-prior and mixtures of g-priors. After data are collected, the prior \(f(\varvec{\gamma })\) is updated to the posterior distribution \(f(\varvec{\gamma }|\text{ Data })\) via the conditional probability formula and Bayes Theorem, so that,
For the prior distributions discussed above, closed-form expressions for the posterior distribution \(f(\varvec{\gamma }|\text{ Data })\) do not exist. The posterior is typically calculated using Markov chain Monte Carlo stochastic simulation, or Normal approximations (O’Hagan and Forster 2004).
It is known (Agresti 2002) that when \(\mathcal {P}\) contains a binary Y, a log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\) implies a specific logistic regression model with parameters \(\varvec{\beta }\) defined uniquely by \(\varvec{\lambda }\). The logistic regression model for the conditional odds ratios for Y implies an equivalent log-linear model with arbitrary interaction terms between the covariates in the logistic regression, plus arbitrary main effects for these covariates. We provide a simple example to illustrate this result and clarify additional notation. Assume three categorical variables X, Y, and Z, with Y binary. Let i, j, k be integer indices that describe the level of X, Y, and Z, respectively. For instance, as Y is binary, \(j=0,1\). Consider the log-linear model,
where the superscript denotes the main effect or interaction term. The corresponding logistic regression model for the conditional odds ratios for Y is derived as follows,
This is a logistic regression with parameters, \(\varvec{\beta }=(\beta ,\beta _{i}^{X},\beta _{k}^{Z})\), so that, \(\beta =\lambda _{1}^{Y} - \lambda _{0}^{Y}, \beta _{i}^{X}=\lambda _{i1}^{XY} - \lambda _{i0}^{XY}\), and \(\beta _{k}^{Z}=\lambda _{1k}^{YZ} - \lambda _{0k}^{YZ}\). Considering identifiability corner point constraints, all elements in \(\varvec{\lambda }\) with a zero subscript are set to zero. Then, \(\beta =\lambda _{1}^{Y}, \beta _{i}^{X}=\lambda _{i1}^{XY} \) and \(\beta _{k}^{Z}=\lambda _{1k}^{YZ}\). This scales in a straightforward manner to larger log-linear models. For instance, if (M1) contained the three-way interaction XYZ, then the corresponding logistic regression model would contain the XZ interaction, so that, \(\beta _{ik}^{XZ}=\lambda _{i1k}^{XYZ} - \lambda _{i0k}^{XYZ}\), and under corner point constraints, \(\beta _{ik}^{XZ}= \lambda _{i1k}^{XYZ}\). If a factor does not interact with Y in the log-linear model, then this factor disappears from the corresponding logistic regression model. To demonstrate that the correspondence between log-linear and logistic models is not bijective, it is straightforward to show that, for example, the log-linear model, \(\text{ log }(\mu _{ijk})=\lambda + \lambda _{i}^{X} + \lambda _{j}^{Y} + \lambda _{k}^{Z} + \lambda _{ij}^{XY} + \lambda _{jk}^{YZ}\), implies the same logistic regression as (M1). More generally, the relation between \(\varvec{\beta }\) and \(\varvec{\lambda }\) can be described as \(\varvec{\beta }=\varvec{T}\varvec{\lambda }\), where \(\varvec{T}\) is an incidence matrix (Bapat 2011). In the context of this manuscript, matrix \(\varvec{T}\) has one row for each element of \(\varvec{\beta }\), and one column for each element of \(\varvec{\lambda }\). The elements of \(\varvec{T}\) are zero, except in the case where the element of \(\varvec{\beta }\) is defined by the corresponding element of \(\varvec{\lambda }\). The number of rows of \(\varvec{T}\) cannot be greater than the number of columns. To simplify the analysis and notation, for the remainder of this manuscript we consider models specified under corner point constraints. Then, every logistic regression model parameter is defined uniquely by the corresponding log-linear model parameter, and the correspondence from a log-linear to a logistic regression model is direct.
The contribution of our manuscript is twofold. First, Theorem 1 states that assigning to \(\varvec{\lambda }\) the g-prior that is specific to log-linear modelling implies the g-prior specific to logistic modelling on the parameters \(\varvec{\beta }\) of the corresponding logistic regression. The log-linear model has to be the largest model that corresponds to the logistic regression, i.e. the model that contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). Second, under the reasonable assumption that an investigator who chooses a g-prior for \(\varvec{\lambda }\) would also choose a g-prior for \(\varvec{\beta }\) if they were to fit a logistic regression directly, inferences on the parameters of a log-linear model translate to inferences on the parameters of the corresponding logistic regression. Closed-form expressions for the posterior distributions do not exist. Wang and George (2007) utilize the Laplace approximation for generalized linear models, focusing on the approximation of the marginal likelihood for the purpose of variable selection. Theorem 2 shows that, asymptotically, the matching between the prior distributions of the corresponding parameters extends to the posterior distributions. It is then demonstrated by numerical illustrations that the presence or absence of interaction terms in the log-linear model can inform on the relation between the binary Y and the other variables as described by logistic regression. For example, assume that after fitting the log-linear model, the credible interval for an element of \(\varvec{\lambda }\) contains zero. When fitting the corresponding logistic regression model, the investigator will anticipate that the credible interval for the corresponding element of \(\varvec{\beta }\) will also contain zero. Importantly, for this translation to hold, it is essential that the prior distribution for \(\varvec{\beta }\) implied by the prior on \(\varvec{\lambda }\) is the same to the distribution the investigator would assign to \(\varvec{\beta }\) if they were to fit the logistic model directly. If the implied prior on \(\varvec{\beta }\) is not the same as a directly assigned prior then, with regard to \(\varvec{\beta }\), the correspondence from the Bayesian log-linear analysis to the logistic one becomes dubious. In both illustrations in Sect. 4, we observe that the credible intervals of the corresponding \(\varvec{\lambda }\) and \(\varvec{\beta }\) parameters are virtually identical considering simulation error.
In Sect. 2, we provide the definition of the g-prior and mixtures of g-priors and describe how the g-prior is derived for log-linear and logistic regression models. Section 3 contains the main contributions in this manuscript. In Sect. 4, the correspondence from a log-linear to a logistic regression model is illustrated using simulated and real data. We conclude with a discussion.
2 The g-prior and mixtures of g-priors
A g-prior for the parameters \(\varvec{\gamma }\) of a generalized linear model is a multivariate Normal distribution \(N(\varvec{m}_{\gamma },g \varSigma _{\gamma })\), constructed so that the prior variance is a multiple of the inverse Fisher information matrix by a scalar g. See Liang et al. (2008) for a discussion on the choice of g. In accordance with Ntzoufras et al. (2003) and Ntzoufras (2009), the g-prior for the parameters of log-linear and logistic regression models is specified so that, \(\varvec{m}_{\gamma }=(m_{\gamma _1},0,\ldots ,0)^{\top }\), where \(m_{\gamma _{1}}\) corresponds to the intercept and can be nonzero, and,
where \(\text{ diag }(1/\phi _i)\) denotes a diagonal \(n\times n\) matrix with nonzero elements \(1/\phi _i\), and \(m^{*}=\zeta ^{-1}(m_{\gamma _1})\).
The unit information prior is a special case of the g-prior, obtained by setting \(g=N\), where N denotes the total number of observations. It is constructed so that the information contained in the prior is equal to the amount of information in a single observation (Kass and Wasserman 1995). Assuming that g is a random variable, with prior f(g), leads to a mixture of g-priors, so that,
Mixtures of g-priors are also called hyper-g priors (Sabanès Bovè and Held 2011).
Log-linear regression Consider counts \(n_i\) \(i=1,\ldots ,n_{ll}\). Now, \(N=\sum \nolimits _{i=1}^{n_{ll}} n_i\), and,
with \(\theta _i = \text{ log }(\mu _i), b(\theta _i)=\mathrm{e}^{\theta _i}\) and \(c(n_i,\phi _i)=-\text{ log } (n_i !)\). Also, \(w_i \phi _{i}^{-1}=1\), so that \(w_i=1\) implies \(\phi _i=1\). Note that,
For the log-linear model, \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\), and \(\zeta (\mu _i) = \text{ log }(\mu _i)\) so that \(\zeta ^{'}(\mu _i)=\mu _{i}^{-1}\). The g-prior is constructed as \(N(\varvec{m}_{\lambda },g \varSigma _{\lambda })\), where \(\varvec{m}_{\lambda }=(\text{ log }(\bar{n}),0,\ldots ,0)\). Here, \(\bar{n}\) denotes the average cell count. The prior mean for the log-linear model intercept is also often set to zero (Dellaportas et al. 2012). (Note that altering the prior mean for the log-linear model intercept does not affect the validity of the theoretical results in Sect. 3. This is straightforward to deduce from the proof of Theorem 1 given in ‘Appendix’, as the prior mean for the log-linear intercept does not affect the implied distribution of the logistic regression parameters.) In addition,
Logistic regression Assume that \(y_i, i=1,\ldots ,n_{lt}\), is the proportion of successes out of \(t_i\) trials. Now, \(N=\sum \nolimits _{i=1}^{n_{lt}} t_i\), and,
where \(\theta _i=\text{ logit }(p_i), b(\theta _i)=\text{ log }(1+\mathrm{e}^{\theta _i})\), and \(c(y_i,\phi _i)=\text{ log } {t_i \atopwithdelims ()t_i y_i}\). Also, \(w_i \phi _{i}^{-1}=t_i\), so that \(w_i=1\) implies \(\phi _i=t_{i}^{-1}\). Note that,
and,
The logistic regression model is defined as \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }\), so that \(X_{lt}\) is a \(n_{lt} \times n_{\beta }\) design matrix, and \(\zeta (p_i) = \text{ logit }(p_i)\) so that \(\zeta ^{'}(p_i)=[p_i(1-p_i)]^{-1}\). The g-prior is \(N(\varvec{m}_{\beta },g \varSigma _{\beta })\), where \(\varvec{m}_{\beta }=(0,0,\ldots ,0)\), and,
Here, \(p^{*}\) corresponds to \(m^{*}\) in the general definition of the g-prior at the start of this section, so that \(p^{*}=\zeta ^{-1}(m_{\gamma _{1}})\), where \(m_{\gamma _{1}}\) is the first element of \(\varvec{m}_{\beta }\) which is zero. Thus, we obtain that \(p^{*}=\mathrm{e}^{0}/(\mathrm{e}^{0}+1)=0.5\). By approximating each \(t_i\) with the average number of trials \(\bar{t}\), as suggested by Ntzoufras et al. (2003),
3 Correspondence from log-linear to logistic regression models
Consider a set of categorical variables \(\mathcal {P}\) that includes a binary variable Y. Assume a log-linear model that, in addition to the terms that involve Y, contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). We show that, given that a g-prior is assigned to the log-linear model parameters \(\varvec{\lambda }\), the implied prior for \(\varvec{\beta }\) is a g-prior for logistic regression models, i.e. the one that would be assigned if the investigator considered the logistic regression model directly.
Theorem 1
Assume a g-prior \(\varvec{\lambda }\sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda })\) on the parameters of a log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\), that contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). This prior implies a g-prior \(N(\varvec{m}_{\beta }, g \varSigma _{\beta })\) for the parameters \(\varvec{\beta }\) of the corresponding logistic regression \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }\).
Proof
The proof is based on rearranging the rows and columns of \(X_{ll}\), and partitioning so that one part of \(X_{ll}\) consists of the logistic design matrix \(X_{lt}\), or replications of \(X_{lt}\). We then show that the prior mean and variance of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\) are the prior that would be assigned to \(\varvec{\beta }\) if the logistic regression was fitted directly. The complete proof is given in ‘Appendix’.\(\square \)
Corollary 1
A unit information prior \(\varvec{\lambda }\sim N(\varvec{m}_{\lambda }, N \varSigma _{\lambda })\) implies a unit information prior \(N(\varvec{m}_{\beta }, N \varSigma _{\beta })\) for the parameters \(\varvec{\beta }\) of the corresponding logistic regression.
Corollary 1 follows directly from Theorem 1 by setting \(g=N\). The following Corollary concerns mixtures of g-priors. It is implicitly assumed that the investigator would adopt the same prior density f(g) for both modelling approaches.
Corollary 2
A mixture of g-priors so that \(\varvec{\lambda }| g \sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda }), g\sim f(g)\), implies a mixture of g-priors for the parameters \(\varvec{\beta }\) of the corresponding logistic regression, so that \(\varvec{\beta }| g \sim N(\varvec{m}_{\beta }, g \varSigma _{\beta }), g\sim f(g)\).
This also follows from Theorem 1, which states that when \(\varvec{\lambda }| g \sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda })\), the conditional prior for \(\varvec{\beta }\) is \(\varvec{\beta }| g \sim N(\varvec{m}_{\beta }, g \varSigma _{\beta })\).
When the g-prior is utilized, it is common to assign a locally uniform Jeffreys prior (\(\propto 1\)) on the intercept, after the covariate columns of the design matrix have been centred to ensure orthogonality with the intercept (Liang et al. 2008). If one decides to adopt the approach where a flat prior is assigned to the intercept in both log-linear and logistic formulations, the correspondence between log-linear and logistic regression breaks, but only with regard to the intercept of the logistic regression. The prior on the log-linear intercept does not have a bearing on the implied prior for the logistic regression parameters, because the log-linear intercept does not contribute to the formation of the logistic regression parameters, as described in Sect. 1. After assigning a flat prior on the intercept of the log-linear model, all \(\varvec{\beta }\) parameters (including the intercept) are still Normal as linear combinations of Normal random variables, and the distribution of \(\varvec{\beta }\) is the one given by Theorem 1. For details, see the additional material in the proof of Theorem 1 in ‘Appendix’. For an illustration, see Table 3 in Sect. 4.2.
Closed-form expressions for the posterior distribution of the parameters of a generalized linear model do not exist. However, it is known (O’Hagan and Forster 2004) that a Normal approximation applies. Consider a g-prior for the parameters \(\varvec{\gamma }\) of the generalized linear model, \(\zeta (\varvec{\mu })=X_d \varvec{\gamma }\), so that, for fixed g,
Given observations \({\varvec{v}}=\{ v_1,\ldots ,v_n \}\), the posterior distribution of \(\gamma \) is approximated by a Normal density, so that,
Here, \(\hat{\varvec{\gamma }}\) is the maximum likelihood estimate of \({\varvec{\gamma }}\), and \(\mathcal{I}(\hat{\varvec{\gamma }})\) is the information matrix \(X_d^{\top } \mathcal{V} X_d\). For the log-linear model, the diagonal matrix \(\mathcal{V}\) (denoted by \(\mathcal{V}_{\text {log-linear}}\)) has diagonal elements \(\text{ exp }\{ X_{ll(i)} \hat{\varvec{\lambda }} \}, i=1,\ldots ,n_{ll}\). When the logistic regression is fitted, \(\mathcal{V}_{\mathrm{logistic}}\) has diagonal elements \(t_i \text{ exp }\{ X_{lt(i)} \hat{\varvec{\beta }} \} \text{ exp }\{1 + X_{lt(i)} \hat{\varvec{\beta }} \}^{-2}, i=1,\ldots ,n_{lt}\). Within the Bayesian framework, when fitting a generalized linear model, a large sample \((n \rightarrow \infty )\) will swamp the prior distribution, rendering it irrelevant for deriving posterior inferences (O’Hagan and Forster 2004). In practice, this can be true even for moderate sample sizes (say, of order \(10^2\) or larger), especially when the prior is not informative, which is typically the case with g-priors.
Theorem 2
Consider a g-prior \(\varvec{\lambda }\sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda })\) on the parameters of a log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\), that contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). Consider also the analogous g-prior \(N(\varvec{m}_{\beta }, g \varSigma _{\beta })\) for the parameters \(\varvec{\beta }\) of the corresponding logistic regression \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }\). For fixed g, and for a large sample, the posterior distribution of \(\varvec{\beta }\), as given in (1), is approximately equal to the posterior distribution of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\).
Proof
A partitioning similar to the one adopted for the proof of Theorem 1 is utilized. First, we show that, asymptotically, the posterior variance of \(\varvec{\beta }\) is the posterior variance of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\). Then, we do the same for the posterior means. The proof is based on the crucial assumption that for a large sample the contribution of the prior in deriving the posterior moments can be ignored. A standard result utilized in the proof is that, asymptotically, the Binomial distribution for a data point can be approximated by a Poisson distribution. The complete proof is given in ‘Appendix’.\(\square \)
In the next section, we demonstrate with numerical illustrations that, for fixed g, the correspondence between the priors extends to posterior distributions, so that the posterior distribution of the logistic regression parameters matches the one of the corresponding log-linear model parameters. This is true even for relatively moderate sample sizes N, say a few hundred, and for standard choices of g such as \(g=N\).
4 Illustrations
Unit information priors were adopted for the model parameters (\(g=N\)). The size of the burn-in sample was \(10^4\), followed by \(5\times 10^5\) iterations.
4.1 A simulation study
We simulate data from 1000 subjects, on six binary variables \(\{Y,A,B,C,D,E\}\). Probabilities that correspond to the cells of the \(2^6\) contingency table are generated in accordance with the log-linear model, \(\text{ log }(\varvec{\mu })=YAB+YCD+YE\). Adopting the notation in Agresti (2002), a single letter denotes the presence of a main effect, two letter terms denote the presence of the implied first-order interaction and so on and so forth. The presence of an interaction between a set of variables implies the presence of all lower-order interactions plus main effects for that set. Cell counts are simulated according to the generated cell probabilities. Parameter values and the design matrix of the log-linear model used to generate the cell probabilities are given in Supplemental material, Section S2.
We fit to the simulated data the log-linear model,
According to the discussion and results in Sects. 1 and 3, the corresponding logistic regression where Y is treated as the outcome only contains the first-order interactions AB and CD plus the main effect for E,
In Table 1, we present credible intervals (CIs) for the parameters of (M3) and the relevant parameters of (M2). The CIs for the corresponding \(\varvec{\lambda }\) and \(\varvec{\beta }\) parameters are almost identical, considering simulation error. For example, the CI for \(\lambda _{1,1,1}^{YCD}\) is \((-2.01,-0.85)\), whilst the CI for \(\beta _{1,1}^{CD}\) is \((-2.00,-0.84)\).
In Table 2, we present minimum, maximum and quantile values for the \(t_i\) observations, for the logistic regression in Table 1. It is clear that the simulated data do not represent balanced Binomial experiments where \(t_i=\bar{t}\). The credible intervals listed in Table 1 demonstrate that the correspondence studied in this manuscript is very robust to departures from \(t_i = \bar{t}\). This is also demonstrated in the real data analysis presented in the next subsection, where the collected data do not represent balanced Binomial experiments when one of the factors is treated as the outcome. In Supplemental material, we present additional analyses on simulated data sets, including results on smaller samples, roughly one quarter the size of the data set analysed in this section. Inferences on the correspondence between the posterior distributions remain unchanged.
4.2 A real data illustration
Edwards and Havránek (1985) presented a \(2^{6}\) contingency table in which 1841 men were cross-classified by six binary risk factors \(\{A, B, C, D, E, F\}\) for coronary heart disease. The data were also analysed in Dellaportas and Forster (1999), where the top hierarchical model was, \(\text{ log }(\varvec{\mu })=AC+AD+AE+BC+CE+DE+F\), with posterior model probability 0.28. In Table 3, we present CIs for the parameters of the log-linear model,
We also present CIs for the parameters of the corresponding logistic regression model when A is treated as the outcome,
We performed this analysis twice. Once after considering the g-priors described in Sect. 2 (\(g=N\)), as in the previous illustration, and after adopting a g-prior with a locally flat prior for the intercept. Under the g-prior described in Sect. 2, the CIs for the corresponding \(\varvec{\lambda }\) and \(\varvec{\beta }\) parameters (including the intercept) are almost identical, considering simulation error. For instance, the CI for both the coefficient of A in the log-linear model and the intercept in the logistic regression is \((-0.59, -0.24)\). Under the flat prior for the intercepts, the correspondence breaks down with regard to the intercept in the logistic regression model. The CI for the coefficient of A in the log-linear model is \((-0.59, -0.24)\), whilst the CI for the intercept of the corresponding logistic regression model is \((-0.17, 0.02)\). Concurrently, the credible intervals for the coefficients of C, D and E in the logistic regression model are almost identical to the corresponding CIs for AC, AD and AE in the log-linear model, with differences due to simulation error.
5 Discussion
The correspondence we investigated is not unexpected, given the results in Agresti (2002) discussed in Introduction, and also the link between the g-prior and Fisher’s information matrix (Held et al. 2015), although this link is stronger for general linear models. Our investigation is also related to Consonni and Veronese (2008), where specifying a prior for the parameters of one model, and then, transferring this specification to the parameters of another is discussed. Of the four strategies considered in Consonni and Veronese (2008), the one directly linked to our manuscript is ‘Marginalization’, as the derived prior for the parameters of the logistic regression is the one that is the marginal prior of the relevant parameters of the log-linear model. Results on the relation between different statistical models are of interest, as they improve understanding and enhance the models’ utility. Often, developments for one modelling framework are not readily available for the other. For example, Papathomas and Richardson (2016) comment on the relation between log-linear modelling and variable selection within clustering, in particular with regard to marginal independence, without examining logistic regression models.
Our numerical illustrations concern the g-prior, where the parameter g is fixed. To further explore the correspondence between the two modelling frameworks, we also considered the two hyper priors that are prominent in Liang et al. (2008). This is the Zellner–Siow prior [IG(0.5, N/2)], and the prior introduced in the aforementioned manuscript in Sect. 4.2, with the suggested specification \(\alpha =3\). Furthermore, the two data sets were analysed after adopting a mixture of g-priors such that, \(g\sim \text{ IG }(a_{g}, b_{g})\). We considered \(a_g = 2+\text{ mean }(g)^2/\text{ var }(g)\) and \(b_g = \text{ mean }(g) + \text{ mean }(g)^3/\text{ var }(g)\), in accordance with the specified prior moments \(\text{ mean }(g)\) and \(\text{ var }(g)\). We considered distinct Inverse Gamma densities with markedly different expectations and variances, as well as the vague prior IG(0.1, 0.1). We observed that the correspondence does not hold exactly when a mixture of g-priors is adopted. This seems to be because the posterior distribution for g is different under the two modelling frameworks, something that affects to a small, but noticeable degree, the posterior credible intervals for the model parameters. For more details, see the analyses presented in Supplemental material.
Theoretical results in this manuscript refer to a specific log-linear model and the corresponding logistic regression model, for a given set of covariates. Therefore, our results should not be misinterpreted as licence to readily translate log-linear model selection inferences to inferences concerning logistic regression models. When performing model selection in a space of log-linear models, the prominent log-linear model describes a certain dependence structure between the categorical factors, including the relation of the binary Y with all other factors. The logistic regression that corresponds to the prominent log-linear model describes the dependence structure between Y and the other factors that is supported by the data in accordance with the log-linear analysis. Therefore, under reasonable expectation, results from a single log-linear model determination analysis may translate, at the very least, to interesting logistic regressions for any of the binary factors that formed the contingency table. However, the mapping between log-linear and logistic regression model spaces is not bijective. Furthermore, posterior model probabilities depend on the prior on the model space, with various different approaches for defining such a prior discussed in Dellaportas et al. (2012). For the simulated data analysed in Sect. 4.1, log-linear model \(YAB+YCD+YE\) has posterior probability 0.98, whilst the posterior probability of the corresponding logistic regression model (M3) is 0.59. Similar results from analysing the real data in Sect. 4.2, not presented here, also support this note of caution. In all model determination analyses, the Reversible Jump MCMC algorithm proposed in Papathomas et al. (2011) was employed. All possible graphical log-linear models were assumed equally likely a priori, as were all possible logistic graphical models for some given outcome.
References
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, Hoboken
Bapat RB (2011) Graphs and matrices. Springer, Hindustan Book Agency, New Delhi
Consonni G, Veronese P (2008) Compatibility of prior specifications across linear models. Stat Sci 23:232–353
Dellaportas P, Forster JJ (1999) Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika 86:615–633
Dellaportas P, Forster JJ, Ntzoufras I (2012) Joint specification of model space and parameter space prior distributions. Stat Sci 27:232–246
Edwards D, Havránek T (1985) A fast procedure for model search in multi-dimensional contingency tables. Biometrika 72:339–351
Fouskakis D, Ntzoufras I, Draper D (2015) Power-expected-posterior priors for variable selection in Gaussian linear models. Bayesian Anal 10:75–107
Held L, Sabanès Bovè D, Gravestock I (2015) Approximate Bayesian model selection with the deviance statistic. Stat Sci. http://www.imstat.org/sts/future_papers.html. Accessed 17 Mar 2016
Kass RE, Wasserman L (1995) A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J Am Stat Assoc 90:928–934
Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of \(g\)-priors for Bayesian variable selection. J Am Stat Assoc 103:410–423
Lutkepohl H (1996) Handbook of matrices. Wiley, Chichester
Mukhopadhyay M, Samantha T (2016) A mixture of \(g\)-priors for variable selection when the number of regressors grows with the sample size. Test. doi:10.1007/s11749-016-0516-0
Ntzoufras I, Dellaportas P, Forster JJ (2003) Bayesian variable and link determination for generalized linear models. J Stat Plan Inference 111:165–180
Ntzoufras I (2009) Bayesian modelling using WinBugs. Wiley, Hoboken
O’Hagan A (1995) Fractional Bayes factors for model comparison. J R Stat Soc Ser B 57:99–138
O’Hagan A, Forster JJ (2004) Bayesian inference, 2nd edn. vol 2B of ‘Kendall’s Advanced Theory of Statistics’. Arnold, London
Overstall A, King R (2014a) A default prior distribution for contingency tables with dependent factor levels. Stat Methodol 16:90–99
Overstall A, King R (2014b) Conting: an R package for Bayesian analysis of complete and incomplete contingency tables. J Stat Softw 58:1–27
Papathomas M, Richardson S (2016) Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms. J Stat Plan Inference 173:47–63
Papathomas M, Dellaportas P, Vasdekis VGS (2011) A novel reversible jump algorithm for generalized linear models. Biometrika 98:231–236
Rohatgi VK (1976) An introduction to probability theory and mathematical statistics. Wiley, New York
Sabanès Bovè D, Held L (2011) Hyper-g priors for generalized linear models. Bayesian Anal 6:387–410
Wang X, George GI (2007) Adaptive Bayesian criteria in variable selection for generalized linear models. Stat Sinica 17:667–690
Wood SN (2006) Generalized additive models. An introduction with R, Chapman and Hall/CRC, New York
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with \(g\)-prior distributions. In: Goel PK, Zellner A (eds) Bayesian inference and decision techniques: essays in honor of Bruno de Finetti. North-Holland/Elsevier, Amsterdam, pp 233–243
Acknowledgements
The author wishes to thank Professor Petros Dellaportas and Dr. Antony Overstall for useful discussions during the preparation of this manuscript. We would also like to thank two reviewers and the editors for comments that helped to improve the manuscript.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
Proof of Theorem 1
To facilitate the proof, the following notation is introduced. Using the incidence matrix \(\varvec{T}\) discussed in Sect. 1, write the mapping between \(\varvec{\beta }\) and \(\varvec{\lambda }\) as \(\varvec{\beta }=\varvec{T}\varvec{\lambda }\), where
and \(\varvec{\lambda }_{(k)}, k=1, \ldots , n_{\lambda _{Y}}\), is a vector of zeros with the exception of one element that is equal to one. This element is in the position of the kth \(\varvec{\lambda }\) parameter with a Y in its superscript. With \(n_{\lambda _{Y}}\) we denote the number of parameters in \(\varvec{\lambda }\) with a Y in their superscript. This is a more rigorous definition of \(\varvec{T}\) compared to the more descriptive definition in Sect. 1. To ease algebraic calculations, and without any loss of generality, rearrange the columns of \(\varvec{\lambda }\), creating a new vector \(\varvec{\lambda }_{r}\), so that \(\varvec{T}\) changes accordingly to, \( \varvec{T}_{r}=\left( \begin{array}{ll} \varvec{I}&\mathbf{0} \end{array} \right) , \) where \(\varvec{I}\) is an \(n_{\beta }\times n_{\beta }\) identity matrix and \(n_{\beta }\) is the number of elements in \(\varvec{\beta }\). The rows and columns of \(X_{ll}\) are also rearranged accordingly to create \(X_{rll}\), so that,
\(X_{{ll}{\text {-}}{lt}}\) is a square \((n_{ll}/2 \times n_{ll}/2)\) matrix. This is because we consider the log-linear model that, in addition to the terms that involve Y, contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). The number of parameters that correspond to the intercept, main effects and interactions for \(\mathcal {P} {\setminus } \{ Y \}\) is \(n_{ll}/2\).
Denote with \(j_1=2\) the number of levels of the binary factor Y that becomes the outcome in the logistic regression model. With \(j_2\) to \(j_q, 1\le q \le P-1\) denote the number of levels of the \(q-1\) factors that are present in the log-linear model but disappear from the logistic regression model as they do not interact with Y. Then, \(n_{ll}=2\times j_2 \times \cdots \times j_q \times n_{lt}\). When \(q=1\), all factors other than Y remain in the logistic regression model as covariates. When \(q=P-1\), the corresponding logistic regression model only contains the intercept. For instance, for a \(2^P\) contingency table, \(n_{ll}=2^q \times n_{lt}\), and for \(q=1, n_{ll}=2\times n_{lt}\). Furthermore, \(X_{lt}^{*}\) is a \(n_{ll}/2 \times n_{\beta }\) matrix. By rearranging the rows of \(X_{rll}\) when necessary, we can write \(X_{lt}^{*}\) as \(X_{lt}^{*}=(X_{lt}^{\top } X_{lt}^{\top } \ldots X_{lt}^{\top })^{\top }\), where \(X_{lt}^{\top }\) is repeated \( (j_1-1) \times j_2 \times \cdots \times j_q\) times. For example, for \(q=1, X_{lt}^{*}=X_{lt}\). For \(q=2, X_{lt}\) repeats \(j_2\) times within \(X_{lt}^{*}\).
We can now write \(\varvec{\beta }=\varvec{T}_{r} \varvec{\lambda }_{r}\). For example, assume the log-linear model (M1) describes a \(3\times 2\times 2\) contingency table. Then, \(q=1\), and the standard arrangement of the elements of \(\varvec{\lambda }\) would be such that,
After rearranging,
For another example, where \(q=2\), consider again model (M1) but now assume that the interaction YZ is not present in the log-linear model. Then, the Z factor will disappear from the corresponding logistic regression model, and after rearranging,
The g-prior,
translates to,
where \(\text{ log }(\bar{n})\) is the \((n_{\beta }+1)\)th element in the mean vector. Then,
Furthermore,
From (2),
From Lutkepohl (1996, p. 147), the submatrix H that is formed by the first \(n_{\beta }\) rows and columns of \((X_{rll}^{\top } X_{rll})^{-1}\) is,
Now, \(P_{lt}\equiv X_{lt}^{*}(X_{lt}^{* \top } X_{lt}^{*})^{-1} X_{lt}^{* \top }\) is the projection matrix for \(X_{lt}^{*}\). It is straightforward to verify that for a projection matrix \(P_{lt}\) and a constant c,
Therefore, \((2 \varvec{I}- P_{lt})=(0.5 \varvec{I}+ 0.5 P_{lt})^{-1}\), and consequently,
\(X_{{ll}{\text {-}}{lt}}\) is a square matrix of full rank. If \(X_{{ll}{\text {-}}{lt}}\) was not full rank, then some of its columns would be linearly dependent. In turn, some of the columns of \(\left( \begin{array}{l} X_{{ll}{\text {-}}{lt}} \\ X_{{ll}{\text {-}}{lt}} \end{array} \right) \) would be linearly dependent, implying the same for columns of \(X_{rll}\) (see Eq. 2). This is not possible as \(X_{rll}\) is a design matrix of full rank. Thus, \(X_{{ll}{\text {-}}{lt}}^{-1}\) exists and,
Therefore,
Thus,
which is the g-prior for the parameters of a logistic regression, as described in Sect. 2. This completes the proof.\(\square \)
Placing a flat prior on the intercept Assume that a flat prior is placed on the intercept of the log-linear model, after the design matrix has been centred to induce orthogonality between the intercept and the factors that form the contingency table. This does not alter the prior on the parameters of the corresponding logistic regression model. The proof follows along the lines of the proof of Theorem 1, if we express the parameters of the logistic regression model as \(\varvec{\beta }=\varvec{T}_{r-1} \varvec{\lambda }_{r-1}\), where \(\varvec{T}_{r-1}\) denotes matrix \(\varvec{T}_{r}\) without the first column with all elements zero, and \(\varvec{\lambda }_{r-1}\) denotes the vector of parameters \(\varvec{\lambda }_{r}\) without the intercept \(\lambda \). The proof proceeds as above, replacing \(X_{rll}\) with \(X_{rll-1}\), where \(X_{rll-1}\) is the former matrix without the column with all elements one. It is also required to replace \(X_{{ll}{\text {-}}{lt}}\) with \(X_{{ll}{\text {-}}{lt}-1}\), where \(X_{{ll}{\text {-}}{lt}-1}\) is the former matrix without the column with all elements one.
Proof of Theorem 2
The proof utilizes quantities defined earlier in Sect. 3 and in the proof of Theorem 1. First, we will show that, asymptotically, the posterior variance of \(\varvec{\beta }\) is identical to the posterior variance of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\). Then, we will do the same for the posterior means.
Consider a vector of cell counts \(\varvec{n}=\{n_1, \ldots , n_{ll} \}\), and the log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\). Then, asymptotically,
where \(\hat{\varvec{\lambda }}\) denotes the maximum likelihood estimate (MLE). After rearranging the rows and columns of \(X_{ll}\), consider the log-linear model with linear predictor \(X_{rll} \varvec{\lambda }_{r}\), for cell counts \(\varvec{n}_{r}\), where \(\varvec{n}_{r}\) is \(\varvec{n}\) rearranged to correspond to \(X_{rll}\). Now,
\(\mathcal{V}_{1}\) denotes a diagonal matrix with nonzero elements \(\text{ exp }(X_{lt(i)}^{*} (\varvec{T}_{r} \hat{\varvec{\lambda }}_{r})), i=1, \ldots , n_{ll}/2\). \(\mathcal{V}_{2}\) denotes a diagonal matrix with nonzero elements \(\text{ exp }(X_{{ll}{\text {-}}{lt}(i)} \hat{\varvec{\lambda }}_{{ll}{\text {-}}{lt}}), i=1, \ldots , n_{ll}/2\), where \(\hat{\varvec{\lambda }}_{{ll}{\text {-}}{lt}}\) denotes the MLE for \(\varvec{\lambda }_{r} {\setminus } \varvec{T}_{r} \varvec{\lambda }_{r}\). Now,
where \(A_{12}=\frac{N}{g n_{ll}} \varvec{I}+ \mathcal{V}_{1} \mathcal{V}_{2} \) and \(A_2=\frac{N}{g n_{ll}} \varvec{I}+\mathcal{V}_{2}\). From Lutkepohl (1996, p. 147), the submatrix H that is formed by the first \(n_{\beta }\) rows and columns of \(\text{ Var }(\varvec{\lambda }_{r} | \varvec{n}_{r})\) is,
From Lutkepohl (1996, p. 29, line 6), the expression above simplifies to,
Within the Bayesian framework, a large sample \((N \rightarrow \infty )\) will swamp the prior distribution, rendering it irrelevant for deriving posterior inferences (O’Hagan and Forster 2004). This can be viewed as equivalent to considering a flat non-informative prior, in our case assuming that \(g \rightarrow \infty \). For a sample size large enough to justify ignoring the contribution of the prior distribution in \(\text{ Var }(\varvec{\lambda }| \varvec{n})\), i.e. assuming that \(A_{12}=\mathcal{V}_{1} \mathcal{V}_{2} \) and \(A_2=\mathcal{V}_{2}\), asymptotically,
\(\mathcal{V}_{1, \mathrm{reduced}}\) denotes a diagonal matrix with elements \(\text{ exp }(X_{lt(i)} (\varvec{T}_{r} \hat{\varvec{\lambda }}_{r})), i=1, \ldots , n_{lt}\). \(\mathcal{V}_{2, k}, k=1, \ldots , (j_1-1)\times j_2\times \cdots \times j_q\), denotes a diagonal matrix with elements \(\text{ exp }(X_{{ll}{\text {-}}{lt}(n_{lt}(k-1)+i)} \hat{\varvec{\lambda }}_{{ll}{\text {-}}{lt}})\). This expression simplifies as q becomes smaller, i.e. the fewer times \(X_{lt}\) is contained within \(X_{lt}^{*}\). For example, when \(X_{lt}^{*}=X_{lt}\), i.e. when \(q=1\) and all factors other than Y remain in the logistic regression, \(\mathcal{V}_{1,\mathrm{reduced}}=\mathcal{V}_{1}\).
We now utilize the standard result (see, for example, Rohatgi 1976, p. 200) that, asymptotically, the Binomial distribution \(\mathrm{Bin}(t_i, \frac{\text{ exp }(X_{lt(i)}^{*} (\varvec{T}_{r} \varvec{\lambda }_{r}))}{1+\text{ exp }(X_{lt(i)}^{*} (\varvec{T}_{r} \varvec{\lambda }_{r}))})\) of a data point \(t_{i} y_{i}, i=1, \ldots , n_{lt}\), can be approximated by a Poisson distribution \(\mathrm{Poisson}(t_i \frac{\text{ exp }(X_{lt(i)}^{*} (\varvec{T}_{r} \varvec{\lambda }_{r}))}{1+\text{ exp }(X_{lt(i)}^{*} (\varvec{T}_{r} \varvec{\lambda }_{r}))})\). The Binomial observation \(t_i-t_i \times y_i\) is formed by adding \((j_1-1)\times j_2\times \cdots \times j_q\) independent Poisson cell counts. Considering the Poisson log-linear model, \(t_i-t_i y_i\) follows the Poisson distribution,
Therefore, approximately,
In matrix notation, we can now write that, asymptotically,
where \(\varvec{t}\) is a diagonal matrix with diagonal elements the number of trials \(t_i\), and \(\mathcal{V}_{\mathrm{logistic}}\) has diagonal elements \(t_i \text{ exp }\{ X_{lt(i)} \hat{\varvec{\beta }} \} \text{ exp }\{1 + X_{lt(i)} \hat{\varvec{\beta }} \}^{-2}, i=1, \ldots , n_{lt}\). \((X_{lt}^{\top } \mathcal{V}_{\mathrm{logistic}} X_{lt})^{-1}\) is, asymptotically, the posterior variance of \(\varvec{\beta }\) when the logistic regression is fitted directly, and thus, we have shown that the posterior variance of \(\varvec{\beta }\) is identical to the posterior variance of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\).
We will now show that, asymptotically, the posterior mean \(E(\varvec{\beta }| \varvec{t}, \varvec{y})\) is the posterior mean of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\). For a sample large enough to justify ignoring the contribution of the prior in (1), we obtain that, \(E(\varvec{\lambda }| \varvec{n})\simeq \mathcal{I}(\hat{\lambda })^{-1} \mathcal{I}(\hat{\lambda }) \hat{\lambda } = \hat{\lambda }\). Similarly, \(E(\varvec{\beta }| \varvec{t}, \varvec{y})\simeq \hat{\varvec{\beta }}\). Therefore, \(E(\varvec{T}_{r} \varvec{\lambda }_{r} | \varvec{n})\simeq \varvec{T}_{r} \hat{\varvec{\lambda }}_{r}\), and it is sufficient to show that \(\hat{\varvec{\beta }}=\varvec{T}_{r} \hat{\varvec{\lambda }}_{r}\). Closed-form expressions for the maximum likelihood estimators of the parameters of a generalized linear model do not exist. As a result, we will base the derivation of this result on the Iterative Re-weighed Least Squares (IRLS) algorithm. This is the standard procedure for maximizing the likelihood when a generalized linear model is fitted. See Wood (2006) for more details. For a linear predictor \(X_d \varvec{\gamma }\), this iterative process is based on the formula,
For a log-linear model, \(\varvec{\zeta }^{it}\) is denoted by \(\varvec{\zeta }^{it}_{\text {log-linear}}\), and its ith element, \(i=1, \ldots , n_{ll}\), is,
For a logistic regression model, \(\varvec{\zeta }^{it}\) is denoted by \(\varvec{\zeta }^{it}_{\mathrm{logistic}}\), and its ith element, \(i=1, \ldots , n_{lt}\), is,
For the log-linear model, the IRLS procedure is written as,
where \(\mathcal{V}_{\text {log-linear}}\) is a diagonal matrix with diagonal elements \(\text{ exp }\{ X_{rll(i)} \hat{\varvec{\lambda }}_{r} \}, i=1, \ldots , n_{ll}\). Algebraic operations similar to the ones carried out earlier show that \((X_{rll}^{\top } \mathcal{V}_{\text {log-linear}}(\varvec{\lambda }^{it}) X_{rll})^{-1}\) partitions as,
where \(\Omega _1\) and \(\Omega _2\) are matrices not relevant to this proof. Furthermore, \(X_{rll}^{\top } \mathcal{V}_{\text {log-linear}}(\varvec{\lambda }^{it}_{r})\) partitions as,
For the log-linear model, we write \(\varvec{\zeta }_{\text {log-linear}}=(\varvec{\zeta }_{lt}^{*\top } \varvec{\zeta }_{{ll}{\text {-}}{lt}}^{\top })^{\top }\), where \(\varvec{\zeta }_{lt}^{*}\) corresponds to the first \(n_{ll}/2\) rows of \(X_{rll}\). Now, the first \(n_{\beta }\) elements of \((X_{rll}^{\top } \mathcal{V}_{\text {log-linear}}(\varvec{\lambda }^{it}) X_{rll})^{-1} X_{rll}^{\top } \mathcal{V}_{\text {log-linear}}(\varvec{\lambda }^{it}_{r}) \varvec{\zeta }_{\text {log-linear}}\), i.e. the ones that correspond to the logistic regression parameters, are given by,
The ith element of \(\varvec{\zeta }_{lt}^{*}, i=1, \ldots , n_{ll}/2\), is,
The ith element of \(\varvec{\zeta }_{{ll}{\text {-}}{lt}}, i=1, \ldots , n_{ll}/2\), is,
It is straightforward to show that \([\mathcal{V}_{1} \mathcal{V}_{2} \varvec{\zeta }_{lt}^{*} + \mathcal{V}_{2} \varvec{\zeta }_{{ll}{\text {-}}{lt}}]\) is, approximately, a vector of zeros. To show this, consider, without loss of generality, the ith element of this vector,
Due to the Poisson approximation to the Binomial distribution,
Thus, the elements of vector \([\mathcal{V}_{1} \mathcal{V}_{2} \varvec{\zeta }_{lt} + \mathcal{V}_{2} \varvec{\zeta }_{{ll}{\text {-}}{lt}}]\) are all approximately zero, and the first \(n_{\beta }\) elements of \((X_{rll}^{\top } \mathcal{V}_{\text {log-linear}}(\varvec{\lambda }^{it}) X_{rll})^{-1} X_{rll}^{\top } \mathcal{V}_{\text {log-linear}}(\varvec{\lambda }^{it}) \varvec{\zeta }_{\text {log-linear}}\) are approximately equal to,
Using the Poisson approximation to the Binomial distribution, for the ith element of \(\varvec{\zeta }_{lt}^{*}\), and assuming without any loss of generality that \(i<n_{lt}\),
Thus,
Therefore, the updating step for \(\varvec{T}_{r} \varvec{\lambda }_{r}\) is,
If the logistic regression was to be fitted directly, obtaining the MLE would be based on the IRLS algorithm,
By replacing the sum of the elements of the \(\mathcal{V}_{2, k}\) matrices with the approximate values given in (3), we observe that, asymptotically, the updating step is the same for both \(\varvec{T}_{r} \varvec{\lambda }_{r}\) and \(\varvec{\beta }\). Thus, if the starting point for \(\varvec{T}_{r} \varvec{\lambda }_{r}\) is the same as the starting point for \(\varvec{\beta }\), the iterative algorithm would give the same MLE for the logistic regression parameters and the corresponding log-linear model parameters. The IRLS algorithm is robust to different starting values when the likelihood is not flat. Therefore, asymptotically, \(\hat{\varvec{\beta }}\simeq \varvec{T}_{r} \hat{\varvec{\lambda }}_{r}\) and the proof is complete.\(\square \)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Papathomas, M. On the correspondence from Bayesian log-linear modelling to logistic regression modelling with g-priors. TEST 27, 197–220 (2018). https://doi.org/10.1007/s11749-017-0540-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-017-0540-8
Keywords
- Categorical variables
- Contingency tables
- Mixtures of g-priors
- Prior correspondence
- Posterior correspondence