# The alarming problems of confounding equivalence using logistic regression models in the perspective of causal diagrams

- 986 Downloads
- 1 Citations

**Part of the following topical collections:**

## Abstract

### Background

Confounders can produce spurious associations between exposure and outcome in observational studies. For majority of epidemiologists, adjusting for confounders using logistic regression model is their habitual method, though it has some problems in accuracy and precision. It is, therefore, important to highlight the problems of logistic regression and search the alternative method.

### Methods

Four causal diagram models were defined to summarize confounding equivalence. Both theoretical proofs and simulation studies were performed to verify whether conditioning on different confounding equivalence sets had the same bias-reducing potential and then to select the optimum adjusting strategy, in which logistic regression model and inverse probability weighting based marginal structural model (IPW-based-MSM) were compared. The “*do*-calculus” was used to calculate the true causal effect of exposure on outcome, then the bias and standard error were used to evaluate the performances of different strategies.

### Results

Adjusting for different sets of confounding equivalence, as judged by identical Markov boundaries, produced different bias-reducing potential in the logistic regression model. For the sets satisfied G-admissibility, adjusting for the set including all the confounders reduced the equivalent bias to the one containing the parent nodes of the outcome, while the bias after adjusting for the parent nodes of exposure was not equivalent to them. In addition, all causal effect estimations through logistic regression were biased, although the estimation after adjusting for the parent nodes of exposure was nearest to the true causal effect. However, conditioning on different confounding equivalence sets had the same bias-reducing potential under IPW-based-MSM. Compared with logistic regression, the IPW-based-MSM could obtain unbiased causal effect estimation when the adjusted confounders satisfied G-admissibility and the optimal strategy was to adjust for the parent nodes of outcome, which obtained the highest precision.

### Conclusions

All adjustment strategies through logistic regression were biased for causal effect estimation, while IPW-based-MSM could always obtain unbiased estimation when the adjusted set satisfied G-admissibility. Thus, IPW-based-MSM was recommended to adjust for confounders set.

## Keywords

Confounding equivalence Logistic regression model Inverse probability weighting based marginal structural model Simulation study Causal diagrams## Abbreviations

*ACE*Average causal effect

- c-equivalence
Confounding equivalence

- IPW-based-MSM
Inverse probability weight based marginal structural model

- OR
Odds ratio

## Background

Causal inference is a key task in epidemiology which discovers the causality between exposure and outcome. Theoretically, causality is the difference in outcome caused by a change in exposure, which can be gotten by ‘*do-calculus*’ in observational studies [1]. In practice, however, as exposure is impossible to intervene in analytic epidemiology, confounders inevitably distort the causal effect of exposure on outcome [2, 3, 4, 5]. For majority of epidemiologists, adjusting for confounders using logistic regression model for dichotomous outcomes is the routine method [6, 7, 8, 9, 10]. Although some studies have verified that different adjustment strategies in logistic regression models might lead to different magnitudes of bias (the difference of the estimation minus the true causal effect) and precision [8, 11], it is still the most commonly used strategy in analytic epidemiologic studies. This phenomenon is mainly attributed to their vague knowledge about the behaviour of logistic regression model. For causal inference in observational study, the inverse probability weighting based marginal structural model (IPW-based-MSM) has been confirmed as an unbiased causal effect estimation approach to adjust for measured confounders [12, 13, 14]. Unfortunately, the advantages of IPW-based-MSM are not recognized by most epidemiologists. Furthermore, for both logistic regression and IPW-based-MSM, the selection of adjustment variables sets remains a big challenge. Fortunately, the concept of confounding equivalence (*c-equivalence*) proposed by Judea Pearl might help us to select adjusting strategies [15].

The *c-equivalence* is presented to determine whether two variables sets are equally valuable for adjustment, namely, whether adjustment for one set is guaranteed to have the same asymptotic bias as adjustment for the others [15]. Tests for *c-equivalence* are fairly easy to perform through a necessary and sufficient condition [15, 16], and they can also be implemented by propensity score methods [17]. This provides us a strategy for selecting adjusting variables sets when using logistic regression models and IPW-based-MSMs, which help to clarify whether adjusting for different *c-equivalent* sets has same bias-reducing potential.

*c-equivalence*to detect the performances of logistic regression models and IPW-based-MSMs under the framework of

*c-equivalence*. Both theoretical proofs and simulation studies were performed to determine whether adjusting for the sets of

*c-equivalence*had the same bias-reducing potential and observed their precision in logistic regression models and IPW-based-MSMs respectively, and further comparing the performances of

*c-equivalence*between these two models through assessing their accuracy (bias) and precision (standard error). Our aim was to highlight the problems of

*c-equivalence*using logistic regression model as well as the advantages of IPW-based-MSM.

## Methods

### *C-equivalence* and its test

Let *X*, *Y* and *Z* be three disjoint subsets of discrete variables, and *P*(*x*, *y*, *z*) are their joint distribution. The causal effect of *X* on *Y* can be defined as \( P\left(y| do(x)\right)=\sum \limits_zP\left(y|x,z\right)p(z) \) [5, 18, 19], where a sufficient set *Z* is chosen to include variables judged as “confounders” [16, 20, 21]. In this framework, the two confounders sets *T* and *Z* are *c-equivalent* if \( \sum \limits_tP\left(y|x,t\right)P(t)=\sum \limits_zP\left(y|x,z\right)P(z) \) ∀*x*, *y*. This means that adjustment for set *T* or *Z* would produce the same asymptotic bias relative to the target causal effect quantity [15]. To meet the necessary and sufficient condition of *c-equivalence*, it is first necessary to define the G-admissibility of a variables set *S*, which satisfies the back-door criterion [19]: 1) No element of *S* is a descendant of *X*; 2) The elements of *S* block every path between *X* and *Y* that contains an arrow into *X*. Another condition of *c-equivalence* is the identical Markov boundary [15], which is defined as: let *S* _{m} be the minimal subset of *S* that satisfies the condition (*X* ⊥ *S*| *S* _{ m })_{ G }. This means that measurement of *S* _{m} renders *X* independent of all other members of *S*, and no proper subset of *S* _{m} has this property. Therefore, the necessary and sufficient conditions for *T* and *Z* to be *c-equivalent* are that at least one of the following conditions hold: 1) *T* _{m} = *Z* _{m}; and 2) *T* and *Z* are *G*-admissible [15].

As an example, Fig. 1 illustrates the four typical causal diagrams with simple and complex confounding paths for the target causal path *X*→*Y* [22]. For instance, Fig. 1d contains three confounding paths *X*←*W*→*Y*, *X*←*Z*→*W*→*Y* and *X*←*W*←*T*→*Y*, with three corresponding confounders *Z*, *W* and *T* [22, 23]. Theoretically, conditioning on {*Z, W*}, {*T, W*} or {*Z, T, W*} can achieve the same bias-reducing potential [23]. Thus, they are *c-equivalent*.

### Statistical methods for adjusting for confounders

*Z*is a set of variables which are considered to be confounders. By weighting the original observations using the stabilized weights (

*sw*

_{ i }), we can fit the following marginal structural model to estimate the causal effect of

*X*on

*Y*,

*X*on

*Y*is \( {\beta}_1^{MSM} \).

### Theoretical derivation for bias-reducing potential of *c-equivalence* under logistic regression model

Taking Fig. 1a as an example, we deduced whether adjusting for different *c-equivalence* sets had the same bias-reducing potential under logistic regression by the following procedure.

1) Based on the necessary and sufficient condition, *A* _{1} = {*Z*}, *A* _{2} = {*T*} and *A* _{3} = {*Z, T*} satisfied G-admissibility, thus they were equivalent, as denoted by *A* _{1} ≈ *A* _{2} ≈ *A* _{3}.

*ACE*

^{log(OR)}) of

*X*on

*Y*through the average causal effect (

*ACE*) on the scale of the logarithm odds ratio (OR),

*X*on

*Y*by logistic regression,

4) Calculated the biases\( {\beta}_X^{A_1}-{ACE}^{\log (OR)} \), \( {\beta}_X^{A_2}-{ACE}^{\log (OR)} \) and \( {\beta}_X^{A_3}-{ACE}^{\log (OR)} \), and then deduced whether \( {\beta}_X^{A_1}-{ACE}^{\log (OR)}={\beta}_X^{A_2}-{ACE}^{\log (OR)}={\beta}_X^{A_3}-{ACE}^{\log (OR)} \).

### Simulation

Taking the four typical causal diagrams, which covered the generalization of *c-equivalence* (Fig. 1), as examples, a series of simulation studies were performed to determine whether adjusting for the sets of *c-equivalence* had the same bias-reducing potential and observed their precision in logistic regression models and IPW-based-MSMs respectively, further compared the performances of *c-equivalence* between these two models though assessing their accuracy and precision.

Four simulation scenarios were considered, and assumed that: 1) all variables were binary and followed a Bernoulli distributions; and 2) the effects of parent nodes on their child nodes were positive and log-linearly additive. Logistic regression models were used to simulate child nodes from their corresponding parent nodes.

For scenario 1 (Fig. 1a), the simulated data were generated as follows. Let*P*(*Z* = 1) = *π*. Then, *P*(*T* = 1| *Z*) = exp(*c* _{0} *Z* + *α* _{1})/(1 + exp(*c* _{0} *Z* + *α* _{1})) was used to derive the probability of child node *T* from its parent node *Z*. Similarly, *P*(*X* = 1| *Z*) = exp(*c* _{1} *Z* + *α* _{2})/(1 + exp(*c* _{1} *Z* + *α* _{2})) and *P*(*Y* = 1| *X*, *T*) = exp(*c* _{3} *X* + *c* _{2} *T* + *α* _{0})/(1 + exp(*c* _{3} *X* + *c* _{2} *T* + *α* _{0})) were used to obtain the probability of *X* = 1 and *Y* = 1, respectively, where the parameters *α* _{0}, *α* _{1}, *α* _{2} denoted the intercepts of *Y, T* and *X*, respectively, and each effect parameter (*c* _{0}, *c* _{1}, *c* _{2}, *c* _{3}) referred to the effect of the parent node on its corresponding child node. Simulated data was generated for 1000 subjects by above procedure.

In this scenario (Fig. 1a), variable sets *A* _{1} = {*Z*}, *A* _{2} = {*T*} and *A* _{3} = {*Z, T*} satisfied the necessary and sufficient conditions of *c-equivalence*; thus, *A* _{1} ≈ *A* _{2} ≈ *A* _{3}. Therefore, we compared three adjustment strategies with the following six models,

model 1: \( \mathrm{logit}\left(p\left(Y=1|X,{A}_1\right)\right)={{\widehat{\beta}}^{A_1}}_0+{\widehat{\beta}}_X^{A_1}X+{{\widehat{\beta}}^{A_1}}_Z\mathrm{Z} \).

model 2: \( \mathrm{logit}\left(p\left(Y=1|X,{A}_2\right)\right)={{\widehat{\beta}}^{A_2}}_0+{\widehat{\beta}}_X^{A_2}X+{{\widehat{\beta}}^{A_2}}_TT \).

model 3: \( \mathrm{logit}\left(p\left(Y=1|X,{A}_3\right)\right)={{\widehat{\beta}}^{A_3}}_0+{\widehat{\beta}}_X^{A_3}X+{{\widehat{\beta}}^{A_3}}_TT+{{\widehat{\beta}}^{A_3}}_ZZ \).

model 4: \( \mathrm{logit}\kern0.1em P\left({Y}_x^{A_1}=1\right)={\widehat{\beta}}_0^{MSM\_{A}_1}+{\widehat{\beta}}_x^{MSM\_{A}_1}x \) \( {sw}_i^{A_1}=\frac{P\left(X={x}_i\right)}{P\left(X={x}_i|{A}_{1i}={A}_{1i}\right)} \).

model 5: \( \mathrm{logit}\kern0.1em P\left({Y}_x^{A_2}=1\right)={\widehat{\beta}}_0^{MSM\_{A}_2}+{\widehat{\beta}}_x^{MSM\_{A}_2}x \) \( {sw}_i^{A_2}=\frac{P\left(X={x}_i\right)}{P\left(X={x}_i|{A}_{2i}={A}_{2i}\right)} \).

model 6: \( \mathrm{logit}\kern0.1em P\left({Y}_x^{A_3}=1\right)={\widehat{\beta}}_0^{MSM\_{A}_3}+{\widehat{\beta}}_x^{MSM\_{A}_3}x \) \( {sw}_i^{A_3}=\frac{P\left(X={x}_i\right)}{P\left(X={x}_i|{A}_{3i}={A}_{3i}\right)} \).

where\( {\widehat{\beta}}_X^{A_1} \), \( {\widehat{\beta}}_X^{A_2} \), \( {\widehat{\beta}}_X^{A_3} \), \( {\widehat{\beta}}_X^{MSM\_{A}_1} \), \( {\widehat{\beta}}_X^{MSM\_{A}_2} \) and \( {\widehat{\beta}}_X^{MSM\_{A}_3} \) denoted the causal effect estimations after conditioning on *A* _{1} *, A* _{2} and *A* _{3} by logistic regression and IPW-based-MSM, respectively. Given the true causal effect \( A\widehat{C}{E}^{\log (OR)} \) calculated by *do-calculus*, both the biases (\( {\widehat{\beta}}_X^{A_1}-A\widehat{C}{E}^{\log (OR)} \),\( {\widehat{\beta}}_X^{A_2}-A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_X^{A_3}-A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_x^{MSM\_{A}_1}-A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_x^{MSM\_{A}_2}-A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_x^{MSM\_{A}_3}-A\widehat{C}{E}^{\log (OR)} \)) and their corresponding standard errors (\( \mathrm{SE}\left({\widehat{\beta}}_X^{A_1}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{A_2}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{A_3}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{MSM\_{A}_1}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{MSM\_{A}_2}\right) \) and \( \mathrm{SE}\left({\widehat{\beta}}_X^{MSM\_{A}_3}\right) \)) were used to identify whether adjusting for different *c-equivalence* sets *A* _{1}, *A* _{2} or *A* _{3} still produced the same bias-reducing under the logistic regression model and IPW-based-MSM, further to evaluate their accuracy and precision.

For scenario 2 (Fig. 1b), similar simulation data sets were created as scenario 1. In this scenario, *A* _{1} = {*Z, W*}, *A* _{2} = {*T, W*} and *A* _{3} = {*Z, T, W*} satisfied G-admissibility; thus, *A* _{1} ≈ *A* _{2} ≈ *A* _{3}. Therefore, three corresponding logistic regression models and three corresponding IPW-based-MSMs conditional on *A* _{1} *, A* _{2} or *A* _{3} were constructed to identify whether the *c-equivalence* has identical biases and to evaluate their precisions. In addition, *B* _{1} = {*Z*} was *c-equivalent* to *B* _{2} = {*Z, T*}, namely, *B* _{1} ≈ *B* _{2}, due to their identical Markov boundary, written as *B* _{1m} = *B* _{2m} = {*Z*}. Therefore, four corresponding models conditioning on *B* _{1} or *B* _{2} were used to calculate the biases and standard errors.

In scenario 3 (Fig. 1c), the simulated data was generated in the same way as in scenario 1. In addition, the sets *A* _{1} = {*Z*} ≈ *A* _{ 2 } = {*Z*, *T*} and *B* _{1} = {*W*} ≈ *B* _{2} = {*W,V*} were separately *c-equivalent* due to *A* _{1m} = *A* _{2m} = {*Z*} and *B* _{1m} = *B* _{2m} = {*W*}. As *A* _{1} ≈ *A* _{ 2 } and *B* _{1} ≈ *B* _{2} were identical in the *c-equivalence* mechanism, it was sufficient to analyze one group to explore the *c-equivalence* mechanism of the identical Markov boundary. Thus, we constructed two logistic regression models and two IPW-based-MSMs conditioning on *A* _{1} or *A* _{2} to explore their *c-equivalence* and to evaluate their precision. Furthermore, as variables sets *C* _{1} = {*Z*,*W*}, *C* _{2} = {*T*,*V*} and *C* _{3} = {*Z*,*W*,*T*,*V*} blocked all back-door paths from *X* to *Y*, they were admissible and equivalent, *C* _{1} ≈ *C* _{2} ≈ *C* _{3}. Therefore, the six corresponding models conditional on *C* _{1}, *C* _{2} or *C* _{3} were performed to identify biases and precisions.

For scenario 4 (Fig. 1d), following the path directions, simulation data sets were created same with scenario 1. *A* _{1} = {*Z, W*}, *A* _{2} = {*T, W*} and *A* _{3} = {*Z, T, W*} satisfied G-admissibility; thus, *A* _{1} ≈ *A* _{2} ≈ *A* _{3}. Their corresponding three logistic regression models and three IPW-based-MSMs conditional on *A* _{1} *, A* _{2} or *A* _{3} were used to observe the biases and precisions.

For each of the 4 simulation scenarios, we varied across the effect of a specific edge given the others fixed with 1000 simulation repetitions. The R (http://cran.r-project.org/) programming language was used to conduct the statistical simulations.

## Results

### Theoretical results for bias-reducing potential of *c-equivalence* under logistic regression model

Considered scenario 1 (Fig. 1a) as a typical diagram for deducing whether adjusting for different *c-equivalence* sets resulted in the same bias reduction under the logistic regression models. In this causal diagram, *A* _{1} = {*Z*}, *A* _{2} = {*T*} and *A* _{3} = {*Z, T*} composed the *c-equivalence* group, which satisfied the G-admissibility .

*A*

_{1}≈

*A*

_{2}≈

*A*

_{3}of

*c-equivalence*, the true causal effect of

*X*on

*Y*was calculated as

By conditioning on *A* _{1} = {*Z*}, the effect of *X* on *Y* was equal to

*X*on

*Y*when conditioning on

*A*

_{2}

*=*{

*T*} was equal to

*X*on

*Y*when conditioning on

*A*

_{3}= {

*T*,

*Z*} was equal to

After a series of derivations (Additional file 1: Appendix), we obtained \( {\beta}_X^{A_2}={\beta}_X^{A_3} \) under any condition, suggesting that the bias-reducing after adjusting for *c-equivalence* sets *A* _{ 2 } ≈ *A* _{3} was equivalent under the logistic regression model. \( {\beta}_X^{A_1}={\beta}_X^{A_2}={\beta}_X^{A_3} \) only if *c* _{2} = 0 or *c* _{3} = 0, indicating that the bias-reducing after adjusting for *c-equivalence* sets *A* _{1} ≈ *A* _{ 2 } ≈ *A* _{3}, respectively, was equivalent in this situation. However, \( {\beta}_X^{A_1}<{\beta}_X^{A_2}={\beta}_X^{A_3} \) if *c* _{2} ≠ 0 and *c* _{3} > 0, and \( {\beta}_X^{A_1}>{\beta}_X^{A_2}={\beta}_X^{A_3} \) if *c* _{2} ≠ 0 and *c* _{3} < 0,which indicating an unequal bias-reducing after adjusting for *c-equivalence* sets *A* _{1} ≈ *A* _{ 2 } ≈ *A* _{3} when both *c* _{2} and *c* _{3} were not equal to zero (for more details, see Appendix).

### Simulation results

#### Scenario 1

*c-equivalent*set

*A*

_{2}or

*A*

_{3}has resulted in approximate biases, but adjusting for set

*A*

_{1}was not equal to them. Moreover, the strategy of adjusting for

*A*

_{1}achieved the minimum bias. When adjusting for confounders by IPW-based-MSM, the estimations of all the strategies were approximate and unbiased. Panel c and d of Fig. 2 and Additional file 2: Figure S1 showed that adjusting for

*A*

_{2}by IPW-based-MSM achieved the highest precision in all situations. Thus, compared with logistic regression models, the IPW-based-MSM produced an unbiased causal effect estimation and the highest precision in this scenario. The optimal adjustment strategy was conditioning on

*A*

_{2}. Although the estimations through logistic regression model were biased, adjusting for

*A*

_{1}produced a result nearest to the true causal effect.

When varying across the effect of *Z* on *T* with the other parameters fixed*,* the simulation results indicated that the biases of all six models (models 1–6) tended to be stable (Fig. 2a). Similar performances were observed when varying across the effect of *Z* on *X* (Additional file 2: Figure S1a). However, when varying across the effect of *T* on *Y* and keeping the other parameters constant, the bias showed a linear increasing trend after adjusting for set *A* _{2} or *A* _{3} under the logistic regression model, but was approximately to zero after adjusting for set *A* _{1}. However, the biases remained stable under IPW-based-MSM (Fig. 2b). We observed similar trends with the effect of *X* on *Y* increasing (Additional file 2: Figure S1b).

#### Scenario 2

*c-equivalent*subsets

*A*

_{1}= {

*Z, W*},

*A*

_{2}= {

*T*,

*W*} and

*A*

_{3}= {

*Z, T, W*}, we observed that the bias after adjusting for set

*A*

_{2}was similar to that of

*A*

_{3}but not to that of

*A*

_{1}, and the strategy of adjusting for

*A*

_{1}achieved the minimum bias under the logistic regression models, as shown in panels a and b of Fig. 3, Additional file 3: Figure S2 and Additional file 4: Figure S3 under logistic regression models. The adjustment of any confounding sets of

*c-equivalent*subsets through IPW-based-MSM had the same bias-reducing potential and the estimations were unbiased. Panel c and d of these figures showed that adjusting for

*A*

_{2}under IPW-based-MSM achieved the highest precision in all situations. Thus, conditioning on any

*c-equivalent*set that was satisfied G-admissibility through IPW-based-MSM produced an unbiased causal effect estimate and adjustment for

*A*

_{2}was the best strategy. When using logistic regression models to adjust for confounders, the optimal adjustment strategy was adjusting for variable subset

*A*

_{1}.

In the logistic regression models, when keeping the other parameters constant, bias elevated with the effect of *T* on *Y* increasing when adjusting for *A* _{2} or *A* _{3}, whereas it elevated in the opposite direction when adjusting for *A* _{1} (Fig. 3a). All three models revealed increased biases with the effects of *W* on *Y* increasing (Fig. 3b). Similar performances were observed when varying across the effect *X* on *Y* (Additional file 3: Figure S2b). When varying across the effect of *Z* on *T* with the other parameters fixed*,* the simulation results indicated that the biases of all three adjustment strategies tended to be stable (Additional file 3: Figure S2b). We observed similar trends with the increase of the effect of *Z* on *X* (Additional file 4: Figure S3a) or the effect of *W* on *X* (Additional file 4: Figure S3b). When adjusting for confounders through IPW-based-MSM, the biases of all three adjustment strategies tended to be stable in all situations.

*c-equivalent*subsets

*B*

_{1}= {

*Z*} and

*B*

_{2}= {

*Z, T*}, panels a and b of Fig. 4, Additional file 5: Figure S4 and Additional file 6: Figure S5 showed that adjusting for

*c-equivalence*set

*B*

_{1}or

*B*

_{2}had different bias-reducing, and the bias of adjusting for

*B*

_{1}was less than that of adjusting for

*B*

_{2}under the logistic regression models. For IPW-based-MSM, the biases were equivalent after adjusting for

*B*

_{1}or

*B*

_{2}. Panels c and d of these figures showed that adjusting for

*B*

_{2}through IPW-based-MSM resulted in higher precision.

Keeping the other parameters constant, the bias elevated as the effect of *T* on *Y* increasing when adjusting for set *B* _{2}, whereas it was stable after adjusting for *B* _{1} under logistic regression. A stable trend also appeared after adjusting for any sets through IPW-based-MSM (Fig. 4a). Similar performances were observed when varying across the effect of *X* on *Y* (Additional file 5: Figure S4b). When varying across the effect of *W* on *Y* with the other parameters fixed*,* the simulation results indicated that biases of four models revealed an increasing trend (Fig. 4b). Similar trends of the effect of *W* on *X* increasing were observed in Additional file 6: Figure S5b. When varying across the effect of *Z* on *T* with the other parameters fixed*,* the biases of the four models were stable (Additional file 5: Figure S4a). Similar performances were observed when varying across the effect of *Z* on *X* (Additional file 6: Figure S5a).

#### Scenario 3

*c-equivalent*subsets,

*A*

_{1}= {

*Z*} and

*A*

_{ 2 }= {

*Z*,

*T*}, Fig. 5, Additional file 7: Figure S6 and Additional file 8: Figure S7 showed that adjusting for

*c-equivalence*set

*A*

_{1}or

*A*

_{2}resulted in different bias-reducing, and the bias of adjusting for

*A*

_{1}was less than that after adjusting for

*A*

_{2}under logistic regression models. Then the biases were equal after conditioning on

*A*

_{1}and

*A*

_{ 2 }via IPW-based-MSM. In consideration of the standard error, adjusting for

*A*

_{2}through IPW-based-MSM resulted in higher precision.

*c-equivalent*subsets

*C*

_{1}= {

*Z*,

*W*},

*C*

_{2}= {

*T*,

*V*} and

*C*

_{3}= {

*Z*,

*W*,

*T*,

*V*}, the simulation result (Fig. 6, Additional file 9: Figure S8 and Additional file 10: Figure S9) showed that adjusting for the variable set

*C*

_{2}resulted in similar bias to that of set

*C*

_{3}but not to

*C*

_{1}, and the strategy of adjusting for

*C*

_{1}resulted in the minimum bias under the logistic regression models. However, the estimations of all strategies conditioned by IPW-based-MSM were approximately equivalent and unbiased. For the standard error, conditioning on

*C*

_{2}by IPW-based-MSM resulted in the minimum standard error in all situations. Thus, IPW-based-MSM was a better method than logistic regression for controlling for confounders. The optimal adjustment strategy was conditioning on

*C*

_{2}by IPW-based-MSM in this scenario. Besides, adjusting for

*A*

_{1}produced the result that was nearest to the true causal effect under the logistic regression model.

#### Scenario 4

*c-equivalence*set

*A*

_{2}or

*A*

_{3}had different bias-reducing but adjusting for

*A*

_{1}was not equal to them and the strategy of adjusting for

*A*

_{1}got the minimum bias than others under logistic regression models. Conditioning on any confounding set through MSM had the same bias-reducing and produce unbiased estimations. In consideration of the standard error, we observed that adjusting for

*A*

_{2}by IPW-based-MSM resulted in higher precision in all situations. Thus, IPW-based-MSM produced unbiased causal effect estimations after conditioning on any

*c-equivalent*set, and the strategy of adjusting for

*A*

_{2}achieved highest precision in this scenario. When using logistic regression models to adjust for confounders, adjusting for variables subset

*A*

_{1}produced the minimum bias.

## Discussion

In this paper, we focused on the 4 typical causal diagrams shown in Fig. 1 to assess the performances of logistic regression models and IPW-based-MSMs with respect to *c-equivalence*. The necessary and sufficient conditions for *T* and *Z* to be *c-equivalent* proposed by Pearl are that at least one of the following conditions hold [15]: 1) *T* _{m} = *Z* _{m}; or 2) *T* and *Z* are G-admissible. Our results revealed that *c-equivalence* sets satisfying the *c-equivalenc*e condition 1) (e.g., *A* _{2} (*T*) and *A* _{3} (*Z, T*) in scenario 2) had different bias-reducing under logistic regression. For *c-equivalence* condition 2), adjusting for the set including all confounders had approximately bias-reducing as adjusting for the set containing the parent nodes of *Y*, while adjusting for the set containing the parent nodes of *X* was not equivalent to adjusting for the two above sets. However, under the framework of IPW-based-MSM, conditioning on any set of *c-equivalence*, as judged by the necessary and sufficient conditions, still had same bias-reducing. In summary, adjusting for different sets of *c-equivalence* under logistic regression always produced different bias-reducing; whereas when using IPW-based-MSM, the estimations of all strategies were approximately equivalent.

Adjusting more confounders would improve accuracy and precision of estimation in classic linear regression [28, 29]. Nevertheless, including more confounders in logistic regression model usually leads to less bias and lower precision [30]. Our studies showed that adjusting for the set containing the parent nodes of *X* had the minimum bias in logistic regression. With regard to the standard error, adjusting for set with fewer confounders would improve precision. Under the framework of IPW-based-MSM, we observed that adjusting for any set satisfying condition 2) had unbiased estimations; and conditioning on the set containing all parent nodes of *Y* achieved the highest precision in all situations. In summary, compared with logistic regression, the IPW-based-MSM produced unbiased causal effect estimates when the adjusted variable sets satisfied condition 2) and the optimal adjustment strategy was conditioning on parent nodes of outcome *Y*, which achieved the highest precision. Although the estimations obtained by logistic regression was biased, the estimation of adjusting for the parent nodes of the exposure *X* was nearest to true causal effect.

The true causal effect of exposure on outcome calculated by “*do-calculus*” is defined in terms of marginal probability distributions. However, the conditional treatment effects estimated from logistic regression model differ from the true causal effect [31, 32]. Logistic regression estimates do not behave like linear regression estimates. They are affected by omitted variables, even when those variables are unrelated to the independent variables in the model [11]. The use of IPW-based-MSM could lead to a more precise estimation of causal effects.

The discrepancy between the marginal OR and the conditional OR even in the absence of confounders is acknowledged as the non-collapsibility of the OR [4, 33]. The non-collapsibility effect depends on a variety of parameters, e.g., the effect of the exposure, the prevalence and effect of the covariate [4, 33]. According to our results, the differences in estimates between the logistic regression model and IPW-based-MSM were equal to the non-collapsibility effect in number. However, the discrepancy in estimates between these two model were different after adjusting for different sets of *c-equivalence* maybe due to these sets have different variables.

## Conclusions

In conclusion, the bias-reducing differed after adjusting for the sets of *c-equivalence* under the logistic regression model, whereas it were approximately equivalent when using IPW-based-MSM. All adjustment strategies through logistic regression were biased, while IPW-based-MSM could always obtain unbiased estimation when the adjusted set satisfied G-admissibility. Thus, for adjusting confounders set, we recommend IPW-based-MSM rather than logistic regression model.

## Notes

### Acknowledgements

We would like to thank the reviewers and academic editors for providing us with constructive comments and suggestions and also wish to acknowledge our colleagues for their invaluable work. In addition, I have benefited greatly from suggestions provided by the group of biostatistics at Shandong University. I am also grateful to the support of the National Natural Science Foundation of China.

### Funding

This work was supported by grants from the National Natural Science Foundation of China (grant number 81773547, 81,573,259).

### Availability of data and materials

Not applicable

### Authors’ contributions

YYY, HKL, YXL and FZX conceived, designed the study. YYY performed the simulation and theoretical proof, HKL perfected the result of theoretical proof. YYY, HKL, XRS, PS, TTW, YL, ZSY drafted of the manuscript and its revision. All authors read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable

### Consent for publication

Not applicable

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary material

## References

- 1.Pearl J. The do-calculus revisited. In: Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence (UAI-12); 2012. p. 4–11.Google Scholar
- 2.Weinberg CR. Toward a clearer definition of confounding. Am J Epidemiol. 1993;137(1):1–8.CrossRefPubMedGoogle Scholar
- 3.Howards PP, Schisterman EF, Poole C, et al. “Toward a clearer definition of confounding” revisited with directed acyclic graphs. Am J Epidemiol. 2012;176(6):506–11.CrossRefPubMedPubMedCentralGoogle Scholar
- 4.Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999;14(1):29–46.CrossRefGoogle Scholar
- 5.Grimes DA, Schulz KF. Bias and causal associations in observational research. Lancet. 2002;359(9302):248–52.CrossRefPubMedGoogle Scholar
- 6.MacKenzie TA, Tosteson TD, Morden NE, et al. Using instrumental variables to estimate a Cox’s proportional hazards regression subject to additive confounding. Health Serv Outcomes Res Methodol. 2014;14(1–2):54–68.CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Liu W, Brookhart MA, Schneeweiss S, et al. Implications of M bias in epidemiologic studies: a simulation study. Am J Epidemiol. 2012;176(10):938–48.CrossRefPubMedGoogle Scholar
- 8.Robinson LD, Jewell NP. Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev. 1991;59(2):227–40.CrossRefGoogle Scholar
- 9.Smolle C, Tuca A, Wurzer P, et al. Complications in tissue expansion: a logistic regression analysis for risk factors. Burns. 2017;Google Scholar
- 10.Gong X, Cui J, Jiang Z, et al. Risk factors for pedicled flap necrosis in hand soft tissue reconstruction: a multivariate logistic regression analysis. ANZ J Surg. 2017. doi: 10.1111/ans.13977.
- 11.Mood C. Logistic regression: why we cannot do what we think we can do, and what we can do about it. Eur Sociol Rev. 2010;26(1):67–82.CrossRefGoogle Scholar
- 12.Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–64.CrossRefPubMedPubMedCentralGoogle Scholar
- 13.Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Community Health. 2006;60(7):578–86.CrossRefPubMedPubMedCentralGoogle Scholar
- 14.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–60.CrossRefPubMedGoogle Scholar
- 15.Pearl J, Paz A. Confounding equivalence in causal inference. J Causal Inference. 2014;2(1):75–93.Google Scholar
- 16.Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol. 2011;174(11):1228–9.CrossRefGoogle Scholar
- 17.Pearl J. Causal inference in statistics: an overview. Stat Surv. 2009;3:96–146.CrossRefGoogle Scholar
- 18.Pearl J. The deductive approach to causal inference. J Causal Inference. 2014;2(2):115–29.Google Scholar
- 19.Pearl J. Causal diagrams and the identification of causal effects. In: Causality. Cambridge: Cambridge university press; 2009.Google Scholar
- 20.Knüppel S, Stang A. DAG program: identifying minimal sufficient adjustment sets. Epidemiology. 2010;21(1):159.CrossRefPubMedGoogle Scholar
- 21.Evans D, Chaix B, Lobbedez T, et al. Combining directed acyclic graphs and the change-in-estimate procedure as a novel approach to adjustment-variable selection in epidemiology. BMC Med Res Methodol 2012; 12(1):156-156.Google Scholar
- 22.Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48.Google Scholar
- 23.VanderWeele TJ. On the relative nature of over adjustment and unnecessary adjustment. Epidemiology. 2009;20(4):496–9.CrossRefPubMedGoogle Scholar
- 24.Robins JM. Causal inference from complex longitudinal data. Latent variable modeling and applications to causality. 120th ed; 1997. p. 69–117.CrossRefGoogle Scholar
- 25.Robins JM, Greenland S, Hu FC. Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome. J Am Stat Assoc. 1999;94(447):687–700.CrossRefGoogle Scholar
- 26.Robins JM. Marginal structural models. 1997 Proc Am Stat Assoc. 1998; 1998: 1-10.Google Scholar
- 27.Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561.CrossRefPubMedGoogle Scholar
- 28.McNamee R. Regression modelling and other methods to control confounding. Occup Environ Med. 2005;62(7):500–6.CrossRefPubMedPubMedCentralGoogle Scholar
- 29.Hosman C, Hansen B, Holland P. The sensitivity of linear regression coefficient confidence limits to the omission of a confounder. Ann Appl Stat. 2010;4(2):849–70.CrossRefGoogle Scholar
- 30.Li H, Yuan Z, Su P, et al. A simulation study on matched case-control designs in the perspective of causal diagrams. BMC Med Res Methodol. 2016;16(1):102.CrossRefPubMedPubMedCentralGoogle Scholar
- 31.Moffitt R. Estimating marginal treatment effects in heterogeneous populations. Ann d'Econ Stat. 2008;91(91):239–61.Google Scholar
- 32.Heckman JJ, Vytlacil E. Structural equations, treatment effects, and econometric policy evaluation. Econometrica. 2005;73(3):669–738.CrossRefGoogle Scholar
- 33.Pang M, Kaufman JS, Platt RW. Studying noncollapsibility of the odds ratio with marginal structural and logistic regression models. Stat Methods Med Res. 2016;25(5):1925–37.CrossRefPubMedGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.