Background

Controlling unobserved confounding is a great challenge when estimating the causal effect of an exposure on an outcome of interest in observational studies [1,2,3,4]. Several techniques such as traditional regression model, marginal structure model, adjustment, stratification, inverse probability weighing (IPW), matching based on propensity score cannot deal with unobserved confounding [5,6,7,8,9]. The observed data distribution may have compatibility with many contradictory causal explanations due to the existence of unobserved confounding. In this circumstance, we say the causal estimand is not identified. On the contrary, when the causal estimand can be obtained entirely from observable probability distributions, we say the query is identified.

Some methods have been developed to alleviate the problems caused by unobserved confounding. Instrumental variable analysis (IVA) is the commonly used method to eliminate unobserved confounding [10]. But in practice, choosing valid instruments (IVs) is a stumbling block in IVA [11,12,13]. The difference-in-differences (DID) is contingent on the availability of repeated outcomes in both periods, but invokes strict parallel trend assumptions, i.e., confounders varying across the groups are time invariant and time-varying confounders are group invariant [14, 15]. Regression discontinuity design (RDD) is a quasi-experimental pretest-posttest design for controlling unobserved confounding by assigning a cutoff or threshold above or below to a treatment [16]. Nevertheless, RDD requires that treatment assignment is sufficiently randomized at the threshold [17]. Negative controls are widely used in epidemiologic practice to detect the presence of unobserved confounding. While a valid negative control outcome needs to be influenced by the same unobserved confounders of the exposure effects on the outcome in view, although not directly influenced by the exposure. But this approach fails to obtain causal estimation of the exposure on the outcome [18]. Moreover, strict assumptions of the methods above usually may be violated in practice and impose restrictions on their generalization.

In this article, we propose a novel method to control unobserved confounding through double confounders with two values satisfying a non-linear condition on the exposure. Under the assumption of ignorable treatment assignment, causal effects can be identified and estimated using commonly generalized moment estimate model. Furthermore, we relax the assumption that observed and unobserved confounders are independent in sensitivity analysis and observe that even when the correlations between binary observed confounders and unobserved confounders are relatively weak, we still obtain the almost unbiased causal effect estimation. Additionally, we explore the statistical properties of this method by a simulation study and compare with the traditional regression approach only adjusting for observed confounders. Finally, we apply this method to a cohort from a follow-up survey (136,895 individuals) from 2007 to 2015 in Jining, China to exam the causal associations of BMI on other factors, including SBP, DBP, FBG, TG, TC, HDL and LDL.

Methods

Notation and preliminaries

Throughout, we let X, Y, C1, C2 and U denote the treatment, outcome, two observed confounders and unobserved confounders, respectively (Fig. 1). Following the convention in causal inference, we use Y(x) to denote the potential outcome of Y under an intervention which sets X to x, and the observed outcome Y is a realization of the potential outcome under the exposure actually received: Y = Y(x) when X = x. We focus on the average causal effect (ACE) of X on Y which is the difference in expectation of potential outcome at two different exposure levels 0 and 1, for instance, ACEX → Y = E(Y(1) − Y(0)) for a binary exposure. The conditional ignorability assumption Y(x) ⊥ x ∣ C1, C2 is conventionally made in causal inference, but it does not hold in the present of unmeasured confounding U. In this case, latent ignorability Y(x) ⊥ x ∣ C1, C2, U is more reasonable, allowing for an unobserved confounder U that captures the source of non-ignorability of the exposure mechanism. The model with continuous exposure and outcome can be developed as follows.

$$ X=F\left({C}_1,{C}_2\right)+\varphi \left(U,{\varepsilon}_X\right) $$
(1)
$$ Y={\beta}_0+{\beta}_1X+{\beta}_2{C}_1+{\beta}_3{C}_2+\phi \left(U,{\varepsilon}_Y\right) $$
(2)

where εY and εX are two mutually independent random errors with means 0 and variances \( {\sigma}_Y^2 \) and \( {\sigma}_X^2 \) respectively, φ(·) and ϕ(·) are two arbitrary functions. The causal effect of X on Y is interpreted as β1. Similarly, for binary exposure and outcome, we can also construct corresponding linear probability model (LPM) for P(X = 1| C1, C2, U), P(Y = 1| C1, C2, U), respectively. In addition, the “complementary log link” of transformation of Y (i.e. − log(1 − Risk), Risk denotes cumulative completion risk) is also appropriate for our method [19].

Fig. 1
figure 1

Causal diagram for X (treatment) and Y (outcome) with C1, C2 (observed confounders) and U (other unobserved confounders)

Identification and estimation of causal effects with double binary confounders

Obviously, the estimation of regression model adjusting for C1 and C2 is biased when there exists an unobserved confounder U. Unfortunately, the no unobserved confounding assumption usually not be satisfied in practical studies. Next, we explain how to identify and estimate the causal effect by relaxing the no unobserved confounding assumption in four cases with different types of exposure and outcome (continuous or binary). It will be shown below that causal effect β1 in (2) is not identifiable if the function F(X| C1, C2) is linear with respect to C1 and C2.

Within the causal framework provided by Fig. 1, we propose four assumptions and discuss the necessary and sufficient condition for identification of parameters in the model (2).

Assumption 1: E(ϕ(U, εY)) = 0.

Assumption 2: (C1, C2) ⊥ ϕ(U, εY), i.e. (C1, C2) are not associated with any confounder (U) of the exposure–outcome relationship and random errors εY.

Assumption 3: The effect of (X, C1, C2) on Y is linear (or no interaction).

Assumption 4: The effect of (C1, C2) on X is non-linear (e.g. with an interaction effect or the quadratic term of C1 and C2 on X).

Under Assumptions 1 and 2, E(ϕ(U, εY)| C1, C2) = 0 is satisfied. The Assumption 2 is the same as the exchangeability assumption in instrumental variable (IV) analysis – an IV is not associated with any confounder of the exposure–outcome relationship. In addition, a valid IV must have a main effect on X and no direct effect on Y, which are called the Relevance and the Exclusion restriction assumption, respectively [20]. In our method, (C1, C2) can be regarded as two near-IVs with a non-linear effect on X and a linear effect on Y as stated in the Assumption 3 and 4.

Theorem 1: The causal effect β1of X on Y in model (2) is identifiable if and only if the Assumptions 1–4 are satisfied.

See Appendix A for the proof of Theorem 1.

Causal effect is identified based on above four assumptions. And various estimation approaches have been applied to estimate the causal effect, including moment estimation, maximum likelihood estimation and generalized moment estimate model (GMM) with different assumptions [21,22,23]. In this section, we aim to find an efficient estimation of causal effect in the model (2).

In our estimation approach, the pivotal “orthogonality condition” is the independence (C1, C2) ⊥ ϕ(U, εY), which implies the following equation:

$$ E\left[\left(Y-{\beta}_0-{\beta}_1X-{\beta}_2{C}_1-{\beta}_3{C}_2\right)f\left({C}_1,{C}_2\right)\right]=\mathbf{0}, $$
(3)

where f(·) = (f1(·), ⋯, fK(·))T is an arbitrary vector function and 0 is a K × 1 zero vector.

For the case of confounders C1 and C2 with two values, we chose the function f(C1, C2) = (δ(C1 = 0, C2 = 0), δ(C1 = 0, C2 = 1), δ(C1 = 1, C2 = 0), δ(C1 = 1, C2 = 1))T, we have

$$ E\left[\left(Y-{\beta}_0-{\beta}_1X-{\beta}_2{C}_1-{\beta}_3{C}_2\right){f}^{\ast}\left({C}_1,{C}_2\right)\right]=\mathbf{0},\kern0.5em $$
(4)

where 0 is a 4 × 1 zero vector. Define β = (β0, β1, β2, β3)T,

$$ {R}^{\ast }=\left(\begin{array}{l}E\left( Y\delta \left({C}_1=0,{C}_2=0\right)\right)\\ {}E\left( Y\delta \left({C}_1=0,{C}_2=1\right)\right)\\ {}E\left( Y\delta \left({C}_1=1,{C}_2=0\right)\right)\\ {}E\left( Y\delta \left({C}_1=1,{C}_2=1\right)\right)\end{array}\right),\kern0.5em {Q}^{\ast }=\left(\begin{array}{l}E\left(\delta \left({C}_1=0,{C}_2=0\right)\right)\kern0.5em E\left( X\delta \left({C}_1=0,{C}_2=0\right)\right)\kern0.5em E\left({C}_1\delta \left({C}_1=0,{C}_2=0\right)\right)\kern0.5em E\left({C}_2\delta \left({C}_1=0,{C}_2=0\right)\right)\ \\ {}E\left(\delta \left({C}_1=0,{C}_2=0\right)\right)\kern0.5em E\left( X\delta \left({C}_1=0,{C}_2=1\right)\right)\kern0.5em E\left({C}_1\delta \left({C}_1=0,{C}_2=1\right)\right)\kern0.75em E\left({C}_2\delta \left({C}_1=0,{C}_2=1\right)\right)\kern0.5em \\ {}E\left(\delta \left({C}_1=0,{C}_2=0\right)\right)\kern0.5em E\left( X\delta \left({C}_1=1,{C}_2=0\right)\right)\kern0.5em E\left({C}_1\delta \left({C}_1=1,{C}_2=0\right)\right)\kern0.75em E\left({C}_2\delta \left({C}_1=1,{C}_2=0\right)\right)\ \\ {}E\left(\delta \left({C}_1=0,{C}_2=0\right)\right)\kern0.5em E\left( X\delta \left({C}_1=1,{C}_2=1\right)\right)\kern0.75em E\left({C}_1\delta \left({C}_1=1,{C}_2=1\right)\right)\kern0.9000001em E\left({C}_2\delta \left({C}_1=1,{C}_2=1\right)\right)\kern0.5em \end{array}\right), $$

the Eq. (4) can be rewritten as R − Qβ = 0. Solve for GMM estimator

$$ \hat{{\boldsymbol{\beta}}^{\ast }}={\hat{Q^{\ast}}}^{-1}\hat{R^{\ast }}, $$

where the elements of \( \hat{Q^{\ast }} \) and \( \hat{R^{\ast }} \) are sample means about X, C1, C2 and C1, C2, Y, respectively. Thus \( \hat{{\boldsymbol{\beta}}^{\ast }} \) is a valid estimator only if Q has full rank. The determinant of the matrix Q is

$$ {\displaystyle \begin{array}{l}\det \left({Q}^{\ast}\right)=P\left({C}_1=0,{C}_2=0\right)P\left({C}_1=0,{C}_2=1\right)P\left({C}_1=1,{C}_2=0\right)P\left({C}_1=1,{C}_2=1\right)\\ {}\cdot \left\{\left[E\left(X|{C}_1=0,{C}_2=1\right)-E\left(X|{C}_1=0,{C}_2=0\right)\right]-\left[E\left(X|{C}_1=1,{C}_2=1\right)-E\left(X|{C}_1=1,{C}_2=0\right)\right]\right\}.\end{array}} $$

Note that the nonlinearity of E[X| C1, C2] with respect to C1 and C2 implies

$$ \kern0.1em \left[E\left(X|{C}_1=0,{C}_2=1\right)-E\left(X|{C}_1=0,{C}_2=0\right)\right]-\left[E\left(X|{C}_1=1,{C}_2=1\right)-E\left(X|{C}_1=1,{C}_2=0\right)\right]\ne 0. $$

Furthermore, the estimated causal effect \( \hat{\beta_1} \) is

$$ \hat{\beta_1}=\frac{\left[E\left(Y|{C}_1=1,{C}_2=1\right)-E\left(Y|{C}_1=1,{C}_2=0\right)\right]-\left[E\left(Y|{C}_1=0,{C}_2=1\right)-E\left(Y|{C}_1=0,{C}_2=0\right)\right]}{\left[E\left(X|{C}_1=0,{C}_2=1\right)-E\left(X|{C}_1=0,{C}_2=0\right)\right]-\left[E\left(X|{C}_1=1,{C}_2=1\right)-E\left(X|{C}_1=1,{C}_2=0\right)\right]}. $$

The estimated causal effect \( \hat{\beta_1} \) is reasonable compared with the “ratio estimator” of a binary IV when C1 and C2 are regarded as two near-IVs. Denote the sub-sample average of y and x by \( {\overline{y}}_1 \) and \( {\overline{x}}_1 \) when z = 1 and by \( {\overline{y}}_0 \) and \( {\overline{x}}_0 \) when z = 0. Then \( \varDelta y/\varDelta z={\overline{y}}_1-{\overline{y}}_0 \) and \( \varDelta x/\varDelta z={\overline{x}}_1-{\overline{x}}_0 \), and “ratio estimator” of an IV [24] is

$$ {\hat{\beta}}_{IV}=\frac{{\overline{y}}_1-{\overline{y}}_0}{{\overline{x}}_1-{\overline{x}}_0}. $$

Define g(β) = E[(Y − β0 − β1X − β2C1 − β3C2)f(C1, C2)], and β can be identified only when rank(Q) = l + 2, l = 2, where l denotes the number of the observed confounders. For any f(·) that makes rank(Q) = l + 2, a GMM estimate of β is

$$ \hat{\boldsymbol{\beta}}=\arg\ \min \left\{\hat{g}\left(\boldsymbol{\beta} \right)\hbox{'}\hat{W}\hat{g}\left(\boldsymbol{\beta} \right)\right\}={\left(\hat{Q}\hbox{'}\hat{W}\hat{Q}\right)}^{-1}\hat{Q}\hbox{'}\hat{W}\hat{R}, $$
(5)

where \( \hat{W} \) is denoted as a symmetric and positive definite weight matrix for K ≥ (l + 2), K = rank(Q). The elements of \( \hat{g}\left(\boldsymbol{\beta} \right),\hat{Q} \) and \( \hat{R} \) are sample means of the corresponding elements of g(β), Q and R respectively [20].

If \( \hat{W}\to W \) as N → ∞ in probability where W is positive semi-definite and N denotes the sample size. The main general properties about GMM estimators under the appropriate regularity conditions are that,

1. \( \hat{\boldsymbol{\beta}}\to \boldsymbol{\beta} \) in probability as N → ∞ where β denotes the true parameter.

2. \( \sqrt{N}\left(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta} \right) \) converges in distribution to a normal distribution with mean zero and variance \( {\left({Q}^T WQ\right)}^{-1}{Q}^T WE\left[f\left({C}_1,{C}_2\right)f{\left({C}_1,{C}_2\right)}^T\right] WQ{\left({Q}^T WQ\right)}^{-1}{\sigma}_{res}^2 \), where \( {\sigma}_{res}^2 \) is the variance of ϕ(U, εY).

Then we describe a GMM estimator with the efficient instrument (a function of C1, C2) proposed by Newey and McFadden [25], which has the minimum variance among all estimators satisfying the Eq. (4). It also shown that there is an efficient estimator \( \hat{\boldsymbol{\beta}} \) when the instrument f(C1, C2) is defined as

$$ {f}^{\mathrm{eff}}\left({C}_1,{C}_2\right)={\left\{E\left[\frac{\partial \psi \left(X,{C}_1,{C}_2,Y\right)}{\partial \boldsymbol{\beta}}\right]|{C}_1,{C}_2\right\}}^T={\left(1,E\left[X|{C}_1,{C}_2\right],{C}_1,{C}_2\right)}^T. $$

From (5), the efficient GMM estimator of β is defined as \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}}={\left({\left({\hat{Q}}^{\mathrm{eff}}\right)}^T\hat{W}{\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\left({\hat{Q}}^{\mathrm{eff}}\right)}^T\hat{W}{\hat{R}}^{\mathrm{eff}} \), where \( {\hat{Q}}^{\mathrm{eff}} \) and \( {\hat{R}}^{\mathrm{eff}} \) are the sample means of

$$ {\hat{Q}}^{\mathrm{eff}}=\left[\begin{array}{l}\kern1em 1\kern4em E(X)\kern3.75em E\left({C}_1\right)\kern1.5em E\left({C}_2\right)\\ {}E(X)\kern0.75em E\left[E{\left(X|{C}_1,{C}_2\right)}^2\right]\kern0.75em E\left({C}_1X\right)\kern0.75em E\left({C}_2X\right)\\ {}E\left({C}_1\right)\kern2.5em E\left({C}_1X\right)\kern3em E\left({C}_1^2\right)\kern1.5em E\left({C}_1{C}_2\right)\\ {}E\left({C}_2\right)\kern2.5em E\left({C}_2X\right)\kern2.75em E\left({C}_1{C}_2\right)\kern0.75em E\left({C}_2^2\right)\end{array}\right]\kern0.5em {\hat{R}}^{\mathrm{eff}}=\left[\begin{array}{l}\kern2.75em E(Y)\\ {}E\left[ YE\left(X|{C}_1,{C}_2\right)\right]\\ {}\kern2.25em E\left({C}_1Y\right)\\ {}\kern2.25em E\left({C}_2Y\right)\end{array}\right]. $$

The matrix Qeff equals E[f(C1, C2)f(C1, C2)T] proved in Appendix B and has full rank if the nonlinearity condition in Theorem 1 holds. Since Qeff and the positive definite \( \hat{W} \) for l = 2 are full rank, \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}} \) is a valid estimator which can be simplified as

$$ {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}}={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\hat{W}}^{-1}{\left[{\left({\hat{Q}}^{\mathrm{eff}}\right)}^T\right]}^{-1}{\left({\hat{Q}}^{\mathrm{eff}}\right)}^T\hat{W}{\hat{R}}^{\mathrm{eff}}={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\hat{R}}^{\mathrm{eff}}. $$
(6)

We find \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}} \) does not rely on the choice of \( \hat{W} \) at this point. From the above property of the GMM estimator, the asymptotic variance of \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}} \) is easily obtained by Qeff and \( {\sigma}_{res}^2 \)

$$ AVAR\left({\hat{\boldsymbol{\beta}}}^{\mathrm{eff}}\right)={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}E\left[{f}^{\mathrm{eff}}\left({C}_1,{C}_2\right){f}^{\mathrm{eff}}{\left({C}_1,{C}_2\right)}^T\right]{\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\sigma}_Y^2={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\sigma}_{res}^2. $$

Additionally, we show for two binary observed confounders that any f(·) which makes the Eq. (4) have the unique solution leads to the same estimator of parameters as that obtained by f(C1, C2) in Appendix B. Therefore, for the case of two binary observed confounders, our estimator \( \hat{{\boldsymbol{\beta}}^{\ast }} \) is efficient and it is not necessary to choose an extra function f(·) to improve the efficiency.

Extend to the case of more than two binary confounders

Note that the method proposed is an easy tool to detect the causal effect in the absence of enough confounding information. Because of a growing appreciation of the power gains of multivariate association analyses, more than two covariates are generally selected to analysis in practice. In this section, for the case of more than two binary confounders, we discuss the identification of parameter β1 in the following model

$$ Y={\beta}_0+{\beta}_1X+{\beta}_2{C}_1+{\beta}_3{C}_2+\cdots +{\beta}_{l+1}{C}_l+\phi \left(U,{\varepsilon}_Y\right). $$
(7)

Define β = (β0, β1, β2, ⋯, βl + 1)T, Eq. (7) can be rewritten as Q × β = R, where Q = (1, E(X| C1, C2, ⋯Cl), C1, C2, ⋯Cl) denotes a 2l × (l + 2) matrix, β is a (l + 2) vector. According to the equation Q × β = R, parameter β1 is identifiable if K = rank(Q) ≥ (l + 2), such that the matrix Q has full column rank. We extend Assumptions 2–4 to the case of more than two binary confounders as follows:

Assumption 2*: (C1, C2, ⋯, Cl) ⊥ ϕ(U, εY), i.e. (C1, C2, ⋯, Cl) are not associated with any confounder (U) of the exposure–outcome relationship and random errors εY.

Assumption 3*: The effect of (X, C1, C2, ⋯, Cl) on Y is linear.

Assumption 4*: The effect of (C1, C2, ⋯, Cl) on X is non-linear (e.g. with an interaction effect or the quadratic term of (C1, C2, ⋯, Cl) on X).

Thus, under the Assumption 1 and the Assumptions 2*-4*, the causal effect of X on Y can be identified.

Similar to the case of two binary observed confounders C1 and C2, different f(·) for C1, C2, ⋯, Cl in (7) leads to different estimators. Here, we have

$$ E\left[\left(Y-{\beta}_0-{\beta}_1X-{\beta}_2{C}_1-{\beta}_3{C}_2-\cdots -{\beta}_{l+1}{C}_l\right)f\left({C}_1,{C}_2,\cdots, {C}_l\right)\right]=\mathbf{0}. $$
(8)

Under the conditional expectation (8), the efficient GMM estimator \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}} \) of β is \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}}={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\hat{R}}^{\mathrm{eff}} \), with

$$ {f}^{\mathrm{eff}}\left({C}_1,\cdots {C}_l\right)={\left\{E\left[\frac{\partial \psi \left(X,{C}_1,\cdots {C}_l,Y\right)}{\partial \boldsymbol{\beta}}\right]|{C}_1,\cdots {C}_l\right\}}^T={\left(1,E\left[X|{C}_1,\cdots {C}_l\right],{C}_1,\cdots {C}_l\right)}^T. $$

From the above property of the GMM estimator, we can obtain the asymptotic variance of \( {\hat{\boldsymbol{\beta}}}^{\mathrm{eff}} \) in this case

$$ AVAR\left({\hat{\boldsymbol{\beta}}}^{\mathrm{eff}}\right)={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}E\left[{f}^{\mathrm{eff}}\left({C}_1,\cdots, {C}_l\right){f}^{\mathrm{eff}}{\left({C}_1,\cdots, {C}_l\right)}^T\right]{\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\sigma}_Y^2={\left({\hat{Q}}^{\mathrm{eff}}\right)}^{-1}{\sigma}_{res}^2, $$

where \( {\hat{Q}}^{\mathrm{eff}} \) and \( {\hat{R}}^{\mathrm{eff}} \) are the sample means of

$$ {Q}^{eff}=\left[\begin{array}{l}\kern1em 1\kern5.25em E(X)\kern4em E\left({C}_1\right)\kern1em \cdots \kern0.75em E\left({C}_l\right)\\ {}E(X)\kern0.75em E\left[E{\left(X|{C}_1,\cdots, {C}_l\right)}^2\right]\kern0.75em E\left({C}_1X\right)\kern0.5em \cdots \kern0.5em E\left({C}_lX\right)\\ {}E\left({C}_1\right)\kern3.5em E\left({C}_1X\right)\kern3.5em E\left({C}_1^2\right)\kern0.75em \cdots \kern0.5em E\left({C}_1{C}_l\right)\\ {}\kern1em \vdots \kern6.25em \vdots \kern6.25em \vdots \kern5.25em \vdots \\ {}E\left({C}_l\right)\kern3.5em E\left({C}_lX\right)\kern3.25em E\left({C}_1{C}_l\right)\kern0.5em \cdots \kern0.75em E\left({C}_l^2\right)\end{array}\right],\kern0.5em {R}^{eff}=\left[\begin{array}{l}\kern2.75em E(Y)\\ {}E\left[ YE\left(X|{C}_1,\cdots, {C}_l\right)\right]\\ {}\kern2.25em E\left({C}_1Y\right)\\ {}\kern3.75em \vdots \\ {}\kern2.25em E\left({C}_lY\right)\end{array}\right]. $$

Simulations

In order to investigate the performance of our method in simulation study, as well as to determine in which scenario it performs well or badly in comparison with the traditional regression approach adjusting for observed confounders, we perform simulations with four cases. In our simulations, data are generated based on the causal diagram depicted in Fig. 1.

In each case (exposure and outcome are continuous or binary respectively, denoted as Simulation A: exposure and outcome are both continuous; Simulation B: exposure is binary while outcome is continuous; Simulation C: exposure is continuous while outcome is binary; Simulation D: exposure and outcome are both binary), we take into account two scenarios where C1, C2 and U are independent, C1, C2 and U are correlated respectively (For Simulation A, two scenarios denotes as Simulation A1, Simulation A2, respectively, similar notation for Simulation B, C, D). We performed simulation to compare performances of three methods including crude association without adjusting for C1, C2 and U (model 1), model of adjusting for C1 and C2 (model 2) and the method we proposed here (model 3). We simulated baseline covariates and a quantitative exposure in a large population consisting of 2000 subjects in Simulation A, 10000 subjects in Simulation B and 100,000 subjects in Simulation C, D.

The data of Simulation A1 were generated for each individual in the following procedure:

  1. 1.

    C1~Bernoulli(N, 0.5), C2~Bernoulli(N, 0.5).

  2. 2.

    U~N(0, 1).

  3. 3.

    εX~U(0, 1), εY~U(0, 1).

  4. 4.

    X = 0.6C1 + 0.4C2 + 0.6C1C2 + 0.2U + εX.

  5. 5.

    Y = X + 0.4C1 + 0.3C2 + 0.1U + εY.

The data of Simulation A2 were generated for each individual in the following procedure:

  1. 1.

    \( \left({C}_1,{C}_2,U\right)\sim N\left(\mu, \Sigma \right),\mu =\left(\begin{array}{l}0\\ {}0\\ {}0\end{array}\right),\Sigma =\left[\begin{array}{l}1\kern1em \mathrm{d}\kern1em \mathrm{c}\\ {}\mathrm{d}\kern1em 1\kern1em \mathrm{c}\\ {}\mathrm{c}\kern1em \mathrm{c}\kern1em 1\end{array}\right],c=0.1,d=0.2 \).

  2. 2.

    Draw C1 from the first column of multivariate normal distribution, and discretize it into a binary variable with probabilityP(C1 = 1) = 0.5.

  3. 3.

    Draw C2 from the second column of multivariate normal distribution, and discretize it into a binary variable with probabilityP(C2 = 1) = 0.5.

  4. 4.

    Draw U from the third column of multivariate normal distribution.

  5. 5.

    εX~U(0, 1), εY~U(0, 1).

  6. 6.

    X = 0.6C1 + 0.4C2 + 0.6C1C2 + 0.2U + εX.

  7. 7.

    Y = X + 0.4C1 + 0.3C2 + 0.1U + εY.

The data of Simulation B1-B2, Simulation C1-C2, Simulation D1-D2 were similarly generated for each individual. Details are listed in the Appendix C.

Data application

Numerous epidemiological studies have evaluated the relationships between BMI and SBP, DBP, FBG, TG, TC, HDL and LDL, respectively, but the causal effects are inconclusive due to the existence of unobserved confounding [26,27,28]. Hence, we use the method proposed in this article to evaluate the potential causal effect of BMI on SBP, DBP, FBG, TG, TC, HDL and LDL compared with the traditional regression approach adjusting for age (discrete) and gender, adjusting for age (continuous) and gender and adjusting for age (discrete), gender and other health factors (including SBP, DBP, FBG, TC, TG, HDL and LDL) using data from a follow-up survey in Jining, Shandong Province.

Results

Simulation results

We separately varied across the main effect value α1 of C1 on X, the main effect value α2 of C2 on X, the interaction effect α3 of C1 and C2 on X, the causal effect β1 of X on Y, the confounding effect β2 of C1 on Y, the correlation coefficient c between C1 and C2 as well as correlation coefficient d between C1 and U.

We changed one parameter at a time while keeping all others at their basic values, when N = 2000 for Simulation A1-A2, N = 10,000 for Simulation B1-B2, N = 100,000 for Simulation C1-C2 and N = 100,000 for Simulation D1-D2. Results showed the estimated bias, standard error (SE) and the mean squared error (MSE) from the three models for varied effects of C1 on X, the interaction effects of C1 and C2 on X, U on X, X on Y, C1 on Y, U on Y, and the correlations among C1, C2 and U. The results showed the estimates of the model without any adjustment (model 1) and the model adjusting for both C1 and C2 (model 2) were biased. When we used the method we proposed (model 3) to analyze simulated data sets, the estimates were unbiased and had acceptable standard error. And our method had better MSE than other methods (i.e. model 1 and model 2). Figures 2 and 3 showed the results of Simulation A1, Fig. 4 showed the partial results of Simulation A2, the rest of results of Simulation A2 and other six different scenarios were showed in Figure S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13 (Figure S1-S2 for Simulation A2, Figure S3-S4 for Simulation B1, Figure S5, S6, S7 for Simulation B2, Figure S8-S9 for Simulation C1, Figure S10 for Simulation C2, Figure S11-S12 for Simulation D1, Figure S13 for Simulation D2). Specifically, when varying across the effects of C1 on Y, the biases of unadjusted model (model 1) had a significant positive linear correlation with β2 and the ones of adjustment for C1 and C2 (model 2) remained stable. When varying across the effects of U on Y, the biases of both model 1 and model 2 linearly increased. Our method still had unbiased causal effect estimates and lower MSE than other two methods. Similarly, when varying across the effects of C1, U on X, respectively, the biases of model 1 and model 2 monotonically varied. Furthermore, the larger interaction effect of C1 and C2 on X, the causal effect estimation had higher precision for our proposed method (model 3). In order to make sure model 3 remains the smallest MSE among three models, we suggested the interaction effect of C1 and C2 on X moderately larger. Certainly, the bias did not change significantly from the basic scenario in any of the models when we changed the causal effect of X on Y. Moreover, when there existed an correlation between observed confounders C1 and C2, our method still got unbiased causal effect estimated. We still obtained the almost unbiased causal effect estimation if P(C1 = 1) = P(C2 = 1) = 0.5 when observed confounder C1 (or C2) and unobserved confounders U were correlated.

Fig. 2
figure 2

The Simulation A1 result. Results shows the estimated biases, SE and MSE from the 3 models for varied effects of aC1 on X, b the interaction effect of C1 and C2 on X, cU on X

Fig. 3
figure 3

The Simulation A1 result. Results shows the estimated biases, SE and MSE from the 3 models for varied effects of aX on Y, bC1 on Y, cU on Y

Fig. 4
figure 4

The Simulation A2 result. Results shows the estimated biases, SE and MSE from the 3 models for varied effects of a the correlation between X and C1, b the correlation between C1 and C2

Compared to the situation of continuous variables, sample size N needs to be larger for discrete exposure and outcome variable. Under the linear probability model setting of X, C1, C2, U on Y (i.e. 0 ≤ P(Y| X, C1, C2, U) ≤ 1), the effects of C1 on X, C2 on X, the interaction effect of C1 and C2 on X, U on X should be relatively small. Furthermore, our method had better MSE than model 2, and was similar with model 1, the estimates remained unbiased. We use R program (version 3.6.1) to reproduce all simulations and analyses which are available on Github (https://github.com/LULIU1816/Two-binary-confounder).

Data application results

In this section, we used the method proposed in this article to evaluate the potential causal effects of BMI on SBP, DBP, FBG, TG, TC, HDL and LDL using data from a follow-up survey in Jining, Shandong Province. Furthermore, we selected age and gender as near-IVs compared with the traditional regression approaches adjusting for age (discrete) and gender, adjusting for age (continuous) and gender and adjusting for age (discrete), gender and other health factors (including SBP, DBP, FBG, TC, TG, HDL and LDL). The cohort recruited 136,895 individuals aged 20- years between 2007 and 2015. In order to avoid reverse causation, we selected the data of BMI, age, gender in 2014 and SBP, DBP, FBG, TG, TC, HDL and LDL in 2015(N = 7013), respectively. Age and gender were divided into two discrete variables 0, 1. Age was discretized by median 40 as 0 and 1. Subject characteristics showed at Table 1. Proportion of female was 41.01%. Kolmogorov-Smirnov test is one of normality tests when the sample size over 5000. The median SBP was 120 (95% CI: 109–160) mmHg, the median DBP was 75 (95% CI: 66–101) mmHg. The median FBG was 5.20 (95% CI: 4.80–8.20) mmol/L, the median TC was 4.64 (95% CI: 4.01–6.76) mmol/L, the median TG was 1.04 (95% CI: 0.60–4.67) mmol/L, the median HDL was 1.29 (95% CI: 1.10–1.87) mmol/L, the median LDL was 2.76 (95% CI: 2.23–4.30) mmol/L. And, the median BMI was 24.68 (95% CI: 21.83–32.07) kg/m2.

Table 1 Subjects characteristics in Jining, Shandong Province

Results of our proposed method and the traditional regression approach adjusting for age and gender are showed in Table 2. Firstly, we examine whether two independent or correlated available confounders satisfy the non-linear condition on the exposure. Results confirmed the interaction between age and gender on BMI (P = 2.82 × 10−9). Traditional regression approach implied the significant associations between SBP, DBP, FBG, TG, TC, HDL, LDL and BMI, respectively. BMI had obviously positive correlation with SBP, DBP, FBG, TG, TC and LDL while obviously negative correlation with HDL. However, our proposed method revealed the causal effect of BMI on indicators including SBP, DBP, TG, TC, LDL. In addition, the causal effects of BMI on HDL and FBG were not significant. SBP increased 1.60 (95% CI: 0.99–2.93) mmol/L with per 1- kg/m2 higher BMI and DBP increased 0.37 (95% CI: 0.03–0.76) mmol/L with per 1- kg/m2 higher BMI. Moreover, 1- kg/m2 increase in BMI had potential causality with a 1.61-SD increase in TC (β, 1.61; 95% CI: 0.96–2.97), a 1.66-SD increase in TG (β, 1.66; 95% CI: 0.91–55.30) and a 2.01-SD increase in LDL (β, 2.01; 95% CI: 1.09–4.31). However, BMI had no potential causality with HDL (β, − 0.20; 95% CI: − 1.71-1.44). The effect of FBG per 1- kg/m2 higher BMI was 0.56 (95% CI: − 0.24-2.18). To conclude, the potential causal effects of BMI on SBP, DBP, FBG, TG, TC, HDL and LDL were almost consistent with previous studies. Furthermore, the results showed the robustness of the novel method.

Table 2 The causal effect of BMI on SBP, DBP, FBG, TG, TC, HDL and LDL

Discussion

In this paper, we present a simple and intuitive method to identify and correct for confounding bias in observation studies using a confounder–exposure nonlinear condition. In cases where the independence assumption between observed confounder and unobserved confounder is violated, a sensible approach shows the almost unbiased causal effect estimation if P(C1 = 1) = P(C2 = 1) = 0.5. In this sense, our proposed method can be viewed very much as a tool for identifying causal effect in epidemiology.

To identify and estimate causal effect with unobserved confounders, different approaches require different untestable assumptions, such as DID needs an untestable common trend assumption. Fortunately, the approach proposed in this paper requires the nonlinearity assumption of at least two observed confounders (C1, C2, ⋯, Cl) and treatment variable X, as well as the conditional expectation of unobserved confounders given observed confounders U is 0. Furthermore, the observed data contain (C1, C2, ⋯, Cl) and X which can be utilized to test our nonlinearity assumption. This is exclusive to our method compared with other methods.

For binary variable, the parameter estimates from the LPM can be directly interpreted as the effect of the exposure on the prevalence rate of the outcome which is consistent with the ACE. Conversely, logistic regression or log-linear regression are not applicable in causal inference, since they do not provide a direct interpretation of bivariate associations. However, error terms of LPM with ordinary least square estimation are heteroskedastic and predicted probability can be above 1 or below 0. Therefore, we adopt GMM to deal with these problems. Additionally, our proposed method has no assumption on the independence among observed confounders. This can be widely used in epidemiology to obtain the causal inference of the exposure on the outcome.

Another advantage is the accessibility of two observed confounders satisfying nonlinear condition. Further, we find out that using GMM still fairly reliable in our proposed method if the number of observed confounders is over two.

A number of factors must be considered before implementing the method. First, the independence assumption between observed and unobserved confounders is essential for causal estimate correction. One strategy to overcome this limitation is to select other observed confounders with the probabilities of binary values 0.5. Second, interaction between observed confounders and unobserved confounders on exposure is sufficiently strong. Finally, our method cannot identify the reverse causation. Requiring a priori knowledge or using the idea of Cross-lagged Panel Analysis may avoid this problem. Identifying causal effects across studies of differing design can therefore prove valuable in further research, whilst agreement with the result of Mendelian Randomization and randomization experiment strengthens confidence in the resulting findings and subsequent inference.

Conclusions

In conclusions, we propose a novel method to control unobserved confounding through double or more binary confounders satisfying a non-linear condition on the exposure which is easy to access. In particular, our method can handle general cases regardless of a continuous or categorical exposure and outcome. Various simulations show better estimation performance by our approach and suggest that our method will be more widely used in observational studies to explore causal association.