1 Introduction

Treatment effect estimation plays an essential role in decision making in various domains, such as healthcare, economic policy, and education. The goal of treatment effect estimation is to estimate the effect of an action by a decision maker. The main difficulty of treatment effect estimation based on observational data is that a treatment assignment is not randomized, which is often referred to as observational or selection bias. For example, elderly people might be more likely to receive drug treatment than younger people. In this example, age is a variable that impacts treatment assignment and outcome. This variable is called a confounding variable. We need to find such confounding variables to mitigate bias and give appropriate treatment effect.

In the context of treatment effect estimation, many studies usually assume that observational data include all the confounding variables. However, this assumption seems too strong and is not realistic because we cannot always practically obtain sufficient information regarding individuals to guarantee that we observe all the confounding variables. Confounding variables that are not included in observational data are often referred to as hidden confounding variables. For example, private and sensitive individual information like income might be difficult to obtain, but this variable can have an effect on treatment assignment and outcome. Without knowing confounding variables, it is impossible to know the true treatment effect, and treating proxy variables as confounding variables will lead to incorrect estimands (Rothman et al., 2008; Louizos et al., 2017). Fig. 1 illustrates a graphical model of the data generation process. In this graphical model, it is indispensable to infer \({\mathbf {z}}\) correctly to know the true treatment effect. Prior studies have used strong assumptions that they have knowledge regarding the nature of hidden confounding variables beforehand, like the number of categories of hidden confounding variables (Cai and Kuroki, 2008). These assumptions limit the application range of these approaches.

Recently, the Causal Effect Variational Autoencoder (CEVAE), the variational autencoder (VAE)-based method  has been successfully incorporated into treatment effect estimation with the existence of hidden confounding variables (Louizos et al., 2017; Zhang et al., 2021). One of the advantages of VAE is that it can recover a large class of hidden confounding variable models thanks to the expressive power of neural networks (Tran et al., 2015). Previous researches require that we know the nature of hidden confounding variables, such as the number of categories.

Xu et al. (2021) employed a deep learning-based technique but they also assumed that they can distinguish variables that have an effect only on treatment assignment from variables which have an effect only on outcome. This assumption requires prior knowledge and seems unrealistic.

However, recent theoretical analysis revealed that the global optimum of VAE evidence lower bound (ELBO) does not correctly model the data generation process (Zhao et al., 2019) because VAE focus on reconstruction loss too much, which becomes severer when input variables have much higher dimensions than latent variables. To mitigate this problem, InfoVAE (Zhao et al., 2019), which adds a mutual information regularizer to the VAE loss function, was proposed.

This phenomenon obviously arises in VAE-based methods for treatment effect estimation and makes recovering hidden confounding variables by VAE difficult. We first remark there are datasets that the optimal solution of VAE-based methods, such as CEVAE (Louizos et al., 2017), does not give the correct treatment effect. This is a strict limitation without any guarantee when they achieve optimal solution even though they are capable of recovering them.

To mitigate these problems, we propose hidden confounding variable matching VAE, which combines VAE with information regularization and matching to give appropriate treatment effect. The proposed method obtains the correct treatment effect when it achieves the optimal solution of its loss function, even under the existence of hidden confounding variables. We summarize the contribution of this study as follows:

  • To the best of our knowledge, this is the first work that shows the optimal solution of naive VAE-based methods is not a correct average treatment effect (ATE) for types of datasets.

  • We propose an effective method based on information regularization and matching algorithm to mitigate hidden confounding variables and bias, with theoretical guarantee.

  • In experiments using semi-synthetic and synthetic datasets, the proposed method significantly outperformed existing methods.

Fig. 1
figure 1

A graphical model for the treatment effect estimation methods with hidden confounding variables. Hidden confounding variable \({\mathbf {z}}\) has an effect on treatment assignment and outcome. Treating proxy variables \({\mathbf {x}}\) as normal confounding variables gives incorrect treatment effect estimation

2 Related work

2.1 Treatment effect estimation

Treatment effect estimation plays a essential role in decision making across various domains, such as healthcare (Eichler et al., 2016; Sekhon, 2009), economic policy (LaLonde, 1986), and education (Zhao and Heffernan, 2017). We outline important studies, ranging from established methods to modern deep learning-based methods. The goal of treatment effect estimation is to understand the effect of a specific action, i.e., treatment. One of the classical methods for treatment effect estimation is matching (Rubin, 1973; Abadie & Imbens, 2006; King & Nielsen, 2019). Matching methods estimate the counterfactual outcomes by the nearest neighbor of each individual in terms of covariates. Because the curse of dimensionality makes finding appropriate nearest neighbors of each individual more difficult, propensity score matching, which defines nearest neighbors in terms of propensity score, was developed (Rosenbaum & Rubin, 1983, 1985). Tree-based methods, such as Random forest and Bayesian additive regression trees (BART), have also been applied . (Chipman et al., 2010; Hill, 2011)

Recently, deep learning-based methods have been successfully applied to the treatment effect estimation problem (Shalit et al., 2017; Johansson et al., 2016; Yao et al., 2018; Yoon et al., 2018; Louizos et al., 2017; Zhang et al., 2021; Guo et al., 2020; Harada and Kashima, 2020, 2021). Counterfactual regression (CFR) encourages individual representation of each treatment group extracted by neural networks to get closer to each other. Perfect matching combines neural networks and propensity score matching (Schwab et al., 2018), and Counterfactual propagation, which also integrates matching and graph-based semi-supervised learning, aims to estimate treatment effect using a large number of unlabeled individual data (Harada and Kashima, 2020). In particular, VAE-based methods (Louizos et al., 2017; Zhang et al., 2021) have been developed to mitigate the hidden confounding variable problem. They aim to recover hidden confounding variables by the strong expressive power of neural networks. Network structured-data also have been utilized to infer hidden confounding variables effectively (Guo et al., 2020).

2.2 VAE

VAE is one of the most famous deep generative models (Kingma and Welling, 2013) and has been widely employed in various domains, such as computer vision (Liu et al., 2017), natural language processing (Miao et al., 2016), and chemoinformatics (Liu et al., 2018). One of the advantages of VAE-based generative models is their strong expressive power based on neural networks. VAE has also been successfully applied in treatment effect estimation (Louizos et al., 2017; Zhang et al., 2021). The idea is to recover a joint distribution including hidden confounding variables expressed as latent variables to estimate treatment effect. However, recent theoretical analysis revealed that VAE will ignore the latent variables in the global optimum of the VAE loss function (Zhao et al., 2019). Hence, due to the nature of the VAE loss function, VAE-based treatment effect estimation methods face the unavoidable issue that they do not provide the correct treatment effect estimation even when their loss function achieves the optimal solution, which we will discuss in this paper.

Our goal is to fill the gap between VAE theoretical analysis and VAE-based treatment effect estimation methods, proposing an efficient method that provides theoretical guarantee of treatment effect even when there are hidden confounding variables.

3 Problem statement

In this section, we state the problem setting of treatment effect estimation. Suppose \(\mathbf{x}_i \in {\mathcal {X}} \subset {\mathbb {R}}^{d_{\mathbf {x}}}\) is the \(d_{\mathbf {x}}\) dimensional proxy variables of the i-th individual, \(t_i \in {{\mathcal {T}}}=\{0,1\}\) is the binary treatment applied to the i-th individual, and \(y^{t_i}_i\in {\mathcal {Y}}\subset {R}\) is its outcome of the i-th individual. We omit the notation i of a variable when the variable can represent any individual. Given a dataset \({\mathcal {D}}:=(\mathbf{x}_i, t_i, y^{t_i}_i)_{i=1}^{N}\), which includes N individuals, our goal is to estimate the conditional ATE (CATE) and ATE, defined as:

$${\text{CATE}}\left( {{\mathbf{x}}_{i} } \right): = {\mathbb{E}}\left[ {{\text{y}}_{{\text{i}}}^{{\text{1}}} {\mid }{\mathbf{x}}_{{\text{i}}} ,{\text{do}}\left( {{\text{t = 1}}} \right)} \right] - {\mathbb{E}}\left[ {{\text{y}}_{{\text{i}}}^{{\text{0}}} {\mid }{\mathbf{x}}_{{\text{i}}} ,{\text{do}}\left( {{\text{t = 0}}} \right)} \right],{\text{ATE}}: = {\mathbb{E}}\left[ {{\text{CATE}}\left( {{\mathbf{x}}_{{\text{i}}} } \right)} \right].$$
(1)

We make some basic assumptions in this study: (i) stable unit treatment value: the outcome of each instance is not affected by the treatment assigned to other instances; (ii) unconfoundedness: the treatment assignment to an individual is independent of the outcome given hidden confounding variables; (iii) overlap: each individual has a positive probability of treatment assignment;  (iv) smoothness: individuals who have similar hidden confounding variables have similar outcomes;  (v)  noisy proxy variables: hidden confounding variables can be recovered by noisy proxy variables.

4 Preliminaries

We briefly introduce some notable deep generative models based on VAE as preliminaries for clarity.

VAE (Kingma and Welling, 2013) is a widely used deep generative model that sets a prior distribution as the normal distribution. It maximizes the ELBO, consisting of reconstruction loss and the Kullback-Leibler (KL) divergence loss. It usually parameterizes \(p_{\theta _{\mathbf {x}}}\) and \(q_\phi\) by neural networks.

$$\begin{aligned} p(\mathbf{z}_i)= & {} \prod _{j=1}^{d_{\mathbf {z}}}{\mathcal {N}}(z_{ij}\mid 0, 1); p_{\theta _{\mathbf {x}}}(\mathbf{x}_i\mid \mathbf{z}_i) = \prod _{j=1}^{d_{\mathbf {x}}}p_{\theta _{\mathbf {x}}}(x_{ij}\mid \mathbf{z}_{i}); \end{aligned}$$
(2)
$$\begin{aligned} {\mathcal {L}}_{\text {ELBO}}= & {} \sum _{i=1}^{N} {\mathbb {E}}_{q_{\phi }(\mathbf{z}_i\mid \mathbf{x}_i)} [ \log p_{\theta _{\mathbf {x}}}(\mathbf{x}_i\mid \mathbf{z}_i) + \log p(\mathbf{z_i}) -\log q_\phi (\mathbf{z}_i\mid \mathbf{x}_i)] \end{aligned}$$
(3)
$$\begin{aligned}= & {} \sum _{i=1}^{N}E_{q_{\phi }(\mathbf{z}_i\mid \mathbf{x}_i)}[ \log p_{\theta _{\mathbf {x}}}(\mathbf{x}_i\mid \mathbf{z}_i) -\text {KL}(q_{\phi }({\mathbf {z}}_i\mid {\mathbf {x}}_i), p({\mathbf {z}}))]. \end{aligned}$$
(4)

InfoVAE (Zhao et al., 2019) is a VAE with a mutual information regularization term. The mutual information term boils down to the distribution divergence between the prior distribution and marginal distribution of posterior distribution, and the function to be optimized is written as

$$\begin{aligned} {\mathcal {L}}_{\text {InfoVAE}}=\sum _{i=1}^{N}{\mathbb {E}}_{q_{\phi }(\mathbf{z}_i\mid \mathbf{x}_i)} [\log p_{\theta _{\mathbf {x}}}(\mathbf{x}_i\mid \mathbf{z}_i) - \text {KL}(q_{\phi }({\mathbf {z}}_i\mid {\mathbf {x}}_i), p({\mathbf {z}}))] - D(q_\phi ({{\mathbf {z}}}), p({{\mathbf {z}}})), \end{aligned}$$
(5)

where \(D(q_\phi ({{\mathbf {z}}}), p({{\mathbf {z}}}))\) is a divergence between the two distributions \(p(\mathbf{z})\) and \(q_\phi (\mathbf{z})\), and any divergence can be used given that \(D(q_\phi ({{\mathbf {z}}}), p({{\mathbf {z}}}))=0\) if and only if \(q_\phi (\mathbf{z})=p(\mathbf{z})\) (Zhao et al., 2019).

CEVAE (Louizos et al., 2017) is a recently proposed VAE-based methods for CATE and ATE estimation, which aims to identify treatment effect under the presence of hidden confounding variables. To correctly specify treatment effect, we need to deal with hidden confounding variables. CEVAE assumes that such hidden confounding variables can be recovered from proxy variables as many previous studies. It takes inputs \({\mathbf {x}}_i, t_i, y^{t_i}_i\) to infer hidden confounding variables, \({\mathbf {z}}_i\).

$$\begin{aligned} p_{\theta _t}(t_i\mid \mathbf{z}_{i})= & {} \text {Bern}(h (g(\mathbf{z}_i))), \end{aligned}$$
(6)
$$\begin{aligned} p_{\theta _y}(y^{t_i}_i\mid \mathbf{z}_i, t_i)= & {} {\mathcal {N}}(\mu ={\hat{\mu }}_i, \sigma ^{2}=1); {{{\hat{\mu }}}}_i= t_i f_1(\mathbf{z}_i) +(1-t_i)f_0(\mathbf{z}_i), \end{aligned}$$
(7)
$$\begin{aligned} q_\phi (\mathbf{z}_i\mid \mathbf{x}_i, t_i, y^{t_i}_i)= & {} \prod _{j=1}^{d_{\mathbf {z}}}{\mathcal {N}}(\mu _{ij}={{\bar{\mu }}}_{ij}, \sigma ^{2}_j=\bar{\sigma }^{2}_{ij}), \end{aligned}$$
(8)
$$\begin{aligned} \bar{\varvec{\mu }}_i= & {} t_i\bar{\varvec{\mu }}_{t=0,i}+(1-t_i)\bar{\varvec{\mu }}_{t=1,i}, \bar{\varvec{\sigma }}^{2}_i=t_i {\varvec{\sigma }}^{2}_{t=0,i}+(1-t_i){\varvec{\sigma }}^{2}_{t=1,i}, \end{aligned}$$
(9)
$$\begin{aligned} \bar{\varvec{\mu }}_{t=0,i}, {\varvec{\sigma }}^{2}_{t=0,i}= & {} f_3\circ f_2 ({\mathbf {x}}_i, y_i), \bar{\varvec{\mu }}_{t=1,i}, {\varvec{\sigma }}^{2}_{t=1,i} = f_4\circ f_2 ({\mathbf {x}}_i, y_i), \end{aligned}$$
(10)

where h(x) is a sigmoid function defined as \(h(x):=\frac{1}{1+\exp ^{-x}}\), and g, \(f_0\), \(f_1\), \(f_2\), \(f_3\) and \(f_4\) are neural networks. The variational lower bound is given as

$${\mathcal{L}}_{{{\text{ELBO(CEVAE)}}}} = \sum\limits_{{{\text{i}}} = {{\text{i}}}}^{{\text{N}}} {{\mathbb{E}}_{{{\text{q}}_{\phi } ({\mathbf{z}}_{{\text{i}}} {\mid }{\mathbf{x}}_{{\text{i}}} ,{\text{t}}_{{\text{i}}} ,{\text{y}}_{{\text{i}}} )}} } \log {\text{p}}_{{\theta _{{{\mathbf{x}},{\text{t}}}} }} \left( {{\mathbf{x}}_{{\text{i}}} ,{\text{t}}_{{\text{i}}} {\mid }{\mathbf{z}}_{{\text{i}}} } \right){\text{ + }}\log {\text{p}}_{{\theta _{{\text{y}}} }} \left( {{\text{y}}_{{\text{i}}}^{{{\text{t}}_{{\text{i}}} }} {\text{t}}_{{\text{i}}} ,{\mathbf{z}}_{{\text{i}}} } \right) - {\text{KL}}\left( {{\text{q}}_{\phi } \left( {{\mathbf{z}}_{{\text{i}}} {\mid }{\mathbf{x}}_{{\text{i}}} ,{\text{t}}_{{\text{i}}} ,{\text{y}}_{{\text{i}}} } \right),{\text{p}}\left( {\mathbf{z}} \right)} \right)$$
(11)

where \(\log p_{\theta _{{\mathbf {x}},t}}({\mathbf {x}}_i, t_i\mid {\mathbf {z}}_i)=\log p_{\theta _{\mathbf {x}}}({\mathbf {x}}_i\mid {\mathbf {z}}_i) +\log p_{\theta _t}(t_i\mid {\mathbf {z}}_i)\). To give outcomes for new individuals, CEAVE is required to have the treatment assignment and outcome beforehand. Therefore, it employs two auxiliary loss functions to deal with new individuals. Finally, the objective function of CEVAE is given as

$$\begin{aligned} {\mathcal {L}}_{\text {CEVAE}}={\mathcal {L}}_{\text {ELBO(CEVAE)}}+\sum _{i=i}^{N}\log q(t_i\mid {\mathbf {x}}_i) + \log q(y^{t_i}_i \mid {\mathbf {x}}_i, t_i). \end{aligned}$$
(12)

5 CEVAE fails to estimate CATE

Treatment effect estimation with hidden confounding variables is an essential problem. CEVAE (Louizos et al., 2017) enabled us to estimate treatment effect with hidden confounding variables without any strong assumption because VAE can recover a larger function class. Prior studies have made strong assumptions, such as on the properties of proxy variables and hidden confounding variables. CEVAE can identify CATE and ATE when it recovers the joint distribution \(p(\mathbf{z}, \mathbf{x}, t, y)\).

Theorem 1

We can recover CATE and ATE when we recover the joint distribution \(p(\mathbf{z}, \mathbf{x}, t, y)\) in Fig. (1).(Louizos et al., 2017).

Proof

The proof is completed by applying the rules of do-calculus to Fig. (1). See CEVAE paper for the details (Louizos et al., 2017).

However, one of the major drawbacks of previous VAE-based methods, including CEVAE, is that they do not guarantee that they can recover the hidden confounding variables, even when when they achieve the optimal solution even though they have a capability to recover them. As a motivating example, we first note that there is a dataset for which the optimal solution of CEVAE does not give the correct CATE and ATE for new individuals. Note that we consider the case that we use only the proxy variables \({\mathbf {x}}\) because assuming that we have correct outcomes y for new individuals is not realistic.

Theorem 2

Suppose we have a dataset \({\mathcal {D}}=\{{\mathbf {x}}_i, t_i, y^{t_i}_i\}^{N}_{i=1}\), where \({\mathbf {z}}_i\sim {\mathcal {N}}(0, 1)\), \({\mathbf {x}}_i\sim {\mathcal {N}}({\mathbf {z}}_i, 1)\), \(t_i\sim \mathrm {Bern}(\rho _t)\), \(y_i\sim {\mathcal {N}}({\mathbb {I}}(Cz_i>0)t, 1)\), where \(\rho _t\) is a probability of of receiving treatment and C is a constant value. Suppose we only observe \({\mathbf {x}}_i=1\) or \({\mathbf {x}}_i=-1\) and \(y_i=1\) or \(y_i=-1\). The optimal solution of CEVAE for this dataset does not give correct CATE and ATE.

Proof

Appendix.

This result demonstrates the insufficiency of naive VAE-based methods to recover hidden confounding variables and estimate treatment effect. Because there are numerous situations where observational data are limited and over-fitting to observational data may occur, we need to treat this problem carefully. Here we demonstrate a specific dataset, but we leave the proof of a more general form for future work.

6 InfoCEVAE with hidden confounding variables matching

The phenomenon described above arises because of the nature that VAE pushes masses away from each other and focuses on reconstruction loss too much. This becomes more crucial when we have higher dimensional proxy variables and a lower number of hidden confounding variables compared to proxy variables (i.e, \(d_{\mathbf {x}}\gg d_{\mathbf {z}}\)), especially when we have limited data. Some readers might think a larger number of proxy variables makes an unconfoundedness assumption, i.e., non-hidden confounding assumption, more reasonable; however, we usually can not guarantee that there are no hidden confounding variables in practice, and moreover, sometimes we never have access to the hidden confounding variables (e.g., variables including sensitive privacy information) even when we can easily obtain some proxy variables.

The straightforward solution to obtain the correct ATE using VAE-based methods is to employ the theoretical analysis of InfoVAE (Zhao et al., 2019), which adds the mutual information regularization term to the original ELBO of VAE.

The ELBO of InfoCEVAE will be adding the information regularization term to CEVAE given as

$$\begin{aligned} {\mathcal {L}}= & {} \sum _{i=i}^{N} {\mathbb {E}}_{q_{\phi }(\mathbf{z}_i\mid \mathbf{x}_i, t_i, y_i)} [\log p_{\theta _{{\mathbf {x}},t}}(\mathbf{x}_i, t_i \mid {\mathbf {z}}_i)+ \log p(y^{t_i}_i\mid t_i, {\mathbf {z}}_i) \nonumber \\&- \text {KL}(q_\phi (\mathbf{z}_i\mid {\mathbf {x}}_i, t_i, y^{t_i}_i), p({{\mathbf {z}}}) )] - D(q_\phi ({\mathbf {z}}), p(\mathbf{z})). \end{aligned}$$
(13)

We can employ the several measures of divergence D between two probability distributions, such as 2-Wasserstein distance given that \(D(q({\mathbf {z}}), p({\mathbf {z}}))=0\) if and only if \(q({\mathbf {z}})=p({\mathbf {z}})\). We use the 2-Wasserstein distance as D, and the 2-Wasserstein distance for two Gaussian distributions is written as:

$$\begin{aligned} D({\mathcal {N}}(\mu _1, \sigma _1), {\mathcal {N}}(\mu _2, \sigma _2)) = \Vert \mu _1-\mu _2\Vert ^2 + \Vert \sigma _1 - \sigma _2\Vert ^2. \end{aligned}$$
(14)

We can also get correct CATE and ATE when the model achieves the optimal solution of the objective function \(q_\phi (\mathbf{z})=p(\mathbf{z})\).

Theorem 3

The optimal solution of InfoCEVAE gives the correct CATE and ATE.

Proof

According to the Proposition of InfoVAE, we obtain the optimal solution when we achieve \(q_\phi (y\mid t,{\mathbf {z}})=p(y\mid t,{\mathbf {z}})\) and \(q_\phi ({\mathbf {z}}\mid {\mathbf {x}},t,y)=p({\mathbf {z}}\mid {\mathbf {x}}, t, y)\). Therefore,

$$\begin{aligned}&{\widehat{CATE}}(\mathbf{x})=p_\theta (y\mid t=1, \mathbf{x})-p_\theta (y\mid t=0, \mathbf{x}) \end{aligned}$$
(15)
$$\begin{aligned}&=\int _{{\mathcal {Z}}}p_\theta (y=1\mid t=1, {\mathbf {z}})q_\phi ({\mathbf {z}}\mid \mathbf{x},t=0,y) - p_\theta (y=1\mid t=0, {\mathbf {z}})q_\phi ({\mathbf {z}}\mid \mathbf{x},t=1,y)dz \end{aligned}$$
(16)
$$\begin{aligned}&=\int _{{\mathcal {Z}}}p(y\mid t=1, {\mathbf {z}})p({\mathbf {z}}\mid \mathbf{x},t=0,y) - p(y\mid t=0, {\mathbf {z}})p({\mathbf {z}}\mid \mathbf{x},t=1,y)dz \end{aligned}$$
(17)
$$\begin{aligned}&=\int _{{\mathcal {Z}}}p(y\mid do(t=1), {\mathbf {z}})p({\mathbf {z}}\mid \mathbf{x},do(t=0),y)\nonumber \\&\quad - p(y\mid do(t=0), {\mathbf {z}})p({\mathbf {z}}\mid \mathbf{x},do(t=1),y)dz \end{aligned}$$
(18)
$$\begin{aligned}&=p(y\mid \mathbf{x}, do(t=1))-p(y\mid \mathbf{x}, do(t=0)) \end{aligned}$$
(19)
$$\begin{aligned}&=CATE({\mathbf {x}}). \end{aligned}$$
(20)

\(\square\)

However, this naive approach requires that we obtain the correct outcome function, i.e., \(p(y\mid {\mathbf {z}},t)=p_\theta (y\mid {\mathbf {z}},t)\) as well as the propensity score function \(p(t\mid {\mathbf {z}})\). Obtaining the correct outcome function is challenging, especially when we need to consider observational bias. Say we obtain \(q_\phi ({\mathbf {z}})=p({\mathbf {z}})\) once, and then our goal is to recover the joint distribution \(\int _{z}q_\phi ({\mathbf {z}}, {\mathbf {x}}, t, y)dz=p({\mathbf {x}}, t,y)\). Therefore we need to ensure that we have \(q({\mathbf {x}},t,y\mid {\mathbf {z}})=p({\mathbf {x}},t,y\mid {\mathbf {z}})\). Hence, to achieve the optimal solution of InfoCEVAE, we need to learn \(\theta\) such that \(p_\theta ({\mathbf {x}},t,y\mid {\mathbf {z}})=p({\mathbf {x}}, t, y\mid {\mathbf {z}})\), which means we need to learn the correct outcome function only by skewed observational data. This is almost impossible without modification. The estimator \(\theta _y\) given observational data is given as

$$\begin{aligned} \theta _y^{obs}&=\text {argmin}_{\theta _y\in \Theta }-\frac{1}{N}\sum _{i=1}^{N} {\mathbb {E}}_{q_{\phi }({\mathbf {z}}_i\mid {\mathbf {x}}_i, t_i, y_i)}[\log p_{\theta _y}(y_i\mid t_i,{\mathbf {z}}_i)] \end{aligned}$$
(21)
$$\quad \begin{aligned}&\simeq \text {argmin}_{\theta _y\in \Theta }-{\mathbb {E}}_{p_{{{\mathcal {D}}_{\text {train}}}}(t,y)} [{\mathbb {E}}_{q_\phi {({\mathbf {z}}_i\mid {\mathbf {x}},t_i,y_i)}} [\log p_{\theta _y}(y_i\mid t_i,{\mathbf {z}}_i)]]. \end{aligned}$$
(22)

However, this estimator is not consistent because of observational bias caused by hidden confounding variables.

$$\begin{aligned} \lim _{N\rightarrow \infty }\theta _y^{obs}&= \text {argmin}_{\theta _y\in \Theta }-{\mathbb {E}}_{p_{{{\mathcal {D}}_{\text {train}}}}(t,y)}[{\mathbb {E}}_{q_\phi ({{\mathbf {z}}\mid {\mathbf {x}},t,y})} [\log p_{\theta _y}(y_i\mid t_i,{\mathbf {z}}_i)]] \end{aligned}$$
(23)
$$\begin{aligned}&\ne \text {argmin}_{\theta _y\in \Theta }-{\mathbb {E}}_{p(t)p(y)}[{\mathbb {E}}_{q_\phi {({\mathbf {z}}\mid {\mathbf {x}},t,y)}} [\log p_{\theta _y}(y_i\mid t_i,{\mathbf {z}}_i)]]. \end{aligned}$$
(24)
$$\begin{aligned} \because p_{{{\mathcal {D}}_{\text {train}}}}(t, y)&=\int _{{\mathcal {Z}}}p(y\mid t, {\mathbf {z}})p(t\mid {\mathbf {z}})p({\mathbf {z}})d{\mathbf {z}}\ne \int _{{\mathcal {Z}}}p(y\mid t, {\mathbf {z}})p(t)p({\mathbf {z}})d{\mathbf {z}}\end{aligned}$$
(25)
$$\begin{aligned}&=p(t)p(y). \end{aligned}$$
(26)

Note that we assume the treatment assignment is randomized when evaluating the model. To resolve this problem, we propose an effective algorithm based on latent variables and a matching algorithm. Note that InfoCEVAE guarantees the correct treatment effect when it achieves the optimal solution, although it is challenging to obtain. However, CEVAE cannot provide the optimal treatment effect, even when it achieves the optimal solution.

6.1 Hidden confounding variables matching

To mitigate the above issue, we aim to recover hidden confounding variables by only proxy variables, not using outcomes like CEVAE. This approach sounds reasonable because the assumption that we can recover hidden confounding variables only by proxy variables when we have such high dimensional proxy variables is quite valid (Zhang et al., 2021). Moreover, the advantage of using only proxy variables is that we do not need to predict outcomes for new individuals. Hence, hidden confounding variables are inferred as:

$$\begin{aligned} q_{\phi }({\textbf {z}}_i\mid {\mathbf {x}}_i)=\prod ^{d_{\textbf {z}}}_{j=1}{\mathcal {N}}(\mu =\mu _{ij}, \sigma =\sigma _{ij}); p({\textbf {z}}_i)={\mathcal {N}}(0, 1). \end{aligned}$$
(27)

The ELBO is given as:

$$\begin{aligned} {\mathcal{L}}_{{\text{InfoCEVAE}}} &=\sum\limits_{{{{i = i}}}}^{{{N}}} {{\mathbb{E}}_{{{{q}}_{\phi } ({\mathbf{z}}_{{{i}}} {\mid }{{\mathbf{x}}}_{{{i}}} )}} } \left[ \log {{p}}_{{\theta _{{{\mathbf{x}},{{t}}}} }} ({\mathbf{x}}_{{{i}}} ,{{t}}_{{{i}}} {\mid }{{\mathbf{z}}}_{{{i}}} ) +\log {{p}}_{{\theta _{{{y}}} }} \left( {{{y}}_{{{i}}} {\mid }{{t}}_{{{i}}} ,{\mathbf{z}}_{{{i}}} } \right) \right. \\ &\quad\left. - {\text{KL}}\left( {\log {{q}}_{\phi } \left( {{\mathbf{z}}_{{{i}}} {\mid }{{\mathbf{x}}}_{{{i}}} } \right),{{p}}\left( {\mathbf{z}} \right)} \right) \right] - \lambda {{D}}\left( {{{q}}_{\phi } \left( {\mathbf{z}} \right),{{p}}\left( {\mathbf{z}} \right)} \right),\end{aligned}$$
(28)

where \(\lambda\) is a hyper-parameter that controls the strength of regularization.

For bias mitigation, we propose latent variable matching, which makes use of latent variables to match individuals. Thanks to the theoretical advantage of InfoCEVAE, we can find the matching based on the some metric using latent variables. By nearest neighbor matching, we construct the counterfactual outcome for each individual i as

$$\begin{aligned} {\hat{y}}^{{\bar{t}}_i}_i = \frac{1}{k}\sum _{j\in \text {NN}({\mathbf {z}}_i, k)}y^{t_j}_j, \end{aligned}$$
(29)

where \(\text {NN}({\mathbf {z}}_i, k) = \{ i_1,\ldots ,i_k\}\) is a set of indices ordered by a similarity that defines nearest neighbors of \({\mathbf {z}}_i\), and \({\bar{t}}_i\in {\mathcal {T}}\) represents the other treatment of \(t_i\). Here, we consider two variants of nearest neighbor selection: (i) Euclidean distance of means of the two latent variables: (ii) propensity score matching. The advantage of (i) is that we can use all the information of latent variables and does not need to infer propensity score, while (i) might fail to find good matching in higher dimensions of latent variables. The pros and cons of (ii) are the opposite of those of (i). Note that under the smoothness assumption and when we achieve the optimal solution of InfoCEVAE, both hidden confounding variable matching methods yield consistency estimators. We compute the log-likelihood of counterfactual outcome as

$$\begin{aligned} {\mathcal {L}}_{\text {cf}}=\sum _{i=i}^{N} {\mathbb {E}}_{q_{\phi }({\mathbf {z}}_i\mid {\mathbf {x}}_i)}[\log p({\hat{y}}^{{\bar{t}}_i}_i\mid {\bar{t}}_i, {\mathbf {z}}_i)]. \end{aligned}$$
(30)

Finally, the objective function to be optimized is given as

$$\begin{aligned} {\mathcal {L}}={\mathcal {L}}_{\text {cf}} +{\mathcal {L}}_{\text {InfoCEVAE}}. \end{aligned}$$
(31)

Theorem 4

The optimal solution of InfoCEVAE with hidden variables matching gives the consistent treatment effect estimator under the smoothness assumption.

Proof

According to the theorem of InfoVAE, we can obtain the correct posterior function when we obtain the optimal solution. Using correct hidden confounding variables, we can obtain correct counterfactual outcomes under the smoothness assumption. Using the correct counterfactual outcomes as well as factual outcomes, we can obtain a consistent estimator, which yields the correct ATE. \(\square\)

7 Experiments

We validated the performance of the proposed method, especially when there are hidden confounding variables. First, we introduce the datasets used in the experiments, and detail the experimental settings.

7.1 Datasets

We rarely have real-world datasets due to the counterfactual nature of treatment effect estimation problem. We employed a widely-used semi-synthetic dataset and a synthetic dataset for this experiment.

7.1.1 News dataset (Johansson et al., 2016)

This is a dataset including opinions of media consumers for news articles (Johansson et al., 2016).Footnote 1 It contains 5, 000 news articles and outcomes generated from the NY Times corpusFootnote 2. Each article is consumed on desktop (\(t=0\)) or mobile (\(t=1\)), and it is assumed that media consumers prefer to read some articles on mobile than desktop. We use the News dataset by setting the scale parameter for outcome in previous research (Johansson et al., 2016) as 200. Each article is generated by a topic model and represented in the bag-of-words representation. The size of the vocabulary is 3, 477. As preprocessing, we apply principal component analysis (PCA) with \(d_{\mathbf {z}}=30\). To simulate hidden confounding variables situation, we generate proxy variables using these variables after PCA. More concretely, we treat these variables as hidden confounding variables \(z_{ij}\) and generate proxy variables as

$$x_{{i,j \times 1, \ldots ,j \times d_{{proxy}} }} \sim {\mathcal{N}}\left( {z_{{ij}} ,\sigma _{{\mathbf{z}}}^{2} } \right),$$
(32)
$$\begin{aligned}&{\mathbf {x}}_i=[x_{i,1}, \ldots , x_{i, 30\times d_{\text {proxy}}}], \end{aligned}$$
(33)

where \(\sigma _{\mathbf {z}}\) is a standard deviation of the entire variables after PCA , \(d_{\text {proxy}}\) stands for the number of proxy variables per hidden confounding variables and [] represents the concatenation. We set \(d_{\text {proxy}}\) as 30 for the News dataset.

7.1.2 Synthetic dataset

The synthetic dataset is a benchmark generated in this study. This dataset includes 5, 000 individuals, binary treatment, and continuous outcomes. We generated the dataset according to the following procedure:

$$\begin{aligned}&z_{ij}\sim {\mathcal {N}}(0, 1)\ \ (j=1,\ldots ,5), \end{aligned}$$
(34)
$$\begin{aligned}&x_{i, j\times 1, \ldots , j\times d_{\text {proxy}}}\sim {\mathcal {N}}(10z_{ij}, 1), \end{aligned}$$
(35)
$$\begin{aligned}&{\mathbf {x}}_i=[x_{i,1}, \ldots , x_{i, 5\times d_{\text {proxy}}}], \end{aligned}$$
(36)
$$t_{i} \sim {\text{Bern}}\left( {\alpha h\left( {\sum\limits_{{j = 1}}^{5} {z_{{ij}} } } \right)} \right),y_{i} \sim {\mathcal{N}}\left( {{{3}}{\mathbb{I}}\left( {\sum\limits_{{{\text{j = 1}}}}^{{{5}}} {{\text{z}}_{{{\text{ij}}}} } \ge {{0}}} \right) \times {\text{t}}_{{\text{i}}} {\text{ + 5t}}_{{\text{i}}} ,{{1}}} \right)$$
(37)

where \(\alpha \ge 0\) is a variable that controls the strength of observational bias, and \({\mathbb {I}}(x)\) is an indicator function that is 1 if x is True and 0 otherwise. Note again that h is a sigmoid function. Larger \(\alpha\) means we have severer observational bias, and setting \(\alpha\) as 0 represents a randomized controlled trial. We clamped the treatment assignment probability at 0.01 and 0.99. We change \(d_{\text {proxy}}\) as ranging from 10 to 500 for the Synthetic dataset. Unless otherwise stated, we report the results when \(d_{\text {proxy}}=500\).

In the experiments, we investigated the robustness against the bias strength by changing the value of \(\alpha\).

7.2 Experimental settings

We split the all individuals into \(20\%\), \(40\%\), and \(40\%\) train, validation, and test data, respectively. Note that we especially focus on the case when train data are limited because over-estimation becomes severer. As base neural network models including VAE-based methods, we use two-layer neural networks. We also set the number of neurons (i.e, the number of representations) as 50 for TARNet and CFR. We use the elu function (Clevert et al., 2015) as the activation function for all neural networks.

As evaluation metrics, we employ ATE error defined as

$$\varepsilon _{{{\text{ATE}}}} = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\left| {\left( {y_{i}^{1} - y_{i}^{0} - \left( {\hat{y}_{i}^{1} - \hat{y}_{i}^{0} } \right)} \right)} \right|}$$

and precision in estimation of heterogeneous effect (PEHE)  used in previous researches (Hill, 2011; Johansson et al., 2016). \(\epsilon _{\text {PEHE}}\) is the estimation error of individual treatment effects and is defined as

$$\begin{aligned} \sqrt{\epsilon _{\mathrm{PEHE}}}=\sqrt{\frac{1}{N}\sum _{i=1}^{N}(y^{1}_i -y^{0}_i -({\hat{y}}^{1}_i - {\hat{y}}^{0}_i))^{2}}. \end{aligned}$$

The hyper-parameters are tuned based on the prediction loss using the observed outcomes on the validation data. We log-uniform randomly choose the hyper-paramters \(\lambda\) ranging from \(1e-3\) to 1e3 ten times, and the final hyper-parameter is selected based on the prediction loss using the outcomes on the validation data. For CEVAE, we compute the ELBO using validation data and use the model at the epoch when the ELBO for validation data achieves the maximum value. We report the average results of 10 trials on the Synthetic dataset and 20 trials on the News dataset.

7.3 Baseline methods

We compare the proposed method with the following baseline methods including VAE-based methods. Unless otherwise stated, we use the concatenation of proxy variables and treatment assignment coded as a one-hot vector as the input of predictive models of (i) and (ii).

  1. (i)

    Ridge is the ordinary linear regression methods with L2 regularization.

  2. (ii)

    Random forest (RF) (Breiman, 2001) and BART (Chipman et al., 2010; Hill, 2011) are the predictive models based on the decision tree.

  3. (iii)

    TARNet (Shalit et al., 2017) is a deep neural network model that has shared layers for representation learning and different layers for outcome prediction for treatment and control instances. Counterfactual regression (CFR) (Shalit et al., 2017) is a state-of-the-art deep neural network model based on balanced representations between treatment and control instances. We use the Wasserstein distance.

  4. (iv)

    CEVAE (Louizos et al., 2017) is a VAE-based treatment effect estimation method.

7.4 Results

We first assess the full results in comparison with the baseline methods, and then we investigate how the performance changes as we change the size of proxy variables or the strength of observational bias. Table 1 gives a performance comparison of the proposed method with the baseline methods. Overall, the proposed method outperforms baseline methods significantly. On the News dataset, the both approaches of proposed method show significant improvement from the baseline methods. On the Synthetic dataset, the proposed method with propensity score matching works better. This result makes sense because the propensity score and outcome have strong correlation in this dataset. However, the proposed method with the Euclidean matching does not work because nearby individuals in terms of the Euclidean distance of hidden confounding variables do not necessarily become the good matching unless we have a large amount of individuals. Meanwhile the predictive performance deteriorates as selection bias becomes stronger, the proposed method shows robustness to selection bias and consistently outperforms the baseline methods. Figures 2 and 3 demonstrate the change of predictive performances as we change the strength of bias \(\alpha\) and the number of proxy variables \(d_\text {proxy}\). Whereas the baseline methods suffer from observational bias, the proposed method show robustness to it. Although, the baseline methods result in limited improvement, the proposed method also can deal with and make use of high dimensional proxy variables and improve its predictive performance.

Table 1 Performance comparison on the News dataset and the Synthetic dataset in terms of PEHE and ATE. Lower is better.
Fig. 2
figure 2

Performance comparison as the change of observational bias \(\alpha\). Lower is better. Whereas baseline methods suffered a observational bias and get degrade its performance, the proposed method demonstrates its robustness to the observational bias and almost entirely surpass the baseline methods in the both metrics. Especially, the proposed method consistently shows the affordable performance in ATE

Fig. 3
figure 3

Performance comparison as the change of the number of proxy variables. Lower is better. While the baseline methods do not improve their predictive performances as the number of proxy variables increase, the proposed method with propensity score matching achieves almost entirely the best results, especially significant in \(\sqrt{\epsilon _{\text {PEHE}}}\)

8 Conclusion

In this study, we considered treatment effect estimation problem with hidden confounding variables using VAE. VAE has been used to recover hidden confounding variables by making use of its large capability. We first pointed out that the optimal solution of CEVAE is not the correct ATE. We propose an efficient algorithm to recover hidden confounding variables and estimate treatment effect making use of mutual information and matching techniques. Experiments on semi-synthetic and synthetic datasets demonstrate the effectiveness of the proposed method, especially when we have higher dimensional proxy variables but still hidden confounding variables.