Adversarial Balancing-based Representation Learning for Causal Effect Inference with Observational Data

Learning causal effects from observational data greatly benefits a variety of domains such as health care, education and sociology. For instance, one could estimate the impact of a new drug on specific individuals to assist the clinic plan and improve the survival rate. In this paper, we focus on studying the problem of estimating Conditional Average Treatment Effect (CATE) from observational data. The challenges for this problem are two-fold: on the one hand, we have to derive a causal estimator to estimate the causal quantity from observational data, where there exists confounding bias; on the other hand, we have to deal with the identification of CATE when the distribution of covariates in treatment and control groups are imbalanced. To overcome these challenges, we propose a neural network framework called Adversarial Balancing-based representation learning for Causal Effect Inference (ABCEI), based on the recent advances in representation learning. To ensure the identification of CATE, ABCEI uses adversarial learning to balance the distributions of covariates in treatment and control groups in the latent representation space, without any assumption on the form of the treatment selection/assignment function. In addition, during the representation learning and balancing process, highly predictive information from the original covariate space might be lost. ABCEI can tackle this information loss problem by preserving useful information for predicting causal effects under the regularization of a mutual information estimator. The experimental results show that ABCEI is robust against treatment selection bias, and matches/outperforms the state-of-the-art approaches. Our experiments show promising results on several datasets, representing different health care domains among others.


Introduction
Many domains of science require inference of causal effects, including healthcare (Casucci et al., 2017), economics and marketing (LaLonde, 1986;Smith and Todd, 2005), sociology (Morgan and Harding, 2006) and education (Zhao and Heffernan, 2017).For instance, medical scientists must know whether a new medicine is more beneficial for patients; teachers want to know if their teaching plan can improve the grades of students significantly; economists need to evaluate whether a policy can improve the unemployment rates.Due to the broad application of machine learning model in these domains, properly estimating causal effects is an important task for machine learning research.
The classical method to estimate the causal effects is Randomized Controlled Trials (RCTs) (Autier and Gandini, 2007), where we have to maintain two statistic identical groups and randomly assign treatments to each individual to observe the outcomes.However, RCTs can be time-consuming, expensive, or unethical (e.g. for studying the effect of smoking on health condition).Hence, causal effect inference through observational studies are needed (Benson and Hartz, 2000).The core issue of causal effect inference from observational data is the identification problem.That is, given a set of assumptions and the nonexperimental data, whether it is possible to derive a model that can correctly estimate the strength of causal effect by certain quantities.
In this paper, our aim is to build a machine learning model that is able to estimate the Conditional Average Treatment Effect (CATE) with observational data.There are several challenges for this task.First, there might be spurious associations between the treatments and outcomes caused by confounding variables: variables that affect both treatment variables and the outcome variables.For example, patients with more personal wealth are in a better position to get new medicines, and at the same time their wealth increases the likelihood that they can survive.Due to the existence of confounding bias, it is nearly impossible to build the estimator by directly modeling the relations between treatments and outcomes.Strong ignorability in Potential Outcome framework (Rubin, 2005) provides a way to estimate the causal quantities using the adjustment estimand with statistical quantities.In order to satisfy the ignorability in practical study, people derive methods to match / balance the covariates, e.g. based on mutual information between treatment variables and covariates (Sun and Nikolaev, 2016), or based on propensity scores (Dehejia and Wahba, 2002).However, these methods are either only feasible for the estimation of Average Treatment Effect (ATE) or Average Treatment effect on the Treated (ATT).Pearl (2009) proposes a criterion based on graphical models to select admissible covariates for ignorability.Throughout this paper, we assume that all the variables in the causal system can be observed and measured, so that the causal effects we are interested in are identifiable from the observational data.This assumption allows us to build causal quantity estimators for each outcome system conditioning on the covariates.
Another challenge for CATE estimation is that in observational study we can only observe the factual outcomes.The counterfactual outcomes can never be observed.When there exists treatment selection bias, the imbalanced distributions of covariates in treatment and control groups would lead to biases for the estimation of CATE due to the generalization error (Swaminathan and Joachims, 2015).Various techniques from several studies are proposed to tackle this problem.Yao et al. (2018) propose to use hard samples to preserve local similarity information from covariate space to latent representation space.The hard sample mining process is highly dependent on the propensity score model, which is not robust when the propensity score model is misspecified.(Imai and Ratkovic, 2014;Ning et al., 2018) propose estimators which are robust even when the propensity score model is not correctly specified.(Kallus, 2018a,b;Ozery-Flato et al., 2018) propose to generate balanced weights for data samples to minimize a selected imbalance measure in covariate space.Shalit et al. (2017) propose to derive upper bounds on the estimation error by considering both covariate balancing and potential outcomes.Highly predictive information might be lost in the reweighing or balancing processes of these methods.
To address these problems, we propose a framework (cf. Figure 1), which generates balanced representations as well as preserving highly predictive information in latent space without using propensity scores.We design a twoplayer adversarial game, between an encoder that transforms covariates to latent representations and a discriminator which distinguishes representations from control and treatment group.Unlike in the classical GAN framework, here, the 'true distribution' (latent representations of the control group1 ) in this game also must be generated by the encoder.On the other hand, to prevent losing useful information during the balancing process, we use a mutual information estimator to constrain the encoder to preserve highly predictive information (Hjelm et al., 2018).The outcome data are also considered in this unified framework to specify the causal effect predictor.
Technically, the unified framework encodes the input covariates into a latent representation space, and build estimators to estimate the treatment outcomes with those representations.There are three components on top of the encoder in our model: (1) mutual information estimation: an estimator is specified to estimate and maximize the mutual information between representations and covariates; (2) adversarial balancing: the encoder plays an adversarial game with a discriminator, trying to fool the discriminator by minimizing the discrepancies between distributions of representations from the treatment and control group; (3) treatment outcome prediction: a predictor over latent space is employed to estimate the treatment outcomes.By jointly optimizing the three components via back propagation, we can get a robust estimator for the CATE.The overarching architecture of our framework is shown in Figure 1.As a summary, our main contributions are: 1. We propose a novel model: Adversarial Balancing-based representation learning for Causal Effect Inference (ABCEI) with observational data.ABCEI addresses information loss and treatment selection bias by learning highly informative and balanced representations in latent space.2. A neural network encoder is constrained by a mutual information estimator to minimize the information loss between representations and the input covariates, which preserves highly predictive information for causal effect inference.3. We employ an adversarial learning method to balance representations between treatment and control groups, which deals with the treatment selection bias problem without any assumption on the form of the treatment selection function, unlike, e.g., the propensity score method.4. We conduct various experiments on synthetic and real-world datasets.
ABCEI outperforms most of the state-of-the-art methods on benchmark datasets.We show that ABCEI is robust against different experimental settings.By supporting mini-batch, ABCEI can be applied on large-scale datasets.

Problem Setup
Assume an observational dataset {X, T, Y }, with covariate matrix X ∈ R n×k , binary treatment vector T ∈ {0, 1} n , and treatment outcome vector Y ∈ R n .
Here, n denotes the number of observed units, and k denotes the number of covariates in the dataset.For each unit u, we have k covariates x 1 , . . ., x k , associated with one treatment variable t ∈ {0, 1} and one treatment outcome y.According to the Rubin-Neyman causal model (Rubin, 2005), two potential outcomes y 0 , y 1 exist for treatments {0, 1}, respectively.We call y t the factual outcome, denoted by y f , and y 1−t the counterfactual outcome, denoted by y cf .
Assuming there is a joint distribution P (x, t, y 0 , y 1 ), we make the following assumptions: Assumption 1 (Strong Ignorability) Conditioning on x, the potential outcomes y 0 , y 1 are independent of t, which can be stated as: (y 0 , y 1 ) ⊥ ⊥ t|x.
Assumption 2 (No Interference) The treatment outcome of each individual is not affected by the treatment assignment of other units, which can be formulated as: Assumption 3 (Consistency) The potential outcome y t of each individual is equal to the observed outcome y, if the actual treatment received is T = t, which can be represented as: y = y t , if T = t, ∀t.
Assumption 4 (Positivity) For all sets of covariates and for all treatments, the probability of treatment assignment will always be strictly larger than 0 and strictly smaller than 1, which can be expressed as: 0 < P (t|x) < 1, ∀t and ∀x.
Assumption 1 indicates that all the confounders are observed, i.e., no unmeasured confounder is present.Hence by controlling on X, we can remove the confounding bias.Assumption 4 allows us to estimate the CATE for any x in the covariate space.Under these assumptions, we can formalize the definition of CATE for our task: We can now define the Average Treatment Effect (ATE) and the Average Treatment effect on the Treated (ATT) as: Because the joint distribution P (x, t, y 0 , y 1 ) is unknown, we can only estimate CAT E(u) from observational data.A function over the covariate space X can be defined as f : X × {0, 1} → Y.The estimate of CAT E(u) can now be defined: Definition 2 Given an observational dataset {X, T, Y } and a function f , for unit u, the estimate of CAT E(u) is: In order to properly accomplish the task of CATE estimation, we need to find an optimal function over the covariate space for both systems (t = 1 and t = 0).

Proposed Method
In order to overcome the challenges in CATE estimation, we build our model on recent advances in representation learning.We propose to define a function Φ : X → H, and a function Ψ : H → Y. Then we have Y T = f (X, T ) = Ψ (Φ(X), T ) = Ψ (h, T ).Instead of directly estimating the treatment outcome conditioned on covariates, we firstly use an encoder to learn latent representations of covariates.We simultaneously learn latent representations and estimate the treatment outcome.However, the function f would still suffer from information loss and treatment selection bias, unless we constrain the encoder Φ to learn balanced representations while preserving useful information.

Mutual Information Estimation
Consider the information loss when transforming covariates into latent space.The non-linear statistical dependencies between variables can be acquired by mutual information (MI) (Kinney and Atwal, 2014).Thus we use MI between latent representations and original covariates as a measure to account for information loss: We denote the joint distribution between covariates and representations by P Xh and the product of marginals by P X ⊗ P h .From the viewpoint of Shannon information theory, mutual information can be represented as Kullback-Leibler (KL) divergence: It is hard to compute MI in continuous and high-dimensional spaces, but one can capture a lower bound of MI with the Donsker-Varadhan representation of KL-divergence (Donsker and Varadhan, 1983): Theorem 1 (Donsker-Varadhan) ,h) .
Here, C denotes the set of unconstrained functions Ω.
Proof Given a fixed function Ω, we can define distribution G by: Equivalently, we have: Then by construction, we have: When distribution G is equal to P , this bound is tight.
Inspired by Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018), we propose to establish a neural network estimator for MI.Specifically, let Ω be a function: X × H → R parametrized by a deep neural network, we have: ,h) . (1) By distinguishing the joint distribution and the product of marginals, the estimator Ω approximates the MI with arbitrary precision.In practice, as shown in Figure 2, we concatenate the input covariates X with representations h one by one to create positive samples (as samples from the true joint distribution).Then, we randomly shuffle X on the batch axis to create fake input covariates X. Representations h are concatenated with fake input X to create negative samples (as samples from the product of marginals).From Equation (1) we can derive the loss function for the MI estimator: ,h) .
Information loss can be diminished by simultaneously optimizing the encoder Φ and the MI estimator Ω to minimize L ΦΩ iteratively via gradient descent.

Adversarial Balancing
The representations of treatment and control groups are denoted by h(t = 1) and h(t = 0), corresponding to the input covariate groups X(t = 1) and X(t = 0).The discrepancy between distributions of the treatment and control groups is an urgent problem in need of a solution.To decrease this discrepancy, we propose an adversarial learning method to constrain the encoder to learn treatment and control representations that are balanced distributions.We build an adversarial game between a discriminator D and the encoder Φ, inspired by the framework of Generative Adversarial Networks (GAN) (Goodfellow et al., 2014).In the classical GAN framework, a source of noise is mapped to a generated image by a generator.A discriminator is trained to distinguish whether an input sample is from true or synthetic image distribution generated by the generator.The aim of classical GAN is training a reliable discriminator to distinguish fake and real images, and using the discriminator to train a generator to generate images by fooling the discriminator.
In our adversarial game: (1) we draw a noise vector z ∼ P (z) which has the same length as the latent representations, where P (z) can be a spherical Gaussian distribution or a Uniform distribution; (2) we separate representation by treatment assignment, and form two distributions: P h(t=1) and P h(t=0) ; (3) we train a discriminator D to distinguish concatenated vectors from treatment and control group ([z, h(t = 1)] and [z, h(t = 0)]); (4) we optimize the encoder Φ to generate balanced representations to fool the discriminator.
According to the architecture of ABCEI, the encoder is associated with the MI estimator Ω, treatment outcome predictor Ψ and adversarial discriminator D. This means that the training process is iteratively adjusting each of the components.The instability of GAN training will become serious in this context.To stabilize the training of GAN, we propose to use the framework of Wasserstein GAN with gradient penalty (Gulrajani et al., 2017).By removing the sigmoid layer and applying the gradient penalty to the data between the distributions of treatment and control groups, we can find a function D which satisfies the 1-Lipschitz inequality: We can write down the form of our adversarial game: where P penalty is the distribution acquired by uniformly sampling along the straight lines between pairs of samples from P h(t=0) and P h(t=1) .The adversarial learning process is in Figure 3.
This ensures the encoder Φ to be smoothly trained to generate balanced representations.We can write down the training objective for discriminator and encoder, respectively:

Treatment Outcome Prediction
The final step for CATE estimation is to predict the treatment outcomes with learned representations.We establish a neural network predictor, which takes latent representations and treatment assignments of units as the input, to conduct outcome prediction: y t = Ψ (h, t).We can write down the loss function of the training objective as: Here, R is a regularization on Ψ for the model complexity.
3.4 Learning Optimization W.r.t. the architecture in Figure 1, we minimize L ΦΩ , L Φ , and L ΦΨ , respectively, to iteratively optimize parameters in the global model.The optimization steps are handled with the stochastic method Adam (Kingma and Ba, 2014), training the model within Algorithm 1. Optimization details and computational complexity analysis are given in the supplementary material.
Algorithm 1 ABCEI Due to the lack of counterfactual treatment outcomes in observational data, it is difficult to validate and test the performance of causal effect inference methods.In this paper, we adopt two ways to construct the datasets that are available for validating and testing the performance of causal inference methods: the one is to use simulated or semi-simulated treatment outcomes, e.g., dataset IHDP (Hill, 2011); the other is to use RCT datasets and add a non-randomized component to generate imbalanced datasets, e.g., dataset Jobs (LaLonde, 1986;Smith and Todd, 2005).We employ five benchmark datasets: IHDP, Jobs, Twins (Louizos et al., 2017), ACIC (Dorie et al., 2019) and MIMICiii (Johnson et al., 2016(Johnson et al., , 2019)).For IHDP, Jobs, Twins, ACIC, and MIMICIII, the experimental results are averaged over 1000, 100, 100, 7700, 100 train/validation/test sets respectively with split sizes 60%/30%/10%.The implementation of our method is based on Python and Tensorflow (Abadi et al., 2016).All the experiments in this paper are conducted on a cluster with 1x Intel Xeon E5 2.2GHz CPU, 4x Nvidia Tesla V100 GPU and 256GB RAM.The source code of our algorithms is available on GitHub2 .

Details of Datasets
IHDP The Infant Health and Development Program (IHDP) studies the impact of specialist home visits on future cognitive test scores.Covariates in the semi-simulated dataset are collected from a real-world randomized experiment.The treatment selection bias is created by removing a subset of the treatment group.We use the setting 'A' in (Dorie, 2016) to simulate treatment outcomes.This dataset includes 747 units (608 control and 139 treated) with 25 covariates associated with each unit.
Jobs The Jobs dataset (LaLonde, 1986;Smith and Todd, 2005)  Twins The Twins dataset is created based on the "Linked Birth / Infant Death Cohort Data" by NBER3 .Inspired by (Almond et al., 2005), we employ a matching algorithm to select twin births in the USA between 1989-1991.By doing this, we get units associated with 43 covariates including education, age, race of parents, birth place, marital status of mother, the month in which pregnancy prenatal care began, total number of prenatal visits and other variables indicating demographic and health conditions.We only select twins that have the same gender who both weigh less than 2000g.For the treatment variable, we use t = 0 indicating the lighter twin and t = 1 indicating the heavier twin.We take the mortality of each twin in their first year of life as the treatment outcome, inspired by (Louizos et al., 2017).Finally, we have a dataset consisting of 12,828 pairs of twins whose mortality rate is 19.02% for the lighter twin and 16.54% for the heavier twin.Hence, we have observational treatment outcomes for both treatments.In order to simulate the selection bias, we selectively choose one of the twins to observe with regard to the covariates associated with unit as follows: t|x ∼ Bernoulli(σ(w T x + n)), where w T ∼ N (0, 0.1 • I) and n ∼ N (1, 0.1).
ACIC The Atlantic Causal Inference Conference (ACIC) (Dorie et al., 2019) is derived from real-world data with 4802 observations using 58 covariates.There are 77 datasets which are simulated with different treatment selection and outcome functions.Each dataset is generated with 100 random replications independently.In this benchmark, different settings like degrees of non-linearity, treatment selection bias and magnitude of treatment outcome are considered.
MIMICIII This benchmark is created based on MIMIC-III, a database comprising deidentified healthcare data associated with patients data in critical care units.We select patient samples with their demographic information as well as the first observed various laboratory measurements by chemistry or hematology.After filtering samples with missing values, the benchmark consists of 7413 samples with 25 covariates.We would like to investigate the effect of prescription amount in the first day of critical care unit on the length of stay in the ICU.Here we choose binary treatment where 0 represents small amount of prescription and 1 represents large amount of prescription.The treatment outcomes are simulated by y|x, t ∼ (w T x + βt + n), where n ∼ N (0, 1), w ∼ N (0 25 , 0.5 • (Σ + Σ T )), and Σ ∼ U((−1, 1) 25×25 ).The treatment assignments are simulated by t|x ∼ Bernoulli(σ(s T x + m)), where m ∼ N (0, 0.1) and s ∼ N (0 25 , 0.1 • I).

Evaluation Metrics
Since the ground truth CATE for the IHDP dataset and MIMICIII benchmark is known, we can employ Precision in Estimation of Heterogeneous Effect (PEHE) (Hill, 2011), as the evaluation metric of CATE estimation: Subsequently, we can evaluate the precision of ATE estimation based on estimated CATE.For the Jobs dataset, because we only know parts of the ground truth (the randomized component), we cannot evaluate the performance of ATE estimation.Following (Shalit et al., 2017), we evaluate the precision of ATT estimation and policy risk estimation, where In this paper, we consider π(x u ) = 1 when f (x , 1) − f (x u , 0) > 0. For the Twins dataset, because we only know the observed treatment outcome for each unit, we follow (Louizos et al., 2017) using area under ROC curve (AUC) as the evaluation metric.For ACIC dataset, we follow (Ozery-Flato et al., 2018) to use RMSE ATE as performance metric.

Baseline Methods
We compare with the following baselines: least square regression using treatment as a feature (OLS/LR 1 ); separate least square regressions for each treatment (OLS/LR 2 ); balancing linear regression (BLR) and balancing neural network (BNN) Johansson et al. ( 2016 2018).MMD measure using RBF kernel (MMD-V1, MMD-V2) Kallus (2018b,a).Adversarial balancing with cross-validation procedure (ADV-LR/SVM/MLP) Ozery-Flato et al. (2018).We show the quantitative comparison between our method and the state-of-the-art baselines.Experimental results (in-sample and out-of-sample) on IHDP, Jobs and Twins datasets are reported.Specifically, we use ABCEI * to represent our model without the mutual information estimation component, and ABCEI * * to represent our model without the adversarial learning component.

Results
Experimental results are shown in Tables 1, 2 and 3.It would be unsound to report statistical test results over the results reported in these tables; due to varying (un-)availability of ground truth, we must resort to reporting varying evaluation measures per dataset, over which it would not be appropriate to aggregate in a single statistical hypothesis test.However, one can see that ABCEI performs best in ten out of twelve cases, not only by the best number in the column, but often also by a non-overlapping empirical confidence interval with that of the best competitor (cf. reported standard deviations).This provides evidence that ABCEI is a substantial improvement over the state of the art.Due to the existence of treatment selection bias, regression based methods suffer from high generalization error.Nearest neighbor based methods consider unit similarity to overcome selection bias, but cannot achieve balance globally.Recent advances in representation learning bring improvements in causal effect estimation.Unlike CFR-Wass, BNN, and SITE, ABCEI considers information loss and balancing problems.The mutual information estimator ensures that the encoder learns representations preserving useful information from the original covariate space.The adversarial learning component constrains the encoder to learn balanced representations.This causes ABCEI to achieve better performance than the baselines.We also report the performance of our model without mutual information estimator or adversarial learning, respectively, as ABCEI * , ABCEI * * .From the results we can see that performance suffers when Table 2 In-sample and out-of-sample results with mean and standard errors on the Twins dataset (AUC: higher = better, AT E : lower = better).

Methods
In either of these components is left out, which demonstrates the importance of combining adversarial learning and mutual information estimation in ABCEI.
In Figure 4, we compare ABCEI with recent balancing methods on ACIC benchmark.As we can see, the variance of representation learning methods are lower than methods reweighing samples on covariate space.We also found that the adversarial balancing methods perform better on ATE estimation.ABCEI has the advantage of adversarial balancing as well as preserving predictive information in latent space, which makes it outperforms the other baselines.

Training details
We adopt ELU (Clevert et al., 2015) as the non-linear activation function if there is no specification.We employ various numbers of fully-connected hidden layers with various sizes across networks: four layers with size 200 for the encoder network; two layers with size 200 for the mutual information estimator network; three layers with size 200 for the discriminator network; and finally, three layers with size 100 for the predictor network, following the structure of TARnet (Shalit et al., 2017).The gradient penalty weight β is set to 10.0, and the regularization weight is set to 0.0001.
In the training step, firstly we minimize L ΦΩ by simultaneously optimizing Φ and Ω with one-step gradient descent.Then the representations h are passed to the discriminator to minimize L D by optimizing D with 3-step gradient descent, in order to find a stable discriminator.Next, we use discriminator D to train encoder Φ by minimizing L Φ with one-step gradient descent.Finally, encoder Φ and predictor Ψ are optimized simultaneously by minimizing L ΦΨ .

Hyper-parameter optimization
Due to the reason that we cannot observe counterfactuals in observational datasets, standard cross-validation methods are not feasible.We follow the hyper-parameter optimization criterion in (Shalit et al., 2017), with an early stopping with regard to the lower bound on the validation set.Detail search space of hyper-parameter is demonstrated in Table 4.The optimal hyperparameter settings for each benchmark dataset is demonstrated in Table 5.

Computational complexity
Assuming the size of mini-batch is n, number of epochs is m, the computational complexity of our model is Here Φ h , Ω h , D h , Ψ h indicates the number of layers and Φ w , Ω w , D w , Ψ w indicates number of neurons in each layer in Neural Network Φ, Ω, D, Ψ .

Robustness Analysis on Selection Bias
To investigate the performance of our model when varying the level of selection bias, we generate toy datasets by varying the discrepancy between the treatment and control groups.We draw 8 000 samples with ten covariates x ∼ N (µ 0 , 0.5 • (Σ + Σ T )) as control group, where Σ ∼ U((−1, 1) 10×10 ).Then we draw 2 000 samples from x ∼ N (µ 1 , 0.5•(Σ +Σ T )).By adjusting µ 1 , we generate treatment groups with varying selection bias, which can be measured by KL-divergence.For the outcomes, we generate y|x ∼ (w T x + n), where n ∼ N (0 2×1 , 0.1 • I 2×2 ) and w ∼ U((−1, 1) 10×2 ).In Figure 5, we can see the robustness of ABCEI, in comparison with CFR-Wass, BART, and SITE.The reported experimental results are averaged over 100 test sets.From the figure, we can see that with increasing KL-divergence, our method achieves more stable performance.We do not visualize standard deviations as they are negligibly small.

Robustness Analysis on Mutual Information Estimation
To investigate the impact of minimizing the information loss on causal effect learning, we block the adversarial learning component and train our model on the IHDP dataset.We record the values of the estimated MI and P EHE in each epoch.In Figure 6, we report the experimental results averaged over 1000 test  sets.We can see that with increasing MI, the mean square error decreases and reaches a stable region.But without the adversarial balancing component, the P EHE cannot be further lowered due to the selection bias.This result indicates that even though the estimators benefit from highly predictive information, they will still suffer if imbalance is ignored.

Balancing Performance of Adversarial Learning
In Figure 7, we visualize the learned representations on the IHDP and Jobs datasets using t-SNE.We can see that compared to CFR-Wass, the coverage of the treatment group over the control group in the representation space learned by our method is better.This showcases the degree to which adversarial balancing improves the performance of ABCEI, especially in population causal effect (ATE, ATT) inference.

Related Work
Studies on causal effect inference give us insight on the true data generating process and allow us to answer the what-if questions.The core issue of causal effect inference is the identifiability problem given some data and set of assumptions (Tian and Pearl, 2002).Such data includes experimental data from Randomized Controlled Trials (RCTs) and non-experimental data that are collected from historic observations.Due to the difficulties of conducting RCTs, we mainly focus on the study of causal effect inference based on observational data.There are many research focus on the identifiability of determining cause from effect (Mooij et al., 2016;Marx and Vreeken, 2019).
In this paper, we focus on the study of assessing the strength of causal effect with the assumptions of causal relations.Confounding bias might create spurious correlations between variables and would lead to difficulties for the identification of causal effect with observational data.Strong ignorability assumption in Potential Outcome framework (Rubin, 2005) provides a way to remove the confounding bias and make causal effect inference possible with observational data.For practical applications, there are some studies focusing on matching based methods (Ho et al., 2011) to create comparable groups for causal effect inference.Various similarity measures are applied to achieve better matching results and reduce the estimation error, e.g.Mahalanobis distance and propensity score matching methods are proposed for population causal effect inference (Rubin, 2001;Diamond and Sekhon, 2013).An information theorydriven approach is proposed by using mutual information as the similarity measure (Sun and Nikolaev, 2016).
Recent studies employ deep representation learning methods to derive models that satisfy the conditional ignorability assumption (Li and Fu, 2017), so that make the Conditional Average Treatment Effect identifiable, e.g.Johansson et al. (2016) propose to use a single neural network with the concatenation of representations and treatment variable as the input to predict the potential outcomes.Shalit et al. (2017) propose to train separate models for different treatment outcome systems associating with a measure based on probabilistic integral metric to bound the generalization error.Yao et al. (2018) propose to employ hard samples to preserve local similarity in order to achieve better balancing results.The main difference between ABCEI and the state-of-theart representation learning based methods are two-fold: on the one hand, by employing adversarial learning, our balancing method does not need any assumptions on the treatment selection functions; on the other hand, the transformation between original covariate space to the latent space might lead to information loss.In our framework, a mutual information estimator is employed to enforce the encoder preserving as much as highly predictive information.
From the view of graphical interpretation, there are some other difficulties for the identification of causal effect, e.g.selection bias (Correa et al., 2019).Some research (Bareinboim and Pearl, 2012) propose the use instrumental variable for the identification of causal effect.In this paper, we assume there exists only confounding bias, so that removing confounding bias can make the causal effect identifiable.Due to the unjustifiable property of strong ignorability, controlling covariates may not remove confounding bias when there exists unobserved confounders.Some research propose to estimate causal effect by using proxy variables (Louizos et al., 2017).A modified variational autoencoder structure is employed to identify the causal effect from observational data.In this paper, we assume that all the confounder can be measured, so that our method is sufficient for the identifiability of the CATE.

Conclusions
We propose a novel model for causal effect inference with observational data, called ABCEI, which is built on deep representation learning methods.AB-CEI focuses on balancing latent representations from treatment and control groups by designing a two-player adversarial game.We use a discriminator to distinguish the representations from different groups.By adjusting the encoder parameters, our aim is to find an encoder that can fool the discriminator, which ensures that the distributions of treatment and control representations are as similar as possible.Our balancing method does not make any assumption on the form of the treatment selection function.With the mutual information estimator, we preserve highly predictive information from the original covariate space to latent space.Experimental results on benchmark datasets and synthetic datasets demonstrate that ABCEI is able to achieve robust, and substantially better performance than the state of the art.
In future work, we will explore more connections between relevant methods in domain adaptation (Daume III and Marcu, 2006) and counterfactual learning (Swaminathan and Joachims, 2015) with the methods in causal inference.A proper extension would be to consider multiple treatment assignments or the existence of hidden confounders.

Fig. 1
Fig. 1 Deep neural network architecture of ABCEI for causal effect inference.
studies the effect of job training on the employment status.It consists of a non-randomized component from observational studies and a randomized component based on the National Supported Work program.The randomized component includes 722 units (425 control and 297 treated) with seven covariates, and the nonrandomized component (PSID comparison group) includes 2490 control units.

Fig. 5 P
Fig. 5 P EHE on datasets with varying treatment selection bias.ABCEI is comparatively robust.

Fig. 6
Fig. 6 Mutual information (MI) between representations and original covariates, as well as P EHE in each epoch.With increasing MI, P EHE decreases.

Fig. 7
Fig. 7 t-SNE visualization of treatment and control group, on the IHDP and Jobs datasets.The blue dots are treated units, and the green dots are control units.The left figures are the units in original covariate space, the middle figures are representations learned by ABCEI, and the right figures are representations learned by CFR-Wass; notice how the latter has control unit clusters unbalanced by treatment observations.

Table 1
In-sample and out-of-sample results with mean and standard errors on the IHDP and Jobs dataset (lower = better).

Table 3
In-sample and out-of-sample results with mean and standard errors on the MIMICIII benchmark (lower = better).

Table 4
Search space of hyper-parameter