Recent advances in statistical methodologies in evaluating program for high-dimensional data

The era of big data brings opportunities and challenges to developing new statistical methods and models to evaluate social programs or economic policies or interventions. This paper provides a comprehensive review on some recent advances in statistical methodologies and models to evaluate programs with high-dimensional data. In particular, four kinds of methods for making valid statistical inferences for treatment effects in high dimensions are addressed. The first one is the so-called doubly robust type estimation, which models the outcome regression and propensity score functions simultaneously. The second one is the covariate balance method to construct the treatment effect estimators. The third one is the sufficient dimension reduction approach for causal inferences. The last one is the machine learning procedure directly or indirectly to make statistical inferences to treatment effect. In such a way, some of these methods and models are closely related to the de-biased Lasso type methods for the regression model with high dimensions in the statistical literature. Finally, some future research topics are also discussed.


§1 Introduction
With the availability and development of statistical methodologies and models, it has become more and more important to evaluate the effect of social and economic policies. Indeed, an accurate policy evaluation can improve the pre-design, implementation, and post-adjustment of the policies. Modern policy evaluation methods basically rely on the potential outcome framework developed by Rubin (1973Rubin ( , 1974Rubin ( , 1977. The difficulty of the treatment effect estimation under Rubin's framework lies in the fact that one can not observe the two potential Benefiting from the advances of modern statistics and computing technology, it is easier than ever for researchers to obtain massive amounts of data, which brings new opportunities and challenges to developing new policy evaluation methods. On the one hand, the emergence and utilization of rich data sets provide new sources and materials to obtain more accurate policy evaluation, for example, rich sets of covariates make the unconfoundedness assumption more credible and help to improve the accuracy of the identification. On the other hand, the emergence of big data has not fundamentally solved the dilemma that counterfactual outcomes of the individuals can not be directly observed in policy evaluation and raises more challenges to the statistical inference of treatment effect, for example, which covariate (or which functional form of the covariate) should be included in the underlying models.
It is well known that making valid statistical inference under high-dimensional settings, where there are possibly more covariates than the sample size, is a nontrivial task in both the statistics and econometrics literature. Due to the high dimensionality effect, the traditional estimators would perform poorly when the dimension of the covariates is greater than or/and increasing with the sample size. Recently, there have been some breakthroughs made by researchers in statistics and econometrics for high-dimensional inference. For example, Javanmard and Montanari (2014), Van de Geer et al. (2014), and Zhang and Zhang (2014) propose the de-biased least absolute shrinkage and selection operator (Lasso) methods from different perspectives to construct confidence intervals for the coefficients in regression models with high-dimensional data. The main idea within the de-biased Lasso methods is to add a compensation part to the classical Lasso estimator as in Tibshirani (1996) to remove the bias caused by the regularization. Meanwhile, Belloni, Chernozhukov and Hansen (2014), Farrell (2015), and Belloni et al. (2017) construct valid inference of treatment effects relying on the estimation of both the outcome regression (capturing the relationship between the outcome variables and the covariates) and propensity score (capturing the relationship between the treatment variable and the covariates) models in high-dimensional settings.
In addition to the unconfoundedness assumption, sparsity assumption, permitting to apply model selection methods, is needed in high dimensions. Generally, researchers would adopt the approximate sparsity assumptions for the regression models. The approximate sparsity assumptions allow for imperfect model selection that the true regression functions can be represented by a linear combination of a set of covariates (or functional transforms of the covariates), whose number is much smaller than the sample size, with a small nonzero approximate error. In contrast to perfect model selection, the approximate sparsity is less restrictive in the sense that it contains the situation where there are some moderate but nonzero covariates.
Typically, the existing methods of making causal inference in high dimensions can be divided into four classes. The first class is to make inference of treatment effect based on the doubly robust type estimation, which needs to model the outcome regression and propensity score (PS) functions simultaneously. For example, Belloni, Chernozhukov, and Hansen (2014) propose the double selection method, applying the Lasso method to two equations, respectively, to make valid inference of the treatment effect under the framework of partially linear model, Farrell (2015) focuses on the doubly robust average treatment effect (ATE) estimators with multivalued treatments and applies the group Lasso method as in Yuan and Lin (2006) to both the outcome regression and propensity score models to make robust inference on ATE, and Belloni et al. (2017) provide inferential procedures for a variety of treatment effect parameters via the Neyman orthogonality assumption, implicitly containing the doubly robust ATE estimator as a special case. The second class is to involve the idea of covariate balance in the procedure of making causal inferences. For example, Athey, Imbens and Wager (2018) propose the approximate residual balancing method to make valid inference by assigning the stable balance weight as in Zubizarreta (2015) to the regression residuals, Ning, Peng and Imai (2020) propose the high-dimensional covariate balancing propensity score (CBPS) method to combine the model selection method with the CBPS method as in Imai and Ratkovic (2014), and Tan (2020b) proposes the model-assisted inference for treatment effects by applying regularization to the calibrated loss function. The third class lies in the utilization of the sufficient dimension reduction (SDR) method. For example, Ma et al. (2019) propose the sparse SDR method to estimate the outcome regression and propensity score models and use the DR estimator to make causal inferences. The last one is to make causal inferences by ingeniously exploiting machine learning methods directly or indirectly. For example, Wager and Athey (2018) adopt the random forest algorithm to estimate the treatment effect by thinking of the individuals in the same leaf of a causal tree as having come from randomization experiment and derive the large sample theories, and Athey et al. (2019) propose the generalized random forest method to estimate any quantity that can be identified via local moment conditions and derive the asymptotic consistency and normality results. Finally, high-dimensionality settings within the panel data framework proposed by Hsiao, Ching and Wan (2012) are also considered by Carvalho, Masini and Medeiros (2018), and Shi and Huang (2021). The reader is referred to Cai (2021) and the references therein to have a comprehensive review on recent developments in estimating treatment effects for panel data.
The rest of the paper is organized as follows. Section 2 presents the notations of the treatment effect and introduces the DR estimators which are popular for both the low-dimensional and high-dimensional settings. Section 3 gives a brief review on causal inference methods based on the outcome regression and propensity score models in high dimensions. Section 4 is devoted to introducing inferential methods of adopting the covariate balance methodology to estimate treatment effect with high-dimensional data. Section 5 investigates the utilization of SDR methods for causal inferences. Section 6 describes the machine learning methods that are directly or indirectly used to estimate treatment effects. Section 7 concludes the paper. §2 Treatment Effect Estimators with Unconfoundedness Suppose that we have a binary treatment variable D taking the value of 1 if the individual is in the treated group and taking the value of 0 if the individual is in the control group. Let Y (1) and Y (0) be two potential outcomes corresponding to the treated and untreated status of the individual, respectively. The observed outcome of interest Y is defined as The ATE is defined as the mean difference between the two potential outcomes, It is supposed that there is also a p × 1 vector of covariates X representing the characteristics of the individual. The propensity score determining the conditional probability of receiving treatment is defined as π(x) = P (D = 1|X = x). To identify the ATE, the socalled unconfoundedness and overlap assumptions (Rosenbaum and Rubin, 1983) are commonly adopted; that is, is a random sample from the population (X, D, Y ), where n is the sample size.
In this section, it is devoted to briefly introducing the three types of widely-used ATE estimators; that is, the OR estimator, the IPW estimator and the DR estimator. The OR estimator is based on the relationship between the outcome variable and the covariates. Under the assumption of unconfoundedness, the ATE can be identified by for j = 0 and 1, and the corresponding OR estimator for ATE is given as∆ whereμ j (X i ) is a consistent estimate of µ j (X i ) for j = 0 and 1. It is clear that the IPW estimator is based on the relationship between the treatment variable and the covariates. Under the identification assumptions, the ATE can be identified by and the corresponding IPW estimator for ATE is defined aŝ whereπ(X i ) is a consistent estimate of π(X i ). Furthermore, the ATE can also be identified by ] . ( Note that when π(x) is correctly specified, E (3) holds if at least one of the OR and PS models is correctly specified. By combining the OR model µ j (X), j = 0, 1 and the PS model π(X), the DR estimator for ATE is constructed ] .
(4) The DR estimator is robust in the sense that if at least one of the OR and PS models is correctly specified, the resulting estimator is consistent. Among all the three types of treatment effect estimators, one can see that the DR estimator plays an important role in the statistical inferences of the ATE in high-dimensional settings in the next section. §3 Inference of Treatment Effects Based on OR and PS Models Belloni, Chernozhukov and Hansen (2014) propose the so-called double selection method under the framework of partially linear model

Double Selection Method
and h(·) are two unknown functions, and u i and v i are unobserved disturbances. To combine the high-dimensional settings with the partially linear model, the approximate sparsity assumptions are made to the two functions f (·) and h(·); that is, where the numbers of non-zero components in β or and β ps are less than s, an integer much smaller than the sample size n, and e or,i and e ps,i are small non-zero approximation error terms. For this structural model, Belloni, Chernozhukov and Hansen (2014) introduce the "post-double-selection" (Post-DS) procedure to make valid inference of the treatment effect α.
Three steps are included in the Post-DS procedure: (i) apply the feasible Lasso model selection method to regress Y i on X i in the first step; (ii) apply the feasible Lasso model selection method to regress D i on X i in the second step; (iii) run the OLS procedure by regressing Y i on D i and the combination of the selected covariates X i in the previous two steps. More generally, Belloni, Chernozhukov and Hansen (2014) apply the Post-DS procedure to study the ATE and average treatment effect on the treated (ATT) in the data-rich environment based on the DR estimators in Section 5 of their paper, with the main idea similar to the paper by Belloni et al. (2017).

Remark 1:
Note that at the second step above, it is assumed that D i is a linear probability model of X i , which might be inappropriate. Also, other forms of f (x) can be considered.

Robust Inference Based on DR Estimators
Under the unconfoundedness and overlap assumptions, Farrell (2015) focuses on the DR ATE estimators in the framework of multivalued treatments and proposes a procedure for making robust inference arguing that the DR estimators are not only robust to model misspecification but also robust to model selection. The robustness property in high-dimensional settings permits the existence of the selection errors in modeling both the outcome regression functions and propensity scores without affecting the valid inference of the ATE estimators. Similar to Belloni, Chernozhukov and Hansen (2014), the approximate sparsity assumptions, allowing for imperfect model section, are made in the model selection stage. Robust inference on the ATE is then obtained via a model selection procedure similar to the Post-DS as in Belloni, Chernozhukov and Hansen (2014). First, apply the group Lasso method 1 to the regression functions and propensity scores, respectively, and then, refit the generalized linear regression model for the regression functions and propensity scores with the union of the selected covariates in the previous step. Belloni et al. (2017) provide a more general framework to make causal inference for a variety of treatment effect parameters such as (local) ATE and (local) quantile treatment effects. As argued by Belloni et al. (2017), perfect model selection may be too restrictive to be satisfied in some real applications, and in order to provide valid inference for the treatment effect, the approximate sparsity assumptions allowing for small nonzero approximation error are made for the outcome regression and propensity score functions. With these assumptions, machine learning techniques, such as Lasso and Post-Lasso 2 (Belloni and Chernozhukov, 2013) methods, can be used to conduct the model selection. Furthermore, two conditions are required to proceed. One is the so-called Neyman orthogonality condition, which ensures that the estimating equations are first-order insensitive to the perturbations in the nuisance components, for instance, the outcome regression and propensity score functions. The other one is no overfitting condition requiring that the nuisance components should be estimated at a relatively slower o(n −1/4 ) rate to insure small estimation bias. As a concrete example of this general framework, it can be verified that the DR ATE estimator (4) satisfies the Neyman orthogonality condition, and the estimated outcome regression and propensity score functions via Lasso or Post-Lasso model selection method satisfy the no overfitting condition automatically. In other words, to estimate and infer the ATE in data-rich environment, one can use the DR estimator with the nuisance components estimated via Lasso or Post-Lasso method and rely on the multiplier bootstrap procedure as described in Section 3.

CBW Methods
In recent years, some CBW methods have been proposed by researchers to improve the finite sample size performance of the traditional IPW estimators. Broadly speaking, there are two ways to achieve the covariate balance among different groups: one depends on modeling the propensity score; see, e.g., the papers by Imai and Ratkovic (2014) and Sant'Anna, Song and Xu (2020), and the other directly aims at the balance weights via optimization designs; see, e.g., the papers by Zubizarreta (2015) and Chan, Yam and Zhang (2016). To have a general idea of the CBW method, by the definition of the propensity score and the law of iterated expectations, it can be shown that where g(·) : R p → R m for some m ≥ 1 is a continuous and integrable function. It can be seen from (5) that the propensity score has the property of balancing the moment of any functional form of the covariates among the treated, control and combined groups. Imai and Ratkovic (2014) exploit the finite moment conditions of (5) and propose the CBPS method, while Sant'Anna, Song and Xu (2020) make full use of (5) to consider the infinite moment conditions and propose the integrated propensity score procedure. For example, Imai and Ratkovic (2014) first assume a logistic model for the propensity score as where β ∈ Θ ⊆ R p is a p-dimensional vector of parameters. Then, the unknown parameter β is estimated byβ = arg min where H is an m × m positive definite matrix and g(X i ).
Once the propensity score is estimated, the ATE estimator based on the CBPS method can be obtained via (2).
Viewing the inverse propensity score of (5) as a weight, through careful optimization designs, Zubizarreta (2015) proposes the stable balance weighting (SBW) method to choose the minimum sample variance of the weights that balance the covariates, while Chan, Yam and Zhang (2016) propose the calibration balance weighting method to choose the weights that minimize the calibration distance and balance the covariates as well. Specifically, the SBW weights in Zubizarreta (2015) can be obtained by solving the following optimization problem for j = 0 and 1, min wji ∑ i:Di=j where n 0 and n 1 are the numbers of units in the control and treated groups, respectively, X ik is the k-th covariate of unit i,X k is the mean of the k-th covariate in the combined group, and δ jk , j = 0, 1; k = 1, · · · , p, are pre-specified small positive numbers. Once the stable balance weights w 1i and w 0i are obtained, the weighted estimator of the ATE based on SBW method is given by∆ In high-dimensional settings, another line of making valid inferences of treatment effects is to take the idea of covariate balancing into the construction of the estimators.
As argued by Athey, Imbens and Wager (2018), the penalized linear regression step can be effective at capturing the main effects while the weighting regression residuals step can be effective at capturing additional small effects and makes compensations to the previous step, leading to an effective procedure for estimating the ATE in high-dimensional cases. Different from the methods discussed in Section 3, the approximate residual balancing method does not rely on modeling the relationship between the treatment variable and the covariates, so that the sparsity assumption for the propensity score is not required. The ATE estimator considered in Athey, Imbens and Wager (2018) is closely related to the DR estimators if one views the inverse propensity score in the DR estimator of (4) as a covariate balance weight. Also, as argued by Athey, Imbens and Wager (2018), the approximate residual balancing method is closely related to the de-biased Lasso studies for the regression model of high dimensions in the statistical literature.

High-Dimensional CBPS
Ning, Peng and Imai (2020) generalize the CBPS method by Imai and Ratkovic (2014) to introduce model selection methods into covariate balancing procedures, termed as highdimensional CBPS (HD-CBPS). The CBPS method does not rely on modeling the outcome regression, while the HD-CBPS method requires the sparsity assumptions for both the outcome regression and the propensity score models to allow for model selections. To adjust for the bias caused by regularized estimation, Ning, Peng and Imai (2020) apply the covariate balancing procedure to the covariates selected from the penalized outcome regression model to obtain calibrated propensity score estimators. For the estimation of µ 1 = E[Y (1)], Ning, Peng and Imai (2020) first estimate the propensity score function via penalized logistic regression model where λ > 0 is a tuning parameter, and estimate the outcome regression function by penalized regression modelα = argmin where λ ′ > 0 is a tuning parameter, and then, they calibrate the estimated propensity score by balancing the covariates selected in the outcome regression model as follows, .
Although the strong covariate balance property as in Imai and Ratkovic (2014) may not be held with high dimensions, Ning, Peng and Imai (2020) argue that once the outcome regression model is properly estimated, the HD-CBPS method would still approximately satisfy weak covariate balancing property in the sense that the covariate balancing property holds for the linear combination of the covariates. Ning, Peng and Imai (2020) also show that the HD-CBPS procedure is robust to the misspecification of either the outcome regression or the propensity score model and extend the linear outcome regression model to generalized linear cases.

Model-Assisted Inference for Treatment Effects with Regularized Calibrated Estimation
Tan (2020a) proposes a calibrated estimation procedure for the propensity score from the perspective of the loss function. Similar to the CBW method based on the propensity score model, a logistic regression model is assumed to the propensity score. Tan (2020a) argues that solving the covariate balance equation between the treated (or control) and combined groups is equivalent to minimizing the calibrated loss function, which is a key component in Tan (2020aTan ( , 2020b. To derive an ATE estimator in the high-dimensional settings, Tan (2020a) adds a Lasso penalty to the calibrated loss function. By plugging the minimal solution of the regularized calibrated loss function into the logistic regression model, one obtains the calibrated propensity score estimator and leads to a regularized calibrated IPW estimator for the ATE. Finally, Tan (2020a) illustrates the advantages of the proposed methods over the conventional maximum likelihood method via a simulation study.
Tan (2020b) proposes a DR type regularized calibrated estimation procedure for the ATE with high-dimensional data. The regularized calibrated ATE estimator provides not only DR point estimators but also model-assisted confidence intervals. The term "model-assisted" means that the confidence intervals are valid when the propensity score function is correctly specified or when a linear outcome regression model is assumed and correctly specified. The model-assisted confidence intervals property makes the proposed method in Tan (2020b) one step further when compared with the methods introduced in Section 3. Under the unconfoundedness and overlap assumptions, the DR ATE estimators are used in Tan (2020b) and the calibrated loss function with a Lasso penalty as in Tan (2020a) is used for the propensity score model while a carefully chosen weighted least-square loss function with Lasso penalty is used for the outcome regression model.
Clearly, causal inference with covariate balance can either circumvent the modeling of propensity score function or make inference robust to the misspecification of the propensity score model. This is certainly an advantage over the methods based on the DR estimators in Section 3. §5 Causal Inference Based on SDR A third line of making causal inference with high dimensions is the SDR method, which captures full information of covariates through a linear combination of the whole covariates. Indeed, Huang and Chan (2017) propose the joint SDR method for the ATE, where the "joint" means that the proposed method aims at finding a linear combination B ⊤ X to simultaneously satisfy Once the parameter matrix B is estimated, denoted asB, the ATE is then estimated by the outcome regression estimator (1), where the outcome regression model is estimated by sieve approach with respect toBX i . Furthermore, Luo, Zhu and Ghosh (2017) propose regression-based SDR to estimate the ATE, where the focus is to find unknown parameter matrix with r(j) ≤ p, j = 0, 1. The minimum average variance estimation method as in Xia et al. (2002) is adopted to estimate the parameter matrix B j , j = 0, 1 and to recover the outcome regression function simultaneously.
However, the theoretical results in Huang and Chan (2017) and Luo, Zhu and Ghosh (2017) are derived only for fixed dimensions. Recently, Ma et al. (2019) propose the sparse SDR method for causal inference of ATE with high dimensions. Notice that both the outcome regression and propensity score models can be viewed as conditional expectations. Let W be a random variable that can be taken as the potential outcomes Y (j), j = 0, 1 or the treatment variable D. The goal of the sparse SDR method is to find a p × r sparse parameter matrix B such that Although the inference of the ATE is based on the DR estimator, the sparse SDR method does not rely on the restrictive parametric modeling assumptions on the outcome regression and propensity score functions. In this regard, the modeling strategy is similar to the generalized random forest method introduced in the next section. §6 Machine Learning Methods The fourth line for making valid inference of treatment effects lies in the machine learning, especially the random forest methods. Random forest, firstly introduced by Breiman (2001), is known as one of the most attractive machine learning methods. Similar to the bagging method, developed by Breiman (1996), random forest is an ensemble algorithm consisting of many decision or regression trees and making an average of the results produced by different trees. In the bagging method, all covariates are used to generate a single tree while in the random forest method, only a fraction of covariates are used to generate a single tree. This small modification makes the trees in the random forest more diverse than those in the bagging algorithm and improves the generalization property of the resulting model.
It is well known that the traditional machine learning methods are good at prediction rather than inference. However,  propose the causal forest method to introduce the random forest method into the causal inference and derive the consistency and asymptotical normality results as well as asymptotic confidence intervals for the treatment effect estimators. The treatment effects are estimated via the causal forest consisting of many causal trees which can be viewed as the nearest-neighbor methods with the leaves in the causal tree as the neighborhood metrics. An important advantage of the causal forest over the nearestneighbor methods is that the weights in the causal forest are data-driven allowing for more flexibility. Thinking of the leaves in a causal tree as small enough, the individuals in the same leaf can be treated as randomly assigned so that the simple calculation between the treated and control individuals as in the randomization experiment can be applied. The causal forest then averages the results produced by the causal trees to obtain the treatment effect estimate.
Recently, Athey, Tibshirani and Wager (2019) propose the generalized random forest method that adopts the random forest algorithm to estimate various types of conditional expectations. Obviously, the causal forest method by Wager and Athey (2018) directly aims at the treatment effect estimation while the generalized random forest method is indirectly used to estimate the treatment effect through intermediate parameters. Estimating conditional expectations via the random forest algorithm can be viewed as the nonparametric method. In classical nonparametric kernel methods, a series of weights are determined by the kernel function whose values vary with the distance to the given conditional variable. In practice, these classical methods suffer from the "curse of dimensionality" problem and may lead to poor performance as the dimensions of the covariates increase. However, the weights assigned to the response variable in Athey, Tibshirani and Wager (2019) are determined by the random forest method, making it possible to deal with the high-dimensional settings. The basic idea in Athey, Tibshirani and Wager (2019) is similar to that in Wager and Athey (2018) by viewing the leaves in the generating trees as the closeness metrics. Furthermore, the large sample theories of consistency and asymptotic normality are derived in Athey, Tibshirani and Wager (2019). Although the theoretical results in both Wager and Athey (2018) and Athey, Tibshirani and Wager (2019) are derived for low-dimensional settings, as long as the sparsity assumptions are made to the true model, the algorithms would still work for the high dimensions since the irrelevant covariates would be ignored by the forests, which is implicitly pointed out in Section 3 of Wager and Athey (2018).
The advantage of the random forest methods is that they are flexible and do not rely on postulating specific models for the outcome regression and potential outcome functions. Specifically, the random forest-based methods can be viewed as nonparametric estimations that can deal with the curse of dimensionality issue.

Remark 2:
In addition to the random forest or the generalized random forest as mentioned above, it should be very attractive and demanding to develop other machine learning methods to study causal inferences, which is clearly warranted to investigate. §7 Conclusion In the big data environment, on the one hand, it can make the unconfoundedness assumption more credible, on the other hand, one needs to determine which covariate (or its functional form) should be included in the true model. The inference of treatment effects based on the DR estimators is valid in high-dimensional settings, indicating that the DR estimator has the self de-biased property. Another line of making valid inference of treatment effects with highdimensional data is to exploit the covariate balance property. The utilization of the covariate balance property can either circumvent modeling the propensity score or relax the model specification assumptions. Furthermore, one can base on the SDR method to make causal inference in high dimensions and one advantage of this method is that it does not require consistency for variable selection. Finally, one can directly or indirectly make inference of treatment effect via the machine learning algorithms such as random forest. The approximate residual balancing method and the inference of treatment effects based on the DR estimators are closely related to the de-biased Lasso method for the regression model with high dimensions in the statistical literature.
The DR estimators for causal inference need to consider model selection in both the outcome regression and propensity score functions. However, the estimators based on the covariate balance methods can ease the dependence on the PS models by using the auxiliary information. The estimators based on the SDR method do not rely on restrictive assumptions on OR and PS models. The methods to estimate the treatment effects via the random forest algorithm are flexible and do not rely on modeling the OR and PS functions. For empirical applications, the first two methods are recommended when one believes that it is reasonable to model the OR and PS functions in parametric forms, while if one has no idea of what functional forms of the OR and PS models should be, it is better to try the last two methods.
Currently, the causal inference in high dimensions mainly focuses on the ATE estimate. How to make valid inference for quantile treatment effects besides the Neyman orthogonality condition? Can one generalize the existing methods to the other setups, such as, differencein-differences, local average treatment effects and regression discontinuity design? Do other