Targeted smoothing parameter selection for estimating average causal effects

The non-parametric estimation of average causal effects in observational studies often relies on controlling for confounding covariates through smoothing regression methods such as kernel, splines or local polynomial regression. Such regression methods are tuned via smoothing parameters which regulates the amount of degrees of freedom used in the fit. In this paper we propose data-driven methods for selecting smoothing parameters when the targeted parameter is an average causal effect. For this purpose, we propose to estimate the exact expression of the mean squared error of the estimators. Asymptotic approximations indicate that the smoothing parameters minimizing this mean squared error converges to zero faster than the optimal smoothing parameter for the estimation of the regression functions. In a simulation study we show that the proposed data-driven methods for selecting the smoothing parameters yield lower empirical mean squared error than other methods available such as, e.g., cross-validation.


Introduction
In observational studies where the interest lies in estimating the average causal effect of a binary treatment z on an outcome of interest y, non-parametric estimators are typically based on controlling for confounding covariates x with smoothing regression methods (nearest neighbour, kernel, splines, local polynomial regression, series estimators; see, e.g., the review by Imbens and Wooldridge, 2009). A useful modeling framework in this context was introduced by Neyman (1923) and Rubin (1974), where in particular two potential outcomes are considered for each unit in the study, the outcome that would be observed if the unit is treated, y(1), and the outcome that would be observed if the unit is not treated, y(0). The causal effect at the unit level is defined as y(1) − y(0). section a novel data-driven method. Section 3 presents a simulation study. The paper is concluded in Section 4.

Model and estimation 2.1 Neyman-Rubin model for causal inference
Suppose we have n units i in a study, a random sample from a population of interest for which we observe a binary treatment assignment z i , a real valued outcome y i and a set of covariates x i . Thus, 1 if unit i recieves treatment 1, 0 if unit i recieves treatment 0 (possibly no treatment, control group).
The causal effect of treatment z i = 1 versus treatment z i = 0 on the response variable y for unit i is defined as τ i = y i (1) − y i (0), with y i (1) and y i (0) the potential outcomes for unit i, i.e. y i (1) is the response that would be observed for unit i if given treatment z i = 1 and y i (0) the response if given treatment z i = 0. The observed response for unit i is then y i = y i (0)(1 − z i ) + y i (1)z i . The individual causal effect τ i is not observable since unit i can only receive one of the two treatments. Typically, the parameter of interest is a population average causal effect, If treatment assignment is not randomized, τ is identified if we have available a set of covariates x i = (x i1 , . . . , x id ) T not affected by treatment assignment and such that the following assumptions hold, y i (1), y i (0) ⊥ ⊥ z i |x i , often called unconfoundedness assumption, and 0 < Pr(z i = 1|x i ) < 1, often called overlap assumption. We have unconfoundedness if all covariates affecting both treatment assignment and the potential outcomes are included in x i . The assumption of overlap states that, for a unit with covariate vector x i , the probability of receiving either treatment should be bounded away from 0. Under these assumptions identifiability of τ is then a consequence of In the sequel we focus on the case d = 1 since when d > 1, the covariate vector x i can be replaced by a scalar, e.g., p(x i ) = Pr(z i = 1|x i ), the propensity score (Rosenbaum andRubin, 1983, Hansen, 2008). Indeed, Rosenbaum and Rubin (1983) showed that it is sufficient to condition on the propensity score, i.e. under the above assumptions we have y i (1), y i (0) ⊥ ⊥ z i |p(x i ), and 0 < Pr(z i = 1|p(x i )) < 1. In applications the propensity score need to be modelled and fitted to the data. Typically parametric models are used, although these do not need to be correctly specified as shown in Waernbaum (2010).

Estimating average causal effects
Note that the assumption of constant conditional variance could be relaxed without changing in essence the results of this paper. We consider this assumption to alleviate the notational burden. From (1), we have that Thus, a natural way to estimate τ is to first estimate the two regression functions β 1 (x i ) and β 0 (x i ), based on the treated and the non-treated, respectively, and then take the average over all the observed x i s of the differences between the estimated functions. This estimator of τ is called the imputation estimator in Imbens et al. (2005). They use series estimators for estimating the regression functions but any smoother, e.g. nearest neighbour, kernel, splines and local polynomial regression (Fan and Gijbels, 1996, p. 14-45), may be used.
Denote y 0 = (y 0 1 , . . . , y 0 n 0 ) T and x 0 = (x 0 1 , . . . , x 0 n 0 ) T the observed response and covariate for the n 0 units with treatment z i = 0, and similarly y 1 = (y 1 1 , . . . , y 1 n 1 ) T and x 1 = (x 1 1 , . . . , x 1 n 1 ) T for the n 1 units with treamtment z i = 1. The smoothers cited above are linear in the sense that the corresponding estimator of β j (x) = (β j (x 1 ), . . . , β j (x n )) T , can be written asβ the smoothing matrix regressing y j on x j , using smoothing parameter h j . The imputation estimator of τ mentioned above iŝ In this paper we base our results on a specific linear smoother, the local linear regression smoother, although we anticipate that most results should hold for any other linear smoother. Local linear regression (Cleveland, 1979;Fan and Gijbels, 1996), consists in fitting a straight line at every x i , i = 1, . . . , n, using only the part of data that is deemed to be sufficiently close to the target point x i . Consider estimating the regression function β j (·), j = 1, 0. The fit, at x i , iŝ is a kernel function such that K(u)du = 1 and uK(u)du = 0. An example is the tricube kernel defined as The definition of b ji , i = 1, . . . , n, depends on the type of bandwidth we use. With a constant bandwidth b j1 = · · · = b jn = h j . For a nearest neighbor type bandwidth, assuming no ties, b ji is the Euclidian distance from x i to the (h j n j ):th nearest among the x j k :s for x j k = x i , h j ∈ [1/n j , 1] , k = 1, . . . , n j , and the smoothing parameter h j is the proportion of observations being used to produce the local fit.

Mean squared errors
Many smoothing parameter selection methods are developed with the purpose of estimating the regression function β j (x i ), j = 1, 0, and attempts to select the smoothing parameter minimizing the average conditional mean squared error: (2) One frequently used selection procedure that attempts to select the smoothing parameter minimizing (2) is leave-one-out cross-validation. In this setting, cross-validation selects the smoothing parameter h j minimizing Asymptotically, for local linear regression, the smoothing parameter minimizing (2) is proportional to n −1/5 j (Fan, 1992), and, hence, proportional to n −1/5 since n j = n Pr(z = j) + o p (n). However, it is known that for estimating a functional of β j (x i ) such as E(β j (x i )), the smoothing parameter minimizing (2) is not optimal, in the sense that it does not result in √ n-consistent estimation of the functional (e.g., Cheng, 1994). Imbens et al. (2005) suggest that one should select h 0 and h 1 by minimizing the conditional mean squared error of 1 We argue that, in order to estimate τ optimally, it may be more suitable to select the combination of (h 1 , h 0 ) minimizing the conditional mean squared error ofτ imp Note that Hence, criterion (5) differs from (4) when both average bias terms in the latter expression are different from zero.

Asymptotics
Asymptotic approximations can be used to describe optimal bandwidth choices as the sample size tends to infinity. The results presented here are deduced in Appendix B, where regularity conditions also used in Ruppert and Wand (1994) are given. For local linear regression with constant bandwidth such that h j → 0 and nh j → ∞ as n → ∞ we have the following approximations for the conditional bias and variance of 1 and with constants j (x) the m:th derivative of the function β j (x) and f (x) is the density of x. Hence, and Let us first consider the optimal smoothing parameter for estimating E(β j (x)) and assume nh 3 j → 0 as n → ∞, j = 0, 1. An asymptotic approximation to the bandwidth minimizing (8) is Hence, the optimal rate of convergence is here faster than n −1/5 , the optimal rate for the estimation of the regression function β j (·). A similar result was shown in Cheng (1994) for kernel regression. Turning to the minimization of (9), this must be done simultaneously in h 0 and h 1 . A reasonable assumption, however, is that these two smoothing parameters have same rate of convergence to zero. Under this assumption we may replace h 1 by ch 0 , for c a constant, in (9). Minimizing the latter for h 0 yields as above an optimal rate of convergence for h 0 (and hence h 1 ) of n −2/5 . Another related result, deduced from (6) and (7), is that as The results above show that selecting the smoothing parameters minimizing (4) will lead to √ n-consistent estimation of τ. This is in accordance with previous results (e.g., Speckman, 1988) where it was shown that asymptotic undersmoothing of the regression function is needed for the √ n−consistent estimation of a functional of the regression function. Imbens et al. (2005) propose the following estimator of (4)

Estimating MSEs
wherep j = (1/p(x j 1 ), . . . , 1/p(x j n j )) T and I n j is the n j × n j identity matrix. It is worth noting that one need to estimate the propensity score (Waernbaum, 2010), in addition to σ 2 ε , in order to use this selection procedure. The error variance σ 2 ε may be estimated byσ where h ε could be equal to h j or selected separately, see e.g. Opsomer et al. (1995) for further discussion on this issue. We propose below novel double smoothing estimators of (4) and (5), respectively: and where g 1 , g 0 are pilot smoothing parameters selected for estimating β 1 and β 0 well, typically using cross-validation. The double smoothing (DS) estimation concept was utilized by Härdle et al. (1992), although for the estimation of the entire regression function β j (·). A difference between MSE INR β j and MSE DŜ β j is that the former is based on an asymptotic approximation of (4) while the double smoothing estimator targets (4) directly.

Simulation study
In this section, we study the finite sample properties of different methods for the selection of nearest neighbor type bandwidths, and in particular the resulting MSE when estimating the average causal effect τ.  (17).

Design of the study
Data were generated according to the model , 500, 1000. Since z i is a Bernoulli draw dependent on x i generated from a uniform distribution, n 1 and n 0 are stochastic. Table 1 and Figure 1 display the six designs generated. Bandwidths h 1 , h 0 considered are 40 equally spaced values within the intervals [0.1, 1] for n = 100, 200 and [0.02, 1] for n = 500, 1000, and, e.g., h = 0.1 implies using 10% of the data for the local fits. The true error variance, σ 2 ε , is used in (14), and (15) and (16) as well as the true propensity score, p(x) in (14). For the DS estimators in (15) and (16) the pilot bandwidths are chosen by leave-one-out cross-validation.
The criteria in (2), (3), (4), (5), (14), (15) and (16) and are computed for every bandwidth, 40 values, in the interval. For the minimizing bandwidthsτ imp is computed. Due to computer time constraint, we use 200 replicates. On the other hand, we reduce noise in the simulation results by making use of the control variate method withτ ols , the mean of the fitted values resulting from estimating τ(x) by ordinary least squares with correctly specified model, as control variate. Ifτ ols is positively correlated withτ imp thenτ c =τ imp − (τ ols − τ) has the same mean asτ imp but lower variance. For instance, for n = 1000 such correlations varied between 0.47 and 0.95 and most of them were larger than 0.8. All computations are made in R (R Development Core Team, 2010). Studying bandwidth selection by simulation is computationally demanding and this study was made possible by the use of the High Performance Computing Center North  Table  1. The first column displays β 1 (x i ) (solid line), β 0 (x i ) (dashed) and τ(x i ) (dotted), and the second column displays p(x i ).

Results
Results for n = 500 and 1000 are displayed in Figures 2-5 (Appendix A). More detailed results (also for n = 100, 200) are not displayed to save space but can be obtained from the authors. Note first that we can compute the smoothing parameter values minimizing (2), (4) and (5), labeled M y , M β and M τ , respectively, because we know the data generating mechanisms.
We see in Figures 2-5 that the double smoothing methods introduced, (15) and (16), labeled DS β and DS τ respectively, mimic quite well their target in terms of selected smoothing parameters. This is not the case for (14), labeled INR, whose selected smoothing parameters are not in accordance with the target M β . The results are further summarized in Tables 2-3. Table 2 summarizes MSE results given in Figures 2-5 (for n = 500, 1000) for the theoretical criteria M β , M τ and M y , by indicating which criterion yielded lowest MSE for the estimation of τ. We see that M τ always results in smallest MSE, which is, in most cases, significantly smaller than the second smallest MSE (achieved by M β except for Design 3, n = 200, and Design 5, n = 1000). Both M τ and M β result in significantly smaller MSE than M y in all cases but three (Design 3, n = 200, 1000, Design 5, n = 1000). Table 3 gives information on MSE (similar to  Table 2), where comparisons are made between the data-driven criteria DS β , DS τ , INR and CV. We see that double smoothing does not always yields lowest MSE, although CV is most often outperformed by the methods targeting the estimation of functional averages (DS and INR −for design 2 where INR performed best, CV was also outperformed by DS).
Finally, note that the propensity scores used in the designs of this study are rather extreme in the sense that they may yield probabilities near zero and one. We have also run these experiments by damping these propensity scores to let them vary only between 0.2 and 0.8. The results where similar qualitatively with double smoothing performing better. and M y . Stars indicate that the method has significantly lower MSE than the next best method, with "*" for a 5% level test and "**" for a 1% level test.

Design
Minimum MSE obtained by n 100 200 500 1000 1 M τ M * Table 3: MSE comparison: The table displays the method yielding lowest MSE among DS β , DS τ , INR and CV. Stars indicate that the method has significantly lower MSE than the next best method, with "*" for a 5% level test and "**" for a 1% level test.

Design
Minimum MSE obtained by n 100 200

Conclusion
In this paper we have proposed double smoothing methods for selecting smoothing parameters that target the estimation of functional averages where the latter are average causal effects of interest. In our numerical experiments cross-validation is often outperformed by double smoothing as we expected since the latter criterion is optimized for the estimation of functions underlying the average causal effect, and not the average itself. The methods proposed and studied here have large applicability, and are, for instance, straightforward to adapt to non-parametric estimators based on instruments as those introduced in Frölich (2007).

Asymptotics
In order to derive the results of Section 3.2 we focus on local linear regression with constant bandwidth. We use further the following assumptions.
(A1) The kernel K is a compactly supported, bounded kernel such that u 2 K(u)du = 0.
In addition, all odd-order moments of K vanish, that is u l K(u)du = 0 for all nonnegative odd integers l.
(A2) The covariate x has density f . The pointx is in the interior of supp( f ) = {x ∈ R : f (x) > 0}. Atx, f is continuously differentiable and all second-order derivatives of β j , j = 0, 1, are continuous.
We have Under ( and where . It follows from (18) and the fact that Using (19) we have Now, According to Ruppert and Wand (1994, eq. (2.11)) Noting that