Targeted smoothing parameter selection for estimating average causal effects

The non-parametric estimation of average causal effects in observational studies often relies on controlling for confounding covariates through smoothing regression methods such as kernel, splines or local polynomial regression. Such regression methods are tuned via smoothing parameters which regulates the amount of degrees of freedom used in the fit. In this paper we propose data-driven methods for selecting smoothing parameters when the targeted parameter is an average causal effect. For this purpose, we propose to estimate the exact expression of the mean squared error of the estimators. Asymptotic approximations indicate that the smoothing parameters minimizing this mean squared error converges to zero faster than the optimal smoothing parameter for the estimation of the regression functions. In a simulation study we show that the proposed data-driven methods for selecting the smoothing parameters yield lower empirical mean squared error than other methods available such as, e.g., cross-validation.


Introduction
In observational studies where the interest lies in estimating the average causal effect of a binary treatment z on an outcome of interest y, non-parametric estimators are typically based on controlling for confounding covariates x with smoothing regression methods (kernel, splines, local polynomial regression, series estimators; see, e.g., the reviews by Imbens 2004, andImbens andWooldridge 2009). A useful modeling framework in this context was introduced by Neyman (1923) and Rubin (1974), where in particular two potential outcomes are considered for each unit in the study, the outcome that would be observed if the unit is treated, y(1), and the outcome that would be observed if the unit is not treated, y(0). The causal effect at the unit level is defined as y(1)− y(0). Population parameters are targeted by the inference, and we focus here on average causal effects of the type E(y(1)− y(0)), where the expectation is taken over a given population of interest. Inference on such expectations is complicated by the fact that the two potential outcomes are not observed for all units in the sample (missing data problem) and assumptions, e.g., on the missingness mechanism must be made in order for the parameter of interest to be identified. In this paper, we consider situations described in Sect. 2, where the causal effect conditional on an observed covariate x (or a score function summarizing a set of observed covariates), E(y(1) | x)−E(y(0) | x), is identified and can be estimated by fitting two curves, functions of x, E(y(1) | x, z = 1) and E(y(0) | x, z = 0) non-parametrically. An estimate of the targeted average causal effect is obtained by averaging the estimated curves over the relevant distribution for x to target E(y(1) − y(0)) = E(E(y(1) | x)) − E(E(y(0) | x)), where the missing outcomes are imputed by predictions from the fitted curves. A tuning parameter for each fitted curve is used to regulate the smoothness of the fit. Cheng (1994) showed that when using kernel regression to estimate the average of a curve, say here E(E(y(1) | x)), with missing y(1) for some units, as described above, then the optimal (in mean squared error, MSE, sense) smoothing parameter for the estimation of the regression curve E(y(1) | x, z = 1) is not optimal for the estimation of the average E(E(y(1) | x)). More precisely the optimal rate of convergence towards zero of the smoothing parameter (when the sample size increases) is different in both situations, and one need typically to asymptotically undersmooth E(y(1) | x, z = 1) when targeting E(E(y(1) | x)). We show in this paper that a similar result holds when using local linear regression instead of kernel regression, and when two curves (implying the choice of two tunining parameters), are fitted and then averaged to target E(y(1) − y(0)).
As a main contribution of the paper, we propose a novel data-driven method geared for selecting the smoothing parameters which minimizes the mean squared error of non-parametric estimators of the average causal effect. Imbens et al. (2005) also proposes a data-driven method based on the estimation of this mean squared error. The two estimators are, however, different. While Imbens et al. (2005) estimates an asymptotic approximation of the population MSE which involves the estimation of the propensity score, the probability of ending up in one of the treatment groups (say z = 1) given the covariates, our estimator targets the exact population MSE by using a double smoothing technique previously used by Härdle et al. (1992) for estimating regression curves and Häggström (2013) in semi-parametric additive models. Note that Frölich (2005) also derived asymptotic approximation of MSE to obtain smoothing parameter selectors although those were outperformed by cross-validation in finite sample simulations. With simulations we study the finite sample properties of the different data-driven methods. The results suggest that the cross-validation choice, which is known to be optimal in MSE sense to estimate smooth curves (Fan 1992), can indeed be improved by using either Imbens et al. (2005) or our proposal, with the latter often being superior.
In the next section we introduce the potential outcome framework dating back to Neyman (1923) and Rubin (1974), which allows us to define the parameter of interest, the average causal effect, and commonly used identifying assumptions and estimators. The selection of smoothing parameters is discussed in Sect. 3, where we present asymptotic results based on the use of local linear regression. We also introduce in this section a novel data-driven method. Section 4 presents a simulation study. The paper is concluded in Sect. 5.

Model and estimation
2.1 Neyman-Rubin model for causal inference Suppose we have n units indexed by i = 1, . . . , n. For each unit i a binary treatment z i is assigned: Further, each unit i is characterised by two potential outcomes y i (1) and y i (0), where y i (1) is the response that is observed if the unit is given treatment z i = 1 and y i (0) the response if the unit is given treatment z i = 0. Only one treatment assignment is possible for each unit and, therefore, only one of the two potential outcomes is observed. Denote by . We assume in the sequel that the n units corresponds to a random sample from the distribution law of the random variables (y i (1), y i (1), z i , x i ), and that only (y i , z i , x i ) is actually observed. We use the same notation to denote random variables and their realisations, letting the context make the distinction. The parameter of interest herein is an average causal effect, If treatment assignment is not randomized, τ is identified if we have available a vector of covariates x i = (x i1 , . . . , x id ) T not affected by treatment assignment and such that the following assumptions hold, often called unconfoundedness assumption, and often called overlap assumption. The sign ⊥ ⊥ is used here to mean "is independent of" (Dawid 1979). We have unconfoundedness if all covariates affecting both treatment assignment and the potential outcomes are included in x i . This is a strong assumption which must be based on subject-matter reasoning. A sensitivity analysis to this assumption is often advocated (e.g., de Luna and Lundin 2014). The assumption of overlap states that, for a unit with covariate vector x i , the probability of receivingeither treatment should be bounded away from 0. This assumption can be investigated empirically (e.g., Imbens and Wooldridge 2009). Under these assumptions identifiability of τ is then a consequence of In the sequel we focus on the case d = 1 since when d > 1, the covariate vector x i can be replaced by a scalar, e.g., p(x i ) = Pr(z i = 1|x i ), the propensity score (Rosenbaum andRubin 1983, Hansen 2008). Indeed, Rosenbaum and Rubin (1983) showed that it is sufficient to condition on the propensity score, i.e., under the above assumptions we have y i (1), y i (0) ⊥ ⊥ z i | p(x i ), and 0 < Pr(z i = 1| p(x i )) < 1. In applications the propensity score need to be modelled and fitted to the data and such situations are considered in the simulation study of Sect. 4. Typically parametric models are used to fit the propensity score, although these do not need to be correctly specified as shown in Waernbaum (2010). Note also that covariate selection procedures may be used to reduce the dimensionality of x i (Luna et al. 2011).

Estimating average causal effects
Let β 0 (x i ) = E(y i |z i = 0, x i ) and β 1 (x i ) = E(y i |z i = 1, x i ) be unknown smooth functions, V ar(y i |x i , z i ) = σ 2 ε , i = 1, . . . , n. Note that the assumption of constant conditional variance could be relaxed without changing in essence the results of this paper. We consider this assumption to alleviate the notational burden. The non-constant variance case is further discussed in the concluding section. From (1), we have that Thus, a natural way to estimate τ is to first estimate the two regression functions β 1 (x i ) and β 0 (x i ), based on the treated and the non-treated, respectively, and then take the average over all the observed x i s of the differences between the estimated functions. This estimator of τ is called the imputation estimator in Imbens et al. (2005). They use series estimators for estimating the regression functions but any smoother, e.g., kernel, splines and local polynomial regression (Fan and Gijbels 1996, pp. 14-45), may be used.
Denote y 0 = (y 0 1 , . . . , y 0 n 0 ) T and x 0 = (x 0 1 , . . . , x 0 n 0 ) T the observed response and covariate for the n 0 units with treatment z i = 0, and similarly y 1 = (y 1 1 , . . . , y 1 n 1 ) T and x 1 = (x 1 1 , . . . , x 1 n 1 ) T for the n 1 units with treatment z i = 1. The smoothers cited above are linear in the sense that the corresponding estimator of β j (x) = (β j (x 1 ), . . . , β j (x n )) T , can be written as the smoothing matrix regressing y j on x j , using smoothing parameter h j . The imputation estimator of τ mentioned above iŝ In this paper we base our results on a specific linear smoother, the local linear regression smoother, although we anticipate that most results should hold for any other linear smoother.
Local linear regression (Cleveland 1979;Fan and Gijbels 1996), consists in fitting a straight line at every x i , i = 1, . . . , n, using only the part of data that is deemed to be sufficiently close to the target point x i . Consider estimating the regression function is a kernel function such that K (u)du = 1 and u K (u)du = 0. An example is the tricube kernel defined as The definition of b ji , i = 1, . . . , n, depends on the type of bandwidth we use. With a constant bandwidth b j1 = · · · = b jn = h j . For a nearest neighbor type bandwidth, assuming no ties, b ji is the Euclidian distance from x i to the (h j n j ):th nearest among the x j k :s for x j k = x i , h j ∈ [1/n j , 1] , k = 1, . . . , n j , and the smoothing parameter h j is the proportion of observations being used to produce the local fit.

Mean squared errors
Many smoothing parameter selection methods are developed with the purpose of estimating the regression function β j (x i ), j = 0, 1, i = 1, . . . , n and attempts to select the smoothing parameter minimizing the average conditional mean squared error: (2) One frequently used selection procedure that attempts to select the smoothing parameter minimizing (2) is leave-one-out cross-validation. In this setting, cross-validation selects the smoothing parameter h j minimizing Asymptotically, for local linear regression, the smoothing parameter minimizing (2) is proportional to n −1/5 j (Fan 1992), and, hence, proportional to n −1/5 since n j = n Pr(z = j) + o p (n). However, it is known that for estimating a functional of β j (x i ) such as E(β j (x i )), the smoothing parameter minimizing (2) is not optimal, in the sense that it does not result in √ n-consistent estimation of the functional (e.g., Cheng 1994). Imbens et al. (2005) suggest that one should select h 0 and h 1 by minimizing the conditional mean squared error of 1 We argue that, in order to estimate τ optimally, it may be more suitable to select the combination of (h 0 , h 1 ) minimizing the conditional mean squared error ofτ imp Note that Hence, criterion (5) differs from (4) when both average bias terms in the latter expression are different from zero.

Asymptotics
Asymptotic approximations can be used to describe optimal bandwidth choices as the sample size tends to infinity. The results presented here are deduced in Appendix, Sect. 6.2, where regularity conditions also used in Ruppert and Wand (1994) are given. For local linear regression with constant bandwidth such that h j → 0 and nh j → ∞ as n → ∞ we have the following approximations for the conditional bias and variance of 1 and with constants and Let us first consider the optimal smoothing parameter for estimating E(β j (x)) and assume nh 3 j → 0 as n → ∞, j = 0, 1. An asymptotic approximation to the bandwidth minimizing (8) is Hence, the optimal bandwidths are of order n −2/5 , so that the optimal bandwidths for the estimation of the average functional τ is smaller than the optimal bandwidths for the estimation of the regression functions β j (·), the latter being of order n −1/5 . Thus, the regression functions must be undersmooth when the target of the inference is τ . A similar result was shown in Cheng (1994) for kernel regression. Turning to the minimization of (9), this must be done simultaneously in h 0 and h 1 . A reasonable assumption, however, is that these two smoothing parameters have the same rate of convergence to zero. Under this assumption we may replace h 1 by ch 0 , for c a constant, in (9). Minimizing the latter for h 0 yields as above an optimal bandwidth of order n −2/5 .
Another related result, deduced from (6) and (7), is that as n → ∞, if h j ∝ n r , for −1 < r < −1/4, then (see Appendix, Sect. 6.2) The results above show that selecting the smoothing parameters minimizing (4) will lead to √ n-consistent estimation of τ . This is in accordance with previous results (e.g., Speckman 1988) where it was shown that asymptotic undersmoothing of the regression function is needed for the √ n−consistent estimation of a functional of the regression function. Imbens et al. (2005) propose the following estimator of (4), for j = 0, 1,

Estimating MSEs
wherep j = (1/p(x j 1 ), . . . , 1/p(x j n j )) T and I n j is the n j × n j identity matrix. It is worth noting that one need to estimate the propensity score (Waernbaum 2010), in addition to σ 2 ε , in order to use this selection procedure. The error variance σ 2 ε may be estimated bŷ where y = (y 0T , y 1T ) T and h ε j , j = 0, 1, could be equal to h j or selected separately, see, e.g., Opsomer et al. (1995) for further discussion on this issue.
We propose below novel double smoothing estimators of (4) and (5), respectively: and where g 0 , g 1 are pilot smoothing parameters. Because the purpose of these pilots parameters is to estimate β 0 and β 1 respectively, we suggest using leave-one-out crossvalidation; see (3). In specific situations one may want to check whether the results are sensitive to changes in the choice of the pilot parameters. The double smoothing (DS) estimation concept was utilized by Härdle et al. (1992), although for the estimation of the entire regression function β j (·). One could, as mentioned by Härdle et al. (1992), specify the pilot bandwidths as g j = n −c j , for an appropriate constant c which would result in good asymptotic performance. This would also reduce the computational burden of the method, although a relevant choice of the arbitrary constant c is problematic. Finally, note that a difference between M SE I N R β j and M SE DŜ β j is that the former is based on an asymptotic approximation of (4) while the double smoothing estimator targets (4) directly.

Simulation study
In this section, we study the finite sample properties of different methods for the selection of constant and nearest neighbour type bandwidths, and in particular the resulting empirical MSE when estimating the average causal effect τ .

Design of the study
Data were generated according to the model n = 100, 200, 500, 1,000. Since z i is a Bernoulli draw dependent on x i generated from a uniform distribution, n 1 and n 0 are stochastic. Table 1 and Fig. 1 display the six designs generated. Bandwidths h 0 and h 1 considered are, for the constant bandwidth setting, 40 equally spaced values within the interval [h min , 2π ], where h min is the smallest bandwidth value such that at least 10 observations are used for the local fits. For the nearest neighbour bandwidth setting, we consider 40 equally spaced values within the intervals [0.1, 1] for n = 100, 200 and [0.02, 1] for n = 500, 1,000, and, e.g., h = 0.1 implies using 10 % of the data for the local fits. The propensity score, p(x), in (14) is estimated by logistic regression with correctly specified model for Design 1-3 (i.e., glm(z~x, family=binomial) in R) and misspecified model for Design 4-6 (i.e., glm(z I(sin(2*x))+I(cos(x))+x+I(x^2), family=binomial) in R). The variance estimator (15) is used in (14), (16) and (17) with h ε j , j = 0, 1, selected by leave-one-out cross-validation (3). These cross-validation bandwidths are also used as pilot bandwidths in the DS estimators in (16) and (17) .
The criteria (2), (3), (4), (5), (14), (16) and (17) are computed for every bandwidth, 40 values, in the interval. For the minimizing bandwidthsτ imp is computed. Due to computer time constraint, we use 200 replicates. On the other hand, we reduce noise in the simulation results by making use of the control variate method (see, e.g., (Wilson 1984) withτ ols , the mean of the fitted values resulting from estimating τ (x) by ordinary least squares with correctly specified model, as control variate. Ifτ ols is positively correlated withτ imp thenτ c =τ imp − (τ ols − τ ) has the same mean as

Results
Results for n = 500 and 1,000 are displayed in Tables 2 and 3, for both constant and nearest neighbour bandwidths, and in Figs. 2, 3, 4 and 5 (Appendix, Sect. 6.1) for nearest neighbour bandwidths. Due to the similarity of bandwidth selection patterns, and to save space, analogous figures with results for constant bandwidths are not included. These figures and more detailed results (also for n = 100, 200), also left out to save space, can be obtained from the authors. Note first that we can compute the smoothing parameter values minimizing (2), (4) and (5), labeled M y , M β and M τ , respectively, because we know the data generating mechanisms. We see in Figs. 2, 3, 4 and 5 that the double smoothing methods introduced, (16) and (17), labeled DS β and DS τ respectively, mimic quite well their target in terms of selected smoothing parameters. This is not the case for (14), labeled INR, whose selected smoothing parameters are not in accordance with the target M β . Table 2 summarizes empirical MSE results for the theoretical criteria M β , M τ and M y , by indicating which criterion yielded lowest MSE for the estimation of τ . For constant bandwidhts, the smallest MSE is most often obtained by M β or M τ and the largest MSE is most often obtained by M y . However, only in 17 and 25 % of the cases, respectively, do M τ and M β result in significantly lower MSE than M y . For nearest neighbour bandwidths, we see that M τ always results in smallest MSE for n = 200, 500, 1,000, which is, in half of the cases, significantly smaller than the second smallest MSE (achieved by M β ). Both M τ and M β result in significantly smaller MSE than M y in a majority of cases (71 and 67 %, respectively). Table 3 gives information on empirical  MSE (similar to Table 2), where comparisons are made between the data-driven criteria DS β , DS τ , INR and CV. For both the constant and nearest neighbour bandwidth setting, we see that double smoothing does not always yields lowest empirical MSE, although CV is most often outperformed by the methods targeting the estimation of functional averages (DS and INR − for design 2 when INR performed best, CV was also outperformed by DS τ but not by DS β ).
Finally, note that the propensity scores used in the designs of this study are rather extreme in the sense that they may yield probabilities near zero and one. We have also run these experiments by damping these propensity scores to let them vary only between 0.2 and 0.8. The results where similar qualitatively with double smoothing often performing better.

Conclusion
In this paper we have proposed double smoothing methods for selecting smoothing parameters that target the estimation of functional averages where the latter are average causal effects of interest. In our numerical experiments cross-validation is often outperformed by double smoothing as we expected since the latter criterion is optimized for the estimation of functions underlying the average causal effect, and not the average itself. The methods proposed and studied here have large applicability, and are, for instance, straightforward to adapt to non-parametric estimators based on instruments as those introduced in Frölich (2007). Finally, note that similar results as the one obtained should hold under a non-constant variance assumption (Andrews 1991;Ruppert and Wand 1994). In such cases the estimation of σ 2 ε need to be replaced by estimators of V ar(y i |x i , z i = 0) and V ar(y i |x i , z i = 1), e.g. using linear smoothers when regressing y 2 i on x i for the units with z i = 0 and z i = 1 respectively.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Asymptotics
In order to derive the results of Sect. 3.2 we focus on local linear regression with constant bandwidth. We use further the following assumptions. We have Design 1 (Ruppert and Wand (1994), Thm 2.1) states that and . It follows from (19) and the fact that n j = (−1) j+1 n i=1 z i + n(1 − j) that For h j ∝ n r we have thus = n 1/2 B 1 (1)n 2r + o p (n 1/2 n 2r ) = O(n 1/2+2r ) + o p (n 1/2 n 2r ).