1 Introduction

In observational studies where the interest lies in estimating the average causal effect of a binary treatment \(z\) on an outcome of interest \(y\), non-parametric estimators are typically based on controlling for confounding covariates \(x\) with smoothing regression methods (kernel, splines, local polynomial regression, series estimators; see, e.g., the reviews by Imbens 2004, and Imbens and Wooldridge 2009). A useful modeling framework in this context was introduced by Neyman (1923) and Rubin (1974), where in particular two potential outcomes are considered for each unit in the study, the outcome that would be observed if the unit is treated, \(y(1)\), and the outcome that would be observed if the unit is not treated, \(y(0)\). The causal effect at the unit level is defined as \(y(1)-y(0)\). Population parameters are targeted by the inference, and we focus here on average causal effects of the type \(E(y(1)-y(0))\), where the expectation is taken over a given population of interest. Inference on such expectations is complicated by the fact that the two potential outcomes are not observed for all units in the sample (missing data problem) and assumptions, e.g., on the missingness mechanism must be made in order for the parameter of interest to be identified. In this paper, we consider situations described in Sect. 2, where the causal effect conditional on an observed covariate \(x\) (or a score function summarizing a set of observed covariates), \(E(y(1)\mid x)-E(y(0)\mid x)\), is identified and can be estimated by fitting two curves, functions of \(x\), \(E(y(1)\mid x,z=1)\) and \(E(y(0)\mid x,z=0)\) non-parametrically. An estimate of the targeted average causal effect is obtained by averaging the estimated curves over the relevant distribution for \(x\) to target \(E(y(1)-y(0))=E(E(y(1)\mid x))-E(E(y(0)\mid x))\), where the missing outcomes are imputed by predictions from the fitted curves. A tuning parameter for each fitted curve is used to regulate the smoothness of the fit. Cheng (1994) showed that when using kernel regression to estimate the average of a curve, say here \(E(E(y(1)\mid x))\), with missing \(y(1)\) for some units, as described above, then the optimal (in mean squared error, MSE, sense) smoothing parameter for the estimation of the regression curve \(E(y(1)\mid x,z=1)\) is not optimal for the estimation of the average \(E(E(y(1)\mid x))\). More precisely the optimal rate of convergence towards zero of the smoothing parameter (when the sample size increases) is different in both situations, and one need typically to asymptotically undersmooth \(E(y(1)\mid x,z=1)\) when targeting \(E(E(y(1)\mid x))\). We show in this paper that a similar result holds when using local linear regression instead of kernel regression, and when two curves (implying the choice of two tunining parameters), are fitted and then averaged to target \(E(y(1)-y(0))\).

As a main contribution of the paper, we propose a novel data-driven method geared for selecting the smoothing parameters which minimizes the mean squared error of non-parametric estimators of the average causal effect. Imbens et al. (2005) also proposes a data-driven method based on the estimation of this mean squared error. The two estimators are, however, different. While Imbens et al. (2005) estimates an asymptotic approximation of the population MSE which involves the estimation of the propensity score, the probability of ending up in one of the treatment groups (say \(z=1\)) given the covariates, our estimator targets the exact population MSE by using a double smoothing technique previously used by Härdle et al. (1992) for estimating regression curves and Häggström (2013) in semi-parametric additive models. Note that Frölich (2005) also derived asymptotic approximation of MSE to obtain smoothing parameter selectors although those were outperformed by cross-validation in finite sample simulations. With simulations we study the finite sample properties of the different data-driven methods. The results suggest that the cross-validation choice, which is known to be optimal in MSE sense to estimate smooth curves (Fan 1992), can indeed be improved by using either Imbens et al. (2005) or our proposal, with the latter often being superior.

In the next section we introduce the potential outcome framework dating back to Neyman (1923) and Rubin (1974), which allows us to define the parameter of interest, the average causal effect, and commonly used identifying assumptions and estimators. The selection of smoothing parameters is discussed in Sect. 3, where we present asymptotic results based on the use of local linear regression. We also introduce in this section a novel data-driven method. Section 4 presents a simulation study. The paper is concluded in Sect. 5.

2 Model and estimation

2.1 Neyman–Rubin model for causal inference

Suppose we have \(n\) units indexed by \(i=1,\ldots ,n\). For each unit \(i\) a binary treatment \(z_i\) is assigned:

$$\begin{aligned} z_i=\left\{ \begin{array}{ll} 1 &{}\ \text {if unit } i \text { receives treatment 1},\\ 0 &{}\ \text {if unit } i \text { receives treatment 0}. \end{array} \right. \end{aligned}$$

Further, each unit \(i\) is characterised by two potential outcomes \(y_i(1)\) and \(y_i(0)\), where \(y_i(1)\) is the response that is observed if the unit is given treatment \(z_i=1\) and \(y_i(0)\) the response if the unit is given treatment \(z_i=0\). Only one treatment assignment is possible for each unit and, therefore, only one of the two potential outcomes is observed. Denote by \(y_i=y_i(0)(1-z_i)+y_i(1)z_i\) the observed outcome. Finally, let all units have a vector of \(d\) background characteristics \(\mathbf {x}_{i}=(x_{i1},\ldots , x_{id})^{T}\) (called covariates). We assume in the sequel that the \(n\) units corresponds to a random sample from the distribution law of the random variables \((y_i(1),y_i(1),z_i,\mathbf {x}_{i})\), and that only \((y_i,z_i,\mathbf {x}_{i})\) is actually observed. We use the same notation to denote random variables and their realisations, letting the context make the distinction.

The parameter of interest herein is an average causal effect,

$$\begin{aligned} \tau =E\big (y_i(1)-y_i(0)\big ). \end{aligned}$$

If treatment assignment is not randomized, \(\tau \) is identified if we have available a vector of covariates \(\mathbf {x}_{i}=(x_{i1},\ldots , x_{id})^{T}\) not affected by treatment assignment and such that the following assumptions hold,

$$\begin{aligned} y_i(1), y_i(0) \perp \!\!\!\perp z_i|\mathbf {x}_i, \end{aligned}$$

often called unconfoundedness assumption, and

$$\begin{aligned} 0<\Pr (z_i=1|\mathbf {x}_i)<1, \end{aligned}$$

often called overlap assumption. The sign \(\perp \!\!\!\perp \) is used here to mean “is independent of” (Dawid 1979). We have unconfoundedness if all covariates affecting both treatment assignment and the potential outcomes are included in \(\mathbf {x}_i\). This is a strong assumption which must be based on subject-matter reasoning. A sensitivity analysis to this assumption is often advocated (e.g., de Luna and Lundin 2014). The assumption of overlap states that, for a unit with covariate vector \(\mathbf {x}_i\), the probability of receiving either treatment should be bounded away from 0. This assumption can be investigated empirically (e.g.,Imbens and Wooldridge 2009). Under these assumptions identifiability of \(\tau \) is then a consequence of

$$\begin{aligned} \tau&=E\big (y_i(1)-y_i(0)\big )\nonumber \\&=E\big (E(y_i(1)|\mathbf {x}_i)-E(y_i(0)|\mathbf {x}_i)\big ) \nonumber \\&=E\big (E(y_i(1)|z_i=1,\mathbf {x}_i)-E(y_i(0)|z_i=0,\mathbf {x}_i)\big ) \nonumber \\&=E\big (E(y_i|z_i=1,\mathbf {x}_i)-E(y_i|z_i=0,\mathbf {x}_i)\big ). \end{aligned}$$
(1)

In the sequel we focus on the case \(d=1\) since when \(d>1\), the covariate vector \(\mathbf {x}_i\) can be replaced by a scalar, e.g., \(p(\mathbf {x}_i)=\Pr (z_i=1|\mathbf {x}_i)\), the propensity score (Rosenbaum and Rubin 1983, Hansen 2008). Indeed, Rosenbaum and Rubin (1983) showed that it is sufficient to condition on the propensity score, i.e., under the above assumptions we have \( y_i(1), y_i(0) \perp \!\!\!\perp z_i|p(\mathbf {x}_{i}), \) and \( 0<\Pr (z_{i}=1|p(\mathbf {x}_{i}))<1. \) In applications the propensity score need to be modelled and fitted to the data and such situations are considered in the simulation study of Sect. 4. Typically parametric models are used to fit the propensity score, although these do not need to be correctly specified as shown in Waernbaum (2010). Note also that covariate selection procedures may be used to reduce the dimensionality of \(\mathbf {x}_{i}\) (Luna et al. 2011).

2.2 Estimating average causal effects

Let \(\beta _0(x_i)=E(y_i|z_i=0,{x}_i)\) and \(\beta _1(x_i)=E(y_i|z_i=1,{x}_i)\) be unknown smooth functions, \(Var(y_i|x_i, z_i)=\sigma _{\varepsilon }^2\), \(i=1, \ldots , n\). Note that the assumption of constant conditional variance could be relaxed without changing in essence the results of this paper. We consider this assumption to alleviate the notational burden. The non-constant variance case is further discussed in the concluding section. From (1), we have that

$$\begin{aligned} \tau =E\big (\beta _1(x_i)\big )-E\big (\beta _0(x_i)\big ). \end{aligned}$$

Thus, a natural way to estimate \(\tau \) is to first estimate the two regression functions \(\beta _1(x_i)\) and \(\beta _0(x_i)\), based on the treated and the non-treated, respectively, and then take the average over all the observed \(x_i\)s of the differences between the estimated functions. This estimator of \(\tau \) is called the imputation estimator in Imbens et al. (2005). They use series estimators for estimating the regression functions but any smoother, e.g., kernel, splines and local polynomial regression (Fan and Gijbels 1996, pp. 14–45), may be used.

Denote \(\mathbf {y}^0=(y_{1}^0,\ldots ,y_{n_0}^0)^T\) and \(\mathbf {x}^0=(x_{1}^0,\ldots ,x_{n_0}^0)^T\) the observed response and covariate for the \(n_0\) units with treatment \(z_i=0\), and similarly \(\mathbf {y}^1=(y_{1}^1,\ldots ,y_{n_1}^1)^T\) and \(\mathbf {x}^1=(x_{1}^1,\ldots ,x_{n_1}^1)^T\) for the \(n_1\) units with treatment \(z_i=1\). The smoothers cited above are linear in the sense that the corresponding estimator of \(\beta _j(\mathbf {x})=(\beta _j(x_1),\ldots ,\beta _j(x_n))^T\), can be written as

$$\begin{aligned} \hat{\beta }_{j}^{h_j}(\mathbf {x})&=S_{j}^{h_j}[\mathbf {x}]\mathbf {y}^{j}, \ \ j=0,1, \end{aligned}$$

where \(\mathbf {x}=(\mathbf {x}^{0T},\mathbf {x}^{1T})^T\) and \(S_{j}^{h_j}[\mathbf {x}]\) the smoothing matrix regressing \(\mathbf {y}^j\) on \(\mathbf {x}^j\), using smoothing parameter \(h_j\). The imputation estimator of \(\tau \) mentioned above is

$$\begin{aligned} \hat{\tau }^{imp}=\frac{1}{n}\sum _{i=1}^n\hat{\tau }^{imp}(x_i)=\frac{1}{n}\sum _{i=1}^n\big (\hat{\beta }_{1}^{h_1}(x_i)-\hat{\beta }_{0}^{h_{0}}(x_i)\big ). \end{aligned}$$

In this paper we base our results on a specific linear smoother, the local linear regression smoother, although we anticipate that most results should hold for any other linear smoother.

Local linear regression (Cleveland 1979; Fan and Gijbels 1996), consists in fitting a straight line at every \(x_{i}\), \(i=1,\ldots ,n\), using only the part of data that is deemed to be sufficiently close to the target point \(x_{i}\). Consider estimating the regression function \(\beta _j(\cdot )\), \(j=0, 1\). The fit, at \(x_i\), is

$$\begin{aligned} \hat{\beta }_j^{h_j}(x_i)=\mathbf {e}_{1}^T(\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {X}_{i}^{j})^{-1}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {y}^j=S_j^{h_j}[x_i]\mathbf {y}^j \end{aligned}$$

where \(\mathbf {e}_{1}=(1,0)^T\),

$$\begin{aligned} \mathbf {X}_{i}^{j}= \left( \begin{array}{c@{\quad }c} 1 &{} \big (x_{1}^j-x_i\big ) \\ \vdots &{} \vdots \\ 1 &{} \big (x_{n_j}^j-x_i\big ) \end{array} \right) \end{aligned}$$

and

$$\begin{aligned} \mathbf {W}_i^{h_j}=\text{ diag }(K\big ((x_{1}^j-x_i)/b_{ji}\big )/b_{ji}, \ldots , K\big ((x_{n_j}^j-x_i)/b_{ji}\big )/b_{ji}). \end{aligned}$$

\(K(\cdot )\) is a kernel function such that \(\int K(u) du=1\) and \(\int u K(u)du=0 \). An example is the tricube kernel defined as

$$\begin{aligned} K(u)=\left\{ \begin{array}{ll} \frac{70}{81}(1- |u|^{3})^{3}, &{} \mathrm if |u|< 1 \\ ~0, &{} \mathrm if |u|\ge 1 \end{array} \right\} . \end{aligned}$$

The definition of \(b_{ji}\), \(i=1,\ldots ,n\), depends on the type of bandwidth we use. With a constant bandwidth \(b_{j1}=\cdots =b_{jn}=h_j\). For a nearest neighbor type bandwidth, assuming no ties, \(b_{ji}\) is the Euclidian distance from \(x_i\) to the \((h_jn_j)\):th nearest among the \(x_{k}^j\):s for \(x_{k}^j\ne x_i,\, \, h_j\in [1/n_j,1]\,, k=1,\ldots ,n_j\), and the smoothing parameter \(h_j\) is the proportion of observations being used to produce the local fit.

3 Selection of smoothing parameters

3.1 Mean squared errors

Many smoothing parameter selection methods are developed with the purpose of estimating the regression function \(\beta _j(x_i)\), \(j=0, 1\), \(i=1, \ldots , n\) and attempts to select the smoothing parameter minimizing the average conditional mean squared error:

$$\begin{aligned}&\frac{1}{n_j}\sum _{i=1}^{n_j}E\big (y_i^j-\hat{\beta }_j^{h_j}(x_{i}^j)|\mathbf {x}^j\big )^2\nonumber \\&\quad = \frac{1}{n_j}\sum _{i=1}^{n_j}Var\big (\hat{\beta }_j^{h_j}(x_{i}^j)|\mathbf {x}^j\big )+\frac{1}{n_j}\sum _{i=1}^{n_j}E\big (\hat{\beta }_j^{h_j}(x_{i}^j)-\beta _j(x_i^j)|\mathbf {x}^j\big )^2\nonumber \\&\quad =\frac{\sigma _{\varepsilon }^2}{n_j}\sum _{i=1}^{n_j}S_{j}^{h_j}[x_{i}^j]S_{j}^{h_j}[x_{i}^j]^T+\frac{1}{n_j}\sum _{i=1}^{n_j}\bigg (S_{j}^{h_j}[x_{i}^j]\beta _j(\mathbf {x}^j)-\beta _j(x_{i}^j)\bigg )^2. \end{aligned}$$
(2)

One frequently used selection procedure that attempts to select the smoothing parameter minimizing (2) is leave-one-out cross-validation. In this setting, cross-validation selects the smoothing parameter \(h_j\) minimizing

$$\begin{aligned} \frac{1}{n_j}\sum _{i=1}^{n_j}\big (y_{i}^j-\hat{\beta }_j^{h_j,-i}\big (x_{i}^j\big )\big )^2, \end{aligned}$$
(3)

where \(\hat{\beta }_j^{h_j,-i}(x_{i}^j)\) is the cross-validated estimate at \(x_{i}^j\) computed without \((x_{i}^j,y_{i}^j)\). Asymptotically, for local linear regression, the smoothing parameter minimizing (2) is proportional to \(n_j^{-1/5}\) (Fan 1992), and, hence, proportional to \(n^{-1/5}\) since \(n_j=n\Pr (z=j)+o_p(n)\). However, it is known that for estimating a functional of \(\beta _j(x_i)\) such as \(E(\beta _j(x_i))\), the smoothing parameter minimizing (2) is not optimal, in the sense that it does not result in \(\sqrt{n}\)-consistent estimation of the functional (e.g., Cheng 1994).

Imbens et al. (2005) suggest that one should select \(h_0\) and \(h_1\) by minimizing the conditional mean squared error of \(\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\), for \(j=0,1\) respectively, i.e.,

$$\begin{aligned} MSE_{\bar{\hat{\beta }}_j}&=\frac{\sigma _{\varepsilon }^2}{n^2}\sum _{i=1}^n\sum _{k=1}^nS_{j}^{h_j}[x_i]S_{j}^{h_j}[x_k]^{T}\nonumber \\&\quad +\frac{1}{n^2}\bigg [\sum _{i=1}^n\bigg (S_{j}^{h_j}[x_i]\beta _j(\mathbf {x}^j)-\beta _j(x_i)\bigg )\bigg ]^2. \end{aligned}$$
(4)

We argue that, in order to estimate \(\tau \) optimally, it may be more suitable to select the combination of (\(h_0, h_1\)) minimizing the conditional mean squared error of \(\hat{\tau }^{imp}\)

$$\begin{aligned} MSE_{\hat{\tau }}&=\frac{\sigma _{\varepsilon }^2}{n^2}\sum _{i=1}^{n}\sum _{j=1}^n\bigg (S_1^{h_1}[x_i]S_{1}^{h_1}[x_j]^T+S_{{0}}^{h_{0}}[x_i]S_{0}^{h_{0}}[x_j]^T\bigg )\nonumber \\&\quad +\bigg [\frac{1}{n}\sum _{i=1}^n\bigg (\big (S_{1}^{h_1}[x_i]\beta _1(\mathbf {x}^1)-\beta _1(x_i)\big )-\big (S_{0}^{h_0}[x_i]\beta _0(\mathbf {x}^0)-\beta _0(x_i)\big )\bigg )\bigg ]^2.\nonumber \\ \end{aligned}$$
(5)

Note that

$$\begin{aligned} MSE_{\hat{\tau }}&=MSE_{\bar{\hat{\beta }}_1}+MSE_{\bar{\hat{\beta }}_0}-2\bigg (\frac{1}{n}\sum _{i=1}^n\big (S_{1}^{h_1}[x_i]\beta _1(\mathbf {x}^1)-\beta _1(x_i)\big )\bigg )\\&\quad \ \times \bigg (\frac{1}{n}\sum _{i=1}^n\big (S_{0}^{h_0}[x_i]\beta _0(\mathbf {x}^0)-\beta _0(x_i)\big )\bigg ). \end{aligned}$$

Hence, criterion (5) differs from (4) when both average bias terms in the latter expression are different from zero.

3.2 Asymptotics

Asymptotic approximations can be used to describe optimal bandwidth choices as the sample size tends to infinity. The results presented here are deduced in Appendix, Sect. 6.2, where regularity conditions also used in Ruppert and Wand (1994) are given. For local linear regression with constant bandwidth such that \(h_j\rightarrow 0\) and \(nh_j \rightarrow \infty \) as \(n\rightarrow \infty \) we have the following approximations for the conditional bias and variance of \(\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\). For \(j=0, 1\),

$$\begin{aligned} E\bigg (\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)-\frac{1}{n}\sum _{i=1}^n\beta _j(x_i)|\mathbf {x}\bigg )=B_1(j)h_j^2+o_p\big (h_j^2\big ), \end{aligned}$$
(6)

and

$$\begin{aligned} Var\bigg (\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)|\mathbf {x}\bigg )&=\frac{V_1(j)}{n}+\frac{V_2(j)}{n^2h_j}+V_3(j)\frac{h_j^2}{n}\nonumber \\&\quad +o_p\big (n^{-1}+n^{-2}h_j^{-1}+n^{-1}h_j^2\big ), \end{aligned}$$
(7)

with constants

$$\begin{aligned} B_1(j)&=\frac{1}{2}\int \beta _j^{(2)}(x)f(x)dx\int u^2K(u)du,\\ V_1(j)&=\sigma _{\varepsilon }^2\int \frac{f(x)}{Pr(z=j|x)}dx,\\ V_2(j)&=\sigma _{\varepsilon }^2\int K(u)^2du\int \frac{1}{Pr(z=j|x)}dx,\\ V_3(j)&=-2\sigma _{\varepsilon }^2\int u^2K(u)du\int \frac{f^{(1)}(x)^2}{f(x)Pr(z=j|x)}dx\\&\quad \ \, -2\sigma _{\varepsilon }^2\int u^2K(u)du\int \frac{f^{(1)}(x)P^{(1)}(z=j|x)}{Pr(z=j|x)^2}dx,\\ \end{aligned}$$

where \(\beta _j^{(m)}(x)\) the \(m\):th derivative of the function \(\beta _j(x)\) and \(f(x)\) is the density of \(x\). Hence,

$$\begin{aligned} MSE_{\bar{\hat{\beta }}_j}&=\frac{V_1(j)}{n}+\frac{V_2(j)}{n^2h_j}+V_3(j)\frac{h_j^2}{n}+B_1^2(j)h_j^4\nonumber \\&\quad +o_p\big (n^{-1}+n^{-2}h_j^{-1}+n^{-1}h_j^2+h_j^4\big ) \end{aligned}$$
(8)

and

$$\begin{aligned} MSE_{\hat{\tau }}&=\frac{V_1(1)+V_1(0)}{n}+\frac{V_2(1)}{n^2h_1}+\frac{V_2(0)}{n^2h_0}\nonumber \\&\quad +V_3(1)\frac{h_1^2}{n}+V_3(0)\frac{h_0^2}{n}+B_1^2(1)h_1^4\nonumber \\&\quad +B_1^2(0)h_0^4-2B_1(1)B_1(0)h_1^2h_0^2\nonumber \\&\quad +o_p\big (n^{-1}+n^{-2}h_1^{-1}+n^{-2}h_0^{-1}+n^{-1}h_1^2\nonumber \\&\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+h_0^2n^{-1}+h_1^4+h_0^4+h_1^2h_0^2\big ). \end{aligned}$$
(9)

Let us first consider the optimal smoothing parameter for estimating \(E(\beta _j(x))\) and assume \(nh_j^3\rightarrow 0\) as \(n\rightarrow \infty \), \(j=0,1\). An asymptotic approximation to the bandwidth minimizing (8) is

$$\begin{aligned} h_j^{opt}=\text {arg}\min _{h_j}\frac{V_2(j)}{n^2h_j}+B_1^2(j)h_j^4=\bigg (\frac{V_2(j)}{4B_1^2(j)}\bigg )^{1/5}n^{-2/5}. \end{aligned}$$

Hence, the optimal bandwidths are of order \(n^{-2/5}\), so that the optimal bandwidths for the estimation of the average functional \(\tau \) is smaller than the optimal bandwidths for the estimation of the regression functions \(\beta _j(\cdot )\), the latter being of order \(n^{-1/5}\). Thus, the regression functions must be undersmooth when the target of the inference is \(\tau \). A similar result was shown in Cheng (1994) for kernel regression. Turning to the minimization of (9), this must be done simultaneously in \(h_0\) and \(h_1\). A reasonable assumption, however, is that these two smoothing parameters have the same rate of convergence to zero. Under this assumption we may replace \(h_1\) by \(ch_0\), for \(c\) a constant, in (9). Minimizing the latter for \(h_0\) yields as above an optimal bandwidth of order \(n^{-2/5}\).

Another related result, deduced from (6) and (7), is that as \(n \rightarrow \infty \), if \(h_j\propto n^r\), for \(-1< r < -1/4\), then (see Appendix, Sect. 6.2)

$$\begin{aligned} E\left[ \sqrt{n}(\bar{\hat{\beta }}_j-E(\beta _j(x_i)))\mid \mathbf {x}\right]&= o_p(1), \end{aligned}$$
(10)
$$\begin{aligned} E\left[ \sqrt{n}(\hat{\tau }^{imp}-\tau )\mid \mathbf {x}\right]&= o_p(1), \end{aligned}$$
(11)
$$\begin{aligned} Var\left[ \sqrt{n}(\bar{\hat{\beta }}_j-E(\beta _j(x_i)))\mid \mathbf {x}\right]&= V_1(j)+o_p(1), \end{aligned}$$
(12)
$$\begin{aligned} Var\left[ \sqrt{n}(\hat{\tau }^{imp}-\tau )\mid \mathbf {x}\right]&= V_1(0)+V_1(1)\nonumber \\&\,+o_p(1). \end{aligned}$$
(13)

The results above show that selecting the smoothing parameters minimizing (4) will lead to \(\sqrt{n}\)-consistent estimation of \(\tau \). This is in accordance with previous results (e.g., Speckman 1988) where it was shown that asymptotic undersmoothing of the regression function is needed for the \(\sqrt{n}-\)consistent estimation of a functional of the regression function.

3.3 Estimating MSEs

Imbens et al. (2005) propose the following estimator of (4), for \(j=0, 1\),

$$\begin{aligned} \widehat{MSE}_{\bar{\hat{\beta }}_j}^{INR}&=\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\sum _{i=1}^n\sum _{k=1}^nS_{j}^{h_j}[x_i]S_{j}^{h_j}[x_k]^{T}\nonumber \\&\quad +\frac{1}{n^2}\bigg [\sum _{i=1}^{n_j}\frac{1}{\hat{p}(x_{i}^j)}\bigg (y_{i}^j-\hat{\beta }_j^{h_j}(x_{i}^j)\bigg )\bigg ]^2\nonumber \\&\quad -\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\hat{{\mathbf p}}_j^T\bigg (I_{n_j}-S_{j}^{h_j}[\mathbf {x}^j]\bigg )\bigg (I_{n_j}-S_{j}^{h_j}[\mathbf {x}^j]\bigg )^T\hat{{\mathbf p}}_j, \end{aligned}$$
(14)

where \(\hat{{\mathbf p}}_j=(1/\hat{p}(x_{1}^j),\ldots ,1/\hat{p}(x_{n_j}^j))^T\) and \(I_{n_j}\) is the \(n_j\times n_j\) identity matrix. It is worth noting that one need to estimate the propensity score (Waernbaum 2010), in addition to \(\sigma _{\varepsilon }^2\), in order to use this selection procedure. The error variance \(\sigma _{\varepsilon }^2\) may be estimated by

$$\begin{aligned} \hat{\sigma _{\varepsilon }}^2\!=\!\frac{\bigg [\mathbf {y} \!-\!\bigg (\hat{\beta }_0^{h_{\varepsilon _0}}(\mathbf {x}^0)^T, \hat{\beta }_1^{h_{\varepsilon _1}}(\mathbf {x}^1)^T\bigg )^T\bigg ]^T\bigg [\mathbf {y} -\bigg (\hat{\beta }_0^{h_{\varepsilon _0}}(\mathbf {x}^0)^T, \hat{\beta }_1^{h_{\varepsilon _1}}(\mathbf {x}^1)^T\bigg )^T\bigg ]}{n-\text {trace}(2S_0^{h_{\varepsilon _0}}[\mathbf {x}^0]-S_0^{h_{\varepsilon _0}}[\mathbf {x}^0]S_0^{h_{\varepsilon _0}}[\mathbf {x}^0])-\text {trace}(2S_1^{h_{\varepsilon _1}}[\mathbf {x}^1]\!-\!S_1^{h_{\varepsilon _1}}[\mathbf {x}^1]S_1^{h_{\varepsilon _1}}[\mathbf {x}^1])},\nonumber \\ \end{aligned}$$
(15)

where \(\mathbf {y}=(\mathbf {y}^{0T},\mathbf {y}^{1T})^T\) and \(h_{\varepsilon _j}\), \(j=0, 1\), could be equal to \(h_j\) or selected separately, see, e.g., Opsomer et al. (1995) for further discussion on this issue.

We propose below novel double smoothing estimators of (4) and (5), respectively:

$$\begin{aligned} \widehat{MSE}_{\bar{\hat{\beta }}_j}^{DS}&=\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\sum _{i=1}^n\sum _{k=1}^nS_{j}^{h_j}[x_i]S_{j}^{h_j}[x_k]^{T}\nonumber \\&\quad +\frac{1}{n^2}\bigg [\sum _{i=1}^n\bigg (S_{j}^{h_j}[x_i]\hat{\beta }_j^{g_j}(\mathbf {x}_j)-\hat{\beta }_j^{g_j}(x_i)\bigg )\bigg ]^2, \end{aligned}$$
(16)

and

$$\begin{aligned} \widehat{MSE}_{\hat{\tau }}^{DS}&=\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\sum _{i=1}^{n}\sum _{j=1}^n\bigg (S_{1}^{h_1}[x_i]S_{1}^{h_1}[x_j]^T+S_{0}^{h_{0}}[x_i]S_{0}^{h_{0}}[x_j]^T\bigg )\nonumber \\&\quad +\!\bigg [\frac{1}{n}\sum _{i=1}^n\!\bigg (\!\big (S_{1}^{h_1}[x_i]\hat{\beta }_1^{g_1}(\mathbf {x}^1)-\hat{\beta }_1^{g_1}(x_i)\big )-\big (S_{0}^{h_{0}}[x_i]\hat{\beta }_{0}^{g_{0}}(\mathbf {x}^{0})-\hat{\beta }_{0}^{g_{0}}(x_i)\big )\!\bigg )\!\bigg ]^2, \end{aligned}$$
(17)

where \(g_{0}, g_{1}\) are pilot smoothing parameters. Because the purpose of these pilots parameters is to estimate \(\beta _{0}\) and \(\beta _{1}\) respectively, we suggest using leave-one-out cross-validation; see (3). In specific situations one may want to check whether the results are sensitive to changes in the choice of the pilot parameters. The double smoothing (DS) estimation concept was utilized by Härdle et al. (1992), although for the estimation of the entire regression function \(\beta _j(\cdot )\). One could, as mentioned by Härdle et al. (1992), specify the pilot bandwidths as \(g_j=n_j^{-c}\), for an appropriate constant \(c\) which would result in good asymptotic performance. This would also reduce the computational burden of the method, although a relevant choice of the arbitrary constant \(c\) is problematic. Finally, note that a difference between \(\widehat{MSE}_{\bar{\hat{\beta }}_j}^{INR}\) and \(\widehat{MSE}_{\bar{\hat{\beta }}_j}^{DS}\) is that the former is based on an asymptotic approximation of (4) while the double smoothing estimator targets (4) directly.

4 Simulation study

In this section, we study the finite sample properties of different methods for the selection of constant and nearest neighbour type bandwidths, and in particular the resulting empirical MSE when estimating the average causal effect \(\tau \).

4.1 Design of the study

Data were generated according to the model

$$\begin{aligned} y_i=\beta _0(x_i)+\tau (x_i) z_i+\varepsilon _i,\ \ \ i=1,\ldots , n, \end{aligned}$$
(18)

with \(x_{i} \sim \text {Uniform}(0,2\pi )\), \(z_{i}|x_i \sim \text {Bernoulli}(p(x_i))\), \(\varepsilon _i \sim \text {Normal}(0,\sigma _{\varepsilon }^2)\), \(\tau (x_i)=\beta _1(x_i)-\beta _0(x_i)\), \(\sigma _{\varepsilon }^2\approx Var\big (\beta _0(x_i)+\tau (x_i) z_i\big )\), \(n=100,200,500,1{,}000\). Since \(z_i\) is a Bernoulli draw dependent on \(x_i\) generated from a uniform distribution, \(n_1\) and \(n_0\) are stochastic. Table 1 and Fig. 1 display the six designs generated. Bandwidths \(h_0\) and \(h_1\) considered are, for the constant bandwidth setting, 40 equally spaced values within the interval \([h_{min}, 2\pi ]\), where \(h_{min}\) is the smallest bandwidth value such that at least 10 observations are used for the local fits. For the nearest neighbour bandwidth setting, we consider 40 equally spaced values within the intervals \([0.1, 1]\) for \(n=100,200\) and \([0.02, 1]\) for \(n=500,1{,}000\), and, e.g., \(h=0.1\) implies using 10 % of the data for the local fits. The propensity score, \(p(x)\), in (14) is estimated by logistic regression with correctly specified model for Design 1–3 (i.e., glm(z~x, family=binomial) in R) and misspecified model for Design 4–6 (i.e., glm(z ~I(sin(2*x))+I(cos(x))+x+I(x⌃2), family=binomial) in R). The variance estimator (15) is used in (14), (16) and (17) with \(h_{\varepsilon _j}, j=0, 1\), selected by leave-one-out cross-validation (3). These cross-validation bandwidths are also used as pilot bandwidths in the DS estimators in (16) and (17) .

Fig. 1
figure 1

Design 1–6 (from top to bottom) used to generate data as specified in Table 1. The first column displays \(\beta _1(x_{i})\) (solid line), \(\beta _0(x_{i})\) (dashed) and \(\tau (x_i)\) (dotted), and the second column displays \(p(x_i)\)

Table 1 Specification of the six designs used to generate data according to model (18)

The criteria (2), (3), (4), (5), (14), (16) and (17) are computed for every bandwidth, 40 values, in the interval. For the minimizing bandwidths \(\hat{\tau }^{imp}\) is computed. Due to computer time constraint, we use 200 replicates. On the other hand, we reduce noise in the simulation results by making use of the control variate method (see, e.g., (Wilson 1984) with \(\hat{\tau }^{ols}\), the mean of the fitted values resulting from estimating \(\tau (x)\) by ordinary least squares with correctly specified model, as control variate. If \(\hat{\tau }^{ols}\) is positively correlated with \(\hat{\tau }^{imp}\) then \(\hat{\tau }^c=\hat{\tau }^{imp}-(\hat{\tau }^{ols}-\tau )\) has the same mean as \(\hat{\tau }^{imp}\) but lower variance. For instance, for \(n=1{,}000\) such correlations varied between 0.39 and 0.96 (Median \(=\) 0.82, IQR \(=\) 0.18). Results based on the raw replicates are similar to the results reported here utilizing the control variate method, except for an increase in noise. All computations are made in R (Core Team 2014). Studying bandwidth selection by simulation is computationally demanding and this study was made possible by the use of the High Performance Computing Center North (HPC2N) at Umeå University.

4.2 Results

Results for \(n=500\) and 1,000 are displayed in Tables 2 and 3, for both constant and nearest neighbour bandwidths, and in Figs. 2, 3, 4 and 5 (Appendix, Sect. 6.1) for nearest neighbour bandwidths. Due to the similarity of bandwidth selection patterns, and to save space, analogous figures with results for constant bandwidths are not included. These figures and more detailed results (also for \(n=100,200\)), also left out to save space, can be obtained from the authors. Note first that we can compute the smoothing parameter values minimizing (2), (4) and (5), labeled M\(_y\), M\(_{\beta }\) and M\(_{\tau }\), respectively, because we know the data generating mechanisms.

Table 2 MSE comparison: the table displays the method yielding lowest MSE (in the estimation of \(\tau \)) among M\(_{\beta }\), M\(_{\tau }\) and M\(_{y}\), when either constant or nearest neighbour bandwidth are used

We see in Figs. 2, 3, 4 and 5 that the double smoothing methods introduced, (16) and (17), labeled DS\(_{\beta }\) and DS\(_{\tau }\) respectively, mimic quite well their target in terms of selected smoothing parameters. This is not the case for (14), labeled INR, whose selected smoothing parameters are not in accordance with the target \(M_{\beta }\). Table 2 summarizes empirical MSE results for the theoretical criteria M\(_{\beta }\), M\(_{\tau }\) and M\(_y\), by indicating which criterion yielded lowest MSE for the estimation of \(\tau \). For constant bandwidhts, the smallest MSE is most often obtained by M\(_{\beta }\) or M\(_{\tau }\) and the largest MSE is most often obtained by \(M_y\). However, only in 17 and 25 % of the cases, respectively, do \(M_{\tau }\) and \(M_{\beta }\) result in significantly lower MSE than \(M_y\). For nearest neighbour bandwidths, we see that M\(_\tau \) always results in smallest MSE for \(n=200, 500, 1{,}000\), which is, in half of the cases, significantly smaller than the second smallest MSE (achieved by M\(_\beta \)). Both M\(_{\tau }\) and M\(_{\beta }\) result in significantly smaller MSE than M\(_y\) in a majority of cases (71 and 67 %, respectively). Table 3 gives information on empirical MSE (similar to Table 2), where comparisons are made between the data-driven criteria DS\(_{\beta }\), DS\(_{\tau }\), INR and CV. For both the constant and nearest neighbour bandwidth setting, we see that double smoothing does not always yields lowest empirical MSE, although CV is most often outperformed by the methods targeting the estimation of functional averages (DS and INR \(-\) for design 2 when INR performed best, CV was also outperformed by DS\(_{\tau }\) but not by DS\(_{\beta }\)).

Table 3 MSE comparison: the table displays the method yielding lowest MSE (in the estimation of \(\tau \)) among DS\(_{\beta }\), DS\(_{\tau }\), INR and CV, when either constant or nearest neighbour bandwidth are used

Finally, note that the propensity scores used in the designs of this study are rather extreme in the sense that they may yield probabilities near zero and one. We have also run these experiments by damping these propensity scores to let them vary only between 0.2 and 0.8. The results where similar qualitatively with double smoothing often performing better.

5 Conclusion

In this paper we have proposed double smoothing methods for selecting smoothing parameters that target the estimation of functional averages where the latter are average causal effects of interest. In our numerical experiments cross-validation is often outperformed by double smoothing as we expected since the latter criterion is optimized for the estimation of functions underlying the average causal effect, and not the average itself. The methods proposed and studied here have large applicability, and are, for instance, straightforward to adapt to non-parametric estimators based on instruments as those introduced in Frölich (2007). Finally, note that similar results as the one obtained should hold under a non-constant variance assumption (Andrews 1991; Ruppert and Wand 1994). In such cases the estimation of \(\sigma _{\varepsilon }^2\) need to be replaced by estimators of \(Var(y_i|x_i, z_i=0)\) and \(Var(y_i|x_i, z_i=1)\), e.g. using linear smoothers when regressing \(y_i^2\) on \(x_i\) for the units with \(z_i=0\) and \(z_i=1\) respectively.