Targeted smoothing parameter selection for estimating average causal effects

Häggström, Jenny; de Luna, Xavier

doi:10.1007/s00180-014-0515-0

Targeted smoothing parameter selection for estimating average causal effects

Original Paper
Open access
Published: 25 July 2014

Volume 29, pages 1727–1748, (2014)
Cite this article

Download PDF

You have full access to this open access article

Computational Statistics Aims and scope Submit manuscript

Targeted smoothing parameter selection for estimating average causal effects

Download PDF

Jenny Häggström¹ &
Xavier de Luna¹

1706 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The non-parametric estimation of average causal effects in observational studies often relies on controlling for confounding covariates through smoothing regression methods such as kernel, splines or local polynomial regression. Such regression methods are tuned via smoothing parameters which regulates the amount of degrees of freedom used in the fit. In this paper we propose data-driven methods for selecting smoothing parameters when the targeted parameter is an average causal effect. For this purpose, we propose to estimate the exact expression of the mean squared error of the estimators. Asymptotic approximations indicate that the smoothing parameters minimizing this mean squared error converges to zero faster than the optimal smoothing parameter for the estimation of the regression functions. In a simulation study we show that the proposed data-driven methods for selecting the smoothing parameters yield lower empirical mean squared error than other methods available such as, e.g., cross-validation.

Sufficient Covariate, Propensity Variable and Doubly Robust Estimation

Using Machine Learning Methods to Support Causal Inference in Econometrics

Semi-parametric small area inference in generalized semi-varying coefficient mixed effects models

Article 16 December 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In observational studies where the interest lies in estimating the average causal effect of a binary treatment $z$ on an outcome of interest $y$, non-parametric estimators are typically based on controlling for confounding covariates $x$ with smoothing regression methods (kernel, splines, local polynomial regression, series estimators; see, e.g., the reviews by Imbens 2004, and Imbens and Wooldridge 2009). A useful modeling framework in this context was introduced by Neyman (1923) and Rubin (1974), where in particular two potential outcomes are considered for each unit in the study, the outcome that would be observed if the unit is treated, $y(1)$, and the outcome that would be observed if the unit is not treated, $y(0)$. The causal effect at the unit level is defined as $y(1)-y(0)$. Population parameters are targeted by the inference, and we focus here on average causal effects of the type $E(y(1)-y(0))$, where the expectation is taken over a given population of interest. Inference on such expectations is complicated by the fact that the two potential outcomes are not observed for all units in the sample (missing data problem) and assumptions, e.g., on the missingness mechanism must be made in order for the parameter of interest to be identified. In this paper, we consider situations described in Sect. 2, where the causal effect conditional on an observed covariate $x$ (or a score function summarizing a set of observed covariates), $E(y(1)\mid x)-E(y(0)\mid x)$, is identified and can be estimated by fitting two curves, functions of $x$, $E(y(1)\mid x,z=1)$ and $E(y(0)\mid x,z=0)$ non-parametrically. An estimate of the targeted average causal effect is obtained by averaging the estimated curves over the relevant distribution for $x$ to target $E(y(1)-y(0))=E(E(y(1)\mid x))-E(E(y(0)\mid x))$, where the missing outcomes are imputed by predictions from the fitted curves. A tuning parameter for each fitted curve is used to regulate the smoothness of the fit. Cheng (1994) showed that when using kernel regression to estimate the average of a curve, say here $E(E(y(1)\mid x))$, with missing $y(1)$ for some units, as described above, then the optimal (in mean squared error, MSE, sense) smoothing parameter for the estimation of the regression curve $E(y(1)\mid x,z=1)$ is not optimal for the estimation of the average $E(E(y(1)\mid x))$. More precisely the optimal rate of convergence towards zero of the smoothing parameter (when the sample size increases) is different in both situations, and one need typically to asymptotically undersmooth $E(y(1)\mid x,z=1)$ when targeting $E(E(y(1)\mid x))$. We show in this paper that a similar result holds when using local linear regression instead of kernel regression, and when two curves (implying the choice of two tunining parameters), are fitted and then averaged to target $E(y(1)-y(0))$.

As a main contribution of the paper, we propose a novel data-driven method geared for selecting the smoothing parameters which minimizes the mean squared error of non-parametric estimators of the average causal effect. Imbens et al. (2005) also proposes a data-driven method based on the estimation of this mean squared error. The two estimators are, however, different. While Imbens et al. (2005) estimates an asymptotic approximation of the population MSE which involves the estimation of the propensity score, the probability of ending up in one of the treatment groups (say $z=1$) given the covariates, our estimator targets the exact population MSE by using a double smoothing technique previously used by Härdle et al. (1992) for estimating regression curves and Häggström (2013) in semi-parametric additive models. Note that Frölich (2005) also derived asymptotic approximation of MSE to obtain smoothing parameter selectors although those were outperformed by cross-validation in finite sample simulations. With simulations we study the finite sample properties of the different data-driven methods. The results suggest that the cross-validation choice, which is known to be optimal in MSE sense to estimate smooth curves (Fan 1992), can indeed be improved by using either Imbens et al. (2005) or our proposal, with the latter often being superior.

In the next section we introduce the potential outcome framework dating back to Neyman (1923) and Rubin (1974), which allows us to define the parameter of interest, the average causal effect, and commonly used identifying assumptions and estimators. The selection of smoothing parameters is discussed in Sect. 3, where we present asymptotic results based on the use of local linear regression. We also introduce in this section a novel data-driven method. Section 4 presents a simulation study. The paper is concluded in Sect. 5.

2 Model and estimation

2.1 Neyman–Rubin model for causal inference

Suppose we have $n$ units indexed by $i=1,\ldots ,n$. For each unit $i$ a binary treatment $z_i$ is assigned:

$$\begin{aligned} z_i=\left\{ \begin{array}{ll} 1 &{}\ \text {if unit } i \text { receives treatment 1},\\ 0 &{}\ \text {if unit } i \text { receives treatment 0}. \end{array} \right. \end{aligned}$$

Further, each unit $i$ is characterised by two potential outcomes $y_i(1)$ and $y_i(0)$, where $y_i(1)$ is the response that is observed if the unit is given treatment $z_i=1$ and $y_i(0)$ the response if the unit is given treatment $z_i=0$. Only one treatment assignment is possible for each unit and, therefore, only one of the two potential outcomes is observed. Denote by $y_i=y_i(0)(1-z_i)+y_i(1)z_i$ the observed outcome. Finally, let all units have a vector of $d$ background characteristics $\mathbf {x}_{i}=(x_{i1},\ldots , x_{id})^{T}$ (called covariates). We assume in the sequel that the $n$ units corresponds to a random sample from the distribution law of the random variables $(y_i(1),y_i(1),z_i,\mathbf {x}_{i})$, and that only $(y_i,z_i,\mathbf {x}_{i})$ is actually observed. We use the same notation to denote random variables and their realisations, letting the context make the distinction.

The parameter of interest herein is an average causal effect,

$$\begin{aligned} \tau =E\big (y_i(1)-y_i(0)\big ). \end{aligned}$$

If treatment assignment is not randomized, $\tau $ is identified if we have available a vector of covariates $\mathbf {x}_{i}=(x_{i1},\ldots , x_{id})^{T}$ not affected by treatment assignment and such that the following assumptions hold,

$$\begin{aligned} y_i(1), y_i(0) \perp \!\!\!\perp z_i|\mathbf {x}_i, \end{aligned}$$

often called unconfoundedness assumption, and

$$\begin{aligned} 0<\Pr (z_i=1|\mathbf {x}_i)<1, \end{aligned}$$

often called overlap assumption. The sign $\perp \!\!\!\perp $ is used here to mean “is independent of” (Dawid 1979). We have unconfoundedness if all covariates affecting both treatment assignment and the potential outcomes are included in $\mathbf {x}_i$. This is a strong assumption which must be based on subject-matter reasoning. A sensitivity analysis to this assumption is often advocated (e.g., de Luna and Lundin 2014). The assumption of overlap states that, for a unit with covariate vector $\mathbf {x}_i$, the probability of receiving either treatment should be bounded away from 0. This assumption can be investigated empirically (e.g.,Imbens and Wooldridge 2009). Under these assumptions identifiability of $\tau $ is then a consequence of

$$\begin{aligned} \tau&=E\big (y_i(1)-y_i(0)\big )\nonumber \\&=E\big (E(y_i(1)|\mathbf {x}_i)-E(y_i(0)|\mathbf {x}_i)\big ) \nonumber \\&=E\big (E(y_i(1)|z_i=1,\mathbf {x}_i)-E(y_i(0)|z_i=0,\mathbf {x}_i)\big ) \nonumber \\&=E\big (E(y_i|z_i=1,\mathbf {x}_i)-E(y_i|z_i=0,\mathbf {x}_i)\big ). \end{aligned}$$

(1)

In the sequel we focus on the case $d=1$ since when $d>1$, the covariate vector $\mathbf {x}_i$ can be replaced by a scalar, e.g., $p(\mathbf {x}_i)=\Pr (z_i=1|\mathbf {x}_i)$, the propensity score (Rosenbaum and Rubin 1983, Hansen 2008). Indeed, Rosenbaum and Rubin (1983) showed that it is sufficient to condition on the propensity score, i.e., under the above assumptions we have $ y_i(1), y_i(0) \perp \!\!\!\perp z_i|p(\mathbf {x}_{i}), $ and $ 0<\Pr (z_{i}=1|p(\mathbf {x}_{i}))<1. $ In applications the propensity score need to be modelled and fitted to the data and such situations are considered in the simulation study of Sect. 4. Typically parametric models are used to fit the propensity score, although these do not need to be correctly specified as shown in Waernbaum (2010). Note also that covariate selection procedures may be used to reduce the dimensionality of $\mathbf {x}_{i}$ (Luna et al. 2011).

2.2 Estimating average causal effects

Let $\beta _0(x_i)=E(y_i|z_i=0,{x}_i)$ and $\beta _1(x_i)=E(y_i|z_i=1,{x}_i)$ be unknown smooth functions, $Var(y_i|x_i, z_i)=\sigma _{\varepsilon }^2$, $i=1, \ldots , n$. Note that the assumption of constant conditional variance could be relaxed without changing in essence the results of this paper. We consider this assumption to alleviate the notational burden. The non-constant variance case is further discussed in the concluding section. From (1), we have that

$$\begin{aligned} \tau =E\big (\beta _1(x_i)\big )-E\big (\beta _0(x_i)\big ). \end{aligned}$$

Thus, a natural way to estimate $\tau $ is to first estimate the two regression functions $\beta _1(x_i)$ and $\beta _0(x_i)$, based on the treated and the non-treated, respectively, and then take the average over all the observed $x_i$s of the differences between the estimated functions. This estimator of $\tau $ is called the imputation estimator in Imbens et al. (2005). They use series estimators for estimating the regression functions but any smoother, e.g., kernel, splines and local polynomial regression (Fan and Gijbels 1996, pp. 14–45), may be used.

Denote $\mathbf {y}^0=(y_{1}^0,\ldots ,y_{n_0}^0)^T$ and $\mathbf {x}^0=(x_{1}^0,\ldots ,x_{n_0}^0)^T$ the observed response and covariate for the $n_0$ units with treatment $z_i=0$, and similarly $\mathbf {y}^1=(y_{1}^1,\ldots ,y_{n_1}^1)^T$ and $\mathbf {x}^1=(x_{1}^1,\ldots ,x_{n_1}^1)^T$ for the $n_1$ units with treatment $z_i=1$. The smoothers cited above are linear in the sense that the corresponding estimator of $\beta _j(\mathbf {x})=(\beta _j(x_1),\ldots ,\beta _j(x_n))^T$, can be written as

$$\begin{aligned} \hat{\beta }_{j}^{h_j}(\mathbf {x})&=S_{j}^{h_j}[\mathbf {x}]\mathbf {y}^{j}, \ \ j=0,1, \end{aligned}$$

where $\mathbf {x}=(\mathbf {x}^{0T},\mathbf {x}^{1T})^T$ and $S_{j}^{h_j}[\mathbf {x}]$ the smoothing matrix regressing $\mathbf {y}^j$ on $\mathbf {x}^j$, using smoothing parameter $h_j$. The imputation estimator of $\tau $ mentioned above is

$$\begin{aligned} \hat{\tau }^{imp}=\frac{1}{n}\sum _{i=1}^n\hat{\tau }^{imp}(x_i)=\frac{1}{n}\sum _{i=1}^n\big (\hat{\beta }_{1}^{h_1}(x_i)-\hat{\beta }_{0}^{h_{0}}(x_i)\big ). \end{aligned}$$

In this paper we base our results on a specific linear smoother, the local linear regression smoother, although we anticipate that most results should hold for any other linear smoother.

Local linear regression (Cleveland 1979; Fan and Gijbels 1996), consists in fitting a straight line at every $x_{i}$, $i=1,\ldots ,n$, using only the part of data that is deemed to be sufficiently close to the target point $x_{i}$. Consider estimating the regression function $\beta _j(\cdot )$, $j=0, 1$. The fit, at $x_i$, is

$$\begin{aligned} \hat{\beta }_j^{h_j}(x_i)=\mathbf {e}_{1}^T(\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {X}_{i}^{j})^{-1}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {y}^j=S_j^{h_j}[x_i]\mathbf {y}^j \end{aligned}$$

where $\mathbf {e}_{1}=(1,0)^T$,

$$\begin{aligned} \mathbf {X}_{i}^{j}= \left( \begin{array}{c@{\quad }c} 1 &{} \big (x_{1}^j-x_i\big ) \\ \vdots &{} \vdots \\ 1 &{} \big (x_{n_j}^j-x_i\big ) \end{array} \right) \end{aligned}$$

and

$$\begin{aligned} \mathbf {W}_i^{h_j}=\text{ diag }(K\big ((x_{1}^j-x_i)/b_{ji}\big )/b_{ji}, \ldots , K\big ((x_{n_j}^j-x_i)/b_{ji}\big )/b_{ji}). \end{aligned}$$

$K(\cdot )$ is a kernel function such that $\int K(u) du=1$ and $\int u K(u)du=0 $. An example is the tricube kernel defined as

$$\begin{aligned} K(u)=\left\{ \begin{array}{ll} \frac{70}{81}(1- |u|^{3})^{3}, &{} \mathrm if |u|< 1 \\ ~0, &{} \mathrm if |u|\ge 1 \end{array} \right\} . \end{aligned}$$

The definition of $b_{ji}$, $i=1,\ldots ,n$, depends on the type of bandwidth we use. With a constant bandwidth $b_{j1}=\cdots =b_{jn}=h_j$. For a nearest neighbor type bandwidth, assuming no ties, $b_{ji}$ is the Euclidian distance from $x_i$ to the $(h_jn_j)$:th nearest among the $x_{k}^j$:s for $x_{k}^j\ne x_i,\, \, h_j\in [1/n_j,1]\,, k=1,\ldots ,n_j$, and the smoothing parameter $h_j$ is the proportion of observations being used to produce the local fit.

3 Selection of smoothing parameters

3.1 Mean squared errors

Many smoothing parameter selection methods are developed with the purpose of estimating the regression function $\beta _j(x_i)$, $j=0, 1$, $i=1, \ldots , n$ and attempts to select the smoothing parameter minimizing the average conditional mean squared error:

$$\begin{aligned}&\frac{1}{n_j}\sum _{i=1}^{n_j}E\big (y_i^j-\hat{\beta }_j^{h_j}(x_{i}^j)|\mathbf {x}^j\big )^2\nonumber \\&\quad = \frac{1}{n_j}\sum _{i=1}^{n_j}Var\big (\hat{\beta }_j^{h_j}(x_{i}^j)|\mathbf {x}^j\big )+\frac{1}{n_j}\sum _{i=1}^{n_j}E\big (\hat{\beta }_j^{h_j}(x_{i}^j)-\beta _j(x_i^j)|\mathbf {x}^j\big )^2\nonumber \\&\quad =\frac{\sigma _{\varepsilon }^2}{n_j}\sum _{i=1}^{n_j}S_{j}^{h_j}[x_{i}^j]S_{j}^{h_j}[x_{i}^j]^T+\frac{1}{n_j}\sum _{i=1}^{n_j}\bigg (S_{j}^{h_j}[x_{i}^j]\beta _j(\mathbf {x}^j)-\beta _j(x_{i}^j)\bigg )^2. \end{aligned}$$

(2)

One frequently used selection procedure that attempts to select the smoothing parameter minimizing (2) is leave-one-out cross-validation. In this setting, cross-validation selects the smoothing parameter $h_j$ minimizing

$$\begin{aligned} \frac{1}{n_j}\sum _{i=1}^{n_j}\big (y_{i}^j-\hat{\beta }_j^{h_j,-i}\big (x_{i}^j\big )\big )^2, \end{aligned}$$

(3)

where $\hat{\beta }_j^{h_j,-i}(x_{i}^j)$ is the cross-validated estimate at $x_{i}^j$ computed without $(x_{i}^j,y_{i}^j)$. Asymptotically, for local linear regression, the smoothing parameter minimizing (2) is proportional to $n_j^{-1/5}$ (Fan 1992), and, hence, proportional to $n^{-1/5}$ since $n_j=n\Pr (z=j)+o_p(n)$. However, it is known that for estimating a functional of $\beta _j(x_i)$ such as $E(\beta _j(x_i))$, the smoothing parameter minimizing (2) is not optimal, in the sense that it does not result in $\sqrt{n}$-consistent estimation of the functional (e.g., Cheng 1994).

Imbens et al. (2005) suggest that one should select $h_0$ and $h_1$ by minimizing the conditional mean squared error of $\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)$, for $j=0,1$ respectively, i.e.,

$$\begin{aligned} MSE_{\bar{\hat{\beta }}_j}&=\frac{\sigma _{\varepsilon }^2}{n^2}\sum _{i=1}^n\sum _{k=1}^nS_{j}^{h_j}[x_i]S_{j}^{h_j}[x_k]^{T}\nonumber \\&\quad +\frac{1}{n^2}\bigg [\sum _{i=1}^n\bigg (S_{j}^{h_j}[x_i]\beta _j(\mathbf {x}^j)-\beta _j(x_i)\bigg )\bigg ]^2. \end{aligned}$$

(4)

We argue that, in order to estimate $\tau $ optimally, it may be more suitable to select the combination of ($h_0, h_1$) minimizing the conditional mean squared error of $\hat{\tau }^{imp}$

$$\begin{aligned} MSE_{\hat{\tau }}&=\frac{\sigma _{\varepsilon }^2}{n^2}\sum _{i=1}^{n}\sum _{j=1}^n\bigg (S_1^{h_1}[x_i]S_{1}^{h_1}[x_j]^T+S_{{0}}^{h_{0}}[x_i]S_{0}^{h_{0}}[x_j]^T\bigg )\nonumber \\&\quad +\bigg [\frac{1}{n}\sum _{i=1}^n\bigg (\big (S_{1}^{h_1}[x_i]\beta _1(\mathbf {x}^1)-\beta _1(x_i)\big )-\big (S_{0}^{h_0}[x_i]\beta _0(\mathbf {x}^0)-\beta _0(x_i)\big )\bigg )\bigg ]^2.\nonumber \\ \end{aligned}$$

(5)

Note that

$$\begin{aligned} MSE_{\hat{\tau }}&=MSE_{\bar{\hat{\beta }}_1}+MSE_{\bar{\hat{\beta }}_0}-2\bigg (\frac{1}{n}\sum _{i=1}^n\big (S_{1}^{h_1}[x_i]\beta _1(\mathbf {x}^1)-\beta _1(x_i)\big )\bigg )\\&\quad \ \times \bigg (\frac{1}{n}\sum _{i=1}^n\big (S_{0}^{h_0}[x_i]\beta _0(\mathbf {x}^0)-\beta _0(x_i)\big )\bigg ). \end{aligned}$$

Hence, criterion (5) differs from (4) when both average bias terms in the latter expression are different from zero.

3.2 Asymptotics

Asymptotic approximations can be used to describe optimal bandwidth choices as the sample size tends to infinity. The results presented here are deduced in Appendix, Sect. 6.2, where regularity conditions also used in Ruppert and Wand (1994) are given. For local linear regression with constant bandwidth such that $h_j\rightarrow 0$ and $nh_j \rightarrow \infty $ as $n\rightarrow \infty $ we have the following approximations for the conditional bias and variance of $\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)$. For $j=0, 1$,

$$\begin{aligned} E\bigg (\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)-\frac{1}{n}\sum _{i=1}^n\beta _j(x_i)|\mathbf {x}\bigg )=B_1(j)h_j^2+o_p\big (h_j^2\big ), \end{aligned}$$

(6)

and

$$\begin{aligned} Var\bigg (\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)|\mathbf {x}\bigg )&=\frac{V_1(j)}{n}+\frac{V_2(j)}{n^2h_j}+V_3(j)\frac{h_j^2}{n}\nonumber \\&\quad +o_p\big (n^{-1}+n^{-2}h_j^{-1}+n^{-1}h_j^2\big ), \end{aligned}$$

(7)

with constants

$$\begin{aligned} B_1(j)&=\frac{1}{2}\int \beta _j^{(2)}(x)f(x)dx\int u^2K(u)du,\\ V_1(j)&=\sigma _{\varepsilon }^2\int \frac{f(x)}{Pr(z=j|x)}dx,\\ V_2(j)&=\sigma _{\varepsilon }^2\int K(u)^2du\int \frac{1}{Pr(z=j|x)}dx,\\ V_3(j)&=-2\sigma _{\varepsilon }^2\int u^2K(u)du\int \frac{f^{(1)}(x)^2}{f(x)Pr(z=j|x)}dx\\&\quad \ \, -2\sigma _{\varepsilon }^2\int u^2K(u)du\int \frac{f^{(1)}(x)P^{(1)}(z=j|x)}{Pr(z=j|x)^2}dx,\\ \end{aligned}$$

where $\beta _j^{(m)}(x)$ the $m$:th derivative of the function $\beta _j(x)$ and $f(x)$ is the density of $x$. Hence,

$$\begin{aligned} MSE_{\bar{\hat{\beta }}_j}&=\frac{V_1(j)}{n}+\frac{V_2(j)}{n^2h_j}+V_3(j)\frac{h_j^2}{n}+B_1^2(j)h_j^4\nonumber \\&\quad +o_p\big (n^{-1}+n^{-2}h_j^{-1}+n^{-1}h_j^2+h_j^4\big ) \end{aligned}$$

(8)

and

$$\begin{aligned} MSE_{\hat{\tau }}&=\frac{V_1(1)+V_1(0)}{n}+\frac{V_2(1)}{n^2h_1}+\frac{V_2(0)}{n^2h_0}\nonumber \\&\quad +V_3(1)\frac{h_1^2}{n}+V_3(0)\frac{h_0^2}{n}+B_1^2(1)h_1^4\nonumber \\&\quad +B_1^2(0)h_0^4-2B_1(1)B_1(0)h_1^2h_0^2\nonumber \\&\quad +o_p\big (n^{-1}+n^{-2}h_1^{-1}+n^{-2}h_0^{-1}+n^{-1}h_1^2\nonumber \\&\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+h_0^2n^{-1}+h_1^4+h_0^4+h_1^2h_0^2\big ). \end{aligned}$$

(9)

Let us first consider the optimal smoothing parameter for estimating $E(\beta _j(x))$ and assume $nh_j^3\rightarrow 0$ as $n\rightarrow \infty $, $j=0,1$. An asymptotic approximation to the bandwidth minimizing (8) is

$$\begin{aligned} h_j^{opt}=\text {arg}\min _{h_j}\frac{V_2(j)}{n^2h_j}+B_1^2(j)h_j^4=\bigg (\frac{V_2(j)}{4B_1^2(j)}\bigg )^{1/5}n^{-2/5}. \end{aligned}$$

Hence, the optimal bandwidths are of order $n^{-2/5}$, so that the optimal bandwidths for the estimation of the average functional $\tau $ is smaller than the optimal bandwidths for the estimation of the regression functions $\beta _j(\cdot )$, the latter being of order $n^{-1/5}$. Thus, the regression functions must be undersmooth when the target of the inference is $\tau $. A similar result was shown in Cheng (1994) for kernel regression. Turning to the minimization of (9), this must be done simultaneously in $h_0$ and $h_1$. A reasonable assumption, however, is that these two smoothing parameters have the same rate of convergence to zero. Under this assumption we may replace $h_1$ by $ch_0$, for $c$ a constant, in (9). Minimizing the latter for $h_0$ yields as above an optimal bandwidth of order $n^{-2/5}$.

Another related result, deduced from (6) and (7), is that as $n \rightarrow \infty $, if $h_j\propto n^r$, for $-1< r < -1/4$, then (see Appendix, Sect. 6.2)

$$\begin{aligned} E\left[ \sqrt{n}(\bar{\hat{\beta }}_j-E(\beta _j(x_i)))\mid \mathbf {x}\right]&= o_p(1), \end{aligned}$$

(10)

$$\begin{aligned} E\left[ \sqrt{n}(\hat{\tau }^{imp}-\tau )\mid \mathbf {x}\right]&= o_p(1), \end{aligned}$$

(11)

$$\begin{aligned} Var\left[ \sqrt{n}(\bar{\hat{\beta }}_j-E(\beta _j(x_i)))\mid \mathbf {x}\right]&= V_1(j)+o_p(1), \end{aligned}$$

(12)

$$\begin{aligned} Var\left[ \sqrt{n}(\hat{\tau }^{imp}-\tau )\mid \mathbf {x}\right]&= V_1(0)+V_1(1)\nonumber \\&\,+o_p(1). \end{aligned}$$

(13)

The results above show that selecting the smoothing parameters minimizing (4) will lead to $\sqrt{n}$-consistent estimation of $\tau $. This is in accordance with previous results (e.g., Speckman 1988) where it was shown that asymptotic undersmoothing of the regression function is needed for the $\sqrt{n}-$consistent estimation of a functional of the regression function.

3.3 Estimating MSEs

Imbens et al. (2005) propose the following estimator of (4), for $j=0, 1$,

$$\begin{aligned} \widehat{MSE}_{\bar{\hat{\beta }}_j}^{INR}&=\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\sum _{i=1}^n\sum _{k=1}^nS_{j}^{h_j}[x_i]S_{j}^{h_j}[x_k]^{T}\nonumber \\&\quad +\frac{1}{n^2}\bigg [\sum _{i=1}^{n_j}\frac{1}{\hat{p}(x_{i}^j)}\bigg (y_{i}^j-\hat{\beta }_j^{h_j}(x_{i}^j)\bigg )\bigg ]^2\nonumber \\&\quad -\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\hat{{\mathbf p}}_j^T\bigg (I_{n_j}-S_{j}^{h_j}[\mathbf {x}^j]\bigg )\bigg (I_{n_j}-S_{j}^{h_j}[\mathbf {x}^j]\bigg )^T\hat{{\mathbf p}}_j, \end{aligned}$$

(14)

where $\hat{{\mathbf p}}_j=(1/\hat{p}(x_{1}^j),\ldots ,1/\hat{p}(x_{n_j}^j))^T$ and $I_{n_j}$ is the $n_j\times n_j$ identity matrix. It is worth noting that one need to estimate the propensity score (Waernbaum 2010), in addition to $\sigma _{\varepsilon }^2$, in order to use this selection procedure. The error variance $\sigma _{\varepsilon }^2$ may be estimated by

$$\begin{aligned} \hat{\sigma _{\varepsilon }}^2\!=\!\frac{\bigg [\mathbf {y} \!-\!\bigg (\hat{\beta }_0^{h_{\varepsilon _0}}(\mathbf {x}^0)^T, \hat{\beta }_1^{h_{\varepsilon _1}}(\mathbf {x}^1)^T\bigg )^T\bigg ]^T\bigg [\mathbf {y} -\bigg (\hat{\beta }_0^{h_{\varepsilon _0}}(\mathbf {x}^0)^T, \hat{\beta }_1^{h_{\varepsilon _1}}(\mathbf {x}^1)^T\bigg )^T\bigg ]}{n-\text {trace}(2S_0^{h_{\varepsilon _0}}[\mathbf {x}^0]-S_0^{h_{\varepsilon _0}}[\mathbf {x}^0]S_0^{h_{\varepsilon _0}}[\mathbf {x}^0])-\text {trace}(2S_1^{h_{\varepsilon _1}}[\mathbf {x}^1]\!-\!S_1^{h_{\varepsilon _1}}[\mathbf {x}^1]S_1^{h_{\varepsilon _1}}[\mathbf {x}^1])},\nonumber \\ \end{aligned}$$

(15)

where $\mathbf {y}=(\mathbf {y}^{0T},\mathbf {y}^{1T})^T$ and $h_{\varepsilon _j}$, $j=0, 1$, could be equal to $h_j$ or selected separately, see, e.g., Opsomer et al. (1995) for further discussion on this issue.

We propose below novel double smoothing estimators of (4) and (5), respectively:

$$\begin{aligned} \widehat{MSE}_{\bar{\hat{\beta }}_j}^{DS}&=\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\sum _{i=1}^n\sum _{k=1}^nS_{j}^{h_j}[x_i]S_{j}^{h_j}[x_k]^{T}\nonumber \\&\quad +\frac{1}{n^2}\bigg [\sum _{i=1}^n\bigg (S_{j}^{h_j}[x_i]\hat{\beta }_j^{g_j}(\mathbf {x}_j)-\hat{\beta }_j^{g_j}(x_i)\bigg )\bigg ]^2, \end{aligned}$$

(16)

and

$$\begin{aligned} \widehat{MSE}_{\hat{\tau }}^{DS}&=\frac{\hat{\sigma _{\varepsilon }}^2}{n^2}\sum _{i=1}^{n}\sum _{j=1}^n\bigg (S_{1}^{h_1}[x_i]S_{1}^{h_1}[x_j]^T+S_{0}^{h_{0}}[x_i]S_{0}^{h_{0}}[x_j]^T\bigg )\nonumber \\&\quad +\!\bigg [\frac{1}{n}\sum _{i=1}^n\!\bigg (\!\big (S_{1}^{h_1}[x_i]\hat{\beta }_1^{g_1}(\mathbf {x}^1)-\hat{\beta }_1^{g_1}(x_i)\big )-\big (S_{0}^{h_{0}}[x_i]\hat{\beta }_{0}^{g_{0}}(\mathbf {x}^{0})-\hat{\beta }_{0}^{g_{0}}(x_i)\big )\!\bigg )\!\bigg ]^2, \end{aligned}$$

(17)

where $g_{0}, g_{1}$ are pilot smoothing parameters. Because the purpose of these pilots parameters is to estimate $\beta _{0}$ and $\beta _{1}$ respectively, we suggest using leave-one-out cross-validation; see (3). In specific situations one may want to check whether the results are sensitive to changes in the choice of the pilot parameters. The double smoothing (DS) estimation concept was utilized by Härdle et al. (1992), although for the estimation of the entire regression function $\beta _j(\cdot )$. One could, as mentioned by Härdle et al. (1992), specify the pilot bandwidths as $g_j=n_j^{-c}$, for an appropriate constant $c$ which would result in good asymptotic performance. This would also reduce the computational burden of the method, although a relevant choice of the arbitrary constant $c$ is problematic. Finally, note that a difference between $\widehat{MSE}_{\bar{\hat{\beta }}_j}^{INR}$ and $\widehat{MSE}_{\bar{\hat{\beta }}_j}^{DS}$ is that the former is based on an asymptotic approximation of (4) while the double smoothing estimator targets (4) directly.

4 Simulation study

In this section, we study the finite sample properties of different methods for the selection of constant and nearest neighbour type bandwidths, and in particular the resulting empirical MSE when estimating the average causal effect $\tau $.

4.1 Design of the study

Data were generated according to the model

$$\begin{aligned} y_i=\beta _0(x_i)+\tau (x_i) z_i+\varepsilon _i,\ \ \ i=1,\ldots , n, \end{aligned}$$

(18)

with $x_{i} \sim \text {Uniform}(0,2\pi )$, $z_{i}|x_i \sim \text {Bernoulli}(p(x_i))$, $\varepsilon _i \sim \text {Normal}(0,\sigma _{\varepsilon }^2)$, $\tau (x_i)=\beta _1(x_i)-\beta _0(x_i)$, $\sigma _{\varepsilon }^2\approx Var\big (\beta _0(x_i)+\tau (x_i) z_i\big )$, $n=100,200,500,1{,}000$. Since $z_i$ is a Bernoulli draw dependent on $x_i$ generated from a uniform distribution, $n_1$ and $n_0$ are stochastic. Table 1 and Fig. 1 display the six designs generated. Bandwidths $h_0$ and $h_1$ considered are, for the constant bandwidth setting, 40 equally spaced values within the interval $[h_{min}, 2\pi ]$, where $h_{min}$ is the smallest bandwidth value such that at least 10 observations are used for the local fits. For the nearest neighbour bandwidth setting, we consider 40 equally spaced values within the intervals $[0.1, 1]$ for $n=100,200$ and $[0.02, 1]$ for $n=500,1{,}000$, and, e.g., $h=0.1$ implies using 10 % of the data for the local fits. The propensity score, $p(x)$, in (14) is estimated by logistic regression with correctly specified model for Design 1–3 (i.e., glm(z~x, family=binomial) in R) and misspecified model for Design 4–6 (i.e., glm(z ~I(sin(2*x))+I(cos(x))+x+I(x⌃2), family=binomial) in R). The variance estimator (15) is used in (14), (16) and (17) with $h_{\varepsilon _j}, j=0, 1$, selected by leave-one-out cross-validation (3). These cross-validation bandwidths are also used as pilot bandwidths in the DS estimators in (16) and (17) .

Table 1 Specification of the six designs used to generate data according to model (18)

Full size table

The criteria (2), (3), (4), (5), (14), (16) and (17) are computed for every bandwidth, 40 values, in the interval. For the minimizing bandwidths $\hat{\tau }^{imp}$ is computed. Due to computer time constraint, we use 200 replicates. On the other hand, we reduce noise in the simulation results by making use of the control variate method (see, e.g., (Wilson 1984) with $\hat{\tau }^{ols}$, the mean of the fitted values resulting from estimating $\tau (x)$ by ordinary least squares with correctly specified model, as control variate. If $\hat{\tau }^{ols}$ is positively correlated with $\hat{\tau }^{imp}$ then $\hat{\tau }^c=\hat{\tau }^{imp}-(\hat{\tau }^{ols}-\tau )$ has the same mean as $\hat{\tau }^{imp}$ but lower variance. For instance, for $n=1{,}000$ such correlations varied between 0.39 and 0.96 (Median $=$ 0.82, IQR $=$ 0.18). Results based on the raw replicates are similar to the results reported here utilizing the control variate method, except for an increase in noise. All computations are made in R (Core Team 2014). Studying bandwidth selection by simulation is computationally demanding and this study was made possible by the use of the High Performance Computing Center North (HPC2N) at Umeå University.

4.2 Results

Results for $n=500$ and 1,000 are displayed in Tables 2 and 3, for both constant and nearest neighbour bandwidths, and in Figs. 2, 3, 4 and 5 (Appendix, Sect. 6.1) for nearest neighbour bandwidths. Due to the similarity of bandwidth selection patterns, and to save space, analogous figures with results for constant bandwidths are not included. These figures and more detailed results (also for $n=100,200$), also left out to save space, can be obtained from the authors. Note first that we can compute the smoothing parameter values minimizing (2), (4) and (5), labeled M$_y$, M$_{\beta }$ and M$_{\tau }$, respectively, because we know the data generating mechanisms.

Table 2 MSE comparison: the table displays the method yielding lowest MSE (in the estimation of $\tau $) among M$_{\beta }$, M$_{\tau }$ and M$_{y}$, when either constant or nearest neighbour bandwidth are used

Full size table

We see in Figs. 2, 3, 4 and 5 that the double smoothing methods introduced, (16) and (17), labeled DS$_{\beta }$ and DS$_{\tau }$ respectively, mimic quite well their target in terms of selected smoothing parameters. This is not the case for (14), labeled INR, whose selected smoothing parameters are not in accordance with the target $M_{\beta }$. Table 2 summarizes empirical MSE results for the theoretical criteria M$_{\beta }$, M$_{\tau }$ and M$_y$, by indicating which criterion yielded lowest MSE for the estimation of $\tau $. For constant bandwidhts, the smallest MSE is most often obtained by M$_{\beta }$ or M$_{\tau }$ and the largest MSE is most often obtained by $M_y$. However, only in 17 and 25 % of the cases, respectively, do $M_{\tau }$ and $M_{\beta }$ result in significantly lower MSE than $M_y$. For nearest neighbour bandwidths, we see that M$_\tau $ always results in smallest MSE for $n=200, 500, 1{,}000$, which is, in half of the cases, significantly smaller than the second smallest MSE (achieved by M$_\beta $). Both M$_{\tau }$ and M$_{\beta }$ result in significantly smaller MSE than M$_y$ in a majority of cases (71 and 67 %, respectively). Table 3 gives information on empirical MSE (similar to Table 2), where comparisons are made between the data-driven criteria DS$_{\beta }$, DS$_{\tau }$, INR and CV. For both the constant and nearest neighbour bandwidth setting, we see that double smoothing does not always yields lowest empirical MSE, although CV is most often outperformed by the methods targeting the estimation of functional averages (DS and INR $-$ for design 2 when INR performed best, CV was also outperformed by DS$_{\tau }$ but not by DS$_{\beta }$).

Table 3 MSE comparison: the table displays the method yielding lowest MSE (in the estimation of $\tau $) among DS$_{\beta }$, DS$_{\tau }$, INR and CV, when either constant or nearest neighbour bandwidth are used

Full size table

Finally, note that the propensity scores used in the designs of this study are rather extreme in the sense that they may yield probabilities near zero and one. We have also run these experiments by damping these propensity scores to let them vary only between 0.2 and 0.8. The results where similar qualitatively with double smoothing often performing better.

5 Conclusion

In this paper we have proposed double smoothing methods for selecting smoothing parameters that target the estimation of functional averages where the latter are average causal effects of interest. In our numerical experiments cross-validation is often outperformed by double smoothing as we expected since the latter criterion is optimized for the estimation of functions underlying the average causal effect, and not the average itself. The methods proposed and studied here have large applicability, and are, for instance, straightforward to adapt to non-parametric estimators based on instruments as those introduced in Frölich (2007). Finally, note that similar results as the one obtained should hold under a non-constant variance assumption (Andrews 1991; Ruppert and Wand 1994). In such cases the estimation of $\sigma _{\varepsilon }^2$ need to be replaced by estimators of $Var(y_i|x_i, z_i=0)$ and $Var(y_i|x_i, z_i=1)$, e.g. using linear smoothers when regressing $y_i^2$ on $x_i$ for the units with $z_i=0$ and $z_i=1$ respectively.

References

Andrews DWK (1991) Asymptotic optimality of generalized cl, cross-validation, and generalized cross-validation in regression with heteroskedastic errors. J Econom 47:359–377
Article MATH Google Scholar
Cheng PE (1994) Nonparametric estimation of mean functionals with data missing at random. J Am Stat Assoc 89:81–87
Article MATH Google Scholar
Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74:829–836
Article MATH MathSciNet Google Scholar
Dawid AP (1979) Conditional independence in statistical theory. J R Stat Soc Ser B Stat Methodol 41:1–31
MATH MathSciNet Google Scholar
de Luna X, Lundin M (2014) Sensitivity analysis of the unconfoundedness assumption with an application to an evaluation of college choice effects on earnings. J App Stat 41:1–18
de Luna X, Waernbaum I, Richardson TS (2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98:861–875
Article MATH MathSciNet Google Scholar
Fan J (1992) Design-adaptive nonparametric regression. J Am Stat Assoc 87:998–1004
Article MATH Google Scholar
Fan J, Gijbels I (1996) Local polynomial modelling and its applications. Chapman and Hall, London
MATH Google Scholar
Frölich M (2005) Matching estimators and optimal bandwidth choice. Stat Comput 15:197–215
Article MathSciNet Google Scholar
Frölich M (2007) Nonparametric IV estimation of local average treatment effects with covariates. J Econom 139:35–75
Article Google Scholar
Häggström J (2013) Bandwidth selection for backfitting estimation of semiparametric additive models: a simulation study. Comput Stat Data Anal 62:136–148
Article Google Scholar
Hansen B (2008) The prognostic analogue of the propensity score. Biometrika 95:481–488
Article MATH MathSciNet Google Scholar
Härdle W, Hall P, Marron J (1992) Regression smoothing parameters that are not far from their optimum. J Am Stat Assoc 87:227–233
MATH Google Scholar
Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat 86:4–29
Article Google Scholar
Imbens GW, Newey W, Ridder G (2005) Mean-squared-error calculations for average treatment effects. IEPR Working Papers 05.34, Institute of Economic Policy Research (IEPR). http://dornsife.usc.edu/IEPR/Working%20Papers/IEPR_05.34_%5bImbens.Newey.Ridder%5d.pdf
Imbens GW, Wooldridge JM (2009) Recent developments in the econometrics of program evaluation. J. Econ. Lit. 47:5–86
Article Google Scholar
Neyman J (1923) On the application of probability theory to agricultural experiments. Essay on principles. Section 9. (1990), translated (with discussion). Stat Sci 5:465–480
MathSciNet Google Scholar
Opsomer JD, Sheather S, Wand M (1995) An effective bandwidth selector for local least squares regression. J Am Stat Assoc 90:1257–1270
Article Google Scholar
R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Rosenbaum P, Rubin D (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55
Article MATH MathSciNet Google Scholar
Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66:688–701
Article Google Scholar
Ruppert D, Wand M (1994) Multivariate locally weighted least squares regression. Ann Stat 22:1346–1370
Article MATH MathSciNet Google Scholar
Speckman P (1988) Kernel smoothing in partial linear models. J R Stat Soc Ser B Stat Methodol 50:413–436
MATH MathSciNet Google Scholar
Waernbaum I (2010) Propensity score model specification for estimation of average treatment effects. J Stat Plan Inference 140:1948–1956
Article MATH MathSciNet Google Scholar
Wilson JR (1984) Variance reduction techniques for digital simulation. Am J Math Manag Sci 4:277–312
MATH Google Scholar

Download references

Acknowledgments

We are grateful to Yanyuan Ma and Sara Sjöstedt-de Luna for comments that have helped us to improve the paper. We acknowledge the financial support of the Swedish Research Council through the Swedish Initiative for Research on Microdata in the Social and Medical Sciences (SIMSAM), the Ageing and Living Condition Program and Grant 70246501.

Author information

Authors and Affiliations

Department of Statistics, Umeå School of Business and Economics, Umeå University, 90187 , Umeå, Sweden
Jenny Häggström & Xavier de Luna

Authors

Jenny Häggström
View author publications
You can also search for this author in PubMed Google Scholar
Xavier de Luna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jenny Häggström.

Appendix

1.1 Figures with results

See Figs. 2, 3, 4 and 5.

1.2 Asymptotics

In order to derive the results of Sect. 3.2 we focus on local linear regression with constant bandwidth. We use further the following assumptions.

(A1)
The kernel $K$ is a compactly supported, bounded kernel such that $\int u^2 K(u) du\ne 0$. In addition, all odd-order moments of $K$ vanish, that is $\int u^lK(u)du=0$ for all nonnegative odd integers $l$.
(A2)
The covariate $x$ has density $f$. The point $\tilde{x}$ is in the interior of supp$(f)=\{x\in \mathbb {R}:f(x)>0\}$. At $\tilde{x}$, $f$ is continuously differentiable and all second-order derivatives of $\beta _j$, $j=0,1$, are continuous.
(A3)
For $j=0,1$, $h_j\rightarrow 0$ and $nh_j \rightarrow \infty $ as $n\rightarrow \infty $ .

We have

$$\begin{aligned} MSE_{\bar{\hat{\beta }}_j}&= \frac{1}{n^2}\sum _{i=1}^nVar(\hat{\beta }_{j}^{h_j}(x_i)|\mathbf {x})+\frac{1}{n^2}\sum _{\mathop {i=1}\limits _{i\ne l}}^{n}\sum _{l=1}^nCov(\hat{\beta }_{j}^{h_j}(x_i),\hat{\beta }_{j}^{h_j}(x_l)|\mathbf {x})\\&+\bigg [\frac{1}{n}\sum _{i=1}^nE(\hat{\beta }_{j}^{h_j}(x_i)-\beta _j(x_i)|\mathbf {x})\bigg ]^2.\\ \end{aligned}$$

Under (A1)–(A2) for $\tilde{x}=x_i$, (Ruppert and Wand (1994), Thm 2.1) states that

$$\begin{aligned} Var(\hat{\beta }_{j}^{h_j}(x_i)|\mathbf {x})=\frac{\sigma _{\varepsilon }^2}{n_jh_j}f_j(x_i)^{-1}\int K(u)^2du\{1+o_p(1)\} \end{aligned}$$

(19)

and

$$\begin{aligned} E(\hat{\beta }_{j}^{h_j}(x_i)&-\beta _j(x_i)|\mathbf {x})=\frac{h_j^2}{2}\beta _j^{(2)}(x_i)\int u^2K(u)du\{1+o_p(1)\}, \end{aligned}$$

(20)

where $f_j(x_i)=f(x_i|z_i=j)=\frac{f(x_i)\Pr (z_i=j|x_i)}{\Pr (z_i=j)}$. It follows from (19) and the fact that $n_j=(-1)^{j+1}\sum _{i=1}^n z_i+n(1-j)$ that

$$\begin{aligned} \frac{1}{n^2}\sum _{i=1}^nVar(\hat{\beta }_{j}^{h_j}(x_i)|\mathbf {x})=\frac{\sigma _{\varepsilon }^2}{n^2h_j}\int K(u)^2du\int \frac{1}{\Pr (z_i=j|x)}dx+o_p(n^{-2}h_j^{-1}).\nonumber \\ \end{aligned}$$

(21)

Using (20) we have

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n E(\hat{\beta }_{j}^{h_j}(x_i)-\beta _j(x_i)|\mathbf {x})=\frac{h_j^2}{2}\int \beta _j^{(2)}(x)f(x)dx\int u^2K(u)du+o_p(h_j^2).\nonumber \\ \end{aligned}$$

(22)

Now,

$$\begin{aligned}&\frac{1}{n^2}\sum _{\mathop {i=1}\limits _{i\ne l}}^{n}\sum _{l=1}^nCov(\hat{\beta }_j^{h_j}(x_i),\hat{\beta }_j^{h_j}(x_l)|\mathbf {x})\nonumber \\&\quad =\frac{1}{n^2}\sum _{\mathop { i=1}\limits _{i\ne l}}^n\sum _{l=1}^n\sigma _{\varepsilon }^2\mathbf {e}_1^T (n_j^{-1}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {X}_{i}^j)^{-1}n_j^{-2}\nonumber \\&\qquad \times \mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {W}_l^{h_j}\mathbf {X}_{l}^{j}(n_j^{-1}\mathbf {X}_{l}^{jT}\mathbf {W}_l^{h_j}\mathbf {X}_{l}^{j})^{-1}\mathbf {e}_1. \end{aligned}$$

(23)

According to (Ruppert and Wand (1994), eq. (2.11))

$$\begin{aligned} (n_j^{-1}&\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {X}_{i}^{j})^{-1}\nonumber \\&= \left( \begin{array}{c@{\quad }c} f_j(x_i)^{-1}+o_p(1) &{} -\frac{f_j^{(1)}(x_i)}{f_j(x_i)^2}+o_p(1) \nonumber \\ -\frac{f_j^{(1)}(x_i)}{f_j(x_i)^2}+o_p(1) &{}\{\int u^2K(u)duf_j(x_i)h_j^2\}^{-1}+o_p(h_j^{-2}) \end{array} \right) . \end{aligned}$$

Noting that

$$\begin{aligned}&\{n_j^{-2}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {W}_l^{h_j}\mathbf {X}_{l}^{j}\}_{11}\\&\quad =\frac{1}{n_j^2h_j^2}\sum _{k=1}^{n_j}K\bigg (\frac{x_k-x_i}{h_j}\bigg )K\bigg (\frac{x_k-x_l}{h_j}\bigg ),\\&\{n_j^{-2}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {W}_l^{h_j}\mathbf {X}_{l}^{j}\}_{12}\\&\quad =\frac{1}{n_j^2h_j^2}\sum _{k=1}^{n_j}K\bigg (\frac{x_k-x_i}{h_j}\bigg )K\bigg (\frac{x_k-x_l}{h_j}\bigg )(x_k-x_l), \\&\{n_j^{-2}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {W}_l^{h_j}\mathbf {X}_{l}^{j}\}_{21}\\&\quad =\frac{1}{n_j^2h_j^2}\sum _{k=1}^{n_j}K\bigg (\frac{x_k-x_i}{h_j}\bigg )K\bigg (\frac{x_k-x_l}{h_j}\bigg )(x_k-x_i), \\ \end{aligned}$$

and

$$\begin{aligned}&\{n_j^{-2}\mathbf {X}_{i}^{jT}\mathbf {W}_i^{h_j}\mathbf {W}_l^{h_j}\mathbf {X}_{l}^{j}\}_{22}\\&\quad =\frac{1}{n_j^2h_j^2}\sum _{k=1}^{n_j}K\bigg (\frac{x_k-x_i}{h_j}\bigg )K\bigg (\frac{x_k-x_l}{h_j}\bigg )(x_k-x_i)(x_k-x_l). \end{aligned}$$

Starting with (23) we arrive at the following result after some calculus (details can be obtained from the authors)

$$\begin{aligned}&\frac{1}{n^2}\sum _{\mathop {i=1}\limits _{i\ne l}}^n\sum _{l=1}^nCov(\hat{\beta }_j^{h_j}(x_i),\hat{\beta }_j^{h_j}(x_l)|\mathbf {x})\nonumber \\&\quad =\frac{\sigma _{\varepsilon }^2}{n}\bigg [\int \frac{f(x_k)}{\Pr (z_k=j|x_k)}dx_k\bigg ]+o_p(n^{-1})\nonumber \\&\qquad -\frac{2\sigma _{\varepsilon }^2h_j^2}{n}\int u^2K(u)du \int \bigg ( \frac{f^{(1)}(x)^2}{f(x)\Pr (z=j|x)}\nonumber \\&\qquad +\frac{f^{(1)}(x)P^{(1)}(z=j|x)}{\Pr (z=j|x)^2}\bigg )dx+o_p(n^{-1}h_j^2). \end{aligned}$$

(24)

It follows from (21) and (24) that

$$\begin{aligned}&Var\bigg (\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\Big |\mathbf {x}\bigg )\nonumber \\&\quad =\frac{\sigma _{\varepsilon }^2}{n^2h_j}\int K(u)^2du\int \frac{1}{\Pr (z=j|x)}dx\nonumber \\&\qquad +\frac{\sigma _{\varepsilon }^2}{n}\bigg [\int \frac{f(x)}{\Pr (z=j|x)}dx\bigg ]\nonumber \\&\qquad -\frac{2\sigma _{\varepsilon }^2h_j^2}{n}\int u^2K(u)du \int \bigg ( \frac{f^{(1)}(x)^2}{f(x)\Pr (z=j|x)}\nonumber \\&\qquad +\frac{f^{(1)}(x)P^{(1)}(z=j|x)}{\Pr (z=j|x)^2}\bigg )dx+o_p(n^{-2}h_j^{-1}+n^{-1}+n^{-1}h_j^2). \end{aligned}$$

(25)

Hence, from (25) and (22),

$$\begin{aligned}&MSE\bigg (\frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\Big |\mathbf {x}\bigg )\\&\quad =\frac{\sigma _{\varepsilon }^2}{n^2h_j}\int K(u)^2du\int \frac{1}{\Pr (z=j|x)}dx+\frac{\sigma _{\varepsilon }^2}{n}\bigg [\int \frac{f(x)}{\Pr (z=j|x)}dx\bigg ]\\&\qquad -\frac{2\sigma _{\varepsilon }^2h_j^2}{n}\int u^2K(u)du\\&\qquad \times \int \bigg ( \frac{f^{(1)}(x)^2}{f(x)\Pr (z=j|x)}+\frac{f^{(1)}(x)P^{(1)}(z=j|x)}{\Pr (z=j|x)^2}\bigg )dx \\&\qquad +\frac{h_j^4}{4}\bigg [\int \beta _j^{(2)}(x)f(x)dx\bigg ]^2\bigg [\int u^2K(u)du\bigg ]^2\\&\qquad +o_p(n^{-2}h_j^{-1}+n^{-1}+n^{-1}h_j^2+h_j^4). \end{aligned}$$

Finally, (11)–(13) follows from (6) and (7). By the weak law of large numbers we can write

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\beta _j(x_i)-E(\beta _j(x_i))=o_p(1). \end{aligned}$$

Combined with (6) we thus have

$$\begin{aligned} E\left( \frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\Big |\mathbf {x}\right) -E(\beta _j(x_i))=B_1(j)h_j^2+o_p(h_j^2). \end{aligned}$$

For $h_j \propto n^r$ we have thus

$$\begin{aligned} \sqrt{n}&E&\left( \frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\Big |\mathbf {x}\right) -\sqrt{n} E(\beta _j(x_i))\\&= n^{1/2}B_1(1)n^{2r}+ o_p(n^{1/2}n^{2r}) \\&= O(n^{1/2+2r})+ o_p(n^{1/2}n^{2r}). \end{aligned}$$

Furthermore from (7) and for $h_j \propto n^r$ we can write

$$\begin{aligned}&nVar\left( \frac{1}{n}\sum _{i=1}^n\hat{\beta }_j^{h_j}(x_i)\Big |\mathbf {x}\right) \\&\quad =V_1(j)+\frac{1}{n^{1+r}}V_2(j)+n^{2r}V_3(j)+o_p(1+n^{-1-r}+n^2) \\&\quad =V_1(j)+O(n^{-1-r})+O(n^{2r})+o_p(1+n^{-1-r}+n^2). \end{aligned}$$

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Reprints and permissions

About this article

Cite this article

Häggström, J., de Luna, X. Targeted smoothing parameter selection for estimating average causal effects. Comput Stat 29, 1727–1748 (2014). https://doi.org/10.1007/s00180-014-0515-0

Download citation

Received: 03 May 2013
Accepted: 30 June 2014
Published: 25 July 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s00180-014-0515-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Targeted smoothing parameter selection for estimating average causal effects

Abstract

Similar content being viewed by others

Sufficient Covariate, Propensity Variable and Doubly Robust Estimation

Using Machine Learning Methods to Support Causal Inference in Econometrics

Semi-parametric small area inference in generalized semi-varying coefficient mixed effects models

1 Introduction

2 Model and estimation