# Targeted smoothing parameter selection for estimating average causal effects

- 888 Downloads
- 3 Citations

## Abstract

The non-parametric estimation of average causal effects in observational studies often relies on controlling for confounding covariates through smoothing regression methods such as kernel, splines or local polynomial regression. Such regression methods are tuned via smoothing parameters which regulates the amount of degrees of freedom used in the fit. In this paper we propose data-driven methods for selecting smoothing parameters when the targeted parameter is an average causal effect. For this purpose, we propose to estimate the exact expression of the mean squared error of the estimators. Asymptotic approximations indicate that the smoothing parameters minimizing this mean squared error converges to zero faster than the optimal smoothing parameter for the estimation of the regression functions. In a simulation study we show that the proposed data-driven methods for selecting the smoothing parameters yield lower empirical mean squared error than other methods available such as, e.g., cross-validation.

## Keywords

Causal inference Double smoothing Local linear regression## 1 Introduction

In observational studies where the interest lies in estimating the average causal effect of a binary treatment \(z\) on an outcome of interest \(y\), non-parametric estimators are typically based on controlling for confounding covariates \(x\) with smoothing regression methods (kernel, splines, local polynomial regression, series estimators; see, e.g., the reviews by Imbens 2004, and Imbens and Wooldridge 2009). A useful modeling framework in this context was introduced by Neyman (1923) and Rubin (1974), where in particular two potential outcomes are considered for each unit in the study, the outcome that would be observed if the unit is treated, \(y(1)\), and the outcome that would be observed if the unit is not treated, \(y(0)\). The causal effect at the unit level is defined as \(y(1)-y(0)\). Population parameters are targeted by the inference, and we focus here on average causal effects of the type \(E(y(1)-y(0))\), where the expectation is taken over a given population of interest. Inference on such expectations is complicated by the fact that the two potential outcomes are not observed for all units in the sample (missing data problem) and assumptions, e.g., on the missingness mechanism must be made in order for the parameter of interest to be identified. In this paper, we consider situations described in Sect. 2, where the causal effect conditional on an observed covariate \(x\) (or a score function summarizing a set of observed covariates), \(E(y(1)\mid x)-E(y(0)\mid x)\), is identified and can be estimated by fitting two curves, functions of \(x\), \(E(y(1)\mid x,z=1)\) and \(E(y(0)\mid x,z=0)\) non-parametrically. An estimate of the targeted average causal effect is obtained by averaging the estimated curves over the relevant distribution for \(x\) to target \(E(y(1)-y(0))=E(E(y(1)\mid x))-E(E(y(0)\mid x))\), where the missing outcomes are imputed by predictions from the fitted curves. A tuning parameter for each fitted curve is used to regulate the smoothness of the fit. Cheng (1994) showed that when using kernel regression to estimate the average of a curve, say here \(E(E(y(1)\mid x))\), with missing \(y(1)\) for some units, as described above, then the optimal (in mean squared error, MSE, sense) smoothing parameter for the estimation of the regression curve \(E(y(1)\mid x,z=1)\) is not optimal for the estimation of the average \(E(E(y(1)\mid x))\). More precisely the optimal rate of convergence towards zero of the smoothing parameter (when the sample size increases) is different in both situations, and one need typically to asymptotically undersmooth \(E(y(1)\mid x,z=1)\) when targeting \(E(E(y(1)\mid x))\). We show in this paper that a similar result holds when using local linear regression instead of kernel regression, and when two curves (implying the choice of two tunining parameters), are fitted and then averaged to target \(E(y(1)-y(0))\).

As a main contribution of the paper, we propose a novel data-driven method geared for selecting the smoothing parameters which minimizes the mean squared error of non-parametric estimators of the average causal effect. Imbens et al. (2005) also proposes a data-driven method based on the estimation of this mean squared error. The two estimators are, however, different. While Imbens et al. (2005) estimates an asymptotic approximation of the population MSE which involves the estimation of the propensity score, the probability of ending up in one of the treatment groups (say \(z=1\)) given the covariates, our estimator targets the exact population MSE by using a double smoothing technique previously used by Härdle et al. (1992) for estimating regression curves and Häggström (2013) in semi-parametric additive models. Note that Frölich (2005) also derived asymptotic approximation of MSE to obtain smoothing parameter selectors although those were outperformed by cross-validation in finite sample simulations. With simulations we study the finite sample properties of the different data-driven methods. The results suggest that the cross-validation choice, which is known to be optimal in MSE sense to estimate smooth curves (Fan 1992), can indeed be improved by using either Imbens et al. (2005) or our proposal, with the latter often being superior.

In the next section we introduce the potential outcome framework dating back to Neyman (1923) and Rubin (1974), which allows us to define the parameter of interest, the average causal effect, and commonly used identifying assumptions and estimators. The selection of smoothing parameters is discussed in Sect. 3, where we present asymptotic results based on the use of local linear regression. We also introduce in this section a novel data-driven method. Section 4 presents a simulation study. The paper is concluded in Sect. 5.

## 2 Model and estimation

### 2.1 Neyman–Rubin model for causal inference

### 2.2 Estimating average causal effects

## 3 Selection of smoothing parameters

### 3.1 Mean squared errors

### 3.2 Asymptotics

### 3.3 Estimating MSEs

## 4 Simulation study

In this section, we study the finite sample properties of different methods for the selection of constant and nearest neighbour type bandwidths, and in particular the resulting empirical MSE when estimating the average causal effect \(\tau \).

### 4.1 Design of the study

Specification of the six designs used to generate data according to model (18)

\(Design\) | \(\beta _1(x_i)\) | \(\beta _0(x_i)\) |
---|---|---|

1 | \(4\pi +5-2\pi x_i+x_i^2+5\sin (2x_i)\) | |

\(\quad -4\cos (x_i)\) | \(\sin (2x_i)-4\cos (x_i)+5\) | |

2 | \(4\big (x_{i}+\sin (x_{i})+\sin (2x_{i})\big )+3\) | \(2\big (x_{i}+\sin (x_{i})+\sin (2x_{i})\big )+3\) |

3 | \(4\pi -\pi x_i+\frac{x_i^2}{2}\) | \(\pi x_i-\frac{x_i^2}{2}\) |

4 | \(4\pi -\pi x_i+\frac{x_i^2}{2}\) | \(\pi x_i-\frac{x_i^2}{2}\) |

5 | \(4\pi +5-2\pi x_i+x_i^2+5\sin (2x_i)\) | |

\(\quad -4\cos (x_i)\) | \(\sin (2x_i)-4\cos (x_i)+5\) | |

6 | \(10\!+\!x_i(2\pi \!-\!x_i)\) | |

\(\times \sin (2\pi (2\pi \!+\!0.05)/(x_i+0.05))\) | \(8+1.5\sin (2x_i-4)+6exp(-16(2x_i-2.5)^2)\) | |

\(Design\) | \(\tau (x_i)\) | \(p(x_i)\) |

1 | \(4\pi -2\pi x_i+x_i^2+4\sin (2x_i)\) | \([e^{-3.5+x_i}]/[1+e^{-3.5+x_i}]\) |

2 | \( 2x_i+2\sin (x_i)+2\sin (2x_i)\) | \([e^{-3.5+x_i}]/[1+e^{-3.5+x_i}]\) |

3 | \(4\pi -2\pi x_i+x_i^2\) | \([e^{-3.5+x_i}]/[1+e^{-3.5+x_i}]\) |

4 | \(4\pi -2\pi x_i+x_i^2\) | \((5\sin {2x_i}-4\cos {x_i}+4\pi -2\pi x_i+x_i^2)/11.3\) |

5 | \(4\pi -2\pi x_i+x_i^2+4\sin (2x_i)\) | \((5\sin {2x_i}-4\cos {x_i}+4\pi -2\pi x_i+x_i^2)/11.3\) |

6 | \(2+x_i(2\pi -x_i)\sin (\frac{2\pi (2\pi +0.05)}{x_i+0.05})\) | \((5\sin {2x_i}-4\cos {x_i}+4\pi -2\pi x_i+x_i^2)/11.3\) |

\(-1.5\sin (2x_i-4)+6exp(-16(2x_i-2.5)^2)\) |

The criteria (2), (3), (4), (5), (14), (16) and (17) are computed for every bandwidth, 40 values, in the interval. For the minimizing bandwidths \(\hat{\tau }^{imp}\) is computed. Due to computer time constraint, we use 200 replicates. On the other hand, we reduce noise in the simulation results by making use of the control variate method (see, e.g., (Wilson 1984) with \(\hat{\tau }^{ols}\), the mean of the fitted values resulting from estimating \(\tau (x)\) by ordinary least squares with correctly specified model, as control variate. If \(\hat{\tau }^{ols}\) is positively correlated with \(\hat{\tau }^{imp}\) then \(\hat{\tau }^c=\hat{\tau }^{imp}-(\hat{\tau }^{ols}-\tau )\) has the same mean as \(\hat{\tau }^{imp}\) but lower variance. For instance, for \(n=1{,}000\) such correlations varied between 0.39 and 0.96 (Median \(=\) 0.82, IQR \(=\) 0.18). Results based on the raw replicates are similar to the results reported here utilizing the control variate method, except for an increase in noise. All computations are made in R (Core Team 2014). Studying bandwidth selection by simulation is computationally demanding and this study was made possible by the use of the High Performance Computing Center North (HPC2N) at Umeå University.

### 4.2 Results

MSE comparison: the table displays the method yielding lowest MSE (in the estimation of \(\tau \)) among M\(_{\beta }\), M\(_{\tau }\) and M\(_{y}\), when either constant or nearest neighbour bandwidth are used

\(Design\) | Minimum MSE obtained by | |||
---|---|---|---|---|

\(n\) | ||||

\(100\) | \(200\) | \(500\) | \(1{,}000\) | |

Constant bandwidth | ||||

1 | M\(_{y}^{**}\) | M\(_{\tau }^{**}\) | M\(_{\beta }^{**}\) | M\(_{\tau }^{**}\) |

2 | M\(_{\beta }\) | M\(_{\beta }\) | M\(_{\beta }\) | M\(_{\beta }\) |

3 | M\(_{\tau }\) | M\(_{\tau }\) | M\(_{\tau }^{**}\) | M\(_{\beta }^{**}\) |

4 | M\(_{\tau }\) | M\(_{\tau }\) | M\(_{y}\) | M\(_{y}^{*}\) |

5 | M\(_{\tau }\) | M\(_{\beta }^{**}\) | M\(_{y}\) | M\(_{\beta }^{**}\) |

6 | M\(_{y}\) | M\(_{\beta }\) | M\(_{\tau }\) | M\(_{\tau }\) |

Nearest neighbour bandwidth | ||||

1 | M\(_{\tau }^{**}\) | M\(_{\tau }^{**}\) | M\(_{\tau }^{**}\) | M\(_{\tau }^{**}\) |

2 | M\(_{\tau }\) | M\(_{\tau }\) | M\(_{\tau }\) | M\(_{\tau }\) |

3 | M\(_{\tau }^{**}\) | M\(_{\tau }\) | M\(_{\tau }^{**}\) | M\(_{\tau }^{**}\) |

4 | M\(_{\beta }\) | M\(_{\tau }^{*}\) | M\(_{\tau }\) | M\(_{\tau }\) |

5 | M\(_{\beta }^{**}\) | M\(_{\tau }^{**}\) | M\(_{\tau }^{**}\) | M\(_{\tau }^{**}\) |

6 | M\(_{\beta }\) | M\(_{\tau }\) | M\(_{\tau }\) | M\(_{\tau }\) |

MSE comparison: the table displays the method yielding lowest MSE (in the estimation of \(\tau \)) among DS\(_{\beta }\), DS\(_{\tau }\), INR and CV, when either constant or nearest neighbour bandwidth are used

\(Design\) | Minimum MSE obtained by | |||
---|---|---|---|---|

\(n\) | ||||

\(100\) | \(200\) | \(500\) | \(1{,}000\) | |

Constant bandwidth | ||||

1 | DS\(_{\beta }^{**}\) | INR\(^{**}\) | DS\(_{\tau }^{**}\) | DS\(_{\beta }^{**}\) |

2 | CV | INR | INR | DS\(_{\tau }\) |

3 | INR\(^{**}\) | INR | DS\(_{\tau }\) | DS\(_{\tau }^{**}\) |

4 | DS\(_{\tau }\) | DS\(_{\tau }\) | DS\(_{\beta }\) | DS\(_{\beta }\) |

5 | DS\(_{\beta }\) | DS\(_{\beta }\) | DS\(_{\beta }\) | DS\(_{\beta }\) |

6 | CV | DS\(_{\beta }\) | DS\(_{\tau }\) | DS\(_{\tau }^{**}\) |

Nearest neighbour bandwidth | ||||

1 | DS\(_{\beta }\) | DS\(_{\tau }\) | DS\(_{\tau }\) | DS\(_{\beta }\) |

2 | INR | INR\(^{*}\) | INR\(^{*}\) | INR\(^{**}\) |

3 | CV | CV | CV\(^{**}\) | CV\(^{**}\) |

4 | INR | DS\(_{\tau }\) | DS\(_{\tau }\) | DS\(_{\tau }^{**}\) |

5 | INR | DS\(_{\tau }\) | DS\(_{\tau }\) | DS\(_{\tau }^{*}\) |

6 | INR | INR | DS\(_{\tau }\) | CV\(^{*}\) |

Finally, note that the propensity scores used in the designs of this study are rather extreme in the sense that they may yield probabilities near zero and one. We have also run these experiments by damping these propensity scores to let them vary only between 0.2 and 0.8. The results where similar qualitatively with double smoothing often performing better.

## 5 Conclusion

In this paper we have proposed double smoothing methods for selecting smoothing parameters that target the estimation of functional averages where the latter are average causal effects of interest. In our numerical experiments cross-validation is often outperformed by double smoothing as we expected since the latter criterion is optimized for the estimation of functions underlying the average causal effect, and not the average itself. The methods proposed and studied here have large applicability, and are, for instance, straightforward to adapt to non-parametric estimators based on instruments as those introduced in Frölich (2007). Finally, note that similar results as the one obtained should hold under a non-constant variance assumption (Andrews 1991; Ruppert and Wand 1994). In such cases the estimation of \(\sigma _{\varepsilon }^2\) need to be replaced by estimators of \(Var(y_i|x_i, z_i=0)\) and \(Var(y_i|x_i, z_i=1)\), e.g. using linear smoothers when regressing \(y_i^2\) on \(x_i\) for the units with \(z_i=0\) and \(z_i=1\) respectively.

## Notes

### Acknowledgments

We are grateful to Yanyuan Ma and Sara Sjöstedt-de Luna for comments that have helped us to improve the paper. We acknowledge the financial support of the Swedish Research Council through the Swedish Initiative for Research on Microdata in the Social and Medical Sciences (SIMSAM), the Ageing and Living Condition Program and Grant 70246501.

## References

- Andrews DWK (1991) Asymptotic optimality of generalized cl, cross-validation, and generalized cross-validation in regression with heteroskedastic errors. J Econom 47:359–377CrossRefzbMATHGoogle Scholar
- Cheng PE (1994) Nonparametric estimation of mean functionals with data missing at random. J Am Stat Assoc 89:81–87CrossRefzbMATHGoogle Scholar
- Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74:829–836CrossRefzbMATHMathSciNetGoogle Scholar
- Dawid AP (1979) Conditional independence in statistical theory. J R Stat Soc Ser B Stat Methodol 41:1–31zbMATHMathSciNetGoogle Scholar
- de Luna X, Lundin M (2014) Sensitivity analysis of the unconfoundedness assumption with an application to an evaluation of college choice effects on earnings. J App Stat 41:1–18Google Scholar
- de Luna X, Waernbaum I, Richardson TS (2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98:861–875CrossRefzbMATHMathSciNetGoogle Scholar
- Fan J (1992) Design-adaptive nonparametric regression. J Am Stat Assoc 87:998–1004CrossRefzbMATHGoogle Scholar
- Fan J, Gijbels I (1996) Local polynomial modelling and its applications. Chapman and Hall, LondonzbMATHGoogle Scholar
- Frölich M (2005) Matching estimators and optimal bandwidth choice. Stat Comput 15:197–215CrossRefMathSciNetGoogle Scholar
- Frölich M (2007) Nonparametric IV estimation of local average treatment effects with covariates. J Econom 139:35–75CrossRefGoogle Scholar
- Häggström J (2013) Bandwidth selection for backfitting estimation of semiparametric additive models: a simulation study. Comput Stat Data Anal 62:136–148CrossRefGoogle Scholar
- Hansen B (2008) The prognostic analogue of the propensity score. Biometrika 95:481–488CrossRefzbMATHMathSciNetGoogle Scholar
- Härdle W, Hall P, Marron J (1992) Regression smoothing parameters that are not far from their optimum. J Am Stat Assoc 87:227–233zbMATHGoogle Scholar
- Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat 86:4–29CrossRefGoogle Scholar
- Imbens GW, Newey W, Ridder G (2005) Mean-squared-error calculations for average treatment effects. IEPR Working Papers 05.34, Institute of Economic Policy Research (IEPR). http://dornsife.usc.edu/IEPR/Working%20Papers/IEPR_05.34_%5bImbens.Newey.Ridder%5d.pdf
- Imbens GW, Wooldridge JM (2009) Recent developments in the econometrics of program evaluation. J. Econ. Lit. 47:5–86CrossRefGoogle Scholar
- Neyman J (1923) On the application of probability theory to agricultural experiments. Essay on principles. Section 9. (1990), translated (with discussion). Stat Sci 5:465–480MathSciNetGoogle Scholar
- Opsomer JD, Sheather S, Wand M (1995) An effective bandwidth selector for local least squares regression. J Am Stat Assoc 90:1257–1270CrossRefGoogle Scholar
- R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
- Rosenbaum P, Rubin D (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55CrossRefzbMATHMathSciNetGoogle Scholar
- Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66:688–701CrossRefGoogle Scholar
- Ruppert D, Wand M (1994) Multivariate locally weighted least squares regression. Ann Stat 22:1346–1370CrossRefzbMATHMathSciNetGoogle Scholar
- Speckman P (1988) Kernel smoothing in partial linear models. J R Stat Soc Ser B Stat Methodol 50:413–436zbMATHMathSciNetGoogle Scholar
- Waernbaum I (2010) Propensity score model specification for estimation of average treatment effects. J Stat Plan Inference 140:1948–1956CrossRefzbMATHMathSciNetGoogle Scholar
- Wilson JR (1984) Variance reduction techniques for digital simulation. Am J Math Manag Sci 4:277–312zbMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.