Skip to main content
Log in

Mitigating the impact of measurement error when using penalized regression to model exposure in two-stage air pollution epidemiology studies

  • Published:
Environmental and Ecological Statistics Aims and scope Submit manuscript

Abstract

Air pollution epidemiology studies often implement a two-stage approach. Exposure models are built using observed monitoring data to predict exposure at participant locations where the true exposure is unobserved, and the predictions used to estimate the health effect. This induces measurement error which may bias the estimated health effect and affect its standard error. The impact of measurement error depends on assumed data generating mechanisms and the approach used to estimate and predict exposure. A paradigm wherein the exposure surface is fixed and the subject and monitoring locations are random has been previously motivated, but corresponding measurement error methods exist only when modeling exposure with simple, low-rank, unpenalized regression splines. We develop a comprehensive treatment of measurement error when modeling exposure with high-but-fixed-rank penalized regression splines. If sufficiently rich, these models well-approximate full-rank methods such as universal kriging while remaining asymptotically tractable. We describe the implications of penalization for measurement error, motivate choosing the penalty to optimize health effect inference, derive an asymptotic bias correction, and provide a simple non-parametric bootstrap to account for all sources of variability. We find that highly parameterizing the exposure model results in severely biased and inefficient health effect inference if no penalty is used. Choosing the penalty to mitigate measurement error yields much less bias and better efficiency, and can lead to better confidence interval coverage than other common penalty selection methods. Combining the bias correction with the non-parametric bootstrap yields accurate coverage of nominal 95 % confidence intervals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Abdi H (2010) Partial least squares regression and projection on latent structure regression (PLS Regression). Wiley Interdiscip Rev Comput Stat 2(1):97–106

  • Bergen S, Sheppard L, Sampson P, Kim S, Richards M, Vedal S, Kaufman J, Szpiro A (2013) A national prediction model for \(\text{ PM }_{2.5}\) component exposures and measurement error-corrected health effect inference. Environ Health Perspect 121(9):1017–1025

    PubMed Central  PubMed  Google Scholar 

  • Carroll R (2006) Measurement error in nonlinear models: a modern perspective. CRC Press, Boca Radon

    Book  Google Scholar 

  • Cefalu M, Dominici F (2014) Does exposure prediction bias health-effect estimation? The relationship between confounding adjustment and exposure prediction. Epidemiology 25(4):583–590

    Article  PubMed Central  PubMed  Google Scholar 

  • Chan S, Van Hee V, Bergen S, Szpiro A, Oron A, DeRoo L, London S, Marshall J, Kaufman J, Sandler D (accepted) Long term air pollution exposure and blood pressure in the Sister Study. Environ Health Perspect

  • Cressie N (1993) Statistics for spatial data. Wiley, New York

    Google Scholar 

  • Cressie N, Johannesson G (2008) Fixed rank kriging for very large spatial data sets. J R Stat Soc Ser B (Stat Methodol) 70(1):209–226

    Article  Google Scholar 

  • Efron B, Tibshirani R (1993) An introduction to the bootstrap, vol 57. Chapman & Hall/CRC, Boca Radon

    Book  Google Scholar 

  • Green P, Silverman B (1994) Nonparametric regression and generalized linear models: a roughness penalty approach. Chapman & Hall London, Boca Radon

    Book  Google Scholar 

  • Gryparis A, Paciorek C, Zeka A, Schwartz J, Coull B (2009) Measurement error caused by spatial misalignment in environmental epidemiology. Biostatistics 10(2):258–274

    Article  PubMed Central  PubMed  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, vol 2. Springer, New York

  • Hodges J (2013) Richly parameterized linear models: additive, spatial, and time series models using random effects. Chapman & Hall/CRC (Statistical Science), Boca Radon

    Google Scholar 

  • Kammann E, Wand M (2003) Geoadditive models. J R Stat Soc Ser C (Appl Stat) 52(1):1–18

    Article  Google Scholar 

  • Kim S, Sheppard L, Kim H (2009) Health effects of long-term air pollution: influence of exposure prediction methods. Epidemiology 20(3):442–450

    Article  PubMed  Google Scholar 

  • Lopiano K, Young L, Gotway C (2013) Estimated generalized least squares in spatially misaligned regression models with berkson error. Biostatistics 14(4):737–751

    Article  PubMed  Google Scholar 

  • Mancl L, DeRouen T (2001) A covariance estimator for gee with improved small-sample properties. Biometrics 57(1):126–134

    Article  CAS  PubMed  Google Scholar 

  • Miller K, Siscovick D, Sheppard L, Shepherd K, Sullivan J, Anderson G et al (2007) Long-term exposure to air pollution and incidence of cardiovascular events in women. N Engl J Med 356(5):447–458

    Article  CAS  PubMed  Google Scholar 

  • NIEHS (2013) The Sister Study. http://www.sisterstudy.org/

  • Paciorek C (2007) Bayesian smoothing with Gaussian processes using Fourier basis functions in the spectralGP library. J Stat Softw 19(2):nihpa22751

  • Peng R (2013) Measurement error in air pollution epidemiology: guidance for uncertain times. Environmetrics 24(8):529–530

    Article  PubMed Central  PubMed  Google Scholar 

  • Ruppert D, Wand M, Carroll R (2003) Semiparametric regression. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Sampson P, Richards M, Szpiro A, Bergen S, Sheppard L, Larson T et al (2013) A regionalized national universal kriging model using partial least squares regression for estimating annual \(\text{ PM }_{2.5}\) concentrations in epidemiology. Atmos Environ 75:383–392

    Article  CAS  Google Scholar 

  • Shao J (2003) Mathematical statistics, 2nd edn. Springer, Berlin

    Book  Google Scholar 

  • Szpiro A, Paciorek C (2013) Measurement error in two-stage analyses, with application to air pollution epidemiology. Environmetrics 24(8):501–517

    Article  PubMed Central  PubMed  Google Scholar 

  • Szpiro A, Sheppard L, Lumley T (2011) Efficient measurement error correction with spatially misaligned data. Biostatistics 12(4):610–623

    Article  PubMed Central  PubMed  Google Scholar 

  • van der Vaart A (1998) Asymptotic statistics. University of Cambridge Press, Cambridge

    Book  Google Scholar 

  • Wakefield J (2013) Bayesian and frequentist regression methods. Springer Series in Statistics, New York

  • White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. J Econom Soc 48(4):817–838

    Article  Google Scholar 

  • Wood S (2003) Thin plate regression splines. J R Stat Soc Ser B (Stat Methodol) 65(1):95–114

    Article  Google Scholar 

  • Yu Y, Ruppert D (2002) Penalized spline estimation for partially linear single-index models. J Am Stat Assoc 97(460):1042–1054

    Article  Google Scholar 

Download references

Acknowledgments

We thank the many participants and study staff comprising the Sister Study who contributed to making this study possible. This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01 ES044005). Additional support by the NIEHS was provided through R01-ES009411, 5P50ES015915, 5R01ES020871, and T32 ES015459. Although the research described in this presentation has been funded wholly or in part by the US Environmental Protection Agency through R831697 and RD-83479601 to the University of Washington, it has not been subjected to the agency’s required peer and policy review and therefore does not necessarily reflect the views of the agency, and no official endorsement should be inferred.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silas Bergen.

Additional information

Handling Editor: Bryan F. J. Manly.

Appendices

Appendix 1: Asymptotic definitions of moments

The following definition of asymptotic moments was adapted by Szpiro and Paciorek (2013) from Shao (2003), for use in a setting very similar to ours.

Let \(v_1, v_2\),...be a sequence of random variables and let \(a_1, a_2\),...be a sequence of positive numbers such that \(\lim _{n\rightarrow \infty } a_n = \infty \). Let \(\vartheta \) be a real number.

Asymptotic mean. Suppose \(v\) is such that \(E|v| <\infty \) and we can write \((v_n-\vartheta ) ={\tilde{v}}_n + v'_n\) with \(E({\tilde{v}}_n) = 0\) and \(a_nv_n' \rightarrow _d v\). Then we denote \(E_{[a_n]}(v_n-\vartheta )= E(v)\) and call \(E(v)/a_n\) be an order \(1/a_n\) asymptotic mean of \((v_n-\vartheta )\).

Asymptotic variance. Suppose \(v\) is such that \(Cov(v) < \infty \) and \(\sqrt{a_n}(v_n-\vartheta )\rightarrow _d v\). Then we denote \(Cov_{[a_n]}(v_n)=Cov(v)\) and call \(Cov_{[a_n]}(v_n)/a_n\) be an order \(1/a_n\) asymptotic covariance of \(v_n\).

Appendix 2: Statement and proof of Lemma 1

Let \({\mathbf {r}}^\perp ({\mathbf {s}})\) contain elements \((r_k({\mathbf {s}}) - {\varTheta }({\mathbf {s}})^T\varphi _k)\), where \(\varphi _k=\hbox {argmin}_{\omega } \int (r_k({\mathbf {s}}) - {\varTheta }({\mathbf {s}})^T\omega )^2 dG({\mathbf {s}})\) for \(k \in \{1,\ldots ,p+q\}\). Let \(w^\perp _\lambda ({\mathbf {s}}_i)={\mathbf {r}}^\perp ({\mathbf {s}}_i)^T{\varvec{\gamma }}_{\lambda }\) and \({\hat{w}}^\perp _\lambda ({\mathbf {s}}_i) = {\mathbf {r}}^\perp ({\mathbf {s}}_i)^T {\hat{{\varvec{\gamma }}}}_{\lambda }\). Then, with:

$$\begin{aligned} f({\hat{{\varvec{\gamma }}}}_{\lambda }) = \frac{\int w^\perp _\lambda ({\mathbf {s}}) {\hat{w}}^\perp _\lambda ({\mathbf {s}})dG({\mathbf {s}})}{\int {\hat{w}}^\perp _\lambda ({\mathbf {s}})^2 dG({\mathbf {s}})} + \frac{\int u_\lambda ^B({\mathbf {s}}){\hat{w}}^\perp _\lambda ({\mathbf {s}})dG({\mathbf {s}})}{\int {\hat{w}}^\perp _\lambda ({\mathbf {s}})^2 dG({\mathbf {s}})}, \end{aligned}$$

\({\hat{\beta }}_{n^*} = \beta f({\hat{{\varvec{\gamma }}}}_{\lambda })\). Let \(\mathbf {h}_\lambda \) and \({\mathbf {H}}_\lambda \) denote the gradient and the Hessian matrix, respectively, of \(f({\hat{{\varvec{\gamma }}}}_{\lambda })\) evaluated at \({\varvec{\gamma }}_\lambda \). Let

$$\begin{aligned} \psi _\lambda ^B=\frac{\int u_\lambda ^B({\mathbf {s}}) w^\perp _\lambda ({\mathbf {s}})dG({\mathbf {s}})}{\int w^\perp _\lambda ({\mathbf {s}})^2dG({\mathbf {s}})}, \end{aligned}$$

and

$$\begin{aligned} \psi _\lambda ^C =\frac{1}{n^*}\left\{ \mathbf {h}_\lambda ^T E_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) + tr\left( {\mathbf {H}}_\lambda Cov_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\right) \right\} . \end{aligned}$$
  1. (a)
    $$\begin{aligned} \frac{1}{n^*}E_{[n^*]} \left( \frac{{\hat{\beta }}_{n^*}-\beta }{\beta } - \psi _\lambda ^B\right) = \psi _\lambda ^C \end{aligned}$$

    is an asymptotic expectation of \(\left( ({\hat{\beta }}_{n^*}-\beta )/\beta -\psi _\lambda ^B\right) \).

  2. (b)
    $$\begin{aligned} \frac{1}{n^*}Var_{[n^*]} \left( \frac{{\hat{\beta }}_{n^*}-\beta }{\beta }\right) = \frac{1}{n^*}\left( \mathbf {h}_\lambda ^TCov_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) \mathbf {h}_\lambda \right) \end{aligned}$$

    is an asymptotic variance of \(({\hat{\beta }}_{n^*}-\beta )/\beta \).

Proof

It is easy to see that

$$\begin{aligned} {\hat{\beta }}_{n,n^*} = \frac{\sum _{i=1}^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i) y_i}{\sum _{i=1}^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i) ^2}. \end{aligned}$$

Each health outcome measurement can be written

$$\begin{aligned} y_i = \beta w^\perp _\lambda ({\mathbf {s}}_i) +\beta (w_\lambda ({\mathbf {s}}_i)-w^\perp _\lambda ({\mathbf {s}}_i)) + \beta (x_i-w_\lambda ({\mathbf {s}}_i)) + {\varvec{\beta }}^T_Z{\mathbf {z}}_i + \varepsilon _i, \end{aligned}$$

which implies

$$\begin{aligned} {\hat{\beta }}_{n,n^*}&= \beta \left( \frac{\sum _1^n{\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)w^\perp _\lambda ({\mathbf {s}}_i)}{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)^2}\!+\! \frac{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)u_\lambda ^B({\mathbf {s}})}{\sum _1^n {\hat{w}}^c_\lambda ({\mathbf {s}}_i)^2} \!+\!\frac{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)(w_\lambda ({\mathbf {s}}_i)\!-\!w^\perp _\lambda ({\mathbf {s}}_i))}{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)^2}\right) \\&{} + \frac{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i){\varvec{\beta }}^T_Z{\mathbf {z}}_i}{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)^2} + \frac{\sum _1^n{\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)\varepsilon _i}{\sum _1^n {\hat{w}}^\perp _\lambda ({\mathbf {s}}_i)^2}. \end{aligned}$$

The numerator in each of the final three terms on the right-hand side of this expression has mean zero, since every random variable being summed over has mean zero. The numerator of the third term has mean zero because each element of \({\mathbf {r}}({\mathbf {s}})\) is orthogonal to each element of \({\mathbf {r}}({\mathbf {s}})-{\mathbf {r}}^\perp ({\mathbf {s}})\) by construction. The fourth numerator has mean zero because \({\mathbf {z}}_i={\varTheta }({\mathbf {s}}_i) + {\varvec{\zeta }}_i\), the elements of \({\mathbf {r}}^\perp ({\mathbf {s}}_i)\) are orthogonal to \({\varTheta }({\mathbf {s}}_i)\) by construction, and each element of \({\varvec{\zeta }}_i\) has mean zero and is independent of every element of \({\mathbf {r}}^\perp ({\mathbf {s}}_i)\). Finally, the fifth numerator has mean zero because every \(\varepsilon _i\) has mean zero and is independent of everything else. Letting \(n \rightarrow \infty \), the weak law of large numbers implies

$$\begin{aligned} {\hat{\beta }}_{n^*}&= \beta \frac{\int w^\perp _\lambda ({\mathbf {s}}) {\hat{w}}^\perp _\lambda ({\mathbf {s}})dG({\mathbf {s}})}{\int {\hat{w}}^\perp _\lambda ({\mathbf {s}})^2 dG({\mathbf {s}})} + \beta \frac{\int u_\lambda ^B({\mathbf {s}}){\hat{w}}^\perp _\lambda ({\mathbf {s}})dG({\mathbf {s}})}{\int {\hat{w}}^\perp _\lambda ({\mathbf {s}})^2 dG({\mathbf {s}})} \\&= \beta f({\hat{{\varvec{\gamma }}}}_\lambda ). \end{aligned}$$

If we write \(b_\lambda =\int u_\lambda ^B({\mathbf {s}}){\mathbf {r}}^\perp ({\mathbf {s}})dG({\mathbf {s}})\) and \({\mathbf {A}}= \int {\mathbf {r}}^\perp ({\mathbf {s}}){\mathbf {r}}^\perp ({\mathbf {s}})^T dG({\mathbf {s}})\) we can express this as

$$\begin{aligned} f({\hat{{\varvec{\gamma }}}}_\lambda ) = ({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )({\hat{{\varvec{\gamma }}}}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )^{-1} + (b_\lambda ^T{\hat{{\varvec{\gamma }}}}_{\lambda })({\hat{{\varvec{\gamma }}}}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )^{-1}, \end{aligned}$$

and we are now prepared to perform a Taylor expansion of \(f({\hat{{\varvec{\gamma }}}}_{\lambda })\) around \({\varvec{\gamma }}_\lambda \). The gradient of \(f({\hat{{\varvec{\gamma }}}}_{\lambda })\) is:

$$\begin{aligned} D f({\hat{{\varvec{\gamma }}}}_\lambda )&= {}-2({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )({\hat{{\varvec{\gamma }}}}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )^{-2}{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda + ({\hat{{\varvec{\gamma }}}}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )^{-1}{\mathbf {A}}{\varvec{\gamma }}_\lambda \\&{}-2(b_\lambda ^T{\hat{{\varvec{\gamma }}}}_{\lambda })({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda }+({\hat{{\varvec{\gamma }}}}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_\lambda )^{-1}b_\lambda . \end{aligned}$$

The Hessian is:

$$\begin{aligned} D^2f({\hat{{\varvec{\gamma }}}}_{\lambda })&= {}-2({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}{\mathbf {A}}+ 8({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-3}{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda }{\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}\\&{}-2({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda }{\varvec{\gamma }}_\lambda ^T{\mathbf {A}}-2({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}{\mathbf {A}}{\varvec{\gamma }}_\lambda {\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}\\&{} -2(b_\lambda ^T{\hat{{\varvec{\gamma }}}}_{\lambda })({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}{\mathbf {A}}+ 8(b_\lambda ^T{\hat{{\varvec{\gamma }}}}_{\lambda })({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-3}{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda }{\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}\\&{} -2({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}b_\lambda {\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}-2({\hat{{\varvec{\gamma }}}}_{\lambda }^T{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda })^{-2}{\mathbf {A}}{\hat{{\varvec{\gamma }}}}_{\lambda }b_\lambda ^T. \end{aligned}$$

Evaluating these quantities at \({\hat{{\varvec{\gamma }}}}_{\lambda }={\varvec{\gamma }}_\lambda \) yields:

$$\begin{aligned} f({\varvec{\gamma }}_\lambda )&= 1+ (b_\lambda ^T{\varvec{\gamma }}_\lambda )({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-1}\\ \\ D f({\hat{{\varvec{\gamma }}}}_{\lambda })|_{{\varvec{\gamma }}_\lambda }&= {}-({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-1}{\mathbf {A}}{\varvec{\gamma }}_\lambda \\&{}-2(b_\lambda ^T{\varvec{\gamma }}_\lambda )({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-2}{\mathbf {A}}{\varvec{\gamma }}_\lambda +({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-1}b_\lambda \\ \frac{1}{2}D^2f({\hat{{\varvec{\gamma }}}}_{\lambda })|_{{\varvec{\gamma }}_\lambda }&= {} -({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-1}{\mathbf {A}}+ 2({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-2}{\mathbf {A}}{\varvec{\gamma }}_\lambda {\varvec{\gamma }}_\lambda ^T{\mathbf {A}}\\&{} -(b_\lambda ^T{\varvec{\gamma }}_\lambda )({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-2}{\mathbf {A}}+ 4(b_\lambda ^T{\varvec{\gamma }}_\lambda )({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-3}{\mathbf {A}}{\varvec{\gamma }}_\lambda {\varvec{\gamma }}_\lambda ^T{\mathbf {A}}\\&{} -({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-2}b_\lambda {\varvec{\gamma }}_\lambda ^T{\mathbf {A}}-({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-2}{\mathbf {A}}{\varvec{\gamma }}_\lambda b_\lambda ^T. \end{aligned}$$

We let \(\mathbf {h}_\lambda = D f({\hat{{\varvec{\gamma }}}}_{\lambda })|_{{\varvec{\gamma }}_\lambda }\) and \({\mathbf {H}}_\lambda =\frac{1}{2}D^2f({\hat{{\varvec{\gamma }}}}_{\lambda })|_{{\varvec{\gamma }}_\lambda }\), and then Taylor expansion of \({\hat{\beta }}_{n^*}\) about \({\varvec{\gamma }}_\lambda \) gives

$$\begin{aligned} {\hat{\beta }}_{n^*}&= \beta f({\hat{{\varvec{\gamma }}}}_{\lambda }) = \beta +\beta (b_\lambda ^T{\varvec{\gamma }}_\lambda )({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-1} \\&+\,\beta \left\{ \mathbf {h}_\lambda ^T({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) +({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )^T{\mathbf {H}}_\lambda ({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\right\} + o_p(1/n^*). \end{aligned}$$

Noting that \((b_\lambda ^T{\varvec{\gamma }}_\lambda )({\varvec{\gamma }}_\lambda ^T{\mathbf {A}}{\varvec{\gamma }}_\lambda )^{-1} = \psi _\lambda ^B\), we have:

$$\begin{aligned} \frac{({\hat{\beta }}_{n^*} -\beta )}{\beta } = \psi _\lambda ^B + \mathbf {h}_\lambda ^T({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) +({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )^T{\mathbf {H}}_\lambda ({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) + o_p(1/n^*). \end{aligned}$$

Thus,

$$\begin{aligned} E_{[n^*]}\left( \frac{({\hat{\beta }}_{n^*} -\beta )}{\beta } - \psi _\lambda ^B\right)&= \mathbf {h}_\lambda ^TE_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) +tr({\mathbf {H}}_\lambda Cov_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )); \\ Var_{[n^*]}\left( \frac{({\hat{\beta }}_{n^*} -\beta )}{\beta } \right)&= \mathbf {h}_\lambda ^TCov_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\mathbf {h}_\lambda ; \end{aligned}$$

as desired. Here we again emphasize how critical it is that the exposure model is fixed-rank. If \(q\) is not fixed, \(({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\) does not converge in distribution. The fact that \(q\) does not depend on \(n^*\) enables us to invoke standard asymptotic theory in deriving a bias expression useful for measurement error correction for finite \(n^*\).

Appendix 3: Estimation of Lemma 1 quantities

In order to use Lemma  1 to estimate the bias and variance of \({\hat{\beta }}_{n^*}\), we need to estimate \(\psi _\lambda ^B\) (the bias from Berkson-like error); \(\mathbf {h}_\lambda \) and \({\mathbf {H}}_\lambda \) (the gradient and Hessian, respectively, of \({\hat{\beta }}_{n^*}\) with respect to \({\hat{{\varvec{\gamma }}}}_{\lambda }\)), and the moments of \(({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\) from available data.

We emphasize that all integrals in Lemma 1 are with respect to \(G({\mathbf {s}})\). Because we assume that the monitoring and subject locations are both drawn from \(G({\mathbf {s}})\), we can approximate these integrals with respect to \(G({\mathbf {s}})\) by summing over the empirical distribution of either monitoring or subject locations. However, for some quantities, such as \(\int u_\lambda ^B({\mathbf {s}})w_\lambda ^\perp ({\mathbf {s}})dG({\mathbf {s}})\), estimates of the integrand are only available at subject locations. In these instances we must use the empirical distribution of monitoring locations to approximate \(G({\mathbf {s}})\).

We also note that \(\psi _\lambda ^B, \mathbf {h}_\lambda \), and \(\mathbf {H}_\lambda \) are functions of \({\varvec{\gamma }}_\lambda \). To estimate \({\varvec{\gamma }}_\lambda \) we plug in its consistent estimator, \({\hat{{\varvec{\gamma }}}}_{\lambda }\).

The numerator of \(\psi _\lambda ^B, \int u^B_\lambda ({\mathbf {s}}) w_\lambda ^\perp ({\mathbf {s}}) dG({\mathbf {s}})\), can only be estimated using data from monitoring locations since that is where we can observe approximate values of \(u^B_\lambda \) and \(w_\lambda ^\perp \). Since \(u^B_\lambda ({\mathbf {s}}) = (x_i - {\mathbf {r}}({\mathbf {s}})^T{\varvec{\gamma }}_\lambda )\) is the difference between the observed and predicted exposure values, and the prediction model was also fit at monitoring locations, we were concerned that simply plugging in residuals from the penalized regression would lead to an understimate of the numerator of \(\psi _\lambda ^B\). Accordingly we also considered 10-fold and leave-one-out cross-validation to estimate \(u^B_\lambda \), but found that this overestimated \(\psi _\lambda ^B\), especially for small values of \(\lambda \). For small \(\lambda \), the numerator of \(\psi _\lambda ^B\) is close to zero since \(u^B_\lambda \) is nearly orthogonal to \(w_\lambda ({\mathbf {s}})\), and using cross-validation broke this near-orthogonality. Therefore, we opted for the more straightforward approach of directly plugging in the residuals. The denominator of \(\psi _\lambda ^B\), \(\int w^\perp _\lambda ({\mathbf {s}})^2dG({\mathbf {s}})\), can be estimated by summing over the empirical distribution of either the subject or monitoring locations. Because there are often many more subjects than monitors, it makes sense to use the subject locations to approximate this integral; this is what we did in our simulations and data analysis.

Following Yu and Ruppert (2002), we estimate \(Cov({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\) using a sandwich form that accounts for having \(\lambda >0\). In some applications, it is reasonable to simply ignore \(\lambda \) when constructing asymptotic standard errors because it converges to zero as the sample size increases. However, we have found that we obtain better finite-sample variance estimates of \(Cov({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\) if we explicitly account for \(\lambda \). If we set

$$\begin{aligned} {\mathbf {B}}\!=\! \int \! \left\{ \left( {\mathbf {r}}({\mathbf {s}})\left( {\varPhi }({\mathbf {s}}) \!-\! {\mathbf {r}}({\mathbf {s}})^T{\varvec{\gamma }}_\lambda \right) \!-\! \lambda {\mathbf {D}}{\varvec{\gamma }}_\lambda \right) \left( {\mathbf {r}}({\mathbf {s}})\!\left( {\varPhi }({\mathbf {s}}) \!-\! {\mathbf {r}}({\mathbf {s}})^T{\varvec{\gamma }}_\lambda \right) \!-\! \lambda {\mathbf {D}}{\varvec{\gamma }}_\lambda \right) ^T\right\} dG({\mathbf {s}}) \end{aligned}$$

and \({\mathbf {A}}= \lambda {\mathbf {D}}+ \int {\mathbf {r}}({\mathbf {s}}){\mathbf {r}}({\mathbf {s}})^TdG({\mathbf {s}})\), then \(\sqrt{n^*}({\hat{{\varvec{\gamma }}}}_{\lambda }- {\varvec{\gamma }}_\lambda ) \rightarrow _d N(0,{\mathbf {V}})\) where \({\mathbf {V}}= {\mathbf {A}}^{-1}{\mathbf {B}}{\mathbf {A}}^{-1}\) where \({\mathbf {V}}\) is a model-robust estimate of the covariance of \({\hat{{\varvec{\gamma }}}}_{\lambda }\) that incorporates reduced variance from \(\lambda >0\). We can estimate \({\mathbf {V}}\) by plugging in consistent estimates of all the quantities: \({\hat{{\varvec{\gamma }}}}_{\lambda }\) for \({\varvec{\gamma }}_\lambda \) and \({\mathbf {R}}({\mathbf {s}}^*)^T{\mathbf {R}}({\mathbf {s}}^*)\) whenever an estimate of \(\int {\mathbf {r}}({\mathbf {s}}){\mathbf {r}}({\mathbf {s}})^TdG({\mathbf {s}})\) is needed. Note however that this covariance estimate, though effectively accounting for the reduction in variance induced by \(\lambda \), maintains all the characteristics of a sandwich covariance estimate for unpenalized regression, and in particular it may be biased for small samples. Mancl and DeRouen (2001) showed that in the context of generalized estimating equations, assuming a correctly specified mean model and no penalty, inversely weighting each residual in the \({\hat{{\mathbf {B}}}}\) portion of the estimated covariance matrix by one minus the influence of each data point reduces the bias of the covariance estimate. Although we make fewer assumptions, we similarly found that weighting each residual term in \({\hat{{\mathbf {B}}}}\) by \(1-h_{ii}\) reduced the bias of our sandwich covariance estimates for smaller sample sizes, where \(h_{ii} = {\mathbf {r}}({\mathbf {s}}^*_i)^T(n^*\lambda {\mathbf {D}}+ {\mathbf {R}}({\mathbf {s}}^*)^T{\mathbf {R}}({\mathbf {s}}^*))^{-1}{\mathbf {r}}({\mathbf {s}}^*_i)\).

It remains to describe estimation of \(\mathbf {h}_\lambda , {\mathbf {H}}_\lambda \), and \(E_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\). Estimation of \(\mathbf {h}_\lambda \) and \({\mathbf {H}}_\lambda \) only requires estimation of \(b_\lambda , {\mathbf {A}}\) and \({\varvec{\gamma }}_\lambda \). \(b_\lambda \) was estimated as described in the text, using residuals from the penalized regression fits to the monitoring data to estimate \(u_\lambda ^B\). \({\mathbf {A}}\) was estimated by summing over plug-ins of \({\mathbf {r}}^\perp ({\mathbf {s}}_i){\mathbf {r}}^\perp ({\mathbf {s}}_i)^T\) at subject locations, and \({\varvec{\gamma }}_\lambda \) was estimated with \({\hat{{\varvec{\gamma }}}}_{\lambda }\).

To estimate \(E_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\) we used an approach very similar to the one outlined in Szpiro and Paciorek (2013). For arbitrary \(m_j\) let \({\mathbf {M}}\) denote the \(n^*\times n^*\) diagonal matrix where the \(j\)th diagonal element is \(m_j\). If we let \(m_j=1/n^*\) for \(j=1,\ldots ,n^*\) and \(\Lambda \) the \((p+q) \times (p+q)\) matrix \(\lambda {\mathbf {D}}\), then we note that

$$\begin{aligned} {\hat{{\varvec{\gamma }}}}_{\lambda }= \kappa (m_1,\ldots ,m_{n^*})= ({\mathbf {R}}({\mathbf {s}}^*)^T{\mathbf {M}}{\mathbf {R}}({\mathbf {s}}^*) + \Lambda )^{-1}{\mathbf {R}}({\mathbf {s}}^*)^T{\mathbf {M}}{\mathbf {x}}^*. \end{aligned}$$

From here we want to take the expectation of \({\hat{{\varvec{\gamma }}}}_{\lambda }\) over repeated realizations \(m_j^*\) of the \(m_j\), where with each realization the rows of \({\mathbf {R}}({\mathbf {s}}^*)\) and elements of \({\mathbf {x}}^*\) are re-weighted. Heuristically, we are using \(\{{\mathbf {s}}_1,\ldots ,{\mathbf {s}}_{n^*}\}\) to approximate the support of \(G(\cdot )\); the weights \(1/n^*\) to approximate \(G(\cdot )\), the probability distribution over this support; and considering repeated sampling with replacement of monitoring locations from \(\{{\mathbf {s}}_1,\ldots ,{\mathbf {s}}_{n^*}\}\) to approximate repeated sampling from \(G(\cdot )\). This is equivalent to sampling new \(m_j^*\) from a multinomial distribution with probabilities \(p_j=1/n^*\). With each realization of \(m_j^*\) we obtain new coefficients \({\hat{{\varvec{\gamma }}}}_{\lambda }^*\), and treat the asymptotic mean of \(({\hat{{\varvec{\gamma }}}}_{\lambda }^*-{\hat{{\varvec{\gamma }}}}_{\lambda })\equiv (\kappa (m_1^*,\ldots ,m_{n^*}^*) -\kappa (m_1,\ldots ,m_{n^*}))\) as an approximation to what we are truly interested in, which is the asymptotic mean of \(({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda )\). This is justified more formally in van der Vaart 1998, Section 20.1. We find the asymptotic distribution (and hence mean) of \(({\varvec{\kappa }}(m_1^*,\ldots ,m_{n^*}^*) -{\varvec{\kappa }}(1/n^*,\ldots ,1/n^*))\) using a multinomial Taylor expansion, perturbing \(\{m_1^*,\ldots ,m_{n^*}^*\}\) about their means \(\{1/n^*,\ldots ,1/n^*\}\).

Following Szpiro and Paciorek (2013), we see that this implies

$$\begin{aligned} E_{[n^*]}({\hat{{\varvec{\gamma }}}}_{\lambda }-{\varvec{\gamma }}_\lambda ) \approx \frac{1}{2} \left( \frac{1}{n^*} - \frac{1}{(n^*)^2}\right) \sum _{j=1}^{n^*} \frac{ \partial ^2 \kappa }{ \partial m_j^2} - \frac{1}{2} \frac{1}{(n^*)^2} \sum _{j,k=1;j\ne k}^{n^*} \frac{\partial ^2 \kappa }{\partial m_j \partial m_k} \end{aligned}$$

where, with \({\mathbf {R}}^*={\mathbf {R}}({\mathbf {s}}^*)\) for simplicity of notation;

$$\begin{aligned} \frac{\partial ^2 \kappa }{\partial m_j \partial m_k}&= ({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T} \frac{\partial {\mathbf {M}}}{\partial m_k}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* \\&\quad +\,\Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial {\mathbf {M}}}{\partial m_j}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {x}}^* \\&\quad -\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial ^2 {\mathbf {M}}}{\partial m_j\partial m_k}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {x}}^* \\&\quad +\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T} \frac{\partial {\mathbf {M}}}{\partial m_j}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* \\&\quad +\,\Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial {\mathbf {M}}}{\partial m_k}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {x}}^* \\&\quad -\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial {\mathbf {M}}}{\partial m_j}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial {\mathbf {M}}}{\partial m_k}{\mathbf {x}}^* \\&\quad -\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial {\mathbf {M}}}{\partial m_k}{\mathbf {R}}^*({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial {\mathbf {M}}}{\partial m_j}{\mathbf {x}}^* \\&\quad +\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}\frac{\partial ^2 {\mathbf {M}}}{\partial m_j\partial m_k}{\mathbf {x}}^* \\&=\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {r}}^*_k{\mathbf {r}}^{*T}_k({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^*\\&\quad + \Lambda )^{-1}{\mathbf {r}}^*_j{\mathbf {r}}_j^T({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {x}}^* -0 \\&\quad +\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {r}}^*_j{\mathbf {r}}^{*T}_j({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* \\&\quad +\,\Lambda )^{-1}{\mathbf {r}}^*_k{\mathbf {r}}_k^T({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {x}}^* \\&\quad -\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {r}}_j^{*}{\mathbf {r}}_j^{*T}({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {r}}_k^*x_k^* \\&\quad -\,({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {r}}_k^{*}{\mathbf {r}}_k^{*T}({\mathbf {R}}^{*T}{\mathbf {M}}{\mathbf {R}}^* + \Lambda )^{-1}{\mathbf {r}}_j^*x_j^* \\&\quad +\,0. \end{aligned}$$

Appendix 4: Further details on Sister Study analysis

This section gives further details about the Sister Study analysis originally performed by Chan et al. (accepted) and re-analyzed here, applying our measurement error methodology.

The original analysis used \(\hbox {PM}_{2.5}\) predictions from a regionalized universal kriging model (Sampson et al. 2013). This model divided the nation into three regions (West Coast, West, and East) and fit a universal kriging model in each of the three regions using PLS to model geographic features of \(\hbox {PM}_{2.5}\).

The primary result of Chan et al. (accepted) was based on data from the 43,629 Sister Study participants whose residences in the year 2006 could be identified and geo-coded (all such participants resided in the lower 48 states). The health model used ordinary least squares to estimate the the health effect of \(\hbox {PM}_{2.5}\) on SBP. Health model covariates used for adjustment included demographic variables (age and race); socio-economic variables (household income, education, marital status, working more than 20 h a week outside the home, perceived stress score, and socio-economic status z-score); important spatial features associated with pollution (urban-rural continuum code and a 10-degree-of-freedom thin-plate spline for latitude and longitude); cardiovascular disease risk factors (body mass index, waist-to-hip ratio, smoking status, alcohol use, self-reported history of diabetes, and self-reported history of hypercholesterolemia); and blood pressure medication use.

All these adjustement covariates were included in the health models described in this paper.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bergen, S., Szpiro, A.A. Mitigating the impact of measurement error when using penalized regression to model exposure in two-stage air pollution epidemiology studies. Environ Ecol Stat 22, 601–631 (2015). https://doi.org/10.1007/s10651-015-0314-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-015-0314-y

Keywords

Navigation