Perils and prospects of using aggregate area level socioeconomic information as a proxy for individual level socioeconomic confounders in instrumental variables regression

Hsu, Jesse Yenchih; Lorch, Scott A.; Small, Dylan S.

doi:10.1007/s10742-012-0095-9

Perils and prospects of using aggregate area level socioeconomic information as a proxy for individual level socioeconomic confounders in instrumental variables regression

Published: 01 June 2012

Volume 12, pages 119–140, (2012)
Cite this article

Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Jesse Yenchih Hsu^1,2,
Scott A. Lorch^2,3,4 &
Dylan S. Small¹

296 Accesses
1 Citation
Explore all metrics

Abstract

A frequent concern in making statistical inference for causal effects of a policy or treatment based on observational studies is that there are unmeasured confounding variables. The instrumental variable method is an approach to estimating a causal relationship in the presence of unmeasured confounding variables. A valid instrumental variable needs to be independent of the unmeasured confounding variables. It is important to control for the confounding variable if it is correlated with the instrument. In health services research, socioeconomic status variables are often considered as confounding variables. In recent studies, distance to a specialty care center has been used as an instrument for the effect of specialty care vs. general care. Because the instrument may be correlated with socioeconomic status variables, it is important that socioeconomic status variables are controlled for in the instrumental variables regression. However, health data sets often lack individual socioeconomic information but contain area average socioeconomic information from the US Census, e.g., average income or education level in a county. We study the effects on the bias of the two stage least squares estimates in instrumental variables regression when using an area-level variable as a controlled confounding variable that may be correlated with the instrument. We propose the aggregated instrumental variables regression using the concept of Wald’s method of grouping, provided the assumption that the grouping is independent of the errors. We present simulation results and an application to a study of perinatal care for premature infants.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessing the impact of natural policy experiments on socioeconomic inequalities in health: how to apply commonly used quantitative analytical methods?

Article Open access 20 April 2017

Locational error in the estimation of regional discrete choice models using distance as a regressor

Article Open access 09 March 2022

An introduction to instrumental variable assumptions, validation and estimation

Article Open access 22 January 2018

References

Abadie, A.: Semiparametric instrumental variable estimation of treatment response models. J. Econometr. 113, 231–263 (2003)
Article Google Scholar
American Academy of Pediatrics, Committee on Fetus and Newborn: Levels of neonatal care. Pediatrics 114(5), 1341–1347 (2004)
Article Google Scholar
Angrist, J.D.: Grouped-data estimation and testing in simple labor-supply models. J. Econometr. 47, 243–266 (1991)
Article Google Scholar
Angrist, J.D., Imbens, G.W., Rubin, D.B.: Identification of causal effects using instrumental variables. J. Am. Stat. Assoc. 91(434), 444–455 (1996)
Google Scholar
Angrist, J.D., Krueger, A.B.: Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. Working Paper 8456, National Bureau of Economic Research (2001)
Baiocchi, M., Small, D.S., Lorch, S., Rosenbaum, P.R.: Building a stronger instrument in an observational study of perinatal care for premature infants. J. Am. Stat. Assoc. 105(492), 1285–1296 (2010)
Article CAS Google Scholar
Brookhart, M.A., Schneeweiss, S.: Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. Int. J. Biostat. 3(1), Article 14 (2007)
PubMed Google Scholar
Card, D., Krueger, A.B.: Does school quality matter? returns to education and the characteristics of public schools in the united states. J. Polit. Econ. 100(1), 1–40 (1992)
Article Google Scholar
Cifuentes, J., Bronstein, J., Phibbs, C.S., Phibbs, R.H., Schmitt, S.K., Carlo, W.A.: Mortality in low birth weight infants according to level of neonatal care at hospital of birth. Pediatrics 109(5), 745–751 (2002)
Article PubMed Google Scholar
Geronimus, A.T., Bound, J.: Use of census-based aggregate variables to proxy for socioeconomic group: evidence from national samples. Am. J. Epidemiol. 148(5), 475–486 (1998)
Article PubMed CAS Google Scholar
Geronimus, A.T., Bound, J., Neidert, L.J.: On the validity of using census geocode characteristics to proxy individual socioeconomic characteristics. J. Am. Stat. Assoc. 91(434), 529–537 (1996)
Google Scholar
Hernán, M.A., Robins, J.M.: Instruments for causal inference: an epidemiologist’s dream?. Epidemiology 17(4), 360–372 (2006)
Article PubMed Google Scholar
Holland, P.W.: Causal inference, path analysis, and recursive structural equations models. Sociol. Methodol. 18, 449–484 (1988)
Article Google Scholar
Joffe, M.M., Small, D., Ten Have, T., Brunelli, S., Feldman, H.I.: Extended instrumental varialbes estimation for overall effects. Int. J. Biostat. 4(1), Article 4 (2008)
PubMed Google Scholar
Krieger, N.: Overcoming the absence of socioeconomic data in medical records: validation and application of a census-based methodology. Am. J. Public Health 82(5), 703–710 (1992)
Article PubMed CAS Google Scholar
Krieger, N., Chen, J.T., Waterman, P.D., Rehkopf, D.H., Subramanian, S.V.: Race/ethnicity, gender, and monitoring socioeconomic gradients in health: a comparison of area-based socioeconomic measures – the public health disparities geocoding project. Am. J. Public Health 93(10), 1655–1671 (2003)
Article PubMed Google Scholar
Krieger, N., Chen, J.T., Waterman, P.D., Soobader, M.-J., Subramanian, S.V., Carson, R.: Choosing area based socioeconomic measures to monitor social inequalities in low birth weight and childhood lead poisoning: the public health disparities geocoding project (us). J. Epidemiol. Commun. Health 57, 186–199 (2003)
Article CAS Google Scholar
Lipsitz, S., Fitzmaurice, G.: Generalized estimating equations for longitudinal data analysis. In: Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G. (eds.), Longitudinal Data Analysis, pp. 43–78. CRC/Chapman & Hall, Boca Raton, FL (2009)
Google Scholar
Lorch, S.A., Baiocchi, M., Ahlberg, C.E., Small, D.S.: The differential impact of delivery hospital on the outcomes of premature infants. Pediatrics (in press) (2012)
Lorch, S.A., Myers, S., Carr, B.: The regionalization of pediatric health care. Pediatrics 126(6), 1182–1190 (2010)
Article PubMed Google Scholar
Mayer, S.E., Jencks, C.: Growing up in poor neighborhoods: how much does it matter? Science 243(4897), 1441–1445 (1989)
Article PubMed CAS Google Scholar
McClellan, M., McNeil, B.J., Newhouse, J.P.: Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality?. J. Am. Med. Assoc. 272(1), 859–866 (1994)
Article CAS Google Scholar
Neyman, J.: On the application of probability theory to agricultural experiments (translated and edited by D.M. Dabrowska and T. P. Speed). Stat. Sci. 5(4), 465–480 (1990)
Google Scholar
Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, New York (2000)
Google Scholar
Phibbs, C.S., Baker, L.C., Caughey, A.B., Danielsen, B., Schmitt, S.K., Phibbs, R.H.: Level and volume of neonatal intensive care and mortality in very-low-birth-weight infants. New Engl. J. Med. 356, 2165–2175 (2007)
Article PubMed CAS Google Scholar
Phibbs, C.S., Mark, D.H., Luft, H.S., Peltzman-Rennie, D.J., Garnick, D.W., Lichtenberg, E., McPhee, S.J.: Choice of hospital for delivery: a comparison of high-risk and low-risk women. Health Serv. Res. 28(2), 201–222 (1993)
PubMed CAS Google Scholar
Phibbs, C.S., Robinson, J.C.: A variable-radius measure of local hospital market structure. Health Serv. Res. 28(3), 313–324 (1993)
PubMed CAS Google Scholar
Prais, S.J., Aitchison, J.: The grouping of observations in regression analysis. Rev. Int. Stat. Inst. 22(1/3), 1–22 (1954)
Article Google Scholar
Rogowski, J.A., Horbar, J.D., Staiger, D.O., Kenny, M., Carpenter, J., Geppert, J.: Indirect vs direct hospital quality indicators for very-low-birth-weight infants. J. Am. Med. Assoc. 291(2), 202–209 (2004)
Article CAS Google Scholar
Rosenbaum, P.R., Rubin, D.B.: Discussion of “on state education statistics”: a difficulty with regression analyses of regional test score averages. J. Edu. Stat. 10(4), 326–333 (1985)
Google Scholar
Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688–701 (1974)
Article Google Scholar
Rubin, D.B.: Statistics and causal inference: comment: which ifs have causal answers. J. Am. Stat. Assoc. 81(396), 961–962 (1986)
Google Scholar
Stock, J.H., Wright, J.H., Yogo, M.: A survey of weak instruments and weak identification in generalized method of moments. J. Bus. Econ. Stat. 20(4), 518–529 (2002)
Article Google Scholar
Theil, H.: Principles of Econometrics. Wiley, New York (1971)
Google Scholar
Wald, A.: The fitting of straight lines if both variables are subject to error. Ann. Math. Stat. 11(3), 284–300 (1940)
Article Google Scholar

Download references

Acknowledgments

The authors thank the Editors and the referees for helpful comments. This work was supported by the National Science Foundation (Measurement, Methodology and Statistics program) grant # NSF 0961971. This work was also supported by Maternal and Child Health Bureau (MCHB) grant # R40 MC05474-01-00 and by Agency for Healthcare Research and Quality (AHRQ) grant # R01 HS 01569.

Author information

Authors and Affiliations

Department of Statistics, Wharton School, University of Pennsylvania, 400 Jon M. Huntsman Hall, 3730 Walnut Street, Philadelphia, PA, 19104-6302, USA
Jesse Yenchih Hsu & Dylan S. Small
Center for Outcomes Research, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Jesse Yenchih Hsu & Scott A. Lorch
Department of Pediatrics, School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Scott A. Lorch
Division of Neonatology, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Scott A. Lorch

Authors

Jesse Yenchih Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Scott A. Lorch
View author publications
You can also search for this author in PubMed Google Scholar
Dylan S. Small
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesse Yenchih Hsu.

Appendix: Consistency of the TSLS estimator from the aggregate IV regression and its variance estimate

In this section, we will show the show the consistency of $\hat{\beta}^{agg}$ obtained from the aggregate IV regression in Sect. 4 We will provide an estimate for the variance of $\hat{\beta}^{agg}$.

Let (Y ^agg, X ^agg, D ^agg, Z ^agg) denote vectors of aggregated (Y, X, D, Z), and $\underline{{\bf W}}^{agg} = [{\bf D}^{agg}, {\bf X}^{agg}]$ and $\underline{{\bf A}}^{agg} = [{\bf Z}^{agg}, {\bf X}^{agg}]$. We use $(\underline{{\bf W}}^{agg}, \underline{{\bf A}}^{agg})$ to distinguish (W ^agg, A ^agg) used in Sect. 3 in which W ^agg = [D, X ^agg] and A ^agg = [Z, X ^agg]. Let (H _Y, H _A, H _W) be aggregation errors for (Y, A, W), where ${\bf H}_{Y} = {\bf Y} - {\bf Y}^{agg},\, {\bf H}_{A} = {\bf A} - \underline{{\bf A}}^{agg}$, and ${\bf H}_{W} = {\bf W} - \underline{{\bf W}}^{agg}$. The TSLS estimator $\hat{\beta}^{agg}$ obtained from all aggregate variables is

$$ \begin{aligned} \hat{\beta}^{agg} &= (\underline{{\bf A}}^{agg^{T}} \underline{{\bf W}}^{agg})^{-1} \underline{{\bf A}}^{agg^{T}}{\bf Y}^{agg} \\ &= (\underline{{\bf A}}^{agg^{T}} \underline{{\bf W}}^{agg})^{-1} \underline{{\bf A}}^{agg^{T}}\{(\underline{{\bf W}}^{agg} + {\bf H}_{W}) \beta - {\bf H}_{Y} + \varepsilon\} \\ &= \beta + (\underline{{\bf A}}^{agg^{T}} \underline{{\bf W}}^{agg})^{-1} \underline{{\bf A}}^{agg^{T}}({\bf H}_{W} \beta - {\bf H}_{Y} + \varepsilon) \end{aligned} $$

(10)

If we can show that $\underline{{\bf A}}^{agg^{T}}({\bf H}_{W} \beta - {\bf H}_{Y} + \varepsilon) \xrightarrow{p}{\bf 0}$ in (10), then $\hat{\beta}^{agg}$ is a consistent estimator for β. We could write the aggregate matrices $({\bf Y}^{agg}, \underline{{\bf A}}^{agg}, \underline{{\bf W}}^{agg})$ as (G Y, G A, G W). The matrix G is a diagonal grouping matrix, where

$$ {\bf G} = \left[\begin{array}{llll} {\bf G}_{1} & {\bf 0} & \cdots & {\bf 0} \\ {\bf 0} & {\bf G}_{2} & \cdots & {\bf 0} \\ \vdots & \vdots & \ddots & \vdots \\ {\bf 0} & {\bf 0} & \cdots & {\bf G}_{j}\\ \end{array}\right] \quad \hbox{and} \quad {\bf G}_{j} = \left[\begin{array}{lll} 1/n_{j} & \cdots & 1/n_{j} \\ \vdots & \ddots & \vdots \\ 1/n_{j} & \cdots & 1/n_{j}\\ \end{array}\right]. $$

Thus, $\underline{{\bf A}}^{agg^{T}}({\bf H}_{W}\beta - {\bf H}_{Y} + \varepsilon)$ can be written as ${\bf A}^{T}{\bf G}^{T}\{({\bf W} - {\bf G}{\bf W})\beta - ({\bf Y} - {\bf G}{\bf Y}) + \varepsilon\}$. Since G is a symmetric and idempotent matrix, G ^T(W − G W) and G ^T(Y − G Y) are zero. Also, ${\bf A}^{T}{\bf G}^{T}\varepsilon$ converges in probability to zero because of the assumption of independence between G and $\varepsilon$. The variance of $\hat{\beta}^{agg}$ is

$$ \begin{aligned} Var\left(\hat{\beta}^{agg}\right) &= Var\left\{(\underline{{\bf A}}^{agg^{T}} \underline{{\bf W}}^{agg})^{-1} \underline{{\bf A}}^{agg^{T}} ({\bf H}_{W} \beta - {\bf H}_{Y} + \varepsilon)\right\} \\ & = ({\bf A}^{T}{\bf G}{\bf W})^{-1}{\bf A}^{T}{\bf G} \times Var\left\{({\bf W} - {\bf G}{\bf W})\beta - ({\bf Y} - {\bf G}{\bf Y}) + \varepsilon\right\} \times {\bf G}{\bf A}({\bf W}^{T}{\bf G}{\bf A})^{-1} \\ &= ({\bf A}^{T}{\bf G}{\bf W})^{-1}{\bf A}^{T}{\bf G} \times Var\left\{({\bf G}{\bf Y} - {\bf G}{\bf W} \beta)\right\} \times {\bf G}{\bf A}({\bf W}^{T}{\bf G}{\bf A})^{-1} \\ &= ({\bf A}^{T}{\bf G}{\bf W})^{-1}{\bf A}^{T}{\bf G} \times Var({\bf G}\varepsilon) \times {\bf G}{\bf A}({\bf W}^{T}{\bf G}{\bf A})^{-1}, \end{aligned} $$

(11)

where $Var({\bf G} \varepsilon)$ can be estimated by

$$ \begin{aligned} \widehat{Var}({\bf G}\varepsilon) &= \left[\begin{array}{llll} \hat{\Upsigma}_{1} & {\bf 0} & \cdots & {\bf 0} \\ {\bf 0} & \hat{\Upsigma}_{2} & \cdots & {\bf 0} \\ \vdots & \vdots & \ddots & \vdots \\ {\bf 0} & {\bf 0} & \cdots & \hat{\Upsigma}_{j}\\ \end{array}\right] \quad \hbox{and} \quad \\ \hat{\Upsigma}_{j} &= \left[\begin{array}{ccc} (n_{j}-1)^{-1}\sum_{i=1}^{n_{j}}(y_{j}^{agg}-[z_{j}^{agg}, x_{j}^{agg}]\hat{\beta}^{agg})^{2} & \cdots & (n_{j}-1)^{-1}\sum_{i=1}^{n_{j}}(y_{j}^{agg}-[z_{j}^{agg}, x_{j}^{agg}]\hat{\beta}^{agg})^{2} \\ \vdots & \ddots & \vdots \\ (n_{j}-1)^{-1}\sum_{i=1}^{n_{j}}(y_{j}^{agg}-[z_{j}^{agg}, x_{j}^{agg}]\hat{\beta}^{agg})^{2} & \cdots & (n_{j}-1)^{-1}\sum_{i=1}^{n_{j}}(y_{j}^{agg}-[z_{j}^{agg}, x_{j}^{agg}]\hat{\beta}^{agg})^{2}\\ \end{array}\right]. \end{aligned} $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsu, J.Y., Lorch, S.A. & Small, D.S. Perils and prospects of using aggregate area level socioeconomic information as a proxy for individual level socioeconomic confounders in instrumental variables regression. Health Serv Outcomes Res Method 12, 119–140 (2012). https://doi.org/10.1007/s10742-012-0095-9

Download citation

Received: 30 November 2011
Revised: 13 May 2012
Accepted: 16 May 2012
Published: 01 June 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10742-012-0095-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Perils and prospects of using aggregate area level socioeconomic information as a proxy for individual level socioeconomic confounders in instrumental variables regression

Abstract

Access this article

Similar content being viewed by others

Assessing the impact of natural policy experiments on socioeconomic inequalities in health: how to apply commonly used quantitative analytical methods?

Locational error in the estimation of regional discrete choice models using distance as a regressor

An introduction to instrumental variable assumptions, validation and estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Consistency of the TSLS estimator from the aggregate IV regression and its variance estimate

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Perils and prospects of using aggregate area level socioeconomic information as a proxy for individual level socioeconomic confounders in instrumental variables regression

Abstract

Access this article

Similar content being viewed by others

Assessing the impact of natural policy experiments on socioeconomic inequalities in health: how to apply commonly used quantitative analytical methods?

Locational error in the estimation of regional discrete choice models using distance as a regressor

An introduction to instrumental variable assumptions, validation and estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Consistency of the TSLS estimator from the aggregate IV regression and its variance estimate

Appendix: Consistency of the TSLS estimator from the aggregate IV regression and its variance estimate

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation