Abstract
Variable importance is a key statistical issue in exposure mixtures, as it allows a ranking of exposures as potential targets for intervention, and helps to identify bad actors within a mixture. In settings where mixtures have many constituents or high between-constituent correlations, estimators of importance can be subject to bias or high variance. Current approaches to assessing variable importance have major limitations, including reliance on overly strong or incorrect constraints or assumptions, excessive model extrapolation, or poor interpretability, especially regarding practical significance. We sought to overcome these limitations by applying an established doubly robust, machine learning-based approach to estimating variable importance in a mixtures context. This method reduces model extrapolation, appropriately controls confounding, and provides both interpretability and model flexibility. We illustrate its use with an evaluation of the relationship between telomere length, a measure of biologic aging, and exposure to a mixture of polychlorinated biphenyls (PCBs), dioxins, and furans among 979 US adults from the National Health and Nutrition Examination Survey (NHANES). In contrast with standard approaches for mixtures, our approach selected PCB 180 and PCB 194 as important contributors to telomere length. We hypothesize that this difference could be due to residual confounding in standard methods that rely on variable selection. Further empirical evaluation of this method is needed, but it is a promising tool in the search for bad actors within a mixture.
Similar content being viewed by others
References
Greenland S (2017) For and against methodologies: some perspectives on recent causal and statistical inference debates. Eur J Epidemiol 32:3–20
Czarnota J, Gennings C, Wheeler DC (2015) Assessment of weighted quantile sum regression for modeling chemical mixtures and cancer risk. Cancer Inform 14:CIN–S17295
Gibson EA, Nunez Y, Abuawad A et al (2019) An overview of methods to address distinct research questions on environmental mixtures: an application to persistent organic pollutants and leukocyte telomere length. Environ Health 18:1–16
Díaz Muñoz I, Van Der Laan M (2012) Population intervention causal effects based on stochastic interventions. Biometrics 68(2):541–549
Díaz Muñoz I, van der Laan MJ (2018) Stochastic treatment regimes. In: van der Laan MJ, Rose S (eds) Targeted learning in data science: causal inference for complex longitudinal studies. Springer International Publishing, Cham, pp 219–232
Díaz Muñoz I, Hubbard A, Decker A et al (2015) Variable importance and prediction methods for longitudinal problems with missing variables. PLoS ONE 10(3):e0120031
Mitro SD, Birnbaum LS, Needham BL et al (2016) Cross-sectional associations between exposure to persistent organic pollutants and leukocyte telomere length among US adults in NHANES, 2001–2002. Environ Health Perspect 124(5):651–658
Zipf G, Chiappa M, Porter KS et al (2013) Health and nutrition examination survey plan and operations, 1999–2010. Vital Health Stat 1 56:1–37
Van der Laan MJ (2006) Statistical inference for variable importance. Int J Biostat. https://doi.org/10.2202/1557-4679.1008
Pearl J (2010) Brief report: on the consistency rule in causal inference: “axiom, definition, assumption, or theorem?’’. Epidemiology 21:872–875
Young JG, Hernán MA, Robins JM (2014) Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data. Epidemiol Methods 3(1):1–19
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320
Snowden JM, Reid CE, Tager IB (2015) Framing air pollution epidemiology in terms of population interventions, with applications to multi-pollutant modeling. Epidemiology 26(2):271
Westreich D, Cole SR (2010) Invited commentary: positivity in practice. Am J Epidemiol 171(6):674–677
Breiman L (2001) Random forests. Mach Learn 45:5–32
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22. https://CRAN.R-project.org/doc/Rnews/
Strobl C, Boulesteix AL, Zeileis A et al (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):1–21
Breiman L (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16(3):199–231
Greenland S (2000) Principles of multilevel modelling. Int J Epidemiol 29(1):158–167
Pearl J (1995) Causal diagrams for empirical research. Biometrika 82(4):669–688
Richardson TS, Robins JM (2013) Single world intervention graphs (SWIGs): a unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series Working Paper 128(30):2013
Van der Laan MJ, Rose S et al (2011) Targeted learning: causal inference for observational and experimental data, vol 4. Springer, New York
Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Math Model 7(9–12):1393–1512
Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973
Van der Laan MJ, Polley EC, Hubbard AE (2007) Super learner. Stat Appl Genet Mol Biol. https://doi.org/10.2202/1544-6115.1309
Wolpert DH (1992) Stacked generalization. Neural Netw 5(2):241–259
IARC Working Group on the Evaluation of Carcinogenic Risks to Humans (2012) Chemical agents and related occupations. IARC Monogr Eval Carcinog Risks Hum 100(Pt F):9–562
IARC Working Group on the Evaluation of Carcinogenic Risks to Humans (2016) Polychlorinated Biphenyls and Polybrominated Biphenyls, vol 107. pp 9–500
Sarkar P, Shiizaki K, Yonemoto J et al (2006) Activation of telomerase in BeWo cells by estrogen and 2,3,7,8-tetrachlorodibenzo-p-dioxin in co-operation with c-Myc. Int J Oncol 28(1):43–51
Ziegler S, Schettgen T, Beier F et al (2017) Accelerated telomere shortening in peripheral blood lymphocytes after occupational polychlorinated biphenyls exposure. Arch Toxicol 91:289–300
Van den Berg M, Birnbaum LS, Denison M et al (2006) The 2005 World Health Organization reevaluation of human and mammalian toxic equivalency factors for dioxins and dioxin-like compounds. Toxicol Sci 93(2):223–241
Keil AP, Buckley JP, O’Brien KM et al (2020) A quantile-based g-computation approach to addressing the effects of exposure mixtures. Environ Health Perspect 128(4):047004
O’Brien KM, Upson K, Cook NR et al (2016) Environmental chemicals in urine and blood: improving methods for creatinine and lipid adjustment. Environ Health Perspect 124(2):220–227
Cawthon RM (2002) Telomere measurement by quantitative PCR. Nucleic Acids Res 30(10):e47–e47
Lan Q, Cawthon R, Shen M et al (2009) A prospective study of telomere length measured by monochrome multiplex quantitative PCR and risk of non-Hodgkin lymphoma. Clin Cancer Res 15(23):7429–7433
Gelman A (2008) Scaling regression inputs by dividing by two standard deviations. Stat Med 27(15):2865–2873
Carrico C, Gennings C, Wheeler DC et al (2015) Characterization of weighted quantile sum regression for highly correlated data in a risk analysis setting. J Agric Biol Environ Stat 20:100–120
Wood SN, Pya N, Säfken B (2016) Smoothing parameter and model selection for general smooth models. J Am Stat Assoc 111(516):1548–1563
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67
Bobb JF, Valeri L, Claus Henn B et al (2015) Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics 16(3):493–508
Chernozhukov V, Chetverikov D, Demirer M et al (2018) Double/debiased machine learning for treatment and structural parameters. Econom J 21(1):C1–C68
Kang JD, Schafer JL (2007) Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 22(4):523–539
Rubin DB (2008) For objective causal inference, design trumps analysis. Ann Appl Stat 2(3):808–840
Funding
This study was funded by the Intramural Research Program of the National Institutes of Health, NCI, Division of Cancer Epidemiology and Genetics (Grant No. Z01CP010119), the Intramural Research Program of the National Institutes of Health, NIEHS (Grant No. Z01ES044005).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Keil, A.P., O’Brien, K.M. Considerations and Targeted Approaches to Identifying Bad Actors in Exposure Mixtures. Stat Biosci (2023). https://doi.org/10.1007/s12561-023-09409-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12561-023-09409-2