Abstract
Estimating sample size and statistical power is an essential part of a good epidemiological study design. Closed-form formulas exist for simple hypothesis tests but not for advanced statistical methods designed for exposure mixture studies. Estimating power with Monte Carlo simulations is flexible and applicable to these methods. However, it is not straightforward to code a simulation for non-experienced programmers and is often hard for a researcher to manually specify multivariate associations among exposure mixtures to set up a simulation. To simplify this process, we present the R package mpower for power analysis of observational studies of environmental exposure mixtures involving recently developed mixtures analysis methods. The components within mpower are also versatile enough to accommodate any mixtures methods that will be developed in future. The package allows users to simulate realistic exposure data and mixed-typed covariates based on public dataset such as the National Health and Nutrition Examination Survey or other existing dataset from prior studies. Users can generate power curves to assess the trade-offs between sample size, effect size, and power of a design. This paper presents tutorials and examples of power analysis using mpower.
Similar content being viewed by others
Data Availability
Publicly available on Github (https://github.com/phuchonguyen/mpower).
Code Availability
Publicly available on Github (https://github.com/phuchonguyen/mpower).
References
Gelman A, Hill J (2007) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, Cambridge
Arnold BF, Hogan DR, Colford JM Jr et al (2011) Simulation methods to estimate design power: an overview for applied research. BMC Med Res Methodol. https://doi.org/10.1186/1471-2288-11-94
Gastañaga VM, McLaren CE, Delfino RJ (2006) Power calculations for generalized linear models in observational longitudinal studies: a simulation approach in sas. Comput Methods Programs Biomed 84(1):27–33
Landau S, Stahl D (2013) Sample size and power calculations for medical studies by simulation when closed form expressions are not available. Stat Methods Med Res 22(3):324–345
Sun Z, Tao Y, Li S et al (2013) Statistical strategies for constructing health risk models with multiple pollutants and their interactions: possible choices and comparisons. Environ Health. https://doi.org/10.1186/1476-069X-12-85
Bien J, Taylor J, Tibshirani R (2013) A lasso for hierarchical interactions. Ann Stat 41(3):1111. https://doi.org/10.1214/13-AOS1096
Lim M, Hastie T (2015) Learning interactions via hierarchical group-lasso regularization. J Comput Graph Stat 24(3):627–654. https://doi.org/10.1080/10618600.2014.938812
Hamra GB, Buckley JP (2018) Environmental exposure mixtures: questions and methods to address them. Curr Epidemiol Rep 5(2):160–165. https://doi.org/10.1007/s40471-018-0145-0
Ferrari F, Dunson DB (2020) Identifying main effects and interactions among exposures using gaussian processes. Ann Appl Stat 14(4):1743–1758. https://doi.org/10.1214/20-AOAS1363
Ferrari F, Dunson DB (2020) Bayesian factor analysis for inference on interactions. J Am Stat Assoc. https://doi.org/10.1080/01621459.2020.1745813
Green P, MacLeod CJ (2016) SIMR: an R package for power analysis of generalized linear mixed models by simulation. Methods Ecol Evol 7(4):493–498. https://doi.org/10.1111/2041-210X.12504
Morgan-Wall T, Khoury G (2021) Optimal design generation and power evaluation in R: the skpr package. J Stat Softw 99(1):1–36. https://doi.org/10.18637/jss.v099.i01
LeBeau B (2022) simglm: simulate models based on the generalized linear model. R package version 0.8.9. https://CRAN.R-project.org/package=simglm. Accessed 5 Jan 2022
Bobb JF, Valeri L, Henn BC et al (2015) Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics 16(3):493–508. https://doi.org/10.1093/biostatistics/kxu058
Hoeting JA, Madigan D, Raftery AE et al (1999) Bayesian model averaging: a tutorial. Stat Sci 14(4):382–401
Hamra GB, MacLehose RF, Croen L et al (2021) Bayesian weighted sums: a flexible approach to estimate summed mixture effects. Int J Environ Res Public Health 18(4):1373. https://doi.org/10.3390/ijerph18041373
Keil AP, Buckley JP, O’Brien KM et al (2020) A quantile-based g-computation approach to addressing the effects of exposure mixtures. Environ Health Perspect. https://doi.org/10.1289/EHP5838
Hoff PD (2007) Extending the rank likelihood for semiparametric copula estimation. Ann Appl Stat 1(1):265–283. https://doi.org/10.1214/07-AOAS107
Hoff P (2018) sbgcop: Semiparametric Bayesian Gaussian copula estimation and imputation. R package version 0.980. https://CRAN.R-project.org/package=sbgcop. Accessed 5 Jan 2022
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines and extended onion method. J Multivar Anal 100(9):1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008
Bedford T, Cooke RM (2002) Vines: a new graphical model for dependent random variables. Ann Stat 30(4):1031–1068
Joe H (2006) Generting random correlation matrices based on partial correlations. J Multivar Anal 97:2177–2189
Eaton ML (2007) Multivariate statistics: a vector space approach. Inst Math Stat Lect Notes-Monogr Ser 53:512. https://doi.org/10.1214/lnms/1196285102
Czanner G, Sarma SV, Eden UT et al (2008) A signal-to-noise ratio estimator for generalized linear model systems. In: Proceedings of the World Congress on Engineering, p 2
McCullagh P, Nelder JA (1983) Generalized linear models. Chapman and Hall, Boca Raton
Joubert BR, Kioumourtzoglou MA, Chamberlain T et al (2022) Powering research through innovative methods for mixtures in epidemiology (prime) program: novel and expanded statistical methods. Int J Environ Res Public Health. https://doi.org/10.3390/ijerph19031378
Raftery A, Hoeting J, Volinsky C et al (2021) BMA: Bayesian model averaging. R package version 3.18.15. https://CRAN.R-project.org/package=BMA. Accessed 5 Jan 2022
Bobb JF, Henn BC, Valeri L et al (2018) Statistical software for analyzing the health effects of multiple concurrent exposures via Bayesian kernel machine regression. Environ Health. https://doi.org/10.1186/s12940-018-0413-y
Poworoznek E (2020) infinitefactor: Bayesian infinite factor models. R package version 1.0. https://CRAN.R-project.org/package=infinitefactor. Accessed 5 Jan 2022
Nguyen PH (2022) bws: Bayesian weighted sums. R package version 0.1.0. https://CRAN.R-project.org/package=bws. Accessed 5 Jan 2022
Keil A (2021) qgcomp: quantile g-computation. R package version 2.7.0. https://CRAN.R-project.org/package=qgcomp. Accessed 5 Jan 2022
Corporation M, Weston S (2022) doSNOW: foreach parallel adaptor for the snow package. R package version 1.0.20. https://CRAN.R-project.org/package=doSNOW. Accessed 5 Jan 2022
Microsoft, Weston S (2020) foreach: provides foreach looping construct. R package version 1.5.1. https://CRAN.R-project.org/package=foreach. Accessed 5 Jan 2022
Cohen J (2013) Statistical power analysis for the behavioral sciences. Academic Press, Cambridge
Zhang Z, Mai Y (2023) WebPower: basic and advanced statistical power analysis. R package version 0.9.3. https://CRAN.R-project.org/package=WebPower. Accessed 5 Jan 2022
Wu B, Jiang Y, Jin X et al (2020) Using three statistical methods to analyze the association between exposure to 9 compounds and obesity in children and adolescents: Nhanes 2005–2010. Environ Health. https://doi.org/10.1186/s12940-020-00642-6
Funding
This work was partially supported by Grants R01ES027498 and R01ES028804 of the National Institute of Environmental Health Sciences of the United States National Institutes of Health.
Author information
Authors and Affiliations
Contributions
AHH devised and supervised the project. PHN developed the software, examples, and wrote the first draft of the manuscript. SME provided critical feedback and helped shape features of the software. All authors contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Ethical Approval
Not applicable.
Appendix A Estimated Signal-to-Noise Ratio as a Function of m
Appendix A Estimated Signal-to-Noise Ratio as a Function of m
We will estimate the SNR of the following data-generating process using different values for m:
Since the predictors are independent standard normal distributions, and the noise variance is 1, we can calculate the true SNR as \([0.3^2(1) + 0.3^2(1)]/1 = 0.18\). Figure 7 shows the estimated SNR and 1000-bootstrap s.e. for m \(\in \{500, 5000, 50000, 10000, 200000\}\). A larger m results in a more precise estimate. When the mixture model is defined based on resampling, it may not be possible to choose a large m without duplicating observations and underestimating the signal.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nguyen, P.H., Herring, A.H. & Engel, S.M. Power Analysis of Exposure Mixture Studies Via Monte Carlo Simulations. Stat Biosci (2023). https://doi.org/10.1007/s12561-023-09385-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12561-023-09385-7