Abstract
For the survival of cancer and many other complex diseases, gene–environment (G-E) interactions have been established as having essential importance. G-E interaction analysis can be roughly classified as marginal and joint, depending on the number of G variables analyzed at a time. In this study, we focus on joint analysis, which can better reflect disease biology and is statistically more challenging. Many approaches have been developed for joint G-E interaction analysis for survival outcomes and led to important findings. However, without rigorous statistical development, quite a few methods have a weak theoretical ground. To fill this knowledge gap, in this article, we consider joint G-E interaction analysis under the Cox model. Sparse group penalization is adopted for regularizing estimation and selecting important main effects and interactions. The “main effects, interactions” variable selection hierarchy, which has been strongly advocated in recent literature, is satisfied. Significantly advancing from some published studies, we rigorously establish the consistency properties under high dimensionality. An effective computational algorithm is developed, simulation demonstrates competitive performance of the proposed approach, and analysis of The Cancer Genome Atlas (TCGA) data on stomach adenocarcinoma (STAD) further demonstrates its practical utility.
Similar content being viewed by others
References
Andersen, P. K., Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. Annals of Statistics, 10(4), 1100–1120.
Bien, J., Taylor, J. E., Tibshirani, R. (2013). A lasso for hierarchical interactions. Annals of Statistics, 41(3), 1111–1141.
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends®in Machine Learning, 3(1), 1–122.
Bradic, J., Fan, J., Jiang, J. (2011). Regularization for cox’s proportional hazards model with np-dimensionality. Annals of Statistics, 39(6), 3092–3120.
Chen, J., Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771.
Choi, N. H., Li, W., Zhu, J. (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105(489), 354–364.
Eriksson, F., Martinussen, T., Nielsen, S. (2019). Large sample results for frequentist multiple imputation for cox regression with missing covariate data. Annals of the Institute of Statistical Mathematics, 72, 969–996.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Feng, S., Zhang, M., Tong, T. (2021). Variable selection for functional linear models with strong heredity constraint. Annals of the Institute of Statistical Mathematics, 74, 321–339.
Fleming, T. R., Harrington, D. P. (2011). Counting processes and survival analysis. Hoboken, NJ, United States: Wiley.
Fujimori, K. (2022). The variable selection by the dantzig selector for cox’s proportional hazards model. Annals of the Institute of Statistical Mathematics, 74(3), 515–537.
Huang, J., Ma, S., Xie, H., Zhang, C. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.
Huang, J., Sun, T., Ying, Z., Yu, Y., Zhang, C. (2013). Oracle inequalities for the lasso in the cox model. Annals of Statistics, 41(3), 1142–1165.
Hunter, D. J. (2005). Gene-environment interactions in human diseases. Nature Reviews Genetics, 6(4), 287–298.
Liu, X., Zhong, P.-S., Cui, Y. (2020). Joint test of parametric and nonparametric effects in partial linear models for gene-environment interaction. Statistica Sinica, 30(1), 325–346.
Luo, S., Xu, J., Chen, Z. (2015). Extended bayesian information criterion in the cox model with a high-dimensional feature space. Annals of the Institute of Statistical Mathematics, 67(2), 287–311.
Ma, S., Huang, J. (2015). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517), 410–423.
McAllister, K. A., Mechanic, L. E., Amos, C. I., Aschard, H., Blair, I. A., Chatterjee, N., Conti, D. V., Gauderman, W. J., Hsu, L., Hutter, C., Jankowska, M. M., Kerr, J., Kraft, P., Montgomery, S. B., Mukherjee, B., Papanicolaou, G. J., Patel, C. J., Ritchie, M. D., Ritz, B. R., Witte, J. S. (2017). Current challenges and new opportunities for gene-environment interaction studies of complex diseases. American Journal of Epidemiology, 186(7), 753–761.
Nocedal, J., Wright, S. (2006). Numerical optimization. Berlin/Heidelberg, Germany: Springer.
Simon, N., Friedman, J. H., Hastie, T., Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics, 22, 231–245.
Smilde, A. K., Kiers, H. A. L., Bijlsma, S., Rubingh, C. M., van Erk, M. J. (2009). Matrix correlations for high-dimensional data: The modified rv-coefficient. Bioinformatics, 25(3), 401–405.
Stute, W., Wang, J. (1993). The strong law under random censorship. Annals of Statistics, 21(3), 1591–1607.
Tang, X., Xue, F., Qu, A. (2021). Individualized multidirectional variable selection. Journal of the American Statistical Association, 116(535), 1280–1296.
Thomas, D. C. (2010). Gene-environment-wide association studies: Emerging approaches. Nature Reviews Genetics, 11(4), 259–272.
Uno, H., Cai, T., Pencina, M., D’Agostino, R., Wei, L. (2011). On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30(10), 1105–1117.
Winham, S. J., Biernacka, J. M. (2013). Gene-environment interactions in genome-wide association studies: Current approaches and new directions. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 54(10), 1120–1134.
Wu, C., Jiang, Y., Ren, J., Cui, Y., Ma, S. (2018). Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Statistics in Medicine, 37(3), 437–456.
Wu, M., Zhang, Q., Ma, S. (2020). Structured gene-environment interaction analysis. Biometrics, 76(1), 23–35.
Xu, Y., Wu, M., Zhang, Q., Ma, S. (2019). Robust identification of gene-environment interactions for prognosis using a quantile partial correlation approach. Genomics, 111(5), 1115–1123.
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(3), 894–942.
Zhang, X., Liu, J., Zhu, Z. (2022) Learning coefficient heterogeneity over networks: A distributed spanning-tree-based fused-lasso regression. Journal of the American Statistical Association, 0(0), 1–13.
Zhao, P., Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7, 2541–2563.
Acknowledgements
The authors thank the Editor, Associate Editor, and two referees for their insightful comments which have led to a significant improvement of this article. This study is partly supported by National Bureau of Statistics of China (2022LZ34), National Natural Science Foundation of China (11971404, 72071169, 71988101, 82204153), National Social Science Foundation of China (21 &ZD146), and NIH (CA204120, CA121974, and CA196530).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
About this article
Cite this article
Fang, K., Li, J., Xu, Y. et al. Gene–environment interaction analysis under the Cox model. Ann Inst Stat Math 75, 931–948 (2023). https://doi.org/10.1007/s10463-023-00871-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-023-00871-9