Abstract
This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to generate debiased and desparsified estimators for the targeted low-dimensional objectives and estimate their uncertainty. In addition to covering the motivations for and intuitions behind these statistical methods, we also discuss their optimality and adaptivity in the context of high-dimensional inference. In addition, we review the recent development of statistical inference based on multiple regression models and the advancement of large-scale multiple testing for high-dimensional regression. The R package SIHR has implemented some of the high-dimensional inference methods discussed in this paper.
Similar content being viewed by others
References
Athey S, Imbens GW, Wager S (2018) Approximate residual balancing: debiased inference of average treatment effects in high dimensions. J R Stat Soc B 80(4):597–623
Bach F (2010) Self-concordant analysis for logistic regression. Electron J Stat 4:384–414
Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085
Barber RF, Candès EJ, Samworth RJ (2020) Robust inference with knockoffs. Ann Stat 48(3):1409–1431
Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed testing and estimation under sparse high dimensional models. Ann Stat 46(3):1352
Bayati M, Montanari A (2011) The Lasso risk for gaussian matrices. IEEE Trans Inf Theory 58(4):1997–2017
Bellec PC, Lecué G, Tsybakov AB (2018) Slope meets Lasso: improved oracle bounds and optimality. Ann Stat 46(6B):3603–3642
Belloni A, Chernozhukov V, Wang L (2011) Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806
Belloni A, Chernozhukov V, Hansen C (2014) Inference on treatment effects after selection among high-dimensional controls. Rev Econ Stud 81(2):608–650
Belloni A, Chernozhukov V, Fernández-Val I, Hansen C (2017) Program evaluation and causal inference with high-dimensional data. Econometrica 85(1):233–298
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Benjamini Y, Hochberg Y (1997) Multiple hypotheses testing with weights. Scand J Stat 24(3):407–418
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and dantzig selector. Ann Stat 37(4):1705–1732
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, New York
Bunea F (2008) Honest variable selection in linear and logistic regression models via \(\ell _1\) and \(\ell _1\)+ \(\ell _2\) penalization. Electron J Stat 2:1153–1194
Cai TT, Guo Z (2017) Confidence intervals for high-dimensional linear regression: minimax rates and adaptivity. Ann Stat 45(2):615–646
Cai TT, Guo Z (2018a) Accuracy assessment for high-dimensional linear regression. Ann Stat 46(4):1807–1836
Cai TT, Zhang L (2018b) High-dimensional gaussian copula regression: adaptive estimation and statistical inference. Stat Sin 2018:963–993
Cai TT, Guo Z (2020) Semisupervised inference for explained variance in high dimensional linear regression and its applications. J R Stat Soc B 82(2):391–419
Cai TT, Li H, Ma J, Xia Y (2019) Differential Markov random field analysis with an application to detecting differential microbial community networks. Biometrika 106(2):401–416
Cai TT, Guo Z, Ma R (2021a) Statistical inference for high-dimensional generalized linear models with binary outcomes. J Am Stat Assoc 116:1–14
Cai T, Cai TT, Guo Z (2021b) Optimal statistical inference for individualized treatment effects in high-dimensional models. J R Stat Soc B 83(4):669–719
Cai T, Liu M, Xia Y (2022) Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J Am Stat Assoc 117(540):2105–2119
Cai TT, Sun W, Xia Y (2022) LAWS: a locally adaptive weighting and screening approach to spatial multiple testing. J Am Stat Assoc 117:1370–1383
Candes E, Tao T (2007) The dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35(6):2313–2351
Candes E, Fan Y, Janson L, Lv J (2018) Panning for gold:‘model-x’ knockoffs for high dimensional controlled variable selection. J R Stat Soc B 80(3):551–577
Chakrabortty A, Cai T (2018) Efficient and adaptive linear regression in semi-supervised settings. Ann Stat 46(4):1541–1572
Chen S, Banerjee A (2017) Alternating estimation for structured high-dimensional multi-response models. Advances in neural information processing systems 30
Chen Y, Fan J, Ma C, Yan Y (2019) Inference and uncertainty quantification for noisy matrix completion. Proc Natl Acad Sci 116(46):22931–22937
Chernozhukov V, Hansen C, Spindler M (2015) Valid post-selection and post-regularization inference: an elementary, general approach. Annu Rev Econom 7(1):649–688
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double/debiased machine learning for treatment and structural parameters: double/debiased machine learning. Econom J 21(1):1–68
Collier O, Comminges L, Tsybakov AB (2017) Minimax estimation of linear and quadratic functionals on sparsity classes. Ann Stat 45(3):923–958
Dai C, Lin B, Xing X, Liu JS (2023) A scale-free approach for false discovery rate control in generalized linear models. J Am Stat Assoc 2023:1–31
Deng S, Ning Y, Zhao J, Zhang H (2020) Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv preprint arXiv:2011.14185
Deshpande Y, Mackey L, Syrgkanis V, Taddy M (2018) Accurate inference for adaptive linear models. In: International conference on machine learning. PMLR, pp 1194–1203
Dezeure R, Bühlmann P, Meier L, Meinshausen N (2015) High-dimensional inference: confidence intervals. \(p\)-values and R-software hdi. Stat Sci 533–558
Dezeure R, Bühlmann P, Zhang C-H (2017) High-dimensional simultaneous inference with the bootstrap. TEST 26(4):685–719
Donoho DL, Maleki A, Montanari A (2011) The noise-sensitivity phase transition in compressed sensing. IEEE Trans Inf Theory 57(10):6920–6941
Du L, Guo X, Sun W, Zou C (2023) False discovery rate control under general dependence by symmetrized data aggregation. J Am Stat Assoc 118 (541): 607–621
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Eftekhari H, Banerjee M, Ritov Y (2021) Inference in high-dimensional single-index models under symmetric designs. J Mach Learn Res 22:27–1
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fang EX, Ning Y, Liu H (2017) Testing and confidence intervals for high dimensional proportional hazards models. J R Stat Soc B 79(5):1415–1437
Fang EX, Ning Y, Li R (2020) Test of significance for high-dimensional longitudinal data. Ann Stat 48(5):2622
Fan Q, Guo Z, Mei Z (2022) Testing overidentifying restrictions with high-dimensional data and heteroskedasticity. arXiv preprint arXiv:2205.00171
Farrell MH (2015) Robust inference on average treatment effects with possibly more covariates than observations. J Econom 189(1):1–23
Fithian W, Lei L (2022) Conditional calibration for false discovery rate control under dependence. Ann Stat 50(6):3091–3118
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1
Genovese CR, Roeder K, Wasserman L (2006) False discovery control with \(p\)-value weighting. Biometrika 93(3):509–524
Greenshtein E, Ritov Y (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6):971–988
Guo Z (2020) Statistical Inference for Maximin Effects: Identifying Stable Associations across Multiple Studies. J Am Stat Assoc, to appear
Guo Z, Kang H, Cai TT, Small DS (2018) Testing endogeneity with high dimensional covariates. J Econom 207(1):175–187
Guo Z, Wang W, Cai TT, Li H (2019a) Optimal estimation of genetic relatedness in high-dimensional linear models. J Am Stat Assoc 114(525):358–369
Guo Z, Yuan W, Zhang C-H (2019b) Decorrelated local linear estimator: Inference for non-linear effects in high-dimensional additive models. arXiv preprint arXiv:1907.12732
Guo Z, Rakshit P, Herman DS, Chen J (2021a) Inference for the case probability in high-dimensional logistic regression. J Mach Learn Res 22(1):11480–11533
Guo Z, Renaux C, Bühlmann P, Cai TT (2021b) Group inference in high dimensions with applications to hierarchical testing. Electron J Stat 15(2):6633–6676
Guo Z, Ćevid D, Bühlmann P (2022) Doubly debiased Lasso: high-dimensional inference under hidden confounding. Ann Stat 50(3):1320–1347
Guo Z, Li X, Han L, Cai T (2023) Robust inference for federated meta-learning. arXiv preprint arXiv:2301.00718
Hou J, Guo Z, Cai T (2021) Surrogate assisted semi-supervised inference for high dimensional risk prediction. arXiv preprint arXiv:2105.01264
Huang J, Zhang C-H (2012) Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J Mach Learn Res 13(Jun):1839–1864
Hunter DJ (2005) Gene-environment interactions in human diseases. Nat Rev Genet 6(4):287–298
Ignatiadis N, Klaus B, Zaugg JB, Huber W (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 13(7):577–580
Javanmard A, Montanari A (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 15(1):2869–2909
Javanmard A, Montanari A (2018) Debiasing the Lasso: optimal sample size for gaussian designs. Ann Stat 46(6A):2593–2622
Javanmard A, Lee JD (2020) A flexible framework for hypothesis testing in high dimensions. J R Stat Soc B 82(3):685–718
Kim B, Liu S, Kolar M (2021) Two-sample inference for high-dimensional Markov networks. J R Stat Soc B
Lee JD, Liu Q, Sun Y, Taylor JE (2017) Communication-efficient sparse regression. J Mach Learn Res 18(1):115–144
Lei L, Fithian W (2018) Adapt: an interactive procedure for multiple testing with side information. J R Stat Soc B 80(4):649–679
Li A, Barber RF (2019) Multiple testing with the structure-adaptive Benjamini–Hochberg algorithm. J R Stat Soc B 81(1):45–74
Li S, Cai TT, Li H (2021a) Transfer learning for high-dimensional linear regression: prediction, estimation and minimax optimality. J R Stat Soc B 84(1):149–173
Li S, Cai TT, Li H (2021b) Inference for high-dimensional linear mixed-effects models: a quasi-likelihood approach. J Am Stat Assoc 116:1–12
Li S, Zhang L, Cai TT, Li H (2021c) Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Technical Report
Liang Z, Cai TT, Sun W, Xia Y (2022) Locally adaptive transfer learning algorithms for large-scale multiple testing. arXiv preprint arXiv:2203.11461
Liu W (2013) Gaussian graphical model estimation with false discovery rate control. Ann Stat 41(6):2948–2978
Liu W, Luo S (2014) Hypothesis testing for high-dimensional regression models. Technical report
Liu M, Xia Y, Cho K, Cai T (2021) Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. J Mach Learn Res 22:126–1
Lounici K, Pontil M, van de Geer S, Tsybakov AB et al (2011) Oracle inequalities and optimal inference under group sparsity. Ann Stat 39(4):2164–2204
Luo L, Han R, Lin Y, Huang J (2021) Statistical inference in high-dimensional generalized linear models with streaming data. arXiv preprint arXiv:2108.04437
Ma R, Tony Cai T, Li H (2021) Global and simultaneous hypothesis testing for high-dimensional logistic regression models. J Am Stat Assoc 116(534):984–998
Ma R, Guo Z, Cai TT, Li H (2022) Statistical inference for genetic relatedness based on high-dimensional logistic regression. arXiv preprint arXiv:2202.10007
Mandozzi J, Bühlmann P (2016) Hierarchical testing in the high-dimensional setting with correlated variables. J Am Stat Assoc 111(513):331–343
Meier L, van de Geer S, Bühlmann P (2008) The group Lasso for logistic regression. J R Stat Soc B 70(1):53–71
Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34(3):1436–1462
Meinshausen N, Bühlmann P (2015) Maximin effects in inhomogeneous large-scale data. Ann Stat 43(4):1801–1830
Negahban S, Yu B, Wainwright MJ, Ravikumar PK (2009) A unified framework for high-dimensional analysis of \( m \)-estimators with decomposable regularizers. In: Advances in neural information processing systems, pp 1348–1356
Neykov M, Ning Y, Liu JS, Liu H (2018) A unified theory of confidence regions and testing for high-dimensional estimating equations. Stat Sci 33(3):427–443
Nickl R, van de Geer S (2013) Confidence sets in sparse regression. Ann Stat 41(6):2852–2876
Ning Y, Liu H (2017) A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann Stat 45(1):158–195
Rakshit P, Cai TT, Guo Z (2021) SIHR: An R package for statistical inference in high-dimensional linear and logistic regression models. arXiv preprint arXiv:2109.03365
Ren Z, Barber RF (2022) Derandomized knockoffs: leveraging e-values for false discovery rate control. arXiv preprint arXiv:2205.15461
Ren Z, Sun T, Zhang C-H, Zhou HH (2015) Asymptotic normality and optimalities in estimation of large gaussian graphical models. Ann Stat 43(3):991–1026
Ren Z, Zhang C-H, Zhou H (2016) Asymptotic normality in estimation of large ising graphical model. Unpublished Manuscript
Ren Z, Wei Y, Candès E (2021) Derandomizing knockoffs. J Am Stat Assoc 116:1–11
Roeder K, Wasserman L (2009) Genome-wide significance levels and weighted hypothesis testing. Stat Sci 24(4):398
Schifano L, Li ED, Christiani DC, Lin X (2013) Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet 2013:744–759
Shi C, Song R, Lu W, Li R (2021) Statistical inference for high-dimensional models via recursive online-score estimation. J Am Stat Assoc 116(535):1307–1318
Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc B 64(3):479–498
Sun T, Zhang C-H (2012) Scaled sparse linear regression. Biometrika 101(2):269–284
Sun Y, Ma L, Xia Y (2022) A decorrelating and debiasing approach to simultaneous inference for high-dimensional confounded models. arXiv preprint arXiv:2208.08754
Tian Y, Feng Y (2022) Transfer learning under high-dimensional generalized linear models. J Am Stat Assoc 117:1–30
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc B 58(1):267–288
van de Geer SA, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 42(3):1166–1202
Vovk V, Wang R (2021) E-values: Calibration, combination and applications. Ann Stat 49(3):1736–1754
Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55(5):2183–2202
Wang R, Ramdas A (2020) False discovery rate control with e-values. arXiv preprint arXiv:2009.02824
Xia Y, Li L (2017) Hypothesis testing of matrix graph model with application to brain connectivity analysis. Biometrics 73(3):780–791
Xia Y, Li L (2019) Matrix graph hypothesis testing and application in brain connectivity alternation detection. Stat Sin 29(1):303–328
Xia Y, Cai T, Tony Cai T (2015) Testing differential networks with applications to the detection of gene-gene interactions. Biometrika 102(2):247–266
Xia Y, Cai T, Tony Cai T (2018a) Multiple testing of submatrices of a precision matrix with applications to identification of between pathway interactions. J Am Stat Assoc 113(521):328–339
Xia Y, Cai T, Tony Cai T (2018b) Two-sample tests for high-dimensional linear regression with an application to detecting interactions. Stat Sin 28:63–92
Xia Y, Cai TT, Li H (2018c) Joint testing and false discovery rate control in high-dimensional multivariate regression. Biometrika 105(2):249–269
Xia Y, Cai TT, Sun W (2020) GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference. J Am Stat Assoc 115(531):1236–1250
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68(1):49–67
Yu Y, Bradic J, Samworth RJ (2018) Confidence intervals for high-dimensional cox models. arXiv preprint arXiv:1803.01150
Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942
Zhang T (2011) Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Trans Inf Theory 57(7):4689–4708
Zhang C-H, Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc B 76(1):217–242
Zhang X, Cheng G (2017) Simultaneous inference for high-dimensional linear models. J Am Stat Assoc 112(518):757–768
Zhang A, Brown LD, Cai TT (2019) Semi-supervised inference: general theory and estimation of means. Ann Stat 47(5):2538–2566
Zhang L, Ma R, Cai TT, Li H (2020) Estimation, confidence intervals, and large-scale hypotheses testing for high-dimensional mixed linear regression. arXiv preprint arXiv:2011.03598
Zhang Y, Chakrabortty A, Bradic J (2021) Double robust semi-supervised inference for the mean: Selection bias under mar labeling with decaying overlap. arXiv preprint arXiv:2104.06667
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
Zhao T, Kolar M, Liu H (2014) A general framework for robust testing and confidence regions in high-dimensional quantile regression. arXiv preprint arXiv:1412.8724
Zhou JJ, Cho MH, Lange C, Lutz S, Silverman EK, Laird NM (2015) Integrating multiple correlated phenotypes for genetic association analysis by maximizing heritability. Hum Hered 79:93–104
Zhou RR, Wang L, Zhao SD (2020) Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika 107(3):573–589
Zhu Y, Bradic J (2018) Linear hypothesis testing in dense high-dimensional linear models. J Am Stat Assoc 113(524):1583–1600
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67(2):301–320
Acknowledgements
The research of Yin Xia was supported in part by NSFC Grant 12022103. The research of Tony Cai was supported in part by NSF Grant DMS-2015259 and NIH grant R01-GM129781.The research of Zijian Guo was partly supported by the NSF grants DMS 1811857 and 2015373 and NIH grants R01GM140463 and R01LM013614.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cai, T.T., Guo, Z. & Xia, Y. Statistical inference and large-scale multiple testing for high-dimensional regression models. TEST 32, 1135–1171 (2023). https://doi.org/10.1007/s11749-023-00870-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-023-00870-1
Keywords
- Confidence interval
- Debiasing
- False discovery rate
- Hypothesis testing
- Linear functionals
- Quadratic functionals
- Simultaneous inference