Abstract
A one-step semiparametric bootstrap procedure is constructed to estimate the distribution of estimators after model selection and of model averaging estimators with data-dependent weights. The method is generally applicable to non-normal models. Misspecification is allowed for all candidate parametric models. The semiparametric bootstrap estimator is shown to be consistent within specific regions such that the good and the bad candidate models are separated. Simulation studies exemplify that the bootstrap procedure leads to short confidence intervals with a good coverage.
Similar content being viewed by others
References
Aerts M, Claeskens G (2001) Bootstrap tests for misspecified models, with application to clustered binary data. Comput Stat Data Anal 36(3):383–401
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov B, Csáki F (eds) Second international symposium on information theory. Akadémiai Kiadó, Budapest, pp 267–281
Bachoc F, Preinerstorfer D, Steinberger L (2020) Uniformly valid confidence intervals post-model-selection. Ann Stat 48(1):440–463
Belloni A, Chernozhukov V (2013) Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2):521–547
Berk R, Brown L, Buja A et al (2013) Valid post-selection inference. Ann Stat 41(2):802–837
Camponovo L (2015) On the validity of the pairs bootstrap for lasso estimators. Biometrika 102(4):981–987
Charkhi A, Claeskens G (2018) Asymptotic post-selection inference for the Akaike information criterion. Biometrika 105(3):645–664
Claeskens G (1999) Smoothing techniques and bootstrap methods for multiparameter likelihood models. Ph.D. Thesis, Limburgs Universitair Centrum, Diepenbeek
Claeskens G, Hjort N (2003) The focused information criterion. J Am Stat Assoc 98:900–916. With discussion and a rejoinder by the authors
Danilov D, Magnus JR (2004) On the harm that ignoring pretesting can cause. J Econom 122(1):27–46
Efron B (2014) Estimation and accuracy after model selection. J Am Stat Assoc 109(507):991–1007
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Garcia-Angulo A, Claeskens G (2023) Exact uniformly most powerful post-selection confidence distributions. Scand J Stat 50:358–382
Garcia-Angulo A, Claeskens G (2023) Optimal finite sample post-selection confidence distributions in generalized linear models. J Stat Plan Inference 222:66–77
Giurcanu MC (2012) Bootstrapping in non-regular smooth function models. J Multivar Anal 111:78–93
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, Berlin
Hjort NL, Claeskens G (2003) Frequentist model average estimators. J Am Stat Assoc 98(464):879–899
Hu F, Zidek JV (1995) A bootstrap based on the estimating equations of the linear model. Biometrika 82(2):263–275
Iverson HK, Randles RH (1989) The effects on convergence of substituting parameter estimates into U-statistics and other families of statistics. Probab Theory Relat Fields 81(3):453–471
Kabaila P (2009) The coverage properties of confidence regions after model selection. Int Stat Rev 77(3):405–414
Kabaila P, Welsh AH, Abeysekera W (2016) Model-averaged confidence intervals. Scand J Stat 43(1):35–48
Lee SMS, Wu Y (2018) A bootstrap recipe for post-model-selection inference under linear regression models. Biometrika 105(4):873–890
Leeb H, Pötscher BM (2008) Can one estimate the unconditional distribution of post-model-selection estimators? Economet Theor 24(2):338–376
Lehmann EL, Romano JP (2022) Bootstrap and subsampling methods. Springer, Cham, pp 863–918
Lu W, Goldberg Y, Fine JP (2012) On the robustness of the adaptive lasso to model misspecification. Biometrika 99(3):717–731
Pötscher BM (2009) Confidence sets based on sparse estimators are necessarily large. Sankhyā Indian J Stat Ser A (2008-) 71(1):1–18
Rao RR (1962) Relations between weak and uniform convergence of measures with applications. Ann Math Stat 33(2):659–680
Rossouw JE, Du Plessis JP, Benadé AJ et al (1983) Coronary risk factor screening in three rural communities. The CORIS baseline study. S Afr Med J Suid-Afrikaanse Tydskrif Vir Geneeskunde 64(12):430–436
Sin CY, White H (1996) Information criteria for selecting possibly misspecified parametric models. J Econom 71(1):207–225
Taylor J, Tibshirani R (2018) Post-selection inference for l1-penalized likelihood models. Can J Stat 46(1):41–61
Tian X, Taylor J (2018) Selective inference with a randomized response. Ann Stat 46(2):679–710
Tibshirani R (1996) Regression Shrinkage and selection via the Lasso. J Roy Stat Soc Ser B (Methodol) 58(1):267–288
Tibshirani RJ, Rinaldo A, Tibshirani R et al (2018) Uniform asymptotic inference and the bootstrap after model selection. Ann Stat 46(3):1255–1287
Wang H, Zhou SZF (2013) Interval estimation by frequentist model averaging. Commun Stat Theory Methods 42(23):4342–4356
White H (1994) Estimation, inference and specification analysis. Cambridge University Press, Cambridge
Zou H (2006) The adaptive lasso and its Oracle properties. J Am Stat Assoc 101(476):1418–1429
Acknowledgements
Support from the Research Foundation Flanders and the KU Leuven Research Fund Project C16/20/002 is acknowledged. The resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix. Proofs
Appendix. Proofs
1.1 Proof of Proposition 1
Proof
Let \(\Delta \Psi _{M,n} = \Psi _{M,n} - \Psi _{M_\textrm{full},n} = c \{\ell _{n}(\varvec{Y}_n;\hat{\theta }_{n}) - \ell _{n}(\varvec{Y}_n;\hat{\theta }_{M,n})\} - (\kappa _{M_\textrm{full},n}-\kappa _{M,n})\). By (A1)–(A3)(i,iii) \(\hat{\theta }_{M,n}-\tilde{\theta }_{M,n}={o}_{a.s.}(1)\). By (A3)(ii,iv) and Proposition 4.2(a) of Sin and White (1996), if \(\kappa _{M,n}= {o}_p(n)\), then \(\Delta \Psi _{M,n} = n \Delta _{M,n} + {o}_p(n)\). If additionally, \(\liminf _{n \rightarrow \infty } \Delta _{M,n} > 0 \), then also by Proposition 4.2(a) of Sin and White (1996) \(\lim _{n \rightarrow \infty } P(\Delta \Psi _{M,n}>0)=1\) for \(M \in \mathcal {I}\). On the other hand, for all \(M' \notin \mathcal {I}\), \(\Delta \Psi _{M',n} \rightarrow 0\) as \(n \rightarrow \infty \) at different rates. Therefore, \(\lim _{n \rightarrow \infty } P(\Delta \Psi _{M,n}>\Delta \Psi _{M',n})=1\) for all \(M \in \mathcal {I}\) and \(M' \notin \mathcal {I}\), which implies \(\lim _{n \rightarrow \infty } P(\Psi _{M,n}>\Psi _{M',n})=1\). Given conditions (A4),(a)–(c) on the weight functions, \(\lim _{n \rightarrow \infty } P(W_{M,n}(\Psi _{M'',n}, M''\in \mathcal {M}) < W_{M',n}(\Psi _{M'',n}, M''\in \mathcal {M}))=1\) for all \(M \in \mathcal {I} \) and at least one \(M'\notin \mathcal {I}\). By condition (d) as \(\Psi _{M',n} \rightarrow _p \infty \) and given that \(\lim _{n \rightarrow \infty } P(\Psi _{M,n}>\Psi _{M',n})=1\), \(\widehat{W}_{M',n} \rightarrow _p 1\), for at least one \(M'\notin \mathcal {I}\). Finally, by condition (e) \(\widehat{W}_{M,n} =o_p(1)\) as \(n \rightarrow \infty \) for \(M \in \mathcal {I}\). This completes the proof.
1.2 Proof of Theorem 1
Proof
For a \((\mathcal {S},\tau _n)\) convergent sequence of pseudo-true parameters \((\tilde{\theta }_n)\), we write
By Definition 1, for \(M \in \mathcal {S}\) it holds that \(|\tilde{\theta }_{n} - \tilde{\theta }_{M,n}|= {o}_p(n^{-1/2})\). Further, for \(M \in \mathcal {S}\cup \mathcal {I}\) it holds by Sin and White (1996, Prop 4.1(a)) that \(\hat{\theta }_{M,n} - \tilde{\theta }_{M,n} = O_p(n^{-1/2})\). This, combined with Proposition 1 which yields that \(\widehat{W}_{M,n}\) is \(o_p(1)\) for \(M\in \mathcal {I}\) and Assumption (A6) guarantee that the sum over \(M\in \mathcal {I}\) is \(o_p(1)\). Therefore using the results obtained until now,
For each \(M \in \mathcal {S}\), under (A1)–(A3) the strong consistency of \(\hat{\theta }_{M,n}\) is guaranteed (see White 1994, Th 3.6). For the \(|M|\times |M|\) matrix \(\Sigma _{M,n}(\tilde{\theta }_{M,n})=[J_{M,n} (\tilde{\theta }_{M,n})]^{-1} K_{M,n}(\tilde{\theta }_{M,n}) [J_{M,n}(\tilde{\theta }_{M,n})]^{-1}\) it holds that since \(|\tilde{\theta }_{n} - \tilde{\theta }_{M,n}|={o}_p(n^{-1/2})\) for each \(M \in \mathcal {S}\), by (A3) (vi) \(E[\Sigma _{M,n}(\tilde{\theta }_{M,n}) - \Sigma _{M,n}(\Pi _M\tilde{\theta }_{n})] \rightarrow 0 \). Due to the convergence of \(\tilde{\theta }_n\) to \(\tilde{\theta }_\infty \), the limit of the covariance matrix is \(\Sigma _M= J_M(\Pi _M \tilde{\theta }_\infty )^{ -1} K_M(\Pi _M \tilde{\theta }_\infty )J_M(\Pi _M \tilde{\theta }_\infty )^{ -1}\). Then, if n tends to infinity, there is convergence in distribution
where \(Z\sim N_{p}(\varvec{0},I_{p})\). Moreover, using the form of the information criterion as in (5) and Taylor expansions of the log likelihood in model M around the full model’s log likelihood, it is clear that by assumption (A3) for all \(M\in \mathcal {S}\) there is joint convergence of all \(n^{1/2}\Pi _M(\hat{\theta }_{M,n}-\tilde{\theta }_{M,n})\) and all weights \(\widehat{W}_{M,n}\) to corresponding limiting distributions. Consequently, for \(M\in \mathcal {S}\), \(\hat{\Lambda }_{M,n}\) can be expressed in the limit as a function of the same variable Z, leading to the limit version of the weights \(\mathcal {W}_{M}(Z)\). Combining the above results, an application of the continuous mapping theorem leads to result of Theorem 1.
1.3 Proof of Theorem 2
Proof
We use the notation \(E^*\), Var\(^*\) and \(d^*\) to represent the bootstrap expectation, variance and convergence in distribution, conditionally on \(\varvec{Y}_n\) and X. The \(|M'|\times |M'|\) matrices \(K^*_{M',n}(\theta )\) and \(J^*_{M',n}(\theta )\) are defined in the same way as \(K_{M',n}(\theta )\) and \(J_{M',n}(\theta )\) though using the bootstrap data \((Y_{i^*},x_{i^*}^\top )\) with \(i^*\in S^*\), the set of n integers taken randomly with replacement from \(\{1,\ldots ,n\}\). The sequence of pseudo-true parameters \((\tilde{\theta }_n)\) is \((\mathcal {S},\tau _n)\) convergent.
Following the proof of Theorem 9.1 of Claeskens (1999) for the one-step semiparametric bootstrap, by (A3)(v) choosing \(\delta >0\) and the strong consistency of \(\hat{\theta }_{M',n}\) for each \(M' \in \mathcal {S}\), any \(|M'|\)-dimensional vector \(\varvec{v}_{|M'|}\), with norm equal to 1,
The resampling scheme ensures that \(E^*[K^*_{M',n}(\hat{\theta }_{M',n})] = K_{M',n}(\hat{\theta }_{M',n})\). Applying Theorem 2.9 of Iverson and Randles (1989) to each of the \(|M'|^2\) components of \(K_{M',n}(\hat{\theta }_{M',n})\), by assumptions (A3)(iv,vi), the strong consistency of \(\hat{\theta }_{M',n}\) and the fact that \(\hat{\theta }_{M',n}-\tilde{\theta }_{n} = {O}_p(n^{-1/2})\) (using Proposition 4.1 of (Sin and White 1996) and proof of Theorem 1) this implies that conditional on \(\varvec{Y_n}\) and X, \(E^*[K_{M',n}(\hat{\theta }_{M',n})]-K_{M',n}(\tilde{\theta }_{n})\) converges to zero as \(n \rightarrow \infty \). Using (12) with a \(|M'|\)-dimensional vector \(\varvec{v}_{|M'|}\), this implies the Liapunov’s condition,
By the Crámer-Wold theorem then,
Also by Theorem 2.9 of Iverson and Randles (1989), \(E^*[J_{M',n}^*(\hat{\theta }_{M',n})] -J_{M',n}(\tilde{\theta }_n) \rightarrow 0 \) and Var\(^* [J_{M',n}^*(\hat{\theta }_{M',n})]= {O}_p(n^{-1})\) such that \(J_{M',n}^*(\hat{\theta }_{M',n})-J_{M'}(\tilde{\theta }_\infty ) \rightarrow 0 \) in bootstrap probability.
Combining the above results, by Slutsky’s theorem, as n tends to infinity,
with \(Z \sim N_p( \varvec{0},I_p)\).
Since \(\overline{\mathcal {S}}\) is a subset of \(\overline{\mathcal {M}}\) that is closed under intersections, we define \(M_{\min }\) as the most parsimonious model in \(\overline{\mathcal {S}}\), that is, \(|M_{\min }|< |M'|\) for all \(M' \in \overline{\mathcal {S}}\setminus M_{\min }\). Because the closedness under intersections, \(M_{\min }\) is a submodel of each \(M' \in \overline{\mathcal {S}}{\setminus } M_{\min }\). Then, \(\hat{\theta }^{(M_{\min })}_{M',n} = \hat{\theta }_{M_{\min },n}\) and by Sin and White (1996, [Prop. 4.1]), \(\hat{\theta }_{M_{\min },n}= \tilde{\theta }_{M_{\min },n} + {O}_p(n^{-1/2})\).
Now, we study the bootstrap information criteria and the weights . The semiparametric sampling scheme is such that for any \(j^*\in S^*\), \(E^*[ \dot{\ell }_{M_\textrm{full}}(Y_{j^*};\hat{\theta }_{M',n})] = n^{-1}\dot{\ell }_{M_\textrm{full}}(\varvec{Y}_{n};\hat{\theta }_{M',n})\) and \(E^*[ \ddot{\ell }_{M_\textrm{full}}(Y_{j^*};\hat{\theta }_{M',n})] = n^{-1}\ddot{\ell }_{M_\textrm{full}}(\varvec{Y}_{n};\hat{\theta }_{M',n})\). Therefore, as \(n \rightarrow \infty \), tends to zero. Under (A3)(ii,iv) and (A4) for the weight functions, as \(n \rightarrow \infty \), converges in bootstrap distribution to the random variable \(\mathcal {W}_{M'}(Z)\). An application of Polya’s theorem (e.g. Rao 1962, Lemma 3.2) yields that for \(M_{\min }\)
Finally, by Proposition 4.2 (b) of Sin and White (1996), if \(\widetilde{\Psi }_{M,n}\) is an information criterion satisfying condition (A5) and \(\hat{\iota }_{M,n}\) is a weight that also satisfies condition (A4), then \(\hat{\iota }_{M,n}\) consistently selects and assigns higher weight to \(M_{\min }\) such that \(\hat{\iota }_{M',n}=o_p(1)\) for \(M' \in \overline{\mathcal {S}}{\setminus } M_{\min }\). Theorem 2 follows by combining the above results.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Garcia-Angulo, A.C., Claeskens, G. Bootstrap for inference after model selection and model averaging for likelihood models. Metrika (2024). https://doi.org/10.1007/s00184-024-00956-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00184-024-00956-2