Skip to main content
Log in

On Hodges’ superefficiency and merits of oracle property in model selection

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

The oracle property of model selection procedures has attracted a large volume of favorable publications in the literature, but also faced criticisms of being ineffective and misleading in applications. Such criticisms, however, have appeared to be largely ignored by the majority of the popular statistical literature, despite their serious impact. In this paper, we present a new type of Hodges’ estimators that can easily produce model selection procedures with the oracle and some other desired properties, but can be readily seen to perform poorly in parts of the parameter spaces that are fixed and independent of sample sizes. Consequently, the merits of the oracle property for model selection as extensively advocated in the literature are questionable and possibly overstated. In particular, because the mathematics employed in this paper are at an elementary level, this finding leads to new discoveries on the merits of the oracle property and exposes some overlooked crucial facts on model selection procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Berk, R., Brown, L., Zhao, L. (2010). Statistical inference after model selection. Journal of Quantitative Criminology, 26(2), 217–236.

  • Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L. (2013). Valid post-selection inference. The Annals of Statistics, 41(2), 802–837 (with an electronic supplement).

  • Brookhart, M., Schneeweiss, S., Rothman, K., Glynn, R., Avorn, J., Strmer, T. (2006). Variable selection for propensity score models. American Journal of Epidemiology, 163(12), 1149–1156.

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

  • Fan, J., Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics, 30, 74–99.

  • Fan, J., Li, R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association, 99, 710–723.

  • Fan, J., Feng, Y., Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics, 3(2), 521–541.

  • Frank, I. E., Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.

  • Friedman, J., Hastie, T., Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.

  • Gefang, D. (2014). Bayesian doubly adaptive elastic-net Lasso for VAR shrinkage. International Journal of Forecasting, 30(1), 1–11.

    Article  Google Scholar 

  • Giles, J. A., Giles, D. E. A. (1993). Pre-test estimation and testing in econometrics: Recent developments. Journal of Economic Surveys, 7, 145–197.

  • Hajek, J. (1971). Limiting properties of likelihoods and inference: In V.P. Godambe, D.A. Sprott (Eds.), Foundations of statistical inference: Proceedings of the symposium on the foundations of statistical inference, University of Waterloo, Ontario, March 31–April 9, 1970 (pp. 142–159). Holt, Rinehart & Winston, Toronto.

  • Horowitz, J., Klemela, J., Mammen, E. (2006). Optimal estimation in additive regression models. Bernoulli, 12, 271–298.

  • Huang, J., Horowitz, J., Wei, F. (2010). Variable selection in nonparametric additive models. Annals of Statistics, 38(4), 2282–2313.

  • Hsu, J. (1996). Multiple comparisons: Theory and methods. New York: Chapman and Hall.

  • Hušková, M., Beran, R., Dupac, V. (1998). Collected works of Jaroslav Hàjek: With commentary. Chichester: Wiley.

  • Judge, G. G., Bock, M. E. (1978). The statistical implications of pre-test and stein-rule estimators in econometrics. Amsterdam: North-Holland Publishing Co.

  • Kock, A. (2013). Oracle efficient variable selection in random and fixed effects panel data models. Econometric Theory, 29(1), 115–152.

    Article  MathSciNet  MATH  Google Scholar 

  • Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. University of California Publications in Statistics, 1, 277–330.

    MathSciNet  Google Scholar 

  • Le Cam, L. (1960). Locally asymptotically normal families of distributions. University of California Publications in Statistics, 3, 37–98.

    MathSciNet  Google Scholar 

  • Leeb, H., Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21–59.

  • Leeb, H., Pötscher, B. M. (2008a). Model selection. In T. G. Anderson, R. A. Davis, J. P. Kreiss, T. Mikosch (Eds.), The handbook of financial time series (pp. 785–821). New York: Springer.

  • Leeb, H., Pötscher, B. M. (2008b). Sparse estimators and oracle property, or the return of Hodges’ estimator. Journal of Econometrics, 142, 201–211.

  • Lehmann, E. L., Casella, G. (1998). Theory of point estimation. New York: Springer.

  • Leng, C., Tran, M., Nott, D. (2014). Bayesian adaptive lasso. Annals of the Institute of Statistical Mathematics, 66(2), 221–244.

  • Pötscher, B. M. (2009). Confidence sets based on sparse estimators are necessarily large. Sankhyā: The Indian Journal of Statistics, Series A, 71(1), 1–18.

  • Pötscher, B. M., Leeb, H. (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. Journal of Multivariate Analysis, 100, 2065–2082.

  • Pötscher, B. M., Schneider, U. (2009). On the distribution of the adaptive Lasso estimator. Journal of Statistical Planning and Inference, 139(8), 2775–2790.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267–288.

    Article  MathSciNet  MATH  Google Scholar 

  • Van de Geer, S. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36, 614–645.

    Article  MathSciNet  MATH  Google Scholar 

  • van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Wang, H., Leng, C. (2008). A note on adaptive group lasso. Computational Statistics and Data Analysis, 52(12), 5277–5286.

  • Wang, S., Nan, B., Rosset, S., Zhu, J. (2011). Random lasso. The Annals of Applied Statistics, 5(1), 468–485.

  • Wang, Z., Paterlini, S., Gao, F., Yang, Y. (2014). Adaptive minimax regression estimation over sparse \(\ell _q\)-hulls. Journal of Machine Learning Research, 15(May), 1675–1711.

  • Xiong, S., Dai, B., Qian, P. Z. (2017). Achieving the oracle property of OEM with nonconvex penalties. Statistical Theory and Related Fields, 1(1), 28–36.

  • Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika, 92(4), 937–950.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, Y. (2007). Prediction/Estimation with simple linear models: is it really that simple? Econometric Theory, 23(1), 1–36.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894–942.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, H., Lu, W. (2007). Adaptive Lasso for Cox’s proportional hazards model. Biometrika, 94(3), 691–703.

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.

  • Zou, H., Zhang, H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37(4), 1733.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xianyi Wu.

Additional information

This work was partially supported by NSFC (Grant No. 71771089) and the 111 Project (Grant No. B14019).

A Appendix: Proofs of the theorems

A Appendix: Proofs of the theorems

1.1 A.1 Proof of Theorem 1

Proof

Note first that \(\hat{\theta }_{n}\overset{p}{\rightarrow }\theta \). For any \(\theta \not =c\), the condition \(a_{n}=o(1)\) implies that, for any \(\varepsilon >0\),

$$\begin{aligned} \mathrm{Pr}_{\theta }(r_{n}\Vert \breve{\theta }_{n}(c)-\hat{\theta }_{n}\Vert >\varepsilon )&\le \mathrm{Pr}_{\theta }(\Vert \hat{\theta }_{n}-c\Vert \le a_{n})\le \mathrm{Pr}_{\theta }(\Vert \theta -c\Vert -\Vert \hat{\theta }_{n}-\theta \Vert \le a_{n})\\&=\mathrm{Pr}_{\theta }(\Vert \hat{\theta }_{n}-\theta \Vert \ge \Vert \theta -c\Vert -a_{n})\rightarrow 0\quad \hbox {as }n\rightarrow \infty . \end{aligned}$$

Thus, \(r_{n}(\breve{\theta }_{n}(c)-\theta )=r_{n}(\breve{\theta }_{n}(c)-\hat{\theta }_{n})+r_{n}(\hat{\theta }_{n}-\theta )\overset{d}{\rightarrow }Z\). For \(\theta =c\), thanks to \(r_{n}a_{n}\rightarrow \infty \),

$$\begin{aligned} \mathrm{Pr}_{c}(r_{n}\Vert \breve{\theta }_{n}(c)-c\Vert>\varepsilon )\le \mathrm{Pr}_{c} (\breve{\theta }_{n}\not =c) =\mathrm{Pr}_{\theta _{0}}(r_{n}\Vert \hat{\theta }_{n}-c\Vert >r_{n}a_{n}) \rightarrow 0. \end{aligned}$$
(19)

This shows \(r_{n}(\breve{\theta }_{n}(c)-c)\overset{p}{\rightarrow }0\). \(\square \)

1.2 A.2 Proof of Theorem 2

Proof

Because we are concerned with the asymptotic distribution of \({\check{\theta }}_{n, b}\) in this section, without loss of generality we can treat the easy case where V is known and \({\check{\theta }}_{n, b}\) is defined by

$$\begin{aligned} {\check{\theta }}_{n, b}= {\hat{\theta }}_{n,b}+{V}_{bb}^{-1}{V}_{b\bar{b}} ({\hat{\theta }}_{n,\bar{b}}-c_{{\bar{b}}}) \quad \hbox {and}\quad {\check{\theta }}_n(b)=({\check{\theta }}_{n,b}',c_{{\bar{b}}}')' \end{aligned}$$
(20)

with the convention \({\check{\theta }}_{n,\{1,2, \ldots ,d\}}={\check{\theta }}_{n}(\{1,2, \ldots ,d\}) ={\hat{\theta }}_{n}\).

The first assertion is obvious, so we here only prove (9). With the definition of \(b(\theta )\) in (4), it is clear that \(\theta _{b(\theta )}\) is the sub-vector \((\theta _j: \theta _j\ne c_j)\) of \(\theta \) and \(\theta _{\bar{b}(\theta )}=(c_j:j\in \bar{b}(\theta ))=c_{\bar{b}(\theta )}\). Note that, by (20), \({\check{\theta }}_{n,b(\theta )}={\hat{\theta }}_{n,b(\theta )}\) and hence \({\check{\theta }}_n(b(\theta ))=({\check{\theta }}_{n,b(\theta )},c_{{\bar{b}}(\theta )})\) are only pseudo-estimators that depend on the unknown parameters \(\theta \). However,

$$\begin{aligned} r_n({\check{\theta }}_{n,b(\theta )}-\theta _{b(\theta )})&=V_{b(\theta ),b(\theta )}^{-1}(V_{b(\theta )b(\theta )}\quad V_{b(\theta ),\bar{b}(\theta )})r_n\left( \begin{array}{c} {\hat{\theta }}_{n,{b}(\theta )}-\theta _{b(\theta )}\\ {\hat{\theta }}_{n,\bar{b}(\theta )}-\theta _{\bar{b}(\theta )}\end{array}\right) \nonumber \\&\overset{d}{\rightarrow } V_{{b}(\theta ),{b}(\theta )}^{-1}(V_{{b}(\theta ){b}(\theta )}\quad V_{{b}(\theta ),\bar{b}(\theta )})Z =\check{Z}_{b(\theta )}, \end{aligned}$$
(21)

where the components of Z have been rearranged according to the order in \(\theta \) and \(\check{Z}_{b(\theta )}\) is derived from (7) by replacing b with \(b(\theta )\).

For any \(\theta \), by comparing (6) and (20),

$$\begin{aligned} \mathrm{Pr}_\theta \left( r_n\Vert {\tilde{\theta }}_n(c)-{\check{\theta }}_{n} (b(\theta ))\Vert>\varepsilon \right)&=\mathrm{Pr}_\theta \left( r_n\Vert {\tilde{\theta }} _n(c)-({\check{\theta }}_{b(\theta )}', c_{{\bar{b}}(\theta )}')'\Vert >\varepsilon \right) \\&\le \mathrm{Pr}_\theta \left( {\tilde{\theta }}_n(c)\ne ({\check{\theta }}_ {b(\theta )}',c_{{\bar{b}}(\theta )}')'\right) \\&\le \mathrm{Pr}_\theta (b_n(c)\ne b(\theta )). \end{aligned}$$

Since \(\{b_n(c)\ne b(\theta )\}=\bigcup _{j\in b(\theta )}\{|{\hat{\theta }}_{nj}-c_j|\le a_{nj}\}\bigcup _{j\in {\bar{b}}(\theta )}\{|{\hat{\theta }}_{nj}-c_j|> a_{nj}\}\),

$$\begin{aligned} \mathrm{Pr}_\theta&\left( r_n\Vert {\tilde{\theta }}_n(c)-{\check{\theta }}_{n}(b(\theta )) \Vert>\varepsilon \right) \\&\le \mathrm{Pr}_\theta \left( \bigcup _{j\in b(\theta )}\{|{\hat{\theta }}_{nj}-c_j|\le a_{nj}\}\bigcup _{j\in {\bar{b}}(\theta )}\{|{\hat{\theta }}_{nj}-c_j|> a_{nj}\} \right) \\&\le \sum _{j\in b(\theta )}\mathrm{Pr}_\theta \{|{\hat{\theta }}_{nj}-c_j|\le a_{nj}\}+\sum _{j\in {\bar{b}}(\theta )} \mathrm{Pr}_\theta \{|{\hat{\theta }}_{nj}-c_j|> a_{nj}\} \\&\le \sum _{j\in b(\theta )}\mathrm{Pr}_\theta \{|{\hat{\theta }}_{nj}-\theta _j|\ge |\theta _j-c_j|-a_{nj}\} +\sum _{j\in {\bar{b}}(\theta )}\mathrm{Pr}_\theta \{r_n|{\hat{\theta }}_{nj}-c_j|> r_na_{nj}\}\\&\rightarrow 0 \quad \hbox {as }n\rightarrow \infty \end{aligned}$$

under the conditions on \(a_n\). Thus, \(r_n({\tilde{\theta }}_n(c)-{\check{\theta }}_{n}(b(\theta ))=o_p(1)\). Combining this with (21), we get

$$\begin{aligned} r_n({\tilde{\theta }}_n(c)-\theta )=r_n({\tilde{\theta }}_n(c) -{\check{\theta }}_{n}(b(\theta ))) +r_n({\check{\theta }}_{n}(b(\theta ))-\theta )\overset{d}{\rightarrow } \left( \begin{array}{c}\check{Z}_{b(\theta )}\\ 0\end{array}\right) \end{aligned}$$
(22)

under \(\mathrm{Pr}_\theta \). Next examine the case \(b(\theta )=\emptyset \). Analogous to (19), under \(\mathrm{Pr}_c\) and the condition \(r_n\min \limits _{1\le j\le d} a_{nj}\rightarrow \infty \),

$$\begin{aligned}&\mathrm{Pr}_{c}(r_{n}\Vert {\tilde{\theta }}_{n}(c)-c\Vert>\varepsilon )\le \mathrm{Pr}_{c} ({\tilde{\theta }}_{n}\not =c)\nonumber \\&\quad \le \sum _{j=1}^d\mathrm{Pr}_{c}(r_{n}|\hat{\theta }_{nj}-c|>r_{n}a_{nj}) \rightarrow 0 \quad \hbox {as }n\rightarrow \infty . \end{aligned}$$
(23)

Hence \(r_{n}|{\tilde{\theta }}_{n}(c)-c| \overset{d}{\rightarrow }0\). The proof is then complete. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Zhou, X. On Hodges’ superefficiency and merits of oracle property in model selection. Ann Inst Stat Math 71, 1093–1119 (2019). https://doi.org/10.1007/s10463-018-0670-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-018-0670-0

Keywords

Navigation