Skip to main content

Model Selection for High-Dimensional Problems

  • Reference work entry
  • First Online:
Handbook of Financial Econometrics and Statistics

Abstract

High-dimensional data analysis is becoming more and more important to both academics and practitioners in finance and economics but is also very challenging because the number of variables or parameters in connection with such data can be larger than the sample size. Recently, several variable selection approaches have been developed and used to help us select significant variables and construct a parsimonious model simultaneously. In this chapter, we first provide an overview of model selection approaches in the context of penalized least squares. We then review independence screening, a recently developed method for analyzing ultrahigh-dimensional data where the number of variables or parameters can be exponentially larger than the sample size. Finally, we discuss and advocate multistage procedures that combine independence screening and variable selection and that may be especially suitable for analyzing high-frequency financial data.

Penalized least squares seek to keep important predictors in a model while penalizing coefficients associated with irrelevant predictors. As such, under certain conditions, penalized least squares can lead to a sparse solution for linear models and achieve asymptotic consistency in separating relevant variables from irrelevant ones. Independence screening selects relevant variables based on certain measures of marginal correlations between candidate variables and the response.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), Second international symposium on information theory (Vol. 1, pp. 267–281). Budapest: Akademiai Kiado.

    Google Scholar 

  • Allen, D. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16, 125–127.

    Article  Google Scholar 

  • Anderson, H. M., & Vahid, F. (2007). Forecasting the volatility of Australian stock returns. Journal of Business and Economic Statistics, 25, 76–90.

    Article  Google Scholar 

  • Antoniadis, A., & Fan, J. (2001). Regularization of wavelets approximations (with discussion). Journal of the American Statistical Association, 96, 939–967.

    Article  Google Scholar 

  • Bai, J., & Ng, S. (2008). Forecasting economic time series using targeted predictors. Journal of Econometrics, 146, 304–317.

    Article  Google Scholar 

  • Bellman, R. (1961). Adaptive control processes: A guided tour. Pattern Recognition Letters, 29, 1351–1357.

    Google Scholar 

  • Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383.

    Article  Google Scholar 

  • Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial markets. Princeton: Princeton University Press.

    Google Scholar 

  • Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Annals of Statistics, 35, 2313–2404.

    Article  Google Scholar 

  • Chen, X., Zou, C., & Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. Annals of Statistics, 38, 3696–3723.

    Article  Google Scholar 

  • Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314.

    Article  Google Scholar 

  • Cook, R. D. (1998). Regression graphics: Ideas for studying regressions through graphics. New York: John Wiley & Sons.

    Google Scholar 

  • Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86, 328–332.

    Google Scholar 

  • Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377–403.

    Article  Google Scholar 

  • Demartines, P., & Hérault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. Neural Networks, 8, 148–154.

    Article  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32, 409–499.

    Google Scholar 

  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

    Article  Google Scholar 

  • Fan, J., & Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proceedings of the International Congress of Mathematicians, European Mathematical Society, Zurich, III, 595–622.

    Google Scholar 

  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911.

    Article  Google Scholar 

  • Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567–3604.

    Article  Google Scholar 

  • Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. Journal of Machine Learning Research, 10, 1829–1853.

    Google Scholar 

  • Fan, J., Feng, Y., Samworth, R., & Wu, Y. (2010). SIS: Sure Independence Screening. R package version 0.6. http://CRAN.R-project.org/package=SIS

  • Fan, J., Feng, Y., & Song, R. (2011a). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106, 544–557.

    Article  Google Scholar 

  • Fan, J., Lv, J., & Qi, L. (2011b). Sparse high-dimensional models in economics. Annual Review of Economics, 3, 291–317.

    Article  Google Scholar 

  • Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.

    Article  Google Scholar 

  • Friedman, J. (2008). Fast sparse regression and classification. International Journal of Forecasting, 28, 722–738.

    Article  Google Scholar 

  • Goto, S., & Xu, Y. (2010). On mean variance portfolio optimization: Improving performance through better use of hedging relations (Working paper), University of Rhode Island.

    Google Scholar 

  • Hall, P., & Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18, 533–550.

    Article  Google Scholar 

  • Hastie, T. (1994). Principal curves and surfaces. DTIC Document.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.

    Article  Google Scholar 

  • Huang, J.-Z., & Shi, Z. (2010). Determinants of bond risk premia. (American Finance Association 2011 Denver meetings paper). Penn State University. http://ssrn.com/abstract=1573186.

  • Ji, P., & Jin, J. (2012). UPS delivers optimal phase diagram in high dimensional variable selection. Annals of Statistics, 40, 73–103.

    Article  Google Scholar 

  • Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 285–299.

    Article  Google Scholar 

  • Li, L. (2007). Sparse sufficient dimension reduction. Biometrika, 94, 603–613.

    Article  Google Scholar 

  • Li, R., Zhong, W., Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.

    Google Scholar 

  • Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15, 661–675.

    Google Scholar 

  • Mardia, K. V., Kent, J. T., & Bibby, J. M. (1980). Multivariate analysis. London: Academic.

    Google Scholar 

  • Miller, A. J. (2002). Subset selection in regression (2nd ed.). New York: Chapman & HALL/CRC.

    Book  Google Scholar 

  • Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

    Article  Google Scholar 

  • Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7, 221–264.

    Google Scholar 

  • Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97, 1167–1179.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B, 58, 267–288.

    Google Scholar 

  • Wang, H., Li, R., & Tsai, C. L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.

    Article  Google Scholar 

  • Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38, 894–942.

    Article  Google Scholar 

  • Zhou, J., & He, X. (2008). Dimension reduction based on constrained canonical correlation and variable filtering. Annals of Statistics, 36, 1649–1668.

    Article  Google Scholar 

  • Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association, 106, 1464–1475.

    Article  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  Google Scholar 

  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.

    Article  Google Scholar 

  • Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.

    Article  Google Scholar 

  • Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Zhong .

Editor information

Editors and Affiliations

Appendix 1: One-Step Sparse Estimates

Appendix 1: One-Step Sparse Estimates

This appendix outlines the local linear approximation (LLA) algorithm, discussed in Sect. 77.2.3, for finding a solution of penalized least squares for a broad class of penalty functions. For illustration, we focus on the nondifferentiable, nonconvex L 0.5 and SCAD penalties introduced in Sect. 77.2.2, as both are ideally suited to the LLA algorithm (Zou and Li 2008). Recall that these two penalty functions are given by Eqs. (77.18) and (77.23), respectively. We note that L 0.5 and SCAD apparently lead to concave objective functions that are singular at the origin. We illustrate later that in linear regressions this optimization problem can be reduced into solving penalized least squares with an L 1 penalty.

We begin with the following one-step LLA estimator:

$$ \widehat{\beta}= \arg\ \underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \mathrm{y}-\mathbf{X}\beta \right\Vert}^2+n{\displaystyle \sum_{j=1}^pp{\prime}_{\lambda }}\left(\left|{\tilde{\beta}}_j\right|\right)\left|{\beta}_j\right|\right\}, $$
(77.40)

where the initial estimate \( \tilde{\beta} \) is usually represented by the ordinary least squares estimator in practice if n > p. Theoretically, \( \tilde{\beta} \) could be any root-n-consistent estimator for β. In the case where n < p, the plain vanilla L 1 estimator (LASSO) would be employed instead.

Following Zou and Li (2008), we discuss separately algorithms for the objective function Q(β) with two different types of penalty functions, which correspond to L 0.5 and SCAD, respectively. Nonetheless, the idea underlying the two algorithms is the same – namely, using a data transform to simplify the penalized least squares to a standard L 1 regularization problem, for which we can take advantage of the related efficient algorithms. In particular, the LARS algorithm (Efron et al. 2004) has been widely applied to obtain the entire solution path of LASSO and forward stage-wise regressions. The relevant R package can be downloaded from http://www.stanford.edu/~hastie/swData.htm#SvmP.

Consider first type one of penalty functions.

Type 1. The tuning parameter could be disentangled from the penalty function, i.e., p λ (θ) satisfies that p λ (θ) = λp λ (θ) and p λ (θ) > 0. For example, bridge penalties p λ (|θ|) = λ|θ|q for 0 < q < 1 and the logarithm penalty p λ (|θ|) = λ log|θ|.

Algorithm 1

  • Step 1. Create working data by \( {\mathrm{x}}_j^{*}={\mathrm{x}}_j/p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) \), for j = 1 , … , p.

  • Step 2. Apply the LARS algorithm to solve the penalized L1 regression:

    $$ {\widehat{\beta}}^{*}= arg\underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \mathrm{y}-{\mathbf{X}}^{*}\beta \right\Vert}^2+ n\lambda {\displaystyle \sum_{j=1}^p\left|{\beta}_j\right|}\right\}. $$
  • Step 3. Let the final one-step estimator be \( {\widehat{\beta}}_j={\widehat{\beta}}_j^{*}/p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) \) for j =1 , … , p.

As we can see, through the one-step sparse estimator, LLA for the L 0. 5 penalty is equivalent to the adaptive LASSO, with \( {\widehat{w}}_j={\left|{\tilde{\beta}}_j\right|}^{-0.5} \) (see Eq. 4 in Zou 2006). In computing the LARS estimates, tuning parameter λ can be chosen using the cv.lars routine embedded in the “lars” package. For example, the following command

$$ cv. lars\left( xstar,y,K=5, plot. it= TRUE, se= TRUE, type=`` lasso"\right) $$

instructs the R to plot fivefold cross-validated mean squared prediction error (MSE) for different values of λ. Then we are able to pick the λ with the smallest MSE and find out the step in LARS corresponding to that λ value. The main function of the package

$$ lars\left( xstar,y, type=`` lasso"\right) $$

provides the entire sequence of coefficients and fits, starting from zero to the least squares fit.

Next, consider type 2 of penalty functions.

Type 2. p λ (θ) satisfies that the derivative p λ (θ) is zero for some values. In addition, the regularization parameter λ cannot be separated from p λ (θ). For example, the SCAD penalty with the first derivative

$$ p{\prime}_{\lambda}\left(\theta \right)=\lambda \left\{I\left(\theta \le \lambda \right)+\frac{{\left( a\lambda -\theta \right)}_{+}}{\left(a-1\right)\lambda }I\left(\theta>\lambda \right)\right\}, $$
(77.41)

where θ > 0, p λ (0) = 0, and a > 2.

We define U = {j : p λ (θ) = 0} and V = {j : p λ (θ) > 0}. Accordingly, we write X = [X U , X V ] and \( \widehat{\beta}={\left({\widehat{\beta}}_U^{\mathrm{T}},{\widehat{\beta}}_V^{\mathrm{T}}\right)}^{\mathrm{T}} \).

Algorithm 2

  • Step 1. Create working data by \( {\mathrm{x}}_j^{*}={\mathrm{x}}_j\lambda /p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) \), for jV. Let H U be the projection matrix in the space of {x * j , jU}. Compute \( \tilde{\mathrm{y}}=\left(I-{H}_U\right)\mathrm{y} \) and \( {\tilde{\mathrm{X}}}_V^{*}=\left(I-H\right){\mathrm{X}}_V^{*} \).

  • Step 2. Apply the LARS algorithm to solve the penalized L 1 regression:

    $$ {\widehat{\beta}}_V^{*}= arg\ \underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \tilde{\mathrm{y}}-{\tilde{\mathrm{X}}}_V^{*}\beta \right\Vert}^2+ n\lambda {\displaystyle \sum_{j=1}^p\left|{\beta}_j\right|}\right\}. $$
  • Step 3. Compute \( {\widehat{\beta}}_U^{*}={\left({\mathbf{X}}_U^{*T}{\mathbf{X}}_U^{*}\right)}^{-1}\;{\mathbf{X}}_U^{*T}\left(\tilde{\mathrm{y}}-{\mathbf{X}}_V^{*}{\widehat{\beta}}_V^{*}\right) \). Then, the final one-step estimator \( \widehat{\beta} \) is obtained by

$$ {\widehat{\beta}}_U={\widehat{\beta}}_U^{*}\kern0.5em and\kern0.5em {\widehat{\beta}}_j={\widehat{\beta}}_j^{*}\lambda /p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right)\kern0.62em for\kern0.5em j\in V. $$

We note that cross-validation routine is not applicable in this situation, as the sets U and V could change as λ varies. In other words, different values of the tuning parameter lead to different transforms of observations. As a result, the SCAD-type penalty requires reexecuting the cross-validation and solving the one-step estimator for each fixed λ. Alternatively, we can determine λ by a non-data-driven approach such as the BIC-based tuning parameter selector. Indeed, Wang et al. (2007) show that the commonly used generalized cross-validation tends to overshoot the correct number of nonzero coefficients. Instead, BIC can be used to consistently identify the true model. For practitioners, it would be more convenient to directly call the procedure getfinalSCADcoef included in the “SIS” package (Fan et al. 2010). The option tune.method = c (“AIC”, “BIC”) is used to specify the selection criterion.

Finally, we use a simple numerical example to illustrate the performance of one-step sparse estimates. In this example, simulation data is generated by executing the following command in the R 2.15.0 program:

$$ \begin{array}{cc}\hfill \begin{array}{l} set. seed(0)\\ {}b<-\kern0.5em c\left(4,4,4,-6* sqrt(2)\right)\\ {}n=150\\ {}p=300\\ {}x= matrix\left( rnorm\left(n*p, mean=0, sd=1\right),n,p\right)\\ {}y<-x\left[,1:4\right]\%*\% b+ rnorm(150)\end{array}\hfill & \hfill .\hfill \end{array} $$

The one-step sparse estimates with the L 0.5 and SCAD penalty are summarized as follows. Both selection methods include all four significant variables, and the L 0.5 penalty falsely selects one noise variable. The estimated nonzero coefficients under these two methods are reported in the following:

$$ \begin{array}{l}{b}_{scad}=\left[3.986,3.937,3.944,-8.369\right],\\ {}{b}_{L05}=\left[3.942,3.964,3.997,-8.388,0.237\right].\end{array} $$

Note that this example alone does not mean that the SCAD outperforms the L 0.5 penalty, because a valid Monte Carlo simulation study usually requires at least 1,000 simulated data sets.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this entry

Cite this entry

Huang, JZ., Shi, Z., Zhong, W. (2015). Model Selection for High-Dimensional Problems. In: Lee, CF., Lee, J. (eds) Handbook of Financial Econometrics and Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7750-1_77

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-7750-1_77

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-7749-5

  • Online ISBN: 978-1-4614-7750-1

  • eBook Packages: Business and Economics

Publish with us

Policies and ethics