Model Selection for High-Dimensional Problems

Huang, Jing-Zhi; Shi, Zhan; Zhong, Wei

doi:10.1007/978-1-4614-7750-1_77

Jing-Zhi Huang³,
Zhan Shi³ &
Wei Zhong⁴

9357 Accesses
2 Citations

Abstract

High-dimensional data analysis is becoming more and more important to both academics and practitioners in finance and economics but is also very challenging because the number of variables or parameters in connection with such data can be larger than the sample size. Recently, several variable selection approaches have been developed and used to help us select significant variables and construct a parsimonious model simultaneously. In this chapter, we first provide an overview of model selection approaches in the context of penalized least squares. We then review independence screening, a recently developed method for analyzing ultrahigh-dimensional data where the number of variables or parameters can be exponentially larger than the sample size. Finally, we discuss and advocate multistage procedures that combine independence screening and variable selection and that may be especially suitable for analyzing high-frequency financial data.

Penalized least squares seek to keep important predictors in a model while penalizing coefficients associated with irrelevant predictors. As such, under certain conditions, penalized least squares can lead to a sparse solution for linear models and achieve asymptotic consistency in separating relevant variables from irrelevant ones. Independence screening selects relevant variables based on certain measures of marginal correlations between candidate variables and the response.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), Second international symposium on information theory (Vol. 1, pp. 267–281). Budapest: Akademiai Kiado.
Google Scholar
Allen, D. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16, 125–127.
Article Google Scholar
Anderson, H. M., & Vahid, F. (2007). Forecasting the volatility of Australian stock returns. Journal of Business and Economic Statistics, 25, 76–90.
Article Google Scholar
Antoniadis, A., & Fan, J. (2001). Regularization of wavelets approximations (with discussion). Journal of the American Statistical Association, 96, 939–967.
Article Google Scholar
Bai, J., & Ng, S. (2008). Forecasting economic time series using targeted predictors. Journal of Econometrics, 146, 304–317.
Article Google Scholar
Bellman, R. (1961). Adaptive control processes: A guided tour. Pattern Recognition Letters, 29, 1351–1357.
Google Scholar
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383.
Article Google Scholar
Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial markets. Princeton: Princeton University Press.
Google Scholar
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Annals of Statistics, 35, 2313–2404.
Article Google Scholar
Chen, X., Zou, C., & Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. Annals of Statistics, 38, 3696–3723.
Article Google Scholar
Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314.
Article Google Scholar
Cook, R. D. (1998). Regression graphics: Ideas for studying regressions through graphics. New York: John Wiley & Sons.
Google Scholar
Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86, 328–332.
Google Scholar
Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377–403.
Article Google Scholar
Demartines, P., & Hérault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. Neural Networks, 8, 148–154.
Article Google Scholar
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32, 409–499.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Article Google Scholar
Fan, J., & Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proceedings of the International Congress of Mathematicians, European Mathematical Society, Zurich, III, 595–622.
Google Scholar
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911.
Article Google Scholar
Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567–3604.
Article Google Scholar
Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. Journal of Machine Learning Research, 10, 1829–1853.
Google Scholar
Fan, J., Feng, Y., Samworth, R., & Wu, Y. (2010). SIS: Sure Independence Screening. R package version 0.6. http://CRAN.R-project.org/package=SIS
Fan, J., Feng, Y., & Song, R. (2011a). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106, 544–557.
Article Google Scholar
Fan, J., Lv, J., & Qi, L. (2011b). Sparse high-dimensional models in economics. Annual Review of Economics, 3, 291–317.
Article Google Scholar
Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.
Article Google Scholar
Friedman, J. (2008). Fast sparse regression and classification. International Journal of Forecasting, 28, 722–738.
Article Google Scholar
Goto, S., & Xu, Y. (2010). On mean variance portfolio optimization: Improving performance through better use of hedging relations (Working paper), University of Rhode Island.
Google Scholar
Hall, P., & Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18, 533–550.
Article Google Scholar
Hastie, T. (1994). Principal curves and surfaces. DTIC Document.
Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.
Book Google Scholar
Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Article Google Scholar
Huang, J.-Z., & Shi, Z. (2010). Determinants of bond risk premia. (American Finance Association 2011 Denver meetings paper). Penn State University. http://ssrn.com/abstract=1573186.
Ji, P., & Jin, J. (2012). UPS delivers optimal phase diagram in high dimensional variable selection. Annals of Statistics, 40, 73–103.
Article Google Scholar
Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 285–299.
Article Google Scholar
Li, L. (2007). Sparse sufficient dimension reduction. Biometrika, 94, 603–613.
Article Google Scholar
Li, R., Zhong, W., Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.
Google Scholar
Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15, 661–675.
Google Scholar
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1980). Multivariate analysis. London: Academic.
Google Scholar
Miller, A. J. (2002). Subset selection in regression (2nd ed.). New York: Chapman & HALL/CRC.
Book Google Scholar
Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Article Google Scholar
Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7, 221–264.
Google Scholar
Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97, 1167–1179.
Article Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Google Scholar
Wang, H., Li, R., & Tsai, C. L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
Article Google Scholar
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38, 894–942.
Article Google Scholar
Zhou, J., & He, X. (2008). Dimension reduction based on constrained canonical correlation and variable filtering. Annals of Statistics, 36, 1649–1668.
Article Google Scholar
Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association, 106, 1464–1475.
Article Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Article Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.
Article Google Scholar
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.
Article Google Scholar
Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Smeal College of Business, Penn State University, University Park, PA, 16802, USA
Jing-Zhi Huang & Zhan Shi
Wang Yanan Institute for Studies in Economics and Department of Statistics, School of Economics, Xiamen University, Xiamen, China
Wei Zhong

Authors

Jing-Zhi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Zhong .

Editor information

Editors and Affiliations

Department of Finance and Economics, Rutgers Business School, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
Cheng-Few Lee
Center for PBBEF Research, North Brunswick, New Jersey, USA
John C. Lee

Appendix 1: One-Step Sparse Estimates

This appendix outlines the local linear approximation (LLA) algorithm, discussed in Sect. 77.2.3, for finding a solution of penalized least squares for a broad class of penalty functions. For illustration, we focus on the nondifferentiable, nonconvex L _0.5 and SCAD penalties introduced in Sect. 77.2.2, as both are ideally suited to the LLA algorithm (Zou and Li 2008). Recall that these two penalty functions are given by Eqs. (77.18) and (77.23), respectively. We note that L _0.5 and SCAD apparently lead to concave objective functions that are singular at the origin. We illustrate later that in linear regressions this optimization problem can be reduced into solving penalized least squares with an L ₁ penalty.

We begin with the following one-step LLA estimator:

$$ \widehat{\beta}= \arg\ \underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \mathrm{y}-\mathbf{X}\beta \right\Vert}^2+n{\displaystyle \sum_{j=1}^pp{\prime}_{\lambda }}\left(\left|{\tilde{\beta}}_j\right|\right)\left|{\beta}_j\right|\right\}, $$

(77.40)

where the initial estimate $ \tilde{\beta} $ is usually represented by the ordinary least squares estimator in practice if n > p. Theoretically, $ \tilde{\beta} $ could be any root-n-consistent estimator for β. In the case where n < p, the plain vanilla L ₁ estimator (LASSO) would be employed instead.

Following Zou and Li (2008), we discuss separately algorithms for the objective function Q(β) with two different types of penalty functions, which correspond to L _0.5 and SCAD, respectively. Nonetheless, the idea underlying the two algorithms is the same – namely, using a data transform to simplify the penalized least squares to a standard L ₁ regularization problem, for which we can take advantage of the related efficient algorithms. In particular, the LARS algorithm (Efron et al. 2004) has been widely applied to obtain the entire solution path of LASSO and forward stage-wise regressions. The relevant R package can be downloaded from http://www.stanford.edu/~hastie/swData.htm#SvmP.

Consider first type one of penalty functions.

Type 1. The tuning parameter could be disentangled from the penalty function, i.e., p _λ(θ) satisfies that p _λ(θ) = λp _λ(θ) and p ^′_λ (θ) > 0. For example, bridge penalties p _λ(|θ|) = λ|θ|^q for 0 < q < 1 and the logarithm penalty p _λ(|θ|) = λ log|θ|.

Algorithm 1

Step 1. Create working data by $ {\mathrm{x}}_j^{*}={\mathrm{x}}_j/p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) $, for j = 1 , … , p.
Step 2. Apply the LARS algorithm to solve the penalized L1 regression:
$$ {\widehat{\beta}}^{*}= arg\underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \mathrm{y}-{\mathbf{X}}^{*}\beta \right\Vert}^2+ n\lambda {\displaystyle \sum_{j=1}^p\left|{\beta}_j\right|}\right\}. $$
Step 3. Let the final one-step estimator be $ {\widehat{\beta}}_j={\widehat{\beta}}_j^{*}/p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) $ for j =1 , … , p.

As we can see, through the one-step sparse estimator, LLA for the L ₀.₅ penalty is equivalent to the adaptive LASSO, with $ {\widehat{w}}_j={\left|{\tilde{\beta}}_j\right|}^{-0.5} $ (see Eq. 4 in Zou 2006). In computing the LARS estimates, tuning parameter λ can be chosen using the cv.lars routine embedded in the “lars” package. For example, the following command

$$ cv. lars\left( xstar,y,K=5, plot. it= TRUE, se= TRUE, type=`` lasso"\right) $$

instructs the R to plot fivefold cross-validated mean squared prediction error (MSE) for different values of λ. Then we are able to pick the λ with the smallest MSE and find out the step in LARS corresponding to that λ value. The main function of the package

$$ lars\left( xstar,y, type=`` lasso"\right) $$

provides the entire sequence of coefficients and fits, starting from zero to the least squares fit.

Next, consider type 2 of penalty functions.

Type 2. p _λ(θ) satisfies that the derivative p ^′_λ (θ) is zero for some values. In addition, the regularization parameter λ cannot be separated from p _λ(θ). For example, the SCAD penalty with the first derivative

$$ p{\prime}_{\lambda}\left(\theta \right)=\lambda \left\{I\left(\theta \le \lambda \right)+\frac{{\left( a\lambda -\theta \right)}_{+}}{\left(a-1\right)\lambda }I\left(\theta>\lambda \right)\right\}, $$

(77.41)

where θ > 0, p _λ(0) = 0, and a > 2.

We define U = {j : p ^′_λ (θ) = 0} and V = {j : p ^′_λ (θ) > 0}. Accordingly, we write X = [X _U, X _V] and $ \widehat{\beta}={\left({\widehat{\beta}}_U^{\mathrm{T}},{\widehat{\beta}}_V^{\mathrm{T}}\right)}^{\mathrm{T}} $.

Algorithm 2

Step 1. Create working data by $ {\mathrm{x}}_j^{*}={\mathrm{x}}_j\lambda /p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) $, for j ∈ V. Let H _U be the projection matrix in the space of {x ^*_j , j ∈ U}. Compute $ \tilde{\mathrm{y}}=\left(I-{H}_U\right)\mathrm{y} $ and $ {\tilde{\mathrm{X}}}_V^{*}=\left(I-H\right){\mathrm{X}}_V^{*} $.
Step 2. Apply the LARS algorithm to solve the penalized L ₁ regression:
$$ {\widehat{\beta}}_V^{*}= arg\ \underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \tilde{\mathrm{y}}-{\tilde{\mathrm{X}}}_V^{*}\beta \right\Vert}^2+ n\lambda {\displaystyle \sum_{j=1}^p\left|{\beta}_j\right|}\right\}. $$
Step 3. Compute $ {\widehat{\beta}}_U^{*}={\left({\mathbf{X}}_U^{*T}{\mathbf{X}}_U^{*}\right)}^{-1}\;{\mathbf{X}}_U^{*T}\left(\tilde{\mathrm{y}}-{\mathbf{X}}_V^{*}{\widehat{\beta}}_V^{*}\right) $. Then, the final one-step estimator $ \widehat{\beta} $ is obtained by

$$ {\widehat{\beta}}_U={\widehat{\beta}}_U^{*}\kern0.5em and\kern0.5em {\widehat{\beta}}_j={\widehat{\beta}}_j^{*}\lambda /p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right)\kern0.62em for\kern0.5em j\in V. $$

We note that cross-validation routine is not applicable in this situation, as the sets U and V could change as λ varies. In other words, different values of the tuning parameter lead to different transforms of observations. As a result, the SCAD-type penalty requires reexecuting the cross-validation and solving the one-step estimator for each fixed λ. Alternatively, we can determine λ by a non-data-driven approach such as the BIC-based tuning parameter selector. Indeed, Wang et al. (2007) show that the commonly used generalized cross-validation tends to overshoot the correct number of nonzero coefficients. Instead, BIC can be used to consistently identify the true model. For practitioners, it would be more convenient to directly call the procedure getfinalSCADcoef included in the “SIS” package (Fan et al. 2010). The option tune.method = c (“AIC”, “BIC”) is used to specify the selection criterion.

Finally, we use a simple numerical example to illustrate the performance of one-step sparse estimates. In this example, simulation data is generated by executing the following command in the R 2.15.0 program:

$$ \begin{array}{cc}\hfill \begin{array}{l} set. seed(0)\\ {}b<-\kern0.5em c\left(4,4,4,-6* sqrt(2)\right)\\ {}n=150\\ {}p=300\\ {}x= matrix\left( rnorm\left(n*p, mean=0, sd=1\right),n,p\right)\\ {}y<-x\left[,1:4\right]\%*\% b+ rnorm(150)\end{array}\hfill & \hfill .\hfill \end{array} $$

The one-step sparse estimates with the L ₀.₅ and SCAD penalty are summarized as follows. Both selection methods include all four significant variables, and the L _0.5 penalty falsely selects one noise variable. The estimated nonzero coefficients under these two methods are reported in the following:

$$ \begin{array}{l}{b}_{scad}=\left[3.986,3.937,3.944,-8.369\right],\\ {}{b}_{L05}=\left[3.942,3.964,3.997,-8.388,0.237\right].\end{array} $$

Note that this example alone does not mean that the SCAD outperforms the L _0.5 penalty, because a valid Monte Carlo simulation study usually requires at least 1,000 simulated data sets.

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Huang, JZ., Shi, Z., Zhong, W. (2015). Model Selection for High-Dimensional Problems. In: Lee, CF., Lee, J. (eds) Handbook of Financial Econometrics and Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7750-1_77

Download citation

DOI: https://doi.org/10.1007/978-1-4614-7750-1_77
Published: 09 August 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7749-5
Online ISBN: 978-1-4614-7750-1
eBook Packages: Business and Economics

Publish with us

Policies and ethics

Model Selection for High-Dimensional Problems

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix 1: One-Step Sparse Estimates

Appendix 1: One-Step Sparse Estimates

Algorithm 1

Algorithm 2

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation