Abstract
High-dimensional data analysis is becoming more and more important to both academics and practitioners in finance and economics but is also very challenging because the number of variables or parameters in connection with such data can be larger than the sample size. Recently, several variable selection approaches have been developed and used to help us select significant variables and construct a parsimonious model simultaneously. In this chapter, we first provide an overview of model selection approaches in the context of penalized least squares. We then review independence screening, a recently developed method for analyzing ultrahigh-dimensional data where the number of variables or parameters can be exponentially larger than the sample size. Finally, we discuss and advocate multistage procedures that combine independence screening and variable selection and that may be especially suitable for analyzing high-frequency financial data.
Penalized least squares seek to keep important predictors in a model while penalizing coefficients associated with irrelevant predictors. As such, under certain conditions, penalized least squares can lead to a sparse solution for linear models and achieve asymptotic consistency in separating relevant variables from irrelevant ones. Independence screening selects relevant variables based on certain measures of marginal correlations between candidate variables and the response.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), Second international symposium on information theory (Vol. 1, pp. 267–281). Budapest: Akademiai Kiado.
Allen, D. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16, 125–127.
Anderson, H. M., & Vahid, F. (2007). Forecasting the volatility of Australian stock returns. Journal of Business and Economic Statistics, 25, 76–90.
Antoniadis, A., & Fan, J. (2001). Regularization of wavelets approximations (with discussion). Journal of the American Statistical Association, 96, 939–967.
Bai, J., & Ng, S. (2008). Forecasting economic time series using targeted predictors. Journal of Econometrics, 146, 304–317.
Bellman, R. (1961). Adaptive control processes: A guided tour. Pattern Recognition Letters, 29, 1351–1357.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383.
Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial markets. Princeton: Princeton University Press.
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Annals of Statistics, 35, 2313–2404.
Chen, X., Zou, C., & Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. Annals of Statistics, 38, 3696–3723.
Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314.
Cook, R. D. (1998). Regression graphics: Ideas for studying regressions through graphics. New York: John Wiley & Sons.
Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86, 328–332.
Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377–403.
Demartines, P., & Hérault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. Neural Networks, 8, 148–154.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32, 409–499.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, J., & Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proceedings of the International Congress of Mathematicians, European Mathematical Society, Zurich, III, 595–622.
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911.
Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567–3604.
Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. Journal of Machine Learning Research, 10, 1829–1853.
Fan, J., Feng, Y., Samworth, R., & Wu, Y. (2010). SIS: Sure Independence Screening. R package version 0.6. http://CRAN.R-project.org/package=SIS
Fan, J., Feng, Y., & Song, R. (2011a). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106, 544–557.
Fan, J., Lv, J., & Qi, L. (2011b). Sparse high-dimensional models in economics. Annual Review of Economics, 3, 291–317.
Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.
Friedman, J. (2008). Fast sparse regression and classification. International Journal of Forecasting, 28, 722–738.
Goto, S., & Xu, Y. (2010). On mean variance portfolio optimization: Improving performance through better use of hedging relations (Working paper), University of Rhode Island.
Hall, P., & Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18, 533–550.
Hastie, T. (1994). Principal curves and surfaces. DTIC Document.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.
Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Huang, J.-Z., & Shi, Z. (2010). Determinants of bond risk premia. (American Finance Association 2011 Denver meetings paper). Penn State University. http://ssrn.com/abstract=1573186.
Ji, P., & Jin, J. (2012). UPS delivers optimal phase diagram in high dimensional variable selection. Annals of Statistics, 40, 73–103.
Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 285–299.
Li, L. (2007). Sparse sufficient dimension reduction. Biometrika, 94, 603–613.
Li, R., Zhong, W., Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.
Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15, 661–675.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1980). Multivariate analysis. London: Academic.
Miller, A. J. (2002). Subset selection in regression (2nd ed.). New York: Chapman & HALL/CRC.
Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7, 221–264.
Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97, 1167–1179.
Tibshirani, R. (1996). Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Wang, H., Li, R., & Tsai, C. L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38, 894–942.
Zhou, J., & He, X. (2008). Dimension reduction based on constrained canonical correlation and variable filtering. Annals of Statistics, 36, 1649–1668.
Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association, 106, 1464–1475.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.
Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix 1: One-Step Sparse Estimates
Appendix 1: One-Step Sparse Estimates
This appendix outlines the local linear approximation (LLA) algorithm, discussed in Sect. 77.2.3, for finding a solution of penalized least squares for a broad class of penalty functions. For illustration, we focus on the nondifferentiable, nonconvex L 0.5 and SCAD penalties introduced in Sect. 77.2.2, as both are ideally suited to the LLA algorithm (Zou and Li 2008). Recall that these two penalty functions are given by Eqs. (77.18) and (77.23), respectively. We note that L 0.5 and SCAD apparently lead to concave objective functions that are singular at the origin. We illustrate later that in linear regressions this optimization problem can be reduced into solving penalized least squares with an L 1 penalty.
We begin with the following one-step LLA estimator:
where the initial estimate \( \tilde{\beta} \) is usually represented by the ordinary least squares estimator in practice if n > p. Theoretically, \( \tilde{\beta} \) could be any root-n-consistent estimator for β. In the case where n < p, the plain vanilla L 1 estimator (LASSO) would be employed instead.
Following Zou and Li (2008), we discuss separately algorithms for the objective function Q(β) with two different types of penalty functions, which correspond to L 0.5 and SCAD, respectively. Nonetheless, the idea underlying the two algorithms is the same – namely, using a data transform to simplify the penalized least squares to a standard L 1 regularization problem, for which we can take advantage of the related efficient algorithms. In particular, the LARS algorithm (Efron et al. 2004) has been widely applied to obtain the entire solution path of LASSO and forward stage-wise regressions. The relevant R package can be downloaded from http://www.stanford.edu/~hastie/swData.htm#SvmP.
Consider first type one of penalty functions.
Type 1. The tuning parameter could be disentangled from the penalty function, i.e., p λ (θ) satisfies that p λ (θ) = λp λ (θ) and p ′ λ (θ) > 0. For example, bridge penalties p λ (|θ|) = λ|θ|q for 0 < q < 1 and the logarithm penalty p λ (|θ|) = λ log|θ|.
Algorithm 1
-
Step 1. Create working data by \( {\mathrm{x}}_j^{*}={\mathrm{x}}_j/p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) \), for j = 1 , … , p.
-
Step 2. Apply the LARS algorithm to solve the penalized L1 regression:
$$ {\widehat{\beta}}^{*}= arg\underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \mathrm{y}-{\mathbf{X}}^{*}\beta \right\Vert}^2+ n\lambda {\displaystyle \sum_{j=1}^p\left|{\beta}_j\right|}\right\}. $$ -
Step 3. Let the final one-step estimator be \( {\widehat{\beta}}_j={\widehat{\beta}}_j^{*}/p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) \) for j =1 , … , p.
As we can see, through the one-step sparse estimator, LLA for the L 0. 5 penalty is equivalent to the adaptive LASSO, with \( {\widehat{w}}_j={\left|{\tilde{\beta}}_j\right|}^{-0.5} \) (see Eq. 4 in Zou 2006). In computing the LARS estimates, tuning parameter λ can be chosen using the cv.lars routine embedded in the “lars” package. For example, the following command
instructs the R to plot fivefold cross-validated mean squared prediction error (MSE) for different values of λ. Then we are able to pick the λ with the smallest MSE and find out the step in LARS corresponding to that λ value. The main function of the package
provides the entire sequence of coefficients and fits, starting from zero to the least squares fit.
Next, consider type 2 of penalty functions.
Type 2. p λ (θ) satisfies that the derivative p ′ λ (θ) is zero for some values. In addition, the regularization parameter λ cannot be separated from p λ (θ). For example, the SCAD penalty with the first derivative
where θ > 0, p λ (0) = 0, and a > 2.
We define U = {j : p ′ λ (θ) = 0} and V = {j : p ′ λ (θ) > 0}. Accordingly, we write X = [X U , X V ] and \( \widehat{\beta}={\left({\widehat{\beta}}_U^{\mathrm{T}},{\widehat{\beta}}_V^{\mathrm{T}}\right)}^{\mathrm{T}} \).
Algorithm 2
-
Step 1. Create working data by \( {\mathrm{x}}_j^{*}={\mathrm{x}}_j\lambda /p{\prime}_{\lambda}\left(\left|{\tilde{\beta}}_j\right|\right) \), for j ∈ V. Let H U be the projection matrix in the space of {x * j , j ∈ U}. Compute \( \tilde{\mathrm{y}}=\left(I-{H}_U\right)\mathrm{y} \) and \( {\tilde{\mathrm{X}}}_V^{*}=\left(I-H\right){\mathrm{X}}_V^{*} \).
-
Step 2. Apply the LARS algorithm to solve the penalized L 1 regression:
$$ {\widehat{\beta}}_V^{*}= arg\ \underset{\beta }{ \min}\left\{\frac{1}{2}{\left\Vert \tilde{\mathrm{y}}-{\tilde{\mathrm{X}}}_V^{*}\beta \right\Vert}^2+ n\lambda {\displaystyle \sum_{j=1}^p\left|{\beta}_j\right|}\right\}. $$ -
Step 3. Compute \( {\widehat{\beta}}_U^{*}={\left({\mathbf{X}}_U^{*T}{\mathbf{X}}_U^{*}\right)}^{-1}\;{\mathbf{X}}_U^{*T}\left(\tilde{\mathrm{y}}-{\mathbf{X}}_V^{*}{\widehat{\beta}}_V^{*}\right) \). Then, the final one-step estimator \( \widehat{\beta} \) is obtained by
We note that cross-validation routine is not applicable in this situation, as the sets U and V could change as λ varies. In other words, different values of the tuning parameter lead to different transforms of observations. As a result, the SCAD-type penalty requires reexecuting the cross-validation and solving the one-step estimator for each fixed λ. Alternatively, we can determine λ by a non-data-driven approach such as the BIC-based tuning parameter selector. Indeed, Wang et al. (2007) show that the commonly used generalized cross-validation tends to overshoot the correct number of nonzero coefficients. Instead, BIC can be used to consistently identify the true model. For practitioners, it would be more convenient to directly call the procedure getfinalSCADcoef included in the “SIS” package (Fan et al. 2010). The option tune.method = c (“AIC”, “BIC”) is used to specify the selection criterion.
Finally, we use a simple numerical example to illustrate the performance of one-step sparse estimates. In this example, simulation data is generated by executing the following command in the R 2.15.0 program:
The one-step sparse estimates with the L 0.5 and SCAD penalty are summarized as follows. Both selection methods include all four significant variables, and the L 0.5 penalty falsely selects one noise variable. The estimated nonzero coefficients under these two methods are reported in the following:
Note that this example alone does not mean that the SCAD outperforms the L 0.5 penalty, because a valid Monte Carlo simulation study usually requires at least 1,000 simulated data sets.
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this entry
Cite this entry
Huang, JZ., Shi, Z., Zhong, W. (2015). Model Selection for High-Dimensional Problems. In: Lee, CF., Lee, J. (eds) Handbook of Financial Econometrics and Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7750-1_77
Download citation
DOI: https://doi.org/10.1007/978-1-4614-7750-1_77
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7749-5
Online ISBN: 978-1-4614-7750-1
eBook Packages: Business and Economics