Abstract
With advancing of modern technologies, high-dimensional data have prevailed in computational biology. The number of variables p is very large, and in many applications, p is larger than the number of observational units n. Such high dimensionality and the unconventional small-n-large-p setting have posed new challenges to statistical analysis methods. Dimension reduction, which aims to reduce the predictor dimension prior to any modeling efforts, offers a potentially useful avenue to tackle such high-dimensional regression. In this chapter, we review a number of commonly used dimension reduction approaches, including principal component analysis, partial least squares, and sliced inverse regression. For each method, we review its background and its applications in computational biology, discuss both its advantages and limitations, and offer enough operational details for implementation. A numerical example of analyzing a microarray survival data is given to illustrate applications of the reviewed reduction methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Gascoyne, R.D., Muller-Hermelink, H.K., Smeland, E.B., and Staudt, L.M. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. The New England Journal of Medicine 346, 1937–1947.
Cook, R.D., Li, B., and Chiaromonte, F. (2007) Dimension reduction without matrix inversion. Biometrika 94, 569–584.
Zhong, W., Zeng, P., Ma, P., Liu, J.S., and Zhu, Y. (2005) RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics 21, 4169–4175.
Tenenbaum, J.B., Silva, V.D., and Langford, J.C. (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323.
Roweis, S.T., and Saul, L.K. (2000) Nonlinear dimensionality reduction by local linear embedding. Science 290, 2323–2326.
Wold, H. (1966) Estimation of principal components and related models by iterative least squares. In Multivariate Analysis, Ed. P. R. Krishnaiah, 391–420. New York: Academic Press.
Li, K.C. (1991) Sliced inverse regression for dimension reduction (with discussion). Journal of the American Statistical Association 86, 316–327.
Jolliffe, I.T. (2002) Principal Components Analysis. Second Edition. Springer, New York.
Alter, O., Brown, P.O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of National Academy of Sciences, USA 97, 10101–10106.
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J.A. Jr. Marks, J.R., and Nevins J.R. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of National Academy of Sciences, USA 98, 11462–11467.
Chiaromonte, F., and Martinelli, J. (2002) Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences 176, 123–144.
Li, L., and Li, H. (2004) Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics 20, 3406–3412.
Li, L. (2006) Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information. Bioinformatics 22, 466–471.
Wei, T., Liao, B.L., Ackermann, B.L., Jolly, R.A., Eckstein, J.A., Kulkarni, N.H., Helvering, L.M., Goldsteiin, K.M., Shou, J., Estrem, S.T., Ryan, T.P., Colet, J.-M., Thomas, C.E., Stevens, J.L., and Onyia, J.E. (2005) Data-driven analysis approach for biomarker discovery using molecular-profiling technologies. Biomarkers 10, 153–172.
Leek, J.T., and Storey, J.D. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 1724–1735.
Patterson, N., Price, A.L., and Reich, D. (2006) Population structure and eigenanalysis. PLoS Genetics 2, 2074–2093.
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904–909.
Cox, D.R. (1968) Notes on some aspects of regression analysis. Journal of the Royal Statistical Society, Series A. 131, 265–279.
Artemiou, A., and Li, B. (2009) On principal components and regression: a statistical explanation of a natural phenomenon. Statistica Sinica, 19, 1557–1565.
Cook, R.D. (2007) Fisher Lecture: Dimension reduction in regression (with discussion). Statistical Science 22, 1–26.
Cook, R.D. (1998) Regression Graphics: Ideas for Studying Regressions Through Graphics. New York: Wiley.
Cook, R.D. (1996) Graphics for regressions with a binary response. Journal of the American Statistical Association 91, 983–992.
Cook, R.D., and Li, B. (2002) Dimension reduction for the conditional mean in regression. Annals of Statistics 30, 455–474.
Wold, H. (1975) Soft modelling by latent variables: The nonlinear partial least squares (NIPALS) approach. In Perspectives in Probability and Statistics, Papers in Honour of M.S. Barlett, Ed. J. Gani, 117–142. London: Academic Press.
Helland, I.S. (1992) Maximum likelihood regression on relevant components. Journal of Royal Statistical Society, Series B 54, 637–647.
Helland, I.S., and Almøy, T. (1994) Comparison of prediction methods when only a few components are relevant. Journal of the American Statistical Association 89, 583–591.
Li, K.C., and Duan, N. (1989) Regression analysis under link violation. Annals of Statistics 17, 1009–1052.
Naik, P., and Tsai, C.L. (2000) Partial least squares estimator for single-index models. Journal of the Royal Statistical Society, Series B 62, 763–771.
Li, L., Cook, R.D., and Tsai, C.L. (2007) Partial inverse regression method. Biometrika 94, 615–625.
Nguyen, D.V., and Rocke, D.M. (2002a) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50.
Pérez-Enciso, M., and Tenenhaus, M. (2003) Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis approach. Human Genetics 112, 581–592.
Fort, G., and Lambert-Lacroix, S. (2005) Classification using partial least squares with penalized logistic regression. Bioinformatics 21, 1104–1111.
Nguyen, D.V., and Rocke, D.M. (2002b) Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 18, 1625–1632.
Park, P.J., Tian, L. and Kohane, I.S. (2002) Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18, 120–127.
Li, H., and Gui, J. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 20, 208–215.
Cook, R.D., and Weisberg, S. (1991) Discussion of Li (1991). Journal of American Statistical Association 86, 328–332.
Zhu, Y., and Zeng, P. (2006) Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical Association 101, 1638–1651.
Li, B., and Wang, S. (2007) On directional regression for dimension reduction. Journal of the American Statistical Association 102, 997–1008.
Li, K.C. (1992) On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s Lemma. Annals of Statistics 87, 1025–1039.
Xia, Y., Tong, H., Li, W.K., and Zhu, L.X. (2002) An adaptive estimation of dimension reduction space (with discussion). Journal of the Royal Statistical Society, Series B 64, 363–410.
Cook, R.D., and Ni, L. (2005) Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. Journal of the American Statistical Association 100, 410–428.
Cook, R.D., and Yin, X. (2001) Dimension reduction and visualization in discriminant analysis. Australian and New Zealand Journal of Statistics 43, 147–177.
Zhu, L.X., Miao, B., and Peng, H. (2006) On sliced inverse regression with large dimensional covariates. Journal of the American Statistical Association 101, 630–643.
Li, L., and Yin, X. (2008a) Sliced inverse regression with regularizations. Biometrics 64, 124–131.
Bura, E., and Pfeiffer, R.M. (2003) Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics 19, 1252–1258.
Antoniadis, A., Lambert-Lacroix, S., and Leblanc, F. (2003) Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19, 563–570.
Li, L., and Yin, X. (2008b) Rejoinder to “A note on sliced inverse regression with regularizations”. Biometrics 64, 982–986.
Zou, H., Hastie, T., and Tibshirani, R. (2006) Sparse principal component analysis. Journal of Computational and Graphical Statistics 15, 265–286.
Li, L. (2007) Sparse sufficient dimension reduction. Biometrika 94, 603–613.
Ni, L., Cook, R.D., and Tsai, C.L. (2005) A note on shrinkage sliced inverse regression. Biometrika 92, 242–247.
Bondell, H.D., and Li, L. (2009) Shrinkage inverse regression estimation for model free variable selection. Journal of the Royal Statistical Society, Series B 71, 287–299.
Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004) Least Angle Regression. Annals of Statistics 32, 407–451.
Fan, J., and Lv, J. (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B 70, 849–911.
Li, K.C., Wang, J.L., and Chen, C.H. (1999) Dimension reduction for censored regression data. The Annals of Statistics 27, 1–23.
Hall, P., and Li, K.C. (1993) On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics 21, 867–889.
Cook, R.D., and Nachtsheim, C.J. (1994) Re-weighting to achieve elliptically contoured covariates in regression. Journal of the American Statistical Association 89, 592–600.
Li, L., Cook, R.D., and Nachtsheim, C.J. (2004) Cluster-based estimation for sufficient dimension reduction. Computational Statistics and Data Analysis 47, 175–193.
Acknowledgments
This work was supported in part by National Science Foundation grant DMS 0706919.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Li, L. (2010). Dimension Reduction for High-Dimensional Data. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_14
Download citation
DOI: https://doi.org/10.1007/978-1-60761-580-4_14
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-578-1
Online ISBN: 978-1-60761-580-4
eBook Packages: Springer Protocols