Journal of Statistical Theory and Practice

, Volume 12, Issue 2, pp 450–461 | Cite as

Dimension reduction of gene expression data

  • Jaylen Lee
  • Shannon Ciccarello
  • Mithun Acharjee
  • Kumer DasEmail author


DNA methylation of specific dinucleotides has been shown to be strongly linked with tissue age. The goal of this research is to explore different analysis techniques for microarray data in order to create a more effective predictor of age from DNA methylation level. Specifically, this study compares elastic net regression models to principal component regression, supervised principal component regression, Y-aware principal component regression, and partial least squares regression models and their ability to predict tissue age based on DNA methylation levels. It has been found that the elastic net model performs better than latent variable models when considering less than ten principal components for each method, but Y-aware principal component regression predicts more accurately (with a reasonably low testing RMSE) and captures more of the desired structure when the number of principal components increases to 20. Coding limitations inhibited forming conclusive results about the performance of supervised principal component regression as the number of components increases.


Principal component analysis DNA methylation elastic net regression Y-aware PCR supervised PCR PLS regression 

AMS Subject Classification

62H25 62J99 62N86 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abdi, H. 2003. Partial least squares (PLS) regression. In Encyclopedia of social sciences research methods, ed. M. Lewis-Beck, A. Bryman, and T. Futing, 792–95. Thousand Oaks (CA): Sage.Google Scholar
  2. Bair, E., T. Hastie, D. Paul, and R. Tibshirani. 2006. Prediction by supervised principal components. Journal of the American Statistical Association 101 (473):119–37. doi:10.1198/016214505000000628.MathSciNetCrossRefGoogle Scholar
  3. Florath, I., K. Butterbach, H. Muller, M. Bewerunge-Hudler, and H. Brenner. 2014. Cross-sectional and longitudinal changes in DNA methylation with age. Human Molecular Genetics 23 (5):1186–201. doi:10.1093/hmg/ddt531.CrossRefGoogle Scholar
  4. Hastie, T., R. Tibshirani, G. Sherlock, E. Michael, P. Brown, and D. Botstein, 1999. Imputing Missing Data for Gene Expression Arrays (Technical Report). Division of Biostatistics, Stanford University, Stanford, CA.Google Scholar
  5. Horvath, S., Z. Yafeng, P. Langfelder, R. S. Kahn, M. P. M. Boks, K. V. Eijk, L. H. Berg, and R. A. Ophoff. 2012. Aging effects on DNA methylation modules in human brain and blood tissue. Genomic Biology 13 (10):R97.CrossRefGoogle Scholar
  6. Jolliffe, I. T. 1982. A note on the use of principal components in regression. Journal of the Royal Statistical Society, Series C 31 (3):300–03.Google Scholar
  7. Kiremire, A. R. 2011. The application of the Pareto principal in software engineering (Consulted). Ruston (LA): Louisiana Tech University; (accessed July 2016).Google Scholar
  8. Kurucz, M., A. A. Benczr, and K. Csalogny. 2007. Methods for large scale SVD with missing values. Proceedings of KDD Cup and Workshop 12:31–38.Google Scholar
  9. Li, H., H. Bangzheng, M. Lublin, and Y. Perez. 2016. Distributed algorithms and optimization. Stanford, CA: Stanford University.Google Scholar
  10. Liu, L., D. M. Hawkins, S. Ghosh, and S. S. Young. 2003. Robust singular value decomposition analysis of microarray data. Proceedings of the National Academy of Sciences of the United States of America 100 (23):13167–72. doi:10.1073/pnas.1733249100.MathSciNetCrossRefGoogle Scholar
  11. Phillips, T. 2008. The role of methylation in gene expression. Nature Education 1 (1):116.MathSciNetGoogle Scholar
  12. R Core Team. 2013. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. (accessed October 2016).Google Scholar
  13. Rosipal, R. 2011. Nonlinear partial least squares: An overview. In Chemoinformatics and advanced machine learning perspectives: Complex computational methods and collaborative techniques, ed. H. Lodhi, and Y. Yamanishi, 169–89. ACCM, IGI Global. (accessed May 2016).Google Scholar
  14. Shlens, J. 2014. A tutorial on principal component analysis. Cornell University Library. (accessed April 2016).Google Scholar
  15. Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17 (6):520525. doi:10.1093/bioinformatics/17.6.520.CrossRefGoogle Scholar
  16. Wall, M., M. Rechtsteiner, and L. M. Rocha. 2003. Singular value decomposition and principal component analysis. In A practical approach to microarray data analysis, ed. D. P. Berrar, W. Dubitzky, and M. Granzow, 91–109. Los Alamos National Laboratory LA-UR-02-4001.Google Scholar
  17. Zou, H., and T. Hastie. 2005. Regularization and variable selection via the elastic net. Journal of Royal Statistical Society: Series B 67 (Part 2):301–20. doi:10.1111/j.1467-9868.2005.00503.x.MathSciNetCrossRefGoogle Scholar
  18. Zumel, N. (2016). Principal components regression, Pt. 2: Y-aware methods [Web log comment]. (accessed July 2016).

Copyright information

© Grace Scientific Publishing, 20 Middlefield Ct, Greensboro, NC 27455 2018

Authors and Affiliations

  • Jaylen Lee
    • 1
  • Shannon Ciccarello
    • 2
  • Mithun Acharjee
    • 3
  • Kumer Das
    • 3
    Email author
  1. 1.Department of Mathematics and StatisticsJames Madison UniversityHarrisonburgUSA
  2. 2.Department of Mathematics and StatisticsHollins UniversityRoanokeUSA
  3. 3.Department of MathematicsLamar UniversityBeaumontUSA

Personalised recommendations