Abstract
Main target of metabolomics research is to reveal informative metabolites or biomarkers, which can be considered as a process of variable selection. So far, several methods, such as regression coefficient (RC), weights or variable importance in projection (VIP), have been widely used to assess the variable importance when building the partial least squares linear discriminant analysis PLS-LDA classification model. Then a set of metabolites can be selected by fixing a threshold value considering the rank of metabolites. However, they do not take into account the combination effect among a subset of variables, which will lead to bias within the results. In this work, a strategy named as variable importance analysis based on random variable combination (VIAVC), is developed for statistical assessment of variable importance. The framework of VIAVC includes mainly three parts: (1) employ a novel variables sampling method, called binary matrix resampling, which can guarantee that each variable has been selected with the same probability and generate a population of different variable combinations; (2) the importance of each variable is assessed by percent decrease or increase of the area under the receiver operating characteristic curve when the variable is excluded for the modeling by PLS-LDA; (3) iteratively retain and output the rank of the final remaining informative variables. The results of the applications to three metabolic datasets illustrate that VIAVC has better performance compared with other methods including RC, VIP and subwindow permutation analysis. The MATLAB code for implementing VIAVC is available in the supplemental materials.
Similar content being viewed by others
References
Anastassiou, D. (2007). Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology,. doi:10.1038/msb4100124.
Asp, M. L., Tian, M., Wendel, A. A., & Belury, M. A. (2010). Evidence for the contribution of insulin resistance to the development of cachexia in tumor bearing mice. International Journal of Cancer, 126, 756–763.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Daniel, C. R., et al. (2009). Dietary intake of ω-6 and ω-3 fatty acids and risk of colorectal cancer in a prospective cohort of U.S. men and women. Cancer Epidemiology, Biomarkers and Prevention, 18, 516–525. doi:10.1158/1055-9965.epi-08-0750.
Deng, B.-C., Yun, Y.-H., Liang, Y.-Z., & Yi, L.-Z. (2014). A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. Analyst, 139, 4836–4845. doi:10.1039/c4an00730a.
Deng, B.-C., Yun, Y.-H., Ma, P., Lin, C.-C., Ren, D.-B., & Liang, Y.-Z. (2015). A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. Analyst, 140, 1876–1885. doi:10.1039/C4AN02123A.
Duarte, I. F., Diaz, S. O., & Gil, A. M. (2014). NMR metabolomics of human blood and urine in disease research. Journal of Pharmaceutical and Biomedical Analysis, 93, 17–26. doi:10.1016/j.jpba.2013.09.025.
Dupertuis, Y. M., Meguid, M. M., & Pichard, C. (2007). Colon cancer therapy: New perspectives of nutritional manipulations using polyunsaturated fatty acids. Current Opinion in Clinical Nutrition & Metabolic Care, 10, 427–432. doi:10.1097/MCO.0b013e3281e2c9d4.
Eisner, R., et al. (2010). Learning to predict cancer-associated skeletal muscle wasting from 1H-NMR profiles of urinary metabolites. Metabolomics, 7, 25–34. doi:10.1007/s11306-010-0232-9.
Favilla, S., Durante, C., Vigni, M. L., & Cocchi, M. (2013). Assessing feature relevance in NPLS models by VIP. Chemometrics and Intelligent Laboratory Systems, 129, 76–86. doi:10.1016/j.chemolab.2013.05.013.
Fearn, T. (2010). Double cross-validation. In: News 3 Interview: Katherine Bakeev 4 Meetings: NIR on the Go 6 Quasi-Imaging Spectrometer with Programmable Field of View 8 Laboratory Profile: Regional Breeders Association of Lombardy 11, 2010, Vol. 17, p. 201014
Fu, G.-H., Zhang, W.-M., Dai, L., & Fu, Y.-Z. (2013). Group variable selection with oracle property by weight-fused adaptive elastic net model for strongly correlated data. Communications in Statistics: Simulation and Computation, 43, 2468–2481. doi:10.1080/03610918.2012.752841.
Golub, T. R., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. doi:10.1126/science.286.5439.531.
Grömping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63, 308–319. doi:10.1198/tast.2009.08199.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Hsing, T., Attoor, S., & Dougherty, E. (2003). Relation between permutation-test P values and classifier error estimates. Machine Learning, 52, 11–30. doi:10.1023/a:1023985022691.
Hulver, M. W., et al. (2003). Skeletal muscle lipid metabolism with obesity. American Journal of Physiology-Endocrinology and Metabolism, 284, 741–747. doi:10.1152/ajpendo.00514.2002.
Icard, P., & Lincet, H. (2013). The cancer tumor: A metabolic parasite? Bulletin du Cancer, 100, 427–433.
Kien, C. L., Bunn, J. Y., & Ugrasbul, F. (2005). Increasing dietary palmitic acid decreases fat oxidation and daily energy expenditure. The American Journal of Clinical Nutrition, 82, 320–326.
Kvalheim, O. M. (2010). Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots. Journal of Chemometrics, 24, 496–504. doi:10.1002/cem.1289.
Kvalheim, O. M., Arneberg, R., Bleie, O., Rajalahti, T., Smilde, A. K., & Westerhuis, J. A. (2014). Variable importance in latent variable regression models. Journal of Chemometrics,. doi:10.1002/cem.2626.
Laborde, C. M., et al. (2013). Plasma metabolomics reveals a potential panel of biomarkers for early diagnosis in acute coronary syndrome. Metabolomics, 10, 414–424. doi:10.1007/s11306-013-0595-9.
Li, H.-D., Liang, Y.-Z., Cao, D.-S., & Xu, Q.-S. (2012a). Model-population analysis and its applications in chemical and biological modeling. TrAC Trends in Analytical Chemistry, 38, 154–162. doi:10.1016/j.trac.2011.11.007.
Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648, 77–84. doi:10.1016/j.aca.2009.06.046.
Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2010a). Model population analysis for variable selection. Journal of Chemometr, 24, 418–423. doi:10.1002/cem.1300.
Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2011). Recipe for uncovering predictive genes using support vector machines based on model population analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8, 1633–1641. doi:10.1109/tcbb.2011.36.
Li, H.-D., Xu, Q.-S., & Liang, Y.-Z. (2012b). Random frog: An efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. Analytica Chimica Acta, 740, 20–26. doi:10.1016/j.aca.2012.06.031.
Li, H.-D., Zeng, M.-M., Tan, B.-B., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2010b). Recipe for revealing informative metabolites based on model population analysis. Metabolomics, 6, 353–361. doi:10.1007/s11306-010-0213-z.
Lindgren, F., Hansen, B., Karcher, W., Sjöström, M., & Eriksson, L. (1996). Model validation by permutation tests: Applications to variable selection. Journal of Chemometrics, 10, 521–532. doi:10.1002/(sici)1099-128x(199609)10:5/6<521:aid-cem448>3.0.co;2-j.
Lv, W., & Yang, T. (2012). Identification of possible biomarkers for breast cancer from free fatty acid profiles determined by GC–MS and multivariate statistical analysis. Clinical Biochemistry, 45, 127–133. doi:10.1016/j.clinbiochem.2011.10.011.
Madsen, R., Lundstedt, T., & Trygg, J. (2010). Chemometrics in metabolomics—A review in human disease diagnosis. Analytica Chimica Acta, 659, 23–33. doi:10.1016/j.aca.2009.11.042.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other stochastically larger than the other. The Annals of Mathematical Statistics, 18, 50–60.
Marrocco, C., Duin, R. P. W., & Tortorella, F. (2008). Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition, 41, 1961–1974. doi:10.1016/j.patcog.2007.11.017.
Martens, G. M. (1985). Sensory and chemical quality criteria for white cabbage studied by multivariate data analysis (Vol. 18). Kidlington: ROYAUME-UNI, Elsevier.
Proenza, A. M., Roca, P., Crespí, C., Lladó, I., & Palou, A. (1998). Blood amino acid compartmentation in men and women with different degrees of obesity. The Journal of Nutritional Biochemistry, 9, 697–704. doi:10.1016/S0955-2863(98)00072-2.
Radivojac, P., Obradovic, Z., Dunker, A. K., & Vucetic, S. (2004). Feature selection filters based on the permutation test Machine Learning: ECML (pp. 334–346). Berlin: Springer.
Thiébaut, A. C. M., et al. (2009). Dietary intakes of ω-6 and ω-3 polyunsaturated fatty acids and the risk of breast cancer. International Journal of Cancer, 124, 924–931. doi:10.1002/ijc.23980.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.
Wang, Q., Li, H.-D., Xu, Q.-S., & Liang, Y.-Z. (2011). Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. Analyst, 136, 1456–1463.
Weljie, A. M., Newton, J., Mercier, P., Carlson, E., & Slupsky, C. M. (2006). Targeted profiling: Quantitative analysis of 1H NMR metabolomics data. Analytical Chemistry, 78, 4430–4442.
Westerhuis, J. A., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4, 81–89. doi:10.1007/s11306-007-0099-6.
Wishart, D. S. (2008). Quantitative metabolomics using NMR. TrAC Trends in Analytical Chemistry, 27, 228–237. doi:10.1016/j.trac.2007.12.001.
Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109–130. doi:10.1016/s0169-7439(01)00155-1.
Wold, S., Sjöström, M., & Eriksson, L. (2002). Partial least squares projections to latent structures (PLS) in chemistry encyclopedia of computational chemistry. New York: Wiley.
Wongravee, K., et al. (2009). Monte-Carlo methods for determining optimal number of significant variables. Application to Mouse Urinary Profiles. Metabolomics, 5, 387–406. doi:10.1007/s11306-009-0164-4.
Xiaonan, W., Zhaoyong, H., Junping, H., Jie, D., & William, M. E. (2006). Insulin resistance accelerates muscle protein degradation: Activation of the ubiquitin-proteasome pathway by defects in muscle cell signaling. Endocrinology, 147, 4160–4168. doi:10.1210/en.2006-0251.
Yi, L., et al. (2014). Recent advances in chemometric methods for plant metabolomics: A review. Biotechnology Advances, doi:10.1016/j.biotechadv.2014.11.008
Yi, L., et al. (2013). A metabolic discrimination model for nasopharyngeal carcinoma and its potential role in the therapeutic evaluation of radiotherapy. Metabolomics,. doi:10.1007/s11306-013-0606-x.
Yun, Y.-H., Liang, Y.-Z., Xie, G.-X., Li, H.-D., Cao, D.-S., & Xu, Q.-S. (2013). A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. Analyst, 138, 6412–6421. doi:10.1039/c3an00714f.
Yun, Y.-H., et al. (2014a). A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems, 130, 76–83. doi:10.1016/j.chemolab.2013.09.007.
Yun, Y.-H., et al. (2014b). Using variable combination population analysis for variable selection in multivariate calibration. Analytica Chimica Acta, 862, 14–23. doi:10.1016/j.aca.2014.12.048.
Yun, Y.-H., et al. (2014c). A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Analytica Chimica Acta, 807, 36–43. doi:10.1016/j.aca.2013.11.032.
Zeng, M., et al. (2010). Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis. Journal of Pharmaceutical and Biomedical Analysis, 52, 265–272. doi:10.1016/j.jpba.2010.01.002.
Zhang, H., Wang, H., Dai, Z., Chen, M.-S., & Yuan, Z. (2012). Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics, 13, 1–20. doi:10.1186/1471-2105-13-298.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. doi:10.1111/j.1467-9868.2005.00503.x.
Zweig, M. H., & Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, 561–577.
Acknowledgments
This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 21275164, 21465016, 21175157 and 21375151) and also supported by the Fundamental Research Funds for the Central Universities of Central South University (Grants No. 2014zzts014). The studies meet with the approval of the university’s review board.
Conflict of interest
The authors declare that there are no conflicts of interest.
Compliance with ethical requirements
All clinical experiments of the GC–MS data were approved by Xiangya Institutional Human Subjects Committee. All clinical experiments of the NMR data were approved by the Alberta Cancer Board Research Ethics Board.
Author information
Authors and Affiliations
Corresponding authors
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yun, YH., Liang, F., Deng, BC. et al. Informative metabolites identification by variable importance analysis based on random variable combination. Metabolomics 11, 1539–1551 (2015). https://doi.org/10.1007/s11306-015-0803-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11306-015-0803-x