Informative metabolites identification by variable importance analysis based on random variable combination

Yun, Yong-Huan; Liang, Fu; Deng, Bai-Chuan; Lai, Guang-Bi; Vicente Gonçalves, Carlos M.; Lu, Hong-Mei; Yan, Jun; Huang, Xin; Yi, Lun-Zhao; Liang, Yi-Zeng

doi:10.1007/s11306-015-0803-x

Informative metabolites identification by variable importance analysis based on random variable combination

Original Article
Published: 17 May 2015

Volume 11, pages 1539–1551, (2015)
Cite this article

Metabolomics Aims and scope Submit manuscript

Yong-Huan Yun¹,
Fu Liang¹,
Bai-Chuan Deng¹,
Guang-Bi Lai²,
Carlos M. Vicente Gonçalves³,
Hong-Mei Lu¹,
Jun Yan¹,
Xin Huang¹,
Lun-Zhao Yi⁴ &
…
Yi-Zeng Liang¹

898 Accesses
37 Citations
Explore all metrics

Abstract

Main target of metabolomics research is to reveal informative metabolites or biomarkers, which can be considered as a process of variable selection. So far, several methods, such as regression coefficient (RC), weights or variable importance in projection (VIP), have been widely used to assess the variable importance when building the partial least squares linear discriminant analysis PLS-LDA classification model. Then a set of metabolites can be selected by fixing a threshold value considering the rank of metabolites. However, they do not take into account the combination effect among a subset of variables, which will lead to bias within the results. In this work, a strategy named as variable importance analysis based on random variable combination (VIAVC), is developed for statistical assessment of variable importance. The framework of VIAVC includes mainly three parts: (1) employ a novel variables sampling method, called binary matrix resampling, which can guarantee that each variable has been selected with the same probability and generate a population of different variable combinations; (2) the importance of each variable is assessed by percent decrease or increase of the area under the receiver operating characteristic curve when the variable is excluded for the modeling by PLS-LDA; (3) iteratively retain and output the rank of the final remaining informative variables. The results of the applications to three metabolic datasets illustrate that VIAVC has better performance compared with other methods including RC, VIP and subwindow permutation analysis. The MATLAB code for implementing VIAVC is available in the supplemental materials.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection using distributions of orthogonal PLS regression vectors in spectral data

Article Open access 22 January 2021

Normalization and integration of large-scale metabolomics data using support vector regression

Article 26 March 2016

Classification of samples from NMR-based metabolomics using principal components analysis and partial least squares with uncertainty estimation

Article 25 July 2018

References

Anastassiou, D. (2007). Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology,. doi:10.1038/msb4100124.
PubMed PubMed Central Google Scholar
Asp, M. L., Tian, M., Wendel, A. A., & Belury, M. A. (2010). Evidence for the contribution of insulin resistance to the development of cachexia in tumor bearing mice. International Journal of Cancer, 126, 756–763.
Article CAS PubMed Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article Google Scholar
Daniel, C. R., et al. (2009). Dietary intake of ω-6 and ω-3 fatty acids and risk of colorectal cancer in a prospective cohort of U.S. men and women. Cancer Epidemiology, Biomarkers and Prevention, 18, 516–525. doi:10.1158/1055-9965.epi-08-0750.
Article CAS PubMed Google Scholar
Deng, B.-C., Yun, Y.-H., Liang, Y.-Z., & Yi, L.-Z. (2014). A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. Analyst, 139, 4836–4845. doi:10.1039/c4an00730a.
Article CAS PubMed Google Scholar
Deng, B.-C., Yun, Y.-H., Ma, P., Lin, C.-C., Ren, D.-B., & Liang, Y.-Z. (2015). A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. Analyst, 140, 1876–1885. doi:10.1039/C4AN02123A.
Article CAS PubMed Google Scholar
Duarte, I. F., Diaz, S. O., & Gil, A. M. (2014). NMR metabolomics of human blood and urine in disease research. Journal of Pharmaceutical and Biomedical Analysis, 93, 17–26. doi:10.1016/j.jpba.2013.09.025.
Article CAS PubMed Google Scholar
Dupertuis, Y. M., Meguid, M. M., & Pichard, C. (2007). Colon cancer therapy: New perspectives of nutritional manipulations using polyunsaturated fatty acids. Current Opinion in Clinical Nutrition & Metabolic Care, 10, 427–432. doi:10.1097/MCO.0b013e3281e2c9d4.
Article CAS Google Scholar
Eisner, R., et al. (2010). Learning to predict cancer-associated skeletal muscle wasting from 1H-NMR profiles of urinary metabolites. Metabolomics, 7, 25–34. doi:10.1007/s11306-010-0232-9.
Article Google Scholar
Favilla, S., Durante, C., Vigni, M. L., & Cocchi, M. (2013). Assessing feature relevance in NPLS models by VIP. Chemometrics and Intelligent Laboratory Systems, 129, 76–86. doi:10.1016/j.chemolab.2013.05.013.
Article CAS Google Scholar
Fearn, T. (2010). Double cross-validation. In: News 3 Interview: Katherine Bakeev 4 Meetings: NIR on the Go 6 Quasi-Imaging Spectrometer with Programmable Field of View 8 Laboratory Profile: Regional Breeders Association of Lombardy 11, 2010, Vol. 17, p. 201014
Fu, G.-H., Zhang, W.-M., Dai, L., & Fu, Y.-Z. (2013). Group variable selection with oracle property by weight-fused adaptive elastic net model for strongly correlated data. Communications in Statistics: Simulation and Computation, 43, 2468–2481. doi:10.1080/03610918.2012.752841.
Article Google Scholar
Golub, T. R., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. doi:10.1126/science.286.5439.531.
Article CAS PubMed Google Scholar
Grömping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63, 308–319. doi:10.1198/tast.2009.08199.
Article Google Scholar
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Google Scholar
Hsing, T., Attoor, S., & Dougherty, E. (2003). Relation between permutation-test P values and classifier error estimates. Machine Learning, 52, 11–30. doi:10.1023/a:1023985022691.
Article Google Scholar
Hulver, M. W., et al. (2003). Skeletal muscle lipid metabolism with obesity. American Journal of Physiology-Endocrinology and Metabolism, 284, 741–747. doi:10.1152/ajpendo.00514.2002.
Article Google Scholar
Icard, P., & Lincet, H. (2013). The cancer tumor: A metabolic parasite? Bulletin du Cancer, 100, 427–433.
CAS PubMed Google Scholar
Kien, C. L., Bunn, J. Y., & Ugrasbul, F. (2005). Increasing dietary palmitic acid decreases fat oxidation and daily energy expenditure. The American Journal of Clinical Nutrition, 82, 320–326.
CAS PubMed PubMed Central Google Scholar
Kvalheim, O. M. (2010). Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots. Journal of Chemometrics, 24, 496–504. doi:10.1002/cem.1289.
Article CAS Google Scholar
Kvalheim, O. M., Arneberg, R., Bleie, O., Rajalahti, T., Smilde, A. K., & Westerhuis, J. A. (2014). Variable importance in latent variable regression models. Journal of Chemometrics,. doi:10.1002/cem.2626.
Google Scholar
Laborde, C. M., et al. (2013). Plasma metabolomics reveals a potential panel of biomarkers for early diagnosis in acute coronary syndrome. Metabolomics, 10, 414–424. doi:10.1007/s11306-013-0595-9.
Article PubMed PubMed Central Google Scholar
Li, H.-D., Liang, Y.-Z., Cao, D.-S., & Xu, Q.-S. (2012a). Model-population analysis and its applications in chemical and biological modeling. TrAC Trends in Analytical Chemistry, 38, 154–162. doi:10.1016/j.trac.2011.11.007.
Article Google Scholar
Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648, 77–84. doi:10.1016/j.aca.2009.06.046.
Article CAS PubMed Google Scholar
Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2010a). Model population analysis for variable selection. Journal of Chemometr, 24, 418–423. doi:10.1002/cem.1300.
Article Google Scholar
Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2011). Recipe for uncovering predictive genes using support vector machines based on model population analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8, 1633–1641. doi:10.1109/tcbb.2011.36.
Article PubMed Google Scholar
Li, H.-D., Xu, Q.-S., & Liang, Y.-Z. (2012b). Random frog: An efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. Analytica Chimica Acta, 740, 20–26. doi:10.1016/j.aca.2012.06.031.
Article CAS PubMed Google Scholar
Li, H.-D., Zeng, M.-M., Tan, B.-B., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2010b). Recipe for revealing informative metabolites based on model population analysis. Metabolomics, 6, 353–361. doi:10.1007/s11306-010-0213-z.
Article CAS Google Scholar
Lindgren, F., Hansen, B., Karcher, W., Sjöström, M., & Eriksson, L. (1996). Model validation by permutation tests: Applications to variable selection. Journal of Chemometrics, 10, 521–532. doi:10.1002/(sici)1099-128x(199609)10:5/6<521:aid-cem448>3.0.co;2-j.
Article CAS Google Scholar
Lv, W., & Yang, T. (2012). Identification of possible biomarkers for breast cancer from free fatty acid profiles determined by GC–MS and multivariate statistical analysis. Clinical Biochemistry, 45, 127–133. doi:10.1016/j.clinbiochem.2011.10.011.
Article CAS PubMed Google Scholar
Madsen, R., Lundstedt, T., & Trygg, J. (2010). Chemometrics in metabolomics—A review in human disease diagnosis. Analytica Chimica Acta, 659, 23–33. doi:10.1016/j.aca.2009.11.042.
Article CAS PubMed Google Scholar
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other stochastically larger than the other. The Annals of Mathematical Statistics, 18, 50–60.
Article Google Scholar
Marrocco, C., Duin, R. P. W., & Tortorella, F. (2008). Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition, 41, 1961–1974. doi:10.1016/j.patcog.2007.11.017.
Article Google Scholar
Martens, G. M. (1985). Sensory and chemical quality criteria for white cabbage studied by multivariate data analysis (Vol. 18). Kidlington: ROYAUME-UNI, Elsevier.
Google Scholar
Proenza, A. M., Roca, P., Crespí, C., Lladó, I., & Palou, A. (1998). Blood amino acid compartmentation in men and women with different degrees of obesity. The Journal of Nutritional Biochemistry, 9, 697–704. doi:10.1016/S0955-2863(98)00072-2.
Article CAS Google Scholar
Radivojac, P., Obradovic, Z., Dunker, A. K., & Vucetic, S. (2004). Feature selection filters based on the permutation test Machine Learning: ECML (pp. 334–346). Berlin: Springer.
Google Scholar
Thiébaut, A. C. M., et al. (2009). Dietary intakes of ω-6 and ω-3 polyunsaturated fatty acids and the risk of breast cancer. International Journal of Cancer, 124, 924–931. doi:10.1002/ijc.23980.
Article PubMed Google Scholar
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.
Google Scholar
Wang, Q., Li, H.-D., Xu, Q.-S., & Liang, Y.-Z. (2011). Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. Analyst, 136, 1456–1463.
Article CAS PubMed Google Scholar
Weljie, A. M., Newton, J., Mercier, P., Carlson, E., & Slupsky, C. M. (2006). Targeted profiling: Quantitative analysis of 1H NMR metabolomics data. Analytical Chemistry, 78, 4430–4442.
Article CAS PubMed Google Scholar
Westerhuis, J. A., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4, 81–89. doi:10.1007/s11306-007-0099-6.
Article CAS Google Scholar
Wishart, D. S. (2008). Quantitative metabolomics using NMR. TrAC Trends in Analytical Chemistry, 27, 228–237. doi:10.1016/j.trac.2007.12.001.
Article CAS Google Scholar
Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109–130. doi:10.1016/s0169-7439(01)00155-1.
Article CAS Google Scholar
Wold, S., Sjöström, M., & Eriksson, L. (2002). Partial least squares projections to latent structures (PLS) in chemistry encyclopedia of computational chemistry. New York: Wiley.
Google Scholar
Wongravee, K., et al. (2009). Monte-Carlo methods for determining optimal number of significant variables. Application to Mouse Urinary Profiles. Metabolomics, 5, 387–406. doi:10.1007/s11306-009-0164-4.
CAS Google Scholar
Xiaonan, W., Zhaoyong, H., Junping, H., Jie, D., & William, M. E. (2006). Insulin resistance accelerates muscle protein degradation: Activation of the ubiquitin-proteasome pathway by defects in muscle cell signaling. Endocrinology, 147, 4160–4168. doi:10.1210/en.2006-0251.
Article Google Scholar
Yi, L., et al. (2014). Recent advances in chemometric methods for plant metabolomics: A review. Biotechnology Advances, doi:10.1016/j.biotechadv.2014.11.008
Yi, L., et al. (2013). A metabolic discrimination model for nasopharyngeal carcinoma and its potential role in the therapeutic evaluation of radiotherapy. Metabolomics,. doi:10.1007/s11306-013-0606-x.
Google Scholar
Yun, Y.-H., Liang, Y.-Z., Xie, G.-X., Li, H.-D., Cao, D.-S., & Xu, Q.-S. (2013). A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. Analyst, 138, 6412–6421. doi:10.1039/c3an00714f.
Article CAS PubMed Google Scholar
Yun, Y.-H., et al. (2014a). A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems, 130, 76–83. doi:10.1016/j.chemolab.2013.09.007.
Article CAS Google Scholar
Yun, Y.-H., et al. (2014b). Using variable combination population analysis for variable selection in multivariate calibration. Analytica Chimica Acta, 862, 14–23. doi:10.1016/j.aca.2014.12.048.
Article PubMed Google Scholar
Yun, Y.-H., et al. (2014c). A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Analytica Chimica Acta, 807, 36–43. doi:10.1016/j.aca.2013.11.032.
Article CAS PubMed Google Scholar
Zeng, M., et al. (2010). Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis. Journal of Pharmaceutical and Biomedical Analysis, 52, 265–272. doi:10.1016/j.jpba.2010.01.002.
Article CAS PubMed Google Scholar
Zhang, H., Wang, H., Dai, Z., Chen, M.-S., & Yuan, Z. (2012). Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics, 13, 1–20. doi:10.1186/1471-2105-13-298.
Article Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. doi:10.1111/j.1467-9868.2005.00503.x.
Article Google Scholar
Zweig, M. H., & Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, 561–577.
CAS PubMed Google Scholar

Download references

Acknowledgments

This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 21275164, 21465016, 21175157 and 21375151) and also supported by the Fundamental Research Funds for the Central Universities of Central South University (Grants No. 2014zzts014). The studies meet with the approval of the university’s review board.

Conflict of interest

The authors declare that there are no conflicts of interest.

Compliance with ethical requirements

All clinical experiments of the GC–MS data were approved by Xiangya Institutional Human Subjects Committee. All clinical experiments of the NMR data were approved by the Alberta Cancer Board Research Ethics Board.

Author information

Authors and Affiliations

College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, China
Yong-Huan Yun, Fu Liang, Bai-Chuan Deng, Hong-Mei Lu, Jun Yan, Xin Huang & Yi-Zeng Liang
Heilongjiang University of Chinese Medicine, Harbin, 150040, Heilongjiang, China
Guang-Bi Lai
Department of Chemistry, Faculty of Mathematics and Natural Sciences, University of Bergen, Bergen, 5020, Norway
Carlos M. Vicente Gonçalves
Yunnan Food Safety Research Institute, Kunming University of Science and Technology, Kunming, 650500, China
Lun-Zhao Yi

Authors

Yong-Huan Yun
View author publications
You can also search for this author in PubMed Google Scholar
Fu Liang
View author publications
You can also search for this author in PubMed Google Scholar
Bai-Chuan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Guang-Bi Lai
View author publications
You can also search for this author in PubMed Google Scholar
Carlos M. Vicente Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Mei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lun-Zhao Yi
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Zeng Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lun-Zhao Yi or Yi-Zeng Liang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 52 kb)

Supplementary material 2 (ZIP 25 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yun, YH., Liang, F., Deng, BC. et al. Informative metabolites identification by variable importance analysis based on random variable combination. Metabolomics 11, 1539–1551 (2015). https://doi.org/10.1007/s11306-015-0803-x

Download citation

Received: 14 September 2014
Accepted: 07 May 2015
Published: 17 May 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11306-015-0803-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Informative metabolites identification by variable importance analysis based on random variable combination

Abstract

Access this article

Similar content being viewed by others

Feature selection using distributions of orthogonal PLS regression vectors in spectral data

Normalization and integration of large-scale metabolomics data using support vector regression

Classification of samples from NMR-based metabolomics using principal components analysis and partial least squares with uncertainty estimation

References

Acknowledgments

Conflict of interest

Compliance with ethical requirements

Author information

Authors and Affiliations

Corresponding authors

Electronic supplementary material

Supplementary material 1 (DOCX 52 kb)

Supplementary material 2 (ZIP 25 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Informative metabolites identification by variable importance analysis based on random variable combination

Abstract

Access this article

Similar content being viewed by others

Feature selection using distributions of orthogonal PLS regression vectors in spectral data

Normalization and integration of large-scale metabolomics data using support vector regression

Classification of samples from NMR-based metabolomics using principal components analysis and partial least squares with uncertainty estimation

References

Acknowledgments

Conflict of interest

Compliance with ethical requirements

Author information

Authors and Affiliations

Corresponding authors

Electronic supplementary material

Supplementary material 1 (DOCX 52 kb)

Supplementary material 2 (ZIP 25 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation