Skip to main content

Advertisement

Log in

Informative metabolites identification by variable importance analysis based on random variable combination

  • Original Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Main target of metabolomics research is to reveal informative metabolites or biomarkers, which can be considered as a process of variable selection. So far, several methods, such as regression coefficient (RC), weights or variable importance in projection (VIP), have been widely used to assess the variable importance when building the partial least squares linear discriminant analysis PLS-LDA classification model. Then a set of metabolites can be selected by fixing a threshold value considering the rank of metabolites. However, they do not take into account the combination effect among a subset of variables, which will lead to bias within the results. In this work, a strategy named as variable importance analysis based on random variable combination (VIAVC), is developed for statistical assessment of variable importance. The framework of VIAVC includes mainly three parts: (1) employ a novel variables sampling method, called binary matrix resampling, which can guarantee that each variable has been selected with the same probability and generate a population of different variable combinations; (2) the importance of each variable is assessed by percent decrease or increase of the area under the receiver operating characteristic curve when the variable is excluded for the modeling by PLS-LDA; (3) iteratively retain and output the rank of the final remaining informative variables. The results of the applications to three metabolic datasets illustrate that VIAVC has better performance compared with other methods including RC, VIP and subwindow permutation analysis. The MATLAB code for implementing VIAVC is available in the supplemental materials.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Anastassiou, D. (2007). Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology,. doi:10.1038/msb4100124.

    PubMed  PubMed Central  Google Scholar 

  • Asp, M. L., Tian, M., Wendel, A. A., & Belury, M. A. (2010). Evidence for the contribution of insulin resistance to the development of cachexia in tumor bearing mice. International Journal of Cancer, 126, 756–763.

    Article  CAS  PubMed  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  Google Scholar 

  • Daniel, C. R., et al. (2009). Dietary intake of ω-6 and ω-3 fatty acids and risk of colorectal cancer in a prospective cohort of U.S. men and women. Cancer Epidemiology, Biomarkers and Prevention, 18, 516–525. doi:10.1158/1055-9965.epi-08-0750.

    Article  CAS  PubMed  Google Scholar 

  • Deng, B.-C., Yun, Y.-H., Liang, Y.-Z., & Yi, L.-Z. (2014). A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. Analyst, 139, 4836–4845. doi:10.1039/c4an00730a.

    Article  CAS  PubMed  Google Scholar 

  • Deng, B.-C., Yun, Y.-H., Ma, P., Lin, C.-C., Ren, D.-B., & Liang, Y.-Z. (2015). A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. Analyst, 140, 1876–1885. doi:10.1039/C4AN02123A.

    Article  CAS  PubMed  Google Scholar 

  • Duarte, I. F., Diaz, S. O., & Gil, A. M. (2014). NMR metabolomics of human blood and urine in disease research. Journal of Pharmaceutical and Biomedical Analysis, 93, 17–26. doi:10.1016/j.jpba.2013.09.025.

    Article  CAS  PubMed  Google Scholar 

  • Dupertuis, Y. M., Meguid, M. M., & Pichard, C. (2007). Colon cancer therapy: New perspectives of nutritional manipulations using polyunsaturated fatty acids. Current Opinion in Clinical Nutrition & Metabolic Care, 10, 427–432. doi:10.1097/MCO.0b013e3281e2c9d4.

    Article  CAS  Google Scholar 

  • Eisner, R., et al. (2010). Learning to predict cancer-associated skeletal muscle wasting from 1H-NMR profiles of urinary metabolites. Metabolomics, 7, 25–34. doi:10.1007/s11306-010-0232-9.

    Article  Google Scholar 

  • Favilla, S., Durante, C., Vigni, M. L., & Cocchi, M. (2013). Assessing feature relevance in NPLS models by VIP. Chemometrics and Intelligent Laboratory Systems, 129, 76–86. doi:10.1016/j.chemolab.2013.05.013.

    Article  CAS  Google Scholar 

  • Fearn, T. (2010). Double cross-validation. In: News 3 Interview: Katherine Bakeev 4 Meetings: NIR on the Go 6 Quasi-Imaging Spectrometer with Programmable Field of View 8 Laboratory Profile: Regional Breeders Association of Lombardy 11, 2010, Vol. 17, p. 201014

  • Fu, G.-H., Zhang, W.-M., Dai, L., & Fu, Y.-Z. (2013). Group variable selection with oracle property by weight-fused adaptive elastic net model for strongly correlated data. Communications in Statistics: Simulation and Computation, 43, 2468–2481. doi:10.1080/03610918.2012.752841.

    Article  Google Scholar 

  • Golub, T. R., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. doi:10.1126/science.286.5439.531.

    Article  CAS  PubMed  Google Scholar 

  • Grömping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63, 308–319. doi:10.1198/tast.2009.08199.

    Article  Google Scholar 

  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.

    Google Scholar 

  • Hsing, T., Attoor, S., & Dougherty, E. (2003). Relation between permutation-test P values and classifier error estimates. Machine Learning, 52, 11–30. doi:10.1023/a:1023985022691.

    Article  Google Scholar 

  • Hulver, M. W., et al. (2003). Skeletal muscle lipid metabolism with obesity. American Journal of Physiology-Endocrinology and Metabolism, 284, 741–747. doi:10.1152/ajpendo.00514.2002.

    Article  Google Scholar 

  • Icard, P., & Lincet, H. (2013). The cancer tumor: A metabolic parasite? Bulletin du Cancer, 100, 427–433.

    CAS  PubMed  Google Scholar 

  • Kien, C. L., Bunn, J. Y., & Ugrasbul, F. (2005). Increasing dietary palmitic acid decreases fat oxidation and daily energy expenditure. The American Journal of Clinical Nutrition, 82, 320–326.

    CAS  PubMed  PubMed Central  Google Scholar 

  • Kvalheim, O. M. (2010). Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots. Journal of Chemometrics, 24, 496–504. doi:10.1002/cem.1289.

    Article  CAS  Google Scholar 

  • Kvalheim, O. M., Arneberg, R., Bleie, O., Rajalahti, T., Smilde, A. K., & Westerhuis, J. A. (2014). Variable importance in latent variable regression models. Journal of Chemometrics,. doi:10.1002/cem.2626.

    Google Scholar 

  • Laborde, C. M., et al. (2013). Plasma metabolomics reveals a potential panel of biomarkers for early diagnosis in acute coronary syndrome. Metabolomics, 10, 414–424. doi:10.1007/s11306-013-0595-9.

    Article  PubMed  PubMed Central  Google Scholar 

  • Li, H.-D., Liang, Y.-Z., Cao, D.-S., & Xu, Q.-S. (2012a). Model-population analysis and its applications in chemical and biological modeling. TrAC Trends in Analytical Chemistry, 38, 154–162. doi:10.1016/j.trac.2011.11.007.

    Article  Google Scholar 

  • Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648, 77–84. doi:10.1016/j.aca.2009.06.046.

    Article  CAS  PubMed  Google Scholar 

  • Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2010a). Model population analysis for variable selection. Journal of Chemometr, 24, 418–423. doi:10.1002/cem.1300.

    Article  Google Scholar 

  • Li, H.-D., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2011). Recipe for uncovering predictive genes using support vector machines based on model population analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8, 1633–1641. doi:10.1109/tcbb.2011.36.

    Article  PubMed  Google Scholar 

  • Li, H.-D., Xu, Q.-S., & Liang, Y.-Z. (2012b). Random frog: An efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. Analytica Chimica Acta, 740, 20–26. doi:10.1016/j.aca.2012.06.031.

    Article  CAS  PubMed  Google Scholar 

  • Li, H.-D., Zeng, M.-M., Tan, B.-B., Liang, Y.-Z., Xu, Q.-S., & Cao, D.-S. (2010b). Recipe for revealing informative metabolites based on model population analysis. Metabolomics, 6, 353–361. doi:10.1007/s11306-010-0213-z.

    Article  CAS  Google Scholar 

  • Lindgren, F., Hansen, B., Karcher, W., Sjöström, M., & Eriksson, L. (1996). Model validation by permutation tests: Applications to variable selection. Journal of Chemometrics, 10, 521–532. doi:10.1002/(sici)1099-128x(199609)10:5/6<521:aid-cem448>3.0.co;2-j.

    Article  CAS  Google Scholar 

  • Lv, W., & Yang, T. (2012). Identification of possible biomarkers for breast cancer from free fatty acid profiles determined by GC–MS and multivariate statistical analysis. Clinical Biochemistry, 45, 127–133. doi:10.1016/j.clinbiochem.2011.10.011.

    Article  CAS  PubMed  Google Scholar 

  • Madsen, R., Lundstedt, T., & Trygg, J. (2010). Chemometrics in metabolomics—A review in human disease diagnosis. Analytica Chimica Acta, 659, 23–33. doi:10.1016/j.aca.2009.11.042.

    Article  CAS  PubMed  Google Scholar 

  • Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other stochastically larger than the other. The Annals of Mathematical Statistics, 18, 50–60.

    Article  Google Scholar 

  • Marrocco, C., Duin, R. P. W., & Tortorella, F. (2008). Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition, 41, 1961–1974. doi:10.1016/j.patcog.2007.11.017.

    Article  Google Scholar 

  • Martens, G. M. (1985). Sensory and chemical quality criteria for white cabbage studied by multivariate data analysis (Vol. 18). Kidlington: ROYAUME-UNI, Elsevier.

    Google Scholar 

  • Proenza, A. M., Roca, P., Crespí, C., Lladó, I., & Palou, A. (1998). Blood amino acid compartmentation in men and women with different degrees of obesity. The Journal of Nutritional Biochemistry, 9, 697–704. doi:10.1016/S0955-2863(98)00072-2.

    Article  CAS  Google Scholar 

  • Radivojac, P., Obradovic, Z., Dunker, A. K., & Vucetic, S. (2004). Feature selection filters based on the permutation test Machine Learning: ECML (pp. 334–346). Berlin: Springer.

    Google Scholar 

  • Thiébaut, A. C. M., et al. (2009). Dietary intakes of ω-6 and ω-3 polyunsaturated fatty acids and the risk of breast cancer. International Journal of Cancer, 124, 924–931. doi:10.1002/ijc.23980.

    Article  PubMed  Google Scholar 

  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.

    Google Scholar 

  • Wang, Q., Li, H.-D., Xu, Q.-S., & Liang, Y.-Z. (2011). Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. Analyst, 136, 1456–1463.

    Article  CAS  PubMed  Google Scholar 

  • Weljie, A. M., Newton, J., Mercier, P., Carlson, E., & Slupsky, C. M. (2006). Targeted profiling: Quantitative analysis of 1H NMR metabolomics data. Analytical Chemistry, 78, 4430–4442.

    Article  CAS  PubMed  Google Scholar 

  • Westerhuis, J. A., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4, 81–89. doi:10.1007/s11306-007-0099-6.

    Article  CAS  Google Scholar 

  • Wishart, D. S. (2008). Quantitative metabolomics using NMR. TrAC Trends in Analytical Chemistry, 27, 228–237. doi:10.1016/j.trac.2007.12.001.

    Article  CAS  Google Scholar 

  • Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109–130. doi:10.1016/s0169-7439(01)00155-1.

    Article  CAS  Google Scholar 

  • Wold, S., Sjöström, M., & Eriksson, L. (2002). Partial least squares projections to latent structures (PLS) in chemistry encyclopedia of computational chemistry. New York: Wiley.

    Google Scholar 

  • Wongravee, K., et al. (2009). Monte-Carlo methods for determining optimal number of significant variables. Application to Mouse Urinary Profiles. Metabolomics, 5, 387–406. doi:10.1007/s11306-009-0164-4.

    CAS  Google Scholar 

  • Xiaonan, W., Zhaoyong, H., Junping, H., Jie, D., & William, M. E. (2006). Insulin resistance accelerates muscle protein degradation: Activation of the ubiquitin-proteasome pathway by defects in muscle cell signaling. Endocrinology, 147, 4160–4168. doi:10.1210/en.2006-0251.

    Article  Google Scholar 

  • Yi, L., et al. (2014). Recent advances in chemometric methods for plant metabolomics: A review. Biotechnology Advances, doi:10.1016/j.biotechadv.2014.11.008

  • Yi, L., et al. (2013). A metabolic discrimination model for nasopharyngeal carcinoma and its potential role in the therapeutic evaluation of radiotherapy. Metabolomics,. doi:10.1007/s11306-013-0606-x.

    Google Scholar 

  • Yun, Y.-H., Liang, Y.-Z., Xie, G.-X., Li, H.-D., Cao, D.-S., & Xu, Q.-S. (2013). A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. Analyst, 138, 6412–6421. doi:10.1039/c3an00714f.

    Article  CAS  PubMed  Google Scholar 

  • Yun, Y.-H., et al. (2014a). A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems, 130, 76–83. doi:10.1016/j.chemolab.2013.09.007.

    Article  CAS  Google Scholar 

  • Yun, Y.-H., et al. (2014b). Using variable combination population analysis for variable selection in multivariate calibration. Analytica Chimica Acta, 862, 14–23. doi:10.1016/j.aca.2014.12.048.

    Article  PubMed  Google Scholar 

  • Yun, Y.-H., et al. (2014c). A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Analytica Chimica Acta, 807, 36–43. doi:10.1016/j.aca.2013.11.032.

    Article  CAS  PubMed  Google Scholar 

  • Zeng, M., et al. (2010). Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis. Journal of Pharmaceutical and Biomedical Analysis, 52, 265–272. doi:10.1016/j.jpba.2010.01.002.

    Article  CAS  PubMed  Google Scholar 

  • Zhang, H., Wang, H., Dai, Z., Chen, M.-S., & Yuan, Z. (2012). Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics, 13, 1–20. doi:10.1186/1471-2105-13-298.

    Article  Google Scholar 

  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. doi:10.1111/j.1467-9868.2005.00503.x.

    Article  Google Scholar 

  • Zweig, M. H., & Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, 561–577.

    CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 21275164, 21465016, 21175157 and 21375151) and also supported by the Fundamental Research Funds for the Central Universities of Central South University (Grants No. 2014zzts014). The studies meet with the approval of the university’s review board.

Conflict of interest

The authors declare that there are no conflicts of interest.

Compliance with ethical requirements

All clinical experiments of the GC–MS data were approved by Xiangya Institutional Human Subjects Committee. All clinical experiments of the NMR data were approved by the Alberta Cancer Board Research Ethics Board.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Lun-Zhao Yi or Yi-Zeng Liang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 52 kb)

Supplementary material 2 (ZIP 25 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yun, YH., Liang, F., Deng, BC. et al. Informative metabolites identification by variable importance analysis based on random variable combination. Metabolomics 11, 1539–1551 (2015). https://doi.org/10.1007/s11306-015-0803-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-015-0803-x

Keywords

Navigation