Skip to main content
Log in

Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles

  • Original Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Three methods for variable selection are described, namely the t-statistic, Partial Least Squares Discriminant Analysis (PLS-DA) weights and regression coefficients, with the aim of determining which variables are the most significant markers for discriminating between two groups: a variable’s level of significance is related to its magnitude. Monte-Carlo methods are employed to determine empirical significance of variables, by permuting randomly the class membership 5000 times to obtain null distributions, and comparing the observed statistic for each variable with the null distribution. Seven simulations consisting of 200 samples, divided equally between two classes, and 300 variables, are constructed; in one dataset there are no induced correlations between variables, in two datasets correlations are induced but there is no induced separation between the classes, and in four datasets, separation is induced by selecting 20 of the variables to be discriminators. In addition two metabolomic datasets were analysed consisting of the GCMS of urinary extracts from mice both to determine the effect of stress and to determine the effect of diet on the urinary chemosignal. It is shown that the t-statistic combined with Monte-Carlo permutations provides similar results to PLS weights. PLS regression coefficients find the least number of markers but, for the simulations, the lowest False Positives rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Alsberg, B. K., Kell, D. B., & Goodacre, R. (1998). Variable selection in discriminant partial least-squares analysis. Analytical Chemistry, 70, 4126–4133. doi:10.1021/ac980506o.

    Article  CAS  Google Scholar 

  • Baldovin, A., Wu, W., Centner, V., et al. (1996). Feature selection for the discrimination between pollution types with partial least squares modelling. Analyst (London), 121, 1603–1608. doi:10.1039/an9962101603.

    Article  CAS  Google Scholar 

  • Barker, M., & Rayens, W. (2003). Partial least squares for discrimination. Journal of Chemometrics, 17, 166–173. doi:10.1002/cem.785.

    Article  CAS  Google Scholar 

  • Bijlsma, S., Bobeldijk, I., Verheij, R. E., et al. (2006). Large-scale human metabolomics studies: A strategy for data (pre-) processing and validation. Analytical Chemistry, 78, 567–574. doi:10.1021/ac051495j.

    Article  CAS  PubMed  Google Scholar 

  • Bogdanov, M., Matson, R. W., Wang, L., et al. (2008). Metabolomic profiling to develop blood biomarkers for Parkinson’s disease. Brain, 131, 389–396. doi:10.1093/brain/awm304.

    Article  PubMed  Google Scholar 

  • Brereton, R. G. (2000). Introduction to multivariate calibration in analytical chemistry. Analyst (London), 125, 2125–2154. doi:10.1039/b003805i.

    Article  CAS  Google Scholar 

  • Brereton, R. G. (2003). Chemometrics: Data analysis for the laboratory and chemical plant. Chichester: Wiley.

    Google Scholar 

  • Brereton, R. G. (2009). Chemometrics for pattern recognition. Chichester: Wiley.

    Book  Google Scholar 

  • Broadhurst, D., Goodacre, R., Jones, A., et al. (1997). Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. Analytica Chimica Acta, 348, 71–86. doi:10.1016/S0003-2670(97)00065-2.

    Article  CAS  Google Scholar 

  • Davis, R. A., Charlton, A. J., Oehlschlager, S., & Wilson, J. C. (2006). Novel feature selection method for genetic programming using metabolomic 1H NMR data. Chemometrics and Intelligent Laboratory Systems, 81, 50–59. doi:10.1016/j.chemolab.2005.09.006.

    Article  CAS  Google Scholar 

  • Dixon, S. J., Brereton, R. G., Soini, H. A., Novotny, M. V., & Penn, D. J. (2006). An automated method for peak detection and matching in large gas chromatography-mass spectrometry data sets. Journal of Chemometrics, 20, 325–340. doi:10.1002/cem.1005.

    Article  CAS  Google Scholar 

  • Dixon, S. J., Heinrich, N., Holmboe, M. E., Schaefer, M. L., Reed, R. R., Trevejo, J., et al. (2009). Use of cluster separation indices and the influence of outliers: Application of two new separation indices, the modified silhouette index and the overlap coefficient to simulated data and mouse urine metabolomic profiles. Journal of Chemometrics, 23, 19–31. doi:10.1002/cem.1189.

    Article  CAS  Google Scholar 

  • Dixon, S. J., Xu, Y., Brereton, R. G., et al. (2007). Pattern recognition of gas chromatography mass spectrometry of human volatiles in sweat to distinguish the sex of subjects and determine potential discriminatory marker peaks. Chemometrics and Intelligent Laboratory Systems, 87, 161–172. doi:10.1016/j.chemolab.2006.12.004.

    Article  CAS  Google Scholar 

  • Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman & Hall.

    Google Scholar 

  • Egan, J. P. (1975). Signal detection theory and ROC analysis. New York: Academic Press.

    Google Scholar 

  • Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185, 1–17. doi:10.1016/0003-2670(86)80028-9.

    Article  CAS  Google Scholar 

  • Haaland, D. M., & Easterling, R. G. (1980). Improved sensitivity of infrared spectroscopy by the application of least squares methods. Applied Spectroscopy, 34, 539–548. doi:10.1366/0003702804731258.

    Article  CAS  Google Scholar 

  • Haaland, D. M., & Thomas, E. V. (1988). Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information. Analytical Chemistry, 60, 1193–1202. doi:10.1021/ac00162a020.

    Article  CAS  Google Scholar 

  • Heise, H. M., & Bittner, A. (1997). Calibration method for the infrared-spectrometric trace gas analysis. Fresenius’. Journal of Analytical Chemistry, 359, 93–99. doi:10.1007/s002160050542.

    Article  CAS  Google Scholar 

  • Hope, A. C. A. (1968). A simplified Monte-Carlo significance test procedure. Journal of the Royal Statistical Society Series B, 30, 582–598.

    Google Scholar 

  • Hoskuldsson, A. (1988). PLS regression methods. Journal of Chemometrics, 2, 211–228. doi:10.1002/cem.1180020306.

    Article  Google Scholar 

  • Hoskuldsson, A. (2001). Variable and subset selection in PLS regression. Chemometrics and Intelligent Laboratory Systems, 55, 23–58. doi:10.1016/S0169-7439(00)00113-1.

    Article  CAS  Google Scholar 

  • Jackson, J. E. (1991). A user’s guide to principal components. New York: Wiley.

    Book  Google Scholar 

  • Jarvis, R. M., & Goodacre, R. (2005). Genetic algorithm optimization for pre-processing and variable selection of spectroscopic data. Bioinformatics (Oxford, England), 21, 860–868. doi:10.1093/bioinformatics/bti102.

    Article  CAS  Google Scholar 

  • Jonsson, P., Stenlund, H., Moritz, T., et al. (2006). A strategy for modelling dynamic response in metabolic samples characterized by GC/MS. Metabolomics, 2, 135–143. doi:10.1007/s11306-006-0027-1.

    Article  CAS  Google Scholar 

  • Lima, S. L. T., Mello, C., & Poppi, R. J. (2005). PLS pruning: A new approach to variable selection for multivariate calibration based on hessian matrix of errors. Chemometrics and Intelligent Laboratory Systems, 76, 73–78. doi:10.1016/j.chemolab.2004.09.007.

    Article  CAS  Google Scholar 

  • Lorber, A., Faber, K., & Kowalski, B. R. (1997). Net analyte signal calculation in multivariate calibration. Analytical Chemistry, 69, 1620–1626. doi:10.1021/ac960862b.

    Article  CAS  Google Scholar 

  • Marriott, F. H. C. (1979). Barnard’s Monte-Carlo tests: How many simulations? Applied Statistics, 28, 75–77. doi:10.2307/2346816.

    Article  Google Scholar 

  • Martens, H., & Naes, T. (1989). Multivariate calibration. Chichester: Wiley.

    Google Scholar 

  • Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283–298. doi:10.1016/S0001-2998(78)80014-2.

    Article  CAS  PubMed  Google Scholar 

  • Miller, J. N., & Miller, J. C. (2000). Statistics and chemometrics for analytical chemistry. Essex: Prentice Hall.

    Google Scholar 

  • Pohjanen, E., Thysell, E., Jonsson, P., et al. (2007). A multivariate screening strategy for investigating metabolic effects of strenuous physical exercise in human serum. Journal of Proteome Research, 6, 2113–2120. doi:10.1021/pr070007g.

    Article  CAS  PubMed  Google Scholar 

  • Rew, R. K., & Davis, G. P. (1990). NetCDF: An interface for scientific data access. IEEE Computer Graphics and Applications, 10, 76–82. doi:10.1109/38.56302.

    Article  Google Scholar 

  • Sanchez, E., & Kowalski, B. R. (1988). Tensorial calibration: I. First-order calibration. Journal of Chemometrics, 2, 247–263. doi:10.1002/cem.1180020404.

    Article  CAS  Google Scholar 

  • Schaefer, M. L., Wong, S. T., Wozniak, D. F., et al. (2000). Altered stress-induced anxiety in adenylyl cyclase type VIII-deficient mice. The Journal of Neuroscience, 20, 4809–4820.

    CAS  PubMed  Google Scholar 

  • Ståhle, L., & Wold, S. (1987). Partial least squares analysis with cross-validation for the two-class problem: A Monte-Carlo study. Journal of Chemometrics, 1, 185–196. doi:10.1002/cem.1180010306.

    Article  Google Scholar 

  • Wang, J., Hou, T., Chen, L., & Xu, X. (1999). Conformational analysis of peptides using Monte-Carlo simulations combined with the genetic algorithm. Chemometrics and Intelligent Laboratory Systems, 45, 347–351. doi:10.1016/S0169-7439(98)00142-7.

    Article  CAS  Google Scholar 

  • Wehrens, R., Putter, H., & Buydens, L. M. C. (2000). The bootstrap: A tutorial. Chemometrics and Intelligent Laboratory Systems, 54, 35–52. doi:10.1016/S0169-7439(00)00102-7.

    Article  CAS  Google Scholar 

  • Wiklund, S., Nilsson, D., Eriksson, L., et al. (2007). A randomization test for PLS component selection. Journal of Chemometrics, 21, 427–439. doi:10.1002/cem.1086.

    Article  CAS  Google Scholar 

  • Wold, S. (1978). Cross-validatory estimation of number of components in factor and principal components models. Technometrics, 20, 397–405. doi:10.2307/1267639.

    Article  Google Scholar 

  • Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2, 37–52. doi:10.1016/0169-7439(87)80084-9.

    Article  CAS  Google Scholar 

  • Xu, Y., & Brereton, R. G. (2005). Diagnostic pattern recognition on gene-expression profile data by using one-class classification. Journal of Chemical Information and Modeling, 45, 1392–1401. doi:10.1021/ci049726v.

    Article  CAS  PubMed  Google Scholar 

  • Xu, Y., Dixon, S. J., Brereton, R. G., et al. (2007a). Comparison of human axillary odour profiles obtained by gas chromatography/mass spectrometry and skin microbial profiles obtained by denaturing gradient gel electrophoresis using multivariate pattern recognition. Metabolomics, 3, 427–437. doi:10.1007/s11306-007-0054-6.

    Article  CAS  Google Scholar 

  • Xu, Y., Gong, F., Dixon, S. J., et al. (2007b). Application of dissimilarity indices, principal co-ordinates analysis and rank tests to peak tables in metabolomics of the gas chromatography mass spectrometry of human sweat. Analytical Chemistry, 79, 5633–5641. doi:10.1021/ac070134w.

    Article  CAS  PubMed  Google Scholar 

  • Xu, Q. S., & Liang, Y. Z. (2001). Monte-Carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56, 1–11. doi:10.1016/S0169-7439(00)00122-2.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank Dr. Sarah Dixon and Dr Yun Xu of the Centre of Chemometrics for developing software used in this project and valuable discussions. This work was sponsored by ARO Contract DAAD19-03-1-0215. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard G. Brereton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wongravee, K., Lloyd, G.R., Hall, J. et al. Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles. Metabolomics 5, 387–406 (2009). https://doi.org/10.1007/s11306-009-0164-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-009-0164-4

Keywords

Navigation