Abstract
Three methods for variable selection are described, namely the t-statistic, Partial Least Squares Discriminant Analysis (PLS-DA) weights and regression coefficients, with the aim of determining which variables are the most significant markers for discriminating between two groups: a variable’s level of significance is related to its magnitude. Monte-Carlo methods are employed to determine empirical significance of variables, by permuting randomly the class membership 5000 times to obtain null distributions, and comparing the observed statistic for each variable with the null distribution. Seven simulations consisting of 200 samples, divided equally between two classes, and 300 variables, are constructed; in one dataset there are no induced correlations between variables, in two datasets correlations are induced but there is no induced separation between the classes, and in four datasets, separation is induced by selecting 20 of the variables to be discriminators. In addition two metabolomic datasets were analysed consisting of the GCMS of urinary extracts from mice both to determine the effect of stress and to determine the effect of diet on the urinary chemosignal. It is shown that the t-statistic combined with Monte-Carlo permutations provides similar results to PLS weights. PLS regression coefficients find the least number of markers but, for the simulations, the lowest False Positives rates.
Similar content being viewed by others
References
Alsberg, B. K., Kell, D. B., & Goodacre, R. (1998). Variable selection in discriminant partial least-squares analysis. Analytical Chemistry, 70, 4126–4133. doi:10.1021/ac980506o.
Baldovin, A., Wu, W., Centner, V., et al. (1996). Feature selection for the discrimination between pollution types with partial least squares modelling. Analyst (London), 121, 1603–1608. doi:10.1039/an9962101603.
Barker, M., & Rayens, W. (2003). Partial least squares for discrimination. Journal of Chemometrics, 17, 166–173. doi:10.1002/cem.785.
Bijlsma, S., Bobeldijk, I., Verheij, R. E., et al. (2006). Large-scale human metabolomics studies: A strategy for data (pre-) processing and validation. Analytical Chemistry, 78, 567–574. doi:10.1021/ac051495j.
Bogdanov, M., Matson, R. W., Wang, L., et al. (2008). Metabolomic profiling to develop blood biomarkers for Parkinson’s disease. Brain, 131, 389–396. doi:10.1093/brain/awm304.
Brereton, R. G. (2000). Introduction to multivariate calibration in analytical chemistry. Analyst (London), 125, 2125–2154. doi:10.1039/b003805i.
Brereton, R. G. (2003). Chemometrics: Data analysis for the laboratory and chemical plant. Chichester: Wiley.
Brereton, R. G. (2009). Chemometrics for pattern recognition. Chichester: Wiley.
Broadhurst, D., Goodacre, R., Jones, A., et al. (1997). Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. Analytica Chimica Acta, 348, 71–86. doi:10.1016/S0003-2670(97)00065-2.
Davis, R. A., Charlton, A. J., Oehlschlager, S., & Wilson, J. C. (2006). Novel feature selection method for genetic programming using metabolomic 1H NMR data. Chemometrics and Intelligent Laboratory Systems, 81, 50–59. doi:10.1016/j.chemolab.2005.09.006.
Dixon, S. J., Brereton, R. G., Soini, H. A., Novotny, M. V., & Penn, D. J. (2006). An automated method for peak detection and matching in large gas chromatography-mass spectrometry data sets. Journal of Chemometrics, 20, 325–340. doi:10.1002/cem.1005.
Dixon, S. J., Heinrich, N., Holmboe, M. E., Schaefer, M. L., Reed, R. R., Trevejo, J., et al. (2009). Use of cluster separation indices and the influence of outliers: Application of two new separation indices, the modified silhouette index and the overlap coefficient to simulated data and mouse urine metabolomic profiles. Journal of Chemometrics, 23, 19–31. doi:10.1002/cem.1189.
Dixon, S. J., Xu, Y., Brereton, R. G., et al. (2007). Pattern recognition of gas chromatography mass spectrometry of human volatiles in sweat to distinguish the sex of subjects and determine potential discriminatory marker peaks. Chemometrics and Intelligent Laboratory Systems, 87, 161–172. doi:10.1016/j.chemolab.2006.12.004.
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman & Hall.
Egan, J. P. (1975). Signal detection theory and ROC analysis. New York: Academic Press.
Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185, 1–17. doi:10.1016/0003-2670(86)80028-9.
Haaland, D. M., & Easterling, R. G. (1980). Improved sensitivity of infrared spectroscopy by the application of least squares methods. Applied Spectroscopy, 34, 539–548. doi:10.1366/0003702804731258.
Haaland, D. M., & Thomas, E. V. (1988). Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information. Analytical Chemistry, 60, 1193–1202. doi:10.1021/ac00162a020.
Heise, H. M., & Bittner, A. (1997). Calibration method for the infrared-spectrometric trace gas analysis. Fresenius’. Journal of Analytical Chemistry, 359, 93–99. doi:10.1007/s002160050542.
Hope, A. C. A. (1968). A simplified Monte-Carlo significance test procedure. Journal of the Royal Statistical Society Series B, 30, 582–598.
Hoskuldsson, A. (1988). PLS regression methods. Journal of Chemometrics, 2, 211–228. doi:10.1002/cem.1180020306.
Hoskuldsson, A. (2001). Variable and subset selection in PLS regression. Chemometrics and Intelligent Laboratory Systems, 55, 23–58. doi:10.1016/S0169-7439(00)00113-1.
Jackson, J. E. (1991). A user’s guide to principal components. New York: Wiley.
Jarvis, R. M., & Goodacre, R. (2005). Genetic algorithm optimization for pre-processing and variable selection of spectroscopic data. Bioinformatics (Oxford, England), 21, 860–868. doi:10.1093/bioinformatics/bti102.
Jonsson, P., Stenlund, H., Moritz, T., et al. (2006). A strategy for modelling dynamic response in metabolic samples characterized by GC/MS. Metabolomics, 2, 135–143. doi:10.1007/s11306-006-0027-1.
Lima, S. L. T., Mello, C., & Poppi, R. J. (2005). PLS pruning: A new approach to variable selection for multivariate calibration based on hessian matrix of errors. Chemometrics and Intelligent Laboratory Systems, 76, 73–78. doi:10.1016/j.chemolab.2004.09.007.
Lorber, A., Faber, K., & Kowalski, B. R. (1997). Net analyte signal calculation in multivariate calibration. Analytical Chemistry, 69, 1620–1626. doi:10.1021/ac960862b.
Marriott, F. H. C. (1979). Barnard’s Monte-Carlo tests: How many simulations? Applied Statistics, 28, 75–77. doi:10.2307/2346816.
Martens, H., & Naes, T. (1989). Multivariate calibration. Chichester: Wiley.
Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283–298. doi:10.1016/S0001-2998(78)80014-2.
Miller, J. N., & Miller, J. C. (2000). Statistics and chemometrics for analytical chemistry. Essex: Prentice Hall.
Pohjanen, E., Thysell, E., Jonsson, P., et al. (2007). A multivariate screening strategy for investigating metabolic effects of strenuous physical exercise in human serum. Journal of Proteome Research, 6, 2113–2120. doi:10.1021/pr070007g.
Rew, R. K., & Davis, G. P. (1990). NetCDF: An interface for scientific data access. IEEE Computer Graphics and Applications, 10, 76–82. doi:10.1109/38.56302.
Sanchez, E., & Kowalski, B. R. (1988). Tensorial calibration: I. First-order calibration. Journal of Chemometrics, 2, 247–263. doi:10.1002/cem.1180020404.
Schaefer, M. L., Wong, S. T., Wozniak, D. F., et al. (2000). Altered stress-induced anxiety in adenylyl cyclase type VIII-deficient mice. The Journal of Neuroscience, 20, 4809–4820.
Ståhle, L., & Wold, S. (1987). Partial least squares analysis with cross-validation for the two-class problem: A Monte-Carlo study. Journal of Chemometrics, 1, 185–196. doi:10.1002/cem.1180010306.
Wang, J., Hou, T., Chen, L., & Xu, X. (1999). Conformational analysis of peptides using Monte-Carlo simulations combined with the genetic algorithm. Chemometrics and Intelligent Laboratory Systems, 45, 347–351. doi:10.1016/S0169-7439(98)00142-7.
Wehrens, R., Putter, H., & Buydens, L. M. C. (2000). The bootstrap: A tutorial. Chemometrics and Intelligent Laboratory Systems, 54, 35–52. doi:10.1016/S0169-7439(00)00102-7.
Wiklund, S., Nilsson, D., Eriksson, L., et al. (2007). A randomization test for PLS component selection. Journal of Chemometrics, 21, 427–439. doi:10.1002/cem.1086.
Wold, S. (1978). Cross-validatory estimation of number of components in factor and principal components models. Technometrics, 20, 397–405. doi:10.2307/1267639.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2, 37–52. doi:10.1016/0169-7439(87)80084-9.
Xu, Y., & Brereton, R. G. (2005). Diagnostic pattern recognition on gene-expression profile data by using one-class classification. Journal of Chemical Information and Modeling, 45, 1392–1401. doi:10.1021/ci049726v.
Xu, Y., Dixon, S. J., Brereton, R. G., et al. (2007a). Comparison of human axillary odour profiles obtained by gas chromatography/mass spectrometry and skin microbial profiles obtained by denaturing gradient gel electrophoresis using multivariate pattern recognition. Metabolomics, 3, 427–437. doi:10.1007/s11306-007-0054-6.
Xu, Y., Gong, F., Dixon, S. J., et al. (2007b). Application of dissimilarity indices, principal co-ordinates analysis and rank tests to peak tables in metabolomics of the gas chromatography mass spectrometry of human sweat. Analytical Chemistry, 79, 5633–5641. doi:10.1021/ac070134w.
Xu, Q. S., & Liang, Y. Z. (2001). Monte-Carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56, 1–11. doi:10.1016/S0169-7439(00)00122-2.
Acknowledgements
We thank Dr. Sarah Dixon and Dr Yun Xu of the Centre of Chemometrics for developing software used in this project and valuable discussions. This work was sponsored by ARO Contract DAAD19-03-1-0215. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wongravee, K., Lloyd, G.R., Hall, J. et al. Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles. Metabolomics 5, 387–406 (2009). https://doi.org/10.1007/s11306-009-0164-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11306-009-0164-4