Abstract
Correctly measured classification accuracy is an important aspect not only to classify pre-designated classes such as disease versus control properly, but also to ensure that the biological question can be answered competently. We recognised that there has been minimal investigation of pre-treatment methods and its influence on classification accuracy within the metabolomics literature. The standard approach to pre-treatment prior to classification modelling often incorporates the use of methods such as autoscaling, which positions all variables on a comparable scale thus allowing one to achieve separation of two or more groups (target classes). This is often undertaken without any prior investigation into the influence of the pre-treatment method on the data and supervised learning techniques employed. Whilst this is useful for deriving essential information such as predictive ability or visual interpretation in many cases, as shown in this study the standard approach is not always the most suitable option available. Here, a study has been conducted to investigate the influence of six pre-treatment methods—autoscaling, range, level, Pareto and vast scaling, as well as no scaling—on four classification models, including: principal components-discriminant function analysis (PC-DFA), support vector machines (SVM), random forests (RF) and k-nearest neighbours (kNN)—using three publically available metabolomics data sets. We have demonstrated that undertaking different pre-treatment methods can greatly affect the interpretation of the statistical modelling outputs. The results have shown that data pre-treatment is context dependent and that there was no single superior method for all the data sets used. Whilst we did find that vast scaling produced the most robust models in terms of classification rate for PC-DFA of both NMR spectroscopy data sets, in general we conclude that both vast scaling and autoscaling produced similar and superior results in comparison to the other four pre-treatment methods on both NMR and GC–MS data sets. It is therefore our recommendation that vast scaling is the primary pre-treatment method to use as this method appears to be more stable and robust across all the different classifiers that were conducted in this study.
Similar content being viewed by others
References
Allwood, J. W., Cheung, W., Xu, Y., Mumm, R., De Vos, R. C. H., Biais, B., et al. (2014). Metabolomics in melon: a new opportunity for aroma analysis. Phytochemistry, 99, 61–72.
Alsberg, B. K., Goodacre, R., Rowland, J. J., & Kell, D. B. (1997). Classification of pyrolysis mass spectra by fuzzy multivariate rule induction-comparison with regression, K-nearest neighbour, neural and decision-tree methods. Analytica Chimica Acta, 348, 389–407.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Bro, R., & Smilde, A. K. (2003). Centering and scaling in component analysis. Journal of Chemometrics, 17, 16–33.
Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171–196.
Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005). A metabolome pipeline: from concept to data to knowledge. Metabolomics, 1, 39–51.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167.
Craig, A., Cloareo, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78, 2262–2267.
Dunn, W. B., Broadhurst, D. I., Atherton, H. J., Goodacre, R., & Griffin, J. L. (2011). Systems level studies of mammalian metabolomes: the roles of mass spectrometry and nuclear magnetic resonance spectroscopy. Chemical Society Reviews, 40, 387–426.
Efron, B. (1979). 1977 Rietz Lecture. Bootstrap methods: another look at the Jackknife. Annals of Statistics, 7, 1–26.
Efron, B., & Gong, G. (1983). A Leisurely look at the Bootstrap, the Jackknife, and cross-validation. American Statistician, 37, 36–48.
Eriksson, L., Johansson, E., Kettaneh-Wold, N., & Wold, S. (2001). Multi- and Megavariate data analysis: principles and applications. Umeå: Umetrics Academy.
Fiehn, O. (2002). Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology, 48, 155–171.
Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R. N., & Willmitzer, L. (2000). Metabolite profiling for plant functional genomics. Nature Biotechnology, 18, 1157–1161.
Fiehn, O., Robertson, D., Griffin, J., van der Werf, M., Nikolau, B., Morrison, N., et al. (2007). The metabolomics standards initiative (MSI). Metabolomics, 3, 175–178.
Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G., & Kell, D. B. (2004). Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in Biotechnology, 22, 245–252.
Goodacre, R., Broadhurst, D., Smilde, A. K., Kristal, B. S., Baker, J. D., Beger, R., et al. (2007). Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics, 3, 231–241.
Gromski, P. S., Xu, Y., Correa, E., Ellis, D. I., Turner, M. L., & Goodacre, R. (2014a). A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. Analytica Chimica Acta, 829, 1–8.
Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., et al. (2014b). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4, 433–452.
Hardy, N. W., & Taylor, C. F. (2007). A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics, 3, 243–248.
Haug, K., Salek, R. M., Conesa, P., Hastings, J., de Matos, P., Rijnbeek, M., et al. (2014). MetaboLights-an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research, 41, D781–D786.
Hollywood, K., Brison, D. R., & Goodacre, R. (2006). Metabolomics: current technologies and future trends. Proteomics, 6, 4716–4723.
Ismail, A. A., & Gill, G. V. (1999). The epidemiology of Type 2 diabetes and its current measurement. Best Practice & Research. Clinical Endocrinology & Metabolism, 13, 197–220.
Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support Vector Machines in R. Journal of Statistical Software, 15, 1–28.
Kell, D. B., & Goodacre, R. (2014). Metabolomics and systems pharmacology: why and how to model the human metabolic network for drug discovery. Drug Discovery, 19, 171–182.
Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy K-nearest neighbor algorithm. IEEE Transactions on System Man and Cybernetics, 15, 580–585.
Keun, H. C., Ebbels, T. M. D., Antti, H., Bollard, M. E., Beckonert, O., Holmes, E., et al. (2003). Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Analytica Chimica Acta, 490, 265–276.
Kusano, M., Fukushima, A., Arita, M., Jonsson, P., Moritz, T., Kobayashi, M., et al. (2007). Unbiased characterization of genotype-dependent metabolic regulations by metabolomic approach in Arabidopsis thaliana. BMC System Biology, 1, 53.
Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News 2, 18–22.
Mamas, M., Dunn, W. B., Neyses, L., & Goodacre, R. (2011). The role of metabolites and metabolomics in clinically applicable biomarkers of disease. Archives of Toxicology, 85, 5–17.
Manly, B. F. J. (1986). Multivariate Statistical Methods: a primer. New York: Chapman and Hall.
Oliver, S. G., Winson, M. K., Kell, D. B., & Baganz, F. (1998). Systematic functional analysis of the yeast genome. Trends in Biotechnology, 16, 373–378.
Salek, R. M., Maguire, M. L., Bentley, E., Rubtsov, D. V., Hough, T., Cheeseman, M., et al. (2007). A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. Physiological Genomics, 29, 99–108.
Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2, 13.
Sansone, S.-A., Schober, D., Atherton, H. J., Fiehn, O., Jenkins, H., Rocca-Serra, P., et al. (2007). Metabolomics standards initiative: ontology working group work in progress. Metabolomics, 3, 249–256.
Schuhmacher, R., Krska, R., Weckwerth, W., & Goodacre, R. (2013). Metabolomics and metabolite profiling. Analytical and Bioanalytical Chemistry, 405, 5003–5004.
Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3, 211–221.
R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/. Accessed 6 Nov 2012.
Todeschini, R. (1989). k-Nearest neighbour method: the influence of data transformations and metrics. Chemometrics and Intelligent. Laboratory, 6, 213–220.
van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142.
Vapnik, V. N. (1998). Statistical Learning Theory. New York: John Willey & Sons.
Wehrens, R. (2011). Chemometrics with R - multivariate data analysis in the natural sciences and life sciences. Berlin Hiedelberg: Springer-Verlag.
Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2008). Discriminant Q(2) (DQ(2)) for improved discrimination in PLSDA models. Metabolomics, 4, 293–296.
Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2010). Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6, 119–128.
Winder, C. L., Cornmell, R., Schuler, S., Jarvis, R. M., Stephens, G. M., & Goodacre, R. (2011). Metabolic fingerprinting as a tool to monitor whole-cell biotransformations. Analytical and Bioanalytical Chemistry, 399, 387–401.
Xu, Y., Zomer, S., & Brereton, R. G. (2006). Support Vector Machines: a recent method for classification in chemometrics. Critical Reviews in Analytical Chemistry, 36, 177–188.
Zacharias, H. U., Schley, G., Hochrein, J., Klein, M. S., Koeberle, C., Eckardt, K.-U., et al. (2013). Analysis of human urine reveals metabolic changes related to the development of acute kidney injury following cardiac surgery. Metabolomics, 9, 697–707.
Acknowledgments
The authors would like to thank to PhastID (Grant Agreement No: 258238) which is a European project supported within the Seventh Framework Programme for Research and Technological Development for funding and for the studentship for PSG.
Conflict of interest
The authors have no conflicts of interest to declare.
Compliance with ethical requirements
This study analysed previously collected data which involved human participants who had provided informed consent. These ethical issues are described in detail in the two primary research papers published by (Salek et al. 2007; Zacharias et al. 2013).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Gromski, P.S., Xu, Y., Hollywood, K.A. et al. The influence of scaling metabolomics data on model classification accuracy. Metabolomics 11, 684–695 (2015). https://doi.org/10.1007/s11306-014-0738-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11306-014-0738-7