The influence of scaling metabolomics data on model classification accuracy

Gromski, Piotr S.; Xu, Yun; Hollywood, Katherine A.; Turner, Michael L.; Goodacre, Royston

doi:10.1007/s11306-014-0738-7

The influence of scaling metabolomics data on model classification accuracy

Original Article
Published: 08 October 2014

Volume 11, pages 684–695, (2015)
Cite this article

Metabolomics Aims and scope Submit manuscript

Piotr S. Gromski¹,
Yun Xu¹,
Katherine A. Hollywood³,
Michael L. Turner² &
…
Royston Goodacre¹

2400 Accesses
55 Citations
5 Altmetric
Explore all metrics

Abstract

Correctly measured classification accuracy is an important aspect not only to classify pre-designated classes such as disease versus control properly, but also to ensure that the biological question can be answered competently. We recognised that there has been minimal investigation of pre-treatment methods and its influence on classification accuracy within the metabolomics literature. The standard approach to pre-treatment prior to classification modelling often incorporates the use of methods such as autoscaling, which positions all variables on a comparable scale thus allowing one to achieve separation of two or more groups (target classes). This is often undertaken without any prior investigation into the influence of the pre-treatment method on the data and supervised learning techniques employed. Whilst this is useful for deriving essential information such as predictive ability or visual interpretation in many cases, as shown in this study the standard approach is not always the most suitable option available. Here, a study has been conducted to investigate the influence of six pre-treatment methods—autoscaling, range, level, Pareto and vast scaling, as well as no scaling—on four classification models, including: principal components-discriminant function analysis (PC-DFA), support vector machines (SVM), random forests (RF) and k-nearest neighbours (kNN)—using three publically available metabolomics data sets. We have demonstrated that undertaking different pre-treatment methods can greatly affect the interpretation of the statistical modelling outputs. The results have shown that data pre-treatment is context dependent and that there was no single superior method for all the data sets used. Whilst we did find that vast scaling produced the most robust models in terms of classification rate for PC-DFA of both NMR spectroscopy data sets, in general we conclude that both vast scaling and autoscaling produced similar and superior results in comparison to the other four pre-treatment methods on both NMR and GC–MS data sets. It is therefore our recommendation that vast scaling is the primary pre-treatment method to use as this method appears to be more stable and robust across all the different classifiers that were conducted in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification

Article Open access 15 November 2019

Predictive Modeling for Metabolomics Data

Statistical Analysis and Modeling of Mass Spectrometry-Based Metabolomics Data

References

Allwood, J. W., Cheung, W., Xu, Y., Mumm, R., De Vos, R. C. H., Biais, B., et al. (2014). Metabolomics in melon: a new opportunity for aroma analysis. Phytochemistry, 99, 61–72.
Article PubMed Central CAS PubMed Google Scholar
Alsberg, B. K., Goodacre, R., Rowland, J. J., & Kell, D. B. (1997). Classification of pyrolysis mass spectra by fuzzy multivariate rule induction-comparison with regression, K-nearest neighbour, neural and decision-tree methods. Analytica Chimica Acta, 348, 389–407.
Article CAS Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article Google Scholar
Bro, R., & Smilde, A. K. (2003). Centering and scaling in component analysis. Journal of Chemometrics, 17, 16–33.
Article CAS Google Scholar
Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171–196.
Article CAS Google Scholar
Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005). A metabolome pipeline: from concept to data to knowledge. Metabolomics, 1, 39–51.
Article CAS Google Scholar
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167.
Article Google Scholar
Craig, A., Cloareo, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78, 2262–2267.
Article CAS PubMed Google Scholar
Dunn, W. B., Broadhurst, D. I., Atherton, H. J., Goodacre, R., & Griffin, J. L. (2011). Systems level studies of mammalian metabolomes: the roles of mass spectrometry and nuclear magnetic resonance spectroscopy. Chemical Society Reviews, 40, 387–426.
Article CAS PubMed Google Scholar
Efron, B. (1979). 1977 Rietz Lecture. Bootstrap methods: another look at the Jackknife. Annals of Statistics, 7, 1–26.
Article Google Scholar
Efron, B., & Gong, G. (1983). A Leisurely look at the Bootstrap, the Jackknife, and cross-validation. American Statistician, 37, 36–48.
Google Scholar
Eriksson, L., Johansson, E., Kettaneh-Wold, N., & Wold, S. (2001). Multi- and Megavariate data analysis: principles and applications. Umeå: Umetrics Academy.
Google Scholar
Fiehn, O. (2002). Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology, 48, 155–171.
Article CAS PubMed Google Scholar
Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R. N., & Willmitzer, L. (2000). Metabolite profiling for plant functional genomics. Nature Biotechnology, 18, 1157–1161.
Article CAS PubMed Google Scholar
Fiehn, O., Robertson, D., Griffin, J., van der Werf, M., Nikolau, B., Morrison, N., et al. (2007). The metabolomics standards initiative (MSI). Metabolomics, 3, 175–178.
Article CAS Google Scholar
Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G., & Kell, D. B. (2004). Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in Biotechnology, 22, 245–252.
Article CAS PubMed Google Scholar
Goodacre, R., Broadhurst, D., Smilde, A. K., Kristal, B. S., Baker, J. D., Beger, R., et al. (2007). Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics, 3, 231–241.
Article CAS Google Scholar
Gromski, P. S., Xu, Y., Correa, E., Ellis, D. I., Turner, M. L., & Goodacre, R. (2014a). A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. Analytica Chimica Acta, 829, 1–8.
Article CAS PubMed Google Scholar
Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., et al. (2014b). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4, 433–452.
Article PubMed Central CAS PubMed Google Scholar
Hardy, N. W., & Taylor, C. F. (2007). A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics, 3, 243–248.
Article CAS Google Scholar
Haug, K., Salek, R. M., Conesa, P., Hastings, J., de Matos, P., Rijnbeek, M., et al. (2014). MetaboLights-an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research, 41, D781–D786.
Article Google Scholar
Hollywood, K., Brison, D. R., & Goodacre, R. (2006). Metabolomics: current technologies and future trends. Proteomics, 6, 4716–4723.
Article CAS PubMed Google Scholar
Ismail, A. A., & Gill, G. V. (1999). The epidemiology of Type 2 diabetes and its current measurement. Best Practice & Research. Clinical Endocrinology & Metabolism, 13, 197–220.
Article CAS Google Scholar
Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support Vector Machines in R. Journal of Statistical Software, 15, 1–28.
Google Scholar
Kell, D. B., & Goodacre, R. (2014). Metabolomics and systems pharmacology: why and how to model the human metabolic network for drug discovery. Drug Discovery, 19, 171–182.
Article CAS Google Scholar
Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy K-nearest neighbor algorithm. IEEE Transactions on System Man and Cybernetics, 15, 580–585.
Article Google Scholar
Keun, H. C., Ebbels, T. M. D., Antti, H., Bollard, M. E., Beckonert, O., Holmes, E., et al. (2003). Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Analytica Chimica Acta, 490, 265–276.
Article CAS Google Scholar
Kusano, M., Fukushima, A., Arita, M., Jonsson, P., Moritz, T., Kobayashi, M., et al. (2007). Unbiased characterization of genotype-dependent metabolic regulations by metabolomic approach in Arabidopsis thaliana. BMC System Biology, 1, 53.
Article Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News 2, 18–22.
Mamas, M., Dunn, W. B., Neyses, L., & Goodacre, R. (2011). The role of metabolites and metabolomics in clinically applicable biomarkers of disease. Archives of Toxicology, 85, 5–17.
Article CAS PubMed Google Scholar
Manly, B. F. J. (1986). Multivariate Statistical Methods: a primer. New York: Chapman and Hall.
Google Scholar
Oliver, S. G., Winson, M. K., Kell, D. B., & Baganz, F. (1998). Systematic functional analysis of the yeast genome. Trends in Biotechnology, 16, 373–378.
Article CAS PubMed Google Scholar
Salek, R. M., Maguire, M. L., Bentley, E., Rubtsov, D. V., Hough, T., Cheeseman, M., et al. (2007). A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. Physiological Genomics, 29, 99–108.
Article CAS PubMed Google Scholar
Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2, 13.
Article PubMed Central PubMed Google Scholar
Sansone, S.-A., Schober, D., Atherton, H. J., Fiehn, O., Jenkins, H., Rocca-Serra, P., et al. (2007). Metabolomics standards initiative: ontology working group work in progress. Metabolomics, 3, 249–256.
Article CAS Google Scholar
Schuhmacher, R., Krska, R., Weckwerth, W., & Goodacre, R. (2013). Metabolomics and metabolite profiling. Analytical and Bioanalytical Chemistry, 405, 5003–5004.
Article CAS PubMed Google Scholar
Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3, 211–221.
Article PubMed Central CAS PubMed Google Scholar
R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/. Accessed 6 Nov 2012.
Todeschini, R. (1989). k-Nearest neighbour method: the influence of data transformations and metrics. Chemometrics and Intelligent. Laboratory, 6, 213–220.
Article Google Scholar
van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142.
Article PubMed Central PubMed Google Scholar
Vapnik, V. N. (1998). Statistical Learning Theory. New York: John Willey & Sons.
Google Scholar
Wehrens, R. (2011). Chemometrics with R - multivariate data analysis in the natural sciences and life sciences. Berlin Hiedelberg: Springer-Verlag.
Google Scholar
Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2008). Discriminant Q(2) (DQ(2)) for improved discrimination in PLSDA models. Metabolomics, 4, 293–296.
Article CAS Google Scholar
Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2010). Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6, 119–128.
Article PubMed Central CAS PubMed Google Scholar
Winder, C. L., Cornmell, R., Schuler, S., Jarvis, R. M., Stephens, G. M., & Goodacre, R. (2011). Metabolic fingerprinting as a tool to monitor whole-cell biotransformations. Analytical and Bioanalytical Chemistry, 399, 387–401.
Article CAS PubMed Google Scholar
Xu, Y., Zomer, S., & Brereton, R. G. (2006). Support Vector Machines: a recent method for classification in chemometrics. Critical Reviews in Analytical Chemistry, 36, 177–188.
Article CAS Google Scholar
Zacharias, H. U., Schley, G., Hochrein, J., Klein, M. S., Koeberle, C., Eckardt, K.-U., et al. (2013). Analysis of human urine reveals metabolic changes related to the development of acute kidney injury following cardiac surgery. Metabolomics, 9, 697–707.
Article CAS Google Scholar

Download references

Acknowledgments

The authors would like to thank to PhastID (Grant Agreement No: 258238) which is a European project supported within the Seventh Framework Programme for Research and Technological Development for funding and for the studentship for PSG.

Conflict of interest

The authors have no conflicts of interest to declare.

Compliance with ethical requirements

This study analysed previously collected data which involved human participants who had provided informed consent. These ethical issues are described in detail in the two primary research papers published by (Salek et al. 2007; Zacharias et al. 2013).

Author information

Authors and Affiliations

School of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
Piotr S. Gromski, Yun Xu & Royston Goodacre
School of Chemistry, The University of Manchester, Brunswick Street, Manchester, M13 9PL, UK
Michael L. Turner
Faculty of Life Science, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
Katherine A. Hollywood

Authors

Piotr S. Gromski
View author publications
You can also search for this author in PubMed Google Scholar
Yun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Katherine A. Hollywood
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Turner
View author publications
You can also search for this author in PubMed Google Scholar
Royston Goodacre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piotr S. Gromski.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 56 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gromski, P.S., Xu, Y., Hollywood, K.A. et al. The influence of scaling metabolomics data on model classification accuracy. Metabolomics 11, 684–695 (2015). https://doi.org/10.1007/s11306-014-0738-7

Download citation

Received: 09 July 2014
Accepted: 26 September 2014
Published: 08 October 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s11306-014-0738-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The influence of scaling metabolomics data on model classification accuracy

Abstract

Access this article

Similar content being viewed by others

A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification

Predictive Modeling for Metabolomics Data

Statistical Analysis and Modeling of Mass Spectrometry-Based Metabolomics Data

References

Acknowledgments

Conflict of interest

Compliance with ethical requirements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (DOCX 56 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The influence of scaling metabolomics data on model classification accuracy

Abstract

Access this article

Similar content being viewed by others

A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification

Predictive Modeling for Metabolomics Data

Statistical Analysis and Modeling of Mass Spectrometry-Based Metabolomics Data

References

Acknowledgments

Conflict of interest

Compliance with ethical requirements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (DOCX 56 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation