Skip to main content
Log in

The influence of scaling metabolomics data on model classification accuracy

  • Original Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Correctly measured classification accuracy is an important aspect not only to classify pre-designated classes such as disease versus control properly, but also to ensure that the biological question can be answered competently. We recognised that there has been minimal investigation of pre-treatment methods and its influence on classification accuracy within the metabolomics literature. The standard approach to pre-treatment prior to classification modelling often incorporates the use of methods such as autoscaling, which positions all variables on a comparable scale thus allowing one to achieve separation of two or more groups (target classes). This is often undertaken without any prior investigation into the influence of the pre-treatment method on the data and supervised learning techniques employed. Whilst this is useful for deriving essential information such as predictive ability or visual interpretation in many cases, as shown in this study the standard approach is not always the most suitable option available. Here, a study has been conducted to investigate the influence of six pre-treatment methods—autoscaling, range, level, Pareto and vast scaling, as well as no scaling—on four classification models, including: principal components-discriminant function analysis (PC-DFA), support vector machines (SVM), random forests (RF) and k-nearest neighbours (kNN)—using three publically available metabolomics data sets. We have demonstrated that undertaking different pre-treatment methods can greatly affect the interpretation of the statistical modelling outputs. The results have shown that data pre-treatment is context dependent and that there was no single superior method for all the data sets used. Whilst we did find that vast scaling produced the most robust models in terms of classification rate for PC-DFA of both NMR spectroscopy data sets, in general we conclude that both vast scaling and autoscaling produced similar and superior results in comparison to the other four pre-treatment methods on both NMR and GC–MS data sets. It is therefore our recommendation that vast scaling is the primary pre-treatment method to use as this method appears to be more stable and robust across all the different classifiers that were conducted in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Allwood, J. W., Cheung, W., Xu, Y., Mumm, R., De Vos, R. C. H., Biais, B., et al. (2014). Metabolomics in melon: a new opportunity for aroma analysis. Phytochemistry, 99, 61–72.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Alsberg, B. K., Goodacre, R., Rowland, J. J., & Kell, D. B. (1997). Classification of pyrolysis mass spectra by fuzzy multivariate rule induction-comparison with regression, K-nearest neighbour, neural and decision-tree methods. Analytica Chimica Acta, 348, 389–407.

    Article  CAS  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  Google Scholar 

  • Bro, R., & Smilde, A. K. (2003). Centering and scaling in component analysis. Journal of Chemometrics, 17, 16–33.

    Article  CAS  Google Scholar 

  • Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171–196.

    Article  CAS  Google Scholar 

  • Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005). A metabolome pipeline: from concept to data to knowledge. Metabolomics, 1, 39–51.

    Article  CAS  Google Scholar 

  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167.

    Article  Google Scholar 

  • Craig, A., Cloareo, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78, 2262–2267.

    Article  CAS  PubMed  Google Scholar 

  • Dunn, W. B., Broadhurst, D. I., Atherton, H. J., Goodacre, R., & Griffin, J. L. (2011). Systems level studies of mammalian metabolomes: the roles of mass spectrometry and nuclear magnetic resonance spectroscopy. Chemical Society Reviews, 40, 387–426.

    Article  CAS  PubMed  Google Scholar 

  • Efron, B. (1979). 1977 Rietz Lecture. Bootstrap methods: another look at the Jackknife. Annals of Statistics, 7, 1–26.

    Article  Google Scholar 

  • Efron, B., & Gong, G. (1983). A Leisurely look at the Bootstrap, the Jackknife, and cross-validation. American Statistician, 37, 36–48.

    Google Scholar 

  • Eriksson, L., Johansson, E., Kettaneh-Wold, N., & Wold, S. (2001). Multi- and Megavariate data analysis: principles and applications. Umeå: Umetrics Academy.

    Google Scholar 

  • Fiehn, O. (2002). Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology, 48, 155–171.

    Article  CAS  PubMed  Google Scholar 

  • Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R. N., & Willmitzer, L. (2000). Metabolite profiling for plant functional genomics. Nature Biotechnology, 18, 1157–1161.

    Article  CAS  PubMed  Google Scholar 

  • Fiehn, O., Robertson, D., Griffin, J., van der Werf, M., Nikolau, B., Morrison, N., et al. (2007). The metabolomics standards initiative (MSI). Metabolomics, 3, 175–178.

    Article  CAS  Google Scholar 

  • Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G., & Kell, D. B. (2004). Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in Biotechnology, 22, 245–252.

    Article  CAS  PubMed  Google Scholar 

  • Goodacre, R., Broadhurst, D., Smilde, A. K., Kristal, B. S., Baker, J. D., Beger, R., et al. (2007). Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics, 3, 231–241.

    Article  CAS  Google Scholar 

  • Gromski, P. S., Xu, Y., Correa, E., Ellis, D. I., Turner, M. L., & Goodacre, R. (2014a). A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. Analytica Chimica Acta, 829, 1–8.

    Article  CAS  PubMed  Google Scholar 

  • Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., et al. (2014b). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4, 433–452.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Hardy, N. W., & Taylor, C. F. (2007). A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics, 3, 243–248.

    Article  CAS  Google Scholar 

  • Haug, K., Salek, R. M., Conesa, P., Hastings, J., de Matos, P., Rijnbeek, M., et al. (2014). MetaboLights-an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research, 41, D781–D786.

    Article  Google Scholar 

  • Hollywood, K., Brison, D. R., & Goodacre, R. (2006). Metabolomics: current technologies and future trends. Proteomics, 6, 4716–4723.

    Article  CAS  PubMed  Google Scholar 

  • Ismail, A. A., & Gill, G. V. (1999). The epidemiology of Type 2 diabetes and its current measurement. Best Practice & Research. Clinical Endocrinology & Metabolism, 13, 197–220.

    Article  CAS  Google Scholar 

  • Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support Vector Machines in R. Journal of Statistical Software, 15, 1–28.

    Google Scholar 

  • Kell, D. B., & Goodacre, R. (2014). Metabolomics and systems pharmacology: why and how to model the human metabolic network for drug discovery. Drug Discovery, 19, 171–182.

    Article  CAS  Google Scholar 

  • Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy K-nearest neighbor algorithm. IEEE Transactions on System Man and Cybernetics, 15, 580–585.

    Article  Google Scholar 

  • Keun, H. C., Ebbels, T. M. D., Antti, H., Bollard, M. E., Beckonert, O., Holmes, E., et al. (2003). Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Analytica Chimica Acta, 490, 265–276.

    Article  CAS  Google Scholar 

  • Kusano, M., Fukushima, A., Arita, M., Jonsson, P., Moritz, T., Kobayashi, M., et al. (2007). Unbiased characterization of genotype-dependent metabolic regulations by metabolomic approach in Arabidopsis thaliana. BMC System Biology, 1, 53.

    Article  Google Scholar 

  • Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News 2, 18–22.

  • Mamas, M., Dunn, W. B., Neyses, L., & Goodacre, R. (2011). The role of metabolites and metabolomics in clinically applicable biomarkers of disease. Archives of Toxicology, 85, 5–17.

    Article  CAS  PubMed  Google Scholar 

  • Manly, B. F. J. (1986). Multivariate Statistical Methods: a primer. New York: Chapman and Hall.

    Google Scholar 

  • Oliver, S. G., Winson, M. K., Kell, D. B., & Baganz, F. (1998). Systematic functional analysis of the yeast genome. Trends in Biotechnology, 16, 373–378.

    Article  CAS  PubMed  Google Scholar 

  • Salek, R. M., Maguire, M. L., Bentley, E., Rubtsov, D. V., Hough, T., Cheeseman, M., et al. (2007). A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. Physiological Genomics, 29, 99–108.

    Article  CAS  PubMed  Google Scholar 

  • Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2, 13.

    Article  PubMed Central  PubMed  Google Scholar 

  • Sansone, S.-A., Schober, D., Atherton, H. J., Fiehn, O., Jenkins, H., Rocca-Serra, P., et al. (2007). Metabolomics standards initiative: ontology working group work in progress. Metabolomics, 3, 249–256.

    Article  CAS  Google Scholar 

  • Schuhmacher, R., Krska, R., Weckwerth, W., & Goodacre, R. (2013). Metabolomics and metabolite profiling. Analytical and Bioanalytical Chemistry, 405, 5003–5004.

    Article  CAS  PubMed  Google Scholar 

  • Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3, 211–221.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/. Accessed 6 Nov 2012.

  • Todeschini, R. (1989). k-Nearest neighbour method: the influence of data transformations and metrics. Chemometrics and Intelligent. Laboratory, 6, 213–220.

    Article  Google Scholar 

  • van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142.

    Article  PubMed Central  PubMed  Google Scholar 

  • Vapnik, V. N. (1998). Statistical Learning Theory. New York: John Willey & Sons.

    Google Scholar 

  • Wehrens, R. (2011). Chemometrics with R - multivariate data analysis in the natural sciences and life sciences. Berlin Hiedelberg: Springer-Verlag.

    Google Scholar 

  • Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2008). Discriminant Q(2) (DQ(2)) for improved discrimination in PLSDA models. Metabolomics, 4, 293–296.

    Article  CAS  Google Scholar 

  • Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2010). Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6, 119–128.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Winder, C. L., Cornmell, R., Schuler, S., Jarvis, R. M., Stephens, G. M., & Goodacre, R. (2011). Metabolic fingerprinting as a tool to monitor whole-cell biotransformations. Analytical and Bioanalytical Chemistry, 399, 387–401.

    Article  CAS  PubMed  Google Scholar 

  • Xu, Y., Zomer, S., & Brereton, R. G. (2006). Support Vector Machines: a recent method for classification in chemometrics. Critical Reviews in Analytical Chemistry, 36, 177–188.

    Article  CAS  Google Scholar 

  • Zacharias, H. U., Schley, G., Hochrein, J., Klein, M. S., Koeberle, C., Eckardt, K.-U., et al. (2013). Analysis of human urine reveals metabolic changes related to the development of acute kidney injury following cardiac surgery. Metabolomics, 9, 697–707.

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors would like to thank to PhastID (Grant Agreement No: 258238) which is a European project supported within the Seventh Framework Programme for Research and Technological Development for funding and for the studentship for PSG.

Conflict of interest

The authors have no conflicts of interest to declare.

Compliance with ethical requirements

This study analysed previously collected data which involved human participants who had provided informed consent. These ethical issues are described in detail in the two primary research papers published by (Salek et al. 2007; Zacharias et al. 2013).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Piotr S. Gromski.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 56 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gromski, P.S., Xu, Y., Hollywood, K.A. et al. The influence of scaling metabolomics data on model classification accuracy. Metabolomics 11, 684–695 (2015). https://doi.org/10.1007/s11306-014-0738-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-014-0738-7

Keywords

Navigation