Abstract
Machine learning is a form of artificial intelligence (AI) that provides computers with the ability to learn generally without being explicitly programmed. Machine learning refers to the ability of computer programs to adapt when exposed to new data. Here we examine the use of machine learning for use with untargeted metabolomics data, when it is appropriate to use, and questions it can answer. We provide an example workflow for training and testing a simple binary classifier, a multiclass classifier and a support vector machine using the Waikato Environment for Knowledge Analysis (Weka), a toolkit for machine learning. This workflow should provide a framework for greater integration of machine learning with metabolomics study.
Key words
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alpaydin E et al (2010) Introduction to machine learning. MIT Press, Cambridge, MA
Cortes C, Vapnik V et al (1995) Support-vector networks. Mach Learn 20(3):273–297
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc Fourteenth Int Joint Conf Artif Intell 2(12):1137–1143
Hawkins DM et al (2004) The problem of overfitting. J Chem Inf Comput Sci 44(1):1–12
Vafaie H, Jong KD et al (1992) Genetic algorithms as a tool for feature selection in machine learning. Proc 1992 I.E. Int Conf on Tools with AI 11:200–203
Bartlett MS, Littlewort G, Lainscsek C, Fasel I, Movellan J et al (2004) Machine learning methods for fully automatic recognition of facial expressions and facial actions. Proc 2004 I.E. Int Conf on systems. Man and Cybernetics 10:592–597
Russell S, Norvig P et al (2003) Artificial intelligence: a modern approach. Prentice Hall, USA
Murtagh F et al (1985) Multidimensional Clustering Algorithms. In: COMPSTAT Lectures 4. Physica-Verlag, Wuerzburg
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London
Venables WN, Ripley BD et al (2002) Modern applied statistics with S. Springer-Verlag, Berlin
McQuitty LL et al (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas 26:825–831
Gordon AD (1999) Classification, 2nd edn. Chapman and Hall/CRC, London
Everitt B (1974) Cluster analysis. Heinemann Educational Books, London
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
Heinemann J, Mazurie A, Tokmina-Lukaszewska M, Beilman GJ, Bothner B et al (2014) Application of support vector machines to metabolomics experiments with limited replicates. Metabolomics 10:1121–1128
Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A et al (2009) Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinformatics 10:259
Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
VeselKov KA, Vingara LK, Masson P, Robinette SL, Want E, Li JV et al (2011) Optimizing preprocessing of ultraperformance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. Anal Chem 83:5864–5872
Lin X, Wang Q, Yin P, Tang L, Tan Y, Li H et al (2011) A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics 7(4):549–558
Bertini I, Calabro A, De Carli V, Luchinat C, Nepi S, Porfirio B et al (2009) The metabonomic signature of celiac disease. J Proteome Res 8:170–177
Smith C, O’Maille G, Want EJ, Qin C, Trauger S, Brandon TR et al (2005) METLIN: a metabolite mass spectral database. Ther Drug Monit 27(6):747–751
Tautenhahn R, Bo¨ttcher C, Neumann S et al (2008) Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 9:504
Yanes O, Tautenhahn R, Patti GJ, Siuzdak G et al (2011) Expanding coverage of the metabolome for global metabolite profiling. Anal Chem 83(6):2152–2161
Duan K, Rajapakse JC et al (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobioscience 4:228–234
Hall M, National H, Frank E, Holmes G, Pfahringer B, Reutemann P et al (2010) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Asa BH, Horn D, Hava S, Vapnik V et al (2001) Support vector clustering. J Mach Learn Res 2:125–137
Acknowledgments
The authors would also like to acknowledge that this work was part of the DOE Joint BioEnergy Institute (http://www.jbei.org) supported by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, through contract DE-AC02-05CH11231 between Lawrence Berkeley National Laboratory and the US Department of Energy.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic Supplementary Material
Supplementary File 1
Example data files containing mass spectrometry based intensity (relative abundance) information for metabolites in both .csv and .arff format (ZIP 524 kb)
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Heinemann, J. (2019). Machine Learning in Untargeted Metabolomics Experiments. In: Baidoo, E. (eds) Microbial Metabolomics. Methods in Molecular Biology, vol 1859. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8757-3_17
Download citation
DOI: https://doi.org/10.1007/978-1-4939-8757-3_17
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8756-6
Online ISBN: 978-1-4939-8757-3
eBook Packages: Springer Protocols