Machine Learning in Untargeted Metabolomics Experiments

  • Joshua HeinemannEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1859)


Machine learning is a form of artificial intelligence (AI) that provides computers with the ability to learn generally without being explicitly programmed. Machine learning refers to the ability of computer programs to adapt when exposed to new data. Here we examine the use of machine learning for use with untargeted metabolomics data, when it is appropriate to use, and questions it can answer. We provide an example workflow for training and testing a simple binary classifier, a multiclass classifier and a support vector machine using the Waikato Environment for Knowledge Analysis (Weka), a toolkit for machine learning. This workflow should provide a framework for greater integration of machine learning with metabolomics study.

Key words

Machine learning Untargeted metabolomics Supervised learning 



The authors would also like to acknowledge that this work was part of the DOE Joint BioEnergy Institute ( supported by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, through contract DE-AC02-05CH11231 between Lawrence Berkeley National Laboratory and the US Department of Energy.

Supplementary material (493 kb)
Supplementary File 1 Example data files containing mass spectrometry based intensity (relative abundance) information for metabolites in both .csv and .arff format (ZIP 524 kb)


  1. 1.
    Alpaydin E et al (2010) Introduction to machine learning. MIT Press, Cambridge, MAGoogle Scholar
  2. 2.
    Cortes C, Vapnik V et al (1995) Support-vector networks. Mach Learn 20(3):273–297Google Scholar
  3. 3.
    Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc Fourteenth Int Joint Conf Artif Intell 2(12):1137–1143Google Scholar
  4. 4.
    Hawkins DM et al (2004) The problem of overfitting. J Chem Inf Comput Sci 44(1):1–12CrossRefGoogle Scholar
  5. 5.
    Vafaie H, Jong KD et al (1992) Genetic algorithms as a tool for feature selection in machine learning. Proc 1992 I.E. Int Conf on Tools with AI 11:200–203Google Scholar
  6. 6.
    Bartlett MS, Littlewort G, Lainscsek C, Fasel I, Movellan J et al (2004) Machine learning methods for fully automatic recognition of facial expressions and facial actions. Proc 2004 I.E. Int Conf on systems. Man and Cybernetics 10:592–597Google Scholar
  7. 7.
    Russell S, Norvig P et al (2003) Artificial intelligence: a modern approach. Prentice Hall, USAGoogle Scholar
  8. 8.
    Murtagh F et al (1985) Multidimensional Clustering Algorithms. In: COMPSTAT Lectures 4. Physica-Verlag, WuerzburgGoogle Scholar
  9. 9.
    Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, LondonGoogle Scholar
  10. 10.
    Venables WN, Ripley BD et al (2002) Modern applied statistics with S. Springer-Verlag, BerlinCrossRefGoogle Scholar
  11. 11.
    McQuitty LL et al (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas 26:825–831CrossRefGoogle Scholar
  12. 12.
    Gordon AD (1999) Classification, 2nd edn. Chapman and Hall/CRC, LondonGoogle Scholar
  13. 13.
    Everitt B (1974) Cluster analysis. Heinemann Educational Books, LondonGoogle Scholar
  14. 14.
    Hartigan JA (1975) Clustering algorithms. Wiley, New YorkGoogle Scholar
  15. 15.
    Anderberg MR (1973) Cluster analysis for applications. Academic Press, New YorkGoogle Scholar
  16. 16.
    Heinemann J, Mazurie A, Tokmina-Lukaszewska M, Beilman GJ, Bothner B et al (2014) Application of support vector machines to metabolomics experiments with limited replicates. Metabolomics 10:1121–1128CrossRefGoogle Scholar
  17. 17.
    Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A et al (2009) Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinformatics 10:259CrossRefGoogle Scholar
  18. 18.
    Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422CrossRefGoogle Scholar
  19. 19.
    VeselKov KA, Vingara LK, Masson P, Robinette SL, Want E, Li JV et al (2011) Optimizing preprocessing of ultraperformance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. Anal Chem 83:5864–5872CrossRefGoogle Scholar
  20. 20.
    Lin X, Wang Q, Yin P, Tang L, Tan Y, Li H et al (2011) A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics 7(4):549–558CrossRefGoogle Scholar
  21. 21.
    Bertini I, Calabro A, De Carli V, Luchinat C, Nepi S, Porfirio B et al (2009) The metabonomic signature of celiac disease. J Proteome Res 8:170–177CrossRefGoogle Scholar
  22. 22.
    Smith C, O’Maille G, Want EJ, Qin C, Trauger S, Brandon TR et al (2005) METLIN: a metabolite mass spectral database. Ther Drug Monit 27(6):747–751CrossRefGoogle Scholar
  23. 23.
    Tautenhahn R, Bo¨ttcher C, Neumann S et al (2008) Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 9:504CrossRefGoogle Scholar
  24. 24.
    Yanes O, Tautenhahn R, Patti GJ, Siuzdak G et al (2011) Expanding coverage of the metabolome for global metabolite profiling. Anal Chem 83(6):2152–2161CrossRefGoogle Scholar
  25. 25.
    Duan K, Rajapakse JC et al (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobioscience 4:228–234CrossRefGoogle Scholar
  26. 26.
    Hall M, National H, Frank E, Holmes G, Pfahringer B, Reutemann P et al (2010) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18CrossRefGoogle Scholar
  27. 27.
    Asa BH, Horn D, Hava S, Vapnik V et al (2001) Support vector clustering. J Mach Learn Res 2:125–137Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Environmental Genomics and Systems BiologyLawrence Berkeley National LaboratoryBerkeleyUSA
  2. 2.Joint BioEnergy InstituteEmeryvilleUSA

Personalised recommendations