, Volume 6, Issue 1, pp 18–26 | Cite as

Functional prediction of unidentified lipids using supervised classifiers

  • Laxman Yetukuri
  • Jarkko Tikka
  • Jaakko Hollmén
  • Matej Orešič
Original Article


Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation.


Lipidomics Mass spectrometry Machine learning k-NN SVM PLS-DA Naive Bayes 



This project was supported by the Academy of Finland (Decision # 111338).


  1. Barker, M., & Rayens, W. (2003). Partial least squares for discrimination. Journal of Chemometrics, 17, 166–173.CrossRefGoogle Scholar
  2. Bijlsma, S., Bobeldijk, I., Verheij, E. R., et al. (2006). Large-scale human metabolomics studies: A strategy for data (pre-) processing and validation. Analytical Chemistry, 78, 567–574. doi: 10.1021/ac051495j.CrossRefPubMedGoogle Scholar
  3. Brereton, R. G. (2006). Consequences of sample size, variable selection, and model validation and optimisation for predicting classification ability from analytical data. TrAC Trends in Analytical Chemistry, 25, 1103–1111.CrossRefGoogle Scholar
  4. Caffrey, M., & Hogan, J. (1992). LIPIDAT: A database of lipid phase transition temperatures and enthalpy changes. DMPC Data Subset Analysis. Chemistry and Physics of Lipids, 61, 1–109.CrossRefGoogle Scholar
  5. Chang, C. -C. & Lin, C. -J. (2001). LIBSVM: A library for support vector machines. Available online:
  6. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.Google Scholar
  7. Ejsing, C. S., Duchoslav, E., Sampaio, J., et al. (2006). Automated identification and quantification of glycerophospholipid molecular species by multiple precursor ion scanning. Analytical Chemistry, 78, 6202–6214.CrossRefPubMedGoogle Scholar
  8. Ekroos, K., Chernushevich, I. V., Simons, K., & Shevchenko, A. (2002). Quantitative profiling of phospholipids by multiple precursor ion scanning on a hybrid quadrupole time-of-flight mass spectrometer. Analytical Chemistry, 74, 941–949.CrossRefPubMedGoogle Scholar
  9. Fahy, E., Sud, M., Cotter, D., & Subramaniam, S. (2007). LIPID MAPS online tools for lipid research. Nucleic Acids Research, 35, W606–612.CrossRefPubMedGoogle Scholar
  10. Han, X., & Gross, R. W. (2005). Shotgun lipidomics: Electrospray ionization mass spectrometric analysis and quantitation of cellular lipidomes directly from crude extracts of biological samples. Mass Spectrometry Reviews, 24, 367–412.CrossRefPubMedGoogle Scholar
  11. Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge: MIT Press.Google Scholar
  12. Hu, C., van Dommelen, J., van der Heijden, R., et al. (2008). RPLC-Ion-Trap-FTMS method for lipid profiling of plasma: Method validation and application to p53 mutant mouse model. Journal of Proteome Research, 7, 4982–4991. doi: 10.1021/pr800373m.CrossRefPubMedGoogle Scholar
  13. Katajamaa, M., Miettinen, J., & Oresic, M. (2006). MZmine: Toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics, 22, 634–636. doi: 10.1093/bioinformatics/btk039.CrossRefPubMedGoogle Scholar
  14. Katajamaa, M., & Orešic, M. (2005). Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics, 6, 179–190.CrossRefPubMedGoogle Scholar
  15. Kind, T., & Fiehn, O. (2007). Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics, 8, 105.CrossRefPubMedGoogle Scholar
  16. Lu, Y., Hong, S., Gotlinger, K., & Serhan, C. (2006a). Lipid mediator informatics and proteomics in inflammation-resolution. The Scientific World Journal, 6, 589–614.Google Scholar
  17. Lu, Y., Hong, S., & Serhan, C. (2006b). Lipid mediator informatics-lipidomics: Novel pathways in mapping resolution. AAPS Journal, 8, E284–E297.CrossRefPubMedGoogle Scholar
  18. Mertens, B. J. A., Noo, M. E. D., Tollenaar, R. A. E. M., & Deelder, A. M. (2006). Mass spectrometry proteomic diagnosis: Enacting the double cross-validatory paradigm. Journal of Computational Biology, 13(159), 1–1605. doi: 10.1089/cmb.2006.13.1591.Google Scholar
  19. Moco, S., Vervoort, J., Moco, S., Bino, R. J., De Vos, R. C. H., & Bino, R. (2007). Metabolomics technologies and metabolite identification. TrAC Trends in Analytical Chemistry, 26, 855–866.CrossRefGoogle Scholar
  20. Pietiläinen, K. H., Sysi-Aho, M., Rissanen, A., et al. (2007). Acquired obesity is associated with changes in the serum lipidomic profile independent of genetic effects—a monozygotic twin study. PLoS ONE, 2, e218.CrossRefPubMedGoogle Scholar
  21. Rogers, S., Scheltema, R. A., Girolami, M., & Breitling, R. (2009). Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics, 25(51), 2–518. doi: 10.1093/bioinformatics/btn642.Google Scholar
  22. Smit, S., Hoefsloot, H. C. J., & Smilde, A. K. (2008). Statistical data processing in clinical proteomics. Journal of Chromatography B, 866, 77–88.CrossRefGoogle Scholar
  23. Smit, S., van Breemen, M. J., Hoefsloot, H. C. J., Smilde, A. K., Aerts, J. M. F. G., & de Koster, C. G. (2007). Assessing the statistical validity of proteomics based biomarkers. Analytica Chimica Acta, 592, 210–217.CrossRefPubMedGoogle Scholar
  24. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B, 36, 111–133.Google Scholar
  25. Sud, M., Fahy, E., Cotter, D., et al. (2007). LMSD: LIPID MAPS structure database. Nucleic Acids Research, 35, D527–532.CrossRefPubMedGoogle Scholar
  26. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.Google Scholar
  27. Watanabe, K., Yasugi, E., & Oshima, M. (2000). How to search the glycolipid data in LIPIDBANK for Web: the newly developed lipid database. Japan Trend Glycoscience and Glycotechnology, 12, 175–184.Google Scholar
  28. Yetukuri, L., Katajamaa, M., Medina-Gomez, G., Seppanen-Laakso, T., Vidal-Puig, A., & Oresic, M. (2007). Bioinformatics strategies for lipidomics analysis: Characterization of obesity related hepatic steatosis. BMC Systems Biology, 1, 12.CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Laxman Yetukuri
    • 1
  • Jarkko Tikka
    • 2
  • Jaakko Hollmén
    • 2
  • Matej Orešič
    • 1
  1. 1.VTT Technical Research Centre of FinlandEspooFinland
  2. 2.Department of Information and Computer ScienceHelsinki University of Technology, TKKEspooFinland

Personalised recommendations