Functional prediction of unidentified lipids using supervised classifiers
- 193 Downloads
Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation.
KeywordsLipidomics Mass spectrometry Machine learning k-NN SVM PLS-DA Naive Bayes
This project was supported by the Academy of Finland (Decision # 111338).
- Chang, C. -C. & Lin, C. -J. (2001). LIBSVM: A library for support vector machines. Available online: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.Google Scholar
- Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge: MIT Press.Google Scholar
- Lu, Y., Hong, S., Gotlinger, K., & Serhan, C. (2006a). Lipid mediator informatics and proteomics in inflammation-resolution. The Scientific World Journal, 6, 589–614.Google Scholar
- Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B, 36, 111–133.Google Scholar
- Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.Google Scholar
- Watanabe, K., Yasugi, E., & Oshima, M. (2000). How to search the glycolipid data in LIPIDBANK for Web: the newly developed lipid database. Japan Trend Glycoscience and Glycotechnology, 12, 175–184.Google Scholar