Normalization and integration of large-scale metabolomics data using support vector regression
- 2.3k Downloads
Untargeted metabolomics studies for biomarker discovery often have hundreds to thousands of human samples. Data acquisition of large-scale samples has to be divided into several batches and may span from months to as long as several years. The signal drift of metabolites during data acquisition (intra- and inter-batch) is unavoidable and is a major confounding factor for large-scale metabolomics studies.
We aim to develop a data normalization method to reduce unwanted variations and integrate multiple batches in large-scale metabolomics studies prior to statistical analyses.
We developed a machine learning algorithm-based method, support vector regression (SVR), for large-scale metabolomics data normalization and integration. An R package named MetNormalizer was developed and provided for data processing using SVR normalization.
After SVR normalization, the portion of metabolite ion peaks with relative standard deviations (RSDs) less than 30 % increased to more than 90 % of the total peaks, which is much better than other common normalization methods. The reduction of unwanted analytical variations helps to improve the performance of multivariate statistical analyses, both unsupervised and supervised, in terms of classification and prediction accuracy so that subtle metabolic changes in epidemiological studies can be detected.
SVR normalization can effectively remove the unwanted intra- and inter-batch variations, and is much better than other common normalization methods.
KeywordsMetabolomics Data normalization Data integration Support vector regression Quality control
The work is financially supported by the funding from Interdisciplinary Research Center on Biology and Chemistry (IRCBC), Chinese Academy of Sciences (CAS), and the National Natural Science Foundation of China (Grants 21575151 and 81573246). Z.-J. Z. is supported by Thousand Youth Talents Program (The Recruitment Program of Global Youth Experts from Chinese government). This work is also partially supported by Agilent Technologies Thought Leader Award.
Compliance with ethical standards
Conflict of interest
The authors declare no competing financial interest.
All institutional and national guidelines for the care and use of biological samples were followed. The data acquired were in accordance with appropriate ethical requirements.
Research involving human participants
The human study was approved by the ethics committee of Shandong Cancer Hospital affiliated to Shandong University, Shandong Province, China.
All written informed consents were obtained from all participants involved in this study.
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.Google Scholar
- Dunn, W. B., Broadhurst, D., Begley, P., Zelena, E., Francis-McIntyre, S., Anderson, N., et al. (2011). Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6(7), 1060–1083.CrossRefPubMedGoogle Scholar
- Evans, A. M., DeHaven, C. D., Barrett, T., Mitchell, M., & Milgram, E. (2009). Integrated, nontargeted ultrahigh performance liquid chromatography/electrospray ionization tandem mass spectrometry platform for the identification and relative quantification of the small-molecule complement of biological systems. Analytical Chemistry, 81(16), 6656–6667.CrossRefPubMedGoogle Scholar
- FDA. (2013). Guidance for industry, bioanalytical method validation. Food and Drug Administration, Centre for Drug Valuation and Research (CDER).Google Scholar
- Fujarewicz, K., Jarzab, M., Eszlinger, M., Krohn, K., Paschke, R., Oczko-Wojciechowska, M., et al. (2007). A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping. Endocrine-Related Cancer, 14(3), 809–826.CrossRefPubMedPubMedCentralGoogle Scholar
- R Development Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org. Accessed 18 June 2015.
- Steinwart, I., & Christmann, A. (2008). Support vector machines. New York: Springer.Google Scholar
- Veselkov, K. A., Vingara, L. K., Masson, P., Robinette, S. L., Want, E., Li, J. V., et al. (2011). Optimized preprocessing of ultra-performance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. Analytical Chemistry, 83(15), 5864–5872.CrossRefPubMedGoogle Scholar
- Wang, S. Y., Kuo, C. H., & Tseng, Y. F. J. (2013). Batch normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration methods. Analytical Chemistry, 85(2), 1037–1046.CrossRefPubMedGoogle Scholar