MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data

Chetnik, Kelsey; Petrick, Lauren; Pandey, Gaurav

doi:10.1007/s11306-020-01738-3

MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data

Original Article
Published: 21 October 2020

Volume 16, article number 117, (2020)
Cite this article

Metabolomics Aims and scope Submit manuscript

2003 Accesses
35 Citations
21 Altmetric
1 Mention
Explore all metrics

Abstract

Introduction

Despite the availability of several pre-processing software, poor peak integration remains a prevalent problem in untargeted metabolomics data generated using liquid chromatography high–resolution mass spectrometry (LC–MS). As a result, the output of these pre-processing software may retain incorrectly calculated metabolite abundances that can perpetuate in downstream analyses.

Objectives

To address this problem, we propose a computational methodology that combines machine learning and peak quality metrics to filter out low quality peaks.

Methods

Specifically, we comprehensively and systematically compared the performance of 24 different classifiers generated by combining eight classification algorithms and three sets of peak quality metrics on the task of distinguishing reliably integrated peaks from poorly integrated ones. These classifiers were compared to using a residual standard deviation (RSD) cut-off in pooled quality-control (QC) samples, which aims to remove peaks with analytical error.

Results

The best performing classifier was found to be a combination of the AdaBoost algorithm and a set of 11 peak quality metrics previously explored in untargeted metabolomics and proteomics studies. As a complementary approach, applying our framework to peaks retained after filtering by 30% RSD across pooled QC samples was able to further distinguish poorly integrated peaks that were not removed from filtering alone. An R implementation of these classifiers and the overall computational approach is available as the MetaClean package at https://CRAN.R-project.org/package=MetaClean.

Conclusion

Our work represents an important step forward in developing an automated tool for filtering out unreliable peak integrations in untargeted LC–MS metabolomics data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Advanced high-resolution chromatographic strategies for efficient isolation of natural products from complex biological matrices: from metabolite profiling to pure chemical entities

Article Open access 06 May 2024

Evaluating LC-HRMS metabolomics data processing software using FAIR principles for research software

Article 06 February 2023

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Data availability

The metabolomics and metadata analyzed in this paper are available via the Metabolomics Workbench (https://www.metabolomicsworkbench.org/) Study IDs ST000726 and ST000695, and via MetaboLights (https://www.ebi.ac.uk/metabolights/) Study IDs MTBLS354 and MTBLS306.

Code availability

The MetaClean R package developed in this study is available at https://CRAN.R-project.org/package=MetaClean.

References

Alpaydin, E. (2014). Introduction to machine learning (3rd ed.). London: The MIT Press.
Google Scholar
Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. https://doi.org/10.1214/09-SS054
Article Google Scholar
Borgsmüller, N., Gloaguen, Y., Opialla, T., Blanc, E., Sicard, E., Royer, A.-L., et al. (2019). WiPP: Workflow for improved peak picking for gas chromatography-mass spectrometry (GC-MS) data. Metabolites, 9(9), 171. https://doi.org/10.3390/metabo9090171
Article CAS PubMed Central Google Scholar
Broadhurst, D., Goodacre, R., Reinke, S. N., Kuligowski, J., Wilson, I. D., Lewis, M. R., & Dunn, W. B. (2018). Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics, 14(6), 72. https://doi.org/10.1007/s11306-018-1367-3
Article CAS PubMed PubMed Central Google Scholar
Calvo, B., & Santafé, G. (2016). Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. The R Journal, 8(1), 248.
Article Google Scholar
Chong, J., Wishart, D. S., & Xia, J. (2019). Using MetaboAnalyst 4.0 for comprehensive and integrative metabolomics data analysis. Current Protocols in Bioinformatics, 68(1), e86. https://doi.org/10.1002/cpbi.86
Article PubMed Google Scholar
Coble, J. B., & Fraga, C. G. (2014). Comparative evaluation of preprocessing freeware on chromatography/mass spectrometry data for signature discovery. Journal of Chromatography A, 1358, 155–164. https://doi.org/10.1016/j.chroma.2014.06.100
Article CAS PubMed Google Scholar
Demsˇar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. https://doi.org/10.5555/1248547.1248548
Article Google Scholar
Dunn, W. B., Broadhurst, D., Begley, P., Zelena, E., Francis-McIntyre, S., Anderson, N., et al. (2011). Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6(7), 1060–1083. https://doi.org/10.1038/nprot.2011.335
Article CAS PubMed Google Scholar
Eshghi, S. T., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics. https://doi.org/10.1186/s12014-018-9209-x
Article Google Scholar
Haug, K., Cochrane, K., Nainala, V. C., Williams, M., Chang, J., Jayaseelan, K. V., & Oonovan, C. (2019). MetaboLights: A resource evolving in response to the needs of its scientific community. Nucleic Acids Research. https://doi.org/10.1093/nar/gkz1019
Article PubMed Central Google Scholar
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(1), 1–26.
Google Scholar
Lever, J., Krzywinski, M., & Altman, N. S. (2016). Points of Significance: Classification evaluation. Nature methods, 13(8), 603–604. https://doi.org/10.1038/nmeth.3945
Article CAS Google Scholar
Libiseller, G., Dvorzak, M., Kleb, U., Gander, E., Eisenberg, T., Madeo, F., et al. (2015). IPO: A tool for automated optimization of XCMS parameters. BMC Bioinformatics, 16(1), 118. https://doi.org/10.1186/s12859-015-0562-8
Article CAS PubMed PubMed Central Google Scholar
Mahieu, N. G., Spalding, J. L., & Patti, G. J. (2016). Warpgroup: Increased precision of metabolomic data processing by consensus integration bound analysis. Bioinformatics (Oxford, England), 32(2), 268–275. https://doi.org/10.1093/bioinformatics/btv564
Article CAS Google Scholar
MetaboLights. (2016a). MTBLS354: Lipid metabolites as potential diagnostic and prognostic biomarkers for acute community acquired pneumonia. Retrieved March 4, 2020, from https://www.ebi.ac.uk/metabolights/MTBLS354.
MetaboLights. (2016b). MTBLS306:Metabolic profiling of submaximal exercise at a standardised relative intensity in healthy adults. Retrieved September 4, 2020, from https://www.ebi.ac.uk/metabolights/MTBLS306.
Metabolomics Workbench. (2017a). PR000523, ST000726. https://doi.org/10.21228/M82D6X
Metabolomics Workbench. (2017b). PR000492, ST000625. https://doi.org/10.21228/M8G31N
Muhsen Ali, A., Burleigh, M., Daskalaki, E., Zhang, T., Easton, C., & Watson, D. G. (2016). Metabolomic profiling of submaximal exercise at a standardised relative intensity in healthy adults. Metabolites, 6(1), 9. https://doi.org/10.3390/metabo6010009
Article CAS PubMed Central Google Scholar
Myers, O. D., Sumner, S. J., Li, S., Barnes, S., & Du, X. (2017). Detailed investigation and comparison of the XCMS and MZmine 2 chromatogram construction and chromatographic peak detection methods for preprocessing mass spectrometry metabolomics data. Analytical Chemistry, 89(17), 8689–8695. https://doi.org/10.1021/acs.analchem.7b01069
Article CAS PubMed Google Scholar
Pluskal, T., Castillo, S., Villar-Briones, A., & Orešič, M. (2010). MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics, 11(1), 395. https://doi.org/10.1186/1471-2105-11-395
Article CAS PubMed PubMed Central Google Scholar
Rafiei, A., & Sleno, L. (2015). Comparison of peak-picking workflows for untargeted liquid chromatography/high-resolution mass spectrometry metabolomics data analysis. Rapid Communications in Mass Spectrometry, 29, 119–127. https://doi.org/10.1002/rcm.7094
Article CAS PubMed Google Scholar
Schiffman, C., Petrick, L., Perttula, K., Yano, Y., Carlsson, H., Whitehead, T., et al. (2019). Filtering procedures for untargeted LC-MS metabolomics data. BMC Bioinformatics, 20(1), 334. https://doi.org/10.1186/s12859-019-2871-9
Article PubMed PubMed Central Google Scholar
Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R., & Siuzdak, G. (2006). XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry, 78(3), 779–787. https://doi.org/10.1021/ac051437y
Article CAS PubMed Google Scholar
Sud, M., Fahy, E., Cotter, D., Azam, K., Vadivelu, I., Burant, C., et al. (2016). Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Research, 44, D463–D470. https://doi.org/10.1093/nar/gkv1042
Article CAS PubMed Google Scholar
To, K. K. W., Lee, K.-C., Wong, S. S. Y., Sze, K.-H., Ke, Y.-H., Lui, Y.-M., et al. (2016). Lipid metabolites as potential diagnostic and prognostic biomarkers for acute community acquired pneumonia. Diagnostic Microbiology and Infectious Disease, 85(2), 249–254. https://doi.org/10.1016/j.diagmicrobio.2016.03.012
Article CAS PubMed PubMed Central Google Scholar
Uppal, K., Soltow, Q. A., Strobel, F. H., Pittard, W. S., Gernert, K. M., Yu, T., & Jones, D. P. (2013). xMSanalyzer: Automated pipeline for improved feature detection and downstream analysis of large-scale, non-targeted metabolomics data. BMC Bioinformatics, 14(1), 15. https://doi.org/10.1186/1471-2105-14-15
Article PubMed PubMed Central Google Scholar
Want, E. J., Wilson, I. D., Gika, H., Theodoridis, G., Plumb, R. S., Shockcor, J., et al. (2010). Global metabolic profiling procedures for urine using UPLC–MS. Nature Protocols, 5(6), 1005–1018. https://doi.org/10.1038/nprot.2010.50
Article CAS PubMed Google Scholar
Whalen, S., Pandey, O. P., & Pandey, G. (2016). Predicting protein function and other biomedical characteristics with heterogeneous ensembles. Methods, 93, 92–102. https://doi.org/10.1016/j.ymeth.2015.08.016
Article CAS PubMed Google Scholar
Yang, P., Yang, Y. H., Zhou, B. B., & Zomaya, A. Y. (2010). A review of ensemble methods in bioinformatics. Current Bioinformatics, 5(4), 296–308. https://doi.org/10.2174/157489310794072508
Article CAS Google Scholar
Yu, T., Park, Y., Johnson, J. M., & Jones, D. P. (2009). apLCMS—adaptive processing of high-resolution LC/MS data. Bioinformatics, 25(15), 1930–1936. https://doi.org/10.1093/bioinformatics/btp291
Article CAS PubMed PubMed Central Google Scholar
Zhang, W., & Zhao, P. X. (2014). Quality evaluation of extracted ion chromatograms and chromatographic peaks in liquid chromatography/mass spectrometry-based metabolomics data. BMC Bioinformatics, 15(Suppl 11), S5. https://doi.org/10.1186/1471-2105-15-S11-S5
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank Miao Yu and Gabriel Hoffman for helping develop, and Yan-chak Li for testing the MetaClean R package. They also thank the authors of the Zhang et al. (2014) study for sharing the Matlab code implementing their peak quality metrics.

Funding

This work was supported by the National Institute of Environmental Health Sciences (NIEHS) Grant Nos. U2CES030859, P30ES23515, R01ES031117, and R21ES030882, and the National Institute of General Medical Sciences (NIGMS) Grant No. R01GM114434.

Author information

Authors and Affiliations

Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Kelsey Chetnik & Gaurav Pandey
Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Lauren Petrick
Institute for Exposomics Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Lauren Petrick & Gaurav Pandey

Authors

Kelsey Chetnik
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Petrick
View author publications
You can also search for this author in PubMed Google Scholar
Gaurav Pandey
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KC performed research, analyzed data, and wrote the paper. LP conceived and designed the study, curated data, interpreted the results, and wrote the paper. GP conceived and designed study, analyzed data, interpreted the results, and wrote the paper. All the authors read and approved the manuscript.

Corresponding authors

Correspondence to Lauren Petrick or Gaurav Pandey.

Ethics declarations

Conflicts of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human and/or animal participants performed by any of the authors, and only utilizes publicly available data and software.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 490 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chetnik, K., Petrick, L. & Pandey, G. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data. Metabolomics 16, 117 (2020). https://doi.org/10.1007/s11306-020-01738-3

Download citation

Received: 09 April 2020
Accepted: 13 October 2020
Published: 21 October 2020
DOI: https://doi.org/10.1007/s11306-020-01738-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data