Skip to main content
Log in

MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data

  • Original Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Introduction

Despite the availability of several pre-processing software, poor peak integration remains a prevalent problem in untargeted metabolomics data generated using liquid chromatography high–resolution mass spectrometry (LC–MS). As a result, the output of these pre-processing software may retain incorrectly calculated metabolite abundances that can perpetuate in downstream analyses.

Objectives

To address this problem, we propose a computational methodology that combines machine learning and peak quality metrics to filter out low quality peaks.

Methods

Specifically, we comprehensively and systematically compared the performance of 24 different classifiers generated by combining eight classification algorithms and three sets of peak quality metrics on the task of distinguishing reliably integrated peaks from poorly integrated ones. These classifiers were compared to using a residual standard deviation (RSD) cut-off in pooled quality-control (QC) samples, which aims to remove peaks with analytical error.

Results

The best performing classifier was found to be a combination of the AdaBoost algorithm and a set of 11 peak quality metrics previously explored in untargeted metabolomics and proteomics studies. As a complementary approach, applying our framework to peaks retained after filtering by 30% RSD across pooled QC samples was able to further distinguish poorly integrated peaks that were not removed from filtering alone. An R implementation of these classifiers and the overall computational approach is available as the MetaClean package at https://CRAN.R-project.org/package=MetaClean.

Conclusion

Our work represents an important step forward in developing an automated tool for filtering out unreliable peak integrations in untargeted LC–MS metabolomics data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The metabolomics and metadata analyzed in this paper are available via the Metabolomics Workbench (https://www.metabolomicsworkbench.org/) Study IDs ST000726 and ST000695, and via MetaboLights (https://www.ebi.ac.uk/metabolights/) Study IDs MTBLS354 and MTBLS306.

Code availability

The MetaClean R package developed in this study is available at https://CRAN.R-project.org/package=MetaClean.

References

Download references

Acknowledgements

The authors would like to thank Miao Yu and Gabriel Hoffman for helping develop, and Yan-chak Li for testing the MetaClean R package. They also thank the authors of the Zhang et al. (2014) study for sharing the Matlab code implementing their peak quality metrics.

Funding

This work was supported by the National Institute of Environmental Health Sciences (NIEHS) Grant Nos. U2CES030859, P30ES23515, R01ES031117, and R21ES030882, and the National Institute of General Medical Sciences (NIGMS) Grant No. R01GM114434.

Author information

Authors and Affiliations

Authors

Contributions

KC performed research, analyzed data, and wrote the paper. LP conceived and designed the study, curated data, interpreted the results, and wrote the paper. GP conceived and designed study, analyzed data, interpreted the results, and wrote the paper. All the authors read and approved the manuscript.

Corresponding authors

Correspondence to Lauren Petrick or Gaurav Pandey.

Ethics declarations

Conflicts of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human and/or animal participants performed by any of the authors, and only utilizes publicly available data and software.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 490 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chetnik, K., Petrick, L. & Pandey, G. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data. Metabolomics 16, 117 (2020). https://doi.org/10.1007/s11306-020-01738-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11306-020-01738-3

Keywords

Navigation