The local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data

Desaire, Heather; Patabandige, Milani Wijeweera; Hua, David

doi:10.1007/s00216-020-03117-2

The local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data

Research Paper
Published: 13 February 2021

Volume 413, pages 1583–1593, (2021)
Cite this article

Analytical and Bioanalytical Chemistry Aims and scope Submit manuscript

Heather Desaire¹,
Milani Wijeweera Patabandige¹ &
David Hua¹

435 Accesses
3 Citations
3 Altmetric
Explore all metrics

This article has been updated

Abstract

One unifying challenge when classifying biological samples with mass spectrometry data is overcoming the obstacle of sample-to-sample variability so that differences between groups, such as between a healthy set and a disease set, can be identified. Similarly, when the same sample is re-analyzed under identical conditions, instrument signals can fluctuate by more than 10%. This signal inconsistency imposes difficulties in identifying subtle differences across a set of samples, and it weakens the mass spectrometrist’s ability to effectively leverage data in domains as diverse as proteomics, metabolomics, glycomics, and imaging. We selected challenging data sets in the fields of glycomics, mass spectrometry imaging, and bacterial typing to study the problem of within-group signal variability and adapted a 30-year-old statistical approach to address the problem. The solution, “local-balanced model,” relies on using balanced subsets of training data to classify test samples. This analysis strategy was assessed on ESI-MS data of IgG-based glycopeptides and MALDI-MS imaging data of endogenous lipids, and MALDI-MS data of bacterial proteins. Two preliminary examples on non-mass spectrometry data sets are also included to show the potential generality of the method outside the field of MS analysis. We demonstrate that this approach is superior to simple normalization methods, generalizable to multiple mass spectrometry domains, and potentially appropriate in fields as diverse as physics and satellite imaging. In some cases, improvements in classification can be dramatic, with accuracy escalating from 60% with normalization alone to over 90% with the additional development described herein.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to amplicon and metagenomic analysis of microbiome data

Article Open access 11 May 2020

Quantitative Mass Spectrometry-Based Proteomics: An Overview

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Article Open access 02 January 2020

Change history

16 April 2020
Springer Nature’s version of this paper was updated to present the correct Electronic Supplementary Material files.

References

Zhou Z, Zare RN. Personal information from latent fingerprints using desorption electrospray ionization mass spectrometry and machine learning. Anal Chem. 2017;89:1369–72.
Article CAS Google Scholar
Papagiannopoulou C, Parchen R, Rubbens P, Waegeman W. Fast pathogen identification using single-cell matrix-assisted laser desorption/ionization-aerosol time-of-flight mass spectrometry data and deep learning methods. Anal Chem. 2020;92:7523–31.
Article CAS Google Scholar
Xie YR, Castro D, Bell S, Rubakhin SS, Sweedler JV. Single-cell classification using mass spectrometry through interpretable machine learning. Anal Chem, 2020. (avail online.).
Hua D, Patabandige MW, Go EP, Desaire H. The Aristotle Classifier: using the whole glycomic profile to indicate a disease state. Anal Chem. 2019;91(17):11070–7.
Article CAS Google Scholar
Desaire H, Hua D. Adapting the Aristotle Classifier for accurate identifications of highly similar bacteria analyzed by MALDI-TOF MS. Anal Chem. 2020;92(1):1050–7.
Article CAS Google Scholar
Hua D, Liu X, Go EP, Wang Y, Hummon AB, Desaire H How to apply supervised machine learning tools to MS imaging files: case study with cancer spheroids undergoing treatment with the monoclonal antibody, cetuximab. J Am Soc Mass Spectrom 2020.
van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7(1):142.
Article Google Scholar
Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief Bioinform. 2016;19(1):bbw095.
Article Google Scholar
Uh H-W, Klarić L, Ugrina I, Lauc G, Smilde AK, Houwing-Duistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Molec Omics, 2020.
Benedetti E, Gerstner N, Pucic-Bakovic M, Keser T, Reiding KR, Ruhaak LR, et al. Systematic evaluation of normalization methods for glycomics data based on performance of network interference. bioRxiv. 2019. https://doi.org/10.1101/814244.
Fonville JM, Carter C, Cloarec O, Nicholson JK, Lindon JC, Bunch J, et al. Robust data processing and normalization strategy for MALDI mass spectrometric imaging. Anal Chem. 2012;84:1310–9.
Article CAS Google Scholar
Song X, He J, Pang X, Zhang J, Sun C, Huang L, et al. Virtual calibration quantitative mass spectrometry imaging for accurately mapping analytes across heterogenous biotissue. Anal Chem. 2019;91:2838–46.
Article CAS Google Scholar
Liu Z, Portero EP, Jian Y, Zhao Y, Onjiko RM, Zeng C, et al. Trace, machine learning of signal images for trace-sensitive mass spectrometry: a case study from single-cell metabolomics. Anal Chem. 2019;91:5768–76.
Article CAS Google Scholar
Blanzieri E, Melgani F. An adaptive SVM nearest neighbor classifier for remotely sensed imagery. In: IEEE Int. Conf. on Geoscience and Remote Sensing Symposium (IGARSS 2006), pp. 3931–3934, 2006.
Blanzieri E, Melgani F. Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans Geosci Remote Sens. 2008;46(6):1604–811.
Article Google Scholar
Segata N, Blanzieri E. Fast and scaleable local kernel models. J Mach Learn Res. 2010;11:1883–926.
Google Scholar
Jiang L, Cai Z, Wang D, Jiang S. Survey of improving K-nearest-neighbor for classification in Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007). IEEE, 2007.
Langley P, Iba W, Thomas K. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference of Artificial Intelligence, pages 223–228. AAAI Press, 1992.
Li K-H, Li C-T. “Locally Weighted Learning for Naïve Bayes Classifier” 2014, arXiv:1412.6741v1.
Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for class-imbalance learning. IEEE Transact Syst, Man, Cybern B: Cybern. 2009;39(2):539–50.
Article Google Scholar
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 2011;12:77.
Article Google Scholar
Hu W, Su X, Zhu Z, Go EP, Desaire H. GlycoPep MassList: software to generate massive inclusion lists for glycopeptide analyses. Anal Bioanal Chem. 2017;409(2):561–70.
Article CAS Google Scholar
Go EP, Moon HJ, Mure M, Desaire H. Recombinant human lysyl oxidase-like 2 secreted from human embryonic kidney cells displays complex and acidic glycans at all three N-linked glycosylation sites. J Proteome Res. 2018;17(5):1826–32.
Article CAS Google Scholar
Rebecchi KR, Wenke JL, Go EP, Desaire H. Label-free quantitation: a new glycoproteomics approach. J Am Soc Mass Spectrom. 2009;20:1048–59.
Article CAS Google Scholar
Liu X, Lukowski JK, Flinders C, Kim S, Georgiadis RA, Mumenthaler SM, et al. MALDI-MSI of immunotherapy: mapping the EGFR-targeting antibody cetuximab in 3D colon-cancer cell cultures. Anal Chem. 2018;90:14156–64.
Article CAS Google Scholar
https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) (accessed September 10, 2020).
http://archive.ics.uci.edu/ml/datasets/hill-valley (accessed September 10, 2020).
Mahe P, Arsac M, Chatellier S, Monnin V, Perrot N, Mailler S, et al. Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum. Bioinformatics. 2014;30(9):1280–6.
Article CAS Google Scholar
Atkeson CG, Moore AW, Schall S. Locally weighted learning. Artif Intell Rev. 1997;11:11–73.
Article Google Scholar
Huang S, Cai N, Pacheco PP, Narandes S, Wang Y, Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics. 2018;15(1):41–51.
CAS PubMed Google Scholar
Xia J, Broadhurst DI, Wilson M, Wishart DS. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. 2013;9:280–99.
Article CAS Google Scholar
Sulecki N “Characterizing dimensionality reduction algorithm performance in terms of data set aspects.” Honors Thesis, Ohio University. 2017.
Shadvar A. Dimension reduction by mutual information discriminant analysis. Int J Artificial Intell Appl. 2012;3(3):23–35.
Google Scholar

Download references

Funding

The authors received financial support from NIH. This work was funded by grant R35GM130354 to H.D.

Author information

Authors and Affiliations

Department of Chemistry, University of Kansas, Lawrence, KS, 66045, USA
Heather Desaire, Milani Wijeweera Patabandige & David Hua

Authors

Heather Desaire
View author publications
You can also search for this author in PubMed Google Scholar
Milani Wijeweera Patabandige
View author publications
You can also search for this author in PubMed Google Scholar
David Hua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heather Desaire.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

ESM 1

(TXT 3.66 kb)

ESM 2

(TXT 1.06 kb)

ESM 3

(TXT 1.74 kb)

ESM 4

(DOCX 14.6 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Desaire, H., Patabandige, M.W. & Hua, D. The local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data. Anal Bioanal Chem 413, 1583–1593 (2021). https://doi.org/10.1007/s00216-020-03117-2

Download citation

Received: 12 September 2020
Revised: 17 November 2020
Accepted: 08 December 2020
Published: 13 February 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00216-020-03117-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data