Skip to main content
Log in

Benchmarking mass spectrometry based proteomics algorithms using a simulated database

  • Short Communication
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

Protein sequencing algorithms process data from a variety of instruments that has been generated under diverse experimental conditions. Currently there is no way to predict the accuracy of an algorithm for a given data set. Most of the published algorithms and associated software has been evaluated on limited number of experimental data sets. However, these performance evaluations do not cover the complete search space the algorithm and the software might encounter in real-world. To this end, we present a database of simulated spectra that can be used to benchmark any spectra to peptide search engine. We demonstrate the usability of this database by bench marking two popular peptide sequencing engines. We show wide variation in the accuracy of peptide deductions and a complete quality profile of a given algorithm can be useful for practitioners and algorithm developers. All benchmarking data is available at https://users.cs.fiu.edu/~fsaeed/Benchmark.html

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Data availability

The proposed database is available at: https://users.cs.fiu.edu/ fsaeed/Benchmark.html

References

  • Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198

    Article  Google Scholar 

  • Arbelaez P, Maire M, Fowlkes C, Malik J (2011) Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell 33(5):898–916

    Article  Google Scholar 

  • Diament BJ, Noble WS (2011) Faster request searching for peptide identification from tandem mass spectra. J Proteome Res 10(9):3871–3879

    Article  Google Scholar 

  • Ebhardt HA, Root A, Sander C, Aebersold R (2015) Applications of targeted proteomics in systems biology and translational medicine. Proteomics 15(18):3193–3208

    Article  Google Scholar 

  • Elias JE, Gygi SP (2007) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 4(3):207–214

    Article  Google Scholar 

  • Freytag S, Tian L, Ingrid L, Ng M, Bahlo M (2018) Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research 7

  • Gul Awan M, Saeed F (2016) MS-reduce: an ultrafast technique for reduction of big mass spectrometry data for high-throughput processing. Bioinformatics 32(10):1518–1526

    Article  Google Scholar 

  • Gul Awan M, Saeed F (2018) Mass-simulator: a highly configurable simulator for generating ms/ms datasets for benchmarking of proteomics algorithms. Proteomics 18(20):1800206

    Article  Google Scholar 

  • Iglesias-Gato D, Wikström P, Tyanova S, Lavallee C, Thysell E, Carlsson J, Hägglöf C, Cox J, Andrén O, Stattin P et al (2016) The proteome of primary prostate cancer. Eur Urol 69(5):942–952

    Article  Google Scholar 

  • Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4(11):923

    Article  Google Scholar 

  • Käll L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7(01):29–34

    Article  Google Scholar 

  • Keller A, Nesvizhskii AI, Kolker E, Aebersold R (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal Chem 74(20):5383–5392

    Article  Google Scholar 

  • Kong AT, Leprevost FV, Avtonomov DM, Mellacheruvu D, Nesvizhskii AI (2017) Msfragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat Methods 14(5):513

    Article  Google Scholar 

  • Ma B (2015) Novor: real-time peptide de novo sequencing software. J Am Soc Mass Spectrom 26(11):1885–1894

    Article  Google Scholar 

  • McIlwain S, Tamura K, Kertesz-Farkas A, Grant CE, Diament B, Frewen B, Howbert JJ, Hoopmann MR, Kall L, Eng JK et al (2014) Crux: rapid open source protein tandem mass spectrometry analysis. J Proteome Res 13(10):4488–4491

    Article  Google Scholar 

  • PedroM C, Bengt F (2016) Emerging systems biology approaches in nanotoxicology: towards a mechanism-based understanding of nanomaterial hazard and risk. Toxicol Appl Pharmacol 299:101–111

    Article  Google Scholar 

  • Saeed F (2015) Big data proteogenomics and high performance computing: Challenges and opportunities. In Signal and information processing (GlobalSIP). In: 2015 IEEE Global Conference on. IEEE, pp 141–145

  • Savitski MM, Wilhelm M, Hahne H, Kuster B, Bantscheff M (2015) A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol Cell Proteom 14(9):2394–2404

    Article  Google Scholar 

  • Shteynberg D, Deutsch EW, Lam H, Eng JK, Sun Z, Tasman N, Mendoza L, Moritz RL, Aebersold R, Nesvizhskii AI (2011) iprophet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol Cell Proteom 10(12):M111-007690

    Article  Google Scholar 

  • Tsai T-H, Song E, Zhu R, Di Poto C, Wang M, Luo Y, Varghese RS, Tadesse MG, Ziada DH, Desai CS et al (2015) LC-MS/MS-based serum proteomics for identification of candidate biomarkers for hepatocellular carcinoma. Proteomics 15(13):2369–2381

    Article  Google Scholar 

  • Zhenqin W, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) Moleculenet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530

    Article  Google Scholar 

Download references

Acknowledgements

Research reported in this paper was supported by NIGMS of the National Institutes of Health under award number: R01GM134384. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Fahad Saeed was further supported by the National Science Foundations (NSF) under the Award Numbers NSF CAREER OAC-1925960. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Contributions

MA devised the method and database, wrote the software for generation of the database and wrote the manuscript. AA performed experiments and analyzed the results. FS proposed the initial idea, designed the experiments, supervised the research, and wrote and edited the manuscript.

Corresponding author

Correspondence to Fahad Saeed.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Code availability

Proposed databases were generated using the MaSS-Simulator software (Gul Awan and Saeed 2018).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 1599 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Awan, M.G., Awan, A.G. & Saeed, F. Benchmarking mass spectrometry based proteomics algorithms using a simulated database. Netw Model Anal Health Inform Bioinforma 10, 23 (2021). https://doi.org/10.1007/s13721-021-00298-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-021-00298-3

Keywords

Navigation