Skip to main content

Algorithms for Database-Dependent Search of MS/MS Data

  • Protocol
  • First Online:
Mass Spectrometry Data Analysis in Proteomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1007))

Abstract

The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Matthiesen R (2007) Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 7:2815–2832

    Article  PubMed  CAS  Google Scholar 

  2. Matthiesen R, Azevedo L, Amorim A, Carvalho AS (2011) Discussion on common data analysis strategies used in MS-based proteomics. Proteomics 11:604–619

    Article  PubMed  CAS  Google Scholar 

  3. Mann M, Hojrup P, Roepstorff P (1993) Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol Mass Spectrom 22:338–345

    Article  PubMed  CAS  Google Scholar 

  4. Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge

    Book  Google Scholar 

  5. Lu B, Chen T (2003) A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications. Bioinformatics 19(Suppl 2):ii113–ii121

    Article  PubMed  Google Scholar 

  6. Frank A, Tanner S, Bafna V, Pevzner P (2005) Peptide sequence tags for fast database search in mass-spectrometry. J Proteome Res 4:1287–1295

    Article  PubMed  CAS  Google Scholar 

  7. Tabb DL, Saraf A, Yates JR III (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 75:6415–6421

    Article  PubMed  CAS  Google Scholar 

  8. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567

    Article  PubMed  CAS  Google Scholar 

  9. Creasy DM, Cottrell JS (2002) Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2:1426–1434

    Article  PubMed  CAS  Google Scholar 

  10. Duncan DT, Craig R, Link AJ (2005) Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. J Proteome Res 4:1842–1847

    Article  PubMed  CAS  Google Scholar 

  11. Pratt B, Howbert JJ, Tasman NI, Nilsson EJ (2012) MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 28:136–137

    Article  PubMed  CAS  Google Scholar 

  12. Matthiesen R, Trelle MB, Hojrup P, Bunkenborg J, Jensen ON (2005) VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. J Proteome Res 4:2338–2347

    Article  PubMed  CAS  Google Scholar 

  13. Rodriguez-Suarez E, Gubb E, Alzueta IF, Falcon-Perez JM, Amorim A, Elortza F, Matthiesen R (2010) Virtual expert mass spectrometrist: iTRAQ tool for database-dependent search, quantitation and result storage. Proteomics 10:1545–1556

    Article  PubMed  CAS  Google Scholar 

  14. Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V (2005) InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem 77:4626–4639

    Article  PubMed  CAS  Google Scholar 

  15. Craig R, Cortens JC, Fenyo D, Beavis RC (2006) Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res 5:1843–1849

    Article  PubMed  CAS  Google Scholar 

  16. Barsnes H, Huber S, Sickmann A, Eidhammer I, Martens L (2009) OMSSA parser: an open-source library to parse and extract data from OMSSA MS/MS search results. Proteomics 9:3772–3774

    Article  PubMed  CAS  Google Scholar 

  17. Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M (2011) Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 10:1794–1805

    Article  PubMed  CAS  Google Scholar 

  18. Schlosser A, Volkmer-Engert R (2003) Volatile polydimethylcyclosiloxanes in the ambient laboratory air identified as source of extreme background signals in nanoelectrospray mass spectrometry. J Mass Spectrom 38:523–525

    Article  PubMed  CAS  Google Scholar 

  19. Cox J, Mann M (2009) Computational principles of determining and improving mass precision and accuracy for proteome measurements in an Orbitrap. J Am Soc Mass Spectrom 20:1477–1485

    Article  PubMed  CAS  Google Scholar 

  20. Jensen ON, Podtelejnikov AV, Mann M (1997) Identification of the components of simple protein mixtures by high-accuracy peptide mass mapping and database searching. Anal Chem 69:4741–4750

    Article  PubMed  CAS  Google Scholar 

  21. Fenyo D, Beavis RC (2003) A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem 75:768–774

    Article  PubMed  Google Scholar 

  22. Sadygov RG, Yates JR III (2003) A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem 75:3792–3798

    Article  PubMed  CAS  Google Scholar 

  23. Eng JK, McCormack AL, Yates JR (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989

    Article  CAS  Google Scholar 

  24. Yates JR III, Eng JK, McCormack AL, Schieltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67:1426–1436

    Article  PubMed  CAS  Google Scholar 

  25. Kristensen DB, Brond JC, Nielsen PA, Andersen JR, Sorensen OT, Jorgensen V, Budin K, Matthiesen J, Veno P, Jespersen HM, Ahrens CH, Schandorff S, Ruhoff PT, Wisniewski JR, Bennett KL, Podtelejnikov AV (2004) Experimental peptide identification repository (EPIR): an integrated peptide-centric platform for validation and mining of tandem mass spectrometry data. Mol Cell Proteomics 3:1023–1038

    Article  PubMed  CAS  Google Scholar 

  26. Kapp EA, Schutz F, Reid GE, Eddes JS, Moritz RL, O’Hair RA, Speed TP, Simpson RJ (2003) Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal Chem 75:6251–6264

    Article  PubMed  CAS  Google Scholar 

  27. Huang Y, Triscari JM, Tseng GC, Pasa-Tolic L, Lipton MS, Smith RD, Wysocki VH (2005) Statistical characterization of the charge state and residue dependence of low-energy CID peptide dissociation patterns. Anal Chem 77:5800–5813

    Article  PubMed  CAS  Google Scholar 

  28. Zhang N, Li XJ, Ye M, Pan S, Schwikowski B, Aebersold R (2005) ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 5:4096–4106

    Article  PubMed  CAS  Google Scholar 

  29. Bern M, Finney G, Hoopmann MR, Merrihew G, Toth MJ, MacCoss MJ (2010) Deconvolution of mixture spectra from ion-trap data-independent-acquisition tandem mass spectrometry. Anal Chem 82:833–841

    Article  PubMed  CAS  Google Scholar 

  30. Wang J, Perez-Santiago J, Katz JE, Mallick P, Bandeira N (2010) Peptide identification from mixture tandem mass spectra. Mol Cell Proteomics 9:1476–1485

    Article  PubMed  CAS  Google Scholar 

  31. Houel S, Abernathy R, Renganathan K, Meyer-Arendt K, Ahn NG, Old WM (2010) Quantifying the impact of chimera MS/MS spectra on peptide identification in large-scale proteomics studies. J Proteome Res 9:4152–4160

    Article  PubMed  CAS  Google Scholar 

  32. Nesvizhskii AI, Keller A, Kolker E, Aebersold R (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 75:4646–4658

    Article  PubMed  CAS  Google Scholar 

  33. Matthiesen R, Prieto G, Amorim A, Aloria K, Fullaondo A, Carvalho AS, Arizmendi JM (2012) SIR: deterministic protein inference from peptides assigned to MS data. J Proteomics 75:4176–4183

    Article  PubMed  CAS  Google Scholar 

  34. Matthiesen R, Bunkenborg J, Stensballe A, Jensen ON, Welinder KG, Bauw G (2004) Database-independent, database-dependent, and extended interpretation of peptide mass spectra in VEMS V2.0. Proteomics 4:2583–2593

    Article  PubMed  CAS  Google Scholar 

  35. Blanco L, Mead JA, Bessant C (2009) Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets. J Proteome Res 8:1782–1791

    Article  PubMed  Google Scholar 

  36. Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4:923–925

    Article  PubMed  Google Scholar 

  37. Navarro P, Vazquez J (2009) A refined method to calculate false discovery rates for peptide identification using decoy databases. J Proteome Res 8:1792–1796

    Article  PubMed  CAS  Google Scholar 

  38. Fisher RA, Yates F (1938) Statistical tables for biological, agricultural and medical research. Oliver and Boyd, London

    Google Scholar 

  39. Wany Y, Yangz A, Chen T (2006) PepHMM: a hidden Markov model based scoring function for mass spectrometry database search. Anal Chem 78:432–437

    Article  Google Scholar 

  40. Li YF, Arnold RJ, Li Y, Radivojac P, Sheng Q, Tang H (2009) A Bayesian approach to protein inference problem in shotgun proteomics. J Comput Biol 16:1183–1193

    Article  PubMed  CAS  Google Scholar 

  41. Mancuso F, Bunkenborg J, Wierer M, Molina H (2012) Data extraction from proteomics raw data: an evaluation of nine tandem MS tools using a large Orbitrap data set. J Proteomics 75:5293–5303

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

R.M. is supported by Fundação para a Ciência e a Tecnologia (FCT) Ciência 2007. IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partially supported by FCT. R.M. is further supported by FCT grants (PTDC/QUI-BIQ/099457/2008 and PTDC/EIA-EIA/099458/2008).

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Matthiesen, R. (2013). Algorithms for Database-Dependent Search of MS/MS Data. In: Matthiesen, R. (eds) Mass Spectrometry Data Analysis in Proteomics. Methods in Molecular Biology, vol 1007. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-392-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-392-3_5

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-391-6

  • Online ISBN: 978-1-62703-392-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics