Algorithms for Database-Dependent Search of MS/MS Data

Matthiesen, Rune

doi:10.1007/978-1-62703-392-3_5

Rune Matthiesen³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1007))

4839 Accesses
12 Citations

Abstract

The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Matthiesen R (2007) Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 7:2815–2832
Article PubMed CAS Google Scholar
Matthiesen R, Azevedo L, Amorim A, Carvalho AS (2011) Discussion on common data analysis strategies used in MS-based proteomics. Proteomics 11:604–619
Article PubMed CAS Google Scholar
Mann M, Hojrup P, Roepstorff P (1993) Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol Mass Spectrom 22:338–345
Article PubMed CAS Google Scholar
Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Book Google Scholar
Lu B, Chen T (2003) A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications. Bioinformatics 19(Suppl 2):ii113–ii121
Article PubMed Google Scholar
Frank A, Tanner S, Bafna V, Pevzner P (2005) Peptide sequence tags for fast database search in mass-spectrometry. J Proteome Res 4:1287–1295
Article PubMed CAS Google Scholar
Tabb DL, Saraf A, Yates JR III (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 75:6415–6421
Article PubMed CAS Google Scholar
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567
Article PubMed CAS Google Scholar
Creasy DM, Cottrell JS (2002) Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2:1426–1434
Article PubMed CAS Google Scholar
Duncan DT, Craig R, Link AJ (2005) Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. J Proteome Res 4:1842–1847
Article PubMed CAS Google Scholar
Pratt B, Howbert JJ, Tasman NI, Nilsson EJ (2012) MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 28:136–137
Article PubMed CAS Google Scholar
Matthiesen R, Trelle MB, Hojrup P, Bunkenborg J, Jensen ON (2005) VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. J Proteome Res 4:2338–2347
Article PubMed CAS Google Scholar
Rodriguez-Suarez E, Gubb E, Alzueta IF, Falcon-Perez JM, Amorim A, Elortza F, Matthiesen R (2010) Virtual expert mass spectrometrist: iTRAQ tool for database-dependent search, quantitation and result storage. Proteomics 10:1545–1556
Article PubMed CAS Google Scholar
Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V (2005) InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem 77:4626–4639
Article PubMed CAS Google Scholar
Craig R, Cortens JC, Fenyo D, Beavis RC (2006) Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res 5:1843–1849
Article PubMed CAS Google Scholar
Barsnes H, Huber S, Sickmann A, Eidhammer I, Martens L (2009) OMSSA parser: an open-source library to parse and extract data from OMSSA MS/MS search results. Proteomics 9:3772–3774
Article PubMed CAS Google Scholar
Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M (2011) Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 10:1794–1805
Article PubMed CAS Google Scholar
Schlosser A, Volkmer-Engert R (2003) Volatile polydimethylcyclosiloxanes in the ambient laboratory air identified as source of extreme background signals in nanoelectrospray mass spectrometry. J Mass Spectrom 38:523–525
Article PubMed CAS Google Scholar
Cox J, Mann M (2009) Computational principles of determining and improving mass precision and accuracy for proteome measurements in an Orbitrap. J Am Soc Mass Spectrom 20:1477–1485
Article PubMed CAS Google Scholar
Jensen ON, Podtelejnikov AV, Mann M (1997) Identification of the components of simple protein mixtures by high-accuracy peptide mass mapping and database searching. Anal Chem 69:4741–4750
Article PubMed CAS Google Scholar
Fenyo D, Beavis RC (2003) A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem 75:768–774
Article PubMed Google Scholar
Sadygov RG, Yates JR III (2003) A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem 75:3792–3798
Article PubMed CAS Google Scholar
Eng JK, McCormack AL, Yates JR (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989
Article CAS Google Scholar
Yates JR III, Eng JK, McCormack AL, Schieltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67:1426–1436
Article PubMed CAS Google Scholar
Kristensen DB, Brond JC, Nielsen PA, Andersen JR, Sorensen OT, Jorgensen V, Budin K, Matthiesen J, Veno P, Jespersen HM, Ahrens CH, Schandorff S, Ruhoff PT, Wisniewski JR, Bennett KL, Podtelejnikov AV (2004) Experimental peptide identification repository (EPIR): an integrated peptide-centric platform for validation and mining of tandem mass spectrometry data. Mol Cell Proteomics 3:1023–1038
Article PubMed CAS Google Scholar
Kapp EA, Schutz F, Reid GE, Eddes JS, Moritz RL, O’Hair RA, Speed TP, Simpson RJ (2003) Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal Chem 75:6251–6264
Article PubMed CAS Google Scholar
Huang Y, Triscari JM, Tseng GC, Pasa-Tolic L, Lipton MS, Smith RD, Wysocki VH (2005) Statistical characterization of the charge state and residue dependence of low-energy CID peptide dissociation patterns. Anal Chem 77:5800–5813
Article PubMed CAS Google Scholar
Zhang N, Li XJ, Ye M, Pan S, Schwikowski B, Aebersold R (2005) ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 5:4096–4106
Article PubMed CAS Google Scholar
Bern M, Finney G, Hoopmann MR, Merrihew G, Toth MJ, MacCoss MJ (2010) Deconvolution of mixture spectra from ion-trap data-independent-acquisition tandem mass spectrometry. Anal Chem 82:833–841
Article PubMed CAS Google Scholar
Wang J, Perez-Santiago J, Katz JE, Mallick P, Bandeira N (2010) Peptide identification from mixture tandem mass spectra. Mol Cell Proteomics 9:1476–1485
Article PubMed CAS Google Scholar
Houel S, Abernathy R, Renganathan K, Meyer-Arendt K, Ahn NG, Old WM (2010) Quantifying the impact of chimera MS/MS spectra on peptide identification in large-scale proteomics studies. J Proteome Res 9:4152–4160
Article PubMed CAS Google Scholar
Nesvizhskii AI, Keller A, Kolker E, Aebersold R (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 75:4646–4658
Article PubMed CAS Google Scholar
Matthiesen R, Prieto G, Amorim A, Aloria K, Fullaondo A, Carvalho AS, Arizmendi JM (2012) SIR: deterministic protein inference from peptides assigned to MS data. J Proteomics 75:4176–4183
Article PubMed CAS Google Scholar
Matthiesen R, Bunkenborg J, Stensballe A, Jensen ON, Welinder KG, Bauw G (2004) Database-independent, database-dependent, and extended interpretation of peptide mass spectra in VEMS V2.0. Proteomics 4:2583–2593
Article PubMed CAS Google Scholar
Blanco L, Mead JA, Bessant C (2009) Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets. J Proteome Res 8:1782–1791
Article PubMed Google Scholar
Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4:923–925
Article PubMed Google Scholar
Navarro P, Vazquez J (2009) A refined method to calculate false discovery rates for peptide identification using decoy databases. J Proteome Res 8:1792–1796
Article PubMed CAS Google Scholar
Fisher RA, Yates F (1938) Statistical tables for biological, agricultural and medical research. Oliver and Boyd, London
Google Scholar
Wany Y, Yangz A, Chen T (2006) PepHMM: a hidden Markov model based scoring function for mass spectrometry database search. Anal Chem 78:432–437
Article Google Scholar
Li YF, Arnold RJ, Li Y, Radivojac P, Sheng Q, Tang H (2009) A Bayesian approach to protein inference problem in shotgun proteomics. J Comput Biol 16:1183–1193
Article PubMed CAS Google Scholar
Mancuso F, Bunkenborg J, Wierer M, Molina H (2012) Data extraction from proteomics raw data: an evaluation of nine tandem MS tools using a large Orbitrap data set. J Proteomics 75:5293–5303
Article PubMed CAS Google Scholar

Download references

Acknowledgments

R.M. is supported by Fundação para a Ciência e a Tecnologia (FCT) Ciência 2007. IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partially supported by FCT. R.M. is further supported by FCT grants (PTDC/QUI-BIQ/099457/2008 and PTDC/EIA-EIA/099458/2008).

Author information

Authors and Affiliations

Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto, Portugal
Rune Matthiesen

Authors

Rune Matthiesen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Inst. Molecular Pathology & Immunology, Universidade do Porto Inst. Molecular Pathology & Immunology, Porto, Portugal
Rune Matthiesen

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Matthiesen, R. (2013). Algorithms for Database-Dependent Search of MS/MS Data. In: Matthiesen, R. (eds) Mass Spectrometry Data Analysis in Proteomics. Methods in Molecular Biology, vol 1007. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-392-3_5

Download citation

DOI: https://doi.org/10.1007/978-1-62703-392-3_5
Published: 29 March 2013
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-391-6
Online ISBN: 978-1-62703-392-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics