Abstract
The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Matthiesen R (2007) Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 7:2815–2832
Matthiesen R, Azevedo L, Amorim A, Carvalho AS (2011) Discussion on common data analysis strategies used in MS-based proteomics. Proteomics 11:604–619
Mann M, Hojrup P, Roepstorff P (1993) Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol Mass Spectrom 22:338–345
Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Lu B, Chen T (2003) A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications. Bioinformatics 19(Suppl 2):ii113–ii121
Frank A, Tanner S, Bafna V, Pevzner P (2005) Peptide sequence tags for fast database search in mass-spectrometry. J Proteome Res 4:1287–1295
Tabb DL, Saraf A, Yates JR III (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 75:6415–6421
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567
Creasy DM, Cottrell JS (2002) Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2:1426–1434
Duncan DT, Craig R, Link AJ (2005) Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. J Proteome Res 4:1842–1847
Pratt B, Howbert JJ, Tasman NI, Nilsson EJ (2012) MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 28:136–137
Matthiesen R, Trelle MB, Hojrup P, Bunkenborg J, Jensen ON (2005) VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. J Proteome Res 4:2338–2347
Rodriguez-Suarez E, Gubb E, Alzueta IF, Falcon-Perez JM, Amorim A, Elortza F, Matthiesen R (2010) Virtual expert mass spectrometrist: iTRAQ tool for database-dependent search, quantitation and result storage. Proteomics 10:1545–1556
Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V (2005) InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem 77:4626–4639
Craig R, Cortens JC, Fenyo D, Beavis RC (2006) Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res 5:1843–1849
Barsnes H, Huber S, Sickmann A, Eidhammer I, Martens L (2009) OMSSA parser: an open-source library to parse and extract data from OMSSA MS/MS search results. Proteomics 9:3772–3774
Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M (2011) Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 10:1794–1805
Schlosser A, Volkmer-Engert R (2003) Volatile polydimethylcyclosiloxanes in the ambient laboratory air identified as source of extreme background signals in nanoelectrospray mass spectrometry. J Mass Spectrom 38:523–525
Cox J, Mann M (2009) Computational principles of determining and improving mass precision and accuracy for proteome measurements in an Orbitrap. J Am Soc Mass Spectrom 20:1477–1485
Jensen ON, Podtelejnikov AV, Mann M (1997) Identification of the components of simple protein mixtures by high-accuracy peptide mass mapping and database searching. Anal Chem 69:4741–4750
Fenyo D, Beavis RC (2003) A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem 75:768–774
Sadygov RG, Yates JR III (2003) A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem 75:3792–3798
Eng JK, McCormack AL, Yates JR (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989
Yates JR III, Eng JK, McCormack AL, Schieltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67:1426–1436
Kristensen DB, Brond JC, Nielsen PA, Andersen JR, Sorensen OT, Jorgensen V, Budin K, Matthiesen J, Veno P, Jespersen HM, Ahrens CH, Schandorff S, Ruhoff PT, Wisniewski JR, Bennett KL, Podtelejnikov AV (2004) Experimental peptide identification repository (EPIR): an integrated peptide-centric platform for validation and mining of tandem mass spectrometry data. Mol Cell Proteomics 3:1023–1038
Kapp EA, Schutz F, Reid GE, Eddes JS, Moritz RL, O’Hair RA, Speed TP, Simpson RJ (2003) Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal Chem 75:6251–6264
Huang Y, Triscari JM, Tseng GC, Pasa-Tolic L, Lipton MS, Smith RD, Wysocki VH (2005) Statistical characterization of the charge state and residue dependence of low-energy CID peptide dissociation patterns. Anal Chem 77:5800–5813
Zhang N, Li XJ, Ye M, Pan S, Schwikowski B, Aebersold R (2005) ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 5:4096–4106
Bern M, Finney G, Hoopmann MR, Merrihew G, Toth MJ, MacCoss MJ (2010) Deconvolution of mixture spectra from ion-trap data-independent-acquisition tandem mass spectrometry. Anal Chem 82:833–841
Wang J, Perez-Santiago J, Katz JE, Mallick P, Bandeira N (2010) Peptide identification from mixture tandem mass spectra. Mol Cell Proteomics 9:1476–1485
Houel S, Abernathy R, Renganathan K, Meyer-Arendt K, Ahn NG, Old WM (2010) Quantifying the impact of chimera MS/MS spectra on peptide identification in large-scale proteomics studies. J Proteome Res 9:4152–4160
Nesvizhskii AI, Keller A, Kolker E, Aebersold R (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 75:4646–4658
Matthiesen R, Prieto G, Amorim A, Aloria K, Fullaondo A, Carvalho AS, Arizmendi JM (2012) SIR: deterministic protein inference from peptides assigned to MS data. J Proteomics 75:4176–4183
Matthiesen R, Bunkenborg J, Stensballe A, Jensen ON, Welinder KG, Bauw G (2004) Database-independent, database-dependent, and extended interpretation of peptide mass spectra in VEMS V2.0. Proteomics 4:2583–2593
Blanco L, Mead JA, Bessant C (2009) Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets. J Proteome Res 8:1782–1791
Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4:923–925
Navarro P, Vazquez J (2009) A refined method to calculate false discovery rates for peptide identification using decoy databases. J Proteome Res 8:1792–1796
Fisher RA, Yates F (1938) Statistical tables for biological, agricultural and medical research. Oliver and Boyd, London
Wany Y, Yangz A, Chen T (2006) PepHMM: a hidden Markov model based scoring function for mass spectrometry database search. Anal Chem 78:432–437
Li YF, Arnold RJ, Li Y, Radivojac P, Sheng Q, Tang H (2009) A Bayesian approach to protein inference problem in shotgun proteomics. J Comput Biol 16:1183–1193
Mancuso F, Bunkenborg J, Wierer M, Molina H (2012) Data extraction from proteomics raw data: an evaluation of nine tandem MS tools using a large Orbitrap data set. J Proteomics 75:5293–5303
Acknowledgments
R.M. is supported by Fundação para a Ciência e a Tecnologia (FCT) Ciência 2007. IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partially supported by FCT. R.M. is further supported by FCT grants (PTDC/QUI-BIQ/099457/2008 and PTDC/EIA-EIA/099458/2008).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Matthiesen, R. (2013). Algorithms for Database-Dependent Search of MS/MS Data. In: Matthiesen, R. (eds) Mass Spectrometry Data Analysis in Proteomics. Methods in Molecular Biology, vol 1007. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-392-3_5
Download citation
DOI: https://doi.org/10.1007/978-1-62703-392-3_5
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-391-6
Online ISBN: 978-1-62703-392-3
eBook Packages: Springer Protocols