Database Search Engines: Paradigms, Challenges and Solutions

  • Kenneth Verheggen
  • Lennart Martens
  • Frode S. Berven
  • Harald BarsnesEmail author
  • Marc Vaudel
Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 919)


The first step in identifying proteins from mass spectrometry based shotgun proteomics data is to infer peptides from tandem mass spectra, a task generally achieved using database search engines. In this chapter, the basic principles of database search engines are introduced with a focus on open source software, and the use of database search engines is demonstrated using the freely available SearchGUI interface. This chapter also discusses how to tackle general issues related to sequence database searching and shows how to minimize their impact.


Peptide identification Search engines Shotgun proteomics Sequence database searching 



Peptide Spectrum Match


Post-Translational Modification



K.V. acknowledges the support of Ghent University. L.M. acknowledges the support of Ghent University (Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”) and the IWT SBO grant ‘InSPECtor’ (120025). H.B. is supported by the Research Council of Norway.


  1. 1.
    Mueller LN, Brusniak MY, Mani DR et al (2008) An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 7:51–61CrossRefPubMedGoogle Scholar
  2. 2.
    Vaudel M, Sickmann A, Martens L (2010) Peptide and protein quantification: a map of the minefield. Proteomics 10:650–670CrossRefPubMedGoogle Scholar
  3. 3.
    Eng J, McCormack AL, Yates JR III (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989CrossRefPubMedGoogle Scholar
  4. 4.
    Deutsch EW, Mendoza L, Shteynberg D et al (2010) A guided tour of the trans-proteomic pipeline. Proteomics 10:1150–1159CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Sturm M, Bertsch A, Gropl C et al (2008) OpenMS – an open-source software framework for mass spectrometry. BMC Bioinf 9:163CrossRefGoogle Scholar
  6. 6.
    Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20:1466–1467CrossRefPubMedGoogle Scholar
  7. 7.
    Tabb DL, Fernando CG, Chambers MC (2007) MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res 6:654–661CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Dorfer V, Pichler P, Stranzl T et al (2014) MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. J Proteome Res 13:3679–3684CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Kim S, Mischerikow N, Bandeira N et al (2010) The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Mol Cell Proteomics 9:2840–2852CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Geer LY, Markey SP, Kowalak JA et al (2004) Open mass spectrometry search algorithm. J Proteome Res 3:958–964CrossRefPubMedGoogle Scholar
  11. 11.
    Eng JK, Jahan TA, Hoopmann MR (2013) Comet: an open-source MS/MS sequence database search tool. Proteomics 13:22–24CrossRefPubMedGoogle Scholar
  12. 12.
    Diament BJ, Noble WS (2011) Faster SEQUEST searching for peptide identification from tandem mass spectra. J Proteome Res 10:3871–3879CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Vaudel M, Barsnes H, Berven FS et al (2011) SearchGUI: an open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 11:996–999CrossRefPubMedGoogle Scholar
  14. 14.
    Vaudel M, Burkhart JM, Zahedi RP et al (2015) PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat Biotechnol 33:22–24CrossRefPubMedGoogle Scholar
  15. 15.
    Shteynberg D, Nesvizhskii AI, Moritz RL et al (2013) Combining results of multiple search engines in proteomics. Mol Cell Proteomics 12:2383–2393CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Vaudel M, Venne AS, Berven FS et al (2014) Shedding light on black boxes in protein identification. Proteomics 14:1001–1005CrossRefPubMedGoogle Scholar
  17. 17.
    Mancuso F, Bunkenborg J, Wierer M et al (2012) Data extraction from proteomics raw data: an evaluation of nine tandem MS tools using a large Orbitrap data set. J Proteome 75:5293–5303CrossRefGoogle Scholar
  18. 18.
    Kessner D, Chambers M, Burke R et al (2008) ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24:2534–2536CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Kohlbacher O, Reinert K, Gropl C et al (2007) TOPP – the OpenMS proteomics pipeline. Bioinformatics 23:e191–e197CrossRefPubMedGoogle Scholar
  20. 20.
    Colaert N, Degroeve S, Helsens K et al (2011) Analysis of the resolution limitations of peptide identification algorithms. J Proteome Res 10:5555–5561CrossRefPubMedGoogle Scholar
  21. 21.
    Nesvizhskii AI, Aebersold R (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol Cell Proteomics 4:1419–1440CrossRefPubMedGoogle Scholar
  22. 22.
    Huala E, Dickerman AW, Garcia-Hernandez M et al (2001) The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res 29:102–105CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Reddy TB, Riley R, Wymore F et al (2009) TB database: an integrated platform for tuberculosis research. Nucleic Acids Res 37:D499–D508CrossRefPubMedGoogle Scholar
  24. 24.
    Apweiler R, Bairoch A, Wu CH et al (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32:D115–D119CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Flicek P, Amode MR, Barrell D et al (2014) Ensembl 2014. Nucleic Acids Res 42:D749–D755CrossRefPubMedGoogle Scholar
  26. 26.
    Muth T, Benndorf D, Reichl U et al (2013) Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol BioSyst 9:578–585CrossRefPubMedGoogle Scholar
  27. 27.
    Knudsen GM, Chalkley RJ (2011) The effect of using an inappropriate protein database for proteomic data analysis. PLoS One 6:e20873CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Ghesquiere B, Helsens K, Vandekerckhove J et al (2011) A stringent approach to improve the quality of nitrotyrosine peptide identifications. Proteomics 11:1094–1098CrossRefPubMedGoogle Scholar
  29. 29.
    Craig R, Cortens JP, Beavis RC (2004) Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 3:1234–1242CrossRefPubMedGoogle Scholar
  30. 30.
    Martens L, Hermjakob H (2007) Proteomics data validation: why all must provide data. Mol Biosyst 3:518–522CrossRefPubMedGoogle Scholar
  31. 31.
    Barsnes H, Martens L (2013) Crowdsourcing in proteomics: public resources lead to better experiments. Amino Acids 44:1129–1137CrossRefPubMedGoogle Scholar
  32. 32.
    Vaudel M, Sickmann A, Martens L (2014) Introduction to opportunities and pitfalls in functional mass spectrometry based proteomics. Biochim Biophys Acta 1844:12–20CrossRefPubMedGoogle Scholar
  33. 33.
    Venne AS, Kollipara L, Zahedi RP (2014) The next level of complexity: crosstalk of posttranslational modifications. Proteomics 14:513–524CrossRefPubMedGoogle Scholar
  34. 34.
    Olsen JV, Mann M (2013) Status of large-scale analysis of post-translational modifications by mass spectrometry. Mol Cell Proteomics 12:3444–3452CrossRefPubMedPubMedCentralGoogle Scholar
  35. 35.
    Pawson T, Scott JD (2005) Protein phosphorylation in signaling – 50 years and counting. Trends Biochem Sci 30:286–290CrossRefPubMedGoogle Scholar
  36. 36.
    Loroch S, Dickhut C, Zahedi RP et al (2013) Phosphoproteomics – more than meets the eye. Electrophoresis 34:1483–1492CrossRefPubMedGoogle Scholar
  37. 37.
    Aasebo E, Vaudel M, Mjaavatten O et al (2014) Performance of super-SILAC based quantitative proteomics for comparison of different acute myeloid leukemia (AML) cell lines. Proteomics 14:1971–1976CrossRefPubMedGoogle Scholar
  38. 38.
    Barsnes H, Vaudel M, Colaert N et al (2011) Compomics-utilities: an open-source Java library for computational proteomics. BMC Bioinf 12:70CrossRefGoogle Scholar
  39. 39.
    Vandermarliere E, Mueller M, Martens L (2013) Getting intimate with trypsin, the leading protease in proteomics. Mass Spectrom Rev 32:453–465PubMedGoogle Scholar
  40. 40.
    Burkhart JM, Schumbrutzki C, Wortelkamp S et al (2012) Systematic and quantitative comparison of digest efficiency and specificity reveals the impact of trypsin quality on MS-based proteomics. J Proteome 75:1454–1462CrossRefGoogle Scholar
  41. 41.
    Siepen JA, Keevil EJ, Knight D et al (2007) Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. J Proteome Res 6:399–408CrossRefPubMedPubMedCentralGoogle Scholar
  42. 42.
    Lawless C, Hubbard SJ (2012) Prediction of missed proteolytic cleavages for the selection of surrogate peptides for quantitative proteomics. OMICS 16:449–456CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Fannes T, Vandermarliere E, Schietgat L et al (2013) Predicting tryptic cleavage from proteomics data using decision tree ensembles. J Proteome Res 12:2253–2259CrossRefPubMedGoogle Scholar
  44. 44.
    Kelchtermans P, Bittremieux W, De Grave K et al (2014) Machine learning applications in proteomics research: how the past can boost the future. Proteomics 14:353–366CrossRefPubMedGoogle Scholar
  45. 45.
    Vaudel M, Burkhart JM, Sickmann A et al (2011) Peptide identification quality control. Proteomics 11:2105–2114CrossRefPubMedGoogle Scholar
  46. 46.
    Beausoleil SA, Villen J, Gerber SA et al (2006) A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol 24:1285–1292CrossRefPubMedGoogle Scholar
  47. 47.
    Roepstorff P, Fohlman J (1984) Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11:601CrossRefPubMedGoogle Scholar
  48. 48.
    Thingholm TE, Palmisano G, Kjeldsen F et al (2010) Undesirable charge-enhancement of isobaric tagged phosphopeptides leads to reduced identification efficiency. J Proteome Res 9:4045–4052CrossRefPubMedGoogle Scholar
  49. 49.
    Everett LJ, Bierl C, Master SR (2010) Unbiased statistical analysis for multi-stage proteomic search strategies. J Proteome Res 9:700–707CrossRefPubMedGoogle Scholar
  50. 50.
    Nesvizhskii AI (2010) A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteome 73:2092–2123CrossRefGoogle Scholar
  51. 51.
    Keller A, Nesvizhskii AI, Kolker E et al (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 74:5383–5392CrossRefPubMedGoogle Scholar
  52. 52.
    Elias JE, Gygi SP (2010) Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol Biol 604:55–71CrossRefPubMedPubMedCentralGoogle Scholar
  53. 53.
    Ma K, Vitek O, Nesvizhskii AI (2012) A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet. BMC Bioinf 13(Suppl 16):S1CrossRefGoogle Scholar
  54. 54.
    Verheggen K, Barsnes H, Martens L (2014) Distributed computing and data storage in proteomics: many hands make light work, and a stronger memory. Proteomics 14:367–377CrossRefPubMedGoogle Scholar
  55. 55.
    Baumgardner LA, Shanmugam AK, Lam H et al (2011) Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J Proteome Res 10:2882–2888CrossRefPubMedPubMedCentralGoogle Scholar
  56. 56.
    Trudgian DC, Mirzaei H (2012) Cloud CPFP: a shotgun proteomics data analysis pipeline using cloud and high performance computing. J Proteome Res 11:6282–6290PubMedGoogle Scholar
  57. 57.
    Muth T, Peters J, Blackburn J et al (2013) ProteoCloud: a full-featured open source proteomics cloud computing pipeline. J Proteome 88:104–108CrossRefGoogle Scholar
  58. 58.
    Afgan E, Chapman B, Taylor J (2012) CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinf 13:315CrossRefGoogle Scholar
  59. 59.
    Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455CrossRefPubMedPubMedCentralGoogle Scholar
  60. 60.
    Boekel J, Chilton JM, Cooke IR et al (2015) Multi-omic data analysis using Galaxy. Nat Biotechnol 33:137–139CrossRefPubMedGoogle Scholar
  61. 61.
    Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86CrossRefPubMedPubMedCentralGoogle Scholar
  62. 62.
    Lam H (2011) Building and searching tandem mass spectral libraries for peptide identification. Mol Cell Proteomics 10(R111):008565PubMedGoogle Scholar
  63. 63.
    Allmer J (2011) Algorithms for the de novo sequencing of peptides from tandem mass spectra. Expert Rev Proteomics 8:645–657CrossRefPubMedGoogle Scholar
  64. 64.
    Dasari S, Chambers MC, Slebos RJ et al (2010) TagRecon: high-throughput mutation identification through sequence tagging. J Proteome Res 9:1716–1726CrossRefPubMedPubMedCentralGoogle Scholar
  65. 65.
    Perkins DN, Pappin DJ, Creasy DM et al (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567CrossRefPubMedGoogle Scholar
  66. 66.
    Tanner S, Shu H, Frank A et al (2005) InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem 77:4626–4639CrossRefPubMedGoogle Scholar
  67. 67.
    Park CY, Klammer AA, Kall L et al (2008) Rapid and accurate peptide identification from tandem mass spectra. J Proteome Res 7:3022–3027CrossRefPubMedPubMedCentralGoogle Scholar
  68. 68.
    Yadav AK, Kumar D, Dash D (2011) MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res 10:2154–2160CrossRefPubMedGoogle Scholar
  69. 69.
    Cox J, Neuhauser N, Michalski A et al (2011) Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 10:1794–1805CrossRefPubMedGoogle Scholar
  70. 70.
    Bern M, Kil YJ, Becker C (2012) Byonic: advanced peptide and protein identification software. Curr Protoc Bioinf Chapter 13, Unit13 20Google Scholar
  71. 71.
    Zhang J, Xin L, Shan B et al (2012) PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol Cell Proteomics 11:M111 010587Google Scholar
  72. 72.
    Wenger CD, Coon JJ (2013) A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. J Proteome Res 12:1377–1386CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Kenneth Verheggen
    • 1
    • 2
  • Lennart Martens
    • 1
    • 2
  • Frode S. Berven
    • 3
    • 4
    • 5
  • Harald Barsnes
    • 3
    Email author
  • Marc Vaudel
    • 3
  1. 1.Department of Medical Protein ResearchVIBGhentBelgium
  2. 2.Department of Biochemistry, Faculty of Medicine and Health SciencesGhent UniversityGhentBelgium
  3. 3.Proteomics Unit, Department of BiomedicineUniversity of BergenBergenNorway
  4. 4.KG Jebsen Centre for Multiple Sclerosis Research, Department of Clinical MedicineUniversity of BergenBergenNorway
  5. 5.Norwegian Multiple Sclerosis Competence Centre, Department of NeurologyHaukeland University HospitalBergenNorway

Personalised recommendations