Proteomics pp 245-259

Part of the Methods in Molecular Biology™ book series (MIMB, volume 564)

Algorithms and Databases



The capacity of proteomics methods and mass spectrometry instrumentation to generate data has grown substantially over the past years. This data volume growth has in turn led to an increased reliance on software to identify peptide or protein sequences from the recorded mass spectra. Diverse algorithms can be applied for the processing of these data, each performing a specific task such as spectrum quality filtering, spectral clustering and merging, assigning a sequence to a spectrum, and assessing the validity of these assignments.

The key algorithms to mass spectral processing pipelines are the ones that assign a sequence to a spectrum. The most commonly used variants of these are crucially dependent on the information contained in the sequences database, which they use as a basis for identification. Since these sequence databases are constructed in different ways and can therefore vary substantially in the amount and type of data they contain, they are also discussed here.

Key words

Sequence database Search algorithm Mass spectrum Clustering Merging Quality assignment Tandem-MS Identification Protein Peptide 


  1. 1.
    Sadygov, R. G., Cociorva, D. and Yates, J. R. (2004) Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods 1, 195–202.PubMedCrossRefGoogle Scholar
  2. 2.
    Nesvizhskii, A. I., Vitek, O. and Aebersold, R. (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787–797.PubMedCrossRefGoogle Scholar
  3. 3.
    Matthiesen, R. (2007) Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 7, 2815–2832.PubMedCrossRefGoogle Scholar
  4. 4.
    Perkins, D. N., Pappin, D. J., Creasy, D. M. and Cottrell, J. S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567.PubMedCrossRefGoogle Scholar
  5. 5.
    Cottrell, J. S. (1994) Protein identification by peptide mass fingerprinting. Pept Res 7, 115–124.PubMedGoogle Scholar
  6. 6.
    Zhang, W. and Chait, B. T. (2000) ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal Chem 72, 2482–2489.PubMedCrossRefGoogle Scholar
  7. 7.
    Eng, J. K., McCormack, A. L. and Yates, J. R. (1994) An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database. J Am Soc Mass Spectrom 5, 976–989.CrossRefGoogle Scholar
  8. 8.
    Craig, R. and Beavis, R. C. (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467.PubMedCrossRefGoogle Scholar
  9. 9.
    Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R. (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 74, 5383–5392.PubMedCrossRefGoogle Scholar
  10. 10.
    Zhang, Z. (2004) De novo peptide sequencing based on a divide-and-conquer algorithm and peptide tandem spectrum simulation. Anal Chem 76, 6374–6383.PubMedCrossRefGoogle Scholar
  11. 11.
    Taylor, J. and Johnson, R. (2001) Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal Chem 73, 2594–2604.PubMedCrossRefGoogle Scholar
  12. 12.
    Ma, B., Zhang, K., Hendrie, C., Liang, C., Li, M., Doherty-Kirby, A. et al (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 17, 2337–2342.PubMedCrossRefGoogle Scholar
  13. 13.
    Grossmann, J., Roos, F., Cieliebak, M., Liptak, Z., Mathis, L., Muller, M. et al (2005) AUDENS: a tool for automated peptide de novo sequencing. J Proteome Res 4, 1768–1774.PubMedCrossRefGoogle Scholar
  14. 14.
    Frank, A. and Pevzner, P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 77, 964–973.PubMedCrossRefGoogle Scholar
  15. 15.
    Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y., Shima, T., Okumura, N., Besada, V. et al (2000) Automated interpretation of low-energy collision-induced dissociation spectra by SeqMS, a software aid for de novo sequencing by tandem mass spectrometry. Electrophoresis 21, 1694–1699.PubMedCrossRefGoogle Scholar
  16. 16.
    Dancik, V., Addona, T., Clauser, K., Vath, J. and Pevzner, P. (1999) De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6, 327–342.PubMedCrossRefGoogle Scholar
  17. 17.
    Pitzer, E., Masselot, A. and Colinge, J. (2007) Assessing peptide de novo sequencing algorithms performance on large and diverse data sets. Proteomics 7, 3051–3054.PubMedCrossRefGoogle Scholar
  18. 18.
    Pevtsov, S., Fedulova, I., Mirzaei, H., Buck, C. and Zhang, X. (2006) Performance evaluation of existing de novo sequencing algorithms. J Proteome Res 5, 3018–3028.PubMedCrossRefGoogle Scholar
  19. 19.
    Mann, M. and Wilm, M. (1994) Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 66, 4390–4399.PubMedCrossRefGoogle Scholar
  20. 20.
    Mørtz, E., O’Connor, P. B., Roepstorff, P., Kelleher, N. L., Wood, T. D. et al (1996) Sequence tag identification of intact proteins by matching tanden mass spectral data against sequence data bases. Proc Natl Acad Sci U S A 93, 8264–8267.CrossRefGoogle Scholar
  21. 21.
    Tabb, D. L., Saraf, A. and Yates, J. R. (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 75, 6415–6421.PubMedCrossRefGoogle Scholar
  22. 22.
    Martens, L., Hermjakob, H., Jones, P., Adamski, M., Taylor, C., States, D. et al (2005) PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545.PubMedCrossRefGoogle Scholar
  23. 23.
    Jones, P., Cote, R. G., Martens, L., Quinn, A. F., Taylor, C. F., Derache, W. et al (2006) PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 34, D659–D663.PubMedCrossRefGoogle Scholar
  24. 24.
    Desiere, F., Deutsch, E. W., Nesvizhskii, A. I., Mallick, P., King, N. L., Eng, J. K. et al (2005) Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol 6, R9.PubMedCrossRefGoogle Scholar
  25. 25.
    Craig, R., Cortens, J. P. and Beavis, R. C. (2004) Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 3, 1234–1242.PubMedCrossRefGoogle Scholar
  26. 26.
    Lam, H., Deutsch, E. W., Eddes, J. S., Eng, J. K., King, N., Stein, S. E. et al (2007) Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667.PubMedCrossRefGoogle Scholar
  27. 27.
    Martens, L., Nesvizhskii, A. I., Hermjakob, H., Adamski, M., Omenn, G. S., Vandekerckhove, J. et al (2005) Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories. Proteomics 5, 3501–3505.PubMedCrossRefGoogle Scholar
  28. 28.
    Gentzel, M., Köcher, T., Ponnusamy, S. and Wilm, M. (2003) Preprocessing of tandem mass spectrometric data to support automatic protein identification. Proteomics 3, 1597–1610.PubMedCrossRefGoogle Scholar
  29. 29.
    Zhang, X., Asara, J. M., Adamec, J., Ouzzani, M. and Elmagarmid, A. K. (2005) Data pre-processing in liquid chromatography-mass spectrometry-based proteomics. Bioinformatics 21, 4054–4059.PubMedCrossRefGoogle Scholar
  30. 30.
    Gevaert, K., Goethals, M., Martens, L., Van Damme, J., Staes, A., Thomas, G. R. et al (2003) Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides. Nat Biotechnol 21, 566–569.PubMedCrossRefGoogle Scholar
  31. 31.
    Yi, J., Kim, C. and Gelfand, C. A. (2007) Inhibition of intrinsic proteolytic activities moderates preanalytical variability and instability of human plasma. J Proteome Res 6, 1768–1781.PubMedCrossRefGoogle Scholar
  32. 32.
    Creasy, D. M. and Cottrell, J. S. (2002) Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2, 1426–1434.PubMedCrossRefGoogle Scholar
  33. 33.
    Falkner, J. and Andrews, P. (2005) Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined. Bioinformatics 21, 2177–2184.PubMedCrossRefGoogle Scholar
  34. 34.
    Salmi, J., Moulder, R., Filén, J., Nevalainen, O. S., Nyman, T. A., Lahesmaa, R. et al (2006) Quality classification of tandem mass spectrometry data. Bioinformatics 22, 400–406.PubMedCrossRefGoogle Scholar
  35. 35.
    Bern, M., Goldberg, D., McDonald, W. H. and Yates, J.R.3rd (2004) Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20 Suppl 1, i49–i54.PubMedCrossRefGoogle Scholar
  36. 36.
    Hoopmann, M. R., Finney, G. L. and MacCoss, M. J. (2007) High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal Chem 79, 5620–5632.PubMedCrossRefGoogle Scholar
  37. 37.
    Wong, J. W. H., Sullivan, M. J., Cartwright, H. M. and Cagney, G. (2007) msmsEval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 8, 51.PubMedCrossRefGoogle Scholar
  38. 38.
    Nesvizhskii, A. I., Roos, F. F., Grossmann, J., Vogelzang, M., Eddes, J. S., Gruissem, W. et al (2006) Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics 5, 652–670.PubMedGoogle Scholar
  39. 39.
    Flikka, K., Martens, L., Vandekerckhove, J., Gevaert, K. and Eidhammer, I. (2006) Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 6, 2086–2094.PubMedCrossRefGoogle Scholar
  40. 40.
    Xu, M., Geer, L. Y., Bryant, S. H., Roth, J. S., Kowalak, J. A., Maynard, D. M. et al (2005) Assessing data quality of peptide mass spectra obtained by quadrupole ion trap mass spectrometry. J Proteome Res 4, 300–305.PubMedCrossRefGoogle Scholar
  41. 41.
    Purvine, S., Kolker, N. and Kolker, E. (2004) Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. OMICS 8, 255–265.PubMedCrossRefGoogle Scholar
  42. 42.
    Liu, H., Sadygov, R. G. and Yates, J.R.3rd. (2004) A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem 76, 4193–4201.PubMedCrossRefGoogle Scholar
  43. 43.
    Ishihama, Y., Oda, Y., Tabata, T., Sato, T., Nagasu, T., Rappsilber, J. et al (2005) Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol Cell Proteomics 4, 1265–1272.PubMedCrossRefGoogle Scholar
  44. 44.
    Tabb, D. L., MacCoss, M. J., Wu, C. C., Anderson, S. D. and Yates, J. R. (2003) Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal Chem 75, 2470–2477.PubMedCrossRefGoogle Scholar
  45. 45.
    Tabb, D. L., Thompson, M. R., Khalsa-Moyers, G., VerBerkmoes, N. C. and McDonald, W. H. (2005) MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J Am Soc Mass Spectrom 16, 1250–1261.PubMedCrossRefGoogle Scholar
  46. 46.
    Flikka, K., Meukens, J., Helsens, K., Vandekerckhove, J., Eidhammer, I., Gevaert, K. et al (2007) Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, 3245–3258.PubMedCrossRefGoogle Scholar
  47. 47.
    Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E. and Apweiler, R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–1988.PubMedCrossRefGoogle Scholar
  48. 48.
    Prince, J. T., Carlson, M. W., Wang, R., Lu, P. and Marcotte, E. M. (2004) The need for a public proteomics repository. Nat Biotechnol 22, 471–472.PubMedCrossRefGoogle Scholar
  49. 49.
    Mead, J. A., Shadforth, I. P. and Bessant, C. (2007) Public proteomic MS repositories and pipelines: available tools and biological applications. Proteomics 7, 2769–2786.PubMedCrossRefGoogle Scholar
  50. 50.
    Hermjakob, H. and Apweiler, R. (2006) The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev Proteomics 3, 1–3.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.EMBL Outstation - HinxtonEuropean Bioinformatics InstituteHinxtonCambridge

Personalised recommendations