Methods to Calculate Spectrum Similarity

  • Şule Yilmaz
  • Elien Vandermarliere
  • Lennart Martens
Part of the Methods in Molecular Biology book series (MIMB, volume 1549)


Scoring functions that assess spectrum similarity play a crucial role in many computational mass spectrometry algorithms. These functions are used to compare an experimentally acquired fragmentation (MS/MS) spectrum against two different types of target MS/MS spectra: either against a theoretical MS/MS spectrum derived from a peptide from a sequence database, or against another, previously acquired MS/MS spectrum. The former is typically encountered in database searching, while the latter is used in spectrum clustering and spectral library searching. The comparison between acquired versus theoretical MS/MS spectra is most commonly performed using cross-correlations or probability derived scoring functions, while the comparison of two acquired MS/MS spectra typically makes use of a normalized dot product, especially in spectrum library search algorithms. In addition to these scoring functions, Pearson’s or Spearman’s correlation coefficients, mean squared error, or median absolute deviation scores can also be used for the same purpose. Here, we describe and evaluate these scoring functions with regards to their ability to assess spectrum similarity for theoretical versus acquired, and acquired versus acquired spectra.

Key words

Mass spectrometry Scoring functions Spectrum similarity Database searching Spectrum library 


  1. 1.
    Domon B, Aebersold R (2006) Mass spectrometry and protein analysis. Science 312:212–217. doi: 10.1126/science.1124619 CrossRefPubMedGoogle Scholar
  2. 2.
    Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422:198–207. doi: 10.1038/nature01511 CrossRefPubMedGoogle Scholar
  3. 3.
    Gevaert K, Van Damme P, Ghesquière B et al (2007) A la carte proteomics with an emphasis on gel-free techniques. Proteomics 7:2698–2718. doi: 10.1002/pmic.200700114 CrossRefPubMedGoogle Scholar
  4. 4.
    Eidhammer I, Flikka K, Martens L, Mikalsen S-O (2007) Computational methods for mass spectrometry proteomics. John Wiley & Sons, Ltd, West SussexGoogle Scholar
  5. 5.
    Käll L, Vitek O (2011) Computational mass spectrometry-based proteomics. PLoS Comput Biol 7:e1002277. doi: 10.1371/journal.pcbi.1002277 CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Xu C, Ma B (2006) Software for computational peptide identification from MS-MS data. Drug Discov Today 11:595–600CrossRefPubMedGoogle Scholar
  7. 7.
    Lam H, Deutsch EW, Eddes JS et al (2008) Building consensus spectral libraries for peptide identification in proteomics. Nat Methods 5:873–875. doi: 10.1038/nmeth.1254 CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Shao W, Zhu K, Lam H (2013) Refining similarity scoring to enable decoy-free validation in spectral library searching. Proteomics 13:3273–3283. doi: 10.1002/pmic.201300232 CrossRefPubMedGoogle Scholar
  9. 9.
    Yen C-Y, Houel S, Ahn NG, Old WM (2011) Spectrum-to-spectrum searching using a proteome-wide spectral library. Mol Cell Proteomics 10:M111.007666. doi: 10.1074/mcp.M111.007666 CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Kim S, Pevzner P (2014) MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5:5277. doi: 10.1038/ncomms6277 CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Eng JK, Jahan T, Hoopmann MR (2013) Comet: an open-source MS/MS sequence database search tool. Proteomics 13:22–24. doi: 10.1002/pmic.201200439 CrossRefPubMedGoogle Scholar
  12. 12.
    Eng JK, McCormack AL, Yates JR (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989CrossRefPubMedGoogle Scholar
  13. 13.
    Tabb DL, MacCoss MJ, Wu CC et al (2003) Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal Chem 75:2470–2477. doi: 10.1021/ac026424o CrossRefPubMedGoogle Scholar
  14. 14.
    Griss J, Foster JM, Hermjakob H, Vizcaíno JA (2013) PRIDE cluster: building a consensus of proteomics data. Nat Methods 10:95–96. doi: 10.1038/nmeth.2343 CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Lam H, Deutsch EW, Eddes JS et al (2007) Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7:655–667CrossRefPubMedGoogle Scholar
  16. 16.
    Frank AM (2009) Predicting intensity ranks of peptide fragment ions. J Proteome Res 8:2226–2240. doi: 10.1021/pr800677f CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Li S, Arnold RJ, Tang H, Radivojac P (2011) On the accuracy and limits of peptide fragmentation spectrum prediction. Anal Chem 83:790–796. doi: 10.1021/ac102272r CrossRefPubMedGoogle Scholar
  18. 18.
    Cox J, Neuhauser N, Michalski A et al (2011) Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 10:1794–1805. doi: 10.1021/pr101065j CrossRefPubMedGoogle Scholar
  19. 19.
    Dorfer V, Pichler P, Stranzl T et al (2014) MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. J Proteome Res 13:3679–3684CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Yates JR, Morgan SF, Gatlin CL et al (1998) Method to compare collision-induced dissociation spectra of peptides: potential for library searching and subtractive analysis. Anal Chem 70:3557–3565. doi: 10.1021/ac980122y CrossRefPubMedGoogle Scholar
  21. 21.
    Craig R, Cortens JC, Fenyo D, Beavis RC (2006) Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res 5:1843–1849. doi: 10.1021/pr0602085 CrossRefPubMedGoogle Scholar
  22. 22.
    Frewen BE, Merrihew GE, Wu CC et al (2006) Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal Chem 78:5678–5684. doi: 10.1021/ac060279n CrossRefPubMedGoogle Scholar
  23. 23.
    Vaudel M, Sickmann A, Martens L (2012) Current methods for global proteome identification. Expert Rev Proteomics 9:519–532. doi: 10.1586/epr.12.51 CrossRefPubMedGoogle Scholar
  24. 24.
    Steen H, Mann M (2004) The ABC’s (and XYZ’s) of peptide sequencing. Nat Rev 5:699–711. doi: 10.1038/nrm1468 CrossRefGoogle Scholar
  25. 25.
    Nesvizhskii A (2007) Protein identification by tandem mass spectrometry and sequence database searching. Mass Spectr Data Anal Proteomics 367:87–119CrossRefGoogle Scholar
  26. 26.
    Matthiesen R (2007) Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 7:2815–2832. doi: 10.1002/pmic.200700116 CrossRefPubMedGoogle Scholar
  27. 27.
    Eidhammer I, Flikka K, Martens L, Mikalsen S-O (2007) Spectral comparisons. Computational methods for mass spectrometry proteomics. John Wiley & Sons, Ltd., West Sussex, pp 159–178CrossRefGoogle Scholar
  28. 28.
    Kapp E, Schütz F (2007) Overview of tandem mass spectrometry (MS/MS) database search algorithms. Curr Protoc Protein Sci 25(2):1–19Google Scholar
  29. 29.
    Wenger CD, Coon JJ (2013) A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. J Proteome Res 12:1377–1386CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    GPM The cRAP FASTA file. Accessed 13 Aug 2015
  31. 31.
    Eng JK, Fischer B, Grossmann J, Maccoss MJ (2008) A fast SEQUEST cross correlation algorithm. J Proteome Res 7:4598–4602. doi: 10.1021/pr800420s CrossRefPubMedGoogle Scholar
  32. 32.
    Park CY, Klammer AA, Käll L et al (2008) Rapid and accurate peptide identification from tandem mass spectra. J Proteome Res 7:3022–3027Google Scholar
  33. 33.
    Diament BJ, Noble WS (2011) Faster SEQUEST searching for peptide identification from tandem mass spectra. J Proteome Res 10:3871–3879. doi: 10.1021/pr101196n CrossRefPubMedPubMedCentralGoogle Scholar
  34. 34.
    Perkins DN, Pappin DJC, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567CrossRefPubMedGoogle Scholar
  35. 35.
    Hu Y, Li Y, Lam H (2011) A semi-empirical approach for predicting unobserved peptide MS/MS spectra from spectral libraries. Proteomics 11:4702–4711. doi: 10.1002/pmic.201100316 CrossRefPubMedGoogle Scholar
  36. 36.
    Lam H (2011) Building and searching tandem mass spectral libraries for peptide identification. Mol Cell Proteomics 10:R111.008565CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Flikka K, Meukens J, Helsens K et al (2007) Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7:3245–3258. doi: 10.1002/pmic.200700160 CrossRefPubMedGoogle Scholar
  38. 38.
    Beer I, Barnea E, Ziv T, Admon A (2004) Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4:950–960. doi: 10.1002/pmic.200300652 CrossRefPubMedGoogle Scholar
  39. 39.
    Tabb DL, Thompson MR, Khalsa-Moyers G et al (2005) MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J Am Soc Mass Spectrom 16:1250–1261. doi: 10.1016/j.jasms.2005.04.010 CrossRefPubMedGoogle Scholar
  40. 40.
    Wan KX, Vidavsky I, Gross ML (2002) Comparing similar spectra: from similarity index to spectral contrast angle. J Am Soc Mass Spectrom 13:85–88. doi: 10.1016/S1044-0305(01)00327-0 CrossRefPubMedGoogle Scholar
  41. 41.
    Stein SE, Scott DR (1994) Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom 5:859–866. doi: 10.1016/1044-0305(94)87009-8 CrossRefPubMedGoogle Scholar
  42. 42.
    Degroeve S, Maddelein D, Martens L (2015) MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Res 43:W326–W330. doi: 10.1093/nar/gkv542 CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Degroeve S, Martens L (2013) MS2PIP: a tool for MS/MS peak intensity prediction. Bioinformatics. doi: 10.1093/bioinformatics/btt544 PubMedGoogle Scholar
  44. 44.
    Rosner B (2010) Regression and correlation methods., Fundamentals of BiostatisticsGoogle Scholar
  45. 45.
    Eidhammer I, Barsnes H, Eide GE, Martens L (2013) Appendix A: statistics. Computational and statistical methods for protein quantification by mass spectrometry. John Wiley & Sons, Ltd, West SussexGoogle Scholar
  46. 46.
    Paulovich AG, Billheimer D, Ham A-JL et al (2010) Interlaboratory study characterizing a yeast performance standard for benchmarking LC-MS platform performance. Mol Cell Proteomics 9:242–254. doi: 10.1074/mcp.M900222-MCP200 CrossRefPubMedGoogle Scholar
  47. 47.
    Barsnes H, Vaudel M, Colaert N et al (2011) compomics-utilities: an open-source Java library for computational proteomics. BMC Bioinform 12:70. doi: 10.1186/1471-2105-12-70
  48. 48.
    Vaudel M, Burkhart JM, Breiter D et al (2012) A complex standard for protein identification, designed by evolution. J Proteome Res 11:5065–5071. doi: 10.1021/pr300055q CrossRefPubMedGoogle Scholar
  49. 49.
    The Uniprot Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212. doi: 10.1093/nar/gku989 CrossRefGoogle Scholar
  50. 50.
  51. 51.
    Martens L, Vandekerckhove J, Gevaert K (2005) DBToolkit: processing protein databases for peptide-centric proteomics. Bioinformatics 21:3584–3585. doi: 10.1093/bioinformatics/bti588 CrossRefPubMedGoogle Scholar
  52. 52.
    Parker CE, Mocanu V, Mocanu M et al (2010) Mass spectrometry for post-translational modifications. Neuroproteomics 2010:PMID: 21882444Google Scholar
  53. 53.
    Allmer J (2010) Existing bioinformatics tools for the quantitation of post-translational modifications. Amino Acids. doi: 10.1007/s00726-010-0614-3 Google Scholar
  54. 54.
    Gonnelli G, Stock M, Verwaeren J et al (2015) A decoy-free approach to the identification of peptides. J Proteome Res 14:1792–1798. doi: 10.1021/pr501164r CrossRefPubMedGoogle Scholar
  55. 55.
    Hulstaert N, Reisinger F, Rameseder J et al (2013) Pride-asap: automatic fragment ion annotation of identified PRIDE spectra. J Proteomics 95:89–92. doi: 10.1016/j.jprot.2013.04.011 CrossRefPubMedPubMedCentralGoogle Scholar
  56. 56.
    Liu J, Bell AW, Bergeron JJM et al (2007) Methods for peptide identification by spectral comparison. Proteome Sci 5:3. doi: 10.1186/1477-5956-5-3 CrossRefPubMedPubMedCentralGoogle Scholar
  57. 57.
    Robin X, Turck N, Hainard A et al (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform 12:77. doi: 10.1186/1471-2105-12-77
  58. 58.
    Fox J, Weisberg S (2011) An R companion to applied regression, 2nd edn. Sage, Thousand Oaks, CAGoogle Scholar
  59. 59.
    Vaudel M, Burkhart JM, Zahedi RP et al (2015) PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat Biotechnol 33:22–24. doi: 10.1038/nbt.3109 CrossRefPubMedGoogle Scholar
  60. 60.
    Shteynberg D, Nesvizhskii I, Moritz RL, Deutsch EW (2013) Combining results of multiple search engines in proteomics. Mol Cell Proteomics 12:2383–2393. doi: 10.1074/mcp.R113.027797 CrossRefPubMedPubMedCentralGoogle Scholar
  61. 61.
    Cox J, Mann M (2008) MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 26:1367–1372. doi: 10.1038/nbt.1511 CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  • Şule Yilmaz
    • 1
    • 2
    • 3
  • Elien Vandermarliere
    • 1
    • 2
    • 3
  • Lennart Martens
    • 1
    • 2
    • 3
  1. 1.Medical Biotechnology CenterVIBGhentBelgium
  2. 2.Department of BiochemistryGhent UniversityGhentBelgium
  3. 3.Bioinformatics Institute GhentGhent UniversityGhentBelgium

Personalised recommendations