Skip to main content
Log in

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Mass spectrometry is an analytical technique for determining the composition of a sample. Recently it has become a primary tool for protein identification and quantification, and post translational modification characterization in proteomics research. Both the size and the complexity of the data produced by this experimental technique impose great computational challenges in the data analysis. This article reviews some of these challenges and serves as an entry point for those who want to study the area in general.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Peng J, Elias J E, Thoreen C C, Licklider L J, Gygi S P. Evaluation of multidimensional chromatography coupled with Tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. Journal of Proteome Research, 2003, 2(1): 43–50.

    Article  Google Scholar 

  2. Mann M. Quantitative proteomics? Nature Biotechnology, 1999, 17(10): 954–955.

    Article  Google Scholar 

  3. Martin-Visscher L A, van Belkum M J, Garneau-Tsodikova S, Whittal R M, Zheng J, McMullen L M, Vederas J C. Isolation and characterization of carnocyclin A, a novel circular bacteriocin produced by Carnobacterium maltaromaticum UAL307. Applied and Environmental Microbiology, 2008, 74(15): 4756–4763.

    Article  Google Scholar 

  4. Mann M, Jensen O N. Proteomic analysis of post-translational modifications. Nature Biotechnology, 2003, 21(3): 255–261.

    Article  Google Scholar 

  5. Keykhosravani M, Doherty-Kirby A, Zhang C, Brewer D, Goldberg H A, Hunter G K, Lajoie G. Comprehensive identification of post-translational modifications of rat bone osteopontin by mass spectrometry. Biochemistry, 2005, 44(18): 6990–7003.

    Article  Google Scholar 

  6. Hoffmann E, Stroobant V. Mass Spectrometry: Principles and Applications. John Wiley & Sons Ltd., 2007.

  7. Tang K, Page J S, Smith R D. Charge competition and the linear dynamic range of detection in electrospray ionization mass spectrometry. Journal of American Society of Mass Spectrometry, 2004, 15(10): 1416–1423.

    Article  Google Scholar 

  8. Gygi S P, Corthals G L, Zhang Y, Rochon Y, Aebersold R. Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. PNAS, 2000, 97(17): 9390–9395.

    Article  Google Scholar 

  9. Perkins D N, Pappin D J, Creasy D M, Cottrell J S. Probability-based protein identification by searching sequence database using mass spectrometry data. Electrophoresis, 1999, 20(18): 3551–3567.

    Article  Google Scholar 

  10. Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G. PEAKS: Powerful software for MS/MS peptide de novo sequencing. Rapid Communications in Mass Spectrometry, 2003, 17(20): 2337–2342.

    Article  Google Scholar 

  11. Eng J K, McCormack A L, Yates III J R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Amer. Soc. Mass Spectrom., 1994, 5(11): 976–989.

    Article  Google Scholar 

  12. Craig R, Beavis R C. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics, 2004, 20(9): 1466–1467.

    Article  Google Scholar 

  13. Geer L Y, Markey S P, Kowalak J A, Wagner L, Xu M, Maynard D M, Yang X, Shi W, Bryant S H. Open mass spectrometry search algorithm. J. Proteome Research, 2004, 3(5): 958–964.

    Article  Google Scholar 

  14. Colinge J, Masselot A, Giron M, Dessingy T, Magnin J. OLAV: Towards high-throughput tandem mass spectrometry data identification. Proteomics, 2003, 3(8): 1454–1463.

    Article  Google Scholar 

  15. Bafna V, Edwards N. SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics, 2001, 17(Supplement 1): S13–S21.

    Google Scholar 

  16. Wan Y et al. PepHMM: A hidden Markov model based scoring function for mass spectrometry database search. In Proc. RECOMB 2005, Standford, USA, May 21–22, 2005, pp.342–356.

  17. Zhang Z. Prediction of low-energy collision-induced dissociation spectra of peptides. Analytical Chemistry, 2004, 76(14): 3908–3922.

    Article  Google Scholar 

  18. Fenyo D, Beavis R C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Analytical Chemistry, 2003, 75(4): 768–774.

    Article  Google Scholar 

  19. Elias J E, Gygi S P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods, 2007, 4(3): 207–214.

    Article  Google Scholar 

  20. Bianco L, Mead J A, Bessant C. Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets. Journal of Proteome Research, 2009, 8(4): 1782–1791.

    Article  Google Scholar 

  21. Moore R E, Young M K, Lee T D. Qscore: An algorithm for evaluating SEQUEST database search results. Journal of the American Society for Mass Spectrometry, 2002, 13(4): 378–386.

    Article  Google Scholar 

  22. Lu B, Motoyama A, Ruse C, Venable J, Yates J R III. Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data. Analytical Chemistry, 2008, 80(6): 2018–2025.

    Article  Google Scholar 

  23. Nesvizhskii A I, Aebersold R. Interpretation of shotgun proteomic data — The protein inference problem. Molecular & Cellular Proteomics, 2005, 4(10): 1419–1440.

    Article  Google Scholar 

  24. Carr S, Aebersold R, Baldwin M, Burlingame A, Clauser K, Nesvizhskii A. The need for guidelines in publication of peptide and protein identification data. Molecular and Cellular Proteomics, 2004, 3(6): 531–533.

    Article  Google Scholar 

  25. Junqueira M et al. Separating the wheat from the chaff: Unbiased filtering of background tandem mass spectra improves protein identification. J. Proteome Research, 2008, 7(8): 3382–3395.

    Article  Google Scholar 

  26. Hughes C, Doble B, Xin L, Chen C, Shan B, Ma B, Lajoie G. SILAC quantitation with PEAKS to a depth of 3000 proteins from a double knockout GSK-3 of mouse embryonic stem cells. In ASMS 2009, Philadelphia, USA, May 31–June 4, 2009, Session Bioinformatics: Quantification, Poster, No. 056.

  27. Frank A, Pevzner P. Pepnovo: De novo peptide sequencing via probabilistic network modeling. Analytical Chemistry, 2005, 77(4): 964–973.

    Article  Google Scholar 

  28. Taylor J A, Johnson R S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Analytical Chemistry, 2001, 73(11): 2594–2604.

    Article  Google Scholar 

  29. Bartels C. Fast algorithm for peptide sequencing by mass spectroscopy. Biomed. Environ. Mass Spectrom., 1990, 19(6): 363–368.

    Article  Google Scholar 

  30. Ma B, Zhang K, Liang C. An effective algorithm for the peptide de novo sequencing from MS/MS spectrum. Journal of Computer and System Sciences, 2005, 70(3): 418–430.

    Article  MATH  MathSciNet  Google Scholar 

  31. Lu B, Chen T. Algorithms for de novo peptide sequencing via tandem mass spectrometry. Drug Discovery Today: BioSilico, 2004, 2(2): 85–90.

    Article  Google Scholar 

  32. Xu C, Ma B. Review of software for computational peptide identification from MS/MS data. Drug Discovery Today, 2006, 11(13/14): 595–600.

    Article  Google Scholar 

  33. Hughes C, Ma B, Lajoie G. De Novo Sequencing Methods in Proteomics. Methods in Molecular Biology, Series, Springer. (to appear)

  34. Pevtsov S, Fedulova I, Mirzaei H, Buck C, Zhang X. Performance evaluation of existing de novo sequencing algorithms. Journal of Proteome Research, 2006, 5(11): 3018–3028.

    Article  Google Scholar 

  35. Yan B, Qu Y, Mao F, Olman V, Xu Y. PRIME: A mass spectrum data mining tool for de novo sequencing and PTMs identification. Journal of Computer Science and Technology, 2005, 20(4): 483–490.

    Article  MathSciNet  Google Scholar 

  36. Dancik V et al. De novo peptide sequencing via tandem mass-spectrometry. J. Comp. Biology, 1999, 6(3/4): 327–342.

    Article  Google Scholar 

  37. Xin L, Lajoie G, Ma B. New method for the validation of de novo sequencing results. In ASMS 2008, Denver, USA, Jun. 1–5, Session: Bioinformatics III, Poster, No. 645.

  38. Savitski M M, Nielsen M L, Kjeldsen F, Zubarev R A. Proteomics-Grade de Novo Sequencing Approach. J. Proteome Research, 2005, 4: 2348–2354.

    Article  Google Scholar 

  39. Datta R, Bern M. Spectrum fusion: Using multiple mass spectra for de novo peptide sequencing. In Proc. RECOMB, 2008, pp.140–153.

  40. Genome News Network. http://www.genomenewsnetwork.org/.

  41. Mackey A J, Haystead T A J, Pearson W R. Getting more for less: Algorithms for rapid protein identification with multiple short peptide sequences. Mol. Cell. Proteomics, 2002, 1(2): 139–147.

    Article  Google Scholar 

  42. Huang L, Jacob R J, Pegg S C H, Baldwin M A, Wang C C, Burlingame A L, Babbitt P C. Functional assignment of the 20 S proteasome from Trypanosoma Brucei using mass spectrometry and new bioinformatics approaches. J. Biol. Chem., 2001, 276(30): 28327–28339.

    Article  Google Scholar 

  43. Shevchenko A, Sunyaev S, Loboda A, Shevchenko A, Bork P, Ens W, Standing K G. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole timeof-flight mass spectrometry and BLAST homology searching, Anal. Chem., 2001, 73(9): 1917–1926.

    Article  Google Scholar 

  44. Han Y, Ma B, Zhang K. SPIDER: Software for protein identification from sequence tags containing de novo sequencing error. Journal of Bioinformatics and Computational Biology, 2005, 3(3): 697–716.

    Article  MathSciNet  Google Scholar 

  45. Searle B C et al. High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Anal. Chem., 2004, 76(8): 2220–2230.

    Article  Google Scholar 

  46. Tabb D L, Saraf A, Yates J R III. GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem., 2003, 75(23): 6415–6421.

    Article  Google Scholar 

  47. Hopper S, Johnson R S, Vath J E, Biemann K. Glutaredoxin from rabbit bone marrow. Purification, characterization, and amino acid sequence determined by tandem mass spectrometry. J. Biol. Chem., 1989, 264(34): 20438–20447.

    Google Scholar 

  48. Bandeira N, Tang H, Bafna V, Pevzner P. Shotgun protein sequencing by tandem mass spectra assembly. Analytical Chemistry, 2004, 76(24): 7221–7233.

    Article  Google Scholar 

  49. Bandeira N, Clauser K R, Pevzner P. Shotgun protein sequencing: Assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol. Cell Proteomics, 2007, 6(7): 1123–1134.

    Article  Google Scholar 

  50. Bandeira N, Pham V, Pevzner P, Arnott D, Lill J R. Automated de novo protein sequencing of monoclonal antibodies. Nature Biotechnology, 2008, 26(12): 1336–1338.

    Article  Google Scholar 

  51. Liu X, Han Y, Yuen D, Ma B. Automated protein (re)sequencing with MS/MS and a homologous database yields almost full coverage and accuracy. Bioinformatics, 2009, 25(17): 2174–2180.

    Article  Google Scholar 

  52. Unimod database. http://www.unimod.org.

  53. Oki M, Aihara H, Ito T. Role of histone phosphorylation in chromatin dynamics and its implications in diseases. Subcellular Biochemistry, 2007, 41: 319–336.

    Google Scholar 

  54. Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. Journal of Molecular Biology, 1999, 294(5): 1351–1362.

    Article  Google Scholar 

  55. Tsur D, Tanner S, Zandi E, Bafna V, Pevzner PA. Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol., 2005, 23(12): 1562–1567.

    Article  Google Scholar 

  56. MacCoss M J et al. Shotgun identification of protein modifications from protein complexes and lens tissue. Proc. Natl. Acad. Sci. USA, 2002, 99(12): 7900–7905.

    Article  Google Scholar 

  57. Bandeira N, Tsur D, Frank A, Pevzner P. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA, 2007, 104(15): 6140–6145.

    Article  Google Scholar 

  58. Witze E S, Old W M, Resing K A, Ahn N G. Mapping protein post-translational modifications with mass spectrometry. Nature Methods, 2007, 4(10): 798–806.

    Article  Google Scholar 

  59. Dwek R A, Butters TD , Platt F M, Zitzmann N. Targeting glycosylation as a therapeutic approach. Nature Reviews Drug Discoveries, 2002, 1(1): 65–75.

    Article  Google Scholar 

  60. Parekh R B et al. Association of rheumatoid arthritis and primary osteoarthritis with changes in the glycosylation pattern of total serum IgG. Nature, 1985, 316(6027): 452–457.

    Article  Google Scholar 

  61. Dennisa J W, Granovskya M, Warrena C E. Glycoprotein glycosylation and cancer progression. Biochimica et Biophysica Acta (BBA) — General Subjects, 1999, 1473(1): 21–34.

    Article  Google Scholar 

  62. Tang H, Mechref Y, Novotny M V. Automated interpretation of MS/MS spectra of oligosaccharides. Bioinformatics, 2005, 21(Suppl. 1): i431–i439.

    Article  Google Scholar 

  63. Zala J. Mass spectrometry of oligosaccharides. Mass Spectrometry Reviews, 2004, 23(3): 161–227.

    Article  Google Scholar 

  64. Zhang C, Doherty-Kirby A, Lajoie G. Investigation of cationic peanut peroxidase glycans by electrospray ionization mass spectrometry. Phytochemistry, 2004, 65(11): 1575–1588.

    Article  Google Scholar 

  65. Shan B, Lajoie G, Ma B, Zhang K. Complexities and algorithms for glycan structure sequencing using tandem mass spectrometry. Journal of Bioinformatics and Computational Biology, 2008, 6(1): 77–91.

    Article  Google Scholar 

  66. An H J, Tillinghast J S, Woodruff D L, Rocke D M, Lebrilla C B. A new computer program (GlycoX) to determine simultaneously the glycosylation sites and oligosaccharide heterogeneity of glycoproteins. Journal of Proteome Research, 2006, 5(10): 2800–2808.

    Article  Google Scholar 

  67. Prince J T, Carlson M W, Wang R, Lu P, Marcotte E M. The need for a public proteomics repository. Nature Biotechnology, 2004, 22(4): 471–472.

    Article  Google Scholar 

  68. Desiere F et al. The PeptideAtlas project. Nucleic Acids Research, 2006, 34(Database Issue): D655–D658.

    Article  Google Scholar 

  69. Rudnick P et al. NIST reference libraries of peptide fragmentation spectra: 2008. In ASMS 2008, Denver, USA, Jun. 1–5, Session: Bioinformatics III, Poster, No. 2008.

  70. Craig R, Cortens J, Fenyo D, Beavis R. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res., 2006, 5(8): 1843–1849.

    Article  Google Scholar 

  71. Dutta D, Chen T. Speeding up tandem mass spectrometry database search: Metric embeddings and fast near neighbor search. Bioinformatics, 2007, 23(5): 612–618.

    Article  Google Scholar 

  72. Wu Z, Lajoie G, Ma B. MSDash: Mass spectrometry database and search. In Proc. the 7th Int. Conf. Computational System Bioinformatics, Stanford, USA, Aug. 26–29, 2008, pp.63–71.

  73. Gygi S P, Rist B, Gerber S A, Turecek F, Gelb M H, Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology, 1999, 17(10): 994–999.

    Article  Google Scholar 

  74. Ong S E, Blagoev B, Kratchmarova I, Kristensen D B, Steen H, Pandey A, Mann M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Molecular & Cellular Proteomics, 2002, 1(5): 376–386.

    Article  Google Scholar 

  75. Wiese S, Reidegeld K A, Meyer H E, Warscheid B. Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics, 2007, 7(3): 340–350.

    Article  Google Scholar 

  76. Wang et al. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Analytical Chemistry, 2003, 75(18): 4818–4826.

    Article  Google Scholar 

  77. Old W M et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell Proteomics, 2005, 4(10): 1487–1502.

    Article  Google Scholar 

  78. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal. Chem., 2006, 78(3): 779–787.

    Article  Google Scholar 

  79. Chen W W et al. New algorithm for label-free protein quantification. In ASMS, Philadelphia, USA, May 31–June 4, 2009, Session MPB: Bioinformatics: Quantification, Poster, No. 043.

  80. Andreev V P, Li L, Cao L, Gu Y, Rejtar T, Wu S L, Karger B L. A new algorithm using cross-assignment for label-free quantitation with LC/LTQ-FT MS. Journal of Proteome Research, 2007, 6(6): 2186–2194.

    Article  Google Scholar 

  81. Lee T, Singh R, Yen TY, Macher B. An algorithmic approach to automated high-throughput identification of disulfide connectivity in proteins using tandem mass spectrometry. In Proc. Computational System Bioinformatics Conference, San Diego, USA, Aug. 13–17, 2007, pp.41–51.

  82. Ng J, Bandeira N, Liu W T, Ghassemian M, Simmons T L, Gerwick W H, Linington R, Dorrestein P C, Pevzner P A. Dereplication and de novo sequencing of nonribosomal peptides. Nature Methods, 2009, 6(8): 596–599.

    Article  Google Scholar 

  83. Zhang N et al. ProbIDtree: An automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 2005, 5(16): 4096–4106.

    Article  Google Scholar 

  84. Kelleher N L, Lin H Y, Valaskovic G A, Aaserud D J, Fridriksson E K, McLafferty F W. Top down versus bottom up protein characterization by tandem high-resolution mass spectrometry. Journal of the American Chemistry Society, 1999, 121(4): 806–812.

    Article  Google Scholar 

  85. Tang H et al. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, 2006, 22(14): e481–e488.

    Article  Google Scholar 

  86. Alves P, Arnold R J, Novotny M V, Radivojac P, Reilly J P, Tang H. Advancement in protein inference from shotgun proteomics using peptide detectability. In Proc. Pac. Symp. Biocomput., Maui, USA, Jan. 3–7, 2007, pp.409–20.

  87. Håkansson K et al. Combined electron capture and infrared multiphoton dissociation for multistage MS/MS in a Fourier transform ion cyclotron resonance mass spectrometer. Anal. Chem., 2003, 75(13): 3256–3262.

    Article  Google Scholar 

  88. Nuno Bandeira, Jesper V Olsen, Matthias Mann, Pavel A Pevzner. Multi-spectra peptide sequencing and its applications to multistage mass spectrometry. Bioinformatics, 2008, 24(13): i416–i423.

    Article  Google Scholar 

  89. Xie M, Ma B. MSPack — Mass spectrometry data compression software. In Proc. the 54th ASMS Conf. Mass Spectrometry, Seattle, USA, May 28–June 1, 2006, Session: Computer Applications, Poster, No. 071.

  90. Miguel A C, Kearney-Fischer M, Keane J F, Whiteaker J, Feng L C, Paulovich A. Near-lossless compression of mass spectra for proteomics. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, USA, April 15–20, 2007, pp.I369–I372.

  91. Meek J L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. Proc. Natl. Acad. Sci. USA, 77(3): 1632–1636.

  92. Strittmatter E F et al. Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. Journal of Proteome Research, 2004, 3(4): 760–769.

    Article  Google Scholar 

  93. Henzel W J, Billeci T M, Stults J T, Wong S C, Grimley C, Watanabe C. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. USA, 1993, 90(11): 5011–5015.

    Article  Google Scholar 

  94. Du P, Kibbe W A, Lin S M. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 2006, 22(17): 2059–2065.

    Article  Google Scholar 

  95. Katajamaa M, Orešič M. Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics, 2005, 6: 179.

    Article  Google Scholar 

  96. Nagalla S R et al. Proteomic analysis of maternal serum in down syndrome: Identification of novel protein biomarkers. Journal of Proteome Research, 2007, 6(4): 1245–1257.

    Article  Google Scholar 

  97. Issaq H J, Veenstra T D, Conrads T P, Felschow D. The SELDI-TOF MS approach to proteomics: Protein profiling and biomarker identification. Biochemical and Biophysical Research Communications, 2002, 292(3): 587–592.

    Article  Google Scholar 

  98. Hancock W S, Wu S L, Shieh P. The challenges of developing a sound proteomics strategy. Proteomics, 2002, 2(4): 352–359.

    Article  Google Scholar 

  99. Steen H, Mann M. The ABC’s (and XYZ’s) of peptide sequencing. Nature Reviews Molecular Cell Biology, 2004, 5(9): 699–711.

    Article  Google Scholar 

  100. Snyder A P. Interpreting Protein Mass Spectra: A Comprehensive Resource. The American Chemical Society and Oxford University Press, 2000.

  101. Kinter M, Sherman N E. Protein Sequencing and Identification Using Tandem Mass Spectrometry. John Wiley & Sons Inc., 2000.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Ma.

Additional information

This work is supported by the National High-Tech Research and Development 863 Program of China under Grant No. 2008AA02Z313, NSERC RGPIN under Grant No. 238748-2006, and a start up grant at University of Waterloo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, B. Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics. J. Comput. Sci. Technol. 25, 107–123 (2010). https://doi.org/10.1007/s11390-010-9309-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-010-9309-1

Keywords

Navigation