Detection of Unknown Amino Acid Substitutions Using Error-Tolerant Database Search

  • Sven H. Giese
  • Franziska Zickmann
  • Bernhard Y. Renard
Part of the Methods in Molecular Biology book series (MIMB, volume 1362)


Recent studies have demonstrated that mass spectrometry-based variant detection is feasible. Typically, either genomic variant databases or transcript data are used to construct customized target databases for the identification of single-amino acid variants in mass spectrometry data. However, both approaches require additional data to perform the identification of SAAVs. Here, we discuss the application of an error-tolerant peptide search engine such as BICEPS for identifying variants exclusively based on standard Uniprot databases. Thereby, unnecessary and redundant extensions of the search space are avoided. The workflow provides an unbiased view on the data; the search space is not limited to known variants and simultaneously does not require additional data. In a subsequent step a second identification search is performed to verify the initially identified variant peptides and aggregate information on the protein level.

Key words

Mass spectrometry Variant peptide identification Error-tolerant peptide identification Single-amino acid variations Single-nucleotide variants Proteomics 



The authors gratefully acknowledge financial support by Deutsche Forschungsgemeinschaft (DFG), grant number (RE3474/2-1 to BYR).


  1. 1.
    Yates JR, Ruse CI, Nakorchevsky A (2009) Proteomics by mass spectrometry: approaches, advances, and applications. Annu Rev Biomed Eng 11:49–79CrossRefPubMedGoogle Scholar
  2. 2.
    Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422:198–207CrossRefPubMedGoogle Scholar
  3. 3.
    Mann M, Ong S-E, Grønborg M et al (2002) Analysis of protein phosphorylation using mass spectrometry: deciphering the phosphoproteome. Trends Biotechnol 20:261–268CrossRefPubMedGoogle Scholar
  4. 4.
    Ozlu N, Akten B, Timm W et al (2010) Phosphoproteomics. Wiley Interdiscip Rev Syst Biol Med 2:255–276CrossRefPubMedGoogle Scholar
  5. 5.
    Sheynkman GM, Shortreed MR, Frey BL et al (2014) Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J Proteome Res 13:228–240PubMedCentralCrossRefPubMedGoogle Scholar
  6. 6.
    Mayne SLN, Patterton H-G (2011) Bioinformatics tools for the structural elucidation of multi-subunit protein complexes by mass spectrometric analysis of protein-protein cross-links. Brief Bioinform 12:660–671CrossRefPubMedGoogle Scholar
  7. 7.
    Bantscheff M, Schirle M, Sweetman G et al (2007) Quantitative mass spectrometry in proteomics: a critical review. Anal Bioanal Chem 389:1017–1031CrossRefPubMedGoogle Scholar
  8. 8.
    Su Z-D, Sun L, Yu D-X et al (2011) Quantitative detection of single amino acid polymorphisms by targeted proteomics. J Mol Cell Biol 3:309–315CrossRefPubMedGoogle Scholar
  9. 9.
    Song C, Wang F, Cheng K et al (2014) Large-scale quantification of single amino-acid variations by a variation-associated database search strategy. J Proteome Res 13:241–248CrossRefPubMedGoogle Scholar
  10. 10.
    Nesvizhskii AI (2010) A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 73:2092–2123PubMedCentralCrossRefPubMedGoogle Scholar
  11. 11.
    Nesvizhskii AI, Vitek O, Aebersold R (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4:787–797CrossRefPubMedGoogle Scholar
  12. 12.
    Ansong C, Purvine SO, Adkins JN et al (2008) Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic 7:50–62CrossRefPubMedGoogle Scholar
  13. 13.
    Woo S, Cha SW, Merrihew G et al (2014) Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 13:21–28PubMedCentralCrossRefPubMedGoogle Scholar
  14. 14.
    Altshuler D, Daly MJ, Lander ES (2008) Genetic mapping in human disease. Science 322:881–888PubMedCentralCrossRefPubMedGoogle Scholar
  15. 15.
    Sachidanandam R, Weissman D, Schmidt SC et al (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409:928–933CrossRefPubMedGoogle Scholar
  16. 16.
    Brogna S, Wen J (2009) Nonsense-mediated mRNA decay (NMD) mechanisms. Nat Struct Mol Biol 16:107–113CrossRefPubMedGoogle Scholar
  17. 17.
    McGlincy NJ, Tan L-Y, Paul N et al (2010) Expression proteomics of UPF1 knockdown in HeLa cells reveals autoregulation of hnRNP A2/B1 mediated by alternative splicing resulting in nonsense-mediated mRNA decay. BMC Genomics 11:565PubMedCentralCrossRefPubMedGoogle Scholar
  18. 18.
    Perkins DN, Pappin DJ, Creasy DM et al (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567CrossRefPubMedGoogle Scholar
  19. 19.
    Geer LY, Markey SP, Kowalak J et al (2004) Open mass spectrometry search algorithm. J Proteome Res 3:958–964CrossRefPubMedGoogle Scholar
  20. 20.
    Tabb DL, Fernando CG, Chambers MC (2007) MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res 6:654–661PubMedCentralCrossRefPubMedGoogle Scholar
  21. 21.
    Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20:1466–1467CrossRefPubMedGoogle Scholar
  22. 22.
    Tanner S, Shu H, Frank A et al (2005) InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem 77:4626–4639CrossRefPubMedGoogle Scholar
  23. 23.
    Eng JK, McCormack AL, Yates JRIII (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989CrossRefPubMedGoogle Scholar
  24. 24.
    Choi H, Nesvizhskii AI (2008) False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J Proteome Res 7:47–50CrossRefPubMedGoogle Scholar
  25. 25.
    Elias JE, Gygi SP (2007) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 4:207–214CrossRefPubMedGoogle Scholar
  26. 26.
    Yates JRIII, Eng JK, McCormack AL et al (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67:1426–1436CrossRefPubMedGoogle Scholar
  27. 27.
    Evans VC, Barker G, Heesom KJ et al (2012) De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat Methods 9:1207–1211PubMedCentralCrossRefPubMedGoogle Scholar
  28. 28.
    Hughes C, Ma B, Lajoie GA (2010) De novo sequencing methods in proteomics. Methods Mol Biol 604:105–121CrossRefPubMedGoogle Scholar
  29. 29.
    Creasy DM, Cottrell JS (2002) Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2:1426–1434CrossRefPubMedGoogle Scholar
  30. 30.
    Starkweather R, Barnes CS, Wyckoff GJ et al (2007) Virtual polymorphism: finding divergent peptide matches in mass spectrometry data. Anal Chem 79:5030–5039CrossRefPubMedGoogle Scholar
  31. 31.
    Mann M, Wilm M (1994) Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 66:4390–4399CrossRefPubMedGoogle Scholar
  32. 32.
    Tabb DL, Saraf A, Yates JR (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 75:6415–6421PubMedCentralCrossRefPubMedGoogle Scholar
  33. 33.
    Shilov IV, Seymour SL, Patel AA et al (2007) The Paragon algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol Cell Proteomics 6:1638–1655CrossRefPubMedGoogle Scholar
  34. 34.
    DiMaggio P, Floudas C, Lu B et al (2008) A hybrid method for peptide identification using integer linear optimization, local database search, and quadrupole time-of-flight or OrbiTrap tandem mass spectrometry. J Proteome Res 7:1584–1593CrossRefPubMedGoogle Scholar
  35. 35.
    Han Y, Ma B, Zhang K (2004) SPIDER: software for protein identification from sequence tags with de novo sequencing error. Proc IEEE Comput Syst Bioinform Conf, pp 206–215Google Scholar
  36. 36.
    Searle BO, Dasari S, Turner M et al (2004) High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Anal Chem 76:2220–2230CrossRefPubMedGoogle Scholar
  37. 37.
    Wang X, Li Y, Wu Z et al (2014) JUMP: a tag-based database search tool for peptide identification with high sensitivity and accuracy. Mol Cell Proteomics 13:3663–3673CrossRefPubMedGoogle Scholar
  38. 38.
    Renard BY, Xu B, Kirchner M et al (2012) Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol Cell Proteomics 11:M111.014167PubMedCentralCrossRefPubMedGoogle Scholar
  39. 39.
    Sherry ST, Ward MH, Kholodov M et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311PubMedCentralCrossRefPubMedGoogle Scholar
  40. 40.
    Li J, Duncan DT, Zhang B (2010) CanProVar: a human cancer proteome variation database. Hum Mutat 31:219–228PubMedCentralCrossRefPubMedGoogle Scholar
  41. 41.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63PubMedCentralCrossRefPubMedGoogle Scholar
  42. 42.
    Wang X, Slebos RJC, Wang D et al (2012) Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res 11:1009–1017PubMedCentralCrossRefPubMedGoogle Scholar
  43. 43.
    Wang X, Zhang B (2013) customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29:3235–3237PubMedCentralCrossRefPubMedGoogle Scholar
  44. 44.
    DePristo M, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498PubMedCentralCrossRefPubMedGoogle Scholar
  45. 45.
    Li J, Su Z, Ma Z-Q et al (2011) A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics 10:M110.006536PubMedCentralCrossRefPubMedGoogle Scholar
  46. 46.
    Berthold MR, Cebron N, Dill F et al (2007) KNIME: the Konstanz Information Miner. Stud Classif Data Anal Knowl Organ (GfKL 2007)Google Scholar
  47. 47.
    Kohlbacher O, Reinert K, Gröpl C et al (2007) TOPP—the OpenMS proteomics pipeline. Bioinformatics 23:e191–e197CrossRefPubMedGoogle Scholar
  48. 48.
    Sturm M, Bertsch A, Gröpl C et al (2008) OpenMS—an open-source software framework for mass spectrometry. BMC Bioinformatics 9:163PubMedCentralCrossRefPubMedGoogle Scholar
  49. 49.
    The UniProt Consortium (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42:D191–D198PubMedCentralCrossRefGoogle Scholar
  50. 50.
    Frank A, Pevzner P (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 77:964–973CrossRefPubMedGoogle Scholar
  51. 51.
    Tabb DL, Ze-Qiang M, Martin DB et al (2008) DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. J Proteome Res 7:3838–3846PubMedCentralCrossRefPubMedGoogle Scholar
  52. 52.
    Renard BY, Timm W, Kirchner M et al (2010) Estimating the confidence of peptide identifications without decoy databases. Anal Chem 82:4314–4318CrossRefPubMedGoogle Scholar
  53. 53.
    Nahnsen S, Bertsch A, Rahnenführer J et al (2011) Probabilistic consensus scoring improves tandem mass spectrometry peptide identification. J Proteome Res 10:3332–3343CrossRefPubMedGoogle Scholar
  54. 54.
    Weisser H, Nahnsen S, Grossmann J et al (2013) An automated pipeline for high-throughput label-free quantitative proteomics. J Proteome Res 12:1628–1644CrossRefPubMedGoogle Scholar
  55. 55.
    Choi Y, Sims GE, Murphy S et al (2012) Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688PubMedCentralCrossRefPubMedGoogle Scholar
  56. 56.
    Kumar P, Henikoff S, Ng PC (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4:1073–1081CrossRefPubMedGoogle Scholar
  57. 57.
    Franceschini A, Szklarczyk D, Frankild S et al (2013) STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 41:D808–D815PubMedCentralCrossRefPubMedGoogle Scholar
  58. 58.
    Renard BY, Kirchner M, Monigatti F et al (2009) When less can yield more—computational preprocessing of MS/MS spectra for peptide identification. Proteomics 9:4978–4984CrossRefPubMedGoogle Scholar
  59. 59.
    Kessner D, Chambers M, Burke R et al (2008) ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24:2534–2536PubMedCentralCrossRefPubMedGoogle Scholar
  60. 60.
    Huang T, Wang J, Yu W et al (2012) Protein inference: a review. Brief Bioinform 13:586–614CrossRefPubMedGoogle Scholar
  61. 61.
    Mann M, Kelleher NL (2008) Precision proteomics: the case for high resolution and high mass accuracy. Proc Natl Acad Sci U S A 105:18132–18138PubMedCentralCrossRefPubMedGoogle Scholar
  62. 62.
    Qian WJ, Liu T, Monroe ME et al (2005) Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome. J Proteome Res 4:53–62CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Sven H. Giese
    • 1
    • 2
    • 3
  • Franziska Zickmann
    • 1
  • Bernhard Y. Renard
    • 1
  1. 1.Research Group Bioinformatics (NG4)Robert Koch-InstituteBerlinGermany
  2. 2.Department of Bioanalytics, Institute of BiotechnologyTechnische Universität BerlinBerlinGermany
  3. 3.Wellcome Trust Centre for Cell Biology, School of Biological SciencesUniversity of EdinburghEdinburghUK

Personalised recommendations