Skip to main content
Log in

Assessing Significance of Peptide Spectrum Matches in Proteomics: A Multiple Testing Approach

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

In the analysis of data from proteomic mass spectrometry experiments, an important issue is determining which of the observed peptide spectrum matches (PSMs) represent true positives. We view this problem through a multiple testing framework and develop procedures for deciding true PSMs. A key feature that makes the problem relative unique to the differential expression problem in microarray analysis is that the null distribution can potentially be estimated from the data. However, this renders much of the asymptotic results from the statistical literature to be invalid. We prove some new key results for this problem using empirical process theory. We also develop a new multiple testing procedure that employs multivariate information from the peptide sequence searches. The proposed methods are studied using a real data set as well as simulated data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anderson DC, Li W, Payan DG, Noble WS (2003) A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res 2:137–146

    Article  Google Scholar 

  2. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300

    MathSciNet  MATH  Google Scholar 

  3. Chen CH, Li KC (1998) Can SIR ever be as popular as multiple regression? Stat Sin 8:298–316

    Google Scholar 

  4. Choi HW, Ghosh D, Neshvizhskii A (2008) Statistical validation of peptide identifications in large-scale proteomics using target-decoy database search strategy and flexible mixture modeling. J Proteome Res 7:286–292

    Article  Google Scholar 

  5. Clayton DG (1978) A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65:141–151

    Article  MathSciNet  MATH  Google Scholar 

  6. Cook RD (1998) Regression graphics. Wiley, New York

    Book  MATH  Google Scholar 

  7. Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20:1466–1467

    Article  Google Scholar 

  8. Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 96:96–104

    Article  MathSciNet  Google Scholar 

  9. Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160

    Article  MathSciNet  MATH  Google Scholar 

  10. Fitzgibbon M, Li Q, McIntosh M (2008) Modes of inference for evaluating the confidence of peptide identifications. J Proteome Res 7:35–39

    Article  Google Scholar 

  11. Genovese CR, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32:1035–1061

    Article  MathSciNet  MATH  Google Scholar 

  12. Genovese CR, Roeder K, Wasserman L (2006) False discovery control with p-value weighting. Biometrika 93:509–524

    Article  MathSciNet  MATH  Google Scholar 

  13. Ghosh D, Chinnaiyan AM (2009) Genomic outlier profile analysis: mixture models, null hypotheses and nonparametric estimation. Biostatistics 10:60–69

    Article  Google Scholar 

  14. Ghosh D, Chen W, Raghunathan TE (2006) The false discovery rate: a variable selection perspective. J Stat Plan Inference 136:2668–2684

    Article  MathSciNet  MATH  Google Scholar 

  15. Käll L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7:29–34

    Article  Google Scholar 

  16. Keller A, Neshvizhskii AI, Kolker E, Aebersold R (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 74:5383–5892

    Article  Google Scholar 

  17. Klimek J, Eddes JS, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken PR, Katz JE, Mallick P, Lee H, Schmidt A, Ossola R, Eng JK, Aebersold R, Martin DB (2008) The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. J Proteome Res 7:96–103

    Article  Google Scholar 

  18. Li KC (1991) Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc 86:316–342

    Article  MATH  Google Scholar 

  19. Liebler DC (2001) Introduction to proteomics: tools for the new biology. Humana Press, Clifton

    Book  Google Scholar 

  20. Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5:155–176

    Article  MATH  Google Scholar 

  21. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567

    Article  Google Scholar 

  22. Sarkar S, Zhou T, Ghosh D (2008) A general decision-theoretic approach to multiple testing procedures for false discovery and false nondiscovery rates. Stat Sin 18:925–946

    MathSciNet  MATH  Google Scholar 

  23. Spivak M, Weston J, Bottou L, Käll L, Noble WS (2009) Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. J Proteome Res 8:3737–3745

    Article  Google Scholar 

  24. Storey JD, Taylor JE, Siegmund D (2004) Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Ser B 66:187–205

    Article  MathSciNet  MATH  Google Scholar 

  25. Van der Vaart A (2000) Asymptotic statistics. Cambridge University Press, Cambridge

    Google Scholar 

  26. Yates JR III, Eng JK, McCormack AL, Schieltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67:1426–1436

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debashis Ghosh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghosh, D. Assessing Significance of Peptide Spectrum Matches in Proteomics: A Multiple Testing Approach. Stat Biosci 1, 199–213 (2009). https://doi.org/10.1007/s12561-009-9012-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-009-9012-3

Keywords

Navigation