Abstract
In the analysis of data from proteomic mass spectrometry experiments, an important issue is determining which of the observed peptide spectrum matches (PSMs) represent true positives. We view this problem through a multiple testing framework and develop procedures for deciding true PSMs. A key feature that makes the problem relative unique to the differential expression problem in microarray analysis is that the null distribution can potentially be estimated from the data. However, this renders much of the asymptotic results from the statistical literature to be invalid. We prove some new key results for this problem using empirical process theory. We also develop a new multiple testing procedure that employs multivariate information from the peptide sequence searches. The proposed methods are studied using a real data set as well as simulated data.
Similar content being viewed by others
References
Anderson DC, Li W, Payan DG, Noble WS (2003) A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res 2:137–146
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Chen CH, Li KC (1998) Can SIR ever be as popular as multiple regression? Stat Sin 8:298–316
Choi HW, Ghosh D, Neshvizhskii A (2008) Statistical validation of peptide identifications in large-scale proteomics using target-decoy database search strategy and flexible mixture modeling. J Proteome Res 7:286–292
Clayton DG (1978) A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65:141–151
Cook RD (1998) Regression graphics. Wiley, New York
Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20:1466–1467
Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 96:96–104
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Fitzgibbon M, Li Q, McIntosh M (2008) Modes of inference for evaluating the confidence of peptide identifications. J Proteome Res 7:35–39
Genovese CR, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32:1035–1061
Genovese CR, Roeder K, Wasserman L (2006) False discovery control with p-value weighting. Biometrika 93:509–524
Ghosh D, Chinnaiyan AM (2009) Genomic outlier profile analysis: mixture models, null hypotheses and nonparametric estimation. Biostatistics 10:60–69
Ghosh D, Chen W, Raghunathan TE (2006) The false discovery rate: a variable selection perspective. J Stat Plan Inference 136:2668–2684
Käll L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7:29–34
Keller A, Neshvizhskii AI, Kolker E, Aebersold R (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 74:5383–5892
Klimek J, Eddes JS, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken PR, Katz JE, Mallick P, Lee H, Schmidt A, Ossola R, Eng JK, Aebersold R, Martin DB (2008) The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. J Proteome Res 7:96–103
Li KC (1991) Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc 86:316–342
Liebler DC (2001) Introduction to proteomics: tools for the new biology. Humana Press, Clifton
Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5:155–176
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567
Sarkar S, Zhou T, Ghosh D (2008) A general decision-theoretic approach to multiple testing procedures for false discovery and false nondiscovery rates. Stat Sin 18:925–946
Spivak M, Weston J, Bottou L, Käll L, Noble WS (2009) Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. J Proteome Res 8:3737–3745
Storey JD, Taylor JE, Siegmund D (2004) Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Ser B 66:187–205
Van der Vaart A (2000) Asymptotic statistics. Cambridge University Press, Cambridge
Yates JR III, Eng JK, McCormack AL, Schieltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67:1426–1436
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ghosh, D. Assessing Significance of Peptide Spectrum Matches in Proteomics: A Multiple Testing Approach. Stat Biosci 1, 199–213 (2009). https://doi.org/10.1007/s12561-009-9012-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-009-9012-3