Abstract
In the pursuit of a personalized medicine, i.e., the individual treatment of a patient, many medical decision problems are desired to be supported by biomarkers that can help to make a diagnosis, prediction, or prognosis. Proteomic biomarkers are of special interest since they can not only be detected in tissue samples but can also often be easily detected in diverse body fluids. Statistical methods play an important role in the discovery and validation of proteomic biomarkers. They are necessary in the planning of experiments, in the processing of raw signals, and in the final data analysis. This review provides an overview on the most frequent experimental settings including sample size considerations, and focuses on exploratory data analysis and classifier development.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Soares H, Chen Y, Sabbagh M et al (2009) Identifying early markers of Alzheimer’s disease using quantitative multiplex proteomic immunoassay panels. Ann N Y Acad Sci 1180:56–67
Pan S, Chen R, Brand RE et al (2012) Multiplex targeted proteomic assay for biomarker detection in plasma: a pancreatic cancer biomarker case study. J Proteome Res 11:1937–1948
Baas T, Baskin CR, Diamond DL et al (2006) Integrated molecular signature of disease: analysis of influenza virus-infected macaques through functional genomics and proteomics. J Virol 80:10813–10828
Paweletz CP, Trock B, Pennanen M (2001) Proteomic patterns of nipple aspirate fluids obtained by SELDI-TOF: potential for new biomarkers to aid in the diagnosis of breast cancer. Dis Markers 17:301–307
Li J, Zhang Z, Rosenzweig J et al (2002) Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin Chem 48:1296–1304
Brown JM, Krutzsch H, Shu H et al (2002) Proteomic analysis and identification of new biomarkers and therapeutic targets for invasive ovarian cancer. Proteomics 2:76–84
Wang TJ, Gona P, Larson MG et al (2006) Multiple biomarkers for the prediction of first major cardiovascular events and death. N Engl J Med 355:2631–2639
Hye A, Lynham S, Thambisetty M et al (2006) Proteome-based plasma biomarkers for Alzheimer’s disease. Brain 129:3042–3050
Abdi F, Quinn JF, Jankovic J et al (2006) Detection of biomarkers with multiplex quantitative proteomic platform in cerebrospinal fluid of patients with neurodegenerative disorders. J Alzheimers Dis 9:293–348
Pisitkun T, Shen R-F, Knepper MA (2004) Identification and proteomic profiling of exosomes in human urine. Proc Natl Acad Sci U S A 101:13368–13373
Hu S, Arellano M, Boontheung P et al (2008) Salivary proteomics for oral cancer biomarker discovery. Clin Cancer Res 14:6246–6252
Pavlou MP, Diamandis EP, Blasutig IM (2012) The long journey of cancer biomarkers from bench to clinic. Clin Chem 59:147–157
Christin C, Bischoff R, Horvatovich P (2011) Data processing pipelines for comprehensive profiling of proteomics samples by label-free LC-MS for biomarker discovery. Talanta 83:1209–1224
Listgarten J, Emili A (2005) Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 4:419–434
Caffrey RE (2010) A review of experimental design best practices for proteomics based biomarker discovery: focus on SELDI-TOF. Methods Mol Biol 641:167–183
Ward DG, Cheng Y, N’Kontchou G et al (2006) Changes in the serum proteome associated with the development of hepatocellular carcinoma in hepatitis C-related cirrhosis. Br J Cancer 94:287–292
Artigaud S, Gauthier O, Pichereau V (2013) Identifying differentially expressed proteins in 2-DE experiments: inputs from transcriptomics statistical tools. Bioinformatics 29:2729–2734
Eisen MB, Spellman PT, Brown PO (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868
Alaiya AA, Franzén B, Hagman A et al (2002) Molecular classification of borderline ovarian tumours using hierarchical cluster analysis of protein expression profiles. Int J Cancer 98:895–899
Yanagisawa K, Shyr Y, Xu BJ et al (2003) Proteomic patterns of tumour subsets in non-small-cell lung cancer. Lancet 362:433–439
Vasseur C, Labadie J, Hébraud M (1999) Differential protein expression by Pseudomonas fragi submitted to various stresses. Electrophoresis 20:2204–2213
Goodacre R, Heald JK, Kell DB (1999) Characterisation of intact microorganisms using electrospray ionisation mass spectrometry. FEMS Microbiol Lett 176:17–24
Duncan R, Carpenter B, Main LC et al (2008) Characterisation and protein expression profiling of annexins in colorectal cancer. Br J Cancer 98:426–433
Zhang Y, Wolf-Yadlin A, Ross RL et al (2005) Time-resolved mass spectrometry of tyrosine phosphorylation sites in the epidermal growth factor receptor signaling network reveals dynamic modules. Mol Cell Proteomics 4:1240–1250
Troyanskaya O, Cantor M, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Jung K, Gannoun A, Sitek B et al (2006) Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumour study. RevStat-Stat J 4:67–80
Karpievitch YV, Dabney AR, Smith RD (2012) Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinformatics 13:S5
Frantzi M, Bhat A, Latosinska A (2014) Clinical proteomic biomarkers: relevant issues on study design & technical considerations in biomarker development. Clin Transl Med 3:7
Pesch B, Brüning T, Johnen G et al (2014) Biomarker research with prospective study designs for the early detection of cancer. Biochim Biophys Acta 1844:874–883
Gosho M, Nagashima K, Sato Y (2012) Study designs and statistical analyses for biomarker research. Sensors 12:8966–8986
Dancey JE, Dobbin KK, Groshen S et al (2010) Guidelines of the development and incorporation of biomarker studies in early clinical trials of novel agents. Clin Cancer Res 16:1745–1755
Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004:Article 3
Ryu SY, Qian W-J, Camp DG et al (2014) Detecting differential protein expression in large-scale population proteomics. Bioinformatics 30:2741–2746
Clough T, Thaminy S, Ragg S et al (2012) Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinformatics 13:S6
Listgarten J, Neal RM, Roweis ST et al (2007) Difference detection in LC-MC data for protein biomarker discovery. Bioinformatics 23:e198–e204
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B 57:289–300
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188
Hulsen T, de Vlieg J, Alkema W (2008) BioVenn—a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genomics 9:488
Choi H, Fermin D, Nesvizhskii AI (2008) Significance analysis of spectral count data in label-free shotgun proteomics. Mol Cell Proteomics 7:2373–2385
Cairns DA, Barrett JH, Billingham LJ et al (2009) Sample size determination in clinical proteomic profiling experiments using mass spectrometry for class comparison. Proteomics 9:74–86
Nyangoma SO, Collins SI, Altman D et al (2012) Sample size calculations for designing clinical proteomic profiling studies using mass spectrometry. Stat Appl Genet Mol Biol 11(3)
A-Shahrour F, Carbonell J, Minguez P et al (2008) Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments. Nucleic Acids Res 36:W341–W346
Cha S, Imielinski MB, Rejtar T et al (2010) In situ proteomic analysis of human breast cancer epithelial cells using laser capture microdissection: annotation by protein set enrichment analysis and gene ontology. Mol Cell Proteomics 9:2529–2544
Jung K, Dihazi H, Bibi A et al (2014) Adaption of the global test idea to proteomics data with missing values. Bioinformatics 30:1424–1430
Chen LS, Paul D, Prentice RL et al (2011) A regularized Hotelling’s T2 test for pathway analysis in proteomics studies. J Am Stat Assoc 106:1345–1360
Baggerly KA, Morris JS, Wang J et al (2003) A comprehensive approach to the analysis of matrix-assisted laser desorption/ionization-time of flight proteomics spectra from serum samples. Proteomics 3:1667–1672
Agranoff D, Fernandez-Reyes D, Papdopoulos MC et al (2006) Identification of diagnostic markers for tuberculosis by proteomic fingerprinting of serum. Lancet 368:1012–1021
Carlsson A, Wingren C, Ingvarsson J et al (2008) Serum proteome profiling of metastatic breast cancer using recombinant antibody microarrays. Eur J Cancer 44:472–480
Tibshirani R, Hastie T, Narshimhan B et al (2004) Sample classification from protein mass spectrometry, by ‘peak probability contrasts’. Bioinformatics 20:3034–3044
Geurts P, Fillet M, de Seny D et al (2005) Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 21:3138–3145
Wu B, Abbott T, Fishman D et al (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19:1636–1643
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Lilien RH, Farid H, Donald BR (2010) Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum. J Comput Biol 10:925–946
Karp NA, Griffin JL, Lilley KS (2005) Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics. Proteomics 5:81–90
Binder H, Allignol A, Schumacher M (2009) Boosting for high-dimensional time-to-event data with competing risks. Bioinformatics 25:890–896
Wang Z, Wang CY (2010) Buckly-James boosting for survival analysis with high-dimensional biomarker data. Stat Appl Genet Mol Biol 9:Article 24
Brage-Neto U, Dougherty ER (2004) Is cross-validation valid for small sample microarray classification? Bioinformatics 20:374–380
Borra S, Di Ciaccio A (2010) Measuring the prediction error. A comparison of cross validation, bootstrap and covariance penalty methods. Comput Stat Data Anal 54:2976–2989
Pattengalem ND, Alipour M, Binida-Emonds ORP (2010) How many bootstrap replicates are necessary? J Comput Biol 17:337–354
Jung K, Grade M, Gaedcke J et al (2010) A new sensitivity-preferred strategy to build prediction rules for therapy response of cancer patients using gene expression data. Comput Methods Programs Biomed 100:132–139
Foody GM (2009) Classification accuracy comparison: hypothesis tests and the use of confidence intervals in evaluation of difference, equivalence and non-inferiority. Remote Sens Environ 113:1658–1663
Porzelius C, Schumacher M, Binder H (2010) A general, prediction error-based criterion for selecting model complexity for high-dimensional survival models. Stat Med 29:830–838
Harrel FE, Lee KL (1984) Regression modelling strategies for improved prognostic prediction. Stat Med 3:143–152
Newson RB (2010) Comparing the predictive power of survival models using Harrell’s C or Somers’ D. Stata J 10:339–358
Fu WJ, Dougherty ER, Mallick B et al (2005) How many samples are needed to build a classifier: a general sequential approach. Bioinformatics 21:63–70
Figuera RL, Zeng-Treidler Q, Kandula S et al (2012) Predicting sample size required for classification performance. BMC Med Inform Decis Mak 12:8
Dobbin KK, Simon RM (2006) Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 8:101–117
Fuchs M, Beißbarth T, Wingender E et al (2013) Connecting high-dimensional mRNA and miRNA expression data for binary medical classification problems. Comput Methods Programs Biomed 111:592–601
Bruns DE (2003) The STARD initiative and the reporting of studies of diagnostic accuracy. Clin Chem 49:19–20
McShane LM, Altman DG, Sauerbrei W et al (2005) REporting recommendations for tumour MARKer prognostic studies (REMARK). Nat Clin Pract Oncol 2:416–422
Marot G, Mayer CD (2009) Sequential analysis for microarray data based on sensitivity and meta-analysis. Stat Appl Genet Mol Biol 8:Article 3
Kolesnikov N, Hastings E, Keays M et al (2015) ArrayExpress update—simplifying data submissions. Nucleic Acids Res 43:D1113–D1116
Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30:207–210
Acknowledgements
The author would like to thank Prof Olga Vitek (Northeastern University, Boston) for very helpful comments on a previous version of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Jung, K. (2016). Statistical Aspects in Proteomic Biomarker Discovery. In: Jung, K. (eds) Statistical Analysis in Proteomics. Methods in Molecular Biology, vol 1362. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3106-4_19
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3106-4_19
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3105-7
Online ISBN: 978-1-4939-3106-4
eBook Packages: Springer Protocols