Abstract
New technologies allow for high-dimensional profiling of patients. For instance, genome-wide gene expression analysis in tumors or in blood is feasible with microarrays, if all transcripts are known, or even without this restriction using high-throughput RNA sequencing. Other technologies like NMR finger printing allow for high-dimensional profiling of metabolites in blood or urine. Such technologies for high-dimensional patient profiling represent novel possibilities for molecular diagnostics. In clinical profiling studies, researchers aim to predict disease type, survival, or treatment response for new patients using high-dimensional profiles. In this process, they encounter a series of obstacles and pitfalls. We review fundamental issues from machine learning and recommend a procedure for the computational aspects of a clinical profiling study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Roepman P, Wessels LF, Kettelarij N, Kemmeren P, Miles AJ, Lijnzaad P, Tilanus MG, Koole R, Hordijk GJ, van der Vliet PC, Reinders MJ, Slootweg PJ, Holstege FC (2005) An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas. Nat Genet 37:182–186
Schölkopf B, Smola AJ (2001) Learning with kernels. MIT Press, Cambridge, MA
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Devroye L, Györfi L, Lugosi L (1996) A probabilistic theory of pattern recognition. Springer, New York
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. Wiley, New York
Speed T (2003) Statistical analysis of gene expression microarray data. Chapman & Hall/CRC, Boca Raton, FL
Kohlmann A, Kipps TJ, Rassenti LZ, Downing JR, Shurtleff SA, Mills KI, Gilkes AF, Hofmann WK, Basso G, Dell'orto MC, Foà R, Chiaretti S, De Vos J, Rauhut S, Papenhausen PR, Hernández JM, Lumbreras E, Yeoh AE, Koay ES, Li R, Liu WM, Williams PM, Wieczorek L, Haferlach T (2008) An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: the Microarray Innovations in Leukemia study prophase. Br J Haematol 142(5):802–807
Bacher U, Kohlmann AI, Haferlach T (2009) Perspectives of gene expression profiling for diagnosis and therapy in haematological malignancies. Brief Funct Genomics 8(3):184–193
Haferlach T, Kohlmann A, Schnittger S, Dugas M, Hiddemann W, Kern W, Schoch C (2005) A global approach to the diagnosis of leukemia using gene expression profiling. Blood 106:1189–1198
van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Cheok MH, Yang W, Pui CH, Downing JR, Cheng C, Naeve CW, Relling MV, Evans WE (2003) Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat Genet 34:85–90
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci U S A 98:11462–11467
Wessels LF, Reinders MJ, Hart AA, Veenman CJ, Dai H, He YD, Veer LJ (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21:3755–3762
Dudoit S, Fridlyand J, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Jäger J, Weichenhan D, Ivandic B, Spang R (2005) Early diagnostic marker panel determination for microarray based clinical studies. SAGMB 4, Art 9
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: International conference on machine learning. Morgan Kaufmann Publishers, San Francisco, CA, USA, pp 121–129
Ihaka R, Gentleman RC (1996) R: a language for data analysis and graphics. J Comput Graph Stat 5:299–314
R Development Core Team (2006) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80
Liu H, Li J, Wong L (2005) Use of extreme patient samples for outcome prediction from gene expression data. Bioinformatics 21(16):3377–3384
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Methodol 36:111–147
Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70:320–328
Kohlmann A, Haschke-Becher E, Wimmer B, Huber-Wechselberger A, Meyer-Monard S, Huxol H, Siegler U, Rossier M, Matthes T, Rebsamen M, Chiappe A, Diemand A, Rauhut S, Johnson A, Liu WM, Williams PM, Wieczorek L, Haferlach T (2008) Intraplatform reproducibility and technical precision of gene expression profiling in 4 laboratories investigating 160 leukemia samples: the DACH study. Clin Chem 54(10):1705–1715
Geiss GK, Bumgarner RE, Birditt B, Dahl T, Dowidar N, Dunaway DL, Fell HP, Ferree S, George RD, Grogan T, James JJ, Maysuria M, Mitton JD, Oliveri P, Osborn JL, Peng T, Ratcliffe AL, Webster PJ, Davidson EH, Hood L, Dimitrov K (2008) Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat Biotechnol 26(3):317–325
Masqué-Soler N, Szczepanowski M, Kohler CW, Spang R, Klapper W (2013) Molecular classification of mature aggressive B-cell lymphoma using digital multiplexed gene expression on formalin-fixed paraffin-embedded biopsy specimens. Blood 122(11):1985–1986
Scott DW, Wright GW, Williams PM, Lih C-J, Walsh W, Jaffe ES, Rosenwald A, Campo E, Chan WC, Connors JM, Smeland EB, Mottok A, Braziel RM, Ott G, Delabie J, Tubbs RR, Cook JR, Weisenburger DD, Greiner TC, Glinsmann-Gibson BJ, Fu K, Staudt LM, Gascoyne RD, Rimsza LM (2014) Determining cell-of-origin subtypes of diffuse large B-cell lymphoma using gene expression in formalin-fixed paraffin-embedded tissue. Blood 123(8):1214–1217
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4):R36
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
Rehrauer H, Opitz L, Tan G, Sieverling L, Schlapbach R (2013) Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 14:370
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C (2006) An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res 34(10):3150–3160
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci 131(4):281–285
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 15(12):550
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140
Witten DM (2011) Classification and clustering of sequencing data using a Poisson model. Ann Appl Stat 5(4):2493–2518
Klein MS, Buttchereit N, Miemczyk SP, Immervoll AK, Louis C, Wiedemann S, Junge W, Thaller G, Oefner PJ, Gronwald W (2012) NMR metabolomic analysis of dairy cows reveals milk glycerophosphocholine to phosphocholine ratio as prognostic biomarker for risk of ketosis. J Proteome Res 11(2):1373–1381
Gronwald W, Klein MS, Zeltner R, Schulze BD, Reinhold SW, Deutschmann M, Immervoll AK, Böger CA, Banas B, Eckardt KU, Oefner PJ (2011) Detection of autosomal dominant polycystic kidney disease by NMR spectroscopic fingerprinting of urine. Kidney Int 79:1244–1253
Ernst RR, Bodenhausen G, Wokaun A (1987) Principles of nuclear magnetic resonance in one and two dimensions. Oxford University Press, London
Savorani F, Tomasi G, Engelsen SB (2010) Icoshift: a versatile tool for the rapid alignment of 1D NMR spectra. J Magn Reson 202:190–202
Huber W, Heydebreck AV, Sültmann H, Poustka A, Vingron M (2002) Variance stabilisation applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:96–104
Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185–193
Kohl SM, Klein MS, Hochrein J, Oefner PJ, Spang R, Gronwald W (2012) State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics 8:146–160
Breiman L (2001) Random forests. Mach Learn 45:5–32
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2:121–167
Vapnik V (1998) Statistical learning theory. Wiley, New York
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Hochrein J, Klein MS, Zacharias HU, Li J, Wijffels G, Schirra HJ, Spang R, Oefner PJ, Gronwald W (2012) Performance evaluation of algorithms for the classification of metabolic 1H-NMR fingerprints. J Proteome Res 11:6242–6251
Simon R, Radmacher MD, Dobbin K, McShane LM (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95:14–18
Ntzani EE, Ioannidis JPA (2003) Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 362:1439–1444
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99:6562–6566
Reid JF, Lusa L, De Cecco L, Coradini D, Veneroni S, Daidone MG, Gariboldi M, Pierotti MA (2005) Limits of predictive models using microarray data for breast cancer clinical treatment outcome. J Natl Cancer Inst 97:927–930
Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365:488–492
Dudoit S (2003) Introduction to multiple hypothesis testing. Biostatistics Division, California University, Berkeley CA, USA
Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117
Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 99:6567–6572
Huang X, Pan W (2003) Linear regression and two-class classification with gene expression data. Bioinformatics 19:2072–2078
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Ruschhaupt M, Huber W, Poustka A, Mansmann U (2004) A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol 3:37
Braga-Neto UM, Dougherty ER (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics 20:374–380
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and Model Selection. In International joint conference on artificial intelligence, Montreal, Quebec, Canada, pp. 1137–1145
Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92:548–560
van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009
Sorlie T, Tibshirani R, Parker J, Hastie T, Emrron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lonning PE, Brown PO, Borresen-Dal AL, Botstein D (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A 100:8418–8423
Ramaswamy S, Ross KN, Lander ES, Golub TR (2003) A molecular signature of metastasis in primary solid tumors. Nat Genet 33:49–54
Ein-Dor LE, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21:171–178
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media New York
About this protocol
Cite this protocol
Lottaz, C., Gronwald, W., Spang, R., Engelmann, J.C. (2017). High-Dimensional Profiling for Computational Diagnosis. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1526. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6613-4_12
Download citation
DOI: https://doi.org/10.1007/978-1-4939-6613-4_12
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6611-0
Online ISBN: 978-1-4939-6613-4
eBook Packages: Springer Protocols