Skip to main content

Statistical Genetics for Genomic Data Analysis

  • Reference work entry
Springer Handbook of Engineering Statistics

Part of the book series: Springer Handbooks ((SHB))

  • 8681 Accesses

Abstract

In this chapter, we briefly summarize the emerging statistical concepts and approaches that have been recently developed and applied to the analysis of genomic data such as microarray gene expression data. In the first section we introduce the general background and critical issues in statistical sciences for genomic data analysis. The second section describes a novel concept of statistical significance, the so-called false discovery rate, the rate of false positives among all positive findings, which has been suggested to control the error rate of numerous false positives in large screening biological data analysis. In the next section we introduce two recent statistical testing methods: significance analysis of microarray (SAM) and local pooled error (LPE) tests. The latter in particular, which is significantly strengthened by pooling error information from adjacent genes at local intensity ranges, is useful to analyze microarray data with limited replication. The fourth section introduces analysis of variation (ANOVA) and heterogenous error modeling (HEM) approaches that have been suggested for analyzing microarray data obtained from multiple experimental and/or biological conditions. The last two sections describe data exploration and discovery tools largely termed supervised learning and unsupervised learning. The former approaches include several multivariate statistical methods for the investigation of coexpression patterns of multiple genes, and the latter approaches are used as classification methods to discover genetic markers for predicting important subclasses of human diseases. Most of the statistical software packages for the approaches introduced in this chapter are freely available at the open-source bioinformatics software web site (Bioconductor; http://www.bioconductor.org/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 309.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

AUC:

area under the receiver operating characteristics curve

CIM:

cluster-image map

FDR:

false discovery rate

FWER:

family-wise error rate

HEM:

heterogeneous error model

LPE:

local pooled error

LR:

logistic regression

MAD:

median absolute deviation

MiPP:

misclassification penalized posterior

QDA:

quadratic discriminant analysis

SAM:

significance analysis of microarray

References

  1. C. Sander: Genomic medicine and the future of health care, 287, 1977–8 (2000)

    Google Scholar 

  2. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286, 5439 (1999)

    Article  Google Scholar 

  3. J. K. Lee, U. Scherf, K. J. Bussey, F. G. Gwadry, W. Reinhold, G. Riddick, S. L. Pelletier, S. Nishizuka, G. Szakacs, J.-P. Annereau, U. Shankavaram, S. Lababidi, L. H. Smith, M. M. Gottesman, J. N. Weinstein: Comparing cDNA, oligonucleotide array data: Concordance of gene expression across platforms for the NCI-60 cancer cell lines, Genome Biol. 4, R82 (2003)

    Article  Google Scholar 

  4. D. Pinkel: Cancer cells, chemotherapy, gene clusters, Nat. Genet. 24, 208–9 (2000)

    Article  Google Scholar 

  5. J. K. Lee: Discovery, validation of microarray gene expression patterns, LabMedica Int. 19, 8–10 (2002)

    Google Scholar 

  6. C. J. Stoeckert, H. C. Causton, C. A. Ball: Microarray databases: standards, ontologies, Nat. Genet. 32, 469–473 (2002)

    Article  Google Scholar 

  7. M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein: Cluster analysis, display of genome-wide expression patterns, Proc. Nat. Acad. Sci. 95, 14863–8 (1998)

    Article  Google Scholar 

  8. P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, T. R. Golub: Interpreting patterns of gene expression with self-organizing maps: Methods, application to hematopoietic differentiation, Proc. Nath. Acad. Sci. 96, 2907–2912 (1999)

    Article  Google Scholar 

  9. S. Dudoit, Y. H. Yang, M. J. Callow, T. P. Speed: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Stat. Sin. 12, 111–139 (2002)

    MathSciNet  MATH  Google Scholar 

  10. V. Tusher, R. Tibshirani, C. Chu: Significance analysis of microarrays applied to transcriptional responses to ionizing radiation, Proc. Nat. Acad. Sci. 98, 5116–21 (2001)

    Article  MATH  Google Scholar 

  11. Y. Benjamini, Y. Hochberg: Controlling the false discovery rate: a practical, powerful approach to multiple testing, J. R. Stat. Soc., Ser. B, Methodological 57, 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  12. J. Storey, R. Tibshirani: SAM thresholding, false discovery rates for detecting differential gene expression in DNA microarrays. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003) Chap. 12

    Google Scholar 

  13. N. Jain, K. Ley, J. Thatte, M. OʼConnell, J. K. Lee: Local pooled error test for identifying differentially expressed genes with asmall number of replicated microarrays, Bioinformatics 19, 1945–51 (2003)

    Article  Google Scholar 

  14. W. Jin, R. M. Riley, R. D. Wolfinger, K. P. White, G. Passador-Gurgel, G. Gibson: The contributions of sex, genotype, age to transcriptional variance in Drosophila melanogaster, Nat. Genet. 29, 389–395 (2001)

    Article  Google Scholar 

  15. A. Kamb, A. Ramaswami: A simple method for statistical analysis of intensity differences in microarray-derived gene expression data, BMC Biotechnol. 1, 1–8 (2001)

    Article  Google Scholar 

  16. R. Nadon, P. Shi, A. Skandalis, E. Woody, H. Hubschle, E. Susko, P. Ramm, N. Rghei: Statistical inference methods for gene expression arrays, BIOS 4266, 46–55 (2001)

    Google Scholar 

  17. B. Durbin, J. Hardin, D. Hawkins, D. Rocke: A variance-stabilizing transformation for gene-expression microarray data, Bioinformatics 18, 1105 (2002)

    Google Scholar 

  18. X. Huang, W. Pan: Comparing three methods for variance estimation with duplicated high density oligonucleotide arrays, Funct. Integr. Genomics 2, 126–133 (2002)

    Article  Google Scholar 

  19. Y. Lin, S. T. Nadler, A. D. Attie, B. S. Yandell: Adaptive gene picking with microarray data: detecting important low abundance signals. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003) Chap. 13 (http://www.stat.wisc.edu/∼yilin/)

    Google Scholar 

  20. I. Lönnstedt, T. P. Speed: Replicated microarray data, Stat. Sin. 12, 31–46 (2002)

    MATH  Google Scholar 

  21. P. Baldi, A. D. Long: A Bayesian framework for the analysis of microarray expression data: regularized t-test, statistical inferences of gene changes, Bioinformatics 17, 509–519 (2001)

    Article  Google Scholar 

  22. J. K. Lee, M. OʼConnell: An S-PLUS library for the analysis of differential expression. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003) Chap. 7

    Google Scholar 

  23. M. K. Kerr, G. A. Churchill: Statistical design, the analysis of gene expression microarray data, Genetic Res. 77, 123–128 (2001)

    Google Scholar 

  24. R. D. Wolfinger, G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, R. S. Pales: Assessing gene significance from cDNA microarray expression data via mixed models, J. Comput. Biol. 8, 37–52 (2001)

    Article  Google Scholar 

  25. M. A. Newton, C. M. Kendziorski, C. S. Richmond, F. R. Blattner, K. W. Tsui: On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data, J. Comp. Biol. 8, 37–52 (2001)

    Article  Google Scholar 

  26. J. G. Ibrahim, M.-H. Chen, R. J. Gray: Bayesian models for gene expression with DNA microarray data, J. Am. Stat. Assoc. 97, 88–99 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  27. H. J. Cho, J. K. Lee: Hierarchical error model for analyzing gene expression data, Bioinformatics 20, 2016–2025 (2004)

    Article  Google Scholar 

  28. B. Efron, R. Tibshirani, J. D. Storey, V. Tusher: Empirical bayes analysis of a microarray experiment, J. Am. Stat. Assoc. 96, 1151–1160 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  29. M. A. Newton, C. K. Kendziorski: Parametric empirical bayes methods for microarrays. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003)

    Google Scholar 

  30. T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, P. Brown: ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns, Genome Biol. 1, Research03 (2000)

    Article  Google Scholar 

  31. G. C. Tseng, W. H. Wong: Tight clustering: a resampling-based approach for identifying stable and tight patterns in data, Biometrics 61(1), 10–16 (2004)

    Article  MathSciNet  Google Scholar 

  32. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, A. J. Levine: Broad patterns of gene expression revealed by clustering analysis of tumor, normal colon tissues probed by oligonucleotide arrays, Proc. Nath. Acid. Sci. 96, 6745–6750 (1999)

    Article  Google Scholar 

  33. M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. Olson, J. R. Marks, J. R. Nevins: Prediction the clinical status of human breast cancer by using gene expression profiles, Proc. Nath. Acad. Sci. 98, 11462–11467 (2001)

    Article  Google Scholar 

  34. J. Staunton, D. Slonim, P. Tanamo, M. Angelo, J. Park, U. Scherf, J. K. Lee, W. Reinhold, J. Weinstein, J. Mesirov, E. Lander, T. Golub: Chemosensitivity prediction by transcriptional profiling, Proc. Natl. Acad. Sci 11;98(19), 10787–10792 (2001)

    Article  Google Scholar 

  35. T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, D. Haussler: Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16, 906–914 (2000)

    Article  Google Scholar 

  36. S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, T. Poggio: Support Vector Machine Classification of Microarray Data (MIT, Cambridge 1998)

    Google Scholar 

  37. D. V. Nguyen, D. M. Rocke: Tumor classification by partial least squares using microarray gene expression data, Bioinformatics 18, 39–50 (2002)

    Article  Google Scholar 

  38. L. Li, C. R. Weinberg, T. A. Darden, L. G. Pedersen: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics 17, 1131–1142 (2001)

    Article  Google Scholar 

  39. A. C. Culhane, G. Perriere, E. C. Considine, T. G. Cotter, D. G. Higgins: Between-group analysis of microarray data, Bioinformatics 18, 1600–1608 (2002)

    Article  Google Scholar 

  40. A. P. Bradley: The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recog. 30, 1145–1159 (1997)

    Article  Google Scholar 

  41. D. J. Hand: Construction and Assessment of Classification Rules (Wiley, Chichester 1997)

    MATH  Google Scholar 

  42. M. Soukup, J.  K. Lee: Developing optimal prediction models for cancer classification using gene expression data, J. Bioinf. Comp. Biol. 1, 681–694 (2004)

    Article  Google Scholar 

  43. M. Soukup: Robust optimization of classification model for predicting human disease subtypes using microarray gene expression data. Ph.D. Thesis (University of Virginia, Charlottesville 2004)

    Google Scholar 

  44. G. Wahba: Support vector machines, reporoducing Kenel Hilbert spaces, the randomized GACV. In: Advances in Kernel Methods-Support Vector Learning, ed. by B. Scholkopf, C. J. C. Burges, A. J. Smola (MIT Press, Cambridge 1999) pp. 69–88

    Google Scholar 

  45. F. C. Pampel: Logistic Regression: A Primer., Sage Univ. Papers Ser. Quant. Appl. Social Sci. (Thousand Oaks, Sage 2000) pp. 07–132

    Google Scholar 

  46. C. Ambroise, G. J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Nath. Acid. Sci. 10, 6562–6566 (2002)

    Article  Google Scholar 

  47. M. Soukup, H. Cho, J. K. Lee: Robust classification modeling on microarary data using misclassification penalized posterior, Bioinformatics 21(1), i423–i430 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jae Lee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag

About this entry

Cite this entry

Lee, J. (2006). Statistical Genetics for Genomic Data Analysis. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-84628-288-1_32

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-288-1_32

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-85233-806-0

  • Online ISBN: 978-1-84628-288-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics