Statistical Genetics for Genomic Data Analysis

Lee, Jae

doi:10.1007/978-1-84628-288-1_32

Jae Lee²

Part of the book series: Springer Handbooks ((SHB))

8681 Accesses

Abstract

In this chapter, we briefly summarize the emerging statistical concepts and approaches that have been recently developed and applied to the analysis of genomic data such as microarray gene expression data. In the first section we introduce the general background and critical issues in statistical sciences for genomic data analysis. The second section describes a novel concept of statistical significance, the so-called false discovery rate, the rate of false positives among all positive findings, which has been suggested to control the error rate of numerous false positives in large screening biological data analysis. In the next section we introduce two recent statistical testing methods: significance analysis of microarray (SAM) and local pooled error (LPE) tests. The latter in particular, which is significantly strengthened by pooling error information from adjacent genes at local intensity ranges, is useful to analyze microarray data with limited replication. The fourth section introduces analysis of variation (ANOVA) and heterogenous error modeling (HEM) approaches that have been suggested for analyzing microarray data obtained from multiple experimental and/or biological conditions. The last two sections describe data exploration and discovery tools largely termed supervised learning and unsupervised learning. The former approaches include several multivariate statistical methods for the investigation of coexpression patterns of multiple genes, and the latter approaches are used as classification methods to discover genetic markers for predicting important subclasses of human diseases. Most of the statistical software packages for the approaches introduced in this chapter are freely available at the open-source bioinformatics software web site (Bioconductor; http://www.bioconductor.org/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 309.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

AUC:: area under the receiver operating characteristics curve
CIM:: cluster-image map
FDR:: false discovery rate
FWER:: family-wise error rate
HEM:: heterogeneous error model
LPE:: local pooled error
LR:: logistic regression
MAD:: median absolute deviation
MiPP:: misclassification penalized posterior
QDA:: quadratic discriminant analysis
SAM:: significance analysis of microarray

References

C. Sander: Genomic medicine and the future of health care, 287, 1977–8 (2000)
Google Scholar
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286, 5439 (1999)
Article Google Scholar
J. K. Lee, U. Scherf, K. J. Bussey, F. G. Gwadry, W. Reinhold, G. Riddick, S. L. Pelletier, S. Nishizuka, G. Szakacs, J.-P. Annereau, U. Shankavaram, S. Lababidi, L. H. Smith, M. M. Gottesman, J. N. Weinstein: Comparing cDNA, oligonucleotide array data: Concordance of gene expression across platforms for the NCI-60 cancer cell lines, Genome Biol. 4, R82 (2003)
Article Google Scholar
D. Pinkel: Cancer cells, chemotherapy, gene clusters, Nat. Genet. 24, 208–9 (2000)
Article Google Scholar
J. K. Lee: Discovery, validation of microarray gene expression patterns, LabMedica Int. 19, 8–10 (2002)
Google Scholar
C. J. Stoeckert, H. C. Causton, C. A. Ball: Microarray databases: standards, ontologies, Nat. Genet. 32, 469–473 (2002)
Article Google Scholar
M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein: Cluster analysis, display of genome-wide expression patterns, Proc. Nat. Acad. Sci. 95, 14863–8 (1998)
Article Google Scholar
P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, T. R. Golub: Interpreting patterns of gene expression with self-organizing maps: Methods, application to hematopoietic differentiation, Proc. Nath. Acad. Sci. 96, 2907–2912 (1999)
Article Google Scholar
S. Dudoit, Y. H. Yang, M. J. Callow, T. P. Speed: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Stat. Sin. 12, 111–139 (2002)
MathSciNet MATH Google Scholar
V. Tusher, R. Tibshirani, C. Chu: Significance analysis of microarrays applied to transcriptional responses to ionizing radiation, Proc. Nat. Acad. Sci. 98, 5116–21 (2001)
Article MATH Google Scholar
Y. Benjamini, Y. Hochberg: Controlling the false discovery rate: a practical, powerful approach to multiple testing, J. R. Stat. Soc., Ser. B, Methodological 57, 289–300 (1995)
MathSciNet MATH Google Scholar
J. Storey, R. Tibshirani: SAM thresholding, false discovery rates for detecting differential gene expression in DNA microarrays. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003) Chap. 12
Google Scholar
N. Jain, K. Ley, J. Thatte, M. OʼConnell, J. K. Lee: Local pooled error test for identifying differentially expressed genes with asmall number of replicated microarrays, Bioinformatics 19, 1945–51 (2003)
Article Google Scholar
W. Jin, R. M. Riley, R. D. Wolfinger, K. P. White, G. Passador-Gurgel, G. Gibson: The contributions of sex, genotype, age to transcriptional variance in Drosophila melanogaster, Nat. Genet. 29, 389–395 (2001)
Article Google Scholar
A. Kamb, A. Ramaswami: A simple method for statistical analysis of intensity differences in microarray-derived gene expression data, BMC Biotechnol. 1, 1–8 (2001)
Article Google Scholar
R. Nadon, P. Shi, A. Skandalis, E. Woody, H. Hubschle, E. Susko, P. Ramm, N. Rghei: Statistical inference methods for gene expression arrays, BIOS 4266, 46–55 (2001)
Google Scholar
B. Durbin, J. Hardin, D. Hawkins, D. Rocke: A variance-stabilizing transformation for gene-expression microarray data, Bioinformatics 18, 1105 (2002)
Google Scholar
X. Huang, W. Pan: Comparing three methods for variance estimation with duplicated high density oligonucleotide arrays, Funct. Integr. Genomics 2, 126–133 (2002)
Article Google Scholar
Y. Lin, S. T. Nadler, A. D. Attie, B. S. Yandell: Adaptive gene picking with microarray data: detecting important low abundance signals. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003) Chap. 13 (http://www.stat.wisc.edu/∼yilin/)
Google Scholar
I. Lönnstedt, T. P. Speed: Replicated microarray data, Stat. Sin. 12, 31–46 (2002)
MATH Google Scholar
P. Baldi, A. D. Long: A Bayesian framework for the analysis of microarray expression data: regularized t-test, statistical inferences of gene changes, Bioinformatics 17, 509–519 (2001)
Article Google Scholar
J. K. Lee, M. OʼConnell: An S-PLUS library for the analysis of differential expression. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003) Chap. 7
Google Scholar
M. K. Kerr, G. A. Churchill: Statistical design, the analysis of gene expression microarray data, Genetic Res. 77, 123–128 (2001)
Google Scholar
R. D. Wolfinger, G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, R. S. Pales: Assessing gene significance from cDNA microarray expression data via mixed models, J. Comput. Biol. 8, 37–52 (2001)
Article Google Scholar
M. A. Newton, C. M. Kendziorski, C. S. Richmond, F. R. Blattner, K. W. Tsui: On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data, J. Comp. Biol. 8, 37–52 (2001)
Article Google Scholar
J. G. Ibrahim, M.-H. Chen, R. J. Gray: Bayesian models for gene expression with DNA microarray data, J. Am. Stat. Assoc. 97, 88–99 (2002)
Article MathSciNet MATH Google Scholar
H. J. Cho, J. K. Lee: Hierarchical error model for analyzing gene expression data, Bioinformatics 20, 2016–2025 (2004)
Article Google Scholar
B. Efron, R. Tibshirani, J. D. Storey, V. Tusher: Empirical bayes analysis of a microarray experiment, J. Am. Stat. Assoc. 96, 1151–1160 (2001)
Article MathSciNet MATH Google Scholar
M. A. Newton, C. K. Kendziorski: Parametric empirical bayes methods for microarrays. In: The Analysis of Gene Expression Data: Methods and Software, ed. by G. Parmigiani, E. S. Garrett, R. A. Irizarry, S. L. Zeger (Springer, Berlin Heidelberg New York 2003)
Google Scholar
T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, P. Brown: ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns, Genome Biol. 1, Research03 (2000)
Article Google Scholar
G. C. Tseng, W. H. Wong: Tight clustering: a resampling-based approach for identifying stable and tight patterns in data, Biometrics 61(1), 10–16 (2004)
Article MathSciNet Google Scholar
U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, A. J. Levine: Broad patterns of gene expression revealed by clustering analysis of tumor, normal colon tissues probed by oligonucleotide arrays, Proc. Nath. Acid. Sci. 96, 6745–6750 (1999)
Article Google Scholar
M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. Olson, J. R. Marks, J. R. Nevins: Prediction the clinical status of human breast cancer by using gene expression profiles, Proc. Nath. Acad. Sci. 98, 11462–11467 (2001)
Article Google Scholar
J. Staunton, D. Slonim, P. Tanamo, M. Angelo, J. Park, U. Scherf, J. K. Lee, W. Reinhold, J. Weinstein, J. Mesirov, E. Lander, T. Golub: Chemosensitivity prediction by transcriptional profiling, Proc. Natl. Acad. Sci 11;98(19), 10787–10792 (2001)
Article Google Scholar
T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, D. Haussler: Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16, 906–914 (2000)
Article Google Scholar
S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, T. Poggio: Support Vector Machine Classification of Microarray Data (MIT, Cambridge 1998)
Google Scholar
D. V. Nguyen, D. M. Rocke: Tumor classification by partial least squares using microarray gene expression data, Bioinformatics 18, 39–50 (2002)
Article Google Scholar
L. Li, C. R. Weinberg, T. A. Darden, L. G. Pedersen: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics 17, 1131–1142 (2001)
Article Google Scholar
A. C. Culhane, G. Perriere, E. C. Considine, T. G. Cotter, D. G. Higgins: Between-group analysis of microarray data, Bioinformatics 18, 1600–1608 (2002)
Article Google Scholar
A. P. Bradley: The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recog. 30, 1145–1159 (1997)
Article Google Scholar
D. J. Hand: Construction and Assessment of Classification Rules (Wiley, Chichester 1997)
MATH Google Scholar
M. Soukup, J. K. Lee: Developing optimal prediction models for cancer classification using gene expression data, J. Bioinf. Comp. Biol. 1, 681–694 (2004)
Article Google Scholar
M. Soukup: Robust optimization of classification model for predicting human disease subtypes using microarray gene expression data. Ph.D. Thesis (University of Virginia, Charlottesville 2004)
Google Scholar
G. Wahba: Support vector machines, reporoducing Kenel Hilbert spaces, the randomized GACV. In: Advances in Kernel Methods-Support Vector Learning, ed. by B. Scholkopf, C. J. C. Burges, A. J. Smola (MIT Press, Cambridge 1999) pp. 69–88
Google Scholar
F. C. Pampel: Logistic Regression: A Primer., Sage Univ. Papers Ser. Quant. Appl. Social Sci. (Thousand Oaks, Sage 2000) pp. 07–132
Google Scholar
C. Ambroise, G. J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Nath. Acid. Sci. 10, 6562–6566 (2002)
Article Google Scholar
M. Soukup, H. Cho, J. K. Lee: Robust classification modeling on microarary data using misclassification penalized posterior, Bioinformatics 21(1), i423–i430 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Public Health Sciences, University of Virginia, PO Box 800717, 22908, Charlottesville, VA, USA
Jae Lee

Authors

Jae Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jae Lee .

Editor information

Editors and Affiliations

Department of Industrial and Systems Engineering, Rutgers the State University of New Jersey, 96 Frelinghuysen Road, 08854, Piscataway, NJ, USA
Hoang Pham Prof.

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Lee, J. (2006). Statistical Genetics for Genomic Data Analysis. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-84628-288-1_32

Download citation

DOI: https://doi.org/10.1007/978-1-84628-288-1_32
Publisher Name: Springer, London
Print ISBN: 978-1-85233-806-0
Online ISBN: 978-1-84628-288-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics