Challenges and Approaches to Statistical Design and Inference in High-Dimensional Investigations

  • Gary L. Gadbury
  • Karen A. Garrett
  • David B. Allison
Part of the Methods in Molecular Biology™ book series (MIMB, volume 553)


Advances in modern technologies have facilitated high-dimensional experiments (HDEs) that generate tremendous amounts of genomic, proteomic, and other “omic” data. HDEs involving whole-genome sequences and polymorphisms, expression levels of genes, protein abundance measurements, and combinations thereof have become a vanguard for new analytic approaches to the analysis of HDE data. Such situations demand creative approaches to the processes of statistical inference, estimation, prediction, classification, and study design. The novel and challenging biological questions asked from HDE data have resulted in many specialized analytic techniques being developed. This chapter discusses some of the unique statistical challenges facing investigators studying high-dimensional biology and describes some approaches being developed by statistical scientists. We have included some focus on the increasing interest in questions involving testing multiple propositions simultaneously, appropriate inferential indicators for the types of questions biologists are interested in, and the need for replication of results across independent studies, investigators, and settings. A key consideration inherent throughout is the challenge in providing methods that a statistician judges to be sound and a biologist finds informative.

Key words

FDR genomics high-dimensional microarray multiple testing statistics 



D. Allison and G. Gadbury acknowledge the support from NIH Grant U54 CA100949. K. Garrett acknowledges the support from NSF Grants DBI-0421427, DEB-0516046, and EF-0525712, and DOE Grant DE-FG02-04ER63892. Gadbury and Garrett are grateful to programs supporting research from the Ecological Genomics Institute, Kansas State University, and the NSF Long Term Ecological Research Program at Konza Prairie, Kansas, and the Kansas Agricultural Experiment Station (Contribution No. 09-118-B).


  1. 1.
    Wolfsberg, T.G., Wetterstrand, K.A., Guyer, M.S., Collins, F.S., and Baxevanis, A.D. (2002) A user’s guide to the human genome. Nature Genetics Supplement 32, 1–79.CrossRefGoogle Scholar
  2. 2.
    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289–300.Google Scholar
  3. 3.
    Schadt, E.E., Li, C., Ellis, B., and Wong, W.H. (2001) Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. Journal of Cellular Biochemistry, Supplement 37, 120–125.CrossRefGoogle Scholar
  4. 4.
    Quackenbush, J. (2002) Microarray data normalization and transformation. Nature Genetics 32, 496–501.PubMedCrossRefGoogle Scholar
  5. 5.
    Smyth, G.K. and Speed, T. (2003) Normalization of cDNA microarray data. Methods 31, 265–273.PubMedCrossRefGoogle Scholar
  6. 6.
    Ekstrom, C.T., Bak, S., Kristensen, C., and Rudemo, M. (2004) Spot shape modelling and data transformations for microarrays. Bioinformatics 20, 2270–2278.PubMedCrossRefGoogle Scholar
  7. 7.
    Travers, S.E., Smith, M.D., Bai, J.F., Hulbert, S.H., Leach, J.E., Schnable, P.S., Knapp, A.K., Milliken, G.A., Fay, P.A., Saleh, A., and Garrett, K.A. (2007) Ecological genomics: making the leap from model systems in the lab to native populations in the field. Frontiers in Ecology and the Environment 5, 19–24.CrossRefGoogle Scholar
  8. 8.
    Milliken, G.A., Garrett, K.A., and Travers, S.E. (2007) Experimental design for two-color microarrays applied in a pre-existing split-plot experiment. Statistical Applications in Genetics and Molecular Biology 6, Article 20.Google Scholar
  9. 9.
    Kerr, M.K. (2003) Design considerations for efficient and effective microarray studies. Biometrics 59, 822–828.PubMedCrossRefGoogle Scholar
  10. 10.
    Fisher, R.A. (1966) The Design of Experiments, 8th edition. Hafner Publishing Company: New York.Google Scholar
  11. 11.
    Mehta, T.S., Zakharkin, S.O., Gadbury, G.L., and Allison, D.B. (2006) Epistemological issues in omics and high-dimensional biology: give the people what they want. Physiological Genomics 28, 24–32.PubMedCrossRefGoogle Scholar
  12. 12.
    Cui, X. and Churchill, G.A. (2003) Statistical tests for differential expression in cDNA microarray experiments. Genome Biology 4, 21.CrossRefGoogle Scholar
  13. 13.
    Pepe, M.S., Longton, G., Anderson, G.L., and Schummer, M. (2003) Selecting differentially expressed genes from microarray experiments. Biometrics 59, 133–142.PubMedCrossRefGoogle Scholar
  14. 14.
    Gadbury, G.L., Page, G.P., Heo, M., Mountz, J.D., and Allison, D.B. (2003) Randomization tests for small samples: an application for genetic expression data. Journal of the Royal Statistical Society, Series C (Applied Statistics) 52, 365–76.CrossRefGoogle Scholar
  15. 15.
    Xu, R. and Li, X. (2003) A comparison of parametric versus permutation methods with applications to general and temporal microarray gene expression data. Bioinformatics 19, 1284–1289.PubMedCrossRefGoogle Scholar
  16. 16.
    Mielke, P.W. and Berry, K.J. (2007) Permutation Methods: A Distance Function Approach. Springer: New York.Google Scholar
  17. 17.
    Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., and Paules, R.S. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 8, 625–663.PubMedCrossRefGoogle Scholar
  18. 18.
    Sackrowitz, H. and Samuel-Cahn, E.P. (1999) P values as random variables—expected P values. The American Statistician 53, 326–331.CrossRefGoogle Scholar
  19. 19.
    Story, J.D. (2002) A direct approach to false discovery rates, Journal of the Royal Statistical Society, Series B 64, 479–498.CrossRefGoogle Scholar
  20. 20.
    Allison, D.B., Gadbury, G.L., Heo, M., Fernandez, J.R., Lee, C., Prolla, T.A., and Weindruch, R.A. (2002) Mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis 39, 1–20.CrossRefGoogle Scholar
  21. 21.
    Ruppert, D., Nettleton, D., and Hwang, J.T.G. (2007) Exploring the information in P-values for the analysis and planning of multiple-test experiments. Biometrics 63, 487–495.CrossRefGoogle Scholar
  22. 22.
    Schweder, T. and Spjøtvoll, E. (1982) Plots of P-values to evaluate many tests simultaneously. Biometrika 69, 493–502.Google Scholar
  23. 23.
    Berger, J.O. and Sellke, T. (1987) Testing a point null hypothesis: The irreconcilability of P values and evidence. Journal of the American Statistical Association 82, 112–122.CrossRefGoogle Scholar
  24. 24.
    Broberg, P. (2004) A new estimate of the proportion unchanged genes in a microarray experiment. Genome Biology 5, P10.Google Scholar
  25. 25.
    Langaas,M., Lindqvist, B.H., and Ferkingstad, E. (2005) Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B 67, 555–572.CrossRefGoogle Scholar
  26. 26.
    Frank, E.E. (2007) The effects of drought and pathogen stress on gene expression and phytohormone concentrations in Andropogon gerardii. M.S. Thesis; Kansas State University: Manhattan, KS.Google Scholar
  27. 27.
    Singhal, S., Kyvernitis, C.G., Johnson, S.W., Kaisera, L.R., Leibman, M.N., and Albelda, S.M. (2003) Microarray data simulator for improved selection of differentially expressed genes. Cancer Biology and Therapy 2, 383–391.PubMedGoogle Scholar
  28. 28.
    Zakharkin, S.O., Kim, K., Mehta, T., Chen, L., Barnes, S., Scheirer, K.E., Parrish, R.S., Allison, D.B., and Page, G.P. (2005) Sources of variation in Affymetrix microarray experiments. BMC Bioinformatics 29, 214.CrossRefGoogle Scholar
  29. 29.
    Gadbury, G.L., Xiang, Q., Edwards, J.W., Page, G.P., and Allison, D.B. (2006) The role of sample size on measures of uncertainty and power. In: Allison, D.B., Page, G.P., Beasley, T.M., Edwards, J.W., ed. DNA Microarrays and Related Genomics Techniques. Boca Raton: Chapman & Hall/CRC: 77–94.Google Scholar
  30. 30.
    Brody, J.P., Williams, B.A., Wold, B.J., and Quake, S.R. (2002) Significance and statistical errors in the analysis of DNA microarray data. Proceedings of the National Academy of Sciences of the United States of America 99(20), 12975–12978.PubMedCrossRefGoogle Scholar
  31. 31.
    Nguyen, D.V., Arpat, A.B., Wang, N., and Caroll, R.G. (2002) DNA microarray experiments: biological and technical aspects. Biometrics 58, 701–717.PubMedCrossRefGoogle Scholar
  32. 32.
    Rosa Guilherme, J.M., Steibel, J.P., and Tempelman, R.J. (2005) Reassessing design and analysis of two-colour microarray experiments using mixed effects models. Comparative and Functional Genomics 6(3), 123–131.Google Scholar
  33. 33.
    Allison, D.B., Cui, X., Page, G.P., and Sabripour, M.(2006) Microarray data analysis: From disarray to consolidation and consensus. Nature Review Genetics 7, 55–65.CrossRefGoogle Scholar
  34. 34.
    Gadbury, G.L., Page, G.P., Edwards, J.W., Kayo, T., Prolla, T.A., Weindruch, R., Permana, P.A., Mountz, J., and Allison, D.B. (2004) Power analysis and sample size estimation in the age of high dimensional biology: a parametric bootstrap approach illustrated via microarray research. Statistical Methods in Medical Research 13, 325–38.CrossRefGoogle Scholar
  35. 35.
    Hurlbert, S.H. (1984) Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54, 187–211.CrossRefGoogle Scholar
  36. 36.
    Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. Boca Raton, FL: CRC Press.Google Scholar
  37. 37.
    Irizarry, R.A., Wu, Z., and Jaffee, H.A. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics 22, 789–794.PubMedCrossRefGoogle Scholar
  38. 38.
    Ishwaran, H., Rao, J.S., and Kogalur, U.B. (2006) BAMarray: Java software for Bayesian analysis of variance for microarray data. BMC Bioinformatics 7(1), 59.PubMedCrossRefGoogle Scholar
  39. 39.
    Qiu, X., Klebanov, L., and Yakovlev, A. (2005) Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statistical Applications in Genetics and Molecular Biology 4, Article 34.Google Scholar
  40. 40.
    Qiu, X., Xiao, Y., Gordon, A., and Yakovlev, A. (2006) Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics 7, 50.PubMedCrossRefGoogle Scholar
  41. 41.
    Owen, A. (2005) Variance in the number of false discoveries. Journal of the Royal Statistical Society, Series B 67, 411–426.CrossRefGoogle Scholar
  42. 42.
    Hu, X. (2007) Distributional aspects of P-value and their use in multiple testing situations. Ph.D. Dissertation. University of Missouri – Rolla: Rolla, Missouri.Google Scholar
  43. 43.
    Nettleton, D., Hwang, G.J.T., Caldro, R.A., and Wise, R.P. (2006) Estimating the number of true null hypotheses from a histogram of p-values. Journal of Agricultural, Biological, and Environmental Statistics 11, 337–356.CrossRefGoogle Scholar
  44. 44.
    Efron, B. (2007) Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association 102, 93–103.CrossRefGoogle Scholar
  45. 45.
    Goeman, J.J. and Buhlmann, P. (2007) Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23, 980–987.PubMedCrossRefGoogle Scholar
  46. 46.
    Hochberg, Y., and Tamhane, A.C. (1987) Multiple Comparisons Procedures. New York: John Wiley & Sons, Inc.CrossRefGoogle Scholar
  47. 47.
    Tsai, C., Hsueh, H., and Chen, J.J. (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 59, 1071–1081.PubMedCrossRefGoogle Scholar
  48. 48.
    Pounds, S. and Morris, S.W. (2003) Estimating the occurrence of false positive and false negative in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10), 1236–1242.PubMedCrossRefGoogle Scholar
  49. 49.
    Nguyen, D. (2004) On estimating the proportion of true null hypotheses for false discovery rate controlling procedures in exploratory DNA microarray studies. Computational Statistics & Data Analysis 47, 611–637.CrossRefGoogle Scholar
  50. 50.
    Efron, B. (2004) Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 99, 96–104.CrossRefGoogle Scholar
  51. 51.
    Trivedi, P., Edwards, J.W., Wang, J., Gadbury, G.L., Srinivasasainagendra, V., Zakharkin, S.O., Kim, K., Mehta, T., Brand, J.P.L., Patki, A., Page, G.P., and Allison, D.B. (2005) HDBStat!: A platform-independent software suite for statistical analysis of high dimensional biology data. BMC Bioinformatics 6, 86.PubMedCrossRefGoogle Scholar
  52. 52.
    Storey, J.D. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. The Annals of Statistics 31, 2013–2035.CrossRefGoogle Scholar
  53. 53.
    Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100, 9440–9445.CrossRefGoogle Scholar
  54. 54.
    Page, G.P., Edwards, J.W., Gadbury, G.L., Yelisetti, P., Wang, J., Trivedi, P., Allison, D.B. (2006) The PowerAtlas: a power and sample size atlas for microarray experimental design and research. BMC Bioinformatics 7, 84.PubMedCrossRefGoogle Scholar
  55. 55.
    Lee, M.L.T. and Whitmore, G.A. (2002) Power and sample size for DNA microarray studies. Statistics in Medicine 21, 3543–3570.PubMedCrossRefGoogle Scholar
  56. 56.
    Pan, W., Lin, J., and Le, C.T. (2002) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3(5), 1–10.CrossRefGoogle Scholar
  57. 57.
    Zien, A., Fluck, J., Zimmer, R., and Lengauer, T. (2003) Microarrays: how many do you need? Journal of Computational Biology 10, 653–667.PubMedCrossRefGoogle Scholar
  58. 58.
    Shao, Y. and Tseng, C.-H. (2007) Sample size calculation with dependent adjustment for FDR-control in microarray studies. Statistics in Medicine 26, 4219–4237.PubMedCrossRefGoogle Scholar
  59. 59.
    Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science 95, 14863–14868.CrossRefGoogle Scholar
  60. 60.
    Garge, N.R., Page, G.P., Sprague, A.P., Gorman, B.S., and Allison, D.B. (2005) Reproducible clusters from microarray research: Wither? BMC Bioinformatics 6(Suppl 2), S10.PubMedCrossRefGoogle Scholar
  61. 61.
    Kerr, M.K. and Churchill, G.A. (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the National Academy of Science 98, 8961–8965.CrossRefGoogle Scholar
  62. 62.
    McLachlan, G.J. and Khan, N. (2004) On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis 90, 90–105.CrossRefGoogle Scholar
  63. 63.
    Kapp, A.V. and Tibshirani, R. (2007) Are clusters found in one dataset present in another dataset? Biostatistics 8, 9–31.PubMedCrossRefGoogle Scholar
  64. 64.
    Breitling, R., Amtmann, A., and Herzyk, P. (2004) Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics 5(1), 34.PubMedCrossRefGoogle Scholar
  65. 65.
    Osier, M.V. (2006) Postanalysis interpretation: “What do I do with this gene list?” In: Allison DB, Page GP, Beasley TM, Edwards JW, ed. DNA Microarrays and Related Genomics Techniques. Chapman & Hall. CRC: Boca Raton, FL, 321–333.Google Scholar
  66. 66.
    Osier, M.V., Zhao, H., and Cheung, K.-H. (2004) Handling multiple testing while interpreting microarrays with the gene ontology database. BMC Bioinformatics 5, 124.PubMedCrossRefGoogle Scholar
  67. 67.
    Pavlidis, P., Qin, J., Arango, V., Mann, J.J., and Sibille, E. (2004) Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochemical Research 29(6), 1213–1222.PubMedCrossRefGoogle Scholar
  68. 68.
    Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., and Groop, L.C. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately down-regulated in human diabetes. Nature Genetics 34(3), 267–273.PubMedCrossRefGoogle Scholar
  69. 69.
    Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Mesirov, J.P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Science 43, 15545–15550.CrossRefGoogle Scholar
  70. 70.
    Goeman, J.J., van de Geer, S.A., de Kort, F., and van Houwelingen, H.C. (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20(1), 93–99.PubMedCrossRefGoogle Scholar
  71. 71.
    Pan, W. (2005) Incorporating gene functional annotations in detecting differential gene expression. Journal of the Royal Statistical Society, Series C-Applied Statistics 55, 301–316.CrossRefGoogle Scholar
  72. 72.
    Xiang, Q., Edwards, J.W., and Gadbury, G.L. (2006) Interval estimation in a finite mixture model: Modeling P-values in multiple testing applications. Computational Statistics and Data Analysis 51, 570–586.CrossRefGoogle Scholar
  73. 73.
    Damian, D. and Gorfine, M. (2004) Statistical concerns about the GSEA procedure. Nature Genetics 36, 663.PubMedCrossRefGoogle Scholar
  74. 74.
    Mehta, T., Tanik, M., and Allison, D.B. (2004) Towards sound epistemological foundation of statistical methods for high-dimensional biology. Nature Genetics 36, 943–947.PubMedCrossRefGoogle Scholar
  75. 75.
    Genovese, C. and Wasserman, L. (2002) Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society,Series B 64, 499–517.CrossRefGoogle Scholar
  76. 76.
    Hsueh, H., Chen, J.J., and Kodell, R.L. (2003) Comparison of methods for estimating the number of true null hypotheses in multiplicity testing. Journal of Biopharmaceutical Statistics 13(94), 675–689.PubMedCrossRefGoogle Scholar
  77. 77.
    Cattell ,R.B. and Jaspars, J. (1967) A general plasmode (No. 30-10-5-2) for factor analytic exercises and research. Multivariate Behavioral Research Monographs 67, 1–212.Google Scholar
  78. 78.
    Choe, S.E., Boutros, M., Michelson, A.M., Church, G.M., and Halfon, M.S. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology 6(2), R16.PubMedCrossRefGoogle Scholar
  79. 79.
    Gadbury, G.L., Xiang, Q., Yang, L., Barnes, S., Page, G.P., Allison, D.B. (2007) Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration using False Discovery Rates. Plos Genetics 4(6), e1000098.Google Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Gary L. Gadbury
    • 1
  • Karen A. Garrett
    • 2
  • David B. Allison
    • 3
  1. 1.Department of StatisticsKansas State UniversityManhattanUSA
  2. 2.Department of Plant PathologyKansas State UniversityManhattanUSA
  3. 3.Section on Statistical Genetics, Department of BiostatisticsUniversity of Alabama at BirminghamBirminghamUSA

Personalised recommendations