Abstract
This is the core chapter that introduces the theory related to the advanced statistical methods applied in the later chapters on QTL mapping and GWAS analysis. More basic statistical methods are included in the Appendix. Section 3.2 covers the use of classical procedures, like the Bonferroni correction, in multiple testing, as well as approaches based on permutation and resampling, which guarantee control of the familywise error rate (FWER). Afterwards, more modern techniques, like the Benjamini-Hochberg procedure to control the false discovery rate (FDR), are discussed and a somewhat advanced theoretical discussion on optimal multiple testing strategies in high dimensions follows. The second part of this chapter is concerned with model selection. Section 3.3 starts by introducing the basic concepts of likelihood and then recapitulates the development of Akaike’s information criterion (AIC) using information theoretic principles. This is then compared with the use of the Bayesian information criterion (BIC) in the context of Bayesian model selection. It is then pointed out why both AIC and BIC fail to work in a high-dimensional setting and different modifications of BIC designed to control either FWER or FDR are presented. The chapter ends by discussing various further approaches to model selection in high dimensions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abramovich, F., Benjamini, Y., Donoho, D.L., Johnstone, I.M.: Adapting to unknown sparsity by controlling the false discovery rate. Ann. Stat. 34, 584–653 (2006)
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium on Information Theory, 267–281 (1973)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995)
Benjamini, Y., Hochberg, Y.: On the adaptive control of the false discovery fate in multiple testing with independent statistics. J. Educ. Behav. Stat. 25, 60–83 (2000)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165–1188 (2001)
Bera, A.K., Bilias, Y.: Rao’s score, Neyman’s \(C(\alpha )\) and Silvey’s LM tests: an essay on historical developments and some new results. J. Stat. Plan. Infer. 97, 9–44 (2001)
Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3, 203–268 (2001)
Bogdan, M., Chakrabarti, A., Frommlet, F., Ghosh, J.K.: Asymptotic Bayes-optimality under sparsity of some multiple testing procedures. Ann. Stat. 39, 1551–1579 (2011)
Bogdan, M., Frommlet, F., Szulc, P., Tang H.: Model selection approach for genome wide association studies in admixed populations. Technical Report (2013)
Bogdan, M., Ghosh, J.K., Doerge, R.W.: Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitive trait loci. Genetics 167, 989–999 (2004)
Bogdan, M., Ghosh, J.K., Tokdar S.T.: A comparison of the Simes-Benjamini-Hochberg procedure with some Bayesian rules for multiple testing. In: Balakrishnan, N., Peña, E., Silvapulle, M.J. (eds.) Beyond Parametrics in Interdisciplinary Research: Fetschrift in Honor of Professor Pranab K. Sen, IMS collections, vol. 1, pp. 211–230. Beachwood Ohio (2008)
Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candès, E.J.: SLOPE—Adaptive Variable Selection via Convex Optimization. Ann. Appl. Stat. 9, 1103–1140 (2015)
Bogdan, M., van den Berg, E., Su, W., Candès, E.J.: Statistical estimation and testing via the sorted \(\ell _1\) norm. arXiv:1310.1969 (2013)
Bogdan, M., Żak-Szatkowska, M., Ghosh, J.K.: Selecting explanatory variables with the modified version of Bayesian Information criterion. Qual. Reliab. Eng. Int. 24, 627–641 (2008)
Boyd, S., Vandenberghe, L.: Convex Optimization. Kluwer, Cambridge University Press (2004)
Broberg, P.: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinform. 6, 199 (2005)
Broman, K.W., Speed, T.P.: A model selection approach for the identification of quantitative trait loci in experimental crosses. J. Roy. Stat. Soc.: Ser. B (Stat. Meth.) 64(4), 641–656 (2002)
Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data. Springer, Heidelberg (2011)
Burnham, K.P., Anderson, D.R.: Model Selection and Multimodel Inference, 2nd edn. Springer, New York (2002)
Cai, T., Jin, J.: Optimal rates of convergence for estimating the null and proportion of non-null effects in large-scale multiple testing. Ann. Stat. 38, 100–145 (2010)
Candès, E.J., Plan, Y.: Near-ideal model selection by l1 minimization. Ann. Stat. 37, 2145–2177 (2007)
Chipman, H., George, E.I., McCulloch, R.E.: The practical implementation of bayesian model selection. In: Lahiri, P. (ed.) Model Selection (IMS Lecture Notes), pp. 65–116. Beachwood, OH (2001)
Chun, H., Keles, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. Roy. Stat. Soc.: Ser. B (Stat. Meth.) 72(1), 3–25 (2010)
Churchill, G.A., Doerge, R.W. Empirical threshold values for quantitative trait mapping. Genetics 138, 963–971 (1994)
De Leeuw, J., Hornik, K., Mair, P.: Isotone optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and active set methods. Journal of statistical software 32 (5): 1–24, (2009)
Do, K., Müller, P., Tang, F.: A Bayesian mixture model for differential gene expression. Appl. Stat. 54, 627–644 (2005)
Doerge, R.W., Churchill, G.A.: Permutation tests for multiple loci affecting a quantitative character. Genetics 142, 285–294 (1996)
Donoho, D., Tanner, J.: Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Phil. Trans. R. Soc. A 367, 4273–4293 (2009)
Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)
Dudoit, S., van der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Efron, B., Tibshirani, R., Storey, J.D., Tusher, V.: Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001)
Efron, B., Tibshirani, R.: Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol. 23, 70–86 (2002)
Efron, B.: Microarrays, empirical Bayes and the two-group model. Stat. Sci. 23(1), 1–22 (2008)
Ferreira, J.A., Zwinderman, A.H.: On the Benjamini-Hochberg method. Ann. Stat. 34(4), 1827–1849 (2006)
Foster, D.P., Stine, R.A.: Local asymptotic coding and the minimum description length. IEEE Trans. Inf. Theor. 45, 1289–1293 (1999)
Frank, I.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–148 (1993)
Frommlet, F., Bogdan, M: Some optimality properties of FDR controlling rules under sparsity. Technical Report (2012)
Frommlet, F., Chakrabarti, A., Murawska, M., Bogdan, M.: Asymptotic Bayes optimality under sparsity for generally distributed effect sizes under the alternative. arXiv:1005.4753 (2011)
Genovese, C., Wasserman, L.: A stochastic process approach to false discovery control. Ann. Stat. 32, 1035–1061 (2004)
Genovese, C., Wasserman, L.: Operating characteristics and extensions of the false discovery rate procedure. J. Roy. Stat. Soc. Ser. B 64, 499–517 (2002)
George, E.I. Foster, D.F.: Calibration and empirical Bayes variable selection. Biometrika 87, 731–747 (2000)
Ghosh, J.K., Samanta, T.: Model selection—an overview. Curr. Sci. 80, 1135–1144 (2001)
Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. Wiley, New York (1987)
Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–803 (1988)
Hoerl A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)
Holm, S.: A simple sequentially rejective Bonferroni test procedure. Scand. J. Stat. 6, 65–70 (1979)
Hsu, J.C.: Multiple Comparisons: Theory and Methods. Chapman and Hall, New York (1996)
James, W., Stein, C.: Estimation with quadratic loss, Proc. Fourth Berkeley Symp. Math. Stat. Prob. 1, 361–79 (1961)
Jin, J., Cai, T.C.: Estimating the null and the proportion of non-null effects in large-scale multiple comparisons. J. Am. Stat. Assoc. 102, 495–506 (2007)
Johnstone, I.M., Silverman, B.W.: EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw. 12(8) (2005)
Johnstone, I.M., Silverman, B.W.: Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Stat. 32, 1594–1649 (2004)
Korn, E.L., Troendleb, J.F., McShanea, L.M., Simona, R.: Controlling the number of false discoveries: application to high-dimensional genomic data. J. Stat. Plan. Infer. 124(2), 379–398 (2004)
Kullback, S.: Information Theory and Statistics. John Wiley and Sons, New York (1959)
Lehmann, E.L., Romano, J.P.: Generalizations of the familywise error rate. Ann.Stat. 33, 1138–1154 (2005)
Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, New York (2005)
Lehmann, E.L. D’Abrera, H.J.M.: Nonparametrics: Statistical Methods Based on Ranks. McGraw-Hill, New York (1975)
Marcus, R., Peritz, E., Gabriel, K.R.: On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660 (1976)
Martin, R., Tokdar, S.T.: A nonparametric empirical Bayes framework for large-scale multiple testing. Biostatistics. 13, 427–439 (2012)
Müller, P., Giovanni, P., Rice, K.: FDR and Bayesian multiple comparisons rules. In: Proceedings of the Valencia/ISBA 8th World Meeting on Bayesian Statistics. Oxford University Press (2007)
Neuvial, P., Roquain, E.: On false discovery rate thresholding for classification under sparsity. Ann. Stat. 40, 2572–2600 (2012)
Neyman, J., Pearson, E.: On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Ser. A 231: 289–337 (1933)
Rao, C.R., Wu, Y.: On model selection. In: Lahiri, P. (ed.) Model selection (IMS Lecture Notes), pp. 1–57. Beachwood, OH (2001)
Schwarz, G: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Scott, J.G., Berger, J.O.: An exploration of aspects of Bayesian multiple testing. J. Stat. Plan. Inf. 136, 2144–2162 (2006)
Seber, A.F., Lee, A.J.: Linear Regression Analysis. John Wiley and Sons (2003)
Seeger, P.: A note on a method for the analysis of significance en masse. Technometrics. 10, 586–593 (1968)
Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584 (1995)
Simes, R.J.: An improved Bonferroni procedure for multiple tests of significance. Biometrika 73(3), 751–754 (1986)
Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proc. Third Berkeley Symp. Math. Stat. Prob. 1, 197–06 (1956)
Storey, J.D.: The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 31(6), 2013–2035 (2003)
Storey, J.D.: A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B 64, 479–498 (2002)
Sun, T., Zhang, C.-H.: Scaled sparse linear regression. Biometrika 99(4), 879–898 (2012)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc B. 58(1), 267–288 (1996)
Tibshirani, R. Knight, K.: The covariance inflation criterion for adaptive model selection, J. Roy. Stat. Soc. B 55, 757–796 (1999)
Westfall, P.H., Young, S.S.: Resampling-Based Multiple Testing. Wiley, New York (1993)
Wettenhall, J. M., Smyth G. K.: limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20(18): 3705–3706 (2004)
Wold, H.: Estimation of principal components and related models by iterative least squares. In Krishnaiaah, P.R. (ed.) Multivariate Analysis, pp. 391–420. Academic Press, New York (1966)
Yuan, M., Lin, Y. Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B 68(1), 49–67 (2007)
Żak-Szatkowska, M., Bogdan, M.: Modified versions of Bayesian information criterion for sparse generalized linear models. CSDA 55, 2908–2924 (2011)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc B 67(2), 301–320 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer-Verlag London
About this chapter
Cite this chapter
Frommlet, F., Bogdan, M., Ramsey, D. (2016). Statistical Methods in High Dimensions. In: Phenotypes and Genotypes. Computational Biology, vol 18. Springer, London. https://doi.org/10.1007/978-1-4471-5310-8_3
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5310-8_3
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5309-2
Online ISBN: 978-1-4471-5310-8
eBook Packages: Computer ScienceComputer Science (R0)