Abstract
Motivated by applications in genomics, we study in this paper four interrelated high-dimensional hypothesis testing problems on dependence structures among multiple populations. A new test statistic is constructed for testing the global hypothesis that multiple covariance matrices are equal, and its limiting null distribution is established. Correction methods are introduced to improve the accuracy of the test for finite samples. It is shown that the proposed tests are powerful against sparse alternatives and enjoy certain optimality properties. We then propose a multiple testing procedure for simultaneously testing the equality of the entries of the covariance matrices across multiple populations. The proposed method is shown to control the false discovery rate. A simulation study demonstrates that the proposed tests maintain the desired error rates under the null and have good power under the alternative. The methods are also applied to a Novartis multi-tissue analysis. In addition, testing and support recovery of submatrices of multiple covariance matrices are studied.
Similar content being viewed by others
References
Anderson TW (2003) An introduction to multivariate statistical analysis, 3rd edn. Wiley-Intersceince, New York
Bagirov AM, Mardaneh K (2006) Modified global k-means algorithm for clustering in gene expression data sets. In: Proceedings of the 2006 workshop on Intelligent systems for bioinformatics, Vol 73. Australian Computer Society, Inc, pp 23–28
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188
Birnbaum A, Nadler B (2012) High dimensional sparse covariance estimation: accurate thresholds for the maximal diagonal entry and for the largest correlation coefficient. Technical report
Cai TT, Liu W (2011) A direct estimation approach to sparse linear discriminant analysis. J Am Stat Assoc 106:1566–1577
Cai TT, Liu W (2016) Large-scale multiple testing of correlations. J Am Stat Assoc 111(513):229–240
Cai TT, Liu W, Xia Y (2013) Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J Am Stat Assoc 108(501):265–277
Cai TT, Xia Y (2014) High-dimensional sparse MANOVA. J Multivar Anal 131:174–196
De Souto, MC, Silva S, Bittencourt VG, De Araujo DS (2005) Cluster ensemble for gene expression microarray data. In: Neural networks, 2005. IJCNN’05. proceedings. 2005 IEEE international joint conference on, Vol 1, pp 487–492. IEEE
Fujikoshi Y, Himeno T, Wakaki H (2004) Asymptotic results of a high dimensional MANOVA test and power comparison when the dimension is large compared to the sample size. J Jpn Stat Soc 34(1):19–26
Goeman JJ, Van De Geer SA, De Kort F, Van Houwelingen HC (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20(1):93–99
Hall P (1991) On convergence rates of suprema. Probab Theory Relat Fields 89(4):447–455
Ho JW, Stefani M, dos Remedios CG, Charleston MA (2008) Differential variability analysis of gene expression and its application to human diseases. Bioinformatics 24(13):i390–i398
Hu R, Qiu X, Glazko G (2010) A new gene selection procedure based on the covariance distance. Bioinformatics 26(3):348–354
Hu R, Qiu X, Glazko G, Klebanov L, Yakovlev A (2009) Detecting intergene correlation changes in microarray analysis: a new approach to gene selection. BMC bioinformatics 10(1):20
Huckemann S, Hotz T, Munk A (2010) Intrinsic MANOVA for Riemannian manifolds with an application to Kendall’s space of planar shapes. IEEE Trans Pattern Anal Mach Intell 32(4):593–603
Li J, Chen SX (2012) Two sample tests for high-dimensional covariance matrices. Ann Stat 40(2):908–940
Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Hayward NK, Montgomery GW, Visscher PM, Martin NG et al (2010) A versatile gene-based test for genome-wide association studies. Am J Hum Genet 87(1):139–145
Liu W (2013) Gaussian graphical model estimation with false discovery rate control. Ann Stat 41(6):2948–2978
Liu W-D, Lin Z, Shao Q-M (2008) The asymptotic distribution and Berry-Esseen bound of a new test for independence in high dimension with an application to stochastic optimization. Ann Appl Probab 18(6):2337–2366
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118
Schott JR (2007a) Some high-dimensional tests for a one-way MANOVA. J Multivar Anal 98(9):1825–1839
Schott JR (2007b) A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput Stati Data Anal 51(12):6535–6542
Shedden K, Taylor J (2005) Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In: Methods of microarray data analysis, Springer, pp 121–131
Shen Y, Lin Z, Zhu J (2011) Shrinkage-based regularization tests for high-dimensional data with application to gene set analysis. Comput Stat Data Anal 55(7):2221–2233
Srivastava MS (2007) Multivariate theory for analyzing high dimensional data. J Jpn Stat Soc 37(1):53–86
Srivastava MS, Yanagihara H (2010) Testing the equality of several covariance matrices with fewer observations than the dimension. J Multivar Anal 101(6):1319–1329
Stolc V, Gauhar Z, Mason C, Halasz G, van Batenburg MF, Rifkin SA, Hua S, Herreman T, Tongprasit W, Barbano PE et al (2004) A gene expression map for the euchromatic genome of Drosophila melanogaster. Science 306(5696):655–660
Storey JD, Taylor JE, Siegmund D (2004) Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Ser B (Stat Methodol) 66(1):187–205
Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A et al (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Nat Acad Sci 99(7):4465–4470
Sun W, Cai TT (2009) Large-scale multiple testing under dependence. J R Stat Soc Ser B (Stat Methodol) 71(2):393–424
Sun W, Reich BJ, Tony Cai T, Guindani M, Schwartzman A (2015) False discovery control in large-scale spatial multiple testing. J R Stat Soc Ser B (Stat Methodol) 77(1):59–83
Tsai C-A, Chen JJ (2009) Multivariate analysis of variance test for gene set analysis. Bioinformatics 25(7):897–903
Wu WB (2008) On false discovery control under dependence. Ann Stat 36:364–380
Xia Y, Cai T, Cai TT (2015) Testing differential networks with applications to the detection of gene-gene interactions. Biometrika 102:247–266
Xia Y, Cai T, Cai TT (2017) Multiple testing of submatrices with applications to identification of between pathway interactions. J. Amer. Stat. Assoc. doi:10.1080/01621459.2016.1251930
Yu Z, Wongb H-S, You J, Yang Q, Liao H (2011) Knowledge based cluster ensemble for cancer discovery from biomolecular data. IEEE Trans NanoBioscience 10(2):76–85
Acknowledgements
Funding was provided by “The Recruitment Program of Global Experts” Youth Project, the startup fund from Fudan University and National Science Foundation of China (Grant No. 11690013)
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Xia, Y. Testing and support recovery of multiple high-dimensional covariance matrices with false discovery rate control. TEST 26, 782–801 (2017). https://doi.org/10.1007/s11749-017-0533-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-017-0533-7
Keywords
- Correction
- Extreme value distribution
- High-dimensional test
- Limiting null distribution
- Multiple testing
- Sparsity