Abstract
Estimation and hypothesis tests for the covariance matrix in high dimensions is a challenging problem as the traditional multivariate asymptotic theory is no longer valid. When the dimension is larger than or increasing with the sample size, standard likelihood based tests for the covariance matrix have poor performance. Existing high dimensional tests are either computationally expensive or have very weak control of type I error. In this paper, we propose a test procedure, CRAMP (covariance testing using random matrix projections), for testing hypotheses involving one or more covariance matrices using random projections. Projecting the high dimensional data randomly into lower dimensional subspaces alleviates of the curse of dimensionality, allowing for the use of traditional multivariate tests. An extensive simulation study is performed to compare CRAMP against asymptotics-based high dimensional test procedures. An application of the proposed method to two gene expression data sets is presented.
Similar content being viewed by others
References
Achlioptas D (2001) Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, page 274–281, New York, NY, USA. Association for Computing Machinery. ISBN 1581133618
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc National Acad Sci 96(12):6745–6750. ISSN 0027-8424
Anderson TW (2003). An introduction to multivariate statistical analysis. Wiley Series in Probability and Statistics, 3rd edn. ISBN 978-0-471-36091-9
Ayyala DN (2020) High-dimensional statistical inference: Theoretical development to data analytics (Chapter 6), volume 43 of Handbook of Statistics, pp. 289–335. Elsevier. https://doi.org/10.1016/bs.host.2020.02.003
Burr M, Gao S, Knoll F (2018) Optimal bounds for Johnson-Lindenstrauss transformations. J Mach Learn Res 19:1–22
Cai T, Liu W, Xia Y (2013) Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J Am Stat Assoc 108(501):265–277
Cai TT, Li H, Liu W, Xie J (2012) Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100(1):139–156, 11. ISSN 0006-3444. https://doi.org/10.1093/biomet/ass058
Cannings TI (2021) Random projections: data perturbation for classification problems. WIREs Comput Stat 13(1):e1499. https://doi.org/10.1002/wics.1499
Cannings TI, Samworth RJ (2017) Random-projection ensemble classification. J R Stat Soc Ser B (Stat Methodol) 79(4):959–1035
Chen SX, Zhang LX, Zhong PS (2010) Tests for high-dimensional covariance matrices. J Am Stat Assoc 105(490):810–819
Fisher TJ (2012) On testing for an identity covariance matrix when the dimensionality equals or exceeds the sample size. J Stat Plann Inference 142(1):312–326
Fisher TJ, Sun X, Gallagher CM (2010) A new test for sphericity of the covariance matrix for high dimensional data. J Multivar Anal 101(10):2554–2570
Hu J, Bai Z (2016) A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices. Sci China Math 59:2281–2300
John S (1972) The distribution of a statistic used for testing sphericity of normal distributions. Biometrika 59(1):169–173
Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206
Ledoit O, Wolf M (2002) Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann Stat 30(4):1081–1102
Li J, Chen SX (2012) Two sample tests for high-dimensional covariance matrices. Ann Stat 40(2):908–940
Lopes M, Jacob L, Wainwright MJ (2011) A more powerful two-sample test in high dimensions using random projection. pages 1206–1214
Nagao H (1973) On some test criteria for covariance matrix. Ann Stat 1(4):700–709
Qian M, Tao L, Li E, Tian M (2020) Hypothesis testing for the identity of high-dimensional covariance matrices. Stat Probab Lett 161:108699
Rencher AC, Christensen WF (2012). Methods of Multivariate Analysis. Wiley, 3rd edn. ISBN 9781118391686
Schclar A, Rokach L (2009) Random projection ensemble classifiers. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Springer, Berlin, pp 309–316
Schott JR (2007) A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput Stat Data Anal 51(12):6535–6542
Srivastava MS, Yanagihara H, Kubokawa T (2014) Tests for covariance matrices in high dimension with less sample size. J Multivar Anal 130:289–309
Thanei G-A, Heinze C, Meinshausen N (2017) Random Projections for Large-Scale Regression, pp. 51–68. Springer International Publishing, Cham, 2017. ISBN 978-3-319-41573-4. https://doi.org/10.1007/978-3-319-41573-4_3
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
Wu T-L, Li P (2020) Projected tests for high-dimensional covariance matrices. J Stat Plann Inference, 207:73–85. ISSN 0378-3758
Zhao SD, Cai TT, Li H (2014) Direct estimation of differential networks. Biometrika 101(2):253–268. ISSN 0006-3444. https://doi.org/10.1093/biomet/asu009
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Theorem 1
The proof of Theorem 1 is along the same lines as the proof of Theorem 2 in Srivastava et al. (2014). To show that the distribution of \(\overline{\pi }_U\) is independent of \(\sigma \), define \(\mathbf {X}^*_{m; i} = \mathcal {R}_m \mathbf {X}_i, i = 1, \ldots , n, m = 1, \ldots , M\) as the projection of the \(i^\mathrm{th}\) observation using the \(m^\mathrm{th}\) random projection matrix. Then we have
where \(\mathcal {S}\) and \(\mathcal {S}_m^*\) are the sample covariance matrices of the original and projected observations respectively. From equation (17), the p-values based on M i.i.d. random projection matrices are
Firstly since the random matrices are independent, conditional on the data \(\mathcal {X} = \{\mathbf {X}_1, \ldots , \mathbf {X}_n\}\) and \(\mathcal {Y} = \{\mathbf {Y}_1, \ldots , \mathbf {Y}_m \}\), the p-values \(\pi _1, \ldots , \pi _M\) are independent and identically distributed. This is because of the orthogonality of the projection matrices which preserves the covariance matrix structure (\(\mathcal {R} \left( \sigma ^2 \mathcal {I}_p \right) \mathcal {R}^{\top } = \sigma ^2 \mathcal {I}_k\)). Additionally, we can write
where the expected value is with respect to the distribution of the observations and the probability is with respect to the randomness of the projection matrix.
By the conditional independence of \(\pi _1, \ldots , \pi _M\) and the central limit theorem, we have a normal approximation to the probability in (A.20)
Hence the probability \(P \left[ \overline{\pi } < u \right] \) can be approximated only using the moments of \(U | \mathcal {X}, \mathcal {Y}\). Under the null hypothesis \(H_{0S}\), the variable \(U | \mathcal {X}, \mathcal {Y}\) is defined as
The uniform distribution is from the standard property of p-value under the null hypothesis, which is independent of \(\sigma ^2\). Using this property, we shall show that the distribution of \(E_{\mathcal {R}} \left[ U | \mathcal {X}, \mathcal {Y} \right] \) and \(\mathrm{var}_{\mathcal {R}} \left[ U | \mathcal {X}, \mathcal {Y} \right] \) with respect to \(\mathcal {X}, \mathcal {Y}\) are also independent of \(\sigma ^2\).
Let W denote the expected value of \(U | \mathcal {X}, \mathcal {Y}\) with respect to \(\mathcal {R}\),
where the integral is with respect to the distribution of the random projection matrix \(\mathcal {R}\). While the exact integral is not of importance, it should be noted that from equation (A.22), the integrand is independent of \(\sigma ^2\). As the random projection matrices are generated independent of the distribution of the observations, we can conclude that the variable W is independent of \(\sigma ^2\). For any \(m \ge 1\), the \(\mathrm{m}\mathrm{th}\) moment of W is given by
Interchanging the integrals by Fubini’s theorem, we have
By the construction of U in equation (A.22), the integral \(\left\{ \int U_{\mathcal {R}_1} \ldots U_{\mathcal {R}_m} \, dF_{\mathcal {X}, \mathcal {Y}} \right\} \) is independent of \(\sigma ^2\). Therefore, all moments of W are independent of \(\sigma ^2\) which implies that the distribution of W is independent of \(\sigma ^2\).
Similarly, it can be shown that the distribution of \(\mathrm{var}_{\mathcal {R}} \left( U| \mathcal {X}, \mathcal {Y} \right) \) is also independent of \(\sigma ^2\). From the independence of the mean and variance, we have the distributions of
are independent of \(\sigma ^2\). Finally, combining this independence with equation (A.21), we have
with the right hand side independent of \(\sigma ^2\). Taking expected values with respect to \(\mathcal {X}\) and \(\mathcal {Y}\), we have
By equation (A.25), the right hand side in (A.26) is also independent of \(\sigma ^2\), completing the proof. \(\square \)
Proof of Theorem 2
Invariance of the distribution of the two-sample test statistic can be shown similar to the above proof. Besides computation of the test statistic, rest of the argument remains the same since the Box M test statistic also follows a standard uniform distribution under the null hypothesis. Hence in Algorithm 2, \(\pi _m \sim \mathrm{Unif}(0, 1)\) under \(H_0\), which is independent of the choice of \(\Sigma \). \(\square \)
Rights and permissions
About this article
Cite this article
Ayyala, D.N., Ghosh, S. & Linder, D.F. Covariance matrix testing in high dimension using random projections. Comput Stat 37, 1111–1141 (2022). https://doi.org/10.1007/s00180-021-01166-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01166-4