Skip to main content
Log in

Covariance matrix testing in high dimension using random projections

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Estimation and hypothesis tests for the covariance matrix in high dimensions is a challenging problem as the traditional multivariate asymptotic theory is no longer valid. When the dimension is larger than or increasing with the sample size, standard likelihood based tests for the covariance matrix have poor performance. Existing high dimensional tests are either computationally expensive or have very weak control of type I error. In this paper, we propose a test procedure, CRAMP (covariance testing using random matrix projections), for testing hypotheses involving one or more covariance matrices using random projections. Projecting the high dimensional data randomly into lower dimensional subspaces alleviates of the curse of dimensionality, allowing for the use of traditional multivariate tests. An extensive simulation study is performed to compare CRAMP against asymptotics-based high dimensional test procedures. An application of the proposed method to two gene expression data sets is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://genomics-pubs.princeton.edu/oncology/affydata/index.html.

  2. https://portal.gdc.cancer.gov/.

References

  • Achlioptas D (2001) Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, page 274–281, New York, NY, USA. Association for Computing Machinery. ISBN 1581133618

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc National Acad Sci 96(12):6745–6750. ISSN 0027-8424

  • Anderson TW (2003). An introduction to multivariate statistical analysis. Wiley Series in Probability and Statistics, 3rd edn. ISBN 978-0-471-36091-9

  • Ayyala DN (2020) High-dimensional statistical inference: Theoretical development to data analytics (Chapter 6), volume 43 of Handbook of Statistics, pp. 289–335. Elsevier. https://doi.org/10.1016/bs.host.2020.02.003

  • Burr M, Gao S, Knoll F (2018) Optimal bounds for Johnson-Lindenstrauss transformations. J Mach Learn Res 19:1–22

    MathSciNet  MATH  Google Scholar 

  • Cai T, Liu W, Xia Y (2013) Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J Am Stat Assoc 108(501):265–277

    Article  MathSciNet  Google Scholar 

  • Cai TT, Li H, Liu W, Xie J (2012) Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100(1):139–156, 11. ISSN 0006-3444. https://doi.org/10.1093/biomet/ass058

  • Cannings TI (2021) Random projections: data perturbation for classification problems. WIREs Comput Stat 13(1):e1499. https://doi.org/10.1002/wics.1499

    Article  MathSciNet  Google Scholar 

  • Cannings TI, Samworth RJ (2017) Random-projection ensemble classification. J R Stat Soc Ser B (Stat Methodol) 79(4):959–1035

    Article  MathSciNet  Google Scholar 

  • Chen SX, Zhang LX, Zhong PS (2010) Tests for high-dimensional covariance matrices. J Am Stat Assoc 105(490):810–819

    Article  MathSciNet  Google Scholar 

  • Fisher TJ (2012) On testing for an identity covariance matrix when the dimensionality equals or exceeds the sample size. J Stat Plann Inference 142(1):312–326

    Article  MathSciNet  Google Scholar 

  • Fisher TJ, Sun X, Gallagher CM (2010) A new test for sphericity of the covariance matrix for high dimensional data. J Multivar Anal 101(10):2554–2570

    Article  MathSciNet  Google Scholar 

  • Hu J, Bai Z (2016) A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices. Sci China Math 59:2281–2300

    Article  MathSciNet  Google Scholar 

  • John S (1972) The distribution of a statistic used for testing sphericity of normal distributions. Biometrika 59(1):169–173

    Article  MathSciNet  Google Scholar 

  • Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206

    Article  MathSciNet  Google Scholar 

  • Ledoit O, Wolf M (2002) Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann Stat 30(4):1081–1102

    Article  MathSciNet  Google Scholar 

  • Li J, Chen SX (2012) Two sample tests for high-dimensional covariance matrices. Ann Stat 40(2):908–940

    Article  MathSciNet  Google Scholar 

  • Lopes M, Jacob L, Wainwright MJ (2011) A more powerful two-sample test in high dimensions using random projection. pages 1206–1214

  • Nagao H (1973) On some test criteria for covariance matrix. Ann Stat 1(4):700–709

    Article  MathSciNet  Google Scholar 

  • Qian M, Tao L, Li E, Tian M (2020) Hypothesis testing for the identity of high-dimensional covariance matrices. Stat Probab Lett 161:108699

    Article  MathSciNet  Google Scholar 

  • Rencher AC, Christensen WF (2012). Methods of Multivariate Analysis. Wiley, 3rd edn. ISBN 9781118391686

  • Schclar A, Rokach L (2009) Random projection ensemble classifiers. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Springer, Berlin, pp 309–316

    Chapter  Google Scholar 

  • Schott JR (2007) A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput Stat Data Anal 51(12):6535–6542

    Article  MathSciNet  Google Scholar 

  • Srivastava MS, Yanagihara H, Kubokawa T (2014) Tests for covariance matrices in high dimension with less sample size. J Multivar Anal 130:289–309

    Article  MathSciNet  Google Scholar 

  • Thanei G-A, Heinze C, Meinshausen N (2017) Random Projections for Large-Scale Regression, pp. 51–68. Springer International Publishing, Cham, 2017. ISBN 978-3-319-41573-4. https://doi.org/10.1007/978-3-319-41573-4_3

  • van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605

    MATH  Google Scholar 

  • Wu T-L, Li P (2020) Projected tests for high-dimensional covariance matrices. J Stat Plann Inference, 207:73–85. ISSN 0378-3758

  • Zhao SD, Cai TT, Li H (2014) Direct estimation of differential networks. Biometrika 101(2):253–268. ISSN 0006-3444. https://doi.org/10.1093/biomet/asu009

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepak Nag Ayyala.

Appendix

Appendix

Proof of Theorem 1

The proof of Theorem 1 is along the same lines as the proof of Theorem 2 in Srivastava et al. (2014). To show that the distribution of \(\overline{\pi }_U\) is independent of \(\sigma \), define \(\mathbf {X}^*_{m; i} = \mathcal {R}_m \mathbf {X}_i, i = 1, \ldots , n, m = 1, \ldots , M\) as the projection of the \(i^\mathrm{th}\) observation using the \(m^\mathrm{th}\) random projection matrix. Then we have

$$\begin{aligned} \mathrm{var} \left( \mathbf {X}_{m;1}, \ldots , \mathbf {X}_{m; n} \right) = \mathcal {S}^*_m = \mathcal {R}_m \mathcal {S} \mathcal {R}_m^{\top }, \end{aligned}$$

where \(\mathcal {S}\) and \(\mathcal {S}_m^*\) are the sample covariance matrices of the original and projected observations respectively. From equation (17), the p-values based on M i.i.d. random projection matrices are

$$\begin{aligned} \pi _m = 1 - \chi ^2_{\nu } \left( \frac{1}{k} \mathrm{tr}\left\{ \frac{\mathcal {S}_m^*}{ \mathrm{tr}\mathcal {S}_m^* / k} - \mathcal {I}_k \right\} ^2 \right) . \end{aligned}$$

Firstly since the random matrices are independent, conditional on the data \(\mathcal {X} = \{\mathbf {X}_1, \ldots , \mathbf {X}_n\}\) and \(\mathcal {Y} = \{\mathbf {Y}_1, \ldots , \mathbf {Y}_m \}\), the p-values \(\pi _1, \ldots , \pi _M\) are independent and identically distributed. This is because of the orthogonality of the projection matrices which preserves the covariance matrix structure (\(\mathcal {R} \left( \sigma ^2 \mathcal {I}_p \right) \mathcal {R}^{\top } = \sigma ^2 \mathcal {I}_k\)). Additionally, we can write

$$\begin{aligned} P \left[ \overline{\pi }< u \right] = \mathbb {E}_{\mathcal {X}, \mathcal {Y}} \left\{ P_{\mathcal {R}} \left[ \overline{\pi } < u | \mathcal {X}, \mathcal {Y} \right] \right\} , \end{aligned}$$
(A.20)

where the expected value is with respect to the distribution of the observations and the probability is with respect to the randomness of the projection matrix.

By the conditional independence of \(\pi _1, \ldots , \pi _M\) and the central limit theorem, we have a normal approximation to the probability in (A.20)

$$\begin{aligned} \lim \limits _{M \rightarrow \infty } \left| P \left[ \overline{\pi } < u \right] - \Phi \left( \frac{u - \mathbb {E}_{\mathcal {R}} \left[ u | \mathcal {X}, \mathcal {Y} \right] }{ \mathrm{var}_{\mathcal {R}} \left[ u | \mathcal {X}, \mathcal {Y} \right] } \right) \right| = 0. \end{aligned}$$
(A.21)

Hence the probability \(P \left[ \overline{\pi } < u \right] \) can be approximated only using the moments of \(U | \mathcal {X}, \mathcal {Y}\). Under the null hypothesis \(H_{0S}\), the variable \(U | \mathcal {X}, \mathcal {Y}\) is defined as

$$\begin{aligned} U | \mathcal {X}, \mathcal {Y}&= 1 - \chi ^2_{\nu } \left( U | \mathcal {X}, \mathcal {Y} \right) = 1 - F_{\chi ^2_{\nu }} \left( \mathrm{tr}\left\{ \frac{\mathcal {S}_m^*}{ \mathrm{tr}\mathcal {S}_m^* / k} - \mathcal {I}_k \right\} ^2 | \mathcal {X}, \mathcal {Y} \right) \nonumber \\&\sim \mathrm{Unif}(0,1). \end{aligned}$$
(A.22)

The uniform distribution is from the standard property of p-value under the null hypothesis, which is independent of \(\sigma ^2\). Using this property, we shall show that the distribution of \(E_{\mathcal {R}} \left[ U | \mathcal {X}, \mathcal {Y} \right] \) and \(\mathrm{var}_{\mathcal {R}} \left[ U | \mathcal {X}, \mathcal {Y} \right] \) with respect to \(\mathcal {X}, \mathcal {Y}\) are also independent of \(\sigma ^2\).

Let W denote the expected value of \(U | \mathcal {X}, \mathcal {Y}\) with respect to \(\mathcal {R}\),

$$\begin{aligned} W = \mathbb {E}_{\mathcal {R}} \left[ U | \mathcal {X}, \mathcal {Y} \right]&= \int u \, dP_{\mathcal {R}} \nonumber \\&= \int \left[ 1 - F_{\chi ^2_{\nu }} \left( \mathrm{tr}\left\{ \frac{\mathcal {S}_m^*}{ \mathrm{tr}\mathcal {S}_m^* / k} - \mathcal {I}_k \right\} ^2 | \mathcal {X}, \mathcal {Y} \right) \right] \, dP_{\mathcal {R}} \end{aligned}$$
(A.23)

where the integral is with respect to the distribution of the random projection matrix \(\mathcal {R}\). While the exact integral is not of importance, it should be noted that from equation (A.22), the integrand is independent of \(\sigma ^2\). As the random projection matrices are generated independent of the distribution of the observations, we can conclude that the variable W is independent of \(\sigma ^2\). For any \(m \ge 1\), the \(\mathrm{m}\mathrm{th}\) moment of W is given by

$$\begin{aligned} \mathbb {E}_{\mathcal {X}, \mathcal {Y}} \left[ W^m \right] = \int W^m \, dF_{\mathcal {X}, \mathcal {Y}}&= \int \mathbb {E}_{\mathcal {R}} \left[ U| \mathcal {X}, \mathcal {Y} \right] ^m \, dF_{\mathcal {X}, \mathcal {Y}} \\&= \int \mathbb {E}_{\mathcal {R}} \left[ U| \mathcal {X}, \mathcal {Y} \right] \times \cdots \times \mathbb {E}_{\mathcal {R}} \left[ U| \mathcal {X}, \mathcal {Y} \right] \, dF_{\mathcal {X}, \mathcal {Y}} \\&= \int \left\{ \int U_{\mathcal {R}_1} \, dP_{\mathcal {R}_1} \right\} \cdots \left\{ \int U_{\mathcal {R}_m} \, dP_{\mathcal {R}_m} \right\} \, dF_{\mathcal {X}, \mathcal {Y}} \end{aligned}$$

Interchanging the integrals by Fubini’s theorem, we have

$$\begin{aligned} \mathbb {E}_{\mathcal {X}, \mathcal {Y}} \left[ W^m \right] = \int \cdots \int \left\{ \int U_{\mathcal {R}_1} \ldots U_{\mathcal {R}_m} \, dF_{\mathcal {X}, \mathcal {Y}} \right\} \, dP_{\mathcal {R}_1} \cdots dP_{\mathcal {R}_m} \end{aligned}$$
(A.24)

By the construction of U in equation (A.22), the integral \(\left\{ \int U_{\mathcal {R}_1} \ldots U_{\mathcal {R}_m} \, dF_{\mathcal {X}, \mathcal {Y}} \right\} \) is independent of \(\sigma ^2\). Therefore, all moments of W are independent of \(\sigma ^2\) which implies that the distribution of W is independent of \(\sigma ^2\).

Similarly, it can be shown that the distribution of \(\mathrm{var}_{\mathcal {R}} \left( U| \mathcal {X}, \mathcal {Y} \right) \) is also independent of \(\sigma ^2\). From the independence of the mean and variance, we have the distributions of

$$\begin{aligned} \Phi \left[ \frac{ u - \mathbb {E}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} }{ \mathrm{var}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} } \right] \text{ and } \quad \mathbb {E}_{\mathcal {X}, \mathcal {Y}} \left\{ \Phi \left[ \frac{ u - \mathbb {E}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} }{ \mathrm{var}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} } \right] \right\} \end{aligned}$$
(A.25)

are independent of \(\sigma ^2\). Finally, combining this independence with equation (A.21), we have

$$\begin{aligned}&\lim \limits _{M \rightarrow \infty } P_\mathcal {R} \left\{ \overline{\pi } | \mathcal {X}, \mathcal {Y} \right\} = \Phi \left[ \frac{ u - \mathbb {E}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} }{ \mathrm{var}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} } \right] , \end{aligned}$$

with the right hand side independent of \(\sigma ^2\). Taking expected values with respect to \(\mathcal {X}\) and \(\mathcal {Y}\), we have

$$\begin{aligned} \lim \limits _{M \rightarrow \infty } P \left[ \overline{\pi } < u \right] = \mathbb {E}_{\mathcal {X}, \mathcal {Y}} \left\{ \Phi \left[ \frac{ u - \mathbb {E}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} }{ \mathrm{var}_{\mathcal {R}} \left\{ U | \mathcal {X}, \mathcal {Y} \right\} } \right] \right\} . \end{aligned}$$
(A.26)

By equation (A.25), the right hand side in (A.26) is also independent of \(\sigma ^2\), completing the proof. \(\square \)

Proof of Theorem 2

Invariance of the distribution of the two-sample test statistic can be shown similar to the above proof. Besides computation of the test statistic, rest of the argument remains the same since the Box M test statistic also follows a standard uniform distribution under the null hypothesis. Hence in Algorithm 2, \(\pi _m \sim \mathrm{Unif}(0, 1)\) under \(H_0\), which is independent of the choice of \(\Sigma \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ayyala, D.N., Ghosh, S. & Linder, D.F. Covariance matrix testing in high dimension using random projections. Comput Stat 37, 1111–1141 (2022). https://doi.org/10.1007/s00180-021-01166-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-021-01166-4

Keywords

Navigation