Skip to main content
Log in

Bayesian hypothesis testing for equality of high-dimensional means using cluster subspaces

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The classical Hotelling’s \(T^2\) test and Bayesian hypothesis tests breakdown for the problem of comparing two high-dimensional population means due to the singularity of the pooled sample covariance matrices when the model dimension p exceeds the sample size n. In this paper, we develop a simple closed-form Bayesian testing procedure based on a split-and-merge technique. Specifically, we adopt the subspace clustering technique to split the high-dimensional data into lower-dimensional random spaces so that the Bayes factor can be implemented. Then we utilize the geometric mean to merge the results of the Bayesian test to obtain a novel test statistic. We carry out simulation studies to compare the performance of the proposed test with several existing ones in the literature. Finally, two real-data applications are provided for illustrative purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bai Z, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6:311–329

    MathSciNet  Google Scholar 

  • Bayarri MJ, Berger JO, Forte A, García-Donato G et al (2012) Criteria for Bayesian model choice with application to variable selection. Ann Stat 40:1550–1577

    Article  MathSciNet  Google Scholar 

  • Chen SX, Qin Y-L et al (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38:808–835

    Article  MathSciNet  Google Scholar 

  • Consonni G, Deldossi L (2016) Objective Bayesian model discrimination in follow-up experimental design. TEST 25:397–412

    Article  MathSciNet  Google Scholar 

  • Dempster AP (1958) A high dimensional two sample significance test. Ann Math Stat 29:995–1010

    Article  MathSciNet  Google Scholar 

  • García-Donato G, Paulo R (2021) Variable selection in the presence of factors: a model selection perspective. J Am Stat Assoc 10:1–11

    Google Scholar 

  • Gravier E, Pierron G, Vincent-Salomon A, Gruel N, Raynal V, Savignoni A, De Rycke Y, Pierga J-Y, Lucchesi C, Reyal F et al (2010) A prognostic DNA signature for t1t2 node-negative breast cancer patients. Genes Chromosom Cancer 49:1125–1134

    Article  Google Scholar 

  • Henrion M. Mortlock DJ, Hand DJ, Gandy A (2011) Subspace methods for anomaly detection in high dimensional astronomical databases. In Proceedings of the 58th world statistics congress of the international statistical institute, ISI11

  • Jeffreys H (1961) Small corrections in the theory of surface waves. Geophys J Int 6:115–117

    Article  MathSciNet  Google Scholar 

  • Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795

    Article  MathSciNet  Google Scholar 

  • Lee K, You K, Lin L (2021) Bayesian optimal two-sample tests in high-dimension. arXiv preprint arXiv:2112.02580

  • Ley E, Steel MFJ (2012) Mixtures of \(g\)-priors for Bayesian model averaging with economic applications. J Econom 171:251–266

    Article  MathSciNet  Google Scholar 

  • Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of \(g\) priors for Bayesian variable selection. J Am Stat Assoc 103:410–423

    Article  MathSciNet  Google Scholar 

  • Mardia KV, Kent JT, Bibby JM (1979) Multivariate Analysis. Academic Press Inc, London

    Google Scholar 

  • Mulder J, Berger JO, Peña V, Bayarri MJ (2021) On the prevalence of information inconsistency in normal linear models. TEST 30:103–132

    Article  MathSciNet  Google Scholar 

  • Srivastava MS (2007) Multivariate theory for analyzing high dimensional data. J Japan Stat Soc 37:53–86

    Article  MathSciNet  Google Scholar 

  • Srivastava MS, Du M (2008) A test for the mean vector with fewer observations than the dimension. J Multivar Anal 99:386–402

    Article  MathSciNet  Google Scholar 

  • Srivastava R, Li P, Ruppert D (2016) RAPTT: An exact two-sample test in high dimensions using random projections. J Comput Graph Stat 25:954–970

    Article  MathSciNet  Google Scholar 

  • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 102:15545–15550

    Article  Google Scholar 

  • Thulin M (2014) A high-dimensional two-sample test for the mean using random subspaces. Comput Stat Data Anal 74:26–38

    Article  MathSciNet  Google Scholar 

  • Wang M, Liu G (2016) A simple two-sample Bayesian \(t\)-test for hypothesis testing. Am Stat 70:195–201

    Article  MathSciNet  Google Scholar 

  • Zhang H, Wang H (2021) A more powerful test of equality of high-dimensional two-sample means. Comput Stat Data Anal 164:107318

    Article  MathSciNet  Google Scholar 

  • Zhang J, Pan M (2016) A high-dimension two-sample test for the mean using cluster subspaces. Comput Stat Data Anal 97:87–97

    Article  MathSciNet  Google Scholar 

  • Zoh RS, Sarkar A, Carroll RJ, Mallick BK (2018) A powerful bayesian test for equality of means in high dimensions. J Am Stat Assoc 113:1733–1741

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge the comments and suggestions from an Associate Editor and two reviewers, which have substantially improved the quality of the manuscript. The work of Dr. Min Wang was partially supported by the Internal Research Awards (INTRA) program from the UTSA Vice President for Research, Economic Development, and Knowledge Enterprise at the University of Texas at San Antonio.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix Deviations of Equations (9)

Appendix Deviations of Equations (9)

Since the minimum sufficient statistics \({\textbf {D}}\) \({\textbf {A}}\), and \({\textbf {S}}\) are independent, the marginal likelihood function of the data under \(H_{0}\) with \(\varvec{\theta }_{0} = (\varvec{\mu }, \varvec{\Sigma })\) integrated out is given by

$$\begin{aligned} m_{0}(\textrm{Data})&= \int \int f({\textbf {D}}, {\textbf {A}}, {\textbf {S}}\mid \varvec{\delta }=\varvec{0}, \varvec{\mu }, \varvec{\Sigma })\pi (\varvec{\mu }, \varvec{\Sigma }) d\varvec{\mu }d\varvec{\Sigma }\\&\quad =\int \int {\textbf{N}}_{p}({\textbf{D}} \mid \varvec{0}, n_\delta ^{-1}\varvec{\Sigma }) {\textbf{N}}_{p}({\textbf{A}} \mid \varvec{\mu }, n^{-1}\mathbf {\Sigma })\\&\quad \times {\textbf{W}}_{p}\left( (n-2){\textbf{S}}\mid n-2, \mathbf {\Sigma }\right) \left| \mathbf {\Sigma }\right| ^{-\frac{p+1}{2}} d \varvec{\mu }d\mathbf {\Sigma }\\&\quad =\int \int {\textbf {N}}_p({\textbf {D}}\mid \varvec{0}, n_\delta ^{-1}\varvec{\Sigma }){\textbf {W}}_p\left( (n-2){\textbf {S}}\mid n-2, \varvec{\Sigma }\right) \left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}} d \varvec{\Sigma }\\&\quad \times \int {\textbf {N}}_p(\varvec{\mu }\mid {\textbf {A}}, n^{-1}\varvec{\Sigma }) d \varvec{\mu }\\&\quad =\int \int {\textbf {N}}_p({\textbf {D}}\mid \varvec{0}, n_\delta ^{-1}\varvec{\Sigma }){\textbf {W}}_p\left( (n-2){\textbf {S}}\mid n-2, \varvec{\Sigma }\right) \left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}} d \varvec{\Sigma }\\&\quad =\int \frac{1}{(2\pi )^{\frac{p}{2}}\left| n_\delta ^{-1}\varvec{\Sigma }\right| ^{1/2}} \exp \left\{ -\frac{1}{2}{\textbf {D}}^T(n_\delta ^{-1}\varvec{\Sigma })^{-1}D\right\} \\&\quad \times \frac{\left| (n-2){\textbf {S}}\right| ^{\frac{n-p-3}{2}}\exp \left\{ -\frac{1}{2}\text{ tr }[(n-2) {\textbf {S}}\varvec{\Sigma }^{-1}]\right\} }{2^{\frac{p(N-2)}{2}}\pi ^{\frac{p(p-1)}{4}}\left| \varvec{\Sigma }\right| ^ {\frac{n-2}{2}}\prod _{j=1}^{p}\Gamma [0.5(n-1-j)]}\left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}}d\varvec{\Sigma }\\&\quad =\frac{n_\delta ^{p/2}\left| (n-2){\textbf {S}}\right| ^{\frac{n-p-3}{2}}}{(2\pi )^{\frac{p}{2}}2^ {\frac{p(n-2)}{2}}\pi ^{\frac{p(p-1)}{4}}\prod _{j=1}^{p}\Gamma [0.5(n-1-j)]}\\&\quad \times \int \left| \varvec{\Sigma }\right| ^{-\frac{n+p}{2}}\exp \bigg \{-\frac{1}{2}\text{ tr }[(n-2){\textbf {S}}\varvec{\Sigma }^{-1}] -\frac{n_\delta }{2}{\textbf {D}}^T\varvec{\Sigma }^{-1}D\bigg \}d\varvec{\Sigma }\\&\quad \propto A_0 n_\delta ^{p/2}\left| (n-2){\textbf {S}}+n_\delta ({\textbf {D}}{\textbf {D}}^T)\right| ^{-\frac{n-1}{2}}\\&\quad \propto A_0 n_\delta ^{p/2}\left\{ 1+n_\delta {\textbf {D}}^T[(n-2){\textbf {S}}]^{-1}D\right\} ^{-\frac{n-1}{2}}, \end{aligned}$$

where \( A_0=\frac{\left| (n-2){\textbf {S}}\right| ^{\frac{n-p-3}{2}}}{(2\pi )^{\frac{p}{2}}2^{\frac{p(n-2)}{2}} \pi ^{\frac{p(p-1)}{4}}\prod _{j=1}^{p}\Gamma [0.5(n-1-j)]}\).

Under \(H_1\), the marginal likelihood function of the data with \(\varvec{\theta }_1 = (\varvec{\delta }, \varvec{\mu }, \varvec{\Sigma })\) integrated out is given by

$$\begin{aligned} m_{1}(\textrm{Data})&=\int \int \int f({\textbf {D}}, {\textbf {A}}, {\textbf {S}}\mid \varvec{\delta }, \varvec{\mu }, \varvec{\Sigma })\pi (\varvec{\delta }, \varvec{\mu }, \varvec{\Sigma })d\varvec{\delta }d\varvec{\mu }d\varvec{\Sigma }\\&\quad =\int \int \int f({\textbf {D}}, {\textbf {A}}, {\textbf {S}}\mid \varvec{\delta }, \varvec{\mu }, \varvec{\Sigma })\pi (\varvec{\mu }, \varvec{\Sigma })\pi (\varvec{\delta }\mid g, \varvec{\Sigma }, H_1)d\varvec{\delta }d\varvec{\mu }d\varvec{\Sigma }\\&\quad =\int \int \int {\textbf {N}}_p({\textbf {D}}\mid \varvec{\delta }, n_\delta ^{-1}\varvec{\Sigma }){\textbf {N}}_p({\textbf {A}}\mid \varvec{\mu }, n^{-1}\varvec{\Sigma }){\textbf {W}}_p\left( (n-2){\textbf {S}}\mid n-2, \varvec{\Sigma }\right) \\&\quad \times \left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}}{\textbf {N}}_p(\varvec{\delta }\mid \varvec{0}, g\varvec{\Sigma }/n_\delta )d\varvec{\delta }d\varvec{\mu }d\varvec{\Sigma }\\&\quad =\int \int \left\{ \int {\textbf {N}}_p({\textbf {D}}\mid \varvec{\delta }, n_\delta ^{-1}\varvec{\Sigma }) {\textbf {N}}_p(\varvec{\delta }\mid \varvec{0}, g\varvec{\Sigma }/n_\delta )d\varvec{\delta }\right\} \\&\quad \times {\textbf {N}}_p({\textbf {A}}\mid \varvec{\mu }, n^{-1}\varvec{\Sigma }){\textbf {W}}_p\left( (n-2){\textbf {S}}\mid n-2, \varvec{\Sigma }\right) \left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}} d\varvec{\mu }d\varvec{\Sigma }\\&\quad =\int {\textbf {N}}_p({\textbf {D}}\mid \varvec{0}, n_\tau ^{-1}\varvec{\Sigma }) {\textbf {W}}_p\{(n-2){\textbf {S}}\mid n-2, \varvec{\Sigma }\} \left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}} d\varvec{\Sigma }\\&\quad \times \int {\textbf {N}}_p(\varvec{\mu }\mid {\textbf {A}}, n^{-1}\varvec{\Sigma }) d\varvec{\mu }\ \ \textrm{with}\ n_\tau ^{-1}=n_\delta ^{-1}+g/n_\delta \\&=\int {\textbf {N}}_p({\textbf {D}}\mid \varvec{0}, n_\tau ^{-1}\varvec{\Sigma }) {\textbf {W}}_p\left( (n-2){\textbf {S}}\mid n-2, \varvec{\Sigma }\right) \left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}} d\varvec{\Sigma }\\&\quad =\int \frac{1}{(2\pi )^{\frac{p}{2}}\left| n_\tau ^{-1}\varvec{\Sigma }\right| ^{1/2}}\exp \left\{ -\frac{1}{2}{\textbf {D}}^T(n_\tau ^{-1}\varvec{\Sigma })^{-1}D\right\} \\&\quad \times \frac{\left| (n-2){\textbf {S}}\right| ^{\frac{n-p-3}{2}}\exp \big \{-\frac{1}{2}\text{ tr }[(n-2) {\textbf {S}}\varvec{\Sigma }^{-1}]\big \}}{2^{\frac{p(n-2)}{2}}\pi ^{\frac{p(p-1)}{4}}\left| \varvec{\Sigma }\right| ^ {\frac{n-2}{2}}\prod _{j=1}^{p}\Gamma [0.5(n-1-j)]}\left| \varvec{\Sigma }\right| ^{-\frac{p+1}{2}}d\varvec{\Sigma }\\&\quad =\frac{n_\tau ^{p/2}\left| (n-2){\textbf {S}}\right| ^{\frac{n-p-3}{2}}}{(2\pi )^{\frac{p}{2}}2^ {\frac{p(n-2)}{2}}\pi ^{\frac{p(p-1)}{4}}\prod _{j=1}^{p}\Gamma [0.5(n-1-j)]}\\&\quad \times \int \left| \varvec{\Sigma }\right| ^{-\frac{n+p}{2}}\exp \left\{ -\frac{1}{2} \text{ tr }[(n-2){\textbf {S}}\varvec{\Sigma }^{-1}]-\frac{n_\tau }{2}{\textbf {D}}^T\varvec{\Sigma }^{-1}D\right\} d\varvec{\Sigma }\\&\quad \propto A_0 n_\tau ^{p/2}\left| (n-2){\textbf {S}}+n_\tau ({\textbf {D}}{\textbf {D}}^T)\right| ^{-\frac{n-1}{2}}\\&\quad \propto A_0 n_\tau ^{p/2}\left\{ 1+n_\tau {\textbf {D}}^T[(n-2){\textbf {S}}]^{-1}D\right\} ^{-\frac{n-1}{2}}. \end{aligned}$$

The resulting Bayes factor for the hypothesis testing problem in (4) is given by

$$\begin{aligned} \mathrm {BF_{10}}&=\frac{m_{1}(\textrm{Data})}{m_{0}(\textrm{Data})}\\&\quad =\frac{A_0 n_\tau ^{p/2}\left\{ 1+n_\tau {\textbf {D}}^T[(n-2){\textbf {S}}]^{-1}D\right\} ^{-\frac{n-1}{2}}}{A_0 n_\delta ^{p/2}\left\{ 1+n_\delta {\textbf {D}}^T[(n-2){\textbf {S}}]^{-1}D\right\} ^{-\frac{n-1}{2}}}\\&\quad =\left( \frac{n_\tau }{n_\delta }\right) ^{p/2}\left( \frac{1+\frac{n_\tau }{n-2} {\textbf {D}}^T{\textbf {S}}^{-1}{\textbf {D}}}{1+\frac{n_\delta }{n-2}{\textbf {D}}^T{\textbf {S}}^{-1}{\textbf {D}}}\right) ^{-\frac{n-1}{2}}\\&\quad =(1+g)^{\frac{n-p-1}{2}}\left( 1+\frac{g}{1+\frac{pf^{\star }}{n-p-1}}\right) ^{-\frac{n-1}{2}}\\&\quad =(1+g)^{\frac{n-p-1}{2}}(1+qg)^{-\frac{n-1}{2}}, \end{aligned}$$

where \(q=\big (1+\frac{pf^{\star }}{n-p-1}\big )^{-1}\) with \(f^{\star }=\frac{n-p-1}{(n-2)p}n_{\delta }{\textbf {D}}^T{\textbf {S}}^{-1}{\textbf {D}}\).

For notational simplicity, we let \( b = (p+1)/2, a= (n-1)/2, z=\left( 1-{1}/{q}\right) (p+1)/(n+1)\). By specifying the prior in (8) for the hyperparameter g, the Bayes factor with g integrated out is given by

$$\begin{aligned} \mathrm {BF_{10}}&=\int _{\frac{n+1}{p+1}-1}^{\infty }(1+qg)^{-\frac{n-1}{2}}(1+g)^ {\frac{n-p-1}{2}}\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{1/2}(1+g)^{-3/2}dg\\&\quad =\int _{0}^{1}\bigg (\frac{n+1}{p+1}\frac{1}{t}\bigg )^{\frac{n-p-4}{2}}\bigg [1+q\bigg (\frac{n+1}{p+1}\frac{1}{t}-1\bigg )\bigg ]^{-\frac{n-1}{2}}\frac{1}{2}(\frac{n+1}{p+1})^{1/2}\left| -\frac{n+1}{p+1}\frac{1}{t^2}\right| dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{\frac{n-p-1}{2}}\int _{0}^{1}t^ {-\frac{n-p}{2}}\bigg (1-q+\frac{n+1}{p+1}q\frac{1}{t}\bigg )^{-\frac{n-1}{2}}dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{\frac{n-p-1}{2}}\int _{0}^{1}t^ {-\frac{n-p}{2}+\frac{n-1}{2}}\bigg [\frac{n+1}{p+1}q+(1-q)t\bigg ]^{-\frac{n-1}{2}}dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{\frac{n-p-1}{2}}\int _{0}^{1}t^ {\frac{p-1}{2}}\bigg (\frac{n+1}{p+1}q\bigg )^{-\frac{n-1}{2}}\bigg (1+\frac{1-q}{q}\frac{p+1}{n+1}t\bigg )^{-\frac{n-1}{2}}dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}q^{-\frac{n-1}{2}} \int _{0}^{1}t^{\frac{p-1}{2}}\bigg (1+\frac{1-q}{q}\frac{p+1}{n+1}t\bigg )^{-\frac{n-1}{2}}dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}q^{-\frac{n-1}{2}} \int _{0}^{1}t^{(\frac{p-1}{2}+1)-1}\bigg [1-\bigg (1-\frac{1}{q}\bigg )\frac{p+1}{n+1}t\bigg ]^{-\frac{n-1}{2}}dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}q^{-\frac{n-1}{2}} \int _{0}^{1}t^{b-1}(1-tz)^{-a}dt\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}q^{-\frac{n-1}{2}} \frac{\tau (b)\tau (c-b)}{\tau (c)}{}_2 F_1(a, b; c; z)\ \ \textrm{with}\ c=b+1\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}q^{-\frac{n-1}{2}} \frac{\tau (\frac{p+1}{2})}{\tau (\frac{p+1}{2}+1)}{}_2 F_1(a, b; c; z)\\&\quad =\frac{1}{2}\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}q^{-\frac{n-1}{2}} \frac{1}{\frac{p+1}{2}}{}_2 F_1\bigg (\frac{n-1}{2}, \frac{p+1}{2}; \frac{p+3}{2}; \bigg (1-\frac{1}{q}\bigg )\frac{p+1}{n+1}\bigg )\\&\quad =\bigg (\frac{n+1}{p+1}\bigg )^{-\frac{p}{2}}\bigg (1+\frac{pf^{\star }}{n-p-1}\bigg )^ {\frac{n-1}{2}}\frac{1}{p+1}\\&\quad \times {}_2 F_1\bigg (\frac{n-1}{2}, \frac{p+1}{2}; \frac{p+3}{2}; \bigg (-\frac{pf^{\star }}{n-p-1}\bigg )\frac{p+1}{n+1}\bigg ). \end{aligned}$$

To facilitate the computation of the \({}_2 F_1\) function above, we apply the pfaff transformation \({}_2 F_1(a, b; c; z)=(1-z)^{-a}~{}_2 F_1(a, c-b; c; \frac{z}{z-1})\) and obtain the Bayes factor in (9) after some algebra.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, F., Hai, Q. & Wang, M. Bayesian hypothesis testing for equality of high-dimensional means using cluster subspaces. Comput Stat 39, 1301–1320 (2024). https://doi.org/10.1007/s00180-023-01366-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-023-01366-0

Keywords

Navigation