Abstract
The classical Hotelling’s \(T^2\) test and Bayesian hypothesis tests breakdown for the problem of comparing two high-dimensional population means due to the singularity of the pooled sample covariance matrices when the model dimension p exceeds the sample size n. In this paper, we develop a simple closed-form Bayesian testing procedure based on a split-and-merge technique. Specifically, we adopt the subspace clustering technique to split the high-dimensional data into lower-dimensional random spaces so that the Bayes factor can be implemented. Then we utilize the geometric mean to merge the results of the Bayesian test to obtain a novel test statistic. We carry out simulation studies to compare the performance of the proposed test with several existing ones in the literature. Finally, two real-data applications are provided for illustrative purposes.
Similar content being viewed by others
References
Bai Z, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6:311–329
Bayarri MJ, Berger JO, Forte A, García-Donato G et al (2012) Criteria for Bayesian model choice with application to variable selection. Ann Stat 40:1550–1577
Chen SX, Qin Y-L et al (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38:808–835
Consonni G, Deldossi L (2016) Objective Bayesian model discrimination in follow-up experimental design. TEST 25:397–412
Dempster AP (1958) A high dimensional two sample significance test. Ann Math Stat 29:995–1010
García-Donato G, Paulo R (2021) Variable selection in the presence of factors: a model selection perspective. J Am Stat Assoc 10:1–11
Gravier E, Pierron G, Vincent-Salomon A, Gruel N, Raynal V, Savignoni A, De Rycke Y, Pierga J-Y, Lucchesi C, Reyal F et al (2010) A prognostic DNA signature for t1t2 node-negative breast cancer patients. Genes Chromosom Cancer 49:1125–1134
Henrion M. Mortlock DJ, Hand DJ, Gandy A (2011) Subspace methods for anomaly detection in high dimensional astronomical databases. In Proceedings of the 58th world statistics congress of the international statistical institute, ISI11
Jeffreys H (1961) Small corrections in the theory of surface waves. Geophys J Int 6:115–117
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795
Lee K, You K, Lin L (2021) Bayesian optimal two-sample tests in high-dimension. arXiv preprint arXiv:2112.02580
Ley E, Steel MFJ (2012) Mixtures of \(g\)-priors for Bayesian model averaging with economic applications. J Econom 171:251–266
Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of \(g\) priors for Bayesian variable selection. J Am Stat Assoc 103:410–423
Mardia KV, Kent JT, Bibby JM (1979) Multivariate Analysis. Academic Press Inc, London
Mulder J, Berger JO, Peña V, Bayarri MJ (2021) On the prevalence of information inconsistency in normal linear models. TEST 30:103–132
Srivastava MS (2007) Multivariate theory for analyzing high dimensional data. J Japan Stat Soc 37:53–86
Srivastava MS, Du M (2008) A test for the mean vector with fewer observations than the dimension. J Multivar Anal 99:386–402
Srivastava R, Li P, Ruppert D (2016) RAPTT: An exact two-sample test in high dimensions using random projections. J Comput Graph Stat 25:954–970
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 102:15545–15550
Thulin M (2014) A high-dimensional two-sample test for the mean using random subspaces. Comput Stat Data Anal 74:26–38
Wang M, Liu G (2016) A simple two-sample Bayesian \(t\)-test for hypothesis testing. Am Stat 70:195–201
Zhang H, Wang H (2021) A more powerful test of equality of high-dimensional two-sample means. Comput Stat Data Anal 164:107318
Zhang J, Pan M (2016) A high-dimension two-sample test for the mean using cluster subspaces. Comput Stat Data Anal 97:87–97
Zoh RS, Sarkar A, Carroll RJ, Mallick BK (2018) A powerful bayesian test for equality of means in high dimensions. J Am Stat Assoc 113:1733–1741
Acknowledgements
The authors would like to acknowledge the comments and suggestions from an Associate Editor and two reviewers, which have substantially improved the quality of the manuscript. The work of Dr. Min Wang was partially supported by the Internal Research Awards (INTRA) program from the UTSA Vice President for Research, Economic Development, and Knowledge Enterprise at the University of Texas at San Antonio.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix Deviations of Equations (9)
Appendix Deviations of Equations (9)
Since the minimum sufficient statistics \({\textbf {D}}\) \({\textbf {A}}\), and \({\textbf {S}}\) are independent, the marginal likelihood function of the data under \(H_{0}\) with \(\varvec{\theta }_{0} = (\varvec{\mu }, \varvec{\Sigma })\) integrated out is given by
where \( A_0=\frac{\left| (n-2){\textbf {S}}\right| ^{\frac{n-p-3}{2}}}{(2\pi )^{\frac{p}{2}}2^{\frac{p(n-2)}{2}} \pi ^{\frac{p(p-1)}{4}}\prod _{j=1}^{p}\Gamma [0.5(n-1-j)]}\).
Under \(H_1\), the marginal likelihood function of the data with \(\varvec{\theta }_1 = (\varvec{\delta }, \varvec{\mu }, \varvec{\Sigma })\) integrated out is given by
The resulting Bayes factor for the hypothesis testing problem in (4) is given by
where \(q=\big (1+\frac{pf^{\star }}{n-p-1}\big )^{-1}\) with \(f^{\star }=\frac{n-p-1}{(n-2)p}n_{\delta }{\textbf {D}}^T{\textbf {S}}^{-1}{\textbf {D}}\).
For notational simplicity, we let \( b = (p+1)/2, a= (n-1)/2, z=\left( 1-{1}/{q}\right) (p+1)/(n+1)\). By specifying the prior in (8) for the hyperparameter g, the Bayes factor with g integrated out is given by
To facilitate the computation of the \({}_2 F_1\) function above, we apply the pfaff transformation \({}_2 F_1(a, b; c; z)=(1-z)^{-a}~{}_2 F_1(a, c-b; c; \frac{z}{z-1})\) and obtain the Bayes factor in (9) after some algebra.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, F., Hai, Q. & Wang, M. Bayesian hypothesis testing for equality of high-dimensional means using cluster subspaces. Comput Stat 39, 1301–1320 (2024). https://doi.org/10.1007/s00180-023-01366-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01366-0