Abstract
The paper discusses a statistical problem related to testing for differences between two networks with community structures. While existing methods have been proposed, they encounter challenges and do not perform effectively when the networks become sparse. We propose a test statistic that combines a method proposed by Wu and Hu (2024) and a resampling process. Specifically, the proposed test statistic proves effective under the condition that the community-wise edge probability matrices have entries of order \(\Omega (\log n/n)\), where n denotes the network size. We derive the asymptotic null distribution of the test statistic and provide a guarantee of asymptotic power against the alternative hypothesis. To evaluate the performance of the proposed test statistic, we conduct simulations and provide real data examples. The results indicate that the proposed test statistic performs well for both dense and sparse networks.
Similar content being viewed by others
Data availability
The dataset used in this paper is publicly available, with references provided in the text.
References
Abbe, E. (2018). Community Detection and Stochastic Block Models: Recent Developments. Journal of Machine Learning Research, 18(177), 1–86.
Amini, A. A., Chen, A., Bickel, P. J., & Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4), 2097–2122.
Bassett, D. S., Bullmore, E., Verchinski, B. A., Mattay, V. S., Weinberger, D. R., & Meyer-Lindenberg, A. (2008). Hierarchical Organization of Human Cortical Networks in Health and Schizophrenia. Journal of Neuroscience, 28(37), 9239–9248.
Bickel, P., Choi, D., Chang, X., & Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4), 1922–1943.
Bickel, P. J., & Sarkar, P. (2016). Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1), 253–273.
Chen, K., & Lei, J. (2018). Network Cross-Validation for Determining the Number of Communities in Network Data. Journal of the American Statistical Association, 113(521), 241–251.
Chen, J., & Yuan, B. (2006). Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics (Oxford, England), 22(18), 2283–2290.
Chen, L., Zhou, J., & Lin, L. (2021). Hypothesis testing for populations of networks. Communications in Statistics - Theory and Methods 0(0), 1–24.
Chernozhukov, V., Chetverikov, D., & Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6), 2786–2819.
Dong, Z., Wang, S., & Liu, Q. (2020). Spectral based hypothesis testing for community detection in complex networks. Information Sciences, 512, 1360–1371.
Fan, J., & Jiang, T. (2019). Largest entries of sample correlation matrices from equi-correlated normal populations. The Annals of Probability, 47(5), 3321–3374.
Gangrade, A., Venkatesh, P., Nazer, B., & Saligrama, V. (2019). Efficient near-optimal testing of community changes in balanced stochastic block models. Advances in Neural Information Processing Systems 32.
Gao, C., Ma, Z., Zhang, A. Y., & Zhou, H. H. (2017). Achieving Optimal Misclassification Proportion in Stochastic Block Models. Journal of Machine Learning Research, 18(60), 1–45.
Ghoshdastidar, D., Gutzeit, M., Carpentier, A., & von Luxburg, U. (2020). Two-sample hypothesis testing for inhomogeneous random graphs. Annals of Statistics, 48(4), 2208–2229.
Ghoshdastidar, D. & von Luxburg, U. (2018). Practical Methods for Graph Two-Sample Testing. Advances in Neural Information Processing Systems 31.
Holland, P. W., Laskey, K. B., & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109–137.
Hu, J., Qin, H., Yan, T., & Zhao, Y. (2020). Corrected Bayesian Information Criterion for Stochastic Block Models. Journal of the American Statistical Association, 115(532), 1771–1783.
Hu, J., Zhang, J., Qin, H., Yan, T. & Zhu, J. (2020). Using Maximum Entry-Wise Deviation to Test the Goodness of Fit for Stochastic Block Models. Journal of the American Statistical Association 0(0), 1–10.
Ji, P., & Jin, J. (2016). Coauthorship and citation networks for statisticians. The Annals of Applied Statistics, 10(4), 1779–1812.
Ji, P., Jin, J., Ke, Z.T., & Li, W. (2021). Co-citation and Co-authorship Networks of Statisticians. Journal of Business & Economic Statistics 0(0), 1–17
Jin, J. (2015). Fast community detection by SCORE. The Annals of Statistics, 43(1), 57–89.
Jing, B.-Y., Li, T., Ying, N., & Yu, X. (2022). Community detection in sparse networks using the symmetrized laplacian inverse matrix (slim). Statistica Sinica, 32, 1–22.
Krishna Reddy, P., Kitsuregawa, M., Sreekanth, P., & Srinivasa Rao, S. (2002). A graph based approach to extract a neighborhood customer community for collaborative filtering. In S. Bhalla (Ed.), Databases in Networked Information Systems (pp. 188–200). Berlin, Heidelberg: Springer.
Le, C. M., & Levina, E. (2022). Estimating the number of communities by spectral methods. Electronic Journal of Statistics, 16(1), 3315–3342.
Le, C. M., Levina, E., & Vershynin, R. (2017). Concentration and regularization of random graphs. Random Structures & Algorithms, 51(3), 538–561.
Leadbetter, M. R., Lindgren, G., & Rootzén, H. (1983). Extremes and Related Properties of Random Sequences and Processes. New York, NY: Springer Series in Statistics. Springer.
Lei, J. (2016). A goodness-of-fit test for stochastic block models. Annals of Statistics, 44(1), 401–424.
Lei, J., & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1), 215–237.
Li, T., Levina, E., & Zhu, J. (2020). Network cross-validation by edge sampling. Biometrika, 107(2), 257–276.
Ma, X., Wang, B., & Yu, L. (2018). Semi-supervised spectral algorithms for community detection in complex networks based on equivalence of clustering methods. Physica A: Statistical Mechanics and its Applications, 490, 786–802.
Newman, M. E. J., & Leicht, E. A. (2007). Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences, 104(23), 9564.
Newman, M.E.J. (2006). Finding community structure in networks using the eigenvectors of matrices. Physical Review E. Statistical, Nonlinear, and Soft Matter Physics 74(3), 036104–19.
Pal, S., & Zhu, Y. (2021). Community detection in the sparse hypergraph stochastic block model. Random Structures & Algorithms, 59(3), 407–463.
Pontes, B., Giráldez, R., & Aguilar-Ruiz, J. S. (2015). Biclustering on expression data: A review. Journal of Biomedical Informatics, 57, 163–180.
Rohe, K., Chatterjee, S., & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4), 1878–1915.
Rossi, L., & Magnani, M. (2015). Towards effective visual analytics on multiplex and multilayer networks. Chaos, Solitons & Fractals, 72, 68–76.
Saldaña, D. F., Yu, Y., & Feng, Y. (2017). How Many Communities Are There? Journal of Computational and Graphical Statistics, 26(1), 171–181.
Tang, M., Athreya, A., Sussman, D. L., Lyzinski, V., Park, Y., & Priebe, C. E. (2017). A Semiparametric Two-Sample Hypothesis Testing Problem for Random Graphs. Journal of Computational and Graphical Statistics, 26(2), 344–354.
Tang, M., Athreya, A., Sussman, D. L., Lyzinski, V., & Priebe, C. E. (2017). A nonparametric two-sample hypothesis testing problem for random graphs. Bernoulli, 23(3), 1599–1630.
Wang, Y. X. R., & Bickel, P. J. (2017). Likelihood-based model selection for stochastic block models. Annals of Statistics, 45(2), 500–528.
Westveld, A.H., & Hoff, P.D. (2011). A mixed effects model for longitudinal relational and network data, with applications to international trade and conflict. The Annals of Applied Statistics, 5(2A)
Wu, Q., & Hu, J. (2024). Two-sample test of stochastic block models. Computational Statistics & Data Analysis, 192, 107903.
Wu, Y., Lan, W., Feng, L., & Tsai, C.-L. (2022). Testing stochastic block models via the maximum sampling entry-wise deviation. Manuscript.
Zhang, B., Li, H., Riggins, R. B., Zhan, M., Xuan, J., Zhang, Z., Hoffman, E. P., Clarke, R., & Wang, Y. (2009). Differential dependency network analysis to identify condition-specific topological changes in biological networks. Bioinformatics, 25(4), 526–532.
Acknowledgements
The authors would like to thank the Editor, Associate Editor, and the two referees for their insightful comments.
Funding
Jiang Hu was partially supported by National Natural Science Foundation of China (Grant Nos. 12171078, 12292980, and 12292982), National Key R & D Program of China No. 2020YFA0714102 and Fundamental Research Funds for the Central Universities No. 2412023YQ003.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Detailed proofs of Theorem 1 and 2
Firstly we introduce the following lemma before the main proof.
Lemma 1
(Theorem 1.5.1 in Leadbetter et al. (1983)) Let \(Z_1, Z_2, \ldots , Z_{n^{*}}\) be a sequence of independent random variables with distribution function F(x) for \(-\infty< x < \infty\). Let \(z_1, z_2, \ldots , z_{n^{*}}\) be a sequence of real numbers, and let \(0< \tau < 1\). Define \(M_n = \max \{Z_1, Z_2, \ldots , Z_{n^{*}}\}\). Then, the equality
holds true if and only if
Proof of Theorem 1
In this section, we present the proofs of Theorem 1.
Proof of Theorem 1
First, we bound the estimation error of the following equation:
Next, denote that
In accordance with Eq. A1, we can derive that
Under Assumption 3, we have \(K = o(\sqrt{\log n})\), \(MK=o(n)\) and if \(F_{n,0} = O_P(\sqrt{\log MK})\), then we have
Thus to prove Theorem 1, it is sufficient to show that
To derive the asymptotic null distribution of \(F_{n}\), we define the \(\sigma\)-field \({\mathcal {G}} =\sigma \{ \hat{\rho }_{ik,0}: 1 \le i \le n; 1 \le k \le K \}\) that is generated from the auxiliary quantity \(\hat{\rho }_{ik,0}\) in \(F_{n,0}\). Let \(\tilde{\rho }_{ik}\) be the observed value of \(\hat{\rho }_{ik,0}\). Then, conditional on \({\mathcal {G}}\), the \(\hat{\rho }_{ik,0}\)s are independent and identically distributed with the probability \(P(\hat{\rho }_{ik,0} = \tilde{\rho }_{ik}) = \frac{1}{nK}\) for any \(1 \le i \le n\) and \(1 \le k \le K\). Note that \(\hat{\gamma }_{mk,0}\) is calculated from \(\hat{\rho }_{ik,0}\) in \(F_{n,0}\), and we denote \(\tilde{\rho }_{ik}\) as the observed value of \(\hat{\rho }_{ik,0}\) calculated from \(\tilde{\rho }_{iv}\). Let \(\bar{\rho } = \frac{1}{nK} \sum _{i=1}^n \sum _{k=1}^K \tilde{\rho }_{ik}\) and \(\bar{\gamma } = \frac{1}{MK} \sum _{m=1}^M\sum _{k=1}^K \tilde{\gamma }_{mk}\).
As a result, \(E(\hat{\rho }_{iv,0} - \bar{\rho } | {\mathcal {G}}) = 0\) and \(E(\hat{\gamma }_{mv,0} - \bar{\gamma } | {\mathcal {G}}) = 0\). By Corollary 2.1 in Chernozhukov et al. (2013) and conditional on \({\mathcal {G}}\), as \(\min \{n, M, S\} \rightarrow \infty\), we can derive that
where \(U = (U_1, \ldots , U_{MK})^\top \in {\mathbb {R}}^{MK}\) is a Gaussian random vector with mean 0 and covariance matrix \(\text {cov}(\hat{\gamma }_{mk,0})\), and c and C are some finite positive constants. Define \(\check{\gamma } = \text {vec}(\hat{\gamma }_{mk,0}) = (\check{\gamma }_1, \ldots , \check{\gamma }_{MK})^\top \in {\mathbb {R}}^{MK}\) for \(1 \le m \le M\) and \(1 \le k \le K\). As a result, the above inequality can be rewritten as
as \(\min \{n, M, B\} \rightarrow \infty\).
Next, we calculate \(\text {cov}(\check{\gamma })\). The diagonal elements of \(\text {cov}(\check{\gamma })\) are 1 by definition, it suffices to compute \(\text {corr}(\check{\gamma })\). Denote \(\hat{\rho }_{.k,0}\) as the vector containing all elements \(\hat{\rho }_{ik,0}\) in block k. In addition, let \(\Lambda _{m_s} = (\lambda _{m_s,1}, \lambda _{m_s,2}, \ldots , \lambda _{ms,n})^T\) be some random variables that are independently generated from the binomial distribution \(\text {Bernoulli}(n, \frac{S}{n})\) for \(1 \le m_s \le M\), and they are independent of \(\hat{\rho }_{.k,0}\). Thus, for \(i = 1, 2, \ldots , n\), \(\lambda _{m_s,n}\) follows the Bernoulli distribution with probability \(\frac{S}{n}\), which implies that \(E(\lambda _{m_s,i}) = E(\lambda ^2_{m_s,i}) = \frac{S}{n}\). As a result, we obtain that \(\hat{\gamma }_{m_sk,0} = \Lambda ^T_{m_s}\hat{\rho }_{.k,0}/ \sqrt{S}\). Then, for any \(1 \le m_s, m_l \le M\) with subscripts \(s \ne l\), it can be shown that
It can be seen that correlations between \(\check{\gamma }_{s}\) and \(\check{\gamma }_{l}\) are all equal for any \(s \ne l\). According to Fan and Jiang (2019), we can rewrite \(\max _{1 \le l \le MK} U_l\) as following:
where \(\tilde{U}_0, \tilde{U}_1, \ldots , \tilde{U}_{MK}\) are independent and identically distributed standard normal variables. Now we can write (A2) as
Note that, under the null hypothesis we have \({\mathbb {E}}(\bar{\gamma }) = 0\). Next we derive \(\text {var}(\bar{\gamma })\).
Under Assumption 3, we have \(\text {var}(\bar{\gamma })\rightarrow 0\), as \(\min \{n, M, B\} \rightarrow \infty\).
This implies \(\bar{\gamma } = o_p(1)\). Additionally under Assumption 3, we have \(\delta _{\gamma }=o(1)\).
As a result, we can rewrite (A3) as
as \(\min \{n, M, B\} \rightarrow \infty\).
For any \(x \in {\mathbb {R}}\), denote \(u = \frac{1}{2} \log (MK) - \log \log (MK) + x\). We then have
Subsequently, by Lemma 1, as \(\min \{n, M, B\} \rightarrow \infty\), we have
for any \(x \in {\mathbb {R}}\). Combining with (A4), we obtain
Accordingly,
Since
we have
This completes the proof of Theorem 1.
\(\square\)
Proof of Theorem 2
In this section, we present the proofs of Theorem 2.
Proof of Theorem 2
Note that
From the discussion in the proof of Theorem 1, we can obtain that
where
By by the result of (A5), we have that
Moreover, from Assumptions 1, 2, 3 and \(\max \limits _{1 \le i \le n,1 \le j \le n}\vert B_{1,g_{i}g_{j}}-B_{2,g_{i}g_{j}} \vert = \Omega (\frac{\log n}{n}\sqrt{\frac{K}{S}})\), we have that
Thus, we finally obtain that
for any positive constant c.
This completes the proof of Theorem 2. \(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, Q., Hu, J. Two-sample test of stochastic block models via the maximum sampling entry-wise deviation. J. Korean Stat. Soc. (2024). https://doi.org/10.1007/s42952-024-00260-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42952-024-00260-9