Two-sample test of stochastic block models via the maximum sampling entry-wise deviation

Wu, Qianyong; Hu, Jiang

doi:10.1007/s42952-024-00260-9

Two-sample test of stochastic block models via the maximum sampling entry-wise deviation

Research Article
Published: 03 March 2024

(2024)
Cite this article

Journal of the Korean Statistical Society Aims and scope Submit manuscript

Qianyong Wu¹ &
Jiang Hu¹

79 Accesses
Explore all metrics

Abstract

The paper discusses a statistical problem related to testing for differences between two networks with community structures. While existing methods have been proposed, they encounter challenges and do not perform effectively when the networks become sparse. We propose a test statistic that combines a method proposed by Wu and Hu (2024) and a resampling process. Specifically, the proposed test statistic proves effective under the condition that the community-wise edge probability matrices have entries of order $\Omega (\log n/n)$, where n denotes the network size. We derive the asymptotic null distribution of the test statistic and provide a guarantee of asymptotic power against the alternative hypothesis. To evaluate the performance of the proposed test statistic, we conduct simulations and provide real data examples. The results indicate that the proposed test statistic performs well for both dense and sparse networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large deviations for empirical measures of dense stochastic block graphs

Article 01 January 2020

Combinatorial Miller–Hagberg Algorithm for Randomization of Dense Networks

Comparison of large networks with sub-sampling strategies

Article Open access 06 July 2016

Data availability

The dataset used in this paper is publicly available, with references provided in the text.

References

Abbe, E. (2018). Community Detection and Stochastic Block Models: Recent Developments. Journal of Machine Learning Research, 18(177), 1–86.
Google Scholar
Amini, A. A., Chen, A., Bickel, P. J., & Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4), 2097–2122.
Article MathSciNet Google Scholar
Bassett, D. S., Bullmore, E., Verchinski, B. A., Mattay, V. S., Weinberger, D. R., & Meyer-Lindenberg, A. (2008). Hierarchical Organization of Human Cortical Networks in Health and Schizophrenia. Journal of Neuroscience, 28(37), 9239–9248.
Article CAS PubMed Google Scholar
Bickel, P., Choi, D., Chang, X., & Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4), 1922–1943.
Article MathSciNet Google Scholar
Bickel, P. J., & Sarkar, P. (2016). Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1), 253–273.
Article MathSciNet Google Scholar
Chen, K., & Lei, J. (2018). Network Cross-Validation for Determining the Number of Communities in Network Data. Journal of the American Statistical Association, 113(521), 241–251.
Article MathSciNet CAS Google Scholar
Chen, J., & Yuan, B. (2006). Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics (Oxford, England), 22(18), 2283–2290.
CAS PubMed Google Scholar
Chen, L., Zhou, J., & Lin, L. (2021). Hypothesis testing for populations of networks. Communications in Statistics - Theory and Methods 0(0), 1–24.
Chernozhukov, V., Chetverikov, D., & Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6), 2786–2819.
Article MathSciNet Google Scholar
Dong, Z., Wang, S., & Liu, Q. (2020). Spectral based hypothesis testing for community detection in complex networks. Information Sciences, 512, 1360–1371.
Article MathSciNet Google Scholar
Fan, J., & Jiang, T. (2019). Largest entries of sample correlation matrices from equi-correlated normal populations. The Annals of Probability, 47(5), 3321–3374.
Article MathSciNet Google Scholar
Gangrade, A., Venkatesh, P., Nazer, B., & Saligrama, V. (2019). Efficient near-optimal testing of community changes in balanced stochastic block models. Advances in Neural Information Processing Systems 32.
Gao, C., Ma, Z., Zhang, A. Y., & Zhou, H. H. (2017). Achieving Optimal Misclassification Proportion in Stochastic Block Models. Journal of Machine Learning Research, 18(60), 1–45.
MathSciNet Google Scholar
Ghoshdastidar, D., Gutzeit, M., Carpentier, A., & von Luxburg, U. (2020). Two-sample hypothesis testing for inhomogeneous random graphs. Annals of Statistics, 48(4), 2208–2229.
Article MathSciNet Google Scholar
Ghoshdastidar, D. & von Luxburg, U. (2018). Practical Methods for Graph Two-Sample Testing. Advances in Neural Information Processing Systems 31.
Holland, P. W., Laskey, K. B., & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109–137.
Article MathSciNet Google Scholar
Hu, J., Qin, H., Yan, T., & Zhao, Y. (2020). Corrected Bayesian Information Criterion for Stochastic Block Models. Journal of the American Statistical Association, 115(532), 1771–1783.
Article MathSciNet CAS Google Scholar
Hu, J., Zhang, J., Qin, H., Yan, T. & Zhu, J. (2020). Using Maximum Entry-Wise Deviation to Test the Goodness of Fit for Stochastic Block Models. Journal of the American Statistical Association 0(0), 1–10.
Ji, P., & Jin, J. (2016). Coauthorship and citation networks for statisticians. The Annals of Applied Statistics, 10(4), 1779–1812.
MathSciNet Google Scholar
Ji, P., Jin, J., Ke, Z.T., & Li, W. (2021). Co-citation and Co-authorship Networks of Statisticians. Journal of Business & Economic Statistics 0(0), 1–17
Jin, J. (2015). Fast community detection by SCORE. The Annals of Statistics, 43(1), 57–89.
Article MathSciNet Google Scholar
Jing, B.-Y., Li, T., Ying, N., & Yu, X. (2022). Community detection in sparse networks using the symmetrized laplacian inverse matrix (slim). Statistica Sinica, 32, 1–22.
MathSciNet Google Scholar
Krishna Reddy, P., Kitsuregawa, M., Sreekanth, P., & Srinivasa Rao, S. (2002). A graph based approach to extract a neighborhood customer community for collaborative filtering. In S. Bhalla (Ed.), Databases in Networked Information Systems (pp. 188–200). Berlin, Heidelberg: Springer.
Chapter Google Scholar
Le, C. M., & Levina, E. (2022). Estimating the number of communities by spectral methods. Electronic Journal of Statistics, 16(1), 3315–3342.
Article MathSciNet Google Scholar
Le, C. M., Levina, E., & Vershynin, R. (2017). Concentration and regularization of random graphs. Random Structures & Algorithms, 51(3), 538–561.
Article MathSciNet Google Scholar
Leadbetter, M. R., Lindgren, G., & Rootzén, H. (1983). Extremes and Related Properties of Random Sequences and Processes. New York, NY: Springer Series in Statistics. Springer.
Book Google Scholar
Lei, J. (2016). A goodness-of-fit test for stochastic block models. Annals of Statistics, 44(1), 401–424.
Article MathSciNet Google Scholar
Lei, J., & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1), 215–237.
Article MathSciNet Google Scholar
Li, T., Levina, E., & Zhu, J. (2020). Network cross-validation by edge sampling. Biometrika, 107(2), 257–276.
Article MathSciNet Google Scholar
Ma, X., Wang, B., & Yu, L. (2018). Semi-supervised spectral algorithms for community detection in complex networks based on equivalence of clustering methods. Physica A: Statistical Mechanics and its Applications, 490, 786–802.
Article ADS MathSciNet Google Scholar
Newman, M. E. J., & Leicht, E. A. (2007). Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences, 104(23), 9564.
Article ADS CAS Google Scholar
Newman, M.E.J. (2006). Finding community structure in networks using the eigenvectors of matrices. Physical Review E. Statistical, Nonlinear, and Soft Matter Physics 74(3), 036104–19.
Pal, S., & Zhu, Y. (2021). Community detection in the sparse hypergraph stochastic block model. Random Structures & Algorithms, 59(3), 407–463.
Article MathSciNet Google Scholar
Pontes, B., Giráldez, R., & Aguilar-Ruiz, J. S. (2015). Biclustering on expression data: A review. Journal of Biomedical Informatics, 57, 163–180.
Article PubMed Google Scholar
Rohe, K., Chatterjee, S., & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4), 1878–1915.
Article MathSciNet Google Scholar
Rossi, L., & Magnani, M. (2015). Towards effective visual analytics on multiplex and multilayer networks. Chaos, Solitons & Fractals, 72, 68–76.
Article ADS MathSciNet Google Scholar
Saldaña, D. F., Yu, Y., & Feng, Y. (2017). How Many Communities Are There? Journal of Computational and Graphical Statistics, 26(1), 171–181.
Article MathSciNet Google Scholar
Tang, M., Athreya, A., Sussman, D. L., Lyzinski, V., Park, Y., & Priebe, C. E. (2017). A Semiparametric Two-Sample Hypothesis Testing Problem for Random Graphs. Journal of Computational and Graphical Statistics, 26(2), 344–354.
Article MathSciNet Google Scholar
Tang, M., Athreya, A., Sussman, D. L., Lyzinski, V., & Priebe, C. E. (2017). A nonparametric two-sample hypothesis testing problem for random graphs. Bernoulli, 23(3), 1599–1630.
Article MathSciNet Google Scholar
Wang, Y. X. R., & Bickel, P. J. (2017). Likelihood-based model selection for stochastic block models. Annals of Statistics, 45(2), 500–528.
Article MathSciNet Google Scholar
Westveld, A.H., & Hoff, P.D. (2011). A mixed effects model for longitudinal relational and network data, with applications to international trade and conflict. The Annals of Applied Statistics, 5(2A)
Wu, Q., & Hu, J. (2024). Two-sample test of stochastic block models. Computational Statistics & Data Analysis, 192, 107903.
Article MathSciNet Google Scholar
Wu, Y., Lan, W., Feng, L., & Tsai, C.-L. (2022). Testing stochastic block models via the maximum sampling entry-wise deviation. Manuscript.
Zhang, B., Li, H., Riggins, R. B., Zhan, M., Xuan, J., Zhang, Z., Hoffman, E. P., Clarke, R., & Wang, Y. (2009). Differential dependency network analysis to identify condition-specific topological changes in biological networks. Bioinformatics, 25(4), 526–532.
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank the Editor, Associate Editor, and the two referees for their insightful comments.

Funding

Jiang Hu was partially supported by National Natural Science Foundation of China (Grant Nos. 12171078, 12292980, and 12292982), National Key R & D Program of China No. 2020YFA0714102 and Fundamental Research Funds for the Central Universities No. 2412023YQ003.

Author information

Authors and Affiliations

KLASMOE and School of Mathematics & Statistics, Northeast Normal University, 5268 Renmin Street, Changchun, 130024, Jilin, China
Qianyong Wu & Jiang Hu

Authors

Qianyong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiang Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Detailed proofs of Theorem 1 and 2

Firstly we introduce the following lemma before the main proof.

Lemma 1

(Theorem 1.5.1 in Leadbetter et al. (1983)) Let $Z_1, Z_2, \ldots , Z_{n^{*}}$ be a sequence of independent random variables with distribution function F(x) for $-\infty< x < \infty$. Let $z_1, z_2, \ldots , z_{n^{*}}$ be a sequence of real numbers, and let $0< \tau < 1$. Define $M_n = \max \{Z_1, Z_2, \ldots , Z_{n^{*}}\}$. Then, the equality

$$\begin{aligned} P(M_n \le z_n) = e^{-\tau } \end{aligned}$$

holds true if and only if

$$\begin{aligned} \lim _{n^{*} \rightarrow \infty } n^{*}\left( 1 - F(z_n^{*})\right) = \tau . \end{aligned}$$

Proof of Theorem 1

In this section, we present the proofs of Theorem 1.

Proof of Theorem 1

First, we bound the estimation error of the following equation:

$$\begin{aligned} \begin{aligned}&\max \limits _{1 \le u \le K,1 \le v \le K} \vert \frac{\sqrt{B_{1,uv}(1-B_{1,uv})+B_{2,uv}(1-B_{2,uv})}}{\sqrt{\hat{B}_{1,uv}(1-\hat{B}_{1,uv}) +\hat{B}_{2,uv}(1-\hat{B}_{2,uv})}}-1\vert \\&\quad = \max \limits _{1 \le u \le K,1 \le v \le K} \Bigg \vert \frac{\sqrt{B_{1,uv}(1-B_{1,uv})+B_{2,uv}(1-B_{2,uv})}}{\sqrt{\hat{B}_{1,uv} (1-\hat{B}_{1,uv})+\hat{B}_{2,uv}(1-\hat{B}_{2,uv})}}\\&\quad -\frac{\sqrt{\hat{B}_{1,uv}(1-\hat{B}_{1,uv}) +\hat{B}_{2,uv}(1-\hat{B}_{2,uv})}}{\sqrt{\hat{B}_{1,uv}(1-\hat{B}_{1,uv}) +\hat{B}_{2,uv}(1-\hat{B}_{2,uv})}}\Bigg \vert \\&\quad = O_p( \frac{K}{\log n}). \end{aligned} \end{aligned}$$

(A1)

Next, denote that

$$\begin{aligned} \begin{aligned}&F_{n,0} \triangleq \max \limits _{1 \le i \le n,1 \le k \le K} \vert \hat{\gamma }_{mk,0} \vert \\&\quad =\max \limits _{1 \le i \le n,1 \le k \le K} \vert \frac{\hat{\rho }_{m_1k,0}+\hat{\rho }_{m_2k,0}+\cdots +\hat{\rho }_{m_Sk,0}}{\sqrt{S}}\vert \\&\quad =\max \limits _{1 \le i \le n,1 \le k \le K} \Bigg \vert \frac{1}{\sqrt{S}}\sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ g^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}} \\&\quad \times \sum _{j\in g^{-1}(k)\backslash \left\{ i_{m_s}\right\} }\frac{A_{1,ij}-A_{2,ij}}{\sqrt{ B_{1, g_{i} g_{j}}(1- B_{1, g_{i} g_{j}})+ B_{2,{g}_{i} g_{j}}(1- B_{2, g_{i} g_{j}})}}\Bigg \vert . \end{aligned} \end{aligned}$$

In accordance with Eq. A1, we can derive that

$$\begin{aligned} \begin{aligned} F_{n}&\triangleq \max \limits _{1 \le i \le n,1 \le k \le K} \vert \hat{\gamma }_{mk} \vert \\&=\max \limits _{1 \le i \le n,1 \le k \le K} \vert \frac{\hat{\rho }_{m_1k}+\hat{\rho }_{m_2k} +\cdots +\hat{\rho }_{m_Sk}}{\sqrt{S}}\vert \\&=\max \limits _{1 \le i \le n,1 \le k \le K} \Bigg \vert \frac{1}{\sqrt{S}}\sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ \hat{g}^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}} \\&\quad \times \sum _{j\in \hat{g}^{-1}(k)\backslash \left\{ i_{m_s}\right\} }\frac{A_{1,ij} -A_{2,ij}}{\sqrt{ \hat{B}_{1, \hat{g}_{i} \hat{g}_{j}}(1- \hat{B}_{1, \hat{g}_{i} \hat{g}_{j}}) + \hat{B}_{2,{\hat{g}}_{i} \hat{g}_{j}}(1- \hat{B}_{2, \hat{g}_{i} \hat{g}_{j}})}}\Bigg \vert \\&=\max \limits _{1 \le i \le n,1 \le k \le K} \Bigg \vert \frac{1}{\sqrt{S}}\sum _{s=1}^{S} \frac{\sum _{j\in g^{-1}(k)\backslash \left\{ i_{m_s}\right\} }\frac{A_{1,ij}-A_{2,ij}}{\sqrt{ B_{1, g_{i} g_{j}}(1- B_{1, g_{i} g_{j}})+ B_{2,{g}_{i} g_{j}}(1- B_{2, g_{i} g_{j}})}}}{\sqrt{ \#( \ g^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}} \\&\quad \quad \times \frac{\sqrt{B_{1,g_{i}g_{j}}(1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}} (1-B_{2,g_{i}g_{j}})}}{\sqrt{\hat{B}_{1,g_{i}g_{j}}(1-\hat{B}_{1,g_{i}g_{j}})+\hat{B}_{2,g_{i}g_{j}} (1-\hat{B}_{2,g_{i}g_{j}})}}\Bigg \vert \\&\quad = F_{n,0}(1+O_p( \frac{K}{\log n})). \end{aligned} \end{aligned}$$

Under Assumption 3, we have $K = o(\sqrt{\log n})$, $MK=o(n)$ and if $F_{n,0} = O_P(\sqrt{\log MK})$, then we have

$$\begin{aligned} F_n = F_{n,0}+o_p(1). \end{aligned}$$

Thus to prove Theorem 1, it is sufficient to show that

$$\begin{aligned} P\left( F^{2}_{n,0} - 2\log (MK) + \log \log (MK) \le x \right) \rightarrow \exp \left( -\frac{1}{\sqrt{\pi }}e^{-x/2}\right) . \end{aligned}$$

To derive the asymptotic null distribution of $F_{n}$, we define the $\sigma$-field ${\mathcal {G}} =\sigma \{ \hat{\rho }_{ik,0}: 1 \le i \le n; 1 \le k \le K \}$ that is generated from the auxiliary quantity $\hat{\rho }_{ik,0}$ in $F_{n,0}$. Let $\tilde{\rho }_{ik}$ be the observed value of $\hat{\rho }_{ik,0}$. Then, conditional on ${\mathcal {G}}$, the $\hat{\rho }_{ik,0}$s are independent and identically distributed with the probability $P(\hat{\rho }_{ik,0} = \tilde{\rho }_{ik}) = \frac{1}{nK}$ for any $1 \le i \le n$ and $1 \le k \le K$. Note that $\hat{\gamma }_{mk,0}$ is calculated from $\hat{\rho }_{ik,0}$ in $F_{n,0}$, and we denote $\tilde{\rho }_{ik}$ as the observed value of $\hat{\rho }_{ik,0}$ calculated from $\tilde{\rho }_{iv}$. Let $\bar{\rho } = \frac{1}{nK} \sum _{i=1}^n \sum _{k=1}^K \tilde{\rho }_{ik}$ and $\bar{\gamma } = \frac{1}{MK} \sum _{m=1}^M\sum _{k=1}^K \tilde{\gamma }_{mk}$.

As a result, $E(\hat{\rho }_{iv,0} - \bar{\rho } | {\mathcal {G}}) = 0$ and $E(\hat{\gamma }_{mv,0} - \bar{\gamma } | {\mathcal {G}}) = 0$. By Corollary 2.1 in Chernozhukov et al. (2013) and conditional on ${\mathcal {G}}$, as $\min \{n, M, S\} \rightarrow \infty$, we can derive that

$$\begin{aligned}{} & {} \sup _{x \in {\mathbb {R}}} \left| P\left( \max _{1 \le m \le M, 1 \le k \le K} \frac{1}{\sqrt{S}} \sum _{s=1}^{S}\hat{\rho }_{m_sk,0}- \sqrt{B}\bar{\rho } \le x\right) - P\left( \max _{1 \le l \le MK} U_l \le x\right) \right| \\{} & {} \quad \le CS^{-c} \rightarrow 0, \end{aligned}$$

where $U = (U_1, \ldots , U_{MK})^\top \in {\mathbb {R}}^{MK}$ is a Gaussian random vector with mean 0 and covariance matrix $\text {cov}(\hat{\gamma }_{mk,0})$, and c and C are some finite positive constants. Define $\check{\gamma } = \text {vec}(\hat{\gamma }_{mk,0}) = (\check{\gamma }_1, \ldots , \check{\gamma }_{MK})^\top \in {\mathbb {R}}^{MK}$ for $1 \le m \le M$ and $1 \le k \le K$. As a result, the above inequality can be rewritten as

$$\begin{aligned} \sup _{x \in {\mathbb {R}}} \left| P\left( \max _{1 \le l \le MK} \check{\gamma }_l -\bar{\gamma } \le x\right) - P\left( \max _{1 \le l \le MK} U_l\le x\right) \right| \le CS^{-c} \rightarrow 0 \end{aligned}$$

(A2)

as $\min \{n, M, B\} \rightarrow \infty$.

Next, we calculate $\text {cov}(\check{\gamma })$. The diagonal elements of $\text {cov}(\check{\gamma })$ are 1 by definition, it suffices to compute $\text {corr}(\check{\gamma })$. Denote $\hat{\rho }_{.k,0}$ as the vector containing all elements $\hat{\rho }_{ik,0}$ in block k. In addition, let $\Lambda _{m_s} = (\lambda _{m_s,1}, \lambda _{m_s,2}, \ldots , \lambda _{ms,n})^T$ be some random variables that are independently generated from the binomial distribution $\text {Bernoulli}(n, \frac{S}{n})$ for $1 \le m_s \le M$, and they are independent of $\hat{\rho }_{.k,0}$. Thus, for $i = 1, 2, \ldots , n$, $\lambda _{m_s,n}$ follows the Bernoulli distribution with probability $\frac{S}{n}$, which implies that $E(\lambda _{m_s,i}) = E(\lambda ^2_{m_s,i}) = \frac{S}{n}$. As a result, we obtain that $\hat{\gamma }_{m_sk,0} = \Lambda ^T_{m_s}\hat{\rho }_{.k,0}/ \sqrt{S}$. Then, for any $1 \le m_s, m_l \le M$ with subscripts $s \ne l$, it can be shown that

$$\begin{aligned} \delta _{\gamma } \triangleq \text {corr}(\hat{\gamma }_{m_sk,0}, \hat{\gamma }_{m_lk,0})&=\frac{\text {cov}(\hat{\gamma }_{m_sk,0},\hat{\gamma }_{m_lk,0})}{\sqrt{\text {var}(\hat{\gamma }_{m_sk,0})}\sqrt{\text {var}(\hat{\gamma }_{m_lk,0})}} =\frac{E(\Lambda ^T_{m_s}\hat{\rho }_{.k,0}\Lambda ^T_{m_l}\hat{\rho }_{.k,0})}{E(\Lambda ^T_{m_s} \hat{\rho }_{.k,0}\Lambda ^T_{m_s}\hat{\rho }_{.k,0})}\\&=\frac{\sum _{i=1}^{n}E(\lambda _{m_s,i}\lambda _{m_l,i}\hat{\rho }^{2}_{ik,0})}{\sum _{i=1}^{n}E(\lambda ^2_{m_s,i}\hat{\rho }^{2}_{ik,0})}=\frac{\sum _{i=1}^{n}E(\lambda _{m_s,i} \lambda _{m_l,i})E(\hat{\rho }^{2}_{ik,0})}{\sum _{i=1}^{n}E(\lambda ^2_{m_s,i})E(\hat{\rho }^{2}_{ik,0})}\\&=\frac{S}{n}. \end{aligned}$$

It can be seen that correlations between $\check{\gamma }_{s}$ and $\check{\gamma }_{l}$ are all equal for any $s \ne l$. According to Fan and Jiang (2019), we can rewrite $\max _{1 \le l \le MK} U_l$ as following:

$$\begin{aligned} \max _{1 \le l \le MK} U_l = \sqrt{\delta _{\gamma }}\tilde{U}_0 + \sqrt{1-\delta _{\gamma }} \max _{1 \le l \le MK} \tilde{U}_l, \end{aligned}$$

where $\tilde{U}_0, \tilde{U}_1, \ldots , \tilde{U}_{MK}$ are independent and identically distributed standard normal variables. Now we can write (A2) as

$$\begin{aligned}&\sup _{x \in {\mathbb {R}}} \left| P\left( \max _{1 \le l \le MK} \check{\gamma }_l -\bar{\gamma } \le x\right) \right. \nonumber \\&\quad \left. -P\left( \sqrt{\delta _{\gamma }}\tilde{U}_0 + \sqrt{1-\delta _{\gamma }} \max _{1 \le l \le MK} \tilde{U}_l\le x\right) \right| \le CS^{-c} \rightarrow 0 \end{aligned}$$

(A3)

Note that, under the null hypothesis we have ${\mathbb {E}}(\bar{\gamma }) = 0$. Next we derive $\text {var}(\bar{\gamma })$.

$$\begin{aligned} \text {var}(\bar{\gamma })&= E(\bar{\gamma }^2) = E \left[ \frac{1}{MK} \sum _{m=1}^{M} \sum _{v=k}^{K} (\tilde{\gamma }_{mv})^2 \right] \\&= \frac{1}{M^2K^2} \left[ E \sum _{m_1=1}^{M} \sum _{m_2=1}^{M} \sum _{k_1=1}^{K} \sum _{k_2=1}^{K} \tilde{\gamma }_{m_1k_1} \tilde{\gamma }_{m_2k_2} \right] \\&= E[\tilde{\gamma }_{m_1k}\tilde{\gamma }_{m_1k}I(k_1=k_2=k)] +E[\tilde{\gamma }_{mk}I(m_1=m_2=m,k_1=k_2=k)]\\&= \delta _{\gamma } \frac{M(M-1)}{M^2} + \frac{1}{MK}\\&=\frac{S}{n}\frac{M(M-1)}{M^2} + \frac{1}{MK}. \end{aligned}$$

Under Assumption 3, we have $\text {var}(\bar{\gamma })\rightarrow 0$, as $\min \{n, M, B\} \rightarrow \infty$.

This implies $\bar{\gamma } = o_p(1)$. Additionally under Assumption 3, we have $\delta _{\gamma }=o(1)$.

As a result, we can rewrite (A3) as

$$\begin{aligned} \sup _{x \in {\mathbb {R}}} \left| P(\max _{1 \le l \le MK} \check{\gamma }_l\le x)- P( \max _{1 \le l \le MK} \tilde{U}_l\le x)\right| \rightarrow 0, \end{aligned}$$

(A4)

as $\min \{n, M, B\} \rightarrow \infty$.

For any $x \in {\mathbb {R}}$, denote $u = \frac{1}{2} \log (MK) - \log \log (MK) + x$. We then have

$$\begin{aligned} P \left( |N(0, 1) |\ge u \right)&\sim \frac{2}{\sqrt{2 \pi u}} \\&\sim \frac{1}{\sqrt{\pi }}\frac{1}{\sqrt{\log MK}} \exp \left( -\frac{1}{2}\{2\log MK-\log \log MK+x\} \right) \\&\sim \frac{1}{\sqrt{\pi }}\frac{\exp \left( -\frac{x}{2} \right) }{MK}. \end{aligned}$$

Subsequently, by Lemma 1, as $\min \{n, M, B\} \rightarrow \infty$, we have

$$\begin{aligned} P \left( \max _{1 \le l \le MK} \tilde{U}_l^2 - 2 \log (MK) + \log \log (MK)\le x \right) \rightarrow \exp \left( -\frac{1}{\sqrt{\pi }}\exp \left( -\frac{x}{2} \right) \right) , \end{aligned}$$

for any $x \in {\mathbb {R}}$. Combining with (A4), we obtain

$$\begin{aligned} P \left( \max _{1 \le l \le MK} \check{\gamma }_l^2 - 2 \log (MK) + \log \log (MK)\le x \right) \rightarrow \exp \left( -\frac{1}{\sqrt{\pi }}\exp \left( -\frac{x}{2} \right) \right) . \end{aligned}$$

(A5)

Accordingly,

$$\begin{aligned} P \left( F^2_{n,0} - 2 \log (MK) + \log \log (MK)\le x \right) \rightarrow \exp \left( -\frac{1}{\sqrt{\pi }}\exp \left( -\frac{x}{2} \right) \right) . \end{aligned}$$

Since

$$\begin{aligned} F_n = F_{n,0}+o_p(1), \end{aligned}$$

we have

$$\begin{aligned} P \left( F^2_{n} - 2 \log (MK) + \log \log (MK)\le x \right) \rightarrow \exp \left( -\frac{1}{\sqrt{\pi }}\exp \left( -\frac{x}{2} \right) \right) . \end{aligned}$$

This completes the proof of Theorem 1.

$\square$

Proof of Theorem 2

In this section, we present the proofs of Theorem 2.

Proof of Theorem 2

Note that

$$\begin{aligned} \begin{aligned} \hat{\gamma }_{mk}&= \frac{\hat{\rho }_{m_1k}+\hat{\rho }_{m_2k}+\cdots +\hat{\rho }_{m_Sk}}{\sqrt{S}}\\&= \frac{1}{\sqrt{S}}\sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ \hat{g}^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}} \\&\quad \sum _{j\in \hat{g}^{-1}(k)\backslash \left\{ i_{m_s}\right\} }\frac{A_{1,ij}-A_{2,ij}}{\sqrt{ \hat{B}_{1, \hat{g}_{i} \hat{g}_{j}}(1- \hat{B}_{1, \hat{g}_{i} \hat{g}_{j}})+ \hat{B}_{2,{g}_{i} \hat{g}_{j}}(1- \hat{B}_{2, \hat{g}_{i} \hat{g}_{j}})}} \\&= \frac{1}{\sqrt{S}}\sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ g^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}} \\&\quad \sum _{j\in g^{-1}(k)\backslash \left\{ i_{m_s}\right\} }\left( \frac{A_{1,ij}-B_{1,g_{i}g_{j}}-(A_{2,ij}-B_{2,g_{i}g_{j}})}{\sqrt{B_{1,g_{i}g_{j}} (1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}}(1-B_{2,g_{i}g_{j}})}}\right. \\&\quad +\left. \frac{B_{1,g_{i}g_{j}}-B_{2,g_{i}g_{j}}}{\sqrt{B_{1,g_{i}g_{j}}(1-B_{1,g_{i}g_{j}}) +B_{2,g_{i}g_{j}}(1-B_{2,g_{i}g_{j}})}}\right) \\&\quad \quad \quad \times \frac{\sqrt{B_{1,g_{i}g_{j}}(1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}} (1-B_{2,g_{i}g_{j}})}}{\sqrt{\hat{B}_{1,g_{i}g_{j}}(1-\hat{B}_{1,g_{i}g_{j}}) +\hat{B}_{2,g_{i}g_{j}}(1-\hat{B}_{2,g_{i}g_{j}})}}. \end{aligned} \end{aligned}$$

From the discussion in the proof of Theorem 1, we can obtain that

$$\begin{aligned} \begin{aligned} F_{n}&=\max \limits _{1 \le m \le M,1 \le k \le K} \vert \hat{\gamma }_{mk} \vert \\&=\max \limits _{1 \le m \le M,1 \le k \le K} \vert \frac{1}{\sqrt{S}} \sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ g^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}}\\&\times \sum _{j\in \hat{g}^{-1}(k)\backslash \left\{ i_{m_s}\right\} } \left( \frac{A_{1,ij}-B_{1,g_{i}g_{j}}-(A_{2,ij}-B_{2,g_{i}g_{j}})}{\sqrt{B_{1,g_{i}g_{j}}(1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}}(1-B_{2,g_{i}g_{j}})}}\right. \\&\quad \quad +\left. \frac{B_{1,g_{i}g_{j}}-B_{2,g_{i}g_{j}}}{\sqrt{B_{1,g_{i}g_{j}} (1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}}(1-B_{2,g_{i}g_{j}})}}\right) \vert (1+ o_p(1))\\&\quad \ge (l_1-l_2)(1+o_p(1)),\\ \end{aligned} \end{aligned}$$

where

$$\begin{aligned} l_1= & {} \max \limits _{1 \le m \le M,1 \le k \le K} \vert \frac{1}{\sqrt{S}}\sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ g^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}} \\{} & {} \sum _{j\in g^{-1}(k)\backslash \left\{ i_{m_s}\right\} }\frac{B_{1,g_{i}g_{j}}-B_{2,g_{i}g_{j}}}{\sqrt{B_{1,g_{i}g_{j}} (1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}}(1-B_{2,g_{i}g_{j}})}}) \vert ,\\ l_2= & {} \max \limits _{1 \le m \le M,1 \le k \le K} \vert \frac{1}{\sqrt{S}} \sum _{s=1}^{S} \frac{1}{\sqrt{ \#( \ g^{-1}(k)\backslash \left\{ i_{m_s}\right\} )}}\\{} & {} \sum _{j\in g^{-1}(k)\backslash \left\{ i_{m_s}\right\} }(\frac{A_{1,ij}-B_{1,g_{i}g_{j}} -(A_{2,ij}-B_{2,g_{i}g_{j}})}{\sqrt{B_{1,g_{i}g_{j}}(1-B_{1,g_{i}g_{j}})+B_{2,g_{i}g_{j}}(1-B_{2,g_{i}g_{j}})}} \vert . \end{aligned}$$

By by the result of (A5), we have that

$$\begin{aligned} l_2=O_p(\sqrt{\log (MK)}). \end{aligned}$$

Moreover, from Assumptions 1, 2, 3 and $\max \limits _{1 \le i \le n,1 \le j \le n}\vert B_{1,g_{i}g_{j}}-B_{2,g_{i}g_{j}} \vert = \Omega (\frac{\log n}{n}\sqrt{\frac{K}{S}})$, we have that

$$\begin{aligned} \frac{l_1}{\sqrt{\log MK}}\ge & {} \sqrt{\frac{Sn}{K}}\sqrt{\frac{n}{\log n}}\Omega \left( \frac{\log n}{n}\sqrt{\frac{K}{S}}\right) /\sqrt{\log MK}\\= & {} \frac{\sqrt{\log n}}{\sqrt{\log MK}}\rightarrow \infty , \quad as \hspace{5.0pt}n\rightarrow \infty . \end{aligned}$$

Thus, we finally obtain that

$$\begin{aligned} P(T\ge c\log (MK))\rightarrow 1, \end{aligned}$$

for any positive constant c.

This completes the proof of Theorem 2. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, Q., Hu, J. Two-sample test of stochastic block models via the maximum sampling entry-wise deviation. J. Korean Stat. Soc. (2024). https://doi.org/10.1007/s42952-024-00260-9

Download citation

Received: 23 September 2023
Accepted: 24 January 2024
Published: 03 March 2024
DOI: https://doi.org/10.1007/s42952-024-00260-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-sample test of stochastic block models via the maximum sampling entry-wise deviation

Abstract

Access this article

Similar content being viewed by others

Large deviations for empirical measures of dense stochastic block graphs

Combinatorial Miller–Hagberg Algorithm for Randomization of Dense Networks

Comparison of large networks with sub-sampling strategies

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Detailed proofs of Theorem 1 and 2

Lemma 1

Proof of Theorem 1

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-sample test of stochastic block models via the maximum sampling entry-wise deviation

Abstract

Access this article

Similar content being viewed by others

Large deviations for empirical measures of dense stochastic block graphs

Combinatorial Miller–Hagberg Algorithm for Randomization of Dense Networks

Comparison of large networks with sub-sampling strategies

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Detailed proofs of Theorem 1 and 2

Lemma 1

Proof of Theorem 1

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation