Introduction

In network inference applications, it is important to detect community structure, i.e., cluster vertices into potential blocks. However, it can be prohibitively expensive to observe the entire graph in many cases, especially for large graphs. For example, in a network where vertices represent landline phones and edges represent whether there is a call between two landline phones. Based on the size of the network, in terms of the number of vertices, it can be extremely expensive to check whether there is a call for every landline phone pairs. Therefore, if one can utilize the information carried by a partially oberverd graph, that is only a small number of landline phone pairs are verified, to identify the landline phones that may play a more important role in formulating communities. Then given limited resources, one can choose to only check whether there are calls between those landline phone pairs to achieve the goal of detecting potential block structure. Thus it becomes essential to identify vertices that have the most impact on block structure and only check whether there are edges between them to save significant resources but still recover the block structure.

Many classical methods only consider the adjacency or Laplacian matrices for community detection (Fortunato and Hric 2016). By contrast, vertex covariates can also be taken into consideration for the inference. These covariate-aware methods rely on either variational methods (Choi et al. 2012; Roy et al. 2019; Sweet 2015) or spectral approaches (Binkiewicz et al. 2017; Huang and Feng 2018; Mele et al. 2022; Mu et al. 2022). However, none of them focus on the problem of clustering vertices for partially observed graphs. To address this issue, existing methods propose different types of random and adaptive sampling strategies to minimize the information loss from the data reduction (Yun and Proutiere 2014; Purohit et al. 2017).

We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel (SBM) when we only have limited resources to check whether there are edges between certain selected vertices. The innovation of our approach is the application of Chernoff information. To our knowledge, this is the first time that it has been applied to network sampling problems. Motivated by the Chernoff analysis, we not only propose a dynamic network sampling scheme to optimize block recovery, but also provide the framework and justification for using Chernoff information in subsequent inference for graphs.

The structure of this article is summarized as follows. Section 2 reviews relevant models for random graphs and the basic idea of spectral methods. Section 3 introduces the notion of Chernoff analysis for analytically measuring the performance of block recovery. Section 4 includes our dynamic network sampling scheme and theoretical results. Section 5 provides simulations and real data experiments to measure the algorithms’ performance in terms of actual block recovery results. Section 6 discusses the findings and presents some open questions for further investigation. Appendix provides technical details for our theoretical results.

Models and spectral methods

In this work, we are interested in the inference task of block recovery (community detection). To model the block structure in edge-independent random graphs, we focus on the SBM and the generalized random dot product graph (GRDPG).

Definition 1

(Generalized Random Dot Product Graph Rubin-Delanchy et al. 2022) Let \({\textbf{I}}_{d_+ d_-} = {\textbf{I}}_{d_+} \bigoplus \left( -{\textbf{I}}_{d_-} \right)\) with \(d_+ \ge 1\) and \(d_- \ge 0\). Let F be a d-dimensional inner product distirbution with \(d = d_+ + d_-\) on \({\mathcal {X}} \subset {\mathbb {R}}^d\) satisfying \({\textbf{x}}^\top {\textbf{I}}_{d_+ d_-} {\textbf{y}} \in [0, 1]\) for all \({\textbf{x}}, {\textbf{y}} \in {\mathcal {X}}\). Let \({\textbf{A}}\) be an adjacency matrix and \({\textbf{X}} = [{\textbf{X}}_1, \cdots , {\textbf{X}}_n]^\top \in {\mathbb {R}}^{n \times d}\) where \({\textbf{X}}_i \sim F\), i.i.d. for all \(i \in \{ 1, \cdots , n \}\). Then we say \(({\textbf{A}}, {\textbf{X}}) \sim \text {GRDPG}(n, F, d_+, d_-)\) if for any \(i, j \in \{ 1, \cdots , n \}\)

$$\begin{aligned} {\textbf{A}}_{ij} \sim \text {Bernoulli}({\textbf{P}}_{ij}) \qquad \text {where} \qquad {\textbf{P}}_{ij} = {\textbf{X}}_{i}^\top {\textbf{I}}_{d_+ d_-} {\textbf{X}}_j. \end{aligned}$$
(1)

Definition 2

(K-block Stochastic Blockmodel Graph Holland et al. 1983) The K-block stochastic blockmodel (SBM) graph is an edge-independent random graph with each vertex belonging to one of K blocks. It can be parametrized by a block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\) summing to unity. Let \({\textbf{A}}\) be an adjacency matrix and \(\varvec{\tau }\) be a vector of block assignments with \(\tau _i = k\) if vertex i is in block k (occuring with probability \(\pi _k\)). We say \(({\textbf{A}}, \varvec{\tau }) \sim \text {SBM}(n, {\textbf{B}}, \varvec{\pi })\) if for any \(i, j \in \{ 1, \cdots , n \}\)

$$\begin{aligned} {\textbf{A}}_{ij} \sim \text {Bernoulli}({\textbf{P}}_{ij}) \qquad \text {where} \qquad {\textbf{P}}_{ij} = {\textbf{B}}_{\tau _i \tau _j}. \end{aligned}$$
(2)

Remark 1

The SBM is a special case of the GRDPG model. Let \(({\textbf{A}}, \varvec{\tau }) \sim \text {SBM}(n, {\textbf{B}}, \varvec{\pi })\) as in Definition 2 where \({\textbf{B}} \in (0, 1)^{K \times K}\) with \(d_+\) strictly positive eigenvalues and \(d_-\) strictly negative eigenvalues. To represent this SBM in the GRDPG model, we can choose \(\varvec{\nu }_1, \cdots , \varvec{\nu }_K \in {\mathbb {R}}^d\) where \(d = d_+ + d_-\) such that \(\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_\ell = {\textbf{B}}_{k \ell }\) for all \(k, \ell \in \{ 1, \cdots , K \}\). For example, we can take \(\varvec{\nu } = {\textbf{U}}_B |{\textbf{S}}_B|^{1/2}\) where \({\textbf{B}} = {\textbf{U}}_B {\textbf{S}}_B {\textbf{U}}_B^\top\) is the spectral decomposition of \({\textbf{B}}\) after re-ordering. Then we have the latent position of vertex i as \({\textbf{X}}_i = \varvec{\nu }_k\) if \(\tau _i = k\).

The parameters of the models can be estimated via spectral methods (Von Luxburg 2007), which have been widely used in random graph models for community detection (Lyzinski et al. 2014, 2016; McSherry 2001; Rohe et al. 2011). Two particular spectral embedding methods, adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE), are popular since they enjoy nice propertices including consistency (Sussman et al. 2012) and asymptotic normality (Athreya et al. 2016; Tang and Priebe 2018).

Definition 3

(Adjacency Spectral Embedding) Let \({\textbf{A}} \in \{0, 1 \}^{n \times n}\) be an adjacency matrix with eigendecomposition \({\textbf{A}} = \sum _{i=1}^{n} \lambda _i {\textbf{u}}_i {\textbf{u}}_i^\top\) where \(|\lambda _1| \ge \cdots \ge |\lambda _n|\) are the magnitude-ordered eigenvalues and \({\textbf{u}}_1, \cdots , {\textbf{u}}_n\) are the corresponding orthonormal eigenvectors. Given the embedding dimension \(d < n\), the adjacency spectral embedding (ASE) of \({\textbf{A}}\) into \({\mathbb {R}}^d\) is the \(n \times d\) matrix \(\mathbf {{\widehat{X}}} = {\textbf{U}}_A |{\textbf{S}}_A|^{1/2}\) where \({\textbf{S}}_A = \text {diag}(\lambda _1, \ldots , \lambda _d)\) and \({\textbf{U}}_A = [{\textbf{u}}_1 | \cdots | {\textbf{u}}_d]\).

Remark 2

There are different methods for choosing the embedding dimension (Hastie et al. 2009; Jolliffe and Cadima 2016); we adopt the simple and efficient profile likelihood method (Zhu and Ghodsi 2006) to automatically identify “elbow”, which is the cut-off between the signal dimensions and the noise dimensions in scree plot.

Chernoff analysis

To analytically measure the performance of algorithms for block recovery, we consider the notion of Chernoff information among other possible metrics. Chernoff information enjoys the advantages of being independent of the clustering procedure, i.e., it can be derived no matter which clustering methods are used, and it is intrinsically relating to the Bayes risk (Tang and Priebe 2018; Athreya et al. 2017; Karrer and Newman 2011).

Definition 4

(Chernoff Information Chernoff 1952, 1956) Let \(F_1\) and \(F_2\) be two continuous multivariate distributions on \({\mathbb {R}}^d\) with density functions \(f_1\) and \(f_2\). The Chernoff information is defined as

$$\begin{aligned} \begin{aligned} C(F_1, F_2)&= - \log \left[ \inf _{t \in (0,1)} \int _{{\mathbb {R}}^d} f_1^t({\textbf{x}}) f_2^{1-t}({\textbf{x}}) d{\textbf{x}} \right] \\&= \sup _{t \in (0, 1)} \left[ - \log \int _{{\mathbb {R}}^d} f_1^t({\textbf{x}}) f_2^{1-t}({\textbf{x}}) d{\textbf{x}} \right] . \end{aligned} \end{aligned}$$
(3)

Remark 3

Consider the special case where we take \(F_1 = {\mathcal {N}}(\varvec{\mu }_1, \varvec{\Sigma }_1)\) and \(F_2 = {\mathcal {N}}(\varvec{\mu }_2, \varvec{\Sigma }_2)\); then the corresponding Chernoff information is

$$\begin{aligned} C(F_1, F_2) = \sup _{t \in (0, 1)} \left[ \frac{1}{2} t (1-t) (\varvec{\mu }_1 - \varvec{\mu }_2)^\top \varvec{\Sigma }_t^{-1} (\varvec{\mu }_1 - \varvec{\mu }_2) + \frac{1}{2} \log \frac{|\varvec{\Sigma }_t |}{|\varvec{\Sigma }_1 |^t |\varvec{\Sigma }_2 |^{1-t}} \right] , \end{aligned}$$
(4)

where \(\varvec{\Sigma }_t = t \varvec{\Sigma }_1 + (1-t) \varvec{\Sigma }_2\).

The comparsion of block recovery via Chernoff information is based on the statistical information between the limiting distributions of the blocks and smaller statistical information implies less information to discriminate between different blocks of the SBM. To that end, we also review the limiting results of ASE for SBM, essential for investigating Chernoff information.

Theorem 1

(CLT of ASE for SBM Rubin-Delanchy et al. 2022) Let \(({\textbf{A}}^{(n)}, {\textbf{X}}^{(n)}) \sim \text {GRDPG}(n, F, d_+, d_-)\) be a sequence of adjacency matrices and associated latent positions of a d-dimensional GRDPG as in Definition 1 from an inner product distribution F where F is a mixture of K point masses in \({\mathbb {R}}^d\), i.e.,

$$\begin{aligned} F = \sum _{k=1}^{K} \pi _k \delta _{\varvec{\nu }_k} \qquad \text {with} \qquad \forall k, \; \pi _k > 0 \quad \text {and} \quad \sum _{k=1}^{K} \pi _k = 1, \end{aligned}$$
(5)

where \(\delta _{\varvec{\nu }_k}\) is the Dirac delta measure at \(\nu _k\). Let \(\Phi ({\textbf{z}}, \varvec{\Sigma })\) denote the cumulative distribution function (CDF) of a multivariate Gaussian distribution with mean \({\varvec{0}}\) and covariance matrix \(\varvec{\Sigma }\), evaluated at \({\textbf{z}} \in {\mathbb {R}}^d\). Let \(\mathbf {{\widehat{X}}}^{(n)}\) be the ASE of \({\textbf{A}}^{(n)}\) with \(\mathbf {{\widehat{X}}}^{(n)}_i\) as the i-th row (same for \({\textbf{X}}^{(n)}_i\)). Then there exists a sequence of matrices \({\textbf{M}}_n \in {\mathbb {R}}^{d \times d}\) satisfying \({\textbf{M}}_n {\textbf{I}}_{d_+ d_-} {\textbf{M}}_n^\top = {\textbf{I}}_{d_+ d_-}\) such that for all \({\textbf{z}} \in {\mathbb {R}}^d\) and fixed index i,

$$\begin{aligned} {\mathbb {P}} \left\{ \sqrt{n} \left( {\textbf{M}}_n \mathbf {{\widehat{X}}}^{(n)}_i - {\textbf{X}}^{(n)}_i \right) \le {\textbf{z}} \; \big | \; {\textbf{X}}^{(n)}_i = \varvec{\nu }_k \right\} \rightarrow \Phi ({\textbf{z}}, \varvec{\Sigma }_k), \end{aligned}$$
(6)

where for \(\varvec{\nu } \sim F\)

$$\begin{aligned} \varvec{\Sigma }_k = \varvec{\Sigma }(\varvec{\nu }_k) = {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} {\mathbb {E}} \left[ \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu } \right) \left( 1-\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu } \right) \varvec{\nu } \varvec{\nu }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-}, \end{aligned}$$
(7)

with

$$\begin{aligned} \varvec{\Delta } = {\mathbb {E}} \left[ \varvec{\nu } \varvec{\nu }^\top \right] . \end{aligned}$$
(8)

For a K-block SBM, let \({\textbf{B}} \in (0, 1)^{K \times K}\) be the block connectivity probability matrix and \(\varvec{\pi } \in (0, 1)^K\) be the vector of block assignment probabilities. Given an n vertex instantiation of the SBM parameterized by \({\textbf{B}}\) and \(\varvec{\pi }\), for sufficiently large n, the large sample optimal error rate for estimating the block assignments using ASE can be measured via Chernoff information as (Tang and Priebe 2018; Athreya et al. 2017)

$$\begin{aligned} \rho = \min _{k \ne l} \sup _{t \in (0, 1)} \left[ \frac{1}{2} n t (1-t) (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) + \frac{1}{2} \log \frac{|\varvec{\Sigma }_{k \ell }(t) |}{|\varvec{\Sigma }_k |^t |\varvec{\Sigma }_\ell |^{1-t}} \right] , \end{aligned}$$
(9)

where \(\varvec{\Sigma }_{k\ell }(t) = t \varvec{\Sigma }_k + (1-t) \varvec{\Sigma }_\ell\), \(\varvec{\Sigma }_k = \varvec{\Sigma }(\varvec{\nu }_k)\) and \(\varvec{\Sigma }_\ell = \varvec{\Sigma }(\varvec{\nu }_\ell )\) are defined as in Eq. (7). Also note that as \(n \rightarrow \infty\), the logarithm term in Eq. (9) will be dominated by the other term. Then we have the approximate Chernoff information as

$$\begin{aligned} \rho \approx \min _{k \ne l} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }), \end{aligned}$$
(10)

where

$$\begin{aligned} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }) =\sup _{t \in (0, 1)} \left[ t (1-t) (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) \right] . \end{aligned}$$
(11)

We also introduce the following two notions, which will be used when we describe our dynamic network sampling scheme.

Definition 5

(Chernoff-active Blocks) For K-block SBM parametrized by the block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and the vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). The Chernoff-active blocks \((k^*, \ell ^*)\) are defined as

$$\begin{aligned} (k^*, \ell ^*) = \arg \min _{k \ne l} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }), \end{aligned}$$
(12)

where \(C_{k ,\ell }({\textbf{B}}, \varvec{\pi })\) is defined as in Eq. (10).

Definition 6

(Chernoff Superiority) For K-block SBMs, given two block connectivity probability matrices \({\textbf{B}}, {\textbf{B}}^\prime \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). Let \(\rho _B\) and \(\rho _{B^\prime }\) denote the Chernoff information obtained as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively. We say that \({\textbf{B}}\) is Chernoff superior to \({\textbf{B}}^\prime\), denoted as \({\textbf{B}} \succ {\textbf{B}}^\prime\), if \(\rho _B > \rho _{B^\prime }\).

Remark 4

If \({\textbf{B}}\) is Chernoff superior to \({\textbf{B}}^\prime\), then we can have a better block recovery from \({\textbf{B}}\) than \({\textbf{B}}^\prime\). In addition, Chernoff superiority is transitive, which is straightforward from the definition.

Dynamic network sampling

We start our analysis with the unobserved block connectivity probability matrix \({\textbf{B}}\) for SBM and then illustrate how to migrate the proposed methods for real applications when we have the observed adjacency matrix \({\textbf{A}}\).

Consider the K-block SBM parametrized by the block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and the vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\) with \(K > 2\). Given initial sampling parameter \(p_0 \in (0, 1)\), initial sampling is uniformly at random, i.e.,

$$\begin{aligned} {\textbf{B}}_0 = p_0 {\textbf{B}}. \end{aligned}$$
(13)

This initial sampling simulates the case when one only obersves a partial graph with a small portion of the edges instead of the entire graph with all existing edges.

Theorem 2

For K-block SBMs, given two block connectivity probability matrices \({\textbf{B}}, p{\textbf{B}} \in (0, 1)^{K \times K}\) with \(p \in (0, 1)\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\), we have \({\textbf{B}} \succ p {\textbf{B}}\).

The proof of Theorem 2 can be found in Appendix. As an illustration, consider a 4-block SBM parametrized by block connectivity probability matrix \({\textbf{B}}\) as

$$\begin{aligned} {\textbf{B}} = \begin{bmatrix} 0.04 &{} 0.08 &{} 0.10 &{} 0.18 \\ 0.08 &{} 0.16 &{} 0.20 &{} 0.36 \\ 0.10 &{} 0.20 &{} 0.25 &{} 0.45 \\ 0.18 &{} 0.36 &{} 0.45 &{} 0.81 \end{bmatrix}. \end{aligned}$$
(14)

Figure 1 shows Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14) and \(p {\textbf{B}}\) for \(p \in (0, 1)\). In addition, Fig. 1a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 1b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). As suggested by Theorem 2, for any \(p \in (0, 1)\) we have \(\rho _{B} > \rho _{pB}\) and thus \({\textbf{B}} \succ p {\textbf{B}}\).

Fig. 1
figure 1

Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14) and \(p {\textbf{B}}\) for \(p \in (0, 1)\)

Now given dynamic network sampling parameter \(p_1 \in (0, 1-p_0)\), the baseline sampling scheme can proceed uniformly at random again, i.e.,

$$\begin{aligned} {\textbf{B}}_1 = {\textbf{B}}_0 + p_1 {\textbf{B}} = (p_0 + p_1) {\textbf{B}}. \end{aligned}$$
(15)

This dynamic network sampling simulates the situation when one is given limited resources to sample some extra edges after observing the partial graph with only a small portion of the edges. Since we only have limited budget to sample another small portion of edges, one would benefit from identifying vertex pairs that have much influence on the community structure. In other words, the baseline sampling scheme just randomly choosing vertex pairs without using the information from the initial observed graphs and our goal is to design an alternative scheme to optimize this dynamic network sampling procedure so that one could have a better block recovery even with limited resources to only observe a partial graph with a small portion of the edges.

Corollary 1

For K-block SBMs, given block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). We have \({\textbf{B}} \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\) where \({\textbf{B}}_0\) is defined as in Eq. (13) with \(p_0 \in (0, 1)\) and \({\textbf{B}}_1\) is defined as in Eq. (15) with \(p_1 \in (0, 1-p_0)\).

The proof of Corollary 1 can be found in Appendix. This corollay implies that we can have a better block recovery from \({\textbf{B}}_1\) than \({\textbf{B}}_0\).

Assumption 1

The Chernoff-active blocks after initial sampling is unique, i.e., there exists an unique pair \(\left( k_0^*, \ell _0^* \right) \in \{(k, \ell ) \; | \; 1 \le k < \ell \le K \}\) such that

$$\begin{aligned} \left( k_0^*, \ell _0^* \right) = \arg \min _{k \ne l} C_{k ,\ell }({\textbf{B}}_0, \varvec{\pi }), \end{aligned}$$
(16)

where \({\textbf{B}}_0\) is defined as in Eq. (13) and \(\varvec{\pi }\) is the vector of block assignment probabilities.

To improve this baseline sampling scheme, we concentrate on the Chernoff-active blocks \(\left( k_0^*, \ell _0^* \right)\) after initial sampling assuming Assumption 1 holds. Instead of sampling from the entire block connectivity probability matrix \({\textbf{B}}\) like the baseline sampling scheme as in Eq. (15), we only sample the entries associated with the Chernoff-active blocks. As a competitor to \({\textbf{B}}_1\), our Chernoff-optimal dynamic network sampling scheme is then given by

$$\begin{aligned} \widetilde{{\textbf{B}}}_1 = {\textbf{B}}_0 + \frac{p_1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2 } {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}, \end{aligned}$$
(17)

where \(\circ\) denotes Hadamard product, \(\pi _{k_0^*}\) and \(\pi _{\ell _0^*}\) denote the block assignment probabilities for block \(k_0^*\) and \(\ell _0^*\) respectively, and \({\textbf{1}}_*\) is the \(K \times K\) binary matrix with 0’s everywhere except for 1’s associated with the Chernoff-active blocks \(\left( k_0^*, \ell _0^* \right)\), i.e., for any \(i, j \in \{1, \cdots , K \}\)

$$\begin{aligned} {\textbf{1}}_{k_0^*, \ell _0^*}[i, j] = {\left\{ \begin{array}{ll} 1 &{} \text {if} \;\; (i, j) \in \left\{ \left( k_0^*, k_0^* \right) , \; \left( k_0^*, \ell _0^* \right) , \; \left( \ell _0^*, k_0^* \right) , \; \left( \ell _0^*, \ell _0^* \right) \right\} \\ 0 &{} \text {otherwise} \end{array}\right. } . \end{aligned}$$
(18)

Note that the multiplier \(\frac{1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2}\) on \(p_1 {\textbf{B}} \circ {\textbf{1}}_*\) assures that we sample the same number of potential edges with \(\widetilde{{\textbf{B}}}_1\) as we do with \({\textbf{B}}_1\) in the baseline sampling scheme. In addition, to avoid over-sampling with respect to \({\textbf{B}}\), i.e., to ensure \(\widetilde{{\textbf{B}}}_1[i, j] \le {\textbf{B}}[i, j]\) for any \(i, j \in \{1, \cdots , K \}\), we require

$$\begin{aligned} p_1 \le p_1^{\text {max}} = \left( 1 - p_0 \right) \left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2. \end{aligned}$$
(19)

Assumption 2

For K-block SBMs, given a block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). Let \(p_1^* \in (0, p_1^{\text {max}}]\) be the smallest positive \(p_1 \le p_1^{\text {max}}\) such that

$$\begin{aligned} \arg \min _{k \ne l} C_{k ,\ell }(\widetilde{{\textbf{B}}}_1, \varvec{\pi }) \end{aligned}$$
(20)

is not unique where \(p_1^{\text {max}}\) is defined as in Eq. (19) and \(\widetilde{{\textbf{B}}}_1\) is defined as in Eq. (17). If the arg min is always unique, let \(p_1^* = p_1^{\text {max}}\).

For any \(p_1 \in (0, p_1^*)\), we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1\) than \({\textbf{B}}_1\), i.e., our Chernoff-optimal dynamic network sampling sheme is better than the baseline sampling scheme in terms of block recovery.

As an illustaration, consider the 4-block SBM with initial sampling parameter \(p_0 = 0.01\) and block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). Figure 2 shows the Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) with dynamic network sampling parameter \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2. In addition, Figure 2a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 2b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). Note that for any \(p_1 \in (0, p_1^*)\) we have \(\rho _{B}> \rho _{{\widetilde{B}}_1}> \rho _{B_1} > \rho _{B_0}\) and thus \({\textbf{B}} \succ \widetilde{{\textbf{B}}}_1 \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). That is, in terms of Chernoff information, when given same amount of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of Chernoff information, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.

Fig. 2
figure 2

Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) with initial sampling parameter \(p_0 = 0.01\) and dynamic network sampling parameter \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2

As described earlier, it may be the case that \(p_1^* < p_1^{\text {max}}\) at which point Chernoff-active blocks change to \((k_1^*, \ell _1^*)\). This potential non-uniquess of the Chernoff argmin is a consequence of our dynamic network sampling scheme. In the case of \(p_1 > p_1^*\), our Chernoff-optimal dynamic network sampling scheme is adopted as

$$\begin{aligned} \widetilde{{\textbf{B}}}_1^* = {\textbf{B}}_0 + \left( p_1 - p_1^* \right) {\textbf{B}} + \frac{p_1^*}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2 } {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}, \end{aligned}$$
(21)

Similarly, the multiplier \(\frac{1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2}\) on \(p_1^* {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}\) assures that we sample the same number of potential edges with \(\widetilde{{\textbf{B}}}_1^*\) as we do with \({\textbf{B}}_1\) in the baseline sampling scheme. In addition, to avoid over-sampling with respect to \({\textbf{B}}\), i.e., \(\widetilde{{\textbf{B}}}_1^*[i, j] \le {\textbf{B}}[i, j]\) for any \(i, j \in \{1, \cdots , K \}\), we require

$$\begin{aligned} p_1 \le p_{11}^{\text {max}} = 1 - p_0 - \frac{p_1^*}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2 } + p_1^*. \end{aligned}$$
(22)

For any \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\), we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1^*\) than \({\textbf{B}}_1\), i.e., our Chernoff-optimal dynamic network sampling sheme is again better than the baseline sampling scheme in terms of block recovery.

As an illustration, consider a 4-block SBM with initial sampling parameter \(p_0 = 0.01\) and block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). Figure 3 shows the Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) with dynamic network sampling parameter \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where \(p_1^*\) is defined as in Assumption 2 and \(p_{11}^{\text {max}}\) is defined as in Eq. (22). In addition, Fig. 3a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 3b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). Note that for any \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) we have \(\rho _{B}> \rho _{{\widetilde{B}}_1^*}> \rho _{B_1} > \rho _{B_0}\) and thus \({\textbf{B}} \succ \widetilde{{\textbf{B}}}_1^* \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). That is, the adopted Chernoff-optimal dynamic network sampling scheme can still yield better block recovery results, in terms of Chernoff information, given the same amout of resources.

Fig. 3
figure 3

Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) with initial sampling parameter \(p_0 = 0.01\) and dynamic network sampling parameter \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where \(p_1^*\) is defined as in Assumption 2 and \(p_{11}^{\text {max}}\) is defined as in Eq. (22)

Now we illustrate how the proposed Chernoff-optimal dynamic network sampling sheme can be migrated for real applications. We summarize the uniform dynamic sampling scheme (baseline) as Algorithm 1 and our Chernoff-optimal dynamic network sampling scheme as Algorithm 2. Recall given potential edge set E and initial sampling parameter \(p_0 \in (0, 1)\), we have the initial edge set \(E_0 \subset E\) with \(|E_0 |= p_0 |E |\). The goal is to dynamically sample new edges from the potential edge set so that we can have a better block recovery given limited resources.

figure a
figure b

Experiments

Simulations

In addition to Chernoff analysis, we also evalute our Chernoff-optimal dynamic network sampling sheme via simulations. In particular, consider the 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) and dynamic network sampling parameter \(p_1 \in (0, p_{11}^{\text {max}}]\) where \(p_{11}^{\text {max}}\) is defined as in Eq. (22). We fix initial sampling parameter \(p_0 = 0.01\). For each \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2, we simulate 50 adjacency matrices with \(n = 12000\) vertices from \({\textbf{B}}_1\) as in Eq. (15) and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) respectively. For each \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\), we simulate 50 adjacency matrices with \(n = 12000\) vertices from \({\textbf{B}}_1\) as in Eq. (15) and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) respectively. In addition, Fig. 4a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\), i.e., 3000 vertices in each block, and Fig. 4b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\), i.e., 1500 vertices in two of the blocks and 4500 vertices in the other two blocks. We then apply ASE \(\circ\) GMM (Step 3 and 4 in Algorithm 1) to recover block assignments and adopt adjusted Rand index (ARI) to measure the performance. Figure 4 shows ARI (mean\(\pm\)stderr) associated with \({\textbf{B}}_1\) for \(p_1 \in (0, p_{11}^{\text {max}}]\), \(\widetilde{{\textbf{B}}}_1\) for \(p_1 \in (0, p_1^*)\), and \(\widetilde{{\textbf{B}}}_1^*\) for \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where the dashed lines denote \(p_1^*\). Note that we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1\) and \(\widetilde{{\textbf{B}}}_1^*\) than \({\textbf{B}}_1\), which argee with our results from Chernoff analysis.

Fig. 4
figure 4

Simulations for 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) with initial sampling parameter \(p_0 = 0.01\) and dynamic network sampling parameter \(p_1 \in (0, p_{11}^{\text {max}}]\) where \(p_{11}^{\text {max}}\) is defined as in Eq. (22). The dashed lines denote \(p_1^*\) which is defined as in Assumption 2

Now we compare the performance of Algorithms 1 and 2 by actual block recovery results. In particular, we start with the 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). We consider dynamic network sampling parameter \(p_1 \in (0, 1-p_0)\) where \(p_0\) is the initial sampling parameter. For each \(p_1\), we simulate 50 adjacency matrices with \(n = 4000\) vertices and retrieve associated potential edge sets. We fix initial sampling parameter \(p_0 = 0.15\) and randomly sample initial edge sets. We then apply both algorithms to estimate the block assignments and adopt ARI to measure the performance. Figure 5 shows ARI (mean\(\pm\)stderr) of two algorithms for \(p_1 \in (0, 0.85)\) where Fig. 5a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\), i.e., 1000 vertices in each block, and Fig. 5b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\), i.e., 500 vertices in two of the blocks and 1500 vertices in the other two blocks. Note that both algorithms tend to have a better performance as \(p_1\) increases, i.e., as we sample more edges, and Algorithm 2 can always recover more accurate block structure than Algorithm 1. That is, given the same amout of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of the empirical clustering results, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.

Fig. 5
figure 5

Simulations for 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) with initial sampling parameter \(p_0 = 0.15\) and dynamic network sampling parameter \(p_1 \in (0, 0.85)\)

Real data

We also evaluate the performance of Algorithms 1 and 2 for real application. We conduct real data experiments on a diffusion MRI connectome dataset (Priebe et al. 2019). There are 114 graphs (connectomes) estimated by the NDMG pipeline (Kiar et al. 2018) in this dataset. Each vertex in these graphs (the number of vertices n varies from 23728 to 42022) has a {Left, Right} hemisphere label and a {Gray, White} tissue label. We consider the potential 4 blocks as {LG, LW, RG, RW} where L and R denote the Left and Right hemisphere label, G and W denote the Gray and White tissue label. Here we consider initial sampling parameter \(p_0 = 0.25\) and dynamic network sampling parameter \(p_1 = 0.25\). Let \(\Delta = \text {ARI(Algo2)} - \text {ARI(Algo1)}\) where ARI(Algo1) and ARI(Algo2) denotes the ARI when we apply Algorithms 1 and 2 respectively. The following hypothesis testing yields p-value=0.0184. Figure 6 shows algorithms’ comparative performance via boxplot and histogram.

$$\begin{aligned} H_0: \; \text {median}(\Delta ) \le 0 \qquad \text {v.s.} \qquad H_A: \; \text {median}(\Delta ) > 0. \end{aligned}$$
(23)
Fig. 6
figure 6

Algorithms’ comparative performance on diffusion MRI connectome data via ARI with initial sampling parameter \(p_0 = 0.25\) and dynamic network sampling parameter \(p_1 = 0.25\)

Furthermore, we test our algorithms on a Microsoft bing entity dataset (Agterberg et al. 2020). There are 2 graphs in this dataset where each has 13535 vertices. We treat block assignments estimated from the complete graph as ground truth. We consider initial sampling parameter \(p_0 \in \left\{ 0.2, \; 0.3 \right\}\) and dynamic network sampling parameter \(p_1 \in \left\{ 0, \; 0.05, \; 0.1, \; 0.15, \; 0.2 \right\}\). For each \(p_1\), we sample 100 times and compare the overall performance of Algorithm 1 and 2. Figure 7 shows the results where ARI is reported as mean(±stderr).

Fig. 7
figure 7

Algorithms’ comparative performance on Microsoft bing entity data via ARI with different initial sampling parameter \(p_0\) and dynamic network sampling parameter \(p_1\)

We also conduct real data experiments with 2 social network datasets.

  • LastFM asia social network data set (Leskovec and Krevl 2014; Rozemberczki and Sarkar 2020): Vertices (the number of vertices \(n = 7624\)) represent LastFM users from asian countries and edges (the number of edges \(e = 27806\)) represent mutual follower relationships. We treat 18 different location of users, which are derived from the country field for each user, as the potential block.

  • Facebook large page-page network data set (Leskovec and Krevl 2014; Rozemberczki et al. 2019): Vertices (the number of vertices \(n = 22470\)) represent official Facebook pages and edges (the number of edges \(e = 171002\)) represent mutual likes. We treat 4 page types {Politician, Governmental Organization, Television Show, Company}, which are defined by Facebook, as the potential block.

We consider initial sampling parameter \(p_0 \in \left\{ 0.15, \; 0.35 \right\}\) and dynamic network sampling parameter \(p_1 \in \left\{ 0.05, \; 0.1, \; 0.15, \; 0.2, \; 0.25 \right\}\). For each \(p_1\), we sample 100 times and compare the overall performance of Algorithm 1 and 2. Figure 8 shows the results where ARI is reported as mean(±stderr). Again it suggests that given the same amout of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of the empirical clustering results, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.

Fig. 8
figure 8

Algorithms’ comparative performance on social network data via ARI with different initial sampling parameter \(p_0\) and dynamic network sampling parameter \(p_1\)

Discussion

We propose a dynamic network sampling scheme to optimize block recovery for SBM when we only have a limited budget to observe a partial graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery (community detection), of our method on several real datasets including diffusion MRI connectome dataset, Microsoft bing entity graph transitions dataset and social network datasets. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure and only check whether there are edges between them to save significant resources but still recover the block structure.

As the Chernoff-optimal dynamic sampling scheme depends on the initial clustering results to identify Chernoff-active blocks and construct dynamic edge set. Thus the performance could be impacted if the initial clustering is not very ideal. One of the future direction is to design certain strategy to reduce this dependency such that the proposed scheme is more robust.