Endogenous and Exogenous Models
Let Y denote the n × n symmetric adjacency matrix associated with an undirected binary network without self–loops, so that yvu = yuv = 1 if nodes v = 2, … , n and u = 1, … , v − 1 are connected, and yvu = yuv = 0 otherwise. The absence of self–loops implies that the diagonal entries of Y are not considered for inference. Recalling our discussion in Section 1, we consider a stochastic representation partitioning the nodes into exhaustive and non–overlapping groups, where nodes in the same group are characterized by equal patterns of edge formation. More specifically, let \(\textbf {z}=(z_{1}, \ldots , z_{n})^{\intercal } \in \mathcal {Z}\) be the vector of cluster membership indicators for the n nodes, with \(\mathcal {Z}\) being the space of all the possible group assignments, so that zv = h if and only if the v th node belongs to the h th cluster. Letting H be the number of non–empty groups in z, we denote with Θ the H × H symmetric matrix of block probabilities with generic elements θhk ∈ (0, 1) indexing the distribution of the edges between the nodes in cluster h and those in cluster k. To characterize block–connectivity structures within the network, we assume
$$ \begin{array}{@{}rcl@{}} (y_{vu} \mid z_{v}=h,z_{u}=k, \theta_{hk}) \sim \text{Bern}(\theta_{hk}), \end{array} $$
independently for each v = 2, … , n and u = 1, … , v − 1, with \(\theta _{hk} \sim \text {Beta}(a,b)\), independently for every h = 1, … , H and k = 1, … , h. This formulation recalls the classical Bayesian sbm specification (Nowicki and Snijders, 2001) and leverages a stochastic equivalence property that relies on the conditional independence of the edges, whose distribution depends on the cluster membership of the associated nodes. Indeed, by marginalizing out the beta–distributed block probabilities which are typically treated as nuisance parameters in the sbm (e.g. Kemp et al. 2006; Schmidt and Morup 2013), the likelihood for Y given z is
$$ p(\textbf{Y} \mid \textbf{z})=\prod\limits_{h=1}^{H}\prod\limits_{k=1}^{h} \frac{\mathrm{B}(a+m_{hk}, b+ \bar{m}_{hk})}{\mathrm{B}(a,b)}, $$
(2.1)
where mhk and \(\bar {m}_{hk}\) denote the number of edges and non–edges among nodes in clusters h and k, respectively, whereas B(⋅,⋅) is the beta function. Expression (2.1) is derived by exploiting beta–binomial conjugacy, and, as we will clarify later in the article, is fundamental to compute Bayes factors and to develop a collapsed Gibbs sampler which updates only the endogenous cluster assignments while treating the block probabilities as nuisance parameters. Moreover, as is clear from Eq. 2.1, p(Y∣z) is invariant under relabeling of the cluster indicators. Therefore p(Y∣z) is equal to \(p(\textbf {Y} \mid \tilde {\textbf {z}}) \) for any relabeling \( \tilde {\textbf {z}} \) of z, meaning that also the Bayes factors computed from these quantities are invariant under relabeling. Hence, in the rest of the paper, z will denote any element of the equivalence class of its relabelings, whereas \(\mathcal {Z}\) will correspond to the space of all the partitions of \(\{1,\dots ,n\}\).
Recalling Section 1, our goal is develop a formal Bayesian test to assess whether assuming z as known and equal to an exogenous assignment vector z∗ produces an effective characterization of all the block structures in Y, relative to what would be obtained by letting z unknown, random and endogenously determined by the stochastic equivalence relations in Y. The first hypothesized model \( {\mathscr{M}}^{*} \) can be naturally represented via a sbm as in Eq. 2.1 with a fixed and known exogenous partition z∗, whereas the second model \( {\mathscr{M}}\) requires a flexible prior distribution for the indicators z in Eq. 2.1 which is able to reveal the endogenous grouping structure induced by the block–connectivity patterns in Y, without imposing strong parametric constraints. A natural option would be to consider a Dirichlet–multinomial prior as in classical sbm s (Nowicki and Snijders, 2001), but such a specification requires the choice of the total number of groups, which is typically unknown. This issue is usually circumvented by relying on bic metrics that require estimation of multiple sbm s (e.g. Saldana et al., 2017). To avoid these computational costs and increase flexibility, we rely on a Bayesian nonparametric solution that induces a full–support prior on the total number H of non–empty groups in z. This enables learning of H, which is not guaranteed to coincide with the number H∗ of non–empty groups in z∗. A widely used prior in the context of sbm s is the crp (Aldous, 1985), which leads to the so–called infinite relational model (Kemp et al. 2006; Schmidt and Morup, 2013). Under such a prior, each group attracts new nodes in proportion to its size, and the formation of new groups depends only on the size of the network and on a tuning parameter α > 0. More specifically, under model \( {\mathscr{M}}\), we assume the following prior over cluster indicators for the v th node, given the memberships \(\textbf {z}_{-v}=(z_{1},{\ldots } ,z_{v-1},z_{v+1}, \ldots , z_{n})^{\intercal }\) of the others
$$ \begin{array}{@{}rcl@{}} \text{pr}(z_{v}=h \mid \textbf{z}_{-v}) = \left\{\begin{array}{ll} \frac{n_{h,-v}}{n-1+\alpha}& \quad \text{if} \ \ h=1, \ldots, H_{-v}, \\ \frac{\alpha}{n-1+\alpha} & \quad \text{if} \ \ h=H_{-v}+1. \end{array}\right. \end{array} $$
(2.2)
In Eq. 2.2, H−v is the number of non–empty groups in z−v, the integer nh,−v is the total number of nodes in cluster h, excluding node v, whereas α > 0 denotes a concentration parameter controlling the expected number of non–empty clusters. The urn representation in Eq. 2.2 is induced by the joint probability mass function \(p(\textbf {z})=\alpha ^{H} [{\prod }_{h=1}^{H}(n_{h}-1)!][{\prod }_{v=1}^{n}(v-1+\alpha )]^{-1}\), which shows that the crp is exchangeable. See also Gershman and Blei (2012) for an overview of crp priors.
Bayesian Testing
To compare the ability of the endogenous (\( {\mathscr{M}} \)) and exogenous (\( {\mathscr{M}}^{*} \)) formulations in characterizing the block structures in Y, we define a formal Bayesian test relying on the Bayes factor. More specifically, assuming that the two competing models are equally likely a priori, i.e. \(p({\mathscr{M}})=p({\mathscr{M}}^{*})\), we compare \( {\mathscr{M}}\) against \({\mathscr{M}}^{*}\) via
$$ {\mathcal{B}}_{\mathcal{M},\mathcal{M}^{*}}=\frac{p(\textbf{Y} \mid \mathcal{M})}{p(\textbf{Y} \mid \mathcal{M}^{*})}=\frac{{\sum}_{\textbf{z} \in \mathcal{Z}} p(\textbf{Y} \mid \textbf{z}) p(\textbf{z})}{p(\textbf{Y} \mid \textbf{z}^{*})}, $$
(2.3)
where \({\sum }_{\textbf {z} \in \mathcal {Z}} p(\textbf {Y} \mid \textbf {z}) p(\textbf {z})\) and p(Y∣z∗) are the marginal likelihoods of Y under \({\mathscr{M}}\) and \({\mathscr{M}}^{*} \). Recalling, e.g., Kass and Raftery (1995), Eq. 2.3 defines a formal Bayesian procedure to assess evidence against \({\mathscr{M}}^{*}\) relative to \({\mathscr{M}}\), with high values suggesting that the exogenous assignments in z∗ are not as effective in characterizing the endogenous block structures in Y as the posterior for z under \({\mathscr{M}}\). Under the assumption that \(p({\mathscr{M}})=p({\mathscr{M}}^{*})\), the Bayes factor in Eq. 2.3 coincides with the posterior odds \(p({\mathscr{M}} \mid \textbf {Y})/p({\mathscr{M}}^{*} \mid \textbf {Y})\). When \(p({\mathscr{M}}) \neq p({\mathscr{M}}^{*})\), it suffices to rescale \({{\mathscr{B}}}_{{\mathscr{M}},{\mathscr{M}}^{*}}\) by \(p({\mathscr{M}})/p({\mathscr{M}}^{*})\) to assess posterior evidence against \({\mathscr{M}}^{*}\) relative to \({\mathscr{M}}\).
To evaluate (2.3), note that the quantity p(Y∣z∗) can be computed by evaluating (2.1) at z = z∗. In contrast, model \({\mathscr{M}}\) requires the calculation of p(Y∣z) and p(z) for every \(\textbf {z} \in \mathcal {Z}\). Although both quantities can be evaluated in closed form as discussed in Section 2.1, this approach is computationally impractical due to the huge cardinality of the set \(\mathcal {Z}\), thus requiring alternative strategies relying on Monte Carlo estimation of \(p(\textbf {Y} \mid {\mathscr{M}}) ={\sum }_{\textbf {z} {\in } \mathcal {Z}} p(\textbf {Y} \mid \textbf {z}) p(\textbf {z})\). Here, we consider the harmonic mean approach (Newton and Raftery, 1994; Raftery et al. 2007), thus obtaining
$$ \begin{array}{@{}rcl@{}} \hat{p}(\textbf{Y} \mid \mathcal{M})= \left[\frac{1}{R} \sum\limits_{r=1}^{R} \frac{1}{p(\textbf{Y} \mid \textbf{z}^{(r)})} \right]^{-1}, \end{array} $$
(2.4)
where z(1), … , z(R) are samples from the posterior distribution of z and p(Y∣z(r)) is the likelihood in Eq. 2.1 evaluated at z = z(r), for every r = 1, … , R. Although recent refinements have been proposed to address some shortcomings of the harmonic estimate (e.g. Lenk, 2009; Pajor, 2017), here we consider the original formula which is computationally more tractable and has proved stable in our simulations and applications; see Figs. 2 and 4.
Leveraging (2.1) and (2.4), our estimate of the Bayes factor in Eq. 2.3 is
$$ \begin{array}{@{}rcl@{}} \hat{\mathcal{B}}_{\mathcal{M},\mathcal{M}^{*}}=\frac{\hat{p}(\textbf{Y} \mid \mathcal{M})}{p(\textbf{Y} \mid \mathcal{M}^{*})}=\frac{\left[\frac{1}{R} {\sum}_{r=1}^{R} {\prod}_{h=1}^{H^{(r)}}{\prod}_{k=1}^{h}\frac{\text{\small B}(a,b)}{\text{\small B}(a+m^{(r)}_{hk}, b+ \bar{m}^{(r)}_{hk})} \right]^{-1}}{{\prod}_{h=1}^{H^{*}}{\prod}_{k=1}^{h} \frac{\text{\small B}(a+m^{*}_{hk}, b+ \bar{m}^{*}_{hk})}{\text{\small B}(a,b)}}, \end{array} $$
(2.5)
where \(m^{(r)}_{hk}\) and \(\bar {m}^{(r)}_{hk}\) are the counts of edges and non–edges among nodes in groups h and k induced by the r th mcmc sample of z, whereas \(m^{*}_{hk}\) and \(\bar {m}^{*}_{hk}\) denote the number of edges and non–edges among the nodes in clusters h and k induced by the exogenous assignments z∗. Finally, H(r) and H∗ are the total numbers of unique labels in z(r) and z∗. Section 3 describes the collapsed Gibbs algorithm to sample the assignment vectors z(1), … , z(R) from the posterior p(z∣Y) under model \({\mathscr{M}}\). These samples are required to compute (2.5) and, as discussed in Section 2.3, also allow inference on the posterior distribution of the endogenous partitions.
Inference and Uncertainty Quantification on the Endogenous Partition
When the Bayes factor discussed in Section 2.2 provides evidence in favor of model \({\mathscr{M}}\), it is of interest to study the posterior distribution of z leveraging the Gibbs samples z(1), … , z(R). Common strategies address this goal by first computing the posterior co–clustering matrix bC with elements cvu = cuv measuring the relative frequency of the Gibbs samples in which nodes v = 2, … , n and u = 1, … , v − 1 are in the same cluster, and then apply a standard clustering procedure to such a similarity matrix. However, this approach provides only a point estimate of z and, hence, fails to quantify posterior uncertainty. Legramanti et al. (2020) recently covered this gap by adapting the novel inference methods for Bayesian clustering in Wade and Ghahramani (2018) to the network field. These strategies rely on the variation of information (vi) metric, which quantifies distances between two partitions by comparing their individual and joint entropies.
Under this framework, a point estimate \(\hat {\textbf {z}}\) for z coincides with that partition having the lowest posterior averaged vi distance from all the other clusterings. Moreover, a 1 − δ credible ball around \(\hat {\textbf {z}}\) can be obtained by collecting all those partitions with a vi distance from \(\hat {\textbf {z}}\) below a given threshold, with this threshold chosen to guarantee the smallest–size ball containing at least 1 − δ posterior probability. Such inference is useful to complement the results of the test in Section 2.2. Namely, to get further reassurance about the output of the proposed test, we may also study whether the exogenous clustering z∗ is plausible under the posterior distribution for the endogenous partition z by checking if z∗ lies inside the credible ball around \(\hat {\textbf {z}}\). Refer to Wade and Ghahramani (2018), Legramanti et al. (2020) and to the codes at https://github.com/danieledurante/TESTsbm for more details on the aforementioned inference methods and their implementation.
Finally, although the block probabilities are integrated out, a plug–in estimate for these quantities can be easily obtained. Indeed, by leveraging beta–binomial conjugacy, we have that \((\theta _{hk} \mid \textbf {Y},\textbf {z}) \sim \text {Beta}(a+m_{hk}, b+\bar {m}_{hk}) \). Hence, a plug–in estimate of the block probabilities θhk for \(h=1,\ldots ,\hat {H}\) and k = 1, … , h is
$$ \begin{array}{@{}rcl@{}} \hat{\theta}_{hk}=\mathbb{E}[{\theta_{hk} \mid \textbf{Y},\hat{\textbf{z}}]= \frac{a+\hat{m}_{hk}}{a+\hat{m}_{hk}+b+\hat{\bar{m}}_{hk}}}, \end{array} $$
where \(\hat {m}_{hk}\) and \(\hat {\bar {m}}_{hk}\) denote the number of edges and non–edges between nodes in groups h and k, respectively, induced by the posterior point estimate \(\hat {\textbf {z}}\) of z.