Bayesian Testing for Exogenous Partition Structures in Stochastic Block Models

Network data often exhibit block structures characterized by clusters of nodes with similar patterns of edge formation. When such relational data are complemented by additional information on exogenous node partitions, these sources of knowledge are typically included in the model to supervise the cluster assignment mechanism or to improve inference on edge probabilities. Although these solutions are routinely implemented, there is a lack of formal approaches to test if a given external node partition is in line with the endogenous clustering structure encoding stochastic equivalence patterns among the nodes in the network. To fill this gap, we develop a formal Bayesian testing procedure which relies on the calculation of the Bayes factor between a stochastic block model with known grouping structure defined by the exogenous node partition and an infinite relational model that allows the endogenous clustering configurations to be unknown, random and fully revealed by the block-connectivity patterns in the network. A simple Markov chain Monte Carlo method for computing the Bayes factor and quantifying uncertainty in the endogenous groups is proposed. This routine is evaluated in simulations and in an application to study exogenous equivalence structures in brain networks of Alzheimer's patients.

model-based representations for inference on grouping structures, with the stochastic block model (sbm) (Holland et al., 1983;Nowicki and Snijders, 2001) providing the most notable contribution within this class. Such a statistical model expresses the edge probabilities as a function of the node assignments to groups and of block probabilities among such groups, thus allowing inference on more general block-connectivity patterns beyond community structures. The success of sbms in different fields has motivated various extensions (e.g. Kemp et al., 2006;Airoldi et al., 2008;Karrer and Newman, 2011;Geng et al., 2019) and detailed theoretical studies on their asymptotic properties (e.g. Zhao et al., 2012;Gao et al., 2018;van der Pas and van der Vaart, 2018;Ghosh et al., 2019); see Schmidt and Morup (2013); Abbe (2017); Lee and Wilkinson (2019) and the references therein for a comprehensive overview.
When node-specific attributes are available, the above block models have been generalized in different directions to incorporate such external information in the edge formation mechanism. Common proposals address this goal via the inclusion of nodal attributes within the generative model for the cluster assignments (e.g. Tallberg, 2004;White and Murphy, 2016;Newman and Clauset, 2016;Stanley et al., 2019), or by defining the edge probabilities as a function of block-specific parameters, as in classical sbms, and of pairwise similarity measures among node attributes (e.g. Mariadassou et al., 2010;Choi et al., 2012;Sweet, 2015;Roy et al., 2019).
Such formulations are powerful approaches to assist the cluster assignment mechanism and, typically, improve the estimation of the edge probabilities. However, when categorical node attributes are available, less attention has been paid to the development of formal Bayesian testing procedures to assess whether these exogenous partitions identified by an external grouping variable are in line with the endogenous partition revealed by the block-connectivity behaviors in the network. For example, in structural brain network applications it is often of interest to understand if exogenous anatomical partitions of the brain regions can accurately characterize the endogenous block structures of brain networks (e.g. Sporns, 2013;Faskowitz et al., 2018). This goal could be partially addressed by the previously-mentioned models via inference on the posterior distribution for the parameters regulating the effect of the node-specific attributes, but these formulations are prone to identifiability and computational issues. Motivated by the above discussion, we propose a formal yet simple Bayesian testing procedure to compare a stochastic block model with known grouping structure, fixed according to a given exogenous node partition, and an infinite relational model (Kemp et al., 2006) where the node assignments are unknown, random and modeled through a Chinese Restaurant Process (crp) prior (Aldous, 1985), which allows the total number of non-empty clusters H to be inferred. Such a Bayesian nonparametric representation allows flexible learning of the endogenous clustering configurations as revealed by the common connectivity behaviors within the network and, hence, provides a suitable reference model against which to assess the ability of a pre-specified exogenous partition to characterize the block-connectivity structures within the network. In a sense, our goal is related to those of Bianconi et al. (2009) andPeel et al. (2017). However, such contributions compute, under a frequentist perspective, the entropy of a stochastic block model whose groups coincide with the external node partition and compare it with the distribution of the entropies derived under the same network with grouping structure given by a random permutation of the exogenous node labels. Besides taking a Bayesian approach, our procedure quantifies proximities to endogenous block structures rather than studying departures from a random partition. This allows, as a byproduct, inference on node groupings supported by the data. In fact, leveraging the recent inference methods for Bayesian clustering (Wade and Ghahramani, 2018) brought into the network field by Legramanti et al. (2020), we complement the results of the proposed testing procedures with an analysis of the credible balls for the grouping structure under the infinite relational model.
In Section 2 we describe the proposed testing procedure, based on the calculation of the Bayes factor (e.g. Kass and Raftery, 1995) among the two competing models, and discuss methods for uncertainty quantification on the inferred endogenous clustering. In Section 3, we derive a collapsed Gibbs sampler to obtain samples from the posterior of the endogenous partition, thus allowing Monte Carlo estimation of the marginal likelihood (Newton and Raftery, 1994;Raftery et al., 2007) required to compute the Bayes factor. As illustrated in simulations in Section 4 and in an application to Alzheimer's brain networks in Section 5, the Gibbs sampler is also useful to perform inference on the endogenous groups. Codes to implement the proposed methods can be found at https://github.com/danieledurante/TESTsbm.

Endogenous and exogenous models
Let Y denote an n × n symmetric adjacency matrix associated with an undirected binary network without self-loops, so that y vu = y uv = 1 if nodes v = 2, . . . , n and u = 1, . . . , v − 1 are connected, and y vu = y uv = 0 otherwise. The absence of self-loops implies that the diagonal entries of Y are not considered for inference.
Recalling our discussion in Section 1, we consider a stochastic representation partitioning the nodes into exhaustive and non-overlapping groups, where nodes in the same group are characterized by equal patterns of edge formation. More specifically, let z = (z 1 , . . . , z n ) ∈ Z be the vector of cluster membership indicators for the n nodes, with Z being the space of all the possible group assignments, so that z v = h if and only if the vth node belongs to the hth cluster. Letting H be the number of non-empty groups in z, we denote with Θ the H × H symmetric matrix of block probabilities with generic elements θ hk ∈ (0, 1) indexing the distribution of the edges between the nodes in cluster h and those in cluster k. To characterize block-connectivity structures within the network, we assume independently for each v = 2, . . . , n and u = 1, . . . , v − 1, with θ hk ∼ Beta(a, b), independently for every h = 1, . . . , H and k = 1, . . . , h. This formulation recalls the classical Bayesian sbm specification (Nowicki and Snijders, 2001) and leverages a stochastic equivalence property that relies on the conditional independence of the edges, whose distribution depends on the cluster membership of the associated nodes. Indeed, by marginalizing out the beta-distributed block probabilities, which are typically treated as nuisance parameters in the sbm (e.g. Kemp et al., 2006;Schmidt and Morup, 2013), the likelihood for Y given z is where m hk andm hk denote the number of edges and non-edges among nodes in clusters h and k, respectively, whereas B(·, ·) is the beta function. Expression (1) is derived by exploiting beta-binomial conjugacy, and, as we will clarify later in the article, is fundamental to compute Bayes factors and to develop a collapsed Gibbs sampler which updates only the endogenous cluster assignments while treating the block probabilities as nuisance parameters. Moreover, as is clear from equation (1), p(Y | z) is invariant under relabeling of the cluster indicators. Therefore p(Y | z) is equal to p(Y |z) for any relabelingz of z, meaning that also the Bayes factors computed from these quantities are invariant under relabeling. Hence, in the rest of the paper, z will denote any element of the equivalence class of its relabelings, whereas Z will correspond to the space of all the partitions of {1, . . . , n}.
Recalling Section 1, our goal is develop a formal Bayesian test to assess whether assuming z as known and equal to an exogenous assignment vector z * produces an effective characterization of all the block structures in Y, relative to what would be obtained by letting z unknown, random and endogenously determined by the stochastic equivalence relations in Y. The first hypothesized model M * can be naturally represented via a sbm as in (1) with a fixed and known exogenous partition z * , whereas the second model M requires a flexible prior distribution for the indicators z = (z 1 , . . . , z n ) which is able to reveal the endogenous grouping structure induced by the block-connectivity patterns in Y, without imposing strong parametric constraints. A natural option would be to consider a Dirichlet-multinomial prior as in classical sbms (Nowicki and Snijders, 2001), but such a specification requires the choice of the total number of groups, which is typically unknown. This issue is usually circumvented by relying on bic metrics that require estimation of multiple sbms (e.g. Saldana et al., 2017). To avoid these computational costs and increase flexibility, we rely on a Bayesian nonparametric solution that induces a full-support prior on the total number H of non-empty groups in z. This enables learning of H, which is not guaranteed to coincide with the number H * of non-empty groups in z * . A widely used prior in the context of sbms is the crp (Aldous, 1985), which leads to the so-called infinite relational model (Kemp et al., 2006;Schmidt and Morup, 2013). Under this prior each group attracts new nodes in proportion to its size, and the formation of new groups depends only on the size of the network and on a tuning parameter α > 0. More specifically, under model M, we assume the following prior over cluster indicators for the vth node, given the memberships (2) In (2), H −v is the number of non-empty groups in z −v , the integer n h,−v is the total number of nodes in cluster h, excluding the vth one, whereas α > 0 denotes a concentration parameter controlling the expected number of non-empty clusters. The urn representation in equation (2) is induced by the joint probability mass ] −1 , which shows that the crp is exchangeable. See also Gershman and Blei (2012) for an overview of crp priors.

Bayesian testing
To compare the ability of the endogenous (M) and exogenous (M * ) formulations in characterizing the block structures in Y, we define a formal Bayesian test relying on the Bayes factor. More specifically, assuming that the two competing models are equally likely a priori, i.e.
where z∈Z p(Y | z)p(z) and p(Y | z * ) are the marginal likelihoods of Y under M and M * . Recalling, e.g., Kass and Raftery (1995), equation (3)  To evaluate equation (3), note that the quantity p(Y | z * ) can be computed by evaluating (1) at z = z * . In contrast, model M requires the calculation of p(Y | z) and p(z) for every z ∈ Z. Although both quantities can be evaluated in closed form as discussed in Section 2.1, this approach is computationally impractical due to the huge cardinality of the set Z, thus requiring alternative strategies relying on Monte Carlo estimation of p(Y|M) = z∈Z p(Y|z)p(z) via importance sampling methods. Here, we consider the harmonic mean approach (Newton and Raftery, 1994;Raftery et al., 2007), thus obtaininĝ where z (1) , . . . , z (R) are samples from the posterior distribution of z and p(Y | z (r) ) is the likelihood in (1) evaluated at z = z (r) , for every r = 1, . . . , R. The harmonic mean approach is a consistent strategy to evaluate marginal likelihoods and, due to its simplicity, is widely implemented. Although recent refinements have been proposed to address some shortcomings of the harmonic estimate (e.g. Lenk, 2009;Pajor, 2017), here we consider the original formula which is computationally more tractable and has proved stable in our simulations and applications.
Leveraging equations (1) and (4), our estimate of the Bayes factor in (3) iŝ where m (r) hk andm (r) hk refer to the counts of edges and non-edges among nodes in groups h and k induced by the rth mcmc sample of z, whereas m * hk andm * hk denote the number of edges and non-edges among the nodes in clusters h and k induced by the exogenous assignments z * . Finally, H (r) and H * are the total numbers of unique labels in z (r) and z * . Section 3 describes the collapsed Gibbs algorithm to sample the assignment vectors z (1) , . . . , z (R) from the posterior p(z | Y) under model M. These samples are required to compute (5) and, as discussed in Section 2.3, also allow inference on the posterior distribution of the endogenous partitions.

Inference and uncertainty quantification on the endogenous partition
When the Bayes factor discussed in Section 2.2 provides evidence in favor of model M, it is of interest to study the posterior distribution of z leveraging the Gibbs samples z (1) , . . . , z (R) . Common strategies address this goal by first computing the posterior co-clustering matrix C with elements c vu = c uv measuring the relative frequency of the Gibbs samples in which nodes v = 2, . . . , n and u = 1, . . . , v − 1 are in the same cluster, and then apply a standard clustering procedure to such a similarity matrix. However, this approach provides only a point estimate of z and, hence, fails to quantify posterior uncertainty. Legramanti et al. (2020) recently covered this gap by adapting the novel inference methods for Bayesian clustering in Wade and Ghahramani (2018) to the network field. These strategies rely on the variation of information (vi) metric, which quantifies distances between two partitions by comparing their individual and joint entropies.
Under this framework, a point estimateẑ for z coincides with that partition having the lowest posterior averaged vi distance from all the other clusterings, whereas a 1 − δ credible ball aroundẑ is obtained by collecting all those partitions with a vi distance fromẑ below a given threshold, with this threshold chosen Algorithm 1: One step of the Gibbs sampler for z under M for v = 1, . . . , n do Update each z v conditionally on z −v and Y as follows 1. Remove node v from the node set.
2. If no other node belongs to the cluster of v, such a cluster is removed. 3. Reorder the cluster indices so that 1, . . . , H −v are non-empty and sample z v from the categorical variable with full-conditional probabilities return z = (z 1 , . . . , z n ) to guarantee the smallest-size ball containing at least 1 − δ posterior probability. Such inference is useful to complement the results of the test in Section 2.2. Namely, to get further reassurance about the output of the proposed test, we may also study whether the exogenous clustering z * is plausible under the posterior distribution for the endogenous partition z by checking if z * lies inside the credible ball aroundẑ. Refer to Wade and Ghahramani (2018); Legramanti et al. (2020) and to the codes at https://github.com/danieledurante/ TESTsbm for more details on the aforementioned inference methods and their implementation.
Finally, although the block probabilities are integrated out, a plug-in estimate for these quantities can be easily obtained. Indeed, leveraging beta-binomial conjugacy, (θ hk | Y, z) ∼ Beta(a + m hk , b +m hk ). Hence, a plug-in estimate of the block probabilities θ hk for h = 1, . . . ,Ĥ and k = 1, . . . , h iŝ wherem hk andm hk denote the number of edges and non-edges between nodes in groups h and k, respectively, induced by the posterior point estimateẑ of z.

Posterior Computation via Collapsed Gibbs Sampling
The posterior samples of z under model (1) with crp prior (2) can be obtained via a simple collapsed Gibbs sampler which updates the group assignment of each node v conditioned on those of the others by sampling from the full-conditional distribution p(z v | Y, z −v ) (Schmidt and Morup, 2013). By collapsing out the beta priors for the block probabilities, this procedure reduces the computational time in avoiding the updating of θ hk for h = 1, . . . , H and k = 1, . . . , h, while improving mixing (Neal, 2000).
Algorithm 1 provides the detailed steps of one cycle of the Gibbs sampler. Note that since equation (1) is the joint probability for a large set of binary edges, manipulating this quantity within Algorithm 1 and in computing the Bayes factor in (5) may lead to practical difficulties due to the need to deal with quantities very close to zero. In these settings, we suggest to work with the logarithm, when possible, and to exploit the , where d usually coincides with max i ν i .

Simulation Studies
We consider an illustrative simulation to assess the performance of the new inference procedures presented in Section 2, and to evaluate the ability of model M to recover underlying endogenous partition structures.
Consistent with this goal, we simulate a symmetric binary adjacency matrix Y from a stochastic block model to induce a uniform prior on the block probabilities. This choice is theoretically supported (e.g. Ghosh et al., 2019) and has been widely employed in routine implementations of sbms (Nowicki and Snijders, 2001;Kemp et al., 2006;Geng et al., 2019). As for α in prior (2), we set it equal to 1 following default specifications of the crp, thus circumventing the need to include a hyper-prior which could affect mixing and convergence of Algorithm 1. Such a default specification has proved effective both in simulations and in applications, and we found the results robust to moderate changes in α. For instance, setting α = 0.1 or α = 10 did not change the final conclusions of our testing procedures.   To assess the quality of such strategies, we consider four external assignment vectors z 0 , z 1 , z 2 and z 3 evaluated in Table 1. In particular, z 0 denotes the true generative partition, z 1 is obtained by a random permutation of the indices in z 0 , while z 2 and z 3 define a refined and a coarsened partitioning of z 0 , respectively, in which each cluster is either divided in two additional ones (z 2 ) or collapsed with others to form a single group (z 3 ). Due to this, we expect to obtain evidence in favor of the exogenous partition only in the scenario with z * = z 0 . Table 1 confirms our expectations when compared with the thresholds in Kass and Raftery (1995). Note that, althoughẑ = z 0 , we obtain a negative Bayes factor in the first scenario, which leads to a strong preference for model M * relative to M. Indeed, even if the point estimateẑ for z under M exactly recovers z 0 , there is still some amount of posterior uncertainty induced by the crp prior on z. On the contrary, M * is defined in the first scenario by conditioning on the true underlying partition with no uncertainty, thus providing a formulation much closer to the true data-generative mechanism relative to M. All the remaining exogenous partitions z 1 , z 2 and z 3 are, instead, not as effective as model M in characterizing the endogenous block structures within Y. As expected, this is especially true for the random partition (z 1 ), but also those obtained from refinements (z 2 ) or coarsening (z 3 ) operations on z 0 are not plausible according to the results of the tests. Such results confirm the ability of our procedures to provide accurate conclusions under various configuration of z * . For instance, although the partition z 2 still leads to homogenous blocks in Y, the additional refinements in z 2 provide an unnecessary addition of further groups which are not required to characterize the block-connectivity patterns in Y, thus leading the test to provide evidence in favor of M rather than M * when z * = z 2 .
The vi distances between the estimatedẑ under model M and the four exogenous partitions confirm the results of the tests. In particular, the only external assignment vector with a vi distance fromẑ less than the estimated 0.428 threshold of the 95% credible ball aroundẑ is z 0 .

Application to Brain Networks of Alzheimer's Individuals
There is an intensive research effort aimed at finding the sources of the Alzheimer's disease in human brain networks. Such an increasing interest is motivated by recent developments in brain imaging technologies and by the constant growth of elderly population in the age interval mostly affected by Alzheimer's, thus  Fig. 3 Graphical illustration of a representative brain network Y for Alzheimer's individuals. Brain regions are re-ordered and partitioned in blocks according to the estimated endogenous assignmentsẑ. Black and white cells denote edges and non-edges, respectively, whereas the first two colored columns represent the two exogenous anatomical brain partitions into hemispheres and lobes.
making such a disease a major concern, both in terms of disability and mortality, especially for countries with longer life expectancy (Ashford et al., 2011a,b;Stam, 2014). Here, we focus on studying structural brain networks encoding presence or absence of white matter fibers among anatomical regions in human brains. Such connectivity data have been a source of major interest in several recent studies mostly focused on topological summary measures of Alzheimer's brains and on how these measures change as the disease progresses (Daianu et al., 2013;Sulaimany et al., 2017;John et al., 2017;Mårtensson et al., 2018). Instead, we consider a different perspective by studying the endogenous block structures in a representative Alzheimer's brain network, while assessing whether exogenous region partitions of interest can effectively characterize the block structures of such a network.
Consistent with the above goal, we apply methods in Sections 2-3 to the 68×68 binary adjacency matrix Y encoding presence or absence of white matter fibers among anatomical regions in a representative Alzheimer's brain network provided by Sulaimany et al. (2017). In this study, brain regions are defined by the Desikan atlas (Desikan et al., 2006), which provides additional information on hemisphere and lobe memberships (Kang et al., 2012); see Sulaimany et al. (2017) for additional details on the construction of Y. Figure 3 provides a graphical representation of Y with brain regions suitably reordered and organized in blocks according to the estimated endogenous assignmentsẑ. The latter are obtained by considering the same mcmc settings and hyper-parameters of the simulation study, which proved effective and robust also in this application; see Figure   4. As shown in Figure 3, we learnĤ = 12 endogenous groups equally divided between the two hemispheres and showing an overall coherence of the partition structure across left and right regions. As expected, there is an evident block-connectivity within hemispheres, although some groups also display a tendency to connect across hemispheres. For example, brain regions in the frontal lobe tend to create two highly interconnected  clusters, one in each hemisphere, with these two blocks showing also a preference to create bridges among the two hemispheres. Despite these anatomical homophily structures, as highlighted in Figure 3 and in Table 2, hemisphere and lobe partitions are not sufficient to fully characterize the endogenous block structures in Alzheimer's brains. There are, in fact, various sub-blocks within each hemisphere and these clusters typically comprise regions in different lobes.
We conclude by studying if the clustering structures inferred from representative brains of individuals in three ordered stages of cognitive decline can effectively explain the endogenous block structures in Alzheimer's brains. To accomplish this goal, we first apply Algorithm 1 to the representative adjacency matrices of individuals characterized by normal aging, early and late cognitive decline (Sulaimany et al., 2017), and then quantify, via the Bayes factors in Table 2, whether these partitions are also effective in modeling the block structures within the Alzheimer's brain. Although 2 logB M,M * is above the threshold in Kass and Raftery (1995) suggesting strong evidence against this hypothesis for all three stages, it is interesting to notice how 2 logB M,M * decreases as cognitive decline progresses towards Alzheimers' disease. This means that the inferred partitions could be used, with caution, as a diagnostic strategy to identify the progress of the disease.
The vi distances betweenẑ and these external partitions confirm the evidence provided by the Bayes factors, thus providing further support to our conclusions.
To further validate the suitability of M as a flexible model for Y, we also compute the in-sample missclassification error when predicting each y vu withθẑ v ,ẑ u . Such a measure is 0.1, thus confirming that M can be regarded as a suitable model for this application.

Discussion and Future Developments
In this contribution we propose a formal Bayesian testing procedure to assess the ability of a fixed exogenous node partition to characterize a network structure, with respect to an infinite relational model. To accomplish this goal we compare an harmonic mean estimate of the marginal likelihood under this latter representation with the one induced by a stochastic block model conditioned on the external partition of interest. From a computational perspective, we rely on a collapsed Gibbs sampler which additionally allows Bayesian inference and uncertainty quantification on endogenous partitions. As illustrated in simulations and applications to brain networks, our proposal provides a simple yet effective procedure to obtain further insights regarding the effects of node attributes on network structures.
There are several directions for future developments. For example, weighted networks comprising counts or continuous measures of strength in the relationship can be easily incorporated within our strategy by simply replacing the likelihood in (1) with a suitable one. This can be obtained by leveraging Poisson-gamma or Gaussian-Gaussian conjugacy, as done for the beta-binomial case. Moreover, while throughout the paper we have considered the problem of testing model M against model M * given a single observed network Y, one may be interested in the same test given a sample of N exchangeable networks. This is feasible under our proposed framework and only requires to substitute p(Y | z) in (1) with p(Y 1 , . . . , Y N | z). It is also possible to compare two exogenous partitions, rather than an exogenous and an endogenous one. This task is even simpler than the one analyzed in this article, since the likelihood in (1) can be computed in closed form for both the external partitions under comparison, thus avoiding the need of mcmc methods. For example, one may be interested in comparing an external assignment z * with a random permutation of the indices in such a vector to assess whether z * offers improvements in modeling network block structures or has no effect.
Therefore, the perspective taken by Bianconi et al. (2009) andPeel et al. (2017) can be seen as a special case of our more general solution.