Bayesian estimation of the latent dimension and communities in stochastic blockmodels

Spectral embedding of adjacency or Laplacian matrices of undirected graphs is a common technique for representing a network in a lower dimensional latent space, with optimal theoretical guarantees. The embedding can be used to estimate the community structure of the network, with strong consistency results in the stochastic blockmodel framework. One of the main practical limitations of standard algorithms for community detection from spectral embeddings is that the number of communities and the latent dimension of the embedding must be specified in advance. In this article, a novel Bayesian model for simultaneous and automatic selection of the appropriate dimension of the latent space and the number of blocks is proposed. Extensions to directed and bipartite graphs are discussed. The model is tested on simulated and real world network data, showing promising performance for recovering latent community structure.


Introduction
A network can be represented as a graph G = (V, E), where V is a set of nodes and E ⊆ V × V is a set of edges indicating the pairs of nodes which have interacted.The graph can be characterised by the adjacency matrix A ∈ {0, 1} n×n , where n = |V | and for 1 ≤ i, j ≤ n, A ij = 1 E {(i, j)}, such that A ij = 1 if a link between the nodes i and j exists, and A ij = 0 otherwise.The graph is said to be undirected if (i, j) ∈ E ⇐⇒ (j, i) ∈ E and A is constrained to be symmetric; otherwise, the graph is said to be directed.It will be assumed that a node cannot link to itself, implying A is a hollow matrix.
Latent space models (Ho et al., 2002;Handcock et al., 2007;Krivitsky et al., 2009) represent a exible approach to statistical analysis of networks: each node i is assigned a latent position x i in a d-dimensional latent space X, and edges between pairs of nodes are typically generated independently (independent-edge graph), with the probability of observing a link between nodes i and j obtained through a kernel function ψ : X × X → [0, 1] of the respective latent positions: P(A ij = 1) = ψ(x i , x j ).Di erent ideas and techniques for embedding observed graphs into low dimensional spaces are explored in the literature (for a survey, see, for example Cai et al., 2018).Random dot product graphs (RDPGs) (Nickel, 2006;Young andScheinerman, 2007, 2008) are a popular class of latent position models, where X ⊆ R d , and the function ψ(•) is an inner product •, • on X × X. RDPGs are analytically tractable and have therefore been extensively studied; a survey of the existing statistical inference techniques is presented in Athreya et al. (2017).
The stochastic blockmodel (SBM) (Holland et al., 1983) is the classical statistical model for clustering graphs (Snijders and Nowicki, 1997;Nowicki and Snijders, 2001).Assuming K communities, each node is assigned a community membership z i ∈ {1, . . ., K} with probabilities θ, from the K − 1 probability simplex.The probability of a link only depends on the community allocations z i and z j of the two nodes.Given a symmetric matrix B ∈ [0, 1] K×K of inter-community probabilities, then independently P(A ij = 1) = B z i z j .The likelihood for an observed symmetric adjacency matrix A is therefore (1.1) Stochastic blockmodels have appealing statistical properties, and can well approximate any independentedge network model if the number of communities is su ciently large (Bickel and Chen, 2009;Wolfe and Olhede, 2013).Limits and bounds for community detection in stochastic blockmodels, from an informationtheoretic approach, have been extensively studied and explored in the literature (Abbe et al., 2016;Abbe, 2018).SBMs have also been successfully extended in di erent directions: allowing for multiple overlapping communities (Airoldi et al., 2008), incorporating degree corrections (Karrer and Newman, 2011), capturing node popularity (Sengupta and Chen, 2018), Bayesian models (Peixoto, 2018; van der Pas and van der Vaart, 2018) and dynamically evolving models (Xu and Hero III, 2013;Matias and Miele, 2017;Pensky and Zhang, 2017;Ludkin et al., 2018).Stochastic blockmodels can also be easily represented as random dot product graphs: each community is assigned a latent position, which is common to all the nodes belonging to the cluster, and B is obtained from the inner products of those positions.Hence, in this framework, d = rank(B) ≤ K. Spectral clustering (von Luxburg, 2007) provides a consistent statistical estimation procedure for the latent positions in SBMs (Rohe et al., 2011;Sussman et al., 2012;Fishkind et al., 2013;Lyzinski et al., 2014;Lei and Rinaldo, 2015;Lyzinski et al., 2017) and more generally in random dot product graphs (Tang et al., 2013;Sussman et al., 2014).Community memberships in the stochastic block model can be consistently estimated using K-means clustering of the rows (Rohe et al., 2011).Rubin-Delanchy et al. (2017) directly links spectral embedding to the generalised random dot product graph (GRDPG), an extension of the RDPG, and advocates for Gaussian mixture modelling (GMM) of the rows of the embedding, especially when the Laplacian matrix is used.Alternatives to spectral clustering include variational methods (Celisse et al., 2012) and pseudo-likelihood approaches (Amini et al., 2013).SBMs have been extended to the directed case (Wang and Wong, 1987;Rohe et al., 2016), and appropriate embeddings for co-clustering, in most cases based on the singular value decomposition (SVD), have been derived in the literature (Dhillon, 2001;Malliaros and Vazirgiannis, 2013;Zheng and Skillicorn, 2015).
One of the practical issues of spectral embedding, and in general all graph embedding algorithms (Cai et al., 2018), is that it requires a suitable prespeci ed latent dimensionality d (usually d |V |) as input, and, subsequently, a suitable number of clusters K, conditionally, crucially, on the previous choice of d.For a practical example of this procedure on a real world network, see Priebe et al. (2019).In general, in spectral clustering, similarly to what practitioners do in principal component analysis (PCA), the investigator examines the scree-plot of the eigenvalues and chooses the dimension based on the location of elbows in the plot (Jolli e, 2002), or uses the eigengap heuristic (see, for example, von Luxburg, 2007).Automatic methods for thresholding have also been suggested (Zhu and Ghodsi, 2006;Chatterjee, 2015).A relevant body of literature is also devoted to methods for the selection of the number of communities in stochastic blockmodels (Zhao et al., 2011;Bickel and Sarkar, 2016;Lei, 2016;Newman and Reinert, 2016;Franco Saldaña et al., 2017;Riolo et al., 2017;Wang and Bickel, 2017;Chen and Lei, 2018;Rastelli et al., 2018).Often, practitioners simply set d = K, for some d, assuming that B has full rank in the stochastic blockmodel framework.Under the full rank assumption, one may estimate d = K as the number of eigenvalues of A which are larger than √ n (Chatterjee, 2015;Lei, 2016).In this setting, hypothesis tests have also been proposed (Zhao et al., 2011;Bickel and Sarkar, 2016;Lei, 2016).In this work, the problem of selecting d is approached from the perspective of variable selection in model based clustering (Fraley and Raftery, 2002;Lau and Green, 2007), which is widely studied in the literature (Fowlkes et al., 1988;Law et al., 2004;Raftery and Dean, 2006;Maugis et al., 2009;Andrews and McNicholas, 2014).
Similarly, the problem of correctly selecting the number of clusters is also common in K-means or GMMs, since it is usually required to specify a number of components in the mixture.Usually the parameter is chosen by minimising information criteria (for example, AIC or BIC).Numerous other techniques have been proposed for GMMs with unknown number of components (Mengersen and Robert, 1996;Richardson and Green, 1997;Stephens, 2000;Nobile, 2004;Zhang et al., 2004;Dellaportas and Papageorgiou, 2006;Nobile and Fearnside, 2007;Miller and Harrison, 2018).
Clearly, the sequential approach in estimating d and K is suboptimal, and it is desirable to jointly estimate the two parameters.This article addresses the problem in a Bayesian framework, proposing a novel methodology to automatically select d and K, simultaneously.Techniques for selection of K in GMMs will be incorporated within the spectral embedding framework, allowing for K and d, the number of communities and latent dimension of the latent positions, to be random and learned from the data.A structured Bayesian model for simultaneously inferring the dimension of the latent space, the number of communities, and the community allocations is proposed.The model is based on asymptotic results (Tang and Priebe, 2018;Rubin-Delanchy et al., 2017) on the leading and informative components of spectral embeddings, obtained for d xed and known.The asymptotic theory is combined with realistic assumptions about the remaining components of the embedding, empirically tested and justi ed on simulated data.Furthermore, extensions to the directed and bipartite case will be discussed.The proposed model has multiple advantages: the latent dimension d and number of communities K are modelled separately, and the Bayesian framework allows for automatic selection of the two parameters.The model also allows estimation of d even when d < K, and gives insights on the goodness-of-t of the stochastic blockmodel on observed network data, based on the embedding.The method is tested on simulated data and applied to real world computer and transportation networks.It should be noted that Yang et al. (2019) have simultaneously and independently proposed a similar inferential procedure within a frequentist framework.
The article is organised as follows: Section 2 introduces adjacency and Laplacian spectral embeddings and the GRDPG.The novel Bayesian model for selection of the appropriate dimension of the latent space is discussed in Section 3, followed by careful illustration of posterior inference procedures in Section 4. Section 5 discusses the e ects of the curse of dimensionality on the model, and suggests a remedy.Extensions of the model are presented in Section 6, and results and applications are nally discussed in Section 7.

GRDPG and spectral embeddings
The adjacency matrix of an undirected graph is usually embedded into a latent space of dimension d using one of two di erent procedures: the adjacency spectral embedding or the Laplacian spectral embedding.Suppose A ∈ {0, 1} n×n is a symmetric adjacency matrix of an undirected graph with n nodes.
De nition 1 (Adjacency spectral embedding -ASE).For d ∈ {1, . . ., n}, consider the spectral decomposition A = Γ ΛΓ + Γ⊥ Λ⊥ Γ ⊥ , where Λ is a d × d diagonal matrix containing the top d eigenvalues in magnitude, in decreasing order, Γ is a n × d matrix containing the corresponding orthonormal eigenvectors, and the matrices Λ⊥ and Γ⊥ contain the remaining n − d eigenvalues and eigenvectors.The adjacency spectral embedding where the operator | • | applied to a matrix returns the absolute value of its entries.
De nition 2 (Laplacian spectral embedding -LSE).For d ∈ {1, . . ., n}, consider the (modi ed) normalised Laplacian matrix Sanna Passino, F. and N. A. Heard and its spectral decomposition The modi ed Laplacian D −1/2 AD −1/2 (Rohe et al., 2011) is preferred to the more common I n − D −1/2 AD −1/2 since the eigenvalues of the former lie in (−1, 1), providing a convenient interpretation for disassortative networks (Rubin-Delanchy et al., 2016).Also, using the Laplacian embedding is preferable in sparse regimes, whereas the adjacency embedding should be used for relatively dense graphs (Cape et al., 2018).A case-study with discussion on the di erences between community structures obtained from ASE and LSE is presented in Priebe et al. (2019).
In this work, the stochastic block model will be interpreted as a speci c case of a generalised random dot product graph (GRDPG) (Rubin-Delanchy et al., 2017).The GRDPG, a generic latent position model for graphs, is de ned below.
De nition 3 (Generalised random dot product graph, GRDPG).Let d + , d − be non-negative integers such Let F be a probability measure on X, A ∈ {0, 1} n×n be a symmetric matrix and X = (x 1 , . . ., ∼ F and for i < j, independently To represent the K-community stochastic blockmodel as a GRDPG model, F can be chosen to have mass concentrated at µ 1 , . . ., µ K ∈ R d such that µ i I(d + , d − )µ j = B ij ∀ i, j ∈ {1, . . ., K}.For estimation of the latent positions in a SBM, interpreted as a GRDPG, Rubin-Delanchy et al. (2017) suggest the following algorithm, based on Gaussian mixture modelling (see, for example, Fraley and Raftery, 2002).
Intuitively, the algorithm approximately holds because, taking a graph with m nodes, and restricting the attention to the rst n nodes, with n < m: where Q m is a matrix from the inde nite orthogonal group O(d + , d − ) and Σ(µ z i ) can be analytically computed (Rubin-Delanchy et al., 2017).The result holds for d xed and known, but in this work it is of interest to treat d as a random, unknown parameter.If a m-dimensional embedding is considered, with m > d, then asymptotic theory implies an approximate normal distribution with non-zero means and an elliptic covariance within each cluster for the top-d components of the embedding; but, to the best of our knowledge, no theoretical results have been obtained for the remaining m − d columns.It is therefore necessary to propose a model for the remaining part of the embedding, which will be carefully described in Section 3, and empirically justi ed and assessed in Section 7.1.
It should be noted that it is only possible to estimate the vectors {µ j } up to an orthogonal transformation; speci cally, for any matrix which implies that the likelihood (1.1) is invariant to any such transformation.This is inconsequential for much the same reason, however, as knowledge of the true transformation would not change any inferences about the network.
A second source of non-identi ability in the GRDPG interpretation of the SBM is the "uniqueness up to arti cial dimension blow-up" (Cape et al., 2018) In the stochastic blockmodel setting, this essentially means that any matrix B ∈ [0, 1] K×K with rank d can be obtained as an inner product between latent positions on arbitrarily large dimensions.

A Bayesian model for SBM embeddings
For simplicity, the embeddings will be generically denoted as In this article, m is always assumed to be xed and obtained from a preprocessing step.Choosing an appropriate value of m is arguably much easier than choosing the correct d, and, in the proposed model, the correct d can be recovered for any choice of m, as long as d ≤ m.Let X :d denote the rst d columns of X, and X d: the m − d remaining columns.The notation x i,:d denotes the rst d elements (x 1 , . . ., x d ) of the vector x i , and similarly x i,d: denotes the last m−d elements (x d+1 , . . ., x m ).
Suppose a latent dimension d, K communities, and latent community assignments z = (z 1 , . . ., z n ).The latent positions of nodes within each community are assumed to be generated from an m-dimensional community-speci c Gaussian distribution, where the rst d components are modelled di erently from the remaining m − d: the initial components x i,:d are assumed to have unconstrained mean vector µ k ∈ R d and positive de nite d × d covariance matrix Σ k ; in contrast, for x i,d: , two constraints are imposed: the mean is an (m − d)-dimensional vector of zeros, and the covariance is a diagonal matrix σ 2 k I m−d with positive entries σ 2 k = (σ 2 k,d+1 , . . ., σ 2 k,m ).The validation of the model assumptions will be discussed in Section 7.1.For mathematical convenience, conjugate priors can be placed on the parameters as follows: is the number of nonempty communities.Note that the inverse Wishart has been partially re-parametrised using a parameter ν 0 > 0. The dimension of the corresponding matrix is then added to obtain the required constraint ν 0 + d − 1 > d − 1 for the generic parametrisation and interpretation of the degrees of freedom ν (in this case ν = ν 0 + d − 1) of the distribution.Also, note that m can be generically chosen to be equal to K, when xed, or equal to n to have the maximum possible dimension of the embedding.Since one of the speci c objectives of the analysis is to learn the number of components of the mixture, rather than density estimation, the widely used in nite Bayesian nonparametric mixtures are not appropriate in this context (Miller and Harrison, 2014).k is used in this paper.Additionally, as a second di erence from Yang et al. (2019), the full model proposed here will also incorporate a second-level community cluster structure on these vectors of variances, which will be introduced in Section 5.
A cartoon representation of the model is given in Figure 1.Note that the condition d ≤ K is explicitly enforced in (3.1).More speci cally, d ≤ K ∅ , which avoids an arti cial matching between d and K using empty clusters, which are given non-zero probability mass under the Dirichlet-Multinoulli prior on (θ, z).One can also model d and K separately in an analogous way, changing the prior p(d) to, for example, independently of K and z; this will later be referred to as the unconstrained model.The alternative prior (3.2) is particularly useful in practical applications and provides a useful interpretation of d: when d ≤ K, then d = rank(B), but when d > K, this implies that the observed data might deviate from the stochastic blockmodel assumption, and provides a useful diagnostic for model validation and goodness-of-t.The likelihood associated with the spectral embedding X ∈ R n×m obtained from a stochastic blockmodel can be expressed as: where φ(•) denotes the (possibly multivariate) Gaussian density function.Hence, the posterior, up to a normalising constant, has form Sanna Passino, F. and N. A. Heard The NIW d (0, κ 0 , ν 0 + d − 1, ∆ d ) prior on the pair (µ k , Σ k ) is conjugate and yields a conditional :d ).By standard methods for inference in a multivariate Gaussian mixture model with NIW prior, the covariance matrix Σ k can be explicitly integrated out from the posterior to obtain the density of the multivariate Student t distribution with ν n k degrees of freedom, where Henceforth, µ k can easily be resampled in a simple Gibbs sampling step, conditional on the actual value of d and on the community allocations z.In this work, the location vectors µ k are collapsed out too, but the distribution is instructive to present other distributional results below, and could be also used if the objective of the analysis is to also recover the explicit form of the latent positions.
In a multivariate Gaussian model with normal inverse Wishart prior, it is also possible to analytically express the marginal likelihood of the observed data.Here, conditioning on a community-speci c Gaussian, on the assignments z and on the dimension of the latent space d: where :d is the subset of rows of X :d such that z i = k.Given the Inv-χ 2 (λ 0 , σ 2 0 ) prior, the posterior for σ 2 j,k is Inv-χ 2 (λ n k , s , where Similar calculations give the full marginal likelihood for the remaining portion of the embedding Finally, if d is considered as a nuisance parameter: which is easily computed using (3.4) and (3.5).Also, note that the probabilities θ associated to the community assignment can be easily integrated out, resulting in the following marginal likelihood, conditional on K: The distributional results presented in (3.3), (3.4), and (3.6) (for a proof, see, for example, Murphy, 2007) are the building blocks for the MCMC sampler which is used to make Bayesian inference on the model parameters of interest.
Since the full posterior is not analytically tractable, inference is performed using MCMC sampling with trans-dimensional moves (Green, 1995).The main objective of the analysis is to cluster the nodes, and therefore the locations µ k , the variance parameters Σ k and σ 2 k and the community probabilities θ are considered as nuisance parameters and integrated out.Essentially, in this type of collapsed Gibbs sampler (Liu, 1994), four moves are available (Richardson and Green, 1997;Zhang et al., 2004), described in the subsequent four subsections.

Change in the community allocations
A fully collapsed Gibbs update for each community assignment is available: (4.1) In the special case where d = K ∅ and n z i = 1, the full conditional distribution for z i assigns probability one to retaining the same value since the model does not permit d > K ∅ .Otherwise, from (3.6): where Similarly, the remaining term in (4.1), p(x i |{x j } j =i:z j =k , d), can be obtained as the ratio of marginal likelihoods The ratio (4.3) can be decomposed as the product of two ratios of marginal likelihoods.Using (3.4), the rst ratio is equivalent to the following multivariate Student t distribution (Murphy, 2007): where the additional superscript −i denotes a cluster quantity that is computed excluding the allocation z i of the i-th node.The second ratio, which accounts for the last m − d dimensions, has the form

Split or merge two communities
To vary the number of communities, move proposals inspired by Sequentially-Allocated Merge-Split sampling (Dahl, 2003) are used here, but other alternative choices are available (see, for example, Jain and Neal, 2004;Bouchard-Côté et al., 2017).Two indices i and j are sampled at random from the n nodes, and without loss of generality assume z i ≤ z j .If z i = z j , then the single cluster is split, whereas if z i > z j the two clusters are merged.In both move types, node i will remain in the same cluster, denoted z i = z i .In the merge move, all elements of cluster z j are reassigned to cluster z i (with any higher indexed clusters subsequently decremented).For the split move, node j is rst reassigned to cluster K = K + 1 with new allocation z j = K ; then, in random order the remaining nodes currently allocated to cluster z i are randomly reassigned to clusters z i or K with probability proportional to their predictive distribution from the generative model (4.3).Denoting the resulting product of renormalised predictive densities from these reallocations by q(K , z |K, z), the acceptance probability for a split move, for example, is Note that the ratio of densities for X in the acceptance ratio (4.6) will depend only upon the rows of the matrix corresponding to the cluster being split (or similarly, merged), and furthemore these expressions will decompose as a products of terms for the rst d and remaning m − d components (cf.(4.4), (4.5)).

Create or remove an empty community
Adding or removing empty communities whilst xing z corresponds to proposing K = K + 1 or K = K − 1 respectively, although the latter proposal is not possible if K = K ∅ , meaning there are currently no empty communities.The acceptance probability is simply where the proposal ratio q ∅ = 2 if K = K ∅ , q ∅ = 0.5 if K = K ∅ and q ∅ = 1 otherwise.

Change in the latent dimension
This move is only required when the latent dimension is not marginalised out.q(d|d ) q(d |d) .
Notably, if d > d, the ratio p(X|d , K, z)/p(X|d, K, z) only depends on the rst d components of the embedding, since the last m − d components remain independent by (3.1).

Inferring communities
Markov Chain Monte Carlo samplers for mixture models with varying number of clusters are well known to be a ected by label switching (Jasra et al., 2005), since the likelihood is invariant to permutations of the cluster labels.However, the estimated posterior similarity between nodes i and j, πij = P( j }/M is invariant to label switching.Communities can be estimated from the MCMC chains using the posterior similarity matrix and the PEAR method (maximisation of the posterior expected Rand adjusted index, Fritsch and Ickstadt, 2009, R package mcclust).Alternatively, if a con guration with a xed number of clusters K is required, the clusters can been estimated using hierarchical clustering with average linkage, using 1 − πij as distance measure (Medvedovic et al., 2004).Many alternatives to this method have been proposed in the literature (Rand, 1971;Binder, 1978;Dahl, 2006).

Second-level clustering of community variances
Empirical analysis of assuming the model (3.1) for simulations from the stochastic block model show that identifying and clearly separating the K clusters in X d: is particularly di cult for the sampler in settings Sanna Passino, F. and N. A. Heard when d m.The problem is particularly evident when m = n and d is small.In this case, it has been assessed empirically that the within-cluster variance of the true communities in simulated datasets seems to converge to similar values, such that σ 2 k,j ≈ σ 2 ,j for j d and k = .Therefore, when m is large enough, the selected model tends to be under-speci ed: the correct dimension d is identi ed, but the true number of communities K is underestimated.This is also one of the main reasons why it is not advisable to directly t a Gaussian mixture model on X ∈ R n×m and allow K to be random, ignoring the role of d.
The problem is illustrated in Figure 2, which shows the results from performing ASE for a simulation of n = 500 nodes from a stochastic block model containing ve equally probable communities with wellseparated mean locations.The within-cluster variance of three of the ve communities uctuate around the same values on each dimension larger than d.For a dimension larger than approximately 150, four of the ve clusters have the same variance on the subsequent dimensions.Therefore, when m = n, the MCMC sampler selects the MAP estimate K = 2 for parsimony, and increases the variance of the Gaussian distributions on the rst two dimensions, on which the clusters are well separated.
The solution proposed here is to assume shared variance parameters between some of the clusters for dimensions larger than d.Speci cally, each community k ∈ {1, . . ., K} is assigned a second-level cluster allocation v k ∈ {1, . . ., H}, with H ≤ K.If v k = v , then for j > d, σ 2 k,j = σ 2 ,j .Formally, Essentially, the vector v = (v 1 , . . ., v K ) de nes a clustering of communities.Figure 3 depicts a cartoon example of this extended model.Note that if H = 1 and all the communities are assigned to the same second-level cluster, the problem of selecting d essentially reduces to an ordinal version of the feature selection problem in clustering (Raftery and Dean, 2006;Maugis et al., 2009).
Under this extended model, the posterior distribution for σ 2 j,k changes due to the aggregation of com- N 3 (µ 1 , Σ 1 ) munities in the second level.Under the Inv-χ 2 (λ 0 , σ 2 0 ) prior, the posterior is ) where Calculations similar to (3.5) give the correct form of the marginal likelihood for the right hand side of the matrix, restricted to a given value of v k .Clearly, φ can be again marginalised out, yielding the marginal likelihood .
The MCMC sampler described in Section 4 must be slightly adapted.For the Gibbs sampling move in Section 4.1, the product of univariate Student's t densities in (4.5) is modi ed using the appropriate ) pair.For the change in dimension, the likelihood p(X|d, K, z, v) should be computed using the shared variances and the allocations v.When an empty community is proposed, as in Section 4.3, the ratio p(v |K)/p(v|K) must be added, limited to the second level allocation of the additional community.The value v k for the proposed empty cluster can be simply chosen at random from {1, . . ., H}.Finally, for the split-merge move in Section 4.2, if z i = z j for the two selected nodes, then v z i = v z j after the split move.Alternatively, if z i = z j , then the new value of v k is sampled at random from v z i and v z j .
Finally, three additional moves are required: resampling the second-level cluster allocations v using a Gibbs sampling step; proposing a second-level split-merge move; and adding or removing an empty second-level cluster.When φ and the parameters of the Gaussian distributions are marginalised out, the second-level allocations are resampled according to the following equation: where the independence assumption between X The calculations for the second term in (5.1) are similar to (4.3): which can be computed using (3.5).The second-level split-merge move and the proposal of an empty cluster follows the same guidelines in Sections 4.2 and 4.3.
Potentially, the model could be extended further using the same reasoning: from the plot in Figure 2, it is clear that the di erent clusters begin to share the same variance at di erent points in the plot.Empirically, all the variances approximately converge to the same values at large dimensions, and it is therefore possible to identify a (K − 1)-vector of discrete points in {d, d + 1, . . ., m} at which di erent community variances coalesce.For the plot in Figure 3, such vector could be (d, d, d, 150, n), with d = 2 and n = m = 500.

Extension to directed and bipartite graphs
A directed graph G = (V, E) has the property that (i, j) ∈ E =⇒ (j, i) ∈ E, meaning the corresponding adjacency matrix A ∈ {0, 1} n×n is not, in general, symmetric.Directed graphs are useful for representing directed interaction networks, such as email tra c patterns; knowing that individual i broadcasts emails to individual j does not immediately imply that j also issues communications to i.
De nition 4 (Adjacency embedding of the directed graph).Given a directed graph with adjacency matrix A ∈ {0, 1} n×n , and a positive integer d, 1 ≤ d ≤ n, consider the singular value decomposition where D ∈ R d×d + is diagonal matrix containing the top d singular values in decreasing order, Û ∈ R n×d and V ∈ R n×d contain the corresponding left and right singular vectors, and the matrices D⊥ , Û⊥ , and V⊥ contain the remaining n − d singular values and vectors.The d-dimensional directed adjacency embedding of A in R d , is de ned as the pair X = Û D1/2 , X = V D1/2 .
Writing X = UD 1/2 and X = VD 1/2 , the rows of X characterise the activities of each node as a source, and the rows of X characterise the same nodes as destinations.
The model in (3.1) can be easily adapted to directed graphs.Treating the embeddings X and X as independent, each is modelled separately using the same Gaussian structure and prior distributions (3.1), except for three parameters which are initially assumed common to both embeddings: the latent dimension d, the number of communities K and the vector of node assignments to those communities, z.
In some contexts it will be more relevant to consider di erent community membership structures for the same set of nodes when considering them as source nodes or destination nodes.In this case, let K denote the number of source communities and K denote the number of destination communities; similarly let z denote the assignments of nodes to source communities, and z the allocations to destination communities.Jointly learning z and z (as well as d) is a problem commonly known as co-clustering, and the corresponding network model is known as the stochastic co-blockmodel (ScBM) (Rohe et al., 2016).Given an asymmetric matrix B ∈ [0, 1] K×K , then P(A ij = 1) = B z i z j .From a random dot product graph perspective, it is assumed that The Bayesian model for ScBMs can be easily represented as a separate model for X and X , of the form given in (3.1), with the latent dimension of the embedding d now the only common parameter.
Inference via MCMC can be performed in an equivalent way to the method described in Section 4; the only di erence is in the expression of the acceptance ratio for a change in the shared latent dimension d, but the procedure can exploit the results obtained in Section 4.4, using the fact that where all the marginal likelihoods can be equivalently obtained from (3.4).Furthermore, the model can be appropriately modi ed when d m to include the second-level cluster allocations proposed in Section 5. Finally, in bipartite graphs, the observed nodes can be partitioned into two sets V and V , with V ∩ Assume that V plays the role of the set of source nodes and V of the set of destination nodes.Bipartite graphs are usually represented by a rectangular bi-adjacency matrix A ∈ {0, 1} n×n , with n = |V |.In this case, it is still possible to apply the methods described in this section to the SVD embedding obtained from the rectangular matrix A. Note that the ScBM extends trivially to the bipartite graph case, which is essentially a special case of a directed graph, with the cluster con gurations for source and destination nodes now inescapably unrelated and each node possessing only one latent representation in R d .

Applications and results
The Bayesian latent feature network models described in this article have been applied to a both simulated and real world network data from undirected, directed and bipartite graphs.The real network data analysed are from an undirected network obtained from the Santander bikes hires in London, and the Enron Email Dataset; details are given in the corresponding sections.
The model and MCMC sampler have been tested using di erent combinations of the hyperparameters, showing robustness to the prior choice.Inferential performance is sensitive to extreme values of the variance parameters, relative to their prior mean, but otherwise robust.So in practice, the expectation of the prior for the variance parameters should be chosen to be on the same scale as the observed data.The default settings for the MCMC sampler used in the next sections are the usual uniformative values κ 0 = ν 0 = λ 0 = α = β = 1, and ω = δ = 0.1.For the proposal of change in dimension (cf.Section 4.4), ξ = 0.8.The algorithms were run for a total of M = 500,000 samples with burn-in 25,000, for a number of di erent chains to check for convergence.The cluster con guration has been initialised using K-means for some pre-speci ed K, usually chosen according to the scree-plot criterion.The second-level clusters have been initialised setting H = K.In order to set the prior covariances to a realistic value, the correlations in the ∆ d matrices are set to zero, and the elements on the diagonal of ∆ d to the average within-cluster variance based on the K-means cluster con guration.Similarly, the prior σ 2 0j values have been set to the total variance on the corresponding column of the embedding.

Synthetic data and model validation
A stochastic blockmodel can be simulated starting from a matrix B ∈ [0, 1] K×K containing the probabilities of connection between communities, and a vector θ of community allocation probabilities.Here, each element B k of B was generated from a Beta(1.2,1.2) distribution, which produces communities with a moderate level of separation.For an undirected graph, K = K and the constraint B k = B k is imposed; similarly, for directed graphs with a shared cluster con guration (cf.Section 6), K = K .
The latent dimension d can be interpreted as the rank of the matrix B, and a random matrix B generated from independent beta draws has full rank with probability 1.Therefore, to simulate d < K a low-rank approximation of B must be used to generate the embedding.For undirected graphs, a truncated spectral decomposition can be used: B = Γ d Λ d Γ d (recall De nition 1).Similarly, for the directed Sanna Passino, F. and N. A. Heard and bipartite graphs, the truncated SVD is an appropriate approximation: B = U d D d V d (see De nition 4).Note that under this low-rank approximation, it must be checked that each element Bk ∈ [0, 1].
Figure 4 shows results for a synthetic undirected graph with d = 2 and K = 5.The scatterplot of the rst two columns of the adjacency embedding X, plotted in Figure 4a, show well-separated clusters, which can be suitably modelled using a Gaussian mixture.Figure 4b, shows the next two dimensions.Clearly, the community mean locations are signi cantly di erent from zero in just the rst two dimensions, and this is further illustrated in Figure 4c.Similarly, in Figure 4d, the di erence between the within-cluster and overall variance is evident only in the rst two dimensions, after which the quantities are of the same order of magnitude.The within-cluster variances di er across communities, suggesting that it is appropriate to have cluster-speci c values of σ 2 j,k for j > d; this phenomenon can also be witnessed in Figure 4b.Nevertheless, it also seems appropriate to use a second-level clustering with H = 3, since the variances of three of the ve communities are approximately the same for dimensions larger than d = 2. Also, in Figure 4e, the within-cluster correlations between X 1 and X 2 suggest dependence for at least one of the clusters.On the other hand, the sample within-cluster correlations for X d: tend to be small and centred around 0, suggesting that the assumption of independence is appropriate in that part of the model (3.1).Finally, form Figure 4f the marginal likelihood strongly favours the true value d = 2, resulting in a posterior distribution essentially consisting of a point mass at the true value.
Figure 5 shows results for a simulated bipartite graph with separate community structures for nodes as sources and destinations, with d = 2, K = 5 and K = 3.Again, the clusters in Figure 5a and 5b are wellseparated and can be easily estimated using the Gaussian mixture model, both for sources and receivers.From Figure 5c, the zero-mean assumption for the columns with index larger than d seems to hold even for a relatively small number of nodes per community.A similar plot can be produced for X , showing a similar pattern.In Figure 5d, the marginal likelihood strongly favours the true value d = 2, which again results in a point mass posterior centred at the true value.Similar considerations hold for variances and correlations, with results which are similar to the plots in Figure 4d and 4e for the undirected graph.
Overall, for synthetic data, the model seems robust and able to detect the correct d and K in a variety of di erent settings.

Undirected graphs: Santander bikes
The Santander Cycle hire scheme is a bike sharing system implemented in central London.Transport for London periodically releases data on the bike hires1 .Considering this as a network, the nodes correspond to bike sharing stations, and a directed edge between stations i and j is drawn if at least one ride between station i and j is completed within the time period considered.In this example, one week of data were considered, from 5 September until 11 September, 2018.The total number of stations used, n = 783; the total number of undirected edges |E| = 96,060, implying the adjacency matrix is fairly dense.
The results of the Bayesian inferential procedure, using the unconstrained prior (3.2) for d, applied to the adjacency and Laplacian embeddings for the Santander bike network are presented in Figure 6.The initial value of K was set to 10, with m = 25, but similar estimates were obtained using di erent starting points for K and di erent values of m.It is interesting to note the di erent shapes of the posterior barplots of K ∅ and H ∅ , Figures 6a and 6b, showing that the second-level clustering is crucial to obtain a more accurate model t when the adjacency embedding is used.On the other hand, for the Laplacian embedding, the posteriors for K ∅ and H ∅ are extremely similar, suggesting that the second-level clustering is not required for m = 25.The maximum a posteriori (MAP) values d = 11 (adjacency) and d = 12 (Laplacian) clearly correspond to the elbows in the corresponding scree-plots in Figures 6c and 6d.
Note that, especially in the case of the adjacency embedding, d and K have similar values, showing that the two graphs might be well described by a stochastic blockmodel.Similarly, the constrained model results in a larger number of small clusters; the posterior of K ∅ essentially reduces to the rescaled probability mass function obtained from the unrestricted model, constrained such that d ≤ K ∅ , since the posterior for d is approximately a point mass.
The resulting estimated clustering for the unconstrained model (3.2) based on the adjacency embedding and K = 11 (the MAP for d), plotted in Figure 7, shows a clear structure: the largest communities have approximately the same extension, with a few exceptions.This is expected, since the bikes are free for the rst 30 minutes and have limited speed, and are therefore used for small distance journeys.Two clusters are signi cantly smaller than the others, and correspond to touristic areas around Westminster, Covent Garden and Buckingham Palace.On the other hand, two clusters have a large geographical extension, and cover the East and West London areas.For the adjacency embedding, the MAP clustering obtained from the restricted model is almost identical.The PEAR method (Fritsch and Ickstadt, 2009) suggests K = 7 communities instead.Similarly, if the Laplacian embedding is used, the MAP clustering structure suggested by PEAR has K = 7 communities for the unconstrained model (3.2) for d and K = 12 for the constrained model (3.1).

Directed graphs: Enron Email Dataset
Next, the algorithm is applied to a directed network: the Enron Email Dataset 2 .The Enron database is a corpus of emails sent by the employees of the Enron corporation.The version of the Enron data which has been analysed here is described in Priebe et al. (2005), and consists of n = 184 nodes and 3,010 directed edges.A directed edge i → j is drawn if the employee i sent an email to the employee j.
The results of analysing this network are presented in Figure 8.The initial value of K was set to 10, with m = 25, but again similar results were obtained using di erent starting points for K and di erent values of m.The plots in Figures 8a and 8b   clustering con rms this impression: the posteriors for K ∅ , presented in Figures 8c and 8d have a more symmetric shape, and the MAP latent dimension is d = 6.As before, the MAP for K is d + 1 = 7, providing some evidence for the possibility rank(B) < K.
From Figure 8e, the selected MAP values d = 6 and d = 9 for the models with and without secondlevel clustering seem to be a tradeo between the two most popular criteria for selection of the appropriate latent dimension: the eigengap heuristic suggests d = 5 if the second largest di erence is considered, and the elbow in the scree-plot is approximately located around d ≈ 15.

Conclusion
In this article, a novel Bayesian model has been proposed for automatic and simultaneous estimation of the number of communities and latent dimension of stochastic blockmodels, interpreted as special cases of generalised random dot product graphs.The Bayesian framework allows the number of communities K and latent dimension d to be treated as random variables, with associated posterior distributions.The postulated model is based on asymptotic results in the theory of network embeddings and random dot product graphs, and has been validated on synthetic datasets, showing good performance at recovering the latent parameters and communities.The model has been extended to directed and bipartite graphs, using SVD embeddings and allowing for co-clustering.
Overall, the main advantages of the proposed methodology is to allow for an arbitrarily large value of m, the number of columns (dimension) of the embedding at the rst stage of the analysis, and then to treat d and K separately, allowing for the case d = rank(B) < K, which is often overlooked in Bayesian estimation of the latent dimension and number of communities the literature.Problems arising from oversperci cation of m are tackled using a second-level clustering procedure.Also, the model provides an automated procedure and criterion to select the dimension of the embedding and an appropriate number of communities.If d is not constrained to be less than or equal to K, the model also provides empirical evidence of the goodness-of-t of a stochastic blockmodel for the observed data.Results on real world network data sets show encouraging results in recovering the correct d, when compared to commonly used heuristic methods, and the community structure.

Figure 1 :
Figure 1: Cartoon example for the generating process of the embedding of an 11-node stochastic blockmodel GRDPG with K = 5 communities and latent dimension d = 3.

Figure 3 :
Figure 3: Cartoon for the generating process of the embedding of an 11-node stochastic blockmodel GRDPG with K = 5 communities and latent dimension d = 3, with common variance for communities 1-2 and 3-4 on the right hand side of the matrix.

Figure 6 :
Figure 6: Posterior distributions and scree-plots for the Santander bike network data using adjacency and Laplacian embeddings, for the unconstrained model (3.2).MAP estimates of d are plotted in red.

Figure 7 :
Figure 7: Santander bike sharing stations in London and maximum a posteriori estimates of the cluster allocations of the stations, obtained using hierarchical clustering with distance 1 − πij (Medvedovic et al., 2004), K = 11.Stations in the same convex hull share the same cluster.

Figure 8 :
Figure 8: Posterior distributions of K ∅ , H ∅ for the Enron data.MAP estimates of d are in red.