Latent structure blockmodels for Bayesian spectral graph clustering

Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.


Introduction
Network-valued data are commonly observed in many real world applications. They are typically represented by graph adjacency matrices, consisting of binary indicators summarising which nodes are connected. Spectral embedding (Luo et al., 2003) is often the first preprocessing step in the analysis of graph adjacency matrices: the nodes are embedded onto a low-dimensional space via eigendecompositions or singular value decompositions. This work discusses a Bayesian network model for clustering the nodes of the graph in the embedding space, when community-specific substructure is present. Therefore, the proposed methodology could be classified among spectral graph clustering methodologies, which have been extensively studied in the literature. Recent examples include Priebe et al. (2019); Pensky and Zhang (2019); Yang et al. (2021);Sanna Passino et al. (2021). More generally, spectral clustering is an active research area, extensively studied both from a theoretical and applied perspective (for some examples, see Couillet and Benaych-Georges, 2016;Hofmeyr, 2019;Zhu et al., 2019;Amini and Razaee, 2021;Han et al., 2021). In general, research efforts in spectral graph clustering (Rohe et al., 2011;Sussman et al., 2012;Lyzinski et al., 2014) have been primarily focused on studying its properties under simple community structures, for example the stochastic block model (SBM, Holland et al., 1983). On the other hand, in practical applications, it is common to observe more complex community-specific submanifold structure Sanna Passino et al., 2021), which is not captured by spectral graph clustering models at present. In this work, a Bayesian model for spectral graph clustering under such a scenario is proposed, and applied on a variety of real-world networks.
Latent structure blockmodels for Bayesian spectral graph clusterinĝ proposed methodology is able to successfully recover the underlying communities even in complex cases with substantial overlap between the groups. Recent work from Dunson and Wu (2021) also demonstrate the potential of Gaussian processes for manifold inference from noisy observations. Geometries arising from network models have been extensively studied in the literature (see, for example, Asta and Shalizi, 2015;McCormick and Zheng, 2015;Smith et al., 2019). The approach in this paper is also related to the findings of Priebe et al. (2017), where a semiparametric Gaussian mixture model with quadratic structure is used to model the embeddings arising from a subset of the neurons in the Drosophila connectome. In particular, the Kenyon Cell neurons are not well represented by a stochastic blockmodel, which is otherwise appropriate for other neuron types. Consequently, those neurons are modelled via a continuous curve in the latent space, estimated via semiparametric maximum likelihood (Kiefer and Wolfowitz, 1956;Lindsay, 1983). However, the real world examples presented in this article in Section 5 require community-specific curves in the latent space for all the communities, not only for a subset of the nodes. Furthermore, estimation can be conducted within the Bayesian paradigm, which conveniently allows marginalisation of the model parameters.
This article is organised as follows: Section 2 formally introduces RDPGs and spectral embedding methods. Sections 3 and 4 present the main contributions of this work: the latent structure blockmodel (LSBM), and Bayesian inference for the communities of LSBM embeddings. The method is applied to simulated and real world networks in Section 5, and the article is then concluded with a discussion in Section 6.
2 Random dot product graphs and latent structure models Mathematically, a network G = (V, E) is represented by a set V = {1, 2, . . . , n} of nodes, and an edge set E, such that (i, j) ∈ E if i forms a link with j. If (i, j) ∈ E ⇒ (j, i) ∈ E, the network is undirected, otherwise it is directed. If the node set is partitioned into two sets V 1 and V 2 , V 1 ∩ V 2 = ∅, such that E ⊂ V 1 × V 2 , the graph is bipartite. A network can be summarised by the adjacency matrix A ∈ {0, 1} n×n , such that A ij = 1 E {(i, j)}, where 1 · {·} is an indicator function. Latent position models (LPMs, Hoff et al., 2002) for undirected graphs postulate that the edges are sampled independently, with A special case of LPMs is the random dot product graph (RDPG, Young and Scheinerman, 2007), defined below.
Definition 1 (Random dot product graph). For an integer d > 0, let F be an inner product distribution on X ⊂ R d , such that for all x, x ∈ X , 0 ≤ x x ≤ 1. Also, let X = (x 1 , . . . , x n ) ∈ X n and A ∈ {0, 1} n×n be symmetric. (1) Given a realisation of the adjacency matrix A, the first inferential objective is to estimate the latent positions x 1 , . . . , x n . In RDPGs, the latent positions are inherently unidentifiable; in particular, multiplying the latent positions by any orthogonal matrix Q ∈ O(d), the orthogonal group with signature d, leaves the inner product (1) unchanged: x x = (Qx) Qx . Therefore, the latent positions can only be estimated up to orthogonal rotations. Under such a restriction, the latent positions are consistently estimated via spectral embedding methods. In particular, the adjacency spectral embedding (ASE), defined below, has convenient asymptotic properties.
Definition 2 (ASE -Adjacency spectral embedding). For a given integer d ∈ {1, . . . , n} and a binary symmetric adjacency matrix A ∈ {0, 1} n×n , the d-dimensional adjacency spectral embedding (ASE) where Λ is a d × d diagonal matrix containing the absolute values of the d largest eigenvalues in magnitude, in decreasing order, and Γ is a n × d matrix containing corresponding orthonormal eigenvectors.
Alternatively, the Laplacian spectral embedding is also frequently used, and considers a spectral decomposition of the modified Laplacian D −1/2 AD −1/2 instead, where D is the degree matrix. In this work, the focus will be mainly on ASE. Using ASE, the latent positions are consistently estimated up to orthogonal rotations, and a central limit theorem is available (Athreya et al., 2016;Rubin-Delanchy et al., 2017;Athreya et al., 2018).
Theorem 1 (ASE-CLT -ASE central limit theorem). Let (A (n) , X (n) ) ∼ RDPG d (F n ), n = 1, 2, . . . , be a sequence of adjacency matrices and corresponding latent positions, and letX (n) be the d-dimensional ASE of A (n) . For an integer m > 0, and for the sequences of points x 1 , . . . , x m ∈ X and u 1 , . . . , u m ∈ R d , there exists a sequence of orthogonal matrices Q 1 , Q 2 , . . . ∈ O(d) such that: where Φ{·} is the CDF of a d-dimensional normal distribution, and Σ(·) is a covariance matrix which depends on the true value of the latent position.
Therefore, informally, it could be assumed that, for n large, Theorem 1 provides strong theoretical justification for the choice of normal likelihood assumed in the Bayesian model presented in the next section.
This article is mainly concerned with community detection in RDPGs when the latent positions lie on community-specific one-dimensional submanifolds S k ⊂ R d , k = 1, . . . , K. The proposed methodology builds upon latent structure models (LSMs, Athreya et al., 2021), a subset of RDPGs. In LSMs, it is assumed that the latent positions of the nodes are determined by draws from a univariate underlying distribution G on [0, 1], inducing a distribution F on a structural support univariate submanifold S ⊂ R d , such that: In particular, the distribution F on S is the distribution of the vector-valued transformation f (θ) of a univariate random variable θ ∼ G, where f : [0, 1] → S is a function mapping draws from θ to S. The inverse function f −1 : S → [0, 1] could be interpreted as the arc-length parametrisation of S. In simple terms, each node is assigned a draw θ i from the underlying distribution G, representing how far along the submanifold S the corresponding latent position lies, such that: Therefore, conditional on f , the latent positions are uniquely determined by θ i . If the graph is directed or bipartite, each node is assigned two latent positions x i and x i , and the random dot product graph model takes the form P(A ij = 1 | x i , x j ) = x i x j . In this case, the singular value decomposition of A can be used over the eigendecomposition.
Definition 3 (DASE -Directed adjacency spectral embedding). For an integer d ∈ {1, . . . , n} and an adjacency matrix A ∈ {0, 1} n×n , the d-dimensional directed adjacency spectral embeddings (DASE)X = [x 1 , . . . ,x n ] andX = [x 1 , . . . ,x n ] of A arê where S is a d × d diagonal matrix containing the d largest singular values, in decreasing order, and U and V are a n × d matrix containing corresponding left-and right-singular vectors.
Note that ASE and DASE could also be interpreted as instances of multidimensional scaling (MDS, Torgerson, 1952;Shepard, 1962). Clearly, DASE could be also used on rectangular adjacency matrices arising from bipartite graphs. Furthermore, Jones and Rubin-Delanchy (2020) prove the equivalent of Theorem 1 for DASE, demonstrating that DASE also provides a consistent procedure for estimation of the latent positions in directed or bipartite graphs. Analogous constructions to undirected LSMs could be posited for directed or bipartite models, where the x i 's and x i 's lie on univariate structural support submanifolds.

Latent structure blockmodels
This section extends LSM, explicitly allowing for community structure. In particular, it is assumed that each node is assigned a latent community membership z i ∈ {1, . . . , K}, i = 1, . . . , n, and each community is associated with a different one-dimensional structural support submanifold S k ⊂ R d , k = 1, . . . , K. Therefore, it is assumed that F = K k=1 η k F k is a mixture distribution with components F 1 , . . . , F K supported on the submanifolds S 1 , . . . , S K , with weights η 1 , . . . , η K such that for each k, η k ≥ 0 and K k=1 η k = 1. Assuming community allocations z = (z 1 , . . . , z n ), the latent positions are obtained as where, similarly to (2), F z i is the distribution of the community-specific vector-valued transformation f z i (θ), of a univariate random variable θ ∼ G, which is instead shared across communities. The vector-valued functions f k = (f k,1 , . . . , f k,d ) : G → S k , k = 1, . . . , K, map the latent draw from the distribution G to S k . The resulting model will be referred to as the latent structure blockmodel (LSBM). Note that, for generality, the support of the underlying distribution G is assumed here to be G ⊂ R. Furthermore, G is common to all the nodes, and the pair (θ i , z i ), where θ i ∼ G, uniquely determines the latent position x i through f z i , such that x Note that, in the framework described above, the submanifolds S 1 , . . . , S K are one-dimensional, corresponding to curves, but the LSBM could be extended to higher-dimensional settings of the underlying subspaces, postulating draws from a multivariate distribution G supported on G ⊆ R p with 1 ≤ p < d. Since the latent structure blockmodel is a special case of the random dot product graph, the LSBM latent positions are also estimated consistently via ASE, up to orthogonal rotations, conditional on knowing the functions f k (·). Therefore, by Theorem 1, approximately: Special cases of the LSBM include the stochastic blockmodel (SBM, Holland et al., 1983), the degreecorrected SBM (DCSBM, Karrer and Newman, 2011), and other more complex latent structure models with clustering structure, as demonstrated in the following examples. Note that, despite the similarity in name, LSBMs are different from latent block models (LBMs; see, for example, Keribin et al., 2015;Wyse et al., 2017), which instead generalise the SBM to bipartite graphs.
Example 1 (Stochastic blockmodel, SBM). In an SBM (Holland et al., 1983), the edge probability is determined by the community allocation of the nodes: A ij ∼ Bernoulli(B z i z j ), where B ∈ [0, 1] K×K is a matrix of probabilities for connections between communities. An SBM characterised by a non-negative definite matrix B of rank d can be expressed as an LSBM, assigning community-specific latent positions x i = ν z i ∈ R d , such that for each (k, ), B k = ν k ν . Therefore, f k (θ i ) = ν k , with each θ i = 1 for identifiability. It follows that f k,j (θ i ) = ν k,j .
Example 2 (Degree-corrected SBM, DCSBM). DCSBMs (Karrer and Newman, 2011) extend SBMs, allowing for heterogeneous degree distributions within communities. The edge probability depends on the community allocation of the nodes, and degree-correction parameters θ ∈ R n for each node, such that . In a latent structure blockmodel interpretation, the latent positions are Example 3 (Quadratic latent structure blockmodel). For an LSBM with quadratic f k (·), it can be postulated that, conditional on a community allocation z i , A possible solution is to fix each β k,1 = 1. Figure 3 shows example embeddings arising from the three models described above. From these plots, it is evident that taking into account the underlying structure is essential for successful community detection.

Bayesian modelling of LSBM embeddings
Under the LSBM, the inferential objective is to recover the community allocations z = (z 1 , . . . , z n ) given a realisation of the adjacency matrix A. Assuming normality of the rows of ASE for LSBMs (3), the inferential problem consists of making joint inference about z and the latent functions f k = (f k,1 , . . . , f k,d ) : G → R d . The prior for z follows a Categorical-Dirichlet structure: where ν, η k ∈ R + , k ∈ {1, . . . , K}, and K k=1 η k = 1. Following the ASE-CLT in Theorem 1, the estimated latent positions are assumed to be drawn from Gaussian distributions centred at the underlying function value. Conditional on the pair (θ i , z i ), the following distribution is postulated forx i : where σ 2 k = (σ 2 k,1 , . . . , σ 2 k,d ) ∈ R d + is a community-specific vector of variances and I d is the d × d identity matrix. Note that, for simplicity, the components of the estimated latent positions are assumed to be independent. This assumption loosely corresponds to the k-means clustering approach, which has been successfully deployed in spectral graph clustering under the SBM (Rohe et al., 2011). Here, the same idea is extended to a functional setting. Furthermore, for tractability (5) assumes the variance ofx i does not depend on x i , but only on the community allocation z i .
For a full Bayesian model specification, prior distributions are required for the latent functions and the variances. The most popular prior for functions is the Gaussian process (GP; see, for example, Rasmussen and Williams, 2006). Here, for each community k, the j-th dimension of the true latent positions are assumed to lie on a one-dimensional manifold described by a function f k,j with a hierarchical GP-IG prior, with an inverse gamma (IG) prior on the variance: where ξ k,j (·, ·) is a positive semi-definite kernel function and a 0 , b 0 ∈ R + . Note that the terminology "kernel" is used in the literature for both the GP covariance function ξ k,j (·, ·) and the function κ(·, ·) used in LPMs (cf. Section 1), but their meaning is fundamentally different. In particular, κ : is a component of the graph generating process, and assumed here to be the inner product, corresponding to RDPGs. On the other hand, ξ k,j : R × R → R is the scaled covariance function of the GP prior on the unknown function f k,j , which is used for modelling the observed graph embeddings, or equivalently the embedding generating process. There are no restrictions on the possible forms of ξ k,j , except positive semi-definiteness. Overall, the approach is similar to the overlapping mixture of Gaussian processes method (Lázaro-Gredilla et al., 2012). The class of models that can be expressed in the form (6) is vast, and includes, for example, polynomial regression and splines, under a conjugate normal-inverse-gamma prior for the regression coefficients. For example, consider any function that can be expressed in the form f z i ,j (θ i ) = φ z i ,j (θ i ) w z i ,j for some community-specific basis functions φ k,j : R → R q k,j , q k,j ∈ Z + , and corresponding coefficients w k,j ∈ R q k,j . If the coefficients are given a normal-inverse-gamma prior , where ∆ k,j ∈ R q k,j ×q k,j is a positive definite matrix, then f k,j takes the form (6), with the kernel function Considering the examples in Section 3, the SBM (cf. Example 1) corresponds to ξ k,j (θ, θ ) = ∆ k,j , ∆ k,j ∈ R + , whereas the DCSBM (cf. Example 2) corresponds to ξ k,j (θ, θ ) = θθ ∆ k,j , ∆ k,j ∈ R + . For the quadratic LSBM (cf. Example 3), the GP kernel takes the form ξ k,j (θ, θ ) = (1, θ, θ 2 )∆ k,j (1, θ , θ 2 ) for a positive definite scaling matrix ∆ k,j ∈ R 3×3 . The LSBM specification is completed with a prior for each θ i value, which specifies the unobserved location of the latent position x i along each submanifold curve; for µ θ ∈ R, σ 2 θ ∈ R + ,

Posterior and marginal distributions
The posterior distribution for (f k,j , σ 2 k,j ) has the same GP-IG structure as (6), with updated parameters: The parameters are updated as follows: where n k = n i=1 1 k {z i },X k,j ∈ R n k is the subset of values ofX j for which z i = k, and θ k ∈ R n k is the vector θ, restricted to the entries such that z i = k. Furthermore, Ξ k,j is a vector-valued and matrix-valued extension of ξ k,j , such that [Ξ k,j (θ, θ )] , = ξ k,j (θ , θ ). The structure of the GP-IG yields an analytic expression for the posterior predictive distribution for a new observation where t ν (µ, σ) denotes a Student's t distribution with v degrees of freedom, mean µ and scale parameter σ. Furthermore, the prior probabilities η for the community assignments can be integrated out, obtaining where The two distributions (9) and (10) are key components for the Bayesian inference algorithm discussed in the next section.

Posterior inference
After marginalisation of the pairs (f k,j , σ 2 k,j ) and η, inference is limited to the community allocations z and latent parameters θ. The marginal posterior distribution p(z, θ |X) is analytically intractable; therefore, inference is performed using collapsed Metropolis-within-Gibbs Markov Chain Monte Carlo (MCMC) sampling. In this work, MCMC methods are used, but an alternative inferential algorithm often deployed in the community-detection literature is variational Bayesian inference (see, for example, Latouche et al., 2012), which is also applicable to GPs (Cheng and Boots, 2017).
For the community allocations z, the Gibbs sampling step uses the following decomposition: where the superscript −i denotes that the i-th row (or element) is removed from the corresponding matrix (or vector). Using (10), the first term is For the second term, using (9), the posterior predictive distribution forx i given z i = k can be written as the product of d independent Student's t distributions, wherê Note that the quantities µ −i k,j , ξ −i k,j , a −i k and b −i k,j are calculated as described in (8), excluding the contribution of the i-th node.
In order to mitigate identifiability issues, it is necessary to assume that some of the parameters are known a priori. For example, assuming for each community k that f k,1 (θ i ) = θ i , corresponding to a linear model in θ with no intercept and unit slope in the first dimension, gives the predictive distribution: Finally, for updates to θ i , a standard Metropolis-within-Gibbs step can be used. For a proposed value θ * sampled from a proposal distribution q(· | θ i ), the acceptance probability takes the value The proposal distribution q(θ * |θ i ) in this work is a normal distribution N(θ * |θ i , σ 2 * ), σ 2 * ∈ R + , implying that the ratio of proposal distributions in (13) cancels out by symmetry.

Inference on the number of communities K
So far, it has been assumed that the number of communities K is known. The LSBM prior specification (4) naturally admits a prior distribution on the number of communities K. Following Sanna Passino and Heard (2020), it could be assumed: where ω ∈ (0, 1). The MCMC algorithm is then augmented with two additional moves for posterior inference on K: (i) split or merge two communities; and (ii) add or remove an empty community. An alternative approach when K is unknown could also be a nonparametric overlapping mixture of Gaussian processes (Ross and Dy, 2013). For simplicity, in the next two sections, it will be initially assumed that all communities have the same functional form, corresponding, for example, to the same basis functions φ k,j (·) for dot product kernels (7). Then, in Section 4.4, the algorithm will be extended to admit a prior distribution on the community-specific kernels.

Split or merge two communities
In this case, the proposal distribution follows Sanna Passino and Heard (2020). First, two nodes i and j are sampled randomly. For simplicity, assume z i ≤ z j . If z i = z j , then the two corresponding communities are merged into a unique cluster: all nodes in community z j are assigned to z i . Otherwise, if z i = z j , the cluster is split into two different communities, proposed as follows: (i) node i is assigned to community z i (z * i = z i ), and node j to community K * = K + 1 (z * j = K * ); (ii) the remaining nodes in community z i are allocated in random order to clusters z * i or z * j according to their posterior predictive distribution (11) or (12), restricted to the two communities, and calculated sequentially. It follows that the proposal distribution q(K * , z * |K, z) for a split move corresponds to the product of renormalised posterior predictive distributions, leading to the following acceptance probability: where p(X|K, z, θ) is the marginal likelihood. Note that the ratio of marginal likelihoods only depends on the two communities involved in the split and merge moves. Under (5) and (6), the community-specific marginal on the j-th dimension is: where the notation is identical to (8), and the Student's t distribution is n k -dimensional, with mean equal to the identically zero vector 0 · . The full marginal p(X|K, z, θ) is the product of marginals (15) on all dimensions and communities. If f k,1 (θ i ) = θ i , cf. (12), the marginal likelihood is

Add or remove an empty community
When adding or removing an empty community, the acceptance probability is: where q ∅ = q(K|K * , z)/q(K * |K, z) is the proposal ratio, equal to (i) q ∅ = 2 if the proposed number of clusters K * equals the number of non-empty communities in z; (ii) q ∅ = 0.5 if there are no empty clusters in z; and (iii) q ∅ = 1 otherwise. Note that the acceptance probability is identical to Sanna Passino and Heard (2020, Section 4.3), and it does not depend on the marginal likelihoods.

Inference with different community-specific kernels
When communities are assumed to have different functional forms, it is required to introduce a prior distribution p(ξ k ), ξ k = (ξ k,1 , . . . , ξ k,d ), on the GP kernels, supported on one or more classes of possible kernels K. Under this formulation, a proposal to change the community-specific kernel could be introduced. Conditional on the allocations z, the k-th community is assigned a kernel ξ * = (ξ * 1 , . . . , ξ * d ) with probability: normalised for ξ * ∈ K, where p(X k,j |K, z, θ, ξ * j ) is the marginal (15) calculated under kernel ξ * j . The prior distribution p(ξ * ) could also be used as proposal for the kernel of an empty community (cf. Section 4.3.2). Similarly, in the merge move (cf. Section 4.3.1), the GP kernel could be sampled at random from the two kernels assigned to z i and z j , correcting the acceptance probability (14) accordingly.

Applications and results
Inference for the LSBM is tested on synthetic LSBM data and on three real world networks. As discussed in Section 4.2, the first dimensionX 1 is assumed to be linear in θ i , with no intercept and unit slope. It follows that θ i is initialised tox i,1 + ε i , where ε i ∼ N(0, σ 2 ε ), for a small σ 2 ε , usually equal to 0.01. Note that such an assumption links the proposed Bayesian model for LSBM embeddings to Bayesian errors-in-variables models (Dellaportas and Stephens, 1995). In the examples in this section, the kernel function is assumed to be of the dot product form (7), with Zellner's g-prior such that ∆ k,j = n 2 {Φ k,j (θ) Φ k,j (θ)} −1 , where Φ k,j (θ) ∈ R n×q k,j such that the i-th row corresponds to φ k,j (θ i ). For the remaining parameters: a 0 = 1, b 0 = 0.001, µ θ = n i=1x i,1 /n, σ 2 θ = 10, ν = 1, ω = 0.1. The community allocations are initialised using k-means with K groups, unless otherwise specified. The final cluster configuration is estimated from the output of the MCMC algorithm described in Section 4.2, using the estimated posterior similarity between nodes i and j,π ij =P(z i = z j |X) = M s=1 1 z i,s {z j,s }/M , where M is the total number of posterior samples and z i,s is the s-th sample for z i . The posterior similarity is not affected by the issue of label switching (Jasra et al., 2005). The clusters are subsequently estimated using hierarchical clustering with average linkage, with distance measure 1 −π ij (Medvedovic et al., 2004). The quality of the estimated clustering compared to the true partition, when available, is evaluated using the Adjusted Rand Index (ARI, Hubert and Arabie, 1985). The results presented in this section are based on M = 10,000 posterior samples, with 1,000 burn-in period.
The performance of the inferential algorithm is also compared to alternative methods for spectral graph clustering: Gaussian mixture models (GMM; which underly the assumption of a SBM generative model) on the ASEX, GMMs on the row-normalised ASEX (see, for example, Rubin-Delanchy et al., 2017), and GMMs on a spherical coordinate transformation of the embedding (SCSC, see Sanna Passino et al., 2021). Note that the two latter methods postulate a DCSBM generative model. The ARI is averaged across 1000 initialisations of the GMM inferential algorithm. Furthermore, the proposed methodology is compared to a kernel-based clustering technique using Gaussian processes: the parsimonious Gaussian process 1 model (PGP) of Bouveyron et al. (2015), fitted using the EM algorithm (PGP-EM). In all the implementations of PGP-EM in Section 5, the RBF kernel is used, with variance parameter chosen via grid-search, with the objective to maximise the resulting ARI. The responsibilities in the EM algorithm are initialised using the predictive probabilities obtained from a Gaussian mixture model fit on the ASE or row-normalised ASE, chosen according to the largest ARI. Also, the LSBM is compared to the hierarchical Louvain (HLouvain in Table 3) algorithm 2 for graphs, corresponding to hierarchical clustering by successive instances of the Louvain algorithm (Blondel et al., 2008). Finally, the LSBM is also compared to hierarchical clustering with complete linkage and Euclidean distance (HClust in Table 3), applied onX.  Table 1: ARI for communities estimated using LSBM and alternative methodologies on the embeddings in Figure 3.
The results are presented in Table 1. The performance of the LSBM appears to be on par with alternative methodologies for SBMs and DCSBMs. This is expected, since the LSBM does not have competitive advantage over alternative methodologies for inference under such models. In particular, inferring LSBMs with constant latent functions using the algorithm in Section 4.2 is equivalent to Bayesian inference in Gaussian mixture models with normal-inverse gamma priors. For DCSBMs, the LSBM is only marginally outperformed by spectral clustering on spherical coordinates (SCSC, Sanna Passino et al., 2021). On the other hand, LSBMs largely outperform competing methodologies when quadratic latent structure is observed in the embedding. It must be remarked that the coefficients of the latent functions used to generate the quadratic latent structure model in Figure 3c were chosen to approximately reproduce the curves in the cyber-security example in Figure 2. Despite the apparent simplicity of the quadratic LSBM used in this simulation, its practical relevance is evident from plots of real-world network embeddings, and it is therefore important to devise methodologies to estimate its communities correctly. Furthermore, LSBMs could also be used when the number of communities K is unknown. In the examples in this section, inference on K using the MCMC algorithm overwhelmingly suggests K = 2, the correct value used in the simulation.

Simulated data: Hardy-Weinberg LSBM
Second, the performance of the LSBM is assessed on simulated data on a Hardy-Weinberg LSBM. In Athreya et al. (2021) and Trosset et al. (2020), a Hardy-Weinberg latent structure model is used, with G = [0, 1] and f (θ) = {θ 2 , 2θ(1 − θ), (1 − θ) 2 }. A permutation of the Hardy-Weinberg curve is considered here for introducing community structure. A graph with n = 1,000 nodes is simulated, with K = 2 communities of equal size. Each node is assigned a latent score θ i ∼ Uniform(0, 1), which is used to obtain the latent position x i through the mapping f z i (θ i ). In particular: Using the latent positions, the graph adjacency matrix is then generated under the random dot product graph kernel P(A ij = 1 | x i , x j ) = x i x j . The resulting scatterplot of the latent positions estimated via ASE is plotted in Figure 4. For visualisation, the estimated latent positionsX have been aligned to the true underlying latent positions X using a Procrustes transformation (see, for example, Dryden and Mardia, 2016).
The inferential procedure is first run assuming that the parametric form of the underlying latent function is fully known. Therefore, the kernels are set to ξ k,j (θ, θ ) = (1, θ, θ 2 )∆ k,j (1, θ , θ 2 ) , k = 1, 2, j = 1, 2, 3. Figure 5a shows the best-fitting curves for the two estimated communities after MCMC, which are almost indistinguishable from the true underlying latent curves. The Markov Chain was initialised setting θ i = |x i,1 | 1/2 , and obtaining initial values of the allocations z from k-means. The resulting ARI is 0.7918, which is not perfect since some of the nodes at the intersection between the two curves are not classified correctly, but corresponds to approximately 95% of nodes correctly classified. The number of communities is estimated to be K = 2 from the inferential algorithm, the correct value used in the simulation.
If the underlying functional relationship is unknown, a realistic guess could be given by examining the scatterplots of the embedding. The scatterplots in Figure 4 show that, assuming linearity in θ i onX 1 with no  intercept and unit slope, a quadratic or cubic polynomial function is required to modelX 2 andX 3 . Therefore, for the purposes of the MCMC inference algorithm in Section 4.2, f 2 (·) and f 3 (·) are assumed to be cubic functions, corresponding to the Gaussian process kernel ξ k,j (θ, θ ) = (1, θ, θ 2 , θ 3 )∆ k,j (1, θ , θ 2 , θ 3 ) , j ∈ {2, 3}, whereas f k,1 (θ) = θ, k = 1, . . . , K. The curves corresponding to the estimated clustering are plotted in Figure 5b. Also in this case, the algorithm is able to approximately recover the curves that generated the graph, and the number of communities K = 2. The imperfect choice of the latent functions makes the ARI decrease to 0.6687, which still corresponds to over 90% of nodes correctly classified. Table 2 presents further comparisons with the alternative techniques discussed in the previous section, showing that only LSBMs recover a good portion of the communities. This is hardly surprising, since alternative methodologies do not take into account the quadratic latent functions generating the latent positions, and therefore fail to capture the underlying community structure.

Undirected graphs: Harry Potter enmity graph
The LSBM is also applied to the Harry Potter enmity graph 3 , an undirected network with n = 51 nodes representing characters of J.K. Rowling's series of fantasy novels. In the graph, A ij = A ji = 1 if the characters i and j are enemies, and 0 otherwise. A degree-corrected stochastic blockmodel represents a reasonable assumption for such a graph (see, for example, Modell and Rubin-Delanchy, 2021): Harry Potter, the main character, and Lord Voldemort, his antagonist, attract many enemies, resulting in a large degree, whereas their followers are expected to have lower degree. The graph might be expected to contain K = 2 communities, since Harry Potter's friends are usually Lord Voldemort's enemies in the novels. Theorem 1 suggests that the embeddings for a DCSBM are expected to appear as rays passing through the origin. Therefore, f k (·) is assumed to be composed of linear functions such that f k,1 (θ) = θ and ξ k,2 (θ, θ ) = θθ ∆ k,2 . The estimated posterior distribution of the number of non-empty clusters is plotted in Figure 6. It appears that K = 2 is the most appropriate number of clusters. The resulting 2-dimensional ASE embedding of A and the estimated clustering obtained using the linear LSBM are pictured in Figure 7a. The two estimated communities roughly correspond to the Houses of Gryffindor and Slytherin. The inferential algorithm often admits K = 3, with a singleton cluster for Quirinus Quirrell, which appears to be between the two main linear groups (cf. Figure 7a). Alternatively, if a polynomial form for f k (·) is unknown, a more flexible model is represented by regression splines, which can be also expressed in the form (7). A common choice for φ k,j (·) is a cubic truncated power basis φ k,j = (φ k,j,1 , . . . , φ k,j,6 ), ∈ Z + , such that: φ k,j,1 (θ) = θ, φ k,j,2 (θ) = θ 2 , φ k,j,3 (θ) = θ 3 , φ k,j,3+ (θ) = (θ − τ ) 3 + , = 1, 2, 3, where (τ 1 , τ 2 , τ 3 ) are knots, and (·) + = max{0, ·}. In this application, the knots were selected as three equispaced points in the range ofX 1 . The results are plotted in Figure 7b. Using either functional form, the 3 The network is publicly available in the GitHub repository efekarakus/potter-network.  algorithm is clearly able to recover the two communities, meaningfully clustering Harry Potter's and Lord Voldemort's followers.

Directed graphs: Drosophila connectome
LSBM are also useful to cluster the larval Drosophila mushroom body connectome (Eichler et al., 2017), a directed graph representing connections between n = 213 neurons in the brain of a species of fly 4 . The right hemisphere mushroom body connectome contains of K = 4 groups of neurons: Kenyon Cells, Input Neurons, Output Neurons and Projection Neurons. If two neurons are connected, then A ij = 1, otherwise A ij = 0, forming an asymmetric adjacency matrix A ∈ {0, 1} n×n . The network has been extensively analysed in Priebe et al. (2017) and Athreya et al. (2018). Following Priebe et al. (2017) and Athreya et al. (2018), after applying the DASE (Definition 3) for d = 3, a joint concatenated embeddingŶ = [X,X ] ∈ R n×2d is obtained from A. Based on the analysis of Priebe et al. (2017), it should be assumed that three of the communities (Input Neurons, Output Neurons and Projection Neurons) correspond to a stochastic blockmodel, resulting in Gaussian clusters, whereas the Kenyon Cells form a quadratic curve with respect to the first dimension in the embedding space. Therefore, the kernel functions implied in Priebe et al. (2017) are ξ 1,j (θ, θ ) = (θ, θ 2 )∆ 1,j (θ , θ 2 ) , j = 2, . . . , 2d, with f 1,1 (θ) = θ, for the first community (corresponding to the Kenyon Cells), and ξ k,j (θ, θ ) = ∆ k,j , k = 2, 3, 4, j = 1, . . . , 2d for the three remaining groups of neurons.
Following the discussion in Priebe et al. (2017), the LSBM inferential procedure is initialised using k-means with K = 6, and grouping three of the clusters to obtain K = 4 initial groups. Note that, since the output of k-means is invariant to permutations of the K = 4 labels for the community allocations, careful relabelling of the initial values is necessary to ensure that the Kenyon Cells effectively correspond to the first community which assumes a quadratic functional form. The most appropriate relabelling mapping is chosen here by repeatedly initialising the model with all possible permutations of the labels, and choosing the permutation that maximises the marginal likelihood. Note that the marginal likelihood under the Bayesian model (6) for LSBMs is analytically available in closed form (see, for example, Rasmussen and Williams, 2006). The results obtained after MCMC sampling are plotted in Figure 8. The estimated clustering has ARI 0.8643, corresponding to only 10 misclassified nodes out of 213. From the scatterplots in Figure 8, it appears that the embedding for the Input Neurons, Output Neurons and Projection Neurons could be also represented using linear functions. Furthermore, a quadratic curve for the Kenyon Cells might be too restrictive. The LSBM framework allows specification of all such choices.
The performance of the LSBM is also compared to alternative methods for clustering in Table 3. In particular, Gaussian mixture models with K = 4 components were fitted on the concatenated DASE em-beddingŶ = [X,X ] (standard spectral clustering; see, for example Rubin-Delanchy et al., 2017), on its row-normalised versionỸ (Ng et al., 2001;Qin and Rohe, 2013), and on a transformation to spherical coordinates (Sanna Passino et al., 2021). The ARI is averaged over 1000 different initialisations. Furthermore, the LSBM is also compared to PGP (Bouveyron et al., 2015), hierarchical Louvain adapted to directed graphs (Dugué and Perez, 2015), and hierarchical clustering onŶ. The results in Table 3 show that the LSBM largely outperform the alternative clustering techniques, which are not able to account for the non-linearity in the community corresponding to the Kenyon Cells.
Next, the MCMC algorithm is run for M = 50,000 iterations with 5,000 burn-in period, with the  objective of inferring K, assuming a discrete uniform prior on the kernels used in Figure 8 and 9. The estimated posterior barplots for the number of non-empty clusters K are plotted in Figure 10. It appears that the choice of a prior on kernels admitting a quadratic curve for the Kenyon Cells and Gaussian clusters for the remaining groups is overly simplistic, since some of the communities appear to have a linear structure (cf. Figure 8). Hence, the number of communities K is incorrectly estimated under this setting (cf. Figure 10a). On the other hand, when a mixture distribution of cubic and linear kernels is used as prior on ξ k , K is correctly estimated (cf. Figure 10b), confirming the impression from Figure 9 of a better fit of such kernels on the Drosophila connectome.

Bipartite graphs: ICL computer laboratories
The LSBM methodology is finally applied to a bipartite graph obtained from computer network flow data collected at Imperial College London (ICL). The source nodes are |V 1 | = n = 439 client machines located in four computer laboratories in different departments at Imperial College London, whereas the destination nodes are |V 2 | = 60,635 internet servers, connected to by HTTP and HTTPS in January 2020 from one or more of the 439 client computers. A total of 717,912 edges are observed. The inferential objective is to  identify the location of the machines in the network, represented by their department, from the realisation of the rectangular adjacency matrix A, where A ij = 1 if at least one connection is observed between client computer i ∈ V 1 and the server j ∈ V 2 , and A ij = 0 otherwise. It could be assumed that K = 4, representing the departments of Chemistry, Civil Engineering, Mathematics, and Medicine. After taking the DASE of A, the machines are clustered using the LSBM. The value d = 5 is selected using the criterion of Zhu and Ghodsi (2006), choosing the second elbow of the scree-plot of singular values. Computer network graphs of this kind have been seen to present quadratic curves in the embedding, as demonstrated, for example, by Figure 2 in the introduction, which refers to a different set of machines. Therefore, it seems reasonable to assume that ξ k,j (θ, θ ) = (θ, θ 2 )∆ k,j (θ , θ 2 ) , j = 2, . . . , d, which implies quadratic functions passing through the origin, and f k,1 (θ) = θ. The quadratic model with K = 4 is fitted to the 5-dimensional embedding, obtaining ARI 0.9402, corresponding to just 9 misclassified nodes. The results are plotted in Figure 11, with the corresponding best fitting quadratic curves obtained from the estimated clustering. The result appears remarkable, considering that the communities are highly overlapping. Running the MCMC algorithm assuming K unknown shows that the number of clusters is correctly estimated (cf. Figure 12), overwhelmingly suggesting K = 4, corresponding to the correct number of departments. These LSBM results are also compared to alternative methodologies in Table 4. LSBM achieves the best performance in clustering the nodes, followed by Gaussian mixture modelling of the row-normalised adjacency spectral embedding. The GMM onX sometimes converges to competitive solutions, reaching ARI up to 0.94, but usually converges to sub-optimal solutions, as demonstrated by the average ARI of 0.7608.
As before, if a parametric form for f k (·) is unknown, regression splines can be used, for example the truncated power basis (16), with three equispaced knots in the range ofX 1 . The results for the initial three dimensions are plotted in Figure 13. The communities are recovered correctly, and the ARI is 0.9360, corresponding to 10 misclassified nodes.  Figure 11: Scatterplots of {X 2 ,X 3 ,X 4 ,X 5 } vs.X 1 , coloured by department, and best fitting quadratic curves passing through the origin.

Conclusion
An extension of the latent structure model (Athreya et al., 2021) for networks has been introduced, and inferential procedures based on Bayesian modelling of spectrally embedded nodes have been proposed. The model, referred to as the latent structure blockmodel (LSBM), allows for latent positions living on community-specific univariate structural support manifolds, using flexible Gaussian process priors. Under the Bayesian paradigm, most model parameters can be integrated out and inference can be performed efficiently. The proposed modelling framework could be utilised also when the number of communities is unknown and there is uncertainty around the choice the Gaussian process kernels, encoded by a prior distribution. The performance of the model inference has been evaluated on simulated and real world networks. In particular, excellent results have been obtained on complex clustering tasks concerning the Drosophila connectome and the Imperial College NetFlow data, where a substantial overlap between communities is observed. Despite these challenges, the proposed methodology is still able to recover a correct clustering. Overall, this work provides a modelling framework for graph embeddings arising from random dot product graphs where it is suspected that nodes belong to community-specific lower-dimensional subspaces. In particular, this article discusses the case of curves, which are one-dimensional structural support submanifolds. The methodology has been demonstrated to have the potential to recover the correct clustering structure even if the underlying parametric form of the underlying structure is unknown, using a flexible Gaussian process prior on the unknown functions. In particular, regression splines with a truncated power basis have been used, showing good performance in recovering the underlying curves.
In the model proposed in this work, it has been assumed that the variance only depends on the community allocation. This enables marginalising most of the parameters leading to efficient inference. On the other hand, this is potentially an oversimplification, since the ASE-CLT (Theorem 1) establishes that the covariance structure depends on the latent position. Further work should study efficient algorithms for estimating the parameters when an explicit functional form dependent on θ i is incorporated in the covariance.

Code
A python library to reproduce the results in this paper is available in the GitHub repository fraspass/lsbm.  Figure 13: Scatterplots of {X 2 ,X 3 ,X 4 ,X 5 } vs.X 1 , coloured by department, and estimated cubic truncated power splines passing through the origin.