Nonnegative Bayesian nonparametric factor models with completely random measures for community detection

We present a Bayesian nonparametric Poisson factorization model for modeling network data with an unknown and potentially growing number of overlapping communities. The construction is based on completely random measures and allows the number of communities to either increase with the number of nodes at a specified logarithmic or polynomial rate, or be bounded. We develop asymptotics for the number of nodes and the degree distribution of the network and derive a Markov chain Monte Carlo algorithm for targeting the exact posterior distribution for this model. The usefulness of the approach is illustrated on various real networks.


Introduction
Non-negative matrix factorization (NMF) methods (Lee and Seung, 2001;Paatero and Tapper, 1994) aim to find a latent representation of a positive n × m matrix A as a sum of K non-negative factors. For integer-valued data, Poisson factorization models (Dunson and Herring, 2005) offer a flexible probabilistic framework for non-negative matrix factorization, and have found wide applicability in signal processing (Cemgil, 2009;Virtanen et al., 2008) or recommender systems (Gopalan et al., 2015;Ma et al., 2011). In this paper, we focus on the application to network analysis, where m = n and the n × n count matrix A represents the number of directed or undirected interactions between n individuals; the latent factors may be interpreted as latent and potentially overlapping communities (Ball et al., 2011), such as sport team members or other social activities circles. We also consider binary data where the matrix represents the existence or absence of a directed or undirected link between individuals. The estimated latent factors can be used for the prediction of missing links/interactions, or for interpretation of the uncovered latent community structure.
Poisson factorization approaches require the user to set the number K of latent factors, which is typically assumed to be independent of the sample size n. To address this problem, Zhou et al. (2012), Gopalan et al. (2014) and Zhou (2015) proposed Bayesian nonparametric approaches that allow the number of latent factors to be estimated from the data, and to grow unboundedly with the size n of the matrix. In particular, Gopalan et al. (2014) and Zhou (2015), considered a Poisson factorization model where the positive weights (r k ) k≥1 represent the importance of community k, and v ik > 0 represents the level of affiliation of individual i to community k. Gopalan et al. (2014) and Zhou (2015), extending work from Titsias (2008), assume that the weights (r k ) are the jumps of a gamma process, ensuring the sum in equation (1) is almost surely finite. Using properties of Poisson random variables, the model (1) can be equivalently represented as for 1 ≤ i, j ≤ n. The latent count variables Z ijk may be interpreted as the number of latent interactions between two individuals i and j via community k, the overall number A ij of interactions being the sum of those community interactions. For example, two members of the same company who also play sport together may meet five times at the company, and twice at the sport center, resulting in seven interactions overall. The overall number of communities k that generated at least one interaction between the n individuals is termed the number of active communities. For the gamma process Poisson factor model (Zhou, 2015), the number of active communities K n grows logarithmically with the number n of individuals. The logarithmic growth assumption may be too restrictive. For example, the number of active communities may actually be unknown but bounded above; alternatively, it may increase at a rate faster or slower than logarithmic.
In this paper, we consider generalizations of the gamma process Poisson factorization model, using completely random measures (CRM) (Kingman, 1967). CRMs offer a flexible and tractable modeling framework (Lijoi and Prünster, 2010). The proposed models fit in the class of multivariate generalized Indian Buffet process priors recently developed by James (2017) and are also related to compound completely random measures (Griffin and Leisen, 2017). We consider that (r k ) are the points of Poisson point process with mean measure ρ. Depending on the properties of this measure, the number of active communities K n is either (i) bounded, with a random upper bound, (ii) unbounded and grows sub-polynomially (e.g. log n or log log n) or (iii) unbounded and grows as n 2σ , for some σ ∈ (0, 1). For the implementation, we focus in particular on the generalized gamma process (Brix, 1999) where a single parameter flexibly controls all three behaviors.
The article is organized as follows. In Section 2, we describe the statistical model for count and binary matrices. The asymptotic properties of the model are derived in Section 3. In particular, we relate the asymptotic growth of the number of active features to the regular variation properties of the measure ρ. In Section 4 we derive a Markov chain Monte Carlo algorithm for posterior inference that does not require any approximation to the original model. In Section 5 we consider applications of our approach to overlapping community detection and link detection in networks, considering real network data with up to tens of thousands of nodes.

General construction
We present here the model for directed count or binary observations, but the model can be straightforwardly adapted to undirected interactions. Let (r k ) k=1,2..., be the points of a Poisson point process with σ-finite mean measure ρ on (0, ∞), and assume that v ik , i = 1, . . . , n, k ≥ 1, are independent and identically distributed from some probability distribution F on R + = [0, ∞). The variable v ik can be interpreted as the level of affiliation of an individual i to community k, and r k to the importance of that community.
For count data (A ij ), where A ij denotes the number of directed interactions from node i to node j, we consider the Poisson factorization model Denoting Λ ij = ∞ k=1 r k v ik v jk the Poisson rate for A ij , the n×n rate matrix Λ (n) = (Λ ij ) 1≤i,j≤n admits the following factorization as an infinite sum of rank-1 matrices where v 1:n,k = (v 1k , . . . , v nk ) . For the model to be well specified, the sum in the right-handside of Equation (5) needs to be almost surely finite. A necessary and sufficient condition is A sufficient set of conditions 1 , which we will assume to hold in the rest of this article, is that ρ is a Lévy measure and F has finite second moment, that is In this case, the community affiliations and weights for n nodes can be conveniently represented by a completely random measure G = k≥1 r k δ v 1:n,k on R n + with mean measure ρ(dr)F n (dv 1 , . . . , dv n ) where F n denotes the nth product measure of F ; see Kingman (1967) and Lijoi and Prünster (2010) for background on CRMs and their applications. If the Lévy measure is finite, that is, if ∞ 0 ρ(dr) < ∞ then the number of points (r k ), and therefore the number of communities, is almost surely finite. Otherwise, when ρ(dr) = ∞, the number of communities is infinite.
When we have binary observation (Y ij ), we treat the count matrix (A ij ) as a latent variable, and consider that Y ij = 1 Aij >0 as in (Caron and Fox, 2017;Zhou, 2015). Integrating out (A ij ), this leads to the following model for binary observations

Specific model
In the inference and experimental part, we use the following choice for the ρ and F . The Lévy measure ρ is taken to be that of a generalized gamma process (GGP, see Hougaard (1986), Brix (1999)) where σ 0 ∈ (−∞, 1), κ > 0 and τ > 0. When σ 0 = 0, we obtain a gamma process, and the model corresponds to that of Zhou (2015). When σ 0 < 0, the Lévy measure is finite, while when σ 0 ≥ 0, the Lévy measure is infinite.

Related work
The model introduced in this section can be seen from different perspectives that nicely connect it to the existing literature. First, the model can be seen as obtained from a functional of a CRM.
Recall the definition of the CRM G in Eq. (7). Define the n × n matrix Λ (n) as the following functional of G where h(u) = uu .Alternatively, this can be interpreted in the framework of compound completely random measures (Griffin and Leisen, 2017). For each 1 ≤ i, j ≤ n, denote G ij = k≥1 r k v ik v jk δ ζ k where ζ k are some community locations in some domain Θ, iid from some distribution H, irrelevant here. Then (G ij ) 1≤i,j≤n are compound CRMs on Θ and Λ ij = G ij (Θ). In the same vein, the model can also be interpreted as an instance of the class introduced by (James, 2014, Section 5). Denote Z (n) k the n × n matrix with entries Z ijk . Then the matrix-valued process k≥1 Z (n) k δ ζ k is a draw from a multivariate Indian buffet process. Finally, as mentioned in the introduction, the model admits as a special case the Poisson factorization based on the gamma process of Zhou (2015).

Asymptotic Properties
In this section we study the asymptotic properties of the proposed class of models, and in particular the growth rate of the number of active communities as the sample size n grows, and the asymptotic proportion of communities of a given size. For a given sequence (r k ) k≥1 and ijk where n ≥ 1, 1 ≤ i, j ≤ n, k ≥ 1 respectively the number of directed interactions and the number of community directed interactions distributed from Equations (2) and (3). We consider two different asymptotic settings • Constrained setting. For any 1 ≤ m ≤ n, and 1 ≤ i, j ≤ m, A ij . In this setting, we suppose that the connections between the already observed nodes remain unchanged. It is equivalent to assuming that there is an infinite but fixed graph and A (n) represents the connections between the n first nodes of that graph.
• Unconstrained setting. This setting is more general, and we only assume that A All the results of this section, otherwise stated, hold for the unconstrained setting. We indicate when a stronger result holds in the constrained setting. All proofs are given in Appendix A.

General model
Let d (n) k be the degree of the community/feature k, corresponding to the number of interactions amongst n individuals due to community k, and defined as A community is active if d (n) k ≥ 1. The number of active communities is therefore defined as Denote K n,j the number of communities with degree j ≥ 1 Note that under the constrained setting, d k , K n and ≥j K n, are all almost surely increasing with the sample size n, whereas this is not necessarily the case for the unconstrained setting.
Proposition 3.1. Under Assumptions (A1) and (A2), the number of active communities K n is a Poisson random variable with mean The number K n,j of communities with degree j is also Poisson distributed, with mean Finally, for j ≥ 1, k≥j K n,k , the number of communities with degree at least , is also Poisson distributed with mean k≥j Ψ k (n).
In the rest of the section we relate the asymptotic behavior of quantities of interest to the properties of the mean measure ρ. Let consider the tail Lévy intensity defined as We assume that ρ is a regularly varying function at 0, that is where σ ∈ [0, 1) and is a slowly varying function verifying lim t→+∞ (at)/ (t) = 1 for all a > 0. Examples of slowly varying functions include functions converging to a constant, log a t for any t, log log t, etc. Note that the CRM is finite activity if and only if σ = 0 and (t) → C < ∞. Now, let us consider the asymptotic behavior of the number of active communities K n .
Proposition 3.2. Let K n be the number of active communities. Then for 0 ≤ σ < 1, as n tends to infinity, where m f = vF (dv). Additionally, for 0 < σ < 1, If we further assume that the sequence (K n ) n≥1 is almost surely non-decreasing (as in the constrained setting), then (15) holds for σ = 0 and (t) → ∞ as well. In the finite activity case, that is as n tends to infinity, where K ∞ is a Poisson random variable with mean ρ(0). The above convergence holds in distribution for the unconstrained setting and almost surely for the constrained setting.
Proposition 3.3. Let K n,j be the number of communities of degree j. Then for 0 < σ < 1 and any j ≥ 1, as n tends to infinity. Therefore, as n tends to infinity. This corresponds to a power-law behavior as for large j. If we further assume that for all k ≥ 1, j≥k K n,j n≥1 is non-decreasing (constrained setting), then (17) holds also for σ = 0 and (t) → ∞.
Finally, let c (n) (k, k ) denote the cosine between the corresponding affiliation vectors This coefficient gives a measure of the overlap between two communities k and k . By the law of large numbers, for any k = k ,

Specific case of the GGP
In the case of the GGP, we have is a slowly varying function at infinity. The results of the previous subsection therefore apply. For simplicity, we state the results for the constrained setting. We have, almost surely as n → ∞ Therefore, σ 0 governs the asymptotic behavior of the number of active communities. K n is bounded with a random upper bound (σ 0 < 0), increases logarithmically (σ 0 = 0) or polynomially (σ 0 > 0). In the polynomial case, σ 0 also controls the power-law exponent of the proportion of communities of a given size. The parameter κ is an overall linear scaling parameter. Finally, the parameter α governs the amount of overlapping between two communities.

Simulation, posterior characterization and inference
In this section we describe the marginal distribution and conditional characterization of the model. Building on these, we derive an exact sampler for simulating from the model, and a Markov chain Monte Carlo algorithm to approximate the posterior distribution. Importantly, the sampler targets the distribution of interest and does not require any truncation or approximation. For simplicity of exposition, we assume that ρ and F are absolutely continuous with respect to the Lebesgue measure, with ρ(dr) = ρ(r)dr and F (dx) = f (x)dx.

Marginal distribution and simulation
For a fixed n, recall that K n denotes the number of active communities. Let (( r 1 , v 1:n,1 ), . . . , ( r Kn , v 1:n,Kn )) be the subsequence of (r k , v 1:n,k ) such that community k is active, meaning that 1≤i,j≤n Z ijk ≥ 1, arranged in random order. Let Z ijk be the number of community interactions corresponding to the active community ( r k , v 1:n,k ). Note that Let Z k = ( Z ijk ) 1≤i,j≤n . Using Proposition 5.2 of James (2017), we obtain the following lemma.
Sampling from the conditional distribution (21) can be done efficiently by first sampling the number of multiedges i,j Z i,j,k from a truncated Poisson with mean r k ( i v i,k ) 2 , then sampling iid the end nodes of the edges proportionally to the affiliation vector. Simulating from the conditional distribution (20) can be more challenging since it requires sampling a n + 1 dimensional vector. However, if we suppose that the affiliations are Gamma distributed, the problem reduces to sampling ( r k , i v i,k ), which is a two dimensional vector, and independently sample the normalized affiliations from a Dirichlet distribution. Indeed, if the affiliations are Gamma distributed, we consider the following change of variable.
This gives the following algorithm for exact simulation from the model.
is the Laplace exponent, Dirichlet(α, . . . , α) denotes the standard Dirichlet distribution and Gamma(x; a, b) denotes the probability density function of a Gamma random variable with parameters a and b, evaluated at x. In the case of the GGP, the Laplace exponent is One can sample from Eq. (24) and (25) using rejection.

Posterior characterization
Using Proposition 5.1 in (James, 2017), one can characterize the conditional distribution of the CRM G given the latent community counts Z ijk .
..,Kn , the CRM G has the same distribution as and (r k ,ṽ 1:n,k ) k=1,...,Kn are independent of G and iid with density In the case where f is a gamma pdf, we can use the same reparameterization as in the previous subsection with ( ς k , ϕ 1:n,k ) in place of v 1:n,k . This leads to the following conditional distributions.

Slice sampler for posterior inference
We recall that θ denote the set of hyperparameters of the mean measure ρ and pdf f . To simplify the presentation, here we suppose that we observe the complete adjacency matrix A, which means that we observe a directed and weighted graph with no missing (hidden) edge. The objective is obtain samples distributed from the conditional distribution In the Appendix, we show how to do inference when we only observe a partial graph (with missing edges to predict) that can be directed or undirected, weighted or binary. In order to leverage the Poisson factorization construction, we augment the model with the latent community counts Z k . Additionally, to deal with the unknown number of active communities K n , we use auxiliary slice variables, similarly to other Gibbs sampler for Bayesian nonparametric models (Favaro and Teh, 2013;Kalli et al., 2011;Walker, 2007). For each directed pair (i, j) such that A ij ≥ 1 consider the scalar latent variable and denote s = min ij s ij . Note that by definition, r k ≥ s for all k = 1, . . . , K n . Let G = k r k δ v 1:n,k 1 r k ≥s := Kn k=1 r k δ v 1:n,k be the CRM corresponding to the set of active or inactive communities with weight r k ≥ s, of (almost surely finite) cardinality K n ≥ K n . Denote Z ijk ≥ 0 the associated community interactions, and Z k = (Z ijk ). The data augmented slice sampler draws samples asymptotically distributed from p((Z k ) k=1,...,Kn , G, θ, s | A).
The main steps of the algorithm are as follows.
1. For each directed pair (i, j) such that A ij ≥ 1, Update (Z ijk ) k=1,...,Kn given the rest of the variables, 2. Update the hyperparameters θ given the rest of the variables, 3. Update (G, s) given the rest of the variables.
The details of each step are given in Appendix B. Each iteration of the Gibbs sampler has a time complexity scaling in K n S where S is the number of nonzero entries of the matrix. Therefore, the algorithm takes advantage of the sparsity of the networks. Additionally, each entry of the sparse graph can be dealt with independently, making the algorithm straightforwardly parallelizable.

Experiments
We implement the algorithm described in the previous section with the GGP-Gamma scores model. We assign Gamma priors on the hyperparameters κ, τ, α with parameters (0.1, 0.1). We fix β = 1. We allow up to a linear growth of the number of communities, corresponding to σ < 0.5, for small datasets and use a Gamma prior with parameter (0.1, 0.1) on 1 − 2σ. For larger datasets, we restrict σ < 0.25, meaning that the number of communities cannot grow at a faster rate than √ n. This is obtained by using a Gamma prior with parameter (0.1, 0.1) on 1 − 4σ.

Synthetic dataset
We first run the algorithm on a synthetic dataset simulated from our model, to check that the algorithm can recover the true parameters. We sample a directed and unweighted graph from the GGP-gamma model with size n = 800 and σ = 0.2, κ = 1, τ = 0.15, α = 0.05, β = 0.2. The number of edges of the obtained graph is 20198, and the true number of active communities is 42.
We run three chains in parallel with 500, 000 iterations, with 250, 000 iterations for burn-in. We show in Figure 1 trace plots of the number of active communities K n and parameter σ showing the MCMC algorithm can recover these parameters.

Political blogs
The polblogs network (Adamic and Glance, 2005) is the network of the American political blogosphere in February 2005. It is a directed unweighted graph, where there is an edge (i, j) if blog i cites blog j. It is composed of 1490 nodes and 19025 edges. For each node, some ground truth information about its political affiliation (republican/democrat) is known. We will use this dataset in order to illustrate the role of the parameter α in the model. As indicated in Section 3 this parameter tunes the amount of overlapping between the communities. A smaller value enforces less overlap between communities. We run three chains with 500, 000 iterations. The posterior samples of σ for three different values of α are in also shown in Figure 2. The model allows overlapping communities but, for visualization purposes, it is useful to obtain an associated partition of the nodes. For each iteration, one can cluster the nodes by assigning each node to the community where it is most active. That is, at iteration t of the MCMC algorithm, define for i = 1, . . . , n ik } the cluster membership of node i. We then compute an approximate Bayesian point estimate c = ( c 1 , . . . , c n ) of the partition of the nodes, using Binder's loss function (Lau and Green, 2007). Nodes are reordered according to their estimated membership c, and Figure 2 shows the densities of connection between and within clusters for three different values of α. Depending on the amount of overlapping, we obtain two (α = 0.8), three (α = 0.4) or four (α = 0.2) communities. In order to interpret those communities, we calculate in Table 1 for each community the proportion of interactions between democrat blogs, between a democrat and a republican blog, and between two republican blogs. For α = 0.8, there are two estimated communities which can clearly be identified as democrat (community #1) and republican (community 2). For α = 0.4, we have three communities. One is mostly associated to democrat blogs (#1) while the other two correspond to a split of the republican blogs into right (#2) and center-right (#3) groups. For α = 0.2, we obtain a further split of the democrat blogs into left (#1) and center-left (#2) groups. Increasing the value of α therefore leads to a finer and finer partition of the nodes.

Wikipedia topcast
The network is a partial web graph of Wikipedia hyperlinks collected in September 2011 (Klymko et al., 2014). It is a directed unweighted graph where an edge (i, j) corresponds to a citation from a page i to page j. We restrict it to the first 3000 nodes, and the associated 5687 edges. We run three MCMC chains for 200, 000 iterations. Trace plots of the number of active communities and parameter σ are given in Figure 3. Figure 4 shows the adjacency matrix reordered by communities, as explained in the previous section. In order to check that the learnt communities/features are meaningful, we report in Figures the proportions of webpages associated to a given category within a given community/feature (note that a webpage can be associated to multiple categories hence the proportion do not sum to 1). Note that, while the approach is able to estimate the latent block-structure, this dataset has the particularity of having star nodes, a feature that is not captured by our model.

Deezer
The dataset was collected from the music streaming service Deezer in November 2017 (Rozemberczki et al., 2018). It represents the friendship network of a subset of Deezer users from Romania. It is an undirected unweighted graph where nodes represent the users and edges are the mutual friendships. There are 41773 nodes and 125826 edges. We run three chains with 100000 iterations each. Posterior histograms of the number of active communities and σ are given in Figure 7. The algorithms finds around 45 communities/features for this dataset. The reordered adjacency matrix and block densities based on the point estimate of the partition are given in Figure 8. Now we can reorder the nodes using approximate MAP clustering as previously. We obtain the following adjacency matrix For each individual in the network, a list of musical genres liked by that person are available. There are in total 84 distinct genres. We represent in Figure 9 the proportion of individuals who liked a subset of the 84 genres for three different communities where the interpretation in terms of genres is quite clear. The overall proportion of individuals liking a given genre is shown at the bottom of Figure 9. If the bar is red, this indicates that the proportion is 10% higher in the community than in the population. If the bar is blue, this means the proportion is 10% lower. Community 11 can be interpreted as R&B, Community 8 as Dance, and Community 3 as Rock music. For some of the communities, not reported here, the interpretation in terms of the liked genres is less clear, and may be due to other covariates.

Discussion
The model presented in this paper assumed the same parameter β for each node. We can also consider a degree corrected version of the model, similarly to Zhou (2015), where each node is assigned a different parameter β i > 0 and then defining Z ijk ∼ Poisson( . It is unclear however if a MCMC sampler targeting the exact posterior distribution could be implemented, and one may need to resort to some truncation approximation as in Zhou (2015).
The count matrix (A ij ) is infinitely exchangeable, hence the model presented in this article lead to asymptotically dense graphs. That is, 1≤i,j≤n A ij n 2 as n tends to infinity. In order to obtain sparse graphs, we could consider two different strategies. The first solution consists in dropping the infinite exchangeability property and take β (n) i → ∞ with n, then the number of edges will behave as (n/β (n) ) 2 (we can for instance take β (n) i = √ n for any node i to obtain a linear growth of the number of edges). The model would still be finitely exchangeable for any fixed n, but not projective anymore. The second solution would be to consider the different notion of infinite exchangeability developed in (Caron and Fox, 2017) and consider (β i ) i as a realization of a Poisson point process.
Finally, we presented a model for count (and binary) data. The results build on the additive contributions of the communities, which is why we chose the Poisson distribution on the entries of the adjacency matrix (A ij ). We can generalize to non count data using other probability distributions which are closed under convolution. For example, one could consider
Lemma A.2. (Pollard, 2015, Exercise 15) Let X be a Poisson random variable with parameter λ . For any t > 0 Lemma A.3. Let (X n ) n≥1 be a sequence of Poisson random variables with mean (µ n ) n≥1 . If log n = o(µ n ) then X n ∼ µ n almost surely as n tends to infinity.
Lemma A.4. For any x, y ≥ 0, we have the following bound Proof. The bound is trivial when y ≤ 1. Consider the case y > 1. For all x, the function y → e −xy is convex hence f x (y) = e −xy −1 y is a monotonically non-decreasing function of y therefore e −xy − 1 y ≥ e −x − 1 for y ≥ 1.

A.2. Proofs of Section 3
Proof of Proposition 3.1. The result for K n is proved in James (2014) in the general context of GIBP. We provide here the details of the proof for K n , which can be straightforwardly adapted to K n,j . First, let us remark that the bound (33) together with assumptions (A1) and (A2) imply Ψ(n) < ∞. For s < 0, Then, since and the last part is integrable, we can use Campbell's theorem (Kingman, 1993) to get: We can prove similarly that K n,d is a Poisson random variable with mean Ψ d (n) and that The assumption Ψ d (n) < ∞ is also sufficient in this case to apply Campbell's theorem.
Besides, since v → 1 − e −rv is increasing, by Markov's inequality we have for any > 0 where ψ(t) = (1 − e −rt )ρ(dr) is the Laplace exponent. Furthermore, by the law of large numbers, Therefore, under Assumption (A4), Lemma A.1 implies as n tends to infinity.
Proof of Proposition 3.3. As for Proposition 3.2, we only need to show that for d ≥ 1, Therefore the proof is very similar to the one of Proposition 3.2. However, there are some technicalities we need to address since here v → v d d! e −rv is neither convex nor decreasing. Like previously, we will lower bound and upper bound Ψ d (n) by two quantities that are equivalent to σΓ(d−σ) d! n 2σ µ 2σ (n 2 ). Let us first introduce some notations. Let Let us notice that the law of large numbers gives us that P(S n ∈ B ) → 1.

Lower bound: We have that
Therefore, using Lemma A.1 and since P(S n ∈ B ) → 1, we have for n large enough

Upper bound: We have that
We find that for n large enough, Therefore, we only need to prove that E[1 Sn ∈B ϕ d (r, S 2 n )]ρ(r)dr = o(n 2σ (n 2 )).
In order to do so, we split the integral with respect to r in two parts, an integral over (0, 1 n 2 ) and an integral over ( 1 n 2 , ∞) and show that both are o(n 2σ (n 2 )). Since ϕ d (r, v) ≤ 1, where the last line follows from the law of large numbers and Assumption (A4). Besides, where the last inequality holds for n large enough by Assumption (A4) and Lemma A.1. Now, we have that n 2 ) n≥1 is uniformly integrable. Besides, using the law of large numbers, the sequence converges almost surely, and hence in probability, to 0. Therefore, lim n E[1 Sn ∈B S 2 n n 2 ] = 0, which concludes the proof.
For σ = 0, the previous computations for the upper bound give that almost surely, Ψ d (n) = o( (n 2 )) = o(Ψ(n)). Now let D > 1, fore, similarly to the proof for σ = 0 for (K n ) n , we find that Therefore, we finally find that

Appendix B: Gibbs sampler
As mentioned in the main text, the observed graph can be directed or undirected, binary or count, and can have missing entries we would like to predict. Denote by B the observed graph. Here we describes the steps of a Gibbs algorithm with stationary distribution p(K n , ( r k , v 1:n,k ) k=1,...,Kn , θ | B).
Notice that observing the full matrix B = A corresponds to a weighted and directed graph with no missing entry. Let I denote the set of all possible edges. In the directed case, I = {(i, j) | 1 ≤ i, j ≤ n} and on the undirected case I = {(i, j) | 1 ≤ i ≤ j ≤ n}. We say that (i, j) is not observed if we don't know the value of A i,j . Remark that (i, j) can be observed and still A i,j = 0.
Denote O the set of all observed entries and O c = I O, the set on unobserved entry. For all unobserved entry (i, j) ∈ O c , set B i,j = −1 Additionally, to deal with the unknown number of active communities K n , we use auxiliary slice variables s i,j for all (i, j) ∈ I, details are given in the following paragraphs. Denote s the smallest non-zero slice variable s i,j for (i, j) ∈ I. By definition of the slice variables, r k ≥ s for all k = 1, . . . , K n . Let G = k r k δ v 1:n,k 1 r k ≥s := Kn k=1 r k δ v 1:n,k be the CRM corresponding to the set of active or inactive communities with weight r k ≥ s, of (almost surely finite) cardinality K n ≥ K n . Denote Z ijk ≥ 0 the associated community interactions, and Z k = (Z ijk ).

B.1. Directed graph
For each observed pair (i, j) ∈ O, we define the slice variable as if A ij > 0 and s ij = 0 otherwise. For each non observed entry (i, j) ∈ O c , we define s i,j by (37) otherwise.
B.1.1. Gibbs sampler step 1 for weighted graph on observed entries Updating ( Z k ) k=1,...,Kn |(s, G), θ, B on observed entries indexes We sample (Z l ) l=1,..,Kn associated to all atoms of G and keep only the non empty communities. For every (i, j) ∈ O such that A i,j > 0. define the random variable m ij = min {l| Z ijl ≥1} r l . Then, writing the joint distribution it comes that independently for every such (i, j), where Mult is the multinomial distribution and p ijl = To simplify the notations, let us suppose that the atoms of G are in decreasing order. Remark that the indexing of Z is different from the one of Z, the second corresponding to the one of the truncated random measure. For each observed edge (i, j) independently, we can proceed in 4 phases for this step.
1. Sample m ij from the locations of G such that P(m ij = r L ) ∝

B.1.2. Gibbs sampler step 1 for unweighted graph on observed entries
In this setting we observe a binary matrix B ij = 1 Aij >0 . Then the first step of the Gibbs sampler is modified and becomes: Updating ( Z k ) k=1,...,Kn |(s, G), θ, B on observed entries indexes For each unobserved entry (i, j) ∈ O c , knowing s ij , we define L 0 = max{k | r k > s ij }. In the undirected graph, we suppose that for i = j, B ij = A ij + A ji and B ii = A ii . Besides, in this setting we actually don't need to sample Z ijk for all (i, j, k) but only Z ijk + Z jik . For each observed pair (i, j) ∈ O, we define the slice variable as if B ij > 0 and s ij = 0 otherwise. For each non observed entry (i, j) ∈ O c , we define s ij by (37) if { k | Z ijk + Z jik ≥ 1} = ∅ and s ij |( r k , Z ijk + Z jik ) k=1,...,Kn ∼ Unif(0, 1) otherwise. Then Step 2 and 3 remain unchanged. For step 1, simply replace p ijk by 2p ijk for i = j.
B.3. Proofs for the Gibbs sampler step 1

B.3.1. Weighted graph
Here will give the posterior distribution of the count matrices and show that ..,Kn |s, G, θ, B In order to do so, we derive the RHS posterior distribution. Let us first notice that given G, sampling the non zero counts and corresponding locations is equivalent to sampling (Z k ) k for k ∈ N. As stated previously, we can treat each edge (i, j) independently. Therefore, we sample the sequence (Z ijk ) k . Here we suppose that the communities come with decreasing activity order. Let the random variable L = max{k | Z ijk > 0} (supposing that the (r k ) k are decreasing). And This shows how we can sample in three steps these variables. Let us remark that the second part corresponds to the distribution of a zero truncated binomial and that the third part corresponds to the distribution of a multinomial. We also notice that only the elements of G are actually needed.

B.3.3. Prediction
Here we show how to update the missing entries we try to predict. Let us recall that for a predicted count, if it is positive, we define the slice variable as previously. However, if the count is equal to zero, then the slice variable is simply uniform over [0, 1]. Now let L 0 = max{k | r k > s ij } P((Z ijk ) k |s, G, θ) ∝ P((Z ijk ) k |G, θ) × P(s ij |(Z ijk ) k , G, θ) Using B.3.2, it comes that Therefore, here we proceed in two steps, first we sample the binomial 1 Aij =0 with parameter Then, conditioning on the event A ij = 0, we use B.3.2 to proceed.

B.4. Proof for the Gibbs step 2
Here we show how we can update the parameters θ = (κ, σ, τ, α, β) using a Metropolis-Hastings update. First, let us derive the posterior distribution of the hyperparameters.
We write G = G + K c=1r c δṽ c where G is the non observed part. And we note G the restriction of G to the locations which intensity is larger than min s. (f (v i )dv i ) ρ(r)dr.