Implicit consensus clustering from multiple graphs

Dealing with relational learning generally relies on tools modeling relational data. An undirected graph can represent these data with vertices depicting entities and edges describing the relationships between the entities. These relationships can be well represented by multiple undirected graphs over the same set of vertices with edges arising from different graphs catching heterogeneous relations. The vertices of those networks are often structured in unknown clusters with varying properties of connectivity. These multiple graphs can be structured as a three-way tensor, where each slice of tensor depicts a graph which is represented by a count data matrix. To extract relevant clusters, we propose an appropriate model-based co-clustering capable of dealing with multiple graphs. The proposed model can be seen as a suitable tensor extension of mixture models of graphs, while the obtained co-clustering can be treated as a consensus clustering of nodes from multiple graphs. Applications on real datasets and comparisons with multi-view clustering and tensor decomposition methods show the interest of our contribution.


Introduction
Relational data are ubiquitous in various fields (web, biology, neurology, sociology, communication, economics, etc.), and their accessibility has kept increasing in recent years. These data, as a whole, form a network formalized by a graph, where each node is an entity, and each edge is a connection between a pair of nodes; this graph can be directed or not. We find this situation in various scientific publications; the relationships between documents can often be described as multiple graphs with different types of links. In fact, several relationships, such as co-terms, co-authors, co-keywords, and co-references between documents can be used. The objective of this work is to address the clustering of multiple graphs. This is a graph mining task of clustering vertices into several groups in the presence of multiple types of proximity relations. We could hypothesize that the combination of different information that arises from multiple graphs may improve the clustering results. For instance, two documents which share a number of words and/or have one or more authors in common and/or quote each other, are likely to deal with the same topic. Incorporating this additional information leads us to consider a tensor representation of the data.
To deal with multiple graphs, various models and methods under different approaches are proposed to analyze these networks. In Banerjee et al. (2007) and Tang et al. (2009), the authors proposed a multi-way clustering framework for relational data, where different types of entities are simultaneously clustered, based not only on their intrinsic attribute values, but also on the multiple relations between the entities. Other works use a spectral decomposition-based approach relying on the combination of adjacency matrices (Tang et al. 2009;Chen et al. 2017;Nie et al 2017). In these works, the clustering is not the main objective of the proposed approaches, nevertheless it can be deduced from decomposition results.
On the other hand, one of the most used methods in this context is the Stochastic Block Model (SBM) (Nowicki and Snijders 2001) which is a probabilistic approach. SBM is commonly used for network modeling and discovering the latent community structures from a graph. It provides a statistical approach able to model data matrix, symmetric or not, into homogeneous blocks. This leads to consider SBM (Daudin et al. 2008) as a particular case of the Latent Block Model (LBM) proposed by Govaert and Nadif (2003, 2006 and extended in (Shan and Banerjee 2008;Govaert and Nadif 2013), which models any kind of data matrices not necessarily square or symmetric. In other words, the clustering of the graph directed or not, is in fact, a particular case of co-clustering (Dhillon et al. 2003;Labiod and Nadif 2014;Salah and Nadif 2019;Affeldt et al. 2021). In this work, we consider graphs represented by adjacency matrices assimilated to contingency tables. Thus, considering the previous example of document clustering, the relations between documents (co-terms, co-authors, etc.) are count data and can be represented by particularly sparse contingency tables. Many works in the literature show the interest of Poisson distribution for graph theory and clustering of random graphs (Janson 1987;Daudin et al. 2008).
To the best of our knowledge, this is the first attempt to formulate a model-based co-clustering for sparse three-way data. To this end, we rely on the latent block model (Govaert and Nadif 2013) for its flexibility to consider any data matrices. Figure 1 presents a binary three-way dataset constructed from multiple graphs, and the expected Fig. 1 Goal of clustering of multiple graphs results in terms of co-clustering. This leads us to consider our objective as a problem of consensus clustering from different sources.
Consensus clustering, also called cluster ensembles, refers to the situation in which different clusterings have been obtained from a dataset, and it is desired to find a single consensus clustering that is a better fit in some sense than the existing clusterings. Thereby, consensus clustering aims to reconcile clustering information about the same data set coming from different runs of the same algorithm or different algorithms. This kind of consensus is referred to, in the paper, as explicit consensus clustering. On the other hand, we will aim to obtain a consensus clustering from different sources (slices) with the same algorithm as in our case. We refer to this type of consensus clustering as implicit consensus clustering. The key contributions of this work are: -We first establish the links between Poisson Latent Block Model (PLBM) and Poisson Stochastic Block Model (PSBM). Then we show the interest of considering PLBM rather than PSBM. -We propose a Sparse PLBM (SPLBM), a suitable probabilistic model for clustering of multiple graphs. Then we derive an EM-type learning algorithm. -We perform extensive numerical experiments and compare our proposal with multi-view and tensor decomposition methods. -Finally, using the ensemble method, we prove that the proposed algorithm, which can be viewed as an implicit consensus clustering for multiple graphs, is more effective than explicit clustering obtained by traditional consensus clustering methods.
The remainder of this paper is organized as follows. In Sect. 2, we present related work and show the strong points of our approach. Section 3 reviews PLBM, shows the limits of traditional PSBM and describes Sparse PLBM (SPLBM). Sect. 4 discusses the extension of SPLBM to consider multiple graphs. In Sect. 5, we present a variational Expectation-Maximization algorithm. Sect. 6 is devoted to evaluating our approach. Finally, Sect. 7 concludes the paper and gives some directions for future research.

Related work
Although SBM is popular in social networks analysis, dealing with the count data and due to the degree of heterogeneity, the traditional SBM fail to detect relevant clusters of edges to adress community detection problem (Qiao et. al 2017). Thereby, several authors have developed a degree-corrected SBM. In Karrer and Newman (2011), using a Poisson SBM, they introduced a parameter θ i controlling the degree of expected degrees of vertices i. They consider that each x i j with i = j is distributed accord- ing to Poisson(θ i θ j δ k ), where δ k is the expected value of the adjacency matrix for the vertices i and j lying in block (k, ) while x ii is distributed according to Poisson( 1 2 θ 2 i δ kk ). Doing so and under some constraints on the θ i 's, they proposed the DC-SBM (Degree-Corrected SBM) clustering algorithm (DC-SBM 1 ) from an undirected graph on n vertices, possibly including self-edges. Furthermore, they established the equivalence between the maximization of the log-likelihood and the maximization of mutual information used as an objective function for clustering bipartite graphs (Dhillon et al. 2003). It is important to emphasize that the model proposed in Karrer and Newman (2011) is similar to that proposed by Nadif and Govaert (2005), where the authors also showed this connection with the maximization of mutual information; they proposed the Croinfo algorithm as illustrated in Fig. 2. In fact, the objective function maximized by DC-SBM, which can also be used for the co-clustering of an undirected graph, is associated with a constrained Poisson LBM commonly used in the co-clustering context; see e.g.; Ailem et al. (2017a, b) and Role et al. (2019). To sum up, considering DC-SBM which implies that the data are generated according to a Poisson LBM with P(x i j , x i. x . j γ k ) where P(x i j ; λ) = e −λ λ x i j x i j ! , the proportions of the classes of the nodes are assumed to be equal. In addition, although both algorithms DC-SBM or Croinfo are different, the objective is the same, and the clustering considered is based on an approach similar to that of the traditional hard clustering algorithms; for more detail, the reader can refer to recent works Nadif 2013, 2018).
In our contribution, we structured graphs as three-way data where the clustering is the principal objective. We propose an extension of LBM to tackle the co-clustering of multiple undirected/directed graphs where each cell of the diagonal is not necessarily equal to an even number as conventionally considered in community detection. To do this, we adopt an EM-type approach to refer to the Expectation-Maximization algorithm (Dempster et al. 1977;McLachlan and Peel 2000) and not Classification EM (Celeux and Govaert 1992). Furthermore, we will show that this purpose can be viewed as an implicit consensus clustering from Multiple Graphs.

Poisson latent and stochastic block models
Given an n × d data matrix X = (x i j , i ∈ I = {1, . . . , n}; j ∈ J = {1, . . . , d}), it is assumed that there exists a partition on I and a partition on J . A pair of partitions (Z, W) will represent a partition of I × J into g×m blocks. The partition Z for rows can be represented by a label vector (z 1 , . . . , z n ) where z i ∈ {1, . . . , g} or a binary matrix Z = (z ik ) ∈ {0, 1} n×g satisfying g k=1 z ik = 1. In the same manner the partition W for columns can be represented by a label vector (w 1 , . . . , w d ) where w j ∈ {1, . . . , m} or a binary matrix W = (w j ) ∈ {0, 1} d×m satisfying m =1 w j = 1.

Poisson latent block model (PLBM)
Denoting Z and W the sets of possible labels Z for I and W for J , the marginal density function f (X; ) of the Poisson Latent Block Model (PLBM) (Govaert and Nadif 2018) can be written where = (π, ρ, γ ), with π = (π 1 , . . . , π g ) and ρ = (ρ 1 , . . . , ρ m ) where (π k = P(z ik = 1), k = 1, . . . , g), (ρ = P(w j = 1), = 1, . . . , m) are the mixing proportions of row and column clusters respectively, and γ = (γ k ; k = 1, . . . g, = 1, . . . , m). For this model, the complete data are taken to be the vector (X, Z, W) where unobservable Z and W lead to the labels, the resulting complete data log-likelihood can be written as follows: To estimate , we consider the EM algorithm (Dempster et al. 1977). However, the E-step using the log-likelihood of (1) directly is intractable due to the dependence structure among the rows and columns. Govaert and Nadif (2005) suggest a variational approximation in relying on the interpretation of EM due to Neal and Hinton (1998). This leads to maximize the following lower bound of the log-likelihood criterion: where L C ( Z, W, ) is the fuzzy complete-data log-likelihood. H ( Z) = − i,k z ik log z ik with P(z ik = 1|X) = z ik , and H ( W) = − j, w j log w j with P(w j = 1|X) = w j are the entropies.

Poisson stochastic block model
As we mentioned earlier, Poisson SBM, even DC-SBM, are particular cases of Poisson LBM insofar as the latter can model matrices, symmetric or not, oriented or nonoriented graphs, numbers of row clusters and columns clusters not necessarily equal (g = m) and finally with proportions of clusters equal or not. Therefore the transition from LBM to SBM is easy to show. Thereby, for undirected graph, the maximization of (2) leads to maximizing The main differences between both models are a) with the Poisson SBM, the third term which concerns the diagonal of X is ignored and it does not take into account the degree of nodes unlike LBM, b) with the Poisson LBM, Notice that γ k depends only on the block k and not on the margins. Thereby, starting from PLBM, we will see next how to take into account the sparsity often present in the graphs.

PLBM for sparse data: sparse PLBM (SPLBM)
Recently, in Ailem et al. (2017b), the authors proposed a generative mixture model for co-clustering document-term matrices referred to as SPLBM. With this model, they assume that for each diagonal block kk the values and for each block k with k = , x i j ∼ Poisson(λ i j ) where the parameter λ i j takes the following form: Assuming ∀ = k, γ k = γ leads to suppose that all blocks outside the diagonal share the same parameter. SPLBM has been designed from the ground up to deal with data sparsity problems. As a consequence, in addition to seeking homogeneous blocks, it also filters out homogeneous but noisy ones due to the sparsity of the data. The pdf of SPLBM can be written as follows: Assuming that the complete data are (X, Z, W), the complete data log-likelihood L C (Z, W, ) takes the following form : To estimate the parameters , Z and W, a variationnel EM has been proposed (Ailem et al. 2017b) to maximize (2).
Note that although SPLBM is a co-clustering model, we can derive a graph clustering algorithm from an adjacency matrix (symmetric or not). Thereby, when we are dealing with undirected graphs; strating with the same initialization of z and w (z (0) = w (0) ), we obtain the same row and column clusters, that is essential for the undirected graph clustering problem.

PSBM, PLBM and SPLBM for graphs
Although PLBM can deal with sparse matrices, SPLBM can be more suitable for sparse matrices (Fig. 3). It is designed to seek a diagonal block structure and capture the most reliable associations between the rows and columns object clusters. SPLBM assumes that each diagonal block (or co-cluster) is generated according to the Poisson distribution with some specific parameters, and each non-diagonal co-cluster representing noise data is generated according to Poisson distribution with identical parameters. In Fig. 4 we report the graphical models of Poisson models discussed in the paper.
To clarify expectations and the impact of this parameterization, on the political blogs dataset, 2 we applied the clustering algorithms derived from SBM, PLBM, and SPLBM, using 30 random initializations and measured the clustering accuracy. Figure 5 shows the interest of SPLBM, which takes into account the sparsity often present in a graph network.
The properties of this parameterization prompt us to adopt it for co-clustering with multiple graphs, as illustrated in Fig. 1. Next, to avoid confusion between all the rows and columns that are identical in our case, we still keep the notations using the z ik 's and w j 's.  The presented models PSBM, PLBM, and SPLBM deal with adjacency matrices (2D data matrix) to tackle the problem of graph clustering. In the sequel, we deal with multiple graphs organised as 3D data matrix; each matrix depicts a graph.

Three-way tensor characteristics
A tensor is a multidimensional array, which is also known as the N -way, Nth-order tensor. A tensor can be viewed as an element product of N vector spaces (Kolda and Bader 2009). This notion of tensors should not be confused with tensors in physics and mathematics fields such as stress and strain tensors (Frankel 2012).  A three-way tensor or third-order tensor has three dimensions and then has three indices, as shown in Fig. 6. A first-order tensor is a vector, a second-order tensor is a matrix, and tensors of order three or higher are called higher-order tensors.
The notation used here is very close to that introduced by Kiers (2000) for thirdorder tensor. Notice that scalars are represented by lowercase letters e.g. x, and vectors are expressed by a bold lowercase letter e.g. x. The matrices are denoted by bold capital letters e.g. X. And finally, tensors are indicated by bold capital Euler letters e.g. X .
The ith element of vector x is denoted as x i , the element (i, j) of a matrix is expressed by x i j , and x b i j represents the element (i, j, b) of a tensor. The order of tensor is referred to as the number of dimensions, also called ways or modes. One-mode tensor is a vector, second-order tensor is a matrix, and third-order tensor is a cuboid. In the case of matrix X, a row and column can be denoted by x i: and x : j , respectively. In the case of three-way tensor x i j: , x i:b , and x : jb represents the vector of the three different modes respectively. As we consider frontal slices, the tensor can be represented by 7c); this is the most often chosen representation. For convenience, in the following, we will denote the tensor entry x i j: by In this sequel, we aim to extract homogeneous sub-tensors from three-way data.

Definition of the proposed model
We extend SPLBM to Three-way tensor data leading to Tensor SPLBM (or TSPLBM). The proposed model seeks not only to discover homogeneous tube co-clusters (a three xij v n n v Fig. 8 The three-way tensor structure dimensional co-clusters) but also discover important blocks and ignore noisy ones. Thereby, TSPLBM allows to discover a diagonal co-clusters structure, which are tubes (through all slices) from the three-way tensor. It makes it more useful for sparse tensor with high sparsity close to 90%, as shown in the experiments. TSPLM provides a better partitioning than the classical co-clustering algorithm applied on each slice of tensor separately or a consensus clustering used on these independent results.
Our proposal Tensor SPLBM considers 3D data matrix X = [x i j ] ∈ R n×n×v where n is the number of nodes, and v the number of graphs (slices). Figure 1 presents a tensor data with v graphs. As X is symmetric per slice b, when i = j we have z ik = w jk and for k = 1, . . . , g we have π k = ρ k . This leads to deduce the fuzzy complete data and the lower bound of log-likelihood criterion noted F C ( Z, ) ("Appendix A" for more details). Thus, to estimate and Z, from which we can deduce Z, we optimize where H ( Z) = − i,k z ik log z ik is the entropy. After some algebraic calculations, we can simplify the criterion (up a constant) that takes the following form ("Appendix B" for more details)

Variational inference
To estimate the parameters of the model, we rely on the Variational EM algorithm (Govaert and Nadif 2005), and we extend it to multiple graphs. In the sequel, the proposed algorithm is referred to as TSPLBM. E-step It consists in computing, for all i, j, k the posterior probabilities z ik and z jk given the estimated parameters . As k z ik = k z jk = 1, using the corresponding Lagrangians, up to terms which are not function of z ik , leads to log z where P i jb , in a simple form (Algorithm 1), is described in "Appendix C" where z (t) ik represents the value of z ik in the previous iteration (t).
M-step Given the previously computed posterior probabilities Z, the M-step consists in updating, ∀k, the parameters π k , γ b kk and γ b . The estimated parameters are defined as follows. First, taking into account the constraints k π k = 1, it is easy to show that π k = i z ik n . Secondly, it is easy to obtain for all b, k ("Appendix C") The T SPLBM algorithm (Algorithm 1) for multiple graphs alternates the two previously described Expectation-Maximization steps until the objective function value (4) change is small or there is no change. At the convergence, a hard co-clustering where each data point either belongs to a cluster completely or not is deduced from z ik 's using the maximum a posterior principle defined by ∀i, z i ∈ {1, . . . , g} is given by The computational complexity of the TSPLBM algorithm scales linearly with the number of non-zero entries. Let us denote nz the number of non-zero entries in X , it the number of iterations, g the number of clusters and v the number of slices; the computational complexity is given in O(it · g · v · nz).

Experiments
The objective of our experiments is fivefold. First, we discuss some connections between TSPLBM and multiview clustering (Sect. 6.2). Secondly, we evaluate the Algorithm 1: TSPLBM Input: X , g. Initialization: Z (0) randomly and compute (0) until the objective function value (4) change is small, or there is no change; return Z, interest to consider multiple graphs simultaneously by TSPLBM, unlike tensor decomposition methods that consider a reduced matrix arising from multiple graphs (Sect. 6.3). Thirdly, we evaluate the impact of considering multiple graphs (Sect. 6.4). Fourthly, we show how we can harness the results obtained by TSPLBM (Sect. 6.5). Finally, we show that TSPLBM can be viewed as an implicit consensus clustering and propose a solution to increase its clustering performance in an ensemble method framework (Sect. 6.6).

Datasets description and pre-processing
We used eight datasets with a different number of graphs (slices) and clusters. Table 1 shows the characteristics of datasets in terms of the type of instances (image or image+text), the number of graphs/slices (#Graphs), the number of instances (#Nodes), the number of clusters (#Clusters) and the rate of sparsity. We selected four benchmark datasets 3 commonly used to compare multi-view clustering methods, namely UC-digits, 3sources, BBC, 100leaves. Further, we constructed four datasets for multiple graphs clustering, namely DBLP1, DBLP3, Nus-Wide-8, and Amazon-products-10. Hereafter, we give in detail the description of each dataset -UC-digits consists of 2000 images of handwritten digits (including ten classes correspond to the number 0-9) described by six views Fourier coefficients of the character shapes, profile correlations, Karhunen-Love coefficients, pixel averages, Zernike moments, and morphological features. -3sources consists of 169 news texts reported by three newspaper sources BBC, Reuters, and The Guardian. -BBC consists of 658 documents from BBC news splited into four segments and addressing five different topics. -100leaves consists of 1600 images from one hundred plant species and described by shape descriptor, fine-scale margin, and texture histogram features. -DBLP1 consists of 2223 papers published in three different journals and described by words from title, words from abstract, and authors. -DBLP3 is similar to the DBLP1 dataset but including 12,550 papers from ten journals. -Nus-Wide-8 consists of 2738 images from Flickr addressing eight topics and described by tags, Color Histogram (CH), Color Correlogram (CORR), Edge direction histogram (EDH), Wavelet texture (WT), and block-wise color moments (CW55). -Amazon-products-10 consists of 9897 product images from ten product categories and is described by words of product title, words of the product description, LBP features, Haralick features, and Gabor features, co-viewed and co-purchased products. Figure 9 shows all graphs (slices) reorganized according to the true partition into 10 classes.
In these tensor datasets, each (slice) graph can be assimilated to adjacency matrices representing similarities between nodes (objects). Note that the TSPLBM model considers count or binary adjacency matrices. Thereby, in order to apply TSPLBM for image datasets where graphs represent similarities between images according to each type of feature, we had to convert these matrices into binary adjacency matrices (1 if the similarity is higher than ninety-seven percent quantile and 0 otherwise). In this way, we were able to study the robustness of our algorithm even when one or many slices in original data do not respect the expected structure -binary or count data-.

TSPLBM versus multi-view clustering
The multi-view clustering (MvC) (Bickel and Scheffer 2004) aims to perform clustering from diverse sources or domains, where each object (instance) is described by several sets of features (or views). The MvC methods are used in several applications, such as image clustering, where we can have different kinds of features. They allow to taking into account the information arising from each view. Because of the diversity of feature sets, each view can be converted to a symmetric instances × instances similarity/dissimilarity matrix. This brings us back to a tensor representation of these views where each of them is a graph where the edges are continuous. Thereby, even though each view is not a count matrix, we compared TSPLBM-after binarisation-with two recent and effective algorithms SwMV (Nie et al 2017) and MultiNMF (Liu et al. 2013). We consider 6 bases from the 8 ones for which we have or can apply these two algorithms. We performed the same experimentation procedure as TSPLBM with 30 runs, and we compute the average of ACC, NMI, and Purity (Sripada and Rao 2011). For the MultiNMF, we pricked up the results in terms of ACC and NMI that are available in Wang et al. (2020) and Wang et al. (2015). Table 2 are reported the obtained results on the six multi-view datasets. Thereby SwMV does a better job than MultiNMF; it achieves good results on UC-digits and 100Leaves. However, SwMV could not give the clustering for DBLP1. On the other hand, TSPLBM achieves highly better results than SwMV on the four datasets.
Overall, from these experiments, even with binary edges, we observe that TSPLBM gives encouraging results compared with SwMV and MultiNMF applied on graphs with continuous edges.

TSPLBM versus and tensor decomposition approaches
Undoubtedly and for a long time, to deal with tensor data X ∈ R n×n×v , the tensor decomposition methods are the most popular (Kolda and Bader 2009). Even if they are not devoted to clustering, they allow to contribute to this task. Actually, these methods return a factor matrix ∈ R n×r (r is a given rank) that can be used for clustering. Thus, we used a list of suitable algorithms for the clustering: Kmeans++ (Arthur and Vassilvitskii 2007), Spectral clustering (SC) (Ng et al. 2001), and the EM algorithm (Dempster et al. 1977) derived from diagonal Gaussian Mixture Model (GMM) available in the Scikit-Learn package. Thereby, we compared the sparse tensor co-clustering algorithm TSPLBM with PARAFAC (Harshman and Lundy 1994) and Tucker decomposition (Tucker 1966) on the six datasets presented in the previous section. We used different ranks (10, 20, and 50) and performed 30 runs with random initialization. Thus, we computed ACC, NMI, and purity by averaging all runs.
In Fig. 10 are reported the obtained clustering results for the six datasets according to the different tensor-based algorithms (PARAFAC, TUCKER decomposition, and TSPLBM) and the clustering algorithmsapplied on the obtained tensor decomposition. The results concern tensor decomposition approaches with rank number equal to 10 (The results for rank 20 and 50 are similar to those using rank equal to 10). We observe that in most of the cases TSPLBM does a better job than PARAFAC and Tucker decomposition methods. For the 3sources and Caltech-7 datasets, PARAFAC and TUCKER decomposition with GMM obtain close results in terms of Purity and Accuracy but TSPLBM achieves higher performances in terms of NMI.
To compare the computing time of TSPLBM and tensor decomposition approaches, we represent in Fig. 11 the time execution in seconds. We notice that for the four datasets 3sources, BBC, DBLP1, and UCI-digits, TSPLBM is close to all other approaches in terms of time execution. However, with Nus-wide-8 and 100Leaves, the time execution is more important, this is due to the dataset size and the number of clusters for Nus-wide-8 and 100Leaves. Note however, in Fig. 10, we observe that TSPLBM outperforms tensor decomposition approaches with approximately 25 points of ACC for both datasets.

TSPLBM versus PSBM, PLBM, and SPLBM
In this section, we aim to evaluate the impact of considering multiple graphs simultaneously in terms of clustering. To this end, we compare TSPLBM with PSBM, PLBM, and SPLBM that consider the slices separately (Sect. 3.4).

Fig. 11 Time complexity analysis
We performed 30 random initializations and computed Accuracy and Normalized Mutual Information (NMI) (Strehl and Ghosh 2002) metrics by averaging all runs. The clustering accuracy noted (ACC) discovers the one-to-one relationship between two partitions and measures the extent to which each cluster contains data points from the corresponding class. However, NMI is based on Mutual Information (MI) and measures the amount of retrieved information considering our knowledge about the clusters and the obtained results by a clustering method while respecting the proportions of clusters. For lack of space, we are focus on DBLP1, DBLP3, Nus-Wide and Amazon-Products-10. In Fig. 12, are reported the performances of the four algorithms PSBM, PLBM, SPLBM, and TSPLBM. PSBM, PLBM, and SPLBM are applied on each slice x b separately unlike TSPLBM which is applied on X considering all graphs simultaneously. We notice that, in most cases, TSPLBM is better than other algorithms applied to each graph and allows us to achieve the best trade-off. TSPLBM includes all graphs and also the graphs with a very complex structure. DBLP3 obtains the lowest results due to the complex structure of dataset composed of 12K papers with very close or complementary topics on computer science. We observe that PLBM and SPLBM do a better job than PSBM for all datasets on the more informative slices. It is also worth noting that PLBM does good performances in terms of Accuracy on DBLP1 and in terms of NMI on DBLP3. TSPLBM performs a natural consensus when considering all slices and allows us to obtain a unique partition at the end with good clustering results.

Interpretation of multiple graph clustering results
This part aims to analyze the obtained topics and demonstrate how the proposed model can help the user interpret the obtained clusters using a visualization method. To illustrate this, we rely on the Nus-Wide-8 dataset.  On the topics-tags matrix, we performed the Correspondence Analysis (CA) method (Benzecri 1973;Nenadic and Greenacre 2007). The choice of CA is due to the connection between Poisson distribution, mutual information, and chi-square on which CA is based, see, e.g., Govaert and Nadif (2018). The matrix topic-tags Z T M is constructed from image-tags M based on obtained topics (or partition) Z obtained by TSPLBM. In  Fig. 13, are projected the tags and topics on the two first dimensions of CA including the top tags in terms of contribution 4 on the CA results.
We can notice that there are some close topics and other very different one. For instance, topic 3 about weddings is opposed to topics 8 and 6 about snow and temple considering the first and the second dimension respectively. On the other hand, we can see that topics 1 and 2 about plants and animals are close. Figure 14 presents the tags whose contribution is important. We show the frequencies of each term for each topic. For topics 2 and 5 (pink and purple color respectively), Based on the Co-tags graph and the obtained topics, we construct a graph of image clusters linked by edges representing the intensity of joint tags between all topics, this can be computed by Z HZ where Z is obtained by TSPLBM, and H is the cotags matrix. We can notice that there are some topics with a strong relationship like plants-snow and town-persons. On the other hand, some topics with a weak link like animals-town and animals-temple. This representation highlights that there are some tags used with confused meaning. In this context, it is possible to use tensor models for tags completion and tags correction (Tang et al. 2017;Veit et al. 2017).

Discussion: implicit consensus versus explicit consensus
In the first part of our experiments, we observed that TSPLBM applied on all slices simultaneously is, in most of the cases, better than the other algorithms. As we are in an unsupervised context, we have found it helpful to run the calculation with several different random initial conditions and take the best result in terms of maximum loglikelihood, overall runs. This is the usual procedure in clustering. Next we study why and how we can improve this task. Figure 15 shows the 30 performed runs sorted according to Normalized log-likelihood (NL), which is the objective function of TSPLBM. We also draw the ACC and NMI curve according to the 30 runs. We observe that for DBLP1, the best runs leading to maximal NL are the best runs in terms of clustering (ACC and NMI). However, this observation is not noticed in all datasets; for instance, some best runs can achieve less good results in terms of ACC and NMI. This problem is recurrent with all unsupervised methods where the best runs in terms of the objective function are not necessarily the best ones in terms of clustering. On the other hand, we may see the proposed model as an implicit consensus model for graphs clustering, and it is tempting to compare the proposed model to ensemble-based clustering methods.

Ensemble method
The first works about consensus or ensemble classification have emerged in the context of supervised learning; see for instance (Maclin and Opitz 1997;Schapire 2003;Dietterich 2000). However, only the majority voting type algorithms work on the model output level, and the most well-known classification ensembles approaches are based on different variants of voting (Bauer and Kohavi 1999;Crammer et al. 2008;Gao et al. 2009). This approach has been extended to unsupervised learning (Strehl and Ghosh 2002;Vega-Pons and Ruiz-Shulcloper 2011). A clustering ensemble, also known as a consensus clustering or clustering aggregation, is defined in the same manner as for classification (Hanczar and Nadif 2012;Alqurashi and Wang 2019;Yu et al. 2019). It consists in combining multiple clustering models (partitions) into a single consolidated partition that we refer to as explicit consensus clustering. In other words, from r partitions {Z 1 , Z 2 , Z 3 ,…, Z r }, a consensus clustering leads to a unique partition Z * . Based on consensus functions, many approaches exist; see for instance (Strehl and Ghosh 2002;Hanczar and Nadif 2012;Affeldt et al. 2020a, b). In Strehl and Ghosh (2002), the authors introduced three ensemble clustering methods that can produce a consensus partition. All of them consider the consensus problem on a hypergraph representation of the set of partitions. More specifically, each partition is a binary classification matrix (with objects in rows and clusters in columns) where the concatenation of all the set defines the hypergraph. Figure 16 presents this matrix and different steps to construct a combination of these different graphs of clusters, emerged from different partitions, to obtain a unique graph. To this end, we rely on the three hypergraph clustering-based approaches proposed by Strehl and Ghosh (2002), namely Cluster-based Similarity Partitioning Algorithm (CSPA), HyperGraph Partitioning Algorithm (HGPA), and Meta-CLustering Algorithm (MCLA).
To improve clustering results of TSPLBM we will adopt the ensemble approach. We explore in the next part, how implicit consensus clustering through TSPLBM behaves compared to explicit consensus through cluster ensembles of multiple graphs. In Fig. 17, we report the proposed approach to compare TSPLBM with the clustering ensemble methods proposed by Strehl and Ghosh (2002). To do this, we used the implementation of python package Cluster_Ensembles. 5 It relies on CSPA, HGPA, and MCLA and returns the best results in terms of the mean of NMI between the obtained consensus clustering Z * and the different clustering solutions {Z 1 , Z 2 , Z 3 ,…, Z r }. Therebey, with TSPLBM, we select the top ten runs maximizing log-likelihood then we carry out the consensus by using the cluster-ensembles methods. With SPLBM,  PLBM, and PSBM, we consider two steps. The first step is the same as that used with TSPLBM to select the top ten runs and apply the cluster-ensembles methods. The second one consists in applying another clustering consensus between graphs to obtain a unique partition. Note that the consensus clustering information is implicitly provided by the TSPLBM algorithm.
In Fig. 17 (right) are reported the obtained results in terms of NMI using the comparison approach described above. We can notice that TSPLBM achieves the highest NMI for all datasets. SPLBM does a better or similar job than PLBM on three datasets, while PSBM obtains the lowest NMI measures on all datasets. These results can be explained by the fact that the implicit consensus achieved by TSPLBM is optimized within the objective function of the algorithm, unlike the explicit consensus, where the partitions are obtained separately. established some connections between Poisson SBM and the corrected version DC-SBM with Poisson LBM commonly used for the co-clustering of contingency tables. We justified the extension of the latter to deal with multiple graphs clustering. To take into account the sparsity of the tensor, we modified the parametrization of the model and proposed a Tensor SPLBM (TSPLBM). We derived, thereby, an EM-like learning algorithm called TSPLBM capable of performing clustering from a tensor data. On real datasets of text and image graphs, we have shown that TSPLBM, is better than the cited baselines algorithms in terms of clustering.
On the other hand, we can note that the proposed clustering algorithm TSPLBM can be seen as an implicit consensus clustering for multiple graphs. To reinforce our idea that TSPLBM can be used in this sense, a comparative study with explicit consensus through ensemble clustering methods was realized. Experiments on several real graphs datasets highlight the effectiveness of TSPLBM. Thereby, this work gives an extra dimension to LBM as an ensemble method. Our approach has made it possible to propose a like-EM learning algorithm. It is possible to develop a like-Classification EM version. To do this, all that is needed is to insert a classification step between E and M steps. This could lead to propose an extension of DC-SBM for multiple graphs.
Our work opens different avenues for future research. First, in our proposal, we have considered a Poisson model. However, other distributions and other model variants can be developed compared to recent approaches relying on the mixture models and applied on image clustering . When a data point has different representations, the authors propose to maximize a joint probability with multiple representations that can be generated by diverse methods such as kernel functions or data embedding methods. The model incorporates the prior information about data and utilizes it to set preferences for these representations. Second, in order to go further, the proposed model can be extended in incorporating Must Link and Cannot Link relationships in the model based on Hidden Markov Random Fields to deal with semi-supervised learning problems as those already dealt in Wu et al. (2021); Li et al. (2021). Finally, in our proposal, the number of clusters has been assumed to be known. It would be interesting to propose an extension of some criteria, such as the Integrated Completed Likelihood (ICL) criterion, already used with SBM (Daudin et al. 2008).
Funding Open Access funding enabled and organized by Projekt DEAL. Our work is funded by the German Federal Ministry for Economic Affairs and Energy (BMWi) under Grant Agreement Number 01MK20008F (Service-Meister).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A. Appendix: Proof of (4)
The marginal density function f (X; ) of TSPLBM can be written as: Thus, the complete-data log-likelihood function is given by: Hence, the aim is to maximize the following lower bound of the log-likelihood criterion: F C ( Z, W, ) = L C ( Z, W, ) + H ( Z) + H ( W) where L C ( Z, W, ) is the fuzzy complete-data log-likelihood function. As X is symmetric per slice b, when i = j we have z ik = w jk and for k = 1, . . . , g we have π k = ρ k and H ( Z) = H ( W). Then the objective function to optimize takes the following form: where L kb c = x b kk log( C. Appendix: E-step and M-step E-step To obtain the expression of z ik , we maximize (4) with respect to z ik , subject to the constraint k z ik = 1. The corresponding Lagrangian, up to terms which are not a function of z ik , is given by: Taking derivatives with respect to z ik , we obtain: Setting this derivative to zero yields: and we obtain a simple update of z ik as follows