Model-based clustering of multiple networks with a hierarchical algorithm

Rebafka, Tabea

doi:10.1007/s11222-023-10329-w

Model-based clustering of multiple networks with a hierarchical algorithm

Original Paper
Published: 07 November 2023

Volume 34, article number 32, (2024)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Tabea Rebafka^1,2

228 Accesses
Explore all metrics

Abstract

The paper tackles the problem of clustering multiple networks, directed or not, that do not share the same set of vertices, into groups of networks with similar topology. A statistical model-based approach based on a finite mixture of stochastic block models is proposed. A clustering is obtained by maximizing the integrated classification likelihood criterion. This is done by a hierarchical agglomerative algorithm, that starts from singleton clusters and successively merges clusters of networks. As such, a sequence of nested clusterings is computed that can be represented by a dendrogram providing valuable insights on the collection of networks. Using a Bayesian framework, model selection is performed in an automated way since the algorithm stops when the best number of clusters is attained. The algorithm is computationally efficient, when carefully implemented. The aggregation of clusters requires a means to overcome the label-switching problem of the stochastic block model and to match the block labels of the networks. To address this problem, a new tool is proposed based on a comparison of the graphons of the associated stochastic block models. The clustering approach is assessed on synthetic data. An application to a set of ecological networks illustrates the interpretability of the obtained results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 3

A review on spectral clustering and stochastic block models

Article 10 March 2021

Bayesian contiguity constrained clustering

Article 12 January 2024

Detecting Hierarchical Communities in Networks: A New Approach

References

Amini, A.A., Chen, A., Bickel, P.J., Levina, E.: Pseudo-likelihood methods for community detection in large sparse networks. Ann. Stat. 41(4), 2097–2122 (2013)
Article MathSciNet Google Scholar
Bickel, P.J., Chen, A.: A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. 106(50), 21068–21073 (2009)
Article Google Scholar
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2000)
Article Google Scholar
Bollobás, B., Borgs, C., Chayes, J., Riordan, O.: Directed scale-free graphs. In: SODA ’03 Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 132–139 (2003)
Botella, C., Dray, S., Matias, C., Miele, V., Thuiller, W.: An appraisal of graph embeddings for comparing trophic network architectures. Methods Ecol. Evol. 13(1), 203–216 (2022)
Article Google Scholar
Chabert-Liddell, S.C., Barbillon, P., Donnet, S.: Learning common structures in a collection of networks. an application to food webs (2022)
Côme, E., Latouche, P.: Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood. Stat. Model. 15(6), 564–589 (2015)
Article MathSciNet Google Scholar
Daudin, J.J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18(2), 173–183 (2008)
Article MathSciNet Google Scholar
Donnat, C., Holmes, S.: Tracking network dynamics: a survey using graph distances. Ann. Appl. Stat. 12(2), 971–1012 (2018)
Article MathSciNet Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Article MathSciNet Google Scholar
Frühwirth-Schnatter, S., Malsiner-Walli, G.: From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering. Adv. Data Anal. Classif. 13, 33–64 (2019)
Article MathSciNet Google Scholar
Gärtner, T.: A survey of kernels for structured data. ACM SIGKDD Explor. Newsl 5(1), 49–58 (2003)
Article Google Scholar
le Gorrec, L., Knight, P.A., Caen, A.: Learning network embeddings using small graphlets. Soc. Netw. Anal. Min. 12(20), 1–20 (2022)
Google Scholar
Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40(3), 52–74 (2017)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Article Google Scholar
Isella, L., Stehlé, J., Barrat, A., Cattuto, C., Pinton, J.F., den Broeck, W.V.: What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theor. Biol. 271(1), 166–180 (2011)
Article MathSciNet Google Scholar
Le, C.M., Levin, K., Levina, E.: Estimating a network from multiple noisy realizations. Electron. J. Stat. 12(2), 4697–4740 (2018)
Article MathSciNet Google Scholar
Leger, J.B.: Blockmodels: A R-package for estimating in latent block model and stochastic block model, with various probability functions, with or without covariates (2016)
Liu, J.: Monte Carlo Strategies in Scientific Computing. Springer, Berlin (2008)
Google Scholar
Lovász, L., Szegedy, B.: Limits of dense graph sequences. J. Combin. Theory Ser. B 96(6), 933–957 (2006)
Article MathSciNet Google Scholar
Mantziou, A., Lunagomez, S., Mitra, R.: Bayesian model-based clustering for multiple network data (2023)
Matias, C., Robin, S.: Modeling heterogeneity in random graphs through latent space models: a selective review. Esaim Proc. Surv. 47, 55–74 (2014)
Article MathSciNet Google Scholar
McLachlan, G., Krishnan, T.: The EM algorithm and extensions, 2nd edn. Wiley series in probability and statistics, Wiley (2008)
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley-Interscience (2000)
Mehta, N., Duke, L.C., Rai, P.: Stochastic blockmodels meet graph neural networks. In: Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 4466–4474 (2019)
Mukherjee, S.S., Sarkar, P., Lin, L.: On clustering network-valued data. In: Advances in Neural Information Processing Systems, Vol. 30 (2017)
Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)
Article MathSciNet Google Scholar
Peixoto, T.: Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys. Rev. E 89(1), 012804 (2014)
Article Google Scholar
Poisot, T., Baiser, B., Dunne, J.A., Kéfi, S., Fc, Massol, Mouquet, N., Romanuk, T.N., Stouffer, D.B., Wood, S.A., Gravel, D.: Mangal - making ecological network analysis simple. Ecography 39(4), 384–390 (2016)
Article Google Scholar
Robert, C.P.: The Bayesian Choice: A Decision-theoretic Motivation, 2nd edn. Springer, New York (2007)
Google Scholar
Rohe, K., Chatterjee, S., Yu, B.: Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Stat. 39(4), 1878–1915 (2011)
Article MathSciNet Google Scholar
Sabanayagam, M., Vankadara, L.C., Ghoshdastidar, D.: Graphon based clustering and testing of networks: Algorithms and theory. In: The Tenth International Conference on Learning Representations (2022)
Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., Borgwardt, K.: Efficient graphlet kernels for large graph comparison. In: JMLR Workshop and Conference Proceedings: AISTATS, pp 488–495 (2009)
Shimada, Y., Hirata, Y., Ikeguchil, T., Aihara, K.: A survey of kernels for structured data. Sci. Rep. 6, 34944 (2016)
Article Google Scholar
Signorelli, M., Wit, E.C.: Model-based clustering for populations of networks. Stat. Model. 20(1), 9–29 (2019)
Article MathSciNet Google Scholar
Stanley, N., Shai, S., Taylor, D., Mucha, P.J.: Clustering network layers with the strata multilayer stochastic block model. IEEE Trans. Netw. Sci. Eng. 3(2), 95–105 (2016)
Article MathSciNet Google Scholar
Titterington, D., Smith, A., Makov, U.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)
Google Scholar
Weber-Zendrera, A., Sokolovska, N., Soula, H.A.: Functional prediction of environmental variables using metabolic networks. Sci. Rep. 11, 12192 (2021)
Article Google Scholar
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2021)
Article MathSciNet Google Scholar
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: International Conference on Learning Representations (2019)
Young, J.G., Kirkley, A., Newman, M.E.J.: Clustering of heterogeneous populations of networks. Phys. Rev. E 105(1), 041312 (2022)
Article MathSciNet Google Scholar

Download references

Acknowledgements

Work partly supported by the Grant ANR-18-CE02-0010 of the French National Research Agency ANR (project EcoNet).

Author information

Authors and Affiliations

Sorbonne Université, Université Paris Cité, CNRS,Laboratoire de Probabilités Statistique et Modélisation (LPSM), Paris, France
Tabea Rebafka
INRAE, MaIAGE, Jouy-en-Josas, France
Tabea Rebafka

Authors

Tabea Rebafka
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All work for this paper was done by Tabea Rebafka.

Corresponding author

Correspondence to Tabea Rebafka.

Ethics declarations

Conflict on interest

The author declares no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Details on the update of the node labels

Here we present the details on the efficient computation of the ICL changes $\Delta _{m^*, i^*}^{\rightarrow h}$, in the case when moving node $i^*$ to block h does not empty block g.

Changes in the statistics. Let $s^{(m^*)}_{k}$ be the count statistic before the swap and $\vec {s}^{(m^*)}_{k}$ its value after the swap. We use the same notation for all other statistics. Clearly, $\vec {s}^{(m^*)}_{g}=s^{(m^*)}_{g}-1$ and $\vec {s}^{(m^*)}_{h}=s^{(m^*)}_{h}+1$, while the other terms remain unchanged. Define

$$\begin{aligned} \delta _{k,\cdot i^*}=\sum _{i\ne i^*}Z^{(m^*)}_{i,k}A^{(m^*)}_{i,i^*},\qquad \delta _{\ell , i^* \cdot }=\sum _{j\ne i^*}Z^{(m^*)}_{j,\ell }A^{(m^*)}_{i^*,j}. \end{aligned}$$

Then, for any $k,\ell \in \llbracket K\rrbracket $,

$$\begin{aligned} \vec {a}^{(m^*)}_{k,\ell }&=a^{(m^*)}_{k,\ell } -\mathbb {1}_{k=g}\delta _{\ell , i^* \cdot } +\mathbb {1}_{k=h}\delta _{\ell , i^* \cdot } \\&\quad -\mathbb {1}_{\ell =g}\delta _{k, \cdot i^*} +\mathbb {1}_{\ell =h}\delta _{k, \cdot i^*}. \end{aligned}$$

When considering the matrix $(a^{(m^*)}_{k,\ell })_{k,\ell }$, only the g-th and h-th row and the g-th and h-th column change when moving $i^*$ from g to h. We introduce the number of possible dyads from nodes in block k to nodes in block $\ell $ in graph m defined as

$$\begin{aligned} r^{(m)}_{k,\ell } =\sum _{i\ne j}Z^{(m)}_{i,k}Z^{(m)}_{j,\ell } = \left\{ \begin{array}{ll} s^{(m)}_{k}s^{(m)}_{\ell }&{}\quad \text {if }k\ne \ell \\ s^{(m)}_{k}(s^{(m)}_{k}-1)&{}\quad \text {if }k=\ell \end{array}\right. \end{aligned}$$

Then $b^{(m)}_{k,\ell }=r^{(m)}_{k,\ell }-a^{(m)}_{k,\ell }$ and

$$\begin{aligned}&\vec {r}^{(m^*)}_{k,\ell }=r^{(m^*)}_{k,\ell } -s^{(m^*)}_{\ell }\mathbb {1}_{k=g} +s^{(m^*)}_{\ell }\mathbb {1}_{k=h} -s^{(m^*)}_{k}\mathbb {1}_{\ell =g}\\&\quad +s^{(m^*)}_{k}\mathbb {1}_{\ell =h} +2\mathbb {1}_{k=g,\ell =g} -\mathbb {1}_{k=g,\ell =h} -\mathbb {1}_{k=h,\ell =g}. \end{aligned}$$

and $\vec {b}_{k,l}^{(m^*)}=\vec {r}_{k,l}^{(m^*)}-\vec {a}_{k,l}^{(m^*)}$. For any $m\ne m^*$, the statistics remain unchanged, that is, $\vec {a}_{k,l}^{(m)}=a_{k,l}^{(m)}$, $\vec {b}_{k,l}^{(m)}=b_{k,l}^{(m)}$ and $\vec {r}_{k,l}^{(m)}=r_{k,l}^{(m)}$. Finally, we define function $\Psi :\mathbb {R}_+\times {\mathbb {Z}}\rightarrow \mathbb {R}$ as

$$\begin{aligned} \Psi (a,z)&=\log \left( \frac{\Gamma (a+z)}{\Gamma (a)}\right) \mathbb {1}\{a+z>0\}. \end{aligned}$$

First case: K does not change. Suppose that $i^*$ is not the last vertex in block g, that is, $\sum _{m}\sum _iZ^{(m)}_{i,g}>1$. Then, moving node $i^*$ to another block h does not empty block g and the number of blocks K remains unchanged. In this case, the ICL variation is given by

$$\begin{aligned}&\Delta _{m^*, i^*}^{\rightarrow h} \nonumber \\&\quad =\sum _{(k,\ell )\in I_{g,h}} \left\{ \log \left( \frac{\Gamma (\eta +\sum _m \vec {a}_{k,l}^{(m)})\Gamma (\zeta +\sum _m \vec {b}_{k,l}^{(m)})}{\Gamma (\eta +\zeta +\sum _m \vec {r}_{k,l}^{(m)})}\right) \right. \nonumber \\&\quad \quad \left. - \log \left( \frac{\Gamma (\eta +\sum _m a_{k,l}^{(m)})\Gamma (\zeta +\sum _m b_{k,l}^{(m)})}{\Gamma (\eta +\zeta +\sum _m r_{k,l}^{(m)})}\right) \right\} \nonumber \\&\quad \quad + \sum _{k\in \{g,h\}} \left\{ \log \left( \Gamma (\alpha +\sum _m \vec {s}_{k}^{(m)})\right) \right. \nonumber \\&\quad \quad \left. -\log \left( \Gamma (\alpha +\sum _m s_{k}^{(m)})\right) \right\} \nonumber \\&\quad =\sum _{(k,\ell )\in I_{g,h}} \left\{ \Psi \left( \eta +\sum _m a_{k,l}^{(m)}, \vec {a}_{k,l}^{(m^*)}-a_{k,l}^{(m^*)}\right) \right. \nonumber \\&\quad \quad +\left. \Psi \left( \zeta +\sum _m b_{k,l}^{(m)}, \vec {b}_{k,l}^{(m^*)}-b_{k,l}^{(m^*)}\right) \right. \nonumber \\&\quad \quad \left. -\Psi \left( \eta +\zeta +\sum _m r_{k,l}^{(m)}, \vec {r}_{k,l}^{(m^*)}-r_{k,l}^{(m^*)}\right) \right\} \nonumber \\&\quad \quad + \log \left( \frac{ \alpha +\sum _m s_{h}^{(m)}}{\alpha +\sum _m s_{g}^{(m)}-1}\right) , \end{aligned}$$

(6)

where $ I_{g,h} =\left\{ (k,\ell )\in \llbracket K\rrbracket ^2, k\in \{g,h\} \text { or } \ell \in \{g,h\} \right\} . $

1.2 Details on the efficient computation of $\Delta _{c,c'}$

Here it is shown how to evaluate $\Delta _{c,c'}$ efficiently. Denote ${\mathcal {U}}_{c\cup c'}$ the cluster labels afte merging clusters c and $c'$, that is, $U_{c\cup c'}^{(m)}=\min \{c,c'\}$ if $m\in I_c\cup I_{c'}$ and $U_{c\cup c'}^{(m)}=U^{(m)}$ otherwise. Likewise, denote ${\mathcal {Z}}_{c\cup c'}$ the node labels after aggregation and relabeling with ${\mathcal {Z}}^{(\ell )}_{c\cup c'} =\{{\hat{\sigma }}_{\ell }({\textbf{Z}}^{(j)}), j \in I_{\ell }\}$ for $\ell \in \{c,c'\}$, where ${\hat{\sigma }}_{\ell }$ are the permutations that match the block labels. For convenience, denote by $\beta (x,y) = \log \left( \frac{\Gamma (x)\Gamma (y)}{\Gamma (x+y)}\right) $ the logarithm of the Beta function of x and y. Moreover, for any $c\in \llbracket C\rrbracket , (k,l)\in \llbracket K_c\rrbracket $, denote

$$\begin{aligned} {\textbf{s}}^{(c)}_{k}= \sum _{m\in I_{c}} s_{k}^{(m)},\quad {\textbf{a}}^{(c)}_{k,l}= \sum _{m\in I_{c}} a_{k,l}^{(m)},\quad {\textbf{b}}^{(c)}_{k,l}= \sum _{m\in I_{c}} b_{k,l}^{(m)}. \end{aligned}$$

Then $\Delta _{c,c'}=\textrm{ICL}^{\text {mix}}({\mathcal {A}}, {\mathcal {U}}_{c\cup c'}, {\mathcal {Z}}_{c\cup c'})- \textrm{ICL}^{\text {mix}}({\mathcal {A}}, {\mathcal {U}}, {\mathcal {Z}})$ is given by

$$\begin{aligned}&\Delta _{c,c'} = \sum _{(k,\ell ) } \beta \left( \eta +{\textbf{a}}^{(c)}_{{\hat{\sigma }}_{c}^{-1}(k),{\hat{\sigma }}_{c}^{-1}(l)}+ {\textbf{a}}^{(c')}_{{\hat{\sigma }}_{c'}^{-1}(k),{\hat{\sigma }}_{c'}^{-1}(l)}\right. \nonumber \\&\quad \quad \left. + {\textbf{b}}^{(c)}_{{\hat{\sigma }}_{c}^{-1}(k),{\hat{\sigma }}_{c}^{-1}(l)}+ {\textbf{b}}^{(c')}_{{\hat{\sigma }}_{c'}^{-1}(k), {\hat{\sigma }}_{c'}^{-1}(l)}\right) \nonumber \\&\quad \quad - \sum _{(k,\ell ) } \beta \left( \eta +{\textbf{a}}^{(c)}_{k,l}, \zeta +{\textbf{b}}^{(c)}_{k,l}\right) - \sum _{(k,\ell ) }\beta \left( \eta + {\textbf{a}}^{(c')}_{k,l}, \zeta +{\textbf{b}}^{(c')}_{k,l}\right) \nonumber \\&\quad \quad +\sum _{k} \log \left( \Gamma (\alpha +{\textbf{s}}_{{\hat{\sigma }}_{c}^{-1}(k)}^{(c)}+{\textbf{s}}_{{\hat{\sigma }}_{c'}^{-1}(k)}^{(c')})\right) - \log \left( \Gamma (\alpha +{\textbf{s}}_{k}^{(c)})\right) \nonumber \\&\qquad -\log \left( \Gamma (\alpha +{\textbf{s}}_{k}^{(c')})\right) +\log \left( \frac{\Gamma (\lambda +|I_{c}|+|I_{c'}|)}{\Gamma (\lambda + |I_{c}|)\Gamma (\lambda +|I_{c'}|)}\right) . \nonumber \\&\quad \quad +\log \left( \frac{\Gamma (K_{c}\alpha + \sum _{m\in I_{c}}n^{(m)})\Gamma (K_{c'}\alpha + \sum _{m\in I_{c'}}n^{(m)})}{\Gamma \left( K_{\max }\alpha + \sum _{m\in I_{c}\cup I_{c'}}n^{(m)}\right) }\right) \nonumber \\&\quad \quad +K_{\min }^2 \beta (\eta ,\zeta ) +K_{\min }\log \left( \Gamma (\alpha )\right) \nonumber \\&\quad \quad + \beta \left( (C-1)\lambda , \lambda \right) + \log \left( \frac{\Gamma (C\lambda +M)}{\Gamma ((C-1)\lambda +M)}\right) , \end{aligned}$$

(7)

where $K_{\max }=\max \{K_{c},K_{c'}\}$ and $K_{\min }=\min \{K_{c},K_{c'}\}$ are the maximal and minimal number of blocks in the clusters c and $c'$.

1.3 Supplement to the analysis of ecological networks

Figure 11 illustrates the clustering of the foodwebs obtained with the alternative graph moments method by Mukherjee et al. (2017). The obtained clustering is virtually very different from the one obtained by our graph clustering procedure.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rebafka, T. Model-based clustering of multiple networks with a hierarchical algorithm. Stat Comput 34, 32 (2024). https://doi.org/10.1007/s11222-023-10329-w

Download citation

Received: 19 January 2023
Accepted: 10 October 2023
Published: 07 November 2023
DOI: https://doi.org/10.1007/s11222-023-10329-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-based clustering of multiple networks with a hierarchical algorithm

Abstract

Access this article

Similar content being viewed by others

A review on spectral clustering and stochastic block models

Bayesian contiguity constrained clustering

Detecting Hierarchical Communities in Networks: A New Approach

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict on interest

Additional information

Publisher's Note

Appendix

1.1 Details on the update of the node labels

1.2 Details on the efficient computation of \(\Delta _{c,c'}\)

1.3 Supplement to the analysis of ecological networks

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model-based clustering of multiple networks with a hierarchical algorithm

Abstract

Access this article

Similar content being viewed by others

A review on spectral clustering and stochastic block models

Bayesian contiguity constrained clustering

Detecting Hierarchical Communities in Networks: A New Approach

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict on interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 Details on the update of the node labels

1.2 Details on the efficient computation of \(\Delta _{c,c'}\)

1.3 Supplement to the analysis of ecological networks

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation