Skip to main content
Log in

Constrained expectation-maximisation for inference of social graphs explaining online user–user interactions

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Current network inference algorithms fail to generate graphs with edges that can explain whole sequences of node interactions in a given dataset or trace. To quantify how well an inferred graph can explain a trace, we introduce feasibility, a novel quality criterion, and suggest that it is linked to the result’s accuracy. In addition, we propose CEM-*, a network inference method that guarantees 100% feasibility given online social media traces, which are a non-trivial extension of the Expectation-Maximization algorithm developed by Newman (Nature Phys 14:67–75, 2018). We propose a set of linear optimization updates that incorporate a set of auxiliary variables and a set of feasibility constraints; the latter takes into consideration all the hidden paths that are possible between users based on their timestamps of interaction and guides the inference toward feasibility. We provide two CEM-* variations, that assume either an Erdős–Rényi (ER) or a Stochastic Block Model (SBM) prior for the underlying graph’s unknown distribution. Extensive experiments on one synthetic and one real-world Twitter dataset show that for both priors CEM-* can generate a posterior distribution of graphs that explains the whole trace while being closer to the ground truth. As an additional benefit, the use of the SBM prior infers and clusters users simultaneously during optimization. CEM-* outperforms baseline and state-of-the-art methods in terms of feasibility, run-time, and precision of the inferred graph and communities. Finally, we propose a heuristic to adapt the inference to lower feasibility requirements and show how it can affect the precision of the result.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. According to the Twitter API documentation of a Tweet Object, the “retweets of retweets do not show representations of the intermediary retweet, but only the original Tweet.” https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet.

  2. https://github.com/effrosyni-papanastasiou/constrained-em.

  3. https://pypi.org/project/PuLP/.

  4. The highest value is marked with boldface and the second highest value is underlined. max scc: maximum strongly connected component.

  5. N/A in the Tables refers to results not being available after 48 h.

References

  • Blondel V, Guillaum JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:P10008

    Article  Google Scholar 

  • Bourigault S, Lamprier S, Gallinari P (2016) Representation learning for information diffusion through social networks: an embedded cascade model. In: Proceedings of the 9th ACM international conference on web search and data mining, pp 573–582

  • Daley DJ, Gani J (1999) Epidemic modelling: an introduction. Cambridge University Press, Cambridge

    Google Scholar 

  • Daneshmand H, Gomez-Rodriguez M, Song L, Schoelkopf B (2014) Estimating diffusion network structures: recovery conditions, sample complexity and softthresholding algorithm. In: Proceedings of the 31st international conference on machine learning (ICML)

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–22

    MathSciNet  Google Scholar 

  • Firestone SM, Hayama Y, Lau MSY, Yamamoto T, Nishi T, Bradhurst RA, Demirhan H, Stevenson MA, Tsutsui T (2020) Transmission network reconstruction for foot-and-mouth disease outbreaks incorporating farm-level covariates. PLoS ONE 15(7):e0235660

    Article  Google Scholar 

  • Fraisier O, Cabanac O, Pitarch Y, Besançon R, Boughanem M (2018) #Elysee2017fr: the 2017 French presidential campaign on Twitter. In: Proceedings of the 12th international AAAI conference on web and social media

  • Friedman N, Linial M, Nachman I, Pe’er D (2000) Using Bayesian networks to analyze expression data. J Comput Biol 7:601–620

    Article  Google Scholar 

  • Giesecke K, Schwenkler G, Sirignano JA (2020) Inference for large financial systems. Math Finance 30:3–46

    Article  MathSciNet  Google Scholar 

  • Giovanidis A, Baynat B, Magnien C, Vendeville A (2021) Ranking online social users by their influence. IEEE/ACM Trans Netw 29(5):2198–2214

    Article  Google Scholar 

  • Gomez-Rodriguez M, Leskovec J, Krause A (2012) Inferring networks of diffusion and influence. ACM Trans Knowl Discov Data (TKDD) 5(4):1–37

    Article  Google Scholar 

  • Goyal A, Bonchi F, Lakshmanan LV (2010) Learning influence probabilities in social networks. In: Proceedings of the 3rd ACM international conference on Web search and data mining, pp 241–250

  • Harris JW, Stöcker H (1998) Handbook of mathematics and computational science. Springer, Berlin

    Book  Google Scholar 

  • He X, Liu Y (2017) Not enough data? Joint inferring multiple diffusion networks via network generation priors. In: Proceedings of the 10th ACM international conference on web search and data mining, pp 465–474

  • Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: First steps. Social networks, pp 109–137

  • Jin W, Qu M, Jin X, Ren X (2019) Recurrent event network: autoregressive structure inference over temporal knowledge graphs

  • Lagnier C, Denoyer L, Gaussier E, Gallinari P (2013) Predicting information diffusion in social networks using content and user’s profiles. In: European conference on information retrieval, pp 74–85

  • Le CM, Levin K, Levina E (2018) Estimating a network from multiple noisy realizations. Electron J Stat 12:4697–4740

    Article  MathSciNet  Google Scholar 

  • Lokhov A (2016) Reconstructing parameters of spreading models from partial observations. Adv Neural Inf Process Syst 29:3467–3475

    Google Scholar 

  • Newman ME (2018) Network structure from rich but noisy data. Nature Phys 14:67–75

    Article  Google Scholar 

  • Newman ME (2018) Estimating network structure from unreliable measurements. Phys Rev E 98(6):062321

    Article  Google Scholar 

  • Papanastasiou E, Giovanidis A (2021) Bayesian inference of a social graph with trace feasibility guarantees. In: Proceedings of the 2021 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM ’21)

  • Peel L, Peixoto TP, De Domenico M (2022) Statistical inference links data and theory in network science. Nature Commun 13(1):6794

    Article  Google Scholar 

  • Peixoto TP (2019) Network reconstruction and community detection from dynamics. Phys Rev Lett 123(12):128301

    Article  Google Scholar 

  • Saito K, Nakano R, Kimura M (2008) Prediction of information diffusion probabilities for independent cascade model. In: International conference on knowledge-based and intelligent information and engineering systems, pp 67–75

  • Wang Z, Chen C, Li W (2019) Information diffusion prediction with network regularized role-based user representation learning. ACM Trans Knowl Discov Data (TKDD) 13:1–23

    Article  Google Scholar 

  • Wu J, Xia J, Gou F (2022) Information transmission mode and IoT community reconstruction based on user influence in opportunistic social networks. Peer-to-Peer Netw Appl 15:1398–1416

    Article  Google Scholar 

  • Wu X, Kumar A, Sheldon D, Zilberstein S (2013) Parameter learning for latent network diffusion. In: Proceedings of the 23rd international joint conference on artificial intelligence, pp 2923–2930

  • Zhang X, Zhang ZK, Wang W, Hou D, Xu J, Ye X, Li S (2021) Multiplex network reconstruction for the coupled spatial diffusion of infodemic and pandemic of COVID-19. Int J Digit Earth 4:401–423

    Article  Google Scholar 

  • Zhang Y, Lyu T, Zhang Y (2018) Cosine: community-preserving social network embedding from information diffusion cascades. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

Download references

Acknowledgements

An earlier version of this paper was presented at the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 09-11 November 2021 (virtual) ASONAM 2021. This work is funded by the ANR (French National Agency of Research) by the “FairEngine” project under grant ANR-19-CE25-0011.

Author information

Authors and Affiliations

Authors

Contributions

Author 1 wrote the main manuscript text and prepared all figures. All authors reviewed the manuscript.

Corresponding author

Correspondence to Effrosyni Papanastasiou.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: CEM-er

For the E-step, we modify the Newman algorithm by taking the expectation over the set of random variables \(Y_{ij}\) at both sides of (10):

$$\begin{aligned}&{\mathbb {E}}[\log {P}(\theta \text { }|\text { } \mathcal {T})] \ge {\mathbb {E}} [\sum _{{{\textbf {A}}}} q({{\textbf {A}}}) \log \frac{{P}({{\textbf {A}}}, \theta \text { }|\text { } \mathcal {T})}{q({{\textbf {A}}}}] \nonumber \\&\quad = \sum _{{{\textbf {A}}}} q({{\textbf {A}}})\big ({\mathbb {E}}[\log {P}({{\textbf {A}}}, \theta \text { }|\text { } \mathcal {T})] - \log q({{\textbf {A}}} \big )). \end{aligned}$$
(28)

To find \({\mathbb {E}}[\log {P}({{\textbf {A}}},\theta \text { }|\text { }\mathcal {T})]\), we replace (9) into (8). Setting \(\Gamma ={P}(\theta )/{P}(\mathcal {T})\), the expectation of the log of (8) becomes:

$$\begin{aligned}&{\mathbb {E}}[\log {P}({{\textbf {A}}},\theta \text { }|\text { }\mathcal {T})] = log \Gamma + \sum _{i \ne j} \Big [{A_{ij}} \Big (\log \rho + {{\mathbb {E}}[Y_{ij}]}\log \alpha \nonumber \\&\quad +(M_{ij}-{\mathbb {E}}[Y_{ij}])\log {(1 - \alpha )}\Big ) \nonumber +(1-A_{ij}\Big (\log (1-\rho ) + \nonumber \\&\quad + {{\mathbb {E}}[Y_{ij}]}\log \beta +(M_{ij} - {\mathbb {E}}[Y_{ij}])\log {(1 - \beta )\Big )\Big ]}. \end{aligned}$$
(29)

Then, by replacing (7) into (29), and then (29) into (28), we get:

$$\begin{aligned}&{\mathbb {E}}[\log {P}(\theta \text { }|\text { } \mathcal {T})] \ge \sum _{{{\textbf {A}}}} q({{\textbf {A}}})\log \frac{D_{ij}}{q({{\textbf {A}}})} \end{aligned}$$
(30)
$$\begin{aligned}&\text {where, } D_{ij} = \Gamma \prod _{i \ne j}{\left[ \rho \alpha ^{M_{ij}\sigma _{ij}}{(1 - \alpha )}^{M_{ij}(1-\sigma _{ij})}\right] }^{A_{ij}} \nonumber \\&\quad \times {\left[ (1-\rho )\beta ^{M_{ij}\sigma _{ij}}{(1 - \beta )}^{M_{ij}(1-\sigma _{ij})}\right] }^{1-A_{ij} }. \end{aligned}$$
(31)

For the M-step of the EM algorithm, the function that we want to maximize is \({\mathbb {E}}[\log {P}(\theta \text { }|\text { } \mathcal {T})]\). To do so, we need to find the unknown values, \(q({{\textbf {A}}})\) and \(\theta =\){\(\alpha , \beta , \rho , \varvec{\sigma }\)}, that maximize the expectation on the left-hand side of (11), under the feasibility constraints on the parameters set \(\theta \). From these, only the \(\sigma _{ij}\) have an important constraint set, specified in (5) and (6).

Solution with respect to \({\varvec{q}}({{\textbf {A}}})\). We notice that the choice of \(q({{\textbf {A}}})\) that achieves equality (i.e. maximizes the right-hand side) in (11) is:

$$\begin{aligned} q({{\textbf {A}}}) = \dfrac{D_{ij}}{\sum _{{{\textbf {A}}}}D_{ij}}, \end{aligned}$$
(32)

which leads us to Eq. (13).

Solution with respect to \(\varvec{\alpha }, \varvec{\beta }, \varvec{\rho }\). To maximize the right-hand side of (11) in terms of parameter \(\alpha \) we differentiate it with respect to \(\alpha \) and then setting it equal to zero (while holding \(\sigma _{ij}\), q constant):

$$\begin{aligned} \sum _{i \ne j} Q_{ij} M_{ij} \left( \frac{\sigma _{ij}}{\alpha } - \frac{1-\sigma _{ij}}{1-\alpha }\right) = 0. \end{aligned}$$
(33)

After rearranging, we get the updates shown in Eq. (15), and we repeat likewise for \(\beta \) and \(\rho \).

Solution with respect to \(\varvec{\sigma }_{ij}\). If we take into account that \(Q_{ij} = \sum _{{{\textbf {A}}}}q({{\textbf {A}}})A_{ij}\) and also that \(\sum _{{{\textbf {A}}}} q({{\textbf {A}}}) = 1 \), by rearranging the right-hand side of (11), the problem becomes equivalent to maximizing:

$$\begin{aligned}&\sum _{{{\textbf {A}}}} q({{\textbf {A}}}) \sum _{i \ne j} \sigma _{ij} M_{ij}\left( A_{ij}\log \frac{\alpha }{1-\alpha } + (1-A_{ij})\log \frac{\beta }{1-\beta }\right) \ \nonumber \\&\quad = \sum _{i \ne j} \sigma _{ij} M_{ij}\left( Q_{ij}\log \frac{\alpha }{1-\alpha }+ (1-Q_{ij})\log \frac{\beta }{1-\beta }\right) . \end{aligned}$$
(34)

This leads us to the constrained optimization problem of Eq. (17).

Appendix B: CEM-sbm

For the E-step of the EM algorithm, we modify the Newman algorithm by taking the expectation over the set of random variables \(Y_{ij}\) at both sides of (20):

$$\begin{aligned}&{\mathbb {E}}[\log {P}(\theta \text { }| \text { } \mathcal {T})] \ge {\mathbb {E}} [\sum _{{{\textbf {A}}}} q({{\textbf {A}}}, {{\textbf {g}}}) \log \frac{{P}({{\textbf {A}}}, {{\textbf {g}}}, \theta \text { }|\text { }\mathcal {T})}{q({{\textbf {A}}}, {{\textbf {g}}})}] \nonumber \\&\quad = \sum _{{{\textbf {A}}}} q({{\textbf {A}}}, {{\textbf {g}}})\big ({\mathbb {E}}[\log {P}({{\textbf {A}}}, {{\textbf {g}}}, \theta \text { }|\text { }\mathcal {T})] - \log q({{\textbf {A}}}, {{\textbf {g}}} \big )). \end{aligned}$$
(35)

To find \({\mathbb {E}}[\log {P}({{\textbf {A}}}, {{\textbf {g}}}, \theta \text { }| \text { } \mathcal {T})]\), we replace (19) into (18). Setting \(\Gamma ={P}(\theta )/{P}(\mathcal {T})\), the expectation of the log of (18) becomes:

$$\begin{aligned}{} & {} {\mathbb {E}}[\log {P}({{\textbf {A}}}, {{\textbf {g}}}, \theta \text {} \text { }|\text { } \text {} \mathcal {T})] = log \Gamma + \sum _{\begin{array}{c} {i \ne j}\\ {g_{i} = g{j}} \end{array}} \Big [{A_{ij}} \Big (\log p + {{\mathbb {E}}[Y_{ij}]}\log \alpha \nonumber \\{} & {} \quad +(M_{ij}-{\mathbb {E}}[Y_{ij}])\log {(1 - \alpha )}\Big ) \nonumber +(1-A_{ij})\Big (\log (1-p) \nonumber \\{} & {} \quad + {{\mathbb {E}}[Y_{ij}]}\log \beta +(M_{ij} - {\mathbb {E}}[Y_{ij}])\log {(1 - \beta )\Big )\Big ]} \nonumber \\{} & {} \quad + \sum _{\begin{array}{c} {i \ne j}\\ {g_{i} \ne g{j}} \end{array}} \Big [{A_{ij}} \Big (\log q \nonumber \\{} & {} \quad + {{\mathbb {E}}[Y_{ij}]}\log \alpha +(M_{ij}-{\mathbb {E}}[Y_{ij}])\log {(1 - \alpha )}\Big ) \nonumber \\{} & {} \quad +(1-A_{ij})\Big (\log (1-q) \nonumber \\{} & {} \quad + {{\mathbb {E}}[Y_{ij}]}\log \beta +(M_{ij} - {\mathbb {E}}[Y_{ij}])\log {(1 - \beta )\Big )\Big ]}. \end{aligned}$$
(36)

By replacing (7) into (36), and then (36) into (35), we get:

$$\begin{aligned} {\mathbb {E}}[\log {P}(\theta \text { }|\text { } \mathcal {T})] \ge \sum _{{{\textbf {A}}}} q({{\textbf {A}}}, {{\textbf {g}}})\log \frac{D({{\textbf {A}}}, {{\textbf {g}}})}{q({{\textbf {A}}}, {{\textbf {g}}})}, \end{aligned}$$
(37)

where,

$$\begin{aligned}&{ D({{\textbf {A}}}, {{\textbf {g}}})= \Gamma \prod _{\begin{array}{c} {i \ne j}\\ {g_{i} = g{j}} \end{array}}{\left[ p \alpha ^{M_{ij}\sigma _{ij}}{(1 - \alpha )}^{M_{ij}(1-\sigma _{ij})}\right] }^{A_{ij}} } \nonumber \\&{\left[ (1-p)\beta ^{M_{ij}\sigma _{ij}}{(1 - \beta )}^{M_{ij}(1-\sigma _{ij})}\right] }^{1-A_{ij} }\nonumber \\&\quad \prod _{\begin{array}{c} {i \ne j} \\ {g_{i} \ne g{j}} \end{array}}{\left[ q \alpha ^{M_{ij}\sigma _{ij}}{(1 - \alpha )}^{M_{ij}(1-\sigma _{ij})}\right] }^{A_{ij}} \nonumber \\&{\left[ (1-q)\beta ^{M_{ij}\sigma _{ij}}{(1 - \beta )}^{M_{ij}(1-\sigma _{ij})}\right] }^{1-A_{ij} }. \end{aligned}$$
(38)

For the M-step of EM, we maximize the expectation \({\mathbb {E}}[\log {P}(\theta \text { } |\text { } \mathcal {T})]\) as we did in the CEM-er prior.

Solution with respect to \({\varvec{q}}({{\textbf {A}}}, {{\textbf {g}}}\)). We notice that the choice of \(q({{\textbf {A}}}, {{\textbf {g}}})\) that achieves equality (i.e. maximizes the right-hand side) in (37) is:

$$\begin{aligned} q({{\textbf {A}}}, {{\textbf {g}}}) = \dfrac{D({{\textbf {A}}}, {{\textbf {g}}})}{\sum _{{{\textbf {A}}}}D({{\textbf {A}}}, {{\textbf {g}}})}. \end{aligned}$$
(39)

From (39), in a similar fashion to Newman’s method [Eq. (13), 20], and because \(\Gamma \) cancels out, we get:

$$\begin{aligned}&q({{\textbf {A}}}, {{\textbf {g}}}) = \prod _{i \ne j, (g_{i} = g_{j})}Q_{ij}(g_{i},g_{j})^{A_{ij}}(1-Q_{ij}(g_{i},g_{j}))^{1-A_{ij}} \nonumber \\&\quad \prod _{i \ne j, (g_{i} \ne g_{j})}Q_{ij}(g_{i},g_{j})^{A_{ij}}(1-Q_{ij}(g_{i},g_{j}))^{1-A_{ij}}. \end{aligned}$$
(40)

Hence, given Eq. (38), the values of \(Q_{ij}\) are found to be the ones in Eq. (21) and (22).

Our goal is to find the unknown parameters \(\theta =\){\(\alpha , \beta , p, q, \varvec{\sigma }\)} that maximize the right-hand size of (37), given the maximising distribution for \(q({{\textbf {A}}}, {{\textbf {g}}})\) in (39), hence given the values of \(Q_{ij}(g_{i},g_{j})\) in (40).

Solution with respect to \(\varvec{\alpha }, \varvec{\beta }, {\varvec{p}}, {\varvec{q}}\). To maximize the right-hand side of (37) in terms of parameter \(\alpha \), we differentiate the equation with respect to \(\alpha \) and we set it equal to zero (while holding the rest of the parameters \(\theta \) constant):

$$\begin{aligned} \sum _{i \ne j} Q_{ij}(g_{i},g_{j}) M_{ij} \left( \frac{\sigma _{ij}}{\alpha } - \frac{1-\sigma _{ij}}{1-\alpha }\right) = 0. \end{aligned}$$
(41)

After rearranging, we get the value in Eq. (23). By repeating the same procedure for \(\beta \), we get Eq. (24). Likewise, differentiating the r.h.s. of (37) with respect to p and then setting it equal to zero we get:

$$\begin{aligned} \sum _{{{\textbf {A}}}} q({\textbf {A, g}}) \sum _ {\begin{array}{c} {i \ne j}\\ {g_{i} = g{j}} \end{array}} \left(\frac{A_{ij}}{p} - \frac{1-A_{ij}}{1-p} \right) = 0. \end{aligned}$$
(42)

This is how we get the updates for p in Eq. (25), and, likewise, for q in Eq. (26).

Solution with respect to \(\varvec{\sigma }_{ij}\). If we take into account that \(Q_{ij}(g_{i},g_{j}) =\sum _{{{\textbf {A}}}}q({{\textbf {A}}}, {{\textbf {g}}})A_{ij}\) and also that \(\sum _{{{\textbf {A}}}} q({{\textbf {A}}}, {{\textbf {g}}}) = 1 \), by rearranging the right-hand side of (37), the problem becomes equivalent to maximizing:

$$\begin{aligned}&\sum _{{{\textbf {A}}}} q({{\textbf {A}}}, {{\textbf {g}}}) \sum _{i \ne j} \sigma _{ij} M_{ij}\left( A_{ij}\log \frac{\alpha }{1-\alpha } \right. \nonumber \\&\qquad \left. + (1-A_{ij})\log \frac{\beta }{1-\beta }\right) \nonumber \\&\quad = \sum _{i \ne j} \sigma _{ij} M_{ij}\left( Q_{ij}(g_{i},g_{j})\log \frac{\alpha }{1-\alpha }\right. \nonumber \\&\qquad \left. + (1-Q_{ij}(g_{i},g_{j}))\log \frac{\beta }{1-\beta }\right) . \end{aligned}$$
(43)

This leads us to the optimization problem of Eq. (27) through which we can find the \(\sigma _{ij}\) values.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Papanastasiou, E., Giovanidis, A. Constrained expectation-maximisation for inference of social graphs explaining online user–user interactions. Soc. Netw. Anal. Min. 13, 41 (2023). https://doi.org/10.1007/s13278-023-01037-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-023-01037-4

Keywords

Navigation