Web-scale provenance reconstruction of implicit information diffusion on social media


Fast, massive, and viral data diffused on social media affects a large share of the online population, and thus, the (prospective) information diffusion mechanisms behind it are of great interest to researchers. The (retrospective) provenance of such data is equally important because it contributes to the understanding of the relevance and trustworthiness of the information. Furthermore, computing provenance in a timely way is crucial for particular use cases and practitioners, such as online journalists that promptly need to assess specific pieces of information. Social media currently provide insufficient mechanisms for provenance tracking, publication and generation, while state-of-the-art on social media research focuses mainly on explicit diffusion mechanisms (like retweets in Twitter or reshares in Facebook).The implicit diffusion mechanisms remain understudied due to the difficulties of being captured and properly understood. From a technical side, the state of the art for provenance reconstruction evaluates small datasets after the fact, sidestepping requirements for scale and speed of current social media data. In this paper, we investigate the mechanisms of implicit information diffusion by computing its fine-grained provenance. We prove that explicit mechanisms are insufficient to capture influence and our analysis unravels a significant part of implicit interactions and influence in social media. Our approach works incrementally and can be scaled up to cover a truly Web-scale scenario like major events. We can process datasets consisting of up to several millions of messages on a single machine at rates that cover bursty behaviour, without compromising result quality. By doing that, we provide to online journalists and social media users in general, fine grained provenance reconstruction which sheds lights on implicit interactions not captured by social media providers. These results are provided in an online fashion which also allows for fast relevance and trustworthiness assessment.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20


  1. 1.



  1. 1.

    Aierken, A., Davis, D.B., Zhang, Q., Gupta, K., Wong, A., Asuncion, H.U.: A multi-level funneling approach to data provenance reconstruction. In: IEEE 10th International Conference on e-Science, vol 2, pp. 71–74, IEEE (2014)

  2. 2.

    Al Hasan, M., Salem, S., Zaki, M.J.: Simclus: an effective algorithm for clustering with a lower bound on similarity. Knowl. Inf. Syst. 28(3), 665–685 (2011)

    Article  Google Scholar 

  3. 3.

    Azzopardi, J., Staff, C.: Incremental clustering of news reports. Algorithms 5(3), 364–378 (2012)

    Article  Google Scholar 

  4. 4.

    Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer: quantifying influence on twitter. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, pp. 65–74 (2011)

  5. 5.

    Baños, R.A., Borge-Holthoefer, J., Moreno, Y.: The role of hidden influentials in the diffusion of online information cascades. EPJ Data Sci. 2(1), 1–16 (2013)

    Article  Google Scholar 

  6. 6.

    Barbier, G., Feng, Z., Gundecha, P., Liu, H.: Provenance data in social media. Synth. Lect. Data Min. Knowl. Discov. 4(1), 1–84 (2013)

    Article  Google Scholar 

  7. 7.

    Barbosa, S., Cesar, R.M. Jr., Cosley, D.: Using text similarity to detect social interactions not captured by formal reply mechanisms. In: 2015 IEEE 11th International Conference on e-Science (e-Science), pp. 36–46. IEEE (2015)

  8. 8.

    Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120 (2006)

  9. 9.

    Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in Twitter: the million follower fallacy. ICWSM 10(10–17), 30 (2010)

    Google Scholar 

  10. 10.

    Cheney, J., Chiticariu, L., Tan, W.C., et al.: Provenance in databases: why, how, and where. Found. Trends® Databases 1(4), 379–474 (2009)

    Article  Google Scholar 

  11. 11.

    Comarela, G., Crovella, M., Almeida, V., Benevenuto, F.: Understanding factors that affect response rates in Twitter. In: Proceedings of the 23rd ACM Conference on Hypertext and Social Media, pp. 123–132 (2012)

  12. 12.

    Davidson, S.B., Boulakia, S.C., Eyal, A., Ludäscher, B., McPhillips, T.M., Bowers, S., Anand, M.K., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 30(4), 44–50 (2007)

    Google Scholar 

  13. 13.

    De Nies, T., Coppens, S., Van Deursen, D., Mannens, E., Van de Walle, R.: Automatic discovery of high-level provenance using semantic similarity. In: IPAW (2012)

  14. 14.

    De Nies, T., Taxidou, I., Dimou, A., Verborgh, R., Fischer, P.M., Mannens, E., Van de Walle, R.: Towards multi-level provenance reconstruction of information diffusion on social media. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1823–1826 (2015)

  15. 15.

    De Nies, T., Mannens, E., Van de Walle, R.: Reconstructing human-generated provenance through similarity-based clustering. In: International Provenance and Annotation Workshop, Springer, pp. 191–194 (2016)

  16. 16.

    Feng, Z., Gundecha, P., Liu, H.: Recovering information recipients in social media via provenance. In: ASONAM, pp. 706–711 (2013)

  17. 17.

    Glavic, B., Sheykh Esmaili, K., Fischer, P.M., Tatbul, N.: Ariadne: Managing fine-grained provenance on data streams. In: Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems, pp. 39–50 (2013)

  18. 18.

    Gundecha, P., Liu, H.: Mining social media: a brief introduction. In: New Directions in Informatics, Optimization, Logistics, and Production, Informs, pp. 1–17 (2012)

  19. 19.

    Gundecha, P., Ranganath, S., Feng, Z., Liu, H.: A tool for collecting provenance data in social media. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1462–1465. ACM (2013b)

  20. 20.

    Jaho, E., Tzoannos, E., Papadopoulos, A., Sarris, N.: Alethiometer: a framework for assessing trustworthiness and content validity in social media. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 749–752. ACM (2014)

  21. 21.

    Khy, S., Ishikawa, Y., Kitagawa, H.: A novelty-based clustering method for on-line documents. World Wide Web 11(1), 1–37 (2008)

    Article  Google Scholar 

  22. 22.

    Kovács, F., Legány, C., Babos, A.: Cluster validity measurement techniques. In: 6th International Symposium of Hungarian Researchers on Computational Intelligence, Citeseer (2005)

  23. 23.

    Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The clustree: indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst. 29(2), 249–272 (2011)

    Article  Google Scholar 

  24. 24.

    Kwon, S., Cha, M., Jung, K., Chen, W., Wang, Y.: Prominent features of rumor propagation in online social media. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 1103–1108. IEEE (2013)

  25. 25.

    Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506 (2009)

  26. 26.

    Magliacane, S.: Reconstructing provenance. In: Proceedings of the 11th International Conference on The Semantic Web-Volume Part II, pp. 399–406. Springer, New York (2012)

  27. 27.

    Metaxas, P.T., Mustafaraj, E.: Social media and the elections. Science 338(6106), 472–473 (2012)

    Article  Google Scholar 

  28. 28.

    Moreau, L.: The foundations for provenance on the web. Found. Trends Web Sci. 2(2–3), 99–241 (2010)

    Article  Google Scholar 

  29. 29.

    Moreau, L., Missier, P.: (Eds) W3C Provenance Working Group (2013) PROV-DM: The PROV Data Model. W3C

  30. 30.

    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  31. 31.

    Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide web, pp. 851–860. ACM (2010)

  32. 32.

    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1986)

    Google Scholar 

  33. 33.

    Sharma, S.: Applied Multivariate Techniques. Wiley, New York (1995)

    Google Scholar 

  34. 34.

    Simmons, M.P., Adamic, L.A., Adar, E.: Memes online: Extracted, subtracted, injected, and recollected. In: Fifth International AAAI Conference on Weblogs and Social Media (2011)

  35. 35.

    Suen, C., Huang, S., Eksombatchai, C., Sosic, R., Leskovec, J.: NIFTY: a system for large scale information flow tracking and clustering. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1237–1248. ACM (2013)

  36. 36.

    Taxidou, I., Fischer, P.M.: Online analysis of information diffusion in twitter. In: Proceedings of the 23rd International Conference on World Wide Web, WWW ’14 Companion (2014)

  37. 37.

    Taxidou, I., De Nies, T., Verborgh, R., Fischer, P., Mannens, E., Van de Walle, R.: Modeling information diffusion in social media as provenance with W3C PROV. In: Proceedings of the 6th International Workshop on Modeling Social Media, pp. 819–824 (2015)

  38. 38.

    Taxidou, I., Fischer, PM., De Nies, T., Mannens, E., Van de Walle, R.: Information diffusion and provenance of interactions in twitter: Is it only about retweets? In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 113–114 (2016)

  39. 39.

    Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28(4), 1–20 (2010)

    Article  Google Scholar 

  40. 40.

    Yang, J., Leskovec, J.: Modeling information diffusion in implicit networks. In: 2010 IEEE International Conference on Data Mining, pp. 599–608 (2010)

Download references

Author information



Corresponding author

Correspondence to Io Taxidou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Taxidou, I., Lieber, S., Fischer, P.M. et al. Web-scale provenance reconstruction of implicit information diffusion on social media. Distrib Parallel Databases 36, 47–79 (2018). https://doi.org/10.1007/s10619-017-7211-3

Download citation


  • Provenance
  • Information diffusion
  • Incremental clustering
  • Social media
  • Influence