An Offline–Online Visual Framework for Clustering Memes in Social Media

Part of the Lecture Notes in Social Networks book series (LNSN)


The amount of data generated in Online Social Networks (OSNs) is increasing every day. Extracting and understanding trending topics and events from the vast amount of data is an important area of research in OSNs. This paper proposes a novel clustering framework to detect the spread of memes in OSNs in real time. The Offline–Online meme clustering framework exploits various similarity scores between different elements of Reddit submissions, two strategies to combine those scores based on Wikipedia concepts as an external knowledge, text semantic similarity and a modified version of Jaccard Coefficient. The two combination strategies include: (1) automatically computing the similarity score weighting factors for five elements of a submission and (2) allowing users to engage in the clustering process and filter out outlier submissions, modify submission class labels, or assign different similarity score weight factors for various elements of a submission using a visualization prototype. The Offline–Online clustering process does a one-pass clustering for existing OSN data in the first step by calculating and summarizing each cluster statistics using Wikipedia concepts. For the online component, it assigns new streaming data points to the appropriate clusters using a modified version of online k-means. The experiment results show that the use of Wikipedia as external knowledge and text semantic similarity improves the speed and accuracy of the meme clustering problem when comparing to baselines. For the online clustering process, using a damped window model approach is suitable for online streaming environments as it not only requires low prediction and training costs, but also assigns more weight to recent data and popular topics.


Online algorithms Clustering memes Social media Semantic Jaccard index Wikipedia entity linking Visual analysis 



The research was funded in part by the Natural Sciences and Engineering Research Council of Canada, International Development Research Centre, Ottawa, Canada, Social Sciences and Humanities Research Council of Canada, CNPq, and FAPESP (Brazil).


  1. 1.
    Aggarwal CC, Subbian K. Event detection in social streams. In: SDM conference, vol. 12. Philadelphia, PA: SIAM; 2012, p. 624–35.Google Scholar
  2. 2.
    Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large databases, vol. 29. VLDB Endowment; 2003, p. 81–92.Google Scholar
  3. 3.
    Aggarwal CC, Zhao Y, Philip SY. On clustering graph streams. In: SDM conference. Philadelphia, PA: SIAM; 2010, p. 478–89.Google Scholar
  4. 4.
    Banerjee S, Ramanathan K, Gupta A. Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM; 2007, p. 787–88.Google Scholar
  5. 5.
    Becker H, Naaman M, Gravano L. Learning similarity metrics for event identification in social media. In: Proceedings of the third ACM international conference on web search and data mining. New York: ACM; 2010, p. 291–300.CrossRefGoogle Scholar
  6. 6.
    Berkhin P. A survey of clustering data mining techniques. In: Grouping multidimensional data. Berlin: Springer; 2006, p. 25–71.CrossRefGoogle Scholar
  7. 7.
    Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the twenty-first international conference on machine learning. New York: ACM; 2004, p. 11.Google Scholar
  8. 8.
    Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using web search engines. WWW 2007;7:757–66.Google Scholar
  9. 9.
    Brants T, Franz A. The google web 1t 5-gram corpus version 1.1. Technical Report, 2006.Google Scholar
  10. 10.
    Cataldi M, Di Caro L, Schifanella C. Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining. New York: ACM; 2010, p. 4.Google Scholar
  11. 11.
    Caulkins BD, Lee J, Wang M. A dynamic data mining technique for intrusion detection systems. In: Proceedings of the 43rd annual southeast regional conference, vol. 2. New York: ACM; 2005, p. 148–53.Google Scholar
  12. 12.
    Ceccarelli D, Lucchese C, Orlando S, Perego R, Trani S. Dexter 2.0-an open source tool for semantically enriching data. In: International semantic web conference (Posters & Demos). 2014, p. 417–20.Google Scholar
  13. 13.
    Chang JH, Lee WS. Finding recent frequent itemsets adaptively over online data streams. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2003, p. 487–92.CrossRefGoogle Scholar
  14. 14.
    Chi Y, Wang H, Yu PS, Muntz RR. Moment: maintaining closed frequent itemsets over a stream sliding window. In: Fourth IEEE international conference on data mining (ICDM). Piscataway, NJ: IEEE; 2004, p. 59–66.Google Scholar
  15. 15.
    Chierichetti F, Kumar R, Pandey S, Vassilvitskii S. Finding the Jaccard median. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics; 2010, p. 293–311.zbMATHGoogle Scholar
  16. 16.
    Choo J, Lee H, Liu Z, Stasko J, Park H. An interactive visual testbed system for dimension reduction and clustering of large-scale high-dimensional data. In: IS&T/SPIE electronic imaging. International Society for Optics and Photonics; 2013, p. 865,402–865,402.Google Scholar
  17. 17.
    Dang A, Makki R, Moh’d A, Islam A, Keselj V, Milios EE. Real time filtering of tweets using wikipedia concepts and google tri-gram semantic relatedness. In: Proceedings of the TREC, 2015.Google Scholar
  18. 18.
    Dang A, Moh’d A, Gruzd A, Milios E, Minghim R. A visual framework for clustering memes in social media. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining. New York: ACM; 2015, p. 713–20.CrossRefGoogle Scholar
  19. 19.
    Dang A, Michael S, Moh’d A, Minghim R, Milios E. A visual framework for clustering memes in social media. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and mining. New York: ACM; 2016.Google Scholar
  20. 20.
    Dang A, Moh’d A, Milios E, Minghim R. What is in a rumour: combined visual analysis of rumour flow and user activity. In: Proceedings of the 33rd computer graphics international. New York: ACM; 2016, p. 17–20.Google Scholar
  21. 21.
    Deza MM, Deza E. Encyclopedia of distances. Berlin: Springer; 2009.CrossRefzbMATHGoogle Scholar
  22. 22.
    FEMA: Hurricane sandy: Rumor control @ONLINE, 2012. Available: Accessed 15 April 2015.Google Scholar
  23. 23.
    Giannella C, Han J, Pei J, Yan X, Yu PS. Mining frequent patterns in data streams at multiple time granularities. Next Generation Data Mining. 2003;212:191–212.Google Scholar
  24. 24.
    Gouda K, Zaki M. Efficiently mining maximal frequent itemsets. In: Proceedings IEEE international conference on data mining (ICDM). Piscataway, NJ: IEEE; 2001, p. 163–70.Google Scholar
  25. 25.
    Halkidi M. Quality assessment and uncertainty handling in data mining process. In: EDBT Ph.D Workshop, 2000.Google Scholar
  26. 26.
    Heylighen F. What makes a meme successful? Selection criteria for cultural evolution. In Proceedings 15th International Congress on Cybernetics , 1998.Google Scholar
  27. 27.
    Hong L, Davison BD. Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. New York: ACM; 2010, p. 80–8.CrossRefGoogle Scholar
  28. 28.
    Hu X, Zhang X, Lu C, Park EK, Zhou X. Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). New York: ACM; 2009, p. 389–96.CrossRefGoogle Scholar
  29. 29.
    Hu M, Liu S, Wei F, Wu Y, Stasko J, Ma KL. Breaking news on twitter. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York: ACM; 2012, p. 2751–54.Google Scholar
  30. 30.
    Islam A, Milios E, Kešelj V. Text similarity using google tri-grams. In: Proceedings of the 25th Canadian conference on advances in artificial intelligence, Canadian AI’12. Berlin/Heidelberg: Springer; 2012, p. 312–17.Google Scholar
  31. 31.
    JafariAsbagh M, Ferrara E, Varol O, Menczer F, Flammini A. Clustering memes in social media streams. Soc Netw Anal Min. 2014;4(1):237.CrossRefGoogle Scholar
  32. 32.
    Kwak H, Lee C, Park H, Moon S. What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web. New York: ACM; 2010, p. 591–600.Google Scholar
  33. 33.
    Lee H, Kihm J, Choo J, Stasko J, Park H. iVisClustering: an interactive visual document clustering via topic modeling. In: Computer graphics forum, vol. 31. Wiley Online Library; 2012, p. 1155–164.Google Scholar
  34. 34.
    Leskovec J. Backstrom L. Kleinberg J. Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2009, p. 497–506.CrossRefGoogle Scholar
  35. 35.
    Li HF, Lee SY, Shan MK. An efficient algorithm for mining frequent itemsets over the entire history of data streams. In: Proceedings of first international workshop on knowledge discovery in data streams, 2004.Google Scholar
  36. 36.
    Mannila H, Toivonen H, Verkamo AI. Discovery of frequent episodes in event sequences. Data Min Knowl Disc. 1997;1(3):259–89.CrossRefGoogle Scholar
  37. 37.
    Morozov E. Swine flu: Twitter’s power to misinform @ONLINE, 2009. Available: Accessed 15 April 2015.Google Scholar
  38. 38.
    Pedersen T, Patwardhan S, Michelizzi J. Wordnet::similarity: measuring the relatedness of concepts. In: Demonstration papers at HLT-NAACL, HLT-NAACL–demonstrations. Stroudsburg: Association for Computational Linguistics; 2004, p. 38–41.Google Scholar
  39. 39.
    Pramod S, Vyas O. Data stream mining: a review on windowing approach. Global J Comput Sci Technol Softw Data Eng. 2012;12(11):26–30.Google Scholar
  40. 40.
    Strehl A, Strehl E, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. AAAI; 2000, p. 58–64.Google Scholar
  41. 41.
    Thom D, Bosch H, Koch S, Wörner M, Ertl T. Spatiotemporal anomaly detection through visual analysis of geolocated twitter messages. In: Pacific visualization symposium (PacificVis). Piscataway, NJ: IEEE; 2012, p. 41–8.Google Scholar
  42. 42.
    Zhao Y, Yu P. On graph stream clustering with side information. In: Proceedings of the seventh SIAM international conference on data mining. Philadelphia, PA: SIAM; 2013, p. 139–50.Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Faculty of Computer ScienceDalhousie UniversityHalifaxCanada
  2. 2.Ted Rogers School of ManagementRyerson UniversityTorontoCanada
  3. 3.University of São Paulo-USP, ICMCSão CarlosBrazil

Personalised recommendations