Abstract
The amount of data generated in Online Social Networks (OSNs) is increasing every day. Extracting and understanding trending topics and events from the vast amount of data is an important area of research in OSNs. This paper proposes a novel clustering framework to detect the spread of memes in OSNs in real time. The Offline–Online meme clustering framework exploits various similarity scores between different elements of Reddit submissions, two strategies to combine those scores based on Wikipedia concepts as an external knowledge, text semantic similarity and a modified version of Jaccard Coefficient. The two combination strategies include: (1) automatically computing the similarity score weighting factors for five elements of a submission and (2) allowing users to engage in the clustering process and filter out outlier submissions, modify submission class labels, or assign different similarity score weight factors for various elements of a submission using a visualization prototype. The Offline–Online clustering process does a one-pass clustering for existing OSN data in the first step by calculating and summarizing each cluster statistics using Wikipedia concepts. For the online component, it assigns new streaming data points to the appropriate clusters using a modified version of online k-means. The experiment results show that the use of Wikipedia as external knowledge and text semantic similarity improves the speed and accuracy of the meme clustering problem when comparing to baselines. For the online clustering process, using a damped window model approach is suitable for online streaming environments as it not only requires low prediction and training costs, but also assigns more weight to recent data and popular topics.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aggarwal CC, Subbian K. Event detection in social streams. In: SDM conference, vol. 12. Philadelphia, PA: SIAM; 2012, p. 624–35.
Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large databases, vol. 29. VLDB Endowment; 2003, p. 81–92.
Aggarwal CC, Zhao Y, Philip SY. On clustering graph streams. In: SDM conference. Philadelphia, PA: SIAM; 2010, p. 478–89.
Banerjee S, Ramanathan K, Gupta A. Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM; 2007, p. 787–88.
Becker H, Naaman M, Gravano L. Learning similarity metrics for event identification in social media. In: Proceedings of the third ACM international conference on web search and data mining. New York: ACM; 2010, p. 291–300.
Berkhin P. A survey of clustering data mining techniques. In: Grouping multidimensional data. Berlin: Springer; 2006, p. 25–71.
Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the twenty-first international conference on machine learning. New York: ACM; 2004, p. 11.
Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using web search engines. WWW 2007;7:757–66.
Brants T, Franz A. The google web 1t 5-gram corpus version 1.1. Technical Report, 2006.
Cataldi M, Di Caro L, Schifanella C. Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining. New York: ACM; 2010, p. 4.
Caulkins BD, Lee J, Wang M. A dynamic data mining technique for intrusion detection systems. In: Proceedings of the 43rd annual southeast regional conference, vol. 2. New York: ACM; 2005, p. 148–53.
Ceccarelli D, Lucchese C, Orlando S, Perego R, Trani S. Dexter 2.0-an open source tool for semantically enriching data. In: International semantic web conference (Posters & Demos). 2014, p. 417–20.
Chang JH, Lee WS. Finding recent frequent itemsets adaptively over online data streams. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2003, p. 487–92.
Chi Y, Wang H, Yu PS, Muntz RR. Moment: maintaining closed frequent itemsets over a stream sliding window. In: Fourth IEEE international conference on data mining (ICDM). Piscataway, NJ: IEEE; 2004, p. 59–66.
Chierichetti F, Kumar R, Pandey S, Vassilvitskii S. Finding the Jaccard median. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics; 2010, p. 293–311.
Choo J, Lee H, Liu Z, Stasko J, Park H. An interactive visual testbed system for dimension reduction and clustering of large-scale high-dimensional data. In: IS&T/SPIE electronic imaging. International Society for Optics and Photonics; 2013, p. 865,402–865,402.
Dang A, Makki R, Moh’d A, Islam A, Keselj V, Milios EE. Real time filtering of tweets using wikipedia concepts and google tri-gram semantic relatedness. In: Proceedings of the TREC, 2015.
Dang A, Moh’d A, Gruzd A, Milios E, Minghim R. A visual framework for clustering memes in social media. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining. New York: ACM; 2015, p. 713–20.
Dang A, Michael S, Moh’d A, Minghim R, Milios E. A visual framework for clustering memes in social media. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and mining. New York: ACM; 2016.
Dang A, Moh’d A, Milios E, Minghim R. What is in a rumour: combined visual analysis of rumour flow and user activity. In: Proceedings of the 33rd computer graphics international. New York: ACM; 2016, p. 17–20.
Deza MM, Deza E. Encyclopedia of distances. Berlin: Springer; 2009.
FEMA: Hurricane sandy: Rumor control @ONLINE, 2012. Available: http://neteffect.foreignpolicy.com/posts/2009/04/25/swine_flu_twitters_power_to_misinform. Accessed 15 April 2015.
Giannella C, Han J, Pei J, Yan X, Yu PS. Mining frequent patterns in data streams at multiple time granularities. Next Generation Data Mining. 2003;212:191–212.
Gouda K, Zaki M. Efficiently mining maximal frequent itemsets. In: Proceedings IEEE international conference on data mining (ICDM). Piscataway, NJ: IEEE; 2001, p. 163–70.
Halkidi M. Quality assessment and uncertainty handling in data mining process. In: EDBT Ph.D Workshop, 2000.
Heylighen F. What makes a meme successful? Selection criteria for cultural evolution. In Proceedings 15th International Congress on Cybernetics , 1998.
Hong L, Davison BD. Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. New York: ACM; 2010, p. 80–8.
Hu X, Zhang X, Lu C, Park EK, Zhou X. Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). New York: ACM; 2009, p. 389–96.
Hu M, Liu S, Wei F, Wu Y, Stasko J, Ma KL. Breaking news on twitter. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York: ACM; 2012, p. 2751–54.
Islam A, Milios E, Kešelj V. Text similarity using google tri-grams. In: Proceedings of the 25th Canadian conference on advances in artificial intelligence, Canadian AI’12. Berlin/Heidelberg: Springer; 2012, p. 312–17.
JafariAsbagh M, Ferrara E, Varol O, Menczer F, Flammini A. Clustering memes in social media streams. Soc Netw Anal Min. 2014;4(1):237.
Kwak H, Lee C, Park H, Moon S. What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web. New York: ACM; 2010, p. 591–600.
Lee H, Kihm J, Choo J, Stasko J, Park H. iVisClustering: an interactive visual document clustering via topic modeling. In: Computer graphics forum, vol. 31. Wiley Online Library; 2012, p. 1155–164.
Leskovec J. Backstrom L. Kleinberg J. Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2009, p. 497–506.
Li HF, Lee SY, Shan MK. An efficient algorithm for mining frequent itemsets over the entire history of data streams. In: Proceedings of first international workshop on knowledge discovery in data streams, 2004.
Mannila H, Toivonen H, Verkamo AI. Discovery of frequent episodes in event sequences. Data Min Knowl Disc. 1997;1(3):259–89.
Morozov E. Swine flu: Twitter’s power to misinform @ONLINE, 2009. Available: http://neteffect.foreignpolicy.com/posts/2009/04/25/swine_flu_twitters_power_to_misinform. Accessed 15 April 2015.
Pedersen T, Patwardhan S, Michelizzi J. Wordnet::similarity: measuring the relatedness of concepts. In: Demonstration papers at HLT-NAACL, HLT-NAACL–demonstrations. Stroudsburg: Association for Computational Linguistics; 2004, p. 38–41.
Pramod S, Vyas O. Data stream mining: a review on windowing approach. Global J Comput Sci Technol Softw Data Eng. 2012;12(11):26–30.
Strehl A, Strehl E, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. AAAI; 2000, p. 58–64.
Thom D, Bosch H, Koch S, Wörner M, Ertl T. Spatiotemporal anomaly detection through visual analysis of geolocated twitter messages. In: Pacific visualization symposium (PacificVis). Piscataway, NJ: IEEE; 2012, p. 41–8.
Zhao Y, Yu P. On graph stream clustering with side information. In: Proceedings of the seventh SIAM international conference on data mining. Philadelphia, PA: SIAM; 2013, p. 139–50.
Acknowledgements
The research was funded in part by the Natural Sciences and Engineering Research Council of Canada, International Development Research Centre, Ottawa, Canada, Social Sciences and Humanities Research Council of Canada, CNPq, and FAPESP (Brazil).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Dang, A., Moh’d, A., Gruzd, A., Milios, E., Minghim, R. (2017). An Offline–Online Visual Framework for Clustering Memes in Social Media. In: Kaya, M., Erdoǧan, Ö., Rokne, J. (eds) From Social Data Mining and Analysis to Prediction and Community Detection. Lecture Notes in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-51367-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-51367-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51366-9
Online ISBN: 978-3-319-51367-6
eBook Packages: Computer ScienceComputer Science (R0)