Clustering memes in social media streams

  • Mohsen JafariAsbagh
  • Emilio Ferrara
  • Onur Varol
  • Filippo Menczer
  • Alessandro Flammini
Original Article

Abstract

The problem of clustering content in social media has pervasive applications, including the identification of discussion topics, event detection, and content recommendation. Here, we describe a streaming framework for online detection and clustering of memes in social media, specifically Twitter. A pre-clustering procedure, namely protomeme detection, first isolates atomic tokens of information carried by the tweets. Protomemes are thereafter aggregated, based on multiple similarity measures, to obtain memes as cohesive groups of tweets reflecting actual concepts or topics of discussion. The clustering algorithm takes into account various dimensions of the data and metadata, including natural language, the social network, and the patterns of information diffusion. As a result, our system can build clusters of semantically, structurally, and topically related tweets. The clustering process is based on a variant of Online K-means that incorporates a memory mechanism, used to “forget” old memes and replace them over time with the new ones. The evaluation of our framework is carried out using a dataset of Twitter trending topics. Over a 1-week period, we systematically determined whether our algorithm was able to recover the trending hashtags. We show that the proposed method outperforms baseline algorithms that only use content features, as well as a state-of-the-art event detection method that assumes full knowledge of the underlying follower network. We finally show that our online learning framework is flexible, due to its independence of the adopted clustering algorithm, and best suited to work in a streaming scenario.

References

  1. Aggarwal C, Subbian K (2012) Event detection in social streams. In: Proceedings of SIAM international conference on data mining, 2012Google Scholar
  2. Albers S, Leonardi S (1999) Online algorithms. ACM Comput Surv 31(3)Google Scholar
  3. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002. ACM, New York, pp 1–16Google Scholar
  4. Bakshy E, Hofman J, Mason W, Watts D (2011) Everyone’s an influencer: quantifying influence on twitter. In: Proceedings of the 4th ACM international conference on web search and data mining, 2011. ACM, New York, pp 65–74Google Scholar
  5. Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15(3):702–719CrossRefGoogle Scholar
  6. BBC (2014) NYPD Twitter campaign ‘backfires’ after hashtag hijacked. http://www.bbc.com/news/technology-27126041
  7. Becker H, Naaman M, Gravano L (2010) Learning similarity metrics for event identification in social media. In: Proceedings of the 3rd ACM international conference on web search and data mining, 2010. ACM, New York, pp 291–300Google Scholar
  8. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of the 5th international AAAI conference on weblogs and social media, 2011Google Scholar
  9. Blum A (1998) On-line algorithms in machine learning. Springer, BerlinGoogle Scholar
  10. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: 2006 SIAM conference on data mining, 2006, pp 328–339Google Scholar
  11. Cataldi M, Caro LD, Schifanella C (2013) Personalized emerging topic detection based on a term aging model. ACM Trans Intell Syst Technol 5(1):7CrossRefGoogle Scholar
  12. Cesa-Bianchi N (2006) Prediction, learning, and games. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
  13. Chew C, Eysenbach G (2010) Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS One 5(11):e14118CrossRefGoogle Scholar
  14. CNBC (2013) #McFail? McDonald’s Twitter campaign gets hijacked. http://www.cnbc.com/id/46132132
  15. Conover M, Ratkiewicz J, Francisco M, Gonçalves B, Menczer F, Flammini A (2011) Political polarization on twitter. In: ICWSM, 2011Google Scholar
  16. Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, Flammini A (2013) The geospatial characteristics of a social movement communication network. PLoS One 8(3):e55957CrossRefGoogle Scholar
  17. Conover MD, Ferrara E, Menczer F, Flammini A (2013) The digital evolution of Occupy Wall Street. PLoS One 8(5):e64679CrossRefGoogle Scholar
  18. Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):P09008CrossRefGoogle Scholar
  19. Ferrara E, JafariAsbagh M, Varol O, Qazvinian V, Menczer F, Flammini A (2013) Clustering memes in social media. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, 2013. IEEE/ACM, pp 548–555Google Scholar
  20. Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2014) The rise of social bots. arXiv preprint arXiv:1407.5225
  21. Ferrara E, Varol O, Menczer F, Flammini A (2013) Traveling trends: social butterflies or frequent fliers? In: Proceedings of the first ACM conference on Online social networks, 2013. ACM, pp 213–222Google Scholar
  22. Fiat A, Woeginger G (1998) Online algorithms: the state of the art. Springer, HeidelbergCrossRefMATHGoogle Scholar
  23. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26CrossRefGoogle Scholar
  24. Gama J, Gaber MM (2007) Learning from data streams. Springer, BerlinCrossRefMATHGoogle Scholar
  25. Gama J, Rodrigues PP, Spinosa EJ, de Carvalho ACPLF (2010) Knowledge discovery from data streams. Chapman and Hall/CRC, Boca RatonCrossRefMATHGoogle Scholar
  26. Golder S, Huberman B (2006) Usage patterns of collaborative tagging systems. J Inf Sci 32(2):198–208CrossRefGoogle Scholar
  27. Golder SA, Macy MW (2011) Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333(6051):1878–1881CrossRefGoogle Scholar
  28. Hong L, Davison B (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st workshop on social media analytics, 2010. ACM, New York, pp 80–88Google Scholar
  29. Kranen P, Reidl F, Villaamil FS, Seidl T (2011) Hierarchical clustering for real-time stream data with noise. In: Proceedings of the 23rd international conference on scientific and statistical database management (SSDBM 2011), Portland, Oregon, USA, 2011. Springer, Heidelberg, pp 405–413Google Scholar
  30. Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web, 2010. ACM, New York, pp 591–600Google Scholar
  31. Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. N J Phys 11(3):033015CrossRefGoogle Scholar
  32. Lehmann J, Gonçalves B, Ramasco J, Cattuto C (2012) Dynamical classes of collective attention in twitter. In: Proceedings of the 21st international conference on world wide web, 2012, pp 251–260Google Scholar
  33. Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, 2009. ACM, New York, pp 497–506Google Scholar
  34. Marcus A, Bernstein M, Badar O, Karger D, Madden S, Miller R (2011) Twitinfo: aggregating and visualizing microblogs for event exploration. In: Proceedings of the 2011 annual conference on human factors in computing systems, 2011. ACM, New York, pp 227–236Google Scholar
  35. Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: Proceedings of the 17th international conference on world wide web, 2008. ACM, New York, pp 101–110Google Scholar
  36. Meilă M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98(5):873–895CrossRefMATHGoogle Scholar
  37. Metaxas P, Mustafaraj E (2010) From obscurity to prominence in minutes:political speech and real-time search. In: Proceedings of web science: extending the frontiers of society on-line, 2010Google Scholar
  38. Mika P (2007) Ontologies are us: a unified model of social networks and semantics. Web Seman Sci Serv Agents World Wide Web 5(1):5–15CrossRefMathSciNetGoogle Scholar
  39. Morales A, Losada J, Benito R (2012) Users structure and behavior on an online social network during a political protest. Users structure and behavior on an online social network during a political protest 391(21):5244–5253Google Scholar
  40. Nematzadeh A, Ferrara E, Flammini A, Ahn Y-Y (2014) Optimal network modularity for information diffusion. Phys Rev Lett 113(8):088701CrossRefGoogle Scholar
  41. Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  42. Pramod S, Vyas O (2012) Data stream mining: a review on windowing approach. Glob J Comput Sci Technol Softw Data Eng 12(11):26–30Google Scholar
  43. Ratkiewicz J, Conover M, Meiss M, Gonçalves B, Patil S, Flammini A, Menczer F (2011) Truthy: mapping the spread of astroturf in microblog streams. In: Proceedings of the 20th international conference companion on world wide web, 2011. ACM, New York, pp 249–252Google Scholar
  44. Sayed-Mouchaweh M, Lughofer E (2012) Learning in non-stationary environments. Springer, New YorkCrossRefMATHGoogle Scholar
  45. Sayyadi H, Hurst M, Maykov A (2009) Event detection and tracking in social streams. In: Proceedings of the 3rd international AAAI conference on weblogs and social media, 2009Google Scholar
  46. Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194CrossRefMATHGoogle Scholar
  47. Simmons M, Adamic LA, Adar E (2011) Memes online: extracted, subtracted, injected, and recollected. In: Proceedings of the 5th international AAAI conference on weblogs and social media, 2011. AAAI, BarcelonaGoogle Scholar
  48. Skoric M, Poor N, Liao Y, Tang S (2011) Online organization of an offline protest: from social to traditional media and back. In: Proceedings of the 44th Hawaii international conference on system sciences, 2011Google Scholar
  49. Thom D, Bosch H, Koch S, Worner M, Ertl T (2012) Spatiotemporal anomaly detection through visual analysis of geolocated twitter messages. In: IEEE Pacific visualization symposium, pp 41–48Google Scholar
  50. Tsur O, Rappoport A (2012) What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proceedings of the fifth ACM international conference on Web search and data mining, 2012. ACM, New York, pp 643–652Google Scholar
  51. Varol O, Ferrara E, Ogan CL, Menczer F, Flammini A (2014) Evolution of online user behavior during a social upheaval. In: Proceedings of the 2014 ACM conference on Web science, 2014. ACM, New York, pp 81–90Google Scholar
  52. Wu S, Hofman J, Mason W, Watts D (2011) Who says what to whom on twitter. In: Proceedings of the 20th international conference on world wide web, 2011. ACM, New York, pp 705–714Google Scholar
  53. Xie L, Natsev A, Kender JR, Hill M, Smith JR (2011) Visual memes in social media: tracking real-world news in youtube videos. In: Proceedings of the 19th ACM international conference on multimedia, 2011. ACM, New York, pp 53–62Google Scholar
  54. Yang L, Sun T, Zhang M, Mei Q (2012) We know what@ you# tag: does the dual role affect hashtag adoption? In: Proceedings of the 21st international conference on World Wide Web, 2012. ACM, New York, pp 261–270Google Scholar
  55. Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of annual conference of the North American chapter of ACL, 2012Google Scholar
  56. Zhong S (2005) Efficient online spherical k-means clustering. In: Proceedings of the 2005 IEEE international joint conference on neural networks, IJCNN’05, vol 5. IEEE, pp 3180–3185Google Scholar

Copyright information

© Springer-Verlag Wien 2014

Authors and Affiliations

  • Mohsen JafariAsbagh
    • 1
  • Emilio Ferrara
    • 1
  • Onur Varol
    • 1
  • Filippo Menczer
    • 1
  • Alessandro Flammini
    • 1
  1. 1.School of Informatics and ComputingIndiana UniversityBloomingtonUSA

Personalised recommendations