Skip to main content
Log in

Clustering memes in social media streams

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

The problem of clustering content in social media has pervasive applications, including the identification of discussion topics, event detection, and content recommendation. Here, we describe a streaming framework for online detection and clustering of memes in social media, specifically Twitter. A pre-clustering procedure, namely protomeme detection, first isolates atomic tokens of information carried by the tweets. Protomemes are thereafter aggregated, based on multiple similarity measures, to obtain memes as cohesive groups of tweets reflecting actual concepts or topics of discussion. The clustering algorithm takes into account various dimensions of the data and metadata, including natural language, the social network, and the patterns of information diffusion. As a result, our system can build clusters of semantically, structurally, and topically related tweets. The clustering process is based on a variant of Online K-means that incorporates a memory mechanism, used to “forget” old memes and replace them over time with the new ones. The evaluation of our framework is carried out using a dataset of Twitter trending topics. Over a 1-week period, we systematically determined whether our algorithm was able to recover the trending hashtags. We show that the proposed method outperforms baseline algorithms that only use content features, as well as a state-of-the-art event detection method that assumes full knowledge of the underlying follower network. We finally show that our online learning framework is flexible, due to its independence of the adopted clustering algorithm, and best suited to work in a streaming scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Example of hashtag injections include hijacked campaigns such as those by McDonald and the NYPD (CNBC 2013; BBC 2014).

  2. Term vectors might or might not include retweets; in all our experiments, we include retweets. Our framework does not make any assumption on the language of the tweets either, therefore, it is flexible to work with multiple languages.

  3. Note that \(R_{p}\) is not necessarily a subset of \(U_{p}\) when only a sample of the tweets is considered in the stream; the sample may include a retweeted message but not the original one.

References

  • Aggarwal C, Subbian K (2012) Event detection in social streams. In: Proceedings of SIAM international conference on data mining, 2012

  • Albers S, Leonardi S (1999) Online algorithms. ACM Comput Surv 31(3)

  • Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002. ACM, New York, pp 1–16

  • Bakshy E, Hofman J, Mason W, Watts D (2011) Everyone’s an influencer: quantifying influence on twitter. In: Proceedings of the 4th ACM international conference on web search and data mining, 2011. ACM, New York, pp 65–74

  • Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15(3):702–719

    Article  Google Scholar 

  • BBC (2014) NYPD Twitter campaign ‘backfires’ after hashtag hijacked. http://www.bbc.com/news/technology-27126041

  • Becker H, Naaman M, Gravano L (2010) Learning similarity metrics for event identification in social media. In: Proceedings of the 3rd ACM international conference on web search and data mining, 2010. ACM, New York, pp 291–300

  • Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of the 5th international AAAI conference on weblogs and social media, 2011

  • Blum A (1998) On-line algorithms in machine learning. Springer, Berlin

    Google Scholar 

  • Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: 2006 SIAM conference on data mining, 2006, pp 328–339

  • Cataldi M, Caro LD, Schifanella C (2013) Personalized emerging topic detection based on a term aging model. ACM Trans Intell Syst Technol 5(1):7

    Article  Google Scholar 

  • Cesa-Bianchi N (2006) Prediction, learning, and games. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Chew C, Eysenbach G (2010) Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS One 5(11):e14118

    Article  Google Scholar 

  • CNBC (2013) #McFail? McDonald’s Twitter campaign gets hijacked. http://www.cnbc.com/id/46132132

  • Conover M, Ratkiewicz J, Francisco M, Gonçalves B, Menczer F, Flammini A (2011) Political polarization on twitter. In: ICWSM, 2011

  • Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, Flammini A (2013) The geospatial characteristics of a social movement communication network. PLoS One 8(3):e55957

    Article  Google Scholar 

  • Conover MD, Ferrara E, Menczer F, Flammini A (2013) The digital evolution of Occupy Wall Street. PLoS One 8(5):e64679

    Article  Google Scholar 

  • Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):P09008

    Article  Google Scholar 

  • Ferrara E, JafariAsbagh M, Varol O, Qazvinian V, Menczer F, Flammini A (2013) Clustering memes in social media. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, 2013. IEEE/ACM, pp 548–555

  • Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2014) The rise of social bots. arXiv preprint arXiv:1407.5225

  • Ferrara E, Varol O, Menczer F, Flammini A (2013) Traveling trends: social butterflies or frequent fliers? In: Proceedings of the first ACM conference on Online social networks, 2013. ACM, pp 213–222

  • Fiat A, Woeginger G (1998) Online algorithms: the state of the art. Springer, Heidelberg

    Book  MATH  Google Scholar 

  • Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26

    Article  Google Scholar 

  • Gama J, Gaber MM (2007) Learning from data streams. Springer, Berlin

    Book  MATH  Google Scholar 

  • Gama J, Rodrigues PP, Spinosa EJ, de Carvalho ACPLF (2010) Knowledge discovery from data streams. Chapman and Hall/CRC, Boca Raton

    Book  MATH  Google Scholar 

  • Golder S, Huberman B (2006) Usage patterns of collaborative tagging systems. J Inf Sci 32(2):198–208

    Article  Google Scholar 

  • Golder SA, Macy MW (2011) Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333(6051):1878–1881

    Article  Google Scholar 

  • Hong L, Davison B (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st workshop on social media analytics, 2010. ACM, New York, pp 80–88

  • Kranen P, Reidl F, Villaamil FS, Seidl T (2011) Hierarchical clustering for real-time stream data with noise. In: Proceedings of the 23rd international conference on scientific and statistical database management (SSDBM 2011), Portland, Oregon, USA, 2011. Springer, Heidelberg, pp 405–413

  • Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web, 2010. ACM, New York, pp 591–600

  • Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. N J Phys 11(3):033015

    Article  Google Scholar 

  • Lehmann J, Gonçalves B, Ramasco J, Cattuto C (2012) Dynamical classes of collective attention in twitter. In: Proceedings of the 21st international conference on world wide web, 2012, pp 251–260

  • Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, 2009. ACM, New York, pp 497–506

  • Marcus A, Bernstein M, Badar O, Karger D, Madden S, Miller R (2011) Twitinfo: aggregating and visualizing microblogs for event exploration. In: Proceedings of the 2011 annual conference on human factors in computing systems, 2011. ACM, New York, pp 227–236

  • Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: Proceedings of the 17th international conference on world wide web, 2008. ACM, New York, pp 101–110

  • Meilă M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98(5):873–895

    Article  MATH  Google Scholar 

  • Metaxas P, Mustafaraj E (2010) From obscurity to prominence in minutes:political speech and real-time search. In: Proceedings of web science: extending the frontiers of society on-line, 2010

  • Mika P (2007) Ontologies are us: a unified model of social networks and semantics. Web Seman Sci Serv Agents World Wide Web 5(1):5–15

    Article  MathSciNet  Google Scholar 

  • Morales A, Losada J, Benito R (2012) Users structure and behavior on an online social network during a political protest. Users structure and behavior on an online social network during a political protest 391(21):5244–5253

    Google Scholar 

  • Nematzadeh A, Ferrara E, Flammini A, Ahn Y-Y (2014) Optimal network modularity for information diffusion. Phys Rev Lett 113(8):088701

    Article  Google Scholar 

  • Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  • Pramod S, Vyas O (2012) Data stream mining: a review on windowing approach. Glob J Comput Sci Technol Softw Data Eng 12(11):26–30

    Google Scholar 

  • Ratkiewicz J, Conover M, Meiss M, Gonçalves B, Patil S, Flammini A, Menczer F (2011) Truthy: mapping the spread of astroturf in microblog streams. In: Proceedings of the 20th international conference companion on world wide web, 2011. ACM, New York, pp 249–252

  • Sayed-Mouchaweh M, Lughofer E (2012) Learning in non-stationary environments. Springer, New York

    Book  MATH  Google Scholar 

  • Sayyadi H, Hurst M, Maykov A (2009) Event detection and tracking in social streams. In: Proceedings of the 3rd international AAAI conference on weblogs and social media, 2009

  • Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194

    Article  MATH  Google Scholar 

  • Simmons M, Adamic LA, Adar E (2011) Memes online: extracted, subtracted, injected, and recollected. In: Proceedings of the 5th international AAAI conference on weblogs and social media, 2011. AAAI, Barcelona

  • Skoric M, Poor N, Liao Y, Tang S (2011) Online organization of an offline protest: from social to traditional media and back. In: Proceedings of the 44th Hawaii international conference on system sciences, 2011

  • Thom D, Bosch H, Koch S, Worner M, Ertl T (2012) Spatiotemporal anomaly detection through visual analysis of geolocated twitter messages. In: IEEE Pacific visualization symposium, pp 41–48

  • Tsur O, Rappoport A (2012) What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proceedings of the fifth ACM international conference on Web search and data mining, 2012. ACM, New York, pp 643–652

  • Varol O, Ferrara E, Ogan CL, Menczer F, Flammini A (2014) Evolution of online user behavior during a social upheaval. In: Proceedings of the 2014 ACM conference on Web science, 2014. ACM, New York, pp 81–90

  • Wu S, Hofman J, Mason W, Watts D (2011) Who says what to whom on twitter. In: Proceedings of the 20th international conference on world wide web, 2011. ACM, New York, pp 705–714

  • Xie L, Natsev A, Kender JR, Hill M, Smith JR (2011) Visual memes in social media: tracking real-world news in youtube videos. In: Proceedings of the 19th ACM international conference on multimedia, 2011. ACM, New York, pp 53–62

  • Yang L, Sun T, Zhang M, Mei Q (2012) We know what@ you# tag: does the dual role affect hashtag adoption? In: Proceedings of the 21st international conference on World Wide Web, 2012. ACM, New York, pp 261–270

  • Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of annual conference of the North American chapter of ACL, 2012

  • Zhong S (2005) Efficient online spherical k-means clustering. In: Proceedings of the 2005 IEEE international joint conference on neural networks, IJCNN’05, vol 5. IEEE, pp 3180–3185

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emilio Ferrara.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

JafariAsbagh, M., Ferrara, E., Varol, O. et al. Clustering memes in social media streams. Soc. Netw. Anal. Min. 4, 237 (2014). https://doi.org/10.1007/s13278-014-0237-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-014-0237-x

Keywords

Navigation