Scalable Twitter user clustering approach boosted by Personalized PageRank

  • Anup NaikEmail author
  • Hideyuki Maeda
  • Vibhor Kanojia
  • Sumio Fujita
Regular Paper


Twitter has been the focus of analysis in regard to various interesting and challenging problems, one of them being clustering of its users based on their interests. There are many clustering approaches for graphs that look at either the structure or the contents of the graph. However, when we consider real-world complex data such as Twitter data, structural approaches may produce many different user clusters with similar interests. Moreover, content-based clustering approaches on Twitter data also produce inferior results because tweets have a limited number of characters and lots of garbled data. Hence, for practical applications, these clustering approaches cannot be directly used on Twitter data. In the study reported in this paper, we clustered Twitter users on the basis of their interests, looking at both the structure of the graph generated from Twitter data and the contents of the Tweets. In short, we clustered Twitter users by using an unsupervised structural approach, merging similar clusters using a content-based approach, expanding the graph and ranking users with Personalized PageRank, and determining the topic to which a cluster belongs in accordance with the hashtag frequency. The results of combining these approaches were better than those of the existing techniques and befit practical applications.


Twitter Social graph Clustering PageRank Personalized PageRank Crowdsourcing Unsupervised learning Community detection Discounted cumulative gain 



We thank the real-time search team at Yahoo! JAPAN for all their support in carrying out this work. We thank all the people involved in evaluation of the results, without which this work would have been incomplete.


  1. 1.
    Naik, A., Maeda, H., Kanojia, V., Fujita, S.: Scalable Twitter User Clustering Approach Boosted by Personalized PageRank, pp. 472–485. Springer, Cham (2017)Google Scholar
  2. 2.
    Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833. ACM (2007)Google Scholar
  3. 3.
    Shiokawa, H., Fujiwara, Y., Onizuka, M.: Scan++: efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proc. VLDB Endow. 8(11), 1178–1189 (2015)CrossRefGoogle Scholar
  4. 4.
    Latapy, M., Magnien, C., Del Vecchio, N.: Basic notions for the analysis of large two-mode networks. Soc. Netw. 30(1), 31–48 (2008)CrossRefGoogle Scholar
  5. 5.
    Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)CrossRefGoogle Scholar
  6. 6.
    Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE, San Jose (2001).
  7. 7.
    Zhang, Y., Wu, Y., Yang, Q.: Community discovery in twitter based on user interests. J. Comput. Inf. Syst. 8(3), 991–1000 (2012)Google Scholar
  8. 8.
    Hayashi, K., Maehara, T., Toyoda, M., Kawarabayashi, K.-I.: Real-time top-r topic detection on twitter with topic hijack filtering. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ser. KDD ’15, pp. 417–426 (2015)Google Scholar
  9. 9.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999)Google Scholar
  10. 10.
    Haveliwala, T.: Topic-sensitive PageRank. In: Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 517–526 (2002)Google Scholar
  11. 11.
    Andersen, R., Lang, K.J.: Communities from seed sets. In: Proceedings of the 15th International Conference on World Wide, pp. 223–232. ACM (2006)Google Scholar
  12. 12.
    Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, pp. 695–704. ACM (2008)Google Scholar
  13. 13.
    Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in twitter: the million follower fallacy. In: ICWSM, vol. 10, pp. 10–17 (2010)Google Scholar
  14. 14.
    Avnit, A.: The million followers fallacy (2009). Online accessed 2 Aug 2016
  15. 15.
    Weng, J., Lim, E.-P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, Ser. WSDM ’10, pp. 261–270 (2010)Google Scholar
  16. 16.
    David, M.I.J., Blei, M., Ng, A.Y..: Latent Dirichlet Allocation, pp. 993–1022 (2003)Google Scholar
  17. 17.
    Graph-tool. Online accessed 20 Jan 2016
  18. 18.
    Bayon Clustering Tool. Online accessed 3 Feb 2016
  19. 19.
    Trec-9 Results, Appendix A. In: Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009) (2009).
  20. 20.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst.: TOIS 20(4), 422–446 (2002)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Yahoo Japan CorporationTokyoJapan

Personalised recommendations