Wiki-MID: A Very Large Multi-domain Interests Dataset of Twitter Users with Mappings to Wikipedia

  • Giorgia Di Tommaso
  • Stefano Faralli
  • Giovanni StiloEmail author
  • Paola Velardi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11137)


This paper presents Wiki-MID, a LOD compliant multi-domain interests dataset to train and test Recommender Systems, and the methodology to create the dataset from Twitter messages in English and Italian. Our English dataset includes an average of 90 multi-domain preferences per user on music, books, movies, celebrities, sport, politics and much more, for about half million users traced during six months in 2017. Preferences are either extracted from messages of users who use Spotify, Goodreads and other similar content sharing platforms, or induced from their “topical” friends, i.e., followees representing an interest rather than a social relation between peers. In addition, preferred items are matched with Wikipedia articles describing them. This unique feature of our dataset provides a mean to categorize preferred items, exploiting available semantic resources linked to Wikipedia such as the Wikipedia Category Graph, DBpedia, BabelNet and others.


Semantic recommenders Twitter Wikipedia Users’ interest 



This work has been supported by the IBM Faculty Award #2305895190 and by the MIUR under grant “Dipartimenti di eccellenza 2018–2022” of the Department of Computer Science of Sapienza University.


  1. 1.
    Linden, G., Smith, B., York, J.: recommendations: item-to-item collaborative filtering. IEEE Internet Comput. 7(1), 76–80 (2003)CrossRefGoogle Scholar
  2. 2.
    Davidson, J., Liebald, B., Liu, J., et al.: The youtube video recommendation system. In: Proceedings of the 4th RecSys, pp. 293–296. ACM (2010)Google Scholar
  3. 3.
    Fouss, F., Saerens, M.: Evaluating performance of recommender systems: an experimental comparison. In: International Conference on WI-IAT 2008, vol. 1, pp. 735–738. IEEE (2008)Google Scholar
  4. 4.
    Felfernig, A., Jeran, M., Ninaus, G., Reinfrank, F., Reiterer, S., Stettinger, M.: Basic approaches in recommendation systems. In: Robillard, M.P., Maalej, W., Walker, R.J., Zimmermann, T. (eds.) Recommendation Systems in Software Engineering, pp. 15–37. Springer, Heidelberg (2014). Scholar
  5. 5.
    Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative filtering recommender systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 291–324. Springer, Heidelberg (2007). Scholar
  6. 6.
    Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 325–341. Springer, Heidelberg (2007). Scholar
  7. 7.
    Trewin, S.: Knowledge-based recommender systems. Encycl. Libr. Inf. Sci. 69(Suppl. 32), 180 (2000)Google Scholar
  8. 8.
    Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User Adapt. Interact. 12(4), 331–370 (2002)CrossRefGoogle Scholar
  9. 9.
    Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. JMLR 10, 2935–2962 (2009)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Dror, G., Koenigstein, N., Koren, Y., Weimer, M.: The yahoo! music dataset and KDD-cup’11. In: Proceedings of KDD Cup 2011, pp. 3–18 (2012)Google Scholar
  11. 11.
    Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.: Personalized recommendation in social tagging systems using hierarchical clustering. In: RecSys 2008. ACM (2008)Google Scholar
  12. 12.
    Kamishima, T., Akaho, S.: Nantonac collaborative filtering: a model-based approach. In: Proceedings of the 4th RecSys, pp. 273–276. ACM (2010)Google Scholar
  13. 13.
    Sawant, S., Pai, G.: Yelp food recommendation system (2013)Google Scholar
  14. 14.
    Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD, pp. 783–792 (2010)Google Scholar
  15. 15.
    Mavalankar, A.A., et al.: Hotel recommendation system. Internal Report (2017)Google Scholar
  16. 16.
    Çano, E., Morisio, M.: Characterization of public datasets for recommender systems. In: IEEE 1st International Forum on RTSI, pp. 249–257. IEEE (2015)Google Scholar
  17. 17.
    Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. In: TiiS 2016 (2016)Google Scholar
  18. 18.
    McFee, B., Bertin-Mahieux, T., Ellis, D.P., Lanckriet, G.R.: The million song dataset challenge. In: Proceedings of the 21st WWW, pp. 909–916. ACM (2012)Google Scholar
  19. 19.
    Bennett, J., Lanning, S., et al.: The netflix prize. In: Proceedings of KDD, New York (2007)Google Scholar
  20. 20.
    Yan, M., Sang, J., Xu, C.: Mining cross-network association for youtube video promotion. In: Proceedings of the 22nd ACM MM, pp. 557–566. ACM (2014)Google Scholar
  21. 21.
    Piao, G., Breslin, J.G.: Inferring user interests in microblogging social networks: a survey. arXiv:1712.07691v3 (2017)
  22. 22.
    Chaabane, A., Acs, G., Kaafar, M.A., et al.: You are what you like! information leakage through users’ interests. In: Proceedings of the 19th NDSS Symposium (2012)Google Scholar
  23. 23.
    Faralli, S., Stilo, G., Velardi, P.: Large scale homophily analysis in Twitter using a twixonomy. In: Proceedings of 24th IJCAI, Buenos Aires, 25–31 July 2015, pp. 2334–2340 (2015)Google Scholar
  24. 24.
    Piao, G., Breslin, J.G.: Inferring user interests for passive users on Twitter by leveraging followee biographies. In: Jose, J.M. (ed.) ECIR 2017. LNCS, vol. 10193, pp. 122–133. Springer, Cham (2017). Scholar
  25. 25.
    Pichl, M., Zangerle, E., Specht, G.: #Nowplaying on #Spotify: leveraging spotify information on Twitter for artist recommendations. Current Trends in Web Engineering. LNCS, vol. 9396, pp. 163–174. Springer, Cham (2015). Scholar
  26. 26.
    Kapanipathi, P., Jain, P., Venkataramani, C., Sheth, A.: User interests identification on Twitter using a hierarchical knowledge base. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 99–113. Springer, Cham (2014). Scholar
  27. 27.
    Schinas, E., et al.: Eventsense: capturing the pulse of large-scale events by mining social media streams. In: Proceedings of the 17th PCI, pp. 17–24. ACM (2013)Google Scholar
  28. 28.
    Nichols, J., Mahmud, J., Drews, C.: Summarizing sporting events using Twitter. In: Proceedings of the 2012 International Conference on Intelligent User Interfaces, pp. 189–198. ACM (2012)Google Scholar
  29. 29.
    Dooms, S., De Pessemier, T., Martens, L.: Mining cross-domain rating datasets from structured data on Twitter. In: Proceedings of the 23rd WWW, pp. 621–624. ACM (2014)Google Scholar
  30. 30.
    Barbieri, N., Bonchi, F., Manco, G.: Who to follow and why: link prediction with explanations. In: Proceedings of the 20th ACM SIGKDD, pp. 1266–1275. ACM (2014)Google Scholar
  31. 31.
    Myers, S.A., Leskovec, J.: The bursty dynamics of the Twitter information network. In: Proceedings of the 23rd WWW, pp. 913–924. ACM (2014)Google Scholar
  32. 32.
    Pichl, M., Zangerle, E., Specht, G.: Combining spotify and Twitter data for generating a recent and public dataset for music recommendation. In: Grundlagen von Datenbanken, pp. 35–40 (2014)Google Scholar
  33. 33.
    Besel, C., Schlötterer, J., Granitzer, M.: Inferring semantic interest profiles from Twitter followees: does twitter know better than your friends? In: SAC 2016 (2016)Google Scholar
  34. 34.
    Nechaev, Y., Corcoglioniti, F., Giuliano, C.: SocialLink: linking DBpedia entities to corresponding Twitter accounts. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 165–174. Springer, Cham (2017). Scholar
  35. 35.
    Faralli, S., Stilo, G., Velardi, P.: Automatic acquisition of a taxonomy of microblogs users’ interests. J. Web Semant. 45, 23–40 (2017)CrossRefGoogle Scholar
  36. 36.
    Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. AI 193, 217–250 (2012)MathSciNetzbMATHGoogle Scholar
  37. 37.
    Delli Bovi, L., Telesca, L., Navigli, R.: Large-scale information extraction from textual definitions through deep syntactic and semantic analysis. TACL 3, 529–543 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Giorgia Di Tommaso
    • 1
  • Stefano Faralli
    • 2
  • Giovanni Stilo
    • 1
    Email author
  • Paola Velardi
    • 1
  1. 1.Department of Computer ScienceUniversity La Sapienza of RomeRomeItaly
  2. 2.Unitelma-SapienzaRomeItaly

Personalised recommendations