TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets

  • Pavlos FafaliosEmail author
  • Vasileios Iosifidis
  • Eirini Ntoutsi
  • Stefan Dietze
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10843)


Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, spanning almost 5 years (Jan’13–Nov’17). Metadata information about the tweets as well as extracted entities, hashtags, user mentions and sentiment information are exposed using established RDF/S vocabularies. Next to a description of the extraction and annotation process, we present use cases to illustrate scenarios for entity-centric information exploration, data integration and knowledge discovery facilitated by TweetsKB.


Twitter RDF Entity linking Sentiment analysis Social media archives 



The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and the H2020 Grant No. 687916 (AFEL project), and by the German Research Foundation (DFG) project OSCAR (Opinion Stream Classification with Ensembles and Active leaRners).


  1. 1.
    Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. ACM (2015)Google Scholar
  2. 2.
    Bontcheva, K., Rout, D.: Making sense of social media streams through semantics: a survey. Semant. Web 5(5), 373–403 (2014)Google Scholar
  3. 3.
    Breslin, J.G., Decker, S., Harth, A., Bojars, U.: SIOC: an approach to connect web-based communities. Int. J. Web Based Commun. 2(2), 133–142 (2006)CrossRefGoogle Scholar
  4. 4.
    Bruns, A., Weller, K.: Twitter as a first draft of the present: and the challenges of preserving it for the future. In: 8th ACM Conference on Web Science (2016)Google Scholar
  5. 5.
    Fafalios, P., Baritakis, M., Tzitzikas, Y.: Exploiting linked data for open and configurable named entity extraction. Int. J. Artif. Intell. Tools 24(02), 42 (2015)CrossRefGoogle Scholar
  6. 6.
    Fafalios, P., Holzmann, H., Kasturia, V., Nejdl, W.: Building and querying semantic layers for web archives. In: ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2017, Toronto, Ontario, Canada (2017)Google Scholar
  7. 7.
    Fafalios, P., Iosifidis, V., Stefanidis, K., Ntoutsi, E.: Multi-aspect entity-centric analysis of big social media archives. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 261–273. Springer, Cham (2017). Scholar
  8. 8.
    Iosifidis, V., Ntoutsi, E.: Large scale sentiment learning with limited labels. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1823–1832. ACM (2017)Google Scholar
  9. 9.
    Iosifidis, V., Oelschlager, A., Ntoutsi, E.: Sentiment classification over opinionated data streams through informed model adaptation. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 369–381. Springer, Cham (2017). Scholar
  10. 10.
    Kowald, D., Pujari, S.C., Lex, E.: Temporal effects on hashtag reuse in Twitter: a cognitive-inspired hashtag recommendation approach. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1401–1410. International World Wide Web Conferences Steering Committee (2017)Google Scholar
  11. 11.
    Liu, S., Yamada, M., Collier, N., Sugiyama, M.: Change-point detection in time-series data by relative density-ratio estimation. Neural Netw. 43, 72–83 (2013)CrossRefGoogle Scholar
  12. 12.
    Mendes, P.N., Passant, A., Kapanipathi, P.: Twarql: tapping into the wisdom of the crowd. In: 6th International Conference on Semantic Systems. ACM (2010)Google Scholar
  13. 13.
    Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: SemEval-2016 task 4: sentiment analysis in Twitter. In: SemEval@ NAACL-HLT, pp. 1–18 (2016)Google Scholar
  14. 14.
    Passant, A., Bojars, U., Breslin, J.G., Hastrup, T., Stankovic, M., Laublet, P., et al.: An overview of SMOB 2: open, semantic and distributed microblogging. In: ICWSM, pp. 303–306 (2010)Google Scholar
  15. 15.
    Rizzo, G.: Making sense of microposts (# Microposts2015) named entity rEcognition and linking (NEEL) challenge (2015)Google Scholar
  16. 16.
    Rizzo, G., van Erp, M., Plu, J., Troncy, R.: Making sense of microposts (#Microposts2016) named entity rEcognition and linking (NEEL) challenge (2016)Google Scholar
  17. 17.
    Ronallo, J.: HTML5 microdata and Code4Lib J. (16) (2012)Google Scholar
  18. 18.
    Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval 2017, pp. 502–518 (2017)Google Scholar
  19. 19.
    Rossi, M.-E.G., Malliaros, F.D., Vazirgiannis, M.: Spread it good, spread it fast: identification of influential nodes in social networks. In: Proceedings of the 24th International Conference on World Wide Web, pp. 101–102. ACM (2015)Google Scholar
  20. 20.
    Sahito, F., Latif, A., Slany, W.: Weaving Twitter stream into linked data a proof of concept framework. In: International Conference on Emerging Technologies (2011)Google Scholar
  21. 21.
    Saleiro, P., Soares, C.: Learning from the news: predicting entity popularity on Twitter. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 171–182. Springer, Cham (2016). Scholar
  22. 22.
    Sánchez-Rada, J.F., Iglesias, C.A.: Onyx: a linked data approach to emotion representation. Inf. Process. Manag. 52(1), 99–114 (2016)CrossRefGoogle Scholar
  23. 23.
    Sanderson, R., Ciccarese, P., Van de Sompel, H., Bradshaw, S., Brickley, D., Castro, L.J.G., Clark, T., Cole, T., Desenne, P., Gerber, A., et al.: Open annotation data model. W3C Community Draft (2013)Google Scholar
  24. 24.
    Sebastiani, F.: An axiomatically derived measure for the evaluation of classification algorithms. In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval, pp. 11–20. ACM (2015)Google Scholar
  25. 25.
    Sedhai, S., Sun, A.: HSpam14: a collection of 14 million tweets for hashtag-oriented spam research. In: SIGIR ACM (2015)Google Scholar
  26. 26.
    Shinavier, J.: Real-time #SemanticWeb in \(<\)= 140 chars. In: Proceedings of the Third Workshop on Linked Data on the Web, LDOW 2010 at WWW 2010 (2010)Google Scholar
  27. 27.
    Spiliopoulou, M., Ntoutsi, E., Zimmermann, M.: Opinion stream mining. Encycl. Mach. Learn. Data Min. 1–10 (2016)Google Scholar
  28. 28.
    Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 63(1), 163–173 (2012)CrossRefGoogle Scholar
  29. 29.
    Tran, N.K., Tran, T., Niederée, C.: Beyond time: dynamic context-aware entity recommendation. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10249, pp. 353–368. Springer, Cham (2017). Scholar
  30. 30.
    Tran, T., Tran, N.K., Hadgu, A.T., Jäschke, R.: Semantic annotation for microblog topics using Wikipedia temporal information. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 97–106 (2015)Google Scholar
  31. 31.
    Weikum, G., Spaniol, M., Ntarmos, N., Triantafillou, P., Benczúr, A., Kirkpatrick, S., Rigaux, P., Williamson, M.: Longitudinal analytics on web archive data: it’s about time! In: Biennial Conference on Innovative Data Systems Research (2011)Google Scholar
  32. 32.
    Zhang, L., Rettinger, A., Zhang, J.: A probabilistic model for time-aware entity recommendation. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 598–614. Springer, Cham (2016). Scholar
  33. 33.
    Zimmer, M.: The Twitter archive at the library of congress: challenges for information practice and information policy. First Monday 20(7) (2015)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.L3S Research CenterUniversity of HannoverHannoverGermany

Personalised recommendations