TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora

  • Alexander Bazo
  • Manuel Burghardt
  • Christian Wolff
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8105)


In this paper we present Tworpus, an easy-to-use tool for the creation of tailored Twitter corpora. Tworpus allows scholars to create corpora without having to know about the Twitter Application Programming Interface (API) and related technical aspects. At the same time our tool complies with Twitter’s ”rules of the road” on how to use tweet data. Corpora may be composed in various sizes and for specific scenarios, as the Tworpus interface provides controls for filtering and gathering customized collections of tweets, which may serve as the basis for subsequent analyses.


Twitter API web corpora social media corpora corpus tool corpus creation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Androutsopoulos, J.K.: Online-Gemeinschaften und Sprachvariation. Soziolinguistische Perspektiven auf Sprache im Internet. Zeitschrift für germanistische Linguistik 31(2), 173–197 (2004)Google Scholar
  2. Bae, Y., Lee, H.: Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology 63(12), 2521–2535 (2012)CrossRefGoogle Scholar
  3. Beißwenger, M., Storrer, A.: Corpora of Computer-Mediated Communication. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook, pp. 292–308. Mouton de Gruyter, Berlin (2008)Google Scholar
  4. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209–226 (2009)CrossRefGoogle Scholar
  5. Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., Wilson, T.: Language identification for creating language-specific Twitter collections. In: Proceedings of the Second Workshop on Language in Social Media, LSM 2012, pp. 65–74. Association for Computational Linguistics, Montreal (2012)Google Scholar
  6. Crystal, D.: How Language Works. London, Penguin (2007)Google Scholar
  7. Fletcher, W.H.: Corpus analysis of the world wide web. In: Chapelle, C.A. (ed.) Encyclopedia of Applied Linguistics. Wiley-Blackwell (2012)Google Scholar
  8. Gottron, T., Lipka, N.: A Comparison of Language Identification Approaches on Short, Query-Style Texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology 60(11), 2169–2188 (2009)CrossRefGoogle Scholar
  10. Kaeding, F.W.: Häufigkeitswörterbuch der deutschen Sprache. Steglitz near Berlin (1898) (self published)Google Scholar
  11. Kilgarriff, A., Grefenstette, G.: Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29, 333–347 (2003)MathSciNetCrossRefGoogle Scholar
  12. Leech, G.: New resources, or just better old ones? The Holy Grail of representativeness. In: Hundt, M., Nesselhauf, N., Biewer, C. (eds.) Corpus Linguistics and the Web, pp. 133–149. Editons Rodopi B.V, Amsterdam (2007)Google Scholar
  13. McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 1113–1114. ACM, New York (2012)CrossRefGoogle Scholar
  14. Nakatani, S.: Language Detection Library for Java (website), (accessed April 10, 2013)
  15. Moretti, F.: Graphs, Maps, Trees: Abstract Models for a Literary History. London, Verso (2007)Google Scholar
  16. Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In: ICWSM 2013 (2013), (accessed June 3, 2013)
  17. Petrović, S., Osborne, M., Lavrenko, V.: The Edinburgh Twitter corpus. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media, WSA 2010, pp. 25–26. Association for Computational Linguistics, Los Angeles (2010)Google Scholar
  18. Russ, B.: Examining Large-Scale Regional Variation Through Online Geotagged Corpora. Presentation, 2012 Annual Meeting of the American Dialect Society, (accessed April 17, 2013)
  19. Squires, L.: Enregistering internet language. Language in Society 39, 457–492 (2010), (accessed June 6, 2013)
  20. Yamamoto, Y.: Twitter4J. Java library for the Twitter API (website), (accessed April 10, 2013)

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Alexander Bazo
    • 1
  • Manuel Burghardt
    • 1
  • Christian Wolff
    • 1
  1. 1.Media Informatics GroupUniversity of RegensburgRegensburgGermany

Personalised recommendations