Skip to main content

TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora

  • Conference paper
Language Processing and Knowledge in the Web

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8105))


In this paper we present Tworpus, an easy-to-use tool for the creation of tailored Twitter corpora. Tworpus allows scholars to create corpora without having to know about the Twitter Application Programming Interface (API) and related technical aspects. At the same time our tool complies with Twitter’s ”rules of the road” on how to use tweet data. Corpora may be composed in various sizes and for specific scenarios, as the Tworpus interface provides controls for filtering and gathering customized collections of tweets, which may serve as the basis for subsequent analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  • Androutsopoulos, J.K.: Online-Gemeinschaften und Sprachvariation. Soziolinguistische Perspektiven auf Sprache im Internet. Zeitschrift für germanistische Linguistik 31(2), 173–197 (2004)

    Google Scholar 

  • Bae, Y., Lee, H.: Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology 63(12), 2521–2535 (2012)

    Article  Google Scholar 

  • Beißwenger, M., Storrer, A.: Corpora of Computer-Mediated Communication. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook, pp. 292–308. Mouton de Gruyter, Berlin (2008)

    Google Scholar 

  • Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209–226 (2009)

    Article  Google Scholar 

  • Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., Wilson, T.: Language identification for creating language-specific Twitter collections. In: Proceedings of the Second Workshop on Language in Social Media, LSM 2012, pp. 65–74. Association for Computational Linguistics, Montreal (2012)

    Google Scholar 

  • Crystal, D.: How Language Works. London, Penguin (2007)

    Google Scholar 

  • Fletcher, W.H.: Corpus analysis of the world wide web. In: Chapelle, C.A. (ed.) Encyclopedia of Applied Linguistics. Wiley-Blackwell (2012)

    Google Scholar 

  • Gottron, T., Lipka, N.: A Comparison of Language Identification Approaches on Short, Query-Style Texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  • Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology 60(11), 2169–2188 (2009)

    Article  Google Scholar 

  • Kaeding, F.W.: Häufigkeitswörterbuch der deutschen Sprache. Steglitz near Berlin (1898) (self published)

    Google Scholar 

  • Kilgarriff, A., Grefenstette, G.: Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29, 333–347 (2003)

    Article  MathSciNet  Google Scholar 

  • Leech, G.: New resources, or just better old ones? The Holy Grail of representativeness. In: Hundt, M., Nesselhauf, N., Biewer, C. (eds.) Corpus Linguistics and the Web, pp. 133–149. Editons Rodopi B.V, Amsterdam (2007)

    Google Scholar 

  • McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 1113–1114. ACM, New York (2012)

    Chapter  Google Scholar 

  • Nakatani, S.: Language Detection Library for Java (website), (accessed April 10, 2013)

  • Moretti, F.: Graphs, Maps, Trees: Abstract Models for a Literary History. London, Verso (2007)

    Google Scholar 

  • Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In: ICWSM 2013 (2013), (accessed June 3, 2013)

  • Petrović, S., Osborne, M., Lavrenko, V.: The Edinburgh Twitter corpus. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media, WSA 2010, pp. 25–26. Association for Computational Linguistics, Los Angeles (2010)

    Google Scholar 

  • Russ, B.: Examining Large-Scale Regional Variation Through Online Geotagged Corpora. Presentation, 2012 Annual Meeting of the American Dialect Society, (accessed April 17, 2013)

  • Squires, L.: Enregistering internet language. Language in Society 39, 457–492 (2010), (accessed June 6, 2013)

  • Yamamoto, Y.: Twitter4J. Java library for the Twitter API (website), (accessed April 10, 2013)

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bazo, A., Burghardt, M., Wolff, C. (2013). TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40721-5

  • Online ISBN: 978-3-642-40722-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics