Abstract
In this paper we present Tworpus, an easy-to-use tool for the creation of tailored Twitter corpora. Tworpus allows scholars to create corpora without having to know about the Twitter Application Programming Interface (API) and related technical aspects. At the same time our tool complies with Twitter’s ”rules of the road” on how to use tweet data. Corpora may be composed in various sizes and for specific scenarios, as the Tworpus interface provides controls for filtering and gathering customized collections of tweets, which may serve as the basis for subsequent analyses.
Keywords
- Twitter API
- web corpora
- social media corpora
- corpus tool
- corpus creation
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Androutsopoulos, J.K.: Online-Gemeinschaften und Sprachvariation. Soziolinguistische Perspektiven auf Sprache im Internet. Zeitschrift für germanistische Linguistik 31(2), 173–197 (2004)
Bae, Y., Lee, H.: Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology 63(12), 2521–2535 (2012)
Beißwenger, M., Storrer, A.: Corpora of Computer-Mediated Communication. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook, pp. 292–308. Mouton de Gruyter, Berlin (2008)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., Wilson, T.: Language identification for creating language-specific Twitter collections. In: Proceedings of the Second Workshop on Language in Social Media, LSM 2012, pp. 65–74. Association for Computational Linguistics, Montreal (2012)
Crystal, D.: How Language Works. London, Penguin (2007)
Fletcher, W.H.: Corpus analysis of the world wide web. In: Chapelle, C.A. (ed.) Encyclopedia of Applied Linguistics. Wiley-Blackwell (2012)
Gottron, T., Lipka, N.: A Comparison of Language Identification Approaches on Short, Query-Style Texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)
Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology 60(11), 2169–2188 (2009)
Kaeding, F.W.: Häufigkeitswörterbuch der deutschen Sprache. Steglitz near Berlin (1898) (self published)
Kilgarriff, A., Grefenstette, G.: Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29, 333–347 (2003)
Leech, G.: New resources, or just better old ones? The Holy Grail of representativeness. In: Hundt, M., Nesselhauf, N., Biewer, C. (eds.) Corpus Linguistics and the Web, pp. 133–149. Editons Rodopi B.V, Amsterdam (2007)
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 1113–1114. ACM, New York (2012)
Nakatani, S.: Language Detection Library for Java (website), http://code.google.com/p/language-detection (accessed April 10, 2013)
Moretti, F.: Graphs, Maps, Trees: Abstract Models for a Literary History. London, Verso (2007)
Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In: ICWSM 2013 (2013), http://www.public.asu.edu/fmorstat/paperpdfs/icwsm2013.pdf (accessed June 3, 2013)
Petrović, S., Osborne, M., Lavrenko, V.: The Edinburgh Twitter corpus. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media, WSA 2010, pp. 25–26. Association for Computational Linguistics, Los Angeles (2010)
Russ, B.: Examining Large-Scale Regional Variation Through Online Geotagged Corpora. Presentation, 2012 Annual Meeting of the American Dialect Society, http://www.briceruss.com/ADStalk.pdf (accessed April 17, 2013)
Squires, L.: Enregistering internet language. Language in Society 39, 457–492 (2010), http://dx.doi.org/10.1017/S0047404510000412 (accessed June 6, 2013)
Yamamoto, Y.: Twitter4J. Java library for the Twitter API (website), http://twitter4j.org (accessed April 10, 2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bazo, A., Burghardt, M., Wolff, C. (2013). TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-40722-2_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)