Internet Corpora: A Challenge for Linguistic Processing

Abstract

Natural language processing tools are mostly developed for and optimized on newspaper texts, and often show a substantial performance drop when applied to other types of texts such as Twitter feeds, chat data or Internet forum posts. We explore a range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts. Our results show that these methods can improve tagger performance substantially.

This is a preview of subscription content, log in to check access.

Notes

  1. 1.

    “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen”, http://www.schreibgebrauch.de

  2. 2.

    http://web-harvest.sourceforge.net/

  3. 3.

    Text Encoding Initiative, http://www.tei-c.org/

  4. 4.

    It could also be that the writer may have Swiss-German background where “heiss” is the correct spelling,

References

  1. 1

    Bartz T, Beißwenger M, Storrer A (2014) Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internet-basierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Zeitschrift für germanistische Linguistik 28(1):157–198

  2. 2

    Beißwenger M (2013) Das Dortmunder Chat-Korpus. Zeitschrift für germanistische Linguistik 41(1):161–164

  3. 3

    Brants T (2000) TnT—A statistical part-of-speech tagger. In: Proceedings of the sixth conference on applied natural language processing, association for computational linguistics. Seattle, Washington, USA, pp 224–231, http://www.aclweb.org/anthology/A00-1031

  4. 4

    Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic Interpretation of a German Corpus. J Lang Comput, Special Issue 2(4):597–620

  5. 5

    Horbach A, Steffen D, Thater S, Pinkal M (2014) Improving the performance of standard part-of-speech taggers for computer-mediated communication. In: Proceedings of KONVENS, pp 171–177

  6. 6

    IDS (2014) Deutsches Referenzkorpus. Archiv der Korpora geschriebener Gegenwartssprache 2014-II (Release from 11092014) http://www.ids-mannheim.de/DeReKo

  7. 7

    Krome S (2010) Die deutsche Gegenwartssprache im Fokus korpusbasierter Lexikographie. Korpora als Grundlage moderner allgemeinsprachlicher Wörterbücher am Beispiel des WAHRIG Textkorpus\(^{\mbox{digital}}\). In: Kratochvílová I, Wolf NR (eds) Kompendium Korpuslinguistik. Eine Bestandsaufnahme aus deutsch-tschechischer Perspektive. Universitätsverlag Winter, Heidelberg, pp 117–134

  8. 8

    Kübler S, Baucom E (2011) Fast domain adaptation for part of speech tagging for dialogues. In: Angelova G, Bontcheva K, Mitkov R, Nicolov N (eds) RANLP, RANLP 2011 Organising Committee, pp 41–48

  9. 9

    Münzberg F (2011) Korpusrecherche in der Dudenredaktion. Ein Werkstattbericht. In: Konopka M et al (eds) Grammatik und Korpora 2009. Narr Francke Attempto, Tübingen, pp 181–197

  10. 10

    Schiller A, Teufel S, Stöckert C, Thielen C (1999) Guidelines für das Tagging deutscher Textcorpora mit STTS. Tech. rep., IMS-CL, University Stuttgart, Stuttgart.http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-1999.pdf

  11. 11

    Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK

  12. 12

    Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proceedings of the Fifth Conference on Applied Natural Language Processing ANLP-97, Washington, DC

  13. 13

    Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings International Conference on Spoken Language Processing, pp 257–286

  14. 14

    Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, Canada, pp 252–259

  15. 15

    Wiegand M, Roth B, Klakow D (2012) Web-based Relation Extraction for the Food Domain. In: Proceedings of the International Conference on Applications of Natural Language Processing to Information Systems (NLDB), Springer, Groningen, the Netherlands, pp 222–227

Download references

Acknowledgement

This work is part of the BMBF-funded project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen.” We thank our student assistants Jana Ott, Ali Abbas, Jakob Prange and Maximilian Wolf for their support in the annotation and evaluation of our data sets.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andrea Horbach.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Horbach, A., Thater, S., Steffen, D. et al. Internet Corpora: A Challenge for Linguistic Processing. Datenbank Spektrum 15, 41–47 (2015). https://doi.org/10.1007/s13222-014-0172-z

Download citation

Keywords

  • Natural language processing
  • Part-of-speech tagging
  • Computer-mediated communication