Advertisement

Datenbank-Spektrum

, Volume 15, Issue 1, pp 41–47 | Cite as

Internet Corpora: A Challenge for Linguistic Processing

  • Andrea Horbach
  • Stefan Thater
  • Diana Steffen
  • Peter M. Fischer
  • Andreas Witt
  • Manfred Pinkal
Schwerpunktbeitrag

Abstract

Natural language processing tools are mostly developed for and optimized on newspaper texts, and often show a substantial performance drop when applied to other types of texts such as Twitter feeds, chat data or Internet forum posts. We explore a range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts. Our results show that these methods can improve tagger performance substantially.

Keywords

Natural language processing Part-of-speech tagging Computer-mediated communication 

Notes

Acknowledgement

This work is part of the BMBF-funded project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen.” We thank our student assistants Jana Ott, Ali Abbas, Jakob Prange and Maximilian Wolf for their support in the annotation and evaluation of our data sets.

References

  1. 1.
    Bartz T, Beißwenger M, Storrer A (2014) Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internet-basierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Zeitschrift für germanistische Linguistik 28(1):157–198Google Scholar
  2. 2.
    Beißwenger M (2013) Das Dortmunder Chat-Korpus. Zeitschrift für germanistische Linguistik 41(1):161–164Google Scholar
  3. 3.
    Brants T (2000) TnT—A statistical part-of-speech tagger. In: Proceedings of the sixth conference on applied natural language processing, association for computational linguistics. Seattle, Washington, USA, pp 224–231, http://www.aclweb.org/anthology/A00-1031
  4. 4.
    Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic Interpretation of a German Corpus. J Lang Comput, Special Issue 2(4):597–620Google Scholar
  5. 5.
    Horbach A, Steffen D, Thater S, Pinkal M (2014) Improving the performance of standard part-of-speech taggers for computer-mediated communication. In: Proceedings of KONVENS, pp 171–177Google Scholar
  6. 6.
    IDS (2014) Deutsches Referenzkorpus. Archiv der Korpora geschriebener Gegenwartssprache 2014-II (Release from 11092014) http://www.ids-mannheim.de/DeReKo
  7. 7.
    Krome S (2010) Die deutsche Gegenwartssprache im Fokus korpusbasierter Lexikographie. Korpora als Grundlage moderner allgemeinsprachlicher Wörterbücher am Beispiel des WAHRIG Textkorpus\(^{\mbox{digital}}\). In: Kratochvílová I, Wolf NR (eds) Kompendium Korpuslinguistik. Eine Bestandsaufnahme aus deutsch-tschechischer Perspektive. Universitätsverlag Winter, Heidelberg, pp 117–134Google Scholar
  8. 8.
    Kübler S, Baucom E (2011) Fast domain adaptation for part of speech tagging for dialogues. In: Angelova G, Bontcheva K, Mitkov R, Nicolov N (eds) RANLP, RANLP 2011 Organising Committee, pp 41–48Google Scholar
  9. 9.
    Münzberg F (2011) Korpusrecherche in der Dudenredaktion. Ein Werkstattbericht. In: Konopka M et al (eds) Grammatik und Korpora 2009. Narr Francke Attempto, Tübingen, pp 181–197Google Scholar
  10. 10.
    Schiller A, Teufel S, Stöckert C, Thielen C (1999) Guidelines für das Tagging deutscher Textcorpora mit STTS. Tech. rep., IMS-CL, University Stuttgart, Stuttgart.http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-1999.pdf
  11. 11.
    Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UKGoogle Scholar
  12. 12.
    Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proceedings of the Fifth Conference on Applied Natural Language Processing ANLP-97, Washington, DCGoogle Scholar
  13. 13.
    Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings International Conference on Spoken Language Processing, pp 257–286Google Scholar
  14. 14.
    Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, Canada, pp 252–259Google Scholar
  15. 15.
    Wiegand M, Roth B, Klakow D (2012) Web-based Relation Extraction for the Food Domain. In: Proceedings of the International Conference on Applications of Natural Language Processing to Information Systems (NLDB), Springer, Groningen, the Netherlands, pp 222–227Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Andrea Horbach
    • 1
  • Stefan Thater
    • 1
  • Diana Steffen
    • 1
  • Peter M. Fischer
    • 2
  • Andreas Witt
    • 2
  • Manfred Pinkal
    • 1
  1. 1.Universität des SaarlandesSaarbrückenGermany
  2. 2.IDSMannheimGermany

Personalised recommendations