Datenbank-Spektrum

, Volume 15, Issue 1, pp 41–47 | Cite as

Internet Corpora: A Challenge for Linguistic Processing

  • Andrea Horbach
  • Stefan Thater
  • Diana Steffen
  • Peter M. Fischer
  • Andreas Witt
  • Manfred Pinkal
Schwerpunktbeitrag
  • 198 Downloads

Abstract

Natural language processing tools are mostly developed for and optimized on newspaper texts, and often show a substantial performance drop when applied to other types of texts such as Twitter feeds, chat data or Internet forum posts. We explore a range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts. Our results show that these methods can improve tagger performance substantially.

Keywords

Natural language processing Part-of-speech tagging Computer-mediated communication 

References

  1. 1.
    Bartz T, Beißwenger M, Storrer A (2014) Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internet-basierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Zeitschrift für germanistische Linguistik 28(1):157–198Google Scholar
  2. 2.
    Beißwenger M (2013) Das Dortmunder Chat-Korpus. Zeitschrift für germanistische Linguistik 41(1):161–164Google Scholar
  3. 3.
    Brants T (2000) TnT—A statistical part-of-speech tagger. In: Proceedings of the sixth conference on applied natural language processing, association for computational linguistics. Seattle, Washington, USA, pp 224–231, http://www.aclweb.org/anthology/A00-1031
  4. 4.
    Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic Interpretation of a German Corpus. J Lang Comput, Special Issue 2(4):597–620Google Scholar
  5. 5.
    Horbach A, Steffen D, Thater S, Pinkal M (2014) Improving the performance of standard part-of-speech taggers for computer-mediated communication. In: Proceedings of KONVENS, pp 171–177Google Scholar
  6. 6.
    IDS (2014) Deutsches Referenzkorpus. Archiv der Korpora geschriebener Gegenwartssprache 2014-II (Release from 11092014) http://www.ids-mannheim.de/DeReKo
  7. 7.
    Krome S (2010) Die deutsche Gegenwartssprache im Fokus korpusbasierter Lexikographie. Korpora als Grundlage moderner allgemeinsprachlicher Wörterbücher am Beispiel des WAHRIG Textkorpus\(^{\mbox{digital}}\). In: Kratochvílová I, Wolf NR (eds) Kompendium Korpuslinguistik. Eine Bestandsaufnahme aus deutsch-tschechischer Perspektive. Universitätsverlag Winter, Heidelberg, pp 117–134Google Scholar
  8. 8.
    Kübler S, Baucom E (2011) Fast domain adaptation for part of speech tagging for dialogues. In: Angelova G, Bontcheva K, Mitkov R, Nicolov N (eds) RANLP, RANLP 2011 Organising Committee, pp 41–48Google Scholar
  9. 9.
    Münzberg F (2011) Korpusrecherche in der Dudenredaktion. Ein Werkstattbericht. In: Konopka M et al (eds) Grammatik und Korpora 2009. Narr Francke Attempto, Tübingen, pp 181–197Google Scholar
  10. 10.
    Schiller A, Teufel S, Stöckert C, Thielen C (1999) Guidelines für das Tagging deutscher Textcorpora mit STTS. Tech. rep., IMS-CL, University Stuttgart, Stuttgart.http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-1999.pdf
  11. 11.
    Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UKGoogle Scholar
  12. 12.
    Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proceedings of the Fifth Conference on Applied Natural Language Processing ANLP-97, Washington, DCGoogle Scholar
  13. 13.
    Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings International Conference on Spoken Language Processing, pp 257–286Google Scholar
  14. 14.
    Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, Canada, pp 252–259Google Scholar
  15. 15.
    Wiegand M, Roth B, Klakow D (2012) Web-based Relation Extraction for the Food Domain. In: Proceedings of the International Conference on Applications of Natural Language Processing to Information Systems (NLDB), Springer, Groningen, the Netherlands, pp 222–227Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Andrea Horbach
    • 1
  • Stefan Thater
    • 1
  • Diana Steffen
    • 1
  • Peter M. Fischer
    • 2
  • Andreas Witt
    • 2
  • Manfred Pinkal
    • 1
  1. 1.Universität des SaarlandesSaarbrückenGermany
  2. 2.IDSMannheimGermany

Personalised recommendations