The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus

  • Dóra Csendes
  • János Csirik
  • Tibor Gyimóthy
Conference paper

DOI: 10.1007/978-3-540-30120-2_6

Part of the Lecture Notes in Computer Science book series (LNCS, volume 3206)
Cite this paper as:
Csendes D., Csirik J., Gyimóthy T. (2004) The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka P., Kopeček I., Pala K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science, vol 3206. Springer, Berlin, Heidelberg

Abstract

The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Dóra Csendes
    • 1
  • János Csirik
    • 1
  • Tibor Gyimóthy
    • 1
  1. 1.Department of InformaticsUniversity of SzegedSzegedHungary

Personalised recommendations