Skip to main content

The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus

  • Conference paper
Text, Speech and Dialogue (TSD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3206))

Included in the following conference series:

Abstract

The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alexin, Z., Csirik, J., Gyimóthy, T., Bibok, K., Hatvani, C., Prószéky, G., Tihanyi, L.: Manually Annotated Hungarian Corpus. In: Proc. of the Research Note Sessions of the 10th Conference of the European Chapter of the ACL (EACL2003), Budapest, Hungary, pp. 53–56 (2003)

    Google Scholar 

  2. Erjavec, T., Monachini, M.: Specification and notation for lexicon encoding. Copernicus Project 106 “MULTEX-EAST”, Work Package 1 – Task 1.1, Deliverable D1.1F (1997)

    Google Scholar 

  3. Kucera, H., Francis, W.: Brown Corpus Manual. Brown University Press, Providence (1979)

    Google Scholar 

  4. Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19.2, 313–330 (1993)

    Google Scholar 

  5. Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Oxford University Press, Oxford (1995)

    Google Scholar 

  6. Leech, G., Rayson, P., Wilson, A.: Word Frequencies inWritten and Spoken English: based on the British National Corpus. Longman, London (2001)

    Google Scholar 

  7. Abeille, A., Lionel, C., Kinyon, A.: Building a Treebank for French. In: Proc. of the 2nd International LREC Conference (LREC 2000), Athens, Greece, pp. 87–94 (2000)

    Google Scholar 

  8. Bosco, C.: A richer annotation schema for an Italian treebank. In: Proc. of the European Summer School on Logic Language and Information (ESSLLI 2000), Birmingham, UK, pp. 22–32 (2000)

    Google Scholar 

  9. Boguslavsky, I., Grigorieva, S., Grigoriev, N., Kreidlin, L., Frid, N.: Dependency Treebank for Russian: Concepts, Tools, Types of Information. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), Saarbrücken, Germany (2000)

    Google Scholar 

  10. Skut, W., Brants, T., Krenn, B., Uszkoreit, H.: A linguistically interpreted corpus of German newspaper text. In: Proceedings of the Conference on Language Resources and Evaluation (LREC 1998), Granada, Spain, pp. 705–711 (1997)

    Google Scholar 

  11. Brants, S., Hansen, S., Lezius, W., Smith, G.: The Tiger Treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria (2002)

    Google Scholar 

  12. Hajič, J.: Building a syntactically annotated corpus: The Prague dependency treebank. Issues of Valency and Meaning, Charles University Press, Prague, 106–132 (1998)

    Google Scholar 

  13. Simov, K., Simov, A., Kouylekov, M., Ivanova, K., Grigorov, I., Ganev, H.: Development of Corpora within the CLaRK system: The Bultreebank project experience. In: Proceedings of the Demo Sessions of the 10th Conference of the EACL 2003, Budapest, Hungary, pp. 243–246 (2003)

    Google Scholar 

  14. Furedi, M., et al.: A mai Magyar nyelv szépprózai gyakorisági szótára. Akadémia Press, Budapest (1989)

    Google Scholar 

  15. Horváth, T., Alexin, Z., Gyimóthy, T., Wrobel, S.: Application of different learning methods to Hungarian part-of-speech tagging. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 128–139. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  16. Hócza, A., Alexin, Z., Csendes, D., Csirik, J., Gyimóthy, T.: Application of ILP methods in different natural language processing phases for information extraction from Hungarian texts. In: Proceedings of the Kalmár Workshop on Logic and Computer Science, Szeged, Hungary, pp. 107–116 (2003)

    Google Scholar 

  17. Brill, E.: Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21 (4), 543–565 (1995)

    Google Scholar 

  18. Brants, T.: Tnt – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP), Seattle, WA (2000) 3start=113

    Google Scholar 

  19. Kuba, A., Hócza, A., Csirik, J.: PoS Tagging of Hungarian with Combined Statistical and Rulebased Methods. In: Sojka, P., et al. (eds.) Text, Speech and Dialogue, Proceedings of the Seventh International Conference, Brno, Czech Republic, September 8-11, 2004, pp. 113–120 (2004)

    Google Scholar 

  20. Hócza, A., Iván, S.: Learning and recognizing noun phrases. In: Proceedings of the Hungarian Computational Linguistics Conference (MSZNY 2003), Szeged, Hungary, pp. 72–79 (2003)

    Google Scholar 

  21. Quinlan, J.R.: C 4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Csendes, D., Csirik, J., Gyimóthy, T. (2004). The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30120-2_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23049-6

  • Online ISBN: 978-3-540-30120-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics