The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus

Csendes, Dóra; Csirik, János; Gyimóthy, Tibor

doi:10.1007/978-3-540-30120-2_6

Dóra Csendes²¹,
János Csirik²¹ &
Tibor Gyimóthy²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3206))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

886 Accesses
13 Citations

Abstract

The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alexin, Z., Csirik, J., Gyimóthy, T., Bibok, K., Hatvani, C., Prószéky, G., Tihanyi, L.: Manually Annotated Hungarian Corpus. In: Proc. of the Research Note Sessions of the 10th Conference of the European Chapter of the ACL (EACL2003), Budapest, Hungary, pp. 53–56 (2003)
Google Scholar
Erjavec, T., Monachini, M.: Specification and notation for lexicon encoding. Copernicus Project 106 “MULTEX-EAST”, Work Package 1 – Task 1.1, Deliverable D1.1F (1997)
Google Scholar
Kucera, H., Francis, W.: Brown Corpus Manual. Brown University Press, Providence (1979)
Google Scholar
Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19.2, 313–330 (1993)
Google Scholar
Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Oxford University Press, Oxford (1995)
Google Scholar
Leech, G., Rayson, P., Wilson, A.: Word Frequencies inWritten and Spoken English: based on the British National Corpus. Longman, London (2001)
Google Scholar
Abeille, A., Lionel, C., Kinyon, A.: Building a Treebank for French. In: Proc. of the 2nd International LREC Conference (LREC 2000), Athens, Greece, pp. 87–94 (2000)
Google Scholar
Bosco, C.: A richer annotation schema for an Italian treebank. In: Proc. of the European Summer School on Logic Language and Information (ESSLLI 2000), Birmingham, UK, pp. 22–32 (2000)
Google Scholar
Boguslavsky, I., Grigorieva, S., Grigoriev, N., Kreidlin, L., Frid, N.: Dependency Treebank for Russian: Concepts, Tools, Types of Information. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), Saarbrücken, Germany (2000)
Google Scholar
Skut, W., Brants, T., Krenn, B., Uszkoreit, H.: A linguistically interpreted corpus of German newspaper text. In: Proceedings of the Conference on Language Resources and Evaluation (LREC 1998), Granada, Spain, pp. 705–711 (1997)
Google Scholar
Brants, S., Hansen, S., Lezius, W., Smith, G.: The Tiger Treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria (2002)
Google Scholar
Hajič, J.: Building a syntactically annotated corpus: The Prague dependency treebank. Issues of Valency and Meaning, Charles University Press, Prague, 106–132 (1998)
Google Scholar
Simov, K., Simov, A., Kouylekov, M., Ivanova, K., Grigorov, I., Ganev, H.: Development of Corpora within the CLaRK system: The Bultreebank project experience. In: Proceedings of the Demo Sessions of the 10th Conference of the EACL 2003, Budapest, Hungary, pp. 243–246 (2003)
Google Scholar
Furedi, M., et al.: A mai Magyar nyelv szépprózai gyakorisági szótára. Akadémia Press, Budapest (1989)
Google Scholar
Horváth, T., Alexin, Z., Gyimóthy, T., Wrobel, S.: Application of different learning methods to Hungarian part-of-speech tagging. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 128–139. Springer, Heidelberg (1999)
Chapter Google Scholar
Hócza, A., Alexin, Z., Csendes, D., Csirik, J., Gyimóthy, T.: Application of ILP methods in different natural language processing phases for information extraction from Hungarian texts. In: Proceedings of the Kalmár Workshop on Logic and Computer Science, Szeged, Hungary, pp. 107–116 (2003)
Google Scholar
Brill, E.: Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21 (4), 543–565 (1995)
Google Scholar
Brants, T.: Tnt – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP), Seattle, WA (2000) 3start=113
Google Scholar
Kuba, A., Hócza, A., Csirik, J.: PoS Tagging of Hungarian with Combined Statistical and Rulebased Methods. In: Sojka, P., et al. (eds.) Text, Speech and Dialogue, Proceedings of the Seventh International Conference, Brno, Czech Republic, September 8-11, 2004, pp. 113–120 (2004)
Google Scholar
Hócza, A., Iván, S.: Learning and recognizing noun phrases. In: Proceedings of the Hungarian Computational Linguistics Conference (MSZNY 2003), Szeged, Hungary, pp. 72–79 (2003)
Google Scholar
Quinlan, J.R.: C 4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Szeged, H-6720, Szeged, Árpád tér 2., Hungary
Dóra Csendes, János Csirik & Tibor Gyimóthy

Authors

Dóra Csendes
View author publications
You can also search for this author in PubMed Google Scholar
János Csirik
View author publications
You can also search for this author in PubMed Google Scholar
Tibor Gyimóthy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Csendes, D., Csirik, J., Gyimóthy, T. (2004). The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-30120-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23049-6
Online ISBN: 978-3-540-30120-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics