Abstract
The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alexin, Z., Csirik, J., Gyimóthy, T., Bibok, K., Hatvani, C., Prószéky, G., Tihanyi, L.: Manually Annotated Hungarian Corpus. In: Proc. of the Research Note Sessions of the 10th Conference of the European Chapter of the ACL (EACL2003), Budapest, Hungary, pp. 53–56 (2003)
Erjavec, T., Monachini, M.: Specification and notation for lexicon encoding. Copernicus Project 106 “MULTEX-EAST”, Work Package 1 – Task 1.1, Deliverable D1.1F (1997)
Kucera, H., Francis, W.: Brown Corpus Manual. Brown University Press, Providence (1979)
Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19.2, 313–330 (1993)
Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Oxford University Press, Oxford (1995)
Leech, G., Rayson, P., Wilson, A.: Word Frequencies inWritten and Spoken English: based on the British National Corpus. Longman, London (2001)
Abeille, A., Lionel, C., Kinyon, A.: Building a Treebank for French. In: Proc. of the 2nd International LREC Conference (LREC 2000), Athens, Greece, pp. 87–94 (2000)
Bosco, C.: A richer annotation schema for an Italian treebank. In: Proc. of the European Summer School on Logic Language and Information (ESSLLI 2000), Birmingham, UK, pp. 22–32 (2000)
Boguslavsky, I., Grigorieva, S., Grigoriev, N., Kreidlin, L., Frid, N.: Dependency Treebank for Russian: Concepts, Tools, Types of Information. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), Saarbrücken, Germany (2000)
Skut, W., Brants, T., Krenn, B., Uszkoreit, H.: A linguistically interpreted corpus of German newspaper text. In: Proceedings of the Conference on Language Resources and Evaluation (LREC 1998), Granada, Spain, pp. 705–711 (1997)
Brants, S., Hansen, S., Lezius, W., Smith, G.: The Tiger Treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria (2002)
Hajič, J.: Building a syntactically annotated corpus: The Prague dependency treebank. Issues of Valency and Meaning, Charles University Press, Prague, 106–132 (1998)
Simov, K., Simov, A., Kouylekov, M., Ivanova, K., Grigorov, I., Ganev, H.: Development of Corpora within the CLaRK system: The Bultreebank project experience. In: Proceedings of the Demo Sessions of the 10th Conference of the EACL 2003, Budapest, Hungary, pp. 243–246 (2003)
Furedi, M., et al.: A mai Magyar nyelv szépprózai gyakorisági szótára. Akadémia Press, Budapest (1989)
Horváth, T., Alexin, Z., Gyimóthy, T., Wrobel, S.: Application of different learning methods to Hungarian part-of-speech tagging. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 128–139. Springer, Heidelberg (1999)
Hócza, A., Alexin, Z., Csendes, D., Csirik, J., Gyimóthy, T.: Application of ILP methods in different natural language processing phases for information extraction from Hungarian texts. In: Proceedings of the Kalmár Workshop on Logic and Computer Science, Szeged, Hungary, pp. 107–116 (2003)
Brill, E.: Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21 (4), 543–565 (1995)
Brants, T.: Tnt – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP), Seattle, WA (2000) 3start=113
Kuba, A., Hócza, A., Csirik, J.: PoS Tagging of Hungarian with Combined Statistical and Rulebased Methods. In: Sojka, P., et al. (eds.) Text, Speech and Dialogue, Proceedings of the Seventh International Conference, Brno, Czech Republic, September 8-11, 2004, pp. 113–120 (2004)
Hócza, A., Iván, S.: Learning and recognizing noun phrases. In: Proceedings of the Hungarian Computational Linguistics Conference (MSZNY 2003), Szeged, Hungary, pp. 72–79 (2003)
Quinlan, J.R.: C 4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Csendes, D., Csirik, J., Gyimóthy, T. (2004). The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-30120-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23049-6
Online ISBN: 978-3-540-30120-2
eBook Packages: Springer Book Archive