TSD 2013: Text, Speech, and Dialogue pp 575-582 | Cite as
Using Low-Cost Annotation to Train a Reliable Czech Shallow Parser
Abstract
Bushbank is a relatively new concept — a type of annotated corpus where annotation is driven by use of automatic tools and the task of human annotators is limited to accepting or rejecting parts of their output. This creates a possibility to obtain annotated corpora of considerable size at relatively low cost.
In this paper we ask the question if the Czech Bushbank is reliable enough to be used for a NLP task instead of a traditional corpus with high annotation rigour. We perform evaluation of three different parsers using its shallow syntactic annotation, including a CRF chunker made originally for Polish. The results are very promising, showing that many practical applications could benefit from low-cost annotation.
Keywords
corpus annotation shallow parsing CzechPreview
Unable to display preview. Download preview PDF.
References
- 1.Böhmová, A., Hajič, J., Hajičová, E., Hladká, B.: The prague dependency treebank. In: Treebanks, pp. 103–127. Springer (2003)Google Scholar
- 2.Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., Bémová, A., Štěpánek, J., Pajas, P., Kárník, J.: Anotace na analytické rovině. návod pro anotátory (2004)Google Scholar
- 3.Shen, H.: Voting between multiple data representations for text chunking. Master’s thesis, Simon Fraser University, Canada (2004)Google Scholar
- 4.Radziszewski, A., Maziarz, M., Wieczorek, J.: Shallow syntactic annotation in the Corpus of Wroclaw University of Technology. Cognitive Studies 12 (2012)Google Scholar
- 5.Kordoni, V., Zhang, Y.: Annotating Wall Street Journal texts using a hand-crafted deep linguistic grammar. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pp. 170–173. Association for Computational Linguistics, Stroudsburg (2009)CrossRefGoogle Scholar
- 6.Waszczuk, J., Glowińska, K., Savary, A., Przepiówski, A.: Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish. In: Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational Linguistics – Applications (CLA 2010), pp. 531–539. PTI, Wisla (2010)Google Scholar
- 7.Grac, M.: Case study of bushbank concept. In: Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, pp. 353–361. Institute of Digital Enhancement of Cognitive Processing, Waseda University, Singapore (2011)Google Scholar
- 8.Collins, M., Ramshaw, L., Hajič, J., Tillmann, C.: A statistical parser for Czech. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 505–512. Association for Computational Linguistics (1999)Google Scholar
- 9.Kovář, V., Horák, A., Jakubíček, M.: Syntactic analysis using finite patterns: A new parsing system for Czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562, pp. 161–171. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 10.Radziszewski, A., Pawlaczek, A.: Large-scale experiments with NP chunking of Polish. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 143–149. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 11.Šmerk, P.: K morfologické desambiguaci češtiny (2008)Google Scholar
- 12.Grác, M., Jakubíček, M., Kovář, V.: Through low-cost annotation to reliable parsing evaluation. In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, pp. 555–562. Waseda University, Tokio (2010)Google Scholar
- 13.Radziszewski, A., Wardyński, A., Śniatowski, T.: WCCL: A morpho-syntactic feature toolkit. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 434–441. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 14.Grishman, R., Macleod, C., Sterling, J.: Evaluating parsing strategies using standardized parse files. In: Proceedings of the 3rd ACL Conference on Applied Natural Language Processing, pp. 156–161 (1992)Google Scholar