Combining Manual and Automatic Annotation of a Learner Corpus
Conference paper
- 2 Citations
- 1.4k Downloads
Abstract
We present an approach to building a learner corpus of Czech, manually corrected and annotated with error tags using a complex grammar-based taxonomy of errors in spelling, morphology, morphosyntax, lexicon and style. This grammar-based annotation is supplemented by a formal classification of errors based on surface alternations. To supply additional information about non-standard or ill-formed expressions, we aim at a synergy of manual and automatic annotation, deriving information from the original input and from the manual annotation.
Keywords
learner corpora error annotation Czech morphology syntaxPreview
Unable to display preview. Download preview PDF.
References
- 1.Díaz-Negrillo, A., Fernández-Domínguez, J.: Error tagging systems for learner corpora. Resla 19, 83–102 (2006)Google Scholar
- 2.Dickinson, M.: Generating learner-like morphological errors in Russian. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing (2010)Google Scholar
- 3.Granger, S.: Error tagged learner corpora and CALL: A promising synergy. CALICO Journal 20, 465–480 (2003)Google Scholar
- 4.Hajič, J.: Disambiguation of Rich Inflection: Computational Morphology of Czech. Karolinum, Charles University Press, Praha (2004)Google Scholar
- 5.Hana, J., Rosen, A., Škodová, S., Štindlová, B.: Error-tagged learner corpus of Czech. In: Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV), Uppsala (2010)Google Scholar
- 6.Jelínek, T.: Nové značkování v Českém národním korupusu (A new tagging system in the Czech National Corpus). Naše řeč 91, 13–20 (2008)Google Scholar
- 7.Jelínek, T., Petkevič, V.: Systém jazykového značkování korpusů současné psané češtiny [A system of linguistic markup of corpora of contemporary written Czech]. In: Petkevič, V., Rosen, A. (eds.) Korpusová lingvistika Praha 2011: 3 – Gramatika a značkování korpusů. Studie z korpusové lingvistiky, vol. 16, pp. 154–170. Ústav Českého národního korpusu, Nakladatelství Lidové noviny (2011)Google Scholar
- 8.Lüdeling, A.: Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora. In: Grommes, P., Walter, M. (eds.) Fortgeschrittene Lernervarietäten, Niemeyer, Tübingen, pp. 119–140 (2008)Google Scholar
- 9.Nouza, J., Blavka, K., Boháč, M., Červa, P., Žd’ánsky, J., Silovský, J., Pražák, J.: Voice Technology to Enable Sophisticated Access to Historical Audio Archive of the Czech Radio. In: Grana, C., Cucchiara, R. (eds.) MM4CH 2011. CCIS, vol. 247, pp. 27–38. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 10.Pajas, P., Štěpánek, J.: XML-based representation of multi-layered annotation in the PDT 2.0. In: Hinrichs, R.E., Ide, N., Palmer, M., Pustejovsky, J. (eds.) Proceedings of the LREC Workshop on Merging and Layering Linguistic Information (LREC 2006), Genova, Italy, pp. 40–47 (2006)Google Scholar
- 11.Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., Květoň, P.: The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing 2007, pp. 67–74. Association for Computational Linguistics, Praha (2007)Google Scholar
- 12.Štindlová, B., Škodová, S., Hana, J., Rosen, A.: CzeSL an error tagged corpus of Czech as a second language. In: Pęzik, P. (ed.) PALC 2011 Practical Applications in Language and Computers, Łódż, April 13-15, Łódź Studies in Language, Peter Lang (to appear, 2012)Google Scholar
- 13.Van Rooy, B., Schäfer, L.: An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In: Archer, D., Rayson, P., Wilson, A., McEnery, T. (eds.) Proceedings of the Corpus Linguistics 2003 Conference, pp. 835–844. UCREL, Lancaster University (2003)Google Scholar
Copyright information
© Springer-Verlag Berlin Heidelberg 2012