Advertisement

Combining Manual and Automatic Annotation of a Learner Corpus

  • Tomáš Jelínek
  • Barbora Štindlová
  • Alexandr Rosen
  • Jirka Hana
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7499)

Abstract

We present an approach to building a learner corpus of Czech, manually corrected and annotated with error tags using a complex grammar-based taxonomy of errors in spelling, morphology, morphosyntax, lexicon and style. This grammar-based annotation is supplemented by a formal classification of errors based on surface alternations. To supply additional information about non-standard or ill-formed expressions, we aim at a synergy of manual and automatic annotation, deriving information from the original input and from the manual annotation.

Keywords

learner corpora error annotation Czech morphology syntax 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Díaz-Negrillo, A., Fernández-Domínguez, J.: Error tagging systems for learner corpora. Resla 19, 83–102 (2006)Google Scholar
  2. 2.
    Dickinson, M.: Generating learner-like morphological errors in Russian. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing (2010)Google Scholar
  3. 3.
    Granger, S.: Error tagged learner corpora and CALL: A promising synergy. CALICO Journal 20, 465–480 (2003)Google Scholar
  4. 4.
    Hajič, J.: Disambiguation of Rich Inflection: Computational Morphology of Czech. Karolinum, Charles University Press, Praha (2004)Google Scholar
  5. 5.
    Hana, J., Rosen, A., Škodová, S., Štindlová, B.: Error-tagged learner corpus of Czech. In: Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV), Uppsala (2010)Google Scholar
  6. 6.
    Jelínek, T.: Nové značkování v Českém národním korupusu (A new tagging system in the Czech National Corpus). Naše řeč 91, 13–20 (2008)Google Scholar
  7. 7.
    Jelínek, T., Petkevič, V.: Systém jazykového značkování korpusů současné psané češtiny [A system of linguistic markup of corpora of contemporary written Czech]. In: Petkevič, V., Rosen, A. (eds.) Korpusová lingvistika Praha 2011: 3 – Gramatika a značkování korpusů. Studie z korpusové lingvistiky, vol. 16, pp. 154–170. Ústav Českého národního korpusu, Nakladatelství Lidové noviny (2011)Google Scholar
  8. 8.
    Lüdeling, A.: Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora. In: Grommes, P., Walter, M. (eds.) Fortgeschrittene Lernervarietäten, Niemeyer, Tübingen, pp. 119–140 (2008)Google Scholar
  9. 9.
    Nouza, J., Blavka, K., Boháč, M., Červa, P., Žd’ánsky, J., Silovský, J., Pražák, J.: Voice Technology to Enable Sophisticated Access to Historical Audio Archive of the Czech Radio. In: Grana, C., Cucchiara, R. (eds.) MM4CH 2011. CCIS, vol. 247, pp. 27–38. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Pajas, P., Štěpánek, J.: XML-based representation of multi-layered annotation in the PDT 2.0. In: Hinrichs, R.E., Ide, N., Palmer, M., Pustejovsky, J. (eds.) Proceedings of the LREC Workshop on Merging and Layering Linguistic Information (LREC 2006), Genova, Italy, pp. 40–47 (2006)Google Scholar
  11. 11.
    Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., Květoň, P.: The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing 2007, pp. 67–74. Association for Computational Linguistics, Praha (2007)Google Scholar
  12. 12.
    Štindlová, B., Škodová, S., Hana, J., Rosen, A.: CzeSL an error tagged corpus of Czech as a second language. In: Pęzik, P. (ed.) PALC 2011 Practical Applications in Language and Computers, Łódż, April 13-15, Łódź Studies in Language, Peter Lang (to appear, 2012)Google Scholar
  13. 13.
    Van Rooy, B., Schäfer, L.: An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In: Archer, D., Rayson, P., Wilson, A., McEnery, T. (eds.) Proceedings of the Corpus Linguistics 2003 Conference, pp. 835–844. UCREL, Lancaster University (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Tomáš Jelínek
    • 1
  • Barbora Štindlová
    • 2
  • Alexandr Rosen
    • 1
  • Jirka Hana
    • 3
  1. 1.Faculty of ArtsCharles University in PragueCzech Republic
  2. 2.Faculty of EducationTechnical University of LiberecCzech Republic
  3. 3.Faculty of Mathematics and PhysicsCharles University in PragueCzech Republic

Personalised recommendations