Skip to main content
Log in

Abstract

The need for data about the acquisition of Czech by non-native learners prompted the compilation of the first learner corpus of Czech. After introducing its basic design and parameters, including a multi-tier manual annotation scheme and error taxonomy, we focus on the more technical aspects: the transcription of hand-written source texts, process of annotation, and options for exploiting the result, together with tools used for these tasks and decisions behind the choices. To support or even substitute manual annotation we assign some error tags automatically and use automatic annotation tools (tagger, spell checker).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. See http://utkl.ff.cuni.cz/learncorp/ for links and more details.

  2. See, e.g. Díaz-Negrillo et al. (2010), Meurers (2009), Dickinson and Ragheb (2009), Hirschmann et al. (2007).

  3. Only few learner corpora use error tags to classify errors, e.g. Fitzpatrick and Seegmiller (2001), Granger (2003), Abuhakema et al. (2009). For an overview see, e.g. Štindlová (2011) or https://www.uclouvain.be/en-cecl-lcworld.html.

  4. Two annotation layers, each with error labels belonging to categories of several types, are used also by Dickinson and Ledbetter (2012) in an annotation scheme for Hungarian. However, the two layers are used for a slightly different purpose, namely to distinguish between corrections of errors detectable directly in the learner text and adjustments of the text, needed because of the corrections.

  5. http://ufal.mff.cuni.cz/jazz/pml/.

  6. http://www.tei-c.org.

  7. http://purl.org/net/feat/.

  8. http://platform.netbeans.org/.

  9. The spell checker matched human corrections at Tier 1 with an accuracy of 74 %. Only forms unrecognized by a morphological analyzer were considered in the test. See Rosen et al. (2013) for details.

  10. See http://utkl.ff.cuni.cz/learncorp/ for links to all available resources.

  11. http://www.perldancer.org.

  12. http://www.postgresql.org.

  13. Ott and Ziai (2010) report that in texts produced by learners of German the main functor-argument relation types can generally be identified with precision and recall in the area of 80–90 %. This is an encouraging result, but the success will necessarily depend on the proficiency level of the learners.

References

  • Abuhakema, G., Feldman, A., & Fitzpatrick, E. (2009). ARIDA: An Arabic interlanguage database and its applications: A pilot study. Journal of the National Council of Less Commonly Taught Languages (NCOLCTL), 7, 161–184.

    Google Scholar 

  • Díaz-Negrillo, A., Meurers, D., Valera, S., Wunsch, H. (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36(1–2):139–154. http://purl.org/dm/papers/diaz-negrillo-et-al-09.html, Special Issue on Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair.

  • Dickinson, M., & Ragheb, M. (2009). Dependency annotation for learner corpora. In Proceedings of the Eighth international workshop on treebanks and linguistic theories (TLT-8), Milan, Italy. http://cl.indiana.edu/~md7/papers/dickinson-ragheb09.pdf.

  • Dickinson, M., & Ledbetter, S. (2012). Annotating errors in a Hungarian learner corpus. In Proceedings of the Eighth international conference on language resources and evaluation (LREC 2012), Istanbul, Turkey. http://www.lrec-conf.org/proceedings/lrec2012/pdf/758_Paper.pdf.

  • Fitzpatrick, E., & Seegmiller, S. (2001). The montclair electronic language learner database. In Proceedings of the international conference on computing and information technologies (ICCIT). http://www.montclair.edu/media/montclairedu/chss/departments/linguistics/iccitmeld.pdf.

  • Granger, S. (1999). Use of tenses by advanced EFL learners: Evidence from error-tagged computer corpus. In H. Hasselgård, & S. Oksefjell (Eds.), Out of Corpora—Studies in honour of stig Johansson, Atlanta, Amsterdam. http://hdl.handle.net/2078.1/76322.

  • Granger, S. (2003). Error-tagged learner corpora and CALL: A promising synergy. CALICO J., 20(3), 465–480.

    Google Scholar 

  • Hana, J., Rosen, A., Škodová, S., & Štindlová, B. (2010). Error-tagged learner corpus of Czech. In Proceedings of the fourth linguistic annotation workshop, Uppsala, Sweden. http://utkl.ff.cuni.cz/~rosen/public/hanaetal_law2010.pdf.

  • Hirschmann, H., Doolittle, S., & Lüdeling, A. (2007). Syntactic annotation of non-canonical linguistics structures. In Proceedings of corpus linguistics 2007, Birmingham. http://ucrel.lancs.ac.uk/publications/CL2007/paper/128_Paper.

  • Jelínek, T., Štindlová, B., Rosen, A., & Hana, J. (2012). Combining manual and automatic annotation of a learner corpus. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue: Proceedings of the 15th international conference TSD 2012, no. 7499 in Lecture notes in computer science (pp. 127–134). Springer.

  • Leńko-Szymańska, A. (2004). Demonstratives as anaphora markers in advanced learners’ English. In G. Aston, & SBDS (Eds.), Corpora and language learners, (pp. 89–107). Amsterdam: John Benjamins.

  • Meurers, D. (2009). On the automatic analysis of learner language: Introduction to the special issue. CALICO Journal, 26(3):469–473. http://purl.org/dm/papers/meurers-09.html.

  • Ott, N., & Ziai, R. (2010). Evaluating dependency parsing performance on German learner language. In Proceedings of the ninth international workshop on treebanks and linguistic theories (TLT9), NEALT proceeding series. http://drni.de/zap/ott-ziai-10.

  • Richter, M. (2010). Pokročilý korektor češtiny (An advanced spell checker of Czech). Master’s thesis, Faculty of Mathematics and Physics, Charles University, Prague.

  • Rosen, A., Hana, J., Štindlová, B., & Feldman, A. (2013). Evaluating and automating the annotation of a learner corpus. In Language resources and evaluation: Special issue on resources and tools for language learners, pp. 1–28. http://dx.doi.org/10.1007/s10579-013-9226-3.

  • Schmidt, T. (2009). Creating and working with spoken language corpora in EXMARaLDA. In V. Lyding (Ed.), LULCL II: Lesser used languages and computer linguistics II (pp. 151–164). http://www.eurac.edu/Org/LanguageLaw/Multilingualism/Projects/LULCL_II_proceedings.htm.

  • Schmidt, T., Wörner, K., Hedeland, H., & Lehmberg, T. (2011). New and future developments in EXMARaLDA. In Multilingual resources and multilingual applications. Proceedings of the GSCL conference 2011, Hamburg. http://www.exmaralda.org/files/Exmaralda_GSCL2011.

  • Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., & Květoň, P. (2007). The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In Proceedings of the workshop on Balto-Slavonic natural language processing 2007 (pp. 67–74). Prague, Czechia: Association for Computational Linguistics.

  • Šebesta, K. (2010). Korpusy češtiny a osvojování jazyka (Corpora of Czech and language acquistion). Studie z aplikované lingvistiky/Studies in Applied Linguistics, 1, 11–34.

  • Štindlová, B. (2011). Evaluace chybové anotace v žákovském korpusu češtiny (Evaluation of error mark-up in a learner corpus of Czech). PhD thesis, Charles University, Faculty of Arts, Prague.

  • Štindlová, B., Škodová, S., Hana, J., & Rosen, A. (2013). A learner corpus of Czech: Current state and future directions. In S. Granger, G. Gilquin, & F. Meunier (Eds.), Twenty years of learner corpus research: Looking back, moving ahead. Presses Universitaires de Louvain, Louvain-la-Neuve, Corpora and Language in use: Proceedings 1.

  • Waibel, B. (2008). Phrasal verbs. VDM, Saarbrücken: German and Italian learners of English compared.

    Google Scholar 

Download references

Acknowledgments

The corpus was one of the tasks of the project Innovation of Education in the Field of Czech as a Second Language (Project no. CZ.1.07/2.2.00/07.0259), a part of the operational programme Education for Competitiveness, funded by the European Structural Funds (ESF) and the Czech government. The tools and data format development were partially funded by Grants no. P406/10/P328 and P406/2010/0875 of the Grant Agency of the Czech Republic. This work is also partially supported within the programme Large Research, Development and Innovation Infrastructures of the Czech Ministry of Education, Youth and Sports, the project ‘The Czech National Corpus’, no. LM2011023.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandr Rosen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hana, J., Rosen, A., Štindlová, B. et al. Building a learner corpus. Lang Resources & Evaluation 48, 741–752 (2014). https://doi.org/10.1007/s10579-014-9278-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9278-z

Keywords

Navigation