Text Corpus with Errors

  • Karel Pala
  • Pavel Rychlý
  • Pavel Smrž
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2807)


This paper presents a description of a Czech text corpus (Chyby) containing various kinds of errors such as spelling, typographical, grammatical, style, lexical. We explain how Chyby has been built, how the errors in it have been discovered, marked and annotated. The classification of the errors is presented and the statistics concerning the types of errors is given. The tools for annotating the errors are also described. To the best of our knowledge, this is first text corpus of this sort prepared for Czech.


Word Form Annotation Scheme Text Corpus Subordinate Clause Annotate Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Leech, G.: Learner corpora: what they are and what can be done with them. In: Granger, S. (ed.) Learner English on Computer. Addison Wesley Longman, London (1998)Google Scholar
  2. 2.
    Burnard, L. (ed.): Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford (1995)Google Scholar
  3. 3.
    Kocek, J., Kopřivová, M., Kučera, K. (eds.): Český národní korpus – úvod a příručka uživatele (Czech National Corpus – Introduction and Users Guide). FF UK – ÚCŇK (2000)Google Scholar
  4. 4.
    Rychlý, P.: Corpus Managers and Their Effective Implementation. PhD thesis, Faculty of Informatics, Masaryk University, Brno (2000)Google Scholar
  5. 5.
    Carlberger, J., Domeij, R., Kann, V., Kuntsson, O.: A swedish grammar checker (2000),
  6. 6.
    Wei, Y.H., Davies, G.: Do grammar checkers work (2002),
  7. 7.
    Hlavsa, Z., et al.: Akademická pravidla českého pravopisu (Rules of Czech Orthography). Akademia, Praha (1993)Google Scholar
  8. 8.
    Kukačka, M.: Correcting errors in WinCorr (Student Project at the Laboratory of Natural Language Processing, Faculty of Informatics, Masaryk University, Brno, Czech Republic) (2000)Google Scholar
  9. 9.
    Pala, K., Rychlý, P., Smrž, P.: DESAM – an annotated corpus for Czech. In: Proceedings of SOFSEM 1998. Springer, Heidelberg (1998)Google Scholar
  10. 10.
    Karlsson, F., Voutilainen, A., Heikkilä, J., Anttila, A. (eds.): Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Karel Pala
    • 1
  • Pavel Rychlý
    • 1
  • Pavel Smrž
    • 1
  1. 1.Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations