Combining Confidence Score and Mal-rule Filters for Automatic Creation of Bangla Error Corpus: Grammar Checker Perspective

  • Bibekananda Kundu
  • Sutanu Chakraborti
  • Sanjay Kumar Choudhury
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7182)


This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically correct sentences. A statistical Confidence Score Filter has been implemented to select proper samples from the generated erroneous sentences such that sentences with less probable word sequences get lower confidence score and vice versa. Rule based Mal-rule filter with HMM based semi-supervised POS tagger has been used to collect the sentences having improper tag sequences. Combination of these two filters ensures the robustness of the proposed approach such that no valid construction is getting selected within the synthetically generated error corpus. Though the present work focuses on the most frequent grammatical errors in Bangla written text, detail taxonomy of grammatical errors in Bangla is also presented here, with an aim to increase the coverage of the error corpus in future. The proposed approach is language independent and could be easily applied for creating similar corpora in other languages.


Automatic Error Corpora Creation Confidence Score Mal-rule Grammar Checking 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kamp, H., Reyle, U.: From Discourse to Logic:Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representatio. Studies in Linguistics and Philosophy. Kluwer Academic Publishers (1993)Google Scholar
  2. 2.
    Wagner, J., Foster, J., van Genabith, J.: A Comparative Evaluation of Deep and Shallow Approach to the Automatic Detection of Common Grammatical Error. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Processing, pp. 112–121 (2007)Google Scholar
  3. 3.
    Foster, J.: Good Reasons for Noting Bad Grammar: Empirical Investigations into the Parsing of Ungrammatical Written English, Phd. Thesis, University of Dublin, Trinity College, Dublin, Ireland (2005)Google Scholar
  4. 4.
    Stemberger: Syntactic errors in speech. Journal of Psycholinguistic Research, 313–345 (1982)Google Scholar
  5. 5.
    Thurmair, G.: Parsing for Grammar and Style Checking. In: Proceedings of the 13th International Conference on Computational Linguistics, pp. 365–370 (1990)Google Scholar
  6. 6.
    Bustamante, F.R., Leon, F.S.: GramCheck: A grammar and style checker. In: Proceedings of COLING, pp. 175–181 (1996)Google Scholar
  7. 7.
    Stanley, Goodman: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (1996)Google Scholar
  8. 8.
    Dagan, I., Karov, Y., Roth, D.: Mistake-Driven Learning in Text Categorization. In: The Second Conference on Empirical Methods in Natural Language Processing, pp. 55–63 (1997)Google Scholar
  9. 9.
    Powers, D.M.W.: Learning and Application of Differential Grammars. In: Proceedings Meeting of the ACL Special Interest Group in Natural Language Learning, pp. 88–96 (1996)Google Scholar
  10. 10.
    Liu, C., Wu, C., Harris, M.: Word Order Correction for Language Transfer Using Relative Position Language Modeling. In: Proceedings of 6th ISCSLP, pp. 1–4 (2008)Google Scholar
  11. 11.
    Michaud, L.N., Mccoy, K.F.: An intelligent tutoring system for deaf learners of written English. In: Proceedings of the Fourth International ACM SIGCAPH Conference on Assistive Technologies, pp. 13–15Google Scholar
  12. 12.
    Leacock, Chodorow, Gamon, Tetreault: Automated Grammatical Error Detection for Language Learners. Morgan & Claypool Publishers (2010)Google Scholar
  13. 13.
    Sjobergh, Knutsson: Faking errors to avoid making errors: Very weakly supervised learning for error detection in writing. In: Proceeding of the International Conference on Recent Advances in Natural Language Processing, pp. 506–512 (2005)Google Scholar
  14. 14.
    Lee, Seneff: Correcting misuse of verb forms. In: Proceeding of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technology, pp. 174–182 (2008)Google Scholar
  15. 15.
    Brockett, Dolan, Gamon: Correcting ESL errors using phrasal SMT techniques. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 249–256 (2006)Google Scholar
  16. 16.
    Foster, Andersen: GenERRate: Generating errors for use in grammatical error detection. In: Proceedings of the Fourth Workshop on Building Educational Applications Using NLP, pp. 82–90 (2009)Google Scholar
  17. 17.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  18. 18.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)Google Scholar
  19. 19.
    Raybaud, S., Langlois, D., Smaïli, K.: Efficient combination of confidence measures for machine translation. In: Proc. INTERSPEECH, pp. 424–427 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Bibekananda Kundu
    • 1
    • 2
  • Sutanu Chakraborti
    • 2
  • Sanjay Kumar Choudhury
    • 1
  1. 1.Language TechnologyCentre for Development of Advance ComputingKolkataIndia
  2. 2.Department of Computer Science and EngineeringIndian Institution of TechnologyChennaiIndia

Personalised recommendations