Advertisement

The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction

  • Roman Grundkiewicz
  • Marcin Junczys-Dowmunt
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8686)

Abstract

This paper introduces the freely available WikEd Error Corpus. We describe the data mining process from Wikipedia revision histories, corpus content and format. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types. As one possible application, we show that WikEd can be successfully adapted to improve a strong baseline in a task of grammatical error correction for English-as-a-Second-Language (ESL) learners’ writings by 2.63%. Used together with an ESL error corpus, a composed system gains 1.64% when compared to the ESL-trained system.

Keywords

error corpus Wikipedia revision histories grammatical error correction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brockett, C., Dolan, W.B., Gamon, M.: Correcting ESL Errors Using Phrasal SMT Techniques. In: Proceedings of ACL, pp. 249–256. ACL (2006)Google Scholar
  2. 2.
    Buck, C., Heafield, K., van Ooyen, B.: N-gram Counts and Language Models from the Common Crawl. In: Proceedings of LREC (2014)Google Scholar
  3. 3.
    Cahill, A., Madnani, N., Tetreault, J.R., Napolitano, D.: Robust Systems for Preposition Error Correction Using Wikipedia Revisions. In: Proceedings of NAACL: HLT, pp. 507–517. ACL (2013)Google Scholar
  4. 4.
    Dahlmeier, D., Ng, H.T.: Better Evaluation for Grammatical Error Correction. In: Proceedings of NAACL: HLT, pp. 568–572. ACL (2012)Google Scholar
  5. 5.
    Dahlmeier, D., Ng, H.T., Wu, S.M.: Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 22–31. ACL (2013)Google Scholar
  6. 6.
    Foster, J., Andersen, O.E.: GenERRate: Generating Errors for Use in Grammatical Error Detection. In: Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 82–90. ACL (2009)Google Scholar
  7. 7.
    Grundkiewicz, R.: Automatic Extraction of Polish Language Errors from Text Edition History. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 129–136. Springer, Heidelberg (2013)Google Scholar
  8. 8.
    Junczys-Dowmunt, M., Grundkiewicz, R.: The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation. In: Proceedings of CoNLL: Shared Task. ACL (2014)Google Scholar
  9. 9.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 177–180. ACL (2007)Google Scholar
  10. 10.
    Leacock, C., Chodorow, M., Gamon, M., Tetreault, J.: Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publishers (2010)Google Scholar
  11. 11.
    Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10 (1966)Google Scholar
  12. 12.
    Maier, D.: The Complexity of Some Problems on Subsequences and Supersequences. J. ACM 25(2), 322–336 (1978)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Max, A., Wisniewski, G.: Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History. In: Proceedings of LREC (2010)Google Scholar
  14. 14.
    Miłkowski, M.: Automated Building of Error Corpora of Polish. In: Corpus Linguistics, Computer Tools, and Applications — State of the Art, pp. 631–639. Peter Lang (2008)Google Scholar
  15. 15.
    Mizumoto, T., Hayashibe, Y., Komachi, M., Nagata, M., Matsumoto, Y.: The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings. In: Proceedings of COLING 2012: Posters, pp. 863–872 (2012)Google Scholar
  16. 16.
    Mizumoto, T., Komachi, M., Nagata, M., Matsumoto, Y.: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In: IJCNLP, pp. 147–155 (2011)Google Scholar
  17. 17.
    Ng, H.T., Wu, S.M., Briscoe, T., Hadiwinoto, C., Susanto, R.H., Bryant, C.: The CoNLL-2014 Shared Task on Grammatical Error Correction. In: Proceedings of CoNLL: Shared Task. ACL (2014)Google Scholar
  18. 18.
    Ng, H.T., Wu, S.M., Wu, Y., Hadiwinoto, C., Tetreault, J.: The CoNLL-2013 Shared Task on Grammatical Error Correction. In: Proceedings of CoNLL: Shared Task, pp. 1–12. ACL (2013)Google Scholar
  19. 19.
    Wagner, J., Foster, J., van Genabith, J.: A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors. In: Proceedings of EMNLP-CoNLL, pp. 112–121. ACL (2007)Google Scholar
  20. 20.
    Yuan, Z., Felice, M.: Constrained Grammatical Error Correction using Statistical Machine Translation. In: Proceedings of CoNLL: Shared Task, pp. 52–61. ACL (2013)Google Scholar
  21. 21.
    Zesch, T.: Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In: Proceedings of EACL, pp. 529–538 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Roman Grundkiewicz
    • 1
  • Marcin Junczys-Dowmunt
    • 1
  1. 1.Faculty of Mathematics and Computer ScienceAdam Mickiewicz University in PoznańPoznańPoland

Personalised recommendations