Skip to main content

The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 8686)

Abstract

This paper introduces the freely available WikEd Error Corpus. We describe the data mining process from Wikipedia revision histories, corpus content and format. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types. As one possible application, we show that WikEd can be successfully adapted to improve a strong baseline in a task of grammatical error correction for English-as-a-Second-Language (ESL) learners’ writings by 2.63%. Used together with an ESL error corpus, a composed system gains 1.64% when compared to the ESL-trained system.

Keywords

  • error corpus
  • Wikipedia revision histories
  • grammatical error correction

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-10888-9_47
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   69.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-10888-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   89.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brockett, C., Dolan, W.B., Gamon, M.: Correcting ESL Errors Using Phrasal SMT Techniques. In: Proceedings of ACL, pp. 249–256. ACL (2006)

    Google Scholar 

  2. Buck, C., Heafield, K., van Ooyen, B.: N-gram Counts and Language Models from the Common Crawl. In: Proceedings of LREC (2014)

    Google Scholar 

  3. Cahill, A., Madnani, N., Tetreault, J.R., Napolitano, D.: Robust Systems for Preposition Error Correction Using Wikipedia Revisions. In: Proceedings of NAACL: HLT, pp. 507–517. ACL (2013)

    Google Scholar 

  4. Dahlmeier, D., Ng, H.T.: Better Evaluation for Grammatical Error Correction. In: Proceedings of NAACL: HLT, pp. 568–572. ACL (2012)

    Google Scholar 

  5. Dahlmeier, D., Ng, H.T., Wu, S.M.: Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 22–31. ACL (2013)

    Google Scholar 

  6. Foster, J., Andersen, O.E.: GenERRate: Generating Errors for Use in Grammatical Error Detection. In: Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 82–90. ACL (2009)

    Google Scholar 

  7. Grundkiewicz, R.: Automatic Extraction of Polish Language Errors from Text Edition History. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 129–136. Springer, Heidelberg (2013)

    Google Scholar 

  8. Junczys-Dowmunt, M., Grundkiewicz, R.: The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation. In: Proceedings of CoNLL: Shared Task. ACL (2014)

    Google Scholar 

  9. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 177–180. ACL (2007)

    Google Scholar 

  10. Leacock, C., Chodorow, M., Gamon, M., Tetreault, J.: Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publishers (2010)

    Google Scholar 

  11. Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10 (1966)

    Google Scholar 

  12. Maier, D.: The Complexity of Some Problems on Subsequences and Supersequences. J. ACM 25(2), 322–336 (1978)

    CrossRef  MATH  MathSciNet  Google Scholar 

  13. Max, A., Wisniewski, G.: Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History. In: Proceedings of LREC (2010)

    Google Scholar 

  14. Miłkowski, M.: Automated Building of Error Corpora of Polish. In: Corpus Linguistics, Computer Tools, and Applications — State of the Art, pp. 631–639. Peter Lang (2008)

    Google Scholar 

  15. Mizumoto, T., Hayashibe, Y., Komachi, M., Nagata, M., Matsumoto, Y.: The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings. In: Proceedings of COLING 2012: Posters, pp. 863–872 (2012)

    Google Scholar 

  16. Mizumoto, T., Komachi, M., Nagata, M., Matsumoto, Y.: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In: IJCNLP, pp. 147–155 (2011)

    Google Scholar 

  17. Ng, H.T., Wu, S.M., Briscoe, T., Hadiwinoto, C., Susanto, R.H., Bryant, C.: The CoNLL-2014 Shared Task on Grammatical Error Correction. In: Proceedings of CoNLL: Shared Task. ACL (2014)

    Google Scholar 

  18. Ng, H.T., Wu, S.M., Wu, Y., Hadiwinoto, C., Tetreault, J.: The CoNLL-2013 Shared Task on Grammatical Error Correction. In: Proceedings of CoNLL: Shared Task, pp. 1–12. ACL (2013)

    Google Scholar 

  19. Wagner, J., Foster, J., van Genabith, J.: A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors. In: Proceedings of EMNLP-CoNLL, pp. 112–121. ACL (2007)

    Google Scholar 

  20. Yuan, Z., Felice, M.: Constrained Grammatical Error Correction using Statistical Machine Translation. In: Proceedings of CoNLL: Shared Task, pp. 52–61. ACL (2013)

    Google Scholar 

  21. Zesch, T.: Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In: Proceedings of EACL, pp. 529–538 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Grundkiewicz, R., Junczys-Dowmunt, M. (2014). The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10888-9_47

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10887-2

  • Online ISBN: 978-3-319-10888-9

  • eBook Packages: Computer ScienceComputer Science (R0)