Bucking the trend: improved evaluation and annotation practices for ESL error detection systems

Abstract

The last decade has seen an explosion in the number of people learning English as a second language (ESL). In China alone, it is estimated to be over 300 million (Yang in Engl Today 22, 2006). Even in predominantly English-speaking countries, the proportion of non-native speakers can be very substantial. For example, the US National Center for Educational Statistics reported that nearly 10 % of the students in the US public school population speak a language other than English and have limited English proficiency (National Center for Educational Statistics (NCES) in Public school student counts, staff, and graduate counts by state: school year 2000–2001, 2002). As a result, the last few years have seen a rapid increase in the development of NLP tools to detect and correct grammatical errors so that appropriate feedback can be given to ESL writers, a large and growing segment of the world’s population. As a byproduct of this surge in interest, there have been many NLP research papers on the topic, a Synthesis Series book (Leacock et al. in Automated grammatical error detection for language learners. Synthesis lectures on human language technologies. Morgan Claypool, Waterloo 2010), a recurring workshop (Tetreault et al. in Proceedings of the NAACL workshop on innovative use of NLP for building educational applications (BEA), 2012), and a shared task competition (Dale et al. in Proceedings of the seventh workshop on building educational applications using NLP (BEA), pp 54–62, 2012; Dale and Kilgarriff in Proceedings of the European workshop on natural language generation (ENLG), pp 242–249, 2011). Despite this growing body of work, several issues affecting the annotation for and evaluation of ESL error detection systems have received little attention. In this paper, we describe these issues in detail and present our research on alleviating their effects.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    One notable exception is the work of Dahlmeier and Ng (2011) which reimplemented several leading article and preposition error detection methods and compared them on a common corpus of ESL student writing.

  2. 2.

    http://www.mturk.com.

  3. 3.

    http://www.crowdflower.com.

  4. 4.

    There is a third error type, omission (“we are fond ϕ beer”), that is a topic for our future research.

  5. 5.

    The study in Eeg-Olofsson and Knutsson (2003) had a small evaluation and it is unclear whether multiple annotators were used.

  6. 6.

    http://www.cambridge.org/elt.

  7. 7.

    http://langbank.engl.polyu.edu.hk/corpus/clec.html.

  8. 8.

    Gamon et al. (2008) did not have a scheme for annotating preposition errors to create a gold standard corpus, but did use one for the similar problem of verifying a system’s output in preposition error detection.

  9. 9.

    When including spelling and grammar annotations, kappa ranged from 0.474 to 0.773.

  10. 10.

    We also experimented with 50 judgments per sentence, but agreement and kappa improved only negligibly.

  11. 11.

    The only restriction on the Turkers was that they be physically located in the USA.

  12. 12.

    Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems.

  13. 13.

    The difference between unweighted and weighted measures can vary depending on the distribution of agreement.

  14. 14.

    http://bit.ly/crowdgrammar.

References

  1. Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon Mechanical Turk for subjectivity word sense disambiguation. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 195–203).

  2. Bennett, P. N., Chandrasekar, R., Chickering, M., Ipeirotis, P. G., Law, E., Mityagin, A., Provost, F. J., & von Ahn, L. (2009). Proceedings of the ACM SIGKDD Workshop on Human Computation. Paris, France: ACM.

  3. Bitchener, J., Young, S., & Cameron, D. (2005). The effect of different types of corrective feedback on ESL student writing. Journal of Second Language Writing, 14, 191–205.

    Article  Google Scholar 

  4. Brockett, C., Dolan, W. B., & Gamon, M. (2006). Correcting ESL errors using phrasal SMT techniques. In Proceedings of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING-ACL), Sydney, Australia (pp. 249–256).

  5. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 286–295).

  6. Callison-Burch, C., & Dredze, M. (Eds.) (2010). Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk.

  7. Chandrasekar, R., Chi, E., Chickering, M., Ipeirotis, P. G., Mason, W., Provost, F. J., Tam, J., & von Ahn, L. (2010). Proceedings of the 2010 ACM SIGKDD Workshop on Human Computation. Washington, DC: ACM.

  8. Chodorow, M., Tetreault, J., & Han, N.-R. (2007). Detection of grammatical errors involving prepositions. In Proceedings of the fourth ACL-SIGSEM workshop on prepositions.

  9. Dale, R., & Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the European workshop on natural language generation (ENLG) (pp. 242–249).

  10. Dahlmeier, D., & Ng, H. T. (2011). Grammatical error correction with alternating structure optimization. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (ACL-HLT) (pp. 915–923). Portland, Oregon, USA.

  11. Dahlmeier, D., & Ng, H. T. (2012). A beam-search decoder for grammatical error correction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 568–578). Jeju Island, Korea.

  12. Dale, R., Anisimoff, I., & Narroway, G. (2012). Hoo 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the seventh workshop on building educational applications using NLP (BEA) (pp. 54–62).

  13. DeFelice, R., & Pulman, S. (2008). A classifier-based approach to preposition and determiner error correction in L2 English. In Proceedings of the international conference on computational linguistics (COLING) (pp. 169–176).

  14. DeFelice, R., & Pulman, S. (2009). Automatic detection of preposition errors in learner writing. CALICO Journal: Special Issue of the 2008 CALICO Workshop on Automatic Analysis of Learner Language, 26(3), 512–528.

  15. Eeg-Olofsson, J., & Knutsson, O. (2003). Automatic grammar checking for second language learners—the use of prepositions. In Proceedings of the Nordic conference of computational linguistics (NODALIDA).

  16. Evanini, K., Higgins, D., & Zechner, K. (2010). Using Amazon Mechanical Turk for transcription of non-native speech. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 53–56).

  17. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., & Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 80–88).

  18. Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420.

    Article  Google Scholar 

  19. Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.

    Article  Google Scholar 

  20. Gamon, M. (2010). Using mostly native data to correct errors in learners’ writing. In Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL). Los Angeles.

  21. Gamon, M., Gao, J., Brockett, C., Klementiev, A., Dolan, W. B., Belenko, D., & Vanderwende, L. (2008). Using contextual speller techniques and language modeling for ESL error correction. In Proceedings of the international joint conference on natural language processing (IJCNLP). Hyderabad, India.

  22. Gillick, D., & Liu Y. (2010). Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 148–151).

  23. Han, N.-R., Chodorow, M., & Leacock, C. (2006). Detecting errors in English article usage by non-native speakers. Natural Language Engineering, 12, 115–129.

    Article  Google Scholar 

  24. Hermet, M., & Désilets, A. (2009). Using first and second language models to correct preposition errors in second language authoring. In Proceedings of the workshop on building educational applications using NLP (BEA).

  25. Hermet, M., Désilets, A., & Szpakowicz, S. (2008). Using the Web as a linguistic resource to automatically correct lexico-syntactic errors. In Proceedings of the international conference on language resources and evaluation (LREC). Marrekech, Morocco.

  26. Irvine, A., & Klementiev, A. (2010). Using Mechanical Turk to annotate lexicons for less commonly used languages. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 108–113).

  27. Izumi, E., Uchimoto, K., & Isahara, H. (2004). The overview of the SST speech corpus of Japanese learner English and evaluation through the experiment an automatic detection of learners’ errors. In Proceedings of the international conference on language resources and evaluation (LREC). Lisbon, Portugal.

  28. Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., & Isahara, H. (2003). Automatic error detection in the Japanese leaners’ English spoken data. In Proceedings of the conference of the association for computational linguistics.

  29. Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated grammatical error detection for language learners. Synthesis lectures on human language technologies. Waterloo: Morgan Claypool.

  30. Levin, B. (1993). English verb classes and alternations: A preliminary investigation. IL: University of Chicago Press.

    Google Scholar 

  31. Madnani, N. (2010). The circle of meaning: From translation to paraphrasing and back. PhD thesis. Department of Computer Science, University of Maryland College Park.

  32. Mason W., & Watts, D. J. (2009). Financial incentives and the “Performance of Crowds”. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP).

  33. Nagata, R., Kawai, A., Morihiro, K., & Isu, N. (2006). A feedback-augmented method for detecting errors in the writing of learners of English. In Proceedings of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING-ACL) (pp. 241–248). Sydney, Australia.

  34. National Center for Educational Statistics (NCES). (2002). Public school student counts, staff, and graduate counts by state: school year 2000–2001.

  35. Novotney, S., & Callison-Burch, C. (2010). Cheap, fast and good enough: Automatic speech recognition with non-expert transcription. In Proceedings of the 2010 conference of the North American chapter of the association for computational linguistics (NAACL) (pp. 207–215).

  36. Paolacci, G., & Warglien, M. (2010). Experimental turk: A blog on social science experiments on Amazon Mechanical Turk. http://experimentalturk.wordpress.com.

  37. Park, A. Y., & Levy, R. (2011) Automated whole sentence grammar correction using a noisy channel model. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (ACL-HLT) (pp. 934–944). Portland, Oregon, USA.

  38. Rizzolo, N., & Roth, D. (2007). Modeling discriminative global inference. In Proceedings of the first IEEE international conference on semantic computing (ICSC) (pp. 597–604). Irvine, California.

  39. Rozovskaya, A., & Roth, D. (2010a). Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL workshop on innovative use of NLP for building educational applications (BEA).

  40. Rozovskaya, A., & Roth, D. (2010b). Generating confusion sets for context-sensitive error correction. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).

  41. Rozovskaya, A., & Roth, D. (2010c). Training paradigms for correcting errors in grammar and usage. In Proceedings of human language technologies: The conference of the North American chapter of the association for computational linguistics.

  42. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 254–263).

  43. Stolcke, A. (2002). SRILM: An extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing (ICSLP) (pp. 257–286).

  44. Tetreault, J., Burstein, J., & Leacock, C. (Eds.) (2012). Proceedings of the NAACL workshop on innovative use of NLP for building educational applications (BEA).

  45. Tetreault, J., & Chodorow, M. (2008). The ups and downs of preposition error detection in ESL writing. In Proceedings of the international conference on computational linguistics (COLING).

  46. Wang, R., & Callison-Burch, C. (2010). Cheap facts and counter-facts. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 163–167).

  47. Yang, J. (2006). Learners and users of English in China. English Today, 22(2), 3–10.

    Google Scholar 

  48. Yi, X., Gao, J., & Dolan, W. B. (2008). A web-based English proofing system for English as a second language users. In Proceedings of the international joint conference on natural language processing (IJCNLP) (pp. 619–624). Hyderabad, India.

  49. Zaidan, O. F., & Callison-Burch, C. (2010). Predicting human-targeted translation edit rate via untrained human annotators. In Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL) (pp. 369–372).

Download references

Acknowledgments

We would first like to thank our two experts, Sarah Ohls and Waverly VanWinkle, for their many hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the two anonymous reviewers for their helpful comments and feedback.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nitin Madnani.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Tetreault, J., Chodorow, M. & Madnani, N. Bucking the trend: improved evaluation and annotation practices for ESL error detection systems. Lang Resources & Evaluation 48, 5–31 (2014). https://doi.org/10.1007/s10579-013-9243-2

Download citation

Keywords

  • NLP
  • Grammatical error detection systems
  • Evaluation
  • Annotation
  • Crowdsourcing