Advertisement

Language Resources and Evaluation

, Volume 48, Issue 1, pp 5–31 | Cite as

Bucking the trend: improved evaluation and annotation practices for ESL error detection systems

  • Joel Tetreault
  • Martin Chodorow
  • Nitin MadnaniEmail author
SI: Resources for language learning

Abstract

The last decade has seen an explosion in the number of people learning English as a second language (ESL). In China alone, it is estimated to be over 300 million (Yang in Engl Today 22, 2006). Even in predominantly English-speaking countries, the proportion of non-native speakers can be very substantial. For example, the US National Center for Educational Statistics reported that nearly 10 % of the students in the US public school population speak a language other than English and have limited English proficiency (National Center for Educational Statistics (NCES) in Public school student counts, staff, and graduate counts by state: school year 2000–2001, 2002). As a result, the last few years have seen a rapid increase in the development of NLP tools to detect and correct grammatical errors so that appropriate feedback can be given to ESL writers, a large and growing segment of the world’s population. As a byproduct of this surge in interest, there have been many NLP research papers on the topic, a Synthesis Series book (Leacock et al. in Automated grammatical error detection for language learners. Synthesis lectures on human language technologies. Morgan Claypool, Waterloo 2010), a recurring workshop (Tetreault et al. in Proceedings of the NAACL workshop on innovative use of NLP for building educational applications (BEA), 2012), and a shared task competition (Dale et al. in Proceedings of the seventh workshop on building educational applications using NLP (BEA), pp 54–62, 2012; Dale and Kilgarriff in Proceedings of the European workshop on natural language generation (ENLG), pp 242–249, 2011). Despite this growing body of work, several issues affecting the annotation for and evaluation of ESL error detection systems have received little attention. In this paper, we describe these issues in detail and present our research on alleviating their effects.

Keywords

NLP Grammatical error detection systems Evaluation Annotation Crowdsourcing 

Notes

Acknowledgments

We would first like to thank our two experts, Sarah Ohls and Waverly VanWinkle, for their many hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the two anonymous reviewers for their helpful comments and feedback.

References

  1. Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon Mechanical Turk for subjectivity word sense disambiguation. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 195–203).Google Scholar
  2. Bennett, P. N., Chandrasekar, R., Chickering, M., Ipeirotis, P. G., Law, E., Mityagin, A., Provost, F. J., & von Ahn, L. (2009). Proceedings of the ACM SIGKDD Workshop on Human Computation. Paris, France: ACM.Google Scholar
  3. Bitchener, J., Young, S., & Cameron, D. (2005). The effect of different types of corrective feedback on ESL student writing. Journal of Second Language Writing, 14, 191–205.CrossRefGoogle Scholar
  4. Brockett, C., Dolan, W. B., & Gamon, M. (2006). Correcting ESL errors using phrasal SMT techniques. In Proceedings of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING-ACL), Sydney, Australia (pp. 249–256).Google Scholar
  5. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 286–295).Google Scholar
  6. Callison-Burch, C., & Dredze, M. (Eds.) (2010). Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk.Google Scholar
  7. Chandrasekar, R., Chi, E., Chickering, M., Ipeirotis, P. G., Mason, W., Provost, F. J., Tam, J., & von Ahn, L. (2010). Proceedings of the 2010 ACM SIGKDD Workshop on Human Computation. Washington, DC: ACM.Google Scholar
  8. Chodorow, M., Tetreault, J., & Han, N.-R. (2007). Detection of grammatical errors involving prepositions. In Proceedings of the fourth ACL-SIGSEM workshop on prepositions.Google Scholar
  9. Dale, R., & Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the European workshop on natural language generation (ENLG) (pp. 242–249).Google Scholar
  10. Dahlmeier, D., & Ng, H. T. (2011). Grammatical error correction with alternating structure optimization. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (ACL-HLT) (pp. 915–923). Portland, Oregon, USA.Google Scholar
  11. Dahlmeier, D., & Ng, H. T. (2012). A beam-search decoder for grammatical error correction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 568–578). Jeju Island, Korea.Google Scholar
  12. Dale, R., Anisimoff, I., & Narroway, G. (2012). Hoo 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the seventh workshop on building educational applications using NLP (BEA) (pp. 54–62).Google Scholar
  13. DeFelice, R., & Pulman, S. (2008). A classifier-based approach to preposition and determiner error correction in L2 English. In Proceedings of the international conference on computational linguistics (COLING) (pp. 169–176).Google Scholar
  14. DeFelice, R., & Pulman, S. (2009). Automatic detection of preposition errors in learner writing. CALICO Journal: Special Issue of the 2008 CALICO Workshop on Automatic Analysis of Learner Language, 26(3), 512–528.Google Scholar
  15. Eeg-Olofsson, J., & Knutsson, O. (2003). Automatic grammar checking for second language learners—the use of prepositions. In Proceedings of the Nordic conference of computational linguistics (NODALIDA).Google Scholar
  16. Evanini, K., Higgins, D., & Zechner, K. (2010). Using Amazon Mechanical Turk for transcription of non-native speech. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 53–56).Google Scholar
  17. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., & Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 80–88).Google Scholar
  18. Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420.CrossRefGoogle Scholar
  19. Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.CrossRefGoogle Scholar
  20. Gamon, M. (2010). Using mostly native data to correct errors in learners’ writing. In Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL). Los Angeles.Google Scholar
  21. Gamon, M., Gao, J., Brockett, C., Klementiev, A., Dolan, W. B., Belenko, D., & Vanderwende, L. (2008). Using contextual speller techniques and language modeling for ESL error correction. In Proceedings of the international joint conference on natural language processing (IJCNLP). Hyderabad, India.Google Scholar
  22. Gillick, D., & Liu Y. (2010). Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 148–151).Google Scholar
  23. Han, N.-R., Chodorow, M., & Leacock, C. (2006). Detecting errors in English article usage by non-native speakers. Natural Language Engineering, 12, 115–129.CrossRefGoogle Scholar
  24. Hermet, M., & Désilets, A. (2009). Using first and second language models to correct preposition errors in second language authoring. In Proceedings of the workshop on building educational applications using NLP (BEA).Google Scholar
  25. Hermet, M., Désilets, A., & Szpakowicz, S. (2008). Using the Web as a linguistic resource to automatically correct lexico-syntactic errors. In Proceedings of the international conference on language resources and evaluation (LREC). Marrekech, Morocco.Google Scholar
  26. Irvine, A., & Klementiev, A. (2010). Using Mechanical Turk to annotate lexicons for less commonly used languages. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 108–113).Google Scholar
  27. Izumi, E., Uchimoto, K., & Isahara, H. (2004). The overview of the SST speech corpus of Japanese learner English and evaluation through the experiment an automatic detection of learners’ errors. In Proceedings of the international conference on language resources and evaluation (LREC). Lisbon, Portugal.Google Scholar
  28. Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., & Isahara, H. (2003). Automatic error detection in the Japanese leaners’ English spoken data. In Proceedings of the conference of the association for computational linguistics.Google Scholar
  29. Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated grammatical error detection for language learners. Synthesis lectures on human language technologies. Waterloo: Morgan Claypool.Google Scholar
  30. Levin, B. (1993). English verb classes and alternations: A preliminary investigation. IL: University of Chicago Press.Google Scholar
  31. Madnani, N. (2010). The circle of meaning: From translation to paraphrasing and back. PhD thesis. Department of Computer Science, University of Maryland College Park.Google Scholar
  32. Mason W., & Watts, D. J. (2009). Financial incentives and the “Performance of Crowds”. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP).Google Scholar
  33. Nagata, R., Kawai, A., Morihiro, K., & Isu, N. (2006). A feedback-augmented method for detecting errors in the writing of learners of English. In Proceedings of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING-ACL) (pp. 241–248). Sydney, Australia.Google Scholar
  34. National Center for Educational Statistics (NCES). (2002). Public school student counts, staff, and graduate counts by state: school year 2000–2001.Google Scholar
  35. Novotney, S., & Callison-Burch, C. (2010). Cheap, fast and good enough: Automatic speech recognition with non-expert transcription. In Proceedings of the 2010 conference of the North American chapter of the association for computational linguistics (NAACL) (pp. 207–215).Google Scholar
  36. Paolacci, G., & Warglien, M. (2010). Experimental turk: A blog on social science experiments on Amazon Mechanical Turk. http://experimentalturk.wordpress.com.
  37. Park, A. Y., & Levy, R. (2011) Automated whole sentence grammar correction using a noisy channel model. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (ACL-HLT) (pp. 934–944). Portland, Oregon, USA.Google Scholar
  38. Rizzolo, N., & Roth, D. (2007). Modeling discriminative global inference. In Proceedings of the first IEEE international conference on semantic computing (ICSC) (pp. 597–604). Irvine, California.Google Scholar
  39. Rozovskaya, A., & Roth, D. (2010a). Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL workshop on innovative use of NLP for building educational applications (BEA).Google Scholar
  40. Rozovskaya, A., & Roth, D. (2010b). Generating confusion sets for context-sensitive error correction. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).Google Scholar
  41. Rozovskaya, A., & Roth, D. (2010c). Training paradigms for correcting errors in grammar and usage. In Proceedings of human language technologies: The conference of the North American chapter of the association for computational linguistics.Google Scholar
  42. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 254–263).Google Scholar
  43. Stolcke, A. (2002). SRILM: An extensible language modeling toolkit. In Proceedings of the international conference on spoken language processing (ICSLP) (pp. 257–286).Google Scholar
  44. Tetreault, J., Burstein, J., & Leacock, C. (Eds.) (2012). Proceedings of the NAACL workshop on innovative use of NLP for building educational applications (BEA).Google Scholar
  45. Tetreault, J., & Chodorow, M. (2008). The ups and downs of preposition error detection in ESL writing. In Proceedings of the international conference on computational linguistics (COLING).Google Scholar
  46. Wang, R., & Callison-Burch, C. (2010). Cheap facts and counter-facts. In Proceedings of the NAACL workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 163–167).Google Scholar
  47. Yang, J. (2006). Learners and users of English in China. English Today, 22(2), 3–10.Google Scholar
  48. Yi, X., Gao, J., & Dolan, W. B. (2008). A web-based English proofing system for English as a second language users. In Proceedings of the international joint conference on natural language processing (IJCNLP) (pp. 619–624). Hyderabad, India.Google Scholar
  49. Zaidan, O. F., & Callison-Burch, C. (2010). Predicting human-targeted translation edit rate via untrained human annotators. In Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL) (pp. 369–372).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Joel Tetreault
    • 1
  • Martin Chodorow
    • 2
  • Nitin Madnani
    • 1
    Email author
  1. 1.Educational Testing ServicePrincetonUSA
  2. 2.Hunter College of CUNYNew YorkUSA

Personalised recommendations