Language Resources and Evaluation

, Volume 50, Issue 1, pp 67–93 | Cite as

The joint student response analysis and recognizing textual entailment challenge: making sense of student responses in educational applications

  • Myroslava O. Dzikovska
  • Rodney D. Nielsen
  • Claudia Leacock
Original Paper

Abstract

We present the results of the joint student response analysis (SRA) and 8th recognizing textual entailment challenge. The goal of this challenge was to bring together researchers from the educational natural language processing and computational semantics communities. The goal of the SRA task is to assess student responses to questions in the science domain, focusing on correctness and completeness of the response content. Nine teams took part in the challenge, submitting a total of 18 runs using methods and features adapted from previous research on automated short answer grading, recognizing textual entailment and semantic textual similarity. We provide an extended analysis of the results focusing on the impact of evaluation metrics, application scenarios and the methods and features used by the participants. We conclude that additional research is required to be able to leverage syntactic dependency features and external semantic resources for this task, possibly due to limited coverage of scientific domains in existing resources. However, each of three approaches to using features and models adjusted to application scenarios achieved better system performance, meriting further investigation by the research community.

Keywords

Student response analysis Short answer scoring Recognizing textual entailment Semantic textual similarity 

Notes

Acknowledgments

The research reported here was supported by the US ONR Award N000141410733 and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A120808 to the University of North Texas. The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences or the U.S. Department of Education. The authors like to thank Chris Brew for the discussion and suggestions related to the paper organization. We thank the three anonymous reviewers for their helpful comments.

References

  1. Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The first joint conference on lexical and computational semantics (pp. 385–393). Montréal: Association for Computational Linguistics.Google Scholar
  2. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). *SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (*SEM) (pp. 32–43). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  3. Aldabe, I., Maritxalar, M., & Lopez de Lacalle, O. (2013). EHU-ALM: Similarity-feature based approach for student response analysis. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 580–584). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  4. Bentivogli, L., Clark, P., Dagan, I., Dang, H. T., & Giampiccolo, D. (2010). The sixth PASCAL recognizing textual entailment challenge. In Notebook papers and results, text analysis conference (TAC).Google Scholar
  5. Bentivogli, L., Clark, P., Dagan, I., Dang, H. T., & Giampiccolo, D. (2011). The seventh PASCAL recognizing textual entailment challenge. In Notebook papers and results, text analysis conference (TAC).Google Scholar
  6. Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D., & Magnini, B. (2009). The fifth PASCAL recognizing textual entailment challenge. In Proceedings of text analysis conference (TAC) 2009.Google Scholar
  7. Bicici, E., & van Genabith, J. (2013). CNGL: Grading student answers by acts of translation. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 585–591). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  8. Burrows, S., Gurevych, I., & Stein, B. (2015a). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117. doi: 10.1007/s40593-014-0026-8.CrossRefGoogle Scholar
  9. Burstein, J., Tetreault, J., & Madnani, N. (2013). The e-rater essay scoring system. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions. London: Taylor and Francis.Google Scholar
  10. Campbell, G. C., Steinhauser, N. B., Dzikovska, M. O., Moore, J. D., Callaway, C. B., & Farrow, E. (2009). The DeMAND coding scheme: A “common language” for representing and analyzing student discourse. In Proceedings of 14th international conference on artificial intelligence in education (AIED), poster session, Brighton.Google Scholar
  11. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27.CrossRefGoogle Scholar
  12. Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. In J. Quin̄onero-Candela, I. Dagan, B. Magnini, & F. d’Alché Buc (eds.) Machine learning challenges, lecture notes in computer science (Vol. 3944). Berlin: Springer.Google Scholar
  13. Dale, R., Anisimoff, I., & Narroway, G. (2012). HOO 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the seventh workshop of building educational applications using NLP. Association for Computational Linguistics.Google Scholar
  14. Dale, R., & Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the generation challenges session at the 13th European workshop on natural language generation (pp. 242–249). Association for Computational Linguistics.Google Scholar
  15. Daume, H., III (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th annual meeting of the Association of Computational Linguistics (pp. 256–263). Association for Computational Linguistics, Prague.Google Scholar
  16. Dzikovska, M. O., Bell, P., Isard, A., Moore, J. D. (2012a). Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system. In Proceedings of EACL-12 conference (pp. 471–481).Google Scholar
  17. Dzikovska, M. O., Farrow, E., & Moore. J. D. (2013a). Combining semantic interpretation and statistical classification for improved explanation processing in a tutorial dialogue system. In Proceedings of the 16th international conference on artificial intelligence in education (AIED 2013), Memphis, TN.Google Scholar
  18. Dzikovska, M. O., Moore, J. D., Steinhauser, N., Campbell, G., Farrow, E., & Callaway, C. B. (2010). Beetle II: A system for tutoring and computational linguistics experimentation. In Proceedings of ACL 2010 system demonstrations (pp. 13–18).Google Scholar
  19. Dzikovska, M. O., Nielsen, R. D., & Brew, C. (2012b). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In Proceedings of 2012 conference of NAACL: Human language technologies (pp. 200–210).Google Scholar
  20. Dzikovska, M. O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., et al. (2013b). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Proceedings of the 6th international workshop on semantic evaluation (SEMEVAL-2013). Association for Computational Linguistics, Atlanta, GA.Google Scholar
  21. Dzikovska, M., Steinhauser, N., Farrow, E., Moore, J., & Campbell, G. (2014). BEETLE II: Deep natural language understanding and automatic feedback generation for intelligent tutoring in basic electricity and electronics. International Journal of Artificial Intelligence in Education, 24(3), 284–332. doi: 10.1007/s40593-014-0026-8.CrossRefGoogle Scholar
  22. Giampiccolo, D., Dang, H. T., Magnini, B., Dagan, I., Cabrio, E., & Dolan , B. (2008). The fourth PASCAL recognizing textual entailment challenge. In Proceedings of text analysis conference (TAC) 2008, Gaithersburg, MD.Google Scholar
  23. Glass, M. (2000). Processing language input in the CIRCSIM-Tutor intelligent tutoring system. In Papers from the 2000 AAAI fall symposium. AAAI technical report FS-00-01 (pp. 74–79).Google Scholar
  24. Gleize, M., & Grau, B. (2013). LIMSIILES: Basic English substitution for student answer assessment at semeval 2013. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 598–602). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  25. Graesser, A. C., Wiemer-Hastings, K., Wiemer-Hastings, P., & Kreuz, R. (1999). Autotutor: A simulation of a human tutor. Cognitive Systems Research, 1, 35–51.CrossRefGoogle Scholar
  26. Heilman, M., & Madnani, N. (2012). ETS: Discriminative edit models for paraphrase scoring. In *SEM 2012: The first joint conference on lexical and computational semantics—Vol. 1: Proceedings of the main conference and the shared task, and Vol. 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 529–535). Montréal: Association for Computational Linguistics.Google Scholar
  27. Heilman, M., & Madnani, N. (2013a). ETS: Domain adaptation and stacking for short answer scoring. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 275–279). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  28. Heilman, M., & Madnani, N. (2013b). HENRY-CORE: Domain adaptation and stacking for text similarity. In Second joint conference on lexical and computational semantics (*SEM), Vol. 1: Proceedings of the main conference and the shared task: semantic textual similarity (pp. 96–102). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  29. Jimenez, S., Becerra, C., & Gelbukh, A. (2013). SOFTCARDINALITY: Hierarchical text overlap for student response analysis. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 280–284). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  30. Jordan, P. W., Makatchev, M., & Pappuswamy, U. (2006a). Understanding complex natural language explanations in tutorial applications. In Proceedings of the third workshop on scalable natural language understanding, ScaNaLU ’06 (pp. 17–24).Google Scholar
  31. Jordan, P., Makatchev, M., Pappuswamy, U., VanLehn, K., & Albacete, P. (2006b). A natural language tutorial dialogue system for physics. In Proceedings of the 19th international FLAIRS conference (pp. 521–527).Google Scholar
  32. Kouylekov, M., Dini, L., Bosca, A., & Trevisan, M. (2013). Celi: EDITS and generic text pair classification. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 592–597). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  33. Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.CrossRefGoogle Scholar
  34. Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. R. (2014). Automated grammatical error detection for language learners, Second edition. Synthesis lectures on human language technologies. San Rafael: Morgan & Claypool.Google Scholar
  35. Levy, O., Zesch, T., Dagan, I., & Gurevych, I. (2013). UKP-BIU: Similarity and entailment metrics for student response analysis. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 285–289). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  36. MacDonald, N. H., Frase, L. T., Gingrich, P. S., & Keenan, S. A. (1982). The writer’s workbench: Computer aids for text analysis. IEEE Transactions on Communications, 30, 105–110.CrossRefGoogle Scholar
  37. McConville, M., & Dzikovska, M. O. (2008). Deep grammatical relations for semantic interpretation. In Coling 2008: Proceedings of the workshop on cross-framework and cross-domain parser evaluation (pp. 51–58).Google Scholar
  38. Mohler, M., Bunescu, R., & Mihalcea, R. (2011). Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. Portland, OR: Association for Computational Linguistics (pp. 752–762). http://www.aclweb.org/anthology/P11-1076.
  39. Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-201r shared task on grammatical error correction. In Proceedings of the 18th conference on computational natural language learning: Shared task (pp. 1–14). Association for Computational Linguistics.Google Scholar
  40. Ng, H. T., Wu, S. M., Wu, Y., & Tetreault, J. (2013). The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the 17th conference on computational natural language learning. Association for Computational Linguistics.Google Scholar
  41. Nielsen, R. D., Ward, W., & Martin, J. H. (2008a). Learning to assess low-level conceptual understanding. In Proceedings of 21st international FLAIRS conference (pp. 427–432).Google Scholar
  42. Nielsen, R. D., Ward, W., Martin, J. H., & Palmer, M. (2008b). Annotating students’ understanding of science concepts. In Proceedings of the sixth international language resources and evaluation conference (LREC08), Marrakech.Google Scholar
  43. Nielsen, R. D., Ward, W., & Martin, J. H. (2009). Recognizing entailment in intelligent tutoring systems. The Journal of Natural Language Engineering, 15, 479–501.CrossRefGoogle Scholar
  44. Okoye, I., Bethard, S., & Sumner, T. (2013). CU: Computational assessment of short free text answers—A tool for evaluating students’ understanding. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 603–607). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  45. Ott, N., Ziai, R., Hahn, M., & Meurers, D. (2013). CoMeT: Integrating different levels of linguistic modeling for meaning assessment. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 608–616). Atlanta, GA: Association for Computational Linguistics.Google Scholar
  46. Page, E. (1996). The imminence of grading essays by computer. Bloomington: Phi Delta Kappa.Google Scholar
  47. Pon-Barry, H., Clark, B., Schultz, K., Bratt, E. O., & Peters, S. (2004). Advantages of spoken language interaction in dialogue-based intelligent tutoring systems. In Proceedings of ITS-2004 conference (pp. 390–400).Google Scholar
  48. Pulman, S. G., & Sukkarieh, J. Z. (2005). Automatic short answer marking. In Proceedings of the second workshop on building educational applications using NLP (pp. 9–16). Ann Arbor, MI: Association for Computational Linguistics.Google Scholar
  49. Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook on automated essay evaluation: Current applications and new directions. London: Routledge.Google Scholar
  50. Tetreault, J., Blanchard, D., & Cahill, A. (2013). 12:10 a report on the first native language identification shared task. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 48–57). Association for Computational Linguistics.Google Scholar
  51. VanLehn, K., Jordan, P., & Litman, D. (2007). Developing pedagogically effective tutorial dialogue tactics: Experiments and a testbed. In Proceedings of SLaTE workshop on speech and language technology in education. Farmington, PA.Google Scholar
  52. Wolska, M., & Kruijff-Korbayová, I. (2004) Analysis of mixed natural and symbolic language input in mathematical dialogs. In ACL-2004, Barcelona.Google Scholar
  53. Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th international conference on computational linguistics (COLING 2000) (pp. 947–953). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Myroslava O. Dzikovska
    • 1
  • Rodney D. Nielsen
    • 2
  • Claudia Leacock
    • 3
  1. 1.School of InformaticsUniversity of EdinburghEdinburghUK
  2. 2.University of North TexasDentonUSA
  3. 3.McGraw Hill Education/CTBMontereyUSA

Personalised recommendations