Going deeper: Automatic short-answer grading by combining student and question models

  • Yuan ZhangEmail author
  • Chen Lin
  • Min Chi


As various educational technologies have rapidly become more powerful and more prevalent, especially from the 2010s onward, the demand of automated grading natural language responses has become a major area of research. In this work, we leverage the classic student and domain/question models that are widely used in the field of intelligent tutoring systems to the task of automatic short-answer grading (ASAG). ASAG is the process of applying natural language processing techniques to assess student-authored short answers, and conventional ASAG systems often mainly focus upon student answers, referred as answer-based. In recent years, various deep learning models have gained great popularity in a wide range of domains. While classic machine learning methods have been widely employed to ASAG, as far as we know, deep learning models have not been applied to it probably because the lexical features from short answers provide limited information. In this work, we explore the effectiveness of a deep learning model, deep belief networks (DBN), to the task of ASAG. Overall, our results on a real-world corpus demonstrate that 1) leveraging student and question models to the conventional answer-based approach can greatly enhance the performance of ASAG, and 2) deep learning models such as DBN can be productively applied to the task of ASAG.


Automatic short-answer grading Machine learning Deep belief network 



This research was supported by the NSF Grants #1432156: ‘Educational Data Mining for Individualized Instruction in STEM Learning Environments’, #1651909: ‘CAREER: Improving Adaptive Decision Making in Interactive Learning Environment’, #1660878 ‘MetaDash: A Teacher Dashboard Informed by Real-Time Multichannel Self-Regulated Learning Data’ and #1726550: ‘Integrated Data-driven Technologies for Individualized Instruction in STEM Learning Environments’.


  1. An, X., Yung, Y.-F.: Item response theory: what it is and how you can use the IRT procedure to apply it. SAS Institute Inc. SAS364-2014, (2014)Google Scholar
  2. Anderson, R.C., Biddle, W.B.: On asking people questions about what they are reading. Psychol. Learn. Motiv. 9, 89–132 (1975)CrossRefGoogle Scholar
  3. Attali, Y., Burstein, J.: Automated essay scoring with e-rater® v.2. J. Technol. Learn. Assess. 4(3) (2006)Google Scholar
  4. Baker, F.B., Kim, S.-H.: Item Response Theory: Parameter Estimation Techniques. CRC Press, Boca Raton (2004)CrossRefGoogle Scholar
  5. Barnes, T.: The q-matrix method: mining student response data for knowledge. In: American Association for Artificial Intelligence 2005 Educational Data Mining Workshop (2005)Google Scholar
  6. Basu, S., Jacobs, C., Vanderwende, L.: Powergrading: a clustering approach to amplify human effort for short answer grading. Trans. Assoc. Comput. Linguist. 1, 391–402 (2013)CrossRefGoogle Scholar
  7. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 153–160, (2007)Google Scholar
  8. Bernsen, N.O., Dybkjær, H., Dybkjær, L.: Designing Interactive Speech Systems: From First Ideas to User Testing. Springer, Berlin (2012)Google Scholar
  9. Burrows, S., D’Souza, D.: Management of teaching in a complex setting. In: Proceedings of the 2nd Melbourne computing education conventicle, pp. 1–8 (2005)Google Scholar
  10. Burrows, S., Gurevych, I., Stein, B.: The eras and trends of automatic short answer grading. Int. J. Artif. Intell. Educ. 25(1), 60–117 (2015)CrossRefGoogle Scholar
  11. Burstein, J., Leacock, C., Swartz, R.: Automated evaluation of essays and short answers (2001)Google Scholar
  12. Carolyn, R.O.S.E.: Tools for authoring a dialogue agent that participates in learning studies. Artif. Intell. Educ. Building Technol. Rich Learn. Contexts Work 158, 43 (2007)Google Scholar
  13. Carterette, B., Bennett, P.N., Chickering, D.M., Dumais, S.T.: Here or there. In: European Conference on Information Retrieval, pp. 16–27. Springer (2008)Google Scholar
  14. Chi, M., VanLehn, K., Litman, D., Jordan, P.: Empirically evaluating the application of reinforcement learning to the induction of effective and adaptive pedagogical strategies. User Model. User Adapt. Interact. 21, 137–180 (2011)CrossRefGoogle Scholar
  15. Corbett, A.T., Anderson, J.R.: Knowledge tracing: modeling the acquisition of procedural knowledge. User Model. User Adapt. Interact. 4(4), 253–278 (1994)CrossRefGoogle Scholar
  16. Dzikovska, M.O., Moore, J.D., Steinhauser, N., Campbell, G., Farrow, E., Callaway, C.B.: Beetle ii: a system for tutoring and computational linguistics experimentation. In: Proceedings of the ACL 2010 System Demonstrations, pp 13–18. Association for Computational Linguistics (2010)Google Scholar
  17. Dzikovska, M.O., Farrow, E., Moore, J.D.: Improving interpretation robustness in a tutorial dialogue system. In: Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 293–299 (2013)Google Scholar
  18. Eddy, S.R.: Hidden markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)MathSciNetCrossRefGoogle Scholar
  19. Graesser, A.C., Wiemer-Hastings, K., Wiemer-Hastings, P., Kreuz, R.: Tutoring research group, et al. autotutor: a simulation of a human tutor. Cogn. Syst. Res. 1(1), 35–51 (1999)CrossRefGoogle Scholar
  20. Graesser, A. C., Penumatsa, P., Ventura, M., Cai, Z., Hu, X.: Using lsa in autotutor: learning through mixed initiative dialogue in natural language. Handbook of latent semantic analysis, pp. 243–262 (2007)Google Scholar
  21. Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., Pribadi, F.S.: A review of an information extraction technique approach for automatic short answer grading. In: International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), pp. 192–196. IEEE (2016)Google Scholar
  22. Higgins, D., Burstein, J., Marcu, D., Gentile, C.: Evaluating multiple aspects of coherence in student essays. In: HLT-NAACL (2004)Google Scholar
  23. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)CrossRefGoogle Scholar
  24. Hou, W.-J., Tsao, J.-H., Li, S.-Y., Chen, L.: Automatic assessment of students’ free-text answers with support vector machines. In: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 235–243. Springer (2010)Google Scholar
  25. Huang, G.B., Lee, H., Learned-Miller, E.: Learning hierarchical representations for face verification with convolutional deep belief networks. In: CVPR (2012)Google Scholar
  26. Jia, X., Li, K., Li, X., Zhang, A.: A novel semi-supervised deep learning framework for affective state recognition on eeg signals. In: 2014 IEEE International Conference on Bioinformatics and Bioengineering (BIBE), pp. 30–37. IEEE (2014)Google Scholar
  27. Jia, X., Wang, A., Li, X., Xun, G., Xu, W., Zhang, A.: Multi-modal learning for video recommendation based on mobile application usage. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 837–842. IEEE (2015)Google Scholar
  28. Jia, X., Khandelwal, A., Nayak, G., Gerber, J., Carlson, K., West, P., Kumar, V.: Incremental dual-memory lSTM in land cover prediction. In: Proceedings of the 23rd KDD, pp 867–876. ACM (2017)Google Scholar
  29. Jordan, P.W., Makatchev, M., Pappuswamy, U., VanLehn, K., Albacete, P.L.: A natural language tutorial dialogue system for physics. In: FLAIRS Conference, pp 521–526 (2006)Google Scholar
  30. Karpicke, J.D., Roediger, H.L.: The critical importance of retrieval for learning. Science 319(5865), 966–968 (2008)CrossRefGoogle Scholar
  31. Kim, Y.-J., Chi, M.: Temporal belief memory: Imputing missing data during RNN training. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13–19, 2018, Stockholm, Sweden, pp. 2326–2332 (2018)Google Scholar
  32. Klein, R., Kyrilov, A., Tokman, M.: Automated assessment of short free-text responses in computer science using latent semantic analysis. In: Proceedings of the 16th ITiCSE, pp. 158–162. ACM (2011)Google Scholar
  33. Lalor, J.P., Wu, H., Yu, H.: Building an evaluation scale using item response theory. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2016, pp. 648. NIH Public Access (2016)Google Scholar
  34. Leacock, C., Chodorow, M.: C-rater: automated scoring of short-answer questions. Comput. Humanit. 37(4), 389–405 (2003)CrossRefGoogle Scholar
  35. Li, Xiaoyi, Jia, Xiaowei, Xun, Guangxu, Zhang, Aidong: Improving eeg feature learning via synchronized facial video. In: 2015 IEEE International Conference on Big Data (Big Data), pages 843–848. IEEE, (2015)Google Scholar
  36. Lin, C., Chi, M.: Intervention-BKT: incorporating instructional interventions into bayesian knowledge tracing. In: Intelligent Tutoring Systems—13th International Conference, ITS 2016, Zagreb, Croatia, June 7–10, 2016. Proceedings, pp. 208–218 (2016)Google Scholar
  37. Lin, C., Chi, M.: A comparisons of bkt, RNN and LSTM for learning gain prediction. In: Artificial intelligence in education—18th International Conference, AIED 2017, Wuhan, China, June 28–July 1, 2017, Proceedings, pp. 536–539 (2017)Google Scholar
  38. Lin, C., Shen, S., Chi, M.: Incorporating student response time and tutor instructional interventions into student modeling. In: Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, UMAP 2016, Halifax, NS, Canada, July 13–17, 2016, pp 157–161 (2016)Google Scholar
  39. Lin, C., Zhang, Y., Ivy, J.S., Capan, M., Arnold, R., Huddleston, J.M., Chi, M.: Early diagnosis and prediction of sepsis shock by combining static and dynamic information using convolutional-lSTM. In: IEEE International Conference on Healthcare Informatics, ICHI 2018, New York City, NY, USA, June 4–7, 2018, pp. 219–228 (2018)Google Scholar
  40. Litman, D.J., Silliman, S.: Itspoke: an intelligent tutoring spoken dialogue system. In: Demonstration papers at HLT-NAACL 2004, pp 5–8. Association for Computational Linguistics (2004)Google Scholar
  41. Luaces, O., Díez, J., Alonso-Betanzos, A., Troncoso, A., Bahamonde, A.: A factorization approach to evaluate open-response assignments in moocs using preference learning on peer assessments. Knowl. Based Syst. 85, 322–328 (2015)CrossRefGoogle Scholar
  42. Luaces, O., Díez, J., Alonso-Betanzos, A., Troncoso, A., Bahamonde, A.: Content-based methods in peer assessment of open-response questions to grade students as authors and as graders. Knowl. Based Syst. 117, 79–87 (2017)CrossRefGoogle Scholar
  43. Madnani, N., Burstein, J., Sabatini, J., O’Reilly, T.: Automated scoring of a summary writing task designed to measure reading comprehension, vol. 163. In: NAACL/HLT 2013 (2013)Google Scholar
  44. Magnini, B., Rodríguez, M.P., Strapparava, C., Gliozzo, A., Cubero, E.A., Pérez, D.: About the effects of combining latent semantic analysis with natural language processing techniques for free-text assessment. Rev. Signos Estud. Lingüíst. 59, 325–343 (2005)Google Scholar
  45. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, vol. 999. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  46. Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT press, Cambridge (2000)zbMATHGoogle Scholar
  47. Mason, O., Grove-Stephensen, I. : Automated free text marking with paperless school (2002)Google Scholar
  48. Meurers, D., Ziai, R., Ott, N., Kopp, J.: Evaluating answers to reading comprehension questions in context: results for german and the role of information structure. In: Proceedings of the TextInfer 2011 Workshop on Textual Entailment, pp. 1–9 (2011)Google Scholar
  49. Mitchell, T., Russell, T., Broomhead, P., Aldridge, N.: Towards robust computerised marking of free-text responses (2002)Google Scholar
  50. Mohamed, A.-R., Dahl, G., Hinton, G.: Deep belief networks for phone recognition. In: NIPs (2009)Google Scholar
  51. Mohamed, A.-R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)CrossRefGoogle Scholar
  52. Mohler, M., Mihalcea, R.: Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2009)Google Scholar
  53. Mohler, M., Bunescu, R., Mihalcea, R.: Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 752–762. Association for Computational Linguistics (2011)Google Scholar
  54. Pérez, D.: Automatic evaluation of user’s short essays by using statistical and shallow natural language processing techniques. Advanced Studies Diploma (Escuela Politécnica Superior, Universidad Autónoma de Madrid) (2004)Google Scholar
  55. Pérez-Marín, D., Pascual-Nieto, I., Rodríguez, P.: Computer-assisted assessment of free-text answers. Knowl. Eng. Rev. 24(04), 353–374 (2009)CrossRefGoogle Scholar
  56. Pulman, S.G., Sukkarieh, J.Z.: Automatic short answer marking. In: Proceedings of the 2nd Workshop on Building Educational Applications Using NLPGoogle Scholar
  57. Raman, K., Joachims, T.: Methods for ordinal peer grading. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1037–1046. ACM (2014)Google Scholar
  58. Raman, K., Joachims, T.: Bayesian ordinal peer grading. In: Proceedings of the 2nd (2015) ACM Conference on Learning@ Scale, pp. 149–156. ACM (2015)Google Scholar
  59. Rodrigues, F., Oliveira, P.: A system for formative assessment and monitoring of students’ progress. Comput. Educ. 76, 30–41 (2014)CrossRefGoogle Scholar
  60. Sima, D., Schmuck, B., Szöllősi, S., Miklós, Á.: Intelligent short text assessment in emax. In: Towards Intelligent Engineering and Information Technology, pp. 435–445. Springer (2009)Google Scholar
  61. Tatsuoka, K.: Rule space: an approach for dealing with misconceptions based on item response theory. J. Educ. Meas. 20(4), 345–354 (1983)CrossRefGoogle Scholar
  62. Thomson, D., Mitrovic, A.: Towards a negotiable student model for constraint-based ITSS (2009)Google Scholar
  63. VanLehn, K., Jordan, P.W., Litman, D.J.: Developing pedagogically effective tutorial dialogue tactics: experiments and a testbed. In: SLaTE. Citeseer (2007)Google Scholar
  64. Vanlehn, K.: The behavior of tutoring systems. Int. J. Artif. Intell. Educ. 16(3), 227–265 (2006)Google Scholar
  65. Zhang, Y., Lin, C., Chi, M., Ivy, J., Capan, M., Huddleston, J.M.: LSTM for septic shock: adding unreliable labels to reliable predictions. In: 2017 IEEE International Conference on Big Data (Big Data), pp 1233–1242. IEEE (2017)Google Scholar
  66. Zhang, Y., Yang, X., Ivy, J.S., Chi, M.: ATTAIN: attention-based time-aware LSTM networks for disease progression modeling. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 4369–4375 (2019)Google Scholar

Copyright information

© Springer Nature B.V. 2020

Authors and Affiliations

  1. 1.Department of Computer ScienceNorth Carolina State UniversityRaleighUSA

Personalised recommendations