Skip to main content
Log in

Paraphrase Generation and Supervised Learning for Improved Automatic Short Answer Grading

  • ARTICLE
  • Published:
International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Abstract

We consider the reference-based approach for Automatic Short Answer Grading (ASAG) that involves scoring a textual constructed student answer comparing to a teacher-provided reference answer. The reference answer does not cover the variety of student answers as it contains only specific examples of correct answers. Considering other language variants of the reference answer can handle variability in student responses and improve scoring accuracy. Alternative reference answers may be possible, but manually creating them is expensive and time-consuming. In this paper, we consider two issues: First, we need to automatically generate various reference answers that can handle the diversity of student answers. Second, we should provide an accurate grading model that improves sentence similarity computation using multiple reference answers. Therefore, our proposed approach to solve both problems highlights two components. First, we provide a sequence-to-sequence deep learning model that targets generating plausible paraphrased reference answers conditioned on the provided reference answer. Secondly, we propose a supervised grading model based on sentence embedding features. The grading model enriches features to improve accuracy considering multiple reference answers. Experiments are conducted both in Arabic and English. They show that the paraphrase generator produces accurate paraphrases. Using multiple reference answers, the proposed grading model achieves a Root Mean Square Error of 0,6955, a Pearson correlation of 88,92% for the Arabic dataset, an RMSE of 0,7790, and a Pearson correlation of 73,50% for the English dataset. While fine-tuning pre-trained transformers on the English dataset provided state-of-the-art performance (RMSE: 0.7620), our approach yields comparable results. Simple to construct, load, and embed into the Learning Management System question engine with low computational complexity, the proposed approach can be easily integrated into the Learning Management System to support the assessment of short answers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

The datasets used during the current study are available in:

Al-Raisi Arabic Dataset: http://www.cs.cmu.edu/~fraisi/arabic/arparallel/

The Quora English Dataset: https://github.com/jakartaresearch/quora-question-pairs

AR-ASAG Dataset (2020): https://data.mendeley.com/datasets/dj95jh332j/1

• Mohler et al. (2011)Dataset: https://web.eecs.umich.edu/~mihalcea/downloads.html

Notes

  1. https://huggingface.co/gpt2

  2. https://huggingface.co/docs/transformers/model_doc/bart

  3. https://github.com/google-research/text-to-text-transfer-transformer

  4. http://vectors.nlpl.eu/repository/

  5. http://www.cs.cmu.edu/~fraisi/arabic/arparallel/

  6. https://github.com/jakartaresearch/quora-question-pairs

  7. https://fasttext.cc/docs/en/english-vectors.html

  8. https://github.com/Grader-ASAG

  9. https://data.mendeley.com/datasets/dj95jh332j/1

  10. https://web.eecs.umich.edu/~mihalcea/downloads.html

References

  • Ab Aziz, M. J., Ahmad, F. D., Ghani, A. A. A., & Mahmod, R. (2009). Automated marking system for short answer examination (AMS-SAE). Undefined, 1, 47–51. https://doi.org/10.1109/ISIEA.2009.5356500

    Article  Google Scholar 

  • Adams, O., Roy, S., & Krishnapuram, R. (2016). Distributed vector representations for unsupervised automatic short answer grading. In Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016) (pp. 20–29). https://aclanthology.org/W16-4904. Accessed 22 Feb 2022.

  • Agarwal, R., Khurana, V., Grover, K., Mohania, M., & Goyal, V. (2022). Multi-Relational Graph Transformer for Automatic Short Answer Grading. NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 2001–2012. https://doi.org/10.18653/v1/2022.naacl-main.146

  • Alkhatib, M., & Shaalan, K. (2018). Paraphrasing Arabic metaphor with neural machine translation. Procedia Computer Science, 142, 308–314. https://doi.org/10.1016/j.procs.2018.10.493

    Article  Google Scholar 

  • Al-Raisi, F., Bourai, A., & Lin, W. (2018a). Neural symbolic arabic paraphrasing with automatic evaluation. Computer Science & Information Technology, 01–13. https://doi.org/10.5121/CSIT.2018.80601

  • Al-Raisi, F., Lin, W., & Bourai, A. (2018b). A monolingual parallel corpus of Arabic. Procedia Computer Science, 142, 334–338. https://doi.org/10.1016/J.PROCS.2018.10.487

    Article  Google Scholar 

  • Ashton, H. S., Beevers, C. E., Milligan, C. D., Schofield, D. K., Thomas, R. C., & Youngson, M. A. (2005). Moving beyond objective testing in online assessment. In Online Assessment and Measurement: Case Studies from Higher Education, K-12 and Corporate (pp. 116–128). IGI Global. https://doi.org/10.4018/978-1-59140-497-2.ch008

  • Azad, S., Chen, B., Fowler, M., West, M., & Zilles, C. (2020). Strategies for deploying unreliable AI graders in high-transparency high-stakes exams. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12163 LNAI, 16–28. https://doi.org/10.1007/978-3-030-52237-7_2

  • Babych, B. (2014). Automated MT evaluation metrics and their limitations. Tradumàtica: Tecnologies de La Traducció, 12, 464. https://doi.org/10.5565/rev/tradumatica.70

    Article  Google Scholar 

  • Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://arxiv.org/abs/1409.0473v7. Accessed 24 Feb 2022.

  • Beckman, K., Apps, T., Bennett, S., Dalgarno, B., Kennedy, G., & Lockyer, L. (2019). Self-regulation in open-ended online assignment tasks: The importance of initial task interpretation and goal setting. Studies in Higher Education. https://doi.org/10.1080/03075079.2019.1654450

    Article  Google Scholar 

  • Bloom, B. S. (1984). Taxonomy of educational objectives book 1: Cognitive domain. In nancybroz.com. http://nancybroz.com/nancybroz/Literacy_I_files/BloomIntro.doc. Accessed 31 Aug 2021.

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  • Brown, S., & Glasner, A. (Eds.). (1999). Assessment matters in higher education: Choosing and using diverse approaches. https://eric.ed.gov/?id=ED434545. Accessed 24 Feb 2021.

  • Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. In International Journal of Artificial Intelligence in Education (Vol. 25, Issue 1, pp. 60–117). Springer New York LLC. https://doi.org/10.1007/s40593-014-0026-8

  • Cahuantzi, R., Chen, X., & Güttel, S. (2021). A comparison of LSTM and GRU networks for learning symbolic sequences. http://eprints.maths.manchester.ac.uk/. Accessed 25 May 2023.

  • Carbonell, J., & Goldstein, J. (1998). Use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR Forum (ACM Special Interest Group on Information Retrieval), 335–336. https://doi.org/10.1145/290941.291025

  • Carneiro, T., Da Nobrega, R. V. M., Nepomuceno, T., Bian, G. B., De Albuquerque, V. H. C., & Filho, P. P. R. (2018). Performance analysis of google colaboratory as a tool for accelerating deep learning applications. IEEE Access, 6, 61677–61685. https://doi.org/10.1109/ACCESS.2018.2874767

    Article  Google Scholar 

  • Chaganty, A. T., Mussmann, S., & Liang, P. (2018). The price of debiasing automatic metrics in natural language evaluation. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 1, 643–653. https://doi.org/10.48550/arxiv.1807.02202

  • Chen, M., Tang, Q., Wiseman, S., & Gimpel, K. (2020). Controllable paraphrase generation with a syntactic exemplar. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 5972–5984. https://doi.org/10.18653/v1/p19-1599

  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1724–1734. https://doi.org/10.3115/v1/d14-1179

  • Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555v1. Accessed 20 Dec 2022.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (vol. 1, pp. 4171–4186). https://github.com/tensorflow/tensor2tensor. Accessed 27 Sept 2022.

  • Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. https://doi.org/10.2307/1932409

    Article  Google Scholar 

  • Dzikovska, M., Steinhauser, N., Farrow, E., Moore, J., & Campbell, G. (2014). BEETLE II: Deep natural language understanding and automatic feedback generation for intelligent tutoring in basic electricity and electronics. International Journal of Artificial Intelligence in Education, 24(3), 284–332. https://doi.org/10.1007/s40593-014-0017-9

    Article  Google Scholar 

  • Gaddipati, S. K., Nair, D., & Plöger, P. G. (2020). Comparative evaluation of pretrained transfer learning models on automatic short answer grading. https://arxiv.org/abs/2009.01303v1. Accessed 27 May 2023.

  • Gomaa, W. H., & Fahmy, A. A. (2020). Ans2vec: A scoring system for short answers. Advances in Intelligent Systems and Computing, 921, 586–595. https://doi.org/10.1007/978-3-030-14118-9_59

    Article  Google Scholar 

  • Goyal, T., & Durrett, G. (2020). Neural Syntactic Preordering for Controlled Paraphrase Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 238–252. https://doi.org/10.18653/v1/2020.acl-main.22

  • Gupta, A., Agarwal, A., Singh, P., & Rai, P. (2018). A deep generative framework for paraphrase generation. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, 5149–5156. https://doi.org/10.5555/3504035.3504666

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/NECO.1997.9.8.1735

    Article  Google Scholar 

  • Hsu, S., Wentin, T., Zhang, Z., & Fowler, M. (2021). Atitudes surrounding an imperfect ai autograder. Conference on Human Factors in Computing Systems - Proceedings. https://doi.org/10.1145/3411764.3445424

    Article  Google Scholar 

  • Huang, S., Wu, Y., Wei, F., & Luan, Z. (2019). Dictionary-guided editing networks for paraphrase generation. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 6546–6553. https://doi.org/10.1609/AAAI.V33I01.33016546

  • Huang, X., Bidart, R., Khetan, A., & Karnin, Z. (2022). Pyramid-BERT: Reducing complexity via successive core-set based token selection. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 8798–8817. https://doi.org/10.18653/v1/2022.acl-long.602

    Article  Google Scholar 

  • Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1–25. https://doi.org/10.1145/1376815.1376819

    Article  Google Scholar 

  • Jayashankar, S., & Sridaran, R. (2017). Superlative model using word cloud for short answers evaluation in eLearning. Education and Information Technologies, 22(5), 2383–2402. https://doi.org/10.1007/s10639-016-9547-0

    Article  Google Scholar 

  • Jordan, S. (2013). E-assessment: Past, present and future. New Directions, 9(1), 87–106. https://doi.org/10.11120/ndir.2013.00009

    Article  Google Scholar 

  • Jordan, S., & Butcher, P. (2013). Does the Sun orbit the Earth? Challenges in using short free-text computer-marked questions. In HEA STEM Annual Learning and Teaching Conference 2013: Where Practice and Pedagogy Meet. http://www.heacademy.ac.uk/events/detail/2012/17_18_Apr_HEA_STEM_2013_Conf_Bham. Accessed 1 June 2021.

  • Kazemnejad, A., Salehi, M., & Soleymani Baghshah, M. (2020). Paraphrase generation by learning how to edit from samples. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 6010–6021). https://doi.org/10.18653/v1/2020.acl-main.535

  • Khan, S., & Khan, R. A. (2019). Online assessments: Exploring perspectives of university students. Education and Information Technologies, 24(1), 661–677. https://doi.org/10.1007/s10639-018-9797-0

    Article  MathSciNet  Google Scholar 

  • Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://doi.org/10.48550/arxiv.1412.6980. Accessed 20 Feb 2022.

  • Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings. https://arxiv.org/abs/1312.6114v10

  • Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447

  • Kumar, S., Chakrabarti, S., & Roy, S. (2017). Earth mover’s distance pooling over siamese LSTMs for Automatic short answer grading. IJCAI International Joint Conference on Artificial Intelligence, 0, 2046–2052. https://doi.org/10.24963/ijcai.2017/284

    Article  Google Scholar 

  • Kumar, A., Ahuja, K., Vadapalli, R., & Talukdar, P. (2020). Syntax-guided controlled generation of paraphrases. Transactions of the Association for Computational Linguistics, 8, 330–345. https://doi.org/10.1162/tacl_a_00318

    Article  Google Scholar 

  • Kumaran, V. S., & Sankar, A. (2015). Towards an automated system for short-answer assessment using ontology mapping. International Arab Journal of E-Technology, 4(1), 17–24. https://dblp.org/db/journals/iajet/iajet4.html%0A, http://www.iajet.org/Pages/archive-vol-4.aspx%0A, http://www.iajet.org/documents/vol.4/no.1/3.pdf. Accessed 17 Feb 2022.

  • Lai, H., Mao, J., Toral, A., & Nissim, M. (2022). Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer. HumEval 2022 - 2nd Workshop on Human Evaluation of NLP Systems, Proceedings of the Workshop, 102–115. https://doi.org/10.18653/v1/2022.humeval-1.9

  • Lavie, A. (2010). Evaluating the output of machine translation systems. In AMTA 2010 - 9th Conference of the Association for Machine Translation in the Americas. https://www.cs.cmu.edu/~alavie/Presentations/MT-Evaluation-MT-Summit-Tutorial-19Sep11.pdf. Accessed 3 Mar 2022.

  • Lavie, A., & Agarwal, A. (2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, June (pp. 228–231). https://aclanthology.org/W07-0734/. Accessed 20 Feb 2022.

  • Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405. https://doi.org/10.1023/A:1025779619903

    Article  Google Scholar 

  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703

  • Marvaniya, S., Foltz, P., Saha, S., Sindhgatta, R., Dhamecha, T. I., & Sengupta, B. (2018). Creating scoring rubric from representative student answers for improved short answer grading. International Conference on Information and Knowledge Management, Proceedings, 993–1002. https://doi.org/10.1145/3269206.3271755

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations ofwords and phrases and their compositionality. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1310.4546v1

  • Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on - EACL ’09, 567–575. https://doi.org/10.3115/1609067.1609130

  • Mohler, M., Bunescu, R., & Mihalcea, R. (2011). Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 752–762. ejournal.narotama.ac.id/files/Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments..pdf

  • Moodle. (2011). Regular expression short-Answer question type. https://docs.moodle.org/310/en/Regular_Expression_Short-Answer_question_type. Accessed 27 Dec 2020.

  • Nagoudi, E. M. B., Elmadany, A., & Abdul-Mageed, M. (2022). AraT5: Text-to-text transformers for arabic language generation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 628–647. https://doi.org/10.18653/v1/2022.acl-long.47

  • Napoles, C., Sakaguchi, K., Post, M., & Tetreault, J. (2015). Ground Truth for Grammaticality Correction Metrics. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 588–593. https://doi.org/10.3115/v1/p15-2097

  • Noorbehbahani, F., & Kardan, A. A. (2011). The automatic assessment of free text answers using a modified BLEU algorithm. Computers and Education, 56(2), 337–345. https://doi.org/10.1016/j.compedu.2010.07.013

    Article  Google Scholar 

  • Omran, A. M. B., & Ab Aziz, M. J. (2013). Automatic essay grading system for short answers in English language. Journal of Computer Science, 9(10), 1369–1382. https://doi.org/10.3844/jcssp.2013.1369.1382

    Article  Google Scholar 

  • Ott, N., Ziai, R., & Meurers, D. (2012). Creation and analysis of a reading comprehension exercise corpus (pp. 47–69). John Benjamins Publishing Company. https://doi.org/10.1075/hsm.14.05ott

    Book  Google Scholar 

  • Ouahrani, L., & Bennouar, D. (2020). AR-ASAG an Arabic dataset for automatic short answer grading evaluation. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 2634–2643). https://aclanthology.org/2020.lrec-1.321. Accessed 13 Dec 2021.

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, (pp. 311–318). https://doi.org/10.3115/1073083.1073135

  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, (vol. 1, pp. 2227–2237). https://doi.org/10.18653/v1/n18-1202

  • Prakash, A., Hasan, S. A., Lee, K., Datla, V., Qadir, A., Liu, J., & Farri, O. (2016). Neural paraphrase generation with stacked residual LSTM networks - ACL anthology. In Proceedings of {COLING} 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 2923–2934). https://aclanthology.org/C16-1275/. Accessed 19 Feb 2022.

  • Pribadi, F. S., Permanasari, A. E., & Adji, T. B. (2018). Short answer scoring system using automatic reference answer generation and geometric average normalized-longest common subsequence (GAN-LCS). Education and Information Technologies, 23(6), 2855–2866. https://doi.org/10.1007/S10639-018-9745-Z

    Article  Google Scholar 

  • Qiu, R. G. (2019). A systemic approach to leveraging student engagement in collaborative learning to improve online engineering education. International Journal of Technology Enhanced Learning, 11(1), 1–19. https://dl.acm.org/doi/10.5555/3302810.3302811. Accessed 19 Feb 2022.

    Article  Google Scholar 

  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Homology, Homotopy and Applications, 9(1), 399–438. https://www.bibsonomy.org/bibtex/273ced32c0d4588eb95b6986dc2c8147c/jonaskaiser. Accessed 30 May 2023.

    Google Scholar 

  • Radford, A., Jeffrey, W., Rewon, C., David, L., Dario, A., & Ilya, S. (2019). Language models are unsupervised multitask learners | enhanced reader. OpenAI Blog, 1(8), 9. https://github.com/codelucas/newspaper. Accessed 30 May 2023.

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2020). Language models are unsupervised multitask learners. OpenAI Blog, 1(May), 1–7. https://github.com/codelucas/newspaper. Accessed 30 May 2023.

  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67. https://doi.org/10.48550/arxiv.1910.10683

    Article  MathSciNet  Google Scholar 

  • Ramachandran, L., & Foltz, P. (2015). Generating reference texts for short answer scoring using graph-based summarization. 10th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2015 at the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2015, 207–212. https://doi.org/10.3115/v1/w15-0624

  • Ramachandran, L., Cheng, J., & Foltz, P. (2015). Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 97–106. https://doi.org/10.3115/v1/W15-0612

  • Real, R., & Vargas, J. M. (1996). The probabilistic basis of Jaccard’s index of similarity. In Systematic Biology (Vol. 45, Issue 3, pp. 380–385). Taylor and Francis Inc. https://doi.org/10.1093/sysbio/45.3.380

  • Rocchio, J. (1971). Relevance feedback in information retrieval. In editor Salton, G. (Ed.), The Smart Re- trieval System - Experiments in Automatic Document Processing (pp. 313–323). Prentice-Hall, Inc. https://www.bibsonomy.org/bibtex/1c18d843e34fe4f8bd1d2438227857225/bsmyth

  • Saha, S., Dhamecha, T. I., Marvaniya, S., Sindhgatta, R., & Sengupta, B. (2018). Sentence level or token level features for automatic short answer grading?: use both. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10947 LNAI, 503–517. https://doi.org/10.1007/978-3-319-93843-1_37

  • Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference.

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0

    Article  Google Scholar 

  • Schneider, J., Richner, R., & Riser, M. (2023). Towards trustworthy AutoGrading of short, multi-lingual, multi-type answers. International Journal of Artificial Intelligence in Education, 33(1), 88–118. https://doi.org/10.1007/s40593-022-00289-z

    Article  Google Scholar 

  • Scikit-learn. (2019). scikit-learn: machine learning in Python — scikit-learn 0.21.0. https://scikit-learn.org/stable/

  • Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20(1), 46–65. https://doi.org/10.1080/10627197.2015.997617

    Article  Google Scholar 

  • Sukkarieh, J. Z., & Blackmore, J. (2009). c-rater: Automatic Content Scoring for Short Constructed Responses. In Proceedings of the 22nd International FLAIRS Conference. Association for the Advancement of Artificial Intelligence (pp. 290–295). https://www.ets.org/research/policy_research_reports/publications/chapter/2009/imsb. Accessed 26 Mar 2022

  • Sultan, M. A., Salazar, C., & Sumner, T. (2016). Fast and easy short answer grading with high accuracy. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference, 1070–1075. https://doi.org/10.18653/v1/n16-1123

  • Sun, J., Ma, X., & Peng, N. (2021). AESOP: Paraphrase generation with adaptive syntactic control. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5176–5189. https://doi.org/10.18653/v1/2021.emnlp-main.420

  • Sychev, O., Anikin, A., & Prokudin, A. (2020). Automatic grading and hinting in open-ended text questions. Cognitive Systems Research, 59, 264–272. https://doi.org/10.1016/j.cogsys.2019.09.025

    Article  Google Scholar 

  • Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, 252–259. https://doi.org/10.3115/1073445.1073478

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017-Decem, 5999–6009.

  • Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., D. B. (2018). Diverse beam search: decoding diverse solutions from neural sequence models. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 7371–7379.

  • Whitelock, D., & Bektik, D. (2018). Progress and Challenges for Automated Scoring and Feedback Systems for Large-Scale Assessments (pp. 1–18). https://doi.org/10.1007/978-3-319-53803-7_39-1

  • Wubben, S., van den Bosch, A., & Krahmer, E. (2010). Paraphrase generation as monolingual translation: Data and evaluation. In Belgian/Netherlands Artificial Intelligence Conference. http://ilk.uvt.nl/. Accessed 22 Feb 2022.

  • Xu, P., Kumar, D., Yang, W., Zi, W., Tang, K., Huang, C., Cheung, J.C.K., Prince, S.J.D., Cao, Y., 2021. Optimizing deeper transformers on small datasets, in: ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), pp. 2089–2102. https://doi.org/10.18653/v1/2021.acl-long.163

  • Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41

  • Yang, Q., Huo, Z., Shen, D., Cheng, Y., Wang, W., Wang, G., & Carin, L. (2020). An end-to-end generative architecture for paraphrase generation. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3132–3142. https://doi.org/10.18653/v1/d19-1309

  • Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., & Atyia, A. (2015). Word Representations in Vector Space and their Applications for Arabic. In A. Gelbukh (Ed.) (Ed.), 16th international conference, CICLing 2015 Cairo, Egypt, april 14 (Vol. 9041, Issue April, pp. 430–443). Springer International Publishing Switzerland. https://doi.org/10.1007/978-3-319-18111-0_32

  • Zeng, D., Zhang, H., Xiang, L., Wang, J., & Ji, G. (2019). User-oriented paraphrase generation with keywords controlled network. IEEE Access, 7, 80542–80551. https://doi.org/10.1109/ACCESS.2019.2923057

    Article  Google Scholar 

  • Zhao, J., Zhu, T., & Lan, M. (2014). ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Textual Entailment. 271–277. https://doi.org/10.3115/v1/s14-2044

  • Ziai, R., Ott, N., & Meurers, D. (2012). Short Answer Assessment : Establishing Links Between Research Strands. Proceedings of the Seventh Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics, 2(2005), 190–200.

Download references

Acknowledgements

The authors would like to thank Selena LAMARI, Oussama HAMEL, Ahmed Hadjersi, and Oussama Benguergoura for their technical help during the experimentation phase.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

All authors worked collaboratively to formulate research questions, conduct the search, select data, and perform analysis, experiments, and discussion. The corresponding author worked on writing the initial draft. All authors reviewed, read, and approved the final manuscript.

Corresponding author

Correspondence to Leila Ouahrani.

Ethics declarations

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ouahrani, L., Bennouar, D. Paraphrase Generation and Supervised Learning for Improved Automatic Short Answer Grading. Int J Artif Intell Educ (2024). https://doi.org/10.1007/s40593-023-00391-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40593-023-00391-w

Keywords

Navigation