Automated Grading of Exam Responses: An Extensive Classification Benchmark

Ljungman, Jimmy; Lislevand, Vanessa; Pavlopoulos, John; Farazouli, Alexandra; Lee, Zed; Papapetrou, Panagiotis; Fors, Uno

doi:10.1007/978-3-030-88942-5_1

Jimmy Ljungman¹⁰,
Vanessa Lislevand¹⁰,
John Pavlopoulos¹⁰,
Alexandra Farazouli¹⁰,
Zed Lee¹⁰,
Panagiotis Papapetrou¹⁰ &
…
Uno Fors¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

International Conference on Discovery Science

1663 Accesses
1 Citations

Abstract

Automated grading of free-text exam responses is a very challenging task due to the complex nature of the problem, such as lack of training data and biased ground-truth of the graders. In this paper, we focus on the automated grading of free-text responses. We formulate the problem as a binary classification problem of two class labels: low- and high-grade. We present a benchmark on four machine learning methods using three experiment protocols on two real-world datasets, one from Cyber-crime exams in Arabic and one from Data Mining exams in English that is presented first time in this work. By providing various metrics for binary classification and answer ranking, we illustrate the benefits and drawbacks of the benchmarked methods. Our results suggest that standard models with individual word representations can in some cases achieve competitive predictive performance against deep neural language models using context-based representations on both binary classification and answer ranking for free-text response grading tasks. Lastly, we discuss the pedagogical implications of our findings by identifying potential pitfalls and challenges when building predictive models for such tasks.

Supported by the AutoGrade project of Stockholm University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use TfidfVectorizer for feature extraction with all parameters set to default.
2.
https://commoncrawl.org/.
3.
The size of flatten restricts us from running the models for more repetitions.
4.
https://github.com/dsv-data-science/autograde_DS2021.

References

Anderson, R.C., Biddle, W.B.: On asking people questions about what they are reading. In: Psychology of Learning and Motivation, vol. 9, pp. 89–132. Elsevier (1975)
Google Scholar
Basu, S., Jacobs, C., Vanderwende, L.: Powergrading: a clustering approach to amplify human effort for short answer grading. TACL 1, 391–402 (2013)
Article Google Scholar
Burrows, S., Gurevych, I., Stein, B.: The eras and trends of automatic short answer grading. IJAIED 25(1), 60–117 (2014). https://doi.org/10.1007/s40593-014-0026-8
Article Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL, pp. 8440–8451 (2020)
Google Scholar
Horbach, A., Pinkal, M.: Semi-supervised clustering for short answer scoring. In: International Conference on Language Resources and Evaluation (2018)
Google Scholar
Karpicke, J., Roediger, H.: The critical importance of retrieval for learning. Science 319, 966–968 (2008)
Article Google Scholar
Kim, S.W., Gil, J.M.: Research paper classification systems based on TF-IDF and LDA schemes. Hum. Cent. Comput. Inf. Sci. 9, 30 (2019). https://doi.org/10.1186/s13673-019-0192-7
Article Google Scholar
Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates (2018)
Google Scholar
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing (2018)
Google Scholar
Kumar, S., Chakrabarti, S., Roy, S.: Earth mover’s distance pooling over Siamese LSTMs for automatic short answer grading. In: IJCAI, pp. 2046–2052 (2017)
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019)
Leacock, C., Chodorow, M.: C-rater: Automated scoring of short-answer questions. Comput. Humanit. 37(4), 389–405 (2003)
Article Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
McDaniel, M., Anderson, J.L., Derbish, M.H., Morrisette, N.: Testing the testing effect in the classroom. Eur. J. Cogn. Psychol. 19, 494–513 (2007)
Article Google Scholar
Mohler, M., Bunescu, R., Mihalcea, R.: Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: ACL, pp. 752–762 (2011)
Google Scholar
Nandini, V., Uma Maheswari, P.: Automatic assessment of descriptive answers in online examination system using semantic relational features. J. Supercomput. 76(6), 4430–4448 (2018). https://doi.org/10.1007/s11227-018-2381-y
Article Google Scholar
Ouahrani, L., Bennouar, D.: AR-ASAG an Arabic dataset for automatic short answer grading evaluation. In: LREC, pp. 2634–2643 (2020)
Google Scholar
Padó, U.: Get semantic with me! The usefulness of different feature types for short-answer grading. In: COLING, pp. 2186–2195 (2016)
Google Scholar
Pavlopoulos, J., Malakasiotis, P., Androutsopoulos, I.: Deep learning for user comment moderation. In: WOAH, pp. 25–35. ACL (2017)
Google Scholar
Pedersen, T., Patwardhan, S., Michelizzi, J., et al.: WordNet: similarity-measuring the relatedness of concepts. In: AAAI, vol. 4, pp. 25–29 (2004)
Google Scholar
Picard, R.R., Cook, R.D.: Cross-validation of regression models. J. Am. Stat. Assoc. 79, 575–583 (1984)
Article MathSciNet Google Scholar
Rodrigues, F., Oliveira, P.: A system for formative assessment and monitoring of students’ progress. Comput. Educ. 76, 30–41 (2014)
Article Google Scholar
Saha, S., Dhamecha, T.I., Marvaniya, S., Sindhgatta, R., Sengupta, B.: Sentence level or token level features for automatic short answer grading?: use both. In: AIED, pp. 503–517 (2018)
Google Scholar
Sung, C., Dhamecha, T.I., Mukhi, N.: Improving short answer grading using transformer-based pre-training. In: AIED, pp. 469–481 (2019)
Google Scholar
Süzen, N., Gorban, A.N., Levesley, J., Mirkes, E.M.: Automatic short answer grading and feedback using text mining methods. Procedia Comput. Sci. 169, 726–743 (2020)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)
Google Scholar
Williamson, D., Xi, X., Breyer, F.: A framework for evaluation and use of automated scoring. Educa. Meas. Issues Pract. 31, 2–13 (2012)
Article Google Scholar
Willis, A.: Using NLP to support scalable assessment of short free text responses. In: BEA, pp. 243–253 (2015)
Google Scholar

Download references

Acknowledgements

This work was supported by the AutoGrade project (https://datascience.dsv.su.se/projects/autograding.html) of the Dept. of Computer and Systems Sciences at Stockholm University.

Author information

Authors and Affiliations

Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden
Jimmy Ljungman, Vanessa Lislevand, John Pavlopoulos, Alexandra Farazouli, Zed Lee, Panagiotis Papapetrou & Uno Fors

Authors

Jimmy Ljungman
View author publications
You can also search for this author in PubMed Google Scholar
Vanessa Lislevand
View author publications
You can also search for this author in PubMed Google Scholar
John Pavlopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Farazouli
View author publications
You can also search for this author in PubMed Google Scholar
Zed Lee
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar
Uno Fors
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jimmy Ljungman .

Editor information

Editors and Affiliations

Universidade do Porto and Fraunhofer Portugal AICOS, Porto, Portugal
Carlos Soares
Dalhousie University, Halifax, NS, Canada
Luis Torgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ljungman, J. et al. (2021). Automated Grading of Exam Responses: An Extensive Classification Benchmark. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-88942-5_1
Published: 09 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics