Balancing Cost and Quality: An Exploration of Human-in-the-Loop Frameworks for Automated Short Answer Scoring

Funayama, Hiroaki; Sato, Tasuku; Matsubayashi, Yuichiroh; Mizumoto, Tomoya; Suzuki, Jun; Inui, Kentaro

doi:10.1007/978-3-031-11644-5_38

Hiroaki Funayama^11,12,
Tasuku Sato^11,12,
Yuichiroh Matsubayashi^11,12,
Tomoya Mizumoto^12,13,
Jun Suzuki^11,12 &
…
Kentaro Inui^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13355))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

4069 Accesses
4 Citations
8 Altmetric

Abstract

Short answer scoring (SAS) is the task of grading short text written by a learner. In recent years, deep-learning-based approaches have substantially improved the performance of SAS models, but how to guarantee high-quality predictions still remains a critical issue when applying such models to the education field. Towards guaranteeing high-quality predictions, we present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost while guaranteeing the grading quality by allowing a SAS model to share the grading task with a human grader. Specifically, by introducing a confidence estimation method for indicating the reliability of the model predictions, one can guarantee the scoring quality by utilizing only predictions with high reliability for the scoring results and casting predictions with low reliability to human graders. In our experiments, we investigate the feasibility of the proposed framework using multiple confidence estimation methods and multiple SAS datasets. We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/hiro819/HITL_framework_for_ASAS.
2.
We assume that the acceptable scoring error is determined by test administrators.
3.
https://github.com/hiro819/HITL_framework_for_ASAS.
4.
https://www.kaggle.com/c/asap-sas.
5.
https://aip-nlu.gitlab.io/resources/sas-japanese.
6.
We used pretrained BERTs from https://huggingface.co/bert-base-uncased for English and https://github.com/cl-tohoku/bert-japanese for Japanese.
7.
Quadratic weghted kappa (QWK) of our model is 0.722 for the ASAP-SAS dataset, which is comparable to previous studies [12, 17].
8.
Posterior is used because there is no significant difference in performance among the three methods and the most widely used way to estimate confidence score is posterior.

References

Attali, Y., Burstein, J.: Automated essay scoring with e-rater®v.2. J. Technol. Learn. Assess. 4(3) (2006)
Google Scholar
Burrows, S., Gurevych, I., Stein, B.: The eras and trends of automatic short answer grading. Int. J. Artif. Intell. Educ. 25(1), 60–117 (2015)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186, June 2019. https://doi.org/10.18653/v1/N19-1423
Ding, Y., Riordan, B., Horbach, A., Cahill, A., Zesch, T.: Don’t take “nswvtnvakgxpm" for an answer -the surprising vulnerability of automatic content scoring systems to adversarial input. In: COLING, pp. 882–892. International Committee on Computational Linguistics, December 2020. https://doi.org/10.18653/v1/2020.coling-main.76
Funayama, H., et al.: Preventing critical scoring errors in short answer scoring with confidence estimation. In: ACL-SRW, pp. 237–243. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-srw.32
Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: Gpytorch: Blackbox matrix-matrix gaussian process inference with GPU acceleration (2021)
Google Scholar
Herwanto, G., Sari, Y., Prastowo, B., Riasetiawan, M., Bustoni, I.A., Hidayatulloh, I.: Ukara: a fast and simple automatic short answer scoring system for bahasa indonesia. In: ICEAP Proceeding Book, vol. 2, pp. 48–53, December 2018
Google Scholar
Horbach, A., Palmer, A.: Investigating active learning for short-answer scoring. In: BEA. Association for Computational Linguistics, San Diego, June 2016
Google Scholar
Jang, E.S., Kang, S., Noh, E.H., Kim, M.H., Sung, K.H., Seong, T.J.: Kass: korean automatic scoring system for short-answer questions. CSEDU 2014(2), 226–230 (2014)
Google Scholar
Jiang, H., Kim, B., Guan, M.Y., Gupta, M.R.: To trust or not to trust a classifier. In: NIPS, pp. 5546–5557 (2018)
Google Scholar
Johan Berggren, S., Rama, T., Øvrelid, L.: Regression or classification? automated essay scoring for Norwegian. In: BEA, pp. 92–102. Association for Computational Linguistics, Florence, August 2019. https://doi.org/10.18653/v1/W19-4409
Krishnamurthy, S., Gayakwad, E., Kailasanathan, N.: Deep learning for short answer scoring. Int. J. Rec.Technol. Eng. 7, 1712–1715 (2019)
Google Scholar
Kumar, Y., Aggarwal, S., Mahata, D., Shah, R.R., Kumaraguru, P., Zimmermann, R.: Get it scored using autosas - an automated system for scoring short answers. In: AAAI/IAAI/EAAI. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33019662
Leacock, C., Chodorow, M.: C-rater: automated scoring of short-answer questions. Comput. Human. 37(4), 389–405 (2003). https://doi.org/10.1023/A:1025779619903
Mizumoto, T., et al.: Analytic score prediction and justification identification in automated short answer scoring. In: BEA, pp. 316–325 (2019). https://doi.org/10.18653/v1/W19-4433
Rasmussen, C.E.: Gaussian processes in machine learning. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) ML -2003. LNCS (LNAI), vol. 3176, pp. 63–71. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28650-9_4
Chapter Google Scholar
Riordan, B., Horbach, A., Cahill, A., Zesch, T., Lee, C.M.: Investigating neural architectures for short answer scoring. In: BEA, pp. 159–168 (2017). https://doi.org/10.18653/v1/W17-5017
Sychev, O., Anikin, A., Prokudin, A.: Automatic grading and hinting in open-ended text questions. Cogn. Syst. Res. 59, 264–272 (2020)
Article Google Scholar
Wang, T., Inoue, N., Ouchi, H., Mizumoto, T., Inui, K.: Inject rubrics into short answer grading system. In: DeepLo, pp. 175–182, November 2019. https://doi.org/10.18653/v1/D19-6119
Williamson, D.M., Xi, X., Breyer, F.J.: A framework for evaluation and use of automated scoring. Educ. Meas. Issues Pract. 31(1), 2–13 (2012). https://doi.org/10.1111/j.1745-3992.2011.00223.x
Article Google Scholar
Woods, B., Adamson, D., Miel, S., Mayfield, E.: Formative essay feedback using predictive scoring models. In: KDD 2017, pp. 2071–2080. Association for Computing Machinery (2017). https://doi.org/10.1145/3097983.3098160

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP22H00524, JP19K12112, and JP21H04901.

Author information

Authors and Affiliations

Tohoku University, Sendai, Japan
Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsubayashi, Jun Suzuki & Kentaro Inui
RIKEN, Tokyo, Japan
Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsubayashi, Tomoya Mizumoto, Jun Suzuki & Kentaro Inui
Future Corporation, Tokyo, Japan
Tomoya Mizumoto

Authors

Hiroaki Funayama
View author publications
You can also search for this author in PubMed Google Scholar
Tasuku Sato
View author publications
You can also search for this author in PubMed Google Scholar
Yuichiroh Matsubayashi
View author publications
You can also search for this author in PubMed Google Scholar
Tomoya Mizumoto
View author publications
You can also search for this author in PubMed Google Scholar
Jun Suzuki
View author publications
You can also search for this author in PubMed Google Scholar
Kentaro Inui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroaki Funayama .

Editor information

Editors and Affiliations

Ateneo De Manila University, Quezon, Philippines
Maria Mercedes Rodrigo
Department of Computer Science, North Carolina State University, Raleigh, NC, USA
Noburu Matsuda
Durham University, Durham, UK
Alexandra I. Cristea
University of Leeds, Leeds, UK
Vania Dimitrova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Funayama, H., Sato, T., Matsubayashi, Y., Mizumoto, T., Suzuki, J., Inui, K. (2022). Balancing Cost and Quality: An Exploration of Human-in-the-Loop Frameworks for Automated Short Answer Scoring. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds) Artificial Intelligence in Education. AIED 2022. Lecture Notes in Computer Science, vol 13355. Springer, Cham. https://doi.org/10.1007/978-3-031-11644-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-11644-5_38
Published: 27 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11643-8
Online ISBN: 978-3-031-11644-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Balancing Cost and Quality: An Exploration of Human-in-the-Loop Frameworks for Automated Short Answer Scoring