Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

Moore, Steven; Nguyen, Huy A.; Chen, Tianying; Stamper, John

doi:10.1007/978-3-031-42682-7_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14200))

Included in the following conference series:

European Conference on Technology Enhanced Learning

1920 Accesses
1 Citations

Abstract

Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for classroom usage. Existing methods for evaluating multiple-choice questions often focus on machine readability metrics, without considering their intended use within course materials and their pedagogical implications. In this study, we compared the performance of a rule-based method we developed to a machine-learning based method utilizing GPT-4 for the task of automatically assessing multiple-choice questions based on 19 common item-writing flaws. By analyzing 200 student-generated questions from four different subject areas, we found that the rule-based method correctly detected 91% of the flaws identified by human annotators, as compared to 79% by GPT-4. We demonstrated the effectiveness of the two methods in identifying common item-writing flaws present in the student-generated questions across different subject areas. The rule-based method can accurately and efficiently evaluate multiple-choice questions from multiple domains, outperforming GPT-4 and going beyond existing metrics that do not account for the educational use of such questions. Finally, we discuss the potential for using these automated methods to improve the quality of questions based on the identified flaws.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/StevenJamesMoore/ECTEL23/blob/main/IWF.ipynb.

References

Alazaidah, R., Thabtah, F., Al-Radaideh, Q.: A multi-label classification approach based on correlations among labels. Int. J. Adv. Comput. Sci. Appl. 6, 52–59 (2015)
Google Scholar
Amidei, J., Piwek, P., Willis, A.: Rethinking the agreement in human evaluation tasks. In: Proceedings of the 27th International Conference on Computational Linguistics (2018)
Google Scholar
Breakall, J., Randles, C., Tasker, R.: Development and use of a multiple-choice item writing flaws evaluation instrument in the context of general chemistry. Chem. Educ. Res. Pract. 20, 369–382 (2019)
Article Google Scholar
Brown, G.T., Abdulnabi, H.H.: Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. In: Frontiers in Education. Frontiers Media SA, p. 24 (2017)
Google Scholar
Butler, A.C.: Multiple-choice testing in education: are the best practices for assessment also good for learning? J. Appl. Res. Mem. Cogn. 7, 323–331 (2018)
Article Google Scholar
Clifton, S.L., Schriner, C.L.: Assessing the quality of multiple-choice test items. Nurse Educ. 35, 12–16 (2010)
Article Google Scholar
Cochran, K., Cohn, C., Hutchins, N., Biswas, G., Hastings, P.: Improving automated evaluation of formative assessments with text data augmentation. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Durham, UK, July 27–31, 2022, Proceedings, Part I, pp. 390–401. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_32
Chapter Google Scholar
Danh, T., et al.: Evaluating the quality of multiple-choice questions in a NAPLEX preparation book. Curr. Pharm. Teach. Learn. (2020)
Google Scholar
Downing, S.M.: The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv. Health Sci. Educ. 10, 133–143 (2005)
Article Google Scholar
Haladyna, T.M.: Developing and Validating Multiple-choice Test Items. Psychology Press (2004)
Book Google Scholar
Haladyna, T.M., Downing, S.M., Rodriguez, M.C.: A review of multiple-choice item-writing guidelines for classroom assessment. Appl. Meas. Educ. 15, 309–333 (2002)
Article Google Scholar
Haris, S.S., Omar, N.: A rule-based approach in Bloom’s Taxonomy question classification through natural language processing. In: 2012 7th International Conference on Computing and Convergence Technology (ICCCT), pp. 410–414. IEEE (2012)
Google Scholar
Hendrycks, D., et al.: Measuring massive multitask language understanding. In: International Conference on Learning
Google Scholar
Horbach, A., Aldabe, I., Bexte, M., de Lacalle, O.L., Maritxalar, M.: Linguistic appropriateness and pedagogic usefulness of reading comprehension questions. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp 1753–1762 (2020)
Google Scholar
Hüllermeier, E., Fürnkranz, J., Mencia, E.L., Nguyen, V.-L., Rapp, M.: Rule-based multi-label classification: challenges and opportunities. In: Gutiérrez-Basulto, V., Kliegr, T., Soylu, A., Giese, M., Roman, D. (eds.) RuleML+RR 2020. LNCS, vol. 12173, pp. 3–19. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57977-7_1
Chapter MATH Google Scholar
Ji, T., Lyu, C., Jones, G., Zhou, L., Graham, Y.: QAScore—an unsupervised unreferenced metric for the question generation evaluation. Entropy 24, 1514 (2022)
Article Google Scholar
Kasneci, E., et al.: ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ 103, 102274 (2023)
Google Scholar
Khairani, A.Z., Shamsuddin, H.: Assessing item difficulty and discrimination indices of teacher-developed multiple-choice tests. In: Tang, S.F., Logonnathan, L. (eds.) Assessment for Learning Within and Beyond the Classroom, pp. 417–426. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0908-2_35
Chapter Google Scholar
Khosravi, H., Demartini, G., Sadiq, S., Gasevic, D.: Charting the design and analytics agenda of learnersourcing systems. In: LAK21: 11th International Learning Analytics and Knowledge Conference, pp. 32–42 (2021)
Google Scholar
Krishna, K., Wieting, J., Iyyer, M.: Reformulating unsupervised style transfer as paraphrase generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 737–762 (2020)
Google Scholar
Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S.: A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Educ. 30, 121–204 (2020)
Article Google Scholar
van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text. Comput. Speech Lang. 67, 101151 (2021)
Article Google Scholar
Lee, P., Bubeck, S., Petro, J.: Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023)
Article Google Scholar
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment. ArXiv Prepr ArXiv230316634 (2023)
Google Scholar
Lu, O.H., Huang, A.Y., Tsai, D.C., Yang, S.J.: Expert-authored and machine-generated short-answer questions for assessing students learning performance. Educ. Technol. Soc. (2021)
Google Scholar
McHugh, M.L.: Interrater reliability: the kappa statistic. Biochem. Medica. 22, 276–282 (2012)
Article Google Scholar
Moore, S., Nguyen, H.A., Bier, N., Domadia, T., Stamper, J.: Assessing the quality of student-generated short answer questions using GPT-3. In: Hilliger, Is., Muñoz-Merino, P.J., De Laet, T., Ortega-Arranz, A., Farrell, T. (eds.) Educating for a New Future: Making Sense of Technology-Enhanced Learning Adoption: 17th European Conference on Technology Enhanced Learning, EC-TEL 2022, Toulouse, France, September 12–16, 2022, Proceedings, pp. 243–257. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-16290-9_18
Chapter Google Scholar
Moore, S., Nguyen, H.A., Stamper, J.: Examining the effects of student participation and performance on the quality of learnersourcing multiple-choice questions. In: Proceedings of the Eighth ACM Conference on Learning@ Scale, pp. 209–220 (2021)
Google Scholar
Ni, L., et al.: Deepqr: Neural-based quality ratings for learnersourced multiple-choice questions. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12826–12834 (2022)
Google Scholar
OpenAI: GPT-4 Technical Report (2023). http://arxiv.org/abs/2303.08774
Pugh, D., De Champlain, A., Gierl, M., Lai, H., Touchie, C.: Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Res. Pract. Technol. Enhanc. Learn. 15, 1–13 (2020)
Google Scholar
Ruseti, S., et al.: Predicting question quality using recurrent neural networks. In: Rosé, C.P., et al. (eds.) AIED 2018. LNCS (LNAI), vol. 10947, pp. 491–502. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1_36
Rush, B.R., Rankin, D.C., White, B.J.: The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Med. Educ. 1–10 (2016)
Google Scholar
Scialom, T., Staiano, J.: Ask to learn: a study on curiosity-driven question generation. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 2224–2235 (2020)
Google Scholar
Singh, A., Brooks, C., Doroudi, S.: Learnersourcing in theory and practice: synthesizing the literature and charting the future. In: Proceedings of the Ninth ACM Conference on Learning@ Scale, pp 234–245 (2022)
Google Scholar
Straková, J., Straka, M., Hajic, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 13–18 (2014)
Google Scholar
Tarrant, M., Knierim, A., Hayes, S.K., Ware, J.: The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ. Today (2006)
Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous Min. IJDWM 3, 1–13 (2007)
Article Google Scholar
Van Campenhout, R., Hubertz, M., Johnson, B.G.: Evaluating AI-generated questions: a mixed-methods analysis using question data and student perceptions. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Durham, UK, July 27–31, 2022, Proceedings, Part I, pp. 344–353. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_28
Chapter Google Scholar
van der Waa, J., Nieuwburg, E., Cremers, A., Neerincx, M.: Evaluating XAI: a comparison of rule-based and example-based explanations. Artif Intell 291, 103404 (2021)
Article MathSciNet MATH Google Scholar
Wang, Z., Zhang, W., Liu, N., Wang, J.: Scalable rule-based representation learning for interpretable classification. Adv. Neural Inf. Process Syst. 34, 30479–30491 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Steven Moore, Huy A. Nguyen, Tianying Chen & John Stamper

Authors

Steven Moore
View author publications
You can also search for this author in PubMed Google Scholar
Huy A. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Tianying Chen
View author publications
You can also search for this author in PubMed Google Scholar
John Stamper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Moore .

Editor information

Editors and Affiliations

KTH Royal Institute of Technology, Stockholm, Sweden
Olga Viberg
Goethe University Frankfurt, Frankfurt am Main, Germany
Ioana Jivet
Universidad Carlos III de Madrid, Madrid, Spain
Pedro J. Muñoz-Merino
University of Macedonia, Thessaloniki, Greece
Maria Perifanou
CODE University of Applied Sciences, Berlin, Germany
Tina Papathoma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moore, S., Nguyen, H.A., Chen, T., Stamper, J. (2023). Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods. In: Viberg, O., Jivet, I., Muñoz-Merino, P., Perifanou, M., Papathoma, T. (eds) Responsive and Sustainable Educational Futures. EC-TEL 2023. Lecture Notes in Computer Science, vol 14200. Springer, Cham. https://doi.org/10.1007/978-3-031-42682-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-42682-7_16
Published: 28 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42681-0
Online ISBN: 978-3-031-42682-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics