Abstract
Through design and development research (DDR), we aimed to create a validated automatic question generation (AQG) system using large language models (LLMs) like ChatGPT, enhanced by prompting engineering techniques. While AQG has become increasingly integral to online learning for its efficiency in generating questions, issues such as inconsistent question quality and the absence of transparent and validated evaluation methods persist. Our research focused on creating a prompt engineering protocol tailored for AQG. This protocol underwent several iterations of refinement and validation to improve its performance. By gathering validation scores and qualitative feedback on the produced questions and the system’s framework, we examined the effectiveness of the system. The study findings indicate that our combined use of LLMs and prompt engineering in AQG produces questions with statistically significant validity. Our research further illuminates academic and design considerations for AQG design in English education: (a) certain question types might not be optimal for generation via ChatGPT, (b) ChatGPT sheds light on the potential for collaborative AI-teacher efforts in question generation, especially within English education.
Similar content being viewed by others
Data Availability
The availability of the data supporting the findings of this study is confirmed, while certain limitations have been imposed due to the inclusion of personal interviews and sensitive material. As a result, the data is not able to be made publicly available. However, the authors will evaluate data access requests individually, taking into account the reasonableness of the request and if proper rights have been obtained.
Notes
This link is connected to the AQG manual in this research. https://docs.google.com/document/d/1h23DtAVeKHd1AiUvTlpVN3AG073xRg4vygpu4-o82-s/edit?usp=sharing
This link connected to the AQG manual in this research. https://docs.google.com/document/d/1h23DtAVeKHd1AiUvTlpVN3AG073xRg4vygpu4-o82-s/edit?usp=sharing
This link is connected to the AQG Example Output in this research.
https://docs.google.com/document/d/1h23DtAVeKHd1AiUvTlpVN3AG073xRg4vygpu4-o82-s/edit?usp=sharing
References
Aiken, R. M., & Epstein, R. G. (2000). Ethical guidelines for AI in education: Starting a conversation. International Journal of Artificial Intelligence in Education, 11(2), 163–176.
Alsubait, T., Parsia, B., & Sattler, U. (2016). Ontology-based multiple choice question generation. KI-Künstliche Intelligenz, 30, 183–188. https://doi.org/10.1155/2014/274949
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., ... & Fung, P. (2023). A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023. https://doi.org/10.48550/arXiv.2302.04023
Ben Abacha, A., Dos Reis, J. C., Mrabet, Y., Pruski, C., & Da Silveira, M. (2016). Towards natural language question generation for the validation of ontologies and mappings. Journal of Biomedical Semantics, 7, 1–15. https://doi.org/10.1186/s13326-016-0089-6
Brown, H. D., & Abeywickrama, P. (2004). Language assessment. Principles and Classroom Practices. Pearson Education.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165
Cao, T., Zeng, S., Xu, X., Mansur, M., & Chang, B. (2022). DISK: Domain-constrained Instance Sketch for Math Word Problem Generation. arXiv preprint arXiv:2204.04686. https://doi.org/10.48550/arXiv.2204.04686
Das, B., Majumder, M., Phadikar, S., & Sekh, A. A. (2021). Automatic question generation and answer assessment: A survey. Research and Practice in Technology Enhanced Learning, 16(1), 1–15. https://doi.org/10.1186/s41039-021-00151-1
Day, R. R., & Park, J. S. (2005). Developing Reading Comprehension Questions. Reading in a Foreign Language, 17(1), 60–73.
Duke, N. K., & Pearson, P. D. (2009). Effective practices for developing reading comprehension. Journal of Education, 189(1–2), 107–122. https://doi.org/10.1598/0872071774.10
Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30, 681–694. https://doi.org/10.1007/s11023-020-09548-1
Gao, Y., Wang, J., Bing, L., King, I., & Lyu, M. R. (2018). Difficulty controllable question generation for reading comprehension. arXiv preprint arXiv:1807.03586. https://doi.org/10.48550/arXiv.1807.03586
García-Peñalvo, F. J. (2023). The perception of Artificial Intelligence in educational contexts after the launch of ChatGPT: Disruption or Panic?. Education in the Knowledge Society, 24. https://doi.org/10.14201/eks.31279
Goel, A. (2020). Ai-powered learning: making education accessible, affordable, and achievable. arXiv preprint arXiv:2006.01908. https://doi.org/10.48550/arXiv.2006.01908
Hwang, G. J., Xie, H., Wah, B. W., & Gašević, D. (2020). Vision, challenges, roles and research issues of Artificial Intelligence in Education. Computers and Education: Artificial Intelligence, 1, 100001.
Kim, Y. M., & Kang, M. K. (2012). The external analysis of the validation on item-types of foreign language (English) domain of CSAT. Modern English Education, 13(4), 239–270.
Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration: Perspectives of leading teachers for AI in education. Education and Information Technologies, 27(5), 6069–6104. https://doi.org/10.1007/s10639-021-10831-6
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30, 121–204. https://doi.org/10.1007/s40593-019-00186-y
Larrañaga, M., Aldabe, I., Arruarte, A., Elorriaga, J. A., & Maritxalar, M. (2022). A Qualitative Case Study on the Validation of Automatically Generated Multiple-Choice Questions From Science Textbooks. IEEE Transactions on Learning Technologies, 15(3), 338–349. https://doi.org/10.1109/TLT.2022.3171589
Li, Z., Cao, Z., Li, P., Zhong, Y., & Li, S. (2023). Multi-Hop Question Generation with Knowledge Graph-Enhanced Language Model. Applied Sciences, 13(9), 5765. https://doi.org/10.3390/app13095765
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.48550/arXiv.2107.13586
Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35(6), 382–386.
Morón, M., Scocozza, J., Chiruzzo, L., & Rosá, A. (2021, November). A tool for automatic question generation for teaching English to beginner students. In 2021 40th International Conference of the Chilean Computer Science Society, 1–5. IEEE. https://doi.org/10.1109/SCCC54552.2021.9650423
Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review. Computers and Education: Artificial Intelligence, 2, 100033. https://doi.org/10.1016/j.caeai.2021.100033
OpenAI. (2023). GPT-4 Technical Report. arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774
Oppenlaender, J. (2022). A Taxonomy of Prompt Modifiers for Text-To-Image Generation. arXiv preprint arXiv:2204.13988. https://doi.org/10.48550/arXiv.2204.13988
Ouyang, F., & Jiao, P. (2021). Artificial intelligence in education: The three paradigms. Computers and Education: Artificial Intelligence, 2, 100020. https://doi.org/10.1016/j.caeai.2021.100020
Pan, L., Lei, W., Chua, T. S., & Kan, M. Y. (2019). Recent advances in neural question generation. arXiv preprint arXiv:1905.08949. https://doi.org/10.48550/arXiv.1905.08949
Price, S., & Flach, P. A. (2017). Computational support for academic peer review: A perspective from artificial intelligence. Communications of the ACM, 60(3), 70–79. https://doi.org/10.1145/2979672
Richey, R. C., & Klein, J. D. (2005). Developmental research methods: Creating knowledge from instructional design and development practice. Journal of Computing in Higher Education, 16, 23–38. https://doi.org/10.1007/BF02961473
Richey, R. C., & Klein, J. D. (2014). Design and development research: Methods, strategies, and issues. Routledge. https://doi.org/10.4324/9780203826034
Rubio, D. M., Berg-Weger, M., Tebb, S. S., Lee, E. S., & Rauch, S. (2003). Objectifying content validity: Conducting a content validity study in social work research. Social Work Research, 27(2), 94–104. https://doi.org/10.1093/swr/27.2.94
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?. Journal of Applied Learning and Teaching, 6(1). https://doi.org/10.37074/jalt.2023.6.1.9
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980. https://doi.org/10.48550/arXiv.2010.15980
Shohamy, E. (1984). Does the testing method make a difference? The case of reading comprehension. Language Testing, 1(2), 147–170. https://doi.org/10.1177/026553228400100203
Soonklang, T., & Muangon, W. (2017). Automatic question generation system for English exercise for secondary students. In the 25th international conference on Computers in education.
Van Campenhout, R., Dittel, J. S., Jerome, B., & Johnson, B. G. (2021). Transforming Textbooks into Learning by Doing Environments: An Evaluation of Textbook-Based Automatic Question Generation 60-73. In iTextbooks@ AIED.
Xue, Y., & Wang, Y. (2022). Artificial intelligence for education and teaching. Wireless Communications and Mobile Computing, 1–10. https://doi.org/10.1155/2022/4750018
Zhai, X. (2022). ChatGPT user experience: Implications for education. (December 27, 2022). Available at SSRN: https://ssrn.com/abstract=4312418 or https://doi.org/10.2139/ssrn.4312418
Zhang, N., Li, L., Chen, X., Deng, S., Bi, Z., Tan, C., ... & Chen, H. (2021). Differentiable prompt makes pre-trained language models better few-shot learners. arXiv preprint arXiv:2108.13161. https://doi.org/10.48550/arXiv.2108.13161
Acknowledgements
This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of Interest
None.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1
Appendix 1
An appendix contains supplementary information that is not an essential part of the text itself but which may be helpful in providing a more comprehensive understanding of the research problem or it is information that is too cumbersome to be included in the body of the paper. Therefore, we add the link in the footnote, which can show the full manual.Footnote 2
1.1 Introduction to ChatGPT
ChatGPT is a powerful tool developed by OpenAI that can understand and generate human-like responses in conversations. It has the ability to help generate questions for different purposes, including educational assessments.ChatGPT works by learning from a large amount of text data, which helps it understand grammar, vocabulary, and the meaning of words in different contexts. This learning process of ChatGPT is analogous to training ChatGPT's brain to understand language.
Once ChatGPT has learned from the training data, it can be fine-tuned for specific tasks like generating questions. During the fine-tuning process, it learns to generate questions that are relevant and appropriate based on a given passage or topic.
To use ChatGPT for question generation, you provide it with a passage of text, and it uses its knowledge and understanding of language to create questions that test the reader's comprehension of the material. It tries to generate questions that make sense based on the information in the passage.
However, it's important to know that ChatGPT has some limitations. Sometimes it may generate incorrect or nonsensical responses, and it can have difficulty understanding complex or ambiguous queries. Also, it relies on patterns it has learned from the training data and may generate responses that sound plausible but are actually incorrect.
Despite these limitations, ChatGPT has great potential as a tool for generating questions. By carefully using and validating its responses, it can assist in creating contextually appropriate and varied questions, making assessments more effective and efficient.
In summary, ChatGPT is an exciting tool that can understand and generate human-like responses. It can be used to generate questions that test understanding, but it's important to be aware of its limitations and use it wisely in order to benefit from its capabilities in question generation.
1.2 Example passage
The most plausible explanation for the science communication problem is the public’s limited capacity to comprehend science. The public is only modestly science literate. About half, we are regularly reminded, understand that the earth orbits the sun in a year as opposed to a day; less than a quarter knows that nitrogen is the most common gas in the earth’s atmosphere; less than 10\% can make sense of a two-by-two contingency table essential to determining the ratio of true to false positives when assessing medical test results. So how can members of the public possibly be expected to understand what scientists are saying when scientists try to explain complex issues like climate change or nuclear power? More importantly, members of the public do not think the way scientists do. They rely on rapid, intuitive, affect-driven sources of information processing to the exclusion of the deliberate, conscious, analytic ones essential to making appropriate judgments of risk. As a result, they tend to overestimate the magnitude of more emotionally charged disasters (e.g., terrorist attacks) and discount more consequential but more temporally or emotionally remote ones (e.g., the impact of human-caused climate change). They also are more likely to rely on defective heuristics, such as crediting the opinions of their peers, a form of reasoning that can trigger self-reinforcing states of polarization.
1.3 Example output
We also add the link in the footnote, which can show full example outputs.Footnote 3
1.3.1 Example 1
A1 | Identify / y-n, alternative, t-f / multiple-choice |
Prompt | Passage / make a yes–no question of identifying information that is explicitly shown in the text. After questions, give ‘yes or no’ choice option |
Question | Is the public’s reliance on rapid, intuitive, affect-driven sources of information processing a result of their limited capacity to comprehend science? Yes or no: No |
1.3.2 Example 2
A1 | Identify / y-n, alternative, t-f / multiple-choice |
Prompt | Passage / Make a multiple choice, ‘wh’ question asking you to choose what is the topic of the passage |
Question | What is the main topic discussed in the passage? A) The public’s limited capacity to comprehend science B) The Earth’s orbit around the Sun C) Nitrogen in the atmosphere D) Two-by-two contingency tables for medical testing E) Heuristics and polarization in the public’s thinking about science-related issues |
1.3.3 Example 3
E4 | Inference / cloze / multiple-choice |
Prompt | Passage / Make a cloze test with multiple-choice at the sentence that contains information that can be implicitly inferred from the text. Rewrite the whole text with the cloze. Put the options at the end of the text |
Question | Please fill in the blank with the correct answer: The most plausible explanation for the science communication problem is the public’s limited capacity to comprehend science. The public is only modestly science literate. About half, we are regularly reminded, understands that the earth orbits the sun in a year as opposed to a day; less than a quarter knows that nitrogen is the most common gas in the earth’s atmosphere; less than 10 percent can make sense of a two-by-two contingency table essential to determining the ratio of true to false positives when assessing medical test results. So how can members of the public possibly be expected to understand what scientists are saying when scientists try to explain complex issues like climate change or nuclear power? More importantly, members of the public do not think the way scientists do. () sources of information processing to the exclusion of the deliberate, conscious, analytic ones essential to making appropriate judgments of risk. As a result, they tend to overestimate the magnitude of more emotionally charged disasters (e.g., terrorist attacks) and discount more consequential but more temporally or emotionally remote ones (e.g., the impact of human-caused climate change). They also are more likely to rely on defective heuristics, such as crediting the opinions of their peers, a form of reasoning that can trigger self-reinforcing states of polarization A) They rely on rapid, intuitive, affect-driven B) They rely on slow, logical, affect-neutral C) They rely on a combination of rapid, intuitive, affect-driven and deliberate, conscious, analytic D) They rely solely on deliberate, conscious, analytic |
1.3.4 Example 4
D5 | Sentiment / wh-q / open-ended |
Prompt | Passage / Make an open-ended question that asks the main characters' sentiments in the P1, using 'wh' question form |
Question | What were Claire and Tony's feelings towards the situation with the oil-covered bird on the beach and the boys trying to feed it? |
1.3.5 Example 5
C8 | Irrelevant Sentence / wh-q / multiple-choice |
Prompt | Passage /Put an irrelevant sentence somewhere in the text. Rewrite the text including the irrelevant sentence. Make a test question that requires me to choose an irrelevant sentence from the text |
Question | The most plausible explanation for the science communication problem is the public’s limited capacity to comprehend science. The public is only modestly science literate. About half, we are regularly reminded, understands that the earth orbits the sun in a year as opposed to a day; less than a quarter knows that nitrogen is the most common gas in the earth’s atmosphere; less than 10 percent can make sense of a two-by-two contingency table essential to determining the ratio of true to false positives when assessing medical test results. The moon is made of cheese. So how can members of the public possibly be expected to understand what scientists are saying when scientists try to explain complex issues like climate change or nuclear power? More importantly, members of the public do not think the way scientists do. They rely on rapid, intuitive, affect-driven sources of information processing to the exclusion of the deliberate, conscious, analytic ones essential to making appropriate judgments of risk. As a result, they tend to overestimate the magnitude of more emotionally charged disasters (e.g., terrorist attacks) and discount more consequential but more temporally or emotionally remote ones (e.g., the impact of human-caused climate change). They also are more likely to rely on defective heuristics, such as crediting the opinions of their peers, a form of reasoning that can trigger self-reinforcing states of polarization Test question: Which sentence in the passage is irrelevant to the topic of science communication? A) "About half, we are regularly reminded, understands that the earth orbits the sun in a year as opposed to a day" B) "The moon is made of cheese" C) "More important still, members of the public do not think the way scientists do." D) "They also are more likely to rely on defective heuristics, such as crediting the opinions of their peers" |
1.3.6 The teacher validation results of each prompt
Mean | Variation | CVI | IRA (each item) | IRA | |
A1 | 3.50 | 0.29 | 1.00 | 1 | 0.76 |
B1 | 3.38 | 0.55 | 0.88 | 0.88 | |
C1 | 3.88 | 0.13 | 1.00 | 1 | |
D1 | 3.38 | 0.27 | 1.00 | 1 | |
E1 | 3.63 | 0.27 | 1.00 | 1 | |
F1 | 3.25 | 0.50 | 0.88 | 0.88 | |
A2 | 2.88 | 0.41 | 0.75 | 0.75 | |
B2 | 3.25 | 0.21 | 1.00 | 1 | |
C2 | 3.50 | 0.29 | 1.00 | 1 | |
D2 | 3.25 | 0.21 | 1.00 | 1 | |
E2 | 3.38 | 0.27 | 1.00 | 1 | |
F2 | 3.00 | 0.57 | 0.75 | 0.75 | |
A3 | 3.13 | 0.70 | 0.75 | 0.75 | |
B3 | 3.25 | 0.79 | 0.75 | 0.75 | |
C3 | 3.50 | 0.29 | 1.00 | 1 | |
D3 | 3.13 | 0.41 | 0.88 | 0.88 | |
A4 | 2.75 | 0.79 | 0.50 | 0.5 | |
B4 | 3.25 | 0.50 | 0.88 | 0.88 | |
C4 | 3.75 | 0.21 | 1.00 | 1 | |
D4 | 3.38 | 0.55 | 0.88 | 0.88 | |
E4 | 3.38 | 0.55 | 0.88 | 0.88 | |
F4 | 3.00 | 1.43 | 0.63 | 0.63 | |
A5 | 3.25 | 1.07 | 0.88 | 0.88 | |
B5 | 3.63 | 0.27 | 1.00 | 1 | |
C5 | 3.63 | 0.27 | 1.00 | 1 | |
D5 | 3.63 | 0.27 | 1.00 | 1 | |
E5 | 3.88 | 0.13 | 1.00 | 1 | |
F5 | 3.25 | 0.50 | 0.88 | 0.88 | |
A6 | 3.25 | 0.50 | 0.88 | 0.88 | |
B6 | 3.38 | 0.55 | 0.88 | 0.88 | |
C6 | 3.50 | 0.29 | 1.00 | 1 | |
D6 | 3.50 | 0.29 | 1.00 | 1 | |
E6 | 3.50 | 0.29 | 1.00 | 1 | |
F6 | 3.13 | 1.27 | 0.75 | 0.75 | |
C8 | 3.50 | 0.57 | 0.88 | 0.88 | |
C9 | 3.00 | 1.14 | 0.75 | 0.75 | |
C10 | 3.38 | 0.55 | 0.88 | 0.88 | |
Average | 3.35 | 0.89 |
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lee, U., Jung, H., Jeon, Y. et al. Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education. Educ Inf Technol (2023). https://doi.org/10.1007/s10639-023-12249-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10639-023-12249-8